0% found this document useful (0 votes)
35 views153 pages

ML - MU - Unit - 2 - Supervised Learning-Classification Techniques

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views153 pages

ML - MU - Unit - 2 - Supervised Learning-Classification Techniques

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 153

Department of

Computer Engineering

Machine Learning
Sem 7

Supervised Learning 01CE0715


4 Credits

Unit #2 & 3

Prof. Urvi Bhatt


After completion of this course, students will be able to

 Understand machine-learning concepts.

 Understand and implement Classification concepts.


Course
 Understand and analyse the different Regression
Outcomes
algorithms.

 Apply the concept of Unsupervised Learning.

 Apply the concepts of Artificial Neural Networks.


Classification Techniques: Regression Techniques:
 Naive Bayes Classification  Basic concepts and
applications of Regression
 Fitting Multivariate Bernoulli
Distribution  Simple Linear Regression -
Gradient Descent and Normal
 Gaussian Distribution and
Equation Method
Multinomial Distribution
Topics -  K- Nearest Neighbours
 Multiple Linear Regression
Supervised  Decision tree
 Non-Linear Regression
Learning  Random Forest
 Linear Regression with
Regularization
 Ensemble Learning
 Overfitting and Underfitting
 Support Vector Machines
 Hyperparameter tuning
 Evaluation metrics for
 Evaluation Measures for
Classification Techniques:
Regression Techniques: MSE,
Confusion Matrix, Accuracy,
RMSE, MAE, R2
Precision, Recall, F1 Score,
Threshold, AUC-ROC
Classification
Techniques
Introduction, Applications
 Classification is a supervised machine learning process that
predicts the class of input data based on the algorithms training
data

 Classification is the process of predicting the class of given data


points.

What is  Classification is a supervised machine learning method where the

Classification model tries to predict the correct label of a given input data.

 In classification, the model is fully trained using the training data,


and then it is evaluated on test data before being used to perform
prediction on new unseen data.

 For example, a spam detection machine learning algorithm would


aim to classify emails as either “spam” or “not spam.”
 Classes are sometimes called targets, labels or
categories.

 Classification predictive modeling is the task of


approximating a mapping function (f) from input
variables (X) to discrete output variables (y.)
Contd.
 Image Recognition:  Voice Recognition:
 object detection, facial recognition, and  classify spoken words or phrases, enabling
autonomous driving. applications like voice assistants, transcription
services, and speaker identification.
 Object Detection in Autonomous Vehicles:
 autonomous vehicles Detection , enabling them  Language Identification:
to identify and respond to pedestrians, traffic  classify text data into different languages, aiding
signs, and other vehicles. in language identification tasks, multilingual
analysis, and machine translation.
 Face Recognition:
 identify and authenticate individuals based on  Sentiment Analysis:
Applications of facial features, finding applications in security  classify text data (e.g., customer reviews, social

Classification systems, access control, and surveillance. media posts) to determine sentiment (positive,
negative, neutral) and understand public opinion
 Disease Diagnosis: and brand perception.
 classify diseases or predict the likelihood of
certain conditions, assisting in medical diagnosis.  Email Spam Filtering:
 classify emails as either spam or non-spam,
 Land Cover Classification in Remote helping in filtering unwanted or malicious emails
Sensing:
 Toxic Comment Classification:
 classify land cover types (e.g., forests, urban
areas, water bodies) in satellite or aerial imagery,  classify text comments as toxic or non-toxic,
aiding in environmental monitoring, urban helping to identify and moderate harmful or
planning, and natural resource management. abusive content on online platforms.
 Handwriting Recognition: .  Fraud Detection:
 classify handwritten characters or text, finding  identify fraudulent transactions or activities,
applications in optical character recognition playing a critical role in financial institutions, e-
(OCR) systems and digitizing handwritten commerce platforms, and security systems
documents
 Stock Market Prediction:
 Document Classification:  classify stocks as buy, sell, or hold based on
 categorize documents, such as news articles, historical market data and indicators
legal documents, or customer support tickets,
into relevant categories, facilitating efficient  Recommendation Systems:
Applications of document retrieval and organization
 predict user preferences and classify items or

Classification  Social Media Text Classification:


content to provide personalized
recommendations, improving user experience in
 extract valuable insights from social media data. e-commerce, streaming platforms, and content
curation
 Predicting Loan Defaults:
 loan defaults, assisting financial institutions in  Intrusion Detection in Cybersecurity:
managing risk and making informed lending  classify network intrusions and cyber threats,
decisions distinguishing between normal network traffic
and malicious activities. classify emails as either
spam or non-spam, helping in filtering unwanted
or malicious emails
Binomial Logistic
Regression
Multinomial Logistic
Logistic Regression
Regression
Linear Models
Support Vector Ordinal Logistic
Machine (SVM) Regression
Classification K- Nearest
Techniques Neighbours (KNN)
(Categorical Multivariate Bernoulli
Algorithms
Data) Decision Tree
Distribution

Non-linear Naïve Bayes Gaussian Distribution


Models
Multinomial
Distribution
Random Forest
Ensemble Method
Gradient Boosting
In the classification problems, there are two types of learners:

 Lazy Learners:
 Lazy Learner firstly stores the training dataset and wait until it receives
the test dataset.
 In Lazy learner case, classification is done on the basis of the most
related data stored in the training dataset.

Learners in  It takes less time in training but more time for predictions.
 Example: K-NN algorithm, Case-based reasoning
Classification
Problems:  Eager Learners:
 Eager Learners develop a classification model based on a training
dataset before receiving a test dataset.
 Opposite to Lazy learners, Eager Learner takes more time in learning,
and less time in prediction.
 Example: Logistic Regression, Support Vector Machine, Decision Trees,
Naïve Bayes, ANN.
Lazy learners
VS
Eager Learners
Logistic Regression
Introduction, Type of Logistic Regression, Sigmoid Function,
Example, Advantage & Disadvantage
 Logistic regression is one of the most popular Machine Learning
algorithms, which comes under the supervised Learning
technique.

 It is used for predicting the categorical dependent variable using a


given set of independent variables.

 Logistic regression predicts the output of a categorical dependent


Logistic variable. Therefore the outcome must be a categorical or discrete
Regression value.

 Logistic regression is a calculation used to predict a binary


outcome: either something happens, or does not. This can be
exhibited asYes/No, Pass/Fail, Alive/Dead, etc.

 It can be either Yes or No, 0 or 1, true or False, etc. but instead of


giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
• In Logistic regression, instead of fitting a regression
line, we fit an "S" shaped logistic function, which
predicts two maximum values (0 or 1)

 The curve from the logistic function indicates the


likelihood of something such as whether the cells are
Logistic
cancerous or not, a mouse is obese or not based on its
Regression
weight, etc
 The sigmoid function is a mathematical function used to
map the predicted values to probabilities.

 It maps any real value into another value within a range of 0


and 1.
Logistic
Function  The value of the logistic regression must be between 0 and

(Sigmoid 1, which cannot go beyond this limit, so it forms a curve like

Function) the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.

 In logistic regression, we use the concept of the threshold


value, which defines the probability of either 0 or 1. Such as
values above the threshold value tends to 1, and a value
below the threshold values tends to 0.
On the basis of the categories, Logistic Regression can be classified
into three types:

1. Binomial: In binomial Logistic regression, there can be only two


possible types of the dependent variables, such as 0 or 1, Pass

Type of or Fail, etc.

Logistic 2. Multinomial: In multinomial Logistic regression, there can be 3


or more possible unordered types of the dependent variable,
Regression
such as "cat", "dogs", or "sheep“.

3. Ordinal: In ordinal Logistic regression, there can be 3 or more


possible ordered types of dependent variables, such as "low",
"Medium", or "High".
Contd.
DEMO  https://colab.research.google.com/drive/1Z5o4ZYibat
-------------- AsE2_WlHz0bJyboSiy4lUk?usp=sharing
Logistic
Regression
Example:
Contd.
Contd.
Contd.
Contd.
X Y
2 0

 Simple example with one feature X and one binary


3 0
5 1
target variable y. Here's dataset:
6 1
 z=-7+2⋅X

Example: Apply Logistic Regression to answer the following


Questions

Q- 1: X=4, the predicted probability of Y.

Q-2:At least which value of x is required to get Y 89%


 Ans:

 Q-1: X=4, the predicted probability Y: 0.7311

Contd.  Q-2:At least which value of x is required to get Y 89%


X≈4.5454
X Y
2 0
3 0
5 1
Given a logistic regression model defined by: 6 1

z=−5+1.5⋅X

 Let's answer the following questions:


Example:
 Q-1: What is the predicted probability ofY for X=6?

 Q-2: At least what value of X is required to get Y with


95% probability?
 Q-1: So, the predicted probability of Y when X=6 is
approximately 0.982 (or 98.2%).
Ans:
 Q-2: 𝑋 needs to be at least approximately 5.296 to get
a probability of 𝑌 being 1 at 95% or higher.
Advantages of Logistic Regression Algorithm:

 Logistic regression is easier to implement, interpret and very efficient to


train.

Advantages  Logistic Regression performs well when the dataset is linearly separable.

and  Logistic Regression not only gives a measure of how relevant a predictor
Disadvantages (coefficient size) is, but also its direction of association (positive or negative).
of Logistic Disadvantages of Logistic Regression Algorithm:
Regression
 If the number of observations are lesser than the number of features,
Logistic Regression should not be used, otherwise it may lead to overfit.

 Main limitation of Logistic Regression is the assumption of


linearity between the dependent variable and the independent variables. In
the real world, the data is rarely linearly separable..
Support Vector
Machine (SVM)
Introduction, Type of SVM, Different Kernel Functions,
Advantage & Disadvantage
 SVM is a powerful supervised algorithm that works best on
smaller datasets but on complex ones.

 Support Vector Machine, abbreviated as SVM can be used


for both regression and classification tasks, but generally,
they work best in classification problems.
SVM
 It is a supervised machine learning problem where we try
to find a hyperplane that best separates the two classes.

 SVM algorithm can be used for Face detection, image


classification, text categorization, etc.
 Example:

Contd.
Note: Don’t get confused between SVM and logistic
regression.

 Both the algorithms try to find the best hyperplane, but


the main difference is logistic regression is a
probabilistic approach whereas support vector
Contd. machine is based on statistical approaches.

 SVM works best when the dataset is small and


complex. It is usually advisable to first use logistic
regression and see how does it performs, if it fails to
give a good accuracy you can go for SVM without any
kernel
 Linear SVM: When the data is perfectly linearly separable
only then we can use Linear SVM. Perfectly linearly
separable means that the data points can be classified
into 2 classes by using a single straight line(if 2D).
Types of  Non-Linear SVM: When the data is not linearly separable
Support Vector then we can use Non-Linear SVM, which means when the
Machine
data points cannot be separated into 2 classes by using a
(SVM)
straight line (if 2D) then we use some advanced
Algorithms
techniques like kernel tricks to classify them. In most real-
world applications we do not find linearly separable
datapoints hence we use kernel trick to solve them.
Image(a) Image (b)
Linear SVM
 Suppose we have a dataset that has two tags (green and blue), and
the dataset has two features x1 and x2. We want a classifier that
can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image (a). So as it is 2-d space so by just using
a straight line, we can easily separate these two classes. But there
can be multiple lines that can separate these classes. Consider the
below image (b)
SVM algorithm finds the Hence, the SVM algorithm helps to
closest point of the lines find the best line or decision boundary;
from both the classes. this best boundary or region is called
These points are called as a hyperplane. And the goal of SVM
support vectors. is to maximize this margin.
The hyperplane with maximum
margin is called the optimal
hyperplane.

The dimensions of the hyperplane


The distance depend on the features present in the
between the dataset, which means-
vectors and the
if there are 2 features (as shown in
hyperplane is
image), then hyperplane will be a
called as margin.
straight line.
And if there are 3 features, then
hyperplane will be a 2-dimension
plane.
 Hence, the SVM algorithm helps to find the best line or
decision boundary; this best boundary or region is called as
a hyperplane.

 SVM algorithm finds the closest point of the lines from both
the classes. These points are called support vectors.
Contd.
 The distance between the vectors and the hyperplane is
called as margin.

 And the goal of SVM is to maximize this margin.


The hyperplane with maximum margin is called the optimal
hyperplane.
 If data is linearly arranged, then we can separate it by using a
straight line, but for non-linear data, we cannot draw a single
straight line. Consider the below image:

Non-Linear
SVM

 So to separate these data points, we need to add one more


dimension. For linear data, we have used two dimensions x and y,
so for non-linear data, we will add a third dimension z. It can be
calculated as:
 z=x2 +y2

 By adding the third dimension, the sample space will become as below
image (a).

 So now, SVM will divide the datasets into classes in the following way.
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-
axis. If we convert it in 2d space with z=1, then it will become as, Consider
Contd. the below image (b)

Image(a) Image (b)


Contd.
 Kernels in Support Vector Machine

Contd.
 Linear Kernel

 Polynomial Kernel
Different  Sigmoid Kernel
Kernel
Functions  RBF Kernel
Contd.
Contd.
Contd.
Advantages of SVM Algorithm:

 It works really well with a clear margin of separation

 It is effective in high dimensional spaces.


Advantages  It is effective in cases where the number of dimensions is
and
greater than the number of samples.
Disadvantages
of SVM Disadvantages of SVM Algorithm:

 It doesn’t perform well when we have large data set


because the required training time is higher

 It also doesn’t perform very well, when the data set has
more noise i.e. target classes are overlapping.
DEMO
 https://colab.research.google.com/drive/1Yj4t10E7jqxL
--------------
kGTJeNc-aV7_8epgY3iR?usp=sharing
SVM
K- Nearest Neighbours
(KNN)
Introduction, Algorithm, Example
 The K-NN algorithm compares a new data entry to the
values in a given data set (with different classes or
categories).

 Based on its closeness or similarities in a given range (K) of


K- Nearest neighbors, the algorithm assigns the new data to a class or
Neighbours category in the data set (training data).

(KNN)  It is also called a lazy learner algorithm because it does not


learn from the training set immediately instead it stores the
dataset and at the time of classification, it performs an
action on the dataset.
 Suppose there are two categories, i.e., Category A and Category B,
and we have a new data point x1, so this data point will lie in which
of these categories.

 To solve this type of problem, we need a K-NN algorithm. With the


help of K-NN, we can easily identify the category or class of a
Why do we particular dataset. Consider the below diagram:
need a K-NN
Algorithm?
The K-NN working can be explained on the basis of the below
algorithm:

 Step #1 - Assign a value to K.

 Step #2 - Calculate the distance between the new data


entry and all other existing data entries. Arrange them in
KNN ascending order. (Use Euclidean distance formula)
Algorithm
 Step #3 - Find the K nearest neighbors to the new entry
based on the calculated distances.

 Step #4 - Assign the new data entry to the majority class in


the nearest neighbors.
 This data is given, The graph
above represents a data set
consisting of two classes —
K-Nearest red and blue.
Neighbors
Classifiers and
Model  A new data entry has been
Example With introduced to the data set.
Apply KNN
Diagrams
This is represented by the
green point in the graph
above. Apply KNN to classify
Green Point into red or Blue
 Step #1 - Assign a value to K.

the value of K is 3

 Step #2 - Calculate the distance between the new data entry


Contd. and all other existing data entries (you'll learn how to do this
shortly). Arrange them in ascending order.
 Step #3 - Find the K nearest neighbors to the new
entry based on the calculated distances.

the algoritham will only consider the 3 nearest neighbors


to the green point (new entry). This is represented in the
graph above.
Contd.
 Step #4 - Assign the new data entry to the majority
class in the nearest neighbors.

Out of the 3 nearest neighbors in the diagram above, the


majority class is red so the new entry will be assigned to
that class. The last data entry has been classified as red.
Contd.
 There is no particular way of choosing the value K, but
here are some common conventions to keep in mind:

How to Choose  Choosing a very low value will most likely lead to
the Value of K inaccurate predictions.
in the K-NN
 The commonly used value of K is 5.
Algorithm
 Always use an odd number as the value of K.
BRIGHTNESS SATURATION CLASS
40 20 Red
50 50 Blue
60 90 Blue
K-Nearest 10 25 Red
Neighbors 70 70 Blue
Classifiers and 60 10 Red
Model 25 80 Blue
Example With The table above represents our data set. We have two columns -
Data Set Brightness and Saturation. Each row in the table has a class of
either Red or Blue. let's assume the value of K is 5. we introduce
a new data entry,
BRIGHTNESS SATURATION CLASS
Apply KNN to find the class of New Entry. 20 35 ?
 Step #1 - Assign a value to K.

here the value of K is 5 (Given)

 Step #2 - Calculate the distance between the new data


Contd.
entry and all other existing data entries. Arrange them
in ascending order.

Consider Distance = Euclidean distance formula


 To know its class, we have to calculate the distance from the
new entry to other entries in the data set using the
Euclidean distance formula.

 Here's the formula:

 Where:
Contd.  X₂ = New entry's brightness (20).

 X₁= Existing entry's brightness.

 Y₂ = New entry's saturation (35).

 Y₁ = Existing entry's saturation.


BRIGHTNESS SATURATION
CLASS
(X2) (Y2)
d1 = √(20 - 40)² + (35 - 20)²
40 20 Red
= √400 + 225 = √625 = 25
d2 = √(20 - 50)² + (35 - 50)²
50 50 Blue
= √900 + 225 = √1125 = 33.54
d3 = √(20 - 60)² + (35 - 90)²
60 90 Blue
Contd. = √1600 + 3025 = √4625 = 68.01
10 25 Red ??
70 70 Blue ??
60 10 Red ??
25 80 Blue ??
X1 = 20 Y1 = 35 ?? New Entry
BRIGHTNESS SATURATION CLASS DISTANCE
40 20 Red 25
50 50 Blue 33.54
60 90 Blue 68.01
10 25 Red 14.14
70 70 Blue 61.03
60 10 Red 47.17
25 80 Blue 45.28 BRIGHTNESS SATURATION CLASS DISTANCE
Contd.
X1 = 20 Y1 = 35 ?? New Entry 10 25 Red 14.14
40 20 Red 25
50 50 Blue 33.54
25 80 Blue 45.28
Let's rearrange the distances in ascending order:
60 10 Red 47.17
70 70 Blue 61.03
60 90 Blue 68.01
X1 = 20 Y1 = 35 ?? New Entry
 Step #3 - Find the K nearest neighbors to the new
entry based on the calculated distances.

Since we chose 5 as the value of K, we'll only consider


the first five rows. That is:

BRIGHTNESS SATURATION CLASS DISTANCE CONSIDER

Contd. 10 25 Red 10 Yes


40 20 Red 25 Yes
50 50 Blue 33.54 Yes
25 80 Blue 45 Yes
60 10 Red 47.17 Yes
70 70 Blue 61.03 No
60 90 Blue 68.01 No
X1 = 20 Y1 = 35 ?? New Entry -
 Step #4 - Assign the new data entry to the majority
class in the nearest neighbors.

 As you can see above, the majority class within the 5


nearest neighbors to the new entry is Red. Therefore,
we'll classify the new entry as Red.
Contd.
BRIGHTNESS SATURATION CLASS DISTANCE CONSIDER
10 25 Red 10 Yes
40 20 Red 25 Yes
50 50 Blue 33.54 Yes
25 80 Blue 45 Yes
60 10 Red 47.17 Yes
X1 = 20 Y1 = 35 Red New Entry -
BRIGHTNESS SATURATION CLASS
40 20 Red
50 50 Blue
60 90 Blue  Here's the updated table:
10 25 Red
70 70 Blue
60 10 Red
25 80 Blue BRIGHTNESS SATURATION CLASS
40 20 Red
Given Old Table 50 50 Blue
60 90 Blue
10 25 Red
70 70 Blue
60 10 Red
25 80 Blue
20 35 Red
Apply KNN to find the class of New Entry. (Take k =5)

Example:
Contd.
Contd.
Contd.
Apply KNN to find the class of New Entry. (Take k =1,2,5)

Example
ANS:
Example
Advantages of KNN Algorithm:

 It is simple to implement.

 It is robust to the noisy training data


Advantages  It can be more effective if the training data is large.
and
Disadvantages Disadvantages of KNN Algorithm:
of KNN  Always needs to determine the value of K which may
be complex some time.

 The computation cost is high because of calculating the


distance between the data points for all the training
samples.
DEMO
 https://colab.research.google.com/drive/1Yj4t10E7jqxL
--------------
kGTJeNc-aV7_8epgY3iR?usp=sharing
KNN
Naive Bayes Classifier
Introduction, Algorithm, Types of Distribution, Naive Bayes
Classifier Approach with Example (For Single Feature and For
Multiple Feature)
 Naive Bayes is a statistical classification technique
based on Bayes Theorem.

 It is one of the simplest supervised learning algorithms.


Naive Bayes
Naive Bayes classifier is the fast, accurate and reliable
Classifier
algorithm.

 Naive Bayes classifiers have high accuracy and speed


on large datasets.
 Feature independence: The features of the data are
conditionally independent of each other, given the class
label.

 Continuous features are normally distributed: If a feature


is continuous, then it is assumed to be normally distributed
within each class.
Assumption of
 Discrete features have multinomial distributions: If a
Naive Bayes feature is discrete, then it is assumed to have a multinomial
distribution within each class.

 Features are equally important: All features are assumed


to contribute equally to the prediction of the class label.

 No missing data: The data should not contain any missing


values.
Multivariate
Gaussian Multinomial
Aspect Bernoulli
Distribution Distribution
Distribution
Binary (0 or 1) for Categorical data with
Type of Data Continuous data
each feature multiple classes
Multivariate Example Use
Email spam detection
(presence or absence
Predicting house
prices based on
Text classification
Bernoulli Case
of words) features
(frequency of words)
Probability of each Mean and variance for Probability of each
Distribution, Parameters
feature being 1 each feature category
Gaussian Features are
Features are
Features are normally categorical and
Distribution, Assumption independent given
the class
distributed independent given
the class
and Naive Bayes
Gaussian Naive Bayes Multinomial Naive
Multinomial Application classifiers for binary
features
classifier Bayes classifier
Distribution Each feature follows a
Each feature follows a Each feature follows a
Distribution Gaussian (Normal) Multinomial
Bernoulli distribution
distribution distribution
Probability of
Probability of binary Probability density of
Output categorical feature
feature occurrence continuous features
occurrence
 Bayes theorem provides a way of computing posterior probability
P(c|x) from P(c), P(x) and P(x|c). Look at the equation below:

Bayes’
Theorem Above,

 P(c|x) is the posterior probability of class (c, target)


given predictor (x, attributes).

 P(c) is the prior probability of class.

 P(x|c) is the likelihood which is the probability of


the predictor given class.

 P(x) is the prior probability of the predictor.


Naive Bayes  First Approach (In case of a single feature)
Classifier
Approach  Second Approach (In case of multiple features)
 Step 1: Convert the data set into a frequency table

 Step 2: Create Likelihood table by finding the


probabilities

 Step 3: Apply Bayes Formula and calculate posterior

First Approach probability for each target value

(In case of a
single feature)

 Step 4: See which class has a higher probability, given


the input belongs to the higher probability class.
 Apply Naive Bayes Classifier.
 Players will play if the
weather is overcast. Is this
statement correct

 Players will play if the


Example: weather is sunny. Is this
statement correct?

 Players will play if the


weather is Rainy. Is this
statement correct?
 Step 1: Convert the data set
into a frequency table

Whether No Yes
Overcast 0 4
Contd. Sunny 3 2
Rainy 2 3
Total: 5 9
Step 2: Create Likelihood table by finding the probabilities

Whether No Yes
Overcast 0 4 P(Overcast) = 4/14 = 0.29
Sunny 3 2 P(Sunny) = 5/14 = 0.36 Likelihood Table 1

Rainy 2 3 P(Rainy) = 5/14 = 0.36


Total: 5 9
P(No) = 5/14 P(Yes) = 9/14
Contd. = 0.36 = 0.64

Whether No Yes Posterior probabilities (No) Posterior probabilities (Yes)


Overcast 0 4 P(Overcast | No) = 0 / 5 = 0 P(Overcast | Yes) = 4/9 = 0.44
Sunny 3 2 P(Sunny | No) = 3 / 5 = 0.6 P(Sunny | Yes) = 2/9 =0.22
Rainy 2 3 P(Rainy | No) = 2/ 5 = 0.4 P(Overcast | Yes) =3 /9 =0.33
Total: 5 9
Likelihood Table 2
Step 3: Apply Bayes Formula and calculate posterior probability for each target value

Now suppose you want to calculate the probability of playing when the weather is overcast.

 Probability of playing:

 P(Yes | Overcast) = [ P(Overcast | Yes) * P(Yes) ] / P (Overcast)

= (0.44 * 0.64 ) / 0.29 = 0.98

Now suppose you want to calculate the probability of playing when the weather is overcast.

 Probability of not playing:

 P(No | Overcast) = [ P(Overcast | No) * P(No) ] / P (Overcast)

= (0 * 0.36 ) / 0.29 = 0
 Step 4: See which class has a higher probability, given the
input belongs to the higher probability class.

 P(Yes | Overcast) = 0.98

 P(No | Overcast) = 0
Contd.

Ans:Yes, Players will play if the weather is overcast.

this statement correct


 Players will play if the weather is sunny. Is this
statement correct?

 Ans: ???

Contd.  Players will play if the weather is Rainy. Is this


statement correct?

 Ans: ???
Color
Fruit ID Class
(Feature)

1 Red Apple
2 Red Apple
3 Green Apple
4 Orange Orange
Example: 5 Orange Orange
6 Green Apple

Apply Naive Bayes Classifier to classify whether a


fruit is "Apple" or "Orange" if new fruit that is "Red",
color (our single feature).
Color
Fruit ID Class
(Feature)

1 Red Apple
2 Red Apple
3 Green Apple
4 Orange Orange
Contd. 5 Orange Orange
6 Green Apple

Color Apple Orange


 Step 1: Convert the data Red 2 0
set into a frequency Green 2 0
Orange 0 2
table Total: 4 2
Step 2: Create Likelihood table by finding the probabilities

Color Apple Orange


Red 2 0 P(Red) = 2/6 = 0.33
Green 2 0 P(Green) = 2/6 = 0.33 Likelihood Table 1

Orange 0 2 P(Orange) = 2/6 = 0.33


Total: 4 2
P(Apple) = 4/6 P(Orange) = 2/6
Contd. = 0.67 = 0.33
Likelihood Table 2

Color Apple Orange Posterior probabilities (Apple) Posterior probabilities (Orange)


Red 2 0 P(Red | Apple) = 2 /4 = 0.5 P(Red | Orange) = 0/2 = 0
Green 2 0 P(Green | Apple) = 2 / 4 = 0.5 P(Green | Orange) = 0/2 = 0
Orange 0 2 P(Orange | Apple) = 0 / 4 = 0 P(Orange | Orange) = 2/2 =1
Total: 4 2
Step 3: Apply Bayes Formula and calculate posterior probability for each target value

Now suppose you want to calculate the probability of Apple when the Color is Red.

 Probability of Apple:

 P(Apple | Red) = [ P(Red | Apple) * P(Apple) ] / P (Red)

= (0.5 * 0.66 ) / 0.33 = 1

Now suppose you want to calculate the probability of Orange when the Color is Red.

 Probability of Orange:

 P(Orange | Red) = = [ P(Red | Orange) * P(Orange) ] / P (Red)

= (0 * 0.33 ) / 0.33 = 0
 Step 4: See which class has a higher probability, given the
input belongs to the higher probability class.

 P(Apple | Red) = 1
Contd.
 P(Orange | Red) = 0

Ans: if new fruit that is "Red"color, then it is classify as Apple.


Person ID Favorite Drink (Feature) Class
1 Soda Teen
2 Coffee Adult
3 Soda Teen
4 Juice Teen
5 Coffee Adult
6 Soda Teen
7 Coffee Adult
Example 8 Juice Teen
9 Soda Teen
10 Coffee Adult

Apply Naive Bayes Classifier to classify whether new


Person is Teen or Adult whose favorite drink is
"Juice“.
Prior Probabilities:

 P(Teen) = 0.6

 P(Adult) = 0.4

Likelihoods:

 P(Favorite Drink = Soda | Teen) = 4/6 = 0.67

 P(Favorite Drink = Coffee | Teen) = 0/6 = 0.0


Ans:
 P(Favorite Drink = Juice | Teen) = 2/6 = 0.33

 P(Favorite Drink = Soda | Adult) = 0/4 = 0.0

 P(Favorite Drink = Coffee | Adult) = 4/4 = 1.0

 P(Favorite Drink = Juice | Adult) = 0/4 = 0.0

 Classification: For a new person whose favorite drink is "Juice", the person is
classified as "Teen" based on the higher posterior probability.
Age Group Buys Computer?
Apply Naive Bayes for
Youth No
Given a new instance
Youth No
with an input feature
Middle Yes
(e.g., Age Group = Youth),
Senior Yes
calculate the probability
Example Senior Yes
of each class
Senior No
Middle Yes
Youth No
Youth Yes
Senior Yes
 P(Yes) = 5 / 10 = 0.5

 P(No) = 5 / 10 = 0.5

 P(Buys Computer = Yes | Age Group = Youth) = 0.1 / (0.1


Answer:
+ 0.3) = 0.25

 P(Buys Computer = No | Age Group = Youth) = 0.3 / (0.1


+ 0.3) = 0.75
 Step 1: Convert the data set into a frequency tables
(According to No of Input and Output)

 Step 2: Create Likelihood table by finding the


probabilities ((According to No of Input and Output)
Second
 Step 3: Apply Bayes Formula and calculate posterior
Approach (In
probability for each target value
case of
multiple
features)
 Step 4: See which class has a higher probability, given
the input belongs to the higher probability class.
Free Win
Email ID Class
(Feature 1) (Feature 2)
1 Yes Yes Spam
2 Yes No Spam
3 No Yes Not Spam
Second 4 Yes Yes Not Spam
5 No No Not Spam
Approach (In 6 Yes Yes Spam
case of 8
7 No
No
Yes
No
Spam
Not Spam
multiple 9 Yes No Not Spam
features) 10 Yes Yes Spam

 Apply Naive Bayes Classifier to find the class of New


email classification (Free = Yes, Win = No).
Free Win
Email ID Class
(Feature 1) (Feature 2)
 Step 1: Convert the 1 Yes Yes Spam
2 Yes No Spam
data set into a 3 No Yes Not Spam
4 Yes Yes Not Spam
frequency table
5 No No Not Spam
6 Yes Yes Spam
7 No Yes Spam
8 No No Not Spam
9 Yes No Not Spam
Example 10 Yes Yes Spam

Free Not Win Not


Spam Spam
(Feature 1) Spam (Feature 2) Spam
Yes 4 2 Yes 4 2
No 1 3 No 1 3
Total 5 5 Total 5 5
Step 2: Create Likelihood table by finding the probabilities
P(Spam) = 5 /10 = 0.5
P (Not Spam) = 5/10 = 0.5
Free
Spam Not Spam Posterior probabilities (Spam) Posterior probabilities (Not Spam)
(Feature 1)
Yes 4 2 P(Free=Yes | Spam) = 4/5 = 0.8 P(Free=Yes | Not Spam) = 2/5 = 0.4
P(Free=No | Not Spam) = 3/5 = 0.6
No 1 3 P(Free=No | Spam) = 1/5 = 0.2

Total 5 5

Win
Spam Not Spam Posterior probabilities (Spam) Posterior probabilities (Not Spam)
(Feature 1)
Yes 4 2 P(Win=Yes | Spam) = 4/5 = 0.8 P(Win=Yes | Not Spam) = 2/5 = 0.4
P(Win=No | Not Spam) = 3/5 = 0.6
No 1 3 P(Win=No | Spam) = 1/5 = 0.2

Total 5 5
Step 3: Apply Bayes Formula and calculate posterior probability for each target value

• Calculate the probability of Spam with New email classification with Free =Yes, Win = No.
P(Spam | Free =Yes, Win = No) = P(Free =Yes | Spam) * P(Win = No | Spam) * P(Spam)
= 0.8 * 0.2 * 0.5 = 0.08

• Calculate the probability of Not Spam with New email classification with Free = Yes, Win
= No.
P(Not Spam | Free =Yes, Win = No)
= P(Free =Yes | Not Spam) * P(Win = No | Not Spam) * P(Not Spam)
= 0.4 * 0.6 * 0.5 = 0.12
 Step 4: See which class has a higher probability, given
the input belongs to the higher probability class.

 P(Spam | Free =Yes, Win = No) = 0.08

 P(Not Spam | Free =Yes, Win = No) = 0.12

Contd.
 Since 0.12 > 0.08, we classify the new email as "Not
Spam".

Ans: Class of New email classification (Free = Yes, Win =


No) is Not Spam
Example:
Example
Ans:
Example
Ans:
Contd.
DEMO
--------------  https://colab.research.google.com/drive/1Yj4t10E7jqxL
Naive Bayes kGTJeNc-aV7_8epgY3iR?usp=sharing
Classifier
Advantages of Naïve Bayes Classifier:

 Naïve Bayes is one of the fast and easy ML algorithms to predict


a class of datasets.

Advantage &  It can be used for Binary as well as Multi-class Classifications.

Disadvantage  It performs well in Multi-class predictions as compared to the


of Naive Bayes other Algorithms.
Classifier  It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

 Naive Bayes assumes that all features are independent or


unrelated, so it cannot learn the relationship between features.
Decision Tree
Introduction, Decision Tree Terminologies, Algorithm - Attribute
Selection Measures (Information Gain & Entropy , Gini Index),ID3
Algorithm & Examples, Advantages & Disadvanatages
 Decision Tree is a Supervised learning technique that
can be used for both classification and Regression
problems, but mostly it is preferred for solving
Classification problems.

 It is a graphical representation for getting all the


possible solutions to a problem/decision based on
Decision Tree given conditions.

 It has a hierarchical tree structure consisting of a root


node, branches, internal nodes, and leaf nodes.

 It is called a decision tree because, similar to a tree, it


starts with the root node, which expands on further
branches and constructs a tree-like structure.
 It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches
represent the decision rules and each leaf node
represents the outcome.

Contd.
 Root Node: Root node is from where the decision tree starts. It
represents the entire dataset, which further gets divided into
two or more homogeneous sets.

 Leaf Node: Leaf nodes are the final output node, and the tree
cannot be segregated further after getting a leaf node.

 Splitting: Splitting is the process of dividing the decision


Decision Tree
node/root node into sub-nodes according to the given
Terminologies
conditions.

 Branch/Sub Tree: A tree formed by splitting the tree.

 Pruning: Pruning is the process of removing the unwanted


branches from the tree.

 Parent/Child node: The root node of the tree is called the parent
node, and other nodes are called the child nodes.
 Step-1: Begin the tree with the root node, says S, which
contains the complete dataset.

 Step-2: Find the best attribute in the dataset


using Attribute Selection Measure (ASM).

 Step-3: Divide the S into subsets that contains possible


values for the best attributes.
Algorithm
 Step-4: Generate the decision tree node, which contains the
best attribute.

 Step-5: Recursively make new decision trees using the


subsets of the dataset created in step -3. Continue this
process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
 While implementing a Decision tree, the main issue arises
that how to select the best attribute for the root node and
for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or
ASM.
Attribute
 By this measurement, we can easily select the best attribute
Selection
for the nodes of the tree.
Measures
 There are two popular techniques for ASM, which are:
 Information Gain & Entropy
 Gini Index
 Information gain is the measurement of changes in entropy after
the segmentation of a dataset based on an attribute.

 It calculates how much information a feature provides us about


a class.

 According to the value of information gain, we split the node and


Information build the decision tree.
Gain
 A decision tree algorithm always tries to maximize the value of
information gain, and a node/attribute having the highest
information gain is split first. It can be calculated using the below
formula:

IG= Entropy(S) - [(Weighted Avg) *Entropy(each feature)]


 Entropy is a metric to measure the impurity in a given
attribute. It specifies randomness in data. Entropy can
be calculated as:

Entropy Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

 Where,

 S= Total number of samples

 P(yes)= probability of yes

 P(no)= probability of no
 Gini index is a measure of impurity or purity used while
creating a decision tree in the CART(Classification and
Regression Tree) algorithm.
Gini Index
 An attribute with the low Gini index should be
preferred as compared to the high Gini index.

Gini Index= 1- ∑jPj2


 Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.

 Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM) – Consider Highest value of Information Gain Feature
as best Feature

 Step-3: Divide the S into subsets that contains possible values for the
best attributes.
ID3 Algorithm:
 Step-4: Generate the decision tree node, which contains the best
attribute.

 Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as
a leaf node.
Example:
Temperature Humidity Play

T T Yes
T T No
F T Yes
F F No
T F Yes
Example:
T F No
T F No
Ans:  ??????
Example:
Example:
Final Tree:
DEMO
 https://colab.research.google.com/drive/1Yj4t10E7jqxL
--------------
kGTJeNc-aV7_8epgY3iR?usp=sharing
Decision Tree
Advantages of the Decision Tree

 It is simple to understand as it follows the same process which a human


follow while making any decision in real-life.

 It can be very useful for solving decision-related problems.

 It helps to think about all the possible outcomes for a problem.


Advantage &
Disadvantage  There is less requirement of data cleaning compared to other algorithms.

of Decision Disadvantages of the Decision Tree


tress  The decision tree contains lots of layers, which makes it complex.

 It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.

 For more class labels, the computational complexity of the decision tree
may increase.
Ensemble Methods
Bagging – Random Forest Algorithm, Boosting – XGBoost
 Ensemble simply means combining multiple models.

 Thus a collection of models is used to make predictions


rather than an individual model.

 Ensemble uses two types of methods:

Ensemble
Methods
 Bagging, also known as Bootstrap Aggregation, serves as the ensemble technique in the
Random Forest algorithm. Here are the steps involved in Bagging:

 Selection of Subset: Bagging starts by choosing a random sample, or subset, from the
entire dataset.

 Bootstrap Sampling: Each model is then created from these samples, called Bootstrap
Samples, which are taken from the original data with replacement. This process is
known as row sampling.

Bagging  Bootstrapping: The step of row sampling with replacement is referred to as


bootstrapping.

 Independent Model Training: Each model is trained independently on its corresponding


Bootstrap Sample. This training process generates results for each model.

 Majority Voting: The final output is determined by combining the results of all models
through majority voting. The most commonly predicted outcome among the models is
selected.

 Aggregation: This step, which involves combining all the results and generating the final
output based on majority voting, is known as aggregation.
Bagging
Bagging
 Step 1: In the Random forest model, a subset of data
points and a subset of features is selected for
constructing each decision tree. Simply put, n random
records and m features are taken from the data set
having k number of records.
Random  Step 2: Individual decision trees are constructed for
Forest each sample.

 Step 3: Each decision tree will generate an output.

 Step 4: Final output is considered based on Majority


Voting or Averaging for Classification and regression,
respectively.
Contd.
 Example: Consider the fruit basket as the data as
shown in the figure below. Now n number of samples
are taken from the fruit basket, and an individual
decision tree is constructed for each sample. Each
decision tree will generate an output, as shown in the
Contd. figure. The final output is considered based on majority
voting. In the below figure, you can see that the
majority decision tree gives output as an apple when
compared to a banana, so the final output is taken as
an apple.
 Boosting is one of the techniques that use the concept of
ensemble learning.

 A boosting algorithm combines multiple simple models


(also known as weak learners or base estimators) to
generate the final output.
Boosting
 It is done by building a model by using weak models in
series.

 There are several boosting algorithms; AdaBoost -


Adaptive Boosting, Gradient Boosting Machine (GBM),
Extreme Gradient Boosting Machine (XGBM)
Contd.
Contd.
 You’ve built a linear regression model that gives you a decent 77% accuracy
on the validation dataset.

 Next, you decide to expand your portfolio by building a k-Nearest


Neighbour (KNN) model and a decision tree model on the same dataset.
These models gave you an accuracy of 62% and 89% on the validation set
respectively.

 It’s obvious that all three models work in completely different ways. For
Contd. instance, the linear regression model tries to capture linear relationships in
the data while the decision tree model attempts to capture the non-
linearity in the data.

 How about, instead of using any one of these models for making the final
predictions, we use a combination of all of these models?

 I’m thinking of an average of the predictions from these models. By doing


this, we would be able to capture more information from the data, right?
DEMO
--------------
Ensemble
 https://colab.research.google.com/drive/1Yj4t10E7jqxL
Learning
kGTJeNc-aV7_8epgY3iR?usp=sharing
(Random
Forest,
XGBoost)
Any
Queries..?? Thank you

You might also like