SlideShare a Scribd company logo
Classification
Based Machine
Learning
Algorithms
Md Main Uddin Rony,
Software Engineer
.
1
What is Classification?
Classification is a data mining task of predicting the value of a
categorical variable (target or class)
This is done by building a model based on one or more numerical
and/or categorical variables ( predictors, attributes or features)
Considered an instance of supervised learning
Corresponding unsupervised procedure is known as clustering
2
Classification
Based Algorithms
Four main groups of classification
algorithms are:
● Frequency Table
- ZeroR
- OneR
- Naive Bayesian
- Decision Tree
● Covariance Matrix
- Linear Discriminant Analysis
- Logistic Regression
● Similarity Functions
- K Nearest Neighbours
● Others
- Artificial Neural Network
- Support Vector Machine
3
4
Naive Bayes Classifier
● Works based on Bayes’ theorem
● Why its is called Naive?
- Because it assumes that the presence of a particular feature
in a class is unrelated to the presence of any other feature
● Easy to build
● Useful for very large data sets
Bayes’ Theorem
The theorem can be stated mathematically as follow:
P(A) and P(B) are the probabilities of observing A and B without regard
to each other. Also known as Prior Probability.
P(A | B), a conditional (Posterior) probability, is the probability of
observing event A given that B is true.
P(B | A) is the conditional (Posterior)probability of observing event B
given that A is true.
So, how does naive bayes classifier work based on this?
5
How Naive Bayes works?
● Let D be a training set of tuples and each tuple is represented by n-dimensional
attribute vector, X = ( x1, x2, ….., xn)
● Suppose that there are m classes, C1, C2,...., Cm. Given a tuple, X, the classifier will
predict that X belongs to the class having the highest posterior probability, conditioned
on X. That is, the Naive Bayesian classifier predicts that tuple X belongs to the class Ci
if and only if
● By Bayes’ theorem
● P(X) is constant for all classes, only needs to be maximized
6
How Naive Bayes works? (Contd.)
● To reduce computation in evaluating , the naive assumption of
class-conditional independence is made. This presumes that the attributes’ values are
conditionally independent of one another, given the class label of the tuple (i.e., that
there are no dependence relationships among the attributes). This assumption is
called class conditional independence.
● Thus,
7
How Naive
Bayes
Works?
(Hands on
Calculation)
Given all the previous patient's symptoms and
diagnosis
Does the patient with the following symptoms have
the flu?
8
chills runny nose headache fever flu?
Y N Mild Y N
Y Y No N Y
Y N Strong Y Y
N Y Mild Y Y
N N No N N
N Y Strong Y Y
N Y Strong N N
Y Y Mild Y Y
chills runny nose headache fever flu?
Y N Mild Y ?
How Naive
Bayes
Works?
(Hands on
Calculation)
Contd.
First, we compute all possible individual
probabilities conditioned on the target attribute
(flu).
9
P(Flu=Y) 0.625 P(Flu=N) 0.375
P(chills=Y|flu=Y) 0.6 P(chills=Y|flu=N) 0.333
P(chills=N|flu=Y) 0.4 P(chills=N|flu=N) 0.666
P(runny nose=Y|flu=Y) 0.8 P(runny nose=Y|flu=N) 0.333
P(runny nose=N|flu=Y) 0.2 P(runny nose=N|flu=N) 0.666
P(headache=Mild|flu=Y) 0.4 P(headache=Mild|flu=N) 0.333
P(headache=No|flu=Y) 0.2 P(headache=No|flu=N) 0.333
P(headache=Strong|flu=Y) 0.4 P(headache=Strong|flu=N) 0.333
P(fever=Y|flu=Y) 0.8 P(fever=Y|flu=N) 0.333
P(fever=N|flu=Y) 0.2 P(fever=N|flu=N) 0.666
How Naive
Bayes
Works?
(Hands on
Calculation)
Contd.
And then decide:
P(flu=Y|Given attribute) = P(chills = Y|flu=Y).P(runny
nose = N|flu=Y).P(headache = Mild|flu=Y).P(fever =
N|flu=Y).P(flu=Y)
= 0.6 * 0.2 * 0.4 * 0.2 * 0.625
= 0.006
VS
P(flu=N|Given attribute) = P(chills = Y|flu=N).P(runny
nose = N|flu=N).P(headache = Mild|flu=N).P(fever =
N|flu=N).P(flu=N)
= 0.333 * 0.666 * 0.333 * 0.666 * 0.375
= 0.0184
So, Naive Bayes classifier predicts that the patient
doesn’t have the flu.
10
The Decision
Tree Classifier
11
Decision Tree
● Decision tree builds classification or regression models in the form of a
tree structure
● It breaks down a dataset into smaller and smaller subsets while at the
same time an associated decision tree is incrementally developed.
● The final result is a tree with decision nodes and leaf nodes.
- A decision node has two or more branches
- Leaf node represents a classification or decision
● The topmost decision node in a tree which corresponds to the best
predictor called root node
● Decision trees can handle both categorical and numerical data
12
Example Set
we will work
on...
13
Outlook Temp Humidity Windy Play Golf
Rainy Hot High False No
Rainy Hot High True No
Overcast Hot High False Yes
Sunny Mild High False Yes
Sunny Cool Normal False Yes
Sunny Cool Normal True No
Overcast Cool Normal True Yes
Rainy Mild High False No
Rainy Cool Normal False Yes
Sunny Mild Normal False Yes
Rainy Mild Normal True Yes
Overcoast Mild High True Yes
Overcoast Hot Normal False Yes
Sunny Mild High True No
So, our
tree looks
like this...
14
How it works
● The core algorithm for building decision trees called ID3
by J. R. Quinlan
● ID3 uses Entropy and Information Gain to construct a
decision tree
15
Entropy
● A decision tree is built top-down from a root node and
involves partitioning the data into subsets that contain
instances with similar values (homogeneous)
● ID3 algorithm uses entropy to calculate the homogeneity
of a sample
● If the sample is completely homogeneous the entropy is
zero and if the sample is an equally divided it has entropy
of one
16
Compute Two
Types of
Entropy
● To build a decision tree, we need to calculate
two types of entropy using frequency tables
as follows:
● a) Entropy using the frequency table of one
attribute
(Entropy of the Target):
17
● b) Entropy using the
frequency table of two
attributes:
18
Information
Gain
● The information gain is based on the decrease
in entropy after a dataset is split on an attribute
● Constructing a decision tree is all about finding
attribute that returns the highest information
gain (i.e., the most homogeneous branches)
19
Example
● Step 1: Calculate entropy of the target
20
Example
● Step 2: The dataset is then split on
the different attributes.
The entropy for each branch is
calculated.
● Then it is added proportionally, to
get total entropy for the split.
● The resulting entropy is subtracted
from the entropy before the split.
● The result is the Information Gain,
or decrease in entropy
21
Example
22
Example
● Step 3: Choose attribute with the largest information gain as the decision
node
23
Example
● Step 4a: A branch with entropy of 0 is a leaf node.
24
Example
● Step 4b: A branch with entropy more than 0 needs further splitting.
25
Example
● Step 5: The ID3 algorithm is run
recursively on the non-leaf
branches, until all data is classified.
26
Decision
Tree to
Decision
Rules
● A decision tree can easily be transformed to a
set of rules by mapping from the root node to
leaf nodes one by one
27
Any idea about
Random Forest??
After all, Forests are made of trees….
28
K Nearest Neighbors
Classification
29
k-NN Algorithm
● K nearest neighbors is a simple algorithm that stores all available cases
and classifies new cases based on a similarity measure (e.g., distance
functions)
● KNN has been used in statistical estimation and pattern recognition
already in the beginning of 1970’s
● A case is classified by a majority vote of its neighbors, with the case being
assigned to the class most common amongst its K nearest neighbors
measured by a distance function
● If k =1 , the what will it do?
30
Diagram
31
Distance measures for cont. variables
32
How many
neighbors?
● Choosing the optimal value for K is best
done by first inspecting the data
● In general, a large K value is more
precise as it reduces the overall noise
but there is no guarantee
● Cross-validation is another way to
retrospectively determine a good K
value by using an independent dataset
to validate the K value
● Historically, the optimal K for most
datasets has been between 3-10. That
produces much better results than 1NN
33
Example
● Consider the following data concerning credit default. Age and Loan are
two numerical variables (predictors) and Default is the target
34
Example
● We can now use the training set to classify an
unknown case (Age=48 and Loan=$142,000)
using Euclidean distance.
● If K=1 then the nearest neighbor is the last
case in the training set with Default=Y
● D = Sqrt[(48-33)^2 + (142000-150000)^2] =
8000.01 >> Default=Y
● With K=3, there are two Default=Y and one
Default=N out of three closest neighbors. The
prediction for the unknown case is again
Default=Y
35
Standardized
Distance
● One major drawback in calculating distance
measures directly from the training set is in the
case where variables have different
measurement scales or there is a mixture of
numerical and categorical variables.
● For example, if one variable is based on annual
income in dollars, and the other is based on
age in years then income will have a much
higher influence on the distance calculated.
● One solution is to standardize the training set
36
Standardized
Distance
Using the standardized distance on the same
training set, the unknown case returned a different
neighbor which is not a good sign of robustness.
37
Some
Confusions...
What will happen if k equals to multiple
of category or label type?
What will happen if k = 1?
What will happen if we take k’s value
equal to dataset size?
38
Acknowledgements...
Contents are borrowed from…
1. Data Mining Concepts and Techniques By Jiawei Han, Micheline Kamber,
Jian Pei
2. Naive Bayes Example (Youtube Video) By Francisco Icaobelli
(https://www.youtube.com/watch?v=ZAfarappAO0)
3. Predicting the Future Classification
Presented By: Dr. Noureddin Sadawi
(https://github.com/nsadawi/DataMiningSlides/blob/master/Slides.pdf)
39
Questions??
40
Then Thanks...
41

More Related Content

PPTX
Machine learning ppt
Rajat Sharma
 
PDF
Machine learning Algorithms
Walaa Hamdy Assy
 
PPTX
Naive Bayes
Abdullah al Mamun
 
PDF
Introduction to Machine Learning Classifiers
Functional Imperative
 
PPTX
Introduction to Machine Learning
Lior Rokach
 
PDF
Linear regression
MartinHogg9
 
PPTX
Types of Machine Learning
Samra Shahzadi
 
PPTX
Supervised learning and Unsupervised learning
Usama Fayyaz
 
Machine learning ppt
Rajat Sharma
 
Machine learning Algorithms
Walaa Hamdy Assy
 
Naive Bayes
Abdullah al Mamun
 
Introduction to Machine Learning Classifiers
Functional Imperative
 
Introduction to Machine Learning
Lior Rokach
 
Linear regression
MartinHogg9
 
Types of Machine Learning
Samra Shahzadi
 
Supervised learning and Unsupervised learning
Usama Fayyaz
 

What's hot (20)

PPTX
Random forest
Musa Hawamdah
 
PPTX
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Simplilearn
 
PDF
Support Vector Machines ( SVM )
Mohammad Junaid Khan
 
PPTX
Machine learning clustering
CosmoAIMS Bassett
 
PPTX
Classification and Regression
Megha Sharma
 
ODP
Machine Learning With Logistic Regression
Knoldus Inc.
 
ODP
Machine Learning with Decision trees
Knoldus Inc.
 
PPTX
Lecture 6: Ensemble Methods
Marina Santini
 
PDF
Understanding Bagging and Boosting
Mohit Rajput
 
PDF
Linear Regression vs Logistic Regression | Edureka
Edureka!
 
PPTX
supervised learning
Amar Tripathi
 
PDF
Neural Networks: Multilayer Perceptron
Mostafa G. M. Mostafa
 
PPTX
K-Nearest Neighbor Classifier
Neha Kulkarni
 
PPT
2.5 backpropagation
Krish_ver2
 
PPTX
Deep neural networks
Si Haem
 
PDF
Machine Learning: Introduction to Neural Networks
Francesco Collova'
 
PPT
Machine Learning
Rahul Kumar
 
PPTX
Naive bayes
Ashraf Uddin
 
PPTX
Machine Learning
Kumar P
 
PPTX
Introduction to Machine Learning
Rahul Jain
 
Random forest
Musa Hawamdah
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Simplilearn
 
Support Vector Machines ( SVM )
Mohammad Junaid Khan
 
Machine learning clustering
CosmoAIMS Bassett
 
Classification and Regression
Megha Sharma
 
Machine Learning With Logistic Regression
Knoldus Inc.
 
Machine Learning with Decision trees
Knoldus Inc.
 
Lecture 6: Ensemble Methods
Marina Santini
 
Understanding Bagging and Boosting
Mohit Rajput
 
Linear Regression vs Logistic Regression | Edureka
Edureka!
 
supervised learning
Amar Tripathi
 
Neural Networks: Multilayer Perceptron
Mostafa G. M. Mostafa
 
K-Nearest Neighbor Classifier
Neha Kulkarni
 
2.5 backpropagation
Krish_ver2
 
Deep neural networks
Si Haem
 
Machine Learning: Introduction to Neural Networks
Francesco Collova'
 
Machine Learning
Rahul Kumar
 
Naive bayes
Ashraf Uddin
 
Machine Learning
Kumar P
 
Introduction to Machine Learning
Rahul Jain
 
Ad

Viewers also liked (20)

PPTX
Online algorithms in Machine Learning
Amrinder Arora
 
PPTX
Introduction to Machine Learning & Classification
Christopher Sharkey
 
PDF
Online Machine Learning: introduction and examples
Felipe
 
PPTX
Version controll.pptx
Md. Main Uddin Rony
 
PPTX
Zero
Timothy Quinlan
 
PDF
Study On ATM/POS Switching Software For Banks
Md. Main Uddin Rony
 
PPT
Thinking about nlp
Pan Xiaotong
 
PDF
Cost savings from auto-scaling of network resources using machine learning
Sabidur Rahman
 
PPTX
Deep learning for text analytics
Erik Tromp
 
PPTX
NLP
Jeet Das
 
PPTX
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Sean Golliher
 
PPTX
NLP@Work Conference: email persuasion
evolutionpd
 
PDF
Applications of Machine Learning to Location-based Social Networks
Joan Capdevila Pujol
 
PDF
IoT Mobility Forensics
Sabidur Rahman
 
PPTX
Network_Intrusion_Detection_System_Team1
Saksham Agrawal
 
PPTX
AI Reality: Where are we now? Data for Good? - Bill Boorman
Textkernel
 
PPTX
Using Deep Learning And NLP To Predict Performance From Resumes
Benjamin Taylor
 
PPTX
classification_methods-logistic regression Machine Learning
Shiraz316
 
PPTX
Airline passenger profiling based on fuzzy deep machine learning
Ayman Qaddumi
 
PDF
Practical Deep Learning for NLP
Textkernel
 
Online algorithms in Machine Learning
Amrinder Arora
 
Introduction to Machine Learning & Classification
Christopher Sharkey
 
Online Machine Learning: introduction and examples
Felipe
 
Version controll.pptx
Md. Main Uddin Rony
 
Study On ATM/POS Switching Software For Banks
Md. Main Uddin Rony
 
Thinking about nlp
Pan Xiaotong
 
Cost savings from auto-scaling of network resources using machine learning
Sabidur Rahman
 
Deep learning for text analytics
Erik Tromp
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Sean Golliher
 
NLP@Work Conference: email persuasion
evolutionpd
 
Applications of Machine Learning to Location-based Social Networks
Joan Capdevila Pujol
 
IoT Mobility Forensics
Sabidur Rahman
 
Network_Intrusion_Detection_System_Team1
Saksham Agrawal
 
AI Reality: Where are we now? Data for Good? - Bill Boorman
Textkernel
 
Using Deep Learning And NLP To Predict Performance From Resumes
Benjamin Taylor
 
classification_methods-logistic regression Machine Learning
Shiraz316
 
Airline passenger profiling based on fuzzy deep machine learning
Ayman Qaddumi
 
Practical Deep Learning for NLP
Textkernel
 
Ad

Similar to Classification Based Machine Learning Algorithms (20)

PDF
Introduction to data mining and machine learning
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
PDF
Data Science Interview Preparation(#DAY 02).pdf
RahulPandey951774
 
PPTX
W5_CLASSIFICATION.pptxW5_CLASSIFICATION.pptx
NandiniKumari54
 
PDF
Chapter2 NEAREST NEIGHBOURHOOD ALGORITHMS.pdf
PRABHUCECC
 
PPTX
Deep learning from mashine learning AI..
premkumarlive
 
PPTX
Machine learning algorithms
Shalitha Suranga
 
PDF
Machine Learning Algorithms Introduction.pdf
Vinodh58
 
PDF
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
AminaRepo
 
PDF
Principal component analysis and lda
Suresh Pokharel
 
PPTX
Data mining classifiers.
ShwetaPatil174
 
PPTX
Knn 160904075605-converted
rameswara reddy venkat
 
PDF
Machine learning meetup
QuantUniversity
 
PDF
15-Data Analytics in IoT - Supervised Learning-04-09-2024.pdf
DharanshNeema
 
PDF
Data analysis of weather forecasting
Trupti Shingala, WAS, CPACC, CPWA, JAWS, CSM
 
PDF
Introduction to Evidential Neural Networks
Federico Cerutti
 
PPTX
Statistical Machine Learning unit3 lecture notes
SureshK256753
 
PPTX
Dimensionality Reduction and feature extraction.pptx
Sivam Chinna
 
PPTX
KNN Classificationwithexplanation and examples.pptx
ansarinazish958
 
PPTX
DataAnalysis in machine learning using different techniques
mtwnc202302
 
PDF
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
1052LaxmanrajS
 
Introduction to data mining and machine learning
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Data Science Interview Preparation(#DAY 02).pdf
RahulPandey951774
 
W5_CLASSIFICATION.pptxW5_CLASSIFICATION.pptx
NandiniKumari54
 
Chapter2 NEAREST NEIGHBOURHOOD ALGORITHMS.pdf
PRABHUCECC
 
Deep learning from mashine learning AI..
premkumarlive
 
Machine learning algorithms
Shalitha Suranga
 
Machine Learning Algorithms Introduction.pdf
Vinodh58
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
AminaRepo
 
Principal component analysis and lda
Suresh Pokharel
 
Data mining classifiers.
ShwetaPatil174
 
Knn 160904075605-converted
rameswara reddy venkat
 
Machine learning meetup
QuantUniversity
 
15-Data Analytics in IoT - Supervised Learning-04-09-2024.pdf
DharanshNeema
 
Data analysis of weather forecasting
Trupti Shingala, WAS, CPACC, CPWA, JAWS, CSM
 
Introduction to Evidential Neural Networks
Federico Cerutti
 
Statistical Machine Learning unit3 lecture notes
SureshK256753
 
Dimensionality Reduction and feature extraction.pptx
Sivam Chinna
 
KNN Classificationwithexplanation and examples.pptx
ansarinazish958
 
DataAnalysis in machine learning using different techniques
mtwnc202302
 
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
1052LaxmanrajS
 

Recently uploaded (20)

PPTX
1intro to AI.pptx AI components & composition
ssuserb993e5
 
PPTX
Extract Transformation Load (3) (1).pptx
revathi148366
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPTX
Decoding Physical Presence: Unlocking Business Intelligence with Wi-Fi Analytics
meghahiremath253
 
PDF
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
PDF
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 
PPTX
Economic Sector Performance Recovery.pptx
yulisbaso2020
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPTX
Azure Data management Engineer project.pptx
sumitmundhe77
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
jcc00023con
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPTX
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PDF
345_IT infrastructure for business management.pdf
LEANHTRAN4
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
1intro to AI.pptx AI components & composition
ssuserb993e5
 
Extract Transformation Load (3) (1).pptx
revathi148366
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Decoding Physical Presence: Unlocking Business Intelligence with Wi-Fi Analytics
meghahiremath253
 
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 
Economic Sector Performance Recovery.pptx
yulisbaso2020
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Azure Data management Engineer project.pptx
sumitmundhe77
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
jcc00023con
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
345_IT infrastructure for business management.pdf
LEANHTRAN4
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 

Classification Based Machine Learning Algorithms

  • 2. What is Classification? Classification is a data mining task of predicting the value of a categorical variable (target or class) This is done by building a model based on one or more numerical and/or categorical variables ( predictors, attributes or features) Considered an instance of supervised learning Corresponding unsupervised procedure is known as clustering 2
  • 3. Classification Based Algorithms Four main groups of classification algorithms are: ● Frequency Table - ZeroR - OneR - Naive Bayesian - Decision Tree ● Covariance Matrix - Linear Discriminant Analysis - Logistic Regression ● Similarity Functions - K Nearest Neighbours ● Others - Artificial Neural Network - Support Vector Machine 3
  • 4. 4 Naive Bayes Classifier ● Works based on Bayes’ theorem ● Why its is called Naive? - Because it assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature ● Easy to build ● Useful for very large data sets
  • 5. Bayes’ Theorem The theorem can be stated mathematically as follow: P(A) and P(B) are the probabilities of observing A and B without regard to each other. Also known as Prior Probability. P(A | B), a conditional (Posterior) probability, is the probability of observing event A given that B is true. P(B | A) is the conditional (Posterior)probability of observing event B given that A is true. So, how does naive bayes classifier work based on this? 5
  • 6. How Naive Bayes works? ● Let D be a training set of tuples and each tuple is represented by n-dimensional attribute vector, X = ( x1, x2, ….., xn) ● Suppose that there are m classes, C1, C2,...., Cm. Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the Naive Bayesian classifier predicts that tuple X belongs to the class Ci if and only if ● By Bayes’ theorem ● P(X) is constant for all classes, only needs to be maximized 6
  • 7. How Naive Bayes works? (Contd.) ● To reduce computation in evaluating , the naive assumption of class-conditional independence is made. This presumes that the attributes’ values are conditionally independent of one another, given the class label of the tuple (i.e., that there are no dependence relationships among the attributes). This assumption is called class conditional independence. ● Thus, 7
  • 8. How Naive Bayes Works? (Hands on Calculation) Given all the previous patient's symptoms and diagnosis Does the patient with the following symptoms have the flu? 8 chills runny nose headache fever flu? Y N Mild Y N Y Y No N Y Y N Strong Y Y N Y Mild Y Y N N No N N N Y Strong Y Y N Y Strong N N Y Y Mild Y Y chills runny nose headache fever flu? Y N Mild Y ?
  • 9. How Naive Bayes Works? (Hands on Calculation) Contd. First, we compute all possible individual probabilities conditioned on the target attribute (flu). 9 P(Flu=Y) 0.625 P(Flu=N) 0.375 P(chills=Y|flu=Y) 0.6 P(chills=Y|flu=N) 0.333 P(chills=N|flu=Y) 0.4 P(chills=N|flu=N) 0.666 P(runny nose=Y|flu=Y) 0.8 P(runny nose=Y|flu=N) 0.333 P(runny nose=N|flu=Y) 0.2 P(runny nose=N|flu=N) 0.666 P(headache=Mild|flu=Y) 0.4 P(headache=Mild|flu=N) 0.333 P(headache=No|flu=Y) 0.2 P(headache=No|flu=N) 0.333 P(headache=Strong|flu=Y) 0.4 P(headache=Strong|flu=N) 0.333 P(fever=Y|flu=Y) 0.8 P(fever=Y|flu=N) 0.333 P(fever=N|flu=Y) 0.2 P(fever=N|flu=N) 0.666
  • 10. How Naive Bayes Works? (Hands on Calculation) Contd. And then decide: P(flu=Y|Given attribute) = P(chills = Y|flu=Y).P(runny nose = N|flu=Y).P(headache = Mild|flu=Y).P(fever = N|flu=Y).P(flu=Y) = 0.6 * 0.2 * 0.4 * 0.2 * 0.625 = 0.006 VS P(flu=N|Given attribute) = P(chills = Y|flu=N).P(runny nose = N|flu=N).P(headache = Mild|flu=N).P(fever = N|flu=N).P(flu=N) = 0.333 * 0.666 * 0.333 * 0.666 * 0.375 = 0.0184 So, Naive Bayes classifier predicts that the patient doesn’t have the flu. 10
  • 12. Decision Tree ● Decision tree builds classification or regression models in the form of a tree structure ● It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. ● The final result is a tree with decision nodes and leaf nodes. - A decision node has two or more branches - Leaf node represents a classification or decision ● The topmost decision node in a tree which corresponds to the best predictor called root node ● Decision trees can handle both categorical and numerical data 12
  • 13. Example Set we will work on... 13 Outlook Temp Humidity Windy Play Golf Rainy Hot High False No Rainy Hot High True No Overcast Hot High False Yes Sunny Mild High False Yes Sunny Cool Normal False Yes Sunny Cool Normal True No Overcast Cool Normal True Yes Rainy Mild High False No Rainy Cool Normal False Yes Sunny Mild Normal False Yes Rainy Mild Normal True Yes Overcoast Mild High True Yes Overcoast Hot Normal False Yes Sunny Mild High True No
  • 15. How it works ● The core algorithm for building decision trees called ID3 by J. R. Quinlan ● ID3 uses Entropy and Information Gain to construct a decision tree 15
  • 16. Entropy ● A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogeneous) ● ID3 algorithm uses entropy to calculate the homogeneity of a sample ● If the sample is completely homogeneous the entropy is zero and if the sample is an equally divided it has entropy of one 16
  • 17. Compute Two Types of Entropy ● To build a decision tree, we need to calculate two types of entropy using frequency tables as follows: ● a) Entropy using the frequency table of one attribute (Entropy of the Target): 17
  • 18. ● b) Entropy using the frequency table of two attributes: 18
  • 19. Information Gain ● The information gain is based on the decrease in entropy after a dataset is split on an attribute ● Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches) 19
  • 20. Example ● Step 1: Calculate entropy of the target 20
  • 21. Example ● Step 2: The dataset is then split on the different attributes. The entropy for each branch is calculated. ● Then it is added proportionally, to get total entropy for the split. ● The resulting entropy is subtracted from the entropy before the split. ● The result is the Information Gain, or decrease in entropy 21
  • 23. Example ● Step 3: Choose attribute with the largest information gain as the decision node 23
  • 24. Example ● Step 4a: A branch with entropy of 0 is a leaf node. 24
  • 25. Example ● Step 4b: A branch with entropy more than 0 needs further splitting. 25
  • 26. Example ● Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified. 26
  • 27. Decision Tree to Decision Rules ● A decision tree can easily be transformed to a set of rules by mapping from the root node to leaf nodes one by one 27
  • 28. Any idea about Random Forest?? After all, Forests are made of trees…. 28
  • 30. k-NN Algorithm ● K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions) ● KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s ● A case is classified by a majority vote of its neighbors, with the case being assigned to the class most common amongst its K nearest neighbors measured by a distance function ● If k =1 , the what will it do? 30
  • 32. Distance measures for cont. variables 32
  • 33. How many neighbors? ● Choosing the optimal value for K is best done by first inspecting the data ● In general, a large K value is more precise as it reduces the overall noise but there is no guarantee ● Cross-validation is another way to retrospectively determine a good K value by using an independent dataset to validate the K value ● Historically, the optimal K for most datasets has been between 3-10. That produces much better results than 1NN 33
  • 34. Example ● Consider the following data concerning credit default. Age and Loan are two numerical variables (predictors) and Default is the target 34
  • 35. Example ● We can now use the training set to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance. ● If K=1 then the nearest neighbor is the last case in the training set with Default=Y ● D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y ● With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction for the unknown case is again Default=Y 35
  • 36. Standardized Distance ● One major drawback in calculating distance measures directly from the training set is in the case where variables have different measurement scales or there is a mixture of numerical and categorical variables. ● For example, if one variable is based on annual income in dollars, and the other is based on age in years then income will have a much higher influence on the distance calculated. ● One solution is to standardize the training set 36
  • 37. Standardized Distance Using the standardized distance on the same training set, the unknown case returned a different neighbor which is not a good sign of robustness. 37
  • 38. Some Confusions... What will happen if k equals to multiple of category or label type? What will happen if k = 1? What will happen if we take k’s value equal to dataset size? 38
  • 39. Acknowledgements... Contents are borrowed from… 1. Data Mining Concepts and Techniques By Jiawei Han, Micheline Kamber, Jian Pei 2. Naive Bayes Example (Youtube Video) By Francisco Icaobelli (https://www.youtube.com/watch?v=ZAfarappAO0) 3. Predicting the Future Classification Presented By: Dr. Noureddin Sadawi (https://github.com/nsadawi/DataMiningSlides/blob/master/Slides.pdf) 39