0% found this document useful (0 votes)

65 views40 pages

Classification (Naive Bayes)

This document discusses Bayesian classification and naive Bayes classifiers. It provides examples to illustrate Bayes' theorem and the rule of multiplication. It explains that a naive Bayes classifier makes the assumption that attributes are conditionally independent given the class. Probabilities for a naive Bayes classifier can be estimated from the training data by calculating the frequency of attributes within each class. The classifier predicts the class with the highest posterior probability calculated from the attribute probabilities.

Uploaded by

Mahad Gul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views40 pages

Classification (Naive Bayes)

Uploaded by

Mahad Gul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Classification (Bayes, Lazy)

Data Mining* (CSC521)

Dr M Muzammal

*The instructor thanks Dr Jae-Gil Lee for sharing the lecture slides.
Contents

• Decision Tree Induction

• Bayes Classification
• Support Vector Machines (SVM)
• Ensemble Methods
Bayes Classifier
• A probabilistic framework for solving classification problems
P ( A, C )
• Conditional probability: P (C | A) 
P ( A)
P ( A, C )
P( A | C ) 
P (C )

• Bayes theorem: P ( A | C ) P (C )
P(C | A) 
P ( A)
Rule of Multiplication

The probability that Events A and B both occur is equal to the probability that
Event A occurs times the probability that Event B occurs, given that A has
occurred.
P(A ∩ B) = P(A) P(B|A)

5
Example: Rule of Multiplication
• An urn contains 6 red marbles and 4 black marbles. Two marbles are
drawn without replacement from the urn. What is the probability that both of
the marbles are black?

6
Example: Rule of Multiplication
• An urn contains 6 red marbles and 4 black marbles. Two marbles are
drawn without replacement from the urn. What is the probability that both of the
marbles are black?
• Let A = the event that the first marble is black; and
• let B = the event that the second marble is black
P(A) = 4/10 (4 out of 10 in the urn are black)
P(B|A) = 3/9 (3 out of 9 in the urn are black now)

P(A ∩ B) = P(A) P(B|A)

P(A ∩ B) = (4/10) * (3/9) = 12/90 = 2/15
7
Bayes Classifier
• A probabilistic framework for solving classification problems
P ( A, C )
• Conditional probability: P (C | A) 
P ( A)
P ( A, C )
P( A | C ) 
P (C )

• Bayes theorem: P ( A | C ) P (C )
P(C | A) 
P ( A)
Example (1): Bayes Theorem

• A doctor knows that meningitis causes stiff neck 50% of the time  P(S|
M)=0.5
• Prior probability of any patient having meningitis is 1/50,000  P(M) =
1/50000
• Prior probability of any patient having stiff neck is 1/20  P(S) = 1/20
• If a patient has stiff neck, what’s the probability he/she has meningitis? 
P(M|S) ?

P ( S | M ) P ( M ) 0.5 1 / 50000
P( M | S )    0.0002
P( S ) 1 / 20
Example (2): Bayes Theorem

• Anka is getting married tomorrow, at an outdoor ceremony in the Hills. In

recent years, it has rained only 5 days each year. Unfortunately, the
weatherman has predicted rain for tomorrow. When it actually rains, the
weatherman correctly forecasts rain 90% of the time. When it doesn't rain,
he incorrectly forecasts rain 10% of the time. What is the probability that it
will rain on the day of Anka's wedding?

10
Example (2): Bayes Theorem

• The sample space is defined by two mutually-exclusive events – it rains or it

does not rain.
Additionally, a third event occurs when the weatherman predicts rain.
• Event A1. It rains on Anka's wedding.

• Event A2. It does not rain on Anka's wedding.

• Event B. The weatherman predicts rain.

11
Example (2): Bayes Theorem
• P( A ) = 5/365 =0.0136985
1 [It rains 5 days out of the year.]
• P( A ) = 360/365 = 0.9863014 [It does not rain 360 days out of the year.]
2

• P( B | A ) = 0.9
1 [When it rains, the weatherman predicts rain 90% of
the
time.]
• P( B | A ) = 0.1 [When it does not rain, the weatherman predicts rain 10% of
2
the
time.]

12
Example (2): Bayes Theorem

• Compute P( A | B ), the probability it will rain on the day of Anka's wedding

• given a forecast for rain by the weatherman

13
Example (2): Bayes Theorem

• Compute P( A | B ), the probability it will rain on the day of Anka's wedding

• given a forecast for rain by the weatherman

14
Even when the weatherman predicts rain, it only rains only about 11%
of the time.
Example (2): Bayes Theorem

Even when the weatherman predicts rain, it only rains only about 11%
of the time.
15
Bayesian Classifiers (1/2)

• Consider each attribute and class label as random variables

• Given a record with attributes (A , A ,…,A )

1 2 n

• The goal is to predict the class C

• Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An )

• Can we estimate P(C| A , A ,…,A ) directly from data?

1 2 n
Bayesian Classifiers (2/2)
• Approach
• Compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes
theorem

P ( A A  A | C ) P (C )
P (C | A A  A )  1 2 n

P( A A  A )
1 2 n

1 2 n

• Choose the value of C that maximizes P(C | A1, A2, …, An)

• Equivalent to choosing the value of C that maximizes P(A1, A2, …, An | C) P(C) since P(A1,
A2, …, An) is constant for all classes
• How to estimate P(A , A , …, A | C )?
1 2 n
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally independent (i.e., no
dependence relation between attributes)
P(A1, A2, …, An |Cj) = P(A1| Cj) P(A2| Cj) … P(An| Cj)

•• We

can estimate P(A | C ) for all A and C
i j i j

• A new point is classified to C if P(C ) P(A | C ) is maximal

j j i j
How to Estimate Probabilities from Data?
l l
(1/3)
s a a u
r ic r ic o
o o u
e g e g tin as
s
c at c at c on c l
Tid Refund Marital Taxable
• For discrete attributes: Status Income Evade

1 Yes Single 125K No

P(Ai | Ck) = |Aik|/ Nc
2 No Married 100K No
3 No Single 70K No
• where |Aik| is number of instances that has the
4 Yes Married 120K No
attribute Ai and belongs to the class Ck
5 No Divorced 95K Yes
• e.g., P(Status=Married|No) = 4/7
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
How to Estimate Probabilities from Data? (2/3)

• For continuous attributes:

• Assume the attribute follows a normal distribution
• Use data to estimate parameters of the distribution (e.g., mean and standard
deviation)
• Once the probability distribution is known, we can use it to estimate the conditional
probability P(Ai | C)

• e.g., see the next page

a l a l s
u
ric ric
How to Estimate Probabilities from teData?
go
t e g o (3/3)
nt
in
la
s s
uo
ca ca co c
• Normal distribution: Tid Refund Marital
Status
Taxable
Income Evade

1 ( Ai   ij ) 2


P( A | c )  e 2  ij2 1 Yes Single 125K No

2
i j 2 2 No Married 100K No

ij 3 No Single 70K No
4 Yes Married 120K No

• One for each (A , c ) pair

i j 5 No Divorced 95K Yes
6 No Married 60K No
• e.g., (Income, Class=No) 7 Yes Divorced 220K No

• If Class=No 8 No Single 85K Yes

9 No Married 75K No
• sample mean = 110
10 No Single 90K Yes
• sample variance = 2975
10

1 
( 120110 ) 2

P ( Income  120 | No )  e 2 ( 2975)

Taxable income:
Since P(X|No)P(No) > P(X|Yes)P(Yes)
If class=No: sample mean = 110 0.0024 * 7/10 0 * 3/10
sample variance = 2975
If class=Yes: sample mean = 90 Therefore P(No|X) > P(Yes|X)
sample variance = 25 => Class = No
M-Estimate of Conditional Probability

• If one of the conditional probability is zero, then the entire expression

becomes zero
• Probability estimation:
N ic
Original : P ( Ai | C )  c: number of classes
Nc
p: prior probability
N ic  1
Laplace : P ( Ai | C )  m: parameter
Nc  c
N ic  mp
m - estimate : P ( Ai | C ) 
Nc  m
M-Estimate
• The basic idea for estimating conditional probabilities is that the prior probabilities can be
estimated from an unconditional sample
• If we don't have any knowledge of , assume the attribute is uniformly distributed over all possible
values
• Interpretation of
• A higher value of means that we are more confident in the prior probability
•
•
controls the balance between relative frequency and prior probabilities
Nc Nic m
P(Ai|C)    p
N c  m Nc N c  m

• An example
• (instead of 0!)
• We are assuming that = 1 / number of attribute values = 1 / 3
• Our value is arbitrary, and we will use = 4
Summary of Naïve Bayes

• Robust to isolated noise points

• Able to handle missing values by ignoring the instance during probability

estimate calculations

• Robust to irrelevant attributes

• If Xi is an irrelevant attribute, P(Xi | Y) becomes almost uniformly distributed

• Independence assumption may not hold for some attributes

Classification (SVM)
Ice Breaking
Measuring Happiness Using Wearable Technology

Amount and direction of movement in three dimensions and with

high resolution (50 times a second, or once every 20 ms)
History and Applications

• Proposed by Vapnik and colleagues (1992)—groundwork from Vapnik &

Chervonenkis’ statistical learning theory in 1960s
• Characteristics: training can be slow, but accuracy is high owing to its ability
to model complex nonlinear decision boundaries (margin maximization)
• Usage: classification and numeric prediction
• Applications:
• handwritten digit recognition, object recognition, speaker identification,
benchmarking time-series prediction tests
B1
Intuition (1/2)

Find a linear hyperplane (decision boundary)

that will separate the data
B1
Intuition (2/2)

Which one is better? How do you define better?

Basic Idea

B2 Support vectors
b21
b22

margin
b11

b12

Find the hyperplane that maximizes the margin

 B1 is better than B2
When Linearly Separable

Let the data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes, but we want to find
the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)
Formalization (1/3)
• A separating hyperplane: w  x – b = 0
• w: a normal vector
• b: a scalar value (bias)

• Two parallel hyperplanes:

• wx–b=1
• w  x – b = -1
• Maximize (minimize ||w||)
•
subjected to
• w  xi – b 1 for xi of the first class
• w  xi – b -1 for xi of the second class
 yi (w  xi – b) 1, yi
Formalization (2/3)

• Primal form
Substituting ||w|| with for
mathematical convenience

• This becomes a constrained (convex) quadratic optimization problem

Soft Margin (1/2)

• If there exists no hyperplane that can split

the "yes" and "no" examples, the soft
margin method will choose a hyperplane
that splits the examples as cleanly as
possible, while still maximizing the
distance to the nearest cleanly split
examples
 Allow mislabeled examples
When Linearly Inseparable

Linearly separable

Not linearly separable

Projecting data that is not linearly separable into a higher

dimensional space can make it linearly separable
Why Is SVM Effective on High Dimensional
Data?
• The complexity of a trained classifier is characterized by the number of support
vectors rather than the dimensionality of the data
• The support vectors are the essential or critical training examples —they lie
closest to the decision boundary (MMH)
• If all other training examples are removed and the training is repeated, the
same separating hyperplane would be found
• The number of support vectors found can be used to compute an (upper)
bound on the expected error rate of the SVM classifier, which is independent of
the data dimensionality
• Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high
Multi-Class SVM

• Reducing the single multiclass problem into multiple binary classification

problems

• Two methods to build binary classifiers

• Between one of the labels to the rest (one-versus-all)
• Between every pair of classes (one-versus-one)
LIBSVM
• A library for Support Vector Machines

• Providing the interface for many programming languages including Java,

MATLAB, R, Python, and C#

• Developed by Chih-Chung Chang and Chih-Jen Lin in National Taiwan University

• http://www.csie.ntu.edu.tw/~cjlin/libsvm/
SVM Related Links

• http://www.svms.org/

• http://www.support-vector-machines.org/

• http://www.kernel-machines.org/
Thank You!
Questions?

Statistical and Computational Methods in Brain Image Analysis 1st Edition Moo K Chung Download
No ratings yet
Statistical and Computational Methods in Brain Image Analysis 1st Edition Moo K Chung Download
41 pages
Confusion Matrix
No ratings yet
Confusion Matrix
3 pages
Chap4 Naive Bayes
No ratings yet
Chap4 Naive Bayes
26 pages
Practical Column Design Guide
100% (4)
Practical Column Design Guide
404 pages
Bayesian Decision Theory
No ratings yet
Bayesian Decision Theory
27 pages
Smoothed Bootstrap Nelson-Siegel Revisited June 2010
No ratings yet
Smoothed Bootstrap Nelson-Siegel Revisited June 2010
38 pages
Datamining Lect11
No ratings yet
Datamining Lect11
53 pages
Foundations of Data Science - Unit 6 - Naive Bayes
No ratings yet
Foundations of Data Science - Unit 6 - Naive Bayes
12 pages
Lect 7 DM
No ratings yet
Lect 7 DM
65 pages
Naive Bayes
No ratings yet
Naive Bayes
21 pages
Naive Bays & Support Vector Machines 2024-PPG
No ratings yet
Naive Bays & Support Vector Machines 2024-PPG
63 pages
Chapter 4
No ratings yet
Chapter 4
22 pages
Navie Classifier
No ratings yet
Navie Classifier
8 pages
Bayes
No ratings yet
Bayes
48 pages
PR January20 05 PDF
No ratings yet
PR January20 05 PDF
24 pages
Naive Bayes
No ratings yet
Naive Bayes
11 pages
L23 Bayesian Naive
No ratings yet
L23 Bayesian Naive
18 pages
Project 2: Spam Filtering: Linear Statistical Models SYS 4021
No ratings yet
Project 2: Spam Filtering: Linear Statistical Models SYS 4021
36 pages
Bayes
No ratings yet
Bayes
10 pages
About Steinhart-Hart Equation
No ratings yet
About Steinhart-Hart Equation
4 pages
Lecture 5-Naïve Bayes
No ratings yet
Lecture 5-Naïve Bayes
26 pages
L4 Naive Bayes
No ratings yet
L4 Naive Bayes
31 pages
Lec 03 NaiveBayesClassification
No ratings yet
Lec 03 NaiveBayesClassification
33 pages
Bayesian Classifiers Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Bayesian Classifiers Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
17 pages
Class Adv Classification IV
No ratings yet
Class Adv Classification IV
49 pages
Naive Bayes
No ratings yet
Naive Bayes
13 pages
Bayesian Classifiers Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Bayesian Classifiers Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
26 pages
Naive Bayes
No ratings yet
Naive Bayes
11 pages
Lecture 11
No ratings yet
Lecture 11
49 pages
JGR Atmospheres - 2022 - Fu - Quantifying Flash Droughts Over China From 1980 To 2017
No ratings yet
JGR Atmospheres - 2022 - Fu - Quantifying Flash Droughts Over China From 1980 To 2017
16 pages
ML 09 Naive Bayes Classifier
No ratings yet
ML 09 Naive Bayes Classifier
24 pages
Data Mining Classification: Naïve Bayes Classifier Lecture Notes For Chapter 4 &5
No ratings yet
Data Mining Classification: Naïve Bayes Classifier Lecture Notes For Chapter 4 &5
26 pages
Naive Ba Yes
No ratings yet
Naive Ba Yes
28 pages
Chapter 4
No ratings yet
Chapter 4
57 pages
Estimation Questions - 2016
No ratings yet
Estimation Questions - 2016
2 pages
Naive Bayes
No ratings yet
Naive Bayes
19 pages
2017 Lecture 01 Sedimentary Textures
No ratings yet
2017 Lecture 01 Sedimentary Textures
41 pages
Nayes Bayes Classifier
No ratings yet
Nayes Bayes Classifier
46 pages
Naive Bayes
No ratings yet
Naive Bayes
19 pages
6 - Naive Bayes
No ratings yet
6 - Naive Bayes
26 pages
Crop Recommendation System
No ratings yet
Crop Recommendation System
2 pages
ACCUWIND Summary Report
No ratings yet
ACCUWIND Summary Report
24 pages
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
No ratings yet
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
56 pages
ML Lecture#5
No ratings yet
ML Lecture#5
65 pages
Naive Bayes - Lecture Slides
No ratings yet
Naive Bayes - Lecture Slides
11 pages
Naive Bayes
No ratings yet
Naive Bayes
11 pages
Bayesian Learning
No ratings yet
Bayesian Learning
41 pages
ML Lec 15 Naive Bayes
No ratings yet
ML Lec 15 Naive Bayes
16 pages
Variance-Ratio Tests of Random Walk: An Overview: Audencia Nantes, School of Management
No ratings yet
Variance-Ratio Tests of Random Walk: An Overview: Audencia Nantes, School of Management
25 pages
Naive Bayes
No ratings yet
Naive Bayes
29 pages
Operation Management Forecast
No ratings yet
Operation Management Forecast
2 pages
DM NaiveBayes
No ratings yet
DM NaiveBayes
15 pages
Classification With NaiveBayes
No ratings yet
Classification With NaiveBayes
19 pages
26-Bayes Rule-16-03-2024
No ratings yet
26-Bayes Rule-16-03-2024
18 pages
DSCI303-18 NaiveBayes
No ratings yet
DSCI303-18 NaiveBayes
44 pages
Bayes Theorem
No ratings yet
Bayes Theorem
7 pages
Naive by
No ratings yet
Naive by
23 pages
Bayes Classifier PDF
100% (1)
Bayes Classifier PDF
18 pages
Air Quality Prediction Using Machine Learning Algorithms
100% (1)
Air Quality Prediction Using Machine Learning Algorithms
4 pages
D3 It Naive Bayes
No ratings yet
D3 It Naive Bayes
24 pages
Naive Bayes
No ratings yet
Naive Bayes
11 pages
Stochastic Hydrology: Indian Institute of Science
No ratings yet
Stochastic Hydrology: Indian Institute of Science
40 pages
Data Mining Classification: Alternative Techniques
No ratings yet
Data Mining Classification: Alternative Techniques
15 pages
Answer Spss
No ratings yet
Answer Spss
4 pages
Tut 2
No ratings yet
Tut 2
3 pages
Data Analytics
100% (2)
Data Analytics
99 pages
A Temperature Difference Between The Outside and Inside Air Will Create A
No ratings yet
A Temperature Difference Between The Outside and Inside Air Will Create A
5 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
51 pages
Naïve Bayesv1
No ratings yet
Naïve Bayesv1
31 pages
E-Note 14654 Content Document 20231228101425AM
No ratings yet
E-Note 14654 Content Document 20231228101425AM
10 pages
Just The Facts On The Binomial Distribution
No ratings yet
Just The Facts On The Binomial Distribution
10 pages
2MLIntrodpart 2
No ratings yet
2MLIntrodpart 2
42 pages
Case Study of Sales Forecasting
No ratings yet
Case Study of Sales Forecasting
22 pages
Chapt 4 Association
No ratings yet
Chapt 4 Association
6 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
51 pages
Tidy Data
No ratings yet
Tidy Data
21 pages
Bayesian Classification
No ratings yet
Bayesian Classification
25 pages
20210913115710D3708 - Session 09-12 Bayes Classifier
No ratings yet
20210913115710D3708 - Session 09-12 Bayes Classifier
30 pages
Panel Data Regression Chap 10
No ratings yet
Panel Data Regression Chap 10
76 pages
Bayesian Learning: Berrin Yanikoglu
No ratings yet
Bayesian Learning: Berrin Yanikoglu
64 pages
Classification-Alternative Techniques: Bayesian Classifiers
No ratings yet
Classification-Alternative Techniques: Bayesian Classifiers
7 pages
Introduction To Biostatistics (In Arabic)
No ratings yet
Introduction To Biostatistics (In Arabic)
302 pages
Normal Distribution Notes and Workbook
No ratings yet
Normal Distribution Notes and Workbook
6 pages
Heat Exchanger Design: By: Tanuj Gupta (Summer Intern)
No ratings yet
Heat Exchanger Design: By: Tanuj Gupta (Summer Intern)
22 pages
Applied Thermodynamics and Heat Transfer
No ratings yet
Applied Thermodynamics and Heat Transfer
2 pages
Naïve Bayes Classifier (Week 8)
No ratings yet
Naïve Bayes Classifier (Week 8)
18 pages
Mechanical Engineering Objective Type Questions For Exams
No ratings yet
Mechanical Engineering Objective Type Questions For Exams
7 pages
25-27 Statistical Reasoning-Probablistic Model-Naive Bayes Classifier
No ratings yet
25-27 Statistical Reasoning-Probablistic Model-Naive Bayes Classifier
35 pages
Diagram Psychrometric
No ratings yet
Diagram Psychrometric
4 pages
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet

Classification (Naive Bayes)

Uploaded by

Classification (Naive Bayes)

Uploaded by

Classification (Bayes, Lazy)

Data Mining* (CSC521)

• Decision Tree Induction

P(A ∩ B) = P(A) P(B|A)

• Anka is getting married tomorrow, at an outdoor ceremony in the Hills. In

• The sample space is defined by two mutually-exclusive events – it rains or it

• Event A2. It does not rain on Anka's wedding.

• Compute P( A | B ), the probability it will rain on the day of Anka's wedding

• given a forecast for rain by the weatherman

• Compute P( A | B ), the probability it will rain on the day of Anka's wedding

• given a forecast for rain by the weatherman

• Consider each attribute and class label as random variables

• Given a record with attributes (A , A ,…,A )

• The goal is to predict the class C

• Can we estimate P(C| A , A ,…,A ) directly from data?

• Choose the value of C that maximizes P(C | A1, A2, …, An)

• A new point is classified to C if P(C ) P(A | C ) is maximal

1 Yes Single 125K No

• For continuous attributes:

• e.g., see the next page

P( A | c )  e 2  ij2 1 Yes Single 125K No

• One for each (A , c ) pair

• If Class=No 8 No Single 85K Yes

P ( Income  120 | No )  e 2 ( 2975)

• If one of the conditional probability is zero, then the entire expression

• Robust to isolated noise points

• Able to handle missing values by ignoring the instance during probability

• Robust to irrelevant attributes

• Independence assumption may not hold for some attributes

Amount and direction of movement in three dimensions and with

• Proposed by Vapnik and colleagues (1992)—groundwork from Vapnik &

Find a linear hyperplane (decision boundary)

Which one is better? How do you define better?

Find the hyperplane that maximizes the margin

• Two parallel hyperplanes:

• This becomes a constrained (convex) quadratic optimization problem

• If there exists no hyperplane that can split

Not linearly separable

Projecting data that is not linearly separable into a higher

• Reducing the single multiclass problem into multiple binary classification

• Two methods to build binary classifiers

• Providing the interface for many programming languages including Java,

• Developed by Chih-Chung Chang and Chih-Jen Lin in National Taiwan University

You might also like