0% found this document useful (0 votes)
0 views

ML-09-naive-bayes-classifier

The document discusses the Naïve Bayes Classifier, a probabilistic framework for classification problems that assumes conditional independence among attributes. It explains Bayes theorem, how to compute posterior probabilities, and the implications of conditional independence on classification accuracy. The document also highlights the advantages and disadvantages of the Naïve Bayes Classifier, including its robustness to noise and handling of missing values, while noting potential performance degradation due to correlated attributes.

Uploaded by

Triveni Jayaram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

ML-09-naive-bayes-classifier

The document discusses the Naïve Bayes Classifier, a probabilistic framework for classification problems that assumes conditional independence among attributes. It explains Bayes theorem, how to compute posterior probabilities, and the implications of conditional independence on classification accuracy. The document also highlights the advantages and disadvantages of the Naïve Bayes Classifier, including its robustness to noise and handling of missing values, while noting potential performance degradation due to correlated attributes.

Uploaded by

Triveni Jayaram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

CS 60050

Machine Learning

Naïve Bayes Classifier

Some slides taken from course materials of Tan, Steinbach, Kumar


Bayes Classifier

●A probabilistic framework for solving classification


problems
● Approach for modeling probabilistic relationships
between the attribute set and the class variable
– May not be possible to certainly predict class label of a
test record even if it has identical attributes to some
training records
– Reason: noisy data or presence of certain factors that
are not included in the analysis
Probability Basics

● P(A = a, C = c): joint probability that random


variables A and C will take values a and c
respectively
● P(A = a | C = c): conditional probability that A will
take the value a, given that C has taken value c
P( A, C )
P(C | A) =
P( A)
P( A, C )
P( A | C ) =
P(C )
Bayes Theorem

● Bayes theorem:

P( A | C ) P(C )
P(C | A) =
P( A)

● P(C) known as the prior probability


● P(C | A) known as the posterior probability
Example of Bayes Theorem

● Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20

● If a patient has stiff neck, what’s the probability


he/she has meningitis?
P( S | M ) P( M ) 0.5 ×1 / 50000
P( M | S ) = = = 0.0002
P( S ) 1 / 20
Bayesian Classifiers

● Consider each attribute and class label as random


variables

● Given
a record with attributes (A1, A2,…,An)
– Goal is to predict class C
– Specifically, we want to find the value of C that
maximizes P(C| A1, A2,…,An )

● Canwe estimate P(C| A1, A2,…,An ) directly from


data?
Bayesian Classifiers

● Approach:
– compute the posterior probability P(C | A1, A2, …, An) for
all values of C using the Bayes theorem

P( A A ! A | C ) P(C )
P(C | A A ! A ) = 1 2 n

P( A A ! A )
1 2 n

1 2 n
Bayesian Classifiers

● Approach:
– compute the posterior probability P(C | A1, A2, …, An) for
all values of C using the Bayes theorem
Class-conditional
probability Prior probability

P( A A ! A | C ) P(C )
P(C | A A ! A ) = 1 2 n

P( A A ! A )
1 2 n

1 2 n

Posterior probability Evidence


Bayesian Classifiers

● Approach:
– compute the posterior probability P(C | A1, A2, …, An) for
all values of C using the Bayes theorem
P( A A ! A | C ) P(C )
P(C | A A ! A ) = 1 2 n

P( A A ! A )
1 2 n

1 2 n

– Choose value of C that maximizes


P(C | A1, A2, …, An)

– Equivalent to choosing value of C that maximizes


P(A1, A2, …, An|C) P(C)

● How to estimate P(A1, A2, …, An | C )?


Naïve Bayes Classifier

● Assumes all attributes Ai are conditionally independent,


when class C is given:
– P(A1, A2, …, An |C) = P(A1| C) P(A2| C)… P(An| C)

– Can estimate P(Ai | Cj) for all Ai and Cj.

– New point is classified to Cj if P(Cj) Π P(Ai | Cj) is


maximal.
Conditional independence: basics

● Let X, Y, Z denote three sets of random variables


● The variables in X are said to be conditionally
independent of variables in Y, given Z if
● P( X | Y, Z ) = P( X | Z )

● An example
– Level of reading skills of people tends to increase with
length of the arm
– Explanation: both increase with age of a person
– If age is given, arm length and reading skills are
(conditionally) independent
Conditional independence: basics

● If X and Y are conditionally independent, given Z

P( X, Y | Z ) = P(X, Y, Z) / P(Z)
= P(X, Y, Z) / P(Y, Z) * P(Y, Z) / P(Z)
= P(X | Y, Z) * P(Y | Z)
= P(X | Z) * P(Y | Z)

P( X, Y | Z ) = P(X | Z) * P(Y | Z)
NB assumption:
P(A1, A2, …, An |C) = P(A1| C) P(A2| C)… P(An| C)
How to Estimate
l l
Probabilities from Data?
a a u s
r ic r ic o
o o u
g g it n ss
c at
e
c at
e
c on cl
a ● Class: P(C) = Nc/N
Tid Refund Marital Taxable – e.g., P(No) = 7/10,
Status Income Evade
P(Yes) = 3/10
1 Yes Single 125K No
2 No Married 100K No
● For discrete attributes:
3 No Single 70K No
4 Yes Married 120K No P(Ai | Ck) = |Aik|/ Nc k
5 No Divorced 95K Yes
6 No Married 60K No
– where |Aik| is number of
instances having attribute
7 Yes Divorced 220K No
Ai and belongs to class Ck
8 No Single 85K Yes
9 No Married 75K No
– Examples:
10 No Single 90K Yes P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
10
How to Estimate Probabilities from Data?

● For
continuous attributes, two options:
– Discretize the range into bins
u one ordinal attribute per bin
– Probability density estimation:
u Assume attribute follows a Gaussian / normal
distribution
u Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
u Once probability distribution is known, can use it to
estimate the conditional probability P(Ai|c)
How too Estimate Probabilities from Data?
a l a l s
u
o u ric ric o
e g e g t in s s
c at c at c o n
cl
a
Tid Refund Marital
Status
Taxable
Income Evade
● Normal distribution:
1 −
( Ai − µ ij ) 2

1 Yes Single 125K No


P( A | c ) = e 2 σ ij2

2πσ
i j 2
2 No Married 100K No
ij
3 No Single 70K No
4 Yes Married 120K No – One for each (Ai, cj) pair
5 No Divorced 95K Yes
6 No Married 60K No ● For (Income, Class=No):
7 Yes Divorced 220K No
– If Class=No
8 No Single 85K Yes
9 No Married 75K No
u sample mean = 110
10 No Single 90K Yes u sample variance = 2975
10

1 −
( 120 −110 ) 2

P( Income = 120 | No) = e 2 ( 2975 )


= 0.0072
2π (54.54)
A complete example
Example of Naïve Bayes Classifier
Given a Test Record:
X = (Refund
ca
l
ca
l= No, Married,
us
Income = 120K)
r i r i o
o o u
g g it n s
at
e
Training data: at
e
on las naive Bayes Classifier:
c c c c
Tid Refund Marital Taxable
Status Income Evade P(Refund=Yes|No) = 3/7
P(Refund=No|No) = 4/7
1 Yes Single 125K No P(Refund=Yes|Yes) = 0
P(Refund=No|Yes) = 1
2 No Married 100K No
P(Marital Status=Single|No) = 2/7
3 No Single 70K No P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7
4 Yes Married 120K No P(Marital Status=Single|Yes) = 2/7
5 No Divorced 95K Yes P(Marital Status=Divorced|Yes)=1/7
P(Marital Status=Married|Yes) = 0
6 No Married 60K No
7 Yes Divorced 220K No For taxable income:
If class=No: sample mean=110
8 No Single 85K Yes sample variance=2975
9 No Married 75K No If class=Yes: sample mean=90
sample variance=25
10 No Single 90K Yes
10
Example of Naïve Bayes Classifier
Given a Test Record:
X = (Refund = No, Married, Income = 120K)
naive Bayes Classifier:

P(Refund=Yes|No) = 3/7 ● P(X|Class=No) = P(Refund=No|Class=No)


P(Refund=No|No) = 4/7 × P(Married| Class=No)
P(Refund=Yes|Yes) = 0 × P(Income=120K| Class=No)
P(Refund=No|Yes) = 1 = 4/7 × 4/7 × 0.0072 = 0.0024
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7 ● P(X|Class=Yes) = P(Refund=No| Class=Yes)
P(Marital Status=Single|Yes) = 2/7 × P(Married| Class=Yes)
P(Marital Status=Divorced|Yes)=1/7 × P(Income=120K| Class=Yes)
P(Marital Status=Married|Yes) = 0 = 1 × 0 × 1.2 × 10-9 = 0
For taxable income:
If class=No: sample mean=110 Since P(X|No)P(No) > P(X|Yes)P(Yes)
sample variance=2975
Therefore P(No|X) > P(Yes|X)
If class=Yes: sample mean=90
sample variance=25 => Class = No
Naïve Bayes Classifier

● Ifone of the conditional probability is zero, then


the entire expression becomes zero
● Probability estimation:

N ic
Original : P ( Ai | C ) =
Nc
c: number of classes
N ic + 1
Laplace : P ( Ai | C ) = p: prior probability
Nc + c
m: parameter
N ic + mp
m - estimate : P ( Ai | C ) =
Nc + m
Naïve Bayes: Pros and Cons

● Robust to isolated noise points

● Can handle missing values by ignoring the


instance during probability estimate calculations

● Robust to irrelevant attributes

● Independence assumption may not hold for some


attributes
– Presence of correlated attributes can degrade
performance of NB classifier
Example with correlated attribute

● Two attributes A, B and class Y (all binary)


● Prior probabilities:
– P(Y=0) = P(Y=1) = 0.5
● Class conditional probabilities of A:
– P(A=0 | Y=0) = 0.4 P(A=1 | Y=0) = 0.6
– P(A=0 | Y=1) = 0.6 P(A=1 | Y=1) = 0.4
● Class conditional probabilities of B are same as
that of A
● B is perfectly correlated with A when Y=0, but is
independent of A when Y=1
Example with correlated attribute

● Need to classify a record with A=0, B=0


● P(Y=0 | A=0,B=0) = P(A=0,B=0 | Y=0) P(Y=0)
P(A=0, B=0)
= P(A=0|Y=0) P(B=0|Y=0) P(Y=0)
P(A=0, B=0)
= (0.16 * 0.5) / P(A=0,B=0)
● P(Y=1 | A=0,B=0) = P(A=0,B=0 | Y=1) P(Y=1)
P(A=0, B=0)
= P(A=0|Y=1) P(B=0|Y=1) P(Y=1)
P(A=0, B=0)
= (0.36 * 0.5) / P(A=0,B=0)
● Hence prediction is Y=1
Example with correlated attribute

● Need to classify a record with A=0, B=0


● In reality, since B is perfectly correlated to A when
Y= 0
● P(Y=0 | A=0,B=0) = P(A=0,B=0 | Y=0) P(Y=0)
P(A=0, B=0)
= P(A=0|Y=0) P(Y=0)
P(A=0, B=0)
= (0.4 * 0.5) / P(A=0,B=0)

● Hence prediction should have been Y=0


Other Bayesian classifiers

● Ifit is suspected that attributes may have


correlations:
● Can use other techniques such as Bayesian
Belief Networks (BBN)
● Uses a graphical model (network) to capture prior
knowledge in a particular domain, and causal
dependencies among variables

You might also like