0% found this document useful (0 votes)
8 views18 pages

Statistical Inference INF312 - Is - Lecture 03 - Part 3

The document discusses Bayesian classification, emphasizing its probabilistic prediction capabilities based on Bayes' Theorem. It explains the naive Bayes classifier, which simplifies computations by assuming attribute independence, and provides examples of how to calculate probabilities for classifying data. Additionally, it includes a solved example demonstrating the application of Bayes' Theorem in predicting bone fractures using bone mineral density measurements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views18 pages

Statistical Inference INF312 - Is - Lecture 03 - Part 3

The document discusses Bayesian classification, emphasizing its probabilistic prediction capabilities based on Bayes' Theorem. It explains the naive Bayes classifier, which simplifies computations by assuming attribute independence, and provides examples of how to calculate probabilities for classifying data. Additionally, it includes a solved example demonstrating the application of Bayes' Theorem in predicting bone fractures using bone mineral density measurements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Bayesian Classification: Why?

◼ A statistical classifier: performs probabilistic prediction, i.e.,


predicts class membership probabilities
◼ Foundation: Based on Bayes’ Theorem.
◼ Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
◼ Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
◼ Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
1
Bayes’ Theorem: Basics
M
◼ Total probability Theorem: P(B) =  P(B | A )P( A )
i i
i =1

◼ Bayes’ Theorem: P(H | X) = P(X | H )P(H ) = P(X | H ) P(H ) / P(X)


P(X)
◼ Let X be a data sample (“evidence”): class label is unknown
◼ Let H be a hypothesis that X belongs to class C
◼ Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
◼ P(H) (prior probability): the initial probability
◼ E.g., X will buy computer, regardless of age, income, …

◼ P(X): probability that sample data is observed


◼ P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds
◼ E.g., Given that X will buy computer, the prob. that X is 31..40,

medium income
2
Prediction Based on Bayes’ Theorem
◼ Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem

P(H | X) = P(X | H )P(H ) = P(X | H ) P(H ) / P(X)


P(X)
◼ Informally, this can be viewed as
posteriori = likelihood x prior/evidence
◼ Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
◼ Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost

3
Classification Is to Derive the Maximum Posteriori
◼ Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute vector
X = (x1, x2, …, xn)
◼ Suppose there are m classes C1, C2, …, Cm.
◼ Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
◼ This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
◼ Since P(X) is constant for all classes, only
P(C | X) = P(X | C )P(C )
i i i
needs to be maximized

4
Naïve Bayes Classifier
◼ A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes):
n
P( X | C i) =  P( x | C i) = P( x | C i)  P( x | C i)  ...  P( x | C i)
k 1 2 n
k =1
◼ This greatly reduces the computation cost: Only counts the
class distribution
◼ If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
◼ If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x− )2
1 −
g ( x,  ,  ) = e 2 2
and P(xk|Ci) is 2 
P ( X | C i ) = g ( xk ,  C i ,  Ci )
5
Naïve Bayes Classifier: Training Dataset
Example:
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
Class: >40 medium no excellent no

C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’


Data to be classified:
X = (age <=30, Income = medium, Student = yes, Credit_rating = Fair)
6
Naïve Bayes Classifier: An Example
Class:
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’

◼ Compute P(Ci) for each class:


◼ P(C1) = P(buys_computer = “yes”) = 9/14 = 0.643

◼ P(C2) = P(buys_computer = “no”) = 5/14= 0.357

7
Naïve Bayes Classifier: An Example
Class:
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’

◼ Compute P(X|Ci) for each class

P(Xk|C1) = P(X1|C1) * P(X2|C1) * P(X3|C1)* ….*P(Xk|C1)

P(Xk|C2) = P(X1|C2) * P(X2|C2) * P(X3|C2)* ….*P(Xk|C2)

8
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30, Income = medium, Student = yes, Credit_rating = Fair)
Age Buys Computer Count Total Conditional Probability Conditional Probability
<= 30 Yes 2 9 (2/9) 0.222222222
<= 30 No 3 5 (3/5) 0.6
31-40 Yes 4 9 (4/9) 0.444444444
31-40 No 0 5 (0/5) 0
> 40 Yes 3 9 (3/9) 0.333333333
> 40 No 2 5 (2/5) 0.4

P(Age <= 30| Buys Computer = Yes) 0.222222222


P(Age <= 30| Buys Computer = No) 0.6
P(Age Between 31 and 40| Buys Computer = Yes) 0.444444444
P(Age Between 31 and 40| Buys Computer = No) 0
P(Age > 40| Buys Computer = Yes) 0.333333333
P(Age > 40| Buys Computer = No) 0.4

9
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30, Income = medium, Student = yes, Credit_rating = Fair)
Income Buys Computer Count Total Conditional Probability Conditional Probability
High Yes 2 9 (2/9) 0.222222222
High No 2 5 (2/5) 0.4
Medium Yes 4 9 (4/9) 0.444444444
Medium No 2 5 (2/5) 0.4
Low Yes 3 9 (3/9) 0.333333333
Low No 1 5 (1/5) 0.2

P(Income = High| Buys Computer = Yes) 0.222222222


P(Income = High| Buys Computer = No) 0.4
P(Income = Medium| Buys Computer = Yes) 0.444444444
P(Income = Medium| Buys Computer = No) 0.4
P(Income = Low| Buys Computer = Yes) 0.333333333
P(Income = Low| Buys Computer = No) 0.2

10
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30, Income = medium, Student = yes, Credit_rating = Fair)

Student Buys Computer Count Total Conditional Probability Conditional Probability


Yes Yes 6 9 (6/9) 0.666666667
Yes No 1 5 (1/5) 0.2
No Yes 3 9 (3/9) 0.333333333
No No 4 5 (4/5) 0.8

P(Student = Yes| Buys Computer = Yes) 0.666666667


P(Student = Yes| Buys Computer = No) 0.2
P(Student = No| Buys Computer = Yes) 0.333333333
P(Student = No| Buys Computer = No) 0.8

11
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30, Income = medium, Student = yes, Credit_rating = Fair)

Credit Rating Buys Computer Count Total Conditional Probability Conditional Probability
Fair Yes 6 9 (6/9) 0.666666667
Fair No 2 5 (2/5) 0.4
Excellent Yes 3 9 (3/9) 0.333333333
Excellent No 3 5 (3/5) 0.6

P(Credit Rating = Fair| Buys Computer = Yes) 0.666666667


P(Credit Rating = Fair| Buys Computer = No) 0.4
P(Credit Rating = Excellent| Buys Computer = Yes) 0.333333333
P(Credit Rating = Excellent| Buys Computer = No) 0.6

12
Naïve Bayes Classifier: An Example
Class:
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’

◼ Compute P(X|Ci) for each class

P(X|C1) = P(X|buys_computer = “yes”)

= 0.222 x 0.444 x 0.667 x 0.667 = 0.044

P(X|C2) = P(X|buys_computer = “no”)

= 0.6 x 0.4 x 0.2 x 0.4 = 0.019

13
Naïve Bayes Classifier: An Example
Class:
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’

◼ Compute P(X|Ci) * P(Ci) for each class

P(X|C1) * P(C1) = 0.044 * 0.643 = 0.028

P(X|C2) * P(C2) = 0.019 * 0.357 = 0.007

◼ Decision

P(X|C1) * P(C1) > P(X|C2) * P(C2)

X belongs to (C1)
Therefore, X belongs to class (“buys_computer = yes”)
14
Solved Example on Bayes Theorem
◼ Researchers investigated the effectiveness of using the
Hologic Sahara Sonometer, a portable device that
measures bone mineral density (BMD) in the ankle, in
predicting a fracture. They used a Hologic estimated
bone mineral density value of .57 as a cutoff. The
results of the investigation yielded the following data:

15
Solved Example on Bayes Theorem
a) Calculate the sensitivity of using a BMD value of 0.57
as a cutoff value for predicting fracture.
b) Calculate the specificity of using a BMD value of 0.57
as a cutoff value for predicting fracture.
c) If it is estimated that 10 percent of the U.S.
population have a confirmed bone fracture, What is
predictive value positive of using a BMD value of
0.57 as a cutoff value for predicting fracture? That is,
we wish to estimate the probability that a subject
who has BMD value equals 0.57 has a confirmed
bone fracture.

16
Solved Example on Bayes Theorem

a) Sensitivity = P (+T \ +D) = 214/287 = 0.7456 = 74.56%


b) Specificity = P (-T \ -D) = 330/1000 = 0.33 = 33%
c) Predictive Value Positive
P +T\+D ∗P(+D)
P(+D\+T) =
𝑃(+𝑇)

17
Solved Example on Bayes Theorem

c) Predictive Value Positive


P(+T) = P(+T\+D)P(+D) + P(-T\+D)P(-D) =
(214/287)(0.1) + (670/1000)(0.9) = 0.6776
P +T\+D ∗P(+D) 0.7456∗0.1
◼ P(+D\+T) = = = 0.11
𝑃(+𝑇) 0.6776

18

You might also like