Lesson 3.3 - Supervised Learning Rule Based Classification
Lesson 3.3 - Supervised Learning Rule Based Classification
November 5, 2018
1 / 43
Bayesian Classification
Roadmap
Introduction
Bayes Theorem
Naive Bayesian Classification
References
2 / 43
Bayesian Classification
Introduction
Bayesian classifier
A statistical classifier
performs probabilistic prediction,
i.e., predicts class membership probabilities, such as the
probability that a given instance belongs to a particular class.
Foundation
Based on Bayes Theorem.
3 / 43
Bayesian Classification
Introduction
Performance
A simple Bayesian classifier, Naive Bayesian classifier, exhibits
high accuracy and speed when applied to large databases
comparable to decision tree and selected neural network
classifiers
Incremental
Each training example can incrementally increase/decrease the
probability that a hypothesis is correct
Popular methods
Naive Bayesian classifier
Bayesian belief networks
4 / 43
Bayesian Classification
Introduction
5 / 43
Bayesian Classification
Bayes Theorem
Let:
X . be a data sample: class label is unknown
H . a hypothesis that X belongs to class C
P (H |X ). determined by classifier
The probability that instance X belongs to class C
We know the attribute description of X
P (H ). probability of H
P (X ). probability that sample data is observed
P (X |H ). probability of X conditioned on H
6 / 43
Bayesian Classification
Bayes Theorem
Bayes Theorem
Example. (cont.)
P (X ): the probability that a person from our set of customers is 35
years old and earns $40,000.
P (X |H ): the probability that a customer, X , is 35 years old and
earns $40,000, given that we know the customer will buy a
computer.
Practical difficulty
Require initial knowledge of many probabilities, significant
computational cost
In the next section, we will look at how Bayes’ theorem is used in the
naive Bayesian classifier.
8 / 43
Bayesian Classification
Note
Equally important and independence assumptions are never correct in
real-life datasets
9 / 43
Bayesian Classification
10 / 43
Bayesian Classification
If Ak is categorical
P (xk |Ci ) is the # of tuples in Ci having value xk for Ak divided by
|Ci ,D | (# of tuples of Ci in D )
12 / 43
Bayesian Classification
If Ak is continous-valued
P (xk |Ci ) is usually computed based on Gaussian distribution with a
mean µ and standard deviation σ :
(x−µ) 2
1 −
g(x, µ, σ ) = √ e 2σ 2
2πσ
n
1X
µ= xi
n
i =1
v
n
t
1 X
σ= (xi − µ)2
n −1
i =1
µCi and σCi : the mean and standard deviation, respectively, of the
values of attribute Ak for training instances of class Ci .
13 / 43
Bayesian Classification
Example.
let X = (35, $40, 000), where A1 and A2 are the attributes age and
income.
Let the class label attribute be buys_computer.
The associated class label for X is yes (i.e., buys computer = yes).
For attribute age and this class, we have µ = 38 years and σ = 12.
Plug in these quantities, along with x1 = 35 for our instance X into
g(x, µ, σ ) to estimate P(age = 35 | buys_computer = yes).
14 / 43
Bayesian Classification
In other words, the predicted class label is the class Ci for which
P (X |Ci )P (Ci ) is the maximum.
15 / 43
Bayesian Classification
Example 1: AllElectronics
16 / 43
Bayesian Classification
Example 1: AllElectronics
17 / 43
Bayesian Classification
Example: AllElectronics
18 / 43
Bayesian Classification
Example: AllElectronics
19 / 43
Bayesian Classification
Example: AllElectronics
Similarly,
P (X |buys_computer = no) =
= 0.600 × 0.400 × 0.200 × 0.400 = 0.019
20 / 43
Bayesian Classification
Example: AllElectronics
21 / 43
Bayesian Classification
22 / 43
Bayesian Classification
Example:
for the attribute-value pair student = yes of X
we need two counts
the number of customers who are students and for which
buys_computer = yes, which contributes to
P (X |buys_computer = yes)
the number of customers who are students and for which
buys_computer = no, which contributes to
P (X |buys_computer = no)
But if there are no training instances representing students for the
class buys computer = no, resulting in
P (student = yes|buys_computer = no) = 0
Plugging this zero value into Equation P (X |Ci ) would return a zero
probability for P (X |Ci )
23 / 43
Bayesian Classification
24 / 43
Bayesian Classification
25 / 43
Bayesian Classification
Example:
Suppose that for the class buys_computer = yes in training
database, D , containing 1,000 instances
We have
0 instances with income = low,
990 instances with income = medium, and
10 instances with income = high
The probabilities of these events are 0 (from 0/1000), 0.990 (from
999/1000), and 0.010 (from 10/1,000)
Using the Laplacian correction for the three quantities, we pretend
that we have 1 more instance for each income-value pair.
26 / 43
Bayesian Classification
1
P (income = low) = = 0.001
1003
991
P (income = medium) = = 0.998
1003
11
P (income = high ) = = 0.011
1003
27 / 43
Bayesian Classification
28 / 43
Bayesian Classification
Weather Problem
sunny 2/9 3/5 hot 2/9 2/5 high 3/9 4/5 false 6/9 2/5 9/14 5/14
overcast 4/9 0/5 mild 4/9 2/5 normal 6/9 1/5 true 3/9 3/5
rainy 3/9 2/5 cool 3/9 1/5
E.g.
P (outlook = sunny|play = yes) = 2/9
P (windy = true|play = No) = 3/5
29 / 43
Bayesian Classification
A new day:
Outlook Temperature Humidity Windy Play
sunny cool high true ?
30 / 43
Bayesian Classification
Bayes rule
31 / 43
Bayesian Classification
P [yes|X ] =
P [Outlook = sunny|yes]×
P [Temperature = cool |yes]×
P [Humidity = high |yes]×
P [Windy = true|yes]×
P [yes]
= 2/9 × 3/9 × 3/9 × 3/9 × 9/14
32 / 43
Bayesian Classification
33 / 43
Bayesian Classification
Missing values
35 / 43
Bayesian Classification
Numeric attributes
36 / 43
Bayesian Classification
sunny 2 3 83 85 86 85 false 6 2 9 5
overcast 4 0 70 80 96 90 true 3 3
rainy 3 2 68 65 80 70
64 72 65 95
69 71 70 91
75 80
75 70
72 90
81 75
sunny 2/9 3/5 µ 73 74.6 µ 79.1 86.2 false 6/9 2/5 9/14 5/14
overcast 4/9 0/5 σ 6.2 7.9 σ 10.2 9.7 true 3/9 3/5
rainy 3/9 2/5
37 / 43
Bayesian Classification
38 / 43
Bayesian Classification
A new day:
39 / 43
Bayesian Classification
Comments
40 / 43
Bayesian Classification
Missing values
41 / 43
Bayesian Classification
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore loss of
accuracy
Practically, dependencies exist among variables
How to deal with these dependencies?
Bayesian Belief Networks
42 / 43
Bayesian Classification
References
43 / 43