Lecture 5 Bayesian Classification
Lecture 5 Bayesian Classification
1
Bayesian Classification
A statistical classifier
Probabilistic prediction
Predict class membership probabilities
Based on Bayes’ Theorem
Naive Bayesian classifier
comparable performance with decision tree and selected neural
network classifiers
Accuracy and Speed is good when applied to large databases
Incremental
2
Bayesian Classification
attributes
3
Bayesian Theorem: Basics
Let X be a data sample class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), the probability that the
hypothesis holds given the observed data sample X
Posterior Probability
4
Bayesian Theorem
5
Naïve Bayesian Classifier
Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
This can be derived from Bayes’ theorem
P(X | C ) P(C )
P(C | X) = i i
i P(X)
6
Naïve Bayesian Classifier
Since P(X) is constant for all classes, only
P(C | X) = P(X | C )P(C )
i i i
needs to be maximized
Can assume that all classes are equally likely and maximize P(X|
Ci)
A simplified assumption: attributes are conditionally independent
(i.e., no dependence relation between attributes):
n
P(X | C i) = ∏ P( x | C i) = P( x | C i) × P( x | C i) × ...× P( x | C i)
k 1 2 n
k=1
7
Derivation of Naïve Bayes Classifier
2πσ
8
Example age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
Class:
C1:buys_computer = ‘yes’
>40 low yes fair yes
C2:buys_computer = ‘no’ >40 low yes excellent no
31…40 low yes excellent yes
Data sample <=30 medium no fair no
X = (age <=30, <=30 low yes fair yes
Income = medium, >40 medium yes fair yes
Student = yes <=30 medium yes excellent yes
Credit_rating = Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
9
Example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
10
Avoiding the 0-Probability Problem
Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X | C i) = ∏P ( x k | C i )
k =1
Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10),
Use Laplacian correction (or Laplacian estimator)
Adding 1 to each case
11
Naïve Bayesian Classifier
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore loss of
accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
12
Bayesian Belief Networks
Models dependencies between variables
Defined by Two components
Directed Acyclic Graph
Conditional Probability Table (CPT) for each variable
Bayesian belief network allows a subset of the
variables to be conditionally independent
13
Bayesian Belief Networks
A graphical model of causal relationships
Represents dependency among the variables
Gives a specification of joint probability distribution
14
Bayesian Belief Network: An Example
Family The conditional probability table
History
Smoker (CPT) for variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
15
Training Bayesian Networks
Several scenarios:
Given both the network structure and all variables observable:
learn only the CPTs
Network structure known, some hidden variables: gradient
descent (greedy hill-climbing) method, analogous to neural
network learning
Network structure unknown, all variables observable: search
through the model space to reconstruct network topology
Unknown structure, all hidden variables: No good algorithms
known for this purpose
16