0% found this document useful (0 votes)
69 views

Lecture 5 Bayesian Classification

Bayesian classification is a statistical classification method that uses Bayes' theorem. It can be used to predict class membership probabilities. The naive Bayesian classifier assumes conditional independence between attributes. It performs comparably to decision trees and neural networks. Bayesian belief networks are graphical models that can represent dependencies between attributes using a directed acyclic graph and conditional probability tables. They allow modeling of conditional independence relationships.

Uploaded by

Ahsan Asim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Lecture 5 Bayesian Classification

Bayesian classification is a statistical classification method that uses Bayes' theorem. It can be used to predict class membership probabilities. The naive Bayesian classifier assumes conditional independence between attributes. It performs comparably to decision trees and neural networks. Bayesian belief networks are graphical models that can represent dependencies between attributes using a directed acyclic graph and conditional probability tables. They allow modeling of conditional independence relationships.

Uploaded by

Ahsan Asim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Bayesian Classification

1
Bayesian Classification
 A statistical classifier
 Probabilistic prediction
 Predict class membership probabilities
 Based on Bayes’ Theorem
 Naive Bayesian classifier
 comparable performance with decision tree and selected neural
network classifiers
 Accuracy and Speed is good when applied to large databases
 Incremental

2
Bayesian Classification

 Naïve Bayesian Classifier


 Class Conditional Independence

 Effect of an attribute value on a given class is

independent of the values of other attributes


 Simplifies Computations

 Bayesian Belief Networks


 Graphical models

 Represent dependencies among subsets of

attributes

3
Bayesian Theorem: Basics
 Let X be a data sample class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), the probability that the
hypothesis holds given the observed data sample X
 Posterior Probability

 P(H) (prior probability), the initial probability


 P(X): probability that sample data is observed
 P(X|H) (posteriori probability), the probability of observing the
sample X, given that the hypothesis holds
 X – Round and Red Fruit H - Apple

4
Bayesian Theorem

 Given training data X, posteriori probability of a hypothesis H,


P(H|X), follows the Bayes theorem

P(H | X) = P(X | H )P(H )


P(X)
 Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
 Practical difficulty: require initial knowledge of many probabilities,
significant computational cost

5
Naïve Bayesian Classifier
 Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
 This can be derived from Bayes’ theorem

P(X | C ) P(C )
P(C | X) = i i
i P(X)

6
Naïve Bayesian Classifier
 Since P(X) is constant for all classes, only
P(C | X) = P(X | C )P(C )
i i i

needs to be maximized
 Can assume that all classes are equally likely and maximize P(X|
Ci)
 A simplified assumption: attributes are conditionally independent
(i.e., no dependence relation between attributes):
n
P(X | C i) = ∏ P( x | C i) = P( x | C i) × P( x | C i) × ...× P( x | C i)
k 1 2 n
k=1

7
Derivation of Naïve Bayes Classifier

 This greatly reduces the computation cost: Only counts the


class distribution
 If Ak is categorical, P(xk|Ci) = sik /si where sik is the # of tuples in Ci
having value xk for Ak and si is the number of training samples
belonging to Ci
 If Ak is continuous-valued, P(xk|Ci) is usually computed based
on Gaussian distribution with a mean μ and standard deviation
σ
P(xk|Ci) is g(xk, µCi, σCi)
( x −µ ) 2
1 −
g ( x, µ, σ ) = e 2σ 2

2πσ

8
Example age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
Class:
C1:buys_computer = ‘yes’
>40 low yes fair yes
C2:buys_computer = ‘no’ >40 low yes excellent no
31…40 low yes excellent yes
Data sample <=30 medium no fair no
X = (age <=30, <=30 low yes fair yes
Income = medium, >40 medium yes fair yes
Student = yes <=30 medium yes excellent yes
Credit_rating = Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

9
Example
 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
 Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)

10
Avoiding the 0-Probability Problem
 Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X | C i) = ∏P ( x k | C i )
k =1
 Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10),
 Use Laplacian correction (or Laplacian estimator)
 Adding 1 to each case

Prob(income = low) = 1/1003


Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
 The “corrected” prob. estimates are close to their
“uncorrected” counterparts

11
Naïve Bayesian Classifier
 Advantages
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence, therefore loss of
accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
 Dependencies among these cannot be modeled by Naïve

Bayesian Classifier

12
Bayesian Belief Networks
 Models dependencies between variables
 Defined by Two components
 Directed Acyclic Graph
 Conditional Probability Table (CPT) for each variable
 Bayesian belief network allows a subset of the
variables to be conditionally independent

13
Bayesian Belief Networks
 A graphical model of causal relationships
 Represents dependency among the variables
 Gives a specification of joint probability distribution

 Nodes: random variables


 Links: dependency
X Y
 X and Y are the parents of Z, and Y is
the parent of P
Z
P  No dependency between Z and P
 Has no loops or cycles

14
Bayesian Belief Network: An Example
Family The conditional probability table
History
Smoker (CPT) for variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1

LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9

CPT shows the conditional probability for


each possible combination of its parents

Derivation of the probability of a


PositiveXRay Dyspnea
particular combination of values of X,
from CPT:
n
Bayesian Belief Networks P ( x1 ,..., xn ) = ∏ P ( x i | Parents (Y i ))
i =1

15
Training Bayesian Networks
 Several scenarios:
 Given both the network structure and all variables observable:
learn only the CPTs
 Network structure known, some hidden variables: gradient
descent (greedy hill-climbing) method, analogous to neural
network learning
 Network structure unknown, all variables observable: search
through the model space to reconstruct network topology
 Unknown structure, all hidden variables: No good algorithms
known for this purpose

16

You might also like