0% found this document useful (0 votes)
11 views40 pages

Lecturer4 - Bayesian Decision Theory

The document discusses Bayesian Decision Theory, focusing on pattern classification, a priori probabilities, and Bayes' formula for decision-making. It covers concepts such as class conditional probability, Bayes risk, and discriminant functions for minimum-error-rate classification. Additionally, it explores various cases of classification with different covariance structures and the implications for decision boundaries.

Uploaded by

Joe Bin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views40 pages

Lecturer4 - Bayesian Decision Theory

The document discusses Bayesian Decision Theory, focusing on pattern classification, a priori probabilities, and Bayes' formula for decision-making. It covers concepts such as class conditional probability, Bayes risk, and discriminant functions for minimum-error-rate classification. Additionally, it explores various cases of classification with different covariance structures and the implications for decision boundaries.

Uploaded by

Joe Bin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Dr.

Huynh Trung Hieu


 Bayesian Decision Theory
◦ Fundamental statistical approach to the problem of pattern classification
 State of nature
◦ Unpredictable state, don’t know what will emerge next
 A priori probability
◦ Sum of a priori probability is one. P(ω1)+P(ω2)=1
◦ Decision rule
 Provided that only a priori information is given,
 Decide ω1 if P(ω1) > P(ω2) ; otherwise decide ω2
 With a lot of data, we can build a histogram. Let us
just build one for “Antenna Length” for now…
 Class conditional probability density function, p(x|ω)
◦ The probability density function for x given ω
 Bayes formula
◦ Given both the prior probability and the conditional probability
◦ Choose index i which produces maximum P(i | x)

p ( x | i ) P(i ) p( x | i ) P(i )
P(i | x) = =
p( x)  p( x | i ) P(i )
i

Likelihood * prior
________________
Posterior =
Evidence (not important)
 Bayes Decision Rule (for minimizing the probability of error)

 When value x is 14,


if we choose ω1
probability of error
must be low.
 So, decide ω1
if P(ω1|x) > P(ω2|x);
otherwise decide ω2
 Generalization in 4 ways
◦ By allowing the use of more than one feature
 Replacement of scalar x with vector x
◦ By allowing more than two x states of nature x
◦ By allowing actions other than merely deciding the state of
nature
 Such as rejection in close cases
◦ By introducing a loss function more general than the
probability of error
 Introduction of cost function
 Bayes risk (R : the minimum overall risk)
◦ A posteriori probability
p(x |  j ) P( j ) p( x |  j ) P( j )
P( j | x) = = c

 p( x |  ) P( )
p ( x)
j j
j =1
◦ Conditional risk (minimized overall risk R)
c
R( i | x) =   ( i |  j ) P( j | x)
j =1

◦ Action  i is selected for which R( i | x) is minimum


 Two category classification : ij =  ( i |  j )

R(1 | x) = 11P(1 | x) + 12 P(2 | x)


R( 2 | x) = 21P(1 | x) + 22 P(2 | x)
◦ Decide 1 if

(21 − 11 ) P(1 | x)  (12 − 22 ) P(2 | x)


 Two category classification – cont’d
◦ Assumption that 12  22 , 21  11
◦ Decide 1 if

(21 − 11 ) p(x | 1 ) P(1 )  (12 − 22 ) p(x | 2 ) P(2 )


◦ Likelihood ratio : decide 1 if

p(x | 1 ) (12 − 22 ) P(2 )



p(x | 2 ) (21 − 11 ) P(1 )
 Likelihood ratio : example
 Zero one loss :  ( i |  j ) = 1 −  (i − j )
 Conditional risk
◦ Minimization of the average probability of error is equal to
maximization of a posteriori probability.
c
R ( i | x) =   ( i |  j ) P ( j | x)
j =1

=  P( j | x)
j i

= 1 − P(i | x)
 Discriminant functions g i (x)
 Classifier with discriminant functions
◦ Assign x to class i if

g i (x)  g j (x), for all j  i


◦ In Bayes classifier,

g i (x) = − R( i | x)
 For minimum error rate classification
◦ Many kinds of discriminant functions are possible
p(x | i ) P(i )
g i (x) = P(i | x) = c

 p(x |  ) P( )
j =1
j j

g i (x) = p(x | i ) P(i )


g i (x) = ln p(x | i ) + ln P(i )

◦ Two category case : decide 1


p(x | 1 ) P(1 )
g (x) = g1 (x) − g 2 (x)  0  ln + ln 0
p (x | 2 ) P(2 )
 Remind:
◦ Expectation of a scalar function f(x)

 Continuous  f ( x) p( x)dx
E[ f ( x)] =
−
 Discrete E[ f ( x)] =  f ( x) P( x)
◦ Univariate density
xD

1 1 x− 2
p ( x) = exp[− ( ) ], p( x) ~ N (  ,  2 )
2  2 

 = E[ x] =  xp( x)dx
−

 = E[( x −  ) ] =  ( x −  ) 2 p( x)dx; Variance
2 2
−

◦ Entropy (continuous distribution)


H [ p ( x)] = −  p ( x) ln p( x)dx
◦ Multivariate density: (x d-dimensions feature vector)
p(x) ~ N (μ, )
1 1
p ( x) = exp[ − (x − μ)T  −1 (x − μ)]
(2 ) |  |
d /2 1/ 2
2
μ = E[x] =  xp(x)dx
 = E[(x − μ)(x − μ)T ] =  (x − μ)(x − μ)T p(x)dx ; Covariance Matrix

◦ Components of Σ
i = E[ xi ]
 ij2 = E[( xi − i )( x j −  j )]
 Central limit theorem
 The sum of a large number of independent random distribution leads to a
Gaussian distribution
 Linear combination of multivariate density
◦ A is a d-by-k matrix.
Given p(x) ~ N (μ, ) == p(y = AT x) ~ N ( AT μ, AT  A)
◦ If k=1, in case A = a (a projection onto a line with direction of
vector a)
p ( y = a T x) ~ N (  ,  2 )
◦ Whitening transform: eigenvector decomposition(EVD) always
possible
 =  T Φ: orthogonal matrix of eigenvector

A w = −1/ 2 Λ: the diagonal matrix of the eigenvalues

p(y = ATw x) ~ N ( ATwμ, ATw  A w ) = N ( ATwμ, −T / 2  T  T −1/ 2 )


 An example of transformed multivariate Gaussian
distribution
 Discriminant function (d.f) for minimum-error-rate
g i (x) = ln p(x | i ) + ln P(i )
◦ For Gaussian (d.f will be easily evaluated)

1 1
p ( x) = exp[− (x − μ)T  −1 (x − μ)]
(2 ) d / 2 |  |1/ 2 2

1 d 1
== g i (x) = − (x − μ i )  i (x − μ i ) − ln 2 − ln |  i | + ln P(i )
T −1

2 2 2
 Case 1 : i =  2I
◦ The features are statistically independent and have the same
variance regardless of their class
i =  2 d i = (1 /  2 )
−1

◦ We have new linear discriminant function


|| x − μ i ||2 (xT x − 2μTi x + μTi μ i )
g i ( x) = − + ln P(i ) = − + ln P(i )
2 2
2 2

1 T μTi μ i
= 2 μ i x + [− + ln P ( )] = w i x + wi 0
T

 2 2 i

◦ Two category problem: hyperplane decision boundary


g1 (x) = w1T x + w10 , g 2 (x) = w T2 x + w20
g (x) = g1 (x) − g 2 (x) = (w1 − w 2 )T x + ( w10 − w20 ) = w T x + w0 = 0
 Case 1 – cont’d

◦ For the same priori probabilities => ignore

 For the same priori probabilities => ignore ln P(i )


|| x − μ i ||2 || x − μ i ||2
g i ( x) = − + ln P(i )  −  − || x − μ i ||2
2 2
2 2

(Minimum distance classifier: assign x to the category of the nearest mean)


 Case 1 – cont’d
◦ for various priori probabilities
 Case 2 : i = 
◦ The features are statistically dependent but the covariance
matrices are the same for all classes
◦ We have new linear discriminant function
1
g i (x) = − (x − μ i )T  −1 (x − μ i ) + ln P(i )
2
(xT  −1 x − 2μTi  −1 x + μTi  −1 μ i )
=− + ln P(i )
2
−1
μ T
 μi
−1
= μ i  x − + ln P(i ) = w Ti x + wi 0
T i

2
 Case 2 – cont’d
◦ Two category problem: hyperplane decision boundary
◦ For the same priori probabilities => ignore ln P(i )

1
g i (x) = − (x − μ i )T  −1 (x − μ i )
2
(Squared Mahalanobis minimum distance classifier for the same a priori
distribution)
 Case 2 – example
 Boundary is not orthogonal to the line between class means

see eq. 64.


 Case 3 :  i = arbitrary
◦ covariance matrices are different for each category
◦ Quadratic discriminant function
1 1
g i (x) = − (x − μ i )T  i−1 (x − μ i ) − ln |  i | + ln P(i )
2 2
(xT  i−1 x − 2μTi  i−1 x + μTi  i−1 μ i ) 1
=− − ln |  i | + ln P(i )
2 2
−1 −1
 μ T
 μ 1
= xT (− i )x + μTi  i−1 x − i i i − ln |  i | + ln P(i )
2 2 2
= xT Wi x + w Ti x + wi 0

◦ Two category problem: hyperquadratic decision boundary (hyperplane,


pairs of hyperplane, hyperspheres, hyperparaboloids…)
 Case 3 – 2D example
 Case 3 – boundary 4 category
◦ Rather complex boundary
 A posteriori probability for discrete features

P(x | i ) P(i )
g i (x) = P(i | x) = c
not p(x | i )
P( x) =  P(x |  j ) P( j )
j =1
 Binary-valued components
x = ( x1 ,, xd ), xi {0,1}
◦ Class 1: d
pi = P( xi = 1 | 1 ), P(x | 1 ) =  pixi (1 − pi )1− xi
◦ Class 2: i =1

d
qi = P( xi = 1 | 2 ), P(x | 2 ) =  qixi (1 − qi )1− xi
i =1
 Binary trial – cont’d
◦ Likelihood ratio
xi 1− xi
P(x | 1 ) d
p   1 − pi 
=   i   
P (x | 2 ) i =1  qi   1 − qi 
◦ Linear discriminant function
d
 pi 1 − pi  P(1 )
g (x) =   xi ln + (1 − xi ) ln  + ln
i =1  qi 1 − qi  P(2 )
d
pi (1 − pi ) d 1 − pi P(1 )
=  xi ln +  ln + ln = w Ti x + w0
i =1 qi (1 − qi ) i =1 1 − qi P(2 )
◦ For the same pi = qi
P(1 )
g (x) = ln
P(2 ) Remark: There is a large range in
positions of the decision boundary
for the discrete case.
 Bayesian belief nets, causal networks, belief nets
◦ Node : a kind of state with probabilities
 Discrete or continuous
◦ Parent: node giving influence
◦ Children: influenced node

◦ See example 4 and p. 60-61.


APENDIX
Steps to construct a Naïve Bayes
 Handle data
 Summarize data
 Make prediction
 Evaluate accuracy
 Handle Data
◦ Load data file. The data is in CSV format without a header line
or any quotes

You might also like