Lecturer4 - Bayesian Decision Theory
Lecturer4 - Bayesian Decision Theory
p ( x | i ) P(i ) p( x | i ) P(i )
P(i | x) = =
p( x) p( x | i ) P(i )
i
Likelihood * prior
________________
Posterior =
Evidence (not important)
Bayes Decision Rule (for minimizing the probability of error)
p( x | ) P( )
p ( x)
j j
j =1
◦ Conditional risk (minimized overall risk R)
c
R( i | x) = ( i | j ) P( j | x)
j =1
= P( j | x)
j i
= 1 − P(i | x)
Discriminant functions g i (x)
Classifier with discriminant functions
◦ Assign x to class i if
g i (x) = − R( i | x)
For minimum error rate classification
◦ Many kinds of discriminant functions are possible
p(x | i ) P(i )
g i (x) = P(i | x) = c
p(x | ) P( )
j =1
j j
1 1 x− 2
p ( x) = exp[− ( ) ], p( x) ~ N ( , 2 )
2 2
= E[ x] = xp( x)dx
−
= E[( x − ) ] = ( x − ) 2 p( x)dx; Variance
2 2
−
◦ Components of Σ
i = E[ xi ]
ij2 = E[( xi − i )( x j − j )]
Central limit theorem
The sum of a large number of independent random distribution leads to a
Gaussian distribution
Linear combination of multivariate density
◦ A is a d-by-k matrix.
Given p(x) ~ N (μ, ) == p(y = AT x) ~ N ( AT μ, AT A)
◦ If k=1, in case A = a (a projection onto a line with direction of
vector a)
p ( y = a T x) ~ N ( , 2 )
◦ Whitening transform: eigenvector decomposition(EVD) always
possible
= T Φ: orthogonal matrix of eigenvector
1 1
p ( x) = exp[− (x − μ)T −1 (x − μ)]
(2 ) d / 2 | |1/ 2 2
1 d 1
== g i (x) = − (x − μ i ) i (x − μ i ) − ln 2 − ln | i | + ln P(i )
T −1
2 2 2
Case 1 : i = 2I
◦ The features are statistically independent and have the same
variance regardless of their class
i = 2 d i = (1 / 2 )
−1
1 T μTi μ i
= 2 μ i x + [− + ln P ( )] = w i x + wi 0
T
2 2 i
2
Case 2 – cont’d
◦ Two category problem: hyperplane decision boundary
◦ For the same priori probabilities => ignore ln P(i )
1
g i (x) = − (x − μ i )T −1 (x − μ i )
2
(Squared Mahalanobis minimum distance classifier for the same a priori
distribution)
Case 2 – example
Boundary is not orthogonal to the line between class means
P(x | i ) P(i )
g i (x) = P(i | x) = c
not p(x | i )
P( x) = P(x | j ) P( j )
j =1
Binary-valued components
x = ( x1 ,, xd ), xi {0,1}
◦ Class 1: d
pi = P( xi = 1 | 1 ), P(x | 1 ) = pixi (1 − pi )1− xi
◦ Class 2: i =1
d
qi = P( xi = 1 | 2 ), P(x | 2 ) = qixi (1 − qi )1− xi
i =1
Binary trial – cont’d
◦ Likelihood ratio
xi 1− xi
P(x | 1 ) d
p 1 − pi
= i
P (x | 2 ) i =1 qi 1 − qi
◦ Linear discriminant function
d
pi 1 − pi P(1 )
g (x) = xi ln + (1 − xi ) ln + ln
i =1 qi 1 − qi P(2 )
d
pi (1 − pi ) d 1 − pi P(1 )
= xi ln + ln + ln = w Ti x + w0
i =1 qi (1 − qi ) i =1 1 − qi P(2 )
◦ For the same pi = qi
P(1 )
g (x) = ln
P(2 ) Remark: There is a large range in
positions of the decision boundary
for the discrete case.
Bayesian belief nets, causal networks, belief nets
◦ Node : a kind of state with probabilities
Discrete or continuous
◦ Parent: node giving influence
◦ Children: influenced node