Bayesian Classification
Bayesian Classification
Sridhar Mahadevan
[email protected]
University of Massachusetts
Outline
Classification problem
Bayesian Decision Theory: Minimum risk formalization
Linear discriminant analysis (LDA)
Bayesian classification using Multivariate Normal
Distributions
Classication Problem
Classication Problem
!
"
#
"
!
$
%
!
"
"
%
CMPSCI 689 p. 4/2
Classication:
Geometrical View
+
margin
<w,x> + b = 0
CMPSCI 689 p. 5/2
Many Approaches
Parametric models:
Linear discriminant analysis (LDA)
Bayesian classifiers
Logistic regression
Nonparametric models:
Decision trees
k nearest neighbor method
Support vector machines
Classication as
Probabilistic Inference
Posterior =
Likelihood Prior
Evidence
P (ci |X) =
P (X|ci )P (ci )
P (X)
c
!
j=1
Class Conditional
Densities
p(x|i)
0.4
2
1
0.3
0.2
0.1
x
9
10
11
12
13
14
15
Posterior Densities
P(i|x)
1
0.8
0.6
0.4
2
0.2
x
9
10
11
12
13
14
15
Minimum Risk
Classication
R(1 |x) = 11 P (c1 |x) + 12 P (c2 |x)
R(2 |x) = 21 P (c1 |x) + 22 P (c2 |x)
Minimum risk rule: Choose class 1 if R(1 |x) < R(2 |x)
(11 21 )P (c1 |x) < (12 22 )P (c2 |x)
We can reformulate this as
(11 21 )P (x|c1 )P (c1 ) < (12 22 )P (x|c2 )P (c2 )
Likelihood Ratio
p(x|1)
p(x|2)
b
a
R2
R1
R2
R1
Discriminant Functions
A discriminant function is any function that enables
successful classification.
For each class ci , define the discriminant function as
gi (x).
Examples:
gi (x) = P (ci |x) (Bayesian posterior distribution)
gi (x) = P (x|ci )P (ci ) (unnormalized posterior)
gi (x) = ln P (x|ci ) + ln P (ci )
Linear Discriminant
Analysis
LDA finds a linear transformation of the input X that
results in the maximum discrimination among classes.
Define Y = lT X, where X is a p-dim column vector, l is
a p dim row vector, and Y is a scalar.
Define i = E(X|ci ) as the conditional mean of the
input data from class ci .
Define Yi = E(Y |ci ) as the conditional mean of the
projected input data from class ci .
Goal: find the l such that the distance between the
means of the projected data is as large as possible,
and its variance is as small as possible.
CMPSCI 689 p. 14/2
"
"
"
"
$
CMPSCI 689 p. 15/2
"
"
"
"
$
CMPSCI 689 p. 16/2
covariance!
V ar(Y ) = V ar(lT X) = lT Cov(X)l = lT l
LDA: Formalization
The optimization objective of LDA can now be
formalized as maximizing the ratio
Squared distance between projected means
Variance of Y
Y
Y 2
(1 2 )
=
Y Y
T
(l 1 lT 2 )2
=
lT l
lT (1 2 )(1 2 )T l
=
lT l
=
LDA Solution
We can solve the optimization problem using Lagrange
multipliers (setting the denominator to 1)
J(l, ) = (lT (1 2 )(1 2 )T l) (lT l 1)
J
= 2(1 2 )(1 2 )T l 2l
l
Setting the partial derivative to 0, we get the
generalized eigenvalue problem:
(1 2 )(1 2 )T l = l
CMPSCI 689 p. 19/2
LDA Solution
Notice that
(1 2 )(1 2 )T l = (1 2 )
is a vector that lies in the direction 1 2
With this insight, we can finally express Fishers linear
discriminant function as
l = 1 (1 2 )
So, the projected data Y can be written as
Y = lT X = (1 2 )T 1 X
CMPSCI 689 p. 20/2
1
n
"
"
xi
i (xi
1 )(xi
2 )T
m =
IRIS Dataset
!"#
!"#
!"#
!"#
!"#
!"#
!"#
!"#
!"#
!"#
!"!
!"#$%&'&
! ! ! ! ! ! !
!"#
!"#
!"#$%&'&
!"#
!"#
!"#$%&'&
!"
!"#
!"!
!"#
!"#
! ! ! ! ! ! !
!"#
!"#$%&'&
Petal.L. Petal.W.
4.150000 1.2863636
1.484615 0.2346154
5.437037 2.0259259
Discriminant Functions:
Multivariate Gaussians
Multivariate Gaussian
1
T 1
1
p (x) =
e 2 (x) (x)
d
||
(2) 2
Equal Diagonal
Covariances
1
i =
1
, |i |
2
= 2d
-2
p(x|i)
0.4
0.15
1
0
P(2)=.5
0.1
0.05
1
0.3
R2
0.2
-1
P(2)=.5
0.1
P(1)=.5
x
-2
R1
P(1)=.5
R2
P(2)=.5
R2
R1
-2
P(1)=.5 R1
-2
-2
-1
Equal Arbitrary
Covariances
1
(x i )T 1 (x i ) + ln P (ci )
2
= 1 Ti x + wi0 + ln P (ci )
gi (x) =
0.2
0.2
-0.1
-0.1
P(2)=.5
R2
P(2)=.9
R1
P(1)=.5
-5
R2
0
5
-5
-5
10
7.5
R1
P(1)=.5
7.5
P(1)=.1 5
1
2
P(2)=.5
-2
R1
2.5
1
R2
R1
P(1)=.1
-5
-2.5
-2
0
2
-2
-2.5
R2
P(2)=.9
0
2
-2
Arbitrary Covariances
1
1
gi (x) = xT 1
i x + i i + wi0
2