Two-Class Discriminant Analysis
On many occasions, we will be faced with the problem of determining to
which of two populations a given observation belongs. For example, we might be
interested in determining whether a patient has a certain disease given the results
of a blood test, or whether a bank customer will repay a loan given the information
available about him/her.
Following, we discuss two-class discriminant analysis. We will focus our
attention mainly on linear discriminants in which the separation between classes
is given by hyperplanes. When we know the distribution of the classes, their
prior probabilities and the cost of misclassifying an observation, we can build the
Bayesian discriminant.
Bayes Discriminant Analysis
Let X = [X1 , . . . , Xp ]T be a p-dimensional observation belonging to population
c1 or to population c2 . Our objective will be to determine a decision function
d : Rp → {c1 , c2 }, defined by
(
c1 if X ∈ R1 ,
d(X) =
c2 if X ∈ R2 = R1c .
To do this, we assume that we know the density functions f1 (X) and f2 (X)
of these two populations, and their prior probabilities p1 and p2 , respectively
(p1 + p2 = 1).
Knowing this information, we can use Bayes’ theorem to calculate the posterior
probabilities. Thus are given by:
f1 (X)p1
k1 (X) = P (c1 |X) = ,
f1 (X)p1 + f2 (X)p2
f2 (X)p2
k2 (X) = P (c2 |X) = .
f1 (X)p1 + f2 (X)p2
Using the previous equations, X ∈ c1 if k1 (X) ≥ k2 (X).
Example. A veterinary clinic has recently treated a Persian cat and they forgot
to record its sex. They know that the weight distribution of females follows a
Normal distribution of mean 4.1 Kg and standard deviation 3.8 Kg, while the
weight distribution of males is a Normal distribution of mean 5 Kg and standard
deviation 4 Kg. If one out of every three cats they care for is female and the weight
of this cat is 4.7 Kg, is it more likely that the cat was a male or female?
Given that one out of every three cats they care for is female, the prior
probabilities for female and male are p1 = 1/3 and p2 = 2/3, respectively.
The weight of the female cats follows a Normal distribution with mean 4.1 Kg
and standard deviation 3.8 Kg. Therefore
1 X − 4.1 2
−
1
f1 (X) = √ e 2 3.8
3.8 2π
1
Finally, the weight of the male cats follows a Normal distribution with mean 5
Kg and standard deviation 4 Kg. Hence
1 X−5 2
−
1
f2 (X) = √ e 2 4
4 2π
The weight of this cat is 4.7 kg. Therefore
1
f (4.7)
3 1
k1 (4.7) = 1 ≈ 0.343
f (4.7) + 23 f2 (4.7)
3 1
and
2
f (4.7)
3 2
k2 (4.7) = 1 ≈ 0.667
f (4.7) + 23 f2 (4.7)
3 1
Since k1 (4.7) < k2 (4.7), it is more likely that this cat were a male.
The source code to obtain the previous result is
1 f1 = function ( x ) {
2 1 / (3.8 * sqrt (2 * pi ) ) * exp ( -1 / 2 * (( x -4.1) / 3.8) ^2) }
3
4 f2 = function ( x ) {
5 1 / (4 * sqrt (2 * pi ) ) * exp ( -1 / 2 * (( x -5) / 4) ^2) }
6
7 k1 = 1 / 3 * f1 (4.7) / (1 / 3 * f1 (4.7) +2 / 3 * f2 (4.7) )
8 k2 = 2 / 3 * f2 (4.7) / (1 / 3 * f1 (4.7) +2 / 3 * f2 (4.7) )
Sometimes the cost of misclassifying an observation may depend on its true
class. For example, it is more costly to classify a sick patient as healthy and not
provide adequate treatment than to classify a healthy one as sick and have him/her
receive a treatment that is harmless to him/her. Therefore, in certain problems it
is necessary to include a cost matrix such as the following:
Choise
c1 c2
c1 0 L(1, 2)
Truth
c2 L(2, 1) 0
where there is no loss if the observation is correctly classified.
In these situations, the expected error of classifying X as c1 when the true class
is c2 is
ϵ1 = k2 (X)L(2, 1)
and the one of classifying as c2 , when the true class is c1
ϵ2 = k1 (X)L(1, 2)
2
In this scenario, X ∈ R1 if the expected error from wrongly classifying X as c1
is less than the expected error from wrongly classifying it as c2
X ∈ R1 ⇐⇒ ϵ1 ≤ ϵ2
⇐⇒ k2 (X)L(2, 1) ≤ k1 (X)L(1, 2)
⇐⇒ f2 (X)p2 L(2, 1) ≤ f1 (X)p1 L(1, 2)
f1 (X) p2 L(2, 1)
⇐⇒ ≥
f2 (X) p1 L(1, 2)
Then, we have proven the following theorem
Theorem. The Bayes solution to the classification problem is given by the region
( )
f1 (X) p2 L(2, 1)
R1 = X ≥ .
f2 (X) p1 L(1, 2)
Next, we analyze the Bayesian solution in the particular case that the densities
are Normal.
Discrimination between two Normal populations with the same
variance-covariance matrix
If f1 and f2 are Normal with the same variance-covariance matrix
1
1 − (X−µ1 )T Σ−1 (X−µ1 )
f1 (X) = √ e 2
( 2π)p |Σ|
1
1 − (X−µ2 )T Σ−1 (X−µ2 )
f2 (X) = √ e 2
( 2π)p |Σ|
The classification region is given by
f1 (X) p2 L(2, 1)
X ∈ R1 ⇐⇒ ≥ c, where c =
f2 (X) p1 L(1, 2)
1 1
− (X−µ1 )T Σ−1 (X−µ1 )+ (X−µ2 )T Σ−1 (X−µ2 )
⇐⇒e 2 2 ≥c
1 1
⇐⇒ − (X − µ1 )T Σ−1 (X − µ1 ) + (X − µ2 )T Σ−1 (X − µ2 ) ≥ ln(c)
2 2
⇐⇒ − (X − µ1 ) Σ (X − µ1 ) + (X − µ2 )T Σ−1 (X − µ2 ) ≥ 2 ln(c)
T −1
T −1
Σ X + XT Σ−1 µ1 + µT1 Σ−1 X − µT1 Σ−1 µ1
⇐⇒ −
X
−1
X − XT Σ−1 µ2 − µT2 Σ−1 X + µT2 Σ−1 µ2 ≥ 2 ln(c)
T
XΣ
⇐⇒ XT Σ−1 (µ1 − µ2 ) + (µT1 − µT2 )Σ−1 X −µT1 Σ−1 µ1 + µT2 Σ−1 µ2 ≥ 2 ln(c)
| {z } | {z }
real number real number
⇐⇒X T
Σ−1 (µ1 − µ2 ) +X T
Σ−1 (µ1 − µ2 ) − µT1 Σ−1 µ1 + µT2 Σ−1 µ2 ≥ 2 ln(c)
⇐⇒2XT Σ−1 (µ1 − µ2 ) − µT1 Σ−1 µ1 + µT2 Σ−1 µ2 ≥ 2 ln(c)
1 1
⇐⇒XT Σ−1 (µ1 − µ2 ) − µT1 Σ−1 µ1 + µT2 Σ−1 µ2 ≥ ln(c)
2 2
3
Remark. The previous expression depends on the discriminant
l = Σ−1 (µ1 − µ2 ).
Theorem. The vector l has the property that maximize the function
(Ec1 (dT X) − Ec2 (dT X))2
φ(d) =
V (dT X)
where Ec1 (dT X) and Ec2 (dT X) are the expected value of the projection in each
class, respectively.
Proof.
(Ec1 (dT X) − Ec2 (dT X))2
φ(d) =
V (dT X)
(dT (µ1 − µ2 ))2
=
dT Σd
(d (µ1 − µ2 ))(dT (µ1 − µ2 ))T
T
=
dT Σd
d (µ1 − µ2 )(µ1 − µ2 )T d
T
=
dT Σd
As φ(kd) = φ(d) we can determine the value of d for which dT Σd = 1. Using
the Lagrange multipliers in the following problem
max dT (µ1 − µ2 )(µ1 − µ2 )T d
d
s.t. dT Σd = 1
we get
max dT (µ1 − µ2 )(µ1 − µ2 )T d − λ(dT Σd − 1)
λ,d
deriving
2(µ1 − µ2 )(µ1 − µ2 )T d − 2λΣd = 0
2 (µ1 − µ2 )(µ1 − µ2 )T d = 2 λΣd
1 −1
Σ (µ1 − µ2 ) (µ1 − µ2 )T d = d
λ | {z }
real number
1
(µ1 − µ2 )T d Σ−1 (µ1 − µ2 ) = d
λ
| {z }
real number
k · Σ−1 (µ1 − µ2 ) = d
1
where k = (µ1 − µ2 )T d is a scalar, therefore
λ
d ∝ Σ−1 (µ1 − µ2 ).
4
Fisher Discriminant Analysis
Let {X1 , · · · , Xn } be n observations of dimension p belonging to one of two classes
c1 or c2 . The number of observations of each class is n1 and n2 , respectively, where
n1 + n2 = n. Suppose now that the two classes have p-dimensional means m1 and
m2 defined by
1 X
mi = Xn
ni n∈c
i
We would like to find a linear combination of predictors, w ∈ Rp , that maximizes
the separation between the classes and also minimizes the within-class variance in
the projected data. The Fisher criterion is defined as the ratio of the between-class
variance to the within-class variance and is given by
(m2 − m1 )2
J(w) =
s21 + s22
where X
mi = wT mi si = (wT Xn − mi )2
n∈ci
The numerator can be written as
(m2 − m1 )2 = (wT m2 − wT m1 )2
= (wT (m2 − m1 ))2
= wT (m2 − m1 )(wT (m2 − m1 ))T
= wT (m2 − m1 )(m2 − m1 )T w
= wT SB w
Similarly,
X
s21 = (wT Xn − m1 )2
n∈c1
X
= (wT Xn − wT m1 )2
n∈c1
X
= (wT (Xn − m1 ))2
n∈c1
X
= (wT (Xn − m1 ))(wT (Xn − m1 ))T
n∈c1
X
= wT (Xn − m1 )(Xn − m1 )T w
n∈c1
X
T T
=w (Xn − m1 )(Xn − m1 ) w
n∈c1
T
= w S1 w
Same reasoning applies for s2 . Therefore, the Fisher criterion can be rewritten as
wT SB w
J(w) =
wT SW w
where SW = S1 + S2
5
Deriving,
(2SB w)wT SW w − wT SB w(2SW w)
=0
(wT SW w)2
(SB w) wT SW w − wT SB w (SW w) = 0
| {z } | {z }
real number real number
wT SB w
(SB w) − (SW w) =0
wT SW w
(SB w) − λ(SW w) = 0
SB w = λ(SW w)
if SW has full rank then
SW −1 SB w = λw
w is an eigenvector of SW −1 SB
Remark.
SB w = (m2 − m1 ) (m2 − m1 )T w = α(m2 − m1 )
| {z }
α
SB w = λSW w
λSW w = α(m2 − m1 )
w ∝ SW −1 (m2 − m1 )
References
[1] Ersbøll, B. K., & Conradsen, K. (2007). An introduction to statistics, vol. 2.
DTU Informatics, 7.
[2] Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and machine
learning (Vol. 4, No. 4, p. 738). New York: springer.