LDA Tutorial
LDA Tutorial
LDA Tutorial
Reduction
Linear Discriminant
Analysis (LDA)
Aly A. Farag
Shireen Y. Elhabian
CVIP Lab
University of Louisville
www.cvip.uofl.edu
October 2, 2008
Outline
LDA objective
Recall PCA
Now LDA
LDA Two Classes
Counter example
LDA C Classes
Illustrative Example
LDA Objective
The objective of LDA
dimensionality reduction
So what, PCA does this
is
to
perform
We didnt care about whether this dataset represent features from one
or more classes, i.e. the discrimination power was not taken into
consideration while we were talking about PCA.
Eigen values and eigen vectors were then computed for Sx. Hence the
new basis vectors are those eigen vectors with highest eigen values,
where the number of those vectors was our choice.
Thus, using the new basis, we can project the dataset onto a less
dimensional space with more powerful data representation.
Recall PCA
n sample vectors
Now LDA
Consider a pattern classification problem, where we have Cclasses, e.g. seabass, tuna, salmon
Each class has Ni m-dimensional samples, where i = 1,2, , C.
Hence we have a set of m-dimensional samples {x1, x2,, xNi}
belong to class i.
Stacking these samples from different classes into one big fat
matrix X such that each column represents one sample.
We seek to obtain a transformation of X to Y through projecting
the samples in X onto a hyperplane with dimension C-1.
Lets see what does this mean?
y = wT x
x1
.
where x =
.
xm
and
w1
.
w =
.
wm
xi
and
1
~
i =
Ni
1
y=
Ni
yi
= wT
T
w
x
xi
1
Ni
T
x
=
w
i
xi
J ( w) = ~1 ~2 = wT 1 wT 2 = wT 1 2
~
si 2 =
~ )2
(
y
yi
~
si 2 measures the variability within class i after projecting it on
the y-space.
~
Thus
s12 + ~
s22 measures the variability within the two
classes at hand after projection, hence it is called within-class scatter
of the projected samples.
Si =
(x )(x )
S w = S1 + S 2
Where Si is the covariance matrix of class i, and Sw is called the
within-class scatter matrix.
~
si 2 =
2
~
( y i ) =
yi
(w
x w i
T
= w ( x i )( x i ) w
T
xi
= wT Si w
~
2
2
T
T
T
T
~
~
s1 + s2 = w S1w + w S 2 w = w (S1 + S 2 )w = w SW w = SW
Where S~W is the within-class scatter matrix of the projected samples y.
Similarly, the difference between the projected means (in y-space) can be
expressed in terms of the means in the original feature space (x-space).
~1 ~2
) = (w
2
w 2
T
= w (1 2 )(1 2 ) w
144
42444
3
T
SB
~
= w SB w = SB
T
Since SB is the outer product of two vectors, its rank is at most one.
(
(
) (
) (
)
(
)
) (
d
d
T
T
w SW w
wT SW w = 0
w SB w w SB w
dw
dw
wT SW w 2S B w wT S B w 2SW w = 0
T
Dividing by 2 wT SW w :
wT SW w
wT S B w
S B w T
SW w = 0
T
w SW w
w SW w
S B w J ( w) SW w = 0
SW1S B w J ( w) w = 0
SW1S B w = w
where = J ( w) = scalar
yields
T
w
SBw
*
= SW1 (1 2 )
w = arg max J ( w) = arg max T
w
w
w SW w
Compute the Linear Discriminant projection for the following twodimensional dataset.
Samples for class 1 : X1=(x1,x2)={(4,2),(2,4),(2,3),(3,6),(4,4)}
Sample for class 2 : X2=(x1,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}
10
9
8
7
x2
6
5
4
3
2
1
0
5
x1
10
1
=
1
N1
1 4 2 2 3 4 3
x = + + + + =
5 2 4 3 6 4 3.8
x1
1
2 =
N2
1 9 6 9 8 10 8.4
x = + + + + =
5 10 8 5 7 8 7.6
x 2
(x )(x )
4 3 2 3
= +
2 3.8 4 3.8
2
2 3 3 3 4 3
+ + +
3 3.8 6 3.8 4 3.8
0.25
1
=
2.2
0.25
(x )(x )
9 8.4 6 8.4
= +
10 7.6 8 7.6
2
=
3.3
0.05
0.25 2.3
0.05
1
+
S w = S1 + S 2 =
2.2 0.05
3.3
0.25
3.3 0.3
=
0.3 5.5
S B = (1 2 )(1 2 )
3 8.4 3 8.4
=
3.8 7.6 3.8 7.6
5.4
( 5.4 3.8)
=
3.8
29.16 20.52
=
20.52 14.44
The LDA projection is then obtained as the solution of the generalized eigen
value problem 1
SW S B w = w
SW1S B I = 0
3.3 0.3
0. 3 5. 5
29.16 20.52 1 0
= 0
20.52 14.44 0 1
0.0166 0.1827 20.52 14.44 0 1
6.489
9.2213
2.9794
4.2339
= (9.2213 )(2.9794 ) 6.489 4.2339 = 0
2 12.2007 = 0 ( 12.2007 ) = 0
1 = 0, 2 = 12.2007
Hence
w1
9.2213 6.489
w1 = 0{
1 w2
4.2339 2.9794
and
w1
9.2213 6.489
w2 = 12
.2
2007
1
4
4
3
4.2339 2.9794
w2
2
Thus;
0.5755
w1 =
0.8178
and
0.9088
= w*
w2 =
0.4173
=
0.0166 0.1827 3.8
*
1
W
0.9088
=
0.4173
LDA - Projection
Classes PDF : using the LDA projection vector with the other eigen value = 8.8818e-016
0.35
0.3
0.25
p(y|w )
0.2
0.15
8
7
0.1
6
x2
0.05
5
4
0
-4
-2
-1
1
y
2
1
0
-7
-3
-6
-5
-4
-3
-2
-1
2
x1
10
LDA - Projection
Classes PDF : using the LDA projection vector with highest eigen value = 12.2007
0.4
0.35
0.3
0.25
LDA projection vector with the highest eigen value = 12.2007
i
p(y|w )
10
0.2
0.15
0.1
x2
0.05
5
0
4
10
y
6
x1
10
15
LDA C-Classes
Now, we have C-classes instead of just two.
We are now seeking (C-1) projections [y1, y2, , yC-1] by means
of (C-1) projection vectors wi.
wi can be arranged by columns into a projection matrix W =
[w1|w2||wC-1] such that:
yi = wiT x
y =WTx
y1
x1
.
.
where xm1 =
, yC 11 =
.
.
yC 1
xm
and WmC 1 = [w1 | w2 | ... | wC 1 ]
LDA C-Classes
If we have n-feature vectors, we can stack them into one matrix
as follows;
Y =W X
T
where
X mn
x11
=
.
1
xm
x12
.
.
xm2
. x1n
. .
. .
n
. xm
, YC 1n
y11
=
.
1
yC 1
y12
.
.
yC2 1
y1n
.
.
.
.
n
. yC 1
.
LDA C-Classes
Recall the two classes case, the
within-class scatter was computed as:
S w = S1 + S 2
x2
Example of two-dimensional
features (m = 2), with three
classes C = 3.
Sw
3
2
SW = Si
i =1
where
and
Si =
i =
T
(
)(
)
x
i
i
Sw2
xi
1
Ni
S w3
x1
LDA C-Classes
Recall the two classes case, the betweenclass scatter was computed as:
S B = (1 2 )(1 2 )
x2
i =1
N i i
x
Sw
S B1
S B = N i (i )(i )
1
1
where
= x =
N x
N
1
and
i =
x
N i xi
Example of two-dimensional
features (m = 2), with three
classes C = 3.
S B2
S B3
3
S w3
Sw2
x1
N: number of all data .
LDA C-Classes
Similarly,
We can define the mean vectors for the projected samples y as:
1
~
i =
Ni
and
1
~
=
N
y
y
While the scatter matrices for the projected samples y will be:
C
C
~
~
T
SW = Si = ( y ~i )( y ~i )
i =1
i =1 yi
C
~
T
S B = N i (~i ~ )(~i ~ )
i =1
LDA C-Classes
~
SW = W T SW W
~
S B = W T S BW
Recall that we are looking for a projection that maximizes the ratio of
between-class to within-class scatter.
Since the projection is no longer a scalar (it has C-1 dimensions), we then use
the determinant of the scatter matrices to obtain a scalar objective function:
~
SB
W T S BW
J (W ) = ~ = T
W SW W
SW
LDA C-Classes
SW1S B w = w
For C-classes case, we have C-1 projection vectors, hence the eigen value
problem can be generalized to the C-classes case as:
SW1S B wi = i wi
where = J ( w) = scalar
where i = J ( wi ) = scalar
and
i = 1,2,...C 1
Thus, It can be shown that the optimal projection matrix W* is the one whose
columns are the eigenvectors corresponding to the largest eigen values of the
following generalized eigen value problem:
SW1S BW * = W *
where = J (W * ) = scalar
and
Illustration 3 Classes
Lets generate a dataset for each
class to simulate the three
classes shown
x2
Dataset Generation
5
=
5
x2
3
2.5
7
1 = + , 2 = + , 3 = +
3.5
5
7
5 1
S1 =
3
3
4 0
S2 =
0 4
3.5 1
S3 =
3 2.5
x1
In Matlab
Its Working
1
x2
20
10
x1
15
-5
-5
5
10
X - the first feature
1
15
20
SW = Si
i =1
where
Si =
i =
and
(x )(x )
1
Ni
S B = N i (i )(i )
1
W
S SB
i =1
where
and
1
1
x
=
N
N x
1
i =
x
N i xi
N
x
20
15
10
5
0
-5
-10
-15
-10
-5
0
5
10
X - the first feature
1
15
20
25
Projection y = WTx
Along first projection vector
Classes PDF : using the first projection vector with eigen value = 4508.2089
0.4
0.35
0.3
p(y|w )
0.25
0.2
0.15
0.1
0.05
0
-5
10
y
15
20
25
Projection y = WTx
Along second projection vector
Classes PDF : using the second projection vector with eigen value = 1878.8511
0.4
0.35
0.3
p(y|w )
0.25
0.2
0.15
0.1
0.05
0
-10
-5
5
y
10
15
20
Which is Better?!!!
Apparently, the projection vector that has the highest eigen
value provides higher discrimination power between classes
Classes PDF : using the first projection vector with eigen value = 4508.2089
Classes PDF : using the second projection vector with eigen value = 1878.8511
0.4
0.35
0.35
0.3
0.3
0.25
0.25
p(y|w )
p(y|w i )
0.4
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
-5
10
y
15
20
25
0
-10
-5
5
y
10
15
20
PCA vs LDA
Limitations of LDA
If the classification error estimates establish that more features are needed, some other method must be
employed to provide those additional features
If the distributions are significantly non-Gaussian, the LDA projections will not be able to preserve any
complex structure of the data, which may be needed for classification.
Limitations of LDA
LDA will fail when the discriminatory information is not in the mean but
rather in the variance of the data
Thank You