LDA Tutorial

A Tutorial on Data
Reduction
Linear Discriminant
Analysis (LDA)
Aly A. Farag
Shireen Y. Elhabian
CVIP Lab
University of Louisville
www.cvip.uofl.edu
October 2, 2008
Outline
LDA objective
Recall PCA
Now LDA
LDA Two Classes
Counter example
LDA C Classes
Illustrative Example
LDA vs PCA Example

Limitations of LDA
LDA Objective
The objective of LDA
dimensionality reduction
So what, PCA does this
is
to
perform
However, we want to preserve as much of the

class discriminatory information as possible.
OK, thats new, let dwell deeper
In PCA, the main idea to re-express the available dataset to extract

the relevant information by reducing the redundancy and minimize
the noise.
We didnt care about whether this dataset represent features from one
or more classes, i.e. the discrimination power was not taken into
consideration while we were talking about PCA.
In PCA, we had a dataset matrix X with dimensions mxn, where

columns represent different data samples.
We first started by subtracting the mean to have a zero mean dataset,

then we computed the covariance matrix Sx = XXT.
Eigen values and eigen vectors were then computed for Sx. Hence the
new basis vectors are those eigen vectors with highest eigen values,
where the number of those vectors was our choice.
Thus, using the new basis, we can project the dataset onto a less
dimensional space with more powerful data representation.
m - dimensional feature vector
Recall PCA
n sample vectors
Now LDA
Consider a pattern classification problem, where we have Cclasses, e.g. seabass, tuna, salmon
Each class has Ni m-dimensional samples, where i = 1,2, , C.
Hence we have a set of m-dimensional samples {x1, x2,, xNi}
belong to class i.
Stacking these samples from different classes into one big fat
matrix X such that each column represents one sample.
We seek to obtain a transformation of X to Y through projecting
the samples in X onto a hyperplane with dimension C-1.
Lets see what does this mean?
LDA Two Classes
The two classes are not well

separated when projected onto
this line
Assume we have m-dimensional samples {x1,

x2,, xN}, N1 of which belong to 1 and
N2 belong to 2.
We seek to obtain a scalar y by projecting

the samples x onto a line (C-1 space, C = 2).
y = wT x
This line succeeded in separating

the two classes and in the
meantime reducing the
dimensionality of our problem
from two features (x1,x2) to only a
scalar value y.
x1
.
where x =
.

xm
and
w1
.
w =
.

wm
Of all the possible lines we would like to

select the one that maximizes the
separability of the scalars.
LDA Two Classes

In order to find a good projection vector, we need to define a
measure of separation between the projections.
The mean vector of each class in x and y feature space is:
1
i =
Ni
xi
and
1
~
i =
Ni
1
y=
Ni
yi
= wT
T
w
x
xi
1
Ni
T
x
=
w
i
xi
We could then choose the distance between the projected means

as our objective function
J ( w) = ~1 ~2 = wT 1 wT 2 = wT 1 2
LDA Two Classes

However, the distance between the projected means is not a very
good measure since it does not take into account the standard
deviation within the classes.
This axis yields better class separability
This axis has a larger distance between means
LDA Two Classes

The solution proposed by Fisher is to maximize a function that
represents the difference between the means, normalized by a
measure of the within-class variability, or the so-called scatter.
For each class we define the scatter, an equivalent of the
variance, as;
~
si 2 =
~ )2
(
y
yi
~
si 2 measures the variability within class i after projecting it on
the y-space.
~
Thus
s12 + ~
s22 measures the variability within the two
classes at hand after projection, hence it is called within-class scatter
of the projected samples.
LDA Two Classes
The Fisher linear discriminant is defined as

the linear function wTx that maximizes the
criterion function:
2
~
~
1 2
J ( w) = ~ 2 ~ 2
s +s
1
Therefore, we will be looking for a projection

where examples from the same class are
projected very close to each other and, at the
same time, the projected means are as farther
apart as possible
LDA Two Classes

In order to find the optimum projection w*, we need to express
J(w) as an explicit function of w.
We will define a measure of the scatter in multivariate feature
space x which are denoted as scatter matrices;
Si =
(x )(x )
S w = S1 + S 2
Where Si is the covariance matrix of class i, and Sw is called the
within-class scatter matrix.
LDA Two Classes
Now, the scatter of the projection y can then be expressed as a function of

the scatter matrix in feature space x.
~
si 2 =
2
~
( y i ) =
yi
(w
x w i
T
= w ( x i )( x i ) w
T
xi
= wT Si w
~
2
2
T
T
T
T
~
~
s1 + s2 = w S1w + w S 2 w = w (S1 + S 2 )w = w SW w = SW
Where S~W is the within-class scatter matrix of the projected samples y.
LDA Two Classes
Similarly, the difference between the projected means (in y-space) can be
expressed in terms of the means in the original feature space (x-space).
~1 ~2
) = (w
2
w 2
T
= w (1 2 )(1 2 ) w
144
42444
3
T
SB
~
= w SB w = SB
T
The matrix SB is called the between-class scatter of the original samples/feature

vectors, while S~B is the between-class scatter of the projected samples y.
Since SB is the outer product of two vectors, its rank is at most one.
LDA Two Classes

We can finally express the Fisher criterion in terms of
SW and SB as:
2
~
~
1 2
wT S B w
J ( w) = ~ 2 ~ 2 = T
s1 + s2
w SW w
Hence J(w) is a measure of the difference between class

means (encoded in the between-class scatter matrix)
normalized by a measure of the within-class scatter
matrix.
LDA Two Classes

To find the maximum of J(w), we differentiate and equate to
zero.
d wT S B w
d
T
= 0
J ( w) =
dw w SW w
dw
(
(
) (
) (
)
(
)
) (
d
d
T
T
w SW w
wT SW w = 0
w SB w w SB w
dw
dw
wT SW w 2S B w wT S B w 2SW w = 0
T
Dividing by 2 wT SW w :
wT SW w
wT S B w
S B w T
SW w = 0
T
w SW w
w SW w
S B w J ( w) SW w = 0
SW1S B w J ( w) w = 0
LDA Two Classes

Solving the generalized eigen value problem
SW1S B w = w
where = J ( w) = scalar
yields
T
w
SBw
*
= SW1 (1 2 )
w = arg max J ( w) = arg max T
w
w
w SW w
This is known as Fishers Linear Discriminant, although it is not a

discriminant but rather a specific choice of direction for the projection
of the data down to one dimension.
Using the same notation as PCA, the solution will be the eigen
1
vector(s) of S X = SW S B
LDA Two Classes - Example
Compute the Linear Discriminant projection for the following twodimensional dataset.
Samples for class 1 : X1=(x1,x2)={(4,2),(2,4),(2,3),(3,6),(4,4)}
Sample for class 2 : X2=(x1,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}
10
9
8
7
x2
6
5
4
3
2
1
0
5
x1
10

The classes mean are :
1
=
1
N1
1 4 2 2 3 4 3
x = + + + + =
5 2 4 3 6 4 3.8
x1
1
2 =
N2
1 9 6 9 8 10 8.4
x = + + + + =
5 10 8 5 7 8 7.6
x 2

Covariance matrix of the first class:
S1 =
(x )(x )
4 3 2 3
= +
2 3.8 4 3.8
2
2 3 3 3 4 3
+ + +
3 3.8 6 3.8 4 3.8
0.25
1
=
2.2
0.25

Covariance matrix of the second class:
S2 =
(x )(x )
9 8.4 6 8.4
= +
10 7.6 8 7.6
2
9 8.4 8 8.4 10 8.4

+ + +
5 7.6 7 7.6 8 7.6
0.05
2.3
=
3.3
0.05

Within-class scatter matrix:
0.25 2.3
0.05
1
+
S w = S1 + S 2 =
2.2 0.05
3.3
0.25
3.3 0.3
=
0.3 5.5

Between-class scatter matrix:
S B = (1 2 )(1 2 )
3 8.4 3 8.4
=
3.8 7.6 3.8 7.6
5.4
( 5.4 3.8)
=
3.8
29.16 20.52
=
20.52 14.44
The LDA projection is then obtained as the solution of the generalized eigen
value problem 1
SW S B w = w
SW1S B I = 0
3.3 0.3

0. 3 5. 5
29.16 20.52 1 0
= 0

20.52 14.44 0 1
0.3045 0.0166 29.16 20.52 1 0

= 0

0.0166 0.1827 20.52 14.44 0 1
6.489
9.2213
2.9794
4.2339
= (9.2213 )(2.9794 ) 6.489 4.2339 = 0
2 12.2007 = 0 ( 12.2007 ) = 0
1 = 0, 2 = 12.2007
Hence
w1
9.2213 6.489
w1 = 0{
1 w2
4.2339 2.9794
and
w1
9.2213 6.489

w2 = 12
.2
2007
1
4
4
3
4.2339 2.9794
w2
2
Thus;
0.5755
w1 =
0.8178
and
0.9088
= w*
w2 =
0.4173
The optimal projection is the one that given maximum = J(w)

Or directly;
1
3.3 0.3 3 8.4

w = S (1 2 ) =
0.3 5.5 3.8 7.6
0.3045 0.0166 5.4
=
0.0166 0.1827 3.8
*
1
W
0.9088
=
0.4173
LDA - Projection
Classes PDF : using the LDA projection vector with the other eigen value = 8.8818e-016
0.35
The projection vector

corresponding to the
smallest eigen value
0.3
0.25
LDA projection vector with the other eigen value = 8.8818e-016

10
i
p(y|w )
0.2
0.15
8
7
0.1
6
x2
0.05
5
4
0
-4
-2
-1
1
y
Using this vector leads to

bad separability
between the two classes
2
1
0
-7
-3
-6
-5
-4
-3
-2
-1
2
x1
10
LDA - Projection
Classes PDF : using the LDA projection vector with highest eigen value = 12.2007
0.4
The projection vector

corresponding to the
highest eigen value
0.35
0.3
0.25
LDA projection vector with the highest eigen value = 12.2007
i
p(y|w )
10
0.2
0.15
0.1
x2
0.05
5
0
4
10
y
Using this vector leads to

good separability
between the two classes
6
x1
10
15
LDA C-Classes
Now, we have C-classes instead of just two.
We are now seeking (C-1) projections [y1, y2, , yC-1] by means
of (C-1) projection vectors wi.
wi can be arranged by columns into a projection matrix W =
[w1|w2||wC-1] such that:
yi = wiT x
y =WTx
y1
x1
.
.
where xm1 =
, yC 11 =
.
.

yC 1
xm
and WmC 1 = [w1 | w2 | ... | wC 1 ]
LDA C-Classes
If we have n-feature vectors, we can stack them into one matrix
as follows;
Y =W X
T
where
X mn
x11
=
.
1
xm
x12
.
.
xm2
. x1n
. .
. .
n
. xm
, YC 1n
and WmC 1 = [w1 | w2 | ... | wC 1 ]
y11
=
.
1
yC 1
y12
.
.
yC2 1
y1n
.
.
.
.
n
. yC 1
.
LDA C-Classes
Recall the two classes case, the
within-class scatter was computed as:
S w = S1 + S 2
x2
Example of two-dimensional
features (m = 2), with three
classes C = 3.
Sw
This can be generalized in the Cclasses case as:
3
2
SW = Si
i =1
where
and
Si =
i =
T
(
)(
)
x
i
i
Sw2
xi
1
Ni
S w3
Ni : number of data samples

in class i.
x1
LDA C-Classes
Recall the two classes case, the betweenclass scatter was computed as:
S B = (1 2 )(1 2 )
x2
For C-classes case, we will measure the

between-class scatter with respect to the
mean of all class as follows:
C
i =1
N i i
x
Sw
S B1
S B = N i (i )(i )
1
1
where
= x =
N x
N
1
and
i =
x
N i xi
Example of two-dimensional
features (m = 2), with three
classes C = 3.
S B2
S B3
3
S w3
Sw2
x1
N: number of all data .
Ni : number of data samples

in class i.
LDA C-Classes
Similarly,
We can define the mean vectors for the projected samples y as:
1
~
i =
Ni
and
1
~
=
N
y
y
While the scatter matrices for the projected samples y will be:
C
C
~
~
T
SW = Si = ( y ~i )( y ~i )
i =1
i =1 yi
C
~
T
S B = N i (~i ~ )(~i ~ )
i =1
LDA C-Classes
Recall in two-classes case, we have expressed the scatter matrices of the

projected samples in terms of those of the original samples as:
~
SW = W T SW W
~
S B = W T S BW
This still hold in C-classes case.
Recall that we are looking for a projection that maximizes the ratio of
between-class to within-class scatter.
Since the projection is no longer a scalar (it has C-1 dimensions), we then use
the determinant of the scatter matrices to obtain a scalar objective function:
~
SB
W T S BW
J (W ) = ~ = T
W SW W
SW
And we will seek the projection W* that maximizes this ratio.
LDA C-Classes
To find the maximum of J(W), we differentiate with respect to W and equate

to zero.
Recall in two-classes case, we solved the eigen value problem.
SW1S B w = w
For C-classes case, we have C-1 projection vectors, hence the eigen value
problem can be generalized to the C-classes case as:
SW1S B wi = i wi
where = J ( w) = scalar
where i = J ( wi ) = scalar
and
i = 1,2,...C 1
Thus, It can be shown that the optimal projection matrix W* is the one whose
columns are the eigenvectors corresponding to the largest eigen values of the
following generalized eigen value problem:
SW1S BW * = W *
where = J (W * ) = scalar
and
W * = w1* | w2* | ... | wC* 1
Illustration 3 Classes
Lets generate a dataset for each
class to simulate the three
classes shown
x2
For each class do the following,

Use the random number generator
to generate a uniform stream of
500 samples that follows U(0,1).
Using the Box-Muller approach,

x1
convert the generated uniform
stream to N(0,1).
Then use the method of eigen values
and eigen vectors to manipulate the
standard normal to have the required
mean vector and covariance matrix .
Estimate the mean and covariance
matrix of the resulted dataset.
Dataset Generation
By visual inspection of the figure,

classes parameters (means and
covariance matrices can be given as
follows:
Overallmean
5
=
5
x2
3
2.5
7
1 = + , 2 = + , 3 = +
3.5
5
7
5 1
S1 =
3
3
4 0
S2 =
0 4
3.5 1
S3 =
3 2.5
Negative covariance to lead to data samples distributed along the y = -x line.
Zero covariance to lead to data samples distributed horizontally.
Positive covariance to lead to data samples distributed along the y = x line.
x1
In Matlab
Its Working
1
x2
20
10
x1
X - the second feature
15
-5
-5
5
10
X - the first feature
1
15
20
Computing LDA Projection Vectors

Recall
C
SW = Si
i =1
where
Si =
i =
and
(x )(x )
1
Ni
S B = N i (i )(i )
1
W
S SB
i =1
where
and
1
1
x
=
N
N x
1
i =
x
N i xi
N
x
Lets visualize the projection vectors W

25
X - the second feature
20
15
10
5
0
-5
-10
-15
-10
-5
0
5
10
X - the first feature
1
15
20
25
Projection y = WTx
Along first projection vector
Classes PDF : using the first projection vector with eigen value = 4508.2089
0.4
0.35
0.3
p(y|w )
0.25
0.2
0.15
0.1
0.05
0
-5
10
y
15
20
25
Projection y = WTx
Along second projection vector
Classes PDF : using the second projection vector with eigen value = 1878.8511
0.4
0.35
0.3
p(y|w )
0.25
0.2
0.15
0.1
0.05
0
-10
-5
5
y
10
15
20
Which is Better?!!!
Apparently, the projection vector that has the highest eigen
value provides higher discrimination power between classes
Classes PDF : using the first projection vector with eigen value = 4508.2089
Classes PDF : using the second projection vector with eigen value = 1878.8511
0.4
0.35
0.35
0.3
0.3
0.25
0.25
p(y|w )
p(y|w i )
0.4
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
-5
10
y
15
20
25
0
-10
-5
5
y
10
15
20
PCA vs LDA
Limitations of LDA
LDA produces at most C-1 feature projections
If the classification error estimates establish that more features are needed, some other method must be
employed to provide those additional features
LDA is a parametric method since it assumes unimodal Gaussian likelihoods
If the distributions are significantly non-Gaussian, the LDA projections will not be able to preserve any
complex structure of the data, which may be needed for classification.
Limitations of LDA
LDA will fail when the discriminatory information is not in the mean but
rather in the variance of the data
Thank You

LDA Tutorial

Uploaded by

Copyright:

Available Formats

LDA Tutorial

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LDA Tutorial

Uploaded by

Copyright:

Available Formats

A Tutorial on Data

LDA vs PCA Example

However, we want to preserve as much of the

In PCA, the main idea to re-express the available dataset to extract

In PCA, we had a dataset matrix X with dimensions mxn, where

We first started by subtracting the mean to have a zero mean dataset,

m - dimensional feature vector

LDA Two Classes

The two classes are not well

Assume we have m-dimensional samples {x1,

We seek to obtain a scalar y by projecting

This line succeeded in separating

Of all the possible lines we would like to

LDA Two Classes

We could then choose the distance between the projected means

LDA Two Classes

This axis yields better class separability

This axis has a larger distance between means

LDA Two Classes

LDA Two Classes

The Fisher linear discriminant is defined as

Therefore, we will be looking for a projection

LDA Two Classes

LDA Two Classes

Now, the scatter of the projection y can then be expressed as a function of

LDA Two Classes

The matrix SB is called the between-class scatter of the original samples/feature

LDA Two Classes

Hence J(w) is a measure of the difference between class

LDA Two Classes

LDA Two Classes

This is known as Fishers Linear Discriminant, although it is not a

LDA Two Classes - Example

LDA Two Classes - Example

LDA Two Classes - Example

LDA Two Classes - Example

9 8.4 8 8.4 10 8.4

LDA Two Classes - Example

LDA Two Classes - Example

LDA Two Classes - Example

0.3045 0.0166 29.16 20.52 1 0

LDA Two Classes - Example

The optimal projection is the one that given maximum = J(w)

LDA Two Classes - Example

3.3 0.3 3 8.4

The projection vector

LDA projection vector with the other eigen value = 8.8818e-016

Using this vector leads to

The projection vector

Using this vector leads to

and WmC 1 = [w1 | w2 | ... | wC 1 ]

This can be generalized in the Cclasses case as:

Ni : number of data samples

For C-classes case, we will measure the

Ni : number of data samples

Recall in two-classes case, we have expressed the scatter matrices of the

This still hold in C-classes case.

And we will seek the projection W* that maximizes this ratio.

To find the maximum of J(W), we differentiate with respect to W and equate