Lec05 DimensionalityReduction
Lec05 DimensionalityReduction
Pattern Classification
Lecture 05: Component and Discriminant Analysis
Kundan Kumar
https://github.com/erkundanec/PatternClassification
Topics to be covered
Dimensionality Problem
Dimensionality/Feature reduction
Principal component analysis
Linear discriminant analysis
Fisher Linear discriminant
Multiple Discriminant Analysis
Feature Selection
Dimensionality Problem
Introduction
Introduction
28 CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION
x3
x2
x1
Figure Bayes
Figure: There is a non-zero 3.3: Two three-dimensional
error distributionsxhave
in the one-dimensional nonoverlapping densities, and
1 space or the two-dimensional x1 , x2 space.
thus in three dimensions the Bayes error vanishes. When projected to a subspace —
However, the Bayes error vanishes in the x1 , x2 , x3 space because of non-overlapping densities.
here, the two-dimensional x1 − x2 subspace or a one-dimensional x1 subspace — there
4/56 can be greater overlap of the projected
Kundan Kumar distributions, and hence greater Bayes errors. Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References
Problems of Dimensionality
Problems of Dimensionality
Problems of Dimensionality
f(x)
10
x
2 4 6 8
-5
-10
Figure 3.4:
Figure: The “training The (black
data” “training data”
dots) (black
were dots) from
selected were selected from function
a quadratic a quadradic function
plus Gaussian noise, i.e,
f (x) = ax2 + plus
bx +Gaussian noise,p(ε)
c + ε where i.e.,≈
f (x) =σ
N (0, ax22).+The
bx +10th
c + %degree
where p(%)
polynomial
2
∼ N (0, σshown
). Thefits10th
the data perfectly, but
degree
we desire instead thepolynomial shown
second-order fits the fdata
function (x),perfectly,
since it but we desire
would lead toinstead
betterthe second-order
predictions for few samples.
function f (x), since it would lead to better predictions for new samples.
In fitting the points in Fig. 3.4, then, we might consider beginning with a high-
8/56 polynomial (e.g., 10th order),
order Kundan Kumar
and successively smoothing or simplifying our Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References
Problem of Dimensionality
All of the commonly used classifiers can suffer from the curse of dimensionality.
While an exact relationship between the probability of error, the number of training
samples, the number of features, and the number of parameters is very difficult to
establish, some guidelines have been suggested.
It is generally accepted that using at least ten times as many training samples per
class as the number of features (n/d > 10) is a good practice.
Feature/Dimensionality Reduction
One way of coping with the problem of high dimensionality is to reduce the
dimensionality by combining features.
Issues in feature/dimensionality reduction:
Linear vs. non-linear transformations.
Use of class labels or not (depends on the availability of training data).
Linear combinations are particularly attractive because they are simple to compute
and are analytically tractable.
Linear methods project the high-dimensional data onto a lower dimensional space.
Advantages of these projections include
reduced complexity in estimation and classification,
ability to visually examine the multivariate data in two or three dimensions.
x = m + ae
where a is any real value, corresponds to the distance of any point x form the mean
m.
If xk = m + ak e, then we can find optimal set of coefficients ak by minimizing the
squared-error criterion function.
Recognize that ||e|| = 1, partially differentiating with respect to ak , and setting the
derivative to zero, we obtain
ak = eT (xk − m)
Geometrically, this result merely says that we obtain a least-squares solution by
projecting the vector xk onto the line in the direction of e that passes through the
sample mean.
17/56 Kundan Kumar Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References
u = eT Se − λ(eT e − 1)
∂u
= 2Se − 2λe
∂e
Se = λe
In particular, because eT Se = λeT e = λ, it follows that to maximize eT Se, so select the eigenvector
corresponding to the largest eigenvalue of the scatter matrix.
where d0 ≤ d.
is minimized when the vector e1 , e2 , . . . , ed0 are the d0 eigenvector of the scatter
matrix having the largest eigenvalues.
Because the scatter matrix is real and symmetric, these eigenvectors are orthogonal.
The coefficients ai are the components of x in that basis, and are the principal
components.
where e1 , e2 , . . . , ed0 are the bases for the subspace (stored as the columns of A)
and ai is the projection of xi onto that subspace.
Example to be solved
Question: Given the following sets of feature vector belonging to two classes ω1 and ω2 which is
Gaussian distributed.
1 3 4 5 7
, , , , ∈ ω1
2 4 3 5 5
6 9 7 11 13
, , , , ∈ ω2
2 4 3 4 6
Find out the best direction of the line of projection that best represent the data in one-dimensional
feature space. 8
6
x2
0
0 5 10 15
x1
25/56 Kundan Kumar Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References
2
petal width (cm)
1
5
2
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
Figure: Scatter plot of the iris data. Diagonal cells show the histogram for each feature. Other cells show
scatters of pairs of features x1 , x2 , x3 , x4 in top-down and left-right order. Red, green and blue points represent
samples for the setosa, versicolor and virginica classes, respectively.
27/56 Kundan Kumar Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References
1.5
3
3rd eigenvector, e
1.0
2nd eigenvector, e2
0.5
0.0
−0.5
Iris-setosa
e2
Iris-versicolor
−1.0
r,
cto
Iris-virginica
ve
en
−2 0 2 4
eig
1st eigenvector, e1 1st eigenv
d
ector, e
2n
1
Figure: Scatter plot of the projection of the iris data onto the first two and the first three principal axes.
1.5 1.5
1 1
0.5 0.5
w
x1 x1
0.5 1 1.5 0.5 1 1.5
w
-0.5
Figure: Projection of the same set of samples onto two different lines in the directions marked as w. The
Figure
figure 4.27:
on the right Projection
shows greaterof samplesbetween
separation onto two different
the red and blacklines. Thepoints
projected figure on the right
shows greater separation between the red and black projected points.
30/56 Kundan Kumar Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References
Fisher4.10.
Linear Discriminant
*FISHER LINEAR DISCRIMINANT
w
y1 , y2 , . . . , yn divided into the subset Y1
and Y2 . -0.5
Figure 31/56
4.27: Projection of Kundan
samples
Kumar onto two different lines. ThePattern
figure on th
Classification
values
Dimensionality for
w forComponent
Problem a two-dimensional
Analysis PCA s̃2i = It should
example. (y −= be
m̃ 12 t clear that
abundantly
i) .
LDA if theSelection (76)References
Feature
w x = w t mi . (74)
original distributions are multimodal and highly y∈Yi overlapping,ni even the “best” w is
y∈Yi
Fisher Linear Discriminant
unlikely to provide adequate seaparation, and thus this method will be of little use.
Thus, turn to21 the
We now(1/n)(s̃ +and 2
s̃matter
2 )isissimply
an estimate
of finding theofbest
the projection thesuch
ofvariance
midirection
. of the pooled
w, one data,
we hope willand s̃21 + s̃22
enable accurate
is called theclassification.
total within-class
It followsA measure the of
thatscatter the separation
of the
distance projected
between between
the the projected
samples.
projected The is
means Fisher linear with
points
The is the difference
discriminant
criterion employs
functionof the that
for sample
linear
the means.
best If mi w
function
separationist the
xcan d-dimensional
for which
be sample
the criterion
defined as mean
function clas
given by |m̃ − m̃ | = |wt (m − m )|, scat
(75)
1 2 1 2
|m̃1 − m̃2 |2
1
and that we can J(w)
make = difference
this (77) w. Of
as large as we wish merely by scaling
mi = x, s̃21 + s̃22 (73)
n
course, to obtain good
i separation of the projected data we really want the difference
x∈Di
is maximum between
(and the
independentmeans
of to be large
w). relative
While thetowsome measure ofJ(·)
maximizing the standard
leads todeviations
the for
where,
then the m̃i is the
sample meansample
for
each the mean
projected
class. Rather s̃2i is
andpoints
than isthe scatter
given
forming by
sample for the projected
variances, we define samples
the scatter for projected
best separation between the two projected sets (in the sense just described), we will
labeled ωi , given assamples labelled ωi by
also need a threshold criterion before we have a true classifier. We first consider how
to find the optimal w, and later to the issue
1 turn ofthresholds.
m̃i = y s̃2i = (y − m̃i )2 . (76)
To obtain J(·) as an explicit
ni function of w, we y∈Y
y∈Yi
define the scatter matrices Si and scat
i
SW by matr
2 2 2 2
This is called the Thus, (1/n)(s̃1 + s̃2 ) is an estimate of the variance of the pooled data, and s̃1 + s̃2
Fisher’s linear discriminant
with the geometric interpretation that
is called the total within-class scatter of the projected samples. The Fisher linear
the best projectiondiscriminant S
makes theemploys=
difference(x − m )(x − )t
mimeans (78)
i
that between
i the
linear function wt x for as large
which theascriterion
possiblefunction
x∈Di
relative to the variance.
and |m̃1 − m̃2 |2
J(w) = (77)
s̃21 + s̃22
32/56 Kundan Kumar Pattern Classification
best
fall inseparation
another, between
we want the thePCA
two x∈D
projected
projections sets LDA
(in the
the sense
line tojust
be described), we will
falling onto well separated, not
i
Dimensionality ProblemComponent Analysis Feature Selection References
also need a threshold criterion
thoroughly intermingled. Figure = before wet have a true classifier.
(x − mi )(x
4.27willustrates the−effect t We first consider
mi ) wof choosing two different how
Fisher to
Linear
find for
values theDiscriminant
woptimal w, and later turn
for a two-dimensional to the issue
example.
x∈D i
of thresholds.
It should be abundantly clear that if the
To obtain J(·) as an explicit
original distributions are multimodal function
t andof w, we
highly define the scatter
overlapping, even matrices
the “best” Si w
and
is scat
= w Si w; (80)
SW by
Tounlikely
compute to the optimal
provide w, weseaparation,
adequate define the and
scatter
thusmatrices S will be of little use.
this method mat
i
We now
therefore theturn
sumtoofthe matter
these of
scattersfinding
can bethe best such direction
written w, one we hope will
S =
enable accurate classification. A measure
i (x − m ofi )(x mi )t
the−separation (78)
between the projected
points is the difference of the samplex∈D
2 2 t
s̃1 +means.
i
s̃2 = w IfSW mw.
i is the d-dimensional sample mean
(81)
given
and
where, by
Similarly, the separations of the projected means obeys
1
mi = x, (73)
46 ni 4. NONPARAMETRIC
(m̃1 − m̃2 )2 CHAPTER
= (wt mx∈D 1 − t
iw m )
2
2 TECHNIQUES
t
then
The the samplescatter
within-class mean for the projected
matrix (m1 − is
SW= w points m2given − m2 )t w
)(m1 by
= wt SB w, (82)
SW = S1 + S2 . (79)
wherewe can write 1
Then m̃i = y
and the between-class scatter matrix nSiB
y∈Yi
SB = (m1 − m2 )(m1 − m2 )t . (83)
2
s̃i = (wt x − wt mi )2
thin- We 33/56
call SW the within-class scatter matrix. It is proportional to the sample
Kundan
co-
x∈DKumar
i Pattern Classification
ass variance matrix for the pooled d-dimensional data. It is symmetric and positive
Dimensionality Problem Component Analysis PCA LDA Feature Selection References
|m̃1 − m̃2 |2 w T SB w
J(w) = =
s̃21 + s̃22 w T SW w
SB w = λSW w
SW −1 SB w = λw
In this perticular case, it is unnecessary to solve for the eigenvalues and eigenvectors
of SW −1 SB due to the fact that SB w is always in the direction of m1 − m2
34/56 Kundan Kumar Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References
scale factor for w is immaterial, we can immediately write the solution for the w that
optimizes J(·):
So we can find the immediate solution as
w = S−1
W (m1 − m2 ). (87)
Note Thus,
that, Swe have obtained w for Fisher’s linear discriminant — the linear function
W is symmetric and positive semidefinite, and it is usually nonsingular
yielding the maximum ratio of between-class scatter to within-class scatter. (The
if n > d. SB is also symmetric and positive semidefinite, but its rank is at most 1.
solution w given by Eq. 87 is sometimes called the canonical variate.) Thus the
Thus, we have has
classification obtained w for Fisher’s
been converted from linear discriminant
a d-dimensional – the to
problem linear functionmore
a hopefully
yielding the maximum
manageable ratio ofone.
one-dimensional between-class
This mappingscatter to within-class
is many-to-one, and inscatter.
theory can not
possibly reduce the minimum achievable error rate if we have a very large training set.
In general, one is willing to sacrifice some of the theoretically attainable performance
for the advantages of working in one dimension. All that remains is to find the
threshold, i.e., the point along the one-dimensional subspace separating the projected
points.
When35/56the conditional densities p(x|ωi ) are multivariate normal with equal
Kundan Kumar co-
Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References
Example to be solved
Question: Given the following sets of feature vector belonging to two classes ω1 and ω2 which is
Gaussian distributed.
1 3 4 5 7
, , , , ∈ ω1
2 4 3 5 5
6 9 7 11 13
, , , , ∈ ω2
2 4 3 4 6
Find out the best direction of the line of projection that best separates the data in one-dimensional
feature space.
8
6
x2
0
0 5 10 15
x1
Multiple
50
Discriminant Analysis
CHAPTER 4. NONPARAMETRIC TECHNIQUES
W1
W2
Figure:Figure
Three-dimensional
4.28: Three distributions are projected
three-dimensional onto two-dimensional
distributions are projectedsubspaces, described by a normal
onto two-dimensional
W1 and Wdescribed
vectorssubspaces, 2. by a normal vectors w and w . Informally, multiple discrimi-
1 2
nant methods seek the optimum such subspace, i.e., the one with the greatest sepa-
38/56 Kundan Kumar Pattern Classification
ration of the projected distributions for a given total within-scatter matrix, here as
Dimensionality Problem Component Analysis PCA LDA Feature Selection References
For the
For the c-class c-class the
problem, problem,
naturalthe natural generalization
generalization of Fisher’soflinear
Fisher’s linear discriminant
discriminant
Multiple Discriminant
involves c − 1 Analysis
discriminant functions. Thus, the projection is
involves c − 1 discriminant functions. Thus, the projection is from a d-dimensional from a d-dimensional
space to a (c −
space to a (c − 1)-dimensional 1)-dimensional space, and it is tacitly assumed that d ≥ c. The
that d4. ≥NONPARAMETRIC
48 space, and it is tacitly assumedCHAPTER c. The TECH
generalization
generalization for the within-class scatter matrix
for the within-class scatter matrix is obvious: is obvious:
The within-class scatter matrix
c
c
SW = S
SiW = Si 1 (90)
48 m
CHAPTERi = 4. x. (90)
NONPARAMETRIC TECH
CHAPTER 4. NONPARAMETRIC
i=1
i=1 ni
TECHNIQUES x∈Di
where, as before,
where, aswhere
before,
The proper generalization
for SB is not quite so obvious. Suppose that
total a totalmean vector m and a total scattert ST by
1matrix
S = 1 (xSi−=m )(x(x − −
m m)ti )(x − m m i) = x. (91) (91)
i
mi = i
x.x∈Di i i
mean
n ni (92) c
x∈D
1
x∈Dii
vector x∈Di 1 i
and and m = x = ni mi
The proper generalization
The for SB isThe
proper generalization
total
proper
notfor SB so
quite generalization
is obvious. sofor
not quiteSuppose SnBthat
obvious. x
is not
we quite
n i=1so obvious. Suppose that
define
al meanvectortotal a total mean
m and a total scatter matrixvector m and a total scatter matrix ST by
ST byvector
scatter
Suppose
meanthat we define
and a total mean m and a total scatter matrix ST by
matrix
vector c 1 1
c
1 1 m = x = ni mt i
m= x= ni mi ST = n (x − m)(x n(93) − m) .
total n x
n i=1
x i=1
x
scatter and
matrix Then it follows that
39/56 Kundan Kumar Pattern Classification
matrix
matrix
Dimensionality Problem Component Analysis PCA
LDA
t
Feature Selection References
ST S=
T = (x (x − m)(x
− m)(x − m)
− m) t
. . (94)
(94)
Multiple Discriminant Analysis x
x
Then
Then it follows
it follows that
that
Then we can write
c
c
t
ST ST= = (x (x
−m −m i + mi − m)(x − mi + mi − m)
t
i + mi − m)(x − mi + mi − m)
i=1 x∈D
i=1 x∈Di i
c c
c c
t t
== (x − m )(x − mt
i − m ) i+
(x − mi )(x ) + (m(m i − m)(mi − m)
t
i i − m)(mi − m)
i=1 x∈D i=1 x∈D
i=1 x∈Di i i=1 x∈Di i
c
c
t
=S S+W + n (m ni (m i − m)(mi − m)
t . (95)
= W i i − m)(mi − m) . (95)
i=1
i=1
Therefore, It natural
It is is natural
to to define
define thisthis second
second term
term as as a general
a general between-class
between-class scatter
scatter matrix,
matrix,
so that the total scatter is the
so that the total scatter is the sum sum of
STof=thethe within-class
SWwithin-class
+ SB scatter and the between-class
scatter and the between-class
scatter:
scatter:
where
c
c
t
SB = ni (m i − m)(mi − m)
t (96)
SB = ni (m i − m)(mi − m) (96)
i=1
i=1
andand
40/56 Kundan Kumar Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References
and
Multiple Discriminant Analysis
ST = SW + SB . (97)
If we check the two-class case, we find that the resulting between-class scatter matrix
is n1 n2 /n times our previous definition.∗
The projection
The projection form from a d-dimensional
a d-dimensional spacespace
to a to − 1)-dimensional
(c a−(c1)-dimensional space
space is is ac-
complished by c − 1 discriminant functions
accomplished by c − 1 discriminant functions
yi = wit x i = 1, ..., c − 1. (98)
If Ifthe
they yiare
i
areviewed
viewedas
as components
components of
ofaavector
vectoryyand
andthe weight
the vectors
weight wiwareare
vector i
viewed
as the columns of a d-by-(c − 1) matrix W, then the projection can be written as a
viewed as the columns of a d-by-(c − 1) matrix W, then the projection can be
single matrix equation
written as a single matrix equation
y = Wt x. (99)
∗ We could redefine SB for the two-class case to obtain complete consistency, but there should be
no misunderstanding of our usage.
1
m̃i = y (100)
ni
y∈Yi
c
1
m̃ = ni m̃i (101)
n i=1
c
S̃W = (y − m̃i )(y − m̃i )t (102)
i=1 y∈Yi
and
c
S̃B = ni (m̃i − m̃)(m̃i − m̃)t , (103)
i=1
and
Multiple Discriminant Analysis
c
S̃B = ni (m̃i − m̃)(m̃i − m̃)t , (10
i=1
it is a straightforward
It is a straightforward matterthat
matter to show to show that
S̃W = Wt SW W (10
and
S̃B = Wt SB W. (10
These
These equations showequations
how theshow how the and
within-class within-class and between-class
between-class scatterare
scatter matrices matrices a
transformed by the projection to the lower dimensional
transformed by the projection to the lower dimensional space. space (Fig. 4.28). What w
seek is a transformation matrix W that in some sense maximizes the ratio of t
between-class scatter to the within-class scatter. A simple scalar measure of scatt
is the determinant of the scatter matrix. The determinant is the product of t
eigenvalues, and hence is the product of the “variances” in the principal direction
thereby measuring the square of the hyperellipsoidal scattering volume. Using th
43/56 Kundan Kumar Pattern Classification
measure, we obtain the criterion function
transformed
Dimensionality Problemby the
Component projection
Analysis PCA to the lower dimensional
LDA space (Fig.Feature
4.28). What we
Selection References
seek is a transformation matrix W that in some sense maximizes the ratio of the
Multiple Discriminant
between-class scatter Analysis: Solution
to the within-class scatter. A simple scalar measure of scatter
is the determinant of the scatter matrix. W1 The determinant is the product of the
eigenvalues, and hence is the product of the “variances” W2 in the principal directions,
thereby measuring the square of the hyperellipsoidal scattering volume. Using this
measure,
The wefunction
criterion obtain the criterion function
Figure 4.28: Three three-dimensional distributionst are projected onto two-dimensional
|S̃B | |W SB W|
subspaces, described by a normalJ(W) vectors
= w=1 andt w2 . Informally,
. (106)
multiple discrimi-
|W S W|
nant methods seek the optimum such|S̃subspace,
W| i.e.,W the one with the greatest sepa-
ration
Theof problem
the projected distributions
of finding for a given total
a rectangular within-scatter matrix, here as
the problem
associated
of finding a rectangular matrixmatrix
W thatWmaximized
that maximizes
J(·). J(·) is tricky,
with w1 . it turns out that the solution is relatively simple. The columns of
though fortunately
Thean columns
optimal Wof an
areoptimal W are the
the generalized generalized
eigenvectors thateigenvectors
correspond that correspond
to the to
largest eigen-
thevalues
largest
in eigenvalues in
(SB − λi SW )wi = 0 (109)
SB wi = λi SW wi . (107)
directly for the eigenvectors wi . Because SB is the sum of c matrices of rank one or
A few observations about this solution are in order. First, if SW is non-singular,
less, and because only c − 1 of these are independent, SB is of rank c − 1 or less. Thus,
this can be converted to a conventional eigenvalue problem as before. However, this
no more than c − 1 of the eigenvalues are nonzero, and the desired weight vectors
is actually undesirable, since it requires an unnecessary computation of the inverse of
correspond to these nonzero eigenvalues. If the within-class scatter is isotropic, the
SW . Instead,
44/56 one can find the eigenvalues
Kundan Kumar as the roots of the characteristic polynomial
Pattern Classification
ration of theComponent
Dimensionality Problem
projected distributions
Analysis PCA
for a |given|W
|SW SW within-scatter
total
LDA
W| matrix, here as References
Feature Selection
associated with w .
The problem1 of finding a rectangular matrix W that maximizes J(·) is tricky,
Multiple Discriminant
though Analysis:
fortunately it turns out thatObservation
the solution is relatively simple. The columns of
an optimal W are the generalized eigenvectors that correspond to the largest eigen-
values in W
(SB − λi S1 W )wi = 0 (109)
W2
directly for the eigenvectors wi . Because SB wi = SBλiisSW the
wisum
. of c matrices of rank one(107) or
less, and
If SWAisfew because only
nonsingular, c −
this 1 of these
can this are independent,
be converted to in S is
a conventional of rank c − 1 or less. Thus,
ifeigenvalue problem as
B
no more thanobservations
c − 1 of about
the eigenvalues solution
are are
nonzero, order.
and First,
the desiredSW weight
is non-singular,
vectors
before.
Figure
this can4.28:
be Three three-dimensional distributions are projected onto two-dimensional
correspond toconverted
these nonzero to a conventional
eigenvalues. eigenvalue problem
If the within-class as before.
scatter However,the
is isotropic, this
subspaces,
is actually described
undesirable, by a
sincenormal
it vectors
requires an w and w . Informally, multiple discrimi-
Computationare
eigenvectors of merely
the inversethe eigenvectors of SB , and the eigenvectors with nonzeroof
of SW is expensive. unnecessary
1 2 computation of the inverse
nant
S . methods
Instead, seek
one canthefindoptimum
the such
eigenvalues subspace, i.e., the one with the greatest sepa-
W
eigenvalues
Instead, one span thefindspace spanned by as the as the roots
vectors mofi −of m.
the characteristic
In this special polynomial
case the
thecan thedistributions
eigenvalues for the rootstotal the characteristic polynomial
ration of projected a given within-scatter matrix, here as
columns of W can be found simply by applying the Gram-Schmidt orthonormalization
associated with w .
procedure to the c −1 1 vectors mi −|Sm, B −i λ= i S1, | =c0− 1. Finally, we observe that(108)
W ..., in
general
andthen the
thensolvesolution
solve for W is not unique. The allowable transformations include
and
rotating and scaling the axes in various ways. These are all linear transformations
from a (c − 1)-dimensional space to(SaB(c−−λi1)-dimensional SW )wi = 0 space, however, and do not (109)
change things in any significant way; in particular, they leave the criterion function
directly for theand
J(W) eigenvectors w . Because S is the sum of c matrices of rank one or
directlyinvariant the classifier
for the eigenvectors wi .i unchanged. B
less,
If we and because
have very only − 1 ofwe
little cdata, these
wouldare tend
independent,
to project rank c −of
SBtoisaofsubspace 1 or
lowless. Thus,
dimen-
no more
sion, while than c −is1more
if there
45/56 of the eigenvalues
data, we can
Kundan are anonzero,
use
Kumar and the desired
higher dimension, weight
as we shall vectors
explore
Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References
Feature Selection
Feature Selection
Feature Selection
5.0
5
2.5
0.0 0
5 6 7 8 2.0 2.5 3.0 3.5 4.0 4.5
sepal length (cm) sepal width (cm)
30
12.5 setosa setosa
versicolor versicolor
10.0 virginica virginica
20
7.5
5.0
10
2.5
0.0 0
2 4 6 0.0 0.5 1.0 1.5 2.0 2.5
petal length (cm) petal width (cm)
2
petal length (cm)
2
petal width (cm)
1
5
2
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
Figure: Scatter plot of the iris data. Off-diagonal cells show scatters of pairs of features x1 , x2 , x3 , x4 .
50/56 Kundan Kumar Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References
Feature Selection
Examples
Sequential forward selection:
Sequential forward selection
1. First, the best single feature is selected.
DEM::ELEVATION
2. Then, pairs of features are formed IKONOS2::BAND1
IKONOS2::BAND3
using one of the remaining features and IKONOS2::BAND2
IKONOS2::BAND4
IKONOS2_GABOR4::FINE90DEG
this best feature, and the best pair is IKONOS2_GABOR4::COARSE90DEG
IKONOS2_GABOR4::FINE0DEG
IKONOS2_GABOR4::COARSE0DEG
selected. AERIAL_GABOR1::FINE0DEG
IKONOS2_GABOR1::COARSE0DEG
Feature Selection
Examples
Sequential backward selection:
Sequential backward selection
First, the criterion function is computed
AERIAL_GABOR2::FINE0DEG
for all d features. AERIAL::BAND2
AERIAL::BAND3
IKONOS3::BAND2
Then, each feature is deleted one at a AERIAL_GABOR1::COARSE0DEG
AERIAL_GABOR2::COARSE90DEG
time, the criterion function is computed AERIAL_GABOR1::FINE90DEG
IKONOS2_GABOR4::FINE0DEG
IKONOS2::BAND4
for all subsets with d − 1 features, and IKONOS2_GABOR1::FINE0DEG
IKONOS2_GABOR1::COARSE90DEG
IKONOS2_GABOR1::COARSE0DEG
the worst feature is discarded. IKONOS2_GABOR4::COARSE0DEG
IKONOS2::BAND2
Next, each feature among the IKONOS3::BAND1
IKONOS2_GABOR1::FINE90DEG
IKONOS3::BAND4
remaining d − 1 is deleted one at a IKONOS2_GABOR4::COARSE90DEG
AERIAL_GABOR2::FINE90DEG
AERIAL_GABOR1::FINE0DEG
time, and the worst feature is discarded AERIAL_GABOR2::COARSE0DEG
IKONOS2_GABOR4::FINE90DEG
Summary
The choice between feature reduction and feature selection depends on the
application domain and the specific training data.
Feature selection leads to savings in computational costs and the selected features
retain their original physical interpretation.
Feature reduction with transformations may provide a better discriminative ability
but these new features may not have a clear physical meaning.
Assignment Problem
Question:
(a) Given the following sets of feature vector belonging to two classes ω1 and ω2 which is Gaussian
distributed.
(1, 2)t , (3, 5)t , (4, 3)t , (5, 6)t , (7, 5)t ∈ ω1
(6, 2)t , (9, 4)t , (10, 1)t , (12, 3)t , (13, 6)t ∈ ω2
The vector are projected onto a line to represent the feature vectors by a single feature. Find out
the best direction of the line of projection that maintains the separability of the two classes.
(b) Assuming the mean of the projected point belonging to ω1 to be the origin of the projection line,
identify the point on the projection line that optimally separates two classes. Assume the classes to
be equally probable and the projected features also follow Gaussian distribution.
References
[1] Richard O Duda, Peter E Hart, and David G Stork. Pattern classification. John
Wiley & Sons, 2012.