0% found this document useful (0 votes)
3 views57 pages

Lec05 DimensionalityReduction

The document discusses the dimensionality problem in pattern classification, highlighting the challenges of high-dimensional data and the curse of dimensionality. It covers techniques for dimensionality reduction such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), as well as feature selection methods. The document emphasizes the importance of balancing the number of features with training samples to improve classification accuracy.

Uploaded by

Preet Kr Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views57 pages

Lec05 DimensionalityReduction

The document discusses the dimensionality problem in pattern classification, highlighting the challenges of high-dimensional data and the curse of dimensionality. It covers techniques for dimensionality reduction such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), as well as feature selection methods. The document emphasizes the importance of balancing the number of features with training samples to improve classification accuracy.

Uploaded by

Preet Kr Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Pattern Classification
Lecture 05: Component and Discriminant Analysis

Kundan Kumar
https://github.com/erkundanec/PatternClassification

c 2020 Kundan Kumar, All Rights Reserved


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Topics to be covered

 Dimensionality Problem
 Dimensionality/Feature reduction
 Principal component analysis
 Linear discriminant analysis
 Fisher Linear discriminant
 Multiple Discriminant Analysis
 Feature Selection

1/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Dimensionality Problem

2/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Introduction

 In practical multicategory applications, it is not unusual to encounter problems


involving tens or hundreds of features.
 Intuitively, it may seem that each feature is useful for at least some of the
discriminations.
 In general, if the performance obtained with a given set of features is inadequate, it
is natural to consider adding new features.
 Even though increasing the number of features increases the complexity of the
classifier, it may be acceptable for an improved performance.

3/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Introduction
28 CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

x3

x2

x1

Figure Bayes
Figure: There is a non-zero 3.3: Two three-dimensional
error distributionsxhave
in the one-dimensional nonoverlapping densities, and
1 space or the two-dimensional x1 , x2 space.
thus in three dimensions the Bayes error vanishes. When projected to a subspace —
However, the Bayes error vanishes in the x1 , x2 , x3 space because of non-overlapping densities.
here, the two-dimensional x1 − x2 subspace or a one-dimensional x1 subspace — there
4/56 can be greater overlap of the projected
Kundan Kumar distributions, and hence greater Bayes errors. Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Problems of Dimensionality

 Unfortunately, it has frequently been observed in practice that, beyond a certain


point, adding new features leads to worse rather than better performance.

 This is called the curse of dimensionality.

 There are two issues that we must be careful about:


 How is the classification accuracy affected by the dimensionality (relative to the amount
of training data)?
 How is the complexity of the classifier affected by the dimensionality?

5/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Problems of Dimensionality

 Potential reasons for increase in error include


 wrong assumptions in model selection,
 estimation errors due to the finite number of training samples for high-dimensional
observations (overfitting).

 Potential solutions include


 reducing the dimensionality,
 simplifying the estimation.

6/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Problems of Dimensionality

 Dimensionality can be reduced by


 redesigning the features,
 selecting an appropriate subset among the existing features,
 combining existing features.
 Estimation errors can be simplified by
 assuming equal covariance for all classes (for the Gaussian case),
 using regularization,
 using prior information and a Bayes estimate,
 using heuristics such as conditional independence,
 ···.

7/56 Kundan Kumar Pattern Classification


Dimensionality Problem the straight lineAnalysis
Component could easily be
PCAsuperior. The tenth-degree
LDA polynomial fits the given
Feature Selection References
data perfectly. However, we do not expect that a tenth-degree polynomial is required
here. In general, reliable interpolation or extrapolation can not be obtained unless
Problem of Dimensionality
the solution is overdetermined, i.e., there are more points than function parameters
to be set.

f(x)

10

x
2 4 6 8
-5

-10

Figure 3.4:
Figure: The “training The (black
data” “training data”
dots) (black
were dots) from
selected were selected from function
a quadratic a quadradic function
plus Gaussian noise, i.e,
f (x) = ax2 + plus
bx +Gaussian noise,p(ε)
c + ε where i.e.,≈
f (x) =σ
N (0, ax22).+The
bx +10th
c + %degree
where p(%)
polynomial
2
∼ N (0, σshown
). Thefits10th
the data perfectly, but
degree
we desire instead thepolynomial shown
second-order fits the fdata
function (x),perfectly,
since it but we desire
would lead toinstead
betterthe second-order
predictions for few samples.
function f (x), since it would lead to better predictions for new samples.

In fitting the points in Fig. 3.4, then, we might consider beginning with a high-
8/56 polynomial (e.g., 10th order),
order Kundan Kumar
and successively smoothing or simplifying our Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Problem of Dimensionality

 All of the commonly used classifiers can suffer from the curse of dimensionality.

 While an exact relationship between the probability of error, the number of training
samples, the number of features, and the number of parameters is very difficult to
establish, some guidelines have been suggested.

 It is generally accepted that using at least ten times as many training samples per
class as the number of features (n/d > 10) is a good practice.

9/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Feature/Dimensionality Reduction

10/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Component Analysis and Discriminants

 One way of coping with the problem of high dimensionality is to reduce the
dimensionality by combining features.
 Issues in feature/dimensionality reduction:
 Linear vs. non-linear transformations.
 Use of class labels or not (depends on the availability of training data).
 Linear combinations are particularly attractive because they are simple to compute
and are analytically tractable.
 Linear methods project the high-dimensional data onto a lower dimensional space.
 Advantages of these projections include
 reduced complexity in estimation and classification,
 ability to visually examine the multivariate data in two or three dimensions.

11/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Component Analysis and Discriminants

 Given x ∈ Rd , the goal is to find a linear transformation A that gives y = AT x,


0
y ∈ Rd where d0 < d.

 Two classical approaches for finding optimal linear transformations are:


 Principal Components Analysis (PCA): Seeks a projection that best represents the data
in a least-squares sense.
 Multiple Discriminant Analysis (MDA): Seeks a projection that best separates the data
in a least-squares sense.

12/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Principal Component Analysis

13/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Principal Component Analysis

 Given x1 , x2 , . . . , xn ∈ Rd , the goal is to find a d0 -dimensional subspace where the


reconstruction error of xi in this subspace is minimized.
 The squared-error criterion function J0 (x0 ) by
n
X
J0 (x0 ) = kx0 − xk k2
k=1

and seek the value of x0 that minimizes J0


 It is simple to show that the solution to this problem is given by x0 = m, where m
is the sample mean.
n
1X
m= xk
n
k=1

14/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Principal Component Analysis


 This can be easily verified by writing
n
X
J0 (x0 ) = k(x0 − m) − (xk − m)k2
k=1
Xn n
X n
X
= k(x0 − m)k2 −2 (x0 − m)T (xk − m)+ k(xk − m)k2
k=1 k=1 k=1
Xn n
X Xn
= k(x0 − m)k2 −2(x0 − m)T (xk − m)+ k(xk − m)k2
k=1 k=1 k=1
Xn n
X
= k(x0 − m)k2 + k(xk − m)k2
k=1 k=1
| {z }
independent of x0

 Since the second sum is independent of x0 , So the above expression is obviously


minimized by the choice of x0 = m.
15/56 Kundan Kumar Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Principal Component Analysis

 The sample mean is a zero-dimensional representation of the data set. It is simple,


but it does not reveal any of the variability in the data.
 One-dimensional representation by projecting the data onto a line running through
the sample mean.
 Let e be a unit vector in the direction of the line. Then equation of line will be

x = m + ae

where a is any real value, corresponds to the distance of any point x form the mean
m.
 If xk = m + ak e, then we can find optimal set of coefficients ak by minimizing the
squared-error criterion function.

16/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Principal Component Analysis


 Squared-error criterion function
n
X
J1 (a1 , a2 , . . . , an , e) = k(m + ak e) − xk k2
k=1
Xn
= kak e − (xk − m)k2
k=1
Xn n
X n
X
= a2k kek2 − 2 ak eT (xk − m) + k(xk − m)k2
k=1 k=1 k=1

 Recognize that ||e|| = 1, partially differentiating with respect to ak , and setting the
derivative to zero, we obtain
ak = eT (xk − m)
 Geometrically, this result merely says that we obtain a least-squares solution by
projecting the vector xk onto the line in the direction of e that passes through the
sample mean.
17/56 Kundan Kumar Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Principal Component Analysis


 The solution to the problem involves the scatter matrix S defined by
n
X
S= (xk − m)(xk − m)T
k=1

 Scatter matrix is n times the sample covariance matrix.


 Substitute ak in the cost function
n
X n
X n
X
J1 (e) = a2k − 2 a2k + kxk − mk2
k=1 k=1 k=1
n n
X 2 X
=− [eT (xk − m)] + kxk − mk2
k=1 k=1
Xn n
X
=− eT (xk − m)(xk − m)T e + kxk − mk2
k=1 k=1
n
X
T 2
= −e Se + kxk − mk
k=1

18/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Principal Component Analysis

 So the resulting cost function


n
X
J1 (e) = −eT Se + kxk − mk2
k=1

 Use Lagrange multipliers to maximize eT Se subject to the constraint that kek = 1.


 Letting λ be the undetermined multiplier, we differentiate

u = eT Se − λ(eT e − 1)
∂u
= 2Se − 2λe
∂e
Se = λe

 In particular, because eT Se = λeT e = λ, it follows that to maximize eT Se, so select the eigenvector
corresponding to the largest eigenvalue of the scatter matrix.

19/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Principal Component Analysis

 To find the best one-dimensional projection of the data (best in the


least-sum-of-squared-error sense), we project the data onto a line through the
sample mean in the direction of the eigenvector of the scatter matrix having the
largest eigenvalue.

 This result can be readily extended from 1-D to a d0 -D projection.


d0
X
x=m+ ai ei
i=1

where d0 ≤ d.

20/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Principal Component Analysis

 It is not difficult to show that the criterion function


 0
 2
n
X d
X
Jd0 = m + aki ei  − xk
k=1 i=1

is minimized when the vector e1 , e2 , . . . , ed0 are the d0 eigenvector of the scatter
matrix having the largest eigenvalues.

 Because the scatter matrix is real and symmetric, these eigenvectors are orthogonal.

 The coefficients ai are the components of x in that basis, and are the principal
components.

21/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Principal Component Analysis


 Given x1 , x2 , . . . , xn ∈ Rd , the goal is to find a d0 -dimensional subspace where the
reconstruction error of xi in this subspace is minimized.
 The squared error criterion function J0 (x0 ) can be minimized by selecting x0 = m,
where m is the sample mean.
 The sample mean is a zero-dimensional representation of the data set. It is simple,
but it does not reveal any of the variability in the data.
 We must consider at least one-dimensional representation of data by choosing
x = m + ae
and compute the optimal value of a such that the squared error criterion function
J1 is minimum.
 We obtained the solution as
ak = eT (xk − m)
22/56 Kundan Kumar Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Principal Component Analysis

 Given x1 , x2 , . . . , xn ∈ Rd , the goal is to find a d0 -dimensional subspace where the


reconstruction error of xi in this subspace is minimized.
 The criterion function for the reconstruction error can be defined in the least
squares sense as
n d0
! 2
X X
Jd0 = m+ aki ei − xk
k=1 i=1

where e1 , e2 , . . . , ed0 are the bases for the subspace (stored as the columns of A)
and ai is the projection of xi onto that subspace.

23/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Principal Component Analysis

 It can be shown that Jd0 is minimized when e1 , e2 , . . . , ed0 are eigenvectors


corresponding to first d0 largest eigenvalues of scatter matrix.
n
X
S= (xk − m)(xk − m)T
k=1

 The coefficients a = (ai , . . . , ad0 )T are called the principal components.


 When the eigenvectors are sorted in descending order of the corresponding
eigenvalues, the greatest variance of the data lies on the first principal component,
the second greatest variance on the second component, and so on.

24/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Example to be solved
Question: Given the following sets of feature vector belonging to two classes ω1 and ω2 which is
Gaussian distributed.
         
1 3 4 5 7
, , , , ∈ ω1
2 4 3 5 5
         
6 9 7 11 13
, , , , ∈ ω2
2 4 3 4 6
Find out the best direction of the line of projection that best represent the data in one-dimensional
feature space. 8

6
x2

0
0 5 10 15
x1
25/56 Kundan Kumar Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Examples: Iris dataset representation


 ”Iris” dataset is very famous dataset used for data analysis problems (classification,
feature reduction, and many more)
 Available on the UCI machine learning repository
https://archive.ics.uci.edu/ml/datasets/Iris.
 The iris dataset contains measurements for 150 iris flowers from three different
species.
 Iris-setosa (n1 = 50)
 Iris-versicolor (n2 = 50)
 Iris-virginica (n3 = 50)
 And the four features of in Iris dataset are:
 sepal length in cm
 sepal width in cm
 petal length in cm
 petal width in cm

26/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Examples: Iris data representation

sepal length (cm)


7

sepal width (cm)


4

petal length (cm)


6

2
petal width (cm)

1
5

2
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)

Figure: Scatter plot of the iris data. Diagonal cells show the histogram for each feature. Other cells show
scatters of pairs of features x1 , x2 , x3 , x4 in top-down and left-right order. Red, green and blue points represent
samples for the setosa, versicolor and virginica classes, respectively.
27/56 Kundan Kumar Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Examples: Iris data representation


Iris-setosa
Iris-versicolor
Iris-virginica

1.5

3
3rd eigenvector, e
1.0
2nd eigenvector, e2

0.5

0.0

−0.5
Iris-setosa

e2
Iris-versicolor
−1.0

r,
cto
Iris-virginica

ve
en
−2 0 2 4

eig
1st eigenvector, e1 1st eigenv

d
ector, e

2n
1

Figure: Scatter plot of the projection of the iris data onto the first two and the first three principal axes.

28/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Linear Discriminant Analysis

29/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Fisher Linear Discriminant


4.10. *FISHER LINEAR DISCRIMINANT 45
 PCA seeks directions that are efficient for representation, discriminant analysis
seeks directions that are efficient for discrimination.
x2 x2
2 2

1.5 1.5

1 1

0.5 0.5
w

x1 x1
0.5 1 1.5 0.5 1 1.5
w

-0.5

Figure: Projection of the same set of samples onto two different lines in the directions marked as w. The
Figure
figure 4.27:
on the right Projection
shows greaterof samplesbetween
separation onto two different
the red and blacklines. Thepoints
projected figure on the right
shows greater separation between the red and black projected points.
30/56 Kundan Kumar Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Fisher4.10.
Linear Discriminant
*FISHER LINEAR DISCRIMINANT

 Suppose x1 , x2 , . .x. , xn ∈ Rd are divided


2
x2

into two subsets 2D1 (n1 samples) and 2

D2 (n2 samples) corresponding to the


classes ω1 and ω1.52 respectively, the goal 1.5

is to find a projection onto a line


defined as 1 1
y = wT x
such that the points
0.5 corresponding
w
to 0.5

D1 and D2 are well separated.


 A corresponding set of n 0.5
samples
1 1.5
x1
0.5 1 1.5
x1

w
y1 , y2 , . . . , yn divided into the subset Y1
and Y2 . -0.5

Figure 31/56
4.27: Projection of Kundan
samples
Kumar onto two different lines. ThePattern
figure on th
Classification
values
Dimensionality for
w forComponent
Problem a two-dimensional
Analysis PCA s̃2i = It should
example. (y −= be
m̃ 12  t clear that
abundantly
i) .
LDA if theSelection (76)References
Feature
w x = w t mi . (74)
original distributions are multimodal and highly y∈Yi overlapping,ni even the “best” w is
y∈Yi
Fisher Linear Discriminant
unlikely to provide adequate seaparation, and thus this method will be of little use.
Thus, turn to21 the
We now(1/n)(s̃ +and 2
s̃matter
2 )isissimply
an estimate
of finding theofbest
the projection thesuch
ofvariance
midirection
. of the pooled
w, one data,
we hope willand s̃21 + s̃22
enable accurate
is called theclassification.
total within-class
It followsA measure the of
thatscatter the separation
of the
distance projected
between between
the the projected
samples.
projected The is
means Fisher linear with
 points
The is the difference
discriminant
criterion employs
functionof the that
for sample
linear
the means.
best If mi w
function
separationist the
xcan d-dimensional
for which
be sample
the criterion
defined as mean
function clas
given by |m̃ − m̃ | = |wt (m − m )|, scat
(75)
1 2 1 2
|m̃1 − m̃2 |2
1 
and that we can J(w)
make = difference
this (77) w. Of
as large as we wish merely by scaling
mi = x, s̃21 + s̃22 (73)
n
course, to obtain good
i separation of the projected data we really want the difference
x∈Di
is maximum between
(and the
independentmeans
of to be large
w). relative
While thetowsome measure ofJ(·)
maximizing the standard
leads todeviations
the for
where,
then the m̃i is the
sample meansample
for
each the mean
projected
class. Rather s̃2i is
andpoints
than isthe scatter
given
forming by
sample for the projected
variances, we define samples
the scatter for projected
best separation between the two projected sets (in the sense just described), we will
labeled ωi , given assamples labelled ωi by
also need a threshold criterion before we have a true classifier. We first consider how
to find the optimal w, and later  to the issue
1 turn ofthresholds.
m̃i = y s̃2i = (y − m̃i )2 . (76)
To obtain J(·) as an explicit
ni function of w, we y∈Y
y∈Yi
define the scatter matrices Si and scat
i
SW by matr
2 2 2 2
 This is called the Thus, (1/n)(s̃1 + s̃2 ) is an estimate of the variance of the pooled data, and s̃1 + s̃2
Fisher’s linear discriminant
 with the geometric interpretation that
is called the total within-class scatter of the projected samples. The Fisher linear
the best projectiondiscriminant S
makes theemploys=
difference(x − m )(x − )t
mimeans (78)
i
that between
i the
linear function wt x for as large
which theascriterion
possiblefunction
x∈Di
relative to the variance.
and |m̃1 − m̃2 |2
J(w) = (77)
s̃21 + s̃22
32/56 Kundan Kumar Pattern Classification
best
fall inseparation
another, between
we want the thePCA
two x∈D
projected
projections sets LDA
(in the
the sense
line tojust
be described), we will
 falling onto well separated, not
i
Dimensionality ProblemComponent Analysis Feature Selection References
also need a threshold criterion
thoroughly intermingled. Figure = before wet have a true classifier.
(x − mi )(x
4.27willustrates the−effect t We first consider
mi ) wof choosing two different how
Fisher to
Linear
find for
values theDiscriminant
woptimal w, and later turn
for a two-dimensional to the issue
example.
x∈D i
of thresholds.
It should be abundantly clear that if the
To obtain J(·) as an explicit
original distributions are multimodal function
t andof w, we
highly define the scatter
overlapping, even matrices
the “best” Si w
and
is scat
= w Si w; (80)
SW by
 Tounlikely
compute to the optimal
provide w, weseaparation,
adequate define the and
scatter
thusmatrices S will be of little use.
this method mat
i
We now
therefore theturn
sumtoofthe matter
these of 
scattersfinding
can bethe best such direction
written w, one we hope will
S =
enable accurate classification. A measure
i (x − m ofi )(x mi )t
the−separation (78)
between the projected
points is the difference of the samplex∈D
2 2 t
s̃1 +means.
i
s̃2 = w IfSW mw.
i is the d-dimensional sample mean
(81)
given
and
where, by
Similarly, the separations of the projected means obeys
1 
mi = x, (73)
46 ni 4. NONPARAMETRIC
(m̃1 − m̃2 )2 CHAPTER
= (wt mx∈D 1 − t
iw m )
2
2 TECHNIQUES
t
 then
The the samplescatter
within-class mean for the projected
matrix (m1 − is
SW= w points m2given − m2 )t w
)(m1 by
= wt SB w, (82)
SW = S1 + S2 . (79)
wherewe can write 1 
Then m̃i = y
and the between-class scatter matrix nSiB
y∈Yi
SB = (m1 − m2 )(m1 − m2 )t . (83)
2
s̃i = (wt x − wt mi )2
thin- We 33/56
call SW the within-class scatter matrix. It is proportional to the sample
Kundan
co-
x∈DKumar
i Pattern Classification
ass variance matrix for the pooled d-dimensional data. It is symmetric and positive
Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Fisher Linear Discriminant


 Then, the criterion function becomes

|m̃1 − m̃2 |2 w T SB w
J(w) = =
s̃21 + s̃22 w T SW w

This expression is well known in mathematical physics as the generalized Rayleigh


quotient.
 A vector w that maximizes J(·) must satisfy

SB w = λSW w
SW −1 SB w = λw

 In this perticular case, it is unnecessary to solve for the eigenvalues and eigenvectors
of SW −1 SB due to the fact that SB w is always in the direction of m1 − m2
34/56 Kundan Kumar Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Fisher Linear Discriminant


4.11. *MULTIPLE DISCRIMINANT ANALYSIS 47

scale factor for w is immaterial, we can immediately write the solution for the w that
optimizes J(·):
 So we can find the immediate solution as
w = S−1
W (m1 − m2 ). (87)

 Note Thus,
that, Swe have obtained w for Fisher’s linear discriminant — the linear function
W is symmetric and positive semidefinite, and it is usually nonsingular
yielding the maximum ratio of between-class scatter to within-class scatter. (The
if n > d. SB is also symmetric and positive semidefinite, but its rank is at most 1.
solution w given by Eq. 87 is sometimes called the canonical variate.) Thus the
 Thus, we have has
classification obtained w for Fisher’s
been converted from linear discriminant
a d-dimensional – the to
problem linear functionmore
a hopefully
yielding the maximum
manageable ratio ofone.
one-dimensional between-class
This mappingscatter to within-class
is many-to-one, and inscatter.
theory can not
possibly reduce the minimum achievable error rate if we have a very large training set.
In general, one is willing to sacrifice some of the theoretically attainable performance
for the advantages of working in one dimension. All that remains is to find the
threshold, i.e., the point along the one-dimensional subspace separating the projected
points.
When35/56the conditional densities p(x|ωi ) are multivariate normal with equal
Kundan Kumar co-
Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Example to be solved
Question: Given the following sets of feature vector belonging to two classes ω1 and ω2 which is
Gaussian distributed.
         
1 3 4 5 7
, , , , ∈ ω1
2 4 3 5 5
         
6 9 7 11 13
, , , , ∈ ω2
2 4 3 4 6
Find out the best direction of the line of projection that best separates the data in one-dimensional
feature space.
8

6
x2

0
0 5 10 15
x1

36/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Multiple Discriminant Analysis

37/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Multiple
50
Discriminant Analysis
CHAPTER 4. NONPARAMETRIC TECHNIQUES

W1
W2

Figure:Figure
Three-dimensional
4.28: Three distributions are projected
three-dimensional onto two-dimensional
distributions are projectedsubspaces, described by a normal
onto two-dimensional
W1 and Wdescribed
vectorssubspaces, 2. by a normal vectors w and w . Informally, multiple discrimi-
1 2
nant methods seek the optimum such subspace, i.e., the one with the greatest sepa-
38/56 Kundan Kumar Pattern Classification
ration of the projected distributions for a given total within-scatter matrix, here as
Dimensionality Problem Component Analysis PCA LDA Feature Selection References
For the
For the c-class c-class the
problem, problem,
naturalthe natural generalization
generalization of Fisher’soflinear
Fisher’s linear discriminant
discriminant
Multiple Discriminant
involves c − 1 Analysis
discriminant functions. Thus, the projection is
involves c − 1 discriminant functions. Thus, the projection is from a d-dimensional from a d-dimensional
space to a (c −
space to a (c − 1)-dimensional 1)-dimensional space, and it is tacitly assumed that d ≥ c. The
that d4. ≥NONPARAMETRIC
48 space, and it is tacitly assumedCHAPTER c. The TECH
generalization
generalization for the within-class scatter matrix
for the within-class scatter matrix is obvious: is obvious:
 The within-class scatter matrix
 c
c

SW = S
SiW = Si 1  (90)
48 m
CHAPTERi = 4. x. (90)
NONPARAMETRIC TECH
CHAPTER 4. NONPARAMETRIC
i=1
i=1 ni
TECHNIQUES x∈Di
where, as before,
where, aswhere
before,
The proper generalization
 for SB is not quite so obvious. Suppose that
total a totalmean vector m and a total scattert  ST by
1matrix
S = 1 (xSi−=m )(x(x − −
m m)ti )(x − m m i) = x. (91) (91)
i
mi = i
x.x∈Di i i
mean
n ni (92) c
x∈D
1
x∈Dii
vector x∈Di 1 i
and  and m = x = ni mi
The proper generalization
The for SB isThe
proper generalization
total
proper
notfor SB so
quite generalization
is obvious. sofor
not quiteSuppose SnBthat
obvious. x
is not
we quite
n i=1so obvious. Suppose that
define
al meanvectortotal a total mean
m and a total scatter matrixvector m and a total scatter matrix ST by
ST byvector
scatter
Suppose
meanthat we define
and a total mean m and a total scatter matrix ST by
matrix
vector c 1 1
c
1 1 m =  x = ni mt i
m= x= ni mi ST = n (x − m)(x n(93) − m) .
total n x
n i=1
x i=1
x
scatter and
matrix Then it follows that
39/56 Kundan Kumar Pattern Classification

matrix
matrix
Dimensionality Problem Component Analysis PCA
 LDA
t
Feature Selection References

ST S=
T = (x (x − m)(x
− m)(x − m)
− m) t
. . (94)
(94)
Multiple Discriminant Analysis x
x

Then
Then it follows
it follows that
that
 Then we can write
c 
c
 t
ST ST= = (x (x
−m −m i + mi − m)(x − mi + mi − m)
t
i + mi − m)(x − mi + mi − m)
i=1 x∈D
i=1 x∈Di i
c c
 
c  c 
t  t
== (x − m )(x − mt
i − m ) i+
(x − mi )(x ) + (m(m i − m)(mi − m)
t
i i − m)(mi − m)
i=1 x∈D i=1 x∈D
i=1 x∈Di i i=1 x∈Di i
c
c
t
=S S+W + n (m ni (m i − m)(mi − m)
t . (95)
= W i i − m)(mi − m) . (95)
i=1
i=1

 Therefore, It natural
It is is natural
to to define
define thisthis second
second term
term as as a general
a general between-class
between-class scatter
scatter matrix,
matrix,
so that the total scatter is the
so that the total scatter is the sum sum of
STof=thethe within-class
SWwithin-class
+ SB scatter and the between-class
scatter and the between-class
scatter:
scatter:
where
c
c
t
SB = ni (m i − m)(mi − m)
t (96)
SB = ni (m i − m)(mi − m) (96)
i=1
i=1

andand
40/56 Kundan Kumar Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References
and
Multiple Discriminant Analysis
ST = SW + SB . (97)
If we check the two-class case, we find that the resulting between-class scatter matrix
is n1 n2 /n times our previous definition.∗

The projection
The projection form from a d-dimensional
a d-dimensional spacespace
to a to − 1)-dimensional
(c a−(c1)-dimensional space
space is is ac-
complished by c − 1 discriminant functions
accomplished by c − 1 discriminant functions
yi = wit x i = 1, ..., c − 1. (98)

 If Ifthe
they yiare
i
areviewed
viewedas
as components
components of
ofaavector
vectoryyand
andthe weight
the vectors
weight wiwareare
vector i
viewed
as the columns of a d-by-(c − 1) matrix W, then the projection can be written as a
viewed as the columns of a d-by-(c − 1) matrix W, then the projection can be
single matrix equation
written as a single matrix equation
y = Wt x. (99)
∗ We could redefine SB for the two-class case to obtain complete consistency, but there should be
no misunderstanding of our usage.

41/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Multiple Discriminant Analysis


4.11. *MULTIPLE DISCRIMINANT ANALYSIS 49

 The samples x1The


, x2samples
, . . . , xnx1project to a corresponding
, ..., xn project set set
to a corresponding of samplesyy
of samples 1 1, ,y...,
2, y. n. ,. ,which
yn ,
candescribed
which can be be described by by their
their ownmean
own mean vectors
vectors and
andscatter
scattermatrices. Thus,Thus
matrices. if we define

1 
m̃i = y (100)
ni
y∈Yi

c
1
m̃ = ni m̃i (101)
n i=1
c 

S̃W = (y − m̃i )(y − m̃i )t (102)
i=1 y∈Yi

and
c

S̃B = ni (m̃i − m̃)(m̃i − m̃)t , (103)
i=1

it is a straightforward matter to show that


42/56 Kundan Kumar Pattern Classification
Dimensionality Problem Component Analysis PCA i=1 y∈Y
LDAi Feature Selection References

and
Multiple Discriminant Analysis
c

S̃B = ni (m̃i − m̃)(m̃i − m̃)t , (10
i=1


it is a straightforward
It is a straightforward matterthat
matter to show to show that

S̃W = Wt SW W (10

and

S̃B = Wt SB W. (10

 These
These equations showequations
how theshow how the and
within-class within-class and between-class
between-class scatterare
scatter matrices matrices a
transformed by the projection to the lower dimensional
transformed by the projection to the lower dimensional space. space (Fig. 4.28). What w
seek is a transformation matrix W that in some sense maximizes the ratio of t
between-class scatter to the within-class scatter. A simple scalar measure of scatt
is the determinant of the scatter matrix. The determinant is the product of t
eigenvalues, and hence is the product of the “variances” in the principal direction
thereby measuring the square of the hyperellipsoidal scattering volume. Using th
43/56 Kundan Kumar Pattern Classification
measure, we obtain the criterion function
transformed
Dimensionality Problemby the
Component projection
Analysis PCA to the lower dimensional
LDA space (Fig.Feature
4.28). What we
Selection References
seek is a transformation matrix W that in some sense maximizes the ratio of the
Multiple Discriminant
between-class scatter Analysis: Solution
to the within-class scatter. A simple scalar measure of scatter
is the determinant of the scatter matrix. W1 The determinant is the product of the
eigenvalues, and hence is the product of the “variances” W2 in the principal directions,
thereby measuring the square of the hyperellipsoidal scattering volume. Using this
measure,
 The wefunction
criterion obtain the criterion function
Figure 4.28: Three three-dimensional distributionst are projected onto two-dimensional
|S̃B | |W SB W|
subspaces, described by a normalJ(W) vectors
= w=1 andt w2 . Informally,
. (106)
multiple discrimi-
|W S W|
nant methods seek the optimum such|S̃subspace,
W| i.e.,W the one with the greatest sepa-
ration
Theof problem
the projected distributions
of finding for a given total
a rectangular within-scatter matrix, here as
the problem
associated
of finding a rectangular matrixmatrix
W thatWmaximized
that maximizes
J(·). J(·) is tricky,
with w1 . it turns out that the solution is relatively simple. The columns of
though fortunately
 Thean columns
optimal Wof an
areoptimal W are the
the generalized generalized
eigenvectors thateigenvectors
correspond that correspond
to the to
largest eigen-
thevalues
largest
in eigenvalues in
(SB − λi SW )wi = 0 (109)
SB wi = λi SW wi . (107)
directly for the eigenvectors wi . Because SB is the sum of c matrices of rank one or
A few observations about this solution are in order. First, if SW is non-singular,
less, and because only c − 1 of these are independent, SB is of rank c − 1 or less. Thus,
this can be converted to a conventional eigenvalue problem as before. However, this
no more than c − 1 of the eigenvalues are nonzero, and the desired weight vectors
is actually undesirable, since it requires an unnecessary computation of the inverse of
correspond to these nonzero eigenvalues. If the within-class scatter is isotropic, the
SW . Instead,
44/56 one can find the eigenvalues
Kundan Kumar as the roots of the characteristic polynomial
Pattern Classification
ration of theComponent
Dimensionality Problem
projected distributions
Analysis PCA
for a |given|W
|SW SW within-scatter
total
LDA
W| matrix, here as References
Feature Selection
associated with w .
The problem1 of finding a rectangular matrix W that maximizes J(·) is tricky,
Multiple Discriminant
though Analysis:
fortunately it turns out thatObservation
the solution is relatively simple. The columns of
an optimal W are the generalized eigenvectors that correspond to the largest eigen-
values in W
(SB − λi S1 W )wi = 0 (109)
W2
directly for the eigenvectors wi . Because SB wi = SBλiisSW the
wisum
. of c matrices of rank one(107) or
less, and
If SWAisfew because only
nonsingular, c −
this 1 of these
can this are independent,
be converted to in S is
a conventional of rank c − 1 or less. Thus,
ifeigenvalue problem as
 B
no more thanobservations
c − 1 of about
the eigenvalues solution
are are
nonzero, order.
and First,
the desiredSW weight
is non-singular,
vectors
before.
Figure
this can4.28:
be Three three-dimensional distributions are projected onto two-dimensional
correspond toconverted
these nonzero to a conventional
eigenvalues. eigenvalue problem
If the within-class as before.
scatter However,the
is isotropic, this
subspaces,
is actually described
undesirable, by a
sincenormal
it vectors
requires an w and w . Informally, multiple discrimi-
 Computationare
eigenvectors of merely
the inversethe eigenvectors of SB , and the eigenvectors with nonzeroof
of SW is expensive. unnecessary
1 2 computation of the inverse
nant
S . methods
Instead, seek
one canthefindoptimum
the such
eigenvalues subspace, i.e., the one with the greatest sepa-
W
eigenvalues
Instead, one span thefindspace spanned by as the as the roots
vectors mofi −of m.
the characteristic
In this special polynomial
case the
thecan thedistributions
eigenvalues for the rootstotal the characteristic polynomial

ration of projected a given within-scatter matrix, here as
columns of W can be found simply by applying the Gram-Schmidt orthonormalization
associated with w .
procedure to the c −1 1 vectors mi −|Sm, B −i λ= i S1, | =c0− 1. Finally, we observe that(108)
W ..., in
general
andthen the
thensolvesolution
solve for W is not unique. The allowable transformations include
 and
rotating and scaling the axes in various ways. These are all linear transformations
from a (c − 1)-dimensional space to(SaB(c−−λi1)-dimensional SW )wi = 0 space, however, and do not (109)
change things in any significant way; in particular, they leave the criterion function
directly for theand
J(W) eigenvectors w . Because S is the sum of c matrices of rank one or
directlyinvariant the classifier
for the eigenvectors wi .i unchanged. B
less,
If we and because
have very only − 1 ofwe
little cdata, these
wouldare tend
independent,
to project rank c −of
SBtoisaofsubspace 1 or
lowless. Thus,
dimen-
no more
sion, while than c −is1more
if there
45/56 of the eigenvalues
data, we can
Kundan are anonzero,
use
Kumar and the desired
higher dimension, weight
as we shall vectors
explore
Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Feature Selection

46/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Feature Selection

 Feature reduction uses a linear or non-linear combination of features.


 An alternative to feature reduction is feature selection that reduces dimensionality
by selecting subsets of existing features.
 Benefits of performing feature selection:
 avoid curse of dimensionality
 reduce the computational cost
 improves accuracy
 avoid overfitting
 The first step in feature selection is to define a criterion function that is often a
function of the classification error.
 Note that, the use of classification error in the criterion function makes feature
selection procedures dependent on the specific classifier used.

47/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Feature Selection

 The most straightforward approach would require


 
d
 examining all possible subsets of size m,
m
 selecting the subset that performs the best according to the criterion function.
 The number of subsets grows combinatorially, making the exhaustive search
impractical.
 There are two main types of feature selection algorithms:
 Wrapper Feature Selection Methods.
 Filter Feature Selection Methods.

48/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Examples: Iris data representation


12.5
setosa 15 setosa
10.0 versicolor versicolor
virginica virginica
7.5 10

5.0
5
2.5

0.0 0
5 6 7 8 2.0 2.5 3.0 3.5 4.0 4.5
sepal length (cm) sepal width (cm)

30
12.5 setosa setosa
versicolor versicolor
10.0 virginica virginica
20
7.5

5.0
10
2.5

0.0 0
2 4 6 0.0 0.5 1.0 1.5 2.0 2.5
petal length (cm) petal width (cm)

Figure: Histogram plot of Iris features


49/56 Kundan Kumar Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Examples: Iris data representation

sepal length (cm)


7

sepal width (cm)


4

2
petal length (cm)

2
petal width (cm)

1
5

2
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)

Figure: Scatter plot of the iris data. Off-diagonal cells show scatters of pairs of features x1 , x2 , x3 , x4 .
50/56 Kundan Kumar Pattern Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Feature Selection
Examples
 Sequential forward selection:
Sequential forward selection
1. First, the best single feature is selected.
DEM::ELEVATION
2. Then, pairs of features are formed IKONOS2::BAND1
IKONOS2::BAND3
using one of the remaining features and IKONOS2::BAND2
IKONOS2::BAND4
IKONOS2_GABOR4::FINE90DEG
this best feature, and the best pair is IKONOS2_GABOR4::COARSE90DEG
IKONOS2_GABOR4::FINE0DEG
IKONOS2_GABOR4::COARSE0DEG
selected. AERIAL_GABOR1::FINE0DEG
IKONOS2_GABOR1::COARSE0DEG

3. Next, triplets of features are formed IKONOS2_GABOR1::FINE0DEG


IKONOS3::BAND4
IKONOS3::BAND3
using one of the remaining features and IKONOS2_GABOR1::FINE90DEG
IKONOS2_GABOR1::COARSE90DEG
AERIAL_GABOR2::COARSE0DEG
these two best features, and the best IKONOS3::BAND1
AERIAL_GABOR1::FINE90DEG
AERIAL_GABOR2::FINE90DEG
triplet is selected. AERIAL_GABOR1::COARSE90DEG
IKONOS3::BAND2
4. This procedure continues until all or a AERIAL_GABOR2::FINE0DEG
AERIAL::BAND1
AERIAL::BAND2
predefined number of features are AERIAL_GABOR2::COARSE90DEG
AERIAL::BAND3
AERIAL_GABOR1::COARSE0DEG
selected. 56 58 60 62 64 66 68 70 72 74
Classification accuracy

Figure 24: Results of sequential forward feature selection for clas


51/56 a Kundan
satellite image using 28 features. x-axis shows Pattern
Kumar the classification
Classification
Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Feature Selection
Examples
 Sequential backward selection:
Sequential backward selection
 First, the criterion function is computed
AERIAL_GABOR2::FINE0DEG
for all d features. AERIAL::BAND2
AERIAL::BAND3
IKONOS3::BAND2
 Then, each feature is deleted one at a AERIAL_GABOR1::COARSE0DEG
AERIAL_GABOR2::COARSE90DEG
time, the criterion function is computed AERIAL_GABOR1::FINE90DEG
IKONOS2_GABOR4::FINE0DEG
IKONOS2::BAND4
for all subsets with d − 1 features, and IKONOS2_GABOR1::FINE0DEG
IKONOS2_GABOR1::COARSE90DEG
IKONOS2_GABOR1::COARSE0DEG
the worst feature is discarded. IKONOS2_GABOR4::COARSE0DEG
IKONOS2::BAND2
 Next, each feature among the IKONOS3::BAND1
IKONOS2_GABOR1::FINE90DEG
IKONOS3::BAND4
remaining d − 1 is deleted one at a IKONOS2_GABOR4::COARSE90DEG
AERIAL_GABOR2::FINE90DEG
AERIAL_GABOR1::FINE0DEG
time, and the worst feature is discarded AERIAL_GABOR2::COARSE0DEG
IKONOS2_GABOR4::FINE90DEG

to form a subset with d − 2 features. IKONOS2::BAND3


IKONOS2::BAND1
AERIAL_GABOR1::COARSE90DEG
 This procedure continues until one IKONOS3::BAND3
DEM::ELEVATION
NONE
feature or a predefined number of 54 56 58 60 62 64 66 68 70 72
Classification accuracy
features are left.
Figure 25: Results of sequential backward feature selection for cla
of a satellite image using 28 features. x-axis shows the classificatio
52/56 accuracy (%) and y-axis shows the features removed
Kundan Kumar atClassification
Pattern each iterat
Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Summary

 The choice between feature reduction and feature selection depends on the
application domain and the specific training data.
 Feature selection leads to savings in computational costs and the selected features
retain their original physical interpretation.
 Feature reduction with transformations may provide a better discriminative ability
but these new features may not have a clear physical meaning.

53/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

Assignment Problem

Question:
(a) Given the following sets of feature vector belonging to two classes ω1 and ω2 which is Gaussian
distributed.
(1, 2)t , (3, 5)t , (4, 3)t , (5, 6)t , (7, 5)t ∈ ω1
(6, 2)t , (9, 4)t , (10, 1)t , (12, 3)t , (13, 6)t ∈ ω2
The vector are projected onto a line to represent the feature vectors by a single feature. Find out
the best direction of the line of projection that maintains the separability of the two classes.
(b) Assuming the mean of the projected point belonging to ω1 to be the origin of the projection line,
identify the point on the projection line that optimally separates two classes. Assume the classes to
be equally probable and the projected features also follow Gaussian distribution.

54/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

References

[1] Richard O Duda, Peter E Hart, and David G Stork. Pattern classification. John
Wiley & Sons, 2012.

55/56 Kundan Kumar Pattern Classification


Dimensionality Problem Component Analysis PCA LDA Feature Selection References

You might also like