0% found this document useful (0 votes)
17 views9 pages

Generalized Fisher Score For Feature Selection

Uploaded by

Stapheton Ian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views9 pages

Generalized Fisher Score For Feature Selection

Uploaded by

Stapheton Ian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221664782

Generalized Fisher Score for Feature Selection

Article · February 2012


Source: arXiv

CITATIONS READS

686 6,717

3 authors, including:

Quanquan Gu Jiawei Han


University of California, Los Angeles University of Illinois, Urbana-Champaign
165 PUBLICATIONS 6,506 CITATIONS 402 PUBLICATIONS 63,398 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Quanquan Gu on 27 July 2015.

The user has requested enhancement of the downloaded file.


Generalized Fisher Score for Feature Selection

Quanquan Gu Zhenhui Li Jiawei Han


Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science
University of Illinois at University of Illinois at University of Illinois at
Urbana-Champaign Urbana-Champaign Urbana-Champaign
Urbana, IL 61801, US Urbana, IL 61801, US Urbana, IL 61801, US
[email protected] [email protected] [email protected]

Abstract based and embedded methods [5]. Filter-based meth-


ods rank the features as a pre-processing step prior to
Fisher score is one of the most widely used su- the learning algorithm, and select those features with
pervised feature selection methods. However, high ranking scores. Wrapper-based methods score the
it selects each feature independently accord- features using the learning algorithm that will ulti-
ing to their scores under the Fisher criterion, mately be employed. Embedded methods combine fea-
which leads to a suboptimal subset of fea- ture selection with the learning algorithm. The design
tures. In this paper, we present a generalized of embedded method is tightly coupled with a specific
Fisher score to jointly select features. It aims learning algorithm, which in turn limits its application
at finding an subset of features, which max- to other learning algorithms. In our study, we focus on
imize the lower bound of traditional Fisher filter-based methods for supervised feature selection.
score. The resulting feature selection prob- Filter-based feature selection is usually cast into a bi-
lem is a mixed integer programming, which nary selection of features which maximizes some per-
can be reformulated as a quadratically con- formance criterion. In the past decade, a number
strained linear programming (QCLP). It is of performance criteria have been proposed for filter-
solved by cutting plane algorithm, in each it- based feature selection, such as mutual information
eration of which a multiple kernel learning [10], Fisher score [15], ReliefF [17], Laplacian score [6],
problem is solved alternatively by multivari- Hilbert Schmidt Independence Criterion (HSIC) [18]
ate ridge regression and projected gradient and Trace Ratio criterion [12], among which Fisher
descent. Experiments on benchmark data score is one of the most widely used criteria for super-
sets indicate that the proposed method out- vised feature selection due to its general good perfor-
performs Fisher score as well as many other mance. In detail, given a set of d features, denoted by
state-of-the-art feature selection methods. S, the goal of filter based feature selection is to choose
a subset of m < d features, denoted by T , which max-
imizes some criterion F ,
1 Introduction
T ∗ = arg max F (T ), s.t.|T | = m (1)
High-dimensional data in the input space is usually T ⊆S

not good for classification due to the curse of dimen-


where | · | is the cardinality of a set. Eq.(1) is a com-
sionality [15]. It significantly increases the time and
binatorial optimization problem. Finding the global
space complexity for processing the data. Moreover,
optimal solution for Eq.(1) is NP hard. One common
in the presence of many irrelevant and/or redundant
heuristic approach addresses this issue by first comput-
features, learning methods tend to over-fit and become
ing a score for each feature independently according to
less interpretable. A common way to resolve this prob-
the criterion F , and then selecting the top-m ranked
lem is feature selection [21, 5], which reduces the di-
features with high scores. However, the features se-
mensionality by selecting a subset of features from the
lected by the heuristic algorithm is suboptimal. On
input feature set. It is often used to reduce the com-
the one hand, since the heuristic algorithm computes
putational cost and remove irrelevant and redundant
the score of each feature individually, it neglects the
features for problems with high dimensional data.
combination of features, which means evaluating two
Generally speaking, feature selection methods can be or more than two features together. For example, it
categorized into three families: filter-based, wrapper- could be the case that the scores of feature a and fea-
ture b are both low, but the score of the combination tures, such that in the data space spanned by the se-
ab is very high. In this case, the heuristic algorithm lected features, the distances between data points in
will discard feature a and b, although they should be different classes are as large as possible, while the dis-
selected. On the other hand, they cannot handle re- tances between data points in the same class are as
dundant features. For instance, the scores of feature a small as possible. In particular, given the selected m
and feature b are very high, but they are highly corre- features, the input data matrix X ∈ Rd×n reduces to
lated. In this case, the heuristic algorithm will select Z ∈ Rm×n . Then the Fisher Score is computed as
both feature a and b, while either feature a or b could follows,
be eliminated without loss in the subsequent learning { }
performance. In fact, many studies have shown that F (Z) = tr (S̃b )(S̃t + γI)−1 , (2)
removing redundant features can result in performance
improvement [1, 23, 14]. where γ is a positive regularization parameter, S̃b is
In this paper, in order to overcome the above problems, called between-class scatter matrix, and S̃t is called
we present a generalized Fisher score for feature selec- total scatter matrix, which are defined as
tion. Rather than selecting each feature individually,

c
the proposed method selects a subset of features simul- S̃b = nk (µ̃k − µ̃)(µ̃k − µ̃)T
taneously. It aims to find a subset of features, which k=1
maximize the lower bound of traditional Fisher score. ∑n
It is able to consider the combination of features, and S̃t = (zi − µ̃)(zi − µ̃)T , (3)
eliminate the redundant features. The resulting fea- i=1
ture selection problem is a mixed integer program-
ming, which is further reformulated as a quadratically where µ̃k and nk are the mean vector and size of the
constrained linear programming (QCLP) [3]. It can be k-th class
∑respectively in the reduced data space, i.e.,
c
solved by cutting plane algorithm [8], in each iteration Z, µ̃ = k=1 nk µ̃k is the overall mean vector of the
of which a multiple kernel learning problem is solved reduced data. Since S̃t is usually singular, we add a
by multivariate ridge regression and projected gradient perturbation term γI to make it positive semi-definite.
descent [16] alternatively. Experiments on benchmark (d)
Since there are m candidate Z’s out of X, the fea-
data sets indicate that the proposed method outper- ture selection problem is a combinatorial optimization
forms many state of the art feature selection methods. problem and very challenging. To alleviate the dif-
The remainder of this paper is organized as follows. In ficulty, the widely used heuristic strategy [15, 6] is to
Section 2, we briefly review Fisher score. We present compute a score for each feature independently accord-
the generalized Fisher score in Section 3. The exper- ing to the criterion F . In other word, it only
( ) considers
iments on benchmark data sets are demonstrated in xj ∈ R1×n . In this case, there are only d1 = d can-
Section 4. Finally, we draw a conclusion in Section 5. didates. Specifically, let µjk and σkj be the mean and
standard deviation of k-th class, corresponding to the
1.1 Notation j-th feature. Let µj and σ j denote the mean and stan-
dard deviation of the whole data set corresponding to
The generic problem of supervised feature selection the j-th feature. Then the Fisher score of the j-th
is as follows. Given a data set {(xi , yi )}ni=1 where feature is computed below,
xi ∈ Rd and yi ∈ {1, 2, . . . , c}, we aim to find a feature ∑c
subset of size m which contains the most informative j k=1 nk (µjk − µj )2
F (x ) = , (4)
features. We use X = [x1 , x2 , . . . , xn ] ∈ Rd×n to rep- (σ j )2
resent the data matrix. xj denotes the jth row of X. ∑c j 2
1 is a vector of all ones with an appropriate length. 0 where (σ j )2 = k=1 nk (σk ) . After computing the
is a vector of all zeros. I is an identity matrix with Fisher score for each feature, it selects the top-m
an appropriate size. Without loss of generality, we as- ranked features with large scores. Because the score
sume of each feature is computed independently, the fea-
∑n that X has been centered with zero mean, i.e.,
i=1 xi = 0.
tures selected by the heuristic algorithm is suboptimal.
More importantly, as we mentioned before, the heuris-
tic algorithm fails to select those features which have
2 A Brief Review of Fisher Score
relatively low individual scores but a very high score
when they are combined together as a whole. In ad-
In this section, we briefly review Fisher score [15] for
dition, it cannot handle redundant features. This mo-
feature selection, and discuss its shortcomings.
tivates us to propose a generalized Fisher score which
The key idea of Fisher score is to find a subset of fea- can resolve these problems.
3 The Proposed Method We call the feature selection criterion in Eq.(7) as
Generalized Fisher Score. It is easy to show that
In this section, we first present an equivalent formu- given p, Eq.(7) can be seen as Regularized Discrimi-
lation of Fisher score, based on which we present our nant Analysis (RDA) [15] in the reduced feature space,
method. i.e., diag(p)X, which is a Rayleigh Quotient problem
[4] (also known as Ratio Trace problem), and can be
3.1 Equivalent Formulation of Fisher Score solved by eigen-decomposition. However, when p is
also a variable, the problem is difficult to solve. Recent
We introduce an indicator variable p, where p = study [22] established the relationship between RDA
(p1 , . . . , pd )T and pi ∈ {0, 1}, i = 1, . . . , d, to repre- and multi-variate linear regression problem, which pro-
sent whether a feature is selected or not. In order to vides a regression-based solution for RDA. This mo-
indicate that m features are selected, we constrain p tivates us to solve the problem in Eq.(7) in a similar
by pT 1 = m. Then the Fisher Score in Eq.(2) can be manner. In the following, we present a theorem, which
equivalently formulated as follows, establishes the equivalence relationship between the
problem in Eq.(7) and the problem in Eq.(8).
F (p) = tr{(diag(p)Sb diag(p)) Theorem 3.2. The optimal p that maximize the prob-
(diag(p)(St + γI)diag(p))−1 }, lem in Eq.(7) are the same as the optimal p that min-
s.t. p ∈ {0, 1}d , pT 1 = m, (5) imize the following problem
1 γ
where diag(p) is a diagonal matrix whose diagonal ele- min ||XT diag(p)W − H||2F + ||W||2F ,
p,W 2 2
ments are pi ’s, Sb and St are the between-class scatter
matrix and total scatter matrix respectively, which are s.t. p ∈ {0, 1} , p 1 = m.
d T
(8)
defined as where H = [h1 , . . . , hc ] ∈ Rn×c , and hk is a column

c vector whose i-th entry is given by
Sb = nk (µk − µ)(µk − µ)T { √ √
k=1
n
n − nnk , if yi = k
hik = k
√ (9)
∑n
− nnk , otherwise.
St = (xi − µ)(xi − µ)T (6)
i=1
Proof. The sketch of the proof is: for any feasible W,
where µk and nk are mean vector and size of kth the optimal p that maximizes the problem in Eq.(7) is
class ∑
respectively in the input data space, i.e., X, the same as the optimal p that minimizes the problem
c in Eq.(8). It is built upon Lemma 4.1 in [22]. The
µ = k=1 nk µk is the overall mean vector of the orig-
inal data. detailed proof will be included in the longer version of
this paper.
3.2 Generalized Fisher Score
Note that the above theorem holds under the condition
However, the problem in Eq.(5) is still not easy to max- that X is centered with zero mean. It is interesting to
imize due to its combinatorial nature. In this paper, note that, in general, the optimal W for the optimiza-
we turn to maximize its lower bound as follows. tion problem in Eq.(7) is different from the optimal W
for the problem in Eq.(8).
Theorem 3.1. The optimal value of the objective
function in Eq.(5) is lower bounded by the optimal
3.3 The Dual Problem
value of the objective function in the following prob-
lem, According to Theorem 3.2, we can solve the multivari-
T ate ridge regression like problem in Eq.(8) instead of
F (W, p) = tr{(W diag(p)Sb diag(p)W)
the Ratio Trace problem in Eq.(7). Let
(WT diag(p)(St + γI)diag(p)W)−1 },
s.t. p ∈ {0, 1}d , pT 1 = m. (7) U = XT diag(p)W − H. (10)
The optimization problem in Eq.(8) is equivalent to
where W ∈ Rd×c . the following optimization problem,
1 γ
Proof. The key is to prove that for any feasible p, the minp,W ||U||2F + ||W||2F
objective function of Eq.(5) is lower bounded by the 2 2
objective function of Eq.(7). The detailed proof will s.t. U = XT diag(p)W − H
be included in the longer version of this paper. p ∈ {0, 1}d , pT 1 = m. (11)
We consider the dual problem of Eq.(11). The La- To solve the problem in Eq.(19), one possible solution
grangian function of Eq.(11) is as follows, is to relax the indicator variable pi to [0, 1] and trans-
form it into a multiple kernel learning problem [16]. It
1 γ
L = ||U||2F + ||W||2F involves d base kernels, where d is the number of fea-
2 2 tures. However, when the number of features is very
− tr(VT (XT diag(p)W − H − U)), (12) huge, e.g., thousands, even the state of the art mul-
tiple kernel learning methods [16, 20] cannot handle
where V is a Lagrangian multiplier. Taking the deriva- it. Borrowing the idea used in [11, 19], we introduce
tive of L with respect to U and W and setting them an additional variable θ ∈ R, then the problem in Eq.
to zero, we obtain (19) can be reformulated equivalently as follows
∂L max max −θ
= U+V =0 V θ
∂U
∂L s.t.θ ≥ −f (V, pt ), pt ∈ P. (21)
= γW − diag(p)XV = 0. (13)
∂W Note that each pt ∈ P corresponds to one con- (d)
It follows that straint, so the above optimization problem has m
constraints. The optimization problem in Eq.(21)
U = −V is a Quadratically Constrained Linear Programming
1 (QCLP) [3].
W = diag(p)XV. (14)
γ
Taking the dual problem of the inner maximization
Thus we obtain the following dual problem of Eq.(11) problem in Eq(21), we obtain the following problem,
|P|
1 1 ∑
min max tr(VT H) − tr(VT ( XT diag(p)X + I)V) max min λt f (V, pt )
p V 2 γ V λt ∈Λ
t=1
s.t. p ∈ {0, 1}d , pT 1 = m. (15) |P|

= min max λt f (V, pt ), (22)
For notational simplicity, we denote the objective func- λt ∈Λ V
t=1
tion in Eq.(15) as ∑
where Λ = {λt | t λt = 1, λt ≥ 0}. The equality holds
1∑ due to the fact that the objective function is concave
d
1
f (V , p) = tr(VT H) − tr(VT ( pj Kj + I)V), in V and convex in λ.
2 γ j=1
(16) 3.4 Alternating Minimization
where Kj = (xj )T xj and
Eq.(22) can be seen as a multiple kernel learning
P = {p|p ∈ {0, 1}d , pT 1 = m}. (17) problem [16]. Following the technique used in the
state of the art multiple kernel learning [16], we op-
Then Eq.(15) is simplified as
timize Eq. (22) in an alternative way. In particular,
min max f (V, p). (18) we alternatively solve V given the kernel weights λ,
p∈P V and update the kernel weights λ by fixing V. Let
∑|P| t
g(λ, V) = t=1 λt f (V, p ), we denote the gradient
By interchanging the order of minp∈P and maxV in of g(λ, V) with respect to λt by ∇λt g(λ, V), which is
Eq. (18), we obtain calculated as
max min f (V, p). (19) 1 ∑d
V p∈P ∇λt g(λ, V) = − tr(VT ptj Kj V). (23)
2γ j=1
According to the minimax theorem [9], the optimal
value of the objective function in Eq.(18) is an upper Then we use projected gradient descent to update λt .
bound of that in Eq.(19). The gradient of g(λ, V) with respect to V is
|P| d
The problem in Eq.(19) is a convex-concave optimiza- 1 ∑∑ t
∇V g(λ, V) = H − ( p Kj + I)V. (24)
tion problem, and therefore its optimal solution is the γ t=1 j=1 j
saddle point of f (V, p) subject to the constraint in
Eq.(17). Let (V∗ , p∗ ) be the optima to Eq.(19). For So V has a closed-form solution as
any feasible p and V, we have |P|
1∑ ∑ t
d
V=( λt p Kj + I)−1 H. (25)
f (V, p∗ ) ≤ f (V∗ , p∗ ) ≤ f (V∗ , p). (20) γ t=1 j=1 j
which can be solved as a linear system problem. One Algorithm 1 Generalized Fisher Score for Feature
can also solve W in the primal and then compute V Selection
based on Eq.(10). The advantage is that the primal Input:C and m;
problem is a multivariate ridge regression, which can Output:V and Ω;
be solved very efficiently via iterative conjugate gradi- Initialize V = n1 1n 1Tc and t = 1;
ent type algorithms such as LSQR [13]. Find the most violated constraint p1 , and set Ω1 =
{p1 };
3.5 Cutting Plane Acceleration repeat
Initialize λ = 1t 1;
Given P, the above multiple kernel learning ( d )prob- repeat
lem has optimization variables (V, λ) with m con- Solve for V using Eq.(25) under the current λ;
straints, which is impractical to solve. Fortunately, Solve for λ using gradient descent as in Eq.
cutting plane technique [8] enables us to deal with this (23);
problem, which keeps a polynomial sized subset Ω of until converge
working constraints and computes the optimal solu- Find the most violated constraint pt+1 and set
tion to Eq. (22) subject to the constraints in Ω. In Ωt+1 = Ωt ∪ pt+1 ;
detail, the algorithm adds the most violated constraint t = t + 1;
in Eq. (21) into Ω in each iteration. In this way, a suc- until converge
cessively strengthening approximation of the original
problem is solved. And the algorithm terminates when
no constraint in Eq. (21) is violated. {lt } is monotonically increasing and the sequence {ut }
The remaining thing is how to find the most violated is monotonically decreasing.
constraint in each iteration. Since the feasibility of a
constraint is measured by the corresponding value of In each outer iteration of our algorithm, it needs to
θ, the most violated constraint is the one which owns find the most violated p and solve a multiple kernel
the largest θ. Hence, it could be calculated as follows learning problem. Finding the most violated p can be
obtained exactly by finding the m largest ones from
arg max −f (V, p) d coefficients sj , which takes only O(m log d)time. To
p∈P
solve the multiple kernel learning problem, in each in-

d
ner iteration, we need to solve one multivariate ridge
= arg max tr(VT pj Kj V) regression problem, which can be solved efficiently by
p∈P
j=1
LSQR [13] and scales linearly in the number of training

d
samples n. Hence the time complexity of multiple ker-
= arg max sj pj , (26) nel learning is proportional to the complexity of ridge
p∈P
j=1 regression. In summary, the total time complexity of
the proposed method is O(T (cns + s log m)), where T
where sj = xj VVT (xj )T . Note that its optimal so-
is the number of iterations needed to converge, s is
lution can be obtained by first sorting sj and then
the average number of nonzero features among all the
setting the first m numbers corresponding to dj to 1
training samples, c is the number of classes. In our
and the rests to 0.
experiments, 10 outer-iterations usually leads to con-
We summarize the algorithm to solve the problem in verge. Thus, the proposed method is computationally
Eq. (22) in Algorithm 1. Note that the final selected very efficient.
features are the union set of the features corresponding
to each constraint pt ∈ ΩT .
4 Experiments
3.6 Theoretical Analysis
In our experiments, we empirically evaluate the ef-
The convergence property of Algorithm 1 is stated in fectiveness of the proposed method. We compare
the following theorem. the proposed method to the state-of-art feature selec-
tion methods: Fisher Score [15], Laplacian Score [6],
Theorem 3.3. Let (V∗ , θ∗ ) be the global optimal solu-
Hilbert Schmidt Independence Criterion (HSIC) [18]
tion of Eq. (21), lt = max1≤j≤t minV −f (V, pj ) and
and Trace Ratio criterion [12]. Note that Trace Ratio
ut = min1≤j≤t maxp∈P −f (Vj , p), then
criterion can use either Fisher score or Laplacian score
lt ≤ θ ∗ ≤ ut . (27) like criteria. So we use Trace Ratio (FS) and Trace
Ratio (LS) to represent them respectively. After fea-
With the number of iteration t increasing, the sequence ture selection, 1-Nearest Neighbor classifier is used for
classification. The implementations of Laplacian score one of these two criteria is superior to the other.
and Trace Ratio criterion are downloaded from the
Furthermore, although the performance of Fisher score
authors’ websites. For HSIC, we use linear kernel on
is not as good as the proposed method, it is compa-
both data and labels. The parameters of the compared
rable to and even much better than the other feature
methods are tuned according to their original papers.
selection methods on 3 out of 4 data sets. This indi-
cates that Fisher score is still among the state of the
4.1 UCI Data Sets art methods. It also implies the superiority of Fisher
criterion for feature selection over the other criteria.
In the first part of our experiments, we use a subset
of UCI machine learning benchmark data set [2], e.g.,
4.2 Face Recognition
ionosphere, sonar, protein and soybean.
Table.1 summarizes the characteristics of the data sets In the second part of our experiments, we evaluate the
used in our experiments. All datasets are standardized proposed method on the ORL face recognition data
to be zero-mean and normalized by standard deviation set1 . It contains 10 images for each of the 40 human
for each dimension. This normalization is also applied subjects, which were taken at different times, varying
for the data used in the rest of our experiments. the lighting, facial expressions and facial details. The
original images (with 256 gray levels) have size 92 ×
112, which are resized to 32 × 32 for efficiency. For
Table 1: Description of the UCI data sets each person, we randomly choose 5 images for training
datasets #samples #features #classes and the rest for testing. We repeat this experiment 20
ionosphere 351 34 2 times and calculate the average result.
sonar 208 60 2
The face recognition results of the feature selection
protein 116 20 6
methods when the number of selected features is 100
soybean 307 35 19
are shown in Table 3.

We randomly choose 50% of the data for training and


the rest for testing. Since the training samples are Table 3: Recognition results on the ORL data set when
randomly chosen, we repeat this process 20 times and the number of selected features is 100
Methods Acc
calculate the average result. The number of selected HSIC 74.47±3.08
features is set to be 50% of the dimensionality of the Fisher Score 86.92±2.76
data. Note that the number of selected features in Laplacian Score 77.10±2.88
GFS is controlled indirectly by m. We need to gradu- Trace Ratio(FS) 86.78±3.65
ally increase m to reach the chosen number of selected Trace Ratio(LS) 77.03±2.93
GFS 88.78±2.82
features. The regularization parameter γ in GFS is
tuned by 5-fold cross validation on the training set
by searching the grid {50, 100, 200, . . . , 500}. This pa- As can be seen, generalized Fisher score outperforms
rameter tuning approach is used throughout our ex- the other feature selection methods. To take a closer
periments. look at the performance with respect to the number
of selected features, we plot the recognition accuracy
The classification results of the feature selection meth-
with respect to the number of selected features of all
ods are summarized in Table 2. We observe that
the feature selection methods on the ORL data set in
the proposed generalized Fisher score outperforms the
Figure 1. Since the number of selected features for the
other feature selection methods consistently on all the
GFS is controlled by m, we increase m gradually from
data sets. The improvement arises from two aspect:
1 with step size 1 and obtain a increasing number of
(1) GFS is able to consider the combination of fea-
selected features.
tures; and (2) it can handle redundant features. We
will analyze it in more detail in the next part. We can see that with only a very small number of fea-
tures, generalized Fisher score can achieve significant
In addition, it is very interesting to find that there
better result than the other methods. It can be inter-
is no significant difference between Fisher score (or
preted from two aspects: (1) GFS selects features si-
Laplacian score) and the corresponding Trace Ratio
multaneously, which considers the discriminative com-
criterion. This is consistent with the observation in
bination of features. For example, suppose feature
[24]. It is not surprising because both Trace Ratio
combination ab has a very high score, while feature
criterion and Ratio Trace criterion essentially optimize
quite similar objective functions. As far as we know, 1
http://www.cl.cam.ac.uk/Research/DTG/attarchive:
there is no theoretical evidence which supports that pub/data
Table 2: Classification results on the UCI data sets when 50% data are used for training and the number of
selected features is set to be 50% of the dimensionality of the data.
Methods ionosphere sonar protein soybean
HSIC 87.97±2.15 81.70±3.61 59.39±8.71 86.54±4.23
Fisher Score 87.97±1.96 81.31±3.48 67.63±6.77 76.31±3.28
Laplacian Score 83.29±2.10 80.87±3.51 67.19±6.64 78.10±3.77
Trace Ratio(FS) 88.23±2.32 81.36±3.19 67.72±6.52 76.54±3.80
Trace Ratio(LS) 83.66±2.48 81.07±3.50 68.60±6.05 77.91±3.34
GFS 89.14±2.02 82.33±3.97 69.21±5.87 87.06±2.50

90
are asymmetric and hence non-redundant. Since the
85
face image is roughly axially symmetric, one pixel in a
80

75
pair of axially symmetric pixels is redundant given the
70
other one is selected. Furthermore, the selected pixels
Accuracy(%)

65
by GFS are mostly around the eyebrow, the corner of
60
eyes, nose and mouth, which, in our experience, are
55
HSIC more discriminative to distinguish face images of dif-
Fisher score
50 Laplacian score ferent people than those features selected by Fisher
Trace Ratio(FS)
45 Trace Ratio(LS) score. This is why GFS outperforms Fisher score.
GFS
40
10 20 30 40 50 60 70 80 90 100
number of features
4.3 Digit Recognition
Figure 1: Recognition results of the feature selection In the third part of our experiments, we evaluate
methods with respect to the number of selected fea- the proposed method on the USPS handwritten digit
tures on the ORL data set recognition data set [7]. A popular subset2 containing
2007 16 × 16 handwritten digit images is used in our
a and b have relatively low scores respectively. Then experiments. We randomly choose 50% of the data
GFS can select ab at an early stage, while Fisher score for training and the rest for testing. This process is
as well as the other feature selection methods would repeated 20 times.
select a and b respectively at a very late stage (i.e. un- Table 4 summarizes the digit recognition results of the
til quite a lot of low-score features are selected); and feature selection methods when the number of selected
(2) GFS is able to discard the redundant features, as a features is 100, while Figure 3 depicts the classification
result, it can select as many as non-redundant features accuracy with respect to the number of selected fea-
at an early stage. tures of all the feature selection methods on the USPS
data set.

Table 4: Recognition results on the USPS data set


when the number of selected features is 100
Methods Acc
HSIC 85.61±1.06
(a) Fisher Score (b) GFS Fisher Score 91.85±0.82
Laplacian Score 83.90±1.05
Figure 2: The selected features by (a) Fisher score, Trace Ratio(FS) 91.83±0.81
Trace Ratio(LS) 83.33±1.51
and (b) GFS on the ORL data set GFS 92.69±1.16

In order to give an intuitive picture, we display the


first 100 features selected by Fisher score and the pro- Again, generalized Fisher score performs the best on
posed GFS in Figure 2. It is shown that the distri- this data set. Furthermore, GFS gets very good re-
bution of selected features (pixels) by Fisher score is sult even when the number of selected features is
highly skewed. Most features distribute in the non- very small. For example, with only 10 features, GFS
face region. It implies that the features selected by can achieve an accuracy of about 80%, while orig-
Fisher score are not discriminative. In contrast, the inal Fisher score only gets roughly 50% accuracy.
features selected by GFS distribute widely across the 2
http://www-stat-class.stanford.edu/∼tibs/
face region. Additionally, the selected features (pixels) ElemStatLearn/data.html
95
[5] I. Guyon and A. Elisseeff. An introduction to variable
90
and feature selection. Journal of Machine Learning
85 Research, 3:1157–1182, 2003.
80 [6] X. He, D. Cai, and P. Niyogi. Laplacian score for
feature selection. In NIPS, 2005.
Accuracy(%)

75

70
[7] J. J. Hull. A database for handwritten text recognition
research. IEEE Trans. Pattern Anal. Mach. Intell.,
65 HSIC
Fisher score
16(5):550–554, 1994.
60 Laplacian score
Trace Ratio(FS)
[8] J. E. Kelley. The cutting plane method for solving
55 Trace Ratio(LS) convex programs. Journal of the SIAM, 8:703–712,
GFS
50
1960.
10 20 30 40 50 60 70 80 90 100
number of features [9] S.-J. Kim and S. Boyd. A minimax theorem with
applications to machine learning, signal processing,
Figure 3: Recognition results of the feature selection and finance. SIAM J. on Optimization, 19:1344–1367,
methods with respect to the number of selected fea- November 2008.
tures on the USPS data set [10] D. Koller and M. Sahami. Toward optimal feature
selection. In ICML, pages 284–292, 1996.
[11] Y. Li, I. Tsang, J. Kwok, and Z. Zhou. Tighter and
This again strengthens the advantage of the proposed convex maximum margin clusteringe. In AISTATS,
method. 2009.
[12] F. Nie, S. Xiang, Y. Jia, C. Zhang, and S. Yan. Trace
ratio criterion for feature selection. In AAAI, pages
5 Conclusion 671–676, 2008.
[13] C. C. Paige and M. A. Saunders. Lsqr: An algorithm
In this paper, we presented a generalized Fisher score for sparse linear equations and sparse least squares.
for feature selection. It finds a subset of features ACM Trans. Math. Softw., 8:43–71, March 1982.
jointly, which maximize the lower bound of traditional [14] H. Peng, F. Long, and C. H. Q. Ding. Fea-
Fisher score. The resulting feature selection problem ture selection based on mutual information: Cri-
is a mixed integer programming, which is reformu- teria of max-dependency, max-relevance, and min-
redundancy. IEEE Trans. Pattern Anal. Mach. Intell.,
lated as a quadratically constrained linear program-
27(8):1226–1238, 2005.
ming (QCLP). It can be solved by cutting plane al-
[15] P. E. H. R. O. Duda and D. G. Stork. Pattern Clas-
gorithm, in each iteration of which a multiple kernel sification. Wiley-Interscience Publication, 2001.
learning problem is solved by multivariate ridge re- [16] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grand-
gression and projected gradient descent. Experiments valet. More efficiency in multiple kernel learning. In
on benchmark data sets indicate that the proposed ICML, pages 775–782, 2007.
method outperforms many state of the art feature se- [17] M. Robnik-Sikonja and I. Kononenko. Theoretical
lection methods. and empirical analysis of relieff and rrelieff. Machine
Learning, 53(1-2):23–69, 2003.
[18] L. Song, A. J. Smola, A. Gretton, K. M. Borgwardt,
Acknowledgments and J. Bedo. Supervised feature selection via depen-
dence estimation. In ICML, pages 823–830, 2007.
The work was supported in part by NSF IIS-09- [19] M. Tan, L. Wang, and I. W. Tsang. Learning sparse
05215, U.S. Air Force Office of Scientific Research svm for feature selection on very high dimensional
MURI award FA9550-08-1-0265, and the U.S. Army datasets. In ICML, pages 1047–1054, 2010.
Research Laboratory under Cooperative Agreement [20] Z. Xu, R. Jin, I. King, and M. R. Lyu. An extended
Number W911NF-09-2-0053 (NS-CTA). We thank the level method for efficient multiple kernel learning. In
NIPS, pages 1825–1832, 2008.
anonymous reviewers for their helpful comments.
[21] Y. Yang and J. O. Pedersen. A comparative study
on feature selection in text categorization. In ICML,
References pages 412–420, 1997.
[22] J. Ye, S. Ji, and J. Chen. Learning the kernel ma-
[1] A. Appice, M. Ceci, S. Rawles, and P. A. Flach. Re- trix in discriminant analysis via quadratically con-
dundant feature elimination for multi-class problems. strained quadratic programming. In KDD, pages 854–
In ICML, 2004. 863, 2007.
[2] A. Asuncion and D. Newman. UCI machine learning
[23] L. Yu and H. Liu. Efficient feature selection via anal-
repository, 2007.
ysis of relevance and redundancy. Journal of Machine
[3] S. Boyd and L. Vandenberghe. Convex optimization. Learning Research, 5:1205–1224, 2004.
Cambridge University Press, Cambridge, 2004.
[24] Z. Zhao, L. Wang, and H. Liu. Efficient spectral fea-
[4] G. H. Golub and C. F. Van Loan. Matrix Compu- ture selection with minimum redundancy. In AAAI,
tation. Johns Hopkins University Press, Baltimore, 2010.
Maryland, third edition, 1996.

View publication stats

You might also like