Generalized Fisher Score For Feature Selection
Generalized Fisher Score For Feature Selection
net/publication/221664782
CITATIONS READS
686 6,717
3 authors, including:
All content following this page was uploaded by Quanquan Gu on 27 July 2015.
90
are asymmetric and hence non-redundant. Since the
85
face image is roughly axially symmetric, one pixel in a
80
75
pair of axially symmetric pixels is redundant given the
70
other one is selected. Furthermore, the selected pixels
Accuracy(%)
65
by GFS are mostly around the eyebrow, the corner of
60
eyes, nose and mouth, which, in our experience, are
55
HSIC more discriminative to distinguish face images of dif-
Fisher score
50 Laplacian score ferent people than those features selected by Fisher
Trace Ratio(FS)
45 Trace Ratio(LS) score. This is why GFS outperforms Fisher score.
GFS
40
10 20 30 40 50 60 70 80 90 100
number of features
4.3 Digit Recognition
Figure 1: Recognition results of the feature selection In the third part of our experiments, we evaluate
methods with respect to the number of selected fea- the proposed method on the USPS handwritten digit
tures on the ORL data set recognition data set [7]. A popular subset2 containing
2007 16 × 16 handwritten digit images is used in our
a and b have relatively low scores respectively. Then experiments. We randomly choose 50% of the data
GFS can select ab at an early stage, while Fisher score for training and the rest for testing. This process is
as well as the other feature selection methods would repeated 20 times.
select a and b respectively at a very late stage (i.e. un- Table 4 summarizes the digit recognition results of the
til quite a lot of low-score features are selected); and feature selection methods when the number of selected
(2) GFS is able to discard the redundant features, as a features is 100, while Figure 3 depicts the classification
result, it can select as many as non-redundant features accuracy with respect to the number of selected fea-
at an early stage. tures of all the feature selection methods on the USPS
data set.
75
70
[7] J. J. Hull. A database for handwritten text recognition
research. IEEE Trans. Pattern Anal. Mach. Intell.,
65 HSIC
Fisher score
16(5):550–554, 1994.
60 Laplacian score
Trace Ratio(FS)
[8] J. E. Kelley. The cutting plane method for solving
55 Trace Ratio(LS) convex programs. Journal of the SIAM, 8:703–712,
GFS
50
1960.
10 20 30 40 50 60 70 80 90 100
number of features [9] S.-J. Kim and S. Boyd. A minimax theorem with
applications to machine learning, signal processing,
Figure 3: Recognition results of the feature selection and finance. SIAM J. on Optimization, 19:1344–1367,
methods with respect to the number of selected fea- November 2008.
tures on the USPS data set [10] D. Koller and M. Sahami. Toward optimal feature
selection. In ICML, pages 284–292, 1996.
[11] Y. Li, I. Tsang, J. Kwok, and Z. Zhou. Tighter and
This again strengthens the advantage of the proposed convex maximum margin clusteringe. In AISTATS,
method. 2009.
[12] F. Nie, S. Xiang, Y. Jia, C. Zhang, and S. Yan. Trace
ratio criterion for feature selection. In AAAI, pages
5 Conclusion 671–676, 2008.
[13] C. C. Paige and M. A. Saunders. Lsqr: An algorithm
In this paper, we presented a generalized Fisher score for sparse linear equations and sparse least squares.
for feature selection. It finds a subset of features ACM Trans. Math. Softw., 8:43–71, March 1982.
jointly, which maximize the lower bound of traditional [14] H. Peng, F. Long, and C. H. Q. Ding. Fea-
Fisher score. The resulting feature selection problem ture selection based on mutual information: Cri-
is a mixed integer programming, which is reformu- teria of max-dependency, max-relevance, and min-
redundancy. IEEE Trans. Pattern Anal. Mach. Intell.,
lated as a quadratically constrained linear program-
27(8):1226–1238, 2005.
ming (QCLP). It can be solved by cutting plane al-
[15] P. E. H. R. O. Duda and D. G. Stork. Pattern Clas-
gorithm, in each iteration of which a multiple kernel sification. Wiley-Interscience Publication, 2001.
learning problem is solved by multivariate ridge re- [16] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grand-
gression and projected gradient descent. Experiments valet. More efficiency in multiple kernel learning. In
on benchmark data sets indicate that the proposed ICML, pages 775–782, 2007.
method outperforms many state of the art feature se- [17] M. Robnik-Sikonja and I. Kononenko. Theoretical
lection methods. and empirical analysis of relieff and rrelieff. Machine
Learning, 53(1-2):23–69, 2003.
[18] L. Song, A. J. Smola, A. Gretton, K. M. Borgwardt,
Acknowledgments and J. Bedo. Supervised feature selection via depen-
dence estimation. In ICML, pages 823–830, 2007.
The work was supported in part by NSF IIS-09- [19] M. Tan, L. Wang, and I. W. Tsang. Learning sparse
05215, U.S. Air Force Office of Scientific Research svm for feature selection on very high dimensional
MURI award FA9550-08-1-0265, and the U.S. Army datasets. In ICML, pages 1047–1054, 2010.
Research Laboratory under Cooperative Agreement [20] Z. Xu, R. Jin, I. King, and M. R. Lyu. An extended
Number W911NF-09-2-0053 (NS-CTA). We thank the level method for efficient multiple kernel learning. In
NIPS, pages 1825–1832, 2008.
anonymous reviewers for their helpful comments.
[21] Y. Yang and J. O. Pedersen. A comparative study
on feature selection in text categorization. In ICML,
References pages 412–420, 1997.
[22] J. Ye, S. Ji, and J. Chen. Learning the kernel ma-
[1] A. Appice, M. Ceci, S. Rawles, and P. A. Flach. Re- trix in discriminant analysis via quadratically con-
dundant feature elimination for multi-class problems. strained quadratic programming. In KDD, pages 854–
In ICML, 2004. 863, 2007.
[2] A. Asuncion and D. Newman. UCI machine learning
[23] L. Yu and H. Liu. Efficient feature selection via anal-
repository, 2007.
ysis of relevance and redundancy. Journal of Machine
[3] S. Boyd and L. Vandenberghe. Convex optimization. Learning Research, 5:1205–1224, 2004.
Cambridge University Press, Cambridge, 2004.
[24] Z. Zhao, L. Wang, and H. Liu. Efficient spectral fea-
[4] G. H. Golub and C. F. Van Loan. Matrix Compu- ture selection with minimum redundancy. In AAAI,
tation. Johns Hopkins University Press, Baltimore, 2010.
Maryland, third edition, 1996.