1) SVM-RFE: This Is A Popular Method For Feature Selection Where Ranking Is Done Based On
1) SVM-RFE: This Is A Popular Method For Feature Selection Where Ranking Is Done Based On
Topic: A survey of machine learning based methodologies for feature selection for gene
expression classification.
Dataset:
The dataset is available from Rosetta Inpharmatics (here). This dataset contains DNA
microarray expression profiles on primary breast tumours of 295 patients from which we will
select features using methods listed below. The authors of [2] use this dataset to identify a 70
gene signature for breast cancer metastasis .We will come up with our own signature genes using
feature selection. We select 55 samples from this dataset as training samples and 70 more as test
samples. These numbers are chosen to make sure that we have equal number of positive samples
and negative samples in the training and testing dataset.
Feature Selection Methods: There are several machine learning methods which can be used for
feature/gene selection . Typically there are two components to feature selection 1)Feature
ranking 2)Selecting the number of top ranked features to use . We evaluate 5 different feature
ranking methods in this paper for gene selection.
A: Feature ranking:
For the purposes of this report the following feature ranking methods have been tested.A brief
description of each method is given below.
1)SVM-RFE: This is a popular method for feature selection where ranking is done based on
the margin obtained per feature in a high dimensional space[3]. Basically a linear SVM is
trained on the entire dataset and blocks of features which have low individual margins and
ranked lower than those having higher margins
Inputs: Training examples X0 = [x1, x2, … xk, …xl]
Class labels Y= [y1, y2, … yk, … yl]
Initialize:
Subset of surviving features s = [1, 2, … n] Feature ranked list r = [ ]
Repeat until s = [ ] Restrict training examples to good feature indices
X = X0(:, s)
Train the classifier a = SVM-train(X, Y)
3. Linear correlation based feature selection The idea here is to pick one feature at a
time and calculate the Pearson co-relation coefficient of this feature with respect to the
class labels. This is defined as follows:
For each feature X and label Y:
Pearson ( X , Y )=∑ (X i−mean( X i))¿ ¿¿
i
A feature is better if it has higher correlation coefficient and hence has a lower rank.
4. L0-norm minimization: Given features X and feature labels Y We attempt to solve the
following minimization problem
min l o ( w)
subject ¿ y k ( w . x k +b ) ≥ 0
Note that the l0-norm simply measures the number of non-zero components of the vector w
This problem is known to be NP-hard. Hence the way we attempt to solve it is by
approximating the l0 norm with a smooth function. One such function is given in[5] is
l 0 ( w ) =∑ log ( ϵ +|wi|)
i
This method is used to directly find these weights and the ranking is based on the weights.
The idea in l0 norm minimization is to do multivariate feature selection and have an
extremely sparse w at the same time.
5. Feature Selection Conclave: This is another method [6] that tries to approximate the
l0 norm above . Here the l0 norm is approximated as
l 0 ( w ) =∑ (1−exp (−α |wi|) )
i
The idea is the same as in [5]. Sparse weighting results in very few vectors having a higher
weight. Ranking is done according to the weights
D: Evaluation protocol:
A support vector classifier [7] a linear kernel has been used to evaluate and document the
accuracy of each method on the test set . The accuracy obtained is contrasted with direct
classification without feature selection. Also the top 500 features selected are evaluated for
consistency. A consistency table is made and used to pick out features selected by all methods.
Finally an SVM is trained on the features that have been consistently selected by more than one
method and the accuracy of this classifier is found.
Results:
Table 1: Test accuracy on test set of linear SVC using features selected on the training set
Method Naïve(no SVM- Relief Correlatio L0-norm FSV
feature RFE n minimizatio
selection) n
(Weston)
Accuracy 54% 56% 58% 66% 56% 66%
in testing
CV 60% 62% 59% 65% 60% 65%
accuracy
on test set
No. of 24481 102 115 116 43 60
features
selected
Combining Features
In our case 1890 features got selected .The Manhattan plot has been shown below
We find that three genes are selected by all three classifiers and about 133 genes are selected as
relevant by more than 2 classifiers . An accuracy of (~64%) was obtained in testing for a linear
SV classifier trained using these 133 genes.
Table 3: The markers selected by more than 3 classifiers and the genes they correspond to
are presented in the table.
We can see from table 1 that all our feature selection methods select features from the dataset
which result in an increase in the accuracy of classification. In case of FSV and correlation the
accuracy increases by 10 percentage points. This is a 20 % increase over the Naïve classification
accuracy.
However table 2 shows that the methods are not very consistent. That is to say that they all
select vastly different features. This presents a major challenge from the biological interpretation
standpoint
The combined selection approach of which the results are presented in Table 3 provides one
solution to this problem. Some of the genes in the table have been implicated in breast cancer by
several other studies as well[1][8]
References:
[1] Gene expression profiling predicts clinical outcome of breast cancer. van 't Veer LJ, Dai H,
van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ,
Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS,Bernards R, Friend SH.
Nature. 2002 Jan 31;415(6871):484-5
[2] A gene-expression signature as a predictor of survival in breast cancer, Marc J van de Vijver
et. al . New England Journal of Medicine 347, 1999-2009 (2002).
[3] Gene selection for cancer classification using support vector machines.
I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. ,Machine Learning, Vol. 46,
pp. 389--422 ,2002
[4] M. Robnik-Sikonja and I. Kononenko. Theoretical and empirical analysis of relief and relieff.
Machine Learning, 53:23–69, 2003.
[5] J. Weston, A. Elisseeff, B. Sch¨olkopf, and M. Tipping. Use of the Zero-Norm with Linear
Models and Kernel Methods. Journal of Machine Learning Research, 3:1439–1461, March 2003
[6] P. S. Bradley & O. L. Mangasarian. Feature Selection via Concave Minimization and Support
Vector Machines. Mathematical Programming Technical Report 98-03, February 1998. Machine
Learning Proceedings of the Fifteenth International Conference(ICML '98)
[7] B.E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In
Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM, 1992
[8] S. Chiu, C. Chen, T. Lin Using support vector regression to model the correlation between
the clinical metastases time and gene expression profile for breast cancer
Artificial Intelligence in Medicine, Volume 44, Issue 3, Pages 221-231