0% found this document useful (0 votes)
69 views6 pages

1) SVM-RFE: This Is A Popular Method For Feature Selection Where Ranking Is Done Based On

The document describes a study that uses machine learning methods for feature selection to analyze gene expression data related to breast cancer metastasis. It evaluates 5 feature selection methods - SVM-RFE, Relief, correlation, L0-norm minimization, and Feature Selection Conclave - and applies them to a breast cancer gene expression dataset to identify signature genes. A consistency analysis found 133 genes were selected by more than 2 methods. A classifier using these genes achieved 64% accuracy on test data, demonstrating the effectiveness of the multi-method approach.

Uploaded by

bilwaj
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views6 pages

1) SVM-RFE: This Is A Popular Method For Feature Selection Where Ranking Is Done Based On

The document describes a study that uses machine learning methods for feature selection to analyze gene expression data related to breast cancer metastasis. It evaluates 5 feature selection methods - SVM-RFE, Relief, correlation, L0-norm minimization, and Feature Selection Conclave - and applies them to a breast cancer gene expression dataset to identify signature genes. A consistency analysis found 133 genes were selected by more than 2 methods. A classifier using these genes achieved 64% accuracy on test data, demonstrating the effectiveness of the multi-method approach.

Uploaded by

bilwaj
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Advanced Computational biology Project Paper:

Topic: A survey of machine learning based methodologies for feature selection for gene
expression classification.

Goal/Rationale: To evaluate the effectiveness and consistency of different methods of


feature selection applied to gene expression dataset pertaining to metastasis of breast cancer.

Dataset:
The dataset is available from Rosetta Inpharmatics (here). This dataset contains DNA
microarray expression profiles on primary breast tumours of 295 patients from which we will
select features using methods listed below. The authors of [2] use this dataset to identify a 70
gene signature for breast cancer metastasis .We will come up with our own signature genes using
feature selection. We select 55 samples from this dataset as training samples and 70 more as test
samples. These numbers are chosen to make sure that we have equal number of positive samples
and negative samples in the training and testing dataset.

Feature Selection Methods: There are several machine learning methods which can be used for
feature/gene selection . Typically there are two components to feature selection 1)Feature
ranking 2)Selecting the number of top ranked features to use . We evaluate 5 different feature
ranking methods in this paper for gene selection.

A: Feature ranking:
For the purposes of this report the following feature ranking methods have been tested.A brief
description of each method is given below.

1)SVM-RFE: This is a popular method for feature selection where ranking is done based on
the margin obtained per feature in a high dimensional space[3]. Basically a linear SVM is
trained on the entire dataset and blocks of features which have low individual margins and
ranked lower than those having higher margins
Inputs: Training examples X0 = [x1, x2, … xk, …xl]
Class labels Y= [y1, y2, … yk, … yl]
Initialize:
Subset of surviving features s = [1, 2, … n] Feature ranked list r = [ ]
Repeat until s = [ ] Restrict training examples to good feature indices
X = X0(:, s)
Train the classifier a = SVM-train(X, Y)

Compute the weight vector of dimension length(s) : w = k yk xk


Compute the ranking criteria ci = (wi)2, for all i
Find the feature with smallest ranking criterion
f = argmin(c)
Update feature ranked list
r = [s(f), r]
Eliminate the feature with smallest ranking criterion
s = s(1:f-1, f+1:length(s))
Output Feature ranked list r.
Algorithm 1: SVM-RFE
2. Relief is a multivariate feature selection algorithm first presented in [4] where the idea
is that relevance should grow for features that separate vectors from different classes and
reduce for those vectors which belong to same class. The algorithm is described below

Assign an initial weights J to all features (J(X1),..,J(Xn))(note that X1 … Xn are columns of


the feature matrix each of who rows are feature vectors denoted by x with label Y(x))
For a randomly selected sample vector x such that Y(x)=+ select the nearest sample x+
for which Y(x+)=+ and the nearest sample x- from the set of samples for which Y(x-)= -
Update J(Xr)J(Xr)+µ(|X(x) – X(x-)| – |X(x) – X(x+)|) where µ is of the order 1/m (m 
total number of samples)
Iterate
Algorithm 2: Relief

3. Linear correlation based feature selection The idea here is to pick one feature at a
time and calculate the Pearson co-relation coefficient of this feature with respect to the
class labels. This is defined as follows:
For each feature X and label Y:
Pearson ( X , Y )=∑ (X i−mean( X i))¿ ¿¿
i

A feature is better if it has higher correlation coefficient and hence has a lower rank.

4. L0-norm minimization: Given features X and feature labels Y We attempt to solve the
following minimization problem
min l o ( w)
subject ¿ y k ( w . x k +b ) ≥ 0

Note that the l0-norm simply measures the number of non-zero components of the vector w
This problem is known to be NP-hard. Hence the way we attempt to solve it is by
approximating the l0 norm with a smooth function. One such function is given in[5] is
l 0 ( w ) =∑ log ⁡( ϵ +|wi|)
i

This method is used to directly find these weights and the ranking is based on the weights.
The idea in l0 norm minimization is to do multivariate feature selection and have an
extremely sparse w at the same time.

5. Feature Selection Conclave: This is another method [6] that tries to approximate the
l0 norm above . Here the l0 norm is approximated as
l 0 ( w ) =∑ (1−exp (−α |wi|) )
i
The idea is the same as in [5]. Sparse weighting results in very few vectors having a higher
weight. Ranking is done according to the weights

B: Selecting number of features using the ranking obtained:


Typically a cross-validation or a bootstrap heuristic is used to select the total number of features
to be used for the final classification based on the training set.
For our case we use bootstrapping to select the number of features in each of the feature for each
of the feature ranking methods described below. The bootstrap procedure is as follows
Randomly divide training dataset into bootstrap train and bootstrap test datasets
Generate a ranking using bootstrap train .
Calculate classification accuracies on bootstrap test data by successively adding ranked
features
Record the number of features that gave the best classification rate(figure shows what a typical
run would look like)

Repeat the entire procedure above k times(we have used k=5)


Use the average number of features calculated from the repeated runs
Heuristic1:Selecting number of features based on a ranked list

C: Combining selected features:


The heuristic used for combining feature selected from the 5 methods is as follows
Compile a list of all features selected by any of our five methods
For each feature index in the list find out how many methods have selected this feature gene
A manhattan plot is drawn(see figure in the results).For this dataset 1890 unique feature IDs got
selected via the 5 methods.The Manhattan plot is shown in the figure in the Results section.

D: Evaluation protocol:
A support vector classifier [7] a linear kernel has been used to evaluate and document the
accuracy of each method on the test set . The accuracy obtained is contrasted with direct
classification without feature selection. Also the top 500 features selected are evaluated for
consistency. A consistency table is made and used to pick out features selected by all methods.
Finally an SVM is trained on the features that have been consistently selected by more than one
method and the accuracy of this classifier is found.
Results:
Table 1: Test accuracy on test set of linear SVC using features selected on the training set
Method Naïve(no SVM- Relief Correlatio L0-norm FSV
feature RFE n minimizatio
selection) n
(Weston)
Accuracy 54% 56% 58% 66% 56% 66%
in testing
CV 60% 62% 59% 65% 60% 65%
accuracy
on test set
No. of 24481 102 115 116 43 60
features
selected

Table 2: Consistency matrix for feature selection


(Entry denotes number of features in top 500 commonly selected by both methods)
SVM-RFE Relief Correlation L0-Weston FSV
SVM-RFE - 48 118 261 63
Relief 48 - 32 39 14
Correlation 118 32 - 120 43
L0-Weston 261 39 120 - 72
FSV 63 14 43 72 -

Combining Features

In our case 1890 features got selected .The Manhattan plot has been shown below

We find that three genes are selected by all three classifiers and about 133 genes are selected as
relevant by more than 2 classifiers . An accuracy of (~64%) was obtained in testing for a linear
SV classifier trained using these 133 genes.
Table 3: The markers selected by more than 3 classifiers and the genes they correspond to
are presented in the table.

Marker List Gene List(symbols)


X99142 ALDOC
Contig20600_RC CTSL2
NM_002411 GAD1
NM_019845 IGFBP5
AJ009936 IGSF1
Contig55111_RC KRT86
D83781 LHX1
NM_020144 SCGB2A2
Contig38615_RC RNASE1
NM_021069 TAT
NM_002933 TSPAN8
AL133566 SORBS2
NM_012415 LMO4
NM_005165 NR1I2
NM_004616 CXCL14
Contig27749_RC CIB2
NM_005568 NUP160
NM_004887 RAD54B
NM_006383 PCSK4
AL080199 ELOVL2
AF055033 RPRM
NM_006769 PAPOLB
Contig27882_RC
NM_000353
NM_001333
Contig51558_RC
Contig43791_RC
NM_000817
NM_001555

Conclusion and Perspectives

We can see from table 1 that all our feature selection methods select features from the dataset
which result in an increase in the accuracy of classification. In case of FSV and correlation the
accuracy increases by 10 percentage points. This is a 20 % increase over the Naïve classification
accuracy.

However table 2 shows that the methods are not very consistent. That is to say that they all
select vastly different features. This presents a major challenge from the biological interpretation
standpoint

The combined selection approach of which the results are presented in Table 3 provides one
solution to this problem. Some of the genes in the table have been implicated in breast cancer by
several other studies as well[1][8]
References:

[1] Gene expression profiling predicts clinical outcome of breast cancer. van 't Veer LJ, Dai H,
van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ,
Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS,Bernards R, Friend SH.
Nature. 2002 Jan 31;415(6871):484-5

[2] A gene-expression signature as a predictor of survival in breast cancer, Marc J van de Vijver
et. al . New England Journal of Medicine 347, 1999-2009 (2002).

[3] Gene selection for cancer classification using support vector machines.
I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. ,Machine Learning, Vol. 46,
pp. 389--422 ,2002

[4] M. Robnik-Sikonja and I. Kononenko. Theoretical and empirical analysis of relief and relieff.
Machine Learning, 53:23–69, 2003.

[5] J. Weston, A. Elisseeff, B. Sch¨olkopf, and M. Tipping. Use of the Zero-Norm with Linear
Models and Kernel Methods. Journal of Machine Learning Research, 3:1439–1461, March 2003

[6] P. S. Bradley & O. L. Mangasarian. Feature Selection via Concave Minimization and Support
Vector Machines. Mathematical Programming Technical Report 98-03, February 1998. Machine
Learning Proceedings of the Fifteenth International Conference(ICML '98)

[7] B.E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In
Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM, 1992

[8] S. Chiu, C. Chen, T. Lin Using support vector regression to model the correlation between
the clinical metastases time and gene expression profile for breast cancer
Artificial Intelligence in Medicine, Volume 44, Issue 3, Pages 221-231

You might also like