0% found this document useful (0 votes)

69 views6 pages

1) SVM-RFE: This Is A Popular Method For Feature Selection Where Ranking Is Done Based On

The document describes a study that uses machine learning methods for feature selection to analyze gene expression data related to breast cancer metastasis. It evaluates 5 feature selection methods - SVM-RFE, Relief, correlation, L0-norm minimization, and Feature Selection Conclave - and applies them to a breast cancer gene expression dataset to identify signature genes. A consistency analysis found 133 genes were selected by more than 2 methods. A classifier using these genes achieved 64% accuracy on test data, demonstrating the effectiveness of the multi-method approach.

Uploaded by

bilwaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views6 pages

1) SVM-RFE: This Is A Popular Method For Feature Selection Where Ranking Is Done Based On

Uploaded by

bilwaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Advanced Computational biology Project Paper:

Topic: A survey of machine learning based methodologies for feature selection for gene
expression classification.

Goal/Rationale: To evaluate the effectiveness and consistency of different methods of

feature selection applied to gene expression dataset pertaining to metastasis of breast cancer.

Dataset:
The dataset is available from Rosetta Inpharmatics (here). This dataset contains DNA
microarray expression profiles on primary breast tumours of 295 patients from which we will
select features using methods listed below. The authors of [2] use this dataset to identify a 70
gene signature for breast cancer metastasis .We will come up with our own signature genes using
feature selection. We select 55 samples from this dataset as training samples and 70 more as test
samples. These numbers are chosen to make sure that we have equal number of positive samples
and negative samples in the training and testing dataset.

Feature Selection Methods: There are several machine learning methods which can be used for
feature/gene selection . Typically there are two components to feature selection 1)Feature
ranking 2)Selecting the number of top ranked features to use . We evaluate 5 different feature
ranking methods in this paper for gene selection.

A: Feature ranking:
For the purposes of this report the following feature ranking methods have been tested.A brief
description of each method is given below.

1)SVM-RFE: This is a popular method for feature selection where ranking is done based on
the margin obtained per feature in a high dimensional space[3]. Basically a linear SVM is
trained on the entire dataset and blocks of features which have low individual margins and
ranked lower than those having higher margins
Inputs: Training examples X0 = [x1, x2, … xk, …xl]
Class labels Y= [y1, y2, … yk, … yl]
Initialize:
Subset of surviving features s = [1, 2, … n] Feature ranked list r = [ ]
Repeat until s = [ ] Restrict training examples to good feature indices
X = X0(:, s)
Train the classifier a = SVM-train(X, Y)

Compute the weight vector of dimension length(s) : w = k yk xk

Compute the ranking criteria ci = (wi)2, for all i
Find the feature with smallest ranking criterion
f = argmin(c)
Update feature ranked list
r = [s(f), r]
Eliminate the feature with smallest ranking criterion
s = s(1:f-1, f+1:length(s))
Output Feature ranked list r.
Algorithm 1: SVM-RFE
2. Relief is a multivariate feature selection algorithm first presented in [4] where the idea
is that relevance should grow for features that separate vectors from different classes and
reduce for those vectors which belong to same class. The algorithm is described below

Assign an initial weights J to all features (J(X1),..,J(Xn))(note that X1 … Xn are columns of

the feature matrix each of who rows are feature vectors denoted by x with label Y(x))
For a randomly selected sample vector x such that Y(x)=+ select the nearest sample x+
for which Y(x+)=+ and the nearest sample x- from the set of samples for which Y(x-)= -
Update J(Xr)J(Xr)+µ(|X(x) – X(x-)| – |X(x) – X(x+)|) where µ is of the order 1/m (m 
total number of samples)
Iterate
Algorithm 2: Relief

3. Linear correlation based feature selection The idea here is to pick one feature at a
time and calculate the Pearson co-relation coefficient of this feature with respect to the
class labels. This is defined as follows:
For each feature X and label Y:
Pearson ( X , Y )=∑ (X i−mean( X i))¿ ¿¿
i

A feature is better if it has higher correlation coefficient and hence has a lower rank.

4. L0-norm minimization: Given features X and feature labels Y We attempt to solve the
following minimization problem
min l o ( w)
subject ¿ y k ( w . x k +b ) ≥ 0

Note that the l0-norm simply measures the number of non-zero components of the vector w
This problem is known to be NP-hard. Hence the way we attempt to solve it is by
approximating the l0 norm with a smooth function. One such function is given in[5] is
l 0 ( w ) =∑ log ⁡( ϵ +|wi|)
i

This method is used to directly find these weights and the ranking is based on the weights.
The idea in l0 norm minimization is to do multivariate feature selection and have an
extremely sparse w at the same time.

5. Feature Selection Conclave: This is another method [6] that tries to approximate the
l0 norm above . Here the l0 norm is approximated as
l 0 ( w ) =∑ (1−exp (−α |wi|) )
i
The idea is the same as in [5]. Sparse weighting results in very few vectors having a higher
weight. Ranking is done according to the weights

B: Selecting number of features using the ranking obtained:

Typically a cross-validation or a bootstrap heuristic is used to select the total number of features
to be used for the final classification based on the training set.
For our case we use bootstrapping to select the number of features in each of the feature for each
of the feature ranking methods described below. The bootstrap procedure is as follows
Randomly divide training dataset into bootstrap train and bootstrap test datasets
Generate a ranking using bootstrap train .
Calculate classification accuracies on bootstrap test data by successively adding ranked
features
Record the number of features that gave the best classification rate(figure shows what a typical
run would look like)

Repeat the entire procedure above k times(we have used k=5)

Use the average number of features calculated from the repeated runs
Heuristic1:Selecting number of features based on a ranked list

C: Combining selected features:

The heuristic used for combining feature selected from the 5 methods is as follows
Compile a list of all features selected by any of our five methods
For each feature index in the list find out how many methods have selected this feature gene
A manhattan plot is drawn(see figure in the results).For this dataset 1890 unique feature IDs got
selected via the 5 methods.The Manhattan plot is shown in the figure in the Results section.

D: Evaluation protocol:
A support vector classifier [7] a linear kernel has been used to evaluate and document the
accuracy of each method on the test set . The accuracy obtained is contrasted with direct
classification without feature selection. Also the top 500 features selected are evaluated for
consistency. A consistency table is made and used to pick out features selected by all methods.
Finally an SVM is trained on the features that have been consistently selected by more than one
method and the accuracy of this classifier is found.
Results:
Table 1: Test accuracy on test set of linear SVC using features selected on the training set
Method Naïve(no SVM- Relief Correlatio L0-norm FSV
feature RFE n minimizatio
selection) n
(Weston)
Accuracy 54% 56% 58% 66% 56% 66%
in testing
CV 60% 62% 59% 65% 60% 65%
accuracy
on test set
No. of 24481 102 115 116 43 60
features
selected

Table 2: Consistency matrix for feature selection

(Entry denotes number of features in top 500 commonly selected by both methods)
SVM-RFE Relief Correlation L0-Weston FSV
SVM-RFE - 48 118 261 63
Relief 48 - 32 39 14
Correlation 118 32 - 120 43
L0-Weston 261 39 120 - 72
FSV 63 14 43 72 -

Combining Features

In our case 1890 features got selected .The Manhattan plot has been shown below

We find that three genes are selected by all three classifiers and about 133 genes are selected as
relevant by more than 2 classifiers . An accuracy of (~64%) was obtained in testing for a linear
SV classifier trained using these 133 genes.
Table 3: The markers selected by more than 3 classifiers and the genes they correspond to
are presented in the table.

Marker List Gene List(symbols)

X99142 ALDOC
Contig20600_RC CTSL2
NM_002411 GAD1
NM_019845 IGFBP5
AJ009936 IGSF1
Contig55111_RC KRT86
D83781 LHX1
NM_020144 SCGB2A2
Contig38615_RC RNASE1
NM_021069 TAT
NM_002933 TSPAN8
AL133566 SORBS2
NM_012415 LMO4
NM_005165 NR1I2
NM_004616 CXCL14
Contig27749_RC CIB2
NM_005568 NUP160
NM_004887 RAD54B
NM_006383 PCSK4
AL080199 ELOVL2
AF055033 RPRM
NM_006769 PAPOLB
Contig27882_RC
NM_000353
NM_001333
Contig51558_RC
Contig43791_RC
NM_000817
NM_001555

Conclusion and Perspectives

We can see from table 1 that all our feature selection methods select features from the dataset
which result in an increase in the accuracy of classification. In case of FSV and correlation the
accuracy increases by 10 percentage points. This is a 20 % increase over the Naïve classification
accuracy.

However table 2 shows that the methods are not very consistent. That is to say that they all
select vastly different features. This presents a major challenge from the biological interpretation
standpoint

The combined selection approach of which the results are presented in Table 3 provides one
solution to this problem. Some of the genes in the table have been implicated in breast cancer by
several other studies as well[1][8]
References:

[1] Gene expression profiling predicts clinical outcome of breast cancer. van 't Veer LJ, Dai H,
van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ,
Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS,Bernards R, Friend SH.
Nature. 2002 Jan 31;415(6871):484-5

[2] A gene-expression signature as a predictor of survival in breast cancer, Marc J van de Vijver
et. al . New England Journal of Medicine 347, 1999-2009 (2002).

[3] Gene selection for cancer classification using support vector machines.
I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. ,Machine Learning, Vol. 46,
pp. 389--422 ,2002

[4] M. Robnik-Sikonja and I. Kononenko. Theoretical and empirical analysis of relief and relieff.
Machine Learning, 53:23–69, 2003.

[5] J. Weston, A. Elisseeff, B. Sch¨olkopf, and M. Tipping. Use of the Zero-Norm with Linear
Models and Kernel Methods. Journal of Machine Learning Research, 3:1439–1461, March 2003

[6] P. S. Bradley & O. L. Mangasarian. Feature Selection via Concave Minimization and Support
Vector Machines. Mathematical Programming Technical Report 98-03, February 1998. Machine
Learning Proceedings of the Fifteenth International Conference(ICML '98)

[7] B.E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In
Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM, 1992

[8] S. Chiu, C. Chen, T. Lin Using support vector regression to model the correlation between
the clinical metastases time and gene expression profile for breast cancer
Artificial Intelligence in Medicine, Volume 44, Issue 3, Pages 221-231

Recommendation For OIST Research Internship Applicant OIST Graduate University
No ratings yet
Recommendation For OIST Research Internship Applicant OIST Graduate University
1 page
Simulation of Some Power System, Control System and Power Electronics Case Studies Using Matlab and PowerWorld Simulator Programs
From Everand
Simulation of Some Power System, Control System and Power Electronics Case Studies Using Matlab and PowerWorld Simulator Programs
Dr. Hidaia Mahmood Alassouli
No ratings yet
Test
No ratings yet
Test
4 pages
7 Selectia Trasaturilor
No ratings yet
7 Selectia Trasaturilor
54 pages
An Importance Weighted Feature Selection
No ratings yet
An Importance Weighted Feature Selection
57 pages
Hybrid-Recursive Feature Elimination For Efficient Feature Selection
No ratings yet
Hybrid-Recursive Feature Elimination For Efficient Feature Selection
9 pages
Feature Selection For SVMS: J. Weston, S. Mukherjee, O. Chapelle, M. Pontil T. Poggio, V. Vapnik
No ratings yet
Feature Selection For SVMS: J. Weston, S. Mukherjee, O. Chapelle, M. Pontil T. Poggio, V. Vapnik
7 pages
Feature Selection
No ratings yet
Feature Selection
5 pages
Feature Extraction Techniques Using Support Vector Machines in Disease Prediction
No ratings yet
Feature Extraction Techniques Using Support Vector Machines in Disease Prediction
8 pages
Feature Selection
No ratings yet
Feature Selection
18 pages
Chandra Shekar 2014
No ratings yet
Chandra Shekar 2014
13 pages
Combined SVM-Based Feature Selection and Classification
No ratings yet
Combined SVM-Based Feature Selection and Classification
22 pages
Bioinformatics 26-3-392
No ratings yet
Bioinformatics 26-3-392
7 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
5 pages
Iijcs 2014 07 19 18
No ratings yet
Iijcs 2014 07 19 18
7 pages
BMC Bioinformatics
No ratings yet
BMC Bioinformatics
10 pages
ML Lab File
No ratings yet
ML Lab File
43 pages
Feature PDF
No ratings yet
Feature PDF
16 pages
Feature Selection in PR
No ratings yet
Feature Selection in PR
6 pages
Feature Selection For SVMS: by J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, V. Vapnik
No ratings yet
Feature Selection For SVMS: by J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, V. Vapnik
19 pages
Feature Selection For Support Vector Machines With
No ratings yet
Feature Selection For Support Vector Machines With
18 pages
Lecture#10
No ratings yet
Lecture#10
24 pages
کتاب پنجم بارگزاری شده
No ratings yet
کتاب پنجم بارگزاری شده
35 pages
3038-Article Text-5729-1-10-20210418
No ratings yet
3038-Article Text-5729-1-10-20210418
6 pages
An Introduction To Variable and Feature Selection
No ratings yet
An Introduction To Variable and Feature Selection
26 pages
Feature Gradients: Scalable Feature Selection Via Discrete Relaxation
No ratings yet
Feature Gradients: Scalable Feature Selection Via Discrete Relaxation
9 pages
Feature Selection
No ratings yet
Feature Selection
61 pages
Feature Selection 1692278667
No ratings yet
Feature Selection 1692278667
100 pages
3.1 Feature Selection
No ratings yet
3.1 Feature Selection
35 pages
Feature Engineering
No ratings yet
Feature Engineering
5 pages
Feature Select
No ratings yet
Feature Select
13 pages
3b Features PDF
No ratings yet
3b Features PDF
40 pages
Kernels, Model Selection and Feature Selection
No ratings yet
Kernels, Model Selection and Feature Selection
5 pages
Wrapper Method
No ratings yet
Wrapper Method
58 pages
Dimensionality Reduction Via Sparse Support Vector Machines: Jinbo Bi Kristin P. Bennett
No ratings yet
Dimensionality Reduction Via Sparse Support Vector Machines: Jinbo Bi Kristin P. Bennett
15 pages
A Review of Feature Selection Techniques in BioinformaticsBioinformatics
No ratings yet
A Review of Feature Selection Techniques in BioinformaticsBioinformatics
11 pages
A Review of Feature Selection Techniques in Bioinformatics
No ratings yet
A Review of Feature Selection Techniques in Bioinformatics
11 pages
2015 Elsevier Kernel Methods For Heterogeneous Feature Selection
No ratings yet
2015 Elsevier Kernel Methods For Heterogeneous Feature Selection
9 pages
An Introduction To Feature Selection
No ratings yet
An Introduction To Feature Selection
45 pages
Experiment No.: 9: T. Y. B. Tech (CSE) - II Subject: Open Source Lab-II
No ratings yet
Experiment No.: 9: T. Y. B. Tech (CSE) - II Subject: Open Source Lab-II
4 pages
Feature Selection
No ratings yet
Feature Selection
6 pages
E Tasci
No ratings yet
E Tasci
26 pages
CQF June 2021 M4L4 Solutions
No ratings yet
CQF June 2021 M4L4 Solutions
14 pages
Feature Selection Techniques in Machine Learning
No ratings yet
Feature Selection Techniques in Machine Learning
9 pages
Feature Engg Pre Processing Python
No ratings yet
Feature Engg Pre Processing Python
68 pages
Lab 6 Dsa
No ratings yet
Lab 6 Dsa
15 pages
10 - Chapter 3
No ratings yet
10 - Chapter 3
15 pages
ML Lecture 02
No ratings yet
ML Lecture 02
40 pages
3ML.03.Feature Reduction
No ratings yet
3ML.03.Feature Reduction
44 pages
Feature Selection
No ratings yet
Feature Selection
56 pages
Module-3 DSV
No ratings yet
Module-3 DSV
20 pages
Lecture 15 - 23.09.2024 - Feature Selection
No ratings yet
Lecture 15 - 23.09.2024 - Feature Selection
47 pages
2015 Elsevier Multi Objective Optimization of Shared Nearest Neighbor Similarity For Feature Selection
No ratings yet
2015 Elsevier Multi Objective Optimization of Shared Nearest Neighbor Similarity For Feature Selection
12 pages
A Study On Feature Selection Techniques in Bio Informatics
100% (1)
A Study On Feature Selection Techniques in Bio Informatics
7 pages
B24 ML Exp-3
No ratings yet
B24 ML Exp-3
10 pages
Supervised Classification Notes
No ratings yet
Supervised Classification Notes
31 pages
Graph Autoencoder-Based Unsupervised Feature Selection With Broad and Local Data Structure Preservation
No ratings yet
Graph Autoencoder-Based Unsupervised Feature Selection With Broad and Local Data Structure Preservation
28 pages
Module-3
No ratings yet
Module-3
44 pages
Filter Based Feature Selection Using ANOVA: Suppose A Company Wants To Analyze Whether The
No ratings yet
Filter Based Feature Selection Using ANOVA: Suppose A Company Wants To Analyze Whether The
66 pages
Nonlinear Control Feedback Linearization Sliding Mode Control
From Everand
Nonlinear Control Feedback Linearization Sliding Mode Control
Mourad Boufadene
No ratings yet
Multiple Models Approach in Automation: Takagi-Sugeno Fuzzy Systems
From Everand
Multiple Models Approach in Automation: Takagi-Sugeno Fuzzy Systems
Mohammed Chadli
No ratings yet
4925-300 E2 Accubind Elisa Rev 5
No ratings yet
4925-300 E2 Accubind Elisa Rev 5
2 pages
Pmbok: Rajani Nair, Ramesh B, Upendra Bapat Group 3
33% (3)
Pmbok: Rajani Nair, Ramesh B, Upendra Bapat Group 3
31 pages
Which Advertisement. Next To Each Statement Write A Letter (A-H) - Some Advertisements Correspond To More Than One Statement. One Example Is Given
No ratings yet
Which Advertisement. Next To Each Statement Write A Letter (A-H) - Some Advertisements Correspond To More Than One Statement. One Example Is Given
9 pages
Quantifiers Countable and Uncountable
No ratings yet
Quantifiers Countable and Uncountable
1 page
Audio Spotlight
No ratings yet
Audio Spotlight
40 pages
Riello Burner
No ratings yet
Riello Burner
48 pages
MLT Application e
No ratings yet
MLT Application e
16 pages
Transformers (Level-1 & 2)
No ratings yet
Transformers (Level-1 & 2)
45 pages
Comparative Analysis of Short Film
No ratings yet
Comparative Analysis of Short Film
4 pages
Giancoli Chap 3 Vectors Kinematics in 2 Dimensions
No ratings yet
Giancoli Chap 3 Vectors Kinematics in 2 Dimensions
37 pages
Mated Ttbbi 3
No ratings yet
Mated Ttbbi 3
1 page
4.4 Correlation and Simple Linear Regression
No ratings yet
4.4 Correlation and Simple Linear Regression
19 pages
Food Rules, 2027 (1970) : Amendments: 1. 2. 3. 4. 5
No ratings yet
Food Rules, 2027 (1970) : Amendments: 1. 2. 3. 4. 5
50 pages
04 - RPi Pico - Measure Distance With Ultrasonic Sensor HC-SR04
No ratings yet
04 - RPi Pico - Measure Distance With Ultrasonic Sensor HC-SR04
6 pages
Dr. Richard Felder and Dr. Rebecca Brent Part 3
No ratings yet
Dr. Richard Felder and Dr. Rebecca Brent Part 3
3 pages
Zoology Non Chordata
No ratings yet
Zoology Non Chordata
525 pages
IJCRT2310639
No ratings yet
IJCRT2310639
9 pages
Pandas Viva Questions
No ratings yet
Pandas Viva Questions
23 pages
Emergency Cart Checklist
No ratings yet
Emergency Cart Checklist
1 page
The Nigeria Eurobond - An Appraisal by Eghes EYIEYIEN 240111
No ratings yet
The Nigeria Eurobond - An Appraisal by Eghes EYIEYIEN 240111
3 pages
Studentsco: Computer Science
No ratings yet
Studentsco: Computer Science
6 pages
PHD Download
No ratings yet
PHD Download
1 page
Javascriptinterviewquestions 240713104909 D9bedd8b
No ratings yet
Javascriptinterviewquestions 240713104909 D9bedd8b
25 pages
So Harian N3 TGL 03 Juni 2024
No ratings yet
So Harian N3 TGL 03 Juni 2024
160 pages
12620101AN - KS-VISION - Modbus Supervision Protocol Rev08
No ratings yet
12620101AN - KS-VISION - Modbus Supervision Protocol Rev08
16 pages
Ejercicios Recuperacion Presente Simple 4c2ba
No ratings yet
Ejercicios Recuperacion Presente Simple 4c2ba
4 pages
Module 1 Rhyming Words (For Reading On-The-Air) (Final)
No ratings yet
Module 1 Rhyming Words (For Reading On-The-Air) (Final)
12 pages
Bab III
No ratings yet
Bab III
22 pages
Chapter 2 Opaud
No ratings yet
Chapter 2 Opaud
5 pages