0% found this document useful (0 votes)
14 views8 pages

On The Behavior of Feature Selection Methods Dealing With Noise and Relevance Over Synthetic Scenarios

This paper investigates the effectiveness of various feature selection (FS) methods in the presence of noise and irrelevant features using synthetic datasets. It evaluates six filter methods and two embedded methods to determine their robustness and tolerance to noise while dealing with high-dimensional data. The study aims to provide insights for selecting appropriate FS methods for real-world applications where relevant features are often unknown.

Uploaded by

gogigorgonzola
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views8 pages

On The Behavior of Feature Selection Methods Dealing With Noise and Relevance Over Synthetic Scenarios

This paper investigates the effectiveness of various feature selection (FS) methods in the presence of noise and irrelevant features using synthetic datasets. It evaluates six filter methods and two embedded methods to determine their robustness and tolerance to noise while dealing with high-dimensional data. The study aims to provide insights for selecting appropriate FS methods for real-world applications where relevant features are often unknown.

Uploaded by

gogigorgonzola
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Proceedings of International Joint Conference on Neural Networks, San Jose, California, USA, July 31 – August 5, 2011

On the Behavior of Feature Selection Methods Dealing with Noise


and Relevance over Synthetic Scenarios
V. Bolón-Canedo, N. Sánchez-Maroño and A. Alonso-Betanzos

Abstract— Adequate identification of relevant features is By using the learning machine as a black box, wrappers
fundamental in real world scenarios. The problem is specially are remarkably universal and simple and tend to give better
important when the datasets have a much larger number of performance results than the other techniques but they have
features than samples. However, in most cases, the relevant
features in real datasets are unknown. In this paper several the risk of overfitting when the sample size is small.
synthetic datasets are employed to test the effectiveness of In real world applications, there are several situations that
different feature selection methods over different artificial
classification scenarios, such as altered features (noise), presence can hinder the process of selection, such as the presence
of a crescent number of irrelevant features and a small ratio of irrelevant features or noise in the inputs. To check the
between number of samples and number of features. Six filters behavior of the different FS methods against these phenom-
and two embedded methods are tested over five synthetic ena, it is necessary to use artificial synthetic datasets whose
datasets, so as to be able to choose a robust and noise tolerant relevant features are already known, since demonstrating
method, paving the way for its application to real datasets in
the classification domain. improvement on synthetic scenarios can be more convincing
that doing so in real situations where the true solution is
I. I NTRODUCTION completely unknown. This work is focused on studying the
behavior of FS methods with respect to relevance and noise.
T HE advent of high dimensionality problems has brought
an interesting challenge for machine learning re-
searchers. In the past few years, several datasets with high
Altered features, also known as noise, are relevant features
that are randomly corrupted. Therefore, to test the effective-
number of samples and/or high number of input features ness of FS methods under noise, it becomes necessary to
have become publicly available on the Internet. Machine contrast the ability of the algorithms to select the relevant
learning gets particularly difficult when there are many features and discard the irrelevant ones without permitting
features and very few samples, since the search space will the altered inputs to obstruct this process.
be sparsely populated and the model will not be able to There is a vast body of feature selection methods -based
distinguish correctly the relevant data and the noise [1]. The on different metrics- in the literature. The proliferation of
identification of discriminant features is of fundamental and feature selection algorithms, however, has not brought about
practical interest in these scenarios where the number of a general methodology that allows for intelligent selection
features is very large. Moreover, the presence of noise in from existing algorithms. Therefore, one of the problems that
those datasets complicates the learning, since in a small set is necessary to be faced when dealing with feature selection
it is not possible to distinguish between a special case and a is which method is the most adequate for each case, and
spurious piece of information. this is not an easy to solve question. In order to make a
Machine learning can take advantage of feature selection correct choice, a user not only needs to know the domain
methods to be able to confront these problems. Feature well, but also is expected to understand technical details
selection (FS) is the process of detecting the relevant features of available algorithms [3]. Therefore, the more algorithms
and discarding the irrelevant ones, with the goal of obtaining available, the more challenging is to choose a suitable one
a subset of features that describes properly the given problem for a given application. The aim of this work is to infer which
with a minimum degradation of performance [2]. Feature feature selection methods detect relevance successfully and
selection techniques can be divided into three groups: filter, are more tolerant to noise in the inputs. For this purpose,
wrapper and embedded methods [2]. While wrapper mod- several feature selection methods will be applied over several
els involve optimizing a predictor as part of the selection synthetic problems to test its ability to deal with different
process, filter models rely on the general characteristics of percentages of altered attributes while increasing the number
the training data to select features with independence of of irrelevant features. One of the most challenging dataset
any predictor. The embedded methods generally use machine types in the feature selection community are those having
learning models for classification, and then an optimal subset a reduced number of samples and large number of features,
or ranking of features is built by the classifier algorithm. such as microarray datasets or SNP (Single Nucleotide Poly-
V. Bolón-Canedo, N. Sánchez-Maroño and A. Alonso-Betanzos are with morphism). We are interested in testing the feature selection
the Department of Computer Science, University of A Coruña, Spain (email: methods over hard scenarios, therefore a reduced number of
{vbolon, nsanchez, ciamparo}@udc.es). samples was generated. This study will help us to choose
This work was supported by Spanish Ministerio de Ciencia e Innovación
under project TIN 2009-02402, partially supported by the European Union which methods are more robust to noise and to irrelevant
ERDF. features to be applied later in real classification domains.

978-1-4244-9637-2/11/$26.00 ©2011 IEEE 1530


Authorized licensed use limited to: UNIVERSITY OF LJUBLJANA. Downloaded on March 15,2023 at 07:20:19 UTC from IEEE Xplore. Restrictions apply.
II. BACKGROUND Filters are faster than wrappers and this fact results in a
As was stated in the Introduction, to choose the more better generalization ability, which is a critical issue when the
adequate FS method in a given scenario is not a trivial number of samples is small. On the other hand, embedded
question. Thus, there are several previous works based on methods implement the same idea as wrappers, but proceed
comparing FS methods [3], [4], [5], [6], [7], [8], [9], some more efficiently by directly optimizing a two-part objective
of which were performed over artificially generated data. function with a goodness-of-fit term and a penalty for a large
The use of synthetic data is endorsed by the fact that there number of variables [15].
is a controlled scenario where the desired result is known In this work, wrappers will not be studied because we are
and a FS method can be successfully evaluated. Besides, it interested in methods that can be used when the sample size
is possible to alter the data in order to add more irrelevant is small (notice that in this situation wrappers have the risk
features, or noise in the input, in order to test the strengths of overfitting). Plus, we are looking for methods independent
and weaknesses of the existing FS methods. A drawback of of the induction algorithm to draw general conclusions.
these synthetic scenarios is to demonstrate if the results can There are a large number of FS methods in the literature,
be extrapolated to real datasets. However, when comparing with very different behaviors. Several algorithms return a
FS methods, the first step should be over artificial data for subset of the whole set of features, while others return an
two reasons [10]: ordered ranking of the features and then a threshold is re-
• Controlled experiments can be developed by systemati- quired to choose a subset. The FS methods are subsequently
cally varying chosen experimental conditions. This fact described, and all of them are available in the Weka tool
facilitates to draw more useful conclusions. environment [16] or in Matlab [17]. The first three provide
• The main advantage of synthetic scenarios is the knowl- a subset of features, while the last five provide an ordered
edge of the set of optimal variables that must be ranking of the features. On the other hand, the two last
selected, thus the degree of closeness to any of these methods belong to the family of the embedded methods,
solutions can be assessed in a confident way. Moreover, and the remaining ones are filters. These FS methods were
this kind of data allows full control of the experimental selected because they are based on different metrics and
conditions, permitting to inject the desired amount of are able to deal with different problems, so they conform
relevance, irrelevance, redundancy and noise, as well as a heterogeneous suite of methods to perform a broad study.
problem difficulty. • Correlation-based Feature Selection (CFS) is a simple
In this work we are interested in testing FS methods with filter algorithm that ranks feature subsets according to
respect to noise and relevance over artificial data. Molina et a correlation based heuristic evaluation function [18].
al. [4] evaluated several algorithms by taking into account the Theoretically, irrelevant features should be ignored and
amount of relevance, irrelevance and redundancy on sample redundant features should be screened out.
datasets, whereas Yu et al. [11] developed a correlation-based • The Consistency-based Filter [19] evaluates the worth
method for relevance and redundancy analysis. Regarding of a subset of features by the level of consistency in the
the noise problem, Liu et al. [12] added noise attributes to class values when the training instances are projected
existing datasets, in order to test the effectiveness of their onto the subset of attributes.
method Chi2, whilst Sánchez et al [13] checked the tolerance • The INTERACT algorithm [20] is based on symmetri-
of filter methods to noise in the output. Also the widely- cal uncertainty (SU). The authors stated that this method
used synthetic dataset called Monk3 [14] contains noise in can handle feature interaction, and efficiently selects
the class label. Up to the authors’ knowledge, there is no relevant features. The first part of the algorithm requires
research in the literature about the alteration of the input a threshold, but since the second part searches for the
variables which in fact is a very interesting situation, because best subset of features, it is considered a subset filter.
it is very probable in real life datasets. • Information Gain [21] is one of the most common
attribute evaluation methods. This filter provides an
III. F EATURE S ELECTION M ETHODS S TUDIED
ordered ranking of all the features and then a threshold
Regarding the relationship between a feature selection is required.
algorithm and the inductive learning method used to infer • ReliefF [22] is an extension of the original Relief
a model, one can distinguish among three different methods: algorithm [23] that adds the ability of dealing with
• Filters, which rely on the general characteristics of multiclass problems and is also more robust and ca-
the training data and carry out the feature selection pable of dealing with incomplete and noisy data. This
process as a pre-processing step with independence of method may be applied in all situations, has low bias,
the induction algorithm. includes interaction among features and may capture
• Wrappers, which involve optimizing a predictor as part local dependencies which other methods miss.
of the selection process. • The mRMR (minimum Redundancy Maximum Rel-
• Embedded methods, which perform feature selection in evance) method [24] selects features that have the
the process of training and are usually specific to given highest relevance with the target class and are also
learning machines. minimally redundant, i.e., selects features that are maxi-

1531
Authorized licensed use limited to: UNIVERSITY OF LJUBLJANA. Downloaded on March 15,2023 at 07:20:19 UTC from IEEE Xplore. Restrictions apply.
mally dissimilar to each other. Both optimization criteria identifying the digit that the display is representing. A simple
(Maximum-Relevance and Minimum-Redundancy) are scheme of the seven segments used can be seen in [29]. The
based on mutual information. classification task to be solved is the one described by seven
• SVM-RFE (Recursive Feature Elimination for Support binary attributes (X = {s0 , s1 , s2 , s3 , s4 , s5 , s6 }) and the ten
Vector Machines) was introduced by Guyon in [25]. possible classes available (C = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}). A
This embedded method performs feature selection by 1 in the si attribute indicates that the i led is active, and a 0
iteratively training a SVM classifier with the current indicates that is not active.
set of features and removing the least important feature Two versions of this LED problem will be used: the first
indicated by the SVM. one, Led25, adding 17 irrelevant attributes (with random
• FS-P (Feature Selection - Perceptron) [26] is an em- binary values) and the second one, Led100, adding 92 irrele-
bedded method that uses a simple linear perceptron vant attributes. Both versions contain 50 samples. The small
as the classification model and ranks the importance number of samples was chosen because we are interested
of the features based on the weights returned by the in dealing with datasets with a high number of features
perceptron. and a small sample size. Besides, different levels of noise
(altered inputs) have been added to the attributes of these
IV. S YNTHETIC DATASETS
two versions of LED dataset: 2%, 6%, 10%, 15% and 20%.
The aim of this work is to check the behavior of a suite In this manner, the tolerance to different levels of noise of
of FS methods over a controlled experimental scenario with the feature selection methods tested will be checked. Note
respect to relevance and noise. Altered features, also referred that, as the attributes take binary values, adding noise means
in this work as noise (N ), are relevant features that are assigning to the relevant features an incorrect value.
randomly corrupted. Different levels of noise can be added
to the data and in the case of binary variables, adding noise B. CorrAL
means assigning to the relevant features an incorrect value. The CorrAL dataset [28] has six binary features (i.e.
The alteration of the inputs can hinder the process of feature f1 , f2 , f3 , f4 , f5 , f6 ), and its class value is (f1 ∧f2 )∨(f3 ∧f4 ).
selection, therefore the tolerance to noise of a FS method Feature f5 is irrelevant and f6 is correlated to the class label
has to be measured contrasting the ability of the algorithm by 75%.
to deal with relevance and irrelevance. CorrAL-100 [30] was constructed by adding 93 irrelevant
• Relevance: Relevant features (R) must be selected by binary features to the previous CorrAL dataset. The data for
the FS algorithm since they have an influence on the the added features were generated randomly. Both datasets
output and their role cannot be assumed by any other (CorrAL and CorrAL-100) have 32 samples that are formed
subset of features. by considering all possible values of the four relevant features
• Irrelevance: Irrelevant features (I) must be discarded and the correlated one (25 ). The correct behavior for a given
by the FS algorithm since they have no influence at feature selection method is to select the four relevant features
all on the output. The number of irrelevant features has and to discard the irrelevant one and the correlated one. The
influence on the difficulty of the problem and a common correlated feature is redundant if the four relevant features
practice is to generate these variables randomly. are selected and, besides, it is correlated to the class label by
In order to determine the effectiveness of each FS method 75%, so if one applies a classifier after the feature selection
with respect to the concepts explained above, five widely- process, a 25% of error will be obtained.
used synthetic datasets were employed, specifically CorrAL,
C. XOR-100
CorrAL-100, XOR-100, Led-25 and Led-100. Over the Cor-
rAL dataset [28] and the XOR-100 dataset the behavior of the XOR-100 [30] has 2 relevant binary features, 97 irrelevant
methods with respect to relevance will be checked, in terms binary features (randomly generated) and one binary class
of selecting the relevant features and discarding the irrelevant feature. The features f1 and f2 are correlated with the class
ones. These datasets have difficulties such as correlation with value with XOR operation (i.e., class equals f1 ⊕ f2 ). As
the output (CorrAL) or non-linearity (XOR-100). Notice that well as the Led datasets, XOR-100 has 50 samples. This is
these datasets do not contain noise. a hard dataset for the sake of feature selection because of
After these experiments, the LED dataset [27] will be used the small ratio between number of samples and number of
to test the behavior of the feature selection methods when features and due to its non-linearity (unlike CorrAL dataset,
noise is added to the inputs. It has to be noted that CorrAL- which is a multi-variate dataset).
100, XOR-100 and Led-100 have an added difficulty: the To sum up, in Table I can be consulted the relevant features
number of features is significantly higher than the number that must be selected by a given FS method for each dataset.
of samples, which hinder the correct selection of the relevant
V. E XPERIMENTAL S ETTINGS
features as mentioned in the Introduction.
Eight different FS methods were tested in this study. In
A. The LED Problem order to check their behavior over synthetic data, a 10-fold
The LED problem [27] is a simple classification task that cross validation was executed. In this manner, each method
consists of, given the active leds on a seven segments display, selects a subset of features in each fold. After finishing the

1532
Authorized licensed use limited to: UNIVERSITY OF LJUBLJANA. Downloaded on March 15,2023 at 07:20:19 UTC from IEEE Xplore. Restrictions apply.
TABLE I
therefore, the cardinality of the final selected subset could be
F EATURES THAT SHOULD BE SELECTED BY THE METHODS TESTED
larger.
In Figure 1 one can see in terms of %suc (1) the results
Dataset Relevant Features
obtained over the datasets Led-25 and Led-100, respectively.
CorrAL 1,2,3,4
CorrAL-100 1,2,3,4 In the subsequent section, both datasets will be tested with
XOR-100 1,2 different percentages of altered features (noise). However, for
Led-25 1,2,3,4,5,6,7 the sake of brevity, only two cases were considered in this
Led-100 1,2,3,4,5,6,7 cardinality study: no alteration (0% noise) and 10% of altered
features. It has to be noted that choosing an irrelevant feature
in Led-100 has lower impact than in Led-25, according to
10-fold cross validation, each FS method returns an unique (1). The overall impression for Led-25 (Fig. 1 (a) and (b)) is
subset of features, formed by joining the subset selected that the highest percentages of success are achieved with the
at each fold. This final subset is the one which will be optimal number of features (7). For Led-100 (Fig. 1 (c) and
shown in the tables of the next section. In order to show (d)) the results improve selecting 10 features with respect
clearly the results obtained, the tables of this section show to selecting 7 features, at the expense of electing a higher
the relevant features selected (“Relevant”), the number of number of irrelevant features, as can be seen in Table II.
irrelevant features selected (“Irr. No.”) and a percentage of 100 100

success (“% suc.”). This scoring measure (1) was defined to 80 80


be able to compare the different methods tested and it is an

% success

% success
attempt to reward the selection of relevant features and to 60 60

penalize the inclusion of irrelevant features, considering two 40 40

situations: 20
InfoGain
ReliefF 20
InfoGain
ReliefF
mRMR mRMR
1) The solution is incomplete: there are relevant features 0
SVM−RFE
FS−P
0
SVM−RFE
FS−P
2 4 6 7 10 2 4 6 7 10
lacking. Number of features Number of features

2) The solution is incorrect: there are some irrelevant (a) Led-25 (0% noise) (b) Led-25 (10% noise)
features. 100 100

  80 80
Rs Is
%suc. = −α × 100, (1)
% success

% success
60 60
Rt It
40 40
Besides, a restriction on the behavior of the measure InfoGain InfoGain
ReliefF ReliefF
was established: choosing an irrelevant feature is better than 20
mRMR
SVM−RFE
20
mRMR
SVM−RFE
missing a relevant one. Therefore, the term α was introduced 0
2 4 6 7
FS−P
10
0
2 4 6 7
FS−P
10
Number of features Number of features
to obey the restriction explained above and it is defined
as α = min{ 21 , RItt }, where Rs is the number of relevant (c) Led-100 (0% noise) (d) Led-100 (10% noise)
features selected, Rt is the total number of relevant features, Fig. 1. Cardinality results for Led-25 and Led-100
Is is the number of irrelevant features selected and It is the
total number of irrelevant features.
TABLE II
A. The influence of the cardinality R ESULTS FOR L ED -100 ( WITHOUT NOISE AND WITH 10% OF NOISE ).
“I RR . N O .” REPRESENTS THE NUMBER OF IRRELEVANT FEATURES
One of the main questions arising when dealing with
SELECTED . R EMARK THAT THESE RESULTS ARE CALCULATED AFTER
FS methods is: how many features should be selected?
THE UNION OF THE SUBSETS OF THE 10- FOLD CV.
CFS, Consistency-based and INTERACT provide a subset of
features and release the user from choosing the cardinality.
However, this study includes also ranker methods (Informa- Led-100 (0% noise) Led-100 (10% noise)
tion Gain, ReliefF, mRMR, SVM-RFE and FS-P) that return # Method Relevant Irr. No. Relevant Irr. No.
an ordered ranking of the features and so a threshold is InfoGain 1-7 0 1,2,4,5,7 6
ReliefF 1-5,7 3 1,2,4,5,7 4
required. mRMR 1-5,7 3 1-5,7 5
7
Since we are working on a controlled scenario, we know a SVM-RFE 2-7 18 2,4,7 35
priori the optimal features to be selected. A cardinality study FS-P 1-6 3 1,3-6 10
InfoGain 1-7 4 1,2,4,5,7 9
was performed over the two versions of the Led dataset, as ReliefF 1-7 9 1-5,7 12
it has the higher number of relevant features and it is easier 10
mRMR 1-7 11 1-7 15
to extract conclusions. The behavior of the ranker methods SVM-RFE 1-7 23 1-5,7 40
FS-P 1-7 10 1-3,5,6 17
was tested forcing them to return 2, 4, 6, 7 and 10 features,
where 7 is the optimal number. It is important to remind that
these ranker methods are forced to return a given number It seems clear that selecting a smaller number of features
of features for each fold in the 10-fold cross-validation and than the optimal one leads to poor results. Between electing

1533
Authorized licensed use limited to: UNIVERSITY OF LJUBLJANA. Downloaded on March 15,2023 at 07:20:19 UTC from IEEE Xplore. Restrictions apply.
7 (optimal) or 10 features, the authors have opted for forcing they are unable to select the four relevant features. Again,
the ranker methods to select a number of features equal to the all the methods selected the correlated feature. Correlation is
number of relevant features. The reason is that, considering a complex question to deal with, and none of the methods
the results obtained above, selecting a higher number of checked could solve this problem, at least with a high level
features than the optimal does not always lead to an im- of correlation as in this case (75%).
provement, while always implies an increment in the number
of irrelevant features selected. Besides, this is a theoretical TABLE III
study in which the goal is to check the behavior of different R ESULTS FOR C ORR AL AND C ORR AL-100. “R EL .” SHOWS THE
RELEVANT FEATURES SELECTED , “C” INDICATES IF THE CORRELATED
feature selection methods under controlled scenarios, so the
FEATURE IS SELECTED , “I RR .” MEANS THE NUMBER OF IRRELEVANT
purpose is to know which ranker method behaves better when
it has the chance of selecting the correct number of features. FEATURES SELECTED AND “% SUC .” REPRESENTS THE PERCENTAGE OF
SUCCESS .
VI. E XPERIMENTAL R ESULTS
In this section the results obtained after testing the eight
feature selection methods over the different datasets will be CorrAL CorrAL-100
presented, in order to be able to draw conclusions about their Method Rel. C Irr. %suc. Rel. C Irr. %suc.
behaviors and adequacy to deal with noise and relevance. As CFS 1-4 X 1 75 1-4 X 6 99
Consistency 1-4 X 1 75 1-4 X 6 99
a first step, we will contrast the ability of the feature selection INTERACT 1-3 X 1 50 1-4 X 8 99
methods with respect to relevance, irrelevance, correlation InfoGain 1-4 X 2 50 1-4 X 11 99
and non-linearity over the datasets CorrAL, CorrAL-100 ReliefF 1-4 X 1 75 1-4 X 11 99
mRMR 1-4 X 1 75 1-4 X 8 99
and XOR-100. After those experiments, the feature selection SVM-RFE 1-4 X 1 75 1,3,4 X 15 74
methods will be tested over two versions of the dataset Led, FS-P 1-4 X 2 50 3,4 X 7 50
observing their behavior when different levels of noise are
added to the data.
A. CorrAL B. XOR
Two versions of this dataset will be used: CorrAL and This is a complex dataset, because it is a non-linear
CorrAL-100. Over these datasets the ability of the feature problem that contains only 2 relevant features, while the
selection methods to select the 4 relevant features and discard remaining 97 are irrelevant. Thus, the ability of the feature
the irrelevant ones is checked, plus the capacity of detecting selection methods to deal with relevancy and irrelevancy will
the correlated feature and not selecting it. It is necessary be checked over a non-linear scenario with a small ratio
to remind that the correlated feature is redundant if the 4 between number of samples and number of features.
relevant features are selected and, besides, it is correlated
to the class label by 75%. Therefore, when selecting the 4 TABLE IV
relevant features, it should be discarded. In this case, the R ESULTS FOR XOR-100. “R EL .” SHOWS THE RELEVANT FEATURES
ranker methods were forced to select 4 features since it is the SELECTED , “I RR . N O .” MEANS THE NUMBER OF IRRELEVANT FEATURES

number of relevant features. Although they return 4 features SELECTED AND “% SUC .” REPRESENTS THE PERCENTAGE OF SUCCESS .

for each fold in the 10-fold cross-validation, after adding all


the subsets the cardinality of the subsets that appear in the XOR-100
tables of this section could be larger. Method Rel. Irr. No. % suc.
Table III shows the results obtained over the datasets CFS 1 4 50
CorrAL and CorrAL-100. The relevant features for these Consistency 1 4 50
INTERACT 1 3 50
datasets can be consulted in Table I. The results obtained InfoGain – 5 0
over CorrAL show that all the methods except InfoGain and ReliefF 1,2 5 99
FS-P were able to detect the irrelevant feature and discard mRMR 1 9 50
SVM-RFE – 15 0
it. On the other hand, INTERACT did not select one of FS-P – 9 0
the relevant features. It has to be noted that none of the
feature selection methods could discard the correlated feature
and the number of irrelevant features showed in the table One can see that in this scenario, it is complicated for the
includes the correlated feature, i.e. CFS over CorrAL selected FS methods to detect and select the relevant features. Only
the correlated feature which is irrelevant so the number of ReliefF is able to choose the relevant features, furthermore
irrelevant features is 1. with a low percentage of irrelevant ones (5%). On the
In order to complicate the CorrAL dataset, 93 irrelevant contrary, InfoGain, SVM-RFE and FS-P cannot select any
features were added to form the CorrAL-100 dataset. As can of the relevant features. This result is dreadful and could
be seen in Table III, all the filters (6 first rows) selected the lead to an incorrect classification of samples when applying
relevant features whilst the embedded methods (SVM-RFE a machine learning algorithm. The rest of the methods only
and FS-P) exhibit the worst percentages of success, because select one of the relevant features, which would not be

1534
Authorized licensed use limited to: UNIVERSITY OF LJUBLJANA. Downloaded on March 15,2023 at 07:20:19 UTC from IEEE Xplore. Restrictions apply.
enough for a classifier to deduce the XOR operation, leading methods, for both Led-25 and Led-100, and more detailed
as in the previous case to high levels of error. results can be found in Table V. As we would have expected,
the percentage of success decreases when the level of noise
C. Led increases, and worse performances were obtained over Led-
After studying the behavior of the feature selection meth- 100, since it has a higher number of irrelevant features and,
ods over CorrAL, CorrAL-100 and XOR-100, the best therefore, its complexity is higher. It is curious that in some
method seems to be ReliefF, since it achieved the highest cases the percentage of success improved with higher levels
levels of success in all the three datasets. ReliefF is a good of noise (for example, in Led-25 from 10% to 15% of noise),
option when one is interested in detecting irrelevance, as but this fact can be explained by the distribution of the folds
one very important advantage of this method is the rapid in the 10-fold cross validation and the random generation of
assessment of irrelevant features with a principled approach; the noise.
however it does not make a good discrimination among Several conclusions can be drawn from the graphs in
redundant features [31] (that is why it selected the correlated Figure 2. InfoGain achieves very good results when the level
feature in both versions of CorrAL, where the correlated of noise is low or there is no noise (0%, 2% and 6%)
feature is redundant to the set of relevant features). but its performance plummets with higher levels of noise,
The question now is to know how these feature selections specially in Led-100, where the percentage of success is
methods behave when the scenario gets complicated by close to 10%. On the contrary, mRMR appears to be very
adding different levels of noise. It must be noticed that, as the resistant to noise, in fact it achieves the same percentage of
attributes take binary values, adding noise means assigning success with 20% of noise and with no noise, in both Led-
to the relevant features an incorrect value. For this purpose, 25 and Led-100 datasets. Good behaviors can be also found
several levels of noise (altered inputs) have been added to with ReliefF and the embedded method FS-P. The subset
the attributes of the two versions of the Led dataset. filters (CFS, Consistency and INTERACT) show a mediocre
performance, quite affected by high levels of noise. Finally,
100 the embedded method SVM-RFE does not seem a good
90 option when dealing with this type of scenarios. Specially
in Led-100, its percentage of success falls roughly from 6%
80
of noise on.
70
It is curious the opposite behaviors of Information Gain
% success

60 and mRMR, bearing in mind that both come from the


50 Information Theory field. However, this fact can be explained
CFS because Information Gain is a univariate measure that consid-
40 Consistency
INTERACT ers the entropy between a given feature and the class level.
30 InfoGain
ReliefF On the other hand, mRMR takes into account the mutual
mRMR
20 SVM−RFE information among the features. The latter is a multivariate
FS−P
10 measure thus a better behavior was expected when spurious
0 2 6 10 15 20
Level of noise (%) information is present in the data.
(a) Led-25
Regarding the results showed in Table V, it is interesting
to note that the subset filters (CFS, Consistency and INTER-
100
ACT) select the lower number of irrelevant features in all
90 cases, at the expense of discarding some of the relevant ones,
80 especially with high levels of noise. Subset filters generate
70
candidate feature subsets based on a certain search strategy
and each candidate subset is evaluated by an evaluation
% success

60
measure and compared to the previous one. If a new subset
50 turns out to be better, it replaces the previous best subset,
CFS
40 Consistency until a given stopping criterion is satisfied. This kind of
INTERACT
30 InfoGain methods have proven effective to some extent in removing
ReliefF
mRMR
irrelevant features [11], as it happened in this work.
20 SVM−RFE

10
FS−P VII. C ONCLUSIONS
0 2 6 10 15 20
Level of noise (%) In this paper, eight feature selection methods were ap-
(b) Led-100 plied over five synthetic datasets aimed at studying their
performance with respect to relevance and under noise situ-
Fig. 2. Noise study over Led-25 and Led-100 ations. The classification scenarios were complicated adding
irrelevant features or with a small ratio between number of
In Figure 2 it can be seen the effect of the level of noise samples and number of features. To sum up, the feature
in the percentage of success (1) achieved by the different FS selection methods were tested against these phenomena:

1535
Authorized licensed use limited to: UNIVERSITY OF LJUBLJANA. Downloaded on March 15,2023 at 07:20:19 UTC from IEEE Xplore. Restrictions apply.
TABLE V
R ESULTS FOR L ED -25 AND L ED -100 DATASETS WITH DIFFERENT LEVELS OF NOISE (N) IN INPUTS .

Led-25 Led-100
N(%) Method Relevant Irr. No. % suc. Relevant Irr. No. % suc.
CFS 1-5,7 0 86 1-5,7 0 86
Consistency 1-5 0 71 1-5 0 71
INTERACT 1-5,7 0 86 1-5,7 1 86
0
InfoGain 1-7 0 100 1-7 0 100
ReliefF 1-7 1 98 1-5,7 3 85
mRMR 1-7 2 95 1-5,7 3 85
SVM-RFE 1-7 7 83 2-7 18 84
FS-P 1-7 3 93 1-6 3 85
CFS 1-5,7 0 86 1-5,7 0 86
Consistency 1-5 0 71 1-5 0 71
INTERACT 1-5,7 0 86 1-5,7 0 86
2
InfoGain 1-7 0 100 1-7 0 100
ReliefF 1-7 2 95 1-7 4 99
mRMR 1-7 2 95 1-5,7 5 88
SVM-RFE 1-7 9 78 3-7 23 70
FS-P 1-7 3 93 1-6 5 85
CFS 1-5,7 0 86 1-7 0 100
Consistency 1-5,7 0 86 1-5,7 0 86
INTERACT 1-5,7 0 86 1-7 0 100
6
InfoGain 1-7 2 95 1-7 2 99
ReliefF 1-5,7 1 83 1-7 5 99
mRMR 1-7 1 98 1-7 5 99
SVM-RFE 1-7 11 73 1,2,4-7 30 83
FS-P 1-7 3 93 1-6 11 85
CFS 1,2,4,5,7 0 71 1,2,4,5,7 1 71
Consistency 1,2,4,5,7 0 71 1,2,4,5,7 1 71
INTERACT 1,2,4,5,7 0 71 1,2,4,5,7 1 71
10
InfoGain 1,2,4,5,7 3 64 1,2,4,5,7 6 71
ReliefF 1-5,7 4 76 1,2,4,5,7 4 71
mRMR 1-5,7 2 81 1-5,7 5 85
SVM-RFE 1-7 13 69 2,4,7 35 40
FS-P 1-7 4 90 1,3-6 10 71
CFS 1,2,4,5,7 0 71 1,4,5,7 0 57
Consistency 1,2,4-7 0 86 1,4,5,7 0 57
INTERACT 1,4,5,7 0 57 1,4,5,7 0 57
15
InfoGain 1,4,5,7 5 45 1,4,5,7 5 57
ReliefF 1-7 3 93 1-7 4 99
mRMR 1-7 2 95 1-7 5 99
SVM-RFE 2-7 49 -3 5,7 32 26
FS-P 1-7 5 88 1-3,5,7 13 70
CFS 1,5 0 29 1,2 0 29
Consistency 1,5 1 26 1,5 0 29
INTERACT 1 0 14 1,2,5 0 43
20
InfoGain 1,3,4,7 5 45 1 9 14
ReliefF 1-5,7 4 76 1,2,4,5,7 11 71
mRMR 1-7 2 95 1-5,7 13 85
SVM-RFE 3,5-7 14 23 5,7 42 25
FS-P 1-6 8 66 1-3,5-7 14 85

• Altered inputs (noise). where it is not possible to distinguish between a special case
• High number of irrelevant features vs small number of and a spurious piece of information. In this work, we have
relevant features altered the relevant inputs by assigning them an incorrect
• Non-linear relations. value, hindering the feature selection process.
• Small ratio between number of samples and number of As a first stage, the ability of the feature selection methods
features. with respect to relevance and irrelevance was tested over
To test the effectiveness of the studied methods, a evalua- the datasets CorrAL, CorrAL and XOR-100. The best re-
tion measure was introduced in order to reward the selection sults were achieved by ReliefF, according to the evaluation
of the relevant features and to penalize the inclusion of the measure that we presented. An important advantage of this
irrelevant ones. method is the rapid assessment of irrelevant features.
The presence of noise is an important problem to deal with, To see the response of the feature selection methods deal-
since it can be expected in real scenarios and its presence in ing with altered inputs, different levels of noise were added to
the datasets complicates the learning, specially in a small set the attributes. The most stable method with respect to noise

1536
Authorized licensed use limited to: UNIVERSITY OF LJUBLJANA. Downloaded on March 15,2023 at 07:20:19 UTC from IEEE Xplore. Restrictions apply.
turned out to be mRMR, although ReliefF and the embedded [7] A. K. Jain and D. Zongker, Feature Selection Evaluation, Application,
method FS-P also showed to be robust to this phenomenon. and Small Sample Performance. IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 19, no. 2, pp. 153-158, 2002
Information Gain achieved excellent performance when the [8] M. Kudo and J. Sklansky, A Comparative Evaluation of medium
level of noise was low but it worsened when the level of and large-scale Feature Selectors for Pattern Classifiers. In Proc. of
noise increased. On the other hand, the embedded method the 1st International Workshop on Statistical Techniques in Pattern
Recognition, pp. 91-96, 1997
SVM-RFE did not seem to be a good choice when dealing [9] H. Liu and R. Setiono, Scalable Feature Selection for Large Sized
with noise. Databases. In Proc. of the 4th World Conference on Machine Learning,
In light of the reported results, several conclusions and pp. 101-106, 1998
[10] L.A. Belanche and F.F. González, Review and Evaluation of Feature
suggestions can be expounded regarding the use of one or Selection Algorithms in Synthetic Problems. http://arxiv.org/
another feature selection method in a given situation. abs/1101.2320 [Last access: April 2011]
• In complete ignorance of the particulars of the data, [11] L. Yu and H. Liu, Efficient feature selection via analysis of rele-
vance and redundancy. Journal of Machine Learning Research, vol. 5,
the authors suggest to use ReliefF. It detects relevance pp. 1205-1224, 2004
in a satisfactory way, even in complex datasets such [12] H. Liu and R. Setiono, Chi2: Feature selection and discretization of
as XOR-100, and it is tolerant to noise. Besides, it is numeric attributes. In Proc. of the 7th International Conference on Tools
with Artificial Intelligence, pp. 388-391, 2002
a filter method with the implicit advantage of its low [13] N. Sánchez-Maroño, A. Alonso-Betanzos and M. Tombilla-Sanromán,
computational cost. Filter methods for feature selection: a comparative study. In Proc. of
• When dealing with altered inputs in the data, the best the 8th international conference on Intelligent data engineering and
automated learning, pp. 178-187, 2007
option is to use mRMR, since in these experiments it [14] S.B. Thrun et al., The Monk’s problems: A performance comparison
has proved to be very robust to noise. of different learning algorithms. Technical Report CMU-CS-91-197.
• In a situation where the goal is to select the smallest Carnegie Mellon University, 1991
[15] I. Guyon and A. Elisseeff, An Introduction to Variable and Feature
number of irrelevant features (even at the expense of Selection. Journal of Machine Learning Research vol. 3, pp. 1157-1182,
selecting fewer relevant features), we suggest to employ 2003
one of the subset filters (CFS, Consistency or INTER- [16] I.H. Witten and E. Frank, Data Mining: Practical Machine Learning
Tools and Techniques. Morgan Kaufmann, San Francisco, 2005 http:
ACT). These methods have the advantage of releasing //www.cs.waikato.ac.nz/ml/weka/ [Last access: April 2011]
the user from the task of deciding how many features [17] The Mathworks, Matlab Tutorial. 1998 http://www.mathworks.
to choose. com/academia/student\_center/tutorials/ [Last access:
April 2011]
• In a scenario affected by low levels of noise (up to 6%), [18] M.A. Hall, Correlation-based Feature Selection for Machine Learning.
we suggest the use of the filter Information Gain. PhD thesis, University of Waikato, Hamilton, New Zealand, 1999
• In general, the authors suggest the use of filters, due to [19] M. Dash and H. Liu, Consistency-based Search in Feature Selection.
Journal of Artificial Intelligence vol. 151 no. 1-2, pp. 155-176, 2003
the fact that they carry out the feature selection process [20] Z. Zhao, H. Liu, Searching for Interacting Features. In Proc. of the
with independence of the induction algorithm and they International Joint Conference on Artificial Intelligence, pp. 1156-1167,
are faster than embedded methods. With these datasets, 1991
[21] M.A. Hall and L.A. Smith, Practical Feature Subset Selection for
the experiments involving filters take in the order of Machine Learning. Journal of Computer Science vol. 98, pp. 4-6, 1998
minutes, whilst the experiments using embedded meth- [22] I. Kononenko, Estimating Attributes: Analysis and Extensions of
ods take in the order of hours. However, in case of using RELIEF. In Proc. of the European Conference on Machine Learning,
pp. 171-182, 1994
a embedded method, we suggest to use FS-P. [23] K. Kira and L. Rendell, A Practical Approach to Feature Selection. In
As future work, we plan to carry out an extensive study Proc. of the 9th International Workshop on Machine Learning, pp. 249-
256, 1992
on the identification of redundant features, as well as using [24] H. Peng, F. Long and C. Ding, Feature Selection Based on Mutual
datasets with a larger number of features (and also not Information: Criteria of Max-Dependency, Max-Relevance, and Min-
binary). The final step will be to work with real datasets Redundancy. IEEE Transactions on Pattern Analysis and Machine
Intelligence vol. 27, no. 8, pp. 1226-1238, 2005
in order to demonstrate the conclusions extracted from this [25] I. Guyon, J. Weston, S.M.D. Barnhill and V. Vapnik, Gene Selection
theoretical study over real scenarios. for Cancer Classification using Support Vector Machines. Journal of
Machine Learning vol. 46, no. 1-3, pp. 389-422, 2002
R EFERENCES [26] M. Mejı́a-Lavalle, E. Sucar and G. Arroyo, Feature Selection with
[1] F. Provost, Distributed Data Mining: Scaling up and Beyond. Journal of a Perceptron Neural Net. In Proc. of the International Workshop on
Advances in Distributed and Parallel Knowledge Discovery, pp. 3-27, Feature Selection for Data Mining, pp. 131-135, 2006
2000 [27] L. Breiman, J. Friedman, R. Olshen and C. Stone, Classification and
[2] I. Guyon, S. Gunn, M. Nikravesh and L. Zadeh, Feature Extraction. Regression Trees. Wadsworth International Group, 1984
Foundations and Applications. Springer (2006) [28] G. H. John, R. Kohavi and K. Pfleger, Irrelevant Features and the
[3] H. Liu and L. Yu, Toward Integrating Feature Selection Algorithms for Subset Selection Problem. In Proc. of the 11th International Conference
Classification and Clustering. IEEE Transactions on knowledge and on Machine Learning, pp. 121-129, 1994
data engineering, pp. 491-502, 2005 [29] X. Llora and D.E. Goldberg, Minimal Achievable Error in the LED
[4] L.C. Molina, L. Belanche and A. Nebot, Feature Selection Algorithms: Problem. IlliGAL Report No. 2002015 (2002)
A Survey and Experimental Evaluation In Proc. of the 2002 IEEE [30] G. Kim, Y. Kim, H. Lim and H. Kim, An MLP-based Feature
International Conference on Data Mining, pp. 306, 2002 Subset Selection for HIV-1 Protease Cleavage Site Analysis. Journal
[5] D.W. Aha and R.L. Bankert, A comparative evaluation of sequential of Artificial Intelligence in Medicine vol. 48, pp. 83-89, 2010
feature selection algorithms. In Proc. of the 5th International Workshop [31] M. Dash and H. Liu, Feature selection for classification. Journal of
on Artificial Intelligence and Statistics, pp. 1-7, 1995 Intelligent Data Analysis, vol. 1, no. 3, pp. 131-156, 1997
[6] J. Doak, An Evaluation of Feature Selection Methods and their Appli-
cation to Computer Security. Technical Report CSE-92-18, University
of California, Department of Computer Science, 1992

1537
Authorized licensed use limited to: UNIVERSITY OF LJUBLJANA. Downloaded on March 15,2023 at 07:20:19 UTC from IEEE Xplore. Restrictions apply.

You might also like