0% found this document useful (0 votes)
27 views9 pages

Bajer 2020

Uploaded by

yuhanghe719
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views9 pages

Bajer 2020

Uploaded by

yuhanghe719
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Wrapper-based feature selection: how important is

the wrapped classifier?


Dražen Bajer Mario Dudjak Bruno Zorić
Faculty of Electrical Engineering, Faculty of Electrical Engineering, Faculty of Electrical Engineering,
Computer Science Computer Science Computer Science
and Information Technology and Information Technology and Information Technology
Osijek Osijek Osijek
Osijek, Croatia Osijek, Croatia Osijek, Croatia
[email protected] [email protected] [email protected]

Abstract—Wrapper-based feature (subset) selection is a fre- on the modelling algorithm of choice (assuming the whole
quently used approach for dataset dimensionality reduction, search space is explored). Reviewing the literature, this choice
especially when dealing with classification problems. The choice seems to be mostly based on the computational cost associated
of wrapper is at the forefront of these approaches, whilst the
choice of the classifier is typically based on its simplicity as
with the modelling algorithm, regardless of the wrapper. Ac-
to reduce the computational cost. Since the search is guided cordingly, simpler algorithms are preferred. However, this may
by the selected classifier, the same one is also later used for lead to an overall performance that is notably lower than what
independent testing. This raises the question of how well such could possibly be achieved by employing a more sophisticated
feature subsets are suited for other types of classifiers. In other algorithm. This raises the question whether subsets found by
words, can one classifier be used for finding feature subsets that
are also effective for others? An investigation into this matter was
one algorithm are suitable for another or whether at least a
performed by testing and analysing the utility of subsets found trade-off between computational cost and performance can be
by one classifier with respect to other classifiers. It hints at the achieved by employing a cost-effective algorithm in the search
importance of classifier choice since some models, whilst used process whose results are utilised by a more complex one.
inside the wrapper, can solely conform the dataset to themselves, Whilst seeking an answer, the danger of overfitting which is a
whilst others are less susceptible to this issue. Consequently, an
insight into the robustness of the employed classifiers was gained drawback frequently associated with the utilisation of wrapper
as well. methods should be, however, kept in mind.
Index Terms—classification, differential evolution, dimension- The main objective of this paper was to provide an insight
ality reduction, feature selection, wrapper methods into the possible answers to the aforementioned question with
respect to classification problems. To this end, an experimental
I. INTRODUCTION
investigation on datasets of various characteristics and involv-
Feature (subset) selection (FS) [1], [2] is becoming an ing several distinct classifiers was conducted. The obtained
indispensable tool for many tasks in machine learning, like results were analysed from a number of perspectives relevant
classification and regression. This is primarily due to the ever to the FS problem. Consequently, some light was shed onto
increasing sizes of datasets that must be dealt with. The size the robustness of the considered classifiers to varying feature
of datasets is not only increasing in terms of the number
subsets as well.
of available instances but also in terms of features/attributes The rest of the paper is organised as follows. Feature selec-
describing each instance or sample. The latter is arguably tion is briefly introduced in Sect. II followed by a literature
the more difficult to handle since typically features that overview of wrapper methods. The main part of the paper, the
are redundant or even detrimental to the behaviour of the experimental analysis, is given in Sect. III, where the devised
modelling algorithm are present. The prior removal of such methodology, experiment setup and the obtained results are
features, basically representing the task behind FS, may yield reported. Finally, in Sect. IV the drawn conclusions along with
benefits in many cases both in terms of model complexity and possible avenues for future work are stated.
performance. The FS problem can be tackled in a few distinct
ways, with wrapper methods being a frequent choice, although
II. B ACKGROUND
being computationally more expensive than their counterparts.
Their ability to determine comparatively small subsets whilst Should a large number of features be needed in order
retaining or improving model performance is what makes to describe the problem at hand, dataset handling could be
up for the cost. The reason behind both performance and rendered difficult, as stated earlier. This holds especially true
cost of wrapper methods is that the search for a reduced for problems where data is gathered without considering
feature set (or feature subset) is directly guided by model a specific classification problem (e.g. aggregated customer
performance. As a rule, the same modelling algorithm that is attributes) [4]. Apart from the imposed temporal and spatial
eventually used for independent testing is used in the wrapper complexity, the sheer number of included features does not
[3]. Consequently, the characteristics of the feature subset as guarantee improved performance, as some could prove to be
well as the computational cost for obtaining it, are dependent irrelevant or even detrimental [2]. Therefore, FS is used and

978-1-7281-9759-3/20/$31.00 ©2020 IEEE


978-1-7281-9759-3/20/$31.00 ©2020 IEEE

97

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 05:26:28 UTC from IEEE Xplore. Restrictions apply.
methods for achieving it can be grouped into four categories: perceptron (MLP) classifiers were slightly deteriorated after
filter, wrapper, hybrid and embedded methods. the FS process. A similar effect can be observed in [12],
where the 5-NN classifier was used within the wrapper, and the
A. Feature selection performance enhancement of several other classifiers could not
Filter methods rely upon some intrinsic data property (e.g. match that attained when the same classifier was used on the
from information theory) and use it to rank the features [1]. A subset for external testing. In addition to the influence it has on
predefined number of highest-ranking features is then selected FS, it is interesting to notice the value of the parameter k in the
to carry out the realisation of the (classification) model. They k-NN algorithm, since the k = 5 setting is very often used in
work independently of classifiers and are generally fast and the literature [12]–[14]. Given that commonly no explanations
widely applicable. This independence comes at a cost of lower for this choice of the parameter value are provided, it can be
classification quality when compared to other approaches [5]. assumed that these works simply apply the default settings that
Wrapper methods on the other hand use the classifier output are common in many machine learning libraries and toolkits
in order to guide the search for good feature subsets. They (like Weka and Scikit-learn).
are called wrappers as they utilise a search mechanism that For problems stemming from the biomedical domain, the
incorporates (i.e. wraps around) a classifier (the classifier is naive Bayes (NB) classifier is a common choice within the
treated as a black-box). Usually they lead to better subsets wrapper, mainly due to its low computational cost and simplic-
in terms of classification quality [5], but are computationally ity [12], [15], [16]. However, from a performance standpoint
expensive and the resulting subset could be biased towards an exacerbating effect on the performance of the SVM and
the classifier guiding the search [1], [6]. Hybrid methods are k-NN classifiers is apparent, regardless of the wrapper used.
a combination of filters and wrappers, attempting to alleviate The found feature subsets evaluated by the NB classifier, apart
the disadvantages of both by using a filter for preprocessing for itself, are an especially well fit for the MLP classifier. The
step and then applying a wrapper for fine-tuning [7]. They are latter is rarely utilised within the wrapper. On one dataset for
able to find good subsets, but some of the features filtered predicting student performance [14], MLP was shown to im-
out could have proven to be useful in combination with prove performance of several other classifiers [5-NN, NB, and
other features, thus hampering the wrapper. The question of decision tree (DT)], but the analysis was not comprehensive
an appropriate filter, wrapper and classifier combination is enough to draw concrete conclusions.
also raised. Embedded FS methods are incorporated into the A sensible choice of the internal classifier could perhaps
classifier [1], e.g. random forests where trees are created alleviate the tendency to overfit on the validation subset of
using subsets of features, artificial neural networks zeroing the data, but as the choice of the wrapper also plays an
out the weights for the input layer etc. As this is performed important role, it is difficult to independently analyse them.
during model training, it makes the training more difficult. However, the SVM classifier is well-known in the literature
The problem of manipulating large datasets still remains. This for its exceptional generalisation capabilities [17]. In addition,
myriad of available approaches comes as unsurprising, as FS is it is extensively used within the recursive feature elimination
a complex task. This notion is emphasised further by the fact (RFE-SVM) method in medical applications [2], [18]. The
that features can not be considered independently. A seemingly application of this method generally outperforms competing
irrelevant feature can in combination with others prove to be filter and embedded approaches to FS and effectively suits
important for the overall outcome [8]. Therefore, wrapper- linear classifiers [SVM and logistic regression (LR)] and k-NN
based methods represent a common choice as they are able algorithm. Although the RFE-SVM wrapper induces varying
to circumvent this issue. effects on the performance of other algorithms depending
on the problem, in none of the studies conducted did these
B. Literature overview of wrapper-based feature selection algorithms outperform the aforementioned classifiers.
A number of experimental studies have been conducted The effects of selecting different classifiers inside and
in diverse application domains exploring combinations of outside the search are scarcely explored in the literature.
wrapper-based FS approaches and classification algorithms. The work in [3] investigated differences between feature
In these studies the cooperation between different classifiers subsets found when employing the greedy forward step-wise
outside and inside the wrapper was also tested, albeit circum- wrapper along with several standard classifiers, namely, k-
stantially. NN, LR, MLP, NB and SVM. The results obtained on two
The most common classifier used to compare different datasets demonstrated a high level of dissimilarity between
search methods for the wrapper is the k-nearest-neighbour the resulting feature subsets, with the choice of classifier
(k-NN) classifier with k = 1 [9]–[11], likely due to its having a greater impact on diversity than the choice of the
simplicity and lack of tuning needed. According to the results performance metric. However, besides showing that different
obtained on the disease diagnosis problem [11], wrapper-based classifiers yield dissimilar subsets, no insight on their effect
search utilising a 1-NN for evaluation leads to substantial on model performance was provided. On a similar note, the
improvement in performance for the same classifier outside study in [6] analysed the impact of an wrapped classifier on the
of the wrapper. Meanwhile, the performance of the support sizes of feature subsets found, with the DT and NB algorithms
vector machine (SVM), Bayesian network and multilayer selecting considerably more features than the SVM and k-NN

98

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 05:26:28 UTC from IEEE Xplore. Restrictions apply.
algorithms. Yet, this was demonstrated on a single dataset. TABLE I
C HARACTERISTICS OF THE DATASETS COMPRISING THE TEST BED
Aside from the considered studies, it is questionable how
well particular classifiers respond to FS. For example, a num- A name #features #samples #classes
ber of wrappers along with several classifiers were tested on a 1 Sonar 60 208 2
2 QSAR biodegradation 41 1055 2
software defect prediction problem [19], and it turned out that 3 Connectionist Bench 60 208 2
FS procedure mostly improves the classification performance 4 Dermatology 34 358 6
of the k-NN and NB algorithms, whereas hardly for LR and 5 Hill-Valley 100 1212 2
DT. Furthermore, a large experimental study on 400 datasets 6 Image Segmentation 19 210 7
7 Ionosphere 34 351 2
[4] investigated whether FS improves classification perfor- 8 Libras Movement 90 360 15
mance at all. Although datasets consisting of thoughtfully 9 Musk (Version 1) 166 476 2
selected features were used, the distinction between the af- 10 Parkinsons 22 195 2
tereffects on the classifiers was clear-cut. Since it significantly 11 Urban Land Cover 147 675 9
12 Statlog (Vehicle Silhouettes) 18 846 4
reduces the dimensionality of the problem, FS has proven to 13 LSVT Voice Rehabilitation 310 126 2
be a beneficial preprocessing technique prior to the application 14 Wine 13 178 3
of the k-NN algorithm. Likewise, FS improved performance
of the NB and MLP classifiers. While the former is known TABLE II
to be sensitive to feature correlation, the latter should be able L IST OF CLASSIFIERS EMPLOYED
to discern which features are relevant on its own. However,
Label name parameters
results from the study show that FS significantly improves
1-NN k-nearest neighbours k=1
its performance in many cases. On the contrary, ensembles
5-NN k-nearest neighbours k=5
(AdaBoost and random forest) and SVM have shown to be the
DT decision tree Gini impurity for split quality, CART
least-fitting algorithms for FS. Given their effectiveness and
GNB Gaussian naive Bayes —
ability to provide diverse solutions, ensembles have also been
RF random forest Gini impurity for split quality,
considered as subset evaluators within the wrapper [20]. On n
√ est. = 100, max f eat. =
account of consuming substantially more computation time, #f eat.
FS approaches based on ensemble algorithms obtained better SVM support vector machine RBF kernel, c = 1.0, γ = 10
performance in comparison to wrappers utilising standard
classifiers. DE wrapper
Dataset
A review of the rather scarce literature on this issue pro- Train Validation Test
vides insights into quite contradictory findings for individual Population initialisation

Selected classifier
classifiers. It is clear that the effects of the FS method are
Search loop
problem-dependent, but very few datasets have been used
Mutation
in the experimental studies conducted to draw well-founded Classifiers Crossover
Selection
conclusions. However, the assumption that can be made is 1-NN 5-NN GNB
that some algorithms are more sensitive to using different DT RF SVM
classifiers within wrappers (like SVM), while for some FS im- Feature subset

proves performance consistently (like k-NN). In addition, the


question is to what extent a trade-off between performance and
computational cost can be attained using a simple classifier, All classifiers

such as k-NN or NB.


Fig. 1. Concept of the utilised methodology

III. E XPERIMENTAL ANALYSIS


A. Methodology
In the interest of investigating the utility of feature subsets In order to mitigate the impact of the wrapper on the results,
found by the wrapper using one classifier with respect to other only a single wrapper was used throughout the experiments,
types of classifiers, a comprehensive experimental analysis was whilst performing multiple runs. The DE algorithm as in [9]
conducted. To this end, several classifiers were employed as was employed as the wrapper since it proved to be an overall
well as diverse datasets. All of the datasets shown in Table good choice for the FS task by displaying highly competitive
I were taken from the UCI repository [21], apart from the performance and stability against a number of other bio-
A1 dataset which was taken from the KEEL repository [22]. inspired optimisers. Each of the considered classifiers was
The considered classifiers are given in Table II along with the in turn employed in the wrapper to find a promising feature
selected hyperparameters. It should be noted that two instances subset (reduced feature set). This was done using the training
of the k-NN classifier were employed (1-NN and 5-NN) since and validation splits of the data. The wrapper evaluated the
both, as mentioned previously, are frequently utilised as the quality of the generated candidate subsets using the classifier
classifier of choice in the literature. guiding the search on the validation split in order to decrease

99

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 05:26:28 UTC from IEEE Xplore. Restrictions apply.
the chance of overfitting. Once it exhausted a preset number TABLE III
of attempts to create better feature subset candidates, the best AVERAGE RANKS FROM THE F RIEDMAN TEST FOR TESTED CLASSIFIERS
among them was denoted as the subset found for the run Wrapped Tested classifier
and was to be evaluated. The found subset was independently classifier 1-NN 5-NN DT GNB RF SVM
evaluated with each of the classifiers, thus determining its 1-NN 2.29 3.21 4.29 4.93 1.93 4.36
Rej. H0 1-NN vs {DT, GNB, SVM}; RF vs {DT, GNB, SVM}
general utility and behaviour in cases when different classifiers 5-NN 2.64 2.86 4.43 4.93 1.93 4.21
are utilised for guiding the search and using the found subset. Rej. H0 GNB vs {1-NN, 5-NN, RF}; RF vs {DT, SVM}
Model performance during both the search, on the validation DT 3.00 3.43 3.71 5.07 1.64 4.14
Rej. H0 RF vs {DT, GNB, SVM}; 1-NN vs GNB
subset of the dataset, and during independent testing, on the GNB 2.86 3.64 4.71 3.93 1.50 4.36
test subset of the dataset, was measured by the standard F- Rej. H0 RF vs {5-NN, DT, GNB, SVM}
score (the utilisation of classification accuracy as a popular RF 2.57 3.29 4.07 4.93 1.71 4.43
Rej. H0 RF vs {DT, GNB, SVM}; 1-NN vs GNB
performance measure might be ill-advised due to the class SVM 2.57 3.43 5.00 5.07 2.00 2.93
imbalance exhibited by some of the considered datasets). The Rej. H0 GNB vs {1-NN, RF, SVM}; DT vs {1-NN, RF, SVM}
aforementioned methodology is illustrated in Fig. 1. It should
be noted that stratified holdout was used for evaluation, where
the dataset is split into three disjoint subsets – one for training, are the average ranks obtained by application of the Friedman
validation and independent testing, respectively, whilst keeping rank sum test as well as the rejected null hypotheses (at a
the class ratios same as in the original dataset. Due to the significance level of α = 0.05) associated with pair-wise
stochastic nature of DE, a single split was calculated for each comparisons. The p-values from all pair-wise comparisons
dataset and was kept the same for all considered classifiers and have been corrected with the Bergmann-Hommel post-hoc
runs. The described approach to evaluation enables also a safe test. It should be noted that a lower average rank is better.
application of statistical tests when comparisons are conducted Nevertheless, the best ranks are given in boldface to facilitate
on both the validation and test subsets of the data. readability.
From the reported results it is clearly apparent that the
B. Setup RF classifier performed overall the best, regardless of the
For each classifier and dataset combination, 30 independent classifier that was used in the wrapper. This is perhaps not
runs were executed. Each run of the wrapper (DE algorithm) that surprising considering that it is the most complex of the
was terminated when a preset number of function (subset used classifiers. Moreover, it represents an ensemble and it
quality) evaluations, NFEsmax = 10000, was reached. This internally performs further feature selection (embedded FS
relatively large number of evaluations (in the literature smaller method). However, this complexity is also clearly reflected
values are typically used) was chosen as an additional means to in its computational cost as can be seen in Fig. 3, where
mitigate the influence of the wrapper. Population initialisation the average execution times of the classifiers (not a wrapper
was performed uniformly at random in [0, 1]m (each run run) on the validation subsets are given (virtually the same
anew). Parameter values frequently used in the literature were as on the test subsets since they were of equal size). Due
chosen for DE, i.e. CR = 0.9, F = 0.5 and NP = 50. Further, to its substantially higher computational cost compared to
the datasets were preprocessed by normalising their features the other classifiers, RF might not be a reasonable choice
into the [0, 1] range through scaling in order to mitigate the for the wrapped classifier i.e. to be used in the wrapper.
varying values of the features (this might impact especially Different hyperparameters could perhaps be considered in
classifiers utilising metric functions). The standard split ratio order to reduce this computational costs, should it still be
for training, validation and testing of 0.5 : 0.25 : 0.25 was used in the wrapper. As for the remaining classifiers, GNB
used, respectively. The seed for the random number generator yielded the overall worst performance, closely followed by
was kept fixed for all of the classifiers to obtain deterministic that of the DT classifier and SVM not trailing far behind it.
behaviour during fitting. Based on the pair-wise comparisons, the only classifier that
did not show a statistically significant difference from RF was
C. Results and discussion the 1-NN classifier, which is maybe unexpected considering
As stated in the beginning, the feature subsets found with their differences in complexity. The 5-NN classifier was not
the wrapper method are dependent on the utilised classifier far behind 1-NN in terms of overall performance, which is
(referred to as the wrapped classifier). Thus, different classi- unsurprising since its the same classifier with a different
fiers should yield different subsets (as was demonstrated in value for its hyperparameter. A different perspective on the
[3] on a couple of datasets). This is illustrated in Fig. 2, performance but also robustness of the considered classifiers
where the prevalence of particular features across multiple runs is illustrated by the results in Table IV, where the average
with respect to various classifiers is shown. In order to assess ranks of the classifiers depending on the one used for the
the effect of such differences in subsets on other classifiers search (wrapped classifier) are shown. This provides an insight
(referred to as tested classifiers) their average performance on into how sensitive each classifier is to varying feature subsets.
the test subsets across the test bed was analysed and compared. Again, the RF classifier proved to be the least sensitive
The results of this are concisely presented in Table III. Shown to the classifier used in the wrapper as no null hypothesis

100

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 05:26:28 UTC from IEEE Xplore. Restrictions apply.
SVM RF GNB DT 5-NN 1-NN 1.0 1.0 1.0

SVM RF GNB DT 5-NN 1-NN

SVM RF GNB DT 5-NN 1-NN


0.8 0.8 0.8
Classifier

Classifier

Classifier
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


Features Features Features

(a) A1 (b) A2 (c) A3


1.0 1.0 1.0
SVM RF GNB DT 5-NN 1-NN

SVM RF GNB DT 5-NN 1-NN

SVM RF GNB DT 5-NN 1-NN


0.8 0.8 0.8
Classifier

Classifier

Classifier
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


Features Features Features

(d) A4 (e) A5 (f) A6


1.0 1.0 1.0
SVM RF GNB DT 5-NN 1-NN

SVM RF GNB DT 5-NN 1-NN

SVM RF GNB DT 5-NN 1-NN


0.8 0.8 0.8
Classifier

Classifier

Classifier
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


Features Features Features

(g) A7 (h) A8 (i) A9


1.0 1.0 1.0
SVM RF GNB DT 5-NN 1-NN

SVM RF GNB DT 5-NN 1-NN

SVM RF GNB DT 5-NN 1-NN


0.8 0.8 0.8
Classifier

Classifier

Classifier
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


Features Features Features

(j) A10 (k) A11 (l) A12


1.0 1.0
SVM RF GNB DT 5-NN 1-NN

SVM RF GNB DT 5-NN 1-NN

0.8 0.8
Classifier

Classifier

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
Features Features

(m) A13 (n) A14

Fig. 2. Differences in the feature subsets

was rejected in the pair-wise comparisons. The opposite can TABLE IV


be observed for the GNB and SVM classifiers, where it is AVERAGE RANKS FROM THE F RIEDMAN TEST FOR SEARCH CLASSIFIERS
apparent that only subsets found by themselves being wrapped Tested Wrapped classifier
were suitable. Moreover, GNB seems to be the worst option classifier 1-NN DT GNB 5-NN RF SVM
since the subsets found using it in the wrapper also yielded 1-NN 2.36 4.00 4.503.43 3.79 2.93
Rej. H0 1-NN vs GNB
the worst performance on the 1-NN and 5-NN classifier. This 5-NN 2.50 2.93 3.79 4.64 3.96 3.18
less than competitive performance may be attributed to its Rej. H0 1-NN vs GNB
assumption of feature independence (rarely the case), which DT 4.43 4.00 2.14 3.71 3.36 3.36
Rej. H0 1-NN vs DT
directly affects the subsets found. However, the low computa- GNB 3.93 3.93 4.07 1.36 3.54 4.18
tional cost (can be also attributed to its assumption of feature Rej. H0 GNB vs {1-NN, 5-NN, DT, RF, SVM}
independence) as depicted by the average execution times RF 3.86 3.82 3.18 3.64 2.64 3.86
Rej. H0 —
shown in Fig. 3 is likely the reason behind its wide-spread SVM 3.96 3.54 3.96 3.82 4.25 1.46
use. This supports the earlier premise about the preference of Rej. H0 SVM vs {1-NN, 5-NN, DT, GNB, RF}
employing simpler classifiers in the wrapper but also that this
may lead to inferior models as opposed to the utilisation of
even slightly more complex algorithms (or at least the ones
that do not assume feature independence). Roughly put, the Friedman test). Again the superiority of the RF classifier is
results presented in Table III and IV suggest the that each clearly apparent. It should be noted, however, that only the
classifier is able to determine subsets that fit itself the best, overall performance difference between it and both DT and
which is intuitively expected. The results in Table V show GNB is statistically significant. Also, the computational cost
the performance differences between the considered classifiers for this level of performance must not be neglected.
from this perspective (FR denotes the average rank from the Just how much the fitting i.e. the search for a feature
subset affects other classifiers is illustrated with Fig. 4, where

101

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 05:26:28 UTC from IEEE Xplore. Restrictions apply.
the previously presented results (Table IV) some classifiers are
1-NN more sensitive to this than others. As the search progresses, the
5-NN
DT differences in average ranks typically become large, which is
GNB to be expected since better and better fitting subsets are found.
RF
SVM
These graphs might be the first indicator of overfitting. A fur-
ther indication to the occurrence of overfitting is provided by
the data in Fig. 5, where differences in terms of performance
Average execution time (in miliseconds, log-scale)

on the validation and test subsets are shown as well as the


102
difference in terms of the reduced and full feature sets. As can
be discerned, only differences involving the test subset and full
feature set attain negative values, whereas the ones involving
the validation subset and reduced feature set are always non-
negative and, on all but a few instances (a negative slope
marked in red), considerably greater. Especially concerning
are the differences between the reduced and full feature sets
with respect to the test subsets, where negative values are
not uncommon for most of the classifiers (SVM being the
exception). This is certainly undesirable and renders FS in
101 some instances as detrimental to the generalisation ability
of the model. The number of allowed function evaluations
(NFEmax = 10000) might be held accountable for this. Thus,
the comparatively small numbers of evaluations frequently
used in the literature (see, e.g., [23]–[25]), a thousand or even
less, can interpreted as an attempt to avoid overfitting (as a
rule, no elaboration on the setting of NFEmax is given in those
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 studies).
Dataset In adherence to the Occam’s razor principle, reducing the
feature set as much as possible whilst improving the model
Fig. 3. Average execution time performance is the ultimate goal of FS. Thus, the amount of
reduction of the feature set is something that should be always
considered, at least as a secondary objective. A vague insight
TABLE V
AVERAGE F- SCORE
into this aspect of the FS task was already provided by Fig.
2, where differences in the attained subsets with respect to the
Tested/Wrapped classifier wrapped classifier can be discerned. A clearer perspective on
A
1-NN/ 5-NN/ DT/ GNB/ RF/ SVM/
1-NN 5-NN DT GNB RF SVM their effect on the subset sizes i.e. amount of reduction is given
A1 0.727 0.646 0.676 0.625 0.745 0.702 by the results in Table VI (the dimensionality of the datasets
A2 0.772 0.807 0.804 0.774 0.842 0.796 is given in parenthesis to facilitate the interpretation). The
A3 0.727 0.639 0.675 0.634 0.739 0.716
aforementioned differences are also reflected by the amount
A4 0.936 0.942 0.863 0.949 0.961 0.925
A5 0.638 0.562 0.584 0.374 0.622 0.403 of reductions. The SVM classifier stands out as the best
A6 0.821 0.857 0.823 0.814 0.868 0.842 performer with the greatest average set reductions on most
A7 0.859 0.861 0.853 0.906 0.912 0.931 datasets. In conjunction with the results in Table IV and Fig.
A8 0.827 0.668 0.586 0.673 0.777 0.776
A9 0.833 0.784 0.776 0.786 0.863 0.745
3, it seems to be a good overall choice to be used in the
A10 0.850 0.867 0.816 0.677 0.870 0.865 wrapper. Most of the other classifiers (tested classifiers) did not
A11 0.762 0.778 0.772 0.839 0.810 0.765 suffer notable performance penalties when it was used as the
A12 0.701 0.741 0.700 0.554 0.701 0.755 wrapped classifier and its computational cost is comparatively
A13 0.734 0.667 0.581 0.460 0.758 0.396
A14 0.932 0.948 0.913 0.962 0.934 0.969 low – especially on datasets containing a smaller number of
FR 3.46 3.36 4.64 4.36 1.68 3.50 samples. Moreover, another favourable aspect to the choice
Rej. H0 RF vs {DT, GNB} of SVM is presented by the results in Table VII, where the
run-to-run consistency of the found feature subsets is shown.
Given are the values for the calculated adjusted similarity
the average ranks obtained by applying the Friedman test measure (ASM) [26], which is a measure of stability. The
throughout the search process on the validation subsets is consistency of subsets found in multiple runs is important for
shown. It can be easily observed that each classifier is able to stochastic wrappers, as mentioned in [9], from the perspective
attain a reduced feature set that improves its own performance of feature interpretability and for offering an intuition about
on the validation subsets. Yet, in doing so the performance their interactions. The values of ASM lie in the [−1, 1] range
of the other classifiers seems to decline. In accordance with and greater values indicate a higher stability i.e. consistency.

102

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 05:26:28 UTC from IEEE Xplore. Restrictions apply.
1NN 1NN 1NN
5.0 5.0
5.0 5NN 5NN 5NN
GNB GNB GNB
DT DT DT
4.5 4.5 4.5
RF RF RF
SVM SVM SVM
4.0 4.0 4.0

3.5 3.5 3.5


Average rank

Average rank

Average rank
3.0 3.0 3.0

2.5
2.5 2.5

2.0
2.0 2.0

1.5
1.5 1.5

1.0

0.01 0.02 0.03 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.01 0.02 0.03 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.01 0.02 0.03 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
NFEsmax NFEsmax NFEsmax

(a) 1-NN (b) 5-NN (c) DT


5.0
1NN 1NN 5.5 1NN
5.0
5NN 5NN 5NN
GNB 4.5 GNB GNB
DT DT 5.0 DT
4.5
RF RF RF
SVM 4.0 SVM SVM
4.5
4.0

3.5 4.0
Average rank

Average rank
3.5
Average rank

3.0 3.5
3.0

2.5 3.0
2.5

2.0 2.5

2.0
1.5 2.0

1.5
1.0 1.5
0.01 0.02 0.03 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.01 0.02 0.03 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.01 0.02 0.03 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
NFEsmax NFEsmax NFEsmax

(d) GNB (e) RF (f) SVM

Fig. 4. Average ranks from the Friedman test during search

1-NN
Reduced 1-NN validation
5-NN
Full 5-NN test
DT
Reduced DT validation
GNB
Full GNB test
RF
Reduced RF validation
SVM
Feature set

Full test
Subset

SVM
1-NN
Reduced validation
1-NN
5-NN
Full test
5-NN
Reduced validation
DT
DT
Full test GNB
GNB
Reduced validation RF
RF
Full test SVM
SVM

−0.1 0.0 0.1 0.2 0.3 0.4 −0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6
F-score difference F-score difference

(a) difference between F-score on validation and test subsets (b) difference between F-score with reduced and full feature set

Fig. 5. F-score differences

Due to the stochastic nature of the employed wrapper and the IV. C ONCLUSION
multimodality of the FS problem, highly consistent feature
subsets cannot be expected to be found in multiple algorithm This paper attempted to give an insight into whether one
runs. Nevertheless, it should not be neglected when developing classifier used in the wrapper can provide subsets that are
FS approaches. The study in [9] showed a varying stability effective for other classifiers as well. To this end, an ex-
for different wrappers, whilst the results shown here (Table perimental analysis involving classifiers with different char-
VII) suggest that the classifier used in the wrapper itself plays acteristics and diverse datasets was conducted. The obtained
a significant role on the stability as well – perhaps an even results do not provide a simple and clear-cut answer but rather
greater role than the wrapper. suggestions for achieving a trade-off between computational
cost and model performance. In summary, each of the consid-
ered classifiers was able to find feature subsets that fit itself

103

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 05:26:28 UTC from IEEE Xplore. Restrictions apply.
TABLE VI Some of the reported results suggest the occurrence of
AVERAGE FEATURE SET REDUCTION ( IN PERCENT ) overfitting. A certain amount of overfitting was certainly to be
A 1-NN 5-NN DT GNB RF SVM expected as well as the aforementioned feature subsets biased
A1 (60) 48.8 50.6 51.5 62.9 53.9 68.8 towards the classifier used in the wrapper. A possible remedy
A2 (41) 54.2 51.7 50.6 53.3 49.2 69.6 for this might be a simple reduction of the allowed number
A3 (60) 47.9 49.2 53.7 60.9 53.1 67.7
of evaluations since relatively low numbers of evaluations
A4 (34) 36.2 40.0 56.1 45.8 37.7 74.1
A5 (100) 57.5 55.8 52.4 53.5 60.1 74.2 are typical in the literature when bio-inspired optimisers are
A6 (19) 57.4 50.7 49.6 56.7 53.2 45.3 used as wrappers. However, this directly impacts the extent
A7 (34) 57.3 69.8 57.7 56.0 68.6 63.9 of the search that can be performed and may possibly lead to
A8 (90) 55.2 50.4 54.0 54.9 51.9 73.8
A9 (166) 50.8 55.2 52.0 53.5 52.6 76.1
the omission of promising solutions. Nevertheless, the benefit
A10 (22) 47.6 51.1 50.8 74.2 72.9 48.3 of limiting the extent of the search is argued in [27]. An
A11 (147) 51.7 52.4 50.7 57.6 54.2 86.9 alternative path to alleviate the occurrence of overfitting, albeit
A12 (18) 44.1 41.7 40.9 45.9 50.9 44.4 slightly, might be the utilisation of a different evaluation pro-
A13 (310) 51.8 51.6 51.8 51.3 51.8 50.3
A14 (13) 34.6 44.6 47.7 30.8 29.5 35.1 cedure altogether, like the k-fold cross validation procedure.
FR 4.07 3.79 4.14 3.14 3.43 2.43 Yet, a straightforward application may be ill-advised. The final
Rej. H0 — evaluation on an external/independent test subset is crucial for
avoiding overoptimistic results and stressed in e.g. [10], [28].
TABLE VII Hence, maintaining a separate data subset for final evaluation
A DJUSTED SIMILARITY MEASURE and performing k-fold cross-validation inside the wrapper
A 1-NN 5-NN DT GNB RF SVM
might a viable and simple approach. Whatever the approach,
A1 0.100 0.125 0.082 0.285 0.142 0.135 it is necessary to combat overfitting. Otherwise the feature
A2 0.184 0.188 0.141 0.238 0.188 0.336 set reduction may come at the expanse of the generalisation
A3 0.107 0.122 0.064 0.279 0.125 0.144 ability of the model (as opposed to using the full feature set).
A4 0.188 0.174 0.284 0.181 0.154 0.576
A5 0.199 0.186 0.040 0.044 0.116 0.164
A6 0.274 0.273 0.127 0.358 0.323 0.262
R EFERENCES
A7 0.122 0.260 0.169 0.114 0.245 0.327 [1] U. Stańczyk, “Feature evaluation by filter, wrapper, and embedded
A8 0.120 0.104 0.066 0.101 0.073 0.127 approaches,” in Feature Selection for Data and Pattern Recognition,
A9 0.089 0.068 0.029 0.114 0.067 0.135 2015, pp. 29–44.
A10 0.231 0.141 0.267 0.405 0.365 0.189 [2] C. A. Kumar, M. Sooraj, and S. Ramakrishnan, “A comparative
A11 0.093 0.082 0.045 0.127 0.058 0.138 performance evaluation of supervised feature selection algorithms on
A12 0.220 0.454 0.286 0.436 0.257 0.433 microarray datasets,” Procedia Comput. Sci., vol. 115, pp. 209–217,
A13 0.033 0.036 0.015 0.016 0.023 -0.001 2017.
A14 0.213 0.321 0.354 0.331 0.141 0.246 [3] R. Wald, T. M. Khoshgoftaar, and A. Napolitano, “How the choice of
FR 3.71 3.36 4.79 2.64 3.86 2.64 wrapper learner and performance metric affects subset evaluation,” in
Rej. H0 DT vs {GNB, SVM} Proc. ICTAI’13, 2013, pp. 426–432.
[4] M. J. Post, P. van der Putten, and J. N. van Rijn, “Does feature selection
improve classification? a large scale experiment in openml,” in Int. Symp.
Intell. Data Anal., 2016, pp. 158–170.
the best, whilst incurring performance penalties for the others [5] G. Martinović, D. Bajer, and B. Zorić, “A differential evolution approach
to dimensionality reduction for classification needs,” Int. J. Appl. Math.
confirming the premise that using wrappers adapts the found Comput. Sci., vol. 24, no. 1, pp. 111–122, 2014.
subsets to the classifier they wrap. However, it was found that [6] K. Chrysostomou, S. Y. Chen, and X. Liu, “Combining multiple clas-
not all are equally sensitive to this and that some yield subsets sifiers for wrapper feature selection,” Int. J. Data Min. Model. Manag.,
vol. 1, no. 1, pp. 91–102, 2008.
that are more versatile and beneficial than others. This sheds [7] H.-H. Hsu, C.-W. Hsieh, and M.-D. Lu, “Hybrid feature selection by
some light on the question posed in the paper title – the choice combining filters and wrappers,” Expert Syst. Appl., vol. 38, no. 7, pp.
of wrapped classifier is a crucial element in wrapper-based FS. 8144–8150, 2011.
[8] I. Guyon and A. Elisseeff, “An introduction to variable and feature
In scope of the considered classifiers, SVM demonstrated selection,” J. Mach. Learn. Res., vol. 3, pp. 1157–1182, 2003.
some favourable behaviour from the perspective of the clas- [9] D. Bajer, B. Zorić, M. Dudjak, and G. Martinović, “Evaluation and
sifier to be employed in the wrapper. Namely, it attained the analysis of bio-inspired optimisation algorithms for feature selection,”
in Proc. Informatics’19, 2019, pp. 18–25.
overall smallest feature subsets with the highest stability i.e. [10] J. Reunanen, “Overfitting in making comparisons between variable
run-to-run consistency of found subsets all whilst incurring a selection methods,” J. Mach. Learn. Res., vol. 3, pp. 1371–1382, 2003.
modest computational cost. However, it proved to be one of the [11] M. S. Wibawa, H. A. Nugroho, and N. A. Setiawan, “Performance
evaluation of combined feature selection and classification methods
most sensitive (along with GNB) when other classifiers were in diagnosing parkinson disease based on voice feature,” in Proc.
employed in the wrapper. Conversely, RF showed to be the ICSITech’15, 2015, pp. 126–131.
least sensitive by offering the best classification performance, [12] N. Suchetha, A. Nikhil, and P. Hrudya, “Comparing the wrapper
feature selection evaluators on twitter sentiment classification,” in Proc.
regardless of the classifier used in the wrapper. Yet, its ICCIDS’19, 2019, pp. 1–6.
substantial computational cost is enough to make it a poor [13] H. Lang, J. Zhang, X. Zhang, and J. Meng, “Ship classification in sar
choice for the wrapped classifier. Altogether, a combination image by joint feature and classifier selection,” IEEE Geosci. Remote.
S., vol. 13, no. 2, pp. 212–216, 2015.
of the these two might yield a promising trade-off between [14] H. Turabieh, “Hybrid machine learning classifiers to predict student
cost and performance on a wide range of datasets. performance,” in Proc. ICTCS’19, 2019, pp. 1–6.

104

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 05:26:28 UTC from IEEE Xplore. Restrictions apply.
[15] A. A. Shanab, T. M. Khoshgoftaar, and R. Wald, “Evaluation of wrapper-
based feature selection using hard, moderate, and easy bioinformatics
data,” in Proc. BIBE’14, 2014, pp. 149–155.
[16] M. Fajila and M. Jahan, “The effect of evolutionary algorithm in gene
subset selection for cancer classification,” Int. J. Mod. Educ. Comput.
Sci., vol. 7, pp. 60–66, 2018.
[17] P. Bartlett and J. Shawe-Taylor, “Generalization performance of support
vector machines and other pattern classifiers,” Advances in Kernel
methods—support vector learning, pp. 43–54, 1999.
[18] E. Hemphill, J. Lindsay, C. Lee, I. I. Măndoiu, and C. E. Nelson,
“Feature selection and classifier performance on diverse bio-logical
datasets,” in BMC Bioinform., vol. 15, no. S13, 2014, p. S4.
[19] A. O. Balogun, S. Basri, S. J. Abdulkadir, and A. S. Hashim, “Per-
formance analysis of feature selection methods in software defect
prediction: A search method approach,” Appl. Sci., vol. 9, no. 13, p.
2764, 2019.
[20] R. Panthong and A. Srivihok, “Wrapper feature subset selection for
dimension reduction based on ensemble learning algorithm,” Procedia
Comput. Sci., vol. 72, pp. 162–169, 2015.
[21] K. Bache and M. Lichman, “UCI machine learning repository,” 2013,
http://archive.ics.uci.edu/ml.
[22] J. Alcalá-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. Garcı́a,
L. Sánchez, and F. Herrera., “Keel data-mining software tool: data set
repository, integration of algorithms and experimental analysis frame-
work,” J. Mult.-valued Log. Soft Comput., vol. 17, no. 2-3, pp. 255–287,
2011.
[23] S. Arora and P. Anand, “Binary butterfly optimization approaches for
feature selection,” Expert Syst. Appl., vol. 116, pp. 147–160, 2019.
[24] J. P. Papa, G. H. Rosa, A. N. de Souza, and L. C. Afonso, “Feature
selection through binary brain storm optimization,” Comput. & Elec.
Eng., vol. 72, pp. 468–481, 2018.
[25] E. Emary, H. M. Zawbaa, and A. E. Hassanien, “Binary grey wolf
optimization approaches for feature selection,” Neurocomput., vol. 172,
no. C, pp. 371–381, 2016.
[26] U. M. Khaire and R. Dhanalakshmi, “Stability of feature selection
algorithm: A review,” J. King Saud Univ. - Comp. Inf. Sci., 2019.
[27] J. Loughery and P. Cunningham, “Overfitting in wrapper-based feature
subset selection: The harder you try the worse it gets,” in Proc. SGAI’04,
2004, pp. 33–43.
[28] P. Smialowski, D. Frishman, and S. Kramer, “Pitfalls of supervised
feature selection,” Bioinform., vol. 26, no. 3, pp. 440–443, 2010.

105

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 05:26:28 UTC from IEEE Xplore. Restrictions apply.

You might also like