0% found this document useful (0 votes)

5 views

2015-Elsevier-Kernel-methods-for-heterogeneous-feature-selection

This paper presents two kernel-based feature selection methods designed for heterogeneous data containing both continuous and categorical variables. The methods utilize a dedicated kernel within a Recursive Feature Elimination framework, employing either a non-linear Support Vector Machine or Multiple Kernel Learning, and demonstrate superior performance on high-dimensional classification tasks. The study addresses the challenges of traditional feature selection approaches when applied to mixed data types and provides a comprehensive evaluation of the proposed techniques.

Uploaded by

chandreshgovind

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

2015-Elsevier-Kernel-methods-for-heterogeneous-feature-selection

Uploaded by

chandreshgovind

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Neurocomputing 169 (2015) 187–195

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

Kernel methods for heterogeneous feature selection

Jérôme Paul n, Roberto D'Ambrosio, Pierre Dupont
Université catholique de Louvain – ICTEAM/Machine Learning Group1, Place Sainte Barbe 2 bte L5.02.01, B-1348 Louvain-la-Neuve, Belgium

art ic l e i nf o a b s t r a c t

Article history: This paper introduces two feature selection methods to deal with heterogeneous data that include
Received 30 June 2014 continuous and categorical variables. We propose to plug a dedicated kernel that handles both kinds of
Received in revised form variables into a Recursive Feature Elimination procedure using either a non-linear SVM or Multiple
12 December 2014
Kernel Learning. These methods are shown to offer state-of-the-art performances on a variety of high-
Accepted 29 December 2014
Available online 16 April 2015
dimensional classiﬁcation tasks.
& 2015 Elsevier B.V. All rights reserved.
Keywords:
Heterogeneous feature selection
Kernel methods
Mixed data
Multiple kernel learning
Support vector machine
Recursive feature elimination

1. Introduction The speciﬁc choice of such numerical values is however arbitrary.

It introduces an artificial order between the feature values and can
Feature selection is an important preprocessing step in machine lead to largely different distance measures between instances [1].
learning and data mining as increasingly more data are available and A standard approach relies on a multivariate numerical encod-
problems with hundreds or thousands of features have become ing, such as the disjunctive encoding, to represent categorical
common. Those high dimensional data appear in many areas such as variables. For instance, a feature having 3 categories as possible
gene expression array analysis, text processing of internet docu- values could be encoded by considering 3 new features instead:
ments, and economic forecasting. Feature selection allows domain ð1; 0; 0Þ, ð0; 1; 0Þ and ð0; 0; 1Þ. However, they need specific approa-
experts to interpret a decision model by reducing the number of ches, such as group lasso [2], to correctly handle feature selection
variables to analyze. It also reduces training and classification times at the granularity of the original features.
as well as measurement and storage requirements. The discretization of continuous features is a common alter-
To the best of our knowledge, little effort has been dedicated to native to represent categorical and numerical features in a similar
develop feature selection methods tailored for datasets with both space. Such approach comes at the price of making the selection
categorical and numerical values. Such heterogeneous data are highly sensitive to the specific discretization [1].
found in several applications. For instance, in the medical domain, A natural alternative would consider tree ensemble methods
high dimensional continuous feature sets (e.g. gene expression such as Random Forests (RF), since they can be grown from both
data) are typically considered along with a few clinical features. types of variables and these methods perform an embedded
These features can be continuous (e.g. blood pressure) or catego- selection. RF were however shown to bias the selection towards
rical (e.g. sex, smoker vs non-smoker). To highlight important variables with many values [3]. The cForest method has been
variables, a naive approach would transform heterogeneous data introduced to correct this bias [3] but its computational time is
into either fully continuous or categorical variables before apply- drastically increased and becomes prohibitive when dealing with
ing any standard feature selection algorithm. To get a continuous thousands of features.2
dataset, categorical variables can be encoded as numerical values. In this paper we propose two kernel based methods for feature
selection. They are conceptually similar to disjunctive encoding
while keeping original features throughout the whole selection
n
Corresponding author.
E-mail addresses: [email protected] (J. Paul),
2
[email protected] (R. D'Ambrosio), In each node of each tree of the forest, a conditional independence
[email protected] (P. Dupont). permutation test needs to be performed to select the best variable instead of a
1
http://www.ucl.ac.be/mlg/. simple Gini evaluation.

http://dx.doi.org/10.1016/j.neucom.2014.12.098
0925-2312/& 2015 Elsevier B.V. All rights reserved.
188 J. Paul et al. / Neurocomputing 169 (2015) 187–195

process. In both approaches, the selection is performed by the averages univariate subkernels [7] defined for each feature:
Recursive Feature Elimination (RFE) [4] mechanism that iteratively
1X
p
ranks variables according to their importances. We propose to kðxi ; xj Þ ¼ k ðx ; x Þ ð1Þ
extract those feature importances from two different kernel p f ¼ 1 f if jf
methods: the Support Vector Machine (SVM) and the Multiple 8
Kernel Learning (MKL), with a dedicated heterogeneous kernel. < Iða ¼ bÞ
> if f is categorical
We use the clinical kernel [5] that handles both kinds of features in kf ða; bÞ ¼ ðmaxf minf Þ j a bj ð2Þ
>
: if f is continuous
classification tasks. maxf minf
The remainder of this document is organized as follows. Section 2
describes the two proposed methods. Section 3 briefly presents where xi is a data point in p dimensions, xif is the value of xi for
competing approaches we compare to in our experiments. The feature f, I is the indicator function, a and b are scalars and maxf
experimental setting is presented in Section 4. Results are discussed and minf are the maximum and minimum values observed for
in Section 5. Finally, Section 6 concludes this work. feature f, respectively. One can note that summing kernels simply
amounts to concatenating variables in the kernel induced space.
Given two data points, the subkernel values lie between 0,
when the feature values are farthest apart, and 1 when they are
2. Material and methods identical, similar to the Gaussian kernel. The clinical kernel is
basically an unweighted average of overlap kernels [8] for catego-
This section presents the different building blocks that com- rical features and triangular kernels [9,10] for continuous features.
pose our two heterogeneous feature selection methods. Recursive The overlap kernel can also be seen as a rescaled l1-norm on a
Feature Elimination (RFE), the main feature selection mechanism, disjunctive encoding of the categorical variables. The clinical
is presented in Section 2.1. It internally uses a global variable kernel assumes the same importance to each original variable.
ranking for both continuous and categorical features. This ranking We show here the benefit of adapting this kernel for heteroge-
is extracted from two kernel methods (Support Vector Machine neous feature selection.
and Multiple Kernel Learning) that use a dedicated heterogeneous
kernel called the clinical kernel (Section 2.2). Section 2.3 details 2.3. Feature importance from non-linear Support Vector Machines
how to obtain a feature ranking from a non-linear SVM. Finally,
Section 2.4 sketches Multiple Kernel Learning, which offers an The Support Vector Machine (SVM) [11] is a well-known algo-
alternative way to rank variables with the clinical kernel. rithm that is widely used to solve classification problems. It looks
for the largest margin hyperplane that distinguishes between
samples of different classes. In the case of a linear SVM, one can
2.1. Recursive feature elimination measure the feature importances by looking at their respective
weights in the hyperplane. When dealing with a non-linear SVM,
RFE [4] is an embedded backward elimination strategy that we can instead look at the variation in margin size 1= J w J . Since
iteratively builds a feature ranking by removing the least important the larger the margin, the lower the generalization error (at least in
features in a classification model at each step. Following [6], a fixed terms of bound), a feature that does not decrease much the margin
proportion of 20% of features is dropped at each iteration. The benefit size is not deemed important for generalization purposes. So, in
of such a fixed proportion is that the actual number of features order to measure feature importances with a non-linear SVM, one
removed at each step gradually decreases till being rounded to 1, can look at the influence on the margin of removing a particular
allowing a finer ranking for the most important features. This feature [12].
iterative process is pursued till all variables are ranked. The number The margin is inversely proportional to
of iterations automatically depends on the total number p of features n X
X n
to be ranked while following this strategy. RFE is most commonly W 2 ðαÞ ¼ αi αj yi yj kðxi ; xj Þ ¼ ‖w‖2 ð3Þ
used in combination with a linear SVM from which feature weights i¼1j¼1

are extracted. However, it can be used with any classiﬁcation model where αi and αj are the dual variables of a SVM, yi and yj the labels
from which individual feature importance can be deduced. A general of xi and xj, respectively, out of n training examples, and k a kernel.
pseudo-code for RFE is given in Algorithm 1. Therefore, the importance of a particular feature f can be approxi-
mated without re-estimating α by the following formula:
Algorithm 1. Recursive Feature Elimination.
J SVM ðf Þ ¼ jW 2 ðαÞ W 2ð f Þ ðαÞj ð4Þ
R’ empty ranking
F’ set of all features n X
X n

while F is not empty do W 2ð f Þ ðαÞ ¼ αi αj yi yj kðxi f ; xj f Þ ð5Þ

i¼1j¼1
train a classifier m using F

where xi f is the ith training example without considering the
extract variable importances from m

remove the 20% least important features from F feature f. In Eq. (5), the α's are kept identical to those in Eq. (3).

This is a computationally efﬁcient approximation originally pro-
put those features on top of R
posed in [12]. The feature importance is thus evaluated with
end respect to the separating hyperplane in the current feature space
return R and hence the current decision function.
Updating kðxi ; xj Þ to kðxi f ; xj f Þ is pretty efﬁcient and straight-
forward with the clinical kernel (Section 2.2). There is no need to
2.2. Clinical kernel recompute the sum of all subkernels but one only has to remove kf
(Eq. (2)) and normalize accordingly. Removing one such sub-
The so-called clinical kernel proposed in [5] was shown to kernel is equivalent to removing features in the projected space,
outperform a linear kernel for classifying heterogeneous data. It which is similar to what is done with a linear kernel.
J. Paul et al. / Neurocomputing 169 (2015) 187–195 189

In this work, we propose to combine the JSVM feature impor- global ranking [1]. The authors report improved results over those
tance (Eq. (4)) with the RFE mechanism in order to provide a full of the method proposed in [18], which is based on neighborhood
ranking of the features. This method will be referred to as RFESVM. relationships between heterogeneous samples. Out of a total of p
variables, categorical and continuous features are first ranked
2.4. Feature importance from Multiple Kernel Learning independently. Mutual information (MI) was originally proposed
for those rankings but a reliable estimate of MI is difficult to obtain
MKL [13] learns an appropriate linear combination of M basis whenever fewer samples than dimensions are available. Instead
kernels, each one possibly associated to a specific input variable, as we use the p-values of a t-test to rank continuous features and of a
well as a discriminant function. The resulting kernel is a weighted Fisher exact test for categorical ones. The two feature rankings are
combination of different input kernels: then combined into a global ranking by iteratively adding the first
categorical or continuous variable that maximizes the predictive
X
M
kðxi ; xj Þ ¼ μm km ðxi ; xj Þ s:t: μm Z0 ð6Þ performance of a Naive Bayes or a 5-NN classifier (consistently
m¼1 with the choices made in [1]). The NN classifier uses the Hetero-
geneous Euclidian-Overlap Metric [19] between pairs of instances
Summing kernels is equivalent to concatenating the respective
as follows:
feature maps ψ 1 ; …; ψ m induced by those kernels. The associated
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
decision function f ðxÞ is a generalized linear model in the induced uX
u p
space: dðxi ; xj Þ ¼ t df ðxif ; xjf Þ2 ð9Þ
f ¼1
X
M
pffiffiffiffiffiffiffi
f ðxÞ ¼ μm wTm ψ m ðxÞ þ b ð7Þ 8
m¼1 < Iða a bÞ
> if f is categorical
where μm, wm and ψm are respectively the kernel weight, feature df ða; bÞ ¼ j a bj ð10Þ
>
: maxf minf if f is continuous
weight and explicit feature map corresponding to the mth kernel,
and b a bias term. Those parameters are estimated by minimizing
the following objective: df ða; bÞ ¼ 1 kf ða; bÞ ð11Þ
X
n
1 X
M
This metric is closely related to the clinical kernel (Eq. (2)). For
argminC ℓðf ðxi Þ; yi Þ þ ‖wm ‖22 such that ‖μ‖22 r 1 ð8Þ
w;b;μ Z 0 2m¼1 each feature, df takes value 0 for identical points and value 1 for
i¼1
points that are farthest apart in that dimension. We refer to these
where C 4 0 and ℓ denotes the hinge loss ℓðf ðxÞ; yÞ ¼ max approaches as HFSNB and HFS5NN in the sequel.
f0; 1 yf ðxÞg. We note that the kernel weight vector μ is l2-
regularized in contrast to MKL approaches using sparsity inducing
norms [14]. Indeed, non-sparse MKL has been shown to be more 4. Experiments
effective on various computational biology problems [15]. It is also
more convenient in our context since we interpret j μm j as a In order to compare the five feature selection methods, we
feature importance measure and look for a full ranking of all report predictive performances of classifiers built on selected
features. variables as well as quality measures on those feature sets. A
In this work, we adapt the clinical kernel (Eq. (2)) with MKL to statistical analysis is also performed to assess if there are sig-
learn a non-uniform combination of the basis kernels, each one nificant differences between the performances of the various
associated to a single feature. As we can see in Eq. (7), μf reflects methods. This section presents the experimental protocol, the
the influence of kernel kf in the decision function [13]. μf can thus various evaluation metrics and the datasets that we use in our
be seen as the importance JMKL(f) of feature f. experiments.
The combination of RFE with this feature importance extracted
from MKL will be referred to as RFEMKL. It specifically uses the
kernel weights j μf j as feature importance value to eliminate at 4.1. Experimental protocol
each iteration a prescribed fraction of the least relevant features.
When a sufficient amount of data is available, 10-fold cross
validation (10-CV) provides a reliable estimate of model perfor-
3. Competing approaches mances [20]. However, it may lead to inaccurate estimates on
small-sized datasets, due to a higher variability in the different
This section presents the three competing methods we com- folds. We thus make use of a resampling strategy consisting of 200
pare to in the experiments: Random Forest [16] and two variants random splits of the data into training (90%) and test (10%). Such a
of Hybrid Feature Selection [1]. protocol has the same training/test proportions as 10-CV but
The Random Forest (RF) algorithm builds an ensemble of T benefits from a larger number of tests. It also keeps the training
decision trees. Each one is grown on a bootstrap sample of the size sufficiently large so as to report performances close enough to
dataset. The subset of data points that are used to build a those of a model estimated on the whole available data.
particular tree forms its bag. The remaining set of points is its For each data partition, the training set is used to rank features
out-of-bag. To compute variable importances, Breiman [16] pro- and build predictive models using different numbers of features.
poses a permutation test. It uses the out-of-bag samples to The ranking is recorded and predictive performances are mea-
estimate how much the predictive performances of the RF sured while classifying the test set. Average predictive perfor-
decrease when permuting a particular variable. The bigger the mances are reported over all test folds and the stability of various
drop in accuracy, the higher the variable importance. In order to signature sizes is computed from the 200 feature rankings. The
obtain good and stable feature selection from RF, a large ensemble average number of selected categorical features is also computed
of 10,000 trees (RF10000) is considered according to the analysis for each signature size. This number does not reflect a specific
in [17]. performance value of the feature selection methods but rather
An alternative method performs a greedy forward selection gives some insight into how they deal with the selection of
aggregating separate rankings for each type of variables into a heterogeneous variables.
190 J. Paul et al. / Neurocomputing 169 (2015) 187–195

Whenever a SVM is trained with the clinical kernel, the Table 1

regularization parameter is fixed to a predefined value estimated Datasets overview.
from preliminary experiments on independent datasets. Such a
Name Continuous features Categorical features Class priors
value is set to 0.1 for the feature selection itself and to 10 when
learning a final classifier on the selected features. Arrhythmia [23] 198 64 245/185
Bands [23] 20 14 312/228
Heart [23] 6 7 164/139
Hepatitis [23] 6 13 32/123
4.2. Performance metrics Housing [24] 15 2 215/291
Rheumagene [25] 100 3 28/21
van't Veer [26] 4353 2 44/33
Predictive performances are reported here in terms of balanced
classification rate (BCR), which is the average between sensitivity
and specificity. These metrics are particularly popular in the
5. Results and discussion
medical domain and BCR, unlike AUC, easily generalizes to
multi-class with unbalanced priors. For binary classification, it is
We compare here RFEMKL and RFESVM to HFSNB, HFS5NN and RF of
defined as follows:
10,000 trees on 7 real-life datasets resulting in more than 7000

1 TP TN experiments. These methods essentially provide a ranking of the
BCR ¼ þ ð12Þ features, without defining specific feature weights.3 Predictive
2 P N
performances can then be assessed on a common basis for all
where TP (resp. TN) is the number of true positives (resp. techniques by selecting all features up to a prescribed rank and
negatives) and P (resp. N) the number of positive (resp. negative) estimating a classifier restricted to those features. We use here a
samples in the dataset. non-linear SVM with the clinical kernel reduced to the selected
Selection stability is assessed here through Kuncheva's index features as final classifier. Other final classifiers such as RF, Naive
(KI) [21] which measures to which extent K sets of s selected Bayes or 5-NN offer similar predictive performances and are not
features share common elements: reported here.
We compare first all selection techniques across all feature set
s2 sizes and datasets to give a general view of the performances.
KX
1 X
K j S i \ Sj j
2 p Choosing a specific number of features is indeed often left to the
KIðfS1 ; …; SK gÞ ¼ ð13Þ
KðK 1Þ i ¼ 1 j ¼ i þ 1 s2 final user who, for instance, might favor the greater interpret-
s
p ability of a reduced feature set at the price of some predictive
performance decrease. Our second analysis focuses on a fixed
where p is the total number of features and s2 =p is a correction for
number of features offering a good trade-off between predictive
the random chance that 2 feature sets Si and Sj share common
performances and sparsity.
features. KI takes values in ð 1; 1. A value of 0 indicates random
Fig. 1 reports the statistical analysis across all datasets and all
selection. The larger the KI, the larger the number of commonly
feature set sizes using a Friedman test, followed by a Nemenyi
selected features.
post-hoc test. Figs. 2–5 report more detailed results. They show
In order to globally compare the five feature selection methods,
the predictive performance, the stability of feature selection and
a Friedman statistical test [22] is performed across all datasets and
the average number of selected categorical features on each
all feature set sizes. A low p-value indicates that there is indeed a
signature size of each dataset.
difference between the various algorithm performances. In that
The Friedman test [22] can be seen as a non-parametric
case, a Nemenyi post-hoc test [22] is performed to find out which
equivalent to the repeated-measures ANOVA. It tests whether
methods perform significantly different from others.
the methods significantly differ based on their average ranks. In
our experiments, it shows significant differences of the predictive
performances of the 5 feature selection methods across all
4.3. Datasets datasets and all feature set sizes (p-value o 10 6 ). According to
the Nemenyi post-hoc test, (see Fig. 1, left), RFEMKL is best ranked
We report results on 7 binary classification datasets briefly (i.e. it has the lowest mean rank) and performs significantly better
described in Table 1 in terms of number of features and class than HFS5NN and RFESVM which appear at the end of the ranking.
priors. The Arrhythmia [23] dataset aims at distinguishing Our data does not show significant differences between the
between the presence or the absence of cardiac arrhythmia from predictive performances of RFEMKL, RF10000 and HFSNB. A Fried-
features extracted from electrocardiograms. The Bands [23] data- man test on the feature selection stability also shows highly
set tackles the problem of band (grooves) detection on cylinders significant differences (p-value o 10 29 ) between the 5 feature
engraved by rotogravure printing. It consists of physical measure- selection approaches. According to a Nemenyi post-hoc test (see
ments and technical printing specifications. The task associated to Fig. 1, right), our RFE approaches are at the bottom of the ranking.
the Heart [23] dataset is to detect the presence of a heart disease RFEMKL is however not significantly less stable than HFSNB and
in the patient. Variables come from clinical measurements. The RF10000. In addition, the two HFS approaches may have the
Hepatitis [23] dataset is about predicting survival to hepatitis from natural advantage that they are based on filter methods that are
clinical variables. The goal of the Housing [24] dataset is to more stable than embedded methods [27]. Moreover, the RFs had
evaluate the median value of owner-occupied homes from local to be run with a very large number of trees (10,000) to provide a
statistics. The two classes are defined by a cutoff at $20,000. The stable feature selection [17]. This leads to increased computational
Rheumagene [25] dataset aims at diagnosing arthritis at a very times and heavier models, especially on datasets with a higher
early stage of the disease. Genomic variables are provided along number of instances. On the Arrhythmia and Bands datasets, the
with 3 clinical variables. Finally, the van't Veer [26] dataset tackles
a breast cancer prognosis problem. This very high dimensional
dataset consists of genomic features from microarray analysis and 3
Feature weights are used at each RFE iteration but those weights need not be
seven clinical variables, two of them being categorical. comparable globally across iterations.
J. Paul et al. / Neurocomputing 169 (2015) 187–195 191

BCR KI

CD CD

HFSNB HFS5NN RF10000 RFEMKL

RF10000 RFESVM HFSNB RFESVM

RFEMKL HFS5NN

2.6 3 3.2 3.6 2 2.5 3 3.5 4 4.5

mean rank mean rank

Fig. 1. Nemenyi critical difference diagrams [22]: comparison of the predictive performances (BCR) and stability (KI) of the five algorithms over all signature sizes of all
datasets. Horizontal black lines group together methods whose mean ranks do not differ significantly. CD represents the rank difference needed to have a 95% confidence
that method performances are significantly different.

arrhythmia bands

● ● ●
0.80 ● ● ● ●● ● ● ● ●
● ● ● ● ●
●● 0.75
● ●● ● ●
0.75 ● ● ● ● ● ● ● ●
● ●
● ●
0.70 0.70 ●
●
●
BCR
BCR

●
0.65 ●
●
0.65
0.60 ● ●
● RFEMKL ● RFEMKL

0.55 RFESVM RFESVM

RF10000 0.60 RF10000
HFSNB HFSNB
0.50 HFS5NN HFS5NN

200 100 50 20 10 5 2 1 20 10 5 2 1
#features #features
arrhythmia bands
1.0 1.0

● ● ● ●●
● ●
●● ●●
0.8 ● ● ● ●
● ●● 0.8
●● ●
●● ●

0.6
●
KI
KI

0.6
●
0.4
● RFEMKL ● RFEMKL
RFESVM 0.4 RFESVM
0.2 RF10000 RF10000
HFSNB HFSNB
HFS5NN HFS5NN

200 100 50 20 10 5 2 1 20 10 5 2 1
#features #features
arrhythmia bands
● RFEMKL
12 ● RFEMKL
25 RFESVM RFESVM
RF10000 10 ● RF10000
HFSNB HFSNB
# categorical features
# categorical features

●
20 HFS5NN ● HFS5NN
8
●
● ●
15
6 ●
●
● ●
10 4
●
●
●
● ●
5 2 ●
●
● ●
● ●
●● ● ● ● ●
0 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 0

200 100 50 20 10 5 2 1 20 10 5 2 1
#features #features

Fig. 2. Predictive performances (BCR), feature selection stability (KI) and number of selected categorical features for each signature size of the Arrhythmia and Bands
datasets. The dashline deﬁnes the minimal number of features to select without loosing much in predictive performances (see text).
192 J. Paul et al. / Neurocomputing 169 (2015) 187–195

heart hepatitis
0.75
0.85
● ● ●
●
● ● ● ● ● ●
● ● ●
0.80 ●
● ●
0.70 ●
●
● ●
0.75 ●

BCR
BCR

● ●
●
0.65
0.70
● RFEMKL ● RFEMKL
RFESVM RFESVM
0.65 RF10000 RF10000
HFSNB 0.60 HFSNB
HFS5NN HFS5NN

10 5 2 1 10 5 2 1
#features #features
heart hepatitis
1.0 1.0

0.9 0.9

0.8 0.8

0.7
0.7

KI
KI

0.6
0.6
0.5 RFE MKL
RFEMKL
RFESVM 0.5 RFESVM
0.4 RF10000 RF10000
HFSNB 0.4 HFSNB
0.3 HFS5NN HFS5NN

10 5 2 1 10 5 2 1
#features #features
heart hepatitis
7 ● RFEMKL ● RFEMKL
●
RFESVM 10 ●
RFESVM
6 ● RF10000 RF10000
● ●
HFSNB HFSNB
# categorical features
# categorical features

5 ● HFS5NN 8 HFS5NN
● ●
●
4 ●
6 ●
●
3 ● ●
4 ●
2 ● ●
●
2 ●
1 ●
● ● ● ●
0 0

10 5 2 1 10 5 2 1
#features #features

Fig. 3. Predictive performances (BCR), feature selection stability (KI) and number of selected categorical features for each signature size of the Heart and Hepatitis datasets.
The dashline deﬁnes the minimal number of features to select without loosing much in predictive performances (see text).

200 resamplings require 1.5 more CPU time with RF10000 (single- predictive performances with very few selected variables (top
core implementation in the randomForest R-package [28]) than right graph of Fig. 4). The third categorical variable is actually
with the RFE methods (in the Shogun [29] implementation of MKL never selected since it happens to convey very few information to
and SVM). On the Housing dataset, the RF implementation is predict the class label.5 On the van't Veer dataset, the HFS
5 times slower than the RFE methods.4 approaches tend to keep selecting the two categorical variables
The top left graph of Fig. 2 shows predictive performances of even when the feature selection is very aggressive (Fig. 5, bottom).
the five methods on the Arrhythmia dataset. We can see that They show a peak in predictive performances when 5 features are
RFEMKL and RF10000 perform best since they avoid to select kept (Fig. 5, left). However, the best predictive performance (Fig. 5,
categorical features which happen to be noisy on this dataset left) is obtained with RFEMKL which selects one of the two
(Fig. 2, left). The bottom right plot of Fig. 4 reports the average categorical variables. It also corresponds to a very good feature
number of categorical features among selected features for the selection stability, as shown in the right graph of Fig. 5. Finally, on
Rheumagene dataset. It shows that all but RFESVM and HFS5NN the three high dimensional datasets (Arrhythmia, Rheumagene
select two categorical variables first, leading to already good and van't Veer), RFESVM is significantly less stable.

4
Speciﬁcally, CPU times were measured on a 2.60 GHz machine with 8 GB Ram
memory. On this dataset, RFEMKL, RFESVM, and RF10000 took respectively 23 min, 5
Out of 49 samples (28 negative, 21 positive), this variable takes value ‘0’ 46
26 min and 114 min to be run. times and ‘1’ only 3 times.
J. Paul et al. / Neurocomputing 169 (2015) 187–195 193

housing rheumagene
0.90
● ● ● 0.90
● ● ● ●
● ● ● ● ● ● ● ● ●
0.85 ● 0.85 ● ●
●
● ● ●
● ●
● ●
●
0.80 ●●
● ● ●
● ●
0.80 ●
BCR

BCR
0.75

0.75 0.70
RFEMKL ● RFEMKL
RFESVM RFESVM
RF10000 0.65 RF10000
0.70 HFSNB HFSNB
HFS5NN 0.60 HFS5NN

10 5 2 1 100 50 20 10 5 2 1
#features #features

housing rheumagene
1.0 1.0
RFEMKL
RFESVM
0.9 RF10000
0.8 HFSNB
0.8 HFS5NN

0.7 0.6
KI

KI
0.6
RFEMKL 0.4
0.5 RFESVM
RF10000
HFSNB
0.4
HFS5NN 0.2

100 50 20 10 5 2 1
10 5 2 1
#features #features

housing
rheumagene
2.0 ● ● ●
2.0
●
# categorical features

# categorical features

1.5 1.5

●
●
1.0 ● ● ● ● ● ● 1.0

0.5 RFEMKL RFEMKL

0.5
RFESVM RFESVM
RF10000 RF10000
HFSNB HFSNB
0.0 HFS5NN ● 0.0 HFS5NN

10 5 2 1 100 50 20 10 5 2 1
#features #features

Fig. 4. Predictive performances (BCR), feature selection stability (KI) and number of selected categorical features for each signature size of the Housing and Rheumagene
datasets. The dashline deﬁnes the minimal number of features to select without loosing much in predictive performances (see text).

We further analyze below the various feature selection meth- without significant differences between them, but at a larger
ods for a fixed number of selected features. One could indeed be computational cost for the latter.
interested in selecting a feature set as small as possible with only a
marginal decrease in predictive performances. For each dataset,
we choose the smaller feature set size such that the BCR of RFEMKL 6. Conclusion and perspectives
lies in the 95% confidence interval of the best RFEMKL predictive
performance. Those signature sizes are highlighted in Figs. 2–5 by We introduce two heterogeneous feature selection techniques
vertical dashed lines. A Friedman test on those predictive perfor- that can deal with continuous and categorical features. They
mances finds significant differences (p-value of 0.008). A Nemenyi combine Recursive Feature Elimination with variable importances
post-hoc test (Fig. 6, left) shows that the two best ranked methods, extracted from MKL (RFEMKL) or a non-linear SVM (RFESVM). These
RF10000 and RFEMKL, perform significantly better than RFESVM in methods use a dedicated kernel combining continuous and cate-
terms of BCR. Feature selection stabilities also significantly differ gorical variables. Experiments show that RFEMKL produces state-of-
according to a Friedman test (p-value of 0.02). Fig. 6 illustrates that the-art predictive performances and is as good as competing
the ranking among the five methods is the same for stability and methods in terms of feature selection stability. It offers results
BCR. Those results on a fixed number of features show that similar to Random Forests with smaller computational times.
the RFEMKL and RF10000 are the two best performing methods RFESVM performs worse than RFEMKL. It also seems less efficient
194 J. Paul et al. / Neurocomputing 169 (2015) 187–195

vantVeer vantVeer
●
0.75
●
0.8 ●
●
0.70 ●
●
●
● ● ● ●
●● ●● ● ●
● ●
0.6 ●● ● ● ●
●
●
BCR

0.65 ●●
●

KI
● ●
●● ● ● ●

0.60 0.4
RFEMKL ● RFEMKL
●

RFESVM RFESVM
RF10000 RF10000
0.55
HFSNB 0.2 HFSNB
HFS5NN HFS5NN

1000 500 100 50 10 5 1 1000 500 100 50 10 5 1

#features #features
vantVeer
2.0
●
●
●
# categorical features

●
1.5 ●
●●
●
●●
● ●
●● ●
● ●●
1.0 ● ● ●● ●
● ●
●
●
●●
●● ● ● ●
●
●
MKL
0.5
● RFE ●
RFESVM
RF10000
HFSNB
0.0 HFS5NN

1000 500 100 50 10 5 1

#features
Fig. 5. Predictive performances (BCR), feature selection stability (KI) and number of selected categorical features for each signature size of the van't Veer dataset. The
dashline deﬁnes the minimal number of features to select without loosing much in predictive performances (see text).

BCR KI
CD CD
HFSNB HFS5NN HFSNB HFS5NN

RFEMKL RFESVM RFEMKL RFESVM

RF10000 RF10000

1.5 2 2.5 3 3.5 4 4.5 1.5 2 2.5 3 3.5 4 4.5 5

mean rank mean rank

Fig. 6. Nemenyi critical difference diagrams [22]: comparison of the predictive performances (BCR) and stability (KI) of the five algorithms for one small signature size in
each dataset. Horizontal black lines group together methods whose mean ranks do not differ significantly. CD represents the rank difference needed to have a 95% confidence
that methods performances are significantly different.

in terms of prediction and stability than competing approaches, UCL) and the Consortium des Equipements de Calcul Intensif en
even though not significantly different from all competitors. Fédération Wallonie Bruxelles (CECI) funded by the Fonds de la
The two kernel based methods proposed here are among the Recherche Scientifique de Belgique (FRS-FNRS).
few existing selection methods that specifically tackle heteroge-
neous features. Yet, we plan in our future work to improve their
stability possibly by resorting to an ensemble procedure [6]. References
We observed that the proposed methods run faster than the
competing approaches on various datasets. Those differences [1] G. Doquire, M. Verleysen, An hybrid approach to feature selection for mixed
would be worth to reassess in a further study relying on parallel categorical and continuous data, in: International Conference on Knowledge
implementations. Discovery and Information Retrieval (KDIR 2011), 2011, pp. 394–401.
[2] M. Yuan, Y. Lin, Model selection and estimation in regression with grouped
variables, J. R. Stat. Soc.: Ser. B Stat. Methodol. 68 (1) (2006) 49–67. http://dx.
doi.org/10.1111/j.1467-9868.2005.00532.x, URL: 〈http://dx.doi.org/10.1111/j.
Acknowledgments 1467-9868.2005.00532.x〉.
[3] C. Strobl, A.-L. Boulesteix, A. Zeileis, T. Hothorn, Bias in random forest variable
importance measures: illustrations, sources and a solution, BMC Bioinform.
Computational resources have been provided by the super- 8 (1) (2007) 25. http://dx.doi.org/10.1186/1471-2105-8-25, URL: 〈http://www.
computing facilities of the Université catholique de Louvain (CISM/ biomedcentral.com/1471-2105/8/25〉.
J. Paul et al. / Neurocomputing 169 (2015) 187–195 195

[4] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classifica- [25] I. Focant, D. Hernandez-Lobato, J. Ducreux, P. Durez, A. Toukap, D. Elewaut,
tion using support vector machines, Mach. Learn. 46 (1–3) (2002) 389–422. F. Houssiau, P. Dupont, B. Lauwerys, Feasibility of a molecular diagnosis of
http://dx.doi.org/10.1023/A:1012487302797. arthritis based on the identification of specific transcriptomic profiles in knee
[5] A. Daemen, B. De Moor, Development of a kernel function for clinical data, in: synovial biopsies, Arthritis Rheum. 63 (2011) 751.
Annual International Conference of the IEEE Engineering in Medicine and [26] L. van 't Veer, H. Dai, M. van de Vijver, Y. He, A. Hart, M. Mao, H. Peterse, K. van
Biology Society, 2009. EMBC 2009, 2009, pp. 5913–5917. http://dx.doi.org/10. der Kooy, M. Marton, A. Witteveen, G. Schreiber, R. Kerkhoven, C. Roberts,
1109/IEMBS.2009.5334847. P. Linsley, R. Bernards, S. Friend, Gene expression profiling predicts clinical
[6] T. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, Y. Saeys, Robust biomarker outcome of breast cancer, Nature 415 (6871) (2002) 530–536. http://dx.doi.
identification for cancer diagnosis with ensemble feature selection methods, org/10.1038/415530a.
Bioinformatics 26 (3) (2010) 392–398. http://dx.doi.org/10.1093/bioinfor- [27] A.-C. Haury, P. Gestraud, J.-P. Vert, The influence of feature selection methods
matics/btp630, URL: 〈http://bioinformatics.oxfordjournals.org/content/26/3/ on accuracy. stability and interpretability of molecular signatures, PLoS One
392〉. 6 (12) (2011) e28210. http://dx.doi.org/10.1371/journal.pone.0028210.
[7] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge [28] A. Liaw, M. Wiener, Classification and regression by randomforest, R News
University Press, New York, NY, USA, 2004. 2 (3) (2002) 18–22, URL: 〈http://CRAN.R-project.org/doc/Rnews/〉.
[8] B. Vanschoenwinkel, B. Manderick, Appropriate kernel functions for support [29] S. Sonnenburg, G. Rätsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. de Bona,
vector machine learning with sequences of symbolic data, in: J. Winkler, A. Binder, C. Gehl, V. Franc, The shogun machine learning toolbox, J. Mach.
M. Niranjan, N. Lawrence (Eds.), Deterministic and Statistical Methods in Learn. Res. 11 (2010) 1799–1802, URL: 〈http://jmlr.org/papers/volume11/son
Machine Learning, Lecture Notes in Computer Science, vol. 3635, Springer, nenburg10a/sonnenburg10a.pdf〉.
Berlin, Heidelberg, 2005, pp. 256–280. http://dx.doi.org/10.1007/11559887_16.
[9] M.G. Genton, Classes of kernels for machine learning: a statistics perspective,
J. Mach. Learn. Res. 2 (2002) 299–312, URL: 〈http://dl.acm.org/citation.cfm?
id=944790.944815〉.
Jérôme Paul is a Ph.D. Student in the Machine Learning
[10] C. Berg, J.P.R. Christensen, P. Ressel, Harmonic Analysis on Semigroups,
Group of the Université catholique de Louvain. He
Springer-Verlag, New York, 1984.
received a master's degree in computer engineering
[11] B. Boser, I. Guyon, V. Vapnik, A training algorithm for optimal margin
in 2010. His research topic includes feature selection
classifiers, in: Proceedings of the Fifth Annual Workshop on Computational
and classification from heterogeneous data with a
Learning Theory, COLT '92, ACM, New York, NY, USA, 1992, pp. 144–152. http://
special focus on high dimensional biomedical data.
dx.doi.org/10.1145/130385.130401.
[12] I. Guyon, Feature Extraction: Foundations and Applications, vol. 207, Springer-
Verlag, Berlin Heidelberg, 2006.
[13] G. Lanckriet, T. De Bie, N. Cristianini, M. Jordan, W. Noble, A statistical
framework for genomic data fusion, Bioinformatics 20 (16) (2004) 2626–2635.
http://dx.doi.org/10.1093/bioinformatics/bth294, URL: 〈http://bioinformatics.
oxfordjournals.org/content/20/16/2626〉.
[14] F. Bach, G. Lanckriet, M. Jordan, Multiple kernel learning, conic duality, and the
smo algorithm, in: Proceedings of the 21st International Conference on
Machine Learning, 2004, pp. 41–48.
[15] M. Kloft, U. Brefeld, P. Laskov, K.-R. Müller, A. Zien, S. Sonnenburg, Efficient Roberto D'Ambrosio received an M.S. in Biomedical
and accurate lp-norm multiple kernel learning, in: Y. Bengio, D. Schuurmans, Engineering from the University Campus Bio-Medico of
J. Lafferty, C. Williams, A. Culotta (Eds.), Advances in Neural Information Rome in the 2010. In the 2014 he received a joint Ph.D.
Processing Systems, vol. 22, Curran Associates, Inc., New York, 2009, from the University Campus Bio-Medico of Rome and
pp. 997–1005. the University of Nice in Biomedical Engineering and in
[16] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32. http://dx.doi. Sciences and Technologies of Information and Commu-
org/10.1023/A:1010933404324. nication, respectively. He is a post-doctoral researcher
[17] J. Paul, M. Verleysen, P. Dupont, The stability of feature selection and class at Université Catholique de Louvain.
prediction from ensemble tree classifiers, in: ESANN 2012, 20th European His interests cover the areas of pattern recognition
Symposium on Artificial Neural Networks, Computational Intelligence and and machine learning. His research focuses on imbal-
Machine Learning, 2012, pp. 263–268. anced datasets, estimation of posterior probability in
[18] Q. Hu, J. Liu, D. Yu, Mixed feature selection based on granulation and classification tasks, stochastic gradient descent algo-
approximation, Knowl. Based Syst. 21 (4) (2008) 294–304 doi:http://dx.doi. rithm and feature selection.
org/10.1016/j.knosys.2007.07.001 URL: 〈http://www.sciencedirect.com/
science/article/pii/S0950705107000755〉.
[19] R. Wilson, T. Martinez, Improved heterogeneous distance functions, J. Artif.
Intell. Res. 6 (1997) 1–34, URL: 〈http://www.jair.org/media/346/live-346-1610-
jair.pdf〉. Pierre Dupont received an M.S. in Electrical Engineer-
[20] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation ing from the Université catholique de Louvain, Belgium
and model selection, in: Proceedings of the 14th International Joint Con- in 1988, and a Ph.D. in Computer Science from l'Ecole
ference on Artificial Intelligence, IJCAI'95, vol. 2, Morgan Kaufmann Publishers Nationale Supérieure des Télécommunications, Paris in
Inc., San Francisco, CA, USA, 1995, pp. 1137–1143. URL: 〈http://ijcai.org/Past% 1996. In 1996–1997, he was a post-doctoral researcher
20Proceedings/IJCAI-95-VOL2/PDF/016.pdf〉. at Carnegie Mellon University, Pittsburgh, USA. Since
[21] L. Kuncheva, A stability index for feature selection, in: AIAP'07: Proceedings of 2001, Pierre Dupont is a Professor at the Université
the 25th Conference on IASTED International Multi-Conference, ACTA Press, catholique de Louvain where he co-supervises the
Anaheim, CA, USA, 2007, pp. 390–395. Machine Learning Group.
[22] J. Demšar, Statistical comparisons of classifiers over multiple data sets, His current research interests include novel machine
J. Mach. Learn. Res. 7 (2006) 1–30, URL: 〈http://jmlr.org/papers/volume7/ learning methods to tackle real problems arising in
demsar06a/demsar06a.pdf〉. computational biology, bio-statistics and medical
[23] A. Frank, A. Asuncion, UCI ML Repository, 2010. URL: 〈http://archive.ics.uci. research, feature selection and dimensionality reduc-
edu/ml〉. tion, high-throughput data analysis and biomarker identification.
[24] F. Leisch, E. Dimitriadou, mlbench: Machine Learning Benchmark Problems,
R package version 2.1-1 (2010).