2015-Elsevier-Kernel-methods-for-heterogeneous-feature-selection
2015-Elsevier-Kernel-methods-for-heterogeneous-feature-selection
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
art ic l e i nf o a b s t r a c t
Article history: This paper introduces two feature selection methods to deal with heterogeneous data that include
Received 30 June 2014 continuous and categorical variables. We propose to plug a dedicated kernel that handles both kinds of
Received in revised form variables into a Recursive Feature Elimination procedure using either a non-linear SVM or Multiple
12 December 2014
Kernel Learning. These methods are shown to offer state-of-the-art performances on a variety of high-
Accepted 29 December 2014
Available online 16 April 2015
dimensional classification tasks.
& 2015 Elsevier B.V. All rights reserved.
Keywords:
Heterogeneous feature selection
Kernel methods
Mixed data
Multiple kernel learning
Support vector machine
Recursive feature elimination
http://dx.doi.org/10.1016/j.neucom.2014.12.098
0925-2312/& 2015 Elsevier B.V. All rights reserved.
188 J. Paul et al. / Neurocomputing 169 (2015) 187–195
process. In both approaches, the selection is performed by the averages univariate subkernels [7] defined for each feature:
Recursive Feature Elimination (RFE) [4] mechanism that iteratively
1X
p
ranks variables according to their importances. We propose to kðxi ; xj Þ ¼ k ðx ; x Þ ð1Þ
extract those feature importances from two different kernel p f ¼ 1 f if jf
methods: the Support Vector Machine (SVM) and the Multiple 8
Kernel Learning (MKL), with a dedicated heterogeneous kernel. < Iða ¼ bÞ
> if f is categorical
We use the clinical kernel [5] that handles both kinds of features in kf ða; bÞ ¼ ðmaxf minf Þ j a bj ð2Þ
>
: if f is continuous
classification tasks. maxf minf
The remainder of this document is organized as follows. Section 2
describes the two proposed methods. Section 3 briefly presents where xi is a data point in p dimensions, xif is the value of xi for
competing approaches we compare to in our experiments. The feature f, I is the indicator function, a and b are scalars and maxf
experimental setting is presented in Section 4. Results are discussed and minf are the maximum and minimum values observed for
in Section 5. Finally, Section 6 concludes this work. feature f, respectively. One can note that summing kernels simply
amounts to concatenating variables in the kernel induced space.
Given two data points, the subkernel values lie between 0,
when the feature values are farthest apart, and 1 when they are
2. Material and methods identical, similar to the Gaussian kernel. The clinical kernel is
basically an unweighted average of overlap kernels [8] for catego-
This section presents the different building blocks that com- rical features and triangular kernels [9,10] for continuous features.
pose our two heterogeneous feature selection methods. Recursive The overlap kernel can also be seen as a rescaled l1-norm on a
Feature Elimination (RFE), the main feature selection mechanism, disjunctive encoding of the categorical variables. The clinical
is presented in Section 2.1. It internally uses a global variable kernel assumes the same importance to each original variable.
ranking for both continuous and categorical features. This ranking We show here the benefit of adapting this kernel for heteroge-
is extracted from two kernel methods (Support Vector Machine neous feature selection.
and Multiple Kernel Learning) that use a dedicated heterogeneous
kernel called the clinical kernel (Section 2.2). Section 2.3 details 2.3. Feature importance from non-linear Support Vector Machines
how to obtain a feature ranking from a non-linear SVM. Finally,
Section 2.4 sketches Multiple Kernel Learning, which offers an The Support Vector Machine (SVM) [11] is a well-known algo-
alternative way to rank variables with the clinical kernel. rithm that is widely used to solve classification problems. It looks
for the largest margin hyperplane that distinguishes between
samples of different classes. In the case of a linear SVM, one can
2.1. Recursive feature elimination measure the feature importances by looking at their respective
weights in the hyperplane. When dealing with a non-linear SVM,
RFE [4] is an embedded backward elimination strategy that we can instead look at the variation in margin size 1= J w J . Since
iteratively builds a feature ranking by removing the least important the larger the margin, the lower the generalization error (at least in
features in a classification model at each step. Following [6], a fixed terms of bound), a feature that does not decrease much the margin
proportion of 20% of features is dropped at each iteration. The benefit size is not deemed important for generalization purposes. So, in
of such a fixed proportion is that the actual number of features order to measure feature importances with a non-linear SVM, one
removed at each step gradually decreases till being rounded to 1, can look at the influence on the margin of removing a particular
allowing a finer ranking for the most important features. This feature [12].
iterative process is pursued till all variables are ranked. The number The margin is inversely proportional to
of iterations automatically depends on the total number p of features n X
X n
to be ranked while following this strategy. RFE is most commonly W 2 ðαÞ ¼ αi αj yi yj kðxi ; xj Þ ¼ ‖w‖2 ð3Þ
used in combination with a linear SVM from which feature weights i¼1j¼1
are extracted. However, it can be used with any classification model where αi and αj are the dual variables of a SVM, yi and yj the labels
from which individual feature importance can be deduced. A general of xi and xj, respectively, out of n training examples, and k a kernel.
pseudo-code for RFE is given in Algorithm 1. Therefore, the importance of a particular feature f can be approxi-
mated without re-estimating α by the following formula:
Algorithm 1. Recursive Feature Elimination.
J SVM ðf Þ ¼ jW 2 ðαÞ W 2ð f Þ ðαÞj ð4Þ
R’ empty ranking
F’ set of all features n X
X n
In this work, we propose to combine the JSVM feature impor- global ranking [1]. The authors report improved results over those
tance (Eq. (4)) with the RFE mechanism in order to provide a full of the method proposed in [18], which is based on neighborhood
ranking of the features. This method will be referred to as RFESVM. relationships between heterogeneous samples. Out of a total of p
variables, categorical and continuous features are first ranked
2.4. Feature importance from Multiple Kernel Learning independently. Mutual information (MI) was originally proposed
for those rankings but a reliable estimate of MI is difficult to obtain
MKL [13] learns an appropriate linear combination of M basis whenever fewer samples than dimensions are available. Instead
kernels, each one possibly associated to a specific input variable, as we use the p-values of a t-test to rank continuous features and of a
well as a discriminant function. The resulting kernel is a weighted Fisher exact test for categorical ones. The two feature rankings are
combination of different input kernels: then combined into a global ranking by iteratively adding the first
categorical or continuous variable that maximizes the predictive
X
M
kðxi ; xj Þ ¼ μm km ðxi ; xj Þ s:t: μm Z0 ð6Þ performance of a Naive Bayes or a 5-NN classifier (consistently
m¼1 with the choices made in [1]). The NN classifier uses the Hetero-
geneous Euclidian-Overlap Metric [19] between pairs of instances
Summing kernels is equivalent to concatenating the respective
as follows:
feature maps ψ 1 ; …; ψ m induced by those kernels. The associated
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
decision function f ðxÞ is a generalized linear model in the induced uX
u p
space: dðxi ; xj Þ ¼ t df ðxif ; xjf Þ2 ð9Þ
f ¼1
X
M
pffiffiffiffiffiffiffi
f ðxÞ ¼ μm wTm ψ m ðxÞ þ b ð7Þ 8
m¼1 < Iða a bÞ
> if f is categorical
where μm, wm and ψm are respectively the kernel weight, feature df ða; bÞ ¼ j a bj ð10Þ
>
: maxf minf if f is continuous
weight and explicit feature map corresponding to the mth kernel,
and b a bias term. Those parameters are estimated by minimizing
the following objective: df ða; bÞ ¼ 1 kf ða; bÞ ð11Þ
X
n
1 X
M
This metric is closely related to the clinical kernel (Eq. (2)). For
argminC ℓðf ðxi Þ; yi Þ þ ‖wm ‖22 such that ‖μ‖22 r 1 ð8Þ
w;b;μ Z 0 2m¼1 each feature, df takes value 0 for identical points and value 1 for
i¼1
points that are farthest apart in that dimension. We refer to these
where C 4 0 and ℓ denotes the hinge loss ℓðf ðxÞ; yÞ ¼ max approaches as HFSNB and HFS5NN in the sequel.
f0; 1 yf ðxÞg. We note that the kernel weight vector μ is l2-
regularized in contrast to MKL approaches using sparsity inducing
norms [14]. Indeed, non-sparse MKL has been shown to be more 4. Experiments
effective on various computational biology problems [15]. It is also
more convenient in our context since we interpret j μm j as a In order to compare the five feature selection methods, we
feature importance measure and look for a full ranking of all report predictive performances of classifiers built on selected
features. variables as well as quality measures on those feature sets. A
In this work, we adapt the clinical kernel (Eq. (2)) with MKL to statistical analysis is also performed to assess if there are sig-
learn a non-uniform combination of the basis kernels, each one nificant differences between the performances of the various
associated to a single feature. As we can see in Eq. (7), μf reflects methods. This section presents the experimental protocol, the
the influence of kernel kf in the decision function [13]. μf can thus various evaluation metrics and the datasets that we use in our
be seen as the importance JMKL(f) of feature f. experiments.
The combination of RFE with this feature importance extracted
from MKL will be referred to as RFEMKL. It specifically uses the
kernel weights j μf j as feature importance value to eliminate at 4.1. Experimental protocol
each iteration a prescribed fraction of the least relevant features.
When a sufficient amount of data is available, 10-fold cross
validation (10-CV) provides a reliable estimate of model perfor-
3. Competing approaches mances [20]. However, it may lead to inaccurate estimates on
small-sized datasets, due to a higher variability in the different
This section presents the three competing methods we com- folds. We thus make use of a resampling strategy consisting of 200
pare to in the experiments: Random Forest [16] and two variants random splits of the data into training (90%) and test (10%). Such a
of Hybrid Feature Selection [1]. protocol has the same training/test proportions as 10-CV but
The Random Forest (RF) algorithm builds an ensemble of T benefits from a larger number of tests. It also keeps the training
decision trees. Each one is grown on a bootstrap sample of the size sufficiently large so as to report performances close enough to
dataset. The subset of data points that are used to build a those of a model estimated on the whole available data.
particular tree forms its bag. The remaining set of points is its For each data partition, the training set is used to rank features
out-of-bag. To compute variable importances, Breiman [16] pro- and build predictive models using different numbers of features.
poses a permutation test. It uses the out-of-bag samples to The ranking is recorded and predictive performances are mea-
estimate how much the predictive performances of the RF sured while classifying the test set. Average predictive perfor-
decrease when permuting a particular variable. The bigger the mances are reported over all test folds and the stability of various
drop in accuracy, the higher the variable importance. In order to signature sizes is computed from the 200 feature rankings. The
obtain good and stable feature selection from RF, a large ensemble average number of selected categorical features is also computed
of 10,000 trees (RF10000) is considered according to the analysis for each signature size. This number does not reflect a specific
in [17]. performance value of the feature selection methods but rather
An alternative method performs a greedy forward selection gives some insight into how they deal with the selection of
aggregating separate rankings for each type of variables into a heterogeneous variables.
190 J. Paul et al. / Neurocomputing 169 (2015) 187–195
BCR KI
CD CD
RFEMKL HFS5NN
arrhythmia bands
● ● ●
0.80 ● ● ● ●● ● ● ● ●
● ● ● ● ●
●● 0.75
● ●● ● ●
0.75 ● ● ● ● ● ● ● ●
● ●
● ●
0.70 0.70 ●
●
●
BCR
BCR
●
0.65 ●
●
0.65
0.60 ● ●
● RFEMKL ● RFEMKL
200 100 50 20 10 5 2 1 20 10 5 2 1
#features #features
arrhythmia bands
1.0 1.0
● ● ● ●●
● ●
●● ●●
0.8 ● ● ● ●
● ●● 0.8
●● ●
●● ●
0.6
●
KI
KI
0.6
●
0.4
● RFEMKL ● RFEMKL
RFESVM 0.4 RFESVM
0.2 RF10000 RF10000
HFSNB HFSNB
HFS5NN HFS5NN
200 100 50 20 10 5 2 1 20 10 5 2 1
#features #features
arrhythmia bands
● RFEMKL
12 ● RFEMKL
25 RFESVM RFESVM
RF10000 10 ● RF10000
HFSNB HFSNB
# categorical features
# categorical features
●
20 HFS5NN ● HFS5NN
8
●
● ●
15
6 ●
●
● ●
10 4
●
●
●
● ●
5 2 ●
●
● ●
● ●
●● ● ● ● ●
0 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 0
200 100 50 20 10 5 2 1 20 10 5 2 1
#features #features
Fig. 2. Predictive performances (BCR), feature selection stability (KI) and number of selected categorical features for each signature size of the Arrhythmia and Bands
datasets. The dashline defines the minimal number of features to select without loosing much in predictive performances (see text).
192 J. Paul et al. / Neurocomputing 169 (2015) 187–195
heart hepatitis
0.75
0.85
● ● ●
●
● ● ● ● ● ●
● ● ●
0.80 ●
● ●
0.70 ●
●
● ●
0.75 ●
BCR
BCR
● ●
●
0.65
0.70
● RFEMKL ● RFEMKL
RFESVM RFESVM
0.65 RF10000 RF10000
HFSNB 0.60 HFSNB
HFS5NN HFS5NN
10 5 2 1 10 5 2 1
#features #features
heart hepatitis
1.0 1.0
0.9 0.9
0.8 0.8
0.7
0.7
KI
KI
0.6
0.6
0.5 RFE MKL
RFEMKL
RFESVM 0.5 RFESVM
0.4 RF10000 RF10000
HFSNB 0.4 HFSNB
0.3 HFS5NN HFS5NN
10 5 2 1 10 5 2 1
#features #features
heart hepatitis
7 ● RFEMKL ● RFEMKL
●
RFESVM 10 ●
RFESVM
6 ● RF10000 RF10000
● ●
HFSNB HFSNB
# categorical features
# categorical features
5 ● HFS5NN 8 HFS5NN
● ●
●
4 ●
6 ●
●
3 ● ●
4 ●
2 ● ●
●
2 ●
1 ●
● ● ● ●
0 0
10 5 2 1 10 5 2 1
#features #features
Fig. 3. Predictive performances (BCR), feature selection stability (KI) and number of selected categorical features for each signature size of the Heart and Hepatitis datasets.
The dashline defines the minimal number of features to select without loosing much in predictive performances (see text).
200 resamplings require 1.5 more CPU time with RF10000 (single- predictive performances with very few selected variables (top
core implementation in the randomForest R-package [28]) than right graph of Fig. 4). The third categorical variable is actually
with the RFE methods (in the Shogun [29] implementation of MKL never selected since it happens to convey very few information to
and SVM). On the Housing dataset, the RF implementation is predict the class label.5 On the van't Veer dataset, the HFS
5 times slower than the RFE methods.4 approaches tend to keep selecting the two categorical variables
The top left graph of Fig. 2 shows predictive performances of even when the feature selection is very aggressive (Fig. 5, bottom).
the five methods on the Arrhythmia dataset. We can see that They show a peak in predictive performances when 5 features are
RFEMKL and RF10000 perform best since they avoid to select kept (Fig. 5, left). However, the best predictive performance (Fig. 5,
categorical features which happen to be noisy on this dataset left) is obtained with RFEMKL which selects one of the two
(Fig. 2, left). The bottom right plot of Fig. 4 reports the average categorical variables. It also corresponds to a very good feature
number of categorical features among selected features for the selection stability, as shown in the right graph of Fig. 5. Finally, on
Rheumagene dataset. It shows that all but RFESVM and HFS5NN the three high dimensional datasets (Arrhythmia, Rheumagene
select two categorical variables first, leading to already good and van't Veer), RFESVM is significantly less stable.
4
Specifically, CPU times were measured on a 2.60 GHz machine with 8 GB Ram
memory. On this dataset, RFEMKL, RFESVM, and RF10000 took respectively 23 min, 5
Out of 49 samples (28 negative, 21 positive), this variable takes value ‘0’ 46
26 min and 114 min to be run. times and ‘1’ only 3 times.
J. Paul et al. / Neurocomputing 169 (2015) 187–195 193
housing rheumagene
0.90
● ● ● 0.90
● ● ● ●
● ● ● ● ● ● ● ● ●
0.85 ● 0.85 ● ●
●
● ● ●
● ●
● ●
●
0.80 ●●
● ● ●
● ●
0.80 ●
BCR
BCR
0.75
0.75 0.70
RFEMKL ● RFEMKL
RFESVM RFESVM
RF10000 0.65 RF10000
0.70 HFSNB HFSNB
HFS5NN 0.60 HFS5NN
10 5 2 1 100 50 20 10 5 2 1
#features #features
housing rheumagene
1.0 1.0
RFEMKL
RFESVM
0.9 RF10000
0.8 HFSNB
0.8 HFS5NN
0.7 0.6
KI
KI
0.6
RFEMKL 0.4
0.5 RFESVM
RF10000
HFSNB
0.4
HFS5NN 0.2
100 50 20 10 5 2 1
10 5 2 1
#features #features
housing
rheumagene
2.0 ● ● ●
2.0
●
# categorical features
# categorical features
1.5 1.5
●
●
1.0 ● ● ● ● ● ● 1.0
10 5 2 1 100 50 20 10 5 2 1
#features #features
Fig. 4. Predictive performances (BCR), feature selection stability (KI) and number of selected categorical features for each signature size of the Housing and Rheumagene
datasets. The dashline defines the minimal number of features to select without loosing much in predictive performances (see text).
We further analyze below the various feature selection meth- without significant differences between them, but at a larger
ods for a fixed number of selected features. One could indeed be computational cost for the latter.
interested in selecting a feature set as small as possible with only a
marginal decrease in predictive performances. For each dataset,
we choose the smaller feature set size such that the BCR of RFEMKL 6. Conclusion and perspectives
lies in the 95% confidence interval of the best RFEMKL predictive
performance. Those signature sizes are highlighted in Figs. 2–5 by We introduce two heterogeneous feature selection techniques
vertical dashed lines. A Friedman test on those predictive perfor- that can deal with continuous and categorical features. They
mances finds significant differences (p-value of 0.008). A Nemenyi combine Recursive Feature Elimination with variable importances
post-hoc test (Fig. 6, left) shows that the two best ranked methods, extracted from MKL (RFEMKL) or a non-linear SVM (RFESVM). These
RF10000 and RFEMKL, perform significantly better than RFESVM in methods use a dedicated kernel combining continuous and cate-
terms of BCR. Feature selection stabilities also significantly differ gorical variables. Experiments show that RFEMKL produces state-of-
according to a Friedman test (p-value of 0.02). Fig. 6 illustrates that the-art predictive performances and is as good as competing
the ranking among the five methods is the same for stability and methods in terms of feature selection stability. It offers results
BCR. Those results on a fixed number of features show that similar to Random Forests with smaller computational times.
the RFEMKL and RF10000 are the two best performing methods RFESVM performs worse than RFEMKL. It also seems less efficient
194 J. Paul et al. / Neurocomputing 169 (2015) 187–195
vantVeer vantVeer
●
0.75
●
0.8 ●
●
0.70 ●
●
●
● ● ● ●
●● ●● ● ●
● ●
0.6 ●● ● ● ●
●
●
BCR
0.65 ●●
●
KI
● ●
●● ● ● ●
0.60 0.4
RFEMKL ● RFEMKL
●
RFESVM RFESVM
RF10000 RF10000
0.55
HFSNB 0.2 HFSNB
HFS5NN HFS5NN
●
1.5 ●
●●
●
●●
● ●
●● ●
● ●●
1.0 ● ● ●● ●
● ●
●
●
●●
●● ● ● ●
●
●
MKL
0.5
● RFE ●
RFESVM
RF10000
HFSNB
0.0 HFS5NN
BCR KI
CD CD
HFSNB HFS5NN HFSNB HFS5NN
RF10000 RF10000
Fig. 6. Nemenyi critical difference diagrams [22]: comparison of the predictive performances (BCR) and stability (KI) of the five algorithms for one small signature size in
each dataset. Horizontal black lines group together methods whose mean ranks do not differ significantly. CD represents the rank difference needed to have a 95% confidence
that methods performances are significantly different.
in terms of prediction and stability than competing approaches, UCL) and the Consortium des Equipements de Calcul Intensif en
even though not significantly different from all competitors. Fédération Wallonie Bruxelles (CECI) funded by the Fonds de la
The two kernel based methods proposed here are among the Recherche Scientifique de Belgique (FRS-FNRS).
few existing selection methods that specifically tackle heteroge-
neous features. Yet, we plan in our future work to improve their
stability possibly by resorting to an ensemble procedure [6]. References
We observed that the proposed methods run faster than the
competing approaches on various datasets. Those differences [1] G. Doquire, M. Verleysen, An hybrid approach to feature selection for mixed
would be worth to reassess in a further study relying on parallel categorical and continuous data, in: International Conference on Knowledge
implementations. Discovery and Information Retrieval (KDIR 2011), 2011, pp. 394–401.
[2] M. Yuan, Y. Lin, Model selection and estimation in regression with grouped
variables, J. R. Stat. Soc.: Ser. B Stat. Methodol. 68 (1) (2006) 49–67. http://dx.
doi.org/10.1111/j.1467-9868.2005.00532.x, URL: 〈http://dx.doi.org/10.1111/j.
Acknowledgments 1467-9868.2005.00532.x〉.
[3] C. Strobl, A.-L. Boulesteix, A. Zeileis, T. Hothorn, Bias in random forest variable
importance measures: illustrations, sources and a solution, BMC Bioinform.
Computational resources have been provided by the super- 8 (1) (2007) 25. http://dx.doi.org/10.1186/1471-2105-8-25, URL: 〈http://www.
computing facilities of the Université catholique de Louvain (CISM/ biomedcentral.com/1471-2105/8/25〉.
J. Paul et al. / Neurocomputing 169 (2015) 187–195 195
[4] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classifica- [25] I. Focant, D. Hernandez-Lobato, J. Ducreux, P. Durez, A. Toukap, D. Elewaut,
tion using support vector machines, Mach. Learn. 46 (1–3) (2002) 389–422. F. Houssiau, P. Dupont, B. Lauwerys, Feasibility of a molecular diagnosis of
http://dx.doi.org/10.1023/A:1012487302797. arthritis based on the identification of specific transcriptomic profiles in knee
[5] A. Daemen, B. De Moor, Development of a kernel function for clinical data, in: synovial biopsies, Arthritis Rheum. 63 (2011) 751.
Annual International Conference of the IEEE Engineering in Medicine and [26] L. van 't Veer, H. Dai, M. van de Vijver, Y. He, A. Hart, M. Mao, H. Peterse, K. van
Biology Society, 2009. EMBC 2009, 2009, pp. 5913–5917. http://dx.doi.org/10. der Kooy, M. Marton, A. Witteveen, G. Schreiber, R. Kerkhoven, C. Roberts,
1109/IEMBS.2009.5334847. P. Linsley, R. Bernards, S. Friend, Gene expression profiling predicts clinical
[6] T. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, Y. Saeys, Robust biomarker outcome of breast cancer, Nature 415 (6871) (2002) 530–536. http://dx.doi.
identification for cancer diagnosis with ensemble feature selection methods, org/10.1038/415530a.
Bioinformatics 26 (3) (2010) 392–398. http://dx.doi.org/10.1093/bioinfor- [27] A.-C. Haury, P. Gestraud, J.-P. Vert, The influence of feature selection methods
matics/btp630, URL: 〈http://bioinformatics.oxfordjournals.org/content/26/3/ on accuracy. stability and interpretability of molecular signatures, PLoS One
392〉. 6 (12) (2011) e28210. http://dx.doi.org/10.1371/journal.pone.0028210.
[7] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge [28] A. Liaw, M. Wiener, Classification and regression by randomforest, R News
University Press, New York, NY, USA, 2004. 2 (3) (2002) 18–22, URL: 〈http://CRAN.R-project.org/doc/Rnews/〉.
[8] B. Vanschoenwinkel, B. Manderick, Appropriate kernel functions for support [29] S. Sonnenburg, G. Rätsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. de Bona,
vector machine learning with sequences of symbolic data, in: J. Winkler, A. Binder, C. Gehl, V. Franc, The shogun machine learning toolbox, J. Mach.
M. Niranjan, N. Lawrence (Eds.), Deterministic and Statistical Methods in Learn. Res. 11 (2010) 1799–1802, URL: 〈http://jmlr.org/papers/volume11/son
Machine Learning, Lecture Notes in Computer Science, vol. 3635, Springer, nenburg10a/sonnenburg10a.pdf〉.
Berlin, Heidelberg, 2005, pp. 256–280. http://dx.doi.org/10.1007/11559887_16.
[9] M.G. Genton, Classes of kernels for machine learning: a statistics perspective,
J. Mach. Learn. Res. 2 (2002) 299–312, URL: 〈http://dl.acm.org/citation.cfm?
id=944790.944815〉.
Jérôme Paul is a Ph.D. Student in the Machine Learning
[10] C. Berg, J.P.R. Christensen, P. Ressel, Harmonic Analysis on Semigroups,
Group of the Université catholique de Louvain. He
Springer-Verlag, New York, 1984.
received a master's degree in computer engineering
[11] B. Boser, I. Guyon, V. Vapnik, A training algorithm for optimal margin
in 2010. His research topic includes feature selection
classifiers, in: Proceedings of the Fifth Annual Workshop on Computational
and classification from heterogeneous data with a
Learning Theory, COLT '92, ACM, New York, NY, USA, 1992, pp. 144–152. http://
special focus on high dimensional biomedical data.
dx.doi.org/10.1145/130385.130401.
[12] I. Guyon, Feature Extraction: Foundations and Applications, vol. 207, Springer-
Verlag, Berlin Heidelberg, 2006.
[13] G. Lanckriet, T. De Bie, N. Cristianini, M. Jordan, W. Noble, A statistical
framework for genomic data fusion, Bioinformatics 20 (16) (2004) 2626–2635.
http://dx.doi.org/10.1093/bioinformatics/bth294, URL: 〈http://bioinformatics.
oxfordjournals.org/content/20/16/2626〉.
[14] F. Bach, G. Lanckriet, M. Jordan, Multiple kernel learning, conic duality, and the
smo algorithm, in: Proceedings of the 21st International Conference on
Machine Learning, 2004, pp. 41–48.
[15] M. Kloft, U. Brefeld, P. Laskov, K.-R. Müller, A. Zien, S. Sonnenburg, Efficient Roberto D'Ambrosio received an M.S. in Biomedical
and accurate lp-norm multiple kernel learning, in: Y. Bengio, D. Schuurmans, Engineering from the University Campus Bio-Medico of
J. Lafferty, C. Williams, A. Culotta (Eds.), Advances in Neural Information Rome in the 2010. In the 2014 he received a joint Ph.D.
Processing Systems, vol. 22, Curran Associates, Inc., New York, 2009, from the University Campus Bio-Medico of Rome and
pp. 997–1005. the University of Nice in Biomedical Engineering and in
[16] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32. http://dx.doi. Sciences and Technologies of Information and Commu-
org/10.1023/A:1010933404324. nication, respectively. He is a post-doctoral researcher
[17] J. Paul, M. Verleysen, P. Dupont, The stability of feature selection and class at Université Catholique de Louvain.
prediction from ensemble tree classifiers, in: ESANN 2012, 20th European His interests cover the areas of pattern recognition
Symposium on Artificial Neural Networks, Computational Intelligence and and machine learning. His research focuses on imbal-
Machine Learning, 2012, pp. 263–268. anced datasets, estimation of posterior probability in
[18] Q. Hu, J. Liu, D. Yu, Mixed feature selection based on granulation and classification tasks, stochastic gradient descent algo-
approximation, Knowl. Based Syst. 21 (4) (2008) 294–304 doi:http://dx.doi. rithm and feature selection.
org/10.1016/j.knosys.2007.07.001 URL: 〈http://www.sciencedirect.com/
science/article/pii/S0950705107000755〉.
[19] R. Wilson, T. Martinez, Improved heterogeneous distance functions, J. Artif.
Intell. Res. 6 (1997) 1–34, URL: 〈http://www.jair.org/media/346/live-346-1610-
jair.pdf〉. Pierre Dupont received an M.S. in Electrical Engineer-
[20] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation ing from the Université catholique de Louvain, Belgium
and model selection, in: Proceedings of the 14th International Joint Con- in 1988, and a Ph.D. in Computer Science from l'Ecole
ference on Artificial Intelligence, IJCAI'95, vol. 2, Morgan Kaufmann Publishers Nationale Supérieure des Télécommunications, Paris in
Inc., San Francisco, CA, USA, 1995, pp. 1137–1143. URL: 〈http://ijcai.org/Past% 1996. In 1996–1997, he was a post-doctoral researcher
20Proceedings/IJCAI-95-VOL2/PDF/016.pdf〉. at Carnegie Mellon University, Pittsburgh, USA. Since
[21] L. Kuncheva, A stability index for feature selection, in: AIAP'07: Proceedings of 2001, Pierre Dupont is a Professor at the Université
the 25th Conference on IASTED International Multi-Conference, ACTA Press, catholique de Louvain where he co-supervises the
Anaheim, CA, USA, 2007, pp. 390–395. Machine Learning Group.
[22] J. Demšar, Statistical comparisons of classifiers over multiple data sets, His current research interests include novel machine
J. Mach. Learn. Res. 7 (2006) 1–30, URL: 〈http://jmlr.org/papers/volume7/ learning methods to tackle real problems arising in
demsar06a/demsar06a.pdf〉. computational biology, bio-statistics and medical
[23] A. Frank, A. Asuncion, UCI ML Repository, 2010. URL: 〈http://archive.ics.uci. research, feature selection and dimensionality reduc-
edu/ml〉. tion, high-throughput data analysis and biomarker identification.
[24] F. Leisch, E. Dimitriadou, mlbench: Machine Learning Benchmark Problems,
R package version 2.1-1 (2010).