Conformal Recursive Feature Elimination
Conformal Recursive Feature Elimination
∗
Marcos López-De-Castro, Alberto Garcı́a-Galindo, Rubén Armañanzas
Abstract
Unlike traditional statistical methods, Conformal Prediction (CP)
allows for the determination of valid and accurate confidence levels
associated with individual predictions based only on exchangeability
of the data. We here introduce a new feature selection method that
takes advantage of the CP framework. Our proposal, named Confor-
mal Recursive Feature Elimination (CRFE), identifies and recursively
removes features that increase the non-conformity of a dataset. We
also present an automatic stopping criterion for CRFE, as well as a
new index to measure consistency between subsets of features. CRFE
selections are compared to the classical Recursive Feature Elimination
(RFE) method on several multiclass datasets by using multiple parti-
tions of the data. The results show that CRFE clearly outperforms
RFE in half of the datasets, while achieving similar performance in
the rest. The automatic stopping criterion provides subsets of effec-
tive and non-redundant features without computing any classification
performance.
1 Introduction
The curse of dimensionality is a well-known issue in the field of statistical
learning theory. In recent years, the amount of high-dimensional and multi-
modal data has become a challenge, from healthcare [1,6] to physics [27], as
∗
Marcos López-De-Castro, Alberto Garcı́a-Galindo and Rubén Armañanzas are with
DATAI - Institute of Data Science and Artificial Intelligence, Universidad de Navarra,
Pamplona, Spain and Tecnun School of Engineering, Universidad de Navarra, Donostia-
San Sebastian, Spain. Email:{mlopezdecas, agarciagali, rarmananzas}@unav.es
1
Correspondig: [email protected]
2
Under review
1
a consequence of technological advances and the big data advent. Feature se-
lection methods are techniques developed for dimensionality reduction that
select an optimal subset of features without altering their original mean-
ing [16]. The use of these methods helps prediction algorithms to perform
faster and to increase their efficiency. These techniques are categorized as
filters, wrappers, and embedded methods, although mixed approaches have
also been proposed [8, 13]. Search strategies have been developed to ex-
plore the feature space efficiently [7]. Two outstanding techniques are the
sequential forward and the sequential backward selection. They are based
on adding and removing features until the required number of features is
satisfied, respectively. The Recursive Feature Elimination method (RFE)
is a popular feature selection method proposed by Guyon et al. based on
a backward elimination policy [14, 31]. RFE was originally developed for
support vector machines and performs by recursively removing the features,
or sets of features, that least decreases the margin of separation between
classes.
2
samples is updated, similarly to how feature weights are updated in the
original RFE. The features removed are associated with higher strangeness.
We compared CRFE and RFE results on four different multiclass datasets.
CRFE outperformed RFE results in two of the four datasets and achieved the
same performance and consistency as RFE in the remaining. We also present
an automatic stopping criterion and a new consistency index. The automatic
stopping criterion stops the recursive elimination of features, avoiding full
dataset computation.
2 Conformal prediction
Conformal prediction framework rests in two assumptions: randomness and
exchangeability [24]. A finite sequence (z1 , ..., zn ) of random variables is
said to be i.i.d, i.e., the randomness assumption, if they are generated in-
dependently from the same unknown probability distribution Q. A finite
sequence (z1 , ..., zn ) of random variables is said to be exchangeable if for any
permutation π of the set of index {1, ..., n} the joint probability distribution
is invariant Q(z1 , ..., zn ) = Q′ (zπ1 , ..., zπn ). Randomness assumption implies
exchangeability. The construction of a non-conformity measure is the next
step when conformal prediction is performed. A non-conformity measure is
defined as a measurable function A,
A : R × Rn → R
(z, {z1 , ..., zn }) 7→ α,
where h(·) is a predictive rule learned from {z1 , ..., zn } and f (·, ·) a func-
tion which quantifies the difference between the prediction h(x) and a label
y. Monotonic transformations of f (·, ·) have no impact on the predictions.
3
Nevertheless, the proper selection of the estimator h(·) has a significant ef-
fect in conformal prediction efficiency [23]. Let’s assume a set of samples of
the form
{zi }ni=1 = {(xi , yi )}ni=1 ∈ X × Y = Z. (2)
A confidence predictor Γϵ is defined as the prediction interval that will con-
tain the true label of a new sample with a confidence level ϵ ∈ [0, 1],
where the p-values p(xn+1 ,y) are usually derived following a transductive or
an inductive approach. When transductive inference is followed, all compu-
tations have to start from scratch. In particular, for every test sample the
transductive inference needs to learn |Y| prediction rules h(·) to build the
confidence sets (4). Inductive conformal prediction was developed to deal
with this computational inefficiency [21]. Inductive approach splits the set
of samples into a training {z1 , ..., zl } and calibration {zl+1 , ..., zn } sets. A
general prediction rule h(·) is learned from the training set and the p-values
in (4) are computed exclusively from the calibration set as
y
|{ i = l + 1, ..., n + 1 | αi ≥ αn+1 }| + 1
p(xn+1 ,y) = . (5)
n−l+1
The general rule inducted from the training set incorporates the information
about the dataset, eliminating the need to start from scratch when the con-
fidence sets for the new test samples are derived. Hence, inductive inference
is almost as computationally efficient as the implemented prediction rule
is. The inductive approach sacrifices prediction efficiency for computational
efficiency due to the train-calibration split.
3 Methodology
In this section CRFE is presented. First, we adapt the mathematical tools
developed by Belloti et al. [5] to fit multi-class classification, where the
potential of conformal prediction becomes particularly relevant. This adap-
tation ties non-conformities with features. After this, CRFE is presented.
Finally, we propose a stopping criterion based on the behavior of the selected
features non-conformities.
4
discriminant function. Let’s start by assuming that the set of samples Z in
(2) is composed by l features and m different classes, i.e.,
xi ∈ X = (X1 , ..., Xl ),
yi ∈ Y = {y 1 , ..., y m },
where Xj ∈ R. We follow the One vs All (OVA) approach [31] for extending
[5] to a multiclass scenario. Let define the function θ as:
if yi = y k
1
θ(yi , y k ) = (6)
−1 if yi ̸= y k
where k ∈ {1, ..., m}, such that we can derive m new datasets to train m
linear models in the OVA approach:
Similarly to the weighted average method [26], we extend the binary non-
conformity measure to the multiclass problem as
αi = A(zi , D) =
m
X
λÃ(zik , Z k , hk ) + λ′ Ã(zir , Z r , hr ) (9)
r=1
r̸=k
1−λ
with k ∈ {1, ..., m}, λ ∈ [0, 1], and λ′ = ,
m−1
where αi is the non-conformity measure of the sample zi .
Let define the subset of features S as {X ′ : X ′ ⊆ X = (X1 , ..., Xl ) where
|S| = t}. If we select a subset of t features S from the original set of features
X , the expression defined in (8) turns into
5
which is linear separable by features [5] (i.e., we can isolate by features the
terms that are feature dependent).
The OVA approach of the problem defined in (9) is still linearly sepa-
rable. Considering the non-conformity measure (8), the expression in (9) is
expanded as
A(z, Z, S) =
X
− λθ(y, y k )( wjk xj + bk )−
j∈S
m
X X
λ′ θ(y, y r )( wjr xj + br ) =
r=1 j∈S
r̸=k
m
" #
X X
− λθ(y, y k )wjk + λ′ θ(y, y r )wjr xj − γ,
j∈S r=1
r̸=k
Pm
where the constant γ = λθ(y, y k )bk + λ′ r=1 θ(y, y r )br .
r̸=k
Our non-conformity function can then be separated by features,
X
A(z, D, S) = ϕ(z, D, j),
j∈S
where
n
"
X
βj = − λ wjk θ(yi , y ′ k )xij +
i=1
m X
n n
#
X X
′
λ wjr θ(yi , y ′ r )xij − γi . (11)
r=1 i=1 i=1
r̸=k
6
1. Train the classifier.
4. Retrain with the new subset or stop if the stopping criterion is met.
δAk,j k k j j
i = δA(zi , Z , hk ) = αi − αi =
l
X l
X
k
−yi (wp xip + b) + yi (wsk xis + b) =
p=1 s=1
s̸=j
−yi wj xij ,
where αij is the non-conformity measure without taking into account the
feature j. In the whole calibration set, the variation will be :
n n
δAk,j
X X
j
δA = i = −yi wj xij . (12)
i i
so
δAj = βj . (13)
The result in (13) strengths the interpretation of βj as the non-conformity
associated to a feature and justifies the fact that eliminating the feature Xj
with the highest associated non-conformity is what contributes the most to
reducing the global non-conformity of the calibration set. Our goal is to
update the predictive rule h(·) (1) in order to produce more representative
non-conformity measures. Moreover, when comparing the proposed algo-
rithm with the original RFE method, only one additional step is included:
computing expression (11). This additional step adds a linear computational
complexity of O(lmn), or O(ln) if we have only two classes, over the RFE
algorithm complexity.
7
3.3 Stopping criterion
The optimal number of features to be selected is not a straightforward prob-
lem when feature selection is performed. We consider two stopping criteria in
our study. The first criterion (i) involves fixing a certain number of features
and leave the proposed method to iterate up to an specified number of fea-
tures. In cases where the optimal number of features is known this criterion
is useful. Moreover, when the dataset is not too large, this criterion allows
the exploration of all possible subset sizes and analyze performance met-
rics in order to discover an optimal number of features. However, for large
datasets, this approach can become impractical. The second stopping crite-
rion (ii) is a novel approach that takes advantage of the relative β-measures
variation during the recursive elimination method. The underlying idea be-
hind this criterion is as follows: when a feature Xj is removed from the set of
remaining features, the resulting non-conformity associated to the new set
should have decreased. We found that the mean of the β-measures decreases
with constant rate until an exponential decay is observed. We propose to
stop the selection process before reaching the exponential behavior. Because
the transition between the constant and the exponential rate involves a de-
celeration, we use the second derivative of the mean, which remains close
to zero until the exponential behavior starts. The exponential regime starts
when the value of the second derivative exceeds at least three times the
standard deviation of the set of values corresponding to the previous second
derivatives. The standard deviation can be computed based on the k-latest
values to enhance computational performance. The full method is presented
in Algorithm 1.
8
Algorithm 1 CRFE and β-based stopping criteria.
Input: Dtrain = (Xtrain , Ytrain ); training set
Input: Dcal = (Xcal , Ycal ); calibration set
Input: σ; confidence, ≥ than 3 is recommended
Input: ψ; lenght of set used to compute std
Initialize array to zero: beta means
Initialize array to zero: beta num der
Do
w, b ← train (Dtrain )
β ← compute (Dcal , w, b)
j ← index(max( β ))
β ← Compute mean ( β )
beta means ←append to array (β)
x ← array of integers (0,len(beta means))
β ′′ ← derive(derive( beta means, x ), x )
βσ′′ ← compute std(beta num der )
beta num der ← append to array(β ′′ )
4 Experimental settings
4.1 Datasets
Four publicly available databases were used as test-bed: one synthetic and
four real-world datasets. Their characteristics are summarized in Table 4.1.
Data pre-proccessing involved (i) one-hot class encoding and cleaning of
9
Dataset Samples Classes Features Distribution of classes Reference
Synthetic 350 4 35 (25.0, 25.0, 25.0, 25.0) [22]
Coronary artery 899 4 32 (44.9, 21.2, 14.5, 19.3) [17]
Dermatology 366 6 34 (30.6, 16.7, 19.7, 13.4, 14.2, 5,5) [17]
Myocardial infarction 1700 8 104 (84.1, 6.5, 1.1, 3.2, 1.3, 0.7, 1.6, 1.6) [12]
features, (ii) checking for missing values; if more than 25% of the values
were missing, the feature was removed, (iii) imputation of missing values,
and, (iv) standardization of the data to avoid scale biases.
10
to the theoretical coverage 1 − ϵ. For k test samples,
k
1X
Cov = 1{yi ∈Γϵi (xi )} .
k
i=1
11
differences between more than two sets. Let be S1 , ..., Sn subsets of features,
such as |S1 | = ... = |Sn |. The multi-set Jaccard consistency index is defined
as
|S1 ∩ ... ∩ Sn |
IJ = . (14)
|S1 ∪ ... ∪ Sn |
We propose an extension that is more suitable than (14) to compare en-
sembles of n sets of features. Let consider |S1 | = ... = |Sn | and K =
(n/2 + 1, ..., n), the new index IW is defined as
X j
IW = ωj Pj , such that ωj = P , (15)
k∈K k
j∈K
12
times by shuffling the data, and the results were averaged across all itera-
tions. In each iteration, all the random numbers were generated and fixed
at the beginning, always preserving the same split between test, calibration,
and training sets for both selection methods.
13
10, with null uncertainty value at that cardinality.
Note how the feature subsets selected by CRFE were equally or more
consistent than those selected by RFE. The peaks of consistency, especially
in Figures 3a-(b), 3b-(b), were observed around the optimal subset sizes
predicted in Figure 1 for CRFE. This suggests that CRFE was able to,
independently of the data split, find the subset of features which reports
better performance metrics. We highlight how the newly proposed consis-
tency index in Expression (15) was able to reflect the peaks in Figures 3a-(b),
3c-(b), 3d-(a), and 3d-(b), in contrast to the Jaccard index. These peaks
contributed to detect hidden consistent subsets. Moreover, the new index
reflected consistency at the last stages of the recursive elimination process
while Jaccard ignored those sizes. In Figures 3a-(a) and 3d-(a) there are no-
ticeable differences between those consistency measures. Those values may
indicate that there was a considerable number of features that, although
not consistently selected in all the 20 iterations, were frequently chosen by
RFE. For coronary artery and dermatology datasets, both showed a similar
behavior for larger feature subsets. For smaller subsets, CRFE showed more
consistent selections than RFE.
14
Supplementary Equation S6.1 revealed that these coincidences were robust
against random coincidence. On the other hand, synthetic and myocardial
infraction datasets revealed poor agreement on the selected features. For
these two datasets the similarity values suggested that the agreement by
chance was relevant.
Table 2: Average size for subsets of features selected by both stopping crite-
ria: RFECV for RFE and β-based for CRFE. Results inside parentheses are
the averaged inefficiency and certainty scores for the selected subset size.
15
stopping criteria resulted in subset sizes that deviated significantly from
the optimal subsets detected in Figure 1d. However, the inefficiency and
certainty metrics in Table 5.3 were better when the new stopping criterion
was used.
The β-based automatic stopping criteria was designed to determine the
optimal subset of features without computing performance scores for all
possible feature cardinalities or combinations. We observed that variability
in the data can slightly influence its performance. However, the frequency
of features included in the subsets selected by the new criterion showed that
this method was able to identify the most significant features to effectively
reduce non-conformity.
6 Conclusions
Conformal prediction stands out as one of the most effective methods for
uncertainty quantification due to its robust theoretical guarantees. In this
study, we introduce an alternative to the classical Recursive Feature Elim-
ination method but taking advantage of the conformal prediction frame-
work. The proposed method, named Conformal Recursive Feature Elimina-
tion (CRFE), is a feature selection technique that builds upon the SMFS
algorithm [5], but extended to recursively remove features. We demonstrate
that, by removing a feature, the variation in the non-conformity measure
is equivalent to eliminating the non-conformity associated with that partic-
ular feature. Additionally, we introduce a novel consistency index and an
automatic stopping criterion based on the non-conformity associated with
the features. To evaluate the effectiveness of CRFE, we compared its per-
formance against RFE on four multiclass datasets. The results indicate
several benefits of using CRFE when conformal prediction is applied, as
well as in classical performance metrics when no conformal prediction is
done. The consistency tests based on stability indexes reported that CRFE
achieved at least the same level of consistency as RFE. The proposed au-
tomatic stopping criterion for CRFE is based on the non-conformity values
of each feature and outperformed the accuracy-based RFE stopping crite-
rion. However, the feature selection method proposed in this work depends
on the linear separability condition when the selection is performed, i.e.,
CRFE relies on computing separation hyperplanes between classes. Future
developments will investigate how to expand to nonlinear classifiers. We also
plan to explore the class-conditional conformal prediction as a mechanism
to better adapt to imbalanced problems. Finally, an open source library
with the implementation is released.
16
Acknowledgments
This work was supported by the Gobierno de Navarra through the ANDIA
2021 program (grant no. 0011-3947-2021-000023) and the ERA PerMed
JTC2022 PORTRAIT project (grant no. 0011-2750-2022-000000).
Repository
An open source library that implements CRFE, including the datasets used
in this work, the post processed data, and the code can be found at https:
//github.com/digital-medicine-research-group-UNAV/CRFE.
A Supplementary Material
A.1 Introduction
The present document provides Supplementary Material for the paper en-
titles Conformal Recursive Feature Elimination. The results presented here
were derived from the same datasets as shown in the main document. Single-
prediction performance metrics, i.e., accuracy, recall, and precision, as well
as Per class precision, recall and F1 metrics were calculated in supplemen-
tary Sections A.3, A.4, and A.5 for the last three described datasets. Preci-
sion is defined as the the proportion of predicted positives that are truly pos-
itive. Recall, also known as sensitivity, provides the proportion of positives
that are correctly classified. The F-1 score represents the harmonic mean
between precision and recall. The plots supporting the similarity study are
exposed in supplementary Section A.6, and the additional material related
to the study of the β-based stopping criterion is provided in supplementary
Section A.7.
17
A.3 Coronary artery disease dataset
The coronary artery disease dataset [17] comprises data collected from Cleve-
land (303), Hungary (294), Switzerland (123) and Long Beach VA (200).
The original study [11] warned about a potential bias in the test groups
because the noninvasive samples were not withheld from the treating physi-
cian. To classify diseases, a cardiologist diagnosed samples only based on
the angiogram results. Accordingly to his criteria, a coronary artery was
significantly diseased if the luminal diameter reduction exceeded 50%.
The classes in the dataset are; class 0 stands for no disease and has 404
samples of the total, i.e. 44.93%, class 1 stands for patients that have at
least one diseased artery and is composed of 191 samples (21.24%), class 2
stands for those patients that have a single-vessel disease and is composed
of 130 samples (14.46%), class 3 stands for those that has a double-vessel
disease and is composed of 132 samples (14.68%), and class 4 stands for
those that has a triple-vessel disease and is composed of 42 samples (4,67%).
The original database comprised 75 features, but name, SS number, medical
proofs dates and patient´s number were excluded. Any feature that had
more than 25% of missing values was also removed. The remaining features
are listed and numbered in Supplementary Table S4. We encourage the
reader to see the publicly available repository for detailed information of the
specific features considered. Missing data were imputed by a K-nn algorithm
with 5 neighbors. Samples were standardized to avoid scale biases.
18
which stands for the time when ST measure depression was noted in the
ECG, met which is true or false if a threshold on Methabolic Equivalent
(MET) is achieved while exercise testing, trestbpd which stands for the rest-
ing blood pressure, xhypo which stands for exercise-induced hypotension and
lvx3 which was not explained either in the dataset documentation and the
original work, but we decided to include in the dataset because we postulate
that it stands for some type of pacing interval with the left ventricle, which
is relevant medical information.
19
inflammatory mononuclear inflitrate was almost irrelevant for CRFE.
0 Erythema 18 Parakeratosis
1 Scaling 19 Clubbing of the rete
ridges
2 Definite borders 20 Elongation of the
rete ridges
3 Itching 21 Thinning of the
suprapapillary epi-
dermis
4 Koebner phe- 22 Spongiform pustule
nomenon
5 Polygonal papules 23 Munro microabcess
6 Follicular papules 24 Focal hypergranulo-
sis
7 Oral mucosal in- 25 Disappearance of
volvement the granular layer
8 Knee and elbow in- 26 Vacuolisation and
volvement damage of basal
layer
9 Scalp involvement 27 Spongiosis
10 Family history 28 Saw-tooth appear-
ance of retes
11 Melanin inconti- 29 Follicular horn plug
nence
12 Eosinophils in the 30 Perifollicular Parak-
infiltrate eratosis
13 PNL infiltrate 31 Inflammatory
monoluclear in-
flitrate
14 Fibrosis of the pap- 32 Band-like infiltrate
illary dermis
15 Exocytosis 33 Age
16 Ecanthosis
17 Hyperkeratosis
20
A.5 Myocardial infarction complications
The motivation behind myocardial infarction complications dataset [12] were
the continuous spread of the disease, with special importance in the urban
population of developed countries, as well as the differences between pa-
tients in the course of the disease. The feature with patient ID was removed
because it did not contain relevant information, other features were also re-
moved due to missing data (features with more than 25% of missing values),
and features numbered from 113 to 124 were removed because were provided
as potential targets. The name of the 104 remaining features are shown and
numbered in the Table 6. We refer the reader to the publicly available repos-
itory for specific information about the features. Missing data were imputed
by a k-nn means algorithm with 5 neighbours. The objective is to predict the
feature called Lethal outcome (LET IS), distributed as: class alive (84.06%),
class cardiogenic shock (6.47%), class pulmonary edema (1.06%), class my-
ocardial rupture (3.18%), class progress of congestive heart failure (1.35%),
class thromboembolism (0.71%), class asystole (1.59%) and class ventricular
fibrillation (1.59%).
21
Number Feature label Number Feature label Number Feature label
0 AGE 35 K SH POST 70 n p ecg p 12
1 SEX 36 MP TP POST 71 fibr ter 01
2 INF ANAM 37 SVT POST 72 fibr ter 02
3 STENOK AN 38 GT POST 73 fibr ter 03
4 FK STENOK 39 FIB G POST) 74 fibr ter 05
5 IBS POST 40 ant im 75 fibr ter 06
6 GB 41 lat im 76 fibr ter 07
7 SIM GIPERT 42 inf im 77 fibr ter 08
8 DLIT AG 43 post im 78 GIPO K
9 ZSN A 44 IM PG P 79 K BLOOD
10 nr11 45 ritm ecg p 01 80 GIPER Na)
11 nr01 46 ritm ecg p 02 81 Na BLOOD
12 nr02 47 ritm ecg p 04 82 ALT BLOOD
13 nr03 48 ritm ecg p 06 83 AST BLOOD
14 nr04 49 ritm ecg p 07 84 L BLOOD
15 nr07 50 ritm ecg p 08 85 ROE
16 nr08 51 n r ecg p 01 86 TIME B S
17 np01 52 n r ecg p 02 87 R AB 1 n
18 np04 53 n r ecg p 03 88 R AB 2 n
19 np05 54 n r ecg p 04 89 R AB 3 n
20 np07 55 n r ecg p 05 90 NITR S
21 np08 56 n r ecg p 06 91 NA R 1 n
22 np09 57 n r ecg p 08 92 NA R 2 n
23 np10 58 n r ecg p 09 93 NA R 3 n
24 endocr 01 59 n r ecg p 10 94 NOT NA 1 n
25 endocr 02 60 n p ecg p 01 95 NOT NA 2 n
26 endocr 03 61 n p ecg p 03 96 NOT NA 3 n
27 zab leg 01 62 n p ecg p 04 97 LID S n
28 zab leg 02 63 n p ecg p 05 98 B BLOK S n
29 zab leg 03 64 n p ecg p 06 99 ANT CA S n
30 zab leg 04 65 n p ecg p 07 100 GEPAR S n
31 zab leg 06 66 n p ecg p 08 101 ASP S n
32 S AD ORIT 67 n p ecg p 09 102 TIKL S n
33 D AD ORIT 68 n p ecg p 10 103 TRENT S n
34 O L POST 69 n p ecg p 11
Figures 7a and 7b show that the subsets selected by RFE preserved fea-
tures relevant for distinguishing between all classes. CRFE quickly attempts
to select relevant features for classes alive and cardiogenic shock. This sug-
gests that a heavy imbalance in classes may affect the performance of the
proposed method.
Results presented in Figure 4 of the main paper showed that the most
relevant features, i.e. those present in at least the 80% of the subset of
features were: sex, S AD ORIT (the systolic blood pressure), ritm ecg p 01
which represents if the ECG rhythm at the time of admission to hospital is
sinus or not, TIME B S which represents the time elapsed from the begin-
22
ning of the attack of CHD to the hospital, ANT CA S n stands for the use
of calcium channel blockers in the ICU, and finally ASP S n if acetylsalicylic
acid was used in the ICU. Accordingly with supplementary Figures 7a and
7b, these features were postulated to be the most relevant to predict a fatal
outcome or not.
where A and B are two sets such as |A| = |B|. Through this index, we
compared the two subset of features selected by both methods that (i) were
selected using the same random seed and (ii) had the same size. The average
and the standard deviation are provided in Supplementary Figure 8a. The
novel index defined in Equation (15) of the main document was suitable to
compare subsets of features produced by different feature selection methods
having the same cardinality. The new index did not need to be averaged
through the multiple runs because it takes into account all the subsets gen-
erated at the same time. This was possible because of the n/2 + 1 condition,
which is the minimum number of common features required to ensure that
a feature is present in at least two subsets generated from two different fea-
ture selection methods. See Supplementary Figure 8a. The last index to
study similarity is the Kuncheva index [18]. This index is also based on
the cardinality of the intersection between sets of elements, but introduces
a correction for agreements by chance. It was defined as:
rs − κ2
IK = , (17)
κ(s − κ)
23
for sizes between 13 and 14 features, as well as between 17 to 20. According
to the main document, sizes ranged from 17 to 25 yield superior results.
Lastly, Figure 9-(d) exhibits more presence of sizes of 7 and 9 features.
References
[1] Javier Andreu-Perez, Carmen C. Y. Poon, Robert D. Merrifield,
Stephen T. C. Wong, and Guang-Zhong Yang. Big Data for Health.
IEEE J. Biomed. Health Inform., 19(4):1193–1208, 2015.
[5] Tony Bellotti, Zhiyuan Luo, and Alex Gammerman. Strangeness Min-
imisation Feature Selection with Confidence Machines. In Emilio Cor-
chado, Hujun Yin, Vicente Botti, and Colin Fyfe, editors, Intelligent
Data Engineering and Automated Learning IDEAL 2006, Lecture Notes
in Computer Science, pages 978–985, Berlin, Heidelberg, 2006. Springer.
[6] Visar Berisha et al. Digital medicine and the curse of dimensionality.
NPJ Digit. Med., 4(1):153, October 2021.
24
[10] Robert Detrano et al. International application of a new probability
algorithm for the diagnosis of coronary artery disease. Amer. J. of
Cardiol., 64(5):304–310, 1989.
[13] Isabelle Guyon, Masoud Nikravesh, Steve Gunn, Lotfi A. Zadeh, and
Janusz Kacprzyk, editors. Feature Extraction: Foundations and Appli-
cations. Springer, Berlin, Heidelberg, 2006.
[14] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vap-
nik. Gene Selection for Cancer Classification using Support Vector
Machines. Mach. Learn., 46(1):389–422, 2002.
[15] H.Altay Güvenir, Gülşen Demiröz, and Nilsel İlter. Learning differ-
ential diagnosis of erythemato-squamous diseases using voting feature
intervals. Artif. Intell. Med., 13(3):147–165, 1998.
[17] Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. UCI machine
learning repository.
[20] Trishan Panch, Heather Mattie, and Leo Anthony Celi. The inconve-
nient truth about AI in healthcare. NPJ Digit. Med., 2(1):1–3, Aug
2019.
25
[22] F. Pedregosa et al. Scikit-learn: Machine learning in Python. J. Mach.
Learn. Res., 12:2825–2830, 2011.
[27] Alex Wright. Big data meets big science. Comm. ACM, 57(7):13–15,
Jul. 2014.
[28] Chen Xu and Yao Xie. Conformal prediction for time series. IEEE
Trans. Pattern Anal. Mach. Intell., pages 1–22, 2023.
[29] Meng Yang, Ilia Nouretdinov, Zhiyuan Luo, and Alex Gammerman.
Feature selection by conformal predictor. In Lazaros Iliadis, Ilias Ma-
glogiannis, and Harris Papadopoulos, editors, Artificial Intelligence Ap-
plications and Innovations, pages 439–448, Berlin, Heidelberg, 2011.
Springer Berlin Heidelberg.
[30] Shuang Zhou, Evgueni Smirnov, Gijs Schoenmakers, Ralf Peeters, and
Tao Jiang. Conformal feature-selection wrappers for instance transfer.
In Alex Gammerman, Vladimir Vovk, Zhiyuan Luo, Evgueni Smirnov,
and Ralf Peeters, editors, Proceedings of the Seventh Workshop on Con-
formal and Probabilistic Prediction and Applications, volume 91, pages
96–113. Proceedings of Machine Learning Research, 11–13 Jun 2018.
[31] Xin Zhou and David P. Tuck. MSVM-RFE: extensions of SVM-RFE for
multiclass gene selection on DNA microarray data. Bioinf., 23(9):1106–
1114, 2007.
26
1.0
(a) RFE + CONFORMAL PREDICTION 1.0
(b) CRFE + CONFORMAL PREDICTION
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
Score
Score
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features
coverage inefficiency certainty uncertainty mistrust
Score
0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Num. of Features Num. of Features
coverage inefficiency certainty uncertainty mistrust
1.0
(a) RFE + CONFORMAL PREDICTION 1.0
(b) CRFE + CONFORMAL PREDICTION
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
Score
Score
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Num. of Features Num. of Features
coverage inefficiency certainty uncertainty mistrust
27
1.0
(a) OVERALL PERFORMANCE METRICS 1.0
(b) PRECISION STRATIFIED BY CLASSES
0.9 0.9
0.8 0.8
0.7
0.7
0.6
0.6
0.5
Score
Score
0.5 0.4
0.4 0.3
0.3 0.2
0.2 0.1
0.0
0.1
0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features
1.0
(c) RECALL STRATIFIED BY CLASSES 1.0
(d) F1 STRATIFIED FY CLASSES F1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
Score
Score
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features
(a) RFE.
1.0
(a) OVERALL PERFORMANCE METRICS 1.0
(b) PRECISION STRATIFIED BY CLASSES
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5
Score
Score
0.5
0.4
0.4
0.3
0.3
0.2
0.2 0.1
0.1 0.0
0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features
1.0
(c) RECALL STRATIFIED BY CLASSES 1.0
(d) F1 STRATIFIED FY CLASSES F1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
Score
Score
0.4 0.4
0.3 0.3
0.2
0.2
0.1
0.1
0.0
0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features
(b) CRFE.
28
(a) RFE CONSISTENCY (b) CRFE CONSISTENCY
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score
Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features
IW IJ
Score
0.5 0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Num. of Features Num. of Features
IW IJ
Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Num. of Features Num. of Features
IW IJ
Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Num. of Features Num. of Features
IW IJ
29
(a) Synthetic dataset (b) Coronary artery disease dataset (c) Dermatology dataset
50 50 50
40 40 40
Frequency
Frequency
Frequency
30 30 30
20 20 20
10 10 10
0 0 0
7
5
8
6
1
2
9
4
332
0
167
29
321
16
11
3
15
25
13
12
308
10
34
21
27
24
20
19
23
28
14
21
14
19
20
4
13
5
27
26
28
32
8
25
30
3
24
11
2
7
15
23
6
9
1
22
29
12
0
31
16
2
22
29
24
15
16
0
10
23
30
28
9
5
26
6
20
12
11
13
14
17
4
25
Feature number
40
Frequency
30
20
10
0
1
1016
99
32
45
35
98
78
27
50
10
31
82
1003
81
65
80
43
49
73
53
69
1026
97
67
30
51
13
19
95
12
54
64
83
7
39
38
79
11
14
15
16
17
18
20
21
22
23
37
48
52
57
58
59
60
61
62
63
66
71
72
74
75
76
77
92
42
34
36
44
47
46
25
29
56
93
4
94
68
96
2
28
88
55
9
89
90
8
Feature number
Figure 4: The frequency with which each feature was included in the optimal
subset of features selected by CRFE using the β-based stopping criterion.
Note the maximum corresponds to 50 independent runs. The percentages
of features always discarded (not shown) were 0%, 28.2%, 11.7%, and 13.5%
of the total sets for synthetic, coronary artery disease, dermatology, and
myocardial datasets, respectively.
30
1.0
(a) OVERALL PERFORMANCE METRICS 1.0
(b) PRECISION STRATIFIED BY CLASSES
0.9 0.9
0.8 0.8
0.7
0.7
0.6
0.6
0.5
Score
Score
0.5 0.4
0.4 0.3
0.3 0.2
0.2 0.1
0.1 0.0
0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features
(c) RECALL STRATIFIED BY CLASSES 1.0
(d) F1 STRATIFIED FY CLASSES F1
1.0 0.9
0.9 0.8
0.8
0.7
0.7
0.6 0.6
0.5 0.5
Score
Score
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0
0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features
(a) RFE.
1.0
(a) OVERALL PERFORMANCE METRICS 1.0
(b) PRECISION STRATIFIED BY CLASSES
0.9 0.9
0.8 0.8
0.7
0.7
0.6
0.6
0.5
Score
Score
0.5
0.4
0.4 0.3
0.3 0.2
0.2 0.1
0.1 0.0
0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features
(c) RECALL STRATIFIED BY CLASSES 1.0
(d) F1 STRATIFIED FY CLASSES F1
1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5
Score
Score
0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features
(b) CRFE.
31
(a) OVERALL PERFORMANCE METRICS (b) PRECISION STRATIFIED BY CLASSES
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score
Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Num. of Features Num. of Features
(c) RECALL STRATIFIED BY CLASSES (d) F1 STRATIFIED FY CLASSES F1
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score
Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Num. of Features Num. of Features
(a) RFE.
(a) OVERALL PERFORMANCE METRICS (b) PRECISION STRATIFIED BY CLASSES
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score
Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Num. of Features Num. of Features
(c) RECALL STRATIFIED BY CLASSES (d) F1 STRATIFIED FY CLASSES F1
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score
Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Num. of Features Num. of Features
Plots (b), (c), (d).
Plot (a). class 0 class 3
accuracy precision recall class 1 class 4
class 2 class 5
(b) CRFE.
32
(a) OVERALL PERFORMANCE METRICS (b) PRECISION STRATIFIED BY CLASSES
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score
Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Num. of Features Num. of Features
(c) RECALL STRATIFIED BY CLASSES (d) F1 STRATIFIED FY CLASSES F1
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score
Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Num. of Features Num. of Features
(a) RFE.
(a) OVERALL PERFORMANCE METRICS (b) PRECISION STRATIFIED BY CLASSES
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score
Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Num. of Features Num. of Features
(c) RECALL STRATIFIED BY CLASSES (d) F1 STRATIFIED FY CLASSES F1
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score
Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Num. of Features Num. of Features
(b) CRFE.
33
(a) SYNTHETIC (b) CORONARY ARTERY
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score
Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30
Num. of Features Num. of Features
(c) DERMATOLOGY (d) MYOCARDIAL INFRACTION
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score
Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 0 20 40 60 80 100
Num. of Features Num. of Features
I IJ'
34
(a) Synthetic dataset (b) Coronary artery disease dataset
50 50
40 40
Frequency
Frequency
30 30
20 20
10 10
0 0
1
5
7
8
9
11
12
13
14
15
16
17
18
19
20
22
23
25
26
27
1
2
3
4
6
7
8
9
10
11
12
14
Size Size
(c) Dermatology dataset (d) Myocardial dataset
50 50
40 40
Frequency
Frequency
30 30
20 20
10 10
0 0
1
2
5
9
10
11
12
13
14
15
16
17
18
19
20
23
24
25
26
27
6
7
9
10
11
12
13
14
16
17
19
20
22
24
69
71
77
83
Size Size
35