0% found this document useful (0 votes)
24 views35 pages

Conformal Recursive Feature Elimination

The document introduces a new feature selection method called Conformal Recursive Feature Elimination (CRFE) that combines feature selection with the Conformal Prediction framework. CRFE recursively removes features that increase a dataset's non-conformity. It is compared to the classical Recursive Feature Elimination method on multiple datasets, with CRFE outperforming RFE in half of the datasets and achieving similar performance in the rest. An automatic stopping criterion for CRFE is also presented.

Uploaded by

gamininganela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views35 pages

Conformal Recursive Feature Elimination

The document introduces a new feature selection method called Conformal Recursive Feature Elimination (CRFE) that combines feature selection with the Conformal Prediction framework. CRFE recursively removes features that increase a dataset's non-conformity. It is compared to the classical Recursive Feature Elimination method on multiple datasets, with CRFE outperforming RFE in half of the datasets and achieving similar performance in the rest. An automatic stopping criterion for CRFE is also presented.

Uploaded by

gamininganela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Conformal Recursive Feature Elimination


Marcos López-De-Castro, Alberto Garcı́a-Galindo, Rubén Armañanzas

May 31, 2024


arXiv:2405.19429v1 [cs.CV] 29 May 2024

Abstract
Unlike traditional statistical methods, Conformal Prediction (CP)
allows for the determination of valid and accurate confidence levels
associated with individual predictions based only on exchangeability
of the data. We here introduce a new feature selection method that
takes advantage of the CP framework. Our proposal, named Confor-
mal Recursive Feature Elimination (CRFE), identifies and recursively
removes features that increase the non-conformity of a dataset. We
also present an automatic stopping criterion for CRFE, as well as a
new index to measure consistency between subsets of features. CRFE
selections are compared to the classical Recursive Feature Elimination
(RFE) method on several multiclass datasets by using multiple parti-
tions of the data. The results show that CRFE clearly outperforms
RFE in half of the datasets, while achieving similar performance in
the rest. The automatic stopping criterion provides subsets of effec-
tive and non-redundant features without computing any classification
performance.

Keywords: Conformal Prediction, Feature Selection, Recursive elimina-


tion.

1 Introduction
The curse of dimensionality is a well-known issue in the field of statistical
learning theory. In recent years, the amount of high-dimensional and multi-
modal data has become a challenge, from healthcare [1,6] to physics [27], as

Marcos López-De-Castro, Alberto Garcı́a-Galindo and Rubén Armañanzas are with
DATAI - Institute of Data Science and Artificial Intelligence, Universidad de Navarra,
Pamplona, Spain and Tecnun School of Engineering, Universidad de Navarra, Donostia-
San Sebastian, Spain. Email:{mlopezdecas, agarciagali, rarmananzas}@unav.es
1
Correspondig: [email protected]
2
Under review

1
a consequence of technological advances and the big data advent. Feature se-
lection methods are techniques developed for dimensionality reduction that
select an optimal subset of features without altering their original mean-
ing [16]. The use of these methods helps prediction algorithms to perform
faster and to increase their efficiency. These techniques are categorized as
filters, wrappers, and embedded methods, although mixed approaches have
also been proposed [8, 13]. Search strategies have been developed to ex-
plore the feature space efficiently [7]. Two outstanding techniques are the
sequential forward and the sequential backward selection. They are based
on adding and removing features until the required number of features is
satisfied, respectively. The Recursive Feature Elimination method (RFE)
is a popular feature selection method proposed by Guyon et al. based on
a backward elimination policy [14, 31]. RFE was originally developed for
support vector machines and performs by recursively removing the features,
or sets of features, that least decreases the margin of separation between
classes.

In this work, we introduce a novel recursive feature selection method


based on the RFE algorithm for the Conformal Prediction framework. When
predictions are performed in high-risk scenarios, confidence levels concern-
ing individual predictions are sought [19, 20]. Conformal Prediction pro-
vides a powerful and innovative framework able to offer non-asymptotic
theoretical guarantees for uncertainty quantification in individual predic-
tions [2, 4, 26, 28]. This new framework is independent of the learning rule
implemented and only requires the Independent and Identically Distributed
(i.i.d.) nature of data. Consequently, Conformal Prediction is the appropri-
ate tool for high-risk machine learning applications. Few methods to accom-
plish feature selection within this framework have been proposed. One of
these, developed by Bellotti et al. [5], is ‘Strangeness minimization Feature
Selection’ (SMFS). The idea behind the SMFS method is to create a ranking
of features and choose those that contribute less to global strangeness. Yang
et al. proposed the ‘Average Confidence Maximization’ (ACM) method [29].
It is based on computing the confidence level by taking into account only in-
dividual features. The feature that maximizes the confidence is selected and
then the process is repeated by adding features until achieving the required
confidence level. Cherubin et al. [9] used the i.i.d assumption to select rele-
vant features in an anomaly detection problem and Zhou et al. presented a
conformal feature-selection wrapper for instance transfer [30]. To the best
of our knowledge, no other relevant feature selection method in Conformal
Prediction have been developed.

We propose Conformal Recursive Feature Elimination (CRFE), a new


feature selection method that combines strangeness with the backward recur-
sive elimination policy. As features are removed, the strangeness between

2
samples is updated, similarly to how feature weights are updated in the
original RFE. The features removed are associated with higher strangeness.
We compared CRFE and RFE results on four different multiclass datasets.
CRFE outperformed RFE results in two of the four datasets and achieved the
same performance and consistency as RFE in the remaining. We also present
an automatic stopping criterion and a new consistency index. The automatic
stopping criterion stops the recursive elimination of features, avoiding full
dataset computation.

The rest of the manuscript is organized as follows. Section 2 briefly


introduces the conformal prediction framework. Section 3 develops the tools
needed to support the multi-class scenario, presents the recursive algorithm
and the stopping criteria. Experimental comparison is described in Section
4 and results are detailed and discussed in Section 5. Section 6 concludes
the paper.

2 Conformal prediction
Conformal prediction framework rests in two assumptions: randomness and
exchangeability [24]. A finite sequence (z1 , ..., zn ) of random variables is
said to be i.i.d, i.e., the randomness assumption, if they are generated in-
dependently from the same unknown probability distribution Q. A finite
sequence (z1 , ..., zn ) of random variables is said to be exchangeable if for any
permutation π of the set of index {1, ..., n} the joint probability distribution
is invariant Q(z1 , ..., zn ) = Q′ (zπ1 , ..., zπn ). Randomness assumption implies
exchangeability. The construction of a non-conformity measure is the next
step when conformal prediction is performed. A non-conformity measure is
defined as a measurable function A,

A : R × Rn → R
(z, {z1 , ..., zn }) 7→ α,

that quantifies the degree of strangeness of a sample z with respect to a bag


of samples {z1 , ..., zn }. The larger α is, the stranger z is with respect to
{z1 , ..., zn }. The function A must be invariant with respect to any ordering
of {z1 , ..., zn }. When a classification task is conducted, a natural choice for
the non-conformity function is

A(z, {z1 , ..., zn }) = f (y, h(x)), (1)

where h(·) is a predictive rule learned from {z1 , ..., zn } and f (·, ·) a func-
tion which quantifies the difference between the prediction h(x) and a label
y. Monotonic transformations of f (·, ·) have no impact on the predictions.

3
Nevertheless, the proper selection of the estimator h(·) has a significant ef-
fect in conformal prediction efficiency [23]. Let’s assume a set of samples of
the form
{zi }ni=1 = {(xi , yi )}ni=1 ∈ X × Y = Z. (2)
A confidence predictor Γϵ is defined as the prediction interval that will con-
tain the true label of a new sample with a confidence level ϵ ∈ [0, 1],

P (yn+1 ∈ Γϵ (xn+1 )) ≥ 1 − ϵ. (3)

The prediction interval is then defined as

Γϵ (xn+1 ) = { y | p(xn+1 ,y) > ϵ } ∀y ∈ Y, (4)

where the p-values p(xn+1 ,y) are usually derived following a transductive or
an inductive approach. When transductive inference is followed, all compu-
tations have to start from scratch. In particular, for every test sample the
transductive inference needs to learn |Y| prediction rules h(·) to build the
confidence sets (4). Inductive conformal prediction was developed to deal
with this computational inefficiency [21]. Inductive approach splits the set
of samples into a training {z1 , ..., zl } and calibration {zl+1 , ..., zn } sets. A
general prediction rule h(·) is learned from the training set and the p-values
in (4) are computed exclusively from the calibration set as
y
|{ i = l + 1, ..., n + 1 | αi ≥ αn+1 }| + 1
p(xn+1 ,y) = . (5)
n−l+1
The general rule inducted from the training set incorporates the information
about the dataset, eliminating the need to start from scratch when the con-
fidence sets for the new test samples are derived. Hence, inductive inference
is almost as computationally efficient as the implemented prediction rule
is. The inductive approach sacrifices prediction efficiency for computational
efficiency due to the train-calibration split.

3 Methodology
In this section CRFE is presented. First, we adapt the mathematical tools
developed by Belloti et al. [5] to fit multi-class classification, where the
potential of conformal prediction becomes particularly relevant. This adap-
tation ties non-conformities with features. After this, CRFE is presented.
Finally, we propose a stopping criterion based on the behavior of the selected
features non-conformities.

3.1 The multiclass adaptation


Our feature selection method is model-dependent. It is suitable for those
classifiers whose decision function can be expressed in terms of a linear

4
discriminant function. Let’s start by assuming that the set of samples Z in
(2) is composed by l features and m different classes, i.e.,

xi ∈ X = (X1 , ..., Xl ),
yi ∈ Y = {y 1 , ..., y m },

where Xj ∈ R. We follow the One vs All (OVA) approach [31] for extending
[5] to a multiclass scenario. Let define the function θ as:

if yi = y k

 1
θ(yi , y k ) = (6)
−1 if yi ̸= y k

where k ∈ {1, ..., m}, such that we can derive m new datasets to train m
linear models in the OVA approach:

Z k = {zik }ni=1 = {(xi , θ(yi , y k ))}ni=1 . (7)

When training the classifier on each dataset, hk = h(Z k ), m weight vectors


wk defining the separation hyperplanes are produced. In the original binary
classification task, a reasonable non-conformity function for this kind of
classifiers was proposed by Vovk et al. [26]:
l
αibinary = Ã(zik , Z k , hk ) = −θ(yi , y k )
X
(wjk xij + b). (8)
j=1

Similarly to the weighted average method [26], we extend the binary non-
conformity measure to the multiclass problem as

αi = A(zi , D) =
m
X
λÃ(zik , Z k , hk ) + λ′ Ã(zir , Z r , hr ) (9)
r=1
r̸=k
1−λ
with k ∈ {1, ..., m}, λ ∈ [0, 1], and λ′ = ,
m−1
where αi is the non-conformity measure of the sample zi .
Let define the subset of features S as {X ′ : X ′ ⊆ X = (X1 , ..., Xl ) where
|S| = t}. If we select a subset of t features S from the original set of features
X , the expression defined in (8) turns into

αiS; binary = Ã(zik , Z k , hk ) =


X
− θ(yi , y k ) (wjk xij + b), (10)
j∈S

5
which is linear separable by features [5] (i.e., we can isolate by features the
terms that are feature dependent).

The OVA approach of the problem defined in (9) is still linearly sepa-
rable. Considering the non-conformity measure (8), the expression in (9) is
expanded as

A(z, Z, S) =
X
− λθ(y, y k )( wjk xj + bk )−
j∈S
m
X X
λ′ θ(y, y r )( wjr xj + br ) =
r=1 j∈S
r̸=k
m
" #
X X
− λθ(y, y k )wjk + λ′ θ(y, y r )wjr xj − γ,
j∈S r=1
r̸=k
Pm
where the constant γ = λθ(y, y k )bk + λ′ r=1 θ(y, y r )br .
r̸=k
Our non-conformity function can then be separated by features,
X
A(z, D, S) = ϕ(z, D, j),
j∈S

satisfying the condition of linearity. As consequence, we can define the β-


measures as
Xn
βj = ϕ(z, D, j),
i=1

where
n
"
X
βj = − λ wjk θ(yi , y ′ k )xij +
i=1
m X
n n
#
X X

λ wjr θ(yi , y ′ r )xij − γi . (11)
r=1 i=1 i=1
r̸=k

3.2 Conformal recursive feature elimination


Following the idea that ‘a good feature ranking criterion is not always a good
feature subset criterion’ [14], the feature selection method we present is based
on a recursive backward elimination policy. We propose the recursively
removal of the most non-conformal feature and to re-train the classifier in
each iteration until a stopping criterion is met. The recursive algorithm can
be summarized as follows:

6
1. Train the classifier.

2. Compute βj for each feature.

3. Remove the feature Xj with the higher βj .

4. Retrain with the new subset or stop if the stopping criterion is met.

Next, we show that the variation in the non-conformity function is propor-


tional to βj when the feature Xj is eliminated. We use the binary framework
(8) without loss of generality. Let define δAj as the variation of the linear
non-conformity function resulting from the removal of feature j in a partic-
ular calibration set:

δAk,j k k j j
i = δA(zi , Z , hk ) = αi − αi =
l
X l
X
k
−yi (wp xip + b) + yi (wsk xis + b) =
p=1 s=1
s̸=j

−yi wj xij ,

where αij is the non-conformity measure without taking into account the
feature j. In the whole calibration set, the variation will be :
n n
δAk,j
X X
j
δA = i = −yi wj xij . (12)
i i

However, we show that


n
X n
X
−yi wj xij = − wj yi xij ,
i i

so
δAj = βj . (13)
The result in (13) strengths the interpretation of βj as the non-conformity
associated to a feature and justifies the fact that eliminating the feature Xj
with the highest associated non-conformity is what contributes the most to
reducing the global non-conformity of the calibration set. Our goal is to
update the predictive rule h(·) (1) in order to produce more representative
non-conformity measures. Moreover, when comparing the proposed algo-
rithm with the original RFE method, only one additional step is included:
computing expression (11). This additional step adds a linear computational
complexity of O(lmn), or O(ln) if we have only two classes, over the RFE
algorithm complexity.

7
3.3 Stopping criterion
The optimal number of features to be selected is not a straightforward prob-
lem when feature selection is performed. We consider two stopping criteria in
our study. The first criterion (i) involves fixing a certain number of features
and leave the proposed method to iterate up to an specified number of fea-
tures. In cases where the optimal number of features is known this criterion
is useful. Moreover, when the dataset is not too large, this criterion allows
the exploration of all possible subset sizes and analyze performance met-
rics in order to discover an optimal number of features. However, for large
datasets, this approach can become impractical. The second stopping crite-
rion (ii) is a novel approach that takes advantage of the relative β-measures
variation during the recursive elimination method. The underlying idea be-
hind this criterion is as follows: when a feature Xj is removed from the set of
remaining features, the resulting non-conformity associated to the new set
should have decreased. We found that the mean of the β-measures decreases
with constant rate until an exponential decay is observed. We propose to
stop the selection process before reaching the exponential behavior. Because
the transition between the constant and the exponential rate involves a de-
celeration, we use the second derivative of the mean, which remains close
to zero until the exponential behavior starts. The exponential regime starts
when the value of the second derivative exceeds at least three times the
standard deviation of the set of values corresponding to the previous second
derivatives. The standard deviation can be computed based on the k-latest
values to enhance computational performance. The full method is presented
in Algorithm 1.

8
Algorithm 1 CRFE and β-based stopping criteria.
Input: Dtrain = (Xtrain , Ytrain ); training set
Input: Dcal = (Xcal , Ycal ); calibration set
Input: σ; confidence, ≥ than 3 is recommended
Input: ψ; lenght of set used to compute std
Initialize array to zero: beta means
Initialize array to zero: beta num der

Do
w, b ← train (Dtrain )
β ← compute (Dcal , w, b)
j ← index(max( β ))

\ \ Stopping criteria evaluation

If len(beta means) > ψ:


delete the older element(beta means)

β ← Compute mean ( β )
beta means ←append to array (β)
x ← array of integers (0,len(beta means))
β ′′ ← derive(derive( beta means, x ), x )
βσ′′ ← compute std(beta num der )
beta num der ← append to array(β ′′ )

If abs( β ′′ ) < abs(σβσ′′ ):


stopping criteria is met = true

delete feature j from Xtrain


delete feature j from Xval

Until stopping criteria is met = true

Output: Set of selected features

4 Experimental settings
4.1 Datasets
Four publicly available databases were used as test-bed: one synthetic and
four real-world datasets. Their characteristics are summarized in Table 4.1.
Data pre-proccessing involved (i) one-hot class encoding and cleaning of

9
Dataset Samples Classes Features Distribution of classes Reference
Synthetic 350 4 35 (25.0, 25.0, 25.0, 25.0) [22]
Coronary artery 899 4 32 (44.9, 21.2, 14.5, 19.3) [17]
Dermatology 366 6 34 (30.6, 16.7, 19.7, 13.4, 14.2, 5,5) [17]
Myocardial infarction 1700 8 104 (84.1, 6.5, 1.1, 3.2, 1.3, 0.7, 1.6, 1.6) [12]

Table 1: Datasets description after processing the data.

features, (ii) checking for missing values; if more than 25% of the values
were missing, the feature was removed, (iii) imputation of missing values,
and, (iv) standardization of the data to avoid scale biases.

• Synthetic dataset. We considered a synthetic dataset to control


informative versus noisy features. The designed dataset comprises 10
informative and 25 noise features. We included 4 classes distributed
equally among samples, except by a 5% which were assigned randomly.

• Coronary artery disease dataset. This dataset comes from a real-


world clinical problem focused on coronary artery disease diagnosis
[10, 17]. Class 0 represents the absence of disease, whereas classes 1,
2, 3, and 4 represent the degree of artery disease. Due to the limited
number of samples in class 4, we merged samples from classes 3 and
4.

• Dermatology dataset. The third dataset aims to determine the type


of erythemato-squamous disease that a group of patients are suffering
from [15, 17]. Included classes are Psoriasis, Seborrheic Dermatitis,
Lichen Planus, Pityriasis Rosea, Chronic Dermatitis, and Pityriasis
Rubra Pilaris.

• Myocardial infarction dataset. The last dataset deals with my-


ocardical infractions diseases [12]. The proposed classification task is
to predict if a patient will survive and, if not, the cause of death. The
classes considered were: alive, cardiogenic shock, pulmonary edema,
myocardial rupture, progress of congestive heart failure, thromboem-
bolism, asystole, and ventricular fibrillation.

We encourage the reader to review the Supplementary Material for addi-


tional details.

4.2 Performance evaluation


Conformal prediction performance, i.e., set prediction results, was evaluated
using the following metrics [25]:

• Coverage: Empirical coverage measures the percentage of test sam-


ples for which the true class falls in the prediction set. It must be close

10
to the theoretical coverage 1 − ϵ. For k test samples,
k
1X
Cov = 1{yi ∈Γϵi (xi )} .
k
i=1

• Inefficiency (N-Score): This performance score measures the aver-


age size of the prediction sets:
k
1X ϵ
Inef f = |Γi (xi )|.
k
i=1

• Certainty: This score is defined as the percentage of test samples


correctly classified and with a prediction set of size 1:
k
1X
Cert = 1{Γϵi (xi )=yi } .
k
i=1

• Uncertainty: Proportion of test samples for which the prediction set


includes all classes:
k
1X
U ncert = 1{Γϵi (xi )=Y} .
k
i=1

• Mistrust: Proportion of test samples without prediction, i.e., all the


classes are too strange compared to the unclassified sample, and the
sample cannot be classified with the specified confidence in any of the
available classes:
k
1X
M ist = 1{Γϵi (xi )=∅} .
k
i=1

The performance assessment of classical single-predictions was conducted


by the following well-known performance metrics: accuracy, precision and
recall. Per class precision, recall and macro-F1 were also reported (see Sup-
plementary Material ).

4.3 Consistency evaluation


A feature selection method should converge to similar subsets of features
under random splits of a dataset. Ideally, the most informative features
will be the ones selected more frequently. A standard measure to assess the
consistency between two sets of features with the same number of elements is
the Jaccard index [3]. The Jaccard index can be easily modified to consider

11
differences between more than two sets. Let be S1 , ..., Sn subsets of features,
such as |S1 | = ... = |Sn |. The multi-set Jaccard consistency index is defined
as
|S1 ∩ ... ∩ Sn |
IJ = . (14)
|S1 ∪ ... ∪ Sn |
We propose an extension that is more suitable than (14) to compare en-
sembles of n sets of features. Let consider |S1 | = ... = |Sn | and K =
(n/2 + 1, ..., n), the new index IW is defined as
X j
IW = ωj Pj , such that ωj = P , (15)
k∈K k
j∈K

where Pj is the fraction of features that are common at least in j ∈ K


subsets of features. This adaptation can be seen as a weighted Jaccard
index. Unlike the expression in (14), the new index does not require that
all generated subsets have a common set of features. A feature must be
only present in at least n/2 + 1 of the subsets. The consistency measured
by index IW increases if a feature becomes more prevalent across diverse
subsets. The proposed index ranges between 0 and 1, with higher values
indicating more similarity between subsets, and it was conceived to detect
the consistent subsets that the Jaccard index cannot detect.

4.4 Experimental design


Although our feature selection proposal is based on conformal prediction
notions, if the inductive approach is followed, then our method can be de-
scribed as conformal-agnostic. This implies that, once the feature selection
method identifies the optimal subset of features, it is possible to implement
both a traditional machine learning workflow, which provides point pre-
dictions, or a conformal prediction-based pipeline, which offers prediction
intervals. We followed the inductive inference scheme to also optimize com-
putational cost. The conducted performance comparison between CRFE
and RFE methods was evaluated by the performance of the selected subsets
of features in both the conformal prediction and traditional single prediction
frameworks. When conformal prediction was used, we fixed the confidence
level 1 − ϵ at 0.9. Test splits always included 25% of the original dataset and
the remaining samples were equally split into training and calibration sets.
The parametric classifier chosen was a SVM with linear kernel, implemented
through the Scikit-learn library [22]. The comparison scheme between both
methods followed the next steps; (i) a random seed was established; (ii)
both feature selection methods were run, selecting subsets of features for
all possible sizes; and, (iii) predictions were inferred by both conformal and
classical classifiers using the subsets of selected features. To carry out a fair,
unbiased, and statistically robust comparison, the scheme was repeated 20

12
times by shuffling the data, and the results were averaged across all itera-
tions. In each iteration, all the random numbers were generated and fixed
at the beginning, always preserving the same split between test, calibration,
and training sets for both selection methods.

5 Results and discussion


5.1 Performance analysis
The performance comparison for set predictions between RFE and CRFE
methods is presented in Figures 1a, 1b, 1c, and 1d for the synthetic, coro-
nary artery, dermatology, and myocardial infraction datasets, respectively.
We observed how CRFE outperformed RFE in the synthetic and the my-
ocardial infraction datasets. Similar performance for both methods was
observed in the dermatology and the myocardial infraction datasets.

On the synthetic dataset, CRFE outperformed RFE performance scores.


The inefficiency metric, i.e., the average length of the prediction sets, is
clearly smaller when the CRFE method was applied than in the case of
the RFE. Moreover, certainty and uncertainty scores in Figure 1a-(b) sup-
port the fact that CRFE was able to find the features that minimize the
strangeness between samples, without losing classification performance. On
the other hand, Figure 1a-(a) shows how the performance of the scores start
worsening faster when RFE was applied. In this dataset the number of
features that are known to be the informative was configured to 10. On
average, at least 75% of the informative features were constantly present
when CRFE method was used, whereas the RFE captured less than 14% of
informative features across all runs.

The comparison performance on the coronary artery disease dataset


showed a similar trend for both feature selection methods. The uncertainty
and inefficiency scores displayed a pronounced growth in Figure 1b-(a) for
subsets of features with less than 9 features, whereas the same behaviour was
observed in Figure 1b-(b) for subsets of features with less than 6 features.
Therefore, the optimal size of features proposed by RFE had a cardinality
of 9 to 10, whereas the optimal subset proposed by CRFE had 5 to 6 features.

Results for the dermatology dataset showed similar performance by both


feature selection methods, see Figures 1c-(a) and 1c-(b). Optimal cardinal-
ities were found for subsets in the range of 15 to 20. However, CRFE found
a second optimal subset lowering this range just to 5 features. This second
optimal set scored a mistrust of 0. Finally, on the myocardial infraction
dataset, a clear difference between both methods was observed. Figure 1d-
(a) shows an optimal point for the RFE performance close to subsets of size

13
10, with null uncertainty value at that cardinality.

The single-prediction performance metrics also showed the advantage of


CRFE when classical machine learning prediction was implemented. Fig-
ure 2 shows the results for single-prediction performance metrics both for
RFE and CRFE feature selection methods on the synthetic dataset. CRFE
was able to preserve overall scores removing features until subset sizes of
10, which corresponds with the subset size of the known predictive features.
Stratified by class results showed a similar behavior. Results for the remain-
ing datasets are provided in the Supplementary Material.

5.2 Consistency analysis


We also analyzed how consistent the features selected by CRFE were in
comparison to RFE. Figure 3 presents the results obtained by the Jaccard
and the novel consistency indexes described in Expressions (14) and (15),
respectively. These indexes quantify the level of consistency between the
subsets of features selected by both RFE and CRFE for each of the 20
iterations.

Note how the feature subsets selected by CRFE were equally or more
consistent than those selected by RFE. The peaks of consistency, especially
in Figures 3a-(b), 3b-(b), were observed around the optimal subset sizes
predicted in Figure 1 for CRFE. This suggests that CRFE was able to,
independently of the data split, find the subset of features which reports
better performance metrics. We highlight how the newly proposed consis-
tency index in Expression (15) was able to reflect the peaks in Figures 3a-(b),
3c-(b), 3d-(a), and 3d-(b), in contrast to the Jaccard index. These peaks
contributed to detect hidden consistent subsets. Moreover, the new index
reflected consistency at the last stages of the recursive elimination process
while Jaccard ignored those sizes. In Figures 3a-(a) and 3d-(a) there are no-
ticeable differences between those consistency measures. Those values may
indicate that there was a considerable number of features that, although
not consistently selected in all the 20 iterations, were frequently chosen by
RFE. For coronary artery and dermatology datasets, both showed a similar
behavior for larger feature subsets. For smaller subsets, CRFE showed more
consistent selections than RFE.

We also analyzed the differences between the features selected by both


methods when the random seeds were the same, i.e., similarity. We com-
pared the subsets of features selected by both methods with (i) the same
cardinality and (ii) selected in the same iteration, see results in Suplemen-
tary Figure S4. The feature subsets selected had greater similarity for the
coronary artery and dermatology datasets. The Kuncheva index defined in

14
Supplementary Equation S6.1 revealed that these coincidences were robust
against random coincidence. On the other hand, synthetic and myocardial
infraction datasets revealed poor agreement on the selected features. For
these two datasets the similarity values suggested that the agreement by
chance was relevant.

5.3 β-based stopping criterion


The β-based stopping criterion introduced in Section 3.3 was tested in all
four datasets. We compared the results of the new criterion versus the
accuracy-based RFE stopping criterion. RFE accuracy-based rule is a well
established method implemented through a cross validation pipeline. We
tested the RFE-accuracy and the β-based stopping criterion by running 50
different random splits of the datasets. We fixed the excess σ, i.e., the num-
ber of times the second derivative must exceed the standard deviation to
stop the recursive method, to 5 in Algorithm 1. Results are presented in
Table 5.3. The distribution of selected features by the β-based stopping cri-
teria are shown in Figure 4. The correspondent distribution of subset sizes
are included in Supplementary Figure S5.

Dataset RFE C-RFE


Average size (Inefficiency, Certainty) Average size (Inefficiency, Certainty)
Synthetic 34 (0.19, 0.39) 11 (0.19, 0.39)
Coronary artery 16 (0.53, 0.10) 8 (0.49, 0.17)
Dermatology 31 (0.04, 0.81) 16 (0.04, 0.72)
Myocardial infarction 1 (0.16, 0.02) 17 (0.32, 0.17)

Table 2: Average size for subsets of features selected by both stopping crite-
ria: RFECV for RFE and β-based for CRFE. Results inside parentheses are
the averaged inefficiency and certainty scores for the selected subset size.

In the synthetic dataset, the optimal size found by the RFE-accuracy


criterion significantly differed from the most frequent subset size proposed by
the β-based criteria. Results in Figure 4-(a) show that the distribution of the
most frequent selected features was consistent with the known informative
features. For the coronary artery dataset, 9 and 10 were the subset sizes most
commonly observed when the β-based criterion was used, closely followed
by sizes of 6 and 7 elements, see Supplementary Figure S5-(b). Figure 4-(b)
shows that there were 6 features consistently selected at least in 80% of the
experiments. These findings are in line with the results in Figures 1b and 3b.
In the dermatology dataset, the RFE method discarded only one feature,
whereas the new criterion frequently stopped at three different subset sizes:
13, 20, and 25. Two of the optimal subsets with 20 and 25 features by the β-
based criterion were consistent with the optimal size shown in Figure 1c-(b).
The RFE-accuracy based criterion was far from any optimal subset size from
results shown in Figure 1c-(a). In the myocardial infarction dataset, both

15
stopping criteria resulted in subset sizes that deviated significantly from
the optimal subsets detected in Figure 1d. However, the inefficiency and
certainty metrics in Table 5.3 were better when the new stopping criterion
was used.
The β-based automatic stopping criteria was designed to determine the
optimal subset of features without computing performance scores for all
possible feature cardinalities or combinations. We observed that variability
in the data can slightly influence its performance. However, the frequency
of features included in the subsets selected by the new criterion showed that
this method was able to identify the most significant features to effectively
reduce non-conformity.

6 Conclusions
Conformal prediction stands out as one of the most effective methods for
uncertainty quantification due to its robust theoretical guarantees. In this
study, we introduce an alternative to the classical Recursive Feature Elim-
ination method but taking advantage of the conformal prediction frame-
work. The proposed method, named Conformal Recursive Feature Elimina-
tion (CRFE), is a feature selection technique that builds upon the SMFS
algorithm [5], but extended to recursively remove features. We demonstrate
that, by removing a feature, the variation in the non-conformity measure
is equivalent to eliminating the non-conformity associated with that partic-
ular feature. Additionally, we introduce a novel consistency index and an
automatic stopping criterion based on the non-conformity associated with
the features. To evaluate the effectiveness of CRFE, we compared its per-
formance against RFE on four multiclass datasets. The results indicate
several benefits of using CRFE when conformal prediction is applied, as
well as in classical performance metrics when no conformal prediction is
done. The consistency tests based on stability indexes reported that CRFE
achieved at least the same level of consistency as RFE. The proposed au-
tomatic stopping criterion for CRFE is based on the non-conformity values
of each feature and outperformed the accuracy-based RFE stopping crite-
rion. However, the feature selection method proposed in this work depends
on the linear separability condition when the selection is performed, i.e.,
CRFE relies on computing separation hyperplanes between classes. Future
developments will investigate how to expand to nonlinear classifiers. We also
plan to explore the class-conditional conformal prediction as a mechanism
to better adapt to imbalanced problems. Finally, an open source library
with the implementation is released.

16
Acknowledgments
This work was supported by the Gobierno de Navarra through the ANDIA
2021 program (grant no. 0011-3947-2021-000023) and the ERA PerMed
JTC2022 PORTRAIT project (grant no. 0011-2750-2022-000000).

Repository
An open source library that implements CRFE, including the datasets used
in this work, the post processed data, and the code can be found at https:
//github.com/digital-medicine-research-group-UNAV/CRFE.

A Supplementary Material
A.1 Introduction
The present document provides Supplementary Material for the paper en-
titles Conformal Recursive Feature Elimination. The results presented here
were derived from the same datasets as shown in the main document. Single-
prediction performance metrics, i.e., accuracy, recall, and precision, as well
as Per class precision, recall and F1 metrics were calculated in supplemen-
tary Sections A.3, A.4, and A.5 for the last three described datasets. Preci-
sion is defined as the the proportion of predicted positives that are truly pos-
itive. Recall, also known as sensitivity, provides the proportion of positives
that are correctly classified. The F-1 score represents the harmonic mean
between precision and recall. The plots supporting the similarity study are
exposed in supplementary Section A.6, and the additional material related
to the study of the β-based stopping criterion is provided in supplementary
Section A.7.

A.2 Artificial dataset


The synthetic dataset was generated using the make classification() method
from the scikit-learn library. The specific parameters used are detailed in
Supplementary Table 3.

n samples 350 n redundant 1


n informative 10 n clusters per class 1
n classes 4 class sep 1.5
random state 12345 flip y 0.05

Table 3: Parameters used to generate the synthetic dataset.

17
A.3 Coronary artery disease dataset
The coronary artery disease dataset [17] comprises data collected from Cleve-
land (303), Hungary (294), Switzerland (123) and Long Beach VA (200).
The original study [11] warned about a potential bias in the test groups
because the noninvasive samples were not withheld from the treating physi-
cian. To classify diseases, a cardiologist diagnosed samples only based on
the angiogram results. Accordingly to his criteria, a coronary artery was
significantly diseased if the luminal diameter reduction exceeded 50%.

The classes in the dataset are; class 0 stands for no disease and has 404
samples of the total, i.e. 44.93%, class 1 stands for patients that have at
least one diseased artery and is composed of 191 samples (21.24%), class 2
stands for those patients that have a single-vessel disease and is composed
of 130 samples (14.46%), class 3 stands for those that has a double-vessel
disease and is composed of 132 samples (14.68%), and class 4 stands for
those that has a triple-vessel disease and is composed of 42 samples (4,67%).
The original database comprised 75 features, but name, SS number, medical
proofs dates and patient´s number were excluded. Any feature that had
more than 25% of missing values was also removed. The remaining features
are listed and numbered in Supplementary Table S4. We encourage the
reader to see the publicly available repository for detailed information of the
specific features considered. Missing data were imputed by a K-nn algorithm
with 5 neighbors. Samples were standardized to avoid scale biases.

Number Feature label Number Feature label Number Feature label


0 age 13 proto 26 rldv5e
1 sex 14 thaldur 27 lvx1
2 cp 15 thaltime 28 lvx2
3 trestbps 16 met 29 lvx3
4 htn 17 thalach 30 lvx4
5 chol 18 thalrest 31 lvf
6 fbs 19 tpeakbp
7 restecg 20 tpeakbpd
8 dig 21 dummy
9 prop 22 trestbpd
10 nitr 23 exang
11 pro 24 xhypo
12 diuretic 25 oldpeak

Table 4: Features in coronary artery disease dataset. Features considered


after data pre-proccessing.

Supplementary Figures 5a and 5b support results presented in Section


5 of the main document. Results presented in Figure 4 of the main docu-
ment showed that features selected by the β-based stopping criteria more
than 80% of the cases were: cp which stands for chest pain type, thaltime

18
which stands for the time when ST measure depression was noted in the
ECG, met which is true or false if a threshold on Methabolic Equivalent
(MET) is achieved while exercise testing, trestbpd which stands for the rest-
ing blood pressure, xhypo which stands for exercise-induced hypotension and
lvx3 which was not explained either in the dataset documentation and the
original work, but we decided to include in the dataset because we postulate
that it stands for some type of pacing interval with the left ventricle, which
is relevant medical information.

A.4 Dermatology dataset


The dermatology dataset includes a classification problem between 6 classes
[17]. These classes correspond with differential diagnoses of erythemato-
squamous disease: Psoriasis class have 112 samples (30.60%), Seborrheic
Dermatitis class 61 (16.67%), Lichen Planus class 72 (19.67%), Pityriasis
Rosea class 49 (13.39%), Chronic Dermatitis class 52 (14.21%), and Pityria-
sis Rubra Pilaris class 20 (5.46%). Firstly, patients were evaluated through
12 clinical features. Subsequently, 22 histopathological features were an-
notated from a biopsy. These features are listed in Table 5. The original
work [15] warned that, while some samples exhibited the typical histopatho-
logical features of the disease, others did not. The original work reported
an accuracy score of 99.2%, but no other performance was provided. In this
work, both RFE and CRFE methods consistently maintained accuracy, pre-
cision, and recall scores above 95% until the recursive elimination of features
began to deteriorate the results. The optimal subset of features appears to
be around 8 to 12 elements for RFE, see supplementary Figure 6a. The
optimal subset size of features for CRFE, accordingly to results showed in
supplementary Figure 6b, seems to range between 15 to 20 elements. Figures
6b-(b), 6b-(c) and 6b-(d) showed that CRFE was able to provide the neces-
sary information to narrow down the intervals of relevant features by class.
On the other hand, the results presented by RFE in Figures 6a-(b), 6a-(c)
and 6a-(d) indicated that the scores associated with each class start getting
worse for same subset sizes. However, the scores provided by CRFE identi-
fied sets of features responsible for the deterioration of the class-performance
scores.

The original work proposed the use of a genetic algorithm to determine


the relevance of each feature. The most relevant features found by the orig-
inal work were koebner and inflammatory mononuclear inflitrate features,
while the features acanthosis, follicular horn plug, munro microabcess, and
age were found to be the least relevant. Results presented in Figure 4 of
the main document partially agrees with this. The least relevant features
proposed by the original paper were also selected as the least relevant by
CRFE. Koebner was also selected as one of the most relevant. However,

19
inflammatory mononuclear inflitrate was almost irrelevant for CRFE.

Number Feature label Number Feature label

0 Erythema 18 Parakeratosis
1 Scaling 19 Clubbing of the rete
ridges
2 Definite borders 20 Elongation of the
rete ridges
3 Itching 21 Thinning of the
suprapapillary epi-
dermis
4 Koebner phe- 22 Spongiform pustule
nomenon
5 Polygonal papules 23 Munro microabcess
6 Follicular papules 24 Focal hypergranulo-
sis
7 Oral mucosal in- 25 Disappearance of
volvement the granular layer
8 Knee and elbow in- 26 Vacuolisation and
volvement damage of basal
layer
9 Scalp involvement 27 Spongiosis
10 Family history 28 Saw-tooth appear-
ance of retes
11 Melanin inconti- 29 Follicular horn plug
nence
12 Eosinophils in the 30 Perifollicular Parak-
infiltrate eratosis
13 PNL infiltrate 31 Inflammatory
monoluclear in-
flitrate
14 Fibrosis of the pap- 32 Band-like infiltrate
illary dermis
15 Exocytosis 33 Age
16 Ecanthosis
17 Hyperkeratosis

Table 5: Features in Dermatology dataset. Features considered after data


pre-proccessing. The features labeled from 0 to 10, and feature 33 are clin-
ical, whereas from 11 to 32 are histopathological.

20
A.5 Myocardial infarction complications
The motivation behind myocardial infarction complications dataset [12] were
the continuous spread of the disease, with special importance in the urban
population of developed countries, as well as the differences between pa-
tients in the course of the disease. The feature with patient ID was removed
because it did not contain relevant information, other features were also re-
moved due to missing data (features with more than 25% of missing values),
and features numbered from 113 to 124 were removed because were provided
as potential targets. The name of the 104 remaining features are shown and
numbered in the Table 6. We refer the reader to the publicly available repos-
itory for specific information about the features. Missing data were imputed
by a k-nn means algorithm with 5 neighbours. The objective is to predict the
feature called Lethal outcome (LET IS), distributed as: class alive (84.06%),
class cardiogenic shock (6.47%), class pulmonary edema (1.06%), class my-
ocardial rupture (3.18%), class progress of congestive heart failure (1.35%),
class thromboembolism (0.71%), class asystole (1.59%) and class ventricular
fibrillation (1.59%).

21
Number Feature label Number Feature label Number Feature label
0 AGE 35 K SH POST 70 n p ecg p 12
1 SEX 36 MP TP POST 71 fibr ter 01
2 INF ANAM 37 SVT POST 72 fibr ter 02
3 STENOK AN 38 GT POST 73 fibr ter 03
4 FK STENOK 39 FIB G POST) 74 fibr ter 05
5 IBS POST 40 ant im 75 fibr ter 06
6 GB 41 lat im 76 fibr ter 07
7 SIM GIPERT 42 inf im 77 fibr ter 08
8 DLIT AG 43 post im 78 GIPO K
9 ZSN A 44 IM PG P 79 K BLOOD
10 nr11 45 ritm ecg p 01 80 GIPER Na)
11 nr01 46 ritm ecg p 02 81 Na BLOOD
12 nr02 47 ritm ecg p 04 82 ALT BLOOD
13 nr03 48 ritm ecg p 06 83 AST BLOOD
14 nr04 49 ritm ecg p 07 84 L BLOOD
15 nr07 50 ritm ecg p 08 85 ROE
16 nr08 51 n r ecg p 01 86 TIME B S
17 np01 52 n r ecg p 02 87 R AB 1 n
18 np04 53 n r ecg p 03 88 R AB 2 n
19 np05 54 n r ecg p 04 89 R AB 3 n
20 np07 55 n r ecg p 05 90 NITR S
21 np08 56 n r ecg p 06 91 NA R 1 n
22 np09 57 n r ecg p 08 92 NA R 2 n
23 np10 58 n r ecg p 09 93 NA R 3 n
24 endocr 01 59 n r ecg p 10 94 NOT NA 1 n
25 endocr 02 60 n p ecg p 01 95 NOT NA 2 n
26 endocr 03 61 n p ecg p 03 96 NOT NA 3 n
27 zab leg 01 62 n p ecg p 04 97 LID S n
28 zab leg 02 63 n p ecg p 05 98 B BLOK S n
29 zab leg 03 64 n p ecg p 06 99 ANT CA S n
30 zab leg 04 65 n p ecg p 07 100 GEPAR S n
31 zab leg 06 66 n p ecg p 08 101 ASP S n
32 S AD ORIT 67 n p ecg p 09 102 TIKL S n
33 D AD ORIT 68 n p ecg p 10 103 TRENT S n
34 O L POST 69 n p ecg p 11

Table 6: Myocardial infartion dataset. Features considered after data pre-


proccessing.

Figures 7a and 7b show that the subsets selected by RFE preserved fea-
tures relevant for distinguishing between all classes. CRFE quickly attempts
to select relevant features for classes alive and cardiogenic shock. This sug-
gests that a heavy imbalance in classes may affect the performance of the
proposed method.

Results presented in Figure 4 of the main paper showed that the most
relevant features, i.e. those present in at least the 80% of the subset of
features were: sex, S AD ORIT (the systolic blood pressure), ritm ecg p 01
which represents if the ECG rhythm at the time of admission to hospital is
sinus or not, TIME B S which represents the time elapsed from the begin-

22
ning of the attack of CHD to the hospital, ANT CA S n stands for the use
of calcium channel blockers in the ICU, and finally ASP S n if acetylsalicylic
acid was used in the ICU. Accordingly with supplementary Figures 7a and
7b, these features were postulated to be the most relevant to predict a fatal
outcome or not.

A.6 Similarity study


We show the plots supporting the similarity study between the subset of
features selected by RFE and CRFE. Three consistency indexes were used
for the study. Firstly, we need to define the standard Jaccard index as
|A ∩ B|
IJ = , (16)
|A ∪ B|

where A and B are two sets such as |A| = |B|. Through this index, we
compared the two subset of features selected by both methods that (i) were
selected using the same random seed and (ii) had the same size. The average
and the standard deviation are provided in Supplementary Figure 8a. The
novel index defined in Equation (15) of the main document was suitable to
compare subsets of features produced by different feature selection methods
having the same cardinality. The new index did not need to be averaged
through the multiple runs because it takes into account all the subsets gen-
erated at the same time. This was possible because of the n/2 + 1 condition,
which is the minimum number of common features required to ensure that
a feature is present in at least two subsets generated from two different fea-
ture selection methods. See Supplementary Figure 8a. The last index to
study similarity is the Kuncheva index [18]. This index is also based on
the cardinality of the intersection between sets of elements, but introduces
a correction for agreements by chance. It was defined as:

rs − κ2
IK = , (17)
κ(s − κ)

where r = |A ∩ B|, |A| = |B| = κ and 0 ≤ κ ≤ |X | = s. The index was used


in the same way as Jaccard in Figure 8a. The averaged results and standard
deviations are shown in Figure 8b.

A.7 Stopping criteria


Figure 9 exposes the distribution of subset sizes when using the β-based
stopping criterion in 50 independent runs. Figure 9-(a) shows how the most
frequent subsets were made of a single feature. Figure 9-(b) shows how the
subset sides produced by the coronary artery disease dataset range from 8 to
10 features, corresponding with the desirable size based on the analysis con-
ducted in the main document. In Figure 9-(c), more counts were measured

23
for sizes between 13 and 14 features, as well as between 17 to 20. According
to the main document, sizes ranged from 17 to 25 yield superior results.
Lastly, Figure 9-(d) exhibits more presence of sizes of 7 and 9 features.

References
[1] Javier Andreu-Perez, Carmen C. Y. Poon, Robert D. Merrifield,
Stephen T. C. Wong, and Guang-Zhong Yang. Big Data for Health.
IEEE J. Biomed. Health Inform., 19(4):1193–1208, 2015.

[2] Anastasios N. Angelopoulos and Stephen Bates. Conformal prediction:


A gentle introduction. Found. Trends in Mach. Learn., 16(4):494–591,
2023.

[3] Ruben Armañanzas et al. Peakbin selection in mass spectrometry data


using a consensus approach with estimation of distribution algorithms.
IEEE/ACM Trans. Comput. Biol. and Bioinf., 8(3):760–774, May 2011.

[4] Vineeth Balasubramanian, Shen-Shyang Ho, and Vladimir Vovk. Con-


formal Prediction for Reliable Machine Learning: Theory, Adaptations
and Applications. Morgan Kaufmann, San Francisco, CA, USA, 2014.

[5] Tony Bellotti, Zhiyuan Luo, and Alex Gammerman. Strangeness Min-
imisation Feature Selection with Confidence Machines. In Emilio Cor-
chado, Hujun Yin, Vicente Botti, and Colin Fyfe, editors, Intelligent
Data Engineering and Automated Learning IDEAL 2006, Lecture Notes
in Computer Science, pages 978–985, Berlin, Heidelberg, 2006. Springer.

[6] Visar Berisha et al. Digital medicine and the curse of dimensionality.
NPJ Digit. Med., 4(1):153, October 2021.

[7] Concha Bielza and Pedro Larrañaga. Data-Driven Computational Neu-


roscience: Machine Learning and Statistical Models. Cambridge Univ.
Press, Cambridge, U.K., 2020.

[8] Girish Chandrashekar and Ferat Sahin. A survey on feature selection


methods. Comput. Elect. Eng., 40(1):16–28, 2014.

[9] Giovanni Cherubin, Adrian Baldwin, and Jonathan Griffin. Exchange-


ability martingales for selecting features in anomaly detection. In Alex
Gammerman, Vladimir Vovk, Zhiyuan Luo, Evgueni Smirnov, and Ralf
Peeters, editors, Proceedings of the Seventh Workshop on Conformal
and Probabilistic Prediction and Applications, volume 91, pages 157–
170. Proceedings of Machine Learning Research, 11–13 Jun 2018.

24
[10] Robert Detrano et al. International application of a new probability
algorithm for the diagnosis of coronary artery disease. Amer. J. of
Cardiol., 64(5):304–310, 1989.

[11] Robert Detrano et al. International application of a new probability


algorithm for the diagnosis of coronary artery disease. The American
Journal of Cardiology, 64(5):304–310, 1989.

[12] S.E. Golovenkin et al. Myocardial infarction complications database,


2020.

[13] Isabelle Guyon, Masoud Nikravesh, Steve Gunn, Lotfi A. Zadeh, and
Janusz Kacprzyk, editors. Feature Extraction: Foundations and Appli-
cations. Springer, Berlin, Heidelberg, 2006.

[14] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vap-
nik. Gene Selection for Cancer Classification using Support Vector
Machines. Mach. Learn., 46(1):389–422, 2002.

[15] H.Altay Güvenir, Gülşen Demiröz, and Nilsel İlter. Learning differ-
ential diagnosis of erythemato-squamous diseases using voting feature
intervals. Artif. Intell. Med., 13(3):147–165, 1998.

[16] A. Jain and D. Zongker. Feature selection: evaluation, application, and


small sample performance. IEEE Trans. Pattern Anal. Mach. Intell.,
19(2):153–158, February 1997.

[17] Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. UCI machine
learning repository.

[18] Ludmila I. Kuncheva. A stability index for feature selection. In Pro-


ceedings of the 25th Conference on Proceedings of the 25th IASTED In-
ternational Multi-Conference: Artificial Intelligence and Applications,
AIAP’07, page 390–395, USA, 2007. ACTA Press.

[19] Ilia Nouretdinov et al. Machine learning classification with confidence:


application of transductive conformal predictors to MRI-based diagnos-
tic and prognostic markers in depression. Neuroimage, 56(2):809–813,
2011.

[20] Trishan Panch, Heather Mattie, and Leo Anthony Celi. The inconve-
nient truth about AI in healthcare. NPJ Digit. Med., 2(1):1–3, Aug
2019.

[21] Harris Papadopoulos. Inductive conformal prediction: Theory and ap-


plication to neural networks. In Paula Fritzsche, editor, Tools in Artifi-
cial Intelligence, chapter 18, pages 315–318. IntechOpen, Rijeka, 2008.

25
[22] F. Pedregosa et al. Scikit-learn: Machine learning in Python. J. Mach.
Learn. Res., 12:2825–2830, 2011.

[23] Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction.


J. Mach. Learn. Res., 9(12):371–421, 2008.

[24] Paolo Toccaceli. Introduction to conformal predictors. Pattern Recog-


nit., 124:108507, 2022.

[25] Vladimir Vovk, Valentina Fedorova, Ilia Nouretdinov, and Alexander


Gammerman. Criteria of efficiency for conformal prediction. In Alexan-
der Gammerman, Zhiyuan Luo, Jesús Vega, and Vladimir Vovk, edi-
tors, Conformal and Probabilistic Prediction with Applications, pages
23–39, Cham, 2016. Springer.

[26] Vladimir Vovk, Alexander Gammerman, and Glenn Shafe. Algorithmic


Learning in a Random World. Springer-Cham, 2 edition, 2022.

[27] Alex Wright. Big data meets big science. Comm. ACM, 57(7):13–15,
Jul. 2014.

[28] Chen Xu and Yao Xie. Conformal prediction for time series. IEEE
Trans. Pattern Anal. Mach. Intell., pages 1–22, 2023.

[29] Meng Yang, Ilia Nouretdinov, Zhiyuan Luo, and Alex Gammerman.
Feature selection by conformal predictor. In Lazaros Iliadis, Ilias Ma-
glogiannis, and Harris Papadopoulos, editors, Artificial Intelligence Ap-
plications and Innovations, pages 439–448, Berlin, Heidelberg, 2011.
Springer Berlin Heidelberg.

[30] Shuang Zhou, Evgueni Smirnov, Gijs Schoenmakers, Ralf Peeters, and
Tao Jiang. Conformal feature-selection wrappers for instance transfer.
In Alex Gammerman, Vladimir Vovk, Zhiyuan Luo, Evgueni Smirnov,
and Ralf Peeters, editors, Proceedings of the Seventh Workshop on Con-
formal and Probabilistic Prediction and Applications, volume 91, pages
96–113. Proceedings of Machine Learning Research, 11–13 Jun 2018.

[31] Xin Zhou and David P. Tuck. MSVM-RFE: extensions of SVM-RFE for
multiclass gene selection on DNA microarray data. Bioinf., 23(9):1106–
1114, 2007.

26
1.0
(a) RFE + CONFORMAL PREDICTION 1.0
(b) CRFE + CONFORMAL PREDICTION
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5

Score

Score
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features
coverage inefficiency certainty uncertainty mistrust

(a) Synthetic dataset.


(a) RFE + CONFORMAL PREDICTION (b) CRFE + CONFORMAL PREDICTION
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score

0.5 Score 0.5


0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Num. of Features Num. of Features
coverage inefficiency certainty uncertainty mistrust

(b) Coronary artery dataset.


(a) RFE + CONFORMAL PREDICTION 1.0
(b) CRFE + CONFORMAL PREDICTION
1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5
Score

Score

0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Num. of Features Num. of Features
coverage inefficiency certainty uncertainty mistrust

(c) Dermatology dataset.

1.0
(a) RFE + CONFORMAL PREDICTION 1.0
(b) CRFE + CONFORMAL PREDICTION
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
Score

Score

0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Num. of Features Num. of Features
coverage inefficiency certainty uncertainty mistrust

(d) Myocardial infraction dataset.

Figure 1: Set prediction performance metrics. The results are averaged


over 20 train-test iterations for each feature selection method. Standard
deviation is provided as upper and lower intervals. Plots a-(a), b-(a), c-(a),
and d-(a) show results by the RFE method, whereas Plots a-(b), b-(b), c-
(b), and d-(b) presents results by CRFE.

27
1.0
(a) OVERALL PERFORMANCE METRICS 1.0
(b) PRECISION STRATIFIED BY CLASSES
0.9 0.9
0.8 0.8
0.7
0.7
0.6
0.6
0.5
Score

Score
0.5 0.4
0.4 0.3
0.3 0.2
0.2 0.1
0.0
0.1
0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features
1.0
(c) RECALL STRATIFIED BY CLASSES 1.0
(d) F1 STRATIFIED FY CLASSES F1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
Score

Score
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0

0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features

Plot (a). Plots (b), (c), (d).


accuracy precision recall class 0 class 2
class 1 class 3

(a) RFE.

1.0
(a) OVERALL PERFORMANCE METRICS 1.0
(b) PRECISION STRATIFIED BY CLASSES
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5
Score

Score

0.5
0.4
0.4
0.3
0.3
0.2
0.2 0.1
0.1 0.0
0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features
1.0
(c) RECALL STRATIFIED BY CLASSES 1.0
(d) F1 STRATIFIED FY CLASSES F1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
Score

Score

0.4 0.4
0.3 0.3
0.2
0.2
0.1
0.1
0.0
0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features

Plot (a). Plots (b), (c), (d).


accuracy precision recall class 0 class 2
class 1 class 3

(b) CRFE.

Figure 2: Single-prediction performance metrics for the synthetic dataset.


Standards deviations are provided as upper and lower intervals. Plots a-(a)
and b-(a) show accuracy, precision, and recall performance metrics achieved
by subsets of features selected by RFE and CRFE, respectively. Plots a-
(b),(c),(d) and b-(b),(c),(d) show precision, recall and F1 score performance
by class achieved by RFE and CRFE, respectively.

28
(a) RFE CONSISTENCY (b) CRFE CONSISTENCY
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6

Score

Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features

IW IJ

(a) Synthetic dataset.


(a) RFE CONSISTENCY (b) CRFE CONSISTENCY
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score

Score
0.5 0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Num. of Features Num. of Features

IW IJ

(b) Coronary artery dataset.


(a) RFE CONSISTENCY (b) CRFE CONSISTENCY
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score

Score

0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Num. of Features Num. of Features

IW IJ

(c) Dermatology dataset.


(a) RFE CONSISTENCY (b) CRFE CONSISTENCY
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score

Score

0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Num. of Features Num. of Features

IW IJ

(d) Myocardial infraction dataset.

Figure 3: Consistency analysis across the 20 iterations. Plots a-(a), b-(a),


c-(a) and d-(a) show consistency results for RFE, whereas Plots a-(b), b-(b),
c-(b) and d-(b) present CRFE consistency results. The Jaccard IJ , and the
new proposed consistency index IW are shown.

29
(a) Synthetic dataset (b) Coronary artery disease dataset (c) Dermatology dataset
50 50 50

40 40 40
Frequency

Frequency

Frequency
30 30 30

20 20 20

10 10 10

0 0 0
7
5
8
6
1
2
9
4
332
0
167
29
321
16
11
3
15
25
13
12
308
10
34
21
27
24
20
19
23
28
14

21
14
19
20
4
13
5
27
26
28
32
8
25
30
3
24
11
2
7
15
23
6
9
1
22
29
12
0
31
16
2
22
29
24
15
16
0
10
23
30
28
9
5
26
6
20
12
11
13
14
17
4
25

Feature number Feature number


3
2
2

Feature number

(d) Myocardial dataset


50

40
Frequency

30

20

10

0
1
1016
99
32
45
35
98
78
27
50
10
31
82
1003
81
65
80
43
49
73
53
69
1026
97
67
30
51
13
19
95
12
54
64
83
7
39
38
79
11
14
15
16
17
18
20
21
22
23
37
48
52
57
58
59
60
61
62
63
66
71
72
74
75
76
77
92
42
34
36
44
47
46
25
29
56
93
4
94
68
96
2
28
88
55
9
89
90
8

Feature number

Figure 4: The frequency with which each feature was included in the optimal
subset of features selected by CRFE using the β-based stopping criterion.
Note the maximum corresponds to 50 independent runs. The percentages
of features always discarded (not shown) were 0%, 28.2%, 11.7%, and 13.5%
of the total sets for synthetic, coronary artery disease, dermatology, and
myocardial datasets, respectively.

30
1.0
(a) OVERALL PERFORMANCE METRICS 1.0
(b) PRECISION STRATIFIED BY CLASSES
0.9 0.9
0.8 0.8
0.7
0.7
0.6
0.6
0.5
Score

Score
0.5 0.4
0.4 0.3
0.3 0.2
0.2 0.1
0.1 0.0

0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features
(c) RECALL STRATIFIED BY CLASSES 1.0
(d) F1 STRATIFIED FY CLASSES F1
1.0 0.9
0.9 0.8
0.8
0.7
0.7
0.6 0.6
0.5 0.5
Score

Score
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0
0.0

0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features

Plot (a). Plots (b), (c), (d).


accuracy precision recall class 0 class 2
class 1 class 3

(a) RFE.

1.0
(a) OVERALL PERFORMANCE METRICS 1.0
(b) PRECISION STRATIFIED BY CLASSES
0.9 0.9
0.8 0.8
0.7
0.7
0.6
0.6
0.5
Score

Score

0.5
0.4
0.4 0.3
0.3 0.2
0.2 0.1
0.1 0.0
0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features
(c) RECALL STRATIFIED BY CLASSES 1.0
(d) F1 STRATIFIED FY CLASSES F1
1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5
Score

Score

0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Num. of Features Num. of Features

Plot (a). Plots (b), (c), (d).


accuracy precision recall class 0 class 2
class 1 class 3

(b) CRFE.

Figure 5: Single-prediction performance metrics for the coronary artery dis-


ease dataset. Standard deviations were provided as upper and lower inter-
vals. Plots a-(a) and b-(a) show accuracy, precision, and recall performance
metrics achieved by subsets of features selected by RFE and CRFE respec-
tively. Plots a-(b),(c),(d) and b-(b),(c),(d) show precision, recall, and F1
score performance by class achieved by RFE and CRFE, respectively.

31
(a) OVERALL PERFORMANCE METRICS (b) PRECISION STRATIFIED BY CLASSES
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score

Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Num. of Features Num. of Features
(c) RECALL STRATIFIED BY CLASSES (d) F1 STRATIFIED FY CLASSES F1
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score

Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Num. of Features Num. of Features

Plot (a). Plots (b), (c), (d).


accuracy precision recall class 0 class 2 class 4
class 1 class 3 class 5

(a) RFE.
(a) OVERALL PERFORMANCE METRICS (b) PRECISION STRATIFIED BY CLASSES
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score

Score

0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Num. of Features Num. of Features
(c) RECALL STRATIFIED BY CLASSES (d) F1 STRATIFIED FY CLASSES F1
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score

Score

0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Num. of Features Num. of Features
Plots (b), (c), (d).
Plot (a). class 0 class 3
accuracy precision recall class 1 class 4
class 2 class 5

(b) CRFE.

Figure 6: Single-prediction performance metrics for the dermatology


dataset. Standard deviation is provided as upper and lower intervals. Plots
a-(a) and b-(a) show accuracy, precision, and recall performance metrics
achieved by subsets of features selected by RFE and CRFE, respectively.
Plots a-(b),(c),(d), and b-(b),(c),(d) show precision, recall, and F1 score
performance by class achieved by RFE and CRFE respectively.

32
(a) OVERALL PERFORMANCE METRICS (b) PRECISION STRATIFIED BY CLASSES
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score

Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Num. of Features Num. of Features
(c) RECALL STRATIFIED BY CLASSES (d) F1 STRATIFIED FY CLASSES F1
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score

Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Num. of Features Num. of Features

Plot (a). Plots (b), (c), (d).


accuracy precision recall class 0 class 3 class 6
class 1 class 4 class 7
class 2 class 5

(a) RFE.
(a) OVERALL PERFORMANCE METRICS (b) PRECISION STRATIFIED BY CLASSES
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score

Score

0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Num. of Features Num. of Features
(c) RECALL STRATIFIED BY CLASSES (d) F1 STRATIFIED FY CLASSES F1
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score

Score

0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Num. of Features Num. of Features

Plot (a). Plots (b), (c), (d).


accuracy precision recall class 0 class 3 class 6
class 1 class 4 class 7
class 2 class 5

(b) CRFE.

Figure 7: Single-prediction performance metrics for the Myocardial infarc-


tion complications dataset. Standard deviation is provided as upper and
lower intervals. Plots a-(a) and b-(a) show the accuracy, precision, and re-
call performance metrics achieved by subsets of features selected by RFE
and CRFE respectively. Plots a-(b),(c),(d), and b-(b),(c),(d) show preci-
sion, recall, and F1 score performance by class achieved by RFE and CRFE,
respectively.

33
(a) SYNTHETIC (b) CORONARY ARTERY
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score

Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30
Num. of Features Num. of Features
(c) DERMATOLOGY (d) MYOCARDIAL INFRACTION
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Score

Score
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 30 0 20 40 60 80 100
Num. of Features Num. of Features

I IJ'

(a) Jaccard and weighted Jaccard indexes.


(a) SYNTHETIC (b) CORONARY ARTERY
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0.1 0.1
0.2 0.2
0.3 0.3
0.4 0.4
0.5 0.5
0.6 0.6
0.7 0.7
0.8 0.8
0.9 0.9
1.0 1.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30
Num. of Features Num. of Features
(c) DERMATOLOGY (d) MYOCARDIAL INFRACTION
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0.1 0.1
0.2 0.2
0.3 0.3
0.4 0.4
0.5 0.5
0.6 0.6
0.7 0.7
0.8 0.8
0.9 0.9
1.0 1.0
0 5 10 15 20 25 30 0 20 40 60 80 100
Num. of Features Num. of Features

(b) Kuncheva index.

Figure 8: Study of the common features selected by both feature selection


methods for each dataset. The subsets of features compared using the Jac-
card and new index in Plot a and Kuncheva index in Plot b were those
generated with the same random numbers and size, but by different feature
selection method.

34
(a) Synthetic dataset (b) Coronary artery disease dataset
50 50

40 40
Frequency

Frequency

30 30

20 20

10 10

0 0
1
5
7
8
9
11
12
13
14
15
16
17
18
19
20
22
23
25
26
27

1
2
3
4
6
7
8
9
10
11
12
14
Size Size
(c) Dermatology dataset (d) Myocardial dataset
50 50

40 40
Frequency

Frequency

30 30

20 20

10 10

0 0
1
2
5
9
10
11
12
13
14
15
16
17
18
19
20
23
24
25
26
27

6
7
9
10
11
12
13
14
16
17
19
20
22
24
69
71
77
83

Size Size

Figure 9: Distribution of sizes. Subsets of features selected by the β-based


stopping criteria in Subsection IV-E of the main document.

35

You might also like