1 Introduction

Due to their potential to enhance automated decision-making, machine learning (ML) algorithms have increasingly been adopted in many areas of today’s society with important economical and societal implications. These algorithms, which usually operate as black-boxes, have demonstrated potential to achieve expert-level performance in a wide range of tasks. From medical diagnosis (Topol, 2019) to financial markets (Henrique, Sobreiro, & Kimura, 2019) or candidate screening (Liem et al., 2018), the tendency to support decisions with data-driven algorithms has increased significantly in recent years. However, as the integration of ML into safety-critical domains has become more widespread, so has the awareness of the ethical concerns about its misuses (Lo Piano, 2020; Eubanks, 2018). As a matter of fact, despite the common misconception that ML models may escape from human biases, recent findings have demonstrated that algorithms can inherit and even amplify patterns of discrimination on ethnicity (Buolamwini & Gebru, 2018), gender (Bolukbasi, Chang, Zou, Saligrama, & Kalai, 2016), or age (Díaz, Johnson, Lazar, Piper, & Gergle, 2018). This was the case for algorithms that reproduced racial bias when assessing healthcare patient needs (Obermeyer, Powers, Vogeli, and Mullainathan, 2019) or disproportionate false positive rates identifying black defendants in criminal recidivism prediction systems (Angwin, Larson, Mattu, & Kirchner, 2022). It becomes evident that the harmful effects derived from the deployment of biased ML solutions constitute a major risk for certain populations. For this reason, there has been growing interest among ML practitioners to develop technical solutions to address the challenges of discriminatory predictions, leading to the advent of a new field coined as algorithmic fairness (see Pessach & Shmueli, 2022; Berk et al., 2023 for recent reviews on the topic).

Simultaneously, the integration of ML systems into high-risk settings also faces other substantial barriers. For instance, one of the main drawbacks of traditional ML algorithms is that they provide bare predictions with no guarantees, rather than reliable levels of confidence for individual assessments. Failing to quantify uncertainty becomes a particularly relevant issue in scenarios where the use of ML may involve consequential decisions about individuals, such as healthcare (Kompa, Snoek, & Beam, 2021; Banerji, Chakraborti, Harbron, & MacArthur, 2023), or autonomous-driving (Shafaei, Kugele, Osman, & Knoll, 2018). In these contexts, conveying the predictive uncertainty of an ML model is essential and should be a key requirement to conferring reliability on the adoption of this technology.

Over recent years, significant progress in reliable ML has been achieved through the use of conformal prediction (Vovk, Gammerman, & Shafe, 2005; Shafer & Vovk, 2008; Angelopoulos & Bates, 2023). This framework provides distribution-free statistical methods to construct rigorous prediction sets wrapping any pre-trained ML model and guaranteeing to cover the ground truth label with a user-specified probability. It provides a natural way to evaluate the reliability of a particular prediction by means of a range of likely outcomes, thus informing decision-makers and end-users about the limitations of the ML model.

Despite being promising for addressing the problem of uncertainty quantification, a major limitation of conformal prediction sets is that their coverage guarantees are exclusively marginal. This may not be a sufficient requirement, potentially leading to untrustworthy prediction sets in certain settings where fairness is a concern. As a motivational example, consider a clinical predictive system making a cardiac disease diagnosis based on an MRI assessment. Suppose that the system predicts sets containing the actual diagnosis with a probability of at least 90% on average across the entire patient population. However, after a deeper performance evaluation, it is revealed that the algorithm overestimates the coverage for the majority class, related to Caucasian patients, at the expense of failing to predict sets containing the real diagnosis for African American people. Here, the practical utility of the predictive system becomes limited, failing to provide the same level of confidence for all the groups.

In the present study, we address the problem of algorithmic fairness in the context of uncertainty quantification by means of conformal prediction sets. We argue that the uncertainty quantification capabilities of a conformal predictor should equally guarantee to cover the ground truth label with the same level of confidence regardless of an individual’s sensitive attribute, such as race, gender, or socio-economic status. In particular, we consider a conformal predictor fair if it achieves ‘equalized coverage’ (Romano, Barber, Sabatti, & Candès, 2020), i.e., produces prediction sets with controlled coverage for all the demographic groups given by a sensitive attribute, based on which an equitable decision is ensured. In principle, achieving equalized coverage guarantees is possible by running the calibration process in each of the groups separately, which has been formalized through Mondrian conformal prediction (Vovk, Gammerman, & Shafe, 2005). However, due to the data splitting step, this procedure may fail when sample size imbalance occurs, resulting in limited calibration samples for specific groups and, consequently, very large uninformative prediction sets.

The trade-off between conditional validity and the efficiency of a conformal predictor is the focus of the present study. We argue that, since marginal validity is automatically granted, we should find the best possible underlying ML model with which to induce an efficient conformal predictor while retaining coverage guarantees for sensitive groups in need of fair outcomes. Here, we propose to tackle this challenge through a multi-objective optimization scheme. Specifically, we use evolutionary learning to tune the hyperparameter configuration of an ML classifier and calibrate the resulting model using the conformal prediction procedure. As a result, our meta-learning algorithm produces a repertoire of optimal Pareto conformal predictors, bridging the gap between efficiency and equalized coverage guarantees. The integration of multi-objective evolutionary learning in the development of confidence predictors with fairness guarantees presents a novel methodology with clear advantages. Firstly, it produces, in a single execution, a repertoire of different modeling alternatives from which to choose depending on the policy to be adopted from stakeholders or end-users. It also provides a deep inspection of how the contradictory criteria are related for a specific problem. To the best of our knowledge, multi-objective optimization methods have not been previously combined with conformal prediction to achieve both efficient prediction sets and conditional coverage guarantees. We present the potential of such a combination when seeking fair conformal predictors in four different real-world problems.

The paper is organized as follows. Section 2 introduces the background regarding algorithmic fairness measures, conformal prediction and general multi-objective optimization. In Sect. 3, we present the meta-algorithm for optimizing both fairness and efficiency, producing a Pareto set of conformal predictors. Section 4 presents the experimental design setup. Results are discussed in Sect. 5, including a case study demonstrating how our methodology can be used in practice. Lastly, Sect. 6 includes conclusions and highlights potential future work.

1.1 Related work

Conditional coverage. Within the conformal prediction literature, conditional validity has been formalized through different notions depending on the type of information the coverage is conditioned on (Vovk, 2013). For example, in a common classification task, where the response is discrete, we say that a conformal predictor satisfies class-conditional coverage if the validity constraint holds for every class (Löfström, Boström, Linusson, & Johansson, 2015):

$$\begin{aligned} \mathbb {P} \, \bigl (y_{\text {test}} \in \Gamma (x_{\text {test}}) \ \vert \ y_{\text {test}} = y\bigl ) \ge 1 - \alpha , \ \ \text {for all} \ y \in \mathcal {Y} \end{aligned}$$
(1)

Alternatively, we may also be interested in coverage guarantees for a particular new test sample with specific feature values. In this case, a conformal predictor satisfies object-conditional coverage (Vovk, 2013):

$$\begin{aligned} \mathbb {P} \, \bigl (y_{\text {test}} \in \Gamma (x_{\text {test}}) \ \vert \ x_{\text {test}} = x\bigl ) \ge 1 - \alpha , \ \ \text {for all} \ x \in \mathcal {X} \end{aligned}$$
(2)

Unfortunately, Vovk (2013), Lei and Wasserman (2014), and Foygel Barber et al. (2021) showed that achieving object-conditional coverage is impossible without making distributional assumptions or producing prediction sets of infinite size. A related -but weaker- notion of object-conditional coverage can be framed by requiring the achievement of coverage guarantees with respect to some predefined disjoint groups of the feature space. In settings when fairness is a concern, these groups can be defined on the basis of certain legally-protected sensitive attributes to ensure no discrimination (e.g., age, gender, race, or disability). In these cases, if a conformal predictor retains validity for every demographic group \(a \in \mathcal {A}\), it fulfills equalized coverage (Romano, Barber, Sabatti, & Candès, 2020):

$$\begin{aligned} \mathbb {P} \, \bigl (y_{\text {test}} \in \Gamma (x_{\text {test}}) \ \vert \ a_{\text {test}} = a\bigl ) \ge 1 - \alpha , \ \ \text {for all} \ a \in \mathcal {A} \end{aligned}$$
(3)

Equalized coverage has been introduced as a seminal uncertainty-aware fairness notion and has been pursued by Lu et al. (2022) to ensure fair prediction sets in two clinical use cases. Approximating the conditional coverage property in its different notions has been one of the main areas of research in recent years around the conformal prediction framework. This has been addressed by designing novel non-conformity measures (Romano, Patterson, & Candès, 2019; Romano, Sesia, & Candès, 2020; Angelopoulos et al., 2022) or by adjusting the calibration procedure itself (Chernozhukov, Wüthrich, & Zhu, 2021; Guan, 2022; Bastani et al., 2022; Jung, Noarov, Ramalingam, & Roth, 2023; Ding, Angelopoulos, Bates, Jordan, & Tibshirani, 2023). Our work explores an alternative line by exploring how the parameter configuration of the underlying ML model can impact the equalized coverage of the resulting prediction sets.

Uncertainty-aware loss functions. Recent studies in the conformal prediction literature have proposed including uncertainty-aware loss functions in the learning of efficient prediction sets. Colombo and Vovk (2020) introduced a loss function to train both point and conformal predictors based on the observed fuzziness of the p values system, with favorable results in terms of efficiency compared to traditional learning. Alternatively, Stutz et al. (2022) proposed an end-to-end conformal training through the optimization of a set size loss function designed to minimize the expected inefficiency. Similarly, Einbinder et al. (2022) developed a procedure based on the optimization of a combined loss function formed by a traditional accuracy-based component and an uncertainty-aware component to simultaneously seek high performance and mitigate overconfidence, respectively. In the regression setting, Chen et al. (2021) framed the prediction interval learning task as an empirical constrained optimization problem. Beyond the conformal framework, and thereby losing the guarantees in finite samples, many other works have also sought to reduce predictive uncertainty with proper loss functions in different applications (see, for example, Tabarisaadi et al., 2024; Moon et al., 2020; Dawood et al., 2023).

2 Background

To provide a contextual background with regard to our methodology, we first present how algorithmic fairness has been formalized through different statistical definitions. Then, we briefly introduce the (split) conformal prediction procedure and its Mondrian version to automatically achieve conditional validity. Finally, we describe the multi-objective optimization problem.

2.1 Algorithmic fairness measures

Defining fairness from an algorithmic perspective is a complicated task since it does not arise as a technical property of ML systems, but is a value-driven concept with roots in ethics and law. In fact, a large body of research has been mainly focused on how to concisely formalize different notions of fairness into quantitative definitions (Verma & Rubin, 2018). In general, the fairness measures have been categorized into individual and group fairness. Individual fairness requires similar individuals to have a similar treatment by the model (i.e., obtain similar predictions) (Dwork, Hardt, Pitassi, Reingold, & Zemel, 2012). In the case of group fairness (also known as statistical fairness), this posits that inequalities should be similar at a group level, where the groups are defined by sensitive attributes protected by anti-discrimination laws, including gender, race, and age. Group fairness has been the most popular approach in the literature, with several measures aiming to capture what it means to be fair. These measures can be divided into the following three broad categories.

  • Demographic parity (Kamiran & Calders, 2009; Dwork, Hardt, Pitassi, Reingold, & Zemel, 2012). Inspired by the US legal literature, this criterion, also known as statistical parity, requires the prediction to be independent of the sensitive attribute. In other words, it demands that the outcome rates should be similar across the considered groups. A disadvantage of this criterion is that a perfectly accurate model can be considered unfair when the sensitive attribute is related to the outcome feature.

  • Equalized odds (Hardt, Price, Price, & Srebro, 2016). Designed to overcome the limitations of demographic parity, equalized odds was proposed, requiring parity between groups in terms of the true positive rate (TPR) and the true negative rate (TNR). Depending on the context in which the task is framed and how the predictions affect individuals, stakeholders may only be interested in ensuring FPRs or FNRs. In such scenarios, equalized odds criteria can be relaxed, leading to equality of opportunity (balanced FNRs) (Hardt, Price, Price, & Srebro, 2016) or predictive equality (balanced FPRs) (Corbett-Davies, Pierson, Feller, Goel, & Huq, 2017).

  • Predictive rate parity (Chouldechova, 2017). Finally, when working with score-based predictions, predictive rate parity states that, for any given score, all groups should have a similar probability of the outcome being true. As can be noted, this notion is particularly useful when the scores need to be interpreted as real probabilities.

In practice, it may be necessary to prioritize which fairness criterion to employ depending on the problem being addressed. This has become an important issue since it has been shown that some measures, such as demographic parity and equalized odds, can not be simultaneously fulfilled under certain conditions (Chouldechova, 2017).

2.2 Conformal prediction

Let us initially assume a supervised classification setting and an available set of training samples in which each sample \(z_i\) is formed by an input \(x_i \in \mathcal {X}\) and a class label \(y_i \in \mathcal {Y}\), for a discrete label set \(\mathcal {Y}\). The general goal of conformal prediction is to construct a set-valued predictor \(\Gamma : \mathcal {X} \rightarrow 2^\mathcal {Y}\) which ensures validity at a desired confidence level while being as efficient as possible.

A central component of conformal prediction is the definition of a non-conformity measure \(\mathcal {S}: \mathcal {X} \times \mathcal {Y} \rightarrow \mathbb {R}\) used to quantify the degree of relative strangeness of a sample z with respect to a collection of samples \(\{ z_1,..., z_k\}\). The non-conformity measure is a design choice and although it can be any measurable function, it is usually based on the output of an ML model, also called the underlying model. Formally:

$$\begin{aligned} \mathcal {S} (z , \{z_1, \dots , z_k \}) = \Delta (y, h(x)), \end{aligned}$$
(4)

where \(h: \mathcal {X} \rightarrow \mathcal {Y}\) is a supervised pattern-recognition model learned on \(\{z_1, \dots , z_k \}\) and \(\Delta : \mathcal {Y} \times \mathcal {Y} \rightarrow \mathbb {R}\) is a function of dissimilarity between the true label y and the prediction h(x). In this way, the strangeness of a sample z is directly measured through the ability of the model learned on \(\{z_1, \dots , z_k \}\) to accurately predict the true label.

Despite being originally proposed in an online learning setting, we work under a computationally efficient modification known as split conformal prediction (also referred to as inductive conformal prediction) (Papadopoulos, Proedrou, Vovk, & Gammerman, 2002; Lei, G’Sell, Rinaldo, Tibshirani, & Wasserman, 2018). With this approach, we start by randomly splitting the original training set into a proper training set \(\{ z_1,..., z_n\}\) and a calibration set \(\{z_{n+1},..., z_m\}\). The proper training set is employed to learn a single predictive model. This predictive model is then used to compute the non-conformity measure for the remaining calibration samples. At this point, for a new instance \(x_{m+1}\), we can tentatively complete it with a hypothetical value \(\bar{y} \in \mathcal {Y}\), compute the non-conformity measure of \((x_{m+1}, \bar{y})\), and compare the resulting non-conformity score with the calibration scores. Therefore, given

$$\begin{aligned} \begin{aligned} s_i&= \mathcal {S}(z_i , \{z_1, \dots , z_n \}) \ \ \ \ \ \ i = n+1, \dots , m \\ s_{m+1}&= \mathcal {S}(z_{m+1} , \{z_1, \dots , z_n \}), \end{aligned} \end{aligned}$$
(5)

we can compute a p value for each label as follows:

$$\begin{aligned} p_{\bar{y}} = \dfrac{|\{ i = 1, \dots , m+1 : s_i> s_{m+1} \}|}{m+1} \end{aligned}$$
(6)

For a given user-specified coverage rate \(\alpha \in [ 0, 1 ]\), the prediction set for \(x_{m+1}\) will include all labels \(\bar{y}\) whose p values are greater than \(\alpha\):

$$\begin{aligned} \Gamma _{\alpha } (x_{m+1}) = \{ \bar{y} \in \mathcal {Y} \, | \, p_{\bar{y}}> \alpha \} \end{aligned}$$
(7)

As long as the data generation process is exchangeable, the prediction sets generated by a conformal predictor offer finite-sample marginal validity guarantees (Vovk, Gammerman, & Shafe, 2005):

$$\begin{aligned} \mathbb {P} \, \bigl (y_{m+1} \in \Gamma (x_{m+1}) \ \bigl ) \ge 1 - \alpha \end{aligned}$$
(8)

2.2.1 Mondrian conformal prediction

The coverage guarantees of the standard conformal prediction procedure are only satisfied on average. To mitigate this issue, Mondrian conformal prediction (Vovk, Gammerman, & Shafe, 2005) produces prediction sets with finite-sample guarantees in a category-wise style. Mondrian conformal prediction mainly relies on splitting the calibration set into categories predefined by a taxonomy and performing the calibration step on each category independently. Formally, a taxonomy is a measurable function \(\kappa : \mathcal {X} \times \mathcal {Y} \rightarrow \mathcal {K}\) that maps each sample \(z_i\) to a specific category \(\kappa _i \in \mathcal {K}\), where \(\mathcal {K}\) is usually a discrete space.

As in the standard conformal prediction procedure, the Mondrian extension can be approached in a similar computationally efficient manner. Specifically, given a taxonomy \(\kappa\), the split Mondrian conformal prediction computes a prediction set by means of a \(\kappa\)-conditional p value for each hypothetical value \(\bar{y} \in \mathcal {Y}\), which is given by

$$\begin{aligned} p_{\bar{y}} = \dfrac{|\{ i = 1, \dots , m+1 : \kappa _i = \kappa _{m+1} , s_i> s_{m+1} \}|}{|\{ i = 1, \dots , m+1 : \kappa _i = \kappa _{m+1} \}|}, \end{aligned}$$
(9)

where \(s_i\) and \(\kappa _i\) are the non-conformity score and the category related to the ith calibration sample, respectively.

The prediction sets produced by a Mondrian conformal prediction satisfy validity conditioned on each of the categories defined by the chosen taxonomy. A special case is the taxonomy \(\kappa : \mathcal {X} \times \mathcal {Y} \rightarrow \mathcal {Y}\), which maps each sample to the label space. Here, the resulting prediction sets fulfill Eq. 1, an essential property in imbalanced classification scenarios. For the sake of this study, we consider the taxonomy \(\kappa : \mathcal {X} \times \mathcal {Y} \rightarrow \mathcal {A}\) to induce Mondrian conformal predictors which ensure Eq. 3 and produce prediction sets with equalized coverage guarantees.

2.3 Multi-objective optimization and evolutionary algorithms

In traditional optimization, the goal is to find a single optimal solution that maximizes or minimizes a specific objective function. In many real-world scenarios, decision-makers face situations where there are multiple, often conflicting, objectives that need to be considered simultaneously to achieve optimal policies. Multi-objective optimization (Gunantara, 2018) is presented as a promising decision making paradigm that deals with these problems where conflicting objectives are present (Armañanzas & Lozano, 2005). It can be formulated as follows:

$$\begin{aligned} \min _x f(x) = \Bigl (f_1(x), f_2(x), ..., f_n(x) \Bigl ) \ \ \text {s.t.} \ \ x \in \Omega , \ \ f: \Omega \rightarrow \Psi \end{aligned}$$
(10)

where \(f_i(x)\), for \(i = 1,..., n\), are the objective functions to be minimized, \(\Omega \subset \mathbb {R}^m\) is the feasible solution space, and \(\Psi \subset \mathbb {R}^n\) is the feasible objective space.

When dealing with the simultaneous optimization of multiple objective functions, no single solution exists that optimizes each function at once. Therefore, we need to define a notion of optimality between feasible solutions. This optimality has been framed through the concept of dominance. In this sense, a solution \(x \in \Omega\) dominates another solution \(x^{\prime } \in \Omega\) (denoted as \(x \prec x^{\prime }\)) when it is better or equally good in all the objective functions and strictly better in at least one of them, formally:

$$\begin{aligned} \begin{aligned} f_i(x)&\le f_i(x^{\prime }), \ \ \forall i \in \{ 1, ..., n\} \\ f_j(x)&< f_j(x^{\prime }), \ \, \exists j \in \{ 1, ..., n\} \end{aligned} \end{aligned}$$
(11)

Under the concept of dominance, the set of feasible solutions that cannot be dominated by any other is called the Pareto optimal set, or Pareto front. Approximating the Pareto front is the main goal in a multi-objective optimization problem, and it can be approached through different methods. Here, we focus in on multi-objective evolutionary algorithms (MOEAs) (Emmerich & Deutz, 2018), a family of bio-inspired heuristic algorithms well-suited to solving optimization problems. MOEAs are inspired by concepts borrowed from natural evolution and employ the dominance concept to guide the search process towards the Pareto optimal solution set. Specifically, we chose the well-established Non-dominated Sorting Genetic Algorithm II (NSGA-II) (Deb, Pratap, Agarwal, & Meyarivan, 2002) as the MOEA to tackle our optimization problem. NSGA-II finds a diverse set of Pareto optimal solutions using a combination of techniques, including non-dominated sorting, crowding distance assignment to preserve diversity among solutions, and a genetic operator encompassing selection, crossover, and mutation.

3 Methods

3.1 Multi-objective design

The multi-objective optimization problem is formulated by two different objective functions: \(f_1\), to quantify the efficiency of the confidence predictors; and \(f_2\), focused on measuring the unfairness of the prediction sets. Both functions are formalized as two metrics to be minimized.

Let us denote \(y \in \mathcal {Y}\) as the label feature to be predicted; \(\Gamma (x) \in 2^{\mathcal {Y}}\) the prediction set generated by a conformal predictor; and \(a \in \mathcal {A}\) the sensitive attribute on which to ensure a fair prediction.

  • Efficiency (\(f_1\)). Conformal prediction sets represent a suitable mechanism for estimating predictive uncertainty. The informativeness of these prediction sets is usually measured using efficiency metrics. Several criteria have been introduced to measure the efficiency of set-valued predictions (Vovk, Nouretdinov, Fedorova, Petej, & Gammerman, 2017). In our experiments, we focus on the widely used average prediction set size (also known as the N criterion) as the function to measure efficiency, mathematically:

    $$\begin{aligned} f_1 = \dfrac{1}{n} \sum _{i=1}^n |\Gamma (x_i)|{.} \end{aligned}$$
    (12)
  • Fairness (\(f_2\)). Most of the fairness criteria proposed to date evaluate point predictions (Verma & Rubin, 2018), which makes them unsuitable for our uncertainty-based scenario. For this reason, we propose a set-valued loss function based on equalized coverage (Romano, Barber, Sabatti, & Candès, 2020) to quantify the unfairness of the prediction sets generated by a conformal predictor. This loss function measures how far the empirical coverage in the unprivileged group(s) is, on average, from the empirical coverage in the privileged group (multiplied by 100 to convert it to a percentage), mathematically:

    $$\begin{aligned} f_2 = 100 \times \dfrac{1}{| \mathcal {A}^{\textsf {unpriv}} |} \sum _{a \in \mathcal {A}^{\textsf {unpriv}}} \big | \widehat{\text {Cov}}_{\textsf {priv}} - \widehat{\text {Cov}}_a \big |, \end{aligned}$$
    (13)

    where \(\widehat{\text {Cov}}_{\textsf {priv}}\) and \(\widehat{\text {Cov}}_a\) denote the empirical coverages in the privileged group and the unprivileged group \(a \in \mathcal {A}^{\textsf {unpriv}}\), respectively. For a given demographic group, the empirical coverage is computed as follows:

    $$\begin{aligned} \widehat{\text {Cov}}_a = \dfrac{1}{|\mathcal {D}^a|} \sum _{i \in \mathcal {D}^a} \mathbb {I} \, \big ( y_i \in \Gamma (x_i) \big ), \end{aligned}$$
    (14)

    where \(\mathcal {D}^{a}\) includes all the individuals belonging to the group \(a \in \mathcal {A}\) and \(\mathbb {I}\) is the indicator function that is 1 when its argument is true and 0 otherwise.

Based on \(f_1\) and \(f_2\), we can state that a conformal predictor \(\Gamma _1\) dominates another conformal predictor \(\Gamma _2\), if \(\Gamma _1\) is more efficient than \(\Gamma _2\) and \(\Gamma _1\) is at least equally as fair as \(\Gamma _2\), or vice versa.

Special attention should be paid to conformal predictors whose prediction set size systematically matches the cardinality of the label space (i.e., \(|\Gamma (x)| = |\mathcal {Y}|\)). In such cases, the computed prediction sets are the same for all the groups of the sensitive attribute, yielding perfect equalized coverage and, consequently, becoming non-dominated Pareto solutions. However, these prediction sets have no value in practice since they do not provide any kind of information about the prediction.

3.2 Meta-learning and calibration algorithm

The meta-learning algorithm is based on the NSGA-II algorithm (Deb, Pratap, Agarwal, & Meyarivan, 2002) and the split conformal prediction procedure (Papadopoulos, Proedrou, Vovk, & Gammerman, 2002; Lei, G’Sell, Rinaldo, Tibshirani, & Wasserman, 2018). The pseudo-code of the complete procedure is presented in Algorithm 1.

Algorithm 1
figure a

Meta-algorithm for fair and efficient prediction sets

Specifically, we start by assuming an available set of samples which constitutes our meta-algorithm development dataset \(\mathcal {D}_{\textsf {dev}}\). This dataset is partitioned into three different types of sets, each of which has a specific purpose: (a) the proper training set \(\mathcal {D}_{\textsf {train}}\), to learn the underlying models; (b) the calibration set \(\mathcal {D}_{\textsf {cal}}\), to calibrate the models and build the confidence predictors using the split conformal prediction procedure (see Subsect. 2.2); and, (c) the validation set \(\mathcal {D}_{\textsf {val}}\), to evaluate the quality of the prediction sets by quantifying the fitness of each individual in terms of efficiency and fairness (see Subsect. 3.1).

Our meta-algorithm is based on an evolutionary learning process that is designed to induce a Pareto set of conformal predictors by optimizing the hyperparameter configuration of an ML algorithm. In the context of genetic algorithms, each candidate solution is encoded by an individual within a population. In our meta-algorithm, each jth individual in the kth population, denoted as \(I_{kj}\), corresponds to a single conformal predictor. In turn, each individual is encoded with a specific gene, which in our case is given by a particular configuration \(g_{kj} \in \Omega\), for some hyperparameter space \(\Omega\).

The evolution process starts by generating a random first population \(P_1\) of N individuals that constitutes the initial set of potential solutions to our multi-objective problem. To generate \(P_1\), each specific conformal predictor is induced using the split conformal prediction procedure as follows. Given a randomly generated configuration \(g_{1j} \in \Omega\), the proper training set \(\mathcal {D}_{\textsf {train}}\) is employed to learn a classifier \(h_{1j}(x)\). Next, \(h_{1j}(x)\) is used to calculate the non-conformity scores on the calibration set \(\mathcal {D}_{\textsf {cal}}\) and induce a conformal predictor \(\Gamma ^\alpha _{1j} (x)\). Finally, the set predictions are computed for a given confidence level \(\alpha \in [ 0, 1]\) and evaluated on the validation set \(\mathcal {D}_{\textsf {val}}\), allowing the assessment of the quality of the potential solution.

Once all of the N individuals of \(P_1\) have been evaluated, the algorithm generates an offspring population \(Q_2\) of conformal predictors, which represent the next generation of potential solutions based on the procedures borrowed from biological evolution. \(Q_2\) is generated using a binary tournament selection and genetic operators (see Subsect. 4.2 for a detailed description of the evolution parameters). The offspring population is then evaluated, and combined with the parent population following an elitist ranking based on non-dominance and the crowding distance. The best N members among the parent and children populations are selected to form the next population. This process is iterated until a predefined number of generations G are computed or some stopping criteria is met. At the end of the evolutionary process, the meta-algorithm returns the collection of non-dominated conformal predictors of the last computed generation. As a result of the multi-objective optimization, our meta-algorithm produces a set of valid conformal predictors that explore the boundaries of equalized coverage while retaining the informativeness of the prediction sets.

The induction of fair and efficient set predictors comes with benefits with respect to the inherent uncertainty quantification capabilities of conformal prediction. Note that generating prediction sets with equalized coverage ensures the same level of confidence across demographic groups. This allows to produce valid prediction sets regardless of an individuals’ sensitive attribute while naturally yielding unbiased uncertainty estimates. Furthermore, this objective is pursued by seeking optimal informative prediction sets in terms of efficiency.

4 Experimental setup

4.1 Datasets

Our methodology was tested using four publicly available benchmark datasets with different characteristics from real-world domains including income prediction, criminal recidivism assessment, hospital readmission, and nursery applications.

  • Adult income (F. Ding, Hardt, Miller, & Schmidt, 2021). This dataset is one of the most commonly used benchmarks in the algorithmic fairness literature. It contains demographic information about the U.S. citizens in 1994 and is designed to predict yearly income. Despite being naturally framed as a regression task, this dataset has been widely approached as a binary classification problem to assess if an individual earns more or less than $50K per year. In our case, we borrowed a reconstructed version of the original dataset and treated it as a multiclass problem. The prediction task was configured according to whether an individual earns less than $20K, between $20K and $50K, or more than $50K per year. Gender was considered the sensitive attribute and male the privileged group.

  • COMPAS (Angwin, Larson, Mattu, & Kirchner, 2022). This dataset is related to the COMPAS algorithm, a criminal justice predictive tool developed in the U.S. to assess the recidivism risk score of a defendant. Due to its widespread use and the controversies surrounding it, the COMPAS dataset has served as a critical benchmark for evaluating the fairness of ML models in predicting criminal recidivism. In our experiments, we selected a subsample of the original input features, including: sex, age, race, juv_fel_count, juv_misd_count, juv_other_count, priors_count, c_charge_degree, decile_score, and score_text. Race was considered the sensitive attribute and white the privileged group. In this case, all unprivileged groups were combined into one group due to sample size considerations.

  • Diabetes (Strack et al., 2014). This clinical dataset encompasses admission records from a decade of medical care information from 130 U.S. hospitals and integrated delivery networks. The predictive task was framed as discerning the readmission status of a patient diagnosed with diabetes whose hospital stay lasted 1 to 14 days. Specifically, we considered readmission before 30 days, readmission after 30 days, or no readmission as our class labels. In this dataset, we set race as the sensitive attribute and white as the privileged group.

  • Nursery (Olave, Rajkovic, & Bohanec, 1989). This dataset contains information on nursery school applications. The predictive task here was to rank applicants based on family information such as parents’ occupation or the number of children. We considered four different classes: children who are not recommended, very recommended, prioritized, and specifically prioritized to join the nursery. As in Romano, Bates, & Candès (2020), we used financial status as the sensitive attribute, where applicants with a convenient status were considered as the privileged group.

Note that the selection of the privileged group is an arbitrary choice. In this study, the group with the larger sample size was chosen as the privileged group in all the datasets. Each dataset was initially preprocessed when needed to satisfy the classifiers’ requirements by means of missing value imputation and categorical feature encoding. The summary of each dataset after preprocessing is shown in Table 1.

Table 1 Description of datasets, including number of samples (n), number of features (p), cardinality of the label space (\(|\mathcal {Y}|\)), label balance, cardinality of the sensitive attribute (\(|\mathcal {G}|\)) and sensitive attribute balance

4.2 Parameter setup

To carry out our experiments, we employed a common template in order to build the conformal predictors using the proposed meta-algorithm. For each dataset, the testing protocol used was a hold-out procedure with a 50/25/25 random split into training, validation, and test sets, respectively. For calibration purposes, we further partitioned our training set into 80% proper training samples and 20% calibration samples. Each split was performed ensuring that the proportion of both the class label and the sensitive attribute was preserved.

We proposed to optimize the hyperparameter setting of three different state-of-the-art underlying classifiers on which to subsequently build the conformal predictors: logistic regression (Cox, 1958), decision tree (Breiman, Friedman, Stone, & Olshen, 1984), and random forest (Breiman, 2001). These algorithms represent different modeling paradigms: logistic regression is an explainable model that assumes linearity of features; decision tree learns a tree-like structure to make decisions based on feature splits; and random forest is an ensemble based on the foundations of bootstrap aggregation of multiple decision trees. These classifiers were induced using the scikit-learn library (Pedregosa et al., 2011).

The hyperparameter search space for each algorithm is shown in Table 2. The search space for the logistic regression algorithm is fixed, whereas, for the tree-based algorithms, the search space is adapted to better suit each dataset’s characteristics, as proposed in Valdivia et al. (2021). We set the upper bound of the min_samples_split search space as 10% of the training set size of the considered dataset. Regarding the max_depth and the max_leaf_nodes hyperparameters, we first learned an initial model that could be deepened and widened as needed since it was trained with no limits on depth or number of leaves within a node. For the decision tree, the upper bounds of the max_depth and the max_leaf_nodes hyperparameters were given by the actual values of depth and leaves of the initial tree, respectively. For the random forest algorithm, the upper bounds of both hyperparameters were given by the average values across an entire ensemble of 300 trees.

Table 2 Hyperparameter search space for each of the considered ML classifiers

In order to compare the performance of the meta-algorithm with respect to out-of-the-box strategies, we also evaluated standard and Mondrian conformal predictors built on classifiers induced with default hyperparameter settings. Specifically, hyperparameter configurations were left as default values as implemented in scikit-learn, except for the maximum depth of the tree-based classifiers, which was set to 10.

To calibrate the underlying classifiers, we considered the hinge loss (also known as least ambiguous set-valued classifier Sadinle, Lei, & Wasserman, 2019) as the non-conformity measure, formally:

$$\begin{aligned} \Delta \, (y_i, h(x_i)) = 1 - h(x_i)_{y_i}, \end{aligned}$$
(15)

where \(h(x_i)_{y_i}\) is the estimated score provided by the ML model h for the sample \(x_i\) and the label \(y_i\). To construct the conformal predictors, we used the implementation from the crepes library (Boström, 2022).

For the NSGA-II, we configured a population size of \(N = 50\) and a maximum of \(G = 200\) generations. To account for the possible early convergence of the meta-algorithm, we defined the following stopping criteria: if neither of the two objective functions improve on average after 10 generations, the evolutionary learning stops and returns the optimal Pareto set of conformal predictors found in the last generation. Those conformal predictors obtained throughout the evolutionary procedure whose prediction set size matched the cardinality of the label space were automatically discarded to ensure that they were not included in the non-dominated solutions.

For the multi-objective evolutionary learning, we adapted the original Python code implementation from Valdivia et al. (2021). The offspring populations were generated using the following genetic operators:

  • Elitist selection. The selection procedure begins by creating a set of non-dominated fronts \(\mathcal {F} = \{ \mathcal {F}_1, \mathcal {F}_2,... \}\) using the fast-non-dominated sorting algorithm (Deb, Pratap, Agarwal, & Meyarivan, 2002). Then, individuals are selected according to their rank in \(\mathcal {F}\) until a new population of size N is completed. To break ties between solutions on the same front, individuals are further sorted and chosen using the crowding distance, which measures how densely solutions are distributed within the same front.

  • Crossover. The crossover operation between two different parent individuals \(I_{k1}\) and \(I_{k2}\) is defined by the simulated binary crossover operator (Deb & Agrawal, 1995). This operation generates two child individuals \(I_{k+1, 1}\) and \(I_{k+1, 2}\) that inherit the parents’ hyperparameters depending on a crossover probability \(p_c\). The inheritance is based on two given parameters \(u^{\prime }, u^{\prime \prime } \sim \mathcal {U}(0, 1)\). If \(u^{\prime } \le p_c\), the children inherit the same hyperparameter configuration of the parents. Otherwise, the children hyperparameters are based on a weighted combination of the parents’ hyperparameters given by the following expressions:

    $$\begin{aligned} g_{k+1, 1}= & \dfrac{g_{k1} + g_{k2}}{2} + u^{\prime \prime } \dfrac{|g_{k1} - g_{k2}|}{2} \end{aligned}$$
    (16)
    $$\begin{aligned} g_{k+1, 2}= & \dfrac{g_{k1} + g_{k2}}{2} + u^{\prime \prime } \dfrac{|g_{k1} - g_{k2}|}{2}, \end{aligned}$$
    (17)

    In our experiments, we set \(p_c = 0.9\).

  • Mutation. Each of the individuals mutates its hyperparameter configuration according to a polynomial mutation (Liagkouras & Metaxiotis, 2013) and a parameter \(\mu\). Specifically, given \(u^{\prime }, u^{\prime \prime } \sim \mathcal {U}(0, 1)\), the ith hyperparameter to be mutated for an individual \(I_{kj}\) is randomly chosen over the whole set of hyperparameters, and is updated as follows:

    $$\begin{aligned} hp_i = \left\{ \begin{array}{ll} hp_i + \delta \cdot (hp_i - \min (hp_i)) & \text {{if }} u^{\prime } < 0.5 \\ \\ hp_i + \delta \cdot (\max (hp_i) - hp_i) & \text {{if }} u^{\prime } \ge 0.5, \end{array} \right. \end{aligned}$$
    (18)

    where \(\min (hp_i)\) and \(\max (hp_i)\) denote the minimum and maximum values in the hyperparameter space \(\Omega\) for \(hp_i\), respectively, and \(\delta\) is given by

    $$\begin{aligned} \delta = \left\{ \begin{array}{ll} -1 + 2u^{{\prime \prime }^{\frac{1}{\mu + 1}}} & \text {{if }} u^{\prime \prime } \le 0.5\\ \\ 1 - 2(1 - u^{\prime \prime })^{\frac{1}{\mu + 1}} & \text {{if }} u^{\prime \prime }> 0.5 \end{array} \right. \end{aligned}$$
    (19)

    In our experiments, we set \(\mu = 5\).

Our study adheres to the principles of research reproducibility by making both the code and data accessible in a public repository: https://github.com/digital-medicine-research-group-UNAV/fairsets-moho

5 Results and discussion

5.1 Analysis of marginal validity

To assess marginal validity in the repertoire of conformal predictor solutions generated by our algorithm, we evaluated empirical coverage rates (i.e., the fraction of ground truth labels contained in the prediction sets) averaged across 10 independent runs. For conformal predictors to be valid, the prediction sets should not exceed, on average, the desired miscoverage rate chosen by the user. The empirical coverage values for the Adult Income, COMPAS, Diabetes, and Nursery datasets in test sets are presented in Table 3 for 0.05, 0.10, and 0.20 significance levels.

Table 3 Empirical coverage values for each of the datasets

The empirical coverage values from the Pareto set of conformal predictors found by the meta-algorithm practically match the predefined confidence levels for all datasets. Hence, we can confirm that our method produces conformal predictors with marginal validity guarantees. This is expected since our meta-algorithm, although being based on stochastic multi-objective optimization, it includes a conformal calibration step which provides statistical guarantees for the learned classifiers.

5.2 Analysis of efficiency-fairness trade-off

We continue by analyzing the overall results of the produced Pareto set of optimal conformal predictors in terms of the proposed efficiency and fairness metrics averaged across 10 independent runs. In this analysis, we set the significance level to 0.10. For each run, the obtained solutions were initially sorted based on their efficiency. As the set of solutions are Pareto optimal, sorting the solutions by efficiency automatically reversely sorts them by fairness. For this reason, we then computed the minimum, the 25th percentile (Q1), the 50th percentile (Q2), the 75th percentile (Q3), and the maximum of the efficiency values along with the complementary percentile for the fairness dimension, thereby characterizing each Pareto front. Once the percentiles were computed for each Pareto front, the average distribution of each percentile was estimated for the 10 runs. To account for the possible overfitting of our meta-algorithm, we reported average percentiles in both the validation set used to guide the evolutionary procedure and in the test set never seen by the meta-algorithm. The average distributions of the obtained solutions in the Adult Income, COMPAS, Diabetes, and Nursery datasets are reported in Tables 4, 5, 6, and 7, respectively. The results obtained by standard and Mondrian conformal predictors with default underlying classifier hyperparameters are also included.

Table 4 Efficiency-fairness trade-off for the Adult income dataset
Table 5 Efficiency-fairness trade-off for the COMPAS dataset
Table 6 Efficiency-fairness trade-off for the Diabetes dataset
Table 7 Efficiency-fairness trade-off for the Nursery dataset

From the empirical results, we can state that our meta-algorithm was able to produce a collection of Pareto solutions that explored the boundaries of efficiency and fairness, evidencing the inherent trade-off between these two criteria. When comparing the validation and test columns, we observe that the efficiency obtained in validation was maintained in test with practically no loss. This was not the case for the fairness values since bigger differences were observed, and scores in test were slightly worse than in validation. Additionally, it can be observed by focusing on the spread of the solutions found that the stability of the fairness estimates was quite sensitive to the data splits. This was especially evident in the case of the COMPAS problem, where the low sample size significantly impacted the variability of the results.

When analyzing the score distributions, we can assess how much efficiency needs to be sacrificed to improve the fairness of the prediction sets. This can be done, for example, by comparing the metric values in the 25th percentile (Q1) with the 75th percentile (Q3) positions. For the Adult dataset, the efficiency lost in test sets was 3.4%, 1.4%, and 3.4%, whereas the gain in fairness was 59.2%, 46.7%, and 28.7% when using logistic regression, decision tree, and random forest algorithms, respectively. In the case of the COMPAS dataset, the efficiency lost in test sets was 0.3%, 2.0%, and 3.0%, while the gain in fairness was 23.9%, 44.1%, and 22.1% when using logistic regression, decision tree, and random forest algorithms, respectively. Finally, for the Diabetes dataset, it should be noted that our meta-algorithm did not need to compromise on almost any efficiency (less than 1%) with some room for improvement on fairness. In this case, the gain in fairness was similar across algorithms: 15.9%, 27.4%, and 26.6%, respectively. Finally, for the Nursery dataset, the efficiency sacrifice in test sets was 1.7%, 1.3%, and 0.4%, while the improvement in fairness was 52.6%, 51.7%, and 48.4% when using logistic regression, decision tree, and random forest, respectively. As can be seen, our methodology was able to significantly increase the equalized coverage properties with a smaller loss in the prediction sets’ informativeness.

By comparing the results of our meta-algorithm with standard and Mondrian conformal predictors using default hyperparameters, we can observe that in many cases our approach produced solutions that dominated the out-of-the-box predictors. That is, for each problem, our methodology was able to find conformal predictors with better performance in one of the objective functions without being worse in the other one when guiding the hyperparameter optimization through the evolutionary learning. That was the case, for example, of the Nursery dataset (see Table 7), in which the optimization produced substantially better solutions than the baseline predictors. Even when this did not happen, our meta-algorithm was still able to produce solutions that, although less efficient, retrieved fairer prediction sets. This was the case, for example, of the tree-based classifiers in the Adult dataset (see Table 4), where the meta-algorithm was able to find solutions with statistically better fairness than the standard and Mondrian predictors, but at the expense of retrieving less informative prediction sets.

In the COMPAS and the Diabetes datasets (see Tables 5 and 6), note that the optimization of the logistic regression failed to produce statistically better solutions compared to conformal predictors induced with default hyperparameters. In the particular case of COMPAS, this could be due to the spread of the solutions and the limited impact of the hyperparameters on the quality of the induced conformal predictors. Note that, in this dataset, the benefits of our meta-algorithm were observed in the optimization of the decision tree classifier. In this particular scenario, the prediction sets computed with the default hyperparameters obtained equalized coverage. However, this was achieved by producing prediction sets whose width practically matched the cardinality of the label feature space, thereby losing their practical utility.

For a clearer interpretation of the efficiency-fairness trade-off, we visually presented the Pareto optimal solutions, including the whole set of solutions found by the meta-algorithm across all the runs and the average Pareto front. As detailed in Valdivia et al. (2021), the average Pareto front plots the summarized performance of our method for each problem and offers a very suitable representation for gaining insights into how the meta-algorithm balances the confronted objective functions. It is computed as follows. We first take the rounded mean number of the n different solutions produced by the meta-algorithm, which represents how many alternatives the procedure was able to generate on average. Similar to the procedure used in the results tables, we then compute the n equally distributed percentiles of the efficiency metric along with the corresponding complementary percentiles of the fairness scores, with linear interpolation between adjacent positions. Finally, the average Pareto front is determined by the mean value of each percentile for the 10 runs. The average Pareto fronts for the Adult dataset using the logistic regression, decision tree, and random forest are shown in Figs. 1, 2 and 3, respectively.

Fig. 1
figure 1

Local optimal solutions obtained using logistic regression as the underlying classifier for the Adult problem in validation (left) and test (right) sets. Grey dots represent all the solutions found by the meta-algorithm across all the runs, whereas blue dots represent the average Pareto front (Color figure online)

Fig. 2
figure 2

Local optimal solutions obtained using the decision tree as underlying classifier for the Adult problem in validation (left) and test (right) sets. Grey dots represent all the solutions found by the meta-algorithm across all the runs, whereas orange dots represent the average Pareto front (Color figure online)

Fig. 3
figure 3

Local optimal solutions obtained using the random forest as underlying classifier for the Adult problem in validation (left) and test (right) sets. Grey dots represent all the solutions found by the meta-algorithm across all the runs, whereas green dots represent the average Pareto front (Color figure online)

By inspecting the average Pareto fronts, we can assess both the number of different solutions the meta-algorithm was able to generate and the shape of the efficiency-fairness trade-off, providing a deeper understanding about how our methodology tackled the optimization in practice. For the logistic regression, the efficiency-fairness relation in test remained constantly linear in the range [1.00, 2.50] of unfairness (i.e., the range [1.88, 2.01] of efficiency). However, the trade-off became more pronounced for fairer solutions. That is, improving the equalized coverage properties of the prediction sets beyond this range involved a higher sacrifice in their efficiency. This was not the case for the tree-based classifiers since the efficiency-fairness relation for both learners remained linear across the whole solution space. In fact, optimizing the random forest hyperparameters was the strategy that yielded the highest number of different solutions for this problem. As can be observed, the shapes of the Pareto fronts were similar in both validation and test sets.

To better understand the overall behavior of the three considered classifiers, Fig. 4 presents a comparison of the average Pareto fronts obtained in test for the Adult (left) and the COMPAS (right) problems. For the Adult problem, the optimization of the tree-based models produced more efficient prediction sets than the logistic regression classifier. In fact, the decision tree solutions virtually dominated all the logistic regression alternatives, with significantly better performance for the efficiency metric while retaining similar levels of equalized coverage. When comparing the tree-based models, the random forest was not effective at generating any solutions with an unfairness of less than \(1.5\%\), whereas the optimization of the decision tree classifier resulted in conformal predictors with near-optimal equalized coverage guarantees. It is worth mentioning that, despite producing solutions with worse fairness properties for comparable efficiency levels, the optimization of the random forest yielded a set of non-dominated solutions with better efficiency results. In the case of the COMPAS problem, a clear dominance was observed in the average Pareto fronts. Here, both the logistic regression and the decision tree classifiers fully dominated the random forest algorithm. That is, for every solution found by the tuning of the random forest, another solution using either of the other two alternatives can be selected that dominates it. When comparing the logistic regression with the decision tree, the former was able to generate non-dominated solutions that outperformed those produced by the latter. However, it should be noted that the tuning of the decision tree classifier produced twice as many solutions as the logistic regression optimization. Furthermore, it included a broader range of the solution space in both the fairness and the efficiency dimensions, thereby allowing more extreme solutions to be chosen if needed.

Fig. 4
figure 4

Comparison between average Pareto fronts using logistic regression (blue dots), decision tree (orange dots), and random forest (green dots) for the Adult (left) and the COMPAS (right) problems in test set (Color figure online)

Fig. 5
figure 5

Comparison between average Pareto fronts using logistic regression (blue dots), decision tree (orange dots), and random forest (green dots) for the Diabetes (left) and the Nursery (right) problems in test set (Color figure online)

Similarly, Fig. 5 presents average Pareto fronts obtained in test sets for the Diabetes (left) and the Nursery (right) problems. For the Diabetes problem, the random forest clearly dominated both the decision tree and the logistic regression, with the latter two showing nearly identical Pareto fronts. The random forest, despite producing solutions with similar fairness, retrieved more efficient prediction sets. Note that the random forest was able to generate fairer solutions. By inspecting the Pareto shape, it becomes is interesting to observe that constructing these non-dominated solutions had an impact in terms of efficiency. Finally, in the case of the Nursery problem, a clear dominance hierarchy emerged among the algorithms evaluated. In this dataset, all the classifiers were able to generate solutions with similar levels of fairness, but they retrieved substantially different prediction set efficiencies. The logistic regression was clearly dominated by the solutions produced by the tree-based models. At the same time, the optimization of the random forest dominated the tuning of the decision tree classifier. Our methodology was able to produce fair conformal predictors with practically no loss in efficiency. This was particularly evident in the optimization of the tree-based classifiers, where the fairness-efficiency rate remained permanently constant in practically the whole solution space.

As can be seen from the analysis, the differences in the constructed average Pareto fronts underscore the need for a systematic exploration of several classifiers, as different algorithms exhibit distinct strengths and weaknesses with regard to capturing the trade-offs among the efficiency and fairness functions. The experimental results also indicate that the hyperparameter configuration of an ML classifier can significantly impact the informativeness of the prediction sets and their equalized coverage properties. Additional results for 0.20 and 0.05 significance levels are included in the Supplementary Material.

5.3 Case study: guiding optimal policies in decision-making

In practice, providing a repertoire of modeling alternatives allows the selection of a solution that best fits the policy of the stakeholders. To illustrate a practical use of our meta-algorithm, we consider here a sample use case. Let us take the income prediction task and suppose we are asked to develop a reliable predictive system one that is as fair as possible in terms of equalized coverage, while retaining sufficient prediction set size. Turning to the average results of the different conformal predictors returned by our meta-algorithm (see Fig. 4), we could argue that the decision tree classifier produced the most suitable collection of solutions. Despite yielding worse equalized coverage compared to the logistic regression, it retrieved much more efficient prediction sets. In addition, although the random forest was able to produce solutions with higher values of efficiency, the decision tree induced substantially fairer assessments.

Since we are interested in a optimal solution in terms of equalized coverage, we can select the fairest conformal predictor built on the decision tree classifier. The performance of the fairer solution chosen averaged across 10 independent runs, both overall and broken down by the groups defined by the sensitive attribute, is reported in Table 8. As can be seen, the selected Pareto solution produced valid prediction sets in both demographic groups, satisfying the required equalized coverage. That is, it produces prediction sets that cover the actual income with a confidence of 90%, regardless of the individual’s gender. It should be noted that the reported solution has been chosen in the validation set, thus providing clear evidence of the robustness of our meta-algorithm in unseen data.

Table 8 An optimal solution with equalized coverage for the Adult dataset

As previously mentioned, the prediction set size produced by a conformal predictor offers a suitable approach to communicate predictive uncertainty. When comparing the average prediction set size in each demographic group separately, we can observe a substantial difference. The selected solution produces, on average, wider prediction sets for the male group with respect to the female group. Although this could initially be perceived as a form of unfairness in itself, we argue that this is the conformal predictor’s way of quantifying the predictive performance limits across different subgroups. In fact, the utility of uncertainty quantification methods such as conformal prediction lies in making these limitations explicit to the user by means of the prediction set width, while retaining ground truth coverage with statistical guarantees.

6 Conclusions

Algorithmic fairness and the development of reliable predictive tools have become requirements as long as data-driven systems have infiltrated safety-critical contexts. In this work, we tackle the development of fair ML models with predictive uncertainty quantification, and introduce a procedure that returns a collection of valid confidence predictors that balances efficiency and fairness. Our methodological contribution lies in framing the construction of optimal prediction sets with guarantees through a multi-objective optimization scheme using the NSGA-II algorithm and conformal prediction. Specifically, we implement an optimizer for the hyperparameters of an ML classifier and calibrate it with the conformal prediction procedure to learn non-dominated conformal classifiers in terms of average prediction set size and equalized coverage, while ensuring marginal validity guarantees. Through the experimental evaluation, we showed the utility of the meta-algorithm in real-world datasets from different domains, thus exploring the limits of efficiency and fairness when these criteria are simultaneously optimized. This allowed us to produce a wide range of different conformal predictors with optimal properties in terms of a Pareto front, thus evidencing the inherent trade-off between such objective functions. We covered the optimization of three different state-of-the-art ML classifiers, including logistic regression, decision tree and random forest algorithms. We demonstrated that, depending on the classifier to be optimized and the problem to be addressed, the Pareto front can exhibit different shapes, illustrating how the meta-algorithm is able to trade off the efficiency and fairness of a conformal predictor. In most cases, our methodology was able to produce solutions that dominated standard and Mondrian conformal predictors induced with default classifier hyperparameters or, at least, generate fairer prediction sets than out-of-the-box strategies.

As a matter of fact, one of the core strengths of our proposal is that it obtains a wide repertoire of different modeling alternatives in a single run from which the optimal one can be selected depending on the policy to be adopted. We also showed an example of the practical utility of our optimization procedure by presenting a case study on the income prediction problem.

For the purpose of this study, our meta-algorithm was framed to optimize two specific efficiency and fairness criteria. However, our proposal is a completely versatile framework that allows a wide range of criteria to be optimized. For example, we could take any of the set-valued efficiency criteria introduced in Vovk et al. (2017), or the recently proposed equal opportunity of coverage to ensure valid prediction sets for more fine-grained demographic groups (Wang, Cheng, Guo, Liu, & Yu, 2023). Additional work should also include the testing of novel non-conformity measures such as adaptive prediction sets (Romano, Sesia, & Candès, 2020) or its regularized extension proposed for many-class classification problems (Angelopoulos, Bates, Jordan, & Malik, 2021). It is worth noting that our methodology has only been tested on four real-world benchmark datasets, so further experimentation on additional problems where algorithmic fairness is needed should be considered.

Finally, our study highlights that the integration of conformal prediction in the development of fair ML can play a key role in building the foundations of a trustworthy ML based on equitable decisions and statistical guarantees.