Label Noise Robustness of Conformal Prediction: Bat-Sheva Einbinder
Label Noise Robustness of Conformal Prediction: Bat-Sheva Einbinder
Abstract
We study the robustness of conformal prediction, a powerful tool for uncertainty quan-
tification, to label noise. Our analysis tackles both regression and classification problems,
characterizing when and how it is possible to construct uncertainty sets that correctly cover
the unobserved noiseless ground truth labels. We further extend our theory and formulate
the requirements for correctly controlling a general loss function, such as the false negative
proportion, with noisy labels. Our theory and experiments suggest that conformal pre-
diction and risk-controlling techniques with noisy labels attain conservative risk over the
clean ground truth labels whenever the noise is dispersive and increases variability. In other
adversarial cases, we can also correct for noise of bounded size in the conformal prediction
algorithm in order to ensure achieving the correct risk of the ground truth labels without
score or data regularity.
Keywords: conformal prediction, risk control, uncertainty quantification, label noise,
distribution shift
c 2024 Bat-Sheva Einbinder, Shai Feldman, Stephen Bates, Anastasios N. Angelopoulos, Asaf Gendler, Yaniv
Romano.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v25/23-1549.html.
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
1. Introduction
In most supervised classification and regression tasks, one would assume the provided labels
reflect the ground truth. In reality, this assumption is often violated (Cheng et al., 2022; Xu
et al., 2019; Yuan et al., 2018; Lee and Barber, 2022; Cauchois et al., 2022). For example,
doctors labeling the same medical image may have different subjective opinions about the
diagnosis, leading to variability in the ground truth label itself. To quote Abdalla and Fine
(2023): “Noise in labeling schemas and gold label annotations are pervasive in medical
imaging classification and affect downstream clinical deployment.” In other settings, such
variability may arise due to sensor noise, data entry mistakes, the subjectivity of a human
annotator, or many other sources. In other words, the labels we use to train machine learning
(ML) models may often be noisy in the sense that these are not necessarily the ground truth.
Consequently, this can result in the formulation of invalid data-driven conclusions.
The above discussion emphasizes the critical need for making reliable predictions in real-
world scenarios, especially when dealing with imperfect training data. An effective way to
enhance the reliability of ML models is to quantify their prediction uncertainty. Conformal
prediction (Vovk et al., 2005, 1999; Angelopoulos and Bates, 2023) is a generic uncertainty
quantification tool that transforms the output of any ML model into prediction sets that
are guaranteed to cover the future, unknown test labels with high probability. This guar-
antee holds for any data distribution and sample size, under the sole assumption that the
training and test data are i.i.d. However, obtaining valid uncertainty quantification via
conformal prediction in the presence of noisy labels remains unclear since the noise breaks
the i.i.d. assumption. In this paper, we aim to address this specific challenge and precisely
characterize under what conditions conformal methods, applied to noisy data, would yield
prediction sets guaranteed to cover the unseen clean, ground truth label. Additionally, we
analyze the effect of label noise on risk-controlling techniques, which extend the conformal
prediction approach to construct uncertainty sets with a guaranteed control over a general
risk function (Bates et al., 2021; Angelopoulos et al., 2021, 2024). We also show that our
theory can be applied to online settings, e.g., to quantify prediction uncertainty for time
series data with noisy labels. Overall, we analyze the behavior of conformal prediction and
risk-controlling methods for several common loss functions and noise models, highlighting
their built-in robustness to dispersive, variability-increasing noise and vulnerability to ad-
versarial noise. Adversarial noise might reduce the uncertainty of the response variable,
potentially causing the prediction sets to be too small and achieve a low coverage rate. We
note that a summary of this paper and its key contributions can be found in (Feldman
et al., 2023a).
Consider a calibration data set of i.i.d. observations {(Xi , Yi )}ni=1 sampled from an arbitrary
unknown distribution PXY . Here, Xi ∈ Rp is the feature vector that contains p features for
the i-th sample, and Yi denotes its response, which can be discrete for classification tasks
2
Label Noise Robustness of Conformal Prediction
or continuous for regression tasks. Given the calibration data set, an i.i.d. test data point
(Xtest , Ytest ), and a pre-trained model fˆ, conformal prediction constructs a set Cbclean (Xtest )
that contains the unknown test response, Ytest , with high probability, e.g., 90%. We refer to
Cbclean (Xtest ) as ‘clean’ to underscore that the prediction set is formed by utilizing samples
from the clean data distribution. That is, for a user-specified level α ∈ (0, 1),
P Ytest ∈ Cbclean (Xtest ) ≥ 1 − α. (1)
This property is called marginal coverage, where the probability is defined over the calibra-
tion and test data.
In the setting of label noise, we only observe the corrupted labels Ỹi = g(Yi ) for some
corruption function g : Y × [0, 1] → Y, so the i.i.d. assumption and marginal coverage
guarantee are invalidated. The corruption is random; we will always take the second argu-
ment of g to be a random seed U uniformly distributed on [0, 1]. To ease notation, we leave
the second argument implicit henceforth. Nonetheless, using the noisy calibration data,
we seek to form a prediction set Cbnoisy (Xtest ) that covers the clean, uncorrupted test label,
Ytest . More precisely, our goal is to delineate when it is possible to provide guarantees of
the form
P Ytest ∈ Cbnoisy (Xtest ) ≥ 1 − α, (2)
where the probability is taken jointly over the calibration data, test data, and corruption
function (this will be the case for the remainder of the paper).
More formally, we use the model fˆ to construct a score function, s : X ×Y → R, which is
engineered to be large when the model is uncertain and small otherwise. We will introduce
different score functions for both classification and regression as needed in the following
subsections. Abbreviate the scores on each calibration data point as si = s(Xi , Yi ) for each
i = 1, ..., n. Conformal prediction tells us that we can achieve a marginal coverage guarantee
by picking q̂clean = s(d(n+1)(1−α)e) as the d(n + 1)(1 − α)e-smallest of the calibration scores
and constructing the prediction sets as
Cbclean (Xtest ) = {y ∈ Y : s (Xtest , y) ≤ q̂clean } .
In this paper, we do not allow ourselves access to the calibration labels, only their noisy
versions, Ỹ1 , . . . , Ỹn , so we cannot calculate q̂clean . Instead, we can calculate the noisy
quantile q̂noisy as the d(n + 1)(1 − α)e-smallest of the noisy score functions, s̃i = s(Xi , Ỹi ).
The main formal question of our work is whether the resulting prediction set, Cbnoisy (Xtest ) =
{y : s(Xtest , y) ≤ q̂noisy }, covers the clean label as in (2). We state this general recipe
algorithmically for future reference:
3
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
This recipe produces prediction sets that cover the noisy label at the desired coverage rate
(Vovk et al., 2005; Angelopoulos and Bates, 2023):
P Ỹtest ∈ Cbnoisy (Xtest ) ≥ 1 − α.
Classification
4
Label Noise Robustness of Conformal Prediction
• True label:
noisy • True label:
0.94 clean Car
Cat
nominal
• Noisy:
Coverage
0.92 • Noisy:
{Car, Ship,
{Cat, Dog}
0.90 Cat}
• Clean:
0.88 • Clean:
{Cat}
{Car}
Regression
In this section, we present a real-world application with a continuous response, using Aes-
thetic Visual Analysis (AVA) data set, first presented by Murray et al. (2012). This data
set contains pairs of images and their aesthetic scores in the range of 1 to 10, obtained by
approximately 200 annotators. Following Kao et al. (2015); Talebi and Milanfar (2018);
Murray et al. (2012), the task is to predict the average aesthetic score of a given test image.
Therefore, we consider the average aesthetic score taken over all annotators as the clean,
ground truth response. The noisy response is the average aesthetic score taken over 10
randomly selected annotators only.
We examine the performance of conformal prediction using two different scores: the CQR
score (Romano et al., 2019), defined in (4) and the residual magnitude score (Papadopoulos
et al., 2002; Lei et al., 2018) defined in (5). In our experiments, we set the function û(x) of
the residual magnitude score as û(x) = 1. We follow Talebi and Milanfar (2018) and take
a transfer learning approach to fit the predictive model using a VGG-16 model pretrained
on ImageNet data set. Details regarding the training strategy are in Appendix B.2.
Figure 2 portrays the marginal coverage and average interval length achieved using
CQR and residual magnitude scores. As a point of reference, this figure also presents the
performance of the two conformal methods when calibrated with a clean calibration set; as
expected, the two perfectly attain 90% coverage. By constant, when calibrating the same
predictive models with a noisy calibration set, the resulting prediction intervals tend to be
wider and to over-cover the average aesthetic scores.
Thus far, we have found in empirical experiments that conservative coverage is obtained
in the presence of label noise. In the following sections, our objective is to establish condi-
tions that formally guarantee label-noise robustness.
5
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
96 3.0
94
noisy 2.8
Coverage
Length
92 clean
nominal 2.6
90
2.4
88
Residual magnitude CQR Residual magnitude CQR
Method Method
We begin the theoretical analysis with a general statement, showing that Recipe 1 produces
valid prediction sets whenever the noisy score distribution stochastically dominates the clean
score distribution. The intuition is that the noise distribution ‘spreads out’ the distribution
of the score function such that q̂noisy is (stochastically) larger than q̂clean .
Furthermore, for any u satisfying P(s̃test ≤ t) + u ≥ P(stest ≤ t), for all t, then
1
P Ytest ∈ Cbnoisy (Xtest ) ≤ 1 − α + + u.
n+1
Figure 3 illustrates the idea behind Theorem 1, demonstrating that when the noisy score
distribution stochastically dominates the clean score distribution, then q̂noisy ≥ q̂clean and
thus uncertainty sets calibrated using noisy labels are more conservative. In practice, how-
ever, one does not have access to such a figure, since the scores of the clean labels are
unknown. This gap requires an individual analysis for every task and its noise setup, which
emphasizes the complexity of this study. In general, for most commonly used score func-
tions the stochastic dominance assumption holds when the noise is dispersive, meaning it
“flattens” the density of Y | X, e.g., when Var(Ỹ | X) > Var(Y | X). In the following
subsections, we present example setups in classification and regression tasks under which
this stochastic dominance holds, and conformal prediction with noisy labels succeeds in
6
Label Noise Robustness of Conformal Prediction
Figure 3: Clean (green) and noisy (red) non-conformity scores under dispersive corruption.
covering the true, noiseless label. The purpose of these examples is to illustrate simple and
intuitive statistical settings where Theorem 1 holds. Under the hood, all setups given in the
following subsections are applications of Theorem 1. Though the noise can be adversarially
designed to violate these assumptions and cause under-coverage (as in the impossibility re-
sult in Proposition 3), the evidence presented here suggests that in the majority of practical
settings, conformal prediction can be applied without modification. The proof is given in
Appendix A.1.
2.3.2 Regression
In this section, we analyze a regression task where the labels are continuous-valued and the
corruption function is additive:
g add (y) = y + Z (3)
for some independent noise sample Z.
We first analyze the setting where the noise Z is symmetric around 0 and the density
of Y | X is symmetric unimodal. We also assume that the estimated prediction interval
contains the true median of Ỹ | X = x, which is a very weak assumption about the
fitted model. The following proposition states that such an interval achieves a conservative
coverage rate over the clean labels.
Importantly, the corruption function may depend on the feature vector, as stated next.
7
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
Furthermore, Proposition 1 applies for any non-conformity score function, specifically for
two popular scores we focus on. The CQR score, developed by Romano et al. (2019),
measures the distance of an estimated interval’s endpoints from the corresponding label y,
formally defined as:
Above, fˆlower and fˆupper are the estimated lower and upper interval’s endpoints, e.g., ob-
tained by fitting a quantile regression model to approximate the α/2 and 1−α/2 conditional
quantiles of Ỹ | X (Koenker and Bassett, 1978). The residual magnitude (RM) score (Pa-
padopoulos et al., 2002; Lei et al., 2018) assesses the normalized prediction error:
2.3.3 Classification
In this section, we formulate the conditions under which conformal prediction is robust to
label noise in a K-class classification setting where the labels take one of K values, i.e.,
Y ∈ {1, 2, ..., K}. In a nutshell, robustness is guaranteed when the corruption function
transforms the labels’ distribution towards a uniform distribution while maintaining the
same ranking of labels. Such corruption increases the prediction uncertainty, which drives
conformal methods to construct conservative uncertainty sets to achieve the nominal cov-
erage level on the observed noisy labels. We now formalize this intuition. We begin by
defining the following noise models:
• Uniform noise: A noise model that fulfils the following: for all x ∈ X :
1 1
1. ∀i ∈ {1, .., K} : P(Ỹtest = i | Xtest = x) − K ≤ P(Ytest = i | Xtest = x) − K ,
2. ∀i, j ∈ {1, .., K} : P(Ytest = i | Xtest = x) ≤ P(Ytest = j | Xtest = x) ⇐⇒
P(Ỹtest = i | Xtest = x) ≤ P(Ỹtest = j | Xtest = x).
• Random flip: A corruption function that follows:
(
flip y w.p 1 −
g (y) = 0
(6)
Y else,
8
Label Noise Robustness of Conformal Prediction
Figure 4: Clean (green) and noisy (red) class probabilities under dispersive corruption.
Proposition 2 Let Cbnoisy be constructed as in Recipe 1. Then, the coverage rate achieved
over the clean labels is upper bounded by:
K
1 1X
P Ytest ∈ Cbnoisy (Xtest ) ≤ 1 − α + + P(Ỹtest = i) − P(Ytest = i) .
n+1 2
i=1
Further suppose that Cbnoisy contains the most likely labels, i.e., ∀x ∈ X , i ∈ Cbnoisy (x), j ∈
/
Cnoisy (x) : P(Ỹtest = i | Xtest = x) ≥ P(Ỹtest = j | Xtest = x). If the noise follows the
b
uniform noise model, then the coverage rate is guaranteed to be valid:
1 − α ≤ P Ytest ∈ Cbnoisy (Xtest ) .
We emphasize that the key contribution of Proposition 2 is the lower bound, as it guar-
antees a valid coverage rate. We also note that Barber et al. (2023) provides a sharper
upper bound for the coverage rate which also relies on the TV-distance between Ytest , Ỹtest .
Nevertheless, the lower bound in Proposition 2 is tighter than the lower bound described
in Barber et al. (2023). Figure 4 visualizes the essence of Proposition 2, showing that as
the corruption increases the label uncertainty, the prediction sets generated by conformal
prediction get larger, having a conservative coverage rate. It should be noted that under
the above proposition, achieving
valid coverage requires only knowledge of the noisy condi-
tional distribution P Ỹ | X , so a model trained on a large amount of noisy data should
approximately have the desired coverage rate. Moreover, one should notice that without the
assumptions in Proposition 2, even the oracle model is not guaranteed to achieve valid cov-
erage, as we will show in Section 4.1. For a more general analysis of classification problems
where the noise model is an arbitrary confusion matrix, see Appendix A.3.2.
We now turn to examine two conformity scores and show that applying them with noisy
data leads to conservative coverage, as a result of Proposition 2. The adaptive prediction
sets (APS) score, first introduced by Romano et al. (2020), is defined as
X
sAPS (x, y) = π̂y0 (x) I π̂y0 (x) > π̂y (x) + π̂y (x) · U 0 ,
y 0 ∈Y
9
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
where I is the indicator function, π̂y (x) is the estimated conditional probability P Y = y |
X = x and U 0 ∼ Unif(0, 1). To make this non-random, a variant of the above where U 0 = 1
is often used. The APS score is one of two popular conformal methods for classification.
The other score from Vovk et al. (2005); Lei et al. (2013), is referred to as homogeneous
prediction sets (HPS) score, sHPS (x, y) = 1 − π̂y (x), for some classifier π̂y (x) ∈ [0, 1]. The
next corollary states that with access to an oracle classifier’s ranking, conformal prediction
covers the noiseless test label.
Corollary 1 Let Cbnoisy (Xtest ) be constructed as in Recipe 1 with either the APS or HPS
score functions, with any classifier that
ranks the classes in the same order as the oracle
classifier π̂y (x) = P Ỹ = y | X = x . Then,
K
1 1X
1 − α ≤ P Ytest ∈ Cbnoisy (Xtest ) ≤ 1 − α + + P(Ỹtest = i) − P(Ytest = i) .
n+1 2
i=1
We now turn to examine the specific random flip noise model in which the noisy label
is randomly flipped fraction of the time. This noise model is well-studied in the literature;
see, for example, (Aslam and Decatur, 1996; Angluin and Laird, 1988; Ma et al., 2018; Jenni
and Favaro, 2018; Jindal et al., 2016; Yuan et al., 2018).
Corollary 2 Let Cbnoisy (Xtest ) be constructed as in Recipe 1 with the corruption function
g flip and either the APS or HPS score functions, with any classifierthat ranks the classes
in the same order as the oracle classifier π̂y (x) = P Ỹ = y | X = x . Then,
1 K −1
1 − α ≤ P Ytest ∈ Cbnoisy (Xtest ) ≤ 1 − α + + .
n+1 K
Crucially, the above corollaries apply with any score function that preserves the order of
the estimated classifier, which emphasizes the generality of our theory. All proofs are given
in Appendix A.3.1. Moreover, although this is not our main focus, in Appendix A.3.3 we
investigate the inflation of the prediction set size in the specific case of random flip noise
with APS scores and the oracle model. Table 1 summarizes all different settings we examine
with their corresponding bounds. Finally, we note that in Section 3.3.1 we extend the above
analysis to multi-label classification, where there may be multiple labels that correspond to
the same sample.
Though the coverage guarantee holds in many realistic cases, conformal prediction may
generate uncertainty sets that fail to cover the true outcome. Indeed, in the general case,
conformal prediction produces invalid prediction sets, and must be adjusted to account
for the size of the noise. The following proposition states that for any nontrivial noise
distribution, there exists a score function that breaks naı̈ve conformal.
10
Label Noise Robustness of Conformal Prediction
Table 1: Summary of coverage bounds for different scores and noise models
d
Proposition 3 (Coverage is impossible in the general case.) any Ỹ 6= Y . Then
Take
there exists a score function s that yields P Ytest ∈ Cnoisy (Xtest ) < P Ytest ∈ C(X
b b test ) ,
for Cbnoisy constructed using noisy samples and Cb constructed with clean samples.
The above proposition says that for any noise distribution, there exists an adversarially cho-
sen score function that will disrupt coverage. Furthermore, as we discuss in Appendix A.4,
with a noise of a sufficient magnitude, it is possible to get arbitrarily bad violations of
coverage. In Appendix A.4 we state an additional impossibility result in which we claim
that for any given score function following some conditions, there is an adversarial noise
that invalidates the coverage.
Next, we discuss how to adjust the threshold of conformal prediction to account for
noise of a known size, as measured by total variation (TV) distance from the clean label.
Corollary 3 (Corollary of Barber et al. (2023)) Let Ỹ be any random variable satis-
fying DTV (Y, Ỹ ) ≤ . Take α0 = α + 2 n+1
n
. Letting Cbnoisy (Xtest ) be the output of Recipe 1
0
with any score function at level α yields
P Ytest ∈ Cbnoisy (Xtest ) ≥ 1 − α.
We discuss this strategy more in Appendix A.4—the algorithm implied by Corollary 3 may
not be particularly useful, as the TV distance is a badly behaved quantity that is also
difficult to estimate, especially since the clean labels are inaccessible.
As a final note, if the noise is bounded in TV norm, then the coverage is also not too
conservative.
11
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
1 n
P Ytest ∈ Cbnoisy (Xtest ) < 1 − α + + ξ.
n+1 n+1
Up to this point, we have focused on the miscoverage loss, and the analysis conducted thus
far applies exclusively to this metric. However, in real-world applications, it is often desired
to control metrics other than the binary loss Lmiscoverage (y, C) = 1{y ∈ / C}, where C is
a set of predicted labels. Examples of such alternative losses include the F1-score or the
false negative rate. The latter is particularly relevant for high-dimensional response Y , as in
tasks like multi-label classification or image segmentation. To address this need, researchers
have developed extensions of the conformal framework that go beyond the miscoverage loss,
providing a rigorous risk control guarantee for general loss functions (Bates et al., 2021;
Angelopoulos et al., 2021, 2024).
Similarly to the conformal prediction algorithm, in the risk-control setting we post-
process the predictions of a model fˆ to create a prediction set Cbλ (Xtest ) with a parameter
λ that determines its level of conservativeness: higher values yield larger and nested sets,
in the sense that Cbλ2 (·) ⊆ Cbλ1 (·) for λ2 ≤ λ1 . For instance, if fˆy (x) is an estimator of the
conditional probability of Y | X = x, then the prediction sets can be defined as Cbλ (Xtest ) =
{y : fˆy (Xtest ) > λ}. To measure the quality of Cbλ (Xtest ), we consider a loss function
L(Ytest , Cbλ (Xtest )) which we require to be non-increasing as a function of λ. This is analogous
to conformal prediction in which the quantile of the scores, q̂, encodes the prediction set
sizes, and the error measure is simply the miscoverage loss. In the conformal prediction
framework, we use a holdout set {(Xi , Yi )}ni=1 in order to calibrate q̂clean and achieve valid
coverage over a new test point. Likewise, in the conformal risk control settings we aim to
use the observed losses {L(Yi , Cbλ (Xi ))}ni=1 derived by the calibration set to find a calibrated
threshold λ̂clean that will control the risk of an unseen test point at a pre-specified level α:
See Appendix C.1 for the conformal risk control procedure. Analogously to conformal pre-
diction, these methods produce valid sets under the i.i.d. assumption, but their guarantees
do not hold in the presence of label noise. Provided a noisy calibration set, {(Xi , Ỹi )}ni=1 ,
the parameter λ̂noisy is constructed using the noisy losses {L(Ỹi , Cbλ (Xi ))}ni=1 , and therefore
the risk of a new clean test point is not guaranteed to be controlled.
Our main goal is to delineate when it is possible to provide a risk control guarantee of
the form
E[L(Ytest , Cbλ̂noisy (Xtest ))] ≤ α.
12
Label Noise Robustness of Conformal Prediction
Multi-label Classification
In this section, we analyze the robustness of conformal risk control (Angelopoulos et al.,
2024) to label noise in a multi-label classification setting. We use the MS COCO data set
(Lin et al., 2014), in which the input image may contain up to K = 80 positive labels, i.e.,
Y ⊆ {1, 2, ..., K}. In the following experiment, we consider the annotations in this data
set as ground-truth labels. We have collected noisy labels from individual annotators who
annotated 117 images in total. On average, the annotators missed or mistakenly added
approximately 1.75 labels from each image. See Appendix B.5 for additional details about
this experimental setup and data collection.
We fit a TResNet (Ridnik et al., 2021) model on 100k clean samples and calibrate it
using 105 noisy samples with conformal risk control, as introduced in (Angelopoulos et al.,
2024, Section 3.2), to control the false-negative rate (FNR) at different levels. The FNR is
the ratio of positive labels of Y that are missed in a prediction set C, formally defined as:
|Y ∩ C|
LFNR (Y, C) = 1 − . (7)
|Y |
We measure the FNR obtained over the clean and noisy test sets which contain 40k and 12
samples, respectively. Figure 5 displays the results, showing that the uncertainty sets attain
valid risk even though they were calibrated using corrupted data. In the following section,
we aim to explain these results and find the conditions under which label-noise robustness
is guaranteed.
Figure 5: FNR on MS COCO data set, achieved over noisy (red) and clean (green) test
sets. The calibration scheme is applied with noisy annotations. Results are averaged over
2000 trials.
13
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
In this section, we study the conditions under which conformal risk control is robust to
label noise in a multi-label classification setting. Recall that each sample contains up to
K positive labels, i.e., Y ⊆ {1, 2, ..., K}. Here, we assume a vector-flip noise model with a
binary random variable i that flips the i-th label with probability P(i = 1):
Above, yi is an indicator that takes the value 1 if the i-th label is present in y, and 0,
otherwise. Notice that the random variable i takes the value 1 if the i-th label in y is flipped
and 0 otherwise. We further assume that the noise is not adversarial, i.e., P(i = 1) < 0.5
for all i ∈ {1, 2, ..., K}. We now show that a valid FNR risk is guaranteed in the presence
of label noise under the following assumptions.
Proposition 4 Let Cbnoisy (Xtest ) be a prediction set that contains the most likely labels, in
the sense of Proposition 2, and controls the FNR risk of the noisy labels at level α. Assume
the multi-label noise model g vector−flip . If
2. The number of positive labels, i.e., |Ỹtest | | Xtest = x is a constant for all x ∈ X ,
then h i
E LFNR Ytest , Cbnoisy (Xtest ) ≤ α.
The proof and other additional theoretical results are provided in Appendices A.5.1- A.5.4.
Importantly, the determinism assumption on Y | X = x is reasonable as it is simply sat-
isfied when the noiseless response is defined as the consensus outcome. Nevertheless, this
assumption may not always hold in practice. Thus, in the next section, we propose alterna-
tive requirements for the validity of the FNR risk and demonstrate them in a segmentation
setting.
3.3.2 Segmentation
In segmentation tasks, the goal is to assign labels to every pixel in an input image such that
pixels with similar characteristics share the same label. For example, tumor segmentation
can be applied to identify polyps in medical images. Here, the response is a binary matrix
Y ⊆ {0, 1}W ×H that contains the value 1 in the (i, j) pixel if it includes the object of interest
and 0 otherwise. The uncertainty is represented by a prediction set C ⊆ {1, ..., W } ×
14
Label Noise Robustness of Conformal Prediction
{1, ..., H} that includes pixels that are likely to contain the object. Similarly to the multi-
label classification problem, here, we assume a vector flip noise model g vector−flip , that flips
the (i, j) pixel in Y , denoted as Yi,j , with probability P(i,j = 1). We now show that the
prediction sets constructed using a noisy calibration set are guaranteed to have conservative
FNR if the clean response matrix Y and the noise variable are independent given X.
Proposition 5 Let Cbnoisy (Xtest ) be a prediction set that contains the most likely pixels, in
the sense of Proposition 2, and controls the FNR risk of the noisy labels at level α. Suppose
that:
1. The elements of the clean response matrix are independent of each other given Xtest .
⊥ Ytest m,n | Xtest = x for all (i, j) 6= (m, n) ⊆ {1, ..., W } ×
That is, Ytest i,j | Xtest = x ⊥
{1, ..., H} and x ∈ X .
2. For a given input Xtest , the noise level is the same for all response elements.
3. The noise variable is independent of Ytest | Xtest = x, and the noises of different
labels are independent of each other given Xtest , similarly to condition 1.
Then, h i
E LFNR Ytest , Cbnoisy (Xtest ) ≤ α.
We note that a stronger version of this proposition that allows dependence between the
elements in Ytest is given in Appendix A.5.1, and the proof is in Appendix A.5.5. The
advantage of Proposition 5 over Proposition 4 is that here, the response matrix Ytest and
the number of positive labels are allowed to be stochastic. For this reason, we believe that
Proposition 5 is more suited for segmentation tasks, even though Proposition 4 applies in
segmentation settings as well.
This section studies the general regression setting in which Y ∈ R takes continuous values
and the loss function is an arbitrary function L(y, Cbnoisy (x)) ∈ R. Here, our objective is
to find tight bounds for the risk over the clean labels using the risk observed over the
corrupted labels. The main result of this section accomplishes this goal while making
minimal assumptions on the loss function, the noise model, and the data distribution.
Proposition
h 6 Let Cbnoisy
i (Xtest ) be a prediction interval that controls the risk at level α :=
E L(Ỹtest , Cnoisy (Xtest ) . Suppose that the second derivative of the loss L(y; C) is bounded
b
∂ 2
for all y, C: q ≤ ∂y 2 L(y; C) ≤ Q for some q, Q ∈ R. If the labels are corrupted by the
function g add from (3) with a noise Z that satisfies E[Z] = 0, then
1 h i 1
α − Q · Var(Z) ≤ E L(Ytest , Cbnoisy (Xtest ) ≤ α − q · Var(Z).
2 2
If we further assume that L is convex then we obtain valid risk:
h i
E L(Ytest , Cbnoisy (Xtest ) ≤ α.
15
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
The proof is detailed in Appendix A.6.1. Remarkably, Proposition 6 applies for any pre-
dictive model, calibration scheme, distribution of Y | X, and loss function that is twice
differentiable. The only requirement is that the noise must be additive and with zero mean.
We now demonstrate this result on a smooth approximation of the miscoverage loss, for-
mulated as:
2
Lsm (y, [a, b]) = y−a 2 (9)
1 + e−(2 b−a −1)
1 h i 1
α − Q · Var(Z) ≤ E Lsm (Ytest , Cbnoisy (Xtest ) ≤ α − q · Var(Z),
2 2
h i h i
∂ 2 sm ∂ 2 sm
where q = EX miny ∂y 2 L (y, C
b noisy (X)) and Q = E X max y ∂y 2 L (y, C
b noisy (X)) are
known constants.
We now build upon Corollary 5 and establish a lower bound for the coverage rate achieved
by intervals calibrated with corrupted labels.
Proposition 7 Let Cbnoisy (Xtest ) be a prediction interval. Suppose that the labels are cor-
rupted by the function g add from (3) with a noise Z that satisfies E[Z] = 0, then
In Appendix A.6.2 we give additional details about this result as well as formulate a stronger
version of it that provides a tighter coverage bound. Finally, in Appendix A.6.3 we provide
an additional miscoverage bound which is more informative and tight for smooth densities
of Y | X = x. Table 2 summarizes all different risk-control settings with their corresponding
bounds. Lastly, in Appendix A.6.4 we analyze label-noise robustness in settings where the
response Y is a matrix, as in image-to-image regression tasks.
In this section, we focus on an online learning setting and show that all theoretical results
presented thus far also apply to the online framework. Here, the data is given as a stream
(Xt , Ỹt )t∈N in a sequential fashion. Crucially, we have access only to the noisy labels Ỹt ,
and the clean labels Yt are unavailable throughout the entire learning process. At time
stamp t ∈ N, our goal is to construct a prediction set Cbt given on all previously observed
noisy
16
Label Noise Robustness of Conformal Prediction
Table 2: Summary of coverage bounds for different risk control tasks and different noise
models
t−1
samples (Xt0 , Ỹt0 )t0 =1 along with the test feature vector Xt that achieves a long-range risk
controlled at a user-specified level α, i.e.,
T
1X t
R(C)
b = lim Lt (Yt , Cbnoisy (Xt )) = α. (10)
T →∞ T
t=1
Importantly, in this online learning setting, the loss function Lt might be time-dependent
and may vary throughout the learning process. There have been developed calibration
schemes that generate uncertainty sets with statistical guarantees in online settings in the
sense of (10). A popular approach is Adaptive conformal inference (ACI), proposed by
Gibbs and Candes (2021), which is an innovative online calibration scheme that constructs
prediction sets with a pre-specified coverage rate, in the sense of (10) with the choice of
the miscoverage loss. In contrast to ACI, Rolling risk control (Rolling RC) (Feldman et al.,
2023b) extends ACI by providing a guaranteed control of a general risk that may go beyond
the binary loss. The main idea behind both these methods is to tune the calibration
parameter that controls the size of the prediction set according to the coverage or risk level
achieved in the past. See Appendix C.2 and Appendix C.3 for more details on ACI and RRC,
respectively. Yet, the guarantees of these approaches are invalidated when applied using
corrupted data. Nonetheless, we argue that uncertainty sets constructed using corrupted
data attain conservative risk in online settings under the requirements for offline label-noise
robustness presented thus far.
Then,
T
1X t
lim Lt (Yt , Cbnoisy (Xt )) ≤ α,
T →∞ T
t=1
17
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
The proof is given in Appendix A.7.1. In words, Proposition 8 states that if the expected
loss at every timestamp is conservative, then the risk over long-range windows in time
is guaranteed to be valid. Practically, this proposition shows that all theoretical results
presented thus far apply also in the online learning setting. We now demonstrate this result
in two settings: online classification and segmentation and show that valid risk is obtained
under the assumptions of Proposition 2 and Proposition 5, respectively.
Corollary 6 (Valid risk in online classification settings) Suppose that the distribu-
tions of Yt | Xt and Ỹt | Xt satisfy the assumptions in Proposition 2 for all t ∈ N. If
t
Cbnoisy (x) contains the most likely labels, in the sense of Proposition 2, for every t ∈ N and
x ∈ X , then
T
1X
lim 1{Yt ∈/ Cbnoisy
t
(Xt )} ≤ α.
T →∞ T
t=1
Corollary 7 (Valid risk in online segmentation settings) Suppose that the distribu-
tions of Yt | Xt and Ỹt | Xt satisfy the assumptions in Proposition 5 for all t ∈ N. If
t
Cbnoisy (x) contains the most likely labels, in the sense of Proposition 5, for every t ∈ N and
x ∈ X then
T
1 X FNR t
lim L (Yt , Cbnoisy (Xt )) ≤ α.
T →∞ T
t=1
Finally, in Appendix A.7.2 we analyze the effect of label noise on the miscoverage counter
loss (Feldman et al., 2023b). This loss assesses conditional validity in online settings by
counting occurrences of consecutive miscoverage events. In a nutshell, Proposition A.7
claims that with access to corrupted labels, the miscoverage counter is valid when the
miscoverage risk is valid. This is an interesting result, as it connects the validity of the
miscoverage counter to the validity of the miscoverage loss, where the latter is guaranteed
under the conditions established in Section 2.3.
4. Experiments
In this section, we focus on multi-class classification problems, where we study the validity
of conformal prediction using different types of label noise distributions, described below.
Class-independent noise. This noise model, which we call uniform flip, randomly
flips the ground truth label into a different one with probability . Notice that this noise
18
Label Noise Robustness of Conformal Prediction
model slightly differs from the random flip g flip from (6), since in this uniform flip setting,
a label cannot be flipped to the original label. Nonetheless, Proposition 2 states that the
coverage achieved by an oracle classifier is guaranteed to increase in this setting as well.
19
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
2.00
Average set size
1.75
1.50
1.25
1.00
n
flip
ss
flip
ss
atri
atri
Clea
Clea
t cla
t cla
form
form
on m
on m
uen
uen
Uni
Uni
fusi
fusi
freq
freq
Con
Con
ost
ost
to m
to m
e
e
Rar
Rar
Noise type
Figure 6: Effect of label noise on synthetic multi-class classification data. Per-
formance of conformal prediction sets with target coverage 1 − α = 90%, using a noisy
training set and a noisy calibration set. Top: Marginal coverage; Bottom: Average size of
predicted sets. The results are evaluated over 100 independent experiments and the gray
bar represents the interquartile range.
APS score tends to be more robust to label noise than HPS, which emphasizes the role of
the score function.
In Appendix B.3 we provide additional experiments with adversarial noise models that
more aggressively reduce the coverage rate. Such adversarial cases are more pathological
and less likely to occur in real-world settings, unless facing a malicious attacker.
4.2 Regression
20
Label Noise Robustness of Conformal Prediction
100
8 clean length
90
6
Coverage
nominal
Length
80 symmetric heavy-tail 4
symmetric light-tail
70 asymmetric 2
biased
0
0.0 0.01 0.1 1.0 0.0 0.01 0.1 1.0
Noise Magnitude c Noise Magnitude c
where X̄ is the mean of the vector X, and Pois(λ) is the Poisson distribution with mean
λ. Both η1 and η2 are i.i.d. standard Gaussian variables, and U is a uniform random
variable on [0, 1]. The right-most term in (11) creates a few but large outliers. Figure 17
21
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
in the appendix illustrates the effect of the noise models discussed earlier on data sampled
from (11).
We apply conformal prediction with the CQR score (Romano et al., 2019) for each
noise model as follows. First, we fit a quantile random forest model on 8, 000 noisy train-
ing points; we then calibrate the model using 2, 000 fresh noisy samples; and, lastly, test
the performance on additional 5, 000 clean, ground truth samples. The results are sum-
marized in Figures 7 and 8. Observe how the prediction intervals tend to be conservative
under symmetric, both for light- and heavy-tailed noise distributions, asymmetric, and
dispersive corruption models. Intuitively, this is because these noise models increase the
variability of Y ; in Proposition 1 we prove this formally for any symmetric independent noise
model, whereas here we show this result holds more generally even for response-dependent
noise. By contrast, the prediction intervals constructed under the biased and contractive
corruption models tend to under-cover the response variable. This should not surprise us:
following Figure 17(c), the biased noise shifts the data ‘upwards’, and, consequently, the
prediction intervals are undesirably pushed towards the positive quadrants. Analogously,
the contractive corruption model pushes the data towards the mean, leading to intervals
that are too narrow. Figure 20 in the appendix illustrates the scores achieved when using
the different noise models and the 90%’th empirical quantile of the CQR scores. This figure
supports the behavior witnessed in Figures 7, 18 and 19: over-coverage is achieved when
q̂noisy is larger than q̂clean , and under-coverage is obtained when q̂noisy is smaller.
In Appendix B.4 we study the effect of the predictive model on the coverage property,
for all noise models. To this end, we repeat similar experiments to the ones presented
above, however, we now fit the predictive model on clean training data; the calibration
data remains noisy. We also provide an additional adversarial noise model that reduces
the coverage rate, but is unlikely to appear in real-world settings. Figures 18 and 19 in
the appendix depict a similar behaviour for most noise models, except the biased noise
for which the coverage requirement is not violated. This can be explained by the improved
estimation of the low and high conditional quantiles, as these are fitted on clean data and
thus less biased.
In this section, we analyze conformal risk control in a multi-label classification task. For this
purpose, we use the CIFAR-100N data set (Wei et al., 2022), which contains 50K colored
images. Each image belongs to one of a hundred fine classes that are grouped into twenty
mutually exclusive coarse super-classes. Furthermore, every image has a noisy and a clean
label, where the noise rate of the fine categories is 40% and of the coarse categories is 25%.
We turn this single-label classification task into a multi-label classification task by merging
four random images into a 2 by 2 grid. Every image is used once in each position of the
grid, and therefore this new data set consists of 50K images, where each is composed of four
sub-images and thus associated with up to four labels. Figure 9 displays a visualization of
this new variant of the CIFAR-100N data set.
22
Label Noise Robustness of Conformal Prediction
3.0
90 2.5
Coverage
Length
80 2.0
noisy
70 clean
nominal 1.5
contractive dispersive contractive dispersive
Noise Noise
True labels:
baby, mushroom, tulip, bee.
Noisy labels:
baby, mushroom, sweet pepper, bee.
We fit a TResNet (Ridnik et al., 2021) model on 40k noisy samples and calibrate it using
2K noisy samples with conformal risk control, as outlined in (Angelopoulos et al., 2024,
Section 3.2). We control the false-negative rate (FNR) defined in (7) at different levels and
measure the FNR obtained over clean and noisy versions of the test set, which contains 8k
samples. We conducted this experiment twice: once with the fine-classed labels and once
with the super-classed labels. Figure 10 presents the results in both settings, showing that
the risk obtained over the clean labels is valid for every nominal level. Importantly, this
corruption setting violates the assumptions of Proposition 4, as the positive label count
may vary across different noise instantiations. This experiment reveals that valid risk can
be achieved in the presence of label noise even when the corruption model violates the
requirements of our theory.
23
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
Figure 10: FNR achieved over noisy (red) and clean (green) test sets of the on CIFAR-100N
data set. Left: fine labels. Right: coarse labels. The calibration scheme is applied using
noisy annotations in all settings. Results are averaged over 50 trials.
4.4 Segmentation
In this section, we follow a common artificial label corruption methodology, following Zhao
and Gomes (2021); Kumar et al. (2020), and analyze three corruption setups that are
special cases of the vector-flip noise model from (8): independent, dependent, and partial.
In the independent setting, each pixel’s label is flipped with probability β, independently
of the others. In the dependent setting, however, two rectangles in the image are entirely
flipped with probability β, and the other pixels are flipped independently with probability
β. Finally, in the partial noise setting, only one rectangle in the image is flipped with
probability β, and the other pixels are unchanged.
Figure 11: FNR on a polyp segmentation data set, achieved over noisy (red) and clean
(green) test sets. Left: independent noise. Middle: dependent noise. Right: partial noise.
The predictive model is calibrated using noisy annotations in all settings, where the noise
level is set to β = 0.1. Results are averaged over 1000 trials.
We experiment on a polyp segmentation task, pooling data from several polyp data sets:
Kvasir, CVC-ColonDB, CVC-ClinicDB, and ETIS-Larib. We consider the annotations given
in the data as ground-truth labels and artificially corrupt them according to the corruption
setups described above, to generate noisy labels. We use PraNet (Fan et al., 2020) as a base
24
Label Noise Robustness of Conformal Prediction
model and fit it over 1450 noisy samples. Then, we calibrate it using 500 noisy samples with
conformal risk control, as outlined in (Angelopoulos et al., 2024, Section 3.2) to control the
false-negative rate (FNR) from (7) at different levels. Finally, we evaluate the FNR over
clean and noisy versions of the test set, which contains 298 samples, and report the results
in Figure 11. This figure indicates that conformal risk control is robust to label noise, as
the constructed prediction sets achieve conservative risk in all experimented noise settings.
This is not a surprise, as it is guaranteed by Propositions 5 and A.4.
We now turn to demonstrate the coverage rate bounds derived in Section 3.3.3 on real and
synthetic regression data sets. We examine two real benchmarks: meps 19 and bio used
in (Romano et al., 2019), and one synthetic data set that was generated from a bimodal
density function with a sharp slope, as visualized in Figure 12. The simulated data was de-
liberately designed to invalidate the assumptions of our label-noise robustness requirements
in Proposition 1. Consequentially, prediction intervals that do not cover the two peaks of
the density function might undercover the true outcome, even if the noise is dispersive.
Therefore, this gap calls for our distribution-free risk bounds from Section 3.3.3, which are
applicable in this setup, in contrast to Proposition 1. The former approach can be used to
assess the worst risk level that may be obtained in practice. We consider the labels given
Figure 12: Visualization of the marginal density function of the adversarial synthetic data.
in the real and synthetic data sets as ground truth and artificially corrupt them according
to the additive noise model (3). The added noise is independently sampled from a normal
distribution with mean zero and variance 0.1Var(Y ).
For each data set and nominal risk level α, fit a quantile regression model on 12K
samples and learn the α/2, 1 − α/2 conditional quantiles of the noisy labels. Then, we
calibrate its outputs using another 12K samples of the data with conformal risk control,
by Angelopoulos et al. (2024), to control the smooth miscoverage (9) at level α. Finally, we
evaluate the performance on the test set which consists of 6K samples. We also compute
the smooth miscoverage risk bounds according to Corollary 5 with a noise variance set to
0.1.
25
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
Figure 13 presents the risk bound along with the smooth miscoverage obtained over the
clean and noisy versions of the test set. This figure indicates that conformal risk control
generates invalid uncertainty sets when applied on the simulated noisy data, as we antic-
ipated. Additionally, this figure shows that the proposed risk bounds are valid and tight,
meaning that these are informative and effective. Moreover, this figure highlights the main
advantage of the proposed risk bounds: their validity is universal across all distributions
of the response variable and the noise component. Lastly, we note that in Appendix B.7
we repeat this experiment with the miscoverage loss and display the miscoverage bound
derived from Corollary 7.
Figure 13: Smooth miscoverage rate achieved over noisy (red) and clean (green) test sets.
The calibration scheme is applied using noisy annotations to control the smooth miscoverage
level. Results are averaged over 10 random splits of the calibration and test sets.
This section studies the effect of label noise on uncertainty quantification methods in an
online learning setting, as formulated in Section 3.4. We experiment on a depth estimation
task (Geiger et al., 2013), where the objective is to predict a depth map given a colored
image. In other words, X ∈ RW ×H×3 is an input RGB image of size W × H and Y ∈ RW ×H
is its corresponding depth map. We consider the original depth values given in this data
as ground truth and artificially corrupt them according to the additive noise model defined
in (3) to produce noisy labels. Specifically, we add to each depth pixel an independent
random noise drawn from a normal distribution with zero mean and 0.7 variance. Here,
the depth uncertainty of the i, j pixel is represented by a prediction interval C i,j (X) ⊆ R.
Ideally, the estimated intervals should contain the correct depth values at a pre-specified
level 1 − α. In this high-dimensional setting, this requirement is formalized as controlling
the image miscoverage loss, defined as:
W H
im 1 XX
L (Y, C(X)) = 1{Y i,j ∈
/ C i,j (X)}. (12)
WH
i=1 j=1
In words, the image miscoverage loss measures the proportion of depth values that were not
covered in a given image.
26
Label Noise Robustness of Conformal Prediction
For this purpose, we employ the calibration scheme Rolling RC (Feldman et al., 2023b),
which constructs uncertainty sets in an online setting with a valid risk guarantee in the sense
of (10). We follow the experimental protocol outlined in (Feldman et al., 2023b, Section
4.2) and apply Rolling RC with an exponential stretching to control the image miscoverage
loss at different levels on the observed, noisy, labels. We use LeReS (Yin et al., 2021) as a
base model, which was pre-trained on a clean training set that corresponds to timestamps
1,...,6000. We continue training it and updating the calibration scheme in an online fashion
on the following 2000 timestamps. We consider these samples, indexed by 6001 to 8000,
as a validation set and use it to choose the calibration’s hyperparameters, as explained in
(Feldman et al., 2023b, Section 4.2). Finally, we continue the online procedure on the test
samples whose indexes correspond to 8001 to 10000, and measure the performance on the
clean and noisy versions of this test set.
Figure 14 displays the risk achieved by this technique over the clean and corrupted
labels. This figure indicates that Rolling RC attain valid image miscoverage over the
unknown noiseless labels. This is not a surprise, as it is supported by Proposition A.6
which guarantees conservative image miscoverage under label noise in an offline setting,
and Proposition 8 which states that the former result applies to an online learning setting
as well.
Figure 14: Image miscoverage achieved by Rolling RC. Results are averaged over 10 random
trials.
27
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
5. Discussion
Label noise independently of conformal prediction has been well-studied, especially for
training more robust predictive models; see, for example (Angluin and Laird, 1988; Frénay
and Verleysen, 2013; Tanno et al., 2019; Algan and Ulusoy, 2020; Kumar et al., 2020).
Recently, there has been a body of work studying the statistical properties of conformal
prediction (Lei et al., 2018; Barber, 2020) and its performance under deviations from ex-
changeability (Tibshirani et al., 2019; Podkopaev and Ramdas, 2021). This line of work is
relevant to us since the label noise setting violates the exchangeability assumption between
the training and testing data, where the latter is clean while the former is noisy. However,
these works cannot be applied in our setting since they assume covariate or label shift only.
The work by Barber et al. (2023) refers to any distribution shift, and we build upon some of
its results in our general disclaimer in Section 2.4. Another relevant work is (Farinhas et al.,
2024), which studies the effect of a general distribution shift on the obtained risk of risk-
controlling techniques, thus extending the work of by Barber et al. (2023) to more general
loss functions than the miscoverage loss. Additionally, Angelopoulos et al. (2024) analyzes
their proposed risk-controlling method in a covariate shift and a general distribution shift
settings. Close works to ours include (Stutz et al., 2023), which analyses the performance
of conformal prediction under ambiguous ground truth, and Cauchois et al. (2022), that
studies conformal prediction with weak supervision, which could be interpreted as a type
of noisy label.
Lastly, a follow-up work to ours has recently been published (Sesia et al., 2023). Sim-
ilarly to our work, it begins by studying the effect of label noise on the coverage achieved
by standard conformal prediction. Then, an explicit factor that depicts the inflation or
deflation of the coverage is estimated to adjust the desired coverage rate. The theoretical
analysis requires no assumptions on the contamination process, but in order to achieve an
applicable method that can automatically adapt to label noise, some mild assumptions on
the relation between the clean and observable labels are used. Indeed, the presented experi-
ments demonstrate less conservative coverage compared to the standard method. However,
there are two key distinctions between this work and ours. It focuses on controlling the
coverage rate in classification tasks, whilst our analysis extends to regression tasks, general
risk control, and online settings. Additionally, it aims to modify the calibration algorithm
to account for label noise, whereas our goal is to test the limits of conformal prediction in
the label noise setting and reveal the conditions on the scores, noise models, and predictive
models, under which the standard algorithm remains valid despite the presence of noisy
labels.
Our work raises many new questions. First, one can try and define a score function that is
more robust to label noise, continuing the line of Gendler et al. (2021); Frénay and Verleysen
(2013); Cheng et al. (2022). Second, an important remaining question is how to achieve
28
Label Noise Robustness of Conformal Prediction
exact risk control on the clean labels using minimal information about the noise model.
Lastly, it would be interesting to analyze the robustness of alternative conformal methods
such as cross-conformal and jackknife+ (Vovk, 2015; Barber et al., 2021) that do not require
data-splitting.
Acknowledgments
Y.R., A.G., B.E., and S.F. were supported by the ISRAEL SCIENCE FOUNDATION
(grant No. 729/21). Y.R. thanks the Career Advancement Fellowship, Technion, for provid-
ing research support. A.N.A. was supported by the National Science Foundation Graduate
Research Fellowship Program under Grant No. DGE 1752814. S.F. thanks Aviv Adar, Idan
Aviv, Ofer Bear, Tsvi Bekker, Yoav Bourla, Yotam Gilad, Dor Sirton, and Lia Tabib for
annotating the MS COCO data set.
29
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
Note that the probability is only taken over s̃test . Since q̂noisy is constant (measurable) with
respect to this probability, we have that, for any α ∈ (0, 1),
This implies that Ytest ∈ Cbnoisy (Xtest ) with probability at least 1 − α, completing the proof
of the lower bound.
Regarding the upper bound, by the same argument,
1
P(stest ≤ q̂noisy ) ≤ P(s̃test ≤ q̂noisy ) + u ≤ 1 − α + + u.
n+1
A.2 Regression
Theorem A.1 Suppose an additive noise model g add with a noise that has mean 0. Denote
the prediction interval as C(x) = [ax , bx ]. If for all x ∈ X density of Y | X = x is peaked
inside the interval:
Proof For ease of notation, we omit the conditioning on X = x. In other words, we treat
Y as Y | X = x for some x ∈ X . We begin by showing P(Ỹ ≤ b) ≥ P(Y ≤ b).
30
Label Noise Robustness of Conformal Prediction
P(Ỹ ≤ b)
= P(Y + ε ≤ b)
= P(Y + ε ≤ b | ε ≥ 0)P(ε ≥ 0) + P(Y + ε ≤ b | ε ≤ 0)P(ε ≤ 0)
1 1
= P(Y + ε ≤ b | ε ≥ 0) + P(Y + ε ≤ b | ε ≤ 0)
2 2
1 1
= P(Y ≤ b − ε | ε ≥ 0) + P(Y ≤ b + ε | ε ≥ 0)
2 2
1
= Eε≥0 [P(Y ≤ b − ε) + P(Y ≤ b + ε)]
2
1
= P(Y ≤ b) + Eε≥0 [P(Y ≤ b − ε) − P(Y ≤ b) + P(Y ≤ b + ε) − P(Y ≤ b)]
2
1
= P(Y ≤ b) + Eε≥0 [P(Y ≤ b + ε) − P(Y ≤ b) − (P(Y ≤ b) − P(Y ≤ b − ε))]
2
1
= P(Y ≤ b) + Eε≥0 [P(b ≤ Y ≤ b + ε) − P(b − ε ≤ Y ≤ b)]
2
1
≤ P(Y ≤ b) + Eε≥0 [0]
2
= P(Y ≤ b).
The last inequality follows from the assumption that ∀ε ≥ 0 : fY (b + ε) ≤ fY (b − ε). The
proof for P(Ỹ ≥ a) ≤ P(Y ≥ a) is similar and hence omitted. We get that:
P(Ỹ ∈ C(x))
= P(a ≤ Ỹ ≤ b)
= P(Ỹ ≤ b) − P(Ỹ ≥ a)
≤ P(Y ≤ b) − P(Y ≥ a) (follows from the above)
= P(a ≤ Y ≤ b)
= P(Y ∈ C(x)).
31
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
A.3 Classification
Theorem A.2 Suppose that Cbnoisy (x) ⊆ {1, ..., K} is a prediction set. Denote β := P(Ỹ ∈
Cbnoisy (x) | X = x). First, the coverage rate achieved over the clean labels is upper bounded
by:
K
1X
P(Y ∈ Cbnoisy (x) | X = x) ≤ β + |P(Ỹ = i | X = x) − P(Y = i | X = x)|,
2
i=1
Proof First, for ease of notation, we omit the conditioning on x and consider a prediction
C ⊆ {1, ..., K}. Notice that if C = ∅ the proposition is trivially satisfied.
32
Label Noise Robustness of Conformal Prediction
∀i ∈ {1, ..., K} : i ≤ g ⇐⇒ δi ≥ 0.
m
X g
X m
X g
X K
X K
X
δi = δi + δi ≥ δi + δi = δi = 0.
i>g ⇐⇒ δi ≤0
i=1 i=1 i=g+1 i=1 i=g+1 i=1
By taking the expectation over X ∼ PX we obtain the desired marginal coverage bounds.
The only non-trivial transition is marginalizing the TV-distance, which follows from the
integral absolute value inequality:
Z
|P(Ỹ = i | X = x) − P(Y = i | X = x)| ≥
x∈X
Z
P(Ỹ = i | X = x) − P(Y = i | X = x) =
x∈X
P(Ỹ = i) − P(Y = i)
33
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
ε 1
In the random flip setting, |δi | = P(Y = i)ε − K = ε P(Y = i) − K . Therefore:
K K
1X 1X 1
|δi | = ε P(Y = i) −
2 2 K
i=1 i=1
g K
1 X 1 X 1
= ε P(Y = i) − + P(Y = i) −
2 K K
i=1 i=g+1
g K
1 X 1 X 1
= ε P(Y = i) − + − P(Y = i)
2 K K
i=1 i=g+1
g g K K
1 X X 1 X 1 X
= ε P(Y = i) − + − P(Y = i)
2 K K
i=1 i=1 i=g+1 i=g+1
1 g K −g
≤ ε 1− + −0
2 K K
1 K −g K −g
≤ ε +
2 K K
K −g
≤ε
K
K −1
≤ε .
K
Thus, we get:
K
1X K −1
P(Ỹ ∈ C) ≤ P(Y ∈ C) ≤ P(Ỹ ∈ C) + |P(Ỹ = i) − P(Y = i)| ≤ P(Ỹ ∈ C) + ε ,
2 K
i=1
which concludes the proof.
The confusion matrix noise model is more realistic than the random flip. However,
there exists a score function that causes conformal prediction to fail for any non-identity
confusion matrix. We define the corruption model as follows: consider a matrix T in which
(T )i,j = P(Ỹ = j | Y = i).
1
w.p. T1,y
confusion
g (y) = . . .
K w.p. TK,y .
Proposition A.1 Let Cbnoisy be constructed as in Recipe 1 with any score function s and
the corruption function g confusion . Then,
P Ytest ∈ Cbnoisy (Xtest ) ≥ 1 − α.
34
Label Noise Robustness of Conformal Prediction
K
X
P(Ỹ = j 0 | Y = j)P(s̃ ≤ t | Ỹ = j 0 ) ≥ P(s ≤ t | Y = j).
j 0 =1
h i XK X
K
P(s̃ ≤ t) = E P(s̃ ≤ t | Ỹ = j 0 ) = wj P(Ỹ = j 0 | Y = j)P(s̃ ≤ t | Ỹ = j 0 ).
j=1 j 0 =1
We can write
K X
X K K
X
0 0
P(s ≤ t)−P(s̃ ≤ t) = wj P(Ỹ = j | Y = j)P(s̃ ≤ t | Ỹ = j )− wj P(s ≤ t | Y = j).
j=1 j 0 =1 j=1
K K
!
X X
0 P(s̃ ≤ t | Ỹ = j 0 ) 1
wj P(s ≤ t | Y = j) P(Ỹ = j | Y = j) − .
P(s ≤ t | Y = j) K
j=1 j 0 =1
The stochastic dominance condition holds uniformly over all choices of base probabilities
wj if and only if for all j ∈ [K],
K
X
P(Ỹ = j 0 | Y = j)P(s̃ ≤ t | Ỹ = j 0 ) ≥ P(s ≤ t | Y = j).
j 0 =1
Notice that the left-hand side of the above display is a convex mixture of the quantiles
P(s̃ ≤ t | Ỹ = j 0 ) for j 0 ∈ [K]. Thus, the necessary and sufficient condition is for the noise
distribution P(Ỹ = j 0 | Y = j) to place sufficient mass on the classes j 0 whose quantiles
are larger than P(s ≤ t | Y = j). But of course, without assumptions on the model and
score, the latter is unknown, so it is impossible to say which noise distributions will preserve
coverage.
35
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
When using the non-random APS scores with the oracle model, the prediction set size for
a given x can be expressed as
k
X
k ∗ = min k : π(j) (x) ≥ 1 − α ,
j=1
where πy (x) = P(Y | X), the desired coverage level is 1 − α, and π(1) (x) ≥ π(2) (x) ≥ · · · ≥
π(K) (x) are the order statistics of π1 (x), π2 (x), . . . , πK (x). Under the random flip noise, the
noisy conditional class probabilities are given by
π̃y (x) = (1 − )πy (x) + ,
K
where K is the number of labels and is the fraction of flipped labels. Therefore, the noisy
set size is given by:
k
X
k ∗,noisy = min k : π̃(j) (x) ≥ 1 − α
j=1
k k
X k X
= min k : π(j) (x) + − π(j) (x) ≥ 1 − α ≥ k ∗ .
K
j=1 j=1
As a result,
we get k ∗,noisy ∗
≥ k , where the term that controls the inflation of the noisy set
k
− kj=1 π(j) (x) is non-positive, and is a function of the noise level and the oracle
P
size K
conditional class probabilities.
d
Since Y 6= Ỹ , we know that the set A is nonempty and P(Ỹ ∈ A) = δ1 > P(Y ∈ A) = δ2 ≥ 0.
The adversarial choice of score function will be s(x, y) = 1 {y ∈ Ac }; it puts high mass
wherever the ground truth label is more likely than the noisy label. The crux of the
argument is that this design makes the quantile smaller when it is computed on the noisy
data than when it is computed on clean data, as we next show.
36
Label Noise Robustness of Conformal Prediction
Begin by noticing that, because s(x, y) is binary, q̂clean is also binary, and therefore
q̂clean > t ⇐⇒ q̂clean = 1. Furthermore, q̂clean = 1 if and only if |E ∩ A| < d(n + 1)(1 − α)e.
Thus, these events are the same, and for any t ∈ (0, 1],
P (q̂clean ≥ t) = P E ∩ A < d(n + 1)(1 − α)e .
By the definition of A, we have that P E ∩ A < d(n + 1)(1 − α)e > P Ẽ ∩ A <
d(n + 1)(1 − α)e . Chaining the inequalities, we get
P (q̂clean ≥ t) > P Ẽ ∩ A < d(n + 1)(1 − α)e = P (q̂ ≥ t) .
Since sn+1 is measurable with respect to E and Ẽ, we can plug it in for t, yielding the
conclusion.
Remark A.3 In the above argument, if one further assumes continuity of the (ground
truth) score function and P(Ỹ ∈ A) = P(Y ∈ A) + ρ for
n 1
ρ = inf ρ0 > 0 : BinomCDF(n, δ1 , d(n + 1)(1 − α)e − 1) + <
no
0
BinomCDF(n, δ2 + ρ , d(n + 1)(1 − α)e − 1) ,
then
P(sn+1 ≤ q̂) < 1 − α.
In other words, the noise must have some sufficient magnitude in order to disrupt coverage.
Next, we provide an additional impossibility result, similar to Proposition 3.
Proposition A.2 Given a score s(x, y) = |fˆ(x) − y| and Ỹ = βY, for β ∈ (0, 1). Assume
further that fˆ(x) = E[Ỹ | X = x], then there exists α0 ∈ [0, 1] such that for any α satisfying
1 − α ≥ 1 − α0 : P(Y ∈ Cbnoisy (X)) ≤ 1 − α.
s = |fˆ(x) − Y | = |βµx − Y |
We will show that Fs̃ (t) ≥ Fs (t), ∀t ∈ R. Notice that this is true conditional on X (we do
not write this explicitly to enhance clarity.)
37
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
Lastly, we return to consider the notations of conditioning on X. Denote α0 = inf x∈X (α0 (x)).
Suppose that α ∈ [0, 1] satisfies 1 − α ≥ 1 − α0 . Denote by q̂noisy the 1 − α quantile of the
noisy scores s̃, marginally on x ∈ X . Then, we obtain:
1 − α = P(Ỹ ∈ Cnoisy (X)) = P(s̃ ≤ q̂noisy ) ≥ P(s ≤ q̂noisy ) = P(Y ∈ Cnoisy (X)),
38
Label Noise Robustness of Conformal Prediction
Proof
Z∞
2 /τ 2 2 /(τ 2 +σ 2 ) τ →0
2 2
TV(N (0, τ ), N (0, τ + σ )) =2
e−x − e−x dx → 1.
−∞
In this section, we prove all FNR robustness propositions. Here, we suppose that the
response Y ∈ [0, 1]n is a binary vector of size n ∈ N, where Yi = 1 indicates that the i-th
label is present. We further suppose a vector-flip noise model from (8), where ε is a binary
random vector of size n as well. These notations apply for segmentation tasks as well, by
flattening the response matrix into a vector. The prediction set C(X) ⊆ {1, ..., n} contains
a subset of the labels. We begin by providing additional theoretical results and then turn
to the proofs.
Proposition A.4 Let Cbnoisy (Xtest ) be a prediction set that controls the FNR risk of the
noisy labels at level α. Suppose that
1. The prediction set contains the most likely labels in the sense that for all x ∈ X ,
k ∈ Cbnoisy (x), i ∈
/ Cbnoisy (x) and m ∈ N:
X X
P Yk = 1 Yj = m, X = x ≥ P Yi = 1 Yj = m, X = x .
j6=i,k j6=i,k
2. For a given input X = x, the noise level of all response elements is the same, i.e., for
all i 6= j ∈ {1 . . . K}: P(i = 1 | X = x) = P(j = 1 | X = x).
4. The noises of different labels are independent of each other given X, i.e., i ⊥⊥ j |
X = x for all i 6= j ∈ {1, . . . K} and x ∈ X .
Then h i
E LFNR Ytest , Cbnoisy (Xtest ) ≤ α.
In this section, we formulate and prove a general lemma that is used to prove label-noise
robustness with the false-negative rate loss.
39
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
Lemma A.4 Suppose that x ∈ X is an input variable and C(x) is a prediction set. Denote
by βi the noise level at the i-th element: βi := P(εi = 1 | X = x). We define:
" #
1
ek,i (y) := βi yk E Pn εi = 1, Y = y .
j=1 Ỹj
Proof For ease of notation, we omit the conditioning on X = x. That, we take some
x ∈ X and treat Y as Y | X = x. We also denote the prediction set as C = C(x). Denote:
40
Label Noise Robustness of Conformal Prediction
Without loss of generality, we assume that C = {1, ..., p}. We now compute the expectation
of each term separately.
" #
( i εi )( pk=1 Yk )
P P
E P P Y =y
( j Yj )( j Yj − 2εj Yj + εj )
" #
εi ( pk=1 Yk )
P
X
= E P P Y =y
( j Yj )( j Yj − 2εj Yj + εj )
i
" #
X ( p yk )
P
k=1 εi
= P E P Y =y
i j yj j Yj − 2εj Yj + εj
" #
X ( p yk )
P
k=1 εi
= P E P εi = 1, Y = y P(εi = 1)
j yj j Yj − 2εj Yj + εj
i
p
! " #
1 X X 1
=P βi yk E P εi = 1, Y = y
j yj i (1 − Yi ) + j6=i Yj − 2εj Yj + εj
k=1
p
" #
1 XX 1
=P βi yk E P εi = 1, Y = y
j yj (1 − Yi ) + j6=i Yj − 2εj Yj + εj
i k=1
" #
( i Yi )( pk=1 εk )
P P
E P P Y =y
( j Yj )( j Yj − 2εj Yj + εj )
" #
Yi ( pk=1 εk )
P
X
= E P P Y =y
( j Yj )( j Yj − 2εj Yj + εj )
i
" #
( pk=1 εk )
P
X yi
= P E P Y =y
i j yj j Yj − 2εj Yj + εj
p
" #
X yi X εk
= P E P Y =y
i j yj k=1 j Yj − 2εj Yj + εj
p
" #
X yi X εk
= P βk E P εk = 1, Y = y
i j yj k=1 j Yj − 2εj Yj + εj
p
" #
X yi X 1
= P βk E P εk = 1, Y = y
j yj (1 − Yk ) + j6=k Yj − 2εj Yj + εj
i k=1
p
" #
1 XX 1
=P βk yi E P εk = 1, Y = y
j yj (1 − Yk ) + j6=k Yj − 2εj Yj + εj
i k=1
41
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
p
n X p
n X
" #
X e k,i (Y ) − ei,k (Y ) X ek,i (Y ) − e i,k (Y )
E [δ] = E P = E P ≥ 0.
i=p+1 k=1 j Yj i=p+1 k=1 j Yj
Above, the last inequality follows from the assumption of this lemma. We now marginalize
the above result to obtain valid marginal risk:
h i
E [δ] = EX EỸ ,Y |X=x [δ | X = x] ≥ EX [0] ≥ 0.
42
Label Noise Robustness of Conformal Prediction
h i
ek,i (YP)−ei,k (Y )
We now compute E for k < i:
j Yj
ek,i (Y ) − ei,k (Y )
E P
i Yi
h i h i
βi Yk E (1−Yi )+P 1Yj −2εj Yj +εj Y = Y − βk Yi E (1−Yk )+P 1 Yj −2εj Yj +εj | Y = Y
j6=i j6=k
= E P
Y
j j
" " #
X βi yk 1
= P(Y = y) P E P Y =y
j yj (1 − yi ) + j6=i yj − 2εj yj + εj
y∈Y
" # #
βk yi 1
−P E P Y =y
j yj (1 − yk ) + j6=k yj − 2εj yj + εj
" " ##
X β i yk 1
= P(Y = y) P E P Y =y
j yj (1 − yi ) + j6=i yj − 2εj yj + εj
{y∈Y:yi 6=yk }
" " ##
X βk yi 1
− P(Y = y) P E P Y =y
j yj (1 − yk ) + j6=k yj − 2εj yj + εj
{y∈Y:yi 6=yk }
" #
X βi 1
= P(Y = y) P E P Y =y
j yj 2 − εk + j6=i,k yj − 2εj yj + εj
{y∈Y:yk =1,yi =0}
" #
X βk 1
− P(Y = y) P E P Y =y
j yj 2 − εi + j6=k,i yj − 2εj yj + εj
{y∈Y:yi =1,yk =0}
To simplify the equation, denote by y ∗ the vector y with indexes i, k swapped, that is:
yj
j=6 i, k,
∗
yj = yi j = k,
yk j = i.
h i h i
1 1
Notice that βk E P Y = y = βi E 2−εk +P Y = y since
2−εi + j6=k,i yj −2εj yj +εj j6=k,i yj −2εj yj +εj
d
εi = εk and these variables are independent of Yi and of εj for j 6= i, k. Further denote
P h 1 P
γm = β j yj E 2−εk + j6=i,k yj −2εj yj +εj Y = y for j6=i,k yj = m and yi 6= yk . Notice
P
h i
that γm is well defined as βE 2−εk +P 1 yj −2εj yj +εj Y = y has the same value for every y
P j6=i,k P P
such that j6=i,k yj = m since y and ε are independent. Also j yj = m+1 if j6=i,k yj = m
P
and yi 6= yk . Lastly, we denote γ 0 m = γm P j6=i,k Yj = m . We continue computing
43
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
h i
ek,i (YP)−ei,k (Y )
E for k < i:
j Yj
" #
ek,i (Y ) − ei,k (Y )
E P
j Yj
" #
X β 1
= (P(Y = y) − P(Y = y ∗ )) P E P Y =y
j yj 2 − εk + j6=i,k yj − 2εj yj + εj
{y∈Y:yk =1,yi =0}
X X
= (P(Y = y) − P(Y = y ∗ ))γm
m {y∈Y,yk =1,yi =0,
P
j6=i,k yj =m}
X X
= γm [P(Y = y) − P(Y = y ∗ )]
m
P
{y∈Y,yk =1,yi =0, j6=i,k yj =m}
X X X
= γm P(Y = y) − P(Y = y)
m
P P
{y∈Y,yk =1,yi =0, j6=i,k yj =m} {y∈Y,yi =1,yk =0, j6=i,k yj =m}
X X X
= γm P Yk = 1, Yi = 0, Yj = m − P Yk = 0, Yi = 1, Yj = m
m j6=i,k j6=i,k
X X X
= γ 0 m P Yk = 1, Yi = 0 | Yj = m − P Yk = 0, Yi = 1 | Yj = m
m j6=i,k j6=i,k
X X X
= γ 0 m P Yk = 1, Yi = 0 Yj = m + P Yk = 1, Yi = 1 Yj = m
m j6=i,k j6=i,k
X X
− P Yk = 1, Yi = 1 Yj = m − P Yk = 0, Yi = 1 Yj = m
j6=i,k j6=i,k
X X X
= γ 0 m P Yk = 1 Yj = m − P Yi = 1 Yj = m
m j6=i,k j6=i,k
P P
We assume that ∀m : P Yk = 1 j6=i,k Yj = m ≥ P Yi = 1 j6=i,k Yj = m and there-
fore:
" #
ek,i (Y ) − ei,k (Y )
E P
j Yj
X X X
= γ 0 m P Yk = 1 Yj = m − P Yi = 1 Yj = m
m j6=i,k j6=i,k
≥ 0.
According to Lemma A.4, the above concludes the proof.
44
Label Noise Robustness of Conformal Prediction
Since nj=1 Ỹj is a assumed to be a constant, ek,i (y) and ei,k may differ only in the values
P
of yk , yi and βk , βi . We go over all four combinations of yk and yi .
A.5.5 Segmentation
Proof [Proof of Proposition 5] This proposition is a special case of Proposition A.4 and
thus valid risk follows directly from this result.
45
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
δ := L(Ỹ ) − L(Y )
= L(Y + ε) − L(Y )
1
= εL0 (Y ) + ε2 L00 (ξ).
2
We now develop each term separately. Since ε ⊥⊥ Y it follows that ε ⊥⊥ L0 (Y ).
E εL0 (Y ) = E [ε] E L0 (Y ) = 0E L0 (Y ) = 0
We get that:
0 1 2 00 1 2 00
E[δ] = E εL (Y ) + ε L (ξ) = 0 + E ε L (ξ) .
2 2
Therefore:
1 1
qVar [ε] ≤ E[δ] ≤ QVar [ε] .
2 2
1 h i 1
⇒ qVar [ε] ≤ E L(Ỹ , Ĉ(X)) − L(Y, Ĉ(X)) ≤ QVar [ε] .
2 2
We now turn to consider the conditioning on X = x and obtain marginalized bounds by
taking the expectation over all X:
1 h i 1
α − QVar [ε] ≤ E L(Y, Ĉ(X)) ≤ α − qVar [ε]
2 2
46
Label Noise Robustness of Conformal Prediction
h i
where α := E L(Y, Ĉ(X)) , q := EX [qX ], and Q := EX [QX ]. Additionally, if L(y, C(x)) is
convex for all x ∈ X , then qx ≥ 0, and we get conservative coverage:
h i h i
E L(Y, Ĉ(X)) ≤ E L(Ỹ , Ĉ(X)) = α.
In this section we prove Proposition 7 and show how to obtain tight coverage bounds. First,
we define a parameterized smoothed miscoverage loss:
2
Lsm
d,c (y, [a, b]) = .
2 c
−d (2 y−a −1)
1+e b−a
Above, c, d ∈ R are parameters that affect the loss function. We first connect between the
smooth miscoverage and the standard miscoverage functions:
= Lsm
d,c (a, [a, b])
= Lsm
d,c (b, [a, b])
2
= 2 c
−d (2 a−a −1)
1+e b−a
2
= 2 c
1 + e−d((−1) )
2
= .
1 + e−d
Therefore, h(d) is a function that depends only on d. We now denote the second derivative
of the smoothed loss by:
∂ 2 sm
qx (c, d) = min L (y, C(x)).
y ∂y 2 c,d
Importantly, qx (c, d) can be empirically computed by sweeping over all y ∈ R and computing
the second derivative of Lsmc,d for each of them.
47
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
We obtain an upper bound for the miscoverage of C(x) by applying Markov’s inequality
using (13):
h i
E Lsm
c,d (Y, C(X)) | X = x
/ C(X) | X = x) = P(Lsm
P(Y ∈ d,c (Y, C(X)) ≥ h(d) | X = x) ≤ .
h(d)
(14)
Finally, we combine (14) and (15) and derive the following miscoverage bound:
h i
E Lsm
c,d ( Ỹ , C(X)) | X = x − 12 qx (c, d)Var [ε]
P(Y ∈
/ C(X) | X = x) ≤ ,
h(d)
Finally, we take the expectation over all X to obtain marginal coverage bound:
h i
E Lsm
c,d (Ỹ , C(X)) − 21 EX [qx (c, d)]Var [ε]
P(Y ∈ C(X)) ≥ 1 − . (16)
h(d)
Crucially, all variables in (16) are empirically computable so the above lower bound can is
known in practice. Additionally, the parameters c, d can be tuned over a validation set to
obtain tighter bounds.
Proposition A.5 Suppose that C(x) is a prediction interval. Under the additive noise
model g add from (3), if the PDF of Y | X = x is Kx Lipschitz then:
48
Label Noise Robustness of Conformal Prediction
Therefore:
Z
P(Y ∈ C(X) | X = x) = fY |X=x (y) dy
y∈C(x)
Z
≥ (fỸ |X=x (y) − Kx E[|Z|]) dy
y∈C(x)
In this section, we analyze the setting where the response variable is a matrix Y = RW ×H .
Here, the uncertainty is represented by a prediction interval C i,j (X) for each pixel i, j in
the response image. Here, our goal is to control the image miscoverage loss (12), defined
as:
W H
1 XX
Lim (Y, C(X)) = 1{Y i,j ∈
/ C i,j (X)}.
WH
i=1 j=1
While this loss can be controlled under an i.i.d assumption by applying the methods pro-
posed by Angelopoulos et al. (2021, 2024), these techniques may produce invalid uncertainty
sets in the presence of label noise. We now show that conservative image miscoverage risk
is obtained under the assumptions of Theorem A.1.
Proposition A.6 Suppose that each element of the response matrix is corrupted according
to an additive noise model g add with a noise that has mean 0. Suppose that for every pixel
i, j of the response matrix, the prediction interval C i,j (X) and the conditional distribution
of Y i,j | X = x satisfy the assumptions of Theorem A.1 for all x ∈ X . Then, we obtain
valid conditional image miscoverage:
49
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
Proof For ease of notation, we suppose that Y is a random vector of length k and C i is
the prediction interval constructed for the i-th element in Y . Suppose that x ∈ X . Under
the assumptions of Theorem A.1, we get that for all i ∈ {1, .., , k}:
P(Y i ∈
/ C i (x) | X = x) ≤ P(Ỹ i ∈
/ C i (x) | X = x).
Therefore:
k
" #
im 1X
E[L (Y, C(X)) | X = x] = E 1{Y i ∈
/ C i (X)} | X = x
k
i=1
k
1 X
E 1{Y i ∈
/ C i (X)} | X = x
=
k
i=1
k
1 X
P Yi ∈
/ C i (X) | X = x
=
k
i=1
k
1 X h i
≤ P Ỹ i ∈
/ C i (X) | X = x
k
i=1
k
1X h i i
= / C i (X) | X = x
P Ỹ ∈
k
i=1
k
1 X h i
= E 1{Ỹ i ∈
/ C i (X)} | X = x
k
i=1
k
" #
1X
=E 1{Ỹ i ∈
/ C i (X)} | X = x
k
i=1
im
= E[L (Ỹ , C(X)) | X = x].
h i
EYt |Xt =x Lt (Yt , Ĉt (Xt )) | Xt = x ≤ αt .
50
Label Noise Robustness of Conformal Prediction
Draw T uniformly from [0, 1, ..., ∞]. Then, from the law of total expectation, it follows
that:
h i h h h i ii
ET LT (YT , ĈT (XT )) = ET EXT |T =t EYT |XT =x LT (YT , ĈT (XT )) | XT = x | T = t
≤ ET EXT |T =t [αT | T = t]
= ET [αT ]
= α,
T
1X h i
lim Lt (Yt .Ĉt (Xt )) = ET LT (YT , ĈT (XT )) ≤ α
T →∞ T
t=0
In this section, we suppose an online learning setting, where the data {(xt , yt )}∞ t=1 is given
as a stream. The miscoverage counter loss (Feldman et al., 2023b) counts the number of
consecutive miscoverage events that occurred until the timestamp t. Formally, given a series
of prediction sets: {Ĉt (xt )}Tt=1 , and a series of labels: {yt }Tt=1 , the miscoverage counter at
timestamp t is defined as:
(
LMC
t−1 (yt−1 , Ĉt−1 (xt−1 )) + 1, yt 6∈ Ĉt (xt )
LMC
t (yt , Ĉ t (x t )) =
0, otherwise.
We now show that conservative miscoverage counter risk is obtained under the presence of
label noise.
If the miscoverage counter risk of the noisy labels is controlled at level α, then the miscov-
erage counter of the clean labels is controlled at level α:
T
1X
lim E[LMC
t (Yt , Ĉt (Xt ))] ≤ α.
T →∞ T
t=1
51
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
Notice that the conditions of Proposition A.7 follow from Theorem A.2 in classification
tasks or from Theorem A.1 in regression tasks. In other words, we are guaranteed to obtain
valid miscoverage counter risk when the requirements of these theorems are satisfied. We
now demonstrate this result in a regression setting.
Corollary A.8 Suppose that the noise model and the conditional distributions of the clean
Yt | Xt , Xt−1 , Yt−1 and noisy Ỹt | Xt , Xt−1 , Ỹt−1 labels satisfy the assumptions of Theo-
rem A.1 for all t ∈ N. Then, the miscoverage counter of the clean labels is more conservative
than the risk of the noisy labels.
T T
1X 1X
lim E[LMC
t (Yt , Ĉt (Xt ))] ≤ lim E[LMC
t (Ỹt , Ĉt (Xt ))]
T →∞ T T →∞ T
t=1 t=1
P[LMC MC
t (Yt , Ĉt (Xt )) = k] ≤ P[Lt (Ỹt , Ĉt (Xt )) = k].
P[LMC
t (Yt , Ĉt (Xt )) = k | Xt = xt ] = P[Yt ∈,
/ Ĉt (Xt ) | Xt = xt ]
≤ P[Ỹt ∈,
/ Ĉt (Xt ) | Xt = xt ]
= P[LMC
t (Ỹt , Ĉt (Xt )) = k | Xt = xt ].
Inductive step: suppose that the statement is correct for t, k. We now show for k + 1:
P[LMC
t (Yt , Ĉt (Xt )) = k + 1]
= P[LMC MC MC
t (Yt , Ĉt (Xt )) = k + 1 | Lt−1 (Yt−1 , Ĉt−1 (Xt−1 )) = k]P[Lt−1 (Yt−1 , Ĉt−1 (Xt−1 )) = k]
/ Ĉt (Xt ) | LMC
= P[Yt ∈ MC
t−1 (Yt−1 , Ĉt−1 (Xt−1 )) = k]P[Lt−1 (Yt−1 , Ĉt−1 (Xt−1 )) = k]
/ Ĉt (Xt ) | LMC
≤ P[Ỹt ∈ MC
t−1 (Ỹt−1 , Ĉt−1 (Xt−1 )) = k]P[Lt−1 (Ỹt−1 , Ĉt−1 (Xt−1 )) = k]
= P[LMC
t (Ỹt , Ĉt (Xt )) = k + 1].
Inductive step 2: suppose that the statement is correct for t, k. We now show for t + 1:
P[LMC
t+1 (Yt+1 , Ĉt+1 (Xt+1 )) = k]
= P[LMC MC MC
t+1 (Yt+1 , Ĉt+1 (Xt+1 )) = k | Lt (Yt , Ĉt (Xt )) = k − 1]P[Lt (Yt , Ĉt (Xt )) = k − 1]
/ Ĉt+1 (Xt+1 ) | LMC
= P[Yt+1 ∈ MC
t (Yt , Ĉt (Xt )) = k − 1]P[Lt (Yt , Ĉt (Xt )) = k − 1]
/ Ĉt+1 (Xt+1 ) | LMC
≤ P[Ỹt+1 ∈ MC
t (Ỹt , Ĉt (Xt )) = k − 1]P[Lt (Ỹt , Ĉt (Xt )) = k − 1]
= P[LMC
t+1 (Ỹt+1 , Ĉt+1 (Xt+1 )) = k]
52
Label Noise Robustness of Conformal Prediction
Finally, we compute the miscoverage counter risk over the time horizon:
T T ∞
1X 1 XX
lim E[LMC (Yt , Ĉt (Xt ))] = lim kP[LMC
t (Yt , Ĉt (Xt )) = k]
T →∞ T T →∞ T
t=1 t=1 k=1
T X ∞
1 X
≤ lim kP[LMC
t (Ỹt , Ĉt (Xt )) = k]
T →∞ T
t=1 k=1
T
1 X
= lim E[LMC
t (Ỹt , Ĉt (Xt ))]
T →∞ T
t=1
≤ α.
Here we present additional results of the classification experiment with CIFAR-10H ex-
plained in Section 2.2, but first provide further details about the data set and training
procedure. The CIFAR-10H data set contains the same 10,000 images as CIFAR-10, but
with labels from a single annotator instead of a majority vote of 50 annotators. We fine-
tune a ResNet18 model pre-trained on the clean training set of CIFAR-10, which contains
50,000 samples. Then we randomly select 2,000 observations from CIFAR-10H for cali-
bration. The test set contains the remaining 8,000 samples, but with CIFAR-10 labels.
We apply conformal prediction with the APS score. The marginal coverage achieved when
using noisy and clean calibration sets are depicted in Figure 1. This figure shows that (i)
we obtain the exact desired coverage when using the clean calibration set; and (ii) when
calibrating on noisy data, the constructed prediction sets over-cover the clean test labels.
Figure 15 illustrates the average prediction set sizes that are larger when using noisy data
for calibration and thus lead to higher coverage levels.
Herein, we provide additional details regarding the training of the predictive models for
the real-world regression task. As explained in Section 2.2, we use a VGG-16 model—pre-
trained on the ImageNet data set—whose last (deepest) fully connected layer is removed.
Then, we feed the output of the VGG-16 model to a linear fully connected layer to predict
the response. We train two different models: a quantile regression model for CQR and
a classic regression model for conformal with residual magnitude score. Both models are
trained on 34, 000 noisy samples, calibrated on 7, 778 noisy holdout points, and tested on
7, 778 clean samples. We train the quantile regression model for 70 epochs using ’SGD’
optimizer with a batch size of 128 and an initial learning rate of 0.001 decayed every
53
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
1.30 noisy
clean
1.25
1.20
Size
1.15
1.10
1.05
Figure 15: Effect of label noise on CIFAR-10. Distribution of average prediction set
sizes over 30 independent experiments evaluated on CIFAR-10H test data using noisy and
clean labels for calibration. Other details are as in Figure 1
20 epochs exponentially with a rate of 0.95 and a frequency of 10. We apply dropout
regularization to avoid overfitting with a rate of 0.2. We train the classic regression model
for 70 epochs using ’Adam’ optimizer with a batch size of 128 and an initial learning rate
of 0.00005 decayed every 10 epochs exponentially with a rate of 0.95 and a frequency of 10.
The dropout rate in this case is 0.5.
In contrast with the noise distributions presented in Section 4.1, here we construct adver-
sarial noise models to intentionally reduce the coverage rate.
1. Most frequent confusion: we extract from the confusion matrix the pair of classes
with the highest probability to be confused between each other, and switch their
labels until reaching a total probability of . In cases where switching between the
most common pair is not enough to reach , we proceed by flipping the labels of the
second most confused pairs of labels, and so on.
2. Wrong to right: wrong predictions during calibration cause larger prediction sets
during test time. Hence making the model think it makes fewer mistakes than it
actually does during calibration can lead to under-coverage during test time. Here,
we first observe the model predictions over the calibration set, and then switch the
labels only of points that were misclassified. We switch the label to the class that is
most likely to be the correct class according to the model, hence making the model
think it was correct. We switch a suitable amount of labels in order to reach a
total switching probability of (this noise model assumes there are enough wrong
predictions in order to do so).
3. Optimal adversarial: we describe here an algorithm for building the worst possible
label noise for a specific model using a specific non-conformity score. This noise
will decrease the calibration threshold at most and as a result, will cause significant
54
Label Noise Robustness of Conformal Prediction
Wrong to right
Optimal adversarial HPS
Optimal adversarial APS
0.85 0.90 1.0 1.1 1.2 1.3
Coverage Average set size
Figure 16: Effect of label noise on synthetic multi-class classification data. Per-
formance of conformal prediction sets with target coverage 1 − α = 90%, using a noisy
training set and a noisy calibration set with adversarial noise models. Left: Marginal cover-
age; Right: Average size of predicted sets. The results are evaluated over 100 independent
experiments.
In these experiments, we apply the same settings as described in Section 4.1 and present the
results in Figure 16. We can see that the optimal adversarial noise causes the largest
decrease in coverage as one would expect. The most-frequent-confusion noise decreases
the neural network coverage to approximately 89%. The wrong-to-right noise decreases
the coverage to around 85% with the HPS score and to around 87% with the APS score.
This gap is expected as this noise directly reduces the HPS score. We can see that the
optimal worst-case noise for each score function reduces the coverage to around 85% when
using that score. This is in fact the maximal decrease in coverage possible theoretically,
hence it strengthens the optimally of our iterative algorithm.
Here we first illustrate in Figure 17 the data we generate in the synthetic regression exper-
iment from Section 4.2 and the different corruptions we apply.
55
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
(a) (b)
(c) (d)
(e) (f)
Figure 17: Illustration of the generated data with different corruptions. (a): Clean
samples. (b): Samples with symmetric heavy-tailed noise. (c): Samples with asymmetric
noise. (d): Samples with biased noise. Noise magnitude is set to 0.1. (e): Samples with
contractive noise. (f): Samples with dispersive noise.
In Section 4.2 we apply some realistic noise models and examine the performance of
conformal prediction using CQR score with noisy training and calibration sets. Here we
construct some more experiments using the same settings, however we train the models
using clean data instead of noisy data. Moreover, we apply an additional adversarial noise
model that differs from those presented in Section 4.2 in the sense that it is designed to
intentionally reduce the coverage level.
56
Label Noise Robustness of Conformal Prediction
Wrong to right: an adversarial noise that depends on the underlying trained regression
model. In order to construct the noisy calibration set we switch 7% of the responses as
follows: we randomly swap between outputs that are not included in the interval predicted
by the model and outputs that are included.
Figures 18 and 19 depict the marginal coverage and interval length achieved when ap-
plying the different noise models. We see that the adversarial wrong to right noise model
reduces the coverage rate to approximately 83%. Moreover, these results are similar to
those achieved in Section 4.2, except for the conservative coverage attained using biased
noise, which can be explained by the more accurate low and high estimated quantiles.
asymmetric
Length
94 biased 4
92 2
90 0
0.0 0.01 0.1 1.0 0.0 0.01 0.1 1.0
Noise Magnitude c Noise Magnitude c
100
90 2.5
Coverage
Length
80
2.0
70
60 nominal 1.5
cleancontractive dispersiveng to right cleancontractive dispersiveng to right
wro wro
Noise Noise
Lastly, in order to explain the over-coverage or under-coverage achieved for some of the
different noise models, as depicted in Figures 7 and 8, we present in Figure 20 the CQR
scores and their 90%’th empirical quantile. Over-coverage is achieved when the noisy scores
57
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
are larger than the clean ones, for example, in the symmetric heavy tailed case, and
under-coverage is achieved when the noisy scores are smaller.
500
Scores biased noise Scores contractive noise
clean 500 clean
400 noisy noisy
qclean 400 qclean
300 qnoisy 300 qnoisy
200 200
100 100
0 0
2 1 0 1 2 2 1 0 1 2
(c) (d)
Figure 20: Illustration of the CQR scores. (a): Clean training and calibration sets.
(b): Symmetric heavy-tailed noise. (c): Biased noise. Noise magnitude is set to 0.1. (d):
Contractive noise. Other details are as in Figure 7.
B.5 The Multi-label Classification Experiment with the COCO Data Set
Here, we provide the full details about the experimental setup of the real multi-label cor-
ruptions from Section 3.2. We asked 9 annotators to annotate 117 images. Each annotator
labeled approximately 15 images separately, except for two annotators who labeled 15 im-
ages as a couple. The annotators were asked to label each image under 30 seconds, although
this request was not enforced. Figure 21 presents the number of labels that are missed or
mistakenly added to each image. On average, 1.25 of the labels were missed and 0.5 were
mistakenly added, meaning that each image contains a total of 1.75 label mistakes on aver-
age.
58
Label Noise Robustness of Conformal Prediction
We repeat the experimental protocol detailed in Section 4.5 with the same data sets and
analyze the miscoverage bounds formulated in Section 3.3.3. Here, we apply conformalized
quantile regression to control the miscoverage rate at different levels using a noisy calibration
set. Furthermore, we choose the miscoverage bound hyperparameters, c, d, from (16) by a
grid search over the calibration set with the objective of tightening the miscoverage bound.
Figure 23 displays the miscoverage rate achieved on the clean and noisy versions of the
test set, along with the miscoverage bound. Importantly, in contrast to Theorem A.1, this
bound requires no assumptions on the distribution of Y | X = x or on the distribution of
59
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
Figure 22: FNR on MS COCO data set, achieved over noisy (red) and clean (green) test
sets. The calibration scheme is applied with noisy annotations. Results are averaged over
50 trials.
the noise ε. The only requirement is that the noise is independent of the response variable
and has mean 0. This great advantage covers up for the looseness of the bound.
Figure 23: Miscoverage rate achieved over noisy (red) and clean (green) test sets. The cali-
bration scheme is applied using noisy annotations to control the miscoverage level. Results
are averaged over 10 random splits of the calibration and test sets.
Here we provide a pseudo code of the conformal risk control algorithm, following (An-
gelopoulos et al., 2024).
This prediction set Cbλ̂ (Xtest ) produced by Algorithm 1 satisfies E[L(Ytest , Cbλ̂ (Xtest ))] ≤
α, for proof see (Angelopoulos et al., 2024).
60
Label Noise Robustness of Conformal Prediction
Below, we provide a pseudo-code of ACI, following (Gibbs and Candes, 2021, Algorithm 1).
Obtain Yt .
Compute errt = 1{Yt ∈/ Cbt (Xt )}.
Update αt+1 = αt + γ(α − errt ).
Output: Uncertainty sets Cbt (Xt ) for each time step t = n2 , ..., T .
Here we provide a pseudo code of Rolling RC, first developed by (Feldman et al., 2023b).
References
Mohamed Abdalla and Benjamin Fine. Hurdles to artificial intelligence deployment: Noise
in schemas and “gold” labels. Radiology: Artificial Intelligence, 5(2):e220056, 2023.
61
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
Görkem Algan and Ilkay Ulusoy. Label noise types and their effects on deep learning. arXiv
preprint arXiv:2003.10471, 2020.
Anastasios N. Angelopoulos and Stephen Bates. Conformal prediction: A gentle intro-
duction. Foundations and Trends R in Machine Learning, 16(4):494–591, 2023. ISSN
1935-8237.
Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Candès, Michael I. Jordan, and
Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control.
arXiv preprint, 2021. arXiv:2110.01052.
Anastasios Nikolas Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster.
Conformal risk control. In The Twelfth International Conference on Learning Represen-
tations, 2024.
Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4):
343–370, 1988.
Javed A Aslam and Scott E Decatur. On the sample complexity of noise-tolerant learning.
Information Processing Letters, 57(4):189–195, 1996.
Rina Foygel Barber. Is distribution-free inference possible for binary regression? Electronic
Journal of Statistics, 14(2):3487 – 3524, 2020.
Rina Foygel Barber, Emmanuel J Candès, Aaditya Ramdas, and Ryan J Tibshirani. Pre-
dictive inference with the jackknife+. The Annals of Statistics, 49(1):486 – 507, 2021.
Rina Foygel Barber, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. Con-
formal prediction beyond exchangeability. The Annals of Statistics, 51(2):816 – 845,
2023.
Stephen Bates, Anastasios Angelopoulos, Lihua Lei, Jitendra Malik, and Michael I. Jordan.
Distribution-free, risk-controlling prediction sets. Journal of the ACM, 68(6), September
2021. ISSN 0004-5411.
62
Label Noise Robustness of Conformal Prediction
Ruairidh M Battleday, Joshua C Peterson, and Thomas L Griffiths. Capturing human cat-
egorization of natural images by combining deep networks and cognitive models. Nature
Communications, 11(1):1–14, 2020.
Maxime Cauchois, Suyash Gupta, Alnur Ali, and John Duchi. Predictive inference with
weak supervision. arXiv preprint arXiv:2201.08315, 2022.
Chen Cheng, Hilal Asi, and John Duchi. How many labelers do you have? a closer look at
gold-standard labels. arXiv preprint arXiv:2206.12041, 2022.
Deng-Ping Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, Huazhu Fu, Jianbing Shen, and Ling
Shao. Pranet: Parallel reverse attention network for polyp segmentation. In International
conference on medical image computing and computer-assisted intervention, pages 263–
273. Springer, 2020.
António Farinhas, Chrysoula Zerva, Dennis Thomas Ulmer, and Andre Martins. Non-
exchangeable conformal risk control. In The Twelfth International Conference on Learn-
ing Representations, 2024. URL https://openreview.net/forum?id=j511LaqEeP.
Shai Feldman, Liran Ringel, Stephen Bates, and Yaniv Romano. Achieving risk control
in online learning settings. Transactions on Machine Learning Research, 2023b. ISSN
2835-8856.
Benoı̂t Frénay and Michel Verleysen. Classification in the presence of label noise: a survey.
IEEE Transactions on Neural Networks and Learning Systems, 25(5):845–869, 2013.
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics:
The kitti dataset. International Journal of Robotics Research, 2013.
Asaf Gendler, Tsui-Wei Weng, Luca Daniel, and Yaniv Romano. Adversarially robust
conformal prediction. In International Conference on Learning Representations, 2021.
Isaac Gibbs and Emmanuel Candes. Adaptive conformal inference under distribution shift.
In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances
in Neural Information Processing Systems, 2021.
Simon Jenni and Paolo Favaro. Deep bilevel learning. In Proceedings of the European
Conference on Computer Vision, pages 618–633, 2018.
Ishan Jindal, Matthew Nokleby, and Xuewen Chen. Learning deep networks from noisy
labels with dropout regularization. In 2016 IEEE 16th International Conference on Data
Mining, pages 967–972. IEEE, 2016.
63
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
Yueying Kao, Chong Wang, and Kaiqi Huang. Visual aesthetic quality assessment with a
regression model. In 2015 IEEE International Conference on Image Processing, pages
1583–1587. IEEE, 2015.
Roger Koenker and Gilbert Bassett. Regression quantiles. Econometrica: Journal of the
Econometric Society, pages 33–50, 1978.
Himanshu Kumar, Naresh Manwani, and PS Sastry. Robust learning of multi-label classifiers
under label noise. In Proceedings of the ACM IKDD CoDS and COMAD, pages 90–97.
2020.
Yonghoon Lee and Rina Foygel Barber. Binary classification with corrupted labels. Elec-
tronic Journal of Statistics, 16(1):1367 – 1392, 2022.
Jing Lei, James Robins, and Larry Wasserman. Distribution-free prediction sets. Journal
of the American Statistical Association, 108(501):278–287, 2013.
Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman.
Distribution-free predictive inference for regression. Journal of the American Statisti-
cal Association, 113(523):1094–1111, 2018.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In
European Conference on Computer Vision, pages 740–755. Springer, 2014.
Xingjun Ma, Yisen Wang, Michael E Houle, Shuo Zhou, Sarah Erfani, Shutao Xia, Sudanthi
Wijewickrema, and James Bailey. Dimensionality-driven learning with noisy labels. In
International Conference on Machine Learning, pages 3355–3364. PMLR, 2018.
Naila Murray, Luca Marchesotti, and Florent Perronnin. Ava: A large-scale database for
aesthetic visual analysis. In 2012 IEEE Conference on Computer Vision and Pattern
Recognition, pages 2408–2415. IEEE, 2012.
Harris Papadopoulos, Kostas Proedrou, Vladimir Vovk, and Alex Gammerman. Induc-
tive confidence machines for regression. In Machine Learning: European Conference on
Machine Learning, pages 345–356, 2002.
64
Label Noise Robustness of Conformal Prediction
Tal Ridnik, Hussam Lawen, Asaf Noy, Emanuel Ben Baruch, Gilad Sharir, and Itamar
Friedman. Tresnet: High performance gpu-dedicated architecture. In proceedings of the
IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1400–1409,
2021.
Yaniv Romano, Evan Patterson, and Emmanuel Candès. Conformalized quantile regression.
In Advances in Neural Information Processing Systems, volume 32, pages 3543–3553.
2019.
Yaniv Romano, Matteo Sesia, and Emmanuel Candès. Classification with valid and adaptive
coverage. In Advances in Neural Information Processing Systems, volume 33, pages 3581–
3591, 2020.
Matteo Sesia, YX Wang, and Xin Tong. Adaptive conformal classification with noisy labels.
arXiv preprint arXiv:2309.05092, 2023.
Pulkit Singh, Joshua C Peterson, Ruairidh M Battleday, and Thomas L Griffiths. End-to-
end deep prototype and exemplar models for predicting human behavior. arXiv preprint
arXiv:2007.08723, 2020.
David Stutz, Abhijit Guha Roy, Tatiana Matejovicova, Patricia Strachan, Ali Taylan
Cemgil, and Arnaud Doucet. Conformal prediction under ambiguous ground truth. arXiv
preprint arXiv:2307.09302, 2023.
Hossein Talebi and Peyman Milanfar. NIMA: Neural image assessment. IEEE Transactions
on Image Processing, 27(8):3998–4011, 2018.
Ryan J Tibshirani, Rina Foygel Barber, Emmanuel Candes, and Aaditya Ramdas. Con-
formal prediction under covariate shift. In Advances in Neural Information Processing
Systems, volume 32, pages 2530–2540. 2019.
Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random
World. Springer, New York, NY, USA, 2005.
Jiaheng Wei, Zhaowei Zhu, Hao Cheng, Tongliang Liu, Gang Niu, and Yang Liu. Learning
with noisy labels revisited: A study using real-world human annotations. In International
Conference on Learning Representations, 2022.
65
Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano
Yilun Xu, Peng Cao, Yuqing Kong, and Yizhou Wang. L dmi: A novel information-theoretic
loss function for training deep nets robust to label noise. Advances in Neural Information
Processing Systems, 32, 2019.
Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chun-
hua Shen. Learning to recover 3d scene shape from a single image. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 204–213,
2021.
Bodi Yuan, Jianyu Chen, Weidong Zhang, Hung-Shuo Tai, and Sara McMains. Iterative
cross learning on noisy labels. In IEEE Winter Conference on Applications of Computer
Vision, pages 757–765. IEEE, 2018.
Wenting Zhao and Carla Gomes. Evaluating multi-label classifiers with noisy labels. arXiv
preprint arXiv:2102.08427, 2021.
66