Uncertainty Notes
Uncertainty Notes
I Foundations 1
1 Basic Setting and Definitions 3
3 Calibration 23
3.1 Introduction to Calibration . . . . . . . . . . . . . . . . . . . 23
3.2 Calibrating a Model f . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Quantile Calibration . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Sequential Prediction . . . . . . . . . . . . . . . . . . . . . . 30
3.4.1 Sequential (Mean) Calibration . . . . . . . . . . . . . 31
3.4.2 Sequential Quantile Calibration . . . . . . . . . . . . . 36
4 Multigroup Guarantees 43
4.1 Group Conditional Mean Consistency . . . . . . . . . . . . . 44
4.2 Group Conditional Quantile Consistency . . . . . . . . . . . 46
4.2.1 A More Direct Approach to Group Conditional Guar-
antees . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1.1 Generalization . . . . . . . . . . . . . . . . . 50
4.3 Multicalibration: Group Conditional Calibration . . . . . . . 59
4.4 Quantile Multicalibration . . . . . . . . . . . . . . . . . . . . 63
4.5 Out of Sample Generalization . . . . . . . . . . . . . . . . . 66
4.5.1 Mean Multicalibration . . . . . . . . . . . . . . . . . . 66
4.5.2 Quantile Multicalibration . . . . . . . . . . . . . . . . 74
4.6 Sequential Prediction . . . . . . . . . . . . . . . . . . . . . . 75
4.6.1 A Bucketed Calibration Definition . . . . . . . . . . . 75
4.6.2 Achieving Bucketed Calibration . . . . . . . . . . . . . 76
4.6.3 Obtaining Bucketed Quantile Multicalibration . . . . 83
vii
viii Contents
II Applications 105
7 Conformal Prediction 107
7.1 Prediction Sets and Nonconformity Scores . . . . . . . . . . 107
7.1.1 Non-Conformity Scores . . . . . . . . . . . . . . . . . 109
7.2 A Weak Guarantee: Marginal Coverage in Expectation . . . 111
7.3 Dataset Conditional Bounds . . . . . . . . . . . . . . . . . . 113
7.4 Dataset and Group Conditional Bounds . . . . . . . . . . . . 114
7.5 Multivalid Bounds . . . . . . . . . . . . . . . . . . . . . . . . 116
7.6 Sequential Conformal Prediction . . . . . . . . . . . . . . . . 118
7.6.1 Sequential Marginal Coverage Guarantees . . . . . . . 119
7.6.2 Sequential Multivalid Guarantees . . . . . . . . . . . . 120
Bibliography 159
Foundations
1
Basic Setting and Definitions
CONTENTS
3
4Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
The Batch Setting
In the batch setting, we are given a batch or sample of n datapoints D sampled
i.i.d. from D, which we will write as D ∈ Z n . We will want algorithms that
use D to learn something useful about D.
We will sometimes treat a sample D as if it is a distribution: sampling from
it, taking expectations over it, etc. When we do this, we are identifying D with
the discrete distribution that places weight 1/n on each example (x, y) ∈ D.
For example, we can compute the squared error of a predictor over a sample
D which evaluates to:
1 X
B(f, D) = (f (x) − y)2
n
(x,y)∈D
CONTENTS
2.1 Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Generalizing From Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Sequential Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
References and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1 Means
Recall that in a regression setting in which Y ⊆ [0, 1], our goal is to learn
a model f ∗ such that f ∗ (x) = Ey∼D(x) [y] — i.e. that correctly captures the
conditional label mean for each x ∈ X . Of course its not clear how to do this
(or even to test if we have succeeded), but we can begin with a minimal sanity
check: marginal mean consistency:
Definition 2 A model f : X → [0, 1] has marginal mean consistency error α
if:
E [f (x)] − E [y] = α
x∼DX (x,y)∼D
5
6Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
it depends on f only through an unconditional expectation E[f (x)], rather
than constraining the behavior of f conditional on any property of x. In other
words, its just an average over all inputs to f . f ∗ satisfies marginal mean
consistency, so if our model f does not, this means that our model f must not
be f ∗ . Of course, failure to satisfy marginal mean consistency is easy to fix:
Let:
∆ = E [y] − E [f (x)] and fˆ(x) = f (x) + ∆.
(x,y)∼D x∼DX
= E [y]
(x,y)∼D
as desired.
What is less obvious is that fˆ is more accurate than f — as measured by
its squared error.
Lemma 2.1.2 Fix any distribution D, let f : X → [0, 1] be any model, let
∆ = E(x,y)∼D [y] − Ex∼DX [f (x)], and let fˆ(x) = f (x) + ∆. Then over the
distribution D:
B(fˆ, D) = B(f, D) − ∆2
Proof 2 We can directly compute:
h i
B(f, D) − B(fˆ, D) = E (f (x) − y)2 − (fˆ(x) − y)2
(x,y)∼D
h i
= E f (x)2 − 2f (x)y + y 2 − fˆ(x)2 + 2fˆ(x)y − y 2
(x,y)∼D
= 2∆ E [y − f (x)] − ∆2
(x,y)∼D
= 2∆2 − ∆2
= ∆2
So not only is it easy to fix a model that does not satisfy marginal mean
consistency, it is always in our interest to do so if we care about accuracy: the
fix is strictly accuracy improving (as measured by squared error).
A Simple Goal: Marginal Estimation 7
2.2 Quantiles
Rather than asking for a model that matches the mean of a distribution
marginally, we can ask for a model that matches a target quantile of a dis-
tribution marginally. For simplicity, we will assume that all marginal label
distributions D(x) are continuous.
Pr[y ≤ τ ] = q
y
Pr [y ≤ f (x)] − q = α
(x,y)∼D
Observe that for q = 1/2, this is simply (a scaling of) the absolute value
difference function: L1/2 (τ, y) = 21 |τ − y|. Just as the constant that minimizes
the Brier score on a distribution is its mean, the constant that minimizes the
pinball loss for a target quantile q is a q-quantile:
8Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Lemma 2.2.1 For any continuous distribution over y and any 0 ≤ q ≤ 1
is a q-quantile.
Pr [y ≤ τ ′ ] − Pr [y ≤ τ ] ≤ ρ(τ ′ − τ )
y∼D(x) y∼D(x)
The above definition is actually somewhat stronger than we need right now
— we don’t need the Lipschitz condition simultaneously for each conditional
label distribution D(x), but only marginally over the whole distribution —
but this stronger condition will be useful for us later on.
α2
P Bq (fˆ) ≤ P Bq (f ) −
2ρ
and
α2
P Bq (f ) ≤ P Bq (fˆ) + |∆|α −
2ρ
Proof 4 As in the proof of Lemma 2.2.1, we can compute:
dP Bq (f (x) + τ ) d Ey∼D(x) [Lq (f (x) + τ, y)]
= E
dτ x∼DX dτ
= E Pr [y ≤ f (x) + τ ] − q
x∼DX y∼D(x)
= Pr [y ≤ f (x) + τ ] − q
(x,y)∼D
F (∆) = q
F (0) = q − α
FIGURE 2.1
Upper and lower bounding the local area under the curve when ∆ > 0. Here
F (τ ) = Pr(x,y)∼D [y ≤ f (x) + τ ]
Z ∆
α α α α
Pr [y ≤ f (x) + τ ]dτ ≤ (∆ − ) · q + (q − α) + ·
0 (x,y)∼D ρ ρ ρ 2
2 2
qα qα α α
= q∆ − + − +
ρ ρ ρ 2ρ
2
α
= q∆ −
2ρ
Combining with above, we have that:
α2 α2
P Bq (fˆ) − P Bq (f ) ≤ q|∆| − − q|∆| = −
2ρ 2ρ
Next we can lower bound the area under the CDF. Again by the Lipschitz
condition, the smallest area under the CDF that respects the Lipschitz con-
dition arises if the CDF remains constant taking value q − α from τ = 0 to
τ = ∆ − αρ before increasing at a linear rate to q from τ = ∆ − αρ to τ = ∆.
A Simple Goal: Marginal Estimation 11
F (0) = q + α
F (∆) = q
|F (0)−F (∆)|
∆+ ρ =∆+ α
ρ − |F (0)−F (∆)|
= − αρ
∆ ρ 0
FIGURE 2.2
Upper and lower bounding the local area under the curve when ∆ < 0. Here
F (τ ) = Pr(x,y)∼D [y ≤ f (x) + τ ]
α2 α2
P Bq (fˆ) − P Bq (f ) ≥ |∆|q − |∆|α + − |∆|q = −|∆|α +
2ρ 2ρ
In the remaining case in which ∆ < 0, our worst cases are reversed (we need
to maximize the area under the curve to lower bound the integral and minimize
the area under the curve to upper bound the integral). Once again, the CDF
that minimizes the area under the curve subject to the Lipschitz constraint
behaves as follows (See figure 2.2): The CDF remains constant at q between
τ = ∆ and τ = −α/ρ, before increasing as quickly as possible at a linear rate
up to value q + α between τ = −α/ρ and τ = 0. In this case we have that:
Z ∆
α2
Pr [y ≤ f (x) + τ ]dτ ≤ − |∆|q +
0 (x,y)∼D 2ρ
2
α
= −q|∆| −
2ρ
12Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Again combining with above we get that:
α2 α2
P Bq (fˆ) − P Bq (f (x)) ≤ −q|∆| − + q|∆| = −
2ρ 2ρ
Finally, the CDF that maximizes the area under the curve subject to the
Lipschitz constraint increases at a linear rate from τ = ∆ to τ = ∆+α/ρ from
value q to value q + α, and then remains constant at q + α from τ = ∆ + αρ
to τ = 0. Computing the area under this curve, we get:
Z ∆ 2
α α
Pr [y ≤ f (x) + τ ]dτ ≥ − + |∆|q + α(|∆| − )
0 (x,y)∼D 2ρ ρ
2
α
= − |∆|q − |∆|α
2ρ
Together with the above we have that:
α2 α2
P Bq (fˆ) − P Bq (f (x)) ≥ − |∆|q − |∆|α + q|∆| = − |∆|α
2ρ 2ρ
which completes the proof of the lemma.
Pr [y ≤ τ ′ ] − Pr [y ≤ τ ] ≤ ρ(τ ′ − τ )
y∼D(x) y∼D(x)
Lemma 2.2.3 Fix any ρ, α > 0. Fix any distribution over labeled examples
D that is (ρ, αρ )-Lipschitz. Fix any model f : X → [0, 1] that has marginal
consistency error α with respect to target quantile q, and let fˆ(x) = f (x) + ∆
with ∆ chosen such that fˆ satisfies marginal quantile consistency for quantile
q. Then:
α2
P Bq (fˆ) ≤ P Bq (f ) −
2ρ
and
α2
P Bq (f ) ≤ P Bq (fˆ) + |∆|α −
2ρ
E [fˆ(x) − y] = E [f (x) + ∆ − y]
(x,y)∼D (x,y)∼D
≤ ∆ − E [y − f (x)]
(x,y)∼D
r
2 log(2/δ)
≤
n
where the last inequality holds with probability 1 − δ, as established by Hoeffd-
ing’s inequality.
Theorem 2 Fix any model f and distribution D, and let D ∼ Dn consist
of n samples drawn i.i.d. from D. Let ∆ be such that fˆ(x) = f (x) + ∆ has
quantile mean consistency error α′ with respect to some target quantile q on
D. Then with probability 1 − δ over the draw of D, fˆ has marginal quantile
consistency error at most α with respect to target quantile q on D, for:
r
′ log(2/δ)
α≤α +
2n
Proof 6 This is an application of the DKW (Dvoretzky–Kiefer–Wolfowitz)
inequality (Theorem 49) which we quote here in its first use:
Let D ∈ Z n be any distribution and let D ∼ Dn consist of n points sampled
i.i.d. from D. Let F (c) = Pr(x,y)∼D [y ≤ c] denote the CDF of the label
distribution induced by D, and let F̂D (c) = n1 (x,y)∈D 1[y ≤ c] denote
P
the CDF of the empirical label distribution induced by D. Then for every
t > 0:
Pr sup F (c) − F̂D (C) ≥ t ≤ 2 exp −2nt2
c∈R
Pr [y ≤ f (x) + ∆] − q = α′
(x,y)∼D
A Simple Goal: Marginal Estimation 15
= Pr [y ≤ f (x) + ∆] − q
(x,y)∼D
= Pr [y ≤ fˆ(x)] − q
(x,y)∼D
The result is that we can simply proceed as if our sample is our underlying
distribution when we aim for marginal consistency — and our marginal con-
sistency error on the underlying distribution is guaranteed to be larger than
our empirical marginal
consistency
error by at most ϵ with probability 1 − δ,
whenever n ≥ Ω log(1/δ)
ϵ2 .
Let:
T
¯ = 1
X
∆ (ŷt − f (x̂t ))
T t=1
Plugging in the definitions of ∆ and yt we have that:
¯ ≤α
∆−∆
A Simple Goal: Marginal Estimation 17
Note also that since (x̂t , ŷt ) are sampled i.i.d. from D, we have that:
¯ =
E[∆] E [y − f (x)]
(x,y)∼D
2
Pr ∆ ¯ − E [y − f (x)] ≥ ϵ ≤ 2 exp −2T ϵ
(x,y)∼D 4
The right hand side is at most δ when we have:
r
2 log(2/δ)
ϵ≥
T
We therefore have that with probability 1 − δ:
E [fˆ(x) − y] = E [f (x) + ∆ − y]
(x,y)∼D (x,y)∼D
≤ E ¯ − y] + α
[f (x) + ∆
(x,y)∼D
¯ + (∆
¯ − E[∆])
¯ −y +α
= E f (x) + E[∆]
(x,y)∼D
¯ − E[∆])
¯
= E (∆ +α
(x,y)∼D
r
2 log(2/δ)
≤ α+
T
as desired.
Next, we’ll see a simple algorithm that can guarantee marginal mean con-
sistency with error on the order of O(1/T ) on any sequence of length T —
i.e. without assuming that the data points come from a distribution. The
algorithm will be silly on its face as a prediction algorithm — always predict-
ing that today’s outcome will be equal to yesterday’s outcome. Its excellent
performance (as measured by marginal mean consistency) tells us something
about the weakness of marginal guarantees.
Algorithm 1 Online-Marginal-Mean-Predictor
Let y0 = 0
for t = 1 to T do
Observe xt (and ignore it!)
Predict pt = yt−1
Observe yt .
18Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
If we imagine using this algorithm to predict weather, then what it does is
the following: If it rained yesterday, it predicts a 100% chance of rain today.
Otherwise it predicts a 0% chance of rain. And yet:
Theorem 5 Suppose we have an algorithm A that when run against any ad-
versary for T rounds generates a transcript π that satisfies marginal quantile
consistency with error at most α for some target quantile q. Suppose we have
some model f : X → [0, 1] and a data distribution D, and consider the follow-
ing procedure to simulate an adversary. At each round t we:
In other words:
Pr [y ≤ f (x) + ∆] − q ≤ α′
(x,y)∼D,∆
Let D′ be the label distribution induced by outputting the label y − f (x) for
(x, y) ∼ D and let F denote its CDF: F (x) = Pry∼D′ [y ≤ x]. We want to
PT
be able to say that T1 t=1 F (pt ) ≈ q, but we have a problem: the indicators
1[yt ≤ pt ] are not independent random variables even though the yt are, since
each pt is potentially chosen as a function of all previous labels y1 , . . . , yt−1 .
Hence we cannot apply Hoeffding’s inequality. But all is not lost! We will need
Azuma’s inequality (Theorem 48) which we quote here before its first use:
Let X1 , . . . , Xn be random variables (not necessarily independent) bounded
such that for each i, |Xi | ≤ ci . Let X<i denote the prefix X1 , X2 , . . . , Xi−1 .
Then for all t > 0:
" n n
#
−t2
X X
Pr Xi − E[Xi |X<i ] ≥ t ≤ 2 exp Pn 2
i=1 i=1
2 i=1 ci
Combining this with our guarantee of marginal quantile consistency, with prob-
ability 1 − δ we have that:
T
r
1X 2 ln(2/δ)
F (pt ) − q ≤ α +
T t=1 T
20Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Finally we can compute:
T
1X
Pr [y ≤ f (x) + ∆] − q = Pr [y ≤ f (x) + pt ] − q
(x,y)∼D,∆ T t=1 (x,y)∼D
T
1X
= F (pt ) − q
T t=1
r
2 ln(2/δ)
≤ α+
T
where the last inequality holds with probability 1 − δ over the draws of
{(xt , yt )}Tt=1 .
Next, we give our algorithm for making predictions that satisfy online
marginal quantile consistency for any target quantile q against any adversar-
ially chosen sequence of examples. The algorithm takes as input a “learning
rate” parameter η, and can be viewed and analyzed as online gradient descent
on the pinball loss. But the specific form of the resulting update also lends
itself to a very simple analysis showing that its quantile error tends to 0 at
a rate of 1/T , just as our algorithm for obtaining marginal mean consistency
does.
Algorithm 2 Online-Marginal-Quantile-Predictor(q, η)
Let p1 = 0
for t = 1 to T do
Observe xt (and ignore it!)
Predict pt
Observe yt .
Let pt+1 = pt + η(q − 1[yt ≤ pt ])
Proof 10 Examining the update rule pt+1 = pt + η(q − 1[yt ≤ pt ]) and solving
for 1[yt ≤ pt ]), we see:
pt+1 − pt
1[yt ≤ pt ] = q −
η
A Simple Goal: Marginal Estimation 21
Next observe that for all t, |pt − pt+1 | ≤ η, and since yt , q ∈ [0, 1], if pt ≥ 1,
it must be that 1[yt ≤ pt ] = 1 then pt+1 < pt and similarly if pt ≤ 0 then
pt+1 > pt . Hence we must have for all t that:
−η ≤ pt ≤ 1 + η
So we have:
T
1X p 1+η
1[yt ≤ pt ] − q = T +1 ≤
T t=1 ηT ηT
Remark 2.3.1 In fact, there is an even simpler algorithm that can guarantee
marginal quantile consistency against an adversary, with error tending to 0 at
a rate of 1/T . For a q fraction of rounds, predict pt = 1, and for a 1−q fraction
of rounds predict pt = 0. Because we know that yt ∈ [0, 1], we have that on the
q fraction of rounds for which pt = 1, 1[yt ≤ pt ] = 1 and for the remaining
1 − q fraction of rounds, 1[yt ≤ pt ] = 0. Hence we can satisfy marginal
quantile consistency in an entirely data independent way, which should make
us suspicious of marginal guarantees and make us ask for something stronger.
CONTENTS
3.1 Introduction to Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Calibrating a Model f . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Quantile Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Sequential Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.1 Sequential (Mean) Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.2 Sequential Quantile Calibration . . . . . . . . . . . . . . . . . . . . . . . . . 36
References and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
The marginal guarantees we saw in Chapter 2 were easy to obtain, but ex-
tremely weak. In this chapter we’ll see one way to go beyond marginal guar-
antees, by making calibrated predictions. Calibration on its own is also quite
weak, but not as weak as a marginal guarantee, and should be thought of as
one step up in terms of a “sanity check” intended to falsify whether we have
learned the true conditional label distribution.
23
24Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Definition 9 (Squared Error) The squared error (also known as Brier
score) of a predictor f is:
On the other hand, if we want our predictions f (x) to have the same
probabilistic semantics as f ∗ (x) — namely that they be a prediction about the
expected value of y, then we might want that f (x) be calibrated. Calibration
asks that the predictions of f be correct conditional on its own predictions:
Roughly that E(x,y)∼D [y|f (x) = v] = v for all v. So that the conditioning event
makes sense, we will restrict attention so functions f that have a range of finite
cardinality, and study average calibration error. Let R(f ) = {f (x) : x ∈ X }
denote the range of f , and let m = |R(f )| denote the cardinality of f ’s range.
We will assume m < ∞.
Definition 10 (Average Calibration Error) The average calibration er-
ror of a predictor f on a distribution D is:
X
K1 (f, D) = Pr [f (x) = v] v − E [y|f (x) = v]
(x,y)∼D (x,y)∼D
v∈R(f )
When the distribution D is clear from context we will sometimes elide it and
simply write K1 (f ), K2 (f ), K∞ (f ), etc.
Sometimes it will be more convenient to work with one of these quantities
over another, but they are closely related to one another:
Lemma 3.1.1 For any predictor f : X → [0, 1],
p
K2 (f ) ≤ K1 (f ) ≤ K2 (f )
K∞ (f ) ≤ K1 (f ) ≤ mK∞ (f )
Unlike squared error, which we may never be able to drive to zero (because
of inherent unpredictability), we can in principle drive calibration error to zero:
observe that K2 (f ∗ ) = 0.
In fact, f ∗ also minimizes the squared error over the set of all functions
because f ∗ (x) minimizes squared error point-wise per prediction x:
Lemma 3.1.2 Fix any distribution on labels DY . Let v ∗ = EDY [y] denote the
true label expectation, and let v̂ = v ∗ + ∆ for some ∆ ̸= 0. Then:
Proof 12
E (v ∗ + ∆)2 − 2v̂y − (v ∗ )2 + 2v ∗ y
=
y∼DY
E 2v ∗ ∆ + ∆2 − 2(v ∗ + ∆)y + 2v ∗ y
=
y∼DY
E 2v ∗ ∆ + ∆2 − 2∆y
=
y∼DY
= 2v ∗ ∆ + ∆2 − 2v ∗ ∆
= ∆2
Algorithm 3 Calibrate(f, α, D)
Let f0 = f and t = 0.
while K2 (ft , D) ≥ α do
Let:
2
vt ∈ arg max Pr [ft (x) = v] v − E [y|ft (x) = v])
v∈R(ft ) (x,y)∼D (x,y)∼D
Proof 13 Observe that at each round before the algorithm halts, since
K2 (ft ) ≥ α we must have that:
2
α
∆t ≡ Pr [ft (x) = vt ] vt − E [y|f (x) = vt ]) ≥
(x,y)∼D (x,y)∼D m
Let D(vt ) = D|(ft (x) = vt ) be the distribution that results from condi-
tioning on the event that ft (x) = vt and let D(v̄t ) = D|(ft (x) ̸= vt ) be the
distribution that results from conditioning on the event that ft (x) ̸= vt . We
have from Lemma 3.1.2 that:
B(ft+1 , D) − B(ft , D)
= Pr[ft (x) = vt ](B(ft+1 , D(vt )) − B(ft , D(vt )) + Pr[ft (x) ̸= vt ](B(ft+1 , D(v¯t )) − B(ft , D(v¯t ))
= Pr[ft (x) = vt ](B(ft+1 , D(vt )) − B(ft , D(vt ))
= Pr[ft (x) = vt ](vt − vt′ )2
= ∆t
α
≥
m
Here the second to last inequality follows from Lemma 3.1.2. Since for any
model f : X → [0, 1], B(f, D) ≤ 1 and for any model fT B(fT , D) ≥ 0, the
algorithm must halt after at most T ≤ m α many rounds. Since each iteration
decreases squared error, it must be that B(fT , D) ≤ B(f, D).
In fact, this argument is wasteful, although its form will be useful for us
later when we investigate stronger forms of calibration. However for simple
calibration, there is a simple one-shot algorithm that obtains perfect calibra-
tion and decreases squared error by exactly the amount of the calibration error
of the original model.
Algorithm 4 One-Shot-Calibrate(f, D)
For each v ∈ R(f ) let c(v) = E(x,y)∼D [y|f (x) = v])
Output the model fˆ defined as fˆ(x) = c(f (x)).
Proof 14 Consider any level set of fˆ: S(v) = {x : fˆ(x) = v}. By definition,
for all x ∈ S(v), we must have f (x) = v ′ such that c(v ′ ) = v — i.e. such that
c(f (x)) = E(x,y)∼D [y|f (x) = v ′ ]) = v. Let P (v) = {v ′ : c(v ′ ) = v}. We have
that P ′ ′
v ′ ∈P (v) Pr[f (x) = v ]c(v )
E [y|x ∈ S(v)] = P ′
=v
(x,y) v ′ ∈P (v) Pr[f (x) = v ]
Hence:
X 2
K2 (fˆ) = Pr[fˆ(x) = v] (v − E[y|x ∈ S(v)]) = 0
v∈R(fˆ)
Next, observe that we can decompose the squared error of both f and fˆ
28Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
according to the level sets of f , which form a partition for X :
Thus we see that mis-calibrated models can always be improved: they can
be efficiently updated to have no calibration error, and in performing this
simple update, their squared error is improved by an amount equal to their
initial calibration error. This also shows that squared error can be decomposed
into two terms: calibration error, and the remainder (which is sometimes called
refinement error), and that the part corresponding to calibration error can
always be removed.
We will eventually be interested in calibrating predictors using a finite
sample of data from a distribution (rather than giving our algorithm the ability
to directly and exactly compute expectations on the distribution), which will
require proving generalization theorems. But we will defer this to Chapter 4,
when we will prove such theorems for more demanding notions of calibration.
Algorithm 5 One-Shot-Quantile-Calibrate(f, q, D)
For each v ∈ R(f ) let
Theorem 9 For any function f , any target quantile value q ∈ [0, 1], and any
ρ-Lipschitz distribution D, One-Shot-Quantile-Calibrate(f, D) (Algorithm 5
outputs a model fˆ such that Q2 (fˆ) = 0 and P Bq (fˆ) ≤ P Bq (f ) − 2ρ
1
Q2 (f ).
Proof 15 Consider any level set of fˆ: S(v) = {x : fˆ(x) = v}. By definition,
for all x ∈ S(v), we must have f (x) = v ′ such that c(v ′ ) = v — i.e. such
that c(f (x)) satisfies Pr(x,y)∼D [y ≤ c(f (x))|f (x) = v ′ ] = q. Let P (v) = {v ′ :
c(v ′ ) = v}. We have that
′ ′
P
v ′ ∈P (v) Pr[f (x) = v ] Pr[y ≤ v|f (x) = v ]
Pr [y ≤ v|x ∈ S(v)] = P ′
=q
(x,y) v ′ ∈P (v) Pr[f (x) = v ]
30Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Hence:
X 2
Q2 (fˆ) = Pr[fˆ(x) = v] (q − Pr[y ≤ v|x ∈ S(v)]) = 0
v∈R(fˆ)
Next, observe that we can decompose the pinball error of both f and fˆ
according to the level sets of f , which form a partition for X :
P Bq (f, D) − P Bq (fˆ, D) = E[Lq (f (x), y)] − E[Lq (fˆ(x), y)]
X
= Pr[f (x) = v] E[Lq (f (x), y) − Lq (fˆ(x), y)|f (x) = v]
v∈R(f )
2
X (Pr[y ≤ f (x)|f (x) = v] − q)
≥ Pr[f (x) = v]
2ρ
v∈R(f )
1
= Q2 (f )
2ρ
where the second to last inequality follows from Lemma 2.2.2.
Proof 16 Since the terms in the squared mean calibration error corresponding
to predictions p ̸= ps+1 do not change, We can compute:
Next, our plan is to show that for every transcript π ≤s there is a distribu-
tion over subsequent predictions ps+1 such that for every possible realization
of ys+1 , Eps+1 [∆s+1 (ps+1 , ys+1 )] is small. If we can show this, then the algo-
rithm that consists of playing this randomized strategy at each round will have
small expected calibration loss, which we can conclude simply by summing the
terms ∆s (ps , ys ) from s = 1 to T .
Towards this end, define:
p
2Vs s+1
∆1s+1 (ps+1 , ys+1 ) = p · (ys+1 − ps+1 )
n(π ≤s , ps+1 )
With this notation, Lemma 3.4.1 states that:
1
∆s+1 (ps+1 , ys+1 ) ≤ ∆1s+1 (ps+1 , ys+1 ) + .
n(π ≤s , ps+1 )
Calibration 33
Lemma 3.4.2 Fix any partial transcript π ≤s . Consider the distribution over
ps+1 that we can sample from as follows:
1
Predict ps+1 = p with probability q and predict ps+1 = p + m with
probability 1 − q.
This distribution has the property that for every ys+1 ∈ [0, 1]:
2
E [∆1s+1 (ps+1 , ys+1 )] ≤
ps+1 m
Proof 17 We bound Eps+1 [∆1s+1 (ps+1 , ys+1 )] separately in each of the three
cases.
Case 1:
In this case, Vs1 (π ≤s ) ≥ 0 and ps+1 = 1. Note that since ys+1 ∈ [0, 1], we must
have that (ys+1 − ps+1 ) ≤ 0 and so for all ys+1 ∈ [0, 1]:
p
2Vs s+1
∆1s+1 (ps+1 , ys+1 ) = p · (ys+1 − ps+1 ) ≤ 0
n(π ≤s , ps+1 )
Case 2:
In this case, Vs0 (π ≤s ) ≤ 0 and ps+1 = 0. Note that since ys+1 ∈ [0, 1], we must
have that (ys+1 − ps+1 ) ≥ 0 and so for all ys+1 ∈ [0, 1]:
p
2Vs s+1
∆1s+1 (ps+1 , ys+1 ) = p · (ys+1 − ps+1 ) ≤ 0
n(π ≤s , ps+1 )
Case 3:
First we observe that in this case, Vs0 (π ≤s ) ≥ 0 and Vs1 (π ≤s ) ≤ 0. Hence there
must exist some adjacent pair p, p + 1/m ∈ [1/m] such that Vsp (π ≤s ) ≥ 0 and
34Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
p+1/m
Vs (π ≤s ) ≤ 0, so the algorithm is well defined. Recall that q ∈ [0, 1] is
V p (π ≤s ) V p+1/m (π ≤s )
such that q · √ s ≤s + (1 − q) · √s ≤s 1
= 0. We can compute:
n(π ,p) n(π ,p+ m )
Algorithm 6 Online-Calibrated-Predictor(m)
for t = 1 to T do
Observe xt (and ignore it!)
1
if Vt−1 (π <t ) ≥ 0 then
Predict pt = 1.
0
else if Vt−1 (π <t ) ≤ 0 then
Predict pt = 0.
else
1 p
Select p ∈ {0, m , . . . , m−1 <t
m } such that such that Vt−1 (π ) ≥ 0 and
p+1/m <t
Vt−1 (π ) ≤ 0.
Compute q ∈ [0, 1] such that:
p+1/m
V p (π <t ) V (π <t )
q · pt−1 + (1 − q) · q t−1 =0
n(π <t , p) n(π <t , p + m 1
)
1
Predict pt = p with probability q and predict pt = p + m with proba-
bility 1 − q.
Observe yt
Let π <t+1 = π <t ◦ (xt , pt , yt )
where in the last step, we take the maximum over all length t transcripts
p̃i = {(x̃1 , p̃1 , ỹ1 ), . . . , (x̃T , p̃T , ỹT )}
We now take the expectation of both sides (over the randomness of the
algorithm’s predictions pt ) and apply Lemma 3.4.2:
T T
X X 1
E[K̂2 (π)] ≤ E [∆1t (pt , yt )|π <t ] + max
t=1
pt ,yt π̃
t=1
n(π̃ ≤t−1 , p̃ t)
T
2T X 1
≤ + max ≤t−1 , p̃ )
m π̃
t=1
n(π̃ t
PT 1
It remains to bound maxπ̃ t=1 n(π̃≤t−1 ,p̃t )
To do this, we observe that when-
ever p̃t = p, then we must have that n(π̃ , p) = n(π̃ ≤t−1 , p) + 1. Hence for
≤t
36Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
any transcript π̃ we can write:
T
X 1 X X 1
=
t=1
n(π̃ ≤t−1 , p̃t ) n(π̃ ≤t−1 , p)
p∈[1/m] t:p̃t =p
n(π̃,p)−1
X X 1
=
k
p∈[1/m] k=1
T /m
X 1
≤ (m + 1)
k
k=1
= (m + 1) · HT /m
≤ (m + 1) · (log(T /m) + 1)
Here Hk denotes the k’th Harmonic number.
Combining these bounds we find that:
" #
K̂2 (π) 2 m+1
E[K2 (π)] = E ≤ + · (log(T /m) + 1)
T m T
Lemma 3.4.3 Fix any q ∈ [0, 1] and any partial transcript π ≤s and any
triple (ps+1 , xs+1 , ys+1 ) of potential outcomes for the next round. Let π ≤s+1 =
π ≤s ◦ (ps+1 , xs+1 , ys+1 ) be the corresponding continuation of the transcript.
Redefine:
∆s+1 (ps+1 , ys+1 ) = Q̂2 (π ≤s+1 ) − Q̂2 (π ≤s )
to be the increase in the (unnormalized) squared quantile calibration error that
results from the transcript continuation. Then we have that:
!
p
2Ws s+1 1
∆s+1 (ps+1 , ys+1 ) ≤ p · (q − 1[ys+1 ≤ ps+1 ]) +
n(π ≤s , ps+1 ) n(π ≤s , ps+1 )
Next, our plan is to show that for every transcript π ≤s there is a distribu-
tion over subsequent predictions ps+1 such that for every possible ρ-Lipschitz
distribution over ys+1 , Eps+1 ,ys+1 [∆s+1 (ps+1 , ys+1 )] is small. Note that here
we are deviating from our derivation of mean calibration algorithms, in that
we are requiring that the label ys+1 be drawn from a ρ-Lipschitz distribution,
and we are taking the expectation over ys+1 as well as ps+1 . If we can show
this, then the algorithm that consists of playing this randomized strategy at
each round will have small expected quantile calibration loss, which we can
conclude simply by summing the terms ∆s (ps , ys ) from s = 1 to T .
38Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Towards this end, define:
p
2Ws s+1
∆1s+1 (ps+1 , ys+1 ) = p · (q − 1[ys+1 ≤ ps+1 ])
n(π ≤s , ps+1 )
Lemma 3.4.4 Fix any partial transcript π ≤s . Consider the distribution over
ps+1 that we can sample from as follows:
1. If Ws1 (π ≤s ) ≥ 0: Predict ps+1 = 1 with probability 1
2. If Ws0 (π ≤s ) ≤ 0: Predict ps+1 = 0 with probability 1.
1 p ≤s
3. Otherwise: Find a p ∈ {0, m , . . . , m−1
m } such that Ws (π )≥0
p+1/m ≤s
and Ws (π ) ≤ 0. Compute b ∈ [0, 1] such that:
p+1/m
W p (π ≤s ) Ws (π ≤s )
b· p s + (1 − b) · q =0
n(π ≤s , p) n(π ≤s , p + m1
)
1
Predict ps+1 = p with probability b and predict ps+1 = p + m with
probability 1 − b.
This distribution has the property that for every ρ-Lipschitz distribution over
ys+1 ∈ [0, 1]:
2ρ
E [∆1s+1 (ps+1 , ys+1 )] ≤
ps+1 ,ys+1 m
Proof 20 We bound Eps+1 ,ys+1 [∆1s+1 (ps+1 , ys+1 )] separately in each of the
three cases.
Case 1:
In this case, Ws1 (π ≤s ) ≥ 0 and ps+1 = 1. Note that since q, ys+1 ∈ [0, 1], we
must have that (q − 1[ys+1 ≤ ps+1 ]) ≤ 0 and so for all ys+1 ∈ [0, 1]:
p
2Ws s+1
∆1s+1 (ps+1 , ys+1 ) = p · (q − 1[ys+1 ≤ ps+1 ]) ≤ 0
n(π ≤s , ps+1 )
Calibration 39
Case 2:
In this case, Ws0 (π ≤s ) ≤ 0 and ps+1 = 0. Note that since q, ys+1 ∈ [0, 1],
we must have that if ys+1 > 0 (which occurs with probability 1 if it is drawn
from a continuous distribution), (q − 1[ys+1 ≤ ps+1 ]) ≥ 0 and so for all
q, ys+1 ∈ (0, 1]:
p
2Ws s+1
∆1s+1 (ps+1 , ys+1 ) = p · (q − 1[ys+1 ≤ ps+1 ]) ≤ 0
n(π ≤s , ps+1 )
Case 3:
First we observe that in this case, Ws0 (π ≤s ) ≥ 0 and Ws1 (π ≤s ) ≤ 0. Hence
there must exist some adjacent pair p, p+1/m ∈ [1/m] such that Wsp (π ≤s ) ≥ 0
p+1/m ≤s
and Ws (π ) ≤ 0, so the algorithm is well defined. Recall that b ∈ [0, 1]
W p (π ≤s ) W p+1/m (π ≤s )
is such that b · √ s ≤s + (1 − b) · √ s ≤s 1
= 0. We can compute:
n(π ,p) n(π ,p+ m )
1
Predict ps+1 = p with probability b and predict ps+1 = p + m with
probability 1 − b.
Observe yt
Let π <t+1 = π <t ◦ (xt , pt , yt )
where in the last step, we take the maximum over all length t transcripts
p̃i = {(x̃1 , p̃1 , ỹ1 ), . . . , (x̃T , p̃T , ỹT )}
We now take the expectation of both sides (over the randomness of the
algorithm’s predictions pt ) and apply Lemma 3.4.4:
T T
X X 1
E[Q̂2 (π)] ≤ E [∆1t (pt , yt )|π <t ] + max
t=1
pt ,yt π̃
t=1
n(π̃ ≤t−1 , p̃t )
T
2ρT X 1
≤ + max ≤t−1 , p̃ )
m π̃
t=1
n(π̃ t
PT 1
It remains to bound maxπ̃ t=1 n(π̃≤t−1 ,p̃t )
To do this, we observe that when-
ever p̃t = p, then we must have that n(π̃ , p) = n(π̃ ≤t−1 , p) + 1. Hence for
≤t
CONTENTS
4.1 Group Conditional Mean Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Group Conditional Quantile Consistency . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.1 A More Direct Approach to Group Conditional
Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1.1 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Multicalibration: Group Conditional Calibration . . . . . . . . . . . . . . . . 59
4.4 Quantile Multicalibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Out of Sample Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5.1 Mean Multicalibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5.2 Quantile Multicalibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.6 Sequential Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6.1 A Bucketed Calibration Definition . . . . . . . . . . . . . . . . . . . . . . 75
4.6.2 Achieving Bucketed Calibration . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.6.3 Obtaining Bucketed Quantile Multicalibration . . . . . . . . . 83
References and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Marginal guarantees are easy to obtain, but very weak. We saw one way of
strengthening those guarantees: calibration. But on its own calibration is also
quite weak. Obtaining it in the adversarial sequential prediction setting was
non-trivial, but we could obtain it in the batch setting with a simple constant
predictor fˆ(x) = E(x,y)∼D [y] that just predicts the mean of the marginal label
distribution. Moreover, all of the techniques we’ve seen so far entirely ignore
the features x and depend only on the labels y! We’ll now consider a different
way to strengthen marginal guarantees, first on its own, and then together
with calibration. We will call these multi-group guarantees, and they ask for
guarantees that hold conditional on the features x in various ways.
Let G ∈ 2X denote a collection of groups or subsets of the data domain
X . We will represent groups using their indicator functions: so g ∈ G is rep-
resented as a function g : X → {0, 1}, where g(x) = 1 denotes that x ∈ X
is a member of group g, and g(x) = 0 denotes that x is not a member of G.
Given an example x ∈ X , we will write G(x) = {g ∈ G : g(x) = 1} to denote
the set of groups that x is a member of. At a high level, our aim will be to
obtain guarantees like mean consistency (and eventually calibration) not just
43
44Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
marginally, but conditionally on g(x) = 1 for every g ∈ G for some large set
G.
Notice that our requirement smoothly becomes less demanding as the measure
of the group g grows smaller, allowing us to ask for stronger guarantees for
groups for which we will have more data. We have parameterized things so
that the scalingpis at the right rate: the error within a sub-group g increases
at a rate of 1/ µ(g), which is the same rate at which the error of our best
estimate of E(x,y)∼D [y|g(x) = 1] from the data will necessarily increase.
We will now show how to update a model f that does not satisfy group
conditional mean consistency to one that does, using a sequence of “patches”
that are similar to how we obtained calibration. Just as in the examples we
have seen thus far, these patches will be accuracy improving, and so we will
quickly converge to a group conditional mean consistent model.
Algorithm 8 GroupShift(f, α, G)
Let f0 = f and t = 0.
while ft does not satisfy α-approximate group conditional mean consis-
tency w.r.t. G: do
Let:
2
gt ∈ arg max µ(g) E [ft (x)|g(x) = 1] − E [y|g(x) = 1]
g∈G (x,y)∼D (x,y)∼D
Lemma 4.1.1 Fix any model ft : X → [0, 1] and group g : X → {0, 1}. Let
and
ft+1 = h(x, ft ; gt , ∆t )
(i.e. the update performed at Round t of Algorithm 8). Then:
= µ(gt )∆2t
Theorem 12 Given any model f , any collection of groups G, and any α > 0
Algorithm 8 (GroupShift) halts after T ≤ 1/α many rounds and outputs a
model fT that satisfies α-approximate group conditional mean consistency.
Moreover, if the algorithm runs for T rounds, then B(fT ) ≤ B(f ) − T α.
46Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Proof 23 At any round T at which the algorithm halts, by the stopping con-
dition of the algorithm it must be that fT satisfies α-approximate group con-
ditional mean consistency. It remains to bound T and B(fT ).
Consider any intermediate round t < T of the algorithm. We know since
the algorithm has not halted that:
2
max µ(g) · E [f (x)|g(x) = 1] − E [y|g(x) = 1] ≥ α
g∈G (x,y)∼D (x,y)∼D
µ(gt ) · ∆2t ≥ α
Algorithm 9 QuantileGroupShift(f, α, G, q)
Let f0 = f and t = 0.
while ft does not satisfy α-approximate group conditional quantile consis-
tency w.r.t. target quantile q and G: do
Let: 2
gt ∈ arg max µ(gt ) Pr [y ≤ ft (x)|g(x) = 1] − q
g∈G (x,y)∼D
2
∆t = arg min Pr [y ≤ ft (x) + ∆|gt (x) = 1] − q
∆ (x,y)∼D
P Bq (ft ) − P Bq (ft+1 )
= Pr[gt (x) = 1] · E [Lq (ft (x), y) − Lq (ft+1 (x), y)|gt (x) = 1]
(x,y)∼D
α
≥ µ(gt ) ·
2ρµ(gt )
α
=
2ρ
48Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
where the inequality follows from Lemma 2.2.2 applied to the conditional dis-
tribution D|(gt (x) = 1), which must also be ρ-smooth.
Applying this bound iteratively, we have that for every T , P Bq (fT ) ≤
α
P Bq (f ) − T · 2ρ . Since when f (x) and y are bounded in [0, 1], P Bq (f ) ≤ 1 and
P Bq (fT ) ≥ 0 it must be that the total number of iterations that the algorithm
runs for is bounded by:
2ρP Bq (f ) 2ρ
T ≤ ≤
α α
Then the final model fT that is output can be seen to have the form:
X
fT (x) = fˆ(x; λ) ≡ f (x) + λg · g(x)
g∈G
Algorithm 10 Simple-Group-Conditional(f, G)
Let λ∗ be a solution to the optimization problem:
2
Minimizeλ E fˆ(x; λ) − y
(x,y)∼D
Such that:
X
fˆ(x; λ) ≡ f (x) + λg · g(x)
g∈G
Output fˆ(x; λ∗ )
Algorithm 11 Simple-Quantile-Group-Conditional(f, G, q)
Let λ∗ be a solution to the optimization problem:
h i
Minimizeλ E Lq fˆ(x; λ), y
(x,y)∼D
Such that:
X
fˆ(x; λ) ≡ f (x) + λg · g(x)
g∈G
Output fˆ(x; λ∗ )
Theorem 14 Fix any model f : X → [0, 1] and class of groups G. The model
fˆ(x; λ∗ ) output by Algorithm 10 satisfies perfect (i.e. 0-approximate) group
conditional mean consistency. Moreover, if fT is the model output by Algo-
rithm 8, then B(fˆ(·; λ∗ )) ≤ B(fT ).
Proof 25 Suppose fˆ(x; λ∗ ) does not satisfy group conditional mean consis-
tency. Then there must be a group g ∈ G such that:
2
E [f (x)|g(x) = 1] − E [y|g(x) = 1] >0
(x,y)∼D (x,y)∼D
Theorem 15 Fix any model f : X → [0, 1], target quantile q, and class of
groups G. The model fˆ(x; λ∗ ) output by Algorithm 11 satisfies perfect (i.e.
0-approximate) group conditional quantile consistency with respect to q and
G. Moreover, if fT is the model output by Algorithm 9, then P Bq (fˆ(·; λ∗ )) ≤
P Bq (fT ).
Proof 26 Suppose fˆ(x; λ∗ ) does not satisfy group conditional quantile con-
sistency. Then there must be a group g ∈ G such that:
2
∆ = arg min Pr [y ≤ ft (x) + ∆|g(x) = 1] − q >0
∆ (x,y)∼D
In this case, by Lemma 2.2.1 applied to the distribution D|(g(x) = 1), the
model obtained by applying the same patch as in the update rule in Algorithm 9
— i.e. f ′ (x) = h(x, fˆ(x; λ∗ ), g, ∆) is such that P Bq (f ′ ) < P Bq (fˆ(x; λ∗ )). But
this is a contradition to the optimality of λ∗ . Let λ̂ be the vector such that for
all g ′ ̸= g, λ̂g′ = λ∗g′ and such that λ̂g = λ∗g + ∆. We can write f ′ as f ′ (x) =
fˆ(x; λ̂). Since λ̂ is a feasible solution to the optimization problem in Algorithm
11 — by the optimality of λ∗ we must have P Bq (f (x; λ̂)) ≥ P Bq (f (x; λ∗ )).
Similarly, since fT can be represented as fˆ(x; λ) for some λ, we have
P Bq (fT ) ≥ P Bq (f (x; λ∗ ).
4.2.1.1 Generalization
What about out of sample guarantees — i.e. what if we run algorithms 10 and
11 on the empirical distributions on datasets D ∼ Dn ?
Our generalization theorem will depend on the norm of the solution λ∗
output by our algorithms, so it will be helpful for us to study a regularized
version of these simple algorithms that is guaranteed to output a solution of
small norm.
Algorithm 12 Simple-Group-Conditional-Regularized(f, G, D, η)
Let λ∗ be a solution to the optimization problem:
2
Minimizeλ E fˆ(x; λ) − y + η||λ||1
(x,y)∼D
Such that:
X
fˆ(x; λ) ≡ f (x) + λg · g(x)
g∈G
Output fˆ(x; λ∗ )
≤ 1 + η||λ0 ||1
= 1
Thus fˆ(x; λ0 ) has lower objective value than fˆ(x; λ∗ ), contradicting the opti-
mality of λ∗ .
Ok — so Algorithm 12 produces solutions of small norm. How many small
norm solutions are there anyhow? Obviously there are continuously many, so
we need a more refined way to ask this question. To do this lets define an
ϵ-net.
52Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Definition 21 Let B(C, d) = {x ∈ Rd : ||x||1 ≤ C} denote the d dimensional
ℓ1 ball of radius C. Let Nϵ (C, d) ⊂ B(C, d) be some finite subset of the ball.
We say that Nϵ (C, d) is an ℓ1 ϵ-net for B(C, d) if for every x ∈ B(C, d) there
is an x′ ∈ Nϵ (C, d) such that ||x − x′ ||1 ≤ ϵ.
and in particular:
d
VC+ϵ/2
|Nϵ | ≤ d
Vϵ/2
ϵ d
C+ 2
≤ ϵ
2
d
2C
= +1
ϵ
Lemma 4.2.2 Let λ, λ′ ∈ B(C, |G|) be such that ||λ − λ′ ||1 ≤ ϵ. Then for all
x, y, we have that:
(f (x; λ) − y)2 − (f (x; λ′ ) − y)2 = f (x; λ)2 − f (x; λ′ )2 + 2y(f (x; λ′ ) − f (x; λ))
= (f (x; λ) − f (x; λ′ ))(f (x; λ) + f (x; λ′ ) − 2y)
≤ ||λ − λ′ ||1 (||λ||1 + ||λ′ ||1 )
≤ 2ϵC
Lemma 4.2.3 Fix any finite subset S ⊂ B(C, |G|) and any δ > 0. Let D ∼
Dn consist of n samples (x, y). Then with probability 1 − δ, for every λ ∈ S:
v
u
u ln 2|S|
δ
E [(f (x; λ) − y)2 ] − E [(f (x; λ) − y)2 ≤ (C + 1)2
t
(x,y)∼D (x,y)∼D 2n
Proof 30 First observe that since λ ∈ B(C, |G|), we have that for all x,
−C ≤ f (x; λ) ≤ C. Thus for all x, y:
We can therefore apply Hoeffding’s inequality to conclude that for any fixed
λ ∈ B(C, d):
−2nt2
Pr n E [(f (x; λ) − y)2 ] − E [(f (x; λ) − y)2 ≥ t ≤ 2 exp
D∼D (x,y)∼D (x,y)∼D (C + 1)4
Proof 32 Let
2
λ̂ = arg min E fˆ(x; λ) − y + η||λ||1
λ (x,y)∼D
i.e. the true minimizer of regularized objective function over D. We know from
Lemma 4.2.1 that λ∗ , λ̂ ∈ B(1/η, |G|). Hence from Theorem 17 and the fact
that λ∗ minimizes the objective function on D, we have that with probability
1 − δ:
the result of applying a patch operation, where λ′g′ = λ∗g′ for all g ′ ̸= g and
λ′g = λ∗g + ∆. By Lemma 4.1.1, we have that B(f (x, λ∗ ), D) − B(f (x, λ′ ), D) >
α. This will contradict the optimality of λ̂ above if we have that:
√
s
2
ln 2δ + |G| ln (1 + 2 n)
1
α > η (||λ′ ||1 − ||λ∗ ||1 ) + 4 +1
η 2n
To avoid the contradiction we must have that:
√
s
2
ln 2δ + |G| ln (1 + 2 n)
1
α ≤ η (||λ′ ||1 − ||λ∗ ||1 ) + 4 +1
η 2n
√
s
2
ln 2δ + |G| ln (1 + 2 n)
1
≤ η|∆| + 4 +1
η 2n
√
s
2
ln 2δ + |G| ln (1 + 2 n)
r
α 1
≤ η +4 +1
µ(g) η 2n
√
s
2
ln 2δ + |G| ln (1 + 2 n)
1
≤ η+4 +1
η 2n
56Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
q
α
where the second to last inequality follows from the fact that |∆| = µ(g) and
the last inequality follows from the assumption that µ(g) ≥ α.
Algorithm 13 Simple-Quantile-Group-Conditional-Regularized(f, G, q, η)
Let λ∗ be a solution to the optimization problem:
h i
Minimizeλ E Lq fˆ(x; λ), y + η||λ||1
(x,y)∼D
Such that:
X
fˆ(x; λ) ≡ f (x) + λg · g(x)
g∈G
Output fˆ(x; λ∗ )
The basic strategy is the same, and so we highlight only the differences.
Since Pinball loss is also bounded within [0, 1] when f (x), y ∈ [0, 1] we continue
to have that solutions output by Algorithm 12 are norm bounded:
Lemma 4.2.4 Let f : X → [0, 1] be any model with range [0, 1], let G be
any set of groups, let D be any distribution over labelled example, and let
η > 0. Then Simple-Quantile-Group-Conditional-Regularized(f, G, D, η) (Al-
gorithm 13) outputs a model fˆ(x, λ∗ ) with:
1
||λ∗ ||1 ≤
η
We get an even better Lipschitz bound on the loss function:
Lemma 4.2.5 Let λ, λ′ ∈ B(C, |G|) be such that ||λ − λ′ ||1 ≤ ϵ. Then for all
x, y, and for all q ∈ [0, 1] we have that:
Similarly, since |Lq (f (x; λ), y)| ≤ C + 1 (rather than (C + 1)2 ) for λ ∈
B(C, |G|), we get a uniform convergence bound that is improved over our
version for squared loss by a factor of (C + 1):
Lemma 4.2.6 Fix any q ∈ [0, 1], any finite subset S ⊂ B(C, |G|) and any
δ > 0. Let D ∼ Dn consist of n samples (x, y). Then with probability 1 − δ,
for every λ ∈ S:
v
u
u ln 2|S|
t δ
E [Lq (f (x; λ), y)] − E [Lq (f (x; λ), y)] ≤ (C + 1)
(x,y)∼D (x,y)∼D 2n
Multigroup Guarantees 57
s
2 2C
ln δ + |G| ln 1 + ϵ
(C + 1) + 2ϵ
2n
C
In particular, choosing ϵ = √
n
gives:
√
s
2
ln δ + |G| ln (1 + 2 n)
E [Lq (f (x; λ), y)] − E [Lq (f (x; λ), y)] ≤ 2(C+1)
(x,y)∼D (x,y)∼D 2n
We can now obtain our generalization theorem for quantiles — but we’ll
need one more assumption. Recall that we have already been assuming that
our label distributions have CDFs that are ρ-Lipschitz, which means that
they have CDFs F such that F (τ ) − F (τ ′ ) ≤ ρ(τ − τ ′ ). To prove our next
generalization theorem, we’ll also have to assume that the label distributions
are not too flat — that is, that they are σ-anti-Lipschitz:
Theorem 20 Fix any δ > 0 and model f : X → [0, 1]. Let G be any
collection of groups. Let D ∼ Dn consist of n samples (x, y) from a dis-
tribution D that is ρ-Lipschitz and σ-anti-Lipschitz. Then with probabil-
ity 1 − δ, the model fˆ(x, λ∗ ) output by Simple-Quantile-Group-Conditional-
Regularized(f, G, D, η)) (Algorithm 13) satisfies α-approximate group condi-
tional quantile consistency on D whenever ming∈G µ(g) ≥ α for:
√
s
ln 2δ + |G| ln (1 + 2 n)
2ηρ 1
α≤ + 8ρ +1
σ η 2n
i.e. the true minimizer of regularized objective function over D. We know from
Lemma 4.2.4 that λ∗ , λ̂ ∈ B(1/η, |G|). Hence from Theorem 19 and the fact
that λ∗ minimizes the objective function on D, we have that with probability
1 − δ:
E [Lq (fˆ(x; λ∗ ), y)] + η||λ∗ ||1
(x,y)∼D
√
s
2
+ |G| ln (1 + 2 n)
1 ln
≤ E [Lq (fˆ(x; λ ), y)] + η||λ ||1 + 2
∗ ∗
+1 δ
(x,y)∼D η 2n
√
s
ln 2δ + |G| ln (1 + 2 n)
ˆ 1
≤ E [Lq (f (x; λ̂), y)] + η||λ̂||1 + 2 +1
(x,y)∼D η 2n
√
s
ln 2δ + |G| ln (1 + 2 n)
ˆ 1
≤ E [Lq (f (x; λ̂), y)]] + η||λ̂||1 + 4 +1
(x,y)∼D η 2n
Let α be the minimum value such that f (x; λ∗ ) satisfies α-approximate
group conditional quantile consistency on D. In other words, there exists a
group g such that:
2
µ(g) · Pr[y ≤ f (x; λ∗ ) − q|g(x) = 1] = α
D
Definition 23 Fix any model f : X → [0, 1] and group g : X → {0, 1}. The
average squared calibration error of f on g is:
X 2
K2 (f, g, D) = Pr [f (x) = v|g(x) = 1] v − E [y|f (x) = v, g(x) = 1]
(x,y)∼D (x,y)∼D
v∈R(f )
Proof 34 Since by construction ft+1 (x) = ft (x) for every x such that either
g(x) = 0 or ft (x) ̸= vt , we have that:
B(ft ) − B(ft+1 ) = µt (vt , gt ) · E [(ft (x) − y)2 − (ft+1 (x) − y)2 |gt (x) = 1, ft (x) = vt ]
(x,y)∼D
Where the final equality follows from Lemma 3.1.2 and the fact that by defi-
nition vt′ = E(x,y)∼D [y|ft (x) = vt , gt (x) = 1].
For any value v ∈ [0, 1] let Round(v; m) = arg minv′ ∈[1/m] |v − v ′ | denote the
closest grid point to v in [1/m]. For a model f : X → [0, 1], let Round(f ; m)
denote the function f ′ (x) = Round(f (x); m) that simply rounds the output of
f to the nearest grid point of [1/m].
Algorithm 15 Multicalibrate(f, α, G, D)
Let m = α1 .
Let f0 = Round(f ; m) and t = 0.
while ft is not α-approximately multicalibrated with respect to G: do
Let:
2
(vt , gt ) ∈ arg max Pr [ft (x) = v, g(x) = 1] v − E [y|ft (x) = v, g(x) = 1])
(v,g)∈R(ft )×G (x,y)∼D (x,y)∼D
Proof 35 Let f˜t+1 = h(x; ft , vt → ṽt , gt ) be the hypothetical update that would
have resulted had we not rounded ṽt in step t of the algorithm. This is the
62Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
update that would have resulted from a step of Algorithm 14, and so we can
apply Lemma 4.3.1 to conclude that:
And so it remains to upper bound (B(ft+1 ) − B(f˜t+1 )). Let ∆ = ṽt − vt′ and
note that since vt′ = Round(ṽt ; m) we have that |∆| ≤ 2m
1
. We can calculate:
B(ft+1 ) − B(f˜t+1 )
′
(vt − y)2 − (ṽt′ − y)2 |gt (x) = 1, ft (x) = vt
= µt (vt , gt ) · E
(x,y)∼D
= µt (vt , gt )∆2
µt (vt , gt )
≤
4m2
Here the 2nd equality follows from the fact that ṽt = E(x,y)∼D [y|ft (x) =
vt , gt (x) = 1]. Combining with the above we have:
X 2
µt (g, v) v − E [y|ft (x) = v, g(x) = 1] ≥ α
(x,y)∼D
v∈R(ft )
Remark 4.3.1 Theorem 21 bounds B(ft ) − B(f0 ). But recall that f0 results
from rounding the outputs of f to the nearest multiple of 1/m, which might
1 1 α2
increase f ’s squared error by as much as m + 4m 2 = α + 4 if f was very
poorly calibrated at the outset. Taking this into account we can also conclude
that:
α2 α2
B(fT ) < B(f ) − T +α+ .
4 4
X 2
Q2 (f, g) = Pr [f (x) = v|g(x) = 1] q − Pr [y ≤ v|f (x) = v, g(x) = 1]
(x,y)∼D (x,y)∼D
v∈R(f )
64Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
We say that a model f is α-approximately quantile multicalibrated with respect
to a collection of groups G and q if for every group g ∈ G:
α
Q2 (f, g) ≤ .
µ(g)
We will use the same kind of group value patches that we used for mean
multicalibration, as well as the same rounding procedure. We get the following
algorithm:
Algorithm 16 QuantileMulticalibrate(f, α, q, G, ρ)
2
ρ
Let m = 2α .
Let f0 = Round(f ; m) and t = 0.
while ft is not α-approximately quantile multicalibrated with respect to G
and q: do
Let:
2
(vt , gt ) ∈ arg max Pr [ft (x) = v, g(x) = 1] q − Pr [y ≤ v|ft (x) = v, g(x) = 1])
(v,g)∈R(ft )×G (x,y)∼D (x,y)∼D
Let f˜t+1 = h(x; ft , vt → ṽt , gt ) be the hypothetical update that would have
resulted had we not rounded ṽt in step t of the algorithm. Since Pr(x,y)∼D [y ≤
v|ft (x) = vt , gt (x) = 1] = q we can apply Lemma 2.2.2 to the distribution
Multigroup Guarantees 65
P Bq (ft ) − P Bq (f˜t+1 )
h i
= Pr[gt (x) = 1, ft (x) = vt ] · E Lq (ft (x), y) − Lq (f˜t+1 (x), y)|gt (x) = 1, ft (x) = vt
(x,y)∼D
α
≥ µ(gt , vt ) ·
2ρmµ(gt .vt )
α
=
2mρ
We also have that:
And so it remains to upper bound (P Bq (ft+1 ) − P Bq (f˜t+1 )). Let ∆ = ṽt − vt′
and note that since vt′ = Round(ṽt ; m) we have that |∆| ≤ 2m 1
. From Lemma
2.2.2 we have that:
α ρ α2
P Bq (ft ) − P Bq (ft+1 ) ≥ − =
2mρ 8m2 2ρ3
ρ2
Here we use the fact that m = 2α .
With this progress lemma, we can state the final guarantee for Algorithm
16.
Theorem 22 Fix any model f : X → [0, 1], α > 0, q ∈ [0, 1], G, and ρ. If
the distribution D is ρ-Lipschitz, then Algorithm 16 (QuantileMulticalibrate)
66Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
runs for T rounds and outputs a model fT that is α-approximately quantile
multicalibrated with respect to G and q. Moreover:
2ρ3
T ≤
α2
2
α
and P Bq (fT ) ≤ P Bq (f0 ) − T 2ρ3.
Proof 38 Lemma 4.4.1 tells us that at any intermediate round t < T of the
algorithm, we have that:
α2
P Bq (ft ) − P Bq (ft+1 ) ≥
2ρ3
Applying this repeatedly we have that:
α2
P Bq (fT ) ≤ P Bq (f0 ) − T
2ρ3
For labels in [0, 1], we have that P Bq (f0 ) ≤ 1 and P Bq (fT ) ≥ 0. Hence we
3
must have that T ≤ 2ρ α2 .
Theorem 23 Fix any model ft : X → [0, 1], any v ∈ R(ft ), and any group
g ∈ G. Let D ∼ Dn consist of n points drawn i.i.d. from D. Then with proba-
bility 1 − δ.
2 2
µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1] − µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1]
(x,y)∼D (x,y)∼D
r
3µt (g, v, D) ln(8/δ) 135 ln(8/δ)
≤ 46 +
n n
r !
µt (g, v, D) ln(1/δ) ln(1/δ)
∈O +
n n
Proof 39 This will be a long slog. We will beat each term into submission
using Chernoff bounds in sequence, and then combine the resulting bounds.
First we argue that with high probability, µt (g, v, D) and µt (g, v, D) must
be close.
Lemma 4.5.1 Fix any model ft : X → [0, 1], any v ∈ R(ft ), and any group
g ∈ G. Let D ∼ Dn consist of n points drawn i.i.d. from D. Then with proba-
bility 1 − δ:
r
3 ln(2/δ)µt (g, v, D)
|µt (g, v, D) − µt (g, v, D)| ≤
n
Proof 40 We can write
1
1[g(x) = 1, ft (x) = v]
X
µt (g, v, D) =
n
(x,y)∈D
µt (g, v, D)η 2
Pr [|nµt (g, v, D) − nµt (g, v, D)| ≥ ηnµt (g, v, D)] ≤ 2 exp −
D∼D n 3
q
3 ln(2/δ)
Plugging in η = nµ t (g,v,D)
yields:
h p i
Pr n |nµt (g, v, D) − nµt (g, v, D)| ≥ 3 ln(2/δ)nµt (g, v, D) ≤ δ
D∼D
and p
|nµt (g, v, D) − nµt (g, v, D)| ≤ 3 ln(4/δ)nµt (g, v, D)
Therefore we have that with probability 1 − δ:
E [y|ft (x) = v, g(x) = 1]
(x,y)∼D
s
3 ln(4/δ)
45
nµt (g, v, D)
Proof 42 We compute:
2 2
v− E [y|ft (x) = v, g(x) = 1] − v − E [y|ft (x) = v, g(x) = 1]
(x,y)∼D (x,y)∼D
= 2v E [y|ft (x) = v, g(x) = 1] − E [y|ft (x) = v, g(x) = 1] +
(x,y)∼D (x,y)∼D
E [y|ft (x) = v, g(x) = 1]2 − E [y|ft (x) = v, g(x) = 1]2
(x,y)∼D (x,y)∼D
Phew. Lets finish this. Applying Lemma 4.5.1, we have that with probability
1 − δ/2:
2
µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1]
(x,y)∼D
r ! 2
3 ln(2/δ)µt (g, v, D)
≤ µt (g, v, D) + v − E [y|ft (x) = v, g(x) = 1]
n (x,y)∼D
There are two cases to consider. The first case is when µt (g, v, D) <
12 ln(8/δ) 2
n In this case, since v − E(x,y)∼D [y|ft (x) = v, g(x) = 1] ≤ 1 we
.
have:
2 r
12 ln(8/δ) 3 ln(2/δ)µt (g, v, D)
µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1] ≤ +
(x,y)∼D n n
In the remaining case, we can apply Lemma 4.5.3 to continue and conclude
that with probability 1 − δ:
2
µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1]
(x,y)∼D
r ! 2 s !
3 ln(2/δ)µt (g, v, D) 3 ln(8/δ)
≤ µt (g, v, D) + v− E [y|ft (x) = v, g(x) = 1] + 45
n (x,y)∼D nµt (g, v, D)
2
≤ µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1]
(x,y)∼D
r r r s
3 ln(2/δ)µt (g, v, D) 3µt (g, v, D) ln(8/δ) 3 ln(2/δ)µt (g, v, D) 3 ln(8/δ)
+ + 45 + 45
n n n nµt (g, v, D)
2 r
3µt (g, v, D) ln(8/δ) 135 ln(8/δ)
≤ µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1] + 46 +
(x,y)∼D n n
The reverse direction follows the same way:
2 2
µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1] ≥ µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1]
(x,y)∼D (x,y)∼D
r
3µt (g, v, D) ln(8/δ) 135 ln(8/δ)
−46 −
n n
which finally gives us our theorem.
Multigroup Guarantees 71
Recapping where we are, we have shown that for a single model ft , group
2
g, and value v, the quantities µt (g, v, D) v − E(x,y)∼D [y|ft (x) = v, g(x) = 1]
evaluated in-sample are close to the corresponding quantities out of sample.
But we need a corresponding statement for every group g ∈ G, every v ∈ [1/m]
and every model f that might be output by Algorithm 15. Our solution to
this will simply be to count all possible combinations of g, v, and f , but in
order to do this, we need to understand how many different distinct models
might be output by Algorithm 15.
Lemma 4.5.4 Fix any model f : X → [0, 1], any finite collection of groups G,
and any α > 0. Then there is a set of models C such that for every distribution
D (which might be the empirical distribution over an arbitrary dataset), the
model ft output by Multicalibrate(f, α, G, D) is such that ft ∈ C, and:
42 +1
|G| α
|C| ≤
α2
Proof 43 Given a run of Multicalibrate(f, α, G, D) (Algorithm 15) for T
rounds, let π = {(vt , vt′ , gt )}Tt=1 denote the record of the quantities (vt , vt′ , gt )
selected by the algorithm at each round t. Let π <t = {(vt′ , vt′ ′ , gt′ )}t−1
t′ =1 denote
the prefix of this transcript up through round t − 1. Observe that once we fix
π <t we have also fixed the model ft that is defined at the start of round t (in-
dependently of the distribution D). Thus to count models that might be output
by Multicalibrate(f, α, G, D), it suffices to count transcripts.
We let C denote the set of all models defined by transcripts π <T for all
T ≤ α42 . Since we know from Theorem 21 that Algorithm 15 halts after at most
T ≤ α42 many rounds, the models output by Algorithm 15 must be contained
in C as claimed. It remains to count the set of transcripts of length T ≤ α42 .
At each round t, there are m = 1/α possible choices for vt , m = 1/α possible
choices for vt′ , and |G| possible choices for gt . Hence the number of transcripts
T
of length T is |G| α2 . Thus we have:
4
T 4
α2 +1
X |G| |G| α2
|C| ≤ ≤
α2 α2
T =0
2 2
µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1] − µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1]
(x,y)∼D (x,y)∼D
72Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
v
u
u 3µ (g, v, D) 4 + 2 ln 8|G|
t t α2 α2 δ 135 α42 + 2 ln 8|G|
α2 δ
≤ 46 +
n n
v
u
|G
1 t µt (g, v, D) ln αδ
u
∈ O
α n
Proof 44 From Theorem 23 we have that for any δ ′ > 0 and any single triple
(ft , g, v) we have that:
2 2
µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1] − µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1] ≤
(x,y)∼D (x,y)∼D
r
3µt (g, v, D) ln(8/δ) 135 ln(8/δ)
46 +
n n
We now count the number of triples quantified over in our theorem. Lemma
4.5.4 tells us that the number of models ft that might be output is at most
42 +1
|G| α
α2 . The number of groups g ∈ G is |G|, and the number of values
1
v ∈ R(ft ) is by construction m = α. Hence the number of triples is at most:
4 4
+1 +2
|G| α2 1 |G| α2
· |G| · ≤
α2 α α2
Theorem 25 Fix any model f : X → [0, 1], any finite collection of groups G,
any α > 0 and any δ > 0. Let D ∼ Dn consist of n points drawn i.i.d. from
D. Then with probability 1 − δ, the model ft : X → [0, 1] that is output by
Multicalibrate(f, α, G, D) (Algorithm 15) is α′ approximately multicalibrated
with respect to G and D for:
v
4
8|G| u
u 3 4 + 2 ln 8|G|
1 135 α2 + 2 ln α2 δ α2 α 2δ
α′ ≤ α +
t
+ 46
α n αn
v
u
|G| u ln |G|
ln α2 δ t α2 δ
∈ O
α + +
α3 n 3
α n
Multigroup Guarantees 73
Remark 4.5.1 Choosing α to optimize the bound from Theorem 25, we get a
model ft that is α′ approximately multicalibrated with respect to G and D for:
1/5
|G|
′ ln δ
α = Õ
n
v
u
u 3µ (g, v, D)
t t
4
α2 + 2 ln 8|G|
2
α δ 135 4
α2 + 2 ln 8|G|
2
α δ
≤ 46 +
n n
From Theorem 21 we know that (with probability 1), µ(g, D)·K2 (ft , g, D) ≤
α for every g ∈ G.
74Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Combining these bounds we have:
v
8|G|
u
4 4
+ 2 ln 8|G|
u 3µ (g, v, D)
1 135 α2 + 2 ln α2 δ X t t α2 2
α δ
≤ α+ + 46
α n n
v∈R(ft )
v
u
1 135
4
α2 + 2 ln 8|G|
α2 δ
u 3µ (g, D)
t t
4
α2 + 2 ln 8|G|
α2 δ
≤ α+ + 46
α n αn
v
u
1 135 4
α2 + 2 ln 8|G|
2
α δ
u3
t
4
α2 + 2 ln 8|G|
2
α δ
≤ α+ + 46
α n αn
√
Where here we have used the fact that |R(ft )| = α1 , and that because · is a
concave function, the final sum is maximized when µt (g, v, D) = αµt (g) for
each v.
Theorem 26 Fix any model f : X → [0, 1], any finite collection of groups G,
any α > 0 and any δ > 0. Let D ∼ Dn consist of n points drawn i.i.d. from
a ρ-Lipschitz distribution D. Then with probability 1 − δ, the model ft : X →
[0, 1] that is output by QuantileMulticalibrate(f, α, q, G, D) (Algorithm 16) is
Multigroup Guarantees 75
Remark 4.5.2 Choosing α to optimize the bound from Theorem 25, we get a
model ft that is α′ approximately multicalibrated with respect to G and D for:
4 1/5
3 ρ |G|
ρ ln δ
′
α = Õ
n
C∞ (f, m, g, D) =
This is identical to our definition of K∞ except that the condition that f (x) =
v has been replaced with the condition that f (x) ∈ B(i). We can give a
corresponding definition in the sequential setting:
X
max (pt − yt ) ≤ αT
g∈G,i∈[m]
t∈S(π,g,i)
denote the average difference between the predictions pt and the outcomes
yt on the subsequence of π ≤s corresponding to examples from group g and
predictions in bucket i.
Fixing a parameter η ∈ [0, 21 ], define a surrogate calibration loss function
at round s as:
X
Ls (π ≤s ) = exp(ηVsg,i ) + exp(−ηVsg,i ) .
g∈G,
i∈[m]
We will leave η unspecified for now, and choose it later to optimize our bounds.
Recall that what we really want to do is upper bound maxG∈G,i∈[n] |VTGi |,
which corresponds to our calibration loss. Observe that this “soft-max style”
function allows us to tightly upper bound our calibration loss:
Observation 4.6.1 For any transcript πT , and any η ∈ [0, 12 ], we have that:
1 ln (2|G|m)
max VTg,i ≤ ln(LT ) ≤ max VTG,i + .
g∈G,i∈[m] η g∈G,i∈[m] η
= ln(LT )
78Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Dividing by η gives the inequality. In the other direction we have that:
1 1 X
ln(Lt ) = ln exp ηVTg,i + exp (−ηVTg,m )
η η
g∈G,i∈[m]
1
≤ ln 2|G|m · max exp η VTg,i
η g∈G,i∈[m]
ln(2|G|m)
= + max VTg,i
η g∈G,i∈[m]
So we now feel freed to study the analytically nicer surrogate loss function.
Just as in our derivation of algorithms promising (regular) calibration guar-
antees against an adversary, we will be interested in bounding the increase in
our surrogate loss function from round to round.
Our first step is to bound ∆s+1 (π ≤s+1 ) in terms of a quantity that is linear
in ps+1 and ys+1 .
Lemma 4.6.1 Fix any partial transcript π ≤s+1 = π ≤s ◦ (xs+1 , ps+1 , ys+1 )
such that ps+1 ∈ B(i). Then for any η ≤ 1, we have that:
where: X
Csi (xs+1 ) = exp(ηVsg,i ) − exp(−ηVsg,i )
g∈G(xs+1 )
Proof 47 Observe that our surrogate loss function is a sum of terms each
defined by a group g ∈ G and a bucket i ∈ [m], and that at round s + 1, the
change in surrogate loss can be written as a sum over only those groups in
G(xs+1 ) over the bucket i such that ps+1 ∈ B(i), since all other terms in the
Multigroup Guarantees 79
∆s+1 (π ≤s+1 )
= Ls+1 − Ls
X g,i g,i
= exp(ηVs+1 ) − exp(ηVsg,i )) + exp(−ηVs+1 ) − exp(−ηVsg,i )
g∈G(xs+1 )
X
= exp(ηVsg,i )(exp(η(ys+1 − ps+1 )) − 1) + exp(−ηVsg,i )(exp(−η(ys+1 − ps+1 )) − 1)
g∈G(xs+1 )
X
≤ exp(ηVsg,i )(η(ys+1 − ps+1 ) + 2η 2 ) + exp(−ηVsg,i )(−η(ys+1 − ps+1 ) + 2η 2 )
g∈G(xs+1 )
X X
= η(ys+1 − ps+1 ) exp(ηVsg,i ) − exp(−ηVsg,i ) +
g∈G(xs+1 ) g∈G(xs+1 )
X
2η 2 exp(ηVsg,i ) + exp(−ηVsg,i )
g∈G(xs+1 )
Here the first inequality follows from the fact η(ys+1 − ps+1 ) ≤ η and that for
|x| ≤ 1, exp(x) ≤ 1 + x + x2 .
Our goal is to find a strategy for the learner’s choice of ps+1 , as a function of
both π ≤s and xs+1 , that will guarantee that Eps+1 ∆s+1 (π ≤s , (xs+1 , ps+1 , ys+1 ))
where B −1 (ps+1 ) = i is the bucket such that ps+1 ∈ B(i). We do this in cases.
Algorithm 17 Online-Multicalibrated-Predictor(G, m, r, η)
for t = 1 to T do
Observe xt and compute
X g,i g,i
i
Ct−1 (xt ) = exp(ηVt−1 ) − exp(−ηVt−1 )
g∈G(xt )
∗
i 1 i∗
Predict pt = m − rm with probability q and predict pt = m with
probability 1 − q.
Observe yt
Let π <t+1 = π <t ◦ (xt , pt , yt )
Lets now analyze the expected calibration loss of Algorithm 17. We start
by analyzing the expected surrogate loss:
Lemma 4.6.3 Fix any set of groups G, m, r ≥ 0 and 0 ≤ η ≤ 1. Fix any ad-
versary, which together with Online-Multicalibrated-Predictor(G, m, r, η) (Al-
gorithm 17) fixes a distribution on transcripts π. We have that:
Tη
E[LT (π)] ≤ 2|G|m · exp + 2T η 2
π rm
Proof 49 Consider the final round T . From Lemma 4.6.2, we have that for
all π <T , xT , yT :
r
2 ln(2|G|m)
E[α] ≤ (2 + ϵ)
π T
Proof 50 Recall that (α, m)-multicalibration corresponds to the requirement
that maxg∈G,i∈[m] |VTg,i | ≤ αT . Hence we need to show that:
g,i T p
E max |VT | ≤ + 2 2T ln(2|G|m)
π g∈G,i∈[m] rm
We can compute:
exp η E max |VTg,i | ≤ E exp η max |VTg,i |
π g∈G,i∈[m] π g∈G,i∈[m]
= E max exp η|VTg,i |
π g∈G,i∈[m]
≤ E max exp ηVTg,i + exp −ηVTg,i
π g∈G,i∈[m]
X
≤ E exp ηVTg,i + exp −ηVTg,i
π
g∈G,i∈[m]
= E [LT (π)]
π
Tη
≤ 2|G|m · exp + 2T η 2
rm
where the first inequality follows from Jensen’s inequality and the convexity of
Multigroup Guarantees 83
exp(x), and the last inequality follows from Lemma 4.6.3. Taking the log of
both sides and dividing by η gives:
g,i log(2|G|m) T
E max |V | ≤ + + 2T η
π g∈G,i∈[m] T η rm
Plugging in our chosen value of η completes the proof.
Insert high probability bound and online to offline reduction
(1[yt ≤ pt ] − q) ≤ αT
X
max
g∈G,i∈[m]
t∈S(π,g,i)
(1[yt ≤ pt ] − q).
X
Vsg,i =
t∈S(π ≤s ,g,i)
g∈G,
i∈[m]
84Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
When the transcript π ≤s is clear from context, we will simply write Ls .
We can prove a direct analogue of Lemma 4.6.1 for our new quantile surro-
gate loss function. All we used previously about the Vsg,i quantities was that
they were sums of terms bounded between [−1, 1] which remains true in our
quantile reformulation.
Lemma 4.6.4 Fix any partial transcript π ≤s+1 = π ≤s ◦ (xs+1 , ps+1 , ys+1 )
such that ps+1 ∈ B(i). Then for any η ≤ 1, we have that:
∆s+1 (π ≤s+1 ) ≤ η(1[ys+1 ≤ ps+1 ] − q) · Csi (xs+1 ) + 2η 2 Ls (π ≤s )
where: X
Csi (xs+1 ) = exp(ηVsg,i ) − exp(−ηVsg,i )
g∈G(xs+1 )
ps+1 ,ys+1
∗
i 1 i∗
= p · η Pr ys+1 ≤ − − q · Cs (xs+1 ) +
ys+1 m rm
i∗
i∗ +1
(1 − p) · η Pr ys+1 ≤ − q · Cs (xs+1 )
ys+1 m
∗
i 1 i∗
≤ p · η Pr ys+1 ≤ + − q · Cs (xs+1 ) +
ys+1 m ρrm
i∗
i∗ +1
(1 − p) · η Pr ys+1 ≤ − q · Cs (xs+1 )
ys+1 m
1 ∗
= ηp Csi (xs+1 )
ρrm
η
≤ Ls (π ≤s )
ρrm
With our new Lemma 3.4.4 in hand, the rest follows as before:
86Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Algorithm 18 Online-Quantile-Multicalibrated-Predictor(G, m, r, η)
for t = 1 to T do
Observe xt and compute
X g,i g,i
i
Ct−1 (xt ) = exp(ηVt−1 ) − exp(−ηVt−1 )
g∈G(xt )
g,i
for all i ∈ [m], with Vt−1 defined as in Definition 34.
m
if Ct−1 (xt ) < 0 then
Predict pt = 1.
1
else if Ct−1 (xt ) > 0 then
Predict pt = 0.
else
i∗ i∗ +1
Select i∗ ∈ [m] such that such that Ct−1 (xt ) · Ct−1 (xt ) ≤ 0.
Compute p ∈ [0, 1] such that:
∗ ∗
i i +1
p · Ct−1 (xt ) + (1 − p) · Ct−1 (xt ) = 0
∗
i 1 i∗
Predict pt = m − rm with probability p and predict pt = m with
probability 1 − p.
Observe yt
Let π <t+1 = π <t ◦ (xt , pt , yt )
CONTENTS
5.1 Beyond Means and Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Beyond Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
89
6
Multicalibration for Real Valued Functions:
When Does Multicalibration Imply
Accuracy?
CONTENTS
6.1 Beyond Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 Algorithmically Reducing Multicalibration to Regression . . . . . . . 95
6.3 Weak Learning, Multicalibration, and Boosting . . . . . . . . . . . . . . . . . 98
References and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
91
92Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Since g(x) is binary, we can equivalently re-write this multicalibration condi-
tion as the requirement that for every g ∈ G and v ∈ R(f ):
But although this is an equivalent condition to ask for when g is binary (i.e.
a group indicator function), it now makes sense to ask for this condition even
if g is an arbitrary real valued function g : X → R. We will use this as our
more general definition of multicalibration with respect to an arbitrary class
of real valued functions. We will have to define approximate versions of this
condition, and we will again use an ℓ2 -error variant:
Definition 35 (Multicalibration With Respect to Real Valued Functions)
Fix a distribution D ∈ ∆Z and a model f : X → [0, 1]. Let H be an ar-
bitrary collection of real valued functions h : X → R. We say that f is α-
approximately multicalibrated with respect to D and H if for every h ∈ H:
X 2
K2 (f, h, D) = Pr [f (x) = v] E [h(x)(y − v)|f (x) = v] ≤ α
(x,y)∼D (x,y)∼D
v∈R(f )
Note that there is an asymmetry in Lemma 6.1.1 and Lemma 6.1.2. Lemma
6.1.1 implies that if h has improved squared error compared to f on one of its
level-sets, then h itself fails the multi-calibration condition on this levelset. On
the other hand, Lemma 6.1.2 says that if h fails the multicalibration condition
on some levelset v of f , then there is a function h′ = v + ηh(x) that improves
on the squared error of f on level-set v. We can remove this assymetry by
assuming that H is closed under affine transformations
h′ (x) ≡ ah(x) + b ∈ H
Most natural classes of regression functions are closed under affine transfor-
mations: linear functions, polynomials of any fixed degree d, regression trees,
etc.
For classes of functions H that are closed under affine transformation, the
relationship becomes symmetric:
Lemma 6.1.3 Suppose H is closed under affine transformation. Fix a model
f : X → R and a levelset v ∈ R(f ) of f . Then:
1. If f is calibrated and there exists an h ∈ H such that
Proof 54 The first part follows from Lemma 6.1.1 using h′ = h. The second
part follows from Lemma 6.1.2 using h′ = v + ηh(x), where h′ ∈ H by the
assumption that G is closed under affine transformations.
— i.e. the function that outputs the closest grid-point in [1/m] to the function
value ht (x).
Theorem 30 Fix any distribution D ∈ ∆Z, any model f : X → [0, 1], any
α < 1, any class of real valued functions H that is closed under affine trans-
formations, and a squared error regression oracle AH for H. For any bound
B > 0 let:
HB = {h ∈ H : h(x)2 ≤ B}
be the set of functions in h with squared magnitude bounded by B. Then
RegressionMulticalibrate(f, α, AH , D, B) (Algorithm 19) halts after at most
T ≤ 2Bα many iterations and outputs a model fT −1 such that fT −1 is α-
approximately multicalibrated with respect to D and HB .
Remark 6.2.1 Note the form of this theorem — we do not promise multi-
calibration at approximation parameter α for all of H, but only for HB —
i.e. those functions in H satisfying a bound on their squared value. This is
necessary, since H is closed under affine transformations. To see this, note
that if E[h(x)(y − v)] ≥ α, then it must be that E[c · h(x)(y − v)] ≥ c · α.
Since h′ (x) = ch(x) is also in H by assumption, approximate multicalibration
bounds must always also be paired with a bound on the norm of the functions
for which we promise those bounds.
errT −1 − errT
= E [(fT −1 (x) − y)2 − (fT (x) − y)2 ]
(x,y)∼D
= E [(fT −1 (x) − y)2 − (f˜T (x) − y)2 ] + E [(f˜T (x) − y)2 − (fT (x) − y)2 ]
(x,y)∼D (x,y)∼D
α
> + E [(f˜T (x) − y)2 − (fT (x) − y)2 ]
B (x,y)∼D
α 1
> −
B m
α
≥
2B
where the last equality follows from the fact that m ≥ 2B
α .
The 2nd inequality follows from the fact that for every pair (x, y):
1
(f˜T (x) − y)2 − (fT (x) − y)2 ≥ −
m
To see this we consider two cases. Since y ∈ [0, 1], if f˜T (x) > 1 or f˜T (x) < 0
then the Round operation decreases squared error and we have (f˜T (x) − y)2 −
(fT (x) − y)2 ≥ 0. In the remaining case we have fT (x) ∈ [0, 1] and ∆ =
f˜T (x) − fT (x) is such that |∆| ≤ 2m
1
. In this case we can compute:
(f˜T (x) − y)2 − (fT (x) − y)2 = (fT (x) + ∆ − y)2 − (fT (x) − y)2
= 2∆(f (x) − y) + ∆2
≥ −2|∆| + ∆2
1
≥ −
m
First lets pause to interpret this condition and explain why it is “weak”. It is
helpful to recall that f ∗ (x) is the Bayes optimal predictor for squared error
— it minimizes squared error over D over the set of all possible functions (we
proved this in Lemma 3.1.2.) The weak learning condition requires that for
every restriction of D to some subset S ⊂ X of its domain, if the Bayes optimal
predictor performs better than a constant predictor in terms of squared error,
then there must be some h ∈ H that also performs better than a constant
predictor. This is a weak learning assumption because it might be that f ∗ (x)
performs much better than a constant predictor, but that the best h ∈ H
performs only a little bit better than a constant predictor on S — this situation
is still consistent with our assumption.
Nevertheless, we will show that the weak learning assumption is enough
(together with our Algorithm 19 for multicalibration with respect to real val-
ued functions H) to boost the weak learners in H to a strong learner f —
i.e. a model f that performs as well as the optimal model f ∗ with respect to
squared error. In fact, the weak learning condition on H is both necessary and
sufficient for multicalibration of f with respect to H to imply Bayes optimality
of f . Our “boosting algorithm” will simply be our multicalibration algorithm!
First we define what we mean when we say that multicalibration with
respect to H implies Bayes optimality. Note that f ∗ (x) is multicalibrated
with respect to any set of functions, so it is not enough to require that there
exist Bayes optimal functions f that are multicalibrated with respect to H.
Instead, we have to require that every function that is multicalibrated with
respect to H is Bayes optimal:
Definition 39 Fix a distribution D ∈ ∆Z. We say that multicalibration with
respect to H implies Bayes optimality over D if for every f : X → R that is
multicalibrated with respect to D and H, we have:
Where f ∗ (x) = Ey∼D(x) [y] is the function that has minimum squared error
over the set of all functions.
Thus by the weak learning assumption there must exist some h ∈ H such that:
E[(v − y)2 − (h(x) − y)2 |x ∈ S] = E[(f (x) − y)2 − (h(x) − y)2 |f (x) = v] > 0
By Lemma 6.1.3, there must therefore exist some h′ ∈ H such that:
E [h′ (x)(y − v)|f (x) = v] > 0
(x,y)∼D
Since H does not satisfy the weak learning assumption over D, there must
exist some set S ⊆ X with Pr[x ∈ S] > 0 such that
E [(f ∗ (x) − y)2 |x ∈ S] < min E [(c − y)2 |x ∈ S]
(x,y)∼D c∈R (x,y)∼D
Multicalibration for Real Valued Functions: When Does Multicalibration Imply Accuracy? 101
.
Let c(S) = E(x,y)∼D [y|x ∈ S]. We define f (x) as follows:
(
f ∗ (x) x ̸∈ S
f (x) =
c(S) x∈S
We first observe that it must be that v = c(S). If this were not the case,
by definition of f we would have that:
But this contradicts our assumption that H violates the weak learning condi-
tion on S, which completes the proof.
where f ∗ (x) = E(x,y)∼D [y] is the function that minimizes squared error over
D.
≤ Pr [x ∈ S] + (1 − Pr [x ∈ S])γ
(x,y)∼D (x,y)∼D
≥ γ2
γ2
We recall that |err
˜ T − errT | ≤ 1/m = 2 and so we can conclude that
γ2
errT −1 − errT ≥
2
which contradicts the fact that the algorithm halted at round T , completing the
proof.
Applications
7
Conformal Prediction
CONTENTS
7.1 Prediction Sets and Nonconformity Scores . . . . . . . . . . . . . . . . . . . . . . 107
7.1.1 Non-Conformity Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2 A Weak Guarantee: Marginal Coverage in Expectation . . . . . . . . . 110
7.3 Dataset Conditional Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.4 Dataset and Group Conditional Bounds . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.5 Multivalid Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.6 Sequential Conformal Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.6.1 Sequential Marginal Coverage Guarantees . . . . . . . . . . . . . . 119
7.6.2 Sequential Multivalid Guarantees . . . . . . . . . . . . . . . . . . . . . . . 120
References and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Thus far we have restricted our attention to regression problems (in which the
label domain Y = R), and have focused on estimating distributional quantities
of conditional label distributions, like means and quantiles. In this chapter,
we introduce a much more general framework for uncertainty quantification
that reduces a very general uncertainty quantification problem to the problem
of one dimensional quantile estimation. As a result, we will be able to draw
on our development of powerful quantile estimation techniques to give an
analogously powerful set of results for a much more general problem.
Pr[y ∈ T (x)] ≈ 1 − δ
107
108Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
FIGURE 7.1
Images x about which we might have uncertainty about their labels y.
We leave unspecified for now what distribution this probability is taken over,
because we will consider a spectrum of guarantees of increasing strength, mir-
roring our treatment of mean and quantile estimation. For example, we can
ask for marginal guarantees, group conditional guarantees, calibrated guar-
antees, or ask for guarantees that hold empirically on adversarially chosen
sequences. Prediction sets can take different forms: when we are facing a re-
gression problem (Y = R) is is natural (but not necessary) for a prediction
set to take the form of an interval : T (x) = [a, b] for some a < b ∈ R. On the
other hand, for a multiclass classification problem (when Y is some unordered
discrete set), prediction sets correspond to subsets of labels — e.g. we might
have T (x) = {Blueberry Muffin, Chihuahua} for x representing images from
Figure 7.1.
Prediction sets are a very attractive way to quantify uncertainty: their
size represents a quantitative degree of uncertainty. For example, if T (x) is a
singleton, this represents certainty at the specified 1 − δ level in a particular
point prediction. But the contents of the set also provides insight into where
the uncertainty lies. For example in a classification problem, there might be a
high degree of uncertainty in the specific label, but a well crafted prediction
set might nevertheless tell us that our uncertainty is concentrated in a region
that corresponds to the same downstream action. Say, in a computer vision
setting, we might be unsure of the breed of dog in front of us—so T (x) contains
half a dozen different labels, corresponding to different breeds—but despite
this uncertainty in the specifics, this prediction set gives us a high degree of
confidence in what action to take—apply the breaks.
The main difficulty with thinking about producing prediction sets is that
they are very high dimensional objects: In a k-label multiclass classification
setting, there are 2k different prediction sets. The main idea in conformal pre-
Conformal Prediction 109
s(x, y) = |h(x) − y|
which simply measures the deviation of the label y from the point prediction
h(x).
Any non-conformity score function s can be used to parameterize a (now
one dimensional) family of prediction sets Ts : X × R → 2Y as follows:
The prediction set T (x, τ ) simply contains all labels ŷ that would produce
nonconformity score at most τ when paired with x: s(x, ŷ) ≤ τ . In the
case of our simple regression running example, this would simply correspond
to the interval centered at the point prediction h(x) that has width 2τ :
Ts (x, τ ) = [h(x) − τ, h(x) + τ ]. Although simple, a clear disadvantage of this
non-conformity score is that for fixed τ , every prediction interval T (x, τ ) has
the same width — so for methods that use a fixed value of τ — which roughly
speaking are those methods that promise only marginal coverage — the pre-
diction intervals do not give us any insight into which examples we have more
uncertainty about compared to others.
There are many other non-conformity scores that are in wide use. For
example, rather than training a regression model h that aims to predict the
mean of the conditional label distribution DY (x) (as linear regression does),
we could train quantile regression models hδ/2 (x), h1−δ/2 (x) that try and
predict the δ/2 and 1 − δ/2 quantiles of the conditional label distribution
DY (x) instead. Then a natural non-conformity score would be:
This score starts with the candidate interval that directly arises from the
quantile regression method [hδ/2 (x), h1−δ/2 (x)], and measures how far the la-
bel y is from the interval — taking a positive value when y falls outside of the
interval and a negative value when it falls inside. If the interval is correct, then
the 1 − δ quantile of the nonconformity score distribution will be 0 — picking
110Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
threshold τ = 0 will get the target marginal coverage. But if the interval in-
duced by the quantile regression method is not correct, then choosing different
thresholds τ can systematically widen or shorten the prediction interval by τ
on each end: Ts (x, τ ) = [hδ/2 (x) − τ, h1−δ/2 (x) + τ ]. This non-conformity score
has the advantage that even for a fixed value of τ , the prediction intervals
Ts (x, τ ) can have very different widths, depending on the predictions of the
models hδ/2 (x) and h1−δ/2 (x).
What about for multi-class classification problems, in which DY (x) is a
discrete distribution over k possible labels? To build intuition, suppose we
were given the true conditional distribution over labels given x: For each
label ŷ ∈ [k], p∗x (ŷ) = Pr[y = ŷ|x]. Let πp∗x be the permutation on labels
that puts them in decreasing sorted order by their underlying probability: so
p∗x (πp∗x (1)) ≥ p∗x (πp∗x (2)) ≥ . . . ≥ p∗x (πp∗x (k)). How would we find the smallest
prediction set that contains the true label with probability at least 1 − δ?
We would greedily start adding labels to our prediction set in order of their
probabilities (highest probability to lowest) until the cumulative probability
of the labels in our prediction Ptset exceeded 1−δ. To say this more formally, for
each t ∈ [k], let C(t, p∗x ) = i=1 p∗x (πp∗x (i)) denote the cumulative probability
of the top t labels in likelihood sorted order. We would choose the prediction
set:
T (x) = {ŷ : C(πp−1 ∗
∗ (ŷ), px ) ≤ 1 − δ}
x
s(x, y) = C(πp−1
x
(y), px )
Algorithm 20 SplitConformal(D, s, δ)
Let τ be the smallest value such that:
n
1[s(xi , yi ) ≤ τ ] ≥ (1 − δ)(n + 1)
X
i=1
In fact, the only property we will use about the distribution from which D and
(x, y) are jointly drawn is that it is exchangable, which means permutation
invariant — we will not need the stronger property that the points are drawn
i.i.d.
Proof 58 (Proof of Theorem 33) Because we have assumed that the non-
conformity score distribution on s(x, y) is continuous, with probability 1, there
are no ties amongst the non-conformity scores in D — i.e. for all i ̸= j,
s(xi , yi ) ̸= s(xj , yj ). Renumber the datapoints in D in increasing order of their
nonconformity scores — i.e. such that s(x1 , y1 ) < s(x2 , y2 ) < . . . < s(xn , yn ).
Let i∗ be the unique index such that s(xi∗ , yi∗ ) = τ . i∗ = ⌈(1 − δ)(n + 1)⌉.
Imagine the dataset D′ = D ∪ (x, y) containing n + 1 elements. Consider
the event y ∈ TD (x). This occurs exactly when s(x, y) < τ , which is exactly
the event that the pair (x,y) occurs before the pair (xi∗ , yi∗ ) when we sort the
n + 1 points in D′ by their non-conformity scores. But since all of the points
in D′ are exchangable, by symmetry it must be that point (x,y) will have rank
that is uniformly random in {1, 2, . . . , n + 1} when put in sorted order within
D′ . Thus the event that y ∈ TD (x) is the event that (x, y) has rank that is less
than i∗ , which is:
i∗
Pr [y ∈ TD (x)] =
D,(x,y) n+1
⌈(1 − δ)(n + 1)⌉
=
n+1
(1 − δ)(n + 1)
≥
n+1
= (1 − δ)
Conformal Prediction 113
Algorithm 21 HighProbabilitySplitConformal(D, s, δ, γ)
Let τ be the smallest value such that:
n
r
1X log(2/γ)
1[s(xi , yi ) ≤ τ ] ≥ (1 − δ) +
n i=1 2n
Pr [y ∈ TD (x)|g(x) = 1] = 1 − δ.
(x,y)∼D
So: TD (x) will have group conditional coverage guarantees if and only if
f (x) has group conditional quantile marginal consistency guarantees. We know
how to do this using Algorithm 13! We can apply it here:
Conformal Prediction 115
Algorithm 22 GroupSplitConformal(D, s, G, δ, γ, ρ, σ, η)
Let q = 1 − δ.
Let λ∗ be a solution to the optimization problem:
h i
Minimizeλ E Lq fˆ(x; λ), s(x, y) + η||λ||1
(x,y)∼D
Such that:
X
fˆ(x; λ) ≡ λg · g(x)
g∈G
r r
α α
1−δ− ≤ Pr [y ∈ TDf (x)|g(x) = 1] ≤ 1 − δ +
µ(g) (x,y)∼D µ(g)
for: v
u ln 2 + |G| ln (1 + 2√n)
u
2ηρ 1 t γ
α≤ + 8ρ +1
σ η 2n
Choosing η to minimize this expression gives:
1/4
1
ρ ln γ + |G| ln (n)
α ≤ O √ ·
σ n
Proof 60 By Theorem 20, with probability 1−γ, the function f (x; λ∗ ) satisfies
α-approximate group conditional marginal quantile consistency on D on the
set of groups g ∈ G with µ(g) ≥ α and target quantile q = 1 − δ for:
v
u ln 2 + |G| ln (1 + 2√n)
u
2ηρ 1 t γ
α≤ + 8ρ +1
σ η 2n
This means that for every group g ∈ G with µ(g) ≥ α:
2 α
(Pr[s(x, y) ≤ f (x; λ∗ )|g(x) = 1] − (1 − δ)) ≤
µ(g)
116Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
or equivalently:
r r
α α
(1 − δ) − ≤ Pr[s(x, y) ≤ f (x; λ∗ )|g(x) = 1] ≤ (1 − δ) +
µ(g) µ(g)
The result then follows the fact that y ∈ TDf (x) exactly when s(x, y) ≤ f (x; λ∗ ).
Once again, we see that TDf will satisfy multivalid coverage guarantees if
and only if f satisfies quantile multicalibration for target quantile q = 1 − δ.
And once again, we know how to find such an f — use Algorithm 16!
Conformal Prediction 117
Algorithm 23 MultivalidSplitConformal(D, s, G, δ, α, γ, ρ)
2
ρ
Let m = 2α , q = 1 − δ.
Let f0 (x) = 0 for all x and t = 0.
while ft is not α-approximately quantile multicalibrated with respect to G
and q: do
Let:
2
(vt , gt ) ∈ arg max Pr [ft (x) = v, g(x) = 1] q − Pr [y ≤ v|ft (x) = v, g(x) = 1])
(v,g)∈R(ft )×G (x,y)∼D (x,y)∼D
Proof 61 By Theorem 26, we have that with probability 1 − γ, the final func-
tion ft satisfies α′ -approximate quantile multicalibration for target quantile
118Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
1 − δ, groups G, and:
v
u
u 3ρ2 ln( 4π2 T 2 ) + T ln( ρ4 |G| )
3γ α2
α′ ≤ α + 42
t
2αn
This means that for every group g ∈ G and v ∈ R(ft ):
2
Pr [ft (x) = v|g(x) = 1] Pr [s(x, y) ≤ ft (x)|g(x) = 1, ft (x) = v] − (1 − δ)
(x,y)∼D (x,y)∼D
α
≤ Q2 (ft , g) ≤
µ(g)
Dividing through and taking the square root we obtain:
s
α′
Pr [s(x, y) ≤ ft (x)|g(x) = 1, ft (x) = v] − (1 − δ) ≤
(x,y)∼D Pr(x,y)∼D [g(x) = 1, ft (x) = v]
The theorem then follows since y ∈ TDft (x) exactly when s(x, y) ≤ ft (x).
Algorithm 24 Adversarial-Marginal-Conformal-Predictor(δ, η, T )
Let q = 1 − δ + 1+η
ηT and p1 = 0
for t = 1 to T do
Obtain non-conformity score st and observe xt .
Predict
Tt (xt ) = {ŷ : st (xt , ŷ) ≤ pt }
Observe yt .
Let pt+1 = pt + η(q − 1[st (xt , yt ) ≤ pt ])
T
1+η 1X 1+η
q− ≤ 1[st (xt , yt ) ≤ pt ] ≤ q +
ηT T t=1 ηT
Algorithm 25 Online-Multivalid-Conformal-Predictor(G, m, r, η, δ)
Let q = 1 − δ.
for t = 1 to T do
Obtain non-conformity score st and observe xt and compute
X g,i g,i
i
Ct−1 (xt ) = exp(ηVt−1 ) − exp(−ηVt−1 )
g∈G(xt )
g,i
for all i ∈ [m], with Vt−1 defined as:
m
if Ct−1 (xt ) < 0 then
Select pt = 1.
1
else if Ct−1 (xt ) > 0 then
Select pt = 0.
else
i∗ i∗ +1
Select i∗ ∈ [m] such that such that Ct−1 (xt ) · Ct−1 (xt ) ≤ 0.
Compute p ∈ [0, 1] such that:
∗ ∗
i i +1
p · Ct−1 (xt ) + (1 − p) · Ct−1 (xt ) = 0
∗
i 1 i∗
Select pt = m − rm with probability p and select pt = m with proba-
bility 1 − p.
Predict:
Tt (xt ) = {ŷ : st (xt , ŷ) ≤ pt }
Observe yt
Let π <t+1 = π <t ◦ (xt , pt , yt )
Theorem
q 38 Fix any set of groups G, m, r ≥ 0 and q ∈ [0, 1]. Let
log(2|G|m)
η = 2T < 1. Fix δ > 0 Fix any adversary who is constrained for
each t to playing label distributions such that the induced distribution on
non-conformity scores st (xt , yt ) is ρ-Lipschitz, which together with Online-
Multivalid-Conformal-Predictor(G, m, r, η, δ) (Algorithm 18) fixes a distribu-
tion on transcripts π. We have that with probability 1 − γ over the randomness
of π, for every group g ∈ G and every bucket i ∈ [m]:
α α
1−δ− ≤ Pr [yt ∈ T (xt )|pt ∈ Bm (i), g(x) = 1] ≤ 1−δ+
µπ (g, i) (xt ,Tt (xt ),yt )∼π µπ (g, i)
122Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
where
µπ (g, i) = Pr [pt ∈ Bm (i), g(xt ) = 1]
(xt ,Tt (xt ),yt )∼π
and v
u
u 2 ln 2|G|m
1 t γ
α≤ +4
ρrm T
CONTENTS
8.1 Likelihood Ratio Reweighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.2 Multicalibration under Distribution Shift . . . . . . . . . . . . . . . . . . . . . . . 126
8.3 Why Calibration Under Distribution Shift is Useful . . . . . . . . . . . . 128
References and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
123
124Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
hope to do well on Dt , then this relationship had better be similar on both
distributions — in this chapter we will assume that it is the same.
So, two distributions that have the same conditional label distributions
differ in the relative frequency with which different feature vectors x appear,
but agree on how labels are distributed conditional on features — so there is
some fixed “truth” that we can hope to learn.
Lemma 8.1.1 Suppose Ds and Dt have the same conditional label distri-
s t
bution DY (x) = DY (x) = DY (x). Fix any S ⊆ X . For any function
F : X × Y → R, we have:
Remark 8.1.2 Observe that by the law of total probability, for any collection
of sets {S1 , . . . , Sk } that partition X , we have that:
k
X
Pr [x ∈ Si ]e(h, ws→t , Si ) = e(h, ws→t )
(x,y)∼D s
i=1
The next lemma shows that if we can estimate ws→t closely in total vari-
ation distance (as measured in expectation over the source distribution Ds ),
then we can closely approximate expectations over Dt .
Lemma 8.1.2 Suppose Ds and Dt have the same conditional label distribu-
tion. Fix any S ⊆ X . For any function F : X × Y → R, and any function
h : X → R, we have:
p
Recall that we know from Lemma 3.1.1 that K1 (f, h, D) ≤ K2 (f, h, D).
Thus, we can use algorithm 19 — which guarantees that K2 (f, h, D) ≤ α′ for
all h ∈ H — to obtain α-approximate L1 multicalibration by setting α′ = α2 .
α ≥ K1 (f, h, Ds )
X
= Pr s [f (x) = v] E [h(x)(y − v)|f (x) = v]
(x,y)∼D (x,y)∼D s
v∈R(f )
X
= Pr [f (x) = v] E [h(x)(y − v)|f (x) = v]
(x,y)∼D s (x,y)∼D s
v∈R(f )
X h(x)
= Pr [f (x) = v] E ws→t (x) · (y − v) |f (x) = v
(x,y)∼D s (x,y)∼D s ws→t (x)
v∈R(f )
X h(x)
= Pr t [f (x) = v] E t (y − v)|f (x) = v
(x,y)∼D (x,y)∼D ws→t (x)
v∈R(f )
h
= K1 f, , Dt
ws→t
Here the second to last equality follows from applying Lemma 8.1.1 to each
term:
h(x)
Pr [f (x) = v] E ws→t (x) · (y − v) |f (x) = v
(x,y)∼D s (x,y)∼D s ws→t (x)
h(x)
using S = {x : f (x) = v} and F (x, y) = ws→t (x) (y − v).
Corollary 8.2.1 Suppose Ds and Dt have the same conditional label distri-
bution, and suppose f is α-approximately L1 -multicalibrated with respect to
Ds and H. Then if ws→t ∈ H, f has at most α L1 -calibration error on Dt :
K1 (f, Dt ) ≤ α
= K1 (f, Dt )
α ≥ K1 (f, h∗ , Ds )
X
= Pr s [f (x) = v] E [h∗ (x)(y − v)|f (x) = v]
(x,y)∼D (x,y)∼D s
v∈R(f )
X
≥ Pr [f (x) = v] E [(y − v)|f (x) = v]
(x,y)∼D t (x,y)∼D t
v∈R(f )
X
− Pr [f (x) = v] · e(h∗ , ws→t , {x : f (x) = v})
(x,y)∼D s
v∈R(f )
choosing Sv = {x : f (x) = v}, F (x, y) = (y − v), and the fact that since
y, v ∈ [0, 1], maxy |y − v| ≤ 1. The final line follows from the observation that
the collection {Sv }v∈R(f ) forms a partition of X .
E [f (x)] −
t
E [y] ≤ α
x∼DX (x,y)∼D t
Distribution Shift 129
X
≥ Prt [f (x) = v] E t
[y|f (x) = v] − Prt [f (x) = v]v
x∼DX (x,y)∼D x∼DX
v∈R(f )
= E [y] − E [f (x)]
t
(x,y)∼D t x∼DX
If our label space is binary Y = {0, 1}, then we can go beyond this, and
estimate the cost of acting on any policy depending on the predictions of f .
Definition 45 Fix an action space A and a model f : X → [0, 1]. A policy
of f is any mapping ρ : [0, 1] → A that chooses an action ρ(f (x)) ∈ A as a
function of the prediction f (x).
We can evaluate the cost of a policy using a loss function:
Definition 46 Fixing an action space A, a loss function ℓ : A × {0, 1} → R
maps action/label pairs to a real valued loss. Given a distribution D and a
predictor f : X → [0, 1], the expected cost of a policy ρ is:
ℓ(ρ, f, D) = E [ℓ(ρ(f (x)), y)]
(x,y)∼D
Observe that for the Bayes optimal predictor — f ∗ (x) = EDY (x) [y] —
that the f ∗ -estimated cost of ρ: ℓ̃(ρ, f ∗ , DX ) is equal to its true expected cost:
ℓ(ρ, f ∗ , D).
We now show that the same is true if f is not Bayes optimal, but merely
calibrated.
130Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Theorem 40 Fix an action space A, a loss function ℓ : A × {0, 1} → R, a
policy ρ, a distribution D, and a predictor f : X → R, and a distribution D.
Let:
C = max (ℓ(a, 0) + ℓ(a, 1))
a∈A
ℓ(ρ, f, D) − ℓ̃(ρ, f, DX ) ≤ C · α
Proof 70 Let:
ℓ(ρ, f, D)
= E [ℓ(ρ(f (x)), y)]
(x,y)∼D
X
= Pr [f (x) = v] E [ℓ(ρ(f (x)), y)|f (x) = v]
x∼DX (x,y)∼D
v∈R(f )
X
= Pr [f (x) = v] ℓ(ρ(v), 1) Pr [y = 1|f (x) = v] + ℓ(ρ(v), 0) Pr [y = 0|f (x) = v]
x∼DX (x,y)∼D (x,y)∼D
v∈R(f )
X X
= ℓ(a, 1) Pr [f (x) = v] E [y|f (x) = v] +
x∼DX (x,y)∼D
a∈A v:ρ(v)=a
X X
ℓ(a, 0) Pr [f (x) = v](1 − E [y|f (x) = v])
x∼DX (x,y)∼D
a∈A v:ρ(v)=a
X X
≤ ℓ(a, 1) Pr [f (x) = v]v + kv +
x∼DX
a∈A v:ρ(v)=a
X X
ℓ(a, 0) Pr [f (x) = v](1 − v) + kv
x∼DX
a∈A v:ρ(v)=a
X X
= Pr [f (x) = v] (vℓ(ρ(v), 1) + (1 − v)ℓ(ρ(v), 0)) + kv (ℓ(ρ(v), 1) + ℓ(ρ(v), 0))
x∼DX
v∈R(f ) v∈R(f )
X
= ℓ̃(ρ, f, DX ) + kv (ℓ(ρ(v), 1) + ℓ(ρ(v), 0))
v∈R(f )
≤ ℓ̃(ρ, f, DX ) + C · α
Lemma 8.3.2 Fix any distribution D and any predictor f : X → [0, 1] such
that K1 (f, D) ≤ α. Consider the policy ρ∗ defined above. For any other policy
ρ : [0, 1] → {0, 1}, we have:
= ℓ̃0,1 (ρ∗ , f, DX ) − α
≥ ℓ0,1 (ρ∗ , f, D) − 2α
Here the first and last inequalities follow from Theorem 40. The middle in-
equality follows from the fact that pointwise (for each value v):
(
0,1 0,1 1 − v ρ(v) = 1
v · ℓ (ρ(v), 1) + (1 − v)ℓ (ρ(v), 0) =
v ρ(v) = 0
1
is minimized by setting ρ(v) = 1 when v ≥ 2 and ρ(v) = 0 when v < 21 , which
is what the policy ρ∗ (v) does.
CONTENTS
9.1 Omnipredictors: Sufficient Statistics for Unconstrained
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9.2 Sufficient Statistics for Constrained Optimization . . . . . . . . . . . . . . . 139
9.2.1 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
9.2.2 f -estimated Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
9.2.3 Solving Optimization Problems Without Labelled Data 142
References and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
ℓ(ρ, f, D) ≈ ℓ̃(ρ, f, DX )
133
134Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Another nice property of the f -estimated loss is that we can find the policy
that optimizes it without needing to know anything about the underlying
distribution. Specifically, if we have a loss function ℓ in mind, we can choose
the policy ρ∗ℓ that pointwise optimizes the f -estimated cost ℓ̃:
ρ∗ℓ (v) = arg min (vℓ(a, 1) + (1 − v)ℓ(a, 0))
a∈A
If f is calibrated, then the policy ρ∗ℓ has the smallest expected loss (as
measured by ℓ) of any policy that is a function of f . This statement general-
izes what we proved in Lemma 8.3.2 in the special case of 0/1 loss and has
essentially the same proof.
Lemma 9.1.1 Fix any distribution D, loss function ℓ : A × {0, 1} → R, and
any predictor f : X → [0, 1] such that K1 (f, D) ≤ α. Let:
C = max (ℓ(a, 0) + ℓ(a, 1))
a∈A
Consider the policy ρ∗ℓ defined above. For any other policy ρ : [0, 1] → A, we
have:
ℓ(ρ∗ℓ , f, D) ≤ ℓ(ρ, f, D) + 2Cα
Sufficient Statistics for Optimization 135
= ℓ̃(ρ∗ℓ , f, DX ) − Cα
≥ ℓ(ρ∗ℓ , f, D) − 2Cα
Here the first and last inequalities follow from Theorem 40. The middle in-
equality follows from the fact that by definition, ρ∗ℓ (v) is the minimizer of:
v · ℓ(ρ(v), 1) + (1 − v)ℓ(ρ(v), 0)
Going forward, we consider the special case of A = [0, 1] and aim to show
that if f is multicalibrated with respect to a class of real valued functions H,
then for any convex loss function ℓ, the policy ρ∗ℓ has optimal loss not just
compared to other policies ρ of f , but compared to any h ∈ H. Note that
functions h : X → [0, 1] are functions of x directly, rather than functions of
f (x), and so Lemma 9.1.1 does not imply that that ρ∗ℓ (f (x)) has lower loss
than h(x). But Lemma 9.1.1 does point in the direction of our proof strategy:
we will show that if f is (approximately) multicalibrated with respect to H
then in fact every h ∈ H is (almost) dominated by a policy ρ(f (x)). Thus for
any loss functions satisfying the conditions of our theorem, we can do (almost)
as well as any h ∈ H by playing the optimal policy for the f -estimated loss
ρ∗ℓ . As mentioned, our results will apply to any convex loss function:
Definition 51 A loss function ℓ : [0, 1] × {0, 1} → R is convex in its first
argument if for all v, v ′ , α ∈ [0, 1] and for all y ∈ {0, 1}:
ℓ(αv + (1 − α)v ′ , y) ≤ α · ℓ(v, y) + (1 − α) · ℓ(v ′ , y)
A direct consequence of convexity that we will make use of is called Jensen’s
inequality:
Claim 9.1.1 (Jensen’s Inequality) Fix any loss function ℓ : [0, 1] ×
{0, 1} → R that is convex in its first argument. For any y ∈ {0, 1} and for
any distribution P ∈ ∆[0, 1], we have:
E [ℓ(v, y)] ≥ ℓ E [v], y
v∼P v∼P
How closely we can relate the performance of a model h(x) to the per-
formance of a policy ρ(f (x)) will depend both on the multicalibration error
that f has on the models in H and on how much small errors in prediction
are magnified by the loss function ℓ, which we will measure by its Lipschitz
constant:
136Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Definition 52 A loss function ℓ : [0, 1] × {0, 1} → R is L-Lipschitz in its first
argument if for all v, v ′ and for all y ∈ {0, 1}:
Lemma 9.1.2 Fix any distribution D and class of real valued functions H.
Suppose that f is α-approximately L1 multicalibrated with respect to D and
H (as defined in Definition 44). and α-approximately L1 -calibrated. Then for
any h ∈ H and v ∈ R(f ):
X
Pr[f (x) = v]v(1−v) E [h(x)|f (x) = v, y = 1] − E [h(x)|f (x) = v, y = 0]
(x,y)∼D (x,y)∼D
v∈R(f )
≤ 2α
Proof 73 Let:
α ≥ K1 (f, h, D)
X
= Pr [f (x) = v] E [h(x)(y − v)|f (x) = v]
(x,y)∼D (x,y)∼D
v∈R(f )
X
= Pr [f (x) = v] Pr[y = 1|f (x) = v] E [h(x)|f (x) = v, y = 1](1 − v)
(x,y)∼D (x,y)∼D
v∈R(f )
X
≥ Pr [f (x) = v] v E [h(x)|f (x) = v, y = 1](1 − v)
(x,y)∼D (x,y)∼D
v∈R(f )
X
−(1 − v) E [h(x)|f (x) = v, y = 0]v − kv ((1 − v) + v)
(x,y)∼D
v∈R(f )
X
= Pr [f (x) = v] v(1 − v) E [h(x)|f (x) = v, y = 1]
(x,y)∼D (x,y)∼D
v∈R(f )
X
− E [h(x)|f (x) = v, y = 0] − kv
(x,y)∼D
v∈R(f )
X
≥ Pr [f (x) = v]v(1 − v) E [h(x)|f (x) = v, y = 1]
(x,y)∼D (x,y)∼D
v∈R(f )
− E [h(x)|f (x) = v, y = 0] − α
(x,y)∼D
With this lemma in hand, we can prove the main theorem of this section:
that if f is multicalibrated with respect to H, then for any convex Lipschitz
loss function ℓ, the policy ρ∗ℓ obtains loss nearly as good as the loss of the
best h ∈ H. Thus, once we train f , we can use it to optimize any such loss
function ℓ and have performance guarantees relative to H, rather than needing
to solve a fresh optimization problem for each new loss function. The proof
strategy is just as we have already laid out: show that the loss for any h ∈ H
is comparable to the loss of some policy ρ of f , and therefore only higher than
the loss of the best policy ρ∗ℓ for ℓ.
Then the loss of policy ρ∗ℓ is almost as low as the loss of any h ∈ H:
Proof 74 Let:
E [ℓ(h(x), y)]
(x,y)∼D
X
= Pr[f (x) = v] Pr[y = 1|f (x) = v] E [ℓ(h(x), 1)|f (x) = v, y = 1]
(x,y)∼D
v∈R(f )
+(Pr[y = 0|f (x) = v] E [ℓ(h(x), 0)|f (x) = v, y = 0]
(x,y)∼D
X
≥ Pr[f (x) = v] Pr[y = 1|f (x) = v]ℓ(H(v, 1), 1) + (Pr[y = 0|f (x) = v]ℓ(H(v, 0), 0)
v∈R(f )
X X
≥ Pr[f (x) = v] vℓ(H(v, 1), 1) + (1 − v)ℓ(H(v, 0), 0) − 2 kv
v∈R(f ) v∈R(f )
X
≥ Pr[f (x) = v] vℓ(H(v, 1), 1) + (1 − v)ℓ(H(v, 0), 0) − 2α
v∈R(f )
Lets compare the loss of this policy ρ with the loss of h. Continuing our
derivation above we find:
Sufficient Statistics for Optimization 139
E [ℓ(h(x), y)]
(x,y)∼D
X
≥ Pr[f (x) = v] vℓ(H(v, 1), 1) + (1 − v)ℓ(H(v, 0), 0) − 2α
v∈R(f )
X
≥ Pr[f (x) = v] vℓ(H(v, 0), 1) + (1 − v)ℓ(H(v, 0), 0) − vL|H(v, 0) − H(v, 1)| +
v< 21
X
Pr[f (x) = v] vℓ(H(v, 1), 1) + (1 − v)ℓ(H(v, 1), 0) − (1 − v)L|H(v, 0) − H(v, 1)| − 2α
v≥ 12
X
= ℓ̃(ρ, f, DX ) − 2α − L Pr[f (x) = v] min(v, 1 − v)|H(v, 0) − H(v, 1)|
v∈R(f )
X
≥ ℓ̃(ρ, f, DX ) − 2α − 2L Pr[f (x) = v]v · (1 − v)|H(v, 0) − H(v, 1)|
v∈R(f )
≥ ℓ̃(ρ, f, DX ) − (2 + 4L)α
≥ ℓ̃(ρ∗ℓ , f, DX ) − (2 + 4L)α
≥ ℓ(ρ∗ℓ , f, D) − (4 + 4L)α
Here, in the 3rd to last line, we have applied Lemma 9.1.2, which tells us that:
X
Pr[f (x) = v]v(1 − v) |H(v, 0) − H(v, 1)| ≤ 2α
v∈R(f )
In the second to last line we have used the fact that ρ∗ℓ is the minimizer of
ℓ̃(ρ, f, DX ) among all policies ρ. In the final line we have applied Theorem 40 to
relate the f -estimated loss ℓ̃ to the true loss ℓ, using the fact that K1 (f, D) ≤ α,
and that C = maxa∈A (ℓ(a, 0) + ℓ(a, 1)) is at most 2 since we have assumed
that ℓ takes values in [0, 1].
If there is any solution P that satisfies all of the constraints, we say that the
optimization problem is feasible. We write P ∗ for the solution that minimizes
the objective function while satisfying the constraints, and write OPT(H) =
Eh∼P ∗ ,(x,y)∼D [ℓ(h(x), y)] for the objective value of an optimal feasible solution.
Xk
L(P, λ) = E [ℓ(h(x), y)]+ λj E [ℓj (h(x), y)|gj (x) = 1, y ∈ Sj ]
h∼P,(x,y)∼D h∼P,(x,y)∼D
j=1
and
2.
λ∗ ∈ arg max L(P ∗ , λ)
λ∈Rk
Sufficient Statistics for Optimization 141
L(P ∗ , λ∗ )
k
X
= ∗
E [ℓ(h(x), y)] + λ∗j E [ℓj (h(x), y)|gj (x) = 1, y ∈ Sj ]
h∼P ,(x,y)∼D h∼P ∗ ,(x,y)∼D
j=1
k
X
= OPT + λ∗j E [ℓj (h(x), y)|gj (x) = 1, y ∈ Sj ]
h∼P ∗ ,(x,y)∼D
j=1
= OPT
Here the second to last inequality follows from the fact that P ∗ is an opti-
mal solution to the optimization problem, and the last inequality follows from
complementary slackness.
142Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
9.2.2 f -estimated Optimization
The optimization problems we have defined (and their corresponding La-
grangians) are defined as expectations over both x and y — so in order to
evaluate a solution P (or to solve for one), we need access to labelled exam-
ples. Just as we did in Section 9.1 for unconstrained optimization, given a
model f : X → R that purports to encode f (x) = E[y|x], we can define an
f -estimated optimization problem whose definition only involves expectations
taken over features x ∼ DX .
Definition 56 Fix an (H, G, C)-convex minimization problem with objective
ℓ and linear constraints is defined by {(ℓj , gj , Sj )}kj=1 . Fix a model f : X →
[0, 1]. The corresponding f -estimated optimization problem defined as:
If there is any solution P that satisfies all of the constraints, we say that
the optimization problem is feasible. We write P̃ ∗ for the solution that
minimizes the objective function while satisfying the constraints, and write
˜ = E
OPT h∼P̃ ∗ ,x∼DX [f (x)ℓ(h(x), 1) + (1 − f (x))ℓ(h(x), 0)] for the objective
value of an optimal feasible solution.
We can similarly define the f -estimated Lagrangian:
Definition 57 Fix an f -estimated (H, G, C)-convex minimization problem
with linear constraints, defined by (ℓ, {(ℓj , gj , Sj )}kj=1 ). Partition the con-
straints such that C0 = {j ∈ [k] : Sj = {0}}, C1 = {j ∈ [k] : Sj = {1}},
and C01 = {j ∈ [k] : Sj = {0, 1}}.
The corresponding f -estimated Lagrangian is the function L̃ : H × Rk≥0 →
R defined as:
L̃(P, λ) = E [f (x)ℓ(h(x), 1) + (1 − f (x))ℓ(h(x), 0)]
h∼P,x∼DX
X X
+ λj E [ℓj (h(x), 0)|gj (x) = 1]+ λj E [ℓj (h(x), 1)|gj (x) = 1]
h∼P,x∼DX h∼P,x∼DX
j∈C0 j∈C1
X
+ λj E [f (x)ℓj (h(x), 1) + (1 − f (x))ℓj (h(x), 0)|gj (x) = 1]
h∼P,x∼DX
j∈C01
Sufficient Statistics for Optimization 143
Lemma 9.2.1 Fix any model f : X → [0, 1] and any f -estimated (Hall , G, C)-
convex optimization problem with linear constraints. Let h ∈ Hall be an op-
timal solution to the problem. Then h(x) can be written as a policy of f (x):
h(x) = ρ(f (x)) for some ρ : [0, 1] → [0, 1].
Proof 76
Proof 77 First we argue about the objective value. From Theorem 42, we
know that there exists a λ̃∗ such that (P̃ ∗ , λ̃∗ ) form an optimal primal/dual
144Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
pair for the corresponding f -estimated Lagrangian L̃. From Corollary 9.2.1
˜
we know that L̃(P̃ ∗ , λ̃∗ ) = OPT. Similarly, let P ∗ be an optimal solution to
the original (H, G, C)-optimization problem. We know from Theorem 42 that
there exists a λ∗ such that (P ∗ , λ∗ ) form an optimal primal/dual pair for the
corresponding Lagrangian L, and from Corollary 9.2.1 that L(P ∗ , λ∗ ) = OPT.
We also know from Theorem 40 since f is α-approximately calibrated with
respect to H that:
||
Thus we can calculate:
˜
OPT = L(P̃ ∗ , λ̃∗ )
≤
Lemma 9.2.2 Fix any distribution D ∈ ∆Z, any class of group indicator
functions G containing functions g : X → {0, 1} and any class of real valued
functions H containing functions h : X → R. For each g ∈ G define the
distribution Dg = D|g(x) = 1 be the conditional distribution conditional on
g(x) = 1. Suppose a model f : X → R is α L1 -multicalibrated with respect to
D and G · H. Then for every g ∈ G and h ∈ H:
α
K1 (f, h, Dg ) ≤
µ(g, D)
K1 (f, h, Dg )
X
= Pr[f (x) = v|g(x) = 1] E[h(x)(y − v)|g(x) = 1, f (x) = v]
v∈R(f )
[2022b], who called such models “omnipredictors”. Gopalan et al. [2022b] use
a slightly different notion of calibration than we do (based on partitions of the
feature space and covariance), but if a model satisfies our notion of multicali-
bration and is also calibrated, then it also satisfies the covariance based notion
and vice versa. Gopalan et al. [2022a] give an incomparable omniprediction
theorem — they show that group conditional mean consistency together with
(marginal) calibration is sufficient to be competitive with any h ∈ H on any
Lipschitz loss function ℓ (no longer requiring convexity of ℓ or full multical-
ibration) — but in general this requires group conditional mean consistency
with respect to all level sets of functions in H, rather than just with H itself.
10
Ensembling, Model Multiplicity, and the
Reference Class Problem
CONTENTS
10.1 Reference Classes and Model Multiplicity . . . . . . . . . . . . . . . . . . . . . . . 147
10.2 Model Ensembling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
10.3 Sample Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
References and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
147
148Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
f ∗ (x) = Ey∼D(x) [y] minimizes the expected Brier score:
h i
2
f ∗ ∈ arg min E (f (x) − y)
f :X →[0,1] (x,y)∼D
Hence if we have two models such that B(f1 ) < B(f2 ), this falsifies the hy-
pothesis that f2 = f ∗ — i.e. it cannot be the case that f2 represents the true
individual probabilities, and gives us an empirical (and practical!) justification
for adopting model f1 rather than model f2 .
The “model multiplicity” problem refers to the worry that there may be
multiple models f1 , f2 that are equally accurate (such that B(f1 ) = B(f2 ))
that disagree in their predictions. In this case, accuracy gives us no basis
on which to reject either model, and yet if f1 (Bob) is very different from
f2 (Bob), what basis do we have to act on our predictions? Are we justified in
denying Bob life insurance if it seems unprofitable according to the individual
probability assigned by f2 but seems profitable according to the individual
probability assigned by f1 ?
This can indeed be a problem if the models f are chosen to optimize
accuracy in some fixed class. But as we will see, the situation cannot arise if
the parties proposing their models are willing to update (and improve!) their
models in the face of evidence that can be found in the data before them and
in the competing models that are proposed! The updates needed will be of
exactly the same simple “patch” form that we have studied when deriving
algorithms for multicalibration and group conditional mean consistency.
Definition 61 Fix any two models f1 , f2 : X → [0, 1] and any ϵ > 0. Define
the sets:
Uϵ> (f1 , f2 ) = {x ∈ Uϵ (f1 , f2 ) : f1 (x) > f2 (x)}
Uϵ< (f1 , f2 ) = {x ∈ Uϵ (f1 , f2 ) : f1 (x) < f2 (x)}
Based on these sets, for • ∈ {>, <} and i ∈ {1, 2} define the quantities:
Lemma 10.2.1 Fix any two models f1 , f2 : X → [0, 1] and any ϵ > 0.
If the fraction of points on which f1 and f2 have an ϵ disagreement has
mass µ(Uϵ (f1 , f2 )) = α then for some • ∈ {>, <} some i ∈ {1, 2}, we have
that:
2 αϵ2
µ(Uϵ• (f1 , f2 )) · (v∗• − vi• ) ≥
8
Proof 79 Since Uϵ (f1 , f2 ) can be written as the disjoint union:
we must have that for at least one value of • ∈ {>, <} we have that:
α
µ(Uϵ• (f1 , f2 )) ≥ .
2
Since the points in µ(Uϵ• (f1 , f2 )) are ϵ-separated, we must have that |v1• −
v2• | ≥ ϵ. Therefore, for at least one of i ∈ {1, 2} we must have that
ϵ
|vi• − v∗• | ≥
2
Combining these two claims, we must have that:
αϵ2
µ(Uϵ• (f1 , f2 )) · (vi• − v∗• )2 ≥
8
150Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Lets consider the significance of this Lemma. Most basically, if we have
two models f1 and f2 that disagree substantially, this lemma gives an easily
constructable set (Uϵ> (f1t1 , f2t2 ) or Uϵ< (f1t1 , f2t2 )) that falsifies either the as-
sertion that f1 encodes true conditional label expectations or the assertion
that f2 does. And not only does it falsify that at least one of f1 or f2 are a
“correct” model — it provides a directly actionable way to improve one of the
models. Recall Lemma 4.1.1, which we proved when analyzing an algorithm
for guaranteeing group conditional mean consistency, and we reproduce here:
Lemma 10.2.2 Fix any model ft : X → [0, 1] and group g : X → {0, 1}. Let
and
ft+1 = h(x, ft ; gt , ∆t )
where: (
f (x) + ∆ g(x) = 1
h(x, f ; g, ∆) =
f (x) otherwise
Then:
B(ft ) − B(ft+1 ) = µ(gt ) · ∆2t
Algorithm 26 Reconcile(f1 , f2 , α, ϵ)
Let t = t1 = t2 = 0 and f1t1 = f1 , f2t2 = f2 .
Let m = ⌈ √2αϵ ⌉
while µ(Uϵ (f1t1 , f2t2 )) ≥ α do
For each • ∈ {>, <} and i ∈ {1, 2} Let:
v∗• = E [y|x ∈ Uϵ• (f1t1 , f2t2 )] vi• = E [fiti (x)|x ∈ Uϵ• (f1t1 , f2t2 )]
(x,y)∼D (x,y)∼D
Let:
2
(it , •t ) = arg max µ(Uϵ• (f1t1 , f2t2 )) · (v∗• − vi• )
i∈{1,2},•∈{>,<}
Let: (
1 x ∈ Uϵ• (f1t1 , f2t2 )
gt (x) =
0 otherwise
Let:
˜t =
∆ E [y|gt (x) = 1] − E [fiti (x)|gt (x) = 1]
(x,y)∼D (x,y)∼D
˜ t ; m)
∆t = Round(∆
Proof 80 By Lemma 10.2.1, for each round t < T we must have that:
2 αϵ2
arg max µ(S(v, •)) · (v∗• − vi• ) ≥
i∈{1,2},•∈{>,<},v∈[1/2ϵ] 8
αϵ2
B(ftti ) − B(f˜tti +1 ) ≥
8
152Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
.
We can now compute
ˆ
∆ = E [y|gt (x) = 1] − E [fiti (x)|gt (x) = 1] − ∆t
(x,y)∼D (x,y)∼D
ˆ ≤
Third, by definition of the Round operation, |∆| 1
2m . Therefore we can again
apply lemma 10.2.2 to conclude that:
B(ftti +1 ) − B(f˜tti +1 ) ˆ2
= µ(gt )∆
1
≤
4m2
Combining this with our initial calculation lets us conclude that:
αϵ2 1 αϵ2
B(ftti ) − B(ftti +1 ) ≥ − 2
≥
8 4m 16
Here we are using the fact that we have set m ≥ √2 . Applying this lemma for
αϵ
each of the T1 and T2 updates or f1 and f2 respectively we get that: B(f1T1 ) ≤
2
T2 αϵ2
B(f1 ) − T1 · αϵ
16 and B(f2 ) ≤ B(f2 ) − T2 · 16 . Since Brier scores are non-
16 16
negative, we conclude that T1 ≤ B(f1 ) αϵ2 and T2 ≤ B(f2 ) αϵ 2 . Thus T =
16
T1 + T2 ≤ (B(f1 ) + B(f2 )) · αϵ2
Finally the halting condition of the algorithm implies that µ(Uϵ (f1T1 , f2T2 )) <
α.
Thus if we start with any two models that have substantial disagreement, we
are guaranteed to be able to efficiently produce strictly improved models that
almost agree almost everywhere. In particular, we can never be in a position
in which we have two equally accurate but unimprovable models that have
substantial disagreements: in this case, we can always improve the models.
The only time we can have substantial model disagreement is if we refuse to
improve the models even in the face of efficiently verifiable and actionable
evidence that one of the models is suboptimal and improvable.
We observe that any pair of models that have gone through the “Rec-
oncile” process must also produce very similar probability estimates for any
sufficiently large conditional probability.
Ensembling, Model Multiplicity, and the Reference Class Problem 153
Corollary 10.2.1 Let E ⊂ X be any subset of the data space. Let f1 and
f2 be any two models that have been output by Algorithm 26 (Reconcile) with
parameters ϵ and α. Let:
X µ(x) · f1 (x) X µ(x) · f2 (x)
p1 (E) = and p2 (E) =
µ(E) µ(E)
x∈E x∈E
X
µ(E)|p1 (E) − p2 (E)| = µ(x) · (f1 (x) − f2 (x))
x∈E
X X
= µ(x) · (f1 (x) − f2 (x)) + µ(x) · (f1 (x) − f2 (x))
x∈E∩Uϵ (f1 ,f2 ) x∈E∩Sϵ (f1 ,f2 )
two possible values for it , two possible values for •t , and m+1 possible choices
for ∆t . Hence the number of transcripts of length T is (4(m + 1))T . Thus we
have:
32
αϵ2
X 32
|C| ≤ (4(m + 1))T ≤ (4(m + 1)) αϵ2 +1
T =0
3. For any δ > 0, with probability at least 1−δ over the randomness
of D ∼ Dn :
v
u 2
u 32 8 ⌈ √αϵ ⌉+1
t αϵ2 + 1 log
u
δ
T1 T2
µ(Uϵ (f1 , f2 )) < α + .
n
Remark 10.3.1 Theorem 45 tells us that the guarantees we proved for Algo-
rithm 26 in Theorem 44 (when we assumed direct access to the distribution
D) continue to hold when all we have access to is a finite sample of n points
from the data distribution, with additional error terms that tend to zero as n
grows large. How large is large? If we want the final disagreement region to
have mass at most 2α (i.e. we want the third conclusion of Theorem 45 to tell
us that µ(Uϵ (f1T1 , f2T2 )) < 2α), then solving for n in the error bound, we find
that it suffices to have n samples for n on the order of:
log(1/δ)
n ∈ Õ
α 3 ϵ2
where the Õ() notation hides logarithmic terms in 1/α and 1/ϵ.
This is a remarkably small amount of data: We would need ≈ log(1/δ) αϵ2
samples just to estimate the conditional label expectation Pr[y = 1|x ∈ S] for
a conditional event S with µ(S) = α up to error ϵ with probability 1 − δ (or for
two parties with disjoint samples to agree on this conditional label expectation
up to error ϵ). Theorem 45 tells us that in fact two parties can be made to
agree on a 1 − α fraction of points up to error ϵ with an additional amount of
data only on the order of Õ(1/α2 ).
Proof 83 (Proof of Theorem 45) The bound on T follows directly from
Theorem 44 without modification. We focus on bounding the Brier scores and
the uncertainty region for the resulting models.
Consider any pair of models f1 , f2 . Given a finite dataset D we write
(x, y) ∼ D to denote uniformly sampling a single datapoint from D. We start
by comparing Pr(x,y)∼D [x ∈ Uϵ (f1 , f2 )] with Pr(x,y)∼D [x ∈ Uϵ (f1 , f2 )]. We
have that:
n
1X
Pr [x ∈ Uϵ (f1 , f2 )] = 1[xi ∈ Uϵ (f1 , f2 )]
(x,y)∼D n i=1
Since 1[xi ∈ Uϵ (f1 , f2 )] ∈ [0, 1] and
En Pr [x ∈ Uϵ (f1 , f2 )] = Pr [x ∈ Uϵ (f1 , f2 )]
D∼D (x,y)∼D (x,y)∼D
we can apply Hoeffding’s inequality (Theorem 46) to conclude that for every
η > 0:
Pr [x ∈ Uϵ (f1 , f2 )] − Pr [x ∈ Uϵ (f1 , f2 )] ≥ η ≤ 2 exp −2η 2 n
Pr n
D∼D (x,y)∼D (x,y)∼D
156Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Let C be the set of pairs of models guaranteed in the statement of Lemma
2
10.3.1. Recall that Lemma 10.3.1 guarantees us that |C| ≤ (4(m + 1)32/αϵ +1 .
We can apply the union bound to all pairs of models (f1 , f2 ) ∈ C to conclude
that with probability at least 1 − 2|C| exp −2η 2 n (over the randomness of D)
we have that for every pair (f1 , f2 ) ∈ C:
Pr [x ∈ Uϵ (f1 , f2 )] − Pr [x ∈ Uϵ (f1 , f2 )] ≤ η
(x,y)∼D (x,y)∼D
Choosing v
u
u log 2|C|
t δ
η=
2n
we get that with probability 1−δ over the draw of D, for every pair (f1 , f2 ) ∈ C:
v
u
u log 2|C|
t δ
Pr [x ∈ Uϵ (f1 , f2 )] − Pr [x ∈ Uϵ (f1 , f2 )] ≤
(x,y)∼D (x,y)∼D 2n
v
u 2
u 32 8 ⌈ √αϵ ⌉+1
t αϵ2 + 1 log
u
δ
≤
n
where the final inequality follows from plugging in our bound on |C| and the
definition of m.
Because we know from Theorem 44 that the models f1T1 , f2T2 output by
Algorithm 26 satisfy that Pr(x,y)∼D [x ∈ Uϵ (f1T1 , f2T2 )] ≤ α we can conclude
that with probability 1 − δ:
v
u 2
u 32 8 ⌈ √αϵ ⌉+1
u
t αϵ2 + 1 log δ
Pr [x ∈ Uϵ (f1T1 , f2T2 )] ≤
(x,y)∼D n
We can bound the Brier score of the resulting models in exactly the same
way. For any fixed model f : X → [0, 1], we can write the empirical Brier
score (i.e. the Brier score as evaluated over D) as:
n
1X
BD (f ) = (f (xi ) − yi )2
n i=1
Since (f (xi ) − yi )2 ∈ [0, 1] and ED∼Dn [BD (f )], we can apply Hoeffding’s in-
equality (Theorem 46) exactly as before to conclude that for every pair of
models (f1T1 , f2T2 ) ∈ C, with probability 1 − δ:
v
u 2
u 16 16 ⌈ √αϵ ⌉+1
t αϵ2 + 1 log
u
1
δ
BD (f1T1 ) − B(f1T ) ≤
n
Ensembling, Model Multiplicity, and the Reference Class Problem 157
Observe that the same holds true for the original pair of models (f1 , f2 ), since
(f1 , f2 ) ∈ C (they correspond to the models output after transcripts of length
2
0). We further know from Theorem 44 that: BD (f1T1 ) ≤ BD (f1 ) − T1 · αϵ 16 and
2
BD (f2T2 ) ≤ BD (f2 ) − T2 · αϵ
16 .
Instantiating these bounds for the four models {f1 , f2 , f1T1 , f2T2 }, and set-
ting δ ← δ/4 so that we can union bound over all four models, we have that
with probability 1 − δ that we simultaniously have:
v
u 2
u 16 64 ⌈ √αϵ ⌉+1
t αϵ2 + 1 log
u
αϵ2 δ
B(f1T1 ) ≤ B(f1 ) − T1 · +2
16 n
v
u 2
u 16 64 ⌈ √αϵ ⌉+1
t αϵ2 + 1 log
u
αϵ2 δ
B(f2T1 ) ≤ B(f2 ) − T2 · +2
16 n
159
160 Bibliography
Isaac Gibbs and Emmanuel Candes. Adaptive conformal inference under dis-
tribution shift. Advances in Neural Information Processing Systems, 34:
1660–1672, 2021.
Ira Globus-Harris, Declan Harrison, Michael Kearns, Aaron Roth, and Jessica
Sorrell. Multicalibration as boosting for regression. 2023.
Parikshit Gopalan, Lunjia Hu, Michael P Kim, Omer Reingold, and Udi
Wieder. Loss minimization through the lens of outcome indistinguishability.
arXiv preprint arXiv:2210.08649, 2022a.
Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and
Udi Wieder. Omnipredictors. In 13th Innovations in Theoretical Com-
puter Science Conference (ITCS 2022). Schloss Dagstuhl-Leibniz-Zentrum
für Informatik, 2022b.
Varun Gupta, Christopher Jung, Georgy Noarov, Mallesh M Pai, and Aaron
Roth. Online multivalid learning: Means, moments, and prediction intervals.
In 13th Innovations in Theoretical Computer Science Conference (ITCS
2022). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2022.
Sergiu Hart. Calibrated forecasts: The minimax proof. 2020. URL http:
//www.ma.huji.ac.il/~hart/papers/calib-minmax.pdf.
Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum.
Multicalibration: Calibration for the (computationally-identifiable) masses.
In International Conference on Machine Learning, pages 1939–1948. PMLR,
2018.
Christopher Jung, Changhwa Lee, Mallesh Pai, Aaron Roth, and Rakesh
Vohra. Moment multicalibration for uncertainty estimation. In Conference
on Learning Theory, pages 2634–2678. PMLR, 2021.
Christopher Jung, Georgy Noarov, Ramya Ramalingam, and Aaron Roth.
Batch multivalid conformal prediction. arXiv preprint arXiv:2209.15145,
2022.
Michael P Kim, Amirata Ghorbani, and James Zou. Multiaccuracy: Black-
box post-processing for fairness in classification. In Proceedings of the 2019
AAAI/ACM Conference on AI, Ethics, and Society, pages 247–254, 2019.
Michael P Kim, Christoph Kern, Shafi Goldwasser, Frauke Kreuter, and Omer
Reingold. Universal adaptability: Target-independent inference that com-
petes with propensity scoring. Proceedings of the National Academy of
Sciences, 119(4):e2108097119, 2022.
Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry
Wasserman. Distribution-free predictive inference for regression. Journal
of the American Statistical Association, 113(523):1094–1111, 2018.
Bibliography 161
CONTENTS
−2t2
Pr [|Sn − E[Sn ]| ≥ t] ≤ 2 exp Pn 2
i=1 (bi − ai )
E[Sn ]η 2
Pr [|Sn − E[Sn ]| ≥ η E[Sn ]] ≤ 2 exp −
3
163