0% found this document useful (0 votes)
27 views166 pages

Uncertainty Notes

Uploaded by

SophiaSun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views166 pages

Uncertainty Notes

Uploaded by

SophiaSun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 166

Aaron Roth

Uncertain: Modern Topics


in Uncertainty Estimation
INCOMPLETE WORKING DRAFT
Contents

I Foundations 1
1 Basic Setting and Definitions 3

2 A Simple Goal: Marginal Estimation 5


2.1 Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Generalizing From Data . . . . . . . . . . . . . . . . . 13
2.3 Sequential Prediction . . . . . . . . . . . . . . . . . . . . . . 15

3 Calibration 23
3.1 Introduction to Calibration . . . . . . . . . . . . . . . . . . . 23
3.2 Calibrating a Model f . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Quantile Calibration . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Sequential Prediction . . . . . . . . . . . . . . . . . . . . . . 30
3.4.1 Sequential (Mean) Calibration . . . . . . . . . . . . . 31
3.4.2 Sequential Quantile Calibration . . . . . . . . . . . . . 36

4 Multigroup Guarantees 43
4.1 Group Conditional Mean Consistency . . . . . . . . . . . . . 44
4.2 Group Conditional Quantile Consistency . . . . . . . . . . . 46
4.2.1 A More Direct Approach to Group Conditional Guar-
antees . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1.1 Generalization . . . . . . . . . . . . . . . . . 50
4.3 Multicalibration: Group Conditional Calibration . . . . . . . 59
4.4 Quantile Multicalibration . . . . . . . . . . . . . . . . . . . . 63
4.5 Out of Sample Generalization . . . . . . . . . . . . . . . . . 66
4.5.1 Mean Multicalibration . . . . . . . . . . . . . . . . . . 66
4.5.2 Quantile Multicalibration . . . . . . . . . . . . . . . . 74
4.6 Sequential Prediction . . . . . . . . . . . . . . . . . . . . . . 75
4.6.1 A Bucketed Calibration Definition . . . . . . . . . . . 75
4.6.2 Achieving Bucketed Calibration . . . . . . . . . . . . . 76
4.6.3 Obtaining Bucketed Quantile Multicalibration . . . . 83

5 Beyond Means, Quantiles, and Calibration 89


5.1 Beyond Means and Quantiles . . . . . . . . . . . . . . . . . . 89
5.2 Beyond Calibration . . . . . . . . . . . . . . . . . . . . . . . 89

vii
viii Contents

6 Multicalibration for Real Valued Functions: When Does Mul-


ticalibration Imply Accuracy? 91
6.1 Beyond Groups . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 Algorithmically Reducing Multicalibration to Regression . . 95
6.3 Weak Learning, Multicalibration, and Boosting . . . . . . . . 98

II Applications 105
7 Conformal Prediction 107
7.1 Prediction Sets and Nonconformity Scores . . . . . . . . . . 107
7.1.1 Non-Conformity Scores . . . . . . . . . . . . . . . . . 109
7.2 A Weak Guarantee: Marginal Coverage in Expectation . . . 111
7.3 Dataset Conditional Bounds . . . . . . . . . . . . . . . . . . 113
7.4 Dataset and Group Conditional Bounds . . . . . . . . . . . . 114
7.5 Multivalid Bounds . . . . . . . . . . . . . . . . . . . . . . . . 116
7.6 Sequential Conformal Prediction . . . . . . . . . . . . . . . . 118
7.6.1 Sequential Marginal Coverage Guarantees . . . . . . . 119
7.6.2 Sequential Multivalid Guarantees . . . . . . . . . . . . 120

8 Distribution Shift 123


8.1 Likelihood Ratio Reweighting . . . . . . . . . . . . . . . . . 123
8.2 Multicalibration under Distribution Shift . . . . . . . . . . . 126
8.3 Why Calibration Under Distribution Shift is Useful . . . . . 128

9 Sufficient Statistics for Optimization 133


9.1 Omnipredictors: Sufficient Statistics for Unconstrained Opti-
mization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
9.2 Sufficient Statistics for Constrained Optimization . . . . . . 139
9.2.1 Convex Optimization . . . . . . . . . . . . . . . . . . 140
9.2.2 f -estimated Optimization . . . . . . . . . . . . . . . . 142
9.2.3 Solving Optimization Problems Without Labelled Data 143

10 Ensembling, Model Multiplicity, and the Reference Class


Problem 147
10.1 Reference Classes and Model Multiplicity . . . . . . . . . . . 147
10.2 Model Ensembling . . . . . . . . . . . . . . . . . . . . . . . . 148
10.3 Sample Complexity . . . . . . . . . . . . . . . . . . . . . . . 153

Bibliography 159

A Useful Probabilistic Inequalities 163


Part I

Foundations
1
Basic Setting and Definitions

CONTENTS

We will consider prediction tasks over a domain Z = X × Y. Here X repre-


sents the feature domain and Y represents the label domain. Depending on
the setting, the label domain might be real valued (Y = R)—the regression
setting, binary valued (Y = {0, 1})—the binary classification setting, or con-
sist of some larger finite unordered set—the multiclass classification setting.
Sometimes we will consider the regression setting in which the label domain
is rescaled to the unit interval Y = [0, 1].
We will sometimes assume the existence of a distribution D ∈ ∆Z. Given
such a distribution, we will write DX to denote the marginal distribution over
features: DX ∈ ∆X induced by D. We will write DY (x) ∈ ∆Y to denote
the conditional distribution over labels induced by D when we condition on
a particular feature vector x. DY (x) captures all of the information about
the label that is contained by the feature vector x, and is frequently the
object that we are trying to approximate with our models and uncertainty
quantification. A model is just some function f : X → [0, 1], and our (typically
unatainable goal) is to find a model f ∗ that has the property that for all x ∈ X ,
f ∗ (x) = Ey∼DY (x) [y] is the conditional label expectation given x.
Suppose we try and solve this problem and come up with some model f .
How can we evaluate whether f is any good? If we are in a regression setting
and our goal is purely prediction, we might evaluate f via its squared error —
i.e. the expected (squared) deviation of its prediction from the true label. This
is the objective we would minimize if we were solving (e.g.) a least squares
regression problem:

Definition 1 (Squared Error) The squared error (also known as Brier


score) of a predictor f on a distribution D is:

B(f, D) = E [(f (x) − y)2 ]


(x,y)∼D

We will sometimes elide the distribution D when it is clear from context.

3
4Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
The Batch Setting
In the batch setting, we are given a batch or sample of n datapoints D sampled
i.i.d. from D, which we will write as D ∈ Z n . We will want algorithms that
use D to learn something useful about D.
We will sometimes treat a sample D as if it is a distribution: sampling from
it, taking expectations over it, etc. When we do this, we are identifying D with
the discrete distribution that places weight 1/n on each example (x, y) ∈ D.
For example, we can compute the squared error of a predictor over a sample
D which evaluates to:
1 X
B(f, D) = (f (x) − y)2
n
(x,y)∈D

The Sequential Setting


In the sequential setting, data is revealed to the algorithm one example at
a time, and the algorithm must make predictions before learning the label
of each point. We will not always assume that the data is drawn from a
distribution — often we will assume nothing about the data generation process
at all, which might even be adversarial. In such cases our goals will pertain
to the empirical performance of the predictions. At a high level the setting
proceeds as follows, in rounds t ∈ {1, . . . , T }.
1. The adversary chooses a (distribution over) feature vectors xt ∈ X
and labels yt ∈ Y. The realized feature vector xt is shown to the
learner, but not the label.
2. The learner makes some prediction pt .
3. Finally the learner observes the realized label yt .
Here the prediction pt could be anything — it could try to predict the
label itself, or the label mean (in the case in which Y ⊂ R), or it could be a
prediction set. We’ll be more specific about our goals as we proceed.
An interaction for T rounds generates a transcript π, which just en-
codes the sequence of examples and predictions across the T rounds: π =
{(xt , pt , yt )}Tt=1 .
We might assume that (xt , yt ) ∼ D are drawn i.i.d. from some unknown
distribution D — more generally that the examples are drawn from an ex-
changeable distribution, which just means that their probability is permu-
tation invariant. But more frequently we will assume that the sequence of
examples is arbitrary, and will evaluate the performance of algorithms by
considering empirical measures of accuracy as evaluated on the transcript π,
in the worst case over adversaries.
2
A Simple Goal: Marginal Estimation

CONTENTS
2.1 Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Generalizing From Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Sequential Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
References and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

We introduce the problem of learning a model that is faithful to the distri-


bution in some formal sense with a goal that is extremely weak — too weak,
on its face — in part as a straw man that will focus our attention on how
we can meaningfully ask for stronger guarantees. Our initial aim will only be
to find a model that matches the mean of a distribution. But we will also
see that marginal guarantees like these are widely used for estimating other
properties of a distribution — especially quantiles. We will talk about this at
length when we get to conformal prediction, and we will think about ways in
which we can strengthen those guarantees as well.

2.1 Means
Recall that in a regression setting in which Y ⊆ [0, 1], our goal is to learn
a model f ∗ such that f ∗ (x) = Ey∼D(x) [y] — i.e. that correctly captures the
conditional label mean for each x ∈ X . Of course its not clear how to do this
(or even to test if we have succeeded), but we can begin with a minimal sanity
check: marginal mean consistency:
Definition 2 A model f : X → [0, 1] has marginal mean consistency error α
if:
E [f (x)] − E [y] = α
x∼DX (x,y)∼D

If α = 0 we’ll just say that f satisfies marginal mean consistency.

This minimal sanity check is an example of a marginal guarantee because

5
6Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
it depends on f only through an unconditional expectation E[f (x)], rather
than constraining the behavior of f conditional on any property of x. In other
words, its just an average over all inputs to f . f ∗ satisfies marginal mean
consistency, so if our model f does not, this means that our model f must not
be f ∗ . Of course, failure to satisfy marginal mean consistency is easy to fix:
Let:
∆ = E [y] − E [f (x)] and fˆ(x) = f (x) + ∆.
(x,y)∼D x∼DX

It is easy to see that fˆ satisfies marginal mean consistency:


Lemma 2.1.1 fˆ(x) = f (x) + ∆ satisfies marginal mean consistency.
Proof 1
E [fˆ(x)] = E [f (x)] + ∆
x∼DX x∼DX
= E [f (x)] + E [y] − E [f (x)]
x∼DX (x,y)∼D x∼DX

= E [y]
(x,y)∼D

as desired.
What is less obvious is that fˆ is more accurate than f — as measured by
its squared error.
Lemma 2.1.2 Fix any distribution D, let f : X → [0, 1] be any model, let
∆ = E(x,y)∼D [y] − Ex∼DX [f (x)], and let fˆ(x) = f (x) + ∆. Then over the
distribution D:
B(fˆ, D) = B(f, D) − ∆2
Proof 2 We can directly compute:
h i
B(f, D) − B(fˆ, D) = E (f (x) − y)2 − (fˆ(x) − y)2
(x,y)∼D
h i
= E f (x)2 − 2f (x)y + y 2 − fˆ(x)2 + 2fˆ(x)y − y 2
(x,y)∼D

f (x)2 − 2f (x)y − (f (x) + ∆)2 + 2(f (x) + ∆)y


 
= E
(x,y)∼D

f (x)2 − 2f (x)y − f (x)2 − 2∆f (x) − ∆2 + 2f (x)y + 2∆y


 
= E
(x,y)∼D

−2∆f (x) − ∆2 + 2∆y


 
= E
(x,y)∼D

= 2∆ E [y − f (x)] − ∆2
(x,y)∼D

= 2∆2 − ∆2
= ∆2
So not only is it easy to fix a model that does not satisfy marginal mean
consistency, it is always in our interest to do so if we care about accuracy: the
fix is strictly accuracy improving (as measured by squared error).
A Simple Goal: Marginal Estimation 7

2.2 Quantiles
Rather than asking for a model that matches the mean of a distribution
marginally, we can ask for a model that matches a target quantile of a dis-
tribution marginally. For simplicity, we will assume that all marginal label
distributions D(x) are continuous.

Definition 3 Fix any 0 ≤ q ≤ 1. τ is a q-quantile of a label distribution if:

Pr[y ≤ τ ] = q
y

We also write Q(τ ) = q.

Once again, our goal might be to produce a model f : X → [0, 1] that


on each input x outputs a value f (x) that is an q-quantile of the conditional
label distribution D(x). This will generally be impossible, but we can define
marginal quantile consistency as a simple sanity check analogue of marginal
mean conssitency.
Definition 4 A model f : X → [0, 1] has marginal quantile consistency error
α with respect to a target quantile q if:

Pr [y ≤ f (x)] − q = α
(x,y)∼D

If α = 0 we’ll say that f satisfies marginal quantile consistency for target


quantile q.
Just as the Brier score will play a key role in our analysis of models that
aim to match distributional means, pinball loss will play a key role in our
analysis of models that aim to match distributional quantiles.

Definition 5 The pinball loss function for target quantile q is:


(
(y − τ )q y>τ
Lq (τ, y) =
(τ − y)(1 − q) y ≤ τ

Given a data distribution D and a function f : X → [0, 1], write:

P Bq (f, D) = E [Lq (f (x), y)]


(x,y)∼D

We will sometimes elide the distribution D when it is clear from context.

Observe that for q = 1/2, this is simply (a scaling of) the absolute value
difference function: L1/2 (τ, y) = 21 |τ − y|. Just as the constant that minimizes
the Brier score on a distribution is its mean, the constant that minimizes the
pinball loss for a target quantile q is a q-quantile:
8Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Lemma 2.2.1 For any continuous distribution over y and any 0 ≤ q ≤ 1

τq = arg min E[Lq (τ, y)]


τ ∈[0,1] y

is a q-quantile.

Proof 3 Since the distribution is continuous, this is a continuous convex


function and takes its minimum at a point that has a (sub)derivative equal
to 0. Thus We can calculate a (sub)derivative of the function:

d Ey [Lq (τ, y)]


= E[(1 − q)1[y ≤ τ ] − q 1[y > τ ]]
dτ y
= E[1[y ≤ τ ] − q]
y
= Pr[y ≤ τ ] − q
y

Thus by inspection we see that there is a sub-derivative that takes value 0


for exactly values τq for which Pry [y ≤ τq ] = q — i.e. q-quantiles of the
distribution.

It will also be useful for us later on to have an analogue of Lemma 2.1.2


for pinball loss — i.e. a lemma that says that if we start with a model f that
is far from satisfying marginal quantile consistency, and then apply a shift
so that it does, that we make quantifiable progress in terms of reducing the
expected pinball loss for the function.
Suppose that f : X → [0, 1] has marginal quantile consistency error α. Let
∆ ∈ R be such that: Pr(x,y)∼D [y ≤ f (x)+∆] = q. Such a value ∆ is guaranteed
to exist since we have assumed that the conditional label distributions D(x)
are continuous, and so Pr(x,y)∼D [y ≤ f (x) + ∆] is a continuous monotonically
increasing function taking values in the full range [0, 1]. Let fˆ(x) = f (x) +
∆. By construction, fˆ satisfies marginal quantile consistency with respect to
target quantile q. It also has improved pinball loss. But in order to claim that
the pinball loss has improved by an amount that we can bound away from 0,
we will need to assume that the conditional label distributions has bounded
probability density — or equivalently that its cumulative distribution function
is Lipschitz-continuous.

Definition 6 A conditional label distribution D(x) is ρ-Lipschitz continuous


(or just ρ-Lipschitz) if for all 0 ≤ τ ≤ τ ′ ≤ 1:

Pr [y ≤ τ ′ ] − Pr [y ≤ τ ] ≤ ρ(τ ′ − τ )
y∼D(x) y∼D(x)

A distribution over labelled examples D is ρ-Lipschitz if for each x ∈ X , D(x)


is ρ-Lipschitz.
A Simple Goal: Marginal Estimation 9

The above definition is actually somewhat stronger than we need right now
— we don’t need the Lipschitz condition simultaneously for each conditional
label distribution D(x), but only marginally over the whole distribution —
but this stronger condition will be useful for us later on.

Lemma 2.2.2 Fix any distribution over labeled examples D that is ρ-


Lipschitz. Fix any model f : X → [0, 1] that has marginal consistency error α
with respect to target quantile q, and let fˆ(x) = f (x) + ∆ with ∆ chosen such
that fˆ satisfies marginal quantile consistency for quantile q. Then:

α2
P Bq (fˆ) ≤ P Bq (f ) −

and
α2
P Bq (f ) ≤ P Bq (fˆ) + |∆|α −

Proof 4 As in the proof of Lemma 2.2.1, we can compute:
 
dP Bq (f (x) + τ ) d Ey∼D(x) [Lq (f (x) + τ, y)]
= E
dτ x∼DX dτ
 
= E Pr [y ≤ f (x) + τ ] − q
x∼DX y∼D(x)

= Pr [y ≤ f (x) + τ ] − q
(x,y)∼D

We can now compute:

P Bq (fˆ(x)) − P Bq (f (x)) = P Bq (f (x) + ∆) − P Bq (f (x))


Z ∆
dP Bq (f (x) + τ )
= dτ
0 dτ
Z ∆ 
= Pr [y ≤ f (x) + τ ] − q dτ
(x,y)∼D
(0R ∆
Pr [y ≤ f (x) + τ ]dτ − |∆|q ∆≥0
= R0∆ (x,y)∼D
0
Pr(x,y)∼D [y ≤ f (x) + τ ]dτ + |∆|q ∆<0

Pr(x,y)∼D [y ≤ f (x) + τ ] is a non-negative function that is increasing in τ , and


so if ∆ < 0 (i.e. if initially f (x) is over-predicting the q’th quantile), then we
R∆
have that 0 Pr(x,y)∼D [y ≤ f (x) + τ ]dτ evaluates to the negative of the area
under the CDF of the distribution between f (x) + ∆ and f (x).
Similarly the integral takes positive value if ∆ > 0 and corresponds to the
area under the CDF between f (x) and f (x) + ∆. First we consider the case in
R∆
which ∆ > 0. We need to bound 0 Pr(x,y)∼D [y ≤ f (x) + τ ]dτ . Here we will
use the Lipschitz condition to upper bound the maximum possible area under
the CDF. The worst case is that the CDF of the label distribution increases as
10Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT

F (∆) = q

F (0) = q − α

|F (∆)−F (0)| α |F (∆)−F (0)| α


0 ρ = ρ ∆− ρ =∆− ρ ∆

FIGURE 2.1
Upper and lower bounding the local area under the curve when ∆ > 0. Here
F (τ ) = Pr(x,y)∼D [y ≤ f (x) + τ ]

quickly as possible at a linear rate from q − α to q between τ = 0 and τ = α/ρ,


and then maintains a constant value at q from τ = α/ρ to τ = ∆ (See Figure
2.1). Calculating the area under this worst case curve, we have:

Z ∆
α α α α
Pr [y ≤ f (x) + τ ]dτ ≤ (∆ − ) · q + (q − α) + ·
0 (x,y)∼D ρ ρ ρ 2
2 2
qα qα α α
= q∆ − + − +
ρ ρ ρ 2ρ
2
α
= q∆ −

Combining with above, we have that:

α2 α2
P Bq (fˆ) − P Bq (f ) ≤ q|∆| − − q|∆| = −
2ρ 2ρ
Next we can lower bound the area under the CDF. Again by the Lipschitz
condition, the smallest area under the CDF that respects the Lipschitz con-
dition arises if the CDF remains constant taking value q − α from τ = 0 to
τ = ∆ − αρ before increasing at a linear rate to q from τ = ∆ − αρ to τ = ∆.
A Simple Goal: Marginal Estimation 11

F (0) = q + α

F (∆) = q

|F (0)−F (∆)|
∆+ ρ =∆+ α
ρ − |F (0)−F (∆)|
= − αρ
∆ ρ 0

FIGURE 2.2
Upper and lower bounding the local area under the curve when ∆ < 0. Here
F (τ ) = Pr(x,y)∼D [y ≤ f (x) + τ ]

See figure 2.1. In this case the area is:


Z ∆
α2
Pr [y ≤ f (x) + τ ]dτ ≥ ∆(q − α) +
0 (x,y)∼D 2ρ
α2
= ∆q − ∆α +

Combining with the above we have that:

α2 α2
P Bq (fˆ) − P Bq (f ) ≥ |∆|q − |∆|α + − |∆|q = −|∆|α +
2ρ 2ρ
In the remaining case in which ∆ < 0, our worst cases are reversed (we need
to maximize the area under the curve to lower bound the integral and minimize
the area under the curve to upper bound the integral). Once again, the CDF
that minimizes the area under the curve subject to the Lipschitz constraint
behaves as follows (See figure 2.2): The CDF remains constant at q between
τ = ∆ and τ = −α/ρ, before increasing as quickly as possible at a linear rate
up to value q + α between τ = −α/ρ and τ = 0. In this case we have that:
Z ∆
α2
 
Pr [y ≤ f (x) + τ ]dτ ≤ − |∆|q +
0 (x,y)∼D 2ρ
2
α
= −q|∆| −

12Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Again combining with above we get that:
α2 α2
P Bq (fˆ) − P Bq (f (x)) ≤ −q|∆| − + q|∆| = −
2ρ 2ρ
Finally, the CDF that maximizes the area under the curve subject to the
Lipschitz constraint increases at a linear rate from τ = ∆ to τ = ∆+α/ρ from
value q to value q + α, and then remains constant at q + α from τ = ∆ + αρ
to τ = 0. Computing the area under this curve, we get:
Z ∆  2 
α α
Pr [y ≤ f (x) + τ ]dτ ≥ − + |∆|q + α(|∆| − )
0 (x,y)∼D 2ρ ρ
2
α
= − |∆|q − |∆|α

Together with the above we have that:
α2 α2
P Bq (fˆ) − P Bq (f (x)) ≥ − |∆|q − |∆|α + q|∆| = − |∆|α
2ρ 2ρ
which completes the proof of the lemma.

So once again, if a model fails to satisfy marginal quantile consistency, it


is easy to fix the model, and once again, doing so is accuracy improving —
this time as measured via Pinball loss.
We’ll make one more observation: we proved Lemma 2.2.2 under the as-
sumption that the underlying distribution D was ρ-Lipschitz. But eventually,
when we want to apply similar arguments to algorithms run on data sampled
from some underlying distribution, we will face the problem that the empirical
distribution over a finite dataset is discrete, and hence cannot be Lipschitz at
fine enough resolutions. But we observe that Lemma 2.2.2 actually only speaks
to updates ∆ that are applied to fix marginal consistency error of scale α —
and if the underlying distribution is ρ-Lipschitz, we must have that ∆ ≥ αρ . So
we really only require that the underlying distribution is Lipschitz at scales
larger than αρ . Here we define the condition we need — Lipschitzness only at
large enough scales:
Definition 7 Fix ρ, r > 0. A conditional label distribution D(x) is (ρ, r)-
Lipschitz continuous (or just (ρ, r)-Lipschitz) if for all 0 ≤ τ ≤ τ ′ ≤ 1 such
that τ ′ − τ ≥ r:

Pr [y ≤ τ ′ ] − Pr [y ≤ τ ] ≤ ρ(τ ′ − τ )
y∼D(x) y∼D(x)

A distribution over labelled examples D is (ρ, r)-Lipschitz if for each x ∈ X ,


D(x) is (ρ, r)-Lipschitz.
Using the insight that our proof of Lemma 2.2.2 really only used the condition
of (ρ, αρ )-Lipschitz continuity, we have the following lemma:
A Simple Goal: Marginal Estimation 13

Lemma 2.2.3 Fix any ρ, α > 0. Fix any distribution over labeled examples
D that is (ρ, αρ )-Lipschitz. Fix any model f : X → [0, 1] that has marginal
consistency error α with respect to target quantile q, and let fˆ(x) = f (x) + ∆
with ∆ chosen such that fˆ satisfies marginal quantile consistency for quantile
q. Then:
α2
P Bq (fˆ) ≤ P Bq (f ) −

and
α2
P Bq (f ) ≤ P Bq (fˆ) + |∆|α −

2.2.1 Generalizing From Data


Thus far we have been acting as if we have direct access to the data distribution
D, and in particular, given a fixed model f can compute the quantity ∆
such that fˆ(x) = f (x) + ∆ satisfies our marginal consistency desideratum
with respect to either means or quantiles. But of course we generally will not
have direct access to D, and will instead have only a sample D ∼ Dn of n
points drawn i.i.d. from D. What we will generally do (now, and for more
complex algorithms in later chapters) is run our algorithms on the empirical
distribution over our sample D, and then prove that the guarantees that our
algorithms have on D carry over (with small loss) to D.

Theorem 1 Fix any model f and distribution D, and let D ∼ Dn consist of


n samples drawn i.i.d. from D. Let ∆ be such that fˆ(x) = f (x) + ∆ satisfies
marginal mean consistency on D. Then with probability 1 − δ over the draw
of D, fˆ has marginal mean consistency error at most α on D, for:
r
2 log(2/δ)
α≤
n
Proof 5 This is an application of Hoeffding’s inequality (Theorem 46) which
we quote here in its first use:
Let X1 , . . . , Xn be independent random
Pn variables bounded such that for
each i, ai ≤ Xi ≤ bi . Let Sn = i=1 Xi denote their sum. Then for all
t > 0:
−2t2
 
Pr [|Sn − E[Sn ]| ≥ t] ≤ 2 exp Pn 2
i=1 (bi − ai )

In our case, we have that ∆ = n1 (x,y)∈D (y − f (x)), and each term n1 (y −


P

f (x)) is bounded such that:


1 1 1
− ≤ (y − f (x)) ≤
n n n
14Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
We also have that ED ∆ = E(x,y)∼D [y − f (x)]. Thus we can apply Hoeffding’s
inequality to conclude that:
−nt2
 
Pr[|∆ − E [y − f (x)]| ≥ t] ≤ 2 exp
(x,y)∼D 2
Setting the right hand side to be at most δ and solving for t, we find that:
" r #
2 log(2/δ)
Pr |∆ − E [y − f (x)]| ≥ ≤δ
(x,y)∼D n

Finally, recall that by definition, fˆ has marginal mean consistency error:

E [fˆ(x) − y] = E [f (x) + ∆ − y]
(x,y)∼D (x,y)∼D

≤ ∆ − E [y − f (x)]
(x,y)∼D
r
2 log(2/δ)

n
where the last inequality holds with probability 1 − δ, as established by Hoeffd-
ing’s inequality.
Theorem 2 Fix any model f and distribution D, and let D ∼ Dn consist
of n samples drawn i.i.d. from D. Let ∆ be such that fˆ(x) = f (x) + ∆ has
quantile mean consistency error α′ with respect to some target quantile q on
D. Then with probability 1 − δ over the draw of D, fˆ has marginal quantile
consistency error at most α with respect to target quantile q on D, for:
r
′ log(2/δ)
α≤α +
2n
Proof 6 This is an application of the DKW (Dvoretzky–Kiefer–Wolfowitz)
inequality (Theorem 49) which we quote here in its first use:
Let D ∈ Z n be any distribution and let D ∼ Dn consist of n points sampled
i.i.d. from D. Let F (c) = Pr(x,y)∼D [y ≤ c] denote the CDF of the label
distribution induced by D, and let F̂D (c) = n1 (x,y)∈D 1[y ≤ c] denote
P
the CDF of the empirical label distribution induced by D. Then for every
t > 0:  
Pr sup F (c) − F̂D (C) ≥ t ≤ 2 exp −2nt2

c∈R

Consider the distribution D′ which is derived from D by replacing the label


y of each example (x, y) with the label y ′ = y − f (x). We apply the DKW
inequality to this distribution. By definition, ∆ is chosen such that

Pr [y ≤ f (x) + ∆] − q = α′
(x,y)∼D
A Simple Goal: Marginal Estimation 15

rearranging, this is:


|F̂D (∆) − q| ≤ α′
q
Applying the DKW inequality with t = log(2/δ)
2n , we have that with probability
1 − δ: r
log(2/δ)
F̂D (∆) − F (∆) ≤
2n
q
And so |F (∆) − q| ≤ α′ + log(2/δ)
2n Expanding out the definition of F (∆) we
have that
r
′ log(2/δ)
α + ≥ Pr [y − f (x) ≤ ∆] − q
2n (x,y)∼D

= Pr [y ≤ f (x) + ∆] − q
(x,y)∼D

= Pr [y ≤ fˆ(x)] − q
(x,y)∼D

The result is that we can simply proceed as if our sample is our underlying
distribution when we aim for marginal consistency — and our marginal con-
sistency error on the underlying distribution is guaranteed to be larger than
our empirical marginal
 consistency
 error by at most ϵ with probability 1 − δ,
whenever n ≥ Ω log(1/δ)
ϵ2 .

2.3 Sequential Prediction


What about when we are in a sequential prediction setting, and there is no
distribution? Even when examples are selected by an adversary, we can still
talk about marginal mean and quantile consistency (and all of the other dis-
tributional measures that we will introduce in later chapters). We will always
evaluate these guarantees ex-post, over the empirical distribution over the
transcript.
Definition 8 Fix a transcript π = {(xt , pt , yt )}Tt=1 consisting of examples
(xt , yt ) ∈ X × Y and predictions pt ∈ [0, 1]. The transcript satisfies marginal
mean consistency with error α if:
T T
1X 1X
pt − yt = α
T t=1 T t=1
It satisfies marginal quantile consistency with respect to a target quantile q
and error α if:
T
1X
1[yt ≤ pt ] − q = α
T t=1
16Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Our goal in sequential prediction settings will generally be to derive al-
gorithms that guarantee that against any adversary, they generate a tran-
script π that with high probability (or with certainty) satisfies some notion
of statistical consistency. In general, solving these problems in the adversar-
ial sequential setting is only more difficult than solving them in the batch,
distributional setting, in a formal sense: If we have algorithms that promise
consistency guarantees in the sequential setting, then by running them on
data that is in fact drawn i.i.d. from some distribution, we can also obtain the
same guarantees in the batch setting. The reverse is not true — the sequential
setting is generally strictly harder.
Here we give a warm-up version of this style of theorem, just for mean
marginal consistency. In this case, it is easy to more directly get the same kinds
of out of sample guarantees — but more later we will see more sophisticated
versions of this kind of online-to-offline reduction. It will be an application of
Hoeffding’s inequality, which we state as Theorem 46.
Theorem 3 Suppose we have an algorithm A that when run against any ad-
versary for T rounds generates a transcript π that satisfies marginal mean
consistency with error at most α. Suppose we have some model f : X → [0, 1]
and a data distribution D, and consider the following procedure to simulate
an adversary. At each round t we:
1. Sample (x̂t , ŷ) ∼ D
2. Feed algorithm A the sample (xt , yt ) = (x̂, ŷt − f (x̂))
PT
This results in some transcript π = {(xt , pt , yt )}Tt=1 . Let ∆ = T1 t=1 pt and
let fˆ(x) = f (x) + ∆. Then for any δ > 0, with probability 1 − δ, fˆ satisfies
marginal mean consistency with error α′ for:
r
′ 2 log(2/δ)
α ≤α+
T
Proof 7 Since π is promised to satisfy marginal mean consistency with error
at most α, we know that:
T T
1X 1X
pt − yt ≤ α
T t=1 T t=1

Let:
T
¯ = 1
X
∆ (ŷt − f (x̂t ))
T t=1
Plugging in the definitions of ∆ and yt we have that:
¯ ≤α
∆−∆
A Simple Goal: Marginal Estimation 17

Note also that since (x̂t , ŷt ) are sampled i.i.d. from D, we have that:
¯ =
E[∆] E [y − f (x)]
(x,y)∼D

We can now apply Hoeffding’s inequality (Theorem 46) to the quantity


¯ = 1 PT (ŷt − f (x̂t )). Each term in the sum is bounded between −1/T ≤
∆ T t=1
1
T (ŷt − f (x̂t )) ≤ 1/T and so we have for any ϵ > 0:

2
   
Pr ∆ ¯ − E [y − f (x)] ≥ ϵ ≤ 2 exp −2T ϵ
(x,y)∼D 4
The right hand side is at most δ when we have:
r
2 log(2/δ)
ϵ≥
T
We therefore have that with probability 1 − δ:

E [fˆ(x) − y] = E [f (x) + ∆ − y]
(x,y)∼D (x,y)∼D

≤ E ¯ − y] + α
[f (x) + ∆
(x,y)∼D

¯ + (∆
¯ − E[∆])
¯ −y +α
 
= E f (x) + E[∆]
(x,y)∼D

¯ − E[∆])
¯
 
= E (∆ +α
(x,y)∼D
r
2 log(2/δ)
≤ α+
T
as desired.

Next, we’ll see a simple algorithm that can guarantee marginal mean con-
sistency with error on the order of O(1/T ) on any sequence of length T —
i.e. without assuming that the data points come from a distribution. The
algorithm will be silly on its face as a prediction algorithm — always predict-
ing that today’s outcome will be equal to yesterday’s outcome. Its excellent
performance (as measured by marginal mean consistency) tells us something
about the weakness of marginal guarantees.

Algorithm 1 Online-Marginal-Mean-Predictor
Let y0 = 0
for t = 1 to T do
Observe xt (and ignore it!)
Predict pt = yt−1
Observe yt .
18Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
If we imagine using this algorithm to predict weather, then what it does is
the following: If it rained yesterday, it predicts a 100% chance of rain today.
Otherwise it predicts a 0% chance of rain. And yet:

Theorem 4 For any sequence of examples of length T , {(xt , yt )}Tt=1 Online-


Marginal-Mean-Predictor (Algorithm 1) produces a transcript that satisfies
marginal mean consistency with error α for α ≤ 1/T .

Proof 8 Using the fact that pt = yt−1 (and y0 = 0) we compute:


T T T T
1X 1X 1X 1X
pt − yt = yt−1 − yt
T t=1 T t=1 T t=1 T t=1
1
= |y0 − yT |
T
1

T
Now lets do the same for quantiles. First we argue that obtaining marginal
quantile consistency in the online setting is sufficient to obtain marginal quan-
tile consistency on a distribution, and then show a simple deterministic algo-
rithm for obtaining marginal quantile consistency in the online adversarial
setting. There will be one major difference, which is that to convert an on-
line sequence of predictions to an offline quantile predictor, we cannot simply
average the predicted quantiles as we did with predicted means (because the
relationship between the numeric value of quantiles and their inverse CDF
value is not linear). Instead, we will randomize over the sequence of predic-
tions, which will result in an offline randomized quantile predictor.

Theorem 5 Suppose we have an algorithm A that when run against any ad-
versary for T rounds generates a transcript π that satisfies marginal quantile
consistency with error at most α for some target quantile q. Suppose we have
some model f : X → [0, 1] and a data distribution D, and consider the follow-
ing procedure to simulate an adversary. At each round t we:

1. Sample (x̂t , ŷt ) ∼ D


2. Feed algorithm A the sample (xt , yt ) = (x̂, ŷt − f (x̂))
This results in some transcript π = {(xt , pt , yt )}Tt=1 . Let ∆ be the random
variable that takes value in {p1 , . . . , pT } uniformly at random (i.e. ∆ = p1 with
probability 1/T , ∆ = p2 with probability 1/T , etc.) Let fˆ(x) be the randomized
predictor defined has fˆ(x) + ∆. Then for any δ > 0, with probability 1 − δ,
fˆ satisfies marginal quantile consistency with error α′ with respect to target
quantile q for: r
′ 2 ln(2/δ)
α ≤α+
T
A Simple Goal: Marginal Estimation 19

In other words:
Pr [y ≤ f (x) + ∆] − q ≤ α′
(x,y)∼D,∆

Proof 9 Since π is promised to satisfy marginal quantile consistency w.r.t.


quantile q with error at most α, we know that:
T
1X
1[yt ≤ pt ] − q ≤ α
T t=1

Plugging in the definition of yt we have that:


T
1X
1[ŷt − f (x̂t ) ≤ pt ] − q ≤ α
T t=1

Let D′ be the label distribution induced by outputting the label y − f (x) for
(x, y) ∼ D and let F denote its CDF: F (x) = Pry∼D′ [y ≤ x]. We want to
PT
be able to say that T1 t=1 F (pt ) ≈ q, but we have a problem: the indicators
1[yt ≤ pt ] are not independent random variables even though the yt are, since
each pt is potentially chosen as a function of all previous labels y1 , . . . , yt−1 .
Hence we cannot apply Hoeffding’s inequality. But all is not lost! We will need
Azuma’s inequality (Theorem 48) which we quote here before its first use:
Let X1 , . . . , Xn be random variables (not necessarily independent) bounded
such that for each i, |Xi | ≤ ci . Let X<i denote the prefix X1 , X2 , . . . , Xi−1 .
Then for all t > 0:
" n n
#
−t2
X X  
Pr Xi − E[Xi |X<i ] ≥ t ≤ 2 exp Pn 2
i=1 i=1
2 i=1 ci

Recall that for a sequential prediction algorithm, pt can be chosen as a


function of past examples — but must be independent of the current example
yt . Hence we do have that Eyt [1[yt ≤ pt |y<t ] = F (pt ). For us, the random
variables are 1/T (1[yt ≤ pt ]) which
q are bounded by ct = 1/T . Thus we can
2 ln(2/δ)
apply Azuma’s inequality with t = T to conclude that:
" T T
r #
1X 1X 2 ln(2/δ)
Pr 1[yt ≤ pt ] − F (pt ) ≥ ≤δ
T i=1 T t=1 T

Combining this with our guarantee of marginal quantile consistency, with prob-
ability 1 − δ we have that:
T
r
1X 2 ln(2/δ)
F (pt ) − q ≤ α +
T t=1 T
20Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Finally we can compute:
T
1X
Pr [y ≤ f (x) + ∆] − q = Pr [y ≤ f (x) + pt ] − q
(x,y)∼D,∆ T t=1 (x,y)∼D
T
1X
= F (pt ) − q
T t=1
r
2 ln(2/δ)
≤ α+
T
where the last inequality holds with probability 1 − δ over the draws of
{(xt , yt )}Tt=1 .

Next, we give our algorithm for making predictions that satisfy online
marginal quantile consistency for any target quantile q against any adversar-
ially chosen sequence of examples. The algorithm takes as input a “learning
rate” parameter η, and can be viewed and analyzed as online gradient descent
on the pinball loss. But the specific form of the resulting update also lends
itself to a very simple analysis showing that its quantile error tends to 0 at
a rate of 1/T , just as our algorithm for obtaining marginal mean consistency
does.

Algorithm 2 Online-Marginal-Quantile-Predictor(q, η)
Let p1 = 0
for t = 1 to T do
Observe xt (and ignore it!)
Predict pt
Observe yt .
Let pt+1 = pt + η(q − 1[yt ≤ pt ])

Theorem 6 For any sequence of examples of length T , any target quan-


tile q ∈ [0, 1] and any update parameter η > 0, Online-Marginal-Quantile-
Predictor(q, η) (Algorithm 2) produces a transcript that satisfies marginal
quantile consistency with error α for α ≤ 1+η
ηT

Proof 10 Examining the update rule pt+1 = pt + η(q − 1[yt ≤ pt ]) and solving
for 1[yt ≤ pt ]), we see:
pt+1 − pt
1[yt ≤ pt ] = q −
η
A Simple Goal: Marginal Estimation 21

So, we can compute:


T T
1X 1 X
1[yt ≤ pt ] = q− (pt+1 − pt )
T t=1 ηT t=1
pT +1 − p1
= q−
ηT
pT +1
= q−
ηT

Next observe that for all t, |pt − pt+1 | ≤ η, and since yt , q ∈ [0, 1], if pt ≥ 1,
it must be that 1[yt ≤ pt ] = 1 then pt+1 < pt and similarly if pt ≤ 0 then
pt+1 > pt . Hence we must have for all t that:

−η ≤ pt ≤ 1 + η

So we have:
T
1X p 1+η
1[yt ≤ pt ] − q = T +1 ≤
T t=1 ηT ηT

Remark 2.3.1 In fact, there is an even simpler algorithm that can guarantee
marginal quantile consistency against an adversary, with error tending to 0 at
a rate of 1/T . For a q fraction of rounds, predict pt = 1, and for a 1−q fraction
of rounds predict pt = 0. Because we know that yt ∈ [0, 1], we have that on the
q fraction of rounds for which pt = 1, 1[yt ≤ pt ] = 1 and for the remaining
1 − q fraction of rounds, 1[yt ≤ pt ] = 0. Hence we can satisfy marginal
quantile consistency in an entirely data independent way, which should make
us suspicious of marginal guarantees and make us ask for something stronger.

References and Further Reading


Lemma 2.2.2 (bounding the change in pinball loss as a function of the change in
predicted quantile under a Lipschitz condition on the distribution) is adapted
from Jung et al. [2022]. Algorithm 2 and its analysis are adapted from Gibbs
and Candes [2021], who derive it in the context of conformal prediction (which
we will see later in Chapter 7).
3
Calibration

CONTENTS
3.1 Introduction to Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Calibrating a Model f . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Quantile Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Sequential Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.1 Sequential (Mean) Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.2 Sequential Quantile Calibration . . . . . . . . . . . . . . . . . . . . . . . . . 36
References and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

The marginal guarantees we saw in Chapter 2 were easy to obtain, but ex-
tremely weak. In this chapter we’ll see one way to go beyond marginal guar-
antees, by making calibrated predictions. Calibration on its own is also quite
weak, but not as weak as a marginal guarantee, and should be thought of as
one step up in terms of a “sanity check” intended to falsify whether we have
learned the true conditional label distribution.

3.1 Introduction to Calibration


In this section we’ll focus on regression problems in which the label domain is
real valued: Y ⊂ [0, 1]. A natural special case is when we are predicting binary
outcomes: Y = {0, 1}, but everything we say holds also for the general real
valued case.
In such a setting, we often want to solve the regression problem: that is,
to find a model f : X → [0, 1] that has the property that for all x ∈ X ,
f ∗ (x) = Ey∼DY (x) [y] is the conditional label expectation given x. Of course, we
don’t generally expect to actually find this function (for a variety of reasons),
but that’s going to be our goal.
Suppose we try and solve this problem and come up with some model f .
How can we evaluate whether f is any good? If our goal is purely prediction, we
might evaluate f via its squared error — i.e. the expected (squared) deviation
of its prediction from the true label. This is the objective we would minimize
if we were solving (e.g.) a least squares regression problem:

23
24Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Definition 9 (Squared Error) The squared error (also known as Brier
score) of a predictor f is:

B(f ) = E [(f (x) − y)2 ]


(x,y)∼D

On the other hand, if we want our predictions f (x) to have the same
probabilistic semantics as f ∗ (x) — namely that they be a prediction about the
expected value of y, then we might want that f (x) be calibrated. Calibration
asks that the predictions of f be correct conditional on its own predictions:
Roughly that E(x,y)∼D [y|f (x) = v] = v for all v. So that the conditioning event
makes sense, we will restrict attention so functions f that have a range of finite
cardinality, and study average calibration error. Let R(f ) = {f (x) : x ∈ X }
denote the range of f , and let m = |R(f )| denote the cardinality of f ’s range.
We will assume m < ∞.
Definition 10 (Average Calibration Error) The average calibration er-
ror of a predictor f on a distribution D is:
X
K1 (f, D) = Pr [f (x) = v] v − E [y|f (x) = v]
(x,y)∼D (x,y)∼D
v∈R(f )

The average squared calibration error is:


X  2
K2 (f, D) = Pr [f (x) = v] v − E [y|f (x) = v]
(x,y)∼D (x,y)∼D
v∈R(f )

Finally, we can define a notion of maximum calibration error. Just as with


our average notions, we weight by the probability mass of the levelsets to avoid
needing to measure quantities over sets with tiny mass:

K∞ (f, D) = max Pr [f (x) = v] v − E [y|f (x) = v]


v∈R(f ) (x,y)∼D (x,y)∼D

When the distribution D is clear from context we will sometimes elide it and
simply write K1 (f ), K2 (f ), K∞ (f ), etc.
Sometimes it will be more convenient to work with one of these quantities
over another, but they are closely related to one another:
Lemma 3.1.1 For any predictor f : X → [0, 1],
p
K2 (f ) ≤ K1 (f ) ≤ K2 (f )
K∞ (f ) ≤ K1 (f ) ≤ mK∞ (f )

Proof 11 K2 (f ) ≤ K1 (f ) follows from the fact that since v and


2
y are bounded in [0, 1], term by term v − E(x,y)∼D [y|f (x) = v] ≤
Calibration 25
p
v − E(x,y)∼D [y|f (x) = v] . K1 (f ) ≤ K2 (f ) follows from the Cauchy-
Schwarz inequality:
 2 " 2 #
2
E 1 · v − E [y|f (x) = v] ≤ E[1 ] · E v − E [y|f (x) = v]
v (x,y)∼D v (x,y)∼D

K∞ (f ) ≤ K1 (f ) follows from the fact that a sum of non-negative terms upper


bounds a maximum over the terms, and K1 (f ) ≤ mK∞ (f ) follows from the
fact that K1 (f ) is a sum of m terms each of which is upper bounded by K∞ (f ).

Unlike squared error, which we may never be able to drive to zero (because
of inherent unpredictability), we can in principle drive calibration error to zero:
observe that K2 (f ∗ ) = 0.
In fact, f ∗ also minimizes the squared error over the set of all functions
because f ∗ (x) minimizes squared error point-wise per prediction x:

Lemma 3.1.2 Fix any distribution on labels DY . Let v ∗ = EDY [y] denote the
true label expectation, and let v̂ = v ∗ + ∆ for some ∆ ̸= 0. Then:

E [(v̂ − y)2 − (v ∗ − y)2 ] = ∆2


y∼DY

Proof 12

E (v̂ − y)2 − (v ∗ − y)2 = v̂ − 2v̂y − (v ∗ )2 + 2v ∗ y


   2 
E
y∼DY y∼DY

E (v ∗ + ∆)2 − 2v̂y − (v ∗ )2 + 2v ∗ y
 
=
y∼DY

E 2v ∗ ∆ + ∆2 − 2(v ∗ + ∆)y + 2v ∗ y
 
=
y∼DY

E 2v ∗ ∆ + ∆2 − 2∆y
 
=
y∼DY

= 2v ∗ ∆ + ∆2 − 2v ∗ ∆
= ∆2

3.2 Calibrating a Model f


Suppose we are given a model f with large average calibration error K2 (f ).
Can we fix it? And will fixing it come at the cost of accuracy (say, as measured
by squared error B(f ))? The answers are “Yes”, and “No” respectively! :-)
There is a simple algorithm that takes as input an arbitrary model f and
outputs a modified model fˆ such that:
1. fˆ has as low average calibration error as we like: For any α, we can
produce fˆ such that K2 (fˆ) ≤ α
26Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
2. fˆ has strictly lower squared error than f if f was not already cali-
brated.
3. The range of fˆ is only smaller than the range of f : |R(fˆ)| ≤ |R(f )|
The basic idea will be to take some intermediate model ft and then “patch”
it if it is not already calibrated, to produce a better model ft+1 . We will focus
on the simplest possible “patch”, and form our calibrated model by simply
stringing them together.

Definition 11 (Value Patch) Given a model f and a pair of values v, v ′ ∈


[0, 1], we say that the value patch applied to f with pair (v, v ′ ) is the function:
(
′ v′ f (x) = v
h(x, f ; v → v ) =
f (x) otherwise

Algorithm 3 Calibrate(f, α, D)
Let f0 = f and t = 0.
while K2 (ft , D) ≥ α do
Let:
 2
vt ∈ arg max Pr [ft (x) = v] v − E [y|ft (x) = v])
v∈R(ft ) (x,y)∼D (x,y)∼D

vt′ = E [y|ft (x) = vt ]


(x,y)∼D

Let ft+1 = h(x; ft , vt → vt′ ) and t = t + 1.


Output ft .

We can now analyze the algorithm.

Theorem 7 After T rounds, where T ≤ m α , Algorithm 3 outputs a model fT


such that K2 (fT ) ≤ α and B(fT ) ≤ B(f ).

Proof 13 Observe that at each round before the algorithm halts, since
K2 (ft ) ≥ α we must have that:
 2
α
∆t ≡ Pr [ft (x) = vt ] vt − E [y|f (x) = vt ]) ≥
(x,y)∼D (x,y)∼D m

Rearranging, we also have that:


∆t
(vt − vt′ )2 =
Pr(x,y)∼D [ft (x) = vt ]
Calibration 27

Let D(vt ) = D|(ft (x) = vt ) be the distribution that results from condi-
tioning on the event that ft (x) = vt and let D(v̄t ) = D|(ft (x) ̸= vt ) be the
distribution that results from conditioning on the event that ft (x) ̸= vt . We
have from Lemma 3.1.2 that:

B(ft+1 , D) − B(ft , D)
= Pr[ft (x) = vt ](B(ft+1 , D(vt )) − B(ft , D(vt )) + Pr[ft (x) ̸= vt ](B(ft+1 , D(v¯t )) − B(ft , D(v¯t ))
= Pr[ft (x) = vt ](B(ft+1 , D(vt )) − B(ft , D(vt ))
= Pr[ft (x) = vt ](vt − vt′ )2
= ∆t
α

m
Here the second to last inequality follows from Lemma 3.1.2. Since for any
model f : X → [0, 1], B(f, D) ≤ 1 and for any model fT B(fT , D) ≥ 0, the
algorithm must halt after at most T ≤ m α many rounds. Since each iteration
decreases squared error, it must be that B(fT , D) ≤ B(f, D).

In fact, this argument is wasteful, although its form will be useful for us
later when we investigate stronger forms of calibration. However for simple
calibration, there is a simple one-shot algorithm that obtains perfect calibra-
tion and decreases squared error by exactly the amount of the calibration error
of the original model.

Algorithm 4 One-Shot-Calibrate(f, D)
For each v ∈ R(f ) let c(v) = E(x,y)∼D [y|f (x) = v])
Output the model fˆ defined as fˆ(x) = c(f (x)).

Theorem 8 For any function f , One-Shot-Calibrate(f, D) (Algorithm 4 out-


puts a model fˆ such that K2 (fˆ) = 0 and B(fˆ) = B(f ) − K2 (f ).

Proof 14 Consider any level set of fˆ: S(v) = {x : fˆ(x) = v}. By definition,
for all x ∈ S(v), we must have f (x) = v ′ such that c(v ′ ) = v — i.e. such that
c(f (x)) = E(x,y)∼D [y|f (x) = v ′ ]) = v. Let P (v) = {v ′ : c(v ′ ) = v}. We have
that P ′ ′
v ′ ∈P (v) Pr[f (x) = v ]c(v )
E [y|x ∈ S(v)] = P ′
=v
(x,y) v ′ ∈P (v) Pr[f (x) = v ]

Hence:
X 2
K2 (fˆ) = Pr[fˆ(x) = v] (v − E[y|x ∈ S(v)]) = 0
v∈R(fˆ)

Next, observe that we can decompose the squared error of both f and fˆ
28Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
according to the level sets of f , which form a partition for X :

B(f, D) − B(fˆ, D) = E[(f (x) − y)2 ] − E[ˆ(fˆ(x) − y)2 ]


X
= Pr[f (x) = v] E[(f (x) − y)2 − (c(f (x)) − y)2 |f (x) = v]
v∈R(f )
X
= Pr[f (x) = v] E[(v − y)2 − (c(v) − y)2 |f (x) = v]
v∈R(f )
X
= Pr[f (x) = v](v − c(v))2
v∈R(f )

where the last equality follows from Lemma 3.1.2. But:


X X
Pr[f (x) = v](v−c(v))2 = Pr[f (x) = v](v−E[y|f (x) = v])2 = K2 (f )
v∈R(f ) v∈R(f )

which completes the proof.

Thus we see that mis-calibrated models can always be improved: they can
be efficiently updated to have no calibration error, and in performing this
simple update, their squared error is improved by an amount equal to their
initial calibration error. This also shows that squared error can be decomposed
into two terms: calibration error, and the remainder (which is sometimes called
refinement error), and that the part corresponding to calibration error can
always be removed.
We will eventually be interested in calibrating predictors using a finite
sample of data from a distribution (rather than giving our algorithm the ability
to directly and exactly compute expectations on the distribution), which will
require proving generalization theorems. But we will defer this to Chapter 4,
when we will prove such theorems for more demanding notions of calibration.

3.3 Quantile Calibration


We can similarly define quantile calibration for a target quantile q, which
asks that a model f produce quantiles f (x) that satisfy marginal quantile
consistency not just overall, but conditional on the value of f (x).

Definition 12 (Average Quantile Calibration Error) The average quan-


tile calibration error with respect to a target quantile q of a predictor f is:
X
Q1 (f ) = Pr [f (x) = v] q − Pr [y ≤ v|f (x) = v]
(x,y)∼D (x,y)∼D
v∈R(f )
Calibration 29

The average squared quantile calibration error is:


X  2
Q2 (f ) = Pr [f (x) = v] q − Pr [y ≤ v|f (x) = v]
(x,y)∼D (x,y)∼D
v∈R(f )

Finally, we can define a notion of maximum quantile calibration error. Just


as with our average notions, we weight by the probability mass of the levelsets
to avoid needing to measure quantities over sets with tiny mass:

Q∞ (f ) = max Pr [f (x) = v] q − Pr [y ≤ v|f (x) = v]


v∈R(f ) (x,y)∼D (x,y)∼D

The relationship between these different measures of quantile calibration error


is the same as it is for the corresponding measures of (mean) calibration error:
we restate their relationship here without proof, which is identical to the case
of mean calibration.
Lemma 3.3.1 For any predictor f : X → [0, 1],
p
Q2 (f ) ≤ Q1 (f ) ≤ Q2 (f )
Q∞ (f ) ≤ Q1 (f ) ≤ mQ∞ (f )
We now give an analogue to our one-shot mean calibrator. There is also an
analogous iterative version — that we will build on when we study multigroup
guarantees in Chapter 4 — but as with mean calibration, it has no advantages
in this setting.

Algorithm 5 One-Shot-Quantile-Calibrate(f, q, D)
For each v ∈ R(f ) let

c(v) = arg min



|q − Pr[y ≤ v ′ |f (x) = v]|
v

Output the model fˆ defined as fˆ(x) = c(f (x)).

Theorem 9 For any function f , any target quantile value q ∈ [0, 1], and any
ρ-Lipschitz distribution D, One-Shot-Quantile-Calibrate(f, D) (Algorithm 5
outputs a model fˆ such that Q2 (fˆ) = 0 and P Bq (fˆ) ≤ P Bq (f ) − 2ρ
1
Q2 (f ).

Proof 15 Consider any level set of fˆ: S(v) = {x : fˆ(x) = v}. By definition,
for all x ∈ S(v), we must have f (x) = v ′ such that c(v ′ ) = v — i.e. such
that c(f (x)) satisfies Pr(x,y)∼D [y ≤ c(f (x))|f (x) = v ′ ] = q. Let P (v) = {v ′ :
c(v ′ ) = v}. We have that
′ ′
P
v ′ ∈P (v) Pr[f (x) = v ] Pr[y ≤ v|f (x) = v ]
Pr [y ≤ v|x ∈ S(v)] = P ′
=q
(x,y) v ′ ∈P (v) Pr[f (x) = v ]
30Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Hence:
X 2
Q2 (fˆ) = Pr[fˆ(x) = v] (q − Pr[y ≤ v|x ∈ S(v)]) = 0
v∈R(fˆ)

Next, observe that we can decompose the pinball error of both f and fˆ
according to the level sets of f , which form a partition for X :
P Bq (f, D) − P Bq (fˆ, D) = E[Lq (f (x), y)] − E[Lq (fˆ(x), y)]
X
= Pr[f (x) = v] E[Lq (f (x), y) − Lq (fˆ(x), y)|f (x) = v]
v∈R(f )
2
X (Pr[y ≤ f (x)|f (x) = v] − q)
≥ Pr[f (x) = v]

v∈R(f )
1
= Q2 (f )

where the second to last inequality follows from Lemma 2.2.2.

3.4 Sequential Prediction


We now return to the sequential prediction setting, this time to solve a more
challenging problem than simply obtaining marginal mean or quantile con-
sistency: our goal will be to obtain empirically calibrated predictions pt in
the worst case over sequences of observations and outcomes. (xt , yt ). We will
assume that our prediction algorithm makes predictions in the discrete grid
pt ∈ [1/m] = {0, 1/m, 2/m, . . . , 1}. We begin by defining empirical analogues
of our calibration scores K and Q:
Definition 13 (Average Mean and Quantile Calibration Error) Fix any
transcript π = {(p1 , x1 , y1 ), . . . , (pT , xT , yT )} of length T . For each p ∈ [1/m]
let n(π, p) = t=1 1[pt = p] be the number of times that the prediction pt = p
PT
is made over the T rounds of the transcript.
The average squared (mean) calibration error on this transcript is:
!2 !2
X n(π, p) PT 1[pt = p](yt − pt )
t=1 1[pt = p](yt − pt )
PT
t=1 1 X
K2 (π) = = p
T n(π, p) T n(π, p)
p∈[1/m] p∈[1/m]

It will be convenient for us to be able to refer to the un-normalized calibration


error: !2
t=1 1[pt = p](yt − pt )
X PT
K̂2 (π) = p
p∈[1/m]
n(π, p)
Calibration 31

Observe that K2 (π) = T1 K̂2 (π).


For a target quantile q ∈ [0, 1], the average squared quantile calibration
error on this transcript is:
!2 !2
X n(π, p) X T
1[pt = p](q − 1[yt ≤ pt ]) 1 X XT
1[pt = p](q − 1[yt ≤ pt ])
Q2 (π) = = p
T t=1
n(π, p) T n(π, p)
p∈[1/m] p∈[1/m] t=1

Similarly, we define the unnormalized qunatile calibration error:


!2
X XT
1[pt = p](q − 1[yt ≤ pt ])
Q̂2 (π) = p
p∈[1/m] t=1
n(π, p)

Here any term in the sum in which n(π, p) = 0 evaluates to 0 by convention.


We will derive algorithms that (based on their performance on a tran-
script of length t − 1 so far, and possibly on the next context xt ) decide
on their prediction pt at round t. After they make their prediction, they
learn the true label yt , and the transcript extends by one round. We write
π <t = {(p1 , x1 , y1 ), . . . , (pt−1 , xt−1 , yt−1 )} to denote a transcript correspond-
ing to rounds 1, . . . , t − 1, and given a record of the prediction, context, and
outcome at round t (pt , xt , yt ) write the transcript that is extended by one
round as π ≤t = π <t+1 = π <t ◦ (pt , xt , yt ). Similarly given a transcript π of
length T we will write π ≤t to denote the prefix of this transcript of length t.

3.4.1 Sequential (Mean) Calibration


In deriving an algorithm for guaranteeing sequential mean calibration, it will
be helpful for us to understand how the average squared calibration score
increases from round to round, given the prediction of the algorithm pt and
the outcome yt . It will be useful for us to develop some notation.
Definition 14 Fixing a transcript π of length T , for any s ≤ T and p ∈ [1/m]
define the quantity:
Xs
1[pt = p](yt − pt )
Vsp (π) = p
t=1 n(π ≤s , p)
If n(π, p) = 0 then by convention we define Vsp (π) = 0.
We observe that for all p, s, π:
q
|Vsp (π)| ≤ n(π ≤s , p)
Our goal is to understand how the calibration error increases from round
to round as a function of the transcript — and once we understand it, give an
algorithm that guarantees that the increase is small. The next lemma bounds
the increase in calibration error between rounds s and s + 1 as a function of
the transcript up through round s + 1.
32Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Lemma 3.4.1 Fix any partial transcript π ≤s and any triple (ps+1 , xs+1 , ys+1 )
of potential outcomes for the next round. Let π ≤s+1 = π ≤s ◦ (ps+1 , xs+1 , ys+1 )
be the corresponding continuation of the transcript. Define:

∆s+1 (ps+1 , ys+1 ) = K̂2 (π ≤s+1 ) − K̂2 (π ≤s )

to be the increase in the (unnormalized) squared calibration error that results


from the transcript continuation. Then we have that:
!
p
2Vs s+1 1
∆s+1 (ps+1 , ys+1 ) ≤ p · (ys+1 − ps+1 ) +
n(π ≤s , ps+1 ) n(π ≤s , ps+1 )

Proof 16 Since the terms in the squared mean calibration error corresponding
to predictions p ̸= ps+1 do not change, We can compute:

∆s+1 (ps+1 , ys+1 ) = K̂2 (π ≤s+1 ) − K̂2 (π ≤s )


 !2 !2 
t=1 1[pt = ps+1 ](yt − pt ) t=1 1[pt = ps+1 ](yt − pt )
Ps+1 Ps
=  p − p 
n(π ≤s+1 , ps+1 ) n(π ≤s , ps+1 )
 !2 !2 
t=1 1[pt = ps+1 ](yt − pt ) t=1 1[pt = ps+1 ](yt − pt )
Ps+1 Ps
≤  p − p 
n(π ≤s , ps+1 ) n(π ≤s , ps+1 )
 !2 
ys+1 − ps+1
=  Vsps+1 (π ≤s ) + p − Vsps+1 (π ≤s )2 
n(π ≤s , ps+1 )
! !
ps+1 ≤s ys+1 − ps+1 (ys+1 − ps+1 )2
= 2Vs (π ) p +
n(π ≤s , ps+1 ) n(π ≤s , ps+1 )
!
p
2Vs s+1 1
≤ · (ys+1 − ps+1 ) +
n(π ≤s , ps+1 )
p
n(π ≤s , ps+1 )

Next, our plan is to show that for every transcript π ≤s there is a distribu-
tion over subsequent predictions ps+1 such that for every possible realization
of ys+1 , Eps+1 [∆s+1 (ps+1 , ys+1 )] is small. If we can show this, then the algo-
rithm that consists of playing this randomized strategy at each round will have
small expected calibration loss, which we can conclude simply by summing the
terms ∆s (ps , ys ) from s = 1 to T .
Towards this end, define:
p
2Vs s+1
∆1s+1 (ps+1 , ys+1 ) = p · (ys+1 − ps+1 )
n(π ≤s , ps+1 )
With this notation, Lemma 3.4.1 states that:
1
∆s+1 (ps+1 , ys+1 ) ≤ ∆1s+1 (ps+1 , ys+1 ) + .
n(π ≤s , ps+1 )
Calibration 33

Here the term n(π≤s1,ps+1 ) evaluates to 0 if n(π ≤s , ps+1 ) = 0.


We next establish a randomized prediction strategy that makes the first
term Eps+1 [∆1s+1 (ps+1 , ys+1 )] small in expectation.

Lemma 3.4.2 Fix any partial transcript π ≤s . Consider the distribution over
ps+1 that we can sample from as follows:

1. If Vs1 (π ≤s ) ≥ 0: Predict ps+1 = 1 with probability 1


2. If Vs0 (π ≤s ) ≤ 0: Predict ps+1 = 0 with probability 1.
1 p ≤s
3. Otherwise: Find a p ∈ {0, m , . . . , m−1
m } such that Vs (π )≥0
p+1/m ≤s
and Vs (π ) ≤ 0. Compute q ∈ [0, 1] such that:
p+1/m
V p (π ≤s ) Vs (π ≤s )
q· ps + (1 − q) · q =0
n(π ≤s , p) n(π ≤s , p + m 1
)

1
Predict ps+1 = p with probability q and predict ps+1 = p + m with
probability 1 − q.
This distribution has the property that for every ys+1 ∈ [0, 1]:
2
E [∆1s+1 (ps+1 , ys+1 )] ≤
ps+1 m

Proof 17 We bound Eps+1 [∆1s+1 (ps+1 , ys+1 )] separately in each of the three
cases.

Case 1:
In this case, Vs1 (π ≤s ) ≥ 0 and ps+1 = 1. Note that since ys+1 ∈ [0, 1], we must
have that (ys+1 − ps+1 ) ≤ 0 and so for all ys+1 ∈ [0, 1]:
p
2Vs s+1
∆1s+1 (ps+1 , ys+1 ) = p · (ys+1 − ps+1 ) ≤ 0
n(π ≤s , ps+1 )

Case 2:
In this case, Vs0 (π ≤s ) ≤ 0 and ps+1 = 0. Note that since ys+1 ∈ [0, 1], we must
have that (ys+1 − ps+1 ) ≥ 0 and so for all ys+1 ∈ [0, 1]:
p
2Vs s+1
∆1s+1 (ps+1 , ys+1 ) = p · (ys+1 − ps+1 ) ≤ 0
n(π ≤s , ps+1 )

Case 3:
First we observe that in this case, Vs0 (π ≤s ) ≥ 0 and Vs1 (π ≤s ) ≤ 0. Hence there
must exist some adjacent pair p, p + 1/m ∈ [1/m] such that Vsp (π ≤s ) ≥ 0 and
34Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
p+1/m
Vs (π ≤s ) ≤ 0, so the algorithm is well defined. Recall that q ∈ [0, 1] is
V p (π ≤s ) V p+1/m (π ≤s )
such that q · √ s ≤s + (1 − q) · √s ≤s 1
= 0. We can compute:
n(π ,p) n(π ,p+ m )

E [∆1s+1 (ps+1 , ys+1 )]


ps+1
p+1/m ≤s !
2Vsp (π ≤s )

2Vs (π ) 1
≤ q· p · (ys+1 − p) + (1 − q) p · ys+1 − p −
n(π ≤s , p) n(π ≤s , p + 1/m) m
!
p+1/m ≤s
1 2Vs (π )
= − · (1 − q) p
m ≤s
n(π , p + 1/m)
2

m
Here the last p inequality follows from the fact that for all p ∈ [1/m],
|Vsp (π ≤s )| ≤ n(π ≤s , p).

Applying the prediction strategy defined in 3.4.2 repeatedly gives us an


algorithm (Algorithm 6) for making sequential predictions that are calibrated
against arbitrary sequences of outcomes.

Algorithm 6 Online-Calibrated-Predictor(m)
for t = 1 to T do
Observe xt (and ignore it!)
1
if Vt−1 (π <t ) ≥ 0 then
Predict pt = 1.
0
else if Vt−1 (π <t ) ≤ 0 then
Predict pt = 0.
else
1 p
Select p ∈ {0, m , . . . , m−1 <t
m } such that such that Vt−1 (π ) ≥ 0 and
p+1/m <t
Vt−1 (π ) ≤ 0.
Compute q ∈ [0, 1] such that:
p+1/m
V p (π <t ) V (π <t )
q · pt−1 + (1 − q) · q t−1 =0
n(π <t , p) n(π <t , p + m 1
)

1
Predict pt = p with probability q and predict pt = p + m with proba-
bility 1 − q.
Observe yt
Let π <t+1 = π <t ◦ (xt , pt , yt )

Theorem 10 Against any adaptive adversary, Online-Calibrated-Predictor


(Algorithm 6) invoked with the range [1/m] induces a distribution over length
Calibration 35

T transcripts π such that:


2 m+1
E[K2 (π)] ≤ + · (log(T /m) + 1)
π m T
q
2T
In particular, if we choose discretization parameter m = log T then we
have: r !
log T
E[K2 (π)] ≤ O
π T

Proof 18 Fix any length T transcript π = {(x1 , p1 , y1 ), . . . , (xT , pT , yT )}.


Since K̂2 (π ≤0 ) = 0 we have that the telescoping sum:
T
X T
X
∆t (pt , yt ) = K̂2 (π ≤t ) − K̂2 (π ≤t−1 ) = K̂2 (π)
t=1 t=1

From Lemma 3.4.1 we can write this as:


T
X
K̂2 (π) = ∆t (pt , yt )
t=1
T  
X 1
≤ ∆1t (pt , yt ) +
t=1
n(π ≤t−1 , pt )
T T
X X 1
≤ ∆1t (pt , yt ) + max
t=1
π̃
t=1
n(π̃ ≤t−1 , p̃t )

where in the last step, we take the maximum over all length t transcripts
p̃i = {(x̃1 , p̃1 , ỹ1 ), . . . , (x̃T , p̃T , ỹT )}
We now take the expectation of both sides (over the randomness of the
algorithm’s predictions pt ) and apply Lemma 3.4.2:
T T
X X 1
E[K̂2 (π)] ≤ E [∆1t (pt , yt )|π <t ] + max
t=1
pt ,yt π̃
t=1
n(π̃ ≤t−1 , p̃ t)

T
2T X 1
≤ + max ≤t−1 , p̃ )
m π̃
t=1
n(π̃ t

PT 1
It remains to bound maxπ̃ t=1 n(π̃≤t−1 ,p̃t )
To do this, we observe that when-
ever p̃t = p, then we must have that n(π̃ , p) = n(π̃ ≤t−1 , p) + 1. Hence for
≤t
36Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
any transcript π̃ we can write:
T
X 1 X X 1
=
t=1
n(π̃ ≤t−1 , p̃t ) n(π̃ ≤t−1 , p)
p∈[1/m] t:p̃t =p
n(π̃,p)−1
X X 1
=
k
p∈[1/m] k=1
T /m
X 1
≤ (m + 1)
k
k=1
= (m + 1) · HT /m
≤ (m + 1) · (log(T /m) + 1)
Here Hk denotes the k’th Harmonic number.
Combining these bounds we find that:
" #
K̂2 (π) 2 m+1
E[K2 (π)] = E ≤ + · (log(T /m) + 1)
T m T

Add high probability bound, online to offline reduction

3.4.2 Sequential Quantile Calibration


We can derive an algorithm for making sequential predictions that are quan-
tile calibrated in an analogous way. The derivation is almost identical to our
derivation of sequential mean calibration: so we will sketch it, pointing out
those parts that differ. As in our batch quantile algorithm, we will need to
now assume that the adversary picks continuous distributions of labels at each
round rather than allowing her to choose deterministically, since obtaining
quantile calibration is not in general possible against point mass distributions.
Moreover our quantitative bound will require assuming that the adversary’s
label distributions are ρ-Lipschitz and will depend on ρ.
Definition 15 Fix a target quantile q ∈ [0, 1]. Fixing a transcript π of length
T , for any s ≤ T and p ∈ [1/m] define the quantity:
Xs
1[pt = p](q − 1[yt ≤ pt ])
Wsp (π, q) = p
t=1 n(π ≤s , p)
If n(π, p) = 0 then by convention we define Wsp (π, q) = 0. When q is clear
from context we elide it and just write Wsp (π)
We observe that for all p, s, π:
q
|Wsp (π, q)| ≤ n(π ≤s , p)
Calibration 37

Lemma 3.4.3 Fix any q ∈ [0, 1] and any partial transcript π ≤s and any
triple (ps+1 , xs+1 , ys+1 ) of potential outcomes for the next round. Let π ≤s+1 =
π ≤s ◦ (ps+1 , xs+1 , ys+1 ) be the corresponding continuation of the transcript.
Redefine:
∆s+1 (ps+1 , ys+1 ) = Q̂2 (π ≤s+1 ) − Q̂2 (π ≤s )
to be the increase in the (unnormalized) squared quantile calibration error that
results from the transcript continuation. Then we have that:
!
p
2Ws s+1 1
∆s+1 (ps+1 , ys+1 ) ≤ p · (q − 1[ys+1 ≤ ps+1 ]) +
n(π ≤s , ps+1 ) n(π ≤s , ps+1 )

The proof is essentially identical to that of Lemma 3.4.1 — we include it here


for completeness.
Proof 19 Since the terms in the squared quantile calibration error corre-
sponding to predictions p ̸= ps+1 do not change, We can compute:

∆s+1 (ps+1 , ys+1 )


= Q̂2 (π ≤s+1 ) − Q̂2 (π ≤s )
 !2 !2 
1 1 1 1
Ps+1 Ps
t=1 [pt = p s+1 ](q − [y t ≤ pt ]) t=1 [p t = p s+1 ](q − [y t ≤ pt ])
=  p − p 
n(π ≤s+1 , ps+1 ) n(π ≤s , ps+1 )
 !2 !2 
1 1 1 1
Ps+1 Ps
t=1 [pt = p s+1 ](q − [y t ≤ pt ]) t=1 [p t = p s+1 ](q − [y t ≤ pt ])
≤  p − p 
n(π ≤s , ps+1 ) n(π ≤s , ps+1 )
 !2 
(q − 1 [y ≤ p ])
=  Wsps+1 (π ≤s ) + p s+1 s+1
− Wsps+1 (π ≤s )2 
n(π ≤s , ps+1 )

(q − 1[ys+1 ≤ ps+1 ]) (q − 1[ys+1 ≤ ps+1 ])2


! !
ps+1 ≤s
= 2Ws (π ) +
n(π ≤s , ps+1 )
p
n(π ≤s , ps+1 )
!
p
2Ws s+1 1
≤ · (q − 1[ys+1 ≤ ps+1 ]) +
n(π ≤s , ps+1 )
p
n(π ≤s , ps+1 )

Next, our plan is to show that for every transcript π ≤s there is a distribu-
tion over subsequent predictions ps+1 such that for every possible ρ-Lipschitz
distribution over ys+1 , Eps+1 ,ys+1 [∆s+1 (ps+1 , ys+1 )] is small. Note that here
we are deviating from our derivation of mean calibration algorithms, in that
we are requiring that the label ys+1 be drawn from a ρ-Lipschitz distribution,
and we are taking the expectation over ys+1 as well as ps+1 . If we can show
this, then the algorithm that consists of playing this randomized strategy at
each round will have small expected quantile calibration loss, which we can
conclude simply by summing the terms ∆s (ps , ys ) from s = 1 to T .
38Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Towards this end, define:
p
2Ws s+1
∆1s+1 (ps+1 , ys+1 ) = p · (q − 1[ys+1 ≤ ps+1 ])
n(π ≤s , ps+1 )

With this notation, Lemma 3.4.3 states that:


1
∆s+1 (ps+1 , ys+1 ) ≤ ∆1s+1 (ps+1 , ys+1 ) + .
n(π ≤s , ps+1 )

Here the term n(π≤s1,ps+1 ) evaluates to 0 if n(π ≤s , ps+1 ) = 0.


We next establish a randomized prediction strategy that makes the first
term Eps+1 ,ys+1 [∆1s+1 (ps+1 , ys+1 )] small in expectation for any ρ-Lipchitz dis-
tribution over ys+1 .

Lemma 3.4.4 Fix any partial transcript π ≤s . Consider the distribution over
ps+1 that we can sample from as follows:
1. If Ws1 (π ≤s ) ≥ 0: Predict ps+1 = 1 with probability 1
2. If Ws0 (π ≤s ) ≤ 0: Predict ps+1 = 0 with probability 1.
1 p ≤s
3. Otherwise: Find a p ∈ {0, m , . . . , m−1
m } such that Ws (π )≥0
p+1/m ≤s
and Ws (π ) ≤ 0. Compute b ∈ [0, 1] such that:
p+1/m
W p (π ≤s ) Ws (π ≤s )
b· p s + (1 − b) · q =0
n(π ≤s , p) n(π ≤s , p + m1
)

1
Predict ps+1 = p with probability b and predict ps+1 = p + m with
probability 1 − b.
This distribution has the property that for every ρ-Lipschitz distribution over
ys+1 ∈ [0, 1]:

E [∆1s+1 (ps+1 , ys+1 )] ≤
ps+1 ,ys+1 m

Proof 20 We bound Eps+1 ,ys+1 [∆1s+1 (ps+1 , ys+1 )] separately in each of the
three cases.

Case 1:
In this case, Ws1 (π ≤s ) ≥ 0 and ps+1 = 1. Note that since q, ys+1 ∈ [0, 1], we
must have that (q − 1[ys+1 ≤ ps+1 ]) ≤ 0 and so for all ys+1 ∈ [0, 1]:
p
2Ws s+1
∆1s+1 (ps+1 , ys+1 ) = p · (q − 1[ys+1 ≤ ps+1 ]) ≤ 0
n(π ≤s , ps+1 )
Calibration 39

Case 2:
In this case, Ws0 (π ≤s ) ≤ 0 and ps+1 = 0. Note that since q, ys+1 ∈ [0, 1],
we must have that if ys+1 > 0 (which occurs with probability 1 if it is drawn
from a continuous distribution), (q − 1[ys+1 ≤ ps+1 ]) ≥ 0 and so for all
q, ys+1 ∈ (0, 1]:
p
2Ws s+1
∆1s+1 (ps+1 , ys+1 ) = p · (q − 1[ys+1 ≤ ps+1 ]) ≤ 0
n(π ≤s , ps+1 )

Case 3:
First we observe that in this case, Ws0 (π ≤s ) ≥ 0 and Ws1 (π ≤s ) ≤ 0. Hence
there must exist some adjacent pair p, p+1/m ∈ [1/m] such that Wsp (π ≤s ) ≥ 0
p+1/m ≤s
and Ws (π ) ≤ 0, so the algorithm is well defined. Recall that b ∈ [0, 1]
W p (π ≤s ) W p+1/m (π ≤s )
is such that b · √ s ≤s + (1 − b) · √ s ≤s 1
= 0. We can compute:
n(π ,p) n(π ,p+ m )

E [∆1s+1 (ps+1 , ys+1 )]


ps+1 ,ys+1
p+1/m ≤s !
2Wsp (π ≤s )

2Ws (π ) 1
≤ b· p · (q − Pr[ys+1 ≤ p]) + (1 − b) p · q − Pr[ys+1 ≤ p + ]
n(π ≤s , p) n(π ≤s , p + 1/m) m
!
p+1/m ≤s
2Wsp (π ≤s ) 2Ws (π )  ρ
≤ b· p · (q − Pr[ys+1 ≤ p]) + (1 − b) p · q − Pr[ys+1 ≤ p] −
n(π ≤s , p) n(π ≤s , p + 1/m) m
!
p+1/m ≤s
ρ 2Ws (π )
= − · (1 − b) p
m n(π ≤s , p + 1/m)


m
Here the second inequality follows from the fact that the distribution over ys+1
is assumed to be ρ-Lipschitz, and hence for all p:
 
1 ρ
Pr ys+1 ≤ p + ≤ Pr [ys+1 ≤ p] +
ys+1 m ys+1 m
p ≤s
p last inequality follows from the fact that for all p ∈ [1/m], |Ws (π )| ≤
The
≤s
n(π , p).

Applying the prediction strategy defined in 3.4.4 repeatedly gives us an


algorithm (Algorithm 7) for making sequential predictions that are calibrated
against arbitrary sequences of outcomes.
40Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Algorithm 7 Online-Quantile-Calibrated-Predictor(q, m)
for t = 1 to T do
Observe xt (and ignore it!)
if Ws1 (π <t , q) ≥ 0 then
Predict pt = 1.
else if Ws0 (π <t , q) ≤ 0 then
Predict pt = 0.
else
1 p ≤s
Select p ∈ {0, m , . . . , m−1
m } such that such that Ws (π , q) ≥ 0 and
p+1/m ≤s
Ws (π , q) ≤ 0.
Compute b ∈ [0, 1] such that:
p+1/m
W p (π ≤s , q) Ws (π ≤s , q)
b · ps + (1 − b) · q =0
n(π ≤s , p) n(π ≤s , p + m1
)

1
Predict ps+1 = p with probability b and predict ps+1 = p + m with
probability 1 − b.
Observe yt
Let π <t+1 = π <t ◦ (xt , pt , yt )

Theorem 11 Against any adaptive adversary that chooses a ρ-Lipschitz dis-


tribution over yt at each round t, Online-Quantile-Calibrated-Predictor (Al-
gorithm 7) invoked with quantile q ∈ [0, 1] and the range [1/m] induces a
distribution over length T transcripts π such that:
2ρ m + 1
E[Q2 (π)] ≤ + · (log(T /m) + 1)
π m T
q
2ρT
In particular, if we choose discretization parameter m = log T then we
have: r !
ρ log T
E[Q2 (π)] ≤ O
π T

Proof 21 Fix any length T transcript π = {(x1 , p1 , y1 ), . . . , (xT , pT , yT )}.


Since Q̂2 (π ≤0 ) = 0 we have that the telescoping sum:
T
X T
X
∆t (pt , yt ) = Q̂2 (π ≤t ) − Q̂2 (π ≤t−1 ) = Q̂2 (π)
t=1 t=1
Calibration 41

From Lemma 3.4.3 we can write this as:


T
X
Q̂2 (π) = ∆t (pt , yt )
t=1
T  
X 1
≤ ∆1t (pt , yt ) +
t=1
n(π ≤t−1 , pt )
T T
X X 1
≤ ∆1t (pt , yt ) + max
t=1
π̃
t=1
n(π̃ ≤t−1 , p̃t )

where in the last step, we take the maximum over all length t transcripts
p̃i = {(x̃1 , p̃1 , ỹ1 ), . . . , (x̃T , p̃T , ỹT )}
We now take the expectation of both sides (over the randomness of the
algorithm’s predictions pt ) and apply Lemma 3.4.4:
T T
X X 1
E[Q̂2 (π)] ≤ E [∆1t (pt , yt )|π <t ] + max
t=1
pt ,yt π̃
t=1
n(π̃ ≤t−1 , p̃t )
T
2ρT X 1
≤ + max ≤t−1 , p̃ )
m π̃
t=1
n(π̃ t

PT 1
It remains to bound maxπ̃ t=1 n(π̃≤t−1 ,p̃t )
To do this, we observe that when-
ever p̃t = p, then we must have that n(π̃ , p) = n(π̃ ≤t−1 , p) + 1. Hence for
≤t

any transcript p̃i we can write:


T
X 1 X X 1
=
t=1
n(π̃ ≤t−1 , p̃t ) n(π̃ ≤t−1 , p)
p∈[1/m] t:p̃t =p
n(π̃,p)−1
X X 1
=
k
p∈[1/m] k=1
T /m
X 1
≤ (m + 1)
k
k=1
= (m + 1) · HT /m
≤ (m + 1) · (log(T /m) + 1)

Here Hk denotes the k’th Harmonic number.


Combining these bounds we find that:
" #
Q̂2 (π) 2ρ m + 1
E[Q2 (π)] = E ≤ + · (log(T /m) + 1)
T m T

Add high probability bound, online to offline reduction


42Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
References and Further Reading
Algorithm 3 (our calibration algorithm) takes inspiration from the multi-
calibration algorithms given in Hébert-Johnson et al. [2018] (which bounds
K∞ (f )) and Gopalan et al. [2022b] (which bounds K1 (f )) (See Chapter 4 for
more on multicalibration).
Calibration in the sequential setting has a long history dating back to
Dawid [1982] and Dawid [1985]. The first algorithm that guaranteed worst-
case calibration in the sequential setting was given by Foster and Vohra [1998]
and alternative derivations were given by Foster [1999], Fudenberg and Levine
[1999], Hart [2020], and others. Algorithm 6 is a variant of the algorithm
given by Foster and Hart [2021] and its generalization to multicalibration
given in Gupta et al. [2022]. Algorithm 7 is a variant of the online quantile
multicalibration algorithm given by Bastani et al. [2022].
4
Multigroup Guarantees

CONTENTS
4.1 Group Conditional Mean Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Group Conditional Quantile Consistency . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.1 A More Direct Approach to Group Conditional
Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1.1 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Multicalibration: Group Conditional Calibration . . . . . . . . . . . . . . . . 59
4.4 Quantile Multicalibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Out of Sample Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5.1 Mean Multicalibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5.2 Quantile Multicalibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.6 Sequential Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6.1 A Bucketed Calibration Definition . . . . . . . . . . . . . . . . . . . . . . 75
4.6.2 Achieving Bucketed Calibration . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.6.3 Obtaining Bucketed Quantile Multicalibration . . . . . . . . . 83
References and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Marginal guarantees are easy to obtain, but very weak. We saw one way of
strengthening those guarantees: calibration. But on its own calibration is also
quite weak. Obtaining it in the adversarial sequential prediction setting was
non-trivial, but we could obtain it in the batch setting with a simple constant
predictor fˆ(x) = E(x,y)∼D [y] that just predicts the mean of the marginal label
distribution. Moreover, all of the techniques we’ve seen so far entirely ignore
the features x and depend only on the labels y! We’ll now consider a different
way to strengthen marginal guarantees, first on its own, and then together
with calibration. We will call these multi-group guarantees, and they ask for
guarantees that hold conditional on the features x in various ways.
Let G ∈ 2X denote a collection of groups or subsets of the data domain
X . We will represent groups using their indicator functions: so g ∈ G is rep-
resented as a function g : X → {0, 1}, where g(x) = 1 denotes that x ∈ X
is a member of group g, and g(x) = 0 denotes that x is not a member of G.
Given an example x ∈ X , we will write G(x) = {g ∈ G : g(x) = 1} to denote
the set of groups that x is a member of. At a high level, our aim will be to
obtain guarantees like mean consistency (and eventually calibration) not just

43
44Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
marginally, but conditionally on g(x) = 1 for every g ∈ G for some large set
G.

4.1 Group Conditional Mean Consistency


With only a finite number of samples from the distribution, we will not in
general be able to provide group conditional guarantees conditional on groups
that have tiny probability under our distribution, simply because we won’t
have seen very many points from this part of the probability space. So, the
probability mass of a group will be a key parameter for us:
Definition 16 Under a distribution D, a group g : X → {0, 1} has probability
mass µ(g) defined as:
µ(g) = Pr [g(x) = 1]
x∼DX

Definition 17 A model f : X → [0, 1] satisfies α-approximate group condi-


tional mean consistency with respect to a set of groups G ∈ 2X if for every
g ∈ G:
 2
α
E [f (x)|g(x) = 1] − E [y|g(x) = 1] ≤
(x,y)∼D (x,y)∼D µ(g)

Notice that our requirement smoothly becomes less demanding as the measure
of the group g grows smaller, allowing us to ask for stronger guarantees for
groups for which we will have more data. We have parameterized things so
that the scalingpis at the right rate: the error within a sub-group g increases
at a rate of 1/ µ(g), which is the same rate at which the error of our best
estimate of E(x,y)∼D [y|g(x) = 1] from the data will necessarily increase.
We will now show how to update a model f that does not satisfy group
conditional mean consistency to one that does, using a sequence of “patches”
that are similar to how we obtained calibration. Just as in the examples we
have seen thus far, these patches will be accuracy improving, and so we will
quickly converge to a group conditional mean consistent model.

Definition 18 (Group Shift Patch) Given a model f , a shift ∆ ∈ R, and


a group g : X → {0, 1} we say that the group patch applied to f with shift ∆
and group g is the function:
(
f (x) + ∆ g(x) = 1
h(x, f ; g, ∆) =
f (x) otherwise
Multigroup Guarantees 45

Algorithm 8 GroupShift(f, α, G)
Let f0 = f and t = 0.
while ft does not satisfy α-approximate group conditional mean consis-
tency w.r.t. G: do
Let:
 2
gt ∈ arg max µ(g) E [ft (x)|g(x) = 1] − E [y|g(x) = 1]
g∈G (x,y)∼D (x,y)∼D

∆t = E [y|gt (x) = 1] − E [ft (x)|gt (x) = 1]


(x,y)∼D (x,y)∼D

Let ft+1 = h(x, ft ; gt , ∆t ) and t = t + 1.


Output ft .

Lemma 4.1.1 Fix any model ft : X → [0, 1] and group g : X → {0, 1}. Let

∆t = E [y|gt (x) = 1] − E [ft (x)|gt (x) = 1]


(x,y)∼D (x,y)∼D

and
ft+1 = h(x, ft ; gt , ∆t )
(i.e. the update performed at Round t of Algorithm 8). Then:

B(ft ) − B(ft+1 ) = µ(gt ) · ∆2t

Proof 22 By the definition of the patch h(x, ft ; gt , ∆t ), models ft and ft+1


differ in their predictions only for x such that gt (x) = 1. Therefore we can
calculate:

(ft (x) − y)2 − (ft+1 (x) − y)2 |gt (x) = 0


 
B(ft ) − B(ft+1 ) = Pr[gt (x) = 0] · E
(x,y)∼D

(ft (x) − y)2 − (ft+1 (x) − y)2 |gt (x) = 1


 
+ Pr[gt (x) = 1] · E
(x,y)∼D

(ft (x) − y)2 − (ft (x) + ∆t − y)2 |gt (x) = 1


 
= µ(gt ) E
(x,y)∼D
 
2
= µ(gt ) 2∆t E [y − ft (x)|gt (x) = 1] − ∆t
(x,y)∼D

= µ(gt ) 2∆t − ∆2t


2


= µ(gt )∆2t

Theorem 12 Given any model f , any collection of groups G, and any α > 0
Algorithm 8 (GroupShift) halts after T ≤ 1/α many rounds and outputs a
model fT that satisfies α-approximate group conditional mean consistency.
Moreover, if the algorithm runs for T rounds, then B(fT ) ≤ B(f ) − T α.
46Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Proof 23 At any round T at which the algorithm halts, by the stopping con-
dition of the algorithm it must be that fT satisfies α-approximate group con-
ditional mean consistency. It remains to bound T and B(fT ).
Consider any intermediate round t < T of the algorithm. We know since
the algorithm has not halted that:
 2
max µ(g) · E [f (x)|g(x) = 1] − E [y|g(x) = 1] ≥ α
g∈G (x,y)∼D (x,y)∼D

gt realizes this maximum, so we must have:

µ(gt ) · ∆2t ≥ α

Thus by Lemma 4.1.1, B(ft+1 ) ≤ B(ft ) − α. Inductively applying this claim


gives B(fT ) ≤ B(f ) − T α as desired.
Since (by assumption) y, f (x) ∈ [0, 1], we have that B(fT ) ≥ 0 and B(f ) ≤
1. Thus we must have that T ≤ 1/α.

Unlike marginal mean consistency, group conditioned mean consistency is


clearly a non-trivial promise: if G = 2X , the set of all subsets, and α = 0, then
the only model satisfying α-approximate group conditional mean consistency
with respect to G is the model encoding true conditional label distributions f ∗ .
For smaller collections of groups G and larger values of α we have a necessarily
weaker guarantee, but at least we have a parametric family of guarantees that
allows us to interpolate between one satisfied by a (trivial) constant function,
and one only satisfied by a perfect model.

4.2 Group Conditional Quantile Consistency


Our algorithm and analysis for group conditional mean consistency directly
translates to quantiles when we replace the role of the Brier score in our
analysis with Pinball loss. First, we can define an analogous notion for group
conditional quantile consistency:

Definition 19 A model f : X → [0, 1] satisfies α-approximate group condi-


tional quantile consistency with respect to a target quantile q and set of groups
G ∈ 2X if for every g ∈ G:
 2
α
Pr [y ≤ f (x)|g(x) = 1] − q ≤
(x,y)∼D µ(g)

Our algorithm proceeds by applying the same kind of group-shift patches


we used in the case of group conditional mean consistency.
Multigroup Guarantees 47

Algorithm 9 QuantileGroupShift(f, α, G, q)
Let f0 = f and t = 0.
while ft does not satisfy α-approximate group conditional quantile consis-
tency w.r.t. target quantile q and G: do
Let:  2
gt ∈ arg max µ(gt ) Pr [y ≤ ft (x)|g(x) = 1] − q
g∈G (x,y)∼D
 2
∆t = arg min Pr [y ≤ ft (x) + ∆|gt (x) = 1] − q
∆ (x,y)∼D

Let ft+1 = h(x, ft ; gt , ∆t ) and t = t + 1.


Output ft .

Theorem 13 Assume that D is a ρ-Lipschitz continuous probability distri-


bution. Given any model f , any collection of groups G, any target quantile
q ∈ [0, 1] and any α > 0 Algorithm 9 (QuantileGroupShift) halts after T
rounds where:
2ρP Bq (f ) 2ρ
T ≤ ≤ .
α α
It outputs a model fT that satisfies α-approximate group conditional quantile
consistency. Moreover, if the algorithm runs for T rounds, then P Bq (fT ) ≤
α
P Bq (f ) − T · 2ρ .

Proof 24 If the algorithm halts at round T , then by definition of the halting


condition it must be that fT satisies α-approximate group conditional quantile
consistency with respect to q and G, so it remains to bound T .
If the algorithm has not halted at round t, then by definition it must be
that gt satisfies:
 2
µ(gt ) · Pr [y ≤ f (x)|g(x) = 1] − q ≥ α
(x,y)∼D

Since D is continuous, it must be that ∆t is such that:

Pr [y ≤ f (x) + ∆t |gt (x) = 1] = q


(x,y)∼D

Finally by the separability of Pinball loss, we have that:

P Bq (ft ) − P Bq (ft+1 )
 
= Pr[gt (x) = 1] · E [Lq (ft (x), y) − Lq (ft+1 (x), y)|gt (x) = 1]
(x,y)∼D
α
≥ µ(gt ) ·
2ρµ(gt )
α
=

48Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
where the inequality follows from Lemma 2.2.2 applied to the conditional dis-
tribution D|(gt (x) = 1), which must also be ρ-smooth.
Applying this bound iteratively, we have that for every T , P Bq (fT ) ≤
α
P Bq (f ) − T · 2ρ . Since when f (x) and y are bounded in [0, 1], P Bq (f ) ≤ 1 and
P Bq (fT ) ≥ 0 it must be that the total number of iterations that the algorithm
runs for is bounded by:
2ρP Bq (f ) 2ρ
T ≤ ≤
α α

4.2.1 A More Direct Approach to Group Conditional Guar-


antees
Algorithms 8 and 9 gave us a relatively simple method for obtaining approx-
imate group conditional mean and quantile consistency respectively. These
algorithms will be a useful template for our algorithms for mean and quantile
multicalibration in the next section — but it turns out that for group con-
ditional consistency (without calibration) there is an even simpler algorithm
that gives an even better guarantee. Observe that the “group shift” patches
h(x, ft ; gt , ∆t ) that Algorithms 8 and 9 apply have an extremely simple form:
They add ∆t to the output of ft (x) if x ∈ G and do nothing otherwise. Since
addition is commutative, we can observe that these patches are actually order
invariant! Consider any run of Algorithm 8 or 9 for T rounds, and for each
group g ∈ G define the quantities:
X
λg = ∆t
t:gt =g

Then the final model fT that is output can be seen to have the form:
X
fT (x) = fˆ(x; λ) ≡ f (x) + λg · g(x)
g∈G

So to compute a model satisfying group conditional mean consistency, we


can just directly optimize over functions that have this form, which is a |G|
dimensional convex optimization described in Algorithm 10 for group con-
ditional mean consistency and Algorithm 11 for group conditional quantile
consistency. The only difference between the two algorithms is that we min-
imize the Brier score for mean consistency and the pinball loss for quantile
consistency.
Multigroup Guarantees 49

Algorithm 10 Simple-Group-Conditional(f, G)
Let λ∗ be a solution to the optimization problem:
 2 
Minimizeλ E fˆ(x; λ) − y
(x,y)∼D

Such that:
X
fˆ(x; λ) ≡ f (x) + λg · g(x)
g∈G

Output fˆ(x; λ∗ )

Algorithm 11 Simple-Quantile-Group-Conditional(f, G, q)
Let λ∗ be a solution to the optimization problem:
h  i
Minimizeλ E Lq fˆ(x; λ), y
(x,y)∼D

Such that:
X
fˆ(x; λ) ≡ f (x) + λg · g(x)
g∈G

Output fˆ(x; λ∗ )

Theorem 14 Fix any model f : X → [0, 1] and class of groups G. The model
fˆ(x; λ∗ ) output by Algorithm 10 satisfies perfect (i.e. 0-approximate) group
conditional mean consistency. Moreover, if fT is the model output by Algo-
rithm 8, then B(fˆ(·; λ∗ )) ≤ B(fT ).

Proof 25 Suppose fˆ(x; λ∗ ) does not satisfy group conditional mean consis-
tency. Then there must be a group g ∈ G such that:
 2
E [f (x)|g(x) = 1] − E [y|g(x) = 1] >0
(x,y)∼D (x,y)∼D

Let ∆ = E(x,y)∼D [y|g(x) = 1] − E(x,y)∼D [fˆ(x; λ∗ )|g(x) = 1] and note that


∆ ̸= 0. In this case, by Lemma 4.1.1 the model obtained by applying the same
patch as in the update rule in Algorithm 8 — i.e. f ′ (x) = h(x, fˆ(x; λ∗ ), g, ∆)
is such that B(f ′ ) < B(fˆ(x; λ∗ )). But this is a contradition to the optimality
of λ∗ . Let λ̂ be the vector such that for all g ′ ̸= g, λ̂g′ = λ∗g′ and such that
λ̂g = λ∗g + ∆. We can write f ′ as f ′ (x) = fˆ(x; λ̂). Since λ̂ is a feasible solution
to the optimization problem in Algorithm 10 — by the optimality of λ∗ we must
have B(f (x; λ̂)) ≥ B(f (x; λ∗ )).
50Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Similarly, since fT can be represented as fˆ(x; λ) for some λ, we have
B(fT ) ≥ B(f (x; λ∗ )).

Theorem 15 Fix any model f : X → [0, 1], target quantile q, and class of
groups G. The model fˆ(x; λ∗ ) output by Algorithm 11 satisfies perfect (i.e.
0-approximate) group conditional quantile consistency with respect to q and
G. Moreover, if fT is the model output by Algorithm 9, then P Bq (fˆ(·; λ∗ )) ≤
P Bq (fT ).

Proof 26 Suppose fˆ(x; λ∗ ) does not satisfy group conditional quantile con-
sistency. Then there must be a group g ∈ G such that:
 2
∆ = arg min Pr [y ≤ ft (x) + ∆|g(x) = 1] − q >0
∆ (x,y)∼D

In this case, by Lemma 2.2.1 applied to the distribution D|(g(x) = 1), the
model obtained by applying the same patch as in the update rule in Algorithm 9
— i.e. f ′ (x) = h(x, fˆ(x; λ∗ ), g, ∆) is such that P Bq (f ′ ) < P Bq (fˆ(x; λ∗ )). But
this is a contradition to the optimality of λ∗ . Let λ̂ be the vector such that for
all g ′ ̸= g, λ̂g′ = λ∗g′ and such that λ̂g = λ∗g + ∆. We can write f ′ as f ′ (x) =
fˆ(x; λ̂). Since λ̂ is a feasible solution to the optimization problem in Algorithm
11 — by the optimality of λ∗ we must have P Bq (f (x; λ̂)) ≥ P Bq (f (x; λ∗ )).
Similarly, since fT can be represented as fˆ(x; λ) for some λ, we have
P Bq (fT ) ≥ P Bq (f (x; λ∗ ).

4.2.1.1 Generalization
What about out of sample guarantees — i.e. what if we run algorithms 10 and
11 on the empirical distributions on datasets D ∼ Dn ?
Our generalization theorem will depend on the norm of the solution λ∗
output by our algorithms, so it will be helpful for us to study a regularized
version of these simple algorithms that is guaranteed to output a solution of
small norm.

Definition 20 For any vector v ∈ Rd , the ℓ1 norm is defined as:


d
X
||v||1 = |vi |
i=1
Multigroup Guarantees 51

Algorithm 12 Simple-Group-Conditional-Regularized(f, G, D, η)
Let λ∗ be a solution to the optimization problem:
 2 
Minimizeλ E fˆ(x; λ) − y + η||λ||1
(x,y)∼D

Such that:
X
fˆ(x; λ) ≡ f (x) + λg · g(x)
g∈G

Output fˆ(x; λ∗ )

Algorithm 12 is identical to Algorithm 10, except that its objective func-


tion has been augmented with the regularization term η||λ||1 , where η is a
parameter of the algorithm. The reason to add this regularization term is to
guarantee that the output parameters λ∗ will have small norm:
Lemma 4.2.1 Let f : X → [0, 1] be any model with range [0, 1], let G be
any set of groups, let D be any distribution over labelled example, and let
η > 0. Then Simple-Group-Conditional-Regularized(f, G, D, η) (Algorithm 12)
outputs a model fˆ(x, λ∗ ) with:
1
||λ∗ ||1 ≤
η
Proof 27 Suppose otherwise, and we have that ||λ∗ ||1 > η1 . Then since
squared error is non-negative, we must have that:
 2 
E ˆ ∗
f (x; λ ) − y + η||λ∗ ||1 ≥ η||λ∗ ||1
(x,y)∼D
> 1
On the other hand, consider the candidate solution λ0 , where λ0 = 0|G| is the
all 0’s vector. Since the squared error of f is bounded by 1 (since f has range
in [0, 1]), we have that:
 2  h i
2
E fˆ(x; λ0 ) − y + η||λ0 ||1 = E (f (x) − y) + η||λ0 ||1
(x,y)∼D (x,y)∼D

≤ 1 + η||λ0 ||1
= 1
Thus fˆ(x; λ0 ) has lower objective value than fˆ(x; λ∗ ), contradicting the opti-
mality of λ∗ .
Ok — so Algorithm 12 produces solutions of small norm. How many small
norm solutions are there anyhow? Obviously there are continuously many, so
we need a more refined way to ask this question. To do this lets define an
ϵ-net.
52Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Definition 21 Let B(C, d) = {x ∈ Rd : ||x||1 ≤ C} denote the d dimensional
ℓ1 ball of radius C. Let Nϵ (C, d) ⊂ B(C, d) be some finite subset of the ball.
We say that Nϵ (C, d) is an ℓ1 ϵ-net for B(C, d) if for every x ∈ B(C, d) there
is an x′ ∈ Nϵ (C, d) such that ||x − x′ ||1 ≤ ϵ.

Theorem 16 There is a finite ℓ1 ϵ-net for B(C, d) that has cardinality:


 d
2C
|Nϵ (C, d)| ≤ 1+
ϵ

Proof 28 Let Nϵ ⊂ B(C, d) be a maximal subset of points that are ϵ-separated


— i.e. such that for all λ, λ′ ∈ Nϵ , ||λ − λ′ ||1 ≥ ϵ, and such that no other point
from B(C, d) can be added to Nϵ while maintaining this property. Observe that
Nϵ must be an ϵ-net for B(C, d), since if there were any point λ∗ ∈ B(C, d)
such that for all λ ∈ Nϵ , ||λ − λ∗ ||1 > ϵ, then λ∗ could be added to Nϵ while
preserving its ϵ-separation property, which would contradict its maximality.
Consider the union of ℓ1 balls of radius ϵ/2 centered at each point λ ∈ Nϵ .
Because of the ϵ-separation property of Nϵ , these balls are disjoint, and so their
d d
total volume is the sum of their individual volumes: |Nϵ | · Vϵ/2 , where Vϵ/2 is
the volume of a d-dimensional ℓ1 ball of radius ϵ/2. On the other hand, the
union of these balls are all contained within a ball of radius C + ϵ/2. Hence:
d d
|Nϵ | · Vϵ/2 ≤ VC+ϵ/2

and in particular:
d
VC+ϵ/2
|Nϵ | ≤ d
Vϵ/2
 ϵ d
C+ 2
≤ ϵ
2
 d
2C
= +1
ϵ

What good is an ϵ-net? It is useful for two reasons. Since it is finite, we


can use Hoeffding’s inequality together with a union bound to argue that for
every parameter vector λ′ in the net, our in and out-of-sample objective values
are close. And what about for parameter vectors λ that aren’t in the net? We
argue that they take objective value close to the objective value of the closest
parameter vector λ′ in the net.

Lemma 4.2.2 Let λ, λ′ ∈ B(C, |G|) be such that ||λ − λ′ ||1 ≤ ϵ. Then for all
x, y, we have that:

(f (x; λ) − y)2 − (f (x; λ′ ) − y)2 ≤ 2ϵC


Multigroup Guarantees 53

Proof 29 We can write:

(f (x; λ) − y)2 − (f (x; λ′ ) − y)2 = f (x; λ)2 − f (x; λ′ )2 + 2y(f (x; λ′ ) − f (x; λ))
= (f (x; λ) − f (x; λ′ ))(f (x; λ) + f (x; λ′ ) − 2y)
≤ ||λ − λ′ ||1 (||λ||1 + ||λ′ ||1 )
≤ 2ϵC

We’re almost done. First we argue that if D ∼ Dn , then if n is sufficiently


large, the squared error of any model f (x; λ) for λ ∈ Nϵ (C, |G|) is close under
both D and D. In fact this is true for any finite set S ⊆ B(C, |G|) so long as
n scales with the log of |S|:

Lemma 4.2.3 Fix any finite subset S ⊂ B(C, |G|) and any δ > 0. Let D ∼
Dn consist of n samples (x, y). Then with probability 1 − δ, for every λ ∈ S:
v  
u
u ln 2|S|
δ
E [(f (x; λ) − y)2 ] − E [(f (x; λ) − y)2 ≤ (C + 1)2
t
(x,y)∼D (x,y)∼D 2n

Proof 30 First observe that since λ ∈ B(C, |G|), we have that for all x,
−C ≤ f (x; λ) ≤ C. Thus for all x, y:

(f (x; λ) − y)2 ≤ (C + 1)2

We can therefore apply Hoeffding’s inequality to conclude that for any fixed
λ ∈ B(C, d):

−2nt2
   
Pr n E [(f (x; λ) − y)2 ] − E [(f (x; λ) − y)2 ≥ t ≤ 2 exp
D∼D (x,y)∼D (x,y)∼D (C + 1)4

The right hand side evaluates to δ if we take:


r
2 ln(2/δ)
t = (C + 1)
2n
Replacing δ with δ/|S| and union bounding over all λ ∈ S we have that with
probability 1 − δ, simultaneously for every λ ∈ S:
v  
u
u ln 2|S|
δ
E [(f (x; λ) − y)2 ] − E [(f (x; λ) − y)2 ≤ (C + 1)2
t
(x,y)∼D (x,y)∼D 2n

We can combine Lemma 4.2.3 (uniform convergence of squared error over


points in a finite set) with Lemma 4.2.2 (squared error is Lipschitz) together
with the existence of a finite ϵ-net for the ℓ1 ball (Theorem 16 to obtain a
similar claim for all of the (continuously many) vectors λ ∈ B(C, |G|):
54Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Theorem 17 Fix any C, δ, ϵ > 0. Let D ∼ Dn consist of n samples (x, y).
Then with probability 1 − δ, for every λ ∈ B(C, |G|):

E [(f (x; λ) − y)2 ] − E [(f (x; λ) − y)2 ≤


(x,y)∼D (x,y)∼D
s
2 2C
 
2 ln δ + |G| ln 1 + ϵ
(C + 1) + 4ϵC
2n
C
In particular, choosing ϵ = √
n
gives:

s
2

2 2 2 ln δ + |G| ln (1 + 2 n)
E [(f (x; λ) − y) ] − E [(f (x; λ) − y) ≤ 2(C+1)
(x,y)∼D (x,y)∼D 2n
Proof 31 Let Nϵ = Nϵ (C, |G|) be an ℓ1 ϵ-net for B(C, |G|) of size |Nϵ | ≤
|G|
1 + 2Cϵ — which we know exists from Theorem 16. Applying Lemma 4.2.3
with S = Nϵ , we get that with probability 1 − δ, for every λ ∈ Nϵ :
s
ln 2δ + |G| ln 1 + 2C
 
2 2 2 ϵ
E [(f (x; λ) − y) ] − E [(f (x; λ) − y) ≤ (C+1)
(x,y)∼D (x,y)∼D 2n
Now fix any λ ∈ B(C, |G|) and let λ′ = arg minλ̂∈Nϵ ||λ̂ − λ||1 . We know from
the ϵ-net property that ||λ − λ′ ||1 ≤ ϵ. Applying Lemma 4.2.2 twice we can
conclude:

E [(f (x; λ) − y)2 ] − E [(f (x; λ) − y)2


(x,y)∼D (x,y)∼D

≤ E [(f (x; λ′ ) − y)2 ] − E [(f (x; λ′ ) − y)2 + 4ϵC


(x,y)∼D (x,y)∼D
s
ln 2δ + |G| ln 1 + 2C
 
2 ϵ
≤ (C + 1) + 4ϵC
2n
We’re now ready to prove our generalization bound for Algorithm 12.
Theorem 18 Fix any δ > 0 and model f : X → [0, 1]. Let G be any
collection of groups. Let D ∼ Dn consist of n samples (x, y). Then with
probability 1 − δ, the model fˆ(x, λ∗ ) output by Simple-Group-Conditional-
Regularized(f, G, D, η)) (Algorithm 12) satisfies α-approximate group condi-
tional mean consistency on D whenever ming∈G µ(g) ≥ α for:

s
2
ln 2δ + |G| ln (1 + 2 n)
 
1
α≤η+4 +1
η 2n
Choosing η to minimize this expression gives:
 !1/6 
ln 1δ + |G| ln (n)

α ≤ O 
n
Multigroup Guarantees 55

Proof 32 Let
 2 
λ̂ = arg min E fˆ(x; λ) − y + η||λ||1
λ (x,y)∼D

i.e. the true minimizer of regularized objective function over D. We know from
Lemma 4.2.1 that λ∗ , λ̂ ∈ B(1/η, |G|). Hence from Theorem 17 and the fact
that λ∗ minimizes the objective function on D, we have that with probability
1 − δ:

E [(f (x; λ∗ ) − y)2 ] + η||λ∗ ||1


(x,y)∼D

s
2
ln 2δ + |G| ln (1 + 2 n)
 
1
≤ E [(f (x; λ∗ ) − y)2 ] + η||λ∗ ||1 + 2 +1
(x,y)∼D η 2n

s
2
ln 2δ + |G| ln (1 + 2 n)
 
1
≤ E [(f (x; λ̂) − y)2 ] + η||λ̂||1 + 2 +1
(x,y)∼D η 2n

s
2
ln 2δ + |G| ln (1 + 2 n)
 
2 1
≤ E [(f (x; λ̂) − y) ] + η||λ̂||1 + 4 +1
(x,y)∼D η 2n
Let α be the minimum value such that f (x; λ∗ ) satisfies α-approximate
group conditional mean consistency on D. In other words, there exists a group
g such that:
 2
µ(g) · E[f (x; λ∗ ) − y|g(x) = 1] = α
D

Let ∆ = ED [f (x; λ ) − y|g(x) = 1], and let h(x, f (x; λ∗ ), g, ∆) = f (x, λ′ ) be


the result of applying a patch operation, where λ′g′ = λ∗g′ for all g ′ ̸= g and
λ′g = λ∗g + ∆. By Lemma 4.1.1, we have that B(f (x, λ∗ ), D) − B(f (x, λ′ ), D) >
α. This will contradict the optimality of λ̂ above if we have that:

s
2
ln 2δ + |G| ln (1 + 2 n)
 
1
α > η (||λ′ ||1 − ||λ∗ ||1 ) + 4 +1
η 2n
To avoid the contradiction we must have that:

s
2
ln 2δ + |G| ln (1 + 2 n)
 
1
α ≤ η (||λ′ ||1 − ||λ∗ ||1 ) + 4 +1
η 2n

s
2
ln 2δ + |G| ln (1 + 2 n)
 
1
≤ η|∆| + 4 +1
η 2n

s
2
ln 2δ + |G| ln (1 + 2 n)
r  
α 1
≤ η +4 +1
µ(g) η 2n

s
2
ln 2δ + |G| ln (1 + 2 n)
 
1
≤ η+4 +1
η 2n
56Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
q
α
where the second to last inequality follows from the fact that |∆| = µ(g) and
the last inequality follows from the assumption that µ(g) ≥ α.

We can carry out a similar analysis of a regularized variant of our algorithm


for group conditional quantile consistency:

Algorithm 13 Simple-Quantile-Group-Conditional-Regularized(f, G, q, η)
Let λ∗ be a solution to the optimization problem:
h  i
Minimizeλ E Lq fˆ(x; λ), y + η||λ||1
(x,y)∼D

Such that:
X
fˆ(x; λ) ≡ f (x) + λg · g(x)
g∈G

Output fˆ(x; λ∗ )

The basic strategy is the same, and so we highlight only the differences.
Since Pinball loss is also bounded within [0, 1] when f (x), y ∈ [0, 1] we continue
to have that solutions output by Algorithm 12 are norm bounded:
Lemma 4.2.4 Let f : X → [0, 1] be any model with range [0, 1], let G be
any set of groups, let D be any distribution over labelled example, and let
η > 0. Then Simple-Quantile-Group-Conditional-Regularized(f, G, D, η) (Al-
gorithm 13) outputs a model fˆ(x, λ∗ ) with:
1
||λ∗ ||1 ≤
η
We get an even better Lipschitz bound on the loss function:
Lemma 4.2.5 Let λ, λ′ ∈ B(C, |G|) be such that ||λ − λ′ ||1 ≤ ϵ. Then for all
x, y, and for all q ∈ [0, 1] we have that:

|Lq (f (x; λ), y) − Lq (f (x; λ′ ), y)| ≤ ϵ

Similarly, since |Lq (f (x; λ), y)| ≤ C + 1 (rather than (C + 1)2 ) for λ ∈
B(C, |G|), we get a uniform convergence bound that is improved over our
version for squared loss by a factor of (C + 1):
Lemma 4.2.6 Fix any q ∈ [0, 1], any finite subset S ⊂ B(C, |G|) and any
δ > 0. Let D ∼ Dn consist of n samples (x, y). Then with probability 1 − δ,
for every λ ∈ S:
v  
u
u ln 2|S|
t δ
E [Lq (f (x; λ), y)] − E [Lq (f (x; λ), y)] ≤ (C + 1)
(x,y)∼D (x,y)∼D 2n
Multigroup Guarantees 57

Combining these two improved lemmas gives a correspondingly improved


uniform convergence theorem over all of B(C, |G|):

Theorem 19 Fix any q ∈ [0, 1] and C, δ, ϵ > 0. Let D ∼ Dn consist of n


samples (x, y). Then with probability 1 − δ, for every λ ∈ B(C, |G|):

E [Lq (f (x; λ), y)] − E [Lq (f (x; λ), y)] ≤


(x,y)∼D (x,y)∼D

s
2 2C
 
ln δ + |G| ln 1 + ϵ
(C + 1) + 2ϵ
2n
C
In particular, choosing ϵ = √
n
gives:


s
2

ln δ + |G| ln (1 + 2 n)
E [Lq (f (x; λ), y)] − E [Lq (f (x; λ), y)] ≤ 2(C+1)
(x,y)∼D (x,y)∼D 2n

We can now obtain our generalization theorem for quantiles — but we’ll
need one more assumption. Recall that we have already been assuming that
our label distributions have CDFs that are ρ-Lipschitz, which means that
they have CDFs F such that F (τ ) − F (τ ′ ) ≤ ρ(τ − τ ′ ). To prove our next
generalization theorem, we’ll also have to assume that the label distributions
are not too flat — that is, that they are σ-anti-Lipschitz:

Definition 22 A CDF F is σ anti-Lipschitz if for all τ ≥ τ ′ , we have that:


F (τ ) − F (τ ′ ) ≥ σ(τ − τ ′ ). We say that a distribution D ∈ ∆X × Y is σ-
anti-Lipschitz if all of its conditional label distributions DY (x) have σ-anti-
Lipschitz CDFs.

Theorem 20 Fix any δ > 0 and model f : X → [0, 1]. Let G be any
collection of groups. Let D ∼ Dn consist of n samples (x, y) from a dis-
tribution D that is ρ-Lipschitz and σ-anti-Lipschitz. Then with probabil-
ity 1 − δ, the model fˆ(x, λ∗ ) output by Simple-Quantile-Group-Conditional-
Regularized(f, G, D, η)) (Algorithm 13) satisfies α-approximate group condi-
tional quantile consistency on D whenever ming∈G µ(g) ≥ α for:


s
ln 2δ + |G| ln (1 + 2 n)
  
2ηρ 1
α≤ + 8ρ +1
σ η 2n

Choosing η to minimize this expression gives:


 !1/4 
ln 1δ + |G| ln (n)

ρ
α ≤ O √ · 
σ n
58Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Proof 33 Let
h i
λ̂ = arg min E Lq (fˆ(x; λ), y) + η||λ||1
λ (x,y)∼D

i.e. the true minimizer of regularized objective function over D. We know from
Lemma 4.2.4 that λ∗ , λ̂ ∈ B(1/η, |G|). Hence from Theorem 19 and the fact
that λ∗ minimizes the objective function on D, we have that with probability
1 − δ:
E [Lq (fˆ(x; λ∗ ), y)] + η||λ∗ ||1
(x,y)∼D

s
2

+ |G| ln (1 + 2 n)
 
1 ln
≤ E [Lq (fˆ(x; λ ), y)] + η||λ ||1 + 2
∗ ∗
+1 δ
(x,y)∼D η 2n

s
ln 2δ + |G| ln (1 + 2 n)
  
ˆ 1
≤ E [Lq (f (x; λ̂), y)] + η||λ̂||1 + 2 +1
(x,y)∼D η 2n

s
ln 2δ + |G| ln (1 + 2 n)
  
ˆ 1
≤ E [Lq (f (x; λ̂), y)]] + η||λ̂||1 + 4 +1
(x,y)∼D η 2n
Let α be the minimum value such that f (x; λ∗ ) satisfies α-approximate
group conditional quantile consistency on D. In other words, there exists a
group g such that:
 2
µ(g) · Pr[y ≤ f (x; λ∗ ) − q|g(x) = 1] = α
D

Let ∆ be such that PrD [y ≤ f (x; λ∗ ) + ∆|g(x) = 1] = q, and let


h(x, f (x; λ∗ ), g, ∆) = f (x, λ′ ) be the result of applying a patch operation, where
λ′g′ = λ∗g′ for all g ′ ̸= g and λ′g = λ∗g + ∆. We have that P Bq (f (x, λ∗ ), D) −
P Bq (f (x, λ′ ), D) > 2ρ α
. This will contradict the optimality of λ̂ above if we
have that:

s
ln 2δ + |G| ln (1 + 2 n)
  
α 1
> η (||λ′ ||1 − ||λ∗ ||1 ) + 4 +1
2ρ η 2n
To avoid the contradiction we must have that:

s
ln 2δ + |G| ln (1 + 2 n)
  
α 1
≤ η (||λ′ ||1 − ||λ∗ ||1 ) + 4 +1
2ρ η 2n

s
ln 2δ + |G| ln (1 + 2 n)
  
1
≤ η|∆| + 4 +1
η 2n

s
ln 2δ + |G| ln (1 + 2 n)
r   
η α 1
≤ +4 +1
σ µ(g) η 2n

s
ln 2δ + |G| ln (1 + 2 n)
  
η 1
≤ +4 +1
σ η 2n
Multigroup Guarantees 59
q
1 α
where the second to last inequality follows from the fact that |∆| ≤ σ µ(g)
by the anti-Lipschitzness property, and the last inequality follows from the
assumption that µ(g) ≥ α.
Solving we get:

s
ln 2δ + |G| ln (1 + 2 n)
  
2ηρ 1
α≤ + 8ρ +1
σ η 2n

4.3 Multicalibration: Group Conditional Calibration


We can go further and simultaneously ask for group conditional mean consis-
tency and calibration. Combined, these two constraints are called multicali-
bration:

Definition 23 Fix any model f : X → [0, 1] and group g : X → {0, 1}. The
average squared calibration error of f on g is:
X  2
K2 (f, g, D) = Pr [f (x) = v|g(x) = 1] v − E [y|f (x) = v, g(x) = 1]
(x,y)∼D (x,y)∼D
v∈R(f )

We say that a model f is α-approximately multicalibrated with respect to a


collection of groups G and a distribution D if for every group g ∈ G:
α
K2 (f, g, D) ≤ .
µ(g)

When D is clear from context we just write K2 (f, g).

Just as in our previous cases, we will proceed by starting with an initial


model which we will patch:
Definition 24 (Group Value Patch) Given a model f : X → [0, 1], a
group g : X → {0, 1} and a pair of values v, v ′ ∈ [0, 1], we say that the
group value patch applied to f with pair (v, v ′ ) and group g is the function:
(
v′ f (x) = v and g(x) = 1
h(x, f ; v → v ′ , g) =
f (x) otherwise
60Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Algorithm 14 Multicalibrate(f, α, G, D) (First Attempt)
Let f0 = f and t = 0.
while ft is not α-approximately multicalibrated with respect to G: do
Let:
 2
(vt , gt ) ∈ arg max Pr [ft (x) = v, g(x) = 1] v − E [y|ft (x) = v, g(x) = 1])
(v,g)∈R(ft )×G (x,y)∼D (x,y)∼D

vt′ = E [y|ft (x) = vt , gt (x) = 1]


(x,y)∼D

Let ft+1 = h(x; ft , vt → vt′ , gt ) and t = t + 1.


Output ft .

Definition 25 Fix a model ft and a group g : X → 0, 1. For a value v ∈


R(ft ), we write:

µt (v, g, D) = Pr [ft (x) = v, g(x) = 1]


(x,y)∼D

When D is clear from context we just write µt (v, g).

Lemma 4.3.1 Fix any intermediate round t of Algorithm 14 (Multicalibrate).


We have that:
B(ft ) − B(ft+1 ) = µt (vt , gt ) · (vt − vt′ )2

Proof 34 Since by construction ft+1 (x) = ft (x) for every x such that either
g(x) = 0 or ft (x) ̸= vt , we have that:

B(ft ) − B(ft+1 ) = µt (vt , gt ) · E [(ft (x) − y)2 − (ft+1 (x) − y)2 |gt (x) = 1, ft (x) = vt ]
(x,y)∼D

= µt (vt , gt ) · E [(vt − y)2 − (vt′ − y)2 |gt (x) = 1, ft (x) = vt ]


(x,y)∼D

= µt (vt , gt ) · (vt − vt′ )2

Where the final equality follows from Lemma 3.1.2 and the fact that by defi-
nition vt′ = E(x,y)∼D [y|ft (x) = vt , gt (x) = 1].

So far everything is mirroring our past derivations — but there is an im-


portant difference here that will complicate things (just a little!) in comparison
to our analysis of Algorithm 3 (our calibration algorithm — without groups).
The issue is that the updates for Multicalibrate can increase the cardinality
of the range of our model: i.e. it might be that |R(ft+1 )| = |R(ft )| + 1. This
is problematic for us since our updates change the function at the granularity
of a group intersected with an element of R(ft ) — and so as R(ft ) grows,
the rate of progress thatPwe make slows down. We will still eventually get to

multicalibration (since t=0 1/(m + t) is a divergent series for any m), but
we might need a lot of updates.
Multigroup Guarantees 61

The fix is to realize that we don’t need arbitrary precision to achieve


α-approximate multicalibration — we only need some finite precision that de-
pends on α. If we restrict our updates to an appropriately discretized set of
m finite values, then the range of ft+1 can never grow above m and our prob-
lem is solved. This will also be useful to us when it comes time to argue that
solving the empirical multi-calibration problem on a modestly sized dataset
suffices to solve it out of sample, on the distribution from which the data was
drawn.

Definition 26 Let [1/m] denote the set of m + 1 grid points:


  n
1 1 2 m−1 o
= 0, , , . . . , ,1
m m m m

For any value v ∈ [0, 1] let Round(v; m) = arg minv′ ∈[1/m] |v − v ′ | denote the
closest grid point to v in [1/m]. For a model f : X → [0, 1], let Round(f ; m)
denote the function f ′ (x) = Round(f (x); m) that simply rounds the output of
f to the nearest grid point of [1/m].

Observe that for v ′ = Round(v; m) we always have that |v − v ′ | ≤ 1


2m .

Algorithm 15 Multicalibrate(f, α, G, D)
Let m = α1 .
Let f0 = Round(f ; m) and t = 0.
while ft is not α-approximately multicalibrated with respect to G: do
Let:
 2
(vt , gt ) ∈ arg max Pr [ft (x) = v, g(x) = 1] v − E [y|ft (x) = v, g(x) = 1])
(v,g)∈R(ft )×G (x,y)∼D (x,y)∼D

ṽt = E [y|ft (x) = vt , gt (x) = 1] and vt′ = Round(ṽt ; m)


(x,y)∼D

Let ft+1 = h(x; ft , vt → vt′ , gt ) and t = t + 1.


Output ft .

Lets start by proving an approximate variant of Lemma 4.3.1


Lemma 4.3.2 Fix any intermediate round t of Algorithm 15 (Multicalibrate).
We have that:
 
2 1
B(ft ) − B(ft+1 ) ≥ µt (vt , gt ) · (vt − ṽt ) −
4m2

Proof 35 Let f˜t+1 = h(x; ft , vt → ṽt , gt ) be the hypothetical update that would
have resulted had we not rounded ṽt in step t of the algorithm. This is the
62Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
update that would have resulted from a step of Algorithm 14, and so we can
apply Lemma 4.3.1 to conclude that:

B(ft ) − B(f˜t+1 ) ≥ µt (vt , gt ) · (vt − ṽt )2

We also have that:

B(ft ) − B(ft+1 ) = (B(ft ) − B(f˜t+1 )) − (B(ft+1 ) − B(f˜t+1 ))


= µt (vt , gt ) · (vt − ṽt )2 − (B(ft+1 ) − B(f˜t+1 ))

And so it remains to upper bound (B(ft+1 ) − B(f˜t+1 )). Let ∆ = ṽt − vt′ and
note that since vt′ = Round(ṽt ; m) we have that |∆| ≤ 2m
1
. We can calculate:

B(ft+1 ) − B(f˜t+1 )
 ′
(vt − y)2 − (ṽt′ − y)2 |gt (x) = 1, ft (x) = vt

= µt (vt , gt ) · E
(x,y)∼D

= µt (vt , gt )∆2
µt (vt , gt )

4m2
Here the 2nd equality follows from the fact that ṽt = E(x,y)∼D [y|ft (x) =
vt , gt (x) = 1]. Combining with the above we have:

B(ft ) − B(ft+1 ) = µt (vt , gt ) · (vt − ṽt )2 − (B(ft+1 ) − B(f˜t+1 ))


µt (vt , gt )
≥ µt (vt , gt ) · (vt − ṽt )2 −
 4m2
1
= µt (vt , gt ) · (vt − ṽt )2 −
4m2

Theorem 21 Given any model f , any collection of groups G, and any 1 ≥


α > 0, Algorithm 15 (Multicalibrate) halts after T < α42 many rounds and
outputs a model fT that satisfies α-approximate multicalibration. Moreover if
2
the algorithm runs for T rounds then B(fT ) < B(f0 ) − T α4 .

Proof 36 Consider any intermediate round t of the algorithm. If the al-


gorithm has not halted, it must be because ft is not α-approximately mul-
ticalibrated, and so we know that there exists a group g ∈ G such that
α
K2 (ft , g) > µ(g) . In other words:

X  2
µt (g, v) v − E [y|ft (x) = v, g(x) = 1] ≥ α
(x,y)∼D
v∈R(ft )

We know by construction that |R(ft )| ≤ m + 1 and so by averaging there must


exist some v ∈ R(ft ) such that:
 2
α
µt (g, v) v − E [y|ft (x) = v, g(x) = 1] ≥
(x,y)∼D m+1
Multigroup Guarantees 63

In particular, since in the algorithm (gt , vt ) are chosen in the algorithm to


jointly maximize the left hand side of this quantity, we know that this inequality
holds for the pair (gt , vt ):
α
µt (vt , gt ) · (vt − ṽt )2 ≥
m+1
By Lemma 4.1.1 we have that:
 
1
B(ft ) − B(ft+1 ) ≥ µt (vt , gt ) · (vt − ṽt )2 −
4m2
1
≥ µt (vt , gt ) · (vt − ṽt )2 −
4m2
α 1
≥ −
m + 1 4m2
α2 α2
= −
α+1 4
α2 α2
≥ −
2 4
2
α
=
4
2
Iterating we therefore have that B(fT ) ≤ B(f0 ) − T α4 and since B(fT ) and
B(f0 ) are bounded in [0, 1], we must have that: T ≤ α42 .

Remark 4.3.1 Theorem 21 bounds B(ft ) − B(f0 ). But recall that f0 results
from rounding the outputs of f to the nearest multiple of 1/m, which might
1 1 α2
increase f ’s squared error by as much as m + 4m 2 = α + 4 if f was very
poorly calibrated at the outset. Taking this into account we can also conclude
that:
α2 α2
B(fT ) < B(f ) − T +α+ .
4 4

4.4 Quantile Multicalibration


Similarly, if we let Pinball loss play the role of the Brier score in our analysis,
we can derive algorithms for quantile multicalibration:
Definition 27 Fix any model f : X → [0, 1], target quantile q, and group
g : X → {0, 1}. The average squared quantile calibration error of f on g is:

X  2
Q2 (f, g) = Pr [f (x) = v|g(x) = 1] q − Pr [y ≤ v|f (x) = v, g(x) = 1]
(x,y)∼D (x,y)∼D
v∈R(f )
64Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
We say that a model f is α-approximately quantile multicalibrated with respect
to a collection of groups G and q if for every group g ∈ G:
α
Q2 (f, g) ≤ .
µ(g)

We will use the same kind of group value patches that we used for mean
multicalibration, as well as the same rounding procedure. We get the following
algorithm:

Algorithm 16 QuantileMulticalibrate(f, α, q, G, ρ)
2
ρ
Let m = 2α .
Let f0 = Round(f ; m) and t = 0.
while ft is not α-approximately quantile multicalibrated with respect to G
and q: do
Let:
 2
(vt , gt ) ∈ arg max Pr [ft (x) = v, g(x) = 1] q − Pr [y ≤ v|ft (x) = v, g(x) = 1])
(v,g)∈R(ft )×G (x,y)∼D (x,y)∼D

ṽt = arg min Pr [y ≤ v|ft (x) = vt , gt (x) = 1] − q and vt′ = Round(ṽt ; m)


v (x,y)∼D

Let ft+1 = h(x; ft , vt → vt′ , gt ) and t = t + 1.


Output ft .

Lemma 4.4.1 Fix any intermediate round t of Algorithm 16 (QuantileMul-


ticalibrate) run with parameters α, q, and ρ. If D is ρ-Lipschitz, then We have
that:
α2
P Bq (ft ) − P Bq (ft+1 ) ≥ 3

Proof 37 Since the algorithm has not halted at around t, it must be that ft
is not α-approximately quantile multicalibrated, and hence we know that:
 2
α
Pr [ft (x) = vt , gt (x) = 1] q − Pr [y ≤ vt |ft (x) = vt , gt (x) = 1]) ≥
(x,y)∼D (x,y)∼D m

Let f˜t+1 = h(x; ft , vt → ṽt , gt ) be the hypothetical update that would have
resulted had we not rounded ṽt in step t of the algorithm. Since Pr(x,y)∼D [y ≤
v|ft (x) = vt , gt (x) = 1] = q we can apply Lemma 2.2.2 to the distribution
Multigroup Guarantees 65

D|(ft (x) = vt , gt (x) = 1) (which must also be ρ-Lipschitz) to conclude that:

P Bq (ft ) − P Bq (f˜t+1 )
 h i
= Pr[gt (x) = 1, ft (x) = vt ] · E Lq (ft (x), y) − Lq (f˜t+1 (x), y)|gt (x) = 1, ft (x) = vt
(x,y)∼D
α
≥ µ(gt , vt ) ·
2ρmµ(gt .vt )
α
=
2mρ
We also have that:

P Bq (ft ) − P Bq (ft+1 ) = (P Bq (ft ) − P Bq (f˜t+1 )) − (P Bq (ft+1 ) − P Bq (f˜t+1 ))


α
≥ − (P Bq (ft+1 ) − P Bq (f˜t+1 ))
2mρ

And so it remains to upper bound (P Bq (ft+1 ) − P Bq (f˜t+1 )). Let ∆ = ṽt − vt′
and note that since vt′ = Round(ṽt ; m) we have that |∆| ≤ 2m 1
. From Lemma
2.2.2 we have that:

P Bq (ft+1 ) − P Bq (f˜t+1 ) = µ(gt , vt ) · E [Lq (vt′ , y) − Lq (ṽt , y)|gt (x) = 1, ft (x) = vt ]


(x,y)∼D

(Q(vt′ ) − Q(ṽt ))2


≤ |Q(vt′ ) − Q(ṽt )| · |∆| −

′ 2
1 (Q(v t − Q(ṽt ))
)
≤ |Q(vt′ ) − Q(ṽt )| · −
2m 2ρ
ρ ρ
≤ −
4m2 8m2
ρ
=
8m2
where the first inequality follows from Lemma 2.2.2, the second follows from
the fact that ∆ ≤ 1/2m, and the third follows from the fact that by ρ-
ρ
Lipschitzness, we must have that |Q(vt′ ) − Q(ṽt )| ≤ 2m .
Putting it all together we get that:

α ρ α2
P Bq (ft ) − P Bq (ft+1 ) ≥ − =
2mρ 8m2 2ρ3
ρ2
Here we use the fact that m = 2α .

With this progress lemma, we can state the final guarantee for Algorithm
16.

Theorem 22 Fix any model f : X → [0, 1], α > 0, q ∈ [0, 1], G, and ρ. If
the distribution D is ρ-Lipschitz, then Algorithm 16 (QuantileMulticalibrate)
66Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
runs for T rounds and outputs a model fT that is α-approximately quantile
multicalibrated with respect to G and q. Moreover:

2ρ3
T ≤
α2
2
α
and P Bq (fT ) ≤ P Bq (f0 ) − T 2ρ3.

Proof 38 Lemma 4.4.1 tells us that at any intermediate round t < T of the
algorithm, we have that:

α2
P Bq (ft ) − P Bq (ft+1 ) ≥
2ρ3
Applying this repeatedly we have that:

α2
P Bq (fT ) ≤ P Bq (f0 ) − T
2ρ3

For labels in [0, 1], we have that P Bq (f0 ) ≤ 1 and P Bq (fT ) ≥ 0. Hence we
3
must have that T ≤ 2ρ α2 .

4.5 Out of Sample Generalization


Thus far we have presented our algorithms for multicalibration as if they have
direct access to the distribution D. In practice, they will not: We will run our
algorithms on a finite sample of n points D ∈ Z n to obtain multicalibration
on the empirical distribution on the sample — but we will want our multicali-
bration guarantees to carry over to some other distribution. In this section, we
will show that if the n points in D were sampled i.i.d. from any distribution
D, then so long as n is sufficiently large, the guarantees of multicalibration
will indeed carry over to D.

4.5.1 Mean Multicalibration


Imagine that we have a distribution D ∈ ∆Z and that we have sampled n
points i.i.d. from D to form a dataset D: D ∼ Dn . We
Our generalization bounds follow a simple formula: We first argue that
for any particular function ft , if it is multicalibrated on D it is very likely
multicalibrated on D as well. We then argue that for any fixed input model,
Algorithm 15 can only output a model from a finite (and boundedly large set),
and so we can union bound over all possible output models.
Multigroup Guarantees 67

Theorem 23 Fix any model ft : X → [0, 1], any v ∈ R(ft ), and any group
g ∈ G. Let D ∼ Dn consist of n points drawn i.i.d. from D. Then with proba-
bility 1 − δ.

 2  2
µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1] − µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1]
(x,y)∼D (x,y)∼D

r
3µt (g, v, D) ln(8/δ) 135 ln(8/δ)
≤ 46 +
n n
r !
µt (g, v, D) ln(1/δ) ln(1/δ)
∈O +
n n

Proof 39 This will be a long slog. We will beat each term into submission
using Chernoff bounds in sequence, and then combine the resulting bounds.
First we argue that with high probability, µt (g, v, D) and µt (g, v, D) must
be close.
Lemma 4.5.1 Fix any model ft : X → [0, 1], any v ∈ R(ft ), and any group
g ∈ G. Let D ∼ Dn consist of n points drawn i.i.d. from D. Then with proba-
bility 1 − δ:
r
3 ln(2/δ)µt (g, v, D)
|µt (g, v, D) − µt (g, v, D)| ≤
n
Proof 40 We can write
1
1[g(x) = 1, ft (x) = v]
X
µt (g, v, D) =
n
(x,y)∈D

We have both that 0 ≤ 1[g(x) = 1, ft (x) = v] ≤ 1 and that


ED∼Dn [µt (g, v, D)] = µt (g, v, D), and so we can apply the Chernoff bound
(Theorem 47) to conclude:

µt (g, v, D)η 2
 
Pr [|nµt (g, v, D) − nµt (g, v, D)| ≥ ηnµt (g, v, D)] ≤ 2 exp −
D∼D n 3
q
3 ln(2/δ)
Plugging in η = nµ t (g,v,D)
yields:
h p i
Pr n |nµt (g, v, D) − nµt (g, v, D)| ≥ 3 ln(2/δ)nµt (g, v, D) ≤ δ
D∼D

Dividing by n yields the theorem.

We next consider the term: E(x,y)∼D [y|ft (x) = v, g(x) = 1].


68Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Lemma 4.5.2 Fix any model ft : X → [0, 1], any v ∈ R(ft ), and any group
g ∈ G. Let D ∼ Dn consist of n points drawn i.i.d. from D. Then with proba-
bility 1 − δ, for any v ∈ R(ft ) such that µt (g, v, D) ≥ 12 ln(4/δ)
n :
s
3 ln(4/δ)
E [y|ft (x) = v, g(x) = 1] − E [y|ft (x) = v, g(x) = 1] ≤ 5
(x,y)∼D (x,y)∼D nµt (g, v, D)
Proof 41 We have that:
y · 1[ft (x) = v]1[g(x) = 1]
P
(x,y)∈D
E [y|ft (x) = v, g(x) = 1] =
(x,y)∼D nµt (g, v, D)
Both the numerator and the denominator are i.i.d. sums of random vari-
ables bounded in [0, 1]. So, we can apply the Chernoff bound (Theorem 47) to
conclude that with probability 1 − δ we have simultaneously:

y · 1[ft (x) = v]1[g(x) = 1] − n · [y · 1[ft (x) = v]1[g(x) = 1]] ≤


X
E
(x,y)∼D
(x,y)∈D
r
[y · 1[ft (x) = v]1[g(x) = 1]] ≤
p
3 ln(4/δ)n E 3 ln(4/δ)nµt (g, v, D)
(x,y)∼D

and p
|nµt (g, v, D) − nµt (g, v, D)| ≤ 3 ln(4/δ)nµt (g, v, D)
Therefore we have that with probability 1 − δ:
E [y|ft (x) = v, g(x) = 1]
(x,y)∼D

y · 1[ft (x) = v]1[g(x) = 1]


P
(x,y)∈D
=
nµt (g, v, D)
n E(x,y)∼D [y · 1[ft (x) = v]1[g(x) = 1]] + 3 ln(4/δ)nµt (g, v, D)
p
≤ p
nµt (g, v, D) − 3 ln(4/δ)nµt (g, v, D)
n E(x,y)∼D [y · 1[ft (x) = v]1[g(x) = 1]] + 3 ln(4/δ)nµt (g, v, D)
p
=  q 
3 ln(4/δ)
nµt (g, v, D) 1 − nµ t (g,v,D)
 
p
1 3 ln(4/δ)
=  E [y|ft (x) = v, g(x) = 1] + p
 
q   q 
3 ln(4/δ) (x,y)∼D 3 ln(4/δ)
1 − nµ t (g,v,D)
nµt (g, v, D) 1 − nµt (g,v,D)
s ! p !
3 ln(4/δ) 3 ln(4/δ)
≤ 1+2 E [y|ft (x) = v, g(x) = 1] + p
nµt (g, v, D) (x,y)∼D nµt (g, v, D)
s
3 ln(4/δ) 3 ln(4/δ)
≤ E(x,y)∼D [y|ft (x) = v, g(x) = 1] + 3 +2
nµt (g, v, D) nµt (g, v, D)
s
3 ln(4/δ)
≤ E(x,y)∼D [y|ft (x) = v, g(x) = 1] + 5
nµt (g, v, D)
Multigroup Guarantees 69

q between µt (g, v, D) and µt (g, v, D),


Here we have applied Lemma 4.5.1 to move
3 ln(4/δ)
and have relied on our assumption that nµ t (g,v,D)
≤ 12 to apply the inequality
1/(1 − x) ≤ 1 + 2x for 0 ≤ x ≤ 1/2. The last inequality follows because
q by our
3 ln(4/δ) 3 ln(4/δ) 3 ln(4/δ)
assumption on µt (g, v, D), nµ t (g,v,D)
≤ 1 and hence nµ t (g,v,D)
≤ nµ t (g,v,D)
.
We can similarly derive the inequality in the reverse direction to conclude
that with probability 1 − δ:
s
3 ln(4/δ)
E [y|ft (x) = v, g(x) = 1] − E [y|ft (x) = v, g(x) = 1] ≤ 5
(x,y)∼D (x,y)∼D nµt (g, v, D)

Onwards! We now propagate our error bounds outwards:


Lemma 4.5.3 Fix any model ft : X → [0, 1], any v ∈ R(ft ), and any group
g ∈ G. Let D ∼ Dn consist of n points drawn i.i.d. from D. Then with proba-
bility 1 − δ, for any v ∈ R(ft ) such that µt (g, v, D) ≥ 12 ln(4/δ)
n :
 2  2
v− E [y|ft (x) = v, g(x) = 1] − v − E [y|ft (x) = v, g(x) = 1] ≤
(x,y)∼D (x,y)∼D

s
3 ln(4/δ)
45
nµt (g, v, D)

Proof 42 We compute:
 2  2
v− E [y|ft (x) = v, g(x) = 1] − v − E [y|ft (x) = v, g(x) = 1]
(x,y)∼D (x,y)∼D
 
= 2v E [y|ft (x) = v, g(x) = 1] − E [y|ft (x) = v, g(x) = 1] +
(x,y)∼D (x,y)∼D
 
E [y|ft (x) = v, g(x) = 1]2 − E [y|ft (x) = v, g(x) = 1]2
(x,y)∼D (x,y)∼D

≤ 2v E [y|ft (x) = v, g(x) = 1] − E [y|ft (x) = v, g(x) = 1] +


(x,y)∼D (x,y)∼D
 
2 2
E [y|ft (x) = v, g(x) = 1] − E [y|ft (x) = v, g(x) = 1]
(x,y)∼D (x,y)∼D
s  
3 ln(4/δ) 2 2
≤ 10v + E [y|ft (x) = v, g(x) = 1] − E [y|ft (x) = v, g(x) = 1]
nµt (g, v, D) (x,y)∼D (x,y)∼D
s s s !2
3 ln(4/δ) 3 ln(4/δ) 3 ln(4/δ)
≤ 10 + 10 + 5
nµt (g, v, D) nµt (g, v, D) nµt (g, v, D)
s
3 ln(4/δ)
≤ 45
nµt (g, v, D)
70Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Here we have applied Lemma 4.5.2 twice. The last inequality follows because
3 ln(4/δ) 3 ln(4/δ)
by our assumption on µt (g, v, D), nµ t (g,v,D)
≤ 1 and hence nµ t (g,v,D)

q
3 ln(4/δ)
nµt (g,v,D) .

Phew. Lets finish this. Applying Lemma 4.5.1, we have that with probability
1 − δ/2:
 2
µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1]
(x,y)∼D
r ! 2
3 ln(2/δ)µt (g, v, D)
≤ µt (g, v, D) + v − E [y|ft (x) = v, g(x) = 1]
n (x,y)∼D

There are two cases to consider. The first case is when µt (g, v, D) <
12 ln(8/δ) 2
n In this case, since v − E(x,y)∼D [y|ft (x) = v, g(x) = 1] ≤ 1 we
.
have:

 2 r
12 ln(8/δ) 3 ln(2/δ)µt (g, v, D)
µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1] ≤ +
(x,y)∼D n n
In the remaining case, we can apply Lemma 4.5.3 to continue and conclude
that with probability 1 − δ:
 2
µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1]
(x,y)∼D
r !  2 s !
3 ln(2/δ)µt (g, v, D) 3 ln(8/δ)
≤ µt (g, v, D) + v− E [y|ft (x) = v, g(x) = 1] + 45
n (x,y)∼D nµt (g, v, D)
 2
≤ µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1]
(x,y)∼D
r r r s
3 ln(2/δ)µt (g, v, D) 3µt (g, v, D) ln(8/δ) 3 ln(2/δ)µt (g, v, D) 3 ln(8/δ)
+ + 45 + 45
n n n nµt (g, v, D)
 2 r
3µt (g, v, D) ln(8/δ) 135 ln(8/δ)
≤ µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1] + 46 +
(x,y)∼D n n
The reverse direction follows the same way:
 2  2
µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1] ≥ µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1]
(x,y)∼D (x,y)∼D
r
3µt (g, v, D) ln(8/δ) 135 ln(8/δ)
−46 −
n n
which finally gives us our theorem.
Multigroup Guarantees 71

Recapping where we are, we have shown that for a single model ft , group
2
g, and value v, the quantities µt (g, v, D) v − E(x,y)∼D [y|ft (x) = v, g(x) = 1]
evaluated in-sample are close to the corresponding quantities out of sample.
But we need a corresponding statement for every group g ∈ G, every v ∈ [1/m]
and every model f that might be output by Algorithm 15. Our solution to
this will simply be to count all possible combinations of g, v, and f , but in
order to do this, we need to understand how many different distinct models
might be output by Algorithm 15.
Lemma 4.5.4 Fix any model f : X → [0, 1], any finite collection of groups G,
and any α > 0. Then there is a set of models C such that for every distribution
D (which might be the empirical distribution over an arbitrary dataset), the
model ft output by Multicalibrate(f, α, G, D) is such that ft ∈ C, and:
  42 +1
|G| α
|C| ≤
α2
Proof 43 Given a run of Multicalibrate(f, α, G, D) (Algorithm 15) for T
rounds, let π = {(vt , vt′ , gt )}Tt=1 denote the record of the quantities (vt , vt′ , gt )
selected by the algorithm at each round t. Let π <t = {(vt′ , vt′ ′ , gt′ )}t−1
t′ =1 denote
the prefix of this transcript up through round t − 1. Observe that once we fix
π <t we have also fixed the model ft that is defined at the start of round t (in-
dependently of the distribution D). Thus to count models that might be output
by Multicalibrate(f, α, G, D), it suffices to count transcripts.
We let C denote the set of all models defined by transcripts π <T for all
T ≤ α42 . Since we know from Theorem 21 that Algorithm 15 halts after at most
T ≤ α42 many rounds, the models output by Algorithm 15 must be contained
in C as claimed. It remains to count the set of transcripts of length T ≤ α42 .
At each round t, there are m = 1/α possible choices for vt , m = 1/α possible
choices for vt′ , and |G| possible choices for gt . Hence the number of transcripts
 T
of length T is |G| α2 . Thus we have:
4
T 4
α2    +1
X |G| |G| α2
|C| ≤ ≤
α2 α2
T =0

Having counted the number of models that multicalibrate might output, we


can apply our union bound:
Theorem 24 Fix any model f : X → [0, 1], any finite collection of groups G,
any α > 0 and any δ > 0. Let D ∼ Dn consist of n points drawn i.i.d. from D.
Then with probability 1 − δ, simultaneously for every model ft : X → [0, 1] that
can be output by Multicalibrate(f, α, G, D) (Algorithm 15), any group g ∈ G,
and any v ∈ R(ft ):

 2  2
µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1] − µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1]
(x,y)∼D (x,y)∼D
72Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
v  
u
u 3µ (g, v, D) 4 + 2 ln 8|G|   
t t α2 α2 δ 135 α42 + 2 ln 8|G|
α2 δ
≤ 46 +
n n
 v   
u
|G
 1 t µt (g, v, D) ln αδ 
u
∈ O 
α n 

Proof 44 From Theorem 23 we have that for any δ ′ > 0 and any single triple
(ft , g, v) we have that:
 2  2
µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1] − µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1] ≤
(x,y)∼D (x,y)∼D

r
3µt (g, v, D) ln(8/δ) 135 ln(8/δ)
46 +
n n
We now count the number of triples quantified over in our theorem. Lemma
4.5.4 tells us that the number of models ft that might be output is at most
  42 +1
|G| α
α2 . The number of groups g ∈ G is |G|, and the number of values
1
v ∈ R(ft ) is by construction m = α. Hence the number of triples is at most:
4 4
  +1   +2
|G| α2 1 |G| α2
· |G| · ≤
α2 α α2

The theorem then follows from invoking Theorem 23 with δ ′ = δ


4 +2 and
( |G|
α2
) α2

then summing the failure probability δ over all enumerated triples.

We’re now ready to state our final generalization theorem:

Theorem 25 Fix any model f : X → [0, 1], any finite collection of groups G,
any α > 0 and any δ > 0. Let D ∼ Dn consist of n points drawn i.i.d. from
D. Then with probability 1 − δ, the model ft : X → [0, 1] that is output by
Multicalibrate(f, α, G, D) (Algorithm 15) is α′ approximately multicalibrated
with respect to G and D for:
v

4
  8|G|   u 
u 3 4 + 2 ln 8|G|

1 135 α2 + 2 ln α2 δ α2 α 2δ
α′ ≤ α + 
t
 + 46
α n αn

   v  
u
|G| u ln |G|
 ln α2 δ t α2 δ 
∈ O
α + + 
α3 n 3
α n 
Multigroup Guarantees 73

Remark 4.5.1 Choosing α to optimize the bound from Theorem 25, we get a
model ft that is α′ approximately multicalibrated with respect to G and D for:
   1/5 
|G|
′  ln δ  
α = Õ 
n

Proof 45 We need to prove that with probability 1 − δ, for every group g ∈ G:


α′
K2 (ft , g, D) ≤ µ(g,D) .
Expanding out the definition of K2 (ft , g, D) this is equivalent to proving
that for every g ∈ G:
X  2
µ(g, D)K2 (ft , g, D) = µt (g, v, D) v − E [y|f (x) = v, g(x) = 1] ≤ α′
(x,y)∼D
v∈R(ft )

From Theorem 24 we know that with probability 1 − δ we have that for


every v ∈ R(ft ) and g ∈ G we have that:
 2  2
µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1] − µt (g, v, D) v − E [y|ft (x) = v, g(x) = 1]
(x,y)∼D (x,y)∼D

v
u      
u 3µ (g, v, D)
t t
4
α2 + 2 ln 8|G|
2
α δ 135 4
α2 + 2 ln 8|G|
2
α δ
≤ 46 +
n n
From Theorem 21 we know that (with probability 1), µ(g, D)·K2 (ft , g, D) ≤
α for every g ∈ G.
74Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Combining these bounds we have:

µ(g, D)K2 (ft , g, D)


X  2
= µt (g, v, D) v − E [y|f (x) = v, g(x) = 1]
(x,y)∼D
v∈R(ft )
X  2
≤ µt (g, v, D) v − E [y|f (x) = v, g(x) = 1] +
(x,y)∼D
v∈R(ft )
v     
u
4
 8|G| 4
 8|G|
X  ut 3µt (g, v, D) α2 + 2 ln α2 δ 135 α2 + 2 ln α2 δ 
46 + 
 n n 
v∈R(ft )
 v    
u
u 3µ (g, v, D) 4
 8|G| 4
 8|G|
X  t t α2 + 2 ln α2 δ 135 α2 + 2 ln α2 δ 
≤ α+ 46 + 
 n n 
v∈R(ft )

  v
 8|G|
u   
4 4
+ 2 ln 8|G|
 u 3µ (g, v, D)
1 135 α2 + 2 ln α2 δ X t t α2 2
α δ
≤ α+ + 46
α n n
v∈R(ft )
v
    u   
1  135
4
α2 + 2 ln 8|G|
α2 δ
u 3µ (g, D)
t t
4
α2 + 2 ln 8|G|
α2 δ
≤ α+  + 46
α n αn
v
    u   
1 135 4
α2 + 2 ln 8|G|
2
α δ
u3
t
4
α2 + 2 ln 8|G|
2
α δ
≤ α+  + 46
α n αn


Where here we have used the fact that |R(ft )| = α1 , and that because · is a
concave function, the final sum is maximized when µt (g, v, D) = αµt (g) for
each v.

4.5.2 Quantile Multicalibration


We can prove a very similar generalization bound for quantile multicalibration.
We elide the details that are for the most part similar to the case of mean
multicalibration, and state the final theorem:

Theorem 26 Fix any model f : X → [0, 1], any finite collection of groups G,
any α > 0 and any δ > 0. Let D ∼ Dn consist of n points drawn i.i.d. from
a ρ-Lipschitz distribution D. Then with probability 1 − δ, the model ft : X →
[0, 1] that is output by QuantileMulticalibrate(f, α, q, G, D) (Algorithm 16) is
Multigroup Guarantees 75

α′ approximately quantile multicalibrated with respect to target quantile q and


G and D for:
v  
u
u 3ρ2 ln( 4π2 T 2 ) + T ln( ρ4 |G| )
3δ α 2
α′ = α + 42
t
2αn

Remark 4.5.2 Choosing α to optimize the bound from Theorem 25, we get a
model ft that is α′ approximately multicalibrated with respect to G and D for:
  4  1/5 
3 ρ |G|
ρ ln δ

α = Õ 
  
n

4.6 Sequential Prediction


In this section we give algorithms that can promise mean and quantile mul-
ticalibration in the sequential setting, against an adversary. It will be more
convenient for us to bound ℓ∞ calibration error (K∞ ) rather than the ℓ2 cal-
ibration error that we have been working with. But to avoid trivialities, we
will need a slightly modified definition that allows us to drive the calibration
error to 0 while keeping m fixed. (Recall an issue with K∞ and Q∞ as we
have defined them is that it is possible to drive them to zero at a rate of 1/m
in a trivial manner by taking m to be very large and making sure that we do
not make any particular prediction with probability greater than 1/m. )

4.6.1 A Bucketed Calibration Definition


Recall that calibration asks informally that E(x,y)∼D [y|f (x) = v] ≈ v for all
v. To avoid problems with the conditioning event, we have thus far restricted
our attention to models f that have bounded range R(f ) ⊆ [1/m]. A different
solution is to allow f to have arbitrary range in [0, 1], but not to condition on
the event that f (x) = v, but rather to condition on the event that f (x) ≈ v
for an appropriate formalization of ≈. We will do this through bucketing:

Definition 28 Let m be a bucket coarseness parameter. For i ∈ {1, . . . , m−1}


let Bm (i) = [ i−1 i m−1
m , m ) and let Bm (m) = [ m , 1]. Collectively, Bm (i) for i ∈
[m] form a set of m “buckets” of width 1/m that partition the unit interval.
We now give a variant of our ℓ∞ calibration definition in which informally,
the conditioning event f (x) ≈ v is taken to mean “f (x) and v are in the same
bucket”.
76Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Definition 29 (Bucketed Multicalibration Error in the Distributional Setting)
Fix a predictor f : X → [0, 1], a collection of groups G, and a bucket coarse-
ness parameter m. The calibration error of f on a group g with respect to
bucketing coarseness m, on distribution D is defined to be:

C∞ (f, m, g, D) =

max Pr [f (x) ∈ Bm (i)|g(x) = 1]· E [f (x) − y|f (x) ∈ Bm (i), g(x) = 1]


i∈m (x,y)∼D (x,y)∼D

We say that f satisfies (α, m)-multicalibration with respect to G on D if for


every g ∈ G:
α
C∞ (f, m, g, D) ≤
µ(g)

This is identical to our definition of K∞ except that the condition that f (x) =
v has been replaced with the condition that f (x) ∈ B(i). We can give a
corresponding definition in the sequential setting:

Definition 30 (Bucketed Multicalibration Error in the Sequential Setting)


Fix a collection of groups G, a transcript π = {(x1 , p1 , y1 ), . . . , (xT , pT , yT )},
and a bucket coarseness parameter m. For any i ∈ [m] and g ∈ G, let
S(π, g, i) = {t : g(xt ) = 1, pt ∈ Bm (i)} and S(π, g) = {t : g(xt ) = 1}. Let
n(π, g, i) = |S(π, g, i)| and let n(π, g) = |S(π, g)|.
The calibration error of π on a group g with respect to bucketing coarseness
m is defined to be:
P
n(π, g, i) t∈S(π,g,i) (pt − yt )
C∞ (π, m, g) = max ·
i∈[m] n(π, g) n(π, g, i)

We say that π satisfies (α, m) mutlicalibration with respect to G if for every


g ∈ G:
αT
C∞ (π, m, g) ≤
n(π, g)
Expanding out the definitions we find that equivalently, π satisfies (α, m) mul-
ticalibration error with respect to G if:

X
max (pt − yt ) ≤ αT
g∈G,i∈[m]
t∈S(π,g,i)

4.6.2 Achieving Bucketed Calibration


Our goal is to design a sequential prediction algorithm that guarantees (α, m)
multicalibration, with α tending to 0 with T against any possible sequence of
observations. To this end, we define a surrogate loss function that replaces the
Multigroup Guarantees 77

max in the definition of bucketed multicalibration with a “softmax” function


based on the sums of exponentials, that is analytically better behaved but
that is nevertheless a good approximation to the max function.

Definition 31 (Surrogate Loss) For a round s ≤ T and a transcript π,


recall that π ≤s denotes the length s prefix of π. For a group g ∈ G, and a
bucket i ∈ [m] let: X
Vsg,i = (yt − pt )
t∈S(π ≤s ,g,i)

denote the average difference between the predictions pt and the outcomes
yt on the subsequence of π ≤s corresponding to examples from group g and
predictions in bucket i.
Fixing a parameter η ∈ [0, 21 ], define a surrogate calibration loss function
at round s as:
X
Ls (π ≤s ) = exp(ηVsg,i ) + exp(−ηVsg,i ) .


g∈G,
i∈[m]

When the transcript π ≤s is clear from context, we will simply write Ls .

We will leave η unspecified for now, and choose it later to optimize our bounds.
Recall that what we really want to do is upper bound maxG∈G,i∈[n] |VTGi |,
which corresponds to our calibration loss. Observe that this “soft-max style”
function allows us to tightly upper bound our calibration loss:

Observation 4.6.1 For any transcript πT , and any η ∈ [0, 12 ], we have that:

1 ln (2|G|m)
max VTg,i ≤ ln(LT ) ≤ max VTG,i + .
g∈G,i∈[m] η g∈G,i∈[m] η

Proof 46 For the first inequality, note that:


  
max ηVTg,i = ln exp max η VTg,i
g∈G,i∈[m] g∈G,i∈[m]
  
= ln max exp η VTg,i
g∈G,i∈[m]
    
≤ ln max exp ηVTg,i + exp −ηVTg,i
g∈G,i∈[m]
 
X    
g,i g,i
≤ ln  exp ηVT + exp −ηVT 
g∈G,i∈[m]

= ln(LT )
78Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Dividing by η gives the inequality. In the other direction we have that:
 
1 1  X  
ln(Lt ) = ln exp ηVTg,i + exp (−ηVTg,m )
η η
g∈G,i∈[m]
 
1 
≤ ln 2|G|m · max exp η VTg,i
η g∈G,i∈[m]

ln(2|G|m)
= + max VTg,i
η g∈G,i∈[m]

So we now feel freed to study the analytically nicer surrogate loss function.
Just as in our derivation of algorithms promising (regular) calibration guar-
antees against an adversary, we will be interested in bounding the increase in
our surrogate loss function from round to round.

Definition 32 Fix any partial transcript π ≤s+1 = π ≤s ◦ (xs+1 , ps+1 , ys+1 ).


Define:

∆s+1 (π ≤s+1 ) ≡ ∆s+1 (π ≤s , (xs+1 , ps+1 , ys+1 )) = Ls+1 (π ≤s+1 ) − Ls (π ≤s )

Our first step is to bound ∆s+1 (π ≤s+1 ) in terms of a quantity that is linear
in ps+1 and ys+1 .

Lemma 4.6.1 Fix any partial transcript π ≤s+1 = π ≤s ◦ (xs+1 , ps+1 , ys+1 )
such that ps+1 ∈ B(i). Then for any η ≤ 1, we have that:

∆s+1 (π ≤s+1 ) ≤ η(ys+1 − ps+1 ) · Csi (xs+1 ) + 2η 2 Ls (π ≤s )

where: X
Csi (xs+1 ) = exp(ηVsg,i ) − exp(−ηVsg,i )
g∈G(xs+1 )

is a constant depending only on π ≤s and xs+1 .

Proof 47 Observe that our surrogate loss function is a sum of terms each
defined by a group g ∈ G and a bucket i ∈ [m], and that at round s + 1, the
change in surrogate loss can be written as a sum over only those groups in
G(xs+1 ) over the bucket i such that ps+1 ∈ B(i), since all other terms in the
Multigroup Guarantees 79

sum cancel out. Therefore we can write:

∆s+1 (π ≤s+1 )
= Ls+1 − Ls
X  g,i g,i

= exp(ηVs+1 ) − exp(ηVsg,i )) + exp(−ηVs+1 ) − exp(−ηVsg,i )
g∈G(xs+1 )
X
= exp(ηVsg,i )(exp(η(ys+1 − ps+1 )) − 1) + exp(−ηVsg,i )(exp(−η(ys+1 − ps+1 )) − 1)
g∈G(xs+1 )
X
≤ exp(ηVsg,i )(η(ys+1 − ps+1 ) + 2η 2 ) + exp(−ηVsg,i )(−η(ys+1 − ps+1 ) + 2η 2 )
g∈G(xs+1 )
 
X X
= η(ys+1 − ps+1 )  exp(ηVsg,i ) − exp(−ηVsg,i ) +
g∈G(xs+1 ) g∈G(xs+1 )
 
X
2η 2  exp(ηVsg,i ) + exp(−ηVsg,i )
g∈G(xs+1 )

≤ η(ys+1 − ps+1 ) · Csi (xs+1 ) + 2η 2 Ls (π ≤s )

Here the first inequality follows from the fact η(ys+1 − ps+1 ) ≤ η and that for
|x| ≤ 1, exp(x) ≤ 1 + x + x2 .

Our goal is to find a strategy for the learner’s choice of ps+1 , as a function of 
both π ≤s and xs+1 , that will guarantee that Eps+1 ∆s+1 (π ≤s , (xs+1 , ps+1 , ys+1 ))


is small for every possible realization of ys+1 . We measure calibration over m


buckets,
 1  but we will allow our learner to play from a larger strategy space
1 rm−1
rm = {0, rm , . . . , rm , 1} for some integer r > 1. In the end we will see
that our bounds get better with larger r, but that the algorithm we design
has no dependence on r at all in its running time, so we can imagine r to be
an arbitrarily large number.

Lemma 4.6.2 Fix any  1transcript


 π ≤s and any xs+1 . There is a distribution
on predictions ps+1 ∈ rm such that for every ys+1 ∈ [0, 1]:
 η 
E ∆s+1 (π ≤s , (xs+1 , ps+1 , ys+1 )) ≤ Ls (π ≤s ) · + 2η 2
 
ps+1 rm
The distribution can be sampled from as follows:
1. If Csi (xs+1 ) > 0 for all i then predict ps+1 = 1
2. If Csi (xs+1 ) < 0 for all i ten predict ps+1 = 0
∗ ∗
3. Otherwise, find i∗ ∈ [m−1] such that Csi (xs+1 )Csi +1
(xs+1 ) ≤ 0
and let q ∈ [0, 1] be such that
∗ ∗
q · Csi (xs+1 ) + (1 − q)Csi +1
(xs+1 ) = 0.
80Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT

i 1 i∗
Predict ps+1 = m − rm with probability q and predict ps+1 = m
with probability (1 − q).

Proof 48 From Lemma 4.6.1 we have that:

∆s+1 (π ≤s+1 ) ≤ η(ys+1 − ps+1 ) · Csi (xs+1 ) + 2η 2 Ls (π ≤s )

where ps+1 ∈ B(i). So it suffices to prove that:


−1 η
E [η(ys+1 − ps+1 ) · CsB (ps+1 )
(xs+1 )] ≤ Ls (π ≤s )
ps+1 rm

where B −1 (ps+1 ) = i is the bucket such that ps+1 ∈ B(i). We do this in cases.

Case 1: Csi (xs+1 ) > 0 for all i:


In this case ps+1 = 1 and we have that η(ys+1 −ps+1 ) ≤ 0 Since Csm (xs+1 ) > 0,
it follows that η(ys+1 − ps+1 ) · Csm (xs+1 ) ≤ 0.

Case 2: Csi (xs+1 ) < 0 for all i:


In this case ps+1 = 0 and we have that η(ys+1 −ps+1 ) ≥ 0 Since Cs1 (xs+1 ) < 0,
it follows that η(ys+1 − ps+1 ) · Cs1 (xs+1 ) ≤ 0.

Case 3: Everything Else


In the remaining case, observe that since the quantities Csi (xs+1 ) are nei-
ther all positive or all negative, there must exist some bucket i∗ such that
∗ ∗
Csi (xs+1 )Csi +1 (xs+1 ) ≤ 0. So sampling from the specified distribution is well
defined and we can compute:
−1
E [η(ys+1 − ps+1 ) · CsB (ps+1 ) (xs+1 )]
ps+1
 ∗
i∗
       
i 1 ∗ ∗
= q · η ys+1 − − · Csi (xs+1 ) + (1 − q) · η ys+1 − · Csi +1 (xs+1 )
m rm m
i∗
  
∗ ∗
 1 ∗
= η ys+1 − · qCsi (xs+1 ) + (1 − q)Csi +1 (xs+1 ) + ηq C i (xs+1 )
m rm s
1 i∗
= ηq Cs (xs+1 )
rm
 
η  X
≤ exp(ηVsg,i ) − exp(−ηVsg,i )
rm
g∈G(xs+1 )
η
≤ Ls (π ≤s )
rm
We have a concrete algorithm that implements the prediction strategy laid
out in Lemma 3.4.2.
Multigroup Guarantees 81

Algorithm 17 Online-Multicalibrated-Predictor(G, m, r, η)
for t = 1 to T do
Observe xt and compute
X g,i g,i
i
Ct−1 (xt ) = exp(ηVt−1 ) − exp(−ηVt−1 )
g∈G(xt )

for all i ∈ [m].


m
if Ct−1 (xt ) > 0 then
Predict pt = 1.
1
else if Ct−1 (xt ) < 0 then
Predict pt = 0.
else
i∗ i∗ +1
Select i∗ ∈ [m] such that such that Ct−1 (xt ) · Ct−1 (xt ) ≤ 0.
Compute q ∈ [0, 1] such that:
∗ ∗
i i +1
q · Ct−1 (xt ) + (1 − q) · Ct−1 (xt ) = 0


i 1 i∗
Predict pt = m − rm with probability q and predict pt = m with
probability 1 − q.
Observe yt
Let π <t+1 = π <t ◦ (xt , pt , yt )

Lets now analyze the expected calibration loss of Algorithm 17. We start
by analyzing the expected surrogate loss:
Lemma 4.6.3 Fix any set of groups G, m, r ≥ 0 and 0 ≤ η ≤ 1. Fix any ad-
versary, which together with Online-Multicalibrated-Predictor(G, m, r, η) (Al-
gorithm 17) fixes a distribution on transcripts π. We have that:
 

E[LT (π)] ≤ 2|G|m · exp + 2T η 2
π rm

Proof 49 Consider the final round T . From Lemma 4.6.2, we have that for
all π <T , xT , yT :

E [LT (π ≤T )] = LT −1 (π ≤T −1 ) + E [∆T (π ≤T −1 , (xT , pT , yT ))]


pT pT
 η 
≤T −1
≤ LT −1 (π ) + LT −1 (π ≤T −1 ) + 2η 2
rm
 η 2

= LT −1 1 + + 2η
 rm
η 
≤ LT −1 exp + 2η 2
rm
where the last inequality follows from 1 + x ≤ exp(x). Now inductively taking
82Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
the expectation with respect to pT −1 , pT −2 , . . . , p1 we get that:
 η T  
2 Tη 2
E[LT (π)] ≤ L0 exp + 2η = 2|G|m · exp + 2T η
π rm rm
P
Since L0 = g∈G,i∈[m] (exp(0) + exp(0)) = 2|G|m.

We are now ready to state the final guarantee of Algorithm 17.


q
Theorem 27 Fix any set of groups G, m, r ≥ 0. Let η = log(2|G|m)
2T < 1. Fix
any adversary, which together with Online-Multicalibrated-Predictor(G, m, r, η)
(Algorithm 17) fixes a distribution on transcripts π. We have that π satisfies
(α, m)-multicalibration error with respect to G where:
r
1 2 ln(2|G|m)
E[α] ≤ +2
π rm T

√ T
In particular, if we choose r ≥ then we have:
ϵm 2 log(2|G|m)

r
2 ln(2|G|m)
E[α] ≤ (2 + ϵ)
π T
Proof 50 Recall that (α, m)-multicalibration corresponds to the requirement
that maxg∈G,i∈[m] |VTg,i | ≤ αT . Hence we need to show that:
 
g,i T p
E max |VT | ≤ + 2 2T ln(2|G|m)
π g∈G,i∈[m] rm

We can compute:
     
exp η E max |VTg,i | ≤ E exp η max |VTg,i |
π g∈G,i∈[m] π g∈G,i∈[m]
  
= E max exp η|VTg,i |
π g∈G,i∈[m]
     
≤ E max exp ηVTg,i + exp −ηVTg,i
π g∈G,i∈[m]
 
X     
≤ E exp ηVTg,i + exp −ηVTg,i 
π
g∈G,i∈[m]

= E [LT (π)]
π
 

≤ 2|G|m · exp + 2T η 2
rm

where the first inequality follows from Jensen’s inequality and the convexity of
Multigroup Guarantees 83

exp(x), and the last inequality follows from Lemma 4.6.3. Taking the log of
both sides and dividing by η gives:
 
g,i log(2|G|m) T
E max |V | ≤ + + 2T η
π g∈G,i∈[m] T η rm
Plugging in our chosen value of η completes the proof.
Insert high probability bound and online to offline reduction

4.6.3 Obtaining Bucketed Quantile Multicalibration


We can analogously define a “bucketed” definition of quantile multicalibration:
Definition 33 (Bucketed Multicalibration Error in the Sequential Setting)
Fix a collection of groups G, a transcript π = {(x1 , p1 , y1 ), . . . , (xT , pT , yT )},
and a bucket coarseness parameter m. The quantile calibration error of π on a
group g with respect to bucketing coarseness m and target quantile q is defined
to be:
t∈S(π,g,i) (1[yt ≤ pt ] − q)
P
n(π, g, i)
Q∞ (π, m, g) = max ·
i∈[m] n(π, g) n(π, g, i)

We say that π satisfies (α, m) quantile multicalibration with respect to G if for


every g ∈ G:
αT
Q∞ (π, m, g) ≤
n(π, g)
Expanding out the definitions we find that equivalently, π satisfies (α, m) quan-
tile multicalibration error with respect to G if:

(1[yt ≤ pt ] − q) ≤ αT
X
max
g∈G,i∈[m]
t∈S(π,g,i)

The derivation of an algorithm for online quantile multicalibration closely


mimics our derivation for mean multicalibration, so we will state lemmas with-
out proof when the proof is exactly analogous, and focus only on differences.
Definition 34 (Quantile Surrogate Loss) For a round s ≤ T and a tran-
script π, recall that π ≤s denotes the length s prefix of π. For a group g ∈ G,
and a bucket i ∈ [m] redefine:

(1[yt ≤ pt ] − q).
X
Vsg,i =
t∈S(π ≤s ,g,i)

Fixing a parameter η ∈ [0, 21 ], continue to let the surrogate calibration loss


function at round s as:
X
Ls (π ≤s ) = exp(ηVsg,i ) + exp(−ηVsg,i ) .


g∈G,
i∈[m]
84Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
When the transcript π ≤s is clear from context, we will simply write Ls .
We can prove a direct analogue of Lemma 4.6.1 for our new quantile surro-
gate loss function. All we used previously about the Vsg,i quantities was that
they were sums of terms bounded between [−1, 1] which remains true in our
quantile reformulation.
Lemma 4.6.4 Fix any partial transcript π ≤s+1 = π ≤s ◦ (xs+1 , ps+1 , ys+1 )
such that ps+1 ∈ B(i). Then for any η ≤ 1, we have that:
∆s+1 (π ≤s+1 ) ≤ η(1[ys+1 ≤ ps+1 ] − q) · Csi (xs+1 ) + 2η 2 Ls (π ≤s )
where: X
Csi (xs+1 ) = exp(ηVsg,i ) − exp(−ηVsg,i )
g∈G(xs+1 )

is a constant depending only on π ≤s and xs+1 .


We now come to the only lemma whose statement and proof change —
the bound on how much our surrogate loss changes in expectation when we
play according to our multicalibration strategy (which does not change). Our
bound now depends on the Lipschitz parameter of the underlying distributions
played by the adversary, and holds in expectation both over the randomness
of our prediction ps+1 and over the adversary’s choice of labels ys+1 .
Lemma 4.6.5 Fix any transcript π ≤s and any xs+1 . There is a distribution
1

on predictions ps+1 ∈ rm such that for every ρ-Lipschitz distribution over
ys+1 ∈ [0, 1]:
 
η
∆s+1 (π ≤s , (xs+1 , ps+1 , ys+1 )) ≤ Ls (π ≤s ) · + 2η 2
 
E
ps+1 ,ys+1 ρrm
The distribution can be sampled from as follows:
1. If Csi (xs+1 ) < 0 for all i then predict ps+1 = 1
2. If Csi (xs+1 ) > 0 for all i ten predict ps+1 = 0
∗ ∗
3. Otherwise, find i∗ ∈ [m−1] such that Csi (xs+1 )Csi +1
(xs+1 ) ≤ 0
and let p ∈ [0, 1] be such that
∗ ∗
p · Csi (xs+1 ) + (1 − p)Csi +1
(xs+1 ) = 0.

i 1 i∗
Predict ps+1 = −m with probability q and predict ps+1 =
rm m
with probability (1 − q).
Proof 51 From Lemma 4.6.4 we have that:
∆s+1 (π ≤s+1 ) ≤ η(1[ys+1 ≤ ps+1 ] − q) · Csi (xs+1 ) + 2η 2 Ls (π ≤s )
where ps+1 ∈ B(i). So it suffices to prove that:
η
[η(1[ys+1 ≤ ps+1 ] − q) · CsB
−1
E (ps+1 )
(xs+1 )] ≤ Ls (π ≤s )
ps+1 ,ys+1 ρrm
where B −1 (ps+1 ) = i is the bucket such that ps+1 ∈ B(i). We do this in cases.
Multigroup Guarantees 85

Case 1: Csi (xs+1 ) > 0 for all i:


In this case ps+1 = 0 and we have that η(1[ys+1 ≤ ps+1 ] − q) ≤ 0 Since
Cs1 (xs+1 ) > 0, it follows that η(1[ys+1 ≤ ps+1 ] − q) · Cs1 (xs+1 ) ≤ 0.

Case 2: Csi (xs+1 ) < 0 for all i:


In this case ps+1 = 1 and we have that η(1[ys+1 ≤ ps+1 ] − q) ≥ 0 Since
Csm (xs+1 ) < 0, it follows that η(1[ys+1 ≤ ps+1 ] − q) · Csm (xs+1 ) ≤ 0.

Case 3: Everything Else


In the remaining case, observe that since the quantities Csi (xs+1 ) are nei-
ther all positive or all negative, there must exist some bucket i∗ such that
∗ ∗
Csi (xs+1 )Csi +1 (xs+1 ) ≤ 0. So sampling from the specified distribution is well
defined and we can compute:

E [η(1[ys+1 ≤ ps+1 ] − q) · CsB (ps+1 ) (xs+1 )]


−1

ps+1 ,ys+1
    ∗   
i 1 i∗
= p · η Pr ys+1 ≤ − − q · Cs (xs+1 ) +
ys+1 m rm
i∗
     
i∗ +1
(1 − p) · η Pr ys+1 ≤ − q · Cs (xs+1 )
ys+1 m
    ∗   
i 1 i∗
≤ p · η Pr ys+1 ≤ + − q · Cs (xs+1 ) +
ys+1 m ρrm
i∗
     
i∗ +1
(1 − p) · η Pr ys+1 ≤ − q · Cs (xs+1 )
ys+1 m
1 ∗
= ηp Csi (xs+1 )
ρrm
η
≤ Ls (π ≤s )
ρrm
With our new Lemma 3.4.4 in hand, the rest follows as before:
86Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Algorithm 18 Online-Quantile-Multicalibrated-Predictor(G, m, r, η)
for t = 1 to T do
Observe xt and compute
X g,i g,i
i
Ct−1 (xt ) = exp(ηVt−1 ) − exp(−ηVt−1 )
g∈G(xt )

g,i
for all i ∈ [m], with Vt−1 defined as in Definition 34.
m
if Ct−1 (xt ) < 0 then
Predict pt = 1.
1
else if Ct−1 (xt ) > 0 then
Predict pt = 0.
else
i∗ i∗ +1
Select i∗ ∈ [m] such that such that Ct−1 (xt ) · Ct−1 (xt ) ≤ 0.
Compute p ∈ [0, 1] such that:
∗ ∗
i i +1
p · Ct−1 (xt ) + (1 − p) · Ct−1 (xt ) = 0


i 1 i∗
Predict pt = m − rm with probability p and predict pt = m with
probability 1 − p.
Observe yt
Let π <t+1 = π <t ◦ (xt , pt , yt )

We get the following final theorem:


Theorem
q 28 Fix any set of groups G, m, r ≥ 0 and q ∈ [0, 1]. Let
log(2|G|m)
η = 2T < 1. Fix any adversary who is constrained to playing ρ-
Lipschitz distributions, which together with Online-Quantile-Multicalibrated-
Predictor(G, m, r, η) (Algorithm 18) fixes a distribution on transcripts π. We
have that π satisfies (α, m)-quantile-multicalibration error with respect to G
and target quantile q where:
r
1 2 ln(2|G|m)
E[α] ≤ +2
π ρrm T
We can similarly prove a high probability version of this theorem:
Theorem
q 29 Fix any set of groups G, m, r ≥ 0 and q ∈ [0, 1]. Let η =
log(2|G|m)
2T < 1. Fix δ > 0 Fix any adversary who is constrained to playing
ρ-Lipschitz distributions, which together with Online-Quantile-Multicalibrated-
Predictor(G, m, r, η) (Algorithm 18) fixes a distribution on transcripts π. We
have that with probability 1 − δ over the randomness of π, π satisfies (α, m)-
Multigroup Guarantees 87

quantile-multicalibration error with respect to G and target quantile q where:


v  
u
u 2 ln 2|G|m
1 t δ
α≤ +4
ρrm T

References and Further Reading


Multicalibration and Group Conditional Mean Consistency (under the name
“multiaccuracy”) were introduced in Hébert-Johnson et al. [2018] using a
slightly different definition, roughly corresponding to what we refer to our
K∞ metric. Group conditional mean consistency (multiaccuracy) was further
studied in Kim et al. [2019]. Several different multicalibration algorithms have
been given in the literature, including one based on analyzing the Lagrangian
of a linear program Jung et al. [2021], and one based on constructing a branch-
ing program via “split” and “merge” operations Gopalan et al. [2022b], which
controls an ℓ1 variant of multicalibration closely related to our K1 metric. The
algorithm and analysis we give here (which controls multicalibration in the
K2 metric) is based on a variant of the original algorithm given by Hébert-
Johnson et al. [2018] together with a rounding operation. The algorithm we
give for quantile multicalibration and group conditional quantile consistency
is from Jung et al. [2022]. Algorithm 10 — the algorithm for obtaining group
conditional mean consistency with a one-shot minimization of squared error
is due to Parikshit Gopalan. A different analysis than we give here is given in
Gopalan et al. [2022a]. Its quantile analogue (Algorithm 11) was given in Jung
et al. [2022]. The generalization bounds we give are new as far as we know. The
online algorithm for obtaining multicalibration against an adversary is from
Gupta et al. [2022]. The full proof of the generalization theorem giving out
of sample guarantees for our batch quantile multicalibration algorithm can
be found in Jung et al. [2022]. The online algorithm for obtaining quantile
multicalibration against an adversary is an adaptation of Bastani et al. [2022]
to the ℓ∞ setting (Bastani et al. [2022] actually give a bound on an ℓ2 vari-
ant of quantile multicalibration, which is stronger than ℓ∞ multicalibration,
but with a polynomial dependence on |G|. Obtaining a bound on ℓ2 mean or
quantile multicalibration with a logarithmic dependence on |G| remains open
as of this writing.)
5
Beyond Means, Quantiles, and Calibration

CONTENTS
5.1 Beyond Means and Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Beyond Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

In this chapter we will study how much “beyond” multicalibration we can


go in various ways. We’ll start simple: generalizing what a “group” is. Then
we’ll ask what other distributional properties we can multicalibrate with re-
spect to (beyond means and quantiles). Finally we’ll ask what there is beyond
calibration.

5.1 Beyond Means and Quantiles


“nobreak

5.2 Beyond Calibration

89
6
Multicalibration for Real Valued Functions:
When Does Multicalibration Imply
Accuracy?

CONTENTS
6.1 Beyond Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 Algorithmically Reducing Multicalibration to Regression . . . . . . . 95
6.3 Weak Learning, Multicalibration, and Boosting . . . . . . . . . . . . . . . . . 98
References and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

In this section we think about how we can view multicalibration as a boosting


algorithm for regression — i.e. a way to take a regression algorithm that has
the capacity to predict in slightly non-trivial ways (better than a constant
function), and produce an ensemble of regression functions that can predict
optimally. Along the way, we will generalize multicalibration over a set of
groups G represented by indicator functions g : X → {0, 1} to multicalibration
over a collection of arbitrary real valued functions H, where each h ∈ H is a
function h : X → R. This generalization will also be useful for us for a number
of other applications of multicalibration.

6.1 Beyond Groups


Our study of multicalibration thus far has been predicated on groups — i.e.
subsets of the feature domain X . We have represented groups by their indicator
functions g, such that g(x) = 1 if x is a member of the group, and g(x) =
0 otherwise. Recall that what we mean by (perfect) multicalibration of a
function f : X → R on a collection of groups G is that for every g ∈ G and
v ∈ R(f ):
E [(y − f (x))|f (x) = v, g(x) = 1] = 0
(x,y)∼D

91
92Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Since g(x) is binary, we can equivalently re-write this multicalibration condi-
tion as the requirement that for every g ∈ G and v ∈ R(f ):

E [g(x)(y − f (x))|f (x) = v] = 0


(x,y)∼D

But although this is an equivalent condition to ask for when g is binary (i.e.
a group indicator function), it now makes sense to ask for this condition even
if g is an arbitrary real valued function g : X → R. We will use this as our
more general definition of multicalibration with respect to an arbitrary class
of real valued functions. We will have to define approximate versions of this
condition, and we will again use an ℓ2 -error variant:
Definition 35 (Multicalibration With Respect to Real Valued Functions)
Fix a distribution D ∈ ∆Z and a model f : X → [0, 1]. Let H be an ar-
bitrary collection of real valued functions h : X → R. We say that f is α-
approximately multicalibrated with respect to D and H if for every h ∈ H:
X  2
K2 (f, h, D) = Pr [f (x) = v] E [h(x)(y − v)|f (x) = v] ≤ α
(x,y)∼D (x,y)∼D
v∈R(f )

There is a close connection between a failure of a model f to be multi-


calibrated with respect to a class of functions H and the ability to decrease
squared error of f using a simple update using a function h ∈ H. We summa-
rize the connection with two lemmas showing the connection in each direction.
First, suppose H contains a model h that has lower squared error than the
best constant prediction on one of the level-sets of a calibrated model f . Then
f is not multi-calibrated with respect to H. Note that if f is calibrated, it is
making the best constant prediction on each of its level sets, so the condition
that h makes predictions with lower squared error than the best constant
predictor on a level-set of f is the same that it makes predictions with better
squared error than f on one of its level-sets.
Lemma 6.1.1 Fix a calibrated model f : X → R. Suppose for some v ∈ R(f )
there is an h ∈ H such that:

E[(f (x) − y)2 − (h(x) − y)2 |f (x) = v] ≥ α

Then it must be that:


α
E[h(x)(y − v)|f (x) = v] ≥
2
Proof 52 We calculate:
Multicalibration for Real Valued Functions: When Does Multicalibration Imply Accuracy? 93

E [h(x)(y − v)|f (x) = v]


(x,y)∼D

= E [h(x)y|f (x) = v] − v E [h(x)|f (x) = v]


(x,y)∼D (x,y)∼D
 
1
= 2 E [h(x)y|f (x) = v] − 2v E [h(x)|f (x) = v]
2 (x,y)∼D (x,y)∼D
 
1
≥ 2 E [h(x)y|f (x) = v] − 2v E [h(x)|f (x) = v] − E [(h(x) − v)2 |f (x) = v]
2 (x,y)∼D (x,y)∼D (x,y)∼D
 
1
= E [2h(x)y − h(x)2 − v 2 |f (x) = v]
2 (x,y)∼D
 
1
= E [2h(x)y − h(x)2 − 2vy + v 2 |f (x) = v]
2 (x,y)∼D
 
1
= E [(v − y)2 − (h(x) − y)2 |f (x) = v]
2 (x,y)∼D
α

2
where the 3rd to last line follows from adding and subtracting v 2 and the fact
that because f is calibrated, v E[y|f (x) = v] = v 2 .
In the reverse direction, we show that if a model f fails to be multi-
calibrated with respect to a class of functions H, then it is possible to perform
a simple update on one of the level-sets of f using a function h ∈ H that
witnesses a failure of multicalibration that decreases squared error on that
level-set
Lemma 6.1.2 Fix a model f : X → R. Suppose for some v ∈ R(f ) there is
an h ∈ H such that:
E[h(x)(y − v)|f (x) = v] ≥ α
′ α
Let h = v + ηh(x) for η = E[h(x)2 |f (x)=v] . Then:
α2
E[(f (x) − y)2 − (h′ (x) − y)2 |f (x) = v] ≥
E[h(x)2 |f (x) = v]
Proof 53 We calculate:
E[(f (x) − y)2 − (h′ (x) − y)2 |f (x) = v]
= E[(v − y)2 − (v + ηh(x) − y)2 |f (x) = v]
= E[v 2 − 2vy + y 2 − (v + ηh(x))2 + 2y(v + ηh(x)) − y 2 |f (x) = v]
= E[2yηh(x) − 2vηh(x) − η 2 h(x)2 |f (x) = v]
= E[2ηh(x)(y − v) − η 2 h(x)2 |f (x) = v]
≥ 2ηα − η 2 E[h(x)2 |f (x) = v]
α2
= 2
E[h(x) |f (x) = v]
94Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Where the last line follows from the definition of η.

Note that there is an asymmetry in Lemma 6.1.1 and Lemma 6.1.2. Lemma
6.1.1 implies that if h has improved squared error compared to f on one of its
level-sets, then h itself fails the multi-calibration condition on this levelset. On
the other hand, Lemma 6.1.2 says that if h fails the multicalibration condition
on some levelset v of f , then there is a function h′ = v + ηh(x) that improves
on the squared error of f on level-set v. We can remove this assymetry by
assuming that H is closed under affine transformations

Definition 36 A class of functions H is closed under affine transformations


if for every a, b ∈ R, if h(x) ∈ H then:

h′ (x) ≡ ah(x) + b ∈ H

Most natural classes of regression functions are closed under affine transfor-
mations: linear functions, polynomials of any fixed degree d, regression trees,
etc.
For classes of functions H that are closed under affine transformation, the
relationship becomes symmetric:
Lemma 6.1.3 Suppose H is closed under affine transformation. Fix a model
f : X → R and a levelset v ∈ R(f ) of f . Then:
1. If f is calibrated and there exists an h ∈ H such that

E[(f (x) − y)2 − (h(x) − y)2 |f (x) = v] ≥ α

then there exists an h′ ∈ H such that:


α
E[h′ (x)(y − v)|f (x) = v] ≥
2

2. If there exists an h ∈ H such that:

E[h(x)(y − v)|f (x) = v] ≥ α

then there exists an h′ ∈ H such that:


α2
E[(f (x) − y)2 − (h′ (x) − y)2 |f (x) = v] ≥
E[h(x)2 |f (x) = v]

Proof 54 The first part follows from Lemma 6.1.1 using h′ = h. The second
part follows from Lemma 6.1.2 using h′ = v + ηh(x), where h′ ∈ H by the
assumption that G is closed under affine transformations.

The equivalence between a failure of multicalibration with respect to H


and the ability for some h ∈ H to improve on the squared error of f on one of
f ’s level sets is useful for several reasons. First, it means that we can reduce the
Multicalibration for Real Valued Functions: When Does Multicalibration Imply Accuracy? 95

problem of finding a model f that is multicalibrated over H to the standard


regression problem of finding models h ∈ H that minimize squared error over
subsets of the distribution D, which is a well studied problem for which we
have very good algorithms for many classes H. The second is that it will allow
us to give a simple, intuitive characterization of what properties H must have
relative to a data distribution D such that multicalibration with respect to
H implies Bayes optimal prediction with respect to H. Importantly, since we
only have to solve regression problems on subsets of the distribution D —
for which there is a fixed Bayes optimal predictor — this will make it easy
for us to enunciate conditions under which multicalibration implies accuracy;
this would be more difficult if we needed to solve regression problems on
distributions with different conditional label dsitributions.

6.2 Algorithmically Reducing Multicalibration to Re-


gression
In this section we give an algorithm for computing predictors that are mul-
ticalibrated with respect to a real-valued class of functions H. We will be
interested in infinite classes H, so we will need to think about what kind of
access we have to this class of functions. What we will assume is that we have
access to an algorithm AH that given access to a distribution D solves the
squared error regression problem on D over H.
Definition 37 AH is a squared error regression oracle for a class of real
valued functions H if for every D ∈ ∆Z, AH (D) outputs a function h ∈ H
such that:
h ∈ arg min

E [(h′ (x) − y)2 ]
h ∈H (x,y)∼D

A squared error regression oracle AH is a very natural object: for exam-


ple, if H is the class of linear functions, then AH is simply solves a linear
regression problem (which has a solution in closed form). Polynomial squared
error regression problems can also be solved in closed form. Even for model
classes (e.g. regression trees and neural networks) such that the corresponding
squared error regression problem is not convex, we have very good heuristics
for solving the problem. So assuming that we have a squared error regression
oracle for a class H is a very reasonable assumption. We now ask: if we have
such an oracle, can we leverage it to learn a multi-calibrated predictor over
H?
96Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Algorithm 19 RegressionMulticalibrate(f, α, AH , D, B)
Let m = 2Bα .
Let f0 = Round(f ; m), err0 = E(x,y)∼D [(f0 (x) − y)2 ], err−1 = ∞ and t = 0.
α
while (errt−1 − errt ) ≥ 2B do
for each v ∈ [1/m] do
Let Dvt+1 = D|ft (x) = v.
Let ht+1
v = AH (Dvt+1 ).
Let:

1[ft (x) = v] · ht+1


X
f˜t+1 (x) = v (x) ft+1 = Round(ft+1 , m)
v∈[1/m]

Let errt+1 = E(x,y)∼D [(ft+1 (x) − y)2 ] and t = t + 1.


Output ft−1 .

Just as in our algorithm for multicalibration over groups G (Algo-


rithm 15), Algorithm 19 rounds its output to the discrete range [1/m] =
1
{0, m , . . . , m−1
m , 1}. We recall that Round(h, m) outputs the function:

h̃(x) = min |h(x) − v|


v∈[1/m]

— i.e. the function that outputs the closest grid-point in [1/m] to the function
value ht (x).

Theorem 30 Fix any distribution D ∈ ∆Z, any model f : X → [0, 1], any
α < 1, any class of real valued functions H that is closed under affine trans-
formations, and a squared error regression oracle AH for H. For any bound
B > 0 let:
HB = {h ∈ H : h(x)2 ≤ B}
be the set of functions in h with squared magnitude bounded by B. Then
RegressionMulticalibrate(f, α, AH , D, B) (Algorithm 19) halts after at most
T ≤ 2Bα many iterations and outputs a model fT −1 such that fT −1 is α-
approximately multicalibrated with respect to D and HB .

Remark 6.2.1 Note the form of this theorem — we do not promise multi-
calibration at approximation parameter α for all of H, but only for HB —
i.e. those functions in H satisfying a bound on their squared value. This is
necessary, since H is closed under affine transformations. To see this, note
that if E[h(x)(y − v)] ≥ α, then it must be that E[c · h(x)(y − v)] ≥ c · α.
Since h′ (x) = ch(x) is also in H by assumption, approximate multicalibration
bounds must always also be paired with a bound on the norm of the functions
for which we promise those bounds.

Remark 6.2.2 The algorithm runs for at most 2B


α iterations, and at each
iteration needs to make m + 1 = 2B
α + 1 many calls to the squared error
Multicalibration for Real Valued Functions: When Does Multicalibration Imply Accuracy? 97

regression oracle AH . Thus to obtain α-approximate multi-calibration with


2
respect to HB , it suffices to make 4B 2B
α2 + α many oracle calls to a regression
oracle for H.
Proof 55 Since f0 takes values in [0, 1] and y ∈ [0, 1], we have err0 ≤ 1, and
by definition errT ≥ 0 for all T . By construction, if the algorithm has not
α
halted at round t it must be that errt ≤ errt−1 − 2B , and so the algorithm
2B
must halt after at most T ≤ α many iterations to avoid a contradiction.
It remains to show that when the algorithm halts at round T , the model
fT −1 that it outputs is α-approximately multi-calibrated with respect to D and
α
HB . We will show that if this is not the case, then errT −1 − errT > 2B , which
will be a contradiction to the halting criterion of the algorithm.
Suppose that fT −1 is not α-approximately multicalibrated with respect to
D and HB . This means there must be some h ∈ HB such that:
X  2
Pr [fT −1 (x) = v] E [h(x)(y − v)|fT −1 (x) = v] > α
(x,y)∼D (x,y)∼D
v∈[1/m]

For each v ∈ [1/m] define


 2
αv = Pr [fT −1 (x) = v] E [h(x)(y − v)|fT −1 (x) = v]
(x,y)∼D (x,y)∼D
P
So we have v∈[1/m] αv > α.
Applying the 2nd part of Lemma 6.1.3 we learn that for each v, there must
be some hv ∈ H such that:
1 αv
E[(fT −1 (x) − y)2 − (hv (x) − y)2 |fT −1 (x) = v] > ·
E[h(x)2 |fT −1 (x) = v] Pr(x,y)∼D [fT −1 (x) = v]
1 αv

B Pr(x,y)∼D [fT −1 (x) = v]
where the last inequality follows from the fact that h ∈ HB Now we can com-
pute:
E [(fT −1 (x) − y)2 − (f˜T (x) − y)2 ]
(x,y)∼D
X
= Pr [fT −1 (x) = v] E [(fT −1 (x) − y)2 − (f˜T (x) − y)2 |fT −1 (x) = v]
(x,y)∼D (x,y)∼D
v∈[1/m]
X
= Pr [fT −1 (x) = v] E [(fT −1 (x) − y)2 − (hTv (x) − y)2 |fT −1 (x) = v]
(x,y)∼D (x,y)∼D
v∈[1/m]
X
≥ Pr [fT −1 (x) = v] E [(fT −1 (x) − y)2 − (hv (x) − y)2 |fT −1 (x) = v]
(x,y)∼D (x,y)∼D
v∈[1/m]
X αv

B
v∈[1/m]
α
>
B
98Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Here the third line follows from the definition of f˜T and the fourth line follows
from the fact hv ∈ H and that hTv minimizes squared error on DvT amongst all
h ∈ H.
Finally we calculate:

errT −1 − errT
= E [(fT −1 (x) − y)2 − (fT (x) − y)2 ]
(x,y)∼D

= E [(fT −1 (x) − y)2 − (f˜T (x) − y)2 ] + E [(f˜T (x) − y)2 − (fT (x) − y)2 ]
(x,y)∼D (x,y)∼D
α
> + E [(f˜T (x) − y)2 − (fT (x) − y)2 ]
B (x,y)∼D
α 1
> −
B m
α

2B
where the last equality follows from the fact that m ≥ 2B
α .
The 2nd inequality follows from the fact that for every pair (x, y):
1
(f˜T (x) − y)2 − (fT (x) − y)2 ≥ −
m
To see this we consider two cases. Since y ∈ [0, 1], if f˜T (x) > 1 or f˜T (x) < 0
then the Round operation decreases squared error and we have (f˜T (x) − y)2 −
(fT (x) − y)2 ≥ 0. In the remaining case we have fT (x) ∈ [0, 1] and ∆ =
f˜T (x) − fT (x) is such that |∆| ≤ 2m
1
. In this case we can compute:

(f˜T (x) − y)2 − (fT (x) − y)2 = (fT (x) + ∆ − y)2 − (fT (x) − y)2
= 2∆(f (x) − y) + ∆2
≥ −2|∆| + ∆2
1
≥ −
m

6.3 Weak Learning, Multicalibration, and Boosting


We now turn from multicalibration to “Boosting”. Our analysis of multical-
ibration algorithms has used squared error as a potential function — so we
know that post-processing a model to make it multicalibrated does not harm
accuracy (as measured by squared error). But when must multicalibration
improve accuracy meaningfully? Can we find conditions on the class H with
respect to which we are multicalibrated such that multicalibration must imply
Bayes optimality? That is what we’ll do now.
Multicalibration for Real Valued Functions: When Does Multicalibration Imply Accuracy? 99

Definition 38 Fix a distribution D ∈ ∆Z and a class of functions H. Let


f ∗ (x) = Ey∼D(x) [y] denote the true conditional label expectation conditional
on x. We say that H satisfies the weak learner condition relative to D if for
every S ⊂ X with Prx∼DX [x ∈ S] > 0, if:

E [(f ∗ (x) − y)2 |x ∈ S] < min E [(c − y)2 |x ∈ S]


(x,y)∼D c∈R (x,y)∼D

then there exists an h ∈ H such that:

E [(h(x) − y)2 |x ∈ S] < min E [(c − y)2 |x ∈ S]


(x,y)∼D c∈R (x,y)∼D

First lets pause to interpret this condition and explain why it is “weak”. It is
helpful to recall that f ∗ (x) is the Bayes optimal predictor for squared error
— it minimizes squared error over D over the set of all possible functions (we
proved this in Lemma 3.1.2.) The weak learning condition requires that for
every restriction of D to some subset S ⊂ X of its domain, if the Bayes optimal
predictor performs better than a constant predictor in terms of squared error,
then there must be some h ∈ H that also performs better than a constant
predictor. This is a weak learning assumption because it might be that f ∗ (x)
performs much better than a constant predictor, but that the best h ∈ H
performs only a little bit better than a constant predictor on S — this situation
is still consistent with our assumption.
Nevertheless, we will show that the weak learning assumption is enough
(together with our Algorithm 19 for multicalibration with respect to real val-
ued functions H) to boost the weak learners in H to a strong learner f —
i.e. a model f that performs as well as the optimal model f ∗ with respect to
squared error. In fact, the weak learning condition on H is both necessary and
sufficient for multicalibration of f with respect to H to imply Bayes optimality
of f . Our “boosting algorithm” will simply be our multicalibration algorithm!
First we define what we mean when we say that multicalibration with
respect to H implies Bayes optimality. Note that f ∗ (x) is multicalibrated
with respect to any set of functions, so it is not enough to require that there
exist Bayes optimal functions f that are multicalibrated with respect to H.
Instead, we have to require that every function that is multicalibrated with
respect to H is Bayes optimal:
Definition 39 Fix a distribution D ∈ ∆Z. We say that multicalibration with
respect to H implies Bayes optimality over D if for every f : X → R that is
multicalibrated with respect to D and H, we have:

E [(f (x) − y)2 ] = E [(f ∗ (x) − y)2 ]


(x,y)∼D (x,y)∼D

Where f ∗ (x) = Ey∼D(x) [y] is the function that has minimum squared error
over the set of all functions.

Theorem 31 Fix a distribution D ∈ ∆Z. Let H be a class of functions that is


100Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
closed under affine transformation. Multicalibration with respect to H implies
Bayes optimality over D if and only if H satisfies the weak learner condition
relative to D.
Proof 56 To avoid measurability issues we assume that models f have a
countable range (which is true in particular whenever X is countable) — but
this assumption can be avoided with more care.
First we show that if H satisfies the weak learner condition relative to
D, then multicalibration with respect to H implies Bayes optimality over D.
Suppose not. Then there exists a function f that is multicalibrated with respect
to D and H, but is such that:
E [(f (x) − y)2 ] > E [(f ∗ (x) − y)2 ]
(x,y)∼D (x,y)∼D

By linearity of expectation we have:


X
Pr[f (x) = v] · E [(f (x) − y)2 − (f ∗ (x) − y)2 |f (x) = v] > 0
(x,y)∼D
v∈R(f )

In particular there must be some v ∈ R(f ) with Prx∼DX [f (x) = v] > 0


such that:
E [(f (x) − y)2 |f (x) = v] > E [(f ∗ (x) − y)2 |f (x) = v]
(x,y)∼D (x,y)∼D

Let S = {x : f (x) = v}. Since f is calibrated, we know that:


E [(v − y)2 |x ∈ S]] = min E [(c − y)2 |x ∈ S]]
(x,y)∼D c∈R (x,y)∼D

Thus by the weak learning assumption there must exist some h ∈ H such that:
E[(v − y)2 − (h(x) − y)2 |x ∈ S] = E[(f (x) − y)2 − (h(x) − y)2 |f (x) = v] > 0
By Lemma 6.1.3, there must therefore exist some h′ ∈ H such that:
E [h′ (x)(y − v)|f (x) = v] > 0
(x,y)∼D

implying that f is not multicalibrated with respect to D and H, a contradiction.


In the reverse direction, we show that for any H that does not satisfy the
weak learning condition with respect to D, then multicalibration with respect
to H and D does not imply Bayes optimality over D. In particular, we exhibit
a function f such that f is multicalibrated with respect to H and D, but such
that:
E [(f (x) − y)2 ] > E [(f ∗ (x) − y)2 ]
(x,y)∼D (x,y)∼D

Since H does not satisfy the weak learning assumption over D, there must
exist some set S ⊆ X with Pr[x ∈ S] > 0 such that
E [(f ∗ (x) − y)2 |x ∈ S] < min E [(c − y)2 |x ∈ S]
(x,y)∼D c∈R (x,y)∼D
Multicalibration for Real Valued Functions: When Does Multicalibration Imply Accuracy? 101

but for every h ∈ H:

E [(h(x) − y)2 |x ∈ S] ≥ min E [(c − y)2 |x ∈ S]


(x,y)∼D c∈R (x,y)∼D

.
Let c(S) = E(x,y)∼D [y|x ∈ S]. We define f (x) as follows:
(
f ∗ (x) x ̸∈ S
f (x) =
c(S) x∈S

We can calculate that:

E [(f (x) − y)2 ]


(x,y)∼D

= Pr [x ∈ S] E [(c(S) − y)2 |x ∈ S] + Pr [x ̸∈ S] E [(f ∗ (x) − y)2 |x ̸∈ S]


(x,y)∼D (x,y)∼D (x,y)∼D (x,y)∼D

> Pr [x ∈ S] E [(f ∗ (x) − y)2 |x ∈ S] + Pr [x ̸∈ S] E [(f ∗ (x) − y)2 |x ̸∈ S]


(x,y)∼D (x,y)∼D (x,y)∼D (x,y)∼D

= E [(f ∗ (x) − y)2 ]


(x,y)∼D

In other words, f is not Bayes optimal. So if we can demonstrate that f is


multicalibrated with respect to H and D we are done. Suppose otherwise. Then
there exists some h ∈ H and some v ∈ R(f ) such that

E [h(x)(y − v)|f (x) = v] > 0


(x,y)∼D

By Lemma 6.1.3, there exists some h′ ∈ H such that:

E [(h′ (x) − y)2 |f (x) = v] < E [(f (x) − y)2 |f (x) = v]


(x,y)∼D (x,y)∼D

We first observe that it must be that v = c(S). If this were not the case,
by definition of f we would have that:

E [(h′ (x) − y)2 |f (x) = v] < E [(f ∗ (x) − y)2 |f (x) = v]


(x,y)∼D (x,y)∼D

which would contradict the Bayes optimality of f ∗ . Having established that


v = c(S) we can calculate:

E [(h′ (x) − y)2 |f (x) = c(S)]


(x,y)∼D

= Pr [x ∈ S] E [(h′ (x) − y)2 |x ∈ S] +


(x,y)∼D (x,y)∼D

Pr [x ̸∈ S, f (x) = c(S)] E [(h′ (x) − y)2 |x ̸∈ S, f (x) = c(S)]


(x,y)∼D (x,y)∼D

≥ Pr [x ∈ S] E [(h′ (x) − y)2 |x ∈ S] +


(x,y)∼D (x,y)∼D

Pr [x ̸∈ S, f (x) = c(S)] E [(f (x) − y)2 |x ̸∈ S, f (x) = c(S)]


(x,y)∼D (x,y)∼D
102Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
where in the last inequality we have used the fact that by definition, f (x) =
f ∗ (x) for all x ̸∈ S, and so is pointwise Bayes optimal for all x ̸∈ S.
Hence the only way we can have E(x,y)∼D [(h′ (x) − y)2 |f (x) = c(S)] <
E(x,y)∼D [(f (x) − y)2 |f (x) = c(S)] is if:

E [(h′ (x) − y)2 |x ∈ S] < E [(c(S) − y)2 |x ∈ S]


(x,y)∼D (x,y)∼D

But this contradicts our assumption that H violates the weak learning condi-
tion on S, which completes the proof.

Theorem 31 characterizes when exact multicalibration with respect to H


implies exact Bayes optimality and vice versa. But our algorithm 19 only
converges to approximate multi-calibration over a set of functions H. What
can we say about its convergence to approximate Bayes optimality when H
satisfies the weak learning condition? To answer this question we’ll need a
quantitative version of our weak learning condition.

Definition 40 Fix a distribution D ∈ ∆Z and a class of functions H. Let


f ∗ (x) = Ey∼D(x) [y] denote the true conditional label expectation conditional
on x. We say that H satisfies the γ-weak learner condition relative to D if for
every S ⊂ X with Prx∼DX [x ∈ S] > 0, if:

E [(f ∗ (x) − y)2 |x ∈ S] < min E [(c − y)2 |x ∈ S] − γ


(x,y)∼D c∈R (x,y)∼D

then there exists an h ∈ H such that:

E [(h(x) − y)2 |x ∈ S] < min E [(c − y)2 |x ∈ S] − γ


(x,y)∼D c∈R (x,y)∼D

Definition 40 approaches Defininition 38 as γ → 0. It says that when the


Bayes optimal predictor improves over a constant predictor on set S by at
least some margin γ, then there is some h ∈ H that does so as well. On the
one hand, it is weaker than Definition 38 in that it does not require anything
of H if the Bayes optimal predictor improves over a constant prediction by
less than γ. On the other hand, it is stronger, in that it requires that some
h ∈ H improve over a constant predictor on H by margin γ (rather than just
infinitesimally) whenever doing so is possible.
Since the γ-weak learning condition does not make any requirements on
H on sets for which f ∗ (x) improves over a constant predictor by less than γ,
the best we can hope to prove under this assumption is γ-approximate Bayes
optimality, which is what we do next.

Theorem 32 Fix any distribution D ∈ ∆Z, any model f : X → [0, 1],


any γ > 0, any class of real valued functions H that satisfies the γ-weak
learner condition relative to D, and a squared error regression oracle AH
for H. Let α = γ and B = 1/γ (or any pair such that α/B = γ 2 ). Then
Multicalibration for Real Valued Functions: When Does Multicalibration Imply Accuracy? 103

RegressionMulticalibrate(f, α, AH , D, B) halts after at most T ≤ γ22 many it-


erations and outputs a model fT −1 such that fT −1 is 2γ-approximately Bayes
optimal over D:

E [(fT −1 (x) − y)2 ] ≤ E [(f ∗ (x) − y)2 ] + 2γ


(x,y)∼D (x,y)∼D

where f ∗ (x) = E(x,y)∼D [y] is the function that minimizes squared error over
D.

Proof 57 At each round t before the algorithm halts, we have by construction


α
that errt ≤ errt−1 − 2B , and since the squared error of f0 is at most 1, and
squared error is non-negative, we must have T ≤ 2B 2
α = γ2 .
Now suppose the algorithm halts at round T and outputs fT −1 . It must
2
be that errT > errT −1 − γ2 . Suppose also that fT −1 is not 2γ-approximately
Bayes optimal:

E [(fT −1 (x) − y)2 − (f ∗ (x) − y)2 ] > 2γ


(x,y)∼D

We can write this condition as:


X
Pr[fT −1 (x) = v]· E [(fT −1 (x)−y)2 −(f ∗ (x)−y)2 |fT −1 (x) = v] > 2γ
(x,y)∼D
v∈[1/m]

Define the set:

S = {v ∈ [1/m] : E [(fT −1 (x) − y)2 − (f ∗ (x) − y)2 |fT −1 (x) = v] ≥ γ}


(x,y)∼D

to denote the set of values v in the range of fT −1 such that conditional on


fT −1 (x) = v, fT −1 is at least γ-sub-optimal. Since we have both y ∈ [0, 1] and
fT −1 (x) ∈ [0, 1], for every v we must have that E[(fT −1 (x) − y)2 − (f ∗ (x) −
y)2 |fT −1 (x) = v] ≤ 1. Therefore we can bound:
X
2γ < Pr[fT −1 (x) = v] · E [(fT −1 (x) − y)2 − (f ∗ (x) − y)2 |fT −1 (x) = v]
(x,y)∼D
v∈[1/m]

≤ Pr [x ∈ S] + (1 − Pr [x ∈ S])γ
(x,y)∼D (x,y)∼D

Solving we learn that:


2γ − γ
Pr [x ∈ S] ≥ ≥ 2γ − γ = γ
(x,y)∼D (1 − γ)
Now observe that by the fact that H is assumed to satisfy the γ-weak learn-
ing assumption with respect to D, at the final round T of the algorithm, for
every v ∈ S we have that hTv satisfies:

E [(fT −1 (x) − y)2 − (hTv (x) − y)2 |fT −1 (x) = v] ≥ γ


(x,y)∼D
104Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
˜ T = E(x,y)∼D [(f˜T (x) − y)2 ] Therefore we have:
Let err
X
errT −1 − err
˜ T = Pr [fT −1 (x) = v] E [(fT −1 (x) − y)2 − (hTv (x) − y)2 |fT −1 (x) = v]
(x,y)∼D (x,y)∼D
v∈[1/m]

≥ Pr [fT −1 (x) ∈ S]γ


(x,y)∼D

≥ γ2
γ2
We recall that |err
˜ T − errT | ≤ 1/m = 2 and so we can conclude that

γ2
errT −1 − errT ≥
2
which contradicts the fact that the algorithm halted at round T , completing the
proof.

References and Further Reading


Kim et al. [2019] first studied multi-accuracy (what we call group conditional
mean consistency in this book) for real valued functions, and gave a boosting
like algorithm for obtaining it. Multicalibration with respect to real valued
functions was first studied in Gopalan et al. [2022b] who gave an algorithm
based on “split” and “merge” operations, related to boosting-by-branching-
programs algorithms from the learning theory literature. Burhanpurkar et al.
[2021] first ask the question what properties of a set of groups G are sufficient to
guarantee that multicalibration with respect to G implies Bayes optimality —
the answer they give (which is sufficient but not necessary) is that G contains
refinements of the levelsets of the optimal predictor f ∗ . This can be viewed
as a “strong learning” assumption in comparison to our “weak learning” as-
sumption. The main results from this chapter, including the multicalibration
algorithm that operates as a reduction to squared error regression, and the
characterization that multicalibration implies Bayes Optimality if and only if
H satisfies the weak learning condition comes from Globus-Harris et al. [2023].
Part II

Applications
7
Conformal Prediction

CONTENTS
7.1 Prediction Sets and Nonconformity Scores . . . . . . . . . . . . . . . . . . . . . . 107
7.1.1 Non-Conformity Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2 A Weak Guarantee: Marginal Coverage in Expectation . . . . . . . . . 110
7.3 Dataset Conditional Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.4 Dataset and Group Conditional Bounds . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.5 Multivalid Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.6 Sequential Conformal Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.6.1 Sequential Marginal Coverage Guarantees . . . . . . . . . . . . . . 119
7.6.2 Sequential Multivalid Guarantees . . . . . . . . . . . . . . . . . . . . . . . 120
References and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

Thus far we have restricted our attention to regression problems (in which the
label domain Y = R), and have focused on estimating distributional quantities
of conditional label distributions, like means and quantiles. In this chapter,
we introduce a much more general framework for uncertainty quantification
that reduces a very general uncertainty quantification problem to the problem
of one dimensional quantile estimation. As a result, we will be able to draw
on our development of powerful quantile estimation techniques to give an
analogously powerful set of results for a much more general problem.

7.1 Prediction Sets and Nonconformity Scores


Suppose we have a distribution D ∈ ∆Z (although we will also consider the
sequential prediction setting in which there need not be any distribution). Our
goal is to be able to produce prediction sets as a function of observed features x
that are likely to contain the corresponding label y. More specifically, we want
to be able to find a function T : X → 2Y mapping unlabelled examples x to
subsets of labels T (x) that have the property that the true label is contained
within T (x) with some specified level of confidence 1 − δ:

Pr[y ∈ T (x)] ≈ 1 − δ

107
108Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT

FIGURE 7.1
Images x about which we might have uncertainty about their labels y.

We leave unspecified for now what distribution this probability is taken over,
because we will consider a spectrum of guarantees of increasing strength, mir-
roring our treatment of mean and quantile estimation. For example, we can
ask for marginal guarantees, group conditional guarantees, calibrated guar-
antees, or ask for guarantees that hold empirically on adversarially chosen
sequences. Prediction sets can take different forms: when we are facing a re-
gression problem (Y = R) is is natural (but not necessary) for a prediction
set to take the form of an interval : T (x) = [a, b] for some a < b ∈ R. On the
other hand, for a multiclass classification problem (when Y is some unordered
discrete set), prediction sets correspond to subsets of labels — e.g. we might
have T (x) = {Blueberry Muffin, Chihuahua} for x representing images from
Figure 7.1.
Prediction sets are a very attractive way to quantify uncertainty: their
size represents a quantitative degree of uncertainty. For example, if T (x) is a
singleton, this represents certainty at the specified 1 − δ level in a particular
point prediction. But the contents of the set also provides insight into where
the uncertainty lies. For example in a classification problem, there might be a
high degree of uncertainty in the specific label, but a well crafted prediction
set might nevertheless tell us that our uncertainty is concentrated in a region
that corresponds to the same downstream action. Say, in a computer vision
setting, we might be unsure of the breed of dog in front of us—so T (x) contains
half a dozen different labels, corresponding to different breeds—but despite
this uncertainty in the specifics, this prediction set gives us a high degree of
confidence in what action to take—apply the breaks.
The main difficulty with thinking about producing prediction sets is that
they are very high dimensional objects: In a k-label multiclass classification
setting, there are 2k different prediction sets. The main idea in conformal pre-
Conformal Prediction 109

diction is to reduce these high dimensional prediction sets to one-dimensional


objects using a non-conformity score function s : X × Y → R.

7.1.1 Non-Conformity Scores


A “non-conformity score function” s(x, y) is typically built from some model h
for making point predictions. As a running example, lets imagine that we are in
the regression setting (Y = R) and we have solved a linear regression problem
to produce a model h : X → Y that makes point predictions. Intuitively, the
“non-conformity score” s(x, y) is supposed to communicate some measure by
which the label y differs from the prediction of the model h(x). The simplest
(often too simple) non-conformity score in this setting is:

s(x, y) = |h(x) − y|

which simply measures the deviation of the label y from the point prediction
h(x).
Any non-conformity score function s can be used to parameterize a (now
one dimensional) family of prediction sets Ts : X × R → 2Y as follows:

Ts (x, τ ) = {ŷ : s(x, ŷ) ≤ τ }

The prediction set T (x, τ ) simply contains all labels ŷ that would produce
nonconformity score at most τ when paired with x: s(x, ŷ) ≤ τ . In the
case of our simple regression running example, this would simply correspond
to the interval centered at the point prediction h(x) that has width 2τ :
Ts (x, τ ) = [h(x) − τ, h(x) + τ ]. Although simple, a clear disadvantage of this
non-conformity score is that for fixed τ , every prediction interval T (x, τ ) has
the same width — so for methods that use a fixed value of τ — which roughly
speaking are those methods that promise only marginal coverage — the pre-
diction intervals do not give us any insight into which examples we have more
uncertainty about compared to others.
There are many other non-conformity scores that are in wide use. For
example, rather than training a regression model h that aims to predict the
mean of the conditional label distribution DY (x) (as linear regression does),
we could train quantile regression models hδ/2 (x), h1−δ/2 (x) that try and
predict the δ/2 and 1 − δ/2 quantiles of the conditional label distribution
DY (x) instead. Then a natural non-conformity score would be:

s(x, y) = max(hδ/2 (x) − y, y − h1−δ/2 (x))

This score starts with the candidate interval that directly arises from the
quantile regression method [hδ/2 (x), h1−δ/2 (x)], and measures how far the la-
bel y is from the interval — taking a positive value when y falls outside of the
interval and a negative value when it falls inside. If the interval is correct, then
the 1 − δ quantile of the nonconformity score distribution will be 0 — picking
110Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
threshold τ = 0 will get the target marginal coverage. But if the interval in-
duced by the quantile regression method is not correct, then choosing different
thresholds τ can systematically widen or shorten the prediction interval by τ
on each end: Ts (x, τ ) = [hδ/2 (x) − τ, h1−δ/2 (x) + τ ]. This non-conformity score
has the advantage that even for a fixed value of τ , the prediction intervals
Ts (x, τ ) can have very different widths, depending on the predictions of the
models hδ/2 (x) and h1−δ/2 (x).
What about for multi-class classification problems, in which DY (x) is a
discrete distribution over k possible labels? To build intuition, suppose we
were given the true conditional distribution over labels given x: For each
label ŷ ∈ [k], p∗x (ŷ) = Pr[y = ŷ|x]. Let πp∗x be the permutation on labels
that puts them in decreasing sorted order by their underlying probability: so
p∗x (πp∗x (1)) ≥ p∗x (πp∗x (2)) ≥ . . . ≥ p∗x (πp∗x (k)). How would we find the smallest
prediction set that contains the true label with probability at least 1 − δ?
We would greedily start adding labels to our prediction set in order of their
probabilities (highest probability to lowest) until the cumulative probability
of the labels in our prediction Ptset exceeded 1−δ. To say this more formally, for
each t ∈ [k], let C(t, p∗x ) = i=1 p∗x (πp∗x (i)) denote the cumulative probability
of the top t labels in likelihood sorted order. We would choose the prediction
set:
T (x) = {ŷ : C(πp−1 ∗
∗ (ŷ), px ) ≤ 1 − δ}
x

Now suppose we have a method that gives us a score function px : Y →


[0, 1] for each example x. We might hope that px is the true probability dis-
tribution over labels, but we have no strong reason to believe that it is. For
example, px might be the softmax outputs of the final layer of a neural net-
work. We can nevertheless define the same quantities with respect to px : πpx
is the permutation that places the labels in descending Pt order according to
px : px (πpx (1)) ≥ . . . ≥ px (πpx (k)), and C(t, px ) = i=1 px (πpx (i)) denotes
the cumulative “probability” of the top t labels according to px . We can then
define a non-conformity score:

s(x, y) = C(πp−1
x
(y), px )

In the event that px really is the true conditional label distribution


conditional on x, then using this non-conformity score, the prediction sets
Ts (x, τ ) = {ŷ : C(πp−1
x
(ŷ), px ) ≤ τ } are the minimum size prediction sets with
coverage probability τ — and even if they are not, there exists some τ that
leads to coverage with the target coverage probability.
There are plenty of other non-conformity scores that one could consider.
But for the rest of this chapter, we won’t worry about what the non-conformity
score is — the techniques we discuss will work for any choice of nonconformity
score.
Conformal Prediction 111

7.2 A Weak Guarantee: Marginal Coverage in Expecta-


tion
In this section we will consider the problem of using a sample of data D ∼ Dn
(that we will call a calibration set) to produce prediction sets T (x) that obtain
the following coverage guarantee on new samples (x, y) ∼ D that are not
contained in D.
1
1−δ ≤ Pr [y ∈ T (x)] ≤ 1 − δ +
D∼D n ,(x,y)∼D n+1
This is a marginal coverage guarantee because the probability is over x as
well as y, and is unconditioned. We call it a marginal guarantee in expectation
because the probability is also taken over the calibration set D, and so could
be expressed as:
 
1
1−δ ≤ E n Pr [y ∈ T (x)] ≤ 1 − δ +
D∼D (x,y)∼D n+1
This is in contrast to theorems we will see later that have high probability
guarantees over the randomness of the calibration set D. Nevertheless, this
guarantee is very simple to obtain, and has a very mild (inverse linear) de-
pendence on n which makes it attractive.

Algorithm 20 SplitConformal(D, s, δ)
Let τ be the smallest value such that:
n
1[s(xi , yi ) ≤ τ ] ≥ (1 − δ)(n + 1)
X

i=1

i.e. τ is an empirical ⌈(1−δ)(n+1)⌉


n quantile of D.
Output the function:

TD (x) = {ŷ : s(x, ŷ) ≤ τ }

The algorithm is simple, and given in Algorithm 20. Informally, it takes as


input a calibration dataset D (of any size), a non-conformity score s (which
must be defined independently of the calibration dataset D), and a target
miscoverage rate δ. It computes a threshold τ that comes as close as possible to
being an empirical (1 − δ)-quantile of the set of non-conformity scores induced
by S on D (up to a bias correction term of roughly n+1 n ), and then outputs
the function T (x) ≡ Ts (x, τ ) that uses the fixed threshold τ for every example
x. As we have done previously in our discussion of quantile estimation, we will
assume that the distribution on which we want to compute quantiles (which in
112Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
this case is the induced distribution on non-conformity scores) is continuous,
which simplifies things. Recall that we can always enforce this assumption by
adding arbitrarily small amounts of noise from any continuous distribution to
the non-conformity scores.

Theorem 33 Fix any distribution D ∈ ∆Z, any 0 ≤ δ ≤ 1 and any


non-conformity score s : Z → R. Assume the induced distribution on non-
conformity scores s(x, y) for (x, y) ∼ D is continuous. Let D ∼ Dn be a
dataset of n points sampled i.i.d. from D. Then for the function TD (x) output
by SplitConformal(D, s, δ) (Algorithm 20) we have that:
 
1
1−δ ≤ E n Pr [y ∈ TD (x)] ≤ 1 − δ +
D∼D (x,y)∼D n+1

In fact, the only property we will use about the distribution from which D and
(x, y) are jointly drawn is that it is exchangable, which means permutation
invariant — we will not need the stronger property that the points are drawn
i.i.d.

Proof 58 (Proof of Theorem 33) Because we have assumed that the non-
conformity score distribution on s(x, y) is continuous, with probability 1, there
are no ties amongst the non-conformity scores in D — i.e. for all i ̸= j,
s(xi , yi ) ̸= s(xj , yj ). Renumber the datapoints in D in increasing order of their
nonconformity scores — i.e. such that s(x1 , y1 ) < s(x2 , y2 ) < . . . < s(xn , yn ).
Let i∗ be the unique index such that s(xi∗ , yi∗ ) = τ . i∗ = ⌈(1 − δ)(n + 1)⌉.
Imagine the dataset D′ = D ∪ (x, y) containing n + 1 elements. Consider
the event y ∈ TD (x). This occurs exactly when s(x, y) < τ , which is exactly
the event that the pair (x,y) occurs before the pair (xi∗ , yi∗ ) when we sort the
n + 1 points in D′ by their non-conformity scores. But since all of the points
in D′ are exchangable, by symmetry it must be that point (x,y) will have rank
that is uniformly random in {1, 2, . . . , n + 1} when put in sorted order within
D′ . Thus the event that y ∈ TD (x) is the event that (x, y) has rank that is less
than i∗ , which is:
i∗
Pr [y ∈ TD (x)] =
D,(x,y) n+1
⌈(1 − δ)(n + 1)⌉
=
n+1
(1 − δ)(n + 1)

n+1
= (1 − δ)
Conformal Prediction 113

We can similarly calculate:


i∗
Pr [y ∈ TD (x)] =
D,(x,y) n+1
⌈(1 − δ)(n + 1)⌉
=
n+1
(1 − δ)(n + 1) + 1

n+1
1
= (1 − δ) +
n+1
which completes the proof.

7.3 Dataset Conditional Bounds


The first way in which we might strengthen Theorem 33 is to give a bound
that holds with high probability over the draw of D ∼ Dn rather than only in
expectation. To do this, all we need to do is find a high probability estimate
for the 1 − δ quantile of the non-conformity score distribution, which is a
problem that we already solved in Chapter 2!

Algorithm 21 HighProbabilitySplitConformal(D, s, δ, γ)
Let τ be the smallest value such that:
n
r
1X log(2/γ)
1[s(xi , yi ) ≤ τ ] ≥ (1 − δ) +
n i=1 2n

Output the function:

TD (x) = {ŷ : s(x, ŷ) ≤ τ }

Theorem 34 Fix any distribution D ∈ ∆Z, any 0 ≤ δ ≤ 1 and any


non-conformity score s : Z → R. Assume the induced distribution on non-
conformity scores s(x, y) for (x, y) ∼ D is continuous. Let D ∼ Dn be a
dataset of n points sampled i.i.d. from D. Then for the function TD (x) out-
put by SplitConformal(D, s, δ, γ) (Algorithm 20) we have that with probability
1 − γ over the draw of D ∼ Dn :
r
log(2/γ) 1
1 − δ ≤ Pr [y ∈ TD (x)] ≤ 1 − δ + 2 +
(x,y)∼D 2n n
114Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Proof 59 By construction, τ is an empirical q-quantile for the empirical dis-
tribution of scores s(x, y) over D for:
r r
log(2/γ) log(2/γ) 1
(1 − δ) + ≤ q ≤ (1 − δ) + +
2n 2n n
From Theorem 2, we have that with probability 1 − γ, τ is therefore a q ′
quantile for the distribution of scores s(x, y) over D for:
r r
log(2/γ) ′ log(2/γ)
q− ≤q ≤q+
2n 2n
Combining these two bounds, we have that with probability 1 − γ, τ is a q ′
quantile for D such that:
r
′ log(2/γ) 1
(1 − δ) ≤ q ≤ (1 − δ) + 2 +
2n n
Since y ∈ TD (x) exactly when s(x, y) ≤ τ , we have that Pr[y ∈ TD (x)] = q ′ ,
which completes the proof.

7.4 Dataset and Group Conditional Bounds


We can also ask for stronger than marginal guarantees. For example, given
an arbitrary collection of groups G ⊆ 2X , we can ask for group conditional
coverage. That is, we can ask for prediction sets TD (x) that have the property
that for every g ∈ G:

Pr [y ∈ TD (x)|g(x) = 1] = 1 − δ.
(x,y)∼D

To obtain a guarantee like this, it will no longer be sufficient to param-


eterize TD (x) with a single threshold τ : instead we will parameterize TDf (x)
with a function f : X → [0, 1]:

TDf (x) = {ŷ : s(x, ŷ) ≤ f (x)}.

So: TD (x) will have group conditional coverage guarantees if and only if
f (x) has group conditional quantile marginal consistency guarantees. We know
how to do this using Algorithm 13! We can apply it here:
Conformal Prediction 115

Algorithm 22 GroupSplitConformal(D, s, G, δ, γ, ρ, σ, η)
Let q = 1 − δ.
Let λ∗ be a solution to the optimization problem:
h  i
Minimizeλ E Lq fˆ(x; λ), s(x, y) + η||λ||1
(x,y)∼D

Such that:
X
fˆ(x; λ) ≡ λg · g(x)
g∈G

Output the function:


f (x;λ∗ )
TD (x) = {ŷ : s(x, ŷ) ≤ f (x; λ∗ )}

Theorem 35 Fix any γ, δ, η > 0. Let G be any collection of groups. Let


D ∼ Dn consist of n samples (x, y) from a distribution D that is ρ-Lipschitz
and σ-anti-Lipschitz. Then with probability 1 − γ over the draw of D ∼ Dn ,
the function TDf (x) output by GroupSplitConformal(D, s, G, δ, γ, ρ, σ, η)) (Al-
gorithm 22) satisfies for any group g that has mass µ(g) ≥ α:

r r
α α
1−δ− ≤ Pr [y ∈ TDf (x)|g(x) = 1] ≤ 1 − δ +
µ(g) (x,y)∼D µ(g)
for: v  
 u ln 2 + |G| ln (1 + 2√n)
u

2ηρ 1 t γ
α≤ + 8ρ +1
σ η 2n
Choosing η to minimize this expression gives:
    1/4 
1
 ρ ln γ + |G| ln (n)
α ≤ O √ ·   
σ n

Proof 60 By Theorem 20, with probability 1−γ, the function f (x; λ∗ ) satisfies
α-approximate group conditional marginal quantile consistency on D on the
set of groups g ∈ G with µ(g) ≥ α and target quantile q = 1 − δ for:
v  
 u ln 2 + |G| ln (1 + 2√n)
u

2ηρ 1 t γ
α≤ + 8ρ +1
σ η 2n
This means that for every group g ∈ G with µ(g) ≥ α:
2 α
(Pr[s(x, y) ≤ f (x; λ∗ )|g(x) = 1] − (1 − δ)) ≤
µ(g)
116Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
or equivalently:
r r
α α
(1 − δ) − ≤ Pr[s(x, y) ≤ f (x; λ∗ )|g(x) = 1] ≤ (1 − δ) +
µ(g) µ(g)

The result then follows the fact that y ∈ TDf (x) exactly when s(x, y) ≤ f (x; λ∗ ).

7.5 Multivalid Bounds


In moving to group conditional coverage, we have started defining our pre-
diction sets TDf (x) not with a single threshold τ , but instead with a function
f (x) that maps each example to a potentially different threshold. In doing
so, we have created the possibility that our coverage guarantees are no longer
threshold calibrated — i.e. that conditional on the threshold f (x) that we
choose, our coverage guarantees may now fail to hold. In fact, without thresh-
old calibration, it is possible to abuse the model and obtain group conditional
coverage without providing any useful information about the data at all. Con-
sider the randomized predictor which predicts the empty prediction set (and
hence definitely fails to cover the label) with probability δ and predicts the full
prediction set (and hence definitely covers the label) with probability 1 − δ.
This predictor has 1 − δ marginal coverage not just overall, but conditional
on membership in each group! And yet it is defined independently of the
data, and so provides no useful insight. But observe that this predictor would
badly fail a threshold calibration test. Conditional on the chosen threshold,
the coverage rate is either 0 or 1, in either case bounded away from the target
1 − δ.
To correct for this, we can ask for coverage guarantees that hold conditional
on both group membership and the value of the chosen threshold, which are
called multivalid coverage guarantees. Specifically, we want that for each g ∈ G
and for each v ∈ R(f ):

Pr [y ∈ TDf (x)|g(x) = 1, f (x) = v] = 1 − δ


(x,y)∼D

Once again, we see that TDf will satisfy multivalid coverage guarantees if
and only if f satisfies quantile multicalibration for target quantile q = 1 − δ.
And once again, we know how to find such an f — use Algorithm 16!
Conformal Prediction 117

Algorithm 23 MultivalidSplitConformal(D, s, G, δ, α, γ, ρ)
2
ρ
Let m = 2α , q = 1 − δ.
Let f0 (x) = 0 for all x and t = 0.
while ft is not α-approximately quantile multicalibrated with respect to G
and q: do
Let:
 2
(vt , gt ) ∈ arg max Pr [ft (x) = v, g(x) = 1] q − Pr [y ≤ v|ft (x) = v, g(x) = 1])
(v,g)∈R(ft )×G (x,y)∼D (x,y)∼D

ṽt = arg min Pr [y ≤ v|ft (x) = vt , gt (x) = 1] − q and vt′ = Round(ṽt ; m)


v (x,y)∼D

Let ft+1 = h(x; ft , vt → vt′ , gt ) and t = t + 1.


Output the function:

TDft (x) = {ŷ : s(x, ŷ) ≤ ft (x)}

Theorem 36 Fix any γ, δ, α > 0. Let G be any collection of groups. Let


D ∼ Dn consist of n samples (x, y) drawn from a distribution D that is ρ-
Lipschitz. With probability 1 − γ over the draw of D ∼ Dn , the function TDf
output by MultivalidSplitConformal(D, s, G, δ, α, γ, ρ) satisfies for every group
g ∈ G and threshold v ∈ R(f ):
s
α′
1−δ− ≤ Pr [y ∈ TDf (x)|g(x) = 1, f (x) = v] ≤
Pr(x,y)∼D [g(x) = 1, f (x) = v] (x,y)∼D
s
α′
1−δ+
Pr(x,y)∼D [g(x) = 1, f (x) = v]
for: v
u  
u 3ρ2 ln( 4π2 T 2 ) + T ln( ρ4 |G| )
3γ α 2
α′ = α + 42
t
2αn
Choosing α to optimize the bound, we get:
  4  1/5 
3 ρ |G|
ρ ln δ

α ∈ Õ 
  
n

Proof 61 By Theorem 26, we have that with probability 1 − γ, the final func-
tion ft satisfies α′ -approximate quantile multicalibration for target quantile
118Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
1 − δ, groups G, and:
v  
u
u 3ρ2 ln( 4π2 T 2 ) + T ln( ρ4 |G| )
3γ α2
α′ ≤ α + 42
t
2αn
This means that for every group g ∈ G and v ∈ R(ft ):
 2
Pr [ft (x) = v|g(x) = 1] Pr [s(x, y) ≤ ft (x)|g(x) = 1, ft (x) = v] − (1 − δ)
(x,y)∼D (x,y)∼D

α
≤ Q2 (ft , g) ≤
µ(g)
Dividing through and taking the square root we obtain:
s
α′
Pr [s(x, y) ≤ ft (x)|g(x) = 1, ft (x) = v] − (1 − δ) ≤
(x,y)∼D Pr(x,y)∼D [g(x) = 1, ft (x) = v]

The theorem then follows since y ∈ TDft (x) exactly when s(x, y) ≤ ft (x).

7.6 Sequential Conformal Prediction


So far we have considered the problem of conformal prediction in the batch
setting, in which we have a dataset of labelled examples D that we can use to
train a model that defines a prediction set function T : X → 2Y that we can
later deploy to produce prediction sets T (x). A major advantage of this kind
of approach is that we do not need to observe labels at test time, but a major
disadvantage is that we need to make strong assumptions about the test time
distribution — generally that it is identical to the training distribution, and
that it is distributed independently — or is at least exchangable.
In this section we apply the techniques we have developed to the sequen-
tial conformal prediction problem, which can be described as the following
interaction between a learner and an adversary. In rounds t ∈ {1, . . . , T }
1. The adversary chooses a feature vector xt ∈ X and a distribution
over labels yt ∈ Y.
2. The learner produces a prediction set Tπ<t (xt ).
3. The learner observes the realized label yt .
This interaction generates a transcript π = {(x1 , Tπ<1 (x1 ), y1 ), . . . , (xT , Tπ<T (xT ), yT )}.
The learner is an algorithm mapping transcript prefixes π <t and feature vec-
tors xt to prediction sets Tπ<t (xt ), adn the adversary is a mapping from tran-
script prefixes π <t to pairs of feature vectors and label distributions X × ∆Y.
Conformal Prediction 119

The adversary may be arbitrary, or we may impose restrictions on the label


distributions that she chooses.
The prediction sets we study will continue to be based on non-conformity
score functions s — but since we no longer require exchangability, we will
also allow the non-conformity score function st to potentially change at every
round. So, for example, if our non-conformity score function is based on a
model f , we can use a model ft that is retrained on all of the examples
seen so far, at each round — something that breaks the exchangability of the
non-conformity scores of past and future data by introducing a dependence
between the past data and the non-conformity score.

7.6.1 Sequential Marginal Coverage Guarantees


We can derive algorithms for adversarial sequential conformal prediction from
our algorithms for online sequential quantile prediction. For example, we can
use Algorithm 2 (which promises marginal quantile consistency against an
adversary) to obtain a sequential conformal prediction algorithm with a cor-
responding marginal coverage guarantee against an adversary. To talk about
coverage rates in the sequential setting, we write Pr(xt ,Tt (xt ),yt )∼π [·] to denote
the uniformly random selection of a record (xt , Tt (xt ), yt ) from a transcript of
T records π = {(xt , Tt (xt ), yt )}Tt=1 .

Algorithm 24 Adversarial-Marginal-Conformal-Predictor(δ, η, T )
Let q = 1 − δ + 1+η
ηT and p1 = 0
for t = 1 to T do
Obtain non-conformity score st and observe xt .
Predict
Tt (xt ) = {ŷ : st (xt , ŷ) ≤ pt }

Observe yt .
Let pt+1 = pt + η(q − 1[st (xt , yt ) ≤ pt ])

Theorem 37 Fix any δ, η > 0. Paired with any adversary, Adversarial-


Marginal-Conformal-Predictor(δ, η) (Algorithm 24) produces a transcript such
that:
1+η
(1 − δ) ≤ Pr [yt ∈ Tt (xt )] ≤ 1 − δ + 2
(xt ,Tt (xt ),yt )∼π ηT

Proof 62 This is an application of Algorithm 2. Thus we can apply Theorem


6 to conclude that the sequence of thresholds pt produced satisfy α-approximate
marginal quantile consistency with respect to target quantile q and the sequence
120Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
1+η
of non-conformity scores s(xt , yt ) for α ≤ ηT . This means that:

T
1+η 1X 1+η
q− ≤ 1[st (xt , yt ) ≤ pt ] ≤ q +
ηT T t=1 ηT

Plugging in our definition of q and noting that yt ∈ Tt (xt ) exactly when


st (xt , yt ) ≤ pt completes the proof.

7.6.2 Sequential Multivalid Guarantees


We can similarly ask for multivalid coverage guarantees in the sequential set-
ting — i.e. coverage guarantees that remain valid conditional on both the
predicted threshold pt and on group membership. To do this, we apply Algo-
rithm 18, our algorithm for obtaining quantile multicalibrated predictions in
the sequential setting.
Conformal Prediction 121

Algorithm 25 Online-Multivalid-Conformal-Predictor(G, m, r, η, δ)
Let q = 1 − δ.
for t = 1 to T do
Obtain non-conformity score st and observe xt and compute
X g,i g,i
i
Ct−1 (xt ) = exp(ηVt−1 ) − exp(−ηVt−1 )
g∈G(xt )

g,i
for all i ∈ [m], with Vt−1 defined as:

(1[sℓ (xℓ , yℓ ) ≤ pℓ ] − q).


g,i
X
Vt−1 =
ℓ∈S(π ≤t−1 ,g,i)

m
if Ct−1 (xt ) < 0 then
Select pt = 1.
1
else if Ct−1 (xt ) > 0 then
Select pt = 0.
else
i∗ i∗ +1
Select i∗ ∈ [m] such that such that Ct−1 (xt ) · Ct−1 (xt ) ≤ 0.
Compute p ∈ [0, 1] such that:
∗ ∗
i i +1
p · Ct−1 (xt ) + (1 − p) · Ct−1 (xt ) = 0


i 1 i∗
Select pt = m − rm with probability p and select pt = m with proba-
bility 1 − p.
Predict:
Tt (xt ) = {ŷ : st (xt , ŷ) ≤ pt }

Observe yt
Let π <t+1 = π <t ◦ (xt , pt , yt )

Theorem
q 38 Fix any set of groups G, m, r ≥ 0 and q ∈ [0, 1]. Let
log(2|G|m)
η = 2T < 1. Fix δ > 0 Fix any adversary who is constrained for
each t to playing label distributions such that the induced distribution on
non-conformity scores st (xt , yt ) is ρ-Lipschitz, which together with Online-
Multivalid-Conformal-Predictor(G, m, r, η, δ) (Algorithm 18) fixes a distribu-
tion on transcripts π. We have that with probability 1 − γ over the randomness
of π, for every group g ∈ G and every bucket i ∈ [m]:
α α
1−δ− ≤ Pr [yt ∈ T (xt )|pt ∈ Bm (i), g(x) = 1] ≤ 1−δ+
µπ (g, i) (xt ,Tt (xt ),yt )∼π µπ (g, i)
122Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
where
µπ (g, i) = Pr [pt ∈ Bm (i), g(xt ) = 1]
(xt ,Tt (xt ),yt )∼π

and v
u  
u 2 ln 2|G|m
1 t γ
α≤ +4
ρrm T

Proof 63 We can apply Theorem 29 to conclude that the sequence of thresh-


olds pt is (α, m)-approximately quantile multicalibrated with probability 1 − γ,
for target quantile q = 1 − δ. Translating this guarantee into our notation, this
means that for all buckets i ∈ [m] and groups g ∈ G:

µπ (g, i) · q − Pr [1[st (xt , yt ) ≤ pt |pt ∈ Bm (i), g(x) = 1] ≤ α


(xt ,Tt (xt ),yt )∼π

The theorem then follows, since st (xt , yt ) ≤ pt exactly when y ∈ Tt (xt ).

References and Further Reading


See Shafer and Vovk [2008] for a classical introduction to conformal predic-
tion and Angelopoulos and Bates [2021] for an excellent recent survey. The
batch conformal prediction algorithms we present here are variants of “split
conformal prediction” that use a held out calibration set — this general idea
dates back to [Papadopoulos et al., 2002] and was studied in detail by Lei
et al. [2018]. Romano et al. [2019] introduced quantile regression as the ba-
sis of a non-conformity score into the conformal prediction literature, and
Angelopoulos et al. [2020] introduced a non-conformity score based on the
soft-max output of a neural network for classification problems and demon-
strated its utility on ImageNet. Romano et al. [2020] introduced the notion
of group-conditional coverage as a fairness goal, and proposed a method for
obtaining it for disjoint groups. Foygel Barber et al. [2020] also studied group
conditional coverage guarantees for intersecting groups, and gave a conserva-
tive approach based on separately calibrating a threshold for each group, and
then on a new example using the largest threshold of any group that the new
example is a member of. The kind of “high probability” dataset conditional
marginal guarantees we present were studied by Park et al. [2019]. Gibbs and
Candes [2021] studied sequential conformal prediction with marginal coverage
guarantees, and derived the algorithm we present here. Gupta et al. [2022],
Bastani et al. [2022], Jung et al. [2022] introduced techniques from multical-
ibration into the conformal prediction literature and showed how to obtain
group-wise and threshold calibrated coverage for arbitrary group structures
in both the batch and sequential settings.
8
Distribution Shift

CONTENTS
8.1 Likelihood Ratio Reweighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.2 Multicalibration under Distribution Shift . . . . . . . . . . . . . . . . . . . . . . . 126
8.3 Why Calibration Under Distribution Shift is Useful . . . . . . . . . . . . 128
References and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

Thus far we have studied prediction in two very different models:

1. In the batch or distributional setting we assume that we have sample


access to a distribution D which we can use to train a model, that
we can then deploy; it has guarantees on new data drawn from the
same distribution.
2. In the sequential adversarial setting we assume data arrives sequen-
tially and can be worst-case/generated by an adversary. But in order
to make progress we assume that we learn the true label after each
prediction.
But what if we want the best of both worlds — to be able to train a model
on data drawn from some distribution D, but then deploy it without test time
labels on new data drawn from some other process, and still have guarantees
about our predictions?
Of course this is impossible in general, but we can say something about it
if we make assumptions about how the data distribution might shift. Suppose
that we get training data from some source distribution Ds , and then evaluate
our model on a test distribution Dt . Can we give guarantees if we assume
something about how Ds and Dt relate to one another?

8.1 Likelihood Ratio Reweighting


Our goal is to learn to make predictions about labels y from examples x. If
we are going to learn about the relationship between x and y on Ds and then

123
124Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
hope to do well on Dt , then this relationship had better be similar on both
distributions — in this chapter we will assume that it is the same.

Definition 41 Two distributions Ds , Dt ∈ ∆Z are said to have the same


s t
conditional label distributions if for every x ∈ X , DY (x) = DY (x). In other
words the distributions differ only in their marginal distributions on features
s t
DX and DX .

So, two distributions that have the same conditional label distributions
differ in the relative frequency with which different feature vectors x appear,
but agree on how labels are distributed conditional on features — so there is
some fixed “truth” that we can hope to learn.

Definition 42 (Likelihood Ratios) For each x ∈ X let ps (x) = PrDXs [x]


and let pt (x) = PrDXt [x] denote the probabilty mass/density that the feature
s t
distributions DX and DX respectively put on x. The s → t likelihood ratio for
a point x is:
pt (x)
ws→t (x) = s
p (x)
Remark 8.1.1 Observe that:
1 ps (x)
= = wt→s (x)
ws→t (x) pt (x)
s → t likelihood ratios are useful because they allow us to relate expecta-
tions taken over Ds to expectations taken over Dt .

Lemma 8.1.1 Suppose Ds and Dt have the same conditional label distri-
s t
bution DY (x) = DY (x) = DY (x). Fix any S ⊆ X . For any function
F : X × Y → R, we have:

Pr [x ∈ S] E [ws→t (x)·F (x, y)|x ∈ S] = Pr [x ∈ S] E [F (x, y)|x ∈ S]


(x,y)∼D s (x,y)∼D s (x,y)∼D t (x,y)∼D t

Proof 64 For simplicity assume the distribution over X is discrete (otherwise


repeat the derivation below with sums replaced by integrals). We have:

Pr [x ∈ S] E [ws→t (x) · F (x, y)|x ∈ S]


(x,y)∼D s (x,y)∼D s
X
= ps (x) · ws→t (x) · E [F (x, y)]
y∼DY (x)
x∈S
t
X p (x)
= ps (x) · E [F (x, y)]
ps (x) y∼DY (x)
x∈S
X
= pt (x) · E [F (x, y)]
y∼DY (x)
x∈S
= Pr [x ∈ S] E [F (x, y)|x ∈ S]
(x,y)∼D t (x,y)∼D t
Distribution Shift 125

Of course, even if we are explicitly given samples from Ds and Dt , we


will not generally know the likelihood ratios ws→t (x). A common approach
is to attempt to learn a function h from some class H that approximates
them well. Since they are a function only of x, this can be done using only
t
unlabelled examples from DX . Suppose we attempt to approximate ws→t (x)
using a function h. How should we evaluate our approximation error?
Definition 43 Suppose Ds and Dt have the same conditional label distribu-
tion. For a function h : X → R, we write:
e(h, ws→t ) = E s [|h(x) − ws→t (x)|]
x∼D

Similarly, for any subset S ⊆ X of the feature space, we write:


e(h, ws→t , S) = E s [|h(x) − ws→t (x)| | x ∈ S]
x∼D

Remark 8.1.2 Observe that by the law of total probability, for any collection
of sets {S1 , . . . , Sk } that partition X , we have that:
k
X
Pr [x ∈ Si ]e(h, ws→t , Si ) = e(h, ws→t )
(x,y)∼D s
i=1

The next lemma shows that if we can estimate ws→t closely in total vari-
ation distance (as measured in expectation over the source distribution Ds ),
then we can closely approximate expectations over Dt .
Lemma 8.1.2 Suppose Ds and Dt have the same conditional label distribu-
tion. Fix any S ⊆ X . For any function F : X × Y → R, and any function
h : X → R, we have:

Pr [x ∈ S] E [h(x) · F (x, y)|x ∈ S] − Pr [x ∈ S] E [F (x, y)|x ∈ S]


(x,y)∼D s (x,y)∼D s (x,y)∼D t (x,y)∼D t

≤ Pr [x ∈ S] · max |F (x, y)| · e(h, ws→t , S)


(x,y)∼D s (x,y)∈Z

Proof 65 we know from Lemma 8.1.1 that we can write:


Pr [x ∈ S] E [F (x, y)|x ∈ S] = Pr [x ∈ S] E [ws→t (x)·F (x, y)|x ∈ S]
(x,y)∼D t (x,y)∼D t (x,y)∼D s (x,y)∼D s

So, we can calculate:

Pr [x ∈ S] E [h(x) · F (x, y)|x ∈ S] − Pr [x ∈ S] E [F (x, y)|x ∈ S]


(x,y)∼D s (x,y)∼D s (x,y)∼D t (x,y)∼D t

= Pr [x ∈ S] E [h(x) · F (x, y)|x ∈ S] − Pr [x ∈ S] E [ws→t (x)F (x, y)|x ∈ S]


(x,y)∼D s (x,y)∼D s (x,y)∼D s (x,y)∼D s

= Pr [x ∈ S] E [F (x, y) · (h(x) − ws→t (x))|x ∈ S]


(x,y)∼D s (x,y)∼D s

≤ Pr [x ∈ S] · max |F (x, y)| E [|(h(x) − ws→t (x))| | x ∈ S]


(x,y)∼D s (x,y)∈Z (x,y)∼D s

= Pr [x ∈ S] · max |F (x, y)| · e(h, ws→t , S)


(x,y)∼D s (x,y)∈Z
126Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT

8.2 Multicalibration under Distribution Shift


We’ll now study how multi-calibration guarantees change under distribution
shift, and how the relationship between the class of functions H we are multi-
calibrated with respect to interacts with the likelihood ratios ws→t (x) defined
by the shift. It will be more convenient for us to work with an ℓ1 notion of
multicalibration (compared to the ℓ2 notion we gave in Definition 35.

Definition 44 (L1 Multicalibration For Real Valued Functions) Fix a


distribution D ∈ ∆Z and a model f : X → [0, 1]. Let H be an arbitrary col-
lection of real valued functions h : X → R. We say that f is α-approximately
L1 -multicalibrated with respect to D and H if for every h ∈ H:
X
K1 (f, h, D) = Pr [f (x) = v] E [h(x)(y − v)|f (x) = v] ≤ α
(x,y)∼D (x,y)∼D
v∈R(f )

We say that f is α-approximately L1 -multicalibrated with respect to D if:


X
K1 (f, D) = Pr [f (x) = v] E [(y − v)|f (x) = v] ≤ α
(x,y)∼D (x,y)∼D
v∈R(f )

p
Recall that we know from Lemma 3.1.1 that K1 (f, h, D) ≤ K2 (f, h, D).
Thus, we can use algorithm 19 — which guarantees that K2 (f, h, D) ≤ α′ for
all h ∈ H — to obtain α-approximate L1 multicalibration by setting α′ = α2 .

Theorem 39 Suppose Ds and Dt have the same conditional label distribu-


tion, and suppose f is α-approximately L1 -multicalibrated with respect to Ds
and H. Then f is also α-approximately L1 -multicalibrated with respect to Dt
and Hs→t where:
n h(x) o
Hs→t = : h(x) ∈ H
ws→t (x)

Proof 66 Since f is α-approximately L1 -multicalibrated with respect to Ds


Distribution Shift 127

and H, we have that for every h ∈ H:

α ≥ K1 (f, h, Ds )
X
= Pr s [f (x) = v] E [h(x)(y − v)|f (x) = v]
(x,y)∼D (x,y)∼D s
v∈R(f )
X
= Pr [f (x) = v] E [h(x)(y − v)|f (x) = v]
(x,y)∼D s (x,y)∼D s
v∈R(f )
   
X h(x)
= Pr [f (x) = v] E ws→t (x) · (y − v) |f (x) = v
(x,y)∼D s (x,y)∼D s ws→t (x)
v∈R(f )
 
X h(x)
= Pr t [f (x) = v] E t (y − v)|f (x) = v
(x,y)∼D (x,y)∼D ws→t (x)
v∈R(f )
 
h
= K1 f, , Dt
ws→t
Here the second to last equality follows from applying Lemma 8.1.1 to each
term:
   
h(x)
Pr [f (x) = v] E ws→t (x) · (y − v) |f (x) = v
(x,y)∼D s (x,y)∼D s ws→t (x)
h(x)
using S = {x : f (x) = v} and F (x, y) = ws→t (x) (y − v).

Corollary 8.2.1 Suppose Ds and Dt have the same conditional label distri-
bution, and suppose f is α-approximately L1 -multicalibrated with respect to
Ds and H. Then if ws→t ∈ H, f has at most α L1 -calibration error on Dt :

K1 (f, Dt ) ≤ α

Proof 67 We apply Theorem 39. Since by assumption ws→t (x) ∈ H, we can


choose h = ws→t and find that:
 
h t
α ≥ K1 f, ,D
ws→t
 
X h(x)
= Pr [f (x) = v] E (y − v)|f (x) = v
(x,y)∼D t (x,y)∼D t ws→t (x)
v∈R(f )
X
= Pr [f (x) = v] E [(y − v)|f (x) = v]
(x,y)∼D t (x,y)∼D t
v∈R(f )

= K1 (f, Dt )

Similarly, if we are approximately multicalibrated on Ds with respect to a


class H that contains a function h that is close in total variation distance to
ws→t on Ds , then we remain approximately calibrated on Dt .
128Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Lemma 8.2.1 Suppose Ds and Dt have the same conditional label distribu-
tion, and suppose f : X → [0, 1] is α-approximately L1 -multicalibrated with
respect to Ds and H. Then f has L1 - calibration error on Dt at most:
K1 (f, Dt ) ≤ α + min e(h, ws→t )
h∈H

Proof 68 Let h = arg minh∈H e(h, ws→t ). Since f is α-approximately L1
multicalibrated with respect to Ds and H we have:

α ≥ K1 (f, h∗ , Ds )
X
= Pr s [f (x) = v] E [h∗ (x)(y − v)|f (x) = v]
(x,y)∼D (x,y)∼D s
v∈R(f )
X
≥ Pr [f (x) = v] E [(y − v)|f (x) = v]
(x,y)∼D t (x,y)∼D t
v∈R(f )
X
− Pr [f (x) = v] · e(h∗ , ws→t , {x : f (x) = v})
(x,y)∼D s
v∈R(f )

= K1 (f, Dt ) − e(h∗ , ws→t )


where the inequality follows from Lemma 8.1.2 applied to each term:
Pr [f (x) = v] E [(y − v)|f (x) = v]
(x,y)∼D t (x,y)∼D t

choosing Sv = {x : f (x) = v}, F (x, y) = (y − v), and the fact that since
y, v ∈ [0, 1], maxy |y − v| ≤ 1. The final line follows from the observation that
the collection {Sv }v∈R(f ) forms a partition of X .

8.3 Why Calibration Under Distribution Shift is Useful


On our training distribution we generally have samples of labeled data (x, y) ∼
Ds , and so we can empirically evaluate various quantities of interest. When it
t
comes time to deploy a model, we may have unlabelled examples x ∼ DX from
the target distribution, but we may not have labelled examples. But if f is
calibrated on Dt , then there are certain things we can do with only unlabelled
examples.
One very simple thing we can do is estimate the average value of the label.
Lemma 8.3.1 Suppose f satisfies α-approximate L1 calibration on Dt :
K1 (f, Dt ) ≤ α. Then:

E [f (x)] −
t
E [y] ≤ α
x∼DX (x,y)∼D t
Distribution Shift 129

Proof 69 Expanding the definition of K1 we can write:


α ≥ K1 (f, Dt )
X
= Prt [f (x) = v] E [(y − v)|f (x) = v]
x∼DX (x,y)∼D t
v∈R(f )
X
= Pr [f (x) = v]
t
E [y|f (x) = v] − Prt [f (x) = v]v
x∼DX (x,y)∼D t x∼DX
v∈R(f )

X  
≥ Prt [f (x) = v] E t
[y|f (x) = v] − Prt [f (x) = v]v
x∼DX (x,y)∼D x∼DX
v∈R(f )

= E [y] − E [f (x)]
t
(x,y)∼D t x∼DX

If our label space is binary Y = {0, 1}, then we can go beyond this, and
estimate the cost of acting on any policy depending on the predictions of f .
Definition 45 Fix an action space A and a model f : X → [0, 1]. A policy
of f is any mapping ρ : [0, 1] → A that chooses an action ρ(f (x)) ∈ A as a
function of the prediction f (x).
We can evaluate the cost of a policy using a loss function:
Definition 46 Fixing an action space A, a loss function ℓ : A × {0, 1} → R
maps action/label pairs to a real valued loss. Given a distribution D and a
predictor f : X → [0, 1], the expected cost of a policy ρ is:
ℓ(ρ, f, D) = E [ℓ(ρ(f (x)), y)]
(x,y)∼D

We can estimate the cost of any policy ρ if we have sample access to D


— but this requires samples both of x (to compute ρ(f (x))) and y (to plug
into the second argument of ℓ(·, ·)). What if we only have sample access to
unlabelled examples from DX ? Recall that f (x) purports to estimate EDY (x) [y]
— i.e. the probability that y = 1 conditional on x. So we can attempt to
estimate ℓ(ρ, f, D) taking this as a given, using only unlabelled examples:
Definition 47 Given an action space A, a loss function ℓ : A × Y → R, a
policy ρ, a predictor f : X → R, and a feature distribution DX , the f -estimated
cost of ρ is:
ℓ̃(ρ, f, DX ) = E [f (x)ℓ(ρ(f (x)), 1) + (1 − f (x))ℓ(ρ(f (x)), 0)]
x∼DX

Observe that for the Bayes optimal predictor — f ∗ (x) = EDY (x) [y] —
that the f ∗ -estimated cost of ρ: ℓ̃(ρ, f ∗ , DX ) is equal to its true expected cost:
ℓ(ρ, f ∗ , D).
We now show that the same is true if f is not Bayes optimal, but merely
calibrated.
130Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Theorem 40 Fix an action space A, a loss function ℓ : A × {0, 1} → R, a
policy ρ, a distribution D, and a predictor f : X → R, and a distribution D.
Let:
C = max (ℓ(a, 0) + ℓ(a, 1))
a∈A

If f is α-approximately L1 -calibrated with respect to D, then:

ℓ(ρ, f, D) − ℓ̃(ρ, f, DX ) ≤ C · α

Proof 70 Let:

kv := Pr [f (x) = v] · E [y|f (x) = v] − v


x∼DX (x,y)∼D

Since f satisfies α-approximate L1 calibration with respect to D, we know that:


X
K1 (f, D) = kv ≤ α
v∈R(f )

We can now calculate:

ℓ(ρ, f, D)
= E [ℓ(ρ(f (x)), y)]
(x,y)∼D
X
= Pr [f (x) = v] E [ℓ(ρ(f (x)), y)|f (x) = v]
x∼DX (x,y)∼D
v∈R(f )
X  
= Pr [f (x) = v] ℓ(ρ(v), 1) Pr [y = 1|f (x) = v] + ℓ(ρ(v), 0) Pr [y = 0|f (x) = v]
x∼DX (x,y)∼D (x,y)∼D
v∈R(f )
X X
= ℓ(a, 1) Pr [f (x) = v] E [y|f (x) = v] +
x∼DX (x,y)∼D
a∈A v:ρ(v)=a
X X
ℓ(a, 0) Pr [f (x) = v](1 − E [y|f (x) = v])
x∼DX (x,y)∼D
a∈A v:ρ(v)=a
X X  
≤ ℓ(a, 1) Pr [f (x) = v]v + kv +
x∼DX
a∈A v:ρ(v)=a
X X  
ℓ(a, 0) Pr [f (x) = v](1 − v) + kv
x∼DX
a∈A v:ρ(v)=a
X X
= Pr [f (x) = v] (vℓ(ρ(v), 1) + (1 − v)ℓ(ρ(v), 0)) + kv (ℓ(ρ(v), 1) + ℓ(ρ(v), 0))
x∼DX
v∈R(f ) v∈R(f )
X
= ℓ̃(ρ, f, DX ) + kv (ℓ(ρ(v), 1) + ℓ(ρ(v), 0))
v∈R(f )

≤ ℓ̃(ρ, f, DX ) + C · α

The other direction is identical.


Distribution Shift 131

A simple example of a policy and loss function arises in binary classifica-


tion. Here, Y = {0, 1} and A = {0, 1}: our goal is for each example x observed,
to predict the true label by selecting a ∈ A such that a = y. The 0/1 loss
function defined as ℓ0/1 (a, y) = 1[a ̸= y] measures the frequency with which
a given policy makes prediction mistakes.
For any calibrated predictor f (including the true conditional label expec-
tation f = f ∗ = Ey∼DY (x) [y]), the following policy minimizes 0/1 loss among
all policies defined as a function of f :
(
∗ 1 v ≥ 21
ρ (v) =
0 v < 12

Lemma 8.3.2 Fix any distribution D and any predictor f : X → [0, 1] such
that K1 (f, D) ≤ α. Consider the policy ρ∗ defined above. For any other policy
ρ : [0, 1] → {0, 1}, we have:

ℓ0,1 (ρ∗ , f, D) ≤ ℓ0,1 (ρ, f, D) + 2α

Proof 71 Using Theorem 40, the fact that f satisfies α-approximate L1 -


calibration, and the fact that for all y ℓ0/1 (0, y) + ℓ0/1 (1, y) = 1, we can
calculate:

ℓ0,1 (ρ, f, D) ≥ ℓ̃0,1 (ρ, f, DX ) − α


X
Pr [f (x) = v] v · ℓ0,1 (ρ(v), 1) + (1 − v)ℓ0,1 (ρ(v), 0) − α

=
x∼DX
v∈R(f )
X
Pr [f (x) = v] v · ℓ0,1 (ρ∗ (v), 1) + (1 − v)ℓ0,1 (ρ∗ (v), 0) − α


x∼DX
v∈R(f )

= ℓ̃0,1 (ρ∗ , f, DX ) − α
≥ ℓ0,1 (ρ∗ , f, D) − 2α

Here the first and last inequalities follow from Theorem 40. The middle in-
equality follows from the fact that pointwise (for each value v):
(
0,1 0,1 1 − v ρ(v) = 1
v · ℓ (ρ(v), 1) + (1 − v)ℓ (ρ(v), 0) =
v ρ(v) = 0
1
is minimized by setting ρ(v) = 1 when v ≥ 2 and ρ(v) = 0 when v < 21 , which
is what the policy ρ∗ (v) does.

So if f is calibrated on D, the not only is ρ∗ the optimal post-processing of


f to minimize classification error on D, but we can estimate the classification
error of ρ∗ (f (x)) on D without the need for any labelled examples from D —
since we can estimate ℓ̃0,1 (ρ, f, DX ) using only samples from DX .
Thus if we have trained a model that is multicalibrated on some distribu-
tion Ds with respect to some class of functions H, then for any distribution
132Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Dt such that ws→t ∈ H (or is close in statistical distance to some h ∈ H), we
can correctly estimate the performance of our model on Dt given access only
to unlabelled data from Dt , which is much more commonly available.

References and Further Reading


Multicalibration under distribution shift was studied by Kim et al. [2022], from
which this chapter primarily draws — they call the phenomenon “universal
adaptibility”. Baek et al. [2022] suggest in a different context that predictors
that are calibrated out of distribution can be used to evaluate the out of
distribution performance using only unlabelled data.
9
Sufficient Statistics for Optimization

CONTENTS
9.1 Omnipredictors: Sufficient Statistics for Unconstrained
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9.2 Sufficient Statistics for Constrained Optimization . . . . . . . . . . . . . . . 139
9.2.1 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
9.2.2 f -estimated Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
9.2.3 Solving Optimization Problems Without Labelled Data 142
References and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

In Chapter 8 we started studying the problem of choosing actions a ∈ A to


minimize the expectation of a loss function ℓ : A × Y → R using some policy
ρ : [0, 1] → A that we evaluate as a function of a predictor f : X → [0, 1]. In
the case of binary labels Y = {0, 1}, we saw that if f is calibrated on D, then
for any such policy, we can accurately estimate the loss of a policy ρ(f (x))
using only unlabeled data:

ℓ(ρ, f, D) ≈ ℓ̃(ρ, f, DX )

by “pretending” that Pr[y|f (x) = v] = v — so we can choose the optimal


policy ρ∗ (f (x)) under this fiction, and know that it performs as well as any
other policy that is a function only of f (x).
How strong is this guarantee? It depends. If f (x) = f ∗ (x) = E(x,y)∼D [y|x]
is the true conditional label expectation, then this guarantee means that the
optimal policy ρ∗ that is a function of f (x) is as good as any other pol-
icy, regardless of what information about x it uses. On the other hand, if
f (x) = E(x,y)∼D [y] is simply the (calibrated) constant function, then policies
ρ(f (x)) must also be constant functions, and so have necessarily very weak
performance guarantees. In this chapter we ask under what conditions on f
we can compare the performance of policies ρ(f (x)) to the performance of
policies h(x) that can depend in other ways on the features x. We assume
throughout this chapter that the label space is binary: Y = {0, 1}.

133
134Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT

9.1 Omnipredictors: Sufficient Statistics for Uncon-


strained Optimization
We recall several important definitions from Chapter 8.
Definition 48 Fix an action space A and a model f : X → [0, 1]. A policy
of f is any mapping ρ : [0, 1] → A that chooses an action ρ(f (x)) ∈ A as a
function of the prediction f (x).
We can evaluate the cost of a policy using a loss function:
Definition 49 Fixing an action space A, a loss function ℓ : A × {0, 1} → R
maps action/label pairs to a real valued loss. Given a distribution D and a
predictor f : X → [0, 1], the expected cost of a policy ρ is:
ℓ(ρ, f, D) = E [ℓ(ρ(f (x)), y)]
(x,y)∼D

To compute the loss ℓ of a policy we need access to labeled examples


(x, y) ∼ D. But we can estimate the loss of a policy using only unlabelled
examples together with a model f if we “pretend” that f (x) actually encodes
the true conditional label expectation E[y|x]:
Definition 50 Given an action space A, a loss function ℓ : A × Y → R, a
policy ρ, a predictor f : X → R, and a feature distribution DX , the f -estimated
cost of ρ is:
ℓ̃(ρ, f, DX ) = E [f (x)ℓ(ρ(f (x)), 1) + (1 − f (x))ℓ(ρ(f (x)), 0)]
x∼DX

Another nice property of the f -estimated loss is that we can find the policy
that optimizes it without needing to know anything about the underlying
distribution. Specifically, if we have a loss function ℓ in mind, we can choose
the policy ρ∗ℓ that pointwise optimizes the f -estimated cost ℓ̃:
ρ∗ℓ (v) = arg min (vℓ(a, 1) + (1 − v)ℓ(a, 0))
a∈A

If f is calibrated, then the policy ρ∗ℓ has the smallest expected loss (as
measured by ℓ) of any policy that is a function of f . This statement general-
izes what we proved in Lemma 8.3.2 in the special case of 0/1 loss and has
essentially the same proof.
Lemma 9.1.1 Fix any distribution D, loss function ℓ : A × {0, 1} → R, and
any predictor f : X → [0, 1] such that K1 (f, D) ≤ α. Let:
C = max (ℓ(a, 0) + ℓ(a, 1))
a∈A

Consider the policy ρ∗ℓ defined above. For any other policy ρ : [0, 1] → A, we
have:
ℓ(ρ∗ℓ , f, D) ≤ ℓ(ρ, f, D) + 2Cα
Sufficient Statistics for Optimization 135

Proof 72 Using Theorem 40 and the fact that f satisfies α-approximate L1 -


calibration, we can calculate:
ℓ(ρ, f, D) ≥ ℓ̃(ρ, f, DX ) − Cα
X
= Pr [f (x) = v] (v · ℓ(ρ(v), 1) + (1 − v)ℓ(ρ(v), 0)) − Cα
x∼DX
v∈R(f )
X
≥ Pr [f (x) = v] (v · ℓ(ρ∗ℓ (v), 1) + (1 − v)ℓ(ρ∗ℓ (v), 0)) − Cα
x∼DX
v∈R(f )

= ℓ̃(ρ∗ℓ , f, DX ) − Cα
≥ ℓ(ρ∗ℓ , f, D) − 2Cα
Here the first and last inequalities follow from Theorem 40. The middle in-
equality follows from the fact that by definition, ρ∗ℓ (v) is the minimizer of:

v · ℓ(ρ(v), 1) + (1 − v)ℓ(ρ(v), 0)
Going forward, we consider the special case of A = [0, 1] and aim to show
that if f is multicalibrated with respect to a class of real valued functions H,
then for any convex loss function ℓ, the policy ρ∗ℓ has optimal loss not just
compared to other policies ρ of f , but compared to any h ∈ H. Note that
functions h : X → [0, 1] are functions of x directly, rather than functions of
f (x), and so Lemma 9.1.1 does not imply that that ρ∗ℓ (f (x)) has lower loss
than h(x). But Lemma 9.1.1 does point in the direction of our proof strategy:
we will show that if f is (approximately) multicalibrated with respect to H
then in fact every h ∈ H is (almost) dominated by a policy ρ(f (x)). Thus for
any loss functions satisfying the conditions of our theorem, we can do (almost)
as well as any h ∈ H by playing the optimal policy for the f -estimated loss
ρ∗ℓ . As mentioned, our results will apply to any convex loss function:
Definition 51 A loss function ℓ : [0, 1] × {0, 1} → R is convex in its first
argument if for all v, v ′ , α ∈ [0, 1] and for all y ∈ {0, 1}:
ℓ(αv + (1 − α)v ′ , y) ≤ α · ℓ(v, y) + (1 − α) · ℓ(v ′ , y)
A direct consequence of convexity that we will make use of is called Jensen’s
inequality:
Claim 9.1.1 (Jensen’s Inequality) Fix any loss function ℓ : [0, 1] ×
{0, 1} → R that is convex in its first argument. For any y ∈ {0, 1} and for
any distribution P ∈ ∆[0, 1], we have:
 
E [ℓ(v, y)] ≥ ℓ E [v], y
v∼P v∼P

How closely we can relate the performance of a model h(x) to the per-
formance of a policy ρ(f (x)) will depend both on the multicalibration error
that f has on the models in H and on how much small errors in prediction
are magnified by the loss function ℓ, which we will measure by its Lipschitz
constant:
136Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Definition 52 A loss function ℓ : [0, 1] × {0, 1} → R is L-Lipschitz in its first
argument if for all v, v ′ and for all y ∈ {0, 1}:

|ℓ(v, y) − ℓ(v ′ , y)| ≤ L · |v − v ′ |

Finally we will prove a useful statement about multicalibration—if f is


multicalibrated with respect to H, then for any h ∈ H, its conditional ex-
pectation (conditional on the value of f ) doesn’t change by very much if we
additionally condition on the value of the label:

Lemma 9.1.2 Fix any distribution D and class of real valued functions H.
Suppose that f is α-approximately L1 multicalibrated with respect to D and
H (as defined in Definition 44). and α-approximately L1 -calibrated. Then for
any h ∈ H and v ∈ R(f ):
X
Pr[f (x) = v]v(1−v) E [h(x)|f (x) = v, y = 1] − E [h(x)|f (x) = v, y = 0]
(x,y)∼D (x,y)∼D
v∈R(f )

≤ 2α

Proof 73 Let:

kv = Pr [f (x) = v] v − E [y|f (x) = v]


(x,y)∼D (x,y)∼D

By hypothesis, we know that:


X
α ≥ K1 (f, D) = kv .
v∈R(f )
Sufficient Statistics for Optimization 137

We can now compute:

α ≥ K1 (f, h, D)
X
= Pr [f (x) = v] E [h(x)(y − v)|f (x) = v]
(x,y)∼D (x,y)∼D
v∈R(f )

X
= Pr [f (x) = v] Pr[y = 1|f (x) = v] E [h(x)|f (x) = v, y = 1](1 − v)
(x,y)∼D (x,y)∼D
v∈R(f )

− Pr[y = 0|f (x) = v] E [h(x)|f (x) = v, y = 0]v


(x,y)∼D

X
≥ Pr [f (x) = v] v E [h(x)|f (x) = v, y = 1](1 − v)
(x,y)∼D (x,y)∼D
v∈R(f )

X
−(1 − v) E [h(x)|f (x) = v, y = 0]v − kv ((1 − v) + v)
(x,y)∼D
v∈R(f )

X 
= Pr [f (x) = v] v(1 − v) E [h(x)|f (x) = v, y = 1]
(x,y)∼D (x,y)∼D
v∈R(f )
 X
− E [h(x)|f (x) = v, y = 0] − kv
(x,y)∼D
v∈R(f )

X
≥ Pr [f (x) = v]v(1 − v) E [h(x)|f (x) = v, y = 1]
(x,y)∼D (x,y)∼D
v∈R(f )

− E [h(x)|f (x) = v, y = 0] − α
(x,y)∼D

With this lemma in hand, we can prove the main theorem of this section:
that if f is multicalibrated with respect to H, then for any convex Lipschitz
loss function ℓ, the policy ρ∗ℓ obtains loss nearly as good as the loss of the
best h ∈ H. Thus, once we train f , we can use it to optimize any such loss
function ℓ and have performance guarantees relative to H, rather than needing
to solve a fresh optimization problem for each new loss function. The proof
strategy is just as we have already laid out: show that the loss for any h ∈ H
is comparable to the loss of some policy ρ of f , and therefore only higher than
the loss of the best policy ρ∗ℓ for ℓ.

Theorem 41 Let ℓ : [0, 1] × {0, 1} → [0, 1] be a bounded loss function


that is both convex and L-Lipschitz in its first argument. Suppose that f
is α-approximately L1 calibrated with respect to a distribution D, and also
with respect to D and a class of real valued functions H. That is, both that
K1 (f, D) ≤ α and that for every h ∈ H, K1 (f, h, D) ≤ α. Let ρ∗ℓ be the policy
138Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
that optimizes the f -estimated loss ℓ̃:

ρ∗ℓ (v) = arg min (vℓ(a, 1) + (1 − v)ℓ(a, 0))


a∈A

Then the loss of policy ρ∗ℓ is almost as low as the loss of any h ∈ H:

ℓ(ρ∗ℓ , f, D) ≤ E [ℓ(h(x), y)] + (4 + 4L)α


(x,y)∼D

Proof 74 Let:

kv = Pr [f (x) = v] v − E [y|f (x) = v]


(x,y)∼D (x,y)∼D

By hypothesis, we know that:


X
α ≥ K1 (f, D) = kv .
v∈R(f )

Let H(v, 1) = E(x,y)∼D [h(x)|f (x) = v, y = 1] and H(v, 0) =


E(x,y)∼D [h(x)|f (x) = v, y = 0]. Using Jensen’s inequality and the convexity of
ℓ, we can calculate:

E [ℓ(h(x), y)]
(x,y)∼D
X 
= Pr[f (x) = v] Pr[y = 1|f (x) = v] E [ℓ(h(x), 1)|f (x) = v, y = 1]
(x,y)∼D
v∈R(f )

+(Pr[y = 0|f (x) = v] E [ℓ(h(x), 0)|f (x) = v, y = 0]
(x,y)∼D
X  
≥ Pr[f (x) = v] Pr[y = 1|f (x) = v]ℓ(H(v, 1), 1) + (Pr[y = 0|f (x) = v]ℓ(H(v, 0), 0)
v∈R(f )
X   X
≥ Pr[f (x) = v] vℓ(H(v, 1), 1) + (1 − v)ℓ(H(v, 0), 0) − 2 kv
v∈R(f ) v∈R(f )
X  
≥ Pr[f (x) = v] vℓ(H(v, 1), 1) + (1 − v)ℓ(H(v, 0), 0) − 2α
v∈R(f )

Now consider the policy ρ defined such that:


(
1
H(v, 1) v ≥ 2
ρ(v) = 1
H(v, 0) v < 2

Lets compare the loss of this policy ρ with the loss of h. Continuing our
derivation above we find:
Sufficient Statistics for Optimization 139

E [ℓ(h(x), y)]
(x,y)∼D
X  
≥ Pr[f (x) = v] vℓ(H(v, 1), 1) + (1 − v)ℓ(H(v, 0), 0) − 2α
v∈R(f )
X  
≥ Pr[f (x) = v] vℓ(H(v, 0), 1) + (1 − v)ℓ(H(v, 0), 0) − vL|H(v, 0) − H(v, 1)| +
v< 21
X  
Pr[f (x) = v] vℓ(H(v, 1), 1) + (1 − v)ℓ(H(v, 1), 0) − (1 − v)L|H(v, 0) − H(v, 1)| − 2α
v≥ 12
X
= ℓ̃(ρ, f, DX ) − 2α − L Pr[f (x) = v] min(v, 1 − v)|H(v, 0) − H(v, 1)|
v∈R(f )
X
≥ ℓ̃(ρ, f, DX ) − 2α − 2L Pr[f (x) = v]v · (1 − v)|H(v, 0) − H(v, 1)|
v∈R(f )

≥ ℓ̃(ρ, f, DX ) − (2 + 4L)α
≥ ℓ̃(ρ∗ℓ , f, DX ) − (2 + 4L)α
≥ ℓ(ρ∗ℓ , f, D) − (4 + 4L)α

Here, in the 3rd to last line, we have applied Lemma 9.1.2, which tells us that:
X
Pr[f (x) = v]v(1 − v) |H(v, 0) − H(v, 1)| ≤ 2α
v∈R(f )

In the second to last line we have used the fact that ρ∗ℓ is the minimizer of
ℓ̃(ρ, f, DX ) among all policies ρ. In the final line we have applied Theorem 40 to
relate the f -estimated loss ℓ̃ to the true loss ℓ, using the fact that K1 (f, D) ≤ α,
and that C = maxa∈A (ℓ(a, 0) + ℓ(a, 1)) is at most 2 since we have assumed
that ℓ takes values in [0, 1].

9.2 Sufficient Statistics for Constrained Optimization


In Section 9.1 we showed that if f is multicalibrated with respect to H, then
for any (convex, Lipschitz) loss function ℓ, using the policy of f , ρ∗ℓ that
is optimal for minimizing ℓ is almost as good as using the best h ∈ H, in
terms of minimizing ℓ over the true data distribution. Note that this was an
unconstrained optimization problem, in that there were no restrictions at all on
what our policy ρ∗ (ℓ) could look like. In this section, we consider constrained
optimization problems. We continue to consider action spaces A = [0, 1] and
binary label spaces Y = {0, 1}.
140Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Definition 53 Fix a collection of real valued functions H of the form h : X →
[0, 1], a collection of group indicator functions G of the form g : X → {0, 1},
and a scalar C ∈ R. An (H, G, C)-convex minimization problem with linear
constraints is defined by:
1. An objective function ℓ : [0, 1] × {0, 1} → [−C, C] that is convex
in its first argument,
2. A collection of k constraints j each defined by a loss function
ℓj : [0, 1] × {0, 1} → [−C, C] that is affine in its first argument, a
group indicator function gj ∈ G, and a subset of labels Sj ⊆ {0, 1}.
Together they define the following optimization problem:

arg min E [ℓ(h(x), y)]


P∈∆H h∼P,(x,y)∼D

Subject to the constraint that for each j ∈ [k]:

E [ℓj (h(x), y)|gj (x) = 1, y ∈ Sj ] ≤ 0


h∼P,(x,y)∼D

If there is any solution P that satisfies all of the constraints, we say that the
optimization problem is feasible. We write P ∗ for the solution that minimizes
the objective function while satisfying the constraints, and write OPT(H) =
Eh∼P ∗ ,(x,y)∼D [ℓ(h(x), y)] for the objective value of an optimal feasible solution.

9.2.1 Convex Optimization


We now review some facts about convex optimization with linear constraints.
Definition 54 Fix a (H, G, C)-convex minimization problem with linear con-
straints, defined by (ℓ, {(ℓj , gj , Sj )}kj=1 ). The corresponding Lagrangian is the
function L : H × Rk≥0 → R defined as:

Xk
L(P, λ) = E [ℓ(h(x), y)]+ λj E [ℓj (h(x), y)|gj (x) = 1, y ∈ Sj ]
h∼P,(x,y)∼D h∼P,(x,y)∼D
j=1

Definition 55 Fix a (H, G, C)-convex minimization problem with linear con-


straints, and let L : H × Rk≥0 → R be its Lagrangian. We say that P ∗ ∈ ∆H
and λ∗ ∈ Rk are an optimal primal/dual pair for L if we have both that:
1.
P ∗ ∈ arg min L(P, λ∗ )
P ∈∆H

and
2.
λ∗ ∈ arg max L(P ∗ , λ)
λ∈Rk
Sufficient Statistics for Optimization 141

We’ll state an important theorem in convex optimization here without


proof:

Theorem 42 (Strong Duality and Complementary Slackness) Fix a


feasible (H, G, C)-convex minimization problem with linear constraints, defined
by (ℓ, {(ℓj , gj , Sj )}kj=1 ).
For every optimal solution P ∗ , there is a corresponding vector λ∗ such that
(P , λ∗ ) form an optimal primal/dual pair.

Moreover every primal dual pair (P ∗ , λ∗ ), satisfies:


1. P ∗ is a feasible, optimal solution to the optimization problem,
and
2. For every constraint j ∈ [k]:
 

λj · ∗
E [ℓj (h(x), y)|gj (x) = 1, y ∈ Sj ] = 0
h∼P ,(x,y)∼D

The second condition is called “Complementary Slackness”


A simple corollary of Theorem 42 is that the Lagrangian of an optimization
problem takes value OPT when evaluated at an optimal primal/dual pair.

Corollary 9.2.1 Fix a feasible (H, G, C)-convex minimization problem with


linear constraints, defined by (ℓ, {(ℓj , gj , Sj )}kj=1 ). Let L : H × Rk≥0 → R be
its corresponding Lagrangian, and let (P ∗ , λ∗ ) be an optimal primal/dual pair
for L. Then:
L(P ∗ , λ∗ ) = OPT
.

Proof 75 Using both parts of Theorem 42, we can compute:

L(P ∗ , λ∗ )
k
X
= ∗
E [ℓ(h(x), y)] + λ∗j E [ℓj (h(x), y)|gj (x) = 1, y ∈ Sj ]
h∼P ,(x,y)∼D h∼P ∗ ,(x,y)∼D
j=1
k
X
= OPT + λ∗j E [ℓj (h(x), y)|gj (x) = 1, y ∈ Sj ]
h∼P ∗ ,(x,y)∼D
j=1
= OPT

Here the second to last inequality follows from the fact that P ∗ is an opti-
mal solution to the optimization problem, and the last inequality follows from
complementary slackness.
142Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
9.2.2 f -estimated Optimization
The optimization problems we have defined (and their corresponding La-
grangians) are defined as expectations over both x and y — so in order to
evaluate a solution P (or to solve for one), we need access to labelled exam-
ples. Just as we did in Section 9.1 for unconstrained optimization, given a
model f : X → R that purports to encode f (x) = E[y|x], we can define an
f -estimated optimization problem whose definition only involves expectations
taken over features x ∼ DX .
Definition 56 Fix an (H, G, C)-convex minimization problem with objective
ℓ and linear constraints is defined by {(ℓj , gj , Sj )}kj=1 . Fix a model f : X →
[0, 1]. The corresponding f -estimated optimization problem defined as:

arg min E [f (x)ℓ(h(x), 1) + (1 − f (x))ℓ(h(x), 0)]


P∈∆H h∼P,x∼DX
Subject to the constraints that for each j ∈ [k] with S = {0, 1}:
E [f (x)ℓj (h(x), 1) + (1 − f (x))ℓj (h(x), 0)|gj (x) = 1] ≤ 0
h∼P,x∼DX

for each j ∈ [k] with S = {1}:


E [ℓj (h(x), 1)|gj (x) = 1] ≤ 0
h∼P,x∼DX

and for each j ∈ [k] with S = {0}:


E [ℓj (h(x), 0)|gj (x) = 1] ≤ 0
h∼P,x∼DX

If there is any solution P that satisfies all of the constraints, we say that
the optimization problem is feasible. We write P̃ ∗ for the solution that
minimizes the objective function while satisfying the constraints, and write
˜ = E
OPT h∼P̃ ∗ ,x∼DX [f (x)ℓ(h(x), 1) + (1 − f (x))ℓ(h(x), 0)] for the objective
value of an optimal feasible solution.
We can similarly define the f -estimated Lagrangian:
Definition 57 Fix an f -estimated (H, G, C)-convex minimization problem
with linear constraints, defined by (ℓ, {(ℓj , gj , Sj )}kj=1 ). Partition the con-
straints such that C0 = {j ∈ [k] : Sj = {0}}, C1 = {j ∈ [k] : Sj = {1}},
and C01 = {j ∈ [k] : Sj = {0, 1}}.
The corresponding f -estimated Lagrangian is the function L̃ : H × Rk≥0 →
R defined as:
L̃(P, λ) = E [f (x)ℓ(h(x), 1) + (1 − f (x))ℓ(h(x), 0)]
h∼P,x∼DX
X X
+ λj E [ℓj (h(x), 0)|gj (x) = 1]+ λj E [ℓj (h(x), 1)|gj (x) = 1]
h∼P,x∼DX h∼P,x∼DX
j∈C0 j∈C1
X
+ λj E [f (x)ℓj (h(x), 1) + (1 − f (x))ℓj (h(x), 0)|gj (x) = 1]
h∼P,x∼DX
j∈C01
Sufficient Statistics for Optimization 143

9.2.3 Solving Optimization Problems Without Labelled


Data
Our goal is to derive a constrained optimization analogue of our results from
Section 9.1, which were for unconstrained optimization. Namely, we would like
to train a single model f using labelled data from D, such that f is sufficient to
solve a wide variety of downstream constrained optimization problems using
only unlabelled data x ∼ DX and minimal additional computation. The main
idea will be to train a predictor f that is multicalibrated with respect to G,
H, and their corresponding product class:
Definition 58 Fix two classes of functions G and H mapping features to real
numbers. The product class is defined as:

G · H = {g(x) · h(x) : g(x) ∈ G, h(x) ∈ H}

We will argue that if we have such a predictor f , then if we want to solve


some (H, G, C)-convex minimization problem with linear constraints, it will
be sufficient to solve its corresponding f -estimated variant in which H has
been replaced by Hall , the set of all functions f : X → [0, 1], which (as we
will see) is a computationally easier task, and one that does not require a
randomized solution.
Definition 59 We write Hall = {f : X → [0, 1]} for the set of all real-valued
functions mapping features to the unit interval.
In particular, the optimal solution h to an f -estimated (Hall , G, C)-
optimization problem will be a policy in the sense that we can write it in
the form h(x) = ρ(f (x)), that depends only on f (x).

Lemma 9.2.1 Fix any model f : X → [0, 1] and any f -estimated (Hall , G, C)-
convex optimization problem with linear constraints. Let h ∈ Hall be an op-
timal solution to the problem. Then h(x) can be written as a policy of f (x):
h(x) = ρ(f (x)) for some ρ : [0, 1] → [0, 1].

Proof 76

Theorem 43 Fix any feasible (H, G, C)-convex minimization problem with


linear constraints, defined by (ℓ, {(ℓj , gj , Sj )}kj=1 ). Fix any f : X → [0, 1] that
is α-approximately L1 -calibrated and L1 -multicalibrated with respect to H, G,
and H · G. Let P̃ ∗ be an optimal solution to the corresponding f -estimated
(Hall , G, C)-optimization problem. Then we have that P̃ ∗ is approximately op-
timal according to the original objective function and approximately satisfies
the original constraints:
BLAH

Proof 77 First we argue about the objective value. From Theorem 42, we
know that there exists a λ̃∗ such that (P̃ ∗ , λ̃∗ ) form an optimal primal/dual
144Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
pair for the corresponding f -estimated Lagrangian L̃. From Corollary 9.2.1
˜
we know that L̃(P̃ ∗ , λ̃∗ ) = OPT. Similarly, let P ∗ be an optimal solution to
the original (H, G, C)-optimization problem. We know from Theorem 42 that
there exists a λ∗ such that (P ∗ , λ∗ ) form an optimal primal/dual pair for the
corresponding Lagrangian L, and from Corollary 9.2.1 that L(P ∗ , λ∗ ) = OPT.
We also know from Theorem 40 since f is α-approximately calibrated with
respect to H that:
||
Thus we can calculate:
˜
OPT = L(P̃ ∗ , λ̃∗ )

Can prove this if we need it


We recall a piece of notation from our earlier chapters: µ(g, D) =
Pr(x,y)∼D [g(x) = 1].

Lemma 9.2.2 Fix any distribution D ∈ ∆Z, any class of group indicator
functions G containing functions g : X → {0, 1} and any class of real valued
functions H containing functions h : X → R. For each g ∈ G define the
distribution Dg = D|g(x) = 1 be the conditional distribution conditional on
g(x) = 1. Suppose a model f : X → R is α L1 -multicalibrated with respect to
D and G · H. Then for every g ∈ G and h ∈ H:
α
K1 (f, h, Dg ) ≤
µ(g, D)

Proof 78 By hypothesis we know that for every h ∈ H and g ∈ G:


X
K1 (f, g · h, D) = Pr[f (x) = v] E[g(x)h(x)(y − v)|f (x) = v]
v∈R(f )
≤ α

Fix any g ∈ G with µ(g, D) > 0. We can now calculate:

K1 (f, h, Dg )
X
= Pr[f (x) = v|g(x) = 1] E[h(x)(y − v)|g(x) = 1, f (x) = v]
v∈R(f )

References and Further Reading


The fact that a model f that is multicalibrated with respect to a class of func-
tions H can be post-processed in such a way to be competitive with any h ∈ H
as measured by any convex Lipschitz loss function was proven in Gopalan et al.
Sufficient Statistics for Optimization 145

[2022b], who called such models “omnipredictors”. Gopalan et al. [2022b] use
a slightly different notion of calibration than we do (based on partitions of the
feature space and covariance), but if a model satisfies our notion of multicali-
bration and is also calibrated, then it also satisfies the covariance based notion
and vice versa. Gopalan et al. [2022a] give an incomparable omniprediction
theorem — they show that group conditional mean consistency together with
(marginal) calibration is sufficient to be competitive with any h ∈ H on any
Lipschitz loss function ℓ (no longer requiring convexity of ℓ or full multical-
ibration) — but in general this requires group conditional mean consistency
with respect to all level sets of functions in H, rather than just with H itself.
10
Ensembling, Model Multiplicity, and the
Reference Class Problem

CONTENTS
10.1 Reference Classes and Model Multiplicity . . . . . . . . . . . . . . . . . . . . . . . 147
10.2 Model Ensembling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
10.3 Sample Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
References and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

Suppose we have a prediction problem of some import: perhaps we are selling


life insurance, and we want to predict the probability that particular customers
will die in the next 12 months. We are in a familiar regression setting, in
which we have some space of individuals X and would like a model f : X →
[0, 1], where ideally f (Bob) should have the semantics that “f (Bob) is the
probability that Bob will die in the next 12 months”. But what does this mean?
We are predicting a probability for a single event that will occur or not — there
are to be no repeated trials for which we can measure an empirical frequency.
If I propose a model f1 that purports to assign individual probabilities to
people like Bob, and you propose a different model f2 , how are we to resolve
which model is “better”?

10.1 Reference Classes and Model Multiplicity


Suppose we posit that there are true “individual probabilities” underlying
reality — i.e. that there is in principle some number pBob that represents
the probability that Bob will die in the next 12 months. This is after all
the formalism that has underlied our studies so far: we have been modeling
the world as if there is a distribution D over labelled examples (x, y), and
for each individual x a conditional label distribution DY (x). We still cannot
get access to these individual probabilities through data. Nevertheless, we
know that the function f ∗ encoding the true conditional label distribution

147
148Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
f ∗ (x) = Ey∼D(x) [y] minimizes the expected Brier score:
h i
2
f ∗ ∈ arg min E (f (x) − y)
f :X →[0,1] (x,y)∼D

Hence if we have two models such that B(f1 ) < B(f2 ), this falsifies the hy-
pothesis that f2 = f ∗ — i.e. it cannot be the case that f2 represents the true
individual probabilities, and gives us an empirical (and practical!) justification
for adopting model f1 rather than model f2 .
The “model multiplicity” problem refers to the worry that there may be
multiple models f1 , f2 that are equally accurate (such that B(f1 ) = B(f2 ))
that disagree in their predictions. In this case, accuracy gives us no basis
on which to reject either model, and yet if f1 (Bob) is very different from
f2 (Bob), what basis do we have to act on our predictions? Are we justified in
denying Bob life insurance if it seems unprofitable according to the individual
probability assigned by f2 but seems profitable according to the individual
probability assigned by f1 ?
This can indeed be a problem if the models f are chosen to optimize
accuracy in some fixed class. But as we will see, the situation cannot arise if
the parties proposing their models are willing to update (and improve!) their
models in the face of evidence that can be found in the data before them and
in the competing models that are proposed! The updates needed will be of
exactly the same simple “patch” form that we have studied when deriving
algorithms for multicalibration and group conditional mean consistency.

10.2 Model Ensembling


Suppose we are given two models f1 , f2 : X → [0, 1]. We will be interested in
regions in which these models disagree substantially in their predictions. We
will define “substantially” by an arbitrarily small discretization parameter ϵ:
Definition 60 Two models f1 and f2 have an ϵ-disagreement on a point x ∈
X if |f1 (x) − f2 (x)| > ϵ.
Let Uϵ (f1 , f2 ) be the set of points on which f1 and f2 have an ϵ-
disagreement:
Uϵ (f1 , f2 ) = {x : |f1 (x) − f2 (x)| > ϵ}
Informally, we will say that if f1 and f2 do not have an ϵ-disagreement on
x that they agree on x. We will show a quantitative version of the following
statement. It must be the case that either

1. f1 and f2 agree on almost all of their predictions, or


2. f1 , or f2 , or both can be proven from the data to violate a group
Ensembling, Model Multiplicity, and the Reference Class Problem 149

conditional mean consistency condition on a large set of points.


In this case, the falsified model can be patched using our patch
operations in a way that improves its accuracy.
The result is that there can be no substantial disagreements about indi-
vidual probabilities by people who are willing to be convinced by the evidence
of the data before them: models which disagree on a substantial fraction of
their predictions witness for each other places in which their predictions are
falsified by the data, and provide the means to correct (and improve) each
other. Thus disagreements can be leveraged to produce improved models, and
this process necessarily converges only when the models agree.
To formalize this, we start by partitioning the set of ϵ-disagreements
Uϵ (f1 , f2 ) into two additional sets that will be important — the set of dis-
agreements on which f1 (x) > f2 (x), and the set of disagreements on which
f1 (x) < f2 (x).

Definition 61 Fix any two models f1 , f2 : X → [0, 1] and any ϵ > 0. Define
the sets:
Uϵ> (f1 , f2 ) = {x ∈ Uϵ (f1 , f2 ) : f1 (x) > f2 (x)}
Uϵ< (f1 , f2 ) = {x ∈ Uϵ (f1 , f2 ) : f1 (x) < f2 (x)}
Based on these sets, for • ∈ {>, <} and i ∈ {1, 2} define the quantities:

v∗• = E [y|x ∈ Uϵ• (f1 , f2 )] vi• = E [fi (x)|x ∈ Uϵ• (f1 , f2 )]


(x,y)∼D (x,y)∼D

Lemma 10.2.1 Fix any two models f1 , f2 : X → [0, 1] and any ϵ > 0.
If the fraction of points on which f1 and f2 have an ϵ disagreement has
mass µ(Uϵ (f1 , f2 )) = α then for some • ∈ {>, <} some i ∈ {1, 2}, we have
that:
2 αϵ2
µ(Uϵ• (f1 , f2 )) · (v∗• − vi• ) ≥
8
Proof 79 Since Uϵ (f1 , f2 ) can be written as the disjoint union:

Uϵ (f1 , f2 ) = Uϵ> (f1 , f2 ) ∪ Uϵ< (f1 , f2 )

we must have that for at least one value of • ∈ {>, <} we have that:
α
µ(Uϵ• (f1 , f2 )) ≥ .
2
Since the points in µ(Uϵ• (f1 , f2 )) are ϵ-separated, we must have that |v1• −
v2• | ≥ ϵ. Therefore, for at least one of i ∈ {1, 2} we must have that
ϵ
|vi• − v∗• | ≥
2
Combining these two claims, we must have that:
αϵ2
µ(Uϵ• (f1 , f2 )) · (vi• − v∗• )2 ≥
8
150Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Lets consider the significance of this Lemma. Most basically, if we have
two models f1 and f2 that disagree substantially, this lemma gives an easily
constructable set (Uϵ> (f1t1 , f2t2 ) or Uϵ< (f1t1 , f2t2 )) that falsifies either the as-
sertion that f1 encodes true conditional label expectations or the assertion
that f2 does. And not only does it falsify that at least one of f1 or f2 are a
“correct” model — it provides a directly actionable way to improve one of the
models. Recall Lemma 4.1.1, which we proved when analyzing an algorithm
for guaranteeing group conditional mean consistency, and we reproduce here:

Lemma 10.2.2 Fix any model ft : X → [0, 1] and group g : X → {0, 1}. Let

∆t = E [y|gt (x) = 1] − E [ft (x)|gt (x) = 1]


(x,y)∼D (x,y)∼D

and
ft+1 = h(x, ft ; gt , ∆t )
where: (
f (x) + ∆ g(x) = 1
h(x, f ; g, ∆) =
f (x) otherwise
Then:
B(ft ) − B(ft+1 ) = µ(gt ) · ∆2t

Summarizing, whenever we have two models that have ϵ disagreements on an


α-fraction of points, we can always constructively falsify at least one of the
models, and update it to improve its Brier score by at least O(αϵ2 ).
We put this all together in Algorithm 26 (Reconciler).
Ensembling, Model Multiplicity, and the Reference Class Problem 151

Algorithm 26 Reconcile(f1 , f2 , α, ϵ)
Let t = t1 = t2 = 0 and f1t1 = f1 , f2t2 = f2 .
Let m = ⌈ √2αϵ ⌉
while µ(Uϵ (f1t1 , f2t2 )) ≥ α do
For each • ∈ {>, <} and i ∈ {1, 2} Let:

v∗• = E [y|x ∈ Uϵ• (f1t1 , f2t2 )] vi• = E [fiti (x)|x ∈ Uϵ• (f1t1 , f2t2 )]
(x,y)∼D (x,y)∼D

Let:
2
(it , •t ) = arg max µ(Uϵ• (f1t1 , f2t2 )) · (v∗• − vi• )
i∈{1,2},•∈{>,<}

Let: (
1 x ∈ Uϵ• (f1t1 , f2t2 )
gt (x) =
0 otherwise

Let:
˜t =
∆ E [y|gt (x) = 1] − E [fiti (x)|gt (x) = 1]
(x,y)∼D (x,y)∼D

˜ t ; m)
∆t = Round(∆

Let: fiti +1 (x) = h(x, fiti , gt , ∆t ), ti = ti + 1, t = t + 1.


Output (f1t1 , f2t2 ).

Theorem 44 For any pair of models f1 , f2 : X → [0, 1] and any α, ϵ > 0,


Algorithm 26 (Reconcile) runs for T = T1 + T2 many rounds and outputs a
pair of models (f1T1 , f2T2 ) such that:
16
1. T ≤ (B(f1 ) + B(f2 )) · αϵ2
αϵ2 αϵ2
2. B(f1T1 ) ≤ B(f1 ) − T1 · 16 and B(f2T2 ) ≤ B(f2 ) − T2 · 16

3. µ(Uϵ (f1T1 , f2T2 )) < α.

Proof 80 By Lemma 10.2.1, for each round t < T we must have that:

2 αϵ2
arg max µ(S(v, •)) · (v∗• − vi• ) ≥
i∈{1,2},•∈{>,<},v∈[1/2ϵ] 8

Let f˜tti +1 = h(x, fiti , gt , ∆


˜ t ) — i.e. the update that would have resulted at
round t had the algorithm used the unrounded measurement ∆ ˜ t rather than
the rounded measurement ∆t . By Lemma 10.2.2, we have that:

αϵ2
B(ftti ) − B(f˜tti +1 ) ≥
8
152Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
.
We can now compute

B(ftti ) − B(ftti +1 ) = (B(ftti ) − B(f˜tti +1 )) − (B(ftti +1 ) − B(f˜tti +1 ))


αϵ2
≥ − (B(ftti +1 ) − B(f˜tti +1 ))
8

So it remains to upper bound (B(ftti +1 ) − B(f˜tti +1 )). Let ∆ ˆ =∆


˜ t − ∆t . We
make several observations: First, f˜tti +1 = h(x, ftti +1 , gt , ∆).
ˆ Second,

ˆ
∆ = E [y|gt (x) = 1] − E [fiti (x)|gt (x) = 1] − ∆t
(x,y)∼D (x,y)∼D

= E [y|gt (x) = 1] − E [fiti +1 (x)|gt (x) = 1]


(x,y)∼D (x,y)∼D

ˆ ≤
Third, by definition of the Round operation, |∆| 1
2m . Therefore we can again
apply lemma 10.2.2 to conclude that:

B(ftti +1 ) − B(f˜tti +1 ) ˆ2
= µ(gt )∆
1

4m2
Combining this with our initial calculation lets us conclude that:

αϵ2 1 αϵ2
B(ftti ) − B(ftti +1 ) ≥ − 2

8 4m 16
Here we are using the fact that we have set m ≥ √2 . Applying this lemma for
αϵ
each of the T1 and T2 updates or f1 and f2 respectively we get that: B(f1T1 ) ≤
2
T2 αϵ2
B(f1 ) − T1 · αϵ
16 and B(f2 ) ≤ B(f2 ) − T2 · 16 . Since Brier scores are non-
16 16
negative, we conclude that T1 ≤ B(f1 ) αϵ2 and T2 ≤ B(f2 ) αϵ 2 . Thus T =
16
T1 + T2 ≤ (B(f1 ) + B(f2 )) · αϵ2
Finally the halting condition of the algorithm implies that µ(Uϵ (f1T1 , f2T2 )) <
α.

Thus if we start with any two models that have substantial disagreement, we
are guaranteed to be able to efficiently produce strictly improved models that
almost agree almost everywhere. In particular, we can never be in a position
in which we have two equally accurate but unimprovable models that have
substantial disagreements: in this case, we can always improve the models.
The only time we can have substantial model disagreement is if we refuse to
improve the models even in the face of efficiently verifiable and actionable
evidence that one of the models is suboptimal and improvable.
We observe that any pair of models that have gone through the “Rec-
oncile” process must also produce very similar probability estimates for any
sufficiently large conditional probability.
Ensembling, Model Multiplicity, and the Reference Class Problem 153

Corollary 10.2.1 Let E ⊂ X be any subset of the data space. Let f1 and
f2 be any two models that have been output by Algorithm 26 (Reconcile) with
parameters ϵ and α. Let:
X µ(x) · f1 (x) X µ(x) · f2 (x)
p1 (E) = and p2 (E) =
µ(E) µ(E)
x∈E x∈E

be the estimates for E[y|x ∈ E] implied by models f1 and f2 respectively. Then:


α
|p1 (E) − p2 (E)| ≤ +ϵ
µ(E)

Proof 81 Let Sϵ (f1 , f2 ) = x : x ̸∈ Uϵ (f1 , f2 ) be the set of points on which f1


and f2 do not have an ϵ-disagreement. Recall that µ(Sϵ (f1 , f2 )) ≥ 1 − α. We
compute:

X
µ(E)|p1 (E) − p2 (E)| = µ(x) · (f1 (x) − f2 (x))
x∈E

X X
= µ(x) · (f1 (x) − f2 (x)) + µ(x) · (f1 (x) − f2 (x))
x∈E∩Uϵ (f1 ,f2 ) x∈E∩Sϵ (f1 ,f2 )

≤ α + µ(E ∩ Sϵ (f1 , f2 ))ϵ


≤ α + µ(E)ϵ

Dividing by µ(E) yields the corollary.

10.3 Sample Complexity


We have once again presented our algorithm 26 as if it has direct access to
the distribution D. Of course in general we do not have access to D, but
rather have access to some set D of n i.i.d. samples from D. We will typically
instead run Algorithm 26 over the empirical distribution over D — i.e. the
distribution that puts weight 1/n on each datapoint (xi , yi ) ∈ D. We will
prove that with high probability over the sample of D, when Algorithm 26 is
run over the empirical distribution on D, then its guarantees translate over
to the distribution D from which D was drawn, with error parameters that
go to zero with the size n of the data sample.
We begin by counting the number of potential models f1t1 , f2t2 that Algo-
rithm 26 might output.

Lemma 10.3.1 Fix any pair of models f1 , f2 : X → [0, 1] and any


α, ϵ > 0. Then there is a set C of pairs of models of size at most |C| ≤
154Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
32/αϵ2 +1
(4 · (m + 1)) such that for any dataset distribution D on which Algo-
rithm 26 is run, the output models (f1t1 , f2t2 ) ∈ C. Here, as in Algorithm 26,
m = ⌈ √2αϵ ⌉.

Proof 82 Given a run of Algorithm 26 for T rounds, let π = {(it , •t , ∆t )}Tt=1


denote the record of the quantities (it , •t , ∆t ) chosen at each round t. Let π <t =
{(it′ , •t′ , ∆t′ )}t−1
t′ =1 denote the prefix of this transcript up through round t − 1.
Observe that once we fix π <t we have also fixed the models f1t1 and f2t2 that
are defined at the start of round t. To see this, assume the claim holds true at
round t. In particular, π <t fixes the disagreement regions Uϵ• (f1t1 , f2t1 ) of these
two models, and therefore given the choices (it , •t , ∆t ), we have inductively
defined the models present at the start of round t + 1.
We let C denote the set of all pairs of models defined by transcripts π <T
32
for all T ≤ αϵ 2 . Since we know from Theorem 44 that Algorithm 26 halts
16 32
after at most T ≤ (B(f1 ) + B(f2 )) · αϵ 2 ≤ αϵ2 many rounds, and hence the

models output by Algorithm 26 must be contained in C as claimed. It remains


32
to count the set of transcripts of length T ≤ αϵ 2 . At each round t, there are

two possible values for it , two possible values for •t , and m+1 possible choices
for ∆t . Hence the number of transcripts of length T is (4(m + 1))T . Thus we
have:
32
αϵ2
X 32
|C| ≤ (4(m + 1))T ≤ (4(m + 1)) αϵ2 +1
T =0

We can now argue that if we have a sample of n datapoints D that are


sampled i.i.d. from some unknown distribution D, then if we run Algorithm
26 using the empirical distribution over D, then its guarantees hold also over
D, with error terms that tend to 0 as n grows large.
Theorem 45 Fix any data distribution D and consider a run of Algorithm
26 over the empirical distribution over points in a dataset D ∼ Dn consisting
of n points sampled i.i.d. from D. For any pair of models f1 , f2 : X → [0, 1]
and any α, ϵ > 0, Algorithm 26 (Reconcile) runs for T = T1 + T2 many rounds
and outputs a pair of models (f1T1 , f2T2 ) such that:
16
1. T ≤ αϵ2
2. For any δ > 0, with probability at least 1−δ over the randomness
of D ∼ Dn we have that:
v
u   2 
u 16  64 ⌈ √αϵ ⌉+1
u
t αϵ2 + 1 log
αϵ2 δ
B(f1T1 ) ≤ B(f1 ) − T1 · +2
16 n
and
v  
u 
64 ⌈ √2αϵ ⌉+1
16
u 
u
αϵ2 + 1 log
αϵ2 t δ
B(f2T2 ) ≤ B(f2 ) − T2 · +2
16 n
Ensembling, Model Multiplicity, and the Reference Class Problem 155

3. For any δ > 0, with probability at least 1−δ over the randomness
of D ∼ Dn :
v
u   2 
u 32  8 ⌈ √αϵ ⌉+1
t αϵ2 + 1 log
u
δ
T1 T2
µ(Uϵ (f1 , f2 )) < α + .
n

Remark 10.3.1 Theorem 45 tells us that the guarantees we proved for Algo-
rithm 26 in Theorem 44 (when we assumed direct access to the distribution
D) continue to hold when all we have access to is a finite sample of n points
from the data distribution, with additional error terms that tend to zero as n
grows large. How large is large? If we want the final disagreement region to
have mass at most 2α (i.e. we want the third conclusion of Theorem 45 to tell
us that µ(Uϵ (f1T1 , f2T2 )) < 2α), then solving for n in the error bound, we find
that it suffices to have n samples for n on the order of:
 
log(1/δ)
n ∈ Õ
α 3 ϵ2

where the Õ() notation hides logarithmic terms in 1/α and 1/ϵ.
This is a remarkably small amount of data: We would need ≈ log(1/δ) αϵ2
samples just to estimate the conditional label expectation Pr[y = 1|x ∈ S] for
a conditional event S with µ(S) = α up to error ϵ with probability 1 − δ (or for
two parties with disjoint samples to agree on this conditional label expectation
up to error ϵ). Theorem 45 tells us that in fact two parties can be made to
agree on a 1 − α fraction of points up to error ϵ with an additional amount of
data only on the order of Õ(1/α2 ).
Proof 83 (Proof of Theorem 45) The bound on T follows directly from
Theorem 44 without modification. We focus on bounding the Brier scores and
the uncertainty region for the resulting models.
Consider any pair of models f1 , f2 . Given a finite dataset D we write
(x, y) ∼ D to denote uniformly sampling a single datapoint from D. We start
by comparing Pr(x,y)∼D [x ∈ Uϵ (f1 , f2 )] with Pr(x,y)∼D [x ∈ Uϵ (f1 , f2 )]. We
have that:
n
1X
Pr [x ∈ Uϵ (f1 , f2 )] = 1[xi ∈ Uϵ (f1 , f2 )]
(x,y)∼D n i=1
Since 1[xi ∈ Uϵ (f1 , f2 )] ∈ [0, 1] and
 
En Pr [x ∈ Uϵ (f1 , f2 )] = Pr [x ∈ Uϵ (f1 , f2 )]
D∼D (x,y)∼D (x,y)∼D

we can apply Hoeffding’s inequality (Theorem 46) to conclude that for every
η > 0:
 
Pr [x ∈ Uϵ (f1 , f2 )] − Pr [x ∈ Uϵ (f1 , f2 )] ≥ η ≤ 2 exp −2η 2 n

Pr n
D∼D (x,y)∼D (x,y)∼D
156Uncertain: Modern Topics in Uncertainty EstimationINCOMPLETE WORKING DRAFT
Let C be the set of pairs of models guaranteed in the statement of Lemma
2
10.3.1. Recall that Lemma 10.3.1 guarantees us that |C| ≤ (4(m + 1)32/αϵ +1 .
We can apply the union bound to all pairs of models (f1 , f2 ) ∈ C to conclude
that with probability at least 1 − 2|C| exp −2η 2 n (over the randomness of D)
we have that for every pair (f1 , f2 ) ∈ C:

Pr [x ∈ Uϵ (f1 , f2 )] − Pr [x ∈ Uϵ (f1 , f2 )] ≤ η
(x,y)∼D (x,y)∼D

Choosing v
u  
u log 2|C|
t δ
η=
2n
we get that with probability 1−δ over the draw of D, for every pair (f1 , f2 ) ∈ C:
v  
u
u log 2|C|
t δ
Pr [x ∈ Uϵ (f1 , f2 )] − Pr [x ∈ Uϵ (f1 , f2 )] ≤
(x,y)∼D (x,y)∼D 2n
v
u   2 
u 32  8 ⌈ √αϵ ⌉+1
t αϵ2 + 1 log
u
δ

n
where the final inequality follows from plugging in our bound on |C| and the
definition of m.
Because we know from Theorem 44 that the models f1T1 , f2T2 output by
Algorithm 26 satisfy that Pr(x,y)∼D [x ∈ Uϵ (f1T1 , f2T2 )] ≤ α we can conclude
that with probability 1 − δ:
v
u   2 
u 32  8 ⌈ √αϵ ⌉+1
u
t αϵ2 + 1 log δ
Pr [x ∈ Uϵ (f1T1 , f2T2 )] ≤
(x,y)∼D n
We can bound the Brier score of the resulting models in exactly the same
way. For any fixed model f : X → [0, 1], we can write the empirical Brier
score (i.e. the Brier score as evaluated over D) as:
n
1X
BD (f ) = (f (xi ) − yi )2
n i=1

Since (f (xi ) − yi )2 ∈ [0, 1] and ED∼Dn [BD (f )], we can apply Hoeffding’s in-
equality (Theorem 46) exactly as before to conclude that for every pair of
models (f1T1 , f2T2 ) ∈ C, with probability 1 − δ:
v
u   2 
u 16  16 ⌈ √αϵ ⌉+1
t αϵ2 + 1 log
u
1
δ
BD (f1T1 ) − B(f1T ) ≤
n
Ensembling, Model Multiplicity, and the Reference Class Problem 157

and with probability 1 − δ:


v  
u 
16 ⌈ √2αϵ ⌉+1
16
u 
u
t αϵ2 + 1 log δ
BD (f2T2 ) − B(f2T2 ) ≤
n

Observe that the same holds true for the original pair of models (f1 , f2 ), since
(f1 , f2 ) ∈ C (they correspond to the models output after transcripts of length
2
0). We further know from Theorem 44 that: BD (f1T1 ) ≤ BD (f1 ) − T1 · αϵ 16 and
2
BD (f2T2 ) ≤ BD (f2 ) − T2 · αϵ
16 .
Instantiating these bounds for the four models {f1 , f2 , f1T1 , f2T2 }, and set-
ting δ ← δ/4 so that we can union bound over all four models, we have that
with probability 1 − δ that we simultaniously have:
v
u   2 
u 16  64 ⌈ √αϵ ⌉+1
t αϵ2 + 1 log
u
αϵ2 δ
B(f1T1 ) ≤ B(f1 ) − T1 · +2
16 n
v
u   2 
u 16  64 ⌈ √αϵ ⌉+1
t αϵ2 + 1 log
u
αϵ2 δ
B(f2T1 ) ≤ B(f2 ) − T2 · +2
16 n

References and Further Reading


The material from this Chapter is taken from Roth et al. [2022].
Bibliography

Anastasios N Angelopoulos and Stephen Bates. A gentle introduction to con-


formal prediction and distribution-free uncertainty quantification. arXiv
preprint arXiv:2107.07511, 2021.
Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan, and Jiten-
dra Malik. Uncertainty sets for image classifiers using conformal prediction.
In International Conference on Learning Representations, 2020.
Christina Baek, Yiding Jiang, Aditi Raghunathan, and Zico Kolter.
Agreement-on-the-line: Predicting the performance of neural networks un-
der distribution shift. arXiv preprint arXiv:2206.13089, 2022.
Osbert Bastani, Varun Gupta, Christopher Jung, Georgy Noarov, Ramya Ra-
malingam, and Aaron Roth. Practical adversarial multivalid conformal pre-
diction. arXiv preprint arXiv:2206.01067, 2022.
Maya Burhanpurkar, Zhun Deng, Cynthia Dwork, and Linjun Zhang. Scaf-
folding sets. arXiv preprint arXiv:2111.03135, 2021.
A. P. Dawid. Calibration-Based Empirical Probability. The Annals of
Statistics, 13(4):1251 – 1274, 1985. doi: 10.1214/aos/1176349736. URL
https://doi.org/10.1214/aos/1176349736.
A Philip Dawid. The well-calibrated bayesian. Journal of the American Sta-
tistical Association, 77(379):605–610, 1982.
Dean P Foster. A proof of calibration via blackwell’s approachability theorem.
Games and Economic Behavior, 29(1-2):73–78, 1999.
Dean P Foster and Sergiu Hart. Forecast hedging and calibration. Journal of
Political Economy, 129(12):3447–3490, 2021.
Dean P Foster and Rakesh V Vohra. Asymptotic calibration. Biometrika, 85
(2):379–390, 1998.
Rina Foygel Barber, Emmanuel J Candès, Aaditya Ramdas, and Ryan J Tib-
shirani. The limits of distribution-free conditional predictive inference. In-
formation and Inference: A Journal of the IMA, 2020.
Drew Fudenberg and David K Levine. An easier way to calibrate. Games and
economic behavior, 29(1-2):131–137, 1999.

159
160 Bibliography

Isaac Gibbs and Emmanuel Candes. Adaptive conformal inference under dis-
tribution shift. Advances in Neural Information Processing Systems, 34:
1660–1672, 2021.
Ira Globus-Harris, Declan Harrison, Michael Kearns, Aaron Roth, and Jessica
Sorrell. Multicalibration as boosting for regression. 2023.
Parikshit Gopalan, Lunjia Hu, Michael P Kim, Omer Reingold, and Udi
Wieder. Loss minimization through the lens of outcome indistinguishability.
arXiv preprint arXiv:2210.08649, 2022a.
Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and
Udi Wieder. Omnipredictors. In 13th Innovations in Theoretical Com-
puter Science Conference (ITCS 2022). Schloss Dagstuhl-Leibniz-Zentrum
für Informatik, 2022b.

Varun Gupta, Christopher Jung, Georgy Noarov, Mallesh M Pai, and Aaron
Roth. Online multivalid learning: Means, moments, and prediction intervals.
In 13th Innovations in Theoretical Computer Science Conference (ITCS
2022). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2022.

Sergiu Hart. Calibrated forecasts: The minimax proof. 2020. URL http:
//www.ma.huji.ac.il/~hart/papers/calib-minmax.pdf.
Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum.
Multicalibration: Calibration for the (computationally-identifiable) masses.
In International Conference on Machine Learning, pages 1939–1948. PMLR,
2018.
Christopher Jung, Changhwa Lee, Mallesh Pai, Aaron Roth, and Rakesh
Vohra. Moment multicalibration for uncertainty estimation. In Conference
on Learning Theory, pages 2634–2678. PMLR, 2021.
Christopher Jung, Georgy Noarov, Ramya Ramalingam, and Aaron Roth.
Batch multivalid conformal prediction. arXiv preprint arXiv:2209.15145,
2022.
Michael P Kim, Amirata Ghorbani, and James Zou. Multiaccuracy: Black-
box post-processing for fairness in classification. In Proceedings of the 2019
AAAI/ACM Conference on AI, Ethics, and Society, pages 247–254, 2019.

Michael P Kim, Christoph Kern, Shafi Goldwasser, Frauke Kreuter, and Omer
Reingold. Universal adaptability: Target-independent inference that com-
petes with propensity scoring. Proceedings of the National Academy of
Sciences, 119(4):e2108097119, 2022.

Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry
Wasserman. Distribution-free predictive inference for regression. Journal
of the American Statistical Association, 113(523):1094–1111, 2018.
Bibliography 161

Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammer-


man. Inductive confidence machines for regression. In European Conference
on Machine Learning, pages 345–356. Springer, 2002.
Sangdon Park, Osbert Bastani, Nikolai Matni, and Insup Lee. Pac confidence
sets for deep neural networks via calibrated prediction. In International
Conference on Learning Representations, 2019.

Yaniv Romano, Evan Patterson, and Emmanuel Candes. Conformalized quan-


tile regression. Advances in neural information processing systems, 32, 2019.
Yaniv Romano, Rina Foygel Barber, Chiara Sabatti, and Emmanuel Candès.
With malice toward none: Assessing uncertainty via equalized coverage.
Harvard Data Science Review, 2020.

Aaron Roth, Alexander Tolbert, and Scott Weinstein. Reconciling individual


probability forecasts. arXiv preprint arXiv:2209.01687, 2022.
Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal
of Machine Learning Research, 9(3), 2008.
A
Useful Probabilistic Inequalities

CONTENTS

In this appendix we collect several useful probabilistic inequalities that will


be handy in our analyses.

Theorem 46 (Hoeffding’s Inequality) Let X1 , . . . , Xn be independent


random
Pn variables bounded such that for each i, ai ≤ Xi ≤ bi . Let Sn =
i=1 i denote their sum. Then for all t > 0:
X

−2t2
 
Pr [|Sn − E[Sn ]| ≥ t] ≤ 2 exp Pn 2
i=1 (bi − ai )

Theorem 47 (Chernoff ’s Bound) Let X1 , . . . , Xn be independent


Pn random
variables bounded such that for each i, 0 ≤ Xi ≤ 1. Let Sn = i=1 Xi denote
their sum. Then for all η > 0:

E[Sn ]η 2
 
Pr [|Sn − E[Sn ]| ≥ η E[Sn ]] ≤ 2 exp −
3

Theorem 48 (Azuma’s Inequality) Let X1 , . . . , Xn be random variables


(not necessarily independent) bounded such that for each i, |Xi | ≤ ci . Let X<i
denote the prefix X1 , X2 , . . . , Xi−1 . Then for all t > 0:
" n n
#
−t2
X X  
Pr Xi − E[Xi |X<i ] ≥ t ≤ 2 exp Pn 2
i=1 i=1
2 i=1 ci

Theorem 49 (The DKW (Dvoretzky–Kiefer–Wolfowitz) inequality)


Let D ∈ Z n be any distribution and let D ∼ Dn consist of n points sampled
i.i.d. from D. Let F (c) = Pr(x,y)∼D [y ≤ c] denote the CDF of the label distri-
bution induced by D, and let F̂D (c) = n1 (x,y)∈D 1[y ≤ c] denote the CDF of
P
the empirical label distribution induced by D. Then for every t > 0:
 
Pr sup F (c) − F̂D (C) ≥ t ≤ 2 exp −2nt2

c∈R

163

You might also like