Lecture Notes
Lecture Notes
John Duchi
December 6, 2023
Contents
1
Lexture Notes on Statistics and Information Theory John Duchi
4 Concentration Inequalities 62
4.1 Basic tail inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.1 Sub-Gaussian random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.2 Sub-exponential random variables . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.3 Orlicz norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1.4 First applications of concentration: random projections . . . . . . . . . . . . 73
4.1.5 A second application of concentration: codebook generation . . . . . . . . . . 75
4.2 Martingale methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.1 Sub-Gaussian martingales and Azuma-Hoeffding inequalities . . . . . . . . . 78
4.2.2 Examples and bounded differences . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3 Uniformity and metric entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.1 Symmetrization and uniform laws . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.2 Metric entropy, coverings, and packings . . . . . . . . . . . . . . . . . . . . . 86
4.4 Generalization bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.1 Finite and countable classes of functions . . . . . . . . . . . . . . . . . . . . . 91
4.4.2 Large classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.3 Structural risk minimization and adaptivity . . . . . . . . . . . . . . . . . . . 95
4.5 Technical proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5.1 Proof of Theorem 4.1.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5.2 Proof of Theorem 4.1.15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.5.3 Proof of Theorem 4.3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2
Lexture Notes on Statistics and Information Theory John Duchi
8 Minimax lower bounds: the Le Cam, Fano, and Assouad methods 178
8.1 Basic framework and minimax risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
8.2 Preliminaries on methods for lower bounds . . . . . . . . . . . . . . . . . . . . . . . 180
8.2.1 From estimation to testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.2.2 Inequalities between divergences and product distributions . . . . . . . . . . 182
8.2.3 Metric entropy and packing numbers . . . . . . . . . . . . . . . . . . . . . . . 184
8.3 Le Cam’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.4 Fano’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.4.1 The classical (local) Fano method . . . . . . . . . . . . . . . . . . . . . . . . 187
8.4.2 A distance-based Fano method . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.5 Assouad’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.5.1 Well-separated problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.5.2 From estimation to multiple binary tests . . . . . . . . . . . . . . . . . . . . . 195
8.5.3 Example applications of Assouad’s method . . . . . . . . . . . . . . . . . . . 197
8.6 Nonparametric regression: minimax upper and lower bounds . . . . . . . . . . . . . 199
8.6.1 Kernel estimates of the function . . . . . . . . . . . . . . . . . . . . . . . . . 199
8.6.2 Minimax lower bounds on estimation with Assouad’s method . . . . . . . . . 203
8.7 Global Fano Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
8.7.1 A mutual information bound based on metric entropy . . . . . . . . . . . . . 206
3
Lexture Notes on Statistics and Information Theory John Duchi
4
Lexture Notes on Statistics and Information Theory John Duchi
5
Lexture Notes on Statistics and Information Theory John Duchi
6
Lexture Notes on Statistics and Information Theory John Duchi
V Appendices 437
7
Chapter 1
This set of lecture notes explores some of the (many) connections relating information theory,
statistics, computation, and learning. Signal processing, machine learning, and statistics all revolve
around extracting useful information from signals and data. In signal processing and information
theory, a central question is how to best design signals—and the channels over which they are
transmitted—to maximally communicate and store information, and to allow the most effective
decoding. In machine learning and statistics, by contrast, it is often the case that there is a
fixed data distribution that nature provides, and it is the learner’s or statistician’s goal to recover
information about this (unknown) distribution.
A central aspect of information theory is the discovery of fundamental results: results that
demonstrate that certain procedures are optimal. That is, information theoretic tools allow a
characterization of the attainable results in a variety of communication and statistical settings. As
we explore in these notes in the context of statistical, inferential, and machine learning tasks, this
allows us to develop procedures whose optimality we can certify—no better procedure is possible.
Such results are useful for a myriad of reasons; we would like to avoid making bad decisions or false
inferences, we may realize a task is impossible, and we can explicitly calculate the amount of data
necessary for solving different statistical problems.
In this context, we provide two main high-level examples, one for each of these tasks.
Example 1.1.1 (Source coding): The source coding, or data compression problem, is to
take information from a source, compress it, decompress it, and recover the original message.
Graphically, we have
8
Lexture Notes on Statistics and Information Theory John Duchi
The question, then, is how to design a compressor (encoder) and decompressor (decoder) that
uses the fewest number of bits to describe a source (or a message) while preserving all the
information, in the sense that the receiver receives the correct message with high probability.
This fewest number of bits is then the information content of the source (signal). 3
Example 1.1.2: The channel coding, or data transmission problem, is the same as the source
coding problem of Example 1.1.1, except that between the compressor and decompressor is a
source of noise, a channel. In this case, the graphical representation is
Here the question is the maximum number of bits that may be sent per each channel use in
the sense that the receiver may reconstruct the desired message with low probability of error.
Because the channel introduces noise, we require some redundancy, and information theory
studies the exact amount of redundancy and number of bits that must be sent to allow such
reconstruction. 3
Here, we estimate Pb—an empirical version of the distribution P that is easier to describe than
the original signal X1 , . . . , Xn , with the hope that we learn information about the generating
distribution P , or at least describe it efficiently.
In our analogy with channel coding, we make a connection with estimation and inference.
Roughly, the major problem in statistics we consider is as follows: there exists some unknown
function f on a space X that we wish to estimate, and we are able to observe a noisy version
of f (Xi ) for a series of Xi drawn from a distribution P . Recalling the graphical description of
Example 1.1.2, we now have a channel P (Y | f (X)) that gives us noisy observations of f (X) for
each Xi , but we may (generally) now longer choose the encoder/compressor. That is, we have
X1 ,...,Xn f (X1 ),...,f (Xn ) Y1 ,...,Yn
Source (P ) −→ Compressor −→ Channel P (Y | f (X)) −→ Decompressor.
9
Lexture Notes on Statistics and Information Theory John Duchi
Example 1.2.1: A classical example of the statistical paradigm in this lens is the usual linear
regression problem. Here the data Xi belong to Rd , and the compression function f (x) = θ> x
for some vector θ ∈ Rd . Then the channel is often of the form
Yi = θ> Xi + εi ,
| {z } |{z}
signal noise
iid
where εi ∼ N(0, σ 2 ) are independent mean zero normal perturbations. The goal is, given a
sequence of pairs (Xi , Yi ), to recover the true θ in the linear model.
In active learning or active sensing scenarios, also known as (sequential) experimental design,
we may choose the sequence Xi so as to better explore properties of θ. Later in the course we
will investigate whether it is possible to improve estimation by these strategies. As one concrete
idea, if we allow infinite power, which in this context corresponds to letting kXi k → ∞—
choosing very “large” vectors xi —then the signal of θ> Xi should swamp any noise and make
estimation easier. 3
For the remainder of the class, we explore these ideas in substantially more detail.
10
Lexture Notes on Statistics and Information Theory John Duchi
Part I of the notes covers what I term “stability” based results. At a high level, this means that
we ask what can be gained by considering situations where individual observations in a sequence
of random variables X1 , . . . , Xn have little effect on various functions of the sequence. We begin
in Chapter 4 with basic concentration inequalities, discussing how sums and related quantities can
converge quickly; while this material is essential for the remainder of the lectures, it does not depend
on particular information-theoretic techniques. We discuss some heuristic applications to problems
in statistical learning—empirical risk minimization—in this section of the notes. We provide a
treatment of more advanced ideas in Chapter 6, including some approaches to concentration via
entropy methods. We then turn in Chapter 5 carefully investigate generalization and convergence
guarantees—arguing that functions of a sample X1 , . . . , Xn are representative of the full population
P from which the sample is drawn—based on controlling different information-theoretic quantities.
In this context, we develop PAC-Bayesian bounds, and we also use the same framework to present
tools to control generalization and convergence in interactive data analyses. These types of analyses
reflect modern statistics, where one performs some type of data exploration before committing to a
fuller analysis, but which breaks classical statistical approaches, because the analysis now depends
on the sample. Finally, we provide a chapter (Chapter 7) on disclosure limitation and privacy
techniques, all of which repose on different notions of stability in distribution.
Part II studies fundamental limits, using information-theoretic techniques to derive lower bounds
on the possible rates of convergence for various estimation, learning, and other statistical problems.
Part III revisits all of our information theoretic notions from Chapter 2, but instead of sim-
ply giving definitions and a few consequences, provides operational interpretations of the different
information-theoretic quantities, such as entropy. Of course this includes Shannon’s original results
on the relationship between coding and entropy (Chapter 2.4.1), but we also provide an interpreta-
tion of entropy and information as measures of uncertainty in statistical experiments and statistical
learning, which is a perspective typically missing from information-theoretic treatments of entropy
(Chapters TBD). We also relate these ideas to game-playing and maximum likelihood estimation.
Finally, we relate generic divergence measures to questions of optimality and consistency in statisti-
cal and machine learning problems, which allows us to delineate when (at least in asymptotic senses)
it is possible to computationally efficiently learn good predictors and design good experiments.
11
Chapter 2
In this first introductory chapter, we discuss and review many of the basic concepts of information
theory in effort to introduce them to readers unfamiliar with the tools. Our presentation is relatively
brisk, as our main goal is to get to the meat of the chapters on applications of the inequalities and
tools we develop, but these provide the starting point for everything in the sequel. One of the
main uses of information theory is to prove what, in an information theorist’s lexicon, are known
as converse results: fundamental limits that guarantee no procedure can improve over a particular
benchmark or baseline. We will give the first of these here to preview more of what is to come,
as these fundamental limits form one of the core connections between statistics and information
theory. The tools of information theory, in addition to their mathematical elegance, also come
with strong operational interpretations: they give quite precise answers and explanations for a
variety of real engineering and statistical phenomena. We will touch on one of these here (the
connection between source coding, or lossless compression, and the Shannon entropy), and much
of the remainder of the book will explore more.
2.1.1 Definitions
Here, we provide the basic definitions of entropy, information, and divergence, assuming the random
variables of interest are discrete or have densities with respect to Lebesgue measure.
12
Lexture Notes on Statistics and Information Theory John Duchi
Entropy: We begin with a central concept in information theory: the entropy. Let P be a distri-
bution on a finite (or countable) set X , and let p denote the probability mass function associated
with P . That is, if X is a random variable distributed according to P , then P (X = x) = p(x). The
entropy of X (or of P ) is defined as
X
H(X) := − p(x) log p(x).
x
Because p(x) ≤ 1 for all x, it is clear that this quantity is positive. We will show later that if X
is finite, the maximum entropy distribution on X is the uniform distribution, setting p(x) = 1/|X |
for all x, which has entropy log(|X |).
Later in the class, we provide a number of operational interpretations of the entropy. The
most common interpretation—which forms the beginning of Shannon’s classical information the-
ory [158]—is via the source-coding theorem. We present Shannon’s source coding theorem in
Section 2.4.1, where we show that if we wish to encode a random variable X, distributed according
to P , with a k-ary string (i.e. each entry of the string takes
P on one of k values), then the minimal
expected length of the encoding is given by H(X) = − x p(x) logk p(x). Moreover, this is achiev-
able (to within a length of at most 1 symbol) by using Huffman codes (among many other types of
codes). As an example of this interpretation, we may consider encoding a random variable X with
equi-probable distribution on m items, which has H(X) = log(m). In base-2, this makes sense: we
simply assign an integer to each item and encode each integer with the natural (binary) integer
encoding of length dlog me.
We can also define the conditional entropy, which is the amount of information left in a random
variable after observing another. In particular, we define
X X
H(X | Y = y) = − p(x | y) log p(x | y) and H(X | Y ) = p(y)H(X | Y = y),
x y
Example 2.1.2 (Bernoulli random variables): Let h2 (p) = −p log p − (1 − p) log(1 − p) denote
the binary entropy, which is the entropy of a Bernoulli(p) random variable. 3
13
Lexture Notes on Statistics and Information Theory John Duchi
Example 2.1.4 (A random variable with infinite entropy): While most “reasonable” discrete
random variables have finite entropy, it is possible to construct distributions with infinite
entropy. Indeed, let X have p.m.f. on {2, 3, . . .} defined by
∞
A −1
X 1
p(k) = 2 where A = < ∞,
k log k k=2
k log2 k
R∞ Rx
the last sum finite as 2 x log1 α x dx < ∞ if and only if α > 1: for α = 1, we have e 1
t log t =
log log x, while for α > 1, we have
d 1
(log x)1−α = (1 − α)
dx x logα x
R∞ 1 1
so that e t logα t dt = e(1−α) . To see that the entropy is infinite, note that
X log A + log k + 2 log log k X log k
H(X) = A 2 ≥A 2 − C = ∞,
k≥2
k log k k≥2
k log k
KL-divergence: Now we define two additional quantities, which are actually much more funda-
mental than entropy: they can always be defined for any distributions and any random variables,
as they measure distance between distributions. Entropy simply makes no sense for non-discrete
random variables, let alone random variables with continuous and discrete components, though it
proves useful for some of our arguments and interpretations.
Before defining these quantities, we recall the definition of a convex function f : Rk → R as any
bowl-shaped function, that is, one satisfying
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) (2.1.1)
for all λ ∈ [0, 1], all x, y. The function f is strictly convex if the convexity inequality (2.1.1) is
strict for λ ∈ (0, 1) and x 6= y. We recall a standard result:
Proposition 2.1.5 (Jensen’s inequality). Let f be convex. Then for any random variable X,
f (E[X]) ≤ E[f (X)].
Moreover, if f is strictly convex, then f (E[X]) < E[f (X)] unless X is constant.
Now we may define and provide a few properties of the KL-divergence. Let P and Q be
distributions defined on a discrete set X . The KL-divergence between them is
X p(x)
Dkl (P ||Q) := p(x) log .
q(x)
x∈X
We observe immediately that Dkl (P ||Q) ≥ 0. To see this, we apply Jensen’s inequality (Propo-
sition 2.1.5) to the function − log and the random variable q(X)/p(X), where X is distributed
according to P :
q(X) q(X)
Dkl (P ||Q) = −E log ≥ − log E
p(X) p(X)
X
q(x)
= − log p(x) = − log(1) = 0.
x
p(x)
14
Lexture Notes on Statistics and Information Theory John Duchi
Moreover, as log is strictly convex, we have Dkl (P ||Q) > 0 unless P = Q. Another consequence of
the positivity of the KL-divergence is that whenever the set X is finite with cardinality |X | < ∞,
for any random variable X supported on X we have H(X) ≤ log |X |. Indeed, letting m = |X |, Q
1
be the uniform distribution on X so that q(x) = m , and X have distribution P on X , we have
X p(x) X
0 ≤ Dkl (P ||Q) = p(x) log = −H(X) − p(x) log q(x) = −H(X) + log m, (2.1.2)
x
q(x) x
so that H(X) ≤ log m. Thus, the uniform distribution has the highest entropy over all distributions
on the set X .
Mutual information: Having defined KL-divergence, we may now describe the information
content between two random variables X and Y . The mutual information I(X; Y ) between X and
Y is the KL-divergence between their joint distribution and their products (marginal) distributions.
More mathematically,
X p(x, y)
I(X; Y ) := p(x, y) log . (2.1.3)
x,y
p(x)p(y)
We can rewrite this in several ways. First, using Bayes’ rule, we have p(x, y)/p(y) = p(x | y), so
X p(x | y)
I(X; Y ) = p(y)p(x | y) log
x,y
p(x)
XX X X
=− p(y)p(x | y) log p(x) + p(y) p(x | y) log p(x | y)
x y y x
= H(X) − H(X | Y ).
Similarly, we have I(X; Y ) = H(Y ) − H(Y | X), so mutual information can be thought of as the
amount of entropy removed (on average) in X by observing Y . We may also think of mutual infor-
mation as measuring the similarity between the joint distribution of X and Y and their distribution
when they are treated as independent.
Comparing the definition (2.1.3) to that for KL-divergence, we see that if PXY is the joint
distribution of X and Y , while PX and PY are their marginal distributions (distributions when X
and Y are treated independently), then
Entropies of continuous random variables For continuous random variables, we may define
an analogue of the entropy known as differential entropy, which for a random variable X with
density p is defined by Z
h(X) := − p(x) log p(x)dx. (2.1.4)
15
Lexture Notes on Statistics and Information Theory John Duchi
Note that the differential entropy may be negative—it is no longer directly a measure of the number
of bits required to describe a random variable X (on average), as was the case for the entropy. We
can similarly define the conditional entropy
Z Z
h(X | Y ) = − p(y) p(x | y) log p(x | y)dxdy.
We remark that the conditional differential entropy of X given Y for Y with arbitrary distribution—
so long as X has a density—is
Z
h(X | Y ) = E − p(x | Y ) log p(x | Y )dx ,
where p(x | y) denotes the conditional density of X when Y = y. The KL divergence between
distributions P and Q with densities p and q becomes
Z
p(x)
Dkl (P ||Q) = p(x) log dx,
q(x)
and similarly, we have the analogues of mutual information as
Z
p(x, y)
I(X; Y ) = p(x, y) log dxdy = h(X) − h(X | Y ) = h(Y ) − h(Y | X).
p(x)p(y)
As we show in the next subsection, we can define the KL-divergence between arbitrary distributions
(and mutual information between arbitrary random variables) more generally without requiring
discrete or continuous distributions. Before investigating these issues, however, we present a few
examples. We also see immediately that for X uniform on a set [a, b], we have h(X) = log(b − a).
Example 2.1.6 (Entropy of normal random variables): The differential entropy (2.1.4) of
a normal random variable is straightforward to compute. Indeed, for X ∼ N(µ, σ 2 ) we have
p(x) = √ 1 2 exp(− 2σ1 2 (x − µ)2 ), so that
2πσ
E[(X − µ)2 ]
Z
1 1 1 2 1 2 1
h(X) = − p(x) log − (x − µ) = log(2πσ ) + = log(2πeσ 2 ).
2 2πσ 2 2σ 2 2 2σ 2 2
For a general multivariate Gaussian, where X ∼ N(µ, Σ) for a vector µ ∈ Rn and Σ 0 with
density p(x) = n/2
1
√ exp(− 21 (x − µ)> Σ−1 (x − µ)), we similarly have
(2π) det(Σ)
1 h i
h(X) = E n log(2π) + log det(Σ) + (X − µ)> Σ−1 (X − µ)
2
n 1 1 n 1
= log(2π) + log det(Σ) + tr(ΣΣ−1 ) = log(2πe) + log det(eΣ).
2 2 2 2 2
3
Continuing our examples with normal distributions, we may compute the divergence between
two multivariate Gaussian distributions:
Example 2.1.7 (Divergence between Gaussian distributions): Let P be the multivariate
normal N(µ1 , Σ), and Q be the multivariate normal distribution with mean µ2 and identical
covariance Σ 0. Then we have that
1
Dkl (P ||Q) = (µ1 − µ2 )> Σ−1 (µ1 − µ2 ). (2.1.5)
2
We leave the computation of the identity (2.1.5) to the reader. 3
16
Lexture Notes on Statistics and Information Theory John Duchi
An interesting consequence of Example 2.1.7 is that if a random vector X has a given covari-
ance Σ ∈ Rn×n , then the multivariate Gaussian with identical covariance has larger differential
entropy. Put another way, differential entropy for random variables with second moments is always
maximized by the Gaussian distribution.
Proposition 2.1.8. Let X be a random vector on Rn with a density, and assume that Cov(X) = Σ.
Then for Z ∼ N(0, Σ), we have
h(X) ≤ h(Z).
Proof Without loss of generality, we assume that X has mean 0. Let P be the distribution of
X with density p, and let Q be multivariate normal with mean 0 and covariance Σ; let Z be this
random variable. Then
Z Z
p(x) n 1 > −1
Dkl (P ||Q) = p(x) log dx = −h(X) + p(x) log(2π) − x Σ x dx
q(x) 2 2
= −h(X) + h(Z),
because Z has the same covariance as X. As 0 ≤ Dkl (P ||Q), we have h(Z) ≥ h(X) as desired.
We remark in passing that the fact that Gaussian random variables have the largest entropy has
been used to prove stronger variants of the central limit theorem; see the original results of Barron
[16], as well as later quantitative results on the increase of entropy of normalized sums by Artstein
et al. [9] and Madiman and Barron [134].
17
Lexture Notes on Statistics and Information Theory John Duchi
Chain rules for information and divergence: As another immediate corollary to the chain
rule for entropy, we see that mutual information also obeys a chain rule:
n
X
I(X; Y1n ) = I(X; Yi | Y1i−1 ).
i=1
Indeed, we have
n
X n
X
I(X; Y1n ) = H(Y1n ) − H(Y1n | X) = H(Yi | Y1i−1 ) − H(Yi | X, Y1i−1 ) = I(X; Yi | Y1i−1 ).
i=1 i=1
The KL-divergence obeys similar chain rules, making mutual information and KL-divergence mea-
sures useful tools for evaluation of distances and relationships between groups of random variables.
As a second example, suppose that the distribution P = P1 ×P2 ×· · ·×Pn , and Q = Q1 ×· · ·×Qn ,
that is, that P and Q are product distributions over independent random variables Xi ∼ Pi or
Xi ∼ Qi . Then we immediately have the tensorization identity
n
X
Dkl (P ||Q) = Dkl (P1 × · · · × Pn ||Q1 × · · · × Qn ) = Dkl (Pi ||Qi ) .
i=1
We remark in passing that these two identities hold for arbitrary distributions Pi and Qi or random
variables X, Y . As a final tensorization identiy, we consider a more general chain rule for KL-
divergences, which will frequently be useful. We abuse notation temporarily, and for random
variables X and Y with distributions P and Q, respectively, we denote
In analogy to the entropy, we can also define the conditional KL divergence. Let X and Y have
distributions PX|z and PY |z conditioned on Z = z, respectively. Then we define
Dkl (X||Y | Z) = EZ [Dkl PX|Z ||PY |Z ],
P
so that if Z is discrete we have Dkl (X||Y | Z) = z p(z)Dkl PX|z ||PY |z . With this notation, we
have the chain rule
n
X
Dkl Xi ||Yi | X1i−1 ,
Dkl (X1 , . . . , Xn ||Y1 , . . . , Yn ) = (2.1.6)
i=1
because (in the discrete case, which—as we discuss presently—is fully general for this purpose) for
distributions PXY and QXY we have
X p(x, y) X p(y | x) p(x)
Dkl (PXY ||QXY ) = p(x, y) log = p(x)p(y | x) log + log
x,y
q(x, y) x,y
q(y | x) q(x)
X p(x) X X p(y | x)
= p(x) log + p(x) p(y | x) log ,
x
q(x) x y
q(y | x)
P
where the final equality uses that y p(y | x) = 1 for all x.
Expanding upon this, we give several tensorization identities, showing how to transform ques-
tions about the joint distribution of many random variables to simpler questions about their
18
Lexture Notes on Statistics and Information Theory John Duchi
marginals. As a first example, we see that as a consequence of the fact that conditioning de-
creases entropy, we see that for any sequence of (discrete or continuous, as appropriate) random
variables, we have
Both equalities hold with equality if and only if X1 , . . . , Xn are mutually independent. (The only
if follows because I(X; Y ) > 0 whenever X and Y are not independent, by Jensen’s inequality and
the fact that Dkl (P ||Q) > 0 unless P = Q.)
We return to information and divergence now. Suppose that random variables Yi are indepen-
dent conditional on X, meaning that
Such scenarios are common—as we shall see—when we make multiple observations from a fixed
distribution parameterized by some X. Then we have the inequality
n
X
I(X; Y1 , . . . , Yn ) = [H(Yi | Y1i−1 ) − H(Yi | X, Y1i−1 )]
i=1
n n n (2.1.7)
X X X
= [H(Yi | Y1i−1 ) − H(Yi | X)] ≤ [H(Yi ) − H(Yi | X)] = I(X; Yi ),
i=1 i=1 i=1
X → Y → Z.
Proposition 2.1.9. With the above Markov chain, we have I(X; Z) ≤ I(X; Y ).
where we note that the final equality follows because X is independent of Z given Y :
19
Lexture Notes on Statistics and Information Theory John Duchi
There are related data processing inequalities for the KL-divergence—which we generalize in
the next section—as well. In this case, we may consider a simple Markov chain X → Z. If we
let P1 and
R P2 be distributions on X and Q1 and Q2 be the induced distributions on Z, that is,
Qi (A) = P(Z ∈ A | x)dPi (x), then we have
Dkl (Q1 ||Q2 ) ≤ Dkl (P1 ||P2 ) ,
the basic KL-divergence data processing inequality. A consequence of this is that, for any function
f and random variables X and Y on the same space, we have
Dkl (f (X)||f (Y )) ≤ Dkl (X||Y ) .
We explore these data processing inequalities more when we generalize KL-divergences in the next
section and in the exercises.
20
Lexture Notes on Statistics and Information Theory John Duchi
qA (x) = i when x ∈ Ai .
2.2.2 KL-divergence
In this section, we present the general definition of a KL-divergence, which holds for any pair of
distributions. Let P and Q be distributions on a space X . Now, let A be a finite algebra on X
(as in the previous section, this is equivalent to picking a partition of X and then constructing the
associated algebra), and assume that its atoms are atoms(A). The KL-divergence between P and
Q conditioned on A is
X P (A)
Dkl (P ||Q | A) := P (A) log .
Q(A)
A∈atoms(A)
That is, we simply sum over the partition of X . Another way to write this is as follows. Let
q : X → {1, . . . , m} be a quantizer, and define the sets Ai = q−1 ({i}) to be the pre-images of each
i (i.e. the different quantization regions, or the partition of X that q induces). Then the quantized
KL-divergence between P and Q is
m
X P (Ai )
Dkl (P ||Q | q) := P (Ai ) log .
Q(Ai )
i=1
We may now give the fully general definition of KL-divergence: the KL-divergence between P
and Q is defined as
This also gives a rigorous definition of mutual information. Indeed, if X and Y are random variables
with joint distribution PXY and marginal distributions PX and PY , we simply define
21
Lexture Notes on Statistics and Information Theory John Duchi
while if P and Q both have probability mass functions p and q, then—as we see in Exercise 2.6—the
definition (2.2.1) is equivalent to
X p(x)
Dkl (P ||Q) = p(x) log ,
x
q(x)
Measure-theoretic definition of KL-divergence If you have never seen measure theory be-
fore, skim this section; while the notation may be somewhat intimidating, it is fine to always
consider only continuous or fully discrete distributions. We will describe an interpretation that will
mean for our purposes that one never needs to really think about measure theoretic issues.
The general definition (2.2.1) of KL-divergence is equivalent to the following. Let µ be a measure
on X , and assume that P and Q are absolutely continuous with respect to µ, with densities p and
q, respectively. (For example, take µ = P + Q.) Then
Z
p(x)
Dkl (P ||Q) = p(x) log dµ(x). (2.2.2)
X q(x)
The proof of this fact is somewhat involved, requiring the technology of Lebesgue integration. (See
Gray [94, Chapter 5].)
For those who have not seen measure theory, the interpretation
R of the equality (2.2.2) should be
as follows. When integrating a function f (x), replace f (x)dµ(x) with one of two pairsR of symbols:
one may simply think of dµ(x) as dx, so thatR we are performing standard integration f (x)dx, or
one should think
R of the integral
P operation f (x)dµ(x) as summing the argument of the integral, so
dµ(x) = 1 and f (x)dµ(x) = x f (x). (This corresponds to µ being “counting measure” on X .)
2.2.3 f -divergences
A more general notion of divergence is the so-called f -divergence, or Ali-Silvey divergence [4, 54]
(see also the alternate interpretations in the article by Liese and Vajda [131]). Here, the definition
is as follows. Let P and Q be probability distributions on the set X , and let f : R+ → R be a
22
Lexture Notes on Statistics and Information Theory John Duchi
convex function satisfying f (1) = 0. If X is a discrete set, then the f -divergence between P and Q
is
X p(x)
Df (P ||Q) := q(x)f .
x
q(x)
More generally, for any set X and a quantizer q : X → {1, . . . , m}, letting Ai = q−1 ({i}) = {x ∈
X | q(x) = i} be the partition the quantizer induces, we can define the quantized divergence
m
X P (Ai )
Df (P ||Q | q) = Q(Ai )f ,
Q(Ai )
i=1
and the general definition of an f divergence is (in analogy with the definition (2.2.1) of general
KL divergences)
The definition (2.2.3) shows that, any time we have computations involving f -divergences—such
as KL-divergence or mutual information—it is no loss of generality, when performing the compu-
tations, to assume that all distributions have finite discrete support. There is a measure-theoretic
version of the definition (2.2.3) which is frequently easier to use. Assume w.l.o.g. that P and Q are
absolutely continuous with respect to the base measure µ. The f divergence between P and Q is
then Z
p(x)
Df (P ||Q) := q(x)f dµ(x). (2.2.4)
X q(x)
This definition, it turns out, is not quite as general as we would like—in particular, it is unclear
how we should define the integral for points x such that q(x) = 0. With that in mind, we recall
that the perspective transform (see Appendices B.1.1 and B.3.3) of a function f : R → R is defined
by pers(f )(t, u) = uf (t/u) if u > 0 and by +∞ if u ≤ 0. This function is convex in its arguments
(Proposition B.3.12). In fact, this is not quite enough for the fully correct definition. The closure of
a convex function f is cl f (x) = sup{`(x) | ` ≤ f, ` linear}, the supremum over all linear functions
that globally lower bound f . Then [104, Proposition IV.2.2.2] the closer of pers(f ) is defined, for
any t0 ∈ int dom f , by
uf (t/u)
if u > 0
cl pers(f )(t, u) = limα↓0 αf (t0 − t + t/α) if u = 0
+∞ if u < 0.
(The choice of t0 does not affect the definition.) Then the fully general formula expressing the
f -divergence is Z
Df (P ||Q) = cl pers(f )(p(x), q(x))dµ(x). (2.2.5)
X
This is what we mean by equation (2.2.4), which we use without comment.
In the exercises, we explore several properties of f -divergences, including the quantized repre-
sentation (2.2.3), showing different data processing inequalities and orderings of quantizers based
on the fineness of their induced partitions. Broadly, f -divergences satisfy essentially the same prop-
erties as KL-divergence, such as data-processing inequalities, and they provide a generalization of
mutual information. We explore f -divergences from additional perspectives later—they are impor-
tant both for optimality in estimation and related to consistency and prediction problems, as we
discuss in Chapter 14.
23
Lexture Notes on Statistics and Information Theory John Duchi
Examples We give several examples of f -divergences here; in Section 8.2.2 we provide a few
examples of their uses as well as providing a few natural inequalities between them.
Example 2.2.1 (KL-divergence): By taking f (t) = t log t, which is convex and satisfies
f (1) = 0, we obtain Df (P ||Q) = Dkl (P ||Q). 3
Example 2.2.3 (Total variation distance): The total variation distance between probability
distributions P and Q defined on a set X is the maximum difference between probabilities they
assign on subsets of X :
where the second equality follows by considering compliments P (Ac ) = 1 − P (A). The total
variation distance, as we shall see later, is important for verifying the optimality of different
tests, and appears in the measurement of difficulty of solving hypothesis testing problems. The
choice f (t) = 21 |t − 1|, we obtain the total variation distance, that is, kP − QkTV = Df (P ||Q).
There are several alternative characterizations, which we provide as Lemma 2.2.4 next; it will
be useful in the sequel when we develop inequalities relating the divergences. 3
Lemma 2.2.4. Let P, Q be probability measures with densities p, q with respect to a base measure
µ and f (t) = 21 |t − 1|. Then
Z
1
kP − QkTV = Df (P ||Q) = |p(x) − q(x)|dµ(x)
2
Z Z
= [p(x) − q(x)]+ dµ(x) = [q(x) − p(x)]+ dµ(x)
24
Lexture Notes on Statistics and Information Theory John Duchi
Example 2.2.5 (Hellinger distance): The Hellinger distance between √ probability distribu-
√
2
tions P and Q defined on a set X is generated by the function f (t) = ( t − 1) = t − 2 t + 1.
The Hellinger distance is then
Z p
2 1 p
dhel (P, Q) := ( p(x) − q(x))2 dµ(x). (2.2.7)
2
The non-squared version dhel (P, Q) is indeed a distance between probability measures P and
Q. It is sometimes convenient to rewrite the Hellinger distance in terms of the affinity between
P and Q, as
Z Z p
1 p
dhel (P, Q)2 = (p(x) + q(x) − 2 p(x)q(x))dµ(x) = 1 − p(x)q(x)dµ(x), (2.2.8)
2
which makes clear that dhel (P, Q) ∈ [0, 1] is on roughly the same scale as the variation distance;
we will say more later. 3
Example 2.2.6 (χ2 divergence): The χ2 -divergence is generated by taking f (t) = (t − 1)2 ,
so that 2
p(x)2
Z Z
p(x)
Dχ2 (P ||Q) := − 1 q(x)dµ(x) = dµ(x) − 1, (2.2.9)
q(x) q(x)
where the equality is immediate because pdµ = qdµ = 1. 3
R R
25
Lexture Notes on Statistics and Information Theory John Duchi
Rp
As in Example 2.2.5, we have p(x)q(x)dµ(x) = 1 − dhel (P, Q)2 , so this (along with the repre-
sentation Lemma 2.2.4 for variation distance) implies
Z
1 1
kP − QkTV = |p(x) − q(x)|dµ(x) ≤ dhel (P, Q)(2 − d2hel (P, Q)) 2 .
2
√
For the lower bound on total variation, note that for any a, b ∈ R+ , we have a + b − 2 ab ≤ |a − b|
(check the cases a > b and a < b separately); thus
Z Z
2 1 h p i 1
dhel (P, Q) = p(x) + q(x) − 2 p(x)q(x) dµ(x) ≤ |p(x) − q(x)|dµ(x),
2 2
as desired.
Several important inequalitites relate the variation distance to the KL-divergence. We state
two important inequalities in the next proposition, both of which are important enough to justify
their own names.
Proposition 2.2.8. The total variation distance satisfies the following relationships.
Proof Exercise 2.19 outlines one proof of Pinsker’s inequality using the data processing inequality
(Proposition 2.2.13). We present an alternative via the Cauchy-Schwarz inequality. Using the
definition (2.2.1) of the KL-divergence, we may assume without loss of generality that P and Q are
finitely P
supported, say with p.m.f.s p1 , . . . , pm and q1 , . . . , qm . Define the negative entropy function
h(p) = m 2 1 2
i=1 pi log pi . Then showing that Dkl (P ||Q) ≥ 2 kP − QkTV = 2 kp − qk1 is equivalent to
showing that
1
h(p) ≥ h(q) + h∇h(q), p − qi + kp − qk21 , (2.2.11)
2
because by inspection h(p)−h(q)−h∇h(q), p−qi = i pi log pqii . We do this via a Taylor expansion:
P
we have
∇h(p) = [log pi + 1]m 2
i=1 and ∇ h(p) = diag([1/pi ]i=1 ).
m
By Taylor’s theorem, there is some p̃ = (1 − t)p + tq, where t ∈ [0, 1], such that
1
h(p) = h(q) + h∇h(q), p − qi + hp − q, ∇2 h(p̃)(p − q)i.
2
P
But looking at the final quadratic, we have for any vector v and any p ≥ 0 satisfying i pi = 1,
m m m 2
v2 v2
X
X X √ |vi |
2
hv, ∇ h(p̃)vi = i
= kpk1 i
≥ pi √ = kvk21 ,
pi pi pi
i=1 i=1 i=1
26
Lexture Notes on Statistics and Information Theory John Duchi
√ √
where the inequality follows from Cauchy-Schwarz applied to the vectors [ pi ]i and [|vi |/ pi ]i .
Thus inequality (2.2.11) holds. Rp
For the claim (b), we use Proposition 2.2.7. Let a = p(x)q(x)dµ(x) be a shorthand
√ √ for the
affinity, so that d2 (P, Q) = 1 − a. Then Proposition 2.2.7 gives kP − Qk ≤ 1 − a 1+a =
√ hel TV
1 − a2 . Now apply Jensen’s inequality to the exponential: we have
Z p Z s Z
q(x) 1 q(x)
p(x)q(x)dµ(x) = p(x)dµ(x) = exp log p(x)dµ(x)
p(x) 2 p(x)
Z
1 q(x) 1
≥ exp p(x) log dµ(x) = exp − Dkl (P ||Q) .
2 p(x) 2
√ q
In particular, 1 − a2 ≤ 1 − exp(− 12 Dkl (P ||Q))2 , which is the first claim of part (b). For the
√
second, note that 1 − c ≤ 1 − 12 c for c ∈ [0, 1] by concavity of the square root.
We also have the following bounds on the KL-divergence in terms of the χ2 -divergence.
Proposition 2.2.9. For any distributions P, Q,
It is also possible to relate mutual information between distributions to f -divergences, and even
to bound the mutual information above and below by the Hellinger distance for certain problems. In
this case, we consider the following situation: let V ∈ {0, 1} uniformly at random, and conditional
on V = v, draw X ∼ Pv for some distribution Pv on a space X . Then we have that
1 1
I(X; V ) = Dkl P0 ||P + Dkl P1 ||P
2 2
where P = 21 P0 + 12 P1 . The divergence measure on the right side of the preceding identity is a
special case of the Jenson-Shannon divergence, defined for λ ∈ [0, 1] by
which is a symmetrized and bounded variant of the typical KL-divergence (we use the shorthand
Djs (P ||Q) := Djs, 1 (P ||Q) for the symmetric case). As a consequence, we also have
2
1 1
I(X; V ) = Df (P0 ||P1 ) + Df (P1 ||P0 ) ,
2 2
1
where f (t) = −t log( 2t + 21 ) = t log t+1
2t
, so that the mutual information is a particular f -divergence.
This form—as we see in the later chapters—is frequently convenient because it gives an object
with similar tensorization properties to KL-divergence while enjoying the boundedness properties
of Hellinger and variation distances. The following proposition captures the latter properties.
27
Lexture Notes on Statistics and Information Theory John Duchi
But of course the final integral is kP1 − P0 kTV , giving I(X; V ) ≤ log 2 kP0 −pP1 kTV . Conversely,
for the lower bound on Djs (P0 ||P1 ), we use the upper bound h2 (p) ≤ 2 log 2 · p(1 − p) to obtain
Z r
1 p0 p0
I(X; V ) ≥ 1 − (p0 + p1 ) 1− dµ
log 2 p1 + p0 p1 + p0
√ √ √
Z Z
1
=1− p0 p1 dµ = ( p0 − p1 )2 dµ = d2hel (P0 , P1 )
2
as desired.
The Hellinger-based upper bound is simpler: by Proposition 2.2.9, we have
1 1
Djs (P0 ||P1 ) = Dkl (P0 ||(P0 + P1 )/2) + Dkl (P1 ||(P0 + P1 )/2)
2 2
1 1
≤ Dχ2 (P0 ||(P0 + P1 )/2) + Dχ2 (P1 ||(P0 + P1 )/2)
2 2
Z √ √ √ √
(p0 − p1 )2 ( p0 − p1 )2 ( p0 + p1 )2
Z
1 1
= dµ = dµ.
2 p0 + p1 2 p0 + p1
√ √
Now note that (a + b)2 ≤ 2aR2 + 2b2 for any a, b ∈ R, and so ( p0 + p1 )2 ≤ 2(p0 + p1 ), and thus
√ √
the final integral has bound ( p0 − p1 )2 dµ = 2d2hel (P0 , P1 ).
28
Lexture Notes on Statistics and Information Theory John Duchi
Df (λP1 + (1 − λ)P2 ||λQ1 + (1 − λ)Q2 ) ≤ λDf (P1 ||Q1 ) + (1 − λ)Df (P2 ||Q2 ) .
The proof of this proposition we leave as Exercise 2.11, which we treat as a consequence of the
more general “log-sum” like inequalities of Exercise 2.8. It is, however, an immediate consequence
of the fully specified definition (2.2.5) of an f -divergence, because pers(f ) is jointly convex. As an
immediate corollary, we see that the same result is true for KL-divergence as well.
Corollary 2.2.12. The KL-divergence Dkl (P ||Q) is jointly convex in its arguments P and Q.
We can also provide more general data processing inequalities for f -divergences, paralleling
those for the KL-divergence. In this case, we consider random variables X and Z on spaces X
and Z, respectively, and a Markov transition kernel K giving the Markov chain X → Z. That
is, K(· | x) is a probability distribution on Z for each x ∈ X , and conditioned on X = x, Z has
distribution K(· | x) so that K(A | x) = P(Z ∈ A | X = x). Certainly, this includes the situation
when Z = φ(X) for some function φ, and more generally when Z = φ(X, U ) for a function φ and
some additional randomness U . For a distribution P on X, we then define the marginals
Z
KP (A) := K(A, x)dP (x).
X
Proposition 2.2.13. Let P and Q be distributions on X and let K be any Markov kernel. Then
Thus, further processing of random variables can only bring them “closer” in the space of distribu-
tions; downstream processing of signals cannot make them further apart as distributions.
29
Lexture Notes on Statistics and Information Theory John Duchi
Proposition 2.3.1. Let X be an arbitrary set. For any distributions P1 and P2 on X , we have
Proof Any test Ψ : X → {1, 2} has an acceptance region, call it A ⊂ X , where it outputs 1 and
a region Ac where it outputs 2.
inf {P1 (Ψ 6= 1) + P2 (Ψ 6= 2)} = inf {1 − (P1 (A) − P2 (A))} = 1 − sup (P1 (A) − P2 (A)),
Ψ A⊂X A⊂X
30
Lexture Notes on Statistics and Information Theory John Duchi
In the two-hypothesis case, we also know that the optimal test, by the Neyman-Pearson lemma,
is a likelihood ratio test. That is, assuming that P1 and P2 have densities p1 and p2 , the optimal
test is of the form
1 if pp12 (X)
(
(X) ≥ t
Ψ(X) = p1 (X)
2 if p2 (X) < t
for some threshold t ≥ 0. In the case that the prior probabilities on P1 and P2 are each 21 , then
t = 1 is optimal.
We give one example application of Proposition 2.3.1 to the problem of testing a normal mean.
iid
Example 2.3.2 (Testing a normal mean): Suppose we observe X1 , . . . , Xn ∼ P for P = P1
or P = P2 , where Pv is the normal distribution N(µv , σ 2 ), where µ1 6= µ2 . We would like to
understand the sample size n necessary to guarantee that no test can have small error, that
is, say, that
1
inf {P1 (Ψ(X1 , . . . , Xn ) 6= 1) + P2 (Ψ(X1 , . . . , Xn ) 6= 2)} ≥ .
Ψ 2
By Proposition 2.3.1, we have that
iid
where Pvn denotes the n-fold product of Pv , that is, the distribution of X1 , . . . , Xn ∼ Pv .
The interaction between total variation distance and product distributions is somewhat subtle,
so it is often advisable to use a divergence measure more attuned to the i.i.d. nature of the sam-
pling scheme. Two such measures are the KL-divergence and Hellinger distance, both of which
we explore in the coming chapters. With that in mind, we apply Pinsker’s inequality (2.2.10)
to see that kP1n − P2n k2TV ≤ 21 Dkl (P1n ||P2n ) = n2 Dkl (P1 ||P2 ), which implies that
r r 1 √
n 1 n 1 2 n |µ1 − µ2 |
1− kP1n − P2n kTV ≥1− Dkl (P1 ||P2 ) 2 = 1 − 2
(µ1 − µ2 )2
=1− .
2 2 2σ 2 σ
σ2
In particular, if n ≤ (µ1 −µ2 )2
, then we have our desired lower bound of 21 .
2
Conversely, a calculation yields that n ≥ (µ1Cσ
−µ2 )2
, for some numerical constant C ≥ 1, implies
small probability of error. We leave this calculation to the reader. 3
31
Lexture Notes on Statistics and Information Theory John Duchi
and we wish to provide lower bounds on the probability of error—that is, that X b 6= X. If we let
the function h2 (p) = −p log p − (1 − p) log(1 − p) denote the binary entropy (entropy of a Bernoulli
random variable with parameter p), Fano’s inequality takes the following form [e.g. 53, Chapter 2]:
Proposition 2.3.3 (Fano inequality). For any Markov chain X → Y → X,
b we have
b 6= X)) + P(X
h2 (P(X b 6= X) log(|X | − 1) ≥ H(X | X).
b (2.3.1)
Proof This proof follows by expanding an entropy functional in two different ways. Let E be
b 6= X, that is, E = 1 if X
the indicator for the event that X b 6= X and is 0 otherwise. Then we have
H(X, E | X)
b = H(X | E, X)
b + H(E | X)
b
= P(E = 1)H(X | E = 1, X)
b + P(E = 0) H(X | E = 0, X)
b +H(E | X),
b
| {z }
=0
where the zero follows because given there is no error, X has no variability given X.
b Expanding
the entropy by the chain rule in a different order, we have
H(X, E | X)
b = H(X | X)
b + H(E | X,
b X),
| {z }
=0
H(X | X)
b = H(X, E | X)
b = P(E = 1)H(X | E = 1, X)
b + H(E | X).
Noting that H(E | X) ≤ H(E) = h2 (P(E = 1)), as conditioning reduces entropy, and that
H(X | E = 1, X)
b ≤ log(|X | − 1), as X can take on at most |X | − 1 values when there is an error,
completes the proof.
b 6= X) ≥ 1 − I(X; Y ) + log 2
P(X . (2.3.2)
log(|X |)
Proof Let Perror = P(X 6= X) b denote the probability of error. Noting that h2 (p) ≤ log 2 for any
p ∈ [0, 1] (recall inequality (2.1.2), that is, that uniform random variables maximize entropy), then
using Proposition 8.4.1, we have
(i)
b (ii)
log 2 + Perror log(|X |) ≥ h2 (Perror ) + Perror log(|X | − 1) ≥ H(X | X) = H(X) − I(X; X).
b
Here step (i) uses Proposition 2.3.3 and step (ii) uses the definition of mutual information, that
b = H(X) − H(X | X).
I(X; X) b The data processing inequality implies that I(X; X) b ≤ I(X; Y ),
and using H(X) = log(|X |) completes the proof.
32
Lexture Notes on Statistics and Information Theory John Duchi
In particular, Corollary 2.3.4 shows that when X is chosen uniformly at random and we observe
Y , we have
I(X; Y ) + log 2
inf P(Ψ(Y ) 6= X) ≥ 1 − ,
Ψ log |X |
where the infimum is taken over all testing procedures Ψ. Some interpretation of this quantity
is helpful. If we think roughly of the number of bits it takes to describe a variable X uniformly
chosen from X , then we expect that log2 |X | bits are necessary (and sufficient). Thus, until we
collect enough information that I(X; Y ) ≈ log |X |, so that I(X; Y )/ log |X | ≈ 1, we are unlikely to
be unable to identify the variable X with any substantial probability. So we must collect enough
bits to actually discover X.
Example 2.3.5 (20 questions game): In the 20 questions game—a standard children’s game—
there are two players, the “chooser” and the “guesser,” and an agreed upon universe X . The
chooser picks an element x ∈ X , and the guesser’s goal is to find x by using a series of yes/no
questions about x. We consider optimal strategies for each player in this game, assuming that
X is finite and letting m = |X | be the universe size for shorthand.
For the guesser, it is clear that at most dlog2 me questions are necessary to guess the item
X that the chooser has picked—at each round of the game, the guesser asks a question that
eliminates half of the remaining possible items. Indeed, let us assume that m = 2l for some
l ∈ N; if not, the guesser can always make her task more difficult by increasing the size of X
until it is a power of 2. Thus, after k rounds, there are m2−k items left, and we have
k
1
m ≤ 1 if and only if k ≥ log2 m.
2
For the converse—the chooser’s strategy—let Y1 , Y2 , . . . , Yk be the sequence of yes/no answers
given to the guesser. Assume that the chooser picks X uniformly at random in X . Then Fano’s
inequality (2.3.2) implies that for the guess X
b the guesser makes,
b 6= X) ≥ 1 − I(X; Y1 , . . . , Yk ) + log 2
P(X .
log m
By the chain rule for mutual information, we have
k
X k
X k
X
I(X; Y1 , . . . , Yk ) = I(X; Yi | Y1:i−1 ) = H(Yi | Y1:i−1 ) − H(Yi | Y1:i−1 , X) ≤ H(Yi ).
i=1 i=1 i=1
As the answers Yi are yes/no, we have H(Yi ) ≤ log 2, so that I(X; Y1:k ) ≤ k log 2. Thus we
find
P(Xb 6= X) ≥ 1 − (k + 1) log 2 = log2 m − 1 − k ,
log m log2 m log2 m
so that we the guesser must have k ≥ log2 (m/2) to be guaranteed that she will make no
mistakes. 3
33
Lexture Notes on Statistics and Information Theory John Duchi
34
Lexture Notes on Statistics and Information Theory John Duchi
0 2
1
x1
2 0 2
0 1 1
x2 x3 x5 x6 x7
Figure 2.1. Prefix-tree encoding of a set of symbols. The encoding for x1 is 0, for x2 is 10, for x3
is 11, for x4 is 12, for x5 is 20, for x6 is 21, and nothing is encoded as 1, 2, or 22.
Theorem 2.4.2. Let X be a finite or countable set, and let ` : X → N be a function. If `(x) is the
length of the encoding of the symbol x in a uniquely decodable d-ary code, then
X
d−`(x) ≤ 1. (2.4.1)
x∈X
Conversely, given any function ` : X → N satisfying inequality (2.4.1), there is a prefix code whose
codewords have length `(x) for each x ∈ X .
Proof We prove the first statement of the theorem first by a counting and asymptotic argument.
We begin by assuming that X is finite; we eliminate this assumption subsequently. As a
consequence, there is some maximum length `max such that `(x) ≤ `max for all x ∈ X . ForP a sequence
x1 , . . . , xn ∈ X , we have by the definition of our encoding strategy that `(x1 , . . . , xn ) = ni=1 `(xi ).
In addition, for each m we let
En (m) := {x1:n ∈ X n such that `(x1:n ) = m}
denote the symbols x encoded with codewords of length m in our code, then as the code is uniquely
decodable we certainly have card(En (m)) ≤ dm P for all n and m. Moreover, for all x1:n ∈ X n we
have `(x1:n ) ≤ n`max . We thus re-index the sum x d−`(x) and compute
X n`
X max
≤ dm−m = n`max .
m=1
35
Lexture Notes on Statistics and Information Theory John Duchi
as each subset {xP ∈ X : `(x) ≤ k} is uniquely decodable, we have Dk ≤ 1 for all k. Then
1 ≥ limk→∞ Dk = x∈X d−`(x) .
The achievability of such a code is straightforward by a pictorial argument (recall Figure 2.1),
so we sketch the result non-rigorously. Indeed, let Td be an (infinite) d-ary tree. Then, at each
level m of the tree, assign one of the nodes at that level to each symbol x ∈ X such that `(x) = m.
Eliminate the subtree below that node, and repeat with the remaining symbols. The codeword
corresponding to symbol x is then the path to the symbol in the tree.
JCD Comment: Fill out this proof, potentially deferring it.
With the Kraft-McMillan theorem in place, we we may directly relate the entropy of a random
variable to the length of possible encodings for the variable; in particular, we show that the entropy
is essentially the best possible code length of a uniquely decodable source code. In this theorem,
we use the shorthand X
Hd (X) := − p(x) logd p(x).
x∈X
Theorem 2.4.3. Let X ∈ X be a discrete random variable distributed according to P and let `C
be the length function associated with a d-ary encoding C : X → {0, . . . , d − 1}∗ . In addition, let C
be the set of all uniquely decodable d-ary codes for X . Then
Proof The lower bound is an argument by convex optimization, while for the upper bound
we give an explicit length function and (implicit) prefix code attaining the bound. For the lower
bound, we assume for simplicity that X is finite, and we identify X = {1, . . . , |X |} (let m = |X | for
shorthand). Then as C consists of uniquely decodable codebooks, all the associated length functions
must satisfy the Kraft-McMillan inequality (2.4.1). Letting `i = `(i), the minimal encoding length
is at least (m m
)
X X
−`i
infm pi `i : d ≤1 .
`∈R
i=1 i=1
36
Lexture Notes on Statistics and Information Theory John Duchi
By introducing the Lagrange multiplier λ ≥ 0 for the inequality constraint, we may write the
Lagrangian for the preceding minimization problem as
n
!
X h im
> −`i
L(`, λ) = p ` + λ d −1 with ∇` L(`, λ) = p − λ d−`i log d .
i=1
i=1
θ
θ Pm − logd
In particular, the optimal ` satisfies `i = logd pi for some constant θ, and solving i=1 d
pi
=1
gives θ = 1 and `(i) = logd p1i .
l m
1
To attain the result, simply set our encoding to be `(x) = logd P (X=x) , which satisfies the
Kraft-McMillan inequality and thus yields a valid prefix code with
X 1 X
EP [`(X)] = p(x) logd ≤− p(x) logd p(x) + 1 = Hd (X) + 1
p(x)
x∈X x∈X
as desired.
Theorem 2.4.3 thus shows that, at least to within an additive constant of 1, the entropy both
upper and lower bounds the expected length of a uniquely decodable code for the random variable
X. This is the first of our promised “operational interpretations” of the entropy.
In some situations, the limit (2.4.2) may not exist. However, there are a variety of situations in
which it does, and we focus generally on a specific but common instance in which the limit does
exist. First, we recall the definition of a stationary sequence of random variables.
Definition 2.5. We say a sequence X1 , X2 , . . . of random variable is stationary if for all n and all
k ∈ N and all measurable sets A1 , . . . , Ak ⊂ X we have
37
Lexture Notes on Statistics and Information Theory John Duchi
Proposition 2.4.4. Let the sequence of random variables {Xi }, taking values in the discrete space
X , be stationary. Then
H({Xi }) = lim H(Xn | X1 , . . . , Xn−1 )
n→∞
Finally, we present a result showing that it is possible to achieve average code length of at most
the entropy rate, which for stationary sequences is smaller than the entropy of any single random
variable Xi . To do so, we require the use of a block code, which (while it may be prefix code) treats
sets of random variables (X1 , . . . , Xm ) ∈ X m as a single symbol to be jointly encoded.
Proposition 2.4.5. Let the sequence of random variables X1 , X2 , . . . be stationary. Then for any
> 0, there exists an m ∈ N and a d-ary (prefix) block encoder C : X m → {0, . . . , d − 1}∗ such that
1
lim EP [`C (X1:n )] ≤ H({Xi }) + = lim H(Xn | X1 , . . . , Xn−1 ) + .
n n n
Taking n → ∞ yields that the term N (cN − a)/n → 0, which gives that cn − a ∈ [−, ] eventually for any > 0,
which is our desired result.
38
Lexture Notes on Statistics and Information Theory John Duchi
Note that if the m does not divide n, we may also encode the length of the sequence of encoded
words in each block of length m; in particular, if the block begins with a 0, it encodes m symbols,
while if it begins with a 1, then the next dlogd me bits encode the length of the block. This would
yields an increase in the expected length of the code to
2n + dlog2 me n
EP [`C (X1:n )] ≤ + H(X1 , . . . , Xm ).
m m
Dividing by n and letting n → ∞ gives the result, as we can always choose m large.
2.5 Bibliography
The material in this chapter is classical in information theory. For all of our treatment of mutual
information, entropy, and KL-divergence in the discrete case, Cover and Thomas provide an es-
sentially complete treatment in Chapter 2 of their book [53]. Gray [94] provides a more advanced
(measure-theoretic) version of these results, with Chapter 5 covering most of our results (or Chap-
ter 7 in the newer addition of the same book). Csiszár and Körner [55] is the classic reference for
coding theorems and results on communication, including stronger converse results.
The f -divergence was independently discovered by Ali and Silvey [4] and Csiszár [54], and is
consequently sometimes called an Ali-Silvey divergence or Csiszár divergence. Liese and Vajda [131]
provide a survey of f -divergences and their relationships with different statistical concepts (taking
a Bayesian point of view), and various authors have extended the pairwise divergence measures to
divergence measures between multiple distributions [98], making connections to experimental design
and classification [89, 70], which we investigate later in book. The inequalities relating divergences
in Section 2.2.4 are now classical, and standard references present them [127, 167]. For a proof that
equality (2.2.4) is equivalent to the definition (2.2.3) with the appropriate closure operations, see
the paper [70, Proposition 1]. We borrow the proof of the upper bound in Proposition 2.2.10 from
the paper [132].
2.6 Exercises
Our first few questions investigate properties of a divergence between distributions that is weaker
than the KL-divergence, but is intimately related to optimal testing. Let P1 and P2 be arbitrary
distributions on a space X . The total variation distance between P1 and P2 is defined as
Exercise 2.1: Prove the following identities about total variation. Throughout, let P1 and P2
have densities p1 and p2 on a (common) set X .
R
(a) 2 kP1 − P2 kTV = |p1 (x) − p2 (x)|dx.
(b) For functions f : X → R, Rdefine the supremum norm kf k∞ = supx∈X |f (x)|. Show that
2 kP1 − P2 kTV = supkf k∞ ≤1 X f (x)(p1 (x) − p2 (x))dx.
R
(c) kP1 − P2 kTV = max{p1 (x), p2 (x)}dx − 1.
39
Lexture Notes on Statistics and Information Theory John Duchi
R
(d) kP1 − P2 kTV = 1 − min{p1 (x), p2 (x)}dx.
Exercise 2.2 (Divergence between multivariate normal distributions): Let P1 be N(θ1 , Σ) and
P2 be N(θ2 , Σ), where Σ 0 is a positive definite matrix. What is Dkl (P1 ||P2 )?
Exercise 2.3 (The optimal test between distributions): Prove Le-Cam’s inequality: for any
function ψ with dom ψ ⊃ X and any distributions P1 , P2 ,
Thus, the sum of the probabilities of error in a hypothesis testing problem, where based on a sample
X we must decide whether P1 or P2 is more likely, has value at least 1 − kP1 − P2 kTV . Given P1
and P2 is this risk attainable?
Exercise 2.4: A random variable X has Laplace(λ, µ) distribution if it has density p(x) =
λ
2 exp(−λ|x−µ|). Consider the hypothesis test of P1 versus P2 , where X has distribution Laplace(λ, µ1 )
under P1 and distribution Laplace(λ, µ2 ) under P2 , where µ1 < µ2 . Show that the minimal value
over all tests ψ of P1 versus P2 is
λ
inf P1 (ψ(X) 6= 1) + P2 (ψ(X) 6= 2) = exp − |µ1 − µ2 | .
ψ 2
Exercise 2.7 (f -divergences generalize standard divergences): Show the following properties of
f -divergences:
40
Lexture Notes on Statistics and Information Theory John Duchi
(c) If f (t) = t log t − log t, then Df (P ||Q) = Dkl (P ||Q) + Dkl (Q||P ).
(d) For any convex f satisfying f (1) = 0, Df (P ||Q) ≥ 0. (Hint: use Jensen’s inequality.)
(b) Generalizing the preceding result, let a : X → R+ and b : X → R+ , and let µ be a finite
measure on X with respect to which a is integrable. Show that
Z R Z
b(x)dµ(x) b(x)
a(x)dµ(x)f R ≤ a(x)f dµ(x).
a(x)dµ(x) a(x)
If you are unfamiliarR with measure theory, prove the following essentially equivalent result: let
u : X → R+ satisfy u(x)dx < ∞. Show that
Z R Z
b(x)u(x)dx b(x)
a(x)u(x)dxf R ≤ a(x)f u(x)dx
a(x)u(x)dx a(x)
R
whenever a(x)u(x)dx
R < ∞. (It is possible to demonstrate this remains true under appropriate
limits even when a(x)u(x)dx = +∞, but it is a mess.)
(Hint: use the fact that the perspective of a function f , defined by h(x, t) = tf (x/t) for t > 0, is
jointly convex in x and t (see Proposition B.3.12).
Exercise 2.9 (Data processing and f -divergences I): As with the KL-divergence, given a quantizer
g of the set X , where g induces a partition A1 , . . . , Am of X , we define the f -divergence between
P and Q conditioned on g as
m m
P (g −1 ({i}))
X P (Ai ) X
−1
Df (P ||Q | g) := Q(Ai )f = Q(g ({i}))f .
Q(Ai ) Q(g −1 ({i}))
i=1 i=1
Given quantizers g1 and g2 , we say that g1 is a finer quantizer than g2 under the following condition:
assume that g1 induces the partition A1 , . . . , An and g2 induces the partition B1 , . . . , Bm ; then for
any of the sets Bi , there are exists some k and sets Ai1 , . . . , Aik such that Bi = ∪kj=1 Aij . We let
g1 ≺ g2 denote that g1 is a finer quantizer than g2 .
(a) Let g1 and g2 be quantizers of the set X , and let g1 ≺ g2 , meaning that g1 is a finer quantization
than g2 . Prove that
Df (P ||Q | g2 ) ≤ Df (P ||Q | g1 ) .
41
Lexture Notes on Statistics and Information Theory John Duchi
Equivalently, show that whenever A and B are collections of sets partitioning X , but A is a
finer partition of X than B, that
X
X P (B) P (A)
Q(B)f ≤ Q(A)f .
Q(B) Q(A)
B∈B A∈A
(b) Suppose that X is countable (or finite) so that P and Q have p.m.f.s p and q. Show that
X p(x)
Df (P ||Q) = q(x)f ,
x
q(x)
where on the left we are using the partition definition (2.2.3); you should show that the partition
into discrete parts of X achieves the supremum. You may assume that X is finite. (Though
feel free to prove the result in the case that X is infinite.)
Exercise 2.10 (General data processing inequalities): Let f be a convex function satisfying
f (1) = 0. Let K be a Markov transition kernel from X to Z, that is, K(·, x) is a probability
distribution on Z for each x ∈ X . (Written differently, we have X → Z, and conditioned on X = x,
Z has distribution K(·, x), so that K(A, x) is the probability that Z ∈ A given X = x.)
R R
(a) Define the marginals KP (A) = K(A, x)p(x)dx and KQ (A) = K(A, x)q(x)dx. Show that
Hint: by equation (2.2.3), w.l.o.g. we may assume that Z is finite and Z = {1, . . . , m}; also
recall Question 2.8.
(b) Let X and Y be random variables with joint distribution PXY and marginals PX and PY .
Define the f -information between X and Y as
Use part (a) to show the following general data processing inequality: if we have the Markov
chain X → Y → Z, then
If (X; Z) ≤ If (X; Y ).
Exercise 2.11 (Convexity of f -divergences): Prove Proposition 2.2.11. Hint: Use Question 2.8.
Exercise 2.12 (Variational forms of KL divergence): Let P and Q be arbitrary distributions on a
common space X . Prove the following variational representation, known as the Donsker-Varadhan
theorem, of the KL divergence:
Dkl (P ||Q) = sup EP [f (X)] − log EQ [exp(f (X))] .
f :EQ [ef (X) ]<∞
42
Lexture Notes on Statistics and Information Theory John Duchi
Exercise 2.13: Let P and Q have densities p and q with respect to the base measure µ over the
set X . (Recall that this is no loss of generality, as we may take µ = P + Q.) Define the support
supp P := {x ∈ X : p(x) > 0}. Show that
1
Dkl (P ||Q) ≥ log .
Q(supp P )
Exercise 2.14: Let P1 be N(θ1 , Σ1 ) and P2 be N(θ2 , Σ2 ), where Σi 0 are positive definite
matrices. Give Dkl (P1 ||P2 ).
Exercise 2.15: Let {Pv }v∈V be an arbitrary collection of distributions on a space X and µ be a
probability measure on V. Show that if V ∼ µ and conditional on V = v, we draw X ∼ Pv , then
R R
(a) I(X; V ) = Dkl Pv ||P dµ(v), where P = Pv dµ(v) is the (weighted) average of the Pv . You
may assume that V is discrete if you like.
R
(b) For any distribution
R Q on X , I(X; V ) = Dkl (Pv ||Q) dµ(v)R − Dkl P ||Q . Conclude that
I(X; V ) ≤ Dkl (Pv ||Q) dµ(v), or, equivalently, P minimizes Dkl (Pv ||Q) dµ(v) over all prob-
abilities Q.
Exercise 2.16 (The triangle inequality for variation distance): Let P and Q be distributions
on X1n = (X1 , . . . , Xn ) ∈ X n , and let Pi (· | xi−1
1 ) be the conditional distribution of Xi given
i−1 i−1
X1 = x1 (and similarly for Qi ). Show that
n
X h i
kP − QkTV ≤ EP Pi (· | X1i−1 ) − Qi (· | X1i−1 ) TV
,
i=1
(b) Define the negative binary entropy h(p) = p log p + (1 − p) log(1 − p). Show that
43
Lexture Notes on Statistics and Information Theory John Duchi
Exercise 2.20: Use the paper “A New Metric for Probability Distributions” by Dominik p Endres
and Johannes Schindelin to prove that if V ∼ Uniform{0, 1} and X | V = v ∼ Pv , then I(X; V )
is a metric on distributions. (Said differently, Djs (P ||Q)1/2 is a metric on distributions, and it
generates the same topology as the TV-distance.)
Exercise 2.21: Relate the generalized Jensen-Shannon divergence between m distributions to
redundancy in encoding.
44
Chapter 3
Our second introductory chapter focuses on readers who may be less familiar with statistical mod-
eling methodology and the how and why of fitting different statistical models. As in the preceding
introductory chapter on information theory, this chapter will be a fairly terse blitz through the main
ideas. Nonetheless, the ideas and distributions here should give us something on which to hang our
hats, so to speak, as the distributions and models provide the basis for examples throughout the
book. Exponential family models form the basis of much of statistics, as they are a natural step
away from the most basic families of distributions—Gaussians—which admit exact computations
but are brittle, to a more flexible set of models that retain enough analytical elegance to permit
careful analyses while giving power in modeling. A key property is that fitting exponential family
models reduces to the minimization of convex functions—convex optimization problems—an oper-
ation we treat as a technology akin to evaluating a function like sin or cos. This perspective (which
is accurate enough) will arise throughout this book, and informs the philosophy we adopt that once
we formulate a problem as convex, it is solved.
In the continuous case, pθ is instead a density on X ⊂ Rk , and pθ takes the identical form above
but Z
A(θ) = log h(x) exp(hθ, φ(x)i)dx.
X
45
Lexture Notes on Statistics and Information Theory John Duchi
We can abstract away from this distinction between discrete and continuous distributions by making
the definition measure-theoretic, which we do here for completeness. (But recall the remarks in
Section 1.3.)
With our notation, we have the following definition.
Definition 3.1. The exponential family associated with the function φ and base measure µ is
defined as the set of distributions with densities pθ with respect to µ, where
Θ := {θ | A(θ) < ∞}
is open.
In Definition 3.1, we have included the carrier h in the base measure µ, and frequently we will give
ourselves the general notation
In some scenarios, it may be convient to re-parameterize the problem in terms of some function
η(θ) instead of θ itself; we will not worry about such issues and simply use the formulae that are
most convenient.
We now give a few examples of exponential family models.
Example 3.1.2 (Poisson distribution): The Poisson distribution (for count data) is usually
parameterized by some λ > 0, and for x ∈ N has distribution Pλ (X = x) = (1/x!)λx e− λ. Thus
by taking µ to be counting (discrete) measure on {0, 1, . . .} and setting θ = log λ, we find the
density (probability mass function in this case)
1 x −λ 1 1
p(x) = λ e = exp(x log λ − λ) = exp(xθ − eθ ) .
x! x! x!
Notably, taking h(x) = (x!)−1 and log-partition A(θ) = eθ , we have probability mass function
pθ (x) = h(x) exp(θx − A(θ)). 3
46
Lexture Notes on Statistics and Information Theory John Duchi
Example 3.1.3 (Normal distribution, mean parameterization): For the d-dimensional normal
distribution, we take µ to be Lebesgue measure on Rd . If we fix the covariance and vary only
the mean µ in the family N(µ, Σ), then X ∼ N(µ, Σ) has density
1 > −1 1
pµ (x) = exp − (x − µ) Σ (x − µ) − log det(2πΣ) .
2 2
In particular, we have carrier h(x) = exp(− 12 x> Σ−1 x)/((2π)d/2 det(Σ)), sufficient statistic
φ(x) = x, and log partition A(θ) = 21 θ> Σ−1 θ. 3
Example 3.1.4 (Normal distribution): Let X ∼ N(µ, Σ). We may re-parameterize this as
as Θ = Σ−1 and θ = Σ−1 µ, and we have density
1
pθ,Θ (x) ∝ exp hθ, xi − hxx> , Θi ,
2
where h·, ·i denotes the Euclidean inner product. See Exercise 3.1. 3
In some cases, it is analytically convenient to include a few more conditions on the exponential
family.
Definition 3.2. Let {Pθ }θ∈Θ be an exponential family as in Definition 3.1. The sufficient statistic
φ is minimal if Θ = dom A ⊂ Rd is full-dimensional and there exists no vector u such that
Definition 3.2 is essentially equivalent to stating that φ(x) = (φ1 (x), . . . , φd (x)) has linearly inde-
pendent components when viewed as vectors [φi (x)]x∈X . While we do not prove this, via a suitable
linear transformation—a variant of Gram-Schmidt orthonormalization—one may modify any non-
minimal exponential family {Pθ } into an equivalent minimal exponential family {Qη }, meaning
that the two collections satisfy the equality {Pθ } = {Qη } (see Brown [39, Chapter 1]).
47
Lexture Notes on Statistics and Information Theory John Duchi
Here, we enumerate a few of their keyR analytical properties, focusing on the cumulant generating
(or log partition) function A(θ) = log ehθ,φ(x)i dµ(x). We begin with a heuristic calculation, where
we assume that we exchange differentiation and integration. Assuming that this is the case, we
then obtain the important expectation and covariance relationships that
Z
1
∇A(θ) = R hθ,φ(x)i ∇θ ehθ,φ(x)i dµ(x)
e dµ(x)
Z Z
−A(θ) hθ,φ(x)i
=e ∇θ e dµ(x) = φ(x)ehθ,φ(x)i−A(θ) dµ(x) = Eθ [φ(X)]
because ehθ,φ(x)i−A(θ) = pθ (x). A completely similar (and still heuristic, at least at this point)
calculation gives
That these identities hold is no accident and is central to the appeal of exponential family models.
The first and, from our perspective, most important result about exponential family models is
their convexity. While (assuming the differentiation relationships above hold) the differentiation
identity that ∇2 A(θ) = Covθ (φ(X)) 0 makes convexity of A immediate, one can also provide a
direct argument without appealing to differentiation.
Proposition 3.2.1. The cumulant-generating function θ 7→ A(θ) is convex, and it is strictly convex
if and only if Covθ (φ(X)) is positive definite for all θ ∈ dom A.
Proof Let θλ = λθ1 + (1 − λ)θ2 , where θ1 , θ2 ∈ Θ. Then 1/λ ≥ 1 and 1/(1 − λ) ≥ 1, and Hölder’s
inequality implies
Z Z
log exp(hθλ , φ(x)i)dµ(x) = log exp(hθ1 , φ(x)i)λ exp(hθ2 , φ(x)i)1−λ dµ(x)
Z λ Z 1−λ
λ 1−λ
≤ log exp(hθ1 , φ(x)i) dµ(x)
λ exp(hθ2 , φ(x)i) 1−λ dµ(x)
Z Z
= λ log exp(hθ1 , φ(x)i)dµ(x) + (1 − λ) log exp(hθ2 , φ(x)i)dµ(x),
as desired. The strict convexity will be a consequence of Proposition 3.2.2 to come, as there we
formally show that ∇2 A(θ) = Covθ (φ(X)).
We now show that A(θ) is indeed infinitely differentiable and how it generates the moments of
the sufficient statistics φ(x). To describe the properties, we provide a bit of notation related to
tensor products: for a vector x ∈ Rd , we let
x⊗k := x
| ⊗x⊗ {z· · · ⊗ x}
k times
denote the kth order tensor, or multilinear operator, that for v1 , . . . , vk ∈ Rd satisfies
k
Y
⊗k
x (v1 , . . . , vk ) := hx, v1 i · · · hx, vk i = hx, vi i.
i=1
48
Lexture Notes on Statistics and Information Theory John Duchi
When k = 2, this is the familiar outer product x⊗2 = xx> . (More generally, one may think of x⊗k
as a d × d × · · · × d box, where the (i1 , . . . , ik ) entry is [x⊗k ]i1 ,...,ik = xi1 · · · xik .) With this notation,
our first key result regards the differentiability of A, where we can compute (all) derivatives of eA(θ)
by interchanging integration and differentiation.
The proof of the proposition is involved and requires complex analysis, so we defer it to Sec. 3.6.1.
As particular consequences of Proposition 3.2.2, we can rigorously demonstrate the expectation
and covariance relationships that
Z Z
1 hθ,φ(x)i
∇A(θ) = R hθ,φ(x)i ∇e dµ(x) = φ(x)pθ (x)dµ(x) = Eθ [φ(X)]
e dµ(x)
and
( φ(x)ehθ,φ(x)i dµ(x))⊗2
Z R
2 1 ⊗2 hθ,φ(x)i
∇ A(θ) = R φ(x) e dµ(x) − R
ehθ,φ(x)i dµ(x) ( ehθ,φ(x)i dµ(x))2
= Eθ [φ(X)φ(X)> ] − Eθ [φ(X)]Eθ [φ(X)]>
= Covθ (φ(X)).
Minimal exponential families (Definition 3.2) also enjoy a few additional regularity properties.
Recall that A is strictly convex if
Proposition 3.2.3. Let {Pθ } be a regular exponential family. The log partition function A is
strictly convex if and only if {Pθ } is minimal.
Proof If the family is minimal, then Varθ (u> φ(X)) > 0 for any vector u, while Varθ (u> φ(X)) =
u> ∇2 A(θ)u. This implies the strict positive definiteness ∇2 A(θ) 0, which is equivalent to strict
convexity (see Corollary B.3.2 in Appendix B.3.1). Conversely, if ∇2 A(θ) 0 for all θ ∈ Θ, then
Varθ (u> φ(X)) > 0 for all u 6= 0 and so u> φ(x) is non-constant in x.
49
Lexture Notes on Statistics and Information Theory John Duchi
This is always a convex optimization problem (see Appendices B and C for much more on this), as A
is convex and the first term is linear, and so has no non-global optima. Here and throughout, as we
mention in the introductory remarks to this chapter, we treat convex optimization as a technology:
as long as the dimension of a problem is not too large and its objective can be evaluated, it is
(essentially) computationally trivial.
Of course, we never have access to the population P fully; instead, we receive a sample
X1 , . . . , Xn from P . In this case, a natural approach is to replace the expected (negative) log
likelihood above with its empirical version and solve
n
X n
X
minimize − log pθ (Xi ) = [−hθ, φ(Xi )i + A(θ)], (3.2.2)
θ
i=1 i=1
which is still a convex optimization problem (as the objective is convex in θ). The maximum
likelihood estimate is any vector θbn minimizing the negative log likelihood (3.2.2), which by setting
gradients to 0 is evidently any vector satisfying
n
1X
∇A(θbn ) = Eθbn [φ(X)] = φ(Xi ). (3.2.3)
n
i=1
In particular, we need only find a parameter θbn matching moments of the empirical distribution
of the observed Xi ∼ P . This θbn is unique whenever Covθ (φ(X)) 0 for all θ, that is, when
the covariance of φ is full rank in the exponential family model, because then the objective in the
minimization problem (3.2.2) is strictly convex.
Let us proceed heuristically for a moment to develop a rough convergence guarantee for the
estimator θbn ; the next paragraph assumes a comfort with some of classical asymptotic statistics
(and the central limit theorem) and is not essential for what comes later. Then we can see how
minimizers of the problem (3.2.2) converge to their population counterparts. Assume that the data
50
Lexture Notes on Statistics and Information Theory John Duchi
Xi are i.i.d. from an exponential family model Pθ? . Then we expect that the maximum likelihood
estimate θbn should converge to θ? , and so
n
1X
φ(Xi ) = ∇A(θbn ) = ∇A(θ? ) + (∇2 A(θ? ) + o(1))(θbn − θ? ).
n
i=1
But of course, ∇A(θ? ) = Eθ? [φ(X)], and so the central limit theorem gives that
n
1X ·
(φ(Xi ) − ∇A(θ? )) ∼ N 0, n−1 Covθ? (φ(X)) = N 0, n−1 ∇2 A(θ? ) ,
n
i=1
·
where ∼ means “is approximately distributed as.” Multiplying by (∇2 A(θ? )+o(1))−1 ≈ ∇2 A(θ? )−1 ,
we thus see (still working in our heuristic)
n
1 X
θbn − θ? = (∇2 A(θ? ) + o(1))−1 (φ(Xi ) − ∇A(θ? ))
n
i=1
· −1 2 ? −1
∼ N 0, n · ∇ A(θ ) , (3.2.4)
where we use that BZ ∼ N(0, BΣB > ) if Z ∼ N(0, Σ). (It is possible to make each of these steps
fully rigorous.) Thus the cumulant generating function A governs the error we expect in θbn − θ? .
Much of the rest of this book explores properties of these types of minimization problems: at
what rates do we expect θbn to converge to a global minimizer of problem (3.2.1)? Can we show
that these rates are optimal? Is this the “right” strategy for choosing a parameter? Exponential
families form a particular working example to motivate this development.
Similarly, we have
Dkl (Pθ+∆ ||Pθ ) = Eθ+∆ [hθ + ∆, φ(X)i − A(θ + ∆) − hθ, φ(X)i + A(θ)]
= A(θ) − A(θ + ∆) + Eθ+∆ [h∆, φ(X)i]
= A(θ) − A(θ + ∆) − ∇A(θ + ∆)> (−∆).
These identities give an immediate connection with convexity. Indeed, for a differentiable convex
function h, the first-order divergence associated with h is
which is always nonnegative, and is the gap between the linear approximation to the (convex)
function h and its actual value. In much of the statistical and machine learning literature, the
51
Lexture Notes on Statistics and Information Theory John Duchi
divergence (3.3.1) is called a Bregman divergence, though we will use the more evocative first-
order divergence. These will appear frequently throughout the book and, more generally, appear
frequently in work on optimization and statistics.
JCD Comment: Put in a picture of a Bregman divergence
When the perturbation ∆ is small, that A is infinitely differentiable then gives that
1
Dkl (Pθ ||Pθ+∆ ) = ∆> ∇2 A(θ)∆ + O(k∆k3 ),
2
so that the Hessian ∇2 A(θ) tells quite precisely how the KL divergence changes as θ varies (locally).
As we saw already in Example 2.3.2 (and see the next section), when the KL-divergence between
two distributions is small, it is hard to test between them, and in the sequel, we will show converses
to this. The Hessian ∇2 A(θ? ) also governs the error in the estimate θbn − θ? in our heuristic (3.2.4).
When the Hessian ∇2 A(θ) is quite positive semidefinite, the KL divergence Dkl (Pθ ||Pθ+∆ ) is large,
and the asymptotic covariance (3.2.4) is small. For this—and other reasons we address later—for
exponential family models, we call
52
Lexture Notes on Statistics and Information Theory John Duchi
The log partition function A(· | x) provides the same insights for the conditional models (3.4.1)
as it does for the unconditional exponential family models in the preceding sections. Indeed, as
in Propositions 3.2.1 and 3.2.2, the log partition A(· | x) is always C ∞ on its domain and convex.
Moreover, it gives the expected moments of the sufficient statistic φ conditional on x, as
from which we can (typically) extract the mean or other statistics of Y conditional on x.
Three standard examples will be our most frequent motivators throughout this book: linear
regression, binary logistic regression, and multiclass logistic regression. We give these three, as
well as describing two more important examples involving modeling count data through Poisson
regression and making predictions for targets y known to live in a bounded set.
so that we have the exponential family representation (3.4.1) with φ(x, y) = σ12 xy, h(y) =
exp(− 2σ1 2 y 2 + 21 log(2πσ 2 )), and A(θ) = 2σ1 2 θ> xx> θ. As ∇A(θ | x) = Eθ [φ(X, Y ) | X = x] =
1
σ2
xEθ [Y | X = x], we easily recover Eθ [Y | X = x] = θ> x. 3
Frequently, we wish to predict binary or multiclass random variables Y . For example, consider
a medical application in which we wish to assess the probability that, based on a set of covariates
x ∈ Rd (say, blood pressure, height, weight, family history) and individual will have a heart attack
in the next 5 years, so that Y = 1 indicates heart attack and Y = −1 indicates not. The next
example shows how we might model this.
Example 3.4.2 (Binary logistic regression): If Y ∈ {−1, 1}, we model
exp(yx> θ)
pθ (y | x) = ,
1 + exp(yx> θ)
where the idea in the probability above is that if x> θ has the same sign as y, then the large
x> θy becomes the higher the probability assigned the label y; when x> θy < 0, the probability
is small. Of course, we always have pθ (y | x) + pθ (−y | x) = 1, and using the identity
y+1 >
yx> θ − log(1 + exp(yx> θ)) = x θ − log(1 + exp(x> θ))
2
53
Lexture Notes on Statistics and Information Theory John Duchi
y+1
we obtain the generalized linear model representation φ(x, y) = 2 x and A(θ | x) = log(1 +
exp(x> θ)).
As an alternative, we could represent Y ∈ {0, 1} by
exp(yx> θ)
> x> θ
pθ (y | x) = = exp yx θ − log(1 + e ) ,
1 + exp(x> θ)
which has the simpler sufficient statistic φ(x, y) = xy. 3
Instead of a binary prediction problem, in many cases we have a multiclass prediction problem,
where we seek to predict a label Y for an object x belonging to one of k different classes. For
example, in image recognition, we are given an image x and wish to identify the subject Y of the
image, where Y ranges over k classes, such as birds, dogs, cars, trucks, and so on. This too we can
model using exponential families.
Example 3.4.3 (Multiclass logistic regression): In the case that we have a k-class prediction
problem in which we wish to predict Y ∈ {1, . . . , k} from X ∈ Rd , we assign parameters
θy ∈ Rd to each of the classes y = 1, . . . , k. We then model
> k
exp(θy x)
X
>
pθ (y | x) = Pk = exp θy> x − log eθj x .
>
j=1 exp(θj x) j=1
Here, the idea is that if θy> x > θj> x for all j 6= y, then the model assigns higher probability to
class y than any other class; the larger the gap between θy> x and θj> x, the larger the difference
in assigned probabilities. 3
Other approaches with these ideas allow us to model other situations. Poisson regression models
are frequent choices for modeling count data. For example, consider an insurance company that
wishes to issue premiums for shipping cargo in different seasons and on different routes, and so
wishes to predict the number of times a given cargo ship will be damaged by waves over a period
of service; we might represent this with a feature vector x encoding information about the ship to
be insured, typical weather on the route it will take, and the length of time it will be in service.
To model such counts Y ∈ {0, 1, 2, . . .}, we turn to Poisson regression.
Example 3.4.4 (Poisson regression): When Y ∈ N is a count, the Poisson distribution with
−λ y >
rate λ > 0 gives P (Y = y) = e y!λ . Poisson regression models λ via eθ x , giving model
1 >
pθ (y | x) = exp yx> θ − eθ x ,
y!
so that we have carrier h(y) = 1/y! and the simple sufficient statistic yx> θ. The log partition
>
function is A(θ | x) = eθ x . 3
Lastly, we consider a less standard example, but which highlights the flexibility of these models.
Here, we assume a linear regression problem but in which we wish to predict values Y in a bounded
range.
Example 3.4.5 (Bounded range regression): Suppose that we know Y ∈ [−b, b], but we wish
to model it via an exponential family model with density
54
Lexture Notes on Statistics and Information Theory John Duchi
While its functional form makes this highly non-obvious, our general results guarantee that
A(θ | x) is indeed C ∞ and convex in θ. We have ∇A(θ | x) = xEθ [Y | X = x] because
φ(x, y) = xy, and we can therefore immediately recover Eθ [Y | X = x]. Indeed, set s = θ> x,
and without loss of generality assume s 6= 0. Then
∂ ebs − e−bs b(ebs + e−bs ) 1
E[Y | x> θ = s] = log = bs − ,
∂s s e − e−bs s
which increases from −b to b as s = x> θ increases from −∞ to +∞. 3
Once again, to find θbn amounts to matching moments, as ∇A(θ | Xi ) = E[φ(X, Y ) | X = Xi ], and
we still enjoy the convexity properties of the standard exponential family models.
In general, we of course do not expect any exponential family or generalized linear model (GLM)
to have perfect fidelity to the world: all models are in accurate (but many are useful!). Nonetheless,
we can still fit any of the GLM models in Examples 3.4.1–3.4.5 to data of the appropriate type. In
particular, for the logarithmic loss `(θ; x, y) = − log pθ (y | x), we can define the empirical loss
n
1X
Ln (θ) := `(θ; Xi , Yi ).
n
i=1
Then, as n → ∞, we expect that Ln (θ) → E[`(θ; X, Y )], so that the minimizing θ should give the
best predictions possible according to the loss `. We shall therefore often be interested in such
convergence guarantees and the deviations of sample quantities (like Ln ) from their population
counterparts.
55
Lexture Notes on Statistics and Information Theory John Duchi
H0 : θ = θ0 versus H1,n : θ = θn
as n grows, where we observe a sample X1n drawn i.i.d. either according to Pθ0 (i.e., H0 ) or Pθn
(i.e., H1,n ). By choosing θn in a way that makes the separation v > (θn − θ0 ) large but testing H0
against H1,n challenging, we can then (roughly) identify the separation δ at which testing becomes
impossible.
Proposition 3.5.1. Let θ0 ∈ Rd . Then there exists a sequence of parameters θn with kθn − θ0 k =
√
O(1 n), separation
1
q
v > (θn − θ0 ) = √ v > ∇2 A(θ0 )−1 v,
n
and for which
1
inf {Pθ0 (Ψ(X1n ) 6= 0) + Pθn (Ψ(X1n ) 6= 1)} ≥ + O(n−1/2 ).
Ψ 2
Proof Let ∆ ∈ Rd be a potential perturbation to θ1 = θ0 + ∆, which gives separation δ =
v > θ1 − v > θ0 = v > ∆. Let P0 = Pθ0 and P1 = Pθ1 . Then the smallest summed probability of error
in testing between P0 and P1 based on n observations X1n is
by Proposition 2.3.1. Following the approach of Example 2.3.2, we apply Pinsker’s inequal-
ity (2.2.10) and use that the KL-divergence tensorizes to find
2 kP0n − P1n k2TV ≤ nDkl (P0 ||P1 ) = nDkl (Pθ0 ||Pθ0 +∆ ) = nDA (θ0 + ∆, θ0 ),
where the final equality follows from the equivalence between KL and first-order divergences for
exponential families (Proposition 3.3.1).
56
Lexture Notes on Statistics and Information Theory John Duchi
To guarantee that the summed probability of error is at least 21 , that is, kP0n − P1n kTV ≤ 12 ,
it suffices to choose ∆ satisfying nDA (θ0 + ∆, θ0 ) ≤ 21 . So to maximize the separation v > ∆ while
guaranteeing a constant probability of error, we (approximately) solve
maximize v > ∆
1
subject to DA (θ0 + ∆, θ0 ) ≤ 2n .
3
Now, consider that DA (θ0 + ∆, θ0 ) = 12 ∆> ∇2 A(θ0 )∆ + O(k∆k ). Ignoring the higher order term,
we consider maximizing v > ∆ subject to ∆> ∇2 A(θ0 )∆ ≤ n1 . A Lagrangian calculation shows that
this has solution
1 1
∆= √ p ∇2 A(θ0 )−1 v.
n v > ∇2 A(θ0 )−1 v
p
With this choice, we have separation δ = v > ∆ = v > ∇2 A(θ0 )−1 v/n, and DA (θ0 + ∆, θ0 ) =
1 3/2
2n + O(1/n ). The summed probability of error is at least
r r
n 1 1
n n
1 − kP0 − P1 kTV ≥ 1 − + O(n −1/2 )=1− + O(n−1/2 ) = + O(n−1/2 )
4n 4 2
as desired.
Let us briefly sketch out why Proposition 3.5.1 is the “right” answer using the heuristics in Sec-
tion 3.2.1. For an unknown parameter θ in the exponential family model Pθ , we observe X1 , . . . , Xn ,
and wish to test whether v > θ ≥ t for a given threshold t. Call our null H0 : v > θ ≤ t, and assume
we wish to test at an asymptotic level α > 0, meaning the probability the test falsely rejects H0 is
(as n → ∞) is at most α. Assuming the heuristic (3.2.4), we have the approximate distributional
equality
· 1
v > θbn ∼ N v > θ, v > ∇2 A(θbn )−1 v .
n
Note that we have θbn on the right side of the distribution; it is possible to make this rigorous, but
here we target only intuition building. A natural asymptotically level α test is then
( q
Reject if v > θbn ≥ t + z1−α v > ∇2 A(θbn )−1 v/n
Tn :=
Accept otherwise,
where z1−α is the 1 − α quantile of a standard normal, P(Z ≥ z1−α ) = α for Z ∼ N(0, 1). Let θ0
be such that v > θ0 = t, so H0 holds. Then
√
q
> b > 2 −1
Pθ0 (Tn rejects) = Pθ0 n · v (θn − θ0 ) ≥ z1−α v ∇ A(θn ) v → α.
b
p √
At least heuristically, then, this separation δ = v > A(θ0 )−1 v/ n is the fundamental separation
in parameter values at which testing becomes possible (or below which it is impossible).
As a brief and suggestive aside, the precise growth of the KL-divergence Dkl (Pθ0 +∆ ||Pθ0 ) =
1 > 2 3
2 ∆ ∇ A(θ0 )∆ + O(k∆k ) near θ0 plays the fundamental role in both the lower bound and upper
bound on testing. When the Hessian ∇2 A(θ0 ) is “large,” meaning it is very positive definite,
distributions with small parameter distances are still well-separated in KL-divergence, making
testing easy, while when ∇2 A(θ0 ) is small (nearly indefinite), the KL-divergence can be small even
for large parameter separations ∆ and testing is hard. As a consequence, at least for exponential
family models, the Fisher information (3.3.2), which we defined as ∇2 A(θ) = Covθ (φ(X)), plays a
central role in testing and, as we see later, estimation.
57
Lexture Notes on Statistics and Information Theory John Duchi
Lemma 3.6.1. Consider any collection {θ1 , . . . , θm } ⊂ Θ, and let Θ0 = Conv{θi }m i=1 and C ⊂
int Θ0 . Then for any k ∈ N, there exists a constant K = K(C, k, {θi }) such that for all θ0 ∈ C,
Proof Let B = {u ∈ Rd | kuk ≤ 1} be the unit ball in Rd . For any > 0, there exists a K = K()
such that kxkk ≤ Kekxk for all x ∈ Rd . As C ⊂ int Conv(Θ0 ), there exists an > 0 such P that for
all θ0 ∈ C, θ0 + 2B ⊂ Θ0 , and by construction, for any u ∈ B we can write θ0 + 2u = m j=1 λj θj
m >
for some λ ∈ R+ with 1 λ = 1. We therefore have
But using the convexity of t 7→ exp(t) and that θ0 + 2u ∈ Θ0 , the last quantity has upper bound
Lemma 3.6.2. Under the conditions of Lemma 3.6.1, there exists a K such that for any θ, θ0 ∈ C
Proof We write
exp(hθ, xi) − exp(hθ0 , xi) exp(hθ − θ0 , xi) − 1
= exp(hθ0 , xi)
kθ − θ0 k kθ − θ0 k
1
For complex functions, Osgood’s lemma shows that if A is continuous and holomorphic in each variable individ-
ually, it is holomorphic. For a treatment of such ideas in an engineering context, see, e.g. [92, Ch. 1].
58
Lexture Notes on Statistics and Information Theory John Duchi
From this, we can assume without loss of generality that θ0 = 0 (by shifting). Now note that
by convexity e−a ≥ 1 − a for all a ∈ R, so 1 − ea ≤ |a| when a ≤ 0. Conversely, if a > 0, then
d
aea ≥ ea − 1 (note that da (aea ) = aea + ea ≥ ea ), so dividing by kxk, we see that
as desired.
With the lemmas in hand, we can demonstrate a dominating function for the derivatives. Indeed,
fix θ0 ∈ int Θ and for θ ∈ Θ, define
exp(hθ, xi) − exp(hθ0 , xi) − exp(hθ0 , xi)hx, θ − θ0 i ehθ,xi − ehθ0 ,xi − h∇ehθ0 ,xi , θ − θ0 i
g(θ, x) = = .
kθ − θ0 k kθ − θ0 k
Then limθ→θ0 g(θ, x) = 0 by the differentiability of t 7→ et . Lemmas 3.6.1 and 3.6.2 show that if
we take any collection {θj }m
j=1 ⊂ Θ for which θ ∈ int Conv{θj }, then for C ⊂ int Conv{θj }, there
exists a constant K such that
| exp(hθ, xi) − exp(hθ0 , xi)|
|g(θ, x)| ≤ + kxk exp(hθ0 , xi) ≤ K max exp(hθj , xi)
kθ − θ0 k j
Pm
for all θ ∈ C. As maxj ehθj ,xi dµ(x) ≤ hθj ,xi dµ(x) < ∞, the dominated convergence
R R
j=1 e
theorem thus implies that Z
lim g(θ, x)dµ(x) = 0,
θ→θ0
√
Analyticity Over the subset ΘC := {θ + iz | θ ∈ Θ, z ∈ Rd } (where i = −1 is the imaginary
unit), we can extend the preceding results to demonstrate that A is analytic on ΘC . Indeed, we
first simply note that for a, b ∈ R, exp(a + ib) = exp(a) exp(ib) and | exp(a + ib)| = exp(a), i.e.
|ez | = e z for z ∈ C, and so Lemmas 3.6.1 and 3.6.2 follow mutatis-mutandis as in the real case.
These are enough for the application of the dominated convergence theorem above, and we use that
exp(·) is analytic to conclude that θ 7→ M (θ) is analytic on ΘC .
59
Lexture Notes on Statistics and Information Theory John Duchi
3.7 Bibliography
3.8 Exercises
Exercise 3.1: In Example 3.1.4, give the sufficient statistic φ and an explicit formula for the log
partition function A(θ, Θ) so that we can write pθ,Θ (x) = exp(hθ, φ1 (x)i + hΘ, φ2 (x)i − A(θ, Θ)).
Exercise 3.2: Consider the binary logistic regression model in Example 3.4.2, and let `(θ; x, y) =
− log pθ (y | x) be the associated log loss.
(ii) Let (xi , yi )ni=1 ⊂ Rd × {±1} be a sample. Give a sufficient condition for the minimizer of the
empirical log loss
n
1X
Ln (θ) := `(θ; xi , yi )
n
i=1
to be unique that depends only on the vectors {xi }. Hint. A convex function h is strictly
convex if and only if its Hessian ∇2 h is positive definite.
60
Part I
61
Chapter 4
Concentration Inequalities
In many scenarios, it is useful to understand how a random variable X behaves by giving bounds
on the probability that it deviates far from its mean or median. This can allow us to give prove
that estimation and learning procedures will have certain performance, that different decoding and
encoding schemes work with high probability, among other results. In this chapter, we give several
tools for proving bounds on the probability that random variables are far from their typical values.
We conclude the section with a discussion of basic uniform laws of large numbers and applications
to empirical risk minimization and statistical learning, though we focus on the relatively simple
cases we can treat with our tools.
P(X ≥ t)?
We begin with the three most classical three inequalities for this purpose: the Markov, Chebyshev,
and Chernoff bounds, which are all instances of the same technique.
The basic inequality off of which all else builds is Markov’s inequality.
Proposition 4.1.1 (Markov’s inequality). Let X be a nonnegative random variable, meaning that
X ≥ 0 with probability 1. Then
E[X]
P(X ≥ t) ≤ .
t
Proof For any random variable, P(X ≥ t) = E[1 {X ≥ t}] ≤ E[(X/t)1 {X ≥ t}] ≤ E[X]/t, as
X/t ≥ 1 whenever X ≥ t.
When we know more about a random variable than that its expectation is finite, we can give
somewhat more powerful bounds on the probability that the random variable deviates from its
typical values. The first step in this direction, Chebyshev’s inequality, requires two moments, and
when we have exponential moments, we can give even stronger results. As we shall see, each of
these results is but an application of Proposition 4.1.1.
62
Lexture Notes on Statistics and Information Theory John Duchi
Proposition 4.1.2 (Chebyshev’s inequality). Let X be a random variable with Var(X) < ∞. Then
Var(X) Var(X)
P(X − E[X] ≥ t) ≤ and P(X − E[X] ≤ −t) ≤
t2 t2
for all t ≥ 0.
Proof We prove only the upper tail result, as the lower tail is identical. We first note that
X − E[X] ≥ t implies that (X − E[X])2 ≥ t2 . But of course, the random variable Z = (X − E[X])2
is nonnegative, so Markov’s inequality gives P(X − E[X] ≥ t) ≤ P(Z ≥ t2 ) ≤ E[Z]/t2 , and
E[Z] = E[(X − E[X])2 ] = Var(X).
E[eλX ]
P(X ≥ t) ≤ = ϕX (λ)e−λt
eλt
for all λ ≥ 0.
Proof This is another application of Markov’s inequality: for λ > 0, we have eλX ≥ eλt if and
only if X ≥ t, so that P(X ≥ t) = P(eλX ≥ eλt ) ≤ E[eλX ]/eλt .
In particular, taking the infimum over all λ ≥ 0 in Proposition 4.1.3 gives the more standard
Chernoff (large deviation) bound
P(X ≥ t) ≤ exp inf log ϕX (λ) − λt .
λ≥0
λ2 σ 2
ϕX (λ) = E[exp(λX)] = exp . (4.1.1)
2
63
Lexture Notes on Statistics and Information Theory John Duchi
As a consequence of the equality (4.1.1) and the Chernoff bound technique (Proposition 4.1.3),
we see that for X Gaussian with variance σ 2 , we have
t2 t2
P(X ≥ E[X] + t) ≤ exp − 2 and P(X ≤ E[X] − t) ≤ exp − 2
2σ 2σ
λ2 σ 2 2 2 2
for all t ≥ 0. Indeed, we have log ϕX−E[X] (λ) = 2 , and inf λ { λ 2σ − λt} = − 2σ
t
2 , which is
attained by λ = σt2 . 3
Example 4.1.5 (Random signs (Rademacher variables)): The random variable X taking
values {−1, 1} with equal property is 1-sub-Gaussian. Indeed, we have
∞ ∞ ∞ ∞
1 X λk 1 X (−λ)k λ2k (λ2 )k
2
1 1 X X λ
E[exp(λX)] = eλ + e−λ = + = ≤ = exp ,
2 2 2 k! 2 k! (2k)! 2k k! 2
k=0 k=0 k=0 k=0
as claimed. 3
Bounded random variables are also sub-Gaussian; indeed, we have the following example.
Example 4.1.6 (Bounded random variables): Suppose that X is bounded, say X ∈ [a, b].
Then Hoeffding’s lemma states that
λ2 (b − a)2
E[eλ(X−E[X]) ] ≤ exp ,
8
so that X is (b − a)2 /4-sub-Gaussian.
We prove a somewhat weaker statement with a simpler argument, while Exercise 4.1 gives one
approach to proving the above statement. First, let ε ∈ {−1, 1} be a Rademacher variable,
so that P(ε = 1) = P(ε = −1) = 12 . We apply a so-called symmetrization technique—a
common technique in probability theory, statistics, concentration inequalities, and Banach
space research—to give a simpler bound. Indeed, let X 0 be an independent copy of X, so that
E[X 0 ] = E[X]. We have
= E exp(λε(X − X 0 )) ,
64
Lexture Notes on Statistics and Information Theory John Duchi
where the inequality follows from Jensen’s inequality and the last equality is a conseqence of
the fact that X − X 0 is symmetric about 0. Using the result of Example 4.1.5,
λ (X − X 0 )
2 2
λ (b − a)2
0
E exp(λε(X − X )) ≤ E exp ≤ exp ,
2 2
where the final inequality is immediate from the fact that |X − X 0 | ≤ b − a. 3
While Example 4.1.6 shows how a symmetrization technique can give sub-Gaussian behavior,
more sophisticated techniques involving explicitly bounding the logarithm of the moment generating
function of X, often by calculations involving exponential tilts of its density. In particular, letting
X be mean zero for simplicity, if we let
then
E[XeλX ] E[X 2 eλX ] E[XeλX ]2
ψ 0 (λ) = and ψ 00
(λ) = − ,
E[eλX ] E[eλX ] E[eλX ]2
where we can interchange the order of taking expectations and derivatives whenever ψ(λ) is finite.
Notably, if X has density pX (with respect to any base measure) then the random variable Yλ with
density
eλy
pλ (y) = pX (y)
E[eλX ]
(with respect to the same base measure) satisfies
One can exploit this in many ways, which the exercises and coming chapters do. As a particular
example, we can give sharper sub-Gaussian constants for Bernoulli random variables.
Example 4.1.7 (Bernoulli random variables): Let X be Bernoulli(p), so that X = 1 with
probability p and X = 0 otherwise. Then a strengthening of Hoeffding’s lemma (also, essen-
tially, due to Hoeffding) is that
σ 2 (p) 2 1 − 2p
log E[eλ(X−p) ] ≤ λ for σ 2 (p) := .
2 2 log 1−p
p
Here we take the limits as p → {0, 21 , 1} and have σ 2 (0) = 0, σ 2 (1) = 0, and σ 2 ( 12 ) = 14 .
Because p 7→ σ 2 (p) is concave and symmetric about p = 12 , this inequality is always sharper
than that of Example 4.1.6. Exercise 4.9 gives one proof of this bound exploiting exponential
tilting. 3
Chernoff bounds for sub-Gaussian random variables are immediate; indeed, they have the same
concentration properties as Gaussian random variables, a consequence of the nice analytical prop-
erties of their moment generating functions (that their logarithms are at most quadratic). Thus,
using the technique of Example 4.1.4, we obtain the following proposition.
Proposition 4.1.8. Let X be a σ 2 -sub-Gaussian. Then for all t ≥ 0 we have
t2
P(X − E[X] ≥ t) ∨ P(X − E[X] ≤ −t) ≤ exp − 2 .
2σ
65
Lexture Notes on Statistics and Information Theory John Duchi
Chernoff bounds extend naturally to sums of independent random variables, because moment
generating functions of sums of independent random variables become products of moment gener-
ating functions.
Proof We assume w.l.o.g. that the Xi are mean zero. We have by independence that and
sub-Gaussianity that
Xn n−1
X 2 2 n−1
X
λ σn
E exp λ Xi = E exp λ Xi E[exp(λXn )] ≤ exp E exp λ Xi .
2
i=1 i=1 i=1
Two immediate corollary to Propositions 4.1.8 and 4.1.9 show that sums of sub-Gaussian random
variables concentrate around their expectations. We begin with a general concentration inequality.
Corollary 4.1.10. Let Xi be independent σi2 -sub-Gaussian random variables. Then for all t ≥ 0
( n n )
t2
X X
max P (Xi − E[Xi ]) ≥ t , P (Xi − E[Xi ]) ≤ −t ≤ exp − Pn .
i=1 i=1
2 i=1 σi2
Additionally, the classical Hoeffding bound, follows when we couple Example 4.1.6 with Corol-
lary 4.1.10: if Xi ∈ [ai , bi ], then
n
2t2
X
P (Xi − E[Xi ]) ≥ t ≤ exp − Pn 2
.
i=1 i=1 (bi − ai )
To give another interpretation of these inequalities, let us assume that Xi are indepenent and
σ 2 -sub-Gaussian. Then we have that
n
nt2
X
1
P (Xi − E[Xi ]) ≥ t ≤ exp − 2 ,
n 2σ
i=1
q
1
nt2 2σ 2 log δ
or, for δ ∈ (0, 1), setting exp(− 2σ 2) = δ or t = √
n
, we have that
q
1X
n 2σ 2 log 1δ
(Xi − E[Xi ]) ≤ √ with probability at least 1 − δ.
n n
i=1
There are a variety of other conditions equivalent to sub-Gaussianity, which we capture in the
following theorem.
66
Lexture Notes on Statistics and Information Theory John Duchi
Theorem 4.1.11. Let X be a mean-zero random variable and σ 2 ≥ 0 be a constant. The following
statements are all equivalent, meaning that there are numerical constant factors Kj such that if one
statement (i) holds with parameter Ki , then statement (j) holds with parameter Kj ≤ CKi , where
C is a numerical constant.
2
(1) Sub-gaussian tails: P(|X| ≥ t) ≤ 2 exp(− Kt1 σ2 ) for all t ≥ 0.
√
(2) Sub-gaussian moments: E[|X|k ]1/k ≤ K2 σ k for all k.
Particularly,
q (1) implies (2) with K1 = 1 and K2 ≤ e1/e ; (2) implies (3) with K2 = 1 and
2
K3 = e e−1 < 3; (3) implies (4) with K3 = 1 and K4 ≤ 43 ; and (4) implies (1) with K4 = 12 and
K1 ≤ 2.
This result is standard in the literature on concentration and random variables, but see Ap-
pendix 4.5.1 for a proof of this theorem.
For completeness, we can give a tighter result than part (3) of the preceding theorem, giving a
concrete upper bound on squares of sub-Gaussian random variables. The technique used in the ex-
ample, to introduce an independent random variable for auxiliary randomization, is a common and
useful technique in probabilistic arguments (similar to our use of symmetrization in Example 4.1.6).
√ (i) (ii) 1
E[exp(λX 2 )] = E[exp( 2λXZ)] ≤ E exp(λσ 2 Z 2 ) =
1 ,
[1 − 2σ 2 λ]+2
where inequality (i) follows because X is sub-Gaussian, and inequality (ii) because Z ∼ N(0, 1).
3
67
Lexture Notes on Statistics and Information Theory John Duchi
where inequality (i) holds for λ ≤ 14 , because − log(1 − 2λ) ≤ 2λ + 4λ2 for λ ≤ 14 . 3
As a second example, we can show that bounded random variables are sub-exponential. It is
clear that this is the case as they are also sub-Gaussian; however, in many cases, it is possible to
show that their parameters yield much tighter control over deviations than is possible using only
sub-Gaussian techniques.
Example 4.1.14 (Bounded random variables are sub-exponential): Suppose that X is a
mean zero random variable taking values in [−b, b] with variance σ 2 = E[X 2 ] (note that we are
guaranteed that σ 2 ≤ b2 in this case). We claim that
2 2
3λ σ 1
E[exp(λX)] ≤ exp for |λ| ≤ . (4.1.4)
5 2b
To see this, note first that for k ≥ 2 we have E[|X|k ] ≤ E[X 2 bk−2 ] = σ 2 bk−2 . Then by an
expansion of the exponential, we find
∞ ∞
λ2 E[X 2 ] X λk E[X k ] λ2 σ 2 X λk σ 2 bk−2
E[exp(λX)] = 1 + E[λX] + + ≤1+ +
2 k! 2 k!
k=3 k=3
∞
λ2 σ 2 X (λb)k (i) λ2 σ 2 λ2 σ 2
=1+ + λ2 σ 2 ≤ 1+ + ,
2 (k + 2)! 2 10
k=1
1
inequality (i) holding for λ ≤ 2b . Using that 1 + x ≤ ex gives the result.
It is possible to give a slightly tighter result for λ ≥ 0 In this case, we have the bound
∞
λ2 σ 2 2 2
X λk−2 bk−2 σ2
E[exp(λX)] ≤ 1 + +λ σ = 1 + 2 eλb − 1 − λb .
2 k! b
k=3
68
Lexture Notes on Statistics and Information Theory John Duchi
Then using that 1 + x ≤ ex , we obtain Bennett’s moment generating inequality, which is that
2
λX σ λb
E[e ] ≤ exp e − 1 − λb for λ ≥ 0. (4.1.5)
b2
λ2 b2
Inequality (4.1.5) always holds, and for λb near 0, we have eλb − 1 − λb ≈ 2 . 3
In particular, if the variance σ 2 b2 , the absolute bound on X, inequality (4.1.4) gives much
tighter control on the moment generating function of X than typical sub-Gaussian bounds based
only on the fact that X ∈ [−b, b] allow.
More broadly, we can show a result similar to Theorem 4.1.11.
Theorem 4.1.15. Let X be a random variable and σ ≥ 0. Then—in the sense of Theorem 4.1.11—
the following statements are all equivalent for suitable numerical constants K1 , . . . , K4 .
(4) If, in addition, E[X] = 0, then E[exp(λX)] ≤ exp(K4 λ2 σ 2 ) for all |λ| ≤ K40 /σ.
In particular, if (2) holds with K2 = 1, then (4) holds with K4 = 2e2 and K40 = 1
2e .
The proof, which is similar to that for Theorem 4.1.11, is presented in Section 4.5.2.
While the concentration properties of sub-exponential random variables are not quite so nice
as those for sub-Gaussian random variables (recall Hoeffding’s inequality, Corollary 4.1.10), we
can give sharp tail bounds for sub-exponential random variables. We first give a simple bound on
deviation probabilities.
Proposition 4.1.16. Let X be a mean-zero (τ 2 , b)-sub-exponential random variable. Then for all
t ≥ 0, 2
1 t t
P(X ≥ t) ∨ P(X ≤ −t) ≤ exp − min , .
2 τ2 b
Proof The proof is an application of the Chernoff bound technique; we prove only the upper tail
as the lower tail is similar. We have
E[eλX ] (i)
2 2
λ τ
P(X ≥ t) ≤ ≤ exp − λt ,
eλt 2
inequality (i) holding for |λ| ≤ 1/b. To minimize the last term in λ, we take λ = min{ τt2 , 1/b},
which gives the result.
Comparing with sub-Gaussian random variables, which have b = 0, we see that Proposition 4.1.16
gives a similar result for small t—essentially the same concentration sub-Gaussian random variables—
while for large t, the tails decrease only exponentially in t.
We can also give a tensorization identity similar to Proposition 4.1.9.
69
Lexture Notes on Statistics and Information Theory John Duchi
Proof We apply an inductive technique similar to that used in the proof of Proposition 4.1.9.
1
First, for any fixed i, we know that if |λ| ≤ bi |ai|
, then |ai λ| ≤ b1i and so
λ2 a2i σi2
E[exp(λai Xi )] ≤ exp .
2
1
Now, we inductively apply the preceding inequality, which applies so long as |λ| ≤ bi |ai | for all i.
We have
n n n
" X # Y 2 2 2
Y λ ai σi
E exp λ ai Xi = E[exp(λai Xi )] ≤ exp ,
2
i=1 i=1 i=1
It is instructive to study the structure of the bound of Corollary 4.1.18. Notably, the bound
is similar to the Hoeffding-type bound of Corollary 4.1.10 (holding for σ 2 -sub-Gaussian random
variables) that
n
!
t2
X
P ai Xi ≥ t ≤ exp − ,
i=1
2 kak22 σ 2
so that for small t, Corollary 4.1.18 gives sub-Gaussian tail behavior. For large t, the bound is
weaker. However, in many cases, Corollary 4.1.18 can give finer control than naive sub-Gaussian
bounds. Indeed, suppose that the random variables Xi are i.i.d., mean zero, and satisfy Xi ∈ [−b, b]
with probability 1, but have variance σ 2 = E[Xi2 ] ≤ b2 as in Example 4.1.14. Then Corollary 4.1.18
implies that
n
( )!
5 t2
X
1 t
P ai Xi ≥ t ≤ exp − min , . (4.1.6)
2 6 σ 2 kak22 2b kak∞
i=1
70
Lexture Notes on Statistics and Information Theory John Duchi
When applied to a standard mean (and with a minor simplification that 5/12 < 1/3) with ai = n1 ,
t2
we obtain the bound that n1 ni=1 Xi ≤ t with probability at least 1−exp(−n min{ 3σ t
P
2 , 4b }). Written
q
3 log 1δ 4b log 1δ
differently, we take t = max{σ n , n } to obtain
q
1
n
X 3 log 1δ 4b log 1
δ
Xi ≤ max σ √ , with probability 1 − δ.
n n n
i=1
q √
The sharpest such bound possible via more naive Hoeffding-type bounds is b 2 log 1δ / n, which
has substantially worse scaling.
Lemma 4.1.19. Let X be a random variable satisfying the Bernstein condition (4.1.7). Then
λ2 σ 2
h
λ(X−µ)
i 1
E e ≤ exp for |λ| ≤ .
2(1 − b|λ|) b
√
Said differently, a random variable satisfying Condition (4.1.7) is ( 2σ, b/2)-sub-exponential.
Proof Without loss of generality we assume µ = 0. We expand the moment generating function
by noting that
∞ ∞
λX λ2 σ 2 X λk E[X k ] (i) λ2 σ 2 λ2 σ 2 X
E[e ]=1+ + ≤ 1+ + |λb|k−2
2 k! 2 2
k=3 k=3
λ2 σ 2 1
=1+
2 [1 − b|λ|]+
where inequality (i) used the Bernstein condition (4.1.7). Noting that 1+x ≤ ex gives the result.
As one final example, we return to Bennett’s inequality (4.1.5) from Example 4.1.14.
71
Lexture Notes on Statistics and Information Theory John Duchi
Proof We assume without loss of generality that E[X] = 0. Using the standard Chernoff bound
argument coupled with inequality (4.1.5), we see that
n n
! !
X X X σi2 λb
P Xi ≥ t ≤ exp e − 1 − λb − λt .
b2
i=1 i=1
Letting h(t) = (1 + t) log(1 + t) − t as in the statement of the proposition and σ 2 = ni=1 σi2 , we
P
minimize over λ ≥ 0, setting λ = 1b log(1 + σbt2 ). Substituting into our Chernoff bound application
gives the proposition.
A slightly more intuitive writing of Bennett’s inequality is to use averages, in which case for
σ 2 = n1 ni=1 σi2 the average of the variances,
P
n
!
nσ 2
1X bt
P Xi ≥ t ≤ exp − h .
n b σ2
i=1
That this is a norm is not completely trivial, though a few properties are immediate: clearly
kaXkψ = |a| kXkψ , and we have kXkψ = 0 if and only if X = 0 with probability 1. The key result
is that in fact, k·kψ is actually convex, which then guarantees that it is a norm.
Proposition 4.1.21. The function k·kψ is convex on the space of random variables.
Proof Because ψ is convex and non-decreasing, x 7→ ψ(|x|) is convex as well. (Convince yourself
of this.) Thus, its perspective transform pers(ψ)(t, |x|) := tψ(|x|/t) is jointly convex in both t ≥ 0
and x (see Appendix B.3.3). This joint convexity of ψe implies that for any random variables X0
and X1 and t0 , t1 ,
E[pers(ψ)(λt0 + (1 − λ)t1 , |λX0 + (1 − λ)X1 |)] ≤ λE[pers(ψ)(t0 , |X0 |)] + (1 − λ)E[pers(ψ)(t1 , |X1 |)].
72
Lexture Notes on Statistics and Information Theory John Duchi
We also have what we term the sub-Gaussian and sub-Exponential norms, typically denoted by
considering the functions
ψp (x) := exp (|x|p ) − 1.
These induce the Orlicz ψp -norms, as for p ≥ 1, these are convex (as they are the composition of the
increasing convex function exp(·) applied to the nonnegative convex function | · |p ). Theorem 4.1.11
shows that we have a natural sub-Gaussian norm
while Theorem 4.1.15 shows a natural sub-exponential norm (or Orlicz ψ1 -norm)
Many relationships follow immediately from the definitions (4.1.10) and (4.1.11). For example,
any sub-Gaussian random variable (whether or not it is mean zero) has a square that is sub-
exponential:
(This is immediate by definition.) By tracing through the arguments in the proofs of Theo-
rems 4.1.11 and 4.1.15, we can also see that an alternative definition of the two norms could
be
1 1
sup √ E[|X|k ]1/k and sup E[|X|k ]1/k
k∈N k k∈N k
for the sub-Gaussian and sub-exponential norms kXkψ2 and kXkψ1 , respectively. They are all
equivalent.
73
Lexture Notes on Statistics and Information Theory John Duchi
dimension while preserving essential aspects of the dataset. This line of research begins with Indyk
and Motwani [112], and continuing through a variety of other works, including Indyk [111] and
work on locality-sensitive hashing by Andoni et al. [6], among others. The original approach is due
to Johnson and Lindenstrauss, who used the results in the study of Banach spaces [117]; our proof
follows a standard argument.
The most specific variant of this problem is as follows: we have n points u1 , . . . , un , and we
could like to construct a mapping Φ : Rd → Rm , where m d, such that
kΦui − Φuj k2 ∈ (1 ± ) kui − uj k2 .
Depending on the norm chosen, this task may be impossible; for the Euclidean (`2 ) norm, however,
such an embedding is easy to construct using Gaussian random variables and with m = O( 12 log n).
This embedding is known as the Johnson-Lindenstrauss embedding. Note that this size m is
independent of the dimension d, only depending on the number of points n.
Example 4.1.23 (Johnson-Lindenstrauss): Let the matrix Φ ∈ Rm×d be defined as follows:
iid
Φij ∼ N(0, 1/m),
and let Φi ∈ Rd denote the ith row of this matrix. We claim that
8 1
m ≥ 2 2 log n + log implies kΦui − Φuj k22 ∈ (1 ± ) kui − uj k22
δ
log n
for all pairs ui , uj with probability at least 1 − δ. In particular, m & 2
is sufficient to achieve
accurate dimension reduction with high probability.
To see this, note that for any fixed vector u,
m
hΦi , ui kΦuk22 X
∼ N(0, 1/m), and = hΦi , u/ kuk2 i2
kuk2 kuk22 i=1
is a sum of independent scaled χ2 -random variables. In particular, we have E[kΦu/ kuk2 k22 ] = 1,
and using the χ2 -concentration result of Example 4.1.13 yields
P kΦuk22 / kuk22 − 1 ≥ = P m kΦuk22 / kuk22 − 1 ≥ m
m2
2
≤ 2 inf exp 2mλ − λm = 2 exp − ,
|λ|≤ 41 8
the last inequality holding for ∈ [0, 1]. Now, using the union bound applied to each of the
pairs (ui , uj ) in the sample, we have
m2
2 2 2
n
P there exist i 6= j s.t. kΦ(ui − uj )k2 − kui − uj k2 ≥ kui − uj k2 ≤ 2 exp − .
2 8
2
Taking m ≥ 82 log nδ = 16
2
log n + 82 log 1δ yields that with probability at least 1 − δ, we have
kΦui − Φuj k2 ∈ (1 ± ) kui − uj k22 . 3
2
74
Lexture Notes on Statistics and Information Theory John Duchi
the maximum likelihood decoder. We now investigate how to choose a collection {x1 , . . . , xm }
of such codewords and give finite sample bounds on its probability of error. In fact, by using
concentration inequalities, we can show that a randomly drawn codebook of fairly small dimension
is likely to enjoy good performance.
Intuitively, if our codebook {x1 , . . . , xm } ⊂ {0, 1}d is well-separated, meaning that each pair of
words xi , xk satisfies kxi − xk k1 ≥ cd for some numerical constant c > 0, we should be unlikely to
make a mistake. Let us make this precise. We mistake word i for word k only if the received signal
Z satisfies kZ − xi k1 ≥ kZ − xk k1 , and letting J = {j ∈ [d] : xij 6= xkj } denote the set of at least
c · d indices where xi and xk differ, we have
X
kZ − xi k1 ≥ kZ − xk k1 if and only if |Zj − xij | − |Zj − xkj | ≥ 0.
j∈J
If xi is the word being sent and xi and xk differ in position j, then |Zj − xij | − |Zj − xkj | ∈ {−1, 1},
and is equal to −1 with probability (1 − ) and 1 with probability . That is, we have kZ − xi k1 ≥
kZ − xk k1 if and only if
X
|Zj − xij | − |Zj − xkj | + |J|(1 − 2) ≥ |J|(1 − 2) ≥ cd(1 − 2),
j∈J
and the expectation EQ [|Zj − xij | − |Zj − xkj | | xi ] = −(1 − 2) when xij 6= xkj . Using the Hoeffding
bound, then, we have
where we have used that there are at least |J| ≥ cd indices differing between xi and xk . The
probability of making a mistake at all is thus at most m exp(− 12 cd(1 − 2)2 ) if our codebook has
separation c · d.
For low error decoding to occur with extremely high probability, it is thus sufficient to choose
a set of code words {x1 , . . . , xm } that is well separated. To that end, we state a simple lemma.
75
Lexture Notes on Statistics and Information Theory John Duchi
76
Lexture Notes on Statistics and Information Theory John Duchi
Definition 4.3. Let M1 , M2 , . . . be an R-valued sequence of random variables. They are a martin-
gale if there exist another sequence of random variables {Z1 , Z2 , . . .} ⊂ Z and sequence of functions
fn : Z n → R such that
E[Mn | Z1n−1 ] = Mn−1 and Mn = fn (Z1n )
for all n ∈ N. We say that the sequence Mn is adapted to {Zn }.
for all n ∈ N.
There are numerous examples of martingale sequences. The classical one is the symmetric
random walk.
Example 4.2.1: Let Dn ∈ {±1} be uniform and independent. Then Dn form a martingale
difference sequence adapted to themselves (that is, we may take Zn = Dn ), and Mn = ni=1 Di
P
is a martingale. 3
A more sophisticated example, to which we will frequently return and that suggests the potential
usefulness of martingale constructions, is the Doob martingale associated with a function f .
by the tower property of expectations. Thus, the Di satisfy Definition 4.4 of a martingale
difference sequence, and moreover, we have
n
X
Di = f (X1n ) − E[f (X1n )],
i=1
and so the Doob martingale captures exactly the difference between f and its expectation. 3
77
Lexture Notes on Statistics and Information Theory John Duchi
because D1 , . . . , Dn−1 are functions of Z1n−1 . Then we use Definition 4.5, which implies that
2 2
E[eλDn | Z1n−1 ] ≤ eλ σn /2 , and we obtain
"n−1 # 2 2
Y
λDi λ σn
E[exp(λMn )] ≤ E e exp .
2
i=1
as desired.
The second claims are simply applications of Chernoff bounds via Proposition 4.1.8 and that
E[Mn ] = 0.
78
Lexture Notes on Statistics and Information Theory John Duchi
Corollary 4.2.4. LetPDi be a bounded difference martingale difference sequence, meaning that
|Di | ≤ c. Then Mn = ni=1 Di satisfies
t2
−1/2 −1/2
P(n Mn ≥ t) ∨ P(n Mn ≤ −t) ≤ exp − 2 for t ≥ 0.
2c
√
Thus, bounded random walks are (with high probability) within ± n of their expectations after
n steps.
There exist extensions of these inequalities to the cases where we control the variance of the
martingales; see Freedman [87].
The classical inequality relating bounded differences and concentration is McDiarmid’s inequal-
ity, or the bounded differences inequality.
Proposition 4.2.5 (Bounded differences inequality). Let f : X n → R satisfy bounded Pdifferences
with constants ci , and let Xi be independent random variables. f (X1n ) − E[f (X1n )] is 14 ni=1 c2i -sub-
Gaussian, and
2t2
n n n n
P (f (X1 ) − E[f (X1 )] ≥ t) ∨ P (f (X1 ) − E[f (X1 )] ≤ −t) ≤ exp − Pn 2 .
i=1 ci
Proof The basic idea is to show that the Doob martingale (Example 4.2.2) associated with f is
c2i /4-sub-Gaussian, and then to simply apply the Azuma-Hoeffding P inequality. To that end, define
Di = E[f (X1n ) | X1i ] − E[f (X1n ) | X1i−1 ] as before, and note that ni=1 Di = f (X1n ) − E[f (X1n )]. The
random variables
Ui − Li ≤ sup sup E[f (X1n ) | X1i−1 = x1i−1 , Xi = x] − E[f (X1n ) | X1i−1 = x1i−1 , Xi = x0 ]
xi−1 x,x0
1
Z
i−1 0 n
f (xi−1 n n
= sup sup 1 , x, xi+1 ) − f (x1 , x , xi+1 ) dP (xi+1 ) ≤ ci ,
xi−1 x,x0
1
where we have used the independence of the Xi and Definition 4.6 of bounded differences. Conse-
quently, we have by Hoeffding’s Lemma (Example 4.1.6) that E[eλDi | X1i−1 ] ≤ exp(λ2 c2i /8), that
is, the Doob martingale is c2i /4-sub-Gaussian.
79
Lexture Notes on Statistics and Information Theory John Duchi
A number of quantities satisfy the conditions of Proposition 4.2.5, and we give two examples
here; we will revisit them more later.
Example 4.2.6 (Bounded random vectors): Let B be a Banach space—a complete normed
vector space—with norm k·k. Let Xi be independent bounded random vectors in B satisfying
E[Xi ] = 0 and kXi k ≤ c. We claim that the quantity
n
1X
f (X1n ) := Xi
n
i=1
i−1 0 n 1 2c
|f (xi−1 n
1 , x, xi+1 ) − f (x1 , x , xi+1 )| ≤ x − x0 ≤ .
n n
Consequently, if Xi are indpendent, we have
n n
!
nt2
1X 1X
P Xi − E Xi ≥ t ≤ 2 exp − 2 (4.2.1)
n n 2c
i=1 i=1
for all t ≥ 0. That is, the norm of (bounded) random vectors in an essentially arbitrary vector
space concentrates extremely quickly about its expectation.
The challenge becomes to control the expectation term in the concentration bound (4.2.1),
which can be a bit challenging. In certain cases—for example, when we have a Euclidean
structure on the vectors Xi —it can be easier. Indeed, let us specialize to the case that Xi ∈ H,
a (real) Hilbert space, so that there is an inner product h·, ·i and the norm satisfies kxk2 = hx, xi
for x ∈ H. Then Cauchy-Schwarz implies that
Xn 2 X n 2 X Xn
E Xi ≤E Xi = E[hXi , Xj i] = E[kXi k2 ].
i=1 i=1 i,j i=1
That is assuming the Xi are independent and E[kXi k2 ] ≤ σ 2 , inequality (4.2.1) becomes
nt2
σ σ
P X n ≥ √ + t + P X n ≤ − √ − t ≤ 2 exp − 2
n n 2c
where X n = n1 ni=1 Xi . 3
P
We can specialize Example 4.2.6 to a situation that is very important for treatments of concen-
tration, sums of random vectors, and generalization bounds in machine learning.
Example 4.2.7 (Rademacher complexities): This example is actually a special case of Ex-
ample 4.2.6, but its frequent uses justify a more specialized treatment and consideration. Let
X be some space, and let F be some collection of functions f : X → R. Let εi ∈ {−1, 1} be a
collection of independent random sign vectors. Then the empirical Rademacher complexity of
F is
n
" #
1 X
Rn (F | xn1 ) := E sup εi f (xi ) ,
n f ∈F i=1
80
Lexture Notes on Statistics and Information Theory John Duchi
(b1 −b0 )2
Consequently, the empirical Rademacher complexity satisfies Rn (F | X1n ) − Rn (F) is 4n -
sub-Gaussian by Theorem 4.2.3. 3
These examples warrant more discussion, and it is possible to argue that many variants of these
random variables are well-concentrated. For example, instead of functions we may simply consider
an arbitrary set A ⊂ Rn and define the random variable
n
X
Z(A) := supha, εi = sup ai εi .
a∈A a∈A i=1
As a function of the random signs εi , we may write Z(A) = f (ε), and this is then a function
satisfying |f (ε) − f (ε0 )| ≤ supa∈A |ha, ε − ε0 i|, so that if ε and ε0 differ in index i, we have |f (ε) −
f (ε0 )| ≤ 2 supa∈A |ai |. That is, Z(A) − E[Z(A)] is ni=1 supa∈A |ai |2 -sub-Gaussian.
P
as elements of this vector space L. (Here we have used 1Xi to denote the point mass at Xi .)
Then the Rademacher complexity is nothing more than the expected norm of Pn0 , a random
vector, as in Example 4.2.6. This view is somewhat sophisticated, but it shows that any general
results we may prove about random vectors, as in Example 4.2.6, will carry over immediately
to versions of the Rademacher complexity. 3
81
Lexture Notes on Statistics and Information Theory John Duchi
denote the empirical distribution on {Xi }ni=1 , where 1Xi denotes the point mass at Xi . Then for
functions f : X → R (or more generally, any function f defined on X ), we let
n
1X
Pn f := EPn [f (X)] = f (Xi )
n
i=1
denote the empirical expectation of f evaluated on the sample, and we also let
Z
P f := EP [f (X)] = f (x)dP (x)
denote general expectations under a measure P . With this notation, we study uniform laws of
large numbers, which consist of proving results of the form
where convergence is in probability, expectation, almost surely, or with rates of convergence. When
we view Pn and P as (infinite-dimensional) vectors on the space of maps from F → R, then we
may define the (semi)norm k·kF for any L : F → R by
kPn − P kF → 0.
Thus, roughly, we are simply asking questions about when random vectors converge to their expec-
tations.1
The starting point of this investigation considers bounded random functions, that is, F consists
of functions f : X → [a, b] for some −∞ < a ≤ b < ∞. In this case, the bounded differences
inequality (Proposition 4.2.5) immediately implies that expectations of kPn − P kF provide strong
guarantees on concentration of kPn − P kF .
1
Some readers may worry about measurability issues here. All of our applications will be in separable spaces,
so that we may take suprema with abandon without worrying about measurability, and consequently we ignore this
from now on.
82
Lexture Notes on Statistics and Information Theory John Duchi
by the triangle inequality. An entirely parallel argument gives the converse lower bound of − b−a
n ,
and thus Proposition 4.2.5 gives the result.
Proposition 4.3.1 shows that, to provide control over high-probability concentration of kPn − P kF ,
it is (at least in cases where F is bounded) sufficient to control the expectation E[kPn − P kF ]. We
take this approach through the remainder of this section, developing tools to simplify bounding
this quantity.
Our starting points consist of a few inequalities relating expectations to symmetrized quantities,
which are frequently easier to control than their non-symmetrized parts. This symmetrization
technique is widely used in probability theory, theoretical statistics, and machine learning. The key
is that for centered random variables, symmetrized quantities have, to within numerical constants,
similar expectations to their non-symmetrized counterparts. Thus, in many cases, it is equivalent
to analyze the symmetized quantity and the initial quantity.
Proposition 4.3.2. Let Xi be independent random vectors on a (Banach) space with norm k·k
and let εi {−1, 1} be independent random signs. Then for any p ≥ 1,
" n # " n # " n #
X p X p X p
2−p E εi (Xi − E[Xi ]) ≤E (Xi − E[Xi ]) ≤ 2p E εi Xi
i=1 i=1 i=1
In the proof of the upper bound, we could also show the bound
" n # " n #
X p X p
E (Xi − E[Xi ]) ≤ 2p E εi (Xi − E[Xi ]) ,
i=1 i=1
dist
Now, note that the distribution of Xi − Xi0 is symmetric, so that Xi − Xi0 = εi (Xi − Xi0 ), and thus
" n # " n #
X p X p
E (Xi − E[Xi ]) ≤E εi (Xi − Xi0 ) .
i=1 i=1
83
Lexture Notes on Statistics and Information Theory John Duchi
as desired.
For the left bound in the proposition, let Yi = Xi − E[Xi ] be the centered version of the random
variables. We break the sum over random variables into two parts, conditional on whether εi = ±1,
using repeated conditioning. We have
" n # " #
X p X X p
E εi Yi =E Yi − Yi
i=1 i:εi =1 i:ε=−1
" " # " ##
X p X p
≤ E 2p−1 E Yi | ε + 2p−1 E Yi |ε
i:εi =1 i:εi −1
" " # " ##
X X p X X p
p−1
=2 E E Yi + E[Yi ] |ε +E Yi + E[Yi ] |ε
i:εi =1 i:εi =−1 i:εi =−1 i:εi =1
" " # " ##
X X p X X p
p−1
≤2 E E Yi + Yi |ε +E Yi + Yi |ε
i:εi =1 i:εi =−1 i:εi =−1 i:εi =1
n
" #
X p
= 2p E Yi .
i=1
The expectation of Pn0 F is of course the Rademacher complexity (Examples 4.2.7 and 4.2.8), and
we have the following corollary.
From Corollary 4.3.3, it is evident that by controlling the expectation of the symmetrized process
E[kPn0 kF ] we can derive concentration inequalities and uniform laws of large numbers. For example,
we immediately obtain that
2nt2
0
P kPn − P kF ≥ 2E[kPn kF ] + t ≤ exp −
(b − a)2
84
Lexture Notes on Statistics and Information Theory John Duchi
A refinement of Massart’s finite class bound applies when the classes are infinite but, on a
collection X1 , . . . , Xn , the functions f ∈ F may take on only a (smaller) number of values. In this
case, we define the empirical shatter coefficient of a collection of points x1 , . . . , xn by SF (xn1 ) :=
card{(f (x1 ), . . . , f (xn )) | f ∈ F }, the number of distinct vectors of values (f (x1 ), . . . , f (xn )) the
functions f ∈ F may take. The shatter coefficient is the maximum of the empirical shatter coeffi-
cients over xn1 ∈ X n , that is, SF (n) := supxn1 SF (xn1 ). It is clear that SF (n) ≤ |F| always, but by
only counting distinct values, we have the following corollary.
Corollary 4.3.5 (A sharper variant of Massart’s finite class bound). Let F be any collection of
functions with f : X → R, and assume that σn2 := n−1 E[maxf ∈F ni=1 f (Xi )2 ] < ∞. Then
P
p
2σn2 log SF (n)
Rn (F) ≤ √ .
n
Typical classes with small shatter coefficients include Vapnik-Chervonenkis classes of functions; we
do not discuss these further here, instead referring to one of the many books in machine learning
and empirical process theory in statistics.
The most important of the calculus rules we use are the comparison inequalities for Rademacher
sums, which allow us to consider compositions of function classes and maintain small complexity
measurers. We state the rule here; the proof is complex, so we defer it to Section 4.5.3
Theorem 4.3.6 (Ledoux-Talagrand Contraction). Let T ⊂ Rn be an arbitrary set and let φi : R →
R be 1-Lipschitz and satisfy φi (0) = 0. Then for any nondecreasing convex function Φ : R → R+ ,
n
" !#
1 X
E Φ sup φi (ti )εi ≤ E Φ supht, εi .
2 t∈T t∈T
i=1
85
Lexture Notes on Statistics and Information Theory John Duchi
86
Lexture Notes on Statistics and Information Theory John Duchi
δ/2
Figures 4.1 and 4.2 give examples of (respectively) a covering and a packing of the same set.
An exercise in proof by contradiction shows that the packing and covering numbers of a set are
in fact closely related:
Lemma 4.3.8. The packing and covering numbers satisfy the following inequalities:
M (2δ, Θ, ρ) ≤ N (δ, Θ, ρ) ≤ M (δ, Θ, ρ).
We leave derivation of this lemma to Exercise 4.11, noting that it shows that (up to constant factors)
packing and covering numbers have the same scaling in the radius δ. As a simple example, we see
for any interval [a, b] on the real line that in the usual absolute distance metric, N (δ, [a, b], | · |)
(b − a)/δ.
As one example of the metric entropy, consider a set of functions F with reasonable covering
numbers (metric entropy) in k·k∞ -norm.
Example 4.3.9 (The “standard” covering number guarantee): Let F consist of functions
f : X → [−b, b] and let the metric ρ be kf − gk∞ = supx∈X |f (x) − g(x)|. Then
!
nt2
P sup |Pn f − P f | ≥ t ≤ exp − + log N (t/3, F, k·k∞ ) . (4.3.2)
f ∈F 18b2
87
Lexture Notes on Statistics and Information Theory John Duchi
So as long as the covering numbers N (t, F, k·k∞ ) grow sub-exponentially in t—so that log N (t)
nt2 —we have the (essentially) sub-Gaussian tail bound (4.3.2). Example 4.4.11 gives one typ-
ical case. Indeed, fix a minimal t/3-cover of F in k·k∞ of size N := N (t/3, F, k·k∞ ), call-
ing the covering functions f1 , . . . , fN . Then for any f ∈ F and the function fi satisfying
kf − fi k∞ ≤ t/2, we have
2t
|Pn f − P f | ≤ |Pn f − Pn fi | + |Pn fi − P fi | + |P fi − P f | ≤ |Pn fi − P fi | + .
3
The Azuma-Hoeffding inequality (Theorem 4.2.3) guarantees (by a union bound) that
nt2
P max |Pn fi − P fi | ≥ t ≤ exp − 2 + log N .
i≤N 2b
Combine this bound (replacing t with t/3) to obtain inequality (4.3.2). 3
Given the relationships between packing, covering, and size of sets Θ, we would expect there
to be relationships between volume, packing, and covering numbers. This is indeed the case, as we
now demonstrate for arbitrary norm balls in finite dimensions.
Lemma 4.3.10. Let B denote the unit k·k-ball in Rd . Then
d
2 d
1
≤ N (δ, B, k·k) ≤ 1 + .
δ δ
Proof We prove the lemma via a volumetric argument. For the lower bound, note that if the
points v1 , . . . , vN are a δ-cover of B, then
N
X
Vol(B) ≤ Vol(δB + vi ) = N Vol(δB) = N Vol(B)δ d .
i=1
In particular, N ≥ δ −d . For the upper bound on N (δ, B, k·k), let V be a δ-packing of B with
maximal cardinality, so that |V| = M (δ, B, k·k) ≥ N (δ, B, k·k) (recall Lemma 4.3.8). Notably, the
collection of δ-balls {δB + vi }M
i=1 cover the ball B (as otherwise, we could put an additional element
in the packing V), and moreover, the balls { 2δ B + vi } are all disjoint by definition of a packing.
Consequently, we find that
d
δ d
δ δ δ
M Vol(B) = M Vol B ≤ Vol B + B = 1 + Vol(B).
2 2 2 2
Rewriting, we obtain
d
δ d Vol(B) 2 d
2
M (δ, B, k·k) ≤ 1+ = 1+ ,
δ 2 Vol(B) δ
completing the proof.
Let us give one application of Lemma 4.3.10 to concentration of random matrices; we explore
more in the exercises as well. We can generalize the definition of sub-Gaussian random variables
to sub-Gaussian random vectors, where we say that X ∈ Rd is a σ 2 -sub-Gaussian vector if
2
σ 2
E[exp(hu, X − E[X]i)] ≤ exp kuk2 (4.3.3)
2
88
Lexture Notes on Statistics and Information Theory John Duchi
for all u ∈ Rd . For example, X ∼ N(0, Id ) is immediately 1-sub-Gaussian, and X ∈ [−b, b]d with
independent entries is b2 -sub-Gaussian. Now, suppose that Xi are independent isotropic random
vectors, meaning that E[Xi ] = 0, E[Xi Xi> ] = Id , and that they are also σ 2 -sub-Gaussian. Then by
an application of Lemma 4.3.10, we can give concentration guarantees for the sample covariance
Σn := n1 ni=1 Xi Xi> for the operator norm kAkop := sup{hu, Avi | kuk2 = kvk2 = 1}.
P
Proposition 4.3.11. Let Xi be independent isotropic and σ 2 -sub-Gaussian vectors. Then there is
a numerical constant C such that the sample covariance Σn := n1 ni=1 Xi Xi> satisfies
P
s
1 1
d + log d + log
kΣn − Id kop ≤ Cσ 2 δ
+ δ
n n
Proof The second inequality is trivial. Fix any u ∈ Bd2 . Then for the i such that ku − ui k2 ≤ ,
we have
hu, Aui = hu − ui , Aui + hui , Aui = 2hu − ui , Aui + hui , Aui i ≤ 2 kAkop + hui , Aui i
by definition of the operator norm. Taking a supremum over u gives the final result.
Let the matrix Ei = Xi Xi> − I, and define the average error E n = n1 Ei . Then with this lemma
in hand, we see that for any -cover N of the `2 -ball Bd2 ,
(1 − 2) E n op
≤ maxhu, E n ui.
u∈N
89
Lexture Notes on Statistics and Information Theory John Duchi
In general, however, we only have access to the risk via the empirical distribution of the Zi , and
we often choose f by minimizing the empirical risk
n
b n (f ) := 1
X
L `(f, Zi ). (4.4.2)
n
i=1
As written, this formulation is quite abstract, so we provide a few examples to make it somewhat
more concrete.
Example 4.4.1 (Binary classification problems): One standard problem—still abstract—
that motivates the formulation (4.4.1) is the binary classification problem. Here the data Zi
come in pairs (X, Y ), where X ∈ X is some set of covariates (independent variables) and
Y ∈ {−1, 1} is the label of example X. The function class F consists of functions f : X → R,
and the goal is to find a function f such that
P(sign(f (X)) 6= Y )
is small, that is, minimizing the risk E[`(f, Z)] where the loss is the 0-1 loss, `(f, (x, y)) =
1 {f (x)y ≤ 0}. 3
In this case, the loss function is the zero-one loss `(f, (x, y)) = 1 {maxl6=y fl (x) ≥ fy (x)}. 3
Example 4.4.3 (Binary classification with linear functions): In the standard statistical
learning setting, the data x belong to Rd , and we assume that our function class F is indexed
by a set Θ ⊂ Rd , so that F = {fθ : fθ (x) = θ> x, θ ∈ Θ}. In this case, we may use the zero-one
loss,
the convex hinge loss, or the (convex) logistic loss, which are variously `zo (fθ , (x, y)) :=
>
1 yθ x ≤ 0 , and the convex losses
h i
`hinge (fθ , (x, y)) = 1 − yx> θ and `logit (fθ , (x, y)) = log(1 + exp(−yx> θ)).
+
The hinge and logistic losses, as they are convex, are substantially computationally easier to
work with, and they are common choices in applications. 3
90
Lexture Notes on Statistics and Information Theory John Duchi
The main motivating question that we ask is the following: given a sample Z1 , . . . , Zn , if we
choose some fbn ∈ F based on this sample, can we guarantee that it generalizes to unseen data? In
particular, can we guarantee that (with high probability) we have the empirical risk bound
n
b n (fbn ) = 1
X
L `(fbn , Zi ) ≤ R(fbn ) + (4.4.3)
n
i=1
for some small ? If we allow fbn to be arbitrary, then this becomes clearly impossible: consider the
classification example 4.4.1, and set fbn to be the “hash” function that sets fbn (x) = y if the pair
(x, y) was in the sample, and otherwise fbn (x) = −1. Then clearly L b n (fbn ) = 0, while there is no
useful bound on R(fbn ).
for all f ∈ F. (Recall that the risk functional L(f ) = EP [`(f, Z)].) For example, if the loss is the
zero-one loss from classification problems, inequality (4.4.4) is satisfied with σ 2 = 14 by Hoeffding’s
lemma. In order to guarantee a bound of the form (4.4.4) for a function fb chosen dependent on
the data, in this section we give uniform bounds, that is, we would like to bound
!
P there exists f ∈ F s.t. L(f ) > Lb n (f ) + t or P sup L b n (f ) − R(f ) > t .
f ∈F
Such uniform bounds are certainly sufficient to guarantee that the empirical risk is a good proxy
for the true risk L, even when fbn is chosen based on the data.
Now, recalling that our set of functions or predictors F is finite or countable, let us suppose
that for each f ∈ F, we have a complexity measure c(f )—a penalty—such that
X
e−c(f ) ≤ 1. (4.4.5)
f ∈F
This inequality should look familiar to the Kraft inequality—which we will see in the coming
chapters—from coding theory. As soon as we have such a penalty function, however, we have the
following result.
Theorem 4.4.4. Let the loss `, distribution P on Z, and function class F be such that `(f, Z) is
σ 2 -sub-Gaussian for each f ∈ F, and assume that the complexity inequality (4.4.5) holds. Then
with probability at least 1 − δ over the sample Z1:n ,
s
1
b n (f ) + 2σ 2 log δ + c(f ) for all f ∈ F.
L(f ) ≤ L
n
91
Lexture Notes on Statistics and Information Theory John Duchi
Proof First, we note that by the usual sub-Gaussian concentration inequality (Corollary 4.1.10)
we have for any t ≥ 0 and any f ∈ F that
2
nt
P L(f ) ≥ L b n (f ) + t ≤ exp − .
2σ 2
p
Now, if we replace t by t2 + 2σ 2 c(f )/n, we obtain
nt2
p
2 2
P L(f ) ≥ Ln (f ) + t + 2σ c(f )/n ≤ exp − 2 − c(f ) .
b
2σ
Then using a union bound, we have
nt2
X
p
2 2
P ∃ f ∈ F s.t. L(f ) ≥ Ln (f ) + t + 2σ c(f )/n ≤
b exp − 2 − c(f )
2σ
f ∈F
nt2 X
= exp − 2 exp(−c(f )) .
2σ
f ∈F
| {z }
≤1
As one classical example of this setting, suppose that we have a finite class of functions F. Then
we can set c(f ) = log |F|, in which case we clearly have the summation guarantee (4.4.5), and we
obtain s
1
L(f ) ≤ Lb n (f ) + 2σ 2 log δ + log |F| uniformly for f ∈ F
n
with probability at least 1 − δ. To make this even more concrete, consider the following example.
Example 4.4.5 (Floating point classifiers): We implement a linear binary classifier using
double-precision floating point values, that is, we have fθ (x) = θ> x for all θ ∈ Rd that may
be represented using d double-precision floating point numbers. Then for each coordinate of
θ, there are at most 264 representable numbers;
> in total, we must thus have |F| ≤ 264d . Thus,
for the zero-one loss `zo (fθ , (x, y)) = 1 θ xy ≤ 0 , we have
s
1
b n (fθ ) + log δ + 45d
L(fθ ) ≤ L
2n
for all representable classifiers simultaneously, with probability at least 1 − δ, as the zero-one
loss is 1/4-sub-Gaussian. (Here we have used that 64 log 2 < 45.) 3
We also note in passing that by replacing δ with δ/2 in the bounds of Theorem 4.4.4, a union
bound yields the following two-sided corollary.
Corollary 4.4.6. Under the conditions of Theorem 4.4.4, we have
s
2
Lb n (f ) − L(f ) ≤ 2σ 2 log δ + c(f ) for all f ∈ F
n
with probability at least 1 − δ.
92
Lexture Notes on Statistics and Information Theory John Duchi
Example 4.4.7 (Rademacher complexity of the `2 -ball): Let Θ = {θ ∈ Rd | kθk2 ≤ r}, and
consider the class of linear functionals F := {fθ (x) = θT x, θ ∈ Θ}. Then
v
u n
ru X
n
Rn (F | x1 ) ≤ t kxi k22 ,
n
i=1
because we have
v " v
n n u n
" # #
u 2
r X ru X ru X
Rn (F | xn1 ) = E εi x i ≤ t E εi x i = t kxi k22 ,
n 2 n 2 n
i=1 i=1 i=1
as desired. 3
Example 4.4.8 (Rademacher complexity of the `1 -ball): In contrast to the previous example,
suppose that Θ = {θ ∈ Rd | kθk1 ≤ r}, and consider the linear class F := {fθ (x) = θT x, θ ∈ Θ}.
Then
" n #
r X
Rn (F | xn1 ) = E εi x i .
n ∞ i=1
Now, each coordinate j of ni=1 εi xi is ni=1 x2ij -sub-Gaussian, and thus using that E[maxj≤d Zj ] ≤
P P
p
2σ 2 log d for arbitrary σ 2 -sub-Gaussian Zj (see Exercise 4.7), we have
v
u n
n r u X
Rn (F | x1 ) ≤ t2 log(2d) max x2ij .
n j
i=1
These examples are sufficient to derive a few sophisticated risk bounds. We focus on the case
where we have a loss function applied to some class with reasonable Rademacher complexity, in
93
Lexture Notes on Statistics and Information Theory John Duchi
which case it is possible to recenter the loss class and achieve reasonable complexity bounds. The
coming proposition does precisely this in the case of margin-based binary classification. Consider
points (x, y) ∈ X × {±1}, and let F be an arbitrary class of functions f : X → R and L =
{(x, y) 7→ `(yf (x))}f ∈F be the induced collection of losses. As a typical example, we might have
`(t) = [1 − t]+ , `(t) = e−t , or `(t) = log(1 + e−t ). We have the following proposition.
Proposition 4.4.9. Let F and X be such that supx∈X |f (x)| ≤ M for f ∈ F and assume that
` is L-Lipschitz. Define the empirical and population risks L b n (f ) := Pn `(Y f (X)) and L(f ) :=
P `(Y f (X)). Then
!
2
P sup |L b n (f ) − L(f )| ≥ 4LRn (F) + t ≤ 2 exp − nt for t ≥ 0.
f ∈F 2L2 M 2
Proof We may recenter the class L, that is, replace `(·) with `(·) − `(0), without changing
b n (f ) − L(f ). Call this class L0 , so that kPn − P k = kPn − P k . This recentered class satisfies
L L L0
bounded differences with constant 2M L, as |`(yf (x)) − `(y 0 f (x0 ))| ≤ L|yf (x) − y 0 f (x0 )| ≤ 2LM ,
as in the proof of Proposition 4.3.1. Applying Proposition 4.3.1 and then Corollary 4.3.3 and gives
that P(supf ∈F |L b n (f ) − L(f )| ≥ 2Rn (L0 ) + t) ≤ exp(− nt22 2 ) for t ≥ 0. Then applying the con-
2M L
traction inequality (Theorem 4.3.6) yields Rn (L0 ) ≤ 2LRn (F), giving the result.
Example 4.4.10 (Support vector machines and hinge losses): In the support vector machine
problem, we receive data (Xi , Yi ) ∈ Rd × {±1}, and we seek to minimize average of the losses
`(θ; (x, y)) = 1 − yθT x + . We assume that the space X has kxk2 ≤ b for x ∈ X and that
nt2
P sup |Pn `(θ; (X, Y )) − P `(θ; (X, Y ))| ≥ 4Rn (FΘ ) + t ≤ exp − 2 2 ,
θ∈Θ 2r b
where FΘ = {fθ (x) = θT x}θ∈Θ . Now, we apply Example 4.4.7, which implies that
2rb
Rn (φ ◦ FΘ ) ≤ 2Rn (Fθ ) ≤ √ .
n
nt2
4rb
P sup |Pn `(θ; (X, Y )) − P `(θ; (X, Y ))| ≥ √ + t ≤ exp − ,
θ∈Θ n 2(rb)2
√
so that Pn and P become close at rate roughly rb/ n in this case. 3
94
Lexture Notes on Statistics and Information Theory John Duchi
When we do not have the simplifying structure of `(yf (x)) identified in the preceding examples,
we can still provide guarantees of generalization using the covering number guarantees introduced
in Section 4.3.2. The most common and important case is when we have a Lipschitzian loss function
in an underlying parameter θ.
Example 4.4.11 (Lipschitz functions over a norm-bounded parameter space): Consider the
parametric loss minimization problem
for a loss function ` that is M -Lipschitz (with respect to the norm k·k) in its argument, where
for normalization we assume inf θ∈Θ `(θ, z) = 0 for each z. Then the metric entropy of Θ
bounds the metric entropy of the loss class F := {z 7→ `(θ, z)}θ∈Θ for the supremum norm
k·k∞ . Indeed, for any pair θ, θ0 , we have
Assume that Θ ⊂ {θ | kθk ≤ b} for some finite b. Then Lemma 4.3.10 guarantees that
log N (, Θ, k·k) ≤ d log(1 + 2/) . d log 1 , and so the classical covering number argument in
Example 4.3.9 gives
nt2
M
P sup |Pn `(θ, Z) − P `(θ, Z)| ≥ t ≤ exp −c 2 2 + Cd log ,
θ∈Θ b M t
2 2d
where c, C are numerical constants. In particular, taking t2 M nb log nδ gives that
M b d log nδ
p
|Pn `(θ, Z) − P `(θ, Z)| . √
n
with probability at least 1 − δ. 3
L∗ = inf L(f ),
f
where the preceding infimum is taken across all (measurable) functions. Then we have
95
Lexture Notes on Statistics and Information Theory John Duchi
error, but may have substantial approximation error. With that in mind, we would like to develop
procedures that, rather than simply attaining good performance for the class F, are guaranteed
to trade-off in an appropriate way between the two types of error. This leads us to the idea of
structural risk minimization.
In this scenario, we assume we have a sequence of classes of functions, F1 , F2 , . . ., of increasing
complexity, meaning that F1 ⊂ F2 ⊂ . . .. For example, in a linear classification setting with
vectors x ∈ Rd , we might take a sequence of classes allowing increasing numbers of non-zeros in
the classification vector θ:
n o n o
F1 := fθ (x) = θ> x such that kθk0 ≤ 1 , F2 := fθ (x) = θ> x such that kθk0 ≤ 2 , . . . .
More broadly, let {Fk }k∈N be a (possibly infinite) increasing sequence of function classes. We
assume that for each Fk and each n ∈ N, there exists a constant Cn,k (δ) such that we have the
uniform generalization guarantee
!
P sup L b n (f ) − L(f ) ≥ Cn,k (δ) ≤ δ · 2−k .
f ∈Fk
(We will see in subsequent sections of the course how to obtain other more general guarantees.)
We consider the following structural risk minimization procedure. First, given the empirical
risk L
b n , we find the model collection b
k minimizing the penalized risk
k := argmin inf Ln (f ) + Cn,k (δ) .
b b (4.4.7a)
k∈N f ∈Fk
We then choose fb to minimize the risk over the estimated “best” class Fbk , that is, set
fb := argmin L
b n (f ). (4.4.7b)
f ∈Fkb
Theorem 4.4.12. Let fb be chosen according to the procedure (4.4.7a)–(4.4.7b). Then with proba-
bility at least 1 − δ, we have
96
Lexture Notes on Statistics and Information Theory John Duchi
On the event that supf ∈Fk |L b n (f ) − L(f )| < Cn,k (δ) for all k, which occurs with probability at least
1 − δ, we have
n o
L(fb) ≤ L
b n (f ) + C b (δ) = inf inf L
n,k
b n (f ) + Cn,k (δ) ≤ inf inf {L(f ) + 2Cn,k (δ)}
k∈N f ∈Fk k∈N f ∈Fk
We conclude with a final example, using our earlier floating point bound from Example 4.4.5,
coupled with Corollary 4.4.6 and Theorem 4.4.12.
Example 4.4.13 (Structural risk minimization with floating point classifiers): Consider
again our floating point example, and let the function class Fk consist of functions defined by
at most k double-precision floating point values, so that log |Fk | ≤ 45d. Then by taking
s
log 1δ + 65k log 2
Cn,k (δ) =
2n
we have that |Lb n (f )−L(f )| ≤ Cn,k (δ) simultaneously for all f ∈ Fk and all Fk , with probability
at least 1 − δ. Then the empirical risk minimization procedure (4.4.7) guarantees that
s
1
2 log δ + 91k
L(fb) ≤ inf inf L(f ) + .
k∈N f ∈Fk n
Roughly, we trade between small risk L(f )—as the qrisk inf f ∈Fk L(f ) must be decreasing in
k—and the estimation error penalty, which scales as (k + log 1δ )/n. 3
where for the last inequality we made the substitution u = t2 /σ 2 . Noting that this final integral is
Γ(k/2), we have E[|X|k ] ≤ kσ k Γ(k/2). Because Γ(s) ≤ ss for s ≥ 1, we obtain
p √
E[|X|k ]1/k ≤ k 1/k σ k/2 ≤ e1/e σ k.
Thus (2) holds with K2 = e1/e .
1 k
(2) implies (3) Let σ = kXkψ2 = supk≥1 k − 2 E[|X|k ]1/k , so that K2 = 1 and E[|X|k ] ≤ k 2 σ for
all k. For K3 ∈ R+ , we thus have
∞ ∞ ∞
E[X 2k ] σ 2k (2k)k (i) X 2e k
X X
E[exp(X 2 /(K3 σ 2 ))] = ≤ ≤
k!K32k σ 2k k!K32k σ 2k K32
k=0 k=0 k=0
where inequality (i) follows because k! ≥ (k/e)k , or 1/k! ≤ (e/k)k . Noting that ∞ k 1
P
p k=0 α = 1−α ,
we obtain (3) by taking K3 = e 2/(e − 1) ≈ 2.933.
97
Lexture Notes on Statistics and Information Theory John Duchi
(3) implies (4) Let us take K3 = 1. We claim that (4) holds with K4 = 34 . We prove this
result for both small and large λ. First, note the (highly non-standard, but true!) inequality that
9x2
ex ≤ x + e 16 for all x. Then we have
9λ2 X 2
E[exp(λX)] ≤ E[λX] +E exp
| {z } 16
=0
4
Now note that for |λ| ≤ 3σ ,
we have 9λ2 σ 2 /16
≤ 1, and so by Jensen’s inequality,
2 2
9λ X 2
2 2
2 9λ16σ 9λ2 σ 2
E exp = E exp(X /σ ) ≤ e 16 .
16
λ2 cx2
For large λ, we use the simpler Fenchel-Young inequality, that is, that λx ≤ 2c + 2 , valid for all
c ≥ 0. Then we have for any 0 ≤ c ≤ 2 that
2
λ2 σ 2 cX λ2 σ 2 c
E[exp(λX)] ≤ e 2c E exp 2
≤ e 2c e 2 ,
2σ
4 1 9 2 2
where the final inequality follows from Jensen’s inequality. If |λ| ≥ 3σ , then 2 ≤ 32 λ σ , and we
have 2 2
1
[ 2c 9c 2 2
+ 32 ]λ σ 3λ σ
E[exp(λX)] ≤ inf e = exp .
c∈[0,2] 4
1
(4) implies (1) This is the content of Proposition 4.1.8, with K4 = 2 and K1 = 2.
where we used the substitution u = t/σ. Thus we have E[|X|k ] ≤ 2Γ(k + 1)σ k , and using Γ(k + 1) ≤
k k yields E[|X|k ]1/k ≤ 21/k kσ, so that (2) holds with K2 ≤ 2.
where inequality (i) used that k! ≥ (k/e)k . Taking K3 = e2 /(e − 1) < 5 gives the result.
98
Lexture Notes on Statistics and Information Theory John Duchi
(2) if and only if (4) Thus, we see that up to constant numerical factors, the definition kXkψ1 =
supk≥1 k −1 E[|X|k ]1/k has the equivalent statements
Now, let us assume that (2) holds with K2 = 1, so that σ = kXkψ1 and that E[X] = 0. Then we
have E[X k ] ≤ k k kXkkψ1 , and
∞ ∞ ∞
X λk E[X k ] X kk X
E[exp(λX)] = 1 + ≤1+ λk
kXkkψ1 · ≤1+ λk kXkkψ1 ek ,
k! k!
k=2 k=2 k=2
1
the final inequality following because k! ≥ (k/e)k . Now, if |λ| ≤ 2ekXkψ , then we have
1
∞
X
E[exp(λX)] ≤ 1 + λ2 e2 kXkψ1 (λ kXkψ1 e)k ≤ 1 + 2e2 kXk2ψ1 λ2 ,
k=0
P∞ −k = 2. Using 1 + x ≤ ex gives that (2) implies (4). For
as the final sum is at most k=0 2
the opposite direction, we may simply use that if (4) holds with K4 = 1 and K40 = 1, then
E[exp(X/σ)] ≤ exp(1), so that (3) holds.
4.6 Bibliography
A few references on concentration, random matrices, and entropies include Vershynin’s extraordi-
narily readable lecture notes [170], upon which our proof of Theorem 4.1.11 is based, the compre-
hensive book of Boucheron, Lugosi, and Massart [34], and the more advanced material in Buldygin
and Kozachenko [41]. Many of our arguments are based off of those of Vershynin and Boucheron
et al. Kolmogorov and Tikhomirov [121] introduced metric entropy.
4.7 Exercises
Exercise 4.1 (Concentration of bounded random variables): Let X be a random variable taking
values in [a, b], where −∞ < a ≤ b < ∞. In this question, we show Hoeffding’s Lemma, that is,
that X is sub-Gaussian: for all λ ∈ R, we have
2
λ (b − a)2
E[exp(λ(X − E[X]))] ≤ exp .
8
(b−a)2
(a) Show that Var(X) ≤ ( b−a 2
2 ) = 4 for any random variable X taking values in [a, b].
(b) Let
ϕ(λ) = log E[exp(λ(X − E[X]))].
99
Lexture Notes on Statistics and Information Theory John Duchi
Assuming that E[X] = 0 (convince yourself that this is no loss of generality) show that
(You may assume that derivatives and expectations commute, which they do in this case.)
(c) Construct a random variable Yt , defined for t ∈ R, such that Yt ∈ [a, b] and
Exercise 4.2: In this question, we show how to use Bernstein-type (sub-exponential) inequal-
ities to give sharp convergence guarantees. Recall (Example 4.1.14, Corollary 4.1.18, and inequal-
ity (4.1.6)) that if Xi are independent bounded random variables with |Xi − E[X]| ≤ b for all i and
Var(Xi ) ≤ σ 2 , then
n n
( ! !)
5 nt2 nt
1X 1X 1
max P Xi ≥ E[X] + t , P Xi ≤ E[X] − t ≤ exp − min , .
n n 2 6 σ 2 2b
i=1 i=1
We consider minimization of loss functions ` over finite function classes F with ` ∈ [0, 1], so that if
L(f ) = E[`(f, Z)] then |`(f, Z) − L(f )| ≤ 1. Throughout this question, we let
We will show that, roughly, a procedure based on picking an empirical risk minimizer is unlikely to
choose a function f ∈ F with bad performance, so that we obtain faster concentration guarantees.
(b) Define the set of “bad” prediction functions F bad := {f ∈ F : L(f ) ≥ L? + }. Show that for
any fixed ≥ 0 and any f ∈ F2 bad , we have
n2
?
1 5 n
P L(f ) ≤ L + ≤ exp − min
b , .
2 6 L? (1 − L? ) + (1 − ) 2
n2
1 5 n
P L(fbn ) ≥ L(f ? ) + 2 ≤ card(F) · exp − min , .
2 6 L? (1 − L? ) + (1 − ) 2
100
Lexture Notes on Statistics and Information Theory John Duchi
(d) Using the result of part (c), argue that with probability at least 1 − δ,
q
|F | L? (1 − L? ) · log |Fδ |
r
? 4 log δ 12
L(fn ) ≤ L(f ) +
b + · √ .
n 5 n
Why is this better than an inequality based purely on the boundedness of the loss `, such as
Theorem 4.4.4 or Corollary 4.4.6? What happens when there is a perfect risk minimizer f ? ?
Exercise 4.3 (Likelihood ratio bounds and concentration): Consider a data release problem,
where given a sample x, we release a sequence of data Z1 , Z2 , . . . , Zn belonging to a discrete set Z,
where Zi may depend on Z1i−1 and x. We assume that the data has limited information about x
in the sense that for any two samples x, x0 , we have the likelihood ratio bound
p(zi | x, z1i−1 )
≤ eε .
p(zi | x0 , z1i−1 )
Let us control the amount of “information” (in the form of an updated log-likelihood ratio) released
by this sequential mechanism. Fix x, x0 , and define
p(z1 , . . . , zn | x)
L(z1 , . . . , zn ) := log .
p(z1 , . . . , zn | x0 )
t2
ε
P (L(Z1 , . . . , Zn ) ≥ nε(e − 1) + t) ≤ exp − .
2nε2
(b) Let γ ∈ (0, 1). Give the largest value of ε you can that is sufficient to guarantee that for any
test Ψ : Z n → {x, x0 }, we have
where Px and Px0 denote the sampling distribution of Z1n under x and x0 , respectively?
where Cp is a constant (that depends on p). As a corollary, derive that if E[|Xi |p ] ≤ σ p and p ≥ 2,
then
n
" #
p
1X σp
E Xi ≤ Cp p/2 .
n n
i=1
101
Lexture Notes on Statistics and Information Theory John Duchi
That is, sample means converge quickly to zero in higher moments. Hint: For any fixed x ∈ Rn , if
εi are i.i.d. uniform signs εi ∈ {±1}, then εT x is sub-Gaussian.
Exercise 4.5 (Small balls and anti-concentration): Let X be a nonnegative random variable
satisfying P(X ≤ ) ≤ c for some c < ∞ and all > 0. Argue that if Xi are i.i.d. copies of X, then
n
!
1X
P Xi ≥ t ≥ 1 − exp(−2n [1/2 − 2ct]2+ )
n
i=1
for all t.
Exercise 4.6 (Lipschitz functions remain sub-Gaussian): Let X be σ 2 -sub-Gaussian and f :
R → R be L-Lipschitz, meaning that |f (x) − f (y)| ≤ L|x − y| for all x, y. Prove that there exists a
numerical constant C < ∞ such that f (X) is CL2 σ 2 -sub-Gaussian.
Exercise 4.7 (Sub-gaussian maxima): Let X1 , . . . , Xn be σ 2 -sub-gaussian (not necessarily inde-
pendent) random variables. Show that
p
(a) E[maxi Xi ] ≤ 2σ 2 log n.
(b) There exists a numerical constant C < ∞ such that E[maxi |Xi |p ] ≤ (Cpσ 2 log k)p/2 .
Exercise 4.8: Consider a binary classification problem with logistic loss `(θ; (x, y)) = log(1 +
exp(−yθT x)), where θ ∈ Θ := {θ ∈ Rd | kθk1 ≤ r} and y ∈ {±1}. Assume additionally that the
space X ⊂ {x ∈ Rd | kxk∞ ≤ b}. Define the empirical and population risks L b n (θ) := Pn `(θ; (X, Y ))
and L(θ) := P `(θ; (X, Y )), and let θbn = argminθ∈Θ L(θ).
b Show that with probability at least 1 − δ
iid
over (Xi , Yi ) ∼ P , q
rb log dδ
L(θbn ) ≤ inf L(θ) + C √
θ∈Θ n
where C < ∞ is a numerical constant (you need not specify this).
Exercise 4.9 (Sub-Gaussian constants of Bernoulli random variables): In this exercise, we will
derive sharp sub-Gaussian constants for Bernoulli random variables (cf. [106, Thm. 1] or [118, 24]),
showing
1 − 2p 2
log E[et(X−p) ] ≤ t for all t ≥ 0. (4.7.1)
4 log 1−p
p
pet(1−p)
where Yt = (1 − p) with probability q(t) := pet(1−p) +(1−p)e−tp
and Yt = −p otherwise.
102
Lexture Notes on Statistics and Information Theory John Duchi
1 − 2p 2
f (s) = Cs + Cps − log(1 − p + peCs ),
2
so that inequality (4.7.1) holds if and only if f (s) ≥ 0 for all s ≥ 0. Give f 0 (s) and f 00 (s).
(e) Show that f (0) = f (1) = f 0 (0) = f 0 (1) = 0, and argue that f 00 (s) changes signs at most twice
and that f 00 (0) = f 00 (1) > 0. Use this to show that f (s) ≥ 0 for all s ≥ 0.
JCD Comment: Perhaps use transportation inequalities to prove this bound, and
also maybe give Ordentlich and Weinberger’s “A Distribution Dependent Refinement
of Pinsker’s Inequality” as an exercise.
1−2p
Exercise 4.10: Let s(p) = . Show that s is concave on [0, 1].
log 1−p
p
1. A hypothesis test likelihood ratio for them (see page 40 of handwritten notes)
2. A full learning guarantee with convergence of Hessian and everything, e.g., for logistic
regression?
3. In the Ledoux-Talagrand stuff, maybe worth going through example of logistic regres-
sion. Also, having working logistic example throughout? Helps clear up the structure
and connect with exponential families.
103
Chapter 5
Concentration inequalities provide powerful techniques for demonstrating when random objects
that are functions of collections of independent random variables—whether sample means, functions
with bounded variation, or collections of random vectors—behave similarly to their expectations.
This chapter continues exploration of these ideas by incorporating the central thesis of this book:
that information theory’s connections to statistics center around measuring when (and how) two
probability distributions get close to one another. On its face, we remain focused on the main
objects of the preceding chapter, where we have a population probability distribution P on a space
X and some collection of functions f : X → R. We then wish to understand when we expect the
empirical distribution
n
1X
Pn := 1Xi ,
n
i=1
iid
defined by teh sample Xi ∼ P , to be close to the population P as measured by f . Following the
notation we introduce in Section 4.3, for P f := EP [f (X)], we again ask to have
n
1X
Pn f − P f = f (Xi ) − EP [f (X)]
n
i=1
104
Lexture Notes on Statistics and Information Theory John Duchi
where the supremum is taken over measurable functions g : X → R with EQ [eg(X) ] < ∞.
We give one proof of this result and one sketch of a proof, which holds when the underlying space
is discrete, that may be more intuitive: the first constructs a particular “tilting” of Q via the
function eg , and verifies the equality. The second relies on the discretization of the KL-divergence
and may be more intuitive to readers familiar with convex optimization: essentially, we expect this
result because the function log( kj=1 exj ) is the convex conjugate of the negative entropy. (See also
P
Exercise 5.1.)
Proof We may assume that P is absolutely continuous with respect to Q, meaning that Q(A) = 0
implies that P (A) = 0, as otherwise both sides are infinite by inspection. Thus, it is no loss of
generality to let P and Q have densities p and q.
Attainment in the equality is easy: we simply take g(x) = log p(x) q(x) , so that EQ [e
g(X) ] = 1. To
show that the right hand side is never larger than Dkl (P ||Q) requires a bit more work. To that
end, let g be any function such that EQ [eg(X) ] < ∞, and define the random variable Zg (x) =
eg(x) /EQ [eg(X) ], so that EQ [Z] = 1. Then using the absolute continuity of P w.r.t. Q, we have
p(X) q(X) dQ
EP [log Zg ] = EP log + log Zg (X) = Dkl (P ||Q) + EP log Zg
q(X) p(X) dP
dQ
≤ Dkl (P ||Q) + log EP Zg
dP
= Dkl (P ||Q) + log EQ [Zg ].
As EQ [Zg ] = 1, using that EP [log Zg ] = EP [g(X)] − log EQ [eg(X) ] gives the result.
Here is the second proof of Theorem 5.1.1, which applies when X is discrete and finite. That we
can approximate KL-divergence by suprema over finite partitions (as in definition (2.2.1)) suggests
that this approach works in general—which it can—but this requires some not completely trivial
approximations of EP [g] and EQ [eg ] by discretized versions of their expectations, which makes
things rather tedious.
Proof of Theorem 5.1.1, the finite case As we have assumed that P and Q have finite
supports, which we identify with {1, . . . , k} and p.m.f.s p, q ∈ ∆k = {p ∈ Rk+ | h1, pi = 1}. Define
fq (v) = log( kj=1 qj evj ), which is convex in v (recall Proposition 3.2.1). Then the supremum in
P
the variational representation takes the form
h(p) := sup {hp, vi − fq (v)} .
v∈Rk
105
Lexture Notes on Statistics and Information Theory John Duchi
If we can take derivatives and solve for zero, we are guaranteed to achieve the supremum. To that
end, note that
" #k
qi evi
∇v {hp, vi − fq (v)} = p − Pk ,
q evj
j=1 j i=1
p
so that setting vj = log qjj achieves p − ∇v fq (v) = p − p = 0 and hence the supremum. Noting that
p
log( kj=1 qj exp(log qjj )) = log( kj=1 pj ) = 0 gives h(p) = Dkl (p||q).
P P
The Donsker-Varadhan variational representation already gives a hint that we can use some
information-theoretic techniques to control the difference between an empirical sample and its
expectation, at least in an average sense. In particular, we see that for any function g, we have
for any random variable X. Now, changing this on its head a bit, suppose that we consider a
collection of functions F and put two probability measures π and π0 on F, and consider Pn f − P f ,
where we consider f a random variable f ∼ π or f ∼ π0 . Then a consequence of the Donsker-
Varadhan theorem is that
Z Z
(Pn f − P f )dπ(f ) ≤ Dkl (π||π0 ) + log exp(Pn f − P f )dπ0 (f )
for any π, π0 . While this inequality is a bit naive—bounding a difference by an exponent seems
wasteful—as we shall see, it has substantial applications when we can upper bound the KL-
divergence Dkl (π||π0 ).
106
Lexture Notes on Statistics and Information Theory John Duchi
Proof The key is to combine Example 4.1.12 with the variational representation that Theo-
rem 5.1.1 provides for KL-divergences. We state Example 4.1.12 as a lemma here.
PWithout loss of generality, we assume that P f = 0 for all f ∈ F, and recall that Pn f =
1 n 2
n i=1 f (Xi ) is the empirical mean of f . Then we know that Pn f is σ /n-sub-Gaussian, and
−1/2
Lemma 5.2.2 implies that E[exp(λ(Pn f )2 )] ≤ 1 − 2λσ 2 /n +
for any f , and thus for any prior
π0 on f we have Z
−1/2
exp(λ(Pn f ) )dπ0 (f ) ≤ 1 − 2λσ 2 /n + .
2
E
3n
Consequently, taking λ = λn := 8σ 2 , we obtain
Z Z
2 3n 2
E exp(λn (Pn f ) )dπ0 (f ) = E exp (Pn f ) dπ0 (f ) ≤ 2.
8σ 2
By Jensen’s inequality (or Cauchy-Schwarz), it is immediate from Theorem 5.2.1 that we also
have
s
8σ 2 Dkl (π||π0 ) + log 2δ
Z
|Pn f − P f |dπ(f ) ≤ simultaneously for all π ∈ Π (5.2.2)
3 n
√
with probability at least 1 − δ, so that Eπ [|Pn f − P f |] is with high probability of order 1/ n. The
inequality (5.2.2) is the original form of the PAC-Bayes bound due to McAllester, with slightly
107
Lexture Notes on Statistics and Information Theory John Duchi
sharper constants and improved logarithmic dependence. The key is that stability, in the form of a
prior π0 and posterior π closeness, allow us to achieve reasonably tight control over the deviations
of random variables and functions with high probability.
Let us give an example, which is similar to many of our approaches in Section 4.4, to illustrate
some of the approaches this allows. The basic idea is that by appropriate choice of prior π0
and “posterior” π, whenever we have appropriately smooth classes of functions we achieve certain
generalization guarantees.
Example 5.2.3 (A uniform law for Lipschitz functions): Consider a case as in Section 4.4,
where we let L(θ) = P `(θ, Z) for some function ` : Θ × Z → R. Let Bd2 = {v ∈ Rd | kvk2 ≤ 1}
be the `2 -ball in Rd , and let us assume that Θ ⊂ rBd2 and additionally that θ 7→ `(θ, z) is
M -Lipschitz for all z ∈ Z. For simplicity, we assume that `(θ, z) ∈ [0, 2M r] for all θ ∈ Θ (we
may simply relativize our bounds by replacing ` by `(·, z) − inf θ∈Θ `(θ, z) ∈ [0, 2M r]).
If L
b n (θ) = Pn `(θ, Z), then Theorem 5.2.1 implies that
s
2 r2
Z
8M 2
|L
b n (θ) − L(θ)|dπ(θ) ≤ Dkl (π||π0 ) + log
3n δ
for all π with probability at least 1 − δ. Now, let θ0 ∈ Θ be arbitrary, and for > 0 (to be
chosen later) take π0 to be uniform on (r + )Bd2 and π to be uniform on θ0 + Bd2 . Then we
immediately see that Dkl (π||π0 ) = d log(1+ r ). Moreover, we have L
R
b n (θ)dπ(θ) ∈ L
b n (θ0 )±M
and similarly for L(θ), by the M -Lipschitz continuity of `. For any fixed > 0, we thus have
s
2M 2 r2
r 2
|Ln (θ0 ) − L(θ0 )| ≤ 2M +
b d log 1 + + log
3n δ
rd
simultaneously for all θ0 ∈ Θ, with probability at least 1 − δ. By choosing = n we obtain
that with probability at least 1 − δ,
s
8M 2 r2
2M rd n 2
sup |Ln (θ) − L(θ)| ≤
b + d log 1 + + log .
θ∈Θ n 3n d δ
q
Thus, roughly, with high probability we have |L
b n (θ) − L(θ)| ≤ O(1)M r d
n log nd for all θ. 3
On the one hand, the result in Example 5.2.3 is satisfying: it applies to any Lipschitz function
and provides a uniform bound. On the other hand, when we compare to the results achievable for
specially structured linear function classes, then applying Rademacher complexity bounds—such
as Proposition 4.4.9 and Example 4.4.10—we have somewhat weaker results, in that they depend
on the dimension explicitly, while the Rademacher bounds do not exhibit this explicit dependence.
This means they can potentially apply in infinite dimensional spaces that Example 5.2.3 cannot.
We will give an example presently showing how to address some of these issues.
108
Lexture Notes on Statistics and Information Theory John Duchi
variable X satisfies |X| ≤ b but Var(X) ≤ σ 2 b2 , then X concentrates more quickly about
its mean than the convergence provided by naive application of sub-Gaussian concentration with
sub-Gaussian parameter b2 /8. To that end, we investigate an alternative to Theorem 5.2.1 that
allows somewhat sharper control.
The approach is similar to our derivation in Theorem 5.2.1, where we show that the moment
generating function of a quantity like Pn f − P f is small (Eq. (5.2.1)) and then relate this—via the
Donsker-Varadhan change of measure in Theorem 5.1.1—to the quantities we wish to control. In
the next proposition, we provide relative bounds on the deviations of functions from their means.
To make this precise, let F be a collection of functions f : X → R, and let σ 2 (f ) := Var(f (X)) be
the variance of functions in F. We assume the class satisfies the Bernstein condition (4.1.7) with
parameter b, that is,
h i k!
E (f (X) − P f )k ≤ σ 2 (f )bk−2 for k = 3, 4, . . . . (5.2.3)
2
This says that the second moment of functions f ∈ F bounds—with the additional boundedness-
type constant b—the higher moments of functions in f . We then have the following result.
Proof We begin with an inequality on the moment generating function of random variables
satisfying the Bernstein condition (4.1.7), that is, that |E[(X − µ)k ]| ≤ k! 2 k−2 for k ≥ 2. In this
2σ b
case, Lemma 4.1.19 implies that
E[eλ(X−µ) ] ≤ exp(λ2 σ 2 )
for |λ| ≤ 1/(2b). As a consequence, for any f in our collection F, we see that if we define
∆n (f, λ) := λ Pn f − P f − λσ 2 (f ) ,
we have that
E[exp(n∆n (f, λ))] = E[exp(λ(f (X) − P f ) − λ2 σ 2 (f ))]n ≤ 1
1
for all n, f ∈ F, and |λ| ≤ 2b .Then, for any fixed measure π0 on F, Markov’s inequality implies
that Z
1
P exp(n∆n (f, λ))dπ0 (f ) ≥ ≤ δ. (5.2.4)
δ
Now, as in the proof of Theorem 5.2.1, we use the Donsker-Varadhan Theorem 5.1.1 (change of
measure), which implies that
Z Z
n ∆n (f, λ)dπ(f ) ≤ Dkl (π||π0 ) + log exp(n∆n (f, λ))dπ0 (f )
for all distributions π. Using inequality (5.2.4), we obtain that with probability at least 1 − δ,
Z
1 1
∆n (f, λ)dπ(f ) ≤ Dkl (π||π0 ) + log
n δ
109
Lexture Notes on Statistics and Information Theory John Duchi
for all π. As this holds for any fixed |λ| ≤ 1/(2b), this gives the desired result by rearranging.
We would like to optimize over the bound in Proposition 5.2.4 by choosing the “best” λ. If we
could choose the optimal λ, by rearranging Proposition 5.2.4 we would obtain the bound
2 1 h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + inf λEπ [σ (f )] + Dkl (π||π0 ) + log
λ>0 nλ δ
r
Eπ [σ 2 (f )] h 1i
= Eπ [Pn f ] + 2 Dkl (π||π0 ) + log
n δ
simultaneously for all π, with probability at least 1−δ. The problem with this approach is two-fold:
first, we cannot arbitrarily choose λ in Proposition 5.2.4, and second, the bound above depends on
the unknown population variance σ 2 (f ). It is thus of interest to understand situations in which
we can obtain similar guarantees, but where we can replace unknown population quantities on the
right side of the bound with known quantities.
To that end, let us consider the following condition, a type of relative error condition related
to the Bernstein condition (4.1.7): for each f ∈ F,
σ 2 (f ) ≤ bP f. (5.2.5)
This condition is most natural when each of the functions f take nonnegative values—for example,
when f (X) = `(θ, X) for some loss function ` and parameter θ of a model. If the functions f are
nonnegative and upper bounded by b, then we certainly have σ 2 (f ) ≤ E[f (X)2 ] ≤ bE[f (X)] = bP f ,
so that Condition (5.2.5) holds. Revisiting Proposition 5.2.4, we rearrange to obtain the following
theorem.
Theorem 5.2.5. Let F be a collection of functions satisfying the Bernstein condition (5.2.3) as in
Proposition 5.2.4, and in addition, assume the variance-bounding condition (5.2.5). Then for any
1
0 ≤ λ ≤ 2b , with probability at least 1 − δ,
λb 1 1h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + Eπ [Pn f ] + Dkl (π||π0 ) + log
1 − λb λ(1 − λb) n δ
for all π.
apply Proposition 5.2.4, and divide both sides of the resulting inequality by λ(1 − λb).
To make this uniform in λ, thus achieving a tighter bound (so that we need not pre-select λ),
1 λb
we choose multiple values of λ and apply a union bound. To that end, let 1 + η = 1−λb , or η = 1−λb
1 (1+η)2
and λb(1−λb) = η , so that the inequality in Theorem 5.2.1 is equivalent to
(1 + η)2 b h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + ηEπ [Pn f ] + Dkl (π||π0 ) + log .
η n δ
110
Lexture Notes on Statistics and Information Theory John Duchi
Corollary 5.2.6. Let the conditions of Theorem 5.2.5 hold. Then with probability at least 1 − δ,
r
bEπ [Pn f ] h ni 1 h n i
Eπ [P f ] ≤ Eπ [Pn f ] + 2 Dkl (π||π0 ) + log + Eπ [Pn f ] + 5b Dkl (π||π0 ) + log ,
n δ n δ
simultaneously for all π on F.
for each η ∈ {1/n, . . . , 1}. We consider two cases. In the first, assume that Eπ [Pn f ] ≤ nb (Dkl (π||π0 )+
log nδ . Then taking η = 1 above evidently gives the result. In the second, we have Eπ [Pn f ] >
b n
n (Dkl (π||π0 ) + log δ ), and we can set
s
b
(Dkl (π||π0 ) + log nδ )
η? = n ∈ (0, 1).
Eπ [Pn f ]
1
Choosing η to be the smallest value ηk in {η1 , . . . , ηn } with ηk ≥ η? , so that η? ≤ η ≤ η? + n then
implies the claim in the corollary.
We call the quantity hθ, xiy the margin of θ on the pair (x, y), noting that when the margin is
large, hθ, xi has the same sign as y and is “confident” (i.e. far from zero). For shorthand, let us
define the expected and empirical losses at margin γ by
Consider the following scenario: the data x lie in a ball of radius b, so that kxk2 ≤ b; note that
the losses `γ and `0 satisfy the Bernstein (5.2.3) and self-bounding (5.2.5) conditions with constant
1 as they take values in {0, 1}. We then have the following proposition.
111
Lexture Notes on Statistics and Information Theory John Duchi
Proposition 5.2.7. Let the above conditions on the data (x, y) hold and let the margin γ > 0 and
radius r < ∞. Then with probability at least 1 − δ,
√ rb log n p r2 b2 log nδ
1
P (hθ, XiY ≤ 0) ≤ 1 + Pn (hθ, XiY ≤ γ) + 8 √ δ Pn (hθ, XiY ≤ γ) + C
n γ n γ2n
simultaneously for all kθk2 ≤ r, where C is a numerical constant independent of the problem
parameters.
Proposition 5.2.7 provides a “dimension-free” guarantee—it depends only on the `2 -norms kθk2
and kxk2 —so that it can apply equally in infinite dimensional spaces. The key to the inequality
is that if we can find a large margin predictor—for example, one achieved by a support vector
machine or, more broadly, by minimizing a convex loss of the form
n
1X
minimize φ(hXi , θiYi )
kθk2 ≤r n
i=1
for some decreasing convex φ : R → R+ , e.g. φ(t) = [1 − t]+ or φ(t) = log(1 + e−t )—then we get
strong generalization performance guarantees relative to the empirical margin γ. As one particular
instantiation of this approach, suppose we can obtain a perfect classifier with positive margin: a
vector θ with kθk2 ≤ r such that hθ, Xi iYi ≥ γ for each i = 1, . . . , n. Then Proposition 5.2.7
guarantees that
r2 b2 log nδ
P (hθ, XiY ≤ 0) ≤ C
γ2n
with probability at least 1 − δ.
Proof Let π0 be N(0, τ 2 I) for some τ > 0 to be chosen, and let π be N(θ,
b τ 2 I) for some θb ∈ Rd
satisfying kθk
b 2 ≤ r. Then Corollary 5.2.6 implies that
Eπ [Lγ (θ)]
s
Eπ [L
b γ (θ)] h ni 1 b γ (θ)] + C Dkl (π||π0 ) + log n
h i
≤ Eπ [L
b γ (θ)] + 2 Dkl (π||π0 ) + log + Eπ [L
n δ n δ
s
h 2 h 2 i
b γ (θ)] + 2 Eπ [Lγ (θ)] r + log n + 1 Eπ [L b γ (θ)] + C r + log n
b i
≤ Eπ [L
n 2τ 2 δ n 2τ 2 δ
Let us use the margin assumption. Note that if Z ∼ N(0, τ 2 I), then for any fixed θ0 , x, y we
have
`0 (θ0 ; (x, y)) − P(Z > x ≥ γ) ≤ E[`γ (θ0 + Z; (x, y))] ≤ `2γ (θ0 ; (x, y)) + P(Z > x ≥ γ)
where the middle expectation is over Z ∼ N(0, τ 2 I). Using the τ 2 kxk22 -sub-Gaussianity of Z > x, we
can obtain immediately that if kxk2 ≤ b, we have
γ2 γ2
`0 (θ0 ; (x, y)) − exp − 2 2 ≤ E[`γ (θ0 + Z; (x, y))] ≤ `2γ (θ0 ; (x, y)) + exp − 2 2 .
2τ b 2τ b
112
Lexture Notes on Statistics and Information Theory John Duchi
Returning to our earlier bound, we evidently have that if kxk2 ≤ b for all x ∈ X , then with
probability at least 1 − δ, simultaneously for all θ ∈ Rd with kθk2 ≤ r,
s
γ 2
b 2γ (θ) + exp(− γ22 2 ) h r2
L ni
2τ b
L0 (θ) ≤ L2γ (θ) + 2 exp − 2 2 + 2
b + log
2τ b n 2τ 2 δ
2 2
1 b γ h r n i
+ L2γ (θ) + exp − 2 2 + C 2
+ log .
n 2τ b 2τ δ
2
Setting τ 2 = 2b2γlog n , we immediately see that for any choice of margin γ > 0, we have with
probability at least 1 − δ that
s
2b 1 hb b ih r2 b2 log n ni
L0 (θ) ≤ Lb 2γ (θ) + +2 L2γ (θ) + + log
n n n 2γ 2 δ
r2 b2 log n
1 b 1 h n i
+ L2γ (θ) + + C + log
n n 2γ 2 δ
8σ 2
2 n n 2
E[(Pn F − P F ) | X1 ] ≤ Dkl (π(· | X1 )||π0 ) + log ,
3n δ
where the expectation is taken over F ∼ π(· | X1n ), leaving the sample fixed. Now, consider choosing
π0 to be the average over all samples X1n of π, that is, π0 (·) = EP [π(· | X1n )], the expectation taken
iid
over X1n ∼ P . Then by definition of mutual information,
113
Lexture Notes on Statistics and Information Theory John Duchi
Corollary 5.2.8. Let F be chosen according to any distribution π(· | X1n ) conditional on the sample
iid
X1n . Then with probability at least 1 − δ0 − δ1 over the sample X1n ∼ P ,
8σ 2 I(F ; X1n )
2 n 2
E[(Pn F − P F ) | X1 ] ≤ + log .
3n δ0 δ1
This corollary shows that if we have any procedure—say, a learning procedure or otherwise—
that limits the information between a sample X1n and an output F , then we are guaranteed that
F generalizes. Tighter analyses of this are possible, though not our focus here, just that already
there should be an inkling that limiting information between input samples and outputs may be
fruitful.
114
Lexture Notes on Statistics and Information Theory John Duchi
information about the sample. The starting point of our approach is similar to our analysis of
PAC-Bayesian learning and generalization: we observe that if the function we decide to compute
on the data X1n is chosen without much information about the data at hand, then its value on the
sample should be similar to its values on the full population. This insight dovetails with what we
have seen thus far, that appropriate “stability” in information can be useful and guarantee good
future performance.
by Corollary 4.1.10 (sub-Gaussian concentration) and a union bound. Thus, so long as |Φ| is not
exponential in the sample size n, we expect uniformly high accuracy.
Example 5.3.1 (Risk minimization via statistical queries): Suppose that we are in the loss-
minimization setting (4.4.2), where the losses `(θ, Xi ) are convex and differentiable in θ. Then
gradient descent applied to Lb n (θ) = Pn `(θ, X) will converge to a minimizing value of Lb n . We
can evidently implement gradient descent by a sequence of statistical queries φ(x) = ∇θ `(θ, x),
iterating
θ(k+1 ) = θ(k) − αk Pn φ(k) , (5.3.2)
where φ(k) = ∇θ `(θ(k) , x) and αk is a stepsize. 3
One issue with the example (5.3.1) is that we are interacting with the dataset, because each
sequential query φ(k) depends on the previous k − 1 queries. (Our results on uniform convergence
of empirical functionals and related ideas address many of these challenges, so that the result of
the process (5.3.2) will be well-behaved regardless of the interactivity.)
We consider an interactive version of the statistical query estimation problem. In this version,
there are two parties: an analyst (or statistician or learner), who issues queries φ : X → R, and
a mechanism that answers the queries to the analyst. We index our functionals φ by t ∈ T for a
(possibly infinite) set T , so we have a collection {φt }t∈T . In this context, we thus have the following
scheme:
115
Lexture Notes on Statistics and Information Theory John Duchi
Input: Sample X1n drawn i.i.d. P , collection {φt }t∈T of possible queries
Repeat: for k = 1, 2, . . .
Of interest in the iteration 5.1 is that we interactively choose T1 , T2 , . . . , Tk , where the choice Ti
may depend on our approximations of EP [φTj (X)] for j < i, that is, on the results of our previous
queries. Even more broadly, the analyst may be able to choose the index Tk in alternative ways
depending on the sample X1n , and our goal is to still be able to accurately compute expectations
P φT = EP [φT (X)] when the index T may depend on X1n . The setting in Figure 5.1 clearly breaks
with the classical statistical setting in which an analysis is pre-specified before collecting data, but
more closely captures modern data exploration practices.
Theorem 5.3.2. Let {φt }t∈T be a collection of σ 2 -sub-Gaussian functions φt : X → R. Then for
any random variable T and any λ > 0,
2 1 n 1 2
E[(Pn φT − P φT ) ] ≤ I(X1 ; T ) − log 1 − 2λσ /n +
λ 2
and r
2σ 2
|E[Pn φT ] − E[P φT ]| ≤I(X1n ; T )
n
where the expectations are taken over T and the sample X1n .
Proof The proof is similar to that of our first basic PAC-Bayes result in Theorem 5.2.1. Let
us assume w.l.o.g. that P φt = 0 for all t ∈ T , noting that then Pn φt is σ 2 /n-sub-Gaussian. We
−1/2
prove the first result first. Lemma 5.2.2 implies that E[exp(λ(Pn φt )2 )] ≤ 1 − 2λσ 2 /n + for each
116
Lexture Notes on Statistics and Information Theory John Duchi
inequality (iii) is Lemma 5.2.2.) Now, let π0 be the marginal distribution on T (marginally over
all observations X1n ), and let π denote the posterior of T conditional on the sample X1n . Then
E[Dkl (π||π0 )] = I(X1n ; T ) by definition of the mutual information, giving the bound on the squared
error.
For the second result, note that the Donsker-Varadhan equality implies
λ2 σ 2
Z Z
λE Pn φt dπ(t) ≤ E[Dkl (π||π0 )] + log E[exp(λPn φt )]dπ0 (t) ≤ I(X1n ; T ) + .
2n
p
Dividing both sides by λ gives E[Pn φT ] ≤ 2σ 2 I(X1n ; T )/n, and performing the same analysis with
−φT gives the second result of the theorem.
The key in the theorem is that if the mutual information—the Shannon information—I(X; T )
between the sample X and T is small, then the expected squared error can be small. To make this
n
a bit clearer, let us choose values for λ in the theorem; taking λ = 2eσ 2 gives the following corollary.
for a numerical constant C. That is, powers of sub-Gaussian maxima grow at most logarith-
mically. Indeed, by Theorem 4.1.11, we have for any q ≥ 1 by Hölder’s inequality that
X 1/q
p pq
E[max |Zj | ] ≤ E |Zj | ≤ k 1/q (Cpqσ 2 )p/2 ,
j
j
117
Lexture Notes on Statistics and Information Theory John Duchi
and setting q = log k gives the inequality. Thus, we see that for any a priori fixed v1 , . . . , vk , vk+1 ,
we have
log k
E[max(vjT (Pn Y X))2 ] ≤ O(1) .
j n
If instead we allow a single interaction, the problem is different. We issue queries associated
with v = e1 , . . . , ek , the k standard basis vectors; then we simply set Vk+1 = Pn Y X/ kPn Y Xk2 .
Then evidently
k
T
E[(Vk+1 (Pn Y X))2 ] = E[kPn Y Xk22 ] = ,
n
which is exponentially larger than in the non-interactive case. That is, if an analyst is allowed
to interact with the dataset, he or she may be able to discover very large correlations that are
certainly false in the population, which in this case has P XY = 0. 3
Example 5.3.4 shows that, without being a little careful, substantial issues may arise in interac-
tive data analysis scenarios. When we consider our goal more broadly, which is to be able to provide
accurate approximations to P φ for queries φ chosen adaptively for any population distribution P
and φ : X → [−1, 1], it is possible to construct quite perverse situations, where if we compute
sample expectations Pn φ exactly, one round of interaction is sufficient to find a query φ for which
Pn φ − P φ ≥ 1.
Example 5.3.5 (Exact query answering allows arbitrary corruption): Suppose we draw a
iid
sample X1n of size n on a sample space X = [m] with Xi ∼ Uniform([m]), where m ≥ 2n. Let
Φ be the collection of all functions φ : [m] → [−1, 1], so that P(|Pn φ − P φ| ≥ t) ≤ exp(−nt2 /2)
for any fixed φ. Suppose that in the interactive scheme in Fig. 5.1, we simply release answers
A = Pn φ. Consider the following query:
More generally, when one performs an interactive data analysis (e.g. as in Fig. 5.1), adapting
hypotheses while interacting with a dataset, it is not a question of statistical significance or mul-
tiplicity control for the analysis one does, but for all the possible analyses one might have done
otherwise. Given the branching paths one might take in an analysis, it is clear that we require
some care.
118
Lexture Notes on Statistics and Information Theory John Duchi
With that in mind, we consider the desiderata for techniques we might use to control information
in the indices we select. We seek some type of stability in the information algorithms provide
to a data analyst—intuitively, if small changes to a sample do not change the behavior of an
analyst substantially, then we expect to obtain reasonable generalization bounds. If outputs of a
particular analysis procedure carry little information about a particular sample (but instead provide
information about a population), then Corollary 5.3.3 suggests that any estimates we obtain should
be accurate.
To develop this stability theory, we require two conditions: first, that whatever quantity we
develop for stability should compose adaptively, meaning that if we apply two (randomized) algo-
rithms to a sample, then if both are appropriately stable, even if we choose the second algorithm
because of the output of the first in arbitrary ways, they should remain jointly stable. Second, our
notion should bound the mutual information I(X1n ; T ) between the sample X1n and T . Lastly, we
remark that this control on the mutual information has an additional benefit: by the data process-
ing inequality, any downstream analysis we perform that depends only on T necessarily satisfies the
same stability and information guarantees as T , because if we have the Markov chain X1n → T → V
then I(X1n ; V ) ≤ I(X1n ; T ).
We consider randomized algorithms A : X n → A, taking values in our index set A, where
A(X1n ) ∈ A is a random variable that depends on the sample X1n . For simplicity in derivation,
we abuse notation in this section, and for random variables X and Y with distributions P and Q
respectively, we denote
Dkl (X||Y ) := Dkl (P ||Q) .
We then ask for a type of leave-one-out stability for the algorithms A, where A is insensitive to the
changes of a single example (on average).
n n
1X n
1 X 1 2 1
Dkl A(x1 )||A(x\i ) = x ≤ 2 2,
n 2nσ 2 n2 i 2σ n
i=1 i=1
so that a the sample mean of a bounded random variable perturbed with Guassian noise is
ε = 2σ12 n2 -KL-stable. 3
Example 5.3.7 (KL-stability in mean estimation: Laplace noise addition): Let the conditions
of Example 2.1.7 hold, but suppose instead of Gaussian noise we add scaled Laplace noise,
119
Lexture Notes on Statistics and Information Theory John Duchi
We require a bit of notational trickery now. Fixing i, let PA,A0 be the joint distribution of
A0 (A(xn1 ), xn1 ) and A(xn1 ) and QA,A0 the joint distribution of A0i (Ai (x\i ), x\i ) and Ai (x\i ), so that
they are both distributions over A1 × A0 . Let PA0 |a be the distribution of A0 (t, xn1 ) and similarly
QA0 |a is the distribution of A0i (t, x\i ). Note that A0 , A0i both “observe” x, so that using the chain
rule (2.1.6) for KL-divergences, we have
Dkl A0 ◦ A, A||A0i ◦ Ai , Ai = Dkl PA,A0 ||QA,A0
Z
= Dkl (PA ||QA ) + Dkl PA0 |t ||QA0 |t dPA (t)
as desired.
The second key result is that KL-stable algorithms also bound the mutual information of a
random function.
Lemma 5.3.9. Let Xi be independent. Then for any random variable A,
Xn n Z
X
n
Dkl A(xn1 )||Ai (x\i ) dP (xn1 ),
I(A; X1 ) ≤ I(A; Xi | X\i ) =
i=1 i=1
120
Lexture Notes on Statistics and Information Theory John Duchi
Proof Without loss of generality, we assume A and X are both discrete. In this case, we have
Xn Xn
i−1
n
I(A; X1 ) = I(A; Xi | X1 ) = H(Xi | X1i−1 ) − H(Xi | A, X1i−1 ).
i=1 i=1
Now, because the Xi follow a product distribution, H(Xi | X1i−1 ) = H(Xi ), while H(Xi |
A, X1i−1 ) ≥ H(Xi | A, X\i ) because conditioning reduces entropy. Consequently, we have
n
X n
X
I(A; X1n ) ≤ H(Xi ) − H(Xi | A, X\i ) = I(A; Xi | X\i ).
i=1 i=1
To see the final equality, note that
Z
I(A; Xi | X\i ) = I(A; Xi | X\i = x\i )dP (x\i )
X n−1
Z Z
= Dkl (A(xn1 )||A(x1:i−1 , Xi , xi+1:n )) dP (xi )dP (x\i )
X n−1 X
by definition of mutual information as I(X; Y ) = EX [Dkl PY |X ||PY ].
Combining Lemmas 5.3.8 and 5.3.9, we see (nearly) immediately that KL stability implies
a mutual information bound, and consequently even interactive KL-stable algorithms maintain
bounds on mutual information.
Proposition 5.3.10. Let A1 , . . . , Ak be εi -KL-stable procedures, respectively, composed in any
arbitrary sequence. Let Xi be independent. Then
k
1 X
I(A1 , . . . , Ak ; X1n ) ≤ εi .
n
i=1
Proof Applying Lemma 5.3.9,
n k X
n
I(Aj ; Xi | X\i , Aj−1
X X
I(Ak1 ; X1n ) ≤ I(Ak1 ; Xi | X\i ) = 1 ).
i=1 j=1 i=1
where A(a0 , xn1 ) is the (random) procedure A on inputs xn1 and a0 , while A(a0 , x\i ) denotes the
(random) procedure A on input a0 , x\i , Xi , and where the ith example Xi follows its disdtribution
conditional on A0 = a0 and X\i = x\i , as in Lemma 5.3.9. We then recognize that for each i, we
have
Z Z
Dkl A(a , x1 )||A(a , x\i ) dP (xi | a , x\i ) ≤ Dkl A(a0 , xn1 )||A(a
0 n 0 0 e 0 , x\i ) dP (xi | a0 , x\i )
for any randomized function A, e as the marginal A in the lemma minimizes the average KL-
divergence (recall Exercise 2.15). Now, sum over i and apply the definition of KL-stability as
in Lemma 5.3.8.
121
Lexture Notes on Statistics and Information Theory John Duchi
Input: Sample X1n ∈ X n drawn i.i.d. P , collection {φt }t∈T of possible queries φt : X →
[−1, 1]
Repeat: for k = 1, 2, . . .
This procedure is evidently KL-stable, and based on Example 5.3.6 and Proposition 5.3.10, we
have that
1 k
I(X1n ; T1 , . . . , Tk , Tk+1 ) ≤ 2 2
n 2σ n
so long as the indices Ti ∈ T are chosen only as functions of Pn φ + Zj for j < i, as the classical
information processing inequality implies that
1 1
I(X1n ; T1 , . . . , Tk , Tk+1 ) ≤ I(X1n ; A1 , . . . , Ak )
n n
because we have X1n → A1 → T2 and so on for the remaining indices. With this, we obtain the
following theorem.
Theorem 5.3.11. Let the indices Ti , i = 1, . . . , k + 1 be chosen in an arbitrary way using the
procedure 5.2, and let σ 2 > 0. Then
2 2ek 10
E max(Aj − P φTj ) ≤ 2 2 + + 4σ 2 (log k + 1).
j≤k σ n 4n
p
By inspection, we can optimize over σ 2 by setting σ 2 = k/(log k + 1)/n, which yields the
upper bound p
2 10 k(1 + log k)
E max(Aj − P φTj ) ≤ + 10 .
j≤k 4n n
Comparing to Example 5.3.4, we see a substantial improvement. While we do not achieve accuracy
scaling with log k, as we would if the queried functionals φt were completely independent of the
sample, we see that we achieve mean-squared error of order
√
k log k
n
122
Lexture Notes on Statistics and Information Theory John Duchi
123
Lexture Notes on Statistics and Information Theory John Duchi
with probability at least 1 − δ suggest that we can optimize them by choosing π carefully. For
example, in the context of learning a statistical model parameterized by θ ∈ Θ with losses `(θ; x, y),
it is natural to attempt to find π minimizing
r
1
Eπ [Pn `(θ; X, Y ) | Pn ] + C Dkl (π||π0 )
n
in π, where the expectation is taken over θ ∼ π. If this quantity has optimal value ?n ,qthen one is
√
immediately guaranteed that for the population P , we have Eπ [P `(θ; X, y)] ≤ n + C log 1δ / n.
?
Langford and Caruana [126] take this approach, and Dziugaite and Roy [79] use it to give (the
first) non-trivial bounds for deep learning models.
The questions of interactive data analysis begin at least several decades ago, perhaps most pro-
foundly highlighted positively by Tukey’s Exploratory Data Analysis [168]. Problems of scientific
replicability have, conversely, highlighted many of the challenges of reusing data or peeking, even
innocently, at samples before performing statistical analyses [113, 86, 91]. Our approach to for-
malizing these ideas, and making rigorous limiting information leakage, draws from a more recent
strain of work in the theoretical computer science literature, with major contributions from Dwork,
Feldman, Hardt, Pitassi, Reingold, and Roth and Bassily, Nissim, Smith, Steinke, Stemmer, and
Ullman [78, 76, 77, 20, 21]. Our particular treatment most closely follows Feldman and Steinke [82].
The problems these techniques target also arise frequently in high-dimensional statistics, where one
often wishes to estimate uncertainty and perform inference after selecting a model. While we do
not touch on these problems, a few references in this direction include [25, 166, 109].
5.5 Exercises
Exercise 5.1 (Duality in Donsker-Varadhan): Here, we give a converse result to Theorem 5.1.1,
showing that for any function h : X → R,
where the supremum is taken over probability measures. If Q has a density, the supremum may be
taken over probability measures having a density.
(a) Show the equality (5.5.1) in the case that X is discrete by directly computing the supremum.
(That is, let |X | = k, and identify probability measures P and Q with vectors p, q ∈ Rk+ .)
(b) Let Q have density q. Assume that EQ [eh(X) ] < ∞ and let
(c) If EQ [eh(X) ] = +∞, then monotone convergence implies that limB↑∞ EQ [emin{B,h(X)} ] = +∞.
Conclude (5.5.1).
124
Lexture Notes on Statistics and Information Theory John Duchi
Exercise 5.2 (An alternative PAC-Bayes bound): Let f : Θ × X → R, and let π0 be a density
on θ ∈ Θ. Use the dual form (5.5.1) of the variational representation of the KL-divergence show
iid
that with probability at least 1 − δ over the draw of X1n ∼ P ,
The function ψ(t) = min{1, max{−1, t}} (the truncation of t to the range [−1, 1]) is such a function.
Let πθ be the normal distribution N(θ, σ 2 I) and π0 be N(0, σ 2 I).
(a) Let λ > 0. Use Exercise 5.2 to show that with probability at least 1 − δ, for all θ ∈ Rd
1
Z kθk2 /2σ 2 + log 1
Pn ψ(λhθ0 , Xi)πθ (θ0 )dθ0 ≤ hθ, E[X]i + λ θ> Σθ + σ 2 tr(Σ) + 2 δ
.
λ nλ
125
Lexture Notes on Statistics and Information Theory John Duchi
Exercise 5.4 (Large-margin PAC-Bayes bounds for multiclass problems): Consider the following
multiclass prediction scenario. Data comes in pairs (x, y) ∈ bBd2 × [k] where Bd2 = {v ∈ Rd | kvk2 ≤
1} denotes the `2 -ball and [k] = {1, . . . , k}. We make predictions using predictors θ1 , . . . , θk ∈ Rd ,
where the prediction of y on an example x is
We suffer an error whenever yb(x) 6= y, and the margin of our classifier on pair (x, y) is
If hθy , xi > hθi , xi for all i 6= y, the margin is then positive (and the prediction is correct).
(a) Develop an analogue of the bounds in Section 5.2.2 in this k-class multiclass setting. To do
so, you should (i) define the analogue of the margin-based loss `γ , (ii) show how Gaussian
perturbations leave it similar, and (iii) prove an analogue of the bound in Section 5.2.2. You
should assume one of the two conditions
k
X
(C1) kθi k2 ≤ r for all i (C2) kθi k22 ≤ kr2
i=1
(b) Describe a minimization procedure—just a few lines suffice—that uses convex optimization to
find a (reasonably) large-margin multiclass classifier.
Exercise 5.5 (A variance-based information bound): Let Φ = {φt }t∈T be a collection of functions
φt : X → R, where each φt satisfies the Bernstein condition (4.1.7) with parameters σ 2 (φt ) and b,
that is, |E[(φt (X) − P φt (X))k ]| ≤ k! 2
2 σ (φt )b
k−2 for all k ≥ 3 and Var(φ (X)) = σ 2 (φ ). Let T ∈ T
t t
be any random variable, which may depend on an observed sample X1n . Show that for all C > 0
C
and |λ| ≤ 2b , then
Pn φT − P φT 1
E ≤ I(T ; X1n ) + |λ|.
max{C, σ(φT )} n|λ|
Exercise 5.6 (An information bound on variance): Let Φ = {φt }t∈T be a collection of functions
φt : X → R, where each φt : X → [−1, 1]. Let σ 2 (φt ) = Var(φt (X)). Let s2n (φ) = Pn φ2 − (Pn φ)2 be
the sample variance of φ. Show that for all C > 0 and 0 ≤ λ ≤ C/4, then
s2n (φT )
1
E ≤ I(T ; X1n ) + 2.
max{C, σ 2 (φT )} nλ
The max{C, σ 2 (φT )} term is there to help avoid division by 0. Hint: If 0 ≤ x ≤ 1, then
ex ≤ 1 + 2x, and if X ∈ [0, 1], then E[eX ] ≤ 1 + 2E[X] ≤ e2E[X] . Use this to argue that
2 2
E[eλnPn (φ−P φ) / max{C,σ } ] ≤ e2λn for any φ : X → [−1, 1] with Var(φ) ≤ σ 2 , then apply the
Donsker-Varadhan theorem.
Exercise 5.7: Consider the following scenario: let φ : X → [−1, 1] and let α > 0, τ > 0. Let
µ = Pn φ and s2 = Pn φ2 − µ2 . Define σ 2 = max{αs2 , τ 2 }, and assume that τ 2 ≥ 5α
n .
126
Lexture Notes on Statistics and Information Theory John Duchi
(b) Show that if α2 ≤ C 0 τ 2 for a numerical constant C 0 < ∞, then we can take ε ≤ O(1) n21α .
Hint: Use exercise 2.14, and consider the “alternative” mechanisms of sampling from
2 2
N(µ−i , σ−i ) where σ−i = max{αs2−i , τ 2 }
for
1 X 1 X
µ−i = φ(Xj ) and s2−i = φ(Xj )2 − µ2−i .
n−1 n−1
j6=i j6=i
Input: Sample X1n ∈ X n drawn i.i.d. P , collection {φt }t∈T of possible queries φt : X →
[−1, 1], parameters α > 0 and τ > 0
Repeat: for k = 1, 2, . . .
iii. Mechanism draws independent Zk ∼ N(0, σk2 ) and responds with answer
n
1X
Ak := Pn φ + Zk = φ(Xi ) + Zk .
n
i=1
Exercise 5.8 (A general variance-dependent bound on interactive queries): Consider the algo-
rithm in Fig. 5.3. Let σ 2 (φt ) = Var(φt (X)) be the variance of φt .
(a) Show that for b > 0 and for all 0 ≤ λ ≤ 2b ,
r
|Aj − P φTj | τ2
1 n k
p 4α
E max ≤ I(X1 ; T1 ) + λ + 2 log(ke) 2
I(X1n ; T1k ) + 2α + 2 .
j≤k max{b, σ(φTj )} nλ nb b
(If you do not have quite the right constants, that’s fine.)
(b) Using the result of Question 5.7, show that with appropriate choices for the parameters
α, b, τ 2 , λ that for a numerical constant C < ∞
" #
|Aj − P φTj | (k log k)1/4
E max √ ≤ C √ .
j≤k max{(k log k)1/4 / n, σ(φTj )} n
You may assume that k, n are large if necessary.
(c) Interpret the result from part (b). How does this improve over Theorem 5.3.11?
127
Chapter 6
128
Lexture Notes on Statistics and Information Theory John Duchi
defined whenever Z ≥ 0 with probability 1. As our particular focus throughout this chapter, we
consider the moment generating function and associated transformation X 7→ eλX . If we know the
moment generating function ϕX (λ) := E[eλX ], then ϕ0X (λ) = E[XeλX ], and so
This suggests—in a somewhat roundabout way we make precise—that control of the entropy H(eλX )
should be sufficient for controlling the moment generating function of X.
The Herbst argument makes this rigorous.
Proposition 6.1.2. Let X be a random variable and assume that there exists a constant σ 2 < ∞
such that
λ2 σ 2
H(eλX ) ≤ ϕX (λ). (6.1.3)
2
for all λ ∈ R (respectively, λ ∈ R+ ) where ϕX (λ) = E[eλX ] denotes the moment generating function
of X. Then 2 2
λ σ
E[exp(λ(X − E[X]))] ≤ exp
2
for all λ ∈ R (respectively, λ ∈ R+ ).
Proof Let ϕ = ϕX for shorthand. The proof procedes by an integration argument, where we
2 2
show that log ϕ(λ) ≤ λ 2σ . First, note that
ϕ0 (λ) = E[XeλX ],
λ2 σ 2
λϕ0 (λ) − ϕ(λ) log ϕ(λ) = H(eλX ) ≤ ϕ(λ),
2
and dividing both sides by λ2 ϕ(λ) yields the equivalent statement
ϕ0 (λ) 1 σ2
− 2 log ϕ(λ) ≤ .
λϕ(λ) λ 2
∂ 1 ϕ0 (λ) 1
log ϕ(λ) = − 2 log ϕ(λ).
∂λ λ λϕ(λ) λ
129
Lexture Notes on Statistics and Information Theory John Duchi
It is possible to give a similar argument for sub-exponential random variables, which allows us
to derive Bernstein-type bounds, of the form of Corollary 4.1.18, but using the entropy method. In
particular, in the exercises, we show the following result.
Proposition 6.1.3. Assume that there exist positive constants b and σ such that
be the collection of all variables except Xi . Our first result is a consequence of the chain rule for
entropy and is known as Han’s inequality.
130
Lexture Notes on Statistics and Information Theory John Duchi
Proof The proof is a consequence of the chain rule for entropy and that conditioning reduces
entropy. We have
We also require a divergence version of Han’s inequality, which will allow us to relate the entropy
H of a random variable to divergences and other information-theoretic quantities. Let X be an
arbitrary space, and let Q be a distribution over X n and P = P1 ×· · ·×Pn be a product distribution
on the same space. For A ⊂ X n−1 , define the marginal densities
Proof We have seen earlier in the notes (recall the definition (2.2.1) of the KL divergence as
a supremum over all quantizers and the surrounding discussion) that it is no loss of generality to
assume that X is discrete. Thus, noting that the probability mass functions
X Y
q (i) (x\i ) = q(xi−1 n (i)
1 , x, xi+1 ) and p (x\i ) = pj (xj ),
x j6=i
Now, by subtracting q(xn1 ) log p(xn1 ) from both sides of the preceding display, we obtain
X X
(n − 1)Dkl (Q||P ) = (n − 1) q(xn1 ) log q(xn1 ) − (n − 1) q(xn1 ) log p(xn1 )
xn
1 xn
1
n X
X X
≥ q (i) (x\i ) log q (i) (x\i ) − (n − 1) q(xn1 ) log p(xn1 ).
i=1 x\i xn
1
131
Lexture Notes on Statistics and Information Theory John Duchi
We expand the final term. Indeed, by the product nature of the distributions p, we have
X X n
X
(n − 1) q(xn1 ) log p(xn1 ) = (n − 1) q(xn1 ) log pi (xi )
xn
1 xn
1 i=1
n X
X X n X
X
= q(xn1 ) log pi (xi ) = q (i) (x\i ) log p(i) (x\i ).
i=1 xn
1 j6=i i=1 x\i
| {z }
=log p(i) (x\i )
Noting that
X X
q (i) (x\i ) log q (i) (x\i ) − q (i) (x\i ) log p(i) (x\i ) = Dkl Q(i) ||P (i)
x\i x\i
Finally, we will prove the main result of this subsection: a tensorization identity for the entropy
H(Y ) for an arbitrary random variable Y that is a function of n independent random variables.
For this result, we use a technique known as tilting, in combination with the two variants of Han’s
inequality we have shown, to obtain the result. The tilting technique is one used to transform
problems of random variables into one of distributions, allowing us to bring the tools of information
and entropy to bear more directly. This technique is a common one, and used frequently in
large deviation theory, statistics, for heavy-tailed data, amont other areas. More concretely, let
Y = f (X1 , . . . , Xn ) for some non-negative function f . Then we may always define a tilted density
f (x1 , . . . , xn )p(x1 , . . . , xn )
q(x1 , . . . , xn ) := (6.1.5)
EP [f (X1 , . . . , Xn )]
which, by inspection, satisfies q(xn1 ) = 1 and q ≥ 0. In our context, if f ≈ constant under the
R
distribution P , then we should have f (xn1 )p(xn1 ) ≈ cp(xn1 ) and so Dkl (Q||P ) should be small; we
can make this rigorous via the following tensorization theorem.
Proof Inequality (6.1.6) holds for Y if and only if holds identically for cY for any c > 0, so
we assume without loss of generality that EP [Y ] = 1. We thus obtain that H(Y ) = E[Y log Y ] =
E[φ(Y )], where assign φ(t) = t log t. Let P have density p with respect to a base measure µ. Then
by defining the tilted distribution (density) q(xn1 ) = f (xn1 )p(xn1 ), we have Q(X n ) = 1, and moreover,
we have
q(xn1 )
Z Z
n
Dkl (Q||P ) = q(x1 ) log dµ(x1 ) = f (xn1 )p(xn1 ) log f (xn1 )dµ(xn1 ) = EP [Y log Y ] = H(Y ).
n
p(xn1 )
132
Lexture Notes on Statistics and Information Theory John Duchi
E[φ(Y )] − E[φ(E[Y | X\i ])] = E[E[φ(Y ) | X\i ] − φ(E[Y | X\i ])] = E[H(Y | X\i )].
Using Han’s inequality for relative entropies (Proposition 6.1.4) then immediately gives
n h
X n
i X
H(Y ) = Dkl (Q||P ) ≤ Dkl (Q||P ) − Dkl Q(i) ||P (i) = E[H(Y | X\i )],
i=1 i=1
Theorem 6.1.6 shows that if we can show that individually the conditional entropies H(Y | X\i )
are not too large, then the Herbst argument (Proposition 6.1.2 or its variant Proposition 6.1.3)
allows us to provide strong concentration inequalities for general random variables Y .
sup f (x1 , . . . , xi−1 , x, xi+1 , . . . , xn ) − f (x1 , . . . , xi−1 , x0 , xi+1 , . . . , xn ) ≤ ci for all x\i .
x∈X ,x0 ∈X
(6.1.7)
Then we have the following result.
Proposition 6.1.7 (Bounded differences). Assume that f satisfies the bounded differences condi-
1 Pn
tion (6.1.7), where 4 i=1 ci ≤ σ 2 . Let Xi be independent. Then Y = f (X1 , . . . , Xn ) is σ 2 -sub-
2
Gaussian.
Proof We use a similar integration argument to the Herbst argument of Proposition 6.1.2, and
we apply the tensorization inequality (6.1.6). First, let U be an arbitrary random variable taking
values in [a, b]. We claim that if ϕU (λ) = E[eλU ] and ψ(λ) = log ϕU (λ) is its cumulant generating
function, then
H(eλU ) λ2 (b − a)2
≤ . (6.1.8)
E[eλU ] 8
133
Lexture Notes on Statistics and Information Theory John Duchi
Indeed, we have that Y = k ni=1 Xi k2 satisfies the bounded differences inequality with param-
P
eters ci , and so
X n X n Xn X n
P Xi ≥ t = P Xi − E Xi ≥ t − E Xi
i=1 2 i=1 2 i=1 2 i=1 2
Ek i=1 Xi k2 ]2+
Pn !
[t −
≤ exp −2 Pn 2 .
i=1 ci
Pn q P qP
n n 2
i=1 E[kXi k2 ] gives the result. 3
2
Noting that E[k i=1 Xi k2 ] ≤ E[k i=1 Xi k2 ] =
134
Lexture Notes on Statistics and Information Theory John Duchi
Theorem 6.1.9. Let X1 , . . . , Xn be independent random variables with Xi ∈ [a, b] for all i. Assume
that f : Rn → R is separately convex and L-Lipschitz with respect to the k·k2 norm. Then
We defer the proof of the theorem temporarily, giving two example applications. The first is to
the matrix concentration problem that motivates the beginning of this section.
Example 6.1.10: Let X ∈ Rm×n be a matrix with independent entries, where Xij ∈ [−1, 1]
for all i, j, and let |||·||| denote the operator norm on matrices, that is, |||A||| = supu,v {u> Av :
kuk2 ≤ 1, kvk2 ≤ 1}. Then Theorem 6.1.9 implies
2
t
P(|||X||| ≥ E[|||X|||] + t) ≤ exp −
16
where k·kFr denotes the Frobenius norm of a matrix. Thus the matrix operator norm is 1-
Lipschitz. Therefore, we have by Theorem 6.1.9 and the Chernoff bound technique that
As a second example, we consider Rademacher complexity. These types of results are important
for giving generalization bounds in a variety of statistical algorithms, and form the basis of a variety
of concentration and convergence results. We defer further motivation of these ideas to subsequent
chapters, just mentioning here that we can provide strong concentration guarantees for Rademacher
complexity or Rademacher chaos.
Example 6.1.11: Let A ⊂ Rn be any collection of vectors. The the Rademacher complexity
of the class A is
n
" #
X
Rn (A) := E sup a i εi , (6.1.9)
a∈A i=1
t2
P(Rn (A) ≥ Rn (A) + t) ≤ exp −
b ,
16 diam(A)2
where diam(A) := supa∈A kak2 . Indeed, we have that ε 7→ supa∈A a> ε is a convex function,
as it is the maximum of a family of linear functions. Moreover, it is Lipschitz, with Lipschitz
constant bounded by supa∈A kak2 . Applying Theorem 6.1.9 as in Example 6.1.10 gives the
result. 3
Proof of Theorem 6.1.9 The proof relies on our earlier tensorization identity and a sym-
metrization lemma.
135
Lexture Notes on Statistics and Information Theory John Duchi
iid
Lemma 6.1.12. Let X, Y ∼ P be independent. Then for any function g : R → R, we have
Proof For the first result, we use the convexity of the exponential in an essential way. In
particular, we have
because log is concave and ex ≥ 0. Using symmetry, that is, that g(X) − g(Y ) has the same
distribution as g(Y ) − g(X), we then find
1
H(eλg(X) ) ≤ E[λ(g(X)−g(Y ))(eλg(X) −eλg(Y ) )] = E[λ(g(X)−g(Y ))(eλg(X) −eλg(Y ) )1 {g(X) ≥ g(Y )}].
2
Now we use the classical first order convexity inequality—that a convex function f satisfies f (t) ≥
f (s)+f 0 (s)(t−s) for all t and s, Theorem B.3.3 in the appendices—which gives that et ≥ es +es (t−s)
for all s and t. Rewriting, we have es − et ≤ es (s − t), and whenever s ≥ t, we have (s − t)(es − et ) ≤
es (s − t)2 . Replacing s and t with λg(X) and λg(Y ), respectively, we obtain
λ(g(X) − g(Y ))(eλg(X) − eλg(Y ) )1 {g(X) ≥ g(Y )} ≤ λ2 (g(X) − g(Y ))2 eλg(X) 1 {g(X) ≥ g(Y )} .
Returning to the main thread of the proof, we note that the separate convexity of f and the
tensorization identity of Theorem 6.1.6 imply
n n
X X " 2 #
λf (X1:n ) λf (X1:n ) 2 2 ∂ λf (X1:n )
H(e )≤E H(e | X\i ) ≤ E λ E (Xi − Yi ) f (X1:n ) e | X\i ,
∂xi
i=1 i=1
where Yi are independent copies of the Xi . Now, we use that (Xi −Yi )2 ≤ (b−a)2 and the definition
of the partial derivative to obtain
Noting that k∇f (X)k22 ≤ L2 , and applying the Herbst argument, gives the result.
136
Lexture Notes on Statistics and Information Theory John Duchi
so that we “project out” the jth coordinate, and define the projected sets.
137
Chapter 7
In this chapter, we continue to build on our ideas on stability in different scenarios, ranging from
model fitting and concentration to interactive data analyses. Here, we show how stability ideas
allow us to provide a new type of protection: the privacy of participants in studies. Until the mid-
2000s, the major challenge in this direction had been a satisfactory definition of privacy, because
collection of side information often results in unforeseen compromises of private information. The
introduction of differential privacy—a type of stability in likelihood ratios for data releases from
differing samples—alleviated these challenges, providing a firm foundation on which to build private
estimators and other methodology. (Though it is possible to trace some of the definitions and major
insights in privacy back at least to survey sampling literature in the 1960s.) Consequently, in this
chapter we focus on privacy notions based on differential privacy and its cousins, developing the
information-theoretic stability ideas helpful to understand the protections it is possible to provide.
138
Lexture Notes on Statistics and Information Theory John Duchi
perform a study on smoking, and discover that smoking causes cancer. We publish the result, but
now we have “compromised” the privacy of everyone who smokes who did not participate in the
study: we know they are more likely to get cancer.
In each of these cases, the biggest challenge is one of side information: how can we be sure
that, when releasing a particular statistic, dataset, or other quantity that no adversary will be able
to infer sensitive data about participants in our study? We articulate three desiderata that—we
believe—suffice for satisfactory definitions of privacy. In discussion of private releases of data, we
require a bit of vocabulary. We term a (randomized) algorithm releasing data either a privacy
mechanism, consistent with much of the literature in privacy, or a channel, mapping from the input
sample to some output space, in keeping with our statistical and information-theoretic focus. In
no particular order, we wish our privacy mechanism, which takes as input a sample X1n ∈ X n and
releases some Z to satisfy the following.
i. Given the output Z, even an adversary knowing everyone in the study (excepting one person)
should not be able to test whether you belong to the study.
ii. If you participate in multiple “private” studies, there should be some graceful degradation
in the privacy protections, rather than a catastrophic failure. As part of this, any definition
should guarantee that further processing of the output Z of a private mechanism X1n → Z, in
the form of the Markov chain X1n → Z → Y , should not allow further compromise of privacy
(that is, a data-processing inequality). Additional participation in “private” studies should
continue to provide little additional information.
iii. The mechanism X1n → Z should be resilient to side information: even if someone knows
something about you, he should learn little about you if you belong to X1n , and this should
remain true even if the adversary later gleans more information about you.
The third desideratum is perhaps most elegantly phrased via a Bayesian perspective, where an
adversary has some prior beliefs π on the membership of a dataset (these prior beliefs can then
capture any side information the adversary has). The strongest adversary has a prior supported on
two samples {x1 , . . . , xn } and {x01 , . . . , x0n } differing in only a single element; a private mechanism
would then guarantee the adversary’s posterior beliefs (after the release X1n → Z) should not change
significantly.
Before continuing addressing these challenges, we take a brief detour to establish notation for the
remainder of the chapter. It will be convenient to consider randomized procedures acting on samples
n 1 Pn
themselves; a sample x1 is cleary isomorphic to the empirical distribution Pn = n i=1 1xi , and
for two empirical distributions Pn and Pn0 supported on {x1 , . . . , xn } and {x01 , . . . , x0n }, we evidently
have
n Pn − Pn0 TV = dham ({x1 , . . . , xn }, {x01 , . . . , x0n }),
and so we will identify samples with their empirical distributions. With this notational convenience
in place, we then identify
n
( )
1X
Pn = Pn = 1xi | xi ∈ X
n
i=1
as the set of all empirical distributions on n points in X and we also abuse notation in an obvious
way to define dham (Pn , Pn0 ) := n kPn − Pn0 kTV as the number of differing observations in the samples
Pn and Pn0 represent. A mechanism M is then a (typically) randomized mapping M : Pn → Z,
139
Lexture Notes on Statistics and Information Theory John Duchi
which we can identify with its induced Markov channel Q from X n → Z; we use the equivalent
views as is convenient.
The challenges of side information motivate Dwork et al.’s definition of differential privacy [74].
The key in differential privacy is that the noisy channel releasing statistics provides guarantees of
bounded likelihood ratios between neighboring samples, that is, samples differing in only a single
entry.
Definition 7.1 (Differential privacy). Let M : Pn → Z be a randomized mapping. Then M is
ε-differentially private if for all (measurable) sets S ⊂ Z and all Pn , Pn0 ∈ Pn with dham (Pn , Pn0 ) ≤ 1,
P(M (Pn ) ∈ S)
≤ eε . (7.1.1)
P(M (Pn0 ) ∈ S)
The intuition and original motivation for this definition are that an individual has little incentive
to participate (or not participate) in a study, as the individual’s data has limited effect on the
outcome.
The model (7.1.1) of differential privacy presumes that there is a trusted curator, such as a
hospital, researcher, or corporation, who can collect all the data into one centralized location, and
it is consequently known as the centralized model. A stronger model of privacy is the local model,
in which data providers trust no one, not even the data collector, and privatize their individual
data before the collector even sees it.
Definition 7.2 (Local differential privacy). A channel Q from X to Z is ε-locally differentially
private if for all measurable S ⊂ Z and all x, x0 ∈ X ,
Q(Z ∈ S | x)
≤ eε . (7.1.2)
Q(Z ∈ S | x0 )
It is clear that Definition 7.2 and the condition (7.1.2) are stronger than Definition 7.1: when
samples {x1 , . . . , xn } and {x01 , . . . , x0n } differ in at most one observation, then the local model (7.1.2)
guarantees that the densities
n
dQ(Z1n | {xi }) Y dQ(Zi | xi )
= ≤ eε ,
dQ(Z1n | {x0i }) dQ(Zi | x0i )
i=1
where the inequality follows because only a single ratio may contain xi 6= x0i .
In the remainder of this introductory section, we provide a few of the basic mechanisms in use
in differential privacy, then discuss its “semantics,” that is, its connections to the three desiderata
we outline above. In the coming sections, we revisit a few more advanced topics, in particular, the
composition of multiple private mechanisms and a few weakenings of differential privacy, as well as
more sophisticated examples.
140
Lexture Notes on Statistics and Information Theory John Duchi
estimate the proportion of the population with a characteristic (versus those without); call
these groups 0 and 1. Rather than ask the participant to answer the question specifically,
however, we give them a spinner with a face painted in two known areas, where the first
corresponds to group 0 and has area eε /(1 + eε ) and the second to group 1 and has area
1/(1 + eε ). Thus, when the participant spins the spinner, it lands in group 0 with probability
eε /(1 + eε ). Then we simply ask the participant, upon spinning the spinner, to answer “Yes”
if he or she belongs to the indicated group, “No” otherwise.
Let us demonstrate that this randomized response mechanism provides ε-local differential
privacy. Indeed, we have
Q(Yes | x = 0) Q(No | x = 0)
= e−ε and = eε ,
Q(Yes | x = 1) Q(No | x = 1)
so that Q(Z = z | x)/Q(Z = z | x0 ) ∈ [e−ε , eε ] for all x, z. That is, the randomized response
channel provides ε-local privacy. 3
The interesting question is, of course, whether we can still use this channel to estimate the
proportion of the population with the sensitive characteristic. Indeed, we can. We can provide
a somewhat more general analysis, however, which we now do so that we can give a complete
example.
Example 7.1.2 (Randomized response, continued): Suppose that we have an attribute of
interest, x, taking the values x ∈ {1, . . . , k}. Then we consider the channel (of Z drawn
conditional on x)
(
eε
x with probability k−1+eε
Z= k−1
Uniform([k] \ {x}) with probability k−1+eε .
satisfies E[b
pn ] = p, and we also have
2 k
2 X
eε + k − 1 eε + k − 1
1
pn − pk22 = cn ]k22
E kb E kb
cn − E[b = P(Z = j)(1−P(Z = j)).
eε − 1 n eε − 1
j=1
ε
pn − pk22 ] ≤ n1 ( e e+k−1 2
P
As j P(Z = j) = 1, we always have the bound E[kb ε −1 ) .
We may consider two regimes for simplicity: when ε ≤ 1 and when ε ≥ log k. In the former
case—the high privacy regime—we have k1 . P(Z = i) . k1 , so that the mean `2 squared error
2
scales as n1 kε2 . When ε ≥ log k is large, by contrast, we see that the error scales at worst as n1 ,
which is the “non-private” mean squared error. 3
141
Lexture Notes on Statistics and Information Theory John Duchi
While randomized response is essentially the standard mechanism in locally private settings, in
centralized privacy, the “standard” mechanism is Laplace noise addition because of its exponential
tails. In this case, we require a few additional definitions. Suppose that we wish to release some
d-dimensional function f (Pn ) of the sample distribution Pn (equivalently, the associated sample
X1n ), where f takes values in Rd . In the case that f is Lipschitz with respect to the Hamming
metric—that is, the counting metric on X n —it is relatively straightforward to develop private
mechanisms. To better reflect the nomenclature in the privacy literature and easier use in our
future development, for p ∈ [1, ∞] we define the global sensitivity of f by
n o
GSp (f ) := sup f (Pn ) − f (Pn0 ) p | dham (Pn , Pn0 ) ≤ 1 .
Pn ,Pn0 ∈Pn
This is simply the Lipschitz constant of f with respect to the Hamming metric. The global sensi-
tivity is a convenient metric, because it allows simple noise addition strategies.
Letting L = GS1 (f ) be the Lipschitz constant for simplicity, if we consider the mechanism
defined by the addition of W ∈ Rd with independent Laplace(L/ε) coordinates,
iid
Z := f (Pn ) + W, Wj ∼ Laplace(L/ε), (7.1.3)
we have that Z is ε-differentially private. Indeed, for samples Pn , Pn0 differing in at most a
single example, Z has density ratio
q(z | Pn ) ε ε ε
0
= exp − kf (Pn ) − zk1 + f (Pn0 ) − z 1
≤ exp f (Pn ) − f (Pn0 ) 1
≤ exp(ε)
q(z | Pn ) L L L
by the triangle inequality and that f is L-Lipschitz with respect to the Hamming metric. Thus
Z is ε-differentially private. Moreover, we have
2dGS1 (f )2
E[kZ − f (Pn )k22 ] = ,
ε2
so that if L is small, we may report the value of f accurately. 3
The most common instances and applications of the Laplace mechanism are in estimation of
means and histograms. Let us demonstrate more carefully worked examples in these two cases.
142
Lexture Notes on Statistics and Information Theory John Duchi
Example 7.1.4 (Private one-dimensional mean estimation): Suppose that we have variables
Xi taking values in [−b, b] for some b < ∞, and wish to estimate E[X]. A natural function to
n 1 Pn
release is then f (X1 ) = X n = n i=1 Xi . This has Lipschitz constant 2b/n with respect to
the Hamming metric, because for any two samples x, x0 ∈ [−b, b]n differing in only entry i, we
have
1 2b
|f (x) − f (x0 )| = |xi − x0i | ≤
n n
because xi ∈ [−b, b]. Thus the Laplace mechanism (7.1.3) with the choice variance W ∼
Laplace(2b/(nε)) yields
1 8b2 b2 8b2
E[(Z − E[X])2 ] = E[(X n − E[X])2 ] + E[(Z − X n )2 ] = Var(X) + 2 2 ≤ + 2 2.
n n ε n n ε
We can privately release means with little penalty so long as ε n−1/2 . 3
Example 7.1.5 (Private histogram (multinomial) release): Suppose that we wish to estimate
a multinomial distribution, or put differently, a histogram. That is, we have observations
X ∈ {1, . . . , k}, where k may be large, and wish to estimate pj := P(X = j) P for j = 1, . . . , k.
For a given sample xn1 , the empirical count vector pbn with coordinates pbn,j = n1 ni=1 1 {Xi = j}
satisfies
2
GS1 (b
pn ) =
n
0
because swapping a single example xi for xi may change the counts for at most two coordinates
j, j 0 by 1. Consequently, the Laplace noise addition mechanism
iid 2
Z = pbn + W, Wj ∼ Laplace
nε
satisfies
8k
E[kZ − pbn k22 ] =
n 2 ε2
and consequently
k
8k 1X 8k 1
E[kZ − pk22 ] = 2 2
+ pj (1 − pj ) ≤ 2 2 + .
n ε n n ε n
j=1
This example shows one of the challenges of differentially private mechanisms: even in the case
where the quantity of interest is quite stable (insensitive to changes in the underlying sample,
or has small Lipschitz constant), it may be the case that the resulting mechanism adds noise
that introduces some dimension-dependent scaling. In this case, the conditions on privacy
levels acceptable for good estimation—in that the P rate of convergence is no different from the
non-private case, which achieves E[kb pn − pk2 ] = n kj=1 pj (1 − pj ) ≤ n1 are that ε nk . Thus,
2 1
in the case that the histogram has a large number of bins, the naive noise addition strategy
cannot provide as much protection without sacrificing efficiency.
If instead of `2 -error we consider `∞ error, it is possible to provide somewhat more satisfying
iid
results in this case. Indeed, we know that P(kW k∞ ≥ t) ≤ k exp(−t/b) for Wj ∼ Laplace(b),
so that in the mechanism above we have
tnε
P(kZ − pbn k∞ ≥ t) ≤ k exp − all t ≥ 0,
2
143
Lexture Notes on Statistics and Information Theory John Duchi
so that the samples under H0 and H1 differ only in the ith observation Xi ∈ {xi , x0i }. Now, for a
channel taking inputs from X n and outputting Z ∈ Z, we define ε-conditional hypothesis testing
privacy by saying that
Q(Ψ(Z) = 1 | H0 , Z ∈ A) + Q(Ψ(Z) = 0 | H1 , Z ∈ A) ≥ 1 − ε (7.1.4)
for all sets A ⊂ Z satisfying Q(A | H0 ) > 0 and Q(A | H1 ) > 0. That is, roughly, no matter
what value Z takes on, the probability of error in a test of whether H0 or H1 is true—even with
knowledge of xj , j 6= i—is high. We then have the following proposition.
Proposition 7.1.6. Assume the channel Q is ε-differentially private. Then Q is also ε̄ = 1−e−2ε ≤
2ε-conditional hypothesis testing private.
Proof Let Ψ be any test of H0 versus H1 , and let B = {z | Ψ(z) = 1} be the acceptance region
of the test. Then
Q(A, B | H0 ) Q(A, B c | H1 )
Q(B | H0 , Z ∈ A) + Q(B c | H1 , Z ∈ A) = +
Q(A | H0 ) Q(A | H1 )
Q(A, B | H1 ) Q(A, B c | H1 )
≥ e−2ε +
Q(A | H1 ) Q(A | H1 )
Q(A, B | H1 ) + Q(A, B c | H1 )
≥ e−2ε ,
Q(A | H1 )
144
Lexture Notes on Statistics and Information Theory John Duchi
where the first inequality uses ε-differential privacy. Then we simply note that Q(A, B | H1 ) +
Q(A, B c | H1 ) = Q(A | H1 ).
So we see that (roughly), even conditional on the output of the channel, we still cannot test whether
the initial dataset was x or x0 whenever x, x0 differ in only a single observation.
An alternative perspective is to consider a Bayesian one, which allows us to more carefully
consider side information. In this case, we consider the following thought experiment. An adversary
has a set of prior beliefs π on X n , and we consider the adversary’s posterior π(· | Z) induced by
observing the output Z of some mechanism M . In this case, Bayes factors, which measure how
much prior and posterior distributions differ after observations, provide one immediate perspective.
π(Pn | z)
≤ eε
π(Pn0 | z)
Proof Let q be the associated density of Z = M (·) (conditional or marginal). We have π(Pn |
z) = q(z | Pn )π(Pn )/q(z). Then
Thus we see that private channels mean that prior and posterior odds between two neighboring
samples cannot change substantially, no matter what the observation Z actually is.
For an an alternative view, we consider a somewhat restricted family of prior distributions,
where we now take the view of a sample xn1 ∈ X n . There is some annoyance in this calculation
in that the order of the sample may be important, but it at least gets toward some semantic
interpretation of differential privacy. We consider the adversary’s beliefs on whether a particular
value x belongs to the sample, but more precisely, we consider whether Xi = x. We assume that
the prior density π on X n satisfies
are independent of his beliefs about the other members of the dataset. (We assume that π is
a density with respect to a measure µ on X n−1 × X , where dµ(s, x) = dµ(s)dµ(x).) Under the
condition (7.1.5), we have the following proposition.
Proposition 7.1.8. Let Q be an ε-differentially private channel and let π be any prior distribution
satisfying condition (7.1.5). Then for any z, the posterior density πi on Xi satisfies
145
Lexture Notes on Statistics and Information Theory John Duchi
Proof We abuse notation and for a sample s ∈ X n−1 , where s = (x1i−1 , xni+1 ), we let s ⊕i x =
(xi−1 n
1 , x, xi+1 ). Letting µ be the base measure on X
n−1 × X with respect to which π is a density
n
and q(· | x1 ) be the density of the channel Q, we have
R
s∈X n−1 q(z | s ⊕i x)π(s ⊕i x)dµ(s)
πi (x | Z = z) = R R
0 0 0
s∈X n−1 x0 ∈X q(z | s ⊕i x )π(s ⊕i x )dµ(s, x )
R
n−1 q(z | s ⊕i x)π(s ⊕i x)dµ(s)
(?)
s∈X
≤ eε R R
0 0
s∈X n−1 x0 ∈X q(z | s ⊕i x)π(s ⊕i x )dµ(s)dµ(x )
R
s∈X n−1 q(z | s ⊕i x)π\i (s)dµ(s)πi (x)
= eε R R
0 0
s∈X n−1 q(z | s ⊕i x)π\i (s)dµ(s) x0 ∈X πi (x )dµ(x )
= eε πi (x),
where inequality (?) follows from ε-differential privacy. The lower bound is similar.
Roughly, however, we see that Proposition 7.1.8 captures the idea that even if an adversary has
substantial prior knowledge—in the form of a prior distribution π on the ith value Xi and everything
else in the sample—the posterior cannot change much.
146
Lexture Notes on Statistics and Information Theory John Duchi
Definition 7.4. Let P and Q be distributions on a space X with densities p and q (with respect to
a measure µ). For α ∈ [1, ∞], the Rényi-α-divergence between P and Q is
p(x) α
Z
1
Dα (P ||Q) := log q(x)dµ(x).
α−1 q(x)
Here, the values α ∈ {1, ∞} are defined in terms of their respective limits.
1
Rényi divergences satisfy exp((α − 1)Dα(P ||Q)) = 1 + Df (P ||Q), i.e., Dα(P ||Q) = α−1 log(1 +
α
Df (P ||Q)), for the f -divergence defined by f (t) = t − 1, so that they inherit a number of the
properties of such divergences. We enumerate a few here for later reference.
Proposition 7.2.1 (Basic facts on Rényi divergence). Rényi divergences satisfy the following.
ii. limα↓1 Dα (P ||Q) = Dkl (P ||Q) and limα↑∞ Dα (P ||Q) = sup{t | Q(p(X)/q(X) ≥ t) > 0}.
iii. Let K(· | x) be a Markov kernel from X → Z as in Proposition 2.2.13, and let KP and KQ be
the induced marginals of P and Q under K, respectively. Then Dα (KP ||KQ ) ≤ Dα (P ||Q).
We leave the proof of this proposition as Exercise 7.1, noting that property i is a consequence
of Hölder’s inequality, property ii is by L’Hopital’s rule, and property iii is an immediate conse-
quence of Proposition 2.2.13. Rényi divergences also tensorize nicely—generalizing the tensoriza-
tion properties of KL-divergence and information of Chapter 2 (recall the chain rule (2.1.6) for
KL-divergence)—and we return to this later. As a preview, however, these tensorization proper-
ties allow us to prove that the composition of multiple private data releases remains appropriately
private.
With these preliminaries in place, we can then provide
Definition 7.5 (Rényi-differential privacy). Let ε ≥ 0 and α ∈ [1, ∞]. A channel Q from Pn to
output space Z is (ε, α)-Rényi private if for all neighboring samples Pn , Pn0 ∈ Pn ,
Clearly, any ε-differentially private channel is also (ε, α)-Rényi private for any α ≥ 1; as we soon
see, we can provide tighter guarantees than this.
Example 7.2.2 (Rényi divergence between Gaussian distributions): Consider normal distri-
butions N(µ0 , Σ) and N(µ1 , Σ). Then
α
Dα (N(µ0 , Σ)||N(µ1 , Σ)) = (µ0 − µ1 )T Σ−1 (µ0 − µ1 ). (7.2.3)
2
147
Lexture Notes on Statistics and Information Theory John Duchi
To see this equality, we compute the appriate integral of the densities. Let p and q be the
densities of N(µ0 , Σ) and N(µ1 , Σ), respectively. Then letting Eµ1 denote expectation over
X ∼ N(µ1 , Σ), we have
p(x) α
Z α
h α i
q(x)dx = Eµ1 exp − (X − µ0 )T Σ−1 (X − µ0 ) + (X − µ1 )T Σ−1 (X − µ1 )
q(x) 2 2
(i)
h α i
= Eµ1 exp − (µ0 − µ1 )T Σ−1 (µ0 − µ1 ) + α(µ0 − µ1 )T Σ−1 (X − µ1 )
2
α2
(ii) α T −1 T −1
= exp − (µ0 − µ1 ) Σ (µ0 − µ1 ) + (µ0 − µ1 ) Σ (µ0 − µ1 ) ,
2 2
where equality (i) is simply using that (x − a)2 − (x − b)2 = (a − b)2 + 2(b − a)(x − b) and
equality (ii) follows because (µ0 − µ1 )T Σ−1 (X − µ1 ) ∼ N(0, (µ1 − µ0 )T Σ−1 (µ1 − µ0 )) under
X ∼ N(µ1 , Σ). Noting that −α + α2 = α(α − 1) and taking logarithms gives the result. 3
Example 7.2.2 is the key to developing different privacy-preserving schemes under Rényi privacy.
Let us reconsider Example 7.1.3, except that instead of assuming the function f of interest is smooth
with respect to `1 norm, we use the `2 -norm.
Example 7.2.3 (Gaussian mechanisms): Suppose that f : Pn → Rd has Lipschitz constant
L with respect to the `2 -norm (for the Hamming metric dham ), that is, global `2 -sensitivity
Z = f (Pn ) + W, W ∼ N(0, σ 2 I)
satisfies
α 2 α
Dα N(f (Pn ), σ 2 )||N(f (Pn0 ), σ 2 ) = 2 f (Pn ) − f (Pn0 ) 2 ≤ 2 L2
2σ 2σ
0
for neighboring samples Pn , Pn . Thus, if we have Lipschitz constant L and desire (ε, α)-Rényi
2
privacy, we may take σ 2 = L2εα , and then the mechanism
L2 α
Z = f (Pn ) + W W ∼ N 0, I (7.2.4)
2ε
satisfies (ε, α)-Rényi privacy. 3
Certain special cases can make this more concrete. Indeed, suppose we wish to estimate a mean
iid
E[X] where Xi ∼ P for some distribution P such that kXi k2 ≤ r with probability 1 for some
radius.
Example 7.2.4 (Bounded mean estimation with Gaussian mechanisms): Letting f (X1n ) =
X n be the sample mean, where Xi satisfy kXi k2 ≤ r as above, we see immediately that
2r
GS2 (f ) = .
n
2r
In this case, the Gaussian mechanism (7.2.4) with L = n yields
h
2
i 2dr2 α
E Z − Xn 2
= E[kW k22 ] = .
n2 ε
148
Lexture Notes on Statistics and Information Theory John Duchi
Then we have
r2 2dr2 α
E[kZ − E[X]k22 ] = E[kX n − E[X]k22 ] + E[kZ − X n k22 ] ≤ + 2 .
n n ε
It is not immediately apparent how to compare this quantity to the case for the Laplace mech-
anism in Example 7.1.3, but we will return to this shortly once we have developed connections
between the various privacy notions we have developed. 3
Proposition 7.2.5. Let ε ≥ 0 and let P and Q be distributions such that e−ε ≤ P (A)/Q(A) ≤ eε
for all measurable sets A. Then for any α ∈ [1, ∞],
3α 2
Dα (P ||Q) ≤ min ε ,ε .
2
Corollary 7.2.6. Let ε ≥ 0 and assume that Q is ε-differentially private. Then for any α ≥ 1, Q
is (min{ 3α 2
2 ε , ε}, α)-Rényi private.
Before proving the proposition, let us see its implications for Example 7.2.4 versus estimation
under ε-differential privacy. Let ε ≤ 1, so that roughly to have “similar” privacy, we require
0 2
that our Rényi private channels
√ √ | x)||Q(· | x )) ≤ ε . The `1 -sensitivity of the mean
satisfy Dα (Q(·
satisfies kxn − x0 n k1 ≤ dkxn − x0 n k2 ≤ 2 dr/n for neighboring samples. Then the Laplace
mechanism (7.1.3) satisfies
2 8r2
E[kZLaplace − E[X]k22 ] = E[ X n − E[X] 2 ] + · d2 ,
n 2 ε2
while the Gaussian mechanism under (ε2 , α)-Rényi privacy will yield
2 2r2
E[kZGauss − E[X]k22 ] = E[ X n − E[X] 2 ] + · dα.
n 2 ε2
This is evidently better than the Laplace mechanism whenever α < d.
Proof of Proposition 7.2.5 We asume that P and Q have densities p and q with respect to a
base measure µ, which is no loss of generality, whence the ratio condition implies that e−ε ≤ p/q ≤ eε
1 α
R
and Dα (P ||Q) = α−1 log (p/q) qdµ. We prove the result assuming that α ∈ (1, ∞), as continuity
gives the result for α ∈ {1, ∞}.
First, it is clear that Dα (P ||Q) ≤ ε always. For the other term in the minimum, let us assume
that α ≤ 1 + 1ε and ε ≤ 1. If either of these fails, the result is trivial, because for α > 1 + 1ε we
have 32 αε2 ≥ 32 ε ≥ ε, and similarly ε ≥ 1 implies 23 αε2 ≥ ε.
149
Lexture Notes on Statistics and Information Theory John Duchi
Now we perform a Taylor approximation of t 7→ (1 + t)α . By Taylor’s theorem, we have for any
t > −1 that
α(α − 1)
(1 + t)α = 1 + αt + t)α−2 t2
(1 + e
2
p(z) α
Z
exp ((α − 1)Dα (P ||Q)) = q(z)dµ(z)
q(z)
Z α
p(z)
= 1+ − 1 q(z)dµ(z)
q(z)
Z Z 2
p(z) α(α − 1) p(z)
≤1+α − 1 q(z)dµ(z) + max{1, exp(ε(α − 2))} − 1 q(z)dµ(z)
q(z) 2 q(z)
α(α − 1) ε[α−2]+
≤1+ e · (eε − 1)2 .
2
Now, we know that α − 2 ≤ 1/ε − 1 by assumption, so using that log(1 + x) ≤ x, we obtain
α ε
Dα (P ||Q) ≤ (e − 1)2 · exp([1 − ε]+ ).
2
3α 2
Finally, a numerical calculation yields that this quantity is at most 2 ε for ε ≤ 1.
We can also provide connections from (ε, α)-Rényi privacy to (ε, δ)-differential privacy, and
then from there to ε-differential privacy. We begin by showing how to develop (ε, δ)-differential
privacy out of Rényi privacy. Another way to think about this proposition is that whenever two
distributions P and Q are close in Rényi divergence, then there is some limited “amplification” of
probabilities that is possible in moving from one to the other.
Proposition 7.2.7. Let P and Q satisfy Dα (P ||Q) ≤ ε. Then for any set A,
α−1 α−1
P (A) ≤ exp ε Q(A) α .
α
Before turning to the proof of the proposition, we show how it can provide prototypical (ε, δ)-
private mechanisms via Gaussian noise addition.
150
Lexture Notes on Statistics and Information Theory John Duchi
2 r2 1
E[kZGauss − E[X]k22 ] = E[ X n − E[X] 2 ] + O(1) 2 2
· d log .
n ε δ
Comparing to the previous cases, we see an improvement over the Laplace mechanism whenever
log 1δ d, or that δ e−d . 3
Proof of Proposition 7.2.7 We use the data processing inequality of Proposition 7.2.1.iii,
which shows that
P (A) α
1
ε ≥ Dα (P ||Q) ≥ log Q(A) .
α−1 Q(A)
Rearranging and taking exponentials, we immediately obtain the first claim of the proposition.
α
For the second, we require a bit more work. First, let us assume that Q(A) > e−ε δ α−1 . Then
we have by the first claim of the proposition that
α−1 1 1
P (A) ≤ exp ε + log Q(A)
α α Q(A)
α−1 1 1 1 1 1
≤ exp ε+ ε+ log Q(A) = exp ε + log Q(A).
α α α−1 δ α−1 δ
α
On the other hand, when Q(A) ≤ e−ε δ α−1 , then again using the first result of the proposition,
α−1
P (A) ≤ exp (ε + log Q(A))
α
α − 1 α
≤ exp ε−ε+ log δ = δ.
α α−1
151
Lexture Notes on Statistics and Information Theory John Duchi
Finally, we develop our last set of connections, which show how we may relate (ε, δ)-private
channels with ε-private channels. To provide this definition, we require one additional weakened
notion of divergence, which relates (ε, δ)-differential privacy to Rényi-α-divergence with α = ∞.
We define
δ P (S) − δ
D∞ (P ||Q) := sup log | P (S) > δ ,
S⊂X Q(S)
where the supremum is over measurable sets. Evidently equivalent to this definition is that
δ (P ||Q) ≤ ε if and only if
D∞
P (S) ≤ eε Q(S) + δ for all S ⊂ X .
Then we have the following lemma.
Lemma 7.2.10. Let ε > 0 and δ ∈ (0, 1), and let P and Q be distributions on a space X .
δ (P ||Q) ≤ ε if and only if there exists a probability distribution R on X such that
(i) We have D∞
kP − RkTV ≤ δ and D∞ (R||Q) ≤ ε.
(ii) We have D∞δ (P ||Q) ≤ ε and D δ (Q||P ) ≤ ε if and only if there exist distributions P and Q
∞ 0 0
such that
δ δ
kP − P0 kTV ≤ ε
, kQ − Q0 kTV ≤ ,
1+e 1 + eε
and
D∞ (P0 ||Q0 ) ≤ ε and D∞ (Q0 ||P0 ) ≤ ε.
The proof of the lemma is technical, so we defer it to Section 7.5.1. The key application of the
lemma—which we shall see presently—is that (ε, δ)-differentially private algorithms compose in
elegant ways.
152
Lexture Notes on Statistics and Information Theory John Duchi
Proposition 7.2.12. Let M be a (ε, α)-Rényi private mechanism, where α ∈ (1, ∞). Then for
(0)
any neighboring Pn , Pn0 , Pn ∈ Pn , we have
" 1
α−1 # α−1
π(Pn | Z) π(Pn )
E0 ≤ eε ,
π(Pn0 | Z) π(Pn0 )
(0)
where E0 denotes expectation taken over Z = M (Pn ).
Proposition 7.2.12 communicates a similar message to our previous results in this vein: even if
we get information from the output of the private mechanism on some sample x0 ∈ X n near the
samples (datasets) of interest x, x0 that an adversary wishes to distinguish, it is impossible to update
beliefs by much. The parameter α then controls the degree of difficulty of this “impossible” claim,
which one can see by (for example) applying a Chebyshev-type bound to the posterior ratio and
prior ratios.
We now turn to the promised proofs of Propositions 7.2.11 and 7.2.12. To prove the former, we
require a definition.
Definition 7.6. Distributions P and Q on a space X are (ε, δ)-close if for all measurable A
Letting p and q denote their densities (with respect to any shared base measure), they are (ε, δ)-
pointwise close if the set
satisfy
eβε δ e−ε δ
max{P (A+ ), Q(A− )} ≤ , max{P (A − ), Q(A + )} ≤ .
eβε − 1 eβε − 1
Conversely, if P and Q are (ε, δ)-pointwise close, then
153
Lexture Notes on Statistics and Information Theory John Duchi
so that Q(A) ≤ e−(1+β)ε δ/(1−e−βε ) = e−ε δ/(eβε −1). The set A− satisfies the symmetric properties.
For the converse result, let B = {x : e−ε q(x) ≤ p(x) ≤ eε q(x)}. Then for any set A we have
154
Lexture Notes on Statistics and Information Theory John Duchi
Theorem 7.3.1. Let the conditions above hold, εi < ∞ for i = 1, . . . , n, and α ∈ [1, ∞]. Assume
that conditional on z1i−1 , we have Dα Pi (· | z1i−1 )||Qi (· | z1i−1 ) ≤ εi . Then
n
X
Dα (P ||Q) ≤ εi .
i=1
Proof We assume without loss of generality that the conditional distributions Pi (· | z1i−1 ) and
Qi are absolutely continuous with respect to a base measure µ on Z.1 Then we have
n α
pi (zi | z1i−1 )
Z Y
1
Dα (P ||Q) = log i−1
qi (zi | z1i−1 )dµn (z1n )
α−1 q (z
i i | z 1 )
i=1
"Z α # n−1
Y pi α
pn (zn | z1n−1 )
Z
1 n−1
= log q n (z n | z 1 )dµ(z n ) qi dµn−1
α−1 Z1n−1 qn (zn | z1n−1 ) i=1
qi
Z n−1
Y pi (zi | z ) i−1 α
1
≤ log exp((α − 1)εn ) 1
i−1
qi (zi | z1i−1 )dµn−1 (z1n−1 )
α−1 Z1n−1 q (z
i i | z 1 )
i=1
n−1 n−1
= εn + Dα P1 ||Q1 .
1
This is no loss of generality, as the general definition of f -divergences as suprema over finite partitions, or
quantizations, of each Xi and Yi separately, as in our discussion of KL-divergence in Chapter 2.2.2. Thus we may
assume Z is discrete and µ is a counting measure.
155
Lexture Notes on Statistics and Information Theory John Duchi
i. Adversary chooses arbitrary space X , n ∈ N, and two datasets x(0) , x(1) ∈ X n with
dham (x(0) , x(1) ) ≤ 1.
Figure 7.1. The privacy game. In this game, the adversary may not directly observe
the private b ∈ {0, 1}.
By considering a special case centered around a particular individual in the game 7.1, we can gain
some intuition for the definition. Indeed, suppose that an individual has some data x0 ; in each
round of the game the adversary generates two datasets, one containing x0 and the other identical
except that x0 is removed. Then satisfying Definition 7.7 captures the intuition that an individual’s
privacy remains protected, even in the face of multiple (private) accesses of the individual’s data.
As an immediate corollary to Theorem 7.3.1, we then have the following.
Corollary 7.3.2. Assume that each channel in the game Pkin Fig. 7.1 is (εi , α)-Rényi private. Then
the arbitrary composition of k such channels remains ( i=1 εi , α)-Rényi private.
More sophisticated corollaries are possible once we start to use the connections between privacy
measures we outline in Section 7.2.2. In this case, we can develop so-called advanced composition
rules, which sometimes suggest that privacy degrades more slowly than might be expected under
adaptive composition.
156
Lexture Notes on Statistics and Information Theory John Duchi
Corollary 7.3.3. Assume that each channel in the game in Fig. 7.1 is ε-differentially private.
Then the composition of k such channels is kε-differentially private. Additionally, the composition
of k such channels is r !
3k 2 1
ε + 6k log · ε, δ
2 δ
differentially private for all δ > 0.
Proof The first claim is immediate: for Q(0) , Q(1) as in Definition 7.7, we know that Dα Q(0) ||Q(1) ≤
kε for all α ∈ [1, ∞] by Theorem 7.3.1 coupled with Proposition 7.2.5 (or Corollary 7.2.6).
For the second claim, we require a bit more work. Here, we use the bound 3α 2
2 ε in the Rényi
privacy bound in Corollary 7.2.6. Then we have for any α ≥ 1 that
3kα
Dα Q(0) ||Q(1) ≤ ε2
2
by Theorem 7.3.1. Now we apply Proposition 7.2.7 and Corollary 7.2.8, which allow us to conclude
(ε, δ)-differential privacy from Rényi privacy. Indeed, by the preceding display, setting η = 1 + α,
3kη 2
we have that the composition is ( 3k 2 1 1
2 ε + 2 ε + η log δ , δ)-differentially private for all η > 0 and
δ > 0. Optimizing over η gives the second result.
We note in passing that it is possible to get slightly sharper results thanqthose in Corollary 7.3.3;
indeed, using ideas from Exercise 4.3 it is possible to achieve (kε(eε − 1) + 2k log 1δ ε, δ)-differential
privacy under adaptive composition.
A more sophisticated result, which shows adaptive composition for (ε, δ)-differentially private
channels, is also possible using Lemma 7.2.10.
Theorem 7.3.4. Assume that each channel in the game in Fig. 7.1 is (ε, δ)-differentially private.
Then the composition of k such channels is (kε, kδ)-differentially private. Additionally, they are
r
3k 2 1 kδ
ε + 6k log · ε, δ0 +
2 δ0 1 + eε
Proof Consider the channels Qi in Fig. 7.1. As each satisfies D∞ δ (Q (· | x(0) )||Q (· | x(1) )) ≤ ε
i i
δ (1) (0)
and D∞ (Qi (· | x )||Qi (· | x )) ≤ ε, Lemma 7.2.10 guarantees the existence (at each sequential
(0) (1)
step, which may depend on the preceding i − 1 outputs) of probability measures Qi and Qi such
(1−b) (b) (b)
that D∞ (Qi ||Qi ) ≤ ε, kQi − Qi (· | x(b) )kTV ≤ δ/(1 + eε ) for b ∈ {0, 1}.
(b) (b) (1−b) (1−b)
Note that by construction (and Theorem 7.3.1) we have Dα(Q1 · · · Qk ||Q1 · · · Qk ) ≤
3kα 2 (b)
min{ 2 ε , kε}, where Q denotes the joint distribution on Z1 , . . . , Zk under bit b. We also have
(b) (b)
by the triangle inequality that kQ1 · · · Qk − Q(b) kTV ≤ kδ/(1 + eε ) for b ∈ {0, 1}. (See Exer-
cise 2.16.) As a consequence, we see as in the proof of Corollary 7.3.3 that the composition is
3kη 2
( 3k 2 1 1 ε
2 ε + 2 ε + η log δ0 , δ0 + kδ/(1 + e ))-differentially private for all η > 0 and δ0 . Optimizing
gives the result.
157
Lexture Notes on Statistics and Information Theory John Duchi
As a consequence of these results, we see that whenever the privacy parameter ε < 1, it is
possible to compose
√ multiple privacy mechanisms together and have privacy penalty scaling only
as the worse of kε and kε2 , which is substantially better than the “naive” bound of kε. Of course,
a challenge here—relatively unfrequently discussed in the privacy literature—is that when ε ≥ 1,
which is a frequent case for practical deployments of privacy, all of these bounds are much worse
than a naive bound that k-fold composition of ε-differentially private algorithms is kε-differentially
private.
in R+ and µ is a finite measure, making the last assumption trivial.) That is, the exponential
mechanism M releases Z = M (Pn ) with probability proportional to
ε
exp − `(Pn , z) .
L
That the mechanism (7.4.1) is 2ε-differentially private is immediate: for any neighboring Pn , Pn0 ,
158
Lexture Notes on Statistics and Information Theory John Duchi
we have
exp(− Lε `(Pn0 , z))dµ(z) A exp(− Lε `(Pn , z))dµ(z)
R R
Q(A | Pn )
=R
Q(A | Pn0 ) exp(− Lε `(Pn , z))dµ(z) A exp(− Lε `(Pn0 , z))dµ(z)
R
n ε o n ε o
≤ sup exp [`(Pn , z) − `(Pn0 , z)] · sup exp [`(Pn0 , z) − `(Pn , z)] ≤ exp(2ε).
z∈Z L z∈A L
As a first (somewhat trivial) example, we can recover the Laplace mechanism:
Example 7.4.1 (The Laplace mechanism): We can recover Example 7.1.3 through the
exponential mechanism. Indeed, suppose that we wish to release f : Pn → Rd , where GS1 (f ) ≤
L. Then taking z ∈ Rd , `(Pn , z) = kf (Pn ) − zk1 , and µ to be the usual Lebesgue measure on
Rd , the exponential mechanism simply uses density
ε
q(z | Pn ) ∝ exp − kf (Pn ) − zk1 ,
L
which is the Laplace mechanism. 3
One challenge with the exponential mechanism (7.4.1) is that it is somewhat abstract and is
often hard to compute, as it requires evaluating an often high-dimensional integral to sample from.
Yet it provides a nice abstract mechanism with strong privacy guarantees and, as we shall see, good
utility guarantees. For the moment, we defer further examples and provide utility guarantees when
µ(Z) is finite, giving bounds based on the measure of “bad” solutions. For notational convenience,
we define the optimal value
`? (Pn ) = inf `(Pn , z),
z∈Z
assuming tacitly that it is finite, and the sublevel sets
`(Pn , Z) ≤ `? (Pn ) + 2t
µ(Z)
with probability at least 1 − exp − εt
L + log µ(S t)
.
Proof Assume without loss of generality (by scaling) that the global Lipschitzian (sensitivity)
constant of ` is L = 1. Then for Z ∼ Q(· | Pn ), we have
?
R R
S c exp(−ε`(Pn , z))dµ(z) c exp(−ε(`(Pn , z) − ` (Pn )))dµ(z)
S2t
? 2t
P (`(Pn , Z) ≥ ` (Pn ) + 2t) = R = R
exp(−ε`(Pn , z))dµ(z) exp(−ε(`(Pn , z) − `? (Pn )))dµ(z)
R
c exp(−2εt)dµ(z)
S2t
c )
µ(S2t
≤R ?
≤ exp(−εt) ,
St exp(−ε(`(Pn , z) − ` (Pn )))dµ(z) µ(St )
We can provide a few simplifications of this result in different special cases. For example, if Z
is finite with cardinality card(Z), then Proposition 7.4.2 implies that taking µ to be the counting
measure on Z we have
159
Lexture Notes on Statistics and Information Theory John Duchi
Corollary 7.4.3. In addition to the conditions in Proposition 7.4.2, assume that card(Z) is finite.
Then for any u ∈ (0, 1), with probability at least 1 − u,
2L card(Z)
`(Pn , Z) ≤ `? (Pn ) + log .
ε u
That is, with extremely high probability, the loss of Z from the exponential mechanism is at most
logarithmic in card(Z) and grows only linearly with the global sensitivity L.
A second corollary allows us to bound the expected loss of the exponential mechanism, assuming
we have some control over the measure of the sublevel sets St .
2L µ(Z) L
Corollary 7.4.4. Let t ≥ 0 be the smallest scalar such that t ≥ ε log µ(S t)
and t ≥ ε. Then Z
drawn from the exponential mechanism (7.4.1) satisfies
? 2L ? ? L µ(Z)
E[`(Pn , Z)] ≤ ` (Pn ) + t + ≤ ` (Pn ) + 3t ≤ ` (Pn ) + O(1) log 1 + .
ε ε µ(St )
Z ∞
?
E[`(Pn , Z) − ` (Pn )] ≤ t0 + P(`(Pn , Z) − `? (Pn ) ≥ t)dt
t0
Z ∞
= t0 + 2 P(`(Pn , Z) − `? (Pn ) ≥ 2t)dt
t0 /2
Z ∞
εt µ(Z)
≤ t0 + 2 exp − + log dt
t0 /2 L µ(St )
Z ∞
ρ εt 2L εt0
≤ t0 + 2e exp − dt = t0 + exp ρ − .
t0 /2 L ε 2L
Corollary 7.4.4 may seem a bit circular: we require the ratio µ(Z)/µ(St ) to be controlled—but
it is relatively straightforward to use it (and Proposition 7.4.2) with a bit of care and standard
bounds on volumes.
Example 7.4.5 (Empirical risk minimization via the exponential mechanism): We consider
the empirical risk minimization problem, where we have losses ` : Θ × X → R+ , where Θ ⊂ Rd
is a parameter space of interest, and we wish to choose
n
( )
1 X
θbn ∈ argmin L(θ, Pn ) := `(θ, xi )
θ∈Θ n
i=1
where Pn = n1 ni=1 1xi . We make a few standard assumptions: first, for simplicity, that n
P
is large enough that nd ≥ ε. We also assume that Θ ⊂ Rd is an `2 -ball of radius R, that
θ 7→ `(θ, xi ) is M -Lipschitz for all xi , and that `(θ, xi ) ∈ [0, 2M R] for all θ ∈ Θ. (Note that
this last is no loss of generality, as `(θ, xi ) − inf θ∈Θ `(θ, xi ) ≤ M supθ,θ0 ∈Θ kθ − θ0 k2 ≤ 2M R.)
160
Lexture Notes on Statistics and Information Theory John Duchi
Take the empirical loss L(θ, Pn ) as our criterion function for the exponential mechanism, which
evidently satisfies |L(θ, Pn ) − L(θ, Pn0 )| ≤ 2Mn R whenever dham (Pn , Pn0 ) ≤ 1, so that we release
θ with density nε
q(θ | x) ∝ exp − L(θ, Pn ) .
2M R
Let θbn be the empirical minimizer as above; then by the Lipschitz continuity of `, the sublevel
set St evidently satisfies
t
St ⊃ θ ∈ Θ | kθ − θbn k2 ≤ .
M
Then a volume calculation (with the factor of 2 necessary because we may have θbn on the
boundary of Θ) yields that for µ the Lebesgue measure,
d
µ(St ) t
≥ .
µ(Z) 2M R
As a consequence, by Corollary 7.4.4, whenever t ≥ O(1) MnεR · d log MtR , we have E[L(θ, Pn ) |
Pn ] ≤ L(θbn , Pn ) + 3t. The choice t = O(1) MnεRd suffices whenever dε ≤ 1, so we obtain
M Rd nε
E[L(θ, Pn )] ≤ L(θbn , Pn ) + O(1) log ,
nε d
d
whenever nε ≤ 1. Notably, standard empirical risk minimization (recall Chapter 4.4) typically
√
achieves rates of convergence roughly of M R/ n, so that the gap of the exponential mechanism
is lower order whenever √dnε ≤ 1. 3
which measures the amount that changing k observations in Pn can change the function f . In the
privacy literature, the particular choice k = 1 yields the local sensitivity
LS(f, Pn )
Z = f (Pn ) + · W for W ∼ Laplace(1),
ε
which is analogous to the Laplace mechanism (7.1.3), except that the noise scales with the local
sensitivity of f at Pn . The issue, as the next example makes clear, is that the scale of this noise
can compromise privacy.
161
Lexture Notes on Statistics and Information Theory John Duchi
Example 7.4.6 (The sensitivity of the sensitivity): Consider estimating a median f (Pn ) =
med(Pn ), where the data x ∈ [0, 1], where n = 2m + 1 for simplicity, to make the median
unique. If the sample consists of m points xi = 0 and m + 1 points xi = 1, then the sensitivity
ωf (1, Pn ) = 1, the maximal value—we simply move one example from xi = 1 to xi = 0,
changing the median from med(Pn ) = 1 to 0. On the other hand, on the sample Pn0 with m − 1
points xi = 0 and m + 2 points xi = 1, the sensitivity ωf (1, Pn0 ) = 0, because changing a single
example cannot move the median from f (Pn0 ) = 1. 3
Instead of using the inherently unstable quantity ω, then, we can instead use, essentially, its
inverse: define the inverse sensitivity
where df (t, Pn ) = +∞ if no Pn0 yields f (Pn0 ) = t. So df (t, Pn ) counts the number of examples that
must be changed in the sample Pn to move f (Pn ) to a target t, and by inspection, always satisfies
Then the inverse sensitivity mechanism releases a value t with probability density proportional to
ε
q(t | Pn ) ∝ exp − df (t, Pn ) . (7.4.4)
2
Implicit in the definition (7.4.4) is a base measure µ, typically one of Lebesgue measure or counting
measure on a discrete set. Then a quick calculation (or recognition that the density (7.4.4) is a
particular instance of the exponential mechanism) gives the following proposition.
Proposition 7.4.7. Let M be the inverse sensitivity mechanism with density (7.4.4). Then M is
ε-differentially private.
As in the general exponential mechanism (7.4.1), efficiently sampling from the density (7.4.4)
can be challenging. Some cases admit easier reformulations.
Example 7.4.8 (Mean estimation with bounded data): Suppose the data x ∈ [a, b] are
bounded and we wish to estimate the sample mean f (Pn ) = EPn [X] = xn , where Pn =
1 Pn b−a
n i=1 1x i . Changing a single observation can move the mean by at most n (replace xi = a
0
with xi = b). Thus, while discretization issues and that we may have xi 6∈ {a, b} make precisely
computing df tedious, the approximation
n|t − xn |
dmean (t, Pn ) = ,
b−a
where we define dmean (t, Pn ) = +∞ for t 6∈ [a, b], is both Lipschitz (with respect to the
Hamming metric) in the sample Pn , and approximates df (t, Pn ). (See Exercise 7.8 for a more
general approach justifying this particular approximation.) The approximation
162
Lexture Notes on Statistics and Information Theory John Duchi
The density (7.4.5) yields a particular step-like density. Define the shells
b−a b−a b−a b−a
Sk = xn − k , xn − (k − 1) ∪ xn + (k − 1) , xn + k ∩ [a, b]
n n n n
corresponding to the amount the mean may change if we modify k examples and let Vol(Sk ) be
volume (length) of the intervals making up Sk . To sample from the density (7.4.5), note that
Rb kε
the denominator C(Pn ) := a exp(− 2ε dmean (s, Pn ))ds = nk=1 Vol(Sk )e− 2 . Then we draw an
P
index I ∈ [n] with probability P(I = k) = Vol(Sk )e−εk/2 /C(Pn ), and then choose t uniformly
at random within Sk . 3
Example 7.4.9 (Median estimation): For the median, the inverse sensitivity takes a par-
ticularly clean form, making sampling from the density (7.4.4) fairly straightforward. In this
1 Pn
case, for a sample Pn = n i=1 1xi , where xi ∈ R, we have
the number of examples between the median f (Pn ) and putative target t. If the data lie in
a range x ∈ [a, b], then the density q is relatively straightforward to compute. Similar to the
approach to the stepped density in Example 7.4.8, divide [a, b] into the intervals
Sk− := [a− − + + +
k , ak−1 ] and Sk := [ak−1 , ak ], k = 1, . . . , n/2,
where
a− 0 0
and a+ 0 0
k = inf f (Pn ) | dham (Pn , Pn ) ≤ k k = sup f (Pn ) | dham (Pn , Pn ) ≤ k .
That is, a− +
k is the smallest we can make the median by changing k examples and ak the largest,
1 k 1 k
corresponding to the the 2 − n and 2 + n quantiles of the sample Pn , where the 0 quantile is
a and 1 quantile is b. Then defining the normalization constant
Z b ε n
X ε
C(Pn ) := exp − df (t, Pn ) dt = Vol(Sk− ∪ Sk+ ) exp − k
a 2 2
k=1
(where the volume is simply interval length), we may sample from the density (7.4.4) by first
drawing a random index I ∈ {1, . . . , n} with probability proportional to
Vol(Sk− ∪ Sk+ ) ε
P(I = k | Pn ) = exp − k ,
C(Pn ) 2
then drawing t uniformly at random in the each of the intervals Sk− or Sk+ with probabilities
Vol(Sk− )/ Vol(Sk− ∪ Sk+ ) or Vol(Sk+ )/ Vol(Sk− ∪ Sk+ ), respectively. 3
The particular sampling strategies—where we construct concentric shells Sk around f (Pn ) and
sample from these with geometrically decaying probabilities e−kε/2 —point toward more general
sampling stratgies and optimality guarantees for the inverse sensitivity mechanism. Define the
“shells”
Sk := {f (Pn0 ) | dham (Pn , Pn0 ) = k}.
163
Lexture Notes on Statistics and Information Theory John Duchi
We focus on sampling from the density (7.4.4) in the case t ∈ R, so sampling is equivalent to
drawing an index I ∈ [n] with probability
n
1 ε X ε
P(I = k | Pn ) = e− 2 k for C(Pn ) := Vol(Sk )e− 2 k , (7.4.6)
C(Pn )
k=1
then choosing t uniformly at random in Sk .
Define the shorthand ω(k) = ωf (k, Pn ). Then the values t ∈ Sk all satisfy |f (Pn ) − t| ≤ ω(k),
and so the inverse sensitivity mechanism M guarantees
n
X
E[|M (Pn ) − f (Pn )|] ≤ P(M (Pn ) ∈ Sk )ω(k).
k=1
Now our calculations become heuristic, where we make an effort to give the rough flavor of results
possible, and later apply the care necessary for tighter guarantees. Suppose that the interval lengths
Vol(Sk ) are of the same order for k . 1ε , and grow only polynomially quickly for k 1ε . Then
we have the heuristic bound C(Pn ) := nk=1 Vol(Sk )e−kε/2 & Vol(S1 ) nk=1 e−kε/2 & ε−1 Vol(S1 ),
P P
while
n n
X Vol(Sk )e−kε/2 heuristic X
E[|M (Pn ) − f (Pn )|] ≤ Pn −iε/2
ω(k) . εe−kε/2 ω(k) . max e−kε/2 ω(k),
k=1 i=1 Vol(S i )e k=1
k
where the heuristic inequality is our bound on the normalizing constant C(Pn ), and the final bound
follows because maxima are larger than (weighted) averages. Continuing the heuristic derivation,
the final maximum has is exponentially small weight on ω(k) for k 1ε . Thus—and again, this is
highly non-rigorous—we expect roughly that
heuristic heuristic c
E[|M (Pn ) − f (Pn )|] . max e−kε/2 ω(k) . ωf , Pn , (7.4.7)
k ε
where c is some numerical constant.
To gain some intuition for the claims of optimality we have made, let us revisit the equivalent
definitions of privacy that repose on testing, as in Eq. (7.1.4) and Proposition 7.1.6. By the
definition of differential privacy, the inverse sensitivity mechanism satisfies
P(M (Pn ) ∈ A) ≤ ekε P(M (Pn0 ) ∈ A)
for any samples Pn , Pn0 satisfying dham (Pn , Pn0 ) ≤ k. So for k ≤ 1ε , we have
P(M (Pn ) ∈ A) ≤ exp(1)P(M (Pn0 ) ∈ A),
and so no procedure exists that can test whether the sample is Pn or Pn0 with probability of error less
than e−2 , by Proposition 7.1.6. Thus, at a fundamental level, no procedure can reliably distinguish
the outputs of M (Pn ) from those of M (Pn0 ) when Pn and Pn0 differ in only 1/ε examples. Thus, we
cannot expect to estimate f (Pn ) to accuracy better than ωf ( 1ε , Pn ), and so for any ε-differentially
private mechanism M and Pn , there exists Pn0 ∈ Pn with dham (Pn , Pn0 ) ≤ 1ε and for which
h i 1
max E |M (P ) − f (P )| & ωf
b b , Pn , (7.4.8)
Pb∈{Pn ,Pn0 } ε
which the heuristic calculation (7.4.7) achieves.
To provide more rigorous guarantees requires restrictions on the functions f whose values we
wish to release. The simplest is that the function f : Pn → R obey a natural ordering property,
where larger changes in the sample distribution Pn beget larger changes in f .
164
Lexture Notes on Statistics and Information Theory John Duchi
So the mean and median (Examples 7.4.8 and 7.4.9) are both sample monotone. So, too, are
appropriately continuous functions f . For this, we make the obvious
Pn identification of f : Pn → R
n n
with the induced function on X by defining fX (x1 ) := f (n −1
i=1 1xi ). Then we say f : Pn → R
is continuous if the induced function fX is.
Proof Identify f with its induced function fX for notational simplicity, and let x ∈ X n ,
n
f (x) ≤ s ≤ t, and Pn = n−1 i=1 1xi be the empirical distribution associated with x. We
P
show that df (s, Pn ) ≤ df (t, Pn ). If df (t, Pn ) = +∞, then the desired inequality holds triv-
ially. Otherwise, let x0 ∈ X n satisfy f (x0 ) = t and dham (x, x0 ) = df (t, Pn ). Then the function
g(λ) := f ((1 − λ)x + λx0 ) is continuous in λ and satisfies g(0) = f (x) ≤ g(1) = f (x0 ) = t. By the
intermediate value theorem, there exists λs ∈ [0, 1] with g(λs ) = s, and as X is convex the vector
xs = (1 − λs )x + λs x0 ∈ X n satisfies f (xs ) = g(λs ) = s. That xs is a convex combination of x and
x0 then implies df (s, Pn ) ≤ dham (x, xs ) ≤ dham (x, x0 ) = df (t, Pn ).
With Definition 7.8 in place, we can provide a few stronger guarantees for the inverse sensitivity
mechanism. To avoid pathological sampling issues, one replaces the inverse sensitivity df (t, Pn ) with
a “smoothed” version, where for ρ ≥ 0 we define
(Pathological cases include estimating the median where the sample Pn consists of a single point re-
peated n times, which would make the density (7.4.4) uniform.) Then instead of the density (7.4.4),
we define the continuous inverse sensitivity mechanism Mcont to have density
While the parameter ρ adds complexity, setting it to be very small (say, ρ = n12 ) is a reasonable
practical default.
The continuous inverse sensitivity enjoys fairly strong error guarantees, as the next two propo-
sitions demonstrate, providing two prototypical results. (Exercises 7.11 and 7.12 show how to prove
the propositions.) The first proposition shows that the inverse sensivity mechanism is essentially
never worse than the Laplace mechanism (7.1.3) when ε . 1.
Proposition 7.4.11. Let f be sample monotone (Definition 7.8) and have finite global sensitivity
GS(f ) < ∞. Then taking ρ = 0,
1
E [|Mcont (Pn ) − f (Pn )|] ≤ GS(f ).
1 − e−ε/2
As Example 7.1.3 shows, the standard Laplace mechanism M has error
GS(f )
E [|M (Pn ) − f (Pn )|] = ,
ε
165
Lexture Notes on Statistics and Information Theory John Duchi
the same scaling Proposition 7.4.11 guarantees, because 1 − e−ε/2 = ε/2 + O(ε2 ).
For the next proposition, which provides a more nuanced guarantee, we require local sensitivities
for samples Pn0 near Pn , and so we define the largest local sensitivity within Hamming distance K
of the sample Pn by
L(K) := sup LS(f, Pn0 ) | dham (Pn , Pn0 ) ≤ K = sup ωf (1, Pn0 ) | dham (Pn , Pn0 ) ≤ K ,
Pn0 ∈Pn Pn0 ∈Pn
where we recall the definition (7.4.2) of the local sensitivity of f . Then we have the following.
Proposition 7.4.12. Let f be sample monotone
l (Definition m 7.8) and have finite global sensitivity
4 log(2nGS(f )/ρ)
GS(f ) < ∞. Then for any ρ ≥ 0 and Kn = ε ,
1
E [|Mcont (Pn ) − f (Pn )|] ≤ 2ρ + L(Kn ).
1 − e−ε/2
Unpacking Proposition 7.4.12 a bit, let us make the default substitution ρ = n12 . Then because
1 − e−ε/2 = ε/2 + O(ε2 ), for ε . 1 this yields
1 1
sup LS(f, Pn0 ) | dham (Pn0 , Pn ) ≤ Kn + 2 ,
E [|Mcont (Pn ) − f (Pn )|] .
ε Pn0 ∈Pn n
where Kn = 4 log GS(fε)+12 log n . 1ε log n for large sample sizes n. Comparing this to the sketched
lower bound (7.4.8), these quantities are of the same order whenever the moduli of continu-
ity ωf (k; Pn ) are roughly additive and comparable near Pn , so that for k . 1ε there is a chain
(1) (2) (k) (i) (i+1) (i)
) = 1 and ωf (k; Pn ) & ki=1 LS(f, Pn ) and LS(f, Pn )
P
Pn , Pn , . . . , Pn with dham (Pn , Pn
LS(f, Pn0 ) for Pn0 satisfying dham (Pn , Pn0 ) . logε n . Under these conditions—which often require care
to check, but which hold, for example, for mean estimation—we then obtain
1 1
E [|Mcont (Pn ) − f (Pn )|] . ωf , Pn + 2 .
ε n
166
Lexture Notes on Statistics and Information Theory John Duchi
In particular, when x ∈ T , we may take the density r so that p(x) ≤ r(x) ≤ q(x), as
by the inequalities (7.5.1), and so that R(X ) = 1. With this, we evidently have r(x) ≤ eε q(x) by
construction, and because S ⊂ T c , we have
by assumption.
Now, we turn to the second statement of the lemma. We start with the easy direction, where
we assume that P0 and Q0 satisfy D∞ (P0 ||Q0 ) ≤ ε and D∞ (Q0 ||P0 ) ≤ ε as well as kP − P0 kTV ≤ δ
and kQ − Q0 kTV ≤ δ. Then for any set S we have
δ δ δ
P (S) ≤ P0 (S) + ε
≤ eε Q0 (S) + ε
≤ eε Q(S) + eε δ + ,
1+e 1+e 1 + eε
or D∞δ (P ||Q) ≤ ε. The other direction is similar.
δ (P ||Q) ≤ ε and D δ (Q||P ) ≤ ε. Let
We consider the converse direction, where we have both D∞ ∞
us construct P0 and Q0 as in the statement of the lemma. Define the sets
167
Lexture Notes on Statistics and Information Theory John Duchi
and similarly
Q(S 0 ) − eε P (S 0 ) δ
α0 := Q(S 0 ) − Q1 (S 0 ) = ≤ .
1 + eε 1 + eε
Note also that we have P (S) − P1 (S) = Q1 (S) − Q(S) and Q(S 0 ) − Q1 (S 0 ) = P1 (S 0 ) − P (S 0 ) by
construction.
We assume w.l.o.g. that α ≥ α0 , so that if β = α − α0 ≥ 0, we have β ≤ 1+eδ
ε , and we have the
sandwiching
where
P0 (S ∪ T ) = P1 (S ∪ T ) + β and Q0 (S ∪ T ) = Q1 (S ∪ T ) − β. (7.5.2)
With these choices, we evidently obtain Q0 (X ) = P0 (X ) = 1 and that D∞ (P0 ||Q0 ) ≤ ε and
D∞ (Q0 ||P0 ) ≤ ε by construction. It remains to consider the variation distances. As p0 = p on T 0 ,
we have
Z Z Z
1 1 1
kP − P0 kTV = |p − p0 | + |p − p0 | + |p − p0 |
2 S 2 S0 2 T
1 1 1
= (P (S) − P0 (S)) + (P0 (S 0 ) − P (S)) + (P0 (T ) − P (T ))
2 2 2
1 1 1
≤ (P (S) − P1 (S)) + (P0 (S 0 ) − P (S)) + (P0 (T ) − P (T )),
2| {z } 2| {z } 2| {z }
=α =α0 ≤β
where the P0 (T ) − P (T ) ≤ β claim follows becase p1 (x) = p(x) on T and by the increasing
construction yielding equality (7.5.2), we have P0 (T ) − P (T ) = P0 (T ) − P1 (T ) = β + P1 (S) −
0
P0 (S) ≤ β. In particular, we have kP − P0 kTV ≤ α+α 2 + β2 = α ≤ 1+e δ
ε . The argument that
δ
kQ − Q0 kTV ≤ 1+eε is similar.
168
Lexture Notes on Statistics and Information Theory John Duchi
7.6 Bibliography
Given the broad focus of this book, our treatment of privacy is necessarily somewhat brief, and
there is substantial depth to the subject that we do not cover.
The initial development of randomized response began with Warner [173], who proposed ran-
domized response in survey sampling as a way to collect sensitive data. This elegant idea remained
in use for many years, and a generalization to data release mechanisms with bounded likelihood
ratios—essentially, the local differential privacy definition 7.2—is due to Evfimievski et al. [80] in
2003 in the databases community. Dwork, McSherry, Nissim, and Smith [74] and the subsequent
work of Dwork et al. [73] defined differential privacy and its (ε, δ)-approximate relaxation. A small
industry of research has built out of these papers, with numerous extensions and developments.
Exponential mechanism is McSherry and Talwar [139].
The book of Dwork and Roth [72] surveys much of the field, from the perspective of computer
science, as of 2014. Lemma 7.2.10 is due to Dwork et al. [75], and our proof is based on theirs.
7.7 Exercises
Exercise 7.1: Prove Proposition 7.2.1.
Exercise 7.2: Prove Proposition 7.4.7.
Exercise 7.3 (Laplace mechanisms versus randomized response): In this question, you will
investigate using Laplace and randomized response mechanisms, as in Examples 7.1.3 and 7.1.1–
7.1.2, to perform locally private estimation of a mean, and compare this with randomized-response
based mechanisms.
We consider the following scenario: we have data Xi ∈ [0, 1], drawn i.i.d., and wish to estimate
the mean E[X] under local ε-differential privacy.
iid
(a) The Laplace mechanism simply sets Zi = Xi + Wi for Wi ∼ Laplace(b) for some b. What choice
of b guarantees ε-local differential privacy?
(c) A randomized response mechanism for this case is the following: first, we randomly round Xi
to {0, 1}, by setting (
1 with probability Xi
Xi =
e
0 otherwise.
Conditional on X
ei = x, we then set
(
eε
x with probability 1+eε
Zi = 1
1−x with probability 1+eε .
What is E[Zi ]?
(d) For the randomized response Zi above, give constants a and b so that aZi − b is unbiased
1 Pn
for E[X], that is, E[aZi − b] = E[X]. Let θn = n i=1 (aZi − b) be your mean estimator.
b
What is E[(θbn − E[X])2 ]? Does this converge to the mean-square error of the sample mean
E[(X n − E[X])2 ] = Var(X)/n as ε ↑ ∞?
169
Lexture Notes on Statistics and Information Theory John Duchi
(e) Now, it is time to compare the simple randomized response estimator from part (d) with the
Laplace mechanism from part (b). For each of the following distributions, generate samples
of size N = 10, 100, 1000, 10000, and then for T = 25 tests, compute the two estimators, both
with ε = 1. Then plot the mean-squared error and confidence intervals for each of the two
methods as well as the sample mean without any privacy.
Do you prefer the Laplace or randomized response mechanism? In one sentence, why?
Exercise 7.4 (A more sophisticated randomized response scheme): Let us consider a more
sophisticated randomized response scheme than that in Exercise 7.3. Define quantized values
1 k−1
b0 = 0, b1 = , . . . , bk−1 = , bk = 1. (7.7.1)
k k
Now consider a randomized response estimator that, when X ∈ [bj , bj+1 ] first rounds X randomly
e ∈ {bj , bj+1 } so that E[X
to X e | X] = X. Conditional on X e = j, we then set
(
eε
j with probability k+e ε
Z= k
Uniform({0, . . . , k} \ {j}) with probability k+e ε.
(c) For any given ε > 0, give (approximately) the k in the choice of the number of bins (7.7.1) that
optimizes your bound, and (approximately) evaluate E[(θbn − E[X])2 ] with your choice of k. As
ε ↑ ∞, does this converge to Var(X)/n?
Exercise 7.5 (Subsampling via divergence measures (Balle et al. [14])): The hockey stick di-
vergence functional, defined for α ≥ 1, is φα (t) = [1 − αt]+ . It is straightforward to relate this to
(ε, δ)-differential privacy via Definition 7.6: two distributions P and Q are (ε, δ)-close if and only
their φeε -divergences are less than δ, i.e., if and only if
(In your answer to this question, feel free to use Dα (P ||Q) as a shorthand for Dφα (P ||Q).)
(a) Let P0 , P1 , Q1 be any three distributions, and for some q ∈ [0, 1] and α ≥ 1, define P =
(1 − q)P0 + qP1 and Q = (1 − q)P0 + qQ1 . Let α0 = 1 + q(α − 1) = (1 − q) + qα and
θ = α0 /α ≤ 1. Show that
(b) Let ε > 0 and define ε(q) = log(1 + q(eε − 1)). Show that
170
Lexture Notes on Statistics and Information Theory John Duchi
Exercise 7.6 (Subsampling and privacy amplification (Balle et al. [14])): Consider the follow-
ing subsampling approach to privacy. Assume that we have a private (randomized) algorithm,
represented by A, that acts on samples of size m and guarantees (ε, δ)-differential privacy. The
subsampling mechanism is then defined as follows: given a sample X1n of size n > m, choose a
subsample Xsub of size m uniformly at random from X1n , and then release Z = A(Xsub ).
(a) Use the results of parts (a) and (b) in Exercise 7.5 to show that Z is (ε(q), δq)-differentially
private, where q = m/n and ε(q) = log(1 + q(eε − 1)).
(b) Show that if ε ≤ 1, then Z is ((e − 1)qε, qδ)-differentially private, and if ε ≤ 12 , then Z is
√
(2( e − 1)qε, qδ)-differentially private. Hint: Argue that for any T > 0, one has et − 1 ≤
(eT − 1) Tt for all t ∈ [0, T ].
Exercise 7.7 (Concentration and privacy composition): In this question, we give an alternative
to the privacy composition approaches we exploit in Section 7.3.2. Consider an identical scenario to
that in Fig. 7.1, and begin by assuming that each channel Qi is ε-differentially private with density
qi , and let Q(b) be shorthand for Q(· | x(b) ). Define the log-likelihood ratio
k (b)
X qi (Zi )
L(b) (Z1k ) := log (1−b)
.
i=1 qi (Zi )
(a) Let P , Q be any two distributions satisfying D∞ (P ||Q) ≤ ε and D∞ (Q||P ) ≤ ε, i.e., that
P (A)
log Q(A) ∈ [−ε, ε] for all sets A. Show that
(b) Let Q(b) denote the joint distribution of Z1 , . . . , Zk when bit b holds in the privacy game in
Fig. 7.1. Show that
Eb [L(b) (Z1k )] ≤ kε(eε − 1)
where Eb denotes expectation under Q(b) , and that for all t ≥ 0,
t2
(b) (b) k ε
Q L (Z1 ) ≥ kε(e − 1) + t ≤ exp − .
2kε2
Conclude that for any δ ∈ (0, 1), with probability at least 1 − δ over Z1k ∼ Q(b) ,
r
(b) k ε 1
L (Z1 ) ≤ k(e − 1)ε + 2k log · ε.
δ
(c) Argue that for any (measurable) set A,
Q(b) (Z1k ∈ A) ≤ eε(k,δ) · Q(1−b) (Z1k ∈ A) + δ
q
for all δ ∈ [0, 1], where ε(k, δ) = kε(eε − 1) + 2k log 1δ · ε.
(d) Conclude the following tighter variant of Corollary 7.3.3: if each channel in Fig. 7.1 is ε-
differentially private, then the composition of k such channels is
r !
ε 1
kε(e − 1) + 2k log · ε, δ
δ
differentially private for all δ > 0.
171
Lexture Notes on Statistics and Information Theory John Duchi
As an aside, a completely similar derivation yields the following tighter analogue of Theorem 7.3.4:
if each channel is (ε, δ)-differentially private, then their composition is
r
ε 1 kδ
kε(e − 1) + 2k log · ε, δ0 +
δ0 1 + eε
differentially private for all δ0 > 0.
Exercise 7.8 (One-dimensional minimization with inverse sensitivity): Consider the private
minimization of the one dimensional loss `(θ, x) (for θ ∈ Θ ⊂ R), where we wish to estimate
n
b n ) ∈ argmin{Pn `(θ, X) := 1
X
θ(P `(θ, Xi )},
θ n
i=1
where we recall the notation from Chapters 4 and 5. Assume that the loss ` is convex, differentiable
in θ, and that it satisfies the Lipschitz-type guarantees that there exist constants 0 < L0 ≤ L1 < ∞
for all θ ∈ Θ and that {`0 (θ, x)}x∈X is an interval. (That is, the set of potential derivatives `0 (θ, x)
as x varies includes [−L0 , L0 ], is convex, and |`0 (θ, x)| ≤ L1 for all θ ∈ Θ, x ∈ X .)
(a) Let the loss ` be the Huber loss `(θ, x) = hu (θ − x) for some fixed u > 0, where
(
1 2
t if |t| ≤ u
hu (t) = 2u u
|t| + 2 if |t| ≥ u.
(b) Let the loss ` be the absolute value `(θ, x) = |θ − x|, where we abuse notation to call
{`0 (θ, x)}x=θ = [−1, 1] (the subdifferential). When X = R, show that ` satisfies the con-
tainment (7.7.2) with L0 = L1 = 1.
(c) Let dθb be the inverse sensitivity (7.4.3) for the minimizer θ(P b n ), which is the solution (in θ) to
0
Pn ` (θ, X) = 0. Assuming inequality (7.7.2) holds, show that
172
Lexture Notes on Statistics and Information Theory John Duchi
(a) Implement the Laplace mechanism (7.1.3) for this problem. Fix n = 200 and repeat the
following experiment 50 times. For ε = .1, .5, 1, 2, generate a sample xn1 ∈ X n (from whatever
distribution you like), then estimate xn using the Laplace mechanism. Give a table of the mean
squared errors (xn − M (xn1 ))2 .
(b) Implement the inverse sensitivity mechanism using the approximation in Example 7.4.8. Repeat
the experiment in part (a).
Exercise 7.10 (Estimating medians with the inverse sensitivity mechanism): The data at https:
//stats311.stanford.edu/data/salaries.txt contains approximately 250,000 salaries from the
University of California Schools between 2011 and 2014. Assuming that the maximum salary is 3·106
and minimum is 0 (so the data x ∈ [0, 3 · 106 ]), implement the inverse sensitivity mechanism for the
median as in Example 7.4.9. Repeat the following 20 times: for each of ε = .0625, .125, .25, .5, 1, 2,
estimate the median using the inverse sensitivity mechanism with ε-differential privacy. Compute
the mean absolute errors across the 20 experiments for each ε.
Exercise 7.11 (Shells and accuracy in inverse sensitivity): Let f : Pn → R be sample monotone
(Def. 7.8) and ρ ≥ 0. Let M = Mcont be the continuous inverse sensitivity mechanism with
density (7.4.9). Define the upper and lower shells
Sk+ = {t > f (Pn ) | df,ρ (t, Pn ) = k} and Sk− = {t < f (Pn ) | df,ρ (t, Pn ) = k} ,
and the upper and lower moduli of continuity (values in the shells Sk± )
(b) Bound P(M (Pn ) ∈ Sk+ ) and P(M (Pn ) ∈ Sk− ), and using these bounds demonstrate that
173
Lexture Notes on Statistics and Information Theory John Duchi
Exercise 7.12 (Accuracy of the inverse sensitivity mechanism): In this question, we prove
Propositions 7.4.11 and 7.4.12. Let the conditions and notation of Exercise 7.11 hold. Recall the
definition
L(K) := sup LS(f, Pn0 ) | dham (Pn , Pn0 ) ≤ K .
Pn0 ∈Pn
(a) Use Exercise 7.11.(b) and (c) to show that for any K ∈ N,
PK + − −kε/2
L(K) k=1 (ω (k) + ω (k)) e
E [|Mcont (Pn ) − f (Pn )|] ≤ ρ + · n
1 − e−ε/2 + − −kε/2
P
k=1 (ω (k) + ω (k)) e
n
GS(f ) X
ω + (k) + ω − (k) e−kε/2 .
+
ρ
k=K+1
1
(b) Choose values for ρ and K to show that E[|Mcont (Pn ) − f (Pn )|] ≤ 1−e−ε/2
GS(f ), giving Propo-
sition 7.4.11.
Exercise 7.13 (Subsampling and Rényi privacy): We would like to estimate the mean E[X] of
X ∼ P , where X ∈ B = {x ∈ Rd | kxk2 ≤ 1}, the `2 -ball in Rd . We investigate the extent to which
subsampling of a dataset can improve privacy by providing some additional anonymity. Consider
the following mechanism for estimating (scaled) multiples of this mean: for a dataset {X1 , . . . , Xn },
we let Si ∈ {0, 1} be i.i.d. Bernoulli(q), that is, E[Si ] = q, and then consider the algorithm
n
X
Z= Xi Si + σW, W ∼ N(0, Id ). (7.7.3)
i=1
(a) Let Q(· | X) and Q(· | X 0 ) denote the channels for the mechanism (7.7.3) with data matrices
X = [x1 · · · xn−1 x] and X 0 = [x1 · · · xn−1 ] ∈ Rd×n . Let Pµ denote the normal distribution
N(µ, σ 2 I) with mean µ and covariance σ 2 I on Rd . Show that for any α ∈ (1, ∞),
and
Dα Q(· | X 0 )||Q(· | X) ≤ Dα (P0 ||qPx + (1 − q)P0 ) .
174
Lexture Notes on Statistics and Information Theory John Duchi
Consider two mechanisms for computing a sample mean X n of vectors, where kxi k2 ≤ b for all i.
The first is to repeat the following T times: for t = 1, 2, . . . , T ,
iid
i. Draw S ∈ {0, 1}n with Si ∼ Bernoulli(q)
1 iid
ii. Set Zt = nq (XS + σsub Wt ), where Wt ∼ N(0, I), as in (7.7.3).
1 PT
Then set Zsub = T t=1 Zt . The other mechanism is to simply set ZGauss = X n + σGauss W for
W ∼ N(0, I).
(c) What level of privacy does Zsub have? That is, Zsub is (ε, 2)-Rényi private (against single
removals (7.7.4)). Give a tight upper bound on ε.
(e) Fix ε > 0, and assume that each mechanism Zsub and ZGauss have parameters chosen so that
they are (ε, 2)-Rényi private. Optimize over T, q, n, σsub in the subsampling mechanism and
σGauss in the Gaussian mechanism, and provide the sharpest bound you can on
2 2
E[ Zsub − X n 2
] and E[ ZGauss − X n 2
].
You may assume kxi k2 = b for all i. (In your derivation, to avoid annoying constants, you
should replace log(1 + t) with its upper bound, log(1 + t) ≤ t, which is fairly sharp for t ≈ 0.)
175
Part II
176
Lexture Notes on Statistics and Information Theory John Duchi
i. Minimax lower bounds (both local and global) using Le Cam’s, Fano’s, and Assouad’s methods.
Worked out long example with nonparametric regression.
ii. Strong data processing inequalities, along with some bounds on them (constrained risk inequal-
ities).
177
Chapter 8
Understanding the fundamental limits of estimation and optimization procedures is important for
a multitude of reasons. Indeed, developing bounds on the performance of procedures can give
complementary insights. By exhibiting fundamental limits of performance (perhaps over restricted
classes of estimators), it is possible to guarantee that an algorithm we have developed is optimal, so
that searching for estimators with better statistical performance will have limited returns, though
searching for estimators with better performance in other metrics may be interesting. Moreover,
exhibiting refined lower bounds on the performance of estimators can also suggest avenues for de-
veloping alternative, new optimal estimators; lower bounds need not be a fully pessimistic exercise.
In this chapter, we define and then discuss techniques for lower-bounding the minimax risk,
giving three standard techniques for deriving minimax lower bounds that have proven fruitful in
a variety of estimation problems [177]. In addition to reviewing these standard techniques—the
Le Cam, Fano, and Assouad methods—we present a few simplifications and extensions that may
make them more “user friendly.” Finally, the concluding sections of the chapter (Sections 8.6
and 8.7) present extensions of the ideas to nonparametric problems, where the effective number of
parameters to estimate grows with the sample size n; this culminates with an essentially geometric
treatment of information and divergence measures directly relating covering and packing numbers
to estimation.
178
Lexture Notes on Statistics and Information Theory John Duchi
known variance σ 2 , then θ(P ) = EP [X] uniquely determines distributions in P. In other scenarios,
however, θ does not uniquely determine the distribution: for instance, Rwe may be given a class of
1
densities P on the unit interval [0, 1], and we wish to estimate θ(P ) = 0 (p0 (t))2 dt, where p is the
density of P . Such problems arise, for example, in estimating the uniformity of the distribution
of a species over an area (large θ(P ) indicates an irregular distribution). In this case, θ does not
parameterize P , so we take a slightly broader viewpoint of estimating functions of distributions in
these notes.
The space Θ in which the parameter θ(P ) takes values depends on the underlying statistical
problem; as an example, if the goal is to estimate the univariate mean θ(P ) = EP [X], we have
Θ ⊂ R. To evaluate the quality of an estimator θ, b we let ρ : Θ × Θ → R+ denote a (semi)metric
on the space Θ, which we use to measure the error of an estimator for the parameter θ, and let
Φ : R+ → R+ be a non-decreasing function with Φ(0) = 0 (for example, Φ(t) = t2 ).
For a distribution P ∈ P, we assume we receive i.i.d. observations Xi drawn according to some
P , and based on these {Xi }, the goal is to estimate the unknown parameter θ(P ) ∈ Θ. For a
given estimator θ—a b measurable function θb : X n → Θ—we assess the quality of the estimate
θ(X
b 1 , . . . , Xn ) in terms of the risk
h i
EP Φ ρ(θ(X
b 1 . . . , Xn ), θ(P )) .
For instance, for a univariate mean problem with ρ(θ, θ0 ) = |θ − θ0 | and Φ(t) = t2 , this risk is the
mean-squared error. As the distribution P is varied, we obtain the risk functional for the problem,
which gives the risk of any estimator θb for the family P.
For any fixed distribution P , there is always a trivial estimator of θ(P ): simply return θ(P ),
which will have minimal risk. Of course, this “estimator” is unlikely to be good in any real sense,
and it is thus important to consider the risk functional not in a pointwise sense (as a function of
individual P ) but to take a more global view. One approach to this is Bayesian: we place a prior
π on the set of possible distributions P, viewing θ(P ) as a random variable, and evaluate the risk
of an estimator θb taken in expectation with respect to this prior on P . Another approach, first
suggested by Wald [172], which is to choose the estimator θb minimizing the maximum risk
h i
sup EP Φ ρ(θ(Xb 1 . . . , Xn ), θ(P )) .
P ∈P
An optimal estimator for this metric then gives the minimax risk, which is defined as
h i
Mn (θ(P), Φ ◦ ρ) := inf sup EP Φ ρ(θ(Xb 1 , . . . , Xn ), θ(P )) , (8.1.1)
θb P ∈P
where we take the supremum (worst-case) over distributions P ∈ P, and the infimum is taken over
b Here the notation θ(P) indicates that we consider parameters θ(P ) for P ∈ P and
all estimators θ.
distributions in P.
In some scenarios, we study a specialized notion of risk appropriate for optimization problems
(and statistical problems in which all we care about is prediction). In these settings, we assume
there exists some loss function ` : Θ × X → R, where for an observation x ∈ X , the value `(θ; x)
measures the instantaneous loss associated with using θ as a predictor. In this case, we define the
risk Z
LP (θ) := EP [`(θ; X)] = `(θ; x)dP (x) (8.1.2)
X
as the expected loss of the vector θ. (See, e.g., Chapter 5 of the lectures by Shapiro, Dentcheva,
and Ruszczyński [159], or work on stochastic approximation by Nemirovski et al. [143].)
179
Lexture Notes on Statistics and Information Theory John Duchi
Here the variables T correspond to the goods transported to and from each location (so Tij is
goods shipped from i to j), and we wish to minimize the cost of ourPshipping and maximize
the profit. By minimizing the risk (8.1.2) over a set Θ = {θ ∈ Rm
+ : i θi ≤ b}, we maximize
our expected reward given a budget constraint b on the amount of allocated resources. 3
where the expectation is taken over Xi and any randomness in the procedure θ. b This expression
captures the difference between the (expected) risk performance of the procedure θb and the best
possible risk, available if the distribution P were known ahead of time. The minimax excess risk,
defined with respect to the loss `, domain Θ, and family P of distributions, is then defined by the
best possible maximum excess risk,
h i
Mn (Θ, P, `) := inf sup EP LP (θ(X1 , . . . , Xn )) − inf LP (θ) ,
b (8.1.3)
θb P ∈P θ∈Θ
where the infimum is taken over all estimators θb : X n → Θ and the risk LP is implicitly defined in
terms of the loss `. The techniques for providing lower bounds for the minimax risk (8.1.1) or the
excess risk (8.1.3) are essentially identical; we focus for the remainder of this section on techniques
for providing lower bounds on the minimax risk.
180
Lexture Notes on Statistics and Information Theory John Duchi
While trivial, this lower bound serves as the departure point for each of the subsequent techniques
for lower bounding the minimax risk.
where the final inequality follows because Φ is non-decreasing. Now, let us define θv = θ(Pv ), so
that ρ(θv , θv0 ) ≥ 2δ for v 6= v 0 . By defining the testing function
b := argmin{ρ(θ,
Ψ(θ) b θv )},
v∈V
181
Lexture Notes on Statistics and Information Theory John Duchi
θb
θv
θv 0 2δ
Figure 8.1. Example of a 2δ-packing of a set. The estimate θb is contained in at most one of the
δ-balls around the points θv .
breaking ties arbitrarily, we have that ρ(θ,b θv ) < δ implies that Ψ(θ) b = v because of the triangle
inequality and 2δ-separation of the set {θv }v∈V . Indeed, assume that ρ(θ, b θv ) < δ; then for any
0
v 6= v, we have
b θv0 ) ≥ ρ(θv , θv0 ) − ρ(θ,
ρ(θ, b θv ) > 2δ − δ = δ.
The test must thus return v as claimed. Equivalently, for v ∈ V, the inequality Ψ(θ)
b 6= v implies
ρ(θ, θv ) ≥ δ. (See Figure 8.1.) By averaging over V, we find that
b
b θ(P )) ≥ δ) ≥ 1 X b θ(Pv )) ≥ δ | V = v) ≥ 1
X
b 6= v | V = v).
sup P(ρ(θ, P(ρ(θ, P(Ψ(θ)
P |V| |V|
v∈V v∈V
The remaining challenge is to lower bound the probability of error in the underlying multi-way
hypothesis testing problem, which we do by choosing the separation δ to trade off between the loss
Φ(δ) (large δ increases the loss) and the probability of error (small δ, and hence separation, makes
the hypothesis test harder). Usually, one attempts to choose the largest separation δ that guarantees
a constant probability of error. There are a variety of techniques for this, and we present three:
Le Cam’s method, Fano’s method, and Assouad’s method, including extensions of the latter two
to enhance their applicability. Before continuing, however, we review some inequalities between
divergence measures defined on probabilities, which will be essential for our development, and
concepts related to packing sets (metric entropy, covering numbers, and packing).
182
Lexture Notes on Statistics and Information Theory John Duchi
of f -divergences (recall Section 2.2.3). We first recall the definitions of the three when applied to
distributions P , Q on a set X , which we assume have densities p, q with respect to a base measure
µ. Then we recall the total variation distance (2.2.6) is
Z
1
kP − QkTV := sup |P (A) − Q(A)| = |p(x) − q(x)|dµ(x),
A⊂X 2
which is the f -divergence Df (P ||Q) generated by f (t) = 12 |t − 1|. The Hellinger distance (2.2.7) is
Z p p
dhel (P, Q) := ( p(x) − q(x))2 dµ(x),
2
√
which is the f -divergence Df (P ||Q) generated by f (t) = ( t − 1)2 . We also recall the Kullback-
Leibler (KL) divergence Z
p(x)
Dkl (P ||Q) := p(x) log dµ(x), (8.2.2)
q(x)
which is the f -divergence Df (P ||Q) generated by f (t) = t log t. As noted in Section 2.2.3, Propo-
sition 2.2.8, these divergences have the following relationships.
Proposition (Proposition 2.2.8, restated). The total variation distance satisfies the following re-
lationships:
We now show how Proposition 2.2.8 is useful, because KL-divergence and Hellinger distance
both are easier to manipulate on product distributions than is total variation. Specifically, consider
the product distributions P = P1 × · · · × Pn and Q = Q1 × · · · × Qn . Then the KL-divergence
satisfies the decoupling equality
n
X
Dkl (P ||Q) = Dkl (Pi ||Qi ) , (8.2.3)
i=1
183
Lexture Notes on Statistics and Information Theory John Duchi
In particular, we see that for product distributions P n and Qn , Proposition 2.2.8 implies that
1 n
kP n − Qn k2TV ≤ Dkl (P n ||Qn ) = Dkl (P ||Q)
2 2
and p
kP n − Qn kTV ≤ dhel (P n , Qn ) ≤ 2 − 2(1 − dhel (P, Q)2 )n .
√
As a consequence, if we can guarantee that Dkl (P ||Q) ≤ 1/n or dhel (P, Q) ≤ 1/ n, then we
guarantee the strict inequality kP n − Qn kTV ≤ 1 − c for a fixed constant c > 0, for any n. We
will see how this type of guarantee can be used to prove minimax lower bounds in the following
sections.
Corollary 8.2.2. Let Bd = {v ∈ Rd | kvk ≤ 1} be the unit ball for the norm k·k. Then there exists
V ⊂ Bd with |V| ≥ 2d and kv − v 0 k ≥ 21 for each v 6= v 0 ∈ V.
Another common packing arises from coding theory, where the technique is to construct well-
separated code-books ({0, 1}-valued bit strings associated to individual symbols to be communi-
cated) for communication. In showing our lower bounds, we show that even if a code-book is
well-separated, it may still be hard to estimate. With that, we now demonstrate that there exist
(exponentially) large packings of the d-dimensional hypercube of points that are O(d)-separated in
the Hamming metric.
Proof We use the proof of Guntuboyina [97]. Consider a maximal subset V of Hd = {−1, 1}d
satisfying
v − v 0 1 ≥ d/2 for all distinct v, v 0 ∈ V. (8.2.5)
184
Lexture Notes on Statistics and Information Theory John Duchi
That is, the addition of any vector w ∈ Hd , w 6∈ V to V will break the constraint (8.2.5). This
means that if we construct the closed balls B(v, d/2) := {w ∈ Hd : kv − wk1 ≤ d/2}, we must have
[ X
B(v, d/2) = Hd so |V||B(0, d/2)| = |B(v, d/2)| ≥ 2d . (8.2.6)
v∈V v∈V
We now upper bound the cardinality of B(v, d/2) using the probabilistic method, which will imply
the desired result. Let Si , i = 1, . . . , d, be i.i.d. Bernoulli {0, 1}-valued random variables. Then by
their uniformity, for any v ∈ Hd ,
for any λ > 0, by Markov’s inequality (or the Chernoff bound). Since E[exp(λS1 )] = 21 (1 + eλ ), we
obtain n o
2−d |B(v, d/2)| ≤ inf 2−d (1 + eλ )d exp(−3λd/4)
λ≥0
33d/4
−3d/4 d d 3
|V|3 4 ≥ |V||B(v, d/2)| ≥ 2 , or |V| ≥ d = exp d log 3 − log 2 ≥ exp(d/8),
2 4
as claimed.
185
Lexture Notes on Statistics and Information Theory John Duchi
Example 8.3.1 (Bernoulli mean estimation): Consider the problem of estimating the mean
b 2 , where
θ ∈ [−1, 1] of a {±1}-valued Bernoulli distribution under the squared error loss (θ − θ)
Xi ∈ {−1, 1}. In this case, by fixing some δ > 0, we set V = {−1, 1}, and we define Pv so that
1 + vδ 1 − vδ
Pv (X = 1) = and Pv (X = −1) = ,
2 2
whence we see that the mean θ(Pv ) = δv. Using the metric ρ(θ, θ0 ) = |θ−θ0 | and loss Φ(δ) = δ 2 ,
we have separation 2δ of θ(P−1 ) and θ(P1 ). Thus, via Le Cam’s method (8.3.3), we have that
1
Mn (Bernoulli([−1, 1]), (·)2 ) ≥ δ 2 1 − P−1
n
− P1n
TV
.
2
We would thus like to upper bound kP−1 n − P nk
1 TV as a function of the separation δ and
sample size n; here we use Pinsker’s inequality (Proposition 2.2.8(a)) and the tensorization
identity (8.2.3) that makes KL-divergence so useful. Indeed, we have
n 2 1 n n 1+δ
P−1 − P1n TV
n
≤ Dkl P−1 ||P1n = Dkl (P−1 ||P1 ) = δ log .
2 2 2 1−δ
Noting that δ log 1+δ 2 n n
p
1−δ ≤ 3δ for δ ∈ [0, 1/2], we obtain that kP−1 − P1 kTV ≤ δ 3n/2 for
δ ≤ 1/2. In particular, we can guarantee a high probability of error √ in the associated hy-
pothesis testing problem (recall inequality (8.3.2)) by taking δ = 1/ 6n; this guarantees
n − P nk 1
kP−1 1 TV ≤ 2 . We thus have the minimax lower bound
2 1 2 1 1
Mn (Bernoulli([−1, 1]), (·) ) ≥ δ 1 − = .
2 2 24n
186
Lexture Notes on Statistics and Information Theory John Duchi
While the factor 1/24 is smaller P than necessary, this bound is optimal to within constant
factors; the sample mean (1/n) ni=1 Xi achieves mean-squared error (1 − θ2 )/n.
As an alternative proof, we may use the Hellinger distance and its associated decoupling
identity (8.2.4). We sketch the idea, ignoring lower order terms when convenient. In this case,
Proposition 2.2.7 implies
√ p
kP1n − P2n kTV ≤ 2dhel (P1n , P2n ) = 2 − 2(1 − dhel (P1 , P2 )2 )n .
Noting that
r r !2 r
2 1+δ 1−δ 1 − δ2 p 1
dhel (P1 , P2 ) = − =1−2 = 1 − 1 − δ2 ≈ δ2,
2 2 4 2
2 −δ 2
p noting that (1 − δ ) ≈ e , we
and have (up to lower p order terms in δ) that kP1n − P2n kTV ≤
2 − 2 exp(−δ 2 n/2). Choosing δ 2 = 1/(4n), we have 2 − 2 exp(−δ 2 n/2) ≤ 1/2, thus giving
the lower bound
2 1 2 1 1
Mn (Bernoulli([−1, 1]), (·) ) “ ≥ ” δ 1 − = ,
2 2 16n
where the quotations indicate we have been fast and loose in the derivation. 3
This example shows the “usual” rate of convergence in parametric estimation problems, that is,
that we can estimate a parameter θ at a rate (in squared error) scaling as 1/n. The mean estimator
above is, in some sense, the prototypical example of such regular problems. In some “irregular”
scenarios—including estimating the support of a uniform random variable, which we study in the
homework—faster rates are possible.
We also note in passing that their are substantially more complex versions of Le Cam’s method
that can yield sharp results for a wider variety of problems, including some in nonparametric
estimation [127, 177]. For our purposes, the simpler two-point perspective provided in this section
will be sufficient.
JCD Comment: Talk about Euclidean structure with KL space and information geom-
etry a bit here to suggest the KL approach later.
187
Lexture Notes on Statistics and Information Theory John Duchi
Restating the results in Chapter 2, we also have the following convenient rewriting of Fano’s
inequality when V is uniform in V (recall Corollary 2.3.4).
I(V ; X) + log 2
P(Vb 6= V ) ≥ 1 − . (8.4.2)
log(|V|)
I(V ; X) + log 2
inf P(Ψ(X) 6= V ) ≥ 1 − ,
Ψ log |V|
where the infimum is taken over all testing procedures Ψ. By combining Corollary 8.4.2 with the
reduction from estimation to testing in Proposition 8.2.1, we obtain the following result.
Proposition 8.4.3. Let {θ(Pv )}v∈V be a 2δ-packing in the ρ-semimetric. Assume that V is uniform
on the set V, and conditional on V = v, we draw a sample X ∼ Pv . Then the minimax risk has
lower bound
I(V ; X) + log 2
M(θ(P); Φ ◦ ρ) ≥ Φ(δ) 1 − .
log |V|
To gain some intuition for Proposition 8.4.3, we think of the lower bound as a function of the
separation δ > 0. Roughly, as δ ↓ 0, the separation condition between the distributions Pv is
relaxed and we expect the distributions Pv to be closer to one another. In this case—as will be
made more explicity presently—the hypothesis testing problem of distinguishing the Pv becomes
more challenging, and the information I(V ; X) shrinks. Thus, what we roughly attempt to do
is to choose our packing θ(Pv ) as a function of δ, and find the largest δ > 0 making the mutual
information small enough that
I(V ; X) + log 2 1
≤ . (8.4.3)
log |V| 2
In this case, the minimax lower bound is at least Φ(δ)/2. We now explore techniques for achieving
such results.
188
Lexture Notes on Statistics and Information Theory John Duchi
With this definition of the mixture distribution, via algebraic manipulations, we have
X
I(V ; X) = π(v)Dkl Pv ||P , (8.4.4)
v
a representation that plays an important role in our subsequent derivations. To see equality (8.4.4),
let µ be a base measure over X (assume w.l.o.g. that X has density p(· | v) = pv (·) conditional on
V = v), and note that
XZ p(x | v) p(x | v)
X Z
I(V ; X) = p(x | v)π(v) log P 0 0
dµ(x) = π(v) p(x | v) log dµ(x).
v X v 0 p(x | v )π(v ) v X p(x)
Representation (8.4.4) makes it clear that if the distributions of the sample X conditional
on V are all similar, then there is little information content. Returning to the discussion after
Proposition 8.4.3, we have in this uniform setting that
1 X 1 X
P = Pv and I(V ; X) = Dkl Pv ||P .
|V| |V|
v∈V v∈V
The mutual information is small if the typical conditional distribution Pv is difficult to distinguish—
has small KL-divergence—from P .
In the local Fano method approach, we construct a local packing. This local packing approach
is based on constructing a family of distributions Pv for v ∈ V defining a 2δ-packing (recall Sec-
tion 8.2.1), meaning that ρ(θ(Pv ), θ(Pv0 )) ≥ 2δ for all v 6= v 0 , but which additionally satisfy the
uniform upper bound
Dkl (Pv ||Pv0 ) ≤ κ2 δ 2 for all v, v 0 ∈ V, (8.4.6)
where κ > 0 is a fixed problem-dependent constant. If we have the inequality (8.4.6), then so long
as we can find a local packing V such that
189
Lexture Notes on Statistics and Information Theory John Duchi
we are guaranteed the testing error condition (8.4.3), and hence the minimax lower bound
1
M(θ(P), Φ ◦ ρ) ≥ Φ(δ).
2
The difficulty in this approach is constructing the packing set V that allows δ to be chosen to obtain
sharp lower bounds, and we often require careful choices of the packing sets V. (We will see how
to reduce such difficulties in subsequent sections.)
Constructing local packings As mentioned above, the main difficulty in using Fano’s method
is in the construction of so-called “local” packings. In these problems, the idea is to construct a
packing V of a fixed set (in a vector space, say Rd ) with constant radius and constant distance.
Then we scale elements of the packing by δ > 0, which leaves the cardinality |V| identical, but
allows us to scale δ in the separation in the packing and the uniform divergence bound (8.4.6). In
particular, Lemmas 8.2.3 and 4.3.10 show that we can construct exponentially large packings of
certain sets with balls of a fixed radius.
We now illustrate these techniques via two examples.
Example 8.4.4 (Normal mean estimation): Consider the d-dimensional normal location
family Nd = {N(θ, σ 2 Id×d ) | θ ∈ Rd }; we wish to estimate the mean θ = θ(P ) of a given
distribution P ∈ Nd in mean-squared error, that is, with loss kθb − θk22 . Let V be a 1/2-packing
of the unit `2 -ball with cardinality at least 2d , as guaranteed by Lemma 4.3.10. (We assume
for simplicity that d ≥ 2.)
Now we construct our local packing. Fix δ > 0, and for each v ∈ V, set θv = δv ∈ Rd . Then
we have
δ
kθv − θv0 k2 = δ v − v 0 2 ≥
2
0
for each distinct pair v, v ∈ V, and moreover, we note that kθv − θv0 k2 ≤ δ for such pairs as
well. By applying the Fano minimax bound of Proposition 8.4.3, we see that (given n normal
iid
observations Xi ∼ P )
2
I(V ; X1n ) + log 2 δ2 I(V ; X1n ) + log 2
1 δ
Mn (θ(Nd ), k·k22 ) ≥ · 1− = 1− .
2 2 log |V| 16 d log 2
Now note that for any pair v, v 0 , if Pv is the normal distribution N(θv , σ 2 Id×d ) we have
δ2 2
Dkl (Pvn ||Pvn0 ) = n · Dkl N(δv, σ 2 Id×d )||N(δv 0 , σ 2 Id×d ) = n · 2 v − v 0
2
,
2σ
as the KL-divergence between two normal distributions with identical covariance is
1
Dkl (N(θ1 , Σ)||N(θ2 , Σ)) = (θ1 − θ2 )> Σ−1 (θ1 − θ2 )
2
as in Example 2.1.7. As kv − v 0 k2 ≤ 1, we have the KL-divergence bound (8.4.6) with κ2 =
n/2σ 2 .
Combining our derivations, we have the minimax lower bound
δ2 nδ 2 /2σ 2 + log 2
2
Mn (θ(Nd ), k·k2 ) ≥ 1− . (8.4.7)
16 d log 2
190
Lexture Notes on Statistics and Information Theory John Duchi
Example 8.4.5 (Linear regression): In this example, we show how local packings can give
(up to some constant factors) sharp minimax rates for standard linear regression problems. In
particular, for fixed matrix X ∈ Rn×d , we observe
Y = Xθ + ε,
where ε ∈ Rn consists of independent random variables εi with variance bounded by Var(εi ) ≤
σ 2 , and θ ∈ Rd is allowed to vary over Rd . For the purposes of our lower bound, we may
assume that ε ∼ N(0, σ 2 In×n ). Let P denote the family of such normally distributed linear
regression problems, and assume for simplicity that d ≥ 32.
In this case, we use the Gilbert-Varshamov bound (Lemma 8.2.3) to construct a local packing
and attain minimax rates. Indeed, let V be a packing of {−1, 1}d such that kv − v 0 k1 ≥ d/2 for
distinct elements of V, and let |V| ≥ exp(d/8) as guaranteed by the Gilbert-Varshamov bound.
For fixed δ > 0, if we set θv = δv, then we have the packing guarantee for distinct elements
v, v 0 that
d
X
2
kθv − θv0 k2 = δ 2
(vj − vj0 )2 = 4δ 2 v − v 0 1 ≥ 2dδ 2 .
j=1
2
2dγmax (X) 2
!
dδ 2 dδ 2 δ + log 2
2 I(V ; Y ) + log 2 σ2
M(θ(P), k·k2 ) ≥ 1− ≥ 1− .
2 log |V| 2 d/8
Now, if we choose
σ2 2 (X)δ 2
8 log 2 16dγmax 1 1 1
δ2 = 2 (X)
, then 1 − − ≥1− − = ,
64γmax d d 4 4 2
by assumption that d ≥ 32. In particular, we obtain the lower bound
1 σ2d 1 σ2d 1
M(θ(P), k·k22 ) ≥ = ,
2
256 γmax (X) 256 n γmax ( √1n X)
2
191
Lexture Notes on Statistics and Information Theory John Duchi
√
for a convergence rate (roughly) of σ 2 d/n after rescaling the singular values of X by 1/ n.
This bound is sharp in terms of the dimension, dependence on n, and the variance σ 2 , but
it does not fully capture the dependence on X, as it depends only on the maximum singular
value. Indeed, in this case, an exact calculation (cf. [130]) shows that the minimax value of
the problem is exactly σ 2 tr((X > X)−1 ). Letting λj (A) be the jth eigenvalue of a matrix A,
we have
d
2 > −1 σ2 −1 > −1 σ2 X 1
σ tr((X X) )= tr((n X X) ) = 1 >
n n λ ( X X)
j=1 j n
σ2d 1 σ2d 1
≥ min 1 = .
n j >
λj ( n X X) n γmax ( √1n X)
2
Thus, the local Fano method captures most—but not all—of the difficulty of the problem. 3
|V| − Ntmin
h2 (Pt ) + Pt log + log Ntmax ≥ H(V | Vb ). (8.4.9)
Ntmax
Before proving the proposition, which we do in Section 8.8.1, it is informative to note that it
reduces to the standard form of Fano’s inequality (8.4.1) in a special case. Suppose that we take
ρV to be the 0-1 metric, meaning that ρV (v, v 0 ) = 0 if v = v 0 and 1 otherwise. Setting t = 0 in
Proposition 8.4.6, we have P0 = P[Vb 6= V ] and N0min = N0max = 1, whence inequality (8.4.9) reduces
192
Lexture Notes on Statistics and Information Theory John Duchi
to inequality (8.4.1). Other weakenings allow somewhat clearer statements (see Section 8.8.2 for a
proof):
Corollary 8.4.7. If V is uniform on V and (|V| − Ntmin ) > Ntmax , then
I(V ; X) + log 2
P(ρV (Vb , V ) > t) ≥ 1 − . (8.4.10)
log N|V|
max
t
Inequality (8.4.10) is the natural analogue of the classical mutual-information based form of
Fano’s inequality (8.4.2), and it provides a qualitatively similar bound. The main difference is
that the usual cardinality |V| is replaced by the ratio |V|/Ntmax . This quantity serves as a rough
measure of the number of possible “regions” in the space V that are distinguishable—that is, the
number of subsets of V for which ρV (v, v 0 ) > t when v and v 0 belong to different regions. While
this construction is similar in spirit to the usual construction of packing sets in the standard
reduction from testing to estimation (cf. Section 8.2.1), our bound allows us to skip the packing set
construction. We can directly compute I(V ; X) where V takes values over the full space, as opposed
to computing the mutual information I(V 0 ; X) for a random variable V 0 uniformly distributed over
a packing set contained within V. In some cases, the former calculation can be much simpler, as
illustrated in examples and chapters to follow.
We now turn to providing a few consequences of Proposition 8.4.6 and Corollary 8.4.7, showing
how they can be used to derive lower bounds on the minimax risk. Proposition 8.4.6 is a generaliza-
tion of the classical Fano inequality (8.4.1), so it leads naturally to a generalization of the classical
Fano lower bound on minimax risk, which we describe here. This reduction from estimation to
testing is somewhat more general than the classical reductions, since we do not map the original
estimation problem to a strict test, but rather a test that allows errors. Consider as in the standard
reduction of estimation to testing in Section 8.2.1 a family of distributions {Pv }v∈V ⊂ P indexed by
a finite set V. This family induces an associated collection of parameters {θv := θ(Pv )}v∈V . Given
a function ρV : V × V → R and a scalar t, we define the separation δ(t) of this set relative to the
metric ρ on Θ via
δ(t) := sup δ | ρ(θv , θv0 ) ≥ δ for all v, v 0 ∈ V such that ρV (v, v 0 ) > t .
(8.4.11)
As a special case, when t = 0 and ρV is the discrete metric, this definition reduces to that of a
packing set: we are guaranteed that ρ(θv , θv0 ) ≥ δ(0) for all distinct pairs v 6= v 0 , as in the classical
approach to minimax lower bounds. On the other hand, allowing for t > 0 lends greater flexibility
to the construction, since only certain pairs θv and θv0 are required to be well-separated.
Given a set V and associated separation function (8.4.11), we assume the canonical estimation
setting: nature chooses V ∈ V uniformly at random, and conditioned on this choice V = v, a sample
X is drawn from the distribution Pv . We then have the following corollary of Proposition 8.4.6,
whose argument is completely identical to that for inequality (8.2.1):
Corollary 8.4.8. Given V uniformly distributed over V with separation function δ(t), we have
δ(t) I(X; V ) + log 2
Mn (θ(P), Φ ◦ ρ) ≥ Φ 1− for all t. (8.4.12)
2 log N|V|
max
t
Notably, using the discrete metric ρV (v, v 0 ) = 1 {v 6= v 0 } and taking t = 0 in the lower bound (8.4.12)
gives the classical Fano lower bound on the minimax risk based on constructing a packing [110, 177,
175]. We now turn to an example illustrating the use of Corollary 8.4.8 in providing a minimax
lower bound on the performance of regression estimators.
193
Lexture Notes on Statistics and Information Theory John Duchi
Example: Normal regression model Consider the d-dimensional linear regression model Y =
Xθ + ε, where ε ∈ Rn is i.i.d. N(0, σ 2 ) and X ∈ Rn×d is known, but θ is not. In this case, our
family of distributions is
n o n o
PX := Y ∼ N(Xθ, σ 2 In×n ) | θ ∈ Rd = Y = Xθ + ε | ε ∼ N(0, σ 2 In×n ), θ ∈ Rd .
We then obtain the following minimax lower bound on the minimax error in squared `2 -norm: there
is a universal (numerical) constant c > 0 such that
σ 2 d2 c σ2d
Mn (θ(PX , k·k22 ) ≥ c ≥ √ · , (8.4.13)
kXk2Fr γmax (X/ n)2 n
where γmax denotes the maximum singular value. Notably, this inequality is nearly the sharpest
known bound proved via Fano inequality-based methods [43], but our technique is essentially direct
and straightforward.
To see inequality (8.4.13), let the set V = {−1, 1}d be the d-dimensional hypercube, and define
θv = δv for some fixed δ > 0. Then letting ρV be the Hamming metric on√V and ρ be the usual
`2 -norm, the associated separation function (8.4.11) satisfies δ(t) > max{ t, 1}δ. Now, for any
t ≤ dd/3e, the neighborhood size satisfies
t t
X d d de
Ntmax = ≤2 ≤2 .
τ t t
τ =0
for d ≥ 12. (The case 2 ≤ d < 12 can be checked directly). In particular, by taking t = bd/6c we
obtain via Corollary 8.4.8 that
max{bd/6c , 2}δ 2
2 I(Y ; V ) + log 2
Mn (θ(PX ), k·k2 ) ≥ 1− .
4 max{d/6, 2 log 2}
But of course, for V uniform on V, we have E[V V > ] = Id×d , and thus for V, V 0 independent and
uniform on V,
1 XX
Dkl N(Xθv , σ 2 In×n )||N(Xθv0 , σ 2 In×n )
I(Y ; V ) ≤ n 2
|V|
v∈V v 0 ∈V
δ 2 h i δ2
2
= 2 E XV − XV 0 2 = 2 kXk2Fr .
2σ σ
Substituting this into the preceding minimax bound, we obtain
!
max{bd/6c , 2}δ 2 δ 2 kXk2Fr /σ 2 + log 2
Mn (θ(PX ), k·k22 ) ≥ 1− .
4 max{d/6, 2 log 2}
194
Lexture Notes on Statistics and Information Theory John Duchi
That is, we can take the parameter θ and test the individual indices via b
v.
195
Lexture Notes on Statistics and Information Theory John Duchi
While Lemma 8.5.2 requires conditions on the loss Φ and metric ρ for the separation condi-
tion (8.5.1) to hold, it is sometimes easier to apply than Fano’s method. Moreover, while we will
not address this in class, several researchers [7, 68] have noted that it appears to allow easier ap-
plication in so-called “interactive” settings—those for which the sampling of the Xi may not be
precisely i.i.d. It isPclosely related to Le Cam’s method, discussed previously, as we see that if we
define P+j = 21−d v:vj =1 Pv (and similarly for −j), Lemma 8.5.2 is equivalent to
d
X
M(θ(P), Φ ◦ ρ) ≥ δ 1 − kP+j − P−j kTV . (8.5.2)
j=1
There are standard weakenings of the lower bound (8.5.2) (and Lemma 8.5.2). We give one
such weakening. First, we note that the total variation is convex, so that if we define Pv,+j to be
the distribution Pv where coordinate j takes the value vj = 1 (and similarly for P − v, −j), we have
1 X 1 X
P+j = d Pv,+j and P−j = d Pv,+j .
2 d
2 d
v∈{−1,1} v∈{−1,1}
Then as long as the loss satisfies the per-coordinate separation (8.5.1), we obtain the following:
M(θ(P), Φ ◦ ρ) ≥ dδ 1 − max kPv,+j − Pv,−j kTV . (8.5.3)
v,j
This most common version of Assouad’s lemma sometimes too brutally controls kP+j − P−j kTV .
We also note that by the Cauchy-Schwarz inequality and convexity of the variation-distance,
we have
d d 1/2 d 1
X √ X 2
√ X 1 X 2
2
kP+j − P−j kTV ≤ d kP+j − P−j kTV ≤ d d
kPv,+j − Pv,−j kTV ,
2 v
j=1 j=1 j=1
and consequently we have a not quite so terribly weak version of inequality (8.5.2):
X d 1
1 X 2
M(θ(P), Φ ◦ ρ) ≥ δd 1 − kPv,+j − Pv,−j k2TV . (8.5.4)
d
j=1 v∈{−1,1}
d
Regardless of whether we use the sharper version (8.5.2) or weakened versions (8.5.3) or (8.5.4),
the technique is essentially the same. We seek a setting of the distributions Pv so that the probability
of making a mistake in the hypothesis test of Lemma 8.5.2 is high enough—say 1/2—or the variation
distance is small enough—such as kP+j − P−j kTV ≤ 1/2 for all j. Once this is satisfied, we obtain
a minimax lower bound of the form
d
X 1 dδ
M(θ(P), Φ ◦ ρ) ≥ δ 1− = .
2 2
j=1
196
Lexture Notes on Statistics and Information Theory John Duchi
Example 8.5.3 (Normal mean estimation): For some σ 2 > 0 and d ∈ N, we consider
estimation of mean parameter for the normal location family
n o
N := N(θ, σ 2 Id×d ) : θ ∈ Rd
in squared Euclidean distance. We now show how for this family, the sharp Assouad’s method
implies the lower bound
dσ 2
Mn (θ(N ), k·k22 ) ≥ . (8.5.5)
8n
Up to constant factors, this bound is sharp; the sample mean has mean squared error dσ 2 /n.
We proceed in (essentially) the usual way we have set up. Fix some δ > 0 and define θv = δv,
taking Pv = N(θv , σ 2 Id×d ) to be the normal distribution with mean θv . In this case, we see that
the hypercube P structure is natural, as our loss function decomposes on coordinates: we have
kθ − θv k22 ≥ δ 2 dj=1 1 {sign(θj ) 6= vj }. The family Pv thus induces a δ 2 -Hamming separation
for the loss k·k22 , and by Assouad’s method (8.5.2), we have
d
δ2 X h i
Mn (θ(N ), k·k22 ) ≥ n
1 − P+j n
− P−j TV
,
2
j=1
Example 8.5.4 (Logistic regression): In this example, consider the logistic regression model,
where we have known (fixed) regressors Xi ∈ Rd and an unknown parameter θ ∈ Rd ; the goal
is to estimate θ after observing a sequence of Yi ∈ {−1, 1}, where for y ∈ {−1, 1} we have
1
P (Yi = y | Xi , θ) = .
1 + exp(−yXi> θ)
197
Lexture Notes on Statistics and Information Theory John Duchi
Denote this family by Plog , and for P ∈ Plog , let θ(P ) be the predictor vector θ. We would
like to estimate the vector θ in squared `2 error. As in Example 8.5.3, if we choose some δ > 0
and for each v ∈P{−1, 1}d , we set θv = δv, then we have the δ 2 -separation in Hamming metric
kθ − θv k22 ≥ δ 2 dj=1 1 {sign(θj ) 6= vj }. Let Pvn denote the distribution of the n independent
observations Yi when θ = θv . Then we have by Assouad’s lemma (and the weakening (8.5.4))
that
d
δ2 X h i
Mn (θ(Plog ), k·k22 ) ≥ n
1 − P+j n
− P−j TV
2
j=1
" d 1 #
dδ 2
X
1 1 X
n n 2 2
≥ 1− Pv,+j − Pv,−j . (8.5.6)
2 d 2d TV
j=1 d
v∈{−1,1}
It remains to bound kPv,+j n n k2 to find our desired lower bound. To that end, use the
− Pv,−j TV
shorthands pv (x) = 1/(1 + exp(δx> v)) and let Dkl (p||q) be the binary KL-divergence between
Bernoulli(p) and Bernoulli(q) distributions. Then Pinsker’s inequality (recall Proposition 2.2.8)
implies that for any v, v 0 ,
1
kPvn − Pvn0 kTV ≤ [Dkl (Pvn ||Pvn0 ) + Dkl (Pvn0 ||Pvn )]
4
n
1X
= [Dkl (pv (Xi )||pv0 (Xi )) + Dkl (pv0 (Xi )||pv (Xi ))] .
4
i=1
Let us upper bound the final KL-divergence. Let pa = 1/(1 + ea ) and pb = 1/(1 + eb ). We
claim that
Dkl (pa ||pb ) + Dkl (pb ||pa ) ≤ (a − b)2 . (8.5.7)
Deferring the proof of claim (8.5.7), we immediately see that
n
δ2 X > 2
kPvn − Pvn0 kTV ≤ Xi (v − v 0 ) .
4
i=1
Now we recall inequality (8.5.6) for motivation, and we see that the preceding display implies
d d X
n n d
1 X X
n n 2 δ2 1 X X
2 δ2 X X 2
Pv,+j − Pv,−j ≤ (2Xij ) = Xij .
2d d TV 4d 2d d
j=1 v∈{−1,1}d v∈{−1,1} j=1 i=1
d i=1 j=1
Replacing the final double sum with kXk2Fr , where X is the matrix of the Xi , we have
" 21 #
dδ 2
2
δ
Mn (θ(Plog ), k·k22 ) ≥ 1− 2
kXkFr .
2 d
dδ 2 d2 d 1
Mn (θ(Plog ), k·k22 ) ≥ = 2 = n · .
4 16 kXkFr 16 dn i=1 kXi k22
1 Pn
198
Lexture Notes on Statistics and Information Theory John Duchi
That is, we have a minimax lower bound scaling roughly as d/n for logistic regression, where
“large” Xi (in `2 -norm) suggest that we may obtain better performance in estimation. This is
intuitive, as a larger Xi gives a better signal to noise ratio.
We return to prove the claim (8.5.7). Indeed, by a straightforward expansion, we have
pa 1 − pa pb 1 − pb
Dkl (pa ||pb ) + Dkl (pb ||pa ) = pa log + (1 − pa ) log + pb log + (1 − pb ) log
pb 1 − pb pa 1 − pa
pa 1 − pa pa 1 − pb
= (pa − pb ) log + (pb − pa ) log = (pa − pb ) log .
pb 1 − pb 1 − pa pb
1 1 eb − ea eb − ea
− = ≤ = 1 − ea−b ≤ 1 − (1 + (a − b)) = b − a,
1 + ea 1 + eb (1 + ea )(1 + eb ) eb
Yi = f (Xi ) + εi (8.6.1)
where εi are independent, mean zero conditional on Xi , and E[ε2i ] ≤ σ 2 . See Figure 8.2 for an
example. We also assume that we fix the locations of the Xi as Xi = i/n ∈ [0, 1], that is, the Xi
are evenly spaced in [0, 1]. Given n observations Yi , we ask two questions: (1) how can we estimate
f ? and (2) what are the optimal rates at which it is possible to estimate f ?
199
Lexture Notes on Statistics and Information Theory John Duchi
Figure 8.2. Observations in a non-parametric regression problem, with function f plotted. (Here
f (x) = sin(2x + cos2 (3x)).)
where λ0 > 0 (this says the kernel has some width to it). A natural example is the “tent” function
given by Ktent (x) = [1 − |x|]+ , which satisfies inequality (8.6.2) with λ0 = 1/2. See Fig. 8.3 for two
examples, one the tent function and the other the function
1 1
K(x) = 1 {|x| < 1} exp − exp − ,
(x − 1)2 (x + 1)2
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.0 −0.5 0.0 0.5 1.0
Figure 8.3: Left: “tent” kernel. Right: infinitely differentiable compactly supported kernel.
Now we consider a natural estimator of the function f based on observations (8.6.2) known as
the Nadaraya-Watson estimator. Fix a bandwidth h, which we will see later smooths the estimated
200
Lexture Notes on Statistics and Information Theory John Duchi
The intuition here is that we have a locally weighted regression function, where points Xi in the
neighborhood of x are given higher weight than further points. Using this function fbn as our
estimator, it is possible to provide a guarantee on the bias and variance of the estimated function
at each point x ∈ [0, 1].
Proposition 8.6.1. Let the observation model (8.6.1) hold and assume condition (8.6.2). In
addition assume the bandwidth is suitably large that h ≥ 2/n and that the Xi are evenly spaced on
[0, 1]. Then for any x ∈ [0, 1], we have
2σ 2
|E[fbn (x)] − f (x)| ≤ Lh and Var(fbn (x)) ≤ .
λ0 nh
Proof To bound the bias, we note that (conditioning implicitly on Xi )
n
X n
X n
X
E[fbn (x)] = E[Yi Wni (x)] = E[f (Xi )Wni (x) + εi Wni (x)] = f (Xi )Wni (x).
i=1 i=1 i=1
and because there are at least nh/2 indices satisfying |Xj − x| ≤ h, we obtain the claim (8.6.3).
Using the claim, we have
Xn 2 n
X 2
Var(fbn (x)) = E (Yi − f (Xi ))Wni (x) =E εi Wni (x)
i=1 i=1
n
X n
X
= Wni (x)2 E[ε2i ] ≤ σ 2 Wni (x)2 .
i=1 i=1
201
Lexture Notes on Statistics and Information Theory John Duchi
Pn
Noting that Wni (x) ≤ 2/λ0 nh and i=1 Wni (x) = 1, we have
n n
X X 2
σ 2 Wni (x)2 ≤ σ 2 max Wni (x) Wni (x) ≤ σ 2 ,
i λ0 nh
i=1
|i=1 {z }
=1
With the proposition in place, we can then provide a theorem bounding the worst case pointwise
mean squared error for estimation of a function f ∈ F.
Theorem 8.6.2. Under the conditions of Proposition 8.6.1, choose h = (σ 2 /L2 λ0 )1/3 n−1/3 . Then
there exists a universal (numerical) constant C < ∞ such that for any f ∈ F,
2/3
Lσ 2
2
sup E[(fbn (x) − f (x))2 ] ≤ C n− 3 .
x∈[0,1] λ0
By integrating the result in Theorem 8.6.2 over the interval [0, 1], we immediately obtain the
following corollary.
Corollary 8.6.3. Under the conditions of Theorem 8.6.2, if we use the tent kernel Ktent , we have
2/3
Lσ 2
sup Ef [kfbn − f k22 ] ≤C ,
f ∈F n
In Proposition 8.6.1, it is possible to show that a more clever choice of kernels—ones that are
not always positive—can attain bias E[fbn (x)]−f (x) = O(hβ ) if f has Lipschitz (β −1)th derivative.
In this case, we immediately obtain that the rate can be improved to
2β
− 2β+1
sup E[(fbn (x) − f (x))2 ] ≤ Cn ,
x
and every additional degree of smoothness gives a corresponding improvement in convergence rate.
We also remark that rates of this form, which are much larger than n−1 , are characteristic of non-
parametric problems; essentially, we must adaptively choose a dimension that balances the sample
size, so that rates of 1/n are difficult or impossible to achieve.
202
Lexture Notes on Statistics and Information Theory John Duchi
Theorem 8.6.4. Let the observation points Xi be spaced evenly on [0, 1], and assume the observa-
tion model (8.6.1). Then there exists a universal constant c > 0 such that
23
σ2
h i
Mn (F, k·k22 ) := inf sup Ef kfbn − f k22 ≥c .
fbn f ∈F n
Deferring the proof of the theorem temporarily, we make a few remarks. It is in fact possible to
show—using a completely identical technique—that if Fβ denotes the class of functions with β − 1
derivatives, where the (β − 1)th derivative is Lipschitz, then
2β
σ2
2β+1
Mn (Fβ , k·k22 ) ≥c .
n
So for any smoothness class, we can never achieve the parametric σ 2 /n rate, but we can come
arbitrarily close. As another remark, which we do not prove, in dimensions d ≥ 1, the minimax
rate for estimation of functions f with Lipschitz (β − 1)th derivative scales as
2β
σ2
2β+d
Mn (Fβ , k·k22 ) ≥c . (8.6.4)
n
This result can, similarly, be proved using a variant of Assouad’s method or a local Fano method;
see, for example, Györfi et al. [99, Chapter 3]. Exercise 8.9 works through a particular case of this
lower bound. This is a striking example of the curse of dimensionality: the penalty for increasing
dimension results in worse rates of convergence. For example, suppose that β = 1. In 1 dimension,
we require n ≥ 90 ≈ (.05)−3/2 observations to achieve accuracy .05 in estimation of f , while we
require n ≥ 8000 = (.05)−(2+d)/2 even when the dimension d = 4, and n ≥ 64·106 observations even
in 10 dimensions, which is a relatively small problem. That is, the problem is made exponentially
more difficult by dimension increases.
We now prove Theorem 8.6.4. To establish the result, we show how to construct a family of
problems—indexed by binary vectors v ∈ {−1, 1}k —so that our estimation problem satisfies the
separation (8.5.1), then we show that the information based on observing noisy versions of the
functions we have defined is small. Choosing k to make our resulting lower bound as high as
possible completes the argument.
203
Lexture Notes on Statistics and Information Theory John Duchi
0 outside of the interval [0, 1]. Then for any v ∈ {−1, 1}k , define the “bump” functions
k
1 j−1 X
gj (x) := g k x − and fv (x) := vj gj (x),
k k
j=1
which we see is 1-Lipschitz. Now, consider any function f : [0, 1] → R, and let Ej be shorthand for
the intervals Ej = [(j − 1)/k, j/k] for j = 1, . . . , k. We must find a mapping identifying a function
f with points in the hypercube {−1, 1}k . To that end, we may define a vector b v(f ) ∈ {−1, 1}k by
Z
vj (f ) = argmin
b (f (t) − sgj (t))2 dt.
s∈{−1,1} Ej
204
Lexture Notes on Statistics and Information Theory John Duchi
Bounding the binary testing error Let Pvn denote the distribution of the n observations
Yi = fv (Xi ) + εi when fv is the true regression function. Then inequality (8.6.6) implies via
Assouad’s lemma that
k
2 c Xh n n
i
Mn (F, k·k2 ) ≥ 3 1 − P+j − P−j TV
. (8.6.7)
k
j=1
For any two functions fv and fv0 , we have that the observations Yi are independent and normal
with means fv (Xi ) or fv0 (Xi ), respectively. Thus
n
X
Dkl (Pvn ||Pvn0 ) = Dkl N(fv (Xi ), σ 2 )||N(fv0 (Xi ), σ 2 )
i=1
n
X 1
= (fv (Xi ) − fv0 (Xi ))2 . (8.6.8)
2σ 2
i=1
Now we must show that the expression (8.6.8) scales more slowly than n, which we will see must
be the case as whenever dham (v, v 0 ) ≤ 1. Intuitively, most of the observations have the same
distribution by our construction of the fv as bump functions; let us make this rigorous.
We may assume without loss of generality that vj = vj0 for j > 1. As the Xi = i/n, we thus
have that only Xi for i near 1 can have non-zero values in the tensorization (8.6.8). In particular,
i 2 2n
≥ , i.e. i ≥
fv (i/n) = fv0 (i/n) for all i s.t. .
n k k
Rewriting expression (8.6.8), then, and noting that fv (x) ∈ [−1/k, 1/k] for all x by construction,
we have
n 2n/k
X 1 2
X 1 1 2n 1 n
2
(fv (Xi ) − fv 0 (Xi )) ≤
2
(fv (Xi ) − fv0 (Xi ))2 ≤ 2 2
= 3 2.
2σ 2σ 2σ k k k σ
i=1 i=1
Combining this with inequality (8.6.8) and the minimax bound (8.6.7), we obtain
r
n n n
P+j − P−j TV ≤ ,
2k 3 σ 2
so
k r
2 c X n
Mn (F, k·k2 ) ≥ 3 1− .
k 2k 3 σ 2
j=1
205
Lexture Notes on Statistics and Information Theory John Duchi
I(V ; X) + log 2
P(Vb 6= V ) ≥ 1 − .
log(|V|)
Thus, there are two ingredients in proving lower bounds on the error in a hypothesis test: upper
bounding the mutual information and lower bounding the size |V|. The key in the global Fano
method is an upper bound on the former (the information I(V ; X)) using covering numbers.
Before stating our result, we require a bit of notation. First, we assume that V is drawn from a
distribution µ, and conditional on V = v, assume the sample X ∼ Pv . Then a standard calculation
(or simply the definition of mutual information; recall equation (8.4.4)) gives that
Z Z
I(V ; X) = Dkl Pv ||P dµ(v), where P = Pv dµ(v).
Now, we show how to connect this mutual information quantity to a covering number of a set of
distributions.
Assume that for all v, we have Pv ∈ P, where P is a collection of distributions. In analogy
with Definition 4.7, we say that the collection of distributions {Qi }N i=1 form an -cover of P in
KL-divergence if for all P ∈ P, there exists some i such that Dkl (P ||Qi ) ≤ 2 . With this, we may
define the KL-covering number of the set P as
2
Nkl (, P) := inf N ∈ N | ∃ Qi , i = 1, . . . , N, sup min Dkl (P ||Qi ) ≤ , (8.7.1)
P ∈P i
where Nkl (, P) = +∞ if no such cover exists. With definition (8.7.1) in place, we have the following
proposition.
206
Lexture Notes on Statistics and Information Theory John Duchi
so that inequality (8.7.3) holds. By carefully choosing the distribution Q in the upper bound (8.7.3),
we obtain the proposition.
Now, assume that the distributions Qi , i = 1, . . . , N form an 2 -cover of the family P, meaning
that
min Dkl (P ||Qi ) ≤ 2 for all P ∈ P.
i∈[N ]
Let pv and qi denote the densities of Pv and Qi with respect to some fixed base measure on PX (the
choice of based measure does not matter). Then definining the distribution Q = (1/N ) N i=1 Qi ,
we obtain for any v that in expectation over X ∼ Pv ,
pv (X) pv (X)
Dkl (Pv ||Q) = EPv log = EPv log −1 Pn
q(X) N i=1 qi (X)
" #
pv (X) pv (X)
= log N + EPv log PN ≤ log N + EPv log
i=1 qi (X)
maxi qi (X)
pv (X)
≤ log N + min EPv log = log N + min Dkl (Pv ||Qi ) .
i qi (X) i
By our assumption that the Qi form a cover, this gives the desired result, as ≥ 0 was arbitrary,
as was our choice of the cover.
With Corollary 8.7.2 and Proposition 8.7.1 in place, we thus see that the global covering numbers
in KL-divergence govern the behavior of information.
We remark in passing that the quantity (8.7.2), and its i.i.d. analogue in Corollary 8.7.2, is
known as the index of resolvability, and it controls estimation rates and redundancy of coding
schemes for unknown distributions in a variety of scenarios; see, for example, Barron [17] and
Barron and Cover [18]. It is also similar to notions of complexity in Dudley’s entropy integral
(cf. Dudley [71]) in empirical process theory, where the fluctuations of an empirical process are
governed by a tradeoff between covering number and approximation of individual terms in the
process.
207
Lexture Notes on Statistics and Information Theory John Duchi
(ii) Bound the metric entropy. Give an upper bound on the KL-metric entropy of the class P of
distributions containing all the distributions Pv , that is, an upper bound on log Nkl (, P).
(iii) Find the critical radius. Noting as in Corollary 8.7.2 that with n i.i.d. observations, we have
we now balance the information I(V ; X1n ) and the packing entropy log M (δ). To that end, we
choose n and δ > 0 at the critical radius, defined as follows: choose the any n such that
log M (δn ) ≥ 4n2n + 2 log 2 ≥ 2Nkl (n , P) + 2n2n + 2 log 2 ≥ 2 (I(V ; X1n ) + log 2) .
(We could have chosen the n attaining the infimum in the mutual information, but this way
we need only an upper bound on log Nkl (, P).)
(iv) Apply the Fano minimax bound. Having chosen δn and n as above, we immediately obtain
that for the Markov chain V → X1n → Vb ,
I(V ; X1 , . . . , Xn ) + log 2 1 1
P(V 6= Vb ) ≥ 1 − ≥1− = ,
log M (δn ) 2 2
and thus, applying the Fano minimax bound in Proposition 8.4.3, we obtain
1
Mn (θ(P); Φ ◦ ρ) ≥ Φ(δn ).
2
208
Lexture Notes on Statistics and Information Theory John Duchi
Now, if for some fixed x ∈ [0, 1] and f, g ∈ F we define Pf and Pg to be the distributions of the
observations f (x) + ε or g(x) + ε, we have that
1 2 kf − gk2∞
Dkl (Pf ||Pg ) = (f (Xi ) − g(Xi )) ≤ ,
2σ 2 2σ 2
and if Pfn is the distribution of the n observations f (Xi ) + εi , i = 1, . . . , n, we also have
n
X 1 n
Dkl Pfn ||Pgn = 2
(f (Xi ) − g(Xi ))2 ≤ 2 kf − gk2∞ .
2σ 2σ
i=1
209
Lexture Notes on Statistics and Information Theory John Duchi
where the final term vanishes since E is (V, Vb )-measurable. On the other hand, we also have
using the fact that conditioning reduces entropy. Applying the definition of conditional entropy
yields
and we upper bound each of these terms separately. For the first term, we have
since conditioned on the event E = 0, the random variable V may take values in a set of size at
most |V| − Ntmin . For the second, we have
since conditioned on E = 1, or equivalently on the event that ρ(Vb , V ) ≤ t, we are guaranteed that
V belongs to a set of cardinality at most Ntmax .
Combining the pieces and and noting P(E = 0) = Pt , we have proved that
Combining this inequality with our earlier equality (8.8.1), we see that
|V| − Nt min
H(V | X) − log Ntmax ≤ H(V | Vb ) − log Ntmax ≤ P(ρ(Vb , V ) > t) log + log 2.
Ntmax
Rearranging the preceding equations yields
H(V | X) − log Ntmax − log 2
P(ρ(Vb , V ) > t) ≥ . (8.8.2)
|V|−Ntmin
log Ntmax
210
Lexture Notes on Statistics and Information Theory John Duchi
Note that his bound holds without any assumptions on the distribution of V .
By definition, we have I(V ; X) = H(V ) − H(V | X). When V is uniform on V, we have
H(V ) = log |V|, and hence H(V | X) = log |V| − I(V ; X). Substituting this relation into the
bound (8.8.2) yields the inequality
log N|V|
max I(V ; X) + log 2 I(V ; X) + log 2
t
P(ρ(Vb , V ) > t) ≥ − ≥1− .
|V|−Ntmin |V|−Ntmin
log Ntmax log Ntmax
log N|V|
max
t
d
X
Φ(ρ(θ, θ(Pv ))) ≥ 2δ 1 {[b
v(θ)]j 6= vj } .
j=1
as the average is smaller than the maximum of a set and using the separation assumption (8.5.1).
Recalling the definition of the mixtures P±j as the joint distribution of V and X conditional on
Vj = ±1, we swap the summation orders to see that
1 X b 1 X
b j 6= vj + 1
X
v(θ)]j 6= vj =
Pv [b Pv [b v(θ)] Pv [b b j 6= vj
v(θ)]
|V| |V| |V|
v∈V v:vj =1 v:vj =−1
1 1
b j 6= vj + P−j [b
b j 6= vj .
= P+j [b v(θ)] v(θ)]
2 2
This gives the statement claimed in the lemma, while taking an infimum over all testing procedures
Ψ : X → {−1, +1} gives the claim (8.5.2).
8.9 Bibliography
For a fuller technical introduction into nonparametric estimation, see the book by Tsybakov [167].
Has’minskii [100].
The material in Section 8.7 is based on a paper of Yang and Barron [175].
8.10 Exercises
Exercise 8.1 (A generalized version of Fano’s inequality; cf. Proposition 8.4.6): Let V and V
b be
arbitrary sets, and suppose that π is a (prior) probability measure on V, where V is distributed
according to π. Let V → X → Vb be Markov chain, where V takes values in V and Vb takes values
211
Lexture Notes on Statistics and Information Theory John Duchi
in V.
b Let N ⊂ V × V b denote a measurable subset of V × V b (a collection of neighborhoods), and for
any vb ∈ V,
b denote the slice
Nvb := {v ∈ V : (v, vb) ∈ N } . (8.10.1)
That is, N denotes the neighborhoods of points v for which we do not consider a prediction vb for
v to be an error, and the slices (8.10.1) index the neighborhoods. Define the “volume” constants
Define the error probability Perror = P[(V, Vb ) 6∈ N ] and entropy h2 (p) = −p log p − (1 − p) log(1 − p).
1 − pmin 1
h2 (Perror ) + Perror log max
≥ log max − I(V ; Vb ). (8.10.2)
p p
I(V ; X) + log 2
P[(V, Vb ) 6∈ N ] ≥ 1 − 1 .
inf vb log π(N )
v
b
(c) Now we give a version explicitly using distances. Let V ⊂ Rd and define N = {(v, v 0 ) :
kv − v 0 k ≤ δ} to be the points within δ of one another. Let Bv denote the k·k-ball of radius 1
centered at v. Conclude that for any prior π on Rd that
I(V ; X) + log 2
P kV − Vb k2 ≥ δ ≥ 1 − 1 .
log sup π(δB v) v
Exercise 8.2: In this question, we will show that the minimax rate of estimation for the parameter
iid
of a uniform distribution (in squared error) scales as 1/n2 . In particular, assume that Xi ∼
Uniform(θ, θ + 1), meaning that Xi have densities p(x) = 1 {x ∈ [θ, θ + 1]}. Let X(1) = mini {Xi }
denote the first order statistic.
(b) Using Le Cam’s two-point method, show that the minimax rate for estimation of θ ∈ R for the
uniform family U = {Uniform(θ, θ + 1) : θ ∈ R} in squared error has lower bound c/n2 , where
c is a numerical constant.
Exercise 8.3 (Sign identification in sparse linear regression): In sparse linear regression, we have
n observations Yi = hXi , θ∗ i + εi , where Xi ∈ Rd are known (fixed) matrices and the vector θ∗ has
iid
a small number k d of non-zero indices, and εi ∼ N(0, σ 2 ). In this problem, we investigate the
problem of sign recovery, that is, identifying the vector of signs sign(θj∗ ) for j = 1, . . . , d, where
sign(0) = 0.
212
Lexture Notes on Statistics and Information Theory John Duchi
Assume we have the following process: fix a signal threshold θmin > 0. First, a vector S ∈
{−1, 0, 1}d is chosen uniformly at random from the set of vectors Sk := {s ∈ {−1, 0, 1}d : ksk1 = k}.
Then we define vectors θs so that θjs = θmin sj , and conditional on S = s, we observe
(b) Assume that X ∈ {−1, 1}n×d . Give a lower bound on how large n must be for sign recovery.
Give a one sentence interpretation of σ 2 /θmin
2 .
Exercise 8.4 (General minimax lower bounds): In this exercise, we outline a more general
approach to minimax risk than that afforded by studying losses applied to parameter error. In
particular, we may instead consider losses of the form
L : Θ × P → R+
where P is a collection of distributions and Θ is a parameter space, where additionally the losses
satisfy the condition
inf L(θ, P ) = 0 for all P ∈ P.
θ∈Θ
(a) Consider a statistical risk minimization problem, where we have a distribution P on random
variable X ∈ X , loss function f : Θ × X → R, and for P ∈ P define the population risk
FP (θ) := EP [f (θ, X)]. Show that
(b) For distributions P0 , P1 , define the separation between them (for the loss L) by
L(θ, P0 ) ≤ δ implies L(θ, P1 ) ≥ δ
sepL (P0 , P1 ; Θ) := sup δ ≥ 0 : for any θ ∈ Θ . (8.10.3)
L(θ, P1 ) ≤ δ implies L(θ, P0 ) ≥ δ
That is, having small loss on P0 implies large loss on P1 and vice versa.
We say a collection of distributions {Pv }v∈V indexed by V is δ-separated if sepL (Pv , Pv0 ; Θ) ≥ δ.
Show that if {Pv }v∈V is δ-separated, then for any estimator θb
1 X b Pv )] ≥ δ inf P(b
EPv [L(θ, v 6= V ),
|V| v
b
v∈V
where P is the joint distribution over the random index V chosen uniformly and then X sampled
X ∼ Pv conditional on V = v.
213
Lexture Notes on Statistics and Information Theory John Duchi
Exercise 8.5 (Optimality in stochastic optimization): In this question, we prove minimax lower
bounds on the convergence rates in stochastic optimization problems based on the size of the
domain over which we optimize and certain Lipschitz conditions of the functions themselves. You
may assume the dimension d in the problems we consider is as large as you wish.
The setting is as follows: we have a domain Θ ⊂ Rd , function f : Θ × X → R, which is convex
in its first argument, and population risks FP (θ) := EP [f (θ, X)], where the expectation is taken
over X ∼ P . For any two functions F0 , F1 , let θv ∈ argminθ∈Θ Fv (θ), and define the optimization
distance between F0 and F1 by
dopt (F0 , F1 ; Θ) := inf F0 (θ) + F1 (θ) − F0 (θ0 ) − F1 (θ1 ) .
θ∈Θ
iid
(where FP? = inf θ∈Θ FP (θ) and X1n ∼ P ) satisfies
(√ )
d
Mn (P, L) ≥ c min √ ,1 .
n
where c > 0 is a constant. You may assume d ≥ 8 (or any other large constant) for simplicity.
214
Lexture Notes on Statistics and Information Theory John Duchi
(f) Show how to modify this construction so that for constants L, R > 0, if Θ ⊃ [−R, R]d , there
are functions f that are L-Lipschitz with respect to the `∞ norm, meaning
such that for this domain Θ, loss f (and induced L), and the same family of distributions P
as above, (√ )
d
Mn (P, Θ, L) ≥ cLR min √ , 1 .
n
(g) Suppose that instead, we have Θ ⊃ {θ ∈ Rd | kθk2 ≤ R2 }, the `2 -ball of radius R2 , and allow
f to be L2 -Lipschitz with respect to the `2 -norm (instead of `∞ ). Show that
L2 R2
Mn (P, Θ, L) ≥ c √ .
n
(b) Using the optimization distance dopt (F0 , F1 ; Θ) = inf θ∈Θ {F0 (θ) + F1 (θ) − F0? − F1? }, where
Fv? = inf θ∈Θ Fv (θ), defined in Question 8.5, show the separation
(c) Let the loss L(θ, P ) = FP (θ) − inf θ∈Θ FP (θ) as in Question 8.5, let P be the collection of
distributions supported on [−1, 1]d , and define the minimax loss gap
n h io
Mn (P, Θ, L) := inf sup EP FP (θbn (X1n )) − FP?
θbn P ∈P
215
Lexture Notes on Statistics and Information Theory John Duchi
iid
where X1n ∼ P . Show that there exists a numerical constant c > 0 such that
p
log(2d)
Mn (P, Θ, L) ≥ c √ .
n
(You may assume d ≥ 2 to avoid trivial cases.) Hint. Use the result of Question 8.4 part (c).
Exercise 8.7: In this question, we study the question of whether adaptivity can give better
estimation performance for linear regression problems. That is, for i = 1, . . . , n, assume that we
observe variables Yi in the usual linear regression setup,
iid
Yi = hXi , θi + εi , εi ∼ N(0, σ 2 ), (8.10.4)
where θ ∈ Rd is unknown. But now, based on observing Y1i−1 = {Y1 , . . . , Yi−1 }, we allow an adaptive
choice of the next predictor variables Xi ∈ Rd . Let Lnada (F2 ) denote the family of linear regression
problems under this adaptive setting (with n observations) where we constrain P the Frobenius norm
of the data matrix X > = [X1 · · · Xn ], X ∈ Rn×d , to have bound kXk2Fr = ni=1 kXi k22 ≤ F2 . We
use Assouad’s method to show that the minimax mean-squared error satisfies the following bound:
dσ 2 1
M(Lnada (F2 ), k·k22 ) := inf sup E[kθb − θk22 ] ≥ · 1 2. (8.10.5)
θb θ∈Rd n 16 dn F
Here the infimum is taken over all adaptive procedures satisfying kXk2Fr ≤ F2 .
In general, when we choose Xi based on the observations Y1i−1 , we are taking Xi = Fi (Y1i−1 , U1i ),
where Ui is a random variable independent of εi and Y1i−1 and Fi is some function. Justify the
following steps in the proof of inequality (8.10.5):
(i) Assume that nature chooses v ∈ V = {−1, 1}d uniformly at random and, conditionally on v,
let θ = θv . Justify
1 X
M(Lnada (F2 ), k·k22 ) ≥ inf Eθv [kθb − θv k22 ].
θb |V| v∈V
Argue it is no loss of generality to assume that the choices for Xi are deterministic based on
the Y1i−1 . Thus, throughout we assume that Xi = Fi (Y1i−1 , ui1 ), where un1 is a fixed sequence,
or, for simplicity, that Xi is a function of Y1i−1 .
(ii) Fix δ > 0. Let v ∈ {−1, 1}d , and for each such v, define θv = δv. Also let Pvn denote the joint
distributionP(over all adaptively chosenP Xi ) of the observed variables Y1 , . . . , Yn , and define
n = 1
P+j P n and P n = 1 n n
2d−1 v:vj =1 v −j 2d−1 v:vj =−1 Pv , so that P±j denotes the distribution of
the Yi when v ∈ {−1, 1}d is chosen uniformly at random but conditioned on vj = ±1. Then
d
1 X δ2 X h i
inf Eθv [kθb − θv k22 ] ≥ n
1 − P+j n
− P−j TV
.
θb |V| 2
v∈V j=1
(iii) We have
d d 1
δ2 X h δ2d
X
n n
i 1 n n 2 2
1 − P+j − P−j TV
≥ 1− P+j − P−j TV
.
2 2 d
j=1 j=1
216
Lexture Notes on Statistics and Information Theory John Duchi
(i)
(iv) Let P+j be the distribution of the random variable Yi conditioned on vj = +1 (with the other
(i)
coordinates of v chosen uniformly at random), and let P+j (· | y1i−1 , xi ) denote the distribution
of Yi conditioned on vj = +1, Y1i−1 = y1i−1 , and xi . Justify
n n 2 1 n n
P+j − P−j TV
≤ Dkl P+j ||P−j
2
n Z
1X
(i) (i)
≤ Dkl P+j (· | y1i−1 , xi )||P−j (· | y1i−1 , xi ) dP+j
i−1 i−1
(y1 , xi ).
2
i=1
(vi) We have
d
X 2 δ2
n
P+j n
− P−j ≤ E[kXk2Fr ],
TV σ2
j=1
where the final expectation is over V drawn uniformly in {−1, 1}d and all Yi , Xi .
(vii) Show how to choose δ appropriately to conclude the minimax bound (8.10.5).
Exercise 8.8: Suppose under the setting of Question 8.7 that we may no longer be adaptive,
meaning that the matrix X ∈ Rn×d must be chosen ahead of time (without seeing any data).
Assuming n ≥ d, is it possible to attain (within a constant factor) the risk (8.10.5)? If so, give an
example construction, if not, explain why not.
Exercise 8.9 (The curse of dimensionality in nonparametric regression): Consider the non-
d d
parametric regression problem in Section 8.6. Let B be the unit `2 -ball in R and consider the
R F of 1-Lipschitz
function class functions taking values in [−1, 1] on Bd , and consider the error
2
kf − gk2 = Bd (f (x) − g(x)) dx. (Here, 1-Lipschitz means |f (x) − f (x0 )| ≤ kx − x0 k2 for any x, x0 .)
2
We show the minimax lower bound (8.6.4) for this function class using Fano’s method. Fix δ ∈ [0, 1]
1 d
to be chosen and let {xj }M d
j=1 be the centers of a maximal 2δ-packing of B , so that M ≥ ( 2δ ) (by
Lemma 4.3.10), and define the “bump” functions
gj (x) = δ 1 − kx − xj k2 /δ + ,
which all have disjoint support. Then for a vector v ∈ {±1}M , define
M
X
fv (x) := vj gj (x).
j=1
(c) Use the Gilbert-Varshamov bound (Lemma 8.2.3) to show there is a collection V ⊂ {±1}M of
cardinality exp(M/8) with kfv − fv0 k22 ≥ cd δ 2 for all v 6= v 0 ∈ V, where cd depends only on the
dimension d.
217
Lexture Notes on Statistics and Information Theory John Duchi
Exercise 8.10 (Optimal algorithms for memory access): In a modern CPU, memory is
organized in a hierarchy, so that data upon which computations are being actively performed lies
in a very small memory close to the logic units of the processor for which access is extraordinarily
fast, while data not being actively used lies in slower memory slightly farther from the processor.
(Modern processor memory is generally organized into the registers—a small number of 4- or 8-byte
memory locations on the processor—and level 1, 2, (and sometimes 3 or more) cache, which contain
small amounts of data and increasing access times, and RAM (random access memory).) Moving
data—communicating—between levels of the memory hierarchy is both power intensive and very
slow relative to computation on the data itself, so that in many algorithms the bulk of the time of
the algorithm is in moving data from one place to another to be computed upon. Thus, developing
very fast algorithms for numerical (and other) tasks on modern computers requires careful tracking
of memory access and communication, and careful control of these quantities can often yield orders
of magnitude speed improvements in execution. In this problem, you will prove a lower bound on
the number of communication steps that a variety of numerical-type methods must perform, giving
a concrete (attainable) inequality that allows one to certify optimality of specific algorithms.
In particular, we consider matrix multiplication, as it is a proxy for a class of cubic algorithms
that are well behaved. Let A, B ∈ Rn×n be matrices, and assume we wish to compute C = AB,
via the simple algorithm that for all i, j sets
n
X
Cij = Ail Blj .
l=1
where F is some function—that may depend on i, j, l—and Mem(·) indicates that we access the
memory associated with the argument. (In our case, we have Cij = Cij + Ail · Blj .) We assume
that executing F requires that Mem(Ail ), Mem(Blj ), and Mem(Cij ) belong to fast memory, and
that each are distinct (stored in a separate place in flow and fast memory). We assume that the
order of the computations does not matter, so we may re-order them in any way. We call Mem(Ail )
(respectively B or C) and operand in our computation. We let M denote the size of fast/local
memory, and we would like to lower bound the number of times we must communicate an operand
into or out of the fast local memory as a function of n, the matrix size, and M , the fast memory
size, when all we may do is re-order the computation being executed. We let NStore denote the
number of times we write something from fast memory out to slow memory and let NLoad the
number of times we load something from slow memory to fast memory. Let N be the total number
of operations we execute (for simple matrix multiplication, we have N = n3 , though with sparse
matrices, this can be smaller).
We analyze the procedure by breaking the computation into a number of segments, where each
segment contains precisely M load or store (communication-causing) instructions.
(a) Let Nseg be an upper bound on the number of evaluations with the function F (·) in any given
segment (you will upper bound this in a later part of the problem). Justify that
218
Lexture Notes on Statistics and Information Theory John Duchi
(b) Within a segment, all operands involved must be in fast memory at least once to be computed
with. Assume that memory locations Mem(Ail ), Mem(Blj ), and Mem(Cij ) do not overlap.
For any operand involved in a memory operation in one of the segments, the operand (1) was
already in fast memory at the beginning of the segment, (2) was read from slow memory, (3)
is still in fast memory at the end of the segment, or (4) is written to slow memory at the end
of the segment. (There are also operands potentially created during execution that are simply
discarded; we do not bound those.) Justify the following: within a segment, for each type of
operand Mem(Aij ), Mem(Bij ), or Mem(Cij ), there are at most c · M such operands (i.e. there
are at most cM operands of type Mem(Aij ), independent of the others, and so on), where c is
a numerical constant. What value of c can you attain?
√
(c) Using the result of question 6.1, argue that Nseg ≤ c0 M 3 for a numerical constant c0 . What
value of c0 do you get?
(d) Using the result of part (c), argue that the number of loads and stores satisfies
N
NStore + NLoad ≥ c00 √ − M
M
for a numerical constant c00 . What is your constant?
1. Use the global Fano method technique to give lower bounds for density estimation
219
Chapter 9
In this chapter, we revisit our minimax bounds in the context of what we term constrained risk
inequalities. While the minimax risk of provides a first approach for providing fundamental limits
on procedures, its reliance on the collection of all measurable functions as its class of potential
estimators is somewhat limiting. Indeed, in most statistical and statistical learning problems, we
have some type of constraint on our procedures: they must be efficiently computable, they must
work with data arriving in a sequential stream, they must be robust, or they must protect the
privacy of the providers of the data. In modern computational hardware, where physical limits
prevent increasing clock speeds, we may like to use as much parallel computation as possible,
though there are potential tradeoffs between “sequentialness” of procedures and their parallelism.
With this as context, we replace the minimax risk of Chapter 8.1 with the constrained mini-
max risk, which, given a collection C of possible procedures—private, communication limited, or
otherwise—defines
h i
M(θ(P), Φ ◦ ρ, C) := inf sup EP Φ ρ(θ(X),
b θ(P )) , (9.0.1)
b P ∈P
θ∈C
where as in the original defining equation (8.1.1) of the minimax risk, Φ : R+ → R+ is a nondecreas-
ing loss, ρ is a semimetric on the space Θ, and the expectation is taken over the sample X ∼ P .
In this chapter, we study the quantity (9.0.1) via a few examples, highlighting possibilities and
challenges with its analysis. We will focus on a restricted class of examples—many procedures do
not fall in the framework we consider—that assumes, given a sample X1 , . . . , Xn , we can represent
the class C of estimators under consideration as acting on some view or processed version Zi of
Xi . This allows us to study communication complexity, memory complexity, and certain private
estimators.
220
Lexture Notes on Statistics and Information Theory John Duchi
R
the marginal distribution Mv (A) := Q(A | x)dPv (x). The channel Q satisfies a strong data
processing inequality with constant α ≤ 1 for the given f -divergence
for any choice of P0 , P1 on X . For any such f , we define the f -strong data processing constant
Df (M0 ||M1 )
αf (Q) := sup .
P0 6=P1 Df (P0 ||P1 )
These types of inequalities are common throughout information and probability theory. Perhaps
their most frequent use is in the development conditions for the fast mixing of Markov chains.
Indeed, suppose the Markov kernel Q satisfies a strong data processing inequality with constant α
with respect to variation distance. If π denotes the stationary distribution of the Markov kernel Q
and we use the operator ◦ to denote one step of the Markov kernel,1
Z
Q ◦ P := Q(· | x)dP (x),
because Q ◦ π = π by definition of the stationary distribution. Thus, the Markov chain enjoys
geometric mixing.
To that end, a common quantity of interest is the Dobrushin coefficient, which immediately
implies mixing rates.
The Dobrushin coefficient satisfies many properties, some of which we discuss in the exercises and
others of which we enumerate here. The first is that
Proposition 9.1.1. The Dobrushin coefficient is the strong data processing constant for the vari-
ation distance, that is,
kQ ◦ P0 − Q ◦ P1 kTV
αTV (Q) = sup .
P0 6=P1 kP0 − P1 kTV
Proof There are two directions to the proof; one easy and one more challenging. For the easy
direction, we see immediately that if 1x and 1y denote point masses at x and y, then
kQ ◦ P0 − Q ◦ P1 kTV
sup ≥ sup kQ(· | x) − Q(· | y)kTV
P0 6=P1 kP0 − P1 kTV x,y
221
Lexture Notes on Statistics and Information Theory John Duchi
The other direction—that kQ ◦ P0 − Q ◦ P1 kTV ≤ αTV kP0 − P1 kTV —is is more challenging.
For this, recall Lemma 2.2.4 characterizing the variation distance, and let Q? (A) := inf y Q(A | y).
Then by definition
R of the Dobrushin coefficient α = αTV (Q), we evidently have |Q(A | x)−Q? (A)| ≤
α. Let Mv = Q(· | x)dPv (x) for v ∈ {0, 1}. By expanding dP0 − dP1 into its positive and negative
parts, we thus obtain
Z
M0 (A) − M1 (A) = Q(A | x)(dP0 − dP1 )(x)
Z Z
= Q(A | x) [dP0 (x) − dP1 (x)]+ − Q(A | x) [dP1 (x) − dP0 (x)]+
Z Z
≤ Q(A | x) [dP0 (x) − dP1 (x)]+ − Q? (A) [dP1 (x) − dP0 (x)]+
Z Z
= Q(A | x) [dP0 (x) − dP1 (x)]+ − Q? (A) [dP0 (x) − dP1 (x)]+ ,
where the final equality uses Lemma 2.2.4. But of course we then obtain
Z Z
M0 (A) − M1 (A) = (Q(A | x) − Q? (A)) [dP0 (x) − dP1 (x)]+ ≤ α [dP0 − dP1 ]+ = α kP0 − P1 kTV ,
where the inequality follows as 0 ≤ Q(A | x) − Q? (A) ≤ α and the equality is one of the character-
izations of the total variation distance in Lemma 2.2.4.
A more substantial fact is that the Dobrushin coefficient upper bounds every other strong data
processing constant.
Theorem 9.1.2. Let f : R+ → R ∪ {∞} satisfy f (1) = 0. Then for any channel Q,
The theorem is roughly a consequence of a few facts. First, Proposition 9.1.1 holds. Second,
without loss of generality we may assume that f ≥ 0; indeed, replace f (t) with h(t) = f (t) − f 0 (1)t
for any f 0 (1) ∈ ∂f (1), we have h ≥ 0 as 0 ∈ ∂h(1) and Dh = Df . Third, any f ≥ 0 with 0 ∈ ∂f (1)
can be approximated arbitrarily accurately with functions of the form h(t) = ki=1 ai [t − ci ]+ +
P
Pk
i=1 bi [di − t]+ , where ci ≥ 1 and di ≤ 1. For such h, an argument shows that
which follows from the similarities between variation distance, with f (t) = 21 |t|, and the positive
part functions [·]+ .
There is a related result, which we do not prove, that guarantees that strong Rdata processing
constants for χ2 -divergences are the “worst” constants. In particular, if QP = Q(· | x)dP (x)
denotes the application of one step of a channel Q to X ∼ P , then the χ2 contraction coefficient is
Dχ2 (QP0 ||QP1 )
αχ2 (Q) = sup .
P0 6=P1 Dχ2 (P0 ||P1 )
Then it is possible to show that for any twice continuously differentiable f on R++ with f 00 (1) > 0,
222
Lexture Notes on Statistics and Information Theory John Duchi
and we also have αχ2 (Q) = αkl (Q), so that the strong data processing inequalities for KL-divergence
and χ2 -divergence coincide.
In our context, that of (constrained) minimax lower bounds, such data processing inequalities
immediately imply somewhat sharper lower bounds than the (unconstrained) applications in previ-
ous chapters. Indeed, let us revisit the situation present in the local Fano bound, where we the KL
divergence has a Euclidean structure as in the bound (8.4.6), meaning that Dkl (P0 ||P1 ) ≤ κ2 δ 2 when
our parameters of interest θv = θ(Pv ) satisfy ρ(θ0 , θ1 ) ≤ δ. We assume that the constraints C impose
that the data Xi is passed through a channel Q with KL-data processing constant αKL (Q) ≤ 1. In
this case, in the basic Le Cam’s method (8.3.2), an application of Pinsker’s inequality yields that
whenever ρ(θ0 , θ1 ) ≥ 2δ then
r
Φ(δ) n Φ(δ) h p i
Mn (θ(P), Φ ◦ ρ, C) ≥ 1− Dkl (M0 ||M1 ) ≥ 1 − nκ2 αKL (Q)δ 2 /2 ,
2 2 2
and the “standard” choice of δ to make the probability of error constant results in δ 2 = (2nκ2 αKL (Q))−1 ,
or the minimax lower bound
!
1 1
Mn (θ(P), Φ ◦ ρ, C) ≥ Φ p ,
4 2nκ2 αKL (Q)
which suggests an effective sample size degradation of n 7→ nαKL (Q). Similarly, in the local Fano
method in Chapter 8.4.1, we see identical behavior and an effective sample size degradation of
n 7→ nαKL (Q), that is, if without constraints a sample size of n() is required to achieve some
desired accuracy , with the constraint a sample size of at least n()/αKL (Q) is necessary.
Dkl (M0 ||M1 ) + Dkl (M1 ||M0 ) ≤ 4(eε − 1)2 kP0 − P1 k2TV .
223
Lexture Notes on Statistics and Information Theory John Duchi
X1 X2 X3 Xn
Z1 Z2 Z3 Zn
Figure 9.1. The sequentially interactive private observation model: the ith output Zi may depend
on Xi and the previously released Z1i−1 .
Proof Without loss of generality, we assume that the output space Z is finite (by defini-
tion (2.2.3)), and let mv (z) and q(z | x) be the p.m.f.s of M and Q, respectively, and let P0
and P1 have densities p0 and p1 with respect to a measure µ. Then
X m0 (z)
Dkl (M0 ||M1 ) + Dkl (M1 ||M0 ) = (m0 (z) − m1 (z)) log
z
m1 (z)
To control the difference m0 (z) − m1 (z), note that for any fixed x0 ∈ X we have
Z
q(z | x0 )(p0 (x) − p1 (x))dµ(x) = 0.
X
Thus
Z
m0 (z) − m1 (z) = (q(z | x) − q(z | x0 ))(p0 (x) − p1 (x))dµ(x),
X
and so
Z
|m0 (z) − m1 (z)| ≤ sup |q(z | x) − q(z | x0 )| |p0 (x) − p1 (x)|dµ(x)
x∈X X
q(z | x)
= 2q(z | x0 ) sup − 1 kP0 − P1 kTV .
x∈X q(z | x0 )
q(z|x)
By definition of local differential privacy, q(z|x0 ) − 1 ≤ eε − 1, and as x0 was arbitrary we obtain
Noting that inf x q(z | x) ≤ min{m0 (z), m1 (z)} we obtain the theorem.
224
Lexture Notes on Statistics and Information Theory John Duchi
To be able to apply this result to obtain minimax lower bounds for estimation as in Sec-
tion 8.3, we need to address samples drawn from product distributions, even with the potential
interaction
R (9.2.1). In this case, we consider sequential samples Zi ∼ Q(· | Xi , Z1i−1 ) and define
Mv = Q(· | x1 )dPv (xn1 ) to be the marginal distribution over all the Z1n . Then we have the
n n
following corollary.
Corollary 9.2.2. Assume that each channel Q(· | Xi , Z1i−1 ) is εi -differentially private. Then
n
X
Dkl (M0n ||M1n ) ≤ 4 (eεi − 1)2 kP0 − P1 k2TV .
i=1
Proof Recalling the chain rule (2.1.6) for the KL-divergence, we have
n
X
Dkl (M0n ||M1n ) EM0 Dkl M0,i (· | Z1i−1 )||M1,i (· | Z1i−1 ) ,
=
i=1
where the outer expectation is taken over Z1i−1 drawn marginally from M0n , and Mv,i (· | z1i−1 )
iid
denotes the conditional distribution on Zi given Z1i−1 = z1i−1 when X1n ∼ Pv . Writing this distri-
bution out, we note that Zi is conditionally independent of X\i given Xi and Z1i−1 by construction,
so for any set A
Z Z
Mv,i (A | z1 ) = Q(Zi ∈ A | x1 , z1 )dPv (x1 | z1 ) = Q(Zi ∈ A | xi , z1i−1 )dPv (xn1 | z1i−1 )
i−1 n i−1 n i−1
Z
= Q(Zi ∈ A | xi , z1i−1 )dPv (xi ).
Now we know that Q(Zi ∈ · | xi , z1i−1 ) is εi -differentially private by assumption, so Theorem 9.2.1
gives
Dkl M0,i (· | z1i−1 )||M1,i (· | z1i−1 ) ≤ 4(eεi − 1)2 kP0 − P1 k2TV
for any realization z1i−1 of Z1i−1 . Iterating this gives the result.
Local privacy is such a strong condition on the channel Q that it actually “transforms” the
KL-divergence into a variation distance, so that even if two distributions P0 and P1 have infinite
KL-divergence Dkl (P0 ||P1 ) = +∞—for example, if their supports are not completely overlapping—
their induced marginals have the much smaller divergence Dkl (M0 ||M1 ) ≤ 4(eε −1)2 kP0 − P1 k2TV ≤
4(eε − 1)2 . This transformation into a different metric means that even in estimation problems that
should on their faces be easy become quite challenging under local privacy constraints; for example,
minimax squared error for estimating the mean of a random variable with finite variance scales as
√
1/ n rather than the typical 1/n scaling in non-private cases (see Exercise 9.4).
Let us demonstrate how to apply Corollary 9.2.2 in a few applications. Our main object of
interest is the private analogue of the minimax risk (8.1.1), where for a parameter θ : P → Θ,
semimetric ρ, and loss Φ, for a family of channels Q we define the channel-constrained minimax
risk h i
Mn (θ(P), Φ ◦ ρ, Q) := inf inf sup EP,Q Φ(ρ(θbn (Z1n ), θ(P ))) . (9.2.2)
θbn Q∈Q P ∈P
When we take Q = Qε to be the collection of ε-locally differentially private (interactive) chan-
nels (9.2.1), we obtain the ε-locally private minimax risk.
A few examples showing lower (and upper) bounds for the private minimax risk (9.2.2) in mean
estimation follow.
225
Lexture Notes on Statistics and Information Theory John Duchi
Example 9.2.3 (Bounded mean estimation): Let P be the collection of distributions with
supports on [−b, b], where 0 < b < ∞. Then for any ε ≥ 0, the minimax squared error satisfies
b2 b2
Mn (θ(P), (·)2 , Qε ) & + .
(eε − 1)2 n n
The second term in the bound is the classic minimax rate for this collection of distributions.
To see the first term, take Bernoulli distributions P0 and P1 ∈ P, where for some δ ≥ 0
to be chosen, under P0 we have X = b with probability 1−δ 2 and −b otherwise, while under
1+δ
P1 we have X = b with probability 2 and X = −b otherwise. Then kP0 − P1 kTV = δ,
E1 [X] − E0 [X] = 2bδ, and by Le Cam’s method (8.3.3), for any ε-locally private channel Q
and induced marginals M0n , M1n as in Corollary 9.2.2, we have
r !
b2 δ 2 b2 δ 2
1
q
2 n n ε 2 2
Mn (θ(P), (·) , {Q}) ≥ 1− Dkl (M0 ||M1 ) ≥ 1 − 2(e − 1) n kP0 − P1 kTV
2 2 2
b2 δ 2 p
= 1 − 2(eε − 1)2 nδ 2 .
2
Setting δ 2 = 1
8n(eε −1)2
gives the claimed minimax bound. 3
Effectively, then, we see a reduction in the effective sample size: when ε is large, there is no change,
but otherwise, the estimation error is similar to that when we observe a sample of size nε2 .
Example 9.2.4 (Estimating the parameter of a uniform distribution): In exercise 8.2, we show
that estimating the parameter θ of a Uniform(θ, θ + 1) distribution has minimax squared error
scaling as 1/n2 . Under local differential privacy, this is impossible. Let P = {Uniform(θ, θ +
1), θ ∈ [0, 1]} be the collection of uniform distributions with the given supports. Letting P0
and P1 be Uniform(0, 1) and Uniform(δ, 1 + δ), respectively, where δ ≥ 0 is to be chosen, we
have kP0 − P1 kTV = δ, while for any ε-differentially private channel Q and induced marginals
M0 and M1 ,
Dkl (M0n ||M1n ) ≤ 4(eε − 1)2 n kP0 − P1 k2TV = 4(eε − 1)2 nδ 2 .
1
Applying Le Cam’s method (8.3.3) and taking δ √ , we thus have that if Qε denotes
n(eε −1)
the collection of ε-locally differentially private channels,
1
Mn (θ(P), (·)2 , Qε ) & .
(eε − 1)2 n
In both the preceding examples, a number of simple estimators achieve the given minimax rates.
iid
The simplest is one based on the Laplace mechanism (Example 7.1.3): let Wi ∼ Laplace(1), and
set Zi = Xi + 2b 2
ε Wi in Example 9.2.3 and Zi = Xi + ε Wi in Example 9.2.4. In the former, define
θbn = Z n to be the mean; in the latter, E[Z n ] = θ+1
2 , so θn = 2Z n − 1 achieves the minimax rate.
b
More extreme examples are possible. Consider, for example, the problem of testing the support
of a distribution, where we care only about distinguishing two distributions.
226
Lexture Notes on Statistics and Information Theory John Duchi
Example 9.2.5 (Support testing): Consider the problem of testing between the support of
two uniform distributions, that is, given n observations, we wish to test whether P = P0 =
Uniform[0, 1] or P = P1 = Uniform[θ, 1] for some θ ∈ (0, 1). We can ask the rate at which
we may take θ ↓ 0 with n while still achieving non-trivial testing power. Without privacy, a
simple (and optimal) test Ψ is to simply check whether any observation Xi < θ, in which case
we can trivially accept P0 and reject P1 , otherwise accepting P1 . Then
P0 (Ψ = 1) + P1 (Ψ = 0) = (1 − θ)n ≤ exp(−θn),
where cε = 4(eε − 1)2 . In the range that n1 θ √1n , then, there is an essentially exponential
gap between the non-private and private cases. 3
227
Lexture Notes on Statistics and Information Theory John Duchi
a1 (x)
0 1
b1 b1
0 1 0 1
a2 no no a2
0 1 0 1
b2 b2 b2 b2
0 1 0 1 0 1 0 1
yes no no yes yes no no yes
Figure 9.2. A communication tree representing testing equality for 2-dimensional bit strings
x, y ∈ {0, 1}2 . Internal nodes labeled aj communicate the jth bit aj (x) = xj of x, while internal
nodes labeled bj communicate the jth bit bj (y) = yj of y. The maximum number of messages is 4.
(A more efficient protocol is to have Alice send the entire string x ∈ {0, 1}n , then for Bob to check
equality x = y and output “Yes” or “No.”)
protocol Π, which specifies the messages that each of Alice and Bob send to one another. We view
this as a series of rounds, where at each round, the protocol allows one {0, 1}-valued bit to be sent
and determines who sends this bit, and, at termination time, can compute f (x, y) based on the
communicated message. Then the communication cost of Π is the maximum number of messages
sent to (correctly) compute f over all inputs x, y.
A more convenient formulation for analysis is to consider a binary tree:
Definition 9.3. A protocol Π over a domain X × Y with output space Z is a binary tree, where
each internal node v is labeled with a mapping av : X → {0, 1} or bv : Y → {0, 1} and each leaf is
labeled with a value z ∈ Z.
Then to execute a communication protocol Π on input (x, y), we walk down the tree: beginning
at the root node, for each internal node v labeled av (an Alice node) we walk left if av (x) = 0 and
right if av (x) = 1, and each node v labeled bv (a Bob node) we walk left if bv (y) = 0 and right if
bv (y) = 1. Then the communication cost of the protocol Π is the height of the tree, which we denote
by depth(Π). Figure 9.2 shows an example for testing the equality x = y of two 2-dimensional bit
strings x, y ∈ {0, 1}2 .
In classical communication complexity, the main questions center around the communication
complexity of a function f : X → Y, which is the length of the shortest protocol that computes f
correctly on all inputs: letting Πout (x, y) denote the final output of the protocol Π on inputs (x, y),
this is
CC(f ) := inf {depth(Π) | Πout (x, y) = f (x, y) for all x ∈ X , y ∈ Y} .
In many cases, it is useful to allow randomized communication protocols, which tolerate some
probability of error; in this case, we let Alice and Bob each have access to (an arbitrary amount)
of randomness, which we can identify without loss of generality with uniform random variables
iid
Ua , Ub ∼ Uniform[0, 1], and the nodes av and bv in Definition 9.3 are then mappings av : X ×[0, 1] →
228
Lexture Notes on Statistics and Information Theory John Duchi
{0, 1} and bv : Y × [0, 1] → {0, 1} and they calculate av (·, Ua ) and bv (·, Ub ), respectively. Abusing
notation slightly by leaving this randomness implicit, the randomized communication complexity
for an accuracy δ is then the length of the shortest randomized protocol that calculates f (x, y)
correctly with probability at least 1 − δ, that is,
RCCδ (f ) := inf {depth(Π) | P(Πout (x, y) 6= f (x, y)) ≤ δ for all x ∈ X , y ∈ Y} . (9.3.1)
In the definition (9.3.1), we leave the randomization in Π implicit, and note that we require that
the tree it induces still have a maximum length. We note that essentially any choice of δ > 0 is
immaterial, as we always have
1
RCCδ (f ) ≤ O(1) log · RCC1/3 (f ),
δ
making all (constant) probability of error complexities essentially equivalent. (See Exercise 9.7.)
There are variants of randomized complexity that allow public randomness rather than pri-
vate randomness, which can yield simpler algorithms and somewhat reduced complexity, but this
improvement is limited, as Alice and Bob can always essentially simulate public randomness (see
Exercise 9.8). Letting Ppub be the collection of protocols in which both Alice and Bob have access
to a shared random variable U ∼ Uniform[0, 1], we make the obvious extension
RCCpub
δ (f ) := inf {depth(Π) | P(Πout (x, y, U ) 6= f (x, y)) ≤ δ for all x ∈ X , y ∈ Y} .
Π∈Ppub
where the supremum is taken over joint distributions on (X, Y ), the infimum over randomized
protocols Π, and the right probability P is over any randomness in Π. There is a subtlety in this
definition: we require Π to be accurate on all inputs (x, y), not just with probability over the
distribution on (X, Y ) in the information measure I(X, Y ; Π(X, Y )). Relaxations to distributional
variants of the information complexity (9.3.3) are also natural, as in the definition (9.3.2). Thus
we sometimes consider the distributional information complexity
ICµδ (f ) := inf {I2 (X, Y ; Π(X, Y )) | µ(Πout (X, Y ) 6= f (X, Y )) ≤ δ} ,
Π
229
Lexture Notes on Statistics and Information Theory John Duchi
Proposition 9.3.1. For any function f , δ ∈ (0, 1), and probability measure µ on X × Y,
and
RCCδ (f ) ≥ ICδ (f ).
Proof The first two inequalities are immediate. By Theorem 2.4.3, we have
that is, there must be at least some u achieving the average error of Π, and the protocol Π is
deterministic given u. So any protocol Π using public randomness to achieve probability of error δ
can be modified into a deterministic protocol Π(·, ·, u) that achieves µ-probability of error δ.2
Proposition 9.3.2. Let v be a node in a deterministic protocol Π and Rv be those pairs (x, y)
reaching node v. Then Rv is a rectangle.
2
This is one direction of Yao’s minimax theorem [176], which states that communication complexity with public
(shared) randomness and worst-case distributional complexity are identical: RCCpub
δ (f ) = supµ DCCµ
δ (f ).
230
Lexture Notes on Statistics and Information Theory John Duchi
Proof We prove the result by induction. Certainly, for the root node v, we have Rv = X × Y,
which is a rectangle. Now, let v be an arbitrary (non-root) node in the tree and w its parent; assume
w.l.o.g. that v is the left child of w and that in w, Alice speaks (that is, we use aw : X → {0, 1}.)
Then Rw = A × B by the inductive assumption. If aw (x) = 0, then
which is a rectangle.
The structure of rectangles for correct protocols thus naturally determines the communication
complexity of a function f . For a set R ⊂ X × Y, we say R is f -constant if f (x, y) = f (x0 , y 0 ) for
all (x, y) ∈ R and (x0 , y 0 ) ∈ R. Thus, any correct protocol Π necessarily partitions X × Y into a
collection of f -constant rectangles, where we identify the rectangles with the leaves l of the protocol
tree. In particular, Proposition 9.3.2 implies the following corollary.
Corollary 9.3.3. Let N be the size of the minimal partition of X × Y into f -constant rectangles.
Then CC(f ) ≥ log2 N .
Proof Any correct protocol Π partitions X × Y into the f -constant rectangles {Rl } indexed by
its leaves l. The minimal depth of a binary tree with at least N leaves is log2 N .
A related corollary follows by considering fooling sets, which are basically sets that rectangles
cannot contain.
Definition 9.4 (Fooling sets). A set S ⊂ X ×Y is a fooling set for f if for any two pairs (x0 , y0 ) ∈ S
and (x1 , y1 ) ∈ S satisfying f (x0 , y0 ) = f (x1 , y1 ), at least one of the inequalities f (x0 , y1 ) 6= f (x0 , y0 )
or f (x1 , y0 ) 6= f (x0 , y0 ) holds.
Corollary 9.3.4. Let f have a fooling set S of size N . Then CC(f ) ≥ log2 N .
Proof By definition, no f -constant rectangle contains more than a single element of S. So the
tree associated with any correct protocol Π has a single leaf for each element of S.
An extension of the fooling set idea is the rectangle measure method, which proves that (for
some probability measure P ) the “size” of f -constant rectangles is small. By judicious choice of
the probability, we can then demonstrate lower bounds.
Proof By the union bound, any f -constant partition of X × Y into rectangles {Rl }N
l=1 satisfies
PN 1
1 ≤ l=1 P (Rl ) ≤ N δ. So N ≥ δ , and the result follows by Corollary 9.3.3.
With these results, we can provide lower bounds on two exemplar problems that will inform
much of our coming development.
231
Lexture Notes on Statistics and Information Theory John Duchi
Example 9.3.6 (Equality): Consider the problem of testing equality of two n-bit strings
x, y ∈ {0, 1}n , letting f = EQ be f (x, y) = 1 if x = y and 0 otherwise. Define the set
S = {(x, x) | x ∈ {0, 1}n }, which has cardinality 2n , and satisfies f (x, x) = 1 for all (x, x) ∈ S.
That S is a fooling set is immediate: for any (x, x) and (x0 , x0 ) ∈ S, if x 6= x0 , then certainly
(x, x0 ) 6∈ S. So
n ≤ CC(EQ) ≤ n + 1,
where the upper bound follows by letting Alice simply communicate the string x and Bob
check if x = y, outputting 1 or 0 as x = y or x 6= y. 3
The second example concerns inner products on F2 , the field of arithmetic on the integers modulo
2 (that is, with bit strings); one could extend this to inner products in more complicated number
systems (such as floating point), but the basic ideas are cleaner when we deal with bits.
Example 9.3.7 (Inner products on F2 ): Consider computing the inner product IP2 (x, y) =
hx, yi mod 2 for n-bit strings x, y ∈ {0, 1}n , where addition is performed modulo 2. Rather
than a constructing a fooling set directly, we use Proposition 9.3.5 and let P be the uniform
distribution on {0, 1}n × {0, 1}n . Let R = A × B be a rectangle with hx, yi = 0 for all
x ∈ A and y ∈ B. The linearity of the inner product guarantees that hx, yi = 0 for all
x ∈ span(A) and y ∈ span(B), the (linear) spans of A and B in Fn2 , respectively. Now
recognize that span(A), span(B) ⊂ Fn2 are orthogonal subspaces of Fn2 , and so their dimensions
d0 = dim(span(A)) and d1 = dim(span(B)) satisfy d0 + d1 ≤ n.
Noting that if d0 = dim(A) then |A| ≤ 2d0 in Fn2 , we thus obtain |R| ≤ |A| · |B| ≤ 2n , which
(under the uniform measure P ) satisfies
2n
P (R) ≤ = 2−n .
22n
By Proposition 9.3.5, we thus have
n ≤ CC(IP2 ) ≤ n + 1,
where once again the upper bound follows by letting Alice simply communicate x ∈ {0, 1}n
and having Bob output hx, yi mod 2. 3
Example 9.3.8 (Equality with randomization): Let x, y ∈ {0, 1}n and p be a prime number
satisfying n2 ≤ p ≤ 2n2 (the Prime Number Theorem guarantees the existence of such a p).
Let Alice choose a uniformly random number U ∈ {0, . . . , p − 1} and compute the polynomial
Then Alice may communicate both U and a(U ) to Bob, which requires at most 2 log2 p ≤
4 log2 n + 2 log 2 bits. Then Bob checks whether
232
Lexture Notes on Statistics and Information Theory John Duchi
satisfies b(U ) = a(U ). If so, Bob outputs “Yes” (equality), and otherwise, Bob outputs “No.”
This protocol satisfies depth(Π) ≤ 4 log2 n + 1. Moreover, if x = y, it is always correct, while if
x 6= y, then the protocol is incorrect only if a(U ) = b(U ), that is, U is a root of the polynomial
n
X
p(u) = (xi − yi )ui−1 .
i=1
But this is a non-zero degree n − 1 polynomial, which has at most n − 1 roots (on the field Fp ;
see Appendix A.1 for a brief review of polynomials). Thus for x 6= y we have
n−1 1
P(Π(x, y) fails) = P(a(U ) = b(U )) ≤ < ,
p n
and so RCC1/n (EQ) ≤ O(1) log n, exponentially improving over deterministic complexity.
In passing, we make two additional remarks. First, this protocol is one-way and non-
interactive: Alice can simply send O(log n) bits. Second, we can achieve essentially any prob-
ability of success in the bound while still only paying logarithmically in communication, as
taking nk ≤ p ≤ 2nk for k ≥ 2 yields RCC1/nk (EQ) ≤ 2k log2 n + O(1). 3
Example 9.3.8 makes clear that any lower bounds on randomized communication complexity,
or, relatedly, information complexity, will necessarily be somewhat more subtle than those we have
presented for CC. We develop a few of the main ideas here. Because our focus is on information
theoretic techniques, we pass over a few of the standard tools for proving lower bounds involving
discrepancy and randomized inputs, touching on these in the bibliographic notes at the end of the
chapter. One of our main goals will be to show that the information complexity of the inner product
is indeed Ω(n), a much stronger result than Example 9.3.7. In contrast to the lower bounds we
provide for minimax risk in most of this book, the focus in communication complexity is to take
an a priori accurate estimator and demonstrate that it requires a certain amount of information to
be communicated, rather than the contrapositive result that limited information yields inaccurate
estimators. While these are clearly equivalent, it can be fruitful to use the perspective most relevant
for the problem at hand.
Two main ideas form the basis for information complexity lower bounds: first, direct sum
inequalitites, which show that computing a function on n inputs requires roughly order n more
communication than computing it (or at least, one of the constituent functions making it up)
on one. The second important insight is to provide lower bounds on the information necessary
to compute different primitives, and the particular structure of even randomized communication
protocols makes this possible. For the remainder of Section 9.3.3, we address the first of these,
returning to the information complexity of primitives in Section 9.3.4.
233
Lexture Notes on Statistics and Information Theory John Duchi
Example 9.3.10 (Decomposition of inner product): The inner product in F2 , f (x, y) P= hx, yi
mod 2, where h(xi , yi ) = xi yi , and g(z) = h1, zi mod 2, which satisfies g(z) = 0 if ni=1 zi is
even and g(z) = 1 otherwise. 3
While Example 9.3.8 makes clear that the decomposition (9.3.4) is not sufficient to guarantee a
randomized complexity lower bound of order n, it will be useful.
To develop the main information complexity direct sum theorem showing that the information
complexity of f is at least the sum of the complexities of its constituent primitives, we leverage
what we term plantable inputs:
Definition 9.5. Let f : X n × Y n → {0, 1} have the decomposition (9.3.4), where the primitive h
is {0, 1}-valued. The pair (x, y) ∈ X n × Y n admits a planted solution if for each i ∈ {1, . . . , n}, all
x0i , yi0 , and vectors all
x0 = (x1 , . . . , xi−1 , x0i , xi+1 , . . . , xn ) and y 0 = (y1 , . . . , yi−1 , yi0 , yi+1 , . . . , yn ),
we have f (x0 , y 0 ) = h(x0i , yi0 ).
The binary inner product in Examples 9.3.7 and 9.3.10 has many plantable inputs: any of the 3n
pairs of vectors x, y ∈ {0, 1}n with hx, yi = 0 admit planted solutions, as we have xi yi = 0 for each
i. The set-disjointness problem, Example 9.3.11, has the same plantable inputs. For the equality
function, only the 2n pairs x = y admit planted solutions.
We outline the key idea to our direct sum lower bounds. Because we define information com-
plexity for protocols Π that are correct on all inputs with high probability, we can choose an
arbitrary distribution on inputs (xn1 , y1n ) ∈ X n × Y n . Thus we choose a fooling distribution µ for
iid
f , meaning that for (Xi , Yi ) ∼ µ the pair (X1n , Y1n ) ∈ X n × Y n always admits a planted solution
(Definition 9.5). The next definition says this slightly differently.
Definition 9.6. A distribution µ on (x, y) ∈ X × Y is a fooling distribution if all (xn1 , y1n ) in the
support of the product µn admit planted solutions (Definition 9.5).
Typically, fooling distributions µ require some dependence between Xi and Yi —for example, in the
inner product, we require Xi Yi = 0, so that if Xi = 1 then Yi = 0 and vice versa:
Example 9.3.12 (A fooling distribution for inner products and set disjointness): Define
the distribution µ on pairs (x, y) ∈ {0, 1} × {0, 1} as follows: let V be uniform on {0, 1}, and
conditional on V = 0, set X = 0 and let Y ∼ Uniform{0, 1}; conditional on V = 1, set Y = 0
iid
and let X ∼ Uniform{0, 1}. Then certainly XY = 0, and any set of pairs (Xi , Yi ) ∼ µ satisfy
both that the binary inner product IP2 (X1n , Y1n ) = hX1n , Y1n i mod 2 = 0 and set disjointness
DISJ(X1n , Y1n ) = 1 {hX1n , Y1n i > 0} = 0. 3
234
Lexture Notes on Statistics and Information Theory John Duchi
CICµδ (h) := inf sup {I(X, Y ; Π(X, Y ) | V ) s.t. P(Πout (x, y) 6= h(x, y)) ≤ δ for all x ∈ X , y ∈ Y} ,
Π V
where the infimum is over all (randomized) protocols and the supremum is over all random variables
making X and Y conditionally independent with joint distribution (X, Y ) ∼ µ. So if we can find a
variable V making the mutual information I(X, Y ; Π(X, Y ) | V ) large for any correct protocol Π,
the conditional information complexity of h is necessarily large.
With this, we obtain our main direct sum theorem for information complexity.
Theorem 9.3.13. Let µ be a fooling distribution X × Y for a function f with primitive h. Then
Proof Let V = V1n ∈ V n be any random vector with i.i.d. entries making (Xi , Yi ) conditionally
indpendent given Vi . Then for any protocol Π, we have
because we have the Markov chain V → (X1n , Y1n ) → Π. Using the chain rule for mutual informa-
tion, where we recognize that X1n and Y1n are independent given V , we have
n
X
I(X1n , Y1n ; Π |V)= I(Xi , Yi ; Π | V, X1i−1 , Y1i−1 )
i=1
n
X
= H(Xi , Yi | V, X1i−1 , Y1i−1 ) − H(Xi , Yi | V, Π, X1i−1 , Y1i−1 )
i=1
n
X n
X
≥ H(Xi , Yi | V ) − H(Xi , Yi | V, Π) = I(Xi , Yi ; Π | V ) (9.3.5)
i=1 i=1
because conditioning reduces entropy and (Xi , Yi ) are independent of X1i−1 , Y1i−1 given V .
Now we come to the key reduction from the global protocol Π to one solving individual prim-
itives. On inputs (x, y) ∈ X × Y, define the simulated protocol Πi,v (x, y) so that given the vector
iid
v\i ∈ V n−1 , Alice and Bob independently generate (Xj∗ , Yj∗ ) ∼ µ(· | Vj = vj ) for j 6= i, which
is possible because of the assumed conditional independence given V , yielding X\i ∗ ∈ X n−1 and
that is, the joint over the simulated protocol is equal to that over the original protocol Π conditional
on V\i = v\i . The latter claim (9.3.6) is essentially definitional; the former requires a bit more work.
235
Lexture Notes on Statistics and Information Theory John Duchi
To see that Πi,v is a δ-error protocol for the primitive h, note that by construction, X\i ∗ and Y ∗ are
\i
in the support of µ, and so admit planted solutions. In particular, f ((X\i ∗ , x), (Y ∗ , y)) = h(x, y),
\i
and so Πi,v is necessarily a δ-error protocol.
The distributional equality (9.3.6) guarantees that for any v we have
as desired.
With Theorem 9.3.13 in hand, we have our desired direct sum result, so that proving informa-
tion complexity lower bounds reduces to providing lower bounds on the (conditional) information
complexity of various 1-bit primitives. The following corollary highlights the theorem’s applications
to inner product and set disjointness (Examples 9.3.10 and 9.3.11).
Corollary 9.3.14. Let f be the binary inner product f (x, y) = hx, yi mod 2 or the disjointness
function f (x, y) = 1 {hx, yi > 0}. Let µ be the fooling distribution in Example 9.3.12. Then
Exercise 9.10 explores similar techniques for the entrywise lesser than or equal function, showing
similar complexity lower bounds.
236
Lexture Notes on Statistics and Information Theory John Duchi
Proposition 9.3.15. Let h(x, y) = xy for inputs x, y ∈ {0, 1}. Let µ be the fooling distribution in
Example 9.3.12. Then
1
CICµδ (h) ≥
p
1 − 2 δ(1 − δ) .
4
We prove this proposition in the remainder of this section, noting that as an immediate corollary,
we obtain the following lower bounds on the communication complexity of set disjointness and
binary inner product.
Corollary 9.3.16. Let f be the binary inner product f (x, y) = hx, yi mod 2 or the disjointness
function f (x, y) = 1 {hx, yi > 0}. Then
n p
ICδ (f ) ≥ (1 − 2 δ(1 − δ)).
4
To control the complexity of computing individual primitives, it proves easier to use metrics
tied more directly to testing. To that end, we recall the connection between Hellinger distance
and the mutual information, or Jensen-Shannon divergence, between a variable X and a single bit
B ∈ {0, 1} in Proposition 2.2.10, which gives that if B → Z, where Z ∼ Pb conditional on B = b,
then
I2 (Z; B) ≥ d2hel (P0 , P1 ).
To apply this inequality, recall the fooling distribution µ for inner products in Example 9.3.12,
where V ∼ Uniform{0, 1} and conditional on V = 0 we set X = 0 and draw Y ∼ Uniform{0, 1}, and
otherwise Y = 0 and X ∼ Uniform{0, 1}. Then for V → (X, Y ) from this distribution, we have
1 1
I2 (X, Y ; Π(X, Y ) | V ) = I2 (Y ; Π(0, Y ) | V = 0) + I2 (X; Π(X, 0) | V = 1).
2 2
Letting Qxy denote the (conditional) distribution over Π on input bits x, y ∈ {0, 1} and noting that
X and Y above are each uniform on {0, 1}, we see that Proposition 2.2.10 applies and so
1 1
I2 (X, Y ; Π(X, Y ) | V ) ≥ d2hel (Q01 , Q00 ) + d2hel (Q10 , Q00 ).
2 2
Applying the triangle inequality that (a − b)2 ≤ (|a − c| + |c − b|)2 ≤ 2(a − c)2 + 2(b − c)2 , we obtain
the following lemma.
Lemma 9.3.17. Let Π be any protocol acting on two bit inputs x, y ∈ {0, 1}, and let µ be the
fooling distribution in Example 9.3.12. Let Qxy be the distribution of Π(x, y) on inputs x, y. Then
1
I2 (X, Y ; Π(X, Y ) | V ) ≥ d2hel (Q01 , Q10 ).
4
The last step in the proof of Proposition 9.3.15 is to demonstrate a property of (randomized)
protocols Π analogous to the rectangular property of deterministic communcation that Proposi-
tions 9.3.2 and 9.3.5 demonstrate. In analogy with the output leaf in the tree for deterministc
communication complexity, let τ be the transcript of the communication protocol, that is, its en-
tire communication trace. Then we claim the following analog of Proposition 9.3.2 that the set of
inputs resulting in a particular output in deterministic complexity is a rectangle in X × Y.
Lemma 9.3.18. Let Π be any randomized protocol with inputs in X ×Y. Then there exist functions
qx and qy such that for any transcript τ ,
P(Π(x, y) = τ ) = qx (τ ) · qy (τ ).
237
Lexture Notes on Statistics and Information Theory John Duchi
We thus have the following key cut and paste property, which shows that in some sense, Hellinger
distances respect the “rectangular” structure of communication protocols.
Lemma 9.3.19. Let Π be any protocol acting on inputs in X × Y and let Qx,y be the distribution
of Π(x, y) on inputs x, y. Then
Proof Let T be the collection of all possible transcripts the protocol outputs. By Lemma 9.3.18
we have
q 2
2 1X q
dhel (Qx,y , Qx0 ,y0 ) = Qx,y (τ ) − Qx0 ,y0 (τ )
2
τ ∈T
q 2
1X q Xq
= qx (τ )qy (τ ) − qx0 (τ )qy0 (τ ) = 1 − qx (τ )qy (τ )qx0 (τ )qy0 (τ ).
2 τ
τ ∈T
Rearranging by the trivial modification qx qy qx0 qy0 = qx qy0 qx0 qy , we have the result.
We now finalize the proof of Proposition 9.3.15. Substituting this cutting and pasting in
Lemma 9.3.17 we have
1 1
I2 (X, Y ; Π(X, Y ) | V ) ≥ d2hel (Q01 , Q10 ) = d2hel (Q00 , Q11 ).
4 4
Then a simple lemma recalling the testing inequalities in Chapter 2.3.1 completes
p the proof of the
proposition, because it guarantees that 4I2 (X, Y ; Π(X, Y ) | V ) ≥ 1 − 2 δ(1 − δ) no matter the
choice of protocol Π, and so
1
CICµδ (h) ≥ inf I2 (X, Y ; Π(X, Y ) | V ) ≥
p
1 − 2 δ(1 − δ) .
Π 4
238
Lexture Notes on Statistics and Information Theory John Duchi
Lemma 9.3.20. Let Π be any δ-accurate protocol for computing p h(x, y) = xy and Qxy be its
distribution on inputs (x, y). Then d2hel (Q00 , Q11 ) ≥ 1 − 2 δ(1 − δ).
Proof Assume that Π computes the product xy ∈ {0, 1} correctly with probability at least
1 − δ, that is, P(Πout (x, y) 6= xy) ≤ δ for all x, y ∈ {0, 1}. By Le Cam’s testing lower bounds
(Proposition 2.3.1), we know that
where inequality (?) follows from the inequalities in Proposition 2.2.7 relating Hellinger and total-
variation distance. Let d = d2hel (Q00 , Q11 ) for shorthand. Thenprearranging gives d(2 − d) ≥
(1 − 2δ)2 . Solving for d in 0 ≥ d2 − 2d + (1 − 2δ)2 yields d ≥ 1 − 1 − (1 − 2δ)2 . Recognize that
1 − (1 − 2δ)2 = 4(δ − δ 2 ).
Here we have used the notation Z<i := (Z1 , . . . , Zi−1 ), and we will use Z≤i := (Z1 , . . . , Zi ) and
(t) (t)
similarly for superscripts throughout. We will also use the notation Z→i = (B (1) , Z<i ) to denote
(t)
all the messages coming into communication of Zi . Figure 9.3 illustrates two rounds of this
communication scheme.
We can provide lower bounds on the minimax risk of communication-constrained estimators by
extending the data processing inequality approach we have developed. Our approach to the lower
bounds, which we provide in Sections 9.4.1 and 9.4.2 to follow, is roughly as follows. First, we
develop another direct sum bound, in analogy with Theorem 9.3.13, meaning that the difficulty of
239
Lexture Notes on Statistics and Information Theory John Duchi
X1 X2 X3 Xm
B (2)
Figure 9.3. Left: single round of communication of variables, writing to public blackboard B (1) .
Right: two rounds of communication of variables, writing to public blackboards B (1) and B (2) .
solving a d-dimensional problem is roughly d-times that of solving a 1-dimensional version of the
problem; thus, any lower bounds on the error in 1-dimensional problems imply lower bounds for
d-dimensional problems. Second, we provide an extension of the data processing inequalities we
have developed thus far to apply to particular communication scenarios.
The key to our reductions is that we consider families of distributions where the coordinates of
X are independent, which dovetails with Assouad’s method. We thus index our distributions by
v ∈ {0, 1}d , and in proving our lower bounds, we assume the typical Markov structure
V → (X1 , . . . , Xm ) → Π(X1m ),
where V is chosen uniformly at random from {−1, 1}d , and Π = Π(X1m ) denotes the protocol of
the entire communication—in this context, this is the entire set of blackboard messages
Π = (B (1) , . . . , B (T ) ),
(which also encodes the message order). We assume that X follows a d-dimensional product
distribution, so that conditional on V = v we have
iid
X ∼ Pv = Pv1 ⊗ Pv2 ⊗ · · · ⊗ Pvd . (9.4.1)
The generation strategy (9.4.1) guarantees that conditional on the jth coordinate Vj = vj , the co-
ordinates Xi,j are i.i.d. and independent of V\j = (V1 , . . . , Vj−1 , Vj+1 , . . . , Vd ) as well as independent
of Xi0 ,j for data points i0 6= i.
240
Lexture Notes on Statistics and Information Theory John Duchi
In particular, viewing X≤m,\j as extraneous randomness, we have the simpler Markovian structure
Vj → X≤m,j → Π, (9.4.2)
so that we may think of the communication Π = Π(X≤m,j ) as acting only on X≤m,j . Now, define
M−j and Mj to be the marginal distributions over the total communication protocol Π conditional
on Vj = ±j, the one-variable model (9.4.2). Then Le Cam’s testing equality (Proposition 2.3.1),
and the equivalence between Hellinger and variation distance (Proposition 2.2.7) imply that
d d d
X X X √
inf 2 P(Vbj (Π) 6= Vj ) ≥ (1 − kM−j − M+j kTV ) ≥ (1 − 2dhel (M−j , M+j ))
Vb j=1 j=1 j=1
v
u d
u2 X
≥ d 1 − t d2hel (M−j , M+j )
d
j=1
Recalling Assouad’s method (Lemma 8.5.2) of Chapter 8.5, we see that any time we have a problem
with separation with respect to the Hamming metric (8.5.1), we have a lower bound on its error in
estimation problems. This proposition analogizes Theorem 9.3.13, in that small Hellinger distance
between the individual marginals M±j necessarily makes the testing and estimation problems hard.
241
Lexture Notes on Statistics and Information Theory John Duchi
We leave the proof of this proposition as Exercise 9.12, as it follows by adapting the techniques
we use to prove Theorem 9.2.1, with the main difference being the random variables with bounded
likelihood ratios (X → Z versus V → X). A brief example illustrates Proposition 9.4.2.
Example 9.4.3 (Bernoulli distributions): Let Pv = Bernoulli( 1+vδ 2 ) for v ∈ {−1, 1}. Then
we have likelihood ratio bound
dP1 1+δ
log ≤ log
dP−1 1−δ
and so under the conditions of Proposition 9.4.2, for any Z we have
2
2δ 2
(i)
1+δ
I(V ; Z) ≤ 2 − 1 I(X; Z) = 2 I(X; Z) ≤ 10δ 2 I(X; Z),
1−δ 1−δ
where inequality (i) holds for δ ∈ [0, 1/10]. 3
We now give the two main results connecting mutual information and the contraction-type
bounds in Definition 9.7. To provide bounds using Proposition 9.4.1, we wish to control the
Hellinger distance between individual marginals M±j , so we consider single variables in the Markov
chain
V → (X1 , . . . , Xm ) → Π,
where V ∈ {0, 1}. To state the coming theorems, we make a restriction on the data generation
V → X, calling distributions P0 and P1 (c, β)-contractive if
β(P0 , P1 ) ≤ β ≤ 1 and max {D∞ (P0 ||P1 ) , D∞ (P1 ||P0 )} ≤ log c, (9.4.3)
where D∞ (·||·) denotes the Rényi-∞-divergence. Proposition 9.4.2 shows that whenever such a c
exists we certainly have β(P0 , P1 ) ≤ 2(c − 1)2 .
The next theorem then provides the basic information contraction inequality for single-variable
communication.
Theorem 9.4.4. Let 1 ≤ c < ∞ and β ≤ 1. Let P0 and P1 be (c, β)-contractive (9.4.3) distributions
on X and Mv , v ∈ {0, 1} be the marginal distribution of the protocol Π conditional on V = v. Then
7
d2hel (M0 , M1 ) ≤ (c + 1)β · min {I(X1m ; Π(X1m ) | V = 0), I(X1m ; Π(X1m ) | V = 1)} .
2
The proof of Theorem 9.4.4 is quite complicated, so we defer it to Section 9.5.
We can use Theorem 9.4.4 to obtain bounds on the probability of error—detection of d-
dimensional signals—in higher dimensional problems based on mutual information alone. Because
the theorem provides a bound involving the minimum of the conditional mutual informations, we
have substantial freedom to combine the direct-sum lower bounds in Section 9.4.1 to massage it
into the mutual information between the data X1m and the protocol Π(X1m ).
We thus recall the definition (9.4.1) of our product distribution signals, where we assume that
each individual datum Xi = (Xi,1 , . . . , Xi,d ) = (Xi,j )dj=1 belongs to a d-dimensional set and condi-
tional on V = v ∈ {−1, 1}d has independent coordinates distributed as Xi,j ∼ Pvj . With this, we
have the following theorem, which follows by a combination of Assouad’s method (in the context
of communication bounds, i.e. Proposition 9.4.1) and Theorem 9.4.4.
242
Lexture Notes on Statistics and Information Theory John Duchi
Theorem 9.4.5. Let Π the entire communication protocol in Figure 9.3, V ∈ {−1, 1}d be uniform,
iid
and generate Xi ∼ Pv , i = 1, . . . , m according to the independent coordinate distribution (9.4.1).
Assume additionally that for each coordinate j = 1, . . . , d, the coordinate distributions P±vj are
(c, β)-contractive (9.4.3). Then for any estimator Vb ,
d r !
X d β
P(Vbj (Π) 6= Vj ) ≥ 1 − 7(c + 1) · I(X1 , . . . , Xm ; Π | V ) .
2 d
j=1
Proof Under the given conditions, Proposition 9.4.1 and Theorem 9.4.4 immediately combine to
give
v
d u d
X d u β X
P(Vbj (Π) 6= Vj ) ≥ 1 − t7(c + 1) min I(X1,j , . . . , Xm,j ; Π | Vj = v) .
2 d v∈{−1,1}
j=1 j=1
Certainly
min I(X1,j , . . . , Xm,j ; Π | Vj = v) ≤ I(X1,j , . . . , Xm,j ; Π | Vj ).
v∈{−1,1}
Then, using that w.l.o.g. we may assume the Xi,j are discrete, we obtain
d
X d
X
I((Xi,j )m
i=1 ; Π | Vj ) = [H((Xi,j )m m
i=1 | Vj ) − H((Xi,j )i=1 | Π, Vj )]
j=1 j=1
d
(i) X
H((Xi,j )m m
= i=1 | (Xi,j 0 )i≤m,j 0 <j , V ) − H((Xi,j )i=1 | Π, Vj )
j=1
d
X
H((Xi,j )m m
≤ i=1 | (Xi,j 0 )i≤m,j 0 <j , V ) − H((Xi,j )i=1 | (Xi,j 0 )i≤m,j 0 <j , Π, V )
j=1
d
X
= I((Xi,j )m
i=1 ; Π | V, (Xi,j 0 )i≤m,j 0 <j ) = I(X1 , . . . , Xm ; Π | V ),
j=1
where equality (i) used the independence of Xi,j from V\j and Xi,j 0 for j 0 6= j given Vj , and the
inequality that conditioning reduces entropy. This gives the theorem.
243
Lexture Notes on Statistics and Information Theory John Duchi
Thus we obtain
I(X1 , . . . , Xm ; Π | V ) ≤ I(X1 , . . . , Xm ; Π)
m X
T
(t) (t)
X
= I(X1 , . . . , Xm ; Zi | Z→i ).
i=1 t=1
d
Choosing δ = min{1/10, 2C P
i Bi } gives the result.
244
Lexture Notes on Statistics and Information Theory John Duchi
This result deserves some discussion. It is sharp in the case that the number of bits is of order
d or less from each machine: when we set Bi = d, the lower bound becomes
2 d d d
sup Eθ [kθ(Π) − θk2 ] & min
b · ,d = ,
θ m d m
which is certainly achievable (each machine simply sends its entire vector Xi ∈ {0, 1}d ). When
machines communicate fewer than d bits, we have a tighter result; for example, if only k/m machines
send d bits, and the rest communicate little, we obtain
2 d md d
sup Eθ [kθ(Π) − θk2 ] & min
b · ,d = ,
θ m kd k
which is similarly intuitive. The extension of these ideas to the case when each machine has an
individual sample of size n is more challenging, as it requires tensorized variants of the strong data
processing inequality in Definition 9.7; we provide remarks in the bibliographical section.
The following observation shows that for appropriate choices of εkl , this is indeed weaker than
the interactive guarantee (9.4.5).
Lemma 9.4.7. Let the communication Q satisfy the interactive privacy guarantee (9.4.5) and Π
be the induced communication protocol over rounds t ≤ T . Then
n n X
T
1X
(i)
1X 3 2
Dkl Π(x≤n )||Π(x≤n ) ≤ min εi,t , εi,t .
n n 2
i=1 i=1 t=1
245
Lexture Notes on Statistics and Information Theory John Duchi
Proof Using the chain rule for the KL-divergence, we have for any j that
Xn X
T h i
(j) (j) (t) (t)
Dkl Π(x≤n )||Π(x≤n ) = E Dkl Q(Zit ∈ · | xi , Z→i )||Q(Zit ∈ · | xi , Z→i )
i=1 t=1
T h i
(j) (t) (t)
X
= E Dkl Q(Zit ∈ · | xj , Z→i )||Q(Zit ∈ · | xj , Z→i ) ,
t=1
(t) (j)
where the expectation is taken over Z→i in the protocol Π(x≤n ), and the second equality follows
(i)
because xj = xj for all j except index i. Now let P0 and P1 be arbitrary distributions whose
densities satisfy p0 (z)/p1 (z) ≤ eε . Then
Dkl (P0 ||P1 ) ≤ ε and Dkl (P0 ||P1 ) ≤ log 1 + Dχ2 (P0 ||P1 ) ≤ log 1 + (eε − 1)2
by Proposition 2.2.9. Then by inspection min{ε, log(1 + (eε − 1)2 )} ≤ min{ε, 23 ε2 } for all ε ≥ 0.
Returning to the initial KL-divergence sum, we thus obtain
n n X
T
X
(i)
X 3
Dkl Π(x≤n )||Π(x≤n ) ≤ E min εi,t , ε2i,t ,
2
i=1 i=1 t=1
as desired.
The key is that the average KL-local privacy guarantee is sufficient to provide a mutual infor-
mation bound, thus allowing us to apply Theorem 9.4.5 as in the proof of Proposition 9.4.6.
Proposition 9.4.8. Let Π be any εkl -KL-locally-private on average protocol and assume that
X1 , . . . , Xn are independent conditional on V . Then Then
We abuse notation to let Π∗ (X\i ) be the marginal protocol (marginalizing over Xi ). Then
I(Xi ; Π(X1n ) | V, X\i ) = E Dkl Π(X\i , Xi )||Π∗ (X\i ) ≤ E Dkl Π(X\i , Xi )||Π(X\i , Xi0 )
iid
where the first expectation is taken over V and Xj ∼ Pv conditional on V = v and the inequality
uses convexity and draws Xi0 independently. Summing over i = 1, . . . , n, Definition 9.8 gives the
result.
246
Lexture Notes on Statistics and Information Theory John Duchi
Corollary 9.4.9. Let the conditions of Theorem 9.4.5 hold. If the data release Π is εkl -private on
average, then
d r !
X d β
P(Vbj (Π) 6= Vj ) ≥ 1 − 7(c + 1) nεkl .
2 d
j=1
Specializing to the case that we wish to estimate a d-dimensional Bernoulli vector, where X ∈ {±1}
has coordinates with P(Xj = 1) = θj , Example 9.4.3 gives the following minimax lower bound.
Corollary 9.4.10. Let Mn (θ(Pd ), k·k22 , εkl ) denote the minimax mean-square error for estima-
tion of a d-dimensional Bernoulli under the εkl -KL-locally-private-on-average constraint in Defini-
tion 9.8. Then
d2
Mn (θ(Pd ), k·k22 , εkl ) ≥ c min d, .
nεkl
Proof By Corollary 9.4.9 and Example 9.4.3, we have minimax lower bound
r !
2 2 δ2
Mn (θ(Pd ), k·k2 , εkl ) & dδ 1 − C nεkl
d
for a numerical constant C, which is valid for δ . 1. Choose δ 2 to scale as min{1, nεdkl }.
When instead of the average KL-privacy we use the pure local differential privacy constraint (9.4.5),
Lemma 9.4.7 implies the following.
Corollary 9.4.11. Let Mn (θ(Pd ), k·k22 , ε) denote the minimax mean-square error for estimation
of
P∞ a d-dimensional Bernoulli where each data release is εi,t -locally differentially private (9.4.5), and
t=1 εi,t ≤ ε. Then
d2
2
Mn (θ(Pd ), k·k2 , ε) ≥ c min d, .
n(ε ∧ ε2 )
2. Provide a data processing inequality to relate dhel (M0 , Mi0 ) and the mutual information I(Xi ; Π)
between the individual observation Xi and the protocol Π.
3. Use the standard chain rules for mutual information to finalize the theorem.
Xi | b ∼ Pbi . (9.5.1)
247
Lexture Notes on Statistics and Information Theory John Duchi
For the standard basis vectors e1 , . . . , em , we expect M0 to be close to Mel , and thus hope for some
type of tensorization behavior, where we can relate M0 and M1 via one-step changes from M0 to
Mel . The next lemma realizes this promise.
Lemma 9.5.1. Let M0 , M1 , and Mel be as above. Then
m
X
d2hel (M0 , M1 ) ≤ 7 d2hel (M0 , Mel ). (9.5.2)
l=1
Proof The proof crucially relies on the Euclidean structures that the Hellinger distance induces
along with analogues of the cut-and-paste (the “rectangular” structure of inputs in communication
protocols) properties from deterministic and randomized two-player communication. We assume
without loss of generality that Π is discrete, as the Hellinger distance is an f -divergence and so can
be arbitrarily approximated by discrete random variables.
First, we analogize the “rectangular” probabilistic structure of two-player communication pro-
tocols in Lemmas 9.3.18 and 9.3.19, which yields a multi-player cut-and-paste lemma.
Lemma 9.5.2 (cutting and pasting). Let a, b, c, d ∈ {0, 1}m be bit vectors satisfying ai +bi = ci +di
for each i = 1, . . . , m. Then
d2hel (Ma , Mb ) = d2hel (Mc , Md ).
Proof We claim the following analogue of Lemma 9.3.18: for any X1m = xm
1 and any communi-
cation transcript τ , we may write
m
Y
Q(Π(xm m
1 ) = τ | x1 ) = fi,xi (τ ) (9.5.3)
i=1
(t)
for some functions fi,xi . Indeed, letting τ = {zi }i≤n,t≤T we have
m Y
T
(t) (t) (t) (t)
Y Y
Q(Π(xm m
1 ) = τ | x1 ) = Q(zi | xm
1 , z→i ) = Q(zi | xi , z→i )
i,t i=1 t=1
| {z }
=:fi,xi (τ )
(t) (t)
where we use that message zi depends only on xi and z→i . Then we can write Mb (Π(X1m ) = τ )
as a product using Eq. (9.5.3): integrating over independent Xi ∼ Pbi , we have
Z Ym Z Ym
Mb (Π(X1m ) = τ ) = Q(τ | xm 1 )dP (x
b1 1 ) · · · dP (x
bm m ) = f (x
i,τ i )dP (x
bi i ) = gi,bi (τ ).
i=1 | {z } i=1
:=gi,bi (τ )
But as ai + bi = ci + di and each is {0, 1}-valued, we certainly have gi,ai gi,bi = gi,ci gi,di , and so the
lemma follows.
The second result we require is due to Jayram [115], and is the following:
248
Lexture Notes on Statistics and Information Theory John Duchi
Lemma 9.5.3. Let {Pb }b∈{0,1}m be any collection of distributions satisfying the cutting and pasting
property d2hel (Pa , Pb ) = d2hel (Pc , Pd ) whenever a, b, c, d ∈ {0, 1}m satisfy a + b = c + d. Let N = 2k
for some k ∈ N. P Then for any collection of bit vectors {b(i) }N i=1 ⊂ {0, 1}
m with hb(i) , b(j) i = 0 for
k
Y m
X
−l
(1 − 2 )d2hel (P0 , Pb ) ≤ d2hel (P0 , Pb(i) ).
l=1 i=1
where the second inequality again follows from Lemma 9.5.3 as b(i) = ej or ej + ej 0 for some basis
vectors ej , e0j . This gives Lemma 9.5.1.
so that additionally W → Xl0 → Π(X 0 ) is a Markov chain. As a consequence, Definition 9.7 of the
strong data processing inequality gives
249
Lexture Notes on Statistics and Information Theory John Duchi
It remains to relate I(Xl0 ; Π(X 0 )) to I(Xl ; Π(X) | V = 0). Here we bounded likelihood ratio
between P0 by P1 . Indeed, we have by the condition (9.4.3) that
1 2 P0 + P1
P0 ≥ P1 so (c + 1)P0 ≥ P0 + P1 or P0 ≥ .
c c+1 2
As a consequence, we have
Z
I(Xl ; Π(X1m ) | V = 0) = Dkl (Q(· | Xl = x)||M0 ) dP0 (x)
Z
2 dP0 (x) + dP1 (x)
≥ Dkl (Q(· | Xl = x)||M0 )
c+1 2
Z
2 dP0 (x) + dP1 (x)
≥ Dkl Q(· | Xl = x)||M
c+1 2
2
= I(Xl0 ; Π(X 0 )),
c+1
where the second inequality uses that M = Q(· | Xl = x) dP0 (x)+dP 1 (x)
R
2 minimizes the integrated
KL-divergence (recall inequality (8.7.3)). Returning to inequality (9.5.4), we evidently have the
result of the lemma.
as desired.
250
Lexture Notes on Statistics and Information Theory John Duchi
X X X X N
X
kui − uj k22 = kui − u + u − uj k22 = kui − uk22 + ku − uj k22 = 2N ku − ui k22 .
i,j i,j i,j i,j i=1
N N
1 X X X
kui − uj k22 ≤ kui − uk22 ≤ kui − u0 k22 . (9.5.5)
N
1≤i<j≤N i=1 i=1
√ √ 2
Now, we return to the Hellinger distances. Evidently 2d2hel (Pa , Pb ) = pa (·) − pb (·) 2 , so
that it is a Euclidean distance. As a consequence, for any pairwise disjoint collection of N bit
vectors b(i) , we have
N
X 1 X 1 X
d2hel (P0 , Pb(i) ) ≥ d2hel (Pb(i) , Pb(j) ) = d2hel (P0 , Pb(i) +b(j) ) (9.5.6)
N N
i=1 1≤i<j≤N 1≤i<j≤N
where the inequality follows from (9.5.5) and the equality by the assumed cut-and-paste property.
Now, we apply Baranyai’s theorem, which says that we may decompose any complete graph KN ,
where N is even, into N − 1 perfect matchings Mi with N/2 edges—necessarily, as they form a
perfect matching—where each Mi is edge disjoint. Identifying the pairs i < j with the complete
graph, we thus obtain
X N
X −1 X
d2hel (P0 , Pb(i) +b(j) ) = d2hel (P0 , Pb(i) +b(j) ). (9.5.7)
1≤i<j≤N l=1 (i,j)∈Ml
0 0
Now fix n ∈ {1, . . . , N −1} and a matching Mn . By assumption we have hb(i) +b(j) , b(i ) +b(j ) i =
0 0
0 for any distinct pairs (i, j), (i , j ) ∈ Mn , and moreover, (i,j)∈Mn (b(i) + b(j) ) = b. Thus, our
P
induction hypothesis gives that for any l ∈ {1, . . . , N − 1} and any of our matchings Mn , we have
X k−1
Y
d2hel (P0 , Pb(i) +b(j) ) ≥ d2hel (P0 , Pb ) (1 − 2−l ).
(i,j)∈Mn l=1
Substituting this lower bound into inequality (9.5.7) and using inequality (9.5.6), we obtain
N k−1 k
X 1 Y Y
d2hel (P0 , Pb(i) ) ≥ · (N − 1)d2hel (P0 , Pb ) (1 − 2−l ) = d2hel (P0 , Pb ) (1 − 2−l ),
N
i=1 l=1 l=1
251
Lexture Notes on Statistics and Information Theory John Duchi
9.6 Bibliography
Data processing inequalities originate with Dobrushin’s study of central limit theorems for Markov
chains [62, 63]; Dobrushin first proved Proposition 9.1.1 (see [63, Sec. 3.1]). Cohen et al. [50] show
that the strong data processing constant for variation distance is the largest of the strong data
processing constants (Theorem 9.1.2) for finite state spaces using careful linear algebraic techniques,
also showing the opposite extremality (inequality (9.1.1)) of the χ2 contraction coefficient [50,
Proposition II.6.15] for finite state spaces. Del Moral et al. [61] and Polyanskiy and Wu [146] give
related and approachable treatments for general alphabets, and Exercises 9.1 and 9.2 follow [61].
More broadly, strong data processing inequalities arise in many applications in communication,
estimation, and some functional analysis [147, 146].
Communication complexity begins with Yao [176], which introduces the communication com-
plexity setting we discuss in Section 9.3, making the connections between randomized complexities
and public (shared) randomness. The standard classical reference for the subject is Kushilevitz
and Nisan’s book [124]. There are numerous techniques that we do not discuss, including so-called
discrepancy lower bounds, which address both randomized and deterministic communication com-
plexity; for example, these give the stronger lower bound that DCCδ (IP2 ) ≥ n−O(1) [124, Example
3.29 and Exercise 3.30]. Communication complexity has uses far beyond the “standard” commu-
nication setting we have outlined, with more recent research showing how to use the techniques
to provide lower bounds on the performance of algorithms in many computational models, such
as streaming models and memory-limited computation [141, 148]. Our information complexity ap-
proach follows Bar-Yossef et al. [15]. Recent work has shown how communication lower bounds and
strong data processing inequalities can be used to show the necessity of “memorization” in some
natural problems in machine learning, where any learning procedure with good enough performance
necessarily encodes substantial irrelevant information about a dataset [38].
Our treatment of communication complexity and its applications in estimation follows an ap-
proach Zhang et al. [178] originate. The particular techniques we adapt, involving direct sums and
strong data processing in communication, we adapt from Braverman et al. [37] and Garg et al.
[90]. Our results apply most easily to scenarios in which each machine or agent owns only a single
data item, which allows application of Proposition 9.4.2; tensorizing this to multiple observations
requires some care, but can be done with a truncation argument [178, 37] or more careful Sobolev
inequalities [147]. Our extension to private estimation scenarios follows the paper [65], which also
shows how to generalize to other variants of privacy.
9.7 Exercises
Exercise 9.1 (Approximating nonnegative convex functions): Let f : R → R+ ∪ {+∞} be a
closed, nonnegative convex function.
(a) Show that there exists a sequence of piecewise linear functions fn satisfying fn−1 ≤ fn ≤ f for
all n and for which fn (x) ↑ f (x) pointwise for all x s.t. f (x) < ∞, and fn (x) ↑ ∞ otherwise.
Hint: Let L be the collection of linear functions below f , that is L = {l | l(x) = a + bx, l(x) ≤
f (x) for all x}, and note that f (x) = sup{l(x) | l ∈ L}. (See Appendix C.2.) You may replace
L with functions of the form l(x) = f (x0 ) + g(x − x0 ), where g ∈ ∂f (x0 ) is a subderivative of
f at x0 .
252
Lexture Notes on Statistics and Information Theory John Duchi
R R
(c) Conclude that for any measure µ on R+ , fn dµ ↑ f dµ.
Exercise 9.2 (Proving Theorem 9.1.2): In this question, we formalize the sketched proof of
Theorem 9.1.2 by filling in details of the following steps. Let α = αTV (Q) be the Dobrushin
coefficient of the channel Q and f : R → R+ ∪ {+∞} be a closed convex function.
(a) P
There exists a nondecreasing sequence fn of piecewise linear functions, each of the form fn (x) =
n Pn
i=1 a i [bi − x] + + i=1 ci [x − di ]+ , where bi ≤ 1, di ≥ 1, and ai , ci ≥ 0. Hint: Exercise 9.1.
R
(b) Let Mv (A) = Q(A | x)dPv (x) for v ∈ {0, 1} be the induced marginal distributions. Show
that for any function of the form h(t) = [t − ∆]+ , where ∆ > 1,
i. Define the set X (∆) := {x | dP0 (x) ≤ ∆dP1 (x)}. Argue that X (∆) must be non-null (i.e.,
have positive measure).
ii. Define the probability distribution P∆ with density
(c) Using the monotone convergence theorem, show that Df (M0 ||M1 ) ≤ αDf (P0 ||P1 ).
Exercise 9.3 (Markov chain mixing): Consider a Markov chain X1 , X2 , . . . with transition distri-
bution P (· | x) and stationary distribution π. Let P k (· | x) denote the distribution of the Markov
chain initialized in state x after k steps. Assume there exists some (finite) positive integer k ∈ N
such that for any two initial states x0 , x1 , the Markov chain satisfies
P k (· | x0 ) − P k (· | x1 ) ≤ β < 1.
TV
Show that the Markov chain enjoys fast mixing for any f divergence: if there is any n such that
Df (P n (· | x)||π) < ∞, the Markov chain mixes exponentially quickly in that it satisfies
1 1
lim sup log Df (P n (· | x)||π) ≤ log β < 0.
n n k
253
Lexture Notes on Statistics and Information Theory John Duchi
In brief, as soon as one can demonstrate a constant gap in variation distance, one is guaranteed a
Markov chain mixes geometrically.
Exercise 9.4: For k ∈ [1, ∞], we consider the collection of distributions
that is, distributions P supported on R with kth moment bounded by 1. We consider minimax
estimation of the mean E[X] for these families under ε-local differential privacy, meaning that for
each observation Xi , we observe a private realization Zi (which may depend on Z1i−1 ) where Zi
is an ε-differentially private view of Xi . Let Qε denote the collection of all ε-differentially private
channels, and define the (locally) private minimax risk
Mn (θ(P), (·)2 , ε) := inf inf sup EP,Q [(θbn (Z1n ) − θ(P ))2 ].
θbn Q∈Qε P ∈P
(a) Assume that ε ≤ 1. For k ∈ [1, ∞], show that there exists a constant c > 0 such that
k−1
2 1 k
Mn (θ(Pk ), (·) , ε) ≥ c .
nε2
(b) Give an ε-locally differentially private estimator achieving the minimax rate in part (a).
Exercise 9.5: Show that strong data processing inequality in Theorem 9.2.1 is sharp in the
following sense. There exist ε-differentially private channels Qε such that for any Bernoulli distri-
butions P0 and P1 and induced marginal distributions Mv,ε = Q(· | X = 1)Pv (X = 1) + Q(· | X =
0)Pv (X = 0),
Dkl (M0,ε ||M1,ε ) ε2
2 = + O(ε3 )
kP0 − P1 kTV 2
as ε ↓ 0.
Exercise 9.6: We apply the results of Exercise 9.4 to a problem of estimation of drug use.
Assume we interview a series of individuals i = 1, . . . , n, asking whether each takes illicit drugs.
Let Xi ∈ {0, 1} be 1 if person i uses drugs, 0 otherwise, and define θ∗ = E[X] = E[Xi ] = P (X = 1).
Instead of Xi we observe answers Zi under differential privacy,
Zi | Xi = x ∼ Q(· | Xi = x)
for a ε-differentially private Q with ε < 21 (so that (eε − 1)2 ≤ 2ε2 ). Let Qε denote the family of
all ε-differentially private channels, and let P denote the Bernoulli distributions with parameter
θ(P ) = P (Xi = 1) ∈ [0, 1] for P ∈ P.
(a) Use Le Cam’s method and the strong data processing inequality in Theorem 9.2.1 to show that
the minimax rate for estimation of the proportion θ∗ in absolute value satisfies
b 1 , . . . , Zn ) − θ(P )| ≥ c √ 1 ,
h i
Mn (θ(P), | · |, ε) := inf inf sup EP,Q |θ(Z
Q∈Qε θb P ∈P nε2
where c > 0 is a universal constant.
254
Lexture Notes on Statistics and Information Theory John Duchi
Exercise 9.7: Show that the randomized communication complexity (9.3.1) satifies RCCδ (f ) ≤
O(1) log 1δ RCC1/3 (f ) for any f and any δ < 1.
Exercise 9.8 (From public to private randomness): Consider the randomized complexity (9.3.1)
and associated public-randomness complexity RCCpub n
δ . Let X = Y = {0, 1} and f : X ×Y → {0, 1},
and let Π be a protocol using public randomness U such that maxx,y P(Π(x, y, U ) 6= f (x, y)) ≤ .
(b) Give a protocol that uses no public randomness but whose communication complexity is at
most depth(Π) + O(1) log nδ .
Exercise 9.9 (An information lower bound for indexing): In the indexing problem in communi-
cation complexity, Alice receives an n-bit string x ∈ {0, 1}n and Bob an index y ∈ [n] = {1, . . . , n},
and the two communicate to evaluate xy ; set f (x, y) = xy .
(a) Show that if Bob can send messages, the communication complexity of indexing satisfies
CC(f ) ≤ O(1) log n.
In the one way communication model, only Alice can send messages. Let µ be the uniform
distribution on (X, Y ) ∈ {0, 1}n × [n]. We will show that DCCµδ (f ) ≥ (1 − h2 (δ))n, where
h2 (p) = −p log2 p − (1 − p) log2 (1 − p) is the binary entropy.
(b) Fix the index Y = i and let pi = P(X bi = Xi | Y = i) based on a protocol Π. Use Fano’s
inequality (Proposition 8.4.1) to argue that h2 (pi ) ≥ H2 (Xi | Π).
I(X1n ; Π) ≥ (1 − h2 (δ))n.
Exercise 9.10 (Information complexity for entrywise less or equal): Consider the entrywise less
than or equal to function f : {0, 1}n × {0, 1}n → {0, 1} with f (x, y) = 1 {x y}, so that f (x, y) = 1
if xi ≤ yi for each i and 0 if there exists i such that xi > yi .
(a) Show that f has the decompositional structure (9.3.4). Give the functions g and h.
255
Lexture Notes on Statistics and Information Theory John Duchi
(c) Use Theorem 9.3.13 and a modification of the proof of Proposition 9.3.15 to show that ICδ (f ) ≥
n
p
4 (1 − 2 δ(1 − δ)). (This is order optimal, because ICδ (f ) ≤ CC(f ) ≤ n + 1 trivially.)
Exercise 9.11 (Lower bounds for private logistic regression): This question is (likely) challenging.
Consider the logistic regression model for y ∈ {±1}, x ∈ Rd , that
1
pθ (y | x) = .
1 + exp(−yhθ, xi)
For a distribution P on (X, Y ) ∈ Rd × {±1}, where Y | X = x has logistic distribution, define the
excess risk
L(θ, P ) := EP [`(θ; X, Y )] − inf EP [`(θ; X, Y )]
θ
where `(θ; x, y) = log(1 + exp(−yhx, θi)) is the logistic loss. Let P be the collection of such
distributions, where X is supported on {−1, 1}d . Following the notation of Exercise 8.4, for a
channel Q mapping (X, Y ) → Z, define
b n ), P )],
Mn (P, L, Q) := inf sup EP,Q [L(θ(Z 1
θb P ∈P
where the expectation is taken over Zi ∼ Q(· | Xi , Z1i−1 ). Assume that the channel releases are all
(locally) ε-differentially private.
(a) Show that for all n large enough,
d d
Mn (P, L, Q) ≥ c · ·
n ε ∧ ε2
for some (numerical) constant c > 0.
(b) Suppose we allow additional passes through the dataset (i.e. multiple rounds of communication),
but still require that all data Zi released from Xi be ε-differentially private. That is, assume
we have the (sequential and interactive) release schemes of Fig. 9.3, and we guarantee that
(t) (t) (t)
Zi ∼ Q(· | Xi , B (1) , . . . , B (t) , Z1 , . . . , Zi−1 )
P
is εi,t -differentially private, where t εi,t ≤ ε for all i. Does the lower bound of part (a) change?
256
Chapter 10
When we wish to estimate a complete “object,” such as the parameter θ in a linear regression
Y = Xθ + ε, or a density when we observe X1 , . . . , Xn i.i.d. with a density f , the previous chapters
give a number of approaches to proving fundamental optimality results and limits. In many cases,
however, we wish to estimate functionals of a distribution or larger parameter, rather than the
entire distribution or a high-dimensional parameter. Suppose we wish to estimate some statistic
T (P ) ∈ R of a probability distribution P . Then a naive estimator is to construct an estimate Pb
of P , and simply plug it in: use Tb = T (Pb). But frequently—and as we have seen in the preceding
chapters—our ability to estimate Pb may be limited, while various statistics of P may be easier to
estimate. As a trivial example of this phenomenon, suppose we have an unknown distribution P
supported on [−1, 1], and we wish to estimate the statistic T (P ) = EP [X], its expectation. Then
the trivial sample mean estimator
Tn := X n
satisfies E[(Tn − E[X])2 ] ≤ n1 . But an estimator that first attempts to approximate the full distri-
R
bution P via some Pb and then estimate xdPb(x) is likely to incur substantial additional error.
Alternatively, we might wish to test different properties of distributions. In goodness of fit
testing, we are given a sample X1 , . . . , Xn i.i.d. from a distribution Q, and we wish to distinguish
iid
whether Q = P or Q is far from P . In related two-sample tests, we are given samples X1n ∼ P
iid
and Y1m ∼ Q, and again wish to test whether Q = P or Q and P are far from one another. For
example, in a medical study, we may wish to distinguish whether there are significant differences
between a treated population Q and control population P .
More broadly, we wish to develop tools to understand the optimality of different estimators
and tests of functionals, by which we mean scalar valued parameters of a distribution P . Such
parameters could include the norm kθk2 of a regression vector, an estimate of the best possible
expected loss inf f EP [`(f (X), Y )] in a prediction problem, the distance kP − P0 kTV of a sampled
population P from a reference P0 , or the probability mass of outcomes we have not observed in a
study. This chapter develops a few of the tools to understand these problems.
257
Lexture Notes on Statistics and Information Theory John Duchi
another, and then computing the distance between the convex hulls of the families. This leads to
Le Cam’s convex hull method, which we state abstractly and specialize later to different scenarios
of interest. Let P be a collection of distributions on an underlying space X , and let θ : P → Rd be
a parameter of interest. We say that two subsets P0 ⊂ P and P1 are δ-separated in k·k if
We do not require that all of P0 be somehow on one side or the other of the collection {θ(P1 ) |
P1 ∈ P1 } of parameters associated with P1 , just that they be pairwise separate.
Let Conv(P) be the collection of mixtures of elements of P, that is,
(m )
X
Conv(P) = λi Pi | m ∈ N, λ 0, hλ, 1i = 1, Pi ∈ P .
i=1
(note the temporary lack of sample size n), we then have the following generalization of inequal-
ity (8.3.3).
Theorem 10.1.1 (Le Cam’s Convex Hull Lower Bound). Let P0 and P1 ⊂ P be δ-separated in
k·k. Then
δ
M(θ(P), k·k) ≥ sup 1 − P 0 − P 1 TV
| P 0 ∈ Conv(P0 ), P 1 ∈ Conv(P1 )
2
Proof For any parameter θ, the separation kθ(P0 ) − θ(P1 )k ≥ δ and the triangle inequality
Pmof kθ − θ(P0 )k ≥ δ/2
guarantees that at least one Pmor kθ − θ(P1 )k ≥ δ/2 holds for all pairs P0 ∈ P0
and P1 ∈ P1 . Let P 0 = j=1 αj Pj and P 1 = j=1 βj Qj for Pj ∈ P0 and Qj ∈ P1 , respectively,
where α, β are convex combinations. Then by Markov’s inequality,
m m
1X h i 1X h i
M(θ(P), k·k) ≥ αj EPj kθb − θ(Pj )k + βj EQj kθb − θ(Qj )k
2 2
j=1 j=1
Xm m i
δ h i X h
≥ αj EPj 1{kθb − θ(Pj )k ≥ δ/2} + βj EQj 1{kθb − θ(Pj )k ≥ δ/2}
2
j=1 j=1
m
X
≥δ αj EPj inf 1{kθ − θ(P0 )k ≥ δ/2} + βj EQj inf 1{kθ − θ(P1 )k ≥ δ/2}
b b
P0 ∈P0 P1 ∈P1
j=1
δ
= EP 0 inf 1{kθ − θ(P0 )k ≥ δ/2} + EP 1 inf 1{kθ − θ(P1 )k ≥ δ/2} .
b b
2 P0 ∈P0 P1 ∈P1
258
Lexture Notes on Statistics and Information Theory John Duchi
We leave this form of total variation distance as an exercise (see Exercise 2.1). Substituting it into
the display above, we find that for any P v ∈ Conv(Pv ), we have
δ
M(θ(P), k·k) ≥ 1 − P0 − P1 TV
.
2
Taking a supremum over the P v gives the theorem.
where the expectation is taken with respect to X ∼ P0 . More generally, let V ∈ V be a random
variable distributed according to π and conditional on V = v, let X | V = v ∼ Pv . Then for the
p (x)pv0 (x)
paired likelihood ratio l(x | v, v 0 ) = v p2 (x) , the marginal distribution P of X satisfies
0
iid
where the expectation is taken jointly over X ∼ P0 and V, V 0 ∼ π.
259
Lexture Notes on Statistics and Information Theory John Duchi
Proof The starting point is to notice that for any two distributions P and Q we have Dχ2 (P ||Q) =
R 2 R dP 2
(dP/dQ − 1)2 dQ = dP
R R dP R
dQ − 2 dQ dQ + dQ = dQ − 1. Then we proceed by recognizing that
1 PN 2 1 P
( N i=1 xi ) = N 2 i,j xi xj for any sequence xi , and so
((1/|V|) v∈V dPv )2
Z P Z
1 X dPv dPv0
Dχ2 P ||P0 + 1 = = 2
dP0 |V| 0
dP0
v,v ∈V
1 P
as desired. The second statement has identical proof to the first except that we replace |V| v∈V
with expectations according to π.
The applications of this lemma are many, and going through a few examples will best show how
to leverage it. Roughly, our typical approach is the following: we identify V with {±1}d or some
other suitably nice collection of vectors. We then choose distributions Pv and P0 with densities
suitably nice that the ratios pv /p0 “act” like exponentials involving inner products of v ∈ V with
some other quantity; then, because v is uniform in V in Lemma 10.1.3, we can leverage all the
tools we have developed to control moment generating functions and concentration inequalities in
Chapter 4 to bound the χ2 -divergence and then apply Theorem 10.1.1.
Let us give one example of this approach, where we see the technique we use to prove the
lemma arises frequently. Let P0 = N(0, σ 2 Id ) be the standard normal distribution on Rd , and for
V = {−1, 1}d and some δ ≥ 0 to be chosen, let Pv = N(δv, σ 2 Id ). Then we have the following
dδ 2
lemma, which shows that while Dkl (Pv ||P0 ) = 2σ 2 for each individual Pv , the divergence for the
average can be much smaller (even quadratically so in the ratio δ 2 /σ 2 ).
Lemma 10.1.4. Let P0 and Pv be Gaussian distributions as above, and define the mixture P =
1 P
2d v∈{±1}d Pv . Then
2 dδ 4
2 P0 − P TV
≤ log(1 + Dχ2 P ||P0 ) ≤ 4 .
2σ
Proof The first inequality combines Pinsker’s inequality (Proposition 2.2.8) with the bound
Dkl (P ||Q) ≤ log(1 + Dχ2 (P ||Q)) in Proposition 2.2.9. Now we expand the χ2 -divergence, yielding
1 2 1 0 2 1 2
1 + Dχ2 P ||P0 = E exp − 2 kY − δV k2 − 2 Y − δV 2 + 2 kY k2 ,
2σ 2σ σ
iid
where the expectation is over Y ∼ N(0, σ 2 In ) and V, V 0 ∼ Uniform(V). Taking the expectation over
Y first, before averaging over the packing elements, allows more careful control. Indeed, expanding
the squares and recognizing that kvk22 = d for each v ∈ {±1}d , we have
nδ 2
2
nδ 2
δ 0 δ 0 2
1 + Dχ2 P ||P0 = E exp hY, V + V i − 2 = E exp V +V 2− 2
σ2 σ 2σ 2 σ
2
δ
= E exp hV, V 0 i
σ2
4
dδ
≤ exp ,
2σ 4
where the final key inequality follows because an individual U ∼ Uniform({±1}) is 1-sub-Gaussian,
and hV, V 0 i is thus d-sub-Gaussian.
260
Lexture Notes on Statistics and Information Theory John Duchi
2. Multiple testing: say we have d distinct p-values Uj . Then set Zj = Φ−1 (Uj ). Under
the null that Uj ∼ Uniform[0, 1] these are i.i.d. N(0, 1). Alternatives then deviate from
this. Often interesting to consider other alternatives (sparse/dense/etc.)
JCD Comment: Clean this up now, because I moved Lemma 10.1.4 up.
Let us give one example to show how the mixture approach suggested by Lemma 10.1.3 works,
along with showing that a more naive approach using the two point method of Chapter 8.3 fails
to provide the correct bounds. After this we will further develop the techniques. We motivate the
example by considering regression problems, then simplify it to a more stylized and easily workable
form. Suppose we wish to estimate the best possible loss achievable in a regression problem,
For simplicity, assume that X ∼ N(0, Id ), and that “base” distribution P0 is simply that Y ∼ N(0, 1),
while the alternatives are that Y = X > θ? + (1 − kθ? k22 )ε, where ε ∼ N(0, 1) and kθ? k22 ≤ 1. In
either case we have Y ∼ N(0, 1) marginally, while
inf E0 [(X > θ − Y )2 ] = 1 and inf Eθ? [(X > θ − Y )2 ] = 1 − kθ? k22 ,
θ θ
so that estimating the final risk is equivalent to estimating the `2 -norm kθ? k22 .
To make the calculations more palatable, let us assume the simpler Gaussian sequence model
Y = θ? + ε, ε ∼ N(0, σ 2 In ) (10.1.2)
where θ? ∈ Rn satisfies kθ? k2 ≤ r for some radius r, and we wish to estimate the statistic
T (P ) := kθ? k22 .
Note that E[kY k22 ] = kθ? k22 + nσ 2 , so that a natural estimator is the debiased quantity
Tn := kY k22 − nσ 2 .
That is, the family Pσ,r defined as Gaussian sequence models (10.1.2) with variance σ 2 and kθ? k22 ≤
r2 satisfies p √
Mn (T (Pσ,r ), | · |) ≤ 2nσ 4 + r2 σ 2 ≤ 2nσ 2 + rσ. (10.1.3)
We first provide the more naive approach. Suppose that we were to use Le Cam’s two-point
method to achieve a lower bound in this case. The minimax risk from inequality (8.3.3) shows that
261
Lexture Notes on Statistics and Information Theory John Duchi
(for a numerical constant c > 0), if P0 and P1 are (respectively) N(θ0 , σ 2 In ) and N(θ1 , σ 2 In ), then
for any choice of θ0 , θ1 we have
1n o
Mn (T (Pσ,r ), | · |) ≥ kθ0 k22 − kθ1 k22 · [1 − kP0 − P1 kTV ] . (10.1.4)
4
Recalling Pinsker’s inequality (Proposition 2.2.8), we have
1 p 1 kθ0 − θ1 k2
1 − kP0 − P1 kTV ≥ 1 − √ Dkl (P0 ||P1 ) = 1 − .
2 2 σ
So whenever kθ0 − θ1 k2 ≤ σ, we have
1
Mn (T (Pσ,r ), | · |) ≥ kθ0 k22 − kθ1 k22 .
8
Take any θ0 such that kθ0 k2 = r and θ1 = (1 − t)θ0 , then choose the largest t ∈ [0, 1] such that
kθ0 − θ1 k2 = tr ≤ σ. The choice t = min{1, σr } then gives that
kθ0 k22 − kθ1 k22 = r2 (1 − (1 − t)2 ) = r2 (2t − t2 ) = 2 min r2 , rσ − min r2 , σ 2 ≥ min{r2 , σr}.
for P = 21n v∈V Pv . Substituting the result of Lemma 10.1.4 into the minimax lower bound, we
P
obtain r !
δ2n nδ 4
Mn (T (Pσ,r ), | · |) ≥ 1− .
2 4σ 4
We choose δ so that the (implied) probability of error in the hypothesis test from which our
reduction follows is at least 12 , for which it evidently suffices to take δ = n1/4
σ
. Putting all the pieces
together, we achieve the minimax lower bound
√
δ2n σ2 n
Mn (T (Pσ,r ), | · |) ≥ = . (10.1.6)
4 4
Comparing the result from the upper bound (10.1.3), we see that at least in the regime that the
√
radius r scales at most as σ n, the mixture Le Cam method allows us to characterize the minimax
risk of estimation of kθk22 in a Gaussian sequence model.
By combining the result (10.1.3) with the more naive two-point lower bound (10.1.5), which is
valid in “large radius” regimes, we have actually characterized the minimax risk.
262
Lexture Notes on Statistics and Information Theory John Duchi
Corollary 10.1.5. Let Pσ,r be the Gaussian sequence model family {N(θ, σ 2 In ) | kθk2 ≤ r}, and
T (θ) = kθk22 . Then there is a numerical constant c > 0 such that the minimax absolute error
satisfies
√ p
c σ 2 n + rσ ≤ Mn (T (Pσ,r ), | · |) ≤ 2nσ 4 + r2 σ 2 .
√ √
Proof The only thing to recognize is that rσ ≥ σ 2 n whenever r ≥ σ n, in which case
min{r2 , σr} = σr in the bound (10.1.5).
that is, the sum of the worst-case probabilities that the test is correct. (We also use the notation
R(Ψ | H0 , H1 ) to denote the same quantity.) In the scenarios we consider, we will assume a metric
ρ on the family of distributions P, and instead of the general hypothesis test (10.2.1), we will
consider testing whether P ∈ P0 or ρ(P, P0 ) ≥ for all P0 ∈ P, giving the variant
H0 : P ∈ P0
(10.2.2)
H1 : P ∈ P1 () := {P ∈ P s.t. ρ(P, P0 ) ≥ all P0 ∈ P0 }
In this case, we can define the risk at distance for a sample of size n by
iid
leaving P0 and P implicit in the definition, and where we let X1n ∼ P . From this, we can define
the minimax test risk
inf Rn (Ψ, ).
Ψ
We then ask for the particular thresholds at which the minimax test risk becomes small or
large. Thus, while the coming definition allows some ambiguity, we say that a sequence n is a
minimax threshold or critical testing radius for the testing problem (10.2.2) if there exist numerical
constants 0 < c ≤ C < ∞ such that
1 2
inf Rn (Ψ, Cn ) ≤ and inf Rn (Ψ, cn ) ≥ . (10.2.4)
Ψ 3 Ψ 3
Here, the constants 31 and 23 are unimportant, the point being that for separation at most cn ,
no hypothesis test can test whether the distribution P satisfies P ∈ P0 or inf P0 ∈P0 ρ(P, P0 ) ≥ cn
263
Lexture Notes on Statistics and Information Theory John Duchi
with reasonable accuracy. But it is possible to test whether P ∈ P0 or inf P0 ∈P0 ρ(P, P0 ) ≥ Cn
with reasonable accuracy. Moreover, we can make the probability of error exponentially small by
increasing the sample size by a constant factor, as Exercise 10.2 explores.
Conveniently, the minimax test risk has a precise divergence-based form, to which we can apply
the techniques comparing different divergences we have developed. In particular, we have the
following analogue of Le Cam’s convex hull lower bound in Theorem 10.1.1, which provides the
same fundamental quantity (the variation distance between convex hulls of P0 and P1 ) for lower
bounds, except that it applies for testing.
Proposition 10.2.1 (Convex hull lower bounds in testing). For any classes P0 and P1 , the mini-
max test risk satisfies
inf R(Ψ | P0 , P1 ) ≥ 1 − sup P 0 − P 1 TV | P 0 ∈ Conv(P0 ), P 1 ∈ Conv(P1 ) .
Ψ
R(Ψ | P0 , P1 ) ≥ P 0 (Ψ 6= 0) + P 1 (Ψ 6= 1)
because suprema are always at least as large as averages. Now note that the set A = {x | Ψ(x) = 0}
satisfies
P 0 (Ψ 6= 0) + P 1 (Ψ 6= 1) = P 0 (Ac ) + P 1 (A) = 1 − (P 0 (A) − P 1 (A)),
and take an infimum over regions A.
In fact, equality typically holds in Proposition 10.2.1, but this requires the application of (infinite
dimensional) convex duality, which is beyond our scope here.
for each P1 ∈ P1 . Typically, we choose statistics T so that EP0 [T ] = 0 for each P0 in the null P0
(though this is not always possible). The next proposition shows how to define a test that leverages
this to achieve small worst-case test error.
264
Lexture Notes on Statistics and Information Theory John Duchi
Proposition 10.2.2. Let the statistic T p : X → R robustly C-separate P0 from P1 . Then for the
threshold τ = supP0 ∈P0 EP0 [T ] + supP0 ∈P0 VarP0 (T ), the test
Ψ(X) := 1 {T ≥ τ }
satisfies
2
R(Ψ | {P0 }, P1 ) ≤ .
C2
Proof Without loss of generality we assume supP0 ∈P0 EP0 [T ] = 0, as the test is invariant to shifts,
p
so that τ = supP0 ∈P0 VarP0 (T ). We can also assume that C ≥ 1, as otherwise the proposition is
vacuous. We control the test error in each case. Under any null P0 , we have
Var0 (T ) 1
P0 (Ψ 6= 0) = P0 (T ≥ τ ) ≤ 2 2
= 2.
C τ C
For the alternatives under P1 ∈ P1 , we have
Var1 (T )
P1 (Ψ 6= 1) = P1 (T ≤ τ ) = P1 (T − E1 [T ] ≤ τ − E1 [T ]) ≤ .
[E1 [T ] − τ ]2+
But of course,
p p p
E1 [T ] − τ = E1 [T ] − sup EP0 [T ] − sup VarP0 (T ) ≥ C Var1 (T ) + (C − 1) sup VarP0 (T )
P0 P0 P0
H0 : P = P0 = N(0, Id )
(10.2.6)
H1 : P ∈ P1 (r) := {N(θ, Id ) | kθk2 ≥ r}.
That is, we are interested in whether X ∼ P has a mean θ separated by at least r from the all-
zeros vector. The problem is to find the critical radius r at which testing between P0 = {P0 }
and P1 becomes feasible (or infeasible). 3
265
Lexture Notes on Statistics and Information Theory John Duchi
Example 10.2.4 (A global null in multiple hypothesis testing): Consider the problem of
testing d distinct null hypotheses H0,j , j = 1, . . . , d, where for each we have a p-value Yj and
reject H0,j if Y0,j ≤ τ for a threshold τ . (Recall that a p value is a random variable Y that is
sub-uniform, meaning that P (Y ≤ u) ≤ P (U ≤ u) for U ∼ Uniform[0, 1], so we are less likely
to reject at threshold τ than a uniform would be.) If we assume the Yj are exact p-values, that
is, P (Yj ≤ u) = u for u ∈ [0, 1], then testing the global independent null
d
iid
\
H0 := H0,j = each Yj ∼ Uniform[0, 1]
j=1
is equivalent to Gaussian signal detection. Indeed, let Zj = Φ−1 (Yj ), where Φ denotes the
standard Gaussian cumulative distribution. Then under the global null H0 , we have
Z ∼ N(0, Id ).
The question of which alternative class P1 to consider is then frequently a matter of applica-
tions. For example, we might be curious about alternatives for which a few nulls H0,j are false,
that is, sparse alternatives. Example 10.2.3 corresponds to something like dense alternatives.
3
With these as motivation, let us consider Example 10.2.3 more carefully, in effort to find the
critical radius r at which minimax testing becomes feasible (or infeasible). While our standard
techniques for estimation tell us that the minimax rate for estimating θ in a normal location family
P = {N(θ, σ 2 Id )}θ∈Rd (say, in mean squared error) necessarily scale as
dσ 2
Mn (θ(P), k·k22 ) = ,
n
we can test whether the mean of a Gaussian is zero at a smaller dimensionality—effectively, while
E[kθb − θk22 ] → 0 as
√ n → ∞ if and only if d/n → 0, in the testing case, we can save a dimension-
dependent factor d. In particular, the next two examples—one addressing achievability and one
the fundamental limit—show that in the dense Gaussian signal detection problem of Example 10.2.3,
the critical test radius (10.2.4) at which testing is feasible or infeasible scales as
d1/4
rn := √ .
n
We can achieve
√ (asymptotically) accurate testing in the dense signal detection problem (10.2.6) if
and only if d/n → 0 as n → ∞.
We first demonstrate achievability in Example 10.2.3, leveraging Proposition 10.2.2.
Example 10.2.5 (Achievability in Gaussian mean testing): We wish to test the alterna-
tives (10.2.6). We use the approach of Proposition 10.2.2: find an estimator of kθk22 , and
then threshold it for our test. The discussion preceding Corollary 10.1.5 (specifically equa-
tion (10.1.3)) shows that given a sample of size n, the estimator Tn = kX n k22 − d/n is unbiased
for kθk22 and satisfies
2d kθk22
Eθ (Tn − kθk22 )2 = Varθ (Tn ) ≤ 2 +
. (10.2.7)
n n
266
Lexture Notes on Statistics and Information Theory John Duchi
the statistic Tn robustly 2-separates P0 from P1 (r) (recall definition (10.2.5)) whenever
√ r !
1 2d 2d 1
kθk22 ≥ + + kθk22
2 n n2 n
√
for all θ with kθk2 ≥ r. Immediately we see that if we take radius r2 = C nd for some C > 0,
√ √ p √
then this separation occurs if C d ≥ 2( 2d + 2d + C d), which of course happens
n for large
o
n
p
constant C. Applying Proposition 10.2.2, we thus see that the test Ψ(X1 ) = 1 Tn ≥ 2d/n2
satisfies
1 d1/4
Rn (Ψ, Crn ) ≤ for rn = √ ,
3 n
which gives the achievability required for the critical test radius (10.2.4). 3
1/4
Example 10.2.5 shows that at the critical radius rn = d√n , it is possible (in a worst-case sense)
to test between the null H0 : N(0, Id ) and alternatives H1 : N(θ, Id ) for kθk2 ≥ Crn , where C is a
numerical constant. We can also provide the converse.
Example 10.2.6 (Lower bounds in Gaussian mean testing): Let P1 (r) = {N(θ, Id ) | kθk2 ≥ r}
be a collection of Gaussians with means r away from the origin in `2 -norm. We seek the critical
radius r below which it is impossible to distinguish between P0 = N(0, Id ) and P1 ∈ P1 (r) given
2
an i.i.d. sample X1n . Lemma 10.1.4 and Proposition 10.2.1 combine (set σ 2 = n1 and δ 2 = rd
in Lemma 10.1.4) to give
2 4
1 n r
inf Rn (Ψ | P0 , P1 (r)) ≥ 1 − √ exp −1 .
Ψ 2 2d
√
In particular, the threshold r2 = d/n means that there is necessarily constant test error
√
probability Rn ≥ 1 − √12 ( e − 1) > .54. Combining the estimation guarantee with this lower
bound shows that the critical radius (10.2.4) for testing √ H0 : N(0, Id ) against the family of
2
alternatives H1 : N(θ, Id ) with kθk2 ≥ r is precisely r = d/n. 3
2 2
267
Lexture Notes on Statistics and Information Theory John Duchi
It turns out that even in what might appear to be a particularly simple case—that of multinomial
distributions, where we identify the distribution P with a probability mass function (p.m.f.) p ∈
∆d —a surprising amount of complexity arises. We thus work through two examples on testing
distance between discrete distributions by considering two metrics on the probability mass functions:
the `2 -metric and the total variation distance (or `1 metric). Then ρ(p, q) = kp − qk for k·k = k·k2
or k·k = k·k1 . In the uniformity testing case, we let p0 = d1 1 be the uniform distribution on [d],
and we seek the critical threshold at which testing
kp − p0 k = 0 versus kp − p0 k ≥
iid
from n i.i.d. observations Xi ∼ p becomes feasible or infeasible.
It is simpler (for analyzing procedures) to consider a slight variant of this problem, which
uses the Poissonization trick. To motivate the idea, identify the observations Xi with the basis
vectorsP(so that observing item j ∈ {1, . . . , d} corresponds to Xi = ej ). Then that the sample mean
pb = n1 ni=1 Xi is unbiased, but its coordinates exhibit dependence in that h1, pbi = 1—an annoyance
for analyses. Thus, we consider an alternative approach, where we assume a two-stage sampling
iid
procedure: we first drawn N ∼ Poi(n), and then conditional on N = m, draw Xi ∼ p, i = 1, . . . , m.
As E[N ] = n and N concentrates around its mean, this is nearly equivalent to simply observing
iid
Xi ∼ p for i = 1, . . . , n, and a standard probabilistic calculation shows that the distribution of
iid
{Xi }N
i=1 conditional on N = m is identical to the distribution of Xi ∼ p, i = 1, . . . , m.
Even more, the minimax risk for estimation in this Poissonized sampling scheme is similar to
that for estimation in the original multinomial setting. Indeed, suppose that we wish to estimate
an abstract statistic T (p) of p ∈ ∆d , and assume for simplicity that T (p) ∈ [−r, r] for some fixed r.
Define the minimax and Poissonized minimax risks
and
MPoi(n) := inf sup Ep (TN (X1N ) − T (p))2 ,
{Tm } p∈∆d
where the latter expectation is taken over the sample size N ∼ Poi(n), and {Tm } denotes a sequence
of estimators (defined for all sample sizes m). We have the following proposition, which shows that
if we can provide procedures that work in the poissonized (independent sampling) setting, then the
standard multinomial sampling setting is similarly easy (or challenging).
Proposition 10.2.7. There exist numerical constants 0 < c, C < ∞ such that
268
Lexture Notes on Statistics and Information Theory John Duchi
ind ind
This is equivalent to sampling nb pj ∼ Poi(npj ) and nbqj ∼ Poi(nqj ), j = 1, . . . , d, and so we use
the quantities (10.2.9) to define an estimator we can threshold using Proposition 10.2.1. We work
through this in the next (somewhat complicated) example.
Example 10.2.8 (Estimating the `2 -distance between multinomials): For the estimators (10.2.9),
define the quantity
Zj := (nb qj )2 − nb
pj − nb pj − nb
qj .
Recalling that if W ∼ Poi(λ) then E[W ] = Var(W ) = λ, we have E[nb
pj ] = pj and Var(nb
pj ) =
npj , so
pj )2 ] + E[(nb
E[Zj ] = E[(nb qj )2 ] − 2n2 pj qj − npj − nqj
= Var(nb qj ) + (npj )2 + (nqj )2 − 2n2 pj qj − npj − nqj = n2 kp − qk22 .
pj ) + Var(nb
and
Var(h1, Zi) ≤ 4n3 kp − qk24 kp + qk2 + 2n2 kp + qk22 .
Under the (non-point) null H0 : p = q, Var(h1, Zi) = 2n2 kp + qk22 ≤ 8n2 , as supp,q kp + qk2 =
2. Proposition 10.2.2 thus shows that if
r s
2
8 16 kp − qk 8
kp − qk22 ≥ C + 4
+ 2, (10.2.11)
n2 n n
Summarizing, we see that if we wish to test whether two multinomials are identical or separated
in `2 , the critical threshold for the hypothesis test
H0 : p = q
(10.2.12)
H1 : kp − qk2 ≥ δ
269
Lexture Notes on Statistics and Information Theory John Duchi
satisfies δ ≤ √1n : we can test between H0 and H1 at separations that are essentially “independent”
of the dimension or number of categories d. This is in fact sharp, as a relatively straightforward
argument with Le Cam’s two-point lemma demonstrates (see Exercise 10.9). However, if we change
the norm k·k2 into the `1 -norm k·k1 , the story changes significantly.
Let us change the hypothesis test (10.2.12) to simpler looking—in that we only test goodness of
fit—`1 -based variant. Identifying distributions P on {1, . . . , d} with their p.m.f.s p ∈ ∆d , let P0 be
the uniform distribution on {1, . . . , d}, with p.m.f. p0 = d1 1. Then we consider the testing problem
H0 : p = p0
(10.2.13)
H1 : kp − p0 k1 ≥ δ,
which tests the `1 -distance to uniformity. In this case, developing a test that distinguishes these
hypotheses at the optimal rate is quite sophisticated, though we outline an approach to it in the
exercises. To develop the correct order of lower bound—that is, a threshold δ for which no test can
reliably distinguish H0 from H1 —is possible via the mixture of χ2 -distributions approach we have
developed in Lemma 10.1.3.
JCD Comment: Should I just do these as lemmas / propositions rather than examples?
They’re a bit involved for examples!
Proposition 10.2.9 (A lower bound for testing `1 -separated multinomials). In the testing prob-
lem (10.2.13),
1
inf Rn (Ψ | H0 , H1 ) ≥ 1 − √
Ψ 2
1/4
d√
whenever δ ≤ n
.
Proof We construct a particular packing of the probability simplex ∆d ∈ Rd+ that guarantees
that the divergence between elements of H0 and H1 in the test (10.2.13) is small. For simplicity,
we assume d is even, as it changes nothing. For the base distribution P0 take p.m.f. p0 = d1 1 as
required by the problem (10.2.13). To construct the alternatives, let V ⊂ {±1}d be the collection
of 2d/2 vectors of the form v = (v 0 , −v 0 ), where v 0 ∈ {±1}d/2 , so that h1, vi = 0 for each v ∈ V.
Then for δ ≥ 0 to be chosen, define the p.m.f.s pv = 1+δv d . Identify samples X ∈ {e1 , . . . , ed }. Then
for any x ∈ {ej }, we have Pv (X = x) = d (1 + δhv, xi), and so for any pair v, v 0 we have
1
Pv (X = x)Pv0 (X = x)
= (1 + δhv, xi)(1 + δhv 0 , xi).
P0 (X = x)2
iid
From this key equality, we see that if V, V 0 ∼ Uniform(V), then for P = 1 P
|V| v∈V Pv we have
" n #
Y
0
1 + Dχ2 P ||P0 = E0 (1 + δhV, Xi i)(1 + δhV , Xi i)
i=1
= E E0 [(1 + δhV, Xi)(1 + δhV 0 , Xi) | V, V 0 ]n
n
δ2
0
= E 1 + hV, V i ,
d
where the final equality follows because E0 [hv, Xi] = d1 hv, 1i = 0 for each v ∈ V. Now we use that
1 + t ≤ et for all t to obtain
2 2 d/2
nδ 2nδ X
hV, V 0 i
1 + Dχ2 P ||P0 ≤ E exp = E exp Uj
d d
j=1
270
Lexture Notes on Statistics and Information Theory John Duchi
iid
for Uj ∼ Uniform{±1}. But of course these Uj are 1-sub-Gaussian, so
2 4
n δ
1 + Dχ2 P ||P0 ≤ exp .
d
2 n2
Now use Pinsker’s inequalities (Propositions 2.2.8 and 2.2.9), which gives 2 P0 − P TV
≤ δ4
d.
Choose δ 4 = nd2 .
3. Show that for testing, the rate at which we can test really is this modulus whenever
we have linear functions and convex classes, because of Le Cam’s result on Hellinger
affinities.
JCD Comment: Write this section
10.5 Bibliography
JCD Comment: We stole the mixture idea from David Pollard I believe.
Outline
I. Motivation: function values, testing certain quantities (e.g. is kP − QkTV ≥ or not), entropy
and other quantities, and allows superefficiency guarantees in an elegant way
271
Lexture Notes on Statistics and Information Theory John Duchi
III. “Best possible” lower bounds, super-efficiency and constrained risk inequalities
gv dP0 = 0, so that each Pv is a valid distribution. Let Pvn be the distribution of n observa-
R
while
iid 1 P
tions Xi ∼ Pv , and let P n = |V| v∈V Pv .
iid
Lemma 10.6.1. Define the inner product hf, giP = f (x)g(x)dP (x) and let V, V 0 ∼ Uniform(V).
R
Then
Dχ2 P n ||P0 + 1 ≤ E[exp(nhgV , gV 0 iP0 )].
Proof The simple technical lemma 10.1.3 essentially gives us the result. We observe that
n
1 X dPvn dPvn0
Z Z
n n
1 X
Dχ2 P ||P0 + 1 = = (1 + gv (x))(1 + gv0 (x))dP0 (x)
|V|2 0 dP0n |V|2 0
v,v v,v
because Pvn (x1 , . . . , xn ) = ni=1 (1 + gv (xi ))dP0 (xi ), so that the integral decomposes into a product
Q
of integrals. Then expanding (1 + gv )(1 + gv0 ) and noting that each has zero mean under P0 gives
1 X
Dχ2 P n ||P0n + 1 = (1 + E0 [gv (X)gv0 (X)])n .
|V|2 0
v,v
272
Lexture Notes on Statistics and Information Theory John Duchi
Definition 10.1. Let k ∈ N and the functions φj : X → [−b, b]. Then the functions φj are an
admissible partition with variances σj2 of X with respect to a probability distribution P0 if
(ii) Each function has P0 mean 0, i.e., EP0 [φj (X)] = 0 for each j.
(iii) Function j has variance σj2 = EP0 [φ2j (X)] = φ2j (x)dP0 (x).
R
Pk
With such a partition, we can define the functions gv (x) = thv, φ(x)i = t j=1 vj φj (x) for
|t| ≤ 1/b, and if we take V = {−1, 1}k , we obtain the following.
Lemma 10.6.2. Let the functions {φj }kj=1 be an admissible partition of X with variances σj2 . Fix
|t| ≤ 1b , and let dPtv = (1 + thv, φ(x)i)dP0 (x) and Ptn = |V|
1 P n
v∈V Pv . Then
k
n2 t 4 X 4
Ptn ||P0
D χ2 ≤ exp σj − 1,
2
j=1
k
X
2 4
Ptn ||P0 σj4 .
D χ2 ≤n t
j=1
Proof First, if φ(x) = [φj (x)]kj=1 , then E0 [φ(X)φ(X)T ] = diag(σj2 ), that is, the diagonal matrix
with σj2 on its diagonal. By Lemma 10.6.1, we therefore have
k k
X n2 t 4 X
Dχ2 Ptn ||P0 + 1 ≤ E exp nt2 σj2 Vj Vj0 ≤ E exp σj4 Vj2
2
j=1 j=1
iid
by Hoeffding’s Lemma (see Example 4.1.6), as Vj ∼ Uniform({±1}) Noting that Vj2 = 1 gives the
first part of the lemma. The final statement is immediate once we observe that ex ≤ 1 + (e − 1)x ≤
1 + 2x for 0 ≤ x ≤ 1.
273
Lexture Notes on Statistics and Information Theory John Duchi
10.7 Exercises
Exercise 10.1: R Recall
p thepHellinger distance between distributions P and Q with densities p, q
is dhel (P, Q)2 = ( p(x) − q(x))2 dx. Let P be N(µ0 , Σ) and Q be N(µ1 , Σ). Show that
1 2 1 > −1
dhel (P, Q) = 1 − exp − (µ0 − µ1 ) Σ (µ0 − µ1 ) .
2 8
Exercise 10.2: Suppose that the test Ψ has test risk for testing between P0 and P1 satisfying
Rn (Ψ | P0 , P1 ) ≤ 31 Let k ∈ N. Show how, given a sample of size kn, we can develop a test Ψ? with
where c > 0 is a numerical constant. Hint. Split the sample into k samples of size n, and then
apply Ψ to each.
Exercise 10.3 (Poissonization: lower bounds [174]): Prove the lower bound in Proposition 10.2.7,
inequality (10.2.8), that is, that for numerical constants C, c,
Hint. Bound MPoi(2n) with a weighted sum of Mm . Use the MGF calculation that for X ∼ Poi(λ),
E[etX ] = exp(λ(et − 1)) to show that N ∼ Poi(2n) is concentrated above n.
Exercise 10.4 (Poissonization: upper bounds [174]): Assume the minimax result that
where the supremum is over probability distributions (priors π) on p ∈ ∆k , and the expectation
iid
is now over the random choice of p and the sample X1n ∼ p drawn conditional on p. (This is a
standard infinite-dimensional saddle point result generalizing von-Neumann’s minimax theorem;
cf. [81, 160].) You will show the upper bound in Proposition 10.2.7, Eq. (10.2.8).
Let {Tm } be an arbitrary sequence of estimators and define the sequence of averaged risks
274
Lexture Notes on Statistics and Information Theory John Duchi
Exercise 10.5: Consider the hypothesis testing problem of testing whether a collection of inde-
pendent Bernoulli random variables X1 , . . . , Xn is is fair (H0 , so that P(Xi = 1) = 12 for each i) or
that there are unfair subcollections. That is, we wish to test
iid
H0 : Xi ∼ Bernoulli( 12 )
ind
H1 : Xi ∼ Bernoulli( 1+θ
2 ), θ ∈ C
i
for a set C ⊂ [−1, 1]n . Show that if the set C is orthosymmetric, meaning that whenever θ ∈ C
then Sθ ∈ C for any diagonal matrix S of signs, i.e. diag(S) ∈ {±1}n , then no test can reliably
distinguish H0 from H1 (in a minimax sense). Hint. Let v ∈ V := {±1}n index coordinate signs
and define θv = Dv for some diagonal D, where Dv ∈ C. Let Pv be the product distribution with
Xi ∼ Bernoulli( 1+D
2
i vi
). What is 1 P
2n v∈V Pv ?
Exercise 10.6 (Testing a trend in independent Bernoullis): Consider testing whether a collection
of Bernoulli random variables has an “upward trend” over time, by which we mean that if Xi ∼
Bernoulli(pi ) independently, then
n n/4
1 X 1 X
pend := pi > pbeg := pi .
n/4 3n n/4
i= +1 i=1
4
Consider the following more quantitative version of this problem: we wish to test
iid
H0 : Xi ∼ Bernoulli( 12 )
ind
H1 : Xi ∼ Bernoulli(pi ), pend − pbeg ≥ δ.
(a) Use Le Cam’s two-point method to show that there exists a numerical constant c > 0 such that
for δ ≤ √cn , no test can reliably distinguish H0 from H1 .
1
to develop a test Ψ (use Proposition 10.2.2) that achieves test risk Rn (Ψ | H0 , H1 ) ≤ 4 whenever
δ ≥ √Cn , where C < ∞ is a constant.
pk22 ].
(a) Give E[kb
p − qbk22 satisfies
(b) Show that Tn := kb
1 1 1 1
E[Tn ] = kp − qk22 + + − kpk22 − kqk22 .
n m n n
275
Lexture Notes on Statistics and Information Theory John Duchi
Exercise 10.9: Show that in the hypothesis testing problem (10.2.12), there is a numerical
√
constant c > 0 such that δ ≤ c/ n implies that no test can reliably distinguish H0 from H1 .
JCD Comment:
3. Lower bound for testing whether collection of coins is fair or some number are unfair.
276
Part III
277
Chapter 11
In prediction problems broadly construed, we have a random variable X and a label, or target or
response, Y , and we wish to fit a model or predictive function that accurately predicts the value
of Y given X. There are several perspectives possible when we consider such problems, each with
attendant advantages and challenges. We can roughly divide these into three approaches, though
there is considerable overlap between the tools, techniques, and goals of the three:
(1) Point prediction, where we wish to find a prediction function f so that f (X) most accurately
predicts Y itself.
(2) Probabilistic prediction, where we output a predicted distribution P of Y , and we seek P(Y =
y | X = x) ≈ P (Y = y | X = x), where here P denotes the “true” probability and P the
predicted one. A relaxed version of this is calibration, the subject of the next chapter, where
we ask that P(Y = y | P ) ≈ P (Y = y), that is, the distribution of Y given a predicted
distribution P is accurate.
(3) Predictive inference, where for a given level α ∈ (0, 1), we seek a confidence set mapping C
such that P(Y ∈ C(X)) ≈ 1 − α.
We focus mostly on the former two, though there is overlap between the approaches.
In this first chapter of the sequence, we focus on the probabilistic prediction problem. Our main
goal will be to elucidate and identify loss functions for choosing probabilistic predictions that are
proper, meaning that the true distribution of Y minimizes the loss, and strictly proper, meaning that
the true distribution of Y uniquely minimizes the loss. As part of this, we will develop mappings
between losses and entropy-type functionals; these will repose on convex analytic techniques for their
cleanest statements, highlighting the links between convex analysis, prediction, and information.
Moreover, we highlight how any proper loss (which will be defined) is in correspondence with a
particular measure of entropy on the distribution P , and how these connect with an object known
as the Bregman divergence central to convex optimization. For the deepest understanding of this
chapter, it will therefore be useful to review the basic concepts of convexity (e.g., convex sets,
functions, and subgradients) in Appendix B, as well as the more subtle tools on optimality and
stability of solutions to convex optimization problems in Appendix C. We give an overview of the
important results in Section 11.1.1.
278
Lexture Notes on Statistics and Information Theory John Duchi
In much of the literature on prediction, one instead considers proper scoring rules, which are simply
negative proper losses, that is, functions S : P × Y satisfying S(P, y) = −`(P, y) for a (strictly)
proper loss. We focus on losses for consistency with the convex analytic tools we develop. In
addition, frequently we will work with discrete distributions, so that Y has a probability mass
function (p.m.f.), in which case we will use p ∈ ∆k := {p ∈ Rk+ | h1, pi = 1} to identify the
distribution and `(p, y) instead of `(P, y).
Perhaps the two most famous proper losses are the log loss and the squared loss (often termed
Brier scoring). For simplicity let us assume that Y ∈ {1, 2, . . . , k}, and let ∆k = {p ∈ Rk+ | 1T p = 1}
be the probability simplex; we then identify distributions P on Y with vectors p ∈ ∆k , and abuse
notation to write `(p, y) accordingly and when it is unambiguous. The squared loss is then
X
`sq (p, y) = (py − 1)2 + p2i = kp − ey k22 ,
i6=y
where ey is the yth standard basis vector, while the log loss (really, the negative logarithm) is
Both of these are strictly proper. To this propriety, let Y have p.m.f. p ∈ ∆k , so that P(Y = y) = py .
Then for the squared loss and any q ∈ ∆k , we have
E[`sq (q, Y )] − E[`sq (p, Y )] = E[kq − eY k22 ] − E[kp − eY k22 ] = kqk22 − 2hq, pi + 2hp, pi = kq − pk22 .
279
Lexture Notes on Statistics and Information Theory John Duchi
case of the log loss—is no accident. In fact, for proper losses, we will show that this divergence
representation necessarily holds.
The key underlying our development is a particular construction, which we present in Sec-
tion 11.1.2, that transforms a loss into a generalized notion of entropy. Because it is so central, we
highlight it here, though before doing so, we take a brief detour through a few of the concepts in
convexity we require. Figures representing these results capture most of the mathematical content,
while Chapters B and C in the appendices contain proofs of the results we require.
where for x 6∈ dom f we define f (x) = +∞. We exclusively work with proper convex functions, so
that f (x) > −∞ for each x. Typically, we work with closed convex f , meaning that the epigraph
epi f = {(x, t) ∈ Rd × R | f (x) ≤ t} ⊂ Rd+1 is a closed set; equivalently, f is lower semi-continuous,
so that lim inf y→x f (y) ≥ f (x). A concave function f is one for which −f is convex.
Three main concepts form the basis for our development. The first is the subgradient (see
Appendix B.3). For a function f : Rd → R, the subgradient set (also called the subdifferential) at
the point x is n o
∂f (x) := s ∈ Rd | f (y) ≥ f (x) + hs, y − xi for all y ∈ Rd . (11.1.1)
If f is a convex function, then at any point x in the relative interior of its domain, ∂f (x) is non-
empty (Theorem B.3.3). Moreover, a quick calculation shows that x minimizes f (x) if and only if
0 ∈ ∂f (x), and (a more challenging calculation) that if ∂f (x) = {s} is a singleton, then s = ∇f (x)
is the usual gradient. See the left plot of Figure 11.1. We shall in some cases allow subgradients to
k
take values in the extended reals R and Rk , which will necessitate some additional care.
The second concept is that the supremum of a collection of convex functions is always convex,
that is, if fα is convex for each index α ∈ A, then
is convex,
T and f is closed in fα is closed for each α. The closure of f is immediate because
epi f = epi fα , and convexity follows because
f (λx + (1 − λ)y) ≤ sup {λfα (x) + (1 − λ)fα (y)} ≤ λ sup fα (x) + (1 − λ) sup fα (y).
α∈A α∈A α∈A
280
Lexture Notes on Statistics and Information Theory John Duchi
f (x)
(a)
(b)
(c)
fb(x) = f (x0 ) + f 0 (x0 )(x − x0 ) (d)
Figure 11.1. Left: The quadratic f (x) = 12 x2 and the linear approximation fb(x) = f (x0 )+s(x−x0 ),
where x0 = 21 and s = f 0 (x0 ). Right: the piecewise quadratic f (x) = max{f0 (x), f1 (x)} where
√
f0 (x) = 21 x2 and f1 (x) = 14 (x + 14 )2 + 18 , intersecting at x0 = 1−4 10 . (a) The function f (x). (b)
The linear underestimator fb(x) = f (x0 ) + s0 (x − x0 ) for s0 = f00 (x0 ). (c) The linear underestimator
fb(x) = f (x0 )+s1 (x−x0 ) for s1 = f10 (x0 ). (d) The linear approximation fb(x) = f (x1 )+f 0 (x1 )(x−x1 )
around the point x1 = 41 .
Lastly, we revisit a special duality relationship that all closed convex functions f enjoy (see
Appendix C.2 for a fuller treatment). The Fenchel-Legendre conjugate or convex conjugate of a
function f is
f ∗ (s) := sup {hs, xi − f (x)} . (11.1.3)
x
The function f∗is always convex, as it is the supremum of linear functions of s, and for any x? (s)
maximizing hs, xi − f (x), we have that x? (s) ∈ ∂s f ∗ (s) by the relationship (11.1.2); by a bit more
work, we see that if s ∈ ∂f (x), then 0 ∈ ∂x {f (x) − hs, xi} and so x maximizes hs, xi − f (x). See
Figure 11.2 for a graphical representation of this process. Flipping this argument by replacing
f with f ∗ and x with s, when s ∈ ∂f (x) and x maximizes hs, xi − f (x) in x, then x ∈ ∂f ∗ (s)
and so s maximizes hs, xi − f ∗ (s) in s. From this development comes the biconjugate, that is,
f ∗∗ (x) = sups {hs, xi − f ∗ (s)}, or f ∗∗ = (f ∗ )∗ . The biconjugate f ∗∗ , it turns out, is the supremum
of all linear functionals below f , because hs, xi − f ∗ (s) ≤ f (x) for all s, and if ∂f (x) is non-empty,
then the preceding argument guarantees that hs, xi − f ∗ (s) = f (x) for s ∈ ∂f (x). Theorem C.2.1
in the appendices makes this rigorous, and shows that if f is a closed convex function, then
for all x. In particular, by passing through the conjugate, we can recover the function f directly
whenever f is closed convex.
We immediately have the Fenchel-Young inequality that
and (see Proposition C.2.2) if f is a closed convex function, then equality holds if and only if
281
Lexture Notes on Statistics and Information Theory John Duchi
sx − f ∗ (s)
f (x) f ∗ (s)
sx
(0, −f ∗ (s))
Figure 11.2. The conjugate function. The line of long dashes is f (x) = sx, while the dotted line
is x 7→ sx − f ∗ (s). The blue line is the largest gap between sx and f (x), which equals f ∗ (s). Note
that x 7→ sx − f ∗ (s) meets the graph of f (x) at exactly the point of maximum difference sx − f (x),
where f 0 (x) = s.
where we have paralleled the typical notation H(Y ) for the Shannon entropy. In many cases, it
will be more convenient to write this entropy directly as a function of the distribution P of Y , in
which case we write
H` (P ) = inf EP [`(Q, Y )], (11.1.6)
Q∈P
where Y follows the distribution P ; we will use whichever is more convenient. As the nota-
tion (11.1.6) makes clear, H` (P ) is the infimum of a collection of linear functions of the form
P 7→ EP [`(Q, Y )], one for each Q ∈ P), so that necessarily H` (P ) is concave in P . The remainder
282
Lexture Notes on Statistics and Information Theory John Duchi
of this chapter, and several parts of the coming chapters, highlights the ways that this particular
quantity informs the properties of the loss `, and more generally, how we may always view any
concave function H on a family of distributions P as a generalized entropy function.
In Section 11.2, we show how such entropy-type functionals map back to losses themselves,
so for now we content ourselves with a few examples to see why we call these entropies. Let us
temporarily assume that Y has finite support {1, . . . , k} with P = ∆k = {p ∈ Rk+ | h1, pi = 1} the
collection of probability mass functions on elements {1, . . . , k}.
Example 11.1.1 (Log loss): Consider the log loss `log (p, y) = − log py . Then
k k k
X qy X X
H`log (p) = inf Ep [− log qY ] = inf − py log − py log py = − py log py ,
q∈∆k q∈∆k py
y=1 y=1 y=1
This highlights an operational interpretation of entropy distinct from that arising in coding: the
(Shannon) entropy is the minimal expected loss of a player in a prediction game, where the player
chooses a distribution Q on Y , nature draws Y ∼ P , and upon observing Y = y, the player suffers
loss − log Q(Y = y).
Example 11.1.2 (0-1 error): If instead we take the 0-1 loss, that is, `0-1 (p, y) = 1 if py ≤ pj
for some j 6= y and `0-1 (p, y) = 0 otherwise, then
So H`0-1 (ey ) = 0 for any standard basis vector, that is, distribution with all mass on a single
point y, and H`0-1 (p) > 0 otherwise. Moreover, the vector p = 1/k maximizes H`0-1 (p), with
k . 3
H`0-1 (1/k) = k−1
Example 11.1.3 (Brier scoring and squared error): For the squared error (Brier scoring)
loss `sq (p, y) = kp − ey k22 , where ey ∈ {0, 1}k is the yth standard basis vector, let Y have p.m.f.
p ∈ ∆k . Then
H`sq (Y ) = E[`sq (p, Y )] = kpk22 − 2 kpk22 + 1 = 1 − kpk22 .
So as above, we have H`sq (Y ) ≥ 0, with H`sq (Y ) = 0 if and only if Y is a point mass on one
of {1, . . . , k}, and the uniform distribution with p.m.f. p = k1 1 maximizes the entropy, with
H`sq (Uniform([k])) = 1 − 1/k. 3
These examples highlight how these entropy functions are types of uncertainty measures, giving
rise to “maximally uncertain” distributions p, which are typically uniform on Y .
283
Lexture Notes on Statistics and Information Theory John Duchi
In complete analogy with our development in Chapter 2, then, we can define the information
between variables X and Y relative to a particular loss function `. Thus, we define the `-conditional
entropy
H` (Y | X = x) := inf E [`(Q, Y ) | X = x]
Q∈P
and, in analogy to the definitions in Section 2.1.1, the conditional entropy of Y given X is
Z
H` (Y | X) := E inf E[`(Q, Y ) | X] = H` (Y | X = x)dP (x),
Q∈P X
Example 11.1.5 (0-1 error): Consider the 0-1 error `0-1 (p, y) = 1 if py ≤ maxj pj and
`0-1 (p, y) = 0 if py > maxj6=y pj . Then letting y ? = argmaxy P(Y = y) and y ? (x) = argmaxy P(Y =
y | X = x), we have
I`0-1 (X; Y ) = P(Y = y ? ) − E [P(Y = y ? (X) | X)] = P(Y = y ? ) − P(Y = y ? (X)),
the gap between the prior probability of making a mistake when guessing Y and the posterior
probability given X. 3
Example 11.1.6 (Squared error): For the Brier score with squared error `sq (p, y) = kp − ey k22 ,
we have H`sq (p) = 1 − kpk22 , and so
k
X k k
X X
E P(Y = j | X)2 − P(Y = j)2 =
I`sq (X; Y ) = Var(P(Y = j | X)),
j=1 j=1 j=1
the summed variances of the random variables P(Y = j | X). The higher the variance of these
quantities, the more information X carries about Y . 3
284
Lexture Notes on Statistics and Information Theory John Duchi
We do allow losses to attain infinite values, for example, we can allow `(Q, y) = +∞ if Q assigns
probability 0 to an event y, as in the case of the logarithmic loss. The following theorem then
provides the promised representation of proper losses, and additionally, highlights the centrality of
the generalized entropy functionals.
Theorem 11.2.1 (Proper scoring rules: the finite case). Let Y = {1, . . . , k} be finite and P ⊂ ∆k
a convex collection of distributions on Y. Then the following are true.
Additionally, if ` is real valued, then ∇h(p) ∈ Rk in the representation (11.2.1). If `(p, y) can take
the value +∞, then we allow ∇h(p) ∈ Rk when p 6∈ relint ∆k . The loss is strictly proper if and only
if the convex h is strictly convex.
285
Lexture Notes on Statistics and Information Theory John Duchi
by the first-order convexity property of convex functions (that is, the definition (11.1.1)) of a
subdifferential).
Conversely, suppose that the loss is proper, and let h(p) = h` (p). Clearly h is convex, as it is
the supremum of linear functionals of p. Moreover, propriety of ` guarantees that
k
X
h(p) ≥ −E[`(q, Y )] = h(q) + −`(q, y)(pk − qk )
y=1
That is, for each q ∈ P the vector [−`(q, y)]ky=1 ∈ ∂h(q), so h is subdifferentiable. Choosing the
vector ∇h(p) = [−`(p, y)]ky=1 , we have
k
X
`(p, y) = −h(p) + `(p, y) + h(p) = −h(p) − pi `(p, i) + `(p, y) = −h(p) − h∇h(p), ey − pi
i=1
as desired. Note that `(p, y) < ∞ except when py = 0, in which case our definition ∇h(p) =
[−`(p, y)]ky=1 remains sensible as −h∇h(p), ey − pi = +∞.
As an alternative argument more directly using convexity, definition of h(p) = supq {−Ep [`(q, Y )] |
q ∈ P} and the immediate calculation (11.1.2) of the subdifferential of the supremum shows that
n o
∂h(p) ⊃ [−`(q, y)]ky=1 | q ∈ ∆k satisfies − Ep [`(q, Y )] = h(p) .
But propriety guarantees that the set of such q includes p, so that ∂h(p) ⊃ [−`(p, y)]ky=1 .
For the strict inequalities and strict propriety, trace the argument replacing inequalities with
strict inequalities for q 6= p and use Corollary B.3.2 or C.1.7.
The negative generalized entropy h in Theorem 11.2.1 is essentially unique and marks an impor-
tant duality between proper losses and convex functions: to each loss, we can assign a generalized
entropy, and from this generalized entropy, we can reconstruct the loss. Exercise 11.2 explores this
connection. We can also give a few examples that show how to recover standard losses. For each,
we begin with a convex function h, then exhibit the associated proper or strictly proper scoring
rule. One thing to notice in this representation is that, typically, we do not expect to achieve a
loss function convex in p, which is a weakness of the representation (11.2.1). In Section 11.3 (and
Chapter 14 in more depth), however, we will show how to convert suitable proper losses into sur-
rogates that are convex in their arguments and which, after a particular transformation based on
convex duality, are proper and yield the correct distributional predictions. We defer this, however,
and instead provide a few examples.
Pk
Example 11.2.2 (Logarithmic losses): Consider the negative entropy h(p) = y=1 py log py .
We have ∂p∂ y h(p) = 1 + log py ∈ [−∞, 1], and
k
X k
X
`log (p, y) = − pj log pj + py (1 + log pj ) − (1 + log py ) = − log py ,
j=1 j=1
yielding the log loss. Note that for this case, we do require that the gradients ∇h(p) take
values in the (downward) extended reals Rk . 3
286
Lexture Notes on Statistics and Information Theory John Duchi
Example 11.2.3 (Brier scores and squared error): When we have the squared error `sq (p, y) =
kp − ey k22 , we can directly check that h(p) = kpk22 gives the loss. Indeed,
More esoteric examples exist in the literature, such as the spherical score arising from h(p) =
kpk2 (note the lack of a square).
Example 11.2.4 (Spherical scores): Let h(p) = kpk2 , which is strictly convex on ∆k . Then
∇h(p) = p/ kpk2
1
and `(p, y) = − kpk2 − kpk hp, ey − pi = −py / kpk2 , which is strictly proper but does not retain
2
convexity. 3
Bregman divergences
A key aspect of the Savage representation (11.2.1) is that associated to any proper loss is a first-
order divergence (or, less evocatively, the Bregman divergence). Recall from Chapter 3 that for a
function function h : Rk → R, the first-order divergence associated with h is
In typical definitions of the divergence, one requires that h be differentiable; here, we allow non-
differentiable h so long as the choice ∇h(v) ∈ ∂h(v) is given. In particular, we see that
Dh (u, v) ≥ 0
287
Lexture Notes on Statistics and Information Theory John Duchi
losses with first-order divergences immediately. Indeed, let h : ∆k → R be a convex function and
loss ` be the associated proper loss, with `(p, y) = −h(p) − h∇h(p), ey − pi. Now, suppose that Y
has p.m.f. p; then for any q ∈ ∆k , the gap
k
X
Ep [`(q, Y )] − Ep [`(p, Y )] = h(p) − h(q) − py h∇h(q), ey − qi
y=1
We record this as a corollary to Theorem 11.2.1, highlighting the links between propriety, first-order
divergences, and proper loss functions.
Corollary 11.2.5. Let the conditions of Theorem 11.2.1 hold. Then ` is (strictly) proper if and
only if there exists a (strictly) convex h : ∆k → R for which
for all p, q ∈ ∆k .
Example 11.2.6 (Continuous ranked probability score (CRPS)): The CRPS loss for a CDF
F at y is Z
`crps (F, y) = (F (t) − 1 {y ≤ t})2 dt. (11.2.3)
This is a strictly proper scoring rule: let G be any cumulative distribution function, meaning
that limt→−∞ G(t) = 0 and limt→∞ G(t) = 1, and let Y have CDF F . Then
Z
G(t)2 − F (t)2 − 2(G(t) − F (t))E[1 {Y ≤ t}] dt
E[`crps (G, Y )] − E[`crps (F, Y )] =
Z
= (G(t) − F (t))2 dt
because E[1 {Y ≤ t}] = F (t). This is the (squared) Cramér-von-Mises distance between F and
G, and which is positive unless F = G. Unfortunately, computing the CRPS loss (11.2.3) is
often challenging except for specially structured F . 3
Because the computation of the continuous ranked probability score is challenging, it can be
advantageous to consider other losses on probability distributions, which can allow more flexibility
in modeling. To that end, we define the quantile loss: for a probability distribution P on Y , let
288
Lexture Notes on Statistics and Information Theory John Duchi
to be the α-quantile of the distribution P . (When Y has cumulative distribution F , this is the
inverse CDF mapping F −1 (α) = inf{t | F (t) ≥ α}.) Defining the quantile penalty
The propriety of the quantile loss is relatively straightforward; it is, however, not strictly proper.
Example 11.2.7 (Quantile loss): To see that the quantile loss (11.2.4) is proper, consider the
single quantile penalty ρα : let g(t) = E[ρα (Y − t)] = αE[[Y − t]+ ] + (1 − α)E[[t − Y ]+ ], which
we claim is minimized by Quantα (Y ). Indeed, g is convex, and it has left and right derivatives
g(s) − g(t)
∂− g(t) := lim = −αP(Y ≥ t) + (1 − α)P(Y < t) = P(Y < t) − α and
s↑t s−t
g(s) − g(t)
∂+ g(t) := lim = −αP(Y > t) + (1 − α)P(Y ≤ t) = P(Y ≤ t) − α.
s↓t s−t
Indeed, for t = Quantα (Y ), we have ∂− g(t) = P(Y < t)−α ≤ 0 and ∂+ g(t) = P(Y ≤ t)−α ≥ 0,
because t 7→ P(Y ≤ t) is right continuous. So convexity yields
for any Q whenever Y ∼ P , and equality holds whenever Q and P have identical α quantile
for each α ∈ A. 3
The general case of Theorem 11.2.1 allows us to address such scenarios, though it does require
measure theory to properly define. Happily, the generality does not require a particularly more
sophisticated proof. For a (convex) function h : P → R on a family of distributions P on a set Y,
we say h0 (P ; ·) : Y → R is a subderivative of h at P ∈ P whenever
Z
h(Q) ≥ h(P ) + h0 (P, y)(dQ(y) − dP (y))
Y for all Q ∈ P. (11.2.5)
= h(P ) + EQ [h0 (P, Y )] − EP [h0 (P, Y )]
When Y is discrete and we can identify P with the simplex ∆k , the inequality (11.2.5) is simply
the typical subgradient inequality (11.1.1) that h(q) ≥ h(p) + h∇h(p), q − pi for p, q ∈ ∆k , where
∇h(p) ∈ ∂h(p). We then have the following generalization of Theorem 11.2.1.
Theorem 11.2.8. Let P be a convex collection of distributions on Y. Then the following are true.
(i) If the loss ` : P × Y → R satisfies the representation
Z
`(P, y0 ) = −h(P ) − h (P, y0 ) + h0 (P, y)dP (y), for all y0 ∈ Y,
0
(11.2.6)
289
Lexture Notes on Statistics and Information Theory John Duchi
The loss is strictly proper if and only if the convex h is strictly convex.
which is the supremum of linear functionals of P and hence convex. If we let h0 (P, y) = −`(P, y) ∈ R
for P ∈ P, then
Z
h(P ) ≥ −EP [`(Q, Y )] = h(Q) + EQ [`(Q, Y )] − EP [`(Q, Y )] = h(Q) + h0 (P, y)(dP (y) − dQ(y))
Corollary 11.2.9. Let P be a convex collection of probability distributions on Y. Then the loss
` : P × Y → R is proper if and only if there exists a convex function h : P → R with subderivatives
h0 (P, ·) : Y → R such that
The subdifferentials and differentiability in this potentially infinite dimensional case can make
writing the particular representation (11.2.6) challenging; for example, the representation of the
quantile loss in Example 11.2.7 is quite complex. In the case of predictions involving the cumulative
distribution function F , however, one can obtain the subderivative by taking directional (Gateaux)
derivatives in directions G−F for cumulative distributions G. In this case, for the point cumulative
distribution Gy with Gy (t) = 1 {y ≤ t}, we define
The continuous ranked probability score (Example 11.2.6) admits this expansion.
290
Lexture Notes on Statistics and Information Theory John Duchi
Example 11.2.10 (CRPS (Example 11.2.6) continued): The strict propriety of the CRPS
loss (11.2.3) means that the generalized entropy
Z
h(F ) = sup −E[`(G, Y )] = −E[`crps (F, Y )] = (F (t) − 1)F (t)dt
G
by definition. Expanding h(F + (G − F )) for small as in the recipe above, we have
Z Z
h(F + (G − F )) = h(F ) − (G(t) − F (t))dt + 2 F (t)(G(t) − F (t))dt + O(2 ).
to obtain the y-based derivative h0 (F, y), we choose Gy (t) = 1 {y ≤ t} to obtain directional
derivative
h(F + (Gy − F )) − h(F )
Z Z
h0 (F, y) = lim = (1 {y ≤ t}−F (t))dt−2 (F (t)(1 {y ≤ t}−F (t))dt.
↓0
as desired. 3
291
Lexture Notes on Statistics and Information Theory John Duchi
In this vector-valued Y case, instead of prediction distributions P , the goal is to predict the
mean mapping
µ(P ) := EP [Y ] ∈ Conv(Y),
so that µ : P → Rk for the collection P of distributions on Y . Our goal is to reward predictions of
the correct expectation, leading to the following definition.
Definition 11.3. Let C = cl Conv(Y) be a convex set. Then ` : C × Y → R is proper if
for y ∈ {0, . . . , k}. Here, however, note the importance of allowing infinite values in the loss `
when µ → {0, k}. 3
292
Lexture Notes on Statistics and Information Theory John Duchi
Proof One direction is, as in the previous cases, straightforward. Let ` have the given represen-
tation. Then for µ(P ) = EP [Y ],
where the equality (?) follows because ` is proper. The function h is closed convex, as it is the partial
infimum of the closed convex function p 7→ −Ep [`(µ, Y )] + I∆m (p), where we recall I∆m (p) = 0 if
p ∈ ∆m and +∞ otherwise (see Proposition B.3.11).
We compute ∂h(µ) directly now. The infimum over p in the definition of h(µ) is attained, as
∆m is compact and g(p) := −Ep [`(µ, Y )] is necessarily continuous in p satisfying µ(p) = µ, because
regularity of the loss guarantees `(µ, yi ) ∈ R whenever pi > 0 is feasible in the mean mapping
constraint µ(p) = µ. Moreover, it is immediate that
−`(µ, y1 )
∇g(p) = .. m
∈R .
.
−`(µ, ym )
Let p? (µ) be any p attaining the infimum. By Proposition B.3.27 on the subgradients of partial
minimization, we thus obtain
n o
∂h(µ) = s ∈ Rk | yiT s = −`(µ, yi ) for i = 1, . . . , m ,
and moreover, this set is necessarily non-empty for all µ ∈ relint C = {µ(p) | p 0, p ∈ ∆m }. Using
this equality, we have
293
Lexture Notes on Statistics and Information Theory John Duchi
Note that for each of these, we have a direct relationship between the probabilistic predictions and
derivatives of ϕ. In the binary logistic regression case, we have
∂ 1 1
p(y | s) = 1 + ϕ(s, y) = 1 − = ,
∂s 1 + exp(ys) 1 + exp(−ys)
while in the multiclass case we similarly have
∂ exp(sy )
p(y | s) = 1 + ϕ(s, y) = Pk .
∂sy i=1 exp(si )
294
Lexture Notes on Statistics and Information Theory John Duchi
Section 11.2.3 of losses for vector-valued y where Y ⊂ Rk , so that instead of predicting probability
distributions on Y itself we predict elements µ of the set {EP [Y ]} = Conv(Y), and let ` be a strictly
proper loss. Theorems 11.2.1 and 11.2.14 demonstrate that if the loss ` is proper, there exists a
(negative) generalized entropy, which in the case of Theorem 11.2.1 is h(p) = supq {−Ep [`(q, Y )]},
for which
`(µ, y) = −h(µ) − h∇h(µ), y − µi.
Note that h is always a closed convex function, meaning that it is lower semicontinuous or that its
epigraph epi h = {(µ, t) | h(µ) ≤ t} is closed.
Let us suppose temporarily that we have any such entropy. Recalling the convex conju-
gate (11.1.3), the negative generalized entropy h is closed convex, and so its conjugate h∗ (s) =
sup{hs, µi − h(µ)} satisfies h∗∗ (µ) = h(µ). In particular, if we define the surrogate loss
inf EP [ϕ(s, Y )] = inf {h∗ (s) − hs, µ(P )i} = −h∗∗ (µ(P )) = −h(µ(P )),
s s
This identification of (generalized) entropies will underpin much of our development of the consis-
tency of losses in sections to come. For now, we content ourselves with addressing how to under-
stand propriety of the surrogate loss ϕ and how to transform predictions s ∈ Rk into probabilistic
predictions µ.
The key will be to consider what we term convex-conjugate-linkages, or conjugate linkages for
short. Recall the duality relationships (11.1.4) from the Fenchel-Young inequality we present in
the convexity primer in Section 11.1.1. The negative generalized entropy h is convex, and the
dualities associated with its conjugate h∗ (s) = supµ {hs, µi − h(µ)} will form the basis of our
transformations. We first give a somewhat heuristic presentation, as the intuition is important
(but details to make things precise can be a bit tedious). Essentially, we require that h∗ and h are
continuously differentiable, in which case we have
∇h(µ) = s if and only if ∇h∗ (s) = µ if and only if h∗ (s) + h(µ) = hs, µi
by the Fenchel-Young inequalities (11.1.4). That is, the gradient ∇h∗ of the conjugate transforms
a score vector s ∈ Rk into elements c to predict Y: we transform s into a prediction µ via the
conjugate link function
predh (s) = argmax {hs, µi − h(µ)} = ∇h∗ (s) = (∇h)−1 (s), (11.3.1)
µ
which finds the µ that best trades having maximal “entropy” −h(µ), or uncertainty, with alignment
with the scores hs, µi.
295
Lexture Notes on Statistics and Information Theory John Duchi
With this, it is then natural to consider the function substituting the prediction µ = predh (s)
into `(µ, y), and so we consider
`(predh (s), y).
Immediately, if µ = predh (s) = ∇h∗ (s), we have s = ∇h(µ) by construction (or the Fenchel-Young
inequality (11.1.4)), and so h(µ) = hs, µi − h∗ (s) for this particular pair (s, µ), and ∇h(µ) =
∇h(∇h∗ (s)) = s because ∇h and ∇h∗ are inverses. Substituting, we obtain
`(predh (s), y) = −h(predh (s)) − h∇h(predh (s)), y − predh (s)i = −h(µ) − hs, y − µi
= h∗ (s) − hs, µi − hs, y − µi,
The surrogate loss (11.3.2) constructed from the negative entropy h is the key transformation of
the loss ` into a convex loss, and (no matter the properties of `) is always convex.
As we have already demonstrated, the construction (11.3.2) is more general than we have
presented; certainly, h∗ is always convex, and so ϕ is always convex in s. Moreover, if Y has
expectation E[Y ] = µ, then
by conjugate duality, so the surrogate ϕ always recovers the negative entropy h; without some type
of differentiability conditions, however, the construction of the prediction mapping predh requires
more care. Chapter 14 more deeply investigates these connections.
All that remains is to give more precise conditions under which the prediction (11.3.1) is always
unique and exists for all possible score vectors s ∈ Rk . To that end, we make the following definition.
This is precisely the condition we require to make each step in the development of the surro-
gate (11.3.2) airtight; as a corollary to Theorem C.2.9 in the appendices, we have the following.
Corollary 11.3.1. Let h be a Legendre negative entropy. Then the conjugate link prediction (11.3.1)
is unique and exists for all s ∈ Rk . In particular, the conjugate h∗ is strictly convex, continuously
differentiable, satisfies dom h∗ = Rk , and ∇h∗ = (∇h)−1 .
With this corollary in place, we can then give a theorem showing the equivalence of the strictly
proper loss ` and its surrogate.
Theorem 11.3.2. Let ` : C × Y → R be the strictly proper loss associated with the Legendre
negative entropy h. Then
296
Lexture Notes on Statistics and Information Theory John Duchi
Proof The first equality we have already demonstrated. For the minimization claim, we note
that if µ = E[Y ], then E[ϕ(s, Y )] = h∗ (s) − hµ, si and inf s {h∗ (s) − hµ, si} = −h(µ). Strict propriety
of ` then gives inf µ0 E[`(µ0 , Y )] = −h(µ).
Said differently, the surrogate ϕ is consistent with the loss ` and (strictly) proper, in that
if s minimizes E[ϕ(s, Y )], then predh (s) minimizes E[`(µ, Y )]. The statement in terms of limits
is necessary, however, as simple examples show, because with some link functions it is in fact
impossible to achieve the extreme points of Conv(Y), as in logistic regression. We provide a few
example applications (and non-applications) of Theorem 11.3.2. For the first, let us consider binary
logistic regression.
Example 11.3.3 (Binary logistic regression): For a label Y ∈ {0, 1} and predictions p ∈ [0, 1],
take the generalized entropy
es
where the supremum is achieved by p = predh (s) = 1+es . Then we have
For the induced loss `(p, y) = −y log p − (1 − y) log(1 − p) (the log loss), if P(Y = 1) = 1,
then p = 1 minimizes E[`(p, Y )]. Similarly, if P(Y = 0) = 1, then p = 0 minimizes E[`(p, Y )].
eys
Neither of these is achievable by a finite `˙ in p(y | s) = 1+e s , showing how the limiting argument
The next example shows that we sometimes need to elaborate the setting of Theorem 11.3.2 to deal
with constraints.
Example 11.3.4 (Multiclass logistic regression): Identify the set Y = {e1 , . . . , ek } with the
k standard basis vectors, and for p ∈ ∆k = {p ∈ Rk+ | 1T p = 1}, consider the negative entropy
k
X
h(p) = py log py .
y=1
297
Lexture Notes on Statistics and Information Theory John Duchi
This function is strictly convex and of Legendre type for the positive orthant Rk+ but not for
∆k . Shortly, we shall allow linear constraints on the predictions to address this shortcoming.
As an alternative, take Y = {0, e1 , . . . , ek−1 }, so that Conv(Y) = {p ∈ Rk−1
+ | 1T p ≤ 1}, which
has an interior and so more easily admits a conjugate duality relationship. In this case, the
negative entropy-type function
k−1
X
h(p) = py log py + (1 − 1T p) log(1 − 1T p) (11.3.4)
y=1
with !
es1 esk−1
predh (s) = Pk−1 ,..., Pk−1 .
1+ j=1 e sj 1+ j=1 esj
Letting pPdenote the entries of this vector, we can then assign a probability to class k via
pk = 1 − k−1
j=1 pj . 3
In Section 11.4 we revisit exponential families in the (proper) loss minimization framework we
have thus far developed, which gives some additional perspective on these problems.
is a proper subspace of Rk . The key motivating example here is the “failure” case of Example 11.3.4
on multiclass logistic regression, where Y = {e1 , . . . , ek }, whose affine hull is exactly those vectors
p ∈ Rk satisfying hp, 1i = 1. Naturally, in this case we wishP to predict probabilities, and so given a
score vector s ∈ Rk and using the negative entropy h(p) = ky=1 py log py , we let
k
k
X
pred(s) = argmin h(p) − hs, pi | 1T p = 1 = esy / esj
.
p
j=1
y=1
298
Lexture Notes on Statistics and Information Theory John Duchi
Then for the loss `(µ, y) = −h(µ) − h∇h(µ), y − µi associated with the negative entropy h, we define
the surrogate
ϕ(s, y) := `(predh,A (s), y).
Perhaps remarkably, this construction still yields a well-defined convex loss with the same consis-
tency properties as those in Theorem 11.3.2. Indeed, defining
and the associated conjugate h∗A (s) = sup{hs, µi − h(µ) | µ ∈ A}, we have the following theorem.
Theorem 11.3.5. Let ` : C × Y → R be the strictly proper loss associated with the Legendre
negative entropy h and A = aff(Y) be the affine hull of Y. Then
We return to proving the theorem presently, focusing here on how it applies to Example 11.3.4.
Example 11.3.6 (Multiclass logistic regression): Consider Example 11.3.4, where we identify
Y = {eP k k
1 , . . . , ek } ⊂ R , which has affine hull A = {p ∈ R | h1, pi = 1}. Then taking
k
h(p) = y=1 pk log pk , a calculation with a Lagrangian shows that
k
X
predh,A (s) = argmin {−hs, pi + h(p)} = esy / esj .
p∈∆k j=1
Notably, the logistic loss is not strictly convex, as ϕ(s + t1, y) = ϕ(s, y) for t ∈ R. If Y is a
multinomial random variable with P(Y = ey ) = py , then by another calculation, the vector
with entries
s?y = log py
minimizes E[ϕ(s, Y )], which in turn gives predh,A (s? ) = p, maintaining propriety. 3
299
Lexture Notes on Statistics and Information Theory John Duchi
Corollary 11.3.7. The conjugate h∗A is continuously differentiable with dom h∗A = Rk , and if
µ = ∇h∗A (s), then µ ∈ int dom h and
∇h(µ) = s + v
for some vector v normal to A, that is, a vector v ∈ Rk satisfying hv, µ0 −µ1 i = 0 for all µ0 , µ1 ∈ A.
While the proof of the corollary requires some care to make precise, a sketch can give intuition.
Sketch of Proof Because h is strictly convex and its derivatives ∇h(µ) explode as µ →
bd dom h, the minimizer of −hs, µi + h(µ) over µ ∈ A exists and is unique. Let A = {µ | Aµ = b}
for shorthand, where A ∈ Rn×k for some n < k. Then introducing Lagrange multiplier w ∈ Rn for
the constraint µ ∈ A, the Lagrangian for finding predh,A (s) = argminµ {h(µ) − hs, µi | µ ∈ A} is
∇h(µ) − s + AT w = 0.
Finally, we return to prove the theorem. Take any vector s ∈ Rk . Then because predh,A (s) =
∇h∗A (s), we have
As ∇h∗A (s) ∈ A and using the shorthand µ = ∇h∗A (s) ∈ A, we have ∇h(µ) = s + v for some v
normal to A. Moreover, h(µ) = hA (µ), and so the Fenchel-Young inequality (11.1.4) guarantees
−hA (µ) = h∗A (s) − hs, µi. Substituting in the expression for ϕ, we obtain
300
Lexture Notes on Statistics and Information Theory John Duchi
Now, for any distribution P ∈ P with mean vector µ = µ(P ), the associated generalized negative
entropy is
h(µ) := sup {−EP [ϕ(θ, X)]} = sup {hθ, µ(P )i − A(θ)} = A∗ (µ),
θ θ
the convex conjugate of A. At this point, the centrality of the duality relationships (via gradients
∇A and ∇A∗ ) between Θ and M to fitting and modeling should come as no surprise, and so we
elucidate a few of the main properties. Because ∇A(θ) = Eθ [φ(X)] in the exponential family, we
immediately see that
∇A(Θ) := {∇A(θ)}θ∈Θ ⊂ M.
Recalling the duality relationship (11.1.4) that
θ ∈ ∂A∗ (µ) if and only if ∇A(θ) = µ,
we can say much more.
301
Lexture Notes on Statistics and Information Theory John Duchi
(ii) If the family is non-minimal, then h is continuously differentiable relative to aff(M), meaning
that there exists a continuous mapping ∇h(µ) ∈ Θ such that for all µ ∈ Mo ,
n o
∂h(µ) = ∇h(µ) + aff(M)⊥ .
Moreover, Θ = Θ + aff(M)⊥ .
The proof of the proposition relies on the more sophisticated duality theory we develop in Appen-
dices B and C, so we defer it to Section 11.5.2.
We can summarize the proposition by considering minimizers and maximizers: suppose we wish
to choose θ to minimize
Then so long as the distribution P is not extremal in that µ(P ) = EP [φ(X)] ∈ relint M, there
exists a parameter θ(P ), unique up to translation in the subspace perpendicular to aff(M), for
which
θ(P ) ∈ argmin EP [ϕ(θ, X)] = argmin{A(θ) − hµ(P ), θi}.
θ θ
Moreover, this parameter satisfies the mean matching condition
∇A(θ(P )) = µ(P ),
which is of course sufficient to be a minimizer of the expected log loss. As the statements in the
proposition evidence, calculations become more challenging when we must perform them all in an
affine subspace, though sometimes this care is unavoidable.
302
Lexture Notes on Statistics and Information Theory John Duchi
When Cov(X) 6 0, the solution K = Cov(X)−1 does not exist, so we must rely instead
on part (ii) of Proposition 11.4.1. With some care, one may check that we can work in the
subspace spanned by the eigenvectors of Cov(X), that is, if Cov(X) = U ΛU > and U ∈ Rd×k ,
the collection of symmetric matrices K whose column space belongs to span(U ). Then the
pseudo-inverse K = Cov(X)† is the appropriate solution, and it recovers the covariance Σ =
K † = Cov(X) 0. 3
Finally, let us give a last result that shows the duality relationships between the negative
generalized entropy h(µ) and log partition A, which allows us to also capture a few of the nuances
of minimization of the surrogate log loss ϕ(θ, x) = − log pθ (x) when we encounter distributions P
for which the mean mapping µ(P ) is on the boundary of M or even outside it.
Proposition 11.4.3. Let {Pθ } be a regular exponential family with log partition A(θ) with domain
Θ, and let M be the associated mean parameter space with relative interior Mo = relint M. Let
h(µ) = A∗ (µ) be the associated negative generalized entropy. Then
(ii) If µ ∈ Mo , there exists θ(µ) ∈ Θ such that the negative entropy satisfies h(µ) = A∗ (µ) =
hθ(µ), µi − A(θ(µ)) < ∞. If µ 6∈ cl M, then h(µ) = +∞.
(iii) If µ ∈ bd M = cl M \ Mo , then for any µ0 ∈ Mo , h(µ) = limt→0 h(tµ0 + (1 − t)µ), and there
exist θt ∈ Θ with
∇A(θt ) = tµ0 + (1 − t)µ and lim{A(θt ) − hµ, θt i} = inf {A(θ) − hµ, θi} .
t→0 θ
In particular, there exist sequences of dual pairs (µn , θn ) with µn ∈ Mo and θn ∈ Θ satisfying
µn = ∇A(θn ), µn → µ, h(µn ) → h(µ), and A(θn ) − hµ, θn i → inf θ {A(θ) − hµ, θi}.
EP [ϕ(θn , X)] → inf EP [ϕ(θ, X)] = −h(µ(P )), and µ(Pθn ) → µ(P ),
θ
so that they asymptotically satisfy the mean identiy. Finally, if µ(P ) 6∈ cl M, then inf θ E[ϕ(θ, X)] =
−∞, making the choice of exponential family model poor, as it cannot capture the mean parameters
at all.
303
Lexture Notes on Statistics and Information Theory John Duchi
Definition 11.5. Let ν be a base measure on X and assume P has density p with respect to ν.
Then the Shannon entropy of P is
Z
H(P ) = − p(x) log p(x)dν(x).
Example 11.4.4: Let P be the uniform distribution on [0, a]. Then the differential entropy
H(P ) = − log(1/a) = log a. 3
Example 11.4.5: Let P be the normal distribution N(µ, Σ) and ν be Lebesgue measure.
Then
1 1 d 1
H(P ) = log(det(2πΣ)) + E[(X − µ)> Σ−1 (X − µ)] = log(2πe) + log det(Σ).
2 2 2 2
because p(x) = √ 1
exp(− 12 (x − µ)> Σ−1 (x − µ)). 3
det(2πΣ)
For exponential families, the log partition determines the Shannon entropy directly, highlighting
that −h is indeed a familiar entropy-like object.
Proposition 11.4.6. Let {Pθ } be a regular exponential family with respect to the base measure ν.
Then for any θ ∈ Θ,
H(Pθ ) = −h(µ(Pθ )) = A(θ) − hµ(Pθ ), θi,
where h(µ) = sup{hµ, θi − A(θ)} = A∗ (µ).
Proof Using log pθ (x) = hθ, φ(x)i − A(θ) we obtain H(Pθ ) = −Eθ [hθ, φ(X)i − A(θ)] = A(θ) −
hµ(Pθ ), θi, where as usual µ(P ) = EP [φ(X)]. As θ and µ(Pθ ) have the duality relationship
∇A(θ) = µ(Pθ ), we obtain A(θ) − hµ(Pθ ), θi = −h(µ(Pθ )) as desired.
The maximum entropy principal, which Jaynes [114] first elucidated in the 1950s, originates in
statistical mechanics, where Jaynes showed that (in a sense) entropy in statistical mechanics and
information theory were equivalent. The maximum entropy principle is this: given some constraints
(prior information) about a distribution P , we consider all probability distributions satisfying said
constraints. Then to encode our prior information while being as “objective” or “agnostic” as
possible (essentially being as uncertain as possible), we should choose the distribution P satisfying
the constraints to maximize the Shannon entropy. This principal naturally gives rise to exponential
family models, and (as we revisit later) allows connections to Bayesian and minimax procedures.
One caveat throughout is that the base measure ν is essential to all our derivations: it radically
effects the distributions P we consider.
With all this said, suppose (without making any exponential family assumptions yet) we are
given φ : X → Rd and a mean vector µ ∈ Rd , and we wish to solve
304
Lexture Notes on Statistics and Information Theory John Duchi
over all distributions P ∈ P, the collection of distributions having densities with respect to the
base measure ν, that is, P ν. Rewriting problem (11.4.2), we see that it is equivalent to
Z
maximize − p(x) log p(x)dν(x)
Z Z
subject to p(x)φ(x)dν(x) = µ, p(x) ≥ 0 for x ∈ X , p(x)dν(x) = 1.
Let
Pµlin := {P ν | EP [φ(X)] = µ}
be distributions with densities w.r.t. ν satisfying the expectation (linear) constraint E[φ(X)] = µ.
We then obtain the following theorem.
Theorem 11.4.7. For θ ∈ Rd , let Pθ have density
Z
pθ (x) = exp(hθ, φ(x)i − A(θ)), A(θ) = log exp(hθ, φ(x)i)dν(x),
with respect to the measure ν. If EPθ [φ(X)] = µ, then Pθ maximizes H(P ) over Pµlin ; moreover, the
distribution Pθ is unique (though θ need not be).
Proof We first give a heuristic derivation—which is not completely rigorous—and then check to
verify that our result is exact. First, we write a Lagrangian for the problem (11.4.2). Introducing
Lagrange multipliers λ(x) ≥ 0 for the constraint p(x) ≥ 0, θ0 ∈ R for the normalization constraint
that P (X ) = 1, and θ ∈ Rd for the constraints that EP [φ(X)] = µ, we obtain the following
Lagrangian:
Z d
X Z
L(p, θ, θ0 , λ) = p(x) log p(x)dν(x) + θi µi − p(x)φi (x)dν(x)
i=1
Z Z
+ θ0 p(x)dν(x) − 1 − λ(x)p(x)dν(x).
Now, heuristically treating the density p = [p(x)]x∈X as a finite-dimensional vector (in the case
that X is finite, this is completely rigorous), we take derivatives and obtain
d
∂ X
L(p, θ, θ0 , λ) = 1 + log p(x) − θi φi (x) + θ0 − λ(x) = 1 + log p(x) − hθ, φ(x)i + θ0 − λ(x).
∂p(x)
i=1
To find the minimizing p for the Lagrangian (the function is convex in p), we set this equal to zero
to find that
p(x) = exp (hθ, φ(x)i − 1 − θ0 − λ(x)) .
Now, we note that with this setting, we always have p(x) > 0, so that the constraint p(x) ≥ 0
is unnecessary and (by complementary
R slackness) we have λ(x) = 0. In particular, by taking
θ0 = −1+A(θ) = −1+log exp(hθ, φ(x)i)dν(x), we have that (according to our heuristic derivation)
the optimal density p should have the form
305
Lexture Notes on Statistics and Information Theory John Duchi
Consider any distribution P ∈ Pµlin , and assume that we have some θ satisfying EPθ [φ(X)] = µ.
In this case, we may expand the entropy H(P ) as
Z Z Z
p
H(P ) = − p log pdν = − p log dν − p log pθ dν
pθ
Z
= −Dkl (P ||Pθ ) − p(x)[hθ, φ(x)i − A(θ)]dν(x)
Z
(?)
= −Dkl (P ||Pθ ) − pθ (x)[hθ, φ(x)i − A(θ)]dν(x)
We obtain the following immediate corollary, which shows the direct connection between max-
imum entropy and minimizing expected logarithmic loss.
Corollary 11.4.8. Let {Pθ } be the exponential family with densities pθ (x) = exp(hθ, φ(x)i − A(θ))
with respect to ν. For any µ ∈ M, if there exists θ satisfying EPθ [φ(X)], then Pθ solves
So if we consider minimizing the negative log loss (which is strictly proper) but wish to guarantee
that the predictive distribution satisfies EP [φ(X)] = µ, then the exponential family model is the
unique minimizer.
We give three examples of maximum entropy, showing how the choice of the base measure ν
effects the resulting maximum entropy distribution. For all three, we assume that the space X = R
is the real line. We consider maximizing the entropy over all distributions P satisfying
EP [X 2 ] = 1.
Example 11.4.9: Assume that the base measure ν is counting measure on the support
{−1, 1}, so that ν({−1}) = ν({1}) = 1. Then the maximum entropy distribution is given by
P (X = x) = 21 for x ∈ {−1, 1}. 3
Example 11.4.10: Assume that the base measure ν is Lebesgue measure on X = R, so that
ν([a, b]) = b − a for b ≥ a. Then by Theorem 11.4.7, we have that the maximum entropy
distribution has the form pθ (x) ∝ exp(−θx2 ); recognizing the normal, we see that the optimal
distribution is simply N(0, 1). 3
Example 11.4.11: Assume that the base measure ν is counting measure on the integers
Z = {. . . , −2, −1, 0, 1, . . .}. Then Theorem 11.4.7 shows that the optimal distribution is a
discrete version of the normal: we have pθ (x)P ∝ exp(−θx2 ) for x ∈ Z. That is, we choose θ > 0
so that the distribution pθ (x) = exp(−θx )/ ∞
2
j=−∞ exp(−θj ) has variance 1. 3
2
306
Lexture Notes on Statistics and Information Theory John Duchi
We remark in passing that in some cases, it is interesting to instead consider inequality rather than
equality constraints in the linear constraints defining the family P lin . Exercises 11.10 and 11.11
explore these ideas.
Lastly, we consider the empirical variant of minimizing the log loss, equivalently, of maximum
likelihood, where we maximize the likelihood of a given sample X1 , . . . , Xn . Consider the sample-
based maximum likelihood problem of solving
n n
Y 1X
maximize pθ (Xi ) ≡ minimize − log pθ (Xi ), (11.4.3)
θ n
i=1 i=1
for the exponential family model pθ (x) = exp(hθ, φ(x)i − A(θ)). We have the following result.
bn = n1 ni=1 φ(Xi ). Then any θ solving EPθ [φ(X)] = µ
P
Proposition 11.4.12. Let µ bn is a maximum
iid
bn ∈ relint M. If the sample is drawn Xi ∼ P where
likelihood solution, which exists if and only if µ
P ν and µ(P ) ∈ relint M, then with probability 1, µ bn ∈ relint M eventually.
Proof Define the empirical negative log likelihood
n
b n (θ) := − 1
X
L log pθ (Xi ) = −hb
µn , θi + A(θ),
n
i=1
which is convex. Taking derivatives and using that Θ = dom A is open, the parameter θ is a mini-
mizer if and only if ∇L bn −∇A(θ) = 0 if and only if ∇A(θ) = µ
b n (θ) = µ bn . Apply Proposition 11.4.1.
For the final statement, note that µb ∈ aff(M) with probability 1. Then because µ(P ) ∈ relint M
bn → µ(P ) with probability 1, we see that for any > 0 there is some (random, but finite) N
and µ
such that n ≥ N implies kb µn − µ(P )k ≤ and µ
bn ∈ aff(M), so that µ
bn ∈ relint M.
As a consequence of the result, we have the following rough equivalences tying together the
preceding material. In short, maximum entropy subject to (linear) empirical moment constraints
(Theorem 11.4.7) is equivalent to maximum likelihood estimation in exponential families (Propo-
sition 11.4.12), and these are all equivalent to minimizing the (surrogate) log loss E[ϕ(θ, X)].
307
Lexture Notes on Statistics and Information Theory John Duchi
so that p is the carrier of Pθ (recall Chapter 3). The next proposition uses this to show, perhaps
unsurprisingly given our derivations thus far, that I-Projection is essentially the same as maximum
entropy, and the projection of a distribution P onto a family of linearly constrained distributions
yields exponential family distributions.
Proposition 11.4.13. Let Π = Pµlin . If pθ (x) = p(x) exp(hθ, φ(x)i − A(θ)) satisfies EPθ [φ(X)] = µ,
then Pθ solves the I-projection problem (11.4.4). Moreover we have the Pythagorean identity
for Q ∈ Pµlin .
Proof We perform an expansion of the KL-divergence parallels that in the proof of Theo-
rem 11.4.7. Indeed, for any Q ν, we have
Z Z Z
q pθ
Dkl (Q||P ) = q log dν = q log dν + Dkl (Q||Pθ ) = q(x) [hθ, φ(x)i − A(θ)] dν(x) + Dkl (Q||Pθ )
p p
because pθ (x)R= p(x) exp(hθ, φ(x)i − A(θ)). Then because Q ∈ Pµlin , we have q(x)[hθ, φ(x)i −
R
In brief, the exponential family model is the projection—in the sense of the KL divergence—of
a distribution P onto the collection of distributions satisfying E[φ(X)] = µ.
308
Lexture Notes on Statistics and Information Theory John Duchi
for any s ∈ G(µ), where equality (ii) follows because p?i (µ) = 0 for i ∈ I(µ). As we allow extended
reals, replace s with s∞ = limt→∞ (s + t∆), which satisfies hs∞ , yi i = ∞ = `(µ, yi ) for i ∈ I(µ),
and we finally obtain
m
X
0
h(µ ) ≥ h(µ) + sT∞ yi (p?i (µ) − p?i (µ0 )) = h(µ) + hs∞ , µ − µ0 i.
i=1
Now as µ ∈ Mo , for any vector v = 6 0 we have hv, µi < ν ? (φ, v). Let > 0 satisfy hv, µi < ν ? (φ, v)−
be otherwise arbitrary, fix θ ∈ Θ, and let X = {x | hv, φ(x)i ≥ ν ? (φ, v) − }, which satisfies
ν(X ) > 0. Then
Z
A(θ + tv) − hµ, θ + tvi = log exp(hφ(x), θ + tvi)dν(x) − hµ, θ + tvi
Z
?
≥ log exp(hφ(x), θi)et(ν −) dν(x) − hµ, θi − thµ, vi
X
Z
= t(ν ? (φ, v) − ) + log ν(X ) − thµ, vi + log ehφ(x),θi dν(x) − hµ, θi.
X
If ν(X ) = +∞, then A(θ + tv) = +∞ and so A0∞ (v) > 0 certainly. If ν(X ) < ∞, then note that
ν ? (φ, v) − − hµ, vi > 0, and so
Z
?
A(θ + tv) − hµ, θ + tvi ≥ t(ν (φ, v) − − hµ, vi) − log ν(Xc ) + log ehφ(x),θi dν(x) − hµ, θi
Xc
309
Lexture Notes on Statistics and Information Theory John Duchi
and thus
A(θ + tv) − hµ, θ + tvi − (A(θ) − hµ, vi)
≥ ν ? (φ, v) − − hµ, vi + o(1) (11.5.1)
t
as t → ∞.
Extending to the non-minimal case. If the exponential family is not minimal, there exists
a unit vector u and constant c such that hu, φ(x)i = c for ν-almost all x. Let U ∈ Rd×k be an
orthonormal basis for all such vectors, where k is the dimension of this collection. Then there exists
a vector c ∈ Rk such that c = U > φ(x) for ν-almost all x, and we see that A(θ + U v) = A(θ) + hc, vi
as hθ + U v, φ(x)i = hθ, φ(x)i + hc, vi for ν-almost all x. We show both inclusions as above. Let
U⊥ ∈ Rd×d−k be an orthonormal basis for the orthogonal subspace to U , so that U > U = Ik and
U⊥> U⊥ = Id−k , and for any µ ∈ M, we have aff(M) = µ + span(U⊥ ).
Showing that ∇A(Θ) ⊂ Mo . Fix θ0 ∈ Θ and let µ = ∇A(θ0 ). We must show that there
exists > 0 such that for all u ∈ span(U⊥ ) satisfying kuk ≤ , the point µ + u ∈ M. To that end,
note that for any vectors v ∈ Rd−k and w ∈ Rk , we have
A(θ0 + U⊥ v + U w) − hµ + u, U⊥ v + U wi = A(θ0 + U⊥ v) − hµ + u, U⊥ vi
because U > u = 0 and U > µ = c for each u ∈ span(U⊥ ) and µ ∈ M. The function g(v) :=
A(θ0 + U⊥ v) − hµ, U⊥ vi is strictly convex as ∇2 g(v) = U⊥> ∇2 A(θ0 + U⊥ v)U⊥ 0, because we
know that u> φ(x) is non-constant for all u ∈ span(U⊥ ). Define f (v) = A(θ0 + U⊥ v) − hµ, U⊥ vi.
Then applying Proposition C.1.10 as in the minimal representation case, there exists > 0 such
that vu = argminv {f (v) − hu, U⊥ vi} exists and is continuous in u ∈ span(U⊥ ), where by inspection
v0 = 0. Then θu := θ0 + U⊥ vu minimizes A(θ) − hµ + u, θi, satisfying ∇A(θu ) = µ + u.
Showing that Mo ⊂ ∇A(Θ). We again follow the logic of the minimal representation case.
Let µ ∈ Mo = relint M, and recall ν ? (φ, U⊥ v) = ess supx hφ(x), U⊥ vi. Then there exists > 0 such
that µ + u ∈ M for each u ∈ span(U⊥ ) with kuk ≤ , so that
v 6= 0. Following the same argument, mutatis mutandis, as that leading to inequality (11.5.1)
yields that g∞ 0 (v) > 0 for all v 6= 0. That is, v 7→ A(θ + U v) − hµ, U vi has a minimizer v(µ)
⊥ ⊥
(Corollary C.2.6), which is unique by the strict convexity of v 7→ A(θ + U⊥ v), and which necessarily
satisfies U⊥> ∇A(θ + U⊥ v(µ)) = U⊥> µ. As U > ∇A(θ) = c for all θ and U > µ = c for all µ ∈ M, this
shows that there exists θ(µ) such that ∇A(θ(µ)) = µ as desired. Moreover, fixing an arbitrary θ
and letting v(µ) be the unique minimizer of A(θ + U⊥ v) − hµ, U⊥ vi, the set of all minimizers
n o
Θ? (µ) = argmin{A(θ) − hµ, θi} = θ + U⊥ v(µ) + U w | w ∈ Rk .
θ
310
Lexture Notes on Statistics and Information Theory John Duchi
that if µ ∈ Mo , there exists θ(µ) ∈ Θ such that ∇A(θ(µ)) = µ by Proposition 11.4.1. This θ(µ)
maximizes hθ, µi − A(θ) over all θ, and so h(µ) = hθ(µ), µi − A(θ(µ)) < ∞. By Corollary C.2.4 in
Appendix C.2.1 and Proposition 11.4.1, dom ∂h = Mo , and as h is subdifferentiable on the relative
interior of its domain, we have dom h ⊂ cl Mo = cl M. As h is closed convex, any point µ outside
its domain necessarily satisfies h(µ) = +∞.
Finally, for part (iii), we note that the function g(t) = h(tµ0 + (1 − t)µ) is a one-dimensional
closed convex function. One-dimensional closed convex functions are continuous on their domains
(Observation B.3.6 in Appendix B.3.2), and so g is necessarily continuous. Thus limt↓0 g(t) = g(0).
The existence of θt follows from Proposition 11.4.1.
Bibliography
JCD Comment: Need to do a lot here!
Gneiting and Raftery [93]
11.6 Exercises
Exercise 11.1 (Strict propriety of the log loss): Let ∆k = {p ∈ Rk+ | 1T p = 1} be the probability
simplex. Show that if `(q, y) = − log qy and P(Y = y) = py , then
argmin E[`(q, Y )] = p,
q∈∆k
(a) Let ` : ∆k → R be strictly proper and let Y have p.m.f. p. Show that H(p) = inf q∈∆k E[`(q, Y )]
is strictly concave, and that Hper is strictly concave and continuously differentiable on Rk++ .
(b) Show the converse that if H : ∆k → R is strictly concave and its perspective Hper is differen-
tiable on Rk++ , then there exists a proper scoring loss ` satisfying
Exercise 11.3: Give the details in the computations for Example 11.3.4.
Exercise 11.4: Let y ∈ {0, 1} and take the regularization function h(p) = − log p − log(1 − p).
(b) Give the associated loss ` and surrogate loss ϕ in the sense of Section 11.3.
(c) Plot the surrogate ϕ(s, y) + log 8 and the logistic regression surrogate log(1 + es ) − sy for
y ∈ {0, 1}, each as function of s. (The shift by log 8 guarantees the losses coincide at s = 0.)
311
Lexture Notes on Statistics and Information Theory John Duchi
(d) Give predh (s) for s ∈ R, verifying that predh (s) ∈ [0, 1].
Exercise 11.5: For h(p) = − log p − log(1 − p) as in Exercise 11.4, show that h is self-concordant,
meaning that h000 (p) ≤ 2(h00 (p))3/2 for all p ∈ (0, 1). (Such functions are important in optimization;
the conjugate h∗ is then also guaranteed to be self-concordant.)
Exercise 11.6 (Surrogates for regression): Define h(c) = 41 c4 .
(b) Show directly that the surrogate loss ϕ(s, y) = h∗ (s)−sy satisfies that if ŝ = argmins E[ϕ(s, Y )],
then predh (ŝ) = E[Y ].
Exercise 11.7: Let P be a predicted distribution and for α ∈ [0, 21 ], define the lower and upper
quantiles lα = Quantα (P ) and uα = Quant1−α (P ). Given these quantiles, for a finite set A ⊂ [0, 12 ],
define the weighted interval loss
X
W (P, y) := [α(uα − lα ) + dist(y, [lα , uα ])] ,
α∈A
which penalizes P using both the size (uα − lα ) of the quantile intervals and the distance of the
outcome y from the predicted quantiles. Define the symmetrized set As = A ∪ {1 − α | α ∈ A}.
Show that
W (P, y) = `quant,As (P, y),
where `quant is the quantile loss (11.2.4).
Exercise 11.8: We explore a particularization of the results in Section 11.4. Let Y ∼ Poi(eθ ), so
that Y has p.m.f. pθ (y) = exp(θy − eθ )/y! for y ∈ N. Let A(θ) = eθ be the log-partition function.
Define the “surrogate” loss ϕ(θ, y) = − log pθ (y).
(a) Give the associated negative generalized entropy h(µ) for µ ∈ (0, ∞).
(b) Give the associated loss `(µ, y) in the proper representation of Theorem 11.2.14. Directly verify
that it is strictly proper, in that argminµ E[`(µ, Y )] = E[Y ] for any Y supported on R+ .
Exercise 11.9: We explore a particularization of Example 11.4. Let X ∼ N(0, Σ) for a co-
variance Σ 0, and let K = Σ−1 be the associated precision matrix. Then X has density
pK (x) = exp(− 21 hxxT , Ki + 12 log det(K)) with respect to (a scaled) Lebesgue measure, and log
partition A(K) = − 12 log det(K), which has domain the positive definite matrices K 0 (and is
+∞ elsewhere).
(a) Give the associated negative generalized entropy h(M ) for symmetric matrices M . Specify the
domain of h.
(b) Give the associated loss `(M, x) in the proper representation of Theorem 11.2.14. Directly
verify that it is strictly proper, in that if the second moment matrix C := E[XX T ] of X
satisfies C 0, then argminM E[`(M, X)] = C.
312
Lexture Notes on Statistics and Information Theory John Duchi
Exercise 11.10: In this extended exercise, we generalize Theorem 11.4.7 to apply to general
(finite-dimensional) convex cone constraints. A set C is a convex cone if for any two points x, y ∈ C,
we have λx + (1 − λ)y ∈ C for all λ ∈ [0, 1], and C is closed under positive scaling: x ∈ C implies
that tx ∈ C for all t ≥ 0. The following are standard examples (the positive orthant and the
semi-definite cone):
ii. The semidefinite cone. Take C = {X ∈ Rd×d : X = X > , X 0}, where a matrix X 0 means
that a> Xa ≥ 0 for all vectors a. Then C is convex and closed under positive scaling as well.
Given a convex cone C, we associate a cone ordering with the cone and say that for two elements
x, y ∈ C, we have x y if x − y 0, that is, x − y ∈ C. In the orthant case, this simply means that
x is component-wise larger than y. For a given inner product h·, ·i, define the dual cone
For the standard (Euclidean) inner product, the positive orthant is thus self-dual, and similarly
the semidefinite cone is also self-dual. For a vector y, we write y ∗ 0 if y ∈ C ∗ is in the dual cone.
With this setup, consider the following linearly constrained maximum entropy problem, where the
cone ordering derives from a cone C:
Theorem 11.6.1. For θ ∈ Rd and K ∈ C ∗ , the dual cone to C, let Pθ,K have density
Z
pθ,K (x) = exp (hθ, φ(x)i − hK, ψ(x)i − A(θ, K)) , A(θ, K) = log exp(hθ, φ(x)i−hK, ψ(x)i)dν(x),
Exercise 11.11 (An application of Theorem 11.6.1): Let the cone C be the positive semidefinite
cone in Rd×d , ν be the Lebesgue measure dν(x) = dx and define ψ(x) = 21 xx> ∈ Rd×d . Let Σ 0.
Give the density solving
Z
maximize − p(x) log p(x)dx subject to EP [XX > ] Σ.
Exercise 11.12: Prove that the log determinant function is concave over the positive semidefinite
matrices. That is, show that for X, Y ∈ Rd×d satisfying X 0 and Y 0, we have
313
Lexture Notes on Statistics and Information Theory John Duchi
where σij are specified only for indices i, j ∈ S (but we know that σij = σji and (i, i) ∈ S for all i).
Let Σ∗ denote the solution to this problem, assuming there is a positive definite matrix Σ satisfying
Σij = σij for i, j ∈ S. Show that for each unobserved pair (i, j) 6∈ S, the (i, j) Rentry [Σ∗−1 ]ij of
the inverse Σ∗−1 is 0. Hint: The distribution maximizing the entropy H(X)P = − p(x) log p(x)dx
subject to E[Xi Xj ] = σij has Gaussian density of the form p(x) = exp( (i,j)∈S λij xi xj − Λ0 ).
314
Chapter 12
In Chapter 11, we encountered proper losses, in which we assume we predict probability distribu-
tions on outcomes Y . In typical problems, we wish to predict things about Y from a given set of
covariates or inputs X, and in focusing exclusively on the losses ` themselves, we implicitly assume
that we can model Y | X basically perfectly. Here, we move away from this focus exclusively on
the loss itself to incorporate discussion of predictions, where we seek a function f : X → Y (or
some other output space) that yields the most accurate predictions.
In this chapter, we adopt the view of Section 11.2.3, where the target Y ⊂ Rk is vector-valued,
and we wish to predict its expectation E[Y | X] as accurately as possible. For binary prediction,
we have Y ∈ {0, 1}, so that E[Y | X] = P(Y = 1 | X); in the case of multiclass prediction problems,
it is easy to represent Y as an element of the k standard basis vectors {e1 , . . . , ek } ⊂ Rk , so that
p = E[Y | X] is simply the p.m.f. of Y given X with entries py = P(Y = y | X). We focus here,
therefore, on choosing functions to minimize the risk, or expected population loss,
L(f ) := E[`(f (X), Y )].
When f is chosen from a collection F ⊂ {X → Rk } of functions, for example, to guarantee that we
can generalize, we do not expect to be able to perfect minimize the population loss. Accordingly,
even though the loss is proper and hence minimized by f ? (x) = E[Y | X = x], we cannot perfectly
model reality, and so it is unrealistic to expect to be able to find f satisfying f (x) = E[Y | X = x],
even approximately, for all x.
We therefore depart from the goal of perfection to address a somewhat simpler criterion: that
of calibration. Here, the idea is that a predictor should be accurate on average conditional on its
own predictions. Consider again a weather forecasting problem, where Yt = 1 indicates it rains on
day t and Yt = 0 indicates no rain, and we wish to predict Yt based on observable covariates Xt
at time t. While we would like a forecaster to have perfect predictions pt = E[Yt | Xt ], we instead
ask that on days where the forecaster makes a given prediction, it should rain (roughly) with that
given frequency. In particular, we seek calibration, which is that
f (X) = E[Y | f (X)]. (12.0.1)
That is, given that the forecaster makes a prediction with value p = f (X), we should have
E[Y | f (X) = p] = p.
While in general it is challenging to achieve this perfect calibration, in this chapter we investigate
several variants of the desideratum (12.0.1) that allow for more elegant statistical and information-
theoretic approaches, as well as procedures to achieve calibration.
315
Lexture Notes on Statistics and Information Theory John Duchi
2. Show how to measure it, specifically using partitioned methods. I think that parti-
tioned ones should be better than non-partitioned approaches, because we can estimate
the binned / partitioned calibration error
This is far too stringent a condition to be achievable, so that one relaxes to various forms of marginal
or average calibration. See the bibliographic notes for some discussion of the approaches here.
The second strand of research on calibration that, again, we do not address, considers more
adversarial and sequential settings, where instead of any probabilistic underpinnings, nature (an
adversary) plays a game against the player (or predictor). Philosophically, this approach elegantly
does away with the need for probabilities: there is a physical world where whether it rains tomorrow
is essentially deterministic, and we use probability as a crutch to model things we cannot measure,
so calibration means that of the days on which we predict rain with a chance of 50%, it rains on
roughly 50% of those days. In this sequential setting, at times t = 1, 2, . . . , T , the player makes
a prediction pt of the outcome, and then nature may choose the outcome Yt . Without giving the
player a bit more leeway, calibration is impossible: say that Y ∈ {0, 1}, and nature plays Yt = 1
if pt ≤ .5 and Yt = 0 if pt > .5. Then any player is miscalibrated at least by an amount .5.
Astoundingly, Foster and Vohra [84] show that if the player is allowed to randomize, then the
forecasted probabilities pt can be made arbitrarily close to the empirical averages of the observed
Yt . While many of the techniques we consider and develop arise from this adversarial setting in the
literature, we shall mostly address the scenarios in which Y is indeed random.
316
Lexture Notes on Statistics and Information Theory John Duchi
Then by Theorems 11.2.1 and 11.2.14, there exists a convex function h such that
`(µ, y) = −h(µ) − h∇h(µ), y − µi (12.1.1)
for all µ ∈ M, y ∈ Y. Recall the Bregman divergence (11.2.2)
Dh (u, v) = h(u) − h(v) − h∇h(v), u − vi,
which is nonnegative for all convex h (and strictly positive whenever h is strictly convex and u 6= v),
and Corollary 11.2.5. Then for any prediction function f , if we condition on the predicted value
S = f (X), then
E[`(S, Y ) | S] = E[`(E[Y | S], Y ) | S] + E[`(S, Y ) − `(E[Y | S], Y ) | S]
= E[`(E[Y | S], Y ) | S] + E[Dh (E[Y | S], S) | S],
where we use the linearity E[`(s, Y )] = `(s, E[Y ]) for any distribution on Y and fixed s ∈ Rk in the
second equality. We record this as a theorem.
Theorem 12.1.1. Let ` be a proper loss with representation (12.1.1). Then for any f : X → Rk ,
E[`(f (X), Y )] = E[`(E[Y | f (X)], Y )] + E[Dh (E[Y | f (X)], f (X))].
In particular, the predictor g : Rk → Rk defined by
g(s) := E[Y | f (X) = s]
is calibrated and satisfies
E[`(g ◦ f (X), Y )] = E[`(E[Y | f (X)], Y )] ≤ E[`(f (X), Y )],
and the inequality is strict whenever f is not calibrated and ` is strictly proper.
Proof The first statement we have already proved. For the second, note that
g(s) = E[Y | f (X) = s]
by construction of g, so that E[`(g ◦ f (X), Y )] = E[`(E[Y | f (X)], Y )]. The inequality and its
strictness are immediate because h is strictly convex if and only if ` is strictly proper.
To interpret this result, it essentially says that if we can post-process f to make it calibrated,
then we can only improve its risk, or expected loss, when ` is a proper loss. We can give an alter-
native version of Theorem 12.1.1, where we instead consider the conjugate linkages in Section 11.3,
which can be useful when we wish to find f via convex optimization (instead of by directly min-
imizing a proper loss). To that end, assume that h is a strictly convex function, differentiable on
the interior of its domain, satisfying the Legendre conditions (11.3.3), and define the surrogate loss
(linked via duality and the negative generalized entropy h to `)
ϕ(s, y) = h∗ (s) − hs, yi = `(predh (s), y),
where `(µ, y) = −h(µ) − h∇h(µ), y − µi and
predh (s) = argmin{−hs, µi + h(µ)} = ∇h∗ (s).
µ
Then we have the following decomposition of the population surrogate loss, which follows similarly
to Theorem 12.1.1.
317
Lexture Notes on Statistics and Information Theory John Duchi
Theorem 12.1.2. Let ϕ be the surrogate loss defined above. Then for any f : X → Rk , we have
Proof The key is to rely on the duality relationships inherent in the definition of the surrogate
ϕ(s, y) = h∗ (s) − hs, yi. We fix x and work in exclusively in the space of the scores (predictions)
s = f (x) ∈ Rk , as
E[ϕ(f (X), Y ) | X = x] = ϕ(f (x), E[Y | X = x])
by definition. Let µ ∈ M = Conv(Y). Then ϕ(s, µ) = h∗ (s) − hs, µi, and
because h is (closed) convex. Additionally, if µ∗ (s) = ∇h∗ (s) = predh (s), then the conjugate duality
relationships (11.1.4) guarantee h∗ (s) = hs, µ∗ (s)i − h(µ∗ (s)) and s = ∇h(µ∗ (s)). Thus
ϕ(s, µ) − inf0 ϕ(s, µ) = h∗ (s) − hs, µi + h(µ) = h(µ) − h(µ∗ (s)) − hs, µ − µ∗ (s)i
s
= h(µ) − h(µ∗ (s)) − h∇h(µ∗ ), µ − µ∗ (s)i = Dh (µ, µ∗ (s)).
Taking the expectation over X and using the shorthand S = f (X), we thus obtain
Lastly, we use that `(µ, y) = −h(µ) − h∇h(µ), y − µi is proper, so inf s ϕ(s, µ) = −h(µ) = `(µ, µ),
giving the first claim of the theorem.
As in Theorem 12.1.1, Theorem 12.1.2 shows that calibrating a predictor f can only improve
the surrogate loss associated with h. Any predictor f : X → Rk has unnecessary error arising from
the average divergence of the prediction from being calibrated,
In both cases, we see that any proper (or derived proper) loss has a natural decomposition into
an error term relating to the typical error in predicting Y from E[Y | f (X)], which one frequently
refers to as sharpness of the predictor. Replacing f (X) with the expectation of Y given f (X) (or a
particular transformation thereof) does not increase this first term, but improves the second term,
which measures the typical error of a prediction from calibration.
Let us consider an example with squared error:
Example 12.1.3 (Squared error and calibration): In the case that h(p) = 21 kpk22 , we have
h∗ = h and ∇h = ∇h∗ is the identity. Then Theorems 12.1.1 and 12.1.2 reduce to the statement
that h i h i
E[kY − f (X)k22 ] = E kY − E[Y | X]k22 + E kE[Y | X] − f (X)k22 ,
318
Lexture Notes on Statistics and Information Theory John Duchi
This result requires some delicate measure-theoretic arguments, so we defer it to the technical
proofs (see Section 12.6.1). The disctontinuity of ece is relatively easy to show, however, even in
very simple cases.
Example 12.2.2 (Discontinuity of the calibration error): Let Y ∈ {0, 1} be a Bernoulli
random variable, and let X ∈ {0, 1}. Take Y = X with probability 1. Then the predictor that
always predicts 21 is perfectly calibrated, but if for ∈ [0, 21 ] we define f by
1 1
f (0) = − and f (1) = +
2 2
then we see that ece(f ) = 21 − , while ece(f0 ) = 0. Certainly f → f0 in any Lp distance on
functions, while lim→0 ece(f ) = 21 . 3
Zi = (f (Xi ), Yi )
319
Lexture Notes on Statistics and Information Theory John Duchi
drawn i.i.d.; the coming lower bound holds if X = [0, 1] and f (x) = x, so in many cases, observing
X is of no help. Recall the (worst-case) test risk from Section 10.2, that for the testing problem
between classes H0 : P ∈ P0 and H1 : P ∈ P1 of distributions,
Because we consider the function f fixed and ask only whether we can evaluate its calibration error
under an (unknown) distribution P , we denote the expected calibration error of f under P via
eceP (f ) = EP [kEP [Y | f (X)] − f (X)k]. We thus consider testing perfect calibration H0 : ece(f ) =
0 against alternatives H1 : ece(f ) ≥ γ of miscalibration for γ > 0, defining
Theorem 12.2.3. Let f : X → [0, 1] be a predictor of Y ∈ {0, 1}. Assume for some 0 < c < 12 that
f (X ) ∩ [c, 1 − c] has cardinality at least N . Then there is a distribution P0 such that eceP0 (f ) = 0
and for any 0 < γ ≤ c,
nγ 2 1
inf Rn (Ψ | {P0 }, Pγ ) ≥ 1 − √ .
Ψ 2 N c(1 − c)
Before proving Theorem 12.2.3, we note the following immediate corollary; part (ii) follows from
part (i), which follows by taking N ↑ ∞ in the theorem.
Corollary 12.2.4. Let the conditions of Theorem 12.2.3 hold and let P0 = {P | eceP (f ) = 0}.
(i) If there exists 0 < c < 12 such that f (X )∩[c, 1−c] has infinite cardinality, then P0 is non-empty
and for any 0 < γ ≤ c,
lim inf inf R(Ψ | P0 , Pγ ) = 1.
n Ψ
(ii) If there exists a neighborhood U of 21 such that U ⊂ f (X ), then P0 is non-empty and for any
γ < 21 , the minimax test risk satisfies
In brief, no test exists that is better than random guessing for testing between
given access to the predictions f (Xi ) and observed outcomes Yi . The theorem and corollary apply to
binary prediction models with Y ∈ {0, 1}, but the results immediately extend to more complicated
prediction problems where Y is vector-valued or multiclass.
Proof The proof relies on the convex hull testing lower bound from Proposition 10.2.1. Without
loss of generality, we can assume that X ⊂ [0, 1] and that f (x) = x by transforming the input
space. Let S = f (X) be the (random) scores that f outputs.
We first construct the perfectly calibrated distribution P0 and miscalibrated family Pγ . Define
the distribution P0 so that S is uniform on distinct points s1 , . . . , sN ∈ [c, 1 − c] and Y | S = s ∼
Bernoulli(s), that is, given S = s, Y = 1 with probability s and Y = 0 with probability 1 − s. By
320
Lexture Notes on Statistics and Information Theory John Duchi
construction, eceP0 (f ) = 0. To construct the particular members of the alternative family Pγ , for
each j ∈ [N ], define the “tilting” function
y 1−y
φj (y, s) := − 1 {s = sj } .
sj 1 − sj
Then E0 [φj (Y, S)] = 0 while
" 2 #
1 Y 1−Y 1 1 1 1 1
Var0 (φj (Y, S)) = E0 − | S = sj = + = .
N sj sj N sj 1 − sj N sj (1 − sj )
Note that |φj (y, s)| ≤ 1c as c < 12 , and if we define the vector φ(y, s) = (φ1 (y, s), . . . , φN (y, s)), then
kφ(y, s)k0 ≤ 1 (that is, the number of non-zero entries is at most 1). Now as γ ∈ [0, c], for each
v ∈ {−1, 1}N we may define the tilted distribution Pv with
Pv (Y = y, S = s) = (1 + γhv, φ(y, s)i) P0 (Y = y, S = s),
which is a valid distribution whenever γ ≤ c, as |hv, φ(y, s)i| ≤ 1c . We compute the calibration error
for distributions P ∈ {Pv }. Noting that S is still uniform on {s1 , . . . , sN } under Pv , we have
Ev [Y | S = sj ] = sj + γvj E[φj (Y, sj )Y | S = sj ] = sj + γvj ,
and so ecePv (f ) = N1 N
P
j=1 γ|vj | = γ. In particular, we have Pv ∈ Pγ .
Lastly,Pwe compute a bound on the testing error. For this, we recall Lemma 10.1.3. Letting
P n = 21N v Pvn , we have
1 X n
Dχ2 P n ||P0n + 1 = 2N E0 (1 + γhv, φ(Y, S)i)(1 + γhv 0 , φ(Y, S)i)
2
v,v 0
1 X n
= 2N 1 + γ 2 v > Cov0 (φ(Y, S))v 0
2 0
v,v
because the sampling is i.i.d. By our variance calculation for φ and that each φj has disjoint
support, we have Cov0 (φ(Y, S)) = N1 diag([ sj (1−s
1
]N ), and so
j ) j=1
2 N 0 n 2XN 0
V V V V
γ X j j
≤ E exp nγ j j
Dχ2 P n ||P0n + 1 = E 1 +
N sj (1 − sj ) N sj (1 − sj )
j=1 j=1
iid
where the expectation is over V, V 0 ∼ Uniform({±1}N ). But of course Vj Vj0 i.i.d. random signs, and
hence 1-sub-Gaussian, so that
2γ4 XN 2 4
n n
n 1 n γ 1
Dχ2 P ||P0 + 1 ≤ exp 2 2 (1 − s )2
≤ exp
N s j j 2N c2 (1 − c)2
j=1
321
Lexture Notes on Statistics and Information Theory John Duchi
and complete (continuing the analogy, that everything true can be proved), meaning that
We begin by considering types of distance to calibration. Let C(P ) denote those functions g
that are perfectly calibrated for P , that is, C(P ) = {g : X → Rk | EP [Y | g(X)] = g(X)} (where
the defining equality holds with P -probability 1 over X). The set P always consists at least of the
constant function g(X) = EP [Y ] and so is non-empty (but is typically larger). Then we call the
minimum L1 (P ) distance of a function f to the set C(P ) the distance to calibration
It is not always clear how to estimate the distance dcal (f ), making using it sometimes challenging.
We also consider a complementary quantity that relies on an alternative variational character-
ization. Let W ⊂ {Rk → Rk } be a symmetric collection of functions, meaning that w ∈ W implies
−w ∈ W. We can view any such collection as potential witnesses of miscalibration, in that
and so if w can “witness” the portions of space where f (X) 6≈ E[Y | f (X)], it can certify miscali-
bration. We then arrive at what we term the calibration error relative to the class W,
Depending on the class W, this is sometimes called the weak calibration error, and with large
enough classes, we can recover the classical expected calibration error (12.2.1).
Example 12.2.5 (Recovering expected calibration error): For a norm k·k, let the set W be
the collection of all functions w with bound sups kw(s)k∗ ≤ 1. Then
" #
CE(f, W) = E sup hw, E[Y | f (X)] − f (X)i = E[kE[Y | f (X)] − f (X)k] = ece(f ),
kwk∗ ≤1
322
Lexture Notes on Statistics and Information Theory John Duchi
It is more interesting to consider restricted classes; one of particular interest to us is that of bounded
Lipschitz functions. Let
n o
Wk·k := w : Rk → Rk | kw(s0 ) − w(s1 )k∗ ≤ ks0 − s1 k and kw(s)k∗ ≤ 1 for all s, s0 , s1 (12.2.5)
denote the collection of functions bounded by 1 in k·k∗ and that are 1-Lipschitz with respect to
k·k. Then (as we see presently) we can at least estimate the calibration error relative to the class
W in the definition (12.2.4).
The final calibration measure we consider reposes on the idea of quantizing or partitioning the
output space, which relates to the idea of “binning” predictions that the literature on calibration
frequently considers. Here, we consider averages of Y conditioned on predictions in larger sets.
Thus, instead of evaluating the precise conditioning E[Y | f (X)] we to look instead at the expec-
tation of Y conditional on f (X) ∈ A for a set A, so that a predicted score is (nearly) calibrated
if the diameter diam(A) is small, and E[Y | f (X) ∈ A] ≈ s for some s ∈ A. Given a partition
A of the space M = Conv(Y), it is then natural to evaluate the average error for each element
of A (weighting by the probability of A), and consider the calibration error (12.2.4) for indicator
functions of A ∈ A, where we abuse notation slightly to define
X X
CE(f, A) := kE[(f (X) − Y )1 {f (X) ∈ A}]k = kE[f (X) − Y | f (X) ∈ A]k P(f (X) ∈ A).
A∈A A∈A
Indeed, taking a supremum over all such partitions gives supA CE(f | A) = E[kE[Y | f (X)] − f (X)k],
the original expected calibration error (12.2.1). Additionally, and here we elide details, if f (X) is
a continuous random variable with suitably nice density and An denotes any partition satisfying
diam(A) ≤ 1/n for A ∈ An , then limn CE(f, An ) = E[kE[Y | f (X)] − f (X)k]. Instead of consider-
ing CE(f, A) directly, we optimize over all partitions, but penalize the average size of elements of
A, giving the partitioned calibration error
( )
X
pce(f ) := inf CE(f, A) + diam(A)P(f (X) ∈ A) . (12.2.6)
A
A∈A
Corollary 12.2.6. Let Y ⊂ Rk have finite diameter and k·k be any norm. Then each of the
calibration measures dcal , CE(·, Wk·k ), and pce in definitions (12.2.3), (12.2.4), and (12.2.6) is
sound and complete (12.2.2). Additionally, let Y = {e1 , . . . , ek } and k·k = k·k1 be the `1 -norm.
Then for any f : X → M = Conv(Y),
1 q
CE(f, Wk·k ) ≤ dcal (f ) ≤ CE(f, Wk·k ) + 2 k CE(f, Wk·k )
2
and p
dcal (f ) ≤ pce(f ) ≤ dcal (f ) + 2 k dcal (f ).
Corollary 12.2.6 will come as a consequence of the deeper development we purse in Section 12.5.
Here, we take Corollary 12.2.6 as motivation to give the type of typical result that justifies cali-
bration estimates. As any of the calibration measures is roughly equivalent (except ece), measuring
any of them on a sample can provide evidence for or against calibration of a predictor f . We focus
323
Lexture Notes on Statistics and Information Theory John Duchi
on the simpler binary case in which f : X → [0, 1] and let WLip be bounded Lipschitz functions
w : [0, 1] → [−1, 1]. Given a sample (X1n , Y1n ), the empirical variant of CE(f, W) is
( n )
1 X
CE
c n (f ) := sup wi (Yi − f (Xi )) s.t. |wi − wj | ≤ |f (Xi ) − f (Xj )| for i, j ≤ n .
kwk∞ ≤1 n i=1
By combining uniform covering bounds for the class of Lipschitz functions with a standard concen-
tration inequality, we then have the following convergence guarantee for CE
c n.
Proposition 12.2.7. There exists a numerical constant C such that for any δ > 0,
log n
p
CEn (f ) − CE(f, WLip ) ≤ C 1/3 δ
c
n
with probability at least 1 − δ.
Proof Fix > 0 and let N () be a minimal -cover of the set WLip in uniform norm, meaning that
w − w(j) ∞ ≤ for each w(j) ∈ N (), and let N () be its (minimal) cardinality. Then log N () .
1 1
log (recall Proposition 8.7.3 and Eq. (8.7.4)). For shorthand, let the error vector E ∈ [−1, 1]n
1 Pn
have entries Ei = Yi −f (Xi ), and abusing notation, for w ∈ WLip let hw, Ein = n i=1 w(f (Xi ))Ei .
Then for any w ∈ WLip , there exists i ≤ N () such that
c n (f ) − CE(f, WLip ) ≤ sup |hw, Ein − E[hw, Ein ]| ≤ max |hw, Ein − E[hw, Ein ]| + 2.
CE
w∈WLip w∈N ()
by the Azuma-Hoeffding inequality and a union bound. Take = n−1/3 and t = Cn−1/3 log nδ for
p
Summarizing, while the expected calibration error is fundamentally inestimable, there are alter-
native measures that are both sound and complete, and they can admit reasonable estimators. As
the class size k grows, however, it can become statistically infeasible to estimate the calibration of
predictors f , so that one must consider alternative metrics. The exercises and bibliography explore
these questions in more detail.
324
Lexture Notes on Statistics and Information Theory John Duchi
into an average loss and an expected divergence between f (X) and E[Y | f (X)], where h is the
negative (generalized) entropy (11.1.6) associated with the loss `, so that the loss has representation
`(µ, y) = −h(µ) − h∇h(µ), y − µi. This suggests an approach to improving a predictor f : X → Rk
without compromising its average loss: make it closer to being calibrated, so that E[Y | f (X)] ≈
f (X). Here, we make this idea precise by using the weak calibration (12.2.4): if there exists a
witness function w certifying that E[hw(f (X)), Y − f (X)i] 0, then we can post-process f to
f (X) + ηw(f (X)) for some stepsize η > 0 and only improve the expected loss. We first develop
the idea in the context of the squared error, where the calculations are cleanest, and extend it to
general proper losses based on convex conjugates (as in Section 11.3) immediately after. Combining
the ideas we develop, we also provide a (population-level) algorithm to transform a function f
by post-processing its outputs that guarantees the result is nearly calibrated relative to a class
W of witnesses. This provides an algorithmic proof quantitatively relating the calibration error
CE(f, W) relative to a class W to the improvement achievable in minimizing E[`(f (X), Y )] by
post-composition g ◦ f .
12.3.1 The post-processing gap and calibration audits for squared error
Consider a thought experiment: instead of using f to make predictions, we use a postprocessing g◦f ,
where g : Rk → Rk has the (suggestively chosen) form g(v) = v + w(v), where w(v) = (g(v) − v).
Then using the representation `(µ, y) = −h(µ) − h∇h(µ), y − µi for the proper loss, we recall
Theorem 12.1.1 and for µ(f (X)) := E[Y | f (X)] expand
where the final equality uses the linearity of y 7→ `(µ, y), that is,
E[`(g ◦ f (X), Y )] = E[`(E[Y | f (X)], Y )] + E [Dh (E[Y | f (X)], f (X) + w(f (X)))] . (12.3.1)
We have decomposed the expected loss E[`(g ◦ f (X), Y )] into a term that post-processing does not
change, which measures the sharpness with which E[Y | f (X)] predicts Y , and a divergence term
Dh measuring the error in calibration of g ◦ f (X) = f (X) + w(f (X)) for E[Y | f (X)].
The expansion (12.3.1) points toward an ability to postprocess any prediction function f : X →
k
R to both (i) obtain calibration relative to a class of functions W, as in Definition (12.2.4), and (ii)
improve the expected loss E[`(f (X), Y )]. Moreover, this improvement is monotone, in that changes
“toward” calibration guarantee smaller expected loss, an improvement over the less refined results
in Theorems 12.1.1 and 12.1.2. To that end, define the post-processing gap for the (proper) loss `
and function f relative to the class W of functions Rk → Rk by
gap(`, f, W) := E[`(f (X), Y )] − inf E[`(f (X) + w(f (X)), Y )]. (12.3.2)
w∈W
The gap (12.3.2) is fundamentally tied to the calibration error relative to the class W.
325
Lexture Notes on Statistics and Information Theory John Duchi
We specialize here to the simpler case of the squared error, as the statements are most transpar-
ent. We focus exclusively on symmetric convex collections of functions W, meaning that if w ∈ W,
then −w ∈ W, and W is convex.
Proposition 12.3.1. Let `(µ, y) = 12 ky − µk22 be the squared error (Brier score), and let W be a
symmetric convex collection of functions, each 1-Lipschitz with respect to the `2 -norm k·k2 . Define
R2 (f ) = supw∈W E[kw(f (X))k22 ]. Then
CE(f, W)2
1
min CE(f, W), ≤ gap(`, f, W) ≤ CE(f, W)
2 R2 (f )
Proof Fix x and let µ = E[Y | f (X) = f (x)] ∈ Conv(Y) and w = w(f (x)) be a potential update
to f (x). Then because `(µ, y) = 21 kµ − yk22 , for any y ∈ Y
1
`(µ, y) + h∇`(µ, y), wi + kwk2 = `(µ + w, y).
2
Recognizing that ∇`(µ, y) = (µ − y), for any w ∈ W we therefore have
1
−E[hf (X) − Y, w(f (X))i] − E[kw(f (X))k22 ] ≤ E[`(f (X), Y )] − E[`(f (X) + w(f (X)), Y )]
2
≤ −E[hf (X) − Y, w(f (X))i].
Taking suprema over w on each side of the preceding inequalities and using the symmetry of W
gives
1 2
sup E[hf (X) − Y, w(f (X))i] − E[kw(f (X))k2 ] ≤ gap(`, f, W)
w∈W 2
≤ sup E[hf (X) − Y, w(f (X))i].
w∈W
Because CE(f, W) = supw∈W E[hf (X) − Y, w(f (X))i], we can use the convexity of W and the
definition R2 (f ) := supw∈W E[kw(f (X))k22 ] to see that for any η ∈ [0, 1], we may replace w with
η · w ∈ W, and we have
η2 2
sup ηCE(f, W) − R (f ) ≤ gap(`, f, W) ≤ CE(f, W).
η∈[0,1] 2
As an immediate corollary, we see that if W = Wk·k2 consists of the 1-Lipschitz functions with
kw(·)k2 ≤ 1, we have a cleaner guarantee.
Corollary 12.3.2. Let W = Wk·k2 and the conditions of Proposition 12.3.1 hold. Then
1
CE(f, W)2 ≤ gap(`, f, W) ≤ CE(f, W).
2 diam(Y)2
Thus, the calibration error upper and lower bounds the gap between the expected loss of f and a
post-processed version of f . This yields a nearly operational interpretation of the calibration error
relative to the class W: it is, to within a square, exactly the amount we could improve the expected
loss of the function f by postprocessing f itself.
326
Lexture Notes on Statistics and Information Theory John Duchi
and we may transform arbitrary scores s ∈ Rk to predictions via the conjugate link (11.3.1), that
is,
predh (s) = argmin {−hs, µi + h(µ)} = ∇h∗ (s).
µ
So long as h is appropriately smooth, these satisfy `(predh (s), y) = ϕ(s, y). In complete analogy
with the post-processing gap (12.3.2) when we assume f makes predictions in (the affine hull of)
Y, we can define the surrogate post-processing gap
gap(ϕ, f, W) := E[ϕ(f (X), Y )] − inf E[ϕ(f (X) + w(f (X)), Y )]. (12.3.3)
w∈W
In spite of the similarity with definition (12.3.2), the actual predictions of Y from f in this case
come via the link predh (f (X)). Thus, in this case we instead consider the calibration error relative
to a class W but after the composition of f with predh = ∇h∗ , so that
CE(predh ◦ f, W) = sup E[hw(f (X)), Y − predh (f (X))i] = sup E[hw(f (X)), Y − ∇h∗ (f (X))i],
w∈W w∈W
where as always we assume that the class of witness functions satisfies W = −W. When the
prediction function is continuous enough in s, we can give an analogue of Proposition 12.3.1 to the
more general surrogate case. To that end, we assume that the conjugate h∗ has Lipschitz continuous
gradient with respect to the dual norm k·k∗ , meaning that
for all s0 , s1 ∈ Rk . This is equivalent (see Proposition C.2.7) to the negative entropy h being
strongly convex with respect to the norm k·k, and also immediately implies that
kwk2∗
ϕ(s + w, y) ≤ ϕ(s, y) + h∇s ϕ(s, y), wi + .
2
Example 12.3.3 (Multiclass logistic regression): For multiclass logistic regression, where we
take h(p) = kj=1 pj log pj , we know that h is strongly convex with respect to the `1 norm (this
P
is Pinsker’s inequality; see inequality (2.2.11)). Thus, the conjugate h∗ (s) = log( kj=1 esj ) has
P
Lipschitz gradient with respect to the `∞ norm, meaning that for the prediction link
" #k
esy
predh (s) = Pk ,
sj
j=1 e y=1
we have
predh (s) − predh (s0 ) 1
≤ s − s0 ∞
for all s, s0 ∈ Rk . 3
327
Lexture Notes on Statistics and Information Theory John Duchi
Example 12.3.4 (The squared error): When we measure the error of a predictions in Rk
by the squared `2 -norm 21 kf (x) − yk22 , this corresponds to the generalized negative entropy
h(µ) = 21 kµk22 . In this case, the norm k·k = k·k2 = k·k∗ , and we have the self duality h∗ = h,
so that the prediction mapping predh is the identity. 3
With these examples as motivation, we then have the following generalization of Proposi-
tion 12.3.1.
Proposition 12.3.5. Let the negative generalized entropy h be strongly convex with respect to the
norm k·k and consider surrogate loss ϕ(s, y) = h∗ (s)−hs, yi. Define R∗2 (f ) := supw∈W E[kw(f (X))k2∗ ].
Then
CE(predh ◦ f, W)2
1
min CE(predh ◦ f, W), ≤ gap(ϕ, f, W) ≤ CE(predh ◦ f, W).
2 R∗2 (f )
Proof Fix x and let s = f (x) and w = w(f (x)), and notice that for any y we have
1
ϕ(s, y) + h∇ϕ(s, y), wi ≤ ϕ(s + w, y) ≤ ϕ(s, y) + h∇ϕ(s, y), wi + kwk2∗ .
2
Recognizing that ∇ϕ(s, y) = ∇h∗ (s) − y, for any w ∈ W we have
1
−E[h∇ϕ(f (X), Y ), w(f (X))i] − E[kw(f (X))k2∗ ] ≤ E[ϕ(f (X), Y )] − E[ϕ(f (X) + w(f (X)), Y )]
2
≤ −E[h∇ϕ(f (X), Y ), w(f (X))i].
Taking suprema over w on each side and using the symmetry of W gives
∗ 1 2
sup E[h∇h (f (X)) − Y, w(f (X))i] − E[kw(f (X))k∗ ] ≤ gap(ϕ, f, W)
w∈W 2
≤ sup E[h∇h∗ (f (X)) − Y, w(f (X))i].
w∈W
Because CE(predh ◦ f, W) = supw∈W E[h∇h∗ (f (X)) − Y, w(f (X))i], we can use the convexity of W
and the definition R∗2 (f ) := supw∈W E[kw(f (X))k2∗ ], to see that for any η ∈ [0, 1], we may replace
w with η · w ∈ W and
η2 2
sup ηCE(predh ◦ f, W) − R∗ (f ) ≤ gap(ϕ, f, W) ≤ CE(predh ◦ f, W).
η∈[0,1] 2
h ◦f,W)
Set η = min{1, CE(pred
R2 (f )
}.
∗
A corollary specializing to the case of bounded witness functions allows a somewhat cleaner
statement, in analogy with Corollary 12.3.2. It provides the same operational interpretation: the
calibration error CE(f, W) of f relative to W upper and lower bounds improvement possible through
postprocessing f .
Corollary 12.3.6. Let the conditions of Proposition 12.3.5 hold, and additionally assume that the
witness functions W satisfy kw(s)k∗ ≤ 1 for all s ∈ Rk . Then
1
CE(predh ◦ f, W)2 ≤ gap(ϕ, f, W) ≤ CE(predh ◦ f, W).
2 diam(dom h)2
328
Lexture Notes on Statistics and Information Theory John Duchi
We can give an alternative perspective for this section by focusing on the definitions (12.3.2)
and (12.3.3) of the post-processing gap. Suppose we have a proper loss ` and we wish to improve the
expected loss of a predictor f by post-processing f . When there is little to be gained by replacing
f with an adjusted version f (x) + w(f (x)) for some w ∈ W, then f must be calibrated with respect
to the class W. So, for example, for a surrogate ϕ, the function f (really, its associated prediction
function predh ◦ f ) is calibrated with respect to W if and only if E[ϕ(f (X) + w(f (X)), Y )] ≤
E[ϕ(f (X), Y )] for all w ∈ W.
As a particular special case to close this section, the standard multiclass logistic loss provides
a clean example.
Example
Pk 12.3.7 (Multiclass logistic losses, continued): Let h be the negative entropy h(p) =
k
j=1 pj log pj restricted to the probability simplex ∆k = {p ∈ R+ | h1, pi = 1} and the
Pk
surrogate ϕ(s, y) = log( j=1 esj ) − sy . Then for any class W consisting of functions with
kw(s)k∞ ≤ 1 for all s ∈ Rk and any function f : X → Rk ,
1
CE(predh ◦ f, W)2 ≤ E[ϕ(f (X), Y )] − inf E[ϕ(f (X) + w(f (X)), Y )].
2 w∈W
Theorem 12.3.8. Assume that the surrogate loss ϕ is nonnegative and that the class of witnesses
W satisfies R∗ := sups kw(s)k∗ < ∞. Then the algorithm in Figure 12.1 guarantees that
p
2R∗2 E[ϕ(f0 (X), Y )]
min CE(predh ◦ fτ , W) ≤ √ ,
τ <t t
and in particular terminates with CE(predh ◦ ft , W) ≤ for some t with
η2
E[ϕ(f (X) + ηw(f (X)), Y )] ≤ E[ϕ(f (X), Y )] + ηE[hw(f (X)), ∇h∗ (f (X)) − Y i] + E[kw(f (X))k2∗ ].
2
329
Lexture Notes on Statistics and Information Theory John Duchi
iv. Terminate if
CE(predh ◦ ft , W) ≤ .
η2 2
E[ϕ(f (X) − ηw(f (X)), Y )] ≤ E[ϕ(f (X), Y )] − ηCE(predh ◦ f, W) + R .
2 ∗
CE(predh ◦f,W)
Choose ηf = R∗2
to obtain
1 CE(predh ◦ f, W)2
E[ϕ(f (X) − ηf w(f (X)), Y )] ≤ E[ϕ(f (X), Y )] − . (12.3.4)
2 R∗2
Now we apply the obvious inductive argument. Let ft be a function in the iteration of Algo-
2
rithm 12.1. Then inequality (12.3.4) guarantees that if δt2 := 12 CE(predRh2◦ft ,W) , then
∗
In particular,
t−1
X
0 ≤ E[ϕ(ft (X), Y )] ≤ E[ϕ(f0 (X), Y )] − δτ2 .
τ =0
In particular,
t min δτ2 ≤ E[ϕ(f0 (X), Y )],
τ <t
p
so that minτ <t δτ ≤ E[ϕ(f0 (X), Y )]/t. Replacing δτ with its definition gives the theorem.
330
Lexture Notes on Statistics and Information Theory John Duchi
those are squared error or general proper losses. That is, by calibrating we can beat (and hence,
calibeat) a given predictor. These arguments have exclusively been at the population level, leaving
it unclear whether this approach might actually work given a finite sample. While employing
these ideas for general losses and general decision settings, where we only guarantee Y ⊂ Rk , is
challenging because of dimensionality issues, here we show how to improve calibration in finite
samples while simultaneously losing little in squared error for binary predictions with Y ∈ {0, 1}.
That is, we have calibeating: from any potential predictor f , we can construct a predictor g with
both small calibration error and with (asymptotically) no larger squared error than f , realizing
Theorem 12.1.1 but in finite samples.
Let f : X → [0, 1] be any predictor of Y ∈ {0, 1}, and consider the squared error loss
`(s, y) = (s − y)2 with population loss L(f ) = E[(Y − f (X))2 ]. The idea to improve calibra-
tion of f without losing much in accuracy (squared error) is fairly straightforward: we discretize
f by binning its predictions so that the number of Xi for which f (Xi ) is in a bin is equal; such
binning ideas are central to the theory of calibration. Then we choose the postprocessed func-
tion g by averaging observed Y values over those bins. This transforms the (population level)
idea present in Theorem 12.1.1, which says to choose the post-processing conditional expectation
g(x) = E[Y | f (X) = f (x)], into one implementable in finite samples, which approximately sets
where l and u are lower and upper bounds over which to average the predictions of f .
To make the ideas concrete, assume we have a sample (Xi , Yi )2n i=1 of size 2n drawn i.i.d. according
to P (where we choose 2n for notational convenience), which we divide into samples {(Xi , Yi )}ni=1
(1) (2)
and {(Xi , Yi )}2n
i=n+1 , letting Pn denote the empirical distribution on the first sample and Pn that
on the second. We use the first to choose the binning (quantization) of f and the second to actually
choose values for the binned function. Fix a number of bins b ∈ N to be chosen, for convenience
assuming that b divides n. Let the indices i1 , . . . , in sort f (Xi ), so that
except that b
l1 = 0 and u
bb = 1, and define the bins
h h h i
B1 = b b1 , B2 = b
l1 , u l2 , u
b2 , . . . , Bb = b
lb , u
bb
to partition [0, 1]. These partition [0, 1] evenly in the empirical probabilities of f (Xi ), i = 1, . . . , n,
not evenly in the widths u bj − b
lj .
To construct the recalibrated and binned version g of f , for each x ∈ X , define the bin mapping
331
Lexture Notes on Statistics and Information Theory John Duchi
which implicitly depends on the first sample (X1n , Y1n ). The partitioning of [0, 1] into the bins Bj
also induces a partition on X = bj=1 f −1 (Bj ), where elements x, x0 belong to the same partition
S
set if bin(x) = bin(x0 ). Once we have this mapping from x to the associated prediction bin, we can
use the second sample (its empirical distribution) to define the binned function g by the average of
(2)
the second sample distribution Pn over those examples falling into each bin. Formally, we define
g to be the the piecewise constant function
where we assign g(x) an arbitrary value if no Xi satisfies f (Xi ) ∈ Bj for the index j = bin(x).
Informally, this function g partitions X space into regions of roughly equal (small) probability
1/b, and for which f (x) belongs to a given interval on each region. Then recalibrating f on that
region changes the prediction error (Y − f (X))2 little, but improves the calibration. Formally, we
can show the following theorem.
Theorem 12.4.1. Let g be the binned and recalibrated estimator (12.4.1). Assume that the number
of bins b and sample size n satisfy logn n ≥ b. Then there exists a numerical constant c > 0 such
that for all δ ∈ (0, 1), with probability at least 1 − 2 exp(−c nb ) − δ,
3 2b log 2b h i
L(g) ≤ L(f ) + + δ
− E (E[Y | bin(X)] − E[f (X) | bin(X)])2
b n
and g has expected calibration error (12.2.1) at most
s
2b log 2b
δ
ece(g) ≤ .
n
while the expected calibration error is of order n−1/4 , ignoring the logarithmic factors. So we
improve the loss L(f ) by a factor involving the calibration error of f (relative to the random
binning)—the less calibrated
p f is, the more improvement we can provide—and with a penalty
tending to 0 at rate log n/n.
332
Lexture Notes on Statistics and Information Theory John Duchi
With the three lemmas in place, we can now expand the squared error to obtain the calibeating
theorem. Recalling the population squared error L(g) = E[(Y − g(X))2 ], let us suppose that the
consequences of Lemmas 12.4.2–12.4.4 hold, so that |g(x) − E[Y | f (X) ∈ Bj ]|2 ≤ 2b 2b
n log δ and
7 n
P (Bj ) ≤ 4b for each j. By the lemmas, these hold with probability 1 − 2 exp(−c d ) − δ. Define the
average function values and conditional expectations
f j := E[f (X) | f (X) ∈ Bj ] and E j := E[Y | f (X) ∈ Bj ].
Then we have
b
X
L(g) = E[(Y − g(X))2 ] = P (Bj )E[(Y − E j + E j − g(X))2 | f (X) ∈ Bj ].
j=1
333
Lexture Notes on Statistics and Information Theory John Duchi
Considering the expectation term, note that g(X) is constant for f (X) ∈ Bj by construction of the
binning, and so for any x ∈ f −1 (Bj ), we have
by adding and subtracting f j and expanding the square. Summarizing, we have shown so far that
b b
X
2 2b 2b X
L(g) ≤ P (Bj )E[(Y − f j ) | f (X) ∈ Bj ] + log − P (Bj )(E j − f j )2 . (12.4.2)
n δ
j=1 j=1
We can directly relate the first term in the expansion (12.4.2) to the expected error E[(Y −
f (X))2 ]. Indeed, by expanding out the square, we have
E[(Y − f j )2 | f (X) ∈ Bj ]
= E[(Y − f (X) + f (X) − f j )2 | f (X) ∈ Bj ]
= E[(Y − f (X))2 | f (X) ∈ Bj ] + 2E[(Y − f (X))(f (X) − f j ) | f (X) ∈ Bj ] + Var(f (X) | f (X) ∈ Bj )
q
≤ E[(Y − f (X))2 | f (X) ∈ Bj ] + 2 Var(f (X) | f (X) ∈ Bj ) + Var(f (X) | f (X) ∈ Bj ),
5
E[(Y − f j )2 | f (X) ∈ Bj ] ≤ E[(Y − f (X))2 | f (X) ∈ Bj ] + (b
uj − b
lj ).
4
Substituting in the bound (12.4.2) and recognizing that bj=1 P (Bj )E[(Y − f (X))2 | f (X) ∈ Bj ] =
P
7 Pb
But of course, P (Bj ) ≤ 4b by the assumed conclusions of Lemma 12.4.2, and so j=1 P (Bj )(b
uj −
7 Pb
lj ) ≤
b
4b as j=1uj − b
(b lj ) = 1. This gives the final inequality
b
35 2b 2b X
L(g) ≤ L(f ) + + log − P (Bj )(E j − f j )2 ,
16b n δ
j=1
proving the first claim of the theorem. The bound on calibration error is immediate because
|g(x) − E[Y | f (X) ∈ Bj ]|2 ≤ 2b 2b −1 (B ) with the prescribed probability, by
n log δ for each x ∈ f j
Lemma 12.4.4.
334
Lexture Notes on Statistics and Information Theory John Duchi
We provide upper and lower bounds on k as a function of the error in Pn (Aj ). Suppose that
for some t > 0, we have
1−t 1+t
≤ Pn (Aj ) ≤ for j = 1, . . . , 4b. (12.4.3)
4b 4b
Then
1+t 1
(k + 1) ≥ Pn (Aj ∪ · · · ∪ Aj+k ) ≥ Pn (Bj ? ) = ,
4b b
and similarly
1−t 1
(k − 1) ≤ Pn (Aj+1 ∪ · · · ∪ Aj+k ) ≤ Pn (Bj ? ) = ,
4b b
implying the bounds
4 4
−1≤k ≤ + 1.
1+t t−1
1 1
In particular, if t < 3 then 3 ≤ k ≤ 6, and so when the bounds (12.4.3) hold with t = 3 we obtain
1 k−1 k+1 7
≤ = P (Aj+1 ∪ · · · ∪ Aj+k−1 ) ≤ P (Bj ? ) ≤ P (Aj ∪ · · · ∪ Aj+k ) = ≤ .
2b 4b 4b 4b
Apply Bernstein’s inequality for using t = 13 , or v = 1
12b , with variance bound σ 2 ≤ P (Aj ) ≤ 1
4b
to obtain that for each j = 1, . . . , 4b, we have
!
n/(12b)2
1 n
P |Pn (Aj ) − P (Aj )| ≥ ≤ 2 exp − = 2 exp − .
12b 2/(4b) + 23 12b
1 80b
Apply a union bound to obtain the lemma once we recognize that n/b − log b & n/b whenever
n/ log n ≥ b.
7
Proof of Lemma 12.4.3 Assume that P (Bj ) ≤ 4b . Then applying Bernstein’s inequal-
ity (4.1.8), and using that 1 {f (X) ∈ Bj } is a Bernoulli random variable with mean (and hence
7
variance) at most 4b , we have
!
n/(4b)2
(2) 2 1 n 1 n
P Pn (Bj ) ≥ ≤ exp − 7 2 1 = exp − ≤ exp − .
b 4b + 3 4b
28 + 8/3 b 31 b
(2) 1 1 n 1
Similarly, we have P(Pn (Bj ) ≤ 4b ) ≤ exp(− 31 b ) as P (Bj ) ≥ 2b . Applying a union bound over
j = 1, . . . , b, then noting that n/b − log b & n/b whenever n/ log n ≥ b, we again obtain
335
Lexture Notes on Statistics and Information Theory John Duchi
Proof of Lemma 12.4.4 Recall that g(x) = EP (2) [Y | bin(X) = bin(x)], and note that g is
n
−1
constant on x ∈ f (Bj ). Fix a bin j, and let I(j) = {i ∈ {n + 1, . . . , 2n} | f (Xn+i ) ∈ Bj } denote
the indices in the second sample for which f (Xn+i ) falls in bin Bj . Then conditional on i ∈ I(j),
we have Yi ∼ P (Y ∈ · | f (X) ∈ Bj ), so that
1 X
Yi − E[Y | f (X) ∈ Bj ] ≥ t | I(j) ≤ 2 exp −2 card(I(j))t2
P
|I(j)|
i∈I(j)
(1)
by Hoeffding’s inequality. Then (conditioning on the bins {Bj } chosen using Pn , which by as-
1 7
sumption satisfy P (Bj ) ∈ [ 2b , 4b ], we have for any fixed x ∈ f −1 (Bj ) that
!
P sup |g(x) − E[Y | f (X) ∈ Bj ]| ≥ t | Pn(1)
x∈f −1 (Bj )
X
= P |g(x) − E[Y | f (X) ∈ Bj ]| ≥ t, I(j) = I | Pn(1)
I⊂[n]
n X
≤ P card(I(j)) < | Pn(1) + P |g(x) − E[Y | f (X) ∈ Bj ]| ≥ t, I(j) = I | Pn(1)
4b
I⊂[n],card(I)≥n/4b
nt2
(2) 1
≤ P Pn (Bj ) < + 2 exp − ,
4b 2b
2b
2b log
where the final line applies Hoeffding’s inequality. Taking t2 = n
δ
and applying Lemma 12.4.3
and a union bound gives Lemma 12.4.4.
(i) it is sound and complete (12.2.2), that is, M(f ) = 0 if and only if f is calibrated for P , and
(ii) it is continuous with respect to the L1 (P ) metric on F, that is, for any f , if fn is a sequence
of functions with E[kf (X) − fn (X)k] → 0, then
M(f ) − M(fn ) → 0.
336
Lexture Notes on Statistics and Information Theory John Duchi
(iii) it is Lipschitz continuous with respect to the L1 (P ) metric on F, that is, for some C < ∞
for all f0 , f1 ∈ F.
If conditions (i) and (ii) (respectively (iii)) hold for all P in a collection of distributions P on X ×Y,
we will say that M is a continuous (respectively, Lipschitz) calibration measure for P.
The desiderata (ii) and (iii) are matters of taste; the central idea is that some type of continuity
is essential for efficient modeling, estimation, and analysis. We leave the norm k·k implicit in the
definition, and we typically omit the distribution P from the calibration metric as it is clear from
context. The two parts of Definition 12.2 admit many possible calibration measures. We consider
two types of measures, which are (almost) dual to one another, as examples. Both use a variational
representation, where in one we essentially look for the “closest” function that is calibrated, while
in the other, we investigate the ease with which we can (quantitatively) certify that a predictor f
is uncalibrated.
A key concept will be the equivalence of calibration measures, where we target a quantitative
equivalence. To define this, let 0 < α, β < ∞. Then we say that two candidate calibration measures
M0 and M1 on F ⊂ X → Rk are (α, β)-equivalent if there exist constants c0 , c1 (which may depend
on Y) such that
h i
M0 (f ) ≤ c0 [M1 (f ) + M1 (f )α ] and M1 (f ) ≤ c1 M0 (f ) + M0 (f )β . (12.5.1)
Distances to calibration. Recall the distance to calibration (12.2.3), which for C(P ) = {g :
X → Rk | EP [Y | g(X)] = g(X)} (where the defining equality holds with P -probability 1
over X) has definition dcal (f ) := inf g {E[kg(X) − f (X)k] s.t. g ∈ C(P )}. The measure (12.2.3)
is, after appropriate normalization, the largest Lipschitz measure of calibration: if M is any Lip-
schitz calibration measure (with constant C = 1 in Definition 12.2 part (iii)), then taking a per-
fectly calibrated g with ece(g) = 0, we necessarily have M(g) = 0. Then for any f we have
M(f ) = M(f ) − M(g) ≤ E[kf (X) − g(X)k], and taking an infimum over such g guarantees
M(f ) ≤ dcal (f ).
The second related quantity, which sometimes admits cleaner properties for analysis, is the penalized
calibration distance, which we define as
337
Lexture Notes on Statistics and Information Theory John Duchi
These quantities are strongly related, and in the sequel (see Corollary 12.5.8), we show that
p
pcal (f ) ≤ dcal (f ) ≤ pcal (f ) + CY pcal (f ),
where CY is a constant depending only on the set Y whenever Y has finite diameter.
To build intuition for the definition (12.5.2), consider the two quantities. The first measures
the usual L1 distance between the function f and a putative alternative g. The second is the
expected calibration error of g. By restricting the infimum in definition (12.2.3) to functions g
with ece(g) = 0, we simply have the L1 distance to the nearest calibrated function; as is, the
additional term in (12.5.2) allows trading between the distance to a calibrated function and the
actual calibration error. We also have the following proposition.
Proposition 12.5.1. The functions dcal and pcal are Lipschitz calibration measures.
pcal (f0 ) − pcal (f1 ) ≤ inf {E[kf0 (X) − g(X)k] + E[kE[Y | g(X)] − g(X)k]}
g
admits similar properties, as it also satisfies our desiderata for a calibration measure. In particular,
if we take W to be the class Wk·k of bounded Lipschitz witness functions (12.2.5), we have the next
two propositions.
Proposition 12.5.2. Let F consist of functions with E[kf (X)k] < ∞ and assume E[kY k] < ∞.
Then CE(·, Wk·k ) is a continuous calibration measure over F.
Because continuity is such a weak requirement, the proof of this result relies on measure theoretic
results, so we defer it to Section 12.6.2.
When we assume the collection F consists of bounded functions and Y itself is bounded, we
can give a stronger guarantee for the weak calibration, and we no longer need to rely on careful
arguments considering the order of various limits.
338
Lexture Notes on Statistics and Information Theory John Duchi
Proposition 12.5.3. Assume that diam(Y) is finite and that F is a collection of bounded functions
X → Rk . Then CE(·, Wk·k ) is a Lipschitz calibration measure over F.
Proof Let W = Wk·k for shorthand. That CE(f, W) = 0 when f is calibrated is immediate, as
by definition of conditional expectation we have
Lemma 12.5.4. Let S ∈ Rk be a random variable and E[kg(S)k] < ∞. If E[hw(S), g(S)i] = 0 for
all bounded and 1-Lipschitz functions w, then g(S) = 0 with probability 1.
The converse is now trivial: let S = f (X), and note that CE(f, W) = supw∈W E[hw(S), E[Y |
S] − Si], and take g(S) = E[Y | S] − S in Lemma 12.5.4.
To see that CE is Lipschitz, let w0 ∈ W be such that CE(f0 , W) ≥ E[hw0 (f0 (X)), Y −f0 (X)i]−,
and let C < ∞ satisfy C ≥ supy∈Y,x∈X ,f ∈F ky − f (x)k. Then
CE(f0 , W) − CE(f1 , W) ≤ E[hw0 (f0 (X)), Y − f0 (X)i] − E[hw0 (f1 (X)), Y − f1 (X)i] +
≤ E[hw0 (f0 (X)) − w0 (f1 (X)), Y − f0 (X)i] + E[hw0 (f1 (X)), f1 (X) − f0 (X)i] +
≤ CE[kw0 (f0 (X)) − w0 (f1 (X))k∗ ] + E[kf1 (X) − f0 (X)k] +
≤ (1 + C)E[kf1 (X) − f0 (X)k] + .
Repeating the same argument, mutatis mutandis, for the lower bound gives the Lipschitz continuity
as desired.
The family of weak calibration measures CE(f, W) as we vary the collection of potential witness
functions W yields a variety of behaviors. Different choices of W can give different continuous
calibration measures, where we may modify Definition 12.1 part (ii) to other notions of continuity,
such as Lipschitzness with respect to L2 (P ) norms. We explore a few of these in the exercises at
the end of the chapter.
Theorem 12.5.5. Let Y = {e1 , . . . , ek } and Wk·k be the collection (12.2.5) of bounded Lipschitz
functions for a norm k·k on Rk . Then dcal , pcal , and CE(·, Wk·k ) are each ( 21 , 12 )-equivalent. More-
over, this equivalence is sharp, in that they are not (α, β)-equivalent for any α, β > 12 .
339
Lexture Notes on Statistics and Information Theory John Duchi
The theorem follows as a compilation of the other results in this section. Along the way to demon-
strating this theorem, we introduce a few alternative measures of calibration we use as stepping
stones toward our final results. While many of our derivations will apply for general sets Y, in
some cases we will restrict to multiclass classification problems, so that Y = {e1 , . . . , ek } ⊂ Rk
are the k standard basis vectors. We present two main results: the first, Theorem 12.5.6, shows
an equivalence (up to a square root) between the penalized calibration distance (12.5.2) and the
partitioned calibration error (12.2.6). As a corollary of this result, we obtain the equivalence of the
distance to calibration (12.2.3) and penalized distance to calibration (12.5.2). The second main
result, Theorem 12.5.9, gives a similar equivalence between the penalized distance (12.5.2) and
the calibration error relative to Lipschitz functions (12.2.4). Throughout, to make the calculations
cleaner and more transparent, we restrict our functions to make predictions in M = conv(Y).
Once we work exclusively in the space of random scores S = f (X), we may define alternative
distances to calibration in analogy with the (penalized) distances to calibration, which will allow
us to more easily relate distances to the partitioned error (12.2.6). Thus, we define
and
pcal,low (f ) := inf {E[kS − V k] + E[kE[Y | V ] − V k]} , (12.5.3b)
V
where the infimum are over all random variables V taking values in Conv(Y), which can have
arbitrary distribution with (S, Y ) (but do not modify the joint (S, Y )), and in case (12.5.3a) are
calibrated. This formulation is convenient in that we can represent it as a convex optimization
problem, allowing us to bring the tools of duality to bear on it, though we defer this temporarily. By
considering V = g(X) for functions g : X → Conv(Y), we immediately see that pcal (f ) ≥ pcal,low (f ).
We can also consider upper distances
and
pcal,up (f ) := inf {E[kS − g(S)k] + E[kE[Y | g(S)] − g(S)k]} ,
g:Rk →Conv(Y)
which restrict the definitions (12.2.3) and (12.5.2) to compositions. We therefore have the inequal-
ities
dcal,low (f ) ≤ dcal (f ) ≤ dcal,up (f ) and pcal,low (f ) ≤ pcal (f ) ≤ pcal,up (f ). (12.5.4)
The partitioned calibration error (12.2.6) allows us to provide a bound relating the calibration
error and the lower and upper calibration errors. To state the theorem, we make a normalization
with k·k, assuming without loss of generality that k·k∞ ≤ k·k.
340
Lexture Notes on Statistics and Information Theory John Duchi
Theorem 12.5.6. Let Y ⊂ Rk have finite diameter diam(Y) in the norm k·k. Let S = f (X) ∈ Rk .
Then for all ε > 0,
2k diam(Y)
pcal,up (f ) ≤ dcal,up (f ) ≤ pce(S) ≤ 1 + pcal,low (f ) + k1k k∗ ε
ε
2k diam(Y)
≤ 1+ dcal,low (f ) + k1k k∗ ε.
ε
While the first inequality in Theorem 12.5.6 is relatively straightforward to prove, the second
requires substantially more care, so we defer the proof of the theorem to Section 12.6.4.
We record a few corollaries, one consequence of which is to show that the partitioned calibration
error (12.2.6) is at least a calibration measure in the sense of Definition 12.2.(i). Theorem 12.5.6
also shows that the penalized calibration distance pcal (f ) is equivalent, up to taking a square root,
to the upper and lower calibration “distances”. In each corollary, we let Ck = k1k k∗ for shorthand.
Corollary 12.5.7. Let the conditions of Theorem 12.5.6 hold. Then
p q
pcal,low (f ) ≤ pcal (f ) ≤ pcal,low (f ) + 2 Ck k diam(Y) pcal,low (f )
and p q
dcal,low (f ) ≤ dcal (f ) ≤ dcal,low (f ) + 2 Ck k diam(Y) dcal,low (f ).
Proof
p The first lower bound is immediate (recall the naive inequalitites (12.5.4)). Now set
ε = 2k diam(Y)pcal,low (f )/Ck in Theorem 12.5.6, and recognize that pcal,low (f ) ≤ pcal,up (f ).
We also obtain an approximate equivalence between the calibration distance dcal and penalized
calibration distance pcal from definitions (12.2.3) and (12.5.2).
Corollary 12.5.8. Let the conditions of Theorem 12.5.6 hold. Then
p p
pcal (f ) ≤ dcal (f ) ≤ pcal (f ) + 2 ck k diam(Y) pcal (f ).
Proof The first inequality is immediate by definition. For the second, note (see Lemma 12.6.4
in the proof of Theorem 12.5.6 inpSection 12.6.4) that pcal,low (f ) ≤ pce(S) for S = f (X). Then
apply Theorem 12.5.6 with ε = 2k diam(Y)pcal,low (f )/ck as in Corollary 12.5.7, and recognize
that pcal,low ≤ pcal .
Let us instantiate the theorem and its corollaries in a few special cases. If we make binary
predictions with Y = {0, 1}, then Ck = k = diam(Y) = 1, and Theorem 12.5.6 implies that
q
pcal,low (f ) ≤ pcal (f ) ≤ pcal,low (f ) + 2 pcal,low (f ).
For k-class multiclass classification, where we identify Y = {e1 , . . . , ek } with the k standard basis
vectors, we have the bounds
q
pcal,low (f ) ≤ pcal (f ) ≤ pcal,low (f ) + 2 kpcal,low (f ),
so long as we measure calibration errors with respect to the `1 -norm, that is, ky − f (x)k1 , because
diam(Y) ≤ 1 and Ck = k1k∞ = 1.
JCD Comment: Remark on sharpness here.
341
Lexture Notes on Statistics and Information Theory John Duchi
The same argument implies the following analogue for the distance to calibration (12.2.3).
Corollary 12.5.11. Let Y = {e1 , . . . , ek } and k·k = k·k1 be the `1 -norm. Then for any f : X →
Conv(Y), we have
1 q
CE(f, Wk·k ) ≤ dcal (f ) ≤ CE(f, Wk·k ) + 2 kCE(f, Wk·k ).
2
342
Lexture Notes on Statistics and Information Theory John Duchi
To prove the converse requires more; we present most of the argument for an arbitrary discrete
space Y and specialize to the multiclass setting only at the end. The starting point is to reduce the
problem to a discrete problem over probability mass functions rather than general distributions, as
then it is much easier to apply the standard tools of convex duality. Consider the value
Let b ∈ N and Sb be a (minimal) 1/b covering {s1 , . . . , sN } of Conv(Y), and define Sb to be the
projection of S to the nearest si . Then kS − V k = kSb − V k ± 1b , and
1
dcal,low (S) = inf {E[kSb − V k] s.t. E[Y | V ] = V } ± .
V b
Now, if we replace the infimum over arbitrary joint distributions of (Sb , Y, V ) leaving the marginal
(Sb , Y ) unchanged (with V calibrated) with an infimum over only discrete distributions on V , we
have
1
dcal,low (S) ≤ inf {E[kSb − V k] s.t. E[Y | V ] = V } + . (12.5.5)
V finitely supported b
Notably, the infimum is non-empty, as we can always choose V = Y .
With the problem (12.5.5) in hand, we can write a finite dimensional optimization problem
whose optimal value is the discretized infimum on the right side. Without loss of generality assuming
that S is finitely supported, we let psy = P(S = s, Y = y) be the probability mass function of
(S, Y ). Then introducing the joint distribution
P Q with p.m.f. qsyv = Q(S =Ps, Y = y, V = v), the
infimum (12.5.5) has the constraint that v qsyv = psy . Then E[kS − V k] = s,y,v qsyv Pks − vk and
the calibration constraint E[Y | V ] = V is equivalent to the equality constraint that s,y qsyv (y −
v) = 0 for each v. This yields the convex optimization problem
P
minimize q ks − vk
Ps,y,v syv P (12.5.6)
subject to v qsyv = psy , q 0, y,q qsyv (y − v) = 0 for all v
L(q, z, λ, θ, β)
X X X X
T
= qsyv ks − vk + qsyv βv (y − v) − λsy qsyv − psy − hθ, qi.
s,y,v s,y,v s,y v
for each triple (s, y, v), we have inf q L(q, λ, θ, β) = −∞. The equality in the preceding display is
equivalent to ks − vk + βvT (y − v) ≥ λsy , so that eliminating θ 0 variables, we have the dual
P
maximize s,y λsy psy
subject to λsy ≤ ks − vk + βvT (y − v), all s, y, v
343
Lexture Notes on Statistics and Information Theory John Duchi
to problem (12.5.6). Equivalently, recognizing that at the optimum we must saturate the constraints
on λ via λsy = minv {ks − vk + βvT (y − v)}, we have
X
psy min ks − vk + βvT (y − v)
maximize (12.5.7)
v
s,y
1
sup E[hw(S), Y − Si] ≥ dcal,low (S),
w∈Wk·k C
and similarly
by the triangle inequality. Here, we specialize to the particular multiclass classification case in which
the set Y = {e1 , . . . , ek } consists of extreme points of the probability simplex, so that s ∈ Conv(Y)
means that h1, si = 1 and s 0. Abusing notation slightly, let λi = λei for i = 1, . . . , k. Then
define the function
λ1 (s)
w(s) := ... .
λk (s)
By inspection, we have
w(s) − w(s0 ) ∗
≤ s − s0 1 ∗
= k1k∗ s − s0 .
344
Lexture Notes on Statistics and Information Theory John Duchi
Lemma 12.6.1 (Egorov’s theorem). Let fn → f in Lp (P ) for some p ≥ 1. Then for each > 0,
there exists a set A of measure at least P (A) ≥ 1 − such that fn → f uniformly on A.
Lemma 12.6.3 (Density of Lipschitz functions). Let CcLip be the collection of compactly supported
Lipschitz functions on Rk and P a probability distribution on Rk . Then CcLip is dense in Lp (P ), that
is, for each > 0 and f with EP [|f (X)|p ] < ∞, there exists g ∈ CcLip with EP [|g(X)−f (X)|p ]1/p ≤ .
345
Lexture Notes on Statistics and Information Theory John Duchi
Substituting these bounds into inequality (12.6.1), we have for any > 0 that there exists a set A
with P (A ) ≥ 1 − and for which
By taking a supremum over w ∈ Wk in the last display and recognizing that > 0 was arbitrary,
we have shown that
lim inf CE(fn , Wk ) ≥ CE(f, Wk )
n
for all k < ∞. By Lemma 12.6.3, for any integrable f and for each > 0 there exists k such that
Noting that CE(fn , W) ≥ CE(fn , Wk ) for any k and taking → 0 gives the lemma.
346
Lexture Notes on Statistics and Information Theory John Duchi
and
CE(fn , W) − CE(f, W) ≤ sup E[hw(fn (X)), Y − fn (X)i − hw(f (X)), Y − f (X)i].
w∈W
We focus on bounding the first display, as showing that the second tends to zero requires, mutatis
mutandis, an identical argument.
Fix any w ∈ W. Then
E[hw(f (X)), Y − f (X)i − hw(fn (X)), Y − fn (X)i]
= E[hw(f (X)) − w(fn (X)), Y − f (X)i] + E[hw(fn (X)), fn (X) − f (X)i]
≤ E[min{2, kf (X) − fn (X)k} kY − f (X)k] + E[kfn (X) − f (X)k],
where the inequality follows because kw(s) − w(s0 )k∗ ≤ 2 and kw(s) − w(s0 )k∗ ≤ ks − s0 k for any
s, s0 by construction. The second expectation certainly tends to zero as n → ∞, so we consider the
first. Define gn (x, y) = min{2, kf (x) − fn (x)k} ky − f (x)k. Then gn (x, y) ≤ g(x, y) = ky − f (x)k,
which has finite expectation by assumption. Moreover, Egorov’s theorem (Lemma 12.6.1) guaran-
tees that for each k, there is a set Ak with P (Ak ) ≥ S 1 − 1/k and for which gn → 0 uniformly on
Ak (because E[kf (X) − fn (X)k] → 0). Define A∞ = k Ak , so that P (A∞ ) = 1, and gn (x, y) → 0
pointwise on A∞ . Then the dominated convergence theorem guarntees that
E[gn (X, Y )] = E[gn (X, Y )1 {(X, Y ) ∈ A∞ }] + E[gn (X, Y )1 {(X, Y ) 6∈ A∞ }] → 0.
| {z }
=0
347
Lexture Notes on Statistics and Information Theory John Duchi
as n ↑ ∞.S We now employ the same device we use in the proof of Lemma 12.2.1. For m ∈ N,
let Bm = n≤m A1/n . Then wn → f uniformly on Bm , and so E[kg(S)k2 ] ≤ E[kg(S)k2 1 {S 6∈ Bm }],
that is, E[kg(S)k2 1 {S ∈ Bm }] = 0. Monotone
S convergence implies 0 = limm→∞ E[kg(S)k2 1 {S ∈ Bm }] =
E[kg(S)k2 1 {S ∈ B∞ }] where B∞ = m Bn . As P (B∞ ) = 1 by continuity of measure, we have
E[kg(S)k2 ] = 0, giving the lemma.
Proof Fix any partition A, and define qA (s) to be the (unique) set A such that s ∈ A (so we
quantize s). Then set g(s) = E[Y | S ∈ qA (s)] to be the expectation of Y conditional on S being in
the same partition element as s. Then g(S) = E[Y | g(S)] with probability 1, so that g is perfectly
calibrated, and
To prove the claimed upper bound requires more work. For pedagogical reasons, let us attempt
to prove a similar upper bound relating pce(S) to pcal,low (f ). We might begin with a partition A
with maximal diameter diam(A) ≤ for A ∈ A, and for random variables (S, V, Y ), begin with the
first term in the partition error, whence
X
CE(S, A) ≤ kE[(S − V )1 {S ∈ A}]k + kE[(V − Y )1 {S ∈ A}]k
A∈A
X X
≤ E[kS − V k] + kE[(V − Y )1 {V ∈ A}]k + kE[(V − Y )(1 {S ∈ A} − 1 {V ∈ A})]k
A∈A A∈A
X
≤ E[kS − V k] + E[kE[Y | V ] − V k] + kE[(V − Y )(1 {S ∈ A} − 1 {V ∈ A})]k
A∈A
348
Lexture Notes on Statistics and Information Theory John Duchi
If S and V had continuous distributions, we would expect the probability that they fail to belong
to the same partition elements to scale as E[kS − V k]. This may fail, but to rectify the issue, we
can randomize.
Consequently, let us consider the randomized partition error, which we index with ε > 0 and
for U ∼ Uniform[−1, 1]k define as
( )
X X
rpceε (S) := inf kE[(S − Y )1 {S + εU ∈ A}]k + diam(A)P(S ∈ A) . (12.6.3)
A
A∈A A∈A
(The choice of uniform [−1, 1]k is only made for convenience in the calculations to follow.) Letting
ck = k1k k∗ , we see immediately that
kE[(S − Y )1 {S + εU ∈ A}]k
≤ kE[(S − V )1 {S + εU ∈ A}]k + kE[(V − Y )1 {V + εU ∈ A}]k
+ kE[(V − Y )(1 {S + εU ∈ A} − 1 {S + εU ∈ A})]k
≤ kE[(S − V )1 {S + εU ∈ A}]k + kE[(V − Y )1 {V + εU ∈ A}]k
+ E [kV − Y k · (P(V + εU ∈ A, S + εU 6∈ A | V, S) + P(S + εU ∈ A, V + εU 6∈ A | V, S, Y ))]
Summing over sets A and using the triangle inequality and that S + εU ∈ A for some A, we find
X
kE[(S − Y )1 {S + εU ∈ A}]k ≤ E[kS − V k] + E[kE[Y | V ] − V k] (12.6.4)
A∈A
X
+ 2E kV − Y k P(V + εU ∈ A, S + εU 6∈ A | V, S, Y ) .
A∈A
We now may bound the probability in inequality (12.6.4). Recall that A = [−ε, ε]k + εz for
some z ∈ 2Zk , and fix v, s ∈ Rk . Let B = [−1, 1]k be the `∞ ball. Then
349
Lexture Notes on Statistics and Information Theory John Duchi
1
δ k−1
hypercubes
s
Figure 12.2. The volume argument in inequality (12.6.5). In k dimensions, the hypercube of
side-length δ can be replicated 1/δ k−1 times on each exposed base of the cube centered at v, where
δ = ks − vk∞ . There are at most k such faces, giving volume at most kδ k /δ k−1 = kδ to the gray
region.
Figure 12.2. The k-dimensional surface area of one side of a hypercube of radius δ is 2kδ k−1 , and
we can put at most 1/δ k−1 boxes in each facial part of the grey region.)
Substituting inequality (12.6.5) into the bound (12.6.4) and conditioning and deconditioning on
V, S, we find that
X
kE[(S − Y )1 {S + εU ∈ A}]k
A∈A
2k X
≤ E[kS − V k] + E[kE[Y | V ] − V k] + E kV − Y k kV − Sk∞ 1 {V + εU ∈ A}
ε
A∈A
2k
= E[kS − V k] + E[kE[Y | V ] − V k] + E[kY − V k kV − Sk∞ ].
ε
Taking an infimum over partitions A gives the lemma.
12.7 Bibliography
Draft: Calibration remains an active research area. The initial references for online calibration are
Foster and Vohra [84], Dawid and Vovk [59]. The idea of calibeating is most present in Foster and
Hart [85]. Our proof of calibeating is based on Kumar et al. [123]. Blasiok et al. [31] demonstrate
the equivalence of the different metrics for measuring calibration, focusing on the case of binary
prediction; the extension to vector-valued Y appears to be new. The ideas of the postprocessing
gap and also descend from Blasiok et al. [32], and the connections with general proper losses also
appear to be new. Propositions 12.5.1, 12.5.2, and 12.5.3 are new in that they are the first to
demonstrate that the measures are valid calibration measures (Definition 12.1, part (i)).
350
Lexture Notes on Statistics and Information Theory John Duchi
JCD Comment: A few more things to add either in the bibliography or the introduction
to the section:
1. We only really do calibration for binary/multiclass things. One would also really like to
predict full distributions Pt on general outcomes Y , which is harder (nearly impossible)
to do in any conditional sense.
2. It’s much easier to do predictive inference (cover) because don’t need accuracy
3. Maybe comment on variants for top entry (from multiclass to binary) classification
and why that is important. Maybe in the middle, maybe here.
12.8 Exercises
JCD Comment: Add a uniform convexity version of Proposition 12.3.5 as an exercise.
JCD Comment: Can we add an exercise about achieving weak calibration for different
classes of functions?
JCD Comment: A few potential exercises:
(i) Deal with any class W for which E[hw, f i] = 0 for all w ∈ W means f = 0, then
still get a continuous calibration measure
JCD Comment: Exercise: do Aaditya’s top-class calibration approach.
351
Chapter 13
352
Lexture Notes on Statistics and Information Theory John Duchi
3. Note the minimizer of `: we have s∗ (η) = sign(η − 1/2), and f ∗ (X) = sign(η(X) − 1/2)
minimizes risk R(f ) over all f
4. Minimizing f can be achieved pointwise, and we have
(b) Example 13.0.1 (Exponential loss): Consider the exponential loss, used in Ad-
aBoost (among other settings), which sets φ(s) = e−s . In this case, we have
1 η ∂
argmin `φ (s, η) = log because `φ (s, η) = −ηe−s + (1 − η)es .
s 2 1−η ∂s
η(x)
Thus fφ∗ (x) = 1
2 log 1−η(x) , and this is Fisher consistent. 3
(c) Classification calibration
1. Consider pointwise versions of risk (all that is necessary, turns out)
2. Define the infimal conditional φ-risks as
Definition 13.2. The margin-based loss φ is classification calibrated if H(δ) > 0 for
all δ > 0. Equivalently, for any η 6= 21 , we have `∗φ (η) < `wrong
φ (η).
5. Example (Example 13.0.1 continued): For the exponential loss, we have
`wrong
−s
φ (η) = inf ηe + (1 − η)es = e0 = 1
s(2η−1)≤0
353
Lexture Notes on Statistics and Information Theory John Duchi
Proof Let A ⊂ Rd × R denote all the pairs (a, b) minorizing f , that is, those pairs
such that f (t) ≥ ha, ti − b for all t. Then we have
354
Lexture Notes on Statistics and Information Theory John Duchi
ψ(δn ) → 0 ⇔ δn → 0.
1. Some insights from theorem. Recall examples 13.0.1 and 13.0.2. For both of these, we
have that ψ(δ) = H(δ), as H is convex. For the hinge loss, φ(s) = [1 − s]+ , we obtain
for any f that
P(Y f (X) ≤ 0) − inf P(Y f (X) ≤ 0) ≤ E [1 − Y f (X)]+ − inf E [1 − Y f (X)]+ .
f f
355
Lexture Notes on Statistics and Information Theory John Duchi
`wrong
φ (1/2) := inf `φ (s, 1/2) = inf `φ (s, 1/2) = `∗φ (1/2),
s(1−1)≤0 s
so H(0) = `∗φ (1/2) − `∗φ (1/2) = 0. (It is clear that the sub-optimality gap H ≥ 0 by
construction.)
2. We begin with the first statement of the theorem, inequality (13.0.2). Consider first
the gap (for a fixed margin s) in conditional 0-1 risk,
Now we use expression (13.0.3) to get an upper bound on R(f ) − R∗ via the φ-risk.
Indeed, consider the ψ-transform (13.0.1). By Jensen’s inequality, we have that
Now we use the special structure of the suboptimality function we have constructed.
Note that ψ ≤ H, and moreover, we have for any s ∈ R that
∗
1 {sign(s) 6= sign(2η − 1)} H(|2η − 1|) = 1 {sign(s) 6= sign(2η − 1)} inf `φ (s, η) − `φ (η)
s(2η−1)≤0
= Rφ (f ) − Rφ∗ ,
356
Lexture Notes on Statistics and Information Theory John Duchi
satisfies `0φ (0, η) = (2η − 1)φ0 (0), and if φ0 (0) < 0, this quantity is negative for η > 1/2.
Thus the minimizing s(η) ∈ (0, ∞]. (Proof by picture, but formalize in full notes.)
For the other direction assume that φ is classification calibrated. Recall the definition of
a subgradient gs of the function φ at s ∈ R is any gs such that φ(t) ≥ φ(s) + gs (t − s) for
all t ∈ R. (Picture.) Let g1 , g2 be such that `(s) ≥ `(0) + g1 s and `(s) ≥ `(0) + g2 s, which
exist by convexity. We show that both g1 , g2 < 0 and g1 = g2 . By convexity we have
357
Lexture Notes on Statistics and Information Theory John Duchi
and under the assumption that g1 > g2 we obtain `∗φ (η) = inf s≥0 `φ (s, η) > `wrong
φ (η), which
is a contradiction to classification calibration. We thus obtain g1 = g2 , so that the function
φ has a unique subderivative at s = 0 and is thus differentiable.
Now that we know φ is differentiable at 0, consider
If φ0 (0) ≥ 0, then for s ≥ 0 and η > 1/2 we must have the right hand side is at least
φ(0), which contradicts classification calibration, because we know that `∗φ (η) < `wrong
φ (η)
exactly as in the preceding argument.
a. Say it’s Fisher consistent (or infinite sample consistent) if Rϕ (fn ) → Rϕ? implies R(fn ) →
R?
b. Reduce to pointwise cases, compare non-uniform to uniform results (noting that in cases
where L is discrete, they are the same—requires a proof)
c. Basically, this is Question 13.4, except we will use finite Y I think (can still leave the super
general version in)
and
1 β
f (x0 ) ≤ βf (b) + (1 − β)f (x), or f (x0 ) ≤ f (x) + f (b).
1−β 1−β
Taking s, β → 0, we obtain
as desired.
358
Lexture Notes on Statistics and Information Theory John Duchi
assumption we have h(z/2) = b > 0, whence we have h(1) ≥ b > 0. In particular, the piecewise
linear function defined by (
0 if t ≤ z/2
g(t) = b
1−z/2 (t − z/2) if t > z/2
is closed, convex, and satisfies g ≤ h. But g(z) > 0 = h∗∗ (z), a contradiction to the fact that h∗∗
is the largest (closed) convex function below h.
13.3 Exercises
Exercise 13.1: Find the suboptimality function Hφ and ψ-transform for the binary classification
problem with the following losses.
(a) Logistic loss. That is,
φ(s) = log(1 + e−s )
(b) Squared error (ordinary regression). The surrogate loss in this case for the pair (x, y) is 12 (f (x)−
y)2 . Show that for y ∈ {−1, 1}, this can be written as a margin-based loss, and compute the
associated suboptimality function Hφ and ψ-transform. Is the squared error classification
calibrated?
Exercise 13.2: Suppose we have a regression problem with data (independent variables) x ∈ X
and y ∈ R. We wish to find a predictor f : X → R minimizing the probability of being far away
from the true y, that is, for some c > 0, our loss is of the form
Show that no loss of the form ϕ(s, y) = |s − y|p , where p ≥ 1, is Fisher consistent for the loss L,
even if the distribution of Y conditioned on X = x is symmetric about its mean E[Y | X]. That is,
show there exists a distribution on pairs X, Y such that the set of minimizers of the surrogate
Rϕ (f ) := E[ϕ(f (X), Y )]
is not included in the set of minimizers of the true risk, R(f ) = P(|Y − f (X)| ≥ c), even if the
distribution of Y (conditional on X) is symmetric.
359
Lexture Notes on Statistics and Information Theory John Duchi
Exercise 13.3 (Empirics of classification calibration): In this problem you will compare the
performance of hinge loss minimization and an ordinary linear regression in terms of classifica-
tion performance. Specifically, we compare the performance of the hinge surrogate loss with the
regression surrogate when the data is generated according to the model
(ii) Set
n
1X
θbhinge = argmin [1 − yi hxi , θi]+
θ:kθk2 ≤R n i=1
and
n
1 X
θbreg = argmin (yi − hxi , θi)2 = argmin kXθ − yk22 .
θ 2n θ
i=1
(iii) Evaluate the 0-1 error rate of the vectors θbhinge and θbreg on the held-out data points {(xtest test n
i , yi )}i=1 .
Perform the preceding steps (i)–(iii), using any n ≥ 100 and d ≥ 10 and a radius R = 5, for
different standard deviations σ = {0, 1, . . . , 10}; perform the experiment a number of times. Give
a plot or table exhibiting the performance of the classifiers learned on the held-out data. How do
the two compare? Given that for the hinge loss we know Hφ (δ) = δ (as presented in class), what
would you expect based on the answer to Question 13.1?
I have implemented (in the julia language; see http://julialang.org/) methods for solving
the hinge loss minimization problem with stochastic gradient descent so that you do not need to.
The file is available at this link. The code should (hopefully) be interpretable enough that if julia
is not your language of choice, you can re-implement the method in an alternative language.
Exercise 13.4: In this question, we generalize our results on classification calibration and surro-
gate risk consistency to a much broader supervised learning setting. Consider the following general
supervised learning problem, where we assume that we have data in pairs (X, Y ) ∈ X × Y, where
X and Y are general spaces.
Let L : Rm × Y → R+ be a loss function we wish to minimize, so that the loss of a prediction
function f : X → Rm for the pair (x, y) is L(f (x), y). Let ϕ : Rm × Y → R be an arbitrary
surrogate, where ϕ(f (x), y) is the surrogate loss. Define the risk and ϕ-risk
Let PY denote the space of all probability distributions on Y, and define the conditional (pointwise)
risks ` : Rm × PY → R and `ϕ : Rm × PY → R by
Z Z
`(s, P ) = L(s, y)p(y)dy and `ϕ (s, P ) = `(s, y)p(y)dy.
Y Y
360
Lexture Notes on Statistics and Information Theory John Duchi
(Here for simplicity we simply write integration against dy; you may make this fully general if you
wish.) Let `∗ (P ) = inf s `(s, P ) denote the minimal conditional risk, and similarly for `∗ϕ (P ), when
Y has distribution P . If Px denotes the distribution of Y conditioned on X = x, then we may
rewrite the risk functionals as
We will show that the same machinery we developed for classification calibration extends to this
general supervised learning setting.
For ≥ 0, define the suboptimality gap function
which measures the gap between achievable (pointwise) risk and the best surrogate risk when we
enforce that the true loss is not minimized. Also define the uniform suboptimality function
(Compare this with the definition of ∆ for the classification case to gain intuition.)
Prove that ∆ϕ () > 0 for all > 0 implies if Rϕ (fn ) → Rϕ∗ , then R(fn ) → R∗ .
(b) We say that the loss ϕ is uniformly calibrated if ∆ϕ () > 0 for all > 0. Show that, in the
margin-based binary classification case with loss φ : R → R, uniform calibration as defined
here is equivalent to classification-calibration as defined in class. You may assume that the
margin-based loss φ is continuous.
(c) A non-uniform result: assume that for all distributions P ∈ PY on the set Y, we have
∆ϕ (, P ) > 0
if > 0. (We call this calibration.) Assume that there exists an upper bound function B : X →
R+ such that E[B(X)] < ∞ and `(s, Px ) ≤ `∗ (Px ) + B(x) for all x and s ∈ Rm . For example, if
the loss L is bounded, this holds. Show that if the sequence of functions fn : X → Rm satisfies
Equivalently, show that for any distribution P on X × Y, for all > 0 there exists a δ > 0 such
that
Rϕ (f ) ≤ Rϕ∗ + δ implies R(f ) ≤ R∗ + .
(You may ignore any measurability issues that come up.)
361
Chapter 14
2. For losses or risks, probably ` would be better and L for population loss (risk), but
not sure
4. Give proof of universal equivalence for the binary case, which is “easy” (at least,
easier...) because we can just use binary entropies of the form h(p) = inf α {p`(α) +
(1 − p)`(−α)}, choosing distributions in a fairly transparent way to get them. (Will
probably write this down in afternoon.)
New outline:
I. Generalized entropies
(a) Some similarity to the ideas in the Fenchel-Young losses paper, where given a vector s of
scores, we make predictions
(b) If loss is generalized entropy loss, then there is duality in that loss minimizers s give
calibrated p when Ω is strictly (or perhaps uniformly?) convex
362
Lexture Notes on Statistics and Information Theory John Duchi
(a) Multiclass case: any time we have a uniformly convex loss, we get consistency (or uniformly
convex entropy I guess)
(b) The discrete losses for structured prediction
V. Loss equivalence
(a)
c. Examples of 0-1 loss and its friends: have X ∈ X and Y ∈ {−1, 1}.
1. Example 14.0.1 (Binary classification with 0-1 loss): What is Bayes risk of binary
classifier? Let
P (Y = 1 | X = x)p(x)
p+1 (x) = p(x | Y = 1) =
P (Y = 1)
363
Lexture Notes on Statistics and Information Theory John Duchi
be the density of X conditional on Y = 1 and similarly for p−1 (x), and assume that
each class occurs with probability 1/2. Then
Z
∗
R = inf [1 {γ(x) ≤ 0} P (Y = 1 | X = x) + 1 {γ(x) ≥ 0} P (Y = −1 | X = x)] p(x)dx
γ
Z Z
1 1
= inf [1 {γ(x) ≤ 0} p+1 (x) + 1 {γ(x) ≥ 0} p−1 (x)] dx = min{p+1 (x), p−1 (x)}dx.
2 γ 2
Similarly, we may compute the minimal prior risk, which is simply 12 by defini-
tion (14.0.2). Looking at the gap between the two, we obtain
Z Z
∗ 1 1 1 1
Rprior −R∗ = − min{p+1 (x), p−1 (x)}dx = [p1 − p−1 ]+ = kP1 − P−1 kTV .
2 2 2 2
That is, the difference is half the variation distance between P1 and P−1 , the dis-
tributions of x conditional on the label Y . 3
2. Example 14.0.2 (Binary classification with hinge loss): We now repeat precisely
the same calculations as in Example 14.0.1, but using as our loss the hinge loss
(recall Example 13.0.2). In this case, the minimal φ-risk is
Z
∗
Rφ = inf [1 − α]+ P (Y = 1 | X = x) + [1 + α]+ P (Y = −1 | X = x) p(x)dx
α
Z Z
1
= inf [1 − α]+ p1 (x) + [1 + α]+ p−1 (x) dx = min{p1 (x), p−1 (x)}dx.
2 α
∗
We can similarly compute the prior risk as Rφ,prior = 1. Now, when we calculate
the improvement available via observing X = x, we find that
Z
∗ ∗
Rφ,prior − Rφ = 1 − min{p1 (x), p−1 (x)}dx = kP1 − P−1 kTV ,
that between the prior risk and the risk attainable after viewing x ∈ X .
2. Didn’t present this. True definition of statistical information: suppose class 1 has
prior probability π and class −1 has prior 1 − π, and let P1 and P−1 be the distributions
of X ∈ X given Y = 1 and Y = −1, respectively. The Bayes risk associated with the
problem is then
Z
Bπ (P1 , P−1 ) := inf [1 {γ(x) ≤ 0} p1 (x)π + 1 {γ(x) ≥ 0} p−1 (x)(1 − π)] dx (14.0.6)
γ
Z
= p1 (x)π ∧ p−1 (x)(1 − π)dx
364
Lexture Notes on Statistics and Information Theory John Duchi
365
Lexture Notes on Statistics and Information Theory John Duchi
where we have used Bayes rule as in (14.0.9). Let us now divide all appearances of the
density p1 by p−1 , which yields
By inspection, the representation (14.0.11) gives the result of the theorem if we can argue
that the function fπ is convex, where we substitute p1 (x)/p−1 (x) for t in fπ (t).
To see that the function fπ is convex, consider the intermediate function
a. Do these equivalences mean anything? What about the fact that the suboptimality function
Hφ was linear for the hinge loss?
b. Consider problems with quantization: we must jointly learn a classifier (prediction or dis-
criminant function) γ and a quantizer q : X → {1, . . . , k}, where k is fixed and we wish
to find an optimal quantizer q ∈ Q, where Q is some family of quantizers. Recall the
notation (2.2.1) of quantization of f -divergence, so
k k
P0 (q−1 (i))
X
−1
X P0 (Ai )
Df (P0 ||P1 | q) = P1 (q (i))f = P1 (Ai )f
P1 (q−1 (i)) P1 (Ai )
i=1 i=1
Rφ (γ | q) = E[φ(Y γ(q(X)))]
366
Lexture Notes on Statistics and Information Theory John Duchi
Proof The proof follows straightforwardly via the representation (14.0.12). If φ1 and φ2
are equivalent, then we have that
Rφ∗ 1 ,prior − Rφ∗ 1 (q) = Dfπ,φ1 (P−1 ||P1 | q) = cDfπ,φ2 (P−1 ||P1 | q) + a + b
= c Rφ∗ 2 ,prior − Rφ∗ 2 (q) + a + b
367
Lexture Notes on Statistics and Information Theory John Duchi
e. Some comments:
1. We have an interesting thing: if we wish to learn a quantizer and a classifier jointly,
then this is possible by using any loss equivalent to the true loss we care about.
2. Example: hinge loss and 0-1 loss are equivalent.
3. Turns out that the condition that the losses φ1 and φ2 be equivalent is (essentially)
necessary and sufficient for two quantizers to induce the same ordering [144]. This
equivalence is necessary and sufficient for the ordering conclusion of Theorem 14.0.4.
we see that
HL (P ) = min τ (y)> Aµ(P ) = inf ν > Aµ(P ).
y∈Y ν∈M
368
Lexture Notes on Statistics and Information Theory John Duchi
Notably, HL is concave in p, as it is the infimum of linear functionals, and with some abuse of
notation, we can also define the negative entropy mapping
which evidently satisfies Ω(ν) = HL (P ) for any ν ∈ M satisfying ν = µ(P ) and is convex. The
associated surrogate loss is
ϕ(s, y) := −s> τ (y) + Ω∗ (s), (14.2.1)
and we have
Ep [ϕ(s, Y )] = −s> µ(p) + Ω∗ (s),
so that
n o n o
inf Ep [ϕ(s, Y )] = inf −s> µ(p) + Ω∗ (s) = − sup s> µ(p) − Ω∗ (s) = −Ω(µ(p)) = HL (p).
s s s
As s minimizes `, we have Ω(µ) − s> µ + Ω∗ (s) = 0, and thus (using some duality results in the
appendices) we necessarily have
s ∈ ∂Ω(µ).
As Ω(µ) = maxy∈Y −τ (y)> Aµ + IM (µ), we see that
where we recall that µ = µ(P ). If we make the (unrealistically) simplifying assumption that µ is
interior to M (say, for example, if P assigns positive probability to all y ∈ Y), i.e. µ ∈ int M, then
369
Lexture Notes on Statistics and Information Theory John Duchi
NM (µ) = {0}. If we also assume there is only a single label y ? ∈ Y minimizing τ (y ? )> Aµ, that is,
a single best prediction for the probabilities P on Y , then we obtain that s = −A> τ (y ? ). Then of
course the prediction function becomes
n o
pred(s) = argmax −τ (y)> A> τ (y ? ) = y ?
y∈Y
Theorem 14.4.1. The surrogate ϕ is consistent for the discrete structured prediction loss L.
In this case, we may simplify the quantities by writing out the entropy functionals explicitly as
y (s))> Aµ − inf ν∈M ν > Aµ, where yb(s) is (an arbitrary) element of the prediction set
`(s, P ) = τ (b
>
argmaxy s τ (y). We need only show that ∆ϕ (, P ) > 0 whenever > 0, which we prove by
contradiction.
Thus, assume for the sake of contradiction that ∆ϕ (, P ) = 0. As the losses ϕ are piecewise
linear and the set of s such that `(s, P ) − `? (P ) ≥ is a union of polyhedra, there must be s
achieving the infimum, and so for some vector of scores s, we have
while yb(s) is incorrect. Following the calculation (14.4.1), we thus obtain that for some ν ? ∈ M
satisfying hν ? , Aµi = miny τ (y)> Aµ and a vector w ∈ NM (µ), we have
s = −A> ν ? + w.
For any ν ∈ M, define the shorthand let y ? (µ) = argminy τ (y)> Aµ, which is a set-valued mapping,
and let y ? (P ) = y ? (µ(P )) when there is no chance of notational confusion. If we can show the
inclusions
yb(s) ⊂ y ? (ν ? ) ⊂ y ? (µ(P )), (14.4.2)
then the proof is complete, as we would evidently have our desired contradiction because necessarily
y )> Aµ = miny τ (y)> Aµ for any yb ∈ yb(s).
τ (b
To see the inclusion y ? (ν ? ) ⊂ y ? (µ(P )) is relatively straightforward. Let
ν 0 ∈ Conv τ (y) | τ (y)> Aµ(P ) = min 0
EP [L(y 0
, Y )] = HL (P ) (14.4.3)
y
370
Lexture Notes on Statistics and Information Theory John Duchi
be otherwise arbitrary. For all P 0 such that ν 0 = µ(P 0 ), the identifiability assumption 14.1 guaran-
tees that if y ∈ y ? (ν 0 ), we must have P 0 (Y = y) > 0. That is, we have
y ? (ν 0 ) ⊂ ∩P 0 supp P 0 | ν 0 = µ(P 0 ) .
In particular, Assumption 14.1 guarantees there is at least one P 0 satisfying supp P 0 ⊂ y ? (µ(P ))
and ν 0 = µ(P 0 ), so that y ? (ν 0 ) ⊂ y ? (µ(P )) for all ν 0 in the convex hull (14.4.3), and in particular
for ν ? .
The first inclusion in the chain (14.4.2) is more challenging. We begin a convex analytic result
that allows us to simplify maximizers of s> τ (y) in y.
Lemma 14.4.2. Let w ∈ NM (µ) be the element satisfying s = −AT ν ? + w. Then for any y ∈ Y
and any z ∈ supp P ,
hτ (z) − τ (y), wi ≥ 0.
Proof Fix any y ∈ Y and let z ∈ supp P , so that pz > 0. Then for a vector α ∈ Conv(τ (y 0 ) |
y 6∈ {y, z}), we can write µ(P ) = λy τ (y) + λz τ (z) + (1 − λy − λz )α, where λy ≥ 0, λz ≥ pz > 0,
and λy + λz ≤ 1. The vector ν = (λy + λz )τ (y) + (1 − λy − λz )α similarly satisfies ν ∈ M. By
the definition of the normal cone NM (µ), we know that w> (µ0 − µ) ≤ 0 for all µ0 ∈ M, and in
particular this holds for µ0 = ν. As
ν − µ = λz (τ (y) − τ (z)),
we obtain
λz w> (τ (y) − τ (z)) ≤ 0,
and as λz > 0 the lemma follows.
With Lemma 14.4.2 in hand, we can consider the predictions pred(s) = argmaxy s> τ (y). As
s = −A> ν ? + w, we have
n o n o
yb(s) = argmax −τ (y)> A> ν ? + τ (y)> w = argmax −τ (y)> Aν ? + τ (y)> w ,
y∈Y y∈Y
371
Lexture Notes on Statistics and Information Theory John Duchi
Pk
The associated conditional entropy of Y given X = x is H` (Y | X = x) = inf s y=1 P (Y =y |
X = x)`y (s), and the conditional entropy of Y given X is
Z
H` (Y | X) := H` (Y | X = x)dP (x).
X
I` (X; Y ) := H` (Y ) − H` (Y | X),
so that Hφ is really a concave function on p ∈ [0, 1] with Hφ (Y ) = hφ (P (Y = 1)), where the binary
generalized entropy is
hφ (p) := inf {pφ(−s) + (1 − p)φ(s)} .
s∈R
Definition 14.2. Losses `1 and `2 are universally equivalent if for all distributions on (X, Y ) and
all quantizers q1 and q2 ,
I`1 (q1 (X); Y ) ≤ I`1 (q2 (X); Y ) if and only if I`2 (q1 (X); Y ) ≤ I`2 (q2 (X); Y ).
We note in passing that swapping the roles of q1 and q2 and taking contrapositives, we an equivalent
formulation to Definition 14.2 is that
I`1 (q1 (X); Y ) < I`1 (q2 (X); Y ) if and only if I`2 (q1 (X); Y ) < I`2 (q2 (X); Y ).
Theorem 14.5.1. Let the multiclass losses `1 and `2 be bounded below and H1 and H2 be the
associated generalized entropies. Then `1 and `2 are universally equivalent if and only if there exist
a > 0, b ∈ Rk , and c ∈ R such that for all distributions on Y ∈ [k],
372
Lexture Notes on Statistics and Information Theory John Duchi
By inspection, each hi is a closed concave function (as it is the infimum of linear functions of p),
and by symmetry they satisfy hi (0) = hi (1) = 0 and hi ( 12 ) = supp∈[0,1] hi (p). We show that these
entropies satisfy a particular order equivalence property on [0, 1], which will turn out to be sufficient
to prove their equality.
To motivate what follows, recall that universal equivalence (Def. 14.2) must hold for all dis-
tributions on (X, Y ), and hence all (measurable) spaces X and joint distributions on X × {±1}.
Thus, consider a space X that we can partition into sets {A, Ac } or {B, B c }, where we take the
conditional distributions
( (
1 w.p. pa 1 w.p. qa
Y |X∈A= Y | X ∈ Ac =
−1 w.p. 1 − pa , −1 w.p. 1 − qa ,
and ( (
1 w.p. pb 1 w.p. qb
Y |X∈B= Y | X ∈ Bc =
−1 w.p. 1 − pb , −1 w.p. 1 − qb ,
where we require the consistency conditions that the marginals over Y remain constant, that is, if
P (A) = P (X ∈ A), we have
Then evidently by defining quantizers q1 and q2 such that q1 (x) = 1 {x ∈ A} and q2 (x) =
1 {x ∈ B}, we have
P (A)h1 (pa ) + (1 − P (A))h1 (qa ) ≤ P (B)h1 (pb ) + (1 − P (B))h1 (qb ) if and only if
P (A)h2 (pa ) + (1 − P (A))h2 (qa ) ≤ P (B)h2 (pb ) + (1 − P (B))h2 (qb )
h1 (pa ) + h1 (qa ) ≤ h1 (pb ) + h1 (qb ) if and only if h2 (pa ) + h2 (qa ) ≤ h2 (pb ) + h2 (qb )
whenever pa + qa = pb + qb .
373
Lexture Notes on Statistics and Information Theory John Duchi
Generalizing this construction by taking distributions over X that partition it into k equiprob-
able sets {A1 , . . . , Ak } or {B1 , . . . , Bk }, each with P (Ai ) = P (Bi ) = 1/k, we see that universal
equivalence implies that for any vectors p ∈ [0, 1]k and q ∈ [0, 1]k satisfies 1> p = 1> q,
k
X k
X k
X k
X
h1 (pi ) ≤ h1 (qi ) if and only if h2 (pi ) ≤ h2 (qi ). (14.6.1)
i=1 i=1 i=1 i=1
We shall give condition (14.6.1) a name, as it implies certain equivalence properties for convex
functions (we can replace hi with −hi and obtain convex functions).
Definition 14.3. Let Ω ⊂ R be a closed interval and let f1 , f2 : Ω → R be closed convex functions.
Then f1 and f2 are order equivalent if for any k ∈ N and vectors s ∈ Ωk and t ∈ Ωk satisfying
1> s = 1> t, we have
k
X k
X k
X k
X
f1 (si ) ≤ f1 (ti ) if and only if f2 (si ) ≤ f2 (ti )
i=1 i=1 i=1 i=1
As in the brief remark following Definition 14.2, by taking complements we have as well that
k
X k
X k
X k
X
f1 (si ) < f1 (ti ) if and only if f2 (si ) < f2 (ti )
i=1 i=1 i=1 i=1
The theorem will then be proved if we can show the following lemma.
Lemma 14.6.1. Let f1 and f2 be order equivalent on Ω. Then there exist a > 0, and b, c ∈ R such
that f1 (t) = af2 (t) + bt + c for all t ∈ Ω.
The proof of Lemma 14.6.1 is somewhat involved, and we proceed in three parts. The key is
that order equivalence actually implies a strong relationship between affine combinations of points
in the domain of the functions fi , not just convex combinations of points, which guarantees that
we can predict values of f2 (v) for any v ∈ Ω by just three values of fi evaluate in Ω. We state this
as a lemma, whose proof we defer temporarily to Sec. 14.6.1
Lemma 14.6.2. If f1 and f2 are order equivalent on Ω, then for any λ ∈ Rk satisfying λ> 1 = 1
and any u ∈ Ωk , if λ> u = v ∈ Ω then
k
X k
X
λi f1 (ui ) ≤ f1 (v) if and only if λi f2 (ui ) ≤ f2 (v),
i=1 i=1
and the statement still holds with both inequalities replaced with strict inequalities.
In particular, if
k
X k
X
λi f1 (ui ) = f1 (v) then necessarily λi f2 (ui ) = f2 (v) (14.6.2)
i=1 i=1
374
Lexture Notes on Statistics and Information Theory John Duchi
Lemma 14.6.3. Let f be convex on R. Let u0 < u1 and for λ ∈ [0, 1], define uλ = (1 − λ)u0 + λu1 .
If there exists any λ ∈ (0, 1) such that f (uλ ) = λf (u0 ) + (1 − λ)f (u1 ), then f is linear on [u0 , u1 ].
We leave the proof (an algebraic manipulation using the definitions of convexity) as Question 14.2.
The last intermediate step we require in the proof of Lemma 14.6.1 is that at three particular
points in the domain Ω, we can satisfy Lemma 14.6.1.
1
Lemma 14.6.4. Let f1 , f2 be order equivalent on Ω = [u0 , u1 ] and uc = 2 (u0 + u1 ). There are
a > 0 and b, c ∈ R such that f1 (t) = af2 (t) + bt + c for t ∈ {u0 , uc , u1 }.
find such a λ, then equality (14.6.2) guarantees that f1 (v) = f2 (v), and we are done. As the points
(ui , f1 (ui ))3i=1 are not collinear (recall Lemma 14.6.3), the matrix
1 1 1
A := u0 uc u1
f1 (u0 ) f1 (uc ) f1 (u1 )
which evidently satisfies our desiderata. Thus f1 (v) = f2 (v), and as v was arbitrary, the proof is
complete.
where 1> a = k + 1> b, as 1> λ = k1 1> (a − b) = 1. Then we may define the two vectors
Then each has entries in Ω, and we have 1> t = 1> s. Then order equivalence (Def. 14.3) guarantees
that
m
X m
X m
X m
X
ai f1 (ui ) ≤ kf1 (v) + bi f1 (ui ) if and only if ai f2 (ui ) ≤ kf2 (v) + bi f2 (ui )
i=1 i=1 i=1 i=1
375
Lexture Notes on Statistics and Information Theory John Duchi
From the first we obtain c = f1 (0) − af2 (0), and substituting this into the third yields b = f1 (1) −
f1 (0) − a(f2 (1) − f2 (0)). Finally, substituting both equalities into the equality with f1 ( 21 ) yields
that
1 1 f2 (1) − f2 (0) f1 (1) − f1 (0)
f1 = a f2 − + + f1 (0) − af2 (0)
2 2 2 2
1 f2 (1) + f2 (0) f1 (1) + f1 (0)
= a f2 ( ) − + .
2 2 2
As we know that f1 , f2 are nonlinear, Lemma 14.6.3 applies, so that the convexity gaps f1 ( 12 ) −
f1 (1)+f1 (0)
2 and f2 ( 12 ) − f2 (1)+f
2
2 (0)
are both positive, and thus we take
f1 (0)+f1 (1)
f1 ( 12 ) − 2
a= f2 (0)+f2 (1)
> 0.
f2 ( 21 ) − 2
14.7 Bibliography
Point to full proof of Theorem 14.5.1.
14.8 Exercises
Exercise 14.1 (Bayes risk gaps): Consider a general binary classification problem with (X, Y ) ∈
X × {−1, 1}. Let φ(α) = log(1 + e−α ), so that we use the logistic loss. Show that the surrogate
risk gap
L∗φ,prior − L∗φ = I(X; Y ),
where I is the mutual information.
376
Lexture Notes on Statistics and Information Theory John Duchi
Exercise 14.2: Prove Lemma 14.6.3. Hint: without loss of generality, you may take u0 = 0,
u1 = 1. Then for any u ∈ [0, 1], write λ as a convex combination of either {0, u} or {u, 1} and use
the definition of convexity.
377
Chapter 15
Fisher Information
Having explored the definitions associated with exponential families and their robustness properties,
we now turn to a study of somewhat more general parameterized distributions, developing connec-
tions between divergence measures and other geometric ideas such as the Fisher information. After
this, we illustrate a few consequences of Fisher information for optimal estimators, which gives a
small taste of the deep connections between information geometry, Fisher information, exponential
family models. In the coming chapters, we show how Fisher information measures come to play a
central role in sequential (universal) prediction problems.
where the score function `˙θ = ∇θ log pθ (x) is the gradient of the log likelihood at θ (implicitly
depending on X) and the expectation Eθ denotes expectation taken with respect to Pθ . Intuitively,
the Fisher information captures the variability of the gradient ∇ log pθ ; in a family of distributions
for which the score function `˙θ has high variability, we intuitively expect estimation of the parameter
θ to be easier—different θ change the behavior of `˙θ —though the log-likelihood functional θ 7→
Eθ0 [log pθ (X)] varies more in θ.
Under suitable smoothness conditions on the densities pθ (roughly, that derivatives pass through
expectations; see Remark 15.1 at the end of this chapter), there are a variety of alternate definitions
of Fisher information. These smoothness conditions hold for exponential families, so at least in
the exponential family case, everything in this chapter is rigorous. (We note in passing that there
are more general definitions of Fisher information for more general families under quadratic mean
differentiability; see, for example, van der Vaart [169].) First, we note that the score function has
378
Lexture Notes on Statistics and Information Theory John Duchi
where in equality (?) we have assumed that integration and derivation may be exchanged. Under
similar conditions, we thus attain an alternate definition of Fisher information as the negative
expected hessian of log pθ (X). Indeed,
Z
= −E[∇2 log pθ (x)] + ∇2 pθ (x)dx = −E[∇2 log pθ (x)]. (15.1.2)
| {z }
=1
This representation also makes clear the additional fact that, if we have n i.i.d. observations from the
model Pθ , then the information content similarly grows linearly, as log pθ (X1n ) = ni=1 log pθ (Xi ).
P
We now give two examples of Fisher information, the first somewhat abstract and the second
more concrete.
Iθ = ∇2 A(θ).
379
Lexture Notes on Statistics and Information Theory John Duchi
where Ip is the Fisher information for the single observation Bernoulli(p) family as in Example 15.1.2.
In fact, this inverse dependence on Fisher information is unavoidable, as made clear by the Cramér
Rao Bound, which provides lower bounds on the mean squared error of all unbiased estimators.
Proposition 15.2.1 (Cramér Rao Bound). Let φ : Rd → R be an arbitrary differentiable function
and assume that the random function (estimator) T is unbiased for φ(θ) under Pθ . Then
As an immediate corollary to Proposition 15.2.1, we may take φ(θ) = hλ, θi for λ ∈ Rd . Then
varying λ over all of Rd , and we obtain that for any unbiased estimator T for the parameter θ ∈ Rd ,
we have Var(hλ, T i) ≥ λ> Iθ−1 λ. That is, we have
Corollary 15.2.2. Let T be unbiased for the parameter θ under the distribution Pθ . Then the
covariance of T has lower bound
Cov(T ) Iθ−1 .
In fact, the Cramér-Rao bound and Corollary 15.2.2 hold, in an asymptotic sense, for substantially
more general settings (without the unbiasedness requirement). For example, see the books of
van der Vaart [169] or Le Cam and Yang [128, Chapters 6 & 7], which show that under appropriate
conditions (known variously as quadratic mean differentiability and local asymptotic normality)
that no estimator can have smaller mean squared error than Fisher information in any uniform
sense.
We now prove the proposition, where, as usual, we assume that it is possible to exchange
differentiation and integration.
Proof Throughout this proof, all expectations and variances are computed with respect to Pθ .
The idea of the proof is to choose λ ∈ Rd to minimize the variance
380
Lexture Notes on Statistics and Information Theory John Duchi
where in the final step we used that T is unbiased for φ(θ). Using the preceding equality,
Var(T − hλ, `˙θ i) = Var(T ) + λ> Iθ λ − 2E[(T − φ(θ))hλ, `˙θ i] = Var(T ) + λ> Iθ λ − 2hλ, ∇φ(θ)i.
Taking λ = Iθ−1 ∇φ(θ) gives 0 ≤ Var(T − hλ, `˙θ i) = Var(T ) − ∇φ(θ)> Iθ−1 ∇φ(θ), and rearranging
gives the result.
That is, the divergence is simply the difference between A(θ2 ) and its first order expansion
around θ1 . This suggests that we may approximate the KL-divergence via the quadratic re-
mainder in the first order expansion. Indeed, as A is infinitely differentiable (it is an exponential
family model), the Taylor expansion becomes
1
Dkl (Pθ1 ||Pθ2 ) = hθ1 − θ2 , ∇2 A(θ1 )(θ1 − θ2 )i + O(kθ1 − θ2 k3 )
2
1
= hθ1 − θ2 , Iθ1 (θ1 − θ2 )i + O(kθ1 − θ2 k3 ).
2
3
In particular, KL-divergence is roughly quadratic for exponential family models, where the
quadratic form is given by the Fisher information matrix. We also remark in passing that for a
convex function f , the Bregman divergence (associated with f ) between points x and y is given
by Df (x, y) = f (x) − f (y) − h∇f (y), x − yi; such divergences are common in convex analysis,
optimization, and differential geometry. Making such connections deeper and more rigorous is the
goal of the field of information geometry (see the book of Amari and Nagaoka [5] for more).
We can generalize this example substantially under appropriate smoothness conditions. Indeed,
we have
Proposition 15.3.2. For appropriately smooth families of distributions {Pθ }θ∈Θ ,
1
Dkl (Pθ1 ||Pθ2 ) = hθ1 − θ2 , Iθ1 (θ1 − θ2 )i + o(kθ1 − θ2 k2 ). (15.3.1)
2
381
Lexture Notes on Statistics and Information Theory John Duchi
We only sketch the proof, as making it fully rigorous requires measure-theoretic arguments and
Lebesgue’s dominated convergence theorem.
Sketch of Proof By a Taylor expansion of the log density log pθ2 (x) about θ1 , we have
log pθ2 (x) = log pθ1 (x) + h∇ log pθ1 (x), θ1 − θ2 i
1
+ (θ1 − θ2 )> ∇2 log pθ1 (x)(θ1 − θ2 ) + R(θ1 , θ2 , x),
2
where R(θ1 , θ2 , x) = Ox (kθ1 − θ2 k3 ) is the remainder term, where Ox denotes a hidden dependence
on x. Taking expectations and assuming that we can interchange differentiation and expectation
appropriately, we have
Eθ1 [log pθ2 (X)] = Eθ1 [log pθ1 (X)] + hEθ1 [`˙θ1 ], θ1 − θ2 i
1
+ (θ1 − θ2 )> Eθ1 [∇2 log pθ1 (X)](θ1 − θ2 ) + Eθ1 [R(θ1 , θ2 , X)]
2
1
= Eθ1 [log pθ1 (X)] − (θ1 − θ2 )> Iθ1 (θ1 − θ2 ) + o(kθ1 − θ2 k2 ),
2
where we have assumed that the O(kθ1 − θ2 k3 ) remainder is uniform enough in X that E[R] =
o(kθ1 − θ2 k2 ) and used that the score function `˙θ is mean zero under Pθ .
We may use Proposition 15.3.2 to give a somewhat more general version of the Cramér-Rao
bound (Proposition 15.2.1) that applies to more general (sufficiently smooth) estimation problems.
Indeed, we will show that Le Cam’s method (recall Chapter 8.3) is (roughly) performing a type of
discrete second-order approximation to the KL-divergence, then using this to provide lower bounds.
More concretely, suppose we are attempting to estimate a parameter θ parameterizing the family
P = {Pθ }θ∈Θ , and assume that Θ ⊂ Rd and θ0 ∈ int Θ. Consider the minimax rate of estimation
of θ0 in a neighborhood around θ0 ; that is, consider
inf sup b n ) − θk2 ],
Eθ [kθ(X1
θb θ=θ0 +v∈Θ
where the observations Xi are drawn i.i.d. Pθ . Fixing v ∈ Rd and setting θ = θ0 + δv for some
δ > 0, Le Cam’s method (8.3.3) then implies that
b n ) − θk2 ] ≥ δ 2 kvk2
1 − Pθn0 − Pθn0 +δv
inf max Eθ [kθ(X1 TV
.
θb θ∈{θ0 ,θ+δv} 8
Using Pinsker’s inequality that 2 kP − Qk2TV ≤ Dkl (P ||Q) and the asymptotic quadratic approxi-
mation (15.3.1), we have
r √ 1
n n 2 >
δ v Iθ0 v + o(δ 2 kvk2 ) .
n n 2
Pθ0 − Pθ0 +δv TV ≤ Dkl (Pθ0 ||Pθ0 +δv ) =
2 2
By taking δ 2 = (nv > Iθ0 v)−1 , for large enough v and n we know that θ0 + δv ∈ int Θ (so that the
distribution Pθ0 +δv exists), and for large n, the remainder term o(δ 2 kvk2 ) becomes negligible. Thus
we obtain
2 2 2
inf max Eθ [kθ(X b n ) − θk2 ] & δ kvk = 1 kvk . (15.3.2)
1
θb θ∈{θ0 ,θ+δv} 16 16 nv > Iθ0 v
In particular, in one-dimension, inequality (15.3.2) implies a result generalizing the Cramér-Rao
bound. We have the following asymptotic local minimax result:
382
Lexture Notes on Statistics and Information Theory John Duchi
Corollary 15.3.3. Let P = {Pθ }θ∈Θ , where Θ ⊂ R, be a family of distributions satisfying the
quadratic approximation condition of Proposition 15.3.2. Then there exists a constant c > 0 such
that h i 1
lim lim inf sup Eθ (θn (X1 ) − θ) ≥ c Iθ−1
b n 2
.
v→∞ n→∞ θb θ:|θ−θ |≤v/√n n 0
n 0
Written differently (and with minor extension), Corollary 15.3.3 gives a lower bound based on
a local modulus of continuity of the loss function with respect to the metric induced by the Fisher
information. Indeed, suppose we wish to estimate a parameter θ in the neighborhood of θ0 (where
√
the neighborhood size decreases as 1/ n) according to some loss function ` : Θ × Θ → R. Then if
we define the modulus of continuity of ` with respect to the Fisher information metric as
`(θ0 , θ0 + δv)
ω` (δ, θ0 ) := sup ,
v:kvk≤1 δ 2 v > Iθ0 v
the combination of Corollary 15.3.3 and inequality (15.3.2) shows that the local minimax rate of
estimating Eθ [`(θbn , θ)] for θ near θ0 must be at least ω` (n−1/2 , θ0 ). For more on connections between
moduli of continuity and estimation, see, for example, Donoho and Liu [64].
Remark 15.1: In order to make all of our exchanges of differentiation and expectation rigorous,
we must have some conditions on the densities we consider. One simple condition sufficient to make
this work is via Lebesgue’s dominated convergence theorem. Let f : X × Θ → R be a differentiable
function. For a fixed base measure ν assume there exists a function g such that g(x) ≥ k∇θ f (x, θ)k
for all θ, where Z
g(x)dµ(x) < ∞.
X
R R
Then in this case, we have ∇θ f (x, θ)dµ(x) = ∇θ f (x, θ)dµ(x) by the mean-value theorem and
definition of a derivative. (Note that for all θ0 we have supv:kvk2 ≤δ k∇θ f (x, θ)k2 θ=θ0 +v ≤ g(x).)
More generally, this type of argument can handle absolutely continuous functions, which are dif-
ferentiable almost everywhere. 3
383
Part IV
384
Chapter 16
In this chapter, we explore sequential game playing and online probabilistic prediction schemes.
These have applications in coding when the true distribution of the data is unknown, biological
algorithms (encoding genomic data, for example), control, and a variety of other areas. The field of
universal prediction is broad; in addition to this chapter touching briefly on a few of the techniques
therein and their relationships with statistical modeling and inference procedures, relevant reading
includes the survey by Merhav and Feder [140], the more recent book of Grünwald [95], and Tsachy
Weissman’s EE376c course at Stanford.
JCD Comment: Check out the below stuff a bit more carefully
To motivate this abstract setting we give two examples, the first abstract and the second somewhat
more concrete.
iid
Example 16.1.1: Suppose that receive n random variables Xi ∼ P ; in this case, we have the
sequential prediction loss
n
X 1
EP [− log q(X1n )] = EP log i−1
,
i=1
q(Xi | X1 )
which corresponds to predicting Xi given X1i−1 as well as possible, when the Xi follow an
(unknown or adversarially chosen) distribution P . 3
385
Lexture Notes on Statistics and Information Theory John Duchi
Example 16.1.2 (Coding): Expanding on the preceding example, suppose that the set X
is finite, and we wish to encode X into {0, 1}-valued sequences using as few bits as possible.
In this case, the Kraft inequality (recall Theorem 2.4.2) tells us that if C : X → {0, 1}∗ is an
uniquely decodable code, and `C (x) denotes the length of the encoding for the symbol x ∈ X ,
then X
2−`C (x) ≤ 1.
x
In particular, we have a coding game where we attempt to choose a distribution Q (or sequential
coding scheme C) that has as small an expected length as possible, uniformly over distributions
P . (The field of universal coding studies such questions in depth; see Tsachy Weissman’s course
EE376b.) 3
We now show how the minimax game (16.1.1) naturally gives rise to exponential family models,
so that exponential family distributions are so-called robust Bayes procedures (cf. Grünwald and
Dawid [96]). Specifically, we say that Q is a robust Bayes procedure for the class P of distributions
if it minimizes the supremum risk (16.1.1) taken over the family P; that is, it is uniformly good for
all distributions P ∈ P. If we restrict our class P to be a linearly constrained family of distributions,
then we see that the exponential family distributions are natural robust Bayes procedures: they
uniquely solve the minimax game. More concretely, assume that P = Pαlin and that Pθ denotes the
exponential family distribution with density pθ (x) = p(x) exp(hθ, φ(x)i − A(θ)), where p denotes
the base density. We have the following.
inf sup EP [− log q(X)] = sup EP [− log pθ (X)] = sup inf EP [− log q(X)].
Q P ∈P lin lin
P ∈Pα lin Q
P ∈Pα
α
Proof This is a standard saddle-point argument (cf. [153, 104, 35]). First, note that
where H denotes the Shannon entropy, for any distribution P ∈ Pαlin . Moreover, for any Q 6= Pθ ,
we have
sup EP [− log q(X)] ≥ EPθ [− log q(X)] > EPθ [− log pθ (X)] = H(Pθ ),
P
386
Lexture Notes on Statistics and Information Theory John Duchi
But we know from our standard maximum entropy results (Theorem 11.4.7) that Pθ maximizes the
entropy over Pαlin , that is, supP ∈Pαlin H(P ) = H(Pθ ).
In short: maximum entropy is equivalent to robust prediction procedures for linear families of
distributions Pαlin , which is equivalent to maximum likelihood in exponential families, which in turn
is equivalent to I-projection.
JCD Comment: Here we are back to the original stuff
where we have written it as the sum over q(xi | xi−11 ) to emphasize the sequential nature of the
game. Associated with the regret of the sequence xn1 is the adversarial regret (usually simply called
the regret) of Q with respect to the family P of distributions, which is
RX
n (Q, P) := sup Reg(Q, P, xn1 ). (16.2.2)
P ∈P,xn
1 ∈X
n
In more generality, we may which to use a loss function ` different than the log loss; that is, we
might wish to measure a loss-based version the regret as
n
X
`(xi , Q(· | x1i−1 )) − `(xi , P (· | x1i−1 )),
i=1
387
Lexture Notes on Statistics and Information Theory John Duchi
where `(xi , P ) indicates the loss suffered on the point xi when the distribution P over Xi is played,
and P (· | xi−1
1 ) denotes the conditional distribution of Xi given x1
i−1
according to P . We defer
discussion of such extensions later, focusing on the log loss for now because of its natural connections
with maximum likelihood and coding.
A less adversarial problem is to minimize the redundancy, which is the expected regret under a
distribution P . In this case, we define the redunancy of Q with respect to P as the expected regret
of Q with respect to P under the distribution P , that is,
1 1
Redn (Q, P ) := EP log − log = Dkl (P ||Q) , (16.2.3)
q(X1n ) p(X1n )
where the dependence on n is implicit in the KL-divergence. The worst-case redundancy with
respect to a class P is then
Rn (Q, P) := sup Redn (Q, P ). (16.2.4)
P ∈P
Example 16.2.1 (Example 16.1.2 on coding, continued): We noted in Example 16.1.2 that
for any p.m.f.s p and q on the set X , it is possible to define coding schemes Cp and Cq with
code lengths
1 1
`Cp (x) = log and `Cq (x) = log .
p(x) q(x)
Conversely, given (uniquely decodable) encoding schemes Cp and Cq : X → {0, 1}∗ , the func-
tions pCp (x) = 2−`Cp (x) and qCq (x) = 2−`Cq (x) satisfy x pCp (x) ≤ 1 and x qCq (x) ≤ 1. Thus,
P P
the redundancy of Q with respect to P is the additional number of bits required to encode
variables distributed according to P when we assume they have distribution Q:
n
X 1 1
Redn (Q, P ) = EP log i−1
− log
i=1
q(Xi | X1 ) p(Xi | X1i−1 )
n
X
= EP [`Cq (Xi )] − EP [`Cp (Xi )],
i=1
where `C (x) denotes the number of bits C uses to encode x. Note that, as in Section 2.4.1,
the code d− log p(x)e is (essentially) optimal. 3
As another example, we may consider a filtering or prediction problem for a linear system.
388
Lexture Notes on Statistics and Information Theory John Duchi
Moreover, if Compn (Θ) < +∞, then the normalized maximum likelihood distribution (also known
as the Shtarkov distribution) Q, defined with density
supθ∈Θ pθ (xn1 )
q(xn1 ) = R ,
supθ pθ (xn1 )dxn1
is uniquely minimax optimal.
The proposition completely characterizes the minimax regret in the adversarial setting, and it
gives the unique distribution achieving the regret. Unfortunately, in most cases it is challenging
to compute the minimax optimal distribution Q, so we must make approximations of some type.
One approach is to make Bayesian approximations to Q, as we do in the sequel when we consider
redundancy rather than adversarial regret. See also the book of Grünwald [95] for more discussion
of this and other issues.
Proof We begin by proving the result in the case that Compn < +∞. First, note that the
normalized maximum likelihood distribution Q has constant regret:
X 1 1
Rn (Q, P) = sup log − log
xn
1 ∈X
n q(xn1 ) supθ pθ (xn1 )
supθ pθ (xn1 )dxn1
R
1
= sup log − log = Compn (P).
xn
1
supθ pθ (xn1 ) supθ pθ (xn1 )
389
Lexture Notes on Statistics and Information Theory John Duchi
We now give an example where (up to constant factor terms) we can explicitly calculate the
minimax regret in the adversarial setting. In this case, we compete with the family of i.i.d. Bernoulli
distributions.
where h2 (p) = −p log p − (1 − p) log(1 − p) is the binary entropy. Using this representation, we
find that the complexity of the Bernoulli family is
n
X n −nh2 ( m )
Compn ([0, 1]) = log e n .
m
m=0
390
Lexture Notes on Statistics and Information Theory John Duchi
Rather than explicitly compute with this, we now use Stirling’s approximation (cf. Cover and
Thomas [53, Chapter 17]): for any p ∈ (0, 1) with np ∈ N, we have
" #
n 1 1 1
∈√ p ,p exp(nh2 (p)).
np n 8p(1 − p) πp(1 − p)
the notedR 1asymptote occuring as n → ∞ by the fact that this sum is a Riemann sum for the
integral 0 θ −1/2 (1 − θ)−1/2 dθ. In particular, we have that as n → ∞,
Z 1 !
X −1/2 −1/2 1/2 1
inf Rn (Q, P) = Compn ([0, 1]) = log 2 + [8 ,π ]n p dθ + o(1)
Q 0 θ(1 − θ)
Z 1
1 1
= log n + log p dθ + O(1).
2 0 θ(1 − θ)
R1√
We remark in passing that this is equal to 12 log n + log 0 Iθ dθ, where Iθ denotes the Fisher
information of the Bernoulli family (recall Example 15.1.2). We will see that this holds in more
generality, at least for redundancy, in the sequel. 3
where we use Qn to denote that Q is applied on all n data points (in a sequential fashion, as
Q(· | X1i−1 )). In this expression, q and p denote the densities of Q and P , respectively. In a slightly
more general setting, we may consider the expected regret of Q with respect to a distribution Pθ
even under model mis-specification, meaning that the data is generated according to an alternate
distribution P . In this case, the (more general) redundancy becomes
1 1
EP log − log . (16.4.2)
q(X1n ) pθ (X1n )
391
Lexture Notes on Statistics and Information Theory John Duchi
In both cases (16.4.1) and (16.4.2), we would like to be able to guarantee that the redundancy
grows more slowly than n as n → ∞. That is, we would like to find distributions Q such that, for
1 n
any θ0 ∈ Θ, we have n Dkl Pθ0 ||Qn → 0 as n → ∞. Assuming we could actually obtain such a
distribution in general, this is interesting because (even in the i.i.d. case) for any fixed distribution
Pθ 6= Pθ0 , we must have Dkl Pθn0 ||Pθn = nDkl (Pθ0 ||Pθn ) = Ω(n). A standard approach to attaining
such guarantees is the mixture approach, which is based on choosing Q as a convex combination
(mixture) of all the possible source distributions Pθ for θ ∈ Θ.
In particular, given a prior distribution π (weighting function integrating to 1) over Θ, we define
the mixture distribution Z
π
Qn (A) = π(θ)Pθ (A)dθ for A ⊂ X n . (16.4.3)
Θ
Rewriting this in terms of densities pθ , we have
Z
π n
qn (x1 ) = π(θ)pθ (xn1 )dθ.
Θ
Conceptually, this gives a simple prediction scheme, where at iteration i we play the density
q π (xi1 )
q π (xi | xi−1
1 )= ,
q π (xi−1
1 )
where we have emphasized that this strategy exhibits an exponential weighting approach, where
distribution weights are scaled exponentially by their previous loss performance of log 1/pθ (xi−11 ).
This mixture construction (16.4.3), with the weighting scheme (16.4.4), enjoys very good per-
formance. In fact, we say that so long as the prior π puts non-zero mass over all of Θ, under some
appropriate smoothness conditions, the scheme Qπ is universal, meaning that Dkl (Pθn ||Qπn ) = o(n).
We have the following theorem illustrating this effect. In the theorem, we let π be a density on Θ,
and we assume the Fisher information Iθ for the family P = {Pθ }θ∈Θ exists in a neighborhood of
θ0 ∈ int Θ, and that the distributions Pθ are sufficiently regular that differentiation and integration
can be interchanged. (See Clarke and Barron [49] for precise conditions.) We have
Theorem 16.4.1 (Clarke and Barron [49]). Under the above conditions, if Qπn = Pθn π(θ)dθ is
R
392
Lexture Notes on Statistics and Information Theory John Duchi
Example 16.4.2 (Bernoulli distributions with a Beta prior): Consider the class of binary
(i.i.d. or memoryless) Bernoulli sources, that is, the Xi are i.i.d Bernoulli(θ), where θ = Pθ (X =
1) ∈ [0, 1]. The Beta(α, β)-distribution prior on θ is the mixture π with density
Γ(α + β) α−1
π(θ) = θ (1 − θ)β−1
Γ(α)Γ(β)
R∞
on [0, 1], where Γ(a) = 0 ta−1 e−t dt denotes the gamma function. We remark that that under
α
the Beta(α, β) distribution, we have Eπ [θ] = α+β . (See any undergraduate probability text for
such results.)
If we play via a mixture of Bernoulli distributions under such a Beta-prior for θ, by Theo-
rem 16.4.1 we have a universal prediction scheme. We may also explicitly calculate the predic-
tive distribution
Pi Q. To do so, we first compute the posterior π(θ | X1i ) as in expression (16.4.4).
Let Si = j=1 Xj be partial sum of the Xs up to iteration i. Then
pθ (xi1 )π(θ)
π(θ | xi1 ) = ∝ θSi (1 − θ)i−Si θα−1 θβ−1 = θα+Si −1 (1 − θ)β+i−Si −1 ,
q(xi1 )
where we have ignored the denominator as we must simply normalize the above quantity in
θ. But by inspection, the posterior density of θ | X1i is a Beta(α + Si , β + i − Si ) distribution.
Thus to compute the predictive distribution, we note that Eθ [Xi ] = θ, so we have
Si + α
Q(Xi = 1 | X1i ) = Eπ [θ | X1i ] = .
i+α+β
Moreover, Theorem 16.4.1 shows that when we play the prediction game with a Beta(α, β)-
prior, we have redundancy scaling as
n π
1 n Γ(α)Γ(β) 1 1 1
Dkl Pθ0 ||Qn = log + log α−1 + log + o(1)
2 2πe Γ(α + β) θ0 (1 − θ0 ) β−1 2 θ0 (1 − θ0 )
As one additional interesting result, we show that mixture models are actually quite robust,
even under model mis-specification, that is, when the true distribution generating the data does not
belong to the class P = {Pθ }θ∈Θ . That is, mixtures can give good performance for the generalized
redundancy quantity (16.4.2).R For this next result, we as usual define the mixture distribution Qπ
over the set X via Qπ (A) = Θ Pθ (A)dπ(θ). We may also restrict this mixture distribution to a
subset Θ0 ⊂ Θ by defining Z
π 1
QΘ0 (A) = Pθ (A)dπ(θ).
π(Θ0 ) Θ0
Then we obtain the following robustness result.
Proposition 16.4.3. Assume that Pθ have densities pθ over X , let P be any distribution having
density p over X , and let q π be the density associated with Qπ . Then for any Θ0 ⊂ Θ,
1 1 1
+ Dkl P ||QπΘ0 − Dkl (P ||Pθ ) .
EP log π − log ≤ log
q (X) pθ (X) π(Θ0 )
393
Lexture Notes on Statistics and Information Theory John Duchi
In particular, Proposition 16.4.3 shows that so long as the mixture distributions QπΘ0 can closely
approximate Pθ , then we attain a convergence guarantee nearly as good as any in the family P =
{Pθ }θ∈Θ . (This result is similar in flavor to the mutual information bound (8.7.2), Corollary 8.7.2,
and the index of resolvability quantity.)
Proof Fix any Θ0 ⊂ Θ. Then we have q π (x) = Θ pθ (x)dπ(θ) ≥ Θ0 pθ (x)dπ(θ). Thus we have
R R
" #
p(X) p(X)
EP log π ≤ EP inf log R
q (X) Θ0 ⊂Θ
Θ0 pθ (x)dπ(θ)
" # " #
p(X)π(Θ0 ) p(X)
= EP inf log R = EP inf log π (X) .
Θ0 π(Θ0 ) Θ0 pθ (x)dπ(θ) Θ0 π(Θ0 )qΘ0
This is certainly smaller than the same quantity with the infimum outside the expectation, and
noting that
1 1 p(X) p(X)
EP log π − log = EP log π − EP log
q (X) pθ (X) q (X) pθ (X)
gives the result.
With Theorem 16.4.1 in hand, we can give a somewhat more nuanced picture of this mutual
information quantity. As a first consequence of Theorem 16.4.1, we have that
Z √
n d n det Iθ
Iπ (T ; X1 ) = log + log π(θ)dθ + o(1), (16.4.7)
2 2πe π(θ)
where Iθ denotes the Fisher information matrix for the family {Pθ }θ∈Θ . One strand of Bayesian
statistics—we will not delve too deeply into this now, instead referring to the survey by Bernardo
[26]—known as reference analysis, advocates that in performing a Bayesian analysis, we should
choose the prior π that maximizes the mutual information between the parameters θ about which
we wish to make inferences and any observations X1n available. Moreover, in this set of strategies,
394
Lexture Notes on Statistics and Information Theory John Duchi
one allows n to tend to ∞, as we wish to take advantage of any data we might actually see. The
asymptotic formula (16.4.7) allows us to choose such a prior.
In a different vein, Jeffreys [116] proposed that if the square root of the determinant of the
Fisher information was integrable, then one should take π as
√
det Iθ
πjeffreys (θ) = R √
Θ det Iθ dθ
known as the Jeffreys prior. Jeffreys originally proposed this for invariance reasons, as the infer-
ences made on the parameter θ under the prior πjeffreys are identical to those made on a trans-
formed parameter φ(θ) under the appropriately transformed Jeffreys prior. The asymptotic ex-
pression (16.4.7), however, shows that the Jeffreys prior is the asymptotic reference prior. Indeed,
computing the integral in (16.4.7), we have
Z √ Z Z p
det Iθ πjeffreys (θ)
π(θ) log dθ = π(θ) log dθ + log det Iθ dθ
Θ π(θ) Θ π(θ)
Z p
= −Dkl (π||πjeffreys ) + log det Iθ dθ,
whenever the Jeffreys prior exists. Moreover, we see that in an asymptotic sense, the worst-case
prior distribution π for nature to play is given by the Jeffreys prior, as otherwise the −Dkl (π||πjeffreys )
term in the expected (Bayesian) redundancy is negative.
Example 16.4.4 (Jeffreys priors and the exponential distribution): Let us now assume that
our source distributions Pθ are exponential distributions, meaning that θ ∈ (0, ∞) and we have
density pθ (x) = exp(−θx − log 1θ ) for x ∈ [0, ∞). This is clearly an exponential family model,
∂2 1 2
and the Fisher information is easy to compute as Iθ = ∂θ 2 log θ = 1/θ (cf. Example 15.1.1).
√
In this case, the Jeffreys prior is πjeffreys (θ) ∝ I = 1/θ, but this “density” does not integrate
over [0, ∞). One approach to this difficulty, advocated by Bernardo [26, Definition 3] (among
others) is to just proceed formally and notice that after observing a single datapoint, the
“posterior” distribution π(θ | X)P is well-defined. Following this idea, note that after seeing
some data X1 , . . . , Xi , with Si = ij=1 Xj as the partial sum, we have
i
X 1
π(θ | xi1 ) ∝ pθ (xi1 )πjeffreys (θ) i
= θ exp −θ xj = θi−1 exp(−θSi ).
θ
j=1
Z ∞ Z ∞ Z ∞
1
q(x | xi1 ) = pθ (x)π(θ | xi1 )dθ ∝ θe−θx θi−1 e−θsi dθ = ui e−u du,
0 0 (si + x)i+1 0
where we made the change of variables u = θ(si + x). This is at least a distribution that
normalizes, so often one simply assumes the existence of a piece of fake data. For example, by
saying we “observe” x0 = 1, we have prior proportional to π(θ) = e−θ , which yields redundancy
1 n 1
Dkl Pθn0 ||Qπn = log + θ0 + log + o(1).
2 2πe θ0
The difference is that, in this case, the redundancy bound is no longer uniform in θ0 , as it
would be for the true reference (or Jeffreys, if it exists) prior. 3
395
Lexture Notes on Statistics and Information Theory John Duchi
A natural question that arises from this expression is the following: if nature chooses a worst-case
prior, can we swap the order of maximization and minimization? That is, do we ever have the
equality
sup Iπ (T ; X1n ) = inf sup Dkl (Pθn ||Q) ,
π Q θ
so that the worst-case Bayesian redundancy is actually the minimax redundancy? It is clear that
if nature can choose the worst case Pθ after we choose Q, the redundancy must be at least as bad
as the Bayesian redundancy, so
sup Iπ (T ; X1n ) ≤ inf sup Dkl (Pθn ||Q) = inf Rn (Q, P).
π Q θ Q
∗
Indeed, if this inequality were an equality, then for the worst-case prior π ∗ , the mixture Qπn would
be minimax optimal.
In fact, the redundancy-capacity theorem, first proved by Gallager [88], and extended by Haus-
sler [101] (among others) allows us to do just that. That is, if we must choose a distribution
Q and then nature chooses Pθ adversarially, we can guarantee to worse redundancy than in the
(worst-case) Bayesian setting. We state a simpler version of the result that holds when the ran-
dom variables X take values in finite spaces; Haussler’s more general version shows that the next
theorem holds whenever X ∈ X and X is a complete separable metric space.
Theorem 16.4.5 (Gallager [88]). Let X be a random variable taking on a finite number of values
and Θ be a measurable space. Then
Z
sup inf Dkl (Pθ ||Q) dπ(θ) = sup Iπ (T ; X) = inf sup Dkl (Pθ ||Q) .
π Q π Q θ∈Θ
Moreover, the infimum on the right isRuniquely attained by some distribution Q∗ , and if π ∗ attains
the supremum on the left, then Q∗ = Pθ dπ ∗ (θ).
See Section 16.6 for a proof of Theorem 16.4.5.
This theorem is known as the redundancy-capacity theorem in the literature, because in classical
information theory, the capacity of a noisy channel T → X1n is the maximal mutual informationx
supπ Iπ (T ; X1n ). In the exercises, you explore some robustness properties of the optimal distribution
Qπ in relation to this theorem. In short, though, we see that if there is a capacity achieving
prior, then the associated mixture distribution Qπ is minimax optimal and attains the minimax
redundancy for the game.
396
Lexture Notes on Statistics and Information Theory John Duchi
distributed with appropriate variance measure, which gives the result. We now give the intuition for
this statement, first by heuristically deriving the asymptotics of a maximum likelihood estimator,
then by looking at the Bayesian case. (Clarke and Barron [49] provide a fully rigorous proof.)
where R is a remainder term. Assuming that θb → θ0 at any reasonable rate (this can be made
rigorous), this remainder is negligible asymptotically.
Rearranging this equality, we obtain
where we have used that the Fisher information Iθ = −Eθ [∇2 log pθ (X)] and the law of large
numbers. By the (multivariate) central limit theorem, we then obtain the asymptotic normality
result
n
√ 1 X d
n(θb − θ0 ) ≈ √ Iθ−1 ∇ log pθ0 (Xi ) → N(0, Iθ−1 ),
n 0 0
i=1
d
where → denotes convergence in distribution, with asymptotic variance
Iθ−1
0
Eθ0 [∇ log pθ0 (X)∇ log pθ0 (X)> ]Iθ−1
0
= Iθ−1
0
Iθ0 Iθ−1
0
= Iθ−1
0
.
pθ (X1n )π(θ)
π(θ | X1n ) = .
qn (X1n )
397
Lexture Notes on Statistics and Information Theory John Duchi
By our heuristic calculation of the MLE, this density (assuming the data overwhelms the prior) is
approximately a normal density with mean θ0 and variance (nIθ0 )−1 , where we have used expres-
sion (16.5.1). Expanding the redundancy, we obtain
" # " #
pθ0 (X1n ) pθb(X1n )π(θ) pθ0 (X1n )
b 1
Eθ0 log = Eθ0 log + Eθ0 log + Eθ0 log . (16.5.2)
qn (X1n ) qn (X1n ) π(θ)
b pθb(X1n )
by the asymptotic normality result, π(θ) b = π(θ0 ) + O(1/√n) again by the asymptotic normality
result, and
Xn >
n n
log pθb(X1 ) ≈ log pθ0 (X1 ) + ∇ log pθ0 (Xi ) (θb − θ0 )
i=1
n
X > Xn
−1 1
≈ log pθ0 (X1n ) + ∇ log pθ0 (Xi ) Iθ0 ∇ log pθ0 (Xi ) .
n
i=1 i=1
where we have assumed that X belongs to a finite set, so that Q(X) is simply the probability of
X. For a given prior distribution π on θ, we define the expected redundancy as
Z
Red(Q, π) = Dkl (Pθ ||Q) dπ(θ).
Our goal is to show that the max-min value of the prediction game is the same as the min-max
value of the game, that is,
398
Lexture Notes on Statistics and Information Theory John Duchi
Proof We know that the max-min risk (worst-case Bayes risk) of the game is supπ Iπ (T ; X);
it remains to show that this is the min-max risk. To that end, define the capacity of the family
{Pθ }θ∈Θ as
C := sup Iπ (T ; X). (16.6.1)
π
Notably, this constant is finite (because Iπ (T ; X) ≤ log |X |), and there exists a sequence πn of prior
probabilities such that Iπn (T ; X) → C. Now, let Q̄ be any cluster point of the sequence of mixtures
Qπn = Pθ dπn (θ); such a point exists because the space of probability distributions on the finite
R
and we claim this is sufficient for the theorem. Indeed, suppose that inequality (16.6.2) holds. Then
in this case, we have
inf sup Red(Q, θ) ≤ sup Red(Q̄, θ) = sup Dkl Pθ ||Q̄ ≤ C,
Q θ∈Θ θ∈Θ θ∈Θ
sup inf Red(Q, θ) ≤ inf sup Red(Q, π) = inf sup Red(Q, θ).
π Q Q π Q θ∈Θ
For the sake of contradiction, let us assume that there exists some θ ∈ Θ such that inequal-
ity (16.6.2) fails, call it θ∗ . We will then show that suitable mixtures (1 − λ)π + λδθ∗ , where δθ∗ is
the point mass on θ∗ , could increase the capacity (16.6.1). To that end, for shorthand define the
mixtures
πn,λ = (1 − λ)πn + λδθ∗ and Qπn ,λ = (1 − λ)Qπn + λPθ∗
for λ ∈ [0, 1]. Let us also use the notation Hw (X | T ) to denote the conditionaly entropy of
the random variable X on T (when T is distributed as w), and we abuse notation by writing
H(X) = H(P ) when X is distributed as P . In this case, it is clear that we have
To demonstrate our contradiction, we will show two things: first, that at λ = 0 the limits of both
sides of the preceding display are equal to the capacity C, and second, that the derivative of the
right hand side is positive. This will contradict the definition (16.6.1) of the capacity.
To that end, note that
399
Lexture Notes on Statistics and Information Theory John Duchi
It is clear that at λ = 0, both sides are equal to the capacity C, while taking derivatives with
respect to λ we have
∂ X
H((1 − λ)Q̄ + λPθ∗ ) = − (Pθ∗ (x) − Q̄(x)) log (1 − λ)Q̄(x) + λPθ∗ (x) .
∂λ x
∂
lim Iπn ,λ (T ; X)
∂λ n λ=0
X X X
=− Pθ∗ (x) log Q̄(x) + Q̄(x) log Q̄(x) + H(Q̄) − C + Pθ∗ (x) log Pθ∗ (x)
x x x
X Pθ∗ (x)
= Pθ∗ (x) log − C.
x
Q̄(x)
∂
In particular, if inequality (16.6.2) fails to hold, then ∂λ limn Iπn ,λ (T ; X)|λ=0 > 0, contradicting the
definition (16.6.1) of the channel capacity.
The uniqueness of the result follows from the strict convexity of the mutual information I in
the mixture channel Q̄.
16.7 Exercises
Exercise 16.1 (Minimax redundancy and different loss functions): In this question, we consider
iid
expected losses under the Bernoulli distribution. Assume that Xi ∼ Bernoulli(p), meaning that
Xi = 1 with probability p and Xi = 0 with probability 1 − p. We consider four different loss
functions, and their associated expected regret, for measuring the accuracy of our predictions of
such Xi . For each of the four choices below, we prove expected regret bounds on
n
X n
X
Redn (θ,
b P, `) := b i−1 ), Xi )] − inf
EP [`(θ(X EP [`(θ, Xi )], (16.7.1)
1
θ
i=1 i=1
where θb is a predictor based on X1 , . . . , Xi−1 at time i. Define Si = ij=1 Xj to be the partial sum
P
up to time i. For each of parts (a)–(c), at time i use the predictor
1
b i−1 ) = Si−1 + 2 .
θbi = θ(X1
i
400
Lexture Notes on Statistics and Information Theory John Duchi
(d) Extra credit: Show that there is a numerical constant c > 0 such that for any procedure
√
θ, p∈[0,1] Redn (θ, Bernoulli(p), `) ≥ c n for the absolute loss ` in
b the worst-case redundancy sup b
part (c). Give a strategy attaining this redundancy.
Exercise 16.2 (Strong versions of redundancy): Assume that for a given θ ∈ Θ we draw
X1n ∼ Pθ . We define the Bayes redundancy for a family of distributions P = {Pθ }θ∈Θ as
Z
Cn := inf Dkl (Pθ ||Q) dπ(θ) = Iπ (T ; X1n ),
π
Q
(b) Assume that π attains the supremum in the definition of Cn∗ . Show that
Hint: Introduce the random variable Z to be 1 if the random variable T ∈ B and 0 otherwise, then
use that Z → T → X1n forms a Markov chain, and expand the mutual information. For part (b),
the inequality 1−x 1
x log 1−x ≤ 1 for all x ∈ [0, 1] may be useful.
Exercise 16.3 (Mixtures are as good as point distributions): Let P be a Laplace(λ) distribution
on R, meaning that X ∼ P has density
λ
p(x) = exp(−λ|x|).
2
iid
Assume that X1 , . . . , Xn ∼ P , and let P n denote the n-fold product of P . In this problem, we
compare the predictive performance of distributions from the normal location family P = {N(θ, σ 2 ) :
θ ∈ R} with the mixture distribution Qπ over P defined by the normal prior distribution N(µ, τ 2 ),
that is, π(θ) = (2πτ 2 )−1/2 exp(−(θ − µ)2 /2τ 2 ).
(a) Let Pθ,Σ be the multivariate normal distribution with mean θ ∈ Rn and covariance Σ ∈ Rn×n .
What is Dkl (P n ||Pθ,Σ )?
(b) Show that inf θ∈Rn Dkl (P n ||Pθ,Σ ) = Dkl (P n ||P0,Σ ), that is, the mean-zero normal distribution
has the smallest KL-divergence from the Laplace distribution.
401
Lexture Notes on Statistics and Information Theory John Duchi
(c) Let Qπn be the mixture of the n-fold products in P, that is, Qπn has density
Z ∞
qnπ (xn1 ) = π(θ)pθ (x1 ) · · · pθ (xn )dθ,
−∞
(d) Show that the redundancy of Qπn under the distribution P is asymptotically nearly as good
as the redundancy of any Pθ ∈ P, the normal location family (so Pθ has density pθ (x) =
(2πσ 2 )−1/2 exp(−(x − θ)2 /2σ 2 )). That is, show that
1 1
sup EP log π n − log = O(log n)
θ∈R qn (X1 ) pθ (X1n )
for any prior variance τ 2 > 0 and any prior mean µ ∈ R, where the big-Oh hides terms
dependent on τ 2 , σ 2 , µ2 .
(e) Extra credit: Can you give an interesting condition under which such redundancy guarantees
hold more generally? That is, using Proposition 16.4.3 in the notes, give a general condition
under which
1 1
EP log π n − log = o(n)
q (X1 ) pθ (X1n )
as n → ∞, for all θ ∈ Θ.
402
Chapter 17
Thus far, in our discussion of universal prediction and related ideas, we have focused (essentially)
exclusively on making predictions with the logarithmic loss, so that we play a full distribution over
the set X as our prediction at each time step in the procedure. This is natural in settings, such
as coding (recall examples 16.1.2 and 16.2.1), in which the log loss corresponds to a quantity we
directly care about, or when we do not necessarily know much about the task at hand but rather
wish to simply model a process. (We will see this more shortly.) In many cases, however, we have
a natural task-specific loss. The natural question that follows, then, is to what extent it is possible
to extend the results of Chapter 16 to different settings in which we do not necessarily care about
prediction of an entire distribution. (Relevant references include the paper of Cesa-Bianchi and
Lugosi [46], which shows how complexity measures known as Rademacher complexity govern the
regret in online prediction games; the book by the same authors [47], which gives results covering a
wide variety of online learning, prediction, and other games; the survey by Merhav and Feder [140];
and the study of consequences of the choice of loss for universal prediction problems by Haussler
et al. [102].)
where Xbi are the predictions of the procedure we use and P is the distribution generating the data
n
X1 . In this case, if the distribution P is known, it is clear that the optimal strategy is to play the
Bayes-optimal prediction
Z
∗ i−1
Xi ∈ argmin EP [`(x, Xi ) | X1 ] = argmin `(x, xi )dP (xi | X1i−1 ). (17.1.1)
x∈Xb x∈Xb X
403
Lexture Notes on Statistics and Information Theory John Duchi
In many cases, however, we do not know the distribution P , and so our goal (as in the previous chap-
ter) is to simultaneously minimize the cumulative loss simultaneously for all source distributions in
a family P.
where X bi chosen according to Q(· | X i−1 ) as in expression (17.1.2). The natural question now, of
1
course, is whether the strategy (17.1.2) has redundancy growing more slowly than n.
It turns out that in some situations, this is the case: we have the following theorem [140, Section
III.A.2], which only requires that the usual redundancy (16.2.3) (with log loss) is sub-linear and the
loss is suitably bounded. In the theorem, we assume that the class of distributions P = {Pθ }θ∈Θ is
indexed by θ ∈ Θ.
x, x) − `(x∗ , x)| ≤
Theorem 17.1.1. Assume that the redundancy Redn (Q, Pθ ) ≤ Rn (θ) and that |`(b
L for all x and predictions x ∗
b, x . Then we have
r
1 2
Redn (Q, Pθ , `) ≤ L Rn (θ).
n n
To attain vanishing expected regret under the loss `, then, Theorem 17.1.1 requires only that
we play a Bayes’ strategy (17.1.2) with a distribution Q for which the average (over n) of the
usual redundancy (16.2.3) tends to zero, so long as the loss is (roughly) bounded. We give two
examples of bounded losses. First, we might consider the 0-1 loss, which clearly satisfies |`(b x, x) −
∗
`(x , x)| ≤ 1. Second, the absolute value loss (which is used for robust estimation of location
x, x) = |x − x
parameters [145, 108]), given by `(b x, x) − `(x∗ , x)| ≤ |b
b|, satisfies |`(b x − x∗ |. If the
distribution Pθ has median θ and Θ is compact, then E[|b x − X|] is minimized by its median, and
|b ∗
x − x | is bounded by the diameter of Θ.
Proof The theorem is essentially a consequence of Pinsker’s inequality (Proposition 2.2.8). By
404
Lexture Notes on Statistics and Information Theory John Duchi
Xn Z Z h i
∗
= pθ (xi−1
1 ) (pθ (xi | x1i−1 ) − q(xi | xi−1 i−1
1 )) `(Xi , xi ) − `(Xi , xi ) dxi dx1
b
i=1 X i−1 X
n Z
X
∗
+ pθ (xi−1 i−1 i−1
1 ) EQ [`(Xi , Xi ) − `(Xi , Xi ) | x1 ] dx1 ,
b (17.1.4)
i=1 X i−1 | {z }
≤0
bi , Xi ) − `(X ∗ , Xi ) | X i−1 ]
EQ [`(X i 1
where we have used the definition of total variation distance. Combining this inequality with (17.1.4),
we obtain
n Z
X
Redn (Q, Pθ , `) ≤ 2L pθ (xi−1 i−1 i−1
1 ) Pθ (· | x1 ) − Q(· | x1 ) TV dx1
i−1
i−1
i=1 X
(?) n Z 1 Z 1
X 2 2 2
≤ 2L pθ (xi−1 i−1
1 )dx1 pθ (xi−1
1 ) Pθ (· | x1i−1 ) − Q(· | x1i−1 ) TV
X i−1 X i−1
i=1
n Z 1
X 2 2
= 2L pθ (xi−1
1 ) Pθ (· | x1i−1 ) − Q(· | xi−1
1 ) TV ,
i=1 X i−1
√
where the inequality (?) follows by the Cauchy-Schwarz inequality applied to the integrands pθ
√
and pθ kP − QkTV . Applying the Cauchy-Schwarz inequality to the final sum, we have
n Z 1
√
X
2 2
Redn (Q, Pθ , `) ≤ 2L n pθ (xi−1
1 ) Pθ (· | xi−1
1 ) − Q(· | x1i−1 ) TV
i=1 X i−1
n Z 1
(??) √ 1X i−1 i−1 i−1
i−1 2
≤ 2L n pθ (x1 )Dkl Pθ (· | x1 )||Q(· | x1 ) dx1
2 X i−1
i=1
√ q
= L 2n Dkl Pθn ||Q ,
405
Lexture Notes on Statistics and Information Theory John Duchi
where inequality (??) is an application of Pinsker’s inequality. But of course, we know by that
Redn (Q, Pθ ) = Dkl (Pθn ||Q) by definition (16.2.3) of the redundancy.
Before proceding to examples, we note that in a variety of cases the bounds of Theorem 17.1.1 are
loose. For example, under mean-squared error, universal linear predictors [58, 151] have redundancy
√
O(log n), while Theorem 17.1.1 gives at best a bound of O( n).
TODO: Add material on redundancy/capacity (Theorem 16.4.5) analogue in general loss case,
which allows playing mixture distributions based on mixture of {Pθ }θ∈Θ .
17.1.2 Examples
We now give an example application of Theorem 17.1.1 with an application to a classification
problem with side information. In particular, let us consider the 0-1 loss `0-1 (ŷ, y) = 1 {ŷ · y ≤ 0},
and assume that we wish to predict y based on a vector x ∈ Rd of regressors that are fixed ahead
of time. In addition, we assume that the “true” distribution (or competitor) Pθ is that given x and
θ, Y has normal distribution with mean hθ, xi and variance σ 2 , that is,
iid
Yi = hθ, xi i + εi , εi ∼ N(0, σ 2 ).
Now, we consider playing according to a mixture distribution (16.4.3), and for our prior π we choose
θ ∼ N(0, τ 2 Id×d ), where τ > 0 is some parameter we choose.
Let us first consider the case in which we observe Y1 , . . . , Yn directly (rather than simply whether
we classify correctly) and consider the prediction scheme this generates. First, we recall as in the
posterior calculation (16.4.4) that we must calculate the posterior on θ given Y1 , . . . , Yi at step i+1.
Assuming we have computed this posterior, we play
Ybi := argmin EQπ [`0-1 (y, Yi ) | Y1i−1 ] = argmin Qπ (sign(Yi ) 6= sign(y) | Y1i−1 )
y∈R y∈R
Z ∞
= argmin Pθ (sign(Yi ) 6= sign(y))π(θ | Y1i−1 )dθ. (17.1.5)
y∈R −∞
Lemma 17.1.2. Assume that θ has prior N(0, τ 2 Id×d ). Then conditional on Y1i = y1i and the first
i vectors xi1 = (x1 , . . . , xi ) ⊂ Rd , we have
i i
−1
X
−1 1 1 X
i i
θ | y1 , x1 ∼ N Ki xj yj , Ki , where Ki = 2 Id×d + 2 xj x>
j .
τ σ
j=1 j=1
Deferring the proof of Lemma 17.1.2 temporarily, we note that under the distribution Qπ , as
by assumption we have Yi = hθ, xi i + εi , the posterior distribution (under the prior π for θ) on Yi+1
conditional on Y1i = yii and x1 , . . . , xi+1 is
D i
X E
−1 > −1
Yi+1 = hθ, xi+1 i + εi+1 | y1i , x1i ∼ N xi+1 , Ki 2
xj yj , xi+1 Ki xi+1 + σ .
j=1
406
Lexture Notes on Statistics and Information Theory John Duchi
Consequently, if we let θbi+1 be the posterior mean of θ | y1i , xii (as given by Lemma 17.1.2), the
optimal prediction (17.1.5) is to choose any Ybi+1 satisfying sign(Ybi+1 ) = sign(hxi+1 , θbi+1 i). Another
option is to simply play
X i
> −1
Yi+1 = xi+1 Ki
b yj x j , (17.1.6)
j=1
407
Lexture Notes on Statistics and Information Theory John Duchi
inf sup Redn (Q, Pθ ) = inf sup Dkl (Pθn ||Q) = sup Iπ (T ; X1n ) ≤ log |Θ|,
Q θ∈Θ Q θ∈Θ π
where T ∼ π and conditioned on T = θ we draw X1n ∼ Pθ . (Here we have used that I(T ; X1n ) =
H(T )−H(T | X1n ) ≤ H(T ) ≤ log |Θ|, by definition (2.1.3) of the mutual information.) In particular,
the redundancy is constant for any n.
Now we come to our question: is this possible in a purely sequential case? More precisely,
suppose we wish to predict a sequence of variables yi ∈ {−1, 1}, we have access to a finite collection
of strategies, and we would like to guarantee that we perform as well in prediction as any single
member of this class. Then, while it is not possible to achieve constant regret, it is possible to have
regret that grows only logarithmically in the number of comparison strategies. To establish the
setting, let us denote our collection of strategies, henceforth called “experts”, by {xi,j }dj=1 , where
i ranges in 1, . . . , n. Then at iteration i of the prediction game, we measure the loss of expert j by
`(xi,j , y).
We begin by considering a mixture strategy that would be natural under the logarithmic loss,
we assume the experts play points xi,j ∈ [0, 1], where xi,j = P (Yi = 1) according to expert j.
(We remark in passing that while the notation is perhaps not completely explicit about this, the
experts may adapt to the sequence Y1n .) In this case, the loss we suffer is the usual log loss,
`(xi,j , y) = y log x1i,j + (1 − y) log 1−x1
i,j
. Now, if we assume we begin with the uniform prior
distribution π(j) = 1/d for all j, then the posterior distribution, denoted by πji = π(j | Y1i−1 ), is
i i !
1 1
xyl,jl (1 − xl,j )1−yl
Y X
πji ∝ π(j) = π(j) exp − yl log + (1 − yl ) log
xl,j 1 − xl,j
l=1 l=1
i
!
X
= π(j) exp − `(xl,j , yl ) .
l=1
This strategy suggests what is known variously as the multiplicative weights strategy [8], expo-
nentiated gradient descent method [119], or (after some massaging) a method known since the
late 1970s as the mirror descent or non-Euclidean gradient descent method (entropic gradient de-
scent) [142, 22].
In particular, we consider an algorithm for general losses where fix a stepsize η > 0 (as we cannot
be as aggressive as in the probabilistic setting), and we then weight each of the experts j by expo-
nentially decaying the weight assigned to the expert for the losses it has suffered. For the algorithm
to work, unfortunately, we need a technical condition on the loss function and experts xi,j . This
loss function is analogous to a weakened version of exp-concavity, which is a common assumption
in online game playing scenarios (see the logarithmic regret algorithms developed by Hazan et al.
[103], as well as earlier work, for example, that by Kivinen and Warmuth [120] studying regression
408
Lexture Notes on Statistics and Information Theory John Duchi
problems for which the loss is strongly convex in one variable but not simultaneously in all). In
particular, exp-concavity is the assumption that
x 7→ exp(−`(x, y))
is a concave function. Because the exponent of the logarithm is linear, the log loss is obviously
exp-concave, but for alternate losses, we make a slightly weaker assumption. In particular, we
assume there are constants c, η such that for any vector π in the d-simplex (i.e. π ∈ Rd+ satisfies
Pd
j=1 πj = 1) there is some way to choose y
b so that for any y (that can be played in the game)
Xd d
1 X
exp − `(b
y , y) ≥ πj exp(−η`(xi,j , y)) or y , y) ≤ −c log
`(b πj exp(−η`(xi,j , y)) .
c
j=1 j=1
(17.2.1)
By inspection, inequality (17.2.1) holds for the log loss with c = η = 1 and the choice yb =
P d
j=1 πj xi,j , because of the exp-concavity condition; any exp-concave loss also satisfies inequal-
ity (17.2.1) with c = η = 1 and the choice of the posterior mean yb = dj=1 πj xi,j . The idea in
P
this case is that losses satisfying inequality (17.2.1) behave enough like the logarithmic loss that a
Bayesian updating of the experts works. (Condition (17.2.1) originates with the work of Haussler
et al. [102], where they name such losses (c, η)-realizable.)
Example 17.2.1 (Squared error and exp-concavity): Consider the squared error loss `(b y , y) =
1 2 , where y
2
P (b
y − y) b , y ∈ R. We claim that if x j ∈ [0, 1] for each j, π is in the simplex, meaning
j πj = 1 and πj ≥ 0, and y ∈ [0, 1], then the squared error π 7→ `(hπ, xi, y) is exp-concave,
that is, inequality (17.2.1) holds with c = η = 1 and yb = hπ, xi. Indeed, computing the Hessian
of the exponent, we have
2 1 2 1 2
∇π exp − (hπ, xi − y) = ∇π − exp − (hπ, xi − y) (hπ, xi − y)x
2 2
1 2
(hπ, xi − y)2 − 1 xx> .
= exp − (hπ, xi − y)
2
We can also show that the 0-1 loss satisfies the weakened version of exp-concavity in inequal-
ity (17.2.1), but we have to take the constant c to be larger (or η to be smaller).
Example 17.2.2 (Zero-one loss and weak exp-concavity): Now suppose that we use the
y , y) = 1 {y · yb ≤ 0}. We claimPthat if we take a weighted majority vote
0-1 loss, that is, `0-1 (b
under the distribution π, meaning that we set yb = dj=1 πj sign(xj ) for a vector x ∈ Rd , then
inequality (17.2.1) holds with any c large enough that
2
c−1 ≤ log . (17.2.2)
1 + e−η
409
Lexture Notes on Statistics and Information Theory John Duchi
Thus, to attain
d
X
y , y) = 1 ≤ −c log
`0-1 (b πj e−η`0-1 (xj ,y)
j=1
it is sufficient that
d
1 + e−η
X
−η`0-1 (xj ,y) −1 2
1 ≤ −c log ≤ −c log πj e , or c ≤ log .
2 1 + e−η
j=1
3. Choose ybi satisfying (17.2.1) for the weighting π = π i and expert values {xi,j }dj=1
4. Observe yi and suffer loss `(b
yi , yi )
With the scheme above, we have the following regret bound.
Theorem 17.2.3 (Haussler et al. [102]). Assume condition (17.2.1) holds and that ybi is chosen by
the above scheme. Then for any j ∈ {1, . . . , d} and any sequence y1n ∈ Rn ,
n
X n
X
yi , yi ) ≤ c log d + cη
`(b `(xi,j , yi ).
i=1 i=1
Proof This is an argument based on potentials. At each iteration, any loss we suffer implies that
the potential W i must decrease, but it cannot decrease too quickly (as otherwise the individual
predictors xi,j would suffer too much loss). Beginning with condition (17.2.1), we observe that
d i+1
X
i W
yi , yi ) ≤ −c log
`(b πj exp(−η`(xi,j , yi )) = −c log
Wi
j=1
410
Lexture Notes on Statistics and Information Theory John Duchi
where the inequality uses that exp(·) is increasing. As log exp(a) = a, this is the desired result.
We illustrate the theorem√by continuing Example 17.2.2, showing how Theorem 17.2.3 gives a
regret guarantee of at most n log d for any set of at most d experts and any sequence y1n ∈ Rn
under the zero-one loss.
Example (Example 17.2.2 continued): By substituting the choice c−1 = log 1+e2−η into the
regret guarantee of Theorem 17.2.3 (which satisfies inequality (17.2.1) by our guarantee (17.2.2)
from Example 17.2.2), we obtain
P
2 n
Xn
log d η − log 1+e−η i=1 `0-1 (xi,j , yi )
yi , yi ) − `0-1 (xi,j , yi ) ≤
`0-1 (b + .
i=1
log 1+e2−η log 1+e2−η
Now, we make an asymptotic expansion to give the basic flavor of the result (this can be made
rigorous, but it is sufficient). First, we note that
2 η η2
log ≈ − ,
1 + e−η 2 8
and substituting this into the previous display, we have regret guarantee
n n
X log d X
yi , yi ) − `0-1 (xi,j , yi ) .
`0-1 (b +η `0-1 (xi,j , yi ). (17.2.3)
η
i=1 i=1
p
By making the choice η ≈ log d/n and noting that `0-1 ≤ 1, we obtain
n
X p
yi , yi ) − `0-1 (xi,j , yi ) .
`0-1 (b n log d
i=1
We make a few remarks on the preceding example to close the chapter. First, ideally we would
like to attain adaptive regret guarantees, meaning that the regret scales with the performance of
the bestPpredictor in inequality (17.2.3). In particular, we might expect that a good expert would
satisfy ni=1 `0-1 (xi,j , yi ) n, which—if we could choose
1
log d 2
η≈ Pn ,
i=1 0-1 (xi,j ∗ , yi )
`
411
Lexture Notes on Statistics and Information Theory John Duchi
Pn
where j ∗ = argminj i=1 `0-1 (xi,j , yi )—then we would attain regret bound
v
u n
X p
u
tlog d · `0-1 (xi,j ∗ , yi ) n log d.
i=1
For results of this form, see, for example, Cesa-Bianchi et al. [48] or the more recent work on mirror
descent of Steinhardt and Liang [162].
Secondly, we note that it is actually possible to give a regret bound of the form (17.2.3) without
relying on the near exp-concavity condition (17.2.1). In particular, performing mirror descent on
the convex losses defined by
Xd
π 7→ sign(xi,j )πj − sign(yi ) ,
j=1
√
which is convex, will give a regret bound of n log d for the zero-one loss as well. We leave this
exploration to the interested reader.
412
Chapter 18
A related notion to the universal prediction problem with alternate losses is that of online learning
and online convex optimization, where we modify the requirements of Chapter 17 further. In the
current setting, we essentially do away with distributional assumptions at all, including prediction
with a distribution, and we consider the following two player sequential game: we have a space W
in which we—the learner or first player—can play points w1 , w2 , . . ., while nature plays a sequence
of loss functions `t : W → R. The goal is to guarantee that the regret
n
X
`t (wt ) − `t (w? )
(18.0.1)
t=1
grows at most sub-linearly with n, for any w? ∈ W (often, we desire this guarantee to be uniform).
As stated, this goal is too broad, so in this chapter we focus on a few natural restrictions, namely,
that the sequence of losses `t are convex, and W is a convex subset of Rd . In this setting, the
problem (18.0.1) is known as online convex programming.
λw + (1 − λ)w0 ∈ W.
for all λ ∈ [0, 1] and w, w0 . The subgradient set, or subdifferential, of a convex function f at the
point w is defined to be
and we say that any vector g ∈ Rd satisfying f (v) ≥ f (w) + hg, v − wi for all v is a subgradient. For
convex functions, the subdifferential set ∂f (w) is essentially always non-empty for any w ∈ dom f .1
1
Rigorously, we are guaranteed that ∂f (w) 6= ∅ at all points w in the relative interior of the domain of f .
413
Lexture Notes on Statistics and Information Theory John Duchi
We now give several examples of convex functions, losses, and corresponding subgradients. The
first two examples are for classification problems, in which we receive data points x ∈ Rd and wish
to predict associated labels y ∈ {−1, 1}.
Example 18.1.1 (Support vector machines): In the support vector machine problem, we
receive data in pairs (xt , yt ) ∈ Rd × {−1, 1}, and the loss function
which is convex because it is the maximum of two linear functions. Moreover, the subgradient
set is
−yt xt
if yt hw, xt i < 1
∂`t (w) = −λ · yt xt for λ ∈ [0, 1] if yt hw, xt i = 1
0 otherwise.
Example 18.1.2 (Logistic regression): As in the support vector machine, we receive data in
pairs (xt , yt ) ∈ Rd × {−1, 1}, and the loss function is
To see that this loss is convex, note that if h(t) = log(1 + et ), then h0 (t) = 1
1+e−t
and h00 (t) =
e−t
(1+e−t )2
≥ 0, and `t is the composition of a linear transformation with h. In this case,
1
∂`t (w) = ∇`t (w) = − yt x t .
1+ eyt hxt ,wi
3
where we have defined the vector gt = [`0-1 (xt,j , yt )]dj=1 ∈ {0, 1}d . Notably, the expected zero-
one loss is convex (even linear), so that its online minimization falls into the online convex
programming framework. 3
As we see in the sequel, online convex programming approaches are often quite simple, and, in
fact, are often provably optimal in a variety of scenarios outside of online convex optimization. This
motivates our study, and we will see that online convex programming approaches have a number of
similarities to our regret minimization approaches in previous chapters on universal coding, regret,
and redundancy.
414
Lexture Notes on Statistics and Information Theory John Duchi
415
Lexture Notes on Statistics and Information Theory John Duchi
(w, ψ(w))
Dψ (w, v)
(v, ψ(v))
Example 18.2.2 (KL divergence as a Bregman divergence): Take ψ(w) = dj=1 wj log wj .
P
Then ψ is convex over the positive orthant Rd+ (the second derivative of w log w is 1/w), and
for w, v ∈ ∆d = {u ∈ Rd+ : h1, ui = 1}, we have
X X X X wj
Dψ (w, v) = wj log wj − vj log vj − (1 + log vj )(wj − vj ) = wj log = Dkl (w||v) ,
vj
j j j j
where in the final equality we treat w and v as probability distributions on {1, . . . , d}. 3
With these examples in mind, we now present the mirror descent algorithm, which is the natural
generalization of online gradient descent.
Before providing the analysis of Algorithm 18.3, we give a few examples of its implementation.
First, by taking W = Rd and ψ(w) = 12 kwk22 , we note that the mirror descent procedure simply
corresponds to the gradient update wt+1 = wt −ηt gt . We can also recover the exponentiated gradient
algorithm, also known as entropic mirror descent.
416
Lexture Notes on Statistics and Information Theory John Duchi
For example, a straightforward calculation shows that the dual to the `∞ -norm is the `1 -norm,
and the Euclidean norm k·k2 is self-dual (by the Cauchy-Schwarz inequality). Lastly, we require a
definition of functions of suitable curvature for use in mirror descent methods.
Definition 18.2. A convex function f : Rd → R is strongly convex with respect to the norm k·k
over the set W if for all w, v ∈ W and g ∈ ∂f (w) we have
1
f (v) ≥ f (w) + hg, v − wi + kw − vk2 .
2
That is, the function f is strongly convex if it grows at least quadratically fast at every point in its
domain. It is immediate from the definition of the Bregman divergence that ψ is strongly convex
if and only if
1
Dψ (w, v) ≥ kw − vk2 .
2
As two examples, we consider Euclidean distance and entropy. For the Euclidean distance, which
uses ψ(w) = 12 kwk22 , we have ∇ψ(w) = w, and
1 1 1 1
kvk22 = kw + v − wk22 = kwk22 + hw, v − wi + kw − vk22
2 2 2 2
by a calculation, so that ψ is strongly convex with respect to the Euclidean norm. We also have
the following observation.
417
Lexture Notes on Statistics and Information Theory John Duchi
P
Observation 18.2.4. Let ψ(w) = j wj log wj be the negative entropy. Then ψ is strongly convex
with respect to the `1 -norm, that is,
1
Dψ (w, v) = Dkl (w||v) ≥ kw − vk21 .
2
Proof The result is an immediate consequence of Pinsker’s inequality, Proposition 2.2.8.
With these examples in place, we present the main theorem of this section.
Theorem 18.2.5 (Regret of mirror descent). Let `t be an arbitrary sequence of convex functions,
and let wt be generated according to the mirror descent algorithm 18.3. Assume that the proximal
function ψ is strongly convex with respect to the norm k·k, which has dual norm k·k∗ . Then
Before proving the theorem, we provide a few comments to exhibit its power. First, we consider
the Euclidean case, where ψ(w) = 12 kwk22 , and we assume that the loss functions `t are all L-
Lipschitz, meaning that |`t (w) − `t (v)| ≤ L kw − vk2 , which is equivalent to kgt k2 ≤ L for all
gt ∈ ∂`t (w). In this case, the two regret bounds above become
n
1 η 1 2 X ηt 2
kw? − w1 k22 + nL2 and R + L ,
2η 2 2ηn 2
t=1
respectively, where in the second case we assumed that kw? − wt k2 ≤ R for all t. In the former
R
case, we take η = L√ n
, while in the second, we take ηt = LR√t , which does not require knowledge
of n ahead of time. Focusing on the latter case, we have the following corollary.
Corollary 18.2.6. Assume that W ⊂ {w ∈ Rd : kwk2 ≤ R} and that the loss functions `t are
L-Lipschitz with respect to the Euclidean norm. Take ηt = LR√t . Then for all w? ∈ W,
n
X √
[`t (wt ) − `t (w? )] ≤ 3RL n.
t=1
Proof For any w, w? ∈ W, we have kw − w? k2 ≤ 2R, so that Dψ (w? , w) ≤ 4R2 . Using that
n n √
Z
− 12 1
X
t ≤ t− 2 dt = 2 n
t=1 0
418
Lexture Notes on Statistics and Information Theory John Duchi
Now that we have presented the Euclidean variant of online convex optimization, we turn to an
example that achieves better performance in high dimensional settings, as long as the domain is
the probability simplex. (Recall Example 18.1.3 for motivation.) In this case, we have the following
corollary to Theorem 18.2.5.
d
CorollaryP 18.2.7. Assume that W = ∆d = {w ∈ R+ : h1, wi = 1} and take the proximal function
ψ(w) = j wj log wj to be the negative entropy in the mirror descent procedure 18.3. Then with the
fixed stepsize η and initial point as the uniform distribution w1 = 1/d, we have for any sequence of
convex losses `t
n n
X log d η X
?
[`t (wt ) − `t (w )] ≤ + kgt k2∞ .
η 2
t=1 t=1
Proof Using Pinsker’s inequality in the form of Observation 18.2.4, we have that ψ is strongly
convex with respect to k·k1 . Consequently, taking the dual norm to be the `∞ -norm, part (a) of
Theorem 18.2.5 shows that
n d n
X 1X ? wj? ηX
?
[`t (wt ) − `t (w )] ≤ wj log + kgt k2∞ .
η w1,j 2
t=1 j=1 t=1
Noting that with w1 = 1/d, we have Dψ (w? , w1 ) ≤ log d for any w? ∈ W gives the result.
Corollary 18.2.7 yields somewhat sharper results than Corollary 18.2.6, though in the restricted
setting that W is the probability simplex in Rd . Indeed, let us assume that the subgradients
gt ∈ [−1, d d
√1] , the hypercube in R . In this case, the tightest possible bound on their `2 -norm is
kgt k2 ≤ d, while kgt k∞ ≤ 1 always. Similarly, if W = ∆d , then while we are only guaranteed that
kw? − w1 k2 ≤ 1. Thus, the best regret guaranteed by the Euclidean case (Corollary 18.2.6) is
1 η √ 1
kw? − w1 k22 + nd ≤ nd with the choice η = √ ,
2η 2 nd
P
while the entropic mirror descent procedure (Alg. 18.3 with ψ(w) = j wj log wj ) guarantees
√
log d η p 2 log d
+ n ≤ 2n log d with the choice η = √ . (18.2.5)
η 2 2 n
The latter guarantee is exponentially better in the dimension. Moreover, the key insight is that
we essentially maintain a “prior,” and then perform “Bayesian”-like updating of the posterior
distribution wt at each time step, exactly as in the setting of redundancy minimization.
419
Lexture Notes on Statistics and Information Theory John Duchi
Lemma 18.2.9. Let `t : W → R be any sequence of convex loss functions and ηt be a non-increasing
sequence, where η0 = ∞. Then with the mirror descent strategy (18.2.4), for any w? ∈ W we have
n n n
X
?
X 1 1 ?
X 1
`t (wt ) − `t (w ) ≤ − Dψ (w , wt ) + − Dψ (wt+1 , wt ) + hgt , wt − wt+1 i .
ηt ηt−1 ηt
t=1 t=1 t=1
Proof Our proof follows by the application of a few key identities. First, we note that by
convexity, we have for any gt ∈ ∂`t (wt ) that
`t (wt ) − `t (w? ) ≤ hgt , wt − w? i. (18.2.6)
Secondly, we have that because wt+1 minimizes
1
hgt , wi + Dψ (w, wt )
ηt
over w ∈ W, then Lemma 18.2.8 implies
hηt gt + ∇ψ(wt+1 ) − ∇ψ(wt ), w − wt+1 i ≥ 0 for all w ∈ W. (18.2.7)
Taking w = w? in inequality (18.2.7) and making a substitution in inequality (18.2.6), we have
`t (wt ) − `t (w? ) ≤ hgt , wt − w? i = hgt , wt+1 − w? i + hgt , wt − wt+1 i
1
≤ h∇ψ(wt+1 ) − ∇ψ(wt ), w? − wt+1 i + hgt , wt − wt+1 i
ηt
1
= [Dψ (w? , wt ) − Dψ (w? , wt+1 ) − Dψ (wt+1 , wt )] + hgt , wt − wt+1 i (18.2.8)
ηt
where the final equality (18.2.8) follows from algebraic manipulations of Dψ (w, w0 ). Summing
inequality (18.2.8) gives
n n n
X
?
X 1 ? ?
X
`t (wt ) − `t (w ) ≤ [Dψ (w , wt ) − Dψ (w , wt+1 ) − Dψ (wt+1 , wt )] + hgt , wt − wt+1 i
η
t=1 t=1 t t=1
n
X 1 1 1 1
= − Dψ (w? , wt ) + Dψ (w? , w1 ) − Dψ (w? , wn+1 )
ηt ηt−1 η1 ηn
t=2
n
X 1
+ − Dψ (wt+1 , wt ) + hgt , wt − wt+1 i
ηt
t=1
as desired.
It remains to use the negative terms −Dψ (wt , wt+1 ) to cancel the gradient terms hgt , wt − wt+1 i.
To that end, we recall Definition 18.1 of the dual norm k·k∗ and the strong convexity assumption
on ψ. Using the Fenchel-Young inequality, we have
ηt 1
hgt , wt − wt+1 i ≤ kgt k∗ kwt − wt+1 k ≤ kgt k2∗ + kwt − wt+1 k2 .
2 2ηt
Now, we use the strong convexity condition, which gives
1 1
− Dψ (wt+1 , wt ) ≤ − kwt − wt+1 k2 .
ηt 2ηt
Combining the preceding two displays in Lemma 18.2.9 gives the result of Theorem 18.2.5.
420
Lexture Notes on Statistics and Information Theory John Duchi
While as stated, the bound of the proposition does not look substantially more powerful than
Corollary 18.2.7, but a few remarks will exhibit its consequences. We prove the proposition in
Section 18.4.1 to come.
2 ≤ kg k2 . So certainly
P
First, we note that because wt ∈ ∆d , we will always have j wt,j gt,j t ∞
the bound of Proposition 18.4.1 is never worse than that of Corollary 18.2.7. Sometimes this can
be made tighter, however, as exhibited by the next corollary, which applies (for example) to the
experts setting of Example 18.1.3. More specifically, we have d experts, each suffering losses in
[0, 1], and we seek to predict with the best of the d experts.
Corollary 18.4.2. Consider the linear online convex optimization setting, that is, where `t (wt ) =
hgt , wt i for vectors gt , and assume that gt ∈ Rd+ with kgt k∞ ≤ 1. In addition,
√ assume
p that we know
? n ?
an upper bound Ln on t=1 `t (w ). Then taking the stepsize η = min{1, log d/ L?n }, we have
P
n
X n o
[`t (wt ) − `t (w? )] ≤ 3 max log d, L?n log d .
p
t=1
Note that when `t (w? ) = 0 for all w? , which corresponds to a perfect expert in Example 18.1.3,
the upper bound becomes constant in n, yielding 3 log d as a bound on the regret. Unfortunately,
in our bound of Corollary 18.4.2, we had to assume that we knew ahead of time a bound on the
loss of the best predictor w? , which is unrealistic in practice. There are a number of techniques for
dealing with such issues, including a standard one in the online learning literature known as the
doubling trick. We explore someP in the exercises.
2
Proof First, we note that j wj gt,j ≤ hw, gt i for any nonnegative vector w, as gt,j ∈ [0, 1]. Thus,
Proposition 18.4.1 gives
n n n
X
? log d η X log d η X
[`t (wt ) − `t (w )] ≤ + hwt , gt i = + `t (wt ).
η 2 η 2
t=1 t=1 t=1
421
Lexture Notes on Statistics and Information Theory John Duchi
xi exp(−ηgi )
yi = P ,
j xj exp(−ηgj )
d
1 ηX 2
− Dψ (y, x) + hg, x − yi ≤ gi xi .
η 2
i=1
Deferring the proof of the lemma, we note that it precisely applies to the setting of Lemma 18.2.9.
Indeed, with a fixed stepsize η, we have
n n
X
? 1 ?
X 1
`t (wt ) − `t (w ) ≤ Dψ (w , w1 ) + − Dψ (wt+1 , wt ) + hgt , wt − wt+1 i .
η η
t=1 t=1
Earlier, we used the strong convexity of ψ to eliminate the gradient terms hgt , wt − wt+1 i using the
bregman divergence Dψ . This time, we use Lemma 18.2.9: setting y = wt+1 and x = wt yields the
bound
n n d
X 1 X ηX 2
`t (wt ) − `t (w? ) ≤ Dψ (w? , w1 ) + gt,i wt,i
η 2
t=1 t=1 i=1
as desired.
Proof of Lemma 18.4.3 We begin by noting that a direct calculation yields Dψ (y, x) =
Dkl (y||x) = i yi log xyii . Substituting the values for x and y into this expression, we have
P
!
yi x exp(−ηgi ) X
Pi
X X X
yi log = yi log = −ηhg, yi − yi log xj e−ηgj .
xi xi ( j exp(−ηgj )xj )
i i i j
422
Lexture Notes on Statistics and Information Theory John Duchi
Now we use a Taylor expansion of the function g 7→ log( j xj e−ηgj ) around the point 0. If we
P
X η2
log xj e−ηgj = log(h1, xi) − ηhp(0), gi + g > (diag(p(e
g )) − p(e g )> )g,
g )p(e
2
j
where ge = λg for some λ ∈ [0, 1]. Noting that p(0) = x and h1, xi = h1, yi = 1, we obtain
η2 >
Dψ (y, x) = −ηhg, yi + log(1) + ηhg, xi − g )) − p(e
g (diag(p(e g )> )g,
g )p(e
2
whence
d
1 ηX 2
− Dψ (y, x) + hg, x − yi ≤ gi pi (e
g ). (18.4.1)
η 2
i=1
Using the Fenchel-Young inequality, we have ab ≤ 13 |a|3 + 32 |b|3/2 for any a, b, so gi gj2 ≤ 13 gi3 + 23 gj3 .
This implies
Pd that the numerator in our expression for s0 (λ) is non-positive. Thus we have s(λ) ≤
2
s(0) = i=1 gi xi , which gives the result when combined with inequality (18.4.1).
423
Chapter 19
Consider the following problem: we have a possible treatment for a population with a disease, but
we do not know whether the treatment will have a positive effect or not. We wish to evaluate the
treatment to decide whether it is better to apply it or not, and we wish to optimally allocate our
resources to attain the best outcome possible. There are challenges here, however, because for each
patient, we may only observe the patient’s behavior and disease status in one of two possible states—
under treatment or under control—and we wish to allocate as few patients to the group with worse
outcomes (be they control or treatment) as possible. This balancing act between exploration—
observing the effects of treatment or non-treatment—and exploitation—giving treatment or not as
we decide which has better palliative outcomes—underpins and is the paridigmatic aspect of the
multi-armed bandit problem.1
Our main focus in this chapter is a fairly simple variant of the K-armed bandit problem, though
we note that there is a substantial literature in statistics, operations research, economics, game
theory, and computer science on variants of the problems we consider. In particular, we consider the
following sequential decision making scenario. We assume that there are K distributions P1 , . . . , PK
on R, which we identify (with no loss of generality) with K random variables Y1 , . . . , YK . Each
random variable Yi has mean µi and is σ 2 -sub-Gaussian, meaning that
2 2
λ σ
E [exp (λ(Yi − µi ))] ≤ exp . (19.0.1)
2
The goal is to find the index i with the maximal mean µi without evaluating sub-optimal “arms”
(or random variables Yi ) too often. At each iteration t of the process, the player takes an action
At ∈ {1, . . . , K}, then, conditional on i = At , observes a reward Yi (t) drawn independently from
the distribution Pi . Then the goal is to minimize the the regret after n steps, which is
n
X
Regn := µi? − µA t , (19.0.2)
t=1
1
The problem is called the bandit problem in the literature because we imagine a player in a casino, choosing
between K different slot machines (hence a K-armed bandit, as this is a casino and the player will surely lose
eventually), each with a different unknown reward distribution. The player wishes to put as much of his money as
possible into the machine with the greatest expected reward.
424
Lexture Notes on Statistics and Information Theory John Duchi
where i? ∈ argmaxi µi so µi? = maxi µi . The regret Regn as defined is a random quantity, so we
generally seek to give bounds on its expectation or high-probability guarantees on its value. In this
chapter, we generally focus for simplicity on the expected regret,
n
X
Regn := E µ i? − µA t , (19.0.3)
t=1
where the expectation is taken over any randomness in the player’s actions At and in the repeated
observations of the random variables Y1 , . . . , YK .
to be the running average of the rewards of arm i at time t (computed only on those instances in
which arm i was selected), we claim that for all i and all t,
s s
σ 2 log 1δ σ 2 log 1δ
P µbi (t) ≥ µi + ∨ P µbi (t) ≤ µi − ≤ δ. (19.1.1)
Ti (t) Ti (t)
That is, so long as we pull the arms sufficiently many times, we are unlikely to pull the wrong arm.
We prove the claim (19.1.1) in the appendix to this chapter.
Here then is the UCB procedure:
425
Lexture Notes on Statistics and Information Theory John Duchi
If we define
∆i := µi? − µi
to be the gap in means between the optimal arm and any sub-optimal arm, we then obtain the
following guarantee on the expected number of pulls of any sub-optimal arm i after n steps.
Proposition 19.1.1. Assume that each of the K arms is σ 2 -sub-Gaussian and let the sequence
δ1 ≥ δ2 ≥ · · · be non-increasing and positive. Then for any n and any arm i 6= i? ,
n
& '
4σ 2 log δ1n X
E[Ti (n)] ≤ + 2 δt .
∆2i t=2
Proof Without loss of generality, we assume arm 1 satisfies µ1 = maxi µi , and let arm i be any
sub-optimal arm. The key insight is to carefully consider what occurs if we play arm i in the UCB
procedure of Figure 19.1. In particular, if we play arm i at time t, then we certainly have
s s
σ 2 log δ1t σ 2 log δ1t
µ
bi (t) + ≥µ
b1 (t) + .
Ti (t) T1 (t)
For this to occur, at least one of the following three events must occur (we suppress the dependence
on i for each of them):
s s
σ 2 log δ1t σ 2 log δ1t
E1 (t) := µ b (t) ≥ µi + , E2 (t) := µb (t) ≤ µ1 − ,
i Ti (t) 1 T1 (t)
s
σ 2 log δ1t
E3 (t) := ∆i ≤ 2 .
Ti (t)
Indeed, suppose that none of the events E1 , E2 , E3 occur at time t. Then we have
s s s
σ 2 log δ1t σ 2 log δ1t σ 2 log δ1t
µbi (t) + < µi + 2 < µi + ∆i = µ1 < µ
b1 (t) + ,
Ti (t) Ti (t) T1 (t)
426
Lexture Notes on Statistics and Information Theory John Duchi
?
q to have Ti (t) >ql it must be the case that E3 (t) cannot occur—that is, we would have
then
2 σ 2 log δ1t /Ti (t) > 2 σ 2 log δ1t /l ≥ ∆i . Thus we have
n
X n
X
E[Ti (n)] = E[1 {At = i}] ≤ l? + P(At = i, E3 (t) fails)
t=1 t=l? +1
n
X n
X
? ?
≤l + P(E1 (t) or E2 (t)) ≤ l + 2δt .
t=l? +1 t=l? +1
Naturally, the number of times arm i is selected in the sequential game is related to the regret
of a procedure; indeed, we have
n
X K
X K
X
Regn = (µi? − µAt ) = (µi? − µi )Ti (n) = ∆i Ti (n).
t=1 i=1 i=1
Using this identity, we immediately obtain two theorems on the (expected) regret of the UCB
algorithm.
Theorem 19.1.2. Let δt = δ/t2 for all t. Then for any n ∈ N the UCB algorithm attains
K
X 4σ 2 [2 log n − log δ] π 2 − 2 X K
X
Regn ≤ + ∆i δ + ∆i .
?
∆i 3
i6=i i=1 i=1
by Proposition 19.1.1. Summing over i 6= i? and noting that −2 = π 2 /6 − 1 gives the re-
P
t≥2 t
sult.
Let us unpack the bound of Theorem 19.1.2 slightly. First, we make the simplifying assumption
that δt = 1/t2 for all t, and let ∆ = mini6=i? ∆i . In this case, we have expected regret bounded by
K
Kσ 2 log n π 2 + 1 X
Regn ≤ 8 + ∆i .
∆ 3
i=1
So we see that the asymptotic regret with this choice of δ scales as (Kσ 2 /∆) log n, roughly linear
in the classes, logarithmic in n, and inversely proportional to the gap in means. As a concrete
example, if we know that the rewards for each arm Yi belong to the interval [0, 1], then Hoeffding’s
lemmaP(recall Example 4.1.6) states that we may take σ 2 = 1/4. Thus the mean regret becomes at
most i:∆i >0 2 log n
∆i (1 + o(1)), where the o(1) term tends to zero as n → ∞.
427
Lexture Notes on Statistics and Information Theory John Duchi
If we knew a bit more about our problem, then by optimizing over δ and choosing δ = σ 2 /∆,
we obtain the upper bound
Kσ 2
n∆ maxi ∆i
Regn ≤ O(1) log 2 + K , (19.1.2)
∆ σ mini ∆i
that is, the expected regret scales asymptotically as (Kσ 2 /∆) log( n∆
σ2
)—linearly in the number of
classes, logarithmically in n, and inversely proportional to the gap between the largest and other
means.
If any of the gaps ∆i → 0 in the bound of Theorem 19.1.2, the bound becomes vacuous—it
simply says that the regret is upper bounded by infinity. Intuitively, however, pulling a slightly
sub-optimal arm should be insignificant for the regret. With that in mind, we present a slight
√
variant of the above bounds, which has a worse scaling with n—the bound scales as n rather than
log n—but is independent of the gaps ∆i .
Theorem 19.1.3. If UCB is run with parameter δt = 1/t2 , then
p K
X
2
Regn ≤ 8Kσ n log n + 4 ∆i .
i=1
Proof Fix any γ > 0. Then we may write the regret with the standard identity
X X X X
Regn = ∆i Ti (n) = ∆i Ti (n) + ∆i Ti (n) ≤ ∆i Ti (n) + nγ,
i6=i? i:∆i ≥γ i:∆i <γ i:∆i ≥γ
PK
where the final inequality uses that certainly i=1 Ti (n) ≤ n. Taking expectations with our UCB
procedure and δ = 1, we have by Theorem 19.1.2 that
K K
X 8σ 2 log n π 2 + 1 X 8σ 2 log n π2 + 1 X
Regn ≤ ∆i 2 + ∆i + nγ ≤ K + nγ + ∆i ,
∆i 3 γ 3
i:∆i ≥γ i=1 i=1
√
8Kσ 2 log n
Optimizing over γ by taking γ = √
n
gives the result.
Combining the above two theorems, we see that the UCB algorithm with parameters δt = 1/t2
automatically achieves the expected regret guarantee
X σ 2 log n p
Regn ≤ C · min , Kσ 2 n log n . (19.1.3)
∆i
i:∆i >0
That is, UCB enjoys some adaptive behavior. It is not, however, optimal; there are algorithms,
including Audibert and Bubeck’s MOSS (Minimax Optimal in the Stochastic Case) bandit proce-
dure [11], which achieve regret
√ n∆2
K
Regn ≤ C · min Kn, log ,
∆ K
which is essentially the bound specified by inequality (19.1.2) (which required knowledge of the
∆i s) and an improvement by log n over the analysis of Theorem 19.1.3. It is also possible to provie
a high-probability guarantee for the UCB algorithms, which follows essentially immediately from
the proof techniques of Proposition 19.1.1, but we leave this to the interested reader.
428
Lexture Notes on Statistics and Information Theory John Duchi
Example 19.2.1 (Classical Bernoulli bandit problem): The classical bandit problem, as in the
UCB case of the previous section, has actions (arms) A = {1, . . . , K}, and the parameter space
Θ = [0, 1]K , and we have that Pθ is a distribution on Y ∈ {0, 1}K , where Y has independent
coordinates 1, . . . , K with P (Yj = 1) = θj , that is, Yj ∼ Bernoulli(θj ). The goal is to find the
arm with highest mean reward, that is, argmaxj θj , and thus possible loss functions include
`(a, θ) = −θa or, if we wish the loss to be positive, `(a, θ) = 1 − θa ∈ [0, 1]. 3
Lastly, in this Bayesian setting, we require a prior distribution π on the space Θ, where π(Θ) = 1.
We then define the Bayesian regret as
n
X
?
Regn (A, `, π) = Eπ `(At , θ) − `(A , θ) , (19.2.1)
t=1
where A? ∈ argmina∈A `(a, θ) is the minimizer of the loss, and At ∈ A is the action the player takes
at time t of the process. The expectation (19.2.1) is taken both over the randomness in θ according
to the prior π and any randomness in the player’s strategy for choosing the actions At at each time.
Our approaches in this section build off of those in Chapter 16, except that we no longer fully
observe the desired observations Y —we may only observe YAt (t) at time t, which may provide less
information. The broad algorithmic framework for this section is as follows. We now give several
concrete instantiations of this broad procedure, as well as tools (both information-theoretic and
otherwise) for its analysis.
429
Lexture Notes on Statistics and Information Theory John Duchi
Example 19.2.2 (Thompson sampling with Bernoulli penalities): Let us suppose that the
vector θ ∈ [0, 1]K , and we draw θi ∼ Beta(1, 1), which corresponds to the uniform distribution
on [0, 1]d . The actions available are simply to select one of the coordinates, a ∈ A = {1, . . . , K},
and we observe Ya ∼ Bernoulli(θa ), that is, P(Ya = 1 | θ) = θa . That is, `(a, θ) = θa . Let
Ta1 (t) = card{τ ≤ t : At = a, Ya (τ ) = 1} be the number of times arm a is pulled and results in
a loss of 1 by time t, and similarly let Ta0 (t) = card{τ ≤ t : At = a, Ya (τ ) = 0}. Then, recalling
Example 16.4.2 on Beta-Bernoulli distributions, Thompson sampling proceeds as follows:
(1) For each arm a ∈ A = {1, . . . , K}, draw θa (t) ∼ Beta(1 + Ta1 (t), 1 + Ta0 (t)).
(2) Play the action At = argmina θa (t).
(3) Observe the loss YAt (t) ∈ {0, 1}, and increment the appropriate count.
Thompson sampling is simple in this case, and it is implementable with just a few counters.
3
We may extend Example 19.2.2 to the case in which the losses come from any distribution with
mean θi , so long as the distribution is supported on [0, 1]. In particular, we have the following
example.
Example 19.2.3 (Thompson sampling with bounded random losses): Let us again consider
the setting of Example 19.2.2, except that the observed losses Ya (t) ∈ [0, 1] with E[Ya | θ] = θa .
The following modification allows us to perform Thompson sampling in this case, even without
knowing the distribution of Ya | θ: instead of observing a loss Ya ∈ {0, 1}, we construct a
random observation Yea ∈ {0, 1} with the property that P(Yea = 1 | Ya ) = Ya . Then the losses
`(a, θ) = θa are identical, and the posterior distribution over θ is still a Beta distribution. We
simply redefine
Ta0 (t) := card{τ ≤ t : At = a, Yea (τ ) = 0} and Ta1 (t) := card{τ ≤ t : At = a, Yea (τ ) = 0}.
Our first analysis shows that Thompson sampling can guarantee performance similar to (or
in some cases, better than) confidence-based procedures, which we do by using a sequence of
430
Lexture Notes on Statistics and Information Theory John Duchi
(potential) lower and upper bounds on the losses of actions. (Recall we wish to minimize our
losses, so that we would optimistically play those arms with the lowest estimated loss.) This
analysis is based on that of Russo and Van Roy [155]. Let Lt : A → R and Ut : A → R be an
arbitrary sequence of (random) functions that are measurable with respect to Ht−1 , that is, they
are constructed based only on {A1 , YA1 (1), . . . , At−1 , YAt−1 (t − 1)}. Then we can decompose the
Bayesian regret (19.2.1) as
n
X
?
Regn (A, `, π) = Eπ `(At , θ) − `(A , θ) (19.2.2)
t=1
n
X n
X n
X
= Eπ [Ut (At ) − Lt (At )] + Eπ [`(At , θ) − Ut (At )] + Eπ [Lt (At ) − `(A? , θ)]
t=1 t=1 t=1
n n n
(i) X X X
= Eπ [Ut (At ) − Lt (At )] + Eπ [`(At , θ) − Ut (At )] + Eπ [Lt (A?t ) − `(A?t , θ)],
t=1 t=1 t=1
where in equality (i) we used that conditional on Ht−1 , At and A?t = A? have the same distribution,
as we sample from the posterior π(θ | Ht−1 ), and Lt is a function of Ht−1 . With the decomposi-
tion (19.2.2) at hand, we may now provide an expected regret bound for Thompson (or posterior)
sampling. We remark that the behavior of Thompson sampling is independent of these upper and
lower bounds Ut , Lt we have chosen—they are simply an artifact to make analysis easier.
Theorem 19.2.4. Suppose that conditional on the choice of action At = a, the received loss Ya (t)
is σ 2 -sub-Gaussian with mean `(a, θ), that is,
2 2
λ σ
E [exp (λ(Ya (t) − `(a, θ))) | Ht−1 ] ≤ exp for all a ∈ A.
2
Proof We choose the upper and lower bound functions somewhat carefully so as to get a fairly
sharp regret guarantee. In particular, we (as in our analysis of the UCB algorithm) let δ ∈ (0, 1)
and define Ta (t) := card{τ ≤ t : At = a} to be the number of times that action a has been chosen
by iteration t. Then we define the mean loss for action a at time t by
1 X
`ba (t) := Ya (τ )
Ta (t)
τ ≤t,Aτ =a
431
Lexture Notes on Statistics and Information Theory John Duchi
With these choices, we see that by the extensionPof the sub-Gaussian concentration bound (19.1.1)
and
P the equality (19.5.1) showing that the sum τ ≤t,Aτ =a Ya (τ ) is equal in distribution to the sum
0 0
τ ≤t,Aτ =a Ya (τ ), where Ya (τ ) are independent and identically distributed copies of Ya (τ ), we have
for any ≥ 0 that
s 2 1 2
1 Ta (t)2
Ta (t) 2σ log δ
P(Ut (a) ≤ `(a, θ) − | Ta (t)) ≤ exp − + ≤ exp − log − ,
2σ 2 Ta (t) δ 2σ 2
(19.2.3)
where the final inequality uses that (a + b)2 ≥ a2 + b2 for ab ≥ 0. We have an identical bound for
P(Lt (a) ≥ `(a, θ) + | Ta (t)).
We may now bound the final two sums in the regret expansion (19.2.2) using inequality (19.2.3).
First, however,
R∞ we make the observation that for any nonnegative random variable Z, we have
E[Z] = 0 P(Z ≥ )d. Using this, we have
n
X n X
X
Eπ [`(At , θ) − Ut (At )] ≤ Eπ [`(a, θ) − Ut (a)]+
t=1 t=1 a∈A
n X
X Z ∞
= Eπ P(Ut (a) ≥ `(a, θ) + | Ta (t))d
t=1 a∈A 0
n ∞ n s
Ta (t)2 πσ 2
(i) X
Z
X (ii) X X
≤ δEπ exp − d = δ Eπ ,
0 2σ 2 2Ta (t)
t=1 a∈A t=1 a∈A
where inequality (i) uses the bound (19.2.3) and equality (ii) uses that this is the integral of half
of a normal density. Substituting this bound, as well as the identical one for the terms involving
Lt (A?t ), into the decomposition (19.2.2) yields
n n
"s #
X X X 2πσ 2
Regn (A, `, π) ≤ Eπ [Ut (At ) − Lt (At )] + δ Eπ .
Ta (t)
t=1 t=1 a∈A
p
Using that Ta (t) ≥ 1 for each action a, we have a∈A Eπ [ 2πσ 2 /Ta (t)] < 3σ|A|. Lastly, we use
P
that s
2σ 2 log 1δ
Ut (At ) − Lt (At ) = 2 .
TAt (t)
Thus we have
n r " #
X 1X X 1
Eπ [Ut (At ) − Lt (At )] = 2 2σ 2 log Eπ p .
δ Ta (t)
t=1 a∈A t:At =a
PT − 12 1
RT √
Once we see that t=1 t ≤ t− 2 dt = 2 T , we have the upper bound
0
r
1X p
Regn (A, `, π) ≤ 4 2σ 2 log Eπ [ Ta (n)] + 3nδσ|A|.
δ
a∈A
P P p p
As a∈A Ta (n) = n, the Cauchy-Scwharz inequality implies a∈A Ta (n) ≤ |A|n, which gives
the result.
432
Lexture Notes on Statistics and Information Theory John Duchi
An immediate Corollary of Theorem 19.2.4 is the following result, which applies in the case of
bounded losses Ya as in Examples 19.2.2 and 19.2.3.
iid
Corollary 19.2.5. Let the losses Ya ∈ [0, 1] with E[Ya | θ] = θa , where θi ∼ Beta(1, 1) for
i = 1, . . . , K. Then Thompson sampling satisfies
p 3
Regn (A, `, π) ≤ 3 Kn log n + K.
2
Then E[Ye | Y ] = Y .
Proof The proof is immediate: for each coordinate j of Ye , we have E[Yej | Y ] = wj Yj /wj = Yj .
Lemma 19.3.1 suggests the following procedure, which gives rise to (a variant of) Auer et al.’s
EXP3 (Exponentiated gradient for Exploration and Exploitation) algorithm [13]. We can prove
the following bound on the expected regret of the EXP3 Algorithm 19.3 by leveraging our refined
analysis of exponentiated gradients in Proposition 18.4.1.
Proposition 19.3.2. Assume that for each j, we have E[Yj2 ] ≤ σ 2 and the observed loss Yj ≥ 0.
Then Alg. 19.3 attains expected regret (we are minimizing)
n
X log K η
Regn = E[µAt − µi? ] ≤ + σ 2 Kn.
η 2
t=1
433
Lexture Notes on Statistics and Information Theory John Duchi
wt,i exp(−ηgt,i )
wt+1,i = P .
j wt,j exp(−ηgt,j )
p
In particular, choosing η = log K/(Kσ 2 n) gives
n
X 3 p
Regn = E[µAt − µi? ] ≤ σ Kn log K.
2
t=1
Proof With Lemma 19.3.1 in place, we recall the refined regret bound of Proposition 18.4.1. We
have that for w? ∈ ∆K and any sequence of vectors g1 , g2 , . . . with gt ∈ RK
+ , then exponentiated
gradient descent achieves
n n k
X log K η XX
hgt , wt − w? i ≤ + 2
wt,j gt,j .
η 2
t=1 t=1 j=1
E[gt | wt ] = E[Y ] = µ
This careful normalizing, allowed by Proposition 18.4.1, is essential to our analysis (and fails for
more naive applications of online convex optimization bounds). In particular, we have
n n
X X log K η
Regn = E[hµ, wt − w? i] = E[hgt , wt − w? i] ≤ + nE[kY k22 ].
η 2
t=1 t=1
When√the random observed losses Ya (t) are bounded in [0, 1], then we have the mean regret
bound 23 Kn log K, which is as sharp as any of our other bounds.
434
Lexture Notes on Statistics and Information Theory John Duchi
b0i (t) = Ti1(t) τ :Aτ =i Yi0 (τ ) is the empirical mean of the copies Yi0 (τ ) for those steps when
P
where µ
arm i is selected. To see this, we use the standard fact that the characteristic function of a random√
variable completely characterizes the random variable. Let ϕYi (λ) = E[eιλYi ], where ι = −1 is
the imaginary unit, denote the characteristic function of Yi , noting that by construction we have
ϕYi = ϕYi0 . Then writing the joint characteristic function of Ti (t)b µi (t) and Ti (t), we obtain
t
" !#
X
E exp ιλ1 1 {Aτ = i} Yi (τ ) + ιλ2 Ti (t)
τ =1
t
" #
(i) Y
=E E [exp (ιλ1 1 {Aτ = i} Yi (τ ) + ιλ2 1 {Aτ = i}) | Hτ −1 ]
τ =1
" t #
(ii) Y
ιλ2
= E 1 {Aτ = i} e E [exp(ιλ1 Yi (τ )) | Hτ −1 ] + 1 {Aτ 6= i}
τ =1
" t #
(iii) Y
= E 1 {Aτ = i} eλ2 ι ϕYi (λ1 ) + 1 {Aτ 6= i}
τ =1
t
" #
(iv) Y
λ2 ι
= E 1 {Aτ = i} e ϕYi0 (λ1 ) + 1 {Aτ 6= i}
τ =1
t
" !#
X
= E exp ιλ1 1 {Aτ = i} Yi0 (τ ) + ιλ2 Ti (t) ,
τ =1
where equality (i) is the usual tower property of conditional expectations, where Hτ −1 denotes the
history to time τ − 1, equality (ii) because Aτ ∈ Hτ −1 (that is, it is a function of the history),
equality (iii) follows because Yi (τ ) is independent of Hτ −1 , and equality (iv) follows because Yi0 and
Yi have identical distributions. The final step is simply reversing the steps.
435
Lexture Notes on Statistics and Information Theory John Duchi
With the distributional equality (19.5.1) in place, we see that for any δ ∈ [0, 1], we have
s s s
σ 2 log 1δ σ 2 log 1 σ 2 log 1
P µbi (t) ≥ µi + = P µb0i (t) ≥ µi + δ
= P µ b0i (t) ≥ µi + δ
Ti (t) Ti (t) Ti (t)
s
t 2 log 1
X σ
= P µ b0i (t) ≥ µi + δ
| Ti (t) = s P(Ti (t) = s)
s
s=1
t
X
≤ δP(Ti (t) = s) = δ.
s=1
436
Part V
Appendices
437
Appendix A
This appendix collects several mathematical results and some of the more advanced mathematical
treatment required for full proofs of the results in the book. It is not a core part of the book, but it
does provide readers who wish to see the measure-theoretic rigor necessary for some of our results,
or otherwise, to dot the appropriate I’s and cross the appropriate T’s.
438
Appendix B
Convex Analysis
In this appendix, we review several results in convex analysis that are useful for our purposes. We
give only a cursory study here, identifying the basic results and those that will be of most use to
us; the field of convex analysis as a whole is vast. The study of convex analysis and optimization
has become very important practically in the last fourty to fifty years for a few reasons, the most
important of which is probably that convex optimization problems—those optimization problems
in which the objective and constraints are convex—are tractable, while many others are not. We
do not focus on optimization ideas here, however, building only some analytic tools that we will
find useful. We borrow most of our results from Hiriart-Urruty and Lemaréchal [104], focusing
mostly on the finite-dimensional case (though we present results that apply in infinite dimensional
cases with proofs that extend straightforwardly, and we do not specify the domains of our functions
unless necessary), as we require no results from infinite-dimensional analysis.
In addition, we abuse notation and assume that the range of any function is the extended real
line, meaning that if f : C → R we mean that f (x) ∈ R ∪ {−∞, +∞}, where −∞ and +∞ are
infinite and satisfy a + ∞ = +∞ and a − ∞ = −∞ for any a ∈ R. However, we assume throughout
and without further mention that our functions are proper, meaning that f (x) > −∞ for all x, as
this allows us to avoid annoying pathologies.
λx + (1 − λ)y ∈ C.
An important restriction of convex sets is to closed convex sets, those convex sets that are, well,
closed.
JCD Comment: Picture
We now consider two operations that extend sets, convexifying them in nice ways.
Definition B.2. The affine hull of a set C is the smallest affine set containing C. That is,
k
X k
X
k
aff(C) := λi xi : k ∈ N, xi ∈ C, λ ∈ R , λi = 1 .
i=1 i=1
439
Lexture Notes on Statistics and Information Theory John Duchi
An almost immediate associated result is that the convex hull of a set is equal to the set of all
convex combinations of points in the set.
Proposition B.1.1. Let C be an arbitrary set. Then
k
X k
X
Conv(C) = λi xi : k ∈ N, xi ∈ C, λ ∈ Rk+ , λi = 1 .
i=1 i=1
Proof Call T the set on the right hand side of the equality in the proposition. Then T ⊃ C
is clear, as we may simply take λ1 = 1 and vary x ∈ C. Moreover, the set T ⊂ Conv(C), as any
convex set containing C must contain all convex combinations of its elements; similarly, any convex
set S ⊃ C must have S ⊃ T .
Thus PlT is convex, then we are done. Take any two points x, y ∈ T . Then
Pk if we show that
x = i=1 αi xi and y = i=1 βi yi for xi , yi ∈ C. Fix λ ∈ [0, 1]. Then (1 − λ)βi ≥ 0 and λαi ≥ 0 for
all i,
Xk Xl
λ αi + (1 − λ) βi = λ + (1 − λ) = 1,
i=1 i=1
and λx + (1 − λ)y is a convex combination of the points xi and yi weighted by λαi and (1 − λ)βi ,
respectively. So λx + (1 − λ)y ∈ T and T is convex.
We also give one more definition, which is useful for dealing with some pathalogical cases in
convex analysis, as it allows us to assume many sets are full-dimensional.
Definition B.4. The relative interior of a set C is the interior of C relative to its affine hull, that
is,
relint(C) := {x ∈ C : B(x, ) ∩ aff(C) ⊂ C for some > 0} ,
where B(x, ) := {y : ky − xk < } denotes the open ball of radius centered at x.
An example may make Definition B.4 clearer.
Example B.1.2 (Relative interior of a disc): Consider the (convex) set
n o
C = x ∈ Rd : x21 + x22 ≤ 1, xj = 0 for j ∈ {3, . . . , d} .
The affine hull aff(C) = R2 × {0} = {(x1 , x2 , 0, . . . , 0) : x1 , x2 ∈ R} is simply the (x1 , x2 )-plane
in Rd , while the relative interior relint(C) = {x ∈ Rd : x21 + x22 < 1} ∩ aff(C) is the “interior”
of the 2-dimensional disc in Rd . 3
In finite dimensions, we may actually restrict the definition of the convex hull of a set C to
convex combinations of a bounded number (the dimension plus one) of the points in C, rather
than arbitrary convex combinations as required by Proposition B.1.1. This result is known as
Carathéodory’s theorem.
440
Lexture Notes on Statistics and Information Theory John Duchi
Theorem B.1.3. Let C ⊂ Rd . Then x ∈ Conv(C) if and only if there exist points x1 , . . . , xd+1 ∈ C
Pd+1
d+1
and λ ∈ R+ with i=1 λi = 1 such that
d+1
X
x= λ i xi .
i=1
Proof It is clear that if x can be represented as such a sum, then x ∈ Conv(C). Conversely,
Proposition B.1.1 implies that for any x ∈ Conv(C) we have
k
X k
X
x= λi xi , λi ≥ 0, λi = 1, xi ∈ C
i=1 i=1
for some λi , xi . Assume that k > d + 1 and λi > 0 for each i, as otherwise, there is nothing to prove.
Then we know that the points xi − x1 are certainly linearly dependent (as P there are k − 1 > d of
them), and we can find (not identically zero) values α2 , . . . , αk such that ki=2 αi (xi − x1 ) = 0. Let
α1 = − ki=2 αi to obtain that we have both
P
k
X k
X
αi xi = 0 and αi = 0. (B.1.1)
i=1 i=1
λi
Notably, the equalities (B.1.1) imply that at least one αi > 0, and if we define λ∗ = mini:αi >0 αi > 0,
then setting λ0i = λi − λ∗ αi we have
k
X k
X k
X k
X k
X k
X
λ0i ≥ 0 for all i, λ0i = λi − λ∗ αi = 1, and λ0i xi = λi xi − λ∗ αi xi = x.
i=1 i=1 i=1 i=1 i=1 i=1
But we know that at least one of the λ0i = 0, so that we could write x as a convex combination of
k − 1 elements. Repeating this strategy until k = d + 1 gives the theorem.
Observation B.1.4 is clear, as we have C ⊂ Conv(C), while any other convex S ⊃ C clearly satisfies
S ⊃ Conv(C). Secondly, we note that intersections preserve convexity.
Observation B.1.5. Let {Cα }α∈A be an arbitrary collection of convex sets. Then
\
C= Cα
α∈A
441
Lexture Notes on Statistics and Information Theory John Duchi
The convexity property follows because if x1 ∈ C and x2 ∈ C, then clearly x1 , x2 ∈ Cα for all
α ∈ A, and moreover λx1 + (1 − λ)x2 ∈ Cα for all α and any λ ∈ [0, 1]. The closure property is
standard. In addition, we note that closing a convex set maintains convexity.
Observation B.1.6. Let C be convex. Then cl(C) is convex.
To see this, we note that if x, y ∈ cl(C) and xn → x and yn → y (where xn , yn ∈ C), then for any
λ ∈ [0, 1], we have λxn + (1 − λ)yn ∈ C and λxn + (1 − λ)yn → λx + (1 − λ)y. Thus we have
λx + (1 − λ)y ∈ cl(C) as desired.
Observation B.1.6 also implies the following result.
Observation B.1.7. Let D be an arbitrary set. Then
\
{C : C ⊃ D, C is convex} = cl Conv(D).
Proof Let T denote the leftmost set. It is clear that T ⊂ cl Conv(D) as cl Conv(D) is a closed
convex set (by Observation B.1.6) containing D. On the other hand, if C ⊃ D is a closed convex
set, then C ⊃ Conv(D), while the closedness of C implies it also contains the closure of Conv(D).
Thus T ⊃ cl Conv(D) as well.
As our last consideration of operations that preserve convexity, we consider what is known as
the perspective of a set. To define this set, we need to define the perspective function, which, given
a point (x, t) ∈ Rd × R++ (here R++ = {t : t > 0} denotes strictly positive points), is defined as
x
pers(x, t) = .
t
We have the following definition.
Definition B.5. Let C ⊂ Rd × R+ be a set. The perspective transform of C, denoted by pers(C),
is nx o
pers(C) := : (x, t) ∈ C and t > 0 .
t
This corresponds to taking all the points z ∈ C, normalizing them so their last coordinate is 1, and
then removing the last coordinate. (For more on perspective functions, see Boyd and Vandenberghe
[35, Chapter 2.3.3].)
It is interesting to note that the perspective of a convex set is convex. First, we note the
following.
Lemma B.1.8. Let C ⊂ Rd+1 be a compact line segment, meaning that C = {λx + (1 − λ)y : λ ∈
[0, 1]}, where xd+1 > 0 and yd+1 > 0. Then pers(C) = {λ pers(x) + (1 − λ) pers(y) : λ ∈ [0, 1]}.
Proof Let λ ∈ [0, 1]. Then
λx1:d + (1 − λ)y1:d
pers(λx + (1 − λ)y) =
λxd+1 + (1 − λ)yd+1
λxd+1 x1:d (1 − λ)yd+1 y1:d
= +
λxd+1 + (1 − λ)yd+1 xd+1 λxd+1 + (1 − λ)yd+1 yd+1
= θ pers(x) + (1 − θ) pers(y),
442
Lexture Notes on Statistics and Information Theory John Duchi
where x1:d and y1:d denote the vectors of the first d components of x and y, respectively, and
λxd+1
θ= ∈ [0, 1].
λxd+1 + (1 − λ)yd+1
Sweeping λ from 0 to 1 sweeps θ ∈ [0, 1], giving the result.
We now consider some properties of convex sets, showing that (1) they have nice separation
properties—we can put hyperplanes between them—and (2) this allows several interesting represen-
tations of convex sets. We begin with the separation properties, developing them via the existence
of projections. Interestingly, this existence of projections does not rely on any finite-dimensional
structure, and can even be shown to hold in arbitrary Banach spaces (assuming the axiom of
choice) [133]. We provide the results in a Hilbert space, meaning a complete vector space for which
there exists an inner product h·, ·i and associated norm k·k given by kxk2 = hx, xi. We first note
that projections exist.
Theorem B.1.10 (Projections). Let C be a closed convex set. Then for any x, there exists a
unique point πC (x) minimizing ky − xk over y ∈ C. Moreover, this point is characterized by the
inequality
hπC (x) − x, y − πC (x)i ≥ 0 for all y ∈ C. (B.1.2)
Proof The existence and uniqueness of the projection follows from the parallelogram identity,
that is, that for any x, y we have kx − yk2 + kx + yk2 = 2(kxk2 + kyk2 ), which follows by noting
that kx + yk2 = kxk2 + kyk2 + 2hx, yi. Indeed, let {yn } ⊂ C be a sequence such that
kyn − xk → inf ky − xk =: p?
y∈C
as n → ∞, where p? is the infimal value. We show that yn is Cauchy, so that there exists a (unique)
limit point of the sequence. Fix > 0 and let N be such that n ≥ N implies kyn − xk2 ≤ p2? + 2 .
Let m, n ≥ N . Then by the parallelogram identity,
h i
kyn − ym k2 = k(x − yn ) − (x − ym )k2 = 2 kx − yn k2 + kx − ym k2 − k(x − yn ) + (x − ym )k2 .
Noting that
yn + ym yn + ym
(x − yn ) + (x − ym ) = 2 x − and ∈ C (by convexity of C),
2 2
443
Lexture Notes on Statistics and Information Theory John Duchi
we have
2
yn + ym
kx − yn k2 ≤ p2? +2 , kx − ym k2 ≤ p2? +2 , and k(x − yn ) + (x − ym )k2 = 4 x − ≥ 4p2? .
2
In particular, we have
As > 0 was arbitrary, this completes the proof of the first statement of the theorem.
To see the second result, assume that z is a point satisfying inequality (B.1.2), that is, such
that
hz − x, y − zi ≥ 0 for all y ∈ C.
Then we have
kz − xk2 = hz − x, z − xi = hz − x, z − yi +hz − x, y − xi ≤ kz − xk ky − xk
| {z }
≤0
kπC (x) − xk2 ≤ k(1 − t)πC (x) + ty − xk2 = kπC (x) − x + t(y − πC (x))k2
= kπC (x) − xk2 + 2thπC (x) − x, y − πC (x)i + t2 ky − πC (x)k2 .
Subtracting the projection value kπC (x) − xk2 from both sides and dividing by t > 0, we have
Corollary B.1.11. Let C be closed convex and x 6∈ C. Then there is a vector v strictly separating
x from C, that is,
hv, xi > suphv, yi.
y∈C
In addition, we can show the existence of supporting hyperplanes, that is, hyperplanes “sepa-
rating” the boundary of a convex set from itself.
444
Lexture Notes on Statistics and Information Theory John Duchi
Theorem B.1.12. Let C be a convex set and x ∈ bd(C), where bd(C) = cl(C) \ int C. Then there
exists a non-zero vector v such that hv, xi ≥ supy∈C hv, yi.
Proof Let D = cl(C) be the closure of C and let xn 6∈ D be a sequence of points such that
xn → x. Let us define the sequence of separating vectors sn = xn − πD (xn ) and the normalized
version vn = sn / ksn k. Notably, we have hvn , xn i > supy∈C hvn , yi for all n. Now, the sequence
{vn } ⊂ {v : kvk = 1} belongs to a compact set.1 Passing to a subsequence if necessary, let us
assume w.l.o.g. that vn → v with kvk = 1. Then by a standard limiting argument for the xn → x,
we have
hv, xi ≥ hv, yi for all y ∈ C,
which was our desired result.
Theorem B.1.12 gives us an important result. In particular, let D be an arbitrary set, and let
C = cl Conv(D) be the closure of the convex hull of D, which is the smallest closed convex set
containing D. Then we can write C as the intersection of all the closed half-spaces containing D;
this is, in some sense, the most useful “convexification” of D. Recall that a closed half-space H is
defined with respect to a vector v and real a ∈ R as
H := {x : hv, xi ≤ r}.
Before stating the theorem, we remark that by Observation B.1.6, the intersection of all the closed
convex sets containing a set D is equal to the closure of the convex hull of D.
Theorem B.1.13. Let D be an arbitrary set. If C = cl Conv(D), then
\
C= H, (B.1.3)
H⊃D
where H denotes a closed half-space containing D. Moreover, for any closed convex set C,
\
C= Hx , (B.1.4)
x∈bd(C)
445
Lexture Notes on Statistics and Information Theory John Duchi
is clearly supporting to C at the point πC (x0 ). The half-space {y : hy, vi ≤ hπC (x0 ), vi} thus
contains C and does not contain x0 , implying that x0 6∈ ∩x∈bd(C) Hx .
Now we show the first result (B.1.3). Let C be the closed convex hull of D and T = ∩H⊃D H.
By a trivial extension of the representation (B.1.4), we have that C = ∩H⊃C H, where H denotes
any halfspace containing C. As C ⊃ D, we have that H ⊃ C implies H ⊃ D, so that
\ \
T = H⊂ H = C.
H⊃D H⊃C
On the other hand, as C = cl Conv(D), Observation B.1.7 implies that any closed set containing
D contains C. As a closed halfspace is convex and closed, we have that H ⊃ D implies H ⊃ C,
and thus T = C as desired.
To any set S we can associate a particular sublinear function, the support function of S, defining
σS (x) := suphs, xi. (B.2.1)
s∈S
This function is evidently a closed convex function—it is the supremum of linear functions—and is
positively homogeneous, so that it is sublinear. We thus immediately have the duality
446
Lexture Notes on Statistics and Information Theory John Duchi
Corollary B.2.2. Let f be a sublinear function. Then it is the support function of the closed
convex set
Sf := {s | hs, xi ≤ f (x) for all x ∈ Rd },
and hence if C is closed convex, then
A few other consequences of the definition are immediate. We see that σS has dom σS = Rd if
and only if S is bounded: whenever ksk ≤ L for all s ∈ S, then σS (x) ≤ L kxk. Conversely,
if dom σS = Rd then it is locally Lipschitz (Theorem B.3.4) and (by positive homogeneity) thus
globally Lipschitz, so we have hs, xi ≤ σS (x) ≤ L kxk for some L < ∞ and taking x = s/ ksk
gives ksk ≤ L. As another consequence, we see that support functions of a set S are the support
functions of the closed convex hull of S:
Proposition B.2.3. Let S ⊂ Rd . Then
Proof Let C = Conv S, and let sn be any sequence with hsn , xi → sups∈C hs, xi. Then there
Pk(n)
exist sn,i ∈ S, i = 1, . . . , k(n), such that sn = i=1 λi sn,i for some λ 0, hλ, 1i = 1, which
may change with n. But of course, hsn , xi ≤ maxi hsn,i , xi, and thus σS (x) ≥ σC (x). To see that
σC (x) = σcl C (x), note that for each > 0, for each s ∈ cl C there is s0 ∈ C with ks − s0 k < . Then
hs, xi ≤ hs0 , xi + kxk and σcl C (x) ≤ σC (x) + kxk. Take ↓ 0.
This proposition, coupled with Corollary B.2.2, shows that if sets S1 , S2 have identical support
functions, then they have identical closed convex hulls, and if they are closed convex, they are thus
identical.
Corollary B.2.4. Let S1 , S2 ⊂ Rd . If σS1 = σS2 , then cl Conv S1 = cl Conv S2 .
Proof By Proposition B.2.3, we have σSi = σcl Conv Si for each i, and Corollary B.2.2 shows that
if σC1 = σC2 for closed convex sets C1 and C2 , then C1 = C2 .
t1 σ1 + t2 σ2 = σcl(t1 S1 +t2 S2 ) .
equality (?) following from Proposition B.2.3. As the suprema run independently through their
respective sets S1 , S2 , the latter quantity is evidently
447
Lexture Notes on Statistics and Information Theory John Duchi
The final result is an immediate consequence of the result that if C is a compact convex set and
S is closed convex, then C + S is closed convex. That C + S is convex is immediate. To see that
it is closed, let xn ∈ C, yn ∈ S satisfy xn + yn → z. Then proceeding to a subsequence, we have
xn(m) → x∞ for some x∞ ∈ C, and thus yn(m) → z − x∞ , which is then necessarily in S. As the
subsequence xn(m) + yn(m) → x∞ + (z − x∞ ) ∈ C + S and xn(m) + yn(m) → z as well, this gives the
result.
Linear transformations of support functions are also calculable. In the result, recall that for a
matrix A and set S, the set AS = {As | s ∈ S}.
Proposition B.2.6. Let S ⊂ Rd and A ∈ Rm×d . Then σcl AS (x) = σS (A> x).
Proof We have σAS (x) = sups∈S hAs, xi = sups∈S hs, AT xi. The closure operation changes noth-
ing (Proposition B.2.3).
Lastly, we show how to use support functions to characterize whether sets have interiors. Recall
that for a set S ⊂ Rd , the affine hull aff(S) (Definition B.2) is the set of affine combinations of a
point in S, and the relative interior of S is its interior relative to its affine hull (Definition B.4).
Proposition B.2.7. Let S ⊂ Rd be non-empty a closed convex set. Then
(i) s ∈ int S if and only if hs, xi < σS (x) for all x 6= 0.
(ii) s ∈ relint S if and only if hs, xi < σS (x) for all x with σS (x) + σS (−x) > 0.
(iii) int S is non-empty if and only if σS (x) + σS (−x) > 0 for all x 6= 0.
Proof
(i) Because σS is positively homogeneous, an equivalent statement is that σS (x) > hs, xi for all
x ∈ Sd−1 = {x ∈ Rd | kxk2 = 1}. If s ∈ int S, we there exists > 0 such that s + x ∈ S for
all x ∈ Sd−1 , and so
σS (x) ≥ hs + x, xi = hs, xi + ,
so that hs, xi < σS (x).
Conversely, let s be any point satisfying σS (x) − hs, xi > 0 for all x ∈ Sd−1 . Because σS is
lower semicontinuous, the infimum inf x∈Sd−1 {σS (x)−hs, xi} is attained at some x? ∈ Sd−1 (see
Proposition C.0.1). Then there exists some > 0 such that hs, xi + ≤ σS (x) for all x ∈ Sd−1 .
Let u be any vector with kuk2 < . Then hs + u, xi = hs, xi + hu, xi ≤ hs, xi + ≤ σS (x), so
Corollary B.2.2 implies s + u ∈ S and s ∈ int S.
(iii) Suppose int S is non-empty. Then s ∈ int S implies hs, xi < σS (x) for all x with kxk = 1.
Then σS (x) + σS (−x) > hs, x − xi = 0. Conversely, if int S is empty, there exists a hyperplane
containing S (by a dimension counting argument and that the relative interior of S is never
empty [104, Theorem III.2.1.3]), which we may write as S ⊂ {s | v T s = b} for some v 6= 0.
For this σS (v) + σS (−v) = b − b = 0.
448
Lexture Notes on Statistics and Information Theory John Duchi
epi f
We now build off of the definitions of convex sets to define convex functions. As we will see,
convex functions have several nice properties that follow from the geometric (separation) properties
of convex sets. First, we have
Definition B.7. A function f is convex if for all λ ∈ [0, 1] and x, y ∈ dom f ,
We define the domain dom f of a convex function to be those points x such that f (x) < +∞. Note
that Definition B.7 implies that the domain of f must be convex.
An equivalent definition of convexity follows by considering a natural convex set attached to
the function f , known as its epigraph.
Definition B.8. The epigraph epi f of a function is the set
That is, the epigraph of a function f is the set of points on or above the graph of the function itself,
as depicted in Figure B.1. It is immediate from the definition of the epigraph that f is convex if
and only if epi f is convex. Thus, we see that any convex set C ⊂ Rd+1 that is unbounded “above,”
meaning that C = C + {0} × R+ , defines a convex function, and conversely, any convex function
defines such a set C. This duality in the relationship between a convex function and its epigraph
is central to many of the properties we exploit.
449
Lexture Notes on Statistics and Information Theory John Duchi
Because the quotient function (B.3.2) is nondecreasing, we can relatively straightforwardly give
first-order characterizations of convexity as well. Indeed, suppose that f : R → R is differentiable;
then convexity is equivalent to the first-order inequality that for all x, y ∈ R, we have
To see that inequality (B.3.3) implies that f is convex follows from algebraic manipulations: let
λ ∈ [0, 1] and z = λx + (1 − λ)y, so that y − z = λ(y − x) and x − z = (1 − λ)(x − y). Then
and multiplying the former by (1 − λ) and the latter by λ and adding the two inequalities yields
as desired.
We may also give the standard second order characterization: if f : R → R is twice differentiable
and f 00 (x) ≥ 0 for all x, then f is convex. To see this, note that
1
f (y) = f (x) + f 0 (x)(y − x) + f 00 (tx + (1 − t)y)(x − y)2
2
for some t ∈ [0, 1] by Taylor’s theorem, so that f (y) ≥ f (x) + f 0 (x)(y − x) for all x, y because
f 00 (tx + (1 − t)y) ≥ 0. As a consequence, we obtain inequality (B.3.3), which implies that f is
convex.
As convexity is a property that depends only on properties of functions on lines—one dimen-
sional projections—we can straightforwardly extend the preceding results to functions f : Rd → R.
450
Lexture Notes on Statistics and Information Theory John Duchi
Indeed, noting that if h(t) = f (x + ty) then h0 (0) = h∇f (x), yi and h00 (0) = y > ∇2 f (x)y, we have
that a differentiable function f : Rd → R is convex if and only if
f (y) ≥ f (x) + ∇f (x)> (y − x) for all x, y,
while a twice differentabile function f : Rd → R is convex if and only if
∇2 f (x) 0 for all x.
Noting that nothing in the derivation that the quotient (B.3.2) was non-decreasing relied on f
being a function on R, we can see that a function f : Rd is convex if and only if it satisfies the
increasing slopes criterion: for all x ∈ dom f and any vector v, the quotient
f (x + tv) − f (x)
t 7→ q(t) := (B.3.4)
t
is nondecreasing in t ≥ 0 (where we leave x, v implicit). An alternative version of the crite-
rion (B.3.4) is that if x ∈ dom f and v is any vector, if we define the one-dimensional convex
function h(t) = f (x + tv) then for any s < t and ∆ > 0, we have
h(t + ∆) − h(t) h(t) − h(s) h(t) − h(s − ∆)
≥ ≥ . (B.3.5)
∆ t−s t − (s − ∆)
The proof that either of the inequalities (B.3.5) is equivalent to convexity we leave as an exercise
(Q. C.1).
JCD Comment: Draw pictures of increasing slopes
(iv) The function f has positive semidefinite Hessian: ∇2 f (x) 0 for all x.
JCD Comment: Draw a picture and of strict convexity
A condition slightly stronger than convexity is strict convexity, which makes each of the in-
equalities in Proposition B.3.1 strict. We begin with the classical definition: a function f is strictly
convex if it is convex and
f (λx + (1 − λ)y) < λf (x) + (1 − λ)f (y)
whenever λ ∈ (0, 1) and x 6= y ∈ dom f . These are convex functions, but always have strictly in-
creasing slopes—secants lie strictly above f . By tracing through the arguments leading to Propo-
sition B.3.1 (replace appropriate non-strict inequalities with strict inequalities), one obtains the
following corollary describing strictly convex functions.
451
Lexture Notes on Statistics and Information Theory John Duchi
(iv) The function f has positive definite Hessian: ∇2 f (x) 0 for all x.
f (x)
f (x0 ) + hg, x − x0 i
(x0 , f (x0 ))
Figure B.2. The tangent (affine) function to the function f generated by a subgradient g at the
point x0 .
Interestingly, convex functions have subgradients (at least, nearly everywhere). This is perhaps
intuitively obvious by viewing a function in conjunction with its epigraph epi f and noting that
epi f has supporting hyperplanes, but here we state a result that will have further use.
452
Lexture Notes on Statistics and Information Theory John Duchi
Theorem B.3.3. Let f be convex. Then there is an affine function minorizing f . More precisely,
for any x0 ∈ relint dom f , there exists a vector g such that
Proof If relint dom f = ∅, then it is clear that f is either identically +∞ or its domain is a
single point {x0 }, in which case the constant function f (x0 ) minorizes f . Now, we assume that
int dom f 6= ∅, as we can simply always change basis to work in the affine hull of dom f .
We use Theorem B.1.12 on the existence of supporting hyperplanes to construct a subgradient.
Indeed, we note that (x0 , f (x0 )) ∈ bd epi f , as for any open set O we have that (x0 , f (x0 )) + O
contains points both inside and outside of epi f . Thus, Theorem B.1.12 guarantees the existence of
a vector v and a ∈ R, not both simultaneously zero, such that
Inequality (B.3.7) implies that a ≥ 0, as for any x we may take t → +∞ while satisfying (x, t) ∈
epi f . Now we argue that a > 0 strictly. To see this, note that for suitably small δ > 0, we have
x = x0 − δv ∈ dom f . Then we find by inequality (B.3.7) that
hv, x0 i + af (x0 ) ≤ hv, x0 i − δ kvk2 + af (x0 − δv), or a [f (x0 ) − f (x0 − δv)] ≤ −δ kvk2 .
So if v = 0, then Theorem B.1.12 already guarantees a 6= 0, while if v 6= 0, then kvk2 > 0 and we
must have a 6= 0 and f (x0 ) 6= f (x0 − δv). As we showed already that a ≥ 0, we must have a > 0.
Then by setting t = f (x0 ) and dividing both sides of inequality (B.3.7) by a, we obtain
1
hv, x0 − xi + f (x0 ) ≤ f (x) for all x ∈ dom f.
a
Setting g = −v/a gives the result of the theorem, as we have f (x) = +∞ for x 6∈ dom f .
Convex functions generally have quite nice behavior. Indeed, they enjoy some quite remarkable
continuity properties just by virtue of the defining convexity inequality (B.3.1). In particular, the
following theorem shows that convex functions are continuous on the relative interiors of their
domains. Even more, convex functions are Lipschitz continuous on any compact subsets contained
in the (relative) interior of their domains. (See Figure B.3 for an illustration of this fact.)
Theorem B.3.4. Let f : Rd → R be convex and C ⊂ relint dom f be compact. Then there exists
an L = L(C) ≥ 0 such that
|f (x) − f (x0 )| ≤ L x − x0 .
453
Lexture Notes on Statistics and Information Theory John Duchi
Figure B.3. Left: discontinuities in int dom f are impossible while maintaining convexity (Theo-
rem B.3.4). Right: At the edge of dom f , there may be points of discontinuity.
Lemma B.3.5. Let f : Rd → R be convex and suppose that there are x0 , δ > 0, m, and M such
that
m ≤ f (x) ≤ M for x ∈ B(x0 , 2δ) := {x : kx − x0 k < 2δ}.
Then f is Lipschitz on B(x0 , δ), and moreover,
M −m
|f (y) − f (y 0 )| ≤ y − y0 for y, y 0 ∈ B(x0 , δ).
δ
Proof Let y, y 0 ∈ B(x0 , δ), and define y 00 = y 0 + δ(y 0 − y)/ ky 0 − yk ∈ B(x0 , 2δ). Then we can
write y 0 as a convex combination of y and y 00 , specifically,
ky 0 − yk 00 δ
y0 = 0
y + y.
δ + ky − yk δ + ky 0 − yk
ky 0 − yk δ ky − y 0 k
f (y 0 ) − f (y) ≤ f (y 00
) + f (y) − f (y) = [f (y 00 ) − f (y)]
δ + ky 0 − yk δ + ky 0 − yk δ + ky − y 0 k
M −m
≤ y − y0 .
δ + ky − y 0 k
Here we have used the bounds on f assumed in the lemma. Swapping the assignments of y and y 0
gives the same lower bound, thus giving the desired Lipschitz continuity.
With Lemma B.3.5 in place, we proceed to the proof proper. We assume without loss of
generality that dom f has an interior; otherwise we prove the theorem restricting ourselves to the
affine hull of dom f . The proof follows a standard compactification argument. Suppose that for
each x ∈ C, we could construct an open ball Bx = B(x, δx ) with δx > 0 such that
454
Lexture Notes on Statistics and Information Theory John Duchi
As the Bx cover the compact set C, we can extract a finite number of them, call them Bx1 , . . . , Bxk ,
covering C, and then within each (overlapping) ball f is maxk Lxk Lipschitz. As a consequence, we
find that
|f (y) − f (y 0 )| ≤ max Lxk y − y 0
k
for any y, y 0∈ C.
We thus must derive inequality (B.3.8), for which we use the boundedness Lemma B.3.5. We
must demonstrate that f is bounded in a neighborhood of each x ∈ C. To that end, fix x ∈
int dom f , and let the points x0 , . . . , xd be affinely independent and such that
∆ := Conv{x0 , . . . , xd } ⊂ dom f
Moreover, Theorem B.3.3 implies that there is some affine h function minorizing f ; let h(x) =
a + hv, xi denote this function. Then
exists and is finite, so that in the ball B(x, 2δ) constructed above, we have f (y) ∈ [m, M ] as required
by Lemma B.3.5. This guarantees the existence of a ball Bx required by inequality (B.3.8).
epi f
f (x) f (x)
Our final discussion of continuity properties of convex functions revolves around the most com-
mon and analytically convenient type of convex function, the so-called closed-convex functions.
455
Lexture Notes on Statistics and Information Theory John Duchi
for all x0 and any sequence of points tending toward x0 . See Figure B.4 for an example such
function and its associated epigraph.
Interestingly, in the one-dimensional case, closed convexity implies continuity. Indeed, we have
the following observation (compare Figures B.4 and B.3 previously):
Proof By Theorem B.3.4, we need only consider the endpoints of the domain of f (the result
is obvious by Theorem B.3.4 if dom f = R); let x0 ∈ bd dom f . Let y ∈ dom f be an otherwise
arbitrary point, and define x = λy + (1 − λ)x0 . Then taking λ → 0, we have
so that lim supx→x0 f (x) ≤ f (x0 ). By the closedness assumption (B.3.9), we have lim inf x→x0 f (x) ≥
f (x0 ), and continuity follows. Note that in this argument, if x0 6∈ dom f , then f (x0 ) = +∞ by
convention; for epi f to be closed we require that for each t < f (x0 ) = ∞, we may take a small
enough open interval U = (y, x0 ) for which f (x) > t for all x ∈ U .
In the full-dimensional case, we do not have quite the same continuity, though Theorem B.3.4
guarantees continuity on the (relative) interior of dom f .
An important characterization of convex functions is as the supremum of all affine functionals
(linear plus an offset) below them, which is one of the keys to duality relationships about functions
to come.
Theorem B.3.7. Let f be closed convex and let A be the collection of affine functions h satisfying
f (x) ≥ h(x) for all x. Then f (x) = suph∈A h(x).
Proof By Theorem B.1.13 that any closed convex set is the intersection of all the halfspaces
containing (even supporting) it, we can write epi f = ∩H∈H H, where H is the collection of closed
halfspaces H ⊃ epi f . We may write any such halfspace as
H = {(x, r) ∈ Rd × R | ha, xi + br ≤ c}
where (a, b) ∈ Rd × R is non-zero. As H ⊃ epi f , the particular nature of epigraphs (that is, that
if (x, t) ∈ epi f then (x, t + ∆) ∈ epi f for all ∆ > 0) means that b ≤ 0, and so for any b < 0 we
may divide through by b to rewrite H as H = {(x, r) | ha/b, xi + r ≥ c/b}, while if b = 0 then
H = {(x, r) | ha, xi ≤ c}. That is, it is no loss of generality to set
which (respectively) correspond to the non-vertical halfspaces containing epi f and the halfspaces
containing dom f ⊂ Rd . We have epi f = H∈H1 H ∩ H∈H0 H.
T T
456
Lexture Notes on Statistics and Information Theory John Duchi
Identify the halfspaces H ∈ H0 or H1 with the associated triple (a, 0, c) or (a, 1, c) and abuse
notation to write (a, i, c) ∈ Hi for i ∈ {0, 1}. For any (a, 1, c) ∈ H1 , the linear function
l(x) = c − ha, xi = inf{r | ha, xi + r ≥ c} satisfies ha, xi + l(x) ≥ c for all x,
and so necessarily l(x) ≤ f (x) for all x, while for the function h(x) = sup(a,1,c)∈H1 {c − ha, xi} we
have \
epi h = H.
H∈H1
Thus, if we can show that \ \ \
H∩ H= H (B.3.10)
H∈H1 H∈H0 H∈H1
the proof will be complete.
To show the equality (B.3.10), take arbitrary vectors v0 = (a0 , 0, c0 ) ∈ H0 and v1 = (a1 , 1, c1 ) ∈
H1 , and let H0 = {(x, r) | ha0 , xi ≥ c0 } and H1 = {(x, r) | ha1 , xi + r ≥ c1 } be the associated
halfspaces. Consider the conic-like vector
v(t) := (a1 + ta0 , 1, c0 + tc0 ) for t ≥ 0
and associated halfspace H(t) := {(x, r) | ha1 + ta0 , xi + r ≥ c1 + tc0 }. Then as ha0 , xi ≥ c0 if and
only if tha0 , xi ≥ tc0 for all t ≥ 0, any point (x, r) ∈ H0 ∩ H1 satisfies
ha1 + ta0 , xi + r ≥ c1 + tc0 for t ≥ 0,
that is, H(t) ∈ H1 and (x, r) ∈ ∩t≥0 H(t). Additionally, taking t = 0 we see that H(0) = H1 and
so ∩t≥0 H(t) ⊂ H1 , while taking t ↑ ∞ we obtain that each (x, r) ∈ ∩t≥0 H(t) satisfies ha0 , xi ≥ c0 .
That is, we have \
H(t) = H0 ∩ H1 ,
t≥0
while H(t) ∈ H1 for all t ≥ 0. This shows the equality (B.3.10).
In spite of the continuity of closed convex functions on R, closed convex functions on higher
dimensional spaces need not be continuous. Indeed, it is immediate (see Proposition B.3.9 to follow)
that f (x) := supα∈A {fα (x)} is closed convex whenever fα are all closed convex for any index set
A. We have the following failure of continuity.
Example B.3.8 (A discontinuous closed convex function): Define the function f : R2 → R
by
1 2
f (x) := sup αx1 + βx2 | α ≤ β .
2
Then certainly f (0) = 0 and f is closed convex. If the supremum is attained then β = 12 α2
and so β ≥ 0 and
1 2
0 2 if x = 0
x1
f (x) = sup αx1 + α x2 = − 2x if x2 < 0
α 2 2
+∞ otherwise.
But then along the path x2 = − 12 x21 , we always have f (x) = 1, while taking x1 → 0 gives
f (x) = 1 > 0 = f (0). 3
457
Lexture Notes on Statistics and Information Theory John Duchi
Proposition B.3.9. Let {fα }α∈A be an arbitrary collection of convex functions indexed by A.
Then
f (x) := sup fα (x)
α∈A
is convex. Moreover, if for each α ∈ A, the function fα is closed convex, f is closed convex.
Proof The proof is immediate once we consider the epigraph epi f . We have that
\
epi f = epi fα ,
α∈A
which is convex whenever epi fα is convex for all α and closed whenever epi fα is closed for all α
(recall Observation B.1.5).
f1 (x)
f2 (x)
Figure B.5. The maximum of two convex functions is convex, as its epigraph is the intersection of
the two epigraphs.
Another immediate result is that composition of a convex function with an affine transformation
preserves convexity:
Proposition B.3.10. Let A ∈ Rd×n and b ∈ Rd , and let f : Rd → R be convex. Then the function
g(y) = f (Ay + b) is convex.
Partial minimization of convex functions and some related transformations preserve convexity
as well.
458
Lexture Notes on Statistics and Information Theory John Duchi
From the proposition we immediately see that if f (x, y) is jointly convex in x and y, then the
partially minimized function inf y∈Y f (x, y) is convex whenever Y is a convex set.
Lastly, we consider the functional analogue of the perspective transform. Given a function
f : Rd → R, the perspective transform of f is defined as
(
tf xt if t > 0 and xt ∈ dom f
pers(f )(x, t) := (B.3.11)
+∞ otherwise.
In analogue with the perspective transform of a convex set, the perspective transform of a function
is (jointly) convex.
Proof The result follows if we can show that epi pers(f ) is a convex set. With that in mind,
note that x r
Rd × R++ × R 3 (x, t, r) ∈ epi pers(f ) if and only if f ≤ .
t t
Rewriting this, we have
n x ro
epi pers(f ) = (x, t, r) ∈ Rd × R++ × R : f ≤
n t t o
= t(x , 1, r ) : x ∈ R , t ∈ R++ , r ∈ R, f (x0 ) ≤ r0
0 0 0 d 0
459
Lexture Notes on Statistics and Information Theory John Duchi
460
Lexture Notes on Statistics and Information Theory John Duchi
Take any sequence ∆n → 0 achieving this limit supremum, and let ∆n = n vn for a sequence vn on
the sphere, that is, kvn k = 1, so n = k∆n k. Then by passing to a subsequence if necessary, we can
assume w.l.o.g. that vn → v with kvk = 1. Then
Taking ∆ = tv and t ↓ 0, this implies that L kvk ≥ h∇f (x), vi for all v, which is equivalent to
k∇f (x)k ≤ L.
The main consequence of convexity that is important for us is that a convex function is direc-
tionally differentiable at every point in the interior of its domain, though the directional derivative
need not be linear:
Proposition B.3.16. Let f be convex and x ∈ int dom f . Then f 0 (x; v) exists and the mapping
v 7→ f 0 (x; v) is sublinear, convex, and globally Lipschitz.
Proof If x ∈ int dom f , then the criterion (B.3.4) of increasing slopes guarantees that f 0 (x; v) =
limt↓0 f (x+tv)−f
t
(x)
exists for all x ∈ int dom f , as the quantity is monotone. To see that f 0 (x; v) is
convex and sublinear in v, note that positive homogeneity is immediate, as we have 1t (f (x + αtv) −
f (x)) = αtα
(f (x + αtv) − f (x)) for all α > 0, and f 0 (x; 0) = 0. That it is convex is straightforward
as well: for any u, v we have
f (x + t(λu + (1 − λ)v)) − f (x) f (x + tu) − f (x) f (x + tv) − f (x)
≤λ + (1 − λ)
t t t
and take t ↓ 0. For the global Lipschitz claim, note that f is already locally Lipschitz near
x ∈ int dom f (recall Theorem B.3.4), so that there exists some L < ∞ and > 0 such that for
all kvk = 1 and 0 ≤ t ≤ |f (x + tv) − f (x)| ≤ Lt, whence |f 0 (x; v)| ≤ L and by homogeneity
461
Lexture Notes on Statistics and Information Theory John Duchi
An inspection of the proof shows that the result extends even to all of dom f if we allow f 0 (x; v) =
+∞ whenever x + tv 6∈ dom f for all t > 0, though of course we lose that f 0 (x; v) is finite-valued.
Then we have the following corollary, showing that f 0 (x; v) provides a valid first-order development
of f in all directions from x (where we take ∞ · t = ∞ whenever t > 0).
as t ↓ 0 and
f (x + tv) ≥ f (x) + f 0 (x; v)t for all t ≥ 0.
There are strong connections between subdifferentials and directional derivatives, and hence of
the local developments (B.3.12). The following result makes this clear.
Proof For shorthand let S = {s | hs, vi ≤ f 0 (x; v) all v} be the set on the right. If s ∈ S, then
the criterion (B.3.4) of increasing slopes guarantees that
f (x + tv) − f (x)
hs, vi ≤ for all v ∈ Rd , t > 0.
t
Recognizing that as v is allowed to vary over all of Rd and t > 0, then x + tv similarly describes
Rd , we see that this condition is completely equivalent to the definition (B.3.6) of the subgradient.
That ∂f (x) 6= ∅ is Theorem B.3.3.
We can also extend this to x ∈ dom f —not necessarily the interior—where we see that there is
no loss (even when f may be +∞ valued) to defining
Notably, the directional derivative function v 7→ f 0 (x; v) always exists for x ∈ dom f and is a
sublinear convex function, and thus ∂f (x) above is always a closed convex set whose support
function (recall (B.2.1)) is the closure of v 7→ f 0 (x; v). While the subdifferential ∂f (x) is always a
compact convex set when x ∈ int dom f , even when it exists it may not be compact if x is on the
boundary of dom f . To see one important example of this, consider the indicator function
(
+∞ if x 6∈ C
IC (x) :=
0 if x ∈ C
462
Lexture Notes on Statistics and Information Theory John Duchi
of a closed convex set C. For simplicity, let C = [a, b] be an interval. Then we have
[0, ∞]
if x = b
∂IC (x) = {0} if a < x < b
[−∞, 0] if x = a.
Whether points ±∞ are included is a matter of convenience and whether we work with the extended
real line.
JCD Comment: Draw a picture of this
These representations points to a certain closure property of subgradients, namely, that the
subdifferential is closed under additions of the normal cone to the domain of f :
Lemma B.3.19. Let Ndom f (x) be the normal cone (Definition C.1) to dom f at the point x (where
Ndom f (x) = {0} for x ∈ int dom f and Ndom f (x) = ∅ for x 6∈ dom f ). Then
Proof We only need concern ourselves with points x ∈ bd dom f , where the normal cone N =
Ndom f (x) is non-trivial. If ∂f (x) is empty, there is nothing to prove, so assume that ∂f (x) is
non-empty. Then the definition (B.3.15) of the subdifferential as ∂f (x) = {s | hs, ui ≤ f 0 (x; u)}
allows us to prove the result. First, consider vectors u for which f 0 (x; u) = +∞. Then certainly, for
any s ∈ ∂f (x), we have hs + v, ui ≤ f 0 (x; u) for all v ∈ N . If f 0 (x; u) < ∞, then for small enough
t > 0 we necessarily have x + tu ∈ dom f . In particular, the definition of the normal cone gives that
v ∈ N satisfies 0 ≥ hv, x + tu − xi = thv, ui, or that hv, ui ≤ 0. Thus hs + v, ui ≤ hs, ui ≤ f 0 (x; u),
and so s + v ∈ ∂f (x) once again.
The claim about boundedness is immediate, because Ndom f is a cone.
A more compelling case for the importance of the subgradient set with respect to first-order
developments and differentiability properties of convex functions is the following:
JCD Comment: Add a picture of this as well.
Proof That sups∈∂f (x) hs, vi = f 0 (x; v) is immediate by Theorem B.3.7 and Proposition B.2.1,
because f 0 (x; v) is sublinear and closed convex in v when x ∈ int dom f . Certainly the right hand
sides are then equal.
We thus prove the equality f (y) = f (x)+f 0 (x; y −x)+o(ky − xk), where the argument is similar
to that for Proposition B.3.15. Let yn → x be any sequence and let ∆n = yn − x, so that k∆k → 0;
as x ∈ int dom f , there exists a (local) Lipschitz constant L such that |f (x + ∆) − f (x)| ≤ L k∆k
463
Lexture Notes on Statistics and Information Theory John Duchi
for all small ∆. Similarly, because v 7→ f 0 (x; v) is convex (even positively homogeneous and thus
sublinear), it has a Lipschitz constant, and we take this to be L as well. Now, write ∆n = n vn
where kvn k = 1 and n → 0, and moving to a subsequence if necessary let vn → v. Then we have
Note that convexity only played the role of establishing the local Lipschitz property of f in the
proof of Proposition B.3.20; any locally Lipschitz function with directional derivatives will enjoy a
similar first-order expansion.
As our final result on smoothness properties of convex functions, we connect subdifferentials to
differentiability properties of convex f . First, we give a lemma showing that the subdifferential set
∂f is outer semicontinuous.
Lemma B.3.21 (Closure of the graph of the subdifferential). Let f : Rd → R be closed convex.
Then the graph {(x, s) | x ∈ Rd , s ∈ ∂f (x)} of its subdifferential is closed. Equivalently, whenever
xn → x with sn ∈ ∂f (xn ) and sn → s, f has non-empty subdifferential at x with s ∈ ∂f (x).
Proof We prove the second statement, whose equivalence to the first is definitional. Fix any
y ∈ Rd . Then f (y) ≥ f (xn ) + hsn , y − xn i, and because f is closed (i.e., lower semicontinuous),
we have lim inf f (xn ) ≥ f (x). Let > 0 be arbitrary. Then for all large enough n, we have
f (xn ) ≥ f (x) − , and similarly, ksn − sk ≤ , kxn − xk ≤ , and ky − xn k ≤ ky − xk + . Then
Given the somewhat technical Lemma B.3.21, we can show that if f is convex and differentiable
at a point, it is in fact continuously differentiable at the point.
Proposition B.3.22. Let f be convex and x ∈ int dom f . Then ∂f (x) is a singleton if and only if
f is differentiable at x. If additionally f is differentiable on an open set U , then f is continuously
differentiable on U .
Proof Because x ∈ int dom f , there exists L < ∞ such that f is L-Lipschitz near x by Theo-
rem B.3.4. Suppose that ∂f (x) = {s}. Then the directional derivative f 0 (x; v) = hs, vi for all v,
and Proposition B.3.20 gives
464
Lexture Notes on Statistics and Information Theory John Duchi
Proof By Proposition B.3.18, the set ∂f (x) is a compact convex set, and the general defi-
nition (B.3.15) of the subdifferential gives that ∂g(x) is closed convex. Let S1 = ∂f (x) and
S2 = ∂g(x). Then immediately S1 + S2 ⊂ ∂(f + g)(x), so that
n o
S := ∂(f + g)(x) = s | hs, vi ≤ f 0 (x; v) + g 0 (x; v) for all v ∈ Rd
is non-empty. Because of the support function equality f 0 (x; v) = σS1 (v) and g 0 (x; v) = σS2 (v),
Corollary B.2.5 gives
σS (v) = σS1 (v) + σS2 (v) = σS1 +S2 (v).
Thus (Corollary B.2.4) S1 + S2 = S.
Other situations that arise frequently are composition with affine mappings and taking maxima
or suprema of convex functions, so that finding a calculus for these is also important.
Corollary B.3.24. Let f : Rm → R be convex and for A ∈ Rm×d and b ∈ Rm , let g(x) = f (Ax+b).
Then
∂g(x) = AT ∂f (Ax + b).
Proof Using the directional derivative, we have g 0 (x; v) = f 0 (Ax + b; Av) for all v ∈ Rd , and
applying Proposition B.2.6 gives that the latter is the support function of the convex compact set
AT ∂f (Ax + b).
It is also useful to be able to compute subdifferentials of maxima and suprema (recall Proposi-
tion B.3.9). Consider a collection {fα }α∈A of convex functions, and define
465
Lexture Notes on Statistics and Information Theory John Duchi
be the indices attaining the suprema, that is, the active index set (though this may be empty).
Then there is an “easy” direction:
Lemma B.3.25. With the notation above,
n[ o
∂f (x) ⊃ cl Conv ∂fα (x) | α ∈ A(x) = cl Conv {g | g ∈ ∂fα (x) for some α ∈ A(x)} .
Thus g ∈ ∂f (x), which as a closed convex set must thus include its closed convex hull.
A much more challenging argument is to show that the active index set A(x) exactly charac-
terizes the subdifferential of f at x; we simply state a typical result as a proposition.
Proposition B.3.26. Let A be a compact set (for some metric) and assume that for each x, the
mapping α 7→ fα (x) is upper semi-continuous. Then
n[ o
∂f (x) = Conv ∂fα (x) | α ∈ A(x) = Conv {g | g ∈ ∂fα (x) for some α ∈ A(x)} .
Finally, we revisit the partial minimization operation in Proposition B.3.11. In this case, we
require a bit more care when defining subdifferentials and subdifferentiability. For A ∈ Rn×m with
m ≥ n, where A has rank n (so that x 7→ Ax is surjective) and f : Rm → R, define the function
which is convex. Define the set Y ? (x) := {y | Ay = x and fA (x) = f (y)} to be the set of y
attaining the infimum in the definition of fA , which may be empty. When it is not, however, we
can characterize the subdifferential of fA (x):
Proposition B.3.27. Let x ∈ Rn be a point for which Y ? (x) is non-empty for the function fA .
Then
∂fA (x) = s | AT s ∈ ∂f (y)
for any y ∈ Y ? (x), and the set on the right is independent of the choice of y.
Proof A vector s is a subgradient of f at x if and only if
466
Lexture Notes on Statistics and Information Theory John Duchi
Because A has full row rank, for any x0 ∈ Rn there exists y 0 with Ay 0 = x0 ; by definition of fA as
the infimum, the preceding display is thus equivalent to
467
Appendix C
The existence and continuity properties of minimizers of (convex) optimization problems play a
central role in much of statistical theory. They are especially essential in our understanding of loss
functions and the associated optimality properties. In our context, this is especially central for
problems of classification calibration or surrogate risk consistency, as in Chapters 13 and 14. This
appendix records several representative results along these lines, and also builds up the duality
theory associated with convex conjugates, frequently identified as Fenchel-Young duality.
Broadly, throughout this appendix, we shall consider the generic optimization problem
minimize f (x)
(C.0.1)
subject to x ∈ C
where C is a closed convex set (we have not yet assumed convexity of f ), Throughout (as in the
previous appendix) we assume that f is proper, so that f (x) > −∞ for each x, and that f (x) = +∞
if x 6∈ dom f .
The most basic question we might ask is when minimizers even exist in the problem (C.0.1).
The standard result in this vein is that if minimizers exist whenever C is compact and f is lower
semicontinuous (B.3.9), that is, its epigraph is closed, i.e., lim inf x→x0 f (x) ≥ f (x0 ).
Proof Let f ? = inf x∈C f (x), where for now we allow the possibility that f ? = −∞. Let xn ∈ C
be a sequence of points satisfying f (xn ) → f ? . Proceeding to a subseqeunce if necessary, we can
assume that xn → x? ∈ C by the compactness of C. Then lower semi-continuity guarantees that
f ? = limn f (xn ) ≥ f (x? ) ≥ f ? , and so f (x? ) = f ? and so necessarily f ? > −∞.
When the domain C is not compact but only closed, alternative conditions are necessary to
guarantee the existence of minimizers. Perhaps the most frequent, and one especially useful with
convexity (as we shall see), is that f is coercive, meaning that
Proposition C.0.2. Let C be closed and f : Rd → R be lower semi-continuous over C and coercive.
Then inf x∈C f (x) > −∞ and the infimum is attained.
468
Lexture Notes on Statistics and Information Theory John Duchi
Proof Once again, let f ? = inf x∈C f (x) and let xn ∈ C satisfy f (xn ) → f ? . Certainly
xn must be a bounded sequence because f is coercive. Thus, it has a subsequent limit, and
w.l.o.g. we assume that xn → x? ∈ C by closedness. Lower semi-continuity guarantees that
f ? ≥ lim inf n f (xn ) = f (x? ) ≥ f ? , giving the result.
Finally, we make a small remark norms and dual norms, as these will be important for the more
quantitative smoothness guarantees we provide. For a norm k·k, the dual norm k·k∗ has definition
This is a norm as it is positively homogeneous, kyk∗ = 0 if and only if y = 0, and satisfies the
triangle inequality. A few brief examples follow, which we leave as exercises to the reader.
p
(i) The `2 -norm kxk2 = hx, xi is self-dual, so that its dual is k·k2 .
(ii) The `1 and `∞ norms are dual, that is, kxk∞ = supkyk1 ≤1 hx, yi and kyk1 = supkxk∞ ≤1 hx, yi.
(iii) For all p ∈ [1, ∞], the dual to the `p norm kxkp = ( dj=1 |xj |p )1/p is the `q norm with q = p
P
p−1 ,
that is, for the q ≥ 1 satisfying p1 + 1q = 1.
which rearranged yields f (y) ≥ f (x). If f is additionally strictly convex (recall Corollary B.3.2),
then the preceding inequality is strict whenver y 6= x.
469
Lexture Notes on Statistics and Information Theory John Duchi
Proof If 0 ∈ ∂f (x), then f (y) ≥ f (x) + h0, y − xi = f (x) for all y. Conversely, if x minimizes f ,
then we have f (y) ≥ f (x) for all y, and in particular, 0 ∈ ∂f (x).
Things become a bit more complicated when we consider the constraints in the problem (C.0.1),
so that the point x may be restricted. In this case, it is important and useful to consider the normal
cone to the set C, which is (essentially) the collection of vectors pointing out of C.
Definition C.1. Let C be a closed convex set. The normal cone to C at the point x ∈ C is the
collection of vectors
NC (x) := {v | hv, y − xi ≤ 0 for all y ∈ C} .
So NC (x) is the collection of vectors making an obtuse angle with any direction into the set C from
JCD Comment: Draw a picture, and also, put this earlier in the discussion of convex
x.
sets.
It is clear that NC (x) is indeed a cone: if v ∈ NC (x), then certainly tv ∈ NC (x) for all t ≥ 0.
It is closed convex, being the intersection of halfspaces. Moreover, if x ∈ int C, then we have
NC (x) = {0}, and additionally, we can connect the supporting hyperplanes of C to its normal
cones: Theorem B.1.12 gives the following corollary.
Corollary C.1.3. Let C be closed convex. Then for any x ∈ bd(C), the normal cone NC (x) is
non-trivial and consists of the collection of supporting hyperplanes to C at x.
By a bit of subgradient calculus, we can then write optimality conditions involving the normal
cones to C. If C is a closed convex set, the convex indicator function IC (x) has subdifferentials
{0}
if x ∈ int C
∂IC (x) = NC (x) if x ∈ bd(C)
otherwise.
∅
The only case requiring justification is the boundary case; for this, we note that w ∈ NC (x) if and
only if hw, y − xi ≤ 0 for all y ∈ C, which in turn occurs if and only if IC (y) ≥ IC (x) + hw, y − xi
for all y.
The subdifferential calculation for IC (x) yields the following general optimality characterization
for problem (C.0.1).
Proposition C.1.4. In the problem (C.0.1), let x ∈ int dom f . Then x minimizes f over C if and
only if
0 ∈ ∂f (x) + NC (x). (C.1.1)
Proof The minimimization problem (C.0.1) is equivalent to the problem
As x ∈ int dom f , f has nonempty compact convex subdifferential ∂f (x), and so ∂(f + IC )(x) =
∂f (x) + ∂IC (x) = ∂f (x) + NC (x) by Proposition B.3.23. Apply Observation C.1.2.
Several equivalent versions of Proposition C.1.4 are possible. The first is that
470
Lexture Notes on Statistics and Information Theory John Duchi
that is, there is a subgradient vector g ∈ ∂f (x) such that −g ∈ NC (x), so that −g points outside the
set C. JCD Comment: Draw a picture
Another variant, frequently used, is to write Proposition C.1.4 as that x solves problem (C.0.1)
if and only if there exists g ∈ ∂f (x) such that
Indeed, taking g ∈ ∂f (x) to be the element satisfying −g ∈ NC (x), we immediately see that h−g, y−
xi ≤ 0 for all y ∈ C by definition of the normal cone, which is (C.1.2). JCD Comment: Use picture above
hgx − gy , x − yi ≥ λ kx − yk2 .
Proof Let us prove that (ii) if and only if (iii). Let gx ∈ ∂f (x) and gy ∈ ∂f (y) and assume (ii)
holds. Then
λ
f (y) ≥ f (x) + hgx , y − xi + ky − xk2
2
λ
f (x) ≥ f (y) + hgy , x − yi + kx − yk2
2
471
Lexture Notes on Statistics and Information Theory John Duchi
0 ≥ hgx − gy , y − xi + λ kx − yk2 .
Rearranging gives part (iii). Conversely, assume (iii), and for t ∈ [0, 1] let xt = (1 − t)x + ty and
define h(t) = f (x(t)). Then h is Rconvex and hence almost everywhere differentiable (and locally
1
Lipschitz), so that h(1) = h(0) + 0 h0 (t)dt. Noting that
λ 2
f (y) ≥ f (tx + (1 − t)y) + thgt , y − xi + t kx − yk2
2
λ
f (x) ≥ f (tx + (1 − t)y) + (1 − t)hgt , x − yi + (1 − t)2 kx − yk2
2
for any gt ∈ ∂f (tx + (1 − t)y). Multiply the first inequality by (1 − t) and the second by t, then
add them to obtain
λ
(1 − t)t2 + t(1 − t)2 kx − yk2 ,
tf (x) + (1 − t)f (y) ≥ f (tx + (1 − t)y) +
2
and note that (1 − t)t2 + t(1 − t)2 = t(1 − t). Finally, let (i) hold, and which is equivalent to the
condition that
f ((1 − t)x + ty) − f (x) λ
+ (1 − t) kx − yk2 ≤ f (y) − f (x)
t 2
for t ∈ (0, 1). Taking t ↓ 0 gives f 0 (x; y − x) + λ2 kx − yk2 ≤ f (y) − f (x), and because f 0 (x; y − x) =
sups∈∂f (x) hs, y − xi we obtain (ii).
As a first example application of strong convexity, consider minimizers of the tilted functions
as u varies. First, note that minimizers necessarily exist: the function fu (x) → ∞ whenever
kxk → ∞ by condition (ii) in Proposition C.1.5, and so we can restrict to minimizing fu over
472
Lexture Notes on Statistics and Information Theory John Duchi
compacta. Moreover, the minimizers xu := argminx fu (x) are unique, as the functions fu are
strongly (and hence strictly) convex. However, we can say more. Indeed, let C be any closed
convex set and let
xu = argmin fu (x). (C.1.4)
x∈C
We claim the following:
Proposition C.1.6. Let f be λ-strongly convex with respect to the norm k·k and subdifferentiable
on C. Then the mapping u 7→ xu is λ1 -Lipschitz continuous with respect to the dual norm k·k∗ , that
is, kxu − xv k ≤ λ1 ku − vk∗ .
Proof We use the optimality condition (C.1.2). We have ∂fu (x) = ∂f (x) − u, and thus for any
u, v we have both
hgu − u, y − xu i ≥ 0 and hgv − v, y − xv i ≥ 0
for some gu ∈ ∂f (xu ) and gv ∈ ∂f (xv ) for all y ∈ C. Set y = xv in the first inequality and y = xu
in the second and add them to obtain
hgu − gv + v − u, xv − xu i ≥ 0 or hv − u, xv − xu i ≥ hgv − gu , xv − xu i.
By strong convexity the last term satisfies hgv − gu , xv − xu i ≥ λ kxu − xv k2 . By definition of the
dual norm, kv − uk∗ kxv − xu k ≥ hv − u, xv − xu i, so ku − vk∗ kxv − xu k ≥ λ kxu − xv k2 , which is
the desired result.
There are alternative versions of strong convexity, typically given the name uniform convexity
in the convex analysis literature, which allow generalizations and similar quantitative stability
properties. In analogy with the strong convexity condition (C.1.3), we say that f is (λ, κ)-uniformly
convex, where κ ≥ 2, over C if it is closed and for all t ∈ [0, 1] and x, y ∈ C,
λ
t(1 − t) kx − ykκ (1 − t)κ−1 + tκ−1 .
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y) − (C.1.5)
2
Notably the κ = 2 case is simply the familiar strong convexity.
JCD Comment: Give lemmas and propositions but leave as exercises, filling this out.
JCD Comment: Add some material on strict convexity implying a bit of growth around
a neighborhood, and stability properties of strongly convex functions.
The weakest version of such strong convexity properties is strict convexity, for which a careful
reading of the proof of Proposition C.1.5 (replace all λ with 0 and inequalities with strict inequal-
ities) gives the following characterization of equivalent definitions of strict convexity (recall also
Corollary B.3.2).
Corollary C.1.7. Let f be a convex function subdifferentiable on C. The following are equivalent.
(i) f is strictly convex on C.
473
Lexture Notes on Statistics and Information Theory John Duchi
hgx − gy , x − yi > 0.
Using Corollary C.1.7, we can then obtain certain smoothness properties of the tilted minimizers
xu of the minimization (C.1.4). We begin with a lemma that guarantees growth of convex functions
over their first-order approximations.
Lemma C.1.8. Let f be convex and subdifferentiable on the closed convex set C, and for any fixed
g ∈ ∂f (x0 ) define the Bregman divergence
A minor variant of the criterion of increasing slopes (B.3.4) and that h(0) = 0 then gives h0 (1; 1) =
limt↓0 h(1+t)−h(1)
t ≥ h(1)−h(0)
1 = h(1) = D(x0 , x0 ), so we have
0
0 0
δ(0 ) = D(x0 , x0 ) ≥ D(x0 , x0 ) + − 1 D(x0 , x0 ) = D(x0 , x0 ) ≥ δ()
as desired.
Whenever f is strictly convex, because the infimum in δ() is attained in Lemma C.1.8, we have
the following guarantee.
Lemma C.1.9. Let the conditions of Lemma C.1.8 hold and additionally let f be strictly convex.
Then δ() > 0 for all > 0.
Combining these results yields the following non-quantitative version of Proposition C.1.6:
Proposition C.1.10. Let f be strictly convex and subdifferentiable on the closed convex set C,
and assume that the minimum x0 = argminx∈C f (x) is attained. Then the mapping u 7→ xu is
continuous in a neighborhood of u = 0.
474
Lexture Notes on Statistics and Information Theory John Duchi
Proof We show first that xu is continuous at u = 0. By Lemmas C.1.8 and C.1.9, we see that
for x ∈ C we have
where g ∈ ∂f (x0 ) satisfies hg, x − x0 i ≥ 0 for all x ∈ C by the optimality condition (C.1.2). Now,
pick > 0, so that if kx − x0 k > we have δ(kx − x0 k) ≥ kx − x0 k δ()
by Lemma C.1.8. Then if
δ()
u satisfies kuk < , we have
The conjugate function is the largest gap between the linear functional x 7→ hs, xi and the
function f itself. The remarkable property of such conjugates is that their biconjugates describe
the function f itself, or at least the largest closed convex function below f . To make this a bit
more precise, we state a theorem, and then connect to so-called convex closures of functions.
Theorem C.2.1. Let f be closed convex and f ∗ be its conjugate (C.2.1). Then
475
Lexture Notes on Statistics and Information Theory John Duchi
and we always have hx, si−f ∗ (s) ≤ f (x) by definition of f ∗ (s) = supx {hs, xi−f (x)}. So immediately
we see that f ∗∗ (x) ≤ f (x).
We essentially show that the linear functions hs (x) := hx, si − f ∗ (s) describe (enough) of
the global linear underestimators of f so that f (x) = sups hs (x), allowing us to apply Theo-
rem B.3.7. Indeed, let l(x) = hs, xi + b be any global underestimator of f . Then we must have
b ≤ f (x) − hs, xi for all x, that is, b ≤ inf x {f (x) − hs, xi} = − supx {hs, xi − f (x)} = −f ∗ (s), that
is, l(x) ≤ hs, xi − f ∗ (s) = hs (x). Apply Theorem B.3.7.
We may visualize f ∗∗ as pulling a string up below a function f , yielding the largest closed
convex underestimator of f . (While this is in fact a rigorous statement, we shall not prove it here.)
Even more, combining Theorem C.2.1 with this observation, we can exhibit a duality between
subgradients of f and f ∗ with this inequality.
Proof If hs, xi = f ∗ (s) + f (x), then −f (x) + hs, xi = f ∗ (s) ≥ hs, yi − f (y) for all y, and re-
arranging, we have f (y) ≥ f (x) + hs, y − xi, that is, s ∈ ∂f (x). Conversely, if s ∈ ∂f (x) then
0 ∈ ∂f (x) − s, so that x minimizes f (x) − hs, xi, or equivalently, x maximizes hs, xi − f (x) and so
hs, xi − f (x) = supx {hs, xi − f (x)} as desired. The final statement is immediate from a parallel
argument and the duality in Theorem C.2.1.
Writing Proposition C.2.2 differently, we see that ∂f and ∂f ∗ are inverses of one another. That
is, as set-valued mappings, where
and
∂f ∗ (s) = argmax {hs, xi − f (x)} and ∂f (x) = argmax {hs, xi − f ∗ (s)} .
x s
Additionally, we see that the domains and images of ∂f and ∂f ∗ are also related, which guar-
antees convexity properties of their images as well.
476
Lexture Notes on Statistics and Information Theory John Duchi
We can use the identification between the domains of ∂f and the images of ∂f ∗ to give a few
additional characterizations of the domains of convex functions and their conjugates; the domain
of f ∗ is intimately tied with the growth properties of f , and conversely by the relationship f = f ∗∗
when f is closed convex. As one example of how we can make this identification, note that if f ∗
is defined everywhere, that is, dom f ∗ = Rd , then similarly dom ∂f ∗ = Rd , and so in particular the
(sub)gradients of f must cover all of Rd . Even more, as we shall see, this implies certain growth
conditions on f .
To make this more rigorous, we require functions capturing the asymptotic growth of f . To that
end, we present the following proposition, which has the benefit of defining the recession function
(essentially, an asymptotic derivative) of f .
Proposition C.2.5. Let f be a closed convex function and f ∗ is convex conjugate. Then for any
x ∈ dom f , we may define
0 f (x + tv) − f (x) f (x + tv) − f (x)
f∞ (v) := sup = lim (C.2.3)
t>0 t t→∞ t
independently of x, and moreover,
0
f∞ (v) = σdom f ∗ (v)
where σdom f ∗ is the support function (B.2.1) of dom f ∗ .
Proof That for any fixed x ∈ dom f the limit exists and is equal to the supremum follows
because of the criterion of increasing slopes (B.3.4), making the equality with the supremum im-
mediate. That f∞ 0 (v) is independent of x will follow once we show the second equality claimed in
by conjugate duality, as f is closed convex (Theorem C.2.1). Fix x ∈ dom f . Then for any
s ∈ dom f ∗ , we evidently have
f (x + tv) − f (x) hs, x + tvi − f ∗ (s) − f (x)
≥ → hs, vi
t t
as t ↑ ∞. Taking a supremum over s ∈ dom f ∗ gives that f∞ 0 (v) ≥ σ
dom f ∗ (v). For the opposite
direction, note that
" #
f (x + tv) − f (x) 1
= sup {hs, x + tvi − f ∗ (s)} − sup {hs, xi − f ∗ (s)}
t t s∈dom f ∗ s∈dom f ∗
1 1
≤ sup {hs, x + tvi − f ∗ (s) − (hs, xi − f ∗ (s))} = sup ths, vi.
t s∈dom f ∗ t s∈dom f ∗
477
Lexture Notes on Statistics and Information Theory John Duchi
0 (v) ≤ σ
Thus f∞ dom f ∗ (v), and we have the result.
It is particularly interesting to understand the conditions under which dom f ∗ = Rd , that is,
f∗ is finite everywhere, and relatedly, under which the function x 7→ f (x) − hs, xi has a minimizer.
Recall that f : Rd → R is coercive if f (x) → ∞ whenever kxk → ∞, so that if f is closed convex,
then the tilted function f (·) − hs, ·i has a minimizer if and only if it is coercive. We call f super-
coercive if f (x)/ kxk → ∞ whenever kxk → ∞, so that f grows more than linearly. These concepts
are central to the existence of minimizers. A priori, any function with compact domain is super-
coercive, because f (x) = +∞ for x 6∈ dom f . For convex functions, we can relate such coercivity
ideas to the recession function f∞ 0 associated with f as expression (C.2.3) defines. Particularly
a class Rockafellar [153] calls copositive functions, as these exhibit superlinear growth on all rays
toward infinity. We can relate this condition to the domains of f ∗ as well: using Proposition B.2.7,
Proposition C.2.5 gives
Corollary C.2.6. Let f : Rd → R be closed convex. Then s ∈ dom f ∗ if and only if hs, vi ≤ f∞ 0 (v)
0 (v) > 0 for all v 6= 0, then 0 ∈ int dom f ∗ and f has a minimizer. A sufficient condition
(i) If f∞
for this is that f be coercive.
0 (v) = +∞ for all v 6= 0 if and only if
(ii) We have f∞
dom f ∗ = Rd .
Proof Combine Propositions B.2.7 and C.2.5: for part (i), note that if f∞ 0 (v) > 0 for all v 6= 0,
then 0 ∈ int dom f ∗ , and so f ∗ has a non-trivial subdifferential ∂f ∗ (0) at 0; letting x ∈ ∂f ∗ (0) we
have x ∈ argmin f . To see that coercivity is sufficient, note that if f (x) → ∞ whenever kxk → ∞,
the criterion of increasing slopes (B.3.4) gives f∞0 (v) > 0 for all v 6= 0. Part (ii) is similarly imme-
diate.
478
Lexture Notes on Statistics and Information Theory John Duchi
Proposition C.2.7. Let f : Rd → R be λ-strongly convex with respect to the norm k·k (see
Eq. (C.1.3)) on its domain. Then dom f ∗ = Rd and ∇f ∗ is λ1 -Lipschitz continuous with respect to
the dual norm k·k∗ , that is,
1
k∇f ∗ (u) − ∇f ∗ (v)k ≤ ku − vk∗
λ
for all u, v. Conversely, let f : R → R be convex with L-Lipschitz gradient with respect to k·k on Rd .
d
Then f ∗ is L1 -strongly convex with respect to the dual norm k·k∗ on convex subsets C ⊂ dom ∂f ∗ ,
and in particular,
1
h∇f (x) − ∇f (y), x − yi ≥ k∇f (x) − ∇f (y)k2∗ . (C.2.4)
L
Proof For the first claim, let C = dom f . Then Proposition C.1.6 shows that if x1 = argminx {f (x)−
hs1 , xi} and x2 = argminx {f (x) − hs2 , xi} (which exist and are necessarily unique), we have
kx1 − x2 k ≤ λ1 ks1 − s2 k∗ . Then Proposition C.2.2 shows that xi ∈ ∂f ∗ (si ) for i = 1, 2, and
hence ∂f ∗ (si ) is necessarily single-valued and (1/λ)-Lipschitz continuous.
The converse is a bit trickier. Let x and y be arbitrary and sx = ∇f (x) and sy = ∇f (y); we
prove inequality (C.2.4), known as co-coercivity. By the L-Lipschitz continuity of ∇f , we have
Z 1
f (y) = f (x) + h∇f (x + t(y − x)), y − xidt
0
Z 1
= f (x) + h∇f (x), y − xi + h∇f (x + t(y − x)) − ∇f (x), y − xidt
0
Z 1
L
≤ f (x) + hsx , y − xi + Lt ky − xk2 dt = f (x) + hsx , y − xi + ky − xk2 ,
0 2
which is valid for any x, y. Note that f (x) − hsx , xi = −f ∗ (sx ), so that rearranging we have
L L
ky − xk2 = hs, yi − f (y) + hsx − s, yi + ky − xk2
f ∗ (sx ) ≤ hsx , yi − f (y) +
2 2
L
≤ f ∗ (s) + hsx − s, yi + ky − xk2 ,
2
valid for any vector s and any y. We may in particular take an infimum over y on the right hand
side, where
L L
inf hsx − s, yi + ky − xk2 = inf hsx − s, y − xi + ky − xk2 + hsx − s, xi
y 2 y 2
Lt2
(?)
= inf t ksx − sk∗ + + hsx − s, xi
t 2
1
=− ksx − sk2∗ + hsx − s, xi,
2L
where equality (?) follows by definition of the dual norm and we identify t = ky − xk. Thus
1
f ∗ (sx ) + hx, s − sx i + ks − sx k2∗ ≤ f ∗ (s)
2L
for all s. As x ∈ ∂f ∗ (sx ), Proposition C.1.5(ii) gives the strong convexity result. The rest is alge-
braic manipulations with sy = ∇f (y) and an application of Proposition C.1.5(iii).
There are more qualitative versions of Proposition C.2.7 that allow us to give a duality between
strict convexity and continuous differentiability of f . Here, we give one typical result.
479
Lexture Notes on Statistics and Information Theory John Duchi
Proposition C.2.8. Let f : Rd → R be strictly convex and closed. Then int dom f ∗ 6= ∅ and
f ∗ is continuously differentiable on int dom f ∗ . Conversely, let f : Rd → R be differentiable on
Ω := int dom f . Then f ∗ is strictly convex on each convex C ⊂ ∇f (Ω).
These results should be roughly expected becaues of the duality that ∇f = (∇f ∗ )−1 and that
∂f ∗ (s) = argminx {hs, xi − f (x)}, because strict convexity guarantees uniqueness of minimizers
(Proposition C.1.1) so that ∂f ∗ should be a singleton.
Proof To see that int dom f ∗ is non-empty, we use the identification f∞ 0 (v) = σ
dom f ∗ (v) in
Proposition C.2.5 and the interior identification in Proposition B.2.7. Because f is strictly convex,
for any x ∈ dom f we have
sake of contradiction that ∂f ∗ (s) has distinct points x1 , x2 . Then Corollary C.2.3 gives that x1
and x2 both minimize f (x) − hs, xi over x. But Proposition C.1.1 guarantees x1 = x2 , so that
∂f ∗ (s) = {∇f ∗ (s)} is a singleton, and hence f ∗ is continuous differentiable at s (Proposition B.3.22).
For the converse claim, let C be a convex set as stated. Suppose for the sake of contradiction
that f ∗ is not strictly convex on C, so that there are distinct points s1 , s2 ∈ C for which f ∗ is affine
on the line segment [s1 , s2 ] = {ts1 + (1 − t)s2 | t ∈ [0, 1]}. As C ⊂ ∇f (Ω) is convex, the midpoint
s = 21 (s1 + s2 ) ∈ C and there exists x satisfying ∇f (x) = s, or x ∈ ∂f ∗ (s). Then because f ∗ is
assumed affine on [s1 , s2 ], we have f ∗ (s) = 12 f ∗ (s1 ) + 12 f ∗ (s2 ) and hs, xi = 21 hs1 + s2 , xi, so
We close this section by investigating particularly nice classes of functions f , where f and its
conjugate f ∗ are strictly convex and smooth. These results are central to the various conjugate
linkage dualities we explore in Chapter 11.3. We therefore make the following definition:
480
Lexture Notes on Statistics and Information Theory John Duchi
Thus, at the boundaries of their domains or as their argument tends off to infinity, functions of
Legendre type have slopes tending to ∞. This does not guarantee that f (x) → ∞ as x → bd dom f ,
though it does provide guarantees of regularity that the next theorem highlights.
Theorem C.2.9. Let f be a convex function of Legendre type (Def. C.2). Then f ∗ is strictly
convex, continuously differentiable, and dom f ∗ = Rd .
The theorem implies a number of results on continuity of minimizers and tilted minimiz-
ers (C.1.4), clarifying some of our earlier results. For example, we have the following corollary.
Corollary C.2.10. Let f : Rd → R be a convex function of Legendre type. Then the tilted
minimizer
xu := argmin{f (x) − hu, xi}
exists for all u, is continuous in u and unique, and xu ∈ int dom f .
Lemma C.2.11. Let f : Rd → R be closed convex and satisfy the gradient boundary condition that
ksn k → ∞ for any sequence xn → bd dom f and sn ∈ ∂f (xn ). Then
0
f∞ (v) = ∞ for all v 6= 0
if and only if
ksn k → ∞ whenever kxn k → ∞ and sn ∈ ∂f (xn ).
The theorem follows straightforwardly from Lemma C.2.11. By the boundary conditions (C.2.5)
associated with f , we have f∞ 0 (v) = ∞ for all v 6= 0 (Lemma C.2.11). Because the support
∗
function of dom f satisfies σdom f ∗ = f∞ 0 (Proposition C.2.5), we see that dom f ∗ = Rd as
dom f ∗ = {s | hs, vi ≤ σdom f ∗ (v) for all v} (e.g., Proposition B.2.7 or Corollary B.2.2). With
this, f ∗ is continuously differentiable and strictly convex on its domain (Proposition C.2.8).
481
Lexture Notes on Statistics and Information Theory John Duchi
convergent sequence.
As a last application of these ideas, in some cases we wish to allow constraints on the functions
f to be minimized, returning to the original convex optimization problem (C.0.1) with f a function
of Legendre type and C a closed convex set. We then have the following corollary.
Corollary C.2.12. Let f be of Legendre type (Definition C.2) and C ⊂ Rd a closed convex set
with int dom f ∩ C 6= ∅. Define fC (x) = f (x) + IC (x). Then
(i) fC∗ is continuously differentiable,
(ii) dom fC∗ = Rd , and
(iii) the constrained tilted minimizers
xu = argmin {hu, xi − f (x)}
x∈C
482
Lexture Notes on Statistics and Information Theory John Duchi
Further reading
There are a variety of references on the topic, beginning with the foundational book by Rockafellar
[153], which initiated the study of convex functions and optimization in earnest. Since then, a
variety of authors have written (perhaps more easily approachable) books on convex functions,
optimization, and their related calculus. Hiriart-Urruty and Lemaréchal [104] have written two
volumes explaining in great detail finite-dimensional convex analysis, and provide a treatment of
some first-order algorithms for solving convex problems. Borwein and Lewis [33] and Luenberger
[133] give general treatments that include infinite-dimensional convex analysis, and Bertsekas [27]
gives a variety of theoretical results on duality and optimization theory.
There are, of course, books that combine theoretical treatment with questions of convex mod-
eling and procedures for solving convex optimization problems (problems for which the objective
and constraint sets are all convex). Boyd and Vandenberghe [35] gives a very readable treatment
for those who wish to use convex optimization techniques and modeling, as well as the basic results
in convex analytic background and duality theory. Ben-Tal and Nemirovski [23], as well as Ne-
mirovski’s various lecture notes, give a theory of the tractability of computing solutions to convex
optimization problems as well as methods for solving them.
C.3 Exercises
Exercise C.1: Show that the alternative increasing slopes condition (B.3.5) is equivalent to
convexity of f .
Exercise C.2: Do the uniform convexity version of Proposition C.1.5.
Exercise C.3: Do the uniform convexity version of Proposition C.1.6.
483
Bibliography
[2] S. Agrawal and N. Goyal. Analysis of Thompson sampling for the multi-armed bandit prob-
lem. In Proceedings of the Twenty Fifth Annual Conference on Computational Learning
Theory, 2012.
[3] N. Ailon and B. Chazelle. The fast Johnson-Lindenstrauss transform and approximate nearest
neighbors. SIAM Journal on Computing, 39(1):302–322, 2009.
[4] S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution
from another. Journal of the Royal Statistical Society, Series B, 28:131–142, 1966.
[5] S. Amari and H. Nagaoka. Methods of Information Geometry, volume 191 of Translations of
Mathematical Monographs. American Mathematical Society, 2000.
[8] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta algorithm
and applications. Theory of Computing, 8(1):121–164, 2012.
[9] S. Artstein, K. Ball, F. Barthe, and A. Naor. Solution of Shannon’s problem on the mono-
tonicity of entropy. Journal of the American Mathematical Society, 17(4):975–982, 2004.
[10] P. Assouad. Deux remarques sur l’estimation. Comptes Rendus des Séances de l’Académie
des Sciences, Série I, 296(23):1021–1024, 1983.
[11] J.-Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring.
In Journal of Machine Learning Research, pages 2635–2686, 2010.
[12] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit
problem. Machine Learning, 47(2-3):235–256, 2002.
484
Lexture Notes on Statistics and Information Theory John Duchi
[14] B. Balle, G. Barthe, and M. Gaboardi. Privacy amplification by subsampling: Tight analyses
via couplings and divergences. In Advances in Neural Information Processing Systems 31,
pages 6277–6287, 2018.
[16] A. Barron. Entropy and the central limit theorem. Annals of Probability, 14(1):336–342,
1986.
[18] A. R. Barron and T. M. Cover. Minimum complexity density estimation. IEEE Transactions
on Information Theory, 37:1034–1054, 1991.
[19] P. L. Bartlett, M. I. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds.
Journal of the American Statistical Association, 101:138–156, 2006.
[20] R. Bassily, A. Smith, T. Steinke, and J. Ullman. More general queries and less generalization
error in adaptive data analysis. arXiv:1503.04843 [cs.LG], 2015.
[21] R. Bassily, K. Nissim, A. Smith, T. Steinke, U. Stemmer, and J. Ullman. Algorithmic stability
for adaptive data analysis. In Proceedings of the Forty-Eighth Annual ACM Symposium on
the Theory of Computing, pages 1046–1059, 2016.
[22] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for
convex optimization. Operations Research Letters, 31:167–175, 2003.
[23] A. Ben-Tal and A. Nemirovski. Lectures on Modern Convex Optimization. SIAM, 2001.
[24] D. Berend and A. Kontorovich. On the concentration of the missing mass. Electronic Com-
munications in Probability, 18:1–7, 2018.
[25] R. Berk, L. Brown, A. Buja, K. Zhang, and L. Zhao. Valid post-selection inference. Annals
of Statistics, 41(2):802–837, 2013.
[26] J. M. Bernardo. Reference analysis. In D. Day and C. R. Rao, editors, Bayesian Thinking,
Modeling and Computation, volume 25 of Handbook of Statistics, chapter 2, pages 17–90.
Elsevier, 2005.
[28] L. Birgé. Approximation dans les espaces métriques et théorie de l’estimation. Zeitschrift für
Wahrscheinlichkeitstheorie und verwebte Gebiet, 65:181–238, 1983.
[29] L. Birgé. A new lower bound for multiple hypothesis testing. IEEE Transactions on Infor-
mation Theory, 51(4):1611–1614, 2005.
[30] L. Birgé and P. Massart. Estimation of integral functionals of a density. Annals of Statistics,
23(1):11–29, 1995.
485
Lexture Notes on Statistics and Information Theory John Duchi
[31] J. Blasiok, P. Gopalan, L. Hu, and P. Nakkiran. A unifying theory of distance from calibration.
In Proceedings of the Fifty-Fifth Annual ACM Symposium on the Theory of Computing, 2023.
URL https://arxiv.org/abs/2211.16886.
[32] J. Blasiok, P. Gopalan, L. Hu, and P. Nakkiran. When does optimizing a proper loss yield
calibration? In Advances in Neural Information Processing Systems 36, 2023. URL https:
//arxiv.org/abs/2305.18764.
[33] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006.
[34] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: a Nonasymptotic
Theory of Independence. Oxford University Press, 2013.
[35] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[36] S. Boyd, J. Duchi, and L. Vandenberghe. Subgradients. Course notes for Stanford Course
EE364b, 2015. URL http://web.stanford.edu/class/ee364b/lectures/subgradients_
notes.pdf.
[37] M. Braverman, A. Garg, T. Ma, H. L. Nguyen, and D. P. Woodruff. Communication lower
bounds for statistical estimation problems via a distributed data processing inequality. In
Proceedings of the Forty-Eighth Annual ACM Symposium on the Theory of Computing, 2016.
URL https://arxiv.org/abs/1506.07216.
[38] G. Brown, M. Bun, V. Feldman, A. Smith, and K. Talwar. When is memorization of irrelevant
training data necessary for high-accuracy learning? In Proceedings of the Fifty-Third Annual
ACM Symposium on the Theory of Computing, pages 123–132, 2021.
[39] L. D. Brown. Fundamentals of Statistical Exponential Families. Institute of Mathematical
Statistics, Hayward, California, 1986.
[40] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed
bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
[41] V. Buldygin and Y. Kozachenko. Metric Characterization of Random Variables and Random
Processes, volume 188 of Translations of Mathematical Monographs. American Mathematical
Society, 2000.
[42] T. Cai and M. Low. Testing composite hypotheses, Hermite polynomials and optimal esti-
mation of a nonsmooth functional. Annals of Statistics, 39(2):1012–1041, 2011.
[43] E. J. Candès and M. A. Davenport. How well can we estimate a sparse vector. Applied and
Computational Harmonic Analysis, 34(2):317–323, 2013.
[44] O. Catoni. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical
Learning, volume 56 of IMS Lecture Notes and Monographs. Institute of Mathematical Statis-
tics, Beachwood, Ohio, USA, 2007. URL https://arxiv.org/abs/0712.0248.
[45] O. Catoni and I. Giulini. Dimension-free PAC-bayesian bounds for matrices, vectors, and
linear least squares regression. arXiv:1712.02747 [math.ST], 2017.
[46] N. Cesa-Bianchi and G. Lugosi. On prediction of individual sequences. Annals of Statistics,
27(6):1865–1895, 1999.
486
Lexture Notes on Statistics and Information Theory John Duchi
[47] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University
Press, 2006.
[48] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction
with expert advice. Machine Learning, 66(2–3):321–352, 2007.
[50] J. E. Cohen, Y. Iwasa, G. Rautu, M. B. Ruskai, E. Seneta, and G. Zbaganu. Relative entropy
under mappings by stochastic matrices. Linear Algebra and its Applications, 179:211–235,
1993.
[52] J. Couzin. Whole-genome data not anonymous, challenging assumptions. Science, 321(5894):
1278, 2008.
[53] T. M. Cover and J. A. Thomas. Elements of Information Theory, Second Edition. Wiley,
2006.
[55] I. Csiszár and J. Körner. Information Theory: Coding Theorems for Discrete Memoryless
Systems. Cambridge University Press, second edition, 2011.
[56] T. Dalenius. Towards a methodology for statistical disclosure control. Statistik Tidskrift, 15:
429–444, 1977.
[57] S. Dasgupta and A. Gupta. An elementray proof of a theorem of Johnson and Lindenstrauss.
Random Structures and Algorithms, 22(1):60–65, 2002.
[58] L. D. Davisson. The prediction error of stationary gaussian time series of unknown covariance.
IEEE Transactions on Information Theory, 11:527–532, 1965.
[59] A. Dawid and V. Vovk. Prequential probability: principles and properties. Bernoulli, 5:
125–162, 1999.
[61] P. Del Moral, M. Ledoux, and L. Miclo. On contraction properties of Markov kernels. Prob-
ability Theory and Related Fields, 126:395–420, 2003.
[62] R. L. Dobrushin. Central limit theorem for nonstationary Markov chains. I. Theory of
Probability and Its Applications, 1(1):65–80, 1956.
[63] R. L. Dobrushin. Central limit theorem for nonstationary Markov chains. II. Theory of
Probability and Its Applications, 1(4):329–383, 1956.
[64] D. L. Donoho and R. C. Liu. Geometrizing rates of convergence I. Technical Report 137,
University of California, Berkeley, Department of Statistics, 1987.
487
Lexture Notes on Statistics and Information Theory John Duchi
[65] J. C. Duchi and R. Rogers. Lower bounds for locally private estimation via communica-
tion complexity. In Proceedings of the Thirty Second Annual Conference on Computational
Learning Theory, 2019.
[66] J. C. Duchi and F. Ruan. A constrained risk inequality for general losses. In Proceedings of
the 23nd International Conference on Artificial Intelligence and Statistics, 2020.
[67] J. C. Duchi and M. J. Wainwright. Distance-based and continuum Fano inequalities with
applications to statistical estimation. arXiv:1311.2669 [cs.IT], 2013.
[68] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy, data processing inequalities,
and minimax rates. arXiv:1302.3203 [math.ST], 2013. URL http://arxiv.org/abs/1302.
3203.
[69] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates.
In 54th Annual Symposium on Foundations of Computer Science, pages 429–438, 2013.
[70] J. C. Duchi, K. Khosravi, and F. Ruan. Multiclass classification, information, divergence,
and surrogate risk. Annals of Statistics, 46(6b):3246–3275, 2018.
[71] R. M. Dudley. Uniform Central Limit Theorems. Cambridge University Press, 1999.
[72] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and
Trends in Theoretical Computer Science, 9(3 & 4):211–407, 2014.
[73] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves:
Privacy via distributed noise generation. In Advances in Cryptology (EUROCRYPT 2006),
2006.
[74] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private
data analysis. In Proceedings of the Third Theory of Cryptography Conference, pages 265–284,
2006.
[75] C. Dwork, G. N. Rothblum, and S. P. Vadhan. Boosting and differential privacy. In 51st
Annual Symposium on Foundations of Computer Science, pages 51–60, 2010.
[76] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. Preserving statistical
validity in adaptive data analysis. arXiv:1411.2664v2 [cs.LG], 2014.
[77] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. Preserving statis-
tical validity in adaptive data analysis. In Proceedings of the Forty-Seventh Annual ACM
Symposium on the Theory of Computing, 2015.
[78] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. The reusable holdout:
Preserving statistical validity in adaptive data analysis. Science, 349(6248):636–638, 2015.
[79] G. K. Dziugaite and D. M. Roy. Computing nonvacuous generalization bounds for deep
(stochastic) neural networks with many more parameters than training data. In Proceedings
of the 33rd Conference on Uncertainty in Artificial Intelligence, 2017.
[80] A. V. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy preserving
data mining. In Proceedings of the Twenty-Second Symposium on Principles of Database
Systems, pages 211–222, 2003.
488
Lexture Notes on Statistics and Information Theory John Duchi
[81] K. Fan. Minimax theorems. Proceedings of the National Academy of Sciences, 39(1):42–47,
1953.
[82] V. Feldman and T. Steinke. Calibrating noise to variance in adaptive data analysis. In
Proceedings of the Thirty First Annual Conference on Computational Learning Theory, 2018.
URL http://arxiv.org/abs/1712.07196.
[83] G. Folland. Real Analysis: Modern Techniques and their Applications. Pure and Applied
Mathematics. John Wiley & Sons, second edition, 1999.
[85] D. P. Foster and S. Hart. “calibeating”: Beating forecasters at their own game.
arXiv:2209.0489 [econ.TH], 2022.
[86] A. Franco, N. Malhotra, and G. Simonovits. Publication bias in the social sciences: Unlocking
the file drawer. Science, 345(6203):1502–1505, 2014.
[87] D. A. Freedman. On tail probabilities for martingales. The Annals of Probability, 3(1):
100–118, Feb. 1975.
[88] R. Gallager. Source coding with side information and universal coding. Technical Report
LIDS-P-937, MIT Laboratory for Information and Decision Systems, 1979.
[89] D. Garcı́a-Garcı́a and R. C. Williamson. Divergences and risks for multiclass experiments.
In Proceedings of the Twenty Fifth Annual Conference on Computational Learning Theory,
2012.
[90] A. Garg, T. Ma, and H. L. Nguyen. On communication cost of distributed statistical estima-
tion and dimensionality. In Advances in Neural Information Processing Systems 27, 2014.
[91] A. Gelman and E. Loken. The garden of forking paths: Why multiple comparisons can
be a problem, even when there is no “fishing expedition” or “p-hacking” and the research
hypothesis was posited ahead of time. Technical report, Columbia University, 2013.
[92] R. P. Gilbert. Function Theoretic Methods in Partial Differential Equations. Academic Press,
1969.
[93] T. Gneiting and A. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal
of the American Statistical Association, 102(477):359–378, 2007.
[95] P. Grünwald. The Minimum Description Length Principle. MIT Press, 2007.
[96] P. Grünwald and A. P. Dawid. Game theory, maximum entropy, minimum discrepancy, and
robust Bayesian decision theory. Annals of Statistics, 32(4):1367–1433, 2004.
[97] A. Guntuboyina. Lower bounds for the minimax risk using f -divergences, and applications.
IEEE Transactions on Information Theory, 57(4):2386–2399, 2011.
[98] L. Györfi and T. Nemetz. f -dissimilarity: A generalization of the affinity of several distribu-
tions. Annals of the Institute of Statistical Mathematics, 30:105–113, 1978.
489
Lexture Notes on Statistics and Information Theory John Duchi
[100] R. Z. Has’minskii. A lower bound on the risks of nonparametric estimates of densities in the
uniform metric. Theory of Probability and Applications, 23:794–798, 1978.
[101] D. Haussler. A general minimax result for relative entropy. IEEE Transactions on Information
Theory, 43(4):1276–1280, 1997.
[103] E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic regret algorithms for online
convex optimization. In Proceedings of the Nineteenth Annual Conference on Computational
Learning Theory, 2006.
[104] J. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms I & II.
Springer, New York, 1993.
[105] J.-B. Hiriart-Urruty and C. Lemaréchal. Fundamentals of Convex Analysis. Springer, 2001.
[106] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the
American Statistical Association, 58(301):13–30, Mar. 1963.
[108] P. J. Huber. Robust Statistics. John Wiley and Sons, New York, 1981.
[109] K. Hung and W. Fithian. Statistical methods for replicability assessment. Annals of Applied
Statistics, 14(3):1063–1087, 2020.
[111] P. Indyk. Nearest neighbors in high-dimensional spaces. In Handbook of Discrete and Com-
putational Geometry. CRC Press, 2004.
[112] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of
dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of
Computing, 1998.
[113] J. P. Ioannidis. Why most published research findings are false. PLoS Medicine, 2(8), 2005.
doi: 10.1371/journal.pmed.0020124.
490
Lexture Notes on Statistics and Information Theory John Duchi
[115] T. S. Jayram. Hellinger strikes back: a note on the multi-party information complexity of
AND. In Proceedings of APPROX and RANDOM 2009, volume 5687 of Lecture Notes in
Computer Science, pages 562–573. Springer, 2009.
[116] H. Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings
of the Royal Society of London, Series A: Mathematical and Physical Sciences, 186:453–461,
1946.
[117] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz maps into a Hilbert space. Con-
temporary Mathematics, 26:189–206, 1984.
[118] M. J. Kearns and L. Saul. Large deviation methods for approximate probabilistic inference.
In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, pages
311–319, 1998.
[119] J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear
predictors. Information and Computation, 132(1):1–64, Jan. 1997.
[120] J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems.
Machine Learning, 45(3):301–329, July 2001.
[121] A. Kolmogorov and V. Tikhomirov. ε-entropy and ε-capacity of sets in functional spaces.
Uspekhi Matematischeskikh Nauk, 14(2):3–86, 1959.
[122] V. Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery
Problems, volume 2033 of Lecture Notes in Mathematics. Springer-Verlag, 2011.
[123] A. Kumar, P. Liang, and T. Ma. Verified uncertainty calibration. In Advances in Neural
Information Processing Systems 32, 2019.
[124] E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, 1997.
[125] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in
Applied Mathematics, 6:4–22, 1985.
[126] J. Langford and R. Caruana. (not) bounding the true error. In Advances in Neural Informa-
tion Processing Systems 14, 2001.
[128] L. Le Cam and G. L. Yang. Asymptotics in Statistics: Some Basic Concepts. Springer, 2000.
[130] E. L. Lehmann and G. Casella. Theory of Point Estimation, Second Edition. Springer, 1998.
[131] F. Liese and I. Vajda. On divergences and informations in statistics and information theory.
IEEE Transactions on Information Theory, 52(10):4394–4412, 2006.
[132] J. Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Informa-
tion Theory, 37(1):145–151, 1991.
491
Lexture Notes on Statistics and Information Theory John Duchi
[134] M. Madiman and A. Barron. Generalized entropy power inequalities and monotonicity prop-
erties of information. IEEE Transactions on Information Theory, 53(7):2317–2329, 2007.
[139] F. McSherry and K. Talwar. Mechanism design via differential privacy. In 48th Annual
Symposium on Foundations of Computer Science, 2007.
[140] N. Merhav and M. Feder. Universal prediction. IEEE Transactions on Information Theory,
44(6):2124–2147, 1998.
[141] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in
Theoretical Computer Science, 1(2):117–236, 2005.
[142] A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in Optimization.
Wiley, 1983.
[143] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation ap-
proach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
[145] B. T. Polyak and J. Tsypkin. Robust identification. Automatica, 16:53–63, 1980. doi: 10.
1016/0005-1098(80)90086-2. URL http://dx.doi.org/10.1016/0005-1098(80)90086-2.
[146] Y. Polyanskiy and Y. Wu. Strong data-processing inequalities for channels and Bayesian
networks. In Convexity and Concentration, volume 141 of The IMA Volumes in Mathematics
and its Applications, pages 211–249. Springer, 2017.
[147] M. Raginsky. Strong data processing inequalities and φ-Sobolev inequalities for discrete
channels. IEEE Transactions on Information Theory, 62(6):3355–3389, 2016.
[148] A. Rao and A. Yehudayoff. Communication Complexity and Applications. Cambridge Uni-
versity Press, 2020.
[149] G. Raskutti, M. J. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional
linear regression over `q -balls. IEEE Transactions on Information Theory, 57(10):6976—6994,
2011.
[150] M. Reid and R. Williamson. Information, divergence, and risk for binary experiments. Journal
of Machine Learning Research, 12:731–817, 2011.
492
Lexture Notes on Statistics and Information Theory John Duchi
[151] J. Rissanen. Universal coding, information, prediction, and estimation. IEEE Transactions
on Information Theory, 30:629–636, 1984.
[152] H. Robbins. Some aspects of the sequential design of experiments. Bulletin American Math-
ematical Society, 55:527–535, 1952.
[155] D. Russo and B. Van Roy. An information-theoretic analysis of Thompson sampling. Journal
of Machine Learning Research, page To appear, 2014.
[156] D. Russo and B. Van Roy. Learning to optimize via information-directed sampling. In
Advances in Neural Information Processing Systems 27, 2014.
[157] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of
Operations Research, 39(4):1221–1243, 2014.
[158] C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal,
1948.
[160] M. Sion. On general minimax theorems. Pacific Journal of Mathematics, 8(1):171–176, 1958.
[161] A. Slavkovic and F. Yu. Genomics and privacy. Chance, 28(2):37–39, 2015.
[162] J. Steinhardt and P. Liang. Adaptivity and optimism: An improved exponentiated gradient
algorithm. In Proceedings of the 31st International Conference on Machine Learning, 2014.
[164] T. Tao. An Epsilon of Room, I: Real Analysis (pages from year three of a mathematical blog),
volume 117 of Graduate Studies in Mathematics. American Mathematical Society, 2010.
[165] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view
of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
[166] R. Tibshirani, J. Taylor, R. Lockhart, and R. Tibshirani. Exact post-selection inference for
sequential regression procedures. Journal of the American Statistical Association, 111(514):
600–620, 2016.
[169] A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic
Mathematics. Cambridge University Press, 1998.
493
Lexture Notes on Statistics and Information Theory John Duchi
[172] A. Wald. Contributions to the theory of statistical estimation and testing hypotheses. Annals
of Mathematical Statistics, 10(4):299–326, 1939.
[173] S. Warner. Randomized response: a survey technique for eliminating evasive answer bias.
Journal of the American Statistical Association, 60(309):63–69, 1965.
[174] Y. Wu and P. Yang. Minimax rates of entropy estimation on large alphabets via best poly-
nomial approximation. IEEE Transactions on Information Theory, 62(6):3702–3720, 2016.
[176] A. C.-C. Yao. Some complexity questions related to distributive computing (preliminary
report). In Proceedings of the Eleventh Annual ACM Symposium on Theory of Computing,
pages 209–213. ACM, 1979.
[177] B. Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423–435.
Springer-Verlag, 1997.
494