0% found this document useful (0 votes)

21 views668 pages

Lecture Notes

Uploaded by

kay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views668 pages

Lecture Notes

Uploaded by

kay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 668

Statistics and Information Theory

John Duchi

October 22, 2024

Contents

1 Introduction and setting 10

1.1 Information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Moving to statistics and machine learning . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Outline and chapter discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 A remark about measure theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 An information theory review 15

2.1 Basics of Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Chain rules and related properties . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.3 Data processing inequalities: . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 General divergence measures and definitions . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Partitions, algebras, and quantizers . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2 KL-divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.3 f -divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.4 Inequalities and relationships between divergences . . . . . . . . . . . . . . . 28
2.2.5 Convexity and data processing for divergence measures . . . . . . . . . . . . 32
2.3 First steps into optimal procedures: testing inequalities . . . . . . . . . . . . . . . . 33
2.3.1 Le Cam’s inequality and binary hypothesis testing . . . . . . . . . . . . . . . 33
2.3.2 Fano’s inequality and multiple hypothesis testing . . . . . . . . . . . . . . . . 35
2.4 A first operational result: entropy and source coding . . . . . . . . . . . . . . . . . . 37
2.4.1 The source coding problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.2 The Kraft-McMillan inequalities . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.3 Entropy rates and longer codes . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3 Exponential families and statistical modeling 48

3.1 Exponential family models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 Why exponential families? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.1 Fitting an exponential family model . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Divergence measures and information for exponential families . . . . . . . . . . . . . 54
3.4 Generalized linear models and regression . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.1 Fitting a generalized linear model from a sample . . . . . . . . . . . . . . . . 58
3.4.2 The information in a generalized linear model . . . . . . . . . . . . . . . . . . 59
3.5 Lower bounds on testing a parameter’s value . . . . . . . . . . . . . . . . . . . . . . 61

1
Lexture Notes on Statistics and Information Theory John Duchi

3.6 Deferred proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.6.1 Proof of Proposition 3.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.7 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

I Concentration, information, stability, and generalization 66

4 Concentration Inequalities 67
4.1 Basic tail inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1.1 Sub-Gaussian random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1.2 Sub-exponential random variables . . . . . . . . . . . . . . . . . . . . . . . . 73
4.1.3 Orlicz norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1.4 First applications of concentration: random projections . . . . . . . . . . . . 80
4.1.5 A second application of concentration: codebook generation . . . . . . . . . . 81
4.2 Martingale methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2.1 Sub-Gaussian martingales and Azuma-Hoeffding inequalities . . . . . . . . . 84
4.2.2 Examples and bounded differences . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3 Matrix concentration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.4 Technical proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4.1 Proof of Theorem 4.1.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4.2 Proof of Theorem 4.1.15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.3 Proof of Theorem 5.1.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.4 Proof of Proposition 4.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.5 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5 Estimation and generalization 102

5.1 Uniformity and metric entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.1.1 Symmetrization and uniform laws . . . . . . . . . . . . . . . . . . . . . . . . 102
5.1.2 Metric entropy, coverings, and packings . . . . . . . . . . . . . . . . . . . . . 106
5.1.3 Application: matrix concentration . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2 Generalization bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2.1 Finite and countable classes of functions . . . . . . . . . . . . . . . . . . . . . 112
5.2.2 Large classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2.3 Structural risk minimization and adaptivity . . . . . . . . . . . . . . . . . . . 116
5.3 M-estimators and estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.3.1 Standard conditions and convex optimization . . . . . . . . . . . . . . . . . . 119
5.3.2 Some growth properties of convex functions . . . . . . . . . . . . . . . . . . . 120
5.3.3 Convergence analysis for convex M-estimators . . . . . . . . . . . . . . . . . . 122
5.3.4 Consequences for exponential families and generalized linear models . . . . . 124
5.3.5 Proof of Theorem 5.3.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

2
Lexture Notes on Statistics and Information Theory John Duchi

6 Generalization and stability 132

6.1 The variational representation of Kullback-Leibler divergence . . . . . . . . . . . . . 133
6.2 PAC-Bayes bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.2.1 Relative bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2.2 A large-margin guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2.3 A mutual information bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.3 Interactive data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.3.1 The interactive setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.3.2 Second moment errors and mutual information . . . . . . . . . . . . . . . . . 144
6.3.3 Limiting interaction in interactive analyses . . . . . . . . . . . . . . . . . . . 145
6.3.4 Error bounds for a simple noise addition scheme . . . . . . . . . . . . . . . . 150
6.4 Bibliography and further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

7 Advanced concentration inequalities 157

7.1 From divergences to concentration and back . . . . . . . . . . . . . . . . . . . . . . . 157
7.1.1 Concentration of covariance matrices via the variational representation . . . . 159
7.1.2 A generalized connection between moment generating functions and divergence161
7.2 Transportation inequalitites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.2.1 A tensorized transportation inequality . . . . . . . . . . . . . . . . . . . . . . 165
7.2.2 A heuristic proof of Theorem 7.2.1 . . . . . . . . . . . . . . . . . . . . . . . . 166
7.2.3 Proof of Corollary 7.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.3 Some applications of concentration and the variational inequality . . . . . . . . . . . 168
7.3.1 Metric Gaussianity, transport inequalities, and expansion of sets . . . . . . . 168
7.3.2 A weak and strong converse for hypothesis testing . . . . . . . . . . . . . . . 171
7.4 Discussion and bibliographic remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

8 Privacy and disclosure limitation 177

8.1 Disclosure limitation, privacy, and definitions . . . . . . . . . . . . . . . . . . . . . . 177
8.1.1 Basic mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
8.1.2 Resilience to side information, Bayesian perspectives, and data processing . . 183
8.2 Weakenings of differential privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.2.1 Basic mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.2.2 Connections between privacy measures . . . . . . . . . . . . . . . . . . . . . . 188
8.2.3 Side information protections under weakened notions of privacy . . . . . . . . 191
8.3 Composition and privacy based on divergence . . . . . . . . . . . . . . . . . . . . . . 194
8.3.1 Composition of Rényi-private channels . . . . . . . . . . . . . . . . . . . . . . 194
8.3.2 Privacy games and composition . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.4 Additional mechanisms and privacy-preserving algorithms . . . . . . . . . . . . . . . 197
8.4.1 The exponential mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.4.2 Local sensitivities and the inverse sensitivity mechanism . . . . . . . . . . . . 200
8.5 Deferred proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.5.1 Proof of Lemma 8.2.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

3
Lexture Notes on Statistics and Information Theory John Duchi

II Fundamental limits and optimality 215

9 Minimax lower bounds: the Le Cam, Fano, and Assouad methods 217
9.1 Basic framework and minimax risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
9.2 Preliminaries on methods for lower bounds . . . . . . . . . . . . . . . . . . . . . . . 219
9.2.1 From estimation to testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
9.2.2 Inequalities between divergences and product distributions . . . . . . . . . . 221
9.2.3 Metric entropy and packing numbers . . . . . . . . . . . . . . . . . . . . . . . 223
9.3 Le Cam’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
9.4 Fano’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.4.1 The classical (local) Fano method . . . . . . . . . . . . . . . . . . . . . . . . 226
9.4.2 A distance-based Fano method . . . . . . . . . . . . . . . . . . . . . . . . . . 231
9.5 Assouad’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
9.5.1 Well-separated problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
9.5.2 From estimation to multiple binary tests . . . . . . . . . . . . . . . . . . . . . 235
9.5.3 Example applications of Assouad’s method . . . . . . . . . . . . . . . . . . . 237
9.6 Deferred proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
9.6.1 Proof of Proposition 9.4.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
9.6.2 Proof of Corollary 9.4.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
9.6.3 Proof of Lemma 9.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
9.7 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
9.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

10 Beyond local minimax techniques 250

10.1 Nonparametric regression: minimax upper and lower bounds . . . . . . . . . . . . . 250
10.1.1 Kernel estimates of the function . . . . . . . . . . . . . . . . . . . . . . . . . 251
10.1.2 Minimax lower bounds on estimation with Assouad’s method . . . . . . . . . 253
10.2 Global Fano Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
10.2.1 A mutual information bound based on metric entropy . . . . . . . . . . . . . 256
10.2.2 Minimax bounds using global packings . . . . . . . . . . . . . . . . . . . . . . 258
10.2.3 Example: non-parametric regression . . . . . . . . . . . . . . . . . . . . . . . 259
10.3 Strong converses and high-probability lower bounds . . . . . . . . . . . . . . . . . . . 260
10.3.1 Refined Fano inequalitites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
10.3.2 High probability estimation lower bounds . . . . . . . . . . . . . . . . . . . . 265
10.3.3 Proof of Theorem 10.3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
10.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

11 Constrained risk inequalities 269

11.1 Strong data processing inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
11.2 Local privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
11.3 Communication complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
11.3.1 Classical communication complexity problems . . . . . . . . . . . . . . . . . . 276
11.3.2 Deterministic communication: lower bounds and structure . . . . . . . . . . . 279
11.3.3 Randomization, information complexity, and direct sums . . . . . . . . . . . 281
11.3.4 The structure of randomized communication and communication complexity
of primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
11.4 Communication complexity in estimation . . . . . . . . . . . . . . . . . . . . . . . . 288

4
Lexture Notes on Statistics and Information Theory John Duchi

11.4.1 Direct sum communication bounds . . . . . . . . . . . . . . . . . . . . . . . . 289

11.4.2 Communication data processing . . . . . . . . . . . . . . . . . . . . . . . . . 290
11.4.3 Applications: communication and privacy lower bounds . . . . . . . . . . . . 292
11.5 Proof of Theorem 11.4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
11.5.1 Proof of Lemma 11.5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
11.6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
11.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302

12 Squared error and asymptotically exact optimality guarantees 307

12.1 The Cramér-Rao inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
12.1.1 Compact sets and the failure of the Cramér-Rao bound . . . . . . . . . . . . 309
12.1.2 Regularization and the failure of the Cramér-Rao bound . . . . . . . . . . . . 309
12.2 The van Trees inequality: a Bayesian Cramér-Rao bound . . . . . . . . . . . . . . . 310
12.2.1 The van Trees inequality in one dimension . . . . . . . . . . . . . . . . . . . . 311
12.2.2 The van Trees inequality in d-dimensions . . . . . . . . . . . . . . . . . . . . 312
12.2.3 The van Trees inequality for a function of the parameter . . . . . . . . . . . . 314
12.3 Beyond parametric problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
12.3.1 An extended example: M-estimation lower bounds . . . . . . . . . . . . . . . 321
12.4 Super-efficiency and instance optimality . . . . . . . . . . . . . . . . . . . . . . . . . 323
12.5 Applications in privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
12.6 Bibliography and further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
12.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325

13 Testing and functional estimation 328

13.1 Geometrizing rates of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
13.1.1 Fisher information and divergence measures . . . . . . . . . . . . . . . . . . . 332
13.1.2 Valid asymptotic information expansions of divergences . . . . . . . . . . . . 334
13.2 Le Cam’s convex hull method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
13.2.1 The χ2 -mixture bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
13.2.2 Estimating the norm of a Gaussian vector . . . . . . . . . . . . . . . . . . . . 340
13.2.3 Lower bounds on estimating integral functionals . . . . . . . . . . . . . . . . 342
13.3 Minimax hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
13.3.1 Detecting a difference in populations . . . . . . . . . . . . . . . . . . . . . . . 346
13.3.2 Signal detection and testing a Gaussian mean . . . . . . . . . . . . . . . . . . 348
13.3.3 Goodness of fit and two-sample tests for multinomials . . . . . . . . . . . . . 350
13.3.4 Detecting sparse signals and phase transitions . . . . . . . . . . . . . . . . . . 353
13.4 Instance-optimal lower bounds and super-efficiency . . . . . . . . . . . . . . . . . . . 358
13.4.1 Risk transfer inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
13.4.2 A general risk transfer bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
13.4.3 Risk transfer with mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
13.5 Deferred and technical proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
13.5.1 Proof of Lemma 13.1.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
13.5.2 Proof of Lemma 13.1.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
13.6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
13.7 A useful divergence calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
13.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370

5
Lexture Notes on Statistics and Information Theory John Duchi

III Entropy, predictions, divergences, and information 377

14 Predictions, loss functions, and entropies 379

14.1 Proper losses, scoring rules, and generalized entropies . . . . . . . . . . . . . . . . . 380
14.1.1 A convexity primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
14.1.2 From a proper loss to an entropy . . . . . . . . . . . . . . . . . . . . . . . . . 383
14.1.3 The information in an experiment . . . . . . . . . . . . . . . . . . . . . . . . 385
14.2 Characterizing proper losses and Bregman divergences . . . . . . . . . . . . . . . . . 386
14.2.1 Characterizing proper losses for Y taking finitely many vales . . . . . . . . . 386
14.2.2 General proper losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
14.2.3 Proper losses and vector-valued Y . . . . . . . . . . . . . . . . . . . . . . . . 393
14.3 From entropies to convex losses, arbitrary predictions, and link functions . . . . . . . 396
14.3.1 Convex conjugate linkages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
14.3.2 Convex conjugate linkages with affine constraints . . . . . . . . . . . . . . . . 400
14.4 Exponential families, maximum entropy, and log loss . . . . . . . . . . . . . . . . . . 403
14.4.1 Maximizing entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
14.5 Technical and deferred proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
14.5.1 Finalizing the proof of Theorem 14.2.15 . . . . . . . . . . . . . . . . . . . . . 409
14.5.2 Proof of Proposition 14.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
14.5.3 Proof of Proposition 14.4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
14.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412

15 Calibration and Proper Losses 416

15.1 Proper losses and calibration error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
15.2 Measuring calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
15.2.1 The impossibility of measuring calibration . . . . . . . . . . . . . . . . . . . . 420
15.2.2 Alternative calibration measures . . . . . . . . . . . . . . . . . . . . . . . . . 423
15.3 Auditing and improving calibration at the population level . . . . . . . . . . . . . . 426
15.3.1 The post-processing gap and calibration audits for squared error . . . . . . . 426
15.3.2 Calibration audits for losses based on conjugate linkages . . . . . . . . . . . . 428
15.3.3 A population-level algorithm for calibration . . . . . . . . . . . . . . . . . . . 430
15.4 Calibeating: improving squared error by calibration . . . . . . . . . . . . . . . . . . 431
15.4.1 Proof of Theorem 15.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
15.5 Continuous and equivalent calibration measures . . . . . . . . . . . . . . . . . . . . . 437
15.5.1 Calibration measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
15.5.2 Equivalent calibration measures . . . . . . . . . . . . . . . . . . . . . . . . . . 440
15.6 Deferred technical proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
15.6.1 Proof of Lemma 15.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
15.6.2 Proof of Proposition 15.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
15.6.3 Proof of Lemma 15.5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
15.6.4 Proof of Theorem 15.5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
15.7 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
15.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452

6
Lexture Notes on Statistics and Information Theory John Duchi

16 Classification, Divergences, and Surrogate Risk 454

16.1 Surrogate risk consistency in binary classification . . . . . . . . . . . . . . . . . . . . 455
16.1.1 A general classification calibration result . . . . . . . . . . . . . . . . . . . . . 458
16.1.2 Convex losses for binary classification . . . . . . . . . . . . . . . . . . . . . . 459
16.1.3 Proof of Theorem 16.1.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
16.2 General surrogate risk consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
16.2.1 Uniform calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
16.2.2 Pointwise calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
16.2.3 Examples: multiclass surrogate risk consistency . . . . . . . . . . . . . . . . . 466
16.3 Generalized entropies and surrogate risk consistency . . . . . . . . . . . . . . . . . . 468
16.3.1 Proof of Theorem 16.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
16.4 Structured prediction and generalized entropies . . . . . . . . . . . . . . . . . . . . . 471
16.4.1 The failure of naive margin- and hinge-type losses . . . . . . . . . . . . . . . 474
16.4.2 Structured prediction losses via the generalized entropy . . . . . . . . . . . . 476
16.4.3 Proof of Theorem 16.4.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
16.5 Universal loss equivalence and entropies . . . . . . . . . . . . . . . . . . . . . . . . . 480
16.5.1 Proof of Theorem 16.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
16.5.2 Proof of Lemma 16.5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
16.5.3 Proof of Lemma 16.5.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
16.6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
16.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486

IV Online game playing and compression 490

17 Stochastic and online convex optimization 491

17.1 Preliminaries on convex optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
17.2 Online convex optimization methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
17.2.1 Projected subgradient methods . . . . . . . . . . . . . . . . . . . . . . . . . . 494
17.2.2 Mirror descent-type methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
17.2.3 Convergence analysis of mirror descent . . . . . . . . . . . . . . . . . . . . . . 498
17.2.4 Instantiations of the regret guarantee . . . . . . . . . . . . . . . . . . . . . . 499
17.2.5 Proof of Theorem 17.2.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
17.3 Optimality guarantees and fundamental limits . . . . . . . . . . . . . . . . . . . . . . 503
17.3.1 From optimization to testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
17.3.2 Constructing hard classes of optimization problems . . . . . . . . . . . . . . . 506
17.3.3 Instantiations and optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
17.3.4 A lower bound for high-dimensional stochastic optimization . . . . . . . . . . 512
17.4 Online to batch conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
17.5 More refined convergence guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
17.5.1 Proof of Proposition 17.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
17.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517

18 Exploration, exploitation, and bandit problems 523

18.1 The multi-armed bandit problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
18.2 Confidence-based algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
18.3 General losses and information-based bounds . . . . . . . . . . . . . . . . . . . . . . 529

7
Lexture Notes on Statistics and Information Theory John Duchi

18.3.1 An information-based regret bound . . . . . . . . . . . . . . . . . . . . . . . . 530

18.3.2 Posterior (Thompson) sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 533
18.3.3 Information-based exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
18.3.4 An extended example: linear bandits . . . . . . . . . . . . . . . . . . . . . . . 539
18.4 Online gradient descent approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
18.4.1 Some empirical comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
18.5 Minimax lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
18.5.1 Action separation and a modulus of continuity . . . . . . . . . . . . . . . . . 546
18.5.2 Assoaud’s method for lower bounds . . . . . . . . . . . . . . . . . . . . . . . 549
18.5.3 Proof of Theorem 18.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
18.6 Technical proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
18.6.1 Proof of Lemma 18.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
18.7 Further notes and references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
18.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556

19 Minimax games and Bayesian estimation 558

19.1 Robust Bayesian procedures and maximum entropy . . . . . . . . . . . . . . . . . . . 559
19.1.1 A digression on min-max games . . . . . . . . . . . . . . . . . . . . . . . . . . 560
19.1.2 Saddle points for maximum entropy . . . . . . . . . . . . . . . . . . . . . . . 561
19.1.3 Exponential family models as robust Bayesian procedures . . . . . . . . . . . 561
19.2 The coding game and sequential prediction . . . . . . . . . . . . . . . . . . . . . . . 563
19.3 Expected regret, information capacity, and redundancy . . . . . . . . . . . . . . . . . 565
19.3.1 Information capacity and regret duality . . . . . . . . . . . . . . . . . . . . . 566
19.3.2 Instantiations and corollaries of regret/capacity duality . . . . . . . . . . . . 569
19.3.3 Maximum generalized entropy and Robust Bayesian procedures . . . . . . . . 570
19.3.4 Proof of Lemma 19.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
19.4 Minimax strategies for regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
19.5 Mixture (Bayesian) strategies and redundancy . . . . . . . . . . . . . . . . . . . . . . 575
19.5.1 Bayesian redundancy and objective, reference, and Jeffreys priors . . . . . . . 578
19.5.2 Heuristic calculations: normality and Theorem 19.5.1 . . . . . . . . . . . . . 580
19.6 Regret and capacity dualities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
19.6.1 Duality when the domain is finite . . . . . . . . . . . . . . . . . . . . . . . . . 581
19.6.2 Proof of Corollary 19.3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
19.6.3 Regret/capacity duality for arbitrary domains . . . . . . . . . . . . . . . . . . 584
19.6.4 A formal statement of regret/capacity duality . . . . . . . . . . . . . . . . . . 588
19.7 Bibliographic details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
19.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589

V Appendices 592

A Miscellaneous mathematical results 593

A.1 The roots of a polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
A.2 Measure-theoretic development of divergence measures . . . . . . . . . . . . . . . . . 593
A.3 Integral convergence and completeness of probability spaces . . . . . . . . . . . . . . 593
A.4 Probabilistic convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
A.4.1 Classical results on convergence in distribution . . . . . . . . . . . . . . . . . 594

8
Lexture Notes on Statistics and Information Theory John Duchi

A.4.2 Assorted convergence results for probability distributions . . . . . . . . . . . 595

A.5 Stirling approximations and entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 598

B Convex Analysis 600

B.1 Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
B.1.1 Operations preserving convexity . . . . . . . . . . . . . . . . . . . . . . . . . 602
B.1.2 Representation and separation of convex sets . . . . . . . . . . . . . . . . . . 604
B.2 Sublinear and support functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
B.3 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
B.3.1 Equivalent definitions of convex functions . . . . . . . . . . . . . . . . . . . . 612
B.3.2 Continuity properties of convex functions . . . . . . . . . . . . . . . . . . . . 614
B.3.3 Operations preserving convexity . . . . . . . . . . . . . . . . . . . . . . . . . 620
B.3.4 Smoothness properties, first-order developments for convex functions, and
subdifferentiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
B.3.5 Calculus rules of subgradients . . . . . . . . . . . . . . . . . . . . . . . . . . . 628

C Optimality, stability, and duality 631

C.1 Optimality conditions and stability properties . . . . . . . . . . . . . . . . . . . . . . 632
C.1.1 Subgradient characterizations for optimality . . . . . . . . . . . . . . . . . . . 632
C.1.2 Stability properties of minimizers . . . . . . . . . . . . . . . . . . . . . . . . . 634
C.2 Conjugacy and duality properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
C.2.1 Gradient dualities and the Fenchel-Young inequality . . . . . . . . . . . . . . 640
C.2.2 Smoothness and strict convexity of conjugates . . . . . . . . . . . . . . . . . . 641
C.2.3 Smooth convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
C.3 Limits at infinity of convex functions and sets . . . . . . . . . . . . . . . . . . . . . . 645
C.3.1 Boundedness and closedness of convex sets . . . . . . . . . . . . . . . . . . . 646
C.3.2 Asymptotic growth and existence of minimizers . . . . . . . . . . . . . . . . . 649
C.4 Saddle point theorems and min-max duality . . . . . . . . . . . . . . . . . . . . . . . 651
C.4.1 Saddle points and convex conjugates . . . . . . . . . . . . . . . . . . . . . . . 652
C.4.2 Min-max duality and the existence of saddle points . . . . . . . . . . . . . . . 654
C.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655

9
Chapter 1

Introduction and setting

This book explores some of the (many) connections relating information theory, statistics, computa-
tion, and learning. Signal processing, machine learning, and statistics all revolve around extracting
useful information from signals and data. In signal processing and information theory, a central
question is how to best design signals—and the channels over which they are transmitted—to max-
imally communicate and store information, and to allow the most effective decoding. In machine
learning and statistics, by contrast, it is often the case that nature provides a fixed data distri-
bution, and it is the learner’s or statistician’s goal to recover information about this (unknown)
distribution. Our goal will be to show how a information theoretic perspectives can provide clean
answers about and techniques to perform this recovery.
The discovery of fundamental limits forms a central aspect of information theory: the develop-
ment of results that demonstrate that certain procedures are optimal. Thus, information theoretic
tools allow a characterization of the attainable results in a variety of communication and statis-
tical settings. As we explore in the coming chapters in the context of statistical, inferential, and
machine learning tasks, this allows us to develop procedures whose optimality we can certify—no
better procedure is possible. Such results are useful for a myriad of reasons; we would like to avoid
making bad decisions or false inferences, we may realize a task is impossible, and we can explicitly
calculate the amount of data necessary for solving different statistical problems.

1.1 Information theory

Information theory focuses on a plethora of deep questions: What is information? How much
information content do various signals and data hold? How much information can be reliably
transmitted over a noisy communication channel? We will leave delineation of the discipline and
answers to these questions to information theorists, instead grossly oversimplifying information
theory into two main inquiries, with corresponding chains of tasks.

1. How much information does a signal contain?

2. How much information can a noisy channel reliably transmit?

In this context, we provide two main high-level examples, one for each of these tasks.

Example 1.1.1 (Source coding): The source coding, or data compression problem, is to
take information from a source, compress it, decompress it, and recover the original message.

10
Lexture Notes on Statistics and Information Theory John Duchi

Graphically, we have

Source → Compressor → Decompressor → Receiver

The question, then, is how to design a compressor (encoder) and decompressor (decoder) that
uses the fewest number of bits to describe a source (or a message) while preserving all the
information, in the sense that the receiver receives the correct message with high probability.
This fewest number of bits is then the information content of the source (signal). 3

Example 1.1.2: The channel coding, or data transmission problem, is the same as the source
coding problem of Example 1.1.1, except that between the compressor and decompressor is a
source of noise, a channel. The graphical representation becomes

Source → Compressor → Channel → Decompressor → Receiver

Here we investigate the maximum number of bits that may be sent per each channel use in
the sense that the receiver can reconstruct the desired message with low probability of error.
Because the channel introduces noise, we require some redundancy, and information theory
studies the exact amount of redundancy—in the form of additional bits—that must be sent to
allow such reconstruction. 3

1.2 Moving to statistics and machine learning

We advocate a study of statistics and machine learning that—broadly—keeps in mind the same
views. Let us attempt, then, to shoehorn statistics and machine learning into such source coding and
a channel coding problems, which will help to illuminate the perspective that information-theoretic
techniques give.
In the analogy with source coding, we observe a sequence of data points X1 , . . . , Xn drawn from
some (unknown) distribution P on a space X . For example, we might be observing species that
biologists collect. Then by analogy, we construct a model (often a generative model) that encodes
the data using relatively few bits: that is,

X1 ,...,Xn Pb
Source (P ) −→ Compressor → Decompressor → Receiver.

Here, we estimate Pb—an empirical version of the distribution P that is easier to describe than
the original signal X1 , . . . , Xn —with the hope that we learn information about the generating
distribution P , or at least describe it efficiently.
In our analogy with channel coding we can connect to estimation and inference. Consider a
statistical problem in which there exists some unknown function f on a space X that we wish
to estimate, and we are able to observe a noisy version of f (Xi ) for a series of Xi drawn from
a distribution P . Recalling the graphical description of Example 1.1.2, we now have a channel
P (Y | f (X)) that gives us noisy observations of f (X) for each Xi , but we generally no longer
choose the encoder/compressor. That is, we have
X1 ,...,Xn f (X1 ),...,f (Xn ) Y1 ,...,Yn
Source (P ) −→ Compressor −→ Channel P (Y | f (X)) −→ Decompressor.

The estimation—decompression—problem is to either estimate f , or, in some cases, to estimate

other aspects of the source probability distribution P . In statistical problems, we do not have

11
Lexture Notes on Statistics and Information Theory John Duchi

any choice in the design of the compressor f that transforms the original signal X1 , . . . , Xn , which
makes it somewhat different from traditional ideas in information theory. In some cases that we
explore later—such as experimental design, randomized controlled trials, reinforcement learning
and bandits (and associated exploration/exploitation tradeoffs)—we are also able to influence the
compression part of the above scheme.

Example 1.2.1: A classical example of the statistical paradigm in this lens is the usual linear
regression problem. Here the data Xi belong to Rd , and the compression function f (x) = θ⊤ x
for some vector θ ∈ Rd . Then the channel is often of the form

Yi = θ⊤ Xi + εi ,
| {z } |{z}
signal noise

iid
where εi ∼ N(0, σ 2 ) are independent mean zero normal perturbations. Given a sequence of
pairs (Xi , Yi ), we wish to recover the true θ in the linear model.
In active learning or active sensing scenarios, also known as (sequential) experimental
design, we may choose the sequence Xi so as to better explore properties of θ. As one concrete
idea, if we allow infinite power, which in this context corresponds to letting ∥Xi ∥ → ∞—
choosing very “large” vectors xi —then the signal of θ⊤ Xi should swamp any noise and make
estimation easier. 3

The remainder of book explores these ideas.

1.3 Outline and chapter discussion

I divide the book into four distinct parts, each of course interacting with the others, but it is possible
to read each as a reasonably self-contained unit. The book begins with a revew (Chapter 2)
that introduces the basic information-theoretic quantities that we discuss: mutual information,
entropy, and divergence measures. It is required reading for all the chapters that follow. Chapter 3
provides an overview of exponential family models, which form a core tool in the statistical learning
toolbox. Readers familiar with this material, perhaps via a course on generalized linear models,
can certainly skip this, but it provides a useful grounding for examples and applications in the
subsequent chapters, and so we will dip back into it throughout the book.
Part I of the book covers what I term “stability” based results. At a high level, this means that
we ask what can be gained by considering situations where individual observations in a sequence
of random variables X1 , . . . , Xn have little effect on various functions of the sequence. We begin in
Chapter 4 with concentration inequalities, discussing how sums and related quantities can converge
quickly; while this material is essential for the remainder of the chapters, it does not depend on
particular information-theoretic techniques. We discuss some heuristic applications to problems in
statistical learning—empirical risk minimization—in this section of the book, with Chapter 5 pro-
viding results on uniform concentration, with applications to both “generalization”—the standard
theoretical tool in machine learning, most typically applying to the accuracy of prediction models—
and to estimation problems, which provide various guarantees on estimation of model parameters,
which constitute core statistical problems and techniques.
We then turn in Chapter 6 to carefully investigate generalization and convergence guarantees—
arguing that functions of a sample X1 , . . . , Xn are representative of the full population P from
which the sample is drawn—based on controlling different information-theoretic quantities. In this

12
Lexture Notes on Statistics and Information Theory John Duchi

context, we develop PAC-Bayesian bounds, and we also use the same framework to present tools to
control generalization and convergence in interactive data analyses. These types of analyses reflect
modern statistics, where one performs some type of data exploration before committing to a fuller
analysis, but which breaks classical statistical approaches, because the analysis now depends on
the sample. We provide a treatment of more advanced ideas in Chapter 7, where we develop more
sophisticated concentration results, such as on random matrices, using core ideas from information
theory, which allow us to connect divergence measures to different random processes. Finally, we
provide a chapter (Chapter 8) on disclosure limitation and privacy techniques, all of which repose
on different notions of stability in distribution.
Part II studies fundamental limits, using information-theoretic techniques to derive lower
bounds on the possible rates of convergence for various estimation, learning, and other statistical
problems. Chapter 9 kicks things off by developing the three major methods for lower bounds:
the Assouad, Fano, and Le Cam methods. This chapter shows the basic techniques from which all
the other lower bound ideas follow. At a high level, we might consider it, along with Part I, as
exhibiting the entire object of study of this book: how do distributions get close to one another, and
how can we leverage that closeness? We give a brief treatment of some lower bounding techniques
beyond these approaches in Chapter 10, including applications to certain nonparametric problems,
as well as a few results that move beyond the typical lower bounds, which apply in expectation,
to some that mimic “strong converses” in information theory, meaning that with exceedingly high
probability, one cannot hope to achieve anything better than average case error guarantees.
In modern statistical learning problems, one frequently has concerns beyond just statistical risk,
such as communication or computational cost, or the privacy of study participants. Accordingly,
we develop some of the recent techniques for such problems in Chapter 11 on problems where we
wish to obtain optimality guarnatees simultaneously along many dimensions, connecting to com-
munication complexity ideas from information theory. Chapter 12 provides a bit of a throwback to
estimation with squared error—the most common error metric—introducing the classical statistical
tools we have, but shows a few of the more modern applications of the ideas, which re-appear with
some frequency. Finally, we conclude the discussion of fundamental limits by looking at testing
problems and functional estimation, where one wishes to only estimate a single parameter of a
larger model (Chapter 13). While estimating a single scalar might seem, a priori, to be simpler
than other problems, adequately addressing its complexity requires a fairly nuanced treatment and
the introduction of careful information-theoretic tools.
Part III revisits all of our information theoretic notions from Chapter 2, but instead of simply
giving definitions and a few consequences, provides operational interpretations of the different
information-theoretic quantities, such as entropy. Of course this includes Shannon’s original results
on the relationship between coding and entropy (which we cover in the overview Chapter 2.4.1 on
information theory), but we also provide an interpretation of entropy and information as measures
of uncertainty in statistical experiments and statistical learning, which is a perspective typically
missing from information-theoretic treatments of entropy (Chapter 14). Our treatment shows a
deep connection between entropy and loss functions used for prediction, where a particular duality
allows moving back and forth between them.
We connect these ideas to the problem of calibration in Chapter 15, where we ask that a
prediction model be valid in that, e.g., on 75% of the days the model provides a prediction of 75% of
rain, it rains. We are also able to use these information-theoretic notions of risk, entropy, and losses
to connect to problems in optimization and machine learning. In particular, Chapter 16 explores the
ways that, if instead of fitting a model to some “true” loss we use an easier-to-optimize surrogate,
we essentially lose nothing. This allows us to delineate when (at least in asymptotic senses) it

13
Lexture Notes on Statistics and Information Theory John Duchi

is possible to computationally efficiently learn good predictors and design good experiments in
statistical machine learning problems. Because of the connections with optimization and convex
duality, these chapters repose on a nontrivial foundation of convex analysis; we include Appendices
(Appendix B and C) that provide a fairly comprehensive review of the results we require. For
readers unfamiliar with convex optimization and analysis, I will be the first to admit that these
chapters may be tough going—accordingly, we attempt to delineate the big-picture ideas from the
nitty-gritty technical conditions necessary for the most general results.
Part IV finishes the book with a treatment of stochastic optimization, online game playing,
and minimax problems. Our approach in Chapter 17 takes a modern perspective on stochastic
optimization as minimizing random models of functions, and it includes the “book” proofs of
convergence of the workhorses of modern machine learning optimization. It also leverages the
earlier results on fundamental limits to develop optimality theory for convex optimization in the
same framework. Chapter 18 explores online decision-making problems and, more broadly, problems
that require exploration and exploitation. This includes bandit problems and some basic questions
in causal estimation, where information-theoretic tools allow a clean treatment. The concluding
Chapter 19 revisits Chapter 14 on loss functions and predictions, but considers it more in the
context of particular games between nature and a statistician/learner. Once again leveraging the
perspective on entropy and loss functions we have developed, we are able to provide a generalization
of the celebrated redundancy/capacity theorem from information theory, but recast as a game of
loss minimization against a nature.

1.4 A remark about measure theory

As this book focuses on a number of fundamental questions in statistics, machine learning, and
information theory,
R fully general Rstatements of the results often require measure theory. Thus,
formulae such as f (x)dP (x) or f (x)dµ(x) appear. While knowledge of measure theory is cer-
tainly useful and may help appreciate the results, it is completely inessential to developing the
intuition and, I hope, understanding the proofs and main results. Indeed, the best strategy (for
a reader unfamiliar with measure theory) is to simply replace every instance of a formula such as
dµ(x) with dx. The most frequent cases we encounter will be the following: we wish to compute
the expectation of a function f of randomR variable X following distribution P , that
R is, EP [f (X)].
Normally, we would write EP [f (X)] = f (x)dP (x), or sometimes EP [f (X)] = f (x)p(x)dµ(x),
saying that “P has density p with respect to the underlying measure µ.” Instead, one may simply
(and intuitively) assume that x really has density p over the reals, and instead of computing the
integral Z Z
EP [f (X)] = f (x)dP (x) or EP [f (X)] = f (x)p(x)dµ(x),

assume we may write Z

EP [f (X)] = f (x)p(x)dx.

Nothing will be lost.

14
Chapter 2

An information theory review

In this first introductory chapter, we discuss and review many of the basic concepts of information
theory in effort to introduce them to readers unfamiliar with the tools. Our presentation is relatively
brisk, as our main goal is to get to the meat of the chapters on applications of the inequalities and
tools we develop, but these provide the starting point for everything in the sequel. One of the
main uses of information theory is to prove what, in an information theorist’s lexicon, are known
as converse results: fundamental limits that guarantee no procedure can improve over a particular
benchmark or baseline. We will give the first of these here to preview more of what is to come,
as these fundamental limits form one of the core connections between statistics and information
theory. The tools of information theory, in addition to their mathematical elegance, also come
with strong operational interpretations: they give quite precise answers and explanations for a
variety of real engineering and statistical phenomena. We will touch on one of these here (the
connection between source coding, or lossless compression, and the Shannon entropy), and much
of the remainder of the book will explore more.

2.1 Basics of Information Theory

In this section, we review the basic definitions in information theory, including (Shannon) entropy,
KL-divergence, mutual information, and their conditional versions. Before beginning, I must make
an apology to any information theorist reading these notes: any time we use a log, it will always
be base-e. This is more convenient for our analyses, and it also (later) makes taking derivatives
much nicer.
In this first section, we will assume that all distributions are discrete; this makes the quantities
somewhat easier to manipulate and allows us to completely avoid any complicated measure-theoretic
quantities. In Section 2.2 of this note, we show how to extend the important definitions (for our
purposes)—those of KL-divergence and mutual information—to general distributions, where basic
ideas such as entropy no longer make sense. However, even in this general setting, we will see we
essentially lose no generality by assuming all variables are discrete.

2.1.1 Definitions
Here, we provide the basic definitions of entropy, information, and divergence, assuming the random
variables of interest are discrete or have densities with respect to Lebesgue measure.

15
Lexture Notes on Statistics and Information Theory John Duchi

Entropy: We begin with a central concept in information theory: the entropy. Let P be a distri-
bution on a finite (or countable) set X , and let p denote the probability mass function associated
with P . That is, if X is a random variable distributed according to P , then P (X = x) = p(x). The
entropy of X (or of P ) is defined as
X
H(X) := − p(x) log p(x).
x

Because p(x) ≤ 1 for all x, it is clear that this quantity is positive. We will show later that if X
is finite, the maximum entropy distribution on X is the uniform distribution, setting p(x) = 1/|X |
for all x, which has entropy log(|X |).
Later in the class, we provide a number of operational interpretations of the entropy. The
most common interpretation—which forms the beginning of Shannon’s classical information the-
ory [167]—is via the source-coding theorem. We present Shannon’s source coding theorem in
Section 2.4.1, where we show that if we wish to encode a random variable X, distributed according
to P , with a k-ary string (i.e. each entry of the string takes
P on one of k values), then the minimal
expected length of the encoding is given by H(X) = − x p(x) logk p(x). Moreover, this is achiev-
able (to within a length of at most 1 symbol) by using Huffman codes (among many other types of
codes). As an example of this interpretation, we may consider encoding a random variable X with
equi-probable distribution on m items, which has H(X) = log(m). In base-2, this makes sense: we
simply assign an integer to each item and encode each integer with the natural (binary) integer
encoding of length ⌈log m⌉.
We can also define the conditional entropy, which is the amount of information left in a random
variable after observing another. In particular, we define
X X
H(X | Y = y) = − p(x | y) log p(x | y) and H(X | Y ) = p(y)H(X | Y = y),
x y

where p(x | y) is the p.m.f. of X given that Y = y.

Let us now provide a few examples of the entropy of various discrete random variables
Example 2.1.1 (Uniform random variables): As we noted earlier, if a random variable X is
uniform on a set of size m, then H(X) = log m. 3

Example 2.1.2 (Bernoulli random variables): Let h2 (p) = −p log p − (1 − p) log(1 − p) denote
the binary entropy, which is the entropy of a Bernoulli(p) random variable. 3

Example 2.1.3 (Geometric random variables): A random variable X is Geometric(p), for

some p ∈ [0, 1], if it is supported on {1, 2, . . .}, and P (X = k) = (1 − p)k−1 p; this is the
probability distribution of the number X of Bernoulli(p) trials until a single success. The
entropy of such a random variable is
∞
X ∞
X
k−1
H(X) = − (1 − p) p [(k − 1) log(1 − p) + log p] = − (1 − p)k p [k log(1 − p) + log p] .
k=1 k=0
P∞ k 1 d 1 1 P∞ k−1 ,
As k=0 α = 1−α and dα 1−α = (1−α)2
= k=1 kα we have
∞ ∞
X X 1−p
H(X) = −p log(1 − p) · k(1 − p)k − p log p · (1 − p)k = − log(1 − p) − (1 − p) log p.
p
k=1 k=1

As p ↓ 0, we see that H(X) ↑ ∞. 3

16
Lexture Notes on Statistics and Information Theory John Duchi

Example 2.1.4 (A random variable with infinite entropy): While most “reasonable” discrete
random variables have finite entropy, it is possible to construct distributions with infinite
entropy. Indeed, let X have p.m.f. on {2, 3, . . .} defined by
∞
A −1
X 1
p(k) = 2 where A = < ∞,
k log k k=2
k log2 k
R∞ Rx
the last sum finite as 2 x log1 α x dx < ∞ if and only if α > 1: for α = 1, we have e 1
t log t =
log log x, while for α > 1, we have
d 1
(log x)1−α = (1 − α)
dx x logα x
R∞ 1 1
so that e t logα t dt = e(1−α) . To see that the entropy is infinite, note that
X log A + log k + 2 log log k X log k
H(X) = A 2 ≥A 2 − C = ∞,
k≥2
k log k k≥2
k log k

where C is a numerical constant. 3

KL-divergence: Now we define two additional quantities, which are actually much more funda-
mental than entropy: they can always be defined for any distributions and any random variables,
as they measure distance between distributions. Entropy simply makes no sense for non-discrete
random variables, let alone random variables with continuous and discrete components, though it
proves useful for some of our arguments and interpretations.
Before defining these quantities, we recall the definition of a convex function f : Rk → R as any
bowl-shaped function, that is, one satisfying
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) (2.1.1)
for all λ ∈ [0, 1], all x, y. The function f is strictly convex if the convexity inequality (2.1.1) is
strict for λ ∈ (0, 1) and x ̸= y. We recall a standard result:
Proposition 2.1.5 (Jensen’s inequality). Let f be convex. Then for any random variable X,
f (E[X]) ≤ E[f (X)].
Moreover, if f is strictly convex, then f (E[X]) < E[f (X)] unless X is constant.
Now we may define and provide a few properties of the KL-divergence. Let P and Q be
distributions defined on a discrete set X . The KL-divergence between them is
X p(x)
Dkl (P ||Q) := p(x) log .
q(x)
x∈X

We observe immediately that Dkl (P ||Q) ≥ 0. To see this, we apply Jensen’s inequality (Propo-
sition 2.1.5) to the function − log and the random variable q(X)/p(X), where X is distributed
according to P :

q(X) q(X)
Dkl (P ||Q) = −E log ≥ − log E
p(X) p(X)
X
q(x)
= − log p(x) = − log(1) = 0.
x
p(x)

17
Lexture Notes on Statistics and Information Theory John Duchi

Moreover, as log is strictly convex, we have Dkl (P ||Q) > 0 unless P = Q. Another consequence of
the positivity of the KL-divergence is that whenever the set X is finite with cardinality |X | < ∞,
for any random variable X supported on X we have H(X) ≤ log |X |. Indeed, letting m = |X |, Q
1
be the uniform distribution on X so that q(x) = m , and X have distribution P on X , we have
X p(x) X
0 ≤ Dkl (P ||Q) = p(x) log = −H(X) − p(x) log q(x) = −H(X) + log m, (2.1.2)
x
q(x) x

so that H(X) ≤ log m. Thus, the uniform distribution has the highest entropy over all distributions
on the set X .

Mutual information: Having defined KL-divergence, we may now describe the information
content between two random variables X and Y . The mutual information I(X; Y ) between X and
Y is the KL-divergence between their joint distribution and their products (marginal) distributions.
More mathematically,
X p(x, y)
I(X; Y ) := p(x, y) log . (2.1.3)
x,y
p(x)p(y)

= H(X) − H(X | Y ).

Similarly, we have I(X; Y ) = H(Y ) − H(Y | X), so mutual information can be thought of as the
amount of entropy removed (on average) in X by observing Y . We may also think of mutual infor-
mation as measuring the similarity between the joint distribution of X and Y and their distribution
when they are treated as independent.
Comparing the definition (2.1.3) to that for KL-divergence, we see that if PXY is the joint
distribution of X and Y , while PX and PY are their marginal distributions (distributions when X
and Y are treated independently), then

I(X; Y ) = Dkl (PXY ||PX × PY ) ≥ 0.

Moreover, we have I(X; Y ) > 0 unless X and Y are independent.

As with entropy, we may also define the conditional information between X and Y given Z,
which is the mutual information between X and Y when Z is observed (on average). That is,
X
I(X; Y | Z) := I(X; Y | Z = z)p(z) = H(X | Z) − H(X | Y, Z) = H(Y | Z) − H(Y | X, Z).
z

Entropies of continuous random variables For continuous random variables, we may define
an analogue of the entropy known as differential entropy, which for a random variable X with
density p is defined by Z
h(X) := − p(x) log p(x)dx. (2.1.4)

18
Lexture Notes on Statistics and Information Theory John Duchi

Note that the differential entropy may be negative—it is no longer directly a measure of the number
of bits required to describe a random variable X (on average), as was the case for the entropy. We
can similarly define the conditional entropy
Z Z
h(X | Y ) = − p(y) p(x | y) log p(x | y)dxdy.

We remark that the conditional differential entropy of X given Y for Y with arbitrary distribution—
so long as X has a density—is
Z
h(X | Y ) = E − p(x | Y ) log p(x | Y )dx ,

where p(x | y) denotes the conditional density of X when Y = y. The KL divergence between
distributions P and Q with densities p and q becomes
Z
p(x)
Dkl (P ||Q) = p(x) log dx,
q(x)
and similarly, we have the analogues of mutual information as
Z
p(x, y)
I(X; Y ) = p(x, y) log dxdy = h(X) − h(X | Y ) = h(Y ) − h(Y | X).
p(x)p(y)
As we show in the next subsection, we can define the KL-divergence between arbitrary distributions
(and mutual information between arbitrary random variables) more generally without requiring
discrete or continuous distributions. Before investigating these issues, however, we present a few
examples. We also see immediately that for X uniform on a set [a, b], we have h(X) = log(b − a).
Example 2.1.6 (Entropy of normal random variables): The differential entropy (2.1.4) of
a normal random variable is straightforward to compute. Indeed, for X ∼ N(µ, σ 2 ) we have
p(x) = √ 1 2 exp(− 2σ1 2 (x − µ)2 ), so that
2πσ

E[(X − µ)2 ]
Z
1 1 1 2 1 2 1
h(X) = − p(x) log − (x − µ) = log(2πσ ) + = log(2πeσ 2 ).
2 2πσ 2 2σ 2 2 2σ 2 2
For a general multivariate Gaussian, where X ∼ N(µ, Σ) for a vector µ ∈ Rn and Σ ≻ 0 with
density p(x) = n/2
1
√ exp(− 21 (x − µ)⊤ Σ−1 (x − µ)), we similarly have
(2π) det(Σ)

1 h i
h(X) = E n log(2π) + log det(Σ) + (X − µ)⊤ Σ−1 (X − µ)
2
n 1 1 n 1
= log(2π) + log det(Σ) + tr(ΣΣ−1 ) = log(2πe) + log det(eΣ).
2 2 2 2 2
3
Continuing our examples with normal distributions, we may compute the divergence between
two multivariate Gaussian distributions:
Example 2.1.7 (Divergence between Gaussian distributions): Let P be the multivariate
normal N(µ1 , Σ), and Q be the multivariate normal distribution with mean µ2 and identical
covariance Σ ≻ 0. Then we have that
1
Dkl (P ||Q) = (µ1 − µ2 )⊤ Σ−1 (µ1 − µ2 ). (2.1.5)
2
We leave the computation of the identity (2.1.5) to the reader. 3

19
Lexture Notes on Statistics and Information Theory John Duchi

An interesting consequence of Example 2.1.7 is that if a random vector X has a given covari-
ance Σ ∈ Rn×n , then the multivariate Gaussian with identical covariance has larger differential
entropy. Put another way, differential entropy for random variables with second moments is always
maximized by the Gaussian distribution.
Proposition 2.1.8. Let X be a random vector on Rn with a density, and assume that Cov(X) = Σ.
Then for Z ∼ N(0, Σ), we have
h(X) ≤ h(Z).
Proof Without loss of generality, we assume that X has mean 0. Let P be the distribution of
X with density p, and let Q be multivariate normal with mean 0 and covariance Σ; let Z be this
random variable. Then
Z Z
p(x) n 1 ⊤ −1
Dkl (P ||Q) = p(x) log dx = −h(X) + p(x) log(2π) − x Σ x dx
q(x) 2 2
= −h(X) + h(Z),
because Z has the same covariance as X. As 0 ≤ Dkl (P ||Q), we have h(Z) ≥ h(X) as desired.

We remark in passing that the fact that Gaussian random variables have the largest entropy has
been used to prove stronger variants of the central limit theorem; see the original results of Barron
[16], as well as later quantitative results on the increase of entropy of normalized sums by Artstein
et al. [9] and Madiman and Barron [143].

2.1.2 Chain rules and related properties

We now illustrate several of the properties of entropy, KL divergence, and mutual information;
these allow easier calculations and analysis.

Chain rules: We begin by describing relationships between collections of random variables

X1 , . . . , Xn and individual members of the collection. (Throughout, we use the notation Xij =
(Xi , Xi+1 , . . . , Xj ) to denote the sequence of random variables from indices i through j.)
For the entropy, we have the simplest chain rule:
H(X1 , . . . , Xn ) = H(X1 ) + H(X2 | X1 ) + . . . + H(Xn | X1n−1 ).
This follows from the standard decomposition of a probability distribution p(x, y) = p(x)p(y | x).
to see the chain rule, then, note that
X
H(X, Y ) = − p(x)p(y | x) log p(x)p(y | x)
x,y
X X X X
=− p(x) p(y | x) log p(x) − p(x) p(y | x) log p(y | x) = H(X) + H(Y | X).
x y x y

Now set X = X1n−1 ,Y = Xn , and simply induct.

A related corollary of the definitions of mutual information is the well-known result that con-
ditioning reduces entropy:
H(X | Y ) ≤ H(X) because I(X; Y ) = H(X) − H(X | Y ) ≥ 0.
So on average, knowing about a variable Y can only decrease your uncertainty about X. That
conditioning reduces entropy for continuous random variables is also immediate, as for X continuous
we have I(X; Y ) = h(X) − h(X | Y ) ≥ 0, so that h(X) ≥ h(X | Y ).

20
Lexture Notes on Statistics and Information Theory John Duchi

Chain rules for information and divergence: As another immediate corollary to the chain
rule for entropy, we see that mutual information also obeys a chain rule:
n
X
I(X; Y1n ) = I(X; Yi | Y1i−1 ).
i=1

Indeed, we have
n
X n
X
I(X; Y1n ) = H(Y1n ) − H(Y1n | X) = H(Yi | Y1i−1 ) − H(Yi | X, Y1i−1 ) = I(X; Yi | Y1i−1 ).

i=1 i=1

The KL-divergence obeys similar chain rules, making mutual information and KL-divergence mea-
sures useful tools for evaluation of distances and relationships between groups of random variables.
As a second example, suppose that the distribution P = P1 ×P2 ×· · ·×Pn , and Q = Q1 ×· · ·×Qn ,
that is, that P and Q are product distributions over independent random variables Xi ∼ Pi or
Xi ∼ Qi . Then we immediately have the tensorization identity
n
X
Dkl (P ||Q) = Dkl (P1 × · · · × Pn ||Q1 × · · · × Qn ) = Dkl (Pi ||Qi ) .
i=1

We remark in passing that these two identities hold for arbitrary distributions Pi and Qi or random
variables X, Y . As a final tensorization identiy, we consider a more general chain rule for KL-
divergences, which will frequently be useful. We abuse notation temporarily, and for random
variables X and Y with distributions P and Q, respectively, we denote

Dkl (X||Y ) := Dkl (P ||Q) .

In analogy to the entropy, we can also define the conditional KL divergence. Let X and Y have
distributions PX|z and PY |z conditioned on Z = z, respectively. Then we define

Dkl (X||Y | Z) = EZ [Dkl PX|Z ||PY |Z ],
P
so that if Z is discrete we have Dkl (X||Y | Z) = z p(z)Dkl PX|z ||PY |z . With this notation, we
have the chain rule
n
X
Dkl Xi ||Yi | X1i−1 ,

Dkl (X1 , . . . , Xn ||Y1 , . . . , Yn ) = (2.1.6)
i=1

because (in the discrete case, which—as we discuss presently—is fully general for this purpose) for
distributions PXY and QXY we have

X p(x, y) X p(y | x) p(x)
Dkl (PXY ||QXY ) = p(x, y) log = p(x)p(y | x) log + log
x,y
q(x, y) x,y
q(y | x) q(x)
X p(x) X X p(y | x)
= p(x) log + p(x) p(y | x) log ,
x
q(x) x y
q(y | x)
P
where the final equality uses that y p(y | x) = 1 for all x. In different notation, if we let P and
Q be any distributions on X1 × · · · × Xn , and define Pi (A | xi−1 i−1
1 ) = P (Xi ∈ A | X1 = x1i−1 ), and
similarly for Qi , we have the following:

21
Lexture Notes on Statistics and Information Theory John Duchi

Lemma 2.1.9. Let P, Q be distributions on X1 × · · · × Xn . Then

n
X
EP [Dkl Pi (· | X1i−1 )||Qi (· | X1i−1 ) ].

Dkl (P ||Q) =
i=1

Expanding upon this, we give several tensorization identities, showing how to transform ques-
tions about the joint distribution of many random variables to simpler questions about their
marginals. As a first example, we see that as a consequence of the fact that conditioning de-
creases entropy, we see that for any sequence of (discrete or continuous, as appropriate) random
variables, we have

H(X1 , . . . , Xn ) ≤ H(X1 ) + · · · + H(Xn ) and h(X1 , . . . , Xn ) ≤ h(X1 ) + . . . + h(Xn ).

Both equalities hold with equality if and only if X1 , . . . , Xn are mutually independent. (The only
if follows because I(X; Y ) > 0 whenever X and Y are not independent, by Jensen’s inequality and
the fact that Dkl (P ||Q) > 0 unless P = Q.)
We return to information and divergence now. Suppose that random variables Yi are indepen-
dent conditional on X, meaning that

P (Y1 = y1 , . . . , Yn = yn | X = x) = P (Y1 = y1 | X = x) · · · P (Yn = yn | X = x).

Such scenarios are common—as we shall see—when we make multiple observations from a fixed
distribution parameterized by some X. Then we have the inequality
n
X
I(X; Y1 , . . . , Yn ) = [H(Yi | Y1i−1 ) − H(Yi | X, Y1i−1 )]
i=1
n n n (2.1.7)
X X X
= [H(Yi | Y1i−1 ) − H(Yi | X)] ≤ [H(Yi ) − H(Yi | X)] = I(X; Yi ),
i=1 i=1 i=1

where the inequality follows because conditioning reduces entropy.

2.1.3 Data processing inequalities:

A standard problem in information theory (and statistical inference) is to understand the degrada-
tion of a signal after it is passed through some noisy channel (or observation process). The simplest
of such results, which we will use frequently, is that we can only lose information by adding noise.
In particular, assume we have the Markov chain

X → Y → Z.

Then we obtain the classical data processing inequality.

Proposition 2.1.10. With the above Markov chain, we have I(X; Z) ≤ I(X; Y ).

Proof We expand the mutual information I(X; Y, Z) in two ways:

I(X; Y, Z) = I(X; Z) + I(X; Y | Z)

= I(X; Y ) + I(X; Z | Y ),
| {z }
=0

22
Lexture Notes on Statistics and Information Theory John Duchi

where we note that the final equality follows because X is independent of Z given Y :

I(X; Z | Y ) = H(X | Y ) − H(X | Y, Z) = H(X | Y ) − H(X | Y ) = 0.

Since I(X; Y | Z) ≥ 0, this gives the result.

There are related data processing inequalities for the KL-divergence—which we generalize in
the next section—as well. In this case, we may consider a simple Markov chain X → Z. If we
let P1 and
R P2 be distributions on X and Q1 and Q2 be the induced distributions on Z, that is,
Qi (A) = P(Z ∈ A | x)dPi (x), then we have

Dkl (Q1 ||Q2 ) ≤ Dkl (P1 ||P2 ) ,

the basic KL-divergence data processing inequality. A consequence of this is that, for any function
f and random variables X and Y on the same space, we have

Dkl (f (X)||f (Y )) ≤ Dkl (X||Y ) .

We explore these data processing inequalities more when we generalize KL-divergences in the next
section and in the exercises.

2.2 General divergence measures and definitions

Having given our basic definitions of mutual information and divergence, we now show how the
definitions of KL-divergence and mutual information extend to arbitrary distributions P and Q
and arbitrary sets X . This requires a bit of setup, including defining set algebras (which, we will
see, simply correspond to quantization of the set X ), but allows us to define divergences in full
generality.

2.2.1 Partitions, algebras, and quantizers

Let X be an arbitrary space. A quantizer on X is any function that maps X to a finite collection
of integers. That is, fixing m < ∞, a quantizer is any function q : X → {1, . . . , m}. In particular,
a quantizer q partitions the space X into the subsets of x ∈ X for which q(x) = i. A related
notion—we will see the precise relationship presently—is that of an algebra of sets on X . We say
that a collection of sets A is an algebra on X if the following are true:

1. The set X ∈ A.

2. The collection of sets A is closed under finite set operations: union, intersection, and com-
plementation. That is, A, B ∈ A implies that Ac ∈ A, A ∩ B ∈ A, and A ∪ B ∈ A.

There is a 1-to-1 correspondence between quantizers—and their associated partitions of the set
X —and finite algebras on a set X , which we discuss briefly.1 It should be clear that there is a
one-to-one correspondence between finite partitions of the set X and quantizers q, so we must argue
that finite partitions of X are in one-to-one correspondence with finite algebras defined over X .
1
Pedantically, this one-to-one correspondence holds up to permutations of the partition induced by the quantizer.

23
Lexture Notes on Statistics and Information Theory John Duchi

In one direction, we may consider a quantizer q : X → {1, . . . , m}. Let the sets A1 , . . . , Am
be the partition associated with q, that is, for x ∈ Ai we have q(x) = i, or Ai = q−1 ({i}). Then
we may define an algebra Aq as the collection of all finite set operations performed on A1 , . . . , Am
(note that this is a finite collection, as finite set operations performed on the partition A1 , . . . , Am
induce only a finite collection of sets).
For the other direction, consider a finite algebra A over the set X . We can then construct a
quantizer qA that corresponds to this algebra. To do so, we define an atom of A as any non-empty
set A ∈ A such that if B ⊂ A and B ∈ A, then B = A or B = ∅. That is, the atoms of A are the
“smallest” sets in A. We claim there is a unique partition of X with atomic sets from A; we prove
this inductively.

Base case: There is at least 1 atomic set, as A is finite; call it A1 .

Induction step: Assume we have atomic sets A1 , . . . , Ak ∈ A. Let B = (A1 ∪ · · · ∪ Ak )c be their

complement, which we assume is non-empty (otherwise we have a partition of X into atomic sets).
The complement B is either atomic, in which case the sets {A1 , A2 , . . . , Ak , B} are a partition of
X consisting of atoms of A, or B is not atomic. If B is not atomic, consider all the sets of the form
A ∩ B for A ∈ A. Each of these belongs to A, and at least one of them is atomic, as there is a
finite number of them. This means there is a non-empty set Ak+1 ⊂ B such that Ak+1 is atomic.
By repeating this induction, which must stop at some finite index m as A is finite, we construct
a collection A1 , . . . , Am of disjoint atomic sets in A for which and ∪i Ai = X . (The uniqueness is
an exercise for the reader.) Thus we may define the quantizer qA via

qA (x) = i when x ∈ Ai .

2.2.2 KL-divergence
In this section, we present the general definition of a KL-divergence, which holds for any pair of
distributions. Let P and Q be distributions on a space X . Now, let A be a finite algebra on X
(as in the previous section, this is equivalent to picking a partition of X and then constructing the
associated algebra), and assume that its atoms are atoms(A). The KL-divergence between P and
Q conditioned on A is
X P (A)
Dkl (P ||Q | A) := P (A) log .
Q(A)
A∈atoms(A)

That is, we simply sum over the partition of X . Another way to write this is as follows. Let
q : X → {1, . . . , m} be a quantizer, and define the sets Ai = q−1 ({i}) to be the pre-images of each
i (i.e. the different quantization regions, or the partition of X that q induces). Then the quantized
KL-divergence between P and Q is
m
X P (Ai )
Dkl (P ||Q | q) := P (Ai ) log .
Q(Ai )
i=1

We may now give the fully general definition of KL-divergence: the KL-divergence between P
and Q is defined as

Dkl (P ||Q) := sup {Dkl (P ||Q | A) such that A is a finite algebra on X }

(2.2.1)
= sup {Dkl (P ||Q | q) such that q quantizes X } .

24
Lexture Notes on Statistics and Information Theory John Duchi

This also gives a rigorous definition of mutual information. Indeed, if X and Y are random variables
with joint distribution PXY and marginal distributions PX and PY , we simply define
I(X; Y ) = Dkl (PXY ||PX × PY ) .
When P and Q have densities p and q, the definition (2.2.1) reduces to
Z
p(x)
Dkl (P ||Q) = p(x) log dx,
R q(x)
while if P and Q both have probability mass functions p and q, then—as we see in Exercise 2.6—the
definition (2.2.1) is equivalent to
X p(x)
Dkl (P ||Q) = p(x) log ,
x
q(x)

precisely as in the discrete case.

We remark in passing that if the set X is a product space, meaning that X = X1 × X2 × · · · × Xn
for some n < ∞ (this is the case for mutual information, for example), then we may assume our
quantizer always quantizes sets of the form A = A1 × A2 × · · · × An , that is, Cartesian products.
Written differently, when we consider algebras on X , the atoms of the algebra may be assumed to be
Cartesian products of sets, and our partitions of X can always be taken as Cartesian products. (See
Gray [104, Chapter 5].) Written slightly differently, if P and Q are distributions on X = X1 ×· · ·×Xn
and qi is a quantizer for the set Xi (inducing the partition Ai1 , . . . , Aimi of Xi ) we may define
X P (A1j1 × A2j2 × · · · × Anjn )
Dkl P ||Q | q1 , . . . , qn = P (A1j1 × A2j2 × · · · × Anjn ) log

.
j ,...,j
Q(A1j1 × A2j2 × · · · × Anjn )
1 n

Then the general definition (2.2.1) of KL-divergence specializes to

Dkl (P ||Q) = sup Dkl P ||Q | q1 , . . . , qn such that qi quantizes Xi .

So we only need consider “rectangular” sets in the definitions of KL-divergence.

Measure-theoretic definition of KL-divergence If you have never seen measure theory be-
fore, skim this section; while the notation may be somewhat intimidating, it is fine to always
consider only continuous or fully discrete distributions. We will describe an interpretation that will
mean for our purposes that one never needs to really think about measure theoretic issues.
The general definition (2.2.1) of KL-divergence is equivalent to the following. Let µ be a measure
on X , and assume that P and Q are absolutely continuous with respect to µ, with densities p and
q, respectively. (For example, take µ = P + Q.) Then
Z
p(x)
Dkl (P ||Q) = p(x) log dµ(x). (2.2.2)
X q(x)
The proof of this fact is somewhat involved, requiring the technology of Lebesgue integration. (See
Gray [104, Chapter 5].)
For those who have not seen measure theory, the interpretation
R of the equality (2.2.2) should be
as follows. When integrating a function f (x), replace f (x)dµ(x) with one of two pairsR of symbols:
one may simply think of dµ(x) as dx, so that
R we are performing standard integration f (x)dx, or
one should think
R of the integralPoperation f (x)dµ(x) as summing the argument of the integral, so
dµ(x) = 1 and f (x)dµ(x) = x f (x). (This corresponds to µ being “counting measure” on X .)

25
Lexture Notes on Statistics and Information Theory John Duchi

2.2.3 f -divergences
A more general notion of divergence is the so-called f -divergence, or Ali-Silvey divergence [6, 59]
(see also the alternate interpretations in the article by Liese and Vajda [137]). Here, the definition
is as follows. Let P and Q be probability distributions on the set X , and let f : R+ → R be a
convex function satisfying f (1) = 0. If X is a discrete set, then the f -divergence between P and Q
is
X p(x)
Df (P ||Q) := q(x)f .
x
q(x)

More generally, for any set X and a quantizer q : X → {1, . . . , m}, letting Ai = q−1 ({i}) = {x ∈
X | q(x) = i} be the partition the quantizer induces, we can define the quantized divergence
m
X P (Ai )
Df (P ||Q | q) = Q(Ai )f ,
Q(Ai )
i=1

and the general definition of an f divergence is (in analogy with the definition (2.2.1) of general
KL divergences)

Df (P ||Q) := sup {Df (P ||Q | q) such that q quantizes X } . (2.2.3)

The definition (2.2.3) shows that, any time we have computations involving f -divergences—such
as KL-divergence or mutual information—it is no loss of generality, when performing the compu-
tations, to assume that all distributions have finite discrete support. There is a measure-theoretic
version of the definition (2.2.3) which is frequently easier to use. Assume w.l.o.g. that P and Q are
absolutely continuous with respect to the base measure µ. The f divergence between P and Q is
then Z
p(x)
Df (P ||Q) := q(x)f dµ(x). (2.2.4)
X q(x)
This definition, it turns out, is not quite as general as we would like—in particular, it is unclear
how we should define the integral for points x such that q(x) = 0. With that in mind, we recall
that the perspective transform (see Appendices B.1.1 and B.3.3) of a function f : R → R is defined
by pers(f )(t, u) = uf (t/u) if u > 0 and by +∞ if u ≤ 0. This function is convex in its arguments
(Proposition B.3.12). In fact, this is not quite enough for the fully correct definition. The closure of
a convex function f is cl f (x) = sup{ℓ(x) | ℓ ≤ f, ℓ linear}, the supremum over all linear functions
that globally lower bound f . Then [111, Proposition IV.2.2.2] the closer of pers(f ) is defined, for
any t′ ∈ int dom f , by

uf (t/u)
 if u > 0
′
cl pers(f )(t, u) = limα↓0 αf (t − t + t/α) if u = 0

+∞ if u < 0.


(The choice of t′ does not affect the definition.) Then the fully general formula expressing the
f -divergence is Z
Df (P ||Q) = cl pers(f )(p(x), q(x))dµ(x). (2.2.5)
X
This is what we mean by equation (2.2.4), which we use without comment.
In the exercises, we explore several properties of f -divergences, including the quantized repre-
sentation (2.2.3), showing different data processing inequalities and orderings of quantizers based

26
Lexture Notes on Statistics and Information Theory John Duchi

on the fineness of their induced partitions. Broadly, f -divergences satisfy essentially the same prop-
erties as KL-divergence, such as data-processing inequalities, and they provide a generalization of
mutual information. We explore f -divergences from additional perspectives later—they are impor-
tant both for optimality in estimation and related to consistency and prediction problems, as we
discuss in Chapter 16.5.

Examples We give several examples of f -divergences here; in Section 9.2.2 we provide a few
examples of their uses as well as providing a few natural inequalities between them.

Example 2.2.1 (KL-divergence): By taking f (t) = t log t, which is convex and satisfies
f (1) = 0, we obtain Df (P ||Q) = Dkl (P ||Q). 3

Example 2.2.2 (KL-divergence, reversed): By taking f (t) = − log t, we obtain Df (P ||Q) =

Dkl (Q||P ). 3

Example 2.2.3 (Total variation distance): The total variation distance between probability
distributions P and Q defined on a set X is the maximum difference between probabilities they
assign on subsets of X :
∥P − Q∥TV := sup |P (A) − Q(A)| = sup (P (A) − Q(A)), (2.2.6)
A⊂X A⊂X

where the second equality follows by considering compliments P (Ac ) = 1 − P (A). The total
variation distance, as we shall see later, is important for verifying the optimality of different
tests, and appears in the measurement of difficulty of solving hypothesis testing problems. The
choice f (t) = 21 |t − 1|, we obtain the total variation distance, that is, ∥P − Q∥TV = Df (P ||Q).
There are several alternative characterizations, which we provide as Lemma 2.2.4 next; it will
be useful in the sequel when we develop inequalities relating the divergences. 3

Lemma 2.2.4. Let P, Q be probability measures with densities p, q with respect to a base measure
µ and f (t) = 21 |t − 1|. Then
Z
1
∥P − Q∥TV = Df (P ||Q) = |p(x) − q(x)|dµ(x)
2
Z Z
= [p(x) − q(x)]+ dµ(x) = [q(x) − p(x)]+ dµ(x)

= P (dP/dQ > 1) − Q(dP/dQ > 1) = Q(dQ/dP > 1) − P (dQ/dP > 1).

In particular, the set A = {x | p(x)/q(x) ≥ 1} maximizes P (B)−Q(B) over B ⊂ X and so achieves
∥P − Q∥TV = P (A) − Q(A).
Proof Eliding the measure-theoretic details,2 we immediately have
Z Z
1 p(x) 1
Df (P ||Q) = − 1 q(x)dµ(x) = |p(x) − q(x)|dµ(x)
2 q(x) 2
Z Z
1 1
= [p(x) − q(x)] dµ(x) + [q(x) − p(x)] dµ(x)
2 x:p(x)>q(x) 2 x:q(x)>p(x)
Z Z
1 1
= [p(x) − q(x)]+ dµ(x) + [q(x) − p(x)]+ dµ(x).
2 2
2
R To make thisRfully rigorous, we Rwould use the Hahn decomposition of the signed measure P − Q to recognize that
f (dP − dQ) = f [dP − dQ]+ − f [dQ − dP ]+ for any integrable f .

27
Lexture Notes on Statistics and Information Theory John Duchi

R
Considering the last inegral [q(x) − p(x)]+ dµ(x), we see that the set A = {x : q(x) > p(x)}
satisfies
Z Z
Q(A) − P (A) = (q(x) − p(x))dµ(x) ≥ (q(x) − p(x))dµ(x) = Q(B) − P (B)
A B

for any set B, as any x ∈ B \ A clearly satisfies q(x) − p(x) ≤ 0.

Example 2.2.5 (Hellinger distance): The Hellinger distance between √ probability distribu-
√
tions P and Q defined on a set X is generated by the function f (t) = ( t − 1)2 = t − 2 t + 1.
The Hellinger distance is then
Z p
2 1 p
dhel (P, Q) := ( p(x) − q(x))2 dµ(x). (2.2.7)
2
The non-squared version dhel (P, Q) is indeed a distance between probability measures P and
Q. It is sometimes convenient to rewrite the Hellinger distance in terms of the affinity between
P and Q, as
Z Z p
2 1 p
dhel (P, Q) = (p(x) + q(x) − 2 p(x)q(x))dµ(x) = 1 − p(x)q(x)dµ(x), (2.2.8)
2
which makes clear that dhel (P, Q) ∈ [0, 1] is on roughly the same scale as the variation distance;
we will say more later. 3

Example 2.2.6 (χ2 divergence): The χ2 -divergence is generated by taking f (t) = (t − 1)2 ,
so that 2
p(x)2
Z Z
p(x)
Dχ2 (P ||Q) := − 1 q(x)dµ(x) = dµ(x) − 1, (2.2.9)
q(x) q(x)
where the equality is immediate because pdµ = qdµ = 1. 3
R R

2.2.4 Inequalities and relationships between divergences

Important to our development will come will be different families of inequalities relating the different
divergence measures. These inequalities will be particularly important because, in some cases,
different distributions admit easy calculations with some divergences, such as KL or χ2 divergence,
but it can be challenging to work with others that may be more “natural” for a particular problem.
Most importantly, replacing a variation distance by bounding it with an alternative divergence is
often convenient for analyzing the properties of product distributions (as will become apparent
in Chapter 9). We record several of these results here, making a passing connection to mutual
information as well.
The first inequality shows that the Hellinger distance and variation distance roughly generate
the same topology on collections of distributions, as they upper and lower bound the other (if we
tolerate polynomial losses).

Proposition 2.2.7. The total variation distance and Hellinger distance satisfy
q
2
dhel (P, Q) ≤ ∥P − Q∥TV ≤ dhel (P, Q) 2 − d2hel (P, Q).

28
Lexture Notes on Statistics and Information Theory John Duchi

Rp
As in Example 2.2.5, we have p(x)q(x)dµ(x) = 1 − dhel (P, Q)2 , so this (along with the repre-
sentation Lemma 2.2.4 for variation distance) implies
Z
1 1
∥P − Q∥TV = |p(x) − q(x)|dµ(x) ≤ dhel (P, Q)(2 − d2hel (P, Q)) 2 .
2
√
For the lower bound on total variation, note that for any a, b ∈ R+ , we have a + b − 2 ab ≤ |a − b|
(check the cases a > b and a < b separately); thus
Z Z
2 1 h p i 1
dhel (P, Q) = p(x) + q(x) − 2 p(x)q(x) dµ(x) ≤ |p(x) − q(x)|dµ(x),
2 2
as desired.

Several important inequalitites relate the variation distance to the KL-divergence. We state
two important inequalities in the next proposition, both of which are important enough to justify
their own names.
Proposition 2.2.8. The total variation distance satisfies the following relationships.
(a) Pinsker’s inequality: for any distributions P and Q,
1
∥P − Q∥2TV ≤ Dkl (P ||Q) . (2.2.10)
2

(b) The Bretagnolle-Huber inequality: for any distributions P and Q,

p 1
∥P − Q∥TV ≤ 1 − exp(−Dkl (P ||Q)) ≤ 1 − exp(−Dkl (P ||Q)).
2

Proof Exercise 2.19 outlines one proof of Pinsker’s inequality using the data processing inequality
(Proposition 2.2.13). We present an alternative via the Cauchy-Schwarz inequality. Using the
definition (2.2.1) of the KL-divergence, we may assume without loss of generality that P and Q are
finitely P
supported, say with p.m.f.s p1 , . . . , pm and q1 , . . . , qm . Define the negative entropy function
h(p) = m 2 1 2
i=1 pi log pi . Then showing that Dkl (P ||Q) ≥ 2 ∥P − Q∥TV = 2 ∥p − q∥1 is equivalent to
showing that
1
h(p) ≥ h(q) + ⟨∇h(q), p − q⟩ + ∥p − q∥21 , (2.2.11)
2
because by inspection h(p)−h(q)−⟨∇h(q), p−q⟩ = i pi log pqii . We do this via a Taylor expansion:
P
we have
∇h(p) = [log pi + 1]m 2
i=1 and ∇ h(p) = diag([1/pi ]i=1 ).
m

29
Lexture Notes on Statistics and Information Theory John Duchi

By Taylor’s theorem, there is some p̃ = (1 − t)p + tq, where t ∈ [0, 1], such that
1
h(p) = h(q) + ⟨∇h(q), p − q⟩ + ⟨p − q, ∇2 h(p̃)(p − q)⟩.
2
P
But looking at the final quadratic, we have for any vector v and any p ≥ 0 satisfying i pi = 1,
m m m 2
v2 v2
X
X X √ |vi |
2
⟨v, ∇ h(p̃)v⟩ = i
= ∥p∥1 i
≥ pi √ = ∥v∥21 ,
pi pi pi
i=1 i=1 i=1
√ √
where the inequality follows from Cauchy-Schwarz applied to the vectors [ pi ]i and [|vi |/ pi ]i .
Thus inequality (2.2.11) holds. Rp
For the claim (b), we use Proposition 2.2.7. Let a = p(x)q(x)dµ(x) be a shorthand
√ √ for the
2
affinity, so that dhel (P, Q) = 1 − a. Then Proposition 2.2.7 gives ∥P − Q∥TV ≤ 1 − a 1 + a =
√
1 − a2 . Now apply Jensen’s inequality to the exponential: we have
Z p Z s Z
q(x) 1 q(x)
p(x)q(x)dµ(x) = p(x)dµ(x) = exp log p(x)dµ(x)
p(x) 2 p(x)
Z
1 q(x) 1
≥ exp p(x) log dµ(x) = exp − Dkl (P ||Q) .
2 p(x) 2
√ q
In particular, 1 − a2 ≤ 1 − exp(− 12 Dkl (P ||Q))2 , which is the first claim of part (b). For the
√
second, note that 1 − c ≤ 1 − 12 c for c ∈ [0, 1] by concavity of the square root.

We also have the following bounds on the Hellinger distance in terms of the KL-divergence, and
that in terms of the χ2 -divergence.

Proposition 2.2.9. For any distributions P, Q,

2d2hel (P, Q) ≤ Dkl (P ||Q) ≤ log(1 + Dχ2 (P ||Q)) ≤ Dχ2 (P ||Q) .

Proof For the first inequality, note that log x ≤ x − 1 by concavity, or 1 − x ≤ − log x, so that
Z p
2d2hel (P, Q) = 2 − 2 p(x)q(x)dµ(x)
Z s ! Z s
q(x) p(x)
= 2 p(x) 1 − dµ(x) ≤ 2 p(x) log dµ(x) = Dkl (P ||Q) .
p(x) q(x)

The last two inequalitites are simple: by Jensen’s inequality, we have

dP 2
Z
Dkl (P ||Q) ≤ log = log(1 + Dχ2 (P ||Q)).
dQ
The last inequality is immediate as log(1 + t) ≤ t for all t > −1.

It is also possible to relate mutual information between distributions to f -divergences, and even
to bound the mutual information above and below by the Hellinger distance for certain problems. In

30
Lexture Notes on Statistics and Information Theory John Duchi

this case, we consider the following situation: let V ∈ {0, 1} uniformly at random, and conditional
on V = v, draw X ∼ Pv for some distribution Pv on a space X . Then we have that
1 1
I(X; V ) = Dkl P0 ||P + Dkl P1 ||P
2 2
where P = 21 P0 + 12 P1 . The divergence measure on the right side of the preceding identity is a
special case of the Jenson-Shannon divergence, defined for λ ∈ [0, 1] by
Djs,λ (P ||Q) := λDkl (P ||λP + (1 − λ)Q) + Dkl (Q||λP + (1 − λ)Q) , (2.2.12)
which is a symmetrized and bounded variant of the typical KL-divergence (we use the shorthand
Djs (P ||Q) := Djs, 1 (P ||Q) for the symmetric case). As a consequence, we also have
2

1 1
I(X; V ) = Df (P0 ||P1 ) + Df (P1 ||P0 ) ,
2 2
1
where f (t) = −t log( 2t + 21 ) = t log t+1
2t
, so that the mutual information is a particular f -divergence.
This form—as we see in the later chapters—is frequently convenient because it gives an object
with similar tensorization properties to KL-divergence while enjoying the boundedness properties
of Hellinger and variation distances. The following proposition captures the latter properties.
Proposition 2.2.10. Let (X, V ) be distributed as above. Then

2 log 2 · ∥P0 − P1 ∥TV ,
log 2 · dhel (P0 , P1 ) ≤ I(X; V ) = Djs (P0 ||P1 ) ≤ min .
2 · d2hel (P0 , P1 )
Proof The lower bound and upper bound involving the variation distance both follow from
analytic bounds on the binary entropy functional h2 (p) = −p log p−(1−p) log(1−p). By expanding
the mutual information and letting p0 and p1 be densities of P0 and P1 with respect to some base
measure µ, we have
Z Z
2p0 2p1
2I(X; V ) = 2Djs (P0 ||P1 ) = p0 log dµ + p1 log dµ
p0 + p1 p0 + p1
Z
p0 p0 p1 p1
= 2 log 2 + (p0 + p1 ) log + log dµ
p1 + p1 p0 + p1 p1 + p1 p0 + p1
Z
p0
= 2 log 2 − (p0 + p1 )h2 dµ.
p1 + p0
We claim that p
2 log 2 · min{p, 1 − p} ≤ h2 (p) ≤ 2 log 2 · p(1 − p)
for all p ∈ [0, 1] (see Exercises 2.17 and 2.18). Then the upper and lower bounds on the information
become nearly immediate.
For the variation-based upper bound on I(X; V ), we use the lower bound h2 (p) ≥ 2 log 2 ·
min{p, 1 − p} to write
Z
2 p0 (x) p1 (x)
I(X; V ) ≤ 2 − (p0 (x) + p1 (x)) min , dµ(x)
log 2 p0 (x) + p1 (x) p0 (x) + p1 (x)
Z
= 2 − 2 min{p0 (x), p1 (x)}dµ(x)
Z Z
= 2 (p1 (x) − min{p0 (x), p1 (x)})dµ(x) = 2 (p1 (x) − p0 (x))dµ(x).
p1 >p0

31
Lexture Notes on Statistics and Information Theory John Duchi

But of course the final integral is ∥P1 − P0 ∥TV , giving I(X; V ) ≤ log 2 ∥P0 −pP1 ∥TV . Conversely,
for the lower bound on Djs (P0 ||P1 ), we use the upper bound h2 (p) ≤ 2 log 2 · p(1 − p) to obtain
Z r
1 p0 p0
I(X; V ) ≥ 1 − (p0 + p1 ) 1− dµ
log 2 p1 + p0 p1 + p0
√ √ √
Z Z
1
=1− p0 p1 dµ = ( p0 − p1 )2 dµ = d2hel (P0 , P1 )
2
as desired.
The Hellinger-based upper bound is simpler: by Proposition 2.2.9, we have
1 1
Djs (P0 ||P1 ) = Dkl (P0 ||(P0 + P1 )/2) + Dkl (P1 ||(P0 + P1 )/2)
2 2
1 1
≤ Dχ2 (P0 ||(P0 + P1 )/2) + Dχ2 (P1 ||(P0 + P1 )/2)
2 2
Z √ √ √ √
(p0 − p1 )2 ( p0 − p1 )2 ( p0 + p1 )2
Z
1 1
= dµ = dµ.
2 p0 + p1 2 p0 + p1
√ √
Now note that (a + b)2 ≤ 2aR2 + 2b2 for any a, b ∈ R, and so ( p0 + p1 )2 ≤ 2(p0 + p1 ), and thus
√ √ 2
the final integral has bound ( p0 − p1 ) dµ = 2d2hel (P0 , P1 ).

2.2.5 Convexity and data processing for divergence measures

f -divergences satisfy a number of very useful properties, which we use repeatedly throughout the
lectures. As the KL-divergence is an f -divergence, it of course satisfies these conditions; however,
we state them in fuller generality, treating the KL-divergence results as special cases and corollaries.
We begin by exhibiting the general data processing properties and convexity properties of f -
divergences, each of which specializes to KL divergence. We leave the proof of each of these as
exercises. First, we show that f -divergences are jointly convex in their arguments.

Proposition 2.2.11. Let P1 , P2 , Q1 , Q2 be distributions on a set X and f : R+ → R be convex.

Then for any λ ∈ [0, 1],

Df (λP1 + (1 − λ)P2 ||λQ1 + (1 − λ)Q2 ) ≤ λDf (P1 ||Q1 ) + (1 − λ)Df (P2 ||Q2 ) .

The proof of this proposition we leave as Exercise 2.11, which we treat as a consequence of the
more general “log-sum” like inequalities of Exercise 2.8. It is, however, an immediate consequence
of the fully specified definition (2.2.5) of an f -divergence, because pers(f ) is jointly convex. As an
immediate corollary, we see that the same result is true for KL-divergence as well.

Corollary 2.2.12. The KL-divergence Dkl (P ||Q) is jointly convex in its arguments P and Q.

We can also provide more general data processing inequalities for f -divergences, paralleling
those for the KL-divergence. In this case, we consider random variables X and Z on spaces X
and Z, respectively, and a Markov transition kernel K giving the Markov chain X → Z. That
is, K(· | x) is a probability distribution on Z for each x ∈ X , and conditioned on X = x, Z has
distribution K(· | x) so that K(A | x) = P(Z ∈ A | X = x). Certainly, this includes the situation

32
Lexture Notes on Statistics and Information Theory John Duchi

when Z = ϕ(X) for some function ϕ, and more generally when Z = ϕ(X, U ) for a function ϕ and
some additional randomness U . For a distribution P on X, we then define the marginals
Z
KP (A) := K(A, x)dP (x).
X

We then have the following proposition.

Proposition 2.2.13. Let P and Q be distributions on X and let K be any Markov kernel. Then

Df (KP ||KQ ) ≤ Df (P ||Q) .

See Exercise 2.10 for a proof.

As a corollary, we obtain the following data processing inequality for KL-divergences, where we
abuse notation to write Dkl (X||Y ) = Dkl (P ||Q) for random variables X ∼ P and Y ∼ Q.

Corollary 2.2.14. Let X, Y ∈ X be random variables, let U ∈ U be independent of X and Y , and

let ϕ : X × U → Z for some spaces X , U, Z. Then

Dkl (ϕ(X, U )||ϕ(Y, U )) ≤ Dkl (X||Y ) .

Thus, further processing of random variables can only bring them “closer” in the space of distribu-
tions; downstream processing of signals cannot make them further apart as distributions.

2.3 First steps into optimal procedures: testing inequalities

As noted in the introduction, a central benefit of the information theoretic tools we explore is that
they allow us to certify the optimality of procedures—that no other procedure could (substantially)
improve upon the one at hand. The main tools for these certifications are often inequalities gov-
erning the best possible behavior of a variety of statistical tests. Roughly, we put ourselves in the
following scenario: nature chooses one of a possible set of (say) k worlds, indexed by probabil-
ity distributions P1 , P2 , . . . , Pk , and conditional on nature’s choice of the world—the distribution
P ⋆ ∈ {P1 , . . . , Pk } chosen—we observe data X drawn from P ⋆ . Intuitively, it will be difficult to
decide which distribution Pi is the true P ⋆ if all the distributions are similar—the divergence be-
tween the Pi is small, or the information between X and P ⋆ is negligible—and easy if the distances
between the distributions Pi are large. With this outline in mind, we present two inequalities, and
first examples of their application, to make concrete these connections to the notions of information
and divergence defined in this section.

2.3.1 Le Cam’s inequality and binary hypothesis testing

The simplest instantiation of the above setting is the case when there are only two possible dis-
tributions, P1 and P2 , and our goal is to make a decision on whether P1 or P2 is the distribution
generating data we observe. Concretely, suppose that nature chooses one of the distributions P1
or P2 at random, and let V ∈ {1, 2} index this choice. Conditional on V = v, we then observe a
sample X drawn from Pv . Denoting by P the joint distribution of V and X, we have for any test
Ψ : X → {1, 2} that the probability of error is then
1 1
P(Ψ(X) ̸= V ) = P1 (Ψ(X) ̸= 1) + P2 (Ψ(X) ̸= 2).
2 2

33
Lexture Notes on Statistics and Information Theory John Duchi

We can give an exact expression for the minimal possible error in the above hypothesis test.
Indeed, a standard result of Le Cam (see [134, 194, Lemma 1]) is the following variational representa-
tion of the total variation distance (2.2.6), which is the f -divergence associated with f (t) = 12 |t − 1|,
as a function of testing error.

Proposition 2.3.1. Let X be an arbitrary set. For any distributions P1 and P2 on X , we have

inf {P1 (Ψ(X) ̸= 1) + P2 (Ψ(X) ̸= 2)} = 1 − ∥P1 − P2 ∥TV ,

where the infimum is taken over all tests Ψ : X → {1, 2}.

Proof Any test Ψ : X → {1, 2} has an acceptance region, call it A ⊂ X , where it outputs 1 and
a region Ac where it outputs 2.

P1 (Ψ ̸= 1) + P2 (Ψ ̸= 2) = P1 (Ac ) + P2 (A) = 1 − P1 (A) + P2 (A).

Taking an infimum over such acceptance regions, we have

inf {P1 (Ψ ̸= 1) + P2 (Ψ ̸= 2)} = inf {1 − (P1 (A) − P2 (A))} = 1 − sup (P1 (A) − P2 (A)),
Ψ A⊂X A⊂X

which yields the total variation distance as desired.

In the two-hypothesis case, we also know that the optimal test, by the Neyman-Pearson lemma,
is a likelihood ratio test. That is, assuming that P1 and P2 have densities p1 and p2 , the optimal
test is of the form
1 if pp12 (X)
(
(X) ≥ t
Ψ(X) = p1 (X)
2 if p2 (X) < t

for some threshold t ≥ 0. In the case that the prior probabilities on P1 and P2 are each 21 , then
t = 1 is optimal.
We give one example application of Proposition 2.3.1 to the problem of testing a normal mean.

iid
Example 2.3.2 (Testing a normal mean): Suppose we observe X1 , . . . , Xn ∼ P for P = P1
or P = P2 , where Pv is the normal distribution N(µv , σ 2 ), where µ1 ̸= µ2 . We would like to
understand the sample size n necessary to guarantee that no test can have small error, that
is, say, that
1
inf {P1 (Ψ(X1 , . . . , Xn ) ̸= 1) + P2 (Ψ(X1 , . . . , Xn ) ̸= 2)} ≥ .
Ψ 2
By Proposition 2.3.1, we have that

inf {P1 (Ψ(X1 , . . . , Xn ) ̸= 1) + P2 (Ψ(X1 , . . . , Xn ) ̸= 2)} ≥ 1 − ∥P1n − P2n ∥TV ,

iid
where Pvn denotes the n-fold product of Pv , that is, the distribution of X1 , . . . , Xn ∼ Pv .
The interaction between total variation distance and product distributions is somewhat
subtle, so it is often advisable to use a divergence measure more attuned to the i.i.d. nature
of the sampling scheme. Two such measures are the KL-divergence and Hellinger distance,
both of which we explore in the coming chapters. With that in mind, we apply Pinsker’s

34
Lexture Notes on Statistics and Information Theory John Duchi

inequality (2.2.10) to see that ∥P1n − P2n ∥2TV ≤ 21 Dkl (P1n ||P2n ) = n2 Dkl (P1 ||P2 ), which implies
that
r r 1 √
n n n 1 n 1 2
2 n |µ1 − µ2 |
1 − ∥P1 − P2 ∥TV ≥ 1 − Dkl (P1 ||P2 ) = 1 −
2
2
(µ1 − µ2 ) =1− .
2 2 2σ 2 σ
σ2
In particular, if n ≤ (µ1 −µ2 )2
, then we have our desired lower bound of 21 .
2
Conversely, a calculation yields that n ≥ (µ1Cσ −µ2 )2
, for some numerical constant C ≥ 1,
implies small probability of error. We leave this calculation to the reader. 3

2.3.2 Fano’s inequality and multiple hypothesis testing

There are of course situations in which we do not wish to simply test two hypotheses, but have
multiple hypotheses present. In such situations, Fano’s inequality, which we present shortly, is
the most common tool for proving fundamental limits, lower bounds on probability of error, and
converses (to results on achievability of some performance level) in information theroy. We write
this section in terms of general random variables, ignoring the precise setting of selecting an index
in a family of distributions, though that is implicit in what we do.
Let X be a random variable taking values in a finite set X , and assume that we observe a
(different) random variable Y , and then must estimate or guess the true value of X. b That is, we
have the Markov chain
X → Y → X, b

and we wish to provide lower bounds on the probability of error—that is, that X b ̸= X. If we let
the function h2 (p) = −p log p − (1 − p) log(1 − p) denote the binary entropy (entropy of a Bernoulli
random variable with parameter p), Fano’s inequality takes the following form [e.g. 57, Chapter 2]:
Proposition 2.3.3 (Fano inequality). For any Markov chain X → Y → X,
b we have

b ̸= X)) + P(X
h2 (P(X b ̸= X) log(|X | − 1) ≥ H(X | X).
b (2.3.1)

Proof This proof follows by expanding an entropy functional in two different ways. Let E be
b ̸= X, that is, E = 1 if X
the indicator for the event that X b ̸= X and is 0 otherwise. Then we have

H(X, E | X)
b = H(X | E, X)
b + H(E | X)
b
= P(E = 1)H(X | E = 1, X)
b + P(E = 0) H(X | E = 0, X)
b +H(E | X),
b
| {z }
=0

where the zero follows because given there is no error, X has no variability given X.
b Expanding
the entropy by the chain rule in a different order, we have

H(X, E | X)
b = H(X | X)
b + H(E | X,
b X),
| {z }
=0

because E is perfectly predicted by X

b and X. Combining these equalities, we have

H(X | X)
b = H(X, E | X)
b = P(E = 1)H(X | E = 1, X)
b + H(E | X).

35
Lexture Notes on Statistics and Information Theory John Duchi

completes the proof.

We can rewrite Proposition 2.3.3 in a convenient way when X is uniform in X . Indeed, by

definition of the mutual information, we have I(X; X)b = H(X) − H(X | X), b so Proposition 9.4.1
implies that in the canonical hypothesis testing problem from Section 9.2.1, we have

Corollary 2.3.4. Assume that X is uniform on X . For any Markov chain X → Y → X,

b ̸= X) ≥ 1 − I(X; Y ) + log 2
P(X . (2.3.2)
log(|X |)

Proof Let Perror = P(X ̸= X) b denote the probability of error. Noting that h2 (p) ≤ log 2 for any
p ∈ [0, 1] (recall inequality (2.1.2), that is, that uniform random variables maximize entropy), then
using Proposition 9.4.1, we have
(i) (ii)
log 2 + Perror log(|X |) ≥ h2 (Perror ) + Perror log(|X | − 1) ≥ H(X | X)
b = H(X) − I(X; X).
b

Here step (i) uses Proposition 2.3.3 and step (ii) uses the definition of mutual information, that
b = H(X) − H(X | X).
I(X; X) b The data processing inequality implies that I(X; X) b ≤ I(X; Y ),
and using H(X) = log(|X |) completes the proof.

In particular, Corollary 2.3.4 shows that when X is chosen uniformly at random and we observe
Y , we have
I(X; Y ) + log 2
inf P(Ψ(Y ) ̸= X) ≥ 1 − ,
Ψ log |X |
where the infimum is taken over all testing procedures Ψ. Some interpretation of this quantity
is helpful. If we think roughly of the number of bits it takes to describe a variable X uniformly
chosen from X , then we expect that log2 |X | bits are necessary (and sufficient). Thus, until we
collect enough information that I(X; Y ) ≈ log |X |, so that I(X; Y )/ log |X | ≈ 1, we are unlikely to
be unable to identify the variable X with any substantial probability. So we must collect enough
bits to actually discover X.

Example 2.3.5 (20 questions game): In the 20 questions game—a standard children’s game—
there are two players, the “chooser” and the “guesser,” and an agreed upon universe X . The
chooser picks an element x ∈ X , and the guesser’s goal is to find x by using a series of yes/no
questions about x. We consider optimal strategies for each player in this game, assuming that
X is finite and letting m = |X | be the universe size for shorthand.
For the guesser, it is clear that at most ⌈log2 m⌉ questions are necessary to guess the item
X that the chooser has picked—at each round of the game, the guesser asks a question that
eliminates half of the remaining possible items. Indeed, let us assume that m = 2l for some
l ∈ N; if not, the guesser can always make her task more difficult by increasing the size of X
until it is a power of 2. Thus, after k rounds, there are m2−k items left, and we have
k
1
m ≤ 1 if and only if k ≥ log2 m.
2

36
Lexture Notes on Statistics and Information Theory John Duchi

For the converse—the chooser’s strategy—let Y1 , Y2 , . . . , Yk be the sequence of yes/no an-

swers given to the guesser. Assume that the chooser picks X uniformly at random in X . Then
Fano’s inequality (2.3.2) implies that for the guess X
b the guesser makes,

b ̸= X) ≥ 1 − I(X; Y1 , . . . , Yk ) + log 2
P(X .
log m
By the chain rule for mutual information, we have
k
X k
X k
X
I(X; Y1 , . . . , Yk ) = I(X; Yi | Y1:i−1 ) = H(Yi | Y1:i−1 ) − H(Yi | Y1:i−1 , X) ≤ H(Yi ).
i=1 i=1 i=1

As the answers Yi are yes/no, we have H(Yi ) ≤ log 2, so that I(X; Y1:k ) ≤ k log 2. Thus we
find
P(Xb ̸= X) ≥ 1 − (k + 1) log 2 = log2 m − 1 − k ,
log m log2 m log2 m
so that we the guesser must have k ≥ log2 (m/2) to be guaranteed that she will make no
mistakes. 3

2.4 A first operational result: entropy and source coding

The final section of this chapter explores the basic results in source coding. Source coding—in its
simplest form—tells us precisely the number of bits (or some other form of information storage)
are necessary to perfectly encode a seqeunce of random variables X1 , X2 , . . . drawn according to a
known distribution P .

2.4.1 The source coding problem

Assume we receive data consisting of a sequence of symbols X1 , X2 , . . ., drawn from a known
distribution P on a finite or countable space X . We wish to choose an encoding, represented by a
d-ary code function C that maps X to finite strings consisting of the symbols {0, 1, . . . , d − 1}. We
denote this by C : X → {0, 1, . . . , d − 1}∗ , where the superscript ∗ denotes the length may change
from input to input, and use ℓC (x) to denote the length of the string C(x).
In general, we will consider a variety of types of codes; we define each in order of complexity of
their decoding.
Definition 2.1. A d-ary code C : X → {0, . . . , d − 1}∗ is non-singular if for each x, x′ ∈ X we have
C(x) ̸= C(x′ ) if x ̸= x′ .
While Definition 2.1 is natural, generally speaking, we wish to transmit or encode a variety of code-
words simultaneously, that is, we wish to encode a sequence X1 , X2 , . . . using the natural extension
of the code C as the string C(X1 )C(X2 )C(X3 ) · · · , where C(x1 )C(x2 ) denotes the concatenation of
the strings C(x1 ) and C(x2 ). In this case, we require that the code be uniquely decodable:
Definition 2.2. A d-ary code C : X → {0, . . . , d − 1}∗ is uniquely decodable if for all sequences
x1 , . . . , xn ∈ X and x′1 , . . . , x′n ∈ X we have
C(x1 )C(x2 ) · · · C(xn ) = C(x′1 )C(x′2 ) · · · C(x′n ) if and only if x1 = x′1 , . . . , xn = x′n .
That is, the extension of the code C to sequences is non-singular.

37
Lexture Notes on Statistics and Information Theory John Duchi

While more useful (generally) than simply non-singular codes, uniquely decodable codes may require
inspection of an entire string before recovering the first element. With that in mind, we now consider
the easiest to use codes, which can always be decoded instantaneously.

Definition 2.3. A d-ary code C : X → {0, . . . , d − 1}∗ is uniquely decodable or instantaneous if

no codeword is the prefix to another codeword.

As is hopefully apparent from the definitions, all prefix/instantaneous codes are uniquely decodable,
which are in turn non-singular. The converse is not true, though we will see a sense in which—as
long as we care only about encoding sequences—using prefix instead of uniquely decodable codes
has negligible consequences.
For example, written English, with periods (.) and spaces ( ) included at the ends of words
(among other punctuation) is an instantaneous encoding of English into the symbols of the alphabet
and punctuation, as punctuation symbols enforce that no “codeword” is a prefix of any other. A
few more concrete examples may make things more clear.

Example 2.4.1 (Encoding strategies): Consider the encoding schemes below, which encode
the letters a, b, c, and d.
Symbol C1 (x) C2 (x) C3 (x)
a 0 00 0
b 00 10 10
c 000 11 110
d 0000 110 111

By inspection, it is clear that C1 is non-singular but certainly not uniquely decodable (does
the sequence 0000 correspond to aaaa, bb, aab, aba, baa, ca, ac, or d?), while C3 is a prefix
code. We leave showing that C2 is uniquely decodable as an exercise. 3

2.4.2 The Kraft-McMillan inequalities

We now turn to a few results on the connections between source-coding and entropy. Our first
result, the Kraft-McMillan inequality, is an essential result that—as we shall see–essentially says
that there is no difference in code-lengths attainable by prefix codes and uniquely decodable codes.

Theorem 2.4.2. Let X be a finite or countable set, and let ℓ : X → N be a function. If ℓ(x) is the
length of the encoding of the symbol x in a uniquely decodable d-ary code, then
X
d−ℓ(x) ≤ 1. (2.4.1)
x∈X

Conversely, given any function ℓ : X → N satisfying inequality (2.4.1), there is a prefix code whose
codewords have length ℓ(x) for each x ∈ X .

Proof We prove the first statement of the theorem first by a counting and asymptotic argument.
We begin by assuming that X is finite; we eliminate this assumption subsequently. As a
consequence, there is some maximum length ℓmax such that ℓ(x) ≤ ℓmax for all x ∈ X . ForP a sequence
x1 , . . . , xn ∈ X , we have by the definition of our encoding strategy that ℓ(x1 , . . . , xn ) = ni=1 ℓ(xi ).
In addition, for each m we let

En (m) := {x1:n ∈ X n such that ℓ(x1:n ) = m}

38
Lexture Notes on Statistics and Information Theory John Duchi

0 2
1
x1
2 0 2
0 1 1
x2 x3 x5 x6 x7

Figure 2.1. Prefix-tree encoding of a set of symbols. The encoding for x1 is 0, for x2 is 10, for x3
is 11, for x4 is 12, for x5 is 20, for x6 is 21, and nothing is encoded as 1, 2, or 22.

denote the symbols x encoded with codewords of length m in our code, then as the code is uniquely
decodable we certainly have card(En (m)) ≤ dm P for all n and m. Moreover, for all x1:n ∈ X n we
have ℓ(x1:n ) ≤ nℓmax . We thus re-index the sum x d−ℓ(x) and compute

X nℓ
X max

d−ℓ(x1 ,...,xn ) = card(En (m))d−m

x1 ,...,xn ∈X n m=1
nℓ
X max

≤ dm−m = nℓmax .
m=1

The preceding relation is true for all n ∈ N, so that

X 1/n
−ℓ(x1:n )
d ≤ n1/n ℓ1/n
max → 1
x1:n ∈X n

as n → ∞. In particular, using that

X X X n
−ℓ(x1:n ) −ℓ(x1 ) −ℓ(xn ) −ℓ(x)
d = d ···d = d ,
x1:n ∈X n x1 ,...,xn ∈X n x∈X

we obtain x∈X d−ℓ(x) ≤ 1.

P
Returning to the case that card(X ) = ∞, by defining the sequence
X
Dk := d−ℓ(x) ,
x∈X ,ℓ(x)≤k

as each subset {xP ∈ X : ℓ(x) ≤ k} is uniquely decodable, we have Dk ≤ 1 for all k. Then
1 ≥ limk→∞ Dk = x∈X d−ℓ(x) .
The achievability of such a code intuitively follows by a pictorial argument (recall Figure 2.1),
so we first sketch the result non-rigorously. Indeed, let Td be an (infinite) d-ary tree. Then, at each

39
Lexture Notes on Statistics and Information Theory John Duchi

level m of the tree, assign one of the nodes at that level to each symbol x ∈ X such that ℓ(x) = m.
Eliminate the subtree below that node, and repeat with the remaining symbols. The codeword
corresponding to symbol x is then the path to the symbol in the tree.
P A more formal version implementing this sketch follows. Let ℓ be a length function satisfying
−ℓ(x) ≤ 1. Identify X with N (or a subset thereof) in such a way that 1 ≤ ℓ(1) ≤ ℓ(2) ≤ . . .,
x∈X d
i.e., ℓ(x) ≤ ℓ(y) whenever x < y, and let Xm = {x ∈ Xm | ℓ(x) = m} be the set of inputs with
encoding length m. For each x ∈ N, define the value
X
v(x) = d−ℓ(i) .
i<x

We let the codeword C(x) for x be the first ℓ(x) terms in the d-ary expansion of v(x). Certainly the
length of this encoding satisfies |C(x)| = ℓ(x). To see that it is prefix-free, take two symbols x < y,
and assume for the sake of contradiction that C(x) is a prefix of C(y). Then v(y) ≥ v(x), while
v(y) − v(x) ≤ d−ℓ(x) because the two representations agree on the first ℓ(x) terms in the expansion.
But
X X X X
v(y) − v(x) = d−ℓ(i) − d−ℓ(i) = d−ℓ(i) = d−ℓ(x) + d−ℓ(i) > d−ℓ(x) ,
i<y i<x x≤i<y x<i<y

a contradiction.

With the Kraft-McMillan theorem in place, we we may directly relate the entropy of a random
variable to the length of possible encodings for the variable; in particular, we show that the entropy
is essentially the best possible code length of a uniquely decodable source code. In this theorem,
we use the shorthand X
Hd (X) := − p(x) logd p(x).
x∈X

Theorem 2.4.3. Let X ∈ X be a discrete random variable distributed according to P and let ℓC
be the length function associated with a d-ary encoding C : X → {0, . . . , d − 1}∗ . In addition, let C
be the set of all uniquely decodable d-ary codes for X . Then

Hd (X) ≤ inf {EP [ℓC (X)] : C ∈ C} ≤ Hd (X) + 1.

Proof The lower bound is an argument by convex optimization, while for the upper bound
we give an explicit length function and (implicit) prefix code attaining the bound. For the lower
bound, we assume for simplicity that X is finite, and we identify X = {1, . . . , |X |} (let m = |X | for
shorthand). Then as C consists of uniquely decodable codebooks, all the associated length functions
must satisfy the Kraft-McMillan inequality (2.4.1). Letting ℓi = ℓ(i), the minimal encoding length
is at least (m m
)
X X
infm pi ℓi : d−ℓi ≤ 1 .
ℓ∈R
i=1 i=1
By introducing the Lagrange multiplier λ ≥ 0 for the inequality constraint, we may write the
Lagrangian for the preceding minimization problem as
n
!
X h im
L(ℓ, λ) = p⊤ ℓ + λ d−ℓi − 1 with ∇ℓ L(ℓ, λ) = p − λ d−ℓi log d .
i=1
i=1

40
Lexture Notes on Statistics and Information Theory John Duchi

θ
θ Pm − logd
In particular, the optimal ℓ satisfies ℓi = logd pi for some constant θ, and solving i=1 d
pi
=1
gives θ = 1 and ℓ(i) = logd p1i .
l m
1
To attain the result, simply set our encoding to be ℓ(x) = logd P (X=x) , which satisfies the
Kraft-McMillan inequality and thus yields a valid prefix code with

X 1 X
EP [ℓ(X)] = p(x) logd ≤− p(x) logd p(x) + 1 = Hd (X) + 1
p(x)
x∈X x∈X

as desired.

Theorem 2.4.3 thus shows that, at least to within an additive constant of 1, the entropy both
upper and lower bounds the expected length of a uniquely decodable code for the random variable
X. This is the first of our promised “operational interpretations” of the entropy.

2.4.3 Entropy rates and longer codes

Theorem 2.4.3 is a bit unsatisfying in that the additive constant 1 may be quite large relative to
the entropy. By allowing encoding longer sequences, we can (asymptotically) eliminate this error
factor. To that end, we here show that it is possible, at least for appropriate distributions on
random variables Xi , to achieve a per-symbol encoding length that approaches a limiting version of
the Shannon entropy of a random variable. We give two definitions capturing the limiting entropy
properties of sequences of random variables.

Definition 2.4. The entropy rate of a sequence X1 , X2 , . . . of random variables is

1
H({Xi }) := lim H(X1 , . . . , Xn ) (2.4.2)
n→∞ n

whenever the limit exists.

In some situations, the limit (2.4.2) may not exist. However, there are a variety of situations in
which it does, and we focus generally on a specific but common instance in which the limit does
exist. First, we recall the definition of a stationary sequence of random variables.

Definition 2.5. We say a sequence X1 , X2 , . . . of random variable is stationary if for all n and all
k ∈ N and all measurable sets A1 , . . . , Ak ⊂ X we have

P(X1 ∈ A1 , . . . , Xk ∈ Ak ) = P(Xn+1 ∈ A1 , . . . , Xn+k ∈ Ak ).

With this definition, we have the following result.

Proposition 2.4.4. Let the sequence of random variables {Xi }, taking values in the discrete space
X , be stationary. Then
H({Xi }) = lim H(Xn | X1 , . . . , Xn−1 )
n→∞

and the limits (2.4.2) and above exist.

41
Lexture Notes on Statistics and Information Theory John Duchi

1 Pn
Proof We begin by making the following standard observation of Cesàro means: if cn = n i=1 ai
and ai → a, then cn → a.3 Now, we note that for a stationary sequence, we have that
H(Xn | X1:n−1 ) = H(Xn+1 | X2:n ),
and using that conditioning decreases entropy, we have
H(Xn+1 | X1:n ) ≤ H(Xn | X1:n−1 ).
Thus the sequence an := H(Xn | X1:n−1 ) is non-increasing and
P bounded below by 0, so that it has
some limit limn→∞ H(Xn | X1:n−1 ). As H(X1 , . . . , Xn ) = ni=1 H(Xi | X1:i−1 ) by the chain rule
for entropy, we achieve the result of the proposition.

Finally, we present a result showing that it is possible to achieve average code length of at most
the entropy rate, which for stationary sequences is smaller than the entropy of any single random
variable Xi . To do so, we require the use of a block code, which (while it may be prefix code) treats
sets of random variables (X1 , . . . , Xm ) ∈ X m as a single symbol to be jointly encoded.
Proposition 2.4.5. Let the sequence of random variables X1 , X2 , . . . be stationary. Then for any
ϵ > 0, there exists an m ∈ N and a d-ary (prefix) block encoder C : X m → {0, . . . , d − 1}∗ such that
1
lim EP [ℓC (X1:n )] ≤ H({Xi }) + ϵ = lim H(Xn | X1 , . . . , Xn−1 ) + ϵ.
n n n

Proof m ∗
Let C : X → {0, 1, . . . , d − 1} be any prefix code with

1
ℓC (x1:m ) ≤ log .
P (X1:m = x1:m )
Then whenever n/m is an integer, we have
n/m n/m
X X
EP [ℓC (X1:n )] = EP ℓC (Xmi+1 , . . . , Xm(i+1) ) ≤ H(Xmi+1 , . . . , Xm(i+1) ) + 1
i=1 i=1
n n
= + H(X1 , . . . , Xm ).
m m
1 1
Dividing by n gives the result by taking m suitably large that m +m H(X1 , . . . , Xm ) ≤ ϵ+H({Xi }).
Note that if the m does not divide n, we may also encode the length of the sequence of encoded
words in each block of length m; in particular, if the block begins with a 0, it encodes m symbols,
while if it begins with a 1, then the next ⌈logd m⌉ bits encode the length of the block. This would
yields an increase in the expected length of the code to
2n + ⌈log2 m⌉ n
EP [ℓC (X1:n )] ≤ + H(X1 , . . . , Xm ).
m m
Dividing by n and letting n → ∞ gives the result, as we can always choose m large.

3
Indeed, let ϵ > 0 and take N such that n ≥ N implies that |ai − a| < ϵ. Then for n ≥ N , we have
n n
1X N (cN − a) 1 X N (cN − a)
cn − a = (ai − a) = + (ai − a) ∈ ± ϵ.
n i=1 n n i=N +1 n

Taking n → ∞ yields that the term N (cN − a)/n → 0, which gives that cn − a ∈ [−ϵ, ϵ] eventually for any ϵ > 0,
which is our desired result.

42
Lexture Notes on Statistics and Information Theory John Duchi

2.5 Bibliography
The material in this chapter is classical in information theory. For all of our treatment of mutual
information, entropy, and KL-divergence in the discrete case, Cover and Thomas provide an essen-
tially complete treatment in Chapter 2 of their book [57]. Gray [104] provides a more advanced
(measure-theoretic) version of these results, with Chapter 5 covering most of our results (or Chap-
ter 7 in the newer addition of the same book). Csiszár and Körner [61] is the classic reference for
coding theorems and results on communication, including stronger converse results.
The f -divergence was independently discovered by Ali and Silvey [6] and Csiszár [59], and is
consequently sometimes called an Ali-Silvey divergence or Csiszár divergence. Liese and Vajda [137]
provide a survey of f -divergences and their relationships with different statistical concepts (taking a
Bayesian point of view), and various authors have extended the pairwise divergence measures to di-
vergence measures between multiple distributions [107], making connections to experimental design
and classification [98, 76], which we investigate later in book. The inequalities relating divergences
in Section 2.2.4 are now classical, and standard references present them [134, 182]. For a proof that
equality (2.2.4) is equivalent to the definition (2.2.3) with the appropriate closure operations, see
the paper [76, Proposition 1]. We borrow the proof of the upper bound in Proposition 2.2.10 from
the paper [138].
JCD Comment: Converse to Kraft is Chaitin?

2.6 Exercises
Our first few questions investigate properties of a divergence between distributions that is weaker
than the KL-divergence, but is intimately related to optimal testing. Let P1 and P2 be arbitrary
distributions on a space X . The total variation distance between P1 and P2 is defined as

∥P1 − P2 ∥TV := sup |P1 (A) − P2 (A)| .

A⊂X

Exercise 2.1: Prove the following identities about total variation. Throughout, let P1 and P2
have densities p1 and p2 on a (common) set X .
R
(a) 2 ∥P1 − P2 ∥TV = |p1 (x) − p2 (x)|dx.

(b) For functions f : X → R, Rdefine the supremum norm ∥f ∥∞ = supx∈X |f (x)|. Show that
2 ∥P1 − P2 ∥TV = sup∥f ∥∞ ≤1 X f (x)(p1 (x) − p2 (x))dx.
R
(c) ∥P1 − P2 ∥TV = max{p1 (x), p2 (x)}dx − 1.
R
(d) ∥P1 − P2 ∥TV = 1 − min{p1 (x), p2 (x)}dx.

(e) For functions f, g : X → R,

Z Z
inf f (x)p1 (x)dx + g(x)p2 (x)dx : f + g ≥ 1, f ≥ 0, g ≥ 0 = 1 − ∥P1 − P2 ∥TV .

Exercise 2.2 (Divergence between multivariate normal distributions): Let P1 be N(θ1 , Σ) and
P2 be N(θ2 , Σ), where Σ ≻ 0 is a positive definite matrix.

43
Lexture Notes on Statistics and Information Theory John Duchi

(a) Give Dkl (P1 ||P2 ).

(b) Show that d2hel (P1 , P2 ) = 1 − exp(− 18 (µ0 − µ1 )⊤ Σ−1 (µ0 − µ1 )).

Exercise 2.3 (The optimal test between distributions): Prove Le-Cam’s inequality: for any
function ψ with dom ψ ⊃ X and any distributions P1 , P2 ,

P1 (ψ(X) ̸= 1) + P2 (ψ(X) ̸= 2) ≥ 1 − ∥P1 − P2 ∥TV .

Thus, the sum of the probabilities of error in a hypothesis testing problem, where based on a sample
X we must decide whether P1 or P2 is more likely, has value at least 1 − ∥P1 − P2 ∥TV . Given P1
and P2 is this risk attainable?
Exercise 2.4: A random variable X has Laplace(λ, µ) distribution if it has density p(x) =
λ
2 exp(−λ|x−µ|). Consider the hypothesis test of P1 versus P2 , where X has distribution Laplace(λ, µ1 )
under P1 and distribution Laplace(λ, µ2 ) under P2 , where µ1 < µ2 . Show that the minimal value
over all tests ψ of P1 versus P2 is

λ
inf P1 (ψ(X) ̸= 1) + P2 (ψ(X) ̸= 2) = exp − |µ1 − µ2 | .
ψ 2

Exercise 2.5 (Log-sum inequality): Let a1 , . . . , an and b1 , . . . , bn be non-negative reals. Show

that
n X n Pn
X ai ai
ai log ≥ ai log Pi=1
n .
bi i=1 bi
i=1 i=1

(Hint: use the convexity of the function x 7→ − log(x).)

Exercise 2.6: Given quantizers g1 and g2 , we say that g1 is a finer quantizer than g2 under the
following condition: assume that g1 induces the partition A1 , . . . , An and g2 induces the partition
B1 , . . . , Bm ; then for any of the sets Bi , there are exists some k and sets Ai1 , . . . , Aik such that
Bi = ∪kj=1 Aij . We let g1 ≺ g2 denote that g1 is a finer quantizer than g2 . Prove

(a) Finer partitions increase the KL divergence: if g1 ≺ g2 ,

Dkl (P ||Q | g2 ) ≤ Dkl (P ||Q | g1 ) .

(b) If X is discrete (so P and Q have p.m.f.s p and q) then

X p(x)
Dkl (P ||Q) = p(x) log .
x
q(x)

Exercise 2.7 (f -divergences generalize standard divergences): Show the following properties of
f -divergences:

(a) If f (t) = |t − 1|, then Df (P ||Q) = 2 ∥P − Q∥TV .

(b) If f (t) = t log t, then Df (P ||Q) = Dkl (P ||Q).

44
Lexture Notes on Statistics and Information Theory John Duchi

(d) For any convex f satisfying f (1) = 0, Df (P ||Q) ≥ 0. (Hint: use Jensen’s inequality.)

Exercise 2.8 (Generalized “log-sum” inequalities): Let f : R+ → R be an arbitrary convex

function.
(a) Let ai , bi , i = 1, . . . , n be non-negative reals. Prove that
X n Pn X n
i=1 bi bi
ai f P n ≤ ai f .
a
i=1 i ai
i=1 i=1

(b) Generalizing the preceding result, let a : X → R+ and b : X → R+ , and let µ be a finite
measure on X with respect to which a is integrable. Show that
Z R Z
b(x)dµ(x) b(x)
a(x)dµ(x)f R ≤ a(x)f dµ(x).
a(x)dµ(x) a(x)
If you are unfamiliarR with measure theory, prove the following essentially equivalent result: let
u : X → R+ satisfy u(x)dx < ∞. Show that
Z R Z
b(x)u(x)dx b(x)
a(x)u(x)dxf R ≤ a(x)f u(x)dx
a(x)u(x)dx a(x)
R
whenever a(x)u(x)dx
R < ∞. (It is possible to demonstrate this remains true under appropriate
limits even when a(x)u(x)dx = +∞, but it is a mess.)
(Hint: use the fact that the perspective of a function f , defined by h(x, t) = tf (x/t) for t > 0, is
jointly convex in x and t (see Proposition B.3.12).
Exercise 2.9 (Data processing and f -divergences I): As with the KL-divergence, given a quantizer
g of the set X , where g induces a partition A1 , . . . , Am of X , we define the f -divergence between
P and Q conditioned on g as
m m
P (g −1 ({i}))
X
X P (Ai ) −1
Df (P ||Q | g) := Q(Ai )f = Q(g ({i}))f .
Q(Ai ) Q(g −1 ({i}))
i=1 i=1

Given quantizers g1 and g2 , we say that g1 is a finer quantizer than g2 under the following condition:
assume that g1 induces the partition A1 , . . . , An and g2 induces the partition B1 , . . . , Bm ; then for
any of the sets Bi , there are exists some k and sets Ai1 , . . . , Aik such that Bi = ∪kj=1 Aij . We let
g1 ≺ g2 denote that g1 is a finer quantizer than g2 .

(a) Let g1 and g2 be quantizers of the set X , and let g1 ≺ g2 , meaning that g1 is a finer quantization
than g2 . Prove that
Df (P ||Q | g2 ) ≤ Df (P ||Q | g1 ) .
Equivalently, show that whenever A and B are collections of sets partitioning X , but A is a
finer partition of X than B, that
X
X P (B) P (A)
Q(B)f ≤ Q(A)f .
Q(B) Q(A)
B∈B A∈A

(Hint: Use the result of Question 2.8(a)).

45
Lexture Notes on Statistics and Information Theory John Duchi

(b) Suppose that X is countable (or finite) so that P and Q have p.m.f.s p and q. Show that

X p(x)
Df (P ||Q) = q(x)f ,
x
q(x)

where on the left we are using the partition definition (2.2.3); you should show that the partition
into discrete parts of X achieves the supremum. You may assume that X is finite. (Though
feel free to prove the result in the case that X is infinite.)

Exercise 2.10 (General data processing inequalities): Let f be a convex function satisfying
f (1) = 0. Let K be a Markov transition kernel from X to Z, that is, K(·, x) is a probability
distribution on Z for each x ∈ X . (Written differently, we have X → Z, and conditioned on X = x,
Z has distribution K(·, x), so that K(A, x) is the probability that Z ∈ A given X = x.)
R R
(a) Define the marginals KP (A) = K(A, x)p(x)dx and KQ (A) = K(A, x)q(x)dx. Show that
Df (KP ||KQ ) ≤ Df (P ||Q) .
Hint: by equation (2.2.3), w.l.o.g. we may assume that Z is finite and Z = {1, . . . , m}; also
recall Question 2.8.
(b) Let X and Y be random variables with joint distribution PXY and marginals PX and PY .
Define the f -information between X and Y as
If (X; Y ) := Df (PXY ||PX × PY ) .
Use part (a) to show the following general data processing inequality: if we have the Markov
chain X → Y → Z, then
If (X; Z) ≤ If (X; Y ).

Exercise 2.11 (Convexity of f -divergences): Prove Proposition 2.2.11. Hint: Use Question 2.8.
Exercise 2.12 (Variational forms of KL divergence): Let P and Q be arbitrary distributions on a
common space X . Prove the following variational representation, known as the Donsker-Varadhan
theorem, of the KL divergence:

Dkl (P ||Q) = sup EP [f (X)] − log EQ [exp(f (X))] .
f :EQ [ef (X) ]<∞

You may assume that P and Q have densities.

Exercise 2.13: Let P and Q have densities p and q with respect to the base measure µ over the
set X . (Recall that this is no loss of generality, as we may take µ = P + Q.) Define the support
supp P := {x ∈ X : p(x) > 0}. Show that
1
Dkl (P ||Q) ≥ log .
Q(supp P )

Exercise 2.14: Let P1 be N(θ1 , Σ1 ) and P2 be N(θ2 , Σ2 ), where Σi ≻ 0 are positive definite
matrices. Give Dkl (P1 ||P2 ).
Exercise 2.15: Let {Pv }v∈V be an arbitrary collection of distributions on a space X and µ be a
probability measure on V. Show that if V ∼ µ and conditional on V = v, we draw X ∼ Pv , then

46
Lexture Notes on Statistics and Information Theory John Duchi

R R
(a) I(X; V ) = Dkl Pv ||P dµ(v), where P = Pv dµ(v) is the (weighted) average of the Pv . You
may assume that V is discrete if you like.
R
(b) For any distribution
R Q on X , I(X; V ) = Dkl (Pv ||Q) dµ(v)R − Dkl P ||Q . Conclude that
I(X; V ) ≤ Dkl (Pv ||Q) dµ(v), or, equivalently, P minimizes Dkl (Pv ||Q) dµ(v) over all prob-
abilities Q.

Exercise 2.16 (The triangle inequality for variation distance): Let P and Q be distributions
on X1n = (X1 , . . . , Xn ) ∈ X n , and let Pi (· | xi−1
1 ) be the conditional distribution of Xi given
i−1 i−1
X1 = x1 (and similarly for Qi ). Show that
n
X h i
∥P − Q∥TV ≤ EP Pi (· | X1i−1 ) − Qi (· | X1i−1 ) TV
,
i=1

where the expectation is taken over X1i−1 distributed according to P .

Exercise 2.17: Let h(p) = −p log p − (1 − p) log(1 − p). Show that h(p) ≥ 2 log 2 · min{p, 1 − p}.
Exercise 2.18p(Lin [138], Theorem 8): Let h(p) = −p log p − (1 − p) log(1 − p). Show that
h(p) ≤ 2 log 2 · p(1 − p).
Exercise 2.19 (Proving Pinsker’s inequality via data processing): We work through a proof of
Proposition 2.2.8.(a) using the data processing inequality for f -divergences (Proposition 2.2.13).

(a) Define Dkl (p||q) = p log pq + (1 − p) log 1−p

1−q . Argue that to prove Pinsker’s inequality (2.2.10),
2 1
it is enough to show that (p − q) ≤ 2 Dkl (p||q).

(b) Define the negative binary entropy h(p) = p log p + (1 − p) log(1 − p). Show that

h(p) ≥ h(q) + h′ (q)(p − q) + 2(p − q)2

for any p, q ∈ [0, 1].

(c) Conclude Pinsker’s inequality (2.2.10).

JCD Comment: Below are a few potential questions

Exercise 2.20: Use the paper “A New Metric for Probability Distributions” by Dominik p Endres
and Johannes Schindelin to prove that if V ∼ Uniform{0, 1} and X | V = v ∼ Pv , then I(X; V )
is a metric on distributions. (Said differently, Djs (P ||Q)1/2 is a metric on distributions, and it
generates the same topology as the TV-distance.)
Exercise 2.21: Relate the generalized Jensen-Shannon divergence between m distributions to
redundancy in encoding.

47
Chapter 3

Exponential families and statistical

modeling

Our second introductory chapter focuses on readers who may be less familiar with statistical mod-
eling methodology and the how and why of fitting different statistical models. As in the preceding
introductory chapter on information theory, this chapter will be a fairly terse blitz through the main
ideas. Nonetheless, the ideas and distributions here should give us something on which to hang our
hats, so to speak, as the distributions and models provide the basis for examples throughout the
book. Exponential family models form the basis of much of statistics, as they are a natural step
away from the most basic families of distributions—Gaussians—which admit exact computations
but are brittle, to a more flexible set of models that retain enough analytical elegance to permit
careful analyses while giving power in modeling. A key property is that fitting exponential family
models reduces to the minimization of convex functions—convex optimization problems—an oper-
ation we treat as a technology akin to evaluating a function like sin or cos. This perspective (which
is accurate enough) will arise throughout this book, and informs the philosophy we adopt that once
we formulate a problem as convex, it is solved.

3.1 Exponential family models

We begin by defining exponential family distributions, giving several examples to illustrate a few
of their properties. There are three key objects when defining a d-dimensional exponential family
distribution on an underlying space X : the sufficient statistic ϕ : X → Rd representing what we
model, a canonical parameter vector θ ∈ Rd , and a carrier h : X → R+ .
In the discrete case, where X is a discrete set, the exponential family associated with the
sufficient statistic ϕ and carrier h has probability mass function
pθ (x) = h(x) exp (⟨θ, ϕ(x)⟩ − A(θ)) ,
where A is the log-partition-function, sometimes called the cumulant generating function, with
X
A(θ) := log h(x) exp(⟨θ, ϕ(x)⟩).
x∈X

In the continuous case, pθ is instead a density on X ⊂ Rk , and pθ takes the identical form above
but Z
A(θ) = log h(x) exp(⟨θ, ϕ(x)⟩)dx.
X

48
Lexture Notes on Statistics and Information Theory John Duchi

We can abstract away from this distinction between discrete and continuous distributions by making
the definition measure-theoretic, which we do here for completeness. (But recall the remarks in
Section 1.4.)
With our notation, we have the following definition.

Definition 3.1. The exponential family associated with the function ϕ and base measure µ is
defined as the set of distributions with densities pθ with respect to µ, where

pθ (x) = exp (⟨θ, ϕ(x)⟩ − A(θ)) , (3.1.1)

and the function A is the log-partition-function (or cumulant function)

Z
A(θ) := log exp (⟨θ, ϕ(x)⟩) dµ(x) (3.1.2)
X

whenever A is finite (and is +∞ otherwise). The family is regular if the domain

Θ := {θ | A(θ) < ∞}

is open.

In Definition 3.1, we have included the carrier h in the base measure µ, and frequently we will give
ourselves the general notation

pθ (x) = h(x) exp(⟨θ, ϕ(x)⟩ − A(θ)).

In some scenarios, it may be convient to re-parameterize the problem in terms of some function
η(θ) instead of θ itself; we will not worry about such issues and simply use the formulae that are
most convenient.
We now give a few examples of exponential family models.

Example 3.1.1 (Bernoulli distribution): In this case, we have X ∈ {0, 1} and P (X = 1) = p

for some p ∈ [0, 1] in the classical version of a Bernoulli. Thus we take µ to be the counting
p
measure on {0, 1}, and by setting θ = log 1−p to obtain a canonical representation, we have

P (X = x) = p(x) = px (1 − p)1−x = exp(x log p − x log(1 − p))

p
= exp x log + log(1 − p) = exp xθ − log(1 + eθ ) .
1−p

The Bernoulli family thus has log-partition function A(θ) = log(1 + eθ ). 3

Example 3.1.2 (Poisson distribution): The Poisson distribution (for count data) is usually
parameterized by some λ > 0, and for x ∈ N has distribution Pλ (X = x) = (1/x!)λx e− λ. Thus
by taking µ to be counting (discrete) measure on {0, 1, . . .} and setting θ = log λ, we find the
density (probability mass function in this case)
1 x −λ 1 1
p(x) = λ e = exp(x log λ − λ) = exp(xθ − eθ ) .
x! x! x!
Notably, taking h(x) = (x!)−1 and log-partition A(θ) = eθ , we have probability mass function
pθ (x) = h(x) exp(θx − A(θ)). 3

49
Lexture Notes on Statistics and Information Theory John Duchi

Example 3.1.3 (Normal distribution, mean parameterization): For the d-dimensional normal
distribution, we take µ to be Lebesgue measure on Rd . If we fix the covariance and vary only
the mean µ in the family N(µ, Σ), then X ∼ N(µ, Σ) has density

1 ⊤ −1 1
pµ (x) = exp − (x − µ) Σ (x − µ) − log det(2πΣ) .
2 2

Setting h(x) = − 12 x⊤ Σ−1 x and reparameterizing θ = Σ−1 µ, we obtain

1 ⊤ −1 1 ⊤ 1 ⊤
pθ (x) = exp − x Σ x − log det(2πΣ) exp x θ − θ Σθ .
2 2 2
| {z }
=:h(x)

In particular, we have carrier h(x) = exp(− 12 x⊤ Σ−1 x)/((2π)d/2 det(Σ)), sufficient statistic
ϕ(x) = x, and log partition A(θ) = 21 θ⊤ Σ−1 θ. 3

Example 3.1.4 (Normal distribution): Let X ∼ N(µ, Σ). We may re-parameterize this as
as Θ = Σ−1 and θ = Σ−1 µ, and we have density

1
pθ,Θ (x) ∝ exp ⟨θ, x⟩ − ⟨xx⊤ , Θ⟩ ,
2

where ⟨·, ·⟩ denotes the Euclidean inner product. See Exercise 3.1. 3

In some cases, it is analytically convenient to include a few more conditions on the exponential
family.

Definition 3.2. Let {Pθ }θ∈Θ be an exponential family as in Definition 3.1. The sufficient statistic
ϕ is minimal if Θ = dom A ⊂ Rd is full-dimensional and there exists no vector u such that

⟨u, ϕ(x)⟩ is constant µ-almost surely.

Definition 3.2 is essentially equivalent to stating that ϕ(x) = (ϕ1 (x), . . . , ϕd (x)) has linearly inde-
pendent components when viewed as vectors [ϕi (x)]x∈X . While we do not prove this, via a suitable
linear transformation—a variant of Gram-Schmidt orthonormalization—one may modify any non-
minimal exponential family {Pθ } into an equivalent minimal exponential family {Qη }, meaning
that the two collections satisfy the equality {Pθ } = {Qη } (see Brown [41, Chapter 1]).

3.2 Why exponential families?

There are many reasons for us to study exponential families. The first major reason is their
analytical tractability: as the normal distribution does, they often admit relatively straightforward
computation, therefore forming a natural basis for modeling decisions. Their analytic tractability
has made them the objects of substantial study for nearly the past hundred years; Brown [41]
provides a deep and elegant treatment. Moreover, as we see later, they arise as the solutions to
several natural optimization problems on the space of probability distributions, and they also enjoy
certain robustness properties related to optimal Bayes’ procedures (there is, of course, more to
come on this topic).

50
Lexture Notes on Statistics and Information Theory John Duchi

Here, we enumerate a few of their keyR analytical properties, focusing on the cumulant generating
(or log partition) function A(θ) = log e⟨θ,ϕ(x)⟩ dµ(x). We begin with a heuristic calculation, where
we assume that we exchange differentiation and integration. Assuming that this is the case, we
then obtain the important expectation and covariance relationships that
Z
1
∇A(θ) = R ⟨θ,ϕ(x)⟩ ∇θ e⟨θ,ϕ(x)⟩ dµ(x)
e dµ(x)
Z Z
−A(θ) ⟨θ,ϕ(x)⟩
=e ∇θ e dµ(x) = ϕ(x)e⟨θ,ϕ(x)⟩−A(θ) dµ(x) = Eθ [ϕ(X)]

because e⟨θ,ϕ(x)⟩−A(θ) = pθ (x). A completely similar (and still heuristic, at least at this point)
calculation gives

∇2 A(θ) = Eθ [ϕ(X)ϕ(X)⊤ ] − Eθ [ϕ(X)]Eθ [ϕ(X)]⊤ = Covθ (ϕ(X)).

That these identities hold is no accident and is central to the appeal of exponential family models.
The first and, from our perspective, most important result about exponential family models is
their convexity. While (assuming the differentiation relationships above hold) the differentiation
identity that ∇2 A(θ) = Covθ (ϕ(X)) ⪰ 0 makes convexity of A immediate, one can also provide a
direct argument without appealing to differentiation.

Proposition 3.2.1. The cumulant-generating function θ 7→ A(θ) is convex, and it is strictly convex
if and only if Covθ (ϕ(X)) is positive definite for all θ ∈ dom A.

Proof Let θλ = λθ1 + (1 − λ)θ2 , where θ1 , θ2 ∈ Θ. Then 1/λ ≥ 1 and 1/(1 − λ) ≥ 1, and Hölder’s
inequality implies
Z Z
log exp(⟨θλ , ϕ(x)⟩)dµ(x) = log exp(⟨θ1 , ϕ(x)⟩)λ exp(⟨θ2 , ϕ(x)⟩)1−λ dµ(x)
Z λ Z 1−λ
λ 1−λ
≤ log exp(⟨θ1 , ϕ(x)⟩) dµ(x)
λ exp(⟨θ2 , ϕ(x)⟩) 1−λ dµ(x)
Z Z
= λ log exp(⟨θ1 , ϕ(x)⟩)dµ(x) + (1 − λ) log exp(⟨θ2 , ϕ(x)⟩)dµ(x),

as desired. The strict convexity will be a consequence of Proposition 3.2.2 to come, as there we
formally show that ∇2 A(θ) = Covθ (ϕ(X)).

We now show that A(θ) is indeed infinitely differentiable and how it generates the moments of
the sufficient statistics ϕ(x). To describe the properties, we provide a bit of notation related to
tensor products: for a vector x ∈ Rd , we let

x⊗k := x
| ⊗x⊗ {z· · · ⊗ x}
k times

denote the kth order tensor, or multilinear operator, that for v1 , . . . , vk ∈ Rd satisfies
k
Y
⊗k
x (v1 , . . . , vk ) := ⟨x, v1 ⟩ · · · ⟨x, vk ⟩ = ⟨x, vi ⟩.
i=1

51
Lexture Notes on Statistics and Information Theory John Duchi

When k = 2, this is the familiar outer product x⊗2 = xx⊤ . (More generally, one may think of x⊗k
as a d × d × · · · × d box, where the (i1 , . . . , ik ) entry is [x⊗k ]i1 ,...,ik = xi1 · · · xik .) With this notation,
our first key result regards the differentiability of A, where we can compute (all) derivatives of eA(θ)
by interchanging integration and differentiation.

Proposition 3.2.2. The cumulant-generating function θ 7→ A(θ) is infinitely differentiable on the

interior of its domain Θ := {θ ∈ Rd : A(θ) < ∞}. The moment-generating function
Z
M (θ) := exp(⟨θ, ϕ(x)⟩)dµ(x)

is analytic on the set ΘC := {z ∈ Cd | Re z ∈ Θ}. Additionally, the derivatives of M are computed

by passing through the integral, that is,
Z Z
⟨θ,ϕ(x)⟩
k k
∇θ M (θ) = ∇θ e dµ(x) = ∇kθ e⟨θ,ϕ(x)⟩ dµ(x)
Z
= ϕ(x)⊗k exp(⟨θ, ϕ(x)⟩)dµ(x).

The proof of the proposition is involved and requires complex analysis, so we defer it to Sec. 3.6.1.
As particular consequences of Proposition 3.2.2, we can rigorously demonstrate the expectation
and covariance relationships that
Z Z
1 ⟨θ,ϕ(x)⟩
∇A(θ) = R ⟨θ,ϕ(x)⟩ ∇e dµ(x) = ϕ(x)pθ (x)dµ(x) = Eθ [ϕ(X)]
e dµ(x)

and

( ϕ(x)e⟨θ,ϕ(x)⟩ dµ(x))⊗2
Z R
2 1 ⊗2 ⟨θ,ϕ(x)⟩
∇ A(θ) = R ϕ(x) e dµ(x) − R
e⟨θ,ϕ(x)⟩ dµ(x) ( e⟨θ,ϕ(x)⟩ dµ(x))2
= Eθ [ϕ(X)ϕ(X)⊤ ] − Eθ [ϕ(X)]Eθ [ϕ(X)]⊤
= Covθ (ϕ(X)).

Minimal exponential families (Definition 3.2) also enjoy a few additional regularity properties.
Recall that A is strictly convex if

A(λθ0 + (1 − λ)θ1 ) < λA(θ0 ) + (1 − λ)A(θ1 )

whenever λ ∈ (0, 1) and θ0 , θ1 ∈ dom A. We have the following proposition.

Proposition 3.2.3. Let {Pθ } be a regular exponential family. The log partition function A is
strictly convex if and only if {Pθ } is minimal.

Proof If the family is minimal, then Varθ (u⊤ ϕ(X)) > 0 for any vector u, while Varθ (u⊤ ϕ(X)) =
u⊤ ∇2 A(θ)u. This implies the strict positive definiteness ∇2 A(θ) ≻ 0, which is equivalent to strict
convexity (see Corollary B.3.2 in Appendix B.3.1). Conversely, if ∇2 A(θ) ≻ 0 for all θ ∈ Θ, then
Varθ (u⊤ ϕ(X)) > 0 for all u ̸= 0 and so u⊤ ϕ(x) is non-constant in x.

52
Lexture Notes on Statistics and Information Theory John Duchi

3.2.1 Fitting an exponential family model

The convexity and differentiability properties make exponential family models especially attractive
from a computational perspective. A major focus in statistics is the convergence of estimates of
different properties of a population distribution P and whether these estimates are computable.
We will develop tools to address the first of these questions, and attendant optimality guarantees,
throughout this book. To set the stage for what follows, let us consider what this entails in the
context of exponential family models.
Suppose we have a population P (where, for simplicity, we assume P has a density p), and for
a given exponential family P with densities {pθ }, we wish to find the model closest to P . Then it
is natural (if we take on faith that the information-theoretic measures we have developed are the
“right” ones) find the distribution Pθ ∈ P closest to P in KL-divergence, that is, to solve
Z
p(x)
minimize Dkl (P ||Pθ ) = p(x) log dx. (3.2.1)
θ pθ (x)

This is evidently equivalent to minimizing

Z Z
− p(x) log pθ (x)dx = p(x) [−⟨θ, ϕ(x)⟩ + A(θ)] dx = −⟨θ, EP [ϕ(X)]⟩ + A(θ).

This is always a convex optimization problem (see Appendices B and C for much more on this), as A
is convex and the first term is linear, and so has no non-global optima. Here and throughout, as we
mention in the introductory remarks to this chapter, we treat convex optimization as a technology:
as long as the dimension of a problem is not too large and its objective can be evaluated, it is
(essentially) computationally trivial.
Of course, we never have access to the population P fully; instead, we receive a sample
X1 , . . . , Xn from P . In this case, a natural approach is to replace the expected (negative) log
likelihood above with its empirical version and solve
n
X n
X
minimize − log pθ (Xi ) = [−⟨θ, ϕ(Xi )⟩ + A(θ)], (3.2.2)
θ
i=1 i=1

which is still a convex optimization problem (as the objective is convex in θ). The maximum
likelihood estimate is any vector θbn minimizing the negative log likelihood (3.2.2), which by setting
gradients to 0 is evidently any vector satisfying
n
1X
∇A(θbn ) = Eθbn [ϕ(X)] = ϕ(Xi ). (3.2.3)
n
i=1

In particular, we need only find a parameter θbn matching moments of the empirical distribution
of the observed Xi ∼ P . This θbn is unique whenever Covθ (ϕ(X)) ≻ 0 for all θ, that is, when
the covariance of ϕ is full rank in the exponential family model, because then the objective in the
minimization problem (3.2.2) is strictly convex.
Let us proceed heuristically for a moment to develop a rough convergence guarantee for the
estimator θbn ; the next paragraph assumes a comfort with some of classical asymptotic statistics
(and the central limit theorem) and is not essential for what comes later. Then we can see how
minimizers of the problem (3.2.2) converge to their population counterparts. Assume that the data

53
Lexture Notes on Statistics and Information Theory John Duchi

Xi are i.i.d. from an exponential family model Pθ⋆ . Then we expect that the maximum likelihood
estimate θbn should converge to θ⋆ , and so
n
1X
ϕ(Xi ) = ∇A(θbn ) = ∇A(θ⋆ ) + (∇2 A(θ⋆ ) + o(1))(θbn − θ⋆ ).
n
i=1

But of course, ∇A(θ⋆ ) = Eθ⋆ [ϕ(X)], and so the central limit theorem gives that
n
1X ·
(ϕ(Xi ) − ∇A(θ⋆ )) ∼ N 0, n−1 Covθ⋆ (ϕ(X)) = N 0, n−1 ∇2 A(θ⋆ ) ,

n
i=1

·
where ∼ means “is approximately distributed as.” Multiplying by (∇2 A(θ⋆ )+o(1))−1 ≈ ∇2 A(θ⋆ )−1 ,
we thus see (still working in our heuristic)
n
1 X
θbn − θ⋆ = (∇2 A(θ⋆ ) + o(1))−1 (ϕ(Xi ) − ∇A(θ⋆ ))
n
i=1
· −1 2 ⋆ −1

∼ N 0, n · ∇ A(θ ) , (3.2.4)

where we use that BZ ∼ N(0, BΣB ⊤ ) if Z ∼ N(0, Σ). (It is possible to make each of these steps
fully rigorous.) Thus the cumulant generating function A governs the error we expect in θbn − θ⋆ .
Much of the rest of this book explores properties of these types of minimization problems: at
what rates do we expect θbn to converge to a global minimizer of problem (3.2.1)? Can we show
that these rates are optimal? Is this the “right” strategy for choosing a parameter? Exponential
families form a particular working example to motivate this development.

3.3 Divergence measures and information for exponential families

Their nice analytic properties mean that exponential family models also play nicely with the in-
formation theoretic tools we develop. Indeed, consider the KL-divergence between two exponential
family distributions Pθ and Pθ+∆ , where ∆ ∈ Rd . Then we have

Dkl (Pθ ||Pθ+∆ ) = Eθ [⟨θ, ϕ(X)⟩ − A(θ) − ⟨θ + ∆, ϕ(X)⟩ + A(θ + ∆)]

= A(θ + ∆) − A(θ) − Eθ [⟨∆, ϕ(X)⟩]
= A(θ + ∆) − A(θ) − ∇A(θ)⊤ ∆.

Similarly, we have

Dkl (Pθ+∆ ||Pθ ) = Eθ+∆ [⟨θ + ∆, ϕ(X)⟩ − A(θ + ∆) − ⟨θ, ϕ(X)⟩ + A(θ)]
= A(θ) − A(θ + ∆) + Eθ+∆ [⟨∆, ϕ(X)⟩]
= A(θ) − A(θ + ∆) − ∇A(θ + ∆)⊤ (−∆).

These identities give an immediate connection with convexity. Indeed, for a differentiable convex
function h, the Bregman divergence associated with h is

Dh (u, v) = h(u) − h(v) − ⟨∇h(v), u − v⟩, (3.3.1)

54
Lexture Notes on Statistics and Information Theory John Duchi

which is always nonnegative, and is the gap between the linear approximation to the (convex)
function h and its actual value. One might more accurately call the quantity (3.3.1) the “first-
order divergence,” which is more evocative, but the statistical, machine learning, and optimization
literatures—in which such divergences frequently appear—have adopted this terminology, so we
stick with it.
JCD Comment: Put in a picture of a Bregman divergence

We catalog these results as the following proposition.

Proposition 3.3.1. Let {Pθ } be an exponential family model with cumulant generating function
A(θ). Then

Dkl (Pθ ||Pθ+∆ ) = DA (θ + ∆, θ) and Dkl (Pθ+∆ ||Pθ ) = DA (θ, θ + ∆).

Additionally, there exists a t ∈ [0, 1] such that

1
Dkl (Pθ ||Pθ+∆ ) = ∆⊤ ∇2 A(θ + t∆)∆,
2
and similarly, there exists a t ∈ [0, 1] such that
1
Dkl (Pθ+∆ ||Pθ ) = ∆⊤ ∇2 A(θ + t∆)∆.
2
Proof We have already shown the first two statements; the second two are applications of Tay-
lor’s theorem.

When the perturbation ∆ is small, that A is infinitely differentiable then gives that
1
Dkl (Pθ ||Pθ+∆ ) = ∆⊤ ∇2 A(θ)∆ + O(∥∆∥3 ),
2
so that the Hessian ∇2 A(θ) tells quite precisely how the KL divergence changes as θ varies (locally).
As we saw already in Example 2.3.2 (and see the next section), when the KL-divergence between
two distributions is small, it is hard to test between them, and in the sequel, we will show converses
to this. The Hessian ∇2 A(θ⋆ ) also governs the error in the estimate θbn − θ⋆ in our heuristic (3.2.4).
When the Hessian ∇2 A(θ) is quite positive semidefinite, the KL divergence Dkl (Pθ ||Pθ+∆ ) is large,
and the asymptotic covariance (3.2.4) is small. For this—and other reasons we address later—for
exponential family models, we call

∇2 A(θ) = Covθ (ϕ(X)) = Eθ [∇ log pθ (X)∇ log pθ (X)⊤ ] (3.3.2)

the Fisher information of the parameter θ in the model {Pθ }.

3.4 Generalized linear models and regression

We can specialize the general modeling strategies that exponential families provide to more directly
address prediction problems, where we wish to predict a target Y ∈ Y given covariates X ∈ X .
Here, we almost always have that Y is either discrete or continuous with Y ⊂ R. In this case, we

55
Lexture Notes on Statistics and Information Theory John Duchi

have a sufficient statistic ϕ : X × Y → Rd , and we model Y | X = x via the generalized linear model
(or conditional exponential family model) if it has density or probability mass function

pθ (y | x) = exp ϕ(x, y)⊤ θ − A(θ | x) h(y), (3.4.1)

where as before h is the carrier and (in the case that Y ⊂ Rk )

Z
A(θ | x) = log exp(ϕ(x, y)⊤ θ)h(y)dy

or, in the discrete case, X

A(θ | x) = log exp(ϕ(x, y)⊤ θ)h(y).
y

The log partition function A(· | x) provides the same insights for the conditional models (3.4.1)
as it does for the unconditional exponential family models in the preceding sections. Indeed, as
in Propositions 3.2.1 and 3.2.2, the log partition A(· | x) is always C ∞ on its domain and convex.
Moreover, it gives the expected moments of the sufficient statistic ϕ conditional on x, as

∇A(θ | x) = Eθ [ϕ(X, Y ) | X = x],

and
∇2 A(θ | x) = Covθ (ϕ(X, Y ) | X = x),
from which we can (typically) extract the mean or other statistics of Y conditional on x.
Three standard examples will be our most frequent motivators throughout this book: linear
regression, binary logistic regression, and multiclass logistic regression. We give these three, as
well as describing two more important examples involving modeling count data through Poisson
regression and making predictions for targets y known to live in a bounded set.

Example 3.4.1 (Linear regression): In linear regression, we wish to predict Y ∈ R from a

vector X ∈ Rd , and assume that Y | X = x follow the normal distribution N(θ⊤ x, σ 2 ). In this
case, we have

1 1 ⊤ 2
pθ (y | x) = √ exp − 2 (y − x θ)
2πσ 2 2σ

1 ⊤ 1 ⊤ ⊤ 1 2 1 2
= exp yx θ − 2 θ xx θ exp − 2 y + log(2πσ ) ,
σ2 2σ 2σ 2

so that we have the exponential family representation (3.4.1) with ϕ(x, y) = σ12 xy, h(y) =
exp(− 2σ1 2 y 2 + 21 log(2πσ 2 )), and A(θ) = 2σ1 2 θ⊤ xx⊤ θ. As ∇A(θ | x) = Eθ [ϕ(X, Y ) | X = x] =
1
σ2
xEθ [Y | X = x], we easily recover Eθ [Y | X = x] = θ⊤ x. 3

Frequently, we wish to predict binary or multiclass random variables Y . For example, consider
a medical application in which we wish to assess the probability that, based on a set of covariates
x ∈ Rd (say, blood pressure, height, weight, family history) and individual will have a heart attack
in the next 5 years, so that Y = 1 indicates heart attack and Y = −1 indicates not. The next
example shows how we might model this.

56
Lexture Notes on Statistics and Information Theory John Duchi

Example 3.4.2 (Binary logistic regression): If Y ∈ {−1, 1}, we model

exp(yx⊤ θ)
pθ (y | x) = ,
1 + exp(yx⊤ θ)
where the idea in the probability above is that if x⊤ θ has the same sign as y, then the large
x⊤ θy becomes the higher the probability assigned the label y; when x⊤ θy < 0, the probability
is small. Of course, we always have pθ (y | x) + pθ (−y | x) = 1, and using the identity
y+1 ⊤
yx⊤ θ − log(1 + exp(yx⊤ θ)) = x θ − log(1 + exp(x⊤ θ))
2
we obtain the generalized linear model representation ϕ(x, y) = y+12 x and A(θ | x) = log(1 +
⊤
exp(x θ)).
As an alternative, we could represent Y ∈ {0, 1} by
exp(yx⊤ θ)
⊤ x⊤ θ

pθ (y | x) = = exp yx θ − log(1 + e ) ,
1 + exp(x⊤ θ)
which has the simpler sufficient statistic ϕ(x, y) = xy. 3
Instead of a binary prediction problem, in many cases we have a multiclass prediction problem,
where we seek to predict a label Y for an object x belonging to one of k different classes. For
example, in image recognition, we are given an image x and wish to identify the subject Y of the
image, where Y ranges over k classes, such as birds, dogs, cars, trucks, and so on. This too we can
model using exponential families.
Example 3.4.3 (Multiclass logistic regression): In the case that we have a k-class prediction
problem in which we wish to predict Y ∈ {1, . . . , k} from X ∈ Rd , we assign parameters
θy ∈ Rd to each of the classes y = 1, . . . , k. We then model
 
k
exp(θy⊤ x)
X
⊤
pθ (y | x) = Pk = exp θy⊤ x − log eθj x  .
exp(θ ⊤ x)
j=1 j j=1

Here, the idea is that if θy⊤ x > θj⊤ x for all j ̸= y, then the model assigns higher probability to
class y than any other class; the larger the gap between θy⊤ x and θj⊤ x, the larger the difference
in assigned probabilities. 3
Other approaches with these ideas allow us to model other situations. Poisson regression models
are frequent choices for modeling count data. For example, consider an insurance company that
wishes to issue premiums for shipping cargo in different seasons and on different routes, and so
wishes to predict the number of times a given cargo ship will be damaged by waves over a period
of service; we might represent this with a feature vector x encoding information about the ship to
be insured, typical weather on the route it will take, and the length of time it will be in service.
To model such counts Y ∈ {0, 1, 2, . . .}, we turn to Poisson regression.
Example 3.4.4 (Poisson regression): When Y ∈ N is a count, the Poisson distribution with
−λ y ⊤
rate λ > 0 gives P (Y = y) = e y!λ . Poisson regression models λ via eθ x , giving model
1 ⊤

pθ (y | x) = exp yx⊤ θ − eθ x ,
y!
so that we have carrier h(y) = 1/y! and the simple sufficient statistic yx⊤ θ. The log partition
⊤
function is A(θ | x) = eθ x . 3

57
Lexture Notes on Statistics and Information Theory John Duchi

Lastly, we consider a less standard example, but which highlights the flexibility of these models.
Here, we assume a linear regression problem but in which we wish to predict values Y in a bounded
range.

Example 3.4.5 (Bounded range regression): Suppose that we know Y ∈ [−b, b], but we wish
to model it via an exponential family model with density

pθ (y | x) = exp(yx⊤ θ − A(θ | x))1 {y ∈ [−b, b]} ,

which is non-zero only for −b ≤ y ≤ b. Letting s = x⊤ θ for shorthand, we have

Z b
1 h bs i
eys dy = e − e−bs ,
−b s

where the limit as s → 0 is 2b; the (conditional) log partition function is thus
bθ ⊤ x −e−bθ ⊤ x
(
log e θ⊤ x
if θ⊤ x ̸= 0
A(θ | x) =
log(2b) otherwise.

While its functional form makes this highly non-obvious, our general results guarantee that
A(θ | x) is indeed C ∞ and convex in θ. We have ∇A(θ | x) = xEθ [Y | X = x] because
ϕ(x, y) = xy, and we can therefore immediately recover Eθ [Y | X = x]. Indeed, set s = θ⊤ x,
and without loss of generality assume s ̸= 0. Then

∂ ebs − e−bs b(ebs + e−bs ) 1

E[Y | x⊤ θ = s] = log = bs − ,
∂s s e − e−bs s

which increases from −b to b as s = x⊤ θ increases from −∞ to +∞. 3

3.4.1 Fitting a generalized linear model from a sample

We briefly revisit the approach in Section 3.2.1 for fitting exponential family models in the context
of generalized linear models. In this case, the analogue of the maximum likelihood problem (3.2.2)
is to solve
Xn n h
X i
minimize − log pθ (Yi | Xi ) = −ϕ(Xi , Yi )⊤ θ + A(θ | Xi ) .
θ
i=1 i=1

This is a convex optimization problem with C ∞ objective, so we can treat solving it as an (essen-
tially) trivial problem unless the sample size n or dimension d of θ are astronomically large.
As in the moment matching equality (3.2.3), a necessary and sufficient condition for θbn to
minimize the above objective is that it achieves 0 gradient, that is,
n n
1X 1X
∇A(θbn | Xi ) = ϕ(Xi , Yi ).
n n
i=1 i=1

Once again, to find θbn amounts to matching moments, as ∇A(θ | Xi ) = E[ϕ(X, Y ) | X = Xi ], and
we still enjoy the convexity properties of the standard exponential family models.
In general, we of course do not expect any exponential family or generalized linear model (GLM)
to have perfect fidelity to the world: all models are inaccurate (but many are useful!). Nonetheless,

58
Lexture Notes on Statistics and Information Theory John Duchi

we can still fit any of the GLM models in Examples 3.4.1–3.4.5 to data of the appropriate type. In
particular, for the logarithmic loss ℓ(θ; x, y) = − log pθ (y | x), we can define the empirical loss
n
1X
Ln (θ) := ℓ(θ; Xi , Yi ).
n
i=1

Then, as n → ∞, we expect that Ln (θ) → E[ℓ(θ; X, Y )], so that the minimizing θ should give the
best predictions possible according to the loss ℓ. We shall therefore often be interested in such
convergence guarantees and the deviations of sample quantities (like Ln ) from their population
counterparts.

3.4.2 The information in a generalized linear model

As we did in Section 3.3, we can compute the “information” about a parameter θ in a generalized
linear model as well. In this case, Pθ specifies only the conditional distribution of Y given X = x,
so when we compute the information, we assume X follows some marginal distribution Q. In this
case,
R (X, Y ) ∼ Pθ ◦ Q, where we have abused composition notation to mean that P(X, Y ∈ A) =
(x,y)∈A pθ (y | x)q(x)dydx. In this case, we have

pθ (Y | X) h i
Dkl (Pθ ◦ Q||Pθ+∆ ◦ Q) = Eθ log = Eθ ∆⊤ ϕ(X, Y ) − A(θ | X) + A(θ + ∆ | X)
pθ+∆ (Y | X)

= Eθ DA(·|X) (θ + ∆, θ) ,

and similarly

Dkl (Pθ+∆ ◦ Q||Pθ ◦ Q) = Eθ+∆ DA(·|X) (θ, θ + ∆) ,

where we recall the Bregman divergence (3.3.1) and have used that Eθ [ϕ(X, Y ) | X] = ∇A(θ | X).
Performing a Taylor expansion, we have
1
A(θ + ∆ | x) = A(θ | x) + ⟨∇A(θ | x), ∆⟩ + ∆⊤ ∇2 A(θ | x)∆ + O(Eθ+t∆ [∥ϕ(X, Y )∥3 | x] · ∥∆∥3 ),
2
where we have computed third derivatives of A(θ | x), and t ∈ [0, 1]. Evaluating the Taylor
expansion in the integral form, once there exists some δ > 0 such that
Z 1 h i
3
EQ Eθ+tv [∥ϕ(X, Y )∥ | X] dt < ∞
0

for any ∥v∥2 ≤ δ, for either of the expansions above we have the following corollary:

Corollary 3.4.6. Assume the marginal distribution Q on X satisfies the above integrability con-
dition. Then as ∆ → 0,
1
Dkl (Pθ ◦ Q||Pθ+∆ ◦ Q) = ∆⊤ EQ [∇2 A(θ | X)]∆ + O(∥∆∥3 )
2
and
1
Dkl (Pθ+∆ ◦ Q||Pθ ◦ Q) = ∆⊤ EQ [∇2 A(θ | X)]∆ + O(∥∆∥3 ).
2

59
Lexture Notes on Statistics and Information Theory John Duchi

In analogy with Proposition 3.3.1, we see again that the expected Hessian EQ [∇2 A(θ | X)] tells
quite precisely how the KL divergence changes as θ varies locally, but now, the distribution Q on
X also enters the picture. So when A and the distribution Q are such that EQ [∇2 A(θ | X)] is large
in the semidefinite order, then it is easy to distinguish data coming from Pθ ◦ Q from that drawn
from Pθ′ ◦ Q, and otherwise, it is not. We therefore call

E[∇2 A(θ | X)] = E[Covθ (ϕ(X, Y ) | X)] = E[∇ log pθ (Y | X)∇ log pθ (Y | X)] (3.4.2)

the Fisher information of the parameter θ in the model Pθ .

Example 3.4.7 (The information in logistic regression): For the binary logistic regression
model (Example 3.4.2) with Y ∈ {0, 1}, we have
⊤
ex θ
∇ log pθ (y | x) = yx − x = (y − pθ (1 | x))x
1 + ex⊤ θ

and ∇2 A(θ | X) = pθ (1 | x)(1 − pθ (1 | x))xx⊤ . Thus for X ∼ Q we the Fisher information is

1
EQ [Varθ (Y | X)XX ⊤ ] ⪯ EQ [XX ⊤ ].
4
When θ = 0, we have identically Varθ (Y | X) = 41 , which is the “maximal” information.
Additionally, we see that when X ∼ Q has larger covariance, we expect XX ⊤ to be larger in
the semidefinite order, meaning the observations (X, Y ) contain more information about θ (of
course, this is mitigated by the fact that pθ (y | x) becomes more extreme as ∥x∥ grows). 3

Example 3.4.8 (The KL-divergence in logistic regression): The binary logistic regression
model with Y ∈ {0, 1} also admits simple bounds on its KL-divergence. For these, we first
make the simple observation that for the log-sum-exp function f (t) = log(1 + et ), we have
et
f ′ (t) = 1+e ′′ ′ ′
t and f (t) = f (t)(1 − f (t)). Taylor’s theorem states that

Z ∆
′
f (t + ∆) = f (t) + f (t)∆ + f ′′ (t + u)(∆ − u)du,
0

∆2
R∆ R∆
and as 0 ≤ f ′′ ≤ 14 , we have | 0 f ′′ (t + u)(∆ − u)du| ≤ 1
4 0 |∆ − u|du = 8 , so

∆2
|f (t + ∆) − f (t) − f ′ (t)∆| ≤
8
for all ∆ ∈ R. Computing the KL-divergence directly, we thus have for any parameters θ0 , θ1
that
1 ⊤ 2
Dkl (Pθ0 (· | x)||Pθ1 (· | x)) = f (θ0⊤ x) − f (θ1⊤ x) − f ′ (θ1⊤ x)x⊤ (θ0 − θ1 ) ≤ x (θ0 − θ1 ) ,
8
so the divergence is at most quadratic. 3

60
Lexture Notes on Statistics and Information Theory John Duchi

3.5 Lower bounds on testing a parameter’s value

We give a bit of a preview here of the tools we will develop to prove fundamental limits in Part II of
the book, an hors d’oeuvres that points to the techniques we develop. In Section 2.3.1, we presented
Le Cam’s method and used it in Example 2.3.2 to give a lower bound on the probability of error in
a hypothesis test comparing two normal means. This approach extends beyond this simple case,
and here we give another example applying it to exponential family models.
We give a stylized version of the problem. Let {Pθ } be an exponential family model with
parameter θ ∈ Rd . Suppose for some vector v ∈ Rd , we wish to test whether v ⊤ θ > 0 or v ⊤ θ < 0 in
the model. For example, in the regression settings in Section 3.4, we may be interested in the effect
of a treatment on health outcomes. Then the covariates x contain information about an individual
with first index x1 corresponding to whether the individual is treated or not, while Y measures the
outcome of treatment; setting v = e1 , we then wish to test whether there is a positive treatment
effect θ1 = e⊤
1 θ > 0 or negative.
Abstracting away the specifics of the scenario, we ask the following question: given an exponen-
tial family {Pθ } and a threshold t of interest, at what separation δ > 0 does it become essentially
impossible to test
v ⊤ θ ≤ t versus v ⊤ θ ≥ t + δ?
We give one approach to this using two-point hypothesis testing lower bounds. In this case, we
consider testing sequences of two alternatives

H0 : θ = θ0 versus H1,n : θ = θn

as n grows, where we observe a sample X1n drawn i.i.d. either according to Pθ0 (i.e., H0 ) or Pθn
(i.e., H1,n ). By choosing θn in a way that makes the separation v ⊤ (θn − θ0 ) large but testing H0
against H1,n challenging, we can then (roughly) identify the separation δ at which testing becomes
impossible.

Proposition 3.5.1. Let θ0 ∈ Rd . Then there exists a sequence of parameters θn with ∥θn − θ0 ∥ =
√
O(1 n), separation
1
q
v ⊤ (θn − θ0 ) = √ v ⊤ ∇2 A(θ0 )−1 v,
n
and for which
1
inf {Pθ0 (Ψ(X1n ) ̸= 0) + Pθn (Ψ(X1n ) ̸= 1)} ≥ + O(n−1/2 ).
Ψ 2
Proof Let ∆ ∈ Rd be a potential perturbation to θ1 = θ0 + ∆, which gives separation δ =
v ⊤ θ1 − v ⊤ θ0 = v ⊤ ∆. Let P0 = Pθ0 and P1 = Pθ1 . Then the smallest summed probability of error
in testing between P0 and P1 based on n observations X1n is

inf {P0 (Ψ(X1 , . . . , Xn ) ̸= 0) + P1 (Ψ(X1 , . . . , Xn ) ̸= 1)} = 1 − ∥P0n − P1n ∥TV

by Proposition 2.3.1. Following the approach of Example 2.3.2, we apply Pinsker’s inequal-
ity (2.2.10) and use that the KL-divergence tensorizes to find

2 ∥P0n − P1n ∥2TV ≤ nDkl (P0 ||P1 ) = nDkl (Pθ0 ||Pθ0 +∆ ) = nDA (θ0 + ∆, θ0 ),

where the final equality follows from the equivalence between KL and Bregman divergences for
exponential families (Proposition 3.3.1).

61
Lexture Notes on Statistics and Information Theory John Duchi

To guarantee that the summed probability of error is at least 21 , that is, ∥P0n − P1n ∥TV ≤ 12 ,
it suffices to choose ∆ satisfying nDA (θ0 + ∆, θ0 ) ≤ 21 . So to maximize the separation v ⊤ ∆ while
guaranteeing a constant probability of error, we (approximately) solve
maximize v ⊤ ∆
1
subject to DA (θ0 + ∆, θ0 ) ≤ 2n .
3
Now, consider that DA (θ0 + ∆, θ0 ) = 12 ∆⊤ ∇2 A(θ0 )∆ + O(∥∆∥ ). Ignoring the higher order term,
we consider maximizing v ⊤ ∆ subject to ∆⊤ ∇2 A(θ0 )∆ ≤ n1 . A Lagrangian calculation shows that
this has solution
1 1
∆= √ p ∇2 A(θ0 )−1 v.
n v ⊤ ∇2 A(θ0 )−1 v
p
With this choice, we have separation δ = v ⊤ ∆ = v ⊤ ∇2 A(θ0 )−1 v/n, and DA (θ0 + ∆, θ0 ) =
1 3/2
2n + O(1/n ). The summed probability of error is at least
r r
n 1 1
n n
1 − ∥P0 − P1 ∥TV ≥ 1 − + O(n −1/2 )=1− + O(n−1/2 ) = + O(n−1/2 )
4n 4 2
as desired.

Let us briefly sketch out why Proposition 3.5.1 is the “right” answer using the heuristics in Sec-
tion 3.2.1. For an unknown parameter θ in the exponential family model Pθ , we observe X1 , . . . , Xn ,
and wish to test whether v ⊤ θ ≥ t for a given threshold t. Call our null H0 : v ⊤ θ ≤ t, and assume
we wish to test at an asymptotic level α > 0, meaning the probability the test falsely rejects H0 is
(as n → ∞) is at most α. Assuming the heuristic (3.2.4), we have the approximate distributional
equality
· 1
v ⊤ θbn ∼ N v ⊤ θ, v ⊤ ∇2 A(θbn )−1 v .
n
Note that we have θbn on the right side of the distribution; it is possible to make this rigorous, but
here we target only intuition building. A natural asymptotically level α test is then
( q
Reject if v ⊤ θbn ≥ t + z1−α v ⊤ ∇2 A(θbn )−1 v/n
Tn :=
Accept otherwise,
where z1−α is the 1 − α quantile of a standard normal, P(Z ≥ z1−α ) = α for Z ∼ N(0, 1). Let θ0
be such that v ⊤ θ0 = t, so H0 holds. Then
√
q
⊤ b ⊤ 2 −1
Pθ0 (Tn rejects) = Pθ0 n · v (θn − θ0 ) ≥ z1−α v ∇ A(θn ) v → α.
b
p √
At least heuristically, then, this separation δ = v ⊤ A(θ0 )−1 v/ n is the fundamental separation
in parameter values at which testing becomes possible (or below which it is impossible).
As a brief and suggestive aside, the precise growth of the KL-divergence Dkl (Pθ0 +∆ ||Pθ0 ) =
1 ⊤ 2 3
2 ∆ ∇ A(θ0 )∆ + O(∥∆∥ ) near θ0 plays the fundamental role in both the lower bound and upper
bound on testing. When the Hessian ∇2 A(θ0 ) is “large,” meaning it is very positive definite,
distributions with small parameter distances are still well-separated in KL-divergence, making
testing easy, while when ∇2 A(θ0 ) is small (nearly indefinite), the KL-divergence can be small even
for large parameter separations ∆ and testing is hard. As a consequence, at least for exponential
family models, the Fisher information (3.3.2), which we defined as ∇2 A(θ) = Covθ (ϕ(X)), plays a
central role in testing and, as we see later, estimation.

62
Lexture Notes on Statistics and Information Theory John Duchi

3.6 Deferred proofs

We collect proofs that rely on background we do not assume for this book here.

3.6.1 Proof of Proposition 3.2.2

We follow Brown [41]. We demonstrate only the first-order differentiability using Lebesgue’s domi-
nated convergence theorem , as higher orders and the interchange of integration and differentiation
are essentially identical. Demonstrating first-order complex differentiability is of course enough to
show that A is analytic.1 As the proof of Proposition 3.2.1 does not rely on analyticity of A, we
may use its results. Thus, let Θ = dom A(·) in Rd , which is convex. We assume Θ has non-empty
interior (if the interior is empty, then the convexity of Θ means that it must lie in a lower dimen-
sional subspace; we simply take the interior relative to that subspace and may proceed). We claim
the following lemma, which is the key to applying dominated convergence; we state it first for Rd .

Lemma 3.6.1. Consider any collection {θ1 , . . . , θm } ⊂ Θ, and let Θ0 = Conv{θi }m i=1 and C ⊂
int Θ0 . Then for any k ∈ N, there exists a constant K = K(C, k, {θi }) such that for all θ0 ∈ C,

∥x∥k exp(⟨θ0 , x⟩) ≤ K max exp(⟨θj , x⟩).

j≤m

Proof Let B = {u ∈ Rd | ∥u∥ ≤ 1} be the unit ball in Rd . For any ϵ > 0, there exists a K = K(ϵ)
such that ∥x∥k ≤ Keϵ∥x∥ for all x ∈ Rd . As C ⊂ int Conv(Θ0 ), there exists an ϵ > 0 such P that for
all θ0 ∈ C, θ0 + 2ϵB ⊂ Θ0 , and by construction, for any u ∈ B we can write θ0 + 2ϵu = m j=1 λj θj
m ⊤
for some λ ∈ R+ with 1 λ = 1. We therefore have

∥x∥k exp(⟨θ0 , x⟩) ≤ ∥x∥k sup exp(⟨θ0 + ϵu, x⟩)

u∈B
k
= ∥x∥ exp(ϵ ∥x∥) exp(⟨θ0 , x⟩) ≤ K exp(2ϵ ∥x∥) exp(⟨θ0 , x⟩)
= K sup exp(⟨θ0 + 2ϵu, x⟩).
u∈B

But using the convexity of t 7→ exp(t) and that θ0 + 2ϵu ∈ Θ0 , the last quantity has upper bound

sup exp(⟨θ0 + 2ϵu, x⟩) ≤ max exp(⟨θj , x⟩).

u∈B j≤m

This gives the desired claim.

A similar result is possible with differences of exponentials:

Lemma 3.6.2. Under the conditions of Lemma 3.6.1, there exists a K such that for any θ, θ0 ∈ C

e⟨θ,x⟩ − e⟨θ0 ,x⟩

≤ K max e⟨θj ,x⟩ .
∥θ − θ0 ∥ j≤m

Proof We write
exp(⟨θ, x⟩) − exp(⟨θ0 , x⟩) exp(⟨θ − θ0 , x⟩) − 1
= exp(⟨θ0 , x⟩)
∥θ − θ0 ∥ ∥θ − θ0 ∥
1
For complex functions, Osgood’s lemma shows that if A is continuous and holomorphic in each variable individ-
ually, it is holomorphic. For a treatment of such ideas in an engineering context, see, e.g. [101, Ch. 1].

63
Lexture Notes on Statistics and Information Theory John Duchi

so that the lemma is equivalent to showing that

|e⟨θ−θ0 ,x⟩ − 1|
≤ K max exp(⟨θj − θ0 , x⟩).
∥θ − θ0 ∥ j≤m

From this, we can assume without loss of generality that θ0 = 0 (by shifting). Now note that
by convexity e−a ≥ 1 − a for all a ∈ R, so 1 − ea ≤ |a| when a ≤ 0. Conversely, if a > 0, then
d
aea ≥ ea − 1 (note that da (aea ) = aea + ea ≥ ea ), so dividing by ∥x∥, we see that

|e⟨θ,x⟩ − 1| |e⟨θ,x⟩ − 1| max{⟨θ, x⟩e⟨θ,x⟩ , |⟨θ, x⟩|}

≤ ≤ ≤ e⟨θ,x⟩ + 1.
∥θ∥ ∥x∥ |⟨θ, x⟩| |⟨θ, x⟩|
As θ ∈ C, Lemma 3.6.1 then implies that
|e⟨θ,x⟩ − 1|
≤ ∥x∥ e⟨θ,x⟩ + 1 ≤ K max e⟨θj ,x⟩ ,
∥θ∥ j

as desired.

With the lemmas in hand, we can demonstrate a dominating function for the derivatives. Indeed,
fix θ0 ∈ int Θ and for θ ∈ Θ, define
exp(⟨θ, x⟩) − exp(⟨θ0 , x⟩) − exp(⟨θ0 , x⟩)⟨x, θ − θ0 ⟩ e⟨θ,x⟩ − e⟨θ0 ,x⟩ − ⟨∇e⟨θ0 ,x⟩ , θ − θ0 ⟩
g(θ, x) = = .
∥θ − θ0 ∥ ∥θ − θ0 ∥
Then limθ→θ0 g(θ, x) = 0 by the differentiability of t 7→ et . Lemmas 3.6.1 and 3.6.2 show that if
we take any collection {θj }m
j=1 ⊂ Θ for which θ ∈ int Conv{θj }, then for C ⊂ int Conv{θj }, there
exists a constant K such that
| exp(⟨θ, x⟩) − exp(⟨θ0 , x⟩)|
|g(θ, x)| ≤ + ∥x∥ exp(⟨θ0 , x⟩) ≤ K max exp(⟨θj , x⟩)
∥θ − θ0 ∥ j
Pm
for all θ ∈ C. As maxj e⟨θj ,x⟩ dµ(x) ≤ ⟨θj ,x⟩ dµ(x) < ∞, the dominated convergence
R R
j=1 e
theorem thus implies that Z
lim g(θ, x)dµ(x) = 0,
θ→θ0

and so M (θ) = exp(A(θ)) is differentiable in θ, as

Z
M (θ) = M (θ0 ) + xe⟨θ0 ,x⟩ dµ(x), θ − θ0 + o(∥θ − θ0 ∥).

It is evident that we have the derivative

Z
∇M (θ) = ∇ exp(⟨θ, x⟩)dµ(x).

√
Analyticity Over the subset ΘC := {θ + iz | θ ∈ Θ, z ∈ Rd } (where i = −1 is the imaginary
unit), we can extend the preceding results to demonstrate that A is analytic on ΘC . Indeed, we
first simply note that for a, b ∈ R, exp(a + ib) = exp(a) exp(ib) and | exp(a + ib)| = exp(a), i.e.
|ez | = e z for z ∈ C, and so Lemmas 3.6.1 and 3.6.2 follow mutatis-mutandis as in the real case.
These are enough for the application of the dominated convergence theorem above, and we use that
exp(·) is analytic to conclude that θ 7→ M (θ) is analytic on ΘC .

64
Lexture Notes on Statistics and Information Theory John Duchi

3.7 Bibliography

3.8 Exercises
Exercise 3.1: In Example 3.1.4, give the sufficient statistic ϕ and an explicit formula for the log
partition function A(θ, Θ) so that we can write pθ,Θ (x) = exp(⟨θ, ϕ1 (x)⟩ + ⟨Θ, ϕ2 (x)⟩ − A(θ, Θ)).
Exercise 3.2: Consider the binary logistic regression model in Example 3.4.2, and let ℓ(θ; x, y) =
− log pθ (y | x) be the associated log loss.

(i) Give the Hessian ∇2θ ℓ(θ; x, y).

(ii) Let (xi , yi )ni=1 ⊂ Rd × {±1} be a sample. Give a sufficient condition for the minimizer of the
empirical log loss
n
1X
Ln (θ) := ℓ(θ; xi , yi )
n
i=1

to be unique that depends only on the vectors {xi }. Hint. A convex function h is strictly
convex if and only if its Hessian ∇2 h is positive definite.

Exercise 3.3: Give the Fisher information (3.4.2) for each of the following generalized linear
models:

(a) Linear regression (Example 3.4.1).

(b) Poisson regression (Example 3.4.4).

65
Part I

Concentration, information, stability,

and generalization

66
Chapter 4

Concentration Inequalities

In many scenarios, it is useful to understand how a random variable X behaves by giving bounds
on the probability that it deviates far from its mean or median. This can allow us to give prove
that estimation and learning procedures will have certain performance, that different decoding and
encoding schemes work with high probability, among other results. In this chapter, we give several
tools for proving bounds on the probability that random variables are far from their typical values.
We conclude the section with a discussion of basic uniform laws of large numbers and applications
to empirical risk minimization and statistical learning, though we focus on the relatively simple
cases we can treat with our tools.

4.1 Basic tail inequalities

In this first section, we have a simple to state goal: given a random variable X, how does X
concentrate around its mean? That is, assuming w.l.o.g. that E[X] = 0, how well can we bound

P(X ≥ t)?

We begin with the three most classical three inequalities for this purpose: the Markov, Chebyshev,
and Chernoff bounds, which are all instances of the same technique.
The basic inequality off of which all else builds is Markov’s inequality.

Proposition 4.1.1 (Markov’s inequality). Let X be a nonnegative random variable, meaning that
X ≥ 0 with probability 1. Then
E[X]
P(X ≥ t) ≤ .
t
Proof For any random variable, P(X ≥ t) = E[1 {X ≥ t}] ≤ E[(X/t)1 {X ≥ t}] ≤ E[X]/t, as
X/t ≥ 1 whenever X ≥ t.

When we know more about a random variable than that its expectation is finite, we can give
somewhat more powerful bounds on the probability that the random variable deviates from its
typical values. The first step in this direction, Chebyshev’s inequality, requires two moments, and
when we have exponential moments, we can give even stronger results. As we shall see, each of
these results is but an application of Proposition 4.1.1.

67
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 4.1.2 (Chebyshev’s inequality). Let X be a random variable with Var(X) < ∞. Then

Var(X) Var(X)
P(X − E[X] ≥ t) ≤ and P(X − E[X] ≤ −t) ≤
t2 t2
for all t ≥ 0.

Proof We prove only the upper tail result, as the lower tail is identical. We first note that
X − E[X] ≥ t implies that (X − E[X])2 ≥ t2 . But of course, the random variable Z = (X − E[X])2
is nonnegative, so Markov’s inequality gives P(X − E[X] ≥ t) ≤ P(Z ≥ t2 ) ≤ E[Z]/t2 , and
E[Z] = E[(X − E[X])2 ] = Var(X).

If a random variable has a moment generating function—exponential moments—we can give

bounds that enjoy very nice properties when combined with sums of random variables. First, we
recall that
φX (λ) := E[eλX ]
is the moment generating function of the random variable X. Then we have the Chernoff bound.

Proposition 4.1.3. For any random variable X, we have

E[eλX ]
P(X ≥ t) ≤ = φX (λ)e−λt
eλt
for all λ ≥ 0.

Proof This is another application of Markov’s inequality: for λ > 0, we have eλX ≥ eλt if and
only if X ≥ t, so that P(X ≥ t) = P(eλX ≥ eλt ) ≤ E[eλX ]/eλt .

In particular, taking the infimum over all λ ≥ 0 in Proposition 4.1.3 gives the more standard
Chernoff (large deviation) bound

P(X ≥ t) ≤ exp inf log φX (λ) − λt .
λ≥0

Example 4.1.4 (Gaussian random variables): When X is a mean-zero Gaussian variable

with variance σ 2 , we have

λ2 σ 2

φX (λ) = E[exp(λX)] = exp . (4.1.1)
2

To see this, we compute the integral; we have

Z ∞
1 1 2
E[exp(λX)] = √ exp λx − 2 x dx
−∞ 2πσ 2 2σ
Z ∞
2
λ σ 2 1 1 2 2
=e 2 √ exp − 2 (x − λσ x) dx,
−∞ 2πσ 2 2σ
| {z }
=1

because this is simply the integral of the Gaussian density.

68
Lexture Notes on Statistics and Information Theory John Duchi

As a consequence of the equality (4.1.1) and the Chernoff bound technique (Proposi-
tion 4.1.3), we see that for X Gaussian with variance σ 2 , we have
t2 t2

P(X ≥ E[X] + t) ≤ exp − 2 and P(X ≤ E[X] − t) ≤ exp − 2
2σ 2σ
λ2 σ 2 2 2 2
for all t ≥ 0. Indeed, we have log φX−E[X] (λ) = 2 , and inf λ { λ 2σ − λt} = − 2σ
t
2 , which is
attained by λ = σt2 . 3

4.1.1 Sub-Gaussian random variables

Gaussian random variables are convenient for their nice analytical properties, but a broader class
of random variables with similar moment generating functions are known as sub-Gaussian random
variables.
Definition 4.1. A random variable X is sub-Gaussian with parameter σ 2 if
2 2
λ σ
E[exp(λ(X − E[X]))] ≤ exp
2
for all λ ∈ R. We also say such a random variable is σ 2 -sub-Gaussian.
Of course, Gaussian random variables satisfy Definition 4.1 with equality. This would be un-
interesting if only Gaussian random variables satisfied this property; happily, that is not the case,
and we detail several examples.

Example 4.1.5 (Random signs (Rademacher variables)): The random variable X taking
values {−1, 1} with equal property is 1-sub-Gaussian. Indeed, we have
∞ ∞ ∞ ∞
1 X λk 1 X (−λ)k λ2k (λ2 )k
2
1 1 X X λ
E[exp(λX)] = eλ + e−λ = + = ≤ = exp ,
2 2 2 k! 2 k! (2k)! 2k k! 2
k=0 k=0 k=0 k=0

as claimed. 3

Bounded random variables are also sub-Gaussian; indeed, we have the following example.
Example 4.1.6 (Bounded random variables): Suppose that X is bounded, say X ∈ [a, b].
Then Hoeffding’s lemma states that
λ2 (b − a)2

E[eλ(X−E[X]) ] ≤ exp ,
8
so that X is (b − a)2 /4-sub-Gaussian.
We prove a somewhat weaker statement with a simpler argument, while Exercise 4.1 gives
one approach to proving the above statement. First, let ε ∈ {−1, 1} be a Rademacher variable,
so that P(ε = 1) = P(ε = −1) = 21 . We apply a so-called symmetrization technique—a
common technique in probability theory, statistics, concentration inequalities, and Banach
space research—to give a simpler bound. Indeed, let X ′ be an independent copy of X, so that
E[X ′ ] = E[X]. We have

φX−E[X] (λ) = E exp(λ(X − E[X ′ ])) ≤ E exp(λ(X − X ′ ))

= E exp(λε(X − X ′ )) ,

69
Lexture Notes on Statistics and Information Theory John Duchi

where the inequality follows from Jensen’s inequality and the last equality is a conseqence of
the fact that X − X ′ is symmetric about 0. Using the result of Example 4.1.5,
λ (X − X ′ )
2 2
λ (b − a)2

′

E exp(λε(X − X )) ≤ E exp ≤ exp ,
2 2
where the final inequality is immediate from the fact that |X − X ′ | ≤ b − a. 3
While Example 4.1.6 shows how a symmetrization technique can give sub-Gaussian behavior,
more sophisticated techniques involving explicitly bounding the logarithm of the moment generating
function of X, often by calculations involving exponential tilts of its density. In particular, letting
X be mean zero for simplicity, if we let

ψ(λ) = log φX (λ) = log E[eλX ],

then
E[XeλX ] E[X 2 eλX ] E[XeλX ]2
ψ ′ (λ) = and ψ ′′
(λ) = − ,
E[eλX ] E[eλX ] E[eλX ]2
where we can interchange the order of taking expectations and derivatives whenever ψ(λ) is finite.
Notably, if X has density pX (with respect to any base measure) then the random variable Yλ with
density
eλy
pλ (y) = pX (y)
E[eλX ]
(with respect to the same base measure) satisfies

ψ ′ (λ) = E[Yλ ] and ψ ′′ (λ) = E[Yλ2 ] − E[Yλ ]2 = Var(Yλ ).

One can exploit this in many ways, which the exercises and coming chapters do. As a particular
example, we can give sharper sub-Gaussian constants for Bernoulli random variables.
Example 4.1.7 (Bernoulli random variables): Let X be Bernoulli(p), so that X = 1 with
probability p and X = 0 otherwise. Then a strengthening of Hoeffding’s lemma (also, essen-
tially, due to Hoeffding) is that

σ 2 (p) 2 1 − 2p
log E[eλ(X−p) ] ≤ λ for σ 2 (p) := .
2 2 log 1−p
p

Here we take the limits as p → {0, 21 , 1} and have σ 2 (0) = 0, σ 2 (1) = 0, and σ 2 ( 12 ) = 14 .
Because p 7→ σ 2 (p) is concave and symmetric about p = 12 , this inequality is always sharper
than that of Example 4.1.6. Exercise 4.12 gives one proof of this bound exploiting exponential
tilting. 3
Chernoff bounds for sub-Gaussian random variables are immediate; indeed, they have the same
concentration properties as Gaussian random variables, a consequence of the nice analytical prop-
erties of their moment generating functions (that their logarithms are at most quadratic). Thus,
using the technique of Example 4.1.4, we obtain the following proposition.
Proposition 4.1.8. Let X be a σ 2 -sub-Gaussian. Then for all t ≥ 0 we have
t2

P(X − E[X] ≥ t) ∨ P(X − E[X] ≤ −t) ≤ exp − 2 .
2σ

70
Lexture Notes on Statistics and Information Theory John Duchi

Chernoff bounds extend naturally to sums of independent random variables, because moment
generating functions of sums of independent random variables become products of moment gener-
ating functions.

Proposition 4.1.9. Let X1 , X2 , . . . , Xn be independent σi2 -sub-Gaussian random variables. Then

n
" # 2 Pn
X
λ 2
i=1 σi
E exp λ (Xi − E[Xi ]) ≤ exp for all λ ∈ R,
2
i=1
Pn Pn 2
that is, i=1 Xi is i=1 σi -sub-Gaussian.

Proof We assume w.l.o.g. that the Xi are mean zero. We have by independence that and
sub-Gaussianity that
Xn n−1
X 2 2 n−1
X
λ σn
E exp λ Xi = E exp λ Xi E[exp(λXn )] ≤ exp E exp λ Xi .
2
i=1 i=1 i=1

Applying this technique inductively to Xn−1 , . . . , X1 , we obtain the desired result.

Two immediate corollary to Propositions 4.1.8 and 4.1.9 show that sums of sub-Gaussian random
variables concentrate around their expectations. We begin with a general concentration inequality.

Corollary 4.1.10. Let Xi be independent σi2 -sub-Gaussian random variables. Then for all t ≥ 0
( n n )
t2
X X
max P (Xi − E[Xi ]) ≥ t , P (Xi − E[Xi ]) ≤ −t ≤ exp − Pn .
i=1 i=1
2 i=1 σi2

Additionally, the classical Hoeffding bound, follows when we couple Example 4.1.6 with Corol-
lary 4.1.10: if Xi ∈ [ai , bi ], then
n
2t2
X
P (Xi − E[Xi ]) ≥ t ≤ exp − Pn 2
.
i=1 i=1 (bi − ai )

To give another interpretation of these inequalities, let us assume that Xi are indepenent and
σ 2 -sub-Gaussian. Then we have that
n
nt2
X
1
P (Xi − E[Xi ]) ≥ t ≤ exp − 2 ,
n 2σ
i=1
q
1
nt2 2σ 2 log δ
or, for δ ∈ (0, 1), setting exp(− 2σ 2) = δ or t = √
n
, we have that
q
1X
n 2σ 2 log 1δ
(Xi − E[Xi ]) ≤ √ with probability at least 1 − δ.
n n
i=1

There are a variety of other conditions equivalent to sub-Gaussianity, which we capture in the
following theorem.

71
Lexture Notes on Statistics and Information Theory John Duchi

Theorem 4.1.11. Let X be a random variable and σ 2 ≥ 0. The following statements are all
equivalent, meaning that there are numerical constant factors Kj such that if one statement (i) holds
with parameter Ki , then statement (j) holds with parameter Kj ≤ CKi , where C is a numerical
constant.
2
(1) Sub-gaussian tails: P(|X| ≥ t) ≤ 2 exp(− Kt1 σ2 ) for all t ≥ 0.
√
(2) Sub-gaussian moments: E[|X|k ]1/k ≤ K2 σ k for all k.

(3) Super-exponential moment: E[exp(X 2 /(K3 σ 2 ))] ≤ e.

If in addition X is mean zero, each of these is equivalent to

(4) Sub-gaussian moment generating function: E[exp(λX)] ≤ exp(K4 λ2 σ 2 ) for all λ ∈ R.

Particularly,
q (1) implies (2) with K1 = 1 and K2 ≤ e1/e ; (2) implies (3) with K2 = 1 and
2
K3 = e e−1 < 3; (3) implies (1) with K3 = 1 and K1 = 1/ log 2. For the last part, (3) implies
(4) with K3 = 1 and K4 ≤ 34 , while (4) implies (1) with K4 = 1
2 and K1 ≤ 2.

This result is standard in the literature on concentration and random variables; see Section 4.4.1
for a proof.
For completeness, we can give a tighter result than part (3) of the preceding theorem, giving a
concrete upper bound on squares of sub-Gaussian random variables. The technique used in the ex-
ample, to introduce an independent random variable for auxiliary randomization, is a common and
useful technique in probabilistic arguments (similar to our use of symmetrization in Example 4.1.6).

Example 4.1.12 (Sub-Gaussian squares): Let X be a mean-zero σ 2 -sub-Gaussian random

variable. Then
1
E[exp(λX 2 )] ≤ 1 , (4.1.2)
[1 − 2σ 2 λ]+2
and expression (4.1.2) holds with equality for X ∼ N(0, σ 2 ).
To see this result, we focus on the Gaussian case first and assume (for this case) without
loss of generality (by scaling) that σ 2 = 1. Assuming that λ < 21 , we have
Z Z √
1 1 2 1 − 1−2λ z 2 2π 1
2
E[exp(λZ )] = √ e−( 2 −λ)z dz = √ e 2 dz = √ √ ,
2π 2π 1 − 2λ 2π
the final equality a consequence of the fact that (as we know for normal random variables)
R − 1 z2 √
e 2σ2 dz = 2πσ 2 . When λ ≥ 12 , the above integrals are all infinite, giving the equality in
expression (4.1.2).
For the more general inequality, we recall that if Z is an independent N(0, 1) random
2
variable, then E[exp(tZ)] = exp( t2 ), and so

√ (i) (ii) 1
E[exp(λX 2 )] = E[exp( 2λXZ)] ≤ E exp(λσ 2 Z 2 ) =

1 ,
[1 − 2σ 2 λ]+2

where inequality (i) follows because X is sub-Gaussian, and (ii) because Z ∼ N(0, 1). 3

72
Lexture Notes on Statistics and Information Theory John Duchi

4.1.2 Sub-exponential random variables

A slightly weaker condition than sub-Gaussianity is for a random variable to be sub-exponential,
which—for a mean-zero random variable—means that its moment generating function exists in a
neighborhood of zero.
Definition 4.2. A random variable X is sub-exponential with parameters (τ 2 , b) if for all λ such
that |λ| ≤ 1/b, 2 2
λ τ
E[eλ(X−E[X]) ] ≤ exp .
2
It is clear from Definition 4.2 that a σ 2 -sub-Gaussian random variable is (σ 2 , 0)-sub-exponential.
A variety of random variables are sub-exponential. As a first example, χ2 -random variables are
sub-exponential with constant values for τ and b:
Example 4.1.13: Let X = Z 2 , where Z ∼ N(0, 1). We claim that
1
E[exp(λ(X − E[X]))] ≤ exp(2λ2 ) for λ ≤ . (4.1.3)
4
1
Indeed, for λ < we have (recall Example 4.1.12) that
2
(⋆)
1
E[exp(λ(Z − E[Z ]))] = exp − log(1 − 2λ) − λ ≤ exp λ + 2λ2 − λ
2 2

2

where inequality (⋆) holds for λ ≤ 14 , because − log(1 − 2λ) ≤ 2λ + 4λ2 for λ ≤ 14 . 3
As a second example, we can show that bounded random variables are sub-exponential. It is
clear that this is the case as they are also sub-Gaussian; however, in many cases, it is possible to
show that their parameters yield much tighter control over deviations than is possible using only
sub-Gaussian techniques.
Example 4.1.14 (Bounded random variables are sub-exponential): Suppose that X is a
mean zero random variable taking values in [−b, b] with variance σ 2 = E[X 2 ] (note that we are
guaranteed that σ 2 ≤ b2 in this case). We claim that
2 2
3λ σ 1
E[exp(λX)] ≤ exp for |λ| ≤ . (4.1.4)
5 2b
To see this, we expand ez via
∞ ∞
z 2 X 2z k−2 z2 X 2
ez = 1 + z + =1+z+ zk .
2 k! 2 (k + 2)!
k=2 k=0
P∞ P∞
For k ≥ 0, we have 2
(k+2)! ≤ 1
3k
, so that 2
k=0 (k+2)! z
k ≤ k=0 |z/3|
k = [1 − |z|/3]−1
+ . Thus

1 z2
ez ≤ 1 + z + ,
[1 − |z|/3]+ 2
3
and as |X| ≤ b and |λ| < we therefore obtain
b

λ2 X 2 λ2 σ 2

1
E[exp(λX)] ≤ 1 + E[λX] + E ≤1+ .
2 [1 − |λX|/3]+ 1 − |λ|b/3 2

73
Lexture Notes on Statistics and Information Theory John Duchi

1 1
Letting |λ| ≤ 2b implies 1−|λ|b/3 ≤ 56 , and using that 1 + x ≤ ex gives the result.
It is possible to give a slightly tighter result for λ ≥ 0 In this case, we have the bound
∞
λ2 σ 2 X λk−2 bk−2 σ2
E[exp(λX)] ≤ 1 + + λ2 σ 2 = 1 + 2 eλb − 1 − λb .
2 k! b
k=3
Then using that 1 + x ≤ ex ,
we obtain Bennett’s moment generating inequality, which is that
2
λX σ λb
E[e ] ≤ exp e − 1 − λb for λ ≥ 0. (4.1.5)
b2
λ2 b2
Inequality (4.1.5) always holds, and for λb near 0, we have eλb − 1 − λb ≈ 2 . 3
In particular, if the variance σ 2 ≪ b2 , the absolute bound on X, inequality (4.1.4) gives much
tighter control on the moment generating function of X than typical sub-Gaussian bounds based
only on the fact that X ∈ [−b, b] allow.
More broadly, we can show a result similar to Theorem 4.1.11.
Theorem 4.1.15. Let X be a random variable and σ ≥ 0. Then—in the sense of Theorem 4.1.11—
the following statements are all equivalent for suitable numerical constants K1 , . . . , K4 .
(1) Sub-exponential tails: P(|X| ≥ t) ≤ 2 exp(− Kt1 σ ) for all t ≥ 0

(2) Sub-exponential moments: E[|X|k ]1/k ≤ K2 σk for all k ≥ 1.

(3) Existence of moment generating function: E[exp(X/(K3 σ))] ≤ e and E[exp(−X/(K3 σ))] ≤ e.
If in addition X is mean zero, each of these is equivalent to
(4) Sub-exponential moment generating function: E[exp(λX)] ≤ exp(K4 λ2 σ 2 ) for |λ| ≤ K4′ /σ.
In particular, if (2) holds with K2 = 1, then (4) holds with K4 = 2e2 and K4′ = 1
2e .
See Section 4.4.2 for the proof, which is similar to that for Theorem 4.1.11.
While the concentration properties of sub-exponential random variables are not quite so nice
as those for sub-Gaussian random variables (recall Hoeffding’s inequality, Corollary 4.1.10), we
can give sharp tail bounds for sub-exponential random variables. We first give a simple bound on
deviation probabilities.
Proposition 4.1.16. Let X be a mean-zero (τ 2 , b)-sub-exponential random variable. Then for all
t ≥ 0, 2
1 t t
P(X ≥ t) ∨ P(X ≤ −t) ≤ exp − min , .
2 τ2 b
Proof The proof is an application of the Chernoff bound technique; we prove only the upper tail
as the lower tail is similar. We have
E[eλX ] (i)
2 2
λ τ
P(X ≥ t) ≤ ≤ exp − λt ,
eλt 2
inequality (i) holding for |λ| ≤ 1/b. To minimize the last term in λ, we take λ = min{ τt2 , 1/b},
which gives the result.

Comparing with sub-Gaussian random variables, which have b = 0, we see that Proposition 4.1.16
gives a similar result for small t—essentially the same concentration sub-Gaussian random variables—
while for large t, the tails decrease only exponentially in t.
We can also give a tensorization identity similar to Proposition 4.1.9.

74
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 4.1.17. Let X1 , . . . , Xn be independent mean-zero sub-exponential random variables,

where Xi is (σi2 , bi )-sub-exponential. Then for any vector a ∈ Rn , we have
n
" !# 2 Pn
λ 2 2
i=1 ai σi 1
X
E exp λ ai Xi ≤ exp for |λ| ≤ ,
2 b∗
i=1
Pn 2 2
where b∗ = maxi bi |ai |. That is, ⟨a, X⟩ is ( i=1 ai σi , maxi bi |ai |)-sub-exponential.

Proof We apply an inductive technique similar to that used in the proof of Proposition 4.1.9.
1
First, for any fixed i, we know that if |λ| ≤ bi |ai|
, then |ai λ| ≤ b1i and so

λ2 a2i σi2

E[exp(λai Xi )] ≤ exp .
2
1
Now, we inductively apply the preceding inequality, which applies so long as |λ| ≤ bi |ai | for all i.
We have
n n n
" X # Y 2 2 2
Y λ ai σi
E exp λ ai Xi = E[exp(λai Xi )] ≤ exp ,
2
i=1 i=1 i=1

which is our desired result.

As in the case of sub-Gaussian random variables, a combination of the tensorization property—

that the moment generating functions of sums of sub-exponential random variables are well-
behaved—of Proposition 4.1.17 and the concentration inequality (4.1.16) immediately yields the
following Bernstein-type inequality. (See also Vershynin [187].)

Corollary 4.1.18. Let X1 , . . . , Xn be independent mean-zero (σi2 , bi )-sub-exponential random vari-

ables (Definition 4.2). Define b∗ := maxi bi . Then for all t ≥ 0 and all vectors a ∈ Rn , we
have
n n
t2
X X
1 t
P ai Xi ≥ t ∨ P ai Xi ≤ −t ≤ exp − min P n 2 2, .
i=1 i=1
2 i=1 ai σi b∗ ∥a∥∞

It is instructive to study the structure of the bound of Corollary 4.1.18. Notably, the bound
is similar to the Hoeffding-type bound of Corollary 4.1.10 (holding for σ 2 -sub-Gaussian random
variables) that
n
!
t2
X
P ai Xi ≥ t ≤ exp − 2 2 ,
i=1
2 ∥a∥ 2 σ
so that for small t, Corollary 4.1.18 gives sub-Gaussian tail behavior. For large t, the bound is
weaker. However, in many cases, Corollary 4.1.18 can give finer control than naive sub-Gaussian
bounds. Indeed, suppose that the random variables Xi are i.i.d., mean zero, and satisfy Xi ∈ [−b, b]
with probability 1, but have variance σ 2 = E[Xi2 ] ≤ b2 as in Example 4.1.14. Then Corollary 4.1.18
implies that
n
( )!
5 t2
X
1 t
P ai Xi ≥ t ≤ exp − min , . (4.1.6)
2 6 σ 2 ∥a∥22 2b ∥a∥∞
i=1

75
Lexture Notes on Statistics and Information Theory John Duchi

When applied to a standard mean (and with a minor simplification that 5/12 < 1/3) with ai = n1 ,
t2
we obtain the bound that n1 ni=1 Xi ≤ t with probability at least 1−exp(−n min{ 3σ t
P
2 , 4b }). Written
q
3 log 1δ 4b log 1δ
differently, we take t = max{σ n , n } to obtain
 q 
1X
n  3 log 1δ 4b log 1 
δ
Xi ≤ max σ √ , with probability 1 − δ.
n  n n 
i=1
q √
The sharpest such bound possible via more naive Hoeffding-type bounds is b 2 log 1δ / n, which
has substantially worse scaling.
The exercises ask you to work out further variants of these results, including the sub-exponential
behavior of quadratic forms of Gaussian random vectors. As one particular example, Exercises 4.10
and 4.11 work through the details of proving the following corollary.
Corollary 4.1.19. Let Z ∼ N(0, 1). Then for any µ ∈ R, (µ+Z)2 is (4(1+2µ2 ), 4)-sub-exponential,
and more precisely,
2 2
λ2

2 2
2λ µ
E exp λ (µ + Z) − (µ + 1) ≤ exp + .
1 − 2λ [1 − 2|λ|]+

Additionally, if Z ∼ N(0, I), then for any matrix A and vector b, ∥AZ − b∥22 is sub-exponential with
h i 1
E exp λ ∥AZ − b∥22 − ∥A∥2Fr − ∥b∥22 ≤ exp 2λ2 (∥A∥2Fr + 2 ∥b∥22 ) for |λ| ≤ .
4 ∥A∥2op

Further conditions and examples

There are a number of examples and conditions sufficient for random variables to be sub-exponential.
One common condition, the so-called Bernstein condition, controls the higher moments of a random
variable X by its variance. In this case, we say that X satisfies the b-Bernstein condition if
k! 2 k−2
|E[(X − µ)k ]| ≤ σ b for k = 3, 4, . . . , (4.1.7)
2
where µ = E[X] and σ 2 = Var(X) = E[X 2 ] − µ2 . In this case, the following lemma controls
the moment generating function of X. This result is essentially present in Theorem 4.1.15, but it
provides somewhat tighter control with precise constants.
Lemma 4.1.20. Let X be a random variable satisfying the Bernstein condition (4.1.7). Then
λ2 σ 2

h
λ(X−µ)
i 1
E e ≤ exp for |λ| ≤ .
2(1 − b|λ|) b
√
Said differently, a random variable satisfying Condition (4.1.7) is ( 2σ, b/2)-sub-exponential.
Proof Without loss of generality we assume µ = 0. We expand the moment generating function
by noting that
∞ ∞
λ2 σ 2 X λk E[X k ] (i) λ2 σ 2 λ2 σ 2 X
E[eλX ] = 1 + + ≤ 1+ + |λb|k−2
2 k! 2 2
k=3 k=3
λ2 σ 2 1
=1+
2 [1 − b|λ|]+

76
Lexture Notes on Statistics and Information Theory John Duchi

where inequality (i) used the Bernstein condition (4.1.7). Noting that 1+x ≤ ex gives the result.

As one final example, we return to Bennett’s inequality (4.1.5) from Example 4.1.14.

Proposition 4.1.21 (Bennett’s inequality). Let Xi be independent mean-zero P random variables

with Var(Xi ) = σi2 and |Xi | ≤ b. Then for h(t) := (1 + t) log(1 + t) − t and σ 2 := ni=1 σi2 , we have
n
! 2
X σ bt
P Xi ≥ t ≤ exp − 2 h .
b σ2
i=1

Proof We assume without loss of generality that E[X] = 0. Using the standard Chernoff bound
argument coupled with inequality (4.1.5), we see that
n n
! !
X X X σi2 λb
P Xi ≥ t ≤ exp e − 1 − λb − λt .
b2
i=1 i=1

Letting h(t) = (1 + t) log(1 + t) − t as in the statement of the proposition and σ 2 = ni=1 σi2 , we
P
minimize over λ ≥ 0, setting λ = 1b log(1 + σbt2 ). Substituting into our Chernoff bound application
gives the proposition.

A slightly more intuitive writing of Bennett’s inequality is to use averages, in which case for
1 Pn
σ2 = n i=1 σi2 the average of the variances,
n
!
nσ 2

1X bt
P Xi ≥ t ≤ exp − h .
n b σ2
i=1

It is possible to show that

nσ 2 nt2

bt
h ≥ ,
b σ2 2σ 2 + 23 bt
which gives rise to the classical Bernstein inequality that
n
! !
1X nt2
P Xi ≥ t ≤ exp − 2 2 . (4.1.8)
n 2σ + 3 bt
i=1

4.1.3 Orlicz norms

Sub-Gaussian and sub-exponential random variables are examples of a broader class of random
variables belonging to what are known as Orlicz-spaces. For these, we take any convex function
ψ : R+ → R+ with ψ(0) = 0 and ψ(t) → ∞ as t ↑ ∞, a class called the Orlicz functions. Then the
Orlicz norm of a random variable X is

∥X∥ψ := inf {t > 0 | E[ψ(|X|/t)] ≤ 1} . (4.1.9)

That this is a norm is not completely trivial, though a few properties are immediate: clearly
∥aX∥ψ = |a| ∥X∥ψ , and we have ∥X∥ψ = 0 if and only if X = 0 with probability 1. The key result
is that in fact, ∥·∥ψ is actually convex, which then guarantees that it is a norm.

77
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 4.1.22. The function ∥·∥ψ is convex on the space of random variables.

Proof Because ψ is convex and non-decreasing, x 7→ ψ(|x|) is convex as well. (Convince yourself
of this.) Thus, its perspective transform pers(ψ)(t, |x|) := tψ(|x|/t) is jointly convex in both t ≥ 0
and x (see Appendix B.3.3). This joint convexity of pers(ψ) implies that for any random variables
X0 and X1 and t0 , t1 ,

E[pers(ψ)(λt0 + (1 − λ)t1 , |λX0 + (1 − λ)X1 |)] ≤ λE[pers(ψ)(t0 , |X0 |)] + (1 − λ)E[pers(ψ)(t1 , |X1 |)].

Now note that E[ψ(|X|/t)] ≤ 1 if and only if tE[ψ(|X|/t)] ≤ t.

Because ∥·∥ψ is convex and positively homogeneous, we certainly have

∥X + Y ∥ψ = 2 ∥(X + Y )/2∥ψ ≤ ∥X∥ψ + ∥Y ∥ψ ,

that is, the triangle inequality holds. This implies that centering a variable can never increase its
norm by much:
∥X − E[X]∥ψ ≤ ∥X∥ψ + ∥E[X]∥ψ ≤ ∥X∥ψ + ∥X∥ψ
by Jensen’s inequality, so that ∥X − E[X]∥ψ ≤ 2 ∥X∥ψ .
We can recover several standard norms on random variables, including some we have already
implicitly used. The first are the classical Lp norms, where we take ψ(t) = tp , where we see that

inf{t > 0 | E[|X|p /tp ] ≤ 1} = E[|X|p ]1/p .

We also have what we term the sub-Gaussian and sub-Exponential norms, which we denote by
considering the functions
ψp (x) := exp (|x|p ) − 1.
These induce the Orlicz ψp -norms, as for p ≥ 1, these are convex (as they are the composition of the
increasing convex function exp(·) applied to the nonnegative convex function | · |p ). Theorem 4.1.11
shows that we have a sub-Gaussian norm

∥X∥ψ2 := inf t > 0 | E[exp(X 2 /t2 )] ≤ 2 ,

(4.1.10)

while Theorem 4.1.15 shows a sub-exponential norm (or Orlicz ψ1 -norm)

∥X∥ψ1 := inf {t > 0 | E[exp(|X|/t)] ≤ 2} . (4.1.11)

Many relationships follow immediately from the definitions (4.1.10) and (4.1.11). For example,
the definition of the ψp -norms immediately implies that a sub-Gaussian random variable (whether
or not it is mean zero) has a sub-exponential square:

Lemma 4.1.23. A random variable X is sub-Gaussian if and only if X 2 is sub-exponential, as

∥X∥2ψ2 = X 2 ψ1
.

Additionally,
∥XY ∥ψ1 ≤ ∥X∥ψ2 ∥Y ∥ψ2 .

78
Lexture Notes on Statistics and Information Theory John Duchi

2 2
Proof We prove only the second statement. Because xy ≤ x2η + ηy2 for any x, y, and any η > 0,
for any t > 0 we have
2
ηY 2

X
E[exp(|XY |/t)] ≤ E exp + ≤ E[exp(X 2 /ηt)]1/2 E[exp(ηY 2 /t)]1/2
2ηt 2t

by the Cauchy-Schwarz inequality. In particular, if we take t = ∥X∥ψ2 ∥Y ∥ψ2 , then the choice η =
∥X∥2ψ2 / ∥Y ∥2ψ2 gives E[exp(X 2 /ηt)] ≤ 2 and E[exp(ηY 2 /t)] ≤ 2, so that E[exp(|XY |/t)] ≤ 2.

By tracing through the arguments in the proofs of Theorems 4.1.11 and 4.1.15, we can also see that
we have the equivalences
1 1
∥X∥ψ2 ≍ sup √ E[|X|k ]1/k and ∥X∥ψ1 ≍ sup E[|X|k ]1/k ,
k∈N k k∈N k

where ≍ denotes upper and lower bounds by numerical constants.

The arguments we use to prove Theorems 4.1.11 and 4.1.15 also show the following result, which
gives explicit constants connecting sub-exponential behavior with the ψ1 -norm.

Corollary 4.1.24. Let X be any random variable with ∥X∥ψ1 < ∞. Then for all t ≥ 0,

P(|X| ≥ t) ≤ 2 exp −t/ ∥X∥ψ1

and if E[X] = 0, then X is (8 ∥X∥2ψ1 , 2 ∥X∥ψ1 )-sub-exponential.

Proof The first statement is nearly trivial: we have by the Chernoff bounding method that
h i
P(|X| ≥ t) ≤ E exp |X|/ ∥X∥ψ1 exp(−t/ ∥X∥ψ1 ) ≤ 2 exp(−t/ ∥X∥ψ1 )

by definition
R ∞ of the ψ1 -norm. For the second, we mimic the proof of Theorem 4.1.15: because
E[Z] = 0 P(Z ≥ t)dt for Z ≥ 0, we have
∞ ∞ ∞
E[|X|k ]
Z Z Z
≤ P(|X|/ ∥X∥ψ1 ≥ t 1/k
)dt = k P(|X|/ ∥X∥ψ1 ≥ u)u k−1
du ≤ 2k uk−1 e−u du
∥X∥kψ1 0 0 0

using the substitution uk = t. Rearranging yields E[|X|k ] ≤ 2 ∥X∥kψ1 Γ(k + 1) = 2 ∥X∥kψ1 k!. Then
computing the moment generating function, we obtain
∞ ∞
h i X λk E[|X|k ] X 2λ2
E exp(λX/ ∥X∥ψ1 ) ≤ 1 + k
≤ 1 + 2 λk = 1 +
∥X∥ψ1 k! 1 − |λ|
k=2 k=2

for |λ| < 1. For |λ| ≤ 12 , we use 1 + x ≤ ex to obtain E[exp(λX/ ∥X∥ψ1 )] ≤ exp(4λ2 ), which is the
desired result.

79
Lexture Notes on Statistics and Information Theory John Duchi

4.1.4 First applications of concentration: random projections

In this section, we investigate the use of concentration inequalities in random projections. As
motivation, consider nearest-neighbor (or k-nearest-neighbor) classification schemes. We have a
sequence of data points as pairs (ui , yi ), where the vectors ui ∈ Rd have labels yi ∈ {1, . . . , L},
where L is the number of possible labels. Given a new point u ∈ Rd that we wish to label, we find
the k-nearest neighbors to u in the sample {(ui , yi )}ni=1 , then assign u the majority label of these
k-nearest neighbors (ties are broken randomly). Unfortunately, it can be prohibitively expensive to
store high-dimensional vectors and search over large datasets to find near vectors; this has motivated
a line of work in computer science on fast methods for nearest neighbors based on reducing the
dimension while preserving essential aspects of the dataset. This line of research begins with Indyk
and Motwani [117], and continuing through a variety of other works, including Indyk [116] and
work on locality-sensitive hashing by Andoni et al. [7], among others. The original approach is due
to Johnson and Lindenstrauss, who used the results in the study of Banach spaces [123]; our proof
follows a standard argument.
The most specific variant of this problem is as follows: we have n points u1 , . . . , un , and we
could like to construct a mapping Φ : Rd → Rm , where m ≪ d, such that

∥Φui − Φuj ∥2 ∈ (1 ± ϵ) ∥ui − uj ∥2 .

Depending on the norm chosen, this task may be impossible; for the Euclidean (ℓ2 ) norm, however,
such an embedding is easy to construct using Gaussian random variables and with m = O( ϵ12 log n).
This embedding is known as the Johnson-Lindenstrauss embedding. Note that this size m is
independent of the dimension d, only depending on the number of points n.

Example 4.1.25 (Johnson-Lindenstrauss): Let the matrix Φ ∈ Rm×d be defined as follows:

iid
Φij ∼ N(0, 1/m),

and let Φi ∈ Rd denote the ith row of this matrix. We claim that

8 1
m ≥ 2 2 log n + log implies ∥Φui − Φuj ∥22 ∈ (1 ± ϵ) ∥ui − uj ∥22
ϵ δ
log n
for all pairs ui , uj with probability at least 1 − δ. In particular, m ≳ ϵ2
is sufficient to achieve
accurate dimension reduction with high probability.
To see this, note that for any fixed vector u,
m
⟨Φi , u⟩ ∥Φu∥22 X
∼ N(0, 1/m), and = ⟨Φi , u/ ∥u∥2 ⟩2
∥u∥2 ∥u∥22 i=1

is a sum of independent scaled χ2 -random variables. In particular, we have E[∥Φu/ ∥u∥2 ∥22 ] = 1,
and using the χ2 -concentration result of Example 4.1.13 yields

P ∥Φu∥22 / ∥u∥22 − 1 ≥ ϵ = P m ∥Φu∥22 / ∥u∥22 − 1 ≥ mϵ
mϵ2

2

≤ 2 inf exp 2mλ − λmϵ = 2 exp − ,
|λ|≤ 41 8

80
Lexture Notes on Statistics and Information Theory John Duchi

the last inequality holding for ϵ ∈ [0, 1]. Now, using the union bound applied to each of the
pairs (ui , uj ) in the sample, we have
mϵ2

n
P there exist i ̸= j s.t. ∥Φ(ui − uj )∥22 − ∥ui − uj ∥22 ≥ ϵ ∥ui − uj ∥22 ≤ 2 exp − .
2 8
2
Taking m ≥ ϵ82 log nδ = 16ϵ2
log n + ϵ82 log 1δ yields that with probability at least 1 − δ, we have
∥Φui − Φuj ∥22 ∈ (1 ± ϵ) ∥ui − uj ∥22 . 3
Computing low-dimensional embeddings of high-dimensional data is an area of active research,
and more recent work has shown how to achieve sharper constants [63] and how to use more struc-
tured matrices to allow substantially faster computation of the embeddings Φu (see, for example,
Achlioptas [2] for early work in this direction, and Ailon and Chazelle [5] for the so-called “Fast
Johnson-Lindenstrauss transform”).

4.1.5 A second application of concentration: codebook generation

We now consider a (very simplified and essentially un-implementable) view of encoding a signal for
transmission and generation of a codebook for transmitting said signal. Suppose that we have a set
of words, or signals, that we wish to transmit; let us index them by i ∈ {1, . . . , m}, so that there are
m total signals we wish to communicate across a binary symmetric channel Q, meaning that given
an input bit x ∈ {0, 1}, Q outputs a z ∈ {0, 1} with Q(Z = x | x) = 1 − ϵ and Q(Z = 1 − x | x) = ϵ,
for some ϵ < 21 . (For simplicity, we assume Q is memoryless, meaning that when the channel is
used multiple times on a sequence x1 , . . . , xn , its outputs Z1 , . . . , Zn are conditionally independent:
Q(Z1:n = z1:n | x1:n ) = Q(Z1 = z1 | x1 ) · · · Q(Zn = zn | xn ).)
We consider a simplified block coding scheme, where we for each i we associate a codeword
xi ∈ {0, 1}d , where d is a dimension (block length) to be chosen. Upon sending the codeword over
the channel, and receiving some z rec ∈ {0, 1}d , we decode by choosing
i∗ ∈ argmax Q(Z = z rec | xi ) = argmin ∥z rec xi ∥1 , (4.1.12)
i∈[m] i∈[m]

the maximum likelihood decoder. We now investigate how to choose a collection {x1 , . . . , xm }
of such codewords and give finite sample bounds on its probability of error. In fact, by using
concentration inequalities, we can show that a randomly drawn codebook of fairly small dimension
is likely to enjoy good performance.
Intuitively, if our codebook {x1 , . . . , xm } ⊂ {0, 1}d is well-separated, meaning that each pair of
words xi , xk satisfies ∥xi − xk ∥1 ≥ cd for some numerical constant c > 0, we should be unlikely to
make a mistake. Let us make this precise. We mistake word i for word k only if the received signal
Z satisfies ∥Z − xi ∥1 ≥ ∥Z − xk ∥1 , and letting J = {j ∈ [d] : xij ̸= xkj } denote the set of at least
c · d indices where xi and xk differ, we have
X
∥Z − xi ∥1 ≥ ∥Z − xk ∥1 if and only if |Zj − xij | − |Zj − xkj | ≥ 0.
j∈J

If xi is the word being sent and xi and xk differ in position j, then |Zj − xij | − |Zj − xkj | ∈ {−1, 1},
and is equal to −1 with probability (1 − ϵ) and 1 with probability ϵ. That is, we have ∥Z − xi ∥1 ≥
∥Z − xk ∥1 if and only if
X
|Zj − xij | − |Zj − xkj | + |J|(1 − 2ϵ) ≥ |J|(1 − 2ϵ) ≥ cd(1 − 2ϵ),
j∈J

81
Lexture Notes on Statistics and Information Theory John Duchi

|J|(1 − 2ϵ)2 cd(1 − 2ϵ)2

Q(∥Z − xi ∥1 ≥ ∥Z − xk ∥1 | xi ) ≤ exp − ≤ exp − ,
2 2

where we have used that there are at least |J| ≥ cd indices differing between xi and xk . The
probability of making a mistake at all is thus at most m exp(− 12 cd(1 − 2ϵ)2 ) if our codebook has
separation c · d.
For low error decoding to occur with extremely high probability, it is thus sufficient to choose
a set of code words {x1 , . . . , xm } that is well separated. To that end, we state a simple lemma.

Lemma 4.1.26. Let Xi , i = 1, . . . , m be drawn independently and uniformly on the d-dimensional

hypercube Hd := {0, 1}d . Then for any t ≥ 0,

m2

d m
exp −2dt2 ≤ exp −2dt2 .

P ∃ i, j s.t. ∥Xi − Xj ∥1 < − dt ≤
2 2 2

Proof First,
n let us consider
o two independent draws X and X ′ uniformly on the hypercube. Let
Pd
Z = j=1 1 Xj ̸= Xj′ = dham (X, X ′ ) = ∥X − X ′ ∥1 . Then E[Z] = d2 . Moreover, Z is an i.i.d.
1
sum of Bernoulli 2 random variables, so that by our concentration bounds of Corollary 4.1.10, we
have
2t2

′ d
P X − X 1 ≤ − t ≤ exp − .
2 d
Using a union bound gives the remainder of the result.

Rewriting the lemma slightly, we may take δ ∈ (0, 1). Then

r !
d 1
P ∃ i, j s.t. ∥Xi − Xj ∥1 < − d log + d log m ≤ δ.
2 δ

As a consequence of this lemma, we see two things:

(i) If m ≤ exp(d/16), or d ≥ 16 log m, then taking δ ↑ 1, there at least exists a codebook

{x1 , . . . , xm } of words that are all separated by at least d/4, that is, ∥xi − xj ∥1 ≥ d4 for all
i, j.

(ii) By taking m ≤ exp(d/32), or d ≥ 32 log m, and δ = e−d/32 , then with probability at least
1 − e−d/32 —exponentially large in d—a randomly drawn codebook has all its entries separated
by at least ∥xi − xj ∥1 ≥ d4 .

Summarizing, we have the following result: choose a codebook of m codewords x1 , . . . , xm uniformly

at random from the hypercube Hd = {0, 1}d with

8 log m

δ
d ≥ max 32 log m, .
(1 − 2ϵ)2

Then with probability at least 1 − 1/m over the draw of the codebook, the probability we make a
mistake in transmission of any given symbol i over the channel Q is at most δ.

82
Lexture Notes on Statistics and Information Theory John Duchi

4.2 Martingale methods

The next set of tools we consider constitute our first look at argument sbased on stability, that is,
how quantities that do not change very much when a single observation changes should concentrate.
In this case, we would like to understand more general quantities than sample means, developing a
few of the basic cools to understand when functions f (X1 , . . . , Xn ) of independent random variables
Xi concentrate around their expectations. Roughly, we expect that if changing the value of one xi
does not significantly change f (xn1 ) much—it is stable—then it should exhibit good concentration
properties.
To develop the tools to do this, we go throuhg an approach based on martingales, a deep subject
in probability theory. We give a high-level treatment of martingales, taking an approach that does
not require measure-theoretic considerations, providing references at the end of the chapter. We
begin by providing a definition.
Definition 4.3. Let M1 , M2 , . . . be an R-valued sequence of random variables. They are a martin-
gale if there exist another sequence of random variables {Z1 , Z2 , . . .} ⊂ Z and sequence of functions
fn : Z n → R such that
E[Mn | Z1n−1 ] = Mn−1 and Mn = fn (Z1n )
for all n ∈ N. We say that the sequence Mn is adapted to {Zn }.
In general, the sequence Z1 , Z2 , . . . is a sequence of increasing σ-fields F1 , F2 , . . ., and Mn is Fn -
measurable, but Definition 4.3 is sufficienet for our purposes. We also will find it convenient to
study differences of martingales, so that we make the following
Definition 4.4. Let D1P
, D2 , . . . be a sequence of random variables. They form a martingale differ-
ence sequence if Mn := ni=1 Di is a martingale.
Equivalently, there is a sequence of random variables Zn and functions gn : Z n → R such that

E[Dn | Z1n−1 ] = 0 and Dn = gn (Z1n )

for all n ∈ N.
There are numerous examples of martingale sequences. The classical one is the symmetric
random walk.
Example 4.2.1: Let Dn ∈ {±1} be uniform and independent. Then Dn form a martingale
difference sequence adapted to themselves (that is, we may take Zn = Dn ), and Mn = ni=1 Di
P
is a martingale. 3
A more sophisticated example, to which we will frequently return and that suggests the potential
usefulness of martingale constructions, is the Doob martingale associated with a function f .
Example 4.2.2 (Doob martingales): Let f : X n → R be an otherwise arbitrary function,
and let X1 , . . . , Xn be arbitrary random variables. The Doob martingale is defined by the
difference sequence
Di := E[f (X1n ) | X1i ] − E[f (X1n ) | X1i−1 ].
By inspection, the Di are functions of X1i , and we have

E[Di | X1i−1 ] = E[E[f (X1n ) | X1i ] | X1i−1 ] − E[f (X1n ) | X1i−1 ]

= E[f (X1n ) | X1i−1 ] − E[f (X1n ) | X1i−1 ] = 0

83
Lexture Notes on Statistics and Information Theory John Duchi

by the tower property of expectations. Thus, the Di satisfy Definition 4.4 of a martingale
difference sequence, and moreover, we have
n
X
Di = f (X1n ) − E[f (X1n )],
i=1

and so the Doob martingale captures exactly the difference between f and its expectation. 3

4.2.1 Sub-Gaussian martingales and Azuma-Hoeffding inequalities

With these motivating ideas introduced, we turn to definitions, providing generalizations of our
concentration inequalities for sub-Gaussian sums to sub-Gaussian martingales, which we define.
Definition 4.5. Let {Dn } be a martingale difference sequence adapted to {Zn }. Then Dn is a
σn2 -sub-Gaussian martingale difference if
2 2
n−1 λ σn
E[exp(λDn ) | Z1 ] ≤ exp
2
for all n and λ ∈ R.
Immediately from the definition, we have the Azuma-Hoeffding inequalities, which generalize
the earlier tensorization identities for sub-Gaussian random variables.
Theorem 4.2.3 (Azuma-Hoeffding).
Pn Pn Let {Dn } be a σn2 -sub-Gaussian martingale difference se-
2
quence. Then Mn = i=1 Di is i=1 σi -sub-Gaussian, and moreover,
nt2

max {P(Mn ≥ t), P(Mn ≤ −t)} ≤ exp − for all t ≥ 0.
2 ni=1 σi2
P

Proof The proof is essentially immediate: letting Zn be the sequence to which the Dn are
adapted, we write
" n #
Y
λDi
E[exp(λMn )] = E e
i=1
" " n ##
Y
=E E eλDi | Z1n−1
i=1
" "n−1 # #
Y
λDi n−1 λDn n−1
=E E e | Z1 E[e | Z1 ]
i=1

because D1 , . . . , Dn−1 are functions of Z1n−1 . Then we use Definition 4.5, which implies that
2 2
E[eλDn | Z1n−1 ] ≤ eλ σn /2 , and we obtain
"n−1 #
λ2 σn2
Y
λDi
E[exp(λMn )] ≤ E e exp .
2
i=1

Repeating the same argument for n − 1, n − 2, . . . , 1 gives that

n
λ2 X 2
log E[exp(λMn )] ≤ σi
2
i=1

84
Lexture Notes on Statistics and Information Theory John Duchi

as desired.
The second claims are simply applications of Chernoff bounds via Proposition 4.1.8 and that
E[Mn ] = 0.

As an immediate corollary, we recover

Pn Proposition 4.1.9, as sums of independent random vari-
ables form martingales via Mn = i=1 (Xi − E[Xi ]). A second corollary gives what is typically
termed the Azuma inequality:

Corollary 4.2.4. LetPDi be a bounded difference martingale difference sequence, meaning that
|Di | ≤ c. Then Mn = ni=1 Di satisfies

t2

−1/2 −1/2
P(n Mn ≥ t) ∨ P(n Mn ≤ −t) ≤ exp − 2 for t ≥ 0.
2c
√
Thus, bounded random walks are (with high probability) within ± n of their expectations after
n steps.
There exist extensions of these inequalities to the cases where we control the variance of the
martingales; see Freedman [96].

4.2.2 Examples and bounded differences

We now develop several example applications of the Azuma-Hoeffding inequalities (Theorem 4.2.3),
applying them most specifically to functions satisfying certain stability conditions.
We first define the collections of functions we consider.

Definition 4.6 (Bounded differences). Let f : X n → R for some space X . Then f satisfies
bounded differences with constants ci if for each i ∈ {1, . . . , n}, all xn1 ∈ X n , and x′i ∈ X we have
i−1 ′
|f (xi−1 n n
1 , xi , xi+1 ) − f (x1 , xi , xi+1 )| ≤ ci .

The classical inequality relating bounded differences and concentration is McDiarmid’s inequal-
ity, or the bounded differences inequality.

Proposition 4.2.5 (Bounded differences inequality). Let f : X n → R satisfy bounded Pdifferences

with constants ci , and let Xi be independent random variables. f (X1n ) − E[f (X1n )] is 14 ni=1 c2i -sub-
Gaussian, and

2t2

n n n n
P (f (X1 ) − E[f (X1 )] ≥ t) ∨ P (f (X1 ) − E[f (X1 )] ≤ −t) ≤ exp − Pn 2 .
i=1 ci

Proof The basic idea is to show that the Doob martingale (Example 4.2.2) associated with f is
c2i /4-sub-Gaussian, and then to simply apply the Azuma-HoeffdingPn inequality. nTo that end, define
n i n i−1 n
Di = E[f (X1 ) | X1 ] − E[f (X1 ) | X1 ] as before, and note that i=1 Di = f (X1 ) − E[f (X1 )]. The
random variables

Li := inf E[f (X1n ) | X1i−1 , Xi = x] − E[f (X1n ) | X1i−1 ]

x
Ui := sup E[f (X1n ) | X1i−1 , Xi = x] − E[f (X1n ) | X1i−1 ]
x

85
Lexture Notes on Statistics and Information Theory John Duchi

evidently satisfy Li ≤ Di ≤ Ui , and moreover, we have

′
Ui − Li ≤ sup sup E[f (X1n ) | X1i−1 = x1i−1 , Xi = x] − E[f (X1n ) | X1i−1 = xi−1

1 , Xi = x ]
xi−1 x,x′
1
Z
i−1 ′ n
f (xi−1 n n

= sup sup 1 , x, xi+1 ) − f (x1 , x , xi+1 ) dP (xi+1 ) ≤ ci ,
xi−1 x,x′
1

where we have used the independence of the Xi and Definition 4.6 of bounded differences. Conse-
quently, we have by Hoeffding’s Lemma (Example 4.1.6) that E[eλDi | X1i−1 ] ≤ exp(λ2 c2i /8), that
is, the Doob martingale is c2i /4-sub-Gaussian.
The remainder of the proof is simply Theorem 4.2.3.

A number of quantities satisfy the conditions of Proposition 4.2.5, and we give two examples
here; we will revisit them more later.
Example 4.2.6 (Bounded random vectors): Let B be a Banach space—a complete normed
vector space—with norm ∥·∥. Let Xi be independent bounded random vectors in B satisfying
E[Xi ] = 0 and ∥Xi ∥ ≤ c. We claim that the quantity
n
1X
f (X1n ) := Xi
n
i=1

satisfies bounded differences. Indeed, we have by the triangle inequality that

i−1 ′ n 1 2c
|f (xi−1 n
1 , x, xi+1 ) − f (x1 , x , xi+1 )| ≤ x − x′ ≤ .
n n
Consequently, if Xi are indpendent, we have
n n
!
nt2

1X 1X
P Xi − E Xi ≥ t ≤ 2 exp − 2 (4.2.1)
n n 2c
i=1 i=1

for all t ≥ 0. That is, the norm of (bounded) random vectors in an essentially arbitrary vector
space concentrates extremely quickly about its expectation.
The challenge becomes to control the expectation term in the concentration bound (4.2.1),
which can be a bit challenging. In certain cases—for example, when we have a Euclidean
structure on the vectors Xi —it can be easier. Indeed, let us specialize to the case that Xi ∈ H,
a (real) Hilbert space, so that there is an inner product ⟨·, ·⟩ and the norm satisfies ∥x∥2 = ⟨x, x⟩
for x ∈ H. Then Cauchy-Schwarz implies that
Xn 2 X n 2 X Xn
E Xi ≤E Xi = E[⟨Xi , Xj ⟩] = E[∥Xi ∥2 ].
i=1 i=1 i,j i=1

That is assuming the Xi are independent and E[∥Xi ∥2 ] ≤ σ 2 , inequality (4.2.1) becomes

nt2

σ σ
P X n ≥ √ + t + P X n ≤ − √ − t ≤ 2 exp − 2
n n 2c

where X n = n1 ni=1 Xi . 3
P

86
Lexture Notes on Statistics and Information Theory John Duchi

We can specialize Example 4.2.6 to a situation that is very important for treatments of concen-
tration, sums of random vectors, and generalization bounds in machine learning.
Example 4.2.7 (Rademacher complexities): This example is actually a special case of Ex-
ample 4.2.6, but its frequent uses justify a more specialized treatment and consideration. Let
X be some space, and let F be some collection of functions f : X → R. Let εi ∈ {−1, 1} be a
collection of independent random sign vectors. Then the empirical Rademacher complexity of
F is
n
" #
1 X
Rn (F | xn1 ) := E sup εi f (xi ) ,
n f ∈F i=1
where the expectation is over only the random signs
P εi . (In some cases, depending on context
and convenience, one takes the absolute value | i εi f (xi )|.) The Rademacher complexity of
F is
Rn (F) := E[Rn (F | X1n )],
the expectation of the empirical Rademacher complexities.
If f : X → [b0 , b1 ] for all f ∈ F, then the Rademacher complexity satisfies bounded
differences, because for any two sequences xn1 and z1n differing in only element j, we have
n
X
n n
n|Rn (F | x1 )−Rn (F | z1 )| ≤ E sup εi (f (xi )−f (zi )) = E[sup εi (f (xj )−f (zj ))] ≤ b1 −b0 .
f ∈F i=1 f ∈F

(b1 −b0 )2
Consequently, the empirical Rademacher complexity satisfies Rn (F | X1n ) − Rn (F) is 4n -
sub-Gaussian by Theorem 4.2.3. 3
These examples warrant more discussion, and it is possible to argue that many variants of these
random variables are well-concentrated. For example, instead of functions we may simply consider
an arbitrary set A ⊂ Rn and define the random variable
n
X
Z(A) := sup⟨a, ε⟩ = sup ai εi .
a∈A a∈A i=1

As a function of the random signs εi , we may write Z(A) = f (ε), and this is then a function
satisfying |f (ε) − f (ε′ )| ≤ supa∈A |⟨a, ε − ε′ ⟩|, so that ′
Pn if ε and ε differ in index i, we have |f (ε) −
′
f (ε )| ≤ 2 supa∈A |ai |. That is, Z(A) − E[Z(A)] is i=1 supa∈A |ai |2 -sub-Gaussian.

Example 4.2.8 (Rademacher complexity as a random vector): This view of Rademacher

complexity shows how we may think of Rademacher complexities as norms on certain spaces.
Indeed, if we consider a vector space L of linear functions on F, then we can define the F-
seminorm on L by ∥L∥F := supf ∈F |L(f )|. In this case, we may consider the symmetrized
empirical distributions
n n
1X 1X
Pn0 := εi 1Xi f 7→ Pn0 f := εi f (Xi )
n n
i=1 i=1

as elements of this vector space L. (Here we have used 1Xi to denote the point mass at Xi .)
Then the Rademacher complexity is nothing more than the expected norm of Pn0 , a random
vector, as in Example 4.2.6. This view is somewhat sophisticated, but it shows that any general
results we may prove about random vectors, as in Example 4.2.6, will carry over immediately
to versions of the Rademacher complexity. 3

87
Lexture Notes on Statistics and Information Theory John Duchi

4.3 Matrix concentration

In this section, we will develop analogues of the concentration inequalities for sums in Section 4.1,
including matrix Hoeffding and Bernstein inequalities. Our main goal will be to bound maximal
eigenvalues
Pn (or operator norms) of symmetric and Hermitian matrices, that is, for sums Sn =
X
i=1 i of independent matrices, of deviation probabilities

P(λmax (Sn ) ≥ t) or P(λmin (Sn ) ≤ −t),

where λmax and λmin denote maximal and minimal eigenvalues, respectively. Our approach will
be to generalize the approach using moment generating functions, though this becomes non-trivial
because there is no immediately obvious analogue of the tensorization identities we have for scalars.
While in the scalar case, for a sum Sn = ni=1 Xi of independent random variables, we have
P

n
Y
λSn
e = eλXi ,
i=1

such an identify fails for matrices, because their exponentials (typically) fail to commute.
To develop the basic matrix concentration equalities we provide, we require a brief review of
matrix calculus and operator functions. We shall typically work with Hermitian matrices A ∈ Cd×d ,
meaning that A = A∗ , where A∗ denotes the Hermitian transpose of A, whose entries are (A∗ )ij =
Aji , the conjugate of Aji . We work in this generality for two reasons: first, because such matrices
admit the spectral decompositions we require to develop the operators we use, and second, because
dist
we often will encounter random matrices with symmetric distributions, meaning that X = −X,
which can lead to confusion.
With this, we give a brief review of some properties of Hermitian matrices and some associated
matrix operators. Let Hd := {A ∈ Cd×d | A∗ = A} be the Hermitian matrices. The spectral
theorem states gives that any A ∈ Hd admits the spectral decomposition A = U ΛU ∗ , where Λ is
the diagonal matrix of the (necessarily) real eigenvalues of A and U ∈ Cn×n is unitary, so that
U ∗ U = U U ∗ = I. For a function f : R → R, we can then define its operator extension to Hd by

f (A) := U diag (f (λ1 (A)), . . . , f (λn (A))) U ∗ ,

where A has spectral decomposition A = U ΛU ∗ and λi (A) denotes the ith eigenvalue of A. Because
we wish to mimic the approach based on moment generating functions that yields our original sub-
Gaussian and sub-Exponential concentration inequalities in Chapter 4, the most important function
for us will be the exponential, which evidently satisfies
∞
X 1 k
exp(A) = A ,
k!
k=0

where we recall the convention that A0 = I whenever A is Hermitian.

A Hermitian matrix A is positive definite, denoted A ≻ 0, if x∗ Ax > 0 for all x ̸= 0, and is
positive semidefinite (PSD), which we denote by A ⪰ 0, if x∗ Ax ≥ 0 for all vectors x. Positive
definiteness is then equivalent to the condition that λi (A) > 0 for all eigenvalues of A, while
semidefiniteness that λi (A) ≥ 0. We also use the standard semidefinite ordering, so that A ⪰ B
means
Pd that A − B ⪰ 0. For A ∈ Hd , we evidently have exp(A) ≻ 0. The familiar trace tr(A) =
j=1 jj of a square matrix allows us to define inner products, where for general complex matrices
A

88
Lexture Notes on Statistics and Information Theory John Duchi

A, B ∈ Cm×n we define ⟨A, B⟩ = tr(A∗ B), while the the space of Hermitian matrices admits the
real inner product ⟨A, B⟩ := tr(A ∗
Pd B). (See Exercise 4.14.) The spectral theorem also shows the
standard identity that tr(A) = j=1 λj (A) for A ∈ Hd .
To analogize our approach with real-valued random variables, we begin with the Chernoff bound,
Proposition 4.1.3. Here, we have the following observation:

Proposition 4.3.1. For any random Hermitian matrix X,

P(λmax (X) ≥ t) ≤ tr(E[eλX ])e−λt

for all λ ≥ 0 and t ≥ 0.

Proof First, apply the standard Chernoff bound to the random variable λmax (X), which gives
that for any λ > 0 that
P(λmax (X) ≥ t) ≤ E[eλ·λmax (X) ]e−λt .
Then observe that by definition of the matrix exponential, we have eλ·λmax (X) ≤ tr(eλX ), because
the eigenvalues of eλX are all positive.

We would like now to provide some type of general tensorization identity for matrices, in analogy
with Propositions 4.1.9 or 4.1.17. Unfortunately, this breaks down: for Hermitian A, B, we have

eA+B = eA eB

if and only if A and B commute [153], so that they are simultaneously diagonalizable. Nonetheless,
we have the following inequality, which will be the key to extending the standard one-dimensional
approach to concentration:

Proposition 4.3.2 (The Golden-Thompson inequality). Let A, B be Hermitian matrices. Then

tr(eA+B ) ≤ tr(eA eB ).

While the proof is essentially elementary, it is not central to our development, so we defer it to
Section 4.4.4. We remark in passing that there is a converse [153, Section 3]: tr(eA+B ) = tr(eA eB ) if
and only if AB = BA, that is, A and B are simultaneously diagonalizable. With Proposition 4.3.2 in
hand, however, we can develop matrix analogues of the Hoeffding and Bernstein-type concentration
bounds in Chapter 4.
We begin with Azuma-Hoeffding-type bounds, which analogize Theorem 4.2.3. The key to allow
an iterative “peeling” off of individual terms in a sum of random matrices is the following result:

Lemma 4.3.3 (A matrix symmetrization inequality). Let H be an arbitrary (fixed) Hermitian

matrix and X be a mean-zero Hermitian matrix. Then

tr(E[eH+X ]) ≤ tr(E[eH+2εX ]).

Proof Let X ′ be an independent copy of X. Then because the trace exponential tr(eX ) is convex
on the Hermitian matrices (see Exercise 4.16), we have
′ ′
tr(E[eH+X ]) = tr(E[eH+(X−E[X ]) ]) ≤ tr(E[eH+X−X ])

89
Lexture Notes on Statistics and Information Theory John Duchi

dist
by Jensen’s inequality. Introducing the random sign ε ∈ {±1}, we have by symmetry that X−X ′ =
ε(X − X ′ ), and so
′ ′
tr(E[eH+X ]) ≤ E[tr(eH+εX−εX )] = E[tr(eH/2+εX+H/2−εX )]
′
≤ E[tr(eH/2+εX eH/2−εX )],

where the second inequality follows from Proposition 4.3.2. Now we use that for Hermitian matrices
A, B, we have tr(AB) ≤ ∥A∥2 ∥B∥2 = tr(A2 )1/2 tr(B 2 )1/2 , so that
′ ′ ′
E[tr(eH/2+εX eH/2−εX )] ≤ E[tr(eH+2εX )1/2 tr(eH−2εX )1/2 ] ≤ E[tr(eH+2εX )]1/2 E[tr(eH−2εX )]1/2
dist
by Cauchy-Schwarz. Because εX = −εX ′ , the lemma follows.

This allows us to perform the type of “peeling-off” argument, addressing one term in the sum
at a type, that gives tight enough moment generating function bounds.

Pn 4.3.4. Let X1 , . . . , Xn ∈ Hd be independent, mean-zero, and satisfy ∥Xi ∥op ≤ bi . Define

Theorem
Sn = i=1 Xi . Then for all λ ∈ R,
n
!
X
E[tr(eλSn )] ≤ d exp 2λ2 b2i .
i=1

Proof By iterated expectation and Lemma 4.3.3, we have

tr(E[eλSn ]) ≤ tr(E[eλSn−1 +2λεXn ]) ≤ tr(E[eλSn−1 ]) E[e2λεXn ] ,

where ε ∈ {±1} is an independent random sign and we have used independence. Now, we use the
following calculation: if X is Hermitian and ε a random sign, then
2 /2
E[eεX ] ⪯ E[eX ]. (4.3.1)

Temporarily deferring the argument for inequality (4.3.1), note that it immediately implies E[e2λεX ] ⪯
2 2
E[e2λ X ]. The convexity of the operator norm and that ∥Xn ∥op ≤ bn then imply

2λεXn 2λ2 Xn2 2λ2 ∥Xn2 ∥op 2 2
E[e ] ≤ E[e ] ≤E e ≤ e2λ bn
op op

Repeating the argument by iteratively peeling off the last term Xn−i for Sn−2 through S1 then
yields
Yn
E[tr(exp(λSn ))] ≤ tr(I) exp(2λ2 b2n ),
i=1
which gives the theorem.
To see inequality (4.3.1), note that for any positive semidefinite A, we have A ⪯ tA for t ≥ 1.
Then because X 2k ⪰ 0 for all k ∈ N and (2k)! ≥ 2k k!, we have
∞ ∞
εX
X E[X 2k ] X E[(X 2 )k ] 2 /2
E[e ]=I+ ⪯I+ = E[eX ],
(2k!) 2k k!
k=1 k=1

90
Lexture Notes on Statistics and Information Theory John Duchi

where we used symmetry to eliminate terms with odd powers.

Theorem 4.3.4 immediately implies the following corollary, whose argument parallels those in
Chapter 4 (e.g., Corollary 4.1.10).
Corollary 4.3.5. Let Xi ∈ Hd be independent mean-zero Hermitian matrices. Then Sn := ni=1 Xi
P
satisfies
t2

P ∥Sn ∥op ≥ t ≤ 2d exp − Pn 2 .
8 i=1 bi
If we have more direct bounds on E[eλXi ], then we can also employ those via a similar “peeling
off” the last term argument. By carefully controlling matrix moment generating functions in a way
similar to that we did in Example 4.1.14 to obtain sub-exponential behavior for bounded random
variables, we can give a matrix Bernstein-type inequality.
Theorem 4.3.6. Let Xi be independent Hermitian matrices with ∥Xi ∥op ≤ b and E[Xi2 ] op
≤ σi2 .
Then Sn = ni=1 Xi satisfies
P

t2

3t
P ∥Sn ∥op ≥ t ≤ 2d exp − min , .
4 ni=1 σi2 4b
P

The proof of the theorem is similar to that of Theorem 4.3.4, so we leave it as an extended exercise
(Exercise 4.17).
We unpack the theorem a bit to give some intuition. Given a variance bound σ 2 such that
E[Xi2 ] ⪯ σ 2 I, the theorem states that
2

−1
nt 3nt
P n Sn op ≥ t ≤ 2d exp − min , .
4σ 2 4b
q
2σ
Letting δ ∈ (0, 1) be arbitrary and setting t = max{ √ n
log 2d 4b d
δ , 3n log δ }, we have
n
( r )
1X 2σ 2d 4b d
Xi ≤ max √ log , log
n n δ 3n δ
i=1 op
with probability at least 1 − δ. So we see the familiar sub-Gaussian and sub-exponential scaling of
the random sum.

4.4 Technical proofs

4.4.1 Proof of Theorem 4.1.11
(1) implies (2) Let K1 = 1. Using the change
R ∞ofk−1
variables identity that for a nonnegative random
k
variable Z and any k ≥ 1 we have E[Z ] = k 0 t P(Z ≥ t)dt, we find
Z ∞ Z ∞ 2 Z ∞
t
k
E[|X| ] = k k−1
t P(|X| ≥ t)dt ≤ 2k t k−1
exp − 2 dt = kσ k uk/2−1 e−u du,
0 0 σ 0

where for the last inequality we made the substitution u = t2 /σ 2 . Noting that this final integral is
Γ(k/2), we have E[|X|k ] ≤ kσ k Γ(k/2). Because Γ(s) ≤ ss for s ≥ 1, we obtain
p √
E[|X|k ]1/k ≤ k 1/k σ k/2 ≤ e1/e σ k.
Thus (2) holds with K2 = e1/e .

91
Lexture Notes on Statistics and Information Theory John Duchi

1 k
(2) implies (3) Let σ = supk≥1 k − 2 E[|X|k ]1/k , so that K2 = 1 and E[|X|k ] ≤ k 2 σ for all k. For
K3 ∈ R+ , we thus have
∞ ∞ ∞
E[X 2k ] σ 2k (2k)k (i) X 2e k
X X
E[exp(X 2 /(K3 σ 2 ))] = ≤ ≤
k!K32k σ 2k k!K32k σ 2k K32
k=0 k=0 k=0

where inequality (i) follows because k! ≥ (k/e)k , or 1/k! ≤ (e/k)k . Noting that ∞ k 1
P
p k=0 α = 1−α ,
we obtain (3) by taking K3 = e 2/(e − 1) ≈ 2.933.

(3) implies (4) Let us take K3 = 1 and recall the assumption of (4) that E[X] = 0. We claim
that (4) holds with K4 = 34 . We prove this result for both small and large λ. First, note the
9x2
(non-standard but true!) inequality that ex ≤ x + e for all x. Then we have
16

2 2
9λ X
E[exp(λX)] ≤ E[λX] +E exp
| {z } 16
=0
4
Now note that for |λ| ≤ we have 9λ2 σ 2 /16 ≤ 1, and so by Jensen’s inequality,
3σ ,
2 2
9λ X 2
2 2
2 9λ16σ 9λ2 σ 2
E exp = E exp(X /σ ) ≤ e 16 .
16
λ2 cx2
For large λ, we use the simpler Fenchel-Young inequality, that is, that λx ≤ 2c + 2 , valid for all
c ≥ 0. Then we have for any 0 ≤ c ≤ 2 that
2
λ2 σ 2 cX λ2 σ 2 c
E[exp(λX)] ≤ e 2c E exp ≤ e 2c e 2 ,
2σ 2
4 1 9 2 2
where the final inequality follows from Jensen’s inequality. If |λ| ≥ 3σ , then 2 ≤ 32 λ σ , and we
have 2 2
1
[ 2c 9c 2 2
+ 32 ]λ σ 3λ σ
E[exp(λX)] ≤ inf e = exp .
c∈[0,2] 4

(3) implies (1) Assume (3) holds with K3 = 1. Then for t ≥ 0 we have
λt2

P(|X| ≥ t) = P(X 2 /σ 2 ≥ t2 /σ 2 ) ≤ E[exp(λX 2 /σ 2 )] exp − 2
σ
for all λ ≥ 0. For λ ≤ 1, Jensen’s inequality implies E[exp(λX 2 /σ 2 )] ≤ E[exp(X 2 /σ 2 )]λ ≤ eλ by
assumption (3). Set λ = log 2 ≈ .693.

1
(4) implies (1) This is the content of Proposition 4.1.8, with K4 = 2 and K1 = 2.

4.4.2 Proof of Theorem 4.1.15

(1) implies (2) As in R∞the proof of Theorem 4.1.11, we use that for a nonnegative random variable
Z we have E[Z k ] = k 0 tk−1 P(Z ≥ t)dt. Let K1 = 1. Then
Z ∞ Z ∞ Z ∞
k k−1 k−1 k
E[|X| ] = k t P(|X| ≥ t)dt ≤ 2k t exp(−t/σ)dt = 2kσ uk−1 exp(−u)du,
0 0 0

where we used the substitution u = t/σ. Thus we have E[|X|k ] ≤ 2Γ(k + 1)σ k , and using Γ(k + 1) ≤
k k yields E[|X|k ]1/k ≤ 21/k kσ, so that (2) holds with K2 ≤ 2.

92
Lexture Notes on Statistics and Information Theory John Duchi

(2) implies (3) Let K2 = 1, and note that

∞ ∞ ∞
E[X k ] k k 1 (i) X e k
X X
E[exp(X/(K3 σ))] = ≤ · ≤ ,
k=0
K3k σ k k! k=0 k! K3k k=0
K3

where inequality (i) used that k! ≥ (k/e)k . Taking K3 = e2 /(e − 1) < 5 gives the result.

(3) implies (1) If E[exp(X/σ)] ≤ e, then for t ≥ 0

P(X ≥ t) ≤ E[exp(X/σ)]e−t/σ ≤ e1−t/σ .

With the same result for the negative tail, we have

2t
P(|X| ≥ t) ≤ 2e1−t/σ ∧ 1 ≤ 2e− 5σ ,

so that (1) holds with K1 = 52 .

(2) if and only if (4) Assume that (2) holds with K2 = 1, and let σ = supk≥1 k1 E[|X|k ]1/k . Then
because E[X] = 0,
∞ ∞ ∞
X λk E[X k ] X (k|λ|σ)k X
E[exp(λX)] ≤ 1 + ≤1+ ≤1+ (e|λ|σ)k ,
k! k!
k=2 k=2 k=2

1
where we have used that k! ≥ (k/e)k . When |λ| < σe , evaluating the geometric series yields

(eλσ)2
E[exp(λX)] ≤ 1 + .
1 − e|λ|σ
1
For |λ| ≤ 2eσ , we obtain E[eλX ] ≤ 1 + 2e2 σ 2 λ2 , and as 1 + x ≤ ex this implies (4).
For the opposite direction, assume (4) holds with K4 = K4′ = 1. Then E[exp(λX/σ)] ≤ exp(1)
for λ ∈ [−1, 1], and (3) holds. The preceding parts imply the remainder of the equivalence.

4.4.3 Proof of Theorem 5.1.6

JCD Comment: I would like to write this. For now, check out Ledoux and Talagrand
[135, Theorem 4.12] or Koltchinskii [127, Theorem 2.2].

4.4.4 Proof of Proposition 4.3.2

The key insight is to rewrite the matrix exponential eA+B as a limit of sums of matrices, then work
more directly with traces of powers. To that end, we shall use the Lie product formula

lim (exp(A/n) exp(B/n))n = exp(A + B). (4.4.1)

n→∞

We leave the proof of the equality (4.4.1) as Exercise 4.15. Using it, however, it is evidently sufficient
to prove that there exists some sequence of integers n → ∞ where along this sequence,
n
tr eA/n eB/n ≤ tr(eA eB ). (4.4.2)

93
Lexture Notes on Statistics and Information Theory John Duchi

Now recall that the Schatten p-norm of a matrix A is ∥A∥p := tr((AA∗ )p/2 )1/p = ∥γ(A)∥p ,
the
P ℓp -norm of its singular values, where p = 2 gives the Euclidean or Frobenius norm ∥A∥2 =
2 1/2
( i,j |Aij | ) . This norm gives a generalized Hölder-type inequality for powers of 2, that is,
n ∈ {2k }k∈N , which we can in turn use to prove the Golden-Thompson inequality. In particular,
we demonstrate that for n a power of 2,

| tr(A1 · · · An )| ≤ ∥A1 ∥n · · · ∥An ∥n . (4.4.3)

To see this inequality, we proceed inductively. Because the trace defines the inner product
⟨A, B⟩ = tr(A∗ B), for n = 2, the Cauchy-Schwarz inequality implies

| tr(A1 A2 )| = |⟨A∗1 , A2 ⟩| ≤ ∥A1 ∥2 ∥A2 ∥2 .

We now perform an induction, where we have demonstrated the base case n = 2. Then for n ≥ 4
a power of 2, we have by the inductive hypothesis that inequality (4.4.3) holds for n/2 that

| tr(A1 · · · An )| ≤ ∥A1 A2 ∥n/2 · · · ∥An−1 An ∥n/2 .

Now consider an arbitrary pair of matrices A, B. We will demonstrate that ∥AB∥n/2 ≤ ∥A∥n ∥B∥n ,
which will then evidently imply inequality (4.4.3). For these, we have

n/2 ∗ ∗ ∗ ∗ ∗ ∗ n/4
∥AB∥n/2 = tr(ABB
| A ·
{z· · ABB A} ) = tr (A ABB )
n/4 times

by the cyclic property of the trace. Using the inductive hypothesis again with n/4 copies of each
of the matrices AT A and BB T , we thus have

n/2
∥AB∥n/2 ≤ tr (A∗ ABB ∗ )n/4
1/2 1/2
n/4 n/4
≤ ∥A∗ A∥n/2 ∥BB ∗ ∥n/2 = tr (A∗ A)n/2 tr (BB ∗ )n/2 = ∥A∥n/2 n/2
n ∥B∥n .

That is, we have ∥AB∥n/2 ≤ ∥A∥n ∥B∥n for any A, B as desired, giving inequality (4.4.3).
We apply inequality (4.4.3) to powers of products of Hermitian matrices A, B. We have

tr ((AB)n ) ≤ ∥AB∥nn = tr (ABB ∗ A∗ )n/2 = tr (A∗ ABB ∗ )n/2 = tr (A2 B 2 )n/2

because A = A∗ and B = B ∗ . Recognizing that A2 and B 2 are Hermitian, we repeat this argument
to obtain
tr (A2 B 2 )n/2 ≤ tr (A4 B 4 )n/4 ≤ · · · ≤ tr(An B n )

for any n ∈ {2k }k∈N . Replacing A and B by eA and eB , which are both symmetric, we obtain

tr (eA eB )n ≤ tr(enA enB ) for n ∈ {2k }k∈N .

This is inequality (4.4.2) once we replace A and B by A/n and B/n.

94
Lexture Notes on Statistics and Information Theory John Duchi

4.5 Bibliography
A few references on concentration, random matrices, and entropies include Vershynin’s extraordi-
narily readable lecture notes [187], upon which our proof of Theorem 4.1.11 is based, the compre-
hensive book of Boucheron, Lugosi, and Massart [37], and the more advanced material in Buldygin
and Kozachenko [43]. Many of our arguments are based off of those of Vershynin and Boucheron
et al. Kolmogorov and Tikhomirov [126] introduced metric entropy.
We give weaker versions of the matrix-Hoeffding and matrix-Bernstein inequalities. It is possible
to do much better.
Ahlswede-Winter developed the matrix concentration inequalities and Petz [153].
I took the proof of Golden-Thompson from Terry Tao’s blog.
Lemma 4.3.3 is [181, Lemma 7.6].
It is possible to obtain better concentration guarantees using Lieb’s concavity inequality that

f (A) := tr (exp(H + log A)) (4.5.1)

is a concave function in A ≻ 0.

4.6 Exercises
Exercise 4.1 (Concentration of bounded random variables): Let X be a random variable taking
values in [a, b], where −∞ < a ≤ b < ∞. In this question, we show Hoeffding’s Lemma, that is,
that X is sub-Gaussian: for all λ ∈ R, we have
2
λ (b − a)2

E[exp(λ(X − E[X]))] ≤ exp .
8
(b−a)2
(a) Show that Var(X) ≤ ( b−a 2
2 ) = 4 for any random variable X taking values in [a, b].

(b) Let
ϕ(λ) = log E[exp(λ(X − E[X]))].
Assuming that E[X] = 0 (convince yourself that this is no loss of generality) show that

E[X 2 etX ] E[XetX ]2

ϕ(0) = 0, ϕ′ (0) = 0, ϕ′′ (t) = − .
E[etX ] E[etX ]2
(You may assume that derivatives and expectations commute, which they do in this case.)

Var(Yt ) = ϕ′′ (t).

(You may assume X has a density for simplicity.)

λ2 (b−a)2
(d) Using the result of part (c), show that ϕ(λ) ≤ 8 for all λ ∈ R.

Exercise 4.2 (Variance lower bounds on sub-Gaussian parameters):

2 2
(a) Let X be σ 2 -sub-Gaussian, that is, E[exp(λX)] ≤ exp( λ 2σ ) for all λ ∈ R. Show that E[X] = 0
and E[X 2 ] ≤ σ 2 .

95
Lexture Notes on Statistics and Information Theory John Duchi

(b) Let X be a random variable with E[X] = 0 and Var(X) = σ 2 > 0. Show that
1
lim inf log E[exp(λX)] > 0.
|λ|→∞ |λ|

2
Exercise 4.3 (Mills ratio): Let ϕ(t) = √12π e−t /2 be the density of a standard Gaussian, Z ∼
Rt
N(0, 1), and Φ(t) = −∞ ϕ(u)du its cumulative distribution function.

(a) Show that P(Z ≥ t) ≤ 1t ϕ(t) for all t > 0.

(b) Define
t
g(t) := 1 − Φ(t) − ϕ(t).
t2 + 1
Show that g(0) = 0, g ′ (t) < 0 for all t ≥ 0, and that limt→∞ g(t) = 0.

(c) Conclude that for all t ≥ 0,

t 1
ϕ(t) ≤ P(Z ≥ t) ≤ ϕ(t).
t2 +1 t

Exercise 4.4 (Likelihood ratio bounds and concentration): Consider a data release problem,
where given a sample x, we release a sequence of data Z1 , Z2 , . . . , Zn belonging to a discrete set Z,
where Zi may depend on Z1i−1 and x. We assume that the data has limited information about x
in the sense that for any two samples x, x′ , we have the likelihood ratio bound

p(zi | x, z1i−1 )
≤ eε .
p(zi | x′ , z1i−1 )

Let us control the amount of “information” (in the form of an updated log-likelihood ratio) released
by this sequential mechanism. Fix x, x′ , and define
p(z1 , . . . , zn | x)
L(z1 , . . . , zn ) := log .
p(z1 , . . . , zn | x′ )

(a) Show that, assuming the data Zi are drawn conditional on x,

t2

ε
P (L(Z1 , . . . , Zn ) ≥ nε(e − 1) + t) ≤ exp − .
2nε2

Equivalently, show that

p
P L(Z1 , . . . , Zn ) ≥ nε(eε − 1) + ε 2n log(1/δ) ≤ δ.

(b) Let γ ∈ (0, 1). Give the largest value of ε you can that is sufficient to guarantee that for any
test Ψ : Z n → {x, x′ }, we have

Px (Ψ(Z1n ) ̸= x) + Px′ (Ψ(Z1n ) ̸= x′ ) ≥ 1 − γ,

where Px and Px′ denote the sampling distribution of Z1n under x and x′ , respectively?

96
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 4.5 (Marcinkiewicz-Zygmund inequality): Let Xi be independent random variables

with E[Xi ] = 0 and E[|Xi |p ] < ∞, where 1 ≤ p < ∞. Prove that
" n # " n p/2 #
X p X
E Xi ≤ Cp E |Xi |2
i=1 i=1

where Cp is a constant depending only on p. As a corollary, derive that if E[|Xi |p ] ≤ σ p and p ≥ 2,

then
n
" #
p
1X σp
E Xi ≤ Cp p/2 .
n n
i=1
That is, sample means converge quickly to zero in higher moments. Hint: For any fixed x ∈ Rn , if
εi are i.i.d. uniform signs εi ∈ {±1}, then εT x is sub-Gaussian.
Exercise 4.6 (A vector Marcinkiewicz-Zygmund inequality): Let Xi ∈ Rd be independent vectors
with E[Xi ] = 0 and E[∥Xi ∥p2 ] < ∞, where 1 ≤ p < ∞. Prove that
" n # " n p/2 #
X p X 2
E Xi ≤ Cp E ∥Xi ∥2
i=1 2 i=1

where Cp is a constant depending only on p.

Exercise 4.7 (Small balls and anti-concentration): Let X be a nonnegative random variable
satisfying P(X ≤ ϵ) ≤ cϵ for some c < ∞ and all ϵ > 0. Argue that if Xi are i.i.d. copies of X, then
n
!
1X
P Xi ≥ t ≥ 1 − exp(−2n [1/2 − 2ct]2+ )
n
i=1

for all t.
Exercise 4.8 (Lipschitz functions remain sub-Gaussian): Let X be σ 2 -sub-Gaussian and f :
R → R be L-Lipschitz, meaning that |f (x) − f (y)| ≤ L|x − y| for all x, y. Prove that there exists a
numerical constant C < ∞ such that f (X) is CL2 σ 2 -sub-Gaussian.
Exercise 4.9 (Sub-gaussian maxima): Let X1 , . . . , Xn be σ 2 -sub-gaussian (not necessarily inde-
pendent) random variables. Show that
p
(a) E[maxi Xi ] ≤ 2σ 2 log n.
(b) There exists a numerical constant C < ∞ such that E[maxi |Xi |p ] ≤ (Cpσ 2 log k)p/2 .

Exercise 4.10: Let Z ∼ N(0, 1).

(a) Use the Cauchy-Schwarz inequality to show that
2λ2

E exp(λ((µ + Z) − µ )) ≤ exp 4λ2 µ2 + λ +
2 2

.
[1 − 4|λ|]+

(b) Use a direct integration argument, as in Examples 4.1.12 and 4.1.13, to show that
2λ2

2
2 1
E exp(λ(µ + Z) ) ≤ exp λµ + − log(1 − 2λ)
1 − 2λ 2
for λ < 21 . Use this to prove the first part of Corollary 4.1.19.

97
Lexture Notes on Statistics and Information Theory John Duchi

x2
Hint. It may be useful to use that − log(1 − x) ≤ −x + 2[1−|x|]+ for all x ∈ R.

Exercise 4.11: Let Z ∼ N(0, Id ), and let A ∈ Rn×d and b ∈ Rn be otherwise arbitrary. Using
the first part of Corollary 4.1.19, show the second part of Corollary 4.1.19, that is, that ∥AZ − b∥22
is (4(∥A∥2Fr + 2 ∥b∥22 ), 4 ∥A∥2op )-sub-exponential. Hint. Use the singular value decomposition A =
U ΓV ⊤ of A, and note that V ⊤ Z ∼ N(0, Id ). Then
Exercise 4.12 (Sub-Gaussian constants of Bernoulli random variables): In this exercise, we will
derive sharp sub-Gaussian constants for Bernoulli random variables (cf. [112, Thm. 1] or [125, 24]),
showing
1 − 2p 2
log E[et(X−p) ] ≤ t for all t ≥ 0. (4.6.1)
4 log 1−p
p

(a) Define φ(t) = log(E[et(X−p) ]) = log((1 − p)e−tp + pet(1−p) ). Show that

φ′ (t) = E[Yt ] and φ′′ (t) = Var(Yt )

pet(1−p)
where Yt = (1 − p) with probability q(t) := pet(1−p) +(1−p)e−tp
and Yt = −p otherwise.

(b) Show that φ′ (0) = 0 and that if p > 1

2, then Var(Yt ) ≤ Var(Y0 ) = p(1 − p). Conclude that
φ(t) ≤ p(1−p)
2 t2 for all t ≥ 0.
1−2p 1+δ
(c) Argue that p(1 − p) ≤ for p ∈ [0, 1]. Hint: Let p = for δ ∈ [0, 1], so that the
2 log 1−p
p
2
1+δ 2δ
Rδ 1
inequality is equivalent to log 1−δ ≤ 1−δ 2 . Then use that log(1 + δ) = 0 1+u du.

(d) Let C = 2 log 1−p 1−p

p and define s = Ct = 2 log p s, and let

1 − 2p 2
f (s) = Cs + Cps − log(1 − p + peCs ),
2
so that inequality (4.6.1) holds if and only if f (s) ≥ 0 for all s ≥ 0. Give f ′ (s) and f ′′ (s).

(e) Show that f (0) = f (1) = f ′ (0) = f ′ (1) = 0, and argue that f ′′ (s) changes signs at most twice
and that f ′′ (0) = f ′′ (1) > 0. Use this to show that f (s) ≥ 0 for all s ≥ 0.

JCD Comment: Perhaps use transportation inequalities to prove this bound, and
also maybe give Ordentlich and Weinberger’s “A Distribution Dependent Refinement
of Pinsker’s Inequality” as an exercise.
1−2p
Exercise 4.13: Let s(p) = . Show that s is concave on [0, 1].
log 1−p
p

Exercise 4.14 (Inner products on complex matrices): Recall that ⟨·, ·⟩ is a complex inner product
on a vector space V if it satisfies the following for all x, y, z ∈ V :

(i) ⟨x, x⟩ ≥ 0, with ⟨x, x⟩ = 0 if and only if x = 0.

(ii) It is conjugate symmetric, so that ⟨x, y⟩ = ⟨y, x⟩.

(iii) It is conjugate linear in its first argument, so that ⟨αx + y, z⟩ = α⟨x, z⟩ + ⟨y, z⟩ for all α ∈ C.

98
Lexture Notes on Statistics and Information Theory John Duchi

The vector space V is real with real inner product if property (ii) is replaced with the symmetry
⟨x, y⟩ = ⟨y, x⟩ and linearity (iii) holds for α ∈ R.

(a) Show that the space of complex m × n matrices Cm×n has complex inner product ⟨A, B⟩ :=
tr(A∗ B).

(b) Show that the space Hn of n × n Hermitian matrices is a real vector space with inner product
⟨A, B⟩ := tr(A∗ B), and that consequently ⟨A, B⟩ ∈ R.

Exercise 4.15 (The Lie product formula): Let A and B be symmetric (or Hermitian) matrices.

(a) Prove the Lie product formula (4.4.1), that is,

lim (exp(A/n) exp(B/n))n = exp(A) exp(B).

n→∞

Hint. One argument proceeds as follows. Let O(ϵ) denote a matrix E such that ∥E∥op ≲ ϵ.
First, demonstrate that
1
eA/n = I + A + O(n−2 ).
n
Then show that for any matrix A, we have (I + n−1 A + o(n−1 ))n → exp(A). Combine these.

(b) Give an example of matrices A and B that do not commute and for which exp(A + B) ̸=
exp(A) exp(B).

Exercise 4.16: Define the trace exponential function f (X) := tr(eX ) on the Hermitian matrices.

(a) Prove that f is monotone for the semidefinite order, that is, if X ⪯ Y , then f (X) ≤ f (Y ).
Hint. It is enough to show that for any A ⪰ 0, the one-dimensional function h(t) := f (X + tA)
is monotone in t, or even that h′ (0) ≥ 0.

(b) Prove that the trace exponential is convex on the Hermitian matrices, that is, f (X) := tr(eX )
is convex. Hint. It is enough to show that for any X, V Hermitian that h(t) := f (X + tV ) is
convex in t, for which it in turn suffices to show that h′′ (0) ≥ 0.

Exercise 4.17 (The matrix-Bernstein inequality): In this question, we prove Theorem 4.3.6.

(a) Let Xi ∈ Hd be independent Hermitian matrices and Sn = ni=1 Xi . Use the Golden-Thompson
P
inequality (Proposition 4.3.2) to show that for all λ ∈ R,
n
Y
tr(E[eλSn ]) ≤ d E[eλXi ] .
op
i=1

(b) Extend Example 4.1.14 to the matrix-valued case. Demonstrate that if X is a mean-zero
Hermitian random matrix with ∥X∥op ≤ b, then for all |λ| < 3b ,

1 λ2 E[X 2 ]
E[exp(λX)] ⪯ I + .
1 − b|λ|/3 2

99
Lexture Notes on Statistics and Information Theory John Duchi

(c) Use parts (a) and (b) to show that

n
!
X
tr(E[eλSn ]) ≤ d exp λ2 σi2
i=1

3
for |λ| ≤ 2b .

(d) Prove Theorem 4.3.6.

Exercise 4.18: In this question, we use Lieb’s concavity inequality (4.5.1) to obtain a stronger
P X1 , . . . , Xn ∈ Hd be an independent sequence of d × d mean-zero
matrix Hoeffding inequality. Let
Hermitian matrices. Let Sn = ni=1 Xi be their sum.
(a) Let X be a random Hermitian matrix and X 2 ⪯ A2 . Show that
λ2 2
log E[eλεX | X] ⪯ A .
2
Hint. Use that the matrix logarithm is operator monotone, that is, if A ⪯ B, then log A ⪯ log B.

(b) Assume that Xn2 ⪯ A2n . Show that

E[tr(exp(λSn ))] ≤ E tr exp λSn−1 + 2λ2 A2n .

(c) Show that if Xi2 ⪯ A2i for each i, then

n
!
X
E[tr(exp(λSn ))] ≤ exp 2λ2 A2i .
i=1

(d) Show that if σ 2 ≥ ∥ ni=1 A2i ∥op , then for t ≥ 0,

t2

P(λmax (Sn ) ≥ t) ≤ d exp − 2 .
8σ

(e) Give an example of random Hermitian matrices where the preceding bound is much sharper
than Corollary 4.3.5.

Exercise 4.19: In this question, we use Lieb’s concavity inequality (4.5.1) to demonstrate a
sharper matrix Bernstein inequality than Theorem 4.3.6.
(a) Define the matrix cumulant generating function ϕXi (λ) := log E[exp(λXi )]. Show that
n
!
X
E[tr(exp(λSn ))] ≤ tr exp ϕXi (λ) .
i=1

(b) Using Exercise 4.17 part (b), show that if E[Xi ] = 0, E[Xi2 ] ⪯ Σi , and ∥Xi ∥op ≤ b for each i,
then for |λ| < 3b , we have
n
λ2 2

λSn 1 X
E[tr(e )] ≤ d exp σ where σ 2 := Σi .
1 − b|λ|/3 2
i=1 op

100
Lexture Notes on Statistics and Information Theory John Duchi

(c) Show that there exists a numerical constant c > 0 such that for all t ≥ 0,
2
t t
P ∥Sn ∥op ≥ t ≤ 2d exp −c min , .
σ2 b

Why is this sharper than Theorem 4.3.6?

101
Chapter 5

Estimation and generalization

5.1 Uniformity and metric entropy

Now that we have explored a variety of concentration inequalities, we show how to put them to use
in demonstrating that a variety of estimation, learning, and other types of procedures have nice
convergence properties. We first give a somewhat general collection of results, then delve deeper
by focusing on some standard tasks from machine learning.

5.1.1 Symmetrization and uniform laws

The first set of results we consider are uniform laws of large numbers, where the goal is to bound
means uniformly over different classes of functions. Frequently, such results are called Glivenko-
Cantelli laws, after the original Glivenko-Cantelli theorem, which shows that empirical distributions
uniformly converge. We revisit these ideas in the next chapter, where we present a number of more
advanced techniques based on ideas of metric entropy (or volume-like considerations); here we
present the basic ideas using our stability and bounded differencing tools.
The starting point is to define what we mean by a uniform law of large numbers. To do so, we
adopt notation (as in Example 4.2.8) we will use throughout the remainder of the book, reminding
readers as we go. For a sample X1 , . . . , Xn on a space X , we let
n
1X
Pn := 1Xi
n
i=1

denote the empirical distribution on {Xi }ni=1 , where 1Xi denotes the point mass at Xi . Then for
functions f : X → R (or more generally, any function f defined on X ), we let
n
1X
Pn f := EPn [f (X)] = f (Xi )
n
i=1

denote the empirical expectation of f evaluated on the sample, and we also let
Z
P f := EP [f (X)] = f (x)dP (x)

denote general expectations under a measure P . With this notation, we study uniform laws of
large numbers, which consist of proving results of the form
sup |Pn f − P f | → 0, (5.1.1)
f ∈F

102
Lexture Notes on Statistics and Information Theory John Duchi

where convergence is in probability, expectation, almost surely, or with rates of convergence. When
we view Pn and P as (infinite-dimensional) vectors on the space of maps from F → R, then we
may define the (semi)norm ∥·∥F for any L : F → R by
∥L∥F := sup |L(f )|,
f ∈F

in which case Eq. (5.1.1) is equivalent to proving

∥Pn − P ∥F → 0.
Thus, roughly, we are simply asking questions about when random vectors converge to their expec-
tations.1
The starting point of this investigation considers bounded random functions, that is, F consists
of functions f : X → [a, b] for some −∞ < a ≤ b < ∞. In this case, the bounded differences
inequality (Proposition 4.2.5) immediately implies that expectations of ∥Pn − P ∥F provide strong
guarantees on concentration of ∥Pn − P ∥F .
Proposition 5.1.1. Let F be as above. Then
2nt2

P (∥Pn − P ∥F ≥ E[∥Pn − P ∥F ] + t) ≤ exp − for t ≥ 0.
(b − a)2
Proof Let Pn and Pn′ be two empirical distributions, differing only in observation i (with Xi and
Xi′ ). We observe that
sup |Pn f − P f | − sup |Pn′ f − P f | ≤ sup |Pn f − P f | − |Pn′ f − P f |

f ∈F f ∈F f ∈F
1 b−a
≤ sup |f (Xi ) − f (Xi′ )| ≤
n f ∈F n

by the triangle inequality. An entirely parallel argument gives the converse lower bound of − b−a
n ,
and thus Proposition 4.2.5 gives the result.

Proposition 5.1.1 shows that, to provide control over high-probability concentration of ∥Pn − P ∥F ,
it is (at least in cases where F is bounded) sufficient to control the expectation E[∥Pn − P ∥F ]. We
take this approach through the remainder of this section, developing tools to simplify bounding
this quantity.
Our starting points consist of a few inequalities relating expectations to symmetrized quantities,
which are frequently easier to control than their non-symmetrized parts. This symmetrization
technique is widely used in probability theory, theoretical statistics, and machine learning. The key
is that for centered random variables, symmetrized quantities have, to within numerical constants,
similar expectations to their non-symmetrized counterparts. Thus, in many cases, it is equivalent
to analyze the symmetized quantity and the initial quantity.
Proposition 5.1.2. Let Xi be independent random vectors on a (Banach) space with norm ∥·∥
and let εi {−1, 1} be independent random signs. Then for any p ≥ 1,
" n # " n # " n #
X p X p X p
2−p E εi (Xi − E[Xi ]) ≤E (Xi − E[Xi ]) ≤ 2p E εi Xi
i=1 i=1 i=1
1
Some readers may worry about measurability issues here. All of our applications will be in separable spaces,
so that we may take suprema with abandon without worrying about measurability, and consequently we ignore this
from now on.

103
Lexture Notes on Statistics and Information Theory John Duchi

In the proof of the upper bound, we could also show the bound
" n # " n #
X p X p
E (Xi − E[Xi ]) ≤ 2p E εi (Xi − E[Xi ]) ,
i=1 i=1

so we may analyze whichever is more convenient.

Proof We prove the right bound first. We introduce independent copies of the Xi and use
these to symmetrize the quantity. Indeed, let Xi′ be an independent copy of Xi , and use Jensen’s
inequality and the convexity of ∥·∥p to observe that
" n # " n # " n #
X p X p X p
′ ′
E (Xi − E[Xi ]) =E (Xi − E[Xi ]) ≤E (Xi − Xi ) .
i=1 i=1 i=1

dist
Now, note that the distribution of Xi − Xi′ is symmetric, so that Xi − Xi′ = εi (Xi − Xi′ ), and thus
" n # " n #
X p X p
E (Xi − E[Xi ]) ≤E εi (Xi − Xi′ ) .
i=1 i=1

Multiplying and dividing by 2p , Jensen’s inequality then gives

" n # " n
#
p p
X 1 X
E (Xi − E[Xi ]) ≤ 2p E εi (Xi − Xi′ )
2
i=1 i=1
" " n # " n ##
X p X p
≤ 2p−1 E εi Xi +E εi Xi′
i=1 i=1

as desired.
For the left bound in the proposition, let Yi = Xi − E[Xi ] be the centered version of the random
variables. We break the sum over random variables into two parts, conditional on whether εi = ±1,
using repeated conditioning. We have
" n # " #
X p X X p
E εi Yi =E Yi − Yi
i=1 i:εi =1 i:ε=−1
" " # " ##
X p X p
≤ E 2p−1 E Yi | ε + 2p−1 E Yi |ε
i:εi =1 i:εi −1
" " # " ##
X X p X X p
p−1
=2 E E Yi + E[Yi ] |ε +E Yi + E[Yi ] |ε
i:εi =1 i:εi =−1 i:εi =−1 i:εi =1
" " # " ##
X X p X X p
≤ 2p−1 E E Yi + Yi |ε +E Yi + Yi |ε
i:εi =1 i:εi =−1 i:εi =−1 i:εi =1
n
" #
X p
= 2p E Yi .
i=1

104
Lexture Notes on Statistics and Information Theory John Duchi

We obtain as an immediate corollary a symmetrization bound for supremum norms on function

spaces. In this corollary, we use the symmetrized empirical measure
n n
1X 1X
Pn0 := εi 1Xi , Pn0 f = εi f (Xi ).
n n
i=1 i=1

The expectation of Pn0 F is of course the Rademacher complexity (Examples 4.2.7 and 4.2.8), and
we have the following corollary.
Corollary 5.1.3. Let F be a class of functions f : X → R and Xi be i.i.d. Then E[∥Pn − P ∥F ] ≤
2E[∥Pn0 ∥F ].
From Corollary 5.1.3, it is evident that by controlling the expectation of the symmetrized process
E[∥Pn0 ∥F ] we can derive concentration inequalities and uniform laws of large numbers. For example,
we immediately obtain that
2nt2

0

P ∥Pn − P ∥F ≥ 2E[∥Pn ∥F ] + t ≤ exp −
(b − a)2
for all t ≥ 0 whenever F consists of functions f : X → [a, b].
There are numerous examples of uniform laws of large numbers, many of which reduce to
developing bounds on the expectation E[∥Pn0 ∥F ], which is frequently possible via more advanced
techniques we develop in Chapter 7. A frequent application of these symmetrization ideas is to
risk minimization problems, as we discuss in the coming section; for these, it will be useful for us
to develop a few analytic and calculus tools. To better match the development of these ideas, we
return to the notation of Rademacher complexities, so that Rn (F) := E[ Pn0 F ]. The first is a
standard result, which we state for its historical value and the simplicity of its proof.
Proposition 5.1.4 (Massart’s finite class bound). Let F be any collection of functions with f :
X → R, and assume that σn2 := n−1 E[maxf ∈F ni=1 f (Xi )2 ] < ∞. Then
P

p
2σn2 log |F|
Rn (F) ≤ √ .
n
Proof For each fixed xn1 , the random variable ni=1 εi f (xi ) is ni=1 f (xi )2 -sub-Gaussian. Now,
P P
σ 2 (xn1 ) := n−1 maxf ∈F ni=1 f (xi )2 . Using the results of Exercise 4.9, that is, that E[maxj≤n Zj ] ≤
P
define
p
2σ 2 log n if the Zj are each σ 2 -sub-Gaussian, we see that
p
n 2σ 2 (xn1 ) log |F|
Rn (F | x1 ) ≤ √ .
n
√ p
Jensen’s inequality that E[ ·] ≤ E[·] gives the result.

A refinement of Massart’s finite class bound applies when the classes are infinite but, on a
collection X1 , . . . , Xn , the functions f ∈ F may take on only a (smaller) number of values. In this
case, we define the empirical shatter coefficient of a collection of points x1 , . . . , xn by SF (xn1 ) :=
card{(f (x1 ), . . . , f (xn )) | f ∈ F }, the number of distinct vectors of values (f (x1 ), . . . , f (xn )) the
functions f ∈ F may take. The shatter coefficient is the maximum of the empirical shatter coeffi-
cients over xn1 ∈ X n , that is, SF (n) := supxn1 SF (xn1 ). It is clear that SF (n) ≤ |F| always, but by
only counting distinct values, we have the following corollary.

105
Lexture Notes on Statistics and Information Theory John Duchi

Corollary 5.1.5 (A sharper variant of Massart’s finite class bound). Let F be any collection of
functions with f : X → R, and assume that σn2 := n−1 E[maxf ∈F ni=1 f (Xi )2 ] < ∞. Then
P

p
2σn2 log SF (n)
Rn (F) ≤ √ .
n

Typical classes with small shatter coefficients include Vapnik-Chervonenkis classes of functions; we
do not discuss these further here, instead referring to one of the many books in machine learning
and empirical process theory in statistics.
The most important of the calculus rules we use are the comparison inequalities for Rademacher
sums, which allow us to consider compositions of function classes and maintain small complexity
measurers. We state the rule here; the proof is complex, so we defer it to Section 4.4.3

Theorem 5.1.6 (Ledoux-Talagrand Contraction). Let T ⊂ Rn be an arbitrary set and let ϕi : R →

R be 1-Lipschitz and satisfy ϕi (0) = 0. Then for any nondecreasing convex function Φ : R → R+ ,
n
" !#
1 X
E Φ sup ϕi (ti )εi ≤ E Φ sup⟨t, ε⟩ .
2 t∈T t∈T
i=1

A corollary to this theorem is suggestive of its power and applicability. Let ϕ : R → R be

L-Lipschitz, and for a function class F define ϕ ◦ F = {ϕ ◦ f | f ∈ F}. Then we have the following
corollary about Rademacher complexities of contractive mappings.

Corollary 5.1.7. Let F be an arbitrary function class and ϕ be L-Lipschitz. Then

√
Rn (ϕ ◦ F) ≤ 2LRn (F) + |ϕ(0)|/ n.

Proof The result is an almost immediate consequence of Theorem 5.1.6; we simply recenter our
functions. Indeed, we have
n n
" #
1 X 1 X
Rn (ϕ ◦ F | xn1 ) = E sup εi (ϕ(f (xi )) − ϕ(0)) + εi ϕ(0)
f ∈F n i=1 n
i=1
n n
" # " #
1X 1X
≤ E sup εi (ϕ(f (xi )) − ϕ(0)) + E εi ϕ(0)
f ∈F n
i=1
n
i=1
|ϕ(0)|
≤ 2LRn (F) + √ ,
n

where the final inequality

Pn follows by Theorem 5.1.6 (as g(·) = ϕ(·) − ϕ(0) is Lipschitz and satisfies
√
g(0) = 0) and that E[| i=1 εi |] ≤ n.

5.1.2 Metric entropy, coverings, and packings

When the class of functions F under consideration is finite, the union bound more or less provides
guarantees that Pn f is uniformly close to P f for all f ∈ F. When F is infinite, however, we require
a different set of tools for addressing uniform laws. In many cases, because of the application
of the bounded differences inequality in Proposition 5.1.1, all we really need to do is to control

106
Lexture Notes on Statistics and Information Theory John Duchi

the expectation E[∥Pn0 ∥F ], though the techniques we develop here will have broader use and can
sometimes directly guarantee concentration.
The basic object we wish to control is a measure of the size of the space on which we work.
To that end, we modify notation a bit to simply consider arbitrary vectors θ ∈ Θ, where Θ is a
non-empty set with an associated (semi)metric ρ. For many purposes in estimation (and in our
optimality results in the further parts of the book), a natural way to measure the size of the set is
via the number of balls of a fixed radius δ > 0 required to cover it.

Definition 5.1 (Covering number). Let Θ be a set with (semi)metric ρ. A δ-cover of the set Θ with
respect to ρ is a set {θ1 , . . . , θN } such that for any point θ ∈ Θ, there exists some v ∈ {1, . . . , N }
such that ρ(θ, θv ) ≤ δ. The δ-covering number of Θ is

N (δ, Θ, ρ) := inf {N ∈ N : there exists a δ-cover θ1 , . . . , θN of Θ} .

The metric entropy of the set Θ is simply the logarithm of its covering number log N (δ, Θ, ρ).
We can define a related measure—more useful for constructing our lower bounds—of size that
relates to the number of disjoint balls of radius δ > 0 that can be placed into the set Θ.

Definition 5.2 (Packing number). A δ-packing of the set Θ with respect to ρ is a set {θ1 , . . . , θM }
such that for all distinct v, v ′ ∈ {1, . . . , M }, we have ρ(θv , θv′ ) ≥ δ. The δ-packing number of Θ is

M (δ, Θ, ρ) := sup {M ∈ N : there exists a δ-packing θ1 , . . . , θM of Θ} .

Figures 5.1 and 5.2 give examples of (respectively) a covering and a packing of the same set.

Figure 5.1. A δ-covering of the

elliptical set by balls of radius δ.

An exercise in proof by contradiction shows that the packing and covering numbers of a set are
in fact closely related:

Lemma 5.1.8. The packing and covering numbers satisfy the following inequalities:

M (2δ, Θ, ρ) ≤ N (δ, Θ, ρ) ≤ M (δ, Θ, ρ).

We leave derivation of this lemma to Exercise 5.2, noting that it shows that (up to constant factors)
packing and covering numbers have the same scaling in the radius δ. As a simple example, we see

107
Lexture Notes on Statistics and Information Theory John Duchi

Figure 5.2. A δ-packing of the

elliptical set, where balls have ra-
dius δ/2. No balls overlap, and
δ/2 each center of the packing satisfies
∥θv − θv′ ∥ ≥ δ.

δ/2

for any interval [a, b] on the real line that in the usual absolute distance metric, N (δ, [a, b], | · |) ≍
(b − a)/δ.
As one example of the metric entropy, consider a set of functions F with reasonable covering
numbers (metric entropy) in ∥·∥∞ -norm.

Example 5.1.9 (The “standard” covering number guarantee): Let F consist of functions
f : X → [−b, b] and let the metric ρ be ∥f − g∥∞ = supx∈X |f (x) − g(x)|. Then
!
nt2

P sup |Pn f − P f | ≥ t ≤ exp − + log N (t/3, F, ∥·∥∞ ) . (5.1.2)
f ∈F 18b2

So as long as the covering numbers N (t, F, ∥·∥∞ ) grow sub-exponentially in t—so that log N (t) ≪
nt2 —we have the (essentially) sub-Gaussian tail bound (5.1.2). Example 5.2.11 gives one typ-
ical case. Indeed, fix a minimal t/3-cover of F in ∥·∥∞ of size N := N (t/3, F, ∥·∥∞ ), call-
ing the covering functions f1 , . . . , fN . Then for any f ∈ F and the function fi satisfying
∥f − fi ∥∞ ≤ t/2, we have
2t
|Pn f − P f | ≤ |Pn f − Pn fi | + |Pn fi − P fi | + |P fi − P f | ≤ |Pn fi − P fi | + .
3
The Azuma-Hoeffding inequality (Theorem 4.2.3) guarantees (by a union bound) that

nt2

P max |Pn fi − P fi | ≥ t ≤ exp − 2 + log N .
i≤N 2b

Combine this bound (replacing t with t/3) to obtain inequality (5.1.2). 3

Given the relationships between packing, covering, and size of sets Θ, we would expect there
to be relationships between volume, packing, and covering numbers. This is indeed the case, as we
now demonstrate for arbitrary norm balls in finite dimensions.

Lemma 5.1.10. Let B denote the unit ∥·∥-ball in Rd . Then

d
2 d

1
≤ N (δ, B, ∥·∥) ≤ 1 + .
δ δ

108
Lexture Notes on Statistics and Information Theory John Duchi

Proof We prove the lemma via a volumetric argument. For the lower bound, note that if the
points v1 , . . . , vN are a δ-cover of B, then
N
X
Vol(B) ≤ Vol(δB + vi ) = N Vol(δB) = N Vol(B)δ d .
i=1

In particular, N ≥ δ −d . For the upper bound on N (δ, B, ∥·∥), let V be a δ-packing of B with
maximal cardinality, so that |V| = M (δ, B, ∥·∥) ≥ N (δ, B, ∥·∥) (recall Lemma 5.1.8). Notably, the
collection of δ-balls {δB + vi }M
i=1 cover the ball B (as otherwise, we could put an additional element
in the packing V), and moreover, the balls { 2δ B + vi } are all disjoint by definition of a packing.
Consequently, we find that
d
δ d

δ δ δ
M Vol(B) = M Vol B ≤ Vol B + B = 1 + Vol(B).
2 2 2 2

Rewriting, we obtain
d
δ d Vol(B) 2 d

2
M (δ, B, ∥·∥) ≤ 1+ = 1+ ,
δ 2 Vol(B) δ

completing the proof.

5.1.3 Application: matrix concentration

Let us give one application of Lemma 5.1.10 to concentration of random matrices; we explore more
in the exercises as well. We can generalize the definition of sub-Gaussian random variables to
sub-Gaussian random vectors, where we say that X ∈ Rd is a σ 2 -sub-Gaussian vector if
2
σ 2
E[exp(⟨u, X − E[X]⟩)] ≤ exp ∥u∥2 (5.1.3)
2

for all u ∈ Rd . For example, X ∼ N(0, Id ) is immediately 1-sub-Gaussian, and X ∈ [−b, b]d with
independent entries is b2 -sub-Gaussian. Now, suppose that Xi are independent isotropic random
vectors, meaning that E[Xi ] = 0, E[Xi Xi⊤ ] = Id , and that they are also σ 2 -sub-Gaussian. Then by
an application of Lemma 5.1.10, we can give concentration guarantees for the sample covariance
Σn := n1 ni=1 Xi Xi⊤ for the operator norm ∥A∥op := sup{⟨u, Av⟩ | ∥u∥2 = ∥v∥2 = 1}.
P

Proposition 5.1.11. Let Xi be independent isotropic and σ 2 -sub-Gaussian vectors. Then there is
a numerical constant C such that the sample covariance Σn := n1 ni=1 Xi Xi⊤ satisfies
P

 s 
d + log 1δ d + log 1δ
∥Σn − Id ∥op ≤ Cσ 2  + 
n n

with probability at least 1 − δ.

Proof We begin with an intermediate lemma.

109
Lexture Notes on Statistics and Information Theory John Duchi

Lemma 5.1.12. Let A be symmetric and {ui }N d

i=1 be an ϵ-cover of the unit ℓ2 ball B2 . Then

(1 − 2ϵ) ∥A∥op ≤ max⟨ui , Aui ⟩ ≤ ∥A∥op .

i≤N

Proof The second inequality is trivial. Fix any u ∈ Bd2 . Then for the i such that ∥u − ui ∥2 ≤ ϵ,
we have

⟨u, Au⟩ = ⟨u − ui , Au⟩ + ⟨ui , Au⟩ = 2⟨u − ui , Au⟩ + ⟨ui , Aui ⟩ ≤ 2ϵ ∥A∥op + ⟨ui , Aui ⟩

by definition of the operator norm. Taking a supremum over u gives the final result.

Let the matrix Ei = Xi Xi⊤ − I, and define the average error E n = n1 Ei . Then with this lemma
in hand, we see that for any ϵ-cover N of the ℓ2 -ball Bd2 ,

(1 − 2ϵ) E n op
≤ max⟨u, E n u⟩.
u∈N

Now, note that ⟨u, Ei u⟩ = ⟨u, Xi ⟩2 −∥u∥22 is sub-exponential, as it is certainly mean 0 and, moreover,
is the square of a sub-Gaussian; in particular, Theorem 4.1.15 shows that there is a numerical
constant C < ∞ such that
1
E[exp(λ⟨u, Ei u⟩)] ≤ exp Cλ2 σ 4

for |λ| ≤ .
Cσ 2
1
Taking ϵ = 4 in our covering N , then,

P( E n op ≥ t) ≤ P max⟨u, E n u⟩ ≥ t/2 ≤ |N | · max P ⟨u, nE n u⟩ ≥ nt/2
u∈N u∈N

by a union bound. As sums of sub-exponential random variable remain sub-exponential, Corol-

lary 4.1.18 implies
2
nt nt
P E n op ≥ t ≤ |N | exp −c min , ,
σ4 σ2
where c > 0 is a numerical constant. Finally, we apply Lemma 5.1.10, q which guarantees that
d 2 d+log 1δ 2 d+log 1δ
|N | ≤ 9 , and then take t to scale as the maximum of σ n and σ n .

5.2 Generalization bounds

We now build off of our ideas on uniform laws of large numbers and Rademacher complexities to
demonstrate their applications in statistical machine learning problems, focusing on empirical risk
minimization procedures and related problems. We consider a setting as follows: we have a sample
Z1 , . . . , Zn ∈ Z drawn i.i.d. according to some (unknown) distribution P , and we have a collection
of functions F from which we wish to select an f that “fits” the data well, according to some loss
measure ℓ : F × Z → R. That is, we wish to find a function f ∈ F minimizing the risk

L(f ) := EP [ℓ(f, Z)]. (5.2.1)

110
Lexture Notes on Statistics and Information Theory John Duchi

In general, however, we only have access to the risk via the empirical distribution of the Zi , and
we often choose f by minimizing the empirical risk
n
b n (f ) := 1
X
L ℓ(f, Zi ). (5.2.2)
n
i=1

As written, this formulation is quite abstract, so we provide a few examples to make it somewhat
more concrete.
Example 5.2.1 (Binary classification problems): One standard problem—still abstract—
that motivates the formulation (5.2.1) is the binary classification problem. Here the data Zi
come in pairs (X, Y ), where X ∈ X is some set of covariates (independent variables) and
Y ∈ {−1, 1} is the label of example X. The function class F consists of functions f : X → R,
and the goal is to find a function f such that
P(sign(f (X)) ̸= Y )
is small, that is, minimizing the risk E[ℓ(f, Z)] where the loss is the 0-1 loss, ℓ(f, (x, y)) =
1 {f (x)y ≤ 0}. 3

Example 5.2.2 (Multiclass classification): The multiclass classifcation problem is identical

to the binary problem, but instead of Y ∈ {−1, 1} we assume that Y ∈ [k] = {1, . . . , k} for
some k ≥ 2, and the function class F consists of (a subset of) functions f : X → Rk . The
goal is to find a function f such that, if Y = y is the correct label for a datapoint x, then
fy (x) > fl (x) for all l ̸= y. That is, we wish to find f ∈ F minimizing
P (∃ l ̸= Y such that fl (X) ≥ fY (X)) .
In this case, the loss function is the zero-one loss ℓ(f, (x, y)) = 1 {maxl̸=y fl (x) ≥ fy (x)}. 3

Example 5.2.3 (Binary classification with linear functions): In the standard statistical
learning setting, the data x belong to Rd , and we assume that our function class F is indexed
by a set Θ ⊂ Rd , so that F = {fθ : fθ (x) = θ⊤ x, θ ∈ Θ}. In this case, we may use the zero-one
loss,
the convex hinge loss, or the (convex) logistic loss, which are variously ℓzo (fθ , (x, y)) :=
1 yθ⊤ x ≤ 0 , and the convex losses
h i
ℓhinge (fθ , (x, y)) = 1 − yx⊤ θ and ℓlogit (fθ , (x, y)) = log(1 + exp(−yx⊤ θ)).
+

The hinge and logistic losses, as they are convex, are substantially computationally easier to
work with, and they are common choices in applications. 3

The main motivating question that we ask is the following: given a sample Z1 , . . . , Zn , if we
choose some fbn ∈ F based on this sample, can we guarantee that it generalizes to unseen data? In
particular, can we guarantee that (with high probability) we have the empirical risk bound
n
b n (fbn ) = 1
X
L ℓ(fbn , Zi ) ≤ R(fbn ) + ϵ (5.2.3)
n
i=1

for some small ϵ? If we allow fbn to be arbitrary, then this becomes clearly impossible: consider the
classification example 5.2.1, and set fbn to be the “hash” function that sets fbn (x) = y if the pair
(x, y) was in the sample, and otherwise fbn (x) = −1. Then clearly L b n (fbn ) = 0, while there is no
useful bound on R(fn ).
b

111
Lexture Notes on Statistics and Information Theory John Duchi

5.2.1 Finite and countable classes of functions

In order to get bounds of the form (5.2.3), we require a few assumptions that are not too onerous.
First, throughout this section, we will assume that for any fixed function f , the loss ℓ(f, Z) is
σ 2 -sub-Gaussian, that is,
2 2
λ σ
EP [exp (λ(ℓ(f, Z) − L(f )))] ≤ exp (5.2.4)
2
for all f ∈ F. (Recall that the risk functional L(f ) = EP [ℓ(f, Z)].) For example, if the loss is the
zero-one loss from classification problems, inequality (5.2.4) is satisfied with σ 2 = 14 by Hoeffding’s
lemma. In order to guarantee a bound of the form (5.2.4) for a function fb chosen dependent on
the data, in this section we give uniform bounds, that is, we would like to bound
!

P there exists f ∈ F s.t. L(f ) > Lb n (f ) + t or P sup L b n (f ) − R(f ) > t .
f ∈F

Such uniform bounds are certainly sufficient to guarantee that the empirical risk is a good proxy
for the true risk L, even when fbn is chosen based on the data.
Now, recalling that our set of functions or predictors F is finite or countable, let us suppose
that for each f ∈ F, we have a complexity measure c(f )—a penalty—such that
X
e−c(f ) ≤ 1. (5.2.5)
f ∈F

This inequality should look familiar to the Kraft inequality—which we will see in the coming
chapters—from coding theory. As soon as we have such a penalty function, however, we have the
following result.
Theorem 5.2.4. Let the loss ℓ, distribution P on Z, and function class F be such that ℓ(f, Z) is
σ 2 -sub-Gaussian for each f ∈ F, and assume that the complexity inequality (5.2.5) holds. Then
with probability at least 1 − δ over the sample Z1:n ,
s
1
b n (f ) + 2σ 2 log δ + c(f ) for all f ∈ F.
L(f ) ≤ L
n
Proof First, we note that by the usual sub-Gaussian concentration inequality (Corollary 4.1.10)
we have for any t ≥ 0 and any f ∈ F that
nt2

P L(f ) ≥ Ln (f ) + t ≤ exp − 2 .
b
2σ
p
Now, if we replace t by t2 + 2σ 2 c(f )/n, we obtain
2

p
2 2
nt
b n (f ) + t + 2σ c(f )/n ≤ exp −
P L(f ) ≥ L − c(f ) .
2σ 2
Then using a union bound, we have
nt2
X
p
2 2
P ∃ f ∈ F s.t. L(f ) ≥ Ln (f ) + t + 2σ c(f )/n ≤
b exp − 2 − c(f )
2σ
f ∈F
nt2 X

= exp − 2 exp(−c(f )) .
2σ
f ∈F
| {z }
≤1

112
Lexture Notes on Statistics and Information Theory John Duchi

Setting t2 = 2σ 2 log 1δ /n gives the result.

As one classical example of this setting, suppose that we have a finite class of functions F. Then
we can set c(f ) = log |F|, in which case we clearly have the summation guarantee (5.2.5), and we
obtain s
1
L(f ) ≤ Lb n (f ) + 2σ 2 log δ + log |F| uniformly for f ∈ F
n
with probability at least 1 − δ. To make this even more concrete, consider the following example.

Example 5.2.5 (Floating point classifiers): We implement a linear binary classifier using
double-precision floating point values, that is, we have fθ (x) = θ⊤ x for all θ ∈ Rd that may
be represented using d double-precision floating point numbers. Then for each coordinate of
θ, there are at most 264 representable numbers;
⊤ in total, we must thus have |F| ≤ 264d . Thus,
for the zero-one loss ℓzo (fθ , (x, y)) = 1 θ xy ≤ 0 , we have
s
1
b n (fθ ) + log δ + 45d
L(fθ ) ≤ L
2n
for all representable classifiers simultaneously, with probability at least 1 − δ, as the zero-one
loss is 1/4-sub-Gaussian. (Here we have used that 64 log 2 < 45.) 3

We also note in passing that by replacing δ with δ/2 in the bounds of Theorem 5.2.4, a union
bound yields the following two-sided corollary.

Corollary 5.2.6. Under the conditions of Theorem 5.2.4, we have

s
2
Lb n (f ) − L(f ) ≤ 2σ 2 log δ + c(f ) for all f ∈ F
n
with probability at least 1 − δ.

5.2.2 Large classes

When the collection of functions is (uncountably) infinite, it can be more challenging to obtain
strong generalization bounds, though there still exist numerous tools for these ideas. The most
basic, of which we will give examples, leverage covering number bounds (essentially, as in Exam-
ple 5.1.9). We return in the next chapter to alternative approaches based on randomization and
divergence measures, which provide guarantees with somewhat similar structure to those we present
here.
Let us begin by considering a few examples, after which we provide examples showing how to
derive explicit bounds using Rademacher complexities.

Example 5.2.7 (Rademacher complexity of the ℓ2 -ball): Let Θ = {θ ∈ Rd | ∥θ∥2 ≤ r}, and
consider the class of linear functionals F := {fθ (x) = θT x, θ ∈ Θ}. Then
v
u n
r uX
n
Rn (F | x1 ) ≤ t ∥xi ∥22 ,
n
i=1

113
Lexture Notes on Statistics and Information Theory John Duchi

because we have
v " v
n n u n
" # #
u 2
r X ru X ru X
Rn (F | xn1 ) = E εi x i ≤ t E εi x i = t ∥xi ∥22 ,
n 2 n 2 n
i=1 i=1 i=1

as desired. 3

In high-dimensional situations, it is sometimes useful to consider more restrictive function

classes, for example, those indexed by vectors in an ℓ1 -ball.

Example 5.2.8 (Rademacher complexity of the ℓ1 -ball): In contrast to the previous example,
suppose that Θ = {θ ∈ Rd | ∥θ∥1 ≤ r}, and consider the linear class F := {fθ (x) = θT x, θ ∈ Θ}.
Then
" n #
r X
Rn (F | xn1 ) = E εi x i .
n ∞ i=1

Now, each coordinate j of ni=1 εi xi is ni=1 x2ij -sub-Gaussian, and thus using that E[maxj≤d Zj ] ≤
P P
p
2σ 2 log d for arbitrary σ 2 -sub-Gaussian Zj (see Exercise 4.9), we have
v
u n
n r u X
Rn (F | x1 ) ≤ t2 log(2d) max x2ij .
n j
i=1

To facilitate comparison with Example 5.2.8, suppose that the vectors

p xi all satisfy ∥xi ∥∞ ≤ b.
n √
√ Rn (F | x1 ) ≤ rb 2 log(2d)/ n. In contrast,
In this case, the preceding inequality implies that
the ℓ2 -norm √ of such xi may satisfy ∥xi ∥2 = b d, so that the bounds of Example 5.2.7 scale
√
instead as rb d/ n, which can be exponentially larger. 3

These examples are sufficient to derive a few sophisticated risk bounds. We focus on the case
where we have a loss function applied to some class with reasonable Rademacher complexity, in
which case it is possible to recenter the loss class and achieve reasonable complexity bounds. The
coming proposition does precisely this in the case of margin-based binary classification. Consider
points (x, y) ∈ X × {±1}, and let F be an arbitrary class of functions f : X → R and L =
{(x, y) 7→ ℓ(yf (x))}f ∈F be the induced collection of losses. As a typical example, we might have
ℓ(t) = [1 − t]+ , ℓ(t) = e−t , or ℓ(t) = log(1 + e−t ). We have the following proposition.

Proposition 5.2.9. Let F and X be such that supx∈X |f (x)| ≤ M for f ∈ F and assume that
ℓ is L-Lipschitz. Define the empirical and population risks L
b n (f ) := Pn ℓ(Y f (X)) and L(f ) :=
P ℓ(Y f (X)). Then
!
nt2

P sup |Ln (f ) − L(f )| ≥ 4LRn (F) + t ≤ 2 exp − 2 2
b for t ≥ 0.
f ∈F 2L M

Proof We may recenter the class L, that is, replace ℓ(·) with ℓ(·) − ℓ(0), without changing
b n (f ) − L(f ). Call this class L0 , so that ∥Pn − P ∥ = ∥Pn − P ∥ . This recentered class satisfies
L L L0
bounded differences with constant 2M L, as |ℓ(yf (x)) − ℓ(y ′ f (x′ ))| ≤ L|yf (x) − y ′ f (x′ )| ≤ 2LM ,
as in the proof of Proposition 5.1.1. Applying Proposition 5.1.1 and then Corollary 5.1.3 and gives

114
Lexture Notes on Statistics and Information Theory John Duchi

b n (f ) − L(f )| ≥ 2Rn (L0 ) + t) ≤ exp(− nt22 2 ) for t ≥ 0. Then applying the con-
that P(supf ∈F |L 2M L
traction inequality (Theorem 5.1.6) yields Rn (L0 ) ≤ 2LRn (F), giving the result.

Let us give a few example applications of these ideas.

Example 5.2.10 (Support vector machines and hinge losses): In the support vector machine
problem, we receive data (Xi , Yi ) ∈ Rd × {±1}, and we seek to minimize average of the losses
ℓ(θ; (x, y)) = 1 − yθT x + . We assume that the space X has ∥x∥2 ≤ b for x ∈ X and that

Θ = {θ ∈ Rd | ∥θ∥2 ≤ r}. Applying Proposition 5.2.9 gives

nt2

P sup |Pn ℓ(θ; (X, Y )) − P ℓ(θ; (X, Y ))| ≥ 4Rn (FΘ ) + t ≤ exp − 2 2 ,
θ∈Θ 2r b

where FΘ = {fθ (x) = θT x}θ∈Θ . Now, we apply Example 5.2.7, which implies that

2rb
Rn (ϕ ◦ FΘ ) ≤ 2Rn (Fθ ) ≤ √ .
n

That is, we have

nt2

4rb
P sup |Pn ℓ(θ; (X, Y )) − P ℓ(θ; (X, Y ))| ≥ √ + t ≤ exp − ,
θ∈Θ n 2(rb)2
√
so that Pn and P become close at rate roughly rb/ n in this case. 3

Example 5.2.10 is what is sometimes called a “dimension free” convergence result—there is no

esxplicit dependence on the dimension d of the problem, except as the radii r and b make explicit.
One consequence of this is that if x and θ instead belong to a Hilbert space (potentiall infinite
dimensional) with inner product ⟨·, ·⟩ and norm ∥x∥2 = ⟨x, x⟩, but for which we are guaranteed
that ∥θ∥ ≤ r and similarly ∥x∥ ≤ b, then the result still applies. Extending this to other function
classes is reasonably straightforward, and we present a few examples in the exercises.
When we do not have the simplifying structure of ℓ(yf (x)) identified in the preceding examples,
we can still provide guarantees of generalization using the covering number guarantees introduced
in Section 5.1.2. The most common and important case is when we have a Lipschitzian loss function
in an underlying parameter θ.

Example 5.2.11 (Lipschitz functions over a norm-bounded parameter space): Consider the
parametric loss minimization problem

minimize L(θ) := E[ℓ(θ; Z)]

θ∈Θ

for a loss function ℓ that is M -Lipschitz (with respect to the norm ∥·∥) in its argument, where
for normalization we assume inf θ∈Θ ℓ(θ, z) = 0 for each z. Then the metric entropy of Θ
bounds the metric entropy of the loss class F := {z 7→ ℓ(θ, z)}θ∈Θ for the supremum norm
∥·∥∞ . Indeed, for any pair θ, θ′ , we have

sup |ℓ(θ, z) − ℓ(θ′ , z)| ≤ M θ − θ′ ,

115
Lexture Notes on Statistics and Information Theory John Duchi

and so an ϵ-cover of Θ is an M ϵ-cover of F in supremum norm. In particular,

N (ϵ, F, ∥·∥∞ ) ≤ N (ϵ/M, Θ, ∥·∥).

Assume that Θ ⊂ {θ | ∥θ∥ ≤ b} for some finite b. Then Lemma 5.1.10 guarantees that
log N (ϵ, Θ, ∥·∥) ≤ d log(1 + 2/ϵ) ≲ d log 1ϵ , and so the classical covering number argument in
Example 5.1.9 gives
nt2

M
P sup |Pn ℓ(θ, Z) − P ℓ(θ, Z)| ≥ t ≤ exp −c 2 2 + Cd log ,
θ∈Θ b M t
2 2d
where c, C are numerical constants. In particular, taking t2 ≍ M nb log nδ gives that

M b d log nδ
p
|Pn ℓ(θ, Z) − P ℓ(θ, Z)| ≲ √
n
with probability at least 1 − δ. 3

5.2.3 Structural risk minimization and adaptivity

In general, for a given function class F, we can always decompose the excess risk into the approxi-
mation/estimation error decomposition. That is, let

L∗ = inf L(f ),
f

where the preceding infimum is taken across all (measurable) functions. Then we have

L(fbn ) − L∗ = L(fbn ) − inf L(f ) + inf L(f ) − L∗ . (5.2.6)

f ∈F f ∈F
| {z } | {z }
estimation approximation
There is often a tradeoff between these two, analogous to the bias/variance tradeoff in classical
statistics; if the approximation error is very small, then it is likely hard to guarantee that the esti-
mation error converges quickly to zero, while certainly a constant function will have low estimation
error, but may have substantial approximation error. With that in mind, we would like to develop
procedures that, rather than simply attaining good performance for the class F, are guaranteed
to trade-off in an appropriate way between the two types of error. This leads us to the idea of
structural risk minimization.
In this scenario, we assume we have a sequence of classes of functions, F1 , F2 , . . ., of increasing
complexity, meaning that F1 ⊂ F2 ⊂ . . .. For example, in a linear classification setting with
vectors x ∈ Rd , we might take a sequence of classes allowing increasing numbers of non-zeros in
the classification vector θ:
n o n o
F1 := fθ (x) = θ⊤ x such that ∥θ∥0 ≤ 1 , F2 := fθ (x) = θ⊤ x such that ∥θ∥0 ≤ 2 , . . . .

More broadly, let {Fk }k∈N be a (possibly infinite) increasing sequence of function classes. We
assume that for each Fk and each n ∈ N, there exists a constant Cn,k (δ) such that we have the
uniform generalization guarantee
!
P sup L b n (f ) − L(f ) ≥ Cn,k (δ) ≤ δ · 2−k .
f ∈Fk

116
Lexture Notes on Statistics and Information Theory John Duchi

For example, by Corollary 5.2.6, if F is finite we may take

s
log |Fk | + log 1δ + k log 2
Cn,k (δ) = 2σ 2 .
n

(We will see in subsequent sections of the course how to obtain other more general guarantees.)
We consider the following structural risk minimization procedure. First, given the empirical
risk L
b n , we find the model collection b
k minimizing the penalized risk

k := argmin inf L
b b n (f ) + Cn,k (δ) . (5.2.7a)
k∈N f ∈Fk

We then choose fb to minimize the risk over the estimated “best” class Fbk , that is, set

fb := argmin L
b n (f ). (5.2.7b)
f ∈Fkb

With this procedure, we have the following theorem.

Theorem 5.2.12. Let fb be chosen according to the procedure (5.2.7a)–(5.2.7b). Then with proba-
bility at least 1 − δ, we have

L(fb) ≤ inf inf {L(f ) + 2Cn,k (δ)} .

k∈N f ∈Fk

Proof First, we have by the assumed guarantee on Cn,k (δ) that

!
P ∃ k ∈ N and f ∈ Fk such that sup L
b n (f ) − L(f ) ≥ Cn,k (δ)
f ∈Fk
∞ ∞
!
X X
≤ P ∃ f ∈ Fk such that sup L
b n (f ) − L(f ) ≥ Cn,k (δ) ≤ δ · 2−k = δ.
k=1 f ∈Fk k=1

On the event that supf ∈Fk |L b n (f ) − L(f )| < Cn,k (δ) for all k, which occurs with probability at least
1 − δ, we have
n o
L(fb) ≤ L
b n (f ) + C b (δ) = inf inf L
n,k
b n (f ) + Cn,k (δ) ≤ inf inf {L(f ) + 2Cn,k (δ)}
k∈N f ∈Fk k∈N f ∈Fk

by our choice of fb. This is the desired result.

We conclude with a final example, using our earlier floating point bound from Example 5.2.5,
coupled with Corollary 5.2.6 and Theorem 5.2.12.

Example 5.2.13 (Structural risk minimization with floating point classifiers): Consider
again our floating point example, and let the function class Fk consist of functions defined by
at most k double-precision floating point values, so that log |Fk | ≤ 45d. Then by taking
s
log 1δ + 65k log 2
Cn,k (δ) =
2n

117
Lexture Notes on Statistics and Information Theory John Duchi

we have that |Lb n (f )−L(f )| ≤ Cn,k (δ) simultaneously for all f ∈ Fk and all Fk , with probability
at least 1 − δ. Then the empirical risk minimization procedure (5.2.7) guarantees that
 s 
1
 2 log δ + 91k 
L(fb) ≤ inf inf L(f ) + .
k∈N f ∈Fk n 

Roughly, we trade between small risk L(f )—as the qrisk inf f ∈Fk L(f ) must be decreasing in
k—and the estimation error penalty, which scales as (k + log 1δ )/n. 3

5.3 M-estimators and estimation

In many problems in statistics and machine learning, we seek not just to have small loss on some
future data but to actually recover parameters of interest. For example, in a regression problem
where we model
y = ⟨x, θ⟩ + ε,
we often care about the actual values of θ. Less prosaically, in the latter part of the book we will
develop a number of fundamental limits and lower bounds; it is good to have algorithms and upper
bounds to demonstrate their tightness!
To that end, we here develop representative finite-sample results on the convergence of different
estimators. We focus on M-estimators, meaning those that arise from minimization of a loss function
ℓ(θ, z) convex in θ, where z is problem data. Exponential families, from Chapter 3, provide a natural
examples here.
Example 5.3.1 (Exponential families and log loss): Let pθ (x) = exp(⟨θ, ϕ(x)⟩ − A(θ)) be
the density of an exponential family. Then ℓ(θ, x) := − log pθ (x) = −⟨θ, ϕ(x)⟩ + A(θ) is convex
and C ∞ over Θ := {θ | A(θ) < ∞}. 3

Example 5.3.2 (Logistic regression): In binary logistic regression (Example 3.4.2) with
labels y ∈ {±1}, we have data z = (x, y) ∈ Rd × {±1}, and the loss

ℓ(θ, z) = − log pθ (y | x) = log 1 + exp(−yx⊤ θ) ,

which is C ∞ , domain all of Rd , and Lipschitz-continuous derivatives of all orders. (Though the
particular Lipschitz constant depends on x). 3

Regardless, we will consider general rates of convergence and estimation error for estimators
minimizing the empirical loss
n
1X
Ln (θ) := Pn ℓ(θ, Z) = ℓ(θ, Zi ),
n
i=1

which approximates the population loss

L(θ) := EP [ℓ(θ, Z)],
iid
where Zi ∼ P and Pn = n1 ni=1 1Zi is the usual empirical distribution. Thus, for a closed convex
P
set Θ ⊂ Rd , we will study the M-estimator
θbn := argmin Ln (θ), (5.3.1)
θ∈Θ

118
Lexture Notes on Statistics and Information Theory John Duchi

providing prototypical arguments for its convergence. Often, we will take Θ = Rd , though this will
not be essential.
Based on the results in the preceding sections on uniform convergence, one natural idea is to
use uniform convergence: if we can argue that

sup |Ln (θ) − L(θ)| → 0,

θ∈Θ

then so long as the minimizer θ⋆ of L is unique and L(θ) − L(θ⋆ ) grows with the distance ∥θ − θ⋆ ∥,
we necessarily have θbn → θ⋆ . Unfortunately, this naive approach typically fails to achieve the
correct convergence rates, let alone the correct dependence on problem parameters. We therefore
take another approach.
JCD Comment: Will need some figures here / illustrations

Recall that a twice differentiable function L is convex if and only if ∇2 L(θ) ⪰ 0 for all θ ∈ dom L.
We thus expect that L should have some quadratic growth around its minimizer θ⋆ , meaning that
in a neighborhood of θ⋆ , we have
λ
L(θ) ≥ L(θ⋆ ) + ∥θ − θ⋆ ∥22
2
for θ near enough θ⋆ . In such a situation, because the sampled Ln is also convex and approximates
L, we then expect that for parameters θ far enough from θ⋆ that the growth of L(θ) above L(θ⋆ )
dominates the noise inherent in the sampling, we necessarily have Ln (θ) > Ln (θ⋆ ). Because the
empirical minimizer θbn necessarily satisfies Ln (θbn ) ≤ Ln (θ⋆ ), we would never choose such a distant
parameter, thus implying a convergence rate. To make this type of argument rigorous requires a
bit of convex analysis and sampling theory; luckily, we are by now well-equipped to address this.

5.3.1 Standard conditions and convex optimization

To provide relatively clean results, we will consider a collection of loss functions to simplify analysis.
As we will see, the assumed conditions are not too onerous, as many families of losses satisfy
them. We can relax them using some of the more sophisticated concentration inequalities we have
developed. We therefore make the following standing assumption.

Assumption A.5.1 (Standard conditions). For each z ∈ Z, the losses ℓ(θ, z) are convex in θ.
There are constants M0 , M1 , M2 < ∞ such that for each z ∈ Z,

(i) ∥∇ℓ(θ⋆ , z)∥2 ≤ M0

(ii) ∇2 ℓ(θ⋆ , z) op
≤ M1 , and

(iii) the Hessian ∇2 ℓ(θ, z) is M2 -Lipschitz continuous in a neighborhood of radius r > 0 around
θ⋆ , meaning ∇2 ℓ(θ0 , z) − ∇2 ℓ(θ1 , z) op ≤ M0 ∥θ0 − θ1 ∥2 whenever ∥θi − θ⋆ ∥2 ≤ r.

Additionally, the minimizer θ⋆ = argminθ L(θ) exists and for a λ > 0 satisfies

∇2 L(θ⋆ ) ⪰ λI.

119
Lexture Notes on Statistics and Information Theory John Duchi

The “standard conditions” in Assumption A.5.1 are not so onerous. As we see when we spe-
cialize our coming results to exponential family models in Section 5.3.4, Assumption A.5.1 holds
essentially as soon as the family is minimal and ϕ(x) is bounded. The existence of minimizers can
be somewhat more subtle to guarantee than the smoothness conditions (i)–(iii), though these are
typically straightforward. (For more on the existence of minimizers, see Exercise 5.10.)
To quickly highlight the conditions, we revisit binary logistic and robust regression.

Example (Example 5.3.2 continued): For logistic regression with labels y ∈ {±1}, we have
1
∇ℓ(θ, (x, y)) = − yx and ∇2 ℓ(θ, (x, y)) = pθ (y | x)(1 − pθ (y | x))xx⊤ .
1 + eyx⊤ θ
Then Assumptions (i)–(iii) hold so long as supx∈X ∥x∥2 < ∞, with M0 = supx∈X ∥x∥2 and
M1 = 14 M02 . We revisit the existence of minimizers in the sequel, noting that because 0 <
pθ (y | x)(1 − pθ (y | x)) ≤ 14 for any θ, x, y, if a minimizer exists then it is necessarily unique
as soon as E[XX ⊤ ] ≻ 0. 3

Example 5.3.3 (Robust regression): In robust regression, we wish to best approximate

responses y ∈ R via linear functions x⊤ θ, but because of outliers, do not use the squared loss.
Thus, for a smooth symmetric convex function h with bounded derivatives, we take

ℓ(θ, (x, y)) = h(⟨x, θ⟩ − y),

so that
∇ℓ(θ, (x, y)) = h′ (⟨x, θ⟩ − y)x and ∇2 ℓ(θ, (x, y)) = h′′ (⟨x, θ⟩ − y)xx⊤ .
et −1
A prototypical example is h(t) = log(1 + et ) + log(1 + e−t ), which satisfies h′ (t) = et +1 ∈
2et
[−1, 1] and h′′ (t)
= ∈ (0, 1/2]. So long as the covariates x ∈ X have finite radius
(et +1)2
rad2 (X ) := supx∈X ∥x∥2 , we obtain the Lipschitz constant bounds

M0 ≤ rad2 (X ), M1 ≲ rad22 (X ), and M2 ≲ rad32 (X )

for Assumption A.5.1, parts (i)–(iii). In general, if h is symmetric with h′′ (0) > 0, then
minimizers exist whenever Y is non-pathological and E[XX ⊤ ] ≻ 0. Exercise 5.11 asks you to
prove this last claim on existence of minimizers. 3

5.3.2 Some growth properties of convex functions

As we discuss above, we will roughly proceed in our analysis by showing that the growth of the
loss function dominates the noise inherent in sampling. To do so, we will rely on certain growth
properties of convex functions. We collect them here, as they provide the fundamental building
block for convergence analysis.
First, we show that for any convex function h, if there exists a “shell” S = {θ | ∥θ − θ0 ∥2 = r}
around some point θ0 for which h(θ) > h(θ0 ) for all θ ∈ S, then necessarily the minimizer θb =
argminθ h(θ) satisfies ∥θb − θ0 ∥2 < r.
JCD Comment: Figure(s)

Lemma 5.3.4. Let h be convex and θ0 ∈ dom h and v an arbitrary vector. Then for all t ≥ 1,
h(θ0 + tv) − h(θ0 ) ≥ t(h(θ1 ) − h(θ0 )).

120
Lexture Notes on Statistics and Information Theory John Duchi

Proof Let θt = θ0 + tv for t ≥ 1. Then for t ≥ 1, we can write θ1 = 1t θt + (1 − 1t )θ0 , and so

1 1 1 1
h(θ1 ) = h 1− θ0 + θt ≤ 1 − h(θ0 ) + h(θt ).
t t t t
Rearranging yields
1 1
1− (h(θ1 ) − h(θ0 )) ≤ (h(θt ) − h(θ1 )) ,
t t
and multiplying through by t and rearranging implies the desired result.

Extending this to shells, we have the following result.

Lemma 5.3.5. Let h be convex and θ0 ∈ dom h and assume that for some ϵ > 0 and δ > 0 that
h(θ0 + v) ≥ h(θ0 ) + δ for all ∥v∥2 = ϵ. Then for all θ for which ∥θ − θ0 ∥2 ≥ ϵ,
δ
h(θ) − h(θ0 ) ≥ ∥θ − θ0 ∥2 .
ϵ
θ−θ0 ∥θ−θ0 ∥2
Proof For any θ with ∥θ − θ0 ∥2 ≥ ϵ, we can write θ = θ0 + tv for v = ϵ ∥θ−θ 0 ∥2
and t = ϵ .
Apply Lemma 5.3.4 with the substitution δ ≥ h(θ1 ) − h(θ0 ).

Now we connect these results to the growth of suitably smooth convex functions. Here, we wish
to argue that the minimizer of a convex function h is not too far from a benchmark point θ0 at
which h has strong upward curvature and small gradient.
JCD Comment: Figure.

Lemma 5.3.6. Let h be convex and λ > 0, γ ≥ 0, and ϵ ≥ 2γ λ . Assume that for some θ0 , we
have ∥∇h(θ0 )∥2 ≤ γ and ∇2 h(θ) ⪰ λI for all θ satisfying ∥θ − θ0 ∥2 ≤ ϵ. Then the minimizer
θb = argminθ h(θ) exists and satisfies
2γ
∥θb − θ0 ∥2 ≤ .
λ
Proof By Taylor’s theorem, for any θ we have
1
h(θ) = h(θ0 ) + ⟨∇h(θ0 ), θ − θ0 ⟩ + (θ − θ0 )⊤ ∇2 h(θ)(θ − θ0 )
2
for a point θ on the line between θ and θ0 . Now, let us take θ such that ∥θ − θ0 ∥2 ≤ t. Then by
assumption ∇2 h(θ) ⪰ λI, and so we have
λ
h(θ) ≥ h(θ0 ) + ⟨∇h(θ0 ), θ − θ0 ⟩ + ∥θ − θ0 ∥22
2
λ
≥ h(θ0 ) − γ ∥θ − θ0 ∥2 + ∥θ − θ0 ∥22 .
2
by assumption that ∥∇h(θ0 )∥2 ≤ γ and the Cauchy-Schwarz inequality.
Fix t ≥ 0. If we can show that h(θ) > h(θ0 ) for all θ satisfying ∥θ − θ0 ∥2 = t, then Lemma 5.3.5
implies that h(θ) > h(θ0 ) whenever ∥θ − θ0 ∥2 ≥ t, so that necessarily ∥θb − θ0 ∥2 < t. Returning to
the previous display and letting t = ∥θ − θ0 ∥2 , note that
λ 2
h(θ) ≥ h(θ0 ) − γt + t ,
2

121
Lexture Notes on Statistics and Information Theory John Duchi

and as −γt + λ2 t2 = t( λ2 t − γ) > 0 whenever t > 2γ 2

λ . As by assumption ∇ h(θ) ⪰ λI whenever
2γ
∥θ − θ0 ∥2 ≤ ϵ for some ϵ ≥ λ , this implies the result.

Lemma 5.3.7. Let h be convex and assume that ∇2 h is M2 -Lipschitz (part (iii) of Assump-
λ2
tion A.5.1). Let λ > 0 be large enough and γ > 0 be small enough that γ < 8M 2
. Then if
2
both ∇ h(θ0 ) ⪰ λI and ∥∇h(θ0 )∥ ≤ γ, the minimizer θb = argmin h(θ) exists and satisfies
2 θ

4γ
∥θb − θ0 ∥2 ≤ .
λ

Proof By Lemma 5.3.6, it is enough to show that ∇2 h(θ) ⪰ λ2 for all θ with ∥θ − θ0 ∥2 ≤ 4γ λ . For
2
this, we use the M2 -Lipschitz continuity of ∇ h to obtain that for any θ with ∥θ − θ0 ∥2 = t,

∇2 h(θ) ⪰ ∇2 h(θ0 ) − M2 ∥θ − θ0 ∥2 I ⪰ (λ − M2 t)I.

λ 4γ λ λ
So if t ≤ 2M 2
we have ∇2 h(θ) ⪰ λI. Because λ ≤ 2M2 by assumption, we have ∇2 h(θ) ⪰ 2I
whenever ∥θ − θ0 ∥2 ≤ 4γ
λ , yielding the result.

5.3.3 Convergence analysis for convex M-estimators

By leveraging Lemma 5.3.7, to show a convergence rate guarantee for the empirical minimizer θbn ,
it is evidently sufficient to demonstrate two (related) conditions: that for some sequence γn → 0,
we have
∥∇Ln (θ⋆ )∥2 ≤ γn (5.3.2a)
with high probability, and that for some λ > 0, we have

∇2 L(θ⋆ ) ⪰ λI (5.3.2b)

with high probability. Happily, the convergence guarantees we develop in Chapter 4 provide pre-
cisely the tools to do this.

Theorem 5.3.8. Let Assumption A.5.1 hold for the M-estimation problem (5.3.1). Let δ ∈ (0, 12 ),
and define

2 ∇2 L(θ⋆ ) op
r ! ( r )
M0 1 2d 4M1 2d
γn (δ) := √ 1 + log and ϵn (δ) := max √ log , log .
n δ n δ 3n δ

Then we have both ∥∇Ln (θ⋆ )∥2 ≤ γn (δ) and ∇2 Ln (θ⋆ ) − ∇2 L(θ⋆ ) op
≤ ϵn (δ) with probability at
λ 9λ2
least 1 − 2δ. So long as ϵn (δ) ≤ 4 and γn (δ) ≤ 128M2 , then with the same probability,

16 γn (δ)
∥θbn − θ⋆ ∥2 ≤ .
3 λ

122
Lexture Notes on Statistics and Information Theory John Duchi

We defer the proof of the theorem to Section 5.3.5, instead providing commentary and a few
examples of its application. Ignoring the numerical constants, the theorem roughly states the
following: once n is large enough that
2
M22 M02 1 ∇2 L(θ⋆ ) op d
n≳ log and n ≳ log , (5.3.3)
λ4 δ λ2 δ
with probability at least 1 − δ we have
r
⋆ M 0 1
∥θbn − θ ∥2 ≲ √ log . (5.3.4)
λ n δ
These finite sample results are, at least for large n, order optimal, as we will develop in the coming
sections on fundamental limits. Nonetheless, the conditions (5.3.3) are stronger than necessary,
typically requiring that n be quite large. In the exercises, we explore a class of quasi-self-concordant
losses, where the second derivative controls the third derivative, allowing more direct application
of Lemma 5.3.6, which allows reducing this sample size requirement. (See Exercises 5.5 and 5.6).
Example 5.3.9 (Logistic regression, Example 5.3.2 continued): Recalling the logistic loss
ℓ(θ, (x, y)) = log(1 + e−y⟨x,θ⟩
√ ) for y ∈ {±1} and x ∈ Rd , assume the domain X consists√of
vectors x with ∥x∥2 ≤ d. For example, if X ⊂ [−1, 1]d , this holds. In this case, M0 = d,
while M1 ≤ 14 d and M2 ≲ d3/2 . Assuming that the population Hessian ∇2 L(θ⋆ ) = E[pθ⋆ (Y |
X)(1 − pθ⋆ (Y | X))XX ⊤ ] has minimal eigenvalue λmin (∇2 L(θ⋆ )) ≳ 1, then the conclusions of
Theorem 5.3.8 apply as soon as n ≳ d4 log 1δ . 3
When n is large enough, the guarantee (5.3.4) allows us to also make the heuristic asymptotic
expansions for the exponential family models in Section 3.2.1 hold in finite samples. Let the
conclusions of Theorem 5.3.8 hold, so that ∇2 Ln (θ⋆ ) − ∇2 L(θ⋆ ) op ≤ ϵn (δ) and so on. Then once
we know that θbn exists, by a Taylor expansion we can write
0 = ∇Ln (θbn ) = ∇Ln (θ⋆ ) + (∇2 Ln (θ⋆ ) + En )(θbn − θ⋆ )
= ∇Ln (θ⋆ ) + (∇2 L(θ⋆ ) + E ′ )(θbn − θ⋆ ),
n

where En is an error matrix satisfying ∥En ∥op ≤ M2 ∥θbn − θ⋆ ∥2 and En′ = En + ∇2 Ln (θ⋆ ) − ∇2 L(θ⋆ )
satisfies ∥En′ ∥op ≤ ∥En ∥op + ϵn (δ) by the triangle inequality and Theorem 5.3.8. Using the infinite
series expansion of the inverse
∞
X
−1 −1
(A + E) =A + (−1)i (A−1 E)i A−1 ,
i=1

valid for A ≻ 0 whenever ∥E∥op < λmin (A) (see Exercise 5.4), we therefore have

θbn − θ⋆ = −(∇2 L(θ⋆ ) + En′ )−1 ∇Ln (θ⋆ ) = −∇2 L(θ⋆ )−1 ∇Ln (θ⋆ ) + Rn ,
where the remainder vector Rn satisfies
∥Rn ∥2 ≲ ∇2 L(θ⋆ )−1 En′ ∇2 L(θ⋆ )−1 op ∥∇Ln (θ⋆ )∥2
 q 
1
M
1  2 0M log δ log dδ

M2 M02

⋆
≲ 2 √ + ϵn (δ) ∥Ln (θ )∥2 ≲ 2 ·
 + ∇2 L(θ⋆ ) op
λ λ n λ n λ

with probability at least 1 − δ. We summarize this in the following corollary.

123
Lexture Notes on Statistics and Information Theory John Duchi

Corollary 5.3.10. Let the conditions of Theorem 5.3.8 hold. Then there exists a problem-dependent
constant C such that the following holds: for any δ > 0, for any n ≥ C log 1δ , with probability at
least 1 − δ
θbn − θ⋆ = −∇2 L(θ⋆ )−1 ∇Ln (θ⋆ ) + Rn ,
where the remainder Rn satisfies ∥Rn ∥2 ≤ C · n1 log dδ . The constant C may be taken to be continuous
in all the problem parameters of Assumption A.5.1.

Corollary 5.3.10 highlights the two salient terms governing error in estimation problems: the
curvature of the loss near the optimum, as ∇2 L(θ⋆ ) contributes, and the variance in the gradients
∇Ln (θ⋆ ) = n1 ni=1 ∇ℓ(θ⋆ , Zi ). When the Hessian term ∇2 L(θ⋆ ) is “large,” meaning that ∇2 L(θ⋆ ) ⪰
P
λI for some large value λ > 0, then estimation is easier: the curvature of the loss helps to identify
θ⋆ . Conversely, when the variance Var(∇ℓ(θ⋆ , Z)) = E[∥∇ℓ(θ⋆ , Z)∥22 ] is large, then estimation is
more challenging. As a final remark, let us imagine that the remainder term Rn in the corollary,
in addition to being small with high probability, satisfies E[∥Rn ∥22 ] ≤ nC2 , where C is a problem-
dependent constant. Let Gn = −∇2 L(θ⋆ )−1 ∇Ln (θ⋆ ) be the leading term in the expansion, which
satisfies
1 tr(∇2 L(θ⋆ )−1 Cov(∇ℓ(θ⋆ , Z))∇2 L(θ⋆ )−1 )
E[∥Gn ∥22 ] = Var(L(θ⋆ )−1 ∇ℓ(θ⋆ , Z)) = .
n n

Then because ∥Gn + Rn ∥22 ≤ ∥Gn ∥22 + ∥Rn ∥22 + 2 ∥Gn ∥2 ∥Rn ∥2 , we have the heuristic
h i h i
2
E θbn − θ⋆ 2 = E ∥Gn + Rn ∥22
= E[∥Gn ∥22 ] + E[∥Rn ∥22 ] ± 2E[∥Gn ∥2 ∥Rn ∥2 ]
(⋆) tr(∇2 L(θ⋆ )−1 Cov(∇ℓ(θ⋆ , Z))∇2 L(θ⋆ )−1 ) C
= ± 3/2 , (5.3.5)
n n
where C is a problem-dependent constant and the step (⋆) is heuristic. See Exercise 5.7 for one
approach to make this step rigorous.

5.3.4 Consequences for exponential families and generalized linear models

Working through a few example applications of Corollary 5.3.10 with the exponential family and
generalized linear models of Chapter 3 can help to make the results and connections clearer. Recall
that for an exponential family model with loss ℓ(θ; x) = − log pθ (x) = −⟨θ, ϕ(x)⟩ + A(θ), we
heuristically derived in expression (3.2.4) that if the data were i.i.d. from the exponential family
model Pθ⋆ , then
·
θbn − θ⋆ ∼ N 0, n−1 · ∇2 A(θ⋆ )−1 .

Corollary 5.3.10 presents one approach to make this rigorous. Assuming the sufficient statistics ϕ
are bounded, we have ∇2 L(θ⋆ ) = ∇2 A(θ⋆ ), and Covθ⋆ (∇Ln (θ⋆ )) = n1 Covθ⋆ (ϕ(X)) = n1 ∇2 A(θ⋆ ).
So
n
1X
θbn − θ⋆ = −∇2 A(θ⋆ )−1 (ϕ(Xi ) − ∇A(θ⋆ )) + Rn ,
n
i=1

where the remainder satisfies ∥Rn ∥2 ≤ logC n1 d

δ.
To obtain finite sample expected bounds requires a bit of tedium because of small probability
events (e.g., that the sampled Hessian matrix ∇2 Ln (θ⋆ ) fails to be invertible). One simple device

124
Lexture Notes on Statistics and Information Theory John Duchi

is to consider the estimator θbn only on some “good” event En that occurs with high probability, for
example, that the remainder Rn is small. The next corollary provides a prototypical result under the
iid
assumption that Xi ∼ Pθ⋆ for an exponential family model with bounded data supx∈X ∥ϕ(x)∥2 < ∞
and positive definite Hessian ∇2 A(θ⋆ ) ≻ 0.

Corollary 5.3.11. Under the preceding conditions on the exponential family model Pθ⋆ , there exists
a problem dependent constant C < ∞ such that the following holds: for any k ≥ 1, there are events
En with probability P(En ) ≥ 1 − n1k and
h i 1 Ck log n
2
Eθ⋆ θbn − θ⋆ 2
· 1 {En } ≤ tr ∇2 A(θ⋆ )−1 + .
n n3/2
The constant C may be taken continuous in θ⋆ .

Recalling the equality (3.3.2), we see that the Fisher information ∇2 A(θ) appears in a fundamental
way for the exponential families. Proposition 3.5.1 shows that this quantity is fundamental, at
least for testing; here it provides an upper bound on the convergence of the maximum likelihood
estimator. Exercise 5.8 extends Corollary 5.3.11 to an equality to within lower order terms.
Proof Let δ = δn > 0 to be chosen and define the event En to be that ∥Rn ∥2 ≤ C n1 log dδ , which
occurs with probability at least 1 − δ. Let

Gn = −∇2 L(θ⋆ )−1 ∇Ln (θ⋆ ) = −∇2 A(θ⋆ )−1 Pn (ϕ(X) − ∇A(θ⋆ ))

be the mean-zero gradient term. Then θbn = Gn + Rn , and ∥Gn + Rn ∥22 ≤ ∥Gn ∥22 + ∥Rn ∥22 +
2 ∥Gn ∥2 ∥Rn ∥2 . Then on the event En we have ∥Rn ∥22 ≤ Cn log dδ , and so
h
2
i C2 d
E θbn − θ⋆ 2
1 {En } ≤ E[∥Gn ∥22 ] + E[∥Rn ∥22 1 {En }]E[∥Gn ∥2 ] + 2
log2 .
n δ
Now note that E[∥Gn ∥22 ] = 1
n tr(∇2 A(θ⋆ )−1 ), and set δ = 1
nk
.

These ideas also extend to generalized linear models, such as linear, logistic, or Poisson regression
(recall Chapter 3.4). For the abstract generalized linear model of predicting a target y ∈ Y from
covariates x ∈ X , we have

ℓ(θ, (x, y)) = − log pθ (y | x) = −ϕ(x, y)⊤ θ + A(θ | x).

Because the log partition is C ∞ , the smoothness conditions in Assumption A.5.1 then reduce to the
boundedness
rad2 ({ϕ(x, y) | x ∈ X , y ∈ Y}) := sup ∥ϕ(x, y)∥2 < ∞.
x∈X ,y∈Y

Assuming that a minimizer θ⋆ = argminθ L(θ) exists, the (local) strong convexity condition that
∇2 L(θ⋆ ) ≻ 0 then becomes that E[∇2 A(θ | X)] = E[Covθ (ϕ(X, Y ) | X)] ≻ 0. Exercise 5.10, part (c)
gives general sufficient conditions for the existence of minimizers in GLMs.
For logistic regression (Example 3.4.2), these conditions correspond to a bound on the covariate
data x, that E[XX ⊤ ] ≻ 0, and that for each X, the label Y is non-deterministic. For Poisson
⊤
regression (Example 3.4.4), we have ℓ(θ, (x, y)) = −yx⊤ θ + eθ x . When the count data Y ∈ N can
be unbounded, Assumption A.5.1.(i) may fail, because yx may be unbounded. If we model a data-
generating process for which X and Y are both bounded using Poisson regression, however, then

125
Lexture Notes on Statistics and Information Theory John Duchi

the smoothness conditions in Assumption A.5.1 hold. (Again, see Exercise 5.10 for the existence
of solutions.)
Regardless, by an argument completely parallel to that for Corollary 5.3.10, we can provide
convergence rates for generalized linear model estimators. Here, we avoid the assumption of model
fidelity, instead assuming that θ⋆ = argminθ L(θ) exists and ∇2 L(θ⋆ ) = E[∇2 A(θ⋆ | X)] ≻ 0, so
that θ⋆ is unique.

Corollary 5.3.12. Let the preceding conditions hold and pθ be a generalized linear model. Then
there exists a problem constant C < ∞ such the following holds: for any δ ∈ (0, 1) and for all
n ≥ C log 1δ , with probability at least 1 − δ

θbn − θ⋆ = −E[∇2 A(θ⋆ | X)]−1 Pn (ϕ(X, Y ) − ∇A(θ⋆ | X)) + Rn ,

C
where the remainder ∥Rn ∥2 ≤ n log dδ .

When the generalized linear model Pθ is correct, so that X ∼ P marginally and Y | X ∼ Pθ (· | X),
then Cov(ϕ(X, Y ) | X) = ∇2 A(θ⋆ | X), and so in this case (again, for a sequence of events En with
probability at least 1 − 1/nk ), we have
h
2
i tr(E[∇2 A(θ⋆ | X)]−1 ) Ck log n
Eθ⋆ θbn − θ⋆ 2
1 {En } ≤ + .
n n3/2
Note that this quantity is the trace of the inverse Fisher information (3.3.2) in the generalized
linear model: the “larger” the information, the better estimation accuracy we can guarantee.

5.3.5 Proof of Theorem 5.3.8

The two key steps in the proof of the theorem are lemmas providing the guarantees (5.3.2).

Lemma 5.3.13. Let Assumption A.5.1 hold. Then for any δ ∈ (0, 1), with probability at least 1 − δ
r !
M 0 1
∥∇Ln (θ⋆ )∥2 ≤ √ 1 + 2 log .
n δ

Proof The function z1n 7→ ∥Ln (θ)∥2 satisfies bounded differences: for any two empirical samples
Pn , Pn′ differing in only observation i,

∥Pn ∇ℓ(θ, Z)∥2 − Pn′ ∇ℓ(θ, Z) ≤ Pn ∇ℓ(θ, Z) − Pn′ ∇ℓ(θ, Z) 2

2
1 2M0
≤ ∇ℓ(θ, Zi ) − ∇ℓ(θ, Zi′ ) 2 ≤
n n
√
q
by Assumption A.5.1.(i). Because E[∥∇Ln (θ⋆ )∥2 ] ≤ E[∥∇Ln (θ⋆ )∥22 ] ≤ M0 / n, Proposition 4.2.5
gives that
√ nt2

⋆

P ∥∇Ln (θ )∥2 ≥ M0 / n + t ≤ exp −
2M02
nt 2
for all t ≥ 0. Solving for t in exp(− 2M 2 ) = δ yields the lemma.
0

126
Lexture Notes on Statistics and Information Theory John Duchi

Lemma 5.3.14. Let Assumption A.5.1 hold. Then for any δ ∈ (0, 1), with probability at least 1 − δ
2 ∇2 L(θ⋆ ) op
( r )
2d 4M 1 2d
∇2 Ln (θ⋆ ) − ∇2 L(θ⋆ ) op ≤ max √ log , log .
n δ 3n δ
1 Pn
Proof Because ∇2 Ln (θ) = n i=1 ∇
2 ℓ(θ, Z
i) and ∇2 ℓ(θ, z) op
≤ M1 by Assumption A.5.1.ii,
Theorem 4.3.6 implies that
( )!
nt2 3t
P ∇2 Ln (θ⋆ ) − ∇2 L(θ⋆ ) op
≥ t ≤ 2d exp − min 2 , 4M .
2 ⋆
4 ∥∇ L(θ )∥op 1

2∥∇2 L(θ⋆ )∥op

q
4M1
Setting t = max{ √
n
log 2d 2d
δ , 3n log δ } gives the lemma.

Let ϵn (δ) be the bound on the right side of Lemma 5.3.14. Then with probability at least 1 − δ,
3λ
∇2 Ln (θ⋆ ) ⪰ L(θ⋆ ) − ϵn (δ)Id ≥ Id
4
by the assumption that n is large enough and ∇2 L(θ⋆ ) ⪰ λI; we therefore have the first condition
of Lemma 5.3.7 where 3λ/4 replaces λ. Lemma 5.3.13 gives that with probability at least 1 − δ,
9λ2
∥∇Ln (θ⋆ )∥2 ≤ γn (δ). Now use the assumption that γn (δ) ≤ 128M 2
, so that Lemma 5.3.7 implies

16 γn (δ)
∥θbn − θ⋆ ∥2 ≤ ,
3 λ
which proves the theorem.

5.4 Exercises
JCD Comment: Exercise ideas around this: We could try to do things with moment
bounds. Like we’d use something like the Marcinkiewicz bounds, and then some moment
bounds on matrices from different exercises, and we could say something.
Probably also write some moment bound guarantees for matrices with operator norms
would be really neat.
Also, exercises that handle dimension scaling could be fun, along with (associated) con-
vergence rates.
Exercise 5.1: In this question, we show how to use Bernstein-type (sub-exponential) inequal-
ities to give sharp convergence guarantees. Recall (Example 4.1.14, Corollary 4.1.18, and inequal-
ity (4.1.6)) that if Xi are independent bounded random variables with |Xi − E[X]| ≤ b for all i and
Var(Xi ) ≤ σ 2 , then
n n
( ! !)
5 nt2 nt

1X 1X 1
max P Xi ≥ E[X] + t , P Xi ≤ E[X] − t ≤ exp − min , .
n n 2 6 σ 2 2b
i=1 i=1

We consider minimization of loss functions ℓ over finite function classes F with ℓ ∈ [0, 1], so that if
L(f ) = E[ℓ(f, Z)] then |ℓ(f, Z) − L(f )| ≤ 1. Throughout this question, we let

L⋆ = min L(f ) and f ⋆ ∈ argmin L(f ).

f ∈F f ∈F

127
Lexture Notes on Statistics and Information Theory John Duchi

We will show that, roughly, a procedure based on picking an empirical risk minimizer is unlikely to
choose a function f ∈ F with bad performance, so that we obtain faster concentration guarantees.

(a) Argue that for any f ∈ F

nt2

1 5 nt
P L(f ) ≥ L(f ) + t ∨ P L(f ) ≤ L(f ) − t ≤ exp − min
b b , .
2 6 L(f )(1 − L(f )) 2

(b) Define the set of “bad” prediction functions Fϵ bad := {f ∈ F : L(f ) ≥ L⋆ + ϵ}. Show that for
any fixed ϵ ≥ 0 and any f ∈ F2ϵ bad , we have

nϵ2

b ) ≤ L⋆ + ϵ ≤ exp − 1 min 5 nϵ

P L(f , .
2 6 L⋆ (1 − L⋆ ) + ϵ(1 − ϵ) 2

(c) Let fbn ∈ argminf ∈F L(f

b ) denote the empirical minimizer over the class F. Argue that it is
likely to have good performance, that is, for all ϵ ≥ 0 we have

nϵ2

⋆
1 5 nϵ
P L(fn ) ≥ L(f ) + 2ϵ ≤ card(F) · exp − min
b , .
2 6 L⋆ (1 − L⋆ ) + ϵ(1 − ϵ) 2

(d) Using the result of part (c), argue that with probability at least 1 − δ,
q
|F | L⋆ (1 − L⋆ ) · log |Fδ |
r
4 log 12
L(fbn ) ≤ L(f ⋆ ) + δ
+ · √ .
n 5 n

Why is this better than an inequality based purely on the boundedness of the loss ℓ, such as
Theorem 5.2.4 or Corollary 5.2.6? What happens when there is a perfect risk minimizer f ⋆ ?

Exercise 5.2: Prove Lemma 5.1.8.

Exercise 5.3: Consider a binary classification problem with logistic loss ℓ(θ; (x, y)) = log(1 +
exp(−yθT x)), where θ ∈ Θ := {θ ∈ Rd | ∥θ∥1 ≤ r} and y ∈ {±1}. Assume additionally that the
space X ⊂ {x ∈ Rd | ∥x∥∞ ≤ b}. Define the empirical and population risks L b n (θ) := Pn ℓ(θ; (X, Y ))
and L(θ) := P ℓ(θ; (X, Y )), and let θbn = argminθ∈Θ L(θ).
b Show that with probability at least 1 − δ
iid
over (Xi , Yi ) ∼ P , q
rb log dδ
L(θbn ) ≤ inf L(θ) + C √
θ∈Θ n
where C < ∞ is a numerical constant (you need not specify this).
Exercise 5.4: Let A ≻ 0 be a positive
Pk definite matrix, and let E be Hermitian and satisfy
∥E∥op < λmin (A). Define Sk := A + i=1 (−1) (A E)A−1 .
−1 i −1

(a) Show that for any k ∈ N,

k+1
(A + E)Sk = I + (−1)k EA−1 .
P∞
(b) Argue that S∞ := limk Sk exists and (A + E)−1 = S∞ = A−1 + i −1 i −1
i=1 (−1) (A E) A .

128
Lexture Notes on Statistics and Information Theory John Duchi

(c) Let γ = ∥E∥op /λmin (A) < 1. Show that

1
(A + E)−1 − Sk op
≤ γ k+1 .
λmin (A)(1 − γ)

Exercise 5.5: Let f : R → R be a three-times differentiable convex function. We say f is

C-quasi-self-concordant (q.s.c.) if |f ′′′ (t)| ≤ C|f ′′ (t)| for all t ∈ R.

(a) Define g(t) = log f ′′ (t). Show that if f is C-q.s.c., then |g ′ (t)| ≤ C and so for any s ∈ R,

e−C|s| f ′′ (t) ≤ f ′′ (t + s) ≤ eC|s| f ′′ (t).

(b) Show that the function f (t) = log(1 + et ) + log(1 + e−t ) is q.s.c., and give its self-concordance
parameter.

(d) Let f be C-q.s.c. and for a fixed x ∈ Rd , define h(θ) := f (⟨θ, x⟩). Show that for any θ, θ0 with
∆ = θ − θ0 , h satisfies

e−C|⟨∆,x⟩| ∇2 h(θ) ⪯ ∇2 h(θ0 ) ⪯ eC|⟨∆,x⟩| ∇2 h(θ).

Exercise 5.6 (Quasi self-concordant M-estimators [152]): Consider a prediction problem of

predicting targets y from vectors x ∈ Rd . A loss ℓ is a C-quasi self-concordant loss if we can write

ℓ(θ, (x, y)) = h(⟨θ, x⟩, y),

where for each y, h(·, y) is a C-q.s.c. function (recall Exercise 5.5).

(a) Show that logistic regression with loss ℓ(θ, (x, y)) = log(1+e−y⟨x,θ⟩ ) and robust linear regression
with loss ℓ(θ, (x, y)) = log(1 + ey−⟨x,θ⟩ ) + log(1 + e⟨x,θ⟩−y ) are both 1-q.s.c. losses.
√
For the remainder of the problem, assume that the data X ⊂ Rd satisfy ∥x∥2 ≤ d for all x ∈ X .
Let L(θ) = E[ℓ(θ, (X, Y ))] and θ⋆ = argminθ L(θ). Assume that ∇2 L(θ⋆ ) ⪰ λI, where λ > 0 is
fixed, and let Ln (θ) = Pn ℓ(θ, (X, Y )) as usual.
√
(b) Show that if ∥θ − θ⋆ ∥2 ≤ 1/ d, then ∇2 Ln (θ) ⪰ e−C ∇2 Ln (θ⋆ ).

(c) Argue that if t = ∥θ − θ⋆ ∥2 ≤ √1 and ∥∇Ln (θ⋆ )∥2 ≤ γ, then

e−C λmin (∇2 Ln (θ⋆ )) 2

Ln (θ) ≥ Ln (θ⋆ ) − γt + t .
2

(d) Let ℓ be a 1-q.s.c. loss and assume that |h′ (t, y)| ≤ 1 and |h′′ (t, y)| ≤ 1 for all t ∈ R. Give a
result similar to that of Theorem 5.3.8, but show that your conclusions hold with probability
at least 1 − δ as soon as
d2 d
n ≳ 2 log .
λ δ

129
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 5.7 (Truncation to obtain a moment bound): Let B < ∞. Show that under the
conditions of Corollary 5.3.10,
h i tr(∇2 L(θ⋆ )−1 Cov(∇ℓ(θ⋆ , Z))∇2 L(θ⋆ )−1 ) C log n
2
E θbn − θ⋆ 2
∧B ≤ + ,
n n3/2
where C is a problem-dependent constant.
Exercise 5.8: Let θbn = argminθ Ln (θ) for an M-estimation problem satisfying the conditions of
Corollary 5.3.10. Show that for any k ≥ 1, there are events En with P(En ) ≥ 1 − n−k and for which
h
2
i 1 Ck log n
θbn − θ⋆ tr ∇2 L(θ⋆ )−1 Cov(∇ℓ(θ⋆ , Z))∇2 L(θ⋆ )−1 ≤

E 2
1 {En } − ,
n n3/2

where C is a problem-dependent constant.

Exercise 5.9: In this problem, you provide sufficient conditions for exponential family models to
have minimizers.

(a) Let Pθ be a minimal exponential family (Definition 3.2) with density pθ (x) = exp(θ⊤ x − A(θ))
with respect to a base measure µ. Show that for any θ⋆ ∈ dom A, the well-specified population
loss L(θ) = −Eθ⋆ [log pθ (X)] has unique minimizer θ⋆ .

For the remainder of the problem, we no longer assume the model is well-specified. For a measure
µ on a set X , recall that the essential supremum of a function f : X → R is

ess sup(f ) := inf {t ∈ R | µ ({f (x) ≥ t}) = 0} .

We say that a measure µ on Rd essentially covers a vector v ∈ Rd if

v ⊤ u < ess sup{x⊤ u} for all u ̸= 0. (5.4.1)

(b) Let {Pθ } be the exponential family with density pθ (x) = exp(θ⊤ x − A(θ)) with respect to the
measure µ. Let X be a random variable for which µ essentially covers (5.4.1) the mean E[X].
Show that L(θ) = −E[log pθ (X)] has a minimizer.

Hint. A continuous convex function h has a minimizer if it is coercive, meaning that h(θ) → ∞
whenever ∥θ∥2 → ∞. Corollary C.3.7, part (i) may be useful.
Exercise 5.10: In this problem, you provide sufficient conditions for generalized linear models to
have minimizers. Let L(θ) = E[ℓ(θ, (X, Y ))] be the population loss (which may be misspecified).
⊤x
(a) Consider Poisson regression with loss ℓ(θ, (x, y)) = −yx⊤ θ + eθ . Show that L has a unique
minimizer if E[X] = 0 and Cov(X) = E[XX ⊤ ] ≻ 0.
⊤
(b) Consider logistic regression with loss ℓ(θ, (x, y)) = −yx⊤ θ + log(1 + eθ x ) for y ∈ {0, 1}. Show
that L has a unique minimizer if E[XX ⊤ ] ≻ 0 and 0 < P(Y = 1 | X) < 1 with probability 1
over X.

130
Lexture Notes on Statistics and Information Theory John Duchi

(c) Consider a generalized linear model with densities pθ (y | x) = exp(ϕ(x, y)⊤ θ − A(θ | x)) w.r.t.
a base measure µ(· | x) on y ∈ Y, and assume for simplicity that µ(Y | x) = 1 for all x. Assume
that for each vector v ∈ Sd−1 and x ∈ X ,
ess sup{v ⊤ ϕ(x, y)} ≥ EP [v ⊤ ϕ(x, Y ) | X = x],
µ(·|x)

and the set of x for which a strict inequality holds has positive P -probability. (This is equivalent
to the set of x for which µ(· | x) essentially covers (5.4.1) the conditional mean EP [ϕ(x, Y ) |
X = x] having positive probability.) Show that a minimizer of L exists. You may assume
E[|A(θ | X)|] < ∞ for ∥θ∥2 ≤ 1 if it is convenient.
Hint. The techniques to solve Exercise 5.9 may be useful. In addition, see Exercise 4.2.
Exercise 5.11: Consider the robust regression setting of Example 5.3.3, and let h ≥ 0 be a
symmetric convex function, twice continuously differentiable in a neighborhood of 0. Assume that
for any (measurable) subset X0 ⊂ X , E[Y | X ∈ X0 ] exists and is finite, and assume E[XX ⊤ ] ≻
0. Show that a minimizer of L(θ) := E[h(⟨θ, X⟩ − Y )] exists. Hint. Show that L is coercive.
Corollary C.3.7, part (i) may be useful.
Exercise 5.12 (The delta method for approximate sums): Let T : Rd → Rp be a differentiable
function with derivative matrix Ṫ (θ) ∈ Rp×d , so that T (θ + ∆) = T (θ) + Ṫ (θ)∆ + o(∥∆∥) as ∆ → 0.
Let θbn ∈ Rd be a sequence of random vectors with
θbn − θ = Pn Z + Rn ,
where Zi are i.i.d. and Rn is a remainder term.
√
(a) Assume that E[∥Zi ∥22 ] < ∞ and that for each ϵ > 0, P(∥Rn ∥2 ≥ ϵ/ n) → 0 as n → ∞. Show
that
θbn − θ = Ṫ (θ)Pn Z + Rn′ ,
√
where the remainder Rn′ also satisfies P(∥Rn′ ∥2 ≥ ϵ/ n) → 0 for all ϵ > 0.
(b) Assume that T is locally smooth enough that for some K < ∞ and δ > 0,
T (θ + ∆) − T (θ) − Ṫ (θ)∆ ≤ K ∥∆∥22
2
when ∥∆∥2 ≤ δ. Assume additionally that there exist C0 , C1 < ∞ such that for t ≥ 0, we have
∥Rn ∥2 ≤ Cn0 t with probability at least 1 − e−t and that P(∥Pn Z∥2 ≥ t) ≤ C1 exp(−nt2 /σ 2 ).
Give a quantitative version of part (a).

JCD Comment: Add in some connections to the exponential family material. Some
ideas:

1. A hypothesis test likelihood ratio for them (see page 40 of handwritten notes)

2. A full learning guarantee with convergence of Hessian and everything, e.g., for logistic
regression?

3. In the Ledoux-Talagrand stuff, maybe worth going through example of logistic regres-
sion. Also, having working logistic example throughout? Helps clear up the structure
and connect with exponential families.

4. Maybe an exercise for Lipschitz functions with random Lipschitz constants?

131
Chapter 6

Generalization and stability

Concentration inequalities provide powerful techniques for demonstrating when random objects
that are functions of collections of independent random variables—whether sample means, functions
with bounded variation, or collections of random vectors—behave similarly to their expectations.
This chapter continues exploration of these ideas by incorporating the central thesis of this book:
that information theory’s connections to statistics center around measuring when (and how) two
probability distributions get close to one another. On its face, we remain focused on the main
objects of the preceding chapter, where we have a population probability distribution P on a space
X and some collection of functions f : X → R. We then wish to understand when we expect the
empirical distribution
n
1X
Pn := 1Xi ,
n
i=1
iid
defined by the sample Xi ∼ P , to be close to the population P as measured by f . Following the
notation we introduce in Section 5.1, for P f := EP [f (X)], we again ask to have
n
1X
Pn f − P f = f (Xi ) − EP [f (X)]
n
i=1

to be small simultaneously for all f .

In this chapter, however, we develop a family of tools based around PAC (probably approximately
correct) Bayesian bounds, where we slightly perturb the functions f of interest to average them in
some way; when these perturbations keep Pn f stable, we expect that Pn f ≈ P f , that is, the sample
generalizes to the population. These perturbations allow us to bring the tools of the divergence
measures we have developed to bear on the problems of convergence and generalization. They also
allow us to go beyond the “basic” concentration inequalities to situations with interaction, where
a data analyst may evaluate some functions of Pn , then adaptively choose additional queries or
analyses to do on the sample sample X1n . This breaks standard statistical analyses—which assume
an a priori specified set of hypotheses or questions to be answered—but is possible to address
once we can limit the information the analyses release in precise ways that information-theoretic
tools allow. Even more, in the next chapter we show how they form the basis for transportation
inequalities, powerful tools for concentration of measure. Modern work has also shown how to
leverage these techniques, coupled with computation, to provide non-vacuous bounds on learning
for complicated scenarios and models to which all classical bounds fail to apply, such as deep
learning.

132
Lexture Notes on Statistics and Information Theory John Duchi

6.1 The variational representation of Kullback-Leibler divergence

The starting point of all of our generalization bounds is a surprisingly simply variational result,
which relates expectations, moment generating functions, and the KL-divergence in one single
equality. It turns out that this inequality, by relating means with moment generating functions
and divergences, allows us to prove generalization bounds based on information-theoretic tools and
stability.

Theorem 6.1.1 (Donsker-Varadhan variational representation). Let P and Q be distributions on

a common space X . Then
n o
Dkl (P ||Q) = sup EP [g(X)] − log EQ [eg(X) ] ,
g

where the supremum is taken over measurable functions g : X → R with EQ [eg(X) ] < ∞. We can
also replace this by bounded simple functions g.

We give one proof of this result and one sketch of a proof, which holds when the underlying space
is discrete, that may be more intuitive: the first constructs a particular “tilting” of Q via the
function eg , and verifies the equality. The second relies on the discretization of the KL-divergence
and may be more intuitive to readers familiar with convex optimization: essentially, we expect this
result because the function log( kj=1 exj ) is the convex conjugate of the negative entropy. (See also
P
Exercise 6.1.)
Proof We may assume that P is absolutely continuous with respect to Q, meaning that Q(A) = 0
implies that P (A) = 0, as otherwise both sides are infinite by inspection. Thus, it is no loss of
generality to let P and Q have densities p and q.
Attainment in the equality is easy: we simply take g(x) = log p(x) q(x) , so that EQ [e
g(X) ] = 1. To

show that the right hand side is never larger than Dkl (P ||Q) requires a bit more work. To that
end, let g be any function such that EQ [eg(X) ] < ∞, and define the random variable Zg (x) =
eg(x) /EQ [eg(X) ], so that EQ [Z] = 1. Then using the absolute continuity of P w.r.t. Q, we have

p(X) q(X) dQ
EP [log Zg ] = EP log + log Zg (X) = Dkl (P ||Q) + EP log Zg
q(X) p(X) dP

dQ
≤ Dkl (P ||Q) + log EP Zg
dP
= Dkl (P ||Q) + log EQ [Zg ].

As EQ [Zg ] = 1, using that EP [log Zg ] = EP [g(X)] − log EQ [eg(X) ] gives the result.
For the claim that bounded simple functions are sufficient, all we need to do is demonstrate
(asymptotic) achievability. For this, we use the definition (2.2.1) of the KL-divergence as a
supremum over partitions. Take An be an sequence of partitions so that Dkl (P ||Q | An ) →
P P (A)
Dkl (P ||Q). Then let gn (x) = A∈An 1 {x ∈ A} log Q(A) , which gives Dkl (P ||Q | An ) = EP [gn (X)]−
log EQ [egn (X) ].

Here is the second proof of Theorem 6.1.1, which applies when X is discrete and finite. That we
can approximate KL-divergence by suprema over finite partitions (as in definition (2.2.1)) suggests
that this approach works in general—which it can—but this requires some not completely trivial

133
Lexture Notes on Statistics and Information Theory John Duchi

approximations of EP [g] and EQ [eg ] by discretized versions of their expectations, which makes
things rather tedious.
Proof of Theorem 6.1.1, the finite case As we have assumed that P and Q have finite
supports, which we identify with {1, . . . , k} and p.m.f.s p, q ∈ ∆k = {p ∈ Rk+ | ⟨1, p⟩ = 1}. Define
fq (v) = log( kj=1 qj evj ), which is convex in v (recall Proposition 3.2.1). Then the supremum in
P
the variational representation takes the form

h(p) := sup {⟨p, v⟩ − fq (v)} .

v∈Rk

If we can take derivatives and solve for zero, we are guaranteed to achieve the supremum. To that
end, note that
" #k
qi evi
∇v {⟨p, v⟩ − fq (v)} = p − Pk ,
vj
j=1 qj e i=1
p
so that setting vj = log qjj achieves p − ∇v fq (v) = p − p = 0 and hence the supremum. Noting that
p
log( kj=1 qj exp(log qjj )) = log( kj=1 pj ) = 0 gives h(p) = Dkl (p||q).
P P

The Donsker-Varadhan variational representation already gives a hint that we can use some
information-theoretic techniques to control the difference between an empirical sample and its
expectation, at least in an average sense. In particular, we see that for any function g, we have

EP [g(X)] ≤ Dkl (P ||Q) + log EQ [eg(X) ]

for any random variable X. Now, changing this on its head a bit, suppose that we consider a
collection of functions F and put two probability measures π and π0 on F, and consider Pn f − P f ,
where we consider f a random variable f ∼ π or f ∼ π0 . Then a consequence of the Donsker-
Varadhan theorem is that
Z Z
(Pn f − P f )dπ(f ) ≤ Dkl (π||π0 ) + log exp(Pn f − P f )dπ0 (f )

for any π, π0 . While this inequality is a bit naive—bounding a difference by an exponent seems
wasteful—as we shall see, it has substantial applications when we can upper bound the KL-
divergence Dkl (π||π0 ).

6.2 PAC-Bayes bounds

Probably-approximately-correct (PAC) Bayesian bounds proceed from a perspective similar to that
of the covering numbers and covering entropies we develop in Section 5.1, where if for a collection
of functions F there is a finite subset (a cover) {fv } such that each f ∈ F is “near” one of the
fv , then we need only control deviations of Pn f from P f for the elements of {fv }. In PAC-Bayes
bounds, we instead average functions f with other functions, and this averaging allows a similar
family of guarantees and applications.
Let us proceed with the main results. Let F be a collection of functions f : X → R, and
assume that each function f is σ 2 -sub-Gaussian, which we recall (Definition 4.1) means that
E[eλ(f (X)−P f ) ] ≤ exp(λ2 σ 2 /2) for all λ ∈ R, where P f = EP [f (X)] = f (x)dP (x) denotes the
R

134
Lexture Notes on Statistics and Information Theory John Duchi

expectation of f under P . The main theorem of this section shows that averages of the squared
error (Pn f − P f )2 of the empirical distribution Pn to P converge quickly to zero for all averaging
distributions π on functions f ∈ F so long as each f is σ 2 -sub-Gaussian, with the caveat that we
pay a cost for different choices of π. The key is that we choose some prior distribution π0 on F
first.

Theorem 6.2.1. Let Π be the collection of all probability distributions on the set F and let π0 be
a fixed prior probability distribution on f ∈ F. With probability at least 1 − δ,

8σ 2 Dkl (π||π0 ) + log 2δ

Z
(Pn f − P f )2 dπ(f ) ≤ simultaneously for all π ∈ Π.
3 n
Proof The key is to combine Example 4.1.12 with the variational representation that Theo-
rem 6.1.1 provides for KL-divergences. We state Example 4.1.12 as a lemma here.

Lemma 6.2.2. Let Z be a σ 2 -sub-Gaussian random variable. Then for λ ≥ 0,

2 1
E[eλZ ] ≤ q .
[1 − 2σ 2 λ]+

PWithout loss of generality, we assume that P f = 0 for all f ∈ F, and recall that Pn f =
1 n 2
n i=1 f (Xi ) is the empirical mean of f . Then we know that Pn f is σ /n-sub-Gaussian, and
−1/2
Lemma 6.2.2 implies that E[exp(λ(Pn f )2 )] ≤ 1 − 2λσ 2 /n +

for any f , and thus for any prior
π0 on f we have Z
−1/2
exp(λ(Pn f ) )dπ0 (f ) ≤ 1 − 2λσ 2 /n + .
2

E

3n
Consequently, taking λ = λn := 8σ 2
, we obtain
Z Z
2 3n 2
E exp(λn (Pn f ) )dπ0 (f ) = E exp (Pn f ) dπ0 (f ) ≤ 2.
8σ 2

Markov’s inequality thus implies that

Z
2
2
P exp λn (Pn f ) dπ0 (f ) ≥ ≤ δ, (6.2.1)
δ
iid
where the probability is over Xi ∼ P .
Now, we use the Donsker-Varadhan equality (Theorem 6.1.1). Letting λ > 0, we define the
function g(f ) = λ(Pn f )2 , so that for any two distributions π and π0 on F, we have

Dkl (π||π0 ) + log exp(λ(Pn f )2 )dπ0 (f )

Z Z R
1 2
g(f )dπ(f ) = (Pn f ) dπ(f ) ≤ .
λ λ

This holds without any probabilistic qualifications, so using the application (6.2.1) of Markov’s
inequality with λ = λn , we thus see that with probability at least 1 − δ over X1 , . . . , Xn , simulta-
neously for all distributions π,

8σ 2 Dkl (π||π0 ) + log 2δ

Z
(Pn f )2 dπ(f ) ≤ .
3 n

135
Lexture Notes on Statistics and Information Theory John Duchi

This is the desired result (as we have assumed that P f = 0 w.l.o.g.).

By Jensen’s inequality (or Cauchy-Schwarz), it is immediate from Theorem 6.2.1 that we also
have
s
8σ 2 Dkl (π||π0 ) + log 2δ
Z
|Pn f − P f |dπ(f ) ≤ simultaneously for all π ∈ Π (6.2.2)
3 n
√
with probability at least 1 − δ, so that Eπ [|Pn f − P f |] is with high probability of order 1/ n. The
inequality (6.2.2) is the original form of the PAC-Bayes bound due to McAllester, with slightly
sharper constants and improved logarithmic dependence. The key is that stability, in the form of a
prior π0 and posterior π closeness, allow us to achieve reasonably tight control over the deviations
of random variables and functions with high probability.
Let us give an example, which is similar to many of our approaches in Section 5.2, to illustrate
some of the approaches this allows. The basic idea is that by appropriate choice of prior π0
and “posterior” π, whenever we have appropriately smooth classes of functions we achieve certain
generalization guarantees.

Example 6.2.3 (A uniform law for Lipschitz functions): Consider a case as in Section 5.2,
where we let L(θ) = P ℓ(θ, Z) for some function ℓ : Θ × Z → R. Let Bd2 = {v ∈ Rd | ∥v∥2 ≤ 1}
be the ℓ2 -ball in Rd , and let us assume that Θ ⊂ rBd2 and additionally that θ 7→ ℓ(θ, z) is
M -Lipschitz for all z ∈ Z. For simplicity, we assume that ℓ(θ, z) ∈ [0, 2M r] for all θ ∈ Θ (we
may simply relativize our bounds by replacing ℓ by ℓ(·, z) − inf θ∈Θ ℓ(θ, z) ∈ [0, 2M r]).
If L
b n (θ) = Pn ℓ(θ, Z), then Theorem 6.2.1 implies that
s
2 2
Z
b n (θ) − L(θ)|dπ(θ) ≤ 8M r Dkl (π||π0 ) + log 2
|L
3n δ

for all π with probability at least 1 − δ. Now, let θ0 ∈ Θ be arbitrary, and for ϵ > 0 (to be
chosen later) take π0 to be uniform on (r + ϵ)Bd2 and π to be uniform on θ0 + ϵBd2 . Then we
immediately see that Dkl (π||π0 ) = d log(1+ rϵ ). Moreover, we have L
R
b n (θ)dπ(θ) ∈ L
b n (θ0 )±M ϵ
and similarly for L(θ), by the M -Lipschitz continuity of ℓ. For any fixed ϵ > 0, we thus have
s
2M 2 r2

r 2
|Ln (θ0 ) − L(θ0 )| ≤ 2M ϵ +
b d log 1 + + log
3n ϵ δ

rd
simultaneously for all θ0 ∈ Θ, with probability at least 1 − δ. By choosing ϵ = n we obtain
that with probability at least 1 − δ,
s
2 r2

2M rd 8M n 2
sup |L
b n (θ) − L(θ)| ≤ + d log 1 + + log .
θ∈Θ n 3n d δ
q
Thus, roughly, with high probability we have |L
b n (θ) − L(θ)| ≤ O(1)M r d
n log nd for all θ. 3

On the one hand, the result in Example 6.2.3 is satisfying: it applies to any Lipschitz function
and provides a uniform bound. On the other hand, when we compare to the results achievable for

136
Lexture Notes on Statistics and Information Theory John Duchi

specially structured linear function classes, then applying Rademacher complexity bounds—such
as Proposition 5.2.9 and Example 5.2.10—we have somewhat weaker results, in that they depend
on the dimension explicitly, while the Rademacher bounds do not exhibit this explicit dependence.
This means they can potentially apply in infinite dimensional spaces that Example 6.2.3 cannot.
We will give an example presently showing how to address some of these issues.

6.2.1 Relative bounds

In many cases, it is useful to have bounds that provide somewhat finer control than the bounds
we have presented. Recall from our discussion of sub-Gaussian and sub-exponential random vari-
ables, especially the Bennett and Bernstein-type inequalities (Proposition 4.1.21), that if a random
variable X satisfies |X| ≤ b but Var(X) ≤ σ 2 ≪ b2 , then X concentrates more quickly about
its mean than the convergence provided by naive application of sub-Gaussian concentration with
sub-Gaussian parameter b2 /8. To that end, we investigate an alternative to Theorem 6.2.1 that
allows somewhat sharper control.
The approach is similar to our derivation in Theorem 6.2.1, where we show that the moment
generating function of a quantity like Pn f − P f is small (Eq. (6.2.1)) and then relate this—via the
Donsker-Varadhan change of measure in Theorem 6.1.1—to the quantities we wish to control. In
the next proposition, we provide relative bounds on the deviations of functions from their means.
To make this precise, let F be a collection of functions f : X → R, and let σ 2 (f ) := Var(f (X)) be
the variance of functions in F. We assume the class satisfies the Bernstein condition (4.1.7) with
parameter b, that is,
h i k!
E (f (X) − P f )k ≤ σ 2 (f )bk−2 for k = 3, 4, . . . . (6.2.3)
2
This says that the second moment of functions f ∈ F bounds—with the additional boundedness-
type constant b—the higher moments of functions in f . We then have the following result.
Proposition 6.2.4. Let F be a collection of functions f : X → R satisfying the Bernstein condi-
1
tion (6.2.3). Then for any |λ| ≤ 2b , with probability at least 1 − δ,
Z Z Z
1 1
λ P f dπ(f ) − λ2 σ 2 (f )dπ(f ) ≤ λ Pn f dπ(f ) + Dkl (π||π0 ) + log
n δ
simultaneously for all π ∈ Π.
Proof We begin with an inequality on the moment generating function of random variables
satisfying the Bernstein condition (4.1.7), that is, that |E[(X − µ)k ]| ≤ k! 2 k−2 for k ≥ 2. In this
2σ b
case, Lemma 4.1.20 implies that
E[eλ(X−µ) ] ≤ exp(λ2 σ 2 )
for |λ| ≤ 1/(2b). As a consequence, for any f in our collection F, we see that if we define
∆n (f, λ) := λ Pn f − P f − λσ 2 (f ) ,

we have that
E[exp(n∆n (f, λ))] = E[exp(λ(f (X) − P f ) − λ2 σ 2 (f ))]n ≤ 1
1
for all n, f ∈ F, and |λ| ≤ 2b .Then, for any fixed measure π0 on F, Markov’s inequality implies
that Z
1
P exp(n∆n (f, λ))dπ0 (f ) ≥ ≤ δ. (6.2.4)
δ

137
Lexture Notes on Statistics and Information Theory John Duchi

Now, as in the proof of Theorem 6.2.1, we use the Donsker-Varadhan Theorem 6.1.1 (change of
measure), which implies that
Z Z
n ∆n (f, λ)dπ(f ) ≤ Dkl (π||π0 ) + log exp(n∆n (f, λ))dπ0 (f )

for all distributions π. Using inequality (6.2.4), we obtain that with probability at least 1 − δ,
Z
1 1
∆n (f, λ)dπ(f ) ≤ Dkl (π||π0 ) + log
n δ

for all π. As this holds for any fixed |λ| ≤ 1/(2b), this gives the desired result by rearranging.

We would like to optimize over the bound in Proposition 6.2.4 by choosing the “best” λ. If we
could choose the optimal λ, by rearranging Proposition 6.2.4 we would obtain the bound

2 1 h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + inf λEπ [σ (f )] + Dkl (π||π0 ) + log
λ>0 nλ δ
r
Eπ [σ 2 (f )] h 1 i
= Eπ [Pn f ] + 2 Dkl (π||π0 ) + log
n δ
simultaneously for all π, with probability at least 1−δ. The problem with this approach is two-fold:
first, we cannot arbitrarily choose λ in Proposition 6.2.4, and second, the bound above depends on
the unknown population variance σ 2 (f ). It is thus of interest to understand situations in which
we can obtain similar guarantees, but where we can replace unknown population quantities on the
right side of the bound with known quantities.
To that end, let us consider the following condition, a type of relative error condition related
to the Bernstein condition (4.1.7): for each f ∈ F,

σ 2 (f ) ≤ bP f. (6.2.5)

This condition is most natural when each of the functions f take nonnegative values—for example,
when f (X) = ℓ(θ, X) for some loss function ℓ and parameter θ of a model. If the functions f are
nonnegative and upper bounded by b, then we certainly have σ 2 (f ) ≤ E[f (X)2 ] ≤ bE[f (X)] = bP f ,
so that Condition (6.2.5) holds. Revisiting Proposition 6.2.4, we rearrange to obtain the following
theorem.

Theorem 6.2.5. Let F be a collection of functions satisfying the Bernstein condition (6.2.3) as in
Proposition 6.2.4, and in addition, assume the variance-bounding condition (6.2.5). Then for any
1
0 ≤ λ ≤ 2b , with probability at least 1 − δ,

λb 1 1h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + Eπ [Pn f ] + Dkl (π||π0 ) + log
1 − λb λ(1 − λb) n δ

for all π.

Proof We use condition (6.2.5) to see that

λEπ [P f ] − λ2 bEπ [P f ] ≤ λEπ [P f ] − λ2 Eπ [σ 2 (f )],

138
Lexture Notes on Statistics and Information Theory John Duchi

apply Proposition 6.2.4, and divide both sides of the resulting inequality by λ(1 − λb).

To make this uniform in λ, thus achieving a tighter bound (so that we need not pre-select λ),
1 λb
we choose multiple values of λ and apply a union bound. To that end, let 1 + η = 1−λb , or η = 1−λb
1 (1+η)2
and λb(1−λb) = η , so that the inequality in Theorem 6.2.1 is equivalent to

(1 + η)2 b h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + ηEπ [Pn f ] + Dkl (π||π0 ) + log .
η n δ
Using that our choice of η ∈ [0, 1], this implies
1 bh 1 i 3b h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + ηEπ [Pn f ] + Dkl (π||π0 ) + log + Dkl (π||π0 ) + log .
ηn δ n δ
Now, take η1 = 1/n, . . . , ηn = 1. Then by optimizing over η ∈ {η1 , . . . , ηn } (which is equivalent, to
within a 1/n factor, to optimizing over 0 < η ≤ 1) and applying a union bound, we obtain
Corollary 6.2.6. Let the conditions of Theorem 6.2.5 hold. Then with probability at least 1 − δ,
r
bEπ [Pn f ] h ni 1 h n i
Eπ [P f ] ≤ Eπ [Pn f ] + 2 Dkl (π||π0 ) + log + Eπ [Pn f ] + 5b Dkl (π||π0 ) + log ,
n δ n δ
simultaneously for all π on F.
Proof By a union bound, we have
1 bh n i 3b h ni
Eπ [P f ] ≤ Eπ [Pn f ] + ηEπ [Pn f ] + Dkl (π||π0 ) + log + Dkl (π||π0 ) + log
ηn δ n δ

for each η ∈ {1/n, . . . , 1}. We consider two cases. In the first, assume that Eπ [Pn f ] ≤ nb (Dkl (π||π0 )+
log nδ . Then taking η = 1 above evidently gives the result. In the second, we have Eπ [Pn f ] >
b n
n (Dkl (π||π0 ) + log δ ), and we can set
s
b n
n (Dkl (π||π0 ) + log δ )
η⋆ = ∈ (0, 1).
Eπ [Pn f ]
1
Choosing η to be the smallest value ηk in {η1 , . . . , ηn } with ηk ≥ η⋆ , so that η⋆ ≤ η ≤ η⋆ + n then
implies the claim in the corollary.

6.2.2 A large-margin guarantee

Let us revisit the loss minimization approaches central to Section 5.2 and Example 6.2.3 in the
context of Corollary 6.2.6. We will investigate an approach to achieve convergence guarantees that
are (nearly) independent of dimension, focusing on 0-1 losses in a binary classification problem.
Consider a binary classification problem with data (x, y) ∈ Rd × {±1}, where we make predictions
⟨θ, x⟩ (or its sign), and for a margin penalty γ ≥ 0 we define the loss

ℓγ (θ; (x, y)) = 1 {⟨θ, x⟩y ≤ γ} .

139
Lexture Notes on Statistics and Information Theory John Duchi

We call the quantity ⟨θ, x⟩y the margin of θ on the pair (x, y), noting that when the margin is
large, ⟨θ, x⟩ has the same sign as y and is “confident” (i.e. far from zero). For shorthand, let us
define the expected and empirical losses at margin γ by

Lγ (θ) := P ℓγ (θ; (X, Y )) and L

b γ (θ) := Pn ℓγ (θ; (X, Y )).

Consider the following scenario: the data x lie in a ball of radius b, so that ∥x∥2 ≤ b; note that
the losses ℓγ and ℓ0 satisfy the Bernstein (6.2.3) and self-bounding (6.2.5) conditions with constant
1 as they take values in {0, 1}. We then have the following proposition.

Proposition 6.2.7. Let the above conditions on the data (x, y) hold and let the margin γ > 0 and
radius r < ∞. Then with probability at least 1 − δ,
√ rb log nδ p r2 b2 log nδ

1
P (⟨θ, X⟩Y ≤ 0) ≤ 1 + Pn (⟨θ, X⟩Y ≤ γ) + 8 √ Pn (⟨θ, X⟩Y ≤ γ) + C
n γ n γ2n

simultaneously for all ∥θ∥2 ≤ r, where C is a numerical constant independent of the problem
parameters.

Proposition 6.2.7 provides a “dimension-free” guarantee—it depends only on the ℓ2 -norms ∥θ∥2
and ∥x∥2 —so that it can apply equally in infinite dimensional spaces. The key to the inequality
is that if we can find a large margin predictor—for example, one achieved by a support vector
machine or, more broadly, by minimizing a convex loss of the form
n
1X
minimize ϕ(⟨Xi , θ⟩Yi )
∥θ∥2 ≤r n
i=1

for some decreasing convex ϕ : R → R+ , e.g. ϕ(t) = [1 − t]+ or ϕ(t) = log(1 + e−t )—then we get
strong generalization performance guarantees relative to the empirical margin γ. As one particular
instantiation of this approach, suppose we can obtain a perfect classifier with positive margin: a
vector θ with ∥θ∥2 ≤ r such that ⟨θ, Xi ⟩Yi ≥ γ for each i = 1, . . . , n. Then Proposition 6.2.7
guarantees that
r2 b2 log nδ
P (⟨θ, X⟩Y ≤ 0) ≤ C
γ2n
with probability at least 1 − δ.
Proof Let π0 be N(0, τ 2 I) for some τ > 0 to be chosen, and let π be N(θ,
b τ 2 I) for some θb ∈ Rd
satisfying ∥θ∥
b 2 ≤ r. Then Corollary 6.2.6 implies that

Eπ [Lγ (θ)]
s
Eπ [L
b γ (θ)] h ni 1 b γ (θ)] + C Dkl (π||π0 ) + log n
h i
≤ Eπ [L
b γ (θ)] + 2 Dkl (π||π0 ) + log + Eπ [L
n δ n δ
s
b γ (θ)] h r2
Eπ [L ni 1
h r2 ni

≤ Eπ [Lγ (θ)] + 2
b + log + Eπ [Lγ (θ)] + C
b + log
n 2τ 2 δ n 2τ 2 δ

simultaneously for all θb satisfying ∥θ∥

b 2 ≤ r with probability at least 1 − δ, where we have used that
2
Dkl N(θ, τ 2 I)||N(0, τ 2 I) = ∥θ∥2 /(2τ 2 ).

140
Lexture Notes on Statistics and Information Theory John Duchi

Let us use the margin assumption. Note that if Z ∼ N(0, τ 2 I), then for any fixed θ0 , x, y we
have
ℓ0 (θ0 ; (x, y)) − P(Z ⊤ x ≥ γ) ≤ E[ℓγ (θ0 + Z; (x, y))] ≤ ℓ2γ (θ0 ; (x, y)) + P(Z ⊤ x ≥ γ)
where the middle expectation is over Z ∼ N(0, τ 2 I). Using the τ 2 ∥x∥22 -sub-Gaussianity of Z ⊤ x, we
can obtain immediately that if ∥x∥2 ≤ b, we have
γ2 γ2

ℓ0 (θ0 ; (x, y)) − exp − 2 2 ≤ E[ℓγ (θ0 + Z; (x, y))] ≤ ℓ2γ (θ0 ; (x, y)) + exp − 2 2 .
2τ b 2τ b
Returning to our earlier bound, we evidently have that if ∥x∥2 ≤ b for all x ∈ X , then with
probability at least 1 − δ, simultaneously for all θ ∈ Rd with ∥θ∥2 ≤ r,
s

γ 2
b 2γ (θ) + exp(− γ22 2 ) h r2
L ni
2τ b
L0 (θ) ≤ L2γ (θ) + 2 exp − 2 2 + 2
b + log
2τ b n 2τ 2 δ
2 2

1 b γ h r n i
+ L2γ (θ) + exp − 2 2 + C + log .
n 2τ b 2τ 2 δ
2
Setting τ 2 = 2b2γlog n , we immediately see that for any choice of margin γ > 0, we have with
probability at least 1 − δ that
s
2b 1 hb b ih r2 b2 log n ni
L0 (θ) ≤ Lb 2γ (θ) + +2 L2γ (θ) + + log
n n n 2γ 2 δ
2 2

1 b 1 h r b log n n i
+ L2γ (θ) + + C 2
+ log
n n 2γ δ
for all ∥θ∥2 ≤ r.
Rewriting (replacing 2γ with γ) and recognizing that with no loss of generality we may take γ
such that rb ≥ γ gives the claim of the proposition.

6.2.3 A mutual information bound

An alternative perspective of the PAC-Bayesian bounds that Theorem 6.2.1 gives is to develop
bounds based on mutual information, which is also central to the interactive data analysis set-
ting in the next section. We present a few results along these lines here. Assume the setting of
Theorem 6.2.1, so that F consists of σ 2 -sub-Gaussian functions. Let us assume the following ob-
iid
servational model: we observe X1n ∼ P , and then conditional on the sample X1n , draw a (random)
function F ∈ F following the distribution π(· | X1n ). Assuming the prior π0 is fixed, Theorem 6.2.1
guarantees that with probability at least 1 − δ over X1n ,
8σ 2

2 n n 2
E[(Pn F − P F ) | X1 ] ≤ Dkl (π(· | X1 )||π0 ) + log ,
3n δ
where the expectation is taken over F ∼ π(· | X1n ), leaving the sample fixed. Now, consider choosing
π0 to be the average over all samples X1n of π, that is, π0 (·) = EP [π(· | X1n )], the expectation taken
iid
over X1n ∼ P . Then by definition of mutual information,
I(F ; X1n ) = EP [Dkl (π(· | X1n )||π0 )] ,

141
Lexture Notes on Statistics and Information Theory John Duchi

and by Markov’s inequality we have

1
P(Dkl (π(· | X1n )||π0 ) ≥ K · I(F ; X1n )) ≤
K
for all K ≥ 0. Combining these, we obtain the following corollary.

Corollary 6.2.8. Let F be chosen according to any distribution π(· | X1n ) conditional on the sample
iid
X1n . Then with probability at least 1 − δ0 − δ1 over the sample X1n ∼ P ,

8σ 2 I(F ; X1n )

2
E[(Pn F − P F )2 | X1n ] ≤ + log .
3n δ0 δ1

This corollary shows that if we have any procedure—say, a learning procedure or otherwise—
that limits the information between a sample X1n and an output F , then we are guaranteed that
F generalizes. Tighter analyses of this are possible, though not our focus here, just that already
there should be an inkling that limiting information between input samples and outputs may be
fruitful.

6.3 Interactive data analysis

A major challenge in modern data analysis is that analyses are often not the classical statistics and
scientific method setting. In the scientific method—forgive me for being a pedant—one proposes
a hypothesis, the status quo or some other belief, and then designs an experiment to falsify that
hypothesis. Then, upon performing the experiment, there are only two options: either the experi-
mental results contradict the hypothesis (that is, we must reject the null) so that the hypothesis is
false, or the hypothesis remains consistent with available data. In the classical (Fisherian) statis-
tics perspective, this typically means that we have a single null hypothesis H0 before observing a
sample, we draw a sample X ∈ X , and then for some test statistic T : X → R with observed value
tobserved = T (X), we compute the probability under the null of observing something as extreme as
what we observed, that is, the p-value p = PH0 (T (X) ≥ tobserved ).
Yet modern data analyses are distant from this pristine perspective for many reasons. The
simplest is that we often have a number of hypotheses we wish to test, not a single one. For example,
in biological applications, we may wish to investigate the associations between the expression of
number of genes and a particular phenotype or disease; each gene j then corresponds to a null
hypothesis H0,j that gene j is independent of the phenotype. There are numerous approaches to
addressing the challenges associated with such multiple testing problems—such as false discovery
rate control, familywise error rate control, and others—with whole courses devoted to the challenges.
Even these approaches to multiple testing and high-dimensional problems do not truly capture
modern data analyses, however. Indeed, in many fields, researchers use one or a few main datasets,
writing papers and performing multiple analyses on the same dataset. For example, in medicine,
the UK Biobank dataset [174] has several thousand citations (as of 2023), many of which build
on one another, with early studies coloring the analyses in subsequent studies. Even in situations
without a shared dataset, analyses present researchers with huge degrees of freedom and choice.
A researcher may study a summary statistic of his or her sampled data, or a plot of a few simple
relationships, performing some simple data exploration—which statisticians and scientists have
advocated for 50 years, dating back at least to John Tukey!—but this means that there are huge
numbers of potential comparisons a researcher might make (that he or she does not). This “garden

142
Lexture Notes on Statistics and Information Theory John Duchi

of forking paths,” as Gelman and Loken [100] term it, causes challenges even when researchers are
not “p-hacking” or going on a “fishing expedition” to try to find publishable results. The problem
in these studies and approaches is that, because we make decisions that may, even only in a small
way, depend on the data observed, we have invalidated all classical statistical analyses.
To that end, we now consider interactive data analyses, where we perform data analyses se-
quentially, computing new functions on a fixed sample X1 , . . . , Xn after observing some initial
information about the sample. The starting point of our approach is similar to our analysis of
PAC-Bayesian learning and generalization: we observe that if the function we decide to compute
on the data X1n is chosen without much information about the data at hand, then its value on the
sample should be similar to its values on the full population. This insight dovetails with what we
have seen thus far, that appropriate “stability” in information can be useful and guarantee good
future performance.

6.3.1 The interactive setting

We do not consider the interactive data analysis setting in full, rather, we consider a stylized
approach to the problem, as it captures many of the challenges while being broad enough for
different applications. In particular, we focus on the statistical queries setting, where a data
analyst wishes to evaluate expectations
EP [ϕ(X)] (6.3.1)
iid
of various functionals ϕ : X → R under the population P using a sample X1n ∼ P . Certainly,
numerous problems problems are solvable using statistical queries (6.3.1). Means use ϕ(x) = x,
while we can compute variances using the two statistical queries ϕ1 (x) = x and ϕ2 (x) = x2 , as
Var(X) = EP [ϕ2 (X)] − EP [ϕ1 (X)]2 .
Classical algorithms for the statistical query problem simply return sample means Pn ϕ :=
1 Pn
n i=1 ϕ(Xi ) given a query ϕ : X → R. When the number of queries to be answered is not chosen
adaptively, this means we can typically answer a large number relatively accurately; indeed, if we
have a finite collection Φ of σ 2 -sub-Gaussian ϕ : X → R, then we of course have
r !
2σ 2 2
P max |Pn ϕ − P ϕ| ≥ (log(2|Φ|) + t) ≤ e−t for t ≥ 0
ϕ∈Φ n

by Corollary 4.1.10 (sub-Gaussian concentration) and a union bound. Thus, so long as |Φ| is not
exponential in the sample size n, we expect uniformly high accuracy.

Example 6.3.1 (Risk minimization via statistical queries): Suppose that we are in the loss-
minimization setting (5.2.2), where the losses ℓ(θ, Xi ) are convex and differentiable in θ. Then
gradient descent applied to Lb n (θ) = Pn ℓ(θ, X) will converge to a minimizing value of Lb n . We
can evidently implement gradient descent by a sequence of statistical queries ϕ(x) = ∇θ ℓ(θ, x),
iterating
θ(k+1 ) = θ(k) − αk Pn ϕ(k) , (6.3.2)
where ϕ(k) = ∇θ ℓ(θ(k) , x) and αk is a stepsize. 3

One issue with the example (6.3.1) is that we are interacting with the dataset, because each
sequential query ϕ(k) depends on the previous k − 1 queries. (Our results on uniform convergence
of empirical functionals and related ideas address many of these challenges, so that the result of
the process (6.3.2) will be well-behaved regardless of the interactivity.)

143
Lexture Notes on Statistics and Information Theory John Duchi

We consider an interactive version of the statistical query estimation problem. In this version,
there are two parties: an analyst (or statistician or learner), who issues queries ϕ : X → R, and
a mechanism that answers the queries to the analyst. We index our functionals ϕ by t ∈ T for a
(possibly infinite) set T , so we have a collection {ϕt }t∈T . In this context, we thus have the following
scheme:
Input: Sample X1n drawn i.i.d. P , collection {ϕt }t∈T of possible queries
Repeat: for k = 1, 2, . . .

i. Analyst chooses index Tk ∈ T and query ϕ := ϕTk

ii. Mechanism responds with answer Ak approximating P ϕ = EP [ϕ(X)] using X1n

Figure 6.1: The interactive statistical query setting

Of interest in the iteration 6.1 is that we interactively choose T1 , T2 , . . . , Tk , where the choice Ti
may depend on our approximations of EP [ϕTj (X)] for j < i, that is, on the results of our previous
queries. Even more broadly, the analyst may be able to choose the index Tk in alternative ways
depending on the sample X1n , and our goal is to still be able to accurately compute expectations
P ϕT = EP [ϕT (X)] when the index T may depend on X1n . The setting in Figure 6.1 clearly breaks
with the classical statistical setting in which an analysis is pre-specified before collecting data, but
more closely captures modern data exploration practices.

6.3.2 Second moment errors and mutual information

The starting point of our derivation is the following result, which follows from more or less identical
arguments to those for our PAC-Bayesian bounds earlier.
Theorem 6.3.2. Let {ϕt }t∈T be a collection of σ 2 -sub-Gaussian functions ϕt : X → R. Then for
any random variable T and any λ > 0,

2 1 n 1 2

E[(Pn ϕT − P ϕT ) ] ≤ I(X1 ; T ) − log 1 − 2λσ /n +
λ 2
and r
2σ 2
|E[Pn ϕT ] − E[P ϕT ]| ≤ I(X1n ; T )
n
where the expectations are taken over T and the sample X1n .
Proof The proof is similar to that of our first basic PAC-Bayes result in Theorem 6.2.1. Let
us assume w.l.o.g. that P ϕt = 0 for all t ∈ T , noting that then Pn ϕt is σ 2 /n-sub-Gaussian. We
−1/2
prove the first result first. Lemma 6.2.2 implies that E[exp(λ(Pn ϕt )2 )] ≤ 1 − 2λσ 2 /n + for each

t ∈ T . As a consequence, we obtain via the Donsker-Varadhan equality (Theorem 6.1.1) that

Z (i) Z
2 2
λE (Pn ϕt ) dπ(t) ≤ E[Dkl (π||π0 )] + E log exp(λ(Pn ϕt ) )dπ0 (t)
(ii)
Z
2
≤ E[Dkl (π||π0 )] + log E exp(λ(Pn ϕt ) )dπ0 (t)
(iii) 1
log 1 − 2λσ 2 /n +

≤ E[Dkl (π||π0 )] −
2

144
Lexture Notes on Statistics and Information Theory John Duchi

for all distributions π on T , which may depend on Pn , where the expectation E is taken over the
iid
sample X1n ∼ P . (Here inequality (i) is Theorem 6.1.1, inequality (ii) is Jensen’s inequality, and
inequality (iii) is Lemma 6.2.2.) Now, let π0 be the marginal distribution on T (marginally over
all observations X1n ), and let π denote the posterior of T conditional on the sample X1n . Then
E[Dkl (π||π0 )] = I(X1n ; T ) by definition of the mutual information, giving the bound on the squared
error.
For the second result, note that the Donsker-Varadhan equality implies

λ2 σ 2
Z Z
λE Pn ϕt dπ(t) ≤ E[Dkl (π||π0 )] + log E[exp(λPn ϕt )]dπ0 (t) ≤ I(X1n ; T ) + .
2n
p
Dividing both sides by λ gives E[Pn ϕT ] ≤ 2σ 2 I(X1n ; T )/n, and performing the same analysis with
−ϕT gives the second result of the theorem.

The key in the theorem is that if the mutual information—the Shannon information—I(X; T )
between the sample X and T is small, then the expected squared error can be small. To make this
n
a bit clearer, let us choose values for λ in the theorem; taking λ = 2eσ 2 gives the following corollary.

Corollary 6.3.3. Let the conditions of Theorem 6.3.2 hold. Then

2eσ 2 5σ 2
E[(Pn ϕT − P ϕT )2 ] ≤ I(X1n ; T ) + .
n 4n
Consequently, if we can limit the amount of information any particular query T (i.e., ϕT ) contains
about the actual sample X1n , then guarantee reasonably high accuracy in the second moment errors
(Pn ϕT − P ϕT )2 .

6.3.3 Limiting interaction in interactive analyses

Let us now return to the interactive data analysis setting of Figure 6.1, where we recall the stylized
application of estimating mean functionals P ϕ for ϕ ∈ {ϕt }t∈T . To motivate a more careful ap-
proach, we consider a simple example to show the challenges that may arise even with only a single
“round” of interactive data analysis. Naively answering queries accurately—using the mechanism
Pn ϕ that simply computes the sample average—can easily lead to problems:

Example 6.3.4 (A stylized correlation analysis): Consider the following stylized genetics
experiment. We observe vectors X ∈ {−1, 1}k , where Xj = 1 if gene j is expressed and −1
otherwise. We also observe phenotypes Y ∈ {−1, 1}, where Y = 1 indicates appearance of
the phenotype. In our setting, we will assume that the vectors X are uniform on {−1, 1}k
and independent of Y , but an experimentalist friend of ours wishes to know if there exists a
vector v with ∥v∥2 = 1 such that the correlation between v T X and Y is high, meaning that
v T X is associated with Y . In our notation here, we have index set {v ∈ Rk | ∥v∥2 = 1}, and
by Example 4.1.6, Hoeffding’s lemma, and the independence of the coordinates of X we have
that v T XY is ∥v∥22 /4 = 1/4-sub-Gaussian. Now, we recall the fact that if Zj , j = 1, . . . , k, are
σ 2 -sub-Gaussian, then for any p ≥ 1, we have

E[max |Zj |p ] ≤ (Cpσ 2 log k)p/2

145
Lexture Notes on Statistics and Information Theory John Duchi

for a numerical constant C. That is, powers of sub-Gaussian maxima grow at most logarith-
mically. Indeed, by Theorem 4.1.11, we have for any q ≥ 1 by Hölder’s inequality that
X 1/q
p pq
E[max |Zj | ] ≤ E |Zj | ≤ k 1/q (Cpqσ 2 )p/2 ,
j
j

and setting q = log k gives the inequality. Thus, we see that for any a priori fixed v1 , . . . , vk , vk+1 ,
we have
log k
E[max(vjT (Pn Y X))2 ] ≤ O(1) .
j n
If instead we allow a single interaction, the problem is different. We issue queries
associated with v = e1 , . . . , ek , the k standard basis vectors; then we simply set Vk+1 =
Pn Y X/ ∥Pn Y X∥2 . Then evidently
k
T
E[(Vk+1 (Pn Y X))2 ] = E[∥Pn Y X∥22 ] = ,
n
which is exponentially larger than in the non-interactive case. That is, if an analyst is allowed
to interact with the dataset, he or she may be able to discover very large correlations that are
certainly false in the population, which in this case has P XY = 0. 3

Example 6.3.4 shows that, without being a little careful, substantial issues may arise in interac-
tive data analysis scenarios. When we consider our goal more broadly, which is to be able to provide
accurate approximations to P ϕ for queries ϕ chosen adaptively for any population distribution P
and ϕ : X → [−1, 1], it is possible to construct quite perverse situations, where if we compute
sample expectations Pn ϕ exactly, one round of interaction is sufficient to find a query ϕ for which
Pn ϕ − P ϕ ≥ 1.
Example 6.3.5 (Exact query answering allows arbitrary corruption): Suppose we draw a
iid
sample X1n of size n on a sample space X = [m] with Xi ∼ Uniform([m]), where m ≥ 2n. Let
Φ be the collection of all functions ϕ : [m] → [−1, 1], so that P(|Pn ϕ − P ϕ| ≥ t) ≤ exp(−nt2 /2)
for any fixed ϕ. Suppose that in the interactive scheme in Fig. 6.1, we simply release answers
A = Pn ϕ. Consider the following query:

ϕ(x) = n−x for x = 1, 2, . . . , m.

Then by inspection, we see that

m
X
Pn ϕ = n−j card({Xi | Xi = j})
j=1
1 1 1
= card({Xi | Xi = 1}) + 2 card({Xi | Xi = 1}) + · · · + m card({Xi | Xi = m}).
n n n
It is clear that given Pn ϕ, we can reconstruct the sample counts exactly. Then if we define a
n
second query ϕ2 (x) = 1 for x ∈ X1n and ϕ2 (x) = −1 for x ̸∈ X1n , we see that P ϕ2 ≤ m − 1,
while Pn ϕ2 = 1. The gap is thus
n
E[Pn ϕ2 − P ϕ2 ] ≥ 2 − ≥ 1,
m
which is essentially as bad as possible. 3

146
Lexture Notes on Statistics and Information Theory John Duchi

More generally, when one performs an interactive data analysis (e.g. as in Fig. 6.1), adapting
hypotheses while interacting with a dataset, it is not a question of statistical significance or mul-
tiplicity control for the analysis one does, but for all the possible analyses one might have done
otherwise. Given the branching paths one might take in an analysis, it is clear that we require
some care.
With that in mind, we consider the desiderata for techniques we might use to control information
in the indices we select. We seek some type of stability in the information algorithms provide
to a data analyst—intuitively, if small changes to a sample do not change the behavior of an
analyst substantially, then we expect to obtain reasonable generalization bounds. If outputs of a
particular analysis procedure carry little information about a particular sample (but instead provide
information about a population), then Corollary 6.3.3 suggests that any estimates we obtain should
be accurate.
To develop this stability theory, we require two conditions: first, that whatever quantity we
develop for stability should compose adaptively, meaning that if we apply two (randomized) algo-
rithms to a sample, then if both are appropriately stable, even if we choose the second algorithm
because of the output of the first in arbitrary ways, they should remain jointly stable. Second, our
notion should bound the mutual information I(X1n ; T ) between the sample X1n and T . Lastly, we
remark that this control on the mutual information has an additional benefit: by the data process-
ing inequality, any downstream analysis we perform that depends only on T necessarily satisfies the
same stability and information guarantees as T , because if we have the Markov chain X1n → T → V
then I(X1n ; V ) ≤ I(X1n ; T ).
We consider randomized algorithms A : X n → A, taking values in our index set A, where
A(X1n ) ∈ A is a random variable that depends on the sample X1n . For simplicity in derivation,
we abuse notation in this section, and for random variables X and Y with distributions P and Q
respectively, we denote
Dkl (X||Y ) := Dkl (P ||Q) .
We then ask for a type of leave-one-out stability for the algorithms A, where A is insensitive to the
changes of a single example (on average).
Definition 6.1. Let ε ≥ 0. A randomized algorithm A : X n → A is ε-KL-stable if for each
i ∈ {1, . . . , n} there is a randomized Ai : X n−1 → A such that for every sample xn1 ∈ X n ,
n
1X
Dkl A(xn1 )||Ai (x\i ) ≤ ε.

n
i=1

Examples may be useful to understand Definition 6.1.

Example 6.3.6 (KL-stability in mean estimation: Gaussian noise addition): Suppose we
wish to estimate a mean, and that P xi ∈ [−1, 1] are all real-valued. Then a natural statistic
is to simply compute A(xn1 ) = n1 ni=1 xi . In this case, without randomization, we will have
n n 1 Pn
infinite KL-divergence between A(x1 ) and Ai (x\i ). If instead we set A(x1 ) = n i=1 xi + Z
for Z ∼ N(0, σ 2 ), and similarly Ai = n1 j̸=i xj + Z, then we have (recall Example 2.1.7)
P

n n
1X n
1 X 1 2 1
Dkl A(x1 )||A(x\i ) = x ≤ 2 2,
n 2nσ 2 n2 i 2σ n
i=1 i=1

so that a the sample mean of a bounded random variable perturbed with Guassian noise is
ε = 2σ12 n2 -KL-stable. 3

147
Lexture Notes on Statistics and Information Theory John Duchi

We can consider other types of noise addition as well.

Example 6.3.7 (KL-stability in mean estimation: Laplace noise addition): Let the conditions
of Example 2.1.7 hold,
P but suppose instead of Gaussian noise we add scaled Laplace noise,
that is, A(xn1 ) = n1 ni=1 xi + Z for Z with density p(z) = 2σ
1
exp(−|z|/σ), where σ > 0. Then
using that if Lµ,σ denotes the Laplace distribution with shape σ and mean µ, with density
1
p(z) = 2σ exp(−|z − µ|/σ), we have
Z |µ1 −µ0 |
1
Dkl (Lµ0 ,σ ||Lµ1 ,σ ) = exp(−z/σ)(|µ1 − µ0 | − z)dz
σ2 0
|µ1 − µ0 |2

|µ1 − µ0 | |µ1 − µ0 |
= exp − −1+ ≤ ,
σ σ 2σ 2

we see that in this case the sample mean of a bounded random variable perturbed with Laplace
noise is ε = 2σ12 n2 -KL-stable, where σ is the shape parameter. 3

The two key facts are that KL-stable algorithms compose adaptively and that they bound
mutual information in independent samples.

Lemma 6.3.8. Let A : X n → A0 and A′ : A0 × X → A1 be ε and ε′ -KL-stable algorithms,

respectively. Then the (randomized) composition A′ ◦ A(xn1 ) = A′ (A(xn1 ), xn1 ) is ε + ε′ -KL-stable.
Moreover, the pair (A′ ◦ A(xn1 ), A(xn1 )) is ε + ε′ -KL-stable.

Proof Let Ai and A′i be the promised sub-algorithms in Definition 6.1. We apply the data
processing inequality, which implies for each i that

Dkl A′ (A(xn1 ), xn1 )||A′i (Ai (x\i ), x\i ) ≤ Dkl A′ (A(xn1 ), xn1 ), A(xn1 )||A′i (Ai (x\i ), x\i ), Ai (x\i ) .

We require a bit of notational trickery now. Fixing i, let PA,A′ be the joint distribution of
A′ (A(xn1 ), xn1 ) and A(xn1 ) and QA,A′ the joint distribution of A′i (Ai (x\i ), x\i ) and Ai (x\i ), so that
they are both distributions over A1 × A0 . Let PA′ |a be the distribution of A′ (t, xn1 ) and similarly
QA′ |a is the distribution of A′i (t, x\i ). Note that A′ , A′i both “observe” x, so that using the chain
rule (2.1.6) for KL-divergences, we have

Dkl A′ ◦ A, A||A′i ◦ Ai , Ai = Dkl PA,A′ ||QA,A′

Z

= Dkl (PA ||QA ) + Dkl PA′ |t ||QA′ |t dPA (t)

= Dkl (A||Ai ) + EA [Dkl A′ (A, xn1 )||A′i (A, xn1 ) ].

Summing this from i = 1 to n yields

n n Xn
1X ′ ′
1X 1
Dkl A (A, x1 )||Ai (A, x1 ) ≤ ε + ε′ ,
′ n ′ n

Dkl A ◦ A||Ai ◦ Ai ≤ Dkl (A||Ai ) + EA
n n n
i=1 i=1 i=1

as desired.

The second key result is that KL-stable algorithms also bound the mutual information of a
random function.

148
Lexture Notes on Statistics and Information Theory John Duchi

Lemma 6.3.9. Let Xi be independent. Then for any random variable A,

n
X n Z
X
I(A; X1n ) Dkl A(xn1 )||Ai (x\i ) dP (xn1 ),

≤ I(A; Xi | X\i ) =
i=1 i=1

where Ai (x\i ) = A(xi−1 n

1 , Xi , xi+1 ) is the random realization of A conditional on X\i = x\i .

Proof Without loss of generality, we assume A and X are both discrete. In this case, we have
n
X n
X
I(A; X1n ) = I(A; Xi | X1i−1 ) = H(Xi | X1i−1 ) − H(Xi | A, X1i−1 ).
i=1 i=1

To see the final equality, note that

Z
I(A; Xi | X\i ) = I(A; Xi | X\i = x\i )dP (x\i )
X n−1
Z Z
= Dkl (A(xn1 )||A(x1:i−1 , Xi , xi+1:n )) dP (xi )dP (x\i )
X n−1 X

by definition of mutual information as I(X; Y ) = EX [Dkl PY |X ||PY ].

Combining Lemmas 6.3.8 and 6.3.9, we see (nearly) immediately that KL stability implies
a mutual information bound, and consequently even interactive KL-stable algorithms maintain
bounds on mutual information.

Proposition 6.3.10. Let A1 , . . . , Ak be εi -KL-stable procedures, respectively, composed in any

arbitrary sequence. Let Xi be independent. Then
k
1 X
I(A1 , . . . , Ak ; X1n ) ≤ εi .
n
i=1

Proof Applying Lemma 6.3.9,

n k X
n
I(Aj ; Xi | X\i , Aj−1
X X
I(Ak1 ; X1n ) ≤ I(Ak1 ; Xi | X\i ) = 1 ).
i=1 j=1 i=1

Fix an index j and for shorthand, let A = A and A′ = (A1 , . . . , Aj−1 ) be the first j − 1 procedures.
Then expanding the final mutual information term and letting ν denote the distribution of A′ , we
have
Z
I(A; Xi | X\i , A ) = Dkl A(a′ , xn1 )||A(a′ , x\i ) dP (xi | A′ = a′ , x\i )dP n−1 (x\i )dν(a′ | x\i )
′

149
Lexture Notes on Statistics and Information Theory John Duchi

where A(a′ , xn1 ) is the (random) procedure A on inputs xn1 and a′ , while A(a′ , x\i ) denotes the
(random) procedure A on input a′ , x\i , Xi , and where the ith example Xi follows its disdtribution
conditional on A′ = a′ and X\i = x\i , as in Lemma 6.3.9. We then recognize that for each i, we
have
Z Z
′ n ′ ′
Dkl A(a , x1 )||A(a , x\i ) dP (xi | a , x\i ) ≤ Dkl A(a′ , xn1 )||A(a
e ′ , x\i ) dP (xi | a′ , x\i )

for any randomized function A, e as the marginal A in the lemma minimizes the average KL-
divergence (recall Exercise 2.15). Now, sum over i and apply the definition of KL-stability as
in Lemma 6.3.8.

6.3.4 Error bounds for a simple noise addition scheme

Based on Proposition 6.3.10, to build an appropriately well-generalizing procedure we must build
a mechanism for the interaction in Fig. 6.1 that maintains KL-stability. Using Example 6.3.6, this
is not challenging for the class of bounded queries. Let Φ = {ϕt }t∈T where ϕt : X → [−1, 1] be
the collection of statistical queries taking values in [−1, 1]. Then based on Proposition 6.3.10 and
Example 6.3.6, the following procedure is stable.

Input: Sample X1n ∈ X n drawn i.i.d. P , collection {ϕt }t∈T of possible queries ϕt : X →
[−1, 1]
Repeat: for k = 1, 2, . . .

i. Analyst chooses index Tk ∈ T and query ϕ := ϕTk

ii. Mechanism draws independent Zk ∼ N(0, σ 2 ) and responds with answer

n
1X
Ak := Pn ϕ + Zk = ϕ(Xi ) + Zk .
n
i=1

Figure 6.2: Sequential Gaussian noise mechanism.

This procedure is evidently KL-stable, and based on Example 6.3.6 and Proposition 6.3.10, we
have that
1 k
I(X1n ; T1 , . . . , Tk , Tk+1 ) ≤ 2 2
n 2σ n
so long as the indices Ti ∈ T are chosen only as functions of Pn ϕ + Zj for j < i, as the classical
information processing inequality implies that
1 1
I(X1n ; T1 , . . . , Tk , Tk+1 ) ≤ I(X1n ; A1 , . . . , Ak )
n n
because we have X1n → A1 → T2 and so on for the remaining indices. With this, we obtain the
following theorem.

150
Lexture Notes on Statistics and Information Theory John Duchi

Theorem 6.3.11. Let the indices Ti , i = 1, . . . , k + 1 be chosen in an arbitrary way using the
procedure 6.2, and let σ 2 > 0. Then

2 2ek 10
E max(Aj − P ϕTj ) ≤ 2 2 + + 4σ 2 (log k + 1).
j≤k σ n 4n
p
By inspection, we can optimize over σ 2 by setting σ 2 = k/(log k + 1)/n, which yields the
upper bound p

2 10 k(1 + log k)
E max(Aj − P ϕTj ) ≤ + 10 .
j≤k 4n n
Comparing to Example 6.3.4, we see a substantial improvement. While we do not achieve accuracy
scaling with log k, as we would if the queried functionals ϕt were completely independent of the
sample, we see that we achieve mean-squared error of order
√
k log k
n
for k adaptively chosen queries.
Proof To prove the result, we use a technique sometimes called the monitor technique. Roughly,
the idea is that we can choose the index Tk+1 in any way we desire as long as it is a function of the
answers A1 , . . . , Ak and any other constants independent of the data. Thus, we may choose
Tk+1 := Tk⋆ where k ⋆ = argmax{|Aj − P ϕTj |},
j≤k

as this is a (downstream) function of the k different ε = 2σ12 n2 -KL-stable queries T1 , . . . , Tk . As

a consequence, we have from Corollary 6.3.3 (and the fact that the queries ϕ are 1-sub-Gaussian)
that for T = Tk+1 ,
2e 5 5 ek 5
E[(Pn ϕT − P ϕT )2 ] ≤ I(X1n ; Tk+1 ) + ≤ 2ekε + = 2 2+ .
n 4n 4n σ n 4n
Now, we simply consider the independent noise addition, noting that (a + b)2 ≤ 2a2 + 2b2 for any
a, b ∈ R, so that

2 2 2
E max(Aj − P ϕTj ) ≤ 2E[(Pn ϕT − P ϕT ) ] + 2E max{Zj }
j≤k j≤k
2ek 10
≤ 2 2+ + 4σ 2 (log k + 1), (6.3.3)
σ n 4n
where inequality (6.3.3) is the desired result and follows by the following lemma.
Lemma 6.3.12. Let Wj , j = 1, . . . , k be independent N(0, 1). Then E[maxj Wj2 ] ≤ 2(log k + 1).
Proof We assume that k ≥ 3, as the result is trivial otherwise. Using the tail bound for
Gaussians (Mills’s ratio for Gaussians, which is tighter Rthan the standard sub-Gaussian bound)
2 ∞
that P(W ≥ t) ≤ √2πt1
e−t /2 for t ≥ 0 and that E[Z] = 0 P(Z ≥ t)dt for a nonnegative random
variable Z, we obtain that for any t0 ,
Z ∞ Z ∞
2 2
E[max Wj ] = P(max Wj ≥ t)dt ≤ t0 + P(max Wj2 ≥ t)dt
j 0 j t0 j
Z ∞ √ Z ∞
2k 4k
≤ t0 + 2k P(W1 ≥ t)dt ≤ t0 + √ e−t/2 dt = t0 + √ e−t0 /2 .
t0 2π t0 2π

151
Lexture Notes on Statistics and Information Theory John Duchi

√
Setting t0 = 2 log(4k/ 2π) gives E[maxj Wj2 ] ≤ 2 log k + log √42π + 1.

6.4 Bibliography and further reading

PAC-Bayes techniques originated with work of David McAllester [144, 145, 146], and we remark
on his excellently readable tutorial [147]. The particular approaches we take to our proofs in
Section 6.2 follow Catoni [48] and McAllester [146]. The PAC-Bayesian bounds we present, that
simultaneously for any distribution π on F, if F ∼ π then

1 1
E[(Pn F − P F )2 | X1n ] ≲ Dkl (π||π0 ) + log
n δ
with probability at least 1 − δ suggest that we can optimize them by choosing π carefully. For
example, in the context of learning a statistical model parameterized by θ ∈ Θ with losses ℓ(θ; x, y),
it is natural to attempt to find π minimizing
r
1
Eπ [Pn ℓ(θ; X, Y ) | Pn ] + C Dkl (π||π0 )
n
in π, where the expectation is taken over θ ∼ π. If this quantity has optimal value ϵ⋆n ,qthen one is
√
immediately guaranteed that for the population P , we have Eπ [P ℓ(θ; X, y)] ≤ ϵn + C log 1δ / n.
⋆

Langford and Caruana [131] take this approach, and Dziugaite and Roy [85] use it to give (the
first) non-trivial bounds for deep learning models.
The questions of interactive data analysis begin at least several decades ago, perhaps most pro-
foundly highlighted positively by Tukey’s Exploratory Data Analysis [183]. Problems of scientific
replicability have, conversely, highlighted many of the challenges of reusing data or peeking, even
innocently, at samples before performing statistical analyses [118, 95, 100]. Our approach to for-
malizing these ideas, and making rigorous limiting information leakage, draws from a more recent
strain of work in the theoretical computer science literature, with major contributions from Dwork,
Feldman, Hardt, Pitassi, Reingold, and Roth and Bassily, Nissim, Smith, Steinke, Stemmer, and
Ullman [84, 82, 83, 20, 21]. Our particular treatment most closely follows Feldman and Steinke [88].
The problems these techniques target also arise frequently in high-dimensional statistics, where one
often wishes to estimate uncertainty and perform inference after selecting a model. While we do
not touch on these problems, a few references in this direction include [27, 180, 114].

6.5 Exercises
Exercise 6.1 (Duality in Donsker-Varadhan): Here, we give a converse result to Theorem 6.1.1,
showing that for any function h : X → R,

log EQ [eh(X) ] = sup {EP [h(X)] − Dkl (P ||Q)} , (6.5.1)

where the supremum is taken over probability measures. If Q has a density, the supremum may be
taken over probability measures having a density.

152
Lexture Notes on Statistics and Information Theory John Duchi

(a) Show the equality (6.5.1) in the case that X is discrete by directly computing the supremum.
(That is, let |X | = k, and identify probability measures P and Q with vectors p, q ∈ Rk+ .)

(b) Let Q have density q. Assume that EQ [eh(X) ] < ∞ and let

Zh (x) = exp(h(x))/EQ [exp(h(X))],

so EQ [Zh (X)] = 1. Let P have density p(x) = Zh (x)q(x). Show that

log EQ [eh(X) ] = EP [h(X)] − Dkl (P ||Q) .

Why does this imply equality (6.5.1) in this case?

Exercise 6.2 (An alternative PAC-Bayes bound): Let f : Θ × X → R, and let π0 be a density
on θ ∈ Θ. Use the dual form (6.5.1) of the variational representation of the KL-divergence show
iid
that with probability at least 1 − δ over the draw of X1n ∼ P ,

Dkl (π||π0 ) + log 1δ

Z Z
Pn f (θ, X)π(θ)dθ ≤ log EP [exp(f (θ, X))] π(θ)dθ +
n
simultaneously for all distributions π on Θ, where the expectation EP is over X ∼ P .
Exercise 6.3 (A mean estimator with sub-Gaussian concentration for a heavy-tailed distribu-
tion [49]): In this question, we use a PAC-Bayes bound to construct an estimator of the mean E[X]
of a distribution with sub-Gaussian-like concentration that depends only on the second moments
Σ = E[XX ⊤ ] of the random vector X (not on any additional dimension-dependent quantitites)
while only assuming that E[∥X∥2 ] < ∞. Let ψ be an odd function (i.e., ψ(−t) = −ψ(t)) satisfying

− log(1 − t + t2 ) ≤ ψ(t) ≤ log(1 + t + t2 ).

The function ψ(t) = min{1, max{−1, t}} (the truncation of t to the range [−1, 1]) is such a function.
Let πθ be the normal distribution N(θ, σ 2 I) and π0 be N(0, σ 2 I).
(a) Let λ > 0. Use Exercise 6.2 to show that with probability at least 1 − δ, for all θ ∈ Rd

1
Z ∥θ∥2 /2σ 2 + log 1
′ ′ ′ ⊤ 2 2 δ
Pn ψ(λ⟨θ , X⟩)πθ (θ )dθ ≤ ⟨θ, E[X]⟩ + λ θ Σθ + σ tr(Σ) + .
λ nλ

(b) For λ > 0, define the “directional mean” estimator

Z
1
En (θ, λ) = Pn ψ(λ⟨θ′ , X⟩)πθ (θ′ )dθ′ .
λ
Give a choice of λ > 0 such that with probability 1 − δ,
s
2 1 1
2 tr(Σ) ,
sup |En (θ, λ) − ⟨θ, E[X]⟩| ≤ √ + log ∥Σ∥ op + σ
θ∈Sd−1 n 2σ 2 δ

where Sd−1 = {u ∈ Rd | ∥u∥2 = 1} is the unit sphere.

153
Lexture Notes on Statistics and Information Theory John Duchi

(c) Justify the following statement: choosing the vector µ

bn minimizing

sup |En (θ, λ) − ⟨θ, µ⟩|

θ∈Sd−1

in µ guarantees that with probability at least 1 − δ,

s
4 1 1
2 tr(Σ) .
µn − E[X]∥2 ≤ √
∥b + log ∥Σ∥op + σ
n 2σ 2 δ

(d) Give a choice of the prior/posterior variance σ 2 so that

r
4 1
µn − E[X]∥2 ≤ √
∥b tr(Σ) + 2 ∥Σ∥op log
n δ
with probability at least 1 − δ.

Exercise 6.4 (Large-margin PAC-Bayes bounds for multiclass problems): Consider the following
multiclass prediction scenario. Data comes in pairs (x, y) ∈ bBd2 × [k] where Bd2 = {v ∈ Rd | ∥v∥2 ≤
1} denotes the ℓ2 -ball and [k] = {1, . . . , k}. We make predictions using predictors θ1 , . . . , θk ∈ Rd ,
where the prediction of y on an example x is

yb(x) := argmax⟨θi , x⟩.

i≤k

We suffer an error whenever yb(x) ̸= y, and the margin of our classifier on pair (x, y) is

⟨θy , x⟩ − max⟨θi , x⟩ = min⟨θy − θi , x⟩.

i̸=y i̸=y

If ⟨θy , x⟩ > ⟨θi , x⟩ for all i ̸= y, the margin is then positive (and the prediction is correct).

(a) Develop an analogue of the bounds in Section 6.2.2 in this k-class multiclass setting. To do
so, you should (i) define the analogue of the margin-based loss ℓγ , (ii) show how Gaussian
perturbations leave it similar, and (iii) prove an analogue of the bound in Section 6.2.2. You
should assume one of the two conditions
k
X
(C1) ∥θi ∥2 ≤ r for all i (C2) ∥θi ∥22 ≤ kr2
i=1

on your classification vectors θi . Specify which condition you choose.

(b) Describe a minimization procedure—just a few lines suffice—that uses convex optimization to
find a (reasonably) large-margin multiclass classifier.

Exercise 6.5 (A variance-based information bound): Let Φ = {ϕt }t∈T be a collection of functions
ϕt : X → R, where each ϕt satisfies the Bernstein condition (4.1.7) with parameters σ 2 (ϕt ) and b,
that is, |E[(ϕt (X) − P ϕt (X))k ]| ≤ k! 2
2 σ (ϕt )b
k−2 for all k ≥ 3 and Var(ϕ (X)) = σ 2 (ϕ ). Let T ∈ T
t t
be any random variable, which may depend on an observed sample X1n . Show that for all C > 0
C
and |λ| ≤ 2b , then
Pn ϕT − P ϕT 1
E ≤ I(T ; X1n ) + |λ|.
max{C, σ(ϕT )} n|λ|

154
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 6.6 (An information bound on variance): Let Φ = {ϕt }t∈T be a collection of functions
ϕt : X → R, where each ϕt : X → [−1, 1]. Let σ 2 (ϕt ) = Var(ϕt (X)). Let s2n (ϕ) = Pn ϕ2 − (Pn ϕ)2 be
the sample variance of ϕ. Show that for all C > 0 and 0 ≤ λ ≤ C/4, then
s2n (ϕT )

1
E 2
≤ I(T ; X1n ) + 2.
max{C, σ (ϕT )} nλ
The max{C, σ 2 (ϕT )} term is there to help avoid division by 0. Hint: If 0 ≤ x ≤ 1, then
ex ≤ 1 + 2x, and if X ∈ [0, 1], then E[eX ] ≤ 1 + 2E[X] ≤ e2E[X] . Use this to argue that
2 2
E[eλnPn (ϕ−P ϕ) / max{C,σ } ] ≤ e2λn for any ϕ : X → [−1, 1] with Var(ϕ) ≤ σ 2 , then apply the
Donsker-Varadhan theorem.
Exercise 6.7: Consider the following scenario: let ϕ : X → [−1, 1] and let α > 0, τ > 0. Let
µ = Pn ϕ and s2 = Pn ϕ2 − µ2 . Define σ 2 = max{αs2 , τ 2 }, and assume that τ 2 ≥ 5α
n .

(a) Show that the mechanism with answer Ak defined by

A := Pn ϕ + Z for Z ∼ N(0, σ 2 )
is ε-KL-stable (Definition 6.1), where for a numerical constant C < ∞,
s2 α2

ε≤C · 2 2 · 1+ 2 .
n σ σ

(b) Show that if α2 ≤ C ′ τ 2 for a numerical constant C ′ < ∞, then we can take ε ≤ O(1) n21α .
Hint: Use exercise 2.14, and consider the “alternative” mechanisms of sampling from
2 2
N(µ−i , σ−i ) where σ−i = max{αs2−i , τ 2 }
for
1 X 1 X
µ−i = ϕ(Xj ) and s2−i = ϕ(Xj )2 − µ2−i .
n−1 n−1
j̸=i j̸=i

Exercise 6.8 (A general variance-dependent bound on interactive queries): Consider the algo-
rithm in Fig. 6.3. Let σ 2 (ϕt ) = Var(ϕt (X)) be the variance of ϕt .
(a) Show that for b > 0 and for all 0 ≤ λ ≤ 2b ,
r
|Aj − P ϕTj | 2

1 n k
p 4α n ; T k ) + 2α + τ .
E max ≤ I(X1 ; T1 ) + λ + 2 log(ke) I(X1 1
j≤k max{b, σ(ϕTj )} nλ nb2 b2
(If you do not have quite the right constants, that’s fine.)
(b) Using the result of Question 6.7, show that with appropriate choices for the parameters
α, b, τ 2 , λ that for a numerical constant C < ∞
" #
|Aj − P ϕTj | (k log k)1/4
E max √ ≤ C √ .
j≤k max{(k log k)1/4 / n, σ(ϕTj )} n

You may assume that k, n are large if necessary.

155
Lexture Notes on Statistics and Information Theory John Duchi

Input: Sample X1n ∈ X n drawn i.i.d. P , collection {ϕt }t∈T of possible queries ϕt : X →
[−1, 1], parameters α > 0 and τ > 0
Repeat: for k = 1, 2, . . .

i. Analyst chooses index Tk ∈ T and query ϕ := ϕTk

ii. Set s2k := Pn ϕ2 − (Pn ϕ)2 and σk2 := max{αs2k , τ 2 }

iii. Mechanism draws independent Zk ∼ N(0, σk2 ) and responds with answer
n
1X
Ak := Pn ϕ + Zk = ϕ(Xi ) + Zk .
n
i=1

Figure 6.3: Sequential Gaussian noise mechanism with variance sensitivity.

156
Chapter 7

Advanced concentration inequalities

I. Use Donsker-Varadhan to demonstrate basic equivalence of sub-Gaussianity and transport

inequality [37, Corollar 4.14]

II. Transport inequalities and their connections

III. Potential idea: concentration without dimension for covariance matrices via PAC-Bayes bounds
(i.e., matrix concentration from them)

Probably omit the entropy method stuff, focus instead on transportation

7.1 From divergences to concentration and back

The Donsker-Varadhan representation of the KL-divergence, Theorem 6.1.1, has a dual form, which
allows us to provide an important connection between moment generating functions and the KL-
divergence. We state this as a corollary, then connect it to moment generating function bounds.

Corollary 7.1.1. Let P and Q be distributions on a common space X . Then

n o
Dkl (P ||Q) = sup EP [g(X)] − log EQ [eg(X) ] ,
g

where the supremum is taken over measurable functions g : X → R with EQ [eg(X) ] < ∞. Con-
versely, for any measurable g : X → R and distribution Q on X ,

log EQ [eg(X) ] = sup {EP [g(X)] − Dkl (P ||Q)} ,

where the supremum is taken over probability distributions P on X with EP [g(X)] < ∞.

Proof The first claim is simply Theorem 6.1.1. For the second, we assume that EQ [eg(X) ] < ∞.
(See Exercise 7.1 for the case that EQ [eg(X) ] = +∞.) Then via the first part of the corollary, for
any distribution P for which EP [g(X)] < ∞ we have Dkl (P ||Q) ≥ EP [g(X)] − log EQ [eg(X) ], that
is,
log EQ [eg(X) ] ≥ EP [g(X)] − Dkl (P ||Q) .

157
Lexture Notes on Statistics and Information Theory John Duchi

To obtain equality, assume w.l.o.g. that Q has density q, and define P to have density p(x) =
eg(x)
E [eg(X) ]
q(x). This evidently integrates to 1, and
Q

" #
EQ [g(X)eg(X) ] eg(X) eg(X)
EP [g(X)] − Dkl (P ||Q) = − EQ log
EQ [eg(X) ] EQ [eg(X) ] EQ [eg(X) ]
= log EQ [eg(X) ],

completing the proof.

The key consequence of Corollary 7.1.1, for our purposes, is that it yields the first building block
for our development of transportation inequalities, which relate concentration and moment gener-
ating functions to KL-divergence measures. The first of these uses the variational representation
to provide an alternative characterization of sub-Gaussian random variables.

Theorem 7.1.2. Let X be a real-valued random variable. Then the following are equivalent:

(i) For all λ ≥ 0,

λ2 σ 2
log EP [eλ(X−EP [X]) ] ≤ .
2
(ii) For any probability distribution Q for which Dkl (Q||P ) < ∞,
p
EQ [X] − EP [X] ≤ 2σ 2 Dkl (Q||P ).

Proof To see that (i) implies (ii), note that swapping the roles of P and Q in Corollary 7.1.1
and taking g(X) = λ(X − EP [X]) yields that

λ2 σ 2
λ(EQ [X] − EP [X]) − Dkl (Q||P ) ≤
2
for all distributions Q and all λ ≥ 0. Rearraging, we obtain

Dkl (Q||P ) λσ 2
EQ [X] − EP [X] ≤ + ,
λ 2
and optimizing over λ > 0 yields (ii). √
a
For the opposite direction, because inf λ>0 λ + bλ = 2 ab for a, b ≥ 0, we see that (ii) implies

Dkl (Q||P ) λσ 2
EQ [X] − EP [X] ≤ +
λ 2
λ2 σ 2
for all λ > 0. Rewriting this by taking g(X) = X − EP [X], we have λEQ [g(X)] − Dkl (Q||P ) ≤ 2
for all λ > 0 and Q. Applying Corollary 7.1.1 and taking a supremum over Q yields (i).

Theorem 7.1.2 provides another proof that bounded random variables are sub-Gaussian; the
sheer slickness of the argument, which relies only on Pinsker’s inequality, hints at the power of the
approaches we will be able to develop. (Compare this approach to Example 4.1.6 and Exercise 4.1.)

158
Lexture Notes on Statistics and Information Theory John Duchi

(b−a)2
Corollary 7.1.3. Let X be a random variable taking values in [a, b]. Then X is 4 -sub-
Gaussian.
Proof Assume P and Q have densities p and q w.r.t. a base measure µ. Then
Z Z Z
EQ [X] − EP [X] = x(q(x) − p(x))dµ(x) ≤ b (q(x) − p(x))dµ(x) + a (q(x) − p(x))dµ(x)
q>p p>q
= (b − a) ∥P − Q∥TV ,

where we have used one of our many characterizations of the total variation (Lemma 2.2.4). Ap-
plying Pinsker’s inequality (Proposition 2.2.8), we obtain
r r
1 (b − a)2
EQ [X] − EP [X] ≤ (b − a) Dkl (Q||P ) = Dkl (Q||P ).
2 2
(b−a)2
This implies that f is 4 -sub-Gaussian.

Pinsker’s inequality that ∥P − Q∥2TV ≤ 21 Dkl (Q||P ) is the first example of a general family
of transportation inequalitites, which relate various distances on probability measures to the KL-
divergence. We develop these inqualities and a few representative consequences in Section 7.2.
Before doing so, howver, we present an application of the variational representation to estimating
a covariance as well as presenting a generalization of Theorem 7.1.2 beyond sub-Gaussian random
variables.

7.1.1 Concentration of covariance matrices via the variational representation

The variational representation of the KL-divergence in Corollary 7.1.1 also allows proofs of strong
concentration guarantees for random matrices, a subject of its own considerable interest. Typical
approaches to such concentration guarantees involve the matrix Chernoff bound approaches we
outline in Chapter 4.3, or approaches via covering numbers, as we present in Proposition 5.1.11.
In this section, we provide an alternative approach that avoids these techniques, attacking the
covariance more directly. To present the bounds, we use the following consequence of the variational
representation of the KL-divergence, which follows from Corollary 7.1.1.
Corollary 7.1.4. Let f : Θ × X → R and π0 be a density on θ ∈ Θ. Then for any t ≥ 0, with
iid
probability at least 1 − e−t over the draw X1n ∼ P ,
Z Z
Dkl (π||π0 ) + t
Pn f (θ, X)π(θ)dθ ≤ log EP [exp(f (θ, X))] π(θ)dθ +
n
simultaneously for all distributions π on Θ, where the expectation EP is over X ∼ P .
Exercise 6.2 asks you to prove this result, which followsQby an application of Markov’s inequality
to the nonnegative random variable exp(nPn f (θ, X)) = ni=1 exp(f (θ, Xi )).
Returning to the study of random vectors and covariance matrices, here we provide an alter-
native argument to show that if Xi ∈ Rd are well-behaved random vectors, then the covariance or
second-moment matrix
n
⊤ 1X
Pn XX = Xi Xi⊤
n
i=1

159
Lexture Notes on Statistics and Information Theory John Duchi

concentrates well. We focus on d-dimensional σ 2 -sub-Gaussian vectors meaning that E[e⟨u,X−E[X]⟩ ] ≤

2
exp( σ2 ∥u∥22 ) for all u ∈ Rd . It will be more convenient to use the Orlicz-norm-based characteriza-
tion of sub-Gaussianity (recall Section 4.1.3), so we say X is a σ 2 -sub-Gaussian vector if

∥⟨X, u⟩∥ψ2 ≤ σ 2

for all u ∈ Sd−1 = {u ∈ Rd | ∥u∥2 = 1}. (This is more convenient for analyzing Pn XX ⊤ ,
because the sub-multiplicity of the Orlicz norms gives v ⊤ XX ⊤ u ψ1 ≤ ∥⟨X, u⟩∥ψ2 ∥⟨X, v⟩∥ψ2 as in
Lemma 4.1.23.)
We then have the following proposition, which provides the same guarantees as Proposition 5.1.11
for isotropic Xi , meaning E[Xi Xi⊤ ] = Id .
Proposition 7.1.5. Let Xi be independent isotropic sub-Gaussian vectors with ∥⟨u, Xi ⟩∥ψ2 ≤ σ
for all u ∈ Sd−1 . Then with probability at least 1 − e−t ,
r
⊤ 8e2 σ 2 2d + t
Pn XX − Id ≤ 2
op (e − 1) n
4e2 σ 2
so long as n ≥ 2d + t. Otherwise, ∥Pn XX ⊤ − Id ∥op ≤ (e−1)2
(1 + 2d+t
n ).

Proof Fix ϵ > 0 to be chosen, and let the prior density π0 be uniform on the Cartesian product
(1 + ϵ)Bd2 × (1 + ϵ)Bd2 . For any fixed u, v with ∥u∥2 , ∥v∥2 ≤ 1, let the posterior πu,v be uniform on
the product (u + ϵBd2 ) × (v + ϵBd2 ), so that for any vector x we have
Z
θ1⊤ xx⊤ θ2 πu,v (θ1 , θ2 )dθ1 dθ2 = u⊤ xx⊤ v.

Fix λ ≥ 0 to be chosen, and define the function

f (θ1 , θ2 , x) := θ1⊤ xx⊤ θ2 − ⟨θ1 , θ2 ⟩.

Notably, f is mean zero under the distribution P , as E[XX ⊤ ] = Id . Because we have assumed
∥⟨u, X⟩∥ψ2 ≤ σ 2 for all u ∈ Sd−1 , for θ1 , θ2 ∈ (1 + ϵ)Bd2 f (θ1 , θ2 , X) is sub-exponential, and more
precisely,

∥f (θ1 , θ2 , X)∥ψ1 = θ1⊤ XX ⊤ θ2 − ⟨θ1 , θ2 ⟩ ≤ 2 θ1⊤ XX ⊤ θ2

ψ1 ψ1 (7.1.1)
≤ 2 ∥⟨θ1 , X⟩∥ψ2 ∥⟨θ2 , X⟩∥ψ2 ≤ 2(1 + ϵ)2 σ 2

by the triangle inequality and sub-multiplicativity (Lemma 4.1.23).

Let λ ∈ R and apply Corollary 7.1.4 to the function λf , which implies
D (π ||π ) + t Z
kl u,v 0
λ u⊤ Pn XX ⊤ v − ⟨u, v⟩ ≤ + log EP [exp(λf (θ1 , θ2 , X))]π0 (θ1 , θ2 )dθ1 dθ2
n

simultaneously for all u, v ∈ Bd2 with probability at least 1 − e−t , where we recall that π is uniform
on (u + ϵBd2 ) × (v + ϵBd2 ). It remains to control the expectation EP [exp(λf )] and the KL-divergence.
For the former, we apply Corollary 4.1.24 and use inequality (7.1.1) to guarantee that f has finite
ψ1 -norm, so for fixed θ1 , θ2 ∈ (1 + ϵ)Bd2 we have

EP [exp(λf (θ1 , θ2 , X))] ≤ exp 16λ2 (1 + ϵ)4 σ 4

160
Lexture Notes on Statistics and Information Theory John Duchi

1
when |λ| ≤ 4(1+ϵ)2 σ 2
. Substituting this above
D (π||π ) + t
kl 0
λ u⊤ Pn XX ⊤ v − ⟨u, v⟩ ≤ + 16λ2 (1 + ϵ)4 σ 4 .
n
Finally, we evaluate the KL-divergence: for any u, v ∈ Bd2 , we have
1+ϵ
Dkl (πu,v ||π0 ) = 2Dkl Uniform(u + ϵBd2 )||Uniform((1 + ϵ)Bd2 ) = 2d log
ϵ
1 e
because πu,v and π0 are product distributions. Take ϵ = e−1 to give Dkl (π||π0 ) = 2d and 1+ϵ = e−1 .
Dividing by λ (accounting for the sign as appropriate), we see that with probability at least 1 − e−t ,
simultaneously for all ∥u∥2 = 1 and ∥v∥2 = 1 we have
4
⊤ ⊤ 2d + t e
|u Pn XX v − ⟨u, v⟩| ≤ + 16 σ4λ
nλ e−1
2
q
(e−1)2
whenever 0 ≤ λ ≤ (e−1)
2
4e σ 2 . Take λ = 2
4e σ 2 min{ 2d+t
n , 1}.

Proposition 7.1.5 admits elegant extensions to non-isotropic matrices, giving “dimension-free”

concentration for sub-Gaussian random vectors. Exercise 7.4 explores one approach to this result.

7.1.2 A generalized connection between moment generating functions and di-

vergence
The approach we use to prove Theorem 7.1.2 extends beyond sub-Gaussian random variables to,
essentially, any random variable whose moment generating function is even a bit nice, using (es-
sentially) the same argument, because any log moment generating function has very similar vari-
ational properties to the quadratic (because, essentially, they look locally quadratic). Here, we
elucidate this approach. Recall that for random variable X with log moment generating functon
ϕX (λ) := log E[eλX ] (the cumulant generating function), so long as ϕX (λ) is finite in some neigh-
borhood of 0, then it is infinitely differentiable on the interior of its domain (Proposition 3.2.2),
and we may take derivatives through expectations. Thus, we always have
E[XeλX ] E[X 2 eλX ] E[XeλX ] 2

′ ′′
ϕX (λ) = and ϕX (λ) = − = VarPλ (X),
E[eλX ] E[eλX ] E[eλX ]
λx
where Pλ denotes the distribution on X with density tilted by E[ee λX ] . So long as X is mean zero
and non-constant, then, we obtain ϕ′X (0) = 0 and ϕ′′X (λ) > 0 for all λ ∈ int dom ϕX . (We can say
a bit more about such cumulant generating functions; see Exercise 7.3.)
To that end, we call a function ϕ CGF (cumulant-generating-function) like on the interval [0, b)
if ϕ is continuously differentiable on [0, b), satisfies ϕ(0) = ϕ′ (0) = 0, and is convex (here we take
φ′ (0) to be the right derivative). The convex conjugate of ϕ is
ϕ∗ (s) := sup {λs − ϕ(λ)} ,
0≤λ<b

which is strictly increasing in s. (We will have much more to say about convex conjugates in
the coming chapters; see also Appendix B, especially Section C.2.) For now, we only require the
generalized inverse
(ϕ∗ )−1 (t) := inf {s ≥ 0 | ϕ∗ (s) > t} .

161
Lexture Notes on Statistics and Information Theory John Duchi

Because ϕ∗ (s) ≥ λs − ϕ(λ) for some λ > 0, we have ϕ∗ (s) → ∞ as s → ∞, meaning that (ϕ∗ )−1 (t)
exists and is finite.

Example 7.1.6 (Sub-Gaussian CGF-like functions): If X is a σ 2 -sub-Gaussian random

2 2 2 2
variables, we have ϕX (λ) ≤ λ 2σ . We thus consider functions ϕ(λ) = λ 2σ , where b = +∞.
2 2 s2
Then for s ≥ 0, we have ϕ∗ (s) = sup{λs − λ 2σ } = 2σ 2 for s ≥ 0, and

√ λσ 2
(ϕ∗ )−1 (t) = 2σ 2 t = inf + tλ
λ>0 2

gives the inverse. (Recall the proof of Theorem 7.1.2, which relies on this transformation.) 3

Example 7.1.7 (Sub-exponential CGF-like functions): When we have (τ 2 , b)-sub-Exponential

2 2
random variables (Definition 4.2), we obtain CGF-like functions of the form ϕ(λ) = τ 2λ for
0 ≤ λ < 1b . Taking suprema gives the the convex conjugate
( 2 2
s
∗ 2τ 2 if s ≤ τb
ϕ (s) = s τ2 2
b − 2b2 if s ≥ τb .
√ τ2
In this case, we have the more complex inverse (ϕ∗ )−1 (t) = min{ 2τ 2 t, bt + 2b }. 3

We can now present the generalization of Theorem 7.1.2.

Theorem 7.1.8. Let X be a real-valued random variable and ϕ be CGF-like on [0, b). Then the
following are equivalent:
(i) For all λ ∈ (0, b),
log EP [eλ(X−EP [X]) ] ≤ ϕ(λ).

(ii) For any probability distribution Q for which Dkl (Q||P ) < ∞,

EQ [X] − EP [X] ≤ (ϕ∗ )−1 (Dkl (Q||P )).

Proof To see that (i) implies (ii), note that if ϕ(λ) ≥ log EP [eλ(X−EP [X]) ], then as in the proof
of Theorem 7.1.2, swapping the roles of P and Q in Corollary 7.1.1 and taking g(X) = X − EP [X]
yields that
λ(EQ [X] − EP [X]) − ϕ(λ) ≤ Dkl (Q||P )
for all distributions Q. Taking a supremum over λ ∈ (0, b) on the left side then gives

ϕ∗ (EQ [X] − EP [X]) ≤ Dkl (Q||P ) ,

which implies part (ii).

For the converse direction, we require a technical lemma giving an inverse for convex conjugates
of smooth increasing functions:
Lemma 7.1.9. Let ϕ be CGF-like on [0, b). Then
t + ϕ(λ)
(ϕ∗ )−1 (t) := inf {s ≥ 0 | ϕ∗ (s) > t} = inf .
λ∈(0,b) λ

Additionally, (ϕ∗ )−1 is concave and strictly increasing on R+ , with (ϕ∗ )−1 (t) → ∞ as t → ∞.

162
Lexture Notes on Statistics and Information Theory John Duchi

Proof We prove the equality first. As in our discussion earlier, because ϕ∗ (s) ≥ λs − ϕ(λ) for
any λ ∈ (0, b), the set {s ≥ 0 | ϕ∗ (s) > t} is non-empty. Then ϕ∗ (s) > t if and only if

λs − ϕ(λ) > t

for some λ ∈ (0, b), that is, s > t+ϕ(λ)

λ for some λ ∈ (0, b).
The concavity follows because (ϕ∗ )−1 (t) is the infimum of linear functions of t. Because
inf λ>0 ϕ(λ)/λ = ϕ′ (0) = 0 by convexity (the slope ϕ′ (λ) is non-decreasing in λ), we have (ϕ∗ )−1 (t) ≥
inf λ∈(0,b) t/λ = t/b for all t ≥ 0. Coupled with concavity this implies (ϕ∗ )−1 is strictly increasing,
with limt→∞ (ϕ∗ )−1 (t) = ∞.

Returning to the main thread of the argument, item (ii) then evidently implies

Dkl (Q||P ) + ϕ(λ)

EQ [X] − EP [X] ≤
λ
for all λ ∈ (0, b), or, taking g(X) = X −EP [X], that λEQ [g(X)]−Dkl (Q||P ) ≤ ϕ(λ) for all Q. Apply-
ing Corollary 7.1.1 and taking a supremum over Q then yields log EP [eλg(X) ] ≤ ϕ(λ), as desired.

7.2 Transportation inequalitites

In Corollary 7.1.3, we saw our first example of a transportation inequality: because
r
1
∥P − Q∥TV ≤ Dkl (Q||P )
2
for any distributions P and Q, we saw that for any function f taking values in [0, 1] we had
r
1
EQ [f ] − EP [f ] ≤ ∥P − Q∥TV ≤ Dkl (Q||P ),
2
and so a fortiori f is sub-Gaussian. We also have the characterization that supf ∈[0,1] EQ [f ]−EP [f ] =
∥P − Q∥TV . To see this as a “transportation” inequality requires a bit more work.
JCD Comment: Perhaps better to just directly do Lagrangian calculation here.

By inspection, for any joint distribution π on X and Y with the correct marginals P and Q
and any function f taking values in [0, 1], we have f (x) − f (y) ≤ 1 {x ̸= y} and

π(X ̸= Y ) = Eπ [1 {X ̸= Y }] ≥ Eπ [f (X) − f (Y )] = EP [f (X)] − EQ [f (Y )].

Consider the special case that P and Q are Bernoulli distributions with parameters p > q > 0, so
that ∥P − Q∥TV = p − q. Now, for X ∼ P and Y ∼ Q, consider the joint distribution π for which

X = 1, Y = 1 w.p. q

π := X = 1, Y = 0 w.p. p − q

X = 0, Y = 0 w.p. 1 − p.


163
Lexture Notes on Statistics and Information Theory John Duchi

Then we evidently have π(X ̸= Y ) = π(X = 1, Y = 0) = p − q = ∥P − Q∥TV . In this case, at least,

taking infima over joint distributions maintaining the marginals on X and Y ,
inf Eπ [1 {X ̸= Y }] = ∥P − Q∥TV = sup {EP [f (X)] − EQ [f (Y )]} ,
π f ∈[0,1]

and we can transform this into a concentration inequality.

The preceding equality ends up holding in substantially more generality and yielding a plethora
of powerful concentration inequalities. Given distributions P and Q on sets X and Y, we let Π(P, Q)
denote the set of couplings between the distributions, meaning joint distributions on X × Y whose
marginals are correct:
Π(P, Q) := {π on X × Y s.t. π(·, Y) = P (·) and π(X , ·) = Q(·)} .
Then for a nonnegative cost function c : X × Y → R+ , the transportation cost between P and Q is
Wc (P, Q) := inf Eπ [c(X, Y )]. (7.2.1)
π∈Π(P,Q)

(We use the letter W because such quantities are frequently termed Wasserstein distances, though
we shall stick with our notation.) To understand these as a transportation cost, we think of the
coupling π as a “plan” for moving probability mass from P to Q via some sampling scheme π(· | X)
or π(· | Y ).
In analogy with the total variation distance, which uses c(x, y) = 1 {x ̸= y}, the discrete dis-
tance, we can consider other costs on X × Y. Notably, for any pair of functions f : X → R and
g : Y → R, whenever f (x) + g(y) ≤ c(x, y) for all x ∈ X and y ∈ Y, we have the trivial inequality
Eπ [c(X, Y )] ≥ Eπ [f (X) + g(Y )] = EP [f (X)] + EQ [g(Y )].
In fact, deep duality results for these distances. The next theorem, a variant of reulsts typically
called the Kantorovich duality theorems, captures the main results.
Theorem 7.2.1. Let c : X × Y → R+ ∪ {∞} be a lower semicontinuous cost function and X and
Y be metric spaces. Then for any distributions P and Q
Wc (P, Q) = sup {EP [f (X)] + EQ [g(Y )] | f (x) + g(y) ≤ c(x, y) for all x ∈ X , y ∈ Y} ,
and there is a coupling π ∈ Π(P, Q) achieving Eπ [c(X, Y )] = Wc (P, Q).
We provide a heuristic sketch of a proof of Theorem 7.2.1 in Section 7.2.2 to suggest why we might
expect duality to hold, with pointers to more rigorous references. It is also typically the case that
the supremum is attained by some functions f, g as well, but this is beyond our scope.
Most frequently, we consider cost functions c that are distances, meaning that c : X × X → R+
satisfies the triangle inequality. In this case, we can restrict the supremum to be over only 1-
Lipschitzian functions, where for a function f we define the Lipschitzian norm
|f (x) − f (y)|
∥f ∥Lip,c := sup .
x̸=y c(x, y)
In this case, we have the following corollary, whose proof we defer to Section 7.2.3.
Corollary 7.2.2. In addition to the conditions of Theorem 7.2.1, assume that c is a distance on
X × X . Then
n o
Wc (P, Q) = min Eπ [c(X, Y )] = sup EP [f (X)] − EQ [f (X)] | ∥f ∥Lip,c ≤ 1
π∈Π(P,Q)

For example, for the total variation distance, we have c(x, y) = 1 {x ̸= y}, and f being Lipschitzian
is equivalent to |f (x) − f (y)| ≤ 1 for all x, y.

164
Lexture Notes on Statistics and Information Theory John Duchi

7.2.1 A tensorized transportation inequality

The power of transportation inequalities comes from their ability to tensorize—one can extend
a one-dimensional transportation inequality to an n-dimensional one using properties of the KL-
divergence and convexity. We give one such argument here. Our starting point is a product
distribution where each Pi satisfies a (modified) transportation cost inequality for an increasing
convex function ϕ : R+ → R+ :

ϕ(Wc (Q, Pi )) ≤ Dkl (Q||Pi ) for all Q (7.2.2)

for i = 1, . . . , n. For example, Pinsker’s inequality gives the bound (7.2.2) for any distribution Pi
with cost c(x, y) = 1 {x ̸= y} and ϕ(t) = 2t2 . The following theorem shows how we can leverage
marginal transport

Theorem 7.2.3. Let the distributions Pi on sets Xi , i = 1, . . . , n, satisfy the marginal transporta-
tion bound (7.2.2). Then the product P = P1 × · · · × Pn satisfies
n
X
inf ϕ (Eπ [c(Xi , Yi )]) ≤ Dkl (Q||P )
π∈Π(P,Q)
i=1

for all distributions Q on X1 × · · · × Xn .

Proof We provide an inductive proof. Assume that the claimed inequality holds for some value
n (the base case, of course, being the assumed inequality (7.2.2)). By the chain rule (2.1.6) for the
KL-divergence, where we use the notation there as well for X1n+1 ∼ Q and Y1n+1 ∼ P , we have

Dkl Q||P n+1 = Dkl (X1n ||Y1n | Xn+1 ) + Dkl (Xn+1 ||Yn+1 ) .

Now, let πn+1 be the optimal coupling between Qn+1 and Pn+1 , and for values xn+1 ∈ X , let πx be
the optimal coupling between Q(X1n ∈ · | Xn+1 = x) and P n (these exist by Theorem 7.2.1). Then
by the induction hypothesis,
n
X
ϕ (Eπx [c(Xi , Yi ) | Xn+1 = x]) ≤ Dkl (X1n ||Y1n | Xn+1 = x)
i=1

and
ϕ Eπn+1 [c(Xn+1 , Yn+1 )] ≤ Dkl (Xn+1 ||Yn+1 ) .
Then Jensen’s inequality implies that, integrating over Xn+1 ∼ Qn+1 , we have
n
X
ϕ (Eπ [c(Xi , Yi )]) + ϕ Eπn+1 [c(Xn+1 , Yn+1 )]
i=1
≤ Dkl (X1n ||Y1n | Xn+1 ) + Dkl (Xn+1 ||Yn+1 ) = Dkl Q||P n+1 .

The distribution π = EQn+1 [πXn+1 ] and πn+1 have a consistent joint distribution by construction,
giving the theorem.

In Section 7.3, we develop several applications of Theorem 7.2.3. Here, we present a few
corollaries of the result.

165
Lexture Notes on Statistics and Information Theory John Duchi

Corollary 7.2.4 (Marton’s transportation inequality). Let Pi be distributions on X and P =

P1 × · · · × Pn . Then for any distribution Q on X n ,
n
X 1
inf π(Xi ̸= Yi )2 ≤ Dkl (Q||P ) .
π∈Π(P,Q) 2
i=1

Proof For any Q and Y ∼ Pi , we have by Pinsker’s inequality and Theorem 7.2.1 that
r
1
inf π(X ̸= Y ) = ∥Q − Pi ∥TV ≤ Dkl (Q||Pi ).
π∈Π(Pi ,Q) 2

This is the assumption of Theorem 7.2.3 with ϕ(t) = 2t2 .

We can recover the bounded differences inequality (Proposition 4.2.5) as a corollary (which, of
course, we have already proved by easier means). To see this, note that any function f : X n → R
satisfying bounded differences with constants b = [bi ]ni=1 satisfies
n
X
|f (xn1 ) − f (y1n )| ≤ bi 1 {xi ̸= yi } .
i=1

We use Corollary 7.2.4 coupled with Theorem 7.1.2 to prove the inequality. Letting Z(xn1 ) =
f (xn1 ) − EP [f (X1n )], we have for any coupling π ∈ Π(Q, P ) that
n
X n
X n
1/2 X 1/2
EQ [Z(Y1n )] − EP [Z(X1n )] ≤ bi π(Yi ̸= Xi ) ≤ b2i π(Yi ̸= Xi ) 2

i=1 i=1 i=1

by Cauchy-Schwarz. Applying Corollary 7.2.4, we obtain

p
EQ [Z] − EP [Z] ≤ ∥b∥2 Dkl (Q||P ) / 2,

and so Theorem 7.1.2 gives

λ2 ∥b∥22

n n
EP [eλ(f (X1 )−E[f (X1 )]) ] ≤ exp
8

for all λ, giving the same bound as Proposition 4.2.5.

7.2.2 A heuristic proof of Theorem 7.2.1

For a formal statement of this result, and much more detail, we refer to Villani [188]; Theorem 4.1
of his book guarantees the existence of an optimal coupling so long as c is lower semicontinuous,
while Theorem 5.10 provides the duality guarantees.
Our proof is non-rigorous and mostly for intuition; to make the writing simpler, we shall assume
P and Q have densities p and q for Lebesgue measure, and we will dispense with all measure-
theoretic details and rigor. Letting π also be the density of (X, Y ), we write the initial transport
problem as
minimize E R π [c(X, Y )] R
subject to Y π(x, y)dy = p(x) and X π(x, y)dx = q(y)

166
Lexture Notes on Statistics and Information Theory John Duchi

over π ≥ 0. Introduce Lagrange multiplier functions f : X → R and g : Y → R to obtain the

Lagrangian

L(π, f, g)
Z Z Z Z Z
= c(x, y)π(x, y)dxdy + f (x) p(x) − π(x, y)dy dx + g(y) q(y) − π(x, y)dx dy
X Y Y X
Z
= (c(x, y) − f (x) − g(y)) π(x, y)dxdy + EP [f (X)] + EQ [g(Y )].

To obtain a dual problem, we minimize out the nonnegative function π. If there exists a pair (x, y)
with f (x) + g(y) > c(x, y) (again, eliding rigor), then we could send π(x, y) ↑ ∞, making the first
integral −∞. Thus we obtain
(
EP [f (X)] + EQ [g(Y )] if f (x) + g(y) ≤ c(x, y) all x ∈ X , y ∈ Y
inf L(π, f, g) =
π≥0 −∞ otherwise.

Assuming strong duality obtains, this gives the associated dual problem and theorem.

7.2.3 Proof of Corollary 7.2.2

Recognize that given any function f , we can only improve the objective by taking g as large as
possible. Accordingly, we define the c-conjugate of f by

f c (y) := inf {c(x, y) − f (x)} ,

which at each y is the largest possible value v satisfying f (x) + v ≤ c(x, y) for all x. Setting g = f c
can only increase the objective in the supremum. We therefore have the equivalent problem

sup {EP [f (X)] + EQ [f c (Y )]} ,

where we note that f (x) + f c (y) ≤ c(x, y) for all x, y. The conjugate f c is Lipschitz with respect
to the cost (distance) c: we have

f c (x) − f c (y) = inf′ sup c(x, y ′ ) − f (y ′ ) − c(x′ , y) + f (x′ ) ≤ sup c(x, x′ ) − c(x′ , y) ≤ c(x, y)

y x′ x′

by the triangle inequality, and the lower bound is similar, showing that f c is Lipschitz.
Of course, we should set f as large as possible given f c while satisfying f (x) + f c (y) ≤ c(x, y),
meaning we choose
f ⋆ (x) = inf {c(x, y) − f c (y)} .
y

But for this we note that because fc is Lipschitz,

c(x, y) − f c (y) ≥ c(x, y) − f c (x) − c(x, y) = −f c (x),

and inf y {c(x, y) − f c (y)} = −f c (x). That is, f (x) = −f c (x) and is therefore Lipschitz, giving the
corollary.

167
Lexture Notes on Statistics and Information Theory John Duchi

7.3 Some applications of concentration and the variational in-

equality
In this section, we give two applications (though perhaps “applications” would be better nomencla-
ture) for the transportation inequalities we have developed. The first is to concentration inequalities
that depend more directly on the underlying geometry of a set X and probability distribution P
on X .

7.3.1 Metric Gaussianity, transport inequalities, and expansion of sets

The connection between the transportation inequalities we have presented and concentration al-
lows us to move beyond concentration of individual functions to more “geometric” concentration
properties. The general formulation of such transportation inequalities, while beyond the scope of
this book, allows us to develop concentration properties for many probability measures and types
of functions. Here, we give one of the “basic” forms of this; in the next section we connect this
to optimal testing, highlighting one of the ways that concentration and convergence interact with
information theory. In Chapter 10.3 to come, we will present further applications of these results
to fundamental limits in statistics and communication.
We begin by a general definition:

Definition 7.1. Let (X , ρ) be a metric space with metric ρ : X × X , and let p ∈ [1, 2]. Define the
cost function cp (x, y) := ρp (x, y). A probability distribution P on X satisfies an Lp transportation
cost inequality with constant σ 2 , or a Tp (σ 2 )-inequality, if for all distributions Q on X
p
Wcp (Q, P )1/p ≤ 2σ 2 Dkl (Q||P ).

We shall only consider T1 (σ)-inequalities in this section, as they admit the most straightforward
arguments. The bibliographic section provides pointers to other results.
Pinsker’s inequality shows if we use the Hamming metric on X , then all distributions P satisfy a
T1 ( 14 )-inequality. Marton’s transportation inequality (Corollary 7.2.4) also shows that the product
measure P n satisfies a T1 -inequality, but with a constant that depends on n: let dham be the
Hamming distance on X n . Then for any coupling π on X1n ∼ Q and Y1n ∼ P n , we have
n n 1/2
√
X X
Eπ [dham (X1n , Y1n )] = π(Xi ̸= Yi ) ≤ n π(Xi ̸= Yi ) 2
,
i=1 i=1

and so r
n np
Wdham (Q, P ) ≤ Dkl (Q||P n ),
2
a T1 ( n4 )-inequality.
The astounding consequence of such transport inequalities is that probability measure P sat-
isfying them then necessarily inherits some strong concentration properties. To discuss this we
require a bit of additional notation. For a metric space (X , ρ) and set A ⊂ X , define the r-blowup
or r-expansion of A by
Ar := {x ∈ X | ρ(x, A) ≤ r},
where we recall the notation ρ(x, A) = inf y∈A ρ(x, y). When a small blowup of A drastically
increases its probability, then P exhibits strong concentration properties. A more quantitative

168
Lexture Notes on Statistics and Information Theory John Duchi

version of this follows: we say P exhibits Gaussian concentration on (X , ρ) with constant κ > 0 if
for some K < ∞,
1 2
P (A) ≥ implies P (Ar ) ≥ 1 − Ke−κr for r ≥ 0. (7.3.1)
2
Because we often think of r ↑ ∞, the only important constant is κ > 0, and so sometimes we provide
2
the weaker guarantee P (Ar ) ≥ 1 − Ke−κ[r−r0 ]+ for some r0 ≥ 0. This latter guarantee implies the
Gaussian metric concentration (7.3.1) for large r with a worse constant K and (asymptotically)
equivalent κ.
By a clever argument involving conditioning on belonging to a particular set A, it is possible
to show the blowing up lemma: whenever a distribution P satisfies a transportation inequality, it
enjoys Gaussian metric concentration.

Theorem 7.3.1 (The blowing up lemma). Let P satisfy a T1 (σ 2 )-inequality. Then for any mea-
surable set A with P (A) > 0 and r ≥ 0,

r2

P (A) · P (Acr ) ≤ exp − 2
4σ

and  " s #2 
1 1
P (Ar ) ≥ 1 − exp − r− 2σ 2 log .
2σ 2 P (A)
+

Letting A be any set with P (A) ≥ 12 , the first inequality in Theorem 7.3.1 implies that P (Acr ) ≤
r2
2 exp(− 4σ 2 ), that is,

r2

P (Ar ) ≥ 1 − 2 exp − 2 .
4σ
p
The second inequality offers better constants in the exponent: taking r0 = 2σ 2 log 2 and r = t+r0 ,
we always have
t2

P (Ar0 +t ) ≥ 1 − exp − 2 .
2σ
Proof For any set A ⊂ X , define the conditional distribution

PA (S) := P (A ∩ S)/P (A).

Then without loss of generality, we may assume PA and P have densities pA and p with respect to
some base measure µ, where pA (x) = p(x)/P (A). Then by inspection,
Z
pA (x) 1
Dkl (PA ||P ) = pA (x) log dµ(x) = log .
p(x) P (A)
Then for any sets A, B, we have

Wρ (PA , PB ) ≤ Wρ (PA , P ) + Wρ (PB , P )

by the triangle inequality, and so

s s
1 1
Wρ (PA , PB ) ≤ 2σ 2 log + 2σ 2 log . (7.3.2)
P (A) P (B)

169
Lexture Notes on Statistics and Information Theory John Duchi

Now, for any joint π ∈ Π(PA , PB ) on X ∈ A and Y ∈ B, we can write

Eπ [ρ(X, Y )] ≥ inf inf ρ(x, y) = ρ(A, B). (7.3.3)

x∈A y∈B

Substituting this into the bound (7.3.2), we have for any (measurable) sets A, B with P (A) > 0
and P (B) > 0 that s s
1 1
ρ(A, B) ≤ 2σ 2 log + 2σ 2 log .
P (A) P (B)
From here,
√ we can perform√ two manipulations. The first is to recognize
√ that
√ √by the inequality
√ p √
( a + b)2 = a + b + 2 ab ≤ 2a + 2b = ( 2(a + b))2 , we have a + b ≤ 2 a + b, and so
s
1 1
ρ(A, B) ≤ 4σ 2 log .
P (A) P (B)

Now if we take B be the set of points at least r far away: recalling the r-blowup Ar := {y ∈
X | d(y, A) ≤ r} of A, we set B = Acr and obtain ρ(A, B) = ρ(A, Acr ) ≥ r for all r ≥ 0, and
P (B) = P (Acr ) = 1 − P (Ar ). In particular,
s
1
r ≤ 2 σ 2 log .
P (A)P (Acr )

r 2
Rearranging gives P (A)P (Acr ) ≤ exp(− 4σ 2 ). Alternatively, we can use inequality (7.3.2) directly
to obtain s s
1 1
r ≤ 2σ 2 log + 2σ 2 log .
P (A) 1 − P (Ar )
Rearranging and solving for P (Ar ) gives the theorem.

Theorem 7.3.1 is particularly evocative for product measures on discrete spaces. In this case,
we use the Hamming metric dham , which satisfies a T1 ( n4 )-transport cost inequality (Definition 7.1).
iid
Corollary 7.3.2. Let A ⊂ X n be any measurable set and let Xi ∼ P . Then for any r ≥ 0,
 " s #2 
2 n 1
P (dham (X1n , A) ≤ r) ≥ 1 − exp − r− log n 
n 2 P (A)
+

As a consequence, for any sequence of sets that have at least polynomial probability—the sets An ⊂
√
X n satisfy P n (An ) ≥ n−p for some p < ∞—neighborhoods expanded by only n in the Hamming
metric necessarily cover nearly the entire probability mass. Indeed, in this case, log P n1(A) ≤ p log n,
and so setting r r
n pn
rn = c + log n,
2 4
we have
P(dham (X1n , An ) ≤ rn ) ≥ 1 − exp(−c2 ) for all n.

170
Lexture Notes on Statistics and Information Theory John Duchi

7.3.2 A weak and strong converse for hypothesis testing

Let us return to the simplest statistical question that underlies much of our development: a simple
hypothesis test. In this case, we wish to test a null H0 against an alternative H1 ,
iid iid
H0 : X1n ∼ P0 versus H1 : X1n ∼ P1 ,

where P0 and P1 are different distributions. Chapter 2.3.1 discusses this simple problem, connecting
bounds on the variation distance to lower bounds on the summed probability of error. Here, we
give a different interpretation, fixing both P and Q and showing that if the type-I probability of
error is small, the rate at which the type-II probability of error can only decrease at a certain rate.
To give intuition, let us first develop the optimal test, which gives an achievability result and
shows the correct behavior of the asymptotic error; we provide the converse results after this. We
shall assume for simplicity that the variance of the log-likelihood ratio v 2 := VarP (log pp10 (X)
(X)
) < ∞.
Define the log-likelihood ratio of a sequence xn1 by
n
X p0 (xi )
Ln (xn1 ) := log ,
p1 (xi )
i=1

so that if Φ(t) = P(Z ≤ t) is the standard normal CDF, the central limit theorem implies
√
P0n Ln (X1n ) ≥ nDkl (P0 ||P1 ) + v nΦ−1 (1 − ϵ) → ϵ.

The Neyman-Pearson lemma gives that the optimal test of P0 against P1 is to compare Ln (X1n )
√
against a threshold, and so we may take a sequence an (ϵ) for which an (ϵ)/ n → 0 and define the
test ( √
n accept H0 if Ln (xn1 ) − nDkl (P0 ||P1 ) ≥ v n + an (ϵ)
Ψn (x1 ) =
accept H1 otherwise.
Letting Ψ = 0 indicate accepting H0 and Ψ = 1 indicate accepting H1 for simplicity, this test
satisfies P0n (Ψn = 0) ≥ 1 − ϵ for all n; taking an (ϵ) as large as possible while satisfying this
√
guarantee yields an optimal test (via the Neyman-Pearson lemma) with an (ϵ) = o( n).
Now we consider the type-II error, that is, P1n (Ψn = 0). For this, we observe that
√
P1n (Ψn = 0) = P1n Ln (X1n ) − nDkl (P0 ||P1 ) ≥ v nΦ−1 (1 − ϵ) + an (ϵ)

(i) h n
i √
≤ EP1n eLn (X1 ) exp −nDkl (P0 ||P1 ) − v nΦ−1 (1 − ϵ) − an (ϵ)

(ii) √
= exp −nDkl (P0 ||P1 ) − v nΦ−1 (1 − ϵ) − an (ϵ) ,

where inequality (i) is a Chernoff bound and equality (ii) holds because EP1 [exp(log pp01 (X)
(X) )] = 1.
Summarizing, we have the following result.

Proposition 7.3.3. Let ϵ ∈ (0, 1) and let Ψn be the level-ϵ likelihood ratio test of H0 against H1 ,
that is, P n (Ψn = H1 ) ≤ ϵ, and assume that v 2 := VarP (log p(X)
q(X) ) < ∞. Then

1 1 v √
log n ≤ −Dkl (P0 ||P1 ) − √ Φ−1 (1 − ϵ) + o(1/ n).
n P1 (Ψn = 0) n

171
Lexture Notes on Statistics and Information Theory John Duchi

We show now that we can derive converse results providing the similar rates of convergence
using the blowing-up lemma (Theorem 7.3.1) without relying so carefully on the on the particulars
of the simple hypothesis test, especially the Neyman-Pearson lemma. We first present a so-called
“weak converse”, then extend the idea to a “strong converse.” One should think of this distinction
as follows: a weak converse states that if the probability of (type-I) error tends to zero, then
the asymptotic type-II error can only decrease to zero at a certain rate. This does not, however,
eliminate the possibility of lower type-II errors by allowing a fixed ϵ > 0 probability of error in
testing the null H0 . The strong converse, on the other hand, eliminates this possibility: for any
fixed ϵ > 0, no improvement in (asymptotic) type-II error is possible. We note that the style of
argument we employ here extends beyond this simple setting: Chapter 10.3 employs it to give
similar fundamental limits in some communication and estimation problems.
Both the weak and strong converse we prove follow from the same simple idea: the data pro-
cessing inequality for the KL-divergence.

Proposition 7.3.4 (Weak converse in hypothesis testing). Let P0 and P1 be arbitrary distributions,
and let the test Ψ have level ϵ, that is, P0 (Ψ = 1) ≤ ϵ. Then

1 Dkl (P0 ||P1 ) log 2

− log ≥− − .
P1 (Ψ = 0) 1−ϵ 1−ϵ

Specializing the result to the product distribution case, replacing P0 and P1 with P0n and P1n ,
respectively, we obtain that

1 nDkl (P0 ||P1 ) log 2

− log ≥− − ,
P1n (Ψn = 0) 1−ϵ 1−ϵ

and so for any ϵn → 0, we have the limit

1 1
lim inf − log n ≥ −Dkl (P0 ||P1 ) ,
n→∞ n P1 (Ψn = 0)

which matches Proposition 7.3.3.

Proof Let A = {x ∈ X | Ψ(A) = 0}, and let p0 = P0 (A) and p1 = P1 (A). Recall our notation for
the binary relative entropy Dkl (p||q) = p log pq + (1 − p) log 1−p
1−q and the binary entropy functional
h2 (p) = −p log p − (1 − p) log(1 − p) ≤ log 2. Then by the data-processing inequality
p0 1 − p0
Dkl (P0 ||P1 ) ≥ Dkl (P0 (A)||P1 (A)) = p0 log + (1 − p0 ) log
p1 1 − p1
1 1
= p0 log + (1 − p0 ) log − h2 (p0 )
p1 1 − p1
1
≥ p0 log − h2 (p0 ).
p1

Rearranging gives log p11 ≥ − p10 Dkl (P0 ||P1 ) + log 2

p0 , and substituting p0 ≥ 1 − ϵ gives the result.

Let us now provide the stronger result. An unfortunate limitation of the method we employ
here is that we must assume X is finite to enable the strongest applications of the blowing up
lemma (Theorem 7.3.1). In this case, we focus on the product distributions P0n and P1n , and let X
be the support of P0 . We assume without loss of generality that inf x∈X P1 ({x}) > 0, as otherwise,

172
Lexture Notes on Statistics and Information Theory John Duchi

Dkl (P0 ||P1 ) = ∞ and any lower bound is trivial. The key insight, which Ahlswede et al. [4] develop,
is that we should apply the data processing inequality to the enlargement Ar = {xn1 | dham (xn1 , A) ≤
r} of the acceptance set A = {x ∈ X n | Ψn (x) = 0} rather than A itself; because of the blowing-up
lemma, this set has nearly full measure, which in turn means that P1 (Ψn = 0) cannot actually be
too small.
Proposition 7.3.5 (Strong converse in hypothesis testing). Let P0 and P1 be distributions on X ,
and let the test Ψn have level ϵ for testing P0n against P1n . Then
q
n
1 1 log n log 1−ϵ
log n ≥ −Dkl (P0 ||P1 ) − O(1) √
n P1 (Ψn = 0) n
for all large enough n.
Proof Following the paragraph preceding the proposition, let A = {xn1 ∈ X n | Ψn (xn1 ) = 0}
be the acceptance set of Ψn , and let Ar = {x | dham (x, A) ≤ r} its r-blowup. For simplicity in
notation, define pr = P0n (Ar ) and qr = P1n (Ar ). Then by the data-processing inequality as in the
proof of Proposition 7.3.4, we have
1
nDkl (P0 ||P1 ) = Dkl (P0n ||P1n ) ≥ Dkl (P0n (Ar )||P1n (Ar )) = Dkl (pr ||qr ) ≥ pr log − h2 (pr ). (7.3.4)
qr

Of course, we wish to provide bounds on log 1q for q = P1n (A), not qr . For this, we use the
following combinatorial lemma:
Lemma 7.3.6. Define q⋆ = minx∈X P1 ({x}). For any r,

card(X ) r n

n n n
P1 (A) ≤ P1 (Ar ) ≤ P1 (A).
r q⋆

Proof Let x, x′ ∈ X n . Then

Y Y Y P1 ({xi }) d (x,x′ )
P1n ({x}) = P1 ({xi }) P1 ({xi }) = P1 ({x′ }) ′ ≥ P1 ({x′ })q⋆ ham .
P1 ({xi })
i:xi =x′i i:xi ̸=x′i ′
i:xi ̸=xi

So for any single x ∈ X n and its r-neighborhood {x}r = {x′ | dham (x, x′ ) ≤ r}, we have

card(X ) r n

n n
P1 ({x}r ) ≤ P1 ({x}).
r q⋆
The union bound implies the lemma.

By the lemma, we therefore obtain for the P1 -dependent constant K = card(X

q⋆
)
and q := P1n (A) =
n
P1 (Ψn = 0), we have
Kne r

n r
qr ≤ K q≤ q
r r
by the standard binomial bound nr ≤ ( ne r

r ) . Substituting into the data processing bound (7.3.4),
we obtain
1 Kne
nDkl (P0 ||P1 ) ≥ pr log − r log − log 2.
q r

173
Lexture Notes on Statistics and Information Theory John Duchi

Finally, we use the blowing-up lemma (Theorem 7.3.1): Fix t ≥ 0 to be chosen and take
r r
n n 1
r = r(t) := t + log .
2 2 1−ϵ
2
Then pr ≥ 1 − e−t , and so
2 1 Kne
nDkl (P0 ||P1 ) ≥ (1 − e−t ) log − r log − log 2.
q r
√ q
n
If we take t = 2 log n then r = O(1) n log 1−ϵ , and so
r
−2 1 n
−(1 − n ) log ≥ −nDkl (P0 ||P1 ) − O(1) n log2 n · log + n log n · log K.
q 1−ϵ
Dividing by n(1 − 1/n2 ) gives the result.

7.4 Discussion and bibliographic remarks

JCD Comment: Might want to investigate a bit of the treatment in both Raginsky
and Sason [158] and Liu et al. [139].
JCD Comment: Maybe add some refs to the paper [199], and give a citation for
Corollary 7.1.4 to Catoni and Giulini [49] or whomever is appropriate.

7.5 Exercises
Exercise 7.1: Complete the proof of Corollary 7.1.1 in the case that EQ [eg(X) ] = +∞.
Exercise 7.2 (A discrete isoperimetric inequality): Let A ⊂ Zd be a finite subset of the d-
dimensional integers. Let the projection mapping πj : Z → Zd−1 be defined by
d

πj (z1 , . . . , zd ) = (z1 , . . . , zj−1 , zj+1 , . . . , zd )

so that we “project out” the jth coordinate, and define the projected sets.
Aj = πj (A) = {πj (z) : z ∈ A}
n o
= z ∈ Zd−1 : there exists z⋆ ∈ Z such that (z1 , z2 , . . . , zj−1 , z⋆ , zj , . . . , zd−1 ) ∈ A .

Prove the Loomis-Whitney inequality, that is, that

  1
d d−1
Y
card(A) ≤  card(Aj )  .
j=1

Exercise 7.3: Let X be a non-constant and mean-zero random variable with moment generating
function φX defined on a neighborhood of 0, and let
b1 = sup {λ ≥ 0 | φX (λ) < ∞} and b0 = inf {λ ≤ 0 | φX (λ) < ∞} .
You may assume that φX is C ∞ on (b0 , b1 ) (Proposition 3.2.2 guarantees this).

174
Lexture Notes on Statistics and Information Theory John Duchi

(a) Show that φX is strictly increasing on (0, b1 ) and strictly decreasing on (b0 , 0).

(b) Show that φ′X is strictly increasing on (b0 , b1 ).

(d) Show that φ′X (λ) → E[Xeb1 X ] > 0 as λ ↑ b1 . (This limit may be finite or infinite.)

(e) Show that the domain of φX , even when non-trivial, may be open or closed. In particular, give
an example of a random variable for which dom φX = [−1, 1], and give an example of a random
variable for which dom φX = (−1, 1).

Exercise 7.4 (“Dimension free” covariance estimation): Define the effective rank of a matrix
Σ ⪰ 0 by
tr(Σ)
reff (Σ) := .
∥Σ∥op
Let Mi ∈ Rd×d be independent positive definite matrices for which E[Mi ] = Σ, and assume

∥u⊤ Mi u∥ψ1 ≤ κ2 u⊤ Σu

for any u ∈ Rd . This problem uses Corollary 7.1.4 to extend Proposition 7.1.5 to show that such
random matrices concentrate similarly to isotropic random matrices, but the effective rank replaces
the dimension d.

(a) Fix β > 0 and r > 0, and define the prior distribution π0 on Θ = Rd × Rd by N(0, β −1 Σ) ×
N(0, β −1 Σ), that is, the product of two mean-zero normals with covariance β −1 Σ. Let Pu be the
normal distribution N(u, β −1 Σ) truncated restricted to the ball u + rBd2 = {x | ∥x − u∥2 ≤ r},
and define the posterior πu,v = Pu ×Pv . Show that for any u, v ∈ Σ1/2 Sd−1 = {u | u⊤ Σ−1 u = 1},
we have
1
Dkl (πu,v ||π0 ) = 2 log + β,
C(r)
where C(r) := P(∥Z∥2 ≤ r) for Z ∼ N(0, β −1 Σ) is the normalization for Pu .

(b) Show that if r = 2β −1 tr(Σ) then C(r) ≥ 21 , and conclude that Dkl (πu,v ||π0 ) ≤ 2 log 2 + β.
p

(c) Define
f (θ1 , θ2 , M ) := θ1⊤ Σ−1/2 (M − Σ)Σ−1/2 θ2 .
Show that for fixed θ1 , θ2 , f is sub-exponential, specifically,

∥f (θ1 , θ2 , M )∥ψ1 ≤ κ2 (∥θ1 ∥22 + ∥θ2 ∥22 ).

p
(d) Show that for (θ1 , θ2 ) ∼ πu,v with the choice r = 2 tr(Σ)/β, we have
q p
max {∥θ1 ∥2 , ∥θ2 ∥2 } ≤ ∥Σ∥op + 2β −1 tr(Σ)

whenever u, v ∈ Σ1/2 Sd−1 .

175
Lexture Notes on Statistics and Information Theory John Duchi

1
(e) Use Corollary 7.1.4 to show that for any fixed |λ| ≤ 16κ2 ∥Σ∥op
, with probability at least 1 − t,

2 log 2 + 2reff (Σ) + t

λu⊤ Σ−1/2 (Pn M − Σ) Σ−1/2 v ≤ 64λ2 κ4 ∥Σ∥op +
n

for all vectors u, v ∈ Σ1/2 Sd−1 by appropriate choice of β.

(f) Show that so long as n ≥ 4reff (Σ) + t, then

r
2 4reff (Σ) + t
∥Pn M − Σ∥op ≤ 20κ ∥Σ∥op . (7.5.1)
n
(The constants are unimportant.)

Exercise 7.5: In this question, we compare the convergence guarantee in Proposition 7.1.5 to
iid
that in inequality (7.5.1). Let Xi ∼ N(0, Σ) for some covariance Σ ⪰ 0.

(a) Give the smallest value σ 2 for which

∥⟨X, u⟩∥ψ2 ≤ σ ∥u∥2

for all u ∈ Rd .

(b) Give the smallest value κ2 for which

√
∥⟨X, u⟩∥ψ2 ≤ κ u⊤ Σu

for all u ∈ Rd .

(c) Given the previous parts, compare the guarantees Proposition 7.1.5 and inequality (7.5.1)
provide on the deviations Pn XX ⊤ − Σ. How far apart can they be?

Exercise 7.6 (An error lower bound): Let P0 and P1 be distributions on X , where X is finite.
Let Ψn : X n → {0, 1} be a sequence of tests with small enough type-II error that
1 1
γn := log n − Dkl (P0 ||P1 ) ≥ 0
n P1 (Ψn = 0)

for all large enough n. Show that there is a numerical constant c > 0 such that for all large enough
n, the type-I error ϵ := P0n (Ψn = 1) satisfies

nγn2

ϵ ≥ 1 − n exp −c 2 .
log n

JCD Comment: Exercise on self-concordant losses and measuring error in the “natural”
metric based on the Hessian.

176
Chapter 8

Privacy and disclosure limitation

In this chapter, we continue to build on our ideas on stability in different scenarios, ranging from
model fitting and concentration to interactive data analyses. Here, we show how stability ideas
allow us to provide a new type of protection: the privacy of participants in studies. Until the mid-
2000s, the major challenge in this direction had been a satisfactory definition of privacy, because
collection of side information often results in unforeseen compromises of private information. The
introduction of differential privacy—a type of stability in likelihood ratios for data releases from
differing samples—alleviated these challenges, providing a firm foundation on which to build private
estimators and other methodology. (Though it is possible to trace some of the definitions and major
insights in privacy back at least to survey sampling literature in the 1960s.) Consequently, in this
chapter we focus on privacy notions based on differential privacy and its cousins, developing the
information-theoretic stability ideas helpful to understand the protections it is possible to provide.

8.1 Disclosure limitation, privacy, and definitions

We begin this chapter with a few cautionary tales and examples, which motivate the coming
definitions of privacy that we consider. A natural belief might be that, given only certain summary
statistics of a large dataset, individuals in the data are protected. Yet this appears, by and large,
to be false. As an example, in 2008 Nils Homer and colleagues [113] showed that even releasing
aggregated genetic frequency statistics (e.g., frequency of single nucleotide polymorphisms (SNP) in
microarrays) can allow resolution of individuals within a database. Consequently, the US National
Institutes of Health (NIH), the Wellcome Trust, and the Broad Institute removed genetic summaries
from public access (along with imposing stricter requirements for private access) [171, 56].
Another hypothetical example may elucidate some of the additional challenges. Suppose that I
release a dataset that consists of the frequent times that posts are made worldwide that denigrate
government policies, but I am sure to remove all information such as IP addresses, usernames, or
other metadata excepting the time of the post. This might seem a priori reasonably safe, but now
suppose that an authoritarian government knows precisely when its citizens are online. Then by
linking the two datasets, the government may be able to track those who post derogatory statements
about their leaders.
Perhaps the strongest definition of privacy of databases and datasets is due to Dalenius [62], who
suggests that “nothing about an individual should be learnable from the database that cannot be
learned without access to the database.” But quickly, one can see that it is essentially impossible
to reconcile this idea with scientific advancement. Consider, for example, a situation where we

177
Lexture Notes on Statistics and Information Theory John Duchi

perform a study on smoking, and discover that smoking causes cancer. We publish the result, but
now we have “compromised” the privacy of everyone who smokes who did not participate in the
study: we know they are more likely to get cancer.
In each of these cases, the biggest challenge is one of side information: how can we be sure
that, when releasing a particular statistic, dataset, or other quantity that no adversary will be able
to infer sensitive data about participants in our study? We articulate three desiderata that—we
believe—suffice for satisfactory definitions of privacy. In discussion of private releases of data, we
require a bit of vocabulary. We term a (randomized) algorithm releasing data either a privacy
mechanism, consistent with much of the literature in privacy, or a channel, mapping from the input
sample to some output space, in keeping with our statistical and information-theoretic focus. In
no particular order, we wish our privacy mechanism, which takes as input a sample X1n ∈ X n and
releases some Z to satisfy the following.

i. Given the output Z, even an adversary knowing everyone in the study (excepting one person)
should not be able to test whether you belong to the study.

ii. If you participate in multiple “private” studies, there should be some graceful degradation
in the privacy protections, rather than a catastrophic failure. As part of this, any definition
should guarantee that further processing of the output Z of a private mechanism X1n → Z, in
the form of the Markov chain X1n → Z → Y , should not allow further compromise of privacy
(that is, a data-processing inequality). Additional participation in “private” studies should
continue to provide little additional information.

iii. The mechanism X1n → Z should be resilient to side information: even if someone knows
something about you, he should learn little about you if you belong to X1n , and this should
remain true even if the adversary later gleans more information about you.

The third desideratum is perhaps most elegantly phrased via a Bayesian perspective, where an
adversary has some prior beliefs π on the membership of a dataset (these prior beliefs can then
capture any side information the adversary has). The strongest adversary has a prior supported on
two samples {x1 , . . . , xn } and {x′1 , . . . , x′n } differing in only a single element; a private mechanism
would then guarantee the adversary’s posterior beliefs (after the release X1n → Z) should not change
significantly.
Before continuing addressing these challenges, we take a brief detour to establish notation for the
remainder of the chapter. It will be convenient to consider randomized procedures acting on samples
n 1 Pn
themselves; a sample x1 is cleary isomorphic to the empirical distribution Pn = n i=1 1xi , and
for two empirical distributions Pn and Pn′ supported on {x1 , . . . , xn } and {x′1 , . . . , x′n }, we evidently
have
n Pn − Pn′ TV = dham ({x1 , . . . , xn }, {x′1 , . . . , x′n }),
and so we will identify samples with their empirical distributions. With this notational convenience
in place, we then identify
n
( )
1X
Pn = Pn = 1xi | xi ∈ X
n
i=1

as the set of all empirical distributions on n points in X and we also abuse notation in an obvious
way to define dham (Pn , Pn′ ) := n ∥Pn − Pn′ ∥TV as the number of differing observations in the samples
Pn and Pn′ represent. A mechanism M is then a (typically) randomized mapping M : Pn → Z,

178
Lexture Notes on Statistics and Information Theory John Duchi

which we can identify with its induced Markov channel Q from X n → Z; we use the equivalent
views as is convenient.
The challenges of side information motivate Dwork et al.’s definition of differential privacy [80].
The key in differential privacy is that the noisy channel releasing statistics provides guarantees of
bounded likelihood ratios between neighboring samples, that is, samples differing in only a single
entry.
Definition 8.1 (Differential privacy). Let M : Pn → Z be a randomized mapping. Then M is
ε-differentially private if for all (measurable) sets S ⊂ Z and all Pn , Pn′ ∈ Pn with dham (Pn , Pn′ ) ≤ 1,

P(M (Pn ) ∈ S)
≤ eε . (8.1.1)
P(M (Pn′ ) ∈ S)
The intuition and original motivation for this definition are that an individual has little incentive
to participate (or not participate) in a study, as the individual’s data has limited effect on the
outcome.
The model (8.1.1) of differential privacy presumes that there is a trusted curator, such as a
hospital, researcher, or corporation, who can collect all the data into one centralized location, and
it is consequently known as the centralized model. A stronger model of privacy is the local model,
in which data providers trust no one, not even the data collector, and privatize their individual
data before the collector even sees it.
Definition 8.2 (Local differential privacy). A channel Q from X to Z is ε-locally differentially
private if for all measurable S ⊂ Z and all x, x′ ∈ X ,
Q(Z ∈ S | x)
≤ eε . (8.1.2)
Q(Z ∈ S | x′ )

It is clear that Definition 8.2 and the condition (8.1.2) are stronger than Definition 8.1: when
samples {x1 , . . . , xn } and {x′1 , . . . , x′n } differ in at most one observation, then the local model (8.1.2)
guarantees that the densities
n
dQ(Z1n | {xi }) Y dQ(Zi | xi )
= ≤ eε ,
dQ(Z1n | {x′i }) dQ(Zi | x′i )
i=1

where the inequality follows because only a single ratio may contain xi ̸= x′i .
In the remainder of this introductory section, we provide a few of the basic mechanisms in use
in differential privacy, then discuss its “semantics,” that is, its connections to the three desiderata
we outline above. In the coming sections, we revisit a few more advanced topics, in particular, the
composition of multiple private mechanisms and a few weakenings of differential privacy, as well as
more sophisticated examples.

8.1.1 Basic mechanisms

The basic mechanisms in either the local or centralized models of differential privacy use some type
of noise addition to ensure privacy. We begin with the simplest and oldest mechanism, randomized
response, for local privacy, due to Warner [190] in 1965.

Example 8.1.1 (Randomized response): We wish to have a participant in a study answer

a yes/no question about a sensitive topic (for example, drug use). That is, we would like to

179
Lexture Notes on Statistics and Information Theory John Duchi

estimate the proportion of the population with a characteristic (versus those without); call
these groups 0 and 1. Rather than ask the participant to answer the question specifically,
however, we give them a spinner with a face painted in two known areas, where the first
corresponds to group 0 and has area eε /(1 + eε ) and the second to group 1 and has area
1/(1 + eε ). Thus, when the participant spins the spinner, it lands in group 0 with probability
eε /(1 + eε ). Then we simply ask the participant, upon spinning the spinner, to answer “Yes”
if he or she belongs to the indicated group, “No” otherwise.
Let us demonstrate that this randomized response mechanism provides ε-local differen-
tial privacy. Indeed, we have
Q(Yes | x = 0) Q(No | x = 0)
= e−ε and = eε ,
Q(Yes | x = 1) Q(No | x = 1)
so that Q(Z = z | x)/Q(Z = z | x′ ) ∈ [e−ε , eε ] for all x, z. That is, the randomized response
channel provides ε-local privacy. 3

The interesting question is, of course, whether we can still use this channel to estimate the
proportion of the population with the sensitive characteristic. Indeed, we can. We can provide
a somewhat more general analysis, however, which we now do so that we can give a complete
example.
Example 8.1.2 (Randomized response, continued): Suppose that we have an attribute of
interest, x, taking the values x ∈ {1, . . . , k}. Then we consider the channel (of Z drawn
conditional on x)
(
eε
x with probability k−1+eε
Z= k−1
Uniform([k] \ {x}) with probability k−1+eε .

This (generalized) randomized response mechanism is evidently ε-locally private, satisfying

Definition 8.2.
Let p ∈ Rk+ , pT 1 = 1 indicate the true probabilities pi = P(X = i). Then by inspection,
we have
eε 1 eε − 1 1
P(Z = i) = pi ε
+ (1 − p i ) ε
= pi ε
+ ε .
k−1+e k−1+e e +k−1 e +k−1
cn ∈ Rk+ denote the empirical proportion of the Z observations in a sample of
Thus, letting b
size n, we have
eε + k − 1

1
pbn := cn − ε 1
eε − 1 e +k−1
b

satisfies E[b
pn ] = p, and we also have
2 k
2 X
eε + k − 1 eε + k − 1

1
pn − p∥22 = cn ]∥22

E ∥b E ∥b
cn − E[b = P(Z = j)(1−P(Z = j)).
eε − 1 n eε − 1
j=1
ε
pn − p∥22 ] ≤ n1 ( e e+k−1 2
P
As j P(Z = j) = 1, we always have the bound E[∥b ε −1 ) .

We may consider two regimes for simplicity: when ε ≤ 1 and when ε ≥ log k. In the
former case—the high privacy regime—we have k1 ≲ P(Z = i) ≲ k1 , so that the mean ℓ2 squared
2
error scales as n1 kε2 . When ε ≥ log k is large, by contrast, we see that the error scales at worst
as n1 , which is the “non-private” mean squared error. 3

180
Lexture Notes on Statistics and Information Theory John Duchi

While randomized response is essentially the standard mechanism in locally private settings, in
centralized privacy, the “standard” mechanism is Laplace noise addition because of its exponential
tails. In this case, we require a few additional definitions. Suppose that we wish to release some
d-dimensional function f (Pn ) of the sample distribution Pn (equivalently, the associated sample
X1n ), where f takes values in Rd . In the case that f is Lipschitz with respect to the Hamming
metric—that is, the counting metric on X n —it is relatively straightforward to develop private
mechanisms. To better reflect the nomenclature in the privacy literature and easier use in our
future development, for p ∈ [1, ∞] we define the global sensitivity of f by
n o
GSp (f ) := sup f (Pn ) − f (Pn′ ) p | dham (Pn , Pn′ ) ≤ 1 .
Pn ,Pn′ ∈Pn

This is simply the Lipschitz constant of f with respect to the Hamming metric. The global sensi-
tivity is a convenient metric, because it allows simple noise addition strategies.

Example 8.1.3 (Laplace mechanisms): Recall the Laplace distribution, parameterized by a

shape parameter β, which has density on R defined by
1
p(w) = exp(−|w|/β),
2β
and the analogous d-dimensional variant, which has density
1
p(w) = exp(− ∥w∥1 /β).
(2β)2
R∞
If W ∼ Laplace(β), W ∈ R, then E[W ] = 0 by symmetry, while E[W 2 ] = β1 0 w2 e−w/β = 2β 2 .
Suppose that f : Pn → Rd has finite global sensitivity for the ℓ1 -norm,

GS1 (f ) = sup f (Pn ) − f (Pn′ ) 1 | dham (Pn , Pn′ ) ≤ 1, Pn , Pn′ ∈ Pn .

Letting L = GS1 (f ) be the Lipschitz constant for simplicity, if we consider the mechanism
defined by the addition of W ∈ Rd with independent Laplace(L/ε) coordinates,
iid
Z := f (Pn ) + W, Wj ∼ Laplace(L/ε), (8.1.3)

we have that Z is ε-differentially private. Indeed, for samples Pn , Pn′ differing in at most a
single example, Z has density ratio
q(z | Pn ) ε ε ε
′
= exp − ∥f (Pn ) − z∥1 + f (Pn′ ) − z 1
≤ exp f (Pn ) − f (Pn′ ) 1
≤ exp(ε)
q(z | Pn ) L L L
by the triangle inequality and that f is L-Lipschitz with respect to the Hamming metric. Thus
Z is ε-differentially private. Moreover, we have

2dGS1 (f )2
E[∥Z − f (Pn )∥22 ] = ,
ε2
so that if L is small, we may report the value of f accurately. 3

The most common instances and applications of the Laplace mechanism are in estimation of
means and histograms. Let us demonstrate more carefully worked examples in these two cases.

181
Lexture Notes on Statistics and Information Theory John Duchi

Example 8.1.4 (Private one-dimensional mean estimation): Suppose that we have variables
Xi taking values in [−b, b] for some b < ∞, and wish to estimate E[X]. A natural function to
n 1 Pn
release is then f (X1 ) = X n = n i=1 Xi . This has Lipschitz constant 2b/n with respect to
the Hamming metric, because for any two samples x, x′ ∈ [−b, b]n differing in only entry i, we
have
1 2b
|f (x) − f (x′ )| = |xi − x′i | ≤
n n
because xi ∈ [−b, b]. Thus the Laplace mechanism (8.1.3) with the choice variance W ∼
Laplace(2b/(nε)) yields

1 8b2 b2 8b2
E[(Z − E[X])2 ] = E[(X n − E[X])2 ] + E[(Z − X n )2 ] = Var(X) + 2 2 ≤ + 2 2.
n n ε n n ε
We can privately release means with little penalty so long as ε ≫ n−1/2 . 3

Example 8.1.5 (Private histogram (multinomial) release): Suppose that we wish to estimate
a multinomial distribution, or put differently, a histogram. That is, we have observations
X ∈ {1, . . . , k}, where k may be large, and wish to estimate pj := P(X = j) P for j = 1, . . . , k.
For a given sample x1 , the empirical count vector pbn with coordinates pbn,j = n ni=1 1 {Xi = j}
n 1

satisfies
2
GS1 (b
pn ) =
n
′
because swapping a single example xi for xi may change the counts for at most two coordinates
j, j ′ by 1. Consequently, the Laplace noise addition mechanism

iid 2
Z = pbn + W, Wj ∼ Laplace
nε
satisfies
8k
E[∥Z − pbn ∥22 ] =
n 2 ε2
and consequently
k
8k 1X 8k 1
E[∥Z − p∥22 ] = 2 2+ pj (1 − pj ) ≤ 2 2 + .
n ε n n ε n
j=1

This example shows one of the challenges of differentially private mechanisms: even in the case
where the quantity of interest is quite stable (insensitive to changes in the underlying sample,
or has small Lipschitz constant), it may be the case that the resulting mechanism adds noise
that introduces some dimension-dependent scaling. In this case, the conditions on privacy
levels acceptable for good estimation—in that the P rate of convergence is no different from the
non-private case, which achieves E[∥b pn − p∥22 ] = n1 kj=1 pj (1 − pj ) ≤ n1 are that ε ≫ nk . Thus,
in the case that the histogram has a large number of bins, the naive noise addition strategy
cannot provide as much protection without sacrificing efficiency.
If instead of ℓ2 -error we consider ℓ∞ error, it is possible to provide somewhat more
iid
satisfying results in this case. Indeed, we know that P(∥W ∥∞ ≥ t) ≤ k exp(−t/b) for Wj ∼
Laplace(b), so that in the mechanism above we have

tnε
P(∥Z − pbn ∥∞ ≥ t) ≤ k exp − all t ≥ 0,
2

182
Lexture Notes on Statistics and Information Theory John Duchi

so using that each coordinate of pbn is 1-sub-Gaussian, we have

r
2 log k 2k tnε
E[∥Z − p∥∞ ] ≤ E[∥b
pn − p∥∞ ] + E[∥W ∥∞ ] ≤ + inf t + exp −
n t≥0 nε 2
r
2 log k 2 log k 2
≤ + + .
n nε nε
−1/2 , we obtain rate of convergence at least
p this case, then, whenever ε ≫ (n/ log k)
In
2 log k/n, which is a bit loose (as we have not controlled the variance of pbn ), but some-
what more satisfying than the k-dependent penalty above. 3

8.1.2 Resilience to side information, Bayesian perspectives, and data processing

One of the major challenges in the definition of privacy is to protect against side information,
especially because in the future, information about you may be compromised, allowing various
linkage attacks. With this in mind, we return to our three desiderata. First, we note the following
simple fact: if Z is a differentially private view of a sample X1n (or associated empirical distribution
Pn ), then any downstream functions Y are also differentially private. That is, if we have the Markov
chain Pn → Z → Y , then for any Pn , Pn′ ∈ Pn with dham (Pn , Pn′ ) ≤ 1, we have for any set A that
′
R R
P(Y ∈ A | x) P (Y ∈ A | z)q(z | Pn )dµ(z) ε R P (Y ∈ A | z)q(z | Pn )dµ(z)
= ≤ e = eε .
P(Y ∈ A | x′ )
R
P (Y ∈ A | z)q(z | Pn′ )dµ(z) P (Y ∈ A | z)q(z | Pn′ )dµ(z)
That is, any type of post-processing cannot reduce privacy.
With this simple idea out of the way, let us focus on our testing-based desideratum. In this
case, we consider a testing scenario, where an adversary wishes to test two hypotheses against one
another, where the hypotheses are
′
H0 : X1n = xn1 vs. H1 : X1n = (xi−1 n
1 , xi , xi+1 ),

so that the samples under H0 and H1 differ only in the ith observation Xi ∈ {xi , x′i }. Now, for a
channel taking inputs from X n and outputting Z ∈ Z, we define ε-conditional hypothesis testing
privacy by saying that
Q(Ψ(Z) = 1 | H0 , Z ∈ A) + Q(Ψ(Z) = 0 | H1 , Z ∈ A) ≥ 1 − ε (8.1.4)
for all sets A ⊂ Z satisfying Q(A | H0 ) > 0 and Q(A | H1 ) > 0. That is, roughly, no matter
what value Z takes on, the probability of error in a test of whether H0 or H1 is true—even with
knowledge of xj , j ̸= i—is high. We then have the following proposition.
Proposition 8.1.6. Assume the channel Q is ε-differentially private. Then Q is also ε̄ = 1−e−2ε ≤
2ε-conditional hypothesis testing private.
Proof Let Ψ be any test of H0 versus H1 , and let B = {z | Ψ(z) = 1} be the acceptance region
of the test. Then
Q(A, B | H0 ) Q(A, B c | H1 )
Q(B | H0 , Z ∈ A) + Q(B c | H1 , Z ∈ A) = +
Q(A | H0 ) Q(A | H1 )
Q(A, B | H1 ) Q(A, B c | H1 )
≥ e−2ε +
Q(A | H1 ) Q(A | H1 )
Q(A, B | H1 ) + Q(A, B c | H1 )
≥ e−2ε ,
Q(A | H1 )

183
Lexture Notes on Statistics and Information Theory John Duchi

where the first inequality uses ε-differential privacy. Then we simply note that Q(A, B | H1 ) +
Q(A, B c | H1 ) = Q(A | H1 ).

So we see that (roughly), even conditional on the output of the channel, we still cannot test whether
the initial dataset was x or x′ whenever x, x′ differ in only a single observation.
An alternative perspective is to consider a Bayesian one, which allows us to more carefully
consider side information. In this case, we consider the following thought experiment. An adversary
has a set of prior beliefs π on X n , and we consider the adversary’s posterior π(· | Z) induced by
observing the output Z of some mechanism M . In this case, Bayes factors, which measure how
much prior and posterior distributions differ after observations, provide one immediate perspective.

Proposition 8.1.7. A mechanism M : Pn → Z is ε-differentially private if and only if for any

prior distribution π on Pn and any observation z ∈ Z, the posterior odds satisfy

π(Pn | z)
≤ eε
π(Pn′ | z)

for all Pn , Pn′ ∈ Pn with dham (Pn , Pn′ ) ≤ 1.

Proof Let q be the associated density of Z = M (·) (conditional or marginal). We have π(Pn |
z) = q(z | Pn )π(Pn )/q(z). Then

π(Pn | z) q(z | Pn )π(Pn ) π(Pn )

= ≤ eε
π(Pn′ | z) q(z | Pn′ )π(Pn′ ) π(Pn′ )

for all z, Pn , Pn′ if and only if M is ε-differentially private.

Thus we see that private channels mean that prior and posterior odds between two neighboring
samples cannot change substantially, no matter what the observation Z actually is.
For an an alternative view, we consider a somewhat restricted family of prior distributions,
where we now take the view of a sample xn1 ∈ X n . There is some annoyance in this calculation
in that the order of the sample may be important, but it at least gets toward some semantic
interpretation of differential privacy. We consider the adversary’s beliefs on whether a particular
value x belongs to the sample, but more precisely, we consider whether Xi = x. We assume that
the prior density π on X n satisfies

π(xn1 ) = π\i (x\i )πi (xi ), (8.1.5)

where x\i = (xi−1 n

1 , xi+1 ) ∈ X
n−1 . That is, the adversary’s beliefs about person i in the dataset

are independent of his beliefs about the other members of the dataset. (We assume that π is
a density with respect to a measure µ on X n−1 × X , where dµ(s, x) = dµ(s)dµ(x).) Under the
condition (8.1.5), we have the following proposition.

Proposition 8.1.8. Let Q be an ε-differentially private channel and let π be any prior distribution
satisfying condition (8.1.5). Then for any z, the posterior density πi on Xi satisfies

e−ε πi (x) ≤ πi (x | Z = z) ≤ eε πi (x).

184
Lexture Notes on Statistics and Information Theory John Duchi

Proof We abuse notation and for a sample s ∈ X n−1 , where s = (x1i−1 , xni+1 ), we let s ⊕i x =
(xi−1 n
1 , x, xi+1 ). Letting µ be the base measure on X
n−1 × X with respect to which π is a density
n
and q(· | x1 ) be the density of the channel Q, we have
R
s∈X n−1 q(z | s ⊕i x)π(s ⊕i x)dµ(s)
πi (x | Z = z) = R R
′ ′ ′
s∈X n−1 x′ ∈X q(z | s ⊕i x )π(s ⊕i x )dµ(s, x )
R
n−1 q(z | s ⊕i x)π(s ⊕i x)dµ(s)
(⋆)
s∈X
≤ eε R R
′ ′
s∈X n−1 x′ ∈X q(z | s ⊕i x)π(s ⊕i x )dµ(s)dµ(x )
R
s∈X n−1 q(z | s ⊕i x)π\i (s)dµ(s)πi (x)
= eε R R
′ ′
s∈X n−1 q(z | s ⊕i x)π\i (s)dµ(s) x′ ∈X πi (x )dµ(x )
= eε πi (x),
where inequality (⋆) follows from ε-differential privacy. The lower bound is similar.

Roughly, however, we see that Proposition 8.1.8 captures the idea that even if an adversary has
substantial prior knowledge—in the form of a prior distribution π on the ith value Xi and everything
else in the sample—the posterior cannot change much.

8.2 Weakenings of differential privacy

One challenge with the definition of differential privacy is that it can sometimes require the addition
of more noise to a desired statistic than is practical for real use. Moreover, the privacy considerations
interact in different ways with geometry: as we saw in Example 8.1.5, the Laplace mechanism
adds noise that introduces dimension-dependent scaling, which we discuss more in Example 8.2.9.
Consequently, it is of interest to develop weaker notions that—at least hopefully—still provide
appropriate and satisfactory privacy protections. To that end, we develop two additional types
of privacy that allow the development of more sophisticated and lower-noise mechanisms than
standard differential privacy; their protections are necessarily somewhat weaker but are typically
satisfactory.
We begin with a definition that allows (very rare) catostrophic privacy breaches—as long as the
probability of this event is extremely small (say, 10−20 ), these may be acceptable.
Definition 8.3. Let ε, δ ≥ 0. A mechanism M : Pn → Z is (ε, δ)-differentially private if for all
(measurable) sets S ⊂ Z and all neighboring samples Pn , Pn′ ,
P(M (Pn ) ∈ S) ≤ eε P(M (Pn′ ) ∈ S) + δ. (8.2.1)
One typically thinks of δ in the definition above as satisfying δ = δn , where δn ≪ n−k for any
k ∈ N. (That is, δ decays super-polynomially to zero.) Some practitioners contend that all real-
world differentially private algorithms are in fact (ε, δ)-differentially private: while one may use
cryptographically secure random number generators, there is some possibility (call this δ) that a
cryptographic key may leak, or an encoding may be broken, in the future, making any mechanism
(ε, δ)-private at best for some δ > 0.
An alternative definition of privacy is based on Rényi divergences between distributions. These
are essentially simply monotonically transformed f divergences (recall Chapter 2.2), though their
structure is somewhat more amenable to analysis, especially in our contexts. With that in mind,
we define

185
Lexture Notes on Statistics and Information Theory John Duchi

Definition 8.4. Let P and Q be distributions on a space X with densities p and q (with respect to
a measure µ). For α ∈ [1, ∞], the Rényi-α-divergence between P and Q is

p(x) α
Z
1
Dα (P ||Q) := log q(x)dµ(x).
α−1 q(x)

Here, the values α ∈ {1, ∞} are defined in terms of their respective limits.
1
Rényi divergences satisfy exp((α − 1)Dα(P ||Q)) = 1 + Df (P ||Q), i.e., Dα(P ||Q) = α−1 log(1 +
α
Df (P ||Q)), for the f -divergence defined by f (t) = t − 1, so that they inherit a number of the
properties of such divergences. We enumerate a few here for later reference.

Proposition 8.2.1 (Basic facts on Rényi divergence). Rényi divergences satisfy the following.

i. The divergence Dα (P ||Q) is non-decreasing in α.

ii. limα↓1 Dα (P ||Q) = Dkl (P ||Q) and limα↑∞ Dα (P ||Q) = sup{t | Q(p(X)/q(X) ≥ t) > 0}.

iii. Let K(· | x) be a Markov kernel from X → Z as in Proposition 2.2.13, and let KP and KQ be
the induced marginals of P and Q under K, respectively. Then Dα (KP ||KQ ) ≤ Dα (P ||Q).

We leave the proof of this proposition as Exercise 8.1, noting that property i is a consequence
of Hölder’s inequality, property ii is by L’Hopital’s rule, and property iii is an immediate conse-
quence of Proposition 2.2.13. Rényi divergences also tensorize nicely—generalizing the tensoriza-
tion properties of KL-divergence and information of Chapter 2 (recall the chain rule (2.1.6) for
KL-divergence)—and we return to this later. As a preview, however, these tensorization proper-
ties allow us to prove that the composition of multiple private data releases remains appropriately
private.
With these preliminaries in place, we can then provide

Definition 8.5 (Rényi-differential privacy). Let ε ≥ 0 and α ∈ [1, ∞]. A channel Q from Pn to
output space Z is (ε, α)-Rényi private if for all neighboring samples Pn , Pn′ ∈ Pn ,

Dα Q(· | Pn )||Q(· | Pn′ ) ≤ ε.

(8.2.2)

Clearly, any ε-differentially private channel is also (ε, α)-Rényi private for any α ≥ 1; as we soon
see, we can provide tighter guarantees than this.

8.2.1 Basic mechanisms

We now describe a few of the basic mechanisms that provide guarantees of (ε, δ)-differential privacy
and (ε, α)-Rényi privacy. The advantage for these settings is that they allow mechanisms that more
naturally handle vectors in ℓ2 , and smoothness with respect to Euclidean norms, than with respect
to ℓ1 , which is most natural for pure ε-differential privacy. A starting point is the following example,
which we will leverage frequently.

Example 8.2.2 (Rényi divergence between Gaussian distributions): Consider normal distri-
butions N(µ0 , Σ) and N(µ1 , Σ). Then
α
Dα (N(µ0 , Σ)||N(µ1 , Σ)) = (µ0 − µ1 )T Σ−1 (µ0 − µ1 ). (8.2.3)
2

186
Lexture Notes on Statistics and Information Theory John Duchi

To see this equality, we compute the appriate integral of the densities. Let p and q be the
densities of N(µ0 , Σ) and N(µ1 , Σ), respectively. Then letting Eµ1 denote expectation over
X ∼ N(µ1 , Σ), we have
p(x) α
Z α
h α i
q(x)dx = Eµ1 exp − (X − µ0 )T Σ−1 (X − µ0 ) + (X − µ1 )T Σ−1 (X − µ1 )
q(x) 2 2
(i)
h α i
= Eµ1 exp − (µ0 − µ1 )T Σ−1 (µ0 − µ1 ) + α(µ0 − µ1 )T Σ−1 (X − µ1 )
2
α2

(ii) α T −1 T −1
= exp − (µ0 − µ1 ) Σ (µ0 − µ1 ) + (µ0 − µ1 ) Σ (µ0 − µ1 ) ,
2 2

where equality (i) is simply using that (x − a)2 − (x − b)2 = (a − b)2 + 2(b − a)(x − b) and
equality (ii) follows because (µ0 − µ1 )T Σ−1 (X − µ1 ) ∼ N(0, (µ1 − µ0 )T Σ−1 (µ1 − µ0 )) under
X ∼ N(µ1 , Σ). Noting that −α + α2 = α(α − 1) and taking logarithms gives the result. 3

Example 8.2.2 is the key to developing different privacy-preserving schemes under Rényi privacy.
Let us reconsider Example 8.1.3, except that instead of assuming the function f of interest is smooth
with respect to ℓ1 norm, we use the ℓ2 -norm.
Example 8.2.3 (Gaussian mechanisms): Suppose that f : Pn → Rd has Lipschitz constant
L with respect to the ℓ2 -norm (for the Hamming metric dham ), that is, global ℓ2 -sensitivity

GS2 (f ) = sup f (Pn ) − f (Pn′ ) 2 | dham (Pn , Pn′ ) ≤ 1 ≤ L.

Then, for any variance σ 2 > 0, we have that the mechanism

Z = f (Pn ) + W, W ∼ N(0, σ 2 I)

satisfies
α 2 α
Dα N(f (Pn ), σ 2 )||N(f (Pn′ ), σ 2 ) = 2 f (Pn ) − f (Pn′ ) 2 ≤ 2 L2

2σ 2σ
′
for neighboring samples Pn , Pn . Thus, if we have Lipschitz constant L and desire (ε, α)-Rényi
2
privacy, we may take σ 2 = L2εα , and then the mechanism

L2 α

Z = f (Pn ) + W W ∼ N 0, I (8.2.4)
2ε
satisfies (ε, α)-Rényi privacy. 3
Certain special cases can make this more concrete. Indeed, suppose we wish to estimate a mean
iid
E[X] where Xi ∼ P for some distribution P such that ∥Xi ∥2 ≤ r with probability 1 for some
radius.
Example 8.2.4 (Bounded mean estimation with Gaussian mechanisms): Letting f (X1n ) =
X n be the sample mean, where Xi satisfy ∥Xi ∥2 ≤ r as above, we see immediately that
2r
GS2 (f ) = .
n
2r
In this case, the Gaussian mechanism (8.2.4) with L = n yields
h
2
i 2dr2 α
E Z − Xn 2
= E[∥W ∥22 ] = .
n2 ε

187
Lexture Notes on Statistics and Information Theory John Duchi

Then we have
r2 2dr2 α
E[∥Z − E[X]∥22 ] = E[∥X n − E[X]∥22 ] + E[∥Z − X n ∥22 ] ≤ + 2 .
n n ε
It is not immediately apparent how to compare this quantity to the case for the Laplace mech-
anism in Example 8.1.3, but we will return to this shortly once we have developed connections
between the various privacy notions we have developed. 3

8.2.2 Connections between privacy measures

An important consideration in our development of privacy definitions and mechanisms is to un-
derstand the relationships between the definitions, and when a channel Q satisfying one of the
definitions satisfies one of our other definitions. Thus, we collect a few different consequences of
our definitions, which help to show the various definitions are stronger or weaker than others.
First, we argue that ε-differential privacy implies stronger values of Rényi-differential privacy.

Proposition 8.2.5. Let ε ≥ 0 and let P and Q be distributions such that e−ε ≤ P (A)/Q(A) ≤ eε
for all measurable sets A. Then for any α ∈ [1, ∞],

3α 2
Dα (P ||Q) ≤ min ε ,ε .
2

As an immediate corollary, we have

Corollary 8.2.6. Let ε ≥ 0 and assume that Q is ε-differentially private. Then for any α ≥ 1, Q
is (min{ 3α 2
2 ε , ε}, α)-Rényi private.

Before proving the proposition, let us see its implications for Example 8.2.4 versus estimation
under ε-differential privacy. Let ε ≤ 1, so that roughly to have “similar” privacy, we require
′ 2
that our Rényi private channels
√ √ | x)||Q(· | x )) ≤ ε . The ℓ1 -sensitivity of the mean
satisfy Dα (Q(·
satisfies ∥xn − x′ n ∥1 ≤ d∥xn − x′ n ∥2 ≤ 2 dr/n for neighboring samples. Then the Laplace
mechanism (8.1.3) satisfies

2 8r2
E[∥ZLaplace − E[X]∥22 ] = E[ X n − E[X] 2 ] + · d2 ,
n 2 ε2
while the Gaussian mechanism under (ε2 , α)-Rényi privacy will yield

2 2r2
E[∥ZGauss − E[X]∥22 ] = E[ X n − E[X] 2 ] + · dα.
n 2 ε2
This is evidently better than the Laplace mechanism whenever α < d.
Proof of Proposition 8.2.5 We asume that P and Q have densities p and q with respect to a
base measure µ, which is no loss of generality, whence the ratio condition implies that e−ε ≤ p/q ≤ eε
1 α
R
and Dα (P ||Q) = α−1 log (p/q) qdµ. We prove the result assuming that α ∈ (1, ∞), as continuity
gives the result for α ∈ {1, ∞}.
First, it is clear that Dα (P ||Q) ≤ ε always. For the other term in the minimum, let us assume
that α ≤ 1 + 1ε and ε ≤ 1. If either of these fails, the result is trivial, because for α > 1 + 1ε we
have 32 αε2 ≥ 32 ε ≥ ε, and similarly ε ≥ 1 implies 23 αε2 ≥ ε.

188
Lexture Notes on Statistics and Information Theory John Duchi

Now we perform a Taylor approximation of t 7→ (1 + t)α . By Taylor’s theorem, we have for any
t > −1 that
α(α − 1)
(1 + t)α = 1 + αt + t)α−2 t2
(1 + e
2

t ∈ [0, t] (or [t, 0] if t < 0). In particular, if 1 + t ≤ c, then (1 + t)α ≤ 1 + αt +

for some e
α(α−1)
2 max{1, cα−2 }t2 . Now, we compute the divergence: we have

p(z) α
Z
exp ((α − 1)Dα (P ||Q)) = q(z)dµ(z)
q(z)
Z α
p(z)
= 1+ − 1 q(z)dµ(z)
q(z)
Z Z 2
p(z) α(α − 1) p(z)
≤1+α − 1 q(z)dµ(z) + max{1, exp(ε(α − 2))} − 1 q(z)dµ(z)
q(z) 2 q(z)
α(α − 1) ε[α−2]+
≤1+ e · (eε − 1)2 .
2
Now, we know that α − 2 ≤ 1/ε − 1 by assumption, so using that log(1 + x) ≤ x, we obtain
α ε
Dα (P ||Q) ≤ (e − 1)2 · exp([1 − ε]+ ).
2
3α 2
Finally, a numerical calculation yields that this quantity is at most 2 ε for ε ≤ 1.

We can also provide connections from (ε, α)-Rényi privacy to (ε, δ)-differential privacy, and
then from there to ε-differential privacy. We begin by showing how to develop (ε, δ)-differential
privacy out of Rényi privacy. Another way to think about this proposition is that whenever two
distributions P and Q are close in Rényi divergence, then there is some limited “amplification” of
probabilities that is possible in moving from one to the other.

Proposition 8.2.7. Let P and Q satisfy Dα (P ||Q) ≤ ε. Then for any set A,

α−1 α−1
P (A) ≤ exp ε Q(A) α .
α

Consequently, for any δ > 0,

1 1 1 1
P (A) ≤ min exp ε + log Q(A), δ ≤ exp ε + log Q(A) + δ.
α−1 δ α−1 δ

As above, we have an immediate corollary to this result.

1
Corollary 8.2.8. Assume that M is (ε, α)-Rényi private. Then it is also (ε + α−1 log 1δ , δ)-
differentially private for any δ > 0.

Before turning to the proof of the proposition, we show how it can provide prototypical (ε, δ)-
private mechanisms via Gaussian noise addition.

189
Lexture Notes on Statistics and Information Theory John Duchi

Example 8.2.9 (Gaussian mechanisms, continued): Consider Example 8.2.3, where f : Pn →

Rd has ℓ2 -sensitivity L. Then by Example 8.2.2, the Gaussian mechanism Z = f (Pn ) + W for
2
W ∼ N(0, σ 2 I) is ( αL
2σ 2
, α)-Rényi private for all α ≥ 1. Combining this with Corollary 8.2.8,
the Gaussian mechanism is also
2
αL 1 1
+ log , δ -differentially private
2σ 2 α−1 δ
p
for any δ > 0 and α > 1. Optimizing first over α by taking α = 1 + 2σ 2 log δ −1 /L2 , we see
L2
p
that the channel is ( 2σ 2 + 2L2 log δ −1 /σ 2 , δ)-differentially private. Thus we have that the
Gaussian mechanism
( )
2 2 2 8 log 1δ 1
Z = f (Pn ) + W, W ∼ N(0, σ I) for σ = L max , (8.2.5)
ε2 ε

is (ε, δ)-differentially private.

To continue with our ℓ2 -bounded mean-estimation in Example 8.2.4, let us assume that
ε < 8 log 1δ , in which case the Gaussian mechanism (8.2.5) with L2 = r2 /n2 achieves (ε, δ)-
differential privacy, and we have

2 r2 1
E[∥ZGauss − E[X]∥22 ] = E[ X n − E[X] 2 ] + O(1) 2 2
· d log .
n ε δ
Comparing to the previous cases, we see an improvement over the Laplace mechanism whenever
log 1δ ≪ d, or that δ ≫ e−d . 3

Proof of Proposition 8.2.7 We use the data processing inequality of Proposition 8.2.1.iii,
which shows that
P (A) α

1
ε ≥ Dα (P ||Q) ≥ log Q(A) .
α−1 Q(A)
Rearranging and taking exponentials, we immediately obtain the first claim of the proposition.
α
For the second, we require a bit more work. First, let us assume that Q(A) > e−ε δ α−1 . Then
we have by the first claim of the proposition that

α−1 1 1
P (A) ≤ exp ε + log Q(A)
α α Q(A)

α−1 1 1 1 1 1
≤ exp ε+ ε+ log Q(A) = exp ε + log Q(A).
α α α−1 δ α−1 δ
α
On the other hand, when Q(A) ≤ e−ε δ α−1 , then again using the first result of the proposition,

α−1
P (A) ≤ exp (ε + log Q(A))
α

α − 1 α
≤ exp ε−ε+ log δ = δ.
α α−1

This gives the second claim of the proposition.

190
Lexture Notes on Statistics and Information Theory John Duchi

Finally, we develop our last set of connections, which show how we may relate (ε, δ)-private
channels with ε-private channels. To provide this definition, we require one additional weakened
notion of divergence, which relates (ε, δ)-differential privacy to Rényi-α-divergence with α = ∞.
We define
δ P (S) − δ
D∞ (P ||Q) := sup log | P (S) > δ ,
S⊂X Q(S)
where the supremum is over measurable sets. Evidently equivalent to this definition is that
δ (P ||Q) ≤ ε if and only if
D∞
P (S) ≤ eε Q(S) + δ for all S ⊂ X .
Then we have the following lemma.
Lemma 8.2.10. Let ε > 0 and δ ∈ (0, 1), and let P and Q be distributions on a space X .
δ (P ||Q) ≤ ε if and only if there exists a probability distribution R on X such that
(i) We have D∞
∥P − R∥TV ≤ δ and D∞ (R||Q) ≤ ε.
(ii) We have D∞δ (P ||Q) ≤ ε and D δ (Q||P ) ≤ ε if and only if there exist distributions P and Q
∞ 0 0
such that
δ δ
∥P − P0 ∥TV ≤ ε
, ∥Q − Q0 ∥TV ≤ ,
1+e 1 + eε
and
D∞ (P0 ||Q0 ) ≤ ε and D∞ (Q0 ||P0 ) ≤ ε.
The proof of the lemma is technical, so we defer it to Section 8.5.1. The key application of the
lemma—which we shall see presently—is that (ε, δ)-differentially private algorithms compose in
elegant ways.

8.2.3 Side information protections under weakened notions of privacy

We briefly discuss the side information protections these weaker notions of privacy protect. For both
(ε, δ)-differential privacy and (ε, α)-Rényi privacy, we revisit the treatment in Proposition 8.1.7,
considering Bayes factors and ratios of prior and posterior divergences, as these are natural for-
mulations of side information in terms of an adversary’s probabilistic beliefs. Our first analogue of
Proposition 8.1.7, applies to the (ε, δ)-private case.
Proposition 8.2.11. Let M be a (ε, δ)-differentially private mechanism. Then for any neighboring
(0) (0)
Pn , Pn′ , Pn ∈ Pn , we have with probability at least 1−δ over the draw of Z = M (Pn ), the posterior
odds satisfy
π(Pn | z) π(Pn )
≤ e3ε .
π(Pn′ | z) π(Pn′ )
Deferring the proof momentarily, this result shows that as long as two samples x, x′ are neighboring,
then an adversary is extremely unlikely to be able to glean substantially distinguishing information
between the samples. This is suggestive of a heuristic in differential privacy that if n is the sample
size, then one should take δ ≪ 1/n to limit the probability of disclosure: by a union bound, we see
that for each individual i ∈ {1, . . . , n}, we can simultaneously guarantee that the posterior odds
for swapping individual i’s data do not change much (with high probability).
Unsurprisingly at this point, we can also give posterior update bounds for Rényi differential
privacy. Here, instead of giving high-probability bounds—though it is possible—we can show that
moments of the odds ratio do not change significantly. Indeed, we have the following proposition:

191
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 8.2.12. Let M be a (ε, α)-Rényi private mechanism, where α ∈ (1, ∞). Then for
(0)
any neighboring Pn , Pn′ , Pn ∈ Pn , we have
" 1
α−1 # α−1
π(Pn | Z) π(Pn )
E0 ≤ eε ,
π(Pn′ | Z) π(Pn′ )

(0)
where E0 denotes expectation taken over Z = M (Pn ).
Proposition 8.2.12 communicates a similar message to our previous results in this vein: even if
we get information from the output of the private mechanism on some sample x0 ∈ X n near the
samples (datasets) of interest x, x′ that an adversary wishes to distinguish, it is impossible to update
beliefs by much. The parameter α then controls the degree of difficulty of this “impossible” claim,
which one can see by (for example) applying a Chebyshev-type bound to the posterior ratio and
prior ratios.
We now turn to the promised proofs of Propositions 8.2.11 and 8.2.12. To prove the former, we
require a definition.
Definition 8.6. Distributions P and Q on a space X are (ε, δ)-close if for all measurable A

P (A) ≤ eε Q(A) + δ and Q(A) ≤ eε P (A) + δ.

Letting p and q denote their densities (with respect to any shared base measure), they are (ε, δ)-
pointwise close if the set

A := {x ∈ X : e−ε q(x) ≤ p(x) ≤ eε q(x)} = {x ∈ X : e−ε p(x) ≤ q(x) ≤ eε p(x)}

satisfies P (A) ≥ 1 − δ and Q(A) ≥ 1 − δ.

The following lemma shows the strong relationship between closeness and approximate differ-
ential privacy.
Lemma 8.2.13. If P and Q are (ε, δ)-close, then for any β > 0, the sets

A+ := {x : p(x) > e(1+β)ε q(x)} and A− := {x : p(x) ≤ e−(1+β)ε q(x)}

satisfy
eβε δ e−ε δ
max{P (A+ ), Q(A− )} ≤ , max{P (A − ), Q(A + )} ≤ .
eβε − 1 eβε − 1
Conversely, if P and Q are (ε, δ)-pointwise close, then

P (A) ≤ eε Q(A) + δ and Q(A) ≤ eε P (A) + δ

for all sets A.

Proof Let A = A+ = {x : p(x) > e(1+β)ε q(x)}. Then

P (A) ≤ eε Q(A) + δ ≤ e−βε P (A) + δ,

δ
so that P (A) ≤ 1−e−βε
. Similarly,

Q(A) ≤ e−(1+β)ε P (A) ≤ e−βε Q(A) + e−(1+β)ε δ,

192
Lexture Notes on Statistics and Information Theory John Duchi

so that Q(A) ≤ e−(1+β)ε δ/(1−e−βε ) = e−ε δ/(eβε −1). The set A− satisfies the symmetric properties.
For the converse result, let B = {x : e−ε q(x) ≤ p(x) ≤ eε q(x)}. Then for any set A we have

P (A) = P (A ∩ B) + P (A ∩ B c ) ≤ eε Q(A ∩ B) + δ ≤ eε Q(A) + δ,

and the same inequalities yield Q(A) ≤ eε P (A) + δ.

ε −ε
That is, (ε, δ)-close distributions are (2ε, e e+e
ε −1 δ)-pointwise close, and (ε, δ)-pointewise close dis-

tributions are (ε, δ)-close.

A minor extension of this lemma (taking β = 1 and applying the lemma twice) yields the
following result.
Lemma 8.2.14. Let P0 , P1 , P2 be distributions on a space X , each (ε, δ)-close. Then for any i, j, k,
j ̸= k, the set

pj (x)
Ajk := x ∈ X : log > 3ε satisfies Pi (Ajk ) ≤ Cδ max{ε−1 , 1}
pk (x)
for a numerical constant C ≤ 2.
With Lemma 8.2.14 in hand, we can prove Proposition 8.2.11:
(0)
Proof of Proposition 8.2.11 Let Pn ∈ Pn denote the “true” sample. Consider the three
(0)
channels Q0 , Q1 , Q2 , which represent the induced distributions of M (Pn ), M (Pn ), and M (Pn′ ),
respectively. Then by Lemma 8.2.14, with probability at least 1 − 2δ max{ε−1 , 1}, Z ∼ Q0 belongs
to the set A = {z ∈ Z | e−3ε q1 (z) ≤ q2 (z) ≤ e3ε q1 (z)}. Calculating the odds ratios immediately
gives the result.

Finally, we provide the proof of Proposition 8.2.12.

Proof of Proposition 8.2.12 Let r = α − 1 for shorthand, and let p = αr = α−1 α
> 1 and
p
p∗ = p−1 = α be its conjugate. As in the proof of Proposition 8.2.11, let Q0 , Q1 , and Q2 represent
(0)
the distributions of Z = M (Pn ), Z = M (Pn ), and Z = M (Pn′ ), respectively. We apply Hölder’s
inequality: letting qi be the density of Qi with respect to some base measure dµ—which we know
must exist by definition of Rényi differential privacy—we have
π(Pn | Z) r q1 (z)π(Pn ) r
Z
E = q0 (z)dµ
π(Pn′ | Z) q2 (z)π(Pn′ )
π(Pn ) r q1 (z) r q0 (z)
Z
= q2 (z)dµ
π(Pn′ ) q2 (z) q2 (z)
1 Z 1
π(Pn ) r q1 (z) pr q0 (z) p∗
Z
p p∗
≤ ′
q2 (z)dµ q2 (z)dµ
π(Pn ) q2 (z) q2 (z)
r 2

π(Pn ) (α − 1) α−1
= exp Dα (Q1 ||Q2 ) + Dα (Q0 ||Q2 )
π(Pn′ ) α α
π(Pn ) r (α − 1)2 + α − 1

≤ exp ε
π(Pn′ ) α
as pr = α and p∗ = α. Taking everything to the 1/(α − 1) power and gives the result.

193
Lexture Notes on Statistics and Information Theory John Duchi

8.3 Composition and privacy based on divergence

One of the major challenges in privacy is to understand what happens when a user participates in
multiple studies, each providing different privacy guarantees. In this case, we might like to under-
stand and control privacy losses even when the mechanisms for information release may depend
on one another. Conveniently, all Rényi divergences provide strong guarantees on composition,
essentially for free, and these then allow us to prove strong results on the composition of multiple
private mechanisms.

8.3.1 Composition of Rényi-private channels

A natural idea to address composition is to attempt to generalize our chain rules for KL-divergence
and related ideas to Rényi divergences. Unfortunately, this plan of attack does not quite work, as
there is no generally accepted definition of a conditional Rényi divergence, and associated chain
rules do not sum naturally. In situations in which individual divergence of associated elements of a
joint distribution have bounded Rényi divergence, however, we can provide some natural bounds.
Indeed, consider the following essentially arbitrary scheme for data generation: we have distri-
butions P and Q on a space Z n , where Z1n ∼ P and Z1n ∼ Q may exhibit arbitrary dependence. If,
however, we can bound the conditional Rényi divergence between P (Zi | Z1i−1 ) and Q(Zi | Z1i−1 ),
we can provide some natural tensorization guarantees. To set notation, let Pi (· | z1i−1 ) be the the
(regular) conditional probability of Zi conditional on Z1i−1 = z1i−1 under P , and similarly for Qi .
We have the following theorem.

Theorem 8.3.1. Let the conditions above hold, εi < ∞ for i = 1, . . . , n, and α ∈ [1, ∞]. Assume
that conditional on z1i−1 , we have Dα Pi (· | z1i−1 )||Qi (· | z1i−1 ) ≤ εi . Then
n
X
Dα (P ||Q) ≤ εi .
i=1

Proof We assume without loss of generality that the conditional distributions Pi (· | z1i−1 ) and
Qi are absolutely continuous with respect to a base measure µ on Z.1 Then we have
n α
pi (zi | z1i−1 )
Z Y
1
Dα (P ||Q) = log i−1
qi (zi | z1i−1 )dµn (z1n )
α−1 q (z
i i | z 1 )
i=1
"Z α # n−1
Y pi α
pn (zn | z1n−1 )
Z
1 n−1
= log q n (z n | z 1 )dµ(z n ) qi dµn−1
α−1 Z1n−1 qn (zn | z1n−1 ) i=1
qi
Z n−1
Y pi (zi | z ) i−1 α
1
≤ log exp((α − 1)εn ) 1
i−1
qi (zi | z1i−1 )dµn−1 (z1n−1 )
α−1 Z1n−1 q (z
i i | z 1 )
i=1
n−1 n−1

= εn + Dα P1 ||Q1 .

Applying the obvious inductive argument then gives the result.

1
This is no loss of generality, as the general definition of f -divergences as suprema over finite partitions, or
quantizations, of each Xi and Yi separately, as in our discussion of KL-divergence in Chapter 2.2.2. Thus we may
assume Z is discrete and µ is a counting measure.

194
Lexture Notes on Statistics and Information Theory John Duchi

8.3.2 Privacy games and composition

To understand arbitrary composition of private channels, let us consider a privacy “game,” where
an adversary may sequentially choose a dataset—in an arbitrary way—and then observes a private
release Zi of some mechanism applied to the dataset and the dataset with one entry (observation)
modified. The adversary may then select a new dataset, and repeat the game. We then ask whether
the resulting sequence of (private) observations Z1k remains private. Figure 8.1 captures this in an
(b)
algorithmic form. Letting Zi denote the random observations under the bit b ∈ {0, 1}, whether

Input: Family of channels Q and bit b ∈ {0, 1}.

Repeat: for k = 1, 2, . . .

i. Adversary chooses arbitrary space X , n ∈ N, and two datasets x(0) , x(1) ∈ X n with
dham (x(0) , x(1) ) ≤ 1.

ii. Adversary chooses private channel Qk ∈ Q.

iii. Adversary observes one sample Zk ∼ Qk (· | x(b) ).

Figure 8.1. The privacy game. In this game, the adversary may not directly observe
the private b ∈ {0, 1}.

(0) (0) (1) (1)

the distributions of (Z1 , . . . , Zk ) and (Z1 , . . . , Zk ) are substantially different. Note that, in
the game in Fig. 8.1, the adversary may track everything, and even chooses the mechanisms Qk .
(0) (0) (1) (1)
Now, let Z (0) = (Z1 , . . . , Zk ) and Z (1) = (Z1 , . . . , Zk ) be the outputs of the privacy game
above, and let their respective marginal distributions be Q(0) and Q(1) . We then make the following
definition.
Definition 8.7. Let ε ≥ 0, α ∈ [1, ∞], and k ∈ N.
(i) A collection Q of channels satisfies (ε, α)-Rényi privacy under k-fold adaptive composition
if, in the privacy game in Figure (0) and Q(1) on Z (0) and Z (1) ,
(0) (1)
8.1, the distributions
(1) (0)
Q
respectively, satisfy Dα Q ||Q ≤ ε and Dα Q ||Q ≤ ε.
(ii) Let δ > 0. Then a collection Q of channels satisfies (ε, δ)-differential privacy under k-fold
δ (Q(0) ||Q(1) ) ≤ ε and D δ (Q(1) ||Q(0) ) ≤ ε.
adaptive composition if D∞ ∞

By considering a special case centered around a particular individual in the game 8.1, we can gain
some intuition for the definition. Indeed, suppose that an individual has some data x0 ; in each
round of the game the adversary generates two datasets, one containing x0 and the other identical
except that x0 is removed. Then satisfying Definition 8.7 captures the intuition that an individual’s
privacy remains protected, even in the face of multiple (private) accesses of the individual’s data.
As an immediate corollary to Theorem 8.3.1, we then have the following.
Corollary 8.3.2. Assume that each channel in the game Pkin Fig. 8.1 is (εi , α)-Rényi private. Then
the arbitrary composition of k such channels remains ( i=1 εi , α)-Rényi private.
More sophisticated corollaries are possible once we start to use the connections between privacy
measures we outline in Section 8.2.2. In this case, we can develop so-called advanced composition
rules, which sometimes suggest that privacy degrades more slowly than might be expected under
adaptive composition.

195
Lexture Notes on Statistics and Information Theory John Duchi

Corollary 8.3.3. Assume that each channel in the game in Fig. 8.1 is ε-differentially private.
Then the composition of k such channels is kε-differentially private. Additionally, the composition
of k such channels is r !
3k 2 1
ε + 6k log · ε, δ
2 δ
differentially private for all δ > 0.

Proof The first claim is immediate: for Q(0) , Q(1) as in Definition 8.7, we know that Dα Q(0) ||Q(1) ≤

kε for all α ∈ [1, ∞] by Theorem 8.3.1 coupled with Proposition 8.2.5 (or Corollary 8.2.6).
For the second claim, we require a bit more work. Here, we use the bound 3α 2
2 ε in the Rényi
privacy bound in Corollary 8.2.6. Then we have for any α ≥ 1 that
3kα
Dα Q(0) ||Q(1) ≤ ε2
2
by Theorem 8.3.1. Now we apply Proposition 8.2.7 and Corollary 8.2.8, which allow us to conclude
(ε, δ)-differential privacy from Rényi privacy. Indeed, by the preceding display, setting η = 1 + α,
3kη 2
we have that the composition is ( 3k 2 1 1
2 ε + 2 ε + η log δ , δ)-differentially private for all η > 0 and
δ > 0. Optimizing over η gives the second result.

We note in passing that it is possible to get slightly sharper results thanqthose in Corollary 8.3.3;
indeed, using ideas from Exercise 4.4 it is possible to achieve (kε(eε − 1) + 2k log 1δ ε, δ)-differential
privacy under adaptive composition.
A more sophisticated result, which shows adaptive composition for (ε, δ)-differentially private
channels, is also possible using Lemma 8.2.10.

Theorem 8.3.4. Assume that each channel in the game in Fig. 8.1 is (ε, δ)-differentially private.
Then the composition of k such channels is (kε, kδ)-differentially private. Additionally, they are
r
3k 2 1 kδ
ε + 6k log · ε, δ0 +
2 δ0 1 + eε

differentially private for all δ0 > 0.

Proof Consider the channels Qi in Fig. 8.1. As each satisfies D∞ δ (Q (· | x(0) )||Q (· | x(1) )) ≤ ε
i i
δ (1) (0)
and D∞ (Qi (· | x )||Qi (· | x )) ≤ ε, Lemma 8.2.10 guarantees the existence (at each sequential
(0) (1)
step, which may depend on the preceding i − 1 outputs) of probability measures Qi and Qi such
(1−b) (b) (b)
that D∞ (Qi ||Qi ) ≤ ε, ∥Qi − Qi (· | x(b) )∥TV ≤ δ/(1 + eε ) for b ∈ {0, 1}.
(b) (b) (1−b) (1−b)
Note that by construction (and Theorem 8.3.1) we have Dα(Q1 · · · Qk ||Q1 · · · Qk ) ≤
3kα 2 (b)
min{ 2 ε , kε}, where Q denotes the joint distribution on Z1 , . . . , Zk under bit b. We also have
(b) (b)
by the triangle inequality that ∥Q1 · · · Qk − Q(b) ∥TV ≤ kδ/(1 + eε ) for b ∈ {0, 1}. (See Exer-
cise 2.16.) As a consequence, we see as in the proof of Corollary 8.3.3 that the composition is
3kη 2
( 3k 2 1 1 ε
2 ε + 2 ε + η log δ0 , δ0 + kδ/(1 + e ))-differentially private for all η > 0 and δ0 . Optimizing
gives the result.

196
Lexture Notes on Statistics and Information Theory John Duchi

As a consequence of these results, we see that whenever the privacy parameter ε < 1, it is
possible to compose
√ multiple privacy mechanisms together and have privacy penalty scaling only
as the worse of kε and kε2 , which is substantially better than the “naive” bound of kε. Of course,
a challenge here—relatively unfrequently discussed in the privacy literature—is that when ε ≥ 1,
which is a frequent case for practical deployments of privacy, all of these bounds are much worse
than a naive bound that k-fold composition of ε-differentially private algorithms is kε-differentially
private.

8.4 Additional mechanisms and privacy-preserving algorithms

Since the introduction of differential privacy, a substantial literature has grown providing mecha-
nisms for different estimation, learning, and data release problems. Here, we describe a few of those
beyond the basic noise addition schemes we have thus far developed, highlighting a few applications
along the way. One major challenge with the naive approaches is that they rely on global sensitivity
of the functions to be estimated, rather than local sensitivities—a worst case notion that sometimes
forces privacy to add unnecessary noise. In Section 8.4.2, we give one potential approach to this
problem, which we develop further in exercises and revisit in optimality guarantees in sequential
chapters. Our view is necessarily somewhat narrow, but the results here can form a natural starting
point for further work in this area.

8.4.1 The exponential mechanism

In many statistical, learning, and other problems, there is a natural notion of loss (or conversely,
utility) in releasing a potentially noisy result of some computation. We abstract this by considering
the input space Pn of samples of size n (that is, empirical distributions) and output space Z along
with a loss function ℓ : Pn × Z → R, where ℓ(Pn , z) measures the loss of z on an input Pn ∈ Pn .
For example, if we wish to compute a function f : Pn → R, a natural notion of loss is ℓ(Pn , z) =
|f (Pn ) − z| for z ∈ R. As a more sophisticated and somewhat abstract formulation, suppose we
wish to release a sample distribution Pe approximating an input sample Pn ∈ Pn , where we wish Pe
1 Pn
to be accurate for most statistical queries in some family, that is, n i=1 ϕ(xi ) ≈ EPe [ϕ(X)] for all
ϕ ∈ Φ. Then a natural loss is ℓ(Pn , Pe) = supϕ∈Φ |EPn ϕ(X) − EPe [ϕ(X)]|.
In scenarios in which we have such a loss, the abstract exponential mechanism provides an
attractive approach. We assume that for each z ∈ Z, the loss ℓ(·, z) has (global) sensitivity L, i.e.,
|ℓ(Pn , z) − ℓ(Pn′ , z)| ≤ L for all neighboring Pn , Pn′ ∈ Pn . We assume we have a base measure µ on
Z, and then define the exponential mechanism by
Z
1 ε
P(M (Pn ) ∈ A) = R exp − ℓ(P n , z) dµ(z), (8.4.1)
exp(− Lε ℓ(Pn , z))dµ(z) A L
ε
assuming e− L ℓ(x,z) dµ(z) is finite for each Pn ∈ Pn . (Typically, one assumes ℓ takes on values
R

in R+ and µ is a finite measure, making the last assumption trivial.) That is, the exponential
mechanism M releases Z = M (Pn ) with probability proportional to
ε
exp − ℓ(Pn , z) .
L
That the mechanism (8.4.1) is 2ε-differentially private is immediate: for any neighboring Pn , Pn′ ,

197
Lexture Notes on Statistics and Information Theory John Duchi

we have
exp(− Lε ℓ(Pn′ , z))dµ(z) A exp(− Lε ℓ(Pn , z))dµ(z)
R R
Q(A | Pn )
=R
Q(A | Pn′ ) exp(− Lε ℓ(Pn , z))dµ(z) A exp(− Lε ℓ(Pn′ , z))dµ(z)
R
n ε o n ε o
≤ sup exp [ℓ(Pn , z) − ℓ(Pn′ , z)] · sup exp [ℓ(Pn′ , z) − ℓ(Pn , z)] ≤ exp(2ε).
z∈Z L z∈A L
As a first (somewhat trivial) example, we can recover the Laplace mechanism:
Example 8.4.1 (The Laplace mechanism): We can recover Example 8.1.3 through the
exponential mechanism. Indeed, suppose that we wish to release f : Pn → Rd , where GS1 (f ) ≤
L. Then taking z ∈ Rd , ℓ(Pn , z) = ∥f (Pn ) − z∥1 , and µ to be the usual Lebesgue measure on
Rd , the exponential mechanism simply uses density
ε
q(z | Pn ) ∝ exp − ∥f (Pn ) − z∥1 ,
L
which is the Laplace mechanism. 3
One challenge with the exponential mechanism (8.4.1) is that it is somewhat abstract and is
often hard to compute, as it requires evaluating an often high-dimensional integral to sample from.
Yet it provides a nice abstract mechanism with strong privacy guarantees and, as we shall see, good
utility guarantees. For the moment, we defer further examples and provide utility guarantees when
µ(Z) is finite, giving bounds based on the measure of “bad” solutions. For notational convenience,
we define the optimal value
ℓ⋆ (Pn ) = inf ℓ(Pn , z),
z∈Z
assuming tacitly that it is finite, and the sublevel sets

St := {z ∈ Z | ℓ(Pn , z) ≤ ℓ⋆ (Pn ) + t}.

With these definitions, we have the following proposition.

Proposition 8.4.2. Let t ≥ 0. Then for the exponential mechanism (8.4.1), if Z ∼ Q(· | Pn ) then

ℓ(Pn , Z) ≤ ℓ⋆ (Pn ) + 2t

µ(Z)
with probability at least 1 − exp − εt
L + log µ(S t)
.

Proof Assume without loss of generality (by scaling) that the global Lipschitzian (sensitivity)
constant of ℓ is L = 1. Then for Z ∼ Q(· | Pn ), we have
⋆
R R
S c exp(−εℓ(Pn , z))dµ(z) c exp(−ε(ℓ(Pn , z) − ℓ (Pn )))dµ(z)
S2t
⋆ 2t
P (ℓ(Pn , Z) ≥ ℓ (Pn ) + 2t) = R = R
exp(−εℓ(Pn , z))dµ(z) exp(−ε(ℓ(Pn , z) − ℓ⋆ (Pn )))dµ(z)
R
c exp(−2εt)dµ(z)
S2t
c )
µ(S2t
≤R ⋆
≤ exp(−εt) ,
St exp(−ε(ℓ(Pn , z) − ℓ (Pn )))dµ(z) µ(St )

where the last inequality uses that ℓ(Pn , z) − ℓ⋆ (Pn ) ≤ t on St .

We can provide a few simplifications of this result in different special cases. For example, if Z
is finite with cardinality card(Z), then Proposition 8.4.2 implies that taking µ to be the counting
measure on Z we have

198
Lexture Notes on Statistics and Information Theory John Duchi

Corollary 8.4.3. In addition to the conditions in Proposition 8.4.2, assume that card(Z) is finite.
Then for any u ∈ (0, 1), with probability at least 1 − u,

2L card(Z)
ℓ(Pn , Z) ≤ ℓ⋆ (Pn ) + log .
ε u
That is, with extremely high probability, the loss of Z from the exponential mechanism is at most
logarithmic in card(Z) and grows only linearly with the global sensitivity L.
A second corollary allows us to bound the expected loss of the exponential mechanism, assuming
we have some control over the measure of the sublevel sets St .
2L µ(Z) L
Corollary 8.4.4. Let t ≥ 0 be the smallest scalar such that t ≥ ε log µ(S t)
and t ≥ ε. Then Z
drawn from the exponential mechanism (8.4.1) satisfies

⋆ 2L ⋆ ⋆ L µ(Z)
E[ℓ(Pn , Z)] ≤ ℓ (Pn ) + t + ≤ ℓ (Pn ) + 3t ≤ ℓ (Pn ) + O(1) log 1 + .
ε ε µ(St )

R ∞ that if W ≥ 0 is a nonnegative random variable, then by a change of

Proof We first recall
variables, E[W ] = 0 P(W ≥ t)dt. Take ℓ(Pn , Z) − ℓ⋆ (Pn ) ≥ 0 as our random variable, fix any
µ(Z)
t0 ≥ 0, and let ρ = log µ(St )
. Then by Proposition 8.4.2 we have
0

Z ∞
⋆
E[ℓ(Pn , Z) − ℓ (Pn )] ≤ t0 + P(ℓ(Pn , Z) − ℓ⋆ (Pn ) ≥ t)dt
t0
Z ∞
= t0 + 2 P(ℓ(Pn , Z) − ℓ⋆ (Pn ) ≥ 2t)dt
t0 /2
Z ∞
εt µ(Z)
≤ t0 + 2 exp − + log dt
t0 /2 L µ(St )
Z ∞
ρ εt 2L εt0
≤ t0 + 2e exp − dt = t0 + exp ρ − .
t0 /2 L ε 2L

Take t0 as in the statement of the corollary to obtain the result.

Corollary 8.4.4 may seem a bit circular: we require the ratio µ(Z)/µ(St ) to be controlled—but
it is relatively straightforward to use it (and Proposition 8.4.2) with a bit of care and standard
bounds on volumes.

Example 8.4.5 (Empirical risk minimization via the exponential mechanism): We consider
the empirical risk minimization problem, where we have losses ℓ : Θ × X → R+ , where Θ ⊂ Rd
is a parameter space of interest, and we wish to choose
n
( )
1 X
θbn ∈ argmin L(θ, Pn ) := ℓ(θ, xi )
θ∈Θ n
i=1

where Pn = n1 ni=1 1xi . We make a few standard assumptions: first, for simplicity, that n
P
is large enough that nd ≥ ε. We also assume that Θ ⊂ Rd is an ℓ2 -ball of radius R, that
θ 7→ ℓ(θ, xi ) is M -Lipschitz for all xi , and that ℓ(θ, xi ) ∈ [0, 2M R] for all θ ∈ Θ. (Note that
this last is no loss of generality, as ℓ(θ, xi ) − inf θ∈Θ ℓ(θ, xi ) ≤ M supθ,θ′ ∈Θ ∥θ − θ′ ∥2 ≤ 2M R.)

199
Lexture Notes on Statistics and Information Theory John Duchi

Take the empirical loss L(θ, Pn ) as our criterion function for the exponential mechanism,
which evidently satisfies |L(θ, Pn ) − L(θ, Pn′ )| ≤ 2Mn R whenever dham (Pn , Pn′ ) ≤ 1, so that we
release θ with density nε
q(θ | x) ∝ exp − L(θ, Pn ) .
2M R
Let θbn be the empirical minimizer as above; then by the Lipschitz continuity of ℓ, the sublevel
set St evidently satisfies
t
St ⊃ θ ∈ Θ | ∥θ − θbn ∥2 ≤ .
M
Then a volume calculation (with the factor of 2 necessary because we may have θbn on the
boundary of Θ) yields that for µ the Lebesgue measure,
d
µ(St ) t
≥ .
µ(Z) 2M R

As a consequence, by Corollary 8.4.4, whenever t ≥ O(1) MnεR · d log MtR , we have E[L(θ, Pn ) |
Pn ] ≤ L(θbn , Pn ) + 3t. The choice t = O(1) MnεRd suffices whenever dε ≤ 1, so we obtain

M Rd nε
E[L(θ, Pn )] ≤ L(θbn , Pn ) + O(1) log ,
nε d
d
whenever nε ≤ 1. Notably, standard empirical risk minimization (recall Chapter 5.2) typically
√
achieves rates of convergence roughly of M R/ n, so that the gap of the exponential mechanism
is lower order whenever √dnε ≤ 1. 3

8.4.2 Local sensitivities and the inverse sensitivity mechanism

A particular choice of the exponential mechanism (8.4.1) can provide strong optimality guarantees
for 1-dimensional quantities, and appears to be the “right” mechanism (in principle) when one
wishes to estimate a scalar-valued functional f (Pn ). A better (in principle) algorithm than noise
addition schemes using the global sensitivity GS(f ) = sup |f (Pn ) − f (Pn′ )| is to use a local notion
of sensitivity: we are only concerned with adding noise commensurate with the changes of f near
Pn ∈ Pn . With this in mind, define the modulus of continuity of f at Pn by

ωf (k; Pn ) := sup |f (Pn′ ) − f (Pn )| | dham (Pn , Pn′ ) ≤ k ,

which measures the amount that changing k observations in Pn can change the function f . In the
privacy literature, the particular choice k = 1 yields the local sensitivity

LS(f, Pn ) := sup |f (Pn ) − f (Pn′ )| | dham (Pn′ , Pn ) = 1 = ωf (1; Pn ).

(8.4.2)

A naive strategy, then, would be to release

LS(f, Pn )
Z = f (Pn ) + · W for W ∼ Laplace(1),
ε
which is analogous to the Laplace mechanism (8.1.3), except that the noise scales with the local
sensitivity of f at Pn . The issue, as the next example makes clear, is that the scale of this noise
can compromise privacy.

200
Lexture Notes on Statistics and Information Theory John Duchi

Example 8.4.6 (The sensitivity of the sensitivity): Consider estimating a median f (Pn ) =
med(Pn ), where the data x ∈ [0, 1], where n = 2m + 1 for simplicity, to make the median
unique. If the sample consists of m points xi = 0 and m + 1 points xi = 1, then the sensitivity
ωf (1, Pn ) = 1, the maximal value—we simply move one example from xi = 1 to xi = 0,
changing the median from med(Pn ) = 1 to 0. On the other hand, on the sample Pn′ with m − 1
points xi = 0 and m + 2 points xi = 1, the sensitivity ωf (1, Pn′ ) = 0, because changing a single
example cannot move the median from f (Pn′ ) = 1. 3

Instead of using the inherently unstable quantity ω, then, we can instead use, essentially, its
inverse: define the inverse sensitivity

df (t, Pn ) := inf dham (Pn′ , Pn ) | f (Pn′ ) = t ,

(8.4.3)

where df (t, Pn ) = +∞ if no Pn′ yields f (Pn′ ) = t. So df (t, Pn ) counts the number of examples that
must be changed in the sample Pn to move f (Pn ) to a target t, and by inspection, always satisfies

|df (t, Pn ) − df (t, Pn′ )| ≤ dham (Pn , Pn′ ).

Then the inverse sensitivity mechanism releases a value t with probability density proportional to
ε
q(t | Pn ) ∝ exp − df (t, Pn ) . (8.4.4)
2
Implicit in the definition (8.4.4) is a base measure µ, typically one of Lebesgue measure or counting
measure on a discrete set. Then a quick calculation (or recognition that the density (8.4.4) is a
particular instance of the exponential mechanism) gives the following proposition.

Proposition 8.4.7. Let M be the inverse sensitivity mechanism with density (8.4.4). Then M is
ε-differentially private.

As in the general exponential mechanism (8.4.1), efficiently sampling from the density (8.4.4)
can be challenging. Some cases admit easier reformulations.

Example 8.4.8 (Mean estimation with bounded data): Suppose the data x ∈ [a, b] are
bounded and we wish to estimate the sample mean f (Pn ) = EPn [X] = xn , where Pn =
1 Pn b−a
n i=1 1x i . Changing a single observation can move the mean by at most n (replace xi = a
′
with xi = b). Thus, while discretization issues and that we may have xi ̸∈ {a, b} make precisely
computing df tedious, the approximation

n|t − xn |
dmean (t, Pn ) = ,
b−a

where we define dmean (t, Pn ) = +∞ for t ̸∈ [a, b], is both Lipschitz (with respect to the
Hamming metric) in the sample Pn , and approximates df (t, Pn ). (See Exercise 8.8 for a more
general approach justifying this particular approximation.) The approximation

exp(− 2ε dmean (t, Pn ))

q(t | Pn ) := R b (8.4.5)
a exp(− 2ε dmean (s, Pn ))ds

to the density (8.4.4) is thus ε-differentially private,

201
Lexture Notes on Statistics and Information Theory John Duchi

The density (8.4.5) yields a particular step-like density. Define the shells

b−a b−a b−a b−a
Sk = xn − k , xn − (k − 1) ∪ xn + (k − 1) , xn + k ∩ [a, b]
n n n n

corresponding to the amount the mean may change if we modify k examples and let Vol(Sk ) be
volume (length) of the intervals making up Sk . To sample from the density (8.4.5), note that
Rb kε
the denominator C(Pn ) := a exp(− 2ε dmean (s, Pn ))ds = nk=1 Vol(Sk )e− 2 . Then we draw an
P

index I ∈ [n] with probability P(I = k) = Vol(Sk )e−εk/2 /C(Pn ), and then choose t uniformly
at random within Sk . 3

Example 8.4.9 (Median estimation): For the median, the inverse sensitivity takes a par-
ticularly clean form, making sampling from the density (8.4.4) fairly straightforward. In this
1 Pn
case, for a sample Pn = n i=1 1xi , where xi ∈ R, we have

df (t, Pn ) = card {i ∈ [n] | xi ∈ [f (Pn ), t]} ,

the number of examples between the median f (Pn ) and putative target t. If the data lie in
a range x ∈ [a, b], then the density q is relatively straightforward to compute. Similar to the
approach to the stepped density in Example 8.4.8, divide [a, b] into the intervals

Sk− := [a− − + + +
k , ak−1 ] and Sk := [ak−1 , ak ], k = 1, . . . , n/2,

where

a− ′ ′
and a+ ′ ′

k = inf f (Pn ) | dham (Pn , Pn ) ≤ k k = sup f (Pn ) | dham (Pn , Pn ) ≤ k .

That is, a− +
k is the smallest we can make the median by changing k examples and ak the largest,
1 k 1 k
corresponding to the the 2 − n and 2 + n quantiles of the sample Pn , where the 0 quantile is
a and 1 quantile is b. Then defining the normalization constant
Z b ε n
X ε
C(Pn ) := exp − df (t, Pn ) dt = Vol(Sk− ∪ Sk+ ) exp − k
a 2 2
k=1

(where the volume is simply interval length), we may sample from the density (8.4.4) by first
drawing a random index I ∈ {1, . . . , n} with probability proportional to

Vol(Sk− ∪ Sk+ ) ε
P(I = k | Pn ) = exp − k ,
C(Pn ) 2

then drawing t uniformly at random in the each of the intervals Sk− or Sk+ with probabilities
Vol(Sk− )/ Vol(Sk− ∪ Sk+ ) or Vol(Sk+ )/ Vol(Sk− ∪ Sk+ ), respectively. 3

The particular sampling strategies—where we construct concentric shells Sk around f (Pn ) and
sample from these with geometrically decaying probabilities e−kε/2 —point toward more general
sampling stratgies and optimality guarantees for the inverse sensitivity mechanism. Define the
“shells”
Sk := {f (Pn′ ) | dham (Pn , Pn′ ) = k}.

202
Lexture Notes on Statistics and Information Theory John Duchi

We focus on sampling from the density (8.4.4) in the case t ∈ R, so sampling is equivalent to
drawing an index I ∈ [n] with probability
n
1 ε X ε
P(I = k | Pn ) = e− 2 k for C(Pn ) := Vol(Sk )e− 2 k , (8.4.6)
C(Pn )
k=1
then choosing t uniformly at random in Sk .
Define the shorthand ω(k) = ωf (k, Pn ). Then the values t ∈ Sk all satisfy |f (Pn ) − t| ≤ ω(k),
and so the inverse sensitivity mechanism M guarantees
n
X
E[|M (Pn ) − f (Pn )|] ≤ P(M (Pn ) ∈ Sk )ω(k).
k=1
Now our calculations become heuristic, where we make an effort to give the rough flavor of results
possible, and later apply the care necessary for tighter guarantees. Suppose that the interval lengths
Vol(Sk ) are of the same order for k ≲ 1ε , and grow only polynomially quickly for k ≫ 1ε . Then
we have the heuristic bound C(Pn ) := nk=1 Vol(Sk )e−kε/2 ≳ Vol(S1 ) nk=1 e−kε/2 ≳ ε−1 Vol(S1 ),
P P
while
n n
X Vol(Sk )e−kε/2 heuristic X
E[|M (Pn ) − f (Pn )|] ≤ Pn −iε/2
ω(k) ≲ εe−kε/2 ω(k) ≲ max e−kε/2 ω(k),
k=1 i=1 Vol(S i )e k=1
k

where the heuristic inequality is our bound on the normalizing constant C(Pn ), and the final bound
follows because maxima are larger than (weighted) averages. Continuing the heuristic derivation,
the final maximum places exponentially small weight on ω(k) for k ≫ 1ε . Thus—and again, this is
non-rigorous—we expect roughly that
heuristic heuristic c
E[|M (Pn ) − f (Pn )|] ≲ max e−kε/2 ω(k) ≲ ωf , Pn , (8.4.7)
k ε
where c is some numerical constant.
To gain some intuition for the claims of optimality we have made, let us revisit the equivalent
definitions of privacy that repose on testing, as in Eq. (8.1.4) and Proposition 8.1.6. By the
definition of differential privacy, the inverse sensitivity mechanism satisfies
P(M (Pn ) ∈ A) ≤ ekε P(M (Pn′ ) ∈ A)
for any samples Pn , Pn′ satisfying dham (Pn , Pn′ ) ≤ k. So for k ≤ 1ε , we have
P(M (Pn ) ∈ A) ≤ exp(1)P(M (Pn′ ) ∈ A),
and so no procedure exists that can test whether the sample is Pn or Pn′ with probability of error less
than e−2 , by Proposition 8.1.6. Thus, at a fundamental level, no procedure can reliably distinguish
the outputs of M (Pn ) from those of M (Pn′ ) when Pn and Pn′ differ in only 1/ε examples. Thus, we
cannot expect to estimate f (Pn ) to accuracy better than ωf ( 1ε , Pn ), and so for any ε-differentially
private mechanism M and Pn , there exists Pn′ ∈ Pn with dham (Pn , Pn′ ) ≤ 1ε and for which

h i 1
max E |M (P ) − f (P )| ≳ ωf
b b , Pn , (8.4.8)
Pb∈{Pn ,Pn′ } ε
which the heuristic calculation (8.4.7) achieves.
To provide more rigorous guarantees requires restrictions on the functions f whose values we
wish to release. The simplest is that the function f : Pn → R obey a natural ordering property,
where larger changes in the sample distribution Pn beget larger changes in f .

203
Lexture Notes on Statistics and Information Theory John Duchi

Definition 8.8. A function f : Pn → R is sample monotone if for each s, t ∈ f (Pn ) satisfying

f (Pn ) ≤ s ≤ t or t ≤ s ≤ f (Pn ), we have df (s, Pn ) ≤ df (t, Pn ).

The mean and median (Examples 8.4.8 and 8.4.9) are both sample monotone. So, too, are appro-

n n −1
Pn identification of f : Pn → R with
priately continuous functions f . For this, we make the obvious
the induced function on X by defining fX (x1 ) := f (n i=1 1xi ). Then we say f : Pn → R is
continuous if the induced function fX is.

Observation 8.4.10. Let f : Pn → R be continuous and X convex. Then f is sample monotone.

Proof Identify f with its induced function fX for notational simplicity, and let x ∈ X n ,
n
f (x) ≤ s ≤ t, and Pn = n−1 i=1 1xi be the empirical distribution associated with x. We
P
show that df (s, Pn ) ≤ df (t, Pn ). If df (t, Pn ) = +∞, then the desired inequality holds triv-
ially. Otherwise, let x′ ∈ X n satisfy f (x′ ) = t and dham (x, x′ ) = df (t, Pn ). Then the function
g(λ) := f ((1 − λ)x + λx′ ) is continuous in λ and satisfies g(0) = f (x) ≤ g(1) = f (x′ ) = t. By the
intermediate value theorem, there exists λs ∈ [0, 1] with g(λs ) = s, and as X is convex the vector
xs = (1 − λs )x + λs x′ ∈ X n satisfies f (xs ) = g(λs ) = s. That xs is a convex combination of x and
x′ then implies df (s, Pn ) ≤ dham (x, xs ) ≤ dham (x, x′ ) = df (t, Pn ).

With Definition 8.8 in place, we can provide a few stronger guarantees for the inverse sensitivity
mechanism. To avoid pathological sampling issues, one replaces the inverse sensitivity df (t, Pn ) with
a “smoothed” version, where for ρ ≥ 0 we define

df,ρ (t, Pn ) := inf dham (Pn , Pn′ ) | |f (Pn′ ) − t| ≤ ρ .

(Pathological cases include estimating the median where the sample Pn consists of a single point re-
peated n times, which would make the density (8.4.4) uniform.) Then instead of the density (8.4.4),
we define the continuous inverse sensitivity mechanism Mcont to have density

exp(− 2ε df,ρ (t, Pn ))

q(t | Pn ) = R . (8.4.9)
exp(− 2ε df,ρ (s, Pn ))ds

While the parameter ρ adds complexity, setting it to be very small (say, ρ = n12 ) is a reasonable
practical default.
The continuous inverse sensitivity enjoys fairly strong error guarantees, as the next two propo-
sitions demonstrate, providing two prototypical results. (Exercises 8.11 and 8.12 show how to prove
the propositions.) The first proposition shows that the inverse sensivity mechanism is essentially
never worse than the Laplace mechanism (8.1.3) when ε ≲ 1.

Proposition 8.4.11. Let f be sample monotone (Definition 8.8) and have finite global sensitivity
GS(f ) < ∞. Then taking ρ = 0,
1
E [|Mcont (Pn ) − f (Pn )|] ≤ GS(f ).
1 − e−ε/2
As Example 8.1.3 shows, the standard Laplace mechanism M has error

GS(f )
E [|M (Pn ) − f (Pn )|] = ,
ε

204
Lexture Notes on Statistics and Information Theory John Duchi

the same scaling Proposition 8.4.11 guarantees, because 1 − e−ε/2 = ε/2 + O(ε2 ).
For the next proposition, which provides a more nuanced guarantee, we require local sensitivities
for samples Pn′ near Pn , and so we define the largest local sensitivity within Hamming distance K
of the sample Pn by
L(K) := sup LS(f, Pn′ ) | dham (Pn , Pn′ ) ≤ K = sup ωf (1, Pn′ ) | dham (Pn , Pn′ ) ≤ K ,

Pn′ ∈Pn Pn′ ∈Pn

where we recall the definition (8.4.2) of the local sensitivity of f . Then we have the following.
Proposition 8.4.12. Let f be sample monotone
l (Definition m 8.8) and have finite global sensitivity
4 log(2nGS(f )/ρ)
GS(f ) < ∞. Then for any ρ ≥ 0 and Kn = ε ,

1
E [|Mcont (Pn ) − f (Pn )|] ≤ 2ρ + L(Kn ).
1 − e−ε/2
Unpacking Proposition 8.4.12 a bit, let us make the default substitution ρ = n12 . Then because
1 − e−ε/2 = ε/2 + O(ε2 ), for ε ≲ 1 this yields
1 1
sup LS(f, Pn′ ) | dham (Pn′ , Pn ) ≤ Kn + 2 ,

E [|Mcont (Pn ) − f (Pn )|] ≲
ε Pn′ ∈Pn n

where Kn = 4 log GS(fε)+12 log n ≲ 1ε log n for large sample sizes n. Comparing this to the sketched
lower bound (8.4.8), these quantities are of the same order whenever the moduli of continu-
ity ωf (k; Pn ) are roughly additive and comparable near Pn , so that for k ≲ 1ε there is a chain
(1) (2) (k) (i) (i+1) (i)
) = 1 and ωf (k; Pn ) ≳ ki=1 LS(f, Pn ) and LS(f, Pn ) ≍
P
Pn , Pn , . . . , Pn with dham (Pn , Pn
LS(f, Pn′ ) for Pn′ satisfying dham (Pn , Pn′ ) ≲ logε n . Under these conditions—which often require care
to check, but which hold, for example, for mean estimation—we then obtain

1 1
E [|Mcont (Pn ) − f (Pn )|] ≲ ωf , Pn + 2 .
ε n

8.5 Deferred proofs

8.5.1 Proof of Lemma 8.2.10
We prove the first statement of the lemma first. Let us assume there exists R such that ∥P − R∥TV ≤
δ and D∞ (R||Q) ≤ ε. Then for any set S we have
P (S) − δ
P (S) ≤ R(S) + δ ≤ eε Q(S) + δ, i.e. log ≤ ε,
Q(S)
which is equivalent to D∞δ (P ||Q) ≤ ε. Now, let us assume that D δ (P ||Q) ≤ ε, whence we must
∞
construct the distribution R.
We assume w.l.o.g. that P and Q have densities p, q, and define the sets
S := {x : p(x) > eε q(x)} and T := {x : p(x) < q(x)}.
On these sets, we have 0 ≤ P (S) − eε Q(S) ≤ δ by assumption, and we then define a distribution R
with density that we partially specify via
x ∈ S ⇒ r(x) := eε q(x) < p(x)
x ∈ (T ∪ S)c ⇒ r(x) := p(x) ≤ eε q(x) and r(x) ≥ q(x).

205
Lexture Notes on Statistics and Information Theory John Duchi

Now, we note that eε q(x) ≥ p(x) ≥ q(x) for x ∈ (S ∪ T )c , and thus

Q(S) + Q(S c ∩ T c ) ≤ eε Q(S) + P (S c ∩ T c )

= R(S) + R(S c ∩ T c ) (8.5.1)
ε c c c c
= e Q(S) + P (S ∩ T ) < P (S) + P (S ∩ T ).

In particular, when x ∈ T , we may take the density r so that p(x) ≤ r(x) ≤ q(x), as

R(S) + R(S c ∩ T c ) + P (T ) < 1 and R(S) + R(S c ∩ T c ) + Q(T ) > 1

by the inequalities (8.5.1), and so that R(X ) = 1. With this, we evidently have r(x) ≤ eε q(x) by
construction, and because S ⊂ T c , we have

R(T ) − P (T ) = P (T c ) − R(T c ) = P (S ∩ T c ) − R(S ∩ T c ) + P (S c ∩ T c ) − R(S c ∩ T c ) = P (S) − R(S),

where we have used that r = p on (T ∪ S)c by construction. Thus we find that

Z Z
1 1 1 1
∥P − R∥TV = |r − p| + |r − p| = (P (S) − R(S)) + (R(T ) − P (T ))
2 S 2 T 2 2
ε
= P (S) − R(S) = P (S) − e Q(S) ≤ δ

by assumption.
Now, we turn to the second statement of the lemma. We start with the easy direction, where
we assume that P0 and Q0 satisfy D∞ (P0 ||Q0 ) ≤ ε and D∞ (Q0 ||P0 ) ≤ ε as well as ∥P − P0 ∥TV ≤ δ
and ∥Q − Q0 ∥TV ≤ δ. Then for any set S we have
δ δ δ
P (S) ≤ P0 (S) + ε
≤ eε Q0 (S) + ε
≤ eε Q(S) + eε δ + ,
1+e 1+e 1 + eε
or D∞δ (P ||Q) ≤ ε. The other direction is similar.
δ (P ||Q) ≤ ε and D δ (Q||P ) ≤ ε. Let
We consider the converse direction, where we have both D∞ ∞
us construct P0 and Q0 as in the statement of the lemma. Define the sets

S := {x : p(x) > eε q(x)} and S ′ := {x : q(x) > eε p(x)}

as well as the sets

T := {x : eε q(x) ≥ p(x) ≥ q(x)} and T ′ := {x : e−ε q(x) ≤ p(x) < q(x)},

so that S, S ′ , T, T ′ are all disjoint, and X = S ∪ S ′ ∪ T ∪ T ′ . We begin by constructing intermediate

measures—which end up not being probabilities—P1 and Q1 , which we modify slightly to actually
construct P0 and Q0 . We first construct densities similar to our construction above for part (i),
setting
1
x ∈ S ⇒ p1 (x) := eε q1 (x), q1 (x) := (p(x) + q(x))
1 + eε
1
x ∈ S ′ ⇒ q1 (x) := eε p1 (x), p1 (x) := (p(x) + q(x)).
1 + eε
Now, define the two quantities
eε P (S) − eε Q(S) δ
α := P (S) − P1 (S) = P (S) − ε
(P (S) + Q(S)) = ε
≤ .
1+e 1+e 1 + eε

206
Lexture Notes on Statistics and Information Theory John Duchi

and similarly

Q(S ′ ) − eε P (S ′ ) δ
α′ := Q(S ′ ) − Q1 (S ′ ) = ≤ .
1 + eε 1 + eε
Note also that we have P (S) − P1 (S) = Q1 (S) − Q(S) and Q(S ′ ) − Q1 (S ′ ) = P1 (S ′ ) − P (S ′ ) by
construction.
We assume w.l.o.g. that α ≥ α′ , so that if β = α − α′ ≥ 0, we have β ≤ 1+eδ
ε , and we have the

sandwiching

P1 (S) + P1 (S ′ ) + P (T ∪ T ′ ) = P1 (S) + P1 (S ′ ) + 1 − P (S ∪ S ′ ) = 1 − β < 1

because S and S ′ are disjoint and T< ∪ T> = (S ∪ S ′ )c , and similarly

Q1 (S) + Q1 (S ′ ) + Q(T ∪ T ′ ) = Q1 (S) + Q1 (S ′ ) + 1 − Q(S ∪ S ′ ) = 1 + β > 1.

Let p1 = p on the set T ∪ T ′ and similarly for q1 = q. Then we have P1 (X ) = 1 − β, Q1 (X ) = 1 + β,

and | log pq11 | ≤ ε.
Now, note that S ∪ T = {x : q1 (x) ≥ p1 (x)}, and we have

Q1 (S) + Q1 (T ) − P1 (S) − P1 (T ) = Q1 (S) + Q(T ) − P1 (S) − P (T )

≥ Q1 (S) + Q1 (S ′ ) + Q(T ) + Q(T ′ ) − P1 (S) − P1 (S ′ ) − P (T ) − P (T ′ ) = 2β.

Now, (roughly) we decrease the density q1 to q0 on S ∪ T and increase p1 to p0 on S ∪ T , while

still satisfying q0 ≥ p0 on S ∪ T . In particular, we may choose the densities q0 = q1 on T ′ ∪ S ′ and
p0 = p1 on T ′ ∪ S ′ , while choosing q0 , p0 so that

p1 (x) ≤ p0 (x) ≤ q0 (x) ≤ q1 (x) on S ∪ T,

where
P0 (S ∪ T ) = P1 (S ∪ T ) + β and Q0 (S ∪ T ) = Q1 (S ∪ T ) − β. (8.5.2)
With these choices, we evidently obtain Q0 (X ) = P0 (X ) = 1 and that D∞ (P0 ||Q0 ) ≤ ε and
D∞ (Q0 ||P0 ) ≤ ε by construction. It remains to consider the variation distances. As p0 = p on T ′ ,
we have
Z Z Z
1 1 1
∥P − P0 ∥TV = |p − p0 | + |p − p0 | + |p − p0 |
2 S 2 S′ 2 T
1 1 1
= (P (S) − P0 (S)) + (P0 (S ′ ) − P (S)) + (P0 (T ) − P (T ))
2 2 2
1 1 1
≤ (P (S) − P1 (S)) + (P0 (S ′ ) − P (S)) + (P0 (T ) − P (T )),
2| {z } 2| {z } 2| {z }
=α =α′ ≤β

where the P0 (T ) − P (T ) ≤ β claim follows becase p1 (x) = p(x) on T and by the increasing
construction yielding equality (8.5.2), we have P0 (T ) − P (T ) = P0 (T ) − P1 (T ) = β + P1 (S) −
′
P0 (S) ≤ β. In particular, we have ∥P − P0 ∥TV ≤ α+α 2 + β2 = α ≤ 1+e δ
ε . The argument that
δ
∥Q − Q0 ∥TV ≤ 1+eε is similar.

207
Lexture Notes on Statistics and Information Theory John Duchi

8.6 Bibliography
Given the broad focus of this book, our treatment of privacy is necessarily somewhat brief, and
there is substantial depth to the subject that we do not cover.
The initial development of randomized response began with Warner [190], who proposed ran-
domized response in survey sampling as a way to collect sensitive data. This elegant idea remained
in use for many years, and a generalization to data release mechanisms with bounded likelihood
ratios—essentially, the local differential privacy definition 8.2—is due to Evfimievski et al. [86] in
2003 in the databases community. Dwork, McSherry, Nissim, and Smith [80] and the subsequent
work of Dwork et al. [79] defined differential privacy and its (ε, δ)-approximate relaxation. A small
industry of research has built out of these papers, with numerous extensions and developments.
Exponential mechanism is McSherry and Talwar [148].
The book of Dwork and Roth [78] surveys much of the field, from the perspective of computer
science, as of 2014. Lemma 8.2.10 is due to Dwork et al. [81], and our proof is based on theirs.

8.7 Exercises
Exercise 8.1: Prove Proposition 8.2.1.
Exercise 8.2: Prove Proposition 8.4.7.
Exercise 8.3 (Laplace mechanisms versus randomized response): In this question, you will
investigate using Laplace and randomized response mechanisms, as in Examples 8.1.3 and 8.1.1–
8.1.2, to perform locally private estimation of a mean, and compare this with randomized-response
based mechanisms.
We consider the following scenario: we have data Xi ∈ [0, 1], drawn i.i.d., and wish to estimate
the mean E[X] under local ε-differential privacy.
iid
(a) The Laplace mechanism simply sets Zi = Xi + Wi for Wi ∼ Laplace(b) for some b. What choice
of b guarantees ε-local differential privacy?

(b) For your choice of b, let Z n = n1 ni=1 Zi . Give E[(Z n − E[X])2 ].

(c) A randomized response mechanism for this case is the following: first, we randomly round Xi
to {0, 1}, by setting (
1 with probability Xi
Xi =
e
0 otherwise.

Conditional on X
ei = x, we then set
(
eε
x with probability 1+eε
Zi = 1
1−x with probability 1+eε .

What is E[Zi ]?

(d) For the randomized response Zi above, give constants a and b so that aZi − b is unbiased
1 Pn
for E[X], that is, E[aZi − b] = E[X]. Let θn = n i=1 (aZi − b) be your mean estimator.
b
What is E[(θbn − E[X])2 ]? Does this converge to the mean-square error of the sample mean
E[(X n − E[X])2 ] = Var(X)/n as ε ↑ ∞?

208
Lexture Notes on Statistics and Information Theory John Duchi

(e) Now, it is time to compare the simple randomized response estimator from part (d) with the
Laplace mechanism from part (b). For each of the following distributions, generate samples
of size N = 10, 100, 1000, 10000, and then for T = 25 tests, compute the two estimators, both
with ε = 1. Then plot the mean-squared error and confidence intervals for each of the two
methods as well as the sample mean without any privacy.

i. Uniform distribution: X ∼ Uniform[0, 1], with E[X] = 1/2.

ii. Bernoulli distribution: X ∼ Bernoulli(p), where p = .1.
iii. Uniform distribution: X ∼ Uniform[.49, .51], with E[X] = 1/2.

Do you prefer the Laplace or randomized response mechanism? In one sentence, why?

Exercise 8.4 (A more sophisticated randomized response scheme): Let us consider a more
sophisticated randomized response scheme than that in Exercise 8.3. Define quantized values
1 k−1
b0 = 0, b1 = , . . . , bk−1 = , bk = 1. (8.7.1)
k k
Now consider a randomized response estimator that, when X ∈ [bj , bj+1 ] first rounds X randomly
e ∈ {bj , bj+1 } so that E[X
to X e | X] = X. Conditional on X e = j, we then set
(
eε
j with probability k+e ε
Z= k
Uniform({0, . . . , k} \ {j}) with probability k+e ε.

(a) Give a and b so that E[aZ − b] = E[X].

1 Pn
(b) For your values of a and b above, let θbn = n i=1 (aZi − b). Give a (reasonably tight) bound
on E[(θbn − E[X])2 ].

(c) For any given ε > 0, give (approximately) the k in the choice of the number of bins (8.7.1) that
optimizes your bound, and (approximately) evaluate E[(θbn − E[X])2 ] with your choice of k. As
ε ↑ ∞, does this converge to Var(X)/n?

Exercise 8.5 (Subsampling via divergence measures (Balle et al. [14])): The hockey stick di-
vergence functional, defined for α ≥ 1, is ϕα (t) = [1 − αt]+ . It is straightforward to relate this to
(ε, δ)-differential privacy via Definition 8.6: two distributions P and Q are (ε, δ)-close if and only
their ϕeε -divergences are less than δ, i.e., if and only if

Dϕeε (P ||Q) ≤ δ and Dϕeε (Q||P ) ≤ δ.

(In your answer to this question, feel free to use Dα (P ||Q) as a shorthand for Dϕα (P ||Q).)
(a) Let P0 , P1 , Q1 be any three distributions, and for some q ∈ [0, 1] and α ≥ 1, define P =
(1 − q)P0 + qP1 and Q = (1 − q)P0 + qQ1 . Let α′ = 1 + q(α − 1) = (1 − q) + qα and
θ = α′ /α ≤ 1. Show that

Dϕα′ (P ||Q) = qDϕα ((1 − θ)P0 + θP1 ||Q1 ) .

(b) Let ε > 0 and define ε(q) = log(1 + q(eε − 1)). Show that

Dϕ (P ||Q) ≤ q max {Dϕeε (P0 ||Q1 ) , Dϕeε (P1 ||Q1 )} .

eε(q)

209
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 8.6 (Subsampling and privacy amplification (Balle et al. [14])): Consider the follow-
ing subsampling approach to privacy. Assume that we have a private (randomized) algorithm,
represented by A, that acts on samples of size m and guarantees (ε, δ)-differential privacy. The
subsampling mechanism is then defined as follows: given a sample X1n of size n > m, choose a
subsample Xsub of size m uniformly at random from X1n , and then release Z = A(Xsub ).
(a) Use the results of parts (a) and (b) in Exercise 8.5 to show that Z is (ε(q), δq)-differentially
private, where q = m/n and ε(q) = log(1 + q(eε − 1)).
(b) Show that if ε ≤ 1, then Z is ((e − 1)qε, qδ)-differentially private, and if ε ≤ 12 , then Z is
√
(2( e − 1)qε, qδ)-differentially private. Hint: Argue that for any T > 0, one has et − 1 ≤
(eT − 1) Tt for all t ∈ [0, T ].

Exercise 8.7 (Concentration and privacy composition): In this question, we give an alternative
to the privacy composition approaches we exploit in Section 8.3.2. Consider an identical scenario to
that in Fig. 8.1, and begin by assuming that each channel Qi is ε-differentially private with density
qi , and let Q(b) be shorthand for Q(· | x(b) ). Define the log-likelihood ratio
k (b)
X qi (Zi )
L(b) (Z1k ) := log (1−b)
.
i=1 qi (Zi )
(a) Let P , Q be any two distributions satisfying D∞ (P ||Q) ≤ ε and D∞ (Q||P ) ≤ ε, i.e., that
P (A)
log Q(A) ∈ [−ε, ε] for all sets A. Show that

Dkl (P ||Q) ≤ ε(eε − 1).

(b) Let Q(b) denote the joint distribution of Z1 , . . . , Zk when bit b holds in the privacy game in
Fig. 8.1. Show that
Eb [L(b) (Z1k )] ≤ kε(eε − 1)
where Eb denotes expectation under Q(b) , and that for all t ≥ 0,
t2

(b) (b) k ε
Q L (Z1 ) ≥ kε(e − 1) + t ≤ exp − .
2kε2
Conclude that for any δ ∈ (0, 1), with probability at least 1 − δ over Z1k ∼ Q(b) ,
r
(b) k ε 1
L (Z1 ) ≤ k(e − 1)ε + 2k log · ε.
δ
(c) Argue that for any (measurable) set A,
Q(b) (Z1k ∈ A) ≤ eε(k,δ) · Q(1−b) (Z1k ∈ A) + δ
q
for all δ ∈ [0, 1], where ε(k, δ) = kε(eε − 1) + 2k log 1δ · ε.

(d) Conclude the following tighter variant of Corollary 8.3.3: if each channel in Fig. 8.1 is ε-
differentially private, then the composition of k such channels is
r !
ε 1
kε(e − 1) + 2k log · ε, δ
δ
differentially private for all δ > 0.

210
Lexture Notes on Statistics and Information Theory John Duchi

As an aside, a completely similar derivation yields the following tighter analogue of Theorem 8.3.4:
if each channel is (ε, δ)-differentially private, then their composition is
r
ε 1 kδ
kε(e − 1) + 2k log · ε, δ0 +
δ0 1 + eε
differentially private for all δ0 > 0.
Exercise 8.8 (One-dimensional minimization with inverse sensitivity): Consider the private
minimization of the one dimensional loss ℓ(θ, x) (for θ ∈ Θ ⊂ R), where we wish to estimate
n
b n ) ∈ argmin{Pn ℓ(θ, X) := 1
X
θ(P ℓ(θ, Xi )},
θ n
i=1

where we recall the notation from Chapters 4 and 6. Assume that the loss ℓ is convex, differentiable
in θ, and that it satisfies the Lipschitz-type guarantees that there exist constants 0 < L0 ≤ L1 < ∞

[−L0 , L0 ] ⊂ {ℓ′ (θ, x)}x∈X ⊂ [−L1 , L1 ] (8.7.2)

for all θ ∈ Θ and that {ℓ′ (θ, x)}x∈X is an interval. (That is, the set of potential derivatives ℓ′ (θ, x)
as x varies includes [−L0 , L0 ], is convex, and |ℓ′ (θ, x)| ≤ L1 for all θ ∈ Θ, x ∈ X .)
(a) Let the loss ℓ be the Huber loss ℓ(θ, x) = hu (θ − x) for some fixed u > 0, where
(
1 2
t if |t| ≤ u
hu (t) = 2u u
|t| + 2 if |t| ≥ u.

When X = R, show that ℓ satisfies the containment (8.7.2) with L0 = L1 = 1.

(b) Let the loss ℓ be the absolute value ℓ(θ, x) = |θ − x|, where we abuse notation to call
{ℓ′ (θ, x)}x=θ = [−1, 1] (the subdifferential). When X = R, show that ℓ satisfies the con-
tainment (8.7.2) with L0 = L1 = 1.

(c) Let dθb be the inverse sensitivity (8.4.3) for the minimizer θ(P b n ), which is the solution (in θ) to
′
Pn ℓ (θ, X) = 0. Assuming inequality (8.7.2) holds, show that

n|Pn ℓ′ (θ, X)| n|Pn ℓ′ (θ, X)|

≤ dθb(θ, Pn ) ≤ .
2L1 L0

(d) Show that the function

n|Pn ℓ′ (θ, X)|

d(θ, Pn ) :=
2L1
is 1-Lipschitz with respect to the Hamming metric in Pn .
The Lipschitz behavior of d(θ, Pn ) in part (d) makes this a computationally attractive alternative
to the pure inverse sensitivity (8.4.3) and associated mechanism with density (8.4.4).
Exercise 8.9 (Estimating means with inverse sensitivity mechanisms): In this question, we
compare behavior of mean estimation under differential privacy with the Laplace mechanism and
the inverse sensitivity-type mechanism in Example 8.4.8. Let X = [−1, 1] be the data space and
consider estimating the mean xn of xn1 ∈ X n .

211
Lexture Notes on Statistics and Information Theory John Duchi

(a) Implement the Laplace mechanism (8.1.3) for this problem. Fix n = 200 and repeat the
following experiment 50 times. For ε = .1, .5, 1, 2, generate a sample xn1 ∈ X n (from whatever
distribution you like), then estimate xn using the Laplace mechanism. Give a table of the mean
squared errors (xn − M (xn1 ))2 .

(b) Implement the inverse sensitivity mechanism using the approximation in Example 8.4.8. Repeat
the experiment in part (a).

(c) Compare the results.

Exercise 8.10 (Estimating medians with the inverse sensitivity mechanism): The data at https:
//stats311.stanford.edu/data/salaries.txt contains approximately 250,000 salaries from the
University of California Schools between 2011 and 2014. Assuming that the maximum salary is 3·106
and minimum is 0 (so the data x ∈ [0, 3 · 106 ]), implement the inverse sensitivity mechanism for the
median as in Example 8.4.9. Repeat the following 20 times: for each of ε = .0625, .125, .25, .5, 1, 2,
estimate the median using the inverse sensitivity mechanism with ε-differential privacy. Compute
the mean absolute errors across the 20 experiments for each ε.
Exercise 8.11 (Shells and accuracy in inverse sensitivity): Let f : Pn → R be sample monotone
(Def. 8.8) and ρ ≥ 0. Let M = Mcont be the continuous inverse sensitivity mechanism with
density (8.4.9). Define the upper and lower shells

Sk+ = {t > f (Pn ) | df,ρ (t, Pn ) = k} and Sk− = {t < f (Pn ) | df,ρ (t, Pn ) = k} ,

and the upper and lower moduli of continuity (values in the shells Sk± )

ω + (k) := sup{t ∈ Sk+ } − f (Pn ) and ω − (k) := f (Pn ) − inf{t ∈ Sk− }.

Let S0 = {t ∈ R | |f (Pn ) − t| ≤ ρ}.

(a) Justify the inequality

E[|M (Pn ) − f (Pn )|]

n
X n
X
≤ P(M (Pn ) ∈ S0 )ρ + P(M (Pn ) ∈ Sk+ )(ω + (k) + ρ) + P(M (Pn ) ∈ Sk− )(ω − (k) + ρ).
k=1 k=1

(b) Bound P(M (Pn ) ∈ Sk+ ) and P(M (Pn ) ∈ Sk− ), and using these bounds demonstrate that

E [|M (Pn ) − f (Pn )|]

Pn + +
k=1 ω (k) · (ω (k) − ω + (k − 1))e−kε/2
≤ρ+ Pn + (k) − ω + (k − 1))e−kε/2 + nk=1 (ω − (k) − ω − (k − 1))e−kε/2
P
ρ+ k=1 (ω
Pn −
+
k=1 ω (k) · (ω (k) − ω − (k − 1))e−kε/2
+ Pn − (k) − ω − (k − 1))e−kε/2 + nk=1 (ω − (k) − ω − (k − 1))e−kε/2
P
ρ+ k=1 (ω

(c) Show that

n
X n
X
− −
−kε/2 −ε/2
+ +
ω + (k) + ω − (k) e−kε/2 .

(ω (k) − ω (k − 1)) + ω (k) − ω (k − 1) e ≥ (1−e )
k=1 k=1

212
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 8.12 (Accuracy of the inverse sensitivity mechanism): In this question, we prove
Propositions 8.4.11 and 8.4.12. Let the conditions and notation of Exercise 8.11 hold. Recall the
definition
L(K) := sup LS(f, Pn′ ) | dham (Pn , Pn′ ) ≤ K .

Pn′ ∈Pn

(a) Use Exercise 8.11.(b) and (c) to show that for any K ∈ N,
PK + − −kε/2
L(K) k=1 (ω (k) + ω (k)) e
E [|Mcont (Pn ) − f (Pn )|] ≤ ρ + · n
1 − e−ε/2 + − −kε/2
P
k=1 (ω (k) + ω (k)) e
n
GS(f ) X
ω + (k) + ω − (k) e−kε/2 .

+
ρ
k=K+1

1
(b) Choose values for ρ and K to show that E[|Mcont (Pn ) − f (Pn )|] ≤ 1−e−ε/2
GS(f ), giving Propo-
sition 8.4.11.

(c) Prove Proposition 8.4.12.

Exercise 8.13 (Subsampling and Rényi privacy): We would like to estimate the mean E[X] of
X ∼ P , where X ∈ B = {x ∈ Rd | ∥x∥2 ≤ 1}, the ℓ2 -ball in Rd . We investigate the extent to which
subsampling of a dataset can improve privacy by providing some additional anonymity. Consider
the following mechanism for estimating (scaled) multiples of this mean: for a dataset {X1 , . . . , Xn },
we let Si ∈ {0, 1} be i.i.d. Bernoulli(q), that is, E[Si ] = q, and then consider the algorithm
n
X
Z= Xi Si + σW, W ∼ N(0, Id ). (8.7.3)
i=1

In this question, we investigate the Rényi privacy properties

R of the subsampling (8.7.3). (Recall
1 α
the Rényi divergence of Definition 8.4, Dα(P ||Q) = α−1 log (p/q) q.)
We consider a slight variant of Rényi privacy, where we define data matrices X and X ′ to be
adjacent if X ∈ Rd×n and X ′ ∈ Rd×n−1 where X ′ is X with a single column removed. Then a
mechanism is (ε, α)-Rényi private against single removals if and only if

Dα Q(· | X)||Q(· | X ′ ) ≤ ε and Dα Q(· | X ′ )||Q(· | X) ≤ ε

(8.7.4)

for all neighboring X and X ′ consisting of samples of size n and n − 1, respectively.

(a) Let Q(· | X) and Q(· | X ′ ) denote the channels for the mechanism (8.7.3) with data matrices
X = [x1 · · · xn−1 x] and X ′ = [x1 · · · xn−1 ] ∈ Rd×n . Let Pµ denote the normal distribution
N(µ, σ 2 I) with mean µ and covariance σ 2 I on Rd . Show that for any α ∈ (1, ∞),

Dα Q(· | X)||Q(· | X ′ ) ≤ Dα (qPx + (1 − q)P0 ||P0 )

and
Dα Q(· | X ′ )||Q(· | X) ≤ Dα (P0 ||qPx + (1 − q)P0 ) .

213
Lexture Notes on Statistics and Information Theory John Duchi

(b) Show that for the Rényi α = 2-divergence,

D2 (qPx + (1 − q)P0 ||P0 ) ≤ log 1 + q 2 exp(∥x∥22 /σ 2 ) − 1 and
q2

2 2
D2 (P0 ||qPx + (1 − q)P0 ) ≤ log 1 + exp(∥x∥2 /σ ) − 1 .
1−q

(Hint: Example 8.2.2.)

Consider two mechanisms for computing a sample mean X n of vectors, where ∥xi ∥2 ≤ b for all i.
The first is to repeat the following T times: for t = 1, 2, . . . , T ,
iid
i. Draw S ∈ {0, 1}n with Si ∼ Bernoulli(q)
1 iid
ii. Set Zt = nq (XS + σsub Wt ), where Wt ∼ N(0, I), as in (8.7.3).
1 PT
Then set Zsub = T t=1 Zt . The other mechanism is to simply set ZGauss = X n + σGauss W for
W ∼ N(0, I).

(c) What level of privacy does Zsub have? That is, Zsub is (ε, 2)-Rényi private (against single
removals (8.7.4)). Give a tight upper bound on ε.

(d) What level of (ε, 2)-Rényi privacy does ZGauss provide?

(e) Fix ε > 0, and assume that each mechanism Zsub and ZGauss have parameters chosen so that
they are (ε, 2)-Rényi private. Optimize over T, q, n, σsub in the subsampling mechanism and
σGauss in the Gaussian mechanism, and provide the sharpest bound you can on
2 2
E[ Zsub − X n 2
] and E[ ZGauss − X n 2
].

You may assume ∥xi ∥2 = b for all i. (In your derivation, to avoid annoying constants, you
should replace log(1 + t) with its upper bound, log(1 + t) ≤ t, which is fairly sharp for t ≈ 0.)

214
Part II

Fundamental limits and optimality

215
Lexture Notes on Statistics and Information Theory John Duchi

JCD Comment: Put a brief commentary here. Some highlights:

i. Minimax lower bounds (both local and global) using Le Cam’s, Fano’s, and Assouad’s methods.
Worked out long example with nonparametric regression.

ii. Strong data processing inequalities, along with some bounds on them (constrained risk inequal-
ities).

iii. Functionals for lower bounds perhaps

216
Chapter 9

Minimax lower bounds: the Le Cam,

Fano, and Assouad methods

Understanding the fundamental limits of estimation and optimization procedures is important for
a multitude of reasons. Indeed, developing bounds on the performance of procedures can give
complementary insights. By exhibiting fundamental limits of performance (perhaps over restricted
classes of estimators), it is possible to guarantee that an algorithm we have developed is optimal, so
that searching for estimators with better statistical performance will have limited returns, though
searching for estimators with better performance in other metrics may be interesting. Moreover,
exhibiting refined lower bounds on the performance of estimators can also suggest avenues for de-
veloping alternative, new optimal estimators; lower bounds need not be a fully pessimistic exercise.
In this chapter, we define and then discuss techniques for lower-bounding the minimax risk,
giving three standard techniques for deriving minimax lower bounds that have proven fruitful in
a variety of estimation problems [194]. In addition to reviewing these standard techniques—the
Le Cam, Fano, and Assouad methods—we present a few simplifications and extensions that may
make them more “user friendly.” Finally, the concluding sections of the chapter (Sections 10.1
and 10.2) present extensions of the ideas to nonparametric problems, where the effective number of
parameters to estimate grows with the sample size n; this culminates with an essentially geometric
treatment of information and divergence measures directly relating covering and packing numbers
to estimation.

9.1 Basic framework and minimax risk

Our first step here is to establish the minimax framework we use. When we study classical es-
timation problems, we use a standard version of minimax risk; we will also show how minimax
bounds can be used to study optimization problems, in which case we use a specialization of the
general minimax risk that we call minimax excess risk (while minimax risk handles this case, it is
important enough that we define additional notation).
Let us begin by defining the standard minimax risk, deferring temporarily our discussion of
minimax excess risk. Throughout, we let P denote a class of distributions on a sample space X ,
and let θ : P → Θ denote a function defined on P, that is, a mapping P 7→ θ(P ). The goal is
to estimate the parameter θ(P ) based on observations Xi drawn from the (unknown) distribution
P . In certain cases, the parameter θ(P ) uniquely determines the underlying distribution; for
example, if we attempt to estimate a normal mean θ from the family P = {N(θ, σ 2 ) : θ ∈ R} with

217
Lexture Notes on Statistics and Information Theory John Duchi

known variance σ 2 , then θ(P ) = EP [X] uniquely determines distributions in P. In other scenarios,
however, θ does not uniquely determine the distribution: for instance, Rwe may be given a class of
1
densities P on the unit interval [0, 1], and we wish to estimate θ(P ) = 0 (p′ (t))2 dt, where p is the
density of P . Such problems arise, for example, in estimating the uniformity of the distribution
of a species over an area (large θ(P ) indicates an irregular distribution). In this case, θ does not
parameterize P , so we take a slightly broader viewpoint of estimating functions of distributions in
these notes.
The space Θ in which the parameter θ(P ) takes values depends on the underlying statistical
problem; as an example, if the goal is to estimate the univariate mean θ(P ) = EP [X], we have
Θ ⊂ R. To evaluate the quality of an estimator θ, b we let ρ : Θ × Θ → R+ denote a (semi)metric
on the space Θ, which we use to measure the error of an estimator for the parameter θ, and let
Φ : R+ → R+ be a non-decreasing function with Φ(0) = 0 (for example, Φ(t) = t2 ).
For a distribution P ∈ P, we assume we receive i.i.d. observations Xi drawn according to some
P , and based on these {Xi }, the goal is to estimate the unknown parameter θ(P ) ∈ Θ. For a
given estimator θ—a b measurable function θb : X n → Θ—we assess the quality of the estimate
θ(X
b 1 , . . . , Xn ) in terms of the risk
h i
EP Φ ρ(θ(X
b 1 . . . , Xn ), θ(P )) .

For instance, for a univariate mean problem with ρ(θ, θ′ ) = |θ − θ′ | and Φ(t) = t2 , this risk is the
mean-squared error. As the distribution P is varied, we obtain the risk functional for the problem,
which gives the risk of any estimator θb for the family P.
For any fixed distribution P , there is always a trivial estimator of θ(P ): simply return θ(P ),
which will have minimal risk. Of course, this “estimator” is unlikely to be good in any real sense,
and it is thus important to consider the risk functional not in a pointwise sense (as a function of
individual P ) but to take a more global view. One approach to this is Bayesian: we place a prior
π on the set of possible distributions P, viewing θ(P ) as a random variable, and evaluate the risk
of an estimator θb taken in expectation with respect to this prior on P . Another approach, first
suggested by Wald [189], which is to choose the estimator θb minimizing the maximum risk
h i
sup EP Φ ρ(θ(Xb 1 . . . , Xn ), θ(P )) .
P ∈P

An optimal estimator for this metric then gives the minimax risk, which is defined as
h i
Mn (θ(P), Φ ◦ ρ) := inf sup EP Φ ρ(θ(Xb 1 , . . . , Xn ), θ(P )) , (9.1.1)
θb P ∈P

where we take the supremum (worst-case) over distributions P ∈ P, and the infimum is taken over
b Here the notation θ(P) indicates that we consider parameters θ(P ) for P ∈ P and
all estimators θ.
distributions in P.
In some scenarios, we study a specialized notion of risk appropriate for optimization problems
(and statistical problems in which all we care about is prediction). In these settings, we assume
there exists some loss function ℓ : Θ × X → R, where for an observation x ∈ X , the value ℓ(θ; x)
measures the instantaneous loss associated with using θ as a predictor. In this case, we define the
risk Z
LP (θ) := EP [ℓ(θ; X)] = ℓ(θ; x)dP (x) (9.1.2)
X
as the expected loss of the vector θ. (See, e.g., Chapter 5 of the lectures by Shapiro, Dentcheva,
and Ruszczyński [168], or work on stochastic approximation by Nemirovski et al. [150].)

218
Lexture Notes on Statistics and Information Theory John Duchi

Example 9.1.1 (Support vector machines): In linear classification problems, we observe

pairs z = (x, y), where y ∈ {−1, 1} and x ∈ Rd , and the goal is to find a parameter θ ∈ Rd
so that sign(⟨θ, x⟩) = y. A convex loss surrogate for this problem is the hinge loss ℓ(θ; z) =
[1 − y⟨θ, x⟩]+ ; minimizing the associated risk functional (9.1.2) over a set Θ = {θ ∈ Rd : ∥θ∥2 ≤
r} gives the support vector machine [55]. 3

Example 9.1.2 (Two-stage stochastic programming): In operations research, one often

wishes to allocate resources to a set of locations {1, . . . , m} before seeing demand for the
resources. Suppose that the (unobserved) sample x consists of the pair x = (C, v), where
C ∈ Rm×m corresponds to the prices of shipping a unit of material, so cij ≥ 0 gives the cost
of shipping from location i to j, and v ∈ Rm denotes the value (price paid for the good) at
each location. Letting θ ∈ Rm+ denote the amount of resources allocated to each location, we
formulate the loss as
X m
X m
X m
X m
X
ℓ(θ; x) := inf cij Tij − vi ri | ri = θi + Tji − Tij , Tij ≥ 0, Tij ≤ θi .
r∈Rm ,T ∈Rm×m
i,j i=1 j=1 j=1 j=1

Here the variables T correspond to the goods transported to and from each location (so Tij is
goods shipped from i to j), and we wish to minimize the cost of ourPshipping and maximize
the profit. By minimizing the risk (9.1.2) over a set Θ = {θ ∈ Rm
+ : i θi ≤ b}, we maximize
our expected reward given a budget constraint b on the amount of allocated resources. 3

For a (potentially random) estimator θb : X n → Θ given access to a sample X1 , . . . , Xn , we may

define the associated maximum excess risk for the family P by
h i
sup EP LP (θ(Xb 1 , . . . , Xn )) − inf L(θ) ,
P ∈P θ∈Θ

where the expectation is taken over Xi and any randomness in the procedure θ. b This expression
captures the difference between the (expected) risk performance of the procedure θb and the best
possible risk, available if the distribution P were known ahead of time. The minimax excess risk,
defined with respect to the loss ℓ, domain Θ, and family P of distributions, is then defined by the
best possible maximum excess risk,
h i
Mn (Θ, P, ℓ) := inf sup EP LP (θ(X1 , . . . , Xn )) − inf LP (θ) ,
b (9.1.3)
θb P ∈P θ∈Θ

where the infimum is taken over all estimators θb : X n → Θ and the risk LP is implicitly defined in
terms of the loss ℓ. The techniques for providing lower bounds for the minimax risk (9.1.1) or the
excess risk (9.1.3) are essentially identical; we focus for the remainder of this section on techniques
for providing lower bounds on the minimax risk.

9.2 Preliminaries on methods for lower bounds

There are a variety of techniques for providing lower bounds on the minimax risk (9.1.1). Each of
them transforms the maximum risk by lower bounding it via a Bayesian problem (e.g. [115, 134,
136]), then proving a lower bound on the performance of all possible estimators for the Bayesian
problem (it is often the case that the worst case Bayesian problem is equivalent to the original

219
Lexture Notes on Statistics and Information Theory John Duchi

minimax problem [134]). In particular, let {Pv } ⊂ P be a collection of distributions in P indexed

by v and π be any probability mass function over v. Then for any estimator θ, b the maximum risk
has lower bound
h i X h i
n n
sup EP Φ(ρ(θ(X1 ), θ(P ))) ≥
b π(v)EPv Φ(ρ(θ(X1 ), θ(Pv ))) .
b
P ∈P v

While trivial, this lower bound serves as the departure point for each of the subsequent techniques
for lower bounding the minimax risk.

9.2.1 From estimation to testing

A standard first step in proving minimax bounds is to “reduce” the estimation problem to a testing
problem [194, 192, 182]. The idea is to show that the probability of error in testing problems lower
bounds the estimation risk, and we can develop tools for the former. We use two types of testing
problems: one a multiple hypothesis test and the second based on multiple binary hypothesis tests.
Given an index set V of finite cardinality, consider a family of distributions {Pv }v∈V contained
within P. This family induces a collection of parameters {θ(Pv )}v∈V ; we call the family a 2δ-packing
in the ρ-semimetric if
ρ(θ(Pv ), θ(Pv′ )) ≥ 2δ for all v ̸= v ′ .
We use this family to define the canonical hypothesis testing problem:
• first, nature chooses V according to the uniform distribution over V;

• second, conditioned on the choice V = v, the random sample X = X1n = (X1 , . . . , Xn ) is

drawn from the n-fold product distribution Pvn .
Given the observed sample X, the goal is to determine the value of the underlying index v. We
refer to any measurable mapping Ψ : X n → V as a test function. Its associated error probability
is P(Ψ(X1n ) ̸= V ), where P P
denotes the joint distribution over the random index V and X. In
1
particular, if we set P = |V| v∈V Pv to be the mixture distribution, then the sample X is drawn
(marginally) from P , and our hypothesis testing problem is to determine the randomly chosen index
V given a sample from this mixture P .
With this setup, we obtain the classical reduction from estimation to testing.
Proposition 9.2.1. The minimax error (9.1.1) has lower bound

Mn (θ(P), Φ ◦ ρ) ≥ Φ(δ) inf P(Ψ(X1 , . . . , Xn ) ̸= V ), (9.2.1)

where the infimum ranges over all testing functions.

Proof To see this result, fix an arbitrary estimator θ. b Suppressing dependence on X throughout
the derivation, first note that it is clear that for any fixed θ, we have
h n oi
b θ))] ≥ E Φ(δ)1 ρ(θ,
E[Φ(ρ(θ, b θ) ≥ δ = Φ(δ)P(ρ(θ, b θ) ≥ δ),

where the final inequality follows because Φ is non-decreasing. Now, let us define θv = θ(Pv ), so
that ρ(θv , θv′ ) ≥ 2δ for v ̸= v ′ . By defining the testing function
b := argmin{ρ(θ,
Ψ(θ) b θv )},
v∈V

220
Lexture Notes on Statistics and Information Theory John Duchi

θb
θv

θv ′ 2δ

Figure 9.1. Example of a 2δ-packing of a set. The estimate θb is contained in at most one of the
δ-balls around the points θv .

breaking ties arbitrarily, we have that ρ(θ,b θv ) < δ implies that Ψ(θ) b = v because of the triangle
inequality and 2δ-separation of the set {θv }v∈V . Indeed, assume that ρ(θ, b θv ) < δ; then for any
′
v ̸= v, we have
b θv′ ) ≥ ρ(θv , θv′ ) − ρ(θ,
ρ(θ, b θv ) > 2δ − δ = δ.

The test must thus return v as claimed. Equivalently, for v ∈ V, the inequality Ψ(θ)
b ̸= v implies
ρ(θ, θv ) ≥ δ. (See Figure 9.1.) By averaging over V, we find that
b

Taking an infimum over all tests Ψ : X n → V gives inequality (9.2.1).

The remaining challenge is to lower bound the probability of error in the underlying multi-way
hypothesis testing problem, which we do by choosing the separation δ to trade off between the loss
Φ(δ) (large δ increases the loss) and the probability of error (small δ, and hence separation, makes
the hypothesis test harder). Usually, one attempts to choose the largest separation δ that guarantees
a constant probability of error. There are a variety of techniques for this, and we present three:
Le Cam’s method, Fano’s method, and Assouad’s method, including extensions of the latter two
to enhance their applicability. Before continuing, however, we review some inequalities between
divergence measures defined on probabilities, which will be essential for our development, and
concepts related to packing sets (metric entropy, covering numbers, and packing).

9.2.2 Inequalities between divergences and product distributions

We now present a few inequalities, and their consequences when applied to product distributions,
that will be quite useful for proving our lower bounds. The three divergences we relate are the total
variation distance, Kullback-Leibler divergence, and Hellinger distance, all of which are instances

221
Lexture Notes on Statistics and Information Theory John Duchi

of f -divergences (recall Section 2.2.3). We first recall the definitions of the three when applied to
distributions P , Q on a set X , which we assume have densities p, q with respect to a base measure
µ. Then we recall the total variation distance (2.2.6) is
Z
1
∥P − Q∥TV := sup |P (A) − Q(A)| = |p(x) − q(x)|dµ(x),
A⊂X 2

which is the f -divergence Df (P ||Q) generated by f (t) = 12 |t − 1|. The Hellinger distance (2.2.7) is
Z p p
dhel (P, Q) := ( p(x) − q(x))2 dµ(x),
2

√
which is the f -divergence Df (P ||Q) generated by f (t) = ( t − 1)2 . We also recall the Kullback-
Leibler (KL) divergence Z
p(x)
Dkl (P ||Q) := p(x) log dµ(x), (9.2.2)
q(x)
which is the f -divergence Df (P ||Q) generated by f (t) = t log t. As noted in Section 2.2.3, Propo-
sition 2.2.8, these divergences have the following relationships.

Proposition (Proposition 2.2.8, restated). The total variation distance satisfies the following re-
lationships:

(a) For the Hellinger distance,

1 p
dhel (P, Q)2 ≤ ∥P − Q∥TV ≤ dhel (P, Q) 1 − dhel (P, Q)2 /4.
2

(b) Pinsker’s inequality: for any distributions P , Q,

1
∥P − Q∥2TV ≤ Dkl (P ||Q) .
2

We now show how Proposition 2.2.8 is useful, because KL-divergence and Hellinger distance
both are easier to manipulate on product distributions than is total variation. Specifically, consider
the product distributions P = P1 × · · · × Pn and Q = Q1 × · · · × Qn . Then the KL-divergence
satisfies the decoupling equality
n
X
Dkl (P ||Q) = Dkl (Pi ||Qi ) , (9.2.3)
i=1
R√
while because d2hel (P, Q) = 1 − dP dQ, the Hellinger distance satisfies
Z p
d2hel (P, Q) = 1 − p1 (x1 ) · · · pn (xn )q1 (x1 ) · · · qn (xn )dµ(xn1 )
n Z p
Y
=1− pi (xi )qi (x)dµ(xi )
i=1
Yn
1 − d2hel (Pi , Qi ) .

=1− (9.2.4)
i=1

222
Lexture Notes on Statistics and Information Theory John Duchi

In particular, we see that for product distributions P n and Qn , Proposition 2.2.8 implies that
1 n
∥P n − Qn ∥2TV ≤ Dkl (P n ||Qn ) = Dkl (P ||Q)
2 2
and p
∥P n − Qn ∥TV ≤ dhel (P n , Qn ) ≤ 2 − 2(1 − dhel (P, Q)2 )n .
√
As a consequence, if we can guarantee that Dkl (P ||Q) ≤ 1/n or dhel (P, Q) ≤ 1/ n, then we
guarantee the strict inequality ∥P n − Qn ∥TV ≤ 1 − c for a fixed constant c > 0, for any n. We
will see how this type of guarantee can be used to prove minimax lower bounds in the following
sections.

9.2.3 Metric entropy and packing numbers

The second part of proving our lower bounds involves the construction of the packing set in Sec-
tion 9.2.1. The size of the space Θ of parameters associated with our estimation problem—and
consequently, how many parameters we can pack into it—is strongly coupled with the difficulty of
estimation. The tools we develop in Section 5.1.2 on metric entropies and covering and packing
numbers therefore become central.
Probably the most central construction relies on volume bounds on packing and covering num-
bers, which we recall from Lemma 5.1.10: the covering and packing numbers of a norm ball B in
its own norm ∥·∥ scale exponentially in the dimension. In particular, for any δ < 1, there is a
packing V of B such that ∥v − v ′ ∥ ≥ δ for all distinct v, v ′ ∈ V and |V| ≥ (1/δ)d , because we know
M (δ, B, ∥·∥) ≥ N (δ, B, ∥·∥) as in Lemma 5.1.8. We thus state the following corollary for later use,
which states that we can construct exponentially large packings of arbitrary norm-balls (in finite
dimensions) where the points have constant distance from one another.

Corollary 9.2.2. Let Bd = {v ∈ Rd | ∥v∥ ≤ 1} be the unit ball for the norm ∥·∥. Then there exists
V ⊂ Bd with |V| ≥ 2d and ∥v − v ′ ∥ ≥ 21 for each v ̸= v ′ ∈ V.

Another common packing arises from coding theory, where the technique is to construct well-
separated code-books ({0, 1}-valued bit strings associated to individual symbols to be communi-
cated) for communication. In showing our lower bounds, we show that even if a code-book is
well-separated, it may still be hard to estimate. With that, we now demonstrate that there exist
(exponentially) large packings of the d-dimensional hypercube of points that are O(d)-separated in
the Hamming metric.

Lemma 9.2.3 (Gilbert-Varshamov bound). Let d ≥ 1. There is a subset V of the d-dimensional

hypercube Hd = {−1, 1}d of size |V| ≥ exp(d/8) such that the ℓ1 -distance
d
X d
v − v′ 1 vj ̸= vj′ ≥

1
=2
2
j=1

for all v ̸= v ′ with v, v ′ ∈ V.

Proof We use the proof of Guntuboyina [106]. Consider a maximal subset V of Hd = {−1, 1}d
satisfying
v − v ′ 1 ≥ d/2 for all distinct v, v ′ ∈ V. (9.2.5)

223
Lexture Notes on Statistics and Information Theory John Duchi

That is, the addition of any vector w ∈ Hd , w ̸∈ V to V will break the constraint (9.2.5). This
means that if we construct the closed balls B(v, d/2) := {w ∈ Hd : ∥v − w∥1 ≤ d/2}, we must have
[ X
B(v, d/2) = Hd so |V||B(0, d/2)| = |B(v, d/2)| ≥ 2d . (9.2.6)
v∈V v∈V

We now upper bound the cardinality of B(v, d/2) using the probabilistic method, which will imply
the desired result. Let Si , i = 1, . . . , d, be i.i.d. Bernoulli {0, 1}-valued random variables. Then by
their uniformity, for any v ∈ Hd ,

2−d |B(v, d/2)| = P(S1 + S2 + . . . + Sd ≤ d/4) = P(S1 + S2 + . . . + Sd ≥ 3d/4)

≤ E [exp(λS1 + . . . + λSd )] exp(−3λd/4)

for any λ > 0, by Markov’s inequality (or the Chernoff bound). Since E[exp(λS1 )] = 12 (1 + eλ ), we
obtain n o
2−d |B(v, d/2)| ≤ inf 2−d (1 + eλ )d exp(−3λd/4)
λ≥0

Choosing λ = log 3, we have

|B(v, d/2)| ≤ 4d exp(−(3/4)d log 3) = 3−3d/4 4d .

Recalling inequality (9.2.6), we have

33d/4

−3d/4 d d 3
|V|3 4 ≥ |V||B(v, d/2)| ≥ 2 , or |V| ≥ d = exp d log 3 − log 2 ≥ exp(d/8),
2 4

as claimed.

9.3 Le Cam’s method

Le Cam’s method, in its simplest form, provides lower bounds on the error in simple binary hypoth-
esis testing testing problems. In this section, we explore this connection, showing the connection
between hypothesis testing and total variation distance, and we then show how this can yield
lower bounds on minimax error (or the optimal Bayes’ risk) for simple—often one-dimensional—
estimation problems.
In the first homework, we considered several representations of the total variation distance,
including a question showing its relation to optimal testing. We begin again with this strand of
thought, recalling the general testing problem discussed in Section 9.2.1. Suppose that we have a
Bayesian hypothesis testing problem where V is chosen with equal probability to be 1 or 2, and
given V = v, the sample X is drawn from the distribution Pv . Denoting by P the joint distribution
of V and X, we have for any test Ψ : X → {1, 2} that the probability of error is
1 1
P(Ψ(X) ̸= V ) = P1 (Ψ(X) ̸= 1) + P2 (Ψ(X) ̸= 2).
2 2
Recalling Section 9.2.1, we note that Proposition 2.3.1 gives an exact representation of the testing
error using total variation distance. In particular, we have

224
Lexture Notes on Statistics and Information Theory John Duchi

Proposition (Proposition 2.3.1, restated). For any distributions P1 and P2 on X , we have

inf {P1 (Ψ(X) ̸= 1) + P2 (Ψ(X) ̸= 2)} = 1 − ∥P1 − P2 ∥TV , (9.3.1)
Ψ

where the infimum is taken over all tests Ψ : X → {1, 2}.

Returning to the setting in which we receive n i.i.d. observations Xi ∼ P , when V = 1 with
probability 21 and 2 with probability 12 , we have
1 1
inf P (Ψ(X1 , . . . , Xn ) ̸= V ) = − ∥P n − P2n ∥TV . (9.3.2)
Ψ 2 2 1
The representations (9.3.1) and (9.3.2), in conjunction with our reduction of estimation to testing
in Proposition 9.2.1, imply the following lower bound on minimax risk. For any family P of
distributions for which there exists a pair P1 , P2 ∈ P satisfying ρ(θ(P1 ), θ(P2 )) ≥ 2δ, then the
minimax risk after n observations has lower bound

1 1 n n
Mn (θ(P), Φ ◦ ρ) ≥ Φ(δ) − ∥P − P2 ∥TV . (9.3.3)
2 2 1
The lower bound (9.3.3) suggests the following strategy: we find distributions P1 and P2 ,
which we choose as a function of δ, that guarantee ∥P1n − P2n ∥TV ≤ 21 . In this case, so long as
ρ(θ(P1 ), θ(P2 )) ≥ 2δ, we have the lower bound

1 1 1 1
Mn (θ(P), Φ ◦ ρ) ≥ Φ(δ) − · = Φ(δ).
2 2 4 4
We now give an example illustrating this idea.

Example 9.3.1 (Bernoulli mean estimation): Consider the problem of estimating the mean
b 2 , where
θ ∈ [−1, 1] of a {±1}-valued Bernoulli distribution under the squared error loss (θ − θ)
Xi ∈ {−1, 1}. In this case, by fixing some δ > 0, we set V = {−1, 1}, and we define Pv so that
1 + vδ 1 − vδ
Pv (X = 1) = and Pv (X = −1) = ,
2 2
whence we see that the mean θ(Pv ) = δv. Using the metric ρ(θ, θ′ ) = |θ−θ′ | and loss Φ(δ) = δ 2 ,
we have separation 2δ of θ(P−1 ) and θ(P1 ). Thus, via Le Cam’s method (9.3.3), we have that
1
Mn (Bernoulli([−1, 1]), (·)2 ) ≥ δ 2 1 − P−1
n
− P1n

TV
.
2
We would thus like to upper bound ∥P−1 n − P n∥
1 TV as a function of the separation δ and
sample size n; here we use Pinsker’s inequality (Proposition 2.2.8(a)) and the tensorization
identity (9.2.3) that makes KL-divergence so useful. Indeed, we have

n 2 1 n n 1+δ
P−1 − P1n TV
n
≤ Dkl P−1 ||P1n = Dkl (P−1 ||P1 ) = δ log .
2 2 2 1−δ
Noting that δ log 1+δ 2 n n
p
1−δ ≤ 3δ for δ ∈ [0, 1/2], we obtain that ∥P−1 − P1 ∥TV ≤ δ 3n/2 for
δ ≤ 1/2. In particular, we can guarantee a high probability of error √ in the associated hy-
pothesis testing problem (recall inequality (9.3.2)) by taking δ = 1/ 6n; this guarantees
n − P n∥ 1
∥P−1 1 TV ≤ 2 . We thus have the minimax lower bound

2 1 2 1 1
Mn (Bernoulli([−1, 1]), (·) ) ≥ δ 1 − = .
2 2 24n

225
Lexture Notes on Statistics and Information Theory John Duchi

While the factor 1/24 is smaller P than necessary, this bound is optimal to within constant
factors; the sample mean (1/n) ni=1 Xi achieves mean-squared error (1 − θ2 )/n.
As an alternative proof, we may use the Hellinger distance and its associated decoupling
identity (9.2.4). We sketch the idea, ignoring lower order terms when convenient. In this case,
Proposition 2.2.7 implies
√ p
∥P1n − P2n ∥TV ≤ 2dhel (P1n , P2n ) = 2 − 2(1 − dhel (P1 , P2 )2 )n .
Noting that
r r !2 r
2 1+δ 1−δ 1 − δ2 p 1
dhel (P1 , P2 ) = − =1−2 = 1 − 1 − δ2 ≈ δ2,
2 2 4 2
2 −δ 2

p noting that (1 − δ ) ≈ e , we
and have (up to lower p order terms in δ) that ∥P1n − P2n ∥TV ≤
2 − 2 exp(−δ 2 n/2). Choosing δ 2 = 1/(4n), we have 2 − 2 exp(−δ 2 n/2) ≤ 1/2, thus giving
the lower bound

2 1 2 1 1
Mn (Bernoulli([−1, 1]), (·) ) “ ≥ ” δ 1 − = ,
2 2 16n
where the quotations indicate we have been fast and loose in the derivation. 3
This example shows the “usual” rate of convergence in parametric estimation problems, that is,
that we can estimate a parameter θ at a rate (in squared error) scaling as 1/n. The mean estimator
above is, in some sense, the prototypical example of such regular problems. In some “irregular”
scenarios—including estimating the support of a uniform random variable, which we study in the
homework—faster rates are possible.
We also note in passing that their are substantially more complex versions of Le Cam’s method
that can yield sharp results for a wider variety of problems, including some in nonparametric
estimation [134, 194]. For our purposes, the simpler two-point perspective provided in this section
will be sufficient.
JCD Comment: Talk about Euclidean structure with KL space and information geom-
etry a bit here to suggest the KL approach later.

9.4 Fano’s method

Fano’s method, originally proposed by Has’minskii [109] for providing lower bounds in nonpara-
metric estimation problems, gives a somewhat more general technique than Le Cam’s method, and
it applies when the packing set V has cardinality larger than two. The method has played a central
role in minimax theory, beginning with the pioneering work of Has’minskii and Ibragimov [109, 115].
More recent work following this initial push continues to the present day (e.g. [31, 194, 192, 32,
160, 106, 47]).

9.4.1 The classical (local) Fano method

We begin by stating Fano’s inequality, which provides a lower bound on the error in a multi-
way hypothesis testing problem. Let V be a random variable taking values in a finite set V
with cardinality |V| ≥ 2. If we let the function h2 (p) = −p log p − (1 − p) log(1 − p) denote the
entropy of the Bernoulli random variable with parameter p, Fano’s inequality (Proposition 2.3.3
from Chapter 2) takes the following form:

226
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 9.4.1 (Fano inequality). For any Markov chain V → X → Vb , we have

h2 (P(Vb ̸= V )) + P(Vb ̸= V ) log(|V| − 1) ≥ H(V | Vb ). (9.4.1)

Restating the results in Chapter 2, we also have the following convenient rewriting of Fano’s
inequality when V is uniform in V (recall Corollary 2.3.4).

Corollary 9.4.2. Assume that V is uniform on V. For any Markov chain V → X → Vb ,

I(V ; X) + log 2
P(Vb ̸= V ) ≥ 1 − . (9.4.2)
log(|V|)

In particular, Corollary 9.4.2 shows that we have

I(V ; X) + log 2
inf P(Ψ(X) ̸= V ) ≥ 1 − ,
Ψ log |V|

where the infimum is taken over all testing procedures Ψ. By combining Corollary 9.4.2 with the
reduction from estimation to testing in Proposition 9.2.1, we obtain the following result.

Proposition 9.4.3. Let {θ(Pv )}v∈V be a 2δ-packing in the ρ-semimetric. Assume that V is uniform
on the set V, and conditional on V = v, we draw a sample X ∼ Pv . Then the minimax risk has
lower bound
I(V ; X) + log 2
M(θ(P); Φ ◦ ρ) ≥ Φ(δ) 1 − .
log |V|
To gain some intuition for Proposition 9.4.3, we think of the lower bound as a function of the
separation δ > 0. Roughly, as δ ↓ 0, the separation condition between the distributions Pv is
relaxed and we expect the distributions Pv to be closer to one another. In this case—as will be
made more explicity presently—the hypothesis testing problem of distinguishing the Pv becomes
more challenging, and the information I(V ; X) shrinks. Thus, what we roughly attempt to do
is to choose our packing θ(Pv ) as a function of δ, and find the largest δ > 0 making the mutual
information small enough that
I(V ; X) + log 2 1
≤ . (9.4.3)
log |V| 2
In this case, the minimax lower bound is at least Φ(δ)/2. We now explore techniques for achieving
such results.

Mutual information and KL-divergence

Many techniques for upper bounding mutual information rely on its representation as the KL-
divergence between multiple distributions. Indeed, given random variables V and X as in the
preceding sections, if we let PV,X denote their joint distribution and PV and PX their marginals,
then
I(V ; X) = Dkl (PX,V ||PX × PV ) ,
where PX × PV denotes the distribution of (X, V ) when the random variables are independent. By
manipulating this definition, we can rewrite it into a form more convenient for our purposes.
Indeed, focusing on our setting of testing, let us assume that V is drawn from a prior distribution
π (this may be a discrete or arbitrary distribution, though for simplicity we focus on the case when

227
Lexture Notes on Statistics and Information Theory John Duchi

π is discrete). Let Pv denote the distribution of X conditional on V = v, as in Proposition 9.4.3.

Then marginally, we know that X is drawn from the mixture distribution
X
P := π(v)Pv .
v

With this definition of the mixture distribution, via algebraic manipulations, we have
X
I(V ; X) = π(v)Dkl Pv ||P , (9.4.4)
v

a representation that plays an important role in our subsequent derivations. To see equality (9.4.4),
let µ be a base measure over X (assume w.l.o.g. that X has density p(· | v) = pv (·) conditional on
V = v), and note that
XZ p(x | v) p(x | v)
X Z
I(V ; X) = p(x | v)π(v) log P ′ ′
dµ(x) = π(v) p(x | v) log dµ(x).
v X v ′ p(x | v )π(v ) v X p(x)

Representation (9.4.4) makes it clear that if the distributions of the sample X conditional
on V are all similar, then there is little information content. Returning to the discussion after
Proposition 9.4.3, we have in this uniform setting that
1 X 1 X
P = Pv and I(V ; X) = Dkl Pv ||P .
|V| |V|
v∈V v∈V

The mutual information is small if the typical conditional distribution Pv is difficult to distinguish—
has small KL-divergence—from P .

The local Fano method

The local Fano method is based on a weakening of the mixture representation of mutual informa-
tion (9.4.4), then giving a uniform upper bound on divergences between all pairs of the conditional
distributions Pv and Pv′ . (This method is known in the statistics literature as the “generalied Fano
method,” a poor name, as it is based on a weak upper bound on mutual information.) In particular
(focusing on the case when V is uniform), the convexity of − log implies that
1 X 1 X
I(V ; X) = Dkl Pv ||P ≤ Dkl (Pv ||Pv′ ) . (9.4.5)
|V| |V|2 ′
v∈V v,v

In the local Fano method approach, we construct a local packing. This local packing approach
is based on constructing a family of distributions Pv for v ∈ V defining a 2δ-packing (recall Sec-
tion 9.2.1), meaning that ρ(θ(Pv ), θ(Pv′ )) ≥ 2δ for all v ̸= v ′ , but which additionally satisfy the
uniform upper bound
Dkl (Pv ||Pv′ ) ≤ κ2 δ 2 for all v, v ′ ∈ V, (9.4.6)
where κ > 0 is a fixed problem-dependent constant. If we have the inequality (9.4.6), then so long
as we can find a local packing V such that

log |V| ≥ 2(κ2 δ 2 + log 2),

228
Lexture Notes on Statistics and Information Theory John Duchi

we are guaranteed the testing error condition (9.4.3), and hence the minimax lower bound
1
M(θ(P), Φ ◦ ρ) ≥ Φ(δ).
2
The difficulty in this approach is constructing the packing set V that allows δ to be chosen to obtain
sharp lower bounds, and we often require careful choices of the packing sets V. (We will see how
to reduce such difficulties in subsequent sections.)

Constructing local packings As mentioned above, the main difficulty in using Fano’s method
is in the construction of so-called “local” packings. In these problems, the idea is to construct a
packing V of a fixed set (in a vector space, say Rd ) with constant radius and constant distance.
Then we scale elements of the packing by δ > 0, which leaves the cardinality |V| identical, but
allows us to scale δ in the separation in the packing and the uniform divergence bound (9.4.6). In
particular, Lemmas 9.2.3 and 5.1.10 show that we can construct exponentially large packings of
certain sets with balls of a fixed radius.
We now illustrate these techniques via two examples.

Example 9.4.4 (Normal mean estimation): Consider the d-dimensional normal location
family Nd = {N(θ, σ 2 Id×d ) | θ ∈ Rd }; we wish to estimate the mean θ = θ(P ) of a given
distribution P ∈ Nd in mean-squared error, that is, with loss ∥θb − θ∥22 . Let V be a 1/2-packing
of the unit ℓ2 -ball with cardinality at least 2d , as guaranteed by Lemma 5.1.10. (We assume
for simplicity that d ≥ 2.)
Now we construct our local packing. Fix δ > 0, and for each v ∈ V, set θv = δv ∈ Rd .
Then we have
δ
∥θv − θv′ ∥2 = δ v − v ′ 2 ≥
2
′
for each distinct pair v, v ∈ V, and moreover, we note that ∥θv − θv′ ∥2 ≤ δ for such pairs as
well. By applying the Fano minimax bound of Proposition 9.4.3, we see that (given n normal
iid
observations Xi ∼ P )
2
I(V ; X1n ) + log 2 δ2 I(V ; X1n ) + log 2

1 δ
Mn (θ(Nd ), ∥·∥22 ) ≥ · 1− = 1− .
2 2 log |V| 16 d log 2

Now note that for any pair v, v ′ , if Pv is the normal distribution N(θv , σ 2 Id×d ) we have

δ2 2
Dkl (Pvn ||Pvn′ ) = n · Dkl N(δv, σ 2 Id×d )||N(δv ′ , σ 2 Id×d ) = n · 2 v − v ′

2
,
2σ
as the KL-divergence between two normal distributions with identical covariance is
1
Dkl (N(θ1 , Σ)||N(θ2 , Σ)) = (θ1 − θ2 )⊤ Σ−1 (θ1 − θ2 )
2
as in Example 2.1.7. As ∥v − v ′ ∥2 ≤ 1, we have the KL-divergence bound (9.4.6) with κ2 =
n/2σ 2 .
Combining our derivations, we have the minimax lower bound

δ2 nδ 2 /2σ 2 + log 2

2
Mn (θ(Nd ), ∥·∥2 ) ≥ 1− . (9.4.7)
16 d log 2

229
Lexture Notes on Statistics and Information Theory John Duchi

Then by taking δ 2 = dσ 2 log 2/(2n), we see that

nδ 2 /2σ 2 + log 2 1 1 1
1− =1− − ≥
d log 2 d 4 4

by assumption that d ≥ 2, and inequality (9.4.7) implies the minimax lower bound

dσ 2 log 2 1 1 dσ 2
Mn (θ(Nd ), ∥·∥22 ) ≥ · ≥ · .
32n 4 185 n
While the constant 1/185 is not sharp, we do obtain the right scaling in d, n, and the variance
σ 2 ; the sample mean attains the same risk. 3

Example 9.4.5 (Linear regression): In this example, we show how local packings can give
(up to some constant factors) sharp minimax rates for standard linear regression problems. In
particular, for fixed matrix X ∈ Rn×d , we observe

Y = Xθ + ε,

where ε ∈ Rn consists of independent random variables εi with variance bounded by Var(εi ) ≤

σ 2 , and θ ∈ Rd is allowed to vary over Rd . For the purposes of our lower bound, we may
assume that ε ∼ N(0, σ 2 In×n ). Let P denote the family of such normally distributed linear
regression problems, and assume for simplicity that d ≥ 32.
In this case, we use the Gilbert-Varshamov bound (Lemma 9.2.3) to construct a local
packing and attain minimax rates. Indeed, let V be a packing of {−1, 1}d such that ∥v − v ′ ∥1 ≥
d/2 for distinct elements of V, and let |V| ≥ exp(d/8) as guaranteed by the Gilbert-Varshamov
bound. For fixed δ > 0, if we set θv = δv, then we have the packing guarantee for distinct
elements v, v ′ that
d
X
∥θv − θv′ ∥22 = δ 2 (vj − vj′ )2 = 4δ 2 v − v ′ 1
≥ 2dδ 2 .
j=1

Moreover, we have the upper bound

1
Dkl N(Xθv , σ 2 In×n )||N(Xθv′ , σ 2 In×n ) = 2 ∥X(θv − θv′ )∥22

2σ
δ2 2 2 2d 2
≤ 2 γmax (X) v − v ′ ≤ γ (X)δ 2 ,
2σ 2 σ 2 max
where γmax (X) denotes the maximum singular value of X. Consequently, the bound (9.4.6)
holds with κ2 ≤ 2dγmax
2 (X)/σ 2 , and we have the minimax lower bound

2
2dγmax (X) 2
!
dδ 2 dδ 2 δ + log 2

I(V ; Y ) + log 2 σ2
M(θ(P), ∥·∥22 ) ≥ 1− ≥ 1− .
2 log |V| 2 d/8

Now, if we choose

σ2 2 (X)δ 2
8 log 2 16dγmax 1 1 1
δ2 = 2 (X)
, then 1 − − ≥1− − = ,
64γmax d d 4 4 2

230
Lexture Notes on Statistics and Information Theory John Duchi

by assumption that d ≥ 32. In particular, we obtain the lower bound

1 σ2d 1 σ2d 1
M(θ(P), ∥·∥22 ) ≥ 2
= 2 ( √1 X)
,
256 γmax (X) 256 n γmax n
√
for a convergence rate (roughly) of σ 2 d/n after rescaling the singular values of X by 1/ n.
This bound is sharp in terms of the dimension, dependence on n, and the variance σ 2 , but
it does not fully capture the dependence on X, as it depends only on the maximum singular
value. Indeed, in this case, an exact calculation (cf. [136]) shows that the minimax value of
the problem is exactly σ 2 tr((X ⊤ X)−1 ). Letting λj (A) be the jth eigenvalue of a matrix A,
we have
d
2 ⊤ −1 σ2 −1 ⊤ −1 σ2 X 1
σ tr((X X) )= tr((n X X) ) = 1
n n λ ( X ⊤ X)
j=1 j n
σ2d 1 σ2d 1
≥ min 1 = .
⊤
n j λj ( n X X) n γmax ( √1n X)
2

Thus, the local Fano method captures most—but not all—of the difficulty of the problem. 3

9.4.2 A distance-based Fano method

While the testing lower bound (9.4.2) is sufficient for proving lower bounds for many estimation
problems, for the sharpest results it sometimes requires a somewhat delicate construction of a well-
separated packing (e.g. [47, 75]). To that end, we also provide extensions of inequalities (9.4.1)
and (9.4.2) that more directly yield bounds on estimation error, allowing more direct and simpler
proofs of a variety of minimax lower bounds (see also reference [72]).
More specifically, suppose that the distance function ρV is defined on V, and we are inter-
ested in bounding the estimation error ρV (Vb , V ). We begin by providing analogues of the lower
bounds (9.4.1) and (9.4.2) that replace the testing error with the tail probability P(ρV (Vb , V ) > t).
By Markov’s inequality, such control directly yields bounds on the expectation E[ρV (Vb , V )]. As
we show in the sequel and in chapters to come, these distance-based Fano inequalities allow more
direct proofs of a variety of minimax bounds without the need for careful construction of packing
sets or metric entropy calculations as in other arguments.
We begin with the distance-based analogue of the usual discrete Fano inequality in Proposi-
tion 9.4.1. Let V be a random variable supported on a finite set V with cardinality |V| ≥ 2, and let
ρ : V × V → R be a function defined on V × V. In the usual setting, the function ρ is a metric on
the space V, but our theory applies to general functions. For a given scalar t ≥ 0, the maximum
and minimum neighborhood sizes at radius t are given by

Ntmax := max card{v ′ ∈ V | ρ(v, v ′ ) ≤ t} and Ntmin := min card{v ′ ∈ V | ρ(v, v ′ ) ≤ t} .

v∈V v∈V
(9.4.8)
Defining the error probability Pt = P(ρV (Vb , V ) > t), we then have the following generalization of
Fano’s inequality:
Proposition 9.4.6. For any Markov chain V → X → Vb , we have

|V| − Ntmin
h2 (Pt ) + Pt log + log Ntmax ≥ H(V | Vb ). (9.4.9)
Ntmax

231
Lexture Notes on Statistics and Information Theory John Duchi

Before proving the proposition, which we do in Section 9.6.1, it is informative to note that it
reduces to the standard form of Fano’s inequality (9.4.1) in a special case. Suppose that we take
ρV to be the 0-1 metric, meaning that ρV (v, v ′ ) = 0 if v = v ′ and 1 otherwise. Setting t = 0 in
Proposition 9.4.6, we have P0 = P[Vb ̸= V ] and N0min = N0max = 1, whence inequality (9.4.9) reduces
to inequality (9.4.1). Other weakenings allow somewhat clearer statements (see Section 9.6.2 for a
proof):
Corollary 9.4.7. If V is uniform on V and (|V| − Ntmin ) > Ntmax , then
I(V ; X) + log 2
P(ρV (Vb , V ) > t) ≥ 1 − . (9.4.10)
log N|V|
max
t

Inequality (9.4.10) is the natural analogue of the classical mutual-information based form of
Fano’s inequality (9.4.2), and it provides a qualitatively similar bound. The main difference is
that the usual cardinality |V| is replaced by the ratio |V|/Ntmax . This quantity serves as a rough
measure of the number of possible “regions” in the space V that are distinguishable—that is, the
number of subsets of V for which ρV (v, v ′ ) > t when v and v ′ belong to different regions. While
this construction is similar in spirit to the usual construction of packing sets in the standard
reduction from testing to estimation (cf. Section 9.2.1), our bound allows us to skip the packing set
construction. We can directly compute I(V ; X) where V takes values over the full space, as opposed
to computing the mutual information I(V ′ ; X) for a random variable V ′ uniformly distributed over
a packing set contained within V. In some cases, the former calculation can be much simpler, as
illustrated in examples and chapters to follow.
We now turn to providing a few consequences of Proposition 9.4.6 and Corollary 9.4.7, showing
how they can be used to derive lower bounds on the minimax risk. Proposition 9.4.6 is a generaliza-
tion of the classical Fano inequality (9.4.1), so it leads naturally to a generalization of the classical
Fano lower bound on minimax risk, which we describe here. This reduction from estimation to
testing is somewhat more general than the classical reductions, since we do not map the original
estimation problem to a strict test, but rather a test that allows errors. Consider as in the standard
reduction of estimation to testing in Section 9.2.1 a family of distributions {Pv }v∈V ⊂ P indexed by
a finite set V. This family induces an associated collection of parameters {θv := θ(Pv )}v∈V . Given
a function ρV : V × V → R and a scalar t, we define the separation δ(t) of this set relative to the
metric ρ on Θ via

δ(t) := sup δ | ρ(θv , θv′ ) ≥ δ for all v, v ′ ∈ V such that ρV (v, v ′ ) > t .

(9.4.11)

As a special case, when t = 0 and ρV is the discrete metric, this definition reduces to that of a
packing set: we are guaranteed that ρ(θv , θv′ ) ≥ δ(0) for all distinct pairs v ̸= v ′ , as in the classical
approach to minimax lower bounds. On the other hand, allowing for t > 0 lends greater flexibility
to the construction, since only certain pairs θv and θv′ are required to be well-separated.
Given a set V and associated separation function (9.4.11), we assume the canonical estimation
setting: nature chooses V ∈ V uniformly at random, and conditioned on this choice V = v, a sample
X is drawn from the distribution Pv . We then have the following corollary of Proposition 9.4.6,
whose argument is completely identical to that for inequality (9.2.1):
Corollary 9.4.8. Given V uniformly distributed over V with separation function δ(t), we have
δ(t) I(X; V ) + log 2

Mn (θ(P), Φ ◦ ρ) ≥ Φ 1− for all t. (9.4.12)
2 log N|V|
max
t

232
Lexture Notes on Statistics and Information Theory John Duchi

Notably, using the discrete metric ρV (v, v ′ ) = 1 {v ̸= v ′ } and taking t = 0 in the lower bound (9.4.12)
gives the classical Fano lower bound on the minimax risk based on constructing a packing [115, 194,
192]. We now turn to an example illustrating the use of Corollary 9.4.8 in providing a minimax
lower bound on the performance of regression estimators.
Example 9.4.9 (Normal regression model): Consider the d-dimensional linear regression
model Y = Xθ + ε, where ε ∈ Rn is i.i.d. N(0, σ 2 ) and X ∈ Rn×d is known, but θ is not. In
this case, our family of distributions is
n o n o
PX := Y ∼ N(Xθ, σ 2 In×n ) | θ ∈ Rd = Y = Xθ + ε | ε ∼ N(0, σ 2 In×n ), θ ∈ Rd .

We then obtain the following minimax lower bound on the minimax error in squared ℓ2 -norm:
there is a universal (numerical) constant c > 0 such that
σ 2 d2 c σ2d
Mn (θ(PX , ∥·∥22 ) ≥ c ≥ √ · , (9.4.13)
∥X∥2Fr γmax (X/ n)2 n
where γmax denotes the maximum singular value. Notably, this inequality is nearly the sharpest
known bound proved via Fano inequality-based methods [47], but our technique is essentially
direct and straightforward.
To see inequality (9.4.13), let the set V = {−1, 1}d be the d-dimensional hypercube, and
define θv = δv for some fixed δ > 0. Then letting ρV be the Hamming metric on V√and ρ
be the usual ℓ2 -norm, the associated separation function (9.4.11) satisfies δ(t) > max{ t, 1}δ.
Now, for any t ≤ ⌈d/3⌉, the neighborhood size satisfies
t t
max
X d d de
Nt = ≤2 ≤2 .
τ t t
τ =0

Consequently, for t ≤ d/6, the ratio |V|/Ntmax satisfies

|V| d d 2 d
log max ≥ d log 2 − log 2 ≥ d log 2 − log(6e) − log 2 = d log √ > max , log 4
Nt t 6 21/d 6 6e 6
for d ≥ 12. (The case 2 ≤ d < 12 can be checked directly). In particular, by taking t = ⌊d/6⌋
we obtain via Corollary 9.4.8 that
max{⌊d/6⌋ , 2}δ 2

2 I(Y ; V ) + log 2
Mn (θ(PX ), ∥·∥2 ) ≥ 1− .
4 max{d/6, 2 log 2}
But of course, for V uniform on V, we have E[V V ⊤ ] = Id×d , and thus for V, V ′ independent
and uniform on V,
1 XX
Dkl N(Xθv , σ 2 In×n )||N(Xθv′ , σ 2 In×n )

I(Y ; V ) ≤ n 2
|V|
v∈V v ′ ∈V
δ 2 h i δ2
2
= 2 E XV − XV ′ 2 = 2 ∥X∥2Fr .
2σ σ
Substituting this into the preceding minimax bound, we obtain
!
2 δ 2 ∥X∥2 /σ 2 + log 2
max{⌊d/6⌋ , 2}δ
Mn (θ(PX ), ∥·∥22 ) ≥ 1− Fr
.
4 max{d/6, 2 log 2}

Choosing δ 2 ≍ dσ 2 / ∥X∥2Fr gives the result (9.4.13). 3

233
Lexture Notes on Statistics and Information Theory John Duchi

As a second example, we can revisit the general M-estimation results in Chapter 5.3. The
construction here extends that in Example 9.4.9 to settings where there is no “true” parameter;
we leave working out the details as exercises. (See Exercises 9.10 9.12, and 9.13, which applies the
former two.)

Example 9.4.10 (Local minimax lower bounds for M-estimation): Let ℓ : Θ × Z → R be a

convex loss as in Chapter 5.3, and let it and the population loss L(θ) := EP0 [ℓ(θ, Z)] satisfy
the conditions of Assumption A.5.1. Let θ0 = argmin L(θ) have Hessian H = ∇2 L(θ0 ) and
covariance Σ = Cov0 (∇ℓ(θ0 ; Z)), so that the results in Chapter 5.3 show (roughly) that the
empirical risk minimizer satisfies
h
2
i tr(H −1 ΣH −1 ) h
2
i d
E θbn − θ0 2
≲ and E θbn − θ0 (H −1 ΣH −1 )−1
≲ ,
n n
where we use the notation ∥x∥2A = x⊤ Ax. We can show these convergence results are sharp
by constructing an appropriate packing of the parameter space, induced by distributions P on
Z, and showing that the information the sample Z1n carries about the particular distribution
is limited, allowing us to apply the local Fano method.
We first develop the underlying distribution family. Let δ ≥ 0 to be chosen. For any
bounded mean-zero function g : Z → Rd , Exercise 9.10 shows how to construct a collection of
distributions {Pu }u∈Rd , indexed by u ∈ Rd , such that
1
Dkl (Pu ||P0 ) = u⊤ Cov0 (g(X))u + o(∥u∥2 )
2
for u near 0. Then as usual drawing a random element V ∼ Uniform({±1}d ) and drawing
iid
Z1n ∼ Pδv conditional on V = v, we have the mutual information bound

nδ 2
I(Z1n ; V ) ≤ tr(Cov0 (g(Z))) + n · o(δ 2 )
2
as δ → 0. For the separation of the induced parameters, let

θv = argmin EPδv [ℓ(θ, Z)].

Then (see Exercise 9.13), these parameters satisfy

θv = θ(P0 ) + δH −1 Cov0 (∇ℓ(θ0 , Z), g(Z))v + O(δ 2 ).

Lastly we choose the function g. Our choice here is

g(Z) = (H −1 ΣH −1 )−1/2 H −1 ∇ℓ(θ0 , Z),

which is mean-zero and bounded so long as ℓ is Lipschitz near θ0 and satisfies Cov0 (g(Z)) =
tr(Id ) = d. We therefore have I(Z1n ; V ) ≲ nδ 2 d and θv − θ0 = δ(H −1 ΣH −1 )−1/2 v + O(δ 2 ).
This gives a packing in the metric induced by the matrix HΣ−1 H, so applying Corollary 9.4.7,
there exists a numerical constant c > 0 such that for all suitably large n we therefore obtain
p 1
P θbn − θV HΣ−1 H ≥ c d/n ≥ .
4
Exercises 9.12 and 9.13 work through the details. 3

234
Lexture Notes on Statistics and Information Theory John Duchi

9.5 Assouad’s method

Assouad’s method provides a somewhat different technique for proving lower bounds. Instead of
reducing the estimation problem to a multiple hypothesis test or simpler estimation problem, as
with Le Cam’s method and Fano’s method from the preceding lectures, here we transform the
original estimation problem into multiple binary hypothesis testing problems, using the structure
of the problem in an essential way. Assouad’s method applies only problems where the loss we
care about is naturally related to identification of individual points on a hypercube. In simple or
standard problems, Assouad’s method rarely provides stronger lower bounds than the local Fano
methods; its true power lies in its applications to adaptive problems, where one may choose points
at which to query a statistical model. We develop this idea in the exercises (see Exercise 9.6), and
also leverage it in Chapter 18, Section 18.5.2 to prove lower bounds for bandit problems, where one
must balance adaptive exploration of a function with performance.

9.5.1 Well-separated problems

To describe the method, we begin by encoding a notion of separation and loss, similar to what we
did in the classical reduction of estimation to testing. For some d ∈ N, let V = {−1, 1}d , and let us
consider a family {Pv }v∈V ⊂ P indexed by the hypercube. We say that the the family Pv induces
v : θ(P) → {−1, 1}d satisfying
a 2δ-Hamming separation for the loss Φ ◦ ρ if there exists a function b
d
X
Φ(ρ(θ, θ(Pv ))) ≥ 2δ 1 {[b
v(θ)]j ̸= vj } . (9.5.1)
j=1

That is, we can take the parameter θ and test the individual indices via b
v.

Example 9.5.1 (Estimation in ℓ1 -error): Suppose we have a family of multivariate Laplace

distributions on Rd —distributions with density proportional to p(x) ∝ exp(− ∥x − µ∥1 )—and
we wish to estimate the mean in ℓ1 -distance. For v ∈ {−1, 1}d and some fixed δ > 0 let pv be
the density
1
pv (x) = exp (− ∥x − δv∥1 ) ,
2
which has mean θ(Pv ) = δv. Under the ℓ1 -loss, we have for any θ ∈ Rd that
d
X d
X
∥θ − θ(Pv )∥1 = |θj − δvj | ≥ δ 1 {sign(θj ) ̸= vj } ,
j=1 j=1

so that this family induces a δ-Hamming separation for the ℓ1 -loss. 3

9.5.2 From estimation to multiple binary tests

As in the standard reduction from estimation to testing, we consider the following random process:
nature chooses a vector V ∈ {−1, 1}d uniformly at random, after which the sample X is drawn
from the distribution Pv conditional on V = v. Then, if we let P±j denote the joint distribution
over the random index V and X conditional on the jth coordinate Vj = ±1, we obtain the following
sharper version of Assouad’s lemma [10] (see also the paper [8]); we provide a proof in Section 9.6.3
to follow.

235
Lexture Notes on Statistics and Information Theory John Duchi

Lemma 9.5.2. Under the conditions of the previous paragraph, we have

d
X
M(θ(P), Φ ◦ ρ) ≥ δ inf [P+j (Ψ(X) ̸= +1) + P−j (Ψ(X) ̸= −1)] .
Ψ
j=1

While Lemma 9.5.2 requires conditions on the loss Φ and metric ρ for the separation condi-
tion (9.5.1) to hold, it is sometimes easier to apply than Fano’s method. Moreover, while we will
not address this in class, several researchers [8, 74] have noted that it appears to allow easier ap-
plication in so-called “interactive” settings—those for which the sampling of the Xi may not be
precisely i.i.d. It isPclosely related to Le Cam’s method, discussed previously, as we see that if we
define P+j = 21−d v:vj =1 Pv (and similarly for −j), Lemma 9.5.2 is equivalent to

d
X
M(θ(P), Φ ◦ ρ) ≥ δ 1 − ∥P+j − P−j ∥TV . (9.5.2)
j=1

There are standard weakenings of the lower bound (9.5.2) (and Lemma 9.5.2). We give one
such weakening. First, we note that the total variation is convex, so that if we define Pv,+j to be
the distribution Pv where coordinate j takes the value vj = 1 (and similarly for P − v, −j), we have

1 X 1 X
P+j = Pv,+j and P−j = Pv,+j .
2d 2d
v∈{−1,1}d v∈{−1,1}d

Thus, by the triangle inequality, we have

1 X
∥P+j − P−j ∥TV = Pv,+j − Pv,−j
2d TV
v∈{−1,1}d
1 X
≤ ∥Pv,+j − Pv,−j ∥TV ≤ max ∥Pv,+j − Pv,−j ∥TV .
2d v,j
v∈{−1,1}d

Then as long as the loss satisfies the per-coordinate separation (9.5.1), we obtain the following:

M(θ(P), Φ ◦ ρ) ≥ dδ 1 − max ∥Pv,+j − Pv,−j ∥TV . (9.5.3)
v,j

This most common version of Assouad’s lemma sometimes too brutally controls ∥P+j − P−j ∥TV .
We also note that by the Cauchy-Schwarz inequality and convexity of the variation-distance,
we have
d d 1/2 d 1
√ √ X
X
X 2 1 X 2
2
∥P+j − P−j ∥TV ≤ d ∥P+j − P−j ∥TV ≤ d ∥P v,+j − P v,−j ∥TV ,
2d v
j=1 j=1 j=1

and consequently we have a not quite so terribly weak version of inequality (9.5.2):
 
d 1
1 X X 2
M(θ(P), Φ ◦ ρ) ≥ δd 1 − ∥Pv,+j − Pv,−j ∥2TV . (9.5.4)
2d d d j=1 v∈{−1,1}

236
Lexture Notes on Statistics and Information Theory John Duchi

Regardless of whether we use the sharper version (9.5.2) or weakened versions (9.5.3) or (9.5.4),
the technique is essentially the same. We seek a setting of the distributions Pv so that the probability
of making a mistake in the hypothesis test of Lemma 9.5.2 is high enough—say 1/2—or the variation
distance is small enough—such as ∥P+j − P−j ∥TV ≤ 1/2 for all j. Once this is satisfied, we obtain
a minimax lower bound of the form
d
X 1 dδ
M(θ(P), Φ ◦ ρ) ≥ δ 1− = .
2 2
j=1

9.5.3 Example applications of Assouad’s method

We now provide two example applications of Assouad’s method. The first is a standard finite-
dimensional lower bound, where we provide a lower bound in a normal mean estimation problem.
For the second, we consider estimation in a logistic regression problem, showing a similar lower
bound. In Section 10.1 to follow, we show how to use Assouad’s method to prove strong lower
bounds in a standard nonparametric problem.

Example 9.5.3 (Normal mean estimation): For some σ 2 > 0 and d ∈ N, we consider
estimation of mean parameter for the normal location family
n o
N := N(θ, σ 2 Id×d ) : θ ∈ Rd

in squared Euclidean distance. We now show how for this family, the sharp Assouad’s method
implies the lower bound
dσ 2
Mn (θ(N ), ∥·∥22 ) ≥ . (9.5.5)
8n
Up to constant factors, this bound is sharp; the sample mean has mean squared error dσ 2 /n.
We proceed in (essentially) the usual way we have set up. Fix some δ > 0 and define θv = δv,
taking Pv = N(θv , σ 2 Id×d ) to be the normal distribution with mean θv . In this case, we see that
the hypercube P structure is natural, as our loss function decomposes on coordinates: we have
∥θ − θv ∥22 ≥ δ 2 dj=1 1 {sign(θj ) ̸= vj }. The family Pv thus induces a δ 2 -Hamming separation
for the loss ∥·∥22 , and by Assouad’s method (9.5.2), we have
d
δ2 X h i
Mn (θ(N ), ∥·∥22 ) ≥ n
1 − P+j n
− P−j TV
,
2
j=1

n = 21−d n n − Pn ∥
P
where P±j v:vj =±1 Pv . It remains to provide upper bounds on ∥P+j −j TV . By
the convexity of ∥·∥2TV and Pinsker’s inequality, we have

2 1
n
P+j n
− P−j TV
≤ max ∥Pvn − Pvn′ ∥2TV ≤ max Dkl (Pvn ||Pvn′ ) .
dham (v,v ′ )≤1 2 dham (v,v′ )≤1

But of course, for any v and v ′ differing in only 1 coordinate,

n 2n
Dkl (Pvn ||Pvn′ ) = 2
∥θv − θv′ ∥22 = 2 δ 2 ,
2σ σ

237
Lexture Notes on Statistics and Information Theory John Duchi

giving the minimax lower bound

d h
X i
Mn (θ(N ), ∥·∥22 ) 2
p
≥ 2δ 1 − 2nδ 2 /σ 2 .
j=1

Choosing δ 2 = σ 2 /8n gives the claimed lower bound (9.5.5). 3

Example 9.5.4 (Logistic regression): In this example, consider the logistic regression model,
where we have known (fixed) regressors Xi ∈ Rd and an unknown parameter θ ∈ Rd ; the goal
is to estimate θ after observing a sequence of Yi ∈ {−1, 1}, where for y ∈ {−1, 1} we have
1
P (Yi = y | Xi , θ) = .
1 + exp(−yXi⊤ θ)
Denote this family by Plog , and for P ∈ Plog , let θ(P ) be the predictor vector θ. We would
like to estimate the vector θ in squared ℓ2 error. As in Example 9.5.3, if we choose some δ > 0
and for each v ∈P{−1, 1}d , we set θv = δv, then we have the δ 2 -separation in Hamming metric
∥θ − θv ∥22 ≥ δ 2 dj=1 1 {sign(θj ) ̸= vj }. Let Pvn denote the distribution of the n independent
observations Yi when θ = θv . Then we have by Assouad’s lemma (and the weakening (9.5.4))
that
d
δ2 X h i
Mn (θ(Plog ), ∥·∥22 ) ≥ n
1 − P+j n
− P−j TV
2
j=1
" d 1 #
dδ 2
X
1 1 X
n n 2 2
≥ 1− Pv,+j − Pv,−j . (9.5.6)
2 d 2d TV
j=1 d v∈{−1,1}

n
It remains to bound ∥Pv,+j n ∥2
− Pv,−j TV to find our desired lower bound. To that end,
⊤
use the shorthands pv (x) = 1/(1 + exp(δx v)) and let Dkl (p||q) be the binary KL-divergence
between Bernoulli(p) and Bernoulli(q) distributions. Then Pinsker’s inequality (recall Proposi-
tion 2.2.8) implies that for any v, v ′ ,
1
∥Pvn − Pvn′ ∥TV ≤ [Dkl (Pvn ||Pvn′ ) + Dkl (Pvn′ ||Pvn )]
4
n
1X
= [Dkl (pv (Xi )||pv′ (Xi )) + Dkl (pv′ (Xi )||pv (Xi ))] .
4
i=1

Let us upper bound the final KL-divergence. Let pa = 1/(1 + ea ) and pb = 1/(1 + eb ). We
claim that
Dkl (pa ||pb ) + Dkl (pb ||pa ) ≤ (a − b)2 . (9.5.7)
Deferring the proof of claim (9.5.7), we immediately see that
n
δ2 X ⊤ 2
∥Pvn − Pvn′ ∥TV ≤ Xi (v − v ′ ) .
4
i=1

Now we recall inequality (9.5.6) for motivation, and we see that the preceding display implies
d d X
n n d
1 X X
n n 2 δ2 1 X X δ2 X X 2
Pv,+j − Pv,−j ≤ (2Xij )2 = Xij .
2d d TV 4d 2d d
j=1 v∈{−1,1}d v∈{−1,1} j=1 i=1
d i=1 j=1

238
Lexture Notes on Statistics and Information Theory John Duchi

Replacing the final double sum with ∥X∥2Fr , where X is the matrix of the Xi , we have
" 21 #
dδ 2
2
2 δ 2
Mn (θ(Plog ), ∥·∥2 ) ≥ 1− ∥X∥Fr .
2 d

Setting δ 2 = d/4 ∥X∥2Fr , we obtain

dδ 2 d2 d 1
Mn (θ(Plog ), ∥·∥22 ) ≥ = 2 = · .
n 16 dn i=1 ∥Xi ∥22
1 Pn
4 16 ∥X∥Fr

That is, we have a minimax lower bound scaling roughly as d/n for logistic regression, where
“large” Xi (in ℓ2 -norm) suggest that we may obtain better performance in estimation. This is
intuitive, as a larger Xi gives a better signal to noise ratio.
We return to prove the claim (9.5.7). Indeed, by a straightforward expansion, we have
pa 1 − pa pb 1 − pb
Dkl (pa ||pb ) + Dkl (pb ||pa ) = pa log + (1 − pa ) log + pb log + (1 − pb ) log
pb 1 − pb pa 1 − pa

pa 1 − pa pa 1 − pb
= (pa − pb ) log + (pb − pa ) log = (pa − pb ) log .
pb 1 − pb 1 − pa pb

Now note that pa /(1 − pa ) = e−a and (1 − pb )/pb = eb . Thus we obtain

1 1
b−a
1 1
Dkl (pa ||pb ) + Dkl (pb ||pa ) = − log e = (b − a) −
1 + ea 1 + eb 1 + ea 1 + eb
Assume without loss of generality that b ≥ a. Noting that ex ≥ 1 + x by convexity, we have
1 1 eb − ea eb − ea
− = ≤ = 1 − ea−b ≤ 1 − (1 + (a − b)) = b − a,
1 + ea 1 + eb (1 + ea )(1 + eb ) eb
yielding claim (9.5.7). 3

9.6 Deferred proofs

9.6.1 Proof of Proposition 9.4.6
Our argument for proving the proposition parallels that of the classical Fano inequality by Cover
and Thomas [57]. Letting E be a {0, 1}-valued indicator variable for the event ρ(Vb , V ) ≤ t, we
compute the entropy H(E, V | Vb ) in two different ways. On one hand, by the chain rule for entropy,
we have
H(E, V | Vb ) = H(V | Vb ) + H(E | V, Vb ), (9.6.1)
| {z }
=0

where the final term vanishes since E is (V, Vb )-measurable. On the other hand, we also have

H(E, V | Vb ) = H(E | Vb ) + H(V | E, Vb ) ≤ H(E) + H(V | E, Vb ),

using the fact that conditioning reduces entropy. Applying the definition of conditional entropy
yields

H(V | E, Vb ) = P(E = 0)H(V | E = 0, Vb ) + P(E = 1)H(V | E = 1, Vb ),

239
Lexture Notes on Statistics and Information Theory John Duchi

and we upper bound each of these terms separately. For the first term, we have

H(V | E = 0, Vb ) ≤ log(|V| − Ntmin ),

since conditioned on the event E = 0, the random variable V may take values in a set of size at
most |V| − Ntmin . For the second, we have

H(V | E = 1, Vb ) ≤ log Ntmax ,

since conditioned on E = 1, or equivalently on the event that ρ(Vb , V ) ≤ t, we are guaranteed that
V belongs to a set of cardinality at most Ntmax .
Combining the pieces and and noting P(E = 0) = Pt , we have proved that

H(E, V | Vb ) ≤ H(E) + Pt log |V| − N min + (1 − Pt ) log Ntmax .

Combining this inequality with our earlier equality (9.6.1), we see that

H(V | Vb ) ≤ H(E) + Pt log(|V| − Ntmin ) + (1 − Pt ) log Ntmax .

Since H(E) = h2 (Pt ), the claim (9.4.9) follows.

9.6.2 Proof of Corollary 9.4.7

First, by the information-processing inequality [e.g. 57, Chapter 2], we have I(V ; Vb ) ≤ I(V ; X),
and hence H(V | X) ≤ H(V | Vb ). Since h2 (Pt ) ≤ log 2, inequality (9.4.9) implies that

|V| − Ntmin
H(V | X) − log Ntmax ≤ H(V | Vb ) − log Ntmax ≤ P(ρ(Vb , V ) > t) log + log 2.
Ntmax

Rearranging the preceding equations yields

H(V | X) − log Ntmax − log 2

P(ρ(Vb , V ) > t) ≥ . (9.6.2)
|V|−Ntmin
log Ntmax

log N|V|
max I(V ; X) + log 2 I(V ; X) + log 2
t
P(ρ(Vb , V ) > t) ≥ − ≥1− .
|V|−Ntmin |V|−Ntmin
log Ntmax log Ntmax
log N|V|
max
t

9.6.3 Proof of Lemma 9.5.2

Fix an (arbitrary) estimator θ.
b By assumption (9.5.1), we have

d
X
Φ(ρ(θ, θ(Pv ))) ≥ 2δ 1 {[b
v(θ)]j ̸= vj } .
j=1

240
Lexture Notes on Statistics and Information Theory John Duchi

Taking expectations, we see that

h i 1 X h i
sup EP Φ(ρ(θ(X),
b θ(P ))) ≥ EPv Φ(ρ(θ(X),
b θv ))
P ∈P |V|
v∈V
d
1 X X h n oi
≥ 2δ b j ̸= vj
EPv 1 [ψ(θ)]
|V|
v∈V j=1

as the average is smaller than the maximum of a set and using the separation assumption (9.5.1).
Recalling the definition of the mixtures P±j as the joint distribution of V and X conditional on
Vj = ±1, we swap the summation orders to see that

1 X b 1 X
b j ̸= vj + 1
X
v(θ)]j ̸= vj =
Pv [b Pv [b v(θ)] Pv [b b j ̸= vj
v(θ)]
|V| |V| |V|
v∈V v:vj =1 v:vj =−1
1
b j ̸= vj + 1 P−j [b

b j ̸= vj .
= P+j [b v(θ)] v(θ)]
2 2
This gives the statement claimed in the lemma, while taking an infimum over all testing procedures
Ψ : X → {−1, +1} gives the claim (9.5.2).

9.7 Bibliography
For a fuller technical introduction into nonparametric estimation, see the book by Tsybakov [182].
Has’minskii [109].
The material in Section 10.2 is based on a paper of Yang and Barron [192].

9.8 Exercises
Exercise 9.1 (A generalized version of Fano’s inequality; cf. Proposition 9.4.6): Let V and V b be
arbitrary sets, and suppose that π is a (prior) probability measure on V, where V is distributed
according to π. Let V → X → Vb be Markov chain, where V takes values in V and Vb takes values
in V.
b Let N ⊂ V × V b denote a measurable subset of V × V b (a collection of neighborhoods), and for
any vb ∈ V,
b denote the slice
Nvb := {v ∈ V : (v, vb) ∈ N } . (9.8.1)
That is, N denotes the neighborhoods of points v for which we do not consider a prediction vb for
v to be an error, and the slices (9.8.1) index the neighborhoods. Define the “volume” constants

pmax := sup π(V ∈ Nvb) and pmin := inf π(V ∈ Nvb).

v
b v
b

Define the error probability Perror = P[(V, Vb ) ̸∈ N ] and entropy h2 (p) = −p log p − (1 − p) log(1 − p).

(a) Prove that for any Markov chain V → X → Vb , we have

1 − pmin 1
h2 (Perror ) + Perror log max
≥ log max − I(V ; Vb ). (9.8.2)
p p

241
Lexture Notes on Statistics and Information Theory John Duchi

(b) Conclude from inequality (9.8.2) that

I(V ; X) + log 2
P[(V, Vb ) ̸∈ N ] ≥ 1 − 1 .
inf vb log π(N )
v
b

(c) Now we give a version explicitly using distances. Let V ⊂ Rd and define N = {(v, v ′ ) :
∥v − v ′ ∥ ≤ δ} to be the points within δ of one another. Let Bv denote the ∥·∥-ball of radius 1
centered at v. Conclude that for any prior π on Rd that
I(V ; X) + log 2
P ∥V − Vb ∥2 ≥ δ ≥ 1 − 1 .
log sup π(δB v)
v

Exercise 9.2: In this question, we will show that the minimax rate of estimation for the parameter
iid
of a uniform distribution (in squared error) scales as 1/n2 . In particular, assume that Xi ∼
Uniform(θ, θ + 1), meaning that Xi have densities p(x) = 1 {x ∈ [θ, θ + 1]}. Let X(1) = mini {Xi }
denote the first order statistic.
(a) Prove that
2
E[(X(1) − θ)2 ] = .
(n + 1)(n + 2)
R∞
(Hint: the fact that E[Z] = 0 P(Z ≥ t)dt for any positive Z may be useful.)

(b) Using Le Cam’s two-point method, show that the minimax rate for estimation of θ ∈ R for the
uniform family U = {Uniform(θ, θ + 1) : θ ∈ R} in squared error has lower bound c/n2 , where
c is a numerical constant.

Exercise 9.3 (Sign identification in sparse linear regression): In sparse linear regression, we have
n observations Yi = ⟨Xi , θ∗ ⟩ + εi , where Xi ∈ Rd are known (fixed) matrices and the vector θ∗ has
iid
a small number k ≪ d of non-zero indices, and εi ∼ N(0, σ 2 ). In this problem, we investigate the
problem of sign recovery, that is, identifying the vector of signs sign(θj∗ ) for j = 1, . . . , d, where
sign(0) = 0.
Assume we have the following process: fix a signal threshold θmin > 0. First, a vector S ∈
{−1, 0, 1}d is chosen uniformly at random from the set of vectors Sk := {s ∈ {−1, 0, 1}d : ∥s∥1 = k}.
Then we define vectors θs so that θjs = θmin sj , and conditional on S = s, we observe

Y = Xθs + ε, ε ∼ N(0, σ 2 In×n ).

(Here X ∈ Rn×d is a known fixed matrix.)

(a) Use Fano’s inequality to show that for any estimator Sb of S, we have
d d

1 k log k σ2
P(S ̸= S) ≥
b unless n ≥ c 2 2 ,
2 n−1/2 X Fr θmin
d

where c is a numerical constant. You may assume that k ≥ 4 or log k ≥ 4 log 2.

(b) Assume that X ∈ {−1, 1}n×d . Give a lower bound on how large n must be for sign recovery.
Give a one sentence interpretation of σ 2 /θmin
2 .

242
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 9.4 (Multiple hypothesis testing and recovery): A p-value for a null hypothesis H is any
random variable Y ∈ [0, 1] such that P(Y ≤ u) ≤ u for all u ∈ [0, 1] whenever H holds, so that it is
no more likely to be small than a Uniform[0, 1] random variable (as Y being quite small is evidence
against the hypothesis). In the multiple hypothesis testing problem, we consider n null hypotheses
H1 , . . . , Hn , and an associated p-value Yi ∈ [0, 1] for each. The goal in multiple hypothesis testing
is to reject as many hypotheses as possible without making false discoveries, that is, rejecting an
index i for which the null Hi holds.
We can model this via an estimation problem of estimating a binary vector v ∈ {0, 1}n indicating
which hypotheses are null (vi = 0) and which are non-null. To mathematize this, let V ∈ {0, 1}n
iid
have i.i.d. entries Vi ∼ Bernoulli(ϵ), and conditional on Vi = 0, draw Yi ∼ P0 := Uniform([0, 1]), and
ind
conditional on Vi = 1, draw Yi ∼ P1 , where P1 has support [0, 1]; we consider the consequences of
making various choices for P1 .

(a) Let P1 = Uniform[0, τ ] for some τ < 1. Show that inf vb P(Vi ̸= vb(Y1n )) = min{τ (1 − ϵ), ϵ}.

(b) For a fixed value 0 < τ < 1, let

P1 = (1 − τ )Uniform[0, τ ] + τ Uniform[τ, 1],

1−τ τ
so that Y ∼ P1 has density p1 (y) = τ 1 {0 ≤ y ≤ τ} + 1−τ 1 {τ < y ≤ 1}. Show that

inf P(Vi ̸= vb(Y1n )) = min {τ (1 − ϵ), ϵ(1 − τ )} + min {(1 − τ )(1 − ϵ), ϵτ } .
v
b

(c) Let the setting of part (b) hold. I0 := {i ∈ [n] | Vi = 0} and I1 := {i ∈ [n] | Vi = 1} be the
collections of null and non-null indices, respectively. Assume the sparse testing regime, where
we take
ϵ = n−β and τ = n−r ,
where β ∈ ( 21 , 1) and r ∈ (0, ∞). For a test vb : [0, 1]n → {0, 1}n , let R := {i | vbi = 1} be the set
of rejected null hypotheses. Show that if r < β,

card(R ∩ I0 ) + card(Rc ∩ I1 )

E ≥ 1 − o(1).
card(I1 ) + 1

That is, the number of mistakes scales at least as the number of true non-nulls.

Exercise 9.5 (Benjamini-Hochberg procedures and optimal recovery): Consider a multiple

hypothesis testing problem in which a statistician receives a set of p-values, which we denote
{Y1 , . . . , Yn }, and seeks to test each of n null hypotheses H1 , . . . , Hn ; under the null Hi we have
PHi (Yi ≤ u) ≤ u for all u ∈ [0, 1]. Fix a desired level α ∈ (0, 1). The Bonferroni correction (that
is, the union bound) rejects those hypothesis satisfying
α
Yi ≤ .
n
Immediately, we obtain P(false rejection) ≤ α. In contrast, the Benjamini-Hochberg (BH) proce-
dure [23] sorts the values Y(1) ≤ Y(2) ≤ . . . ≤ Y(n) , finds the largest value

kα
k := max k ∈ N | Y(k) ≤
b , (9.8.3)
n

243
Lexture Notes on Statistics and Information Theory John Duchi

then rejects H(1) , . . . , H(bk) , the associated nulls (where bk = 0 if Y(k) > kα
n for each k). Benjamini
and Hochberg prove that if F is the number of false discoveries, meaning hypotheses rejected that
were null, then so long as the nulls are independent
h i
E F/ max{b k, 1} ≤ α.

We compare these two procedures under the distributional setting of Exercise 9.4 part (b), where
τ = n−r and ϵ = n−β for r > β and β ∈ ( 12 , 1). Let I0 = {i ∈ [n] | Vi = 0} and I1 = {i ∈ [n] | Vi =
1} be the null and non-null hypotheses, respectively.
α
(a) Assume r < 1 and let RBC = {i ∈ [n] | Yi ≤ n} be the hypotheses the Bonferroni correction
rejects. Show that
c ∩ I )
card(RBC

c 1−β 1
E[card(RBC ∩ I1 )] ≥ ϵn(1 − o(1)) = n (1 − o(1)) and E ≥ 1 − o(1).
card(I1 ) + 1

For the remainder of the question, let R ⊂ [n] be the indices the BH procedure (9.8.3) rejects.
(b) Show that if r > β, then for each δ > 0,

P (card(R ∩ I1 ) ≤ (1 − δ) card(I1 )) → 0.

(c) Show that if r > β, then for any sequence αn → 0 slowly enough in the BH procedure (9.8.3),
card(R ∩ I0 ) + card(Rc ∩ I1 )

E → 0.
card(I1 ) + 1

(d) Comparing with exercise 9.4, what can we say about the relative merits of the Bonferroni
correction and the BH procedure (9.8.3)?
See also the paper [1] for more on optimality of procedures of the form (9.8.3).
Exercise 9.6: In this question, we study the question of whether adaptivity can give better
estimation performance for linear regression problems. That is, for i = 1, . . . , n, assume that we
observe variables Yi in the usual linear regression setup,
iid
Yi = ⟨Xi , θ⟩ + εi , εi ∼ N(0, σ 2 ), (9.8.4)

where θ ∈ Rd is unknown. But now, based on observing Y1i−1 = {Y1 , . . . , Yi−1 }, we allow an adaptive
choice of the next predictor variables Xi ∈ Rd . Let Lnada (F2 ) denote the family of linear regression
problems under this adaptive setting (with n observations) where we constrain P the Frobenius norm
of the data matrix X ⊤ = [X1 · · · Xn ], X ∈ Rn×d , to have bound ∥X∥2Fr = ni=1 ∥Xi ∥22 ≤ F2 . We
use Assouad’s method to show that the minimax mean-squared error satisfies the following bound:
dσ 2 1
M(Lnada (F2 ), ∥·∥22 ) := inf sup E[∥θb − θ∥22 ] ≥ · 1 2. (9.8.5)
θb θ∈Rd n 16 dn F

Here the infimum is taken over all adaptive procedures satisfying ∥X∥2Fr ≤ F2 .
In general, when we choose Xi based on the observations Y1i−1 , we are taking Xi = Fi (Y1i−1 , U1i ),
where Ui is a random variable independent of εi and Y1i−1 and Fi is some function. Justify the
following steps in the proof of inequality (9.8.5):

244
Lexture Notes on Statistics and Information Theory John Duchi

(i) Assume that nature chooses v ∈ V = {−1, 1}d uniformly at random and, conditionally on v,
let θ = θv . Justify
1 X
M(Lnada (F2 ), ∥·∥22 ) ≥ inf Eθv [∥θb − θv ∥22 ].
θb |V| v∈V

Argue it is no loss of generality to assume that the choices for Xi are deterministic based on
the Y1i−1 . Thus, throughout we assume that Xi = Fi (Y1i−1 , ui1 ), where un1 is a fixed sequence,
or, for simplicity, that Xi is a function of Y1i−1 .

(ii) Fix δ > 0. Let v ∈ {−1, 1}d , and for each such v, define θv = δv. Also let Pvn denote the joint
distributionP(over all adaptively chosenP Xi ) of the observed variables Y1 , . . . , Yn , and define
n = 1
P+j P n and P n = 1 n n
2d−1 v:vj =1 v −j 2d−1 v:vj =−1 Pv , so that P±j denotes the distribution of
the Yi when v ∈ {−1, 1}d is chosen uniformly at random but conditioned on vj = ±1. Then
d
1 X δ2 X h i
inf Eθv [∥θb − θv ∥22 ] ≥ n
1 − P+j n
− P−j TV
.
θb |V| 2
v∈V j=1

(iii) We have
 
d h d 1
δ2 δ2d
X
X
n n
i 1 n n 2 2
1 − P+j − P−j TV
≥ 1− P+j − P−j TV
.
2 2 d
j=1 j=1

(i)
(iv) Let P+j be the distribution of the random variable Yi conditioned on vj = +1 (with the other
(i)
coordinates of v chosen uniformly at random), and let P+j (· | y1i−1 , xi ) denote the distribution
of Yi conditioned on vj = +1, Y1i−1 = y1i−1 , and xi . Justify

n n 2 1 n n

P+j − P−j TV
≤ Dkl P+j ||P−j
2
n Z
1X
(i) (i)

≤ i−1 i−1
Dkl P+j (· | y1i−1 , xi )||P−j (· | y1i−1 , xi ) dP+j (y1 , xi ).
2
i=1

(v) Then we have

d 2δ 2
(i) (i)
X
Dkl P+j (· | y1i−1 , xi )||P−j (· | y1i−1 , xi ) ≤ 2 ∥xi ∥22 .
σ
j=1

(vi) We have
d
X 2 δ2
n
P+j n
− P−j ≤ E[∥X∥2Fr ],
TV σ2
j=1

where the final expectation is over V drawn uniformly in {−1, 1}d and all Yi , Xi .

(vii) Show how to choose δ appropriately to conclude the minimax bound (9.8.5).

245
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 9.7: Suppose under the setting of Question 9.6 that we may no longer be adaptive,
meaning that the matrix X ∈ Rn×d must be chosen ahead of time (without seeing any data).
Assuming n ≥ d, is it possible to attain (within a constant factor) the risk (9.8.5)? If so, give an
example construction, if not, explain why not.
JCD Comment: Put this in the next chapter

Exercise 9.8 (The curse of dimensionality in nonparametric regression): Consider the non-
parametric regression problem in Section 10.1. Let Bd be the unit ℓ2 -ball in Rd and consider the
function classR F of 1-Lipschitz functions taking values in [−1, 1] on Bd , and consider the error
∥f − g∥22 = Bd (f (x) − g(x))2 dx. (Here, 1-Lipschitz means |f (x) − f (x′ )| ≤ ∥x − x′ ∥2 for any
x, x′ .) We show the minimax lower bound (10.1.4) for this function class using Fano’s method.
Fix δ ∈ [0, 1] to be chosen and let {xj }M d
j=1 be the centers of a maximal 2δ-packing of B , so that
1 d
M ≥ ( 2δ ) (by Lemma 5.1.10), and define the “bump” functions

gj (x) = δ 1 − ∥x − xj ∥2 /δ + ,

which all have disjoint support. Then for a vector v ∈ {±1}M , define
M
X
fv (x) := vj gj (x).
j=1

(a) Show that fv ∈ F.

2·SA(d)
(b) Show that gj (x)2 dx = 2+d , where SA(d) denotes the surface area of Bd .
R
d(d+1)(d+2) δ

(c) Use the Gilbert-Varshamov bound (Lemma 9.2.3) to show there is a collection V ⊂ {±1}M of
cardinality exp(M/8) with ∥fv − fv′ ∥22 ≥ cd δ 2 for all v ̸= v ′ ∈ V, where cd depends only on the
dimension d.

(d) Prove the minimax lower bound (10.1.4) for β = 1.

Exercise 9.9 (Optimal algorithms for memory access): In a modern CPU, memory is organized
in a hierarchy, so that data upon which computations are being actively performed lies in a very
small memory close to the logic units of the processor for which access is extraordinarily fast, while
data not being actively used lies in slower memory slightly farther from the processor. (Modern
processor memory is generally organized into the registers—a small number of 4- or 8-byte memory
locations on the processor—and level 1, 2, (and sometimes 3 or more) cache, which contain small
amounts of data and increasing access times, and RAM (random access memory).) Moving data—
communicating—between levels of the memory hierarchy is both power intensive and very slow
relative to computation on the data itself, so that in many algorithms the bulk of the time of the
algorithm is in moving data from one place to another to be computed upon. Thus, developing
very fast algorithms for numerical (and other) tasks on modern computers requires careful tracking
of memory access and communication, and careful control of these quantities can often yield orders
of magnitude speed improvements in execution. In this problem, you will prove a lower bound on
the number of communication steps that a variety of numerical-type methods must perform, giving
a concrete (attainable) inequality that allows one to certify optimality of specific algorithms.

246
Lexture Notes on Statistics and Information Theory John Duchi

In particular, we consider matrix multiplication, as it is a proxy for a class of cubic algorithms

that are well behaved. Let A, B ∈ Rn×n be matrices, and assume we wish to compute C = AB,
via the simple algorithm that for all i, j sets
n
X
Cij = Ail Blj .
l=1

Computationally, this forces us to repeatedly execute operations of the form

Mem(Cij ) = F (Mem(Ail ), Mem(Blj ), Mem(Cij )),
where F is some function—that may depend on i, j, l—and Mem(·) indicates that we access the
memory associated with the argument. (In our case, we have Cij = Cij + Ail · Blj .) We assume
that executing F requires that Mem(Ail ), Mem(Blj ), and Mem(Cij ) belong to fast memory, and
that each are distinct (stored in a separate place in flow and fast memory). We assume that the
order of the computations does not matter, so we may re-order them in any way. We call Mem(Ail )
(respectively B or C) and operand in our computation. We let M denote the size of fast/local
memory, and we would like to lower bound the number of times we must communicate an operand
into or out of the fast local memory as a function of n, the matrix size, and M , the fast memory
size, when all we may do is re-order the computation being executed. We let NStore denote the
number of times we write something from fast memory out to slow memory and let NLoad the
number of times we load something from slow memory to fast memory. Let N be the total number
of operations we execute (for simple matrix multiplication, we have N = n3 , though with sparse
matrices, this can be smaller).
We analyze the procedure by breaking the computation into a number of segments, where each
segment contains precisely M load or store (communication-causing) instructions.
(a) Let Nseg be an upper bound on the number of evaluations with the function F (·) in any given
segment (you will upper bound this in a later part of the problem). Justify that
NStore + NLoad ≥ M ⌊N/Nseg ⌋ .

(b) Within a segment, all operands involved must be in fast memory at least once to be computed
with. Assume that memory locations Mem(Ail ), Mem(Blj ), and Mem(Cij ) do not overlap.
For any operand involved in a memory operation in one of the segments, the operand (1) was
already in fast memory at the beginning of the segment, (2) was read from slow memory, (3)
is still in fast memory at the end of the segment, or (4) is written to slow memory at the end
of the segment. (There are also operands potentially created during execution that are simply
discarded; we do not bound those.) Justify the following: within a segment, for each type of
operand Mem(Aij ), Mem(Bij ), or Mem(Cij ), there are at most c · M such operands (i.e. there
are at most cM operands of type Mem(Aij ), independent of the others, and so on), where c is
a numerical constant. What value of c can you attain?
√
(c) Using the result of question 7.2, argue that Nseg ≤ c′ M 3 for a numerical constant c′ . What
value of c′ do you get?
(d) Using the result of part (c), argue that the number of loads and stores satisfies
N
NStore + NLoad ≥ c′′ √ − M
M
for a numerical constant c′′ . What is your constant?

247
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 9.10 (Tilting and information bounds): Let P0 be a distribution on X ∈ X , and assume
w.l.o.g. that P0 has a density p0 . For a bounded function g : X → Rk with EP0 [g(X)] = 0, define
the tilted density
pt,g (x) := (1 + ⟨t, g(x)⟩)p0 (x), (9.8.6)
for t ∈ Rk , with induced distribution Pt,g .

(a) Show that if t is near enough to 0, then pt,g is indeed a valid density.

(b) Show that for t ∈ Rk near 0,

1 1
Dkl (Pt,g ||P0 ) = t⊤ Cov0 (g(X))t + O(∥t∥3 ) and Dkl (P0 ||Pt,g ) = t⊤ Cov0 (g(X))t + O(∥t∥3 ).
2 2

(c) Let V ⊂ Rk be a bounded set and V ∼ Uniform(V). Let δ ≥ 0. Suppose that conditional on
iid
V = v, we draw Xi ∼ Pδv,g . Show the mutual information bound

nδ 2
I(X1n ; V ) ≤ tr Cov0 (g(X))E[V V ⊤ ] + O(nδ 3 ).
2

Exercise 9.11 (General tilting): Replace the tilting (9.8.6) with

Z
1
pt,g (x) = ϕ(⟨t, g(x)⟩)p0 (x), Ct := ϕ(⟨t, g(x)⟩)p0 (x)dµ(x)
Ct

where ϕ : R → R+ is a bounded nonnegative function, differentiable at 0 with ϕ(0) = ϕ′ (0) = 1,

with bounded second derivatives. (For example, ϕ(t) = 1+e2−2t .) Show that if E0 [∥g(X)∥2 ] < ∞
then the following analogues of the the results (a), (b), and (c) from Exercise 9.10 hold:
ϕ′′ (0) ⊤
(a) The normalization Ct := E0 [ϕ(⟨t, g(X)⟩)] satisfies Ct = 1 + 2 t Cov0 (g(X))t + o(∥t∥2 ).

(b) For t near 0,

1
Dkl (Pt,g ||P0 ) = t⊤ Cov0 (g(X))t + o(∥t∥2 ).
2
(c) Let V ⊂ Rk be a bounded set and V ∼ Uniform(V). Let δ ≥ 0. Suppose that conditional on
iid
V = v, we draw Xi ∼ Pδv,g . Then

nδ 2
I(X1n ; V ) ≤ tr Cov0 (g(X))E[V V ⊤ ] + n · o(δ 2 ).
2

Exercise 9.12: Let P be a collection of distributions on a set X , and let θ : P → Rd be a

parameter of the distribution. Let G a collection of mean-zero functions mapping X → Rk with
E0 [g(X)] = 0 for each g ∈ G. For t ∈ Rk define the distribution Pt,g by the tilting (9.8.6), and
assume Pt,g ∈ P for small enough t. We say that θ is differentiable at P0 relative to G with derivative
θ̇P0 : X → Rd if
θ(Pt,g ) = θ(P0 ) + EP0 [θ̇P0 (X)⟨g(X), t⟩] + o(∥t∥)
as t → 0 for each g ∈ G, where EP0 [θ̇P0 (X)] = 0.

248
Lexture Notes on Statistics and Information Theory John Duchi

(a) Show that the mean is differentiable relative to G for any collection of mean-zero bounded
functions, and give the derivative θ̇P0 .

For the remainder of the problem, assume that for any G consisting of bounded functions, θ : P →
Rd is differentiable relative to G with bounded derivative θ̇P0 . Let Σ = Cov0 (θ̇P0 ) and for A ≻ 0
define the Mahalanobis norm ∥v∥2A = v ⊤ Av.

(b) For δ ≥ 0, show how to construct a collection of distributions {Qv } indexed by v ∈ V = {−1, 1}d
such that as δ → 0, for any pair v, v ′

∥θ(Qv ) − θ(Qv′ )∥A = δ v − v ′ A

+ o(δ).

(c) Show that (for large enough dimension d) there exist numerical constants c0 , c1 > 0 and a
collection of distributions {Qv } ⊂ P, indexed by v ∈ V = {−1, 1}d such that if we choose
iid
V ∼ Uniform(V) then draw X1n ∼ Qv conditional on V = V,
p
P θ(Xb n ) − θ(QV ) −1 ≥ c0 d/n ≥ c1 .
1 Σ

Exercise 9.13 (A minimax lower bound for M-estimation): Let ℓ be a convex loss and L(θ) =
EP0 [ℓ(θ, Z)] satisfy all the conditions of Assumption A.5.1 in Chapter 5.3, where P0 is a distribution
on Z ∈ Z that we assume w.l.o.g. has a density p0 . Define the tilted densities pt as in Exercise 9.10,
Eq. (9.8.6). Let θ(P ) = argminθ EP [ℓ(θ, Z)] be the population minimizer and define the Hessian
H = ∇2 L(θ0 ) and covariance Σ := CovP0 (∇ℓ(θ0 , Z)) ≻ 0. Previewing Chapter 12.3.1, the implicit
function theorem implies that

θ(Pt,g ) = θ(P0 ) + H −1 EP0 [∇ℓ(θ0 , Z)g(Z)⊤ ]t + o(∥t∥)

for t ∈ Rk near 0. Show that by choosing an appropriate tilting function g :q Z → Rd , there exist
numerical constants c0 , c1 , c2 > 0 such for large enough n, the choice δn := c0 nd yields

1 X b n ) − θ(Pt ) ≥ c1 d
b n ) − θ(Pt ) ⊤ (H −1 ΣH −1 )−1 θ(Z

inf Pδn v,g θ(Z 1 1 ≥ c2 .
θb |V|v∈V
n

How does this lower bound compare to the guarantee in Corollary 5.3.10?
JCD Comment: A few additional question ideas:

1. Use the global Fano method technique to give lower bounds for density estimation

2. Curse of dimensionality in high-dimensional regression? The idea would be to take dis-

joint δ-balls Bj ⊂ Bd , where Bd = {x | ∥x∥ ≤ 1} is the unit ball, with centers xj , where
d
P , then define the bump function gj (x) = δ [1 − ∥x − xj ∥ /δ]+ .
j runs from 1 to (1/δ)
Then set fv (x) = j vj gj (x), which is 1-Lipschitz for the norm ∥·∥. Then the sepa-
−d
ration is δ, while the log cardinality is 2δ , giving δ 2 (1 − nδ 2+d ) as the lower bound.
Take δ = n−1/(2+d) .

249
Chapter 10

Beyond local minimax techniques

10.1 Nonparametric regression: minimax upper and lower bounds

To show further applications of the minimax optimality ideas we have developed, we consider one
of the two the most classical non-parametric (meaning that the number of parameters can grow
with the sample size n) problems: estimating a regression function on a subset of the real line (the
most classical problem being estimation of a density). In non-parametric regression, we assume
there is an unknown function f : R → R, where f belongs to a pre-determined class of functions F;
usually this class is parameterized by some type of smoothness guarantee. To make our problems
concrete, we will assume that the unknown function f is L-Lipschitz and defined on [0, 1]. Let F
denote this class.

Figure 10.1. Observations in a non-parametric regression problem, with function f plotted. (Here
f (x) = sin(2x + cos2 (3x)).)

In the standard non-parametric regression problem, we obtain observations of the form

Yi = f (Xi ) + εi (10.1.1)

250
Lexture Notes on Statistics and Information Theory John Duchi

where εi are independent, mean zero conditional on Xi , and E[ε2i ] ≤ σ 2 . See Figure 10.1 for an
example. We also assume that we fix the locations of the Xi as Xi = i/n ∈ [0, 1], that is, the Xi
are evenly spaced in [0, 1]. Given n observations Yi , we ask two questions: (1) how can we estimate
f ? and (2) what are the optimal rates at which it is possible to estimate f ?

10.1.1 Kernel estimates of the function

A natural strategy is to place small “bumps” around the observed points, and estimate f in a
neighborhood of a point x by weighted averages of the Y values for other points near x. We now
formalize a strategy for doing this. Suppose we have a kernel function K : R → R+ , which is
continuous, not identically zero, has support supp K = [−1, 1], and satisfies the technical condition
λ0 sup K(x) ≤ inf K(x), (10.1.2)
x |x|≤1/2

where λ0 > 0 (this says the kernel has some width to it). A natural example is the “tent” function
given by Ktent (x) = [1 − |x|]+ , which satisfies inequality (10.1.2) with λ0 = 1/2. See Fig. 10.2 for
two examples, one the tent function and the other the function

1 1
K(x) = 1 {|x| < 1} exp − exp − ,
(x − 1)2 (x + 1)2
which is infinitely differentiable and supported on [−1, 1].

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.0 −0.5 0.0 0.5 1.0

Figure 10.2: Left: “tent” kernel. Right: infinitely differentiable compactly supported kernel.

Now we consider a natural estimator of the function f based on observations (10.1.2) known as
the Nadaraya-Watson estimator. Fix a bandwidth h, which we will see later smooths the estimated
functions f . For all x, define weights

K Xih−x
Wni (x) := P
n Xj −x
j=1 K h

and define the estimated function

n
X
fbn (x) := Yi Wni (x).
i=1

251
Lexture Notes on Statistics and Information Theory John Duchi

The intuition here is that we have a locally weighted regression function, where points Xi in the
neighborhood of x are given higher weight than further points. Using this function fbn as our
estimator, it is possible to provide a guarantee on the bias and variance of the estimated function
at each point x ∈ [0, 1].

Proposition 10.1.1. Let the observation model (10.1.1) hold and assume condition (10.1.2). In
addition assume the bandwidth is suitably large that h ≥ 2/n and that the Xi are evenly spaced on
[0, 1]. Then for any x ∈ [0, 1], we have

2σ 2
|E[fbn (x)] − f (x)| ≤ Lh and Var(fbn (x)) ≤ .
λ0 nh
Proof To bound the bias, we note that (conditioning implicitly on Xi )
n
X n
X n
X
E[fbn (x)] = E[Yi Wni (x)] = E[f (Xi )Wni (x) + εi Wni (x)] = f (Xi )Wni (x).
i=1 i=1 i=1

Thus we have that the bias is bounded as

To bound the variance, we claim that

2
Wni (x) ≤ min ,1 . (10.1.3)
λ0 nh

Indeed, we have that

K Xih−x K Xih−x K Xih−x
Wni (x) = P =P ≤ ,
n
K
Xj −x
K
Xj −x λ0 supx K(x)|{j : |Xj − x| ≤ h/2}|
j=1 h j:|Xj −x|≤h/2 h

and because there are at least nh/2 indices satisfying |Xj − x| ≤ h, we obtain the claim (10.1.3).
Using the claim, we have
n
X 2 n
X 2
Var(fn (x)) = E
b (Yi − f (Xi ))Wni (x) =E εi Wni (x)
i=1 i=1
n
X n
X
= Wni (x)2 E[ε2i ] ≤ σ 2 Wni (x)2 .
i=1 i=1
Pn
Noting that Wni (x) ≤ 2/λ0 nh and i=1 Wni (x) = 1, we have
n n
X
2 2 2
X 2
σ Wni (x) ≤ σ max Wni (x) Wni (x) ≤ σ 2 ,
i λ0 nh
i=1
|i=1 {z }
=1

252
Lexture Notes on Statistics and Information Theory John Duchi

completing the proof.

With the proposition in place, we can then provide a theorem bounding the worst case pointwise
mean squared error for estimation of a function f ∈ F.
Theorem 10.1.2. Under the conditions of Proposition 10.1.1, choose h = (σ 2 /L2 λ0 )1/3 n−1/3 .
Then there exists a universal (numerical) constant C < ∞ such that for any f ∈ F,
2 2/3
Lσ 2
2
sup E[(fbn (x) − f (x)) ] ≤ C n− 3 .
x∈[0,1] λ0

Proof Using Proposition 10.1.1, we have for any x ∈ [0, 1] that

2 2σ 2
E[(fbn (x) − f (x))2 ] = E[fbn (x)] − f (x) + E[(fbn (x) − E[fbn (x)])2 ] ≤ + L2 h2 .
λ0 nh
Choosing h to balance the above bias/variance tradeoff, we obtain the thoerem.

By integrating the result in Theorem 10.1.2 over the interval [0, 1], we immediately obtain the
following corollary.
Corollary 10.1.3. Under the conditions of Theorem 10.1.2, if we use the tent kernel Ktent , we
have 2 2/3
2 Lσ
sup Ef [∥fn − f ∥2 ] ≤ C
b ,
f ∈F n
where C is a universal constant.
In Proposition 10.1.1, it is possible to show that a more clever choice of kernels—ones that are
not always positive—can attain bias E[fbn (x)]−f (x) = O(hβ ) if f has Lipschitz (β −1)th derivative.
In this case, we immediately obtain that the rate can be improved to
2β
−
sup E[(fbn (x) − f (x))2 ] ≤ Cn 2β+1 ,
x

and every additional degree of smoothness gives a corresponding improvement in convergence rate.
We also remark that rates of this form, which are much larger than n−1 , are characteristic of non-
parametric problems; essentially, we must adaptively choose a dimension that balances the sample
size, so that rates of 1/n are difficult or impossible to achieve.

10.1.2 Minimax lower bounds on estimation with Assouad’s method

Now we can ask whether the results we have given are in fact sharp; do there exist estimators
attaining a faster rate of convergence than our kernel-based (locally weighted) estimator? Using
Assouad’s method, we show that, in fact, these results are all tight. In particular, we prove the
following result on minimax estimation of a regression function f ∈ F,R where F consists of 1-
1
Lipschitz functions defined on [0, 1], in the ∥·∥22 error, that is, ∥f − g∥22 = 0 (f (t) − g(t))2 dt.
Theorem 10.1.4. Let the observation points Xi be spaced evenly on [0, 1], and assume the obser-
vation model (10.1.1). Then there exists a universal constant c > 0 such that
2 23
2
h
2
i σ
Mn (F, ∥·∥2 ) := inf sup Ef ∥fn − f ∥2 ≥ c
b .
fbn f ∈F n

253
Lexture Notes on Statistics and Information Theory John Duchi

Deferring the proof of the theorem temporarily, we make a few remarks. It is in fact possible to
show—using a completely identical technique—that if Fβ denotes the class of functions with β − 1
derivatives, where the (β − 1)th derivative is Lipschitz, then
2β
σ2
2β+1
Mn (Fβ , ∥·∥22 ) ≥c .
n
So for any smoothness class, we can never achieve the parametric σ 2 /n rate, but we can come
arbitrarily close. As another remark, which we do not prove, in dimensions d ≥ 1, the minimax
rate for estimation of functions f with Lipschitz (β − 1)th derivative scales as
2β
σ2
2β+d
Mn (Fβ , ∥·∥22 ) ≥c . (10.1.4)
n
This result can, similarly, be proved using a variant of Assouad’s method or a local Fano method;
see, for example, Györfi et al. [108, Chapter 3]. Exercise 9.8 works through a particular case of this
lower bound. This is a striking example of the curse of dimensionality: the penalty for increasing
dimension results in worse rates of convergence. For example, suppose that β = 1. In 1 dimension,
we require n ≥ 90 ≈ (.05)−3/2 observations to achieve accuracy .05 in estimation of f , while we
require n ≥ 8000 = (.05)−(2+d)/2 even when the dimension d = 4, and n ≥ 64·106 observations even
in 10 dimensions, which is a relatively small problem. That is, the problem is made exponentially
more difficult by dimension increases.
We now prove Theorem 10.1.4. To establish the result, we show how to construct a family
of problems—indexed by binary vectors v ∈ {−1, 1}k —so that our estimation problem satisfies
the separation (9.5.1), then we show that the information based on observing noisy versions of
the functions we have defined is small. Choosing k to make our resulting lower bound as high as
possible completes the argument.

Construction of a separated family of functions To construct our separation in Hamming

metric, as required by Eq. (9.5.1), fix some k ∈ N; we will choose k later. This approach is somewhat
different from our standard approach of using a fixed dimensionality and scaling the separation
directly; in non-parametric problems, we scale the “dimension” itself to adjust the difficulty of the
estimation problem. Define the function g(x) = [1/2 − |x − 1/2|]+ , so that g is 1-Lipschitz and is
0 outside of the interval [0, 1]. Then for any v ∈ {−1, 1}k , define the “bump” functions
k
1 j−1 X
gj (x) := g k x − and fv (x) := vj gj (x),
k k
j=1

which we see is 1-Lipschitz. Now, consider any function f : [0, 1] → R, and let Ej be shorthand for
the intervals Ej = [(j − 1)/k, j/k] for j = 1, . . . , k. We must find a mapping identifying a function
f with points in the hypercube {−1, 1}k . To that end, we may define a vector b v(f ) ∈ {−1, 1}k by
Z
vj (f ) = argmin
b (f (t) − sgj (t))2 dt.
s∈{−1,1} Ej

We claim that for any function f ,

Z 1 Z 1
2 2
2 2
(f (t) − fv (t)) dt ≥ 1 {b
vj (f ) ̸= vj } fv (t) dt . (10.1.5)
Ej Ej

254
Lexture Notes on Statistics and Information Theory John Duchi

gj (t)2 dt = fv (t)2 dt. Then by the

R R
Indeed, on the set Ej , we have vj gj (t) = fv (t), and thus Ej Ej
triangle inequality, we have
Z 1 Z 1
2 2
2 2
2 · 1 {b
vj (f ) ̸= vj } gj (t) dt = vj (f ) − vj )gj (t)) dt
((b
Ej Ej
Z 1 Z 1
2 2
2 2
≤ (f (t) − vj gj (t)) dt + (f (t) − b
vj (f )gj (t)) dt
Ej Ej
Z 1
2
≤2 (f (t) − fv (t))2 dt ,
Ej

by definition of the sign b

vj (f ).
With the definition of b v and inequality (10.1.5), we see that for any vector v ∈ {−1, 1}k , we
have
X k Z k
X Z
∥f − fv ∥22 = (f (t) − fv (t))2 dt ≥ 1 {b
vj (f ) ̸= vj } fv (t)2 dt.
j=1 Ej j=1 Ej

In particular, we know that

Z Z 1/k Z 1
2 1 2 1 c
fv (t) dt = 2 g(kt) dt = 3 g(u)2 du ≥ ,
Ej k 0 k 0 k3

where c is a numerical constant. In particular, we have the desired separation

k
c X
∥f − fv ∥22 ≥ 1 {b
vj (f ) ̸= vj } . (10.1.6)
k3
j=1

Bounding the binary testing error Let Pvn denote the distribution of the n observations
Yi = fv (Xi ) + εi when fv is the true regression function. Then inequality (10.1.6) implies via
Assouad’s lemma that
k
c Xh i
Mn (F, ∥·∥22 ) ≥ n
1 − P+j n
− P−j . (10.1.7)
k3 TV
j=1

Now, we use convexity and Pinsker’s inequality to note that

n n 2 n n 2 1 n n

P+j − P−j TV
≤ max Pv,+j − Pv,−j TV
≤ max Dkl Pv,+j ||Pv,−j .
v v 2

For any two functions fv and fv′ , we have that the observations Yi are independent and normal
with means fv (Xi ) or fv′ (Xi ), respectively. Thus
n
X
Dkl (Pvn ||Pvn′ ) = Dkl N(fv (Xi ), σ 2 )||N(fv′ (Xi ), σ 2 )

i=1
n
X 1
= (fv (Xi ) − fv′ (Xi ))2 . (10.1.8)
2σ 2
i=1

255
Lexture Notes on Statistics and Information Theory John Duchi

Now we must show that the expression (10.1.8) scales more slowly than n, which we will see must
be the case as whenever dham (v, v ′ ) ≤ 1. Intuitively, most of the observations have the same
distribution by our construction of the fv as bump functions; let us make this rigorous.
We may assume without loss of generality that vj = vj′ for j > 1. As the Xi = i/n, we thus
have that only Xi for i near 1 can have non-zero values in the tensorization (10.1.8). In particular,
i 2 2n
fv (i/n) = fv′ (i/n) for all i s.t. ≥ , i.e. i ≥ .
n k k
Rewriting expression (10.1.8), then, and noting that fv (x) ∈ [−1/k, 1/k] for all x by construction,
we have
n 2n/k
X 1 2
X 1 1 2n 1 n
(fv (Xi ) − fv ′ (Xi )) ≤ (fv (Xi ) − fv′ (Xi ))2 ≤ 2 = 3 2.
2σ 2 2σ 2 2σ k k 2 k σ
i=1 i=1

Combining this with inequality (10.1.8) and the minimax bound (10.1.7), we obtain
r
n n n
P+j − P−j TV ≤ ,
2k 3 σ 2
so
k r
c X n
Mn (F, ∥·∥22 ) ≥ 3 1− .
k 2k 3 σ 2
j=1

Choosing k for optimal tradeoffs Now we simply choose k; in particular, setting

r
n 1/3 n p 1
k= 2
then 1 − 3 2
≥ 1 − 1/4 = ,
2σ 2k σ 2
and we arrive at
k 2 2/3
c X1 c ′ σ
Mn (F, ∥·∥22 ) ≥ 3 = 2 ≥c ,
k 2 2k n
j=1

where c′ > 0 is a universal constant. Theorem 10.1.4 follows.

10.2 Global Fano Method

In this section, we extend the techniques of Section 9.4 on Fano’s method (the local Fano method)
to a more global construction. In particular, we show that, rather than constructing a local packing,
choosing a scaling δ > 0, and then optimizing over this δ, it is actually, in many cases, possible to
prove lower bounds on minimax error directly using packing and covering numbers (metric entropy
and packing entropy).

10.2.1 A mutual information bound based on metric entropy

To begin, we recall the classical Fano inequality in Corollary 9.4.2, which says that for any Markov
chain V → X → Vb , where V is uniform on the finite set V, we have
I(V ; X) + log 2
P(Vb ̸= V ) ≥ 1 − .
log(|V|)

256
Lexture Notes on Statistics and Information Theory John Duchi

Thus, there are two ingredients in proving lower bounds on the error in a hypothesis test: upper
bounding the mutual information and lower bounding the size |V|. The key in the global Fano
method is an upper bound on the former (the information I(V ; X)) using covering numbers.
Before stating our result, we require a bit of notation. First, we assume that V is drawn from a
distribution µ, and conditional on V = v, assume the sample X ∼ Pv . Then a standard calculation
(or simply the definition of mutual information; recall equation (9.4.4)) gives that
Z Z

I(V ; X) = Dkl Pv ||P dµ(v), where P = Pv dµ(v).

Now, we show how to connect this mutual information quantity to a covering number of a set of
distributions.
Assume that for all v, we have Pv ∈ P, where P is a collection of distributions. In analogy
with Definition 5.1, we say that the collection of distributions {Qi }N i=1 form an ϵ-cover of P in
KL-divergence if for all P ∈ P, there exists some i such that Dkl (P ||Qi ) ≤ ϵ2 . With this, we may
define the KL-covering number of the set P as

2
Nkl (ϵ, P) := inf N ∈ N | ∃ Qi , i = 1, . . . , N, sup min Dkl (P ||Qi ) ≤ ϵ , (10.2.1)
P ∈P i

where Nkl (ϵ, P) = +∞ if no such cover exists. With definition (10.2.1) in place, we have the
following proposition.

Proposition 10.2.1. Under conditions of the preceding paragraphs, we have

I(V ; X) ≤ inf ϵ2 + log Nkl (ϵ, P) .

(10.2.2)
ϵ>0

Proof First, we claim that

Z Z

Dkl Pv ||P dµ(v) ≤ Dkl (Pv ||Q) dµ(v) (10.2.3)

for any distribution Q. Indeed, we have

Z Z Z Z Z
dPv dPv dQ
Dkl Pv ||P dµ(v) = dPv log dµ(v) = dPv log + log dµ(v)
dP Q dP
ZV X Z ZV X
dQ
= Dkl (Pv ||Q) dµ(v) + dµ(v)dPv log
V X V dP
| {z }
=dP
Z Z

= Dkl (Pv ||Q) dµ(v) − Dkl P ||Q ≤ Dkl (Pv ||Q) dµ(v),

so that inequality (10.2.3) holds. By carefully choosing the distribution Q in the upper bound (10.2.3),
we obtain the proposition.
Now, assume that the distributions Qi , i = 1, . . . , N form an ϵ2 -cover of the family P, meaning
that
min Dkl (P ||Qi ) ≤ ϵ2 for all P ∈ P.
i∈[N ]

257
Lexture Notes on Statistics and Information Theory John Duchi

Let pv and qi denote the densities of Pv and Qi with respect to some fixed base measure on PX (the
choice of based measure does not matter). Then definining the distribution Q = (1/N ) N i=1 Qi ,
we obtain for any v that in expectation over X ∼ Pv ,

pv (X) pv (X)
Dkl (Pv ||Q) = EPv log = EPv log −1 Pn
q(X) N i=1 qi (X)
" #
pv (X) pv (X)
= log N + EPv log PN ≤ log N + EPv log
i=1 qi (X)
maxi qi (X)

pv (X)
≤ log N + min EPv log = log N + min Dkl (Pv ||Qi ) .
i qi (X) i

By our assumption that the Qi form a cover, this gives the desired result, as ϵ ≥ 0 was arbitrary,
as was our choice of the cover.

By a completely parallel proof, we also immediately obtain the following corollary.

Corollary 10.2.2. Assume that X1 , . . . , Xn are drawn i.i.d. from Pv conditional on V = v. Let
Nkl (ϵ, P) denote the KL-covering number of a collection P containing the distributions (over a
single observation) Pv for all v ∈ V. Then

I(V ; X1 , . . . , Xn ) ≤ inf nϵ2 + log Nkl (ϵ, P) .

ϵ≥0

With Corollary 10.2.2 and Proposition 10.2.1 in place, we thus see that the global covering numbers
in KL-divergence govern the behavior of information.
We remark in passing that the quantity (10.2.2), and its i.i.d. analogue in Corollary 10.2.2,
is known as the index of resolvability, and it controls estimation rates and redundancy of coding
schemes for unknown distributions in a variety of scenarios; see, for example, Barron [17] and Barron
and Cover [18]. It is also similar to notions of complexity in Dudley’s entropy integral (cf. Dudley
[77]) in empirical process theory, where the fluctuations of an empirical process are governed by a
tradeoff between covering number and approximation of individual terms in the process.

10.2.2 Minimax bounds using global packings

There is now a four step process to proving minimax lower bounds using the global Fano method.
Our starting point is to recall the Fano minimax lower bound in Proposition 9.4.3, which begins
with the construction of a set of points {θ(Pv )}v∈V that form a 2δ-packing of a set Θ in some
ρ-semimetric. With this inequality in mind, we perform the following four steps:

(i) Bound the packing entropy. Give a lower bound on the packing number of the set Θ with
2δ-separation (call this lower bound M (δ)).

(ii) Bound the metric entropy. Give an upper bound on the KL-metric entropy of the class P of
distributions containing all the distributions Pv , that is, an upper bound on log Nkl (ϵ, P).

(iii) Find the critical radius. Noting as in Corollary 10.2.2 that with n i.i.d. observations, we have

I(V ; X1 , . . . , Xn ) ≤ inf nϵ2 + log Nkl (ϵ, P) ,

ϵ≥0

258
Lexture Notes on Statistics and Information Theory John Duchi

we now balance the information I(V ; X1n ) and the packing entropy log M (δ). To that end, we
choose ϵn and δ > 0 at the critical radius, defined as follows: choose the any ϵn such that

nϵ2n ≥ log Nkl (ϵn , P) ,

and choose the largest δn > 0 such that

log M (δn ) ≥ 4nϵ2n + 2 log 2 ≥ 2Nkl (ϵn , P) + 2nϵ2n + 2 log 2 ≥ 2 (I(V ; X1n ) + log 2) .

(We could have chosen the ϵn attaining the infimum in the mutual information, but this way
we need only an upper bound on log Nkl (ϵ, P).)

(iv) Apply the Fano minimax bound. Having chosen δn and ϵn as above, we immediately obtain
that for the Markov chain V → X1n → Vb ,

I(V ; X1 , . . . , Xn ) + log 2 1 1
P(V ̸= Vb ) ≥ 1 − ≥1− = ,
log M (δn ) 2 2

and thus, applying the Fano minimax bound in Proposition 9.4.3, we obtain
1
Mn (θ(P); Φ ◦ ρ) ≥ Φ(δn ).
2

10.2.3 Example: non-parametric regression

In this section, we flesh out the outline in the prequel to show how to obtain a minimax lower bound
for a non-parametric regression problem directly with packing and metric entropies. In this example,
we sketch the result, leaving explicit constant calculations to the dedicated reader. Nonetheless,
we recover an analogue of Theorem 10.1.4 on minimax risks for estimation of 1-Lipschitz functions
on [0, 1].
We use the standard non-parametric regression setting, where our observations Yi follow the
independent noise model (10.1.1), that is, Yi = f (Xi ) + εi . Letting

F := {f : [0, 1] → R, f (0) = 0, f is Lipschitz}

be the family of 1-Lipschitz functions with f (0) = 0, we have

Proposition 10.2.3. There exists a universal constant c > 0 such that

2 1/3
h i σ
Mn (F, ∥·∥∞ ) := inf sup Ef ∥fn − f ∥∞ ≥ c
b ,
fbn f ∈F n

where fbn is constructed based on the n independent observations f (Xi ) + εi .

The rate in Proposition 10.2.3 is sharp to within factors logarithmic in n; a more precise analysis
of the upper and lower bounds on the minimax rate yields
h i σ 2 log n 1/3
Mn (F, ∥·∥∞ ) := inf sup Ef ∥fn − f ∥∞ ≍
b .
fbn f ∈F n

See, for example, Tsybakov [182] for a proof of this fact.

259
Lexture Notes on Statistics and Information Theory John Duchi

Proof Our first step is to note that the covering and packing numbers of the set F in the ℓ∞
metric satisfy
1
log N (δ, F, ∥·∥∞ ) ≍ log M (δ, F, ∥·∥∞ ) ≍ . (10.2.4)
δ
To see this, fix some δ ∈ (0, 1) and assume for simplicity that 1/δ is an integer. Define the sets
P1/δ
Ej = [δ(j − 1), δj), and for each v ∈ {−1, 1}1/δ define hv (x) = j=1 vj 1 {x ∈ Ej }. Then define
Rt
the function fv (t) = 0 hv (t)dt, which increases or decreases linearly on each interval of width δ in
[0, 1]. Then these fv form a 2δ-packing and a 2δ-cover of F, and there are 21/δ such fv . Thus the
asymptotic approximation (10.2.4) holds.
JCD Comment: TODO: Draw a picture

Now, if for some fixed x ∈ [0, 1] and f, g ∈ F we define Pf and Pg to be the distributions of the
observations f (x) + ε or g(x) + ε, we have that

1 2 ∥f − g∥2∞
Dkl (Pf ||Pg ) = (f (Xi ) − g(Xi )) ≤ ,
2σ 2 2σ 2
and if Pfn is the distribution of the n observations f (Xi ) + εi , i = 1, . . . , n, we also have
n
X 1 n
Dkl Pfn ||Pgn (f (Xi ) − g(Xi ))2 ≤ 2 ∥f − g∥2∞ .

= 2
2σ 2σ
i=1

In particular, this implies the upper bound

1
log Nkl (ϵ, P) ≲
σϵ
on the KL-metric entropy of the class P = {Pf : f ∈ F}, as log N (δ, F, ∥·∥∞ ) ≍ δ −1 . Thus we have
completed steps (i) and (ii) in our program above.
It remains to choose the critical radius in step (iii), but this is now relatively straightforward:
by choosing ϵn ≍ (1/σn)1/3 , and whence nϵ2n ≍ (n/σ 2 )1/3 , we find that taking δ ≍ (σ 2 /n)1/3 is
sufficient to ensure that log N (δ, F, ∥·∥∞ ) ≳ δ −1 ≥ 4nϵ2n + 2 log 2. Thus we have
1/3
σ2

1
Mn (F, ∥·∥∞ ) ≳ δn · ≳
2 n

as desired.

JCD Comment: Should we do higher-dimensional stuff?

10.3 Strong converses and high-probability lower bounds

The results we have developed so far provide what we might call weak converse results: if one
attempts estimate a parameter, there is a constant probability of error for estimating to within a
certain accuracy. The information theory literature, on the other hand, provides strong converses,
which for our purposes we interpret as follows: if one attempts to send too much information

260
Lexture Notes on Statistics and Information Theory John Duchi

through a communication channel, then the probability of error in decoding messages necessarily
approaches 1. To set the stage, recall the setting of Section 9.4, where we choose V ∈ V uniformly at
random, then have the Markov chain V → Y → Vb . Then letting ϵ = P(Vb ̸= V ), Fano’s inequality
(Corollary 9.4.2) equivalently states that

I(V ; Y ) log 2
log card(V) ≤ + .
1−ϵ 1−ϵ
In a communication setting, where we wish to send a message v ∈ V over a noisy channel, this
result states that if we wish to have vanishing error ϵ → 0, then the maximum number of messages
it is possible to send has log card(V) ≤ I(V ; Y ) + log 2.
An elegant way to derive strong converses is to provide refined versions of Fano’s inequal-
ity. To develop these refinements, we typically consider somewhat more specific settings than the
completely arbitrary Markov chain V → Y → Vb . The most common scenarios are independent
sampling scenarios, where conditional on V = v, we draw Y1n ∈ Y n from a product distribution Pvn .
Letting ϵ be some measure of the probability of error and g(ϵ) be a function for which g(ϵ) < ∞
whenever ϵ < 1, then a typical refinement of Fano’s inequality takes the form
√
log card(V) ≤ I(V ; Y1n ) + O

e n · g(ϵ) , (10.3.1)

where the big-O e notation may hide logarithmic factors.

That inequality (10.3.1) is stronger than the standard Fano inequality may not be immediately
apparent, so it is useful to consider an abstract communication scenario, where we are allowed
to use a fixed communication channel n times. Here, we think of V as a message from a (large)
collection V of messages, and upon choosing a message v to send, we encode it into a vector x ∈ X n ;
the ith output of the channel Yi is then drawn independently according to P (· | xi ). Graphically,
Figure 10.3 shows the setting. The most common scenario is communication of a binary signal

Message Encode Decode Output

Noisy
v∈V X = enc(v) vb = dec(Y ) vb ∈ V
Channel

Figure 10.3: Communication over a noisy channel

where X = Y = {0, 1}, so that we send bits, and we think of the system as encoding a message
v ∈ V as a bit string x ∈ {0, 1}n , and the channel corrupts each bit xi ∈ {0, 1} independently to
Yi ∈ {0, 1}. Then we expect to be able to encode roughly 2cn messages, for some constant c, and
the information I(V ; Y1n ) should grow linearly in n, as we send n messages. So inequality (10.3.1)
√
would then imply we must have ϵ → 1 whenever log card(V)−I(V ; Y1n ) ≫ n, ignoring logarithmic
factors.
We provide one examplar result of the form (10.3.1) for finite output spaces Y in the next
section. We then show how it implies certain high-probability lower bounds for different estimation
problems. In particular, we will show that for many estimation problems, there is some accuracy
threshold ϵn for which the probability of estimating to accuracy at all better than ϵn tends to zero
exponentially quickly.

261
Lexture Notes on Statistics and Information Theory John Duchi

10.3.1 Refined Fano inequalitites

We begin by revisiting Fano’s inequality, Corollary 9.4.2. We provide a proof of the inequality
distinct from the “book” proof in Proposition 2.3.3, instead using the Donsker-Varadhan variational
characterization of the KL-divergence, Theorem 6.1.1. The advantage of this approach is that the
general idea immediately extends to allow us to use the blowing-up lemma (Corollary 7.3.2) to
develop the refined Fano inequality.
For the first statement, we continue to work in the abstract scenario that we have a Markov
chain V → Y → Vb without relying on any structure in the channel. For simplicity, we assume that
the “decoder” Vb = vb(Y ) is deterministic.
v (Y ) ̸= V ) be the average
Proposition 10.3.1. Let V be uniform on a set of size M and ϵ = P(b
error of the estimator vb. Then
(1 − ϵ) log(M + 1) ≤ I(V ; Y ) + log 2.
1 P
Proof When V is uniform, conditional on V = v, we draw Y ∼ Pv . Let Q = M v Pv be the
marginal distribution on Y . Then by the definition of the mutual information and the Donsker-
Varadhan variational inequality (Theorem 6.1.1), for any functions gv we have
1 X 1 X 1 X
I(V ; Y ) = Dkl (Pv ||Q) ≥ Ev [gv ] − log EQ [egv ]
M v M v M v
X
1 X 1
≥ Ev [gv ] − log EQ [egv ] , (10.3.2)
M v M v
where the final inequality uses the concavity of the logarithm and Ev denotes expectation with
respect to Pv . Now we make a judicious choice of the function gv . Let Av = {y ∈ Y | vb(y) = v}, so
that the sets Av partition Y, and for a δ > 0 to be chosen, define
gv (y) = log(1 {y ∈ Av } + δ).
P
Use the shorthands pv = Pv (Av ) and qv = Q(Av ), so that v qv = 1. Then Ev [gv ] = pv log(1 + δ) +
(1 − pv ) log δ, while EQ [egv ] = δ + qv , and so

1 X 1
I(V ; Y ) ≥ pv log(1 + δ) − (1 − pv ) log + log M − log (1 + M δ) .
M v δ
1 P
Finally, we use the definition of the average error ϵ = M v (1 − pv ), and so
1
I(V ; Y ) ≥ (1 − ϵ) log(1 + δ) − ϵ log + log M − log(1 + δM ).
δ
1
Choose δ = M and simplify.

Proposition 10.3.1 recovers (with the ever-so-slightly stronger quantity log(M + 1) instead of
log M ) the initial (weak) Fano inequality in Corollary 9.4.2. The coming theorem extends this
result when the channel involves repeated independent sampling, where conditional on V = v, we
draw a vector Y1n ∼ Pvn , where Pvn is a product distribution on the output space Y n . As in the
strong converse for hypothesis testing (Proposition 7.3.5), we provide bounds on the probabilities
of “enlarged” subsets of a partition Y n . Note that in this variant, we control the maximal error
rather than the average error present in the weaker Fano inequalities; some weakening like this is
unavoidable (though the proof of that is beyond our scope).

262
Lexture Notes on Statistics and Information Theory John Duchi

Theorem 10.3.2. Let V be uniform on a set of sizepM and ϵ = maxv P(b v (Y1n ) = v | V = v) be the
maximal error of the estimator vb. Then for all t ≥ 8/n,
r " r #
2
n 1
1 − e−t log(M + 1) ≤ I(V ; Y1n ) + log(n card(Y)) t + log
2 1−ϵ

Proof We mimic the proof of Proposition 10.3.1, beginning from the inequality (10.3.2). Here,
however, instead of taking the functions gv to be logarithms of the indicators of the partitions
Av = {y ∈ Y n | vb(y) = v}, we instead consider the r-blowups Avr := {y ∈ Y n | dham (y, Av ) ≤ r},
and define
gv (y) := log (1 {y ∈ Avr } + δ)
for some δ > 0 to be chosen. Let

prv := Pvn (Avr ) and qvr := Q(Avr )

for shorthand. Then because Ev [gv ] = prv log(1 + δ) − (1 − prv ) log 1δ and EQ [egv ] = qvr + δ, inequal-
ity (10.3.2) implies
!
1 X 1
X
I(V ; Y1n ) ≥ prv log(1 + δ) − (1 − prv ) log + log M − log qvr + M δ . (10.3.3)
M v δ v

To leverage inequality (10.3.3), we must upper bound v qvr , which we rewrite as

X X X X
qvr = Q(Avr ) = Q({y}).
v v y∈Y n v:y∈Avr

For any fixed y0 ∈ Y n , a quick counting argument yields

n
card{y ∈ Y n | dham (y, y0 ) ≤ r} ≤ card(Y)r .
r
n
So because the sets {Av } partition Y n , any vector y belongs to at most card(Y)r of the r-

r
enlargements Avr , and we have
X X n
n
r r
qv ≤ card(Y) Q({y}) = card(Y)r .
v n
r r
y∈Y

1 P r
Substituting this into inequality (10.3.3) and letting ϵ(r) := M v (1 − pv ) be the blown-up average
“error” we obtain

1 n
I(V ; Y1n ) ≥ (1 − ϵ(r)) log(1 + δ) − ϵ(r) log + log M − log r
card(Y) + δM .
δ r

Now we use the blowing-up lemma (Corollary 7.3.2). For t ≥ 0 define

r r
n n 1
r = r(t) := t + log ,
2 2 1−ϵ
so that for this r we have
2 2
prv ≥ 1 − e−t and ϵ(r) ≤ e−t .

263
Lexture Notes on Statistics and Information Theory John Duchi

Substituting this above and taking δ = 1/M , we have

2
1 2
n
I(V ; Y1n ) ≥ 1 − e−t log 1 + + 1 − e−t log M − log 1 + card(Y)r
M r

2
n
= 1 − e−t log(M + 1) − log 1 + card(Y)r . (10.3.4)
r

Recognizing that 1 + nr card(Y)r ≤ (n card(Y))r for r ≥ 2, inequality (10.3.4) implies

2

I(V ; Y1n ) ≥ 1 − e−t log(1 + M ) − r [log n + log card(Y)] .
p
Substitute for r = r(t) above and rearrange, recognizing it is sufficient that t ≥ 8/n to guarantee
that r ≥ 2.

The cardinality restriction on Y in Theorem 10.3.2 is, while inelegant, not typically too onerous—
for example, if we represent Y as a 32-bit floating point number then log card(Y) ≤√32 log 2. For
large enough n, we obtain a cleaner statement. Suppose√that log n + log card(Y) ≤ 2 log n, and
1
√ p
using √2−1 = 1 + 2, it is sufficient that n ≥ card(Y)1+ 2 . Then for all t ≥ 8/n,
r
1
q
−t2
1−e log(M + 1) ≤ I(V ; Y1n ) + t n log n + n log2 n · log
2
. (10.3.5)
1−ϵ
We can also rearrange the refined Fano inequality in Theorem 10.3.2 to provide direct lower bounds
on probabilities of error in communication and estimation settings.
√
Proposition 10.3.3. Let the conditions of Theorem 10.3.2 hold and n ≥ card(Y) 2+1 . Define

log M − I(V ; Y1n ) p 1

γn := √ − 2 log log M − √ .
n log n n log n

Then the maximal probability of error ϵ satisfies

ϵ := max Pv (Vb ̸= v) ≥ 1 − exp(− [γn ]2+ ).

v
√ 2
Proof We make the choice t = 2 log log M in the bound (10.3.5). Then e−t log(M + 1) ≤
log(M +1)
2 log M ≤ 1, and
r
1 p q
2 n
n log n log ≥ log M − I(V ; Y1 ) − 2 log log M n log2 n − 1.
1−ϵ
√
Divide both sides by n log n to obtain
r
1 log M − I(V ; Y1n ) p 1
log ≥ √ − 2 log log M − √ .
1−ϵ n log n n log n

Solve for ϵ.

264
Lexture Notes on Statistics and Information Theory John Duchi

Theorem 10.3.2 has the weakness that it relies explicitly on the finiteness of the output space
Y. Extensions of the result exist, though their proofs rely on reverse hypercontractivity of Markov
semigroups and related functional inequalities and so are beyond our scope. We assume the same
ind
setting as Theorem 10.3.2, where V ∈ V is uniform, and conditional on V = v, we draw Yi ∼ Pv,i
for i = 1, . . . , n, that is, Y1n ∼ Pvn for a product distribution Pvn . Then instead of a cardinality
bound on Y, we assume there exists a baseline probability measure P0 on Y providing the uniform
likelihood ratio bound
dPv,i
α := max max < ∞.
i v dP0 ∞
When Y is finite, we can always take P0 to be uniform on Y, so that α ≤ card(Y). Then Liu et al.
[139, Theorem 3.2] prove the following result:
Theorem 10.3.4. Let vb : Y n → V be an estimator of V ∼ Uniform(V), where M = card(V). Let
ϵ < 1 and assume the geometric average of correct estimates probability satisfies
Y 1/M
n
v (Y1 ) = v | V = v)
P(b ≥ 1 − ϵ.
v∈V

Then r
p 1 1
log M ≤ I(V ; Y1n ) + 2 n(α − 1) log + log .
1−ϵ 1−ϵ
Theorem 10.3.4 exhibits a better dependence on n and the probability of error ϵ—using the geo-
metric average rather than the maximum probability of error—than Theorem 10.3.2 and so asymp-
totically is stronger.

10.3.2 High probability estimation lower bounds

Connecting the refinement of Fano’s inequality in Theorem 10.3.2 to estimation problems provides
new insights into the fundamental limits of estimation. As in our development of the local Fano
method, we will work with parametric problems for which Dkl (Pθ ||Pθ′ ) ≲ ∥θ − θ′ ∥22 , but we will
require slightly more delicate control. We consider probabilistic models for predicting a target y
from a covariate vector x ∈ X ⊂ Rd , so that

Y | X = x ∼ Pθ (· | x).

We say that the model has κ quadratic information bound if for each x ∈ X , there exists a distri-
bution P0 (· | x) on Y such that

Dkl (Pθ (· | x)||P0 (· | x)) ≤ κ2 (x⊤ θ)2 , (10.3.6)

refining the local quadratic bound (9.4.6). Many models satisfy such bounds; typical generalized
linear models satisfy it, for example (recall Chapter 3.4). Concretely, Example 3.4.8 shows that for
binary logistic regression of a label y ∈ {0, 1}, the null model P0 that Y ∼ Uniform{0, 1} satisfies

1 ⊤ 2
Dkl (Pθ (· | x)||P0 (· | x)) ≤ min (x θ) , log 2 .
8
The key is that, for any model with a quadratic information bound (10.3.6), we can upper bound
the information in estimation problems. We proceed conditionally on the covariates X1n ∈ X n , and
take our packing set V ⊂ {−1, 1}d .

265
Lexture Notes on Statistics and Information Theory John Duchi

JCD Comment: Ideally, connect this to the reduction from estimation to testing and
everything. Also, fix the constants a little bit.

Theorem 10.3.5. Fix a covariate matrix X = [x1 · · · xn ]⊤ ∈ Rn×d , and assume the model Pθ
satisfies the quadratic information bound (10.3.6) for x ∈ {xi }ni=1 . Then there exists a numerical
constant c > 0 such that the following holds. For δ ≥ 0, define

cd − κ2 δ 2 ∥X∥2Fr p 1
γ(δ) := √ − 2 log 2 · log d − √ .
n log n n log n
√ √
Then for all n ≥ card(Y) 2+1 , there exists θ with ∥θ∥2 ≤ δ d such that

1 √

Pθ ∥θ(Y1 ) − θ∥2 ≥ δ d ≥ 1 − exp(− [γ(δ)]2+ ).
b n
2

Unpacking Theorem 10.3.5, let us assume the covariates are stanadardized to xi ∈ {±1}d , so
2 2 √
that ∥X∥2Fr = nd. Then the constant γ ≥ cd−κ
√ δ nd − 2 log d, and taking δ 2 = c 2 , we obtain
n log n 2nκ

cd p
γ(δ) ≥ √ − 2 log d.
2 n log n
We thus obtain the following corollary, which shows
p that there is essentially no probability that an
estimator can have accuracy better than O(1) d/n when the dimension scales so that d2 /n ≫ 1.

Corollary 10.3.6. In addition to the conditions of Theorem 10.3.5, assume the covariates xi ∈
√
[−1, 1]d and the dimension d ≥ n log3 n. Then there exists a numerical constant c > 0 such that
r !
c d
sup
√ Pθ θbn − θ 2 ≥ ≥ 1 − n−c .
∥θ∥ ≤κ d/n
κ n
2

Of course, if the dimension is larger relative to n, we obtain stronger bounds; for example, once
√
d ≥ n log4 n we obtain
r !
c d 1
sup Pθ θbn − θ 2 ≥ ≥ 1 − c log n .
θ κ n n

JCD Comment: Specialize to logistic regression for giggles. Commentary that can do
better with bounded likelihood ratios. Maybe add an exercise to that effect? Also clean
up the conditional on x part.
Clean up / connect conditions on d2 /n → ∞ or whatever.

10.3.3 Proof of Theorem 10.3.5

As in our discussion of the local Fano method, we will construct a packing set V ⊂ {−1, 1}d , and
for each v ∈ V identify θv = δv for some δ > 0 to be chosen.
The key for us will be to construct a packing that has both strong separation and for which
V ∼ Uniform(V) has small second moment matrix, in particular, satisfying E[V V ⊤ ] ⪯ (1 + o(1))Id .
The probabilistic method allows us to do this:

266
Lexture Notes on Statistics and Information Theory John Duchi

Lemma 10.3.7. There exist numerical constants 0 < c and C < ∞ such that a packing V ⊂
{−1, 1}d of the hypercube exists with the following properties: its cardinality satisfies card(V) ≥
exp(cd),
√
q
′ ′ ⊤ cd
v − v 2 > d for v ̸= v ∈ V, and E[V V ] ⪯ 1 + C d/e Id

for V ∼ Uniform(V).
We defer the proof of the lemma temporarily, as it is a straightforward application of the concen-
tration guarantees we have already developed.
Now, consider the refined Fano bound in Theorem 10.3.2. Let V be the packing Lemma 10.3.7
guarantees, and for a δ ≥ 0 to be chosen, set θv = δv for each v ∈ V. Let V ∼ Uniform(V), and
conditional on V = v, draw
ind
Yi ∼ Pθv (· | X = xi ).
√
b n ) be any estimator of θ. Because θv is well-separated, with ∥θv − θv′ ∥ > δ d we see that
Let θ(Y 1 √ √ 2
if ∥θb − θv ∥2 ≤ δ d, then ∥θb − θv′ ∥2 > δ d for all v ̸= v ′ . Thus, the test
2 2
( √
n v if ∥θb − θv ∥2 ≤ 2δ d
Ψ(Y1 ) :=
arbitrary otherwise

satisfies
δ√

P(Ψ(Y1n ) ̸= v | V = v) ≤ P θb − θv 2
> d|V =v .
2
δ
√
Proposition 10.3.3 then implies that the maximal probability of error ϵ := maxv P(∥θb − θv ∥2 > 2 d)
satisfies ϵ ≥ 1 − exp(− [γ]2+ ) for

log M − I(V ; Y1n ) p 1

γ := √ − 2 log log M − √
n log n n log n

where 2d ≥ M ≥ exp(cd).
1
For the final step, we upper bound the mutual information. Letting P n = Pvn and
P
card(V) v∈V
n
P0 be any other product distribution, we obtain
1 X 1 X
I(V ; Y1n ) = Dkl Pvn ||P n ≤ Dkl (Pvn ||P0n ) .

card(V) card(V)
v∈V v∈V

Taking P0 to be the assumed

Pn conditional distribution providing the quadratic bound (10.3.6), we
n n
have Dkl (Pv ||P0 ) ≤ κ 2 ⊤ 2
i=1 (xi θv ) , and so using θv = δv,
" n # q n
X
X
n 2 2
I(V ; Y1 ) ≤ κ δ E ⊤ 2 2 2
(xi V ) ≤ κ δ 1 + C d/e cd ∥xi ∥22
i=1 i=1

by Lemma 10.3.7. Substituting in γ above gives the theorem.

iid
Proof of Lemma 10.3.7 Let the vectors Ui ∼ Uniform{±1}d be uniform on the binary hyper-
cube, i = 1, . . . , N , where we will choose N presently. Then for i ̸= j ∥Ui − Uj ∥22 = 2d − 2⟨Ui , Uj ⟩,
dist
and ⟨Ui , Uj ⟩ = ⟨1d , Ui ⟩ by independence. So for t ≥ 0,
2
t
P(∥Ui − Uj ∥22 ≤ 2d − 2t) = P(⟨1, Ui ⟩ ≥ t) ≤ exp −
2d

267
Lexture Notes on Statistics and Information Theory John Duchi

by standard sub-Gaussian concentration. By a union bound, we have

2 2
2 N t t
P min ∥Ui − Uj ∥2 ≤ 2d − 2t ≤ exp − ≤ exp − + 2 log N ,
i̸=j 2 2d 2d

and the choice t = d/2 gives P(mini̸=j ∥Ui − Uj ∥22 ≤ d) ≤ exp(−d/8 + 2 log N ). Taking N =
d
√
exp( 16 − 12 ) gives that ∥Ui − Uj ∥2 > d for all i ̸= j with probability at least 1 − 1/e.
To obtain a covariance bound, we P use either Proposition
p 5.1.11 or Proposition 7.1.5, which
for N ≥ exp(cd) guarantees that N1 N U U
i=1 i i
⊤ ⪯ (1 + C (d + t)/N )I with probability at least
d
1 − e−t .

JCD Comment: Here, put a strong converse for channel coding. Also put a strong
converse for estimation in a d2 /n → ∞ model for, e.g., logistic regression.
Is it possible that the paper by Liu, van Handel, and Verdu gives better than the d2 /n log n
convergence?

10.4 Exercises
JCD Comment: Some things to either do as exercises or include in the actual note

1. Density estimation (upper bound in 1-d)

2. Curse of dimensionality in nonparametric regression

3. Better version of the estimation lower bounds using Theorem 10.3.4.

268
Chapter 11

Constrained risk inequalities

In this chapter, we revisit our minimax bounds in the context of what we term constrained risk
inequalities. While the minimax risk of provides a first approach for providing fundamental limits
on procedures, its reliance on the collection of all measurable functions as its class of potential
estimators is somewhat limiting. Indeed, in most statistical and statistical learning problems, we
have some type of constraint on our procedures: they must be efficiently computable, they must
work with data arriving in a sequential stream, they must be robust, or they must protect the
privacy of the providers of the data. In modern computational hardware, where physical limits
prevent increasing clock speeds, we may like to use as much parallel computation as possible,
though there are potential tradeoffs between “sequentialness” of procedures and their parallelism.
With this as context, we replace the minimax risk of Chapter 9.1 with the constrained mini-
max risk, which, given a collection C of possible procedures—private, communication limited, or
otherwise—defines
h i
M(θ(P), Φ ◦ ρ, C) := inf sup EP Φ ρ(θ(X),
b θ(P )) , (11.0.1)
b P ∈P
θ∈C

where as in the original defining equation (9.1.1) of the minimax risk, Φ : R+ → R+ is a nondecreas-
ing loss, ρ is a semimetric on the space Θ, and the expectation is taken over the sample X ∼ P .
In this chapter, we study the quantity (11.0.1) via a few examples, highlighting possibilities and
challenges with its analysis. We will focus on a restricted class of examples—many procedures do
not fall in the framework we consider—that assumes, given a sample X1 , . . . , Xn , we can represent
the class C of estimators under consideration as acting on some view or processed version Zi of
Xi . This allows us to study communication complexity, memory complexity, and certain private
estimators.

11.1 Strong data processing inequalities

The starting point for our results is to consider strong data processing inequalities, which improve
upon the standard data processing inequality for divergences, as in Chapter 2.1.3, to provide more
quantitative versions. The initial setting is straightforward: we have distributions P0 and P1 on a
space X , and a channel (Markov kernel) Q from X to Z. When Q is contractive on the space of
distributions, we have a strong data processing inequality.
Definition 11.1 (Strong data processing inequalities). Let f : R+ → R ∪ {+∞} be convex and
satisfy f (1) = 0. For distributions P0 , P1 on X and a channel Q from X to a space Z, define

269
Lexture Notes on Statistics and Information Theory John Duchi

R
the marginal distribution Mv (A) := Q(A | x)dPv (x). The channel Q satisfies a strong data
processing inequality with constant α ≤ 1 for the given f -divergence

Df (M0 ||M1 ) ≤ αDf (P0 ||P1 )

for any choice of P0 , P1 on X . For any such f , we define the f -strong data processing constant

Df (M0 ||M1 )
αf (Q) := sup .
P0 ̸=P1 Df (P0 ||P1 )

These types of inequalities are common throughout information and probability theory. Perhaps
their most frequent use is in the development conditions for the fast mixing of Markov chains.
Indeed, suppose the Markov kernel Q satisfies a strong data processing inequality with constant α
with respect to variation distance. If π denotes the stationary distribution of the Markov kernel Q
and we use the operator ◦ to denote one step of the Markov kernel,1
Z
Q ◦ P := Q(· | x)dP (x),

then for any initial distribution π0 on the space X we have

∥Q ◦ · · · ◦ Q π0 − π∥TV ≤ αk ∥π0 − π∥TV

| {z }
k times

because Q ◦ π = π by definition of the stationary distribution. Thus, the Markov chain enjoys
geometric mixing.
To that end, a common quantity of interest is the Dobrushin coefficient, which immediately
implies mixing rates.

Definition 11.2. The Dobrushin coefficient of a channel or Markov kernel Q is

αTV (Q) := sup ∥Q(· | x) − Q(· | y)∥TV .

x,y

The Dobrushin coefficient satisfies many properties, some of which we discuss in the exercises and
others of which we enumerate here. The first is that

Proposition 11.1.1. The Dobrushin coefficient is the strong data processing constant for the vari-
ation distance, that is,
∥Q ◦ P0 − Q ◦ P1 ∥TV
αTV (Q) = sup .
P0 ̸=P1 ∥P0 − P1 ∥TV

Proof There are two directions to the proof; one easy and one more challenging. For the easy
direction, we see immediately that if 1x and 1y denote point masses at x and y, then

∥Q ◦ P0 − Q ◦ P1 ∥TV
sup ≥ sup ∥Q(· | x) − Q(· | y)∥TV
P0 ̸=P1 ∥P0 − P1 ∥TV x,y

as ∥1x − 1y ∥TV = 1 for x ̸= y.

1
The standard
R notation is usually to right-multiply the measure P , so that the marginal distribution M = P Q
means M (A) = Q(A | x)dP (x); we find our notation more intuitive.

270
Lexture Notes on Statistics and Information Theory John Duchi

The other direction—that ∥Q ◦ P0 − Q ◦ P1 ∥TV ≤ αTV ∥P0 − P1 ∥TV —is is more challenging.
For this, recall Lemma 2.2.4 characterizing the variation distance, and let Q⋆ (A) := inf y Q(A | y).
Then by definition
R of the Dobrushin coefficient α = αTV (Q), we evidently have |Q(A | x)−Q⋆ (A)| ≤
α. Let Mv = Q(· | x)dPv (x) for v ∈ {0, 1}. By expanding dP0 − dP1 into its positive and negative
parts, we thus obtain
Z
M0 (A) − M1 (A) = Q(A | x)(dP0 − dP1 )(x)
Z Z
= Q(A | x) [dP0 (x) − dP1 (x)]+ − Q(A | x) [dP1 (x) − dP0 (x)]+
Z Z
≤ Q(A | x) [dP0 (x) − dP1 (x)]+ − Q⋆ (A) [dP1 (x) − dP0 (x)]+
Z Z
= Q(A | x) [dP0 (x) − dP1 (x)]+ − Q⋆ (A) [dP0 (x) − dP1 (x)]+ ,

where the final equality uses Lemma 2.2.4. But of course we then obtain
Z Z
M0 (A) − M1 (A) = (Q(A | x) − Q⋆ (A)) [dP0 (x) − dP1 (x)]+ ≤ α [dP0 − dP1 ]+ = α ∥P0 − P1 ∥TV ,

where the inequality follows as 0 ≤ Q(A | x) − Q⋆ (A) ≤ α and the equality is one of the character-
izations of the total variation distance in Lemma 2.2.4.

A more substantial fact is that the Dobrushin coefficient upper bounds every other strong data
processing constant.
Theorem 11.1.2. Let f : R+ → R ∪ {∞} satisfy f (1) = 0. Then for any channel Q,

αTV (Q) ≥ αf (Q).

The theorem is roughly a consequence of a few facts. First, Proposition 11.1.1 holds. Second,
without loss of generality we may assume that f ≥ 0; indeed, replace f (t) with h(t) = f (t) − f ′ (1)t
for any f ′ (1) ∈ ∂f (1), we have h ≥ 0 as 0 ∈ ∂h(1) and Dh = Df . Third, any f ≥ 0 with 0 ∈ ∂f (1)
can be approximated arbitrarily accurately with functions of the form h(t) = ki=1 ai [t − ci ]+ +
P
Pk
i=1 bi [di − t]+ , where ci ≥ 1 and di ≤ 1. For such h, an argument shows that

Dh (Q ◦ P0 ||Q ◦ P1 ) ≤ αTV (Q)Dh (P0 ||P1 ),

which follows from the similarities between variation distance, with f (t) = 21 |t|, and the positive
part functions [·]+ .
There is a related result, which we do not prove, that guarantees that strong Rdata processing
constants for χ2 -divergences are the “worst” constants. In particular, if QP = Q(· | x)dP (x)
denotes the application of one step of a channel Q to X ∼ P , then the χ2 contraction coefficient is
Dχ2 (QP0 ||QP1 )
αχ2 (Q) = sup .
P0 ̸=P1 Dχ2 (P0 ||P1 )

Then it is possible to show that for any twice continuously differentiable f on R++ with f ′′ (1) > 0,

αχ2 (Q) ≤ αf (Q), (11.1.1)

271
Lexture Notes on Statistics and Information Theory John Duchi

and we also have αχ2 (Q) = αkl (Q), so that the strong data processing inequalities for KL-divergence
and χ2 -divergence coincide.
In our context, that of (constrained) minimax lower bounds, such data processing inequalities
immediately imply somewhat sharper lower bounds than the (unconstrained) applications in previ-
ous chapters. Indeed, let us revisit the situation present in the local Fano bound, where we the KL
divergence has a Euclidean structure as in the bound (9.4.6), meaning that Dkl (P0 ||P1 ) ≤ κ2 δ 2 when
our parameters of interest θv = θ(Pv ) satisfy ρ(θ0 , θ1 ) ≤ δ. We assume that the constraints C impose
that the data Xi is passed through a channel Q with KL-data processing constant αKL (Q) ≤ 1. In
this case, in the basic Le Cam’s method (9.3.2), an application of Pinsker’s inequality yields that
whenever ρ(θ0 , θ1 ) ≥ 2δ then
r
Φ(δ) n Φ(δ) h p i
Mn (θ(P), Φ ◦ ρ, C) ≥ 1− Dkl (M0 ||M1 ) ≥ 1 − nκ2 αKL (Q)δ 2 /2 ,
2 2 2

and the “standard” choice of δ to make the probability of error constant results in δ 2 = (2nκ2 αKL (Q))−1 ,
or the minimax lower bound
!
1 1
Mn (θ(P), Φ ◦ ρ, C) ≥ Φ p ,
4 2nκ2 αKL (Q)

which suggests an effective sample size degradation of n 7→ nαKL (Q). Similarly, in the local Fano
method in Chapter 9.4.1, we see identical behavior and an effective sample size degradation of
n 7→ nαKL (Q), that is, if without constraints a sample size of n(ϵ) is required to achieve some
desired accuracy ϵ, with the constraint a sample size of at least n(ϵ)/αKL (Q) is necessary.

11.2 Local privacy

In Chapter 8 on differential privacy, we define locally private mechanisms (Definition 8.2) as those
for which there is no trust: individuals randomize their own data, and no central curator collects
or analyzes and then privatizes the resulting statistics. With such privacy mechanisms, we can
directly develop strong data processing inequalities, after which we can prove strong lower bounds
on estimation. In this section, we (more or less) focus on one-dimensional quantities and Le Cam’s
two-point method for lower bounds, as they allow the most direct application of the ideas. We will
later develop more sophisticated techniques.
We begin with our setting. We have a ε-differentially private channel Q taking inputs x ∈ X
and outputting Z. Here, we allow sequential interactivity, meaning that the ith private variable Zi
may depend on both Xi and Z1i−1 (see the graphical model in Figure 11.1), so that instead of the
basic constraint in Definition 8.2 that Q(A | x) ≤ eε Q(A | x′ ) for all x, x′ , local differential privacy
instead means
Q(Zi ∈ A | Xi = x, z1i−1 )
≤ eε (11.2.1)
Q(Zi ∈ A | Xi = x′ , z1i−1 )
for all (measurable) sets A and inputs x, x′ , z1i−1 . The key result is the following contraction
inequality on the space of probabilities.
Theorem 11.2.1. Let Q be an ε-locally differentially privateR channel from X to Z. Then for any
distributions P0 , P1 inducing marginal distributions Mv (·) = Q(· | x)dPv (x),

Dkl (M0 ||M1 ) + Dkl (M1 ||M0 ) ≤ 4(eε − 1)2 ∥P0 − P1 ∥2TV .

272
Lexture Notes on Statistics and Information Theory John Duchi

X1 X2 X3 Xn

Z1 Z2 Z3 Zn

Figure 11.1. The sequentially interactive private observation model: the ith output Zi may depend
on Xi and the previously released Z1i−1 .

Proof Without loss of generality, we assume that the output space Z is finite (by defini-
tion (2.2.3)), and let mv (z) and q(z | x) be the p.m.f.s of M and Q, respectively, and let P0
and P1 have densities p0 and p1 with respect to a measure µ. Then
X m0 (z)
Dkl (M0 ||M1 ) + Dkl (M1 ||M0 ) = (m0 (z) − m1 (z)) log
z
m1 (z)

For any a, b ≥ 0, we have log ab = log(1 + ab − 1) ≤ a

b − 1, and similarly, log ab ≤ b
a − 1. That is,
|a−b|
| log ab | ≤ min{a,b} . Substituting above, we obtain

X (m0 (z) − m1 (z))2

Dkl (M0 ||M1 ) + Dkl (M1 ||M0 ) ≤ .
z
min{m0 (z), m1 (z)}

To control the difference m0 (z) − m1 (z), note that for any fixed x0 ∈ X we have
Z
q(z | x0 )(p0 (x) − p1 (x))dµ(x) = 0.
X

Thus
Z
m0 (z) − m1 (z) = (q(z | x) − q(z | x0 ))(p0 (x) − p1 (x))dµ(x),
X

and so
Z
|m0 (z) − m1 (z)| ≤ sup |q(z | x) − q(z | x0 )| |p0 (x) − p1 (x)|dµ(x)
x∈X X

q(z | x)
= 2q(z | x0 ) sup − 1 ∥P0 − P1 ∥TV .
x∈X q(z | x0 )
q(z|x)
By definition of local differential privacy, q(z|x0 ) − 1 ≤ eε − 1, and as x0 was arbitrary we obtain

|m0 (z) − m1 (z)| ≤ 2(eε − 1) inf q(z | x) ∥P0 − P1 ∥TV .

Noting that inf x q(z | x) ≤ min{m0 (z), m1 (z)} we obtain the theorem.

273
Lexture Notes on Statistics and Information Theory John Duchi

To be able to apply this result to obtain minimax lower bounds for estimation as in Sec-
tion 9.3, we need to address samples drawn from product distributions, even with the potential
interaction
R (11.2.1). In this case, we consider sequential samples Zi ∼ Q(· | Xi , Z1i−1 ) and define
Mv = Q(· | x1 )dPv (xn1 ) to be the marginal distribution over all the Z1n . Then we have the
n n

following corollary.
Corollary 11.2.2. Assume that each channel Q(· | Xi , Z1i−1 ) is εi -differentially private. Then
n
X
Dkl (M0n ||M1n ) ≤ 4 (eεi − 1)2 ∥P0 − P1 ∥2TV .
i=1

Proof Recalling the chain rule (2.1.6) for the KL-divergence, we have
n
X
Dkl (M0n ||M1n ) EM0 Dkl M0,i (· | Z1i−1 )||M1,i (· | Z1i−1 ) ,

=
i=1

where the outer expectation is taken over Z1i−1 drawn marginally from M0n , and Mv,i (· | z1i−1 )
iid
denotes the conditional distribution on Zi given Z1i−1 = z1i−1 when X1n ∼ Pv . Writing this distri-
bution out, we note that Zi is conditionally independent of X\i given Xi and Z1i−1 by construction,
so for any set A
Z Z
Mv,i (A | z1 ) = Q(Zi ∈ A | x1 , z1 )dPv (x1 | z1 ) = Q(Zi ∈ A | xi , z1i−1 )dPv (xn1 | z1i−1 )
i−1 n i−1 n i−1

Z
= Q(Zi ∈ A | xi , z1i−1 )dPv (xi ).

Now we know that Q(Zi ∈ · | xi , z1i−1 ) is εi -differentially private by assumption, so Theorem 11.2.1
gives
Dkl M0,i (· | z1i−1 )||M1,i (· | z1i−1 ) ≤ 4(eεi − 1)2 ∥P0 − P1 ∥2TV

for any realization z1i−1 of Z1i−1 . Iterating this gives the result.

Local privacy is such a strong condition on the channel Q that it actually “transforms” the
KL-divergence into a variation distance, so that even if two distributions P0 and P1 have infinite
KL-divergence Dkl (P0 ||P1 ) = +∞—for example, if their supports are not completely overlapping—
their induced marginals have the much smaller divergence Dkl (M0 ||M1 ) ≤ 4(eε −1)2 ∥P0 − P1 ∥2TV ≤
4(eε − 1)2 . This transformation into a different metric means that even in estimation problems that
should on their faces be easy become quite challenging under local privacy constraints; for example,
minimax squared error for estimating the mean of a random variable with finite variance scales as
√
1/ n rather than the typical 1/n scaling in non-private cases (see Exercise 11.4).
Let us demonstrate how to apply Corollary 11.2.2 in a few applications. Our main object of
interest is the private analogue of the minimax risk (9.1.1), where for a parameter θ : P → Θ,
semimetric ρ, and loss Φ, for a family of channels Q we define the channel-constrained minimax
risk h i
Mn (θ(P), Φ ◦ ρ, Q) := inf inf sup EP,Q Φ(ρ(θbn (Z1n ), θ(P ))) . (11.2.2)
θbn Q∈Q P ∈P
When we take Q = Qε to be the collection of ε-locally differentially private (interactive) chan-
nels (11.2.1), we obtain the ε-locally private minimax risk.
A few examples showing lower (and upper) bounds for the private minimax risk (11.2.2) in
mean estimation follow.

274
Lexture Notes on Statistics and Information Theory John Duchi

Example 11.2.3 (Bounded mean estimation): Let P be the collection of distributions with
supports on [−b, b], where 0 < b < ∞. Then for any ε ≥ 0, the minimax squared error satisfies

b2 b2
Mn (θ(P), (·)2 , Qε ) ≳ + .
(eε − 1)2 n n

The second term in the bound is the classic minimax rate for this collection of distributions.
To see the first term, take Bernoulli distributions P0 and P1 ∈ P, where for some δ ≥ 0
to be chosen, under P0 we have X = b with probability 1−δ 2 and −b otherwise, while under
1+δ
P1 we have X = b with probability 2 and X = −b otherwise. Then ∥P0 − P1 ∥TV = δ,
E1 [X] − E0 [X] = 2bδ, and by Le Cam’s method (9.3.3), for any ε-locally private channel Q
and induced marginals M0n , M1n as in Corollary 11.2.2, we have
r !
b2 δ 2 b2 δ 2

1
q
2 n n ε 2 2
Mn (θ(P), (·) , {Q}) ≥ 1− Dkl (M0 ||M1 ) ≥ 1 − 2(e − 1) n ∥P0 − P1 ∥TV
2 2 2
b2 δ 2 p
= 1 − 2(eε − 1)2 nδ 2 .
2
Setting δ 2 = 1
8n(eε −1)2
gives the claimed minimax bound. 3

Effectively, then, we see a reduction in the effective sample size: when ε is large, there is no change,
but otherwise, the estimation error is similar to that when we observe a sample of size nε2 .

Example 11.2.4 (Estimating the parameter of a uniform distribution): In exercise 9.2,

we show that estimating the parameter θ of a Uniform(θ, θ + 1) distribution has minimax
squared error scaling as 1/n2 . Under local differential privacy, this is impossible. Let P =
{Uniform(θ, θ + 1), θ ∈ [0, 1]} be the collection of uniform distributions with the given supports.
Letting P0 and P1 be Uniform(0, 1) and Uniform(δ, 1 + δ), respectively, where δ ≥ 0 is to be
chosen, we have ∥P0 − P1 ∥TV = δ, while for any ε-differentially private channel Q and induced
marginals M0 and M1 ,

Dkl (M0n ||M1n ) ≤ 4(eε − 1)2 n ∥P0 − P1 ∥2TV = 4(eε − 1)2 nδ 2 .

1
Applying Le Cam’s method (9.3.3) and taking δ ≍ √ , we thus have that if Qε denotes
n(eε −1)
the collection of ε-locally differentially private channels,
1
Mn (θ(P), (·)2 , Qε ) ≳ .
(eε − 1)2 n

When ε ≲ 1, the best attainable rate thus scales as 1

nε2
. 3

In both the preceding examples, a number of simple estimators achieve the given minimax rates.
iid
The simplest is one based on the Laplace mechanism (Example 8.1.3): let Wi ∼ Laplace(1), and
set Zi = Xi + 2b 2
ε Wi in Example 11.2.3 and Zi = Xi + ε Wi in Example 11.2.4. In the former, define
θbn = Z n to be the mean; in the latter, E[Z n ] = θ+1
2 , so θn = 2Z n − 1 achieves the minimax rate.
b
More extreme examples are possible. Consider, for example, the problem of testing the support
of a distribution, where we care only about distinguishing two distributions.

275
Lexture Notes on Statistics and Information Theory John Duchi

Example 11.2.5 (Support testing): Consider the problem of testing between the support
of two uniform distributions, that is, given n observations, we wish to test whether P = P0 =
Uniform[0, 1] or P = P1 = Uniform[θ, 1] for some θ ∈ (0, 1). We can ask the rate at which
we may take θ ↓ 0 with n while still achieving non-trivial testing power. Without privacy, a
simple (and optimal) test Ψ is to simply check whether any observation Xi < θ, in which case
we can trivially accept P0 and reject P1 , otherwise accepting P1 . Then

P0 (Xi > θ, all i) = (1 − θ)n while P1 (Xi > θ, all i) = 1.

So the summed probability of error

P0 (Ψ = 1) + P1 (Ψ = 0) = (1 − θ)n ≤ exp(−θn),

and if θ ≫ 1/n this tends to zero, while θn = θ0 /n yields limn P0 (Ψ = 1) = e−θ0 .

Consider now the private case. Then for any ε-differentially private channel Q and in-
duced marginals M0 , M1 , we have Dkl (M0n ||M1n ) ≤ 4n(eε −1)2 ∥P0 − P1 ∥2TV by Corollary 11.2.2
while ∥P0 − P1 ∥TV = θ. The Bretagnolle-Huber inequality (Proposition 2.2.8.(b)) thus guar-
antees that

∥M0n − M1n ∥2TV ≤ 1 − exp(−Dkl (M0n ||M1n )) ≤ 1 − exp(−4n(eε − 1)2 θ2 ).

Whenever θ ≪ √1n , we have ∥M0n − M1n ∥TV → 0, and so for any test based on the private data
Z1n , the probabilities of error
p
inf {P0 (Ψ(Z1n ) ̸= 0) + P1 (Ψ(Z1n ) ̸= 1)} ≥ 1 − 1 − exp(−cε nθ2 ),
Ψ

where cε = 4(eε − 1)2 . In the range that n1 ≪ θ ≪ √1n , then, there is an essentially exponential
gap between the non-private and private cases. 3

11.3 Communication complexity

Communication complexity is a broad field, encompassing results establishing fundamental limits in
streaming and online algorithms, memory-limited procedures, and (of course) in minimal commu-
nication in various fields. Recent connections between communication complexity and information-
theoretic techniques have increased its applicability in statistical problems, which is our main mo-
tivation here, and to which we return in force in Section 11.4 to come. To motivate our approaches,
however, we give a (necessarily limited) overview of communication complexity, along with some
of the basic techniques and approaches, which then extend to statistical problems.

11.3.1 Classical communication complexity problems

The most basic problems in communication complexity are not really statistical, instead asking a
simpler question: two entities (always named Alice and Bob) have inputs x, y and wish to jointly
compute a function f (x, y). The question is then how many bits—or other messages—Alice and
Bob need to communicate to compute this value. Less abstractly, Alice and Bob have input domains
X and Y (often, these are {0, 1}n ), and Alice receives a vector x ∈ X and Bob y ∈ Y, each unknown
to the other, and they jointly exchange messages until they can successfully evaluate f (x, y). To
abstract away any details of the computational model, we assume each has infinite computational

276
Lexture Notes on Statistics and Information Theory John Duchi

a1 (x)
0 1
b1 b1

0 1 0 1

a2 no no a2

0 1 0 1

b2 b2 b2 b2
0 1 0 1 0 1 0 1
yes no no yes yes no no yes
Figure 11.2. A communication tree representing testing equality for 2-dimensional bit strings
x, y ∈ {0, 1}2 . Internal nodes labeled aj communicate the jth bit aj (x) = xj of x, while internal
nodes labeled bj communicate the jth bit bj (y) = yj of y. The maximum number of messages is 4.
(A more efficient protocol is to have Alice send the entire string x ∈ {0, 1}n , then for Bob to check
equality x = y and output “Yes” or “No.”)

power, which allows a focus on communication. To formulate this as communication, we consider a

protocol Π, which specifies the messages that each of Alice and Bob send to one another. We view
this as a series of rounds, where at each round, the protocol allows one {0, 1}-valued bit to be sent
and determines who sends this bit, and, at termination time, can compute f (x, y) based on the
communicated message. Then the communication cost of Π is the maximum number of messages
sent to (correctly) compute f over all inputs x, y.
A more convenient formulation for analysis is to consider a binary tree:

Definition 11.3. A protocol Π over a domain X × Y with output space Z is a binary tree, where
each internal node v is labeled with a mapping av : X → {0, 1} or bv : Y → {0, 1} and each leaf is
labeled with a value z ∈ Z.

Then to execute a communication protocol Π on input (x, y), we walk down the tree: beginning
at the root node, for each internal node v labeled av (an Alice node) we walk left if av (x) = 0 and
right if av (x) = 1, and each node v labeled bv (a Bob node) we walk left if bv (y) = 0 and right if
bv (y) = 1. Then the communication cost of the protocol Π is the height of the tree, which we denote
by depth(Π). Figure 11.2 shows an example for testing the equality x = y of two 2-dimensional bit
strings x, y ∈ {0, 1}2 .
In classical communication complexity, the main questions center around the communication
complexity of a function f : X → Y, which is the length of the shortest protocol that computes f
correctly on all inputs: letting Πout (x, y) denote the final output of the protocol Π on inputs (x, y),
this is
CC(f ) := inf {depth(Π) | Πout (x, y) = f (x, y) for all x ∈ X , y ∈ Y} .
In many cases, it is useful to allow randomized communication protocols, which tolerate some
probability of error; in this case, we let Alice and Bob each have access to (an arbitrary amount)
of randomness, which we can identify without loss of generality with uniform random variables

277
Lexture Notes on Statistics and Information Theory John Duchi

iid
Ua , Ub ∼ Uniform[0, 1], and the nodes av and bv in Definition 11.3 are then mappings av : X ×[0, 1] →
{0, 1} and bv : Y × [0, 1] → {0, 1} and they calculate av (·, Ua ) and bv (·, Ub ), respectively. Abusing
notation slightly by leaving this randomness implicit, the randomized communication complexity
for an accuracy δ is then the length of the shortest randomized protocol that calculates f (x, y)
correctly with probability at least 1 − δ, that is,

RCCδ (f ) := inf {depth(Π) | P(Πout (x, y) ̸= f (x, y)) ≤ δ for all x ∈ X , y ∈ Y} . (11.3.1)

In the definition (11.3.1), we leave the randomization in Π implicit, and note that we require that
the tree it induces still have a maximum length. We note that essentially any choice of δ > 0 is
immaterial, as we always have
1
RCCδ (f ) ≤ O(1) log · RCC1/3 (f ),
δ
making all (constant) probability of error complexities essentially equivalent. (See Exercise 11.7.)
There are variants of randomized complexity that allow public randomness rather than pri-
vate randomness, which can yield simpler algorithms and somewhat reduced complexity, but this
improvement is limited, as Alice and Bob can always essentially simulate public randomness (see
Exercise 11.8). Letting Ppub be the collection of protocols in which both Alice and Bob have access
to a shared random variable U ∼ Uniform[0, 1], we make the obvious extension

RCCpub
δ (f ) := inf {depth(Π) | P(Πout (x, y, U ) ̸= f (x, y)) ≤ δ for all x ∈ X , y ∈ Y} .
Π∈Ppub

Finally, we have distributional communication complexity, which for a probability measure µ on

inputs X × Y is the depth of the shortest protocol that succeeds with a given µ-probability:

DCCµδ (f ) := inf {depth(Π) | µ(Πout (X, Y ) ̸= f (X, Y )) ≤ δ} , (11.3.2)

where the infimum is taken over deterministic protocols.

The final notion we consider is the information complexity. In this case, we require again that
for each input pair x, y, the (potentially randomized) protocol Π(x, y) still compute f (x, y) correctly
with probability at least 1 − δ, but instead of measuring the depth of the tree, we let X, Y be drawn
randomly from some distribution and measure the mutual information I2 (X, Y ; Π(X, Y )). (We use
base-2 logarithms to reflect bit communication.) In this case, we define

ICδ (f ) := sup inf {I2 (X, Y ; Π(X, Y )) | P(Πout (x, y) ̸= f (x, y)) ≤ δ for all x ∈ X , y ∈ Y} , (11.3.3)
Π

where the supremum is taken over joint distributions on (X, Y ), the infimum over randomized
protocols Π, and the right probability P is over any randomness in Π. There is a subtlety in this
definition: we require Π to be accurate on all inputs (x, y), not just with probability over the
distribution on (X, Y ) in the information measure I(X, Y ; Π(X, Y )). Relaxations to distributional
variants of the information complexity (11.3.3) are also natural, as in the definition (11.3.2). Thus
we sometimes consider the distributional information complexity

ICµδ (f ) := inf {I2 (X, Y ; Π(X, Y )) | µ(Πout (X, Y ) ̸= f (X, Y )) ≤ δ} ,

where the infimum can be taken over deterministic or randomized protocols.

278
Lexture Notes on Statistics and Information Theory John Duchi

The different notions of communication complexity satisfy a natural ordering, making proving
lower bounds for some notions (or conversely, developing low-communication methods for different
protocols) much easier or harder than others. We record the standard inequalities in the coming
proposition, which essentiall follows immediately from the operational interpretation of entropy as
the average length of the best encoding of a signal (Section 2.4.1).
Proposition 11.3.1. For any function f , δ ∈ (0, 1), and probability measure µ on X × Y,

CC(f ) ≥ RCCδ (f ) ≥ RCCpub µ µ

δ (f ) ≥ DCCδ (f ) ≥ ICδ (f )

and
RCCδ (f ) ≥ ICδ (f ).
Proof The first two inequalities are immediate. By Theorem 2.4.3, we have

depth(Π) ≥ H2 (Π) ≥ H2 (Π) − H2 (Π | X, Y ) = I2 (X, Y ; Π(X, Y )),

and so for all δ ∈ (0, 21 ) we have both

RCCδ (f ) ≥ ICδ (f ) and DCCµδ (f ) ≥ ICµδ (f ).

All that remains is to demonstrate RCCpub µ

δ (f ) ≥ DCCδ (f ). For this, let Π be any protocol with
public randomness U such that P(Πout (x, y, U ) ̸= f (x, y)) ≤ δ for all x, y. Then by taking an
expectation over (X, Y ) ∼ µ, we obtain

δ ≥ Eµ [P(Πout (X, Y, U ) ̸= f (X, Y ) | X, Y )] ≥ inf µ (Πout (X, Y, u) ̸= f (X, Y )) ,

that is, there must be at least some u achieving the average error of Π, and the protocol Π is
deterministic given u. So any protocol Π using public randomness to achieve probability of error δ
can be modified into a deterministic protocol Π(·, ·, u) that achieves µ-probability of error δ.2

Frequently, the first inequality in Proposition 11.3.1 is strict—even exponentially large—while

the randomized complexity and information complexity end up being of roughly the same order.
Understanding these differences is one of the major goals in communication complexity research.

11.3.2 Deterministic communication: lower bounds and structure

Deterministic communication complexity lower bounds often admit fairly elegant and somewhat
elementary arguments, and the gaps between them and the randomized complexity highlight
that we indeed expect providing lower bounds on randomized communication (11.3.1) or infor-
mation (11.3.3) complexity to be quite challenging. The starting point, to which we will return
when we consider randomized protocols, is to understand some structural aspects of the inputs and
outputs of a protocol tree.
Recall that a set R ⊂ X × Y is a rectangle if it has the form R = A × B for some A ⊂ X and
B ⊂ Y. Equivalently, R is a rectangle if (x0 , y0 ) ∈ R and (x1 , y1 ) ∈ R imply that (x0 , y1 ) ∈ R.
As the next proposition shows, rectangular sets provide a key way to understand communication
complexity.
2
This is one direction of Yao’s minimax theorem [193], which states that communication complexity with public
(shared) randomness and worst-case distributional complexity are identical: RCCpub
δ (f ) = supµ DCCµ
δ (f ).

279
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 11.3.2. Let v be a node in a deterministic protocol Π and Rv be those pairs (x, y)
reaching node v. Then Rv is a rectangle.
Proof We prove the result by induction. Certainly, for the root node v, we have Rv = X × Y,
which is a rectangle. Now, let v be an arbitrary (non-root) node in the tree and w its parent; assume
w.l.o.g. that v is the left child of w and that in w, Alice speaks (that is, we use aw : X → {0, 1}.)
Then Rw = A × B by the inductive assumption. If aw (x) = 0, then
Rv = {{x} × B | aw (x) = 0, x ∈ A} = {{x | aw (x) = 0} ∩ A} ∩ B,
which is a rectangle.

The structure of rectangles for correct protocols thus naturally determines the communication
complexity of a function f . For a set R ⊂ X × Y, we say R is f -constant if f (x, y) = f (x′ , y ′ ) for
all (x, y) ∈ R and (x′ , y ′ ) ∈ R. Thus, any correct protocol Π necessarily partitions X × Y into a
collection of f -constant rectangles, where we identify the rectangles with the leaves l of the protocol
tree. In particular, Proposition 11.3.2 implies the following corollary.
Corollary 11.3.3. Let N be the size of the minimal partition of X × Y into f -constant rectangles.
Then CC(f ) ≥ log2 N .
Proof Any correct protocol Π partitions X × Y into the f -constant rectangles {Rl } indexed by
its leaves l. The minimal depth of a binary tree with at least N leaves is log2 N .

A related corollary follows by considering fooling sets, which are basically sets that rectangles
cannot contain.
Definition 11.4 (Fooling sets). A set S ⊂ X × Y is a fooling set for f if for any two pairs
(x0 , y0 ) ∈ S and (x1 , y1 ) ∈ S satisfying f (x0 , y0 ) = f (x1 , y1 ), at least one of the inequalities
f (x0 , y1 ) ̸= f (x0 , y0 ) or f (x1 , y0 ) ̸= f (x0 , y0 ) holds.
With this definition, the next corollary is almost immediate.
Corollary 11.3.4. Let f have a fooling set S of size N . Then CC(f ) ≥ log2 N .
Proof By definition, no f -constant rectangle contains more than a single element of S. So the
tree associated with any correct protocol Π has a single leaf for each element of S.

An extension of the fooling set idea is the rectangle measure method, which proves that (for
some probability measure P ) the “size” of f -constant rectangles is small. By judicious choice of
the probability, we can then demonstrate lower bounds.
Proposition 11.3.5. Let P be a probability distribution on X × Y. If all f -constant rectangles R
have probability at most P (R) ≤ δ, Then CC(f ) ≥ log2 1δ .
Proof By the union bound, any f -constant partition of X × Y into rectangles {Rl }N
l=1 satisfies
1≤ N 1
P
l=1 P (R l ) ≤ N δ. So N ≥ δ , and the result follows by Corollary 11.3.3.

With these results, we can provide lower bounds on two exemplar problems that will inform
much of our coming development.

280
Lexture Notes on Statistics and Information Theory John Duchi

Example 11.3.6 (Equality): Consider the problem of testing equality of two n-bit strings
x, y ∈ {0, 1}n , letting f = EQ be f (x, y) = 1 if x = y and 0 otherwise. Define the set
S = {(x, x) | x ∈ {0, 1}n }, which has cardinality 2n , and satisfies f (x, x) = 1 for all (x, x) ∈ S.
That S is a fooling set is immediate: for any (x, x) and (x′ , x′ ) ∈ S, if x ̸= x′ , then certainly
(x, x′ ) ̸∈ S. So
n ≤ CC(EQ) ≤ n + 1,
where the upper bound follows by letting Alice simply communicate the string x and Bob
check if x = y, outputting 1 or 0 as x = y or x ̸= y. 3

The second example concerns inner products on F2 , the field of arithmetic on the integers modulo
2 (that is, with bit strings); one could extend this to inner products in more complicated number
systems (such as floating point), but the basic ideas are cleaner when we deal with bits.
Example 11.3.7 (Inner products on F2 ): Consider computing the inner product IP2 (x, y) =
⟨x, y⟩ mod 2 for n-bit strings x, y ∈ {0, 1}n , where addition is performed modulo 2. Rather
than a constructing a fooling set directly, we use Proposition 11.3.5 and let P be the uniform
distribution on {0, 1}n × {0, 1}n . Let R = A × B be a rectangle with ⟨x, y⟩ = 0 for all
x ∈ A and y ∈ B. The linearity of the inner product guarantees that ⟨x, y⟩ = 0 for all
x ∈ span(A) and y ∈ span(B), the (linear) spans of A and B in Fn2 , respectively. Now
recognize that span(A), span(B) ⊂ Fn2 are orthogonal subspaces of Fn2 , and so their dimensions
d0 = dim(span(A)) and d1 = dim(span(B)) satisfy d0 + d1 ≤ n.
Noting that if d0 = dim(A) then |A| ≤ 2d0 in Fn2 , we thus obtain |R| ≤ |A| · |B| ≤ 2n ,
which (under the uniform measure P ) satisfies
2n
P (R) ≤ = 2−n .
22n
By Proposition 11.3.5, we thus have

n ≤ CC(IP2 ) ≤ n + 1,

where once again the upper bound follows by letting Alice simply communicate x ∈ {0, 1}n
and having Bob output ⟨x, y⟩ mod 2. 3

11.3.3 Randomization, information complexity, and direct sums

When we allow randomization, the complexity bounds can, in some cases, drastically change.
Consider again the equality function in Example 11.3.6. When we allow randomization, we can
achieve O(log n) complexity to check equality (with high probability).

Example 11.3.8 (Equality with randomization): Let x, y ∈ {0, 1}n and p be a prime number
satisfying n2 ≤ p ≤ 2n2 (the Prime Number Theorem guarantees the existence of such a p).
Let Alice choose a uniformly random number U ∈ {0, . . . , p − 1} and compute the polynomial

a(U ) = x1 + x2 U + x3 U 2 + · · · + xn U n−1 mod p.

Then Alice may communicate both U and a(U ) to Bob, which requires at most 2 log2 p ≤
4 log2 n + 2 log 2 bits. Then Bob checks whether

b(U ) = y1 + y2 U + y3 U 2 + · · · + yn U n−1 mod p

281
Lexture Notes on Statistics and Information Theory John Duchi

satisfies b(U ) = a(U ). If so, Bob outputs “Yes” (equality), and otherwise, Bob outputs “No.”
This protocol satisfies depth(Π) ≤ 4 log2 n + 1. Moreover, if x = y, it is always correct, while if
x ̸= y, then the protocol is incorrect only if a(U ) = b(U ), that is, U is a root of the polynomial
n
X
p(u) = (xi − yi )ui−1 .
i=1

But this is a non-zero degree n − 1 polynomial, which has at most n − 1 roots (on the field Fp ;
see Appendix A.1 for a brief review of polynomials). Thus for x ̸= y we have
n−1 1
P(Π(x, y) fails) = P(a(U ) = b(U )) ≤ < ,
p n
and so RCC1/n (EQ) ≤ O(1) log n, exponentially improving over deterministic complexity.
In passing, we make two additional remarks. First, this protocol is one-way and non-
interactive: Alice can simply send O(log n) bits. Second, we can achieve essentially any prob-
ability of success in the bound while still only paying logarithmically in communication, as
taking nk ≤ p ≤ 2nk for k ≥ 2 yields RCC1/nk (EQ) ≤ 2k log2 n + O(1). 3

Example 11.3.8 makes clear that any lower bounds on randomized communication complexity,
or, relatedly, information complexity, will necessarily be somewhat more subtle than those we have
presented for CC. We develop a few of the main ideas here. Because our focus is on information
theoretic techniques, we pass over a few of the standard tools for proving lower bounds involving
discrepancy and randomized inputs, touching on these in the bibliographic notes at the end of the
chapter. One of our main goals will be to show that the information complexity of the inner product
is indeed Ω(n), a much stronger result than Example 11.3.7. In contrast to the lower bounds we
provide for minimax risk in most of this book, the focus in communication complexity is to take
an a priori accurate estimator and demonstrate that it requires a certain amount of information to
be communicated, rather than the contrapositive result that limited information yields inaccurate
estimators. While these are clearly equivalent, it can be fruitful to use the perspective most relevant
for the problem at hand.
Two main ideas form the basis for information complexity lower bounds: first, direct sum
inequalitites, which show that computing a function on n inputs requires roughly order n more
communication than computing it (or at least, one of the constituent functions making it up)
on one. The second important insight is to provide lower bounds on the information necessary
to compute different primitives, and the particular structure of even randomized communication
protocols makes this possible. For the remainder of Section 11.3.3, we address the first of these,
returning to the information complexity of primitives in Section 11.3.4.

Direct sum bounds and decomposition

To show direct sum inequalities, we demonstrate that computing some function on n inputs requires
roughly n times the communication of single-input computation. In general, we consider functions
f of the form
f (xn1 , y1n ) = g(h(x1 , y1 ), h(x2 , y2 ), . . . , h(xn , yn )), (11.3.4)
where g is the global function of the n primitives h, calling such functions decomposable with
primitive h. Several problems have the decomposable structure (11.3.4); focusing on the case that
the inputs x, y ∈ {0, 1}n and f (x, y) ∈ {0, 1}, we have the following three immediate examples.

282
Lexture Notes on Statistics and Information Theory John Duchi

Example 11.3.9 (Composition in equality): The equality function f (x, y) = 1 if x ̸= y and

f (x, y) = 0 otherwise satisfies the decomposition (11.3.4), where h(xi , yi ) = 1 {xi ̸= yi } and
g is the OR function g(z) = 1 {⟨1, z⟩ > 0}, which is 1 if any of z1 , . . . , zn is non-zero, and 0
otherwise. 3

Example 11.3.10 (Decomposition of inner product): The inner product in F2 , f (x, y) =

⟨x,
Pny⟩ mod 2, where h(xi , yi ) = xi yi , and g(z) = ⟨1, z⟩ mod 2, which satisfies g(z) = 0 if
i=1 zi is even and g(z) = 1 otherwise. 3

Example 11.3.11 (Decomposition of disjointness): The set disjointness function f (x, y) =

DISJ(x, y) := 1 {⟨x, y⟩ > 0} arises when x, y are characteristic vectors of two subsets A, B
of [n], that is, xi = 1 {i ∈ A} and yi = 1 {i ∈ B}. Then f (x, y) = 1 {A ∩ B ̸= ∅}, which
corresponds to g being the OR g(z) = 1 {⟨1, z⟩ > 0} and h the AND function h(xi , yi ) = xi yi .
3

While Example 11.3.8 makes clear that the decomposition (11.3.4) is not sufficient to guarantee a
randomized complexity lower bound of order n, it will be useful.
To develop the main information complexity direct sum theorem showing that the information
complexity of f is at least the sum of the complexities of its constituent primitives, we leverage
what we term plantable inputs:
Definition 11.5. Let f : X n × Y n → {0, 1} have the decomposition (11.3.4), where the primitive
h is {0, 1}-valued. The pair (x, y) ∈ X n × Y n admits a planted solution if for each i ∈ {1, . . . , n},
all x′i , yi′ , and vectors all
x′ = (x1 , . . . , xi−1 , x′i , xi+1 , . . . , xn ) and y ′ = (y1 , . . . , yi−1 , yi′ , yi+1 , . . . , yn ),
we have f (x′ , y ′ ) = h(x′i , yi′ ).
The binary inner product in Examples 11.3.7 and 11.3.10 has many plantable inputs: any of the 3n
pairs of vectors x, y ∈ {0, 1}n with ⟨x, y⟩ = 0 admit planted solutions, as we have xi yi = 0 for each
i. The set-disjointness problem, Example 11.3.11, has the same plantable inputs. For the equality
function, only the 2n pairs x = y admit planted solutions.
We outline the key idea to our direct sum lower bounds. Because we define information com-
plexity for protocols Π that are correct on all inputs with high probability, we can choose an
arbitrary distribution on inputs (xn1 , y1n ) ∈ X n × Y n . Thus we choose a fooling distribution µ for
iid
f , meaning that for (Xi , Yi ) ∼ µ the pair (X1n , Y1n ) ∈ X n × Y n always admits a planted solution
(Definition 11.5). The next definition says this slightly differently.
Definition 11.6. A distribution µ on (x, y) ∈ X × Y is a fooling distribution if all (xn1 , y1n ) in the
support of the product µn admit planted solutions (Definition 11.5).
Typically, fooling distributions µ require some dependence between Xi and Yi —for example, in the
inner product, we require Xi Yi = 0, so that if Xi = 1 then Yi = 0 and vice versa:
Example 11.3.12 (A fooling distribution for inner products and set disjointness): Define
the distribution µ on pairs (x, y) ∈ {0, 1} × {0, 1} as follows: let V be uniform on {0, 1}, and
conditional on V = 0, set X = 0 and let Y ∼ Uniform{0, 1}; conditional on V = 1, set Y = 0
iid
and let X ∼ Uniform{0, 1}. Then certainly XY = 0, and any set of pairs (Xi , Yi ) ∼ µ satisfy
both that the binary inner product IP2 (X1n , Y1n ) = ⟨X1n , Y1n ⟩ mod 2 = 0 and set disjointness
DISJ(X1n , Y1n ) = 1 {⟨X1n , Y1n ⟩ > 0} = 0. 3

283
Lexture Notes on Statistics and Information Theory John Duchi

Fooling distributions, as in Example 11.3.12, make conditioning natural in information com-

plexity. If (X, Y ) ∼ µ, there is always a random variable V such that X ⊥ ⊥ Y | V , that is, X
and Y are conditionally independent given V (trivially, we can take V = X). Thus, for function
h : X × Y → {0, 1}, we define the conditional information complexity

CICµδ (h) := inf sup {I(X, Y ; Π(X, Y ) | V ) s.t. P(Πout (x, y) ̸= h(x, y)) ≤ δ for all x ∈ X , y ∈ Y} ,
Π V

where the infimum is over all (randomized) protocols and the supremum is over all random variables
making X and Y conditionally independent with joint distribution (X, Y ) ∼ µ. So if we can find a
variable V making the mutual information I(X, Y ; Π(X, Y ) | V ) large for any correct protocol Π,
the conditional information complexity of h is necessarily large.
With this, we obtain our main direct sum theorem for information complexity.

Theorem 11.3.13. Let µ be a fooling distribution X × Y for a function f with primitive h. Then

ICδ (f ) ≥ n · CICµδ (h).

Proof Let V = V1n ∈ V n be any random vector with i.i.d. entries making (Xi , Yi ) conditionally
indpendent given Vi . Then for any protocol Π, we have

I(X1n , Y1n ; Π) = H(Π) − H(Π | X1n , Y1n )

= H(Π) − H(Π | X1n , Y1n , V ) ≥ H(Π | V ) − H(Π | X1n , Y1n , V ) = I(X1n , Y1n ; Π | V )

because we have the Markov chain V → (X1n , Y1n ) → Π. Using the chain rule for mutual informa-
tion, where we recognize that X1n and Y1n are independent given V , we have
n
X
I(X1n , Y1n ; Π |V)= I(Xi , Yi ; Π | V, X1i−1 , Y1i−1 )
i=1
n
X
= H(Xi , Yi | V, X1i−1 , Y1i−1 ) − H(Xi , Yi | V, Π, X1i−1 , Y1i−1 )
i=1
n
X n
X
≥ H(Xi , Yi | V ) − H(Xi , Yi | V, Π) = I(Xi , Yi ; Π | V ) (11.3.5)
i=1 i=1

because conditioning reduces entropy and (Xi , Yi ) are independent of X1i−1 , Y1i−1 given V .
Now we come to the key reduction from the global protocol Π to one solving individual prim-
itives. On inputs (x, y) ∈ X × Y, define the simulated protocol Πi,v (x, y) so that given the vector
iid
v\i ∈ V n−1 , Alice and Bob independently generate (Xj∗ , Yj∗ ) ∼ µ(· | Vj = vj ) for j ̸= i, which
is possible because of the assumed conditional independence given V , yielding X\i ∗ ∈ X n−1 and

Y\i∗ ∈ Y n−1 , respectively. They then execute the protocol Π((X\i

∗ , x), (Y ∗ , y)) (where we substitute
\i
x and y into input position i for each). Two key consequences of this simulation follow: that Πi,v
is a δ-error protocol for the primitive h and that we have the distributional equality
dist
(Xi , Yi , Vi , Πi,v (Xi , Yi )) = (Xi , Yi , Vi , Π(X1n , Y1n )) | V\i = v\i , (11.3.6)

that is, the joint over the simulated protocol is equal to that over the original protocol Π conditional
on V\i = v\i . The latter claim (11.3.6) is essentially definitional; the former requires a bit more work.

284
Lexture Notes on Statistics and Information Theory John Duchi

To see that Πi,v is a δ-error protocol for the primitive h, note that by construction, X\i ∗ and Y ∗ are
\i
in the support of µ, and so admit planted solutions. In particular, f ((X\i ∗ , x), (Y ∗ , y)) = h(x, y),
\i
and so Πi,v is necessarily a δ-error protocol.
The distributional equality (11.3.6) guarantees that for any v we have

I(Xi , Yi ; Π(X1n , Y1n ) | Vi , V\i = v\i ) = I(Xi , Yi ; Πi,v (Xi , Yi ) | Vi ),

and as Πi,v is a δ-error protocol for h, we have

inf I(Xi , Yi ; Πi,v (Xi , Yi ) | Vi ) ≥ CICµδ (h).

Substituting in the bound (11.3.5), we obtain

n n
inf I(Xi , Yi ; Πi,v (Xi , Yi ) | Vi ) ≥ nCICµδ (h),
X X
I(X1n , Y1n ; Π) ≥ I(Xi , Yi ; Π | V ) ≥
v
i=1 i=1

as desired.

With Theorem 11.3.13 in hand, we have our desired direct sum result, so that proving informa-
tion complexity lower bounds reduces to providing lower bounds on the (conditional) information
complexity of various 1-bit primitives. The following corollary highlights the theorem’s applications
to inner product and set disjointness (Examples 11.3.10 and 11.3.11).

Corollary 11.3.14. Let f be the binary inner product f (x, y) = ⟨x, y⟩ mod 2 or the disjointness
function f (x, y) = 1 {⟨x, y⟩ > 0}. Let µ be the fooling distribution in Example 11.3.12. Then

ICδ (f ) ≥ n · CICµδ (h)

where h(a, b) = ab is the product (or AND) function.

Exercise 11.10 explores similar techniques for the entrywise lesser than or equal function, showing
similar complexity lower bounds.

11.3.4 The structure of randomized communication and communication com-

plexity of primitives
Theorem 11.3.13 provides a powerful direct sum result that demonstrates that, at least if a problem
admits planted solutions for (nearly) i.i.d. sampling, then the information complexity must scale
at least linearly in the complexity of the primitives making up the function f . Thus, we turn to
providing information lower bounds for computing different primitive functions. Our main tool
will be to show that even randomized communication protocols essentially partition the input
space X × Y into rectangles—in analogy with Proposition 11.3.2 in the deterministic case—which
allows us to provide lower bounds. The broad idea is simple: if we have an accurate protocol for
computing a certain function h, we must necessarily be able to distinguish between the distribution
of Π on different inputs (x, y), as the fundamental connection between tests and variation distance
(Proposition 2.3.1) reveals.
Our main goal now is to prove the following proposition, which gives a lower bound on the
(conditional) information complexity of computing the AND of two bits.

285
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 11.3.15. Let h(x, y) = xy for inputs x, y ∈ {0, 1}. Let µ be the fooling distribution
in Example 11.3.12. Then
1
CICµδ (h) ≥
p
1 − 2 δ(1 − δ) .
4
We prove this proposition in the remainder of this section, noting that as an immediate corollary,
we obtain the following lower bounds on the communication complexity of set disjointness and
binary inner product.
Corollary 11.3.16. Let f be the binary inner product f (x, y) = ⟨x, y⟩ mod 2 or the disjointness
function f (x, y) = 1 {⟨x, y⟩ > 0}. Then
n p
ICδ (f ) ≥ (1 − 2 δ(1 − δ)).
4
To control the complexity of computing individual primitives, it proves easier to use metrics
tied more directly to testing. To that end, we recall the connection between Hellinger distance
and the mutual information, or Jensen-Shannon divergence, between a variable X and a single bit
B ∈ {0, 1} in Proposition 2.2.10, which gives that if B → Z, where Z ∼ Pb conditional on B = b,
then
I2 (Z; B) ≥ d2hel (P0 , P1 ).
To apply this inequality, recall the fooling distribution µ for inner products in Example 11.3.12,
where V ∼ Uniform{0, 1} and conditional on V = 0 we set X = 0 and draw Y ∼ Uniform{0, 1}, and
otherwise Y = 0 and X ∼ Uniform{0, 1}. Then for V → (X, Y ) from this distribution, we have
1 1
I2 (X, Y ; Π(X, Y ) | V ) = I2 (Y ; Π(0, Y ) | V = 0) + I2 (X; Π(X, 0) | V = 1).
2 2
Letting Qxy denote the (conditional) distribution over Π on input bits x, y ∈ {0, 1} and noting that
X and Y above are each uniform on {0, 1}, we see that Proposition 2.2.10 applies and so
1 1
I2 (X, Y ; Π(X, Y ) | V ) ≥ d2hel (Q01 , Q00 ) + d2hel (Q10 , Q00 ).
2 2
Applying the triangle inequality that (a − b)2 ≤ (|a − c| + |c − b|)2 ≤ 2(a − c)2 + 2(b − c)2 , we obtain
the following lemma.
Lemma 11.3.17. Let Π be any protocol acting on two bit inputs x, y ∈ {0, 1}, and let µ be the
fooling distribution in Example 11.3.12. Let Qxy be the distribution of Π(x, y) on inputs x, y. Then
1
I2 (X, Y ; Π(X, Y ) | V ) ≥ d2hel (Q01 , Q10 ).
4
The last step in the proof of Proposition 11.3.15 is to demonstrate a property of (randomized)
protocols Π analogous to the rectangular property of deterministic communcation that Proposi-
tions 11.3.2 and 11.3.5 demonstrate. In analogy with the output leaf in the tree for deterministc
communication complexity, let τ be the transcript of the communication protocol, that is, its en-
tire communication trace. Then we claim the following analog of Proposition 11.3.2 that the set of
inputs resulting in a particular output in deterministic complexity is a rectangle in X × Y.
Lemma 11.3.18. Let Π be any randomized protocol with inputs in X × Y. Then there exist
functions qx and qy such that for any transcript τ ,

P(Π(x, y) = τ ) = qx (τ ) · qy (τ ).

286
Lexture Notes on Statistics and Information Theory John Duchi

Proof We may view any randomized protocol as a particular instatiation of a deterministic

protocol Π(·, ·, ua , ub ), where ua , ub ∈ [0, 1] are realizations of the randomness available to Alice and
Bob, respectively, inducing a particular binary communication tree. By Proposition 11.3.2, for any
leaf l, the set
Rl (ua , ub ) = {(x, y) ∈ X × Y | Π(x, y, ua , ub ) reaches l}
is a rectangle, that is, Rl (ua , ub ) = Al (ua ) × Bl (ub ) for sets Al (u) ⊂ X and Bl (u) ⊂ Y. Of course,
the leaves l of the tree are in bijection with the entire transcript τ , so that if τ ends in leaf l, then

P(Π(x, y) = τ ) = P((x, y) ∈ Rl (Ua , Ub )) = P(x ∈ Al (Ua ), y ∈ Bl (Ub ))

iid
where Ua , Ub ∼ Uniform[0, 1] are the the randomness Alice and Bob use, respectively.
Expanding this as an integral gives
Z 1Z 1
P(x ∈ Al (Ua ), y ∈ Bl (Ub )) = 1 {x ∈ Al (ua )} 1 {y ∈ Bl (ub )} dua dub
0 0
= P(x ∈ Al (Ua ))P(y ∈ Bl (Ub )).

Set qx (τ ) = P(x ∈ Al (Ua )) and qy (τ ) = P(y ∈ Bl (Ub )).

We thus have the following key cut and paste property, which shows that in some sense, Hellinger
distances respect the “rectangular” structure of communication protocols.
Lemma 11.3.19. Let Π be any protocol acting on inputs in X × Y and let Qx,y be the distribution
of Π(x, y) on inputs x, y. Then

dhel (Qx,y , Qx′ ,y′ ) = dhel (Qx,y′ , Qx′ ,y ).

Proof Let T be the collection of all possible transcripts the protocol outputs. By Lemma 11.3.18
we have
q 2
2 1X q
dhel (Qx,y , Qx′ ,y′ ) = Qx,y (τ ) − Qx′ ,y′ (τ )
2
τ ∈T
q 2
1X q Xq
= qx (τ )qy (τ ) − qx′ (τ )qy′ (τ ) = 1 − qx (τ )qy (τ )qx′ (τ )qy′ (τ ).
2 τ
τ ∈T

Rearranging by the trivial modification qx qy qx′ qy′ = qx qy′ qx′ qy , we have the result.

We now finalize the proof of Proposition 11.3.15. Substituting this cutting and pasting in
Lemma 11.3.17 we have
1 1
I2 (X, Y ; Π(X, Y ) | V ) ≥ d2hel (Q01 , Q10 ) = d2hel (Q00 , Q11 ).
4 4
Then a simple lemma recalling the testing inequalities in Chapter 2.3.1 completes
p the proof of the
proposition, because it guarantees that 4I2 (X, Y ; Π(X, Y ) | V ) ≥ 1 − 2 δ(1 − δ) no matter the
choice of protocol Π, and so
1
CICµδ (h) ≥ inf I2 (X, Y ; Π(X, Y ) | V ) ≥
p
1 − 2 δ(1 − δ) .
Π 4

287
Lexture Notes on Statistics and Information Theory John Duchi

Lemma 11.3.20. Let Π be any δ-accurate protocol for p computing h(x, y) = xy and Qxy be its
distribution on inputs (x, y). Then d2hel (Q00 , Q11 ) ≥ 1 − 2 δ(1 − δ).

Proof Assume that Π computes the product xy ∈ {0, 1} correctly with probability at least
1 − δ, that is, P(Πout (x, y) ̸= xy) ≤ δ for all x, y ∈ {0, 1}. By Le Cam’s testing lower bounds
(Proposition 2.3.1), we know that

2δ ≥ P(Πout (0, 0) ̸= 0) + P(Πout (1, 1) ̸= 1) ≥ 1 − ∥Q00 − Q11 ∥TV

(⋆) q
≥ 1 − dhel (Q00 , Q11 ) 2 − d2hel (Q00 , Q11 ),

where inequality (⋆) follows from the inequalities in Proposition 2.2.7 relating Hellinger and total-
variation distance. Let d = d2hel (Q00 , Q11 ) for shorthand. Thenprearranging gives d(2 − d) ≥
(1 − 2δ)2 . Solving for d in 0 ≥ d2 − 2d + (1 − 2δ)2 yields d ≥ 1 − 1 − (1 − 2δ)2 . Recognize that
1 − (1 − 2δ)2 = 4(δ − δ 2 ).

11.4 Communication complexity in estimation

A major application combining strong data processing inequalities and communication is in the
communication and information complexity of statistical estimation itself. In this context, we limit
the amount of information—or perhaps bits—that a procedure may send about individual examples,
and then ask to what extent this constrains the estimator. This has applications in situations in
which the memory available to an estimator is limited, in situations with privacy—as we shall
see—and of course, when we restrict the number of bits different machines storing distributed data
may send.
We consider the following setting: m machines, or agents, have data Xi , i = 1, . . . , m. Com-
(t)
munication proceeds in rounds t = 1, 2, . . . , T , where in each round t machine i sends datum Zi .
To allow for powerful protocols—with little restriction except that each machine i may send only
(t)
a certain amount of information—we allow Zi to depend arbitrarily on the previous messages
(t) (t) (τ )
Z1 , . . . , Zi−1 as well as Zk for all k ∈ {1, . . . , m} and τ < t. We visualize this as a public
(t)
blackboard B, where in each round t each Zi is collected into B (t) , along with the previous public
blackboards B (τ ) for τ < t, and all machines may read these public blackboards. Thus, in round t,
(t)
individual i generates the communicated variable Zi according to the channel
(t) (t)
QZ (t) (· | Xi , Z<i , B (t−1) ) = QZ (t) (· | Xi , Z→i ).
i i

Here we have used the notation Z<i := (Z1 , . . . , Zi−1 ), and we will use Z≤i := (Z1 , . . . , Zi ) and
(t) (t)
similarly for superscripts throughout. We will also use the notation Z→i = (B (1) , Z<i ) to denote
(t)
all the messages coming into communication of Zi . Figure 11.3 illustrates two rounds of this
communication scheme.
We can provide lower bounds on the minimax risk of communication-constrained estimators by
extending the data processing inequality approach we have developed. Our approach to the lower
bounds, which we provide in Sections 11.4.1 and 11.4.2 to follow, is roughly as follows. First, we
develop another direct sum bound, in analogy with Theorem 11.3.13, meaning that the difficulty of

288
Lexture Notes on Statistics and Information Theory John Duchi

X1 X2 X3 Xm

(1) (1) (1) (1)

X1 X2 X3 Xm Z1 Z2 Z3 Zm

(1) (1) (1) (1)

Z1 Z2 Z3 Zm
B (1)

(2) (2) (2) (2)

Z1 Z2 Z3 Zm
B (1)

B (2)

Figure 11.3. Left: single round of communication of variables, writing to public blackboard B (1) .
Right: two rounds of communication of variables, writing to public blackboards B (1) and B (2) .

solving a d-dimensional problem is roughly d-times that of solving a 1-dimensional version of the
problem; thus, any lower bounds on the error in 1-dimensional problems imply lower bounds for
d-dimensional problems. Second, we provide an extension of the data processing inequalities we
have developed thus far to apply to particular communication scenarios.
The key to our reductions is that we consider families of distributions where the coordinates of
X are independent, which dovetails with Assouad’s method. We thus index our distributions by
v ∈ {0, 1}d , and in proving our lower bounds, we assume the typical Markov structure

V → (X1 , . . . , Xm ) → Π(X1m ),

where V is chosen uniformly at random from {−1, 1}d , and Π = Π(X1m ) denotes the protocol of
the entire communication—in this context, this is the entire set of blackboard messages

Π = (B (1) , . . . , B (T ) ),

(which also encodes the message order). We assume that X follows a d-dimensional product
distribution, so that conditional on V = v we have
iid
X ∼ Pv = Pv1 ⊗ Pv2 ⊗ · · · ⊗ Pvd . (11.4.1)

The generation strategy (11.4.1) guarantees that conditional on the jth coordinate Vj = vj , the co-
ordinates Xi,j are i.i.d. and independent of V\j = (V1 , . . . , Vj−1 , Vj+1 , . . . , Vd ) as well as independent
of Xi′ ,j for data points i′ ̸= i.

11.4.1 Direct sum communication bounds

Our first step is to argue that, if we can prove a lower bound on the information complexity of
one-dimensional estimation, we can prove a lower bound on d-dimensional problems that scales
with the dimension. To accomplish this reduction, let X≤m,j = (Xi,j )m
i=1 be the jth coordinate of
the data, and let X≤m,\j be the remaining d − 1 coordinates across all i = 1, . . . , m. Then by the
construction (11.4.1), we have the Markov structure

Vj → X≤m,j → Π(X1m ) ← X≤m,\j ← V\j .

289
Lexture Notes on Statistics and Information Theory John Duchi

In particular, viewing X≤m,\j as extraneous randomness, we have the simpler Markovian structure
Vj → X≤m,j → Π, (11.4.2)
so that we may think of the communication Π = Π(X≤m,j ) as acting only on X≤m,j . Now, define
M−j and Mj to be the marginal distributions over the total communication protocol Π conditional
on Vj = ±j, the one-variable model (11.4.2). Then Le Cam’s testing equality (Proposition 2.3.1),
and the equivalence between Hellinger and variation distance (Proposition 2.2.7) imply that
d d d
X X X √
inf 2 P(Vbj (Π) ̸= Vj ) ≥ (1 − ∥M−j − M+j ∥TV ) ≥ (1 − 2dhel (M−j , M+j ))
Vb j=1 j=1 j=1
 v 
u d
u2 X
≥ d 1 − t d2hel (M−j , M+j )
d
j=1

by Cauchy-Schwarz. Summarizing, we have the following

Proposition 11.4.1 (Assouad’s method in communication). Let M+j be the marginal distribution
over Π conditional on Vj = 1 and M−j be the marginal distribution of Π conditional on Vj = −1
in Markov structure (11.4.2) and assume Xi follow the product distribution (11.4.1). Then
 v 
d u d
X d u2 X
P(Vbj (τ ) ̸= Vj ) ≥ 1 − t d2hel (M−j , M+j ) .
2 d
j=1 j=1

Recalling Assouad’s method (Lemma 9.5.2) of Chapter 9.5, we see that any time we have a problem
with separation with respect to the Hamming metric (9.5.1), we have a lower bound on its error in
estimation problems. This proposition analogizes Theorem 11.3.13, in that small Hellinger distance
between the individual marginals M±j necessarily makes the testing and estimation problems hard.

11.4.2 Communication data processing

We now revisit the data processing inequalities in Section 11.1, where we consider a variant that
allows us to prove lower bounds for estimation problems with limited communication. It will be
more notationally convenient in this section to use V ∈ {0, 1} rather than {−1, 1}, so we do so
without comment. Our starting point is a revised strong data processing inequality.
Definition 11.7. Let P0 , P1 be arbitrary distributions on a space X , let V ∈ {0, 1} uniformly at
random, and conditional on V = v, draw X ∼ Pv . Consider the Markov chain V → X → Z. The
mutual information strong data processing constant β(P0 , P1 ) is
I(V ; Z)
β(P0 , P1 ) := sup ,
X→Z I(X; Z)
where the supremum is taken over all conditional distributions (Markov kernels) from X to Z.
In contrast to Definition 11.1, in this definition we have a contraction over the “beginning” of the
chain V → X rather than the distribution X → Z. Identifying Z with a communication protocol
Π(X1m ), this makes it possible to develop lower bounds on estimation and testing that then depend
on the information I(X; Π).
Distributions with bounded likelihood ratios provide one way to demonstrate a strong data
processing inequality of the form in Definition 11.7, where in analogy with Theorem 11.2.1 we
obtain a contraction inequality involving the total variation distance.

290
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 11.4.2. Let V → X → Z, where X ∼ Pv conditional on V = v. Let PX and PX (· | Z)

dPv
denote the marginal and conditional distributions on X given Z, respectively. If | log dP | ≤ α for
v′
′
all v, v , then
h i
I(V ; Z) ≤ 4(eα − 1)2 EZ ∥PX (· | Z) − PX ∥2TV ≤ 2(eα − 1)2 I(X; Z).

We leave the proof of this proposition as Exercise 11.12, as it follows by adapting the techniques we
use to prove Theorem 11.2.1, with the main difference being the random variables with bounded
likelihood ratios (X → Z versus V → X). A brief example illustrates Proposition 11.4.2.

Example 11.4.3 (Bernoulli distributions): Let Pv = Bernoulli( 1+vδ

2 ) for v ∈ {−1, 1}. Then
we have likelihood ratio bound
dP1 1+δ
log ≤ log
dP−1 1−δ
and so under the conditions of Proposition 11.4.2, for any Z we have
2
2δ 2
(i)
1+δ
I(V ; Z) ≤ 2 − 1 I(X; Z) = 2 I(X; Z) ≤ 10δ 2 I(X; Z),
1−δ 1−δ

where inequality (i) holds for δ ∈ [0, 1/10]. 3

We now give the two main results connecting mutual information and the contraction-type
bounds in Definition 11.7. To provide bounds using Proposition 11.4.1, we wish to control the
Hellinger distance between individual marginals M±j , so we consider single variables in the Markov
chain
V → (X1 , . . . , Xm ) → Π,
where V ∈ {0, 1}. To state the coming theorems, we make a restriction on the data generation
V → X, calling distributions P0 and P1 (c, β)-contractive if

β(P0 , P1 ) ≤ β ≤ 1 and max {D∞ (P0 ||P1 ) , D∞ (P1 ||P0 )} ≤ log c, (11.4.3)

where D∞ (·||·) denotes the Rényi-∞-divergence. Proposition 11.4.2 shows that whenever such a c
exists we certainly have β(P0 , P1 ) ≤ 2(c − 1)2 .
The next theorem then provides the basic information contraction inequality for single-variable
communication.

Theorem 11.4.4. Let 1 ≤ c < ∞ and β ≤ 1. Let P0 and P1 be (c, β)-contractive (11.4.3)
distributions on X and Mv , v ∈ {0, 1} be the marginal distribution of the protocol Π conditional on
V = v. Then
7
d2hel (M0 , M1 ) ≤ (c + 1)β · min {I(X1m ; Π(X1m ) | V = 0), I(X1m ; Π(X1m ) | V = 1)} .
2
The proof of Theorem 11.4.4 is quite complicated, so we defer it to Section 11.5.
We can use Theorem 11.4.4 to obtain bounds on the probability of error—detection of d-
dimensional signals—in higher dimensional problems based on mutual information alone. Because
the theorem provides a bound involving the minimum of the conditional mutual informations, we
have substantial freedom to combine the direct-sum lower bounds in Section 11.4.1 to massage it
into the mutual information between the data X1m and the protocol Π(X1m ).

291
Lexture Notes on Statistics and Information Theory John Duchi

We thus recall the definition (11.4.1) of our product distribution signals, where we assume
that each individual datum Xi = (Xi,1 , . . . , Xi,d ) = (Xi,j )dj=1 belongs to a d-dimensional set and
conditional on V = v ∈ {−1, 1}d has independent coordinates distributed as Xi,j ∼ Pvj . With
this, we have the following theorem, which follows by a combination of Assouad’s method (in the
context of communication bounds, i.e. Proposition 11.4.1) and Theorem 11.4.4.

Theorem 11.4.5. Let Π the entire communication protocol in Figure 11.3, V ∈ {−1, 1}d be
iid
uniform, and generate Xi ∼ Pv , i = 1, . . . , m according to the independent coordinate distribu-
tion (11.4.1). Assume additionally that for each coordinate j = 1, . . . , d, the coordinate distributions
P±vj are (c, β)-contractive (11.4.3). Then for any estimator Vb ,

d r !
X d β
P(Vbj (Π) ̸= Vj ) ≥ 1− 7(c + 1) · I(X1 , . . . , Xm ; Π | V ) .
2 d
j=1

Proof Under the given conditions, Proposition 11.4.1 and Theorem 11.4.4 immediately combine
to give
v  
d u d
X d u β X
P(Vbj (Π) ̸= Vj ) ≥ 1 − t7(c + 1) min I(X1,j , . . . , Xm,j ; Π | Vj = v) .
2 d v∈{−1,1}
j=1 j=1

Certainly
min I(X1,j , . . . , Xm,j ; Π | Vj = v) ≤ I(X1,j , . . . , Xm,j ; Π | Vj ).
v∈{−1,1}

Then, using that w.l.o.g. we may assume the Xi,j are discrete, we obtain
d
X d
X
I((Xi,j )m
i=1 ; Π | Vj ) = [H((Xi,j )m m
i=1 | Vj ) − H((Xi,j )i=1 | Π, Vj )]
j=1 j=1
d
(i) X
H((Xi,j )m m

= i=1 | (Xi,j ′ )i≤m,j ′ <j , V ) − H((Xi,j )i=1 | Π, Vj )
j=1
d
X
H((Xi,j )m m

≤ i=1 | (Xi,j ′ )i≤m,j ′ <j , V ) − H((Xi,j )i=1 | (Xi,j ′ )i≤m,j ′ <j , Π, V )
j=1
d
X
= I((Xi,j )m
i=1 ; Π | V, (Xi,j ′ )i≤m,j ′ <j ) = I(X1 , . . . , Xm ; Π | V ),
j=1

where equality (i) used the independence of Xi,j from V\j and Xi,j ′ for j ′ ̸= j given Vj , and the
inequality that conditioning reduces entropy. This gives the theorem.

11.4.3 Applications: communication and privacy lower bounds

Let us now turn to a few different applications of our lower bounds on communication-constrained
estimators. We evidently require two conditions: first, we must show that the distributions our data
follows satisfy a strong (mutual information) data processing inequality (Definition 11.7). Second,

292
Lexture Notes on Statistics and Information Theory John Duchi

we must provide a (good enough) upper bound on the mutual information I(X1 , . . . , Xm ; Π | V )
between the data points Xi and communication protocol. While there are many strategies to pro-
viding bounds and strong data processing inequalities, we focus mainly on situations with bounded
likelihood ratio, where Proposition 11.4.2 directly provides the type of strong data processing in-
equality we require.

Communication lower bounds

Our first set of examples consideres direct communication bounds, where controlling I(X1m ; Π)
is relatively straightforward. Assume the setting in the introduction to Section 11.4, where to
establish our communication bounds we assume each machine i = 1, . . . , m may send at most Bi
total bits of information throughout the entire communication protocol—that is, for each pair i, t,
we have a bound
(t) (t)
X
H(Zi | Z→i ) ≤ Bi,t and Bi,t ≤ Bi (11.4.4)
t
(t)
on the message from Xi in round t. (This is a weaker condition that H(Zi ) ≤ Bi,t for each i, t.)
With this bound, we can provide minimax lower bounds on communication-constrained estimator.
For our first collection, we consider estimating the parameters of d independent Bernoulli dis-
tributions in squared error. Let Pd be the family of d-dimensional Bernoulli distributions, where
we let the parameter θ ∈ [0, 1]d be such that Pθ (Xj = 1) = θj . Then we have the following result.

Proposition 11.4.6. Let Mm (θ(Pd ), ∥·∥22 , {Bi }m i=1 ) denote the minimax mean-square error for
estimation of a d-dimensional Bernoulli under the information constraint (11.4.4). Then
( )
d d
Mm (θ(Pd ), ∥·∥22 , {Bi }m
i=1 ) ≥ c min 1 Pm ,d ,
mm i=1 Bi

where c > 0 is a numerical constant.

Proof By the standard Assouad reduction (Section 9.5), when we take coordinate distributions
1+δv
Pvj = Bernoulli( 2 j ), we have a cδ 2 -separation in Hamming metric. Applying Theorem 11.4.5
1
and Example 11.4.3, we obtain the minimax lower bound, valid for 0 ≤ δ ≤ 10 , of
r !
δ 2
Mm (θ(Pd ), ∥·∥22 , {Bi }m 2
i=1 ) ≥ cδ d 1 − C I(X1 , . . . , Xm ; Π | V ) .
d

Now, we note that for any Markov chain V → X → Z,

I(X; Z | V ) = H(Z | V ) − H(Z | X, V ) = H(Z | V ) − H(Z | X) ≤ H(Z) − H(Z | X) = I(X; Z).

Thus we obtain

I(X1 , . . . , Xm ; Π | V ) ≤ I(X1 , . . . , Xm ; Π)
m X
T
(t) (t)
X
= I(X1 , . . . , Xm ; Zi | Z→i ).
i=1 t=1

293
Lexture Notes on Statistics and Information Theory John Duchi

(t) (t) (t)

As the message Zi satisfies the conditional independence Zi ⊥ ⊥ X\i | Z→i , Xi , this final quantity
P (t) (t) (t) (t) (t) (t)
equals i,t I(Xi ; Zi | Z→i ). But of course I(Xi ; Zi | Z→i ) ≤ H(Zi | Z→i ) ≤ Bi,t , and so
 v 
u 2X
u δ
Mm (θ(Pd ), ∥·∥22 , {Bi }m 2 
i=1 ) ≥ cδ d 1 − tC Bi,t  .
d
i,t

d
Choosing δ = min{1/10, 2C P
i Bi } gives the result.

This result deserves some discussion. It is sharp in the case that the number of bits is of order
d or less from each machine: when we set Bi = d, the lower bound becomes

2 d d d
sup Eθ [∥θ(Π) − θ∥2 ] ≳ min
b · ,d = ,
θ m d m

which is certainly achievable (each machine simply sends its entire vector Xi ∈ {0, 1}d ). When
machines communicate fewer than d bits, we have a tighter result; for example, if only k/m machines
send d bits, and the rest communicate little, we obtain

2 d md d
sup Eθ [∥θ(Π) − θ∥2 ] ≳ min
b · ,d = ,
θ m kd k
which is similarly intuitive. The extension of these ideas to the case when each machine has an
individual sample of size n is more challenging, as it requires tensorized variants of the strong data
processing inequality in Definition 11.7; we provide remarks in the bibliographical section.

Lower bounds in locally private estimation

We return to the local privacy setting we consider in Section 11.2, except now we allow substantially
more interaction. We treat local differential privacy in the communication model of Figure 11.3,
where n individuals have data Xi which they wish to privatize, and proceed in rounds, releasing
(t) (t)
data Zi from individual i in round t. A natural setting is to assume each data release Zi is
εi,t -differentially private: instead of the sequentially interactive model (11.2.1), we have
(t) (t) (t) (t) (t) (t)
Q(Zi ∈ A | Xi = x, Z→i = z→i ) ≤ exp(εi,t ) · Q(Zi ∈ A | Xi = x′ , Z→i = z→i ) (11.4.5)
(t)
for each i, t and all possible x, x′ , z→i . At a more abstract level, rather than a particular privacy
(t)
guarantee on each individual data release Zi , we can assume a more global stability guarantee
akin to the (average) KL-stability in interactive data analysis (Definition 6.1). Thus, let Π(xn1 )
be the entire collection of communicated information in the protocol in Figure 11.3 on input data
x1 , . . . , xn . Abusing notation to let Dkl (Z0 ||Z1 ) be the KL-divergence between the distributions of
Z0 and Z1 , as in Definition 6.1, we make the following definition to capture arbitrary interactions.
(i)
Definition 11.8 (Average KL-privacy). Let the samples x≤n ∈ X n and x≤n ∈ X n differ only in
example i. Then the data release Π is εkl -KL-locally-private on average if
n
1X
(i)

Dkl Π(x≤n )||Π(x≤n ) ≤ εkl .
n
i=1

294
Lexture Notes on Statistics and Information Theory John Duchi

The following observation shows that for appropriate choices of εkl , this is indeed weaker than
the interactive guarantee (11.4.5).

Lemma 11.4.7. Let the communication Q satisfy the interactive privacy guarantee (11.4.5) and
Π be the induced communication protocol over rounds t ≤ T . Then
n n X
T
1X
(i)
1X 3 2
Dkl Π(x≤n )||Π(x≤n ) ≤ min εi,t , εi,t .
n n 2
i=1 i=1 t=1

Proof Using the chain rule for the KL-divergence, we have for any j that

n X
T h i
(j) (j) (t) (t)
X
Dkl Π(x≤n )||Π(x≤n ) = E Dkl Q(Zit ∈ · | xi , Z→i )||Q(Zit ∈ · | xi , Z→i )
i=1 t=1
T h i
(j) (t) (t)
X
= E Dkl Q(Zit ∈ · | xj , Z→i )||Q(Zit ∈ · | xj , Z→i ) ,
t=1

(t) (j)
where the expectation is taken over Z→i in the protocol Π(x≤n ), and the second equality follows
(i)
because xj = xj for all j except index i. Now let P0 and P1 be arbitrary distributions whose
densities satisfy p0 (z)/p1 (z) ≤ eε . Then

Dkl (P0 ||P1 ) ≤ ε and Dkl (P0 ||P1 ) ≤ log 1 + Dχ2 (P0 ||P1 ) ≤ log 1 + (eε − 1)2

by Proposition 2.2.9. Then by inspection min{ε, log(1 + (eε − 1)2 )} ≤ min{ε, 23 ε2 } for all ε ≥ 0.
Returning to the initial KL-divergence sum, we thus obtain
n n X
T
X
(i)
X 3
Dkl Π(x≤n )||Π(x≤n ) ≤ E min εi,t , ε2i,t ,
2
i=1 i=1 t=1

as desired.

The key is that the average KL-local privacy guarantee is sufficient to provide a mutual infor-
mation bound, thus allowing us to apply Theorem 11.4.5 as in the proof of Proposition 11.4.6.

Proposition 11.4.8. Let Π be any εkl -KL-locally-private on average protocol and assume that
X1 , . . . , Xn are independent conditional on V . Then Then

I(X1 , . . . , Xn ; Π(X1n ) | V ) ≤ nεkl .

Proof The conditional independence of the Xi guarantees that

295
Lexture Notes on Statistics and Information Theory John Duchi

We abuse notation to let Π∗ (X\i ) be the marginal protocol (marginalizing over Xi ). Then
I(Xi ; Π(X1n ) | V, X\i ) = E Dkl Π(X\i , Xi )||Π∗ (X\i ) ≤ E Dkl Π(X\i , Xi )||Π(X\i , Xi′ )

iid
where the first expectation is taken over V and Xj ∼ Pv conditional on V = v and the inequality
uses convexity and draws Xi′ independently. Summing over i = 1, . . . , n, Definition 11.8 gives the
result.

Applying Theorem 11.4.5, we then obtain the following corollary.

Corollary 11.4.9. Let the conditions of Theorem 11.4.5 hold. If the data release Π is εkl -private
on average, then
d r !
X d β
P(Vbj (Π) ̸= Vj ) ≥ 1 − 7(c + 1) nεkl .
2 d
j=1

Specializing to the case that we wish to estimate a d-dimensional Bernoulli vector, where X ∈ {±1}
has coordinates with P(Xj = 1) = θj , Example 11.4.3 gives the following minimax lower bound.
Corollary 11.4.10. Let Mn (θ(Pd ), ∥·∥22 , εkl ) denote the minimax mean-square error for estima-
tion of a d-dimensional Bernoulli under the εkl -KL-locally-private-on-average constraint in Defini-
tion 11.8. Then
d2

2
Mn (θ(Pd ), ∥·∥2 , εkl ) ≥ c min d, .
nεkl
Proof By Corollary 11.4.9 and Example 11.4.3, we have minimax lower bound
r !
δ 2
Mn (θ(Pd ), ∥·∥22 , εkl ) ≳ dδ 2 1 − C nεkl
d

for a numerical constant C, which is valid for δ ≲ 1. Choose δ 2 to scale as min{1, nεdkl }.

When instead of the average KL-privacy we use the pure local differential privacy constraint (11.4.5),
Lemma 11.4.7 implies the following.
Corollary 11.4.11. Let Mn (θ(Pd ), ∥·∥22 , ε) denote the minimax mean-square error for estimation
of a P
d-dimensional Bernoulli where each data release is εi,t -locally differentially private (11.4.5),
and ∞ t=1 εi,t ≤ ε. Then

d2

2
Mn (θ(Pd ), ∥·∥2 , ε) ≥ c min d, .
n(ε ∧ ε2 )

11.5 Proof of Theorem 11.4.4

The proof proceeds in stages. The basic ideas are as follows:
1. Relate the Hellinger distance between the marginal distributions M0 and M1 of Π conditional
on V = 0 or 1 to a sum of Hellinger distances between the marginal M0 and an alternative Mi′
iid
where Xi ∼ P1 and X\i ∼ P0 .
2. Provide a data processing inequality to relate dhel (M0 , Mi′ ) and the mutual information I(Xi ; Π)
between the individual observation Xi and the protocol Π.
3. Use the standard chain rules for mutual information to finalize the theorem.

296
Lexture Notes on Statistics and Information Theory John Duchi

Step 1: sequential modification of marginals

We begin by relating the marginal distributions M0 and M1 by a sequence of one-variable changes.
To that end, for bit vectors b ∈ {0, 1}m define Mb to be the marginal distribution over the protocol
Π(X1m ) generated from (X1 , . . . , Xm ), where for each i we generate Xi by indpendently sampling

Xi | b ∼ Pbi . (11.5.1)

For the standard basis vectors e1 , . . . , em , we expect M0 to be close to Mel , and thus hope for some
type of tensorization behavior, where we can relate M0 and M1 via one-step changes from M0 to
Mel . The next lemma realizes this promise.

Lemma 11.5.1. Let M0 , M1 , and Mel be as above. Then

m
X
d2hel (M0 , M1 ) ≤7 d2hel (M0 , Mel ). (11.5.2)
l=1

Proof The proof crucially relies on the Euclidean structures that the Hellinger distance induces
along with analogues of the cut-and-paste (the “rectangular” structure of inputs in communication
protocols) properties from deterministic and randomized two-player communication. We assume
without loss of generality that Π is discrete, as the Hellinger distance is an f -divergence and so can
be arbitrarily approximated by discrete random variables.
First, we analogize the “rectangular” probabilistic structure of two-player communication pro-
tocols in Lemmas 11.3.18 and 11.3.19, which yields a multi-player cut-and-paste lemma.

Lemma 11.5.2 (cutting and pasting). Let a, b, c, d ∈ {0, 1}m be bit vectors satisfying ai +bi = ci +di
for each i = 1, . . . , m. Then
d2hel (Ma , Mb ) = d2hel (Mc , Md ).

Proof We claim the following analogue of Lemma 11.3.18: for any X1m = xm
1 and any commu-
nication transcript τ , we may write
m
Y
Q(Π(xm m
1 ) = τ | x1 ) = fi,xi (τ ) (11.5.3)
i=1

(t)
for some functions fi,xi . Indeed, letting τ = {zi }i≤n,t≤T we have
m Y
T
(t) (t) (t) (t)
Y Y
Q(Π(xm m
1 ) = τ | x1 ) = Q(zi | xm
1 , z→i ) = Q(zi | xi , z→i )
i,t i=1 t=1
| {z }
=:fi,xi (τ )

(t) (t)
where we use that message zi depends only on xi and z→i . Then we can write Mb (Π(X1m ) = τ )
as a product using Eq. (11.5.3): integrating over independent Xi ∼ Pbi , we have
Z m Z
Y m
Y
Mb (Π(X1m ) = τ ) = Q(τ | xm
1 )dPb1 (x1 ) · · · dPbm (xm ) = fi,τ (xi )dPbi (xi ) = gi,bi (τ ).
i=1 | {z } i=1
:=gi,bi (τ )

297
Lexture Notes on Statistics and Information Theory John Duchi

Taking Ma , Mb , Mc , Md as in the statement of the lemma,

v
um
X uY
2
dhel (Ma , Mb ) = 1 − t gi,a (τ )gi,b (τ ).
i i
τ i=1

But as ai + bi = ci + di and each is {0, 1}-valued, we certainly have gi,ai gi,bi = gi,ci gi,di , and so the
lemma follows.

The second result we require is due to Jayram [120], and is the following:
Lemma 11.5.3. Let {Pb }b∈{0,1}m be any collection of distributions satisfying the cutting and pasting
property d2hel (Pa , Pb ) = d2hel (Pc , Pd ) whenever a, b, c, d ∈ {0, 1}m satisfy a + b = c + d. Let N = 2k
for some k ∈ N. P Then for any collection of bit vectors {b(i) }N i=1 ⊂ {0, 1}
m with ⟨b(i) , b(j) ⟩ = 0 for

all i ̸= j and b = i b(i) ,

k
Y m
X
(1 − 2−l )d2hel (P0 , Pb ) ≤ d2hel (P0 , Pb(i) ).
l=1 i=1

We defer the technical proof to Section

Qk 11.5.1.
A computation shows that l=1 (1 − 2−l ) > 72 . Lemma 11.5.3 nearly gives us our desired
result (11.5.2), except that it requires a power of 2. To that end, let k0 be the largest k ∈ N such
k
that 2k0 ≤ m, and construct bit vectors b(1) , . . . , b(2 0 ) satisfying i b(i) = 1 and 1 ≤ b(i) 0 ≤ 2
P
for each i. Then Lemma 11.5.3, via the cutting-pasting property of the marginals M , implies
2k0 m
2 2 X X
dhel (M0 , M1 ) ≤ d2hel (M0 , Mb(i) ) ≤ 2 d2hel (M0 , Mei ),
7
i=1 i=1

where the second inequality again follows from Lemma 11.5.3 as b(i) = ej or ej + ej ′ for some basis
vectors ej , e′j . This gives Lemma 11.5.1.

Step 2: from Hellinger to Shannon information

Now we relate the strong data processing processing constants for mutual information in Defini-
tion 11.7 to compare Hellinger distances with mutual information. We claim the following lemma.
Lemma 11.5.4. Let the conditions of Theorem 11.4.4 hold. Let M0 and Mel be the marginal
distributions over Π when Xi have the sampling distribution (11.5.1). Then for l ∈ {1, . . . , m},
c+1
d2hel (Mel , M0 ) ≤ βI(Xl ; Π(X1m ) | V = 0).
2
Proof Consider the following alternative distributions. Let W ∼ Uniform{0, 1}, and draw X ′ ∈
X m with independent coordinates according to
(
iid P0 if i ̸= l
Xi′ ∼ P0 if W = 0 or Xi′ ∼ if W = 1.
P1 if i = l

298
Lexture Notes on Statistics and Information Theory John Duchi

Then we have the Markov chain W → X ′ → Π(X ′ ), and moreover,

W → Xl′ → Π(X ′ ) ← X\l

′
,

so that additionally W → Xl′ → Π(X ′ ) is a Markov chain. As a consequence, Definition 11.7 of the
strong data processing inequality gives

I(W ; Π(X ′ )) ≤ βI(Xl′ ; Π(X ′ )).

Using Proposition 2.2.10, we thus have

d2hel (Mel , M0 ) ≤ I(W ; Π(X ′ )) ≤ βI(Xl′ ; Π(X ′ )). (11.5.4)

It remains to relate I(Xl′ ; Π(X ′ )) to I(Xl ; Π(X) | V = 0). Here we bounded likelihood ratio
between P0 by P1 . Indeed, we have by the condition (11.4.3) that

1 2 P0 + P1
P0 ≥ P1 so (c + 1)P0 ≥ P0 + P1 or P0 ≥ .
c c+1 2
As a consequence, we have
Z
I(Xl ; Π(X1m ) | V = 0) = Dkl (Q(· | Xl = x)||M0 ) dP0 (x)
Z
2 dP0 (x) + dP1 (x)
≥ Dkl (Q(· | Xl = x)||M0 )
c+1 2
Z
2 dP0 (x) + dP1 (x)
≥ Dkl Q(· | Xl = x)||M
c+1 2
2
= I(Xl′ ; Π(X ′ )),
c+1

where the second inequality uses that M = Q(· | Xl = x) dP0 (x)+dP 1 (x)
R
2 minimizes the integrated
KL-divergence (recall inequality (10.2.3)). Returning to inequality (11.5.4), we evidently have the
result of the lemma.

Step 3: Completing the proof of Theorem 11.4.4

By combining the tensorization Lemma 11.5.1 with the information bound in Lemma 11.5.4, we
obtain
m m
X 7 X
d2hel (M0 , M1 ) ≤ 7 d2hel (M0 , Mei ) ≤ (c + 1)β I(Xi ; Π | V = 0).
2
i=1 i=1

By symmetry, we also have

m m
X 7 X
d2hel (M0 , M1 ) ≤7 d2hel (M0 , Mei ) ≤ (c + 1)β I(Xi ; Π | V = 1).
2
i=1 i=1

299
Lexture Notes on Statistics and Information Theory John Duchi

Now, we note that as the Xi are independent conditional on V (and w.l.o.g. for the purposes of
mutual information, we may assume they are discrete), for any v ∈ {0, 1} we have
m
X m
X
I(Xi ; Π | V = v) = [H(Xi | V = v) − H(Xi | Π, V = v)]
i=1 i=1
m
X
H(Xi | X1i−1 , V = v) − H(Xi | Π, V = v)

=
i=1
m
X
H(Xi | X1i−1 , V = v) − H(Xi | X1i−1 , Π, V = v)

≤
i=1
m
X
= I(Xi ; Π | X1i−1 , V = v) = I(X1 , . . . , Xm ; Π | V = v),
i=1

where the inequality used that conditioning decreases entropy. We thus obtain
7
d2hel (M0 , M1 ) ≤ (c + 1)β min I(X1 , . . . , Xm ; Π | V = v)
2 v∈{0,1}

as desired.

11.5.1 Proof of Lemma 11.5.3

We prove the result by induction. It is trivially true for m = 1, that is, k = 0, so now we consider
the inductive case, that is, it holds for m = 1, . . . , 2k−1 and we consider m = 2k .
N 1 PN
First, we observe that if {ui }i=1 are arbitrary vectors and u = N i=1 ui is their mean, then

X X X X N
X
∥ui − uj ∥22 = ∥ui − u + u − uj ∥22 = ∥ui − u∥22 + ∥u − uj ∥22 = 2N ∥u − ui ∥22 .
i,j i,j i,j i,j i=1

− u∥22 over all u gives

P
Thus, if u0 is any other vector, that u minimizes i ∥ui

N N
1 X X X
∥ui − uj ∥22 ≤ ∥ui − u∥22 ≤ ∥ui − u0 ∥22 . (11.5.5)
N
1≤i<j≤N i=1 i=1

√ √ 2
Now, we return to the Hellinger distances. Evidently 2d2hel (Pa , Pb ) = pa (·) − pb (·) 2 , so
that it is a Euclidean distance. As a consequence, for any pairwise disjoint collection of N bit
vectors b(i) , we have
N
X 1 X 1 X
d2hel (P0 , Pb(i) ) ≥ d2hel (Pb(i) , Pb(j) ) = d2hel (P0 , Pb(i) +b(j) ) (11.5.6)
N N
i=1 1≤i<j≤N 1≤i<j≤N

where the inequality follows from (11.5.5) and the equality by the assumed cut-and-paste property.
Now, we apply Baranyai’s theorem, which says that we may decompose any complete graph KN ,
where N is even, into N − 1 perfect matchings Mi with N/2 edges—necessarily, as they form a

300
Lexture Notes on Statistics and Information Theory John Duchi

perfect matching—where each Mi is edge disjoint. Identifying the pairs i < j with the complete
graph, we thus obtain

X N
X −1 X
d2hel (P0 , Pb(i) +b(j) ) = d2hel (P0 , Pb(i) +b(j) ). (11.5.7)
1≤i<j≤N l=1 (i,j)∈Ml

′ ′
Now fix n ∈ {1, . . . , N −1} and a matching Mn . By assumption we have ⟨b(i) +b(j) , b(i ) +b(j ) ⟩ =
′ ′
0 for any distinct pairs (i, j), (i , j ) ∈ Mn , and moreover, (i,j)∈Mn (b(i) + b(j) ) = b. Thus, our
P
induction hypothesis gives that for any l ∈ {1, . . . , N − 1} and any of our matchings Mn , we have

X k−1
Y
d2hel (P0 , Pb(i) +b(j) ) ≥ d2hel (P0 , Pb ) (1 − 2−l ).
(i,j)∈Mn l=1

Substituting this lower bound into inequality (11.5.7) and using inequality (11.5.6), we obtain
N k−1 k
X 1 Y Y
d2hel (P0 , Pb(i) ) ≥ · (N − 1)d2hel (P0 , Pb ) (1 − 2−l ) = d2hel (P0 , Pb ) (1 − 2−l ),
N
i=1 l=1 l=1

where we have used N = 2k .

11.6 Bibliography
Data processing inequalities originate with Dobrushin’s study of central limit theorems for Markov
chains [66, 67]; Dobrushin first proved Proposition 11.1.1 (see [67, Sec. 3.1]). Cohen et al. [54] show
that the strong data processing constant for variation distance is the largest of the strong data pro-
cessing constants (Theorem 11.1.2) for finite state spaces using careful linear algebraic techniques,
also showing the opposite extremality (inequality (11.1.1)) of the χ2 contraction coefficient [54,
Proposition II.6.15] for finite state spaces. Del Moral et al. [65] and Polyanskiy and Wu [154] give
related and approachable treatments for general alphabets, and Exercises 11.1 and 11.2 follow [65].
More broadly, strong data processing inequalities arise in many applications in communication,
estimation, and some functional analysis [157, 154].
Communication complexity begins with Yao [193], which introduces the communication com-
plexity setting we discuss in Section 11.3, making the connections between randomized complexities
and public (shared) randomness. The standard classical reference for the subject is Kushilevitz
and Nisan’s book [129]. There are numerous techniques that we do not discuss, including so-called
discrepancy lower bounds, which address both randomized and deterministic communication com-
plexity; for example, these give the stronger lower bound that DCCδ (IP2 ) ≥ n−O(1) [129, Example
3.29 and Exercise 3.30]. Communication complexity has uses far beyond the “standard” commu-
nication setting we have outlined, with more recent research showing how to use the techniques
to provide lower bounds on the performance of algorithms in many computational models, such
as streaming models and memory-limited computation [149, 159]. Our information complexity ap-
proach follows Bar-Yossef et al. [15]. Recent work has shown how communication lower bounds and
strong data processing inequalities can be used to show the necessity of “memorization” in some
natural problems in machine learning, where any learning procedure with good enough performance
necessarily encodes substantial irrelevant information about a dataset [40].

301
Lexture Notes on Statistics and Information Theory John Duchi

Our treatment of communication complexity and its applications in estimation follows an ap-
proach Zhang et al. [198] originate. The particular techniques we adapt, involving direct sums and
strong data processing in communication, we adapt from Braverman et al. [39] and Garg et al.
[99]. Our results apply most easily to scenarios in which each machine or agent owns only a single
data item, which allows application of Proposition 11.4.2; tensorizing this to multiple observations
requires some care, but can be done with a truncation argument [198, 39] or more careful Sobolev
inequalities [157]. Our extension to private estimation scenarios follows the paper [70], which also
shows how to generalize to other variants of privacy.

11.7 Exercises
Exercise 11.1 (Approximating nonnegative convex functions): Let f : R → R+ ∪ {+∞} be a
closed, nonnegative convex function.
(a) Show that there exists a sequence of piecewise linear functions fn satisfying fn−1 ≤ fn ≤ f for
all n and for which fn (x) ↑ f (x) pointwise for all x s.t. f (x) < ∞, and fn (x) ↑ ∞ otherwise.
Hint: Let L be the collection of linear functions below f , that is L = {l | l(x) = a + bx, l(x) ≤
f (x) for all x}, and note that f (x) = sup{l(x) | l ∈ L}. (See Appendix C.2.) You may replace
L with functions of the form l(x) = f (x0 ) + g(x − x0 ), where g ∈ ∂f (x0 ) is a subderivative of
f at x0 .

Pn z0 ∈ R we havePfn(z0 ) = 0, then one may take the functions fn to be of

(b) Show that if for some
the form fn (x) = i=1 ai [bi − x]+ + i=1 ci [x − di ]+ , where bi ≤ z0 , di ≥ z0 , and ai , ci ≥ 0.
R R
(c) Conclude that for any measure µ on R+ , fn dµ ↑ f dµ.

Exercise 11.2 (Proving Theorem 11.1.2): In this question, we formalize the sketched proof of
Theorem 11.1.2 by filling in details of the following steps. Let α = αTV (Q) be the Dobrushin
coefficient of the channel Q and f : R → R+ ∪ {+∞} be a closed convex function.
(a) P
There exists a nondecreasing sequence fn of piecewise linear functions, each of the form fn (x) =
n Pn
a [b
i=1 i i − x]+ + c
i=1 i [x − di ]+ , where bi ≤ 1, di ≥ 1, and ai , ci ≥ 0. Hint: Exercise 11.1.
R
(b) Let Mv (A) = Q(A | x)dPv (x) for v ∈ {0, 1} be the induced marginal distributions. Show
that for any function of the form h(t) = [t − ∆]+ , where ∆ > 1,

Dh (M0 ||M1 ) ≤ αDh (P0 ||P1 ) (11.7.1)

by the following steps:

i. Define the set X (∆) := {x | dP0 (x) ≤ ∆dP1 (x)}. Argue that X (∆) must be non-null (i.e.,
have positive measure).
ii. Define the probability distribution P∆ with density
∆dP1 (x) − dP0 (x)
dP∆ (x) = R 1 {x ∈ X (∆)} .
[∆dP1 (x) − dP0 (x)]+
Argue that the measure
G = ∆P1 − (∆ − 1)P∆
is a probability distribution.

302
Lexture Notes on Statistics and Information Theory John Duchi

iii. Show that

Dh (P0 ||P1 ) = ∥P0 − G∥TV .
It may be useful to show that dP0 − dG ≤ 0 on X (∆).
iv. Conclude that
1 1
Dh (P0 ||P1 ) ≥ ∥Q ◦ P0 − Q ◦ G∥TV ≥ Dh (Q ◦ P0 ||Q ◦ P1 ) .
α α
(c) Using the monotone convergence theorem, show that Df (M0 ||M1 ) ≤ αDf (P0 ||P1 ).

Exercise 11.3 (Markov chain mixing): Consider a Markov chain X1 , X2 , . . . with transition
distribution P (· | x) and stationary distribution π. Let P k (· | x) denote the distribution of the
Markov chain initialized in state x after k steps. Assume there exists some (finite) positive integer
k ∈ N such that for any two initial states x0 , x1 , the Markov chain satisfies

P k (· | x0 ) − P k (· | x1 ) ≤ β < 1.
TV

Show that the Markov chain enjoys fast mixing for any f divergence: if there is any n such that
Df (P n (· | x)||π) < ∞, the Markov chain mixes exponentially quickly in that it satisfies
1 1
lim sup log Df (P n (· | x)||π) ≤ log β < 0.
n n k
In brief, as soon as one can demonstrate a constant gap in variation distance, one is guaranteed a
Markov chain mixes geometrically.
Exercise 11.4: For k ∈ [1, ∞], we consider the collection of distributions
Pk := {P : EP [|X|k ]1/k ≤ 1},
that is, distributions P supported on R with kth moment bounded by 1. We consider minimax
estimation of the mean E[X] for these families under ε-local differential privacy, meaning that for
each observation Xi , we observe a private realization Zi (which may depend on Z1i−1 ) where Zi
is an ε-differentially private view of Xi . Let Qε denote the collection of all ε-differentially private
channels, and define the (locally) private minimax risk
Mn (θ(P), (·)2 , ε) := inf inf sup EP,Q [(θbn (Z1n ) − θ(P ))2 ].
θbn Q∈Qε P ∈P

(a) Assume that ε ≤ 1. For k ∈ [1, ∞], show that there exists a constant c > 0 such that
k−1
2 1 k
Mn (θ(Pk ), (·) , ε) ≥ c .
nε2

(b) Give an ε-locally differentially private estimator achieving the minimax rate in part (a).

Exercise 11.5: Show that strong data processing inequality in Theorem 11.2.1 is sharp in the
following sense. There exist ε-differentially private channels Qε such that for any Bernoulli distri-
butions P0 and P1 and induced marginal distributions Mv,ε = Q(· | X = 1)Pv (X = 1) + Q(· | X =
0)Pv (X = 0),
Dkl (M0,ε ||M1,ε ) ε2
= + O(ε3 )
∥P0 − P1 ∥2TV 2

303
Lexture Notes on Statistics and Information Theory John Duchi

as ε ↓ 0.
Exercise 11.6: We apply the results of Exercise 11.4 to a problem of estimation of drug use.
Assume we interview a series of individuals i = 1, . . . , n, asking whether each takes illicit drugs.
Let Xi ∈ {0, 1} be 1 if person i uses drugs, 0 otherwise, and define θ∗ = E[X] = E[Xi ] = P (X = 1).
Instead of Xi we observe answers Zi under differential privacy,

Zi | Xi = x ∼ Q(· | Xi = x)

for a ε-differentially private Q with ε < 21 (so that (eε − 1)2 ≤ 2ε2 ). Let Qε denote the family of
all ε-differentially private channels, and let P denote the Bernoulli distributions with parameter
θ(P ) = P (Xi = 1) ∈ [0, 1] for P ∈ P.
(a) Use Le Cam’s method and the strong data processing inequality in Theorem 11.2.1 to show
that the minimax rate for estimation of the proportion θ∗ in absolute value satisfies

b 1 , . . . , Zn ) − θ(P )| ≥ c √ 1 ,
h i
Mn (θ(P), | · |, ε) := inf inf sup EP,Q |θ(Z
Q∈Qε θb P ∈P nε2
where c > 0 is a universal constant.

(b) Give a rate-optimal estimator for this problem. That √

is, define an ε-differentially private channel
b b n
Q and an estimator θ such that E[|θ(Z1 ) − θ|] ≤ C/ nε2 , where C is a universal constant.

(c) Download the dataset at http://web.stanford.edu/class/stats311/Data/drugs.txt, which

consists of a sample of 100,000 hospital admissions and whether the patient was abusing drugs
(a 1 indicates abuse, 0 no abuse). Use your estimator from part (b) to estimate the population
proportion of drug abusers: give an estimated number of users for ε ∈ {2−k , k = 0, 1, . . . , 10}.
Perform each experiment several times. Assuming that the proportion of users in the dataset
is the true population proportion, how accurate is your estimator?

Exercise 11.7: Show that the randomized communication complexity (11.3.1) satifies RCCδ (f ) ≤
O(1) log 1δ RCC1/3 (f ) for any f and any δ < 1.
Exercise 11.8 (From public to private randomness): Consider the randomized complexity (11.3.1)
and associated public-randomness complexity RCCpub n
δ . Let X = Y = {0, 1} and f : X ×Y → {0, 1},
and let Π be a protocol using public randomness U such that maxx,y P(Π(x, y, U ) ̸= f (x, y)) ≤ ϵ.

(a) Use Hoeffding’s inequality to show that there are k = log

δ2
2
n points u1 , . . . , uk such that if I ∈ [k]
is chosen uniformly at random, then P(Π(x, y, uI ) ̸= f (x, y)) ≤ ϵ + δ.

(b) Give a protocol that uses no public randomness but whose communication complexity is at
most depth(Π) + O(1) log nδ .

(c) Conclude that RCCδ (f ) ≤ RCCpub n

δ (f ) + O(1) log δ .

Exercise 11.9 (An information lower bound for indexing): In the indexing problem in communi-
cation complexity, Alice receives an n-bit string x ∈ {0, 1}n and Bob an index y ∈ [n] = {1, . . . , n},
and the two communicate to evaluate xy ; set f (x, y) = xy .
(a) Show that if Bob can send messages, the communication complexity of indexing satisfies
CC(f ) ≤ O(1) log n.

304
Lexture Notes on Statistics and Information Theory John Duchi

In the one way communication model, only Alice can send messages. Let µ be the uniform
distribution on (X, Y ) ∈ {0, 1}n × [n]. We will show that DCCµδ (f ) ≥ (1 − h2 (δ))n, where
h2 (p) = −p log2 p − (1 − p) log2 (1 − p) is the binary entropy.

(b) Fix the index Y = i and let pi = P(X bi = Xi | Y = i) based on a protocol Π. Use Fano’s
inequality (Proposition 9.4.1) to argue that h2 (pi ) ≥ H2 (Xi | Π).

(c) Show that if Π is a δ-error one-way protocol under µ, then

I(X1n ; Π) ≥ (1 − h2 (δ))n.

Exercise 11.10 (Information complexity for entrywise less or equal): Consider the entrywise less
than or equal to function f : {0, 1}n × {0, 1}n → {0, 1} with f (x, y) = 1 {x ⪯ y}, so that f (x, y) = 1
if xi ≤ yi for each i and 0 if there exists i such that xi > yi .

(a) Show that f has the decompositional structure (11.3.4). Give the functions g and h.

(b) Give a fooling distribution µ on X × Y for f .

(c) Use Theorem 11.3.13 and a modification of the proof of Proposition 11.3.15 to show that
n
p
ICδ (f ) ≥ 4 (1 − 2 δ(1 − δ)). (This is order optimal, because ICδ (f ) ≤ CC(f ) ≤ n + 1 trivially.)

Exercise 11.11 (Lower bounds for private logistic regression): This question is (likely) challeng-
ing. Consider the logistic regression model for y ∈ {±1}, x ∈ Rd , that
1
pθ (y | x) = .
1 + exp(−y⟨θ, x⟩)

For a distribution P on (X, Y ) ∈ Rd × {±1}, where Y | X = x has logistic distribution, define the
excess risk
r(θ, P ) := EP [ℓ(θ; X, Y )] − inf EP [ℓ(θ; X, Y )]
θ

where ℓ(θ; x, y) = log(1 + exp(−y⟨x, θ⟩)) is the logistic loss. Let P be the collection of such
distributions, where X is supported on {−1, 1}d . Peeking ahead to Chapter 17 and Section 17.3,
for a channel Q mapping (X, Y ) → Z, define
b n ), P )],
Mn (P, Q) := inf sup EP,Q [r(θ(Z 1
θb P ∈P

where the expectation is taken over Zi ∼ Q(· | Xi , Z1i−1 ). Assume that the channel releases are all
(locally) ε-differentially private.

(a) Show that for all n large enough,

d d
Mn (P, Q) ≥ c · ·
n ε ∧ ε2
for some (numerical) constant c > 0.

305
Lexture Notes on Statistics and Information Theory John Duchi

(b) Suppose we allow additional passes through the dataset (i.e. multiple rounds of communication),
but still require that all data Zi released from Xi be ε-differentially private. That is, assume
we have the (sequential and interactive) release schemes of Fig. 11.3, and we guarantee that
(t) (t) (t)
Zi ∼ Q(· | Xi , B (1) , . . . , B (t) , Z1 , . . . , Zi−1 )
P
is εi,t -differentially private, where t εi,t ≤ ε for all i. Does the lower bound of part (a) change?

Exercise 11.12: In this question, we prove Proposition 11.4.2.

(a) Show that if p(v) and p(v | x) denote the p.m.f.s of V and V conditional on X = x, then

e−α p(v) ≤ p(v | x) ≤ eα p(v).

(b) Show that

|p(v | z) − p(v)| ≤ 2(eα − 1) ∥PX (· | z) − PX (·)∥TV .

(c) Complete the proof of the proposition.

JCD Comment: A few additional exercises to add:

1. Prove Yao’s minimax theorem.

2. Is there a clean “memorization” phenomenon to cover?

306
Chapter 12

Squared error and asymptotically

exact optimality guarantees

JCD Comment: Add a bit on exact optimality I guess

JCD Comment: Notation for derivative matrices?

The squared error takes a particularly central place in the theory of estimation, as its particularly
simple structure means it admits closed forms for many optimal estimators, and, for lack of a better
description, it plays nicely with differentiation. In this chapter, we develop a few of the elements of
this theory, presenting some of the classical bounds on estimation accuracy, extending them beyond
basic parametric estimators. For example, in any Bayesian problem, where we draw a parameter
θ from a prior π on θ and then observe a sample X ∼ Pθ , the posterior mean E[θ | X] is always
Bayes optimal: we have
h i h h ii h h ii
2 2
E θ(X)
b − θ 2 = E E θ(X) b − θ 2 | X ≥ E E ∥E[θ | X] − θ∥22 ,

where we use that inf t E[∥Y − t∥22 ] = E[∥Y − E[Y ]∥22 ] for any random vector Y .
Rather explicitly using this Bayesian perspective with its explicit characterization of optimal
esitmation, however, we leverage the connection between squared error and correlation, showing
that on average over parameters θ, the parameter θ is correlated with the score vector
∇θ pθ (x)
ℓ̇θ (x) := ∇θ log pθ (x) =
pθ (x)
when X ∼ Pθ has density pθ . The connection between correlation—as an inner product—and
squared error then allows us to provide strong lower bounds on estimation. Another description
for our approaches here is that integration-by-parts implies lower bounds on the squared error.
The main results in this chapter provide lower bounds on the estimation error in terms of the
Fisher information matrix of the parameter θ,
h i
J(θ) := Eθ ℓ̇θ (X)ℓ̇θ (X)⊤ = Covθ (ℓ̇θ (X)), (12.0.1)

where the equality exploits that, so long as we may exchange integration and differentiation,
∇pθ (x)
Z Z Z
(⋆)
Eθ [ℓ̇θ (X)] = pθ (x) dx = ∇pθ (x)dx = ∇θ pθ (x)dx = 0,
pθ (x)

307
Lexture Notes on Statistics and Information Theory John Duchi

R
because pθ = 1 for any θ, where equality (⋆) typically requires justification. For an i.i.d. sample
iid
X1n ∼ Pθ , then, the n-observation information matrix satisfies

Jn (θ) = nJ(θ)

because log pθ (xn1 ) = ni=1 log pθ (xi ). Then the main consequences of the development here are
P
inequalities of the form
h
b n) − θ 2
i tr(J(θ)−1 )
E θ(X1 2
≥ − O(1/n2 ),
n
so that the Fisher information matrix (12.0.1) plays a fundamental role in estimation lower bounds.

12.1 The Cramér-Rao inequality

To maintain some connection with historical approaches, we begin with the Cramér-Rao bound,
which provides lower bounds on the mean-squared error of unbiased estimators. We will be a bit
fast-and-loose here when consider the integrability conditions, as we will not use the bound to
develop any actual optimality results. Nonetheless, we provide the (heuristic) setting here: for
a parameter θ ∈ R, assume that X has density pθ with respect to some base measure µ, where
∂
ṗθ (x) := ∂θ pθ (x) exists for all x and is appropriately integrable.

Proposition 12.1.1 (The Cramér-Rao bound). Assume that θb : X → R is unbiased for θ. Then
h 2 i 1
Eθ θ(X)
b −θ ≥
J(θ)

under appropriate conditions on the density pθ .

R R
Proof We use integration by parts, that is, that udv = uv − vdu. Because Eθ [θb = θ], we have
Z Z
∂ ∂ (⋆)
1= Eθ [θ] =
b θ(x)pθ (x)dµ(x) =
b θ(x)
b ṗθ (x)dµ(x),
∂θ ∂θ

where (⋆) exchanges integration and differentiation. Then as ṗθ (x) = ℓ̇θ (x)pθ (x), we obtain
Z h i
1 = θ(x)
b ℓ̇θ (x)pθ (x)dµ(x) = Eθ [θ(X)
b ℓ̇θ (X)] = Eθ (θ(X)
b − θ)ℓ̇θ (X)

because Eθ [ℓ̇θ (X)] = 0. Applying Cauchy-Schwarz to observe Eθ [(θ(X)

b − θ)ℓ̇θ (X)]2 ≤ Eθ [(θ(X)
b −
2 2
θ) ]Eθ [ℓ̇θ (X)] gives the result.

While there is a long history of using the Cramér-Rao bound to claim some type of statistical
efficiency for an estimator, the Cramér-Rao bound is the biggest con in the history of statistics. In no
way does it actually indicate that an estimator is efficient, as most practically efficient estimators
are biased, because of regularization or other choices. Even if an estimator is asymptotically
unbiased, e.g., En [θbn (X1n )] → θ, the Cramér-Rao inequality says nothing. Additionally, soon as the
underlying parameter space Θ is compact, typically, no unbiased estimator that guarantees θb ∈ Θ
even exists.

308
Lexture Notes on Statistics and Information Theory John Duchi

12.1.1 Compact sets and the failure of the Cramér-Rao bound

We justify the preceding polemical screed via exemplar results. For the first, consider estimation
of bounded parameters. We begin with estimation on an interval.

Example 12.1.2 (Estimation of a bounded parameter): Let {Pθ } be any family of probability
models absolutely continuous with respect to one another, meaning that for any θ0 , θ1 ∈ Θ,
Pθ0 (A) > 0 implies Pθ1 (A) > 0. Suppose the parameter θ ∈ [a, b], and assume θb is unbiased
for θ but still satisfies θb ∈ [a, b]. Then we claim that θb = a with probability 1 and θb = b with
probability 1. Indeed, when θ = a, we have

Ea [θ(X)]
b = a,

and as θb ≥ a, we must have θ(X)

b = a with probability 1. Similarly, for θ = b, we have θb ≤ b
and as Eb [θ(X)]
b = b then θ(X)
b = b with probability 1. This is obviously a contradiction unless
a = b. 3

Building off of example 12.1.2, we can show that if we have the a priori guarantee that the
parameter θ ∈ Θ for a convex body Θ, then we can reduce the mean-squared error of any estimator
θb trivially: simply project θb onto Θ. We can provide stricter guarantees. For example, Exercise 12.1
shows that if θb is unbiased for θ, then (except in pathological settings) the estimator

b = argmin θ − θb 2
θ(X)
e := ProjΘ (θ) 2
θ∈Θ

has strictly lower mean-squared error

h i h i
2 2
Eθ θ(X)
e −θ 2
< Eθ θ(X)
b −θ 2

for all θ ∈ Θ: a rebuke to the idea that the Cramér-Rao inequality provides a fundamental limit.

12.1.2 Regularization and the failure of the Cramér-Rao bound

An alternative perspective arises when we consider regularized estimators, which are related to
projection estimators but can be analytically simpler. Ridge or ℓ2 -regularization provides the most
common example; for example, for linear regression, this is
n o
2 2
θλ = argmin ∥Xθ − Y ∥2 + λ ∥θ∥2 .
b
θ

In general, there always exists λ > 0 with smaller mean-squared error than the unregularized
estimator θb0 (see Exercise 12.2).
JCD Comment: Give the Fisher information in this problem?

Here, let us consider the somewhat estimation of the mean of a standard normal random vector.
Let Z ∼ N(θ, Id ), where θ ∈ Rd is unknown, and note that the Fisher information in this case is
J(θ) = d. Note that
n o Z
argmin ∥t − Z∥22 + λ ∥t∥22 =
t 1+λ

309
Lexture Notes on Statistics and Information Theory John Duchi

takes the simple shrinkage form Z 7→ Z/(1 + λ), we reparameterize and consider estimators of the
form
θbβ = βZ
1
for β ∈ [0, 1]. (This is equivalent to the choice λ =
− 1 in the usual ridge regression formulation).
β
Then immediately we see that as Z = θ + ε for ε ∼ N(0, Id ), we have θbβ = βθ + βε, and so
h i
2
E θbβ − θ 2 = (1 − β)2 ∥θ∥22 + β 2 d.

∥θ∥2
Taking derivatives, we see by inspection that β = d+∥θ∥
2
2 minimizes the mean-squared error. This
2
at least shows that there is always some shrinkage-based estimator outperforming the putative
information bound, which is d.
In the particular case of normal mean estimation, James-Stein-type estimators allow us to say
∥θ∥22
more. As we wish to estimate θ, the “optimal” shrinkage parameter β = d+∥θ∥ 2 is of course
2
E[∥Z∥22 ]−d
unavailable. But E[∥Z∥22 ] = d + ∥θ∥22 , and so an estimate of β is possible: we have β = E[∥Z∥22 ]
=
d
1− E[∥Z∥22 ]
and replacing β with a the slightly less conservative counterpart

[d − 2]+
βb := 1 −
∥Z∥22

gives the James-Stein estimator

!
[d − 2]+
θbJS := 1− Z.
∥Z∥22

Famously, we have the following result.

Theorem 12.1.3. Let the dimension d > 2 and Z ∼ N(θ, Id ) for some θ ∈ Rd . Then
h i h i
2 2
Eθ θbJS − θ 2 < Eθ Z − θ 2 = d.

In this case, we see that an adaptive estimator strictly outperforms the sample mean, which achieves
the information bound.
JCD Comment: Give citation and commentary for James-Stein, saying that we won’t
worry about it too much, just that it means we ought to move beyond Cramér-Rao.

12.2 The van Trees inequality: a Bayesian Cramér-Rao bound

While our discussion in the preceding sections casts a dim view on the Fisher information “ef-
ficiency” bound that the Cramér-Rao inequality implies, it turns out that its main failing was
in attempting to provide a pointwise bound valid for any parameter θ. By a bit of Bayesian
averaging—as we do in essentially all of our minimax lower bounds—we can address many of the
issues and show that similar information bounds hold, at least as the sample size n grows. The
key will be that if we put a prior π on the parameter θ of interest, then on average over π, the
parameter θ must be correlated with the score ℓ̇θ = ∇ log pθ .

310
Lexture Notes on Statistics and Information Theory John Duchi

12.2.1 The van Trees inequality in one dimension

We first develop the so-called van Trees inequality in one-dimensional problems. Let θ ∈ R be
the parameter of interest, and assume that {pθ } is a family of densities with respect to some base
measure µ. Assume that the prior density π has support [a, b], where π(a) = π(b) = 0, and that it
is differentiable with finite prior Fisher information
Z b ′ 2 Z b
π (t) 2
J(π) := dt = (log π(t))′ π(t)dt.
a π(t) a

Then in this case, we can show that a Fisher information lower bound holds on average over π, with
a small additional penalty to capture the information π gives on the parameter θ. For notational
simplicity in the theorem, we assume that X ∼ Pθ , detailing the i.i.d. case afterward.
Theorem 12.2.1 (The van Trees inequality). Let the above conditions hold. Then
h h ii 1
Eπ Eθ (θ(X)
b − θ)2 ≥ .
Eπ [J(θ)] + J(π)
Proof Define the augmented score
∂ (pθ (x)π(θ))′ ṗθ (x) π̇(θ)
ℓ̇θ,π (x) := log(pθ (x)π(θ)) = = + .
∂θ pθ (x)π(θ) pθ (x) π(θ)

The key is that θ(X)

b − θ and ℓ̇θ,π (X) are correlated, which then implies a lower bound. Taking an
expectation jointly over θ ∼ π and X ∼Rpθ , we can apply b −θ
integration-by-parts (with u = θ(x)
′
R
and dv = (π(θ)pθ (x)) dθ in the identity udv = uv − vdu) to obtain
h i ZZ
E (θ(X) − θ)ℓ̇θ,π (X) =
b b − θ)ℓ̇θ,π (x)π(θ)pθ (x)dµ(x)dθ
(θ(x)
ZZ
= b − θ)(π(θ)pθ (x))′ dθdµ(x)
(θ(x)
Z h ib Z
= b − θ)π(θ)pθ (x) + π(θ)pθ (x)dθ dµ(x)
(θ(x)
a

= 1,

because π(a) = π(b) = 0. So applying the Cauchy-Schwarz inequality, we have

h i1/2 h i1/2
E θ(X)
b − θ)2 E ℓ̇θ,π (X)2 ≥ 1.

Finally, we note that

E[ℓ̇θ,π (X)2 ] = E[(ℓ̇θ (X) + π ′ (θ)/π(θ))2 ] = Eπ [J(θ)] + J(π),

because Eπ [π ′ (θ)/π(θ)] = 0.

iid
When the sample X is an i.i.d. sample X1n ∼ Pθ , then of course the Fisher information becomes
Jn (θ) = nJ(θ) = nEθ [ℓ̇θ (X)2 ], where X is a single observation. Then Theorem 12.2.1 provides the
bound h
b n ) − θ)2 ≥
i 1 1 J(π)
E (θ(X 1 = − O(1) ,
nEπ [J(θ)] + J(π) nEπ [J(θ)] Eπ [J(θ)]2 n2

311
Lexture Notes on Statistics and Information Theory John Duchi

assuming J(π) < ∞. In principle it is possible to choose the prior π to maximize the lower bound,
but typically this is rather immaterial.
Because we usually want a local lower bound—something akin to the Cramér-Rao bound that
(almost) applies to a particular problem parameter θ0 of interest—it is most common to localize
the prior π to some shrinking neighborhood of θ0 . Let π be any prior supported on a neighborhood
of 0; for simplicity, we can take [−1, 1]. Then for any a > 0 and θ0 ∈ R, the density

1 θ − θ0
fa (θ) := · π
a a

satisfies fa (θ)dθ = 1, has support [−a, a], and information J(fa ) = a1 J(π). (See also Exer-
R

cise 12.3.) Then if J(θ) is continuous in θ near θ0 , we obtain the following corollary.

Corollary 12.2.2. Let θ0 ∈ R, the prior density π have compact support on [−1, 1] and finite
information J(π) < ∞, and an > 0 be any sequence satisfying n1 ≪ an ≪ 1. Define the prior
densities
1 θ − θ0
πn (θ) := ·π .
an an
If J(θ) is continuous in a neighborhood of θ0 , then
Z θ0 +an h
b n) − θ
2 i 1
Eθ θ(X1 πn (θ)dθ ≥ − o(1/n).
θ0 −an nJ(θ0 )

In a strong sense, then, the inverse Fisher information J(θ)−1 provides a (nearly) pointwise lower
bound on estimation in mean-square error.

12.2.2 The van Trees inequality in d-dimensions

We can extend Theorem 12.2.1 to higher-dimensional scenarios as well: more or less, we simply
apply the preceding lower bound to each coordinate of the parameter. In this case, instead of a
scalar Fisher information, we have information matrices

J(θ) := Eθ [ℓ̇θ (X)ℓ̇θ (X)⊤ ] and J(π) = Eπ [∇ log π(θ)∇ log π(θ)⊤ ].

With this notation, the following multivariate van Trees inequality follows almost exactly the same
lines of proof as Theorem 12.2.1.

Theorem 12.2.3 (Basic multivariate van Trees). Let π = π1 × · · · × πd be a product distribution

on θ ∈ Rd , where each coordinate density πj has support [aj , bj ] with πj (aj ) = πj (bj ). Then for any
matrix A ≻ 0,
h i d2
Eπ Eθ (θ(X)
b − θ)⊤ A−1 (θ(X)
b − θ) ≥ .
Eπ [tr(AJ(θ))] + tr(AJ(π))
Qd Q ∂
Proof Let π(θ) = j=1 πj (θj ), and let π\j (θ\j ) = k̸=j πk (θk ). For shorthand, let ∂j = ∂θj .

312
Lexture Notes on Statistics and Information Theory John Duchi

Expanding in each coordinate, we have

d
X ∂
⟨θ(x)
b − θ, ℓ̇θ,π (x)⟩ = (θbj (x) − θj ) log(pθ (x)π(θ))
∂θj
j=1
d
πj′ (θj )
X
= (θj (x) − θj ) [ℓ̇θ (x)]j +
b
πj (θj )
j=1
d
X ∂j (pθ (x)πj (θj ))
= (θbj (x) − θj ) .
pθ (x)πj (x)
j=1

Then each term satisfies integration-by-parts identities as in the proof of Theorem 12.2.1, with
h i ZZ
E (θbj (X) − θj ) · [ℓ̇θ,π (X)]j = (θbj (x) − θj )∂j (pθ (x)πj (x))dθj · π\j (θ\j )dθ\j dµ(x)
!
ZZ h
iθj =bj Z bj
= θbj (x) − θj pθ (x)πj (θj ) + πj (θj )pθ (x)dθj π\j (θ\j )dθ\j dµ(x)
θj =aj aj
ZZ
= π(θ)pθ (x)dθdµ(x) = 1.

We therefore obtain
h i ZZ
E ⟨θ(X)
b − θ, ℓ̇θ,π (X)⟩ = ⟨θ(x)
b − θ, ℓ̇θ,π (x)⟩π(θ)pθ (x)dµ(x)dθ = d.

Applying the Cauchy-Schwarz inequality to the inner product ⟨u, v⟩ = ⟨A−1/2 u, A1/2 v⟩ for any
A ≻ 0 gives

⟨θb − θ, ℓ̇θ,π ⟩ = ⟨θb − θ, ℓ̇θ ⟩ + ⟨θb − θ, ∇ log π(θ)⟩ ≤ ∥θb − θ∥A−1 ℓ̇θ + ∇ log π(θ) ,
A

and applying Cauchy-Schwarz and recognizing that

2
Eπ ℓ̇θ (X) + ∇ log π(θ) = Eπ [tr(AJ(θ))] + tr(AJ(π))
A

gives the theorem.

Typically, we choose the matrix A to be related to the Fisher information of the parameter θ.
As motivation, recall our (heuristic) development in Chapter 3, where we showed in the approxi-
mation (3.2.4) that
·
θbn − θ⋆ ∼ N 0, n−1 J(θ⋆ )−1 ,

where we have substituted the Fisher information. Recalling Chapter 5.3.4, a small modification
of Corollary 5.3.11 and its application of Corollary 5.3.10 yields the following result. (See Exer-
cise 12.4.)
Corollary 12.2.4. Let the conditions of Corollary 5.3.11 hold, let π be any density supported on
a convex body Θ0 , and let Θ be any convex body with Θ ⊃ Θ0 . Let θbn be the maximum likelihood
estimator and define the projected estimator
θen = ProjΘ (θbn ).

313
Lexture Notes on Statistics and Information Theory John Duchi

Then for any matrix B(θ) ⪰ 0 continuous in θ, uniformly in θ ∈ Θ0

h
2
i 1
Eθ θen (X1n ) − θ B(θ)
≤ tr(B(θ) J(θ)−1 ) + o(1/n).
n
The most natural idea to link Theorem 12.2.3 with Corollary 12.2.4 is to make the estimator
θbn “pivotal,” meaning that its asymptotic distribution is independent of θ⋆ . Then (again using our
iid
heuristics), we multiply by J(θ⋆ )1/2 , so that for X1n ∼ Pθ⋆ ,

·
J(θ⋆ )1/2 θbn − θ⋆ ∼ N 0, n−1 Id .

Let us assume J(θ) is continuous in θ. In Corollary 12.2.4, setting B(θ) = J(θ) we then obtain
h i d
n ⊤ n
Eθ (θn (X1 ) − θ) J(θ)(θn (X1 ) − θ) ≤ + o(1/n)
e e
n
simultaneously for all θ ∈ Θ0 . Comparing with Theorem 12.2.3, fix θ⋆ . Let δ > 0 and Θ0 to be any
neighborhood of θ⋆ small enough that (1 − δ)J(θ) ⪯ J(θ⋆ ) ⪯ (1 + δ)J(θ) for all θ ∈ Θ0 . Then
h
2
i d2 d2
Eπ θbn (X1n ) − θ A−1
≥ ≥
n tr(AEπ [J(θ)]) + tr(AJ(π)) n(1 + δ) tr(AJ(θ⋆ )) + tr(AJ(π)

for any estimator θbn . Take A = J(θ⋆ )−1 to obtain

h i d2 d
Eπ (θbn (X1n ) − θ)⊤ J(θ⋆ )(θbn (X1n ) − θ) ≥ = − O(1/n2 ),
n(1 + δ)d + tr(J(θ⋆ )−1 J(π)) (1 + δ)n

showing that the van Trees inequality is essentially unimprovable.

12.2.3 The van Trees inequality for a function of the parameter

We now present two results on the van Trees inequality in higher dimensions. In many statistical
problems, it is interesting to estimate a function of a parameter of the underlying distribution rather
than a parameter identifying the distribution itself. So suppose we have a statistic T : Rd → Rp
and wish to estimate the parameter T (θ) rather than θ itself. The next example highlights a simple
case where T (θ) = e⊤ j θ for a standard basis vector ej .

Example 12.2.5 (Treatment effect estimation): Consider a randomized controlled trial,

where individuals are either assigned to treatment (with an indicator variable Z = 1) or
control (with indicator variable Z = 0). We use the potential outcomes framework to let the
response Y of an individual be Y (0) under no treatment and Y (1) under treatment, recognizing
that for any individual we observe only one of Y (0) and Y (1). Then we model the response
linearly by
Y = Y (Z) = β0 + tZ + β ⊤ X + ε,
where β0 is an intercept and X ∈ Rd are covariates, so the θ = (β0 , t, β) ∈ Rd+2 is the
vector of parameters. Assuming that Z ∼ Uniform{0, 1}, independently of X and (Y (0), Y (1)),
θ⋆ = argminθ E[(Y − β0 − tZ − β ⊤ X)2 ] has t-component t⋆ = E[Y (1)] − E[Y (0)]. This value is
the average treatment effect. (See Exercise 12.5.) 3

314
Lexture Notes on Statistics and Information Theory John Duchi

In Section 12.3, we exploit these ideas profitably in general estimation problems, such as M-
estimation without well-specified models or the general estimation lower bound framework we
develop in Chapter 9.
In any case, suppose we have a statistic ψ : Rd → Rp and wish to estimate ψ(θ). Assume
the derivative matrix ψ̇(θ) ∈ Rp×d exists, with entries ψ̇ij (θ) = ∂θ∂ j ψi (θ), so that ψ(θ + ∆) =
ψ(θ) + ψ̇(θ)∆ + o(∥∆∥). We then have the following theorem, which extends Theorem 12.2.3.

Theorem 12.2.6. Let the conditions of Theorem 12.2.3 hold, and let C ∈ Rd×p be an arbitrary
matrix. Assume that ψ is continuously differentiable on the support of π. Then for any A ≻ 0 and
any estimator ψb of ψ(θ),
h i Eπ [tr(ψ̇(θ)C)]2
Eπ (ψ(X)
b − ψ(θ))⊤ A−1 (ψ(X)
b − ψ(θ)) ≥ .
tr(CAC ⊤ Eπ [J(θ)]) + tr(CAC ⊤ J(π))

Proof Let C ∈ Rd×p have rows c⊤ ⊤

j , so C = [c1 · · · cd ] , and define the error vector E(x, θ) =
ψ(x)
b − ψ(θ). Recognize that
d d
⊤
X X ∂j pθ (x) ∂j πj (θj )
C ℓ̇θ,π (x) = cj ∂j log (pθ (x)π(θ)) = cj + .
pθ (x) πj (θj )
j=1 j=1

Then we have
d ZZ
X
E[⟨ψ(X)
b − ψ(θ), C ⊤ ℓ̇θ,π (X)⟩] = ⟨E(x, θ), cj ⟩∂j (pθ (x)πj (θj )) dθj π\j (θ\j )dθ\j dµ(x)
j=1
d ZZ
X
= [⟨E(x, θ), cj ⟩pθ (x)πj (θj )]bajj + ⟨∂j ψ(θ), cj ⟩pθ (x)πj (θj ) π\j (θ\j )dµ(x)
j=1
h i
= Eπ tr(ψ̇(θ)C)

by integration by parts.
For the remainder of the proof, we mimic that of Theorem 12.2.3. We have
i1/2 1/2
h i h 2
⊤
E ⟨E(X, θ), C ℓ̇θ,π (X)⟩ ≤ E ∥E(X, θ)∥2A−1 E ⊤
C ℓ̇θ,π (X) ,
A

and because the scores are mean zero, we also have

2 2 h i
⊤
Eπ C ℓ̇θ,π (X) = Eπ C ⊤ ℓ̇θ (X) + ∇ log π(θ) = Eπ tr(CAC ⊤ J(θ)) + tr(CAC ⊤ J(π)).
A

Squaring and dividing through gives the result.

The final version of the van Trees, or Bayesian Cramér-Rao inequality, extends Theorem 12.2.3
to include additional terms to better reflect the geometry of the problem at hand, which eliminates
the need for some of the local approximations we have made to consider particular parameters θ0

315
Lexture Notes on Statistics and Information Theory John Duchi

of interest. In this case, we let the matrices A ≻ 0 and C ∈ Rd×p vary in θ in Theorem 12.2.6. For
such a setting, we will require a prior Fisher information with
" d #p
1 X ∂
sC,π (θ) := (Cij (θ)π(θ)) ∈ Rp
π(θ) ∂θi
i=1 j=1

taking the place of the usual score vector ∇ log π(θ) (these agree when p = d and C = Id , so that
sI,π (θ) = ∇ log π(θ)) and
h i
J(π | A, C) := Eπ sC,π (θ)⊤ A(θ)sC,π (θ) . (12.2.1)

To state the most general extension of Theorem 12.2.6, we will present a few additional defini-
tions and conditions. A function f on Θ ⊂ Rd is suitably continuous if for each coordinate j and
almost all values θ ∈ Θ, the coordinate function h(t) := f (θ + tej ) is absolutely continuous. Then
we consider the conditions (i)–(v) below, which relax the assumptions necessary for Theorem 12.2.6.

(i) The density pθ (x) is measurable in (x, θ) and, for almost every x, is suitably continuous in θ.

(ii) ψ : Θ → Rp and C : Θ → Rd×p are suitably continuous.

(iii) The Fisher information J(θ) := Eθ [ℓ̇θ (X)ℓ̇θ (X)⊤ ] exists and diag(J(θ))1/2 is locally integrable.

(iv) A(θ) is positive definite and continuous in θ.

(v) The prior density π is suitably continuous and its domain Θ is compact with piecewise C 1
boundary, and π(int Θ) > 0 and π(bd Θ) = {0}.

Then Gill and Levit [102, Theorem 1] show the following result.

Theorem 12.2.7 (The multivariate van Trees inequality). Let conditions (i)–(v) hold. Then for
any estimator ψb : X → Rp ,
h i Eπ [tr(ψ̇(θ)C(θ))]2
Eπ (ψ(X)
b − ψ(θ))⊤ A(θ)−1 (ψ(X)
b − ψ(θ)) ≥ .
Eπ [tr(C(θ)A(θ)C(θ)⊤ J(θ))] + J(π | A, C)

Exercise 12.6 asks you to prove a slightly weaker version of Theorem 12.2.7, which follows from
arguments similar to those we use to prove Theorem 12.2.6. Under the same conditions, when we
iid
have an i.i.d. sample X1n ∼ Pθ , that the Fisher information tensorizes as well implies
iid
Corollary 12.2.8. Let the conditions of Theorem 12.2.7 hold and X1n ∼ Pθ . Then for any esti-
mator ψbn : X n → Rp ,
h i Eπ [tr(ψ̇(θ)C(θ))]2
Eπ (ψbn (X1n ) − ψ(θ))⊤ A(θ)−1 (ψbn (X1n ) − ψ(θ)) ≥ .
nEπ [tr(C(θ)A(θ)C(θ)⊤ J(θ))] + J(π | A, C)

We provide a corollaries to Theorem 12.2.7 by choosing the matrices A and C appropriately;

typically, we take them to be related to the Fisher information matrix J(θ) as in our discussion
after Corollary 12.2.4. We begin with the “natural” parameterization, where we take ψ to be the
identity and C the identity, and let A(θ)−1 = J(θ) be the Fisher information. Then so long as
1
J(θ) ≻ 0 and is continuous, using n+c ≥ n1 − nc2 , we have the following corollary.

316
Lexture Notes on Statistics and Information Theory John Duchi

Corollary 12.2.9. Let the conditions of Theorem 12.2.7 hold. Then for A(θ) = J(θ) and C(θ) =
Id , we have J(π | A, C) = Eπ [∇ log π(θ)∇ log π(θ)⊤ ] and
h i d2 d J(π | A, C)
Eπ (θbn (X1n ) − θ)⊤ J(θ)(θbn (X1n ) − θ) ≥ ≥ − .
nd + J(π | A, C) n n2

Taking ψ to be the identity once again, but this time letting A(θ) = Id and C(θ) = J(θ)−1 and
assuming J is suitably differentiable, we obtain
h
2
i Eπ [tr(J(θ)−1 )]2 Eπ [tr(J(θ)−1 )] J(π | C)
Eπ θbn (X1n ) − θ ≥ ≥ − .
2 nEπ [tr(J(θ)−1 )] + J(π | C) n n2

When ψ : Rd → Rp is not the identity mapping, the bounds are more sophisticated, but natural
choices again present themselves. We begin by heuristically developing an upper bound, using a
technique known as the delta method : as ψ is assumed differentiable, for θbn obeying the usual
·
asymptotics θbn − θ⋆ ∼ N(0, (1/n)Σ) for some covariance Σ, we may proceed heuristically to obtain

ψ(θbn ) − ψ(θ⋆ ) = ψ̇(θ⋆ )(θbn − θ⋆ ) + O(∥θbn − θ⋆ ∥2 )

| {z }
=O(1/n)

· 1
∼ N 0, ψ̇(θ⋆ )Σψ̇(θ⋆ )⊤ .
n
(Any text on asymptotic statistics provides rigorous versions of these claims, for example, van der
Vaart [185, Ch. 3]. See also Exercise 5.12.)
Assume that p ≤ d and that ψ̇(θ) ∈ Rp×d is rank p, so that the Fisher information for ψ(θ) in
the model Pθ is, by a change of variables,
−1
Jψ (θ) := ψ̇(θ)J(θ)−1 ψ̇(θ)⊤ ∈ Rp×p . (12.2.2)

To obtain an analogue of the dimension-dependent bounds in Corollary 12.2.9, we can therefore

take A to be the inverse information for ψ, A(θ) = Jψ (θ)−1 , and C(θ) = J(θ)−1 ψ̇(θ)⊤ Jψ (θ) ∈ Rd×p .
Then
ψ̇(θ)C(θ) = ψ̇(θ)J(θ)−1 ψ̇(θ⊤ )Jψ (θ) = Ip ,
while the cyclic property of the trace and that C(θ)A(θ) = J(θ)−1 ψ̇(θ)⊤ imply

tr(C(θ)A(θ)C(θ)⊤ Jn (θ)) = n tr J(θ)−1 ψ̇(θ)⊤ Jψ (θ)ψ̇(θ) = n tr(Ip ) = np.

e | ψ) := J(π | A, C) as in (12.2.1) for these choices of A and C, we then obtain the

Letting J(π
following “normalized” risk inequality.
Corollary 12.2.10. In addition to the conditions of Theorem 12.2.7, assume that ψ̇(θ)J(θ)−1 ψ̇(θ)⊤
is continuously differentiable in θ and positive definite. Then letting Jψ (θ) be the Fisher informa-
tion (12.2.2) for ψ in the model {Pθ } satisfies
h i p2 e | ψ)
p J(π
Eπ (ψbn (X1n ) − ψ(θ))⊤ Jψ (θ)(ψbn (X1n ) − ψ(θ)) ≥ ≥ − .
e | ψ)
np + J(π n n2

Let us revisit linear regression (or Example 12.2.5) in this context, where we assume that the
model is true.

317
Lexture Notes on Statistics and Information Theory John Duchi

Example 12.2.11: Assume the linear regression model

iid
Yi = Xi⊤ θ + εi , εi ∼ N(0, σ 2 ).

Because the log-likelihood is log pθ (y | x) = − 2σ1 2 (x⊤ 2 1 2

i θ − yi ) − 2 log(2πσ ), the Fisher infor-
mation for the parameter vector θ given n observations (Xi , Yi ) in this model is
n n
X 1 ⊤ 1 X 1
Jn (θ) = 4
E[ε X X
i i i iε ] = 2
Xi Xi⊤ = 2 X ⊤ X,
σ σ σ
i=1 i=1

where we let X = [X1 · · · Xn ]⊤ be the design matrix. If we wish to estimate a single coordinate
ψ(θ) = e⊤ 1 ⊤ ⊤ −1 −1
j θ, the information for the coordinate θj is then Jψ (θ) = σ 2 (ej (X X) ej ) , and
Corollary 12.2.10 gives the lower bound
h i σ 2 ⊤ −1 ⊤ −1
Eπ (θbj − θj )2 ≥ σ 2 e⊤ ⊤ −1 2
j (X X) ej + O(1/n ) = e n X X ej + O(1/n2 ).
n j

Because the standard regression estimator θb = (X ⊤ X)−1 X ⊤ Y = θ + (X ⊤ X)−1 X ⊤ ε, we have

j (X X) ej , achieving the lower bound. 3
E[(θbj − θj )2 ] = σ 2 e⊤ ⊤ −1

The lower bounds in Corollary 12.2.10 provide further insights in cases where we must estimate
problems with nuisance parameters that are uninteresting, or at least not the subject of particular
scientific investigation. See Exercise 12.7.

12.3 Beyond parametric problems

We often have statistical problems in which there is no particular model underlying the problem,
and we wish only to estimate a parameter defined in terms of the underlying distribution generating
the data. For example, in the M-estimation problems in Chapter 5.3, the quantity of interest is

θ(P ) := argmin EP [ℓ(θ, Z)],

which is a function of the distribution P . Treating the probability distribution P as the “parame-
ter,” then θ(P ) is a function of that parameter, which at the outset seems hard to address with the
techniques we have developed thus far, as in this case P is (typically) infinite dimensional. This is
also the more abstract setting we adopt in Chapter 9.
To extend the techniques for lower bounding estimator error to this setting, we adopt a perspec-
tive that begins with Stein [172], where estimation in a general problem should be at least as hard
as estimation in any particular (model-based) sub-problem. Thus, we take the following two-phase
approach to developing estimation lower bounds for a parameter θ(P ) around a base distribution
P0 :

1. For the base distribution, define a model family {Pt } indexed by t ∈ Rk

2. Show how to write θ(Pt ) as a function θ(t) of the underlying parameter t, then apply Theo-
rem 12.2.7 or its corollaries.

3. Optionally, revisit step 1 and modify the underlying models to make estimation error the largest

318
Lexture Notes on Statistics and Information Theory John Duchi

Exercises 9.10, 9.12, and 9.13 explore this approach in the context of information-theoretic lower
bounds. Here, we employ the ideas to obtain asymptotically exact results.
The actual approach to doing this is, at the end of the day, rather straightforward given our
development of minimax lower bounds as well as the information-based bounds in the preceding
sections. Assume without loss of generality that P0 has a density p0 on X with respect to a base
measure µ (we could always simply take µ = P0 ). Then for a bounded function g : X → Rk with
mean E0 [g(X)] = 0, define the tilted density

pt (x) := (1 + ⟨t, g(x)⟩)p0 (x). (12.3.1)

It is immediate that for any t ∈ Rk , we have pt (x)dµ(x) = E0 [(1 + ⟨t, g(X)⟩)] = 1, and because
R

we assume g is bounded, for t near enough 0 we are guaranteed that pt ≥ 0, so that pt is indeed a
probability density. Once we have these tilted densities, we will assume that the parameter θ(P )
of interest is locally differentiable:

Assumption A.12.1. Let U be any neighborhood of 0. For the model family {Pt }t∈U , the parameter
θ(Pt ) is continuously differentiable in t on U , with derivative matrix θ̇(t) ∈ Rd×k .

Assumption A.12.1 is abstract, but frequently holds; we provide an example with mean esti-
mation here, and after presenting the main lower bound leveraging the assumption, show how it
applies to the general M-estimation problems of Chapter 5.3.

Example 12.3.1 (Nonparametric mean estimation): For distributions P on X ∈ Rd , define

θ(P ) := EP [X]. Then for the tilted family {Pt } of (12.3.1),

θ(Pt ) = E0 [(1 + ⟨t, g(X)⟩)X] = E0 [X] + Cov0 (X, g(X))t,

where we have used that g is mean zero. Then θ(Pt ) is affine in t and hence differentiable. 3

Conveniently, the tilted construction (12.3.1) immediately allows us to compute the Fisher
information matrix for the nuisance parameter t we have invented. Because log pt (x) = log(1 +
⟨t, g(x)⟩) + log p0 (x), we obtain

g(x) 1 + ⟨t, g(X)⟩
∇ log pt (x) = and Et [∇ log pt (X)] = E0 g(X) = E0 [g(X)] = 0
1 + ⟨t, g(x)⟩ 1 + ⟨t, g(X)⟩

by assumption that E0 [g] = 0. Then at t = 0, the Fisher information is

J(0) = E0 [g(X)g(X)⊤ ] = Cov0 (g(X))

and

1
J(t) = E0 g(X)g(X) = E0 [g(X)g(X)⊤ (1 − ⟨t, g(X)⟩)] + O(∥t∥2 ) = J(0) + O(∥t∥),
⊤
1 + ⟨t, g(X)⟩

uniformly in t near 0, because g is bounded. The information J is also continuously differentiable

and positive definite as soon as Cov0 (g(X)) ≻ 0.
As such, whenever Assumption A.12.1 holds, we have satisfied the assumptions necessary for
Theorem 12.2.7. We therefore obtain the following corollary:

319
Lexture Notes on Statistics and Information Theory John Duchi

Corollary 12.3.2. Let Assumption A.12.1 hold and Pt be the tilted distributions (12.3.1), and let
conditions (i)–(v) hold with t replacing θ. Then for any estimator θb : X n → Rd ,
Z h i
b n ) − θ(Pt ))⊤ A(t)−1 (θ(X
EPt (θ(X b n ) − θ(Pt )) π(t)dt
1 1

Eπ [tr(θ̇(T )C(T ))]2

≥ .
nEπ [tr(C(T )A(T )C(T )⊤ J(T ))] + J(π | A, C)
As in Section 12.2, we may make particular choices of A and C to optimize the lower bound in
Corollary 12.3.2. In analogy with Corollary 12.2.10, for example, if we assume the inverse Fisher
information for the parameter θ,
−1
Jθ (t) := θ̇(t)J(t)−1 θ̇(t)

is continuously differentiable in t and positive definite, then we obtain

b n ) − θ(PT )) ≥ d − O(1/n2 ).
h i
b n ) − θ(PT ))⊤ Jθ (T )(θ(X
Eπ (θ(X 1 1
n
One particular (abstract, but common) setting is important: when θ(Pt ) is not only differ-
entiable in t, but where in fact there exists a mean-zero influence function θ̇P0 : X → Rd for θ
satisfying
θ(Pt ) = θ(P0 ) + EP0 [θ̇P0 (X)⟨g(X), t⟩] + o(∥t∥), (12.3.2)
so that the derivative in Assumption A.12.1 is linear in g and θ̇(0) = EP0 [θ̇P0 (X)g(X)⊤ ]. (The
assumption that θ̇P0 is mean zero is no loss of generality, as EP0 [g(X)] = 0 by assumption.) This
linearity holds, for example, for the mean as in Example 12.3.1, where θ̇P0 (x) = x − EP0 [X]; we
will also see it presently for general M-estimators in Section 12.3.1.
Let us proceed somewhat heuristically to show the types of lower bounds the existence func-
tion (12.3.2) provides. For simplicity, let us take the prior π = πnd for a normalized prior density
√ √
πn (t) = nπ0 ( n · t) where π0 is smooth and supported on [−1, 1]; with this we may assume the
√
prior information J(π) scales as d n (recall Corollary 12.2.2, and see also Exercise 12.3). Assuming
we can vary g while maintaining the linearity (12.3.2), then, we assume g : X → Rd and may use
J(t) = Cov0 (g(X)) + O(∥t∥) and take C = Id to obtain
h
b n ) − θ(PT ))⊤ A(T )−1 (θ(X
i
b n ) − θ(PT )) ≥ tr(EP0 [θ̇P0 (X)g(X)⊤ ])2
Eπ (θ(X1 1 − o(1/n)
nEπ [tr(A(T )Cov0 (g(X)))]

for any estimator θb and positive definite A(t). The choice A(t) = Id then gives
h
b n ) − θ(PT ) 2
i EP0 [⟨θ̇P0 (X), g(X)⟩]2
Eπ θ(X1 ≥ − o(1/n),
2
nEP0 [∥g(X)∥22 ]

which by Cauchy-Schwarz is maximized by g(X) = θ̇P0 (X), giving the lower bound
h
b n ) − θ(PT ) 2
i EP0 [∥θ̇P0 (X)∥22 ]
Eπ θ(X1 2
≥ − o(1/n). (12.3.3)
n
Alternatively, if we let Σ = Cov(θ̇P0 (X)), then setting A = Σ we obtain
h
b n ) − θ(PT ) 2
i E[⟨θ̇P0 (X), g(X)⟩]2
Eπ θ(X1 Σ−1
≥ − o(1/n),
n tr(ΣCov0 (g(X)))

320
Lexture Notes on Statistics and Information Theory John Duchi

√
where ∥x∥A = x⊤ Ax is the Mahalonobis norm. Taking g(x) = Σ−1/2 θ̇P0 (x), we obtain
h
b n ) − θ(PT ) 2
i d
Eπ θ(X1 Σ−1
≥ − o(1/n).
n
In comparison with the bounds available via information-theoretic techniques, such as those Exer-
cises 9.12 and 9.13 develop, we see that we have lost some generality in that the results apply only
to the squared error, but have gained in that the leading constants are unimporvable.

12.3.1 An extended example: M-estimation lower bounds

M-estimation problems provide an important class of examples to in which to apply the general
nonparametric lower bounds we have developed. Here, we recall the setting of Chapter 5.3, where
we have population loss L(θ) := EP0 [ℓ(θ, Z)] for a loss function ℓ and distribution P0 on Z, and
define the parameter
θ(P0 ) := argmin L(θ).
θ

To provide lower bounds on estimating θ(P ), the simplest approach is to derive an influence function
expansion (12.3.2) under tilted models Pt of P0 . To do this, we can use the implicit function theorem,
which will show fairly precisely how minimizer θ(Pt ) vary with tiltings (12.3.1).
Before stating the result, we define a bit of notation: let Lt (θ) = EPt [ℓ(θ, Z)] = E0 [(1 +
⟨t, g(Z)⟩)ℓ(θ, Z)] be the tilted population loss, and for shorthand let θt = θ(Pt ) for t near 0. Then
applying the implicit function theorem, we have the following lemma (we defer its proof to the
end of this section); the lemma holds under much weaker conditions, but we prefer to keep things
simple.

Lemma 12.3.3. Let g : X → Rk be a bounded and mean-zero function and Assumption A.5.1 hold
on the losses ℓ and population loss L0 . Then for all t in a neighborhood of 0, θt exists, is unique,
and is differentiable in t, with
h i
θt+v − θt = −E0 (∇2θ Lt (θt ))−1 ∇ℓ(θt , Z)g(Z)⊤ v + o(∥v∥)

as v → 0.

Immediately, the smoothness conditions in Assumption A.5.1 then imply the influence function

θ̇P0 (z) = −(∇2 L0 (θ0 ))−1 ∇ℓ(θ0 , Z),

and moreover, we have derivatives

Dt θt = EP0 [θ̇P0 (Z)g(Z)⊤ ] + O(∥t∥).

The influence expansion (12.3.2) certainly holds.

Applying Corollary 12.3.2, we therefore obtain the following.

Corollary 12.3.4. Let πn be smooth enough and have support on the scaled ℓ2 -ball √1 Bd of radius
n 2
√
1/ n. Then for any matrix A(t) ≻ 0, continuous in t, and estimator θ,
b
2
EP0 ⟨∇2 L0 (θ0 )−1 ∇ℓ(θ0 , Z), g(Z)⟩
h i
b n ) − θ(Pt ) 2
Eπn θ(X1 A(t)−1
≥ − o(1/n).
nEπn [tr(A(T )Cov0 (g(Z))]

321
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 12.8 shows how particular choices of g and A yield different lower bounds. In brief, however,
Corollary 12.3.4 shows that asymptotically in n, the error of the empirical risk minimizers (5.3.1)
is optimal.
Finally, we return to prove Lemma 12.3.3.
Proof of Lemma 12.3.3 The result is nearly immediate once we have the implicit function
theorem, a version of which we state here. (We provide a proof sketch as it can be useful for simply
remembering exactly what the implicit function theorem states.)

Lemma 12.3.5 (The implicit function theorem). Let F : Rn × Rm → Rn be continuously differ-

entiable in its coordinates (x, y) ∈ Rn × Rm . Assume that F (x0 , y0 ) = 0 and that the x-derivative
matrix Dx F (x, y) ∈ Rn×n is invertible at (x0 , y0 ). Then there is an open neighborhood U of y0 such
that a unique solution x(y) to F (x, y) = 0 exists, and it is continuously differentiable on U with

ẋ(y) = −(Dx F (x, y))−1 Dy F (x, y).

Sketch of Proof We will only show what the form of the derivative must be, presuming it
exists; given the form that the derivative should take then an argument involving the Banach fixed
point theorem demonstrates the existence. So assume that F (x, y) = 0; we expand the identity
0 = F (x + ∆x , y + ∆y ) to solve for the form ∆x must take in terms of y, and then the linear part
of this form gives ẋ(y). To that end, we have

0 = F (x + ∆x , y + ∆y )
= F (x, y) + Dx F (x, y)∆x + Dy F (x, y)∆y + O(∥∆x ∥2 + ∥∆y ∥2 ).

Then using F (x, y) = 0, it must be the case that ∆x = −(Dx F (x, y))−1 Dy F (x, y)∆y + O(∥∆x ∥2 +
∥∆y ∥2 ). Proceeding heuristically, we may take ∆y → 0 and ∆x → 0 to obtain that to first-order,
∆x = (Dx F (x, y))−1 Dy F (x, y)∆y + O(∥∆y ∥2 ), giving ẋ(y) = (Dx F (x, y))−1 Dy F (x, y).

To apply the implicit function theorem (Lemma 12.3.5), we compute the derivatives of Lt (θ) =
E0 [(1 + ⟨t, g(Z)⟩)ℓ(θ, Z)] via

∇t Lt (θ) = E0 [g(Z)ℓ(θ, Z)], ∇θ Lt (θ) = E0 [(1 + ⟨t, g(Z)⟩)∇ℓ(θ, Z)]

and the second derivatives

∇2t,θ Lt (θ) = E0 [g(Z)∇ℓ(θ, Z)⊤ ] ∈ Rk×d and ∇2θ Lt (θ) = E0 [(1 + ⟨t, g(Z)⟩)∇2 ℓ(θ, Z)].

At θ0 = θ(P0 ), we have by assumption that ∇2θ L0 (θ) = E0 [∇2 ℓ(θ0 , Z)] ≻ 0, and ∇θ L0 (θ0 ) = 0 as
well. Make the identifications x 7→ θ, y 7→ t in the implicit function theorem, so we also identify
F (x, y) 7→ ∇θ Lt (θ) and Dx F (x, y) 7→ ∇2θ Lt (θ) and Dy F (x, y) 7→ ∇2t,θ Lt (θ)⊤ . Then we can evidently
write h i
θt+v = θt − E0 (∇2θ Lt (θt ))−1 ∇ℓ(θt , Z)g(Z)⊤ v + o(∥v∥)

for t in a neighborhood of 0 and small v ∈ Rk .

322
Lexture Notes on Statistics and Information Theory John Duchi

12.4 Super-efficiency and instance optimality

Much of our development of lower bounds before this chapter was for worst-case (minimax) error
measures. No statistician in their right mind should find these fully compelling—one wants to
know how hard estimation and learning in the actual distribution P at hand is! Consider, for
instance, the family P = {N(θ, σ 2 Id )}θ∈Rd ,σ2 <∞ of Gaussian distributions with unknown variance.
Then the worst-case mean-square error for estimating the mean over this class is of course +∞,
2
while for each distribution N(θ, σ 2 Id ) ∈ P, the sample mean satisfies E[∥X n − θ∥22 ] = dσn < ∞.
We therefore seek more nuance: an “instance-optimal” lower bound, which applies to individual
probability distributions P ∈ P, describing how hard an estimation problem is for the particular
P at hand.
To set the stage, we highlight two desiderata that a satisfying benchmark—here, we avoid the
terminology of lower bounds—of a problem’s complexity should enjoy:

(i) The benchmark should be instance-specific, providing a quantity for each P ∈ P

(ii) The benchmark should be uniformly achievable, in that there should exist a procedure achiev-
ing the benchmark performance on each instance P ∈ P

At some level, minimax complexity guarantees satisfy the two desiderata (i)–(ii): a constant func-
tion at least provides a quantity for each P ∈ P, and minimaxity means that the result is indeed
achievable. The key is that an instance-optimal bound should provide a converse, which we consider
here.

(iii) The benchmark should provide a super-efficiency converse: no procedure outperforms the
benchmark except on a negligible collection of instances.

The bounds we have developed in the local Fano method, as in Chapter 9.4, provide bounds
that at least approach the desiderata (i)–(iii): they provide a lower bound centered around a
given distribution Pθ , and (frequently) these bounds are achievable. Sometimes, the bounds are
independent of the parameter: Example 9.4.4 shows that in d-dimensional the normal location
family {N(θ, σ 2 Id )}θ∈Rd , we have
h
2
i dσ 2
sup √ Eθ θbn − θ 2
≳ (12.4.1)
∥θ−θ0 ∥2 ≤c d/n
n

for some constant c, and the sample mean X n obviously achieves the lower bound. So for the squared
2
error, the quantity dσn can serve as a benchmark: it depends on the variance σ 2 of the instance at
hand, and it is uniformly achievable. What is missing, however, is a super-efficiency (iii) converse:
at least in the statement of the bound (12.4.1), nothing prevents an estimator from achieving much
2
smaller than dσn error at all except for a few worst-case points.
The variations on the Van Trees inequality in Sections 12.2 and 12.3 show that for parametric
families (and even beyond), the Fisher information provides a super-efficiency guarantee for the
squared error of the form (iii): by Corollary 12.2.9 and the bounds following, for smooth enough
prior densities π, for any estimator θbn we have
Z Z
h
2
i 1
Eθ θbn − θ 2 π(θ)dθ ≥ tr(J(θ)−1 )π(θ)dθ − O(1/n2 ),
n

323
Lexture Notes on Statistics and Information Theory John Duchi

and maximum-likelihood estimators attain these lower bounds (under appropriate conditions). In
the case of the normal location family, because J(θ) = σ12 Id , we can then strengthen the lower
bound (12.4.1) to
dσ 2
Z h i
2
inf Eθ θbn − θ 2 π(θ)dθ ≥ (1 − o(1))
θbn n
for “most” priors π.
To provide a bit more weight to this discussion, consider the setting of Section 12.2.3, where
we wish to estimate a differentiable function ψ(θ) ∈ Rp of the parameter θ ∈ Rd of interest, and
measure error via a quadratic with matrix A(θ) ≻ 0. This of course also includes the non-parametric
b or more accurately, a sequence of procedures ψb = {ψbn }
settings in Section 12.3. For a procedure ψ,
defined for each sample size n, define the pointwise limiting squared error at θ by
h i
b := lim sup n · Eθ (ψbn (X n ) − ψ(θ))⊤ A(θ)−1 (ψbn (X n ) − ψ(θ))
LA (θ, ψ) 1 1
n

Theorem 12.4.1. Let the conditions of Theorem 12.2.7 hold. Then for any estimator sequence
ψbn , Z Z
LA (θ)π(θ)dθ ≥ tr ψ̇(θ)J(θ)−1 ψ̇(θ)⊤ A(θ)−1 π(θ)dθ.

Proof In Corollary 12.2.8, define the matrix C(θ) = J(θ)−1 ψ̇(θ)⊤ A(θ)−1 , which gives for any n
that
h i
nEπ (ψbn (X1n ) − ψ(θ))⊤ A(θ)−1 (ψbn (X1n ) − ψ(θ)) ≥ Eπ [tr(ψ̇(θ)J(θ)−1 ψ̇(θ)⊤ A(θ)−1 )] − O(1/n),

where the big-O hides terms that depend on the prior π. Now define the sequence of functions
h i
fn (θ) := nEθ (ψbn (X1n ) − ψ(θ))⊤ A(θ)−1 (ψbn (X1n ) − ψ(θ)) ,

each of which is nonnegative. We have lim supn fn (θ)R= LA (θ) ∈ [0, ∞], and as
R lim supn fn (θ) =
limn→∞ supm≥n fm (θ), monotone convergence implies supm≥n fm (θ)π(θ)dθ ↓ LA (θ)π(θ)dθ.

Taking A = Id and ψ(θ) = θ to be the identity mapping, Theorem 12.4.1 thus shows that for
“most” priors π, the limiting mean squared error of estimators satisfies
Z h i Z
2
lim sup nEθ θn − θ 2 π(θ)dθ ≥ tr(J(θ)−1 )π(θ)dθ.
b (12.4.2)
n

So, for example, there can be no open set U on which an estimator has mean squared error
(asymptotically) better than n1 tr(J(θ)−1 ) everywhere on that set. Indeed, supposing to the contrary
that L(θ) := lim sup nEθ [∥θbn − θ∥2 ] < tr(J(θ)−1 ) for each θ ∈ U , we take any smooth prior whose
n 2
support U contains, and then obtain the contradiction that
Z Z
L(θ)π(θ)dθ < tr(J(θ)−1 )π(θ)dθ.

Another perspective on this is that no “good” estimator can outperform. Suppose that an
estimator is efficient in some neighborhood of θ0 for the squared error, meaning that
h i
2
lim sup nEθ θbn − θ 2 ≤ tr(J(θ)−1 )
n

324
Lexture Notes on Statistics and Information Theory John Duchi

for each θ near θ0 . Then the set of points θ at which strict inequality can hold above necessarily
has measure 0: rearranging the preceding inequality we have
(⋆)
Z h i
−1 2
0≥ tr(J(θ) ) − lim sup nEθ θn − θ 2 π(θ)dθ ≥ 0
b
n

where the inequality (⋆) is inequality (12.4.2). As the integrand is nonnegative, it must be 0 for
almost every θ. Such average-case super-efficiency converses are also present in the classical theory
of asymptotic estimator efficiency that Le Cam and Hajak develop [cf. 184], which extends the
present results to general loss functions far beyond the squared error.
JCD Comment: We should maybe have a small section here on super-efficiency and
estimation. Perhaps more in exercises as well, assuming we do Le Cam in exercises

12.5 Applications in privacy

JCD Comment: Develop score attacks as in Cai et al. [45], Cai et al. [46].

12.6 Bibliography and further reading

JCD Comment: Discuss MMSE estimators, local asymptotic normality, and so on.
Provide a few exercises on local asymptotic normality as well if possible.

12.7 Exercises
Exercise 12.1: Let Θ be a compact convex set with non-empty interior (a convex body), and let
{Pθ }θ∈Θ be a family of distributions, all absolutely continuous with respect to one another.

(a) Show that if θb is an unbiased estimator of θ, then for each θ ∈ Θ, Pθ (θ(X)

b ̸∈ Θ) > 0. Hint.
Extend Example 12.1.2 to show that if θ is unbiased and θ ∈ Θ, then Θ cannot have an interior.
b b
It may be useful to leverage supporting hyperplanes of convex bodies.

(b) Show that if θb is an estimator with Pθ (θ(X)

b ̸∈ Θ) > 0, then the projected estimator θ(x)
e =
ProjΘ (θ(x)) satisfies
b
h i h
2 2
Eθ θ(X)
e − θ 2 < Eθ θ(X) b −θ 2

Hint. Theorem B.1.11 in Appendix B.1.2 provides a useful characterization of the projection
onto a convex set.

(c) Conclude that if Θ is a convex body and {Pθ }θ∈Θ satisfies the assumed conditions of the
exercise, then the projected estimator θe satisfies
h i h i
2 2
Eθ θ(X)
e − θ 2 < Eθ θ(X)
b −θ 2

for all θ ∈ Θ.

325
Lexture Notes on Statistics and Information Theory John Duchi

JCD Comment: Do a computational version of the James-Stein estimator perhaps after

its discussion above.
Exercise 12.2: Let Y = Xθ + ε, where ε ∼ N(0, In ) and X ∈ Rn×d . For λ > 0 define θbλ =
argminθ {∥Xt − Y ∥22 + λ ∥t∥22 }. Define the risk R(λ) := E[∥θbλ − θ∥22 ] and R(0) = limλ↓0 R(λ). Give

R(λ) − R(0)
R′ (0+ ) := lim ,
λ↓0 λ

show that R′ (0+ ) < 0, and conclude that for any θ, there exists λ > 0 such that R(λ) < R(0).
Exercise 12.3: In this question, you investigate some examples and properties of the information
of compactly supported C ∞ functions.
1 ∞.
(a) Let ϕ(t) = exp(− 1−t2 ) for |t| < 1 and ϕ(t) = 0 for |t| ≥ 1. Show that ϕ is C

R1 ′ (t)2 /ϕ(t)dt
(b) Show that for ϕ as in part (a), −1 ϕ < ∞.

0 < a ≤ 1 define πa (t) = ϕ(t/a)/a. Show that π is a density on [−a, a] and give J(πa ) as a
function of J(π1 ).

Exercise 12.4: Prove Corollary 12.2.4.

Exercise 12.5: Demonstrate the claim of Example 12.2.5 that
h i
t⋆ := argmin inf E (Y − β0 − tZ − β ⊤ X)2
t β0 ,β

satisfies t⋆ = E[Y (1)] − E[Y (0)].

Exercise 12.6: In conditions (i)–(v) for Theorem 12.2.7, replace all instances of “suitably con-
tinuous” with “continuously differentiable,” assume that J(θ) = Eθ [ℓ̇θ (X)ℓ̇θ (X)⊤ ] is continuous in
θ, and assume instead of condition (v) that π is a product measure on [a1 , b1 ] × · · · × [ad , bd ], as in
Theorem 12.2.6. Prove Theorem 12.2.7 under these conditions.
Exercise 12.7 (Estimation with and without nuisance parameters): Consider probabilistic models
{Pθ,η }, θ ∈ Rd and η ∈ Rk , with densities pθ,η and (joint) Fisher information matrix

Jθ,θ Jθ,η
J(θ, η) = ⊤ ,
Jθ,η Jη,η

where the notation indicates partitioning J to match parameters.

(a) Define the block matrices

A B −1 X Y
M := and M = ,
B⊤ C Y⊤ Z

where the matrices are partitioned similarly. Assume that M ≻ 0. Show that X ⪯ A−1 , with
equality if and only if B = 0.

326
Lexture Notes on Statistics and Information Theory John Duchi

(b) Show that

−1
Jθ,θ ⪯ [J(θ, η)−1 ]θ,θ .
When does equality hold?

(c) Use Corollary 12.2.10 to interpret the above inequality in the context of in problems in which
the nuisance parameters η are known versus unknown.

Exercise 12.8: Consider M-estimation problems as in Corollary 12.3.4. Let H = ∇2 L(θ0 ) be the
Hessian of L at θ0 = argminθ L(θ), and let Σ = Cov(∇ℓ(θ0 , Z)) the covariance of the gradients. Give
appropriate choices of the matrices A(t) and tilting functions g : Z → Rd to show the following:

(a) For smooth enough priors πn on T supported on √1 Bd , any estimator θb satisfies

n 2
h i
2
lim inf n · Eπn b n ) − θ(PT )
θ(Z 1 ≥ tr(H −1 ΣH −1 ).
n

(b) For smooth enough priors πn on T supported on √1 Bd , any estimator θb satisfies

n 2
h i
b n ) − θ(PT ))⊤ HΣ−1 H(θ(Z
lim inf n · Eπn (θ(Z b n ) − θ(PT ))⊤ ≥ d.
n 1 1

What do these results say about the empirical risk minimizer (5.3.1)?

327
Chapter 13

Testing and functional estimation

When we wish to estimate a complete “object,” such as the parameter θ in a linear regression
Y = Xθ + ε, or a density when we observe X1 , . . . , Xn i.i.d. with a density f , the previous chapters
give a number of approaches to proving fundamental optimality results and limits. In many cases,
however, we wish to estimate functionals of a distribution or larger parameter, rather than the
entire distribution or a high-dimensional parameter. Suppose we wish to estimate some statistic
T (P ) ∈ R of a probability distribution P . Then a naive estimator is to construct an estimate Pb
of P , and simply plug it in: use Tb = T (Pb). But frequently—and as we have seen in the preceding
chapters—our ability to estimate Pb may be limited, while various statistics of P may be easier to
estimate. As a trivial example of this phenomenon, suppose we have an unknown distribution P
supported on [−1, 1], and we wish to estimate the statistic T (P ) = EP [X], its expectation. Then
the trivial sample mean estimator
Tn := X n
satisfies E[(Tn − E[X])2 ] ≤ n1 . But an estimator that first attempts to approximate the full distri-
R
bution P via some Pb and then estimate xdPb(x) is likely to incur substantial additional error.
Alternatively, we might wish to test different properties of distributions. In goodness of fit
testing, we are given a sample X1 , . . . , Xn i.i.d. from a distribution Q, and we wish to distinguish
iid
whether Q = P or Q is far from P . In related two-sample tests, we are given samples X1n ∼ P
iid
and Y1m ∼ Q, and again wish to test whether Q = P or Q and P are far from one another. For
example, in a medical study, we may wish to distinguish whether there are significant differences
between a treated population Q and control population P .
More broadly, we wish to develop tools to understand the optimality of different estimators
and tests of functionals, by which we mean scalar valued parameters of a distribution P . Such
parameters could include the norm ∥θ∥2 of a regression vector, an estimate of the best possible
expected loss inf f EP [ℓ(f (X), Y )] in a prediction problem, the distance ∥P − P0 ∥TV of a sampled
population P from a reference P0 , or the probability mass of outcomes we have not observed in a
study. This chapter develops a few of the tools to understand these problems.

13.1 Geometrizing rates of convergence

JCD Comment: Figure on modulus of continuity?

328
Lexture Notes on Statistics and Information Theory John Duchi

In some cases, it is possible to reduce the development of lower bounds for estimation problems
to characterizing purely geometric objects, relating the continuity of a function to be estimated to
distances of the underlying probability distributions. This approach makes the intuition that esti-
mation should be hard when the statistic of interest is quite sensitive to the underlying distribution
quantitative. To proceed, we define the local modulus of continuity of a functional θ : P → R on a
family P of distributions at a fixed distribution P0 with respect to the Hellinger distance,
ωhel (ϵ; θ, P0 , P) := sup {|θ(P0 ) − θ(P1 )| | dhel (P0 , P1 ) ≤ ϵ} . (13.1.1)
P1 ∈P

We recognize this as the local Lipschitz constant of the parameter θ for the Hellinger distance
around a fixed distribution P0 . The global modulus of continuity is simply the supremum of the
local modulus over all P0 ∈ P,
ωhel (ϵ; θ, P) := sup ωhel (ϵ; θ, P, P).
P ∈P

These are “geometric” quantities in that each measures how much θ can vary over small neighbor-
hoods with respect to the Hellinger distance. Using Le Cam’s two point method (Chapter 9.3),
we can nearly immediately see that this Hellinger modulus implies lower bounds on estimation
error. In this case, we lower bound a somewhat smaller quantity than the minimax error, instead
bounding the “hardest one-dimensional sub-problem,”
h i
M1dn (θ(P), P0 , Φ) := sup inf max EP n Φ(|θbn (X1n ) − θ(P )|) . (13.1.2)
P1 ∈P θbn P ∈{P0 ,P1 }

The quantity (13.1.2) differs from the standard minimax risk in that the distribution P0 is fixed,
and nature must first choose an alternative P1 ; the estimator θbn then knows P1 and P0 . Any lower
bound on the hardest subproblem (13.1.2) immediately lower bounds the minimax risk, as
Mn (θ(P), Φ) ≥ sup M1d
n (θ(P), P0 , Φ).
P0 ∈P

Proposition 13.1.1.√ Let Φ : R+ → R+ be any nondecreasing loss function and θ : P → R. Define

the sequence ϵ2n = 2−n 2 . Then for n ≥ 2,
1 p 1
1d
Mn (θ(P), P0 , Φ) ≥ 1 − 3/8 Φ ωhel (ϵn ; θ, P0 , P) .
2 2
Additionally, for all c > 0,
√ √
1 − e−c 1 − e−2c
r
1− 1 c
M1d
n (θ(P), P0 , Φ) ≥ ·Φ ωhel ; θ, P0 , P · (1 − o(1)).
2 2 n
Proof This is nearly immediate from the minimax lower bound (9.3.3) from Le Cam’s two point
method, which implies
1
M1d
n (θ(P), P0 , Φ) ≥ sup Φ(|θ(P0 ) − θ(P1 )|/2) (1 − ∥P0n − P1n ∥TV ) .
2 P1 ∈P
Now we use Proposition 2.2.7 to observe that
q
∥P0n − P1n ∥TV ≤ dhel (P0n , P1n ) 2 − d2hel (P0n , P1n )
q
= 1 − (1 − d2hel (P0 , P1 ))n 1 + (1 − d2hel (P0 , P1 ))n

329
Lexture Notes on Statistics and Information Theory John Duchi

by the tensorization identity (9.2.4) for Hellinger distance. Because (1 − c/n)n → e−c , the asymp-
totic limit in the proposition follows from the inequality
p p p
lim sup sup ∥P n − Qn ∥TV ≤ (1 − e−c ) 1 + e−c = 1 − e−c 1 − e−2c .
n d2hel (P,Q)≤ nc

√ Let us perform the evaluation in finite samples2 to obtain explicit constants. Because x 7→
c
x 2 − x is increasing for 0 ≤ x ≤ 1, we see that if dhel (P0 , P1 ) ≤ n then
r
n n
c n c n
∥P0 − P1 ∥TV ≤ 1 − 1 − 1+ 1− .
n n

If (1 − nc )n ≥ 21 , then ∥P0n − P1n ∥TV ≤ 21 3/2 = 3/8. Because (1 − nc )n ↑ e−c as n grows,

p p
√ √
taking n = 2 we solve (1 − c/2)2 = 21 , or c = 2 − 2, to obtain that d2hel (P0 , P1 ) ≤ 2−n 2 implies
p
∥P0n − P1n ∥TV ≤ 3/8 for n ≥ 2. Substituting into the minimax lower bound gives the result.

So long as the local modulus does not vary too wildly as ϵ → 0, it in fact characterizes the
difficulty of the hardest one-dimensional subproblem at P0 . We say that the modulus is regular at
P0 if there exist 0 < r1 ≤ r0 and K0 , K1 such that

K0 ϵr0 ≤ ωhel (ϵ; θ, P0 , P) ≤ K1 ϵr1

for all small ϵ > 0. Then we have the following complement to Proposition 13.1.1, showing that
the Hellinger modulus indeed is the fundamental quantity governing the risk (13.1.2) of the hardest
one-dimensional subproblem.
Proposition 13.1.2. The risk of the hardest one-dimensional subproblem satisfies
−nϵ 2
M1d
n (θ(P), P0 , Φ) ≤ sup e Φ (ω(ϵ)) . (13.1.3)
ϵ≥0

If additionally Φ(t) = tp for some p > 0 and the local modulus is regular at P0 , then for large
enough n, r
1d − 21 r1 p r0 p
Mn (θ(P), P0 , Φ) ≤ e · Φ ωhel ; θ, P0 , P .
2n
Proof Let ω(ϵ) = ωhel (ϵ; θ, P0 , P) for shorthand. Given P0 , P1 , let θ0 = θ(P0 ) and θ1 = θ(P1 ),
and let Ψn be the optimal test between P0n and P1n , so that

P0 (Ψn ̸= 0) + P1 (Ψn ̸= 1) = 1 − ∥P0n − P1n ∥TV ≤ 1 − d2hel (P0n , P1n ) = (1 − d2hel (P0 , P1 ))n

by Proposition 2.2.7 and the tensorization identity (9.2.4). Define the estimator θbn = θ1 if Ψn = 1
and θbn = θ0 otherwise. Then
h i
max EP Φ(|θbn − θ(P )|) ≤ (1 − d2hel (P0 , P1 ))n Φ(|θ0 − θ1 |).
P ∈{P0 ,P1 }

Because (1 − c)n ≤ e−nc , we therefore obtain inequality (13.1.3).

By the regularity assumption on ωhel , the ϵ⋆n minimizing the upper bound (13.1.3) satisfy
r r
r1 p ⋆ r0 p
≤ ϵn ≤
2n 2n

330
Lexture Notes on Statistics and Information Theory John Duchi

2 2
(take the ϵ values minimizing, respectively, e−nϵ K1p ϵr1 p and e−nϵ K0p ϵr0 p ). Substituting gives the
result.

√ √
Noting that 12 − 21 1 − e−1 1 − e−2 > 1/8, Proposition 13.1.1 implies that for large enough n,
we have the (global) minimax lower bound

1 1 1
Mn (θ(P), Φ) ≥ · Φ ωhel √ ; θ, P . (13.1.4)
8 2 n

So the modulus of continuity of the parameter θ at a radius of roughly √1n for the Hellinger provides
a lower bound on estimation, reducing the problem of obtaining a minimax lower bound to one of
lower bounding the modulus of continuity of θ. (The question of attainability of the implied bound
is more involved; see the bibliographic section for some discussion of this and related issues.)

Example 13.1.3 (Estimating the value of a nonparametric function): Let us revisit the
nonparametric regression problems of Section 10.1. Assume we receive observations Yi =
iid
f (Xi ) + εi , where εi are i.i.d. N(0, 1), Xi ∼ Uniform([−1, 1]), and f is a 1-Lipschitz function.
We wish to estimate the value of f at 0, i.e., θ(P ) = f (0). Once we provide a lower bound
on the Hellinger modulus in this observation model, Proposition 13.1.1 then gives a minimax
lower bound.
We now construct a particular function f . Let ϕ(x) = [1 − |x|]+ , which is 1-Lipschitz, and
for t ∈ [0, 1] define ft (x) = tϕ(x/t), which is 1-Lipschitz (and f0 is identically 0). Letting Pt
and P0 denote the joint distributions of (X, Y ) when f = ft or f = f0 , we have
Z 1
1
d2hel (Pt , P0 ) = d2hel N(ft (x), 1), N(f0 (x), 1) dx.

2 −1

The Hellinger distance between two Gaussians (see Exercise 2.2) satisfies

(µ0 − µ1 )2

2 2 2 1 2
dhel (N(µ0 , σ ), N(µ1 , σ )) = 1 − exp − 2 (µ0 − µ1 ) ≤ ,
8σ 8σ 2

where we use that ex ≥ 1 + x, or 1 − ex ≤ −x for all x ∈ R. We therefore obtain

Z 1 Z t
t3 1 2 t3
Z
2 1 2 1 2 2
dhel (Pt , P0 ) ≤ ft (x) dx = t ϕ (x/t)dx = ϕ (x)dx = .
16 −1 16 −t 16 −1 24

(The factor 24 is, of course, unimportant.) Observing that the separation between the param-
eters of interest is θ(Pt ) − θ(P0 ) = t, the modulus has lower bound
√
ωhel (ϵ; θ, P) ≥ sup t | t3 ≤ 24ϵ = 24 · ϵ1/3 .
3

Substituting in Proposition 13.1.1 with ϵn = √1 , we see obtain a minimax error bound in

n
squared error of
1
Mn θ(P), (·)2 ≳

n2/3
for the class of Lipschitz functions. Estimators (such as the kernel smoothing estimators of
Chapter 10.1) achieve this rate, showing that it is sharp. 3

331
Lexture Notes on Statistics and Information Theory John Duchi

13.1.1 Fisher information and divergence measures

We have seen the Fisher information matrix appear in lower bounds in previous chapters, and at
least for parametric models, we can frequently characterize the local modulus of continuity using
it. This perspective will also hold beyond moduli with respect to Hellinger distance, so that for
suitably regular families of distributions, f -divergences with twice differentiable f will all be locally
equivalent.
The key results will show quadratic expansions of Hellinger and other divergence measures
under appropriate regularity conditions on the family P = {Pθ } of distributions. To motivate
things, we begin heuristically, providing the rigorous regularity conditions presently. Assume that
the distributions Pθ ∈ P have densities pθ , and recall the Fisher score ℓ̇θ := ∇θ log pθ , which
is (typically) mean-zero. Let f be any twice differentiable convex function with f (1) = 0 and
f ′′ (1) > 0. Then for v small, we have pθ+v (x)/pθ (x) − 1 = ox (∥v∥), where ox (∥v∥) means that a
term that may depend on x but satisfies ox (∥v∥)/ ∥v∥ → 0 as v → 0. Then
2
f ′′ (1) pθ+v (x)

pθ+v (x) pθ+v (x)
f ′
= f (1) + f (1) −1 + − 1 + ox (∥v∥2 ).
pθ (x) pθ (x) 2 pθ (x)
Assuming we may ignore the remainder terms (which will require some type of domination to
actually be able to integrate them), we obtain
Z
pθ+v (x)
Df (Pθ+v ||Pθ ) = f pθ (x)dx
pθ (x)
2
f ′′ (1)
Z Z Z
pθ+v (x)
′
= f (1) (pθ+v (x) − pθ (x)) dx + − 1 dx + ox (∥v∥2 )pθ (x)dx
2 pθ (x)
′′
(?) f (1)
= Dχ2 (Pθ+v ||Pθ ) + o(∥v∥2 ), (13.1.5)
2
where the equality (?) is heuristic in that it integrates the remainder. Continuing with our heuristics,
we use the expansion
pθ+v (x) ∇θ pθ (x)⊤ v + ox (∥v∥)
−1= = ℓ̇θ (x)⊤ v + ox (∥v∥)
pθ (x) pθ (x)
to obtain
Z 2
pθ+v (x)
Dχ2 (Pθ+v ||Pθ ) = −1 pθ (x)dx
pθ (x)
Z d
= ℓ̇θ (x)⊤ v + ox (∥v∥) pθ (x)dx
Z
(?)
= (ℓ̇θ (x)⊤ v)2 pθ (x)dx + o(∥v∥2 ) = v ⊤ J(θ)v + o(∥v∥2 ), (13.1.6)

where once again equality (?) is heuristic. Expressions (13.1.5) and (13.1.6) show that, at least
under some regularity conditions, we expect that
f ′′ (1) ⊤
Df (Pθ+v ||Pθ ) = v J(θ)v + o(∥v∥2 ), (13.1.7)
2
so that the (local) geometry on probability distributions the Fisher information matrix J(θ) induces
is equivalent to that for f -divergences.

332
Lexture Notes on Statistics and Information Theory John Duchi

Under the condition (13.1.7), it becomes rather straightforward to compute the modulus of
continuity. Indeed, we may generalize the Hellinger modulus to define

ωf (ϵ; T, Pθ , P) := sup |T (θ′ ) − T (θ)| | Df (Pθ′ ||Pθ ) ≤ ϵ2 ,

√
where we square ϵ to match the Hellinger case (13.1.1), which corresponds to f (t) = 12 ( t − 1)2 .
Then the Fisher information (asymptotically) characterizes the modulus of continuity whenever we
have the identifiability condition that for all ϵ > 0,

inf′ Df (Pθ′ ||Pθ ) | θ − θ′ 2 ≥ ϵ > 0.

(13.1.8)
θ

Proposition 13.1.4. Let {Pθ } be a family of distributions for which the f -divergence satisfies the
expansion (13.1.7) at θ ∈ Rd , and let T : Rd → R be differentiable at θ. Then

ωf (ϵ; T, Pθ , P) ≥ 2/f ′′ (1) · ϵ J(θ)−1/2 ∇T (θ) − o(ϵ).

p
2

If in addition the identifiability condition (13.1.8) holds, then

ωf (ϵ; T, Pθ , P) ≤ 2/f ′′ (1) · ϵ J(θ)−1/2 ∇T (θ)

p
+ o(ϵ).
2

Proof Given the expansion (13.1.7), as ϵ ↓ 0, for any v ∈ Rd we have

1 f ′′ (1) ⊤
D f (P θ+ϵv ||P θ ) → v J(θ)v.
ϵ2 2
Then because T (θ + v) = T (θ) + ⟨∇T (θ), v⟩ + o(∥v∥), we have

1 ⊤ 2
= 2/f ′′ (1) J(θ)−1/2 ∇T (θ)
p
lim inf ω(ϵ) ≥ sup ⟨∇T (θ), v⟩ | v J(θ)v ≤ ′′ .
ϵ→0 ϵ f (1) 2

For the upper bound, by the identifiability assumption (13.1.8), there exists a function δ(ϵ) → 0
as ϵ ↓ 0 for which Df (Pθ′ ||Pθ ) ≤ ϵ2 implies that ∥θ′ − θ∥2 ≤ δ(ϵ). The identity (13.1.7) shows that
′′
for small enough ϵ > 0, we have Df (Pθ′ ||Pθ ) = f 2(1) (θ′ − θ)⊤ J(θ)(θ′ − θ) + o(∥θ′ − θ∥2 ) whenever
∥θ′ − θ∥ ≤ δ(ϵ), so that in fact it must be the case that Df (Pθ′ ||Pθ ) ≤ ϵ2 implies that θ′ = θ + v
for some v satisfying v ⊤ J(θ)v ≤ f ′′2(1) ϵ2 (1 + o(1)) as ϵ → 0. So
n o
ω(ϵ) ≤ sup |T (θ + v) − T (θ)| | v ⊤ J(θ)v ≤ (2/f ′′ (1))ϵ2 (1 + o(1))
n o
⊤ ′′ 2
= sup |⟨∇T (θ), v⟩| + o(∥v∥) | v J(θ)v ≤ (2/f (1))ϵ (1 + o(1))
n o
2
≤ sup J(θ)−1/2 ∇T (θ) 2 J(θ)1/2 v 2 + o(∥v∥) | J(θ)1/2 v 2 ≤ (2/f ′′ (1))ϵ2 (1 + o(1))

by Cauchy-Schwarz, giving the result.

In a sense, then, so long as we have enough regularity, most divergence measures and related
moduli of continuity induce the same geometry via the Fisher information. The next subsection de-
velops conditions under which these Fisher information expansions hold, but here we preview them
a bit by claiming
√ that the expansion
√ (13.1.7) holds for the squared Hellinger distance, corresponding
to f (t) = 12 ( t − 1)2 = 2t − t + 1 for most distributions, where f ′′ (1) = 14 .

333
Lexture Notes on Statistics and Information Theory John Duchi

Corollary 13.1.5. Let {Pθ }θ∈R be a suitably regular family of distributions, and let J(θ) = Eθ [ℓ̇2θ ]
be the Fisher information. Then there exists a numerical constant c > 0 such that for any θ0 ,
h i 1
sup inf max Eθ |θbn (X1n ) − θ| ≥ c p
θ1 θn
b θ∈{θ ,θ
0 1 } nJ(θ0 )
for all large enough n.

13.1.2 Valid asymptotic information expansions of divergences

In this subsection, we collect a few representative regularity conditions on the distribution family to
make the expansions of f -divergences in terms of the Fisher information matrix, as in the heuristic
equality (13.1.7), rigorous. For the Hellinger distance, we can provide a fairly general result, which
allows allows non-differentiable densities (so, for example, we may consider the Laplace distribution
with density pθ (x) = 12 exp(−|x − θ|)).
Lemma 13.1.6. Let {Pθp }θ∈Θ be a parametric family where each Pθ has a density pθ w.r.t. a base
measure µ. Assume that pθ (x) is continuously differentiable in a neighborhood of θ0 for µ-almost
all x (though this neighborhood may depend on x). Assume as well that the Fisher information
matrix J(θ) = Eθ [ℓ̇θ ℓ̇⊤
θ ] is continuous in θ at θ0 . Then
1
d2hel (Pθ0 +v , Pθ0 ) = v ⊤ J(θ0 )v + o(∥v∥2 ) uniformly in v near 0.
8
Because the measure-theoretic details are not our main focus, we defer the proof to Section 13.5.1.
The Hellinger distance admits particularly√unrestrictive conditions on the family {Pθ }, because
the associated f -divergence, with f (t) = 21 ( t − 1)2 , grows only linearly as t → ∞. When the
divergences can grow more quickly, we will use stronger regularity conditions on the model family.
Definition 13.1. Let {Pθ }θ∈Θ , where Θ ⊂ Rd , be a family of distributions. The family is suffi-
ciently regular for the χ2 -divergence at the point θ0 ∈ int Θ if the following hold:
i. There exists a base measure µ for which Pθ has density pθ for all θ in a neighborhood of θ0 .
ii. The Fisher information Jθ0 := Eθ0 [ℓ̇θ0 ℓ̇⊤
θ0 ] exists.
pθ0 +v (x)
iii. There is some h : X → [0, ∞] with Eθ0 [h2 ] < ∞ for which | pθ0 (x) − 1 − ℓ̇θ0 (x)⊤ v| ≤ h(x) ∥v∥
for all v in a neighborhood of 0.
Many common distribution families are sufficiently regular.
Example 13.1.7: Consider the collection {N(θ, 1)} of normal distributions with densities
pθ (x) = √12π exp(− 21 (x − θ)2 ). Then the score fuction is ℓ̇θ (x) = (x − θ), and (w.l.o.g.) taking
θ = 0, we have
k ∞
pv (x) 1 2 1 2 1 2 X (xv)
− 1 − xv = e− 2 v exv − 1 − xv = (e− 2 v − 1)(1 + xv) + e− 2 v .
p0 (x) k!
k=2

− 12 v 2
As |e − 1| ≤ v 2 and
∞ ∞
(xv)k |xv|k
2
v2

X X x
≤ |xv|2 ≤ |xv|2 exp(|xv|/2) ≤ |xv|2 exp + ,
k! (k + 2)! 4 4
k=2 k=0
2 /4 1 2
we may take h(x) = |x| + 1 + x2 ex , which is certainly integrable against e− 2 x . 3

334
Lexture Notes on Statistics and Information Theory John Duchi

The tilts that form the basis for lower bounds that do not depend on a particular parameter, but
which still provide Fisher-information-like quantities as in Chapter 12.3, also satisfy Definition 13.1.

Example 13.1.8 (Tilted densities and regularity): Let g : X → Rd be a bounded function,

mean zero for a distribution P0 on X , and for t ∈ Rd define the tilted density

pt (x) := (1 + ⟨t, g(x)⟩)p0 (x)

as in definition (12.3.1), where p0 is the density of P0 w.r.t. some base measure µ (which can
simply be P0 ). Then clearly the family {Pt } for t in a neighborhood of 0 have densities, and
g(x)
the score ℓ̇t (x) = ∇t log(1 + tg(x)) = 1+⟨t,g(x)⟩ . Then the ratio pt (x)/p0 (x) = 1 + ⟨t, g(x)⟩, and

pt (x)
− 1 − ⟨g(x), t⟩ = 0,
p0 (x)

satisfying Definition 13.1. 3

Then for non-pathological functions f , we typically have the second-order expansion (13.1.7)
of the f -divergence in terms of the Fisher information. In this case, we restrict to the class of
f -divergences that have global quadratic approximations: we assume there is some K < ∞ such
that
f ′′ (1) 2
f (1 + t) − f (1) − f ′ (1)t − t ≤ Kt2 for all t ≥ −1. (13.1.9)
2
The three most common divergences we encounter excepting the variation distance all satisfy this
condition.
√
Example 13.1.9: For the Hellinger distance with f (t) = 21 ( t − 1)2 , inequality (13.1.9) holds
with K = 21 . For the KL-divergence with f (t) = t log t − t + 1, inequality (13.1.9) again holds
with K = 12 . For the χ2 divergence with f (t) = (t − 1)2 , inequality (13.1.9) holds with K = 0.
Exercise 13.2 asks you to prove these claims. 3

Definition 13.1 and inequality (13.1.9) are enough to guarantee the local information approx-
imation (13.1.7), meaning that for “regular enough” parametric families, all locally quadratic di-
vergences are locally equivalent to the metric induced by the Fisher information matrix.

Lemma 13.1.10. Let {Pθ } be sufficiently regular for the χ2 -divergence (Definition 13.1) at θ and
f : R+ → R be a convex function with f (1) = 0 satisfying the second-order Taylor bound (13.1.9).
Then for v near 0, the f -divergence satisfies

f ′′ (1)
Df (Pθ+v ||Pθ ) = Dχ2 (Pθ+v ||Pθ ) + o(∥v∥2 )
2
and
Dχ2 (Pθ0 +v ||Pθ0 ) = v ⊤ J(θ0 )v + o(∥v∥2 ).

See Section 13.5.2 for a proof of this result, which as a consequence demonstrates that the χ2 -
divergence is indeed (locally) finite on {Pθ }.

335
Lexture Notes on Statistics and Information Theory John Duchi

JCD Comment: Make this its own subsection to set it off a little bit. Make an example
with just the 1-dimensional families. Connect with χ2 and other divergences here, so that
Fisher-informations and other divergences aren’t so different. Move Lemma 13.1.10 to
here or the next subsection. Add in some results on (1 + tg) families, which will satisfy
the regularity conditions and allow “nonparametric” settings, more or less.
Also, maybe do the heuristic versions of these, ignoring regularity conditions. Then
everything is a bunch cleaner I think.
JCD Comment: Outline for this section: Might be good to actually begin the whole
thing with this section.

1. Introduce modulus of continuity (w.r.t. Hellinger), draw a picture suggesting why it

should be hard or easy

2. Example with Fisher information-type quantity

3. (Don’t do this!) Show that for testing, the rate at which we can test really is this
modulus whenever we have linear functions and convex classes, because of Le Cam’s
result on Hellinger affinities.

13.2 Le Cam’s convex hull method

JCD Comment: This isn’t the starting point any longer. Say that sometimes the
method can fail when we just do two point lower bounds,R because it might get the wrong
rate. Add an exercise about that (e.g., for estimating f ′ (x)2 dx or something, where
motivation might be the type of regularization to use in smoothing the function).
Our starting point is to revisit Le Cam’s method from Chapter 9.3, which focused on “two-
point” methods to provide a lower bound on estimation error. We can substantially generalize this
by instead comparing families of distributions that all induce separations between statistics of one
another, and then computing the distance between the convex hulls of the families. This leads to
Le Cam’s convex hull method, which we state abstractly and specialize later to different scenarios
of interest. Let P be a collection of distributions on an underlying space X , and let θ : P → Rd be
a parameter of interest. We say that two subsets P0 ⊂ P and P1 are δ-separated in ∥·∥ if
∥θ(P0 ) − θ(P1 )∥ ≥ δ for all P0 ∈ P0 and P1 ∈ P1 . (13.2.1)
We do not require that all of P0 be somehow on one side or the other of the collection {θ(P1 ) |
P1 ∈ P1 } of parameters associated with P1 , just that they be pairwise separate.
Let Conv(P) be the collection of mixtures of elements of P, that is,
(m )
X
Conv(P) = λi Pi | m ∈ N, λ ⪰ 0, ⟨λ, 1⟩ = 1, Pi ∈ P .
i=1

Defining the minimax risk

h i
M(θ(P), ∥·∥) = inf sup EP ∥θb − θ(P )∥
θb P ∈P

(note the temporary lack of sample size n), we then have the following generalization of inequal-
ity (9.3.3).

336
Lexture Notes on Statistics and Information Theory John Duchi

Theorem 13.2.1 (Le Cam’s Convex Hull Lower Bound). Let P0 and P1 ⊂ P be δ-separated in
∥·∥. Then

δ
M(θ(P), ∥·∥) ≥ sup 1 − P 0 − P 1 TV
| P 0 ∈ Conv(P0 ), P 1 ∈ Conv(P1 )
2
Proof For any parameter θ, the separation ∥θ(P0 ) − θ(P1 )∥ ≥ δ and the triangle inequality
P of ∥θ − θ(P0 )∥ ≥ δ/2
guarantees that at least one Pmor ∥θ − θ(P1 )∥ ≥ δ/2 holds for all pairs P0 ∈ P0
and P1 ∈ P1 . Let P 0 = m α P
j=1 j j and P 1 = j=1 βj Qj for Pj ∈ P0 and Qj ∈ P1 , respectively,
where α, β are convex combinations. Then by Markov’s inequality,
m m
1X h i 1X h i
M(θ(P), ∥·∥) ≥ αj EPj ∥θ − θ(Pj )∥ +
b βj EQj ∥θb − θ(Qj )∥
2 2
j=1 j=1
Xm m i
δ h i X h
≥ αj EPj 1{∥θb − θ(Pj )∥ ≥ δ/2} + βj EQj 1{∥θb − θ(Pj )∥ ≥ δ/2}
2
j=1 j=1
m
X
≥δ αj EPj inf 1{∥θ − θ(P0 )∥ ≥ δ/2} + βj EQj inf 1{∥θ − θ(P1 )∥ ≥ δ/2}
b b
P0 ∈P0 P1 ∈P1
j=1

δ
= EP 0 inf 1{∥θ − θ(P0 )∥ ≥ δ/2} + EP 1 inf 1{∥θ − θ(P1 )∥ ≥ δ/2} .
b b
2 P0 ∈P0 P1 ∈P1

b − θ(P )∥ ≥ δ/2} for v = 0, 1, then f0 + f1 ≥ 1.

Note that if we define fv (x) = inf P ∈Pv 1{∥θ(x)
We claim the following lemma, which extends Le Cam’s lemma (Proposition 2.3.1) to give

Lemma 13.2.2. For any two distributions P0 and P1 ,

inf EP0 [f0 ] + EP1 [f1 ] ≥ 1 − ∥P0 − P1 ∥TV .

f0 +f1 ≥1

We leave this form of total variation distance as an exercise (see Exercise 2.1). Substituting it into
the display above, we find that for any P v ∈ Conv(Pv ), we have

δ
M(θ(P), ∥·∥) ≥ 1 − P0 − P1 TV
.
2
Taking a supremum over the P v gives the theorem.

13.2.1 The χ2 -mixture bound

Theorem 13.2.1 provides a powerful tool for developing lower bounds between collections of well-
separated distributions. The most typical approach is to take the class P0 to consist of a single
“base” distribution P0 , and P then let P1 vary around P0 in some prescribed way, so that for an
1
index set V, we let P = |V| v∈V Pv . Even so, when we have a sample of size n from one of the
distributions, this results in a total variation quantity of the form
1 X n
P0n − P n TV
where P n = Pv ,
|V|
v∈V

337
Lexture Notes on Statistics and Information Theory John Duchi

yielding a mixture of product distributions—something frequently quite challenging to control.

The key technique here is to leverage the inequalities relating divergences from Chapter 2, which
allows us to replace the variation distance with something more convenient. In previous chapters,
this was the KL-divergence; now, instead, we use a χ2 -divergence, as it interacts much more nicely
with the mixture product structure. Essentially, we replace an expectation over X ∼ P with two
expectations: one over X ∼ P and another over independent samples V, V ′ ∼ Uniform(V). To
obtain the bound, first note that
2
2 P0 − P TV
≤ Dkl P ||P0 ≤ log(1 + Dχ2 P ||P0 ) ≤ Dχ2 P ||P0

by Propositions 2.2.8 and 2.2.9.

We then have the following technical lemma.
1 P
Lemma 13.2.3. Let P = |V| v∈V Pv and Pv and P0 have densities pv , p0 with respect to some
base measure µ on a set X . Then
Z
1 X pv (x)pv′ (x) 1 X pv (X)pv′ (X)
Dχ2 P ||P0 = dµ(x) − 1 = E0 − 1,
|V|2 ′ p0 (x) |V|2 ′ p20 (X)
v,v ∈V v,v ∈V

where the expectation is taken with respect to X ∼ P0 . More generally, let V ∈ V be a random
variable distributed according to π and conditional on V = v, let X | V = v ∼ Pv . Then for the
p (x)pv′ (x)
paired likelihood ratio l(x | v, v ′ ) = v p2 (x) , the marginal distribution P of X satisfies
0

Dχ2 P ||P0 = E0 l(X | V, V ′ ) − 1,

iid
where the expectation is taken jointly over X ∼ P0 and V, V ′ ∼ π.

Proof The starting point is to notice that for any two distributions P and Q we have Dχ2 (P ||Q) =
R 2 R dP 2
(dP/dQ − 1)2 dQ = dP
R R dP R
dQ − 2 dQ dQ + dQ = dQ − 1. Then we proceed by recognizing that
1 PN 2 1 P
( N i=1 xi ) = N 2 i,j xi xj for any sequence xi , and so

dPv )2
P
((1/|V|)
Z Z
v∈V 1 X dPv dPv′
Dχ2 P ||P0 + 1 = = 2
dP0 |V| ′
dP0
v,v ∈V

1 P
as desired. The second statement has identical proof to the first except that we replace |V| v∈V
with expectations according to π.

When we apply Lemma 13.2.3 for product distributions, we can sometimes obtain tensorization-
type inequalities, which allows more applications; the next result is an immediate consequence of
Lemma 13.2.3.
1 P n
Lemma 13.2.4. Let P n = |V| v∈V Pv , where Pv and P0 have densities as in Lemma 13.2.3.
Then
pv (X)pv′ (X) n

n n
1 X
1 + Dχ2 P ||P0 = E0 .
|V|2 ′ p20 (X)
v,v ∈V

338
Lexture Notes on Statistics and Information Theory John Duchi

The applications of these lemmas are many, and going through a few examples will best show
how to leverage them. Roughly, our typical approach is the following: we identify V with {±1}d or
some other suitably nice collection of vectors. We then choose distributions Pv and P0 with densities
suitably nice that the ratios pv /p0 “act” like exponentials involving inner products of v ∈ V with
some other quantity; then, because v is uniform in V in Lemma 13.2.3, we can leverage all the
tools we have developed to control moment generating functions and concentration inequalities in
Chapter 4 to bound the χ2 -divergence and then apply Theorem 13.2.1.
Let us give one example of this approach, where we see the technique we use to prove the
lemma arises frequently. Let P0 = N(0, σ 2 Id ) be the standard normal distribution on Rd , and for
V = {−1, 1}d and some δ ≥ 0 to be chosen, let Pv = N(δv, σ 2 Id ). Then we have the following
dδ 2
lemma, which shows that while Dkl (Pv ||P0 ) = 2σ 2 for each individual Pv , the divergence for the
average can be much smaller (even quadratically so in the ratio δ 2 /σ 2 ).

Lemma 13.2.5. Let P0 and Pv be Gaussian distributions as above, and define the mixture P =
1 P
2d v∈{±1}d Pv . Then

2 dδ 4
2 P0 − P TV
≤ log(1 + Dχ2 P ||P0 ) ≤ 4 .
2σ
Proof The first inequality combines Pinsker’s inequality (Proposition 2.2.8) with the bound
Dkl (P ||Q) ≤ log(1 + Dχ2 (P ||Q)) in Proposition 2.2.9. Now we expand the χ2 -divergence, yielding

1 2 1 ′ 2 1 2
1 + Dχ2 P ||P0 = E exp − 2 ∥Y − δV ∥2 − 2 Y − δV 2 + 2 ∥Y ∥2 ,
2σ 2σ σ
iid
where the expectation is over Y ∼ N(0, σ 2 In ) and V, V ′ ∼ Uniform(V). Taking the expectation over
Y first, before averaging over the packing elements, allows more careful control. Indeed, expanding
the squares and recognizing that ∥v∥22 = d for each v ∈ {±1}d , we have

dδ 2
2
dδ 2

δ ′ δ ′ 2
1 + Dχ2 P ||P0 = E exp ⟨Y, V + V ⟩ − 2 = E exp V +V 2− 2
σ2 σ 2σ 2 σ
2
δ
= E exp ⟨V, V ′ ⟩
σ2
4
dδ
≤ exp ,
2σ 4

where the final key inequality follows because an individual U ∼ Uniform({±1}) is 1-sub-Gaussian,
and ⟨V, V ′ ⟩ is thus d-sub-Gaussian.

Using the same techniques, we can provide similar upper bound, coupled with the identity in
Lemma 13.2.4, that gives a tensorization-like bound.

Lemma 13.2.6. Let P0 and Pv be Gaussian distributions, and define the mixture P n = 21d v∈{±1}d Pvn .
P
Then
2 dn2 δ 4
2 P0n − P n TV ≤ log 1 + Dχ2 P n ||P0n ≤ .
2σ 4

339
Lexture Notes on Statistics and Information Theory John Duchi

Proof Tracing the proof of Lemma 13.2.5 and using Lemma 13.2.4, we have
n
dδ 2

n n
1 X δ ′
1 + Dχ2 P ||P0 = d E0 exp ⟨Y, v + v ⟩ − 2
4 σ2 σ
v,v ′ ∈{±1}d
2
ndδ 2

1 X nδ ′ 2
= d exp v+v 2− 2 .
4 ′ d
2σ 2 σ
v,v ∈{±1}

ind
Letting V, V ′ ∼ Uniform({±1}d ) and recognizing that ∥v + v ′ ∥22 = 2d + 2⟨v, v ′ ⟩, we have
2 4 2
δ n ′ dδ n
1 + Dχ2 P n ||P0n = E exp

⟨V, V ⟩ ≤ exp
σ2 2σ 4

as desired.

13.2.2 Estimating the norm of a Gaussian vector

JCD Comment: It would probably be good to connect this to some other literatures
and motivate things, e.g.,

1. Signal detection: is there something to discover?

2. Multiple testing: say we have d distinct p-values Uj . Then set Zj = Φ−1 (Uj ). Under
the null that Uj ∼ Uniform[0, 1] these are i.i.d. N(0, 1). Alternatives then deviate from
this. Often interesting to consider other alternatives (sparse/dense/etc.)
JCD Comment: Clean this up now, because I moved Lemma 13.2.5 up.

Let us give one example to show how the mixture approach suggested by Lemma 13.2.3 works,
along with showing that a more naive approach using the two point method of Chapter 9.3 fails
to provide the correct bounds. After this we will further develop the techniques. We motivate the
example by considering regression problems, then simplify it to a more stylized and easily workable
form. Suppose we wish to estimate the best possible loss achievable in a regression problem,

inf E[(X ⊤ θ − Y )2 ].
θ

For simplicity, assume that X ∼ N(0, Id ), and that “base” distribution P0 is simply that Y ∼ N(0, 1),
while the alternatives are that Y = X ⊤ θ⋆ + (1 − ∥θ⋆ ∥22 )ε, where ε ∼ N(0, 1) and ∥θ⋆ ∥22 ≤ 1. In
either case we have Y ∼ N(0, 1) marginally, while

inf E0 [(X ⊤ θ − Y )2 ] = 1 and inf Eθ⋆ [(X ⊤ θ − Y )2 ] = 1 − ∥θ⋆ ∥22 ,

θ θ

so that estimating the final risk is equivalent to estimating the ℓ2 -norm ∥θ⋆ ∥22 .
To make the calculations more palatable, let us assume the simpler Gaussian sequence model

Y = θ⋆ + ε, ε ∼ N(0, σ 2 In ) (13.2.2)

340
Lexture Notes on Statistics and Information Theory John Duchi

where θ⋆ ∈ Rn satisfies ∥θ⋆ ∥2 ≤ r for some radius r, and we wish to estimate the statistic

T (P ) := ∥θ⋆ ∥22 .

Note that E[∥Y ∥22 ] = ∥θ⋆ ∥22 + nσ 2 , so that a natural estimator is the debiased quantity

Tn := ∥Y ∥22 − nσ 2 .

Using that E[ε2j ] = 1 and E[ε4j ] = 3, we then obtain

n
2 X
∥θ⋆ ∥22 Var (θj⋆ + σεj )2 = 2nσ 4 + ∥θ⋆ ∥22 σ 2 ≤ 2nσ 4 + r2 σ 2 .

E Tn − =
j=1

That is, the family Pσ,r defined as Gaussian sequence models (13.2.2) with variance σ 2 and ∥θ⋆ ∥22 ≤
r2 satisfies p √
Mn (T (Pσ,r ), | · |) ≤ 2nσ 4 + r2 σ 2 ≤ 2nσ 2 + rσ. (13.2.3)
We first provide the more naive approach. Suppose that we were to use Le Cam’s two-point
method to achieve a lower bound in this case. The minimax risk from inequality (9.3.3) shows that
(for a numerical constant c > 0), if P0 and P1 are (respectively) N(θ0 , σ 2 In ) and N(θ1 , σ 2 In ), then
for any choice of θ0 , θ1 we have
1n o
Mn (T (Pσ,r ), | · |) ≥ ∥θ0 ∥22 − ∥θ1 ∥22 · [1 − ∥P0 − P1 ∥TV ] . (13.2.4)
4
Recalling Pinsker’s inequality (Proposition 2.2.8), we have

1 p 1 ∥θ0 − θ1 ∥2
1 − ∥P0 − P1 ∥TV ≥ 1 − √ Dkl (P0 ||P1 ) = 1 − .
2 2 σ

So whenever ∥θ0 − θ1 ∥2 ≤ σ, we have

1
Mn (T (Pσ,r ), | · |) ≥ ∥θ0 ∥22 − ∥θ1 ∥22 .
8
Take any θ0 such that ∥θ0 ∥2 = r and θ1 = (1 − t)θ0 , then choose the largest t ∈ [0, 1] such that
∥θ0 − θ1 ∥2 = tr ≤ σ. The choice t = min{1, σr } then gives that

∥θ0 ∥22 − ∥θ1 ∥22 = r2 (1 − (1 − t)2 ) = r2 (2t − t2 ) = 2 min r2 , rσ − min r2 , σ 2 ≥ min{r2 , σr}.

In particular, this application of the two-point approach yields

1
min r2 , σr .

Mn (T (Pσ,r ), | · |) ≥ (13.2.5)
4
(A careful inspection of the argument, potentially replacing the application of Pinsker with KL
with a Hellinger distance bound, as in Proposition 2.2.8 shows that this is, essentially, the “best
possible” bound achievable by the two-point approach.) While this bound does capture the second
term in the upper bound (13.2.3) whenever σr ≤ r2 , that is, r ≥ σ, we require more sophisticated
techniques to address the scaling with dimension n in the problem.

341
Lexture Notes on Statistics and Information Theory John Duchi

We therefore turn to using the mixture approach. Let P0 = N(0, σ 2 In ), and for V = {±1}n
define Pv = N(δv, σ 2 In ). It is immediate that T (P0 ) = 0 while T (Pv ) = δ 2 n, so we have separation
in the values of the statistic. In this case, we apply Theorem 13.2.1 and to obtain
( r )
δ2n 1
Mn (T (Pσ,r ), | · |) ≥ 1− log(1 + Dχ2 P ||P0 )
2 2

1 P
for P = 2n v∈V Pv . Substituting the result of Lemma 13.2.5 into the minimax lower bound, we
obtain r !
δ2n nδ 4
Mn (T (Pσ,r ), | · |) ≥ 1− .
2 4σ 4
We choose δ so that the (implied) probability of error in the hypothesis test from which our
reduction follows is at least 12 , for which it evidently suffices to take δ = n1/4
σ
. Putting all the pieces
together, we achieve the minimax lower bound
√
δ2n σ2 n
Mn (T (Pσ,r ), | · |) ≥ = . (13.2.6)
4 4
Comparing the result from the upper bound (13.2.3), we see that at least in the regime that the
√
radius r scales at most as σ n, the mixture Le Cam method allows us to characterize the minimax
risk of estimation of ∥θ∥22 in a Gaussian sequence model.
By combining the result (13.2.3) with the more naive two-point lower bound (13.2.5), which is
valid in “large radius” regimes, we have actually characterized the minimax risk.

Corollary 13.2.7. Let Pσ,r be the Gaussian sequence model family {N(θ, σ 2 In ) | ∥θ∥2 ≤ r}, and
T (θ) = ∥θ∥22 . Then there is a numerical constant c > 0 such that the minimax absolute error
satisfies
√ p
c σ 2 n + rσ ≤ Mn (T (Pσ,r ), | · |) ≤ 2nσ 4 + r2 σ 2 .

√ √
Proof The only thing to recognize is that rσ ≥ σ 2 n whenever r ≥ σ n, in which case
min{r2 , σr} = σr in the bound (13.2.5).

13.2.3 Lower bounds on estimating integral functionals

When we consider problems such as nonparametric regression or nonparametric density estimation,
a frequent concern is the degree of smoothness of the underlying functional, which can then impact
the type of regularization one employs—for example, if we know that the underlying function is
sufficiently differentiable, cubic smoothing splines estimate nonparametric regression functions by
solving ( n )
Z
1 X
2 λ ′′ 2
fb = argmin (Yi − f (xi )) + (f (x)) dx ,
f 2 2
i=1

and the regularization λ is chosen to enforce sufficient smoothness of fb (see, e.g., [110, Chapter
5.4]).

342
Lexture Notes on Statistics and Information Theory John Duchi

It is therefore interesting to estimate various functionals based on integration to help under-

stand the gross character of the function being estimated. One can often provide lower bounds on
estimation error (of the correct order) for functionals of the form
Z
Tk (f ) := (f (k) (x))2 dx,

the L2 -norm of the kth derivative of an (unknown) function f , using the convex hull methodology
we have so far developed. We will consider estimation of such functionals over classes of functions
with bounded higher-order derivatives, defining
n o
Fs := f : [0, 1] → R | f (0) = 0, f ∈ C s and ∥f (s) ∥∞ ≤ 1 ,

functions that are s-times continuously differentiable with uniformly bounded sth derivative, where
we choose f (0) = 0 for normalization. We adopt the observation model of nonparametric regression,
iid iid
as in Example 13.1.3, so that Yi = f (Xi ) + εi , where εi ∼ N(0, 1) and Xi ∼ Uniform([−1, 1]). Then
we can use the convex hull method to obtain lower bounds on estimation.
Proposition 13.2.8. Fix k ∈ N. Then for any s ≥ 0,
4s
h i
−
Mn (Tk (Fk+s ), | · |) = inf sup Ef |Tb(Y1n , X1n ) − Tk (f )| ≳ n 4(k+s)+1 .
Tb f ∈Fk+s

Proof The key is to construct “bump-like” functions that induce separation in the functional
Tk , as we have done in the construction of nonparametric lower bounds in ChapterR 10.1. Pick any
1
“bump” function ϕ : [0, 1] → R with ϕ(0) = ϕ(1) = 0 for which ∥ϕ(k+s) ∥∞ ≤ 1 and 0 ϕ(k) (x)2 dx >
1
0. For example, if we take h(x) = exp(− 1−x2 )1 {|x| < 1}, then (up to numerical constant scaling)
the bumps ϕ(x) := h(4x − 1) − h(4x − 3) satisfy our desiderata, as they are C ∞ and compactly
supported on [0, 1]. For a value m ∈ N to be chosen, define the rescaled function
1
g(x) := ϕ(mx),
mk+s
which evidently satisfies g (k) (x) = m1s ϕ(k) (mx) and g (k+s) (x) = ϕ(k+s) (mx) so ∥g (k+s) ∥∞ ≤ 1. Then
for v ∈ {−1, 1}m , define the functions
m m
X 1 X
fv (x) := vj g (mx − (j − 1)) = vj ϕ (mx − (j − 1)) ,
mk+s
j=1 j=1

corresponding to bumps of varying directions on each sub-interval [ j−1 j

m , m ] of [0, 1]. We take
f0 (x) = 0 to be the identically 0 function.
To apply Theorem 13.2.1, the convex hull method, we now follow the usual two steps: we
exhibit a separation between the functions fv and f0 , and then we bound the divergence between
the observations from the null model and the convex hull of the alternatives fv . The separation is
relatively straightforward: clearly we have Tk (f0 ) = 0, while
Z 1 Z 1/m Z 1/m
(k) 2 (k) 2 m
Tk (fv ) = fv (x) dx = m (g (x)) dx = 2s (ϕ(k) (mx))2 dx
0 0 m 0
Z 1
1 1
= 2s (ϕ(k) (u))2 du ≳ 2s . (13.2.7)
m 0 m

343
Lexture Notes on Statistics and Information Theory John Duchi

The divergence bounds for the convex hull are more involved, but follow the approach we have
developed: we relate the χ2 divergence to exponential moments of (independent) random packing
vectors, as in Lemma 13.2.5, and then use sub-Gaussianity of random signs (Example 4.1.5) to
bound this quantity. Define Pvn to be the joint distribution of (Xi , Yi )ni=1 when Yi = fv (Xi )+εi , and
1
let P n = 2m v∈{±1}m Pvn . Let pv (y | x) denote the density of Y ∼ N(fv (x), 1). By Lemma 13.2.4,
P
we have
pv (Y | X)pv′ (X) n

n n
1 X
1 + Dχ2 P ||P0 = 2m E0
2 p0 (Y | X)2
v,v ′ ∈{±1}m
n
1 X 1 2 1 2
= 2m E0 exp Y (fv (X) + fv′ (X)) − fv (X) − fv′ (X) ,
2 ′ m
2 2
v,v ∈{±1}

where we have used that X ∼ Uniform([0, 1]) and that Y ∼ N(0, 1) under P0 . Bounding the
expectations requires a bit of work. Define the coordinate functions gj (x) = g(mx − (j − 1)) and
the vector function ⃗g (x) = [gj (x)]m
j=1 , so that fv (x) = ⟨v, ⃗ g (x)⟩ and ⃗g (x) has at most one non-zero
element, because the supports of the gj are disjoint. Then

pv (Y | X)pv′ (X) 1 ′ 2 1 2 1 ′ 2
E0 | X = x = exp ⟨v + v , ⃗g (x)⟩ − ⟨v, ⃗g (x)⟩ − ⟨v , ⃗g (x)⟩
p0 (Y | X)2 2 2 2

= exp ⟨v, ⃗g (x)⟩⟨v , ⃗g (x)⟩ = exp v diag(⃗g (x))2 v ′
′ ⊤

because of the disjoint support of the elements of ⃗g . Now we use that if |t| ≤ 1, then et ≤ 1 + t + t2 ,
which implies
Z 1
pv (Y | X)pv′ (X) ⊤ 2 ′ ⊤ 2 ′
2
E0 ≤ 1 + v E0 [diag(⃗g (X)) ]v + v diag(⃗g (x)) v dx
p0 (Y | X)2 0
Z 1/m Z 1/m
=1+ g 2 (mx)dx · ⟨v, v ′ ⟩ + m g 4 (mx)dx
0 0
Z 1 Z 1
1 2 ′ 1
= 1 + 2(k+s)+1 ϕ (u)du · ⟨v, v ⟩ + 4(k+s) ϕ4 (u)du.
m 0 m 0
R1 R1
In particular, for numerical constants c = 0 ϕ2 (u)du and c′ = 0 ϕ4 (u)du, we use that 1 + t ≤ et
ind
to obtain that for V, V ′ ∼ Uniform({±1}m ),

nc′

n n
nc ′
1 + Dχ2 P ||P0 ≤ E exp ⟨V, V ⟩ + 4(k+s)
m2(k+s)+1 m
2 2 ′

c n m nc
≤ exp + . (13.2.8)
2m4(k+s)+2 m4(k+s)
2
In particular, we can choose m scaling as n 4(k+s)+1 , which (with an appropriate constant) yields
1 + Dχ2 P n ||P0n ≤ 2, so that P0n − P n TV ≤ log(1 + Dχ2 P n ||P0n ) ≤ log 2 < 1. Substituting
this choice of m into the separation bound (13.2.7) gives the result.

Proposition 13.2.8 gives a few consequences. First, if we wish to estimate the integral of the
kth derivative of a function f , if all we know is that f has bounded and continuous kth derivative

344
Lexture Notes on Statistics and Information Theory John Duchi

then the minimax risk is constant—estimation is impossible. Typical choices are that k = 1 and
the number of additional degrees of smoothness s = 1 in Proposition 13.2.8, which implies a lower
bound on the minimax risk of
1 4
− 2+1/4
M(T1 (F2 ), | · |) ≳ n = n− 9 .

More generally, if the number of additional degrees of smoothness s ≤ k, then standard parametric
rates of n−1/2 are unachievable.
JCD Comment: Maybe add two or three exercises around these ideas:

1. Failure of the 2-point bound to achieve Proposition 13.2.8

2. The density versions of these arguments

3. The curse of dimensionality appearing even in functional estimation

Also add a figure probably.

13.3 Minimax hypothesis testing

In the general hypothesis testing problem, we have a family of potential distributions P, and we
are given a sample X ∼ P for some P ∈ P. Then we wish to distinguish between two disjoint
hypotheses H0 and H1 :
H0 : P ∈ P0
(13.3.1)
H1 : P ∈ P1 ,
where the collections P0 ⊂ P and P1 ⊂ P are disjoint. Then for a given test statistic Ψ : X → {0, 1},
we define the risk of the test to be

R(Ψ | P0 , P1 ) := sup P (Ψ ̸= 0) + sup P (Ψ ̸= 1),

P ∈P0 P ∈P1

that is, the sum of the worst-case probabilities that the test is correct. (We also use the notation
R(Ψ | H0 , H1 ) to denote the same quantity.) In the scenarios we consider, we will assume a metric
ρ on the family of distributions P, and instead of the general hypothesis test (13.3.1), we will
consider testing whether P ∈ P0 or ρ(P, P0 ) ≥ ϵ for all P0 ∈ P, giving the variant

H0 : P ∈ P0
(13.3.2)
H1 : P ∈ P1 (ϵ) := {P ∈ P s.t. ρ(P, P0 ) ≥ ϵ all P0 ∈ P0 }

In this case, we can define the risk at distance ϵ for a sample of size n by

Rn (Ψ, ϵ) := sup P (Ψ(X1n ) ̸= 0) + sup P (Ψ(X1n ) ̸= 1), (13.3.3)

P ∈P0 P ∈P1 (ϵ)

iid
leaving P0 and P implicit in the definition, and where we let X1n ∼ P . From this, we can define
the minimax test risk
inf Rn (Ψ, ϵ).
Ψ
We then ask for the particular thresholds ϵ at which the minimax test risk becomes small or
large. Thus, while the coming definition allows some ambiguity, we say that a sequence ϵn is a

345
Lexture Notes on Statistics and Information Theory John Duchi

minimax threshold or critical testing radius for the testing problem (13.3.2) if there exist numerical
constants 0 < c ≤ C < ∞ such that
1 2
inf Rn (Ψ, Cϵn ) ≤ and inf Rn (Ψ, cϵn ) ≥ . (13.3.4)
Ψ 3 Ψ 3

The constants 13 and 32 are unimportant, the point being that for separation at most cϵn , no
hypothesis test can test whether the distribution P satisfies P ∈ P0 or inf P0 ∈P0 ρ(P, P0 ) ≥ cϵn
with reasonable accuracy. But it is possible to test whether P ∈ P0 or inf P0 ∈P0 ρ(P, P0 ) ≥ Cϵn
with reasonable accuracy. Moreover, we can make the probability of error exponentially small by
increasing the sample size by a constant factor, as Exercise 13.4 explores. In some cases, and we
give one extended example in Section 13.3.4, one can establish a stronger result than the critical
radius (13.3.4), instead establishing a phase transition. In this case, we say that a sequence ϵn is
the phase transition threshold if for any c < 1 < C,

lim sup inf Rn (Ψ, Cϵn ) = 0 and lim inf inf Rn (Ψ, cϵn ) = 1. (13.3.5)
n Ψ n Ψ

Conveniently, the minimax test risk has a precise divergence-based form, to which we can apply
the techniques comparing different divergences we have developed. In particular, we have the
following analogue of Le Cam’s convex hull lower bound in Theorem 13.2.1, which provides the
same fundamental quantity (the variation distance between convex hulls of P0 and P1 ) for lower
bounds, except that it applies for testing.

Proposition 13.3.1 (Convex hull lower bounds in testing). For any classes P0 and P1 , the mini-
max test risk satisfies

inf R(Ψ | P0 , P1 ) ≥ 1 − sup P 0 − P 1 TV | P 0 ∈ Conv(P0 ), P 1 ∈ Conv(P1 ) .
Ψ

Proof Let P 0 ∈ Conv(P0 ) and P 1 ∈ Conv(P1 ). Then for any test Ψ,

R(Ψ | P0 , P1 ) ≥ P 0 (Ψ ̸= 0) + P 1 (Ψ ̸= 1)

because suprema are always at least as large as averages. Now note that the set A = {x | Ψ(x) = 0}
satisfies
P 0 (Ψ ̸= 0) + P 1 (Ψ ̸= 1) = P 0 (Ac ) + P 1 (A) = 1 − (P 0 (A) − P 1 (A)),
and take an infimum over regions A.

In fact, equality typically holds in Proposition 13.3.1, but this requires the application of (infinite
dimensional) convex duality, which is beyond our scope here.

13.3.1 Detecting a difference in populations

With the generic worst-case hypothesis testing setup in place, we can give a general recipe for
developing tests. We specialize this recipe in the next few sections to different problems, including
signal detection in a Gaussian model, two-sample tests in multinomials, and goodness of fit testing.
The basic approach in all of these problems is frequently the following: to demonstrate achievability
and testability, we develop an estimator Tn of the distance ρ(P0 , P1 ), or some other function of
the distance, where Tn has reasonable properties. We then develop a test Ψ by thresholding this

346
Lexture Notes on Statistics and Information Theory John Duchi

estimator. For the converse results that no test can distinguish the families P0 and P1 at a particular
distance, we use the mixture χ2 approaches we have outlined.
Let us give the general recipe first. Suppose that we have a statistic T designed to separate
the classes P0 and P1 . Such a statistic should assign large values for samples X ∼ P1 for P1 ∈ P1
and small values for samples X ∼ P0 . A more quantitative version of this, where the separation
E1 [T ] − E0 [T ] is commensurate with the variance of T , is sufficient to test between P0 and P1 with
high accuracy. To that end, we say that the statistic T robustly C-separates P0 and P1 if

p p
EP1 [T ] − sup EP0 [T ] ≥ C sup VarP0 (T ) + VarP1 (T ) . (13.3.6)
P0 ∈P0 P0 ∈P0

for each P1 ∈ P1 . Typically, we choose statistics T so that EP0 [T ] = 0 for each P0 in the null P0
(though this is not always possible). The next proposition shows how to define a test that leverages
this to achieve small worst-case test error.

Proposition 13.3.2. Let the statistic T p : X → R robustly C-separate P0 from P1 . Then for the
threshold τ = supP0 ∈P0 EP0 [T ] + supP0 ∈P0 VarP0 (T ), the test

Ψ(X) := 1 {T ≥ τ }

satisfies
2
R(Ψ | {P0 }, P1 ) ≤ .
C2
Proof Without loss of generality we assume supP0 ∈P0 EP0 [T ] = 0, as the test is invariant to shifts,
p
so that τ = supP0 ∈P0 VarP0 (T ). We can also assume that C ≥ 1, as otherwise the proposition is
vacuous. We control the test error in each case. Under any null P0 , we have

Var0 (T ) 1
P0 (Ψ ̸= 0) = P0 (T ≥ τ ) ≤ 2 2
= 2.
C τ C
For the alternatives under P1 ∈ P1 , we have

Var1 (T )
P1 (Ψ ̸= 1) = P1 (T ≤ τ ) = P1 (T − E1 [T ] ≤ τ − E1 [T ]) ≤ .
[E1 [T ] − τ ]2+

But of course,
p p p
E1 [T ] − τ = E1 [T ] − sup EP0 [T ] − sup VarP0 (T ) ≥ C Var1 (T ) + (C − 1) sup VarP0 (T )
P0 P0 P0

by the robust C-separation. As we have assumed w.l.o.g. that C ≥ 1, this yields

Var1 (T ) 1
P1 (Ψ ̸= 1) ≤ = 2
C 2 Var1 (T ) C

as desired.

347
Lexture Notes on Statistics and Information Theory John Duchi

13.3.2 Signal detection and testing a Gaussian mean

A common problem in statistics, communication, and information theory is the signal detection
problem, where we observe X ∼ P from an unknown distribution P , and wish to detect if there
is some “signal” present in P . To study such a problem, we typically formulate a null model—
indicating absence of signal—and a set of alternatives for which there is some signal, though we
only care to test its existence. The existence of a signal can then justify further investigation or
data collection to actually estimate the signal.
Let us give a few variants of this problem, for which a substantial literature exists.
Example 13.3.3 (Dense Gaussian signal detection): We consider testing the null H0 and
alternative H1 given by
H0 : P = P0 = N(0, Id )
(13.3.7)
H1 : P ∈ P1 (r) := {N(θ, Id ) | ∥θ∥2 ≥ r}.
That is, we are interested in whether X ∼ P has a mean θ separated by at least r from the all-
zeros vector. The problem is to find the critical radius r at which testing between P0 = {P0 }
and P1 becomes feasible (or infeasible). 3

Example 13.3.4 (A global null in multiple hypothesis testing): Consider the problem of
testing d distinct null hypotheses H0,j , j = 1, . . . , d, where for each we have a p-value Yj and
reject H0,j if Y0,j ≤ τ for a threshold τ . (Recall that a p value is a random variable Y that is
sub-uniform, meaning that P (Y ≤ u) ≤ P (U ≤ u) for U ∼ Uniform[0, 1], so we are less likely
to reject at threshold τ than a uniform would be.) If we assume the Yj are exact p-values, that
is, P (Yj ≤ u) = u for u ∈ [0, 1], then testing the global independent null
d
iid
\
H0 := H0,j = each Yj ∼ Uniform[0, 1]
j=1

is equivalent to Gaussian signal detection. Indeed, let Zj = Φ−1 (Yj ), where Φ denotes the
standard Gaussian cumulative distribution. Then under the global null H0 , we have

Z ∼ N(0, Id ).

The question of which alternative class P1 to consider is then frequently a matter of applica-
tions. For example, we might be curious about alternatives for which a few nulls H0,j are false,
that is, sparse alternatives. Example 13.3.3 corresponds to something like dense alternatives.
3

With these as motivation, let us consider Example 13.3.3 more carefully, in effort to find the
critical radius r at which minimax testing becomes feasible (or infeasible). While our standard
techniques for estimation tell us that the minimax rate for estimating θ in a normal location family
P = {N(θ, σ 2 Id )}θ∈Rd (say, in mean squared error) necessarily scale as

dσ 2
Mn (θ(P), ∥·∥22 ) = ,
n
we can test whether the mean of a Gaussian is zero at a smaller dimensionality—effectively, while
E[∥θb − θ∥22 ] → 0 as
√ n → ∞ if and only if d/n → 0, in the testing case, we can save a dimension-
dependent factor d. In particular, the next two examples—one addressing achievability and one

348
Lexture Notes on Statistics and Information Theory John Duchi

the fundamental limit—show that in the dense Gaussian signal detection problem of Example 13.3.3,
the critical test radius (13.3.4) at which testing is feasible or infeasible scales as
d1/4
rn := √ .
n
We can achieve
√ (asymptotically) accurate testing in the dense signal detection problem (13.3.7) if
and only if d/n → 0 as n → ∞.
We first demonstrate achievability in Example 13.3.3, leveraging Proposition 13.3.2.
Example 13.3.5 (Achievability in Gaussian mean testing): We wish to test the alterna-
tives (13.3.7). We use the approach of Proposition 13.3.2: find an estimator of ∥θ∥22 , and
then threshold it for our test. The discussion preceding Corollary 13.2.7 (specifically equa-
tion (13.2.3)) shows that given a sample of size n, the estimator Tn = ∥X n ∥22 − d/n is unbiased
for ∥θ∥22 and satisfies
2d ∥θ∥22
Eθ (Tn − ∥θ∥22 )2 = Varθ (Tn ) ≤ 2 +

. (13.3.8)
n n
Note that E0 [Tn ] = 0, and so because
Eθ [Tn ] − E0 [Tn ] = ∥θ∥22 ,
the statistic Tn robustly 2-separates P0 from P1 (r) (recall definition (13.3.6)) whenever
√ r !
1 2 2d 2d 1 2
∥θ∥2 ≥ + + ∥θ∥2
2 n n2 n
√
for all θ with ∥θ∥2 ≥ r. Immediately we see that if we take radius r2 = C nd for some C > 0,
√ √ p √
then this separation occurs if C d ≥ 2( 2d + 2d + C d), which of course happens n for large
o
n
p
constant C. Applying Proposition 13.3.2, we thus see that the test Ψ(X1 ) = 1 Tn ≥ 2d/n2
satisfies
1 d1/4
Rn (Ψ, Crn ) ≤ for rn = √ ,
3 n
which gives the achievability required for the critical test radius (13.3.4). 3
1/4
Example 13.3.5 shows that at the critical radius rn = d√n , it is possible (in a worst-case sense)
to test between the null H0 : N(0, Id ) and alternatives H1 : N(θ, Id ) for ∥θ∥2 ≥ Crn , where C is a
numerical constant. We can also provide the converse.
Example 13.3.6 (Lower bounds in Gaussian mean testing): Let P1 (r) = {N(θ, Id ) | ∥θ∥2 ≥ r}
be a collection of Gaussians with means r away from the origin in ℓ2 -norm. We seek the critical
radius r below which it is impossible to distinguish between P0 = N(0, Id ) and P1 ∈ P1 (r)
2
given an i.i.d. sample X1n . Lemma 13.2.6 and Proposition 13.3.1 combine (set δ 2 = rd in
Lemma 13.2.6) to give r
n2 r 4
inf Rn (Ψ | P0 , P1 (r)) ≥ 1 − .
Ψ 4d
√
In particular, the threshold r2 = d/n means that there is necessarily constant test error
probability Rn ≥ 12 . Combining the estimation guarantee with this lower bound shows that the
critical radius (13.3.4) for testing√H0 : N(0, Id ) against the family of alternatives H1 : N(θ, Id )
with ∥θ∥22 ≥ r2 is precisely r2 = d/n. 3

349
Lexture Notes on Statistics and Information Theory John Duchi

13.3.3 Goodness of fit and two-sample tests for multinomials

The basic question in goodness of fit testing—called property testing in the theoretical computer
iid
science literature—is the following. Given a sample X1 , . . . , Xn ∼ P , we wish to test whether
P = P0 for a prescribed base distribution P0 or P is far from P0 . The related two-sample testing
iid iid
problem generalizes this, where we assume samples X1 , . . . , Xn ∼ P and Y1 , . . . , Ym ∼ Q, and wish
to test whether P = Q. Each of these falls into the class of hypothesis tests (13.3.2), where the
choice of the metric ρ can change the character of upper and lower bounds somewhat dramatically.
General methods for developing goodness of fit and two-sample tests typically take the broad
approach in Section 13.3.1, defining a statistic T that separates the distribution P0 (or the joint
that Xi and Yj have the same distribution) from the alternatives about which we are curious, then
thresholding that statistic.
It turns out that even in what might appear to be a particularly simple case—that of multinomial
distributions, where we identify the distribution P with a probability mass function (p.m.f.) p ∈
∆d —a surprising amount of complexity arises. We thus work through two examples on testing
distance between discrete distributions by considering two metrics on the probability mass functions:
the ℓ2 -metric and the total variation distance (or ℓ1 metric). Then ρ(p, q) = ∥p − q∥ for ∥·∥ = ∥·∥2
or ∥·∥ = ∥·∥1 . In the uniformity testing case, we let p0 = d1 1 be the uniform distribution on [d],
and we seek the critical threshold ϵ at which testing

∥p − p0 ∥ = 0 versus ∥p − p0 ∥ ≥ ϵ
iid
from n i.i.d. observations Xi ∼ p becomes feasible or infeasible.
It is simpler (for analyzing procedures) to consider a slight variant of this problem, which
uses the Poissonization trick. To motivate the idea, identify the observations Xi with the basis
vectorsP(so that observing item j ∈ {1, . . . , d} corresponds to Xi = ej ). Then that the sample mean
pb = n1 ni=1 Xi is unbiased, but its coordinates exhibit dependence in that ⟨1, pb⟩ = 1—an annoyance
for analyses. Thus, we consider an alternative approach, where we assume a two-stage sampling
iid
procedure: we first drawn N ∼ Poi(n), and then conditional on N = m, draw Xi ∼ p, i = 1, . . . , m.
As E[N ] = n and N concentrates around its mean, this is nearly equivalent to simply observing
iid
Xi ∼ p for i = 1, . . . , n, and a standard probabilistic calculation shows that the distribution of
iid
{Xi }N
i=1 conditional on N = m is identical to the distribution of Xi ∼ p, i = 1, . . . , m.
Even more, the minimax risk for estimation in this Poissonized sampling scheme is similar to
that for estimation in the original multinomial setting. Indeed, suppose that we wish to estimate
an abstract statistic T (p) of p ∈ ∆d , and assume for simplicity that T (p) ∈ [−r, r] for some fixed r.
Define the minimax and Poissonized minimax risks

Mn := inf sup Ep (Tn (X1n ) − T (p))2

Tn p∈∆d

and
MPoi(n) := inf sup Ep (TN (X1N ) − T (p))2 ,

{Tm } p∈∆d

where the latter expectation is taken over the sample size N ∼ Poi(n), and {Tm } denotes a sequence
of estimators (defined for all sample sizes m). We have the following proposition, which shows that
if we can provide procedures that work in the poissonized (independent sampling) setting, then the
standard multinomial sampling setting is similarly easy (or challenging).

350
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 13.3.7. There exist numerical constants 0 < c, C < ∞ such that

MPoi(2n) − Cr2 exp(−cn) ≤ Mn ≤ 2 · MPoi(n/2) . (13.3.9)

For a proof, see Exercises 13.12 and 13.13.

Let us leverage these ideas to construct an estimator for the ℓ2 -distance between two multinomial
iid iid
distributions. In this case, suppose we have Xi ∼ p and Yi ∼ q, where p, q ∈ ∆d , both for
i = 1, . . . , N and N ∼ Poi(n), and we define
N N
1X 1X
pb = Xi , qb = Yi . (13.3.10)
n n
i=1 i=1

ind ind
This is equivalent to sampling nb pj ∼ Poi(npj ) and nbqj ∼ Poi(nqj ), j = 1, . . . , d, and so we use
the quantities (13.3.10) to define an estimator we can threshold using Proposition 13.3.1. We work
through this in the next (somewhat complicated) example.

Example 13.3.8 (Estimating the ℓ2 -distance between multinomials): For the estimators (13.3.10),
define the quantity
Zj := (nb qj )2 − nb
pj − nb pj − nb
qj .
Recalling that if W ∼ Poi(λ) then E[W ] = Var(W ) = λ, we have E[nb
pj ] = pj and Var(nb
pj ) =
npj , so

pj )2 ] + E[(nb
E[Zj ] = E[(nb qj )2 ] − 2n2 pj qj − npj − nqj
= Var(nb qj ) + (npj )2 + (nqj )2 − 2n2 pj qj − npj − nqj = n2 ∥p − q∥22 .
pj ) + Var(nb

In particular, the statistic

1
Tn := ⟨1, Z⟩
n2
satisfies E[Tn ] = ∥p − q∥22 .
To be able to test whether p and q are identical using Proposition 13.3.2, we must compute
thePvariance of ⟨1, Z⟩, which—conveniently, by the independence our Poisson sampling gives—
is dj=1 Var(Zj ). Leveraging that for a Poisson W ∼ Poi(λ) we have (by tedious calculation)
that

E[W ] = λ, E[W 2 ] = λ(1 + λ), E[W 3 ] = λ + 3λ2 + λ3 , E[W 4 ] = λ + 7λ2 + 6λ3 + λ4 ,

we obtain (see Exercise 13.16)

Var(Zj ) = 4n3 (pj − qj )2 (pj + qj ) + 2(pj + qj )2 n2 (13.3.11)

and
Var(⟨1, Z⟩) ≤ 4n3 ∥p − q∥24 ∥p + q∥2 + 2n2 ∥p + q∥22 .
Under the (non-point) null H0 : p = q, Var(⟨1, Z⟩) = 2n2 ∥p + q∥22 ≤ 8n2 , as supp,q ∥p + q∥2 =
2. Proposition 13.3.2 thus shows that if
r s 
2
8 16 ∥p − q∥ 8
∥p − q∥22 ≥ C  + 4
+ 2, (13.3.12)
n2 n n

351
Lexture Notes on Statistics and Information Theory John Duchi

then the test n √ o

Ψ := 1 Tn ≥ 8/n

satisfies P0 (Ψ ̸= 0) + P1 (Ψ ̸= 1) ≤ C22 , where P0 is any distribution with p = q and P1 is

any distribution with ∥p − q∥2 satisfying the separation (13.3.12). As ∥p − q∥2 ≥ ∥p − q∥4 ,
inequality (13.3.12) a necessary and sufficient condition for inequality (13.3.12) to hold is that
√
∥p − q∥2 ≳ 1/ n. 3

Summarizing, we see that if we wish to test whether two multinomials are identical or separated
in ℓ2 , the critical threshold for the hypothesis test

H0 : p = q
(13.3.13)
H1 : ∥p − q∥2 ≥ δ

satisfies δ ≤ √1n : we can test between H0 and H1 at separations that are essentially “independent”
of the dimension or number of categories d. This is in fact sharp, as a relatively straightforward
argument with Le Cam’s two-point lemma demonstrates (see Exercise 13.18). However, if we change
the norm ∥·∥2 into the ℓ1 -norm ∥·∥1 , the story changes significantly.
Let us change the hypothesis test (13.3.13) to simpler looking—in that we only test goodness of
fit—ℓ1 -based variant. Identifying distributions P on {1, . . . , d} with their p.m.f.s p ∈ ∆d , let P0 be
the uniform distribution on {1, . . . , d}, with p.m.f. p0 = d1 1. Then we consider the testing problem

H0 : p = p0
(13.3.14)
H1 : ∥p − p0 ∥1 ≥ δ,

which tests the ℓ1 -distance to uniformity. In this case, developing a test that distinguishes these
hypotheses at the optimal rate is quite sophisticated, though we outline an approach to it in the
exercises. To develop the correct order of lower bound—that is, a threshold δ for which no test can
reliably distinguish H0 from H1 —is possible via the mixture of χ2 -distributions approach we have
developed in Lemma 13.2.3.

Proposition 13.3.9 (A lower bound for testing ℓ1 -separated multinomials). In the testing prob-
lem (13.3.14),
1
inf Rn (Ψ | H0 , H1 ) ≥ 1 − √
Ψ 2
1/4
d√
whenever δ ≤ n
.

Proof We construct a particular packing of the probability simplex ∆d ∈ Rd+ that guarantees
that the divergence between elements of H0 and H1 in the test (13.3.14) is small. For simplicity,
we assume d is even, as it changes nothing. For the base distribution P0 take p.m.f. p0 = d1 1 as
required by the problem (13.3.14). To construct the alternatives, let V ⊂ {±1}d be the collection
of 2d/2 vectors of the form v = (v ′ , −v ′ ), where v ′ ∈ {±1}d/2 , so that ⟨1, v⟩ = 0 for each v ∈ V.
Then for δ ≥ 0 to be chosen, define the p.m.f.s pv = 1+δv d . Identify samples X ∈ {e1 , . . . , ed }. Then
for any x ∈ {ej }, we have Pv (X = x) = d1 (1 + δ⟨v, x⟩), and so for any pair v, v ′ we have

Pv (X = x)Pv′ (X = x)
= (1 + δ⟨v, x⟩)(1 + δ⟨v ′ , x⟩).
P0 (X = x)2

352
Lexture Notes on Statistics and Information Theory John Duchi

iid
From this key equality, we see that if V, V ′ ∼ Uniform(V), then for P = 1 P
|V| v∈V Pv we have
" n #
Y
′

1 + Dχ2 P ||P0 = E0 (1 + δ⟨V, Xi ⟩)(1 + δ⟨V , Xi ⟩)
i=1
= E E0 [(1 + δ⟨V, X⟩)(1 + δ⟨V ′ , X⟩) | V, V ′ ]n

n
δ2

′
= E 1 + ⟨V, V ⟩ ,
d

where the final equality follows because E0 [⟨v, X⟩] = d1 ⟨v, 1⟩ = 0 for each v ∈ V. Now we use that
1 + t ≤ et for all t to obtain
  
2 d/2
2 X
nδ 2nδ
⟨V, V ′ ⟩

1 + Dχ2 P ||P0 ≤ E exp = E exp  Uj 
d d
j=1

iid
for Uj ∼ Uniform{±1}. But of course these Uj are 1-sub-Gaussian, so
2 4
n δ
1 + Dχ2 P ||P0 ≤ exp .
d
2 n2
Now use Pinsker’s inequalities (Propositions 2.2.8 and 2.2.9), which gives 2 P0 − P TV
≤ δ4
d.
Choose δ 4 = nd2 .

13.3.4 Detecting sparse signals and phase transitions

Interesting phenomena arise when we consider signal detection problems, as in Section 13.3.2, but
the underlying signal is sparse. For example, in an astronomical survey, where we search for light
sources, we may be interested in regions of space where there are more than typical number of
astronomical objects. Then the signal is quite sparse—much of space is empty—but we still wish
to make discoveries. We can distill much of the complexity here into an example motivating much
of our development.

Example 13.3.10 (Sparse Gaussian signal detection): In the sparse Gaussian signal detection
problem, we observe n random variables Yi ∼ N(µi , 1), where under the null
√ H0 the means
µi = 0 identically, and under the alternative H1 we take µi = 0 or µi = 2r log n for some
value r < 1, but for which
√ most of the µi are zero. Note that if r > 1, then the trivial test
comparing maxi Yi to 2 log n would be asymptotically perfect: under the null,

p p 2r log n
PH0 max Yi ≥ 2r log n ≤ nP(Y1 ≥ 2r log n) ≤ n exp − = n1−r → 0,
i≤n 2
iid
while under the alternative, so long as k ≫ 1 of the signals are non-null, for Zi ∼ N(0, 1) we
have
PH1 max Yi ≥ 2r log n ≥ P max Zi ≥ 0 = 1 − 2−k → 1.
p
i≤n i≤k

353
Lexture Notes on Statistics and Information Theory John Duchi

One typical formulation is to formulate this as a mixture problem, where under H0 we have
iid
Yi ∼ N(0, 1)

and under the alternative H1 we observe

iid
Yi ∼ (1 − ϵn )N(0, 1) + ϵn N(µ, 1),

a mixture of N(0, 1) and N(µ, 1) distributions, where ϵn determines the sparsity fraction and
µ the signal strength. We then ask for the rates at which ϵn → 0 and the associated signal
strengths µ > 0 that determine whether testing between H0 and H1 is possible. As a brief
−β 1
√ that if ϵn = n for some β ∈ ( 2 , 1), so that the
remark, it is relatively straightforward to show
signal is indeed quite sparse, then for µ = 2r log n, testing between H0 and H1 is impossible
if r < β − 12 . (See Exercise 13.5.) 3

In the rest of this section, we develop some techniques to answer the questions Example 13.3.10
poses.
Abstractly, we model the sparse signal detection problem as testing mixtures, where for individ-
ual observations Yi we have a null distribution P0 , a known alternative P1 , and we have a sequence
of observations Yi , i = 1, . . . , n, where each Yi is drawn either
iid iid
H0 : Yi ∼ P0 or H1 : Yi ∼ (1 − ϵ)P0 + ϵP1 , (13.3.15)

so that in the alternative H1 we observe from a mixture of P0 and P1 , that is, about ϵ fraction of
the time we observe data from P1 . Then the question in such a sparse signal detection problem is
the rate at which we can take ϵ ↓ 0 while still reliably testing between H0 and H1 .
Because the testing problem (13.3.15) is a simple hypothesis test of

P0n versus ((1 − ϵ)P0 + ϵP1 )n ,

the likelihood ratio test is always optimal, though this is a bit unsatisfying as a principal. Alter-
natively, by the identity that inf Ψ {P (Ψ = 0) + Q(Ψ = 1)} = 1 − ∥P − Q∥TV , and the equivalence
between Hellinger distance and total variation distance that Proposition 2.2.7 shows, we see that

d2hel (P0n , ((1 − ϵ)P0 + ϵP1 )n ) → 0 if and only if inf Rn (Ψ | H0 , H1 ) → 0

while
d2hel (P0n , ((1 − ϵ)P0 + ϵP1 )n ) → 1 if and only if inf Rn (Ψ | H0 , H1 ) → 1.
Ψ

Equivalently, because d2hel (P n , Qn ) = 1−(1−d2hel (P, Q))n , we see that for the sparse mixture testing
problem (13.3.15),
(
1 if d2hel (P0 , (1 − ϵ)P0 + ϵP1 ) ≪ n1
inf Rn (Ψ | H0 , H1 ) → (13.3.16)
Ψ 0 if d2hel (P0 , (1 − ϵ)P0 + ϵP1 ) ≫ n1 .

While a fully general theory characterizing the limits (13.3.16) does not exist, examples can
help to delineate when we might hope to detect sparse signals. As a simple one that captures some
of the techniques, consider the following Bernoulli detection problem.

354
Lexture Notes on Statistics and Information Theory John Duchi

Example 13.3.11 (Bernoulli detection): Consider a null distribution P0 = Bernoulli( 12 ) and

alternatives P1 = Bernoulli( 1+∆
2 ). Noting that the mixture of Bernoulli distributions remains
Bernoulli, we simply consider the Hellinger distance between P0 and P1 . In this case,
√ √ ∆2
2d2hel (P0 , P1 ) = 2 − 1+∆− 1−∆= + O(∆3 ).
4
So perfect testing is asymptotically possible whenever ∆2 ≫ n1 , and testing is asymptotically
impossible whenever ∆2 ≪ n1 . In this case, the optimal test is also simple to describe: given ob-
q
servations Xi , we simply test whether the sum Sn = ni=1 Xi satisfies Sn > n2 + n2 log α1 . This
P
q
has level at most P0 (Sn > 2 + n2 log α1 ) ≤ exp(− log α1 ) = α by sub-Gaussian concentration,
n

iid
and its error under the alternative Xi ∼ P1 satisfies
r ! r !
n n 1 n n 1
P1 Sn > + log = P Sn − − ∆n > log − ∆n
2 2 α 2 2 α
 " r #2 
2 n 1 
≥ 1 − exp − ∆n − log →1
n 2 α
+
√
whenever ∆ ≫ 1/ n. 3

This example fails to capture the fuller complexity of signal detection, because no individual signal
can be too strong—the observations lie in {0, 1} regardless. Example 13.3.11 does, however, show
that “interesting” regime is when the signals are quite sparse, so that β > 21 . Indeed, let β < 12 and
assume that ∥P0 − P1 ∥TV ≥ c > 0. Then it is relatively simple to develop a test Ψ that achieves
risk Rn (Ψ | H0 , H1 ) → 0 by counting observations more likely to have come from P0 or P1 , and a
slight elaboration of this procedure works for β = 21 as well.

Corollary 13.3.12. Let P0 and P1 be a distributions satisfying ∥P0 − P1 ∥TV = c > 0, and consider
the hypothesis test (13.3.15) with ϵ = ϵn = n−β . Then if β < 12 ,

inf Rn (Ψ | H0 , H1 ) → 0.
Ψ

If P0,n and P1,n are sequences of distributions satisfying ∥P0,n − P1,n ∥TV → 1, then if β ≤ 12 ,

inf Rn (Ψ | H0 , H1 ) → 0.
Ψ

n versus H : ((1 − ϵ )P
In each limit we consider the hypothesis test of H0 : P0,n n
1 n 0,n + ϵn P1,n ) .

See Exercise 13.7 for a sketch of the proof.

Moving beyond signal detection problems from binary sequences, multiple hypothesis testing
problems provide motivation closer to problems that arise in practice.

Example 13.3.13 (A multiple testing problem): In a multiple testing problem of n outcomes,

we observe p-values, which we represent as variables Ui ∈ [0, 1], i = 1, . . . , n, where Ui ≈ 0
indicates a significant result. It is natural assume that under the null distribution, Ui ∼
Uniform([0, 1]); more generally, we have P (Ui ≤ u) ≤ u for u ∈ [0, 1], meaning that under
the null Ui is not particularly likely to be small. Under an alternative (i.e., something to be

355
Lexture Notes on Statistics and Information Theory John Duchi

discovered), Ui is likely to be nearer 0, so we have P(Ui ≤ u) > u for u ∈ [0, 1]. We thus
formulate (abstractly) the problem of detecting whether more than a negligible number of
non-nulls are present as testing
iid iid
H0 : Ui ∼ Uniform([0, 1]) versus H1 : Ui ∼ (1 − ϵ)Uniform([0, 1]) + ϵP1 ,

an instance of the abstract problem (13.3.15). This is distinct from discovering which p-values
are significant and ought to be rejected—we ask whether what we observe would be unlikely
if each p-value were null. 3
JCD Comment: In this example, include a version where we consider nulls P0 =
Uniform[0, 1] and alternatives P1 = (1 − τ )Uniform[0, τ ] + τ Uniform[τ, 1] or something
similar, arguing that the testing/detection problem is the same. Could also think about
it in the context of signal recovery.
Let us (heuristically) formulate Example 13.3.10 in the context of Example 13.3.13 to perform
a more quantitative analysis and build a stylized problem that enables us to evaluate potential
procedures for detection in Example 13.3.13. Let Φ denote the standard normal CDF, so that
for detecting observations Yi that are large, the natural p-value is Ui = Φ(−Yi ), as under the
null that Yi ∼qN(0, 1), we have Φ(−Yi ) ∼ Uniform([0, 1]). Then using the approximation that
Φ−1 (1 − u) ≈ 2 log u1 for u near 0, under the alternative Y ∼ N(µ, 1), for Z ∼ N(0, 1) we have
p
P(Φ(−Y ) ≤ u) = P(Z + µ ≥ Φ−1 (1 − u) − µ) ≈ P Z ≥ 2 log u−1 − µ .
√
If µ = 2r log n as in the scaling of Example 13.3.10, then we see a transition in the probability
above as u ≷ n−r : if u ≪ n−r , then
p p p p
P Z ≥ 2 log u−1 − 2r log n ≤ exp −( 2 log u−1 − 2r log n)2 ≈ 0,

while if u ≫ n−r , then

p p p p
P Z ≥ 2 log u−1 − 2r log n ≥ 1 − exp −( 2r log n − 2 log u−1 )2 ≈ 1.

That is, we have P(Φ(−Y ) ≥ u) ≈ 0 if u ≫ n−r , so when Yi comes from the non-null N(µ, 1), we
can (heuristically) model Φ(−Yi ) as uniform on [0, n−r ]. With this as motivation, we instantiate
Example 13.3.13 with a particular choice of the alternative P1 that exhibits this type of transitional
behavior, where Ui ∼ P1 is uniform on [0, τ ] for some τ > 0, where for smaller values of τ we are
more likely to observe “significant” p-values.

Example 13.3.14 (Example 13.3.13, instantiated): Consider nulls and alternatives

P0 = Uniform([0, 1]) and P1 = Uniform([0, τ ]), τ = n−r ,

where r indicates the “signal strength”, with larger r decreasing the threshold τ . Take ϵ = ϵn =
n−β for some β ∈ ( 21 , 1). Then P0 has density 1 on [0, 1], while P1 has density τ1 1 {0 ≤ u ≤ τ }.
Let us compute the Hellinger distance between P0 and the mixture, as the limit (13.3.16)
dictates. The general recipe begins as follows: we let L = dP dP0 be the likelihood ratio between
1

P1 and P0 , so that
1 h p i h√ i
d2hel (P0 , (1 − ϵ)P0 + ϵP1 ) = E0 (1 − (1 − ϵ) + ϵL)2 = 1 − E0 1 − ϵ + ϵL ,
2

356
Lexture Notes on Statistics and Information Theory John Duchi

and controlling these expectations gives our results.

In this example, L = τ1 1 {0 ≤ U ≤ τ }, so
p √
d2hel (P0 , (1 − ϵ)P0 + ϵP1 ) = 1 − τ 1 − ϵ + ϵ/τ − (1 − τ ) 1 − ϵ
p p
= 1 − n−r 1 − n−β + nr−β − (1 − n−r ) 1 − n−β
p p p
= 1 − 1 − n−β + n−r 1 − n−β − 1 − n−β + nr−β
n−β p p
= + n−r 1 − n−β − 1 − n−β + nr−β + o(n−1 ).
2
Let d2hel be shorthand for the squared Hellinger distance above. If r ≥ β, then we have
r+β
d2hel ≳ n−β − n− 2 ≫ 1/n, so that detection becomes trivial—some of the p-values are too
small to ignore. So let us evaluate the limits when r < β. In this case,
p p −n−β nr−β − n−β (nr−β − n−β )2
1 − n−β − 1 + nr−β − n−β = − + + O(n3r−3β ) + o(n−1 )
2 2 8
nr−β n2r−2β − 2nr−2β
=− + + O(n3r−3β ) + o(n−1 ).
2 8
Multiplying by n−r and substituting above, we have

n−β n−β nr−2β

d2hel (P0 , (1 − ϵ)P0 + ϵP1 ) = − + + O(n2r−3β ) + o(n−1 ).
2 2 8
Finally, recognize that 2r−3β < r−2β whenever r < β, yielding the final asymptotic expression
(1 + o(1)) −2β+r
d2hel (P0 , (1 − ϵ)P0 + ϵP1 ) = n + o(n−1 ).
8
In particular, if r > 2β − 1, then n−2β+r ≫ 1/n, and if r < 2β − 1, then −3β + 2r < β − 2
Thus, with our thresholds τ = n−r and ϵ = n−β , we have
(
2 n n 1 if r > 2β − 1
dhel (P0 , ((1 − ϵ)P0 + ϵP1 ) ) →
0 if r < 2β − 1.

We see the thresholding behavior (13.3.5), where testing is possible only if the signal for the
non-null p-values is suitably strong, that is, they have support [0, n−r ] for some r > 2β − 1. 3

Example 13.3.14 shows that in situations in which a (relatively) small proportion of p-values in a
multiple hypothesis test are from non-null distributions, so long as they concentrate near enough to
0, we can detect them. We can relax the choice to make P0 and P1 not absolutely continuous—they
have different supports—while still providing the same asymptotic interpretation, where non-null
p-values are likely to be near 0, with the same phase transitions; Exercise 13.8 provides one such
modification. The development of tests that adaptively achieve the rate Example 13.3.14 suggests
requires nontrivial effort. We outline one such approach in Exercise 13.6, and we also revisit
Example 13.3.10 and develop its critical thresholds in the exercises as well.
JCD Comment: Finish exercises on higher criticism using the notes I’ve developed.
Reference them here. (Write one that uses the maximum to develop, heuristically, the
right threshold, then the full one.)

357
Lexture Notes on Statistics and Information Theory John Duchi

13.4 Instance-optimal lower bounds and super-efficiency

Super-efficiency converses allow us to confidently state that a putative benchmark for estimation
performance—a lower bound—is sharp, so that outperforming the benchmark on more than a
handful of problems is difficult or even impossible. Recalling Chapter 12.4, we wish to develop
benchmarks that are instance specific (item (i)) and uniformly achievable (item (ii) there). To
show that a benchmark or lower bound is indeed the “right” lower bound, in Chapter 12.4 we
consider super-efficiency converses that hold on average, so that no procedure could outperform
the lower bound except on a negligble collection of instances (recall the point (iii)). In classical
estimation problems, Theorem 12.4.1 shows that the Fisher information of a parameter provides
a strong indication on the limits of estimator performance. In problems of estimating or testing
one-dimensional quantities and parameters of a distribution, however, we can frequently provide
an alternative converse:
(iii’) There should be a pointwise super-efficiency converse: if a procedure outperforms the bench-
mark on one population P , there must exist other populations where the procedure’s perfor-
mance is worse than the benchmark.
Proposition 13.1.1 shows that the local modulus of continuity provides a potential benchmark
lower bound, as it provides an instance-specific guarantee (desideratum (i)). The question of
whether a procedure exists achieving (desideratum (ii)) the risk bound that the local modulus pro-
vides frequently depends on the problem, and unfortunately no general characterization exists (but
see the bibliographic discussion for more). Nonetheless, by considering the modulus of continuity
of a parameter θ with respect to the χ2 -divergence rather than the Hellinger distance, we will
obtain a benchmark for estimation complexity that, essentially, is guaranteed to always satisfy the
desiderata (i) and (iii’), the pointwise super-efficiency converse.

13.4.1 Risk transfer inequalities

By using appropriate local moduli of continuity of the parameter to be estimated, we now show
how to provide pointwise guarantees of the form (iii’): outperforming a benchmark at a single
instance implies performing quantitatively worse at others. A technique we shall call risk transfer
inequalities underpin such results. In some literature, these inequalities go by the name “constrained
risk inequalities,” but in effort to avoid confusion with our lower bounds on constrained estimators in
Chapter 11, and because the technique truly transfers “missing” risk from one problem to another,
we use this nomenclature. We focus on real-valued parameters for simplicity.
Setting notation, let θ(P ) ∈ R be a parameter of interest, and for a loss Φ : R+ → R, let RP
denote the risk, or expected loss,
h i
RP (θ)
b := EP Φ θ(X)b − θ(P ) ,

where X ∼ P . We shall use the χ2 -affinity

dP12
Z 2
dP1 dP1
ρχ2 (P1 ||P0 ) := Dχ2 (P1 ||P0 ) + 1 = = E0 = E1 , (13.4.1)
dP0 dP02 dP0
where E0 and E1 denote expectation under P0 and P1 , respectively, to show that if RP0 (θ)
b is small
under P0 , then it must be large under alternative distributions P1 .
The following theorem then provides a lower bound on the risk of an estimator θb under P1 given
an upper bound on its risk under P0 :

358
Lexture Notes on Statistics and Information Theory John Duchi

Theorem 13.4.1. Assume the loss Φ : R+ → R+ is convex. Let θ0 = θ(P0 ) and θ1 = θ(P1 ), and
define the separation ∆ := 2Φ( 21 |θ0 − θ1 |). If the estimator θb satisfies RP0 (θ)
b ≤ γ, then
h√ q i2
b ≥
RP1 (θ) ∆ − γρχ2 (P1 ||P0 ) .
+

The theorem immediately extends to product distributions, as ρχ2 (P1n ||P0n ) = ρχ2 (P1 ||P0 )n , so that
if θbn satisfies E0 [Φ(|θbn (X1n ) − θ0 |)] ≤ γ, then
h i hp q i2
EP1 Φ θbn (X1n ) − θ1 ≥ 2Φ(|θ0 − θ1 |/2) − γρχ2 (P1 ||P0 )n .
+

Proof We assume without loss of generality that θ(x) b ∈ [θ0 , θ1 ], as otherwise, we simply project
it onto the interval [θ0 , θ1 ]. For any θ ∈ [θ0 , θ1 ], we have θ = tθ0 + (1 − t)θ1 for some t ∈ [0, 1], so
p p p p
Φ(|θ − θ0 |) + Φ(|θ − θ1 |) = Φ((1 − t)|θ1 − θ0 |) + Φ(t|θ1 − θ0 |)
s
p 1
≥ Φ((1 − t)|θ1 − θ0 |) + Φ(t|θ1 − θ0 |) ≥ 2Φ |θ1 − θ0 | , (13.4.2)
2
1
because t = 2minimizes Φ(ta) + Φ((1 − t)a). Using the majorization (13.4.2), we thus obtain
√
1/2 1/2 p
E1 Φ θ − θ0
b + E1 Φ θ − θ1
b ≥ 2Φ (|θ1 − θ0 |/2) = ∆.

Rearranging, we have by Cauchy-Schwarz that

1/2 2 √ 1/2 2
RP1 (θ) ≥ E1 Φ θ − θ1
b b ≥ ∆ − E1 Φ θ − θ0
b .
+

Applying Cauchy-Schwarz once more to see that

q 2 1/2 r h
dP1 dP1
q i
E1 Φ(|θ − θ0 |) = E0
b Φ(|θ − θ0 |) ≤ E0
b E0 Φ b − θ0
θ
dP0 dP02

gives the theorem.

Applying Theorem 13.4.1 generally involves a two step process: we assume that some estima-
tor achieves small risk, then show that there exist distributions close in χ2 -distance but whose
parameters are different.

Example 13.4.2 (Super-efficiency in normal mean estimation): Let P = {N(θ, 1)}θ∈R , and
iid
assume we observe Xi ∼ N(θ, 1). Let θbn be an estimator satisfying
γ
Eθ0 [|θbn − θ0 |] ≤ √
n

at some θ0 ∈ R. Now, consider an alternative distribution P1 = N(θ1 , 1), for which we observe
that
ρχ2 (P1 ||P0 ) = exp |θ1 − θ0 |2 so ρχ2 (P1n ||P0n ) = exp n|θ1 − θ0 |2

359
Lexture Notes on Statistics and Information Theory John Duchi

Then by Theorem 13.4.1, we have

2
h i p 1 n n
EP1 θbn − θ1 ≥ |θ0 − θ1 | − 1/4 exp |θ1 − θ0 |2 + log γ .
n 2 2 +
q
1
Taking θ1 = θ0 + n log γ1 then yields
s
log γ1
r 2
h i 1 4
1
EP1 θbn − θ1 ≥√ log − 1 ≳ .
n γ + n
√
So if γ ≪ 1, meaning the estimator exhibits convergence faster than q 1/ n—super-efficient
estimation—at θ0 , it must pay a substantial penalty at points roughly n1 log γ1 away. 3

Example 13.4.2 is an example of a broader class of super-efficiency results that arise from
considering an alternative to the Hellinger modulus of continuity, where we define the χ2 modulus,
with a slight modification which does not change the asymptotic but makes computation simpler.
With that in mind, recall the χ2 -affinity (13.4.1) and define the local χ2 -modulus of continuity by

ωχ2 (ϵ; θ, P0 , P) := sup |θ(P1 ) − θ(P0 )| | log ρχ2 (P1 ||P0 ) ≤ 2ϵ2 .

(13.4.3)
P1 ∈P

By Proposition 2.2.9, 2d2hel (P0 , P1 ) ≤ log(1 + Dχ2 (P1 ||P0 )) = log ρχ2 (P1 ||P0 ), so the Hellinger
modulus dominates the χ2 -modulus:

ωχ2 (ϵ; θ, P0 , P) ≤ ωhel (ϵ; θ, P0 , P).

As we have seen, however, in finite-dimensional problems, where divergences between distributions

are locally quadratic (13.1.7) as in Examples 13.1.7 and 13.1.8, the moduli are equivalent to within
constant factors. Proposition 13.1.4, coupled with a guarantee that the χ2 -divergence behaves
locally quadratically (as in Lemma 13.1.10) shows this.

Corollary 13.4.3. Let the conditions of Theorem 13.4.1 hold, and define the shorthand ω(ϵ) =
ωχ2 (ϵ; θ, P0 , P). Let θbn be any estimator satisfying
h i √
EP0 Φ(|θbn (X1n ) − θ0 |) ≤ γ · Φ ω(1/ n)

1
for some γ < 1. Then for all distributions P1 such that log(1 + Dχ2 (P1 ||P0 )) ≤ n log γ1 ,
2
h
n
i p q √
EP1 Φ |θn (X1 ) − θ| ≥
b 2Φ(|θ(P1 ) − θ0 |/2) − Φ(ω(1/ n)) .
+

In particular,
"s #2
√
q
h i 1 1 1 1/2
sup EP1 Φ |θbn (X1n ) − θ| ≥ 2Φ ω log − Φ(ω(1/ n)) .
1
log ρχ2 (P1 ||P0 )≤ n log 1 2 n γ
γ +

360
Lexture Notes on Statistics and Information Theory John Duchi

Proof Let Φn = Φ(ωχ2 ( √1n ; θ, P0 , P)) for shorthand. By Theorem 13.4.1, for any P1 and associ-
ated parameters θ0 , θ1 , we have
hp q i2
b ≥
RP1 (θ) 2Φ(|θ1 − θ0 |/2) − γΦn · ρχ2 (P1 ||P0 )n .
+

Take P1 to be any distribution satisfying log ρχ2 (P1 ||P0 ) ≤ 1

n log γ1 , so that γΦn · ρχ2 (P1 ||P0 )n ≤
γΦn exp(log γ1 ) = Φn . Take a supremum over such P1 .

So we see that, if an estimator outperforms the risk the χ2 -modulus of continuity predicts at
a distribution P0 , then its loss on nearby distributions P1 must be larger than that the modulus
predicts. A few subtleties are worth noting here, however: the moduli in Corollary 13.4.3 are local
to P0 , not to the alternative P1 , meaning that we have not fully satisfied the pointwise super-
efficiency converse (iii’). Of course, if the modulus (13.4.3) itself is appropriately continuous in the
argument P0 , then we obtain a super-efficiency result.

Example 13.4.4 (Superefficiency in regular parametric families): Let {Pθ }θ∈Rd be a para-
metric family regular enough for the χ2 -divergence (Definition 13.1), so that Dχ2 (Pθ+v ||Pθ ) =
v ⊤ J(θ)v + o(∥v∥2 ) for v small, where J(θ) := Eθ [ℓ̇θ ℓ̇⊤
θ ] is the Fisher information matrix for the
parameter θ. Let T : Rd → R be a continuously differentiable function of interest and Tbn be
an estimator of T (θ) satisfying
h 2 i ∇T (θ0 )⊤ J(θ0 )−1 ∇T (θ0 )
Eθ0 θbn (X1n ) − T (θ0 ) ≤γ
n
at some θ0 . We recognize ∇T (θ)⊤ J(θ)−1 ∇T (θ) as the constant factors characterizing the local
modulus of continuity for problems with Fisher information, as in Proposition 13.1.4. Then
applying Corollary 13.4.3 with the squared error Φ(t) = t2 , we obtain that for all θ satisfying
(θ − θ0 )⊤ J(θ0 )(θ − θ0 ) ≲ n1 log γ1 that
"r #2
h 2 i |T (θ) − T (θ0 )|2 ∥J(θ0 )−1/2 ∇T (θ0 )∥2
Eθ Tbn (X1n ) − T (θ) ≥ − O(1) √ .
n n
+

Using the differentiability of T , we have |T (θ) − T (θ0 )| = |⟨∇T (θ0 ), θ − θ0 ⟩| + o(∥θ − θ0 ∥). Then
choosing
1 J(θ0 )−1/2 ∇T (θ0 )
r
1
θ=c log
n γ J(θ0 )−1/2 ∇T (θ0 ) 2
for some numerical constant c > 0 to see that
h 2 i ∇T (θ0 )⊤ J(θ0 )−1 ∇T (θ0 ) 1 ∇T (θ)⊤ J(θ)−1 ∇T (θ) 1
Eθ Tbn (X1n ) − T (θ) ≳ log ≳ log ,
n γ n γ

where we use the continuity of J and ∇T (θ). 3

Example 13.4.4 extends Example 13.4.2, showing that estimating too accurately at any one
point implies much worse estimation performance elsewhere.

361
Lexture Notes on Statistics and Information Theory John Duchi

13.4.2 A general risk transfer bound

The convexity of the loss in Theorem 13.4.1 may be unsatisfying, as in some cases we wish to
lower bound quantities such as the probability of error, which corresponds to the indicator of an
estimate belonging to a set or not, and as such certainly lacks convexity. Therefore, we extend
Theorem 13.4.1 to apply to general loss functions.
Corollary 13.4.5. Let the loss Φ : R+ → R+ be non-decreasing. Let θ0 = θ(P0 ) and θ1 = θ(P1 ),
and define the separation ∆ := Φ( 12 |θ0 − θ1 |). If the estimator θb satisfies RP0 (θ)
b ≤ γ, then
h√ q i2
RP1 (θ) b ≥ ∆ − γρχ2 (P1 ||P0 ) .
+

Proof The only modification we need to make to the proof of Theorem 13.4.1 is to replace the
majorization inequality (13.4.2) with a slightly smaller lower bound. To that end, let θ ∈ [θ0 , θ1 ]
so that θ = tθ0 + (1 − t)θ1 for some t ∈ [0, 1]. Then at least one of |θ − θ0 | ≥ 12 |θ1 − θ0 | or
|θ − θ1 | ≥ 12 |θ1 − θ0 |, and so
p p p
Φ(|θ − θ0 |) + Φ(|θ − θ1 |) ≥ Φ(|θ1 − θ0 |/2).

The rest of the proof is identical.

Applying Corollary 13.4.5 follows with much the same technique as our applications of Theo-
rem 13.4.1 have. To highlight the techniques, we revisit normal mean estimation, Example 13.4.2,
but now consider a testing scenario.

Example 13.4.6 (Super-efficient testing of a normal mean): Let P = {N(θ, σ 2 )}θ∈R be the
collection of normal distributions, and consider a zero-one loss function indicating whether the
√
estimated mean is near the true mean, Φ(t) = 1 {|t| ≥ σ/ n}, so that

n n σ
RPθn (θ) |θ(X 1 ) − θ| ≥ √
b =P b .
θ
n
We expect that the risk of most estimators should be roughly of constant order, as minimax
considerations dictate. Let us assume however, that we have a sequence of estimators θbn
satisfying P0n (|θbn | ≥ √σn ) ≤ δn , where δn → 0 (i.e., θbn is super-efficient at θ = 0). Fix any
0 < c < 1, and define the sequence of local alternative parameter spaces
r
σ σ 1
Θn := θ ∈ R | 2 √ ≤ |θ| ≤ √ c log .
n n δn
Then we claim that
σ
lim inf inf Pθn b n
|θn (X1 ) − θ| ≥ √ = 1, (13.4.4)
n θ∈Θn n
that is, for a reasonably large collection of parameters θ in a shell around 0, the estimator is
√
never within σ/ n of the true parameter.
To see the limit (13.4.4), assume that n is large enough that c log δ1n ≥ 2, so that Θn is
non-empty. Let θ ∈ Θn . Then calculating the χ2 -affinity, we obtain
!
2
cσ 2 n log 1
nθ δn
ρχ2 (Pθ ||P0 )n = exp ≤ exp = δn−c .
σ2 σ2n

362
Lexture Notes on Statistics and Information Theory John Duchi

√
For any θ ∈ Θn , we have Φ( 21 |θ|) = 1 {|θ| ≥ σ/ n} = 1, so Corollary 13.4.5 gives
h i2
RPθn (θbn ) ≥ 1 − δn(1−c)/2 → 1,
+

implying inequality (13.4.4). 3

13.4.3 Risk transfer with mixtures

Section 13.2 develops lower bound techniques via Le Cam’s convex hull method, which can provide
stronger minimax lower bounds than the two-point methods, which work with hardest 1-dimensional
subproblems. Sometimes, as in Section 13.2.3, the two-point lower bounds are loose even in the
rate of convergence. Nonetheless, an extended (local) modulus of continuity with respect to the
χ2 -divergence over convex hulls of distributions still admits risk transfer inequalities, making it
possible to extend the pointwise super-efficiency converses of Section 13.4.1 to more sophisticated
problems. For simplicity in this section, and because the extension is more or less trivial given the
development thus far, we focus only on absolute losses Φ(t) = |t|.
To that end, we define the mixture modulus, which extends the local modulus of continuity with
respect to the χ2 -divergence (13.4.3), by

ω χ2 (ϵ; θ, P0 , P) := (13.4.5)

2 m

sup sup min{|θ(P0 ) − θ(Pi )|} | log ρχ2 P ||P0 ≤ 2ϵ for some P ∈ Conv{Pi }i=1 .
m∈N P1 ,...,Pm ∈P i≤m

The quantity (13.4.5) is more sophisticated than the typical local modulus, but measures how much
it is possible to perturb the parameter θ of interest while making sure that some element of the
convex hull Conv{Pi }m i=1 is close to the base distribution P0 . We abuse notation slightly to let
P n = {P n }P ∈P be the collection of product distributions and

ω χ2 (ϵ; θ, P0n , P n ) :=

n 2 n m

sup sup n n
min{|θ(P0 ) − θ(Pi )|} | log ρχ2 P ||P0 ≤ 2ϵ for some P ∈ Conv{Pi }i=1 .
m∈N P1 ,...,Pm ∈P i≤m

The mixture modulus will provide a benchmark sufficient to guarantee pointwise super-efficiency
converses. First, however, we note that as a corollary of the convex hull method in Theorem 13.2.1,
that the mixture modulus always lower bounds the minimax risk.
Corollary 13.4.7. Let the conditions of Theorem 13.2.1 hold. Then for each n and P0 ∈ P,
1 1
Mn (θ(P), | · |) ≥ ω χ2 ; θ, P0n , P n .
4 2
Proof Let P0 ∈ P and P1 ⊂ P be any δ-separated collection, that is, satisfying |θ(P0 )−θ(P )| ≥ δ
for all P ∈ P1 . Then Theorem 13.2.1 implies that
δ
1 − P0n − P n

Mn (θ(P), | · |) ≥ TV
2
for any P n ∈ Conv{P1n }. By Pinsker’s inequality (Proposition 2.2.8) and Proposition 2.2.9,
2
2 P0n − P n TV ≤ log ρχ2 P n ||P0n ,

363
Lexture Notes on Statistics and Information Theory John Duchi

1
so if log ρχ2 P n ||P0n ≤ then P0n − P n ≤ 12 . Taking a supremum over the convex hull gives

2 TV
the result.

We turn to risk transfer inequalities, where we show in analogy with Theorem 13.4.1 that
achieving small risk at some distribution P0 implies achieving larger risk at others.
Theorem 13.4.8. Let Pv ∈ P for v = 0, 1, . . . , M , and let θv = θ(Pv ) for each v, and define
the separation ∆ = minv≥1 |θ0 − θv |. If the estimator θb satisfies EP0 [|θb − θ0 |] ≤ γ, then for any
M
nonnegative λ ∈ RM
P
+ with ⟨λ, 1⟩ = 1 and Pλ := v=1 λv Pv ,
M
X h i h√ q i2
λv EPv |θb − θv | ≥ ∆ − γρχ2 (Pλ ||P0 ) .
+
v=1

Proof The proof mirrors that of Theorem 13.4.1. Fix v ∈ [M ]. Then for any θ ∈ R, we have as
in inequality (13.4.2) that
p p p √
|θ − θ0 | + |θ − θv | ≥ |θ0 − θv | ≥ ∆.
So we see that for any v, h i √
Ev |θb − θv |1/2 + |θb − θ0 |1/2 ≥ ∆,
implying by Cauchy-Schwarz that
h i2 h√ h ii2
Ev [|θb − θv |] ≥ Ev |θb − θv |1/2 ≥ ∆ − Ev |θb − θ0 |1/2 .
+

As λ forms a convex combination, the convexity of t 7→ [t]2+

yields by Jensen’s inequality that
i 2
" #
X h i √ X h
λv Ev |θb − θv | ≥ ∆− λv Ev |θb − θ0 |1/2 .
v v +

Applying the Cauchy-Schwarz inequality once again, we obtain

2 1/2 h
X h i dPλ b dPλ i1/2
λv Ev |θb − θ0 |1/2 = E0 |θ − θ0 |1/2 ≤ E0 E 0 |θb − θ 0 |
v
dP0 dP02
h i1/2
= ρχ2 (Pλ ||P0 )1/2 E0 |θb − θ0 | .

The theorem follows.

We can leverage Theorem 13.4.8 to show how the convex hull modulus (13.4.5) implies super-
efficiency converses. For shorthand in the corollary, we define
ω n (ϵ) := ω χ2 (ϵ; θ, P0n , P n ).

Corollary 13.4.9. Let P0 ∈ P, and let θbn be an estimator satisfying EP0 [|θbn (X1n ) − θ(P0 )|] ≤
γω n (1). Then there exists a collection of distributions {Pv }M M
v=1 ⊂ P and a vector λ ∈ R+ , ⟨1, λ⟩ =
1, such that
M i q 2
X h
n √ p
λv EPvn |θn (X1 ) − θv | ≥
b ω n (1/ γ) − ω n (1) .
v=1 +

364
Lexture Notes on Statistics and Information Theory John Duchi

Proof By Theorem 13.4.8, for any collection P1 , . . . , PM ∈ P with associated parameters θv =

θ(Pv ), if we let ∆ = minv |θv − θ0 | then
M i √ 2
X h q
n n n
λv EPvn |θ(X1 ) − θv | ≥
b ∆ − γω n (1)ρχ2 Pλ ||P0
v=1 +

1
for any vector λ ∈ RM n n n n
P
+ with ⟨1, λ⟩ = 1, where Pλ = v λv Pv . So long as log ρχ2 (Pλ ||P0 ) ≤ γ , we
therefore have
XM h i h√ p i2
b n ) − θv | ≥
λv EPvn |θ(X ∆ − ω n (1) .
1
+
v=1
√
Take a supremum over all such convex combinations of P n = {P n }P ∈P and let ∆ = ω n (1/ γ).

Corollary 13.4.9 shows how the convex-hull modulus around P0 thus provides (roughly) an
unimprovable bound: if an estimator achieves a faster convergence rate by a factor γ < 1 than
ω n (1), then on average over some collection of alternative distributions, the risk of the estimator
must scale at least as the modulus across a larger neighborhood.

Example 13.4.10 (An integral functional bound): We revisit R 1 (k)the estimation problems in
2
Section 13.2.3, where the goal was to estimate Tk (f ) := 0 (f (x)) dx given observations
Yi = f (Xi ) + εi . For simplicity, let us take k = s = 1,Rso that f is assumed to have bounded
second derivative and we wish to estimate T1 (f ) = f ′ (x)2 dx. Suppose that we have an
estimator that, for the identically 0 function f0 (x) = 0, has risk
4
Ef0 [|Tbn |] ≤ γn n− 9 , where γn → 0,
which is (asymptotically) faster than the lower bound in Proposition 13.2.8. Then Corol-
lary 13.4.9 implies that for the class F2 = {f : [0, 1] → R | ∥f ′′ ∥∞ ≤ 1, f (0) = 0}, we have

supf ∈F2 Ef [|Tb(X1n , Y1n ) − T1 (f )|]

→ ∞.
n−4/9
To see this limit, we revisit the construction in the proof of Proposition 13.2.8. Letting Pf
iid
be the induced distribution of observations under Yi = f (Xi ) + εi , where P0 is Yi ∼ N(0, 1),
equations (13.2.7) and (13.2.8) imply the following: there exist positive numerical constants
c0 , c1 , c2 such that for each m, n ∈ N, there is a collection of functions G such that
c1 n2 nc2

c0 n

T1 (f ) ≥ 2 for f ∈ G and ρχ2 P ||P0 ≤ exp + 8 ,
m m9 m
1 n
where P n = |G|
P
f ∈G Pf . In particular, this implies that

n2

1 n ϵ 4/9
ω χ2 (ϵ; T1 , P0n , P n ) ≳ sup 2
| 9
+ 8
≤ ϵ2
≳ ,
m∈N m m m n
where we chose m = (n/ϵ)2/9 . By Corollary 13.4.9, we thus have
h i 1 4/9 1
sup EPfn |Tn − T1 (f )| ≳
b ≫ 4/9 ,
f ∈F2 nγn n

a quantitative lower bound on the risk of Tbn . 3

365
Lexture Notes on Statistics and Information Theory John Duchi

13.5 Deferred and technical proofs

13.5.1 Proof of Lemma 13.1.6
We mostly reduce to the 1-dimensional case by considering changing vs → v ∈ Sd−1 as s → 0,
because the claim in the lemma is equivalent to the claim that

1 2 1 ⊤
lim sup 2 dhel (Pθ0 +v , Pθ0 ) − 8 v J(θ0 )v = 0.
v→0 ∥v∥

f˙s (x)
For s ∈ R define fs (x) := pθ+svs (x), so that f˙s (x) = ⟨ṗθ+svs (x), v⟩ and the score ℓ̇s (x) := fs (x) =
⟨vs , ṗθ+svs (x)⟩/pθ+svs (x). Now for t > 0,
t
f˙ (x)
Z
ps
p p
ft (x) = f0 (x) + ds,
0 2 fs (x)

so we have !2
t
f˙s (x)
Z Z
1 2 1 1
d (Pθ+vt , Pθ ) = ds dµ(x).
t2 hel
p
8 t 0 fs (x)
By Jensen’s inequality, we have the domination condition
!2
t
f˙s (x) 1 t f˙s (x)2 1 t 2
Z Z Z
1
ds ≤ ds = ℓ̇ (x)fs (x)ds,
t 0 s
p
t 0 fs (x) t 0 fs (x)

and by Fubini’s theorem this quantity is integrable:

!2
t
f˙s (x)
Z Z t
1 t ⊤
Z Z Z
1 1 2
ds dµ(x) ≤ ℓ̇s (x)fs (x)dsdµ(x) = v J(θ0 + svs )vs ds < ∞
t 0 s
p
t 0 fs (x) t 0

for small enough t, as J is assumed continuous at θ0 . The assumption that for µ-almost all x there
√
is a neighborhood of θ0 on which pθ is continuously differentiable then allows us to apply a variant
of dominated convergence (see Proposition A.3.3 in Appendix A.3), because
!2
t
⟨ṗθ0 +svs (x), vs ⟩ ⟨v, ṗθ0 (x)⟩2
Z
1
p ds →
t 0 pθ0 +svs (x) pθ0 (x)

for µ-almost all x. Then t12 d2hel (Pθ0 +tvt , Pθ0 ) → 18 ⟨ℓ̇θ0 (x), v⟩2 pθ0 (x)dµ(x) = 18 v ⊤ J(θ0 )v. Because
R

we allow vs → v to vary, this implies the desired uniformity.

13.5.2 Proof of Lemma 13.1.10

We prove the expansions related to the χ2 -divergence first, showing that
Z 2
pθ+v (x)
− 1 − ℓ̇θ (x)⊤ v pθ (x)dµ(x) = o(∥v∥2 ). (13.5.1)
pθ (x)

366
Lexture Notes on Statistics and Information Theory John Duchi

Assuming the bound (13.5.1), standard Lp -convergence give the expansion Dχ2 (Pθ+v ||Pθ ) = v ⊤ J(θ)v+
o(∥v∥2 ), as Eθ [ℓ̇θ ℓ̇⊤
θ ] = J(θ). As in the proof of Lemma 13.1.6, we essentially reduce to the 1-
dimensional case. Let vt → v ∈ Sd−1 as t → 0. Then
Z 2 Z Z
1 dPθ+tvt ⊤ 2 2
− 1 − tℓ̇ v
θ t dP θ ≤ h (x) ∥v t ∥ dP θ (x) ≲ h2 (x)dPθ (x) < ∞,
t2 dPθ
and using that for µ-almost all x

1 pθ+tvt (x)
− 1 − ℓ̇θ (x)⊤ vt →0
t pθ (x)
as t → 0, we have by dominated convergence that
Z 2
dPθ+tvt ⊤
− 1 − tℓ̇θ vt dPθ = o(t2 ),
dPθ
giving the claim (13.5.1).
Now we demonstrate the expansions of the f -divergences. Without loss of generality, we assume
that f ′ (1) = 0, because we may always replace f with fe(t) := f (t) − tf ′ (1) + f ′ (1), which satisfies
fe(1) = fe′ (1) = 0 and gives the same divergence. Once again let vt → v ∈ Sd−1 as t → 0, and assume
w.l.o.g. that ∥vt ∥2 ≤ 2 for all t. Define the densities qt (x) = pθ+tvt (x), so that by assumption
qt (x)
− 1 − tℓ̇θ (x)⊤ vt ≤ 2th(x).
q0 (x)
Define the gap
2
f ′′ (1)

1 qt (x) qt (x)
gt (x) := f − −1 .
t2 q0 (x) 2 q0 (x)
Then
2 2
C qt (x) 2C qt (x) 2C 2
gt (x) ≤ 2 −1 ≤ 2 − 1 − tℓ̇θ (x) vt + 2 t2 ℓ̇θ (x) ∥vt ∥22
⊤
t q0 (x) t q0 (x) t 2

2 2
2 2 2
≤ 2C h(x) + ℓ̇θ (x) ∥vt ∥2 ≤ 4C h(x) + ℓ̇θ (x) ,
2 2

f ′′ (1) 2
which is integrable with respect to Pθ . Because f (1 + s) = f (1) + f ′ (1)s + 2 s + o(s2 ) and
f (1) = f ′ (1) = 0 by our w.l.o.g. assumption,
2
f ′′ (1) qt (x)

1 qt (x)
gt (x) = 2 f 1+ −1 − −1
t q0 (x) 2 q0 (x)
2 ! 2
1 qt (x) −2

⊤
=o 2 −1 =o t tℓ̇θ (x) vt + o(t) →0
t q0 (x)
R
for all x. Dominated convergence then implies that limt→0 gt (x)q0 (x)dµ(x) = 0, and so
f ′′ (1)

1
Df (Qt ||Q0 ) − Dχ2 (Qt ||Q0 ) → 0
t2 2
as t → 0. Because we took qt (x) = pθ+tvt along a path, this gives the result.

367
Lexture Notes on Statistics and Information Theory John Duchi

13.6 Bibliography
JCD Comment: We stole the mixture idea from David Pollard I believe. Might want
to cite that old Tsybakov paper for constrained risk inequalities.
Cite Donoho and Liu, Geometrizing Rates II, for the attainability and add some com-
mentary on it. Mention that there are extensions to privacy as well.
Cite Cai and Low for the motivation of super-efficiency converses, but also recognize that
they’ve been around for a long time and cite Hodges counterexample, and cite van der
Vaart’s lecture on super-efficiency.
Cite Birgé and Massart [33] for smooth functionals.
Outline

I. Motivation: function values, testing certain quantities (e.g. is ∥P − Q∥TV ≥ ϵ or not), entropy
and other quantities, and allows superefficiency guarantees in an elegant way

II. Le Cam’s methods

1. The general form with mixtures

2. The χ2 -type bounds, with mixtures to a point mass
3. Geometrizing rates of convergence
4. Examples: Fisher information in classical problems (especially for a one-dimensional quan-
tity)
5. Example: testing distance to uniformity (failure from standard two-point bound)
6. More sophisticated examples:
R
a. Smooth functionals (as in Birgé and Massart [33]), like differential entropy h(x) log h(x)dx
b. Higher-dimensional problems, which are hard

III. “Best possible” lower bounds, super-efficiency and constrained risk inequalities

1. Basic (two-point) constrained risk inequality (cf. [71])

2. Constrained risk inequality when P1 is actually a mixture (easiest with a functional): means
that any minimax bound around P0 is quite strong
3. Potentially (?): Cai and Low [44] paper on minimax estimation for n1 ∥θ∥1 when y = θ +ε in
a Gaussian sequence model as an example and application of a constrained risk inequality.
This is probably too challenging, though—can we find a case where polynomials actually
allow us to do stuff?
a. Hard because of all the polynomial approximation stuff... but maybe there is a simpler
version that simply shows how approximation via polynomials allows lower bounds.
Approach works for Gaussian stuff, as in Cai and Low [44] or the earlier paper “Effect
of mean on variance function estimation in nonparametric regression” by Wang, Brown,
Cai, Levine.
b. Similar idea gives variation distance bounds for Poisson
P priors on parameters when seek-
ing lower bounds on estimating entropy H(X) = − x px log px of discrete distributions
with (unknown) support; see [191].

368
Lexture Notes on Statistics and Information Theory John Duchi

13.7 A useful divergence calculation

JCD Comment: Put this as exercises, related to Examples in Section 13.2.3.

Now, let us suppose that we define the collection {Pv } by tiltings of an underlying base distri-
bution P0 , where each tilting is indexed by a function gv : X → [−1, ∞), and where

dPv (x) = (1 + gv (x))dP0 (x),

gv dP0 = 0, so that each Pv is a valid distribution. Let Pvn be the distribution of n observa-
R
while
iid 1 P
tions Xi ∼ Pv , and let P n = |V| v∈V Pv .

iid
Lemma 13.7.1. Define the inner product ⟨f, g⟩P = f (x)g(x)dP (x) and let V, V ′ ∼ Uniform(V).
R

Then
Dχ2 P n ||P0 + 1 ≤ E[exp(n⟨gV , gV ′ ⟩P0 )].

Proof The simple technical lemma 13.2.3 essentially gives us the result. We observe that
n
1 X dPvn dPvn′
Z Z
n n
1 X
Dχ2 P ||P0 + 1 = = (1 + gv (x))(1 + gv′ (x))dP0 (x)
|V|2 ′ dP0n |V|2 ′
v,v v,v

because Pvn (x1 , . . . , xn ) = ni=1 (1 + gv (xi ))dP0 (xi ), so that the integral decomposes into a product
Q
of integrals. Then expanding (1 + gv )(1 + gv′ ) and noting that each has zero mean under P0 gives
1 X
Dχ2 P n ||P0n + 1 = (1 + E0 [gv (X)gv′ (X)])n .

|V|2 ′
v,v

Lastly, we note that (1 + t) ≤ et for all t, and so

1 X 1 X
2
(1 + E0 [gv (X)gv′ (X)])n ≤ exp (nE0 [gv (X)gv′ (X)]) ,
|V| ′
|V|2 ′
v,v v,v

which is of course equivalent to the result we desired.

A specialization of Lemma 13.7.1 follows when we choose our functions g to correspond to a

partition of X -space. Here, we define the following.

Definition 13.2. Let k ∈ N and the functions ϕj : X → [−b, b]. Then the functions ϕj are an
admissible partition with variances σj2 of X with respect to a probability distribution P0 if

(i) The supports Ej = supp ϕj of each of the functions are disjoint.

(ii) Each function has P0 mean 0, i.e., EP0 [ϕj (X)] = 0 for each j.

(iii) Function j has variance σj2 = EP0 [ϕ2j (X)] = ϕ2j (x)dP0 (x).
R

Pk
With such a partition, we can define the functions gv (x) = t⟨v, ϕ(x)⟩ = t j=1 vj ϕj (x) for
|t| ≤ 1/b, and if we take V = {−1, 1}k , we obtain the following.

369
Lexture Notes on Statistics and Information Theory John Duchi

Lemma 13.7.2. Let the functions {ϕj }kj=1 be an admissible partition of X with variances σj2 . Fix
|t| ≤ 1b , and let dPtv = (1 + t⟨v, ϕ(x)⟩)dP0 (x) and Ptn = |V|
1 P n
v∈V Pv . Then

k
n2 t 4 X 4

Ptn ||P0

D χ2 ≤ exp σj − 1,
2
j=1

and if |t| ≤ √1 Pk 1 , then

n ( j=1 σj4 )1/4

k
X
2 4
Ptn ||P0 σj4 .

D χ2 ≤n t
j=1

Proof First, if ϕ(x) = [ϕj (x)]kj=1 , then E0 [ϕ(X)ϕ(X)T ] = diag(σj2 ), that is, the diagonal matrix
with σj2 on its diagonal. By Lemma 13.7.1, we therefore have
     
k 2 4 k
X n t X
Dχ2 Ptn ||P0 + 1 ≤ E exp nt2 σj2 Vj Vj′  ≤ E exp  σj4 Vj2 

2
j=1 j=1

iid
by Hoeffding’s Lemma (see Example 4.1.6), as Vj ∼ Uniform({±1}) Noting that Vj2 = 1 gives the
first part of the lemma. The final statement is immediate once we observe that ex ≤ 1 + (e − 1)x ≤
1 + 2x for 0 ≤ x ≤ 1.

13.8 Exercises
Exercise 13.1: R Recall
p thepHellinger distance between distributions P and Q with densities p, q
is dhel (P, Q)2 = ( p(x) − q(x))2 dx. Let P be N(µ0 , Σ) and Q be N(µ1 , Σ). Show that

1 2 1 ⊤ −1
dhel (P, Q) = 1 − exp − (µ0 − µ1 ) Σ (µ0 − µ1 ) .
2 8

Exercise 13.2: Demonstrate the claims in Example 13.1.9.

Exercise 13.3 (The Hodges super-efficient estimator): Consider a normal mean estimation
iid
problem, Xi ∼ N(θ, 1) where θ ∈ R is unknown, and define the Hodges estimator
(
0 if |X n | ≤ n−1/4
θbH (X1n ) :=
X n otherwise.

(a) Show that the limiting distribution of θbH satisfies

(
√ d 10 if θ = 0
n(θbH (X1n ) − θ) ⇝
N(0, 1) otherwise.

That is, in a pointwise sense, the asyptotic distribution of the Hodges estimator is the same of
the sample mean, except at 0 where it is simply the point mass at 0.

370
Lexture Notes on Statistics and Information Theory John Duchi

(b) Show that for any c < 1,

√ 2

n c 1
lim inf Pθ |θH (X1 ) − θ| ≥ 1/ n | √ ≤ |θ| ≤ √ 1/4 = 1.
b
n→∞ θ n 2n

What does this say about the performance of θbH ?

Exercise 13.4: Suppose that the test Ψ has test risk for testing between P0 and P1 satisfying
Rn (Ψ | P0 , P1 ) ≤ 31 Let k ∈ N. Show how, given a sample of size kn, we can develop a test Ψ⋆ with

Rkn (Ψ⋆ | P0 , P1 ) ≤ 2 exp (−ck) ,

where c > 0 is a numerical constant. Hint. Split the sample into k samples of size n, and then
apply Ψ to each.
−β 1
Exercise 13.5: Take the√ sampling model (13.3.15), where ϵ = ϵn = n for some β ∈ ( 2 , 1), and
P1 = N(µn , 1) for µn = 2r log n.
(a) Show that
2
1 + Dχ2 ((1 − ϵ)P0 + ϵP1 ||P0 ) = 1 − ϵ2 + ϵ2 eµ .

(b) Show that if r < β − 12 , then

d2hel (P0 , (1 − ϵ)P0 + ϵP1 ) ≪ 1/n.

Hint. Use Proposition 2.2.9.

√
(c) Conclude that if µn ≤ 2r log n for r < β − 21 , it is asymptotically impossible to test between
H0 : P0n and H1 : ((1 − ϵ)P0 + ϵP1 )n .

Exercise 13.6 (Higher criticism and sparse detection [68]): One version of Tukey’s Higher
Criticism statistic is to consider the normalized process
√ Pn (U ≤ t) − t
Zn (t) := n p and TnHC := sup{Zn (t) | 1/n ≤ t ≤ 1 − 1/n}.
t(1 − t) t

√ p iid
This higher criticism statistic converges in probability: TnHC / 2 log log n → 1 when Ui ∼ Uniform([0, 1])
(see, e.g., [169, Theorem 16.1.2]). Consider the null and alternative setting of Example 13.3.14.
(a) Show that for any α ∈ (0, 1), there is a sequence an → 1 such that under the null H0 ,
p
lim sup PH0 TnHC ≥ an 2 log log n ≤ α.
n

iid
(b) Show that under the alternative H1 that Ui ∼ (1 − ϵ)Uniform[0, 1] + ϵUniform[0, τ ], where
ϵ = n−β and τ = n−r for some β ∈ ( 21 , 1) and r > 2β − 1, we have
p
PH1 TnHC ≥ an 2 log log n → 1.

Thus, higher criticism has optimal power to detect anywhere on the interior of the region
r > 2β − 1.

371
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 13.7 (Non-sparse detection problems are easy): Let P0 and P1 be distributions on a
random variable X and consider observations Xi drawn i.i.d. from
iid iid
H0 : Xi ∼ P0 or H1 : Xi ∼ (1 − ϵ)P0 + ϵP1 .

We investigate the rate at which we may allow ϵ = ϵn → 0 while still testing accurately, taking
ϵn = n−β for some β ∈ [0, 1].

that β < 12 . Let A be such that P0 (A) + P1 (Ac ) ≤ 1 − c.

(a) Let ∥P0 − P1 ∥TV ≥ c > 0, and assumeP
Construct a test Ψn based on Sn := ni=1 1 {Xi ∈ A} that achieves Rn (Ψn | H0 , H1 ) → 0 as
n → ∞.

(b) Assume that as n → ∞, we allow P1 to vary so that ∥P0 − P1 ∥TV → 1, and β ≤ 21 . Show that
it is possible to construct a test Ψn such that Rn (Ψn | H0 , H1 ) → 0 as n → ∞.
√
(c) Let P0 = N(0, 1) and P1 = N(µ, 1) for µ = 2r log n, where r > 0. Show that ∥P0 − P1 ∥TV ≥
1 − n−r . What does this say about Example 13.3.10?

(d) Let P0 = Uniform([0, 1]) and P1 = Uniform([0, τ ]) for any τ < 1. Give ∥P0 − P1 ∥TV . What does
this say about Example 13.3.13?

Exercise 13.8 (A less brutal multiple testing scenario): Instead of the instantiation of Exam-
ple 13.3.13 in Example 13.3.14, consider nulls and alternatives

P0 = Uniform([0, 1]) and P1 = (1 − τ )Uniform([0, τ ]) + τ Uniform([τ, 1]),

where again ϵ = n−β while ϵ ≪ τ ≪ 1. (So U ∼ P1 has density 1−τ

τ 1 {0
τ
≤ u ≤ τ }+ 1−τ 1 {τ ≤ u ≤ 1},
making P1 and P0 absolutely continuous.)

(a) Show that for ϵ ≪ τ ,

1 + o(1) ϵ2
d2hel (P0 , (1 − ϵ)P0 + ϵP1 ) = .
8 τ
(b) Do the asymptotics in the conclusion of Example 13.3.14 still hold in this case? Why or why
not?

Exercise 13.9 (Multiple testing and the Benjamini-Hochberg procedure): Consider the Benjamini-
Hochberg step-up procedure (9.8.3) for rejecting hypotheses in Exercise 9.5. Here you provide a
few results that suggest a failure mode of this procedure. (Though of course it exhibits many
complementary optimality properties.)
iid
(a) Let Ui ∼ Uniform[0, 1]. For c ∈ [0, 1/n], define the region

Ac := {x ∈ Rn | 0 ≤ x1 ≤ · · · ≤ xn ≤ 1, xi ≥ ci for each i} .

Letting U(1) ≤ U(2) ≤ · · · ≤ U(n) be the order statistics of U ∈ [0, 1]n , show that

P((U(1) , . . . , U(n) ) ∈ Ac ) = P(U(1) ≥ c, U(2) ≥ 2c, . . . , U(n) ≥ nc) = 1 − nc.

Hint. The density of the collection of order statistics is uniform on A0 and satisfies p(un1 ) = 1/n!.

372
Lexture Notes on Statistics and Information Theory John Duchi

iid
k be the number of hypothesis the BH procedure (9.8.3) rejects on data Ui ∼ Uniform[0, 1].
(b) Let b
Show that P(bk > 0) = α.
iid
(c) Suppose that U1 , . . . , Uk ∼ [0, τ ], where k = n1−β and τ = n−r , mimicking the setting of
Example 13.3.14, where 2β − 1 < r < β. Show that if we only compute order statistics for this
set of variables,
αi
P any U(i) ≥ → 0.
n

Exercise 13.10 (A phase diagram): Consider the multiple hypothesis testing (sparse p-values)
scenario of Example 13.3.14 or its less brutal version in Exercise 13.8. Use the results of the
example and Exercise 9.5 to justify the phase diagram in Figure 13.1, which graphically depicts
the following:

i. If r > β, then it is possible to asymptotically perfectly recover all null and non-null signals; if
r < β, it is impossible.

ii. If r > 2β − 1, then it is possible to detect that there are non-null signals, but not to identify
them; if r < 2β − 1, then even detection is impossible.

1.0
Asymptotically perfect recovery

0.8

0.6
r Detection possible
0.4 No recovery

0.2 Detection impossible

0.0
0.5 0.6 0.7 0.8 0.9 1.0
β
Figure 13.1. Phase diagram for the stylized sparse hypothesis testing in Example 13.3.14. See
Exercise 13.10.

JCD Comment: Exercise ideas:

1. Work through lower bound for sparse testing. Also add a reference to the next problem
in the main text.
Exercise 13.11 (Detecting a sparse signal through the maximum):
iid √ p
(a) Show that if Zi ∼ N(0, 1), then maxi≤n Zi / 2 log n → 1. Hint. Use Mills ratio, Exercise 4.3.

373
Lexture Notes on Statistics and Information Theory John Duchi

iid
(b) Show that if Zi ∼ N(0, 1),
√ then for any α ∈ [0, 1], there exists a sequence an = an (α) → 1 for
which P(maxi≤n Zi ≥ an 2 log n) → α.
iid
Recall the testing problem of Example 13.3.10 to distinguish H0 : Yi ∼ N(0, 1) from the alternative
iid √
H1 : Yi ∼ (1 − ϵn )N(0, 1) + ϵn N(µn , 1), where ϵn = n−β and µn = 2r log n, and β ∈ ( 12 , 1).
√
(c) Show that if r > (1 − 1 − β)2 , then for an = an (0) → 1 from part (b), the test
( √
0 if maxi≤n Yi ≤ an 2 log n
Ψn := √
1 if maxi≤n Yi > an 2 log n
√
satisfies Rn (Ψn | H0 , H1 ) → 0 if r > (1 − 1 + β)2 .

Exercise 13.12 (Poissonization: lower bounds [191]): Prove the lower bound in Proposi-
tion 13.3.7, inequality (13.3.9), that is, that for numerical constants C, c,

MPoi(2n) − Cr2 exp(−cn) ≤ Mn .

Hint. Bound MPoi(2n) with a weighted sum of Mm . Use the MGF calculation that for X ∼ Poi(λ),
E[etX ] = exp(λ(et − 1)) to show that N ∼ Poi(2n) is concentrated above n.
Exercise 13.13 (Poissonization: upper bounds [191]): Assume the minimax result that

Mn = sup inf E (Tn (X1n ) − T (p))2 ,

π Tn

where the supremum is over probability distributions (priors π) on p ∈ ∆k , and the expectation
iid
is now over the random choice of p and the sample X1n ∼ p drawn conditional on p. (This is a
standard infinite-dimensional saddle point result generalizing von-Neumann’s minimax theorem;
cf. [87, 170].) You will show the upper bound in Proposition 13.3.7, Eq. (13.3.9).
Let {Tm } be an arbitrary sequence of estimators and define the sequence of averaged risks

rm := E[(Tm (X1m ) − T (p))2 ].

Define the modified risks rem = min{r1 , . . . , rm } = min{e

rm−1 , rm }, and the “corrected” estimators
(
m Tem−1 (xm−1
1 ) if rm ≥ rem−1
Tm (x1 ) :=
e
m
Tm (x1 ) if rm < rem−1 .

(a) Show that E[(Tem (X1m ) − T (p))2 ] ≤ E[(Tm (X1m ) − T (p))2 ].

(b) Show that

1
inf E[(Tn (X1n ) − T (p))2 ] ≤ E[(TN (X1N ) − T (p))2 ]
2 Tn
for N ∼ Poi(n/2) and p ∼ π, then Xi drawn i.i.d. conditionally on p.

(c) Finalize the proof of the upper bound in inequality (13.3.9).

374
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 13.14: Consider the hypothesis testing problem of testing whether a collection of
independent Bernoulli random variables X1 , . . . , Xn is is fair (H0 , so that P(Xi = 1) = 21 for each
i) or that there are unfair subcollections. That is, we wish to test
iid
H0 : Xi ∼ Bernoulli( 12 )
ind
H1 : Xi ∼ Bernoulli( 1+θ
2 ), θ ∈ C
i

for a set C ⊂ [−1, 1]n . Show that if the set C is orthosymmetric, meaning that whenever θ ∈ C
then Sθ ∈ C for any diagonal matrix S of signs, i.e. diag(S) ∈ {±1}n , then no test can reliably
distinguish H0 from H1 (in a minimax sense). Hint. Let v ∈ V := {±1}n index coordinate signs
and define θv = Dv for some diagonal D, where Dv ∈ C. Let Pv be the product distribution with
Xi ∼ Bernoulli( 1+D
2
i vi
). What is 1 P
2n v∈V Pv ?

Exercise 13.15 (Testing a trend in independent Bernoullis): Consider testing whether a

collection of Bernoulli random variables has an “upward trend” over time, by which we mean that
if Xi ∼ Bernoulli(pi ) independently, then

n n/4
1 X 1 X
pend := pi > pbeg := pi .
n/4 3n n/4
i= +1 i=1
4

Consider the following more quantitative version of this problem: we wish to test
iid
H0 : Xi ∼ Bernoulli( 12 )
ind
H1 : Xi ∼ Bernoulli(pi ), pend − pbeg ≥ δ.

(a) Use Le Cam’s two-point method to show that there exists a numerical constant c > 0 such that
for δ ≤ √cn , no test can reliably distinguish H0 from H1 .

(b) Use the statistic

n n/4
1 X 1 X
Tn := Xi − Xi
n/4 n/4
i=3n/4+1 i=1

1
to develop a test Ψ (use Proposition 13.3.2) that achieves test risk Rn (Ψ | H0 , H1 ) ≤ 4 whenever
δ ≥ √Cn , where C < ∞ is a constant.

Exercise 13.16: Prove the identity (13.3.11).

iid
Exercise 13.17 (Unbiased estimators of distance for multinomials): Let Xi ∼ p, i = 1, . . . , n,
iid
and Yi ∼ q, i = 1, . . . , m, meaning that X1n and Y1m are multinomial samples for p, q ∈ ∆d . Define
1 Pn 1 Pn
the empirical estimators pbj = n i=1 1 {Xi = j} and qbj = m i=1 1 {Y i = j}.

p∥22 ].
(a) Give E[∥b

p − qb∥22 satisfies
(b) Show that Tn := ∥b

1 1 1 1
E[Tn ] = ∥p − q∥22 + + − ∥p∥22 − ∥q∥22 .
n m n n

375
Lexture Notes on Statistics and Information Theory John Duchi

(c) Modify Tn into a new statistic Tnunb so that E[Tnunb ] = ∥p − q∥22 .

Exercise 13.18: Show that in the hypothesis testing problem (13.3.13), there is a numerical
√
constant c > 0 such that δ ≤ c/ n implies that no test can reliably distinguish H0 from H1 .
Exercise 13.19: Consider the linear regression model

Y = Xθ + ε, ε ∼ N(0, In )

where the design X ∈ Rn×d is fixed (and known), and Y ∈ Rn and θ ∈ Rd . Consider the two
hypotheses
H0 : θ = 0 versus H1 : ∥θ∥2 ≥ r.
Use the convex hull method for lower bounds in testing (Proposition 13.3.1) to show that if

d1/4
r≤√
n n−1/2 X op

then any test of H0 against H1 has minimax test risk at least 12 . If X is an orthogonal design, so
that n1 X ⊤ X = Id , is this result tight?

JCD Comment: Put that additional question in here from the Scratch directory

JCD Comment:

1. Poissonization: remark in main text.

2. Work through Liam’s ℓ1 -multinomial testing

3. Lower bound for testing whether collection of coins is fair or some number are unfair.

376
Part III

Entropy, predictions, divergences, and

information

377
Lexture Notes on Statistics and Information Theory John Duchi

JCD Comment: TODO: include an intro describing how we’re going to get into
operational interpretations of these quantities, and that is fun.

378
Chapter 14

Predictions, loss functions, and

entropies

In prediction problems broadly construed, we have a random variable X and a label, or target or
response, Y , and we wish to fit a model or predictive function that accurately predicts the value
of Y given X. There are several perspectives possible when we consider such problems, each with
attendant advantages and challenges. We can roughly divide these into three approaches, though
there is considerable overlap between the tools, techniques, and goals of the three:

(1) Point prediction, where we wish to find a prediction function f so that f (X) most accurately
predicts Y itself.

(2) Probabilistic prediction, where we output a predicted distribution P of Y , and we seek P(Y =
y | X = x) ≈ P (Y = y | X = x), where here P denotes the “true” probability and P the
predicted one. A relaxed version of this is calibration, the subject of the next chapter, where
we ask that P(Y = y | P ) ≈ P (Y = y), that is, the distribution of Y given a predicted
distribution P is accurate.

(3) Predictive inference, where for a given level α ∈ (0, 1), we seek a confidence set mapping C
such that P(Y ∈ C(X)) ≈ 1 − α.

We focus mostly on the former two, though there is overlap between the approaches.
In this first chapter of the sequence, we focus on the probabilistic prediction problem. Our main
goal will be to elucidate and identify loss functions for choosing probabilistic predictions that are
proper, meaning that the true distribution of Y minimizes the loss, and strictly proper, meaning that
the true distribution of Y uniquely minimizes the loss. As part of this, we will develop mappings
between losses and entropy-type functionals; these will repose on convex analytic techniques for their
cleanest statements, highlighting the links between convex analysis, prediction, and information.
Moreover, we highlight how any proper loss (which will be defined) is in correspondence with a
particular measure of entropy on the distribution P , and how these connect with an object known
as the Bregman divergence central to convex optimization. For the deepest understanding of this
chapter, it will therefore be useful to review the basic concepts of convexity (e.g., convex sets,
functions, and subgradients) in Appendix B, as well as the more subtle tools on optimality and
stability of solutions to convex optimization problems in Appendix C. We give an overview of the
important results in Section 14.1.1.

379
Lexture Notes on Statistics and Information Theory John Duchi

14.1 Proper losses, scoring rules, and generalized entropies

As motivation, consider a weather forecasting problem: a meteorologist wishes to prediction the
weather Yt on days t = 1, 2, . . ., where Yt = 1 indicates rain and Yt = 0 indicates no rain. At time t,
using covariates Xt (for example, the weather the previous day, long term trends, or simulations),
the forecaster predicts a probability pt ∈ [0, 1]. We would like the forecaster’s predictions to be
as accurate as possible, so that P(Yt = 1) ≈ pt . Following the standard dicta of decision theory,
we choose a loss function ℓ(p, y) that scores a prediction p for a given outcome y. Ideally, the
forecaster should have an incentive to make predictions as accurately as possible, so the distribution
minimizing the expected loss should coincide with the true distribution of Y .
This leads to proper losses. In our treatment, we will sometimes allow infinite values, so we work
with the upper and lower extended real lines, recalling that R = R ∪ {+∞} and R = R ∪ {−∞}.

Definition 14.1. Let P be a collection of distributions on Y. A loss ℓ : P × Y → R is proper if,

whenever Y ∼ P ∈ P,
E[ℓ(P, Y )] ≤ E[ℓ(Q, Y )] for all Q ∈ P.
The loss is strictly proper if the preceding inequality is strict whenever Q ̸= P .

In much of the literature on prediction, one instead considers proper scoring rules, which are simply
negative proper losses, that is, functions S : P × Y satisfying S(P, y) = −ℓ(P, y) for a (strictly)
proper loss. We focus on losses for consistency with the convex analytic tools we develop. In
addition, frequently we will work with discrete distributions, so that Y has a probability mass
function (p.m.f.), in which case we will use p ∈ ∆k := {p ∈ Rk+ | ⟨1, p⟩ = 1} to identify the
distribution and ℓ(p, y) instead of ℓ(P, y).
Perhaps the two most famous proper losses are the log loss and the squared loss (often termed
Brier scoring). For simplicity let us assume that Y ∈ {1, 2, . . . , k}, and let ∆k = {p ∈ Rk+ | 1T p = 1}
be the probability simplex; we then identify distributions P on Y with vectors p ∈ ∆k , and abuse
notation to write ℓ(p, y) accordingly and when it is unambiguous. The squared loss is then
X
ℓsq (p, y) = (py − 1)2 + p2i = ∥p − ey ∥22 ,
i̸=y

where ey is the yth standard basis vector, while the log loss (really, the negative logarithm) is

ℓlog (p, y) = − log py .

Both of these are strictly proper. To this propriety, let Y have p.m.f. p ∈ ∆k , so that P(Y = y) = py .
Then for the squared loss and any q ∈ ∆k , we have

E[ℓsq (q, Y )] − E[ℓsq (p, Y )] = E[∥q − eY ∥22 ] − E[∥p − eY ∥22 ] = ∥q∥22 − 2⟨q, p⟩ + 2⟨p, p⟩ = ∥q − p∥22 .

For the log loss, we have

k k k
X X X py
E[ℓlog (q, Y )] − E[ℓlog (p, Y )] = − py log qy + py log py = py log = Dkl (p||q) .
qy
y=1 y=1 y=1

It is immediate that q = p uniquely minimizes each loss.

That the gap between the expected losses at q and p reduced to a particular divergence-like
measure—the squared ℓ2 -distance in the case of the squared loss and the KL-divergence in the

380
Lexture Notes on Statistics and Information Theory John Duchi

case of the log loss—is no accident. In fact, for proper losses, we will show that this divergence
representation necessarily holds.
The key underlying our development is a particular construction, which we present in Sec-
tion 14.1.2, that transforms a loss into a generalized notion of entropy. Because it is so central, we
highlight it here, though before doing so, we take a brief detour through a few of the concepts in
convexity we require. Figures representing these results capture most of the mathematical content,
while Chapters B and C in the appendices contain proofs of the results we require.

14.1.1 A convexity primer

Recall that a function f : Rd → R is convex if for all x, y ∈ dom f and λ ∈ [0, 1], we have

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y),

where for x ̸∈ dom f we define f (x) = +∞. We exclusively work with proper convex functions, so
that f (x) > −∞ for each x. Typically, we work with closed convex f , meaning that the epigraph
epi f = {(x, t) ∈ Rd × R | f (x) ≤ t} ⊂ Rd+1 is a closed set; equivalently, f is lower semi-continuous,
so that lim inf y→x f (y) ≥ f (x). A concave function f is one for which −f is convex.
Three main concepts form the basis for our development. The first is the subgradient (see
Appendix B.3). For a function f : Rd → R, the subgradient set (also called the subdifferential) at
the point x is n o
∂f (x) := s ∈ Rd | f (y) ≥ f (x) + ⟨s, y − x⟩ for all y ∈ Rd . (14.1.1)

If f is a convex function, then at any point x in the relative interior of its domain, ∂f (x) is non-
empty (Theorem B.3.3). Moreover, a quick calculation shows that x minimizes f (x) if and only if
0 ∈ ∂f (x), and (a more challenging calculation) that if ∂f (x) = {s} is a singleton, then s = ∇f (x)
is the usual gradient. See the left plot of Figure 14.1. We shall in some cases allow subgradients to
k
take values in the extended reals R and Rk , which will necessitate some additional care.
The second concept is that the supremum of a collection of convex functions is always convex,
that is, if fα is convex for each index α ∈ A, then

f (x) := sup fα (x)

α∈A

is convex,
T and f is closed in fα is closed for each α. The closure of f is immediate because
epi f = epi fα , and convexity follows because

f (λx + (1 − λ)y) ≤ sup {λfα (x) + (1 − λ)fα (y)} ≤ λ sup fα (x) + (1 − λ) sup fα (y).
α∈A α∈A α∈A

Conveniently, subdifferentiability of individual fα implies the subdifferentiability of f when the

supremum is attained. Indeed, let A(x) = {α | fα (x) = f (x)}. Then

∂f (x) ⊂ Conv {sα ∈ ∂fα (x) | α ∈ A(x)} (14.1.2)

P P
because if s = α∈A(x) λα sα for some λα ≥ 0 with α λα = 1, then
X X
f (y) ≥ λα fα (y) ≥ λα [fα (x) + ⟨sα , y − x⟩] = f (x) + ⟨s, y − x⟩.
α∈A(x) α∈A(x)

See the right plot of Figure 14.1.

381
Lexture Notes on Statistics and Information Theory John Duchi

f (x)

(a)
(b)
(c)
fb(x) = f (x0 ) + f ′ (x0 )(x − x0 ) (d)

Figure 14.1. Left: The quadratic f (x) = 12 x2 and the linear approximation fb(x) = f (x0 )+s(x−x0 ),
where x0 = 21 and s = f ′ (x0 ). Right: the piecewise quadratic f (x) = max{f0 (x), f1 (x)} where
√
f0 (x) = 21 x2 and f1 (x) = 14 (x + 14 )2 + 18 , intersecting at x0 = 1−4 10 . (a) The function f (x). (b)
The linear underestimator fb(x) = f (x0 ) + s0 (x − x0 ) for s0 = f0′ (x0 ). (c) The linear underestimator
fb(x) = f (x0 )+s1 (x−x0 ) for s1 = f1′ (x0 ). (d) The linear approximation fb(x) = f (x1 )+f ′ (x1 )(x−x1 )
around the point x1 = 41 .

Lastly, we revisit a special duality relationship that all closed convex functions f enjoy (see
Appendix C.2 for a fuller treatment). The Fenchel-Legendre conjugate or convex conjugate of a
function f is
f ∗ (s) := sup {⟨s, x⟩ − f (x)} . (14.1.3)
x
The function f ∗ is always convex, as it is the supremum of linear functions of s, and for any x⋆ (s)
maximizing ⟨s, x⟩ − f (x), we have that x⋆ (s) ∈ ∂s f ∗ (s) by the relationship (14.1.2); by a bit more
work, we see that if s ∈ ∂f (x), then 0 ∈ ∂x {f (x) − ⟨s, x⟩} and so x maximizes ⟨s, x⟩ − f (x). See
Figure 14.2 for a graphical representation of this process. Flipping this argument by replacing
f with f ∗ and x with s, when s ∈ ∂f (x) and x maximizes ⟨s, x⟩ − f (x) in x, then x ∈ ∂f ∗ (s)
and so s maximizes ⟨s, x⟩ − f ∗ (s) in s. From this development comes the biconjugate, that is,
f ∗∗ (x) = sups {⟨s, x⟩ − f ∗ (s)}, or f ∗∗ = (f ∗ )∗ . The biconjugate f ∗∗ is equal to the supremum of all
linear functionals below f :
Lemma 14.1.1. Let f be a closed convex function. Then f ∗∗ (x) = f (x).
Proof We prove the equivalent statement that if G ⊂ Rd × R denotes all the pairs (s, b) so that
the affine function x 7→ ⟨s, x⟩ − b minorizes f , that is, f (x) ≥ ⟨s, x⟩ − b for all x, then
f ∗∗ (x) = sup {⟨s, x⟩ − b}.
(s,b)∈G

Theorem B.3.7 in Appendix B.3.2 guarantees that sup(s,b)∈G {⟨s, x⟩−b} = f (x). To see the displayed
equality, note that (s, b) ∈ G if and only if
f (x) ≥ ⟨s, x⟩ − b for all x iff b ≥ ⟨s, x⟩ − f (x) for all x iff b ≥ f ∗ (s).
In particular, we obtain the equalities
sup {⟨s, x⟩ − b} = sup{⟨s, x⟩ − b | s ∈ dom f ∗ , f ∗ (s) ≤ b} = sup{⟨s, x⟩ − f ∗ (s)}
(s,b)∈G s,b s

382
Lexture Notes on Statistics and Information Theory John Duchi

as desired.

We immediately have the Fenchel-Young inequality that

f ∗ (s) + f (x) ≥ ⟨s, x⟩ for all s, x,
and (see Proposition C.2.3) if f is a closed convex function, then equality holds if and only if
s ∈ ∂f (x) or x ∈ ∂f ∗ (s), (14.1.4)
which are equivalent. Thus we obtain the identities
∂f ∗ = (∂f )−1 and ∂f = (∂f ∗ )−1 ,
and we have the characterization
∂f ∗ (s) = argmin {−⟨s, x⟩ + f (x)} = argmax {⟨s, x⟩ − f (x)} .
x x

sx − f ∗ (s)
f (x) f ∗ (s)
sx

(0, −f ∗ (s))

Figure 14.2. The conjugate function. The line of long dashes is f (x) = sx, while the dotted line
is x 7→ sx − f ∗ (s). The blue line is the largest gap between sx and f (x), which equals f ∗ (s). Note
that x 7→ sx − f ∗ (s) meets the graph of f (x) at exactly the point of maximum difference sx − f (x),
where f ′ (x) = s.

14.1.2 From a proper loss to an entropy

The key construction underlying all of our proper losses is the optimal value of the expected loss.
To any loss ℓ acting on a family P of distributions, we construct the generalized entropy associated
with the loss ℓ by
Hℓ (Y ) := inf E[ℓ(Q, Y )], (14.1.5)
Q∈P

383
Lexture Notes on Statistics and Information Theory John Duchi

where we have paralleled the typical notation H(Y ) for the Shannon entropy. In many cases, it
will be more convenient to write this entropy directly as a function of the distribution P of Y , in
which case we write
Hℓ (P ) = inf EP [ℓ(Q, Y )], (14.1.6)
Q∈P

where Y follows the distribution P ; we will use whichever is more convenient. As the nota-
tion (14.1.6) makes clear, Hℓ (P ) is the infimum of a collection of linear functions of the form
P 7→ EP [ℓ(Q, Y )], one for each Q ∈ P), so that necessarily Hℓ (P ) is concave in P . The remainder
of this chapter, and several parts of the coming chapters, highlights the ways that this particular
quantity informs the properties of the loss ℓ, and more generally, how we may always view any
concave function H on a family of distributions P as a generalized entropy function.
In Section 14.2, we show how such entropy-type functionals map back to losses themselves,
so for now we content ourselves with a few examples to see why we call these entropies. Let us
temporarily assume that Y has finite support {1, . . . , k} with P = ∆k = {p ∈ Rk+ | ⟨1, p⟩ = 1} the
collection of probability mass functions on elements {1, . . . , k}.

Example 14.1.2 (Log loss): Consider the log loss ℓlog (p, y) = − log py . Then
 
k k k
 X qy X  X
Hℓlog (p) = inf Ep [− log qY ] = inf − py log − py log py = − py log py ,
q∈∆k q∈∆k  py 
y=1 y=1 y=1

the classical Shannon entropy. 3

This highlights an operational interpretation of entropy distinct from that arising in coding: the
(Shannon) entropy is the minimal expected loss of a player in a prediction game, where the player
chooses a distribution Q on Y , nature draws Y ∼ P , and upon observing Y = y, the player suffers
loss − log Q(Y = y).

Example 14.1.3 (0-1 error): If instead we take the 0-1 loss, that is, ℓ0-1 (p, y) = 1 if py ≤ pj
for some j ̸= y and ℓ0-1 (p, y) = 0 otherwise, then

Hℓ0-1 (p) = inf Ep [ℓ(q, y)] = 1 − max py .

q∈∆k y

So Hℓ0-1 (ey ) = 0 for any standard basis vector, that is, distribution with all mass on a single
point y, and Hℓ0-1 (p) > 0 otherwise. Moreover, the vector p = 1/k maximizes Hℓ0-1 (p), with
k . 3
Hℓ0-1 (1/k) = k−1

Example 14.1.4 (Brier scoring and squared error): For the squared error (Brier scoring)
loss ℓsq (p, y) = ∥p − ey ∥22 , where ey ∈ {0, 1}k is the yth standard basis vector, let Y have p.m.f.
p ∈ ∆k . Then
Hℓsq (Y ) = E[ℓsq (p, Y )] = ∥p∥22 − 2 ∥p∥22 + 1 = 1 − ∥p∥22 .
So as above, we have Hℓsq (Y ) ≥ 0, with Hℓsq (Y ) = 0 if and only if Y is a point mass on one
of {1, . . . , k}, and the uniform distribution with p.m.f. p = k1 1 maximizes the entropy, with
Hℓsq (Uniform([k])) = 1 − 1/k. 3

These examples highlight how these entropy functions are types of uncertainty measures, giving
rise to “maximally uncertain” distributions p, which are typically uniform on Y .

384
Lexture Notes on Statistics and Information Theory John Duchi

14.1.3 The information in an experiment

In classical information theory, the mutual (Shannon) information between random variables X
and Y is the gap between the entropy of Y and the remaining entropy given X, that is,

I(X; Y ) = H(Y ) − H(Y | X).

In complete analogy with our development in Chapter 2, then, we can define the information
between variables X and Y relative to a particular loss function ℓ. Thus, we define the ℓ-conditional
entropy
Hℓ (Y | X = x) := inf E [ℓ(Q, Y ) | X = x]
Q∈P

and, in analogy to the definitions in Section 2.1.1, the conditional entropy of Y given X is
Z
Hℓ (Y | X) := E inf E[ℓ(Q, Y ) | X] = Hℓ (Y | X = x)dP (x),
Q∈P X

the average minimal expected loss when one observes X.

With this definition, we then can discuss the information in an experiment. This nomenclature
follows classical statistical parlance, where by an experiment, we mean the observation of a variable
X in a Markov chain X → Y , where we think of Y as a hypothesis to be tested or a value to be
predicted, and we ask how much observing X helps to actually allow this prediction. Then we
define
Iℓ (X; Y ) := Hℓ (Y ) − Hℓ (Y | X), (14.1.7)
which is nonnegative and is the gap between the prior entropy of Y and its posterior entropy
conditional on the observation X. That is, this information measure is precisely the gap between
the best achievable loss in the prediction of a distribution P for Y a priori, when we observe
nothing, and that achievable a posteriori, when we observe X. In parallel to our alternative view of
the entropy as the (expected) minimal loss of a player in a prediction game, then, the information
between X and Y is the improvement an observation X offers a player in predicting Y when
measuring error with the loss ℓ. The information (14.1.7) is typically asymmetrical in X and Y , so
we are careful about the ordering (this lack of symmetric holds, essentially, unless ℓ is the log loss).
The next three examples show different information quantities, where in each we let Y have
finite cardinality k, and thus identify P with the probability simplex ∆k = {p ∈ Rk+ | ⟨1, p⟩ = 1}.

Example 14.1.5 (Shannon information): Taking the log loss ℓ(p, y) = − log py , we have

Iℓ (X; Y ) = Hℓ (Y ) − Hℓ (Y | X) = H(Y ) − H(Y | X) = I(X; Y ),

the classical Shannon information. 3

Example 14.1.6 (0-1 error): Consider the 0-1 error ℓ0-1 (p, y) = 1 if py ≤ maxj pj and
ℓ0-1 (p, y) = 0 if py > maxj̸=y pj . Then letting y ⋆ = argmaxy P(Y = y) and y ⋆ (x) = argmaxy P(Y =
y | X = x), we have

Iℓ0-1 (X; Y ) = P(Y = y ⋆ ) − E [P(Y = y ⋆ (X) | X)] = P(Y = y ⋆ ) − P(Y = y ⋆ (X)),

the gap between the prior probability of making a mistake when guessing Y and the posterior
probability given X. 3

385
Lexture Notes on Statistics and Information Theory John Duchi

Example 14.1.7 (Squared error): For the Brier score with squared error ℓsq (p, y) = ∥p − ey ∥22 ,
we have Hℓsq (p) = 1 − ∥p∥22 , and so

k
X k k
2
X 2
X
Iℓsq (X; Y ) = E P(Y = j | X) − P(Y = j) = Var(P(Y = j | X)),
j=1 j=1 j=1

the summed variances of the random variables P(Y = j | X). The higher the variance of these
quantities, the more information X carries about Y . 3

14.2 Characterizing proper losses and Bregman divergences

With the definition (14.1.5) of the fundamental generalized entropy, we can now proceed to a
characterization of all proper losses. We do this in three settings: in the first (Section 14.2.1),
we give a representation for proper losses when Y is finite and discrete, so we can identify it
with Y = {1, . . . , k} and distributions P on Y with probability mass functions p ∈ ∆k . We then
demonstrate a full characterization of propriety (Section 14.2.2), which requires measure-theoretic
tools and can be skipped. As the final approach to considering propriety, we modify the results for
finite Y to consider cases in which Y is vector-valued and Y ⊂ Rk is contained in a compact set.
This case transparently generalizes the finite representations of Section 14.2.1 and will form the
basis of our development going forward, as it allows us to more directly apply to tools of convexity
and analysis.

14.2.1 Characterizing proper losses for Y taking finitely many vales

Here, we present the Savage representation of proper losses, which characterizes all proper losses
using the entropies (14.1.5) or, equivalently, (14.1.6). To avoid pathological cases, we work with
regular losses, which always assign a finite value to the correct predicted distribution; we assume
regularity without further comment.

Definition 14.2. Let P be a family of distribution on Y. The loss ℓ : P × Y → R is regular for

the family P if EP [ℓ(P, Y )] is real valued for all P ∈ P.

We do allow losses to attain infinite values, for example, we can allow ℓ(Q, y) = +∞ if Q assigns
probability 0 to an event y, as in the case of the logarithmic loss. The following theorem then
provides the promised representation of proper losses, and additionally, highlights the centrality of
the generalized entropy functionals.

Theorem 14.2.1 (Proper scoring rules: the finite case). Let Y = {1, . . . , k} be finite and P ⊂ ∆k
a convex collection of distributions on Y. Then the following are true.

(i) If the loss ℓ : P × Y → R satisfies the representation

ℓ(p, y) = −Ω(p) − ⟨∇Ω(p), ey − p⟩ (14.2.1)

for a subdifferentiable closed convex function Ω : P → R, where ∇Ω(p) ∈ ∂Ω(p), then ℓ is

proper.

386
Lexture Notes on Statistics and Information Theory John Duchi

(ii) Conversely, if ℓ is proper, then choosing Ω to be the negative generalized entropy

Ωℓ (p) := −Hℓ (p) = sup {−Ep [ℓ(q, Y )] | q ∈ P}
q

satisfies equality (14.2.1) (and h is closed).

Additionally, if ℓ is real valued, then ∇Ω(p) ∈ Rk in the representation (14.2.1). If ℓ(p, y) can take
the value +∞, then we allow ∇Ω(p) ∈ Rk when p ∈ ̸ relint ∆k . The loss is strictly proper if and
only if the convex Ω is strictly convex.
Proof If ℓ has the given representation and P(Y = y) = py , then we have
E[ℓ(q, Y )] = −Ω(q) − ⟨∇Ω(q), p − q⟩ ≥ −Ω(p) = E[ℓ(p, Y )]
by the first-order convexity property of convex functions (that is, the definition (14.1.1)) of a
subdifferential).
Conversely, suppose that the loss is proper, and let Ω(p) = Ωℓ (p). Clearly Ω is convex, as it is
the supremum of linear functionals of p. Moreover, propriety of ℓ guarantees that
k
X
Ω(p) ≥ −E[ℓ(q, Y )] = Ω(q) + −ℓ(q, y)(pk − qk )
y=1

That is, for each q ∈ P the vector [−ℓ(q, y)]ky=1 ∈ ∂Ω(q), so Ω is subdifferentiable. Choosing the
vector ∇Ω(p) = [−ℓ(p, y)]ky=1 , we have
k
X
ℓ(p, y) = −Ω(p) + ℓ(p, y) + Ω(p) = −Ω(p) − pi ℓ(p, i) + ℓ(p, y) = −Ω(p) − ⟨∇Ω(p), ey − p⟩
i=1

as desired. Note that ℓ(p, y) < ∞ except when py = 0, in which case our definition ∇Ω(p) =
[−ℓ(p, y)]ky=1 remains sensible as −⟨∇Ω(p), ey − p⟩ = +∞.
As an alternative argument more directly using convexity, definition of Ω(p) = supq {−Ep [ℓ(q, Y )] |
q ∈ P} and the immediate calculation (14.1.2) of the subdifferential of the supremum shows that
n o
∂Ω(p) ⊃ [−ℓ(q, y)]ky=1 | q ∈ ∆k satisfies − Ep [ℓ(q, Y )] = Ω(p) .

But propriety guarantees that the set of such q includes p, so that ∂Ω(p) ⊃ [−ℓ(p, y)]ky=1 .
For the strict inequalities and strict propriety, trace the argument replacing inequalities with
strict inequalities for q ̸= p and use Corollary B.3.2 or C.1.9.

The negative generalized entropy Ω in Theorem 14.2.1 is essentially unique and marks an impor-
tant duality between proper losses and convex functions: to each loss, we can assign a generalized
entropy, and from this generalized entropy, we can reconstruct the loss. Exercise 14.2 explores this
connection. We can also give a few examples that show how to recover standard losses. For each,
we begin with a convex function Ω, then exhibit the associated proper or strictly proper scoring
rule. One thing to notice in this representation is that, typically, we do not expect to achieve a
loss function convex in p, which is a weakness of the representation (14.2.1). In Section 14.3 (and
Chapter 16, especially section 16.5) in more depth), however, we will show how to convert suit-
able proper losses into surrogates that are convex in their arguments and which, after a particular
transformation based on convex duality, are proper and yield the correct distributional predictions.
We defer this, however, and instead provide a few examples.

387
Lexture Notes on Statistics and Information Theory John Duchi

Pk
Example 14.2.2 (Logarithmic losses): Consider the negative entropy Ω(p) = y=1 py log py .
We have ∂p∂ y Ω(p) = 1 + log py ∈ [−∞, 1], and

k
X k
X
ℓlog (p, y) = − pj log pj + py (1 + log pj ) − (1 + log py ) = − log py ,
j=1 j=1

yielding the log loss. Note that for this case, we do require that the gradients ∇Ω(p) take
values in the (downward) extended reals Rk . 3

Example 14.2.3 (Brier scores and squared error): When we have the squared error ℓsq (p, y) =
∥p − ey ∥22 , we can directly check that Ω(p) = ∥p∥22 gives the loss. Indeed,

− ∥p∥22 − 2⟨p, ey − p⟩ = ∥p∥22 − 2⟨p, ey ⟩ + 1 − 1 = ∥p − ey ∥22 − 1.

So aside from an additive constant, we have the desired result. 3

More esoteric examples exist in the literature, such as the spherical score arising from Ω(p) =
∥p∥2 (note the lack of a square).

Example 14.2.4 (Spherical scores): Let Ω(p) = ∥p∥2 , which is strictly convex on ∆k . Then

∇Ω(p) = p/ ∥p∥2
1
and ℓ(p, y) = − ∥p∥2 − ∥p∥ ⟨p, ey − p⟩ = −py / ∥p∥2 , which is strictly proper but does not retain
2
convexity. 3

Bregman divergences
A key aspect of the Savage representation (14.2.1) is that associated to any proper loss is a Bregman
divergence, which measures the difference between a convex function and its first-order approxima-
tion. Recall from Chapter 3 that for a function function Ω : Rk → R, we define

DΩ (u, v) := Ω(u) − Ω(v) − ⟨∇Ω(v), u − v⟩. (14.2.2)

In typical definitions of the divergence, one requires that Ω be differentiable; here, we allow non-
differentiable Ω so long as the choice ∇Ω(v) ∈ ∂Ω(v) is given. In particular, we see that

DΩ (u, v) ≥ 0

for all u and v, and moreover, if Ω is strictly convex

DΩ (u, v) > 0 whenever u ̸= v.

(See, e.g., Corollaries B.3.2 and C.1.9 in the appendices.)

Familiar examples include the squared Euclidean norm Ω(u) = 1
2 ∥u∥22 , which by inspection
gives
1
DΩ (u, v) = ∥u − v∥22 ,
2

388
Lexture Notes on Statistics and Information Theory John Duchi

Pk
and the negative entropies Ω(u) = j=1 uj log uj , which implicitly encodes the constraint that
u ≻ 0. This gives
k k k k
X X X X uj
DΩ (u, v) = uj log uj − vj log vj − (1 + log vj )(uj − vj ) = uj log + 1T (u − v).
vj
j=1 j=1 j=1 j=1

If u, v ∈ ∆k , then evidently DΩ (u, v) = Dkl (u||v) because 1T u = 1T v = 1, where we identify u and

v with probability mass functions.
Continuing this identification of distributions on Y with elements p ∈ ∆k in the probability
simplex, we can reconsider the gaps between a loss evaluated at a true distribution p and an
alternative q. In this case, the representation Theorem 14.2.1 provides allows us to connect proper
losses with first-order divergences immediately. Indeed, let Ω : ∆k → R be a convex function and
loss ℓ be the associated proper loss, with ℓ(p, y) = −Ω(p) − ⟨∇Ω(p), ey − p⟩. Now, suppose that Y
has p.m.f. p; then for any q ∈ ∆k , the gap
k
X
Ep [ℓ(q, Y )] − Ep [ℓ(p, Y )] = Ω(p) − Ω(q) − py ⟨∇Ω(q), ey − q⟩
y=1

= Ω(p) − Ω(q) − ⟨∇Ω(q), p − q⟩ = DΩ (p, q).

We record this as a corollary to Theorem 14.2.1, highlighting the links between propriety, first-order
divergences, and proper loss functions.
Corollary 14.2.5. Let the conditions of Theorem 14.2.1 hold. Then ℓ is (strictly) proper if and
only if there exists a (strictly) convex Ω : ∆k → R for which
Ep [ℓ(q, Y )] − Ep [ℓ(p, Y )] = DΩ (p, q)
for all p, q ∈ ∆k .

14.2.2 General proper losses

More generally, we can consider predicting distributions P on general sets Y. For example, recalling
the meteorological motivation of predicting the weather, suppose we wish to predict a distribution
of the (real-valued) amount Y of rainfall on a given day. Many predictions place a point mass
at Y = 0, with a decaying tail for higher amounts of rainfall. Then it is natural to predict a
cumulative distribution function F : R → [0, 1], measuring error relative to the actual amount of
rain that falls. Several losses are common in the literature; one common example is the continuous
ranked probability score.
Example 14.2.6 (Continuous ranked probability score (CRPS)): The CRPS loss for a CDF
F at y is Z
ℓcrps (F, y) = (F (t) − 1 {y ≤ t})2 dt. (14.2.3)

This is a strictly proper scoring rule: let G be any cumulative distribution function, meaning
that limt→−∞ G(t) = 0 and limt→∞ G(t) = 1, and let Y have CDF F . Then
Z
G(t)2 − F (t)2 − 2(G(t) − F (t))E[1 {Y ≤ t}] dt

E[ℓcrps (G, Y )] − E[ℓcrps (F, Y )] =
Z
= (G(t) − F (t))2 dt

389
Lexture Notes on Statistics and Information Theory John Duchi

because E[1 {Y ≤ t}] = F (t). This is a variant of the (squared) Cramér-von-Mises distance
between F and G, and which is positive unless F = G. Unfortunately, computing the CRPS
loss (14.2.3) is often challenging except for specially structured F . 3
Because the computation of the continuous ranked probability score is challenging, it can be
advantageous to consider other losses on probability distributions, which can allow more flexibility
in modeling. To that end, we define the quantile loss: for a probability distribution P on Y , let
Quantα (Y ) = Quantα (P ) := inf {t | P (Y ≤ t) ≥ α}
to be the α-quantile of the distribution P . (When Y has cumulative distribution F , this is the
inverse CDF mapping F −1 (α) = inf{t | F (t) ≥ α}.) Defining the quantile penalty
ρα (t) = α [t]+ + (1 − α) [−t]+ ,
for a collection A of values in [0, 1], the quantile loss is
X
ℓquant,A (P, y) := ρα (y − Quantα (P )). (14.2.4)
α∈A

The propriety of the quantile loss is relatively straightforward; it is, however, not strictly proper.
Example 14.2.7 (Quantile loss): To see that the quantile loss (14.2.4) is proper, consider the
single quantile penalty ρα : let g(t) = E[ρα (Y − t)] = αE[[Y − t]+ ] + (1 − α)E[[t − Y ]+ ], which
we claim is minimized by Quantα (Y ). Indeed, g is convex, and it has left and right derivatives
g(s) − g(t)
∂− g(t) := lim = −αP(Y ≥ t) + (1 − α)P(Y < t) = P(Y < t) − α and
s↑t s−t
g(s) − g(t)
∂+ g(t) := lim = −αP(Y > t) + (1 − α)P(Y ≤ t) = P(Y ≤ t) − α.
s↓t s−t
Indeed, for t = Quantα (Y ), we have ∂− g(t) = P(Y < t)−α ≤ 0 and ∂+ g(t) = P(Y ≤ t)−α ≥ 0,
because t 7→ P(Y ≤ t) is right continuous. So convexity yields
E[ρα (Y − Quantα (Y ))] ≤ E[ρα (Y − t)]
for all t. Applying this argument for each α ∈ A, we thus have
E[ℓquant,A (Q, Y )] ≥ E[ℓquant,A (P, Y )]
for any Q whenever Y ∼ P , and equality holds whenever Q and P have identical α quantile
for each α ∈ A. 3
The general case of Theorem 14.2.1 allows us to address such scenarios, though it does require
measure theory to properly define. Happily, the generality does not require a particularly more
sophisticated proof. For a (convex) function Ω : P → R on a family of distributions P on a set Y,
we say Ω′ (P ; ·) : Y → R is a subderivative of Ω at P ∈ P whenever
Z
Ω(Q) ≥ Ω(P ) + Ω′ (P, y)(dQ(y) − dP (y))
Y for all Q ∈ P. (14.2.5)
′ ′
= Ω(P ) + EQ [Ω (P, Y )] − EP [Ω (P, Y )]
When Y is discrete and we can identify P with the simplex ∆k , the inequality (14.2.5) is simply
the typical subgradient inequality (14.1.1) that Ω(q) ≥ Ω(p) + ⟨∇Ω(p), q − p⟩ for p, q ∈ ∆k , where
∇Ω(p) ∈ ∂Ω(p). We then have the following generalization of Theorem 14.2.1.

390
Lexture Notes on Statistics and Information Theory John Duchi

Theorem 14.2.8. Let P be a convex collection of distributions on Y. Then the following are true.

(i) If the loss ℓ : P × Y → R satisfies the representation

Z
ℓ(P, y0 ) = −Ω(P ) − Ω (P, y0 ) + Ω′ (P, y)dP (y), for all y0 ∈ Y,
′
(14.2.6)

where Ω′ (P, ·) : Y → R is a subderivative of Ω at P ∈ P, then it is proper.

(ii) Conversely, if ℓ is proper, then choosing Ω to be the negative generalized entropy Ωℓ (P ) =

−Hℓ (P ) = sup{−EP [ℓ(Q, Y )] | Q ∈ P} satisfies equality (14.2.6).

The loss is strictly proper if and only if the convex Ω is strictly convex.

Proof If ℓ has the representation (14.2.6), then we have

Z
−EP [ℓ(P, Y )] = Ω(P ) ≥ Ω(Q) + Ω′ (Q, y)(dP (y) − dQ(y)) = −EP [ℓ(Q, Y )]

for any Q ∈ P by the definition (14.2.5) of a subderivative. Rewriting, we have EP [ℓ(P, Y )] ≤

EP [ℓ(Q, Y )] and ℓ is proper.
Conversely, if ℓ is proper and regular, then as in the proof of Theorem 14.2.1 we define

Ω(P ) := sup −EP [ℓ(Q, Y )] = −EP [ℓ(P, Y )],

Q∈P

which is the supremum of linear functionals of P and hence convex. If we let Ω′ (P, y) = −ℓ(P, y) ∈ R
for P ∈ P, then
Z
Ω(P ) ≥ −EP [ℓ(Q, Y )] = Ω(Q) + EQ [ℓ(Q, Y )] − EP [ℓ(Q, Y )] = Ω(Q) + Ω′ (P, y)(dP (y) − dQ(y))

by propriety, so that evidently Ω′ (P, y) is a subderivative of Ω at P ∈ P. That L(P, y0 ) = −Ω(P ) −

Ω′ (P, y0 ) + Ω′ (P, y)dP (y) is then immediate.
R

The arguments for strict propriety/convexity are similar.

The obvious corollary to Theorem 14.2.8, paralleling Corollary 14.2.5, follows.

Corollary 14.2.9. Let P be a convex collection of probability distributions on Y. Then the loss
ℓ : P × Y → R is proper if and only if there exists a convex function Ω : P → R with subderivatives
Ω′ (P, ·) : Y → R such that

ℓ(P, y0 ) = −Ω(P ) − Ω′ (P, y0 ) + EP [Ω′ (P, Y )] for all y0 ∈ Y.

The loss ℓ is strictly proper if and only if Ω is strictly concave. Similarly, ℓ is (strictly) proper if
and only if there exists a (strictly) convex and subdifferentiable Ω : P → R for which

EP [ℓ(Q, Y )] − EP [ℓ(P, Y )] = DΩ (P, Q).

391
Lexture Notes on Statistics and Information Theory John Duchi

The subdifferentials and differentiability in this potentially infinite dimensional case can make
writing the particular representation (14.2.6) challenging; for example, the representation of the
quantile loss in Example 14.2.7 is quite complex. In the case of predictions involving the cumulative
distribution function F , however, one can obtain the subderivative by taking directional (Gateaux)
derivatives in directions G−F for cumulative distributions G. In this case, for the point cumulative
distribution Gy with Gy (t) = 1 {y ≤ t}, we define

Ω(F + ϵ(Gy − F )) − Ω(F )

Ω′ (F, y) = lim .
ϵ↓0 ϵ

The continuous ranked probability score (Example 14.2.6) admits this expansion.

Example 14.2.10 (CRPS (Example 14.2.6) continued): The strict propriety of the CRPS
loss (14.2.3) means that the generalized negative entropy
Z
Ω(F ) = sup −E[ℓ(G, Y )] = −E[ℓcrps (F, Y )] = (F (t) − 1)F (t)dt
G

by definition. Expanding Ω(F + ϵ(G − F )) for small ϵ as in the recipe above, we have
Z Z
Ω(F + ϵ(G − F )) = Ω(F ) − ϵ (G(t) − F (t))dt + 2ϵ F (t)(G(t) − F (t))dt + O(ϵ2 ).

to obtain the y-based derivative Ω′ (F, y), we choose Gy (t) = 1 {y ≤ t} to obtain directional
derivative
Ω(F + ϵ(Gy − F )) − Ω(F )
Z Z
′
Ω (F, y) = lim = (1 {y ≤ t}−F (t))dt−2 (F (t)(1 {y ≤ t}−F (t))dt.
ϵ↓0 ϵ

By inspection, when Y has cumulative distribution function F , E[Ω′ (F, Y )] = 0 and so

−Ω(F ) − Ω′ (F, y) + E[Ω′ (F, Y )]

Z
−F (t)2 + F (t) − F (t) + 1 {y ≤ t} + 2F (t)2 − 2F (t)1 {y ≤ t} dt

=
Z
= − (F (t) − 1 {y ≤ t})2 dt = ℓcrps (F, y),

as desired. 3

We can also consider the log loss for general distributions, which is a bit subtle because of
the necessity of defining a base measure. Note that for discrete cases—when Y ∈ {1, . . . , k} or Y
countable—we could always use the probability mass function without loss of generality.

Example 14.2.11 (Log loss for general distributions): For general distributions P , the
logarithmic loss requires additional work to be sensible, because − log p(y) is not well-defined
if X is not discrete. Thus, we fix a base measure ν on Y, and let P be the collection of
distributions that are absolutely continuous with respect to ν. Then for P ∈ P, we let ℓ(P, y) =
− log p(y), where p is the density of P with respect to ν, while ℓ(P, y) = +∞ for P ̸≪ ν. Noting
that for P ∈ P,

p(Y )
EP [ℓ(Q, Y )] = EP [− log q(Y )] = EP log − EP [log p(Y )] = Dkl (P ||Q) + Hν (P ),
q(Y )

392
Lexture Notes on Statistics and Information Theory John Duchi

R
where Hν (P ) := − p(y) log p(y)dν(y) denotes the (Shannon) entropy for the base measure ν,
we see that ℓ is indeed strictly proper.
The negative entropy is thus Ω(P ) = −Hν (P ) when P ≪ ν and +∞ otherwise. To obtain
the directional derivatives as in the representation (14.2.6), we heuristically take the derivative
∂
∂p(y) Ω(P ) to guess that
Ω′ (P, y) = 1 + log p(y).
We can directly check that this is indeed a subgradient for Ω: we have

Ω(P ) + EQ [Ω′ (P, Y )] − EP [Ω′ (P, Y )] = EP [log p(Y )] + EQ [1 + log p(Y )] − EP [1 + log p(Y )]

p(Y )
= EQ log + EQ [log q(Y )] = Ω(Q) − Dkl (Q||P ) ,
q(Y )

assuming Q ≪ ν. (Otherwise, Dkl (Q||P ) = +∞, and EQ [log p(Y )] = −∞ regardless.) Finally,
we check the loss representation (14.2.6): so long as Q ≪ ν, we have

−Ω(Q) − Ω′ (Q, y) + EQ [Ω′ (Q, Y )] = Hν (Q) − Hν (Q) − 1 + 1 − log q(y) = − log q(y),

as desired. 3

14.2.3 Proper losses and vector-valued Y

The final variant of propriety we consider generalizes that when Y is finite and identified with
{1, . . . , k} in Section 14.2.1. Now, we assume that Y is vector-valued, with Y ⊂ Rk , and assume
the convex hull
Conv(Y) = {EP [Y ] | P is a distribution on Y}
is bounded. (Typically, it will also be compact, though this will not be central to our development,
and pathological cases, such as Y = {1/n}n∈N , exist.) An example showing how to use this
representation for multinomial Y ∈ {1, . . . , k} may be clarifying.

Example 14.2.12 (Multinomial Y as vectors): If Y is a multinomial taking values in a

discrete set of size k, we can instead identify Y with the first k standard basis vectors e1 , . . . , ek .
Then p = E[Y ] ∈ ∆k is the p.m.f. of Y , and Conv(Y) = ∆k . 3

Example 14.2.13 (Binary Y as a scalar): When Y ∈ {0, 1} is a Bernoulli random variable,

we identify Y with itself, so that p = E[Y ] = P(Y = 1) ∈ [0, 1] and Conv(Y) = [0, 1]. 3

Example 14.2.14 (Ordinal Y as a scalar): Consider a rating problem of predicting the

rating Y of a movie from 1 to 5 stars. In this case, Y takes values {1, . . . , 5} ⊂ R, but the
ordering between the elements is important; it is unnatural to treat Y as a multinomial. More
generally, Y may take values in {y1 , . . . , yk } ⊂ R, where y1 < · · · < yk . As in the binary case,
we identify Y with its scalar value, so that E[Y ] ∈ [y1 , yk ] and Conv(Y) = [y1 , yk ]. 3

In this vector-valued Y case, instead of prediction distributions P , the goal is to predict the
mean mapping
µ(P ) := EP [Y ] ∈ Conv(Y),
so that µ : P → Rk for the collection P of distributions on Y . Our goal is to reward predictions of
the correct expectation, leading to the following definition.

393
Lexture Notes on Statistics and Information Theory John Duchi

Definition 14.3. Let M = cl Conv(Y) be the (convex) collection of mean parameters for Y ∈ Y.
Then ℓ : M × Y → R is proper if

EP [ℓ(µ, Y )] ≥ EP [ℓ(EP [Y ], Y )] for all µ ∈ M,

and strictly proper if the inequality is strict whenever µ ̸= EP [Y ].

Definition 14.3 generalizes Definition 14.2 in the multinomial case, where Y is a discrete set that
we may identify with the basis vectors {e1 , . . . , ek }, as Example 14.2.12 makes clear.
With this definition, we can extend Theorem 14.2.1 to a more general case, where as usual we
say that ℓ is regular if EP [ℓ(EP [Y ], Y )] < ∞ for all distributions P on Y.

Theorem 14.2.15. Let Y ⊂ Rk be finite, P be the collection of distributions on Y, and M =

Conv(Y) = {EP [Y ] | P ∈ P}. A regular loss ℓ : M × Y → R is proper if and only if there exists a
closed convex Ω : M → R such that

ℓ(µ, y) = −Ω(µ) − ⟨∇Ω(µ), y − µ⟩

for some subgradient ∇Ω(µ) ∈ ∂Ω(µ) ⊂ Rk . Additionally, if ℓ : M × Y → R, then ∂Ω(µ) ⊂ Rk ,

and if µ ∈ relint M, we have ∂Ω(µ) ⊂ Rk . The loss is strictly proper if and only if the associated
Ω is strictly convex.

With this theorem, we have an essentially complete analogy with Theorem 14.2.1. There are
subtleties in the proof because the mapping from probabilities P to EP [Y ] can be many-to-one,
necessitating some care in the calculations, and making infinite losses somewhat challenging. A few
examples centered around ordinal regression illustrate the scenarios.

Example 14.2.16 (Ordinal regression, Example 14.2.14 continued): Let Y ∈ {0, 1, . . . , k}

be a value to be predicted, where the ordering on Y is important, as in ratings of items. In
this case, the set M = Conv(Y) = [0, k], and any strictly convex loss with domain [0, k] gives
rise to a proper loss via the construction ℓΩ (µ, y) = −Ω(µ) − Ω′ (µ)(y − µ). First, we take
Ω(µ) = 12 µ2 . This gives rise to a (modified) squared error

1 1
ℓΩ (µ, y) = (µ − y)2 − y 2 ,
2 2
which is strictly convex and proper.
Other choices of Ω are possible. One natural choice is a variant of the negative binary
entropy, and we define
Ω(µ) = (k − µ) log(k − µ) + µ log µ,
which is convex in µ ∈ [0, k], with Ω(µ) = +∞ for µ > k or µ < 0, while Ω(0) = Ω(k) = k log k.
µ
We have Ω′ (µ) = log k−µ , and so

ℓΩ (µ, y) = −y log µ + (y − k) log(k − µ),

for y ∈ {0, . . . , k}. Here, however, note the importance of allowing infinite values in the loss ℓ
when µ → {0, k}. 3

394
Lexture Notes on Statistics and Information Theory John Duchi

Proof One direction is, as in the previous cases, straightforward. Let ℓ have the given represen-
tation. Then for µ(P ) = EP [Y ],

EP [ℓ(µ, Y )] = −Ω(µ) − ⟨∇Ω(µ), µ(P ) − µ⟩ ≥ −Ω(µ(P )) = EP [ℓ(µ(P ), Y )],

and the inequality is strict if Ω is strictly convex.

The converse direction (from a proper loss to function Ω) is more subtle. We first give the
argument in the case that the losses ℓ are finite-valued, so that ℓ(µ, y) < ∞ for each µ ∈ C and
y ∈ Y, deferring the proof of the general case to Section 14.5.1 as it yields little additional intuition.
Let Y = {y1 , . . . , ym } ⊂ Rk , and assume w.l.o.g. that that the matrix A = [y1 · · · ym ] with columns
yj has rank k (otherwise, we simply work in a subspace). Pm We may identify P with the probability
m
simplex ∆m , and then the mean mapping µ(p) = i=1 pi yi for p ∈ R is surjective. Now for
µ ∈ Rk define
(⋆)
Ω(µ) := inf sup {−Ep [ℓ(α, Y )]} = inf {−Ep [ℓ(µ, Y )] | µ(p) = µ} ,
p:µ(p)=µ α p∈∆m

where the equality (⋆) follows because ℓ is proper. The function Ω is closed convex, as it is the partial
infimum of the closed convex function p 7→ −Ep [ℓ(µ, Y )] + I∆m (p), where we recall I∆m (p) = 0 if
p ∈ ∆m and +∞ otherwise (see Proposition B.3.11).
We compute ∂Ω(µ) directly now. The infimum over p in the definition of Ω(µ) is attained, as
∆m is compact and g(p) := −Ep [ℓ(µ, Y )] is necessarily continuous in p satisfying µ(p) = µ, because
regularity of the loss guarantees ℓ(µ, yi ) ∈ R whenever pi > 0 is feasible in the mean mapping
constraint µ(p) = µ. Moreover, it is immediate that
 
−ℓ(µ, y1 )
∇g(p) =  .. m
∈R .
 
.
−ℓ(µ, ym )

Let p⋆ (µ) be any p attaining the infimum. By Proposition B.3.29 on the subgradients of partial
minimization, we thus obtain
n o
∂Ω(µ) = s ∈ Rk | yiT s = −ℓ(µ, yi ) for i = 1, . . . , m ,

and moreover, this set is necessarily non-empty for all µ ∈ relint C = {µ(p) | p ≻ 0, p ∈ ∆m }. Using
this equality, we have

ℓ(µ, y) = −Ω(µ) + Ω(µ) + ℓ(µ, y) = −Ω(µ) + Ep⋆ (µ) [−ℓ(µ, Y )] + ℓ(µ, y)

Xm
= −Ω(µ) + p⋆i (µ)yiT s − y T s
i=1
= −Ω(µ) + ⟨s, Ep⋆ (µ) [Y ] − y⟩ = −Ω(µ) − ⟨s, y − µ⟩

for any s ∈ ∂Ω(µ), as Ep [Y ] = µ(p) = µ by construction.

Lastly, to obtain strict convexity of Ω, note that if Ep [Y ] = µ, then we can use the representation

Ep [ℓ(µ′ , Y )] − Ep [ℓ(µ, Y )] = −Ω(µ′ ) − ⟨∇Ω(µ′ ), µ − µ′ ⟩ + Ω(µ) = DΩ (µ, µ′ )

which is positive whenever µ ̸= µ′ if and only if Ω is strictly convex.

395
Lexture Notes on Statistics and Information Theory John Duchi

14.3 From entropies to convex losses, arbitrary predictions, and

link functions
Frequently, when we fit models, it is inconvenient to directly model or predict probabilities, that is,
to minimize over probabilistic predictions. Instead, we often wish to fit some real-valued prediction
and then transform it into a probabilistic prediction. This is perhaps most familiar from binary
and multiclass logistic regression, where a link function transforms real-valued predictions into
probabilistic predictions. For the binary logistic regression case with Y ∈ {−1, 1}, we assume that
we predict a score s ∈ R, where s > 0 indicates a prediction that Y is more likely to be 1 and s < 0
that it is more likely negative. The implied (modelled) probability that Y = y is then
1
p(y | s) = for y ∈ {−1, 1}.
1 + exp(−ys)
Similarly, for k-class classification problems, when using multiclass logistic regression, we predict a
score vector s ∈ Rk , where sy indicates a score associated to one of the k potential class labels y;
this then implies the probabilites
exp(sy ) 1
p(y | s) = Pk = P ,
i=1 exp(si )
1 + i̸=y exp(si − sy )
P
where we clearly have y p(y | s) = 1.
In binary and multiclass logistic regression, instead of directly minimizing negative log prob-
abilities of error over the probability simplex (though one does this implicitly), instead we use
surrogate logistic losses whose arguments can range over all of R or Rk . In the case of binary
logistic regression with y ∈ {−1, 1}, this is

φ(s, y) = log(1 + exp(−sy)),

while in the multiclass case we use the multiclass logistic loss

 
k
!
X X
φ(s, y) = −sy + log exp(si )) = log 1 + exp(si − sy ) .
i=1 i̸=y

Note that for each of these, we have a direct relationship between the probabilistic predictions and
derivatives of φ. In the binary logistic regression case, we have
∂ 1 1
p(y | s) = 1 + φ(s, y) = 1 − = ,
∂s 1 + exp(ys) 1 + exp(−ys)
while in the multiclass case we similarly have
∂ exp(sy )
p(y | s) = 1 + φ(s, y) = Pk .
∂sy i=1 exp(si )

14.3.1 Convex conjugate linkages

These dualities turn out to hold in substantially more generality, and they are the key to trans-
forming proper losses (as applied on probabilities) into proper surrogate losses that apply directly
to real-valued scores and which are convex in their arguments, allowing us to bring the tools of

396
Lexture Notes on Statistics and Information Theory John Duchi

convex optimization to bear on actually fitting predictive models. We work in the general setting of
Section 14.2.3 of losses for vector-valued y where Y ⊂ Rk , so that instead of predicting probability
distributions on Y itself we predict elements µ of the set {EP [Y ]} = Conv(Y), and let ℓ be a strictly
proper loss. Theorems 14.2.1 and 14.2.15 demonstrate that if the loss ℓ is proper, there exists a
(negative) generalized entropy, which in the case of Theorem 14.2.1 is Ω(p) = supq {−Ep [ℓ(q, Y )]},
for which
ℓ(µ, y) = −Ω(µ) − ⟨∇Ω(µ), y − µ⟩.
Note that Ω is always a closed convex function, meaning that it is lower semicontinuous or that its
epigraph epi Ω = {(µ, t) | Ω(µ) ≤ t} is closed.
Let us suppose temporarily that we have any such entropy. Recalling the convex conju-
gate (14.1.3), the negative generalized entropy Ω is closed convex, and so its conjugate Ω∗ (s) =
sup{⟨s, µ⟩ − Ω(µ)} satisfies Ω∗∗ (µ) = Ω(µ). In particular, if we define the surrogate loss

φ(s, y) := Ω∗ (s) − ⟨s, y⟩,

which is defined for all s ∈ Rk (instead of Conv(Y)), then

EP [φ(s, Y )] = Ω∗ (s) − ⟨s, EP [Y ]⟩ = Ω∗ (s) − ⟨s, µ(P )⟩

for the mean mapping µ(P ) = EP [Y ]. Moreover,

inf EP [φ(s, Y )] = inf {Ω∗ (s) − ⟨s, µ(P )⟩} = −Ω∗∗ (µ(P )) = −Ω(µ(P )),
s s

and so it generates the same negative entropy as the original loss ℓ, as

inf EP [ℓ(µ, Y )] = inf {−Ω(µ) − ⟨∇Ω(µ), µ(P ) − µ⟩} = −Ω(µ(P )).

µ µ

This identification of (generalized) entropies will underpin much of our development of the consis-
tency of losses in sections to come. For now, we content ourselves with addressing how to under-
stand propriety of the surrogate loss φ and how to transform predictions s ∈ Rk into probabilistic
predictions µ.
The key will be to consider what we term convex-conjugate-linkages, or conjugate linkages for
short. Recall the duality relationships (14.1.4) from the Fenchel-Young inequality we present in
the convexity primer in Section 14.1.1. The negative generalized entropy Ω is convex, and the
dualities associated with its conjugate Ω∗ (s) = supµ {⟨s, µ⟩ − Ω(µ)} will form the basis of our
transformations. We first give a somewhat heuristic presentation, as the intuition is important
(but details to make things precise can be a bit tedious). Essentially, we require that Ω∗ and Ω are
continuously differentiable, in which case we have

∇Ω(µ) = s if and only if ∇Ω∗ (s) = µ if and only if Ω∗ (s) + Ω(µ) = ⟨s, µ⟩

by the Fenchel-Young inequalities (14.1.4). That is, the gradient ∇Ω∗ of the conjugate transforms
a score vector s ∈ Rk into elements c to predict Y: we transform s into a prediction µ via the
conjugate link function

predΩ (s) = argmax {⟨s, µ⟩ − Ω(µ)} = ∇Ω∗ (s) = (∇Ω)−1 (s), (14.3.1)
µ

which finds the µ that best trades having maximal “entropy” −Ω(µ), or uncertainty, with alignment
with the scores ⟨s, µ⟩.

397
Lexture Notes on Statistics and Information Theory John Duchi

With this, it is then natural to consider the function substituting the prediction µ = predΩ (s)
into ℓ(µ, y), and so we consider
ℓ(predΩ (s), y).
Immediately, if µ = predΩ (s) = ∇Ω∗ (s), we have s = ∇Ω(µ) by construction (or the Fenchel-Young
inequality (14.1.4)), and so Ω(µ) = ⟨s, µ⟩ − Ω∗ (s) for this particular pair (s, µ), and ∇Ω(µ) =
∇Ω(∇Ω∗ (s)) = s because ∇Ω and ∇Ω∗ are inverses. Substituting, we obtain

ℓ(predΩ (s), y) = −Ω(predΩ (s)) − ⟨∇Ω(predΩ (s)), y − predΩ (s)⟩ = −Ω(µ) − ⟨s, y − µ⟩
= Ω∗ (s) − ⟨s, µ⟩ − ⟨s, y − µ⟩,

that is, we have recovered the surrogate

φ(s, y) = Ω∗ (s) − ⟨s, y⟩. (14.3.2)

The surrogate loss (14.3.2) constructed from the negative entropy Ω is the key transformation of
the loss ℓ into a convex loss, and (no matter the properties of ℓ) is always convex.
As we have already demonstrated, the construction (14.3.2) is more general than we have
presented; certainly, Ω∗ is always convex, and so φ is always convex in s. Moreover, if Y has
expectation E[Y ] = µ, then

inf E[φ(s, µ)] = inf {Ω∗ (s) − ⟨s, µ⟩} = −Ω(µ)

s s

by conjugate duality, so the surrogate φ always recovers the negative entropy Ω; without some type
of differentiability conditions, however, the construction of the prediction mapping predΩ requires
more care. Sections 16.3, 16.4, and 16.5 more deeply investigates these connections.
All that remains is to give more precise conditions under which the prediction (14.3.1) is always
unique and exists for all possible score vectors s ∈ Rk . To that end, we make the following definition.

Definition 14.4. Let Ω : Rk → R. Then Ω is a Legendre negative entropy if it is strictly convex,

continuously differentiable on the interior of its domain, and
(
µ → bd dom Ω or
∥∇Ω(µ)∥ → ∞ if either (14.3.3)
∥µ∥ → ∞.

This is precisely the condition we require to make each step in the development of the surro-
gate (14.3.2) airtight; as a corollary to Theorem C.2.9 in the appendices, we have the following.

Corollary 14.3.1. Let Ω be a Legendre negative entropy. Then the conjugate link prediction (14.3.1)
is unique and exists for all s ∈ Rk . In particular, the conjugate Ω∗ is strictly convex, continuously
differentiable, satisfies dom Ω∗ = Rk , and ∇Ω∗ = (∇Ω)−1 .

With this corollary in place, we can then give a theorem showing the equivalence of the strictly
proper loss ℓ and its surrogate.

Theorem 14.3.2. Let ℓ : M × Y → R be the strictly proper loss associated with the Legendre
negative entropy Ω. Then

ℓ(predΩ (s), y) = φ(s, y) := Ω∗ (s) − ⟨s, y⟩.

398
Lexture Notes on Statistics and Information Theory John Duchi

Moreover, the convex surrogate φ satisfies the consistency that if

E[φ(sn , Y )] → inf E[φ(s, Y )]

then µn = predΩ (sn ) satisfies

E[ℓ(µn , Y )] → inf E[ℓ(µ, Y )].
µ

Proof The first equality we have already demonstrated. For the minimization claim, we note
that if µ = E[Y ], then E[φ(s, Y )] = Ω∗ (s)−⟨µ, s⟩ and inf s {Ω∗ (s)−⟨µ, s⟩} = −Ω(µ). Strict propriety
of ℓ then gives inf µ′ E[ℓ(µ′ , Y )] = −Ω(µ).

Said differently, the surrogate φ is consistent with the loss ℓ and (strictly) proper, in that
if s minimizes E[φ(s, Y )], then predΩ (s) minimizes E[ℓ(µ, Y )]. The statement in terms of limits
is necessary, however, as simple examples show, because with some link functions it is in fact
impossible to achieve the extreme points of Conv(Y), as in logistic regression. We provide a few
example applications (and non-applications) of Theorem 14.3.2. For the first, let us consider binary
logistic regression.

Example 14.3.3 (Binary logistic regression): For a label Y ∈ {0, 1} and predictions p ∈ [0, 1],
take the generalized negative entropy

Ω(p) = p log p + (1 − p) log(1 − p).

p
By inspection, dom Ω = [0, 1], and Ω′ (p) = log 1−p satisfies |Ω′ (p)| → ∞ as p → {0, 1}. For
s ∈ R, the conjugate is

Ω∗ (s) = sup{sp − p log p − (1 − p) log(1 − p)} = log(1 + es ),

es
where the supremum is achieved by p = predΩ (s) = 1+es . Then we have

φ(s, y) = log(1 + es ) − sy = − log p(y | s),

eys
where p(y | s) = 1+e s is the binary logistic probability of the label y ∈ {0, 1}.

For the induced loss ℓ(p, y) = −y log p − (1 − y) log(1 − p) (the log loss), if P(Y = 1) = 1,
then p = 1 minimizes E[ℓ(p, Y )]. Similarly, if P(Y = 0) = 1, then p = 0 minimizes E[ℓ(p, Y )].
eys
Neither of these is achievable by a finite ℓ̇ in p(y | s) = 1+e s , showing how the limiting argument

in Theorem 14.3.2 is necessary. 3

The next example shows that we sometimes need to elaborate the setting of Theorem 14.3.2 to deal
with constraints.

Example 14.3.4 (Multiclass logistic regression): Identify the set Y = {e1 , . . . , ek } with the
k standard basis vectors, and for p ∈ ∆k = {p ∈ Rk+ | 1T p = 1}, consider the negative entropy

k
X
Ω(p) = py log py .
y=1

399
Lexture Notes on Statistics and Information Theory John Duchi

This function is strictly convex and of Legendre type for the positive orthant Rk+ but not for
∆k . Shortly, we shall allow linear constraints on the predictions to address this shortcoming.
As an alternative, take Y = {0, e1 , . . . , ek−1 }, so that Conv(Y) = {p ∈ Rk−1
+ | 1T p ≤ 1},
which has an interior and so more easily admits a conjugate duality relationship. In this case,
the negative entropy-type function
k−1
X
Ω(p) = py log py + (1 − 1T p) log(1 − 1T p) (14.3.4)
y=1

is of Legendre type. A calculation for s ∈ Rk−1 yields

 
k−1
X
Ω∗ (s) = log 1 + e sy  ,
y=1

with !
es1 esk−1
predΩ (s) = Pk−1 ,..., Pk−1 .
1+ j=1 esj 1+ j=1 esj
Letting pPdenote the entries of this vector, we can then assign a probability to class k via
pk = 1 − k−1
j=1 pj . 3

In Section 14.4 we revisit exponential families in the (proper) loss minimization framework we
have thus far developed, which gives some additional perspective on these problems.

14.3.2 Convex conjugate linkages with affine constraints

As Example 14.3.4 shows, in some cases a “natural” formulation fails to satisfy the desiderata of our
link functions. Accordingly, we make a slight modification to the Legendre type (14.3.3) negative
entropy h to allow for affine constraints, which still allows us to develop the precise convexity
dualities with proper losses we require. Continuing to work in the scenario in which Y ⊂ Rk ,
suppose now that the affine hull
 
X m 
A = aff(Y) := αj yj | yj ∈ Y, αT 1 = 1, m ∈ N
 
j=1

is a proper subspace of Rk . The key motivating example here is the “failure” case of Example 14.3.4
on multiclass logistic regression, where Y = {e1 , . . . , ek }, whose affine hull is exactly those vectors
p ∈ Rk satisfying ⟨p, 1⟩ = 1. Naturally, in this case we wish P to predict probabilities, and so given a
score vector s ∈ Rk and using the negative entropy Ω(p) = ky=1 py log py , we let
 k
k
X
pred(s) = argmin Ω(p) − ⟨s, p⟩ | 1T p = 1 = esy / esj 

.
p
j=1
y=1

Generalizing this approach to arbitrary regularizers Ω, we modify the prediction (14.3.1) to be

predΩ,A (s) = argmax {⟨s, µ⟩ − Ω(µ)} .

µ∈A

400
Lexture Notes on Statistics and Information Theory John Duchi

Then for the loss ℓ(µ, y) = −Ω(µ) − ⟨∇Ω(µ), y − µ⟩ associated with the negative entropy Ω, we
define the surrogate
φ(s, y) := ℓ(predΩ,A (s), y).
Perhaps remarkably, this construction still yields a well-defined convex loss with the same consis-
tency properties as those in Theorem 14.3.2. Indeed, defining

ΩA (µ) = Ω(µ) + IA (µ)

and the associated conjugate Ω∗A (s) = sup{⟨s, µ⟩ − Ω(µ) | µ ∈ A}, we have the following theorem.

Theorem 14.3.5. Let ℓ : M × Y → R be the strictly proper loss associated with the Legendre
negative entropy Ω and A = aff(Y) be the affine hull of Y. Then

φ(s, y) := ℓ(predΩ,A (s), y) = Ω∗A (s) − ⟨s, y⟩.

Moreover, the convex surrogate φ satisfies the consistency that if

E[φ(sn , Y )] → inf E[φ(s, Y )]

then µn = predΩ,A (sn ) satisfies

E[ℓ(µn , Y )] → inf E[ℓ(µ, Y )].

We return to proving the theorem presently, focusing here on how it applies to Example 14.3.4.

Example 14.3.6 (Multiclass logistic regression): Consider Example 14.3.4, where we identify
Y = {eP k k
1 , . . . , ek } ⊂ R , which has affine hull A = {p ∈ R | ⟨1, p⟩ = 1}. Then taking
k
Ω(p) = y=1 pk log pk , a calculation with a Lagrangian shows that
 
k
X
predΩ,A (s) = argmin {−⟨s, p⟩ + Ω(p)} = esy / e sj  .
p∈∆k j=1

In turn, this gives surrogate logistic loss

 
k
X
φ(s, y) = log  esj −sy  .
j=1

Notably, the logistic loss is not strictly convex, as φ(s + t1, y) = φ(s, y) for t ∈ R. If Y is a
multinomial random variable with P(Y = ey ) = py , then by another calculation, the vector
with entries
s⋆y = log py
minimizes E[φ(s, Y )], which in turn gives predΩ,A (s⋆ ) = p, maintaining propriety. 3

401
Lexture Notes on Statistics and Information Theory John Duchi

Proof of Theorem 14.3.5

Before proving the theorem proper, we show how the key identity that s = ∇Ω(µ) we use to develop
equality (14.3.2) generalizes in the presence of the affine constraint. The function hA is strictly
convex on its domain dom h ∩ A, and moreover, ∇h∗A exists and is continuous. The following
corollary (a consequence of Corollary C.2.12 in Appendix C.2) extends Corollary 14.3.1 and allows
us to address equality (14.3.2).

Corollary 14.3.7. The conjugate Ω∗A is continuously differentiable with dom Ω∗A = Rk , and if
µ = ∇Ω∗A (s), then µ ∈ int dom Ω and

∇Ω(µ) = s + v

for some vector v normal to A, that is, a vector v ∈ Rk satisfying ⟨v, µ0 −µ1 ⟩ = 0 for all µ0 , µ1 ∈ A.

While the proof of the corollary requires some care to make precise, a sketch can give intuition.
Sketch of Proof Because Ω is strictly convex and its derivatives ∇Ω(µ) explode as µ →
bd dom Ω, the minimizer of −⟨s, µ⟩ + Ω(µ) over µ ∈ A exists and is unique. Let A = {µ | Aµ = b}
for shorthand, where A ∈ Rn×k for some n < k. Then introducing Lagrange multiplier w ∈ Rn for
the constraint µ ∈ A, the Lagrangian for finding predΩ,A (s) = argminµ {Ω(µ) − ⟨s, µ⟩ | µ ∈ A} is

L(µ, w) = Ω(µ) − ⟨s, µ⟩ + wT (Aµ − b).

Minimizing out µ by setting ∇µ L(µ, w) = 0, we obtain

∇Ω(µ) − s + AT w = 0.

But if µ0 , µ1 ∈ A, then v = AT w satisfies ⟨v, µ0 − µ1 ⟩ = wT A(µ0 − µ1 ) = wT (b − b) = 0, so that v

is normal to A.

Finally, we return to prove the theorem. Take any vector s ∈ Rk . Then because predΩ,A (s) =
∇Ω∗A (s), we have

φ(s, y) = ℓ(predΩ,A (s), y) = −Ω(∇Ω∗A (s)) − ⟨∇Ω(∇Ω∗A (s)), y − ∇Ω∗A (s)⟩.

As ∇Ω∗A (s) ∈ A and using the shorthand µ = ∇Ω∗A (s) ∈ A, we have ∇Ω(µ) = s + v for some v
normal to A. Moreover, Ω(µ) = ΩA (µ), and so the Fenchel-Young inequality (14.1.4) guarantees
−ΩA (µ) = Ω∗A (s) − ⟨s, µ⟩. Substituting in the expression for φ, we obtain

φ(s, y) = Ω∗A (s) − ⟨s, µ⟩ − ⟨s + v, y − µ⟩

= Ω∗A (s) − ⟨s, µ⟩ + ⟨s, µ − y⟩ = Ω∗A (s) − ⟨s, y⟩

where the second equality follows because v ⊥ µ − y.

For the consistency argument, let µn = predΩ,A (sn ). Then E[ℓ(µn , Y )] = E[φ(sn , Y )] and
if µ = E[Y ], then E[φ(s, Y )] = Ω∗A (s) − ⟨µ, s⟩ and inf s E[φ(s, Y )] = −ΩA (µ) = −Ω(µ). Strict
propriety of ℓ gives inf µ′ E[ℓ(µ′ , Y )] = −Ω(µ).

402
Lexture Notes on Statistics and Information Theory John Duchi

14.4 Exponential families, maximum entropy, and log loss

Realistically, making predictions using an arbitrary distribution P on an arbitrary space X is sta-
tistically infeasible: we could never collect enough data to accurately model complex phenomena
without any assumptions on P . Accordingly, we may seek more tractable models to make predic-
tions feasible, and we can then investigate the consequences of moving from the entire family P
of distributions on X to smaller families is. A particularly important class of distributions, which
allows us to study these questions in great detail, are the exponential families from Chapter 3; here,
we investigate them in the framework that we have developed for proper losses.
Let {Pθ } be a regular exponential family indexed by θ on a space X with sufficient statistic
ϕ : X → Rd , where for a base measure ν on X , Pθ has density
pθ (x) = exp(⟨θ, ϕ(x)⟩ − A(θ))
with respect to ν, where A(θ) = log e⟨θ,ϕ(x)⟩ dν(x) is the log partition function. (Recall that
R

regularity means that the domain

Θ := dom A = {θ | A(θ) < ∞}
is open, as in Definition 3.1). Consider the log loss − log pθ (x), which we suggestively denote with
the surrogate φ as a function of θ,
φ(θ, x) := − log pθ (x) = A(θ) − ⟨θ, ϕ(x)⟩.
Proposition 3.2.1 guarantees this is always convex in θ because the log partition function is convex,
and it is C ∞ (Proposition 3.2.2). While the log loss − log p(x) is proper, the exponential family
{Pθ } can capture only a subset of the distributions on X .
The mean mapping µ(P ) := EP [ϕ(X)] ∈ Rd will be of central importance to the development
of proper losses, exponential families, and the duality relationships between maximum likelihood
and entropy that we explore here. Accordingly, throughout this section we let
P := {distributions P ≪ ν} = {distributions P with a density p w.r.t. ν}
be the collection of distributions with densities with respect to ν (as Pθ by definition has), and we
define the set of potential mean parameters
n o
M := µ(P ) = EP [ϕ(X)] ∈ Rd | P ≪ ν = {µ(P ) | P ∈ P} . (14.4.1)

Now, for any distribution P ∈ P with mean vector µ = µ(P ), the associated generalized negative
entropy is
Ω(µ) := sup {−EP [φ(θ, X)]} = sup {⟨θ, µ(P )⟩ − A(θ)} = A∗ (µ),
θ θ
the convex conjugate of A. At this point, the centrality of the duality relationships (via gradients
∇A and ∇A∗ ) between Θ and M to fitting and modeling should come as no surprise, and so we
elucidate a few of the main properties. Because ∇A(θ) = Eθ [ϕ(X)] in the exponential family, we
immediately see that
∇A(Θ) := {∇A(θ)}θ∈Θ ⊂ M.
Recalling the duality relationship (14.1.4) that
θ ∈ ∂A∗ (µ) if and only if ∇A(θ) = µ,
we can say much more.

403
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 14.4.1. Let Mo = relint M. Then ∇A(Θ) = Mo . Additionally:

(i) If the family is minimal, then M has non-empty interior and Ω is continuously differentiable
on Mo , with θ = ∇Ω(µ) if and only if ∇A(θ) = µ.

(ii) If the family is non-minimal, then Ω is continuously differentiable relative to aff(M), meaning
that there exists a continuous mapping ∇Ω(µ) ∈ Θ such that for all µ ∈ Mo ,
n o
∂Ω(µ) = ∇Ω(µ) + aff(M)⊥ .

Moreover, Θ = Θ + aff(M)⊥ .
The proof of the proposition relies on the more sophisticated duality theory we develop in Appen-
dices B and C, so we defer it to Section 14.5.2.
We can summarize the proposition by considering minimizers and maximizers: suppose we wish
to choose θ to minimize

EP [φ(θ, X)] = EP [− log pθ (X)] = A(θ) − ⟨µ(P ), θ⟩.

Then so long as the distribution P is not extremal in that µ(P ) = EP [ϕ(X)] ∈ relint M, there
exists a parameter θ(P ), unique up to translation in the subspace perpendicular to aff(M), for
which
θ(P ) ∈ argmin EP [φ(θ, X)] = argmin{A(θ) − ⟨µ(P ), θ⟩}.
θ θ
Moreover, this parameter satisfies the mean matching condition

∇A(θ(P )) = µ(P ),

which is of course sufficient to be a minimizer of the expected log loss. As the statements in the
proposition evidence, calculations become more challenging when we must perform them all in an
affine subspace, though sometimes this care is unavoidable.

Example 14.4.2 (Gaussian estimation): Assume we fit a distribution assuming X has a

Gaussian distribution with mean µ and covariance Σ ≻ 0, both to be estimated. Performing
the transformation to the exponential family form with precision K = Σ−1 and θ = Σ−1 µ, we
have

1 ⊤ 1 1
pθ,K (x) = exp ⟨θ, x⟩ − ⟨xx , K⟩ − A(θ, K) for A(θ, K) = θ⊤ K −1 θ − log det(2πK).
2 2 2
The log partition function has gradients
1 1
∇θ A(θ, K) = K −1 θ and ∇K A(θ, K) = − K −1 θθ⊤ K −1 − K −1 .
2 2
Matching moments for a distribution P with second moment matrix M = E[XX ⊤ ] ≻ 0 and
mean E[X], we obtain

E[X] = K −1 θ and M = K −1 θθ⊤ K −1 + K −1 .

Setting θ = KE[X] and noting that M = Cov(X) − E[X]E[X]⊤ , we solve M = E[X]E[X]⊤ +

K −1 by setting K −1 = Cov(X).

404
Lexture Notes on Statistics and Information Theory John Duchi

When Cov(X) ̸≻ 0, the solution K = Cov(X)−1 does not exist, so we must rely instead
on part (ii) of Proposition 14.4.1. With some care, one may check that we can work in the
subspace spanned by the eigenvectors of Cov(X), that is, if Cov(X) = U ΛU ⊤ and U ∈ Rd×k ,
the collection of symmetric matrices K whose column space belongs to span(U ). Then the
pseudo-inverse K = Cov(X)† is the appropriate solution, and it recovers the covariance Σ =
K † = Cov(X) ⪰ 0. 3

Finally, let us give a last result that shows the duality relationships between the negative
generalized entropy Ω(µ) and log partition A, which allows us to also capture a few of the nuances
of minimization of the surrogate log loss φ(θ, x) = − log pθ (x) when we encounter distributions P
for which the mean mapping µ(P ) is on the boundary of M or even outside it.
Proposition 14.4.3. Let {Pθ } be a regular exponential family with log partition A(θ) with domain
Θ, and let M be the associated mean parameter space with relative interior Mo = relint M. Let
Ω(µ) = A∗ (µ) be the associated negative generalized entropy. Then
(i) A(θ) = Ω∗ (θ) = A∗∗ (θ) for all θ.
(ii) If µ ∈ Mo , there exists θ(µ) ∈ Θ such that the negative entropy satisfies Ω(µ) = A∗ (µ) =
⟨θ(µ), µ⟩ − A(θ(µ)) < ∞. If µ ̸∈ cl M, then Ω(µ) = +∞.
(iii) If µ ∈ bd M = cl M \ Mo , then for any µ0 ∈ Mo , Ω(µ) = limt→0 Ω(tµ0 + (1 − t)µ), and there
exist θt ∈ Θ with
∇A(θt ) = tµ0 + (1 − t)µ and lim{A(θt ) − ⟨µ, θt ⟩} = inf {A(θ) − ⟨µ, θ⟩} .
t→0 θ

In particular, there exist sequences of dual pairs (µn , θn ) with µn ∈ Mo and θn ∈ Θ satisfying
µn = ∇A(θn ), µn → µ, Ω(µn ) → Ω(µ), and A(θn ) − ⟨µ, θn ⟩ → inf θ {A(θ) − ⟨µ, θ⟩}.
See Section 14.5.2 for the deferred proof.
While the statement of Proposition 14.4.3 is complex, considering minimizers of E[φ(θ, X)] can
give some understanding. If P is a distribution such that µ(P ) ∈ Mo , then there exists a parameter
θ(P ) minimizing EP [φ(θ, X)]. If µ(P ) ∈ bd M, then either there exists a minimizer θ(P ) of the
loss, or there is a sequence of points θn such that
EP [φ(θn , X)] → inf EP [φ(θ, X)] = −Ω(µ(P )), and µ(Pθn ) → µ(P ),
θ

so that they asymptotically satisfy the mean identiy. Finally, if µ(P ) ̸∈ cl M, then inf θ E[φ(θ, X)] =
−∞, making the choice of exponential family model poor, as it cannot capture the mean parameters.

14.4.1 Maximizing entropy

As we have seen, our notion of generalizedPentropies as the minimal values of expected losses can
recapture the classical entropy H(P ) = − x p(x) log p(x) when P has a probability mass function
p, as in the case of multiclass prediction. For exponential family models, this connection goes much
futher, and the negative generalized entropy Ω(µ) for µ ∈ M coincides with a more general notion
of entropy known as the Shannon entropy. We begin with the definition:
Definition 14.5. Let ν be a base measure on X and assume P has density p with respect to ν.
Then the Shannon entropy of P is
Z
H(P ) = − p(x) log p(x)dν(x).

405
Lexture Notes on Statistics and Information Theory John Duchi

For a distribution P with probability mass

P function p, the base measure ν is counting measure,
yielding the classical entropy H(P ) = − x p(x) log p(x), while for a distribution P with density
p (for Lebesgue
R measure ν, so that dν(x) = dx for x ∈ Rd ), we recover the differential entropy
H(P ) = − p(x) log p(x)dx.

Example 14.4.4: Let P be the uniform distribution on [0, a]. Then the differential entropy
H(P ) = − log(1/a) = log a. 3

Example 14.4.5: Let P be the normal distribution N(µ, Σ) and ν be Lebesgue measure.
Then
1 1 d 1
H(P ) = log(det(2πΣ)) + E[(X − µ)⊤ Σ−1 (X − µ)] = log(2πe) + log det(Σ).
2 2 2 2
because p(x) = √ 1
exp(− 12 (x − µ)⊤ Σ−1 (x − µ)). 3
det(2πΣ)

For exponential families, the log partition determines the Shannon entropy directly, highlighting
that −h is indeed a familiar entropy-like object.
Proposition 14.4.6. Let {Pθ } be a regular exponential family with respect to the base measure ν.
Then for any θ ∈ Θ,
H(Pθ ) = −h(µ(Pθ )) = A(θ) − ⟨µ(Pθ ), θ⟩,
where Ω(µ) = sup{⟨µ, θ⟩ − A(θ)} = A∗ (µ).
Proof Using log pθ (x) = ⟨θ, ϕ(x)⟩ − A(θ) we obtain H(Pθ ) = −Eθ [⟨θ, ϕ(X)⟩ − A(θ)] = A(θ) −
⟨µ(Pθ ), θ⟩, where as usual µ(P ) = EP [ϕ(X)]. As θ and µ(Pθ ) have the duality relationship
∇A(θ) = µ(Pθ ), we obtain A(θ) − ⟨µ(Pθ ), θ⟩ = −Ω(µ(Pθ )) as desired.

The maximum entropy principal, which Jaynes [119] first elucidated in the 1950s, originates in
statistical mechanics, where Jaynes showed that (in a sense) entropy in statistical mechanics and
information theory were equivalent. The maximum entropy principle is this: given some constraints
(prior information) about a distribution P , we consider all probability distributions satisfying said
constraints. Then to encode our prior information while being as “objective” or “agnostic” as
possible (essentially being as uncertain as possible), we should choose the distribution P satisfying
the constraints to maximize the Shannon entropy. This principal naturally gives rise to exponential
family models, and (as we revisit later) allows connections to Bayesian and minimax procedures.
One caveat throughout is that the base measure ν is essential to all our derivations: it radically
effects the distributions P we consider.
With all this said, suppose (without making any exponential family assumptions yet) we are
given ϕ : X → Rd and a mean vector µ ∈ Rd , and we wish to solve
maximize H(P ) subject to EP [ϕ(X)] = µ (14.4.2)
over all distributions P ∈ P, the collection of distributions having densities with respect to the
base measure ν, that is, P ≪ ν. Rewriting problem (14.4.2), we see that it is equivalent to
Z
maximize − p(x) log p(x)dν(x)
Z Z
subject to p(x)ϕ(x)dν(x) = µ, p(x) ≥ 0 for x ∈ X , p(x)dν(x) = 1.

406
Lexture Notes on Statistics and Information Theory John Duchi

Let
Pµlin := {P ≪ ν | EP [ϕ(X)] = µ}
be distributions with densities w.r.t. ν satisfying the expectation (linear) constraint E[ϕ(X)] = µ.
We then obtain the following theorem.
Theorem 14.4.7. For θ ∈ Rd , let Pθ have density
Z
pθ (x) = exp(⟨θ, ϕ(x)⟩ − A(θ)), A(θ) = log exp(⟨θ, ϕ(x)⟩)dν(x),

with respect to the measure ν. If EPθ [ϕ(X)] = µ, then Pθ maximizes H(P ) over Pµlin ; moreover, the
distribution Pθ is unique (though θ need not be).
Proof We first give a heuristic derivation—which is not completely rigorous—and then check to
verify that our result is exact. First, we write a Lagrangian for the problem (14.4.2). Introducing
Lagrange multipliers λ(x) ≥ 0 for the constraint p(x) ≥ 0, θ0 ∈ R for the normalization constraint
that P (X ) = 1, and θ ∈ Rd for the constraints that EP [ϕ(X)] = µ, we obtain the following
Lagrangian:
Z Xd Z
L(p, θ, θ0 , λ) = p(x) log p(x)dν(x) + θi µi − p(x)ϕi (x)dν(x)
i=1
Z Z
+ θ0 p(x)dν(x) − 1 − λ(x)p(x)dν(x).

Now, heuristically treating the density p = [p(x)]x∈X as a finite-dimensional vector (in the case
that X is finite, this is completely rigorous), we take derivatives and obtain
d
∂ X
L(p, θ, θ0 , λ) = 1 + log p(x) − θi ϕi (x) + θ0 − λ(x) = 1 + log p(x) − ⟨θ, ϕ(x)⟩ + θ0 − λ(x).
∂p(x)
i=1

To find the minimizing p for the Lagrangian (the function is convex in p), we set this equal to zero
to find that
p(x) = exp (⟨θ, ϕ(x)⟩ − 1 − θ0 − λ(x)) .
Now, we note that with this setting, we always have p(x) > 0, so that the constraint p(x) ≥ 0
is unnecessary and (by complementary
R slackness) we have λ(x) = 0. In particular, by taking
θ0 = −1+A(θ) = −1+log exp(⟨θ, ϕ(x)⟩)dν(x), we have that (according to our heuristic derivation)
the optimal density p should have the form
pθ (x) = exp (⟨θ, ϕ(x)⟩ − A(θ)) .
So we see the form of distribution we would like to have.
Consider any distribution P ∈ Pµlin , and assume that we have some θ satisfying EPθ [ϕ(X)] = µ.
In this case, we may expand the entropy H(P ) as
Z Z Z
p
H(P ) = − p log pdν = − p log dν − p log pθ dν
pθ
Z
= −Dkl (P ||Pθ ) − p(x)[⟨θ, ϕ(x)⟩ − A(θ)]dν(x)
Z
(⋆)
= −Dkl (P ||Pθ ) − pθ (x)[⟨θ, ϕ(x)⟩ − A(θ)]dν(x)

= −Dkl (P ||Pθ ) + H(Pθ ),

407
Lexture Notes on Statistics and Information Theory John Duchi

R R
where in the step (⋆) we have used the fact that p(x)ϕ(x)dν(x) = pθ (x)ϕ(x)dν(x) = µ. As
Dkl (P ||Pθ ) > 0 unless P = Pθ , we have shown that Pθ is the unique distribution maximizing the
entropy, as desired.

We obtain the following immediate corollary, which shows the direct connection between max-
imum entropy and minimizing expected logarithmic loss.

Corollary 14.4.8. Let {Pθ } be the exponential family with densities pθ (x) = exp(⟨θ, ϕ(x)⟩ − A(θ))
with respect to ν. For any µ ∈ M, if there exists θ satisfying EPθ [ϕ(X)], then Pθ solves

minimize EP [− log p(x)]

p
R
over all densities p satisfying ϕ(x)p(x)dν(x) = µ.

So if we consider minimizing the negative log loss (which is strictly proper) but wish to guarantee
that the predictive distribution satisfies EP [ϕ(X)] = µ, then the exponential family model is the
unique minimizer.
We give three examples of maximum entropy, showing how the choice of the base measure ν
effects the resulting maximum entropy distribution. For all three, we assume that the space X = R
is the real line. We consider maximizing the entropy over all distributions P satisfying

EP [X 2 ] = 1.

Example 14.4.9: Assume that the base measure ν is counting measure on the support
{−1, 1}, so that ν({−1}) = ν({1}) = 1. Then the maximum entropy distribution is given by
P (X = x) = 21 for x ∈ {−1, 1}. 3

Example 14.4.10: Assume that the base measure ν is Lebesgue measure on X = R, so that
ν([a, b]) = b − a for b ≥ a. Then by Theorem 14.4.7, we have that the maximum entropy
distribution has the form pθ (x) ∝ exp(−θx2 ); recognizing the normal, we see that the optimal
distribution is simply N(0, 1). 3

Example 14.4.11: Assume that the base measure ν is counting measure on the integers
Z = {. . . , −2, −1, 0, 1, . . .}. Then Theorem 14.4.7 shows that the optimal distribution is a
discrete version of the normal: we have pθ (x)P ∝ exp(−θx2 ) for x ∈ Z. That is, we choose θ > 0
so that the distribution pθ (x) = exp(−θx )/ ∞
2
j=−∞ exp(−θj ) has variance 1. 3
2

We remark in passing that in some cases, it is interesting to instead consider inequality rather than
equality constraints in the linear constraints defining the family P lin . Exercises 14.10 and 14.11
explore these ideas.
Lastly, we consider the empirical variant of minimizing the log loss, equivalently, of maximum
likelihood, where we maximize the likelihood of a given sample X1 , . . . , Xn . Consider the sample-
based maximum likelihood problem of solving
n n
Y 1X
maximize pθ (Xi ) ≡ minimize − log pθ (Xi ), (14.4.3)
θ n
i=1 i=1

for the exponential family model pθ (x) = exp(⟨θ, ϕ(x)⟩ − A(θ)). We have the following result.

408
Lexture Notes on Statistics and Information Theory John Duchi

1 Pn
Proposition 14.4.12. Let µ
bn = n i=1 ϕ(Xi ). Then any θ solving EPθ [ϕ(X)] = µ
bn is a maximum
iid
bn ∈ relint M. If the sample is drawn Xi ∼ P where
likelihood solution, which exists if and only if µ
P ≪ ν and µ(P ) ∈ relint M, then with probability 1, µ bn ∈ relint M eventually.
Proof Define the empirical negative log likelihood
n
b n (θ) := − 1
X
L log pθ (Xi ) = −⟨b
µn , θ⟩ + A(θ),
n
i=1

which is convex. Taking derivatives and using that Θ = dom A is open, the parameter θ is a mini-
mizer if and only if ∇L bn −∇A(θ) = 0 if and only if ∇A(θ) = µ
b n (θ) = µ bn . Apply Proposition 14.4.1.
For the final statement, note that µb ∈ aff(M) with probability 1. Then because µ(P ) ∈ relint M
bn → µ(P ) with probability 1, we see that for any ϵ > 0 there is some (random, but finite) N
and µ
such that n ≥ N implies ∥b µn − µ(P )∥ ≤ ϵ and µ
bn ∈ aff(M), so that µ
bn ∈ relint M.

As a consequence of the result, we have the following rough equivalences tying together the
preceding material. In short, maximum entropy subject to (linear) empirical moment constraints
(Theorem 14.4.7) is equivalent to maximum likelihood estimation in exponential families (Propo-
sition 14.4.12), and these are all equivalent to minimizing the (surrogate) log loss E[φ(θ, X)].

14.5 Technical and deferred proofs

14.5.1 Finalizing the proof of Theorem 14.2.15
The issue remaining in the proof of Theorem 14.2.15 occurs when ℓ(µ, yi ) = +∞ for some i. In this
case, we necessarily have pi = 0 for all p ∈ ∆m satisfying Ep [Y ] = µ; define the set of infinite loss
indices I(µ) := {i | L(µ, yi ) = +∞}, which is evidently in the set {i | pi = 0 whenever Ap = 0}.
Because of this containment, we vectors {yi }i∈I(µ) are independent and independent of {yi }i̸∈I(µ) .
In particular, there exists ∆ ∈ Rk such that yiT ∆ = 0 for all i ̸∈ I(µ) but for which yiT ∆ > 0 for
each i ∈ I(µ). Working on the subspace {p ∈ ∆m | pi = 0, i ∈ I(µ)}, we can perform precisely the
same derivation except that G(µ) = {s ∈ Rk | yiT s = −ℓ(µ, yi ) for i ̸∈ I(µ)} is non-empty. Then
we have
(i) m
X
Ω(µ′ ) = −Ep⋆ (µ′ ) [ℓ(µ′ , Y )] ≥ −Ep⋆ (µ′ ) [ℓ(µ, Y )] = −Ep⋆ (µ) [ℓ(µ, Y )] + ℓ(µ, yi )(p⋆i (µ) − p⋆i (µ′ )),
i=1

where inequality (i) follows because ℓ is proper. We then have

m
X (ii) X X
ℓ(µ, yi )(p⋆i (µ) − p⋆i (µ′ )) = ℓ(µ, yi )(p⋆i (µ) − p⋆i (µ′ )) − ℓ(µ, yi )p⋆i (µ′ )
i=1 i̸∈I(µ) i∈I(µ)
X X
= sT yi (p⋆i (µ) − p⋆i (µ′ )) − ℓ(µ, yi )p⋆i (µ′ )
i̸∈I(µ) i∈I(µ)

for any s ∈ G(µ), where equality (ii) follows because p⋆i (µ) = 0 for i ∈ I(µ). As we allow extended
reals, replace s with s∞ = limt→∞ (s + t∆), which satisfies ⟨s∞ , yi ⟩ = ∞ = ℓ(µ, yi ) for i ∈ I(µ),
and we finally obtain
m
X
′
Ω(µ ) ≥ Ω(µ) + sT∞ yi (p⋆i (µ) − p⋆i (µ′ )) = Ω(µ) + ⟨s∞ , µ − µ′ ⟩.
i=1

409
Lexture Notes on Statistics and Information Theory John Duchi

The equality of the loss is as before.

14.5.2 Proof of Proposition 14.4.1

We first give the proof in the case that {Pθ } is a minimal exponential family, meaning that ⟨u, ϕ(x)⟩
is non-constant in x for each u ̸= 0, addressing the non-minimal case at the end. Then A is strictly
convex (Proposition 3.2.3). As part of this proof, we will show that Mo is indeed open in this case.
We show both inclusions Mo ⊂ ∇A(Θ) and that ∇A(Θ) ⊂ Mo .
Showing that ∇A(Θ) ⊂ Mo . Fix θ0 ∈ Θ, and let µ = ∇A(θ0 ). We must show that there
exists ϵ > 0 such that for all ∥u∥ ≤ ϵ, the point µ + u ∈ M. Let θu = argminθ {A(θ) − ⟨µ + u, θ⟩}
whenever the minimizer exists, where evidently θ0 does exist because µ = ∇A(θ0 ). Note that the
strict convexity of A guarantees θu is unique if it exists. But now, we may use the convex analyitic
fact (Proposition C.1.12 in Appendix C.1.2) that u 7→ θu is continuous in u in a neighborhood of
0. These minimizers necessarily satisfy ∇A(θu ) = µ + u, that is, Eθu [ϕ(X)] = µ + u ∈ M.
Showing that Mo ⊂ ∇A(Θ). Let µ ∈ Mo , so that there exists an ϵ > 0 such that µ+ϵB ⊂ Mo .
It is enough to show that A(θ) − ⟨µ, θ⟩ is coercive in θ, as then there necessarily exists a (unique)
minimizer θ(µ) of A(θ) − ⟨µ, θ⟩, and this minimizer satisfies ∇A(θ(µ)) = µ, so that µ ∈ ∇A(Θ).
For this, it is sufficient to show that for any non-zero vector v the recession function of the tilted
version f (θ) := A(θ) − ⟨µ, θ⟩ of A,

′ A(θ + tv) − ⟨µ, θ + tv⟩ − (A(θ) − ⟨µ, θ⟩)

f∞ (v) := lim
t→∞ t
where θ ∈ Θ is otherwise arbitrary, satisfies f∞ ′ (v) > 0 for all v ̸= 0, which guarantees that

A(·) − ⟨µ, ·⟩ has a minimizer. (See Proposition C.3.5 and Corollary C.3.7 in Appendix C.2.1).
To that end, for vectors v ∈ Rd , define the essential supremum of ϕ(x) in the direction v by
ν ⋆ (ϕ, v) := ess sup⟨ϕ(x), v⟩ = inf {t ∈ R | ν({x ∈ X | ⟨v, ϕ(x)⟩ ≥ t}) = 0} .
x t

Now as µ ∈ Mo , for any vector v = ̸ 0 we have ⟨v, µ⟩ < ν ⋆ (ϕ, v). Let ϵ > 0 satisfy ⟨v, µ⟩ < ν ⋆ (ϕ, v)−ϵ
be otherwise arbitrary, fix θ ∈ Θ, and let Xϵ = {x | ⟨v, ϕ(x)⟩ ≥ ν ⋆ (ϕ, v) − ϵ}, which satisfies
ν(Xϵ ) > 0. Then
Z
A(θ + tv) − ⟨µ, θ + tv⟩ = log exp(⟨ϕ(x), θ + tv⟩)dν(x) − ⟨µ, θ + tv⟩
Z
⋆
≥ log exp(⟨ϕ(x), θ⟩)et(ν −ϵ) dν(x) − ⟨µ, θ⟩ − t⟨µ, v⟩
Xϵ
Z
⋆
= t(ν (ϕ, v) − ϵ) + log ν(Xϵ ) − t⟨µ, v⟩ + log e⟨ϕ(x),θ⟩ dν(x) − ⟨µ, θ⟩.
Xϵ

If ν(Xϵ ) = +∞, then A(θ + tv) = +∞ and so A′∞ (v) > 0 certainly. If ν(Xϵ ) < ∞, then note that
ν ⋆ (ϕ, v) − ϵ − ⟨µ, v⟩ > 0, and so
Z
⋆
A(θ + tv) − ⟨µ, θ + tv⟩ ≥ t(ν (ϕ, v) − ϵ − ⟨µ, v⟩) − log ν(Xc ) + log e⟨ϕ(x),θ⟩ dν(x) − ⟨µ, θ⟩
Xc

and thus
A(θ + tv) − ⟨µ, θ + tv⟩ − (A(θ) − ⟨µ, v⟩)
≥ ν ⋆ (ϕ, v) − ϵ − ⟨µ, v⟩ + o(1) (14.5.1)
t
as t → ∞.

410
Lexture Notes on Statistics and Information Theory John Duchi

Extending to the non-minimal case. If the exponential family is not minimal, there exists
a unit vector u and constant c such that ⟨u, ϕ(x)⟩ = c for ν-almost all x. Let U ∈ Rd×k be an
orthonormal basis for all such vectors, where k is the dimension of this collection. Then there exists
a vector c ∈ Rk such that c = U ⊤ ϕ(x) for ν-almost all x, and we see that A(θ + U v) = A(θ) + ⟨c, v⟩
as ⟨θ + U v, ϕ(x)⟩ = ⟨θ, ϕ(x)⟩ + ⟨c, v⟩ for ν-almost all x. We show both inclusions as above. Let
U⊥ ∈ Rd×d−k be an orthonormal basis for the orthogonal subspace to U , so that U ⊤ U = Ik and
U⊥⊤ U⊥ = Id−k , and for any µ ∈ M, we have aff(M) = µ + span(U⊥ ).
Showing that ∇A(Θ) ⊂ Mo . Fix θ0 ∈ Θ and let µ = ∇A(θ0 ). We must show that there
exists ϵ > 0 such that for all u ∈ span(U⊥ ) satisfying ∥u∥ ≤ ϵ, the point µ + u ∈ M. To that end,
note that for any vectors v ∈ Rd−k and w ∈ Rk , we have

A(θ0 + U⊥ v + U w) − ⟨µ + u, U⊥ v + U w⟩ = A(θ0 + U⊥ v) − ⟨µ + u, U⊥ v⟩

because U ⊤ u = 0 and U ⊤ µ = c for each u ∈ span(U⊥ ) and µ ∈ M. The function g(v) :=

A(θ0 + U⊥ v) − ⟨µ, U⊥ v⟩ is strictly convex as ∇2 g(v) = U⊥⊤ ∇2 A(θ0 + U⊥ v)U⊥ ≻ 0, because we
know that u⊤ ϕ(x) is non-constant for all u ∈ span(U⊥ ). Define f (v) = A(θ0 + U⊥ v) − ⟨µ, U⊥ v⟩.
Then applying Proposition C.1.12 as in the minimal representation case, there exists ϵ > 0 such
that vu = argminv {f (v) − ⟨u, U⊥ v⟩} exists and is continuous in u ∈ span(U⊥ ), where by inspection
v0 = 0. Then θu := θ0 + U⊥ vu minimizes A(θ) − ⟨µ + u, θ⟩, satisfying ∇A(θu ) = µ + u.
Showing that Mo ⊂ ∇A(Θ). We again follow the logic of the minimal representation case.
Let µ ∈ Mo = relint M, and recall ν ⋆ (ϕ, U⊥ v) = ess supx ⟨ϕ(x), U⊥ v⟩. Then there exists ϵ > 0 such
that µ + u ∈ M for each u ∈ span(U⊥ ) with ∥u∥ ≤ ϵ, so that

⟨µ, U⊥ v⟩ < sup ⟨µ + u, U⊥ v⟩ ≤ ν ⋆ (ϕ, U⊥ v).

∥u∥2 ≤ϵ,u∈span(U⊥ )

Define g(v) = A(θ + U⊥ v) − ⟨µ, U⊥ v⟩. Then because A(θ + U w + U⊥ v) − ⟨µ, U⊥ v − U w⟩ =

A(θ + U⊥ v) − ⟨µ, U⊥ v⟩ for all w ∈ Rk , v ∈ Rd−k , it is enough to show that g∞ ′ (v) > 0 for all

v ̸= 0. Following the same argument, mutatis mutandis, as that leading to inequality (14.5.1)
yields that g∞ ′ (v) > 0 for all v ̸= 0. That is, v 7→ A(θ + U v) − ⟨µ, U v⟩ has a minimizer v(µ)
⊥ ⊥
(Corollary C.3.7), which is unique by the strict convexity of v 7→ A(θ + U⊥ v), and which necessarily
satisfies U⊥⊤ ∇A(θ + U⊥ v(µ)) = U⊥⊤ µ. As U ⊤ ∇A(θ) = c for all θ and U ⊤ µ = c for all µ ∈ M, this
shows that there exists θ(µ) such that ∇A(θ(µ)) = µ as desired. Moreover, fixing an arbitrary θ
and letting v(µ) be the unique minimizer of A(θ + U⊥ v) − ⟨µ, U⊥ v⟩, the set of all minimizers
n o
Θ⋆ (µ) = argmin{A(θ) − ⟨µ, θ⟩} = θ + U⊥ v(µ) + U w | w ∈ Rk .
θ

This gives Proposition 14.4.1.

14.5.3 Proof of Proposition 14.4.3

For part (i), because Θ = dom A ⊂ Rd is open and A is C ∞ on its domain, A is necessarily a closed
convex function and so A∗∗ (θ) = A(θ) for all θ ∈ Rd . (See Theorem C.2.1.) For part (ii), note
that if µ ∈ Mo , there exists θ(µ) ∈ Θ such that ∇A(θ(µ)) = µ by Proposition 14.4.1. This θ(µ)
maximizes ⟨θ, µ⟩ − A(θ) over all θ, and so Ω(µ) = ⟨θ(µ), µ⟩ − A(θ(µ)) < ∞. By Corollary C.2.5
in Appendix C.2.1 and Proposition 14.4.1, dom ∂Ω = Mo , and as Ω is subdifferentiable on the
relative interior of its domain, we have dom Ω ⊂ cl Mo = cl M. As Ω is closed convex, any point µ
outside its domain necessarily satisfies Ω(µ) = +∞.

411
Lexture Notes on Statistics and Information Theory John Duchi

Finally, for part (iii), we note that the function g(t) = Ω(tµ0 + (1 − t)µ) is a one-dimensional
closed convex function. One-dimensional closed convex functions are continuous on their domains
(Observation B.3.6 in Appendix B.3.2), and so g is necessarily continuous. Thus limt↓0 g(t) = g(0).
The existence of θt follows from Proposition 14.4.1.

Bibliography
JCD Comment: Need to do a lot here!
Gneiting and Raftery [103]

14.6 Exercises
Exercise 14.1 (Strict propriety of the log loss): Let ∆k = {p ∈ Rk+ | 1T p = 1} be the probability
simplex. Show that if ℓ(q, y) = − log qy and P(Y = y) = py , then
argmin E[ℓ(q, Y )] = p,
q∈∆k

where we treat 0 log 0 as 0 (which is the natural limit of t log t as t ↓ 0).

Exercise 14.2 (Uniqueness of generalized entropies): Here we give an alternative perspective
on the generalized entropies associated with losses, showing when they are unique. For a concave
function f : ∆k → R, define the perspective-type transform fper (p) = ⟨1, p⟩f (p/⟨1, p⟩), where
fper (0) = 0, and which gives fper : Rk+ → R.
(a) Let ℓ : ∆k → R be strictly proper and let Y have p.m.f. p. Show that H(p) = inf q∈∆k E[ℓ(q, Y )]
is strictly concave, and that Hper is strictly concave and continuously differentiable on Rk++ .
(b) Show the converse that if H : ∆k → R is strictly concave and its perspective Hper is differen-
tiable on Rk++ , then there exists a proper scoring loss ℓ satisfying
H(p) = inf Ep [ℓ(q, Y )]
q∈∆k

and that ℓ(q, y) = ∇y Hper (q) for all q ∈ dom ∇Hper .

Exercise 14.3: Give the details in the computations for Example 14.3.4.
Exercise 14.4: Let y ∈ {0, 1} and take the regularization function Ω(p) = − log p − log(1 − p).
(a) Verify that the entropy is of Legendre type (Definition 14.4).
(b) Give the associated loss ℓ and surrogate loss φ in the sense of Section 14.3.
(c) Plot the surrogate φ(s, y) + log 8 and the logistic regression surrogate log(1 + es ) − sy for
y ∈ {0, 1}, each as function of s. (The shift by log 8 guarantees the losses coincide at s = 0.)
(d) Give predΩ (s) for s ∈ R, verifying that predΩ (s) ∈ [0, 1].

Exercise 14.5: For Ω(p) = − log p − log(1 − p) as in Exercise 14.4, show that Ω is self-concordant,
meaning that Ω′′′ (p) ≤ 2(Ω′′ (p))3/2 for all p ∈ (0, 1). (Such functions are important in optimization;
the conjugate Ω∗ is then also guaranteed to be self-concordant.)
Exercise 14.6 (Surrogates for regression): Define Ω(µ) = 41 µ4 .

412
Lexture Notes on Statistics and Information Theory John Duchi

(a) Give the conjugate Ω∗ (s) to Ω.

(b) Show directly that the surrogate loss φ(s, y) = Ω∗ (s)−sy satisfies that if ŝ = argmins E[φ(s, Y )],
then predΩ (ŝ) = E[Y ].

Exercise 14.7: Let P be a predicted distribution and for α ∈ [0, 21 ], define the lower and upper
quantiles lα = Quantα (P ) and uα = Quant1−α (P ). Given these quantiles, for a finite set A ⊂ [0, 12 ],
define the weighted interval loss
X
W (P, y) := [α(uα − lα ) + dist(y, [lα , uα ])] ,
α∈A

which penalizes P using both the size (uα − lα ) of the quantile intervals and the distance of the
outcome y from the predicted quantiles. Define the symmetrized set As = A ∪ {1 − α | α ∈ A}.
Show that
W (P, y) = ℓquant,As (P, y),
where ℓquant is the quantile loss (14.2.4).
Exercise 14.8: We explore a particularization of the results in Section 14.4. Let Y ∼ Poi(eθ ), so
that Y has p.m.f. pθ (y) = exp(θy − eθ )/y! for y ∈ N. Let A(θ) = eθ be the log-partition function.
Define the “surrogate” loss φ(θ, y) = − log pθ (y).

(a) Give the associated negative generalized entropy Ω(µ) for µ ∈ (0, ∞).

(b) Give the associated loss ℓ(µ, y) in the proper representation of Theorem 14.2.15. Directly verify
that it is strictly proper, in that argminµ E[ℓ(µ, Y )] = E[Y ] for any Y supported on R+ .

Exercise 14.9: We explore a particularization of Example 14.4. Let X ∼ N(0, Σ) for a co-
variance Σ ≻ 0, and let K = Σ−1 be the associated precision matrix. Then X has density
pK (x) = exp(− 21 ⟨xxT , K⟩ + 12 log det(K)) with respect to (a scaled) Lebesgue measure, and log
partition A(K) = − 12 log det(K), which has domain the positive definite matrices K ≻ 0 (and is
+∞ elsewhere).

(a) Give the associated negative generalized entropy Ω(M ) for symmetric matrices M . Specify the
domain of Ω.

(b) Give the associated loss ℓ(M, x) in the proper representation of Theorem 14.2.15. Directly
verify that it is strictly proper, in that if the second moment matrix C := E[XX T ] of X
satisfies C ≻ 0, then argminM E[ℓ(M, X)] = C.

Exercise 14.10: In this extended exercise, we generalize Theorem 14.4.7 to apply to general
(finite-dimensional) convex cone constraints. A set C is a convex cone if for any two points x, y ∈ C,
we have λx + (1 − λ)y ∈ C for all λ ∈ [0, 1], and C is closed under positive scaling: x ∈ C implies
that tx ∈ C for all t ≥ 0. The following are standard examples (the positive orthant and the
semi-definite cone):

i. The orthant. Take C = Rd+ = {x ∈ Rd : xj ≥ 0, j = 1, . . . , d}. Then clearly C is convex and

closed under positive scaling.

413
Lexture Notes on Statistics and Information Theory John Duchi

ii. The semidefinite cone. Take C = {X ∈ Rd×d : X = X ⊤ , X ⪰ 0}, where a matrix X ⪰ 0 means
that a⊤ Xa ≥ 0 for all vectors a. Then C is convex and closed under positive scaling as well.
Given a convex cone C, we associate a cone ordering ⪰ with the cone and say that for two elements
x, y ∈ C, we have x ⪰ y if x − y ⪰ 0, that is, x − y ∈ C. In the orthant case, this simply means that
x is component-wise larger than y. For a given inner product ⟨·, ·⟩, define the dual cone

C ∗ := {y : ⟨y, x⟩ ≥ 0 for all x ∈ C} .

For the standard (Euclidean) inner product, the positive orthant is thus self-dual, and similarly
the semidefinite cone is also self-dual. For a vector y, we write y ⪰∗ 0 if y ∈ C ∗ is in the dual cone.
With this setup, consider the following linearly constrained maximum entropy problem, where the
cone ordering ⪯ derives from a cone C:

maximize H(P ) subject to EP [ϕ(X)] = µ, EP [ψ(X)] ⪯ β, (14.6.1)

lin be the collection of distributions P ≪ ν satisfying
where the base measure ν is implicit. Let Pµ,β
the constraints in problem (14.6.1).
Prove the following theorem:
Theorem 14.6.1. For θ ∈ Rd and K ∈ C ∗ , the dual cone to C, let Pθ,K have density
Z
pθ,K (x) = exp (⟨θ, ϕ(x)⟩ − ⟨K, ψ(x)⟩ − A(θ, K)) , A(θ, K) = log exp(⟨θ, ϕ(x)⟩−⟨K, ψ(x)⟩)dν(x),

with respect to the measure ν. If

EPθ,K [ϕ(X)] = µ and EPθ,K [ψ(X)] = β,

lin . Moreover, the distribution P
then Pθ,K maximizes H(P ) over Pµ,β θ,K is unique.

Exercise 14.11 (An application of Theorem 14.6.1): Let the cone C be the positive semidefinite
cone in Rd×d , ν be the Lebesgue measure dν(x) = dx and define ψ(x) = 21 xx⊤ ∈ Rd×d . Let Σ ≻ 0.
Give the density solving
Z
maximize − p(x) log p(x)dx subject to EP [XX ⊤ ] ⪯ Σ.

Exercise 14.12: Prove that the log determinant function is concave over the positive semidefinite
matrices. That is, show that for X, Y ∈ Rd×d satisfying X ⪰ 0 and Y ⪰ 0, we have

log det(λX + (1 − λ)Y ) ≥ λ log det(X) + (1 − λ) log det(Y )

for any λ ∈ [0, 1]. Hint: think about log-partition functions.

Exercise 14.13 (Entropy and log-determinant maximization): Consider the following optimiza-
tion problem over symmetric positive semidefinite matrices in Rd×d :

maximize log det(Σ) subject to Σij = σij

Σ⪰0

where σij are specified only for indices i, j ∈ S (but we know that σij = σji and (i, i) ∈ S for all i).
Let Σ∗ denote the solution to this problem, assuming there is a positive definite matrix Σ satisfying

414
Lexture Notes on Statistics and Information Theory John Duchi

Σij = σij for i, j ∈ S. Show that for each unobserved pair (i, j) ̸∈ S, the (i, j) Rentry [Σ∗−1 ]ij of
the inverse Σ∗−1 is 0. Hint: The distribution maximizing the entropy H(X)
P = − p(x) log p(x)dx
subject to E[Xi Xj ] = σij has Gaussian density of the form p(x) = exp( (i,j)∈S λij xi xj − Λ0 ).

Exercise 14.14: JCD Comment: Finish this.

Equivalence of integrated quantile losses and continuous ranked probability score.

415
Chapter 15

Calibration and Proper Losses

In Chapter 14, we encountered proper losses, in which we assume we predict probability distribu-
tions on outcomes Y . In typical problems, we wish to predict things about Y from a given set of
covariates or inputs X, and in focusing exclusively on the losses ℓ themselves, we implicitly assume
that we can model Y | X basically perfectly. Here, we move away from this focus exclusively on
the loss itself to incorporate discussion of predictions, where we seek a function f : X → Y (or
some other output space) that yields the most accurate predictions.
In this chapter, we adopt the view of Section 14.2.3, where the target Y ⊂ Rk is vector-valued,
and we wish to predict its expectation E[Y | X] as accurately as possible. For binary prediction,
we have Y ∈ {0, 1}, so that E[Y | X] = P(Y = 1 | X); in the case of multiclass prediction problems,
it is easy to represent Y as an element of the k standard basis vectors {e1 , . . . , ek } ⊂ Rk , so that
p = E[Y | X] is simply the p.m.f. of Y given X with entries py = P(Y = y | X). We focus here,
therefore, on choosing functions to minimize the risk, or expected population loss,
R(f ) := E[ℓ(f (X), Y )].
When f is chosen from a collection F ⊂ {X → Rk } of functions, for example, to guarantee that we
can generalize, we do not expect to be able to perfect minimize the population loss. Accordingly,
even though the loss is proper and hence minimized by f ⋆ (x) = E[Y | X = x], we cannot perfectly
model reality, and so it is unrealistic to expect to be able to find f satisfying f (x) = E[Y | X = x],
even approximately, for all x.
We therefore depart from the goal of perfection to address a somewhat simpler criterion: that
of calibration. Here, the idea is that a predictor should be accurate on average conditional on its
own predictions. Consider again a weather forecasting problem, where Yt = 1 indicates it rains on
day t and Yt = 0 indicates no rain, and we wish to predict Yt based on observable covariates Xt
at time t. While we would like a forecaster to have perfect predictions pt = E[Yt | Xt ], we instead
ask that on days where the forecaster makes a given prediction, it should rain (roughly) with that
given frequency. In particular, we seek calibration, which is that
f (X) = E[Y | f (X)]. (15.0.1)
That is, given that the forecaster makes a prediction with value p = f (X), we should have
E[Y | f (X) = p] = p.
While in general it is challenging to achieve this perfect calibration, in this chapter we investigate
several variants of the desideratum (15.0.1) that allow for more elegant statistical and information-
theoretic approaches, as well as procedures to achieve calibration.

416
Lexture Notes on Statistics and Information Theory John Duchi

This chapter therefore proceeds as follows. The first goal is to

JCD Comment: Fix notation. Also add a transition here to make clearer why we are
doing this and what we are doing.

1. First show what we want to measure.

2. Show how to measure it, specifically using partitioned methods. I think that parti-
tioned ones should be better than non-partitioned approaches, because we can estimate
the binned / partitioned calibration error

3. Show a few ways to achieve it (population and finite-sample level).

It is important to note that the literature on calibration is broad, and there are several distinct
strands. We take the particular focus that most dovetails with our treatment of proper losses and
scoring rules, basing our development around random variables and finite-dimensional probabilities.
So, for example, if a logistic regression model (as in Example 3.4.2 or 3.4.3) for image classification
assigns a probability of 80% that an image is, say, a dog, then the model is (approximately)
calibrated if in the population of all images in the world to which the model assigns probabilty
80%, (approximately) 80% are dogs. The first direction of research that we essentially do not
touch are the following: in the forecasting literature, one often considers predicting the distribution
of a (potentially continuous) random variable Y , such as the amount of rainfall; if we predict a
cumulative distribution F as in Example 14.2.6, then perfect calibration (15.0.1) becomes that

P(Y ≤ u | F ) = F (u) for all u ∈ R.

This is far too stringent a condition to be achievable, so that one relaxes to various forms of marginal
or average calibration. See the bibliographic notes for some discussion of the approaches here.
The second strand of research on calibration that, again, we do not address, considers more
adversarial and sequential settings, where instead of any probabilistic underpinnings, nature (an
adversary) plays a game against the player (or predictor). Philosophically, this approach elegantly
does away with the need for probabilities: there is a physical world where whether it rains tomorrow
is essentially deterministic, and we use probability as a crutch to model things we cannot measure,
so calibration means that of the days on which we predict rain with a chance of 50%, it rains on
roughly 50% of those days. In this sequential setting, at times t = 1, 2, . . . , T , the player makes
a prediction pt of the outcome, and then nature may choose the outcome Yt . Without giving the
player a bit more leeway, calibration is impossible: say that Y ∈ {0, 1}, and nature plays Yt = 1
if pt ≤ .5 and Yt = 0 if pt > .5. Then any player is miscalibrated at least by an amount .5.
Astoundingly, Foster and Vohra [90] show that if the player is allowed to randomize, then the
forecasted probabilities pt can be made arbitrarily close to the empirical averages of the observed
Yt . While many of the techniques we consider and develop arise from this adversarial setting in the
literature, we shall mostly address the scenarios in which Y is indeed random.

15.1 Proper losses and calibration error

When we use a proper loss to measure the error ℓ(f (x), y) in making the prediction f (x) for the value
y, it turns out we can always improve the losses by modifying f to be a calibrated version of itsef:
calibration is always useful. To make this precise, assume we are making predictions in the convex
hull of Y, that is, that can be represented as E[Y ] for some distribution, so f : X → M = Conv(Y).

417
Lexture Notes on Statistics and Information Theory John Duchi

Then by Theorems 14.2.1 and 14.2.15, there exists a convex function h such that
ℓ(µ, y) = −h(µ) − ⟨∇h(µ), y − µ⟩ (15.1.1)
for all µ ∈ M, y ∈ Y. Recall the Bregman divergence (14.2.2)
Dh (u, v) = h(u) − h(v) − ⟨∇h(v), u − v⟩,
which is nonnegative for all convex h (and strictly positive whenever h is strictly convex and u ̸= v),
and Corollary 14.2.5. Then for any prediction function f , if we condition on the predicted value
S = f (X), then
E[ℓ(S, Y ) | S] = E[ℓ(E[Y | S], Y ) | S] + E[ℓ(S, Y ) − ℓ(E[Y | S], Y ) | S]
= E[ℓ(E[Y | S], Y ) | S] + E[Dh (E[Y | S], S) | S],

where we use the linearity E[ℓ(s, Y )] = ℓ(s, E[Y ]) for any distribution on Y and fixed s ∈ Rk in the
second equality. We record this as a theorem.
Theorem 15.1.1. Let ℓ be a proper loss with representation (15.1.1). Then for any f : X → Rk ,
E[ℓ(f (X), Y )] = E[ℓ(E[Y | f (X)], Y )] + E[Dh (E[Y | f (X)], f (X))].
In particular, the predictor g : Rk → Rk defined by
g(s) := E[Y | f (X) = s]
is calibrated and satisfies
E[ℓ(g ◦ f (X), Y )] = E[ℓ(E[Y | f (X)], Y )] ≤ E[ℓ(f (X), Y )],
and the inequality is strict whenever f is not calibrated and ℓ is strictly proper.
Proof The first statement we have already proved. For the second, note that
g(s) = E[Y | f (X) = s]
by construction of g, so that E[ℓ(g ◦ f (X), Y )] = E[ℓ(E[Y | f (X)], Y )]. The inequality and its
strictness are immediate because h is strictly convex if and only if ℓ is strictly proper.

To interpret this result, it essentially says that if we can post-process f to make it calibrated,
then we can only improve its risk, or expected loss, when ℓ is a proper loss. We can give an alter-
native version of Theorem 15.1.1, where we instead consider the conjugate linkages in Section 14.3,
which can be useful when we wish to find f via convex optimization (instead of by directly min-
imizing a proper loss). To that end, assume that h is a strictly convex function, differentiable on
the interior of its domain, satisfying the Legendre conditions (14.3.3), and define the surrogate loss
(linked via duality and the negative generalized entropy h to ℓ)
φ(s, y) = h∗ (s) − ⟨s, y⟩ = ℓ(predh (s), y),
where ℓ(µ, y) = −h(µ) − ⟨∇h(µ), y − µ⟩ and
predh (s) = argmin{−⟨s, µ⟩ + h(µ)} = ∇h∗ (s).
µ

Then we have the following decomposition of the population surrogate loss, which follows similarly
to Theorem 15.1.1.

418
Lexture Notes on Statistics and Information Theory John Duchi

Theorem 15.1.2. Let φ be the surrogate loss defined above. Then for any f : X → Rk , we have

E[φ(f (X), Y )] = E[ℓ(E[Y | f (X)], Y )] + E [Dh (E[Y | f (X)], predh (f (X)))] .

Proof The key is to rely on the duality relationships inherent in the definition of the surrogate
φ(s, y) = h∗ (s) − ⟨s, y⟩. We fix x and work in exclusively in the space of the scores (predictions)
s = f (x) ∈ Rk , as
E[φ(f (X), Y ) | X = x] = φ(f (x), E[Y | X = x])
by definition. Let µ ∈ M = Conv(Y). Then φ(s, µ) = h∗ (s) − ⟨s, µ⟩, and

inf φ(s, µ) = − sup{⟨s, µ⟩ − h∗ (s)} = −h(µ)

s s

because h is (closed) convex. Additionally, if µ∗ (s) = ∇h∗ (s) = predh (s), then the conjugate duality
relationships (14.1.4) guarantee h∗ (s) = ⟨s, µ∗ (s)⟩ − h(µ∗ (s)) and s = ∇h(µ∗ (s)). Thus

φ(s, µ) − inf′ φ(s, µ) = h∗ (s) − ⟨s, µ⟩ + h(µ) = h(µ) − h(µ∗ (s)) − ⟨s, µ − µ∗ (s)⟩
s
= h(µ) − h(µ∗ (s)) − ⟨∇h(µ∗ ), µ − µ∗ (s)⟩ = Dh (µ, µ∗ (s)).

Taking the expectation over X and using the shorthand S = f (X), we thus obtain

E[φ(S, Y )] = E[φ(S, E[Y | S])]

h i
= E inf φ(s, E[Y | S]) + E [Dh (E[Y | S], predh (S))] .
s

Lastly, we use that ℓ(µ, y) = −h(µ) − ⟨∇h(µ), y − µ⟩ is proper, so inf s φ(s, µ) = −h(µ) = ℓ(µ, µ),
giving the first claim of the theorem.

As in Theorem 15.1.1, Theorem 15.1.2 shows that calibrating a predictor f can only improve
the surrogate loss associated with h. Any predictor f : X → Rk has unnecessary error arising from
the average divergence of the prediction from being calibrated,

E [Dh (E[Y | f (X)], predh (f (X)))] .

In both cases, we see that any proper (or derived proper) loss has a natural decomposition into
an error term relating to the typical error in predicting Y from E[Y | f (X)], which one frequently
refers to as sharpness of the predictor. Replacing f (X) with the expectation of Y given f (X) (or a
particular transformation thereof) does not increase this first term, but improves the second term,
which measures the typical error of a prediction from calibration.
Let us consider an example with squared error:

Example 15.1.3 (Squared error and calibration): In the case that h(p) = 21 ∥p∥22 , we have
h∗ = h and ∇h = ∇h∗ is the identity. Then Theorems 15.1.1 and 15.1.2 reduce to the statement
that h i h i
E[∥Y − f (X)∥22 ] = E ∥Y − E[Y | X]∥22 + E ∥E[Y | X] − f (X)∥22 ,

so we may also see the decompositions of the theorems as bias/variance expansions. 3

419
Lexture Notes on Statistics and Information Theory John Duchi

15.2 Measuring calibration

The first step to building a practicable theory of calibration is to define and measure the calibration
of a predictor f . The first step, defining a calibrated predictor, is relatively easy, but measuring
how “close” a particular predictor f is to being calibrated raises several challenges, as typical and
naive measures of calibration are impossible to estimate. Thus, in this section, we develop several
quantities to measure calibration, providing a main theorem relating the different quantities to one
another and demonstrating a simple technique to estimate one of them, returning in Section 15.5.2
to show the equivalences between the measures.
We begin with a natural candidate for calibration: the expected difference, or expected calibra-
tion error,
ece(f ) := E[∥E[Y | f (X)] − f (X)∥]. (15.2.1)
The calibration error (15.2.1) is 0 if and only if f is perfectly calibrated, as then E[Y | f (X)] = f (X),
and it is positive otherwise. Unfortunately, while the next lemma guarantees that ece is lower semi-
continuous, it is not continuous.
Lemma 15.2.1. The expected calibration error ece is lower semi-continuous with respect to L1 (P )
on F, that is, if E[∥fn (X) − f (X)∥] → 0 and f ∈ L1 (P ), then

lim inf ece(fn ) ≥ ece(f ).

This result requires some delicate measure-theoretic arguments, so we defer it to the technical
proofs (see Section 15.6.1). The disctontinuity of ece is relatively easy to show, however, even in
very simple cases.
Example 15.2.2 (Discontinuity of the calibration error): Let Y ∈ {0, 1} be a Bernoulli
random variable, and let X ∈ {0, 1}. Take Y = X with probability 1. Then the predictor that
always predicts 21 is perfectly calibrated, but if for ϵ ∈ [0, 21 ] we define fϵ by
1 1
fϵ (0) = − ϵ and fϵ (1) = + ϵ
2 2
then we see that ece(fϵ ) = 21 − ϵ, while ece(f0 ) = 0. Certainly fϵ → f0 in any Lp distance on
functions, while limϵ→0 ece(fϵ ) = 21 . 3

15.2.1 The impossibility of measuring calibration

The discontinuity Example 15.2.2 highlights suggests that estimating calibration ece(f ) for a fixed
function f should be nontrivial, and indeed, using the tools on functional estimation and testing
we develop in Chapter 13, we can show strong lower bounds for estimating the calibration error
unless one makes unjustifiable assumptions about the distribution of Y | f (X). The precise reasons
differ a bit from the discontinuity of ece(f ) in f , though the intuition is relatively straightforward:
if f (X) has a density, then even given a very large sample (X1n , Y1n ), all the observations f (Xi ) will
be distinct, and we have no a priori reason to assume that E[Y | f (X)] should be continuous in
the predicted value f (X).
To make this more precise, fix a function f whose calibration error we wish to evaluate, and
consider a hypothesis test of H0 : ece(f ) = 0 against alternatives that f is miscalibrated, H1 :
ece(f ) ≥ γ for some γ > 0. We observe predictions f (Xi ) and outcomes Yi , that is, pairs

Zi = (f (Xi ), Yi )

420
Lexture Notes on Statistics and Information Theory John Duchi

drawn i.i.d.; the coming lower bound holds if X = [0, 1] and f (x) = x, so in many cases, observing
X is of no help. Recall the (worst-case) test risk from Section 13.3, that for the testing problem
between classes H0 : P ∈ P0 and H1 : P ∈ P1 of distributions,

Rn (Ψ | P0 , P1 ) := sup P (Ψ(Z1n ) ̸= 0) + sup P (Ψ(Z1n ) ̸= 1).

P ∈P0 P ∈P1

Because we consider the function f fixed and ask only whether we can evaluate its calibration error
under an (unknown) distribution P , we denote the expected calibration error of f under P via
eceP (f ) = EP [∥EP [Y | f (X)] − f (X)∥]. We thus consider testing perfect calibration H0 : ece(f ) =
0 against alternatives H1 : ece(f ) ≥ γ of miscalibration for γ > 0, defining

Pγ = {distributions P on (X, Y ) | eceP (f ) ≥ γ}

as the collection of distributions for which f is ( 21 − γ) mis-calibrated.

Theorem 15.2.3. Let f : X → [0, 1] be a predictor of Y ∈ {0, 1}. Assume for some 0 < c < 12 that
f (X ) ∩ [c, 1 − c] has cardinality at least N . Then there is a distribution P0 such that eceP0 (f ) = 0
and for any 0 < γ ≤ c,
nγ 2 1
inf Rn (Ψ | {P0 }, Pγ ) ≥ 1 − √ .
Ψ 2 N c(1 − c)
Before proving Theorem 15.2.3, we note the following immediate corollary; part (ii) follows from
part (i), which follows by taking N ↑ ∞ in the theorem.

Corollary 15.2.4. Let the conditions of Theorem 15.2.3 hold and let P0 = {P | eceP (f ) = 0}.

(i) If there exists 0 < c < 12 such that f (X )∩[c, 1−c] has infinite cardinality, then P0 is non-empty
and for any 0 < γ ≤ c,
lim inf inf R(Ψ | P0 , Pγ ) = 1.
n Ψ

(ii) If there exists a neighborhood U of 21 such that U ⊂ f (X ), then P0 is non-empty and for any
γ < 21 , the minimax test risk satisfies

lim inf inf R(Ψ | P0 , Pγ ) = 1.

n Ψ

In brief, no test exists that is better than random guessing for testing between

H0 : ece(f ) = 0 and H1 : ece(f ) ≥ c

given access to the predictions f (Xi ) and observed outcomes Yi . The theorem and corollary apply to
binary prediction models with Y ∈ {0, 1}, but the results immediately extend to more complicated
prediction problems where Y is vector-valued or multiclass.
Proof The proof relies on the convex hull testing lower bound from Proposition 13.3.1. Without
loss of generality, we can assume that X ⊂ [0, 1] and that f (x) = x by transforming the input
space. Let S = f (X) be the (random) scores that f outputs.
We first construct the perfectly calibrated distribution P0 and miscalibrated family Pγ . Define
the distribution P0 so that S is uniform on distinct points s1 , . . . , sN ∈ [c, 1 − c] and Y | S = s ∼
Bernoulli(s), that is, given S = s, Y = 1 with probability s and Y = 0 with probability 1 − s. By

421
Lexture Notes on Statistics and Information Theory John Duchi

construction, eceP0 (f ) = 0. To construct the particular members of the alternative family Pγ , for
each j ∈ [N ], define the “tilting” function

y 1−y
ϕj (y, s) := − 1 {s = sj } .
sj 1 − sj
Then E0 [ϕj (Y, S)] = 0 while
" 2 #
1 Y 1−Y 1 1 1 1 1
Var0 (ϕj (Y, S)) = E0 − | S = sj = + = .
N sj sj N sj 1 − sj N sj (1 − sj )

Note that |ϕj (y, s)| ≤ 1c as c < 12 , and if we define the vector ϕ(y, s) = (ϕ1 (y, s), . . . , ϕN (y, s)), then
∥ϕ(y, s)∥0 ≤ 1 (that is, the number of non-zero entries is at most 1). Now as γ ∈ [0, c], for each
v ∈ {−1, 1}N we may define the tilted distribution Pv with
Pv (Y = y, S = s) = (1 + γ⟨v, ϕ(y, s)⟩) P0 (Y = y, S = s),
which is a valid distribution whenever γ ≤ c, as |⟨v, ϕ(y, s)⟩| ≤ 1c . We compute the calibration error
for distributions P ∈ {Pv }. Noting that S is still uniform on {s1 , . . . , sN } under Pv , we have
Ev [Y | S = sj ] = sj + γvj E[ϕj (Y, sj )Y | S = sj ] = sj + γvj ,
and so ecePv (f ) = N1 N
P
j=1 γ|vj | = γ. In particular, we have Pv ∈ Pγ .
Lastly,Pwe compute a bound on the testing error. For this, we recall Lemma 13.2.3. Letting
P n = 21N v Pvn , we have
1 X n
Dχ2 P n ||P0n + 1 = 2N E0 (1 + γ⟨v, ϕ(Y, S)⟩)(1 + γ⟨v ′ , ϕ(Y, S)⟩)

2
v,v ′
1 X n
= 2N 1 + γ 2 v ⊤ Cov0 (ϕ(Y, S))v ′
2 ′
v,v

because the sampling is i.i.d. By our variance calculation for ϕ and that each ϕj has disjoint
support, we have Cov0 (ϕ(Y, S)) = N1 diag([ sj (1−s
1
]N ), and so
j ) j=1
   
2 N ′ n 2XN ′
V V V V

γ X j j
 ≤ E exp nγ j j
Dχ2 P n ||P0n + 1 = E  1 +


N sj (1 − sj ) N sj (1 − sj )
j=1 j=1

iid
where the expectation is over V, V ′ ∼ Uniform({±1}N ). But of course Vj Vj′ i.i.d. random signs, and
hence 1-sub-Gaussian, so that
 
2γ4 XN 2 4
n n
n 1 n γ 1
Dχ2 P ||P0 + 1 ≤ exp  2 2 (1 − s )2
 ≤ exp
N s j j 2N c2 (1 − c)2
j=1

because c ≤ sj ≤ 1 − c. Apply Proposition 13.3.1 and Pinsker’s inequality (Propositions 2.2.8

and 2.2.9) to see that
r s
1 n2 γ 4 1
inf R(Ψ | {P0 }, Pγ ) ≥ 1 − log 1 + Dχ2 P n ||P0 ≥ 1 − .
Ψ 2 4N c2 (1 − c)2
Taking square roots gives the result.

422
Lexture Notes on Statistics and Information Theory John Duchi

15.2.2 Alternative calibration measures

The fundamental impossibility results in Theorem 15.2.3 and Corollary 15.2.4, even in the binary
prediction case, suggest that we should choose some more easily estimable measure for calibration.
In Section 15.5 we provide formal definitions for calibration measures to be continuous (or Lipschitz
continuous) and equivalent to one another. Here, we provide the alternative definitions of calibration
we consider, giving a corollary that captures their relationships for multiclass classification, and
then describing how to estimate one of them. Let us take the general setting of this chapter, where
the label space Y ⊂ Rk and P is a distribution on X × Y. Let F be a collection of functions
mapping X → Rk and integrable with respect to P , that is, E[∥f (X)∥] < ∞ for each f ∈ F.
In brief, we require that a calibration measure M : F → R+ be sound (in analogy with proof
systems, where soundness means nothing false can be proved), meaning that

M(f ) = 0 implies E[Y | f (X)] = f (X) (15.2.2a)

and complete (continuing the analogy, that everything true can be proved), meaning that

E[Y | f (X)] = f (X) implies M(f ) = 0. (15.2.2b)

We begin by considering types of distance to calibration. Let C(P ) denote those functions g
that are perfectly calibrated for P , that is, C(P ) = {g : X → Rk | EP [Y | g(X)] = g(X)} (where
the defining equality holds with P -probability 1 over X). The set P always consists at least of the
constant function g(X) = EP [Y ] and so is non-empty (but is typically larger). Then we call the
minimum L1 (P ) distance of a function f to the set C(P ) the distance to calibration

dcal (f ) := inf {E[∥g(X) − f (X)∥] s.t. g ∈ C(P )} . (15.2.3)

It is not always clear how to estimate the distance dcal (f ), making using it sometimes challenging.
We also consider a complementary quantity that relies on an alternative variational character-
ization. Let W ⊂ {Rk → Rk } be a symmetric collection of functions, meaning that w ∈ W implies
−w ∈ W. We can view any such collection as potential witnesses of miscalibration, in that

E[⟨w(f (X)), Y − f (X)⟩] = E[⟨w(f (X)), E[Y | f (X)] − f (X)⟩]

and so if w can “witness” the portions of space where f (X) ̸≈ E[Y | f (X)], it can certify miscali-
bration. We then arrive at what we term the calibration error relative to the class W,

CE(f, W) := sup E[⟨w(f (X)), Y − f (X)⟩]. (15.2.4)

w∈W

Depending on the class W, this is sometimes called the weak calibration error, and with large
enough classes, we can recover the classical expected calibration error (15.2.1).

Example 15.2.5 (Recovering expected calibration error): For a norm ∥·∥, let the set W be
the collection of all functions w with bound sups ∥w(s)∥∗ ≤ 1. Then
" #
CE(f, W) = E sup ⟨w, E[Y | f (X)] − f (X)⟩ = E[∥E[Y | f (X)] − f (X)∥] = ece(f ),
∥w∥∗ ≤1

the expected calibration error. 3

423
Lexture Notes on Statistics and Information Theory John Duchi

It is more interesting to consider restricted classes; one of particular interest to us is that of bounded
Lipschitz functions. Let
n o
W∥·∥ := w : Rk → Rk | ∥w(s0 ) − w(s1 )∥∗ ≤ ∥s0 − s1 ∥ and ∥w(s)∥∗ ≤ 1 for all s, s0 , s1 (15.2.5)

denote the collection of functions bounded by 1 in ∥·∥∗ and that are 1-Lipschitz with respect to
∥·∥. Then (as we see presently) we can at least estimate the calibration error relative to the class
W in the definition (15.2.4).
The final calibration measure we consider reposes on the idea of quantizing or partitioning the
output space, which relates to the idea of “binning” predictions that the literature on calibration
frequently considers. Here, we consider averages of Y conditioned on predictions in larger sets.
Thus, instead of evaluating the precise conditioning E[Y | f (X)] we to look instead at the expec-
tation of Y conditional on f (X) ∈ A for a set A, so that a predicted score is (nearly) calibrated
if the diameter diam(A) is small, and E[Y | f (X) ∈ A] ≈ s for some s ∈ A. Given a partition
A of the space M = Conv(Y), it is then natural to evaluate the average error for each element
of A (weighting by the probability of A), and consider the calibration error (15.2.4) for indicator
functions of A ∈ A, where we abuse notation slightly to define
X X
CE(f, A) := ∥E[(f (X) − Y )1 {f (X) ∈ A}]∥ = ∥E[f (X) − Y | f (X) ∈ A]∥ P(f (X) ∈ A).
A∈A A∈A

Indeed, taking a supremum over all such partitions gives supA CE(f | A) = E[∥E[Y | f (X)] − f (X)∥],
the original expected calibration error (15.2.1). Additionally, and here we elide details, if f (X) is
a continuous random variable with suitably nice density and An denotes any partition satisfying
diam(A) ≤ 1/n for A ∈ An , then limn CE(f, An ) = E[∥E[Y | f (X)] − f (X)∥]. Instead of consider-
ing CE(f, A) directly, we optimize over all partitions, but penalize the average size of elements of
A, giving the partitioned calibration error
( )
X
pce(f ) := inf CE(f, A) + diam(A)P(f (X) ∈ A) . (15.2.6)
A
A∈A

Each of these is equivalent to within polynomial scaling.

Corollary 15.2.6. Let Y ⊂ Rk have finite diameter and ∥·∥ be any norm. Then each of the
calibration measures dcal , CE(·, W∥·∥ ), and pce in definitions (15.2.3), (15.2.4), and (15.2.6) is
sound and complete (15.2.2). Additionally, let Y = {e1 , . . . , ek } and ∥·∥ = ∥·∥1 be the ℓ1 -norm.
Then for any f : X → M = Conv(Y),
1 q
CE(f, W∥·∥ ) ≤ dcal (f ) ≤ CE(f, W∥·∥ ) + 2 k CE(f, W∥·∥ )
2
and p
dcal (f ) ≤ pce(f ) ≤ dcal (f ) + 2 k dcal (f ).

Corollary 15.2.6 will come as a consequence of the deeper development we purse in Section 15.5.
Here, we take Corollary 15.2.6 as motivation to give the type of typical result that justifies cali-
bration estimates. As any of the calibration measures is roughly equivalent (except ece), measuring
any of them on a sample can provide evidence for or against calibration of a predictor f . We focus

424
Lexture Notes on Statistics and Information Theory John Duchi

on the simpler binary case in which f : X → [0, 1] and let WLip be bounded Lipschitz functions
w : [0, 1] → [−1, 1]. Given a sample (X1n , Y1n ), the empirical variant of CE(f, W) is
( n )
1 X
CE
c n (f ) := sup wi (Yi − f (Xi )) s.t. |wi − wj | ≤ |f (Xi ) − f (Xj )| for i, j ≤ n .
∥w∥∞ ≤1 n i=1

By combining uniform covering bounds for the class of Lipschitz functions with a standard concen-
tration inequality, we then have the following convergence guarantee for CE
c n.

Proposition 15.2.7. There exists a numerical constant C such that for any δ > 0,

log n
p
CEn (f ) − CE(f, WLip ) ≤ C 1/3 δ
c
n
with probability at least 1 − δ.

Proof Fix ϵ > 0 and let N (ϵ) be a minimal ϵ-cover of the set WLip in uniform norm, meaning that
w − w(j) ∞ ≤ ϵ for each w(j) ∈ N (ϵ), and let N (ϵ) be its (minimal) cardinality. Then log N (ϵ) ≲
1 1
ϵ log ϵ (recall Proposition 10.2.3 and Eq. (10.2.4)). For shorthand, let the error vector E ∈ [−1, 1]n
1 Pn
have entries Ei = Yi −f (Xi ), and abusing notation, for w ∈ WLip let ⟨w, E⟩n = n i=1 w(f (Xi ))Ei .
Then for any w ∈ WLip , there exists i ≤ N (ϵ) such that

|⟨w, E⟩n − ⟨w(i) , E⟩n | ≤ ϵ,

c n (f ) = supw∈W ⟨w, E⟩n . In particular, we have

while CE Lip

c n (f ) − CE(f, WLip ) ≤ sup |⟨w, E⟩n − E[⟨w, E⟩n ]| ≤ max |⟨w, E⟩n − E[⟨w, E⟩n ]| + 2ϵ.
CE
w∈WLip w∈N (ϵ)

Thus for any t ≥ 0, we have

P c n (f ) − CE(f, WLip ) ≥ t ≤ P max |⟨w, E⟩n − E[⟨w, E⟩n ]| ≥ t − 2ϵ
CE
w∈N (ϵ)
!
n [t − 2ϵ]2+
≤ 2N (ϵ) exp −
2

by the Azuma-Hoeffding inequality and a union bound. Take ϵ = n−1/3 and t = Cn−1/3 log nδ for
p

an appropriate numerical constant C to obtain the proposition.

Summarizing, while the expected calibration error is fundamentally inestimable, there are alter-
native measures that are both sound and complete, and they can admit reasonable estimators. As
the class size k grows, however, it can become statistically infeasible to estimate the calibration of
predictors f , so that one must consider alternative metrics. The exercises and bibliography explore
these questions in more detail.

425
Lexture Notes on Statistics and Information Theory John Duchi

15.3 Auditing and improving calibration at the population level

Theorems 15.1.1 and 15.1.2 provide decompositions of the expected loss of a predictor

E[ℓ(f (X), Y )] = E[ℓ(E[Y | f (X)], Y )] + E [Dh (E[Y | f (X)], f (X))]

into an average loss and an expected divergence between f (X) and E[Y | f (X)], where h is the
negative (generalized) entropy (14.1.6) associated with the loss ℓ, so that the loss has representation
ℓ(µ, y) = −h(µ) − ⟨∇h(µ), y − µ⟩. This suggests an approach to improving a predictor f : X → Rk
without compromising its average loss: make it closer to being calibrated, so that E[Y | f (X)] ≈
f (X). Here, we make this idea precise by using the weak calibration (15.2.4): if there exists a
witness function w certifying that E[⟨w(f (X)), Y − f (X)⟩] ≫ 0, then we can post-process f to
f (X) + ηw(f (X)) for some stepsize η > 0 and only improve the expected loss. We first develop
the idea in the context of the squared error, where the calculations are cleanest, and extend it to
general proper losses based on convex conjugates (as in Section 14.3) immediately after. Combining
the ideas we develop, we also provide a (population-level) algorithm to transform a function f
by post-processing its outputs that guarantees the result is nearly calibrated relative to a class
W of witnesses. This provides an algorithmic proof quantitatively relating the calibration error
CE(f, W) relative to a class W to the improvement achievable in minimizing E[ℓ(f (X), Y )] by
post-composition g ◦ f .

15.3.1 The post-processing gap and calibration audits for squared error
Consider a thought experiment: instead of using f to make predictions, we use a postprocessing g◦f ,
where g : Rk → Rk has the (suggestively chosen) form g(v) = v + w(v), where w(v) = (g(v) − v).
Then using the representation ℓ(µ, y) = −h(µ) − ⟨∇h(µ), y − µ⟩ for the proper loss, we recall
Theorem 15.1.1 and for µ(f (X)) := E[Y | f (X)] expand

E[ℓ(g ◦ f (X), Y )] = E[−h(g ◦ f (X)) − ⟨∇h(g ◦ f (X)), Y − g ◦ f (X)⟩]

= E[−h(µ(f (X)))] + E[h(µ(f (X))) − ⟨∇h(g ◦ f (X)), Y − g ◦ f (X)⟩]
= E[ℓ(E[Y | f (X)], Y )] + E[Dh (E[Y | f (X)], g ◦ f (X))],

where the final equality uses the linearity of y 7→ ℓ(µ, y), that is,

E[ℓ(g ◦ f (X), Y )] = E[ℓ(E[Y | f (X)], Y )] + E [Dh (E[Y | f (X)], f (X) + w(f (X)))] . (15.3.1)

We have decomposed the expected loss E[ℓ(g ◦ f (X), Y )] into a term that post-processing does not
change, which measures the sharpness with which E[Y | f (X)] predicts Y , and a divergence term
Dh measuring the error in calibration of g ◦ f (X) = f (X) + w(f (X)) for E[Y | f (X)].
The expansion (15.3.1) points toward an ability to postprocess any prediction function f : X →
k
R to both (i) obtain calibration relative to a class of functions W, as in Definition (15.2.4), and (ii)
improve the expected loss E[ℓ(f (X), Y )]. Moreover, this improvement is monotone, in that changes
“toward” calibration guarantee smaller expected loss, an improvement over the less refined results
in Theorems 15.1.1 and 15.1.2. To that end, define the post-processing gap for the (proper) loss ℓ
and function f relative to the class W of functions Rk → Rk by

gap(ℓ, f, W) := E[ℓ(f (X), Y )] − inf E[ℓ(f (X) + w(f (X)), Y )]. (15.3.2)
w∈W

The gap (15.3.2) is fundamentally tied to the calibration error relative to the class W.

426
Lexture Notes on Statistics and Information Theory John Duchi

We specialize here to the simpler case of the squared error, as the statements are most transpar-
ent. We focus exclusively on symmetric convex collections of functions W, meaning that if w ∈ W,
then −w ∈ W, and W is convex.
Proposition 15.3.1. Let ℓ(µ, y) = 12 ∥y − µ∥22 be the squared error (Brier score), and let W be a
symmetric convex collection of functions, each 1-Lipschitz with respect to the ℓ2 -norm ∥·∥2 . Define
R2 (f ) = supw∈W E[∥w(f (X))∥22 ]. Then

CE(f, W)2

1
min CE(f, W), ≤ gap(ℓ, f, W) ≤ CE(f, W)
2 R2 (f )
Proof Fix x and let µ = E[Y | f (X) = f (x)] ∈ Conv(Y) and w = w(f (x)) be a potential update
to f (x). Then because ℓ(µ, y) = 21 ∥µ − y∥22 , for any y ∈ Y
1
ℓ(µ, y) + ⟨∇ℓ(µ, y), w⟩ + ∥w∥2 = ℓ(µ + w, y).
2
Recognizing that ∇ℓ(µ, y) = (µ − y), for any w ∈ W we therefore have
1
−E[⟨f (X) − Y, w(f (X))⟩] − E[∥w(f (X))∥22 ] ≤ E[ℓ(f (X), Y )] − E[ℓ(f (X) + w(f (X)), Y )]
2
≤ −E[⟨f (X) − Y, w(f (X))⟩].

Taking suprema over w on each side of the preceding inequalities and using the symmetry of W
gives

1 2
sup E[⟨f (X) − Y, w(f (X))⟩] − E[∥w(f (X))∥2 ] ≤ gap(ℓ, f, W)
w∈W 2
≤ sup E[⟨f (X) − Y, w(f (X))⟩].
w∈W

Because CE(f, W) = supw∈W E[⟨f (X) − Y, w(f (X))⟩], we can use the convexity of W and the
definition R2 (f ) := supw∈W E[∥w(f (X))∥22 ] to see that for any η ∈ [0, 1], we may replace w with
η · w ∈ W, and we have
η2 2

sup ηCE(f, W) − R (f ) ≤ gap(ℓ, f, W) ≤ CE(f, W).
η∈[0,1] 2

Maximizing over η on the left side, we choose η = min{1, CE(f,W)

R2 (f )
} to obtain the proposition.

As an immediate corollary, we see that if W = W∥·∥2 consists of the 1-Lipschitz functions with
∥w(·)∥2 ≤ 1, we have a cleaner guarantee.
Corollary 15.3.2. Let W = W∥·∥2 and the conditions of Proposition 15.3.1 hold. Then
1
CE(f, W)2 ≤ gap(ℓ, f, W) ≤ CE(f, W).
2 diam(Y)2
Thus, the calibration error upper and lower bounds the gap between the expected loss of f and a
post-processed version of f . This yields a nearly operational interpretation of the calibration error
relative to the class W: it is, to within a square, exactly the amount we could improve the expected
loss of the function f by postprocessing f itself.

427
Lexture Notes on Statistics and Information Theory John Duchi

15.3.2 Calibration audits for losses based on conjugate linkages

Recall as in Section 14.3.1 that, by a transformation tied to the loss ℓ via its associated generalized
negative entropy, we may define the surrogate

φ(s, y) := h∗ (s) − ⟨s, y⟩,

and we may transform arbitrary scores s ∈ Rk to predictions via the conjugate link (14.3.1), that
is,
predh (s) = argmin {−⟨s, µ⟩ + h(µ)} = ∇h∗ (s).
µ

So long as h is appropriately smooth, these satisfy ℓ(predh (s), y) = φ(s, y). In complete analogy
with the post-processing gap (15.3.2) when we assume f makes predictions in (the affine hull of)
Y, we can define the surrogate post-processing gap

gap(φ, f, W) := E[φ(f (X), Y )] − inf E[φ(f (X) + w(f (X)), Y )]. (15.3.3)
w∈W

In spite of the similarity with definition (15.3.2), the actual predictions of Y from f in this case
come via the link predh (f (X)). Thus, in this case we instead consider the calibration error relative
to a class W but after the composition of f with predh = ∇h∗ , so that

CE(predh ◦ f, W) = sup E[⟨w(f (X)), Y − predh (f (X))⟩] = sup E[⟨w(f (X)), Y − ∇h∗ (f (X))⟩],
w∈W w∈W

where as always we assume that the class of witness functions satisfies W = −W. When the
prediction function is continuous enough in s, we can give an analogue of Proposition 15.3.1 to the
more general surrogate case. To that end, we assume that the conjugate h∗ has Lipschitz continuous
gradient with respect to the dual norm ∥·∥∗ , meaning that

∥∇h∗ (s0 ) − ∇h∗ (s1 )∥ ≤ ∥s0 − s1 ∥∗

for all s0 , s1 ∈ Rk . This is equivalent (see Proposition C.2.6) to the negative entropy h being
strongly convex with respect to the norm ∥·∥, and also immediately implies that

∥w∥2∗
φ(s + w, y) ≤ φ(s, y) + ⟨∇s φ(s, y), w⟩ + .
2
Example 15.3.3 (Multiclass logistic regression): For multiclass logistic regression, where we
take h(p) = kj=1 pj log pj , we know that h is strongly convex with respect to the ℓ1 norm (this
P

is Pinsker’s inequality; see inequality (2.2.11)). Thus, the conjugate h∗ (s) = log( kj=1 esj ) has
P
Lipschitz gradient with respect to the ℓ∞ norm, meaning that for the prediction link
" #k
esy
predh (s) = Pk ,
sj
j=1 e y=1

we have
predh (s) − predh (s′ ) 1
≤ s − s′ ∞

for all s, s′ ∈ Rk . 3

428
Lexture Notes on Statistics and Information Theory John Duchi

Example 15.3.4 (The squared error): When we measure the error of a predictions in Rk
by the squared ℓ2 -norm 21 ∥f (x) − y∥22 , this corresponds to the generalized negative entropy
h(µ) = 21 ∥µ∥22 . In this case, the norm ∥·∥ = ∥·∥2 = ∥·∥∗ , and we have the self duality h∗ = h,
so that the prediction mapping predh is the identity. 3

With these examples as motivation, we then have the following generalization of Proposi-
tion 15.3.1.
Proposition 15.3.5. Let the negative generalized entropy h be strongly convex with respect to the
norm ∥·∥ and consider surrogate loss φ(s, y) = h∗ (s)−⟨s, y⟩. Define R∗2 (f ) := supw∈W E[∥w(f (X))∥2∗ ].
Then
CE(predh ◦ f, W)2

1
min CE(predh ◦ f, W), ≤ gap(φ, f, W) ≤ CE(predh ◦ f, W).
2 R∗2 (f )
Proof Fix x and let s = f (x) and w = w(f (x)), and notice that for any y we have
1
φ(s, y) + ⟨∇φ(s, y), w⟩ ≤ φ(s + w, y) ≤ φ(s, y) + ⟨∇φ(s, y), w⟩ + ∥w∥2∗ .
2
Recognizing that ∇φ(s, y) = ∇h∗ (s) − y, for any w ∈ W we have
1
−E[⟨∇φ(f (X), Y ), w(f (X))⟩] − E[∥w(f (X))∥2∗ ] ≤ E[φ(f (X), Y )] − E[φ(f (X) + w(f (X)), Y )]
2
≤ −E[⟨∇φ(f (X), Y ), w(f (X))⟩].

Taking suprema over w on each side and using the symmetry of W gives

∗ 1 2
sup E[⟨∇h (f (X)) − Y, w(f (X))⟩] − E[∥w(f (X))∥∗ ] ≤ gap(φ, f, W)
w∈W 2
≤ sup E[⟨∇h∗ (f (X)) − Y, w(f (X))⟩].
w∈W

Because CE(predh ◦ f, W) = supw∈W E[⟨∇h∗ (f (X)) − Y, w(f (X))⟩], we can use the convexity of W
and the definition R∗2 (f ) := supw∈W E[∥w(f (X))∥2∗ ], to see that for any η ∈ [0, 1], we may replace
w with η · w ∈ W and
η2 2

sup ηCE(predh ◦ f, W) − R∗ (f ) ≤ gap(φ, f, W) ≤ CE(predh ◦ f, W).
η∈[0,1] 2

h ◦f,W)
Set η = min{1, CE(pred
R2 (f )
}.
∗

A corollary specializing to the case of bounded witness functions allows a somewhat cleaner
statement, in analogy with Corollary 15.3.2. It provides the same operational interpretation: the
calibration error CE(f, W) of f relative to W upper and lower bounds improvement possible through
postprocessing f .
Corollary 15.3.6. Let the conditions of Proposition 15.3.5 hold, and additionally assume that the
witness functions W satisfy ∥w(s)∥∗ ≤ 1 for all s ∈ Rk . Then
1
CE(predh ◦ f, W)2 ≤ gap(φ, f, W) ≤ CE(predh ◦ f, W).
2 diam(dom h)2

429
Lexture Notes on Statistics and Information Theory John Duchi

We can give an alternative perspective for this section by focusing on the definitions (15.3.2)
and (15.3.3) of the post-processing gap. Suppose we have a proper loss ℓ and we wish to improve the
expected loss of a predictor f by post-processing f . When there is little to be gained by replacing
f with an adjusted version f (x) + w(f (x)) for some w ∈ W, then f must be calibrated with respect
to the class W. So, for example, for a surrogate φ, the function f (really, its associated prediction
function predh ◦ f ) is calibrated with respect to W if and only if E[φ(f (X) + w(f (X)), Y )] ≤
E[φ(f (X), Y )] for all w ∈ W.
As a particular special case to close this section, the standard multiclass logistic loss provides
a clean example.

Example
Pk 15.3.7 (Multiclass logistic losses, continued): Let h be the negative entropy h(p) =
k
j=1 pj log pj restricted to the probability simplex ∆k = {p ∈ R+ | ⟨1, p⟩ = 1} and the
Pk
surrogate φ(s, y) = log( j=1 esj ) − sy . Then for any class W consisting of functions with
∥w(s)∥∞ ≤ 1 for all s ∈ Rk and any function f : X → Rk ,

1
CE(predh ◦ f, W)2 ≤ E[φ(f (X), Y )] − inf E[φ(f (X) + w(f (X)), Y )].
2 w∈W

(Note that dom h has diameter 1 in the ℓ1 -norm.) 3

15.3.3 A population-level algorithm for calibration

Implicit in each of the calibration gap bounds in Propositions 15.3.1 and 15.3.5 is bound on the
improvement of a predictor f relative to processing outputs with a class W of functions. This
suggests an algorithm for updating the predictions of f to make them calibrated, after which no
improvement is possible. While we work at the population level here, similar procedures can allow
calibration given access to additional data.
Working in the more general setting of surrogate losses based on the generalized negative entropy
h, as these include the standard squared error as a special case, the key idea is that if we find the
witness w maximizing E[⟨w(f (X)), Y − predh (f (X))⟩] we can update f with f − η · w ◦ f for some
stepsize η, thus improving the calibration of f relative to the class W of potential witnesses. In
Figure 15.1, we present a prototypical algorithm for achieving this.
The following theorem bounds the convergence of the algorithm.

Theorem 15.3.8. Assume that the surrogate loss φ is nonnegative and that the class of witnesses
W satisfies R∗ := sups ∥w(s)∥∗ < ∞. Then the algorithm in Figure 15.1 guarantees that
p
2R∗2 E[φ(f0 (X), Y )]
min CE(predh ◦ fτ , W) ≤ √ ,
τ <t t
and in particular terminates with CE(predh ◦ ft , W) ≤ ϵ for some t with

2R∗2 E[φ(f0 (X), Y )]

t≤ .
ϵ2
Proof We begin by showing a one-step progress guarantee beginning from a fixed function f .
For any w : Rk → Rk and any f , we have

η2
E[φ(f (X) + ηw(f (X)), Y )] ≤ E[φ(f (X), Y )] + ηE[⟨w(f (X)), ∇h∗ (f (X)) − Y ⟩] + E[∥w(f (X))∥2∗ ].
2

430
Lexture Notes on Statistics and Information Theory John Duchi

Input: Population distribution P , collection of bounded witness functions W, general-

ized negative entropy h strongly convex w.r.t. norm ∥·∥, initial predictor f0 : X → Rk ,
calibration tolerance ϵ > 0
Initialize: set R∗2 := sups ∥w(s)∥2∗
Repeat: for t = 0, 1, . . .

i. Find witness wt maximizing E[⟨w(ft (X)), Y − predh (ft (X))⟩]

E[⟨wt (ft (X)),Y −predh (ft (X))⟩]
ii. Set ηt = R∗2

iii. Update ft+1 = ft − ηt · wt ◦ ft

iv. Terminate if
CE(predh ◦ ft , W) ≤ ϵ.

Figure 15.1: Improving calibration relative to the class W

Let w maximize E[⟨w(f (X)), ∇h∗ (f (X)) − Y ⟩], so that

η2 2
E[φ(f (X) − ηw(f (X)), Y )] ≤ E[φ(f (X), Y )] − ηCE(predh ◦ f, W) + R .
2 ∗
CE(predh ◦f,W)
Choose ηf = R∗2
to obtain

1 CE(predh ◦ f, W)2
E[φ(f (X) − ηf w(f (X)), Y )] ≤ E[φ(f (X), Y )] − . (15.3.4)
2 R∗2

Now we apply the obvious inductive argument. Let ft be a function in the iteration of Algo-
2
rithm 15.1. Then inequality (15.3.4) guarantees that if δt2 := 12 CE(predRh2◦ft ,W) , then
∗

E[φ(ft+1 (X), Y )] ≤ E[φ(ft (X), Y )] − δt2 .

In particular,
t−1
X
0 ≤ E[φ(ft (X), Y )] ≤ E[φ(f0 (X), Y )] − δτ2 .
τ =0

In particular,
t min δτ2 ≤ E[φ(f0 (X), Y )],
τ <t
p
so that minτ <t δτ ≤ E[φ(f0 (X), Y )]/t. Replacing δτ with its definition gives the theorem.

15.4 Calibeating: improving squared error by calibration

Sections 15.1 and 15.3 show that at least at the population level, taking a predictor f and modifying
(or postprocessing) it to guarantee its calibration can only improve the losses it suffers, whether

431
Lexture Notes on Statistics and Information Theory John Duchi

those are squared error or general proper losses. That is, by calibrating we can beat (and hence,
calibeat) a given predictor. These arguments have exclusively been at the population level, leaving
it unclear whether this approach might actually work given a finite sample. While employing
these ideas for general losses and general decision settings, where we only guarantee Y ⊂ Rk , is
challenging because of dimensionality issues, here we show how to improve calibration in finite
samples while simultaneously losing little in squared error for binary predictions with Y ∈ {0, 1}.
That is, we have calibeating: from any potential predictor f , we can construct a predictor g with
both small calibration error and with (asymptotically) no larger squared error than f , realizing
Theorem 15.1.1 but in finite samples.
Let f : X → [0, 1] be any predictor of Y ∈ {0, 1}, and consider the squared error loss
ℓ(s, y) = (s − y)2 with population loss L(f ) = E[(Y − f (X))2 ]. The idea to improve calibra-
tion of f without losing much in accuracy (squared error) is fairly straightforward: we discretize
f by binning its predictions so that the number of Xi for which f (Xi ) is in a bin is equal; such
binning ideas are central to the theory of calibration. Then we choose the postprocessed func-
tion g by averaging observed Y values over those bins. This transforms the (population level)
idea present in Theorem 15.1.1, which says to choose the post-processing conditional expectation
g(x) = E[Y | f (X) = f (x)], into one implementable in finite samples, which approximately sets

g(x) ≈ E[Y | l(x) ≤ f (X) ≤ u(x)],

where l and u are lower and upper bounds over which to average the predictions of f .
To make the ideas concrete, assume we have a sample (Xi , Yi )2n i=1 of size 2n drawn i.i.d. according
to P (where we choose 2n for notational convenience), which we divide into samples {(Xi , Yi )}ni=1
(1) (2)
and {(Xi , Yi )}2n
i=n+1 , letting Pn denote the empirical distribution on the first sample and Pn that
on the second. We use the first to choose the binning (quantization) of f and the second to actually
choose values for the binned function. Fix a number of bins b ∈ N to be chosen, for convenience
assuming that b divides n. Let the indices i1 , . . . , in sort f (Xi ), so that

f (Xi1 ) < f (Xi2 ) < · · · < f (Xin ),

and construct index partitions Ij , j = 1, . . . , b, by Ij := {ib(j−1)+1 , . . . , ibj }. Here, we have assumed

(essentially) without loss of generality that the predictions f (Xi ) are distinct with probability
1.1 Given this partitioning of indices I1 , . . . , Ib , for j = 1, . . . , b define the lower and upper bin
boundaries
lj = max f (Xi ) and u
b bj = max f (Xi ),
i∈Ij−1 i∈Ij

except that b
l1 = 0 and u
bb = 1, and define the bins
h h h i
B1 = b b1 , B2 = b
l1 , u l2 , u
b2 , . . . , Bb = b
lb , u
bb

to partition [0, 1]. These partition [0, 1] evenly in the empirical probabilities of f (Xi ), i = 1, . . . , n,
not evenly in the widths u bj − b
lj .
To construct the recalibrated and binned version g of f , for each x ∈ X , define the bin mapping

bin(x) := the bin j such that f (x) ∈ Bj ,

1 iid
If this distinctness fails, we can add random dithering by letting Ui ∼ Uniform[− 12 , 12 ] and replacing the ob-
servations Xi with pairs (Xi , Ui ) and f (Xi ) with fext (Xi , Ui ) := f (Xi ) + ϵUi for some ϵ > 0. Then L(fext ) =
ϵ2
E[(Y − f (X) − ϵU )2 ] = E[(Y − f (X))2 ] + 12 and ℓ(fext (x, u), y) ≤ ℓ(f (x), y) + 2ϵ for all x, u, y, so that we lose little.

432
Lexture Notes on Statistics and Information Theory John Duchi

which implicitly depends on the first sample (X1n , Y1n ). The partitioning of [0, 1] into the bins Bj
also induces a partition on X = bj=1 f −1 (Bj ), where elements x, x′ belong to the same partition
S
set if bin(x) = bin(x′ ). Once we have this mapping from x to the associated prediction bin, we can
use the second sample (its empirical distribution) to define the binned function g by the average of
(2)
the second sample distribution Pn over those examples falling into each bin. Formally, we define
g to be the the piecewise constant function

g(x) := EP (2) [Y | bin(X) = bin(x)], (15.4.1)

or equivalently, for each x ∈ Bj , we have

g(x) := EP (2) [Y | f (X) ∈ Bj ]

n
2n
1 X
= P2n 1 {bin(Xi ) = j} Yi
i=n+1 1 {bin(Xi ) = j} i=n+1

where we assign g(x) an arbitrary value if no Xi satisfies f (Xi ) ∈ Bj for the index j = bin(x).
Informally, this function g partitions X space into regions of roughly equal (small) probability
1/b, and for which f (x) belongs to a given interval on each region. Then recalibrating f on that
region changes the prediction error (Y − f (X))2 little, but improves the calibration. Formally, we
can show the following theorem.

Theorem 15.4.1. Let g be the binned and recalibrated estimator (15.4.1). Assume that the number
of bins b and sample size n satisfy logn n ≥ b. Then there exists a numerical constant c > 0 such
that for all δ ∈ (0, 1), with probability at least 1 − 2 exp(−c nb ) − δ,

3 2b log 2b h i
L(g) ≤ L(f ) + + δ
− E (E[Y | bin(X)] − E[f (X) | bin(X)])2
b n
and g has expected calibration error (15.2.1) at most
s
2b log 2b
δ
ece(g) ≤ .
n

JCD Comment: Put in some figures here.

The proof of Theorem 15.4.1 is long, so we defer it to Section 15.4.1. To interpret

qthe theorem,
consider the terms in it. Roughly, we see that if we choose the number of bins to be n log 1δ , then
the calibeating predictor g guarantees
r
log nδ h i
L(g) ≤ L(f ) + O(1) − E (E[Y | bin(X)] − E[f (X) | bin(X)])2 ,
n

while the expected calibration error is of order n−1/4 , ignoring the logarithmic factors. So we
improve the loss L(f ) by a factor involving the calibration error of f (relative to the random
binning)—the less calibrated
p f is, the more improvement we can provide—and with a penalty
tending to 0 at rate log n/n.

433
Lexture Notes on Statistics and Information Theory John Duchi

15.4.1 Proof of Theorem 15.4.1

Throughout the proof,P we use the shorthands that P (Bj ) = P (f (X) ∈ Bj ) and Pn (Bj ) =
Pn (f (X) ∈ Bj ) = n1 ni=1 1 {f (Xi ) ∈ Bj } to mean the (empirical) probability that f (X) ∈ Bj , and
(1) (2)
Pn and Pn denote empirical probabilities with respect to the samples (X1n , Y1n ) and (Xn+1
2n , Y 2n ),
n+1
respectively. The key to the argument is to show three things:
1 7
1. With high probability, each bin Bj has the approximately correct probability 2b ≤ P (Bj ) ≤ 4b .
(2)
2. With similarly high probability, the empirical probabilities on the second sample Pn satisfy
1 (2) 2
4b ≤ Pn (Bj ) ≤ b .
(2)
3. Conditional on Pn (Bj ) being large enough, the expectations EP (2) [Y | f (X) ∈ Bj ] are accurate,
n
so that g(x) ≈ E[Y | f (X) ∈ Bj ] for x satisfying f (x) ∈ Bj .
Once we have each of these three, we can show that L(g) is essentially no larger than L(f ), up
to diminishing error terms in n, and that g itself is well-calibrated. We proceed through each step
in turn, stating the results as lemmas whose proofs we provide at the end of this section.
Lemma 15.4.2. Let logn n ≥ b. For a numerical constant c > 0, we have

1 7 n
P ≤ P (Bj ) ≤ for all j = 1, . . . , b ≥ 1 − 2 exp −c .
2b 4b b
With Lemma 15.4.2 in hand, the second step of the proof of Theorem 15.4.1 is relatively
(1)
straightforward. In the lemma, conditioning on Pn indicates conditioning on the first sample
(X1n , Y1n ).
(1)
Lemma 15.4.3. Let logn n ≥ b. Assume the first sample Pn is such that 2b 1
≤ P (Bj ) ≤ 7
4b for each
selected bin Bj , j = 1, . . . , b. Then there exists a numerical constant c > 0 such that

1 (2) 2 (1)
n
P ≤ Pn (Bj ) ≤ | Pn ≥ 1 − 2 exp −c .
4b b b
Lemma 15.4.4. Let the conditions of Lemma 15.4.3 hold. Then there exists a numerical constant
c > 0 such that for any δ ∈ (0, 1)
r !
2b 2b n
P max sup |g(x) − E[Y | f (X) ∈ Bj ]| ≥ log | Pn(1) ≤ 2 exp −c + δ.
j≤b x:f (x)∈Bj n δ b

With the three lemmas in place, we can now expand the squared error to obtain the calibeating
theorem. Recalling the population squared error L(g) = E[(Y − g(X))2 ], let us suppose that the
consequences of Lemmas 15.4.2–15.4.4 hold, so that |g(x) − E[Y | f (X) ∈ Bj ]|2 ≤ 2b 2b
n log δ and
7 n
P (Bj ) ≤ 4b for each j. By the lemmas, these hold with probability 1 − 2 exp(−c d ) − δ. Define the
average function values and conditional expectations
f j := E[f (X) | f (X) ∈ Bj ] and E j := E[Y | f (X) ∈ Bj ].
Then we have
b
X
L(g) = E[(Y − g(X))2 ] = P (Bj )E[(Y − E j + E j − g(X))2 | f (X) ∈ Bj ].
j=1

434
Lexture Notes on Statistics and Information Theory John Duchi

Considering the expectation term, note that g(X) is constant for f (X) ∈ Bj by construction of the
binning, and so for any x ∈ f −1 (Bj ), we have

E[(Y − E j + E j − g(X))2 | f (X) ∈ Bj ]

E[(Y − E j )2 | f (X) ∈ Bj ] = E[(Y − f j )2 | f (X) ∈ Bj ] − (E j − f j )2

by adding and subtracting f j and expanding the square. Summarizing, we have shown so far that

b b
X
2 2b 2b X
L(g) ≤ P (Bj )E[(Y − f j ) | f (X) ∈ Bj ] + log − P (Bj )(E j − f j )2 . (15.4.2)
n δ
j=1 j=1

We can directly relate the first term in the expansion (15.4.2) to the expected error E[(Y −
f (X))2 ]. Indeed, by expanding out the square, we have

where the inequality is Cauchy-Schwarz, as |Y − f (X)| ≤ 1. Finally, we recognize that Bj ⊂ [b

lj , u
bj ],
1 2
so Var(f (X) | f (X) ∈ Bj ) ≤ 4 (b
uj − lj ) , and thus
b

5
E[(Y − f j )2 | f (X) ∈ Bj ] ≤ E[(Y − f (X))2 | f (X) ∈ Bj ] + (b
uj − b
lj ).
4
Substituting in the bound (15.4.2) and recognizing that bj=1 P (Bj )E[(Y − f (X))2 | f (X) ∈ Bj ] =
P

E[(Y − f (X))2 ] = L(f ), we have

b b
5X 2b 2b X
L(g) ≤ L(f ) + uj − b
P (Bj )(b lj ) + log − P (Bj )(E j − f j )2 .
4 n δ
j=1 j=1

7 Pb
But of course, P (Bj ) ≤ 4b by the assumed conclusions of Lemma 15.4.2, and so j=1 P (Bj )(b
uj −
7 Pb
lj ) ≤
b
4b as j=1uj − b
(b lj ) = 1. This gives the final inequality

b
35 2b 2b X
L(g) ≤ L(f ) + + log − P (Bj )(E j − f j )2 ,
16b n δ
j=1

proving the first claim of the theorem. The bound on calibration error is immediate because
|g(x) − E[Y | f (X) ∈ Bj ]|2 ≤ 2b 2b −1 (B ) with the prescribed probability, by
n log δ for each x ∈ f j
Lemma 15.4.4.

435
Lexture Notes on Statistics and Information Theory John Duchi

We follow the notational shorthand Pn (A) = n1 ni=1 1 {f (Xi ) ∈ A}.

P
Proof of Lemma 15.4.2
Fix a pair 0 ≤ l < u ≤ 1 and define the interval A = [l, u]. Then Bernstein’s inequality (4.1.8)
shows that !
nv 2

1
P Pn (A) − P (A) ≥ v ≤ 2 exp −
n 2P (A) + 32 v
1
for all v ≥ 0. Partition [0, 1] into intervals A1 , . . . , A4b , Aj = [lj , uj ], each of probability P (Aj ) = 4b .
⋆
Now, fix an index j ∈ [b] and consider the (empirically constructed) bin Bj ⋆ = [lj ⋆ , u b bj ⋆ ). Then
there exist some j, k ∈ N such that

Aj ∪ · · · ∪ Aj+k ⊃ Bj ⋆ ⊃ Aj+1 ∪ · · · ∪ Aj+k−1 .

We provide upper and lower bounds on k as a function of the error in Pn (Aj ). Suppose that
for some t > 0, we have
1−t 1+t
≤ Pn (Aj ) ≤ for j = 1, . . . , 4b. (15.4.3)
4b 4b
Then
1+t 1
(k + 1) ≥ Pn (Aj ∪ · · · ∪ Aj+k ) ≥ Pn (Bj ⋆ ) = ,
4b b
and similarly
1−t 1
(k − 1) ≤ Pn (Aj+1 ∪ · · · ∪ Aj+k ) ≤ Pn (Bj ⋆ ) = ,
4b b
implying the bounds
4 4
−1≤k ≤ + 1.
1+t t−1
1 1
In particular, if t < 3 then 3 ≤ k ≤ 6, and so when the bounds (15.4.3) hold with t = 3 we obtain

1 k−1 k+1 7
≤ = P (Aj+1 ∪ · · · ∪ Aj+k−1 ) ≤ P (Bj ⋆ ) ≤ P (Aj ∪ · · · ∪ Aj+k ) = ≤ .
2b 4b 4b 4b
Apply Bernstein’s inequality for using t = 13 , or v = 1
12b , with variance bound σ 2 ≤ P (Aj ) ≤ 1
4b
to obtain that for each j = 1, . . . , 4b, we have
!
n/(12b)2

1 n
P |Pn (Aj ) − P (Aj )| ≥ ≤ 2 exp − = 2 exp − .
12b 2/(4b) + 23 12b
1 80b

Apply a union bound to obtain the lemma once we recognize that n/b − log b ≳ n/b whenever
n/ log n ≥ b.

7
Proof of Lemma 15.4.3 Assume that P (Bj ) ≤ 4b . Then applying Bernstein’s inequal-
ity (4.1.8), and using that 1 {f (X) ∈ Bj } is a Bernoulli random variable with mean (and hence
7
variance) at most 4b , we have
!
n/(4b)2

(2) 2 1 n 1 n
P Pn (Bj ) ≥ ≤ exp − 7 2 1 = exp − ≤ exp − .
b 4b + 3 4b
28 + 8/3 b 31 b

(2) 1 1 n 1
Similarly, we have P(Pn (Bj ) ≤ 4b ) ≤ exp(− 31 b ) as P (Bj ) ≥ 2b . Applying a union bound over
j = 1, . . . , b, then noting that n/b − log b ≳ n/b whenever n/ log n ≥ b, we again obtain

436
Lexture Notes on Statistics and Information Theory John Duchi

Proof of Lemma 15.4.4 Recall that g(x) = EP (2) [Y | bin(X) = bin(x)], and note that g is
n
−1
constant on x ∈ f (Bj ). Fix a bin j, and let I(j) = {i ∈ {n + 1, . . . , 2n} | f (Xn+i ) ∈ Bj } denote
the indices in the second sample for which f (Xn+i ) falls in bin Bj . Then conditional on i ∈ I(j),
we have Yi ∼ P (Y ∈ · | f (X) ∈ Bj ), so that
 
1 X
Yi − E[Y | f (X) ∈ Bj ] ≥ t | I(j) ≤ 2 exp −2 card(I(j))t2

P
|I(j)|
i∈I(j)

(1)
by Hoeffding’s inequality. Then (conditioning on the bins {Bj } chosen using Pn , which by as-
1 7
sumption satisfy P (Bj ) ∈ [ 2b , 4b ], we have for any fixed x ∈ f −1 (Bj ) that
!
P sup |g(x) − E[Y | f (X) ∈ Bj ]| ≥ t | Pn(1)
x∈f −1 (Bj )
X
= P |g(x) − E[Y | f (X) ∈ Bj ]| ≥ t, I(j) = I | Pn(1)
I⊂[n]
n X
≤ P card(I(j)) < | Pn(1) + P |g(x) − E[Y | f (X) ∈ Bj ]| ≥ t, I(j) = I | Pn(1)
4b
I⊂[n],card(I)≥n/4b

nt2

(2) 1
≤ P Pn (Bj ) < + 2 exp − ,
4b 2b
2b
2b log
where the final line applies Hoeffding’s inequality. Taking t2 = n
δ
and applying Lemma 15.4.3
and a union bound gives Lemma 15.4.4.

15.5 Continuous and equivalent calibration measures

We finally return to constructing a calculus and tools with which to measure calibration, addressing
the issues of discontinuity of ece that Example 15.2.2 highlights, and building to a combination
of results that imply Corollary 15.2.6. In the end, we will see that for appropriate classes F of
predictors, several potential measures M : F → R+ are roughly equivalent sound and complete
calibration measures, all enjoying similar continuity properties. We begin with two definitions.

Definition 15.1. A function M : F → R+ is a continuous calibration measure for the distribution

P on X × Y if

(i) it is sound and complete (15.2.2), that is, M(f ) = 0 if and only if f is calibrated for P , and

(ii) it is continuous with respect to the L1 (P ) metric on F, that is, for any f , if fn is a sequence
of functions with E[∥f (X) − fn (X)∥] → 0, then

M(f ) − M(fn ) → 0.

A stronger definition replaces continuity with a Lipschitz requirement.

437
Lexture Notes on Statistics and Information Theory John Duchi

Definition 15.2. A function M : F → R+ is a Lipschitz calibration measure for the distribution

P on X × Y if it is sound and complete (Definition 15.1, part (i)), and instead of part (ii) satisfies

(iii) it is Lipschitz continuous with respect to the L1 (P ) metric on F, that is, for some C < ∞

|M(f0 ) − M(f1 )| ≤ C · EP [∥f0 (X) − f1 (X)∥]

for all f0 , f1 ∈ F.

If conditions (i) and (ii) (respectively (iii)) hold for all P in a collection of distributions P on X ×Y,
we will say that M is a continuous (respectively, Lipschitz) calibration measure for P.
The desiderata (ii) and (iii) are matters of taste; the central idea is that some type of continuity
is essential for efficient modeling, estimation, and analysis. We leave the norm ∥·∥ implicit in the
definition, and we typically omit the distribution P from the calibration metric as it is clear from
context. The two parts of Definition 15.2 admit many possible calibration measures. We consider
two types of measures, which are (almost) dual to one another, as examples. Both use a variational
representation, where in one we essentially look for the “closest” function that is calibrated, while
in the other, we investigate the ease with which we can (quantitatively) certify that a predictor f
is uncalibrated.
A key concept will be the equivalence of calibration measures, where we target a quantitative
equivalence. To define this, let 0 < α, β < ∞. Then we say that two candidate calibration measures
M0 and M1 on F ⊂ X → Rk are (α, β)-equivalent if there exist constants c0 , c1 (which may depend
on Y) such that
h i
M0 (f ) ≤ c0 [M1 (f ) + M1 (f )α ] and M1 (f ) ≤ c1 M0 (f ) + M0 (f )β . (15.5.1)

Then in a strong sense, M0 (f ) → 0 if and only if M1 (f ) → 0.

15.5.1 Calibration measures

We revisit the potential calibration measures in Section 15.2.2 here to recapitulate definitions,
providing initial results on their soundness and completeness. We focus on the distance to calibra-
tion (15.2.3) and relative calibration errors (15.2.4), as the partitioned calibration error (15.2.6) we
use more as a proof device.

Distances to calibration. Recall the distance to calibration (15.2.3), which for C(P ) = {g :
X → Rk | EP [Y | g(X)] = g(X)} (where the defining equality holds with P -probability 1
over X) has definition dcal (f ) := inf g {E[∥g(X) − f (X)∥] s.t. g ∈ C(P )}. The measure (15.2.3)
is, after appropriate normalization, the largest Lipschitz measure of calibration: if M is any Lip-
schitz calibration measure (with constant C = 1 in Definition 15.2 part (iii)), then taking a per-
fectly calibrated g with ece(g) = 0, we necessarily have M(g) = 0. Then for any f we have
M(f ) = M(f ) − M(g) ≤ E[∥f (X) − g(X)∥], and taking an infimum over such g guarantees

M(f ) ≤ dcal (f ).

The second related quantity, which sometimes admits cleaner properties for analysis, is the penalized
calibration distance, which we define as

pcal (f ) := inf {E[∥f (X) − g(X)∥] + E [∥E[Y | g(X)] − g(X)∥]} . (15.5.2)

438
Lexture Notes on Statistics and Information Theory John Duchi

These quantities are strongly related, and in the sequel (see Corollary 15.5.8), we show that
p
pcal (f ) ≤ dcal (f ) ≤ pcal (f ) + CY pcal (f ),

where CY is a constant depending only on the set Y whenever Y has finite diameter.
To build intuition for the definition (15.5.2), consider the two quantities. The first measures
the usual L1 distance between the function f and a putative alternative g. The second is the
expected calibration error of g. By restricting the infimum in definition (15.2.3) to functions g
with ece(g) = 0, we simply have the L1 distance to the nearest calibrated function; as is, the
additional term in (15.5.2) allows trading between the distance to a calibrated function and the
actual calibration error. We also have the following proposition.

Proposition 15.5.1. The functions dcal and pcal are Lipschitz calibration measures.

Proof If f is calibrated, then pcal (f ) = dcal (f ) = 0 immediately. Conversely, if pcal (f ) = 0, there

exists a sequence of functions gn satisfying E[∥f (X) − gn (X)∥] → 0, as each term in the defini-
tion (15.5.2) is nonnegative. Additionally, we must have that ece(gn ) = E[∥E[Y | gn (X)] − gn (X)∥] →
0. Applying Lemma 15.2.1 we have 0 ≥ lim inf n ece(gn ) ≥ ece(f ). If dcal (f ) = 0, then there ex-
ists a sequence of functions gn with ece(gn ) = 0 and E[∥f (X) − gn (X)∥] → 0. Again, the lower
semicontinuity of ece from Lemma 15.2.1 gives 0 = lim inf n ece(gn ) ≥ ece(f ).
To see that pcal is Lipschitz in f , let f0 , f1 : X → Rk , and let g0 , g1 be within ϵ > 0 of achieving
the infima in definition (15.5.2) for f0 and f1 , respectively. Then

pcal (f0 ) − pcal (f1 ) ≤ inf {E[∥f0 (X) − g(X)∥] + E[∥E[Y | g(X)] − g(X)∥]}
g

− E[∥f1 (X) − g1 (X)∥] + E[∥E[Y | g1 (X)] − g1 (X)∥] + ϵ

≤ E[∥f0 (X) − g1 (X)∥] − E[∥f1 (X) − g1 (X)∥] + ϵ
≤ E[∥f0 (X) − f1 (X)∥] + ϵ.

Take ϵ ↓ 0. The lower inequality is similar, as is the proof for dcal .

Weak calibration. The calibration error (15.2.4) relative to a class W,

CE(f, W) := sup E[⟨w(f (X), Y − f (X)⟩]

w∈W

admits similar properties, as it also satisfies our desiderata for a calibration measure. In particular,
if we take W to be the class W∥·∥ of bounded Lipschitz witness functions (15.2.5), we have the next
two propositions.

Proposition 15.5.2. Let F consist of functions with E[∥f (X)∥] < ∞ and assume E[∥Y ∥] < ∞.
Then CE(·, W∥·∥ ) is a continuous calibration measure over F.

Because continuity is such a weak requirement, the proof of this result relies on measure theoretic
results, so we defer it to Section 15.6.2.
When we assume the collection F consists of bounded functions and Y itself is bounded, we
can give a stronger guarantee for the weak calibration, and we no longer need to rely on careful
arguments considering the order of various limits.

439
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 15.5.3. Assume that diam(Y) is finite and that F is a collection of bounded functions
X → Rk . Then CE(·, W∥·∥ ) is a Lipschitz calibration measure over F.

Proof Let W = W∥·∥ for shorthand. That CE(f, W) = 0 when f is calibrated is immediate, as
by definition of conditional expectation we have

E[⟨w(f (X)), Y − f (X)⟩] = E[⟨w(f (X)), E[Y | f (X)] − f (X)⟩] = 0.

To obtain the converse that CE(f, W) = 0 implies f is calibrated, we require an intermediate

lemma, which leverages the density of Lipschitz functions in Lp spaces. As was the case for the
lower semi-continuity lemma 15.2.1 central to the proof of the converse in Proposition 15.5.1, this
lemma requires measure-theoretic approximation arguments, so we defer its proof to Section 15.6.3.

Lemma 15.5.4. Let S ∈ Rk be a random variable and E[∥g(S)∥] < ∞. If E[⟨w(S), g(S)⟩] = 0 for
all bounded and 1-Lipschitz functions w, then g(S) = 0 with probability 1.

The converse is now trivial: let S = f (X), and note that CE(f, W) = supw∈W E[⟨w(S), E[Y |
S] − S⟩], and take g(S) = E[Y | S] − S in Lemma 15.5.4.
To see that CE is Lipschitz, let w0 ∈ W be such that CE(f0 , W) ≥ E[⟨w0 (f0 (X)), Y −f0 (X)⟩]−ϵ,
and let C < ∞ satisfy C ≥ supy∈Y,x∈X ,f ∈F ∥y − f (x)∥. Then

CE(f0 , W) − CE(f1 , W) ≤ E[⟨w0 (f0 (X)), Y − f0 (X)⟩] − E[⟨w0 (f1 (X)), Y − f1 (X)⟩] + ϵ
≤ E[⟨w0 (f0 (X)) − w0 (f1 (X)), Y − f0 (X)⟩] + E[⟨w0 (f1 (X)), f1 (X) − f0 (X)⟩] + ϵ
≤ CE[∥w0 (f0 (X)) − w0 (f1 (X))∥∗ ] + E[∥f1 (X) − f0 (X)∥] + ϵ
≤ (1 + C)E[∥f1 (X) − f0 (X)∥] + ϵ.

Repeating the same argument, mutatis mutandis, for the lower bound gives the Lipschitz continuity
as desired.

The family of weak calibration measures CE(f, W) as we vary the collection of potential witness
functions W yields a variety of behaviors. Different choices of W can give different continuous
calibration measures, where we may modify Definition 15.1 part (ii) to other notions of continuity,
such as Lipschitzness with respect to L2 (P ) norms. We explore a few of these in the exercises at
the end of the chapter.

15.5.2 Equivalent calibration measures

That all three measures dcal (f ), pcal (f ), CE(f, W∥·∥ ) are Lipschitz calibration measures when the
label space Y is bounded suggests deeper relationships between these and other notions of calibra-
tion, such as the equivalence (15.5.1). We elucidate this here, showing that each of the measures
dcal , pcal , and CE are equivalent. Indeed, the main consequence of the results in this chapter is that
this equivalence holds for multiclass classification.

Theorem 15.5.5. Let Y = {e1 , . . . , ek } and W∥·∥ be the collection (15.2.5) of bounded Lipschitz
functions for a norm ∥·∥ on Rk . Then dcal , pcal , and CE(·, W∥·∥ ) are each ( 21 , 12 )-equivalent. More-
over, this equivalence is sharp, in that they are not (α, β)-equivalent for any α, β > 12 .

440
Lexture Notes on Statistics and Information Theory John Duchi

The theorem follows as a compilation of the other results in this section. Along the way to demon-
strating this theorem, we introduce a few alternative measures of calibration we use as stepping
stones toward our final results. While many of our derivations will apply for general sets Y, in
some cases we will restrict to multiclass classification problems, so that Y = {e1 , . . . , ek } ⊂ Rk
are the k standard basis vectors. We present two main results: the first, Theorem 15.5.6, shows
an equivalence (up to a square root) between the penalized calibration distance (15.5.2) and the
partitioned calibration error (15.2.6). As a corollary of this result, we obtain the equivalence of the
distance to calibration (15.2.3) and penalized distance to calibration (15.5.2). The second main
result, Theorem 15.5.9, gives a similar equivalence between the penalized distance (15.5.2) and
the calibration error relative to Lipschitz functions (15.2.4). Throughout, to make the calculations
cleaner and more transparent, we restrict our functions to make predictions in M = conv(Y).

Partition-based calibration measures and lifting to random variables

It is easier to work directly in the space of predictions f (X) ∈ Rk rather than addressing the
underlying space X . To that end, let S = f (X) be the random vector (use the mnemonic that S is
for “scores”) induced by f (X) and taking values in Conv(Y), which has a joint distribution (S, Y )
with the label Y . Then, for example, the expected calibration error of f is simply

ece(f ) = E[∥E[Y | S] − S∥].

Once we work exclusively in the space of random scores S = f (X), we may define alternative
distances to calibration in analogy with the (penalized) distances to calibration, which will allow
us to more easily relate distances to the partitioned error (15.2.6). Thus, we define

dcal,low (f ) := inf {E[∥S − V ∥] s.t. E[Y | V ] = V } (15.5.3a)

and
pcal,low (f ) := inf {E[∥S − V ∥] + E[∥E[Y | V ] − V ∥]} , (15.5.3b)
V

where the infimum are over all random variables V taking values in Conv(Y), which can have
arbitrary distribution with (S, Y ) (but do not modify the joint (S, Y )), and in case (15.5.3a) are
calibrated. This formulation is convenient in that we can represent it as a convex optimization
problem, allowing us to bring the tools of duality to bear on it, though we defer this temporarily. By
considering V = g(X) for functions g : X → Conv(Y), we immediately see that pcal (f ) ≥ pcal,low (f ).
We can also consider upper distances

dcal,up (f ) := inf {E[∥S − g(S)∥] s.t. E[Y | g(S)] = g(S)}

and
pcal,up (f ) := inf {E[∥S − g(S)∥] + E[∥E[Y | g(S)] − g(S)∥]} ,
g:Rk →Conv(Y)

which restrict the definitions (15.2.3) and (15.5.2) to compositions. We therefore have the inequal-
ities
dcal,low (f ) ≤ dcal (f ) ≤ dcal,up (f ) and pcal,low (f ) ≤ pcal (f ) ≤ pcal,up (f ). (15.5.4)
The partitioned calibration error (15.2.6) allows us to provide a bound relating the calibration
error and the lower and upper calibration errors. To state the theorem, we make a normalization
with ∥·∥, assuming without loss of generality that ∥·∥∞ ≤ ∥·∥.

441
Lexture Notes on Statistics and Information Theory John Duchi

Theorem 15.5.6. Let Y ⊂ Rk have finite diameter diam(Y) in the norm ∥·∥. Let S = f (X) ∈ Rk .
Then for all ε > 0,

2k diam(Y)
pcal,up (f ) ≤ dcal,up (f ) ≤ pce(S) ≤ 1 + pcal,low (f ) + ∥1k ∥∗ ε
ε

2k diam(Y)
≤ 1+ dcal,low (f ) + ∥1k ∥∗ ε.
ε
While the first inequality in Theorem 15.5.6 is relatively straightforward to prove, the second
requires substantially more care, so we defer the proof of the theorem to Section 15.6.4.
We record a few corollaries, one consequence of which is to show that the partitioned calibration
error (15.2.6) is at least a calibration measure in the sense of Definition 15.2.(i). Theorem 15.5.6
also shows that the penalized calibration distance pcal (f ) is equivalent, up to taking a square root,
to the upper and lower calibration “distances”. In each corollary, we let Ck = ∥1k ∥∗ for shorthand.
Corollary 15.5.7. Let the conditions of Theorem 15.5.6 hold. Then
p q
pcal,low (f ) ≤ pcal (f ) ≤ pcal,low (f ) + 2 Ck k diam(Y) pcal,low (f )

and p q
dcal,low (f ) ≤ dcal (f ) ≤ dcal,low (f ) + 2 Ck k diam(Y) dcal,low (f ).
Proof
p The first lower bound is immediate (recall the naive inequalitites (15.5.4)). Now set
ε = 2k diam(Y)pcal,low (f )/Ck in Theorem 15.5.6, and recognize that pcal,low (f ) ≤ pcal,up (f ).

We also obtain an approximate equivalence between the calibration distance dcal and penalized
calibration distance pcal from definitions (15.2.3) and (15.5.2).
Corollary 15.5.8. Let the conditions of Theorem 15.5.6 hold. Then
p p
pcal (f ) ≤ dcal (f ) ≤ pcal (f ) + 2 ck k diam(Y) pcal (f ).
Proof The first inequality is immediate by definition. For the second, note (see Lemma 15.6.4
in the proof of Theorem 15.5.6 inpSection 15.6.4) that pcal,low (f ) ≤ pce(S) for S = f (X). Then
apply Theorem 15.5.6 with ε = 2k diam(Y)pcal,low (f )/ck as in Corollary 15.5.7, and recognize
that pcal,low ≤ pcal .

Let us instantiate the theorem and its corollaries in a few special cases. If we make binary
predictions with Y = {0, 1}, then Ck = k = diam(Y) = 1, and Theorem 15.5.6 implies that
q
pcal,low (f ) ≤ pcal (f ) ≤ pcal,low (f ) + 2 pcal,low (f ).

For k-class multiclass classification, where we identify Y = {e1 , . . . , ek } with the k standard basis
vectors, we have the bounds
q
pcal,low (f ) ≤ pcal (f ) ≤ pcal,low (f ) + 2 kpcal,low (f ),

so long as we measure calibration errors with respect to the ℓ1 -norm, that is, ∥y − f (x)∥1 , because
diam(Y) ≤ 1 and Ck = ∥1∥∞ = 1.
JCD Comment: Remark on sharpness here.

442
Lexture Notes on Statistics and Information Theory John Duchi

The equivalence between calibration error and the calibration distance

We can rewrite the calibration error CE(S, A) relative to partitions in the definition (15.2.6) as the
supremum over a collection WA of functionsP of the form w(s) = v1 {s ∈ A}, where ∥v∥∗ ≤ 1, so that
CE(S, WA ) = supw∈WA E[⟨w(S), Y − S⟩] = A∈A E[∥E[Y | S] − S∥]. Relaxing this supremum, and
removing the infimum over partitions, we might expect a similar relationship to Theorem 15.5.6
to hold. Via a duality argument that the definition (15.5.3) of the lower calibration error as an
infimum over joint distributions makes possible, we can directly relate the measures.
Theorem 15.5.9. Let Y ⊂ Rk have finite diameter in the norm ∥·∥ and W∥·∥ be the collec-
tion (15.2.5) of bounded Lipschitz functions. Then
CE(f, W∥·∥ ) ≤ (1 + diam(Y)) · pcal,low (f ).
Conversely, let Y = {e1 , . . . , ek } and define Ck := ∥1k ∥∗ max{1, diam(Y)}. Then
dcal,low (f ) ≤ Ck · CE(f, W∥·∥ ).
This proof, while nontrivial, is more elementary than the others in this chapter, so we present it
here. Before giving it, however, we give a few corollaries that give a fuller picture of the relationships
between the different calibration measures we have developed. These show how, for the case of k-
class multiclass classification where we identify Y = {e1 , . . . , ek } with the standard basis vectors,
the distance to calibration (15.2.3) and penalized calibration (15.5.2) provide essentially equivalent
measures of calibration error, and that these in turn are equivalent to the calibration error with
respect to the collection of bounded Lipschitz functions.
We first give a corollary for the penalized calibration (15.5.2).
Corollary 15.5.10. Let Y = {e1 , . . . , ek } and ∥·∥ = ∥·∥1 be the ℓ1 -norm. Then for any f : X →
Conv(Y), we have
1 q
CE(f, W∥·∥ ) ≤ pcal (f ) ≤ CE(f, W∥·∥ ) + 2 kCE(f, W∥·∥ ).
2
Proof Let W = W∥·∥ for shorthand. Theorem 15.5.9 gives CE(f, W) ≤ 2pcal,low (f ), and
pcal,low (f ) ≤ p√ p), giving the lower bound. For the upper bound, Corollary 15.5.7 gives pcal (f ) ≤
cal (f
pcal,low (f ) + 2 k pcal,low (f ), then using that pcal,low (f ) ≤ dcal,low (f ) and the second part of Theo-
rem 15.5.9 gives the corollary.

The same argument implies the following analogue for the distance to calibration (15.2.3).
Corollary 15.5.11. Let Y = {e1 , . . . , ek } and ∥·∥ = ∥·∥1 be the ℓ1 -norm. Then for any f : X →
Conv(Y), we have
1 q
CE(f, W∥·∥ ) ≤ dcal (f ) ≤ CE(f, W∥·∥ ) + 2 kCE(f, W∥·∥ ).
2

Proof of Theorem 15.5.9

The proof of the upper bound is fairly straightforward. For any w ∈ W∥·∥ , we have

E[⟨w(S), Y − S⟩] = E[⟨w(S), V − S⟩] + E[⟨w(S) − w(V ), Y − V ⟩] + E[⟨w(V ), Y − V ⟩]

≤ E[∥V − S∥] + diam(Y)E[∥V − S∥] + E[⟨w(V ), E[Y | V ] − V ⟩]
≤ (1 + diam(Y))E[∥V − S∥] + E[∥E[Y | V ] − V ∥].

443
Lexture Notes on Statistics and Information Theory John Duchi

To prove the converse requires more; we present most of the argument for an arbitrary discrete
space Y and specialize to the multiclass setting only at the end. The starting point is to reduce the
problem to a discrete problem over probability mass functions rather than general distributions, as
then it is much easier to apply the standard tools of convex duality. Consider the value

dcal,low (S) = inf {E[∥S − V ∥] s.t. E[Y | V ] = V } .

Let b ∈ N and Sb be a (minimal) 1/b covering {s1 , . . . , sN } of Conv(Y), and define Sb to be the
projection of S to the nearest si . Then ∥S − V ∥ = ∥Sb − V ∥ ± 1b , and

1
dcal,low (S) = inf {E[∥Sb − V ∥] s.t. E[Y | V ] = V } ± .
V b
Now, if we replace the infimum over arbitrary joint distributions of (Sb , Y, V ) leaving the marginal
(Sb , Y ) unchanged (with V calibrated) with an infimum over only discrete distributions on V , we
have
1
dcal,low (S) ≤ inf {E[∥Sb − V ∥] s.t. E[Y | V ] = V } + . (15.5.5)
V finitely supported b
Notably, the infimum is non-empty, as we can always choose V = Y .
With the problem (15.5.5) in hand, we can write a finite dimensional optimization problem
whose optimal value is the discretized infimum on the right side. Without loss of generality assuming
that S is finitely supported, we let psy = P(S = s, Y = y) be the probability mass function of
(S, Y ). Then introducing the joint distribution
P Q with p.m.f. qsyv = Q(S =Ps, Y = y, V = v), the
infimum (15.5.5) has the constraint that v qsyv = psy . Then E[∥S − V ∥] = s,y,v qsyv P∥s − v∥ and
the calibration constraint E[Y | V ] = V is equivalent to the equality constraint that s,y qsyv (y −
v) = 0 for each v. This yields the convex optimization problem
P
minimize q ∥s − v∥
Ps,y,v syv P (15.5.6)
subject to v qsyv = psy , q ⪰ 0, y,q qsyv (y − v) = 0 for all v

in the variable q. We take

P the dual of this problem. Taking Lagrange multipliers λsy for eachk
, θsyv ≥ 0 for the nonnegativity constraints on q, and βv ∈ R
equality constraint that v qsyv = psyP
for each equality constraint that 0 = s,y qsyv (y − v), we have Lagrangian

L(q, z, λ, θ, β)
X X X X
T
= qsyv ∥s − v∥ + qsyv βv (y − v) − λsy qsyv − psy − ⟨θ, q⟩.
s,y,v s,y,v s,y v

Taking an infimum over q, we see that unless

∥s − v∥ + βvT (y − v) − λsy − θsyv = 0

for each triple (s, y, v), we have inf q L(q, λ, θ, β) = −∞. The equality in the preceding display is
equivalent to ∥s − v∥ + βvT (y − v) ≥ λsy , so that eliminating θ ⪰ 0 variables, we have the dual
P
maximize s,y λsy psy
subject to λsy ≤ ∥s − v∥ + βvT (y − v), all s, y, v

444
Lexture Notes on Statistics and Information Theory John Duchi

to problem (15.5.6). Equivalently, recognizing that at the optimum we must saturate the constraints
on λ via λsy = minv {∥s − v∥ + βvT (y − v)}, we have
X
psy min ∥s − v∥ + βvT (y − v)

maximize (15.5.7)
v
s,y

in the variables βv , and strong duality obtains.

The dual problem (15.5.7) is the key to the final step in the proof. To make the functional
notation clearer, let us fix any collection of vectors βv and define λy (s) = minv {∥s − v∥+βvT (y −v)}
for each y ∈ Y. If we can exhibit a C-Lipschitz function s 7→ w(s) that satisfies

⟨w(s), y − s⟩ ≥ λy (s) (15.5.8)

for each y ∈ Y and ∥w(s)∥∗ ≤ C, we will evidently have shown that

1
sup E[⟨w(S), Y − S⟩] ≥ dcal,low (S),
w∈W∥·∥ C

by the dual formulation (15.5.7).

The functions λy are each 1-Lipschitz with respect to ∥·∥, as

λy (s) − λy (s′ ) ≥ min ∥s − v∥ + βvT (y − v) − s′ − v − βvT (y − v)

v
= min ∥s − v∥ − s′ − v ≥ − s − s′ ,

v

and similarly

λy (s) − λy (s′ ) ≤ max ∥s − v∥ + βvT (y − v) − s′ − v − βvT (y − v) ≤ s − s′

by the triangle inequality. Here, we specialize to the particular multiclass classification case in which
the set Y = {e1 , . . . , ek } consists of extreme points of the probability simplex, so that s ∈ Conv(Y)
means that ⟨1, s⟩ = 1 and s ⪰ 0. Abusing notation slightly, let λi = λei for i = 1, . . . , k. Then
define the function  
λ1 (s)
w(s) :=  ...  .
 

λk (s)
By inspection, we have

w(s) − w(s′ ) ∗
≤ s − s′ 1 ∗
= ∥1∥∗ s − s′ .

Additionally, because λi (s) ≤ ∥s − ei ∥ (take v = ei in the definition of λi ), we have ∥w(s)∥∗ ≤

∥1∥∗ diam(Y). Finally, we have
X
⟨w(s), ei − s⟩ = (1 − si )λi (s) − sj λj (s)
j̸=i
X
≥ (1 − si )λi (s) − sj ⟨βs , ej − s⟩
j̸=i

445
Lexture Notes on Statistics and Information Theory John Duchi

because λj (s) ≤ ⟨βs , ej − s⟩ by taking v = s in the definition of λj . Adding and subtracting

si ⟨βs , ei − s⟩, we obtain
k
X
⟨w(s), ei − s⟩ ≥ (1 − si )λi (s) − sj ⟨βs , ej − s⟩ + si ⟨βs , ei − s⟩
j=1

= (1 − si )λi (s) + si ⟨βs , ei − s⟩ ≥ λi (s),

because s ⪰ 0 and ⟨βs , ei − s⟩ ≥ λi (s). This is the desired inequality (15.5.8).

15.6 Deferred technical proofs

JCD Comment: Put these in the appendices

Several of the proofs in this chapter rely on standard results from analysis and measure theory;
we give these as base lemmas, as any book on graduate level real analysis (implicitly) contains
them (see, e.g., Tao [176, Chapters 1.3 and 1.13] or Royden [163]).
Lemma 15.6.1 (Egorov’s theorem). Let fn → f in Lp (P ) for some p ≥ 1. Then for each ϵ > 0,
there exists a set A of measure at least P (A) ≥ 1 − ϵ such that fn → f uniformly on A.
Lemma 15.6.2 (Monotone convergence). Let fn : X → R+ be a Rmonotone increasingR sequence of
functions and f (x) = limn fn (x) (which may be infinite). Then f (x)dµ(x) = limn fn (x)dµ(x)
for any measure µ.
Lemma 15.6.3 (Density of Lipschitz functions). Let CcLip be the collection of compactly supported
Lipschitz functions on Rk and P a probability distribution on Rk . Then CcLip is dense in Lp (P ), that
is, for each ϵ > 0 and f with EP [|f (X)|p ] < ∞, there exists g ∈ CcLip with EP [|g(X)−f (X)|p ]1/p ≤ ϵ.

15.6.1 Proof of Lemma 15.2.1

Let Wk be the collection of k-Lipschitz functions w with ∥w(s)∥∗ ≤ 1 for all s, and let W denote the
collection of measurable functions with ∥w(s)∥∗ ≤ 1 for all s. Recall the defininition CE(g, Wk ) =
supw∈Wk E[⟨w(g(X)), Y − g(X)⟩]. Then if fn → f in L1 (P ), by Egorov’s theorem (Lemma 15.6.1),
for each ϵ > 0 there exists a set A with P (A) ≥ 1 − ϵ and fn → f uniformly on A. Then
E[⟨w(fn (X)), Y − fn (X)⟩]
= E[⟨w(fn (X)), Y − fn (X)⟩1 {X ∈ A}] + E[⟨w(fn (X)), Y − fn (X)⟩1 {X ∈ Ac }]
≥ E[⟨w(fn (X)), Y − fn (X)⟩1 {X ∈ A}] − E[∥Y − fn (X)∥ 1 {X ∈ Ac }] (15.6.1)
because ∥w(s)∥∗ ≤ 1. As | ∥y − fn (x)∥ 1{x ∈ Ac } − ∥y − f (x)∥ 1{x ∈ Ac }| ≤ ∥f (x) − fn (x)∥ by the
triangle inequality, the last term in inequality (15.6.1) converges to E[∥Y − f (X)∥ 1 {X ∈ Ac }] as
n → ∞. Focusing on the first term in (15.6.1), for any ϵ1 > 0 the uniform convergence of fn to f
on A guarantees that for large enough n, we have
E[⟨w(fn (X)), Y − fn (X)⟩]
= E[⟨w(f (X)), Y − fn (X)⟩1 {X ∈ A}] + E[⟨w(fn (X)) − w(f (X)), Y − fn (X)⟩1 {X ∈ A}]
≥ E[⟨w(f (X)), Y − fn (X)⟩1 {X ∈ A}] − k sup ∥f (x) − fn (x)∥∗ E[∥Y − fn (X)∥]
x∈A
≥ E[⟨w(f (X)), Y − fn (X)⟩1 {X ∈ A}] − ϵ1

446
Lexture Notes on Statistics and Information Theory John Duchi

Adding and subtracting f (X) in the final expectation, we have

E[⟨w(f (X)), Y − fn (X)⟩1 {X ∈ A}]

= E[⟨w(f (X)), Y − f (X)⟩1 {X ∈ A}] + E[⟨w(f (X)), f (X) − fn (X)⟩1 {X ∈ A}]
≥ E[⟨w(f (X)), Y − f (X)⟩] − E[∥Y − f (X)∥ 1 {X ∈ Ac }] − E[∥f (X) − fn (X)∥]
→ E[⟨w(f (X)), Y − f (X)⟩] − E[∥Y − f (X)∥ 1 {X ∈ Ac }].

Substituting these bounds into inequality (15.6.1), we have for any ϵ > 0 that there exists a set Aϵ
with P (Aϵ ) ≥ 1 − ϵ and for which

lim inf E[⟨w(fn (X)), Y − fn (X)⟩]

n
≥ E[⟨w(f (X)), Y − f (X)⟩] − 2E[∥Y − f (X)∥ 1 {X ∈ Acϵ }].
S
For each m ∈ N, let Bm = n≤m A1/n . Certainly P (Bm ) ≥ 1 − 1/m, and fn → f uniformly on
Bm (as the guarantees on A1/n from Egorov’s theorem apply); the same argument thus gives

lim inf E[⟨w(fn (X)), Y − fn (X)⟩]

n
c
≥ E[⟨w(f (X)), Y − f (X)⟩] − 2E[∥Y − f (X)∥ 1 {X ∈ Bm }].
S
Because Bm is an increasing sequence of sets with P (Bm ) ≥ 1 − 1/m, the limit B∞ = m Bm
satisfies P (B∞ ) = 1. For any x ∈ B∞ , we see that x ∈ Bm for some finite m; trivially, for
x ∈ B∞ we thus have ∥y − f (x)∥ 1 {x ̸∈ Bm } → ∥y − f (x)∥ 1 {x ̸∈ B∞ } = 0 as m → ∞. Said
differently, except on a null set, we have ∥y − f (x)∥ 1 {x ̸∈ Bm } → 0 for P -almost all (x, y), and
this is certainly dominated by ∥y − f (x)∥. Lebesgue’s dominated convergence theorem then implies
E[∥Y − f (X)∥ 1 {X ̸∈ Bm }] → 0 as m → ∞. Summarizing, we have shown that for any w ∈ Wk ,
we have
lim inf E[⟨w(fn (X)), Y − fn (X)⟩] ≥ E[⟨w(f (X)), Y − f (X)⟩].
n

By taking a supremum over w ∈ Wk in the last display and recognizing that ϵ > 0 was arbitrary,
we have shown that
lim inf CE(fn , Wk ) ≥ CE(f, Wk )
n

for all k < ∞. By Lemma 15.6.3, for any integrable f and for each ϵ > 0 there exists k such that

sup E[⟨w(f (X)), Y − f (X)⟩] ≥ sup E[⟨w(f (X)), Y − f (X)⟩] − ϵ.

w∈Wk w∈W

and for this k we have

lim inf CE(fn , Wk ) ≥ CE(f, Wk ) ≥ CE(f, W) − ϵ.

Noting that CE(fn , W) ≥ CE(fn , Wk ) for any k and taking ϵ → 0 gives the lemma.

15.6.2 Proof of Proposition 15.5.2

The proof that CE(·, W∥·∥ ) identifies calibration (Definition 15.1, part (i)) is identical to the argu-
ment for Proposition 15.5.3, so we omit it.

447
Lexture Notes on Statistics and Information Theory John Duchi

Let W = W∥·∥ for shorthand, and consider a sequence of functions fn → f . Then

CE(f, W) − CE(fn , W) ≤ sup E[⟨w(f (X)), Y − f (X)⟩ − ⟨w(fn (X)), Y − fn (X)⟩]

w∈W

and
CE(fn , W) − CE(f, W) ≤ sup E[⟨w(fn (X)), Y − fn (X)⟩ − ⟨w(f (X)), Y − f (X)⟩].
w∈W

We focus on bounding the first display, as showing that the second tends to zero requires, mutatis
mutandis, an identical argument.
Fix any w ∈ W. Then
E[⟨w(f (X)), Y − f (X)⟩ − ⟨w(fn (X)), Y − fn (X)⟩]
= E[⟨w(f (X)) − w(fn (X)), Y − f (X)⟩] + E[⟨w(fn (X)), fn (X) − f (X)⟩]
≤ E[min{2, ∥f (X) − fn (X)∥} ∥Y − f (X)∥] + E[∥fn (X) − f (X)∥],
where the inequality follows because ∥w(s) − w(s′ )∥∗ ≤ 2 and ∥w(s) − w(s′ )∥∗ ≤ ∥s − s′ ∥ for any
s, s′ by construction. The second expectation certainly tends to zero as n → ∞, so we consider the
first. Define gn (x, y) = min{2, ∥f (x) − fn (x)∥} ∥y − f (x)∥. Then gn (x, y) ≤ g(x, y) = ∥y − f (x)∥,
which has finite expectation by assumption. Moreover, Egorov’s theorem (Lemma 15.6.1) guaran-
tees that for each k, there is a set Ak with P (Ak ) ≥ S 1 − 1/k and for which gn → 0 uniformly on
Ak (because E[∥f (X) − fn (X)∥] → 0). Define A∞ = k Ak , so that P (A∞ ) = 1, and gn (x, y) → 0
pointwise on A∞ . Then the dominated convergence theorem guarntees that
E[gn (X, Y )] = E[gn (X, Y )1 {(X, Y ) ∈ A∞ }] + E[gn (X, Y )1 {(X, Y ) ̸∈ A∞ }] → 0.
| {z }
=0

Notably, this convergence is independent of w, and so we obtain

lim sup {CE(f, W) − CE(fn , W)} ≤ 0.
n

A similar argument gives the converse bound.

15.6.3 Proof of Lemma 15.5.4

Define f (s) = g(s)/ max{1, ∥g(s)∥}, so that E[∥g(s)∥2 ] = E[⟨f (s), g(s)⟩]. Using Lemma 15.6.3,
we see that for each n ∈ N there exists a C = Cn -Lipschitz function (where C < ∞) wn with
E[∥wn (S) − f (S)∥] ≤ n1 , and w.l.o.g. we may assume ∥wn (s)∥2 ≤ 1 (by projection if necessary,
which is Lipschitzian). Then
E[∥g(S)∥2 ] = E[⟨f (S), g(S)⟩] = E[⟨f (S) − wn (S), g(S)⟩] + E[⟨wn (S), g(S)⟩] .
| {z }
=0

Note that wn → f in L1 (P ). Then for any ϵ > 0, an application of Egorov’s theorem

(Lemma 15.6.1) and that E[∥g(S)∥] < ∞ gives that we can find sets Aϵ with P (Aϵ ) ≥ 1 − ϵ
and for which wn → f uniformly on Aϵ . Then
E[∥g(S)∥2 ] = E[⟨f (S) − wn (S), g(S)⟩1 {S ∈ Aϵ }] + E[∥g(S)∥2 1 {S ̸∈ Aϵ }]
h i
≤ E sup ∥f (s) − wn (s)∥2 ∥g(S)∥2 1 {S ∈ Aϵ } + E[∥g(S)∥2 1 {S ̸∈ Aϵ }]
s∈Aϵ
→ E[∥g(S)∥2 1 {S ̸∈ Aϵ }].

448
Lexture Notes on Statistics and Information Theory John Duchi

as n ↑ ∞.S We now employ the same device we use in the proof of Lemma 15.2.1. For m ∈ N,
let Bm = n≤m A1/n . Then wn → f uniformly on Bm , and so E[∥g(S)∥2 ] ≤ E[∥g(S)∥2 1 {S ̸∈ Bm }],
that is, E[∥g(S)∥2 1 {S ∈ Bm }] = 0. Monotone
S convergence implies 0 = limm→∞ E[∥g(S)∥2 1 {S ∈ Bm }] =
E[∥g(S)∥2 1 {S ∈ B∞ }] where B∞ = m Bn . As P (B∞ ) = 1 by continuity of measure, we have
E[∥g(S)∥2 ] = 0, giving the lemma.

15.6.4 Proof of Theorem 15.5.6

The following lemma gives the lower bound in the theorem and is fairly straightforward.

Lemma 15.6.4. For S = f (X), we have

pcal,up (f ) ≤ dcal,up (f ) ≤ pce(S). (15.6.2)

Proof Fix any partition A, and define qA (s) to be the (unique) set A such that s ∈ A (so we
quantize s). Then set g(s) = E[Y | S ∈ qA (s)] to be the expectation of Y conditional on S being in
the same partition element as s. Then g(S) = E[Y | g(S)] with probability 1, so that g is perfectly
calibrated, and

pcal,up (f ) ≤ dcal,up (f ) ≤ E[∥S − g(S)∥]

X
= E[∥S − E[Y | S ∈ A]∥ 1 {S ∈ A}]
A∈A
X
≤ E [(∥S − E[S | S ∈ A]∥ + ∥E[S − Y | S ∈ A]∥) 1 {S ∈ A}]
A∈A
X X
≤ diam(A)P(S ∈ A) + ∥E[(S − Y )1 {S ∈ A}]∥ .
A∈A A∈A

Taking an infimum gives the claim (15.6.2).

To prove the claimed upper bound requires more work. For pedagogical reasons, let us attempt
to prove a similar upper bound relating pce(S) to pcal,low (f ). We might begin with a partition A
with maximal diameter diam(A) ≤ ϵ for A ∈ A, and for random variables (S, V, Y ), begin with the
first term in the partition error, whence
X
CE(S, A) ≤ ∥E[(S − V )1 {S ∈ A}]∥ + ∥E[(V − Y )1 {S ∈ A}]∥
A∈A
X X
≤ E[∥S − V ∥] + ∥E[(V − Y )1 {V ∈ A}]∥ + ∥E[(V − Y )(1 {S ∈ A} − 1 {V ∈ A})]∥
A∈A A∈A
X
≤ E[∥S − V ∥] + E[∥E[Y | V ] − V ∥] + ∥E[(V − Y )(1 {S ∈ A} − 1 {V ∈ A})]∥
A∈A

by Jensen’s inequality applied to conditional expectations, once we recognize E[(Y −V )1 {V ∈ A}] =

E[(E[Y | V ] − V )1 {V ∈ A}]. For the final term, a straightforward computation yields
X X
∥E[(V − Y )(1 {S ∈ A} − 1 {V ∈ A})]∥ ≤ diam(Y) [P(S ∈ A, V ̸∈ A) + P(S ̸∈ A, V ̸∈ A)]
A∈A A∈A
= 2 diam(Y)P(S and V belong to different A ∈ A).

449
Lexture Notes on Statistics and Information Theory John Duchi

If S and V had continuous distributions, we would expect the probability that they fail to belong
to the same partition elements to scale as E[∥S − V ∥]. This may fail, but to rectify the issue, we
can randomize.
Consequently, let us consider the randomized partition error, which we index with ε > 0 and
for U ∼ Uniform[−1, 1]k define as
( )
X X
rpceε (S) := inf ∥E[(S − Y )1 {S + εU ∈ A}]∥ + diam(A)P(S ∈ A) . (15.6.3)
A
A∈A A∈A

(The choice of uniform [−1, 1]k is only made for convenience in the calculations to follow.) Letting
ck = ∥1k ∥∗ , we see immediately that

pce(S) ≤ rpceε (S) + ck ε

for all ε ≥ 0. We can say more.

Lemma 15.6.5. Let ε > 0. Then for any random variable V ,
2k
rpceε (S) ≤ E[∥S − V ∥] + E[∥E[Y | V ] − V ∥] + E[∥Y − S∥ ∥V − S∥∞ ].
ε
Note that by combining Lemma 15.6.5 with the display above and recognizing that ∥Y − S∥ ≤
diam(Y) with probability 1, we have the theorem.
Proof We replicate the calculation bounding CE(S, A) above, but while allowing the random-
ization. Let A be a partition of Rk into hypercubes of width ε, that is, [−ε, ε]k + εz, where z ∈ 2Zk
ranges over integer vectors with even entries. Then diam(A) ≤ ck ε, and

∥E[(S − Y )1 {S + εU ∈ A}]∥
≤ ∥E[(S − V )1 {S + εU ∈ A}]∥ + ∥E[(V − Y )1 {V + εU ∈ A}]∥
+ ∥E[(V − Y )(1 {S + εU ∈ A} − 1 {S + εU ∈ A})]∥
≤ ∥E[(S − V )1 {S + εU ∈ A}]∥ + ∥E[(V − Y )1 {V + εU ∈ A}]∥
+ E [∥V − Y ∥ · (P(V + εU ∈ A, S + εU ̸∈ A | V, S) + P(S + εU ∈ A, V + εU ̸∈ A | V, S, Y ))]

Summing over sets A and using the triangle inequality and that S + εU ∈ A for some A, we find
X
∥E[(S − Y )1 {S + εU ∈ A}]∥ ≤ E[∥S − V ∥] + E[∥E[Y | V ] − V ∥] (15.6.4)
A∈A
X
+ 2E ∥V − Y ∥ P(V + εU ∈ A, S + εU ̸∈ A | V, S, Y ) .
A∈A

We now may bound the probability in inequality (15.6.4). Recall that A = [−ε, ε]k + εz for
some z ∈ 2Zk , and fix v, s ∈ Rk . Let B = [−1, 1]k be the ℓ∞ ball. Then

P(v + εU ∈ B, s + εU ̸∈ B) = P(U ̸∈ ε−1 (B − s) | U ∈ ε−1 (B − v))P(v + εU ∈ B)

k ∥s − v∥∞
≤ P(v + εU ∈ B), (15.6.5)
ε
where inequality (15.6.5) follows because if s, v ∈ Rk are the centers of two ℓ∞ balls Bs and Bv
of radius 1, and if δ = ∥s − v∥∞ , then the volume of Bv \ Bs is at most kδ k /δ k−1 = kδ. (See

450
Lexture Notes on Statistics and Information Theory John Duchi

1
δ k−1
hypercubes
s

Figure 15.2. The volume argument in inequality (15.6.5). In k dimensions, the hypercube of
side-length δ can be replicated 1/δ k−1 times on each exposed base of the cube centered at v, where
δ = ∥s − v∥∞ . There are at most k such faces, giving volume at most kδ k /δ k−1 = kδ to the gray
region.

Figure 15.2. The k-dimensional surface area of one side of a hypercube of radius δ is 2kδ k−1 , and
we can put at most 1/δ k−1 boxes in each facial part of the grey region.)
Substituting inequality (15.6.5) into the bound (15.6.4) and conditioning and deconditioning on
V, S, we find that
X
∥E[(S − Y )1 {S + εU ∈ A}]∥
A∈A

2k X
≤ E[∥S − V ∥] + E[∥E[Y | V ] − V ∥] + E ∥V − Y ∥ ∥V − S∥∞ 1 {V + εU ∈ A}
ε
A∈A
2k
= E[∥S − V ∥] + E[∥E[Y | V ] − V ∥] + E[∥Y − V ∥ ∥V − S∥∞ ].
ε
Taking an infimum over partitions A gives the lemma.

15.7 Bibliography
Draft: Calibration remains an active research area. The initial references for online calibration are
Foster and Vohra [90], Dawid and Vovk [64]. The idea of calibeating is most present in Foster and
Hart [94]. Our proof of calibeating is based on Kumar et al. [128]. Blasiok et al. [34] demonstrate
the equivalence of the different metrics for measuring calibration, focusing on the case of binary
prediction; the extension to vector-valued Y appears to be new. The ideas of the postprocessing
gap and also descend from Blasiok et al. [35], and the connections with general proper losses also
appear to be new. Propositions 15.5.1, 15.5.2, and 15.5.3 are new in that they are the first to
demonstrate that the measures are valid calibration measures (Definition 15.1, part (i)).

451
Lexture Notes on Statistics and Information Theory John Duchi

JCD Comment: A few more things to add either in the bibliography or the introduction
to the section:

1. We only really do calibration for binary/multiclass things. One would also really like to
predict full distributions Pt on general outcomes Y , which is harder (nearly impossible)
to do in any conditional sense.

2. It’s much easier to do predictive inference (cover) because don’t need accuracy

3. Maybe comment on variants for top entry (from multiclass to binary) classification
and why that is important. Maybe in the middle, maybe here.

15.8 Exercises
Exercise 15.1: We say W is weakly complete if E[⟨w(S), g(S)⟩] = 0 implies that g(S) = 0 with
probability 1. Let W be any symmetric weakly complete collection of functions. Show that

CE(f, W)

is a sound and complete (15.2.2) calibration measure.

Exercise 15.2 (Marginal calibration): A predictor f : X → Rk of Y ∈ Rk is per-class calibrated
if for each j ∈ {1, . . . , k} we have E[Yj | fj (X)] = fj (X) with probability 1.

(a) Show that if f : X → Rk is calibrated, then it is marginally calibrated.

(b) Let W be symmetric and weakly complete (Exercise 15.1) and W k = {(w1 , . . . , wk ) | wj ∈
W} be the vector-valued functions with components in W, where w ∈ W k satisfies w(s) =
(w1 (s1 ), . . . , wk (sk )) for s ∈ Rk . Show that CE(f, W k ) is sound and complete for marginal
calibration, that is, CE(f, W k ) = 0 if and only if f is marginally calibrated.

Exercise 15.3 (Top-K calibration): In k-class multiclass classification, where we treat y ∈

{e1 , . . . , ek } ⊂ Rk , instead of obtaining calibration of an entire vector E[Y | f (X)] we can ask
for calibration of the top K predictions. Let f(1) (x) ≥ f(2) (x) ≥ · · · ≥ f(k) (x) denote the sorted
elements of f (x) ∈ Rk . Then if fj (X) is among the K-highest predictions, that is, fj (X) ≥ f(K) (X),
we ask that it be calibrated,

P Y = ej | fj (X), fj (X) ≥ f(K) (X) = fj (X).

For a function class W mapping R → R, define the function class

 
 Xk 

Wtop-K := w(s) = wj (sj )ej 1 sj ≥ s(K) | wj ∈ W, j = 1, . . . , k .
 
j=1

Show that if W is weakly complete (Exercise 15.1) and symmetric, then CE(f, Wtop-K )) is sound
and complete for top-K calibration.
JCD Comment: Add a uniform convexity version of Proposition 15.3.5 as an exercise.

452
Lexture Notes on Statistics and Information Theory John Duchi

JCD Comment: Can we add an exercise about achieving weak calibration for different
classes of functions?
JCD Comment: A few potential exercises:

(i) Deal with any class W for which E[⟨w, f ⟩] = 0 for all w ∈ W means f = 0, then
still get a continuous calibration measure
JCD Comment: Exercise: do Aaditya’s top-class calibration approach.

JCD Comment: Do we need more commentary on calibeating? Maybe an exercise on

empirics? Project ideas: calibeating with witnesses in higher dimensions, doing calibeat-
ing in higher dimensions, optimality results / lower bounds.
JCD Comment: Do Example 3.2 of Kumar et al. [128] as exercise

JCD Comment: Coding and empirical exercises on calibration?

JCD Comment: Remark on impossibility of inference of ece? Exercises on its impos-

sibility too, perhaps, and one-sided estimation of it. And maybe some minimax lower
bounds on the Lipschitz one as well I think.
JCD Comment: Exercise potential: let W be a collection from an RKHS

453
Chapter 16

Classification, Divergences, and

Surrogate Risk

While proper losses—a major focus in chapters 14 and 15—provide a principled approach to choos-
ing a loss and prediction the “right” distribution in prediction problems, they are not a panacea.
In many contexts, we wish to focus more exclusively on optimal prediction of a particular target
Y rather than its distribution, whether because the problem at hand does not require distribu-
tional predictions or because Y itself is so complex that a distribution over Y would be generally
intractable to describe. For example, in structured prediction problems, the target Y may be a com-
plex or combinatorial object, such as labeling each pixel in an image as foreground or background,
a parse tree of a sentence in natural language processing, or a ranking of items.
The setting for much of this chapter has similarities to that in the preceding two: we assume
data coming in pairs (X, Y ) ∈ X × Y, but now we abstractly consider a function f : X → Rk for
some k, and we seek the function f minimizing the risk, or expected loss,

R(f ) := E[ℓ(f (X), Y )],

where ℓ : Rk × Y → R is a loss function. Typically, we think of some decoder or prediction function

that transforms f into a prediction yb(f (x)) of y. Binary and multiclass classification provide the
cleanest motivation of this approach. In the former, we consider Y ∈ {−1, 1}, take f : X → R with
prediction yb = sign(f (x)), and the natural loss is the zero-one error

ℓ0-1 (s, y) = 1 {sy ≤ 0}

for s ∈ R. In the latter, for multiclass classification, we take Y ∈ {1, . . . , k} and f : X → Rk with
prediction yb = argmaxj fj (x), and the zero-one error for s ∈ Rk and label y ∈ [k] becomes

ℓ0-1 (s, y) = 1 sy ≤ max sj ,
j̸=y

which is 0 when the score sy assigned to the correct label y is greater than all others.
In many of the cases we consider in this chapter, the loss ℓ is hard to minimize: it is non-
convex and, as in the case of the zero-one losses above, non-differentiable, NP-hard to minimize
in general, and gradient information does not permit even some reasonable local search for a
predictive function. As a consequence, this chapter takes a different tack, where we search for

454
Lexture Notes on Statistics and Information Theory John Duchi

surrogate losses φ : Rk × Y → R that admit nicer properties—such as convexity and smoothness—

such that minimizing φ is equivalent in some sense to minimizing the original loss ℓ. To define our
desiderata, let
Rφ (f ) := E[φ(f (X), Y )]
be the surrogate risk of a function f , and define the Bayes risk (for ℓ and φ) as the minimal possible
expected loss,
R⋆ := inf E[ℓ(f (X), Y )] and Rφ⋆ := inf E[φ(f (X), Y )],
f f

where the infima are over the class of all measurable functions. We then seek surrogate risk con-
sistency, essentially, a type of infinite sample consistency, sometimes called Fisher consistency, a
population-level guarantee that minimizing Rφ guarantees minimizing R.
Definition 16.1 (Surrogate risk consistency). The loss φ is surrogate risk consistent for the loss
ℓ if for any sequence of functions fn and any distribution on (X, Y ),

Rφ (fn ) → Rφ⋆ implies R(fn ) → R⋆ .

This chapter develops the theory of such surrogate risk consistency. The theory obtains its
cleanest and most transparent form in the case of binary classification with the zero-one error,
but it extends beyond this, including to multiclass and structured prediction problems. As we will
show, the dualities we develop in Chapter 14 and connections with generalized entropies remain
important; in many cases, any loss φ generating a suitably concave generalized entropy guar-
antees consistency. Deeper equivalence results between loss functions arise via this entropy and
information-based perspective, and we close the chapter by developing these. Throughout, we will
elide measure-theoretic details, as they are secondary to the main thrusts of the arguments (but
see the bibliographic section).

16.1 Surrogate risk consistency in binary classification

The first step, both because it offers the most complete solution and because the ideas are most
transparent, is to consider consistency for binary classification. In this case, we take Y ∈ {−1, 1},
and consider margin-based losses, where the goal is to find a predictor f : X → R predicting the
correct sign of y. Thus, for s ∈ R, we have the zero-one loss

ℓ(s, y) = 1 {sy ≤ 0} and φ(s, y) = ϕ(ys),

where ϕ : R → R is a convex function acting on the margin of a predictor, where the margin of f
on an example (x, y) is
yf (x).
A large margin yf (x) ≫ 0 thus corresponds to a correct prediction, while non-positive margin
yf (x) ≤ 0 means that f (x) has opposite sign to y.
Example 16.1.1 (Common convex loss functions): Several convex losses are frequent in the
literature on classification. These include the logistic loss,

ϕ(t) = log(1 + e−t ),

the exponential loss, where

ϕ(t) = e−t ,

455
Lexture Notes on Statistics and Information Theory John Duchi

the squared error

ϕ(t) = (1 − t)2 ,
and the hinge loss and squared hinge loss

ϕ(t) = [1 − t]+ and ϕ(t) = [1 − t]2+ .

Each of these is nonnegative and satisfies ϕ′ (0) < 0. 3

The key step to understanding surrogate-risk consistency is to move from the full population
expectation to conditional expectations given X: as we work with arbitrary measurable functions
f : X → R, we can essentially choose f (X) to be optimally predict of Y given X. To that end, let

η(x) := P(Y = 1 | X = x)

be the conditional probability that Y = 1 given X = x, so that

R(f ) = P(sign(f (X) ̸= Y )) = E [η(X)1 {f (X) ≤ 0} + (1 − η(X))1 {f (X) ≥ 0}]

and
Rφ (f ) = E [η(X)ϕ(f (X)) + (1 − η(X))ϕ(−f (X))] .
Immediately, we see that f ⋆ (x) = η(x) − 21 minimizes R (one may arbitrarily modify f ⋆ on the set
{x | η(x) = 12 }). Defining the conditional risks

r(s, η) = η1 {s ≤ 0} + (1 − η)1 {s ≥ 0} and rϕ (s, η) = ηϕ(s) + (1 − η)ϕ(−s)

we evidently have

R(f ) = E[r(f (X), η(X))] and Rφ (f ) = E[rϕ (f (X), η(X))].

Thus, we seek relationships that relate r(s, η) and rϕ (s, η) to one another that guarantee consistency.

Example 16.1.2 (Exponential loss): Consider the exponential loss, which AdaBoost and
other boosting algorithms use, which sets ϕ(s) = e−s . In this case, for η ∈ (0, 1) we have

1 η ∂
argmin rϕ (s, η) = log because rϕ (s, η) = −ηe−s + (1 − η)es .
s 2 1−η ∂s
η η(x)
Solving for s by setting the derivative to zero yields e2s = 1−η , so that fϕ⋆ (x) = 12 log 1−η(x)
minimizes Rφ (f ), where we allow f ⋆ to take infinite values if η ∈ {0, 1}. As sign(fϕ⋆ (x)) =
sign(2η(x) − 1), this is optimal for the zero-one loss as well. 3

To provide a more quantitative version of this pointwise identification of minimizers, we compare

the conditional ϕ-risks. Intuitively, if the s minimizing rϕ (s, η) has the same sign as 2η − 1, then we
should obtain consistency. Said differently, if one cannot minimize rϕ accurately when we constrain
s to have the incorrect sign, then any minimizer of Rφ should agree with the minimizers of R.
Following this intuition, define the the infimal conditional ϕ-risks as

rϕ⋆ (η) := inf rϕ (s, η) and rϕwrong (η) := inf rϕ (s, η).
s s(η−1/2)≤0

456
Lexture Notes on Statistics and Information Theory John Duchi

Then we expect that rϕ⋆ (η) < rϕwrong (η) for all η ̸= 12 is sufficient to guarantee surrogate risk
consistency for the zero-one loss.
To make this intuition rigorous, define the the sub-optimality function ∆ϕ : [0, 1] → R

1+δ 1+δ
∆ϕ (δ) := rϕwrong − rϕ⋆ . (16.1.1)
2 2

We may now define

Definition 16.2 (Classification calibration). The margin-based loss ϕ is classification calibrated

if ∆ϕ (δ) > 0 for all δ > 0. Equivalently, for any η ̸= 12 , we have rϕ⋆ (η) < rϕwrong (η).

Examples can help to visualize this definition.

Example (Example 16.1.2 continued): For the exponential loss, we have

rϕwrong (η) =
−s
inf ηe + (1 − η)es = e0 = 1
s(2η−1)≤0

while the unconstrained minimal conditional risk is

r
1−η
r
⋆ η p
rϕ (η) = η + (1 − η) = 2 η(1 − η),
η 1−η
√ √ √
so that ∆ϕ (δ) = 1 − 1 − δ 2 ≥ 12 δ 2 , where we have used that a + x ≤ a + x
√
2 a
for all
a > 0, x ∈ R. 3

Example 16.1.3 (Hinge loss): We can also consider the hinge loss ϕ(s) = [1 − s]+ . Com-
puting the minimizers of the conditional risk, we have

rϕ (s, η) = η [1 − s]+ + (1 − η) [1 + s]+ ,

whose unique minimizer (for η ̸∈ {0, 12 , 1}) is s(η) = sign(2η − 1), while s(η) = sign(2η − 1) is
a minimizer for all η. We thus have

rϕ⋆ (η) = 2 min{η, 1 − η} and rϕwrong (η) = η + (1 − η) = 1.

We obtain ∆ϕ (δ) = 1 − min{1 + δ, 1 − δ} = δ. 3

Example 16.1.4 (Squared error): Let the margin-based loss be ϕ(s, y) = 21 (s − y)2 . Then
we have
η 1−η
rϕ⋆ (η) = inf (s − 1)2 + (s + 1)2 = 2η(1 − η)
s 2 2
because s(η) = 2η − 1 minimizes rϕ (s, η). On the other hand, rϕwrong (η) = 1
2 for all η, and so
we have ∆ϕ (δ) = 21 − 12 (1 − δ)(1 + δ) = 12 δ 2 . 3

JCD Comment: Put figures here with exponential loss and conditional probability η.

457
Lexture Notes on Statistics and Information Theory John Duchi

16.1.1 A general classification calibration result

In each of the preceding examples, we have seen that the suboptimality function ∆ϕ is convex.
While this need not hold generally, by a careful application of Jensen’s inequality we can leverage
a convexified variant of ∆ϕ yields a convex function ψ, strictly positive on R++ , such that

ψ(Rϕ (f ) − Rϕ⋆ ) ≤ R(f ) − R⋆ ,

where we have used Rϕ (f ) = E[ϕ(Y f (X)] to make clear that we use the margin-based loss φ(s, y) =
ϕ(ys). The key is to leverage the biconjugate of ∆ϕ , which (recall Lemma 14.1.1) is the largest
(closed) convex function below ∆ϕ . To that end, we define the ψ-transform

ψϕ (δ) := ∆∗∗
ϕ (δ) (16.1.2)

of the margin-based loss ϕ. With this, we obtain the following characterization of classification
calibration.

Theorem 16.1.5. Let ϕ be a margin-based loss function and ψ the associated ψ-transform. Then
for any f : X → R,
ψ(R(f ) − R⋆ ) ≤ Rϕ (f ) − Rϕ⋆ . (16.1.3)
Moreover, if ϕ is nonnegative, the following three are equivalent:

(i) The loss ϕ is classification-calibrated (Definition 16.2).

(ii) For any sequence δn ∈ [0, 1],

ψ(δn ) → 0 if and only if δn → 0.

(iii) For any sequence of measurable functions fn : X → R,

Rϕ (fn ) → Rϕ⋆ implies R(fn ) → R⋆ .

The proof of the theorem relies on the convex analysis we develop in the brief primer in Section 14.1.1
and Appendix B, so we defer it to Section 16.1.3.
Theorem 16.1.5 provides concrete conditions to guarantee infinite-sample consistency of a margin-
based surrogate loss φ(s, y) = ϕ(ys). If the associated suboptimality gap ∆ϕ (δ) is strictly positive
for δ > 0, then the associated ψ-transform (16.1.2) is also strictly positive, and we have the quan-
titative guarantee that
ψϕ (R(f ) − R⋆ ) ≤ Rϕ (f ) − Rϕ⋆ ,
so that a small excess surrogate risk guarantees small excess zero-one error, though different loss
functions yield different explicit upper bounds.
By recalling Examples 16.1.2, 16.1.3, and 16.1.4, we can see immediate applications of the
theorem. For each of these, we have that ψ(δ) = ∆(δ), as the function √
∆ is convex. Considering each
in turn, for the exponential loss ϕ(s) = exp(−s), we have ∆(δ) = 1 − 1 − δ 2 = 21 δ 2 + O(δ 4 ) ≥ 21 δ 2 ,
while for the hinge loss ϕ(s) = [1 − s]+ , we have ∆(δ) = δ. Thus we obtain for any f that

P(Y f (X) ≤ 0) − inf P(Y f (X) ≤ 0) ≤ E [1 − Y f (X)]+ − inf E [1 − Y f (X)]+ .
f f

458
Lexture Notes on Statistics and Information Theory John Duchi

On the other hand, for the exponential loss, we have

2
1
P(Y f (X) ≤ 0) − inf P(Y f (X) ≤ 0) ≤ E [exp(−Y f (X))] − inf E [exp(−Y f (X))] ,
2 f f

so that the bound on the zero-one error that the exponential loss guarantees is quadratically worse
than that the hinge loss provides. The squared loss 21 (f (x) − y)2 has suboptimality function ∆(δ) =
1 2
2 δ , yielding a similar guarantee as that the exponential loss provides while also guaranteeing that
regressing directly on ±1 labels is consistent.

16.1.2 Convex losses for binary classification

Most commonly, one takes a convex margin-based loss ϕ, because these admit computationally
efficient minimization and fitting procedures. We can thus specialize Theorem 16.1.5 to the convex
case, obtaining a much simpler characterization of classification calibrated losses. Intuitively, if
ϕ′ (0) < 0, then any scalar minimizing the conditional ϕ-risk

rϕ (s, η) = ηϕ(s) + (1 − η)ϕ(−s)

should necessarily satisfy sign(s) = sign(2η − 1). This is precisely the correct condition.
JCD Comment: Put a figure, including the fact that rϕ′ (0, η) = (2η − 1)ϕ′ (0) < 0 when
η > 21 , so minimizer must be to the right.

Theorem 16.1.6. If ϕ is convex, then ϕ is classification calibrated if and only if ϕ′ (0) exists and
ϕ′ (0) < 0.

Proof Let η > 21 w.l.o.g. First, suppose that ϕ is differentiable at 0 and ϕ′ (0) < 0. Then the
dom ϕ necessarily includes an interval around 0, and

rϕ (s, η) = ηϕ(s) + (1 − η)ϕ(−s)

satisfies rϕ′ (0, η) = (2η − 1)ϕ′ (0); if ϕ′ (0) < 0, this quantity is negative for η > 1/2. Thus the
minimizing s(η) ∈ (0, ∞]. Indeed, we have

rϕ (s, η) ≥ rϕ (0, η) + (2η − 1)ϕ′ (0)s

by the first-order inequality for convexity, and the final term is strictly positive for s < 0. Moreover,

rϕ (s, η) = rϕ (0, η) + (2η − 1)ϕ′ (0)s + o(s) < rϕ (0, η) = ϕ(0)

for s > 0 but near 0. Thus rϕwrong (η) = rϕ (0, η) = ϕ(0), while rϕ⋆ (η) < ϕ(0) for all η ̸= 21 . In
particular, ∆ϕ (δ) > 0 for δ > 0 as desired.
To see the converse, we must prove that ϕ is differentiable at 0. Recall that a subgradient gs
of the function ϕ at s ∈ R is any gs such that ϕ(t) ≥ ϕ(s) + gs (t − s) for all t ∈ R. Because
ϕ is classification calibrated, its domain necessarily includes an interval around 0; thus that ϕ is
subdifferentiable at 0 and the subgradient set ∂ϕ(0) ̸= ∅ (See Theorem B.3.3 or Proposition B.3.20
in Appendix B.) Let g1 , g2 ∈ ∂ϕ(0). We show that both g1 , g2 < 0 and g1 = g2 , implying that ϕ is
differentiable at 0.

459
Lexture Notes on Statistics and Information Theory John Duchi

By convexity we have

rϕ (s, η) ≥ η(ϕ(0) + g1 s) + (1 − η)(ϕ(0) − g2 s)

= [ηg1 − (1 − η)g2 ] s + ϕ(0). (16.1.4)

We first show that g1 = g2 , meaning that ϕ is differentiable. Without loss of generality, assume
g1 > g2 . Then for η > 1/2, we would have ηg1 − (1 − η)g2 > 0, which would imply that

ηϕ(s′ ) + (1 − η)ϕ(−s′ ) = rϕwrong (η),

rϕ (s, η) ≥ ϕ(0) ≥ inf
′
s ≤0

for all s ≥ 0 by (16.1.4), by taking s′ = 0 in the second inequality. By our assumption of classifi-
cation calibration, for η > 1/2 we know that

inf rϕ (s, η) < inf rϕ (s, η) = rϕwrong (η) so rϕ⋆ (η) = inf rϕ (s, η),
s s≤0 s≥0

and under the assumption that g1 > g2 we obtain rϕ⋆ (η) = inf s≥0 rϕ (s, η) > rϕwrong (η), which is a
contradiction to classification calibration. We thus obtain g1 = g2 , so that the function ϕ has a
unique subderivative at s = 0 and is thus differentiable.
Now that we know ϕ is differentiable at 0, consider

ηϕ(s) + (1 − η)ϕ(−s) ≥ (2η − 1)ϕ′ (0)s + ϕ(0).

If ϕ′ (0) ≥ 0, then for s ≥ 0 and η > 1/2 the right hand side must be at least ϕ(0), which contradicts
classification calibration, because rϕ⋆ (η) < rϕwrong (η) as in the preceding argument.

Theorem 16.1.6 makes it easy to determine whether a convex margin-based loss is classification
calibrated: simply take its derivative at 0. Each of Examples 16.1.2–16.1.4 is thus immediately
classification calibrated. Other examples follow immediately as well.
Example 16.1.7 (Logistic loss): The logistic loss ϕ(t) = log(1 + e−t ) satisfies ϕ′ (0) = − 12
and so is classification calibrated. 3

Example 16.1.8 (Squared hinge loss): The squared hinge loss ϕ(t) = [1 − t]2+ satisfies
ϕ′ (0) = −2 and so is classification calibrated. 3

16.1.3 Proof of Theorem 16.1.5

Throughout the proof, as the margin-based loss ϕ remains consistent, we let ∆ = ∆ϕ and ψ = ψϕ .
We begin with the first statement (16.1.3). By a calculation, the gap (for a fixed margin s) in
conditional 0-1 risk is

r(s, η) − inf r(s, η) = η1 {s ≤ 0} + (1 − η)1 {s ≥ 0} − η1 {η ≤ 1/2} − (1 − η)1 {η ≥ 1/2}

s
(
0 if sign(s) = sign(η − 21 )
=
̸ sign(η − 12 ).
η ∨ (1 − η) − η ∧ (1 − η) = |2η − 1| if sign(s) =

In particular, we obtain the gap in risks

R(f ) − R⋆ = E [1 {sign(f (X)) ̸= sign(2η(X) − 1)} |2η(X) − 1|] . (16.1.5)

460
Lexture Notes on Statistics and Information Theory John Duchi

We use expression (16.1.5) to upper bound R(f ) − R⋆ via the ϕ-risk, for which we use the
ψ-transform (16.1.2). By Jensen’s inequality,

ψ(R(f ) − R⋆ ) ≤ E [ψ(1 {sign(f (X)) ̸= sign(2η(X) − 1)} |2η(X) − 1|)] .

We always have ∆(0) = 0 as rϕwrong (1/2) = inf s(1−1)≤0 rϕ (s, 1/2) = rϕ⋆ (1/2), and by construction
∆ ≥ 0, so that its convex hull satisfies ψ ≥ 0 and ψ(0) = 0. Thus

ψ(R(f ) − R⋆ ) ≤ E [ψ(1 {sign(f (X)) ̸= sign(2η(X) − 1)} |2η(X) − 1|)]

= E [1 {sign(f (X)) ̸= sign(2η(X) − 1)} ψ(|2η(X) − 1|)] (16.1.6)

Now we use the special structure of the suboptimality function. By construction of ψ as the convex
minorant of ∆, we have ψ ≤ ∆, and moreover, for any s ∈ R

⋆
1 {sign(s) ̸= sign(2η − 1)} ∆(|2η − 1|) = 1 {sign(s) ̸= sign(2η − 1)} inf rϕ (s, η) − rϕ (η)
s(2η−1)≤0

≤ rϕ (s, η) − rϕ⋆ (η), (16.1.7)

because (1 + |2η − 1|)/2 = max{η, 1 − η}. Combining inequalities (16.1.6) and (16.1.7), we see that

ψ(R(f ) − R⋆ ) ≤ E [1 {sign(f (X)) ̸= sign(2η(X) − 1)} ∆(|2η(X) − 1|)]

≤ E rϕ (f (X), η(X)) − rϕ⋆ (η(X))

= Rϕ (f ) − Rϕ⋆ ,

which is inequality (16.1.3).

The key to the claims (i)–(iii) in the second part of the theorem is the following lemma, which
guarantees that so long as the sub-optimality function ∆ϕ is positive, then so too is its biconju-
gate (16.1.2) ψ-transform.

Lemma 16.1.9. The following hold:

(a) The functions ∆ and ψ are continuous on [0, 1].

(b) We have ∆ ≥ 0 and ∆(0) = 0.

Deferring the proof of the lemma, We turn to the equivalence of items (i)–(iii) in the theorem.
To see that classification calibration implies ψ(δ) → 0 if and only if δ → 0 (i.e., (i) implies (ii)),
use Lemma 16.1.9: if ϕ is classification calibrated, then ∆(δ) > 0 for all δ > 0 and by continuity,
inf δ≥c ψ(δ) > 0 for all c > 0, so that ψ(δ) → 0 if and only if δ → 0. Now, let (ii) hold: then if
Rϕ (fn ) → Rϕ⋆ , we necessarily have ψ(R(fn ) − R⋆ ) → 0 by inequality (16.1.3); this occurs if and
only if R(fn ) − R⋆ → 0 by assumption, that is, the implication (iii) holds. Finally, to see that if
for any sequence of measurable functions fn , the convergence Rϕ (fn ) → Rϕ⋆ implies R(fn ) → R⋆
(implication (iii)) the loss ϕ is necessarily classification calibrated (i). Assume for the sake of
contradiction that (iii) holds but (i) fails, that is, ϕ is not classification calibrated. Then there
must exist η < 1/2 and a sequence sn ≥ 0 (i.e. a sequence of predictions with incorrect sign)
satisfying
rϕ (sn , η) → rϕ⋆ (η).

461
Lexture Notes on Statistics and Information Theory John Duchi

Construct the classification problem with a singleton X = {x}, and set P(Y = 1) = η. Then the
sequence fn (x) = sn satisfies Rϕ (fn ) → Rϕ⋆ but the true 0-1 risk R(fn ) ̸→ R⋆ .
We return to the promised proof of Lemma 16.1.9.
Proof of Lemma 16.1.9 The function rϕ⋆ (η) = inf s {ηϕ(s) + (1 − η)ϕ(−s)} for η ∈ [0, 1] and
⋆
r (η) = −∞ otherwise is a concave function, as it is the infimum of linear functions in η, and
moreover, −rϕ⋆ is closed convex. Because ϕ is nonnegative, rϕ⋆ has domain [0, 1]. In one-dimension,
closed convex functions are continuous on their domains (see Observation B.3.6 in Appendix B.3.2).
Additionally, we also have

rϕwrong (η) = inf {ηϕ(s) + (1 − η)ϕ(−s)}

s≥0

for 0 ≤ η < 12 , and by symmetry rϕwrong (η) = rϕwrong (−η), which is thus continuous on 0, 12 ∪ 12 , 1 .

A parallel argument to that for rϕ⋆ then shows that rϕwrong is left continuous and right continuous
at 21 , and so
1+δ 1+δ
∆ϕ (δ) = rϕwrong − rϕ
2 2
is continuous on δ ∈ [0, 1], where ∆ϕ (1 + ϵ) = +∞. That ψ = ∆∗∗ is continuous on [0, 1] is then
immediate by the continuity of closed convex functions on their domains.
The nonnegativity of ∆ is immediate, and to see that ∆(0) = 0, note that

rϕwrong (1/2) := inf rϕ (s, 1/2) = inf rϕ (s, 1/2) = rϕ⋆ (1/2),
s(1−1)≤0 s

so ∆(0) = rϕ⋆ (1/2) − rϕ⋆ (1/2) = 0.

The final claim of the lemma that ψ = ∆∗∗ is strictly positive whenever ∆ is is more subtle.
For this, we leverage the following technical lemma:
Lemma 16.1.10. Let h be either (i) continuous on [0, 1] or (ii) non-decreasing on [0, 1], where
h(t) = +∞ for t ̸∈ [0, 1]. If h satisfies h(t) > 0 for t > 0 and h(0) = 0, then f (t) = h∗∗ (t) satisfies
f (t) > 0 for any t > 0.
Proof We begin with the case (i). Define the function hlow (t) := inf s≥t h(s). Then because h is
continuous, we know that over any compact set it attains its infimum, and thus (by assumption on
h) hlow (t) > 0 for all t > 0. Moreover, hlow is non-decreasing. Define flow (t) = h∗∗ low (t) to be the
biconjugate of hlow ; it is clear that f ≥ flow as h ≥ hlow . Thus we see that case (ii) implies case
(i), so we turn to the more general result to see that flow (t) > 0 for all t > 0.
For the result in case (ii), assume for the sake of contradiction there is some z ∈ (0, 1) satisfying
h∗∗ (z) = 0. It is clear that h∗∗ (0) = 0 and h∗∗ ≥ 0, so we must have h∗∗ (z/2) = 0. Now, by
assumption we have h(z/2) = b > 0, whence we have h(1) ≥ b > 0. In particular, the piecewise
linear function defined by (
0 if t ≤ z/2
g(t) = b
1−z/2 (t − z/2) if t > z/2

is closed, convex, and satisfies g ≤ h. But g(z) > 0 = h∗∗ (z), a contradiction to the fact that h∗∗
is the largest (closed) convex function below h.

Lemma 16.1.10 immediately yields the final claim (c) of Lemma 16.1.9: we know that ∆(δ) > 0
and is continuous in δ.

462
Lexture Notes on Statistics and Information Theory John Duchi

JCD Comment: Put a figure

16.2 General surrogate risk consistency

The approach to proving consistency we develop for the binary classification cases extends, with
some care, to essentially fully general scenarios. In this case, we consider a general supervised
learning problem, where we assume that we have data in pairs (X, Y ) ∈ X × Y, where X and Y
are general spaces. We assume we have a loss ℓ : Rk × Y → R+ be a loss function we wish to
minimize, so that the loss of a prediction function f : X → Rk for the pair (x, y) is ℓ(f (x), y). Let
φ : Rk × Y → R be an arbitrary surrogate, where φ(f (x), y) is the surrogate loss. Define the risk
and φ-risk
R(f ) := E[ℓ(f (X), Y )] and Rφ (f ) := E[φ(f (X), Y )].

Example 16.2.1 (Multiclass classification): Assume that Y ∈ {1, . . . , k}. For a score vector
s ∈ Rk , the zero-one error is

ℓ0-1 (s, y) = 1 sy ≤ max sj .
j̸=y

Common surrogates include the multiclass logistic loss

 
X
φ(s, y) = log 1 + esj −sy 
j̸=y

and the pairwise comparison loss

X
φ(s, y) = ϕ(sy − sj ),
j̸=y

where ϕ is a convex differentiable function with ϕ′ (0) < 0. These are both consistent losses (as
we shall see). 3

Example 16.2.2 (Ranking problems): Suppose we wish to rank k items based on covariates
x, so that Y ∈ Sk , the set of permutations of [k], where we write Y (i) ≻ Y (j) to indicate that
i is ranked ahead of j. A prediction assigning scores s ∈ Rk to each of the k items naturally
induces a permutation via the sorting s(1) ≥ s(2) ≥ · · · ≥ s(k) . The Kendall tau distance counts
the pairwise disagreements between s and Y via
−1 X
k
ℓ(s, y) = 1 {si ≥ sj and y(i) ≺ y(j)} .
2
i<j

In other scenarios, the data Y may consist of only partial feedback, such as a pairwise rank-
ing, so that Y = (i, j) indicates item i is preferred to j. Then for y = (i, j), the pairwise
disagreement becomes
ℓ(s, y) = 1 {si ≥ sj } .
These related losses—and others like them—admit essentially no efficiently minimizable sur-
rogates. 3

463
Lexture Notes on Statistics and Information Theory John Duchi

Let PY denote the space of all probability distributions on Y, and define the conditional (point-
wise) risks r : Rk × PY → R and rφ : Rk × PY → R by

r(s, P ) = EP [ℓ(s, Y )] and rφ (s, P ) = EP [φ(s, Y )].

Following our development in the binary case, let r⋆ (P ) = inf s r(s, P ) denote the minimal condi-
tional risk, and similarly for rφ⋆ (P ), when Y has distribution P . If PY |x denotes the distribution of
Y conditioned on X = x, then we may rewrite the risk functionals as

R(f ) = E[r(f (X), PY |X )] and Rφ (f ) = E[rφ (f (X), PY |X )].

16.2.1 Uniform calibration

Continuing to follow the development in Section 16.1, for ϵ ≥ 0 define the suboptimality gap function

∆φ (ϵ, P ) := inf rφ (s, P ) − rφ⋆ (P ) | r(s, P ) − r⋆ (P ) ≥ ϵ ,

(16.2.1)
s∈Rk

which measures the gap between achievable (pointwise) risk and the best surrogate risk when we
enforce that the true loss is not minimized to accuracy better than ϵ. We can then define the
uniform suboptimality function

rφ (s, P ) − rφ⋆ (P ) | r(s, P ) − r⋆ (P ) ≥ ϵ .

∆φ (ϵ) := inf (16.2.2)
s∈Rk ,P ∈PY

(Exercise 16.6 shows how this relates to rϕ for binary classification.) Now let ∆∗∗φ (ϵ) be the bicon-
jugate of ∆φ , that is, ∆∗∗
φ is the largest convex function below ∆ φ . We can then make the following
definition, which analogizes Definition 16.2.
Definition 16.3 (Uniform calibration). The surrogate φ is uniformly calibrated for the loss ℓ if
∆φ (ϵ) > 0 for all ϵ > 0.
We have the following proposition, which analogizes Theorem 16.1.5.
Proposition 16.2.3. For any measurable f : X → Rk ,

∆∗∗ ⋆ ⋆
φ (R(f ) − R ) ≤ Rφ (f ) − Rφ .

Additionally, if φ is uniformly calibrated, then ∆∗∗ ⋆

φ (ϵ) > 0 for all ϵ > 0, and Rφ (fn ) → Rφ implies
that R(fn ) → R⋆ .
Proof Let ψ = ∆∗∗
φ for shorthand. Then by Jensen’s inequality,

ψ(R(f ) − R⋆ ) = ψ(E[r(f (X), PY |X ) − r⋆ (PY |X )])

≤ E ψ(r(f (X), PY |X ) − r⋆ (PY |X ))

≤ E[∆φ (r(f (X), PY |X ) − r⋆ (PY |X ))]

where we use Jensen’s inequality and that ψ ≤ ∆φ . Now, note that by definition, for any P ∈ PY
and f (x) ∈ Rk , we have

∆φ (r(f (x), P ) − r⋆ (P )) = inf rφ (s, P ) − rφ⋆ (P ) | r(s, P ) − r⋆ (P ) ≥ r(f (x), P ) − r⋆ (P )

s∈Rk
≤ r(f (x), P ) − r⋆ (P ).

464
Lexture Notes on Statistics and Information Theory John Duchi

Substituting into the preceding display, we obtain

ψ(R(f ) − R⋆ ) ≤ E rφ (f (X), PY |X ) − rφ⋆ (PY |X ) = Rφ (f ) − Rφ⋆ ,

giving the first statement of the proposition. For the second statement, note that ϵ 7→ ∆φ (ϵ) is
non-decreasing by construction, and uniform calibration implies it is strictly positive for ϵ > 0.
Lemma 16.1.10 implies its biconjugate is positive as well.

16.2.2 Pointwise calibration

Unfortunately, it can be challenging to explicitly verify uniform calibration, as it involves several
nested infima over various potentially non-convex sets. Thus, it is frequently convenient to relax
to loss calibration:

Definition 16.4. The surrogate φ is calibrated for the loss ℓ if for all P ∈ PY and ϵ > 0,

∆φ (ϵ, P ) > 0.

Definition 16.4 relaxes the uniform calibration condition in Definition 16.3 to apply pointwise, that
is, for each distribution P on Y. While this precludes the cleanest guarantee that ψ(Rφ (f ) − Rφ⋆ ) ≤
R(f ) − R⋆ , under a minor condition to address integrability issues (see also Exercise 16.7), it still
is sufficient to guarantee surrogate risk consistency.

Theorem 16.2.4. Let ℓ be bounded. Then φ is calibrated for the loss ℓ if and only if it is surrogate
risk consistent for ℓ, that is, Rφ (fn ) → Rφ⋆ implies that R(fn ) → R⋆ .

Proof We prove the implication that φ is calibrated implies that it is surrogate risk consistent;
the converse is an essentially immediate exercise by considering distributions supported on a single
point x. Let B < ∞ be the bound on ℓ, so we may assume that r(s, P ) − r⋆ (P ) ≤ B. Let fn be a
sequence of functions satisfying Rφ (fn ) → Rφ⋆ and let ϵ > 0 be arbitrary. Let Px = P (Y ∈ · | X = x)
be the conditional distribution of Y given X = x, and for shorthand, define

δn (x) := r(fn (x), Px ) − r⋆ (Px ) and An,ϵ := {x ∈ X | δn (x) ≥ ϵ}.

so that δn (x) ∈ [0, B]. Then for the risk gap R(fn ) − R⋆ , we see that

R(fn ) − R⋆ = E[δn (X)] = E[1 {X ∈ An,ϵ } δn (X)] + E[1 {X ̸∈ An,ϵ } δn (X)],

| {z }
≤ϵ

where we have used that δn (x) ≤ ϵ on Acn,ϵ by definition. Now, consider the first expectation. Let
γ > 0 be otherwise arbitrary, and note that

rφ (fn (x), Px ) − rφ⋆ (Px )

1 {x ∈ An,ϵ } = 1 {δn (x) ≥ ϵ} ≤ 1 {∆φ (ϵ, Px ) > γ} + 1 {∆φ (ϵ, Px ) ≤ γ}
∆φ (ϵ, Px )
rφ (fn (x), Px ) − rφ⋆ (Px )
≤ + 1 {∆φ (ϵ, Px ) ≤ γ} .
γ

465
Lexture Notes on Statistics and Information Theory John Duchi

In particular, as δn (x) ∈ [0, B], we have

E [δn (X)1 {X ∈ An,ϵ }] ≤ BE[1 {X ∈ An,ϵ }]

E[rφ (fn (X), PX ) − rφ⋆ (PX )]

≤B + P(∆φ (ϵ, PX ) ≤ γ)
γ
Rφ (fn ) − Rφ⋆

=B + P(∆φ (ϵ, PX ) ≤ γ)
γ

for all γ ≥ 0. But of course, by assumption the first term satisfies Rφ (fn ) − Rφ⋆ → 0, while the con-
tinuity of probability measures guarantees that limγ↓0 P(∆φ (ϵ, PX ) ≤ γ) = P(∆φ (ϵ, Px ) = 0) = 0.
In particular, E[1 {X ∈ An,ϵ } δn (X)] → 0, and so R(fn ) − R⋆ ≤ ϵ + o(1) as n → ∞. As ϵ > 0 was
arbitrary, this completes the proof.

16.2.3 Examples: multiclass surrogate risk consistency

Happily, it is much easier to provide examples of the pointwise calibration condition in Defini-
tion 16.4 than the uniform calibration guarantee. Here, we focus on multiclass calibration guar-
antees, where Y ∈ {1, . . . , k} represents one of k classes, and we consider predictors f : X → Rk
and consistency for the zero-one loss. While no formulation of surrogate risk consistency as clean
as that Theorem 16.1.5 provides for the margin-based losses for binary classification exists, we can
still give clean sufficient conditions for surrogate risk consistency. Here,
P we identify distributions
on Y with vectors p ∈ ∆k = {p ∈ Rk+ | ⟨1, p⟩ = 1}, so that r(s, p) = ky=1 py 1{sy ≤ maxj̸=y sj }.
Let us revisit Example 16.2.1 to make this more concrete. We first consider pairwise margin-
based losses of the form
Xk
φ(s, y) = ϕ(sy − sj ), (16.2.3)
j=1

where ϕ : R → R is a non-increasing function. The intuition is similar to the classification calibra-

tion case: to minimize this, we expect sy ≫ sj for j =
̸ y, as ϕ is non-increasing. To make this more
precise, we have the following lemma.

Lemma 16.2.5. Let ϕ : R → R be non-increasing and satisfy ϕ(t) < ϕ(−t) for all t > 0 and the
surrogate φ take the pairwise form (16.2.3). Let p ∈ ∆k and s ∈ Rk satisfy rφ (s, p) = rφ⋆ (p). If
pi > pj , then si ≥ sj , and if ϕ is differentiable with ϕ′ (0) < 0, then si < sj .

Proof Without loss of generality, we take i = 1 and j = 2, so that p1 > p2 . Let s ∈ Rk and
s′ = (s2 , s1 , sk3 ) be s with its first two entries flipped. Then
X X
rφ (s, p) − rφ (s′ , p) = p1 ϕ(s1 − s2 ) + p1 ϕ(s1 − sj ) + p2 ϕ(s2 − s1 ) + p2 ϕ(s2 − sj )
j>3 j>3
X X
− p2 ϕ(s1 − s2 ) + p2 ϕ(s1 − sj ) + p1 ϕ(s2 − s1 ) + p1 ϕ(s2 − sj )
j>3 j>3
X
= (p1 − p2 ) ϕ(s1 − s2 ) − ϕ(s2 − s1 ) + (ϕ(s1 − sj ) − ϕ(s2 − sj )) .
j>3

466
Lexture Notes on Statistics and Information Theory John Duchi

If s1 < s2 , then ϕ(s1 − sj ) ≥ ϕ(s2 − sj ) and ϕ(s1 − s2 ) > ϕ(s2 − s1 ), by assumption, so that if
p1 > p2 , we must have rφ (s, p) − rφ (s′ , p) > (p1 − p2 ) · (0 + 0) = 0, which would contradict that
rφ (s, p) = rφ⋆ (p).
When ϕ is differentiable, if s minimizes rφ (s, p), then we have ∇rφ (s, p) = 0, and in particular,
k
X k
X
0 = p1 ϕ′ (s1 − sj ) − pj ϕ′ (sj − s1 ).
j=1 j=1

Assume for the sake of contradiction that s1 = s2 = t for some t ∈ R. Then the preceding equality
implies that
Xk k
X
p1 ϕ′ (t − sj ) = p2 ϕ′ (t − sj ), i.e. p1 ϕ′ (0) = p2 ϕ′ (0).
j=1 j=1

But as p1 > p2 and ϕ′ (0) < 0, this necessarily fails.

This lemma implies the following proposition, which gives sufficient conditions for a convex
margin-based loss to be surrogate risk consistent.

Proposition 16.2.6. Let ϕ : R → R be non-increasing, convex, and differentiable with ϕ′ (0) < 0,
and let the surrogate φ take the pairwise form (16.2.3). Then φ is surrogate risk consistent.

The differentiability assumption on ϕ is important here; without it, Proposition 16.2.6 may fail (see
Exercise 16.8).
Proof We use Theorem 16.2.4. Fix p ∈ ∆k and without loss of generality assume that p1 > p2 ;
by Theorem 16.2.4, it is then enough to show that

inf rφ (s, p) − rφ⋆ (p) | s1 ≤ s2 > 0.

(n) (n)
Let s(n) ∈ Rk be any sequence with s1 ≤ s2 for all n and assume for the sake of contradic-
tion that rφ (s(n) , p) → rφ⋆ (p). Then (working with the compactification [−∞, ∞] of R as neces-
sary) there is a convergent subsequence, which w.l.o.g. we may take as the entire sequence, with
s(n) → s ∈ {R ∪ {±∞}}k , where s1 ≤ s2 . But then (with trivial modification to address the infinite
case) this contradicts Lemma 16.2.5.

One can demonstrate other variants of Proposition 16.2.6 using similar loss functions. For
example, we have the following, whose proof Exercise 16.9 outlines.

Proposition 16.2.7. Let ϕ : R → R be non-increasing, convex, and differentiable with ϕ′ (0) < 0.
Let the surrogate X
φ(s, y) := ϕ(sy ) + ϕ(−sj ).
j̸=y

Then φ is surrogate risk consistent.

467
Lexture Notes on Statistics and Information Theory John Duchi

16.3 Generalized entropies and surrogate risk consistency

In Section 14.3.1 of Chapter 14, we develop a theory of exact equivalence between a proper loss
ℓ and a convex surrogate φ, though with a slightly different perspective than what we take here,
as the goal is to prediction values µ ∈ M = {EP [Y ] | P ∈ PY }. We can provide a parallel
development here, showing that for general losses on vectors s ∈ Rk , we have a natural generalized
entropy notion, and from any generalized entropy, we can construct a surrogate loss that (often)
is guaranteed to be surrogate risk consistent. Thus, entropies once again provide an explicit link
between loss minimization and consistency, though we shall weaken the requirement that the two
entropies a loss and its surrogate generate be equal; equality will allow us to develop more nuanced
notions of consistency that we explore in Section 16.5 to come.
We focus first on the case of multiclass classification. Recall that a (generalized) entropy
H : ∆k → R is any concave function H with H(p) > −∞ except (potentially) when pj = 0 for
some j. We follow the notation of Chapter 14 and let

Ω(p) = −H(p)

be the negative entropy, and require that H be closed, meaning that Ω is a closed convex function.
We immediately see that
φ(s, y) := −sy + Ω∗ (s), (16.3.1)
where
Ω∗ (s) = sup {⟨p, s⟩ + H(p)}
p∈∆k

is the convex conjugate of Ω = −H, satisfies

inf Ep [φ(s, Y )] = inf {Ω∗ (s) − ⟨p, s⟩} = −Ω∗∗ (p) = −Ω(p) = H(p).
s s

We can record this as a proposition, where we recall the entropy Hℓ (p) := inf s Ep [ℓ(s, Y )] associated
to the loss ℓ from Chapter 14.

Proposition 16.3.1. Let H : ∆k → R be closed concave and Ω = −H. Then the losses φ(s, y) =
−sy + Ω∗ (s) are closed, convex, and satisfy

Hφ (p) := inf Ep [φ(s, Y )] = H(p).

s∈Rk

So any (generalized) entropy on the simplex generates a convex loss with the same entropy.
JCD Comment: Perhaps reference earlier material a little bit more carefully here.

The construction (16.3.1) is a privileged construction, as from it we can derive desirable prop-
erties of the convex loss φ itself, and in particular, the loss φ is frequently surrogate risk consistent
(Definition 16.1) for the zero-one error. Specializing Definition 16.4 to the classification case, we
consider the following definition.

Definition 16.5. The loss φ : Rk × [k] → R is classification calibrated for the zero-one loss if for
any p ∈ ∆k and y such that py < maxj pj ,
k
X k
X
inf pj φ(s, j) < inf pj φ(s, j) | sy ≥ max sj .
s s j
j=1 j=1

468
Lexture Notes on Statistics and Information Theory John Duchi

Theorem 16.2.4 shows that Definition 16.5 is equivalent to surrogate risk consistency.
The generalized entropy function H provides a convenient route to determining the classification
calibration of the surrogate φ from the construction (16.3.1). To provide sufficient conditions, we
recall the collection of uniformly convex functions (see Appendix C.1.2, equation (C.1.5)):
Definition 16.6. A convex function f : Rk → R is (λ, κ, ∥·∥)-uniformly convex over C ⊂ Rk if it
is closed and for all t ∈ [0, 1] and x1 , x2 ∈ C,
λ
t(1 − t) ∥x1 − x2 ∥κ (1 − t)κ−1 + tκ−1 .

f (tx1 + (1 − t)x2 ) ≤ tf (x1 ) + (1 − t)f (x2 ) −
2
Uniform convexity is a weaker version of strong convexity (which corresponds to κ = 2), and
guarantees that the function exhibits superlinear growth. The condition is sufficient to guarantee
the classification calibration of the associated loss φ:
Theorem 16.3.2. Let H be closed concave, symmetric, and have domain P dom H = ∆k . Let φ have
the definition (16.3.1). Assume that (a) H is strictly concave and inf s kj=1 pj ℓ(s, j) is attained for
each p ∈ ∆k , or (b) H is uniformly concave. Then φ is classification calibrated.
As should not be surprising, this result relies strongly on the stability properties of the minimizers
of the loss φ as well as the differentiability properites of the conjugate Ω∗ of the negative entropy
Ω = −H. We therefore postpone the proof to Section 16.3.1, instead using it to give a few more
concrete examples of consistent losses.

Example 16.3.3 (Multiclass logistic regression): The multiclass logistic loss

k
X k
X
sj
φ(s, y) = log exp(sj − sy ) = −sy + log e
j=1 j=1

has associated entropy

k
X
H(p) = inf Ep [φ(s, Y )] = − pj log pj ,
s
j=1

the classical entropy of Y . (Recall Examples 14.3.4 and 14.3.6.) This is strongly convex over
∆k by Pinsker’s inequality, and the negative entropy Ω = −H has the dual
k
X
Ω∗ (s) = log esj ,
j=1

showing that the multiclass logistic loss is consistent. 3

Example 16.3.4 (Norm-based losses): For any 1 < κ < ∞, take any loss of the form
1
φ(s, y) = −sy + ∥s∥κκ ,
κ
which for the conjugate κ∗ = κ
κ−1 < ∞ satisfies
1 ∗
H(p) = inf Ep [φ(s, Y )] = − ∥p∥κκ∗ .
s κ∗
This entropy is strictly concave, and the infimum is always attained, so Theorem 16.3.2 gives
that φ is surrogate-risk consistent. 3

469
Lexture Notes on Statistics and Information Theory John Duchi

16.3.1 Proof of Theorem 16.3.2

We will freely use the conjugate duality relationships we provide in Appendices C.1.2 and C.2.2.
Let Ω = −H be the (convex) negative entropy, noting that Ω∗ (s) = supp∈∆k {⟨p, s⟩ + H(s)} <
∞, as suprema of closed concave functions over compact sets are attained, and thus dom Ω∗ =
Rk . Note also that Ω∗ is continuously differentiable under the conditions of the theorem, as Ω is
strictly convex, because uniform convexity is stronger than strict convexity. (See Propositions C.2.7
and C.2.8.)
We begin with an intermediate lemma.
Lemma 16.3.5. Under the conditions of Theorem 16.3.2, Ω∗ is continuously differentiable on Rk ,
satisfies ∆k ⊃ ∇Ω∗ (Rk ), and if si ≥ sj , then p = ∇Ω∗ (s) satisfies pi ≥ pj .
Proof The first claims of the lemma we have already demonstrated; all that remains is to show
that if si ≥ sj then pi ≥ pj . We know that p = ∇Ω∗ (s) = argmaxp∈∆k {⟨p, s⟩ − Ω(p)}, and let us
assume for the sake of contradiction that pi < pj . Then letting p′ be the vector p with entries i
and j swapped, we have
1 1 1
Ω(p′ ) = Ω(p) but Ω (p′ + p) < Ω(p′ ) + Ω(p) = Ω(p′ ) = Ω(p).
2 2 2
We also have
⟨s, p⟩ − ⟨s, p′ ⟩ = (si − sj )(pi − pj ) ≤ 0,
that is, ⟨s, p⟩ ≤ ⟨s, p′ ⟩. But then
1 1 1 1 1
⟨s, p + p′ ⟩ − Ω p + p′ ≥ ⟨s, p⟩ − Ω p + p′ > ⟨s, p⟩ − Ω(p),
2 2 2 2 2
a contradiction to the assumed optimality that p = ∇Ω∗ (s).

From this lemma, we obtain the following immediate observation:

Observation 16.3.6. Under the conditions of Theorem 16.3.2, if s⋆ ∈ Rk minimizes Ep [φ(s, Y )]
then pi > pj implies that s⋆i > s⋆j .
Proof We know that s⋆ minimizes Ep [φ(s, Y )] if and only if p = ∇Ω∗ (s⋆ ) by the Fenchel-Young
inequality (14.1.4). Take the contrapositive of Lemma 16.3.5.

We also require the following technical lemma, whose proof we defer temporarily:
Lemma 16.3.7. Let f have Hölder-continuous gradient and assume f ⋆ = inf u f (u) > −∞. Then
if f (un ) → f ⋆ , we have ∇f (un ) → 0.
Now we may prove the theorem. Under assumption (a) of the theorem, that H is strictly concave
and that the infimum inf s Ep [φ(s, Y )] is always attained, the preceding observation immediately
gives the result: if pi > pj , then we must have

inf {Ep [φ(s, Y )] | si ≤ sj } > inf Ep [φ(s, Y )],

s s

which is classification calibration

P(Definition 16.5). Under assumption (b), that H is uniformly
convex, recall that H(p) = inf s { kj=1 pj φ(s, j)} > −∞, and let pi > pj . Let s(m) be any sequence

470
Lexture Notes on Statistics and Information Theory John Duchi

such that kj=1 pj φ(s(m) , j) → H(p). Because H is uniformly concave (i.e., Ω is uniformly convex),
P
∗ continuous with dom ∇Ω∗ = Rk (Proposition C.2.7). Then because
Pk dual gradient ∇Ω is Hölder
the
∗
j=1 pj φ(s, j) = −⟨p, s⟩ + Ω (s), we obtain

lim ∇Ω∗ (s(m) ) = p

m→∞
(m) (m)
by Lemma 16.3.7. If si ≥ sj , then Lemma 16.3.5 implies that p(m) = ∇Ω∗ (s(m) ) satisfies
(m) (m) (m) (m)
pi ≥ pj , and so if pi > pj , it must be the case that eventually si > sj . Moreover,
continuity of ∇Ω∗ shows that
(m) (m) (m) (m)
lim inf |si − sj | = 0 implies lim inf |pi − pj | = 0,
m m
(m) (m)
and so if pi > pj , then we necessarily have lim inf m (si − sj ) > 0. In particular, we have
demonstrated that  
Xk 
inf pj φ(s, j) | si ≤ sj > H(p)
s  
j=1

whenever pi > pj , which is classification calibration (Definition 16.5).

We finally return to the promised proof of Lemma 16.3.7.
Proof of Lemma 16.3.7 Let β > 0 be the Hölder constant for the ℓ2 -norm, so that for ∥·∥ = ∥·∥2
there exists C < ∞ such that ∥∇f (u) − ∇f (v)∥ ≤ C ∥u − v∥β for all u, v. Then using R1 that
h(t) = f (u + t(v − u)) satisfies h′ (t) = ⟨∇f (u + t(v − u)), v − u⟩ and f (v) = h(1) = h(0) + 0 h′ (t)dt,
we have
Z 1
f (v) = f (u) + ⟨∇f (u), v − u⟩ + ⟨∇f (u + t(v − u)) − ∇f (u), v − u⟩dt
0
Z 1
C
≤ f (u) + ⟨∇f (u), v − u⟩ + Ctβ ∥u − v∥β+1 dt = f (u) + ⟨∇f (u), v − u⟩ + ∥u − v∥β+1 .
0 β + 1
Now, for the sake of contradiction, let un satisfy f (un ) → inf u f (u) and gn = ∇f (un ) ̸→ 0.
Then there exists some c > 0 such that ∥gn ∥ ≥ c infinitely often; without loss of generality, assume
this is the full sequence. Then fixing δ > 0 to be chosen, let vn = un − δgn / ∥gn ∥, so that

C β+1 C β
f (vn ) ≤ f (un ) − δ ∥gn ∥ + δ ≤ f (un ) + δ δ −c .
β+1 β+1
Take any δ small enough that the last quantity is strictly negative to obtain that lim supn f (vn ) <
lim inf n f (un ) = f ⋆ , a contradiction.

16.4 Structured prediction and generalized entropies

Structured prediction problems involve predicting objects more complicated than standard classifi-
cation problems, and their development allows us to apply statistical machine learning techniques
to more sophisticated prediction problems. In Example 16.2.2, for instance, we wished to predict
rankings of k items, and more broadly, we frequently wish to predict some type of combinatorial
object y ∈ Y, such as the matching in a graph, the alignment of images, a tree indicating some
dependence structure, among other possibilities.

471
Lexture Notes on Statistics and Information Theory John Duchi

Example 16.4.1 (Matching): In protein structure prediction, we are given a sequence of

amino acids and wish to determine the 3-dimensional structure of the resulting molecule. Often,
cysteine (one of the amino acids) molecules bind to one another in pairs, offering clues to the
resulting structure. Predicting these bindings is then equivalent to predicting a matching in a
graph: given k cysteine molecules, we determine which k/2 pairs are matched. 3

Example 16.4.2 (Image registration): The image registration problem arises when one is
given a sequence of images of a particular environment and wishes to identify objects within
sequential images. We can represent this as follows: consider two d × d images, collections of
pixels x ∈ [0, 1]d×d and x′ ∈ [0, 1]d×d (for simplicity, we assume the images only have a pixel
intensity and so are grayscale, though multi-channel images add little additional complexity).
The registration problem is to match pixels xij in the first image to pixels x′i′ j ′ in the second.
Assuming it is possible for pixels to be unmatched, a registration consists of a mapping π :
[d] × [d] → ([d] × [d] ∪ {∅}), where π(i, j) = (i′ , j ′ ) indicates the pixel at location (i, j) in image
x matches to (i′ , j ′ ) in image x′ (and π(i, j) = ∅ that the pixel is unmatched). 3

Abstracting away the details in the two examples, the typical setting we consider is that y ∈ Y
consists of “parts,” where each part contributes to the whole prediction. In Example 16.4.1, parts
are pairs of nodes, each with a particular binding affinity; in Example 16.4.2, we may represent
parts as pairs of nodes in a bipartite graph. So in the former, given a graph on k nodes, a prediction
is a subset y ⊂ [k] × [k] of edges, where for perfect matchings, each node i matches with exactly
one other node j (and not itself), and for symmetry we can represent (i, j) ∈ y if and only if
(j, i) ∈ y. In the latter, we are given two sets of k items, and wish to find a matching between
the two; equivalently, a permutation of [k], which we can therefore represent as y ∈ Sk the set of
permutations of [k], where y(i) represents the index to which i matches. Figure 16.1 represents this
graphically. In each of these cases, it is natural to represent the loss of a predicted matching y ′ for
a correct matching y by counting the number of edges incorrectly labeled.
In structured prediction settings, we typically assume the loss of interest decomposes across the
“parts” of y, so that each part of y contributes to the loss as a whole. We will consequently consider
a broad family of structured losses that we can write as follows. We have a statistic τ : Y → Rm
representing the parts of y, and the loss of a prediction y ′ ∈ Y for a correct value y ∈ Y is

ℓ(y ′ , y) = τ (y ′ )⊤ Aτ (y) + c (16.4.1)

where A ∈ Rm×m is a matrix and c is a scalar, and we assume

y = argmin ℓ(y ′ , y) (16.4.2)

y ′ ∈Y

To alleviate the abstractness of the formulation (16.4.1), we rewrite the problems we have already
considered in its form.

Example 16.4.3 (Cost-weighted multiclass classification): Let the label y ∈ {1, . . . , k}, and
let wij ≥ 0 be the loss for predicting class i when the true class is j (and wii = 0). Then we
take τ (y) = ey , the yth standard basis vector, and A ∈ Rk×k with entries Aij = wij , which
gives ℓ(y ′ , y) = τ (y ′ )⊤ Aτ (y) = wy′ ,y in the formulation (16.4.1). 3

Example 16.4.4 (Ranking and bipartite matchings): For bipartite matching, which is equiv-
alent to predicting a permutation, we let y, y ′ ∈ Sk be permutations. Consider the Hamming

472
Lexture Notes on Statistics and Information Theory John Duchi

Figure 16.1. Left two figures: a bipartite graph, with edges across, and a bipartite graph with
matching indicated. Right two figures: a fully connected graph and fully connected graph exhibiting
a perfect matching, with each node matched to exactly one other node.

distance between permutations ℓ(y ′ , y) = ki=1 1 {y ′ (i) ̸= y(i)}. Then for ⟨·, ·⟩ the usual inner
P
product on matrices, we let τ (y) ∈ {0, 1}k×k be the matrix representing the permutation y,
that is, M = τ (y) satisfies Mi,y(i) = 1 for each i and zeros elsewhere. Then

k
X
ℓ(y ′ , y) = k − ⟨τ (y ′ ), τ (y)⟩ = k − 1 y ′ (i) = y(i)

i=1

has the representation (16.4.1). Alternatively, Kendall’s tau distance counts the pairwise dis-
agreements between permutations y, y ′ : [k] → [k] via
X
ℓ(y ′ , y) = 1 y ′ (i) < y ′ (j) and y(i) > y(j) + 1 y ′ (i) > y ′ (j) and y(i) < y(j) .

1≤i,j≤k

Recognizing this as the sum of discordant pairs and letting sign(0) = 0, we rewrite
X
ℓ(y ′ , y) = (1 − sign(y ′ (i) − y ′ (j)) sign(y(i) − y(j))),
1≤i,j≤k

so we see that if we define the matrix τ (y) ∈ {−1, 0, 1}k×k by τ (y)ij = sign(y(i) − y(j)), then
ℓ(y ′ , y) = k 2 − ⟨τ (y ′ ), τ (y)⟩, which again is of the form (16.4.1). 3

Example 16.4.5 (Perfect matchings): For predicting a (perfect) matching in a graph of

k nodes, we identify a matching y as a collection of edges (i, j) ∈ [k] × [k] (where we take
the convention that (i, j) ∈ y if and only if (j, i) ∈ y to address symmetry). Then the loss

473
Lexture Notes on Statistics and Information Theory John Duchi

ℓ(y ′ , y) = (i,j)∈y′ 1 {(i, j) ̸∈ y} counts the number of edges present in y ′ but not in y (which,
P

by symmetry, is the number of edges in y not in y ′ ). Rewriting,

k
′ k ′ k 1 X
ℓ(y , y) = − card(y ∩ y ) = − 1 (i, j) ∈ y ′ 1 {(i, j) ∈ y} .
2 2 2
i,j=1

Letting ⟨·, ·⟩ be the usual inner product on matrices and τ (y) ∈ Rk×k have entries τ (y)ij =
1 {(i, j) ∈ y}, we have ℓ(y ′ , y) = − 21 ⟨τ (y ′ ), τ (y)⟩ + k2 , agreeing with the formulation (16.4.1). 3

Given a loss with the representation (16.4.1), the the prediction function yb(s) from a vector
s ∈ Rm of scores we use is n o
yb(s) ≡ pred(s) := argmax τ (y)⊤ s ,
y∈Y

which decodes a score vector s into a “most likely” yb; we choose the element arbitrarily if the
maximizer is non-unique. It will turn out that the generalized entropy associated to the loss (16.4.1)
and an analogue of the surrogate (16.3.1) will be consistent for predictions here, showing that the
construction (16.3.1) is indeed a privileged one, providing a natural approach (with guaranteed
surrogate-risk consistency properties) for constructing convex loss functions.

16.4.1 The failure of naive margin- and hinge-type losses

Before developing the entropy-based surrogate losses for general structured prediction problems,
we take a brief historical detour. Because the problem of learning a structured predictor is to
finding scores s so that the prediction yb(s) = argmaxy {τ (y)⊤ s} is typically correct, and we wish
to have a convex loss function, a natural first attempt at losses is to generalize the margin-based
losses from Section 16.1. For this, we require a notion of the “margin” of a prediction. Using
the prediction scheme yb(s), we would like to find a vector s ∈ Rk so that for a true label y, we
assign values τ (y ′ )⊤ s ≪ τ (y)⊤ s for y ′ far from y, and to reflect that we have a loss ℓ, we will seek
τ (y ′ )⊤ s ≤ τ (y)⊤ s − ℓ(y, y ′ ). Then the most natural notion of a margin is
n o
min s⊤ (τ (y) − τ (b
y )) − ℓ(b
y , y) ,
yb∈Y

which is positive so long as s⊤ τ (y) ≥ s⊤ τ (y ′ ) + ℓ(y ′ , y) for all y ′ ∈ Y.

In analogy with the hinge loss (recall Examples 16.1.1 and 16.1.3), then the standard choice in
structured prediction problems is to choose losses to maximize the margin, leading to surrogates of
the form n o
ϕ(s, y) := max ℓ(b y , y) + s⊤ (τ (b
y ) − τ (y)) , (16.4.3)
yb∈Y

the structured prediction hinge loss. (See, for example, [58, 122, 178, 177].) Then so long as
ℓ(y ′ , y) > 0 for y ′ ̸= y, it is immediate that ϕ(s, y) = 0 implies yb(s) = y uniquely; moreover, the
loss (16.4.3) is convex as it is the maximum of linear functions of s.
While we will not dwell on this, the formulation (16.4.3) also has the major advantage that
for many structured sets Y—even exponentially large sets—computing the maximum is computa-
tionally efficient. For example, in Examples 16.4.4 and 16.4.5, computing the objective (16.4.3) is
equivalent to solving a maximum-weight matching problem in a graph. For the k-class classification
zero-one error ℓ0-1 (y, y ′ ) = 1 {y ̸= y ′ }, the loss (16.4.3) becomes

ϕ(s, y) = max [1 + sj − sy ]+ .
j̸=y

474
Lexture Notes on Statistics and Information Theory John Duchi

JCD Comment: Make an exercise to show how the loss (16.4.3) is a maximum-weight
matching.
Unfortuntately, excepting (essentially) the case of binary classification, such margin-based clas-
sifieres are necessarily inconsistent. We can make this clearest in the case of binary classification.

Proposition 16.4.6. Consider k-class binary classification with zero-one loss ℓ0-1 (y, y ′ ) = 1 {y ̸= y ′ }
and let y ⋆ = argmaxy∈[k] P (Y = y). Then for the hinge loss (16.4.3), the following hold:

(i) If P (Y = y ⋆ ) > 12 , then s = ey⋆ minimizes EP [ϕ(s, Y )].

(ii) If P (Y = y ⋆ ) < 21 , then s = 0 minimizes EP [ϕ(s, Y )].

The proposition shows that surrogate consistency can occur only if the problem is low noise enough
that the correct label has at least probability 21 . Whether this is a reasonable assumption depends
on the application, as in some problems the correct label is relatively obvious, meaning we are in
case (i) above. The result also helps to explain the consistency results in the binary setting, where
of course the correct label necessarily has probability at least 12 .
Proof Let pj = P (Y = j) for shorthand, and define the function g(s, y) := [sy − sj ]j̸=y ∈ Rk−1 .
Then
k
X
EP [ϕ(s, Y )] = py 1 − min gj (s, y)
j +
y=1

Letting gy (s) = minj̸=y {sy − sj }, the objective becomes

k
X
EP [ϕ(s, Y )] = py [1 − gy (s)]+ .
y=1

Let s⋆ minimize the objective and gy⋆ = gy (s⋆ ) for shorthand. We claim two properties: (P1) that
gy⋆ ≤ 1 for each y and (P2) P
that if yb = argmaxj gj⋆ , then for j ̸= yb, gj⋆ = −gyb⋆ . Assuming P
property
(P1), the objective becomes Py py (1−gy ), so that the problem is equivalent to maximizing y py gy .
When property (P2) holds, y py gy = (2pyb − 1)gyb, this in turn is equivalent to solving

maximize (2pyb − 1)gyb(s) subject to gyb ∈ [0, 1].

Finally, we see that if no y ⋆ satisfies P (Y = y ⋆ ) ≥ 21 , then (2py − 1) < 0 for all y, and so choosing
s = 0 evidently maximizes the preceding display. If P (Y = y ⋆ ) > 12 , then the choice s = ey
evidently achieves the maximum above.
We return toproperties ⋆
⋆
(P1) and (P2). For (P1), suppose to the contrary that gy > 1 for some
y, then we have 1 − gy + = 0, while for j ̸= y,

gj⋆ ≤ s⋆j − s⋆y = −(s⋆y − s⋆j ) < − min(s⋆y − s⋆j ) = −gy⋆ < −1,
j

Pk 1
which would yield objective EP [ϕ(s⋆ , Y )] = ⋆
j=1 pj [1 − gj ]+ . If py < 2 , this quantity satisfies
Pk ⋆
j=1 pj [1 − gj ]+ > 2(1 − py ) > 1, so that s =P0 would give better objective; if py ≥ 12 , then setting
s = ey would yield E[ϕ(s, Y )] = 2(1 − py ) < j̸=y pj [1 − gj⋆ ]+ . To see property (P2), observe that

475
Lexture Notes on Statistics and Information Theory John Duchi

gyb⋆ ∈ [0, 1]. We know that gj⋆ ≤ s⋆j − s⋆yb ≤ −gyb⋆ as above. Consider the scores vector with sj = 0 for
j ̸= yb and syb = gyb⋆ . Then gj (s) = −gyb⋆ ≥ gj⋆ while gyb(s) = gyb⋆ . So

k
X h i X X
pj 1 − gj⋆ + = pyb 1 − gyb⋆ + pj 1 − gj⋆ + ≥ pyb 1 − gyb(s) + +

pj [1 − gj (s)]+ ,
+
j=1 j̸=yb j̸=yb

and the inequality is strict if gj⋆ < −gyb⋆ for any j.

16.4.2 Structured prediction losses via the generalized entropy

Given the failure that Proposition 16.4.6 highlights for the margin-based losses (16.4.3), surrogate
risk consistency in general cases will require more. To that end, we consider generalizing the
entropy-based surrogates (16.3.1). Let us consider the construction of the generalized entropy
associated to the loss (16.4.1), where for simplicity we (w.l.o.g.) assume c = 0. Then for a
distribution P on Y ∈ Y, we define the entropy
h i
Hℓ (P ) := min EP τ (y)⊤ Aτ (Y ) .
y∈Y

Let ∆Y be the set of probability vectors (or p.m.f.s) on Y, where p ∈ ∆Y assigns probability py
to y ∈ Y. Recalling Section 14.2.3, where we considered vector-valued targets y, we generalize the
mean mapping by defining marginal polytope associated with the part-mapping τ of y by
 
X 
M := Conv({τ (y)}y∈Y ) = py τ (y) | p ∈ ∆Y
 
y∈Y

and the mean mapping µ : ∆Y → M by

X
µ(P ) := EP [τ (Y )] = py τ (y)
y∈Y

when P has p.m.f. p. Then we immediately see that

Hℓ (P ) = min τ (y)⊤ Aµ(P ) = inf ν ⊤ Aµ(P ),

y∈Y ν∈M

so that Hℓ is equivalent for all distributions P with identical mean µ(P ) = EP [τ (Y )]. With some
abuse of notation, we can therefore define the negative entropy mapping by

Ω(µ) := − min τ (y)⊤ Aµ + IM (µ),

y∈Y

where IM (µ) = 0 if µ ∈ M and +∞ otherwise, which evidently satisfies Ω(ν) = Hℓ (P ) for any
ν ∈ M satisfying ν = µ(P ); Ω is convex.
With this construction in hand, the immediate generalization of the surrogate (16.3.1) is

φ(s, y) := −s⊤ τ (y) + Ω∗ (s), (16.4.4)

476
Lexture Notes on Statistics and Information Theory John Duchi

which is closed and convex. For this surrogate, we immediately observe that

Ep [φ(s, Y )] = −s⊤ µ(p) + Ω∗ (s),

meaning we immediately obtain the following analogue of Proposition 16.3.1:

n o n o
inf Ep [φ(s, Y )] = inf −s⊤ µ(p) + Ω∗ (s) = − sup s⊤ µ(p) − Ω∗ (s) = −Ω(µ(p)) = Hℓ (p).
s s s

Example 16.4.7 (Multiclass classification): For k-class classification with the zero-one loss
ℓ0-1 (y, y ′ ) = 1 {y ̸= y ′ }, we have τ (y) = ey and take A = 11⊤ −Ik in the representation (16.4.1).
Then for p ∈ ∆k ,

Hℓ (p) = inf q ⊤ (11⊤ − Ik )p = 1 − sup q ⊤ p = 1 − max pj ,

q∈∆k q∈∆k j

which of course we could have obtained by direct calculation. Exercise 16.3 asks you to show
that for Ω(p) = maxj pj − 1,
s(1) + s(2) − 1 s(1) + · · · + s(k) − 1

∗
Ω (s) = 1 + max s(1) − 1, ,..., , (16.4.5)
2 k
where s(1) ≥ s(2) ≥ · · · ≥ s(k) denotes the vector s sorted in decreasing order. Then the convex
surrogate loss of the zero-one loss is
s(1) + s(2) − 1 s(1) + · · · + s(k) − 1

φ(s, y) = 1 − sy + max s(1) − 1, ,..., .
2 k
In passing, we note that other losses also generate the 0-1 entropy H(p) = 1 − maxj pj . For
example, if we define X
φhinge (s, y) := [1 + sj − sy ]+ ,
j̸=y

then inf s Ep [φhinge (s, Y )] = k(1 − maxj pj ). (Exercise 16.3 asks you to prove this as well.) 3

The first question is why, for the particular discrete-type loss (16.4.1), the surrogate (16.4.4)
should be (surrogate-risk) consistent. To gain some intuition for this case, we present a few calcu-
lations that rely on convex analysis, before we move the the more sophisticated argument to come,
which implies surrogate risk consistency. As always, we consider only pointwise versions of the
risk—as surrogate risk consistency requires only this—and fix a P ∈ ∆Y and its induced µ = µ(P ).
Recall the conditional surrogate loss rφ (s, P ) = EP [φ(s, Y )], and consider the vector s minimizing
it, thus satisfying
rφ (s, P ) = −s⊤ µ + Ω∗ (s).
As s minimizes r, we have Ω(µ) − s⊤ µ + Ω∗ (s) = 0, and thus (using Proposition C.2.3 in Ap-
pendix C.2.1) we necessarily have
s ∈ ∂Ω(µ).
As Ω(µ) = maxy∈Y −τ (y)⊤ Aµ + IM (µ), then recalling the definition NM (µ) = {v | ⟨v, ν − µ⟩ ≤
0 for all ν ∈ M} of the normal cone to M at µ (see Definition C.1), we have

∂Ω(µ) = Conv{−A⊤ τ (y) | τ (y)⊤ Aµ = min

′
τ (y ′ )⊤ Aµ} + NM (µ)
y
⊤ ⊤
= {−A ν | ν Aµ = Hℓ (P ), ν ∈ M} + NM (µ), (16.4.6)

477
Lexture Notes on Statistics and Information Theory John Duchi

by Proposition C.1.4, where we recall that µ = µ(P ). If we make the (unrealistically) simplifying
assumption that µ is interior to M (say, for example, if P assigns positive probability to all y ∈ Y),
i.e. µ ∈ int M, then NM (µ) = {0}. If we also assume there is only a single label y ⋆ ∈ Y minimizing
τ (y ⋆ )⊤ Aµ, that is, a single best prediction for the probabilities P on Y , then we obtain that
s = −A⊤ τ (y ⋆ ). Then of course the prediction function becomes
n o
pred(s) = argmax −τ (y)⊤ A⊤ τ (y ⋆ ) = y ⋆
y∈Y

by the assumed identifiability condition (16.4.2) on ℓ.

In summary, we have demonstrated the following proposition:

Proposition 16.4.8. Let P be a distribution on (X, Y ) such that with probability 1 over the draw
of X, E[τ (Y ) | X] ∈ int M and yb(X) = argminy∈Y E[ℓ(y, Y ) | X] is unique. Then the surrogate
minimizer
f ⋆ (x) := argmin E[φ(s, Y ) | X = x]
s

satisfies
E[ℓ(pred(f ⋆ (X)), Y )] = inf E[ℓ(f (X), Y )].
f

Thus, under appropriate regularity conditions on the distribution of the pairs (X, Y ), the surrogate
loss we have constructed from the entropy is consistent, in that given infinite data, it provides
optimal predictions.
To show the more sophisticated calibration guarantee of Definition 16.4, which (via Theo-
rem 16.2.4) implies surrogate risk consistency, we will use an additional assumption.
JCD Comment: Could we do it with an alternative assumption to A.16.1 that there
exists a distribution P ⋆ for which µ(P ⋆ ) = µ(P ) and P ⋆ (Y = y) > 0 for each y ∈ y ⋆ (µ)?
Then if ν ∈ Conv{τ (y)} in Eq. (16.4.9), we have ν = µ(P ′ ) for some P ′ supported only
on y ⋆ (µ). Now there exists a distribution Pν with µ(Pν ) = ν and Pν (y) > 0 for each
y ∈ y ⋆ (ν), and y ⋆ (µ(Pν )) = y ⋆ (ν) clearly.

Assumption A.16.1. The loss ℓ is symmetric, and if y minimizes EP [ℓ(y, Y )] = EP [τ (y)⊤ Aτ (Y )]

then P (Y = y) > 0.

The assumption is somewhat restrictive, though we can relax it at the expense of a much more
sophisticated analysis. For k-class multiclass classification problems with the zero-one loss as in
Example 16.4.7, we see immediately that if y ∈ argminy EP [ℓ0-1 (y, Y )], then P (Y = y) ≥ k1 > 0.
In bipartite matching problems (Example 16.4.4), however, the assumption can fail when we allow
arbitrary distributions on the collections of matchings (see Exercise 16.4).
Nonetheless, under the assumption, we have the following theorem, whose proof we provide in
Section 16.4.3.

Theorem 16.4.9. Let Assumption A.16.1 hold for a structured prediction loss ℓ taking the form (16.4.1).
Then the surrogate φ(s, y) = −s⊤ τ (y) + Ω∗ (s) is surrogate-risk consistent for ℓ.

So the generalized entropy provides a general purpose construction that guarantees consistency. As
we note above, in some cases Assumption A.16.1 may fail to hold for an arbitrary distribution P

478
Lexture Notes on Statistics and Information Theory John Duchi

on the set Y. If we assume the problem is such that the conditional distribution of Y satisfies the
low-noise condition

yb ∈ argmin E[ℓ(y, Y ) | X = x] implies P (Y = yb | X = x) > 0 (16.4.7)

y∈Y

for all x except a null set, then the theorem still holds.
Corollary 16.4.10. Let the distribution on (X, Y ) satisfy the low noise condition (16.4.7). Then
the surrogate φ(s, y) is surrogate-risk consistent for ℓ, that is, Rφ (fn ) → Rφ⋆ implies R(fn ) → R⋆ .

16.4.3 Proof of Theorem 16.4.9

The proof proceeds similarly to the guarantee that pred(s) is correct in this case. Recall the gap
functional (16.2.1),

∆φ (ϵ, P ) := inf rφ (s, P ) − rφ⋆ (P ) | r(s, P ) − r⋆ (P ) ≥ ϵ ,

where r(s, P ) = EP [ℓ(pred(s), Y )] and r⋆ (s, P ) = inf s EP [ℓ(pred(s), Y )]. In this case, we may
simplify the quantities by writing out the entropy functionals explicitly as

y (s))⊤ Aµ − inf ν ⊤ Aµ,

r(s, P ) = τ (b
ν∈M

where yb(s) is (an arbitrary) element of the prediction set argmaxy s⊤ τ (y). We need only show that
∆φ (ϵ, P ) > 0 whenever ϵ > 0, which we prove by contradiction.
Thus, assume for the sake of contradiction that ∆φ (ϵ, P ) = 0. As the losses φ are piecewise
linear and the set of s such that r(s, P ) − r⋆ (P ) ≥ ϵ is a union of polyhedra, there must be s
achieving the infimum, and so for some vector of scores s, we have

Ω∗ (s) − s⊤ µ + Ω(µ) = 0

while yb(s) is incorrect. Following the calculation (16.4.6), we thus obtain that for some ν ⋆ ∈ M
satisfying ⟨ν ⋆ , Aµ⟩ = miny τ (y)⊤ Aµ and a vector w ∈ NM (µ), we have

s = −A⊤ ν ⋆ + w.

For ν ∈ M, define the shorthand y ⋆ (ν) = argminy τ (y)⊤ Aν, which is a set-valued mapping. If
we can show the inclusions
yb(s) ⊂ y ⋆ (ν ⋆ ) ⊂ y ⋆ (µ(P )), (16.4.8)
then the proof is complete, as we would evidently have our desired contradiction: necessarily, for
any yb ∈ yb(s), we would obtain τ (b y )⊤ Aµ(P ) = miny τ (y)⊤ Aµ(P ).
To see the inclusion y ⋆ (ν ⋆ ) ⊂ y ⋆ (µ(P )) is relatively straightforward. Let

′ ⊤ ′
ν ∈ Conv τ (y) | τ (y) Aµ(P ) = min ′
EP [ℓ(y , Y )] = Hℓ (P ) (16.4.9)
y ∈Y

be otherwise arbitrary. For all P ′ satisfying the mean-mapping equality ν ′ = µ(P ′ ), the identifi-
ability assumption A.16.1 guarantees that if y ∈ y ⋆ (ν ′ ), we must have P ′ (Y = y) > 0. That is,
y ⋆ (ν ′ ) is contained in the supports
\
y ⋆ (ν ′ ) ⊂ supp P ′ | ν ′ = µ(P ′ ) .
P′

479
Lexture Notes on Statistics and Information Theory John Duchi

In equation (16.4.9), we may represent each ν ′ via P ′ supported only on {y | τ (y)⊤ Aµ(P ) =
Hℓ (P )} = y ⋆ (µ(P )). Thus y ⋆ (ν ′ ) ⊂ y ⋆ (µ(P )) for all ν ′ in the convex hull (16.4.9), and in particular
for ν ⋆ .
The first inclusion in the chain (16.4.8) is more challenging. We begin a convex analytic result
that allows us to simplify maximizers of s⊤ τ (y) in y.

Lemma 16.4.11. Let w ∈ NM (µ) be the element satisfying s = −AT ν ⋆ + w. Then for any y ∈ Y
and any z ∈ supp P ,
⟨τ (z) − τ (y), w⟩ ≥ 0.

Proof Fix any y ∈ Y and let z ∈ supp P , so that pz = P (Y = z) > 0. Then for some vector
α ∈ Conv(τ (y ′ ) | y ̸∈ {y, z}), we can write µ(P ) = λy τ (y) + λz τ (z) + (1 − λy − λz )α, where
λy ≥ 0, λz ≥ pz > 0, and λy + λz ≤ 1. The vector ν = (λy + λz )τ (y) + (1 − λy − λz )α similarly
satisfies ν ∈ M. By the definition of the normal cone NM (µ), we know that w⊤ (µ′ − µ) ≤ 0 for all
µ′ ∈ M, and in particular this holds for µ′ = ν. As

ν − µ = λz (τ (y) − τ (z)),

we obtain
λz w⊤ (τ (y) − τ (z)) ≤ 0,
and as λz > 0 the lemma follows.

With Lemma 16.4.11 in hand, we can consider the prediction set yb(s) = argmaxy s⊤ τ (y). As
s = −A⊤ ν ⋆ + w, we have
n o n o
⊤ ⊤ ⋆ ⊤ ⊤ ⋆ ⊤
yb(s) = argmax −τ (y) A ν + τ (y) w = argmax −τ (y) Aν + τ (y) w ,
y∈Y y∈Y

where we have used the assumed symmetry of A. Let y ∈ y ⋆ (ν ⋆ ) and y ′ ̸∈ y ⋆ (ν ⋆ ), so that

τ (y ′ )⊤ Aν ⋆ > τ (y)⊤ Aν ⋆ . Then by our earlier argument that y ⋆ (ν ⋆ ) ⊂ y ⋆ (µ(P )) and so P (Y =
y) > 0, we obtain from Lemma 16.4.11 that

τ (y ′ )⊤ w ≤ τ (y)⊤ w.

We then see that

−τ (y ′ )⊤ Aν ⋆ + τ (y ′ )⊤ w < −τ (y)⊤ Aν ⋆ + τ (y)⊤ w,
and so y ′ ̸∈ yb(s). In particular, yb(s) ⊂ y ⋆ (ν ⋆ ) as desired.

16.5 Universal loss equivalence and entropies

When we solve a statistical learning problem, we do more than simply learning a (measurable)
predictor f : X → Rk —we frequently must estimate or learn a representation as well. Our surrogate
risk consistency results typically require that we minimize over all measurable functions f , enabling
us to focus on pointwise calculations conditional on X, meaning that they say little about this
more elaborated process of jointly estimating a data representation and predictor. By extending
our theory of losses and associated entropies, however, we can show that losses can exhibit stronger
equivalences than the consistency results so far.

480
Lexture Notes on Statistics and Information Theory John Duchi

We re-adopt the notation of the generalized entropy associated with the loss ℓ : Rk × Y → R
we use in Chapter 14.1.3, defining

Hℓ (Y ) := inf E [ℓ(s, Y )]
s

and the conditional generalized entropy

h i
Hℓ (Y | X) := E inf E [ℓ(s, Y ) | X] ,
s

the average expected loss observing X. Then, as in definition (14.1.7), the information that X
carries about Y is the amount that X reduces the expected loss in prediction Y , as measured by
the loss ℓ:
Iℓ (X; Y ) := Hℓ (Y ) − Hℓ (Y | X).
We have seen examples in which the entropy associated to different losses was (up to a constant
multiplicative factor) identical and constructed surrogate risks with the property that the associated
entropy was identical to the original loss ℓ. For example, in binary classification, letting p = P (Y =
1) and (1 − p) = P (Y = −1), the zero-one loss has entropy

Hℓ0-1 (P ) = inf P (sY ≤ 0) = 1 − max P (Y = y) = min{p, 1 − p}

s∈R y∈{±1}

while the hinge loss ϕ(t) = [1 − t]+ has entropy

Hϕ (P ) = inf p [1 − s]+ + (1 − p) [1 − s]+ = 2 min{p, 1 − p}.
s∈R

We might ask whether two losses having identical entropies guarantees something more than the
basic consistency properties we have developed.
To do this, we adopt the language of quantization and data processing. We will be somewhat
abstract and say that a quantizer of X ∈ X is any mapping q : X → Z for some space Z. We
will think of such quantizers as a data representation, taking raw inputs x and transforming them
into z ∈ Z. Then for a prediction function f : Z → Rk , we define the quantized risk and optimal
quantized risks
L(f | q) := E[ℓ(f (q(X)), Y )] and L⋆ (q) := inf L(f | q).
f

Rewriting these quantities in terms of the loss-based information, we have Hℓ (Y | q(X)) =

E[inf s E[ℓ(s, Y ) | q(X)]] = L⋆ (q), and

Iℓ (q(X); Y ) = Hℓ (Y ) − Hℓ (Y | q(X)) = Hℓ (Y ) − L⋆ (q).

Then a quantizer q1 outperforms q2 if

L⋆ (q1 ) < L⋆ (q2 ) if and only if Iℓ (q1 (X); Y ) > Iℓ (q2 (X); Y ),

that is, q1 (X) typically carries more information about Y than q2 (X).
We can now ask for stronger versions of surrogate risk consistency, where in addition to con-
sistency of φ for a given loss ℓ, we ask that using φ to choose a data representation (or quantizer)
from a class Q of potential quantizers should be equivalent to using the original loss ℓ. We therefore
define the following equivalence:

481
Lexture Notes on Statistics and Information Theory John Duchi

Definition 16.7. Losses ℓ1 and ℓ2 are universally equivalent if for all distributions on (X, Y ) and
all quantizers q1 and q2 ,
Iℓ1 (q1 (X); Y ) ≤ Iℓ1 (q2 (X); Y ) if and only if Iℓ2 (q1 (X); Y ) ≤ Iℓ2 (q2 (X); Y ).
We note in passing that swapping the roles of q1 and q2 and taking contrapositives, we an equivalent
formulation to Definition 16.7 is that
Iℓ1 (q1 (X); Y ) < Iℓ1 (q2 (X); Y ) if and only if Iℓ2 (q1 (X); Y ) < Iℓ2 (q2 (X); Y ).
It is almost immediate that if Hℓ1 and Hℓ2 are generalized entropies associated with losses ℓ1
and ℓ2 equal to multiplicative constants, then the losses ℓ1 and ℓ2 are universally equivalent. The
converse turns out to be true as well for multiclass classification problems, where ℓ : Rk × [k] → R.
Theorem 16.5.1. Let the multiclass losses ℓ1 and ℓ2 be bounded below and H1 and H2 be the
associated generalized entropies. Then ℓ1 and ℓ2 are universally equivalent if and only if there exist
a > 0, b ∈ Rk , and c ∈ R such that for all distributions on Y ∈ [k],
H1 (Y ) = aH2 (Y ) + b⊤ p + c, (16.5.1)
where p = [P (Y = y)]ky=1 is the p.m.f. of Y .
One direction of the theorem, as we mention above, is easy: if the entropy equivalence (16.5.1)
holds, then ℓ1 and ℓ2 are universally equivalent. Indeed, we see that
I1 (q1 (X); Y ) ≤ I1 (q2 (X); Y ) if and only if H1 (Y | q1 (X)) ≥ H1 (Y | q2 (X)),
and letting p = [P (Y = y)]ky=1 be the marginal probabilities of the label Y , the latter occurs if and
only if
aH2 (Y | q1 (X)) + b⊤ [P (Y = y)]ky=1 + c ≥ aH2 (Y | q2 (X)) + b⊤ [P (Y = y)]ky=1 + c.
The sufficiency of condition (16.5.1) is thus immediate; for its necessity, see Section 16.5.1.
Example 16.5.2 (Universal equivalence for 0-1 losses): Consider classification with the 0-1
loss ℓ0-1 (s, y) = 1 {sy ≤ maxj̸=y sj }. Then Hℓ0-1 (Y ) = 1 − maxy P (Y = y). Several convex
surrogats are both surrogate risk consistent and universally equivalent to the zero-one loss.
Recalling Example 16.4.7,
s(1) + s(2) − 1 s(1) + · · · + s(k) − 1

φ(s, y) := 1 − sy + max s(1) − 1, ,...,
2 k
is hinge (s, y) =
Pconsistent and equivalent to ℓ0-1 , with Hφ (Y ) = Hℓ0-1 (Y ). The hinge-type loss φ
j̸=y [1 + sj − sy ]+ also satisfies Hφhinge (Y ) = k · Hℓ0-1 (Y ). 3

We complete this section with a corollary of Theorems 16.5.1 and 16.4.9, which shows that the
entropy-based construction of surrogate losses is enough to achieve both surrogate risk consistency
and universal equivalence of losses.
Corollary 16.5.3. Let ℓ be a structured prediction loss of the form (16.4.1) with associated gen-
eralized entropy Hℓ (Y ) = inf s E[ℓ(pred(s), Y )], and let Assumption A.16.1 hold. Then the sur-
rogate (16.4.4), φ(s, y) := −s⊤ τ (y) + Ω∗ (s), is both universally equivalent to and surrogate risk
consistent for ℓ.
In brief, given a (discrete) loss, we can construct its entropy, develop a convex surrogate whose
associated entropy is identical to the original function, and guarantee consistency of the convex
surrogate.

482
Lexture Notes on Statistics and Information Theory John Duchi

16.5.1 Proof of Theorem 16.5.1

We give the proof of the “hard” direction of Theorem 16.5.1: that is, that if ℓ1 and ℓ2 are universally
equivalent, then their associated entropies are necessary linearly related (16.5.1). We prove the
result in the slightly simpler case of binary classification, as the more general result introduces few
new ideas except that it requires more technical care.
We thus work with margin-based losses ϕi : R → R+ , i = 1, 2, where without any loss of
generality we assume inf s ℓi (s) = 0 (as we may subtract a constant), and we have generalized
(binary) entropies
hi (p) := inf {pϕi (−s) + (1 − p)ϕ(s)} .
s

By inspection, each hi is a closed concave function (as it is the infimum of linear functions of p),
and by symmetry they satisfy hi (0) = hi (1) = 0 and hi ( 12 ) = supp∈[0,1] hi (p). We show that these
entropies satisfy a particular order equivalence property on [0, 1], which will turn out to be sufficient
to prove their equality.
To motivate what follows, recall that universal equivalence (Def. 16.7) must hold for all dis-
tributions on (X, Y ), and hence all (measurable) spaces X and joint distributions on X × {±1}.
Thus, consider a space X that we can partition into sets {A, Ac } or {B, B c }, where we take the
conditional distributions
( (
1 w.p. pa 1 w.p. qa
Y |X∈A= Y | X ∈ Ac =
−1 w.p. 1 − pa , −1 w.p. 1 − qa ,

and ( (
1 w.p. pb 1 w.p. qb
Y |X∈B= Y | X ∈ Bc =
−1 w.p. 1 − pb , −1 w.p. 1 − qb ,
where we require the consistency conditions that the marginals over Y remain constant, that is, if
P (A) = P (X ∈ A), we have

P (A)pa + P (Ac )qa = P (Y = 1) = P (B)pb + P (B c )qb .

Then evidently by defining quantizers q1 and q2 such that q1 (x) = 1 {x ∈ A} and q2 (x) =
1 {x ∈ B}, we have

Iϕ1 (q1 (X); Y ) = h1 (P (Y = 1)) − P (A)h1 (pa ) − (1 − P (A))h1 (qa ),

Iϕ1 (q2 (X); Y ) = h1 (P (Y = 1)) − P (B)h1 (pb ) − (1 − P (B))h1 (qb ),

and similarly for Iϕ2 . Then universal equivalence implies that

P (A)h1 (pa ) + (1 − P (A))h1 (qa ) ≤ P (B)h1 (pb ) + (1 − P (B))h1 (qb ) if and only if
P (A)h2 (pa ) + (1 − P (A))h2 (qa ) ≤ P (B)h2 (pb ) + (1 − P (B))h2 (qb )

whenever the consistency condition P (A)pa + (1 − P (A))qa = P (B)pb + (1 − P (B))qb holds. As

we may choose X and the probabilities, we can take P (A) = P (B) = 12 (so that their are two
equiprobable partitions), and the preceding conditions become

h1 (pa ) + h1 (qa ) ≤ h1 (pb ) + h1 (qb ) if and only if h2 (pa ) + h2 (qa ) ≤ h2 (pb ) + h2 (qb )

whenever pa + qa = pb + qb .

483
Lexture Notes on Statistics and Information Theory John Duchi

Generalizing this construction by taking distributions over X that partition it into k equiprob-
able sets {A1 , . . . , Ak } or {B1 , . . . , Bk }, each with P (Ai ) = P (Bi ) = 1/k, we see that universal
equivalence implies that for any vectors p ∈ [0, 1]k and q ∈ [0, 1]k satisfies 1⊤ p = 1⊤ q,
k
X k
X k
X k
X
h1 (pi ) ≤ h1 (qi ) if and only if h2 (pi ) ≤ h2 (qi ). (16.5.2)
i=1 i=1 i=1 i=1

We shall give condition (16.5.2) a name, as it implies certain equivalence properties for convex
functions (we can replace hi with −hi and obtain convex functions).

Definition 16.8. Let Ω ⊂ R be a closed interval and let f1 , f2 : Ω → R be closed convex functions.
Then f1 and f2 are order equivalent if for any k ∈ N and vectors s ∈ Ωk and t ∈ Ωk satisfying
1⊤ s = 1⊤ t, we have
k
X k
X k
X k
X
f1 (si ) ≤ f1 (ti ) if and only if f2 (si ) ≤ f2 (ti )
i=1 i=1 i=1 i=1

As in the brief remark following Definition 16.7, by taking complements we have as well that
k
X k
X k
X k
X
f1 (si ) < f1 (ti ) if and only if f2 (si ) < f2 (ti )
i=1 i=1 i=1 i=1

The theorem will then be proved if we can show the following lemma.

Lemma 16.5.4. Let f1 and f2 be order equivalent on Ω. Then there exist a > 0, and b, c ∈ R such
that f1 (t) = af2 (t) + bt + c for all t ∈ Ω.

The proof of Lemma 16.5.4 is somewhat involved, and we proceed in three parts. The key is
that order equivalence actually implies a strong relationship between affine combinations of points
in the domain of the functions fi , not just convex combinations of points, which guarantees that
we can predict values of f2 (v) for any v ∈ Ω by just three values of fi evaluate in Ω. We state this
as a lemma, whose proof we defer temporarily to Sec. 16.5.2

Lemma 16.5.5. If f1 and f2 are order equivalent on Ω, then for any λ ∈ Rk satisfying λ⊤ 1 = 1
and any u ∈ Ωk , if λ⊤ u = v ∈ Ω then
k
X k
X
λi f1 (ui ) ≤ f1 (v) if and only if λi f2 (ui ) ≤ f2 (v),
i=1 i=1

and the statement still holds with both inequalities replaced with strict inequalities.

In particular, if
k
X k
X
λi f1 (ui ) = f1 (v) then necessarily λi f2 (ui ) = f2 (v) (16.5.3)
i=1 i=1

whenever λ ∈ Rk satisfies λ⊤ 1 = 1 and u⊤ λ = ki=1 λi ui = v.

P
Second, we recognize that we may assume both f1 and f2 are nonlinear in the lemma; otherwise,
it is immediate. Nonlinearity guarantees that

484
Lexture Notes on Statistics and Information Theory John Duchi

Lemma 16.5.6. Let f be convex on R. Let u0 < u1 and for λ ∈ [0, 1], define uλ = (1 − λ)u0 + λu1 .
If there exists any λ ∈ (0, 1) such that f (uλ ) = λf (u0 ) + (1 − λ)f (u1 ), then f is linear on [u0 , u1 ].

We leave the proof (an algebraic manipulation using the definitions of convexity) as Question 16.11.
The last intermediate step we require in the proof of Lemma 16.5.4 is that at three particular
points in the domain Ω, we can satisfy Lemma 16.5.4.
1
Lemma 16.5.7. Let f1 , f2 be order equivalent on Ω = [u0 , u1 ] and uc = 2 (u0 + u1 ). There are
a > 0 and b, c ∈ R such that f1 (t) = af2 (t) + bt + c for t ∈ {u0 , uc , u1 }.

We can now finalize the proof of Lemma 16.5.4:

Proof Without loss of generality by an affine rescaling, we can assume that f1 (t) = f2 (t) for
t ∈ {u0 , uc , u1 }, and our goal will be to show that f1 (t) = f2 (t) for all t ∈ Ω.
Let v ∈ Ω with v ̸∈ {u0 , uc , u1 }, and u = [u0 uc u1 ]⊤ for shorthand. We seek λ = (λ0 , λc , λ1 ) ∈
R , where λ⊤ 1 = 1, such that both λ⊤ u = v and λ0 f1 (u0 ) + λc f1 (uc ) + λ1 f1 (u1 ) = f1 (v). If we can
3

find such a λ, then equality (16.5.3) guarantees that f1 (v) = f2 (v), and we are done. As the points
(ui , f1 (ui ))3i=1 are not collinear (recall Lemma 16.5.6), the matrix
 
1 1 1
A :=  u0 uc u1 
f1 (u0 ) f1 (uc ) f1 (u1 )

is full rank. In particular, there is a vector λ solving

Aλ = [1 v f1 (v)]⊤ , i.e. λ = A−1 [1 v f1 (v)]⊤ ,

which evidently satisfies our desiderata. Thus f1 (v) = f2 (v), and as v was arbitrary, the proof is
complete.

16.5.2 Proof of Lemma 16.5.5

We prove the result first for λ with rational entries, as a continuity argument will give the rest. For
each i, let αi = [λi ]+ and βi = [−λi ]+ be the positive and negative parts of λ, so that λ = α − β.
Let k ∈ N be such that we can write αi = aki and βi = bki , where ai , bi ∈ N. Then we have

α⊤ u = v + β ⊤ u or a⊤ u = kv + b⊤ u,

where 1⊤ a = k + 1⊤ b, as 1⊤ λ = k1 1⊤ (a − b) = 1. Then we may define the two vectors

s = [u1 · · · u1 · · · um · · · um ]⊤ and t = [v| ·{z

· · v} u1 · · · u1 · · · um · · · um ]⊤ .
| {z } | {z } | {z } | {z }
a1 times am times k times b1 times bm times

Then each has entries in Ω, and we have 1⊤ t = 1⊤ s. Then order equivalence (Def. 16.8) guarantees
that
m
X m
X m
X m
X
ai f1 (ui ) ≤ kf1 (v) + bi f1 (ui ) if and only if ai f2 (ui ) ≤ kf2 (v) + bi f2 (ui )
i=1 i=1 i=1 i=1

485
Lexture Notes on Statistics and Information Theory John Duchi

and (as per the remark following the definition)

m
X m
X m
X m
X
ai f1 (ui ) = kf1 (v) + bi f1 (ui ) if and only if ai f2 (ui ) = kf2 (v) + bi f2 (ui ).
i=1 i=1 i=1 i=1

These two displays are equivalent to m

P Pm
i=1 λi fj (ui ) ≤ fj (v) and i=1 λi fj (ui ) = fj (v), respectively,
for j = 1, 2.
We have therefore proved Lemma 16.5.5 for λ taking rational values. Because closed convex
functions on R are continuous on their domains, the result extends to real-valued λ.

16.5.3 Proof of Lemma 16.5.7

If either of f1 or f2 is linear, the other is as well, and the proof becomes trivial, so we assume
w.l.o.g. they are both nonlinear.
Without loss of generality, we take u0 = 0, u1 = 1, and uc = 21 by scaling. Then we must solve
the three equations

1 1 b
f1 (0) = af2 (0) + c, f1 = af2 + + c, f1 (1) = af2 (1) + b + c.
2 2 2

From the first we obtain c = f1 (0) − af2 (0), and substituting this into the third yields b = f1 (1) −
f1 (0) − a(f2 (1) − f2 (0)). Finally, substituting both equalities into the equality with f1 ( 21 ) yields
that

1 1 f2 (1) − f2 (0) f1 (1) − f1 (0)
f1 = a f2 − + + f1 (0) − af2 (0)
2 2 2 2

1 f2 (1) + f2 (0) f1 (1) + f1 (0)
= a f2 ( ) − + .
2 2 2

As we know that f1 , f2 are nonlinear, Lemma 16.5.6 applies, so that the convexity gaps f1 ( 12 ) −
f1 (1)+f1 (0)
2 and f2 ( 12 ) − f2 (1)+f
2
2 (0)
are both positive, and thus we take
f1 (0)+f1 (1)
f1 ( 12 ) − 2
a= f2 (0)+f2 (1)
> 0.
f2 ( 21 ) − 2

16.6 Bibliography
Section 16.1 is based on Bartlett, Jordan, and McAuliffe [19]. See also Zhang [196], Lugosi and
Vayatis [142], Zhang and Yu [197]. The general treatment follows [173]. The particular multiclass
results we present are due to Zhang [195].
Point to full proof of Theorem 16.5.1.
Cite Nowak-Vila et al. [151] and for inconsistency (Prop. 16.4.6) cite Liu [140].

16.7 Exercises
Exercise 16.1: Find the suboptimality function ∆ϕ and ψ-transform (16.1.2) for the binary
classification problem (Y ∈ {−1, 1}) with the following losses.

486
Lexture Notes on Statistics and Information Theory John Duchi

(a) Logistic loss ϕ(s) = log(1 + e−s ).

(b) The squared hinge loss ϕ(s) = [1 − s]2+ .

(c) Squared error (ordinary regression). The surrogate loss in this case for the pair (x, y) is 12 (f (x)−
y)2 . Show that for y ∈ {−1, 1}, this can be written as a margin-based loss, and compute the
associated suboptimality function ∆ϕ and ψ-transform. Is the squared error classification
calibrated?

Exercise 16.2: Suppose we have a regression problem with data (independent variables) x ∈ X
and y ∈ R. We wish to find a predictor f : X → R minimizing the probability of being far away
from the true y, that is, for some c > 0, our loss is of the form

ℓ(f (x), y) = 1 {|y − f (x)| ≥ c} .

Show that no loss of the form φ(s, y) = |s − y|p , where p ≥ 1, is Fisher consistent for the loss ℓ,
even if the distribution of Y conditioned on X = x is symmetric about its mean E[Y | X]. That is,
show there exists a distribution on pairs X, Y such that the set of minimizers of the surrogate

Rφ (f ) := E[φ(f (X), Y )]

is not included in the set of minimizers of the true risk, R(f ) = P(|Y − f (X)| ≥ c), even if the
distribution of Y (conditional on X) is symmetric.
Exercise 16.3 (Hinge-type losses and entropies [76]):

(a) Prove the equality (16.4.5) in Example 16.4.7.

(b) Show that inf s Ep [φhinge (s, Y )] = k(1 − maxj pj ).

Exercise 16.4: PLet ℓ be the Hamming loss on bipartite matchings, so that for y, y ′ ∈ Sk , we
define ℓ(y, y ′ ) = ki=1 1 {y(i) ̸= y ′ (i)} as in Example 16.4.4. Let Mπ be the permutation matrix
associated with a permutation π.

(a) Show that there is a collection of k (non-identity) permutations π1 , . . . , πk such that

k
1X 1
Mπi = 11⊤ .
k k
i=1

(b) Show that Assumption A.16.1 must fail for bipartite matchings.

show a stronger failure of Assumption A.16.1: there exist distributions on permutations for
which there exists a unique minimizer of y ⋆ = argminy E[ℓ(y, Y )] but P (Y = y ⋆ ) = 0.

487
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 16.5 (Empirics of classification calibration): In this problem you will compare the
performance of hinge loss minimization and an ordinary linear regression in terms of classifica-
tion performance. Specifically, we compare the performance of the hinge surrogate loss with the
regression surrogate when the data is generated according to the model

y = sign(⟨θ∗ , x⟩ + σZ), Z ∼ N(0, 1) (16.7.1)

where θ∗ ∈ Rd is a fixed vector, σ ≥ 0 is an error magnitude, and Z is a standard normal random

variable. We investigate the model (16.7.1) with a simulation study.
Specifically, we consider the following set of steps:

(i) Generate two collections of n datapoints in d dimensions according to the model (16.7.1),
where θ ∈ Rd is chosen (ahead of time) uniformly at random from the sphere {θ ∈ Rd : ∥θ∥2 =
R}, and where each xi ∈ Rd is chosen as N(0, Id×d ). Let (xi , yi ) denote pairs from the first
collection and (xtest test
i , yi ) pairs from the second.

(ii) Set
n
1X
θbhinge = argmin [1 − yi ⟨xi , θ⟩]+
θ:∥θ∥2 ≤R ni=1

and
n
1 X
θbreg = argmin (yi − ⟨xi , θ⟩)2 = argmin ∥Xθ − y∥22 .
θ 2n θ
i=1

(iii) Evaluate the 0-1 error rate of the vectors θbhinge and θbreg on the held-out data points {(xtest test n
i , yi )}i=1 .

Perform the preceding steps (i)–(iii), using any n ≥ 100 and d ≥ 10 and a radius R = 5, for
different standard deviations σ = {0, 1, . . . , 10}; perform the experiment a number of times. Give
a plot or table exhibiting the performance of the classifiers learned on the held-out data. How do
the two compare? Given that for the hinge loss we know ∆ϕ (δ) = δ (as presented in class), what
would you expect based on the answer to Question 16.1?
I have implemented (in the julia language; see http://julialang.org/) methods for solving
the hinge loss minimization problem with stochastic gradient descent so that you do not need to.
The file is available at this link. The code should (hopefully) be interpretable enough that if julia
is not your language of choice, you can re-implement the method in an alternative language.
Exercise 16.6: JCD Comment: Check the constants and things here
Show that in the case of binary classification with the zero-one loss, so that φ(s, y) = ϕ(sy), the
uniform gap (16.2.2) and binary gap (16.1.1) are equal, that is, ∆φ (ϵ) = ∆ϕ (ϵ). In particular,
uniform calibration (Definition 16.3) and classification calibration (Definition 16.2) are equivalent.
Exercise 16.7: We generalize Theorem 16.2.4 to the case that the loss ℓ is not uniformly bounded.
Assume there exists an upper bound function B : X → R+ such that E[B(X)] < ∞ and r(s, Px ) ≤
r⋆ (Px ) + B(x) for all x ∈ X and s ∈ Rk . Show that if φ is calibrated for the loss ℓ (Definition 16.4)
if and only if it is surrogate risk consistent (Definition 16.1).
Exercise 16.8: Let ϕ(t) = [1 − t]+ . Show by example that the surrogate (16.2.3) with φ(s, y) =
Pk
j=1 ϕ(sy − sj ) may be inconsistent.

Exercise 16.9: Prove Proposition 16.2.7.

488
Lexture Notes on Statistics and Information Theory John Duchi

JCD Comment: Outline this a bit more.

Exercise 16.10 (Bayes risk gaps): Consider a general binary classification problem with (X, Y ) ∈
X × {−1, 1}. Let ϕ(t) = log(1 + e−t ), so that we use the logistic loss. Show that the surrogate risk
gap
Hϕ (Y ) − Hϕ (Y | X) = I(X; Y ),
where I is the mutual information.
Exercise 16.11: Prove Lemma 16.5.6. Hint: without loss of generality, you may take u0 = 0,
u1 = 1. Then for any u ∈ [0, 1], write λ as a convex combination of either {0, u} or {u, 1} and use
the definition of convexity.
JCD Comment: Add some exercises around f -divergences and risk gaps, which give
similar flavor. Nothing too hard. Add commentary after Theorem that these exist.

489
Part IV

Online game playing and compression

490
Chapter 17

Stochastic and online convex

optimization

In Chapter 5, we discussed generalization guarantees, which followed from the various uniform con-
centration inequalities we developed, and applications in estimation, with finite-sample guarantees
for recovering parameters θ in well-behaved models and smooth loss functions. In many cases,
however, we care only about minimizing some expected loss—the particular parameters may be
unimportant—as when we perform large-scale machine learning or prediction tasks. In other cases,
we have online problems, where we do not even make particular statistical assumptions, but assume
simply that data arives in a stream, and we wish to make predictions on this stream that are at
least as good as some reference static predictor. Finally, the smoothness assumptions on which our
convergence results of Chapter 5.3 repose—which are, in a sense, necessary—lie in opposition to
some of our surrogate consistency results in Chapter 16, which suggest several optimality properties
of hinge-type and other non-smooth losses, which are beyond the purview of our earlier convergence
guarantees. This chapter addresses these problems.
In online optimization, we consider the following two player sequential game: we have a space
Θ in which we—the learner or first player—can play points θ1 , θ2 , . . ., while nature plays a sequence
of loss functions ℓt : Θ → R. The goal is to guarantee that the regret
n
X
Regn (θ⋆ ) := ℓt (θt ) − ℓt (θ⋆ )

(17.0.1)
t=1

grows at most sub-linearly with n for any fixed reference θ⋆ ∈ Θ. We will typically consider scenarios
in which the sequence of losses ℓt are convex or can be appropriately modeled by convex losses,
and Θ is a convex subset of Rd . This defines the online convex optimization problem. The related
problem of stochastic convex optimization follows when nature cannot be so capricious: instead of
losses ℓt chosen (essentially) arbitrarily and adversarially, we have the risk minimization problems
of Chapter 5. Thus, for a distribution P on Z ∈ Z, we wish to minimize

L(θ) := EP [ℓ(θ, Z)] (17.0.2)

iid
given a sample Z1n ∼ P , where for each z ∈ Z, the loss function ℓ(θ, z) is convex in θ.
The problem (17.0.2) is a main motivator of the online problem (17.0.1), as we shall see, as
any algorithm that provides a strong regret guarantee for (17.0.1) implies a convergence rate for
iid
optimizing (17.0.2). Indeed, let Zt ∼ P , and in the formulation (17.0.1), set ℓt (θ) = ℓ(θ, Zt ). Then

491
Lexture Notes on Statistics and Information Theory John Duchi

1 Pn
by convexity, the average vector θn := n t=1 θt satisfies
n n Xn
1X (⋆) 1 X 1
L(θn ) ≤ L(θt ) = E[ℓt (θt )] = E ℓt (θt ) , (17.0.3)
n n n
t=1 t=1 t=1

where the equality (⋆) follows because θt is a function only of Z1t−1 , and so conditionally independent
of Zt . Inequality (17.0.3) is our first example of an online-to-batch conversion: any convergence
guarantee for the average regret in the online setting (17.0.1) implies a matching guarantee for
stochastic optimization.
In the remainder of the chapter, we develop some of the modern perspective on algorithms for
solving online and stochastic convex optimization problems, providing their guarantees. It will turn
out that Bregman divergences, which we have developed in the context of prediction problems,
provide one of our central tools. We will also develop a collection of lower bounds showing the
(minimax) optimality of these methods.

17.1 Preliminaries on convex optimization

While convexity has been a touchstone throughout this text, because our focus in this chapter will
be on procedures for solving convex problems and to make things self-contained, it will behoove
us to review some of the relevant definitions briefly. (We refer as always to Appendix B for proofs
associated to convex sets and functions.) Recall that Θ is convex if for all λ ∈ [0, 1] and θ, θ′ ∈ Θ,
we have
λθ + (1 − λ)θ′ ∈ Θ.
A function f is convex if
f (λθ + (1 − λ)θ′ ) ≤ λf (θ) + (1 − λ)f (θ′ )
for all λ ∈ [0, 1] and θ, θ′ , where f (θ) = +∞ for θ ̸∈ dom f . The subgradient set, or subdifferential,
of a convex function f at the point θ is
∂f (θ) := {g ∈ Rd | f (θ′ ) ≥ f (θ) + ⟨g, θ′ − θ⟩ for all θ′ },
and any vector g ∈ ∂f (θ) is a subgradient. For convex f , the subdifferential ∂f (θ) is non-empty for
any θ ∈ int dom f . (See Theorem B.3.3 in Appendix B.3, which shows that ∂f (θ) is non-empty on
the relative interior of dom f .) If f is differentiable at θ, then ∂f (θ) = {∇f (θ)} is a singleton, and
we will abuse notation to simply write ∂f (θ) = ∇f (θ).
We now give several examples of convex functions, losses, and corresponding subgradients.
The first two highlight binary classification problems, where we wish to predict associated labels
y ∈ {−1, 1} from a data point x ∈ Rd . We focus on the online setting (17.0.1).
Example 17.1.1 (Support vector machines): We receive data in pairs (xt , yt ) ∈ Rd ×{−1, 1},
and the loss function
ℓt (θ) = [1 − yt ⟨θ, xt ⟩]+ = max{1 − yt ⟨θ, xt ⟩, 0},
which is convex because it is the maximum of two linear functions. The subgradient set is

−yt xt
 if yt ⟨θ, xt ⟩ < 1
∂ℓt (θ) = −λ · yt xt for λ ∈ [0, 1] if yt ⟨θ, xt ⟩ = 1

0 otherwise


492
Lexture Notes on Statistics and Information Theory John Duchi

as s 7→ [s]+ is differentiable except at s = 0. 3

Example 17.1.2 (Logistic regression): As in the support vector machine, we receive data in
pairs (xt , yt ) ∈ Rd × {−1, 1}, and the loss function is

ℓt (θ) = log(1 + exp(−yt ⟨xt , θ⟩)),

with singleton subdifferential ∂ℓt (θ) = − 1+eyt1⟨xt ,θ⟩ yt xt , because the loss is C ∞ . 3

Example 17.1.3 (Expert prediction and zero-one error): By randomization, it is possible to

cast certain non-convex optimization problems as convex. Consider a problem in which there
are d experts, each of which makes a prediction xt,j (for j = 1, . . . , d) at time t, represented
by the vector xt ∈ Rd , of a label yt ∈ {−1, 1}. Each also suffers the (non-convex) zero-one
loss ℓ0-1 (xt,j , yt ) = 1 {xt,j yt ≤ 0}. Assign a weight wj ≥ 0 to each expert xt,j , where the
weights satisfy ⟨w, 1⟩ = 1. Then if we choose a prediction yb = sign(xt,j ) with probability j,
the probability of a mistake is
d
X
y ̸= yt ) =
ℓt (w) = P(b wj ℓ0-1 (xt,j , yt ) = ⟨gt , w⟩,
j=1

where we have defined the vector gt = [ℓ0-1 (xt,j , yt )]dj=1 ∈ {0, 1}d . Notably, the expected zero-
one loss is convex (even linear), so that its online minimization falls into the online convex
programming framework. 3

We will see that, in spite of their frequenty simplicity, online convex programming approaches
have applications beyond the initial regret formulation (17.0.1).

17.2 Online convex optimization methods

The basic approach to online (and stochastic) optimization is a relatively simple one: at the tth
iteration of the process, we

i. construct a simple model (or approximation) to the instantaneous objective ℓt

ii. minimize this model, regularizing so that the updated point does not move too far (or too
aggressively follow spurious information)

Let us make this more precise before giving the actual algorithms and derived procedures.

Definition 17.1. A model of the loss ℓ at a point θ0 is a function ℓ̂(·; θ0 ) satisfying

i. Model convexity. The function θ 7→ ℓ̂(θ; θ0 ) is convex and subdifferentiable

ii. Lower bound. The model satisfies ℓ̂(θ; θ0 ) ≤ ℓ(θ), with equality at θ = θ0 .

The most common model is the first-order model, which takes g ∈ ∂ℓ(θ0 ) and sets

ℓ̂(θ; θ0 ) := ℓ(θ0 ) + ⟨g, θ − θ0 ⟩,

493
Lexture Notes on Statistics and Information Theory John Duchi

for which it is immediate that both convexity (because ℓ̂ is affine) and the lower bound condition
hold. A somewhat less trivial model holds when we know that the losses are nonnegative; in this
case, with g ∈ ∂ℓ(θ0 ) as before, the truncated model

ℓ̂(θ; θ0 ) := [ℓ(θ0 ) + ⟨g, θ − θ0 ⟩]+

satisfies all the desired conditions as well.

As a brief remark, we note that the conditions in Definition 17.1 guarantee that ℓ̂ looks locally
like ℓ; in particular, we always have

∂ ℓ̂(θ0 ; θ0 ) ⊂ ∂ℓ(θ0 ), (17.2.1)

and in particular, if ℓ is differentiable at θ0 then ℓ̂ is as well, with identical derivative. To see the
inclusion (17.2.1), note that if g0 ∈ ∂ ℓ̂(θ0 ; θ0 ), then for any θ, we have

ℓ(θ) ≥ ℓ̂(θ; θ0 ) ≥ ℓ̂(θ0 ; θ0 ) + ⟨g0 , θ − θ0 ⟩ = ℓ(θ0 ) + ⟨g0 , θ − θ0 ⟩

by repeated application of Definition 17.1, so that g0 ∈ ∂ℓ(θ0 ).

JCD Comment: Put a figure here about the modeling approach

17.2.1 Projected subgradient methods

Let us give a special case of the general mirror-descent-type algorithms we consider. This first is a
variant of (projected) online gradient descent, and requires only that we specify a sequence ηt > 0
of non-increasing stepsizes. Here, we use the simple linear model

ℓ̂t (θ; θt ) = ℓt (θt ) + ⟨gt , θ − θt ⟩,

and we will present a preliminary analysis because the simplicity of the update lends itself to an
elementary proof of convergence.

Input: Parameter space Θ, stepsize sequence ηt .

Repeat: for each iteration t,

i. predict θt ∈ Θ, receive function ℓt and suffer loss ℓt (θt ).

ii. let gt ∈ ∂ℓt (θt ) and update

1 2
θt+1 = argmin ⟨gt , θ⟩ + ∥θ − θt ∥2 . (17.2.2)
θ∈Θ 2ηt
Figure 17.1: Projected gradient descent.

The update (17.2.2) makes clear that we trade between improving performance on ℓt via the
linear approximation ℓt (θ) ≈ ℓt (θt )+⟨gt , θ −θt ⟩ and remaining close to θt according to the Euclidean
distance ∥·∥2 . When we use the standard linear approximation as the model ℓ̂, that is, for some
gt ∈ ∂ℓt (θ) use ℓ̂t (θ) = ℓt (θt ) + ⟨gt , θ − θt ⟩, we may rewrite the update (17.2.2) as the two steps
n o
2
θt+ 1 = θt − ηt gt , θt+1 = ProjΘ (θt+ 1 ) = argmin θ − θt+ 1 2 ,
2 2 2
θ∈Θ

494
Lexture Notes on Statistics and Information Theory John Duchi

where ProjΘ denotes Euclidean projection onto Θ. The update (17.2.2) admits an elegant analysis,
which we include even though it is a special case of the more general Theorem 17.2.9 to come,
because it highlights the main ideas. We use a fixed stepsize for simplicity.

Proposition 17.2.1 (Convergence of projected gradient descent). Let ηt = η > 0 for all t. Then
for any θ ∈ Θ,
n n
X 1 ηX
[ℓt (θt ) − ℓt (θ)] ≤ ∥θ1 − θ∥22 + ∥gt ∥22 .
2η 2
t=1 t=1

Proof The only important step is to write an appropriate measure of convergence: the error
∥θt+1 − θ∥22 , from which we can derive a “one-step” progress guarantee, which we then recurse
more or less via a bit of algebra. To that end, note that for any θ ∈ Θ and any vector v,

∥ProjΘ (v) − θ∥22 ≤ ∥v − θ∥22 ,

because projections decrease (Euclidean) distance. (See Corollary B.1.13 in Appendix B.1.2.) So
we have the one-step progress guarantee

∥θt+1 − θ∥22 ≤ ∥θt − ηgt − θ∥22 = ∥θt − θ∥22 − 2η⟨gt , θt − θ⟩ + η 2 ∥gt ∥22 .

By the first-order conditions for convexity, we have ℓt (θt ) + ⟨gt , θ − θt ⟩ ≤ ℓt (θ), or −⟨gt , θt − θ⟩ ≤
ℓt (θ) − ℓt (θt ). Substituting gives

∥θt+1 − θ∥22 ≤ ∥θt − θ∥22 − 2η [ℓt (θt ) − ℓt (θ)] + η 2 ∥gt ∥22 .

Rearranging and dividing by 2η > 0 we obtain

1 h i η
ℓt (θt ) − ℓt (θ) ≤ ∥θt − θ∥22 − ∥θt+1 − θ∥22 + ∥gt ∥22 . (17.2.3)
2η 2

Sum inequality (17.2.3) and note that the sum telescopes to obtain the proposition.

Typically, one imposes boundedness conditions on θ and ∥gt ∥2 , which then allows a more evoca-
tive guarantee. As we have focused on Euclidean-type updates, we present one such prototypical
result, with more discussion in Section 17.2.4.

Corollary 17.2.2. Assume that Θ ⊂ {θ ∈ Rd : ∥θ∥2 ≤ R2 } and that ∥g∥2 ≤ G2 for any g ∈ ∂ℓt (θ)
R2 √1
for all t. Take η = G2 n
and let θ1 = 0. Then for all θ⋆ ∈ Θ,

n
X √
[ℓt (θt ) − ℓt (θ⋆ )] ≤ R2 G2 n.
t=1
√
Such guarantees—order n regret, with a multiplicative constant involving the size of the space Θ
and the magnitude of the (sub)gradients of the losses ℓt —are typical and, as we shall see, minimax
optimal.

495
Lexture Notes on Statistics and Information Theory John Duchi

(w, ψ(w))

Dψ (w, v)

(v, ψ(v))

Figure 17.2: Illustration of Bregman divergence.

17.2.2 Mirror descent-type methods

In a variety of scenarios, it is advantageous to measure distances in a way more amenable to the
problem structure, for example, if Θ is a probability simplex or we have prior information about
the loss functions ℓt that nature may choose. With this in mind, we present a slightly more general
algorithm than the projected gradient method 17.1, which requires us to give a few more definitions.
Given a convex differentiable function ψ : Rd → R, we will replace the quadratic in the up-
date (17.2.2) with a Bregman divergence associated with ψ, where we recall
Dψ (w, v) = ψ(w) − ψ(v) − ⟨∇ψ(v), w − v⟩. (17.2.4)
The Bregman divergence is always non-negative, as Dψ (w, v) is the gap between the true function
value ψ(w) and its linear approximation at the point v (see Figure 17.2).
Example 17.2.3 (Euclidean distance as Bregman divergence): Take ψ(w) = 21 ∥w∥22 to obtain
D(w, v) = 12 ∥w − v∥22 . More generally, if for a matrix A we define ∥w∥2A = w⊤ Aw, then taking
ψ(w) = 12 w⊤ Aw, we have
1 1
Dψ (w, v) = (w − v)⊤ A(w − v) = ∥w − v∥2A .
2 2
So Bregman divergences generalize (squared) Euclidean distance. 3

Example 17.2.4 (KL divergence as a Bregman divergence): Take ψ(w) = dj=1 wj log wj .
P

Then ψ is convex over the positive orthant Rd+ (the second derivative of w log w is 1/w), and
for w, v ∈ ∆d = {u ∈ Rd+ : ⟨1, u⟩ = 1}, we have
X X X X wj
Dψ (w, v) = wj log wj − vj log vj − (1 + log vj )(wj − vj ) = wj log = Dkl (w||v) ,
vj
j j j j

where in the final equality we treat w and v as probability distributions on {1, . . . , d}. 3

496
Lexture Notes on Statistics and Information Theory John Duchi

With these examples in mind, we now present the mirror descent algorithm, which is the natural
generalization of online gradient descent. We call ψ the distance generating function as it yields
the Bregman divergence in the method, functioning analogously to the squared ℓ2 -distance in the
earlier Algorithm 17.1.

Input: distance-generating function ψ, parameter space Θ, and non-increasing stepsize

sequence η1 , η2 , . . ..
Repeat: for each iteration t,

i. predict θt ∈ Θ, receive function ℓt and suffer loss ℓt (θt ).

ii. construct model ℓ̂t of ℓt at θt , and perform non-Euclidean update

1
θt+1 = argmin ℓ̂t (θ; θt ) + Dψ (θ, θt ) . (17.2.5)
θ∈Θ ηt
Figure 17.3: The online model-based mirror descent algorithm

The mirror descent update (17.2.5) frequently admits easy-to-compute solutions. Assume for
simplicity that we use the linear model ℓ̂t (θ) = ℓt (θt ) + ⟨gt , θ − θt ⟩, where gt ∈ ∂ℓt (θt ). By taking
Θ = Rd and ψ(θ) = 12 ∥θ∥22 , we note that the mirror descent procedure simply corresponds to the
gradient update θt+1 = θt − ηt gt , and it evidently generalizes the update (17.2.2). We can also
recover the exponentiated gradient or entropic mirror descent algorithm.

Example 17.2.5 (Exponentiated gradients): Let Θ = ∆d = {v ∈ Rd+ : ⟨1, v⟩ = 1}, the

probability simplex in Rd . Then a natural choice for ψ isP the negative entropy, ψ(w) =
P wj
w
j j log w j , which (as in Example 17.2.4) gives D ψ (w, v) = j wj log vj .
Consider the update step (17.2.5). Fixing v = θt for notational simplicity, we must solve

1X θj
minimize ⟨g, θ⟩ + θj log subject to θ ∈ ∆d .
θ η vj
j

Writing the Lagrangian for this problem after introducing multipliers τ ∈ R for the contraint
that ⟨1, θ⟩ = 1 and λ ∈ Rd+ for θ ⪰ 0, we have
d
1X θj
L(θ, λ, τ ) = ⟨g, θ⟩ + θj log − ⟨λ, θ⟩ + τ (⟨1, θ⟩ − 1),
η vj
j=1

which we minimize by taking

θj = vj exp(−ηgj + λj η − τ η − 1).

As θj > 0 certainly, the constraint θ ⪰ 0 is inactive and λj = 0. Thus, choosing τ to normalize

the vector θ, we obtain the exponentiated gradient update

θt,i e−ηt gt,i

θt+1,i = P −ηt gt,j for i = 1, . . . , d,
j θt,j e

the explicit form of the update (17.2.5) when using the linear approximation for ℓ̂. 3

497
Lexture Notes on Statistics and Information Theory John Duchi

17.2.3 Convergence analysis of mirror descent

We now turn to an analysis of the mirror descent algorithm. Before presenting the analysis, we
require two more definitions that allow us to relate Bregman divergences to various norms.
Definition 17.2. Let ∥·∥ be a norm. The dual norm ∥·∥∗ associated with ∥·∥ is
∥y∥∗ := sup x⊤ y.
x:∥x∥≤1

For example, a straightforward calculation shows that the dual to the ℓ∞ -norm is the ℓ1 -norm,
and the Euclidean norm ∥·∥2 is self-dual (by the Cauchy-Schwarz inequality). Lastly, we require a
definition of functions of suitable curvature for use in mirror descent methods.
Definition 17.3. A convex function f : Rd → R is strongly convex with respect to the norm ∥·∥
over the set Θ if for all w, v ∈ Θ and g ∈ ∂f (w) we have
1
f (v) ≥ f (w) + ⟨g, v − w⟩ + ∥w − v∥2 .
2
The function f is strongly convex if it grows at least quadratically fast at every point in its domain.
Strongly convex functions play an important role in the stability properties of minimizers and enjoy
a number of equivalent characterizations; see Proposition C.1.5 in Appendix C.1 for more.
The definition (17.2.4) of the divergence makes apparent that ψ is strongly convex if and only
if
1
Dψ (w, v) ≥ ∥w − v∥2 .
2
As three examples, we consider Euclidean distance, entropy, and p-norms for 1 < p ≤ 2.
Example 17.2.6: For the Euclidean distance, which uses ψ(w) = 21 ∥w∥22 , we have ∇ψ(w) =
w, and
1 1 1 1
∥v∥22 = ∥w + v − w∥22 = ∥w∥22 + ⟨w, v − w⟩ + ∥w − v∥22
2 2 2 2
by a calculation, so that ψ is strongly convex with respect to the ℓ2 -norm. 3
P
Example 17.2.7: Let ψ(w) = j wj log wj be the negative entropy. Then ψ is strongly
convex with respect to the ℓ1 -norm, that is,
1
Dψ (w, v) = Dkl (w||v) ≥ ∥w − v∥21 ,
2
which follows immediately from Pinsker’s inequality, Proposition 2.2.8. 3
It can be convenient to have divergences strongly convex with respect to the ℓ1 -norm over all of Rd
rather than just the positive orthant. Squared ℓp -norms provide this guarantee:
Example 17.2.8: Let ψ(w) = 2(p−1) 1
∥w∥2p , where 1 < p ≤ 2. Then ψ is strongly convex with
respect to the ℓp -norm ∥·∥p . (Exercise 17.5 asks you to prove this fact.)
p−1
1
In dimension d ≥ 3, consider the choice p = 1 + log d . Then because ∥w∥1 ≤ d p ∥w∥p for
1−p 1
− 1+log
all p ∈ [1, ∞], then for this choice of p we have ∥w∥p ≥ d ∥w∥1 = d
p ∥w∥1 ≥ e−1 ∥w∥1 ,
d

so
1 1
Dψ (w, v) ≥ ∥w − v∥2p ≥ 2 ∥w − v∥21 .
2 2e
1 2
To within a numerical constant, ψ(w) = 2(p−1) ∥w∥p is strongly convex with respect to the
ℓ1 -norm. 3

498
Lexture Notes on Statistics and Information Theory John Duchi

With these examples in place, we present the main convergence guarantee for mirror descent.
Theorem 17.2.9 (Regret of mirror descent). Let ℓt be an arbitrary sequence of convex functions,
and let θt be generated according to the mirror descent algorithm 17.3. Assume that the proximal
function ψ is strongly convex with respect to the norm ∥·∥, which has dual norm ∥·∥∗ . Then for a
sequence of subgradients gt ∈ ∂ℓt (θt ),
(a) If ηt = η for all t, then for any θ⋆ ∈ Θ,
n n
X 1 ηX
⋆ ⋆
[ℓt (θt ) − ℓt (θ )] ≤ Dψ (θ , θ1 ) + ∥gt ∥2∗ .
η 2
t=1 t=1

(b) If Θ is compact and Dψ (θ⋆ , θ) ≤ R2 for any θ ∈ Θ, then

n n
X 1 2 X ηt
[ℓt (θt ) − ℓt (θ⋆ )] ≤ R + ∥gt ∥2∗ .
ηn 2
t=1 t=1

We defer the proof temporarily to Section 17.2.5.

17.2.4 Instantiations of the regret guarantee

Before proving Theorem 17.2.9, we provide instantiations to exhibit its guarantees. First, we note
that for Lipschitz continuity properties of the losses are equivalent to bounds on the norms of the
subgradients gt : indeed, the following two conditions are equivalent: (a) ℓt is G-Lipschitz with
respect to the norm ∥·∥, meaning that

|ℓt (θ) − ℓt (θ′ )| ≤ G θ − θ′

for all θ, θ′ , and (b)

∥g∥∗ ≤ G for all g ∈ ∂ℓt (θ) and θ.
(See Exercise 17.6.) When designing a particular instantiation of the online mirror descent algo-
rithm 17.3, we therefore carefully consider the strong convexity properties of ψ that we employ, as
these allow control over the gradient magnitudes in Theorem 17.2.9 via the dual norms ∥gt ∥∗ .

Example 17.2.10 (Lipschitz constants and linear prediction): Consider margin-based binary
classification, where for data (xt , yt ) ∈ Rd × {±1} we have loss ℓt (θ) = ϕ(yt ⟨θ, xt ⟩) for some
convex ϕ with ϕ′ (0) < 0 and |ϕ(s)| ≤ 1 for any ϕ′ (s) ∈ ∂ϕ(s), s ∈ R. Particular choices include
i. the logistic loss, with ϕ(t) = log(1 + e−t ), and
ii. the hinge loss, with ϕ(t) = [1 − t]+ .
Then ∂ℓt (θ) = yt xt ∂ϕ(yt ⟨θ, xt ⟩) and the losses ℓt are G-Lipschitz with respect to the norm ∥·∥
if and only if ∥x∥∗ ≤ G for all x ∈ X . 3

In the Euclidean case, when ψ(θ) = 12 ∥θ∥22 , we assume that the loss functions ℓt are all G-
Lipschitz with respect to the ℓ2 -norm. In Example 17.2.10, this corresponds to the data x belonging
to an ℓ2 -ball of radius G. In this case, the two regret bounds above become
n
1 η 1 2 X ηt 2
∥θ⋆ − θ1 ∥22 + nG2 and R + G ,
2η 2 2ηn 2
t=1

499
Lexture Notes on Statistics and Information Theory John Duchi

respectively, where in the second case we assumed that ∥θ⋆ − θt ∥2 ≤ R for all t. In the former case,
we take η = GR √ and recover Corollary 17.2.2; in the latter we take ηt = R
n
√ , which does not
G t
require knowledge of n ahead of time. We obtain the following corollary.

Corollary 17.2.11. Assume that Θ ⊂ {θ ∈ Rd : ∥θ∥2 ≤ R} and that the loss functions ℓt are
G-Lipschitz with respect to the Euclidean norm. Take ηt = GR√t . Then for all θ⋆ ∈ Θ,

n
X √
[ℓt (θt ) − ℓt (θ⋆ )] ≤ 3RG n.
t=1

Proof For any θ, θ⋆ ∈ Θ, we have ∥θ − θ⋆ ∥2 ≤ 2R, so that Dψ (θ⋆ , θ) ≤ 4R2 . Using that
n n √
Z
X 1 1
t− 2 ≤ t− 2 dt = 2 n
t=1 0

gives the result.

Other geometries suggest alternative choices of the distance-generating function ψ. For example,
in high-dimensional settings, when the underlying domain Θ is the probability simplex, we can
achieve bounds that depend only on the ℓ∞ norm of the gradients. (Recall Example 17.1.3 for
motivation.) In this case, we have the following corollary to Theorem 17.2.9.
d
CorollaryP 17.2.12. Assume that Θ = ∆d = {θ ∈ R+ : ⟨1, θ⟩ = 1} and take the proximal function
ψ(θ) = j θj log θj to be the negative entropy in the mirror descent procedure 17.3. Then with the
fixed stepsize η and initial point as the uniform distribution θ1 = 1/d, we have for any sequence of
convex losses ℓt
n n
X log d η X
⋆
[ℓt (θt ) − ℓt (θ )] ≤ + ∥gt ∥2∞ .
η 2
t=1 t=1

Proof Using Pinsker’s inequality in the form of Example 17.2.7, we have that ψ is strongly
convex with respect to ∥·∥1 . Consequently, taking the dual norm to be the ℓ∞ -norm, part (a) of
Theorem 17.2.9 shows that
n d n
X 1X ⋆ wj⋆ ηX
[ℓt (wt ) − ℓt (w⋆ )] ≤ wj log + ∥gt ∥2∞ .
η w1,j 2
t=1 j=1 t=1

Noting that with w1 = 1/d, we have Dψ (w⋆ , w1 ) ≤ log d for any w⋆ ∈ Θ gives the result.

Corollary 17.2.12 yields somewhat sharper results than Corollary 17.2.11, though in the re-
stricted setting that Θ is the probability simplex in Rd . Indeed, let us assume that the subgradients
gt ∈ [−1,√1]d , the hypercube in Rd . In this case, the tightest possible bound on their ℓ2 -norm is
∥gt ∥2 ≤ d, √ while ∥gt ∥∞ ≤ 1 always. Similarly, if Θ = ∆d , then we may only guarantee that
⋆
∥θ − θ1 ∥2 ≤ 2. Thus, Euclidean methods (Corollary 17.2.11) can only guarantee regret

1 η √ 1
∥θ⋆ − θ1 ∥22 + nd ≤ 2nd with the choice η = √ ,
2η 2 2nd

500
Lexture Notes on Statistics and Information Theory John Duchi

P
while the entropic mirror descent procedure (Alg. 17.3 with ψ(θ) = j θj log θj ) guarantees
√
log d η p 2 log d
+ n ≤ 2n log d with the choice η = √ . (17.2.6)
η 2 2 n
The latter guarantee is exponentially better in the dimension.
As a final example, we revisit the p-norm algorithms in high dimensions (at least, when d ≥ 3).

Example 17.2.13 (p-norm algorithms): Letting 1 < p ≤ 2, so that ψ(θ) = 1

2(p−1) ∥θ∥2p is
strongly convex with respect to ∥·∥p , as in Example 17.2.8 over Rd . Exercise 17.8 explores the
computation of the update (17.2.5), so we focus exclusively on the regret guarantee. Assume
w.l.o.g. that 0 ∈ Θ, and let θ1 = 0 (as otherwise, we take ψ(θ) = 2(p−1) 1
∥θ − θ1 ∥2p , and the
p
same results hold). Consider the fixed stepsize result in Theorem 17.2.9. Then for q = p−1 we
obtain
n
X ∥θ⋆ ∥2p ηX
n
ℓt (θt ) − ℓt (θ⋆ ) ≤ + ∥gt ∥2q .
2(p − 1)η 2
t=1 t=1

p
Choose p = 1 + log1 d < 2, so that its conjugate q = p−1 = 1 + log d and satisfies ∥g∥q ≤
1/q
d ∥g∥∞ ≤ e ∥g∥∞ . Then because ∥θ∥p ≤ ∥θ∥1 , we have
n n
X log d ⋆ 2 eη X
[ℓt (θt ) − ℓt (θ⋆ )] ≤ ∥θ ∥1 + ∥gt ∥2∞ .
2η 2
t=1 t=1

Assume each ℓt is G∞ -Lipschitz with respect to the ℓ1 -norm (which is easier to satisfy than
Lipschitzness with respect to any other ℓp -norm, as it is equivalent to boundedness of the
gradients in ℓ∞ ), and
√ that the√domain Θ is contained in an ℓ1 -ball of ℓ1 -radius R1 . Then with
the choice η = R1 log d/G∞ n we obtain
n
X p
[ℓt (θt ) − ℓt (θ⋆ )] ≤ O(1)G∞ R1 n log d.
t=1

As in Corollary 17.2.12, we see the improvement in dimension dependence over Euclidean-type

methods. 3

17.2.5 Proof of Theorem 17.2.9

The proof of the theorem proceeds in three lemmas, which distill various optimality conditions for
convex optimization problems, and then an inductive application.
The first (see Proposition C.1.4 in Appendix C.1) explicitly characterizes optimality for a convex
optimization problem.

Lemma 17.2.14. Let h : Rd → R be a convex function and Θ be a convex set. Then θ⋆ minimizes
h over Θ if and only if there exists g ∈ ∂h(θ⋆ ) such that

⟨g, θ − θ⋆ ⟩ ≥ 0 for all θ ∈ Θ.

The next result gives the key progress guarantee for any model-based minimization strategy.

501
Lexture Notes on Statistics and Information Theory John Duchi

Lemma 17.2.15. Let ℓ̂ be any valid model (Definition 17.1) of ℓ : Θ → R at the point θ0 , and for
η > 0 define
1
θη := argmin ℓ̂(θ) + Dψ (θ, θ0 ) .
θ∈Θ η
Then for some g0 ∈ ∂ℓ(θ0 ) and any θ ∈ Θ,
1
ℓ(θ0 ) − ℓ(θ) ≤ [Dψ (θ, θ0 ) − Dψ (θ, θη ) − Dψ (θη , θ0 )] + ⟨g0 , θ0 − θη ⟩.
η
Proof We consider the optimality conditions for the minimization. By the optimality conditions
in Lemma 17.2.14, there exists gη ∈ ∂ ℓ̂(θη ) such that

gη + η −1 (∇ψ(θη ) − ∇ψ(θ0 )), θ − θη ≥ 0 for all θ ∈ Θ.

Rearranging, we have
1
⟨∇ψ(θη ) − ∇ψ(θ0 ), θ − θη ⟩ ≥ ⟨gη , θη − θ⟩.
η
The magical step is the “three term identity,” valid for any u, v, w, that

Dψ (u, v) − Dψ (u, w) − Dψ (w, v) = ⟨∇ψ(w) − ∇ψ(v), u − w⟩,

and substituting u = θ, v = θ0 , and w = θη , we find that

1
⟨gη , θη − θ⟩ ≤ [Dψ (θ, θ0 ) − Dψ (θ, θη ) − Dψ (θη , θ0 )] . (17.2.7)
η
Finally, we use the modeling conditions in Definition 17.1 to lower bound the left hand side. To
that end, note that if g0 ∈ ∂ ℓ̂(θ0 ), then by convexity we obtain
(i) (ii) (iii)
⟨gη , θη − θ⟩ ≥ ℓ̂(θη ) − ℓ̂(θ) ≥ ℓ̂(θη ) − ℓ(θ) ≥ ℓ̂(θ0 ) + ⟨g0 , θη − θ0 ⟩ − ℓ(θ)
= ℓ(θ0 ) + ⟨g0 , θη − θ0 ⟩ − ℓ(θ)

where inequalities (i) and (iii) are first-order convexity, and inequality (ii) follows from the lower
bounding condition in Definition 17.1. Substituting and rearranging in inequality (17.2.7) yields the
claimed inequality in the lemma, and the last bit is to use the inclusion (17.2.1), that if g0 ∈ ∂ ℓ̂(θ0 )
then g0 ∈ ∂ℓ(θ0 ).

Summing the inequality in Lemma 17.2.15 implies the following regret bound.

Lemma 17.2.16. Let ℓt : Θ → R be any sequence of convex loss functions and ηt be a non-
increasing sequence, where η0 = ∞. Then with the mirror descent strategy (17.2.5), there are
subgradient gt ∈ ∂ℓt (θt ) such that for any θ⋆ ∈ Θ we have
n n n
X
⋆
X 1 1 ⋆
X 1
ℓt (θt ) − ℓt (θ ) ≤ − Dψ (θ , θt ) + − Dψ (θt+1 , θt ) + ⟨gt , θt − θt+1 ⟩ .
ηt ηt−1 ηt
t=1 t=1 t=1

502
Lexture Notes on Statistics and Information Theory John Duchi

Proof Summing the inequality in Lemma 17.2.15 gives

n n n
X X 1 X
ℓt (θt ) − ℓt (θ⋆ ) ≤ [Dψ (θ⋆ , θt ) − Dψ (θ⋆ , θt+1 ) − Dψ (θt+1 , θt )] + ⟨gt , θt − θt+1 ⟩
η
t=1 t=1 t t=1
n
X 1 1 1 1
= − Dψ (θ⋆ , θt ) + Dψ (θ⋆ , θ1 ) − Dψ (θ⋆ , θn+1 )
ηt ηt−1 η1 ηn
t=2
n
X 1
+ − Dψ (θt+1 , θt ) + ⟨gt , θt − θt+1 ⟩
ηt
t=1

as desired.

It remains to use the negative terms −Dψ (θt , θt+1 ) to cancel the gradient terms ⟨gt , θt − θt+1 ⟩.
To that end, we recall Definition 17.2 of the dual norm ∥·∥∗ and the strong convexity assumption
on ψ. Using the Fenchel-Young inequality that ab ≤ 2η a + η2 b2 , valid for any η > 0, we have
1 2

ηt 1
⟨gt , θt − θt+1 ⟩ ≤ ∥gt ∥∗ ∥θt − θt+1 ∥ ≤ ∥gt ∥2∗ + ∥θt − θt+1 ∥2 .
2 2ηt
Now, we use the strong convexity condition, which gives
1 1
− Dψ (θt+1 , θt ) ≤ − ∥θt − θt+1 ∥2 .
ηt 2ηt
Combining the preceding two displays in Lemma 17.2.16 gives the result of Theorem 17.2.9.

17.3 Optimality guarantees and fundamental limits

Developing minimax and other types of lower bounds for stochastic optimization and general sta-
tistical learning problems requires some additional technology beyond the minimax bounds we have
already developed, as we no longer presume there is some parameter θ to actually estimate. In-
stead, we only wish to achieve small expected loss L(θ) = E[ℓ(θ; Z)] in the stochastic optimization
case (17.0.2), or small regret (17.0.1). Thus, we redefine the typical minimax risk to consider only
the gap in the population losses, so that for any loss function ℓ : Θ × Z → R, we use the shorthand
LP (θ) = EP [ℓ(θ, Z)] and define the minimax (optimization) risk
h i
n
Mn (P, Θ, ℓ) := inf sup EP n LP (θ(Z 1 )) − inf LP (θ) . (17.3.1)
b
b n →Θ P ∈P
θ:Z θ∈Θ

Considering the convergence guarantees in the preceding section, it is also interesting to provide
minimax lower bounds that capture the geometric aspects of the problems we consider—their
Lipschitz continuity properties and relationship with the underlying domain Θ ⊂ Rd . Because
it is tedious to carefully address the interaction between probability distributions and losses ℓ, we
modify the minimax definition (17.3.1) slightly. Now, let L consist of functions ℓ : Rd → R ∪ {+∞}.
Let P(L) consist of all probability distributions on ℓ ∈ L, so that we identify Rthe sample space
Z with losses themselves, and make the natural definition LP (θ) = EP [ℓ(θ)] = ℓ(θ)dP (ℓ). This

503
Lexture Notes on Statistics and Information Theory John Duchi

makes it easy, for example, to consider the class of convex functions that are Lipschitz continuous
with respect to the p-norm
n o
Lp := convex ℓ : Rd → R+ s.t. |ℓ(θ) − ℓ(θ′ )| ≤ θ − θ′ p for all θ, θ′ ∈ Rd .

Then we let h i
Mn (Θ, L) := inf sup EP n LP (θn ) − inf LP (θ) ,
b (17.3.2)
θbn P ∈P(L) θ∈Θ

where the infimum is over estimators θbn observing n observations i.i.d. from P .

17.3.1 From optimization to testing

Because we no longer measure parameter error, we introduce a more general lower bounding tech-
nique, which employs the same general idea of our reductions from estimation to testing in Chap-
ter 9. Now, however, we wish to reduce optimization to testing, constructing problems where
optimizing well implies that we can solve certain hypothesis tests. Thus, we define new versions of
separations (analogues of packings), and show how we can use them to show that optimizing well
implies accurately testing between different probability distributions.
We begin by defining separations in terms of optimality gaps. For a parameter space Θ and
collection P of distributions, a function

r : Θ × P → R+

is a valid risk gap if

inf r(θ, P) = 0 for each P ∈ P. (17.3.3)
θ∈Θ

Example 17.3.1 (A risk gap in stochastic optimization): Stochastic optimization problems

immediately suggest a particular choice for the risk gap: given any loss ℓ : Θ × Z → R and
associated population losses LP (θ) := EP [ℓ(θ, Z)], the functional

r(θ, P ) := LP (θ) − inf LP (θ) (17.3.4)

θ∈Θ

immediately satisfies the condition (17.3.3). 3

JCD Comment: Make a figure of separation.

We now mimic the reduction from estimation to testing we developed in Chapter 9.2.1. With
this in mind, for distributions P0 , P1 , we define the separation between them (for the risk gap r) by

r(θ, P0 ) ≤ δ implies r(θ, P1 ) ≥ δ
sepr (P0 , P1 ; Θ) := sup δ ≥ 0 : for any θ ∈ Θ . (17.3.5)
r(θ, P1 ) ≤ δ implies r(θ, P0 ) ≥ δ

That is, having small loss on P0 implies large loss on P1 and vice versa. The next examples show
simple ways to construct separated objectives in optimization using absolute and squared losses.
Note that they have different scaling in the natural underlying “separation,” which suggests—
correctly—that optimizing one loss may be easier than the other.

504
Lexture Notes on Statistics and Information Theory John Duchi

Example 17.3.2 (A separation for one-dimensional objectives): Consider data x ∈ {−1, 1},
and losses ℓ(θ, x) = |θ −x|. Consider the distributions P−1 and P1 on X with Pv (X = v) = 1+δ
2
and Pv (X = −v) = 1−δ 2 for v ∈ {−1, 1}, where δ ∈ [0, 1]. Then by inspection,
1+δ 1−δ
Lv (θ) := EPv [ℓ(θ, X)] = |θ − v| + |θ + v|
2 2
satisfies argminθ Lv (θ) = v, and L⋆v := inf θ Lv (θ) = (1 − δ). For the risk gap (17.3.4), we use
that if θv ≤ 0 then Lv (θ) − L⋆v ≥ 1 − (1 − δ) = δ, so that

sepr (P−1 , P1 ; R) = sepr (P−1 , P1 ; [−1, 1]) = δ.

We have a separation for any domain Θ ⊃ [−1, 1]. 3

Example 17.3.3 (Separation of quadratic losses): As in the preceding example, consider

data x ∈ {−1, 1}, but take squared losses ℓ(θ, x) = 21 (θ − x)2 . Then for Bernoulli distributions
Pv (X = v) = 1+δ 1+δ
2 for v ∈ {−1, 1} and δ ∈ [0, 1], we have Lv (θ) := EPv [ℓ(θ, X)] = 2 (θ − v) +
2
1−δ 2
2 (θ + v) . For these expected losses, we have

1+δ 1−δ
argmin Lv (θ) = δv and inf Lv (θ) = (1 − δ)2 + (1 + δ)2 = 1 − δ 2 .
θ θ 2 2
In this case, for θv ≤ 0 we have Lv (θ) ≥ 1 and so

sepr (P−1 , P1 ; R) = sepr (P−1 , P1 ; [−1, 1]) = δ 2 ,

a quadratically smaller separation than for the absolute loss in Example 17.3.2. 3

To complete our reduction of optimization to testing, we require separation between collections

of distributions. Thus, for a collection of distributions {Pv }v∈V indexed by V, we say {Pv } is
δ-separated if
sepr (Pv , Pv′ ; Θ) ≥ δ for v ̸= v ′ ∈ V.
Then by an argument similar to Proposition 9.2.1, we have the following reduction from optimization
to testing.
Proposition 17.3.4. Let {Pv }v∈V ⊂ P be δ-separated for the risk gap r : Θ × P → R+ . Then for
any estimator θb : Z → Θ,
1 X b Pv )] ≥ δ inf P(b
EPv [r(θ, v ̸= V ), (17.3.6)
|V| v
b
v∈V
where P is the joint distribution over the random index V chosen uniformly and, conditional on
V = v, drawing Z ∼ Pv . If additionally r(θ, P ) ≥ 0 for all θ, inequality (17.3.6) holds for any
estimator rather than estimators taking values only in Θ.
Proof Suppose that {Pv } is δ-separated. Then
1 X b Pv )] ≥ δP(r(θ,
b PV ) ≥ δ).
EPv [r(θ,
|V|
v∈V

Define Ψ : Θ → V by (
v if r(θ, Pv ) < δ
Ψ(θ) =
arbitrary otherwise.

505
Lexture Notes on Statistics and Information Theory John Duchi

Then if r(θ, b PV ) < δ) ≤ P(Ψ(θ)

b Pv ) < δ, we must have Ψ(θ) = v, and so P(r(θ, b = V ), that is,

b PV ) ≥ δ) = 1 − P(r(θ,
P(r(θ, b PV ) < δ) ≥ 1 − P(Ψ(θ) b ̸= V ).
b = V ) = P(Ψ(θ)

Take an infimum over tests to get the result.

As a corollary, we obtain a lower bound on the minimax risk for stochastic optimization. As we
have already seen in Example 17.3.1, r(θ, P ) = LP (θ) − inf θ⋆ ∈Θ LP (θ⋆ ) is a valid risk gap, making
the following result immediate.

Corollary 17.3.5. Assume that P has a δ-separated subset {Pv }v∈V for the population loss gap (17.3.4).
Then the minimax risk (17.3.1) satisfies

v (Z1n ) ̸= V ),
Mn (P, Θ, ℓ) ≥ δ inf P(b
v
b

where P denotes the joint distribution over V following an arbitrary distribution and, conditional
iid
on V = v, drawing Z1n ∼ Pv .

17.3.2 Constructing hard classes of optimization problems

Given Corollary 17.3.5, our program is clear: for a loss ℓ, we construct a well-separated collection
of distributions {Pv }—where the separation scales with some parameter δ—then show that the
iid
information the observations Z1n ∼ Pv contain about the particular element v ∈ V is limited. Then
we apply one of our standard tools—the Assouad, Fano or Le Cam method—to lower bound the
v ̸= V )
testing error, and choose δ > 0 as large as possible while keeping the probability of error P(b
a constant. In this section, we follow this program and provide constructions that allow us to
provide explicit and optimal (to within numerical constants) lower bounds for the minimax risk for
optimization problems over domains containing scaled ℓp balls, where the subgradients of the losses
belong to ℓp′ balls (which need not necessarily be dual to one another).
The separation (17.3.5) can be hard to compute explicitly, so we introduce an alternative that
provides a sometimes simpler lower bound on the separation. For any two functions L0 , L1 , let
θv ∈ argminθ∈Θ Lv (θ), and define the optimization distance between L0 and L1 by

dopt (L0 , L1 ; Θ) := inf L0 (θ) + L1 (θ) − L0 (θ0 ) − L1 (θ1 ) .

(17.3.7)
θ∈Θ

This quantity always lower bounds the separation in the risk gap we use for stochastic optimization
(recall Example 17.3.1).

Lemma 17.3.6. Let P0 , P1 be distributions and Lv (θ) = EPv [ℓ(θ, Z)] for v ∈ {0, 1}. Then for the
risk gap r(θ, P ) := LP (θ) − inf θ∈Θ LP (θ),
1
sepr (P0 , P1 ; Θ) ≥ dopt (L0 , L1 ; Θ).
2
Proof Let L⋆v = inf θ∈Θ Lv (θ) for shorthand. Let δ ≥ 0 be such that dopt (L0 , L1 ; Θ) ≥ δ, and
assume that L0 (θ) − L⋆0 ≤ δ/2. Then

L1 (θ) − L⋆1 = L0 (θ) + L1 (θ) − L⋆0 − L⋆1 − (L0 (θ) − L⋆1 ) ≥ dopt (L0 , L1 ; Θ) − δ/2 ≥ δ/2.

506
Lexture Notes on Statistics and Information Theory John Duchi

The symmetric case with L1 (θ) − L⋆1 is similar, so sepL (P0 , P1 , Θ) ≥ 2δ .

We now turn to the first optimality lower bound. We focus on a particular collection of distri-
butions P indexed by v ∈ {−1, 1}d on the sample space X = {±ej }dj=1 of signed standard basis
vectors and a particular loss ℓ. Appropriate scaling will allows us to derive further lower bounds
for more general parameter spaces. We follow our now familiar recipe: construct well-separated
losses, obtain a packing, and then show that testing between the data coming from any particular
loss is challenging.

Constructing a well-separated family of losses For θ ∈ Rd , define the loss

(
|θj − 1| if x = ej
ℓ(θ; x) := (17.3.8)
|θj + 1| if x = −ej .

Let v ∈ {−1, 1}d . For some δ > 0 to be chosen, define the distribution Pv on X ∈ X by
(
vj ej w.p. 1+δ
2d
X= (17.3.9)
−vj ej w.p. 1−δ
2d .

Then the expected loss has the explicit form

d
1X 1+δ 1−δ
Lv (θ) := EPv [ℓ(θ, X)] = |θj − vj | + |θj + vj | ,
d 2 2
j=1

and by inspection
θv := argmin Lv (θ) = v and inf Lv (θ) = 1 − δ.
θ θ

v′
As the distance between v and grows, so to does the optimization distance between Lv and Lv′
for the loss (17.3.8) (as does the separation):
Lemma 17.3.7. Let Θ ⊃ [−1, 1]d . For any v, v ′ ∈ {±1}d ,
δ
dopt (Lv , Lv′ ; Θ) = v − v′ 1
.
d
Proof Let L⋆v = inf θ Lv (θ) = 1 − δ. Then writing out the quantity inside the infimum in the
definition (17.3.7), we have

Lv (θ) + Lv′ (θ) − L⋆v − L⋆v′

1 X 1 X
= [|θj − 1| + |θj + 1|] + [(1 + δ)|θj − vj | + (1 − δ)|θj + vj |] − 2(1 − δ).
d ′
d ′
j:vj ̸=vj j:vj =vj

Taking infima we thus obtain

1 1−δ δ
inf {Lv (θ) + Lv′ (θ) − L⋆v − L⋆v′ } = v − v′ 1
+ (2d − v − v ′ 1
) − 2(1 − δ) = v − v′ 1
θ d d d
as claimed.

507
Lexture Notes on Statistics and Information Theory John Duchi

Constructing the packing The actual packing construction is now straightforward: as we saw
in Chapter 9.4.1, we may use the Gilbert-Varshamov bound (Lemma 9.2.3) to pack the hypercube,
and then we leverage Lemma 17.3.7. Indeed, Lemma 9.2.3 implies that there exists a d/2-packing of
the hypercube {−1, 1}d in ℓ1 distance with cardinality at least exp(d/8). We conclude the following:

Observation 17.3.8. Let d ∈ N and Θ ⊃ [−1, 1]d . There exists a packing V ⊂ {−1, 1}d of
cardinality at least exp(d/8) such that for the loss (17.3.8) and any choice δ ∈ [0, 1] in the distri-
bution (17.3.9),
δ
dopt (Lv , Lv′ ; Θ) ≥ for v ̸= v ′ ∈ V.
2

Lower bounding the testing error Now that Observation 17.3.8 gives a separation of the
expected losses (17.3.8), we can apply Fano’s or Le Cam’s methods to achieve a minimax lower
bound on the expected optimization error. The key insights are the following KL-divergence and
information bounds.

Lemma 17.3.9. Let X ∼ Pv according to the coordinate sampling scheme (17.3.9). For δ ≤ 12 ,

9
Dkl (Pvn ||Pvn′ ) ≤ v − v′ 1
δ2.
8d
iid
For any packing V ⊂ {−1, 1}d , if V ∼ Uniform(V) and conditional on V = v we draw X1n ∼ Pv ,

9
I(X1n ; V ) ≤ δ 2 .
4
Proof We demonstrate the first inequality; ′the secondd follows from the trivial observation that
n n n
I(X1 ; V ) ≤ maxv,v′ Dkl Pv ||Pv′ . For any v, v ∈ {−1, 1} , the sampling scheme (17.3.9) gives

1 X 1+δ 1−δ
Dkl (Pv ||Pv′ ) = (1 + δ) log + (1 − δ) log
2d 1−δ 1+δ
j:vj ̸=vj′

∥v − v ′ ∥1 ∥v − v ′ ∥1

1+δ 1−δ
4δ 2 + δ 3 ,

= (1 + δ) log + (1 − δ) log ≤
4d 1−δ 1+δ 4d

valid for 0 ≤ δ ≤ 21 . Because Dkl Pvn ||Pvn′ = nDkl (Pv ||Pv′ ), we obtain

the final inequality
Dkl Pvn ||Pvn′ ≤ 8d9
∥v − v ′ ∥1 δ 2 for δ ≤ 21 .

Putting everything together then yields our main lower bound.

Theorem 17.3.10. Let Θ ⊃ [−1, 1]d and P contain distributions on {±ej }dj=1 . Then for the loss
ℓ defined in (17.3.8), there exists a numerical constant c > 0 such that
(√ )
d
Mn (P, Θ, ℓ) ≥ c min √ , 1 .
n

Proof Let Mn = Mn (P, Θ, ℓ) for simplicity, and fix δ ∈ [0, 41 ] to be chosen. We consider
two cases: that d ≥ 14 and d < 14. In the former case, take the packing V ⊂ {−1, 1}d and

508
Lexture Notes on Statistics and Information Theory John Duchi

δ
distribution (17.3.9) that Observation 17.3.8 promises, which gives separation 4 via Lemma 17.3.6.
Then Corollary 17.3.5 and Fano’s inequality imply that

I(X1n ; V ) + log 2

δ
Mn ≥ 1− .
4 d/8

By Lemma 17.3.9, we obtain

8 log 2 36nδ 2 δ 3 36nδ 2

δ
Mn ≥ 1− − ≥ − ,
4 d d 4 5 d

where we used that d ≥ 14. Choose δ 2 = min{ 51 · 36n

d 1
, 4 } to obtain the theorem when d ≥ 14.
When d < 14, we simply take the packing V = {−1d , 1d }, which by Lemma 17.3.7 implies that
dopt (Lv , Lv′ ; Θ) = 2δ. Then applying Le Cam’s method and Corollary 17.3.5, we have
r !
δ n n
δ 9n 2
Mn ≥ 1 − P1 − P−1 TV ≥ 1− δ (17.3.10)
2 2 4

by Pinsker’s inequality and the information bound Lemma 17.3.9 provides. Choose δ 2 = min{ 14 , 16n
9
}
and treat d < 14 as a numerical constant.

17.3.3 Instantiations and optimality

By rescaling the sampling distributions and domain Θ, we can obtain minimax lower bounds for
stochastic optimization with different Lipschitz constants and underlying parameter spaces. We
begin by a simple rescaling.

Corollary 17.3.11. Let the collection of losses L consist of all losses G1 -Lipschitz with respect to
the ℓ∞ -norm, meaning g ∈ ∂ℓ(θ) implies ∥g∥1 ≤ G1 , and assume Θ ⊃ [−R∞ , R∞ ]. Then
(√ )
d
Mn (Θ, L) ≥ cR∞ G1 min √ , 1 .
n

Proof Scale the losses (17.3.8) and sampling distribution (17.3.9) by setting
(
R∞ vj ej w.p. 1+δ
2d
X= 1−δ
−R∞ vj ej w.p. 2d

under Pv and (
G1 |θj − R∞ | if Xj > 0
ℓ(θ; x) =
G1 |θj + R∞ | if Xj < 0.
The proof of Theorem 17.3.10 then proceeds mutatis mutandis.

The main foci of in our development of stochastic gradient- and mirror-descent-type methods
was to develop algorithms that enjoyed convergence guarantees irrespective of the particular loss

509
Lexture Notes on Statistics and Information Theory John Duchi

ℓ, requiring only boundedness conditions on Θ and the gradients g ∈ ∂ℓ. We can extend Theo-
rem 17.3.10 to address this as well. For constants C > 0, let C · L = {C · ℓ | ℓ ∈ L} denote the
multiplicative scaling of a collection of losses (so, e.g., if L consists of 1-Lipschitz functions, then
C · L consists of C-Lipschitz functions). The following corollary provides minimax lower bounds
for such collections and domains Θ other than the ℓ∞ -ball.

Corollary 17.3.12. Let 2 ≤ p, q ≤ ∞. Let Lq be the collection of convex losses Lipschitz with
respect to the ℓq -norm, and assume Θ contains the ℓp -ball of radius Rp . Then there is a numerical
constant c > 0 such that for all n ≥ d and G > 0,

d1/2−1/p
Mn (Θ, G · Lq ) ≥ cRp G √ .
n

Proof We sketch the proof. The ℓp -ball of radius Rp contains the ℓ∞ ball of radius R∞ :=
Rp /d1/p . The modification of the losses (17.3.8) in the proof of Corollary 17.3.11 guarantees that
they are Lipschitz with respect to any ℓq -norm with appropriate constant, as the loss defined for 1-
sparse x ∈ Rd by ℓ(θ; x) = G|θj −sign(xj )Rp | when xj ̸= 0 is the unique non-zero element of x always
q
satisfies ∥g∥1 ≤ G for g ∈ ∂ℓ(θ; x). In particular, for q ∗ = q−1 conjugate to q, ∥g∥q∗ ≤ ∥g∥1 ≤ G.
Applying Corollary 17.3.11 gives
(√ )
Rp d
Mn (Θ, L) ≥ c 1/p G min √ , 1 ,
d n

and when n ≥ d, the first term in the minimum is smaller than the second.

Written differently by considering conjugates, if we let L consist of losses ℓ with subgradients

g ∈ ∂ℓ(θ) implies ∥g∥q ≤ Gq for some 1 ≤ q ≤ 2, and the domain Θ ⊃ {θ ∈ Rd | ∥θ∥p ≤ Rp }, then

d1/2−1/p
Mn (Θ, L) ≥ cRp Gq √
n

for n ≥ d.
Comparing to the regret bounds for stochastic and online gradient methods (Corollary 17.2.2)
shows that these results are sharp. Indeed, let Θ ⊂ {θ ∈ Rd | ∥θ∥2 ≤ R2 } and assume ∥g∥2 ≤ G2
for all subgradients g, and let θt be the iterates of the stochastic gradient method (17.2.2) with
R2 √1 1 Pn
stepsize η = G2 n , as in Corollary 17.2.2. Then the average θn = n t=1 θt satisfies
b

G2 R2
E[LP (θbn )] ≤ LP (θ⋆ ) + √ for any θ⋆ ∈ Θ
n

as in the inequality (17.0.3). Notably, ∥g∥2 ≤ ∥g∥q for all 1 ≤ q ≤ 2, and ∥θ∥2 ≤ d1/2−1/p ∥θ∥p for
p ∈ [2, ∞]. Thus the ℓp -ball of radius Rp is contained within the ℓ2 -ball of radius d1/2−1/p Rp , so
the stochastic gradient method guarantees

d1/2−1/p
E[LP (θbn )] − LP (θ⋆ ) ≤ Gq Rp · √
n

whenever Θ ⊂ {θ | ∥θ∥p ≤ Rp } and the losses satisfy ∥g∥q ≤ Gq for g ∈ ∂ℓ(θ). Comparing to
Corollary 17.3.12, these rates of convergence are evidently minimax optimal.

510
Lexture Notes on Statistics and Information Theory John Duchi

These results provide sharp lower bounds for certain cases and gradient/parameter space ge-
ometries, and in particular, the dual geometries when Θ is an ℓp -ball for some p ≥ 2 and L consists
of functions Lipschitz with respect to the ℓp -norm. Here, we state (leaving the proof to the exer-
cises) a result that naturally handles the geometry in which Θ is contained in an ℓp -ball for p ≤ 2.
We say a convex set Θ ⊂ Rd is orthosymmetric if for θ ∈ Θ, Sθ ∈ Θ for any diagonal sign matrix
S. Any ℓp -ball is orthosymmetric. For a vector of nonnegative parameters {Gj }dj=1 , we let
n o
L({Gj }) := convex ℓ : Rd → R s.t. g ∈ ∂ℓ(θ) satisfies |gj | ≤ Gj for j = 1, . . . , d , (17.3.11)

that
Pd is, whose subgradients have coordinates bounded as specified. Any loss of the form ℓ(θ; x) =
j=1 Gj |θj − xj | evidently belongs to the class (17.3.11), as do linear functions ℓ(θ; x) = ⟨x, θ⟩ for
vectors x satisfying |xj | ≤ Gj . The following theorem provides a lower bound for this class.

Theorem 17.3.13. Let Θ ⊂ Rd be an orthosymmetric convex set and L be the class (17.3.11) of
losses. Then
d
1 X
Mn (Θ, L) ≥ √ sup Gj |θj |.
8 n θ∈Θ
j=1

Exercise 17.11 steps through one approach to proving Theorem 17.3.13.

As a corollary to Theorem 17.3.13, we can provide minimax lower bounds for more general
norms on Θ and the gradients. We say that ∥·∥ is orthosymmetric if ∥Sv∥ = ∥v∥ for any vector
v and diagonal matrix of signs S, which by inspection implies that the associated dual norm
∥·∥∗ is orthosymmetric. Then by carefully choosing the coordinate-Lipschitz parameters Gj in
Theorem 17.3.13, we have the following corollary.

Corollary 17.3.14. Let ∥·∥ be an orthosymmetric norm, and let Θ ⊂ Rd an orthosymmetric convex
set. Let L consist of losses G-Lipschitz with respect to the norm ∥·∥, that is, for which g ∈ ∂θ ℓ(θ)
implies ∥g∥∗ ≤ G. Then
G
Mn (Θ, L) ≥ √ sup ∥θ∥ .
8 n θ∈Θ
That is, the norms ∥·∥ on Θ and the magnitude of the dual norms ∥·∥∗ of gradients necessarily
appear in any lower bound.
When Θ = {θ ∈ Rd | ∥θ∥p ≤ Rp } is an ℓp -ball of radius Rp and L consists of Gq -Lipschitz loss
p
functions for the ℓp -norm, where q = p−1 is conjugate to p, then Corollary 17.3.14 implies the lower
bound
1
Mn (Θ, L) ≳ √ Gq Rp .
n
For p > 1, assuming that θ1 = 0 and we use a fixed stepsize η > 0, the p-norm algorithms as in
Example 17.2.13 guarantee regret
n
X ∥θ⋆ ∥2p η Rp2 η
ℓt (θt ) − ℓt (θ⋆ ) ≤ + nG2q ≤ + nG2q
2(p − 1)η 2 2(p − 1)η 2
t=1

Rp 1
for any sequence of losses ℓt ∈ L. Choosing η = Gq
√ then yields the minimax upper bound
(p−1)n

h i Gq Rp
E LP (θbn ) − inf LP (θ⋆ ) ≤ p ,
⋆
θ ∈Θ (p − 1)n

511
Lexture Notes on Statistics and Information Theory John Duchi

where θbn = n1 nt=1 θt is the average of the iterates of the mirror descent method 17.3 and we
P
√
have applied Jensen’s inequality as in (17.0.3). So, at least to the factor of p − 1, these methods
√
achieve minimax optimal convergence. (We can in fact show that the factor p − 1 is necessary,
though this is beyond our scope.)

17.3.4 A lower bound for high-dimensional stochastic optimization

We conclude our discussion of lower bounds for stochastic optimization by considering the particular
geometry that arises in sparse problems, where Θ is an ℓ1 ball pand gradients g lie in ℓ∞ balls. We
have seen (inequalities (17.2.6) and Example 17.2.13) that log d/n upper bounds the average
regret and stochastic convergence rate. This, it turns out, is sharp.
Proposition 17.3.15. Let Θ = {θ | ∥θ∥1 ≤ R1 } be a scaled ℓ1 -ball and L consist of the losses
satisfying ∥g∥∞ ≤ G∞ for any g ∈ ∂ℓ(θ). Then
(r )
1 log(2d)
Mn (Θ, L) ≥ G∞ R1 min ,1 .
24 n

Proof For shorthand, let G = G∞ and R = R1 . Consider the sample space X = {−1, 1}d and
consider the linear losses
ℓ(θ; x) := G⟨θ, x⟩,
which evidently satisfy ∥∇ℓ(θ; x)∥∞ = G ∥x∥∞ = G. Define the packing set V := {±ej }dj=1 of
the signed standard basis vectors and let Pv be the distribution on X ∈ {±1}d with independent
coordinates ( 1+δvj
1 w.p. 2
Xj = 1−δvj
−1 w.p. 2 .
We first demonstrate the separations we use to prove the lower bound, then apply Fano’s
method. Define Lv (θ) = EPv [ℓ(θ; X)] = Gδ⟨v, θ⟩, so that

θv := argmin Lv (θ) = −Rv and L⋆v := inf Lv (θ) = −GRδ.

θ∈Θ θ∈Θ

For any two losses with v ̸= v ′ , we see that

Gδ⟨v + v ′ , θ⟩ = −GRδ v + v ′

inf {Lv (θ) + Lv′ (θ)} = inf ∞
≥ −GRδ
θ∈Θ ∥θ∥1 ≤R

as ∥v + v ′ ∥∞ = 0 or 1 for v =
̸ v ′ ∈ {±ej }dj=1 . Thus, recalling the optimization distance (17.3.7),
we have
dopt (Lv , Lv′ ; Θ) ≥ −GRδ + 2GRδ = GRδ.
Substituting this into the optimization-to-testing lower bound (17.3.6), we obtain that if we draw
iid
V ∼ Uniform(V) and, conditional on V = v, draw X1n ∼ Pv , then
GRδ
Mn (Θ, L) ≥ v (X1n ) ̸= V ).
inf P(b
2 vb
We may now apply Fano’s method, which implies
I(X1n ; V ) + log 2

GRδ
Mn (Θ, L) ≥ 1− .
2 log(2d)

512
Lexture Notes on Statistics and Information Theory John Duchi

To bound the mutual information, consider the KL-divergence Dkl (Pv ||Pv′ ) for a pair v ̸= v ′ . Let
1+δ
and D0 (δ) = Dkl Bernoulli( 2 )||Bernoulli( 2 ) and D1 (δ) = Dkl Bernoulli( 2 )||Bernoulli( 1−δ
1 1+δ

2 ) .
Then by inspection
9
Dkl (Pv ||Pv′ ) ≤ max {2D0 (δ), D1 (δ)} ≤ D1 (δ) ≤ δ 2 ,
4
the final inequality valid for 0 ≤ δ ≤ 12 . So I(X1n ; V ) ≤ 9n 2
4 δ , and

9nδ 2

GRδ log 2
Mn (Θ, L) ≥ 1− − .
2 log(2d) 4 log(2d)
log(2d)
Assuming that d ≥ 2, we have log 2/ log(2d) ≤ 12 , and so taking δ 2 = 9n ∧ 1
2 to make the
testing-error component at least 14 implies the lower bound

GRδ
Mn (Θ, L) ≥ .
8
Substituting the value of δ gives the proposition when d ≥ 2.
In the case that d = 1, a straightforward argument via Le Cam’s two-point method gives the
claimed lower bound.

17.4 Online to batch conversions

The application (17.0.3) of Jensen’s inequality shows that if an algorithm has a regret bound, then
it enjoys a similar convergence rate for stochastic optimization. More precisely, if for a collection of
losses L a procedure has regret guarantee Regn (θ) ≤ Cn for any sequence of losses ℓt ∈ L and any
θ ∈ Θ, then the average θn := n1 nt=1 θt satisfies E[LP (θn )] ≤ LP (θ) + Cn /n for any θ ∈ Θ. We
P
can extend this to provide high-probability convergence guarantees as well using the martingale
convergence theorems, especially the Azuma-Hoeffding inequality in Theorem 4.2.3.
In the next theorem, a prototypical result, we define the observed regret
n
X
Regn (θ) = ℓ(θi , Zi ) − ℓ(θ, Zi )
i=1

iid
for a sequence Z1n ∼ P . The theorem shows that this quantity provides a high-probability bound
on the suboptimality gap on the population expected loss LP (θ) := EP [ℓ(θ, Z)]. The key is that, for
any online optimization procedure, the ith point θi is a function only of the past random variables
Z1i−1 , so that Zi provides fresh (independent) randomness at each iteration.

Theorem 17.4.1. Let L consist of convex losses, G-Lipschitz continuous with respect to the norm
∥·∥, and let Θ satisfy supθ,θ′ ∈Θ ∥θ − θ′ ∥ ≤ R. Let θi be any sequence predicted by an online procedure.
Then for any distribution P ,
√
P LP (θn ) − LP (θ) ≥ n−1 Regn (θ) + 2 2GRt ≤ exp(−nt2 ).

513
Lexture Notes on Statistics and Information Theory John Duchi

Proof Let Fi := {Z1i } be the information available to the algorithm at time i, and let ℓi (θ) =
ℓ(θ; Zi ). We construct a sub-Gaussian martingale difference sequence using these (recall Defini-
tion 4.5), after which we may apply the Azuma-Hoeffding concentration bound (Theorem 4.2.3).
Thus, define the martingale differences

Di := (LP (θi ) − LP (θ)) − (ℓi (θi ) − ℓi (θ)) ,

which evidently are functions of Fi and satisfy

E[Di | Fi−1 ] = (LP (θi ) − LP (θ)) − E[ℓ(θi , Zi ) − ℓ(θ, Zi ) | Z1i−1 ] = 0,

because Zi is independent of Z1i−1 and θi is a function of Z1i−1 . Additionally, under the Lipschitz
and boundedness conditions on Θ and ℓ, we have |Di | ≤ 2G ∥θi − θ∥ ≤ 2GR. So the Di are thus a
4G2 R2 -sub-Gaussian martingale difference sequence, and Theorem 4.2.3 implies
n
!
nt2

1X
P Di ≥ t ≤ exp − 2 2
n 8G R
i=1

for all t ≥ 0. Now observe that

n n
1X 1X
LP (θn ) − LP (θ) ≤ (LP (θi ) − LP (θ)) = Di + Regn (θ).
n n
i=1 i=1

Evidently this implies the result.

JCD Comment: Maybe add a bit of commentary around this and about why it is a
good generalization bound. Compare to the uniform convergence guarantees in previous
chapters.

17.5 More refined convergence guarantees

It is sometimes possible to give more refined bounds than those we have so far provided. As
motivation, let us revisit Example 17.1.3, but suppose that one of the experts has no loss—that
is, it makes perfect predictions. We might expect—accurately!—that we should attain better
convergence guarantees using exponentiated weights, as the points wt be maintain should quickly
eliminate non-optimal experts.
To that end, we present a refined
P regret bound for the mirror descent algorithm 17.3 with the
entropic regularization ψ(w) = j wj log wj .
P
Proposition 17.5.1. Let ψ(w) = j wj log wj , and assume that the losses ℓt are such that their
subgradients have all non-negative entries, that is, gt ∈ ∂ℓt (w) implies gt ⪰ 0. For any such
sequence of loss functions ℓt and any w⋆ ∈ Θ = ∆d ,
n n d
X log d η X X
[ℓt (wt ) − ℓt (w⋆ )] ≤ + 2
wt,j gt,j .
η 2
t=1 t=1 j=1

514
Lexture Notes on Statistics and Information Theory John Duchi

While as stated, the bound of the proposition does not look substantially more powerful than
Corollary 17.2.12, but a few remarks will exhibit its consequences. We prove the proposition in
Section 17.5.1 to come.
2 ≤ ∥g ∥2 . So certainly
P
First, we note that because wt ∈ ∆d , we will always have j wt,j gt,j t ∞
the bound of Proposition 17.5.1 is never worse than that of Corollary 17.2.12. Sometimes this can
be made tighter, however, as exhibited by the next corollary, which applies (for example) to the
experts setting of Example 17.1.3. More specifically, we have d experts, each suffering losses in
[0, 1], and we seek to predict with the best of the d experts.
Corollary 17.5.2. Consider the linear online convex optimization setting, that is, where ℓt (wt ) =
⟨gt , wt ⟩ for vectors gt , and assume that gt ∈ Rd+ with ∥gt ∥∞ ≤ 1. In addition,
√ assume
p that we know
⋆ n ⋆
an upper bound Ln on t=1 ℓt (w ). Then taking the stepsize η = min{1, log d/ L⋆n }, we have
P

n
X n o
[ℓt (wt ) − ℓt (w⋆ )] ≤ 3 max log d, L⋆n log d .
p

t=1

Note that when ℓt (w⋆ )

= 0 for all w⋆ , which corresponds to a perfect expert in Example 17.1.3,
the upper bound becomes constant in n, yielding 3 log d as a bound on the regret. Unfortunately,
in our bound of Corollary 17.5.2, we had to assume that we knew ahead of time a bound on the
loss of the best predictor w⋆ , which is unrealistic in practice. There are a number of techniques for
dealing with such issues, including a standard one in the online learning literature known as the
doubling trick. We explore someP in the exercises.
2
Proof First, we note that j wj gt,j ≤ ⟨w, gt ⟩ for any nonnegative vector w, as gt,j ∈ [0, 1]. Thus,
Proposition 17.5.1 gives
n n n
X
⋆ log d η X log d η X
[ℓt (wt ) − ℓt (w )] ≤ + ⟨wt , gt ⟩ = + ℓt (wt ).
η 2 η 2
t=1 t=1 t=1

Rearranging via an algebraic manipulation, this is equivalent to

n n
ηX log d η X
1− [ℓt (wt ) − ℓt (w⋆ )] ≤ + ℓt (w⋆ ).
2 η 2
t=1 t=1
p p
⋆ ⋆
Take η = min{1, log d/L pn }. Then if1 p log d/Ln ≤ 1, we have that the⋆ right hand side of the
⋆ ⋆
above inequality becomes Ln log d + 2 Ln log d. On the other hand, if Ln < log d, then the right
hand side of the inequality becomes log d + 12 L⋆n ≤ 32 log d. In either case, we obtain the desired
result by noting that 1 − η2 ≥ 12 .

17.5.1 Proof of Proposition 17.5.1

Our proof relies on a technical lemma, after which the derivation is a straightforward consequence
of Lemma 17.2.16. We first state the technical lemma, which applies to the update that the
exponentiated gradient procedure makes.
P
Lemma 17.5.3. Let ψ(x) = j xj log xj , and let x, y ∈ ∆d be defined by

xi exp(−ηgi )
yi = P ,
j xj exp(−ηgj )

515
Lexture Notes on Statistics and Information Theory John Duchi

where g ∈ Rd+ is non-negative. Then

d
1 ηX 2
− Dψ (y, x) + ⟨g, x − y⟩ ≤ gi xi .
η 2
i=1
Deferring the proof of the lemma, we note that it precisely applies to the setting of Lemma 17.2.16.
Indeed, with a fixed stepsize η, we have
n n
X
⋆ 1 ⋆
X 1
ℓt (wt ) − ℓt (w ) ≤ Dψ (w , w1 ) + − Dψ (wt+1 , wt ) + ⟨gt , wt − wt+1 ⟩ .
η η
t=1 t=1

Earlier, we used the strong convexity of ψ to eliminate the gradient terms ⟨gt , wt − wt+1 ⟩ using the
bregman divergence Dψ . This time, we use Lemma 17.2.16: setting y = wt+1 and x = wt yields
the bound
n n d
X 1 X ηX 2
ℓt (wt ) − ℓt (w⋆ ) ≤ Dψ (w⋆ , w1 ) + gt,i wt,i
η 2
t=1 t=1 i=1
as desired.
Proof of Lemma 17.5.3 We begin by noting that a direct calculation yields Dψ (y, x) =
P yi
Dkl (y||x) = i yi log xi . Substituting the values for x and y into this expression, we have
!
X yi X xi exp(−ηgi ) X X
yi log = yi log P = −η⟨g, y⟩ − yi log xj e−ηgj .
xi xi ( j exp(−ηgj )xj )
i i i j
−ηg
P
Now we use a Taylor expansion of the function g 7→ log( j xj e j ) around the point 0. If we
−ηg −ηg
P
define the vector p(g) by pi (g) = xi e i /( j xj e j ), then
X η2
log xj e−ηgj = log(⟨1, x⟩) − η⟨p(0), g⟩ + g ⊤ (diag(p(e g )) − p(eg )p(eg )⊤ )g,
2
j

where ge = λg for some λ ∈ [0, 1]. Noting that p(0) = x and ⟨1, x⟩ = ⟨1, y⟩ = 1, we obtain
η2 ⊤
Dψ (y, x) = −η⟨g, y⟩ + log(1) + η⟨g, x⟩ − g )) − p(e
g (diag(p(e g )⊤ )g,
g )p(e
2
whence
d
1 ηX 2
− Dψ (y, x) + ⟨g, x − y⟩ ≤ gi pi (e
g ). (17.5.1)
η 2
i=1
Lastly, we claim that the function
d
X xi e−λgi
s(λ) = gi2 P −λgj
i=1 j xj e

is non-increasing on λ ∈ [0, 1]. Indeed, we have

2 x x e−λgi −λgj − −λgi −λgj
P P 3
−λgi )(
P 2 −λgi P 3 −λgi
g g ij gi xi xj e
P
( gi xi e g x i e ) g x i e ij i j i j
s′ (λ) = i i i
− i i
= .
( i xi e−λgi )2 −λgi ( i xi e−λgi )2
P P P
i xi e

Using the Fenchel-Young inequality, we have ab ≤ 13 |a|3 + 32 |b|3/2 for any a, b, so gi gj2 ≤ 13 gi3 + 23 gj3 .
This implies
Pd that the numerator in our expression for s′ (λ) is non-positive. Thus we have s(λ) ≤
2
s(0) = i=1 gi xi , which gives the result when combined with inequality (17.5.1).

516
Lexture Notes on Statistics and Information Theory John Duchi

17.6 Exercises
Exercise 17.1: We consider the doubling trick, a frequently used technique in online learning
to allow good performance of online learning procedures even without knowledge of the number of
steps n they will be run. For this question, we define the regret in the usual way as
n
X
Regn := sup [ℓt (wt ) − ℓt (w⋆ )].
w⋆ ∈Θ t=1

(a) Suppose that we have a procedure (algorithm) A(η) parameterized by the real value η ≥ 0
(usually, this is simply a stepsize) that achieves the regret bound

r2 η
Regn ≤ + L2 n
2η 2
where r and L are known constants. Consider the following procedure, which proceeds in
epochs k = 1, 2, . . ., each of which lasts for nk = 2k steps. At the start of epoch k, restart the
algorithm A(η) with parameter choice ηk = √r k , and run the algorithm with this choice of
L 2
parameter for 2k steps. Show that
√
Regn ≤ C · Lr n,
√
where C is some numerical constant (in our solution, we have C ≤ 2/( 2 − 1)).

(b) Now we consider a slightly more restrictive setting, but we obtain better guarantees. Consider
the mixture of experts problem, in d experts suffer losses in [0, 1] at each timestep; we let
gt ∈ [0, 1]d denote the loss vector. We play a mixture of experts wt ∈ Θ = ∆d , suffering
(expected) loss ℓt (wt ) = ⟨gt , wt ⟩. In the course notes, we show that in this setting
n n d
X log d η X X
Regn = [ℓt (wt ) − ℓt (w⋆ )] ≤ + 2
wt,j gt,j
η 2
t=1 t=1 j=1

when using the exponential weights algorithm with stepsize η, which in turn implies
n n
" #
X η −1 log d η X
[ℓt (wt ) − ℓt (w⋆ )] ≤ 1 − + ℓt (w⋆ )
2 η 2
t=1 t=1

for any w⋆ ∈ Θ. Consider the following procedure, which proceeds in epochs√k = 1, 2, . . ., within
each of which we perform exponential weights with stepsize ηk = min{1, log d/2k }. Let Ek
denote those times t belonging to epoch k, whichP correspond to times when we run exponential
weights with parameter ηk . Define L(k) = minj t∈Ek gt,j to be the loss incurred by the best
expert in epoch k as the procedure runs, and continue epoch k until the best expert’s loss in
epoch k satisfies L(k) ≥ 4k . Then begin a new epoch. Show that with this procedure,
v
Xn u n
X
⋆
u
[ℓt (wt ) − ℓt (w )] ≤ C1 log log d · log d + C2 tlog d · ℓt (w⋆ )
t=1 t=1

√
for numerical constants C1 and C2 (we obtain C1 ≤ 3 and C2 ≤ 8 2).

517
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 17.2 (Min-max games and regret): The saddle point problem, or min-max game
problem, considers solving
minimize sup L(x, y), (17.6.1)
x∈X y∈Y

where L is convex in its first argument and concave in its second, and X, Y are convex sets. (See
Appendix C.4 for a general treatment of these problems.) We say a point (x⋆ , y ⋆ ) ∈ X × Y is a
saddle point if
sup L(x⋆ , y) ≤ L(x⋆ , y ⋆ ) ≤ inf L(x, y ⋆ ).
y∈Y x∈X

(a) Show that if a saddle point exists, then

inf sup L(x, y) = sup inf L(x, y) = L(x⋆ , y ⋆ ).
x∈X y∈Y y∈Y x∈X

Now we show how online convex optimization can prove that saddle points exist for some problems.
Assume that X and Y are compact and convex, and that for each x0 ∈ X, L(x0 , y) is 1-Lipschitz
and concave in y, and for each y0 ∈ Y , L(x, y0 ) is 1-Lipschitz and convex and subdifferentiable in
x. Consider the following online game: at iteration t, the x player chooses xt ∈ X, and then the Y
player chooses the “best response”
yt ∈ argmax L(xt , y).
y∈Y

The goal of the x player is to achieve low regret with respect to any fixed x⋆ ∈ X.
(b) Define ft (x) = L(x, yt ). Give a strategy for the x player so that for any T and any x⋆ ∈ X,
T
X √
ft (xt ) − ft (x⋆ ) ≤ O(1) T .
t=1

1 PT 1 PT
(c) Show that xT = T t=1 xt and y T = T t=1 yt satisfy
√
sup L(xT , y) ≤ L(x⋆ , y T ) + O(1)/ T
y∈Y

for any x⋆ ∈ X.
(d) Show that there exists a saddle point for the min-max problem (17.6.1) when L is Lipschitz
and X and Y are compact, as above.

Exercise 17.3 (von Neumann’s minimax theorem for zero-sum games): Let A ∈ Rm×n be an
arbitrary matrix and X and Y be compact convex sets. Show that
inf sup⟨x, Ay⟩ = sup inf ⟨x, Ay⟩,
x∈X y∈Y y∈Y x∈X

and that there exists a saddle point (x⋆ , y ⋆ ) ∈ X × Y for which

sup⟨x, Ay⟩ ≤ ⟨x⋆ , Ay ⋆ ⟩ ≤ inf ⟨x, Ay ⋆ ⟩.
y∈Y x∈X

In the language of games, there are strategies (x⋆ , y ⋆ ) for which it is unimportant which player
plays first.
Exercise 17.4 (Second-order strong convexity conditions): Let f be twice continuously differen-
tiable on its domain, which you may assume to be open.

518
Lexture Notes on Statistics and Information Theory John Duchi

(a) Assume that f is strongly convex with respect to the norm ∥·∥. Show that for all x ∈ dom f ,
u⊤ ∇2 f (x)u ≥ ∥u∥2 for all u.
(b) Show that if u⊤ ∇2 f (x)u ≥ ∥u∥2 for all u and x ∈ dom f , then f is strongly convex with respect
to the norm ∥·∥.
Hint. See Proposition C.1.5 in Appendix C.1.
Exercise 17.5 (The strong convexity of p-norms): In this question, we use the results of Exer-
1
cise 17.4 to show that for 1 < p < 2, fp (w) := 2(p−1) ∥w∥2p is strongly convex with respect to the
norm ∥·∥p .
1 2/p
(a) Define ψ(t) = 2(p−1) t and ϕ(t) = |t|p . Let H(w) = ∇2 fp (w). Show that

Hii (w) = ψ ′′ ∥w∥pp (ϕ′ (wi ))2 + ψ ′ ∥w∥pp ϕ′′ (wi )

and for i ̸= j,
Hij (w) = ψ ′′ ∥w∥pp ϕ′ (wi )ϕ′ (wj ).

(b) Show that for all u,

d
X u2j
u⊤ ∇2 fp (w)u ≥ ∥w∥2−p
p .
|wj |2−p
j=1

Exercise 17.6: Let f : Rd → R be convex.

(a) Show that if f is G-Lipschitz with respect to the norm ∥·∥, meaning |f (x) − f (y)| ≤ G ∥x − y∥,
then for all g ∈ ∂f (x) and all x, ∥g∥∗ ≤ G.
(b) Show that if for all g ∈ ∂f (x) and all x, ∥g∥∗ ≤ G, then f is G-Lipschitz with respect to the
norm ∥·∥.
Hint. Use Corollary B.3.19 and Proposition B.3.20 in Appendix B.3.4.
Exercise 17.7 (The naming of mirror descent): Let ψ : Rd → R be a Legendre distance
generating function, meaning that it is strictly convex, continuously differentiable, and satisfies the
conditions (14.3.3). Show that for any g ∈ Rd , η > 0, and θ0 ∈ dom ψ,

1
θη = argmin ⟨g, θ⟩ + Dψ (θ, θ0 ) .
θ η
satisfies
θη = ∇ψ ∗ (∇ψ(θ0 ) − ηg) .
We therefore may think of the update (17.2.5) when we use the first-order model ℓ̂(θ) = ℓ(θ0 ) +
⟨g, θ − θ0 ⟩ as performing a type of “mirror” operation: we transform θ0 into a dual version ∇ψ(θ0 ),
where gradients belong to the dual space, update the parameter, then “reflect” back into the primal
space via ∇ψ ∗ : Rd → dom ψ. Hint. Use Corollary 14.3.1.
Exercise 17.8: In this question, we show how to compute the p-norm-based mirror descent
update (17.2.5).

519
Lexture Notes on Statistics and Information Theory John Duchi

(a) Let h(x) = 1

2η ∥x∥2 . Show that the conjugate h∗ (y) = η
2 ∥y∥2∗ .

(b) Let ψ(θ) = 1

2 ∥θ∥2p , where 1 < p ≤ 2. Let θ0 ∈ Rd , and for a fixed g ∈ Rd define

1
θη = argmin ⟨g, θ⟩ + Dψ (θ, θ0 ) .
θ η

Show that if we let w have entries

1 p−1
wj = p−2 |θ0,j | sign(θ0,j ) − ηgj
∥θ0 ∥
p
then for q = p−1 conjugate to p,

1 q−1
d
θη = q−2 |wj | sign(wj ) j=1 .
∥w∥q

Hint. Use Exercise 17.7.

Exercise 17.9: Let ψ(u) = 1

2 ∥u∥2p , where 1 < p < ∞.

(a) Show that for any u, v,

1 1
Dψ (u, v) ≤ ∥u∥2p + ∥u − v∥2p .
2 2

(b) Let Θ ⊂ Rd satisfy that ∥θ − θ0 ∥p ≤ Rp for all θ ∈ Θ. Show that if ψ(u) = 1

2 ∥u − θ0 ∥2p , then

5
sup Dψ (θ, θ′ ) ≤ Rp2 .
′
θ,θ ∈Θ 2

Exercise 17.10 (On adaptive stepsizes): Let ψ be a distance generating function strongly convex
with respect to a norm ∥·∥ over Θ, and let θt be generated by the iteration (17.2.5). Fix η > 0, and
define the tth stepsize
η
ηt = qP , (17.6.2)
t 2
τ =1 ∥g ∥
τ ∗

which is computable at iteration t. Assume that Dψ (θ⋆ , θ) ≤ R2 for all θ ∈ Θ.

(a) Show that v v

n u n u n
X 1 X 2
uX
∥gt ∥2∗ .
⋆
u
2t
ℓt (θt ) − ℓt (θ ) ≤ R ∥gt ∥∗ + η t
η
t=1 t=1 t=1

(b) Show that with the fixed stepsize multiplier η = R,

n n
( )
X √ R 2 α X
ℓt (θt ) − ℓt (θ⋆ ) ≤ 2 inf + ∥gt ∥2∗ ,
α α 2
t=1 t=1

so that the adaptive steps (17.6.2) enjoy a type of post-hoc optimality guarantee.

520
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 17.11 (Coordinate-based lower bounds [73]): This question explores using Assouad’s
method for lower bounds in stochastic convex optimization, providing a proof of Theorem 17.3.13.
Recall that a convex set Θ ⊂ Rd is orthosymmetric if for any θ ∈ Θ, Sθ ∈ Θ for any diagonal sign
matrix S. Let Gj ≥ 0 and L be the coordinate-bounded loss class (17.3.11).
(a) Fix an arbitrary vector a ∈ Rd . Define the loss ℓ : Rd × Rd → R+ by
d
X
ℓ(θ; x) := Gj |θj − aj xj |.
j=1

Show that for each x ∈ Rd , ℓ(·, x) belongs to the class L.

(b) Use a variant of Assouad’s method to prove Theorem 17.3.13. That is, show that there exists
a numerical constant c > 0 such that for any orthosymmetric convex parameter space Θ,
d
1 X
Mn (Θ, L) ≥ c √ sup Gj |θj |.
n θ∈Θ
j=1

(It is possible but perhaps tedious to obtain a constant c ≥ 18 .) Hint. Define distributions
Pv on X ∈ {−1, 1}d , indexed by v ∈ {−1, 1}d , so that X ∼ Pv has independent coordinates
1+v δ
Xj ∼ Bernoulli( 2 j ). Take a ∈ Θ and use the losses in the previous part.

Exercise 17.12 (A geometric lower bound): Prove Corollary 17.3.14.

Exercise 17.13 (Sharper constants in Proposition 17.3.15): In this exercise, we trace the argu-
ment of Proposition 17.3.15 to obtain sharper constants as d ↑ ∞.
(a) Let D0 (δ) = Dkl Bernoulli( 1+δ 1 1+δ 1−δ

2 )||Bernoulli( 2 ) and D1 (δ) = Dkl Bernoulli( 2 )||Bernoulli( 2 ) .
For the construction of Pv in the proof of Proposition 17.3.15, show that

n d−1 1 1 9 2 9 2
I(X1 ; V ) ≤ n 2D0 (δ) + D1 (δ) ≤ n 1 − δ + δ ,
d d d 8 4d

the second inequality valid for 0 ≤ δ ≤ 21 .

(b) Conclude that as d ↑ ∞,

9nδ 2

G∞ R 1 δ
Mn (Θ, L) ≥ 1 − o(1) − .
2 8 log(2d)

(c) Show that for n and d scaling so that log(2d)/n → 0 but d → ∞ as n → ∞,

r r
4(1 − o(1)) log(2d) 11 log(2d)
Mn (Θ, L) ≥ √ · G∞ R1 > · G∞ R1 .
9 6 n 61 n

JCD Comment: Next question should go with banditos

Exercise 17.14 (An empirical comparison of Bandit algorithms): In this question, you will
investigate three algorithms for solving the Bandit problem: the Upper Confidence Bound algorithm

521
Lexture Notes on Statistics and Information Theory John Duchi

(UCB), Thompson sampling (also known as Posterior Sampling), and exponential gradient. You will
attempt to maximize the reward achieved by the algorithms (note that in the notes, we sometimes
maximize and sometimes minimize; make sure you have your signs correct!).
In particular, set the rewards for the arms in the following way:
1 1 1
i. Let θ1 = 2 and θ2 = 2 − ϵ, . . . , θK = 2 − ϵ.

ii. When arm j is sampled, return Y = 1 with probability θj and Y = 0 with probability 1 − θj .

Now, repeat the following experiment with the values

(a) K = 10, ϵ = .1, and n = 106 steps

(b) K = 10, ϵ = .02, and n = 106 steps.

Perform Thompson sampling (Example 18.3.4 in the notes) assuming that the prior on θ is to have
each coordinate√independent with Beta(1, 1) distribution. Perform UCB with the confidence pa-
rameter δt = 1/ t (Algorithm 18.1 in the notes) and the appropriate choice of the sub-Gaussian pa-
rameter σ 2 (Hint: use Hoeffding’s lemma for σ 2 ). Perform exponentiated gradient (Algorithm 18.6)
using the optimal stepsize choice η, when assuming that σ 2 = 12 in the bound.
Plot your results for each of experiments (a) and (b). Which algorithm do you prefer?

522
Chapter 18

Exploration, exploitation, and bandit

problems

Consider the following problem: we have a possible treatment for a population with a disease, but
we do not know whether the treatment will have a positive effect or not. We wish to evaluate the
treatment to decide whether it is better to apply it or not, and we wish to optimally allocate our
resources to attain the best outcome possible. There are challenges here, however, because for each
patient, we may only observe the patient’s behavior and disease status in one of two possible states—
under treatment or under control—and we wish to allocate as few patients to the group with worse
outcomes (be they control or treatment) as possible. This balancing act between exploration—
observing the effects of treatment or non-treatment—and exploitation—giving treatment or not as
we decide which has better palliative outcomes—underpins the challenges in this chapter.
Our main focus will be variants of the K-armed bandit problem, so named because we imagine
a player in a casino, choosing between K different slot machines. As this is a casino, the player
will surely lose eventually, hence the bandit moniker. Each machine has a different and unknown
reward distribution. The player wishes to put as much money as possible into the machine with
the greatest expected reward. There is a substantial literature in statistics, operations research,
economics, game theory, and computer science on variants of the problems we consider here.

18.1 The multi-armed bandit problem

In the most basic setting—and we shall elaborate this later—We consider the following sequential
decision making scenario. We assume that there are K distributions P1 , . . . , PK on R, which we
identify with K random variables Y (1), . . . , Y (K). In this basic setting, each random variable Y (i)
has mean µi and is σ 2 -sub-Gaussian, meaning that
2 2
λ σ
E [exp (λ(Y (i) − µi ))] ≤ exp . (18.1.1)
2

The goal is to find the index i with the maximal mean µi without sampling sub-optimal distributions
Pi (or random variables Y (i)) too often.
We consider this in an online setting, proceeding iteratively for t = 1, 2, . . ., where at each
iteration t of the process, the player takes an action At ∈ {1, . . . , K}, then, conditional on i = At ,
observes a reward Yt (i) drawn independently from the distribution Pi . The key is that given the
past, the action At is independent of the vector Yt := (Yt (1), . . . , Yt (K)), and at time t, we can

523
Lexture Notes on Statistics and Information Theory John Duchi

never observe the entire feedback vector. Then the goal is to minimize the realized regret after n
steps, which is
n
X
Regn := µi⋆ − µA t , (18.1.2)
t=1

where i⋆∈ argmaxi µi so µi⋆ = maxi µi . While the regret (18.1.2) involves only the means of
random variables, it is still random, so we generally seek to give bounds on its expectation or
high-probability guarantees on its value. In this chapter, we generally focus for simplicity on the
expected regret,
X n
Regn := E µ i⋆ − µA t , (18.1.3)
t=1
where the expectation is taken over any randomness in the player’s actions At and in the repeated
observations of the random variables Y (1), . . . , Y (K).

Example 18.1.1 (Potential outcomes): Consider estimating the effect of a particular treat-
ment on some disease. In a simple version of this setting, the set of actions is {0, 1}, corre-
sponding to treatment or control. In a particular model for estimation of causal effects, the
eponymous Neyman-Rubin causal model, we imagine an individual i as having potential out-
comes Yi := (Yi (0), Yi (1)), where Yi (0) indicates the individual’s response under the control
(no treatment), while Yi (1) indicates the individual’s response to treatment.
These outcomes are potential because we can never view both: we may only observe one or
the other, as a patient cannot be both treatment and control. When the action Ai (assignment
to treatment or control) for patient i is independent of any particular characteristics of the
patient, then
E[Yi (a) | Ai = a] = E[Yi (a)] =: µa (18.1.4)
is the expected response for the patient for treatment (a = 1) or control (a = 0), and E[Yi (1) −
Yi (0)] is the (unobservable!) treatment effect for the patient. (Here, we elide a small detail: we
are thinking of the particular individual i as a random representative individual, not attempting
to estimate anything particular about them.) If we choose Ai in a way that depends on the
individual i, then we may have confounding: E[Yi (a) | Ai = a] ̸= E[Yi (a)].
Given a group of n individuals, a fully randomized trial to estimate the treatment effect
chooses n/2 of the individuals to receive treatment and n/2 to be in the control, uniformly at
random. Then
1 X 1 X
τb := Yi (Ai ) − Yi (Ai )
n/2 n/2
i:Ai =1 i:Ai =0
2
τ − τ | ≥ t) ≤ exp(−c nt
is an unbiased estimator for τ = E[Y (1) − Y (0)], with P(|b σ2
) in the sub-
gaussian setting (18.1.1). This treatment assignment strategy, while effective for estimating
the treatment effect, typically incurs very high regret (18.1.3), at least if the treatment is
effective or has strong negative outcomes, because it allocates too many individuals to either
the control or treatment arm of the study. 3

JCD Comment: Connect this with Example 12.2.5

Example 18.1.1 highlights many of the central aspects of bandit-like problems. First, they have
deep connections with causal reasoning: by selecting an action (or arm) A, we are intervening in
a system, with a goal to identify the best intervention. Second, the online nature can be critical:

524
Lexture Notes on Statistics and Information Theory John Duchi

in medical settings, for example, it would unethical to continue a drug trial if it were ineffective.
Finally, the example implicitly shows one of the major benefits of the online scenario: because we
select the action At without any information about the outcomes Yt = (Yt (1), . . . , Yt (K)), we can
avoid issues of confounding that would arise trying to estimate or learn from offline data.

18.2 Confidence-based algorithms

A natural first strategy to consider is one based on confidence intervals with slight optimism.
Roughly, if we believe the true mean µi for an arm i lies within [b µi − ci , µ
bi + ci ], where ci is some
interval (whose length decreases with time t), then we optimistically “believe” that the value of
arm i is µ
bi + ci ; then at iteration t, as our action At we choose the arm whose optimistic mean is
the highest, thus hoping to maximize our received reward.
This strategy lies at the heart of the Upper Confidence Bound (UCB) family of algorithms [12],
a simple variant of which we describe here. Before continuing, we recall the standard result on
sub-Gaussian random variables of Corollary 4.1.10 in our context, though we require a somewhat
more careful calculation because of the sequential nature of our process. Let

Nt (i) = card{τ ≤ t | Aτ = i}

denote the number of times that arm i has been pulled by time t of the bandit process, and define
1 X
µ
bt (i) := Yt (i),
Nt (i)
τ ≤t,Aτ =i

to be the running average of the rewards of arm i at time t (computed only on those instances in
which arm i was selected). The sequential nature of the bandit process coupled with the (condi-
iid
tionally) independent randomness in sample Yt (i) ∼ Pi when At = i in our version of the bandit
game implies an important distributional equality, which underpins our analyses throughout:
iid
Lemma 18.2.1. Let Yτ′ (i) ∼ Pi , τ = 1, 2, . . ., be independent copies of the random variables Yτ (i).
Then Yτ′ (i) is independent of Nt (i) for all t and τ , and
dist
b′t (i), Nt (i) ,

(b
µt (i), Nt (i)) = µ (18.2.1)

b′t (i) = Nt1(i) ′ is the empirical mean of the copies Yτ′ (i) for those steps when
P
where µ τ :Aτ =i Yτ (i)
arm i is selected.
We prove the claim (18.2.1) in the Appendix 18.6.1 to this chapter.
The distributional equality (18.2.1) implies that the means µ bt (i) are sub-Gaussian, even condi-
tional on the number of pulls Nt (i). Indeed, we observe that for λ ∈ R and any m ∈ N,

mλ2 σ 2

(⋆) ′

E [exp (λmbµt (i)) | Nt (i) = m] = E exp λmb µt (i) | Nt (i) = m ≤ exp , (18.2.2)
2

where equality (⋆) follows from Lemma 18.2.1, while the inequality follows because conditional on
the event Nt (i) = m, we have
m
dist 1 X ′ iid
b′t (i) =
µ Yt (i) for Yt′ (i) ∼ Pi .
m
t=1

525
Lexture Notes on Statistics and Information Theory John Duchi

Thus, by taking conditional expectations over the value Nt (i), we immediately see via Corol-
lary 4.1.10 that for all t,
 s   s 
σ 2 log 1δ σ 2 log 1δ
P µbt (i) ≥ µi +  ∨ P µbt (i) ≤ µi −  ≤ δ. (18.2.3)
Nt (i) Nt (i)

That is, so long as we pull the arms sufficiently many times, we are unlikely to pull the wrong arm.
Here then is the UCB procedure:

Input: Sub-gaussian parameter σ 2 and sequence of deviation probabilities δ1 , δ2 , . . ..

Initialization: Play each arm i = 1, . . . , K once
Repeat: for each iteration t, play the arm maximizing
s
σ 2 log δ1t
µbt (i) + .
Nt (i)
Figure 18.1: The Upper Confidence Bound (UCB) Algorithm

If we define
∆i := µi⋆ − µi
to be the gap in means between the optimal arm and any sub-optimal arm, we then obtain the
following guarantee on the expected number of pulls of any sub-optimal arm i after n steps.

Proposition 18.2.2. Assume that each of the K arms is σ 2 -sub-Gaussian and let the sequence
δ1 ≥ δ2 ≥ · · · be non-increasing and positive. Then for any T and any arm i ̸= i⋆ ,

4σ log δ1T T
& 2 '
X
E[NT (i))] ≤ +2 δt .
∆2i t=2

Proof Without loss of generality, we assume arm 1 satisfies µ1 = maxi µi , and let arm i be any
sub-optimal arm. The key insight is to carefully consider what occurs if we play arm i in the UCB
procedure of Figure 18.1. In particular, if we play arm i at time t, then we certainly have
s s
σ 2 log δ1t σ 2 log δ1t
µ
bt (i) + ≥µ
b1 (t) + .
Nt (i) N1 (t)

For this to occur, at least one of the following three events must occur (we suppress the dependence
on i for each of them):
 s   s 
 σ 2 log δ1t   σ 2 log δ1t 
E1,t := µ b (i) ≥ µi + , E2,t := µb (t) ≤ µ1 − ,
 t Nt (i)   1 Nt (1) 
 s 
 σ 2 log δ1t 
E3,t := ∆i ≤ 2 .
 Nt (i) 

526
Lexture Notes on Statistics and Information Theory John Duchi

Indeed, suppose that none of the events E1,t , E2,t , E3,t occur at time t. Then we have
s s s
σ 2 log δ1t σ 2 log δ1t σ 2 log δ1t
µ
bt (i) + < µi + 2 < µi + ∆i = µ1 < µbt (1) + ,
Nt (i) Nt (i) Nt (1)

the inequalities following by E1,t , E3,t , and E2,t , respectively.

Now, for any l ∈ {1, . . . , n}, we see that
T
X T
X
E[NT (i)] = E[1 {At = i}] = E[1 {At = i, Nt (i) > l} + 1 {At = i, Nt (i) ≤ l}]
t=1 t=1
T
X
≤l+ P(At = i, Nt (i) > l).
t=l+1

Using that δt is non-increasing, if we set

σ 2 log δ1T
& '
⋆
l = 4 ,
∆2i

then toqhave Nt (i) > l⋆ it q

must be the case that E3,t cannot occur at time t—that is, we would
have 2 σ log δt /Nt (i) > 2 σ 2 log δ1t /l ≥ ∆i . Thus we have
2 1

T
X T
X
E[NT (i)] = E[1 {At = i}] ≤ l⋆ + P(At = i, E3,t fails)
t=1 t=l⋆ +1
T
X T
X
⋆ ⋆
≤l + P(E1,t or E2,t ) ≤ l + 2δt .
t=l⋆ +1 t=l⋆ +1

This implies the desired result.

Naturally, the number of times arm i is selected in the sequential game is related to the regret
of a procedure; indeed, we have
n
X K
X K
X
Regn = (µi⋆ − µAt ) = (µi⋆ − µi )Ni (n) = ∆i Ni (n).
t=1 i=1 i=1

Using this identity, we immediately obtain two theorems on the (expected) regret of the UCB
algorithm.

Theorem 18.2.3. Let δt = δ/t2 for all t. Then for any n ∈ N the UCB algorithm attains
K
X 4σ 2 [2 log n − log δ] π 2 − 2 X K
X
Regn ≤ + ∆i δ + ∆i .
⋆
∆i 3
i̸=i i=1 i=1

527
Lexture Notes on Statistics and Information Theory John Duchi

Proof First, we note that

n n
4σ 2 log δ1n

2 1 2
X δ X δ
E[∆i Ni (n)] ≤ ∆i 4σ log /∆i + 2∆i 2
≤ + ∆ i + 2∆ i
δn t ∆i t2
t=2 t=2

by Proposition 18.2.2. Summing over i ̸= i⋆ and noting that t≥2 t−2 = π 2 /6 − 1 gives the re-
P
sult.

Let us unpack the bound of Theorem 18.2.3 slightly. First, we make the simplifying assumption
that δt = 1/t2 for all t, and let ∆ = mini̸=i⋆ ∆i . In this case, we have expected regret bounded by
K
Kσ 2 log n π 2 + 1 X
Regn ≤ 8 + ∆i .
∆ 3
i=1

So we see that the asymptotic regret with this choice of δ scales as (Kσ 2 /∆) log n, roughly linear
in the classes, logarithmic in n, and inversely proportional to the gap in means. As a concrete
example, if we know that the rewards for each arm Yi belong to the interval [0, 1], then Hoeffding’s
lemmaP(recall Example 4.1.6) states that we may take σ 2 = 1/4. Thus the mean regret becomes at
most i:∆i >0 2 log n
∆i (1 + o(1)), where the o(1) term tends to zero as n → ∞.
If we knew a bit more about our problem, then by optimizing over δ and choosing δ = σ 2 /∆,
we obtain the upper bound
Kσ 2

n∆ maxi ∆i
Regn ≤ O(1) log 2 + K , (18.2.4)
∆ σ mini ∆i
that is, the expected regret scales asymptotically as (Kσ 2 /∆) log( n∆
σ2
)—linearly in the number of
classes, logarithmically in n, and inversely proportional to the gap between the largest and other
means.
If any of the gaps ∆i → 0 in the bound of Theorem 18.2.3, the bound becomes vacuous—it
simply says that the regret is upper bounded by infinity. Intuitively, however, pulling a slightly
sub-optimal arm should be insignificant for the regret. With that in mind, we present a slight
√
variant of the above bounds, which has a worse scaling with n—the bound scales as n rather than
log n—but is independent of the gaps ∆i .
Theorem 18.2.4. If UCB is run with parameter δt = 1/t2 , then
p K
X
2
Regn ≤ 8Kσ n log n + 4 ∆i .
i=1

Proof Fix any γ > 0. Then we may write the regret with the standard identity
X X X X
Regn = ∆i Ni (n) = ∆i Ni (n) + ∆i Ni (n) ≤ ∆i Ni (n) + nγ,
i̸=i⋆ i:∆i ≥γ i:∆i <γ i:∆i ≥γ

where the final inequality uses that certainly K

P
i=1 Ni (n) ≤ n. Taking expectations with our UCB
procedure and δ = 1, we have by Theorem 18.2.3 that
K K
X 8σ 2 log n π 2 + 1 X 8σ 2 log n π2 + 1 X
Regn ≤ ∆i + ∆ i + nγ ≤ K + nγ + ∆i ,
∆2i 3 γ 3
i:∆i ≥γ i=1 i=1

528
Lexture Notes on Statistics and Information Theory John Duchi

√
8Kσ 2 log n
Optimizing over γ by taking γ = √
n
gives the result.

Combining the above two theorems, we see that the UCB algorithm with parameters δt = 1/t2
automatically achieves the expected regret guarantee
 
 X σ 2 log n p 
Regn ≤ C · min , Kσ 2 n log n . (18.2.5)
 ∆i 
i:∆i >0

That is, UCB enjoys some adaptive behavior. It is not, however, optimal; there are algorithms,
including Audibert and Bubeck’s MOSS (Minimax Optimal in the Stochastic Case) bandit proce-
dure [11], which achieves regret
√ n∆2

K
Regn ≤ C · min Kn, log ,
∆ K
which is essentially the bound specified by inequality (18.2.4) (which required knowledge of the ∆i s)
and an improvement by log n over the analysis of Theorem 18.2.4. It is also possible to provide a
high-probability guarantee for the UCB algorithms, which follows essentially immediately from the
proof techniques of Proposition 18.2.2; we leave this to Exercise 18.6.

18.3 General losses and information-based bounds

The upper confidence bound (UCB) procedure is elegant and straightforward, but in many cases
we wish to move beyond the simplest K-armed bandit settings to consider more sophisticated
scenarios. To that end, we now assume there is an abstract set of actions (arms) A, which may or
may not be finite, and we have a collection of distributions P = {Pθ }θ∈Θ parameterized by a set
Θ (abstractly, we could just let Θ be distributions, but parametric scenarios are also interesting).
Often, A is finite, as in the case of K-armed bandit problems, and Θ is some subset of RK (the
means), but we stay in this abstract setting temporarily. We generalize things slightly to address
more than just the responses Yt (a) ∈ Y, so that we have a loss function ℓ : A × Y → R that
measures the quality of an action a ∈ A for the observation y ∈ Y.

Example 18.3.1 (Classical Bernoulli bandit problem): The classical bandit problem, as in the
UCB case of the previous section, has actions (arms) A = {1, . . . , K}, and the parameter space
Θ = [0, 1]K , and Pθ is a distribution on Y ∈ {0, 1}K , where Y has independent coordinates
ind
1, . . . , K with Pθ (Y (j) = 1) = θj , that is, Y (j) ∼ Bernoulli(θj ). The goal is to find the arm
a⋆ ∈ argmaxj θj with highest mean reward. Then the loss function ℓ(a, y) = −y satisfies
ℓ(a, y) ∈ [−1, 0], and Eθ [ℓ(a, Y (a))] = −θa , so that the optimal action minimizes Eθ [ℓ(a, Y (a))]
over actions a ∈ A. 3

In this setting, for a given θ, we consider the expected regret

n
X
Regn (A, ℓ, θ) = Eθ ℓ(At , Yt (At )) − ℓ(A⋆ , Yt (A⋆ )) , (18.3.1)
t=1

where
A⋆ = A⋆ (θ) = argmin Eθ [ℓ(a, Y (a))]
a∈A

529
Lexture Notes on Statistics and Information Theory John Duchi

minimizes the loss expected loss of taking action a ∈ A when θ is the parameter (and so is a
function of θ), while At ∈ A is the action the player takes at time t of the process. With this
abstract setting, Figure 18.2 captures the broad algorithmic framework for this section.

Initialize: choose parameter θ ∈ Θ via some process, yielding sampling distribution Pθ

Repeat: for each iteration t = 1, 2, . . .
Choose distribution ρt on actions A based on the history

Ht−1 := {(A1 , Y1 (A1 )), (A2 , Y2 (A2 )), . . . , (At−1 , Yt−1 (At−1 ))}

Draw random action

At ∼ ρt .
Observe feedback Yt (At ) ∼ Pθ and loss ℓ(At , Yt (At )); suffer (unobserved) regret

ℓ(At , Yt (At )) − ℓ(A⋆ , Yt (A⋆ ))

Figure 18.2: The generic exploration/exploitation algorithm

In Figure 18.2, we left the choice of θ unspecified; in worst case settings, it may be adversarial.
Moving to a Bayesian setting in which θ ∼ π for some prior distribution π on the space Θ allows us
to leverage information-theoretic to obtain regret bounds, as well as new yet intuitive algorithms.
Bayesian strategies—because they (can) incorporate prior knowledge—have the advantage that they
suggest policies for exploration and trading between regret and information; that is, they allow us
to quantify a value for information. They often yield very simple procedures, allowing simpler
implementations. In this Bayesian setting where we have a prior distribution π with support Θ, we
then define the Bayesian regret as
X n
⋆ ⋆
Regn (A, ℓ, π) = Eπ ℓ(At , Yt (At )) − ℓ(A , Yt (A )) , (18.3.2)
t=1

where now the optimal action is random (as it is a function of θ ∼ π):

A⋆ ∈ argmin E[ℓ(a, Y (a)) | θ].

a∈A

We take the expectation (18.3.2) over both the randomness in θ according to the prior π and any
randomness in the player’s strategy for choosing the actions At at each time.
One consequence of our definition (18.3.2) worth noting is that because we take expectations,
we could instead consider ℓ(a, θ) := E[ℓ(a, Y (a)) | θ], but this quantity is typically unobservable to
the algorithm. The setting (18.3.2) also encapsulates what appear to be more general settings in
which we obtain observations Yt (a) by taking action a ∈ A, and suffer some random loss Lt,a that
depends on the action a. But we could simply incorporate these random losses into the observation
Yt (a) itself, making ℓ(a, Yt (a)) “pull out” the component of the observation including the loss.

18.3.1 An information-based regret bound

We can provide an information-theoretic regret bound, which bounds regret via the ratio of the
expected instantaneous regret at time t and the information that taking a particular action A
provides about the optimal action A⋆ . The rough idea is simple: if we can bound the ratio of

530
Lexture Notes on Statistics and Information Theory John Duchi

expected losses to the information each action At provides about the optimal action, then each step
of the Algorithm 18.2 either gains substantial information or suffers small regret, but never both.
Thus, once we have collected sufficient information, the remaining regret must be small.
To make things rigorous, we unfortunately must confront a notational choice that bedevils
bounds combining information theory and statistics, in that conditional mutual information and
entropy are expectations, but sometimes we wish to condition on particular realizations of variables.
Thus, define the history

Ht := {A1 , Y1 (A1 ), A2 , Y2 (A2 ), . . . , At , Yt (At )}

of actions and observations through iteration t of our bandit process—here, we think of the particu-
lar realizations (At , Yt (At )). Then for any action a ∈ A, define the conditional mutual information

It (a) := I(A⋆ ; Yt (a) | Ht−1 ) (18.3.3)

to be the mutual information between A⋆ and the observation Yt (a) conditional on the realized
history Ht−1 . By this we mean that we draw θ from its posterior distribution conditional on
Ht−1 , then set A⋆ = argmina∈A ℓ(a, θ), and draw Yt (a) conditional on θ as well. The mutual
information (18.3.3) is a random variable, as it depends on the particular realization Ht−1 of the
history (so we think of it analogously to I(X; Y | Z = z) in our definitions of mutual information in
Chapter 2.1.1, except the value z is random). Without any real loss of generality, we can discretize
and assume A is countable, allowing us to write

It (a) = H(A⋆ | Ht−1 ) − H(A⋆ | Yt (a), Ht−1 ),

the reduction in entropy on the optimal action A⋆ conditional on observing Yt (a). Then for a
distribution ρ on actions A, define

It (ρ) := Eρ [It (A)] = I(A⋆ ; Yt (A), A | Ht−1 ), (18.3.4)

where now we replace the fixed action a in the definition (18.3.3) with a random A ∼ ρ, and then
conditional on A = a draw Yt (a) as before. This is the information gained about A⋆ by taking
action A ∼ ρ, conditional on the realization Ht−1 of the history.
With the random averaged information (18.3.4), we can define the ratio between the expected
regret and information gained by sampling an action A ∼ ρ via

Eρ [ℓ(A, Yt (A)) − ℓ(A⋆ , Yt (A⋆ )) | Ht−1 ]2

Rt (ρ) := . (18.3.5)
It (ρ)

This random ratio is, again, a random variable, as it is a function of the (realized) history Ht−1 . If
we can choose a distribution ρ to make the ratio (18.3.5) small, this captures the desirable property
that for A ∼ ρ, either A should have losses close to the optimal action A⋆ , or A and the response
Yt (A) should provide substantial information about A⋆ .
That gaps in squared expected losses should be related to information at all might seem a
priori unmotivated: why the squared error? Why the information? But we have seen relationships
between squared errors and information many times throughout the book; indeed, the connection
between squared distances and KL-divergences underpins many of the concentration inequalities
we developed in previous chapters via the Donsker-Varadhan variational representation of KL-
divergence (Theorem 6.1.1). Particular examples arise in Chapter 6.2 on PAC-Bayes generalization

531
Lexture Notes on Statistics and Information Theory John Duchi

bounds, in our development of adaptive data analysis in Chapter 6.3 (recall Theorem 6.3.2), and
in transportation inequalities for concentration (Chapter 7, Theorem 7.1.2).
By an application of the Cauchy-Schwarz inequality and using the definition of the mutual
information, we can prove the next theorem, which upper bounds the expected regret by the
loss/information ratio and the information gained throughout the process about the optimal arm:

Theorem 18.3.2. For any procedure and any prior π on θ ∈ Θ,

v
u n q
u X
Regn (A, ℓ, π) ≤ tEπ Rt (ρt ) I (A⋆ ; {At , Yt (At )}nt=1 ).
t=1

Proof As in the discussion before the theorem, we apply the Cauchy-Schwarz inequality:
n n
ℓ(At , Yt (At )) − ℓ(A⋆ , Yt (A⋆ )) p
X X
Eπ ℓ(At , Yt (At )) − ℓ(A⋆ , Yt (A⋆ )) = Eπ p It (ρt )
t=1 t=1 It (ρt )
n
Eρt [ℓ(At , θ) − ℓ(A⋆ , θ) | Ht−1 ] p
X
= Eπ p It (ρt )
t=1 It (ρt )
Xn p p

= Eπ R(ρt ) It (ρt )
t=1
v v
u n u Xn
u X u
≤ Eπ
t R(ρt ) t Eπ It (ρt ) ,
t=1 t=1

where the second equality uses that the action sampling distribution ρt is a function of the history
Ht−1 . Then observe that

Eπ [It (ρt )] = Eπ [I(A⋆ ; At , Yt (At ) | Ht−1 )] = I (A⋆ ; At , Yt (At ) | {Aτ , Yτ (Aτ )}τ <t )

by definition of the conditional mutual information. Using the chain rule for mutual information
(recall Section 2.1.2),
n
X
I (A⋆ ; At , Yt (At ) | {Aτ , Yτ (Aτ )}τ <t ) = I (A⋆ ; {At , Yt (At )}nt=1 ) ,
t=1

implying the theorem.

As an immediate corollary, note that whenever the set of A is finite or countable, the mutual
information has the trivial upper bound

I (A⋆ ; {At , Yt (At )}nt=1 ) = H(A⋆ ) − H(A⋆ | {At , Yt (At )}nt=1 ) ≤ H(A⋆ ),

the (Shannon) entropy of A⋆ . We elide the dependence of A⋆ on the prior π over Θ, though of
course different priors can yield different entropies. Whenever A is finite, H(A⋆ ) ≤ log card(A),
and we record these as a corollary.

532
Lexture Notes on Statistics and Information Theory John Duchi

Corollary 18.3.3. Assume that action distributions ρt are chosen so that for some Rπ < ∞, the
average loss/information ratio satisfies
n
1X
Eπ [Rt (ρt )] ≤ Rπ
n
t=1

and that A is countable. Then

p √
Regn (A, ℓ, π) ≤ Rπ H(A⋆ ) · n.
In particular, if A is finite, then
p √
Regn (A, ℓ, π) ≤ Rπ log card(A) · n.
Corollary 18.3.3 thus suggests a design principal for (Bayesian) bandit algorithms: at each
time step t, choose the distribution ρt on actions to minimize the loss/information ratio Rt (ρ) =
Eρ [ℓ(A, θ) − ℓ(A⋆ , θ) | Ht ]2 /It (ρ). Though this particular choice may be impossible, it is frequently
possible instead to at least guarantee that the ratio is bounded, which is evidently sufficient for
√
a regret bound that only grows as n. For the remainder of this section, we develop further
consequences of these corollaries using this design principle.

18.3.2 Posterior (Thompson) sampling

One natural strategy for sampling in the Bayesian setting is to instantiate the algorithm 18.2
maintaining a posterior distribution over θ, and drawing actions that reflect this posterior. We can
instantiate this via Algorithm 18.3, which describes Thompson sampling.

Input: Prior distribution π on space Θ, family of distributions P = {Pθ }θ∈Θ

Repeat: for each iteration t,
Choose distribution πt to be posterior on θ given history Ht−1 = {Aτ , Yτ (Aτ )}τ <t
Draw parameter θt ∼ πt and for Y ∼ Pθt , choose action

At ∈ argmin Eθt [ℓ(a, Y (a))]

a∈A

Observe feedback Yt (At ) and loss ℓ(At , Yt (At ))

Figure 18.3: Generic Thompson sampling algorithm

In Thompson sampling, at iteration t, we use the posterior

πt (θ) = π(θ | Ht−1 ),
the distribution on θ conditional on Ht−1 . We let
n o
ρts
t = distribution of At := argmin E[ℓ(a, Y (a)) | θt ] (18.3.6)
a∈A

be the induced distribution on the actions At when sampling from the posterior πt in Thompson
sampling. Thompson [179] originally proposed this procedure in 1933 in the first paper on bandit
problems, and it has since been the subject of substantial analysis.
We provide a few more concrete specifications of Algorithm 18.3 for Thompson (posterior),
beginning with the case of Bernoulli rewards.

533
Lexture Notes on Statistics and Information Theory John Duchi

Example 18.3.4 (Thompson sampling for a K-armed Bernoulli bandit): Let the vector
θ ∈ [0, 1]K parameterize K independent Bernoulli(θa ) distributions, where on action a ∈ A =
{1, . . . , K}, we observe Y (a) ∼ Bernoulli(θa ), that is, P(Y (a) = 1 | θ) = θa .
Place a beta prior on the coordinates θ, θa ∼ Beta(1, 1), which corresponds to the uniform
distribution on [0, 1]d . Let
Nt1 (a) = card{τ ≤ t : At = a, Yτ (a) = 1}
be the number of times arm a is pulled by time t, and similarly let Na0 (t) = card{τ ≤ t :
At = a, Ya (τ ) = 0}. Then, peeking ahead to Example 19.5.2 on Beta-Bernoulli distributions,
Thompson sampling with the loss ℓ(a, y) = −y proceeds as follows:
(1) For each arm a ∈ A = {1, . . . , K}, draw θt (a) ∼ Beta(1 + Na1 (t), 1 + Na0 (t)).
(2) Play the action At = argmaxa θt (a).
(3) Observe Yt (At ) ∈ {0, 1}, and increment the appropriate count.
In this case, we may implement Thompson sampling with just a few counters. 3
We may extend Example 18.3.4 to the case in which the losses come from any distribution with
mean θi , so long as the distribution is supported on [0, 1]. In particular, we have the following
example.
Example 18.3.5 (Thompson sampling with bounded random losses): Let us again consider
the setting of Example 18.3.4, except that the observations Yt (a) ∈ [0, 1] with E[Y (a) | θ] = θa .
The following modification allows us to perform Thompson sampling in this case, even without
knowing the distribution of Y (a) | θ: we construct a random observation Ye (a) ∈ {0, 1} with the
property that P(Ye (a) = 1 | Y (a)) = Y (a). Then with losses ℓ(a, y) = −y, we seek the action a
maximizing the mean θa , and the posterior distribution over θ is still a Beta distribution. We
simply redefine
Na0 (t) := card{τ ≤ t : At = a, Yea (τ ) = 0} and Na1 (t) := card{τ ≤ t : At = a, Yea (τ ) = 0}.
The Thompson sampling procedure is otherwise identical. 3
The key to our analysis of Thompson sampling is that whenever the losses themselves are sub-
Gaussian, (a multiple of) the mutual information between A⋆ and (Yt (At ), At ) upper bounds the
squared regret. There is some subtlety here that we note only in passing: when we say that for
each a ∈ A,
ℓ(a, Y (a)) is σ 2 -sub-Gaussian, (18.3.7)
we mean the following: for any distribution π on θ ∈ Θ, the observed loss from the process that
(i) chooses a ∈ A, (ii) draws θ ∼ π, and conditional on θ draws Y ∼ Pθ , then (iii) observes Y (a)
and ℓ(a, Y (a)) is σ 2 -sub-Gaussian. Alternatively, in the context of the sequential bandit problems
here, we could say that there is some σ 2 , which is a function of the history Ht−1 , such that for each
a ∈ A, ℓ(a, Yt (a)) is σ 2 -sub-Gaussian conditional on Ht−1 . A trivial sufficient condition for all of
this is that the losses be bounded: if
sup ℓ(a, y) − inf ℓ(a, y) ≤ B
y y

for all a ∈ A, then certainly the losses are 41 B 2 -sub-Gaussian.

The following lemma shows the information bound on the instantaneous (single step) regret.
The result follows as a consequence of the Donsker-Varadhan variational representation of the
KL-divergence (Theorem 6.1.1) and its connection with sub-Gaussianity.

534
Lexture Notes on Statistics and Information Theory John Duchi

Lemma 18.3.6. Assume that for each a ∈ A, the loss ℓ(a, Y (a)) is σ 2 -sub-Gaussian (18.3.7) and
that A is finite. Assume that for θ ∼ π, the optimal action A⋆ = argmina∈A Eθ [ℓ(a, Y (a))] has
distribution ρ, and let A ∼ ρ independent of A⋆ . Then
X p p
ρ(a) (E[ℓ(a, Y (a))] − E[ℓ(a, Y (a)) | A⋆ = a]) ≤ 2σ 2 card(A) I(A, Y (A); A⋆ ).
a∈A

Proof Let L(a) := ℓ(a, Y (a)) be shorthand for the random realization of the loss. Then
X X 1/2
2
⋆ 2 ⋆
p
ρ(a) (E[L(a)] − E[L(a) | A = a]) ≤ ρ(a) (E[L(a)] − E[L(a) | A = a]) card(A).
a∈A a

Then we observe that for each a ∈ A,

X
ρ(a)2 (E[L(a)] − E[L(a) | A⋆ = a])2 ≤ ρ(a) ρ(a⋆ ) (E[L(a)] − E[L(a) | A⋆ = a⋆ ])2 , (18.3.8)
a⋆ ∈A

because we have only added nonnegative terms. Now we use that the losses are sub-Gaussian (18.3.7).
If Pa and Pa|A⋆ =a⋆ denote the marginal distribution of L(a) = ℓ(a, Y (a)) and L(a) = ℓ(a, Y (a))
conditional on A⋆ = a, respectively, then Theorem 7.1.2 implies that

(E[L(a)] − E[L(a) | A⋆ = a⋆ ])2 ≤ 2σ 2 Dkl Pa|A⋆ =a⋆ ||Pa .

Summing over a⋆ , we recognize that the marginal Pa = a⋆ ρ(a⋆ )Pa|A⋆ =a⋆ , and so the familiar
P
representation (9.4.4) of the mutual information as a mixture of KL-divergences implies
X
ρ(a⋆ )Dkl Pa|A⋆ =a⋆ ||Pa = I(ℓ(A, Y (A)); A⋆ | A = a) ≤ I(A, Y (A); A⋆ | A = a)

a⋆ ∈A

by the data processing inequality. Finally, we use the independence that A, A⋆ are identically
distributed to observe that
X
ρ(a)I(A, Y (A); A⋆ | A = a) = I(A, Y (A); A⋆ ).
a∈A

Substituting this into the bound (18.3.8), we have

X
ρ(a)2 (E[L(a)] − E[L(a) | A⋆ = a])2 ≤ 2σ 2 I(A, Y (A); A⋆ ),
a∈A

implying the lemma.

Lemma 18.3.6 immediately extends to the instantaneous expected loss gaps conditional on the
history, because in the sampling procedure in Alg. 18.3, we have the distributional equality
dist
A⋆ | Ht−1 = At | Ht−1

by construction of the posterior πt on θ given the history Ht−1 . Additionally, they are certainly
independent given the history, and so Lemma 18.3.6 implies that for Thompson sampling, whenever
A is finite,

(E[ℓ(At , Yt (At )) | Ht−1 ] − E[ℓ(A⋆ , Yt (A⋆ )) | Ht−1 ])2 ≤ 2σ 2 card(A)I(At , Yt (At ); A⋆ | Ht−1 ).

535
Lexture Notes on Statistics and Information Theory John Duchi

Recalling the loss/information ratio (18.3.5), we see that for Thompson sampling we have the
uniform bound
Rt (ρts 2
t ) ≤ 2σ card(A) (18.3.9)
uniformly for all times t. Thus, as a corollary to the main Theorem 18.3.2 bounding expected regret
by the mutual information, we have the following result, which we state as a theorem to highlight
the importance of Thompson sampling-style algorithms.
Theorem 18.3.7. Let A be a finite, and assume that the loss ℓ(a, Yt (a)) is σ 2 -sub-Gaussian (18.3.7)
for each action a ∈ A. Then Thompson sampling has regret
p √
Regn (A, ℓ, π) ≤ 2σ 2 card(A)H(A⋆ ) n.

We make a few remarks here. First, the entropy H(A⋆ ) of the p optimal action is never greater
than log card(A), and so the regret essentially scales at worst as card(A)n. When the entropy of
the optimal action is smaller, of course, the regret itself may be smaller. As an example application
of Theorem 18.3.7, let us revisit the Bernoulli bandit problem in Example 18.3.4.
Example 18.3.8 (Bernoulli bandits, continued): Consider the Bernoulli bandit setting of
Example 18.3.4. Then the mean regret for loss ℓ(a, y(a)) = −y(a) is
n
X
Regn = Eπ [θA⋆ − θAt ],
t=1

the gap between the best mean reward of an arm and the played arm At . Because there are K
arms, with {0, 1}-valued rewards, each is 14 -sub-Gaussian, and so Thompson sampling achieves
regret r
K log K √
Regn ≤ n.
2
This (up to the log K factor) minimax rate optimal, as we shall see, and removes the log n
multiplicative factors (18.2.5) present in our other analyses. 3

JCD Comment: Add an experiment here that looks at UCB versus Thompson sam-
pling, and see what happens.

18.3.3 Information-based exploration

The loss/information ratio (18.3.5) around which the regret bound in Theorem 18.3.2 centers
suggests strategies that iterative choose the distribution ρ to minimize the ratio. This suggests
information-directed sampling, where we define
Eρ [ℓ(A, Y (A)) − ℓ(A⋆ , Y (A⋆ )) | Ht−1 ]2

ids
ρt := argmin Rt (ρ) = . (18.3.10)
ρ∈∆(A) It (ρ)

(Recall here that ∆(A) denotes the collection of probability distributions on A, and Eρ denotes
expectation over A ∼ ρ.)
In cases where there are only finitely many actions A, because Thompson sampling guarantees
the bound Rt (ρt ) ≤ 2σ 2 card(A) whenever the losses are sub-Gaussian (recall inequality (18.3.9)),
we similarly have Rt (ρids 2
t ) ≤ 2σ card(A). Thus information directed sampling achieves no worse
regret:

536
Lexture Notes on Statistics and Information Theory John Duchi

Corollary 18.3.9. Let ρids

t be the information-directed sampling distribution (18.3.10), and assume
the other conditions of Theorem 18.3.7. Then
p √
Regn (A, ℓ, π) ≤ 2σ 2 card(A)H(A⋆ ) n.

It is not always apparent how to compute the information-directed sampling distribution (18.3.10),
as it involves optimization of a ratio of losses and mutual information. The connections between
information and squared error, however, allow other related approaches, which can sometimes be
easier to implement. For example, the following result, which is similar to Lemma 18.3.6, relates
the information of actions to the variance of losses:

Lemma 18.3.10. Assume that ℓ(a, Y (a)) is σ 2 -sub-Gaussian (18.3.7) for each a ∈ A. Then

Var(E[ℓ(a, Y (a)) | A⋆ ]) ≤ 2σ 2 I(A⋆ ; Y (A), A | A = a).

Proof Let Pa|A⋆ =a denote the distribution of ℓ(a, Y (a)) conditional on A⋆ = a⋆ , and let ρ be the
distribution of A⋆ . Then
1 (i) 1
2
I(A⋆ ; Y (A), A | A = a) ≥ I(A⋆ ; ℓ(a, Y (a)) | A = a)
2σ 2σ 2 Z
(ii) 1
Dkl Pa|A⋆ =a⋆ ||Pa dρ(a⋆ )

= 2
2σ a⋆ ∈A
(iii)
Z
≥ (E[ℓ(a, Y (a))] − E[ℓ(a, Y (a)) | A⋆ = a⋆ ])2 dρ(a⋆ )
a⋆ ∈A
= Var(E[ℓ(a, Y (a)) | A⋆ ]),

where step (i) follows by the data processing inequality, step (ii) via the representation (9.4.4) of
the mutual information, and step (iii) uses the sub-Gaussianity of ℓ(a, Y (a)) (Theorem 7.1.2), and
the final equality follows because E[E[ℓ(a, Y (a)) | A⋆ ]] = E[ℓ(a, Y (a))].

We consider Lemma 18.3.10 conditional on the history Ht−1 . Defining the shorthand notation

Et [·] := E[· | Ht−1 ] and Vart (·) := Var(· | Ht−1 )

for the conditional expectation and variance given the history, the lemma shows that for any
distribution ρ on A, the loss/information ratio satisfies Rt (ρ) ≤ 2σ 2 Vt (ρ) for the loss/variance ratio

( A Et [ℓ(a, Y (a)) − ℓ(A⋆ , Y (A⋆ ))]dρ(a))2

R
Vt (ρ) := R
⋆
. (18.3.11)
A Vart (Et [ℓ(a, Y (a)) | A ])dρ(a)

The quantity Vt (ρ) is the ratio of a quadratic function in ρ over a linear function of ρ, and so is
convex in ρ. (See Exercise 18.1.) Thus, at least from a computational perspective, it is natural to
use variance-directed sampling, which chooses

ρvds
t := argmin Vt (ρ).
ρ∈∆(A)

Whenever we can guarantee the variance ratio inf ρ Vt (ρ) is bounded for all t, then evidently we
√
have an order n regret bound.

537
Lexture Notes on Statistics and Information Theory John Duchi

In some cases, we can use this variance-directed sampling strategy to obtain regret bounds
without having to compute information directly, instead relying on variances. For simplicity, we
focus on the case when A is discrete, so we can identify ρ as vectors ρ ∈ Rk or ρ ∈ RN . Then using
the shorthands v = [Var(E[ℓ(a, Y (a) | A⋆ ]))]a∈A and r = [E[ℓ(a, Y (a)) − ℓ(A⋆ , Y (A⋆ ))]]a∈A for the
vectors of variances and instantaneous regrets, respectively, the variance ratio becomes

( A ρ(a)E[ℓ(a, Y (a)) − ℓ(A⋆ , Y (A⋆ ))])2 ⟨ρ, r⟩2

P
V (ρ) := P ⋆
= .
A ρ(a)Var(E[ℓ(a, Y (a)) | A ]) ⟨ρ, v⟩

In the case of the K-armed bandit with sub-Gaussian arms, we can modify the argument that
Thompson sampling has bounded loss/information ratio (18.3.9) to obtain an identical bound for
the variance ratio:

Lemma 18.3.11. Assume A is finite. Then variance-directed sampling and Thompson sampling
satisfy
Vt (ρvds ts
t ) = inf Vt (ρ) ≤ Vt (ρt ) ≤ card(A). (18.3.12)
ρ

Proof By construction, the distribution ρts ⋆ ⋆

t is identical to the distribution ρ (· | Ht−1 ) of A ,
ts ⋆
so that inf ρ∈∆K Vt (ρ) ≤ Vt (ρt ) = Vt (ρt ). We may now proceed tacitly conditioning on the history
iid
Ht−1 to avoid notational overload. For A, A⋆ ∼ ρ⋆ ,
X X
E[ℓ(A, Y (A)) − ℓ(A⋆ , Y (A⋆ ))] = ρ⋆ (a)E[ℓ(a, Y (a))] − ρ⋆ (a)E[ℓ(a, Y (a)) | A⋆ = a]
a∈A a∈A
X
= ρ (a)(E[ℓ(a, Y (a))] − E[ℓ(a, Y (a)) | A⋆ = a]),
⋆

a∈A

while the vector v = [Var(E[ℓ(a, Y (a)) | A⋆ ])]a∈A of variances satisfies

X X
⟨ρ⋆ , v⟩ = ρ⋆ (a) ρ⋆ (a⋆ )(E[ℓ(a, Y (a)) | A⋆ = a⋆ ] − E[ℓ(a, Y (a))])2 .
a∈A a⋆ ∈A

Jensen’s inequality implies that

X
(E[ℓ(A, Y (A)) − ℓ(A⋆ , Y (A⋆ ))])2 ≤ K (ρ⋆ (a))2 (E[ℓ(a, Y (a))] − E[ℓ(a, Y (a)) | A⋆ = a])2
a∈A
X
≤K ρ (a)ρ (a⋆ ) (E[ℓ(a, Y (a))] − E[ℓ(a, Y (a)) | A⋆ = a⋆ ])2 = K⟨ρ⋆ , v⟩,
⋆ ⋆

a,a⋆ ∈A

because we have simply added more positive terms.

As an immediate consequence of the inequality (18.3.12), we have the following corollary, which
follows from the main Theorem 18.3.2.

Corollary 18.3.12. Let ρvds

t be the variance-directed sampling distribution, and assume the other
conditions of Theorem 18.3.7. Then
p √
Regn (A, ℓ, π) ≤ 2σ 2 card(A)H(A⋆ ) n.

538
Lexture Notes on Statistics and Information Theory John Duchi

18.3.4 An extended example: linear bandits

In many bandit and other exploration/exploitation problems, the rewards among arms are not
independent—losses ℓ(a, Y (a)) for a particular action a ∈ A are correlated with those for other
actions a′ ∈ A. So-called linear bandits provide the cleanest model capturing this setting, making
them a natural target of analysis. In this case, we identify actions A with a subset of Rd , and the
“true” state of the world with a parameter θ ∈ Rd . Then upon taking action a ∈ A, we receive
feedback and loss
Y (a) = ⟨a, θ⟩ + ε and ℓ(a, Y (a)) = −Y (a), (18.3.13)
where ε is conditionally mean-zero noise and we therefore wish to maximize the inner product

⟨a, θ⟩ = −E[ℓ(a, Y (a)) | θ].

Example 18.3.13 (Network routing): In the source routing problem, one wishes to send a
sequence of packets from a source to a given destination, and one chooses a path (sequence
of links) on which to send the packets among nodes i = 1, . . . , d. Representing a path via a
matrix a ∈ {0, 1}d×d , where aij = 1 if the packet P is sent on the link from node i to j and 0
otherwise, then the (noiseless) total transit cost is ij aij θij , where θij indicates the delay on
edge (i, j). The actions A correspond to valid paths between nodes in the network. 3

Given the setting (18.3.13), we can extend Lemma 18.3.11 to apply to arbitrary action sets A.
Proposition 18.3.14. Assume the linear bandit model (18.3.13). Then variance-directed sampling
and Thompson sampling satisfy

Vt (ρvds ts
t ) ≤ Vt (ρt ) ≤ d,

where Vt denotes the loss/variance ratio (18.3.11). If additionally Y (a) = ⟨θ, a⟩ + ε is σ 2 -sub-
Gaussian for each a ∈ A, then information-directed sampling, variance-directed sampling, and
Thompson sampling satisfy the loss/information ratio bound

Rt (ρids 2 vds 2 ts 2
t ) ≤ 2σ Vt (ρt ) ≤ 2σ Vt (ρt ) ≤ 2σ d.

Proof We tacitly ignore the history Ht−1 as it is immaterial for the proof, and let V = Vt be the
loss/variance ratio. Let ρ = ρts , so that A⋆ ∼ ρ. Then V (ρvds ) ≤ V (ρ), and defining the matrix
M ∈ Rd×d by M = Cov(E[θ | A⋆ ]), when A ∼ ρ the variance becomes

Eρ [Var(⟨E[θ | A⋆ ], A⟩ | A)] = ⟨Cov(E[θ | A⋆ ]), Eρ [AA⊤ ]⟩ = ⟨M, Eρ [AA⊤ ]⟩.

Noting that E[A] = E[A⋆ ] when A, A⋆ ∼ ρ, because A is independent of θ we also have

E[ℓ(A, Y (A)) − ℓ(A⋆ , Y (A⋆ ))] = E[⟨θ, A⋆ − A⟩] = E[⟨θ, A⋆ ⟩ − ⟨E[θ], A⋆ ⟩]

= E[⟨E[θ | A⋆ ] − E[θ], A⋆ ⟩].

Then the Cauchy-Schwarz inequality ipmlies

hD Ei2
E[⟨E[θ | A⋆ ] − E[θ], A⋆ ⟩]2 = E M −1/2 (E[θ | A⋆ ] − E[θ]) , M 1/2 A⋆

2 2
≤ E M −1/2 (E[θ | A⋆ ] − E[θ]) E M 1/2 A⋆
2 2
D E
= ⟨M −1 , Cov(E[θ | A⋆ ])⟩ M, E[AA⊤ ]

539
Lexture Notes on Statistics and Information Theory John Duchi

because A and A⋆ have identical distributions. Finally, we use that ⟨M −1 , M ⟩ = tr(Id ) = d,

implying V (ρvds ) ≤ V (ρts ) = V (ρ) ≤ d.
For the second claim of the proposition, we simply apply Lemma 18.3.11.

As a corollary, we can obtain the “standard” regret bounds for linear bandits, assuming we have
a prior π over the parameters Θ ⊂ Rd . In this case, the loss
ℓ(a, y) = −y satisfies E[ℓ(a, Y (a)) | θ] = −⟨a, θ⟩,
and so minimizing the loss over a ∈ A corresponds to maximizing ⟨a, θ⟩. We then obtain the
following corollary.
Corollary 18.3.15. Let A be a countable or finite set of actions. Assume the linear bandit
model (18.3.13) and that for each a ∈ A, Y (a) = ⟨θ, a⟩ + ε is σ 2 -sub-Gaussian (18.3.7). Then
information-directed sampling, variance-directed sampling, and Thompson sampling each have the
regret bound
p √
Regn (A, ℓ, π) ≤ 2σ 2 d H(A⋆ ) · n.
Let us work through one example here, which shows how Thompson-style sampling procedures
can apply to linear bandit problems.
Example 18.3.16 (Thompson sampling for linear bandits with a Gaussian prior): Consider
the linear bandit setting (18.3.13), and let Θ = Rd be all of d-dimensional space; put a
Gaussian prior N(0, τ 2 Id ) on θ, where τ 2 < ∞ captures the prior variance, and assume the
iid
noise ε ∼ N(0, σ 2 ). At least at the first step t = 1, the observed loss

−ℓ(a, Y (a)) = Y (a) = ⟨a, θ⟩ + ε ∼ N 0, ∥a∥22 τ 2 + σ 2 ,

which is sub-Gaussian so long as A is bounded.

We claim that for Ht = {Y1 (A1 ), A1 , . . . , Yt (At ), At }, we have
θ | Ht ∼ N E[θ | Ht ], t−1 Σt

(18.3.14)
for t −1 Xt
1 1 X ⊤ 1 1
Σt = Id + 2 Ai Ai and E[θ | Ht ] = 2 Σt Ai Yi (Ai ) .
tτ 2 tσ σ t
i=1 i=1
Assuming that the distributional equality (18.3.14) holds, then we have the necessary sub-
Gaussianity (18.3.7) for the regret bounds to apply, as conditional on Ht , ℓ(a, Y (a)) is Gaussian
with variance 1t a⊤ Σt a + σ 2 ≤ τ 2 ∥a∥22 + σ 2 , where we used Σt ⪯ tτ 2 . So then for any action set
A and N(0, τ 2 ) prior π, we have the following expected regret bound for information-directed,
variance-directed, or Thompson sampling and linear bandits: whenever A is finite,
√
r
Regn (A, ℓ, π) ≤ 2d τ 2 max ∥a∥22 + σ 2 log card(A) · n.

a∈A

Let us return to develop the posterior distribution (18.3.14) of θ. We prove the result
inductively. Let π(θ | y1t , at1 ) be shorthand for the density of θ given Yi (ai ) = yi and Ai = ai
for i ≤ t. We show that

t t t−1 t−1 1 2
π(θ | y1 , a1 ) ∝ π(θ | y1 , a1 ) exp − 2 (yt − ⟨θ, at ⟩) . (18.3.15)
2σ

540
Lexture Notes on Statistics and Information Theory John Duchi

To see this, note that at time t, we have the graphical structure in Figure 18.4, so that Yt (At )
is conditionally independent of the history Ht−1 given At , θ. Thus we can write (with some
abuse of notation) that

where we use p to denote an appropriate density or p.m.f. As p(at | y1t−1 , at−1

1 ) does not depend
on θ, we have the claim (18.3.15). For t = 1, we have π(θ | y10 , a01 ) = π(θ), and so inductively,
we obtain the equality
t
!
t t 1 X 2 1 2
π(θ | a1 , y1 ) ∝ exp − 2 (yi − ⟨ai , θ⟩) − 2 ∥θ∥2 .
2σ 2τ
i=1

Given this equality, the conditional distribution (18.3.14) follows by algebraic manipulations
(see Exercise 18.2). 3

θ Yt (At ) Figure 18.4. Graphical independence struc-

ture in sequential problems. Action At is
drawn conditional on past observations and ac-
tions Ht−1 = {Yi (Ai ), Ai }t−1
i=1 , and given θ, ob-
servation Yt (At ) is conditionally independent
{Yi (Ai ), Ai } of Ht−1 .
At
i = 1, . . . , t − 1

The sampling distribution (18.3.14) is quite easy to use in Example 18.3.16: letting
t
( )
1 X
2 1 2
θbt = argmin (Yi − ⟨Ai , θ⟩) + 2 ∥θ∥2
2σ 2 2τ
i=1

be the minimizer of the regularized squared error, we draw θt+1 ∼ N(θbt , 1t Σt ) for the covariance Σt
in (18.3.14), and then choose
At+1 ∈ argmin⟨a, θt+1 ⟩.
a∈A
√
For example, if A = {−1, 1}d / d consists of all {−1, 1}-valued vectors normalized to the unit
√
ball, then so long as τ 2 , σ 2 = O(1), we have expected regret Regn ≤ O(1)d n. Figure 18.5
shows example behavior with this sampling strategy, with the minor modification that we take
A = {a ∈ Rd | ∥a∥2 ≤ 1} to be the ℓ2 -ball. The figure shows that our analysis captures the typical
√
behavior of the methods: the majority of the time, the (average) regret scales no worse than d/ n,
and this appears to capture the typical behavior as well.

18.4 Online gradient descent approaches

Returning to the more basic multi-armed bandit setting of Section 18.1, it is natural to ask if
we might leverage the online optimization approaches from Chapter 17 to tackles these problems.

541
Lexture Notes on Statistics and Information Theory John Duchi

Mean regret Mean regret

Inst. regret
√ 101 Inst. regret
√
d/ n
d/ n

100

100
10 1

0 200 400 600 800 1000 0 200 400 600 800 1000
(a) (b)
Figure 18.5. The regret behavior of Thompson sampling for the linear bandit problem (18.3.13).
Each plot shows the results of 500 experiments run for 1000 iterations using the Gaussian prior of
Example 18.3.16, where the action set A = {a ∈ Rd | ∥a∥2 ≤ 1} is the ℓ2 -ball of radius 1. The dark
black line (Inst. regret) in each plot shows the instantaneous regret ⟨θ, A⋆ − At ⟩ at each iteration t,
averaged over the 500 experiments. The blue line (Mean regret) shows the “posterior mean” √ regret:
taking θt = E[θ | Ht ] and the “mean” action at = θt /∥θt ∥2 . The dotted black line shows d/ n, while
the shaded orange region gives the 10–90% quantiles of the instantaneous regret across experiments.

This is certainly possible. In this scenario, we would like to formulate the bandit problem as one
of minimizing a (partially) observed sequence of convex losses, where to leverage the approaches
in Chapter 17, we must be able to construct unbiased gradient estimators to have any hope of
achieving small regret. Let us assume as usual that we have K arms, with an (unknown) mean
vector µ ∈ RK . Then at each step t of the procedure, we play a distribution wt ∈ ∆K on the arms,
and then we select an arm a at random with probability wt,a . The expected loss we suffer is then
ℓt (wt ) = ⟨wt , µ⟩, though we observe only a random realization of the loss for the arm a that we
play.
Because of its natural connections with estimation of probability distributions, we will use the
exponentiated gradient algorithm, Example 17.2.5, to play this game. We face one main difficulty:
we must estimate the gradient of the losses, ∇ℓt (wt ) = µ, even though we only observe a random
variable Yt (a) ∈ R+ , conditional on selecting action At = a at time t, with the property that
E[Yt (a) | Ht−1 ] = µa . Happily, we can construct such an estimate without too much additional
variance.

Lemma 18.4.1. Let Y ∈ RK be a random variable with E[Y ] = µ and w ∈ ∆K be a probability

vector with positive entries. Choose coordinate a ∈ [K] with probability wa and define the random
vector (
Y (j)/wj if j = a
Ye (j) =
0 otherwise.

Then E[Ye | Y ] = Y .

Proof The proof is immediate: for each coordinate j of Ye , we have E[Ye (j) | Y ] = wj Y (j)/wj =
Y (j).

542
Lexture Notes on Statistics and Information Theory John Duchi

Lemma 18.4.1 suggests the following procedure, which gives rise to (a variant of) Auer et al.’s
EXP3 (Exponentiated gradient for Exploration and Exploitation) algorithm [13]. We can prove

Input: stepsize parameter η, initial vector w1 = [ K1 · · · K1 ]⊤

Repeat: for each iteration t,
Choose random action At = a with probability wt,a
Receive non-negative loss Yt (a), and define
(
Yt (j)/wj if At = j
gt,j =
0 otherwise.

Update for each i = 1, . . . , K

wt,i exp(−ηgt,i )
wt+1,i = P .
j wt,j exp(−ηgt,j )

Figure 18.6: Exponentiated gradient for bandit problems.

the following bound on the expected regret of the EXP3 Algorithm 18.6 by leveraging our refined
analysis of exponentiated gradients in Proposition 17.5.1.

Proposition 18.4.2. Assume that for each j, we have E[Y (j)2 ] ≤ σ 2 and the observed loss Y (j) ≥
0. Then Alg. 18.6 attains expected regret
n
X log K η
Regn = E[µAt − µa⋆ ] ≤ + σ 2 Kn.
η 2
t=1
p
In particular, choosing η = log K/(Kσ 2 n) gives
n
X 3 p
Regn = E[µAt − µa⋆ ] ≤ σ Kn log K.
2
t=1

Proof With Lemma 18.4.1 in place, we recall the refined regret bound of Proposition 17.5.1.
For w⋆ ∈ ∆K and any sequence of vectors g1 , g2 , . . . with gt ∈ RK
+ , exponentiated gradient descent
achieves
n n k
X log K η XX
⟨gt , wt − w⋆ ⟩ ≤ + 2
wt,j gt,j .
η 2
t=1 t=1 j=1

To transform this into a useful bound, we take expectations. Indeed, we have

E[gt | wt ] = E[Y ] = µ

by construction, and we also have

k
X XK K
X
E 2
wt,j gt,j | wt = 2 2 2
wt,j E[Yt (j) /wt,j | wt ] = E[Y (j)2 ] = E[∥Y ∥22 ].
j=1 j=1 j=1

543
Lexture Notes on Statistics and Information Theory John Duchi

This careful normalizing, allowed by Proposition 17.5.1, is essential to our analysis (and fails for
more naive applications of online convex optimization bounds). In particular, we have
n n
X X log K η
Regn = ⋆
E[⟨µ, wt − w ⟩] = E[⟨gt , wt − w⋆ ⟩] ≤ + nE[∥Y ∥22 ].
η 2
t=1 t=1

Taking expectations gives the result.

Proposition 18.4.2 provides a regret bound in the multi-armed bandit problem that applies
as soon as the (random) losses Yt (a) are nonnegative and have finite second moment. In this
sense, it is more general than the bounds for the other methods we have developed, which we have
analyzed using sub-Gaussianity. When the losses Y (a) are bounded or sub-Gaussian, it is possible
to achieve high-probability guarantees on the regret, though this is beyond our scope. When the
random observed losses Yt (a) are bounded in [0, 1], the sub-Gaussian constant σ 2 = 14 , yielding the
3√
mean regret bound 4 Kn log K, which is as sharp (to within constant factors) as any of our other
bounds.

18.4.1 Some empirical comparisons

In spite of the similarities in the regret bounds each procedure enjoys, they frequently exhibit strik-
ingly different performance. Practically, it appears that Thompson-sampling strategies and their
variants—such as information-directed sampling—typically obtain the best empirical performance,
even in non-Bayesian settings. Here, we show a few results for K-armed Bernoulli bandits, where
Y (a) ∼ Bernoulli(θa ) for a vector θ ∈ [0, 1]K highlighting the performance of the different methods.
As a brief remark, the EXP3 algorithm (Fig. 18.6) often gets “stuck” in practice—at some point,
the weights on one arm are exponentially larger than the weights on others. Thus, practically it
appears important to use a variant that encourages some additional exploration by regularizing the
weights, at time t defining the weights w as in Alg. 18.6, but then mixing them with a uniform
distribution via
P r
exp(−η τ <t gτ,a ) 1 log K
wt,a := P P , ϵt := min , , and ρt (a) = (1−Kϵt )wt,a +ϵt (18.4.1)
j exp(−η τ <t gτ,j ) K Kt

for each a ∈ {1, . . . , K}. Then we draw At ∼ ρt in iteration t. The exploration strategy (18.4.1)
encourages just enough more exploration without incurring substantial additional regret; see the
references for pointers to its analysis.
In Figure 18.7, we plot results that show typical behavior of the three main algorithms we
consider in this chapter for K-armed Bernoulli bandit problems: Upper Confidence Bound (UCB)-
type algorithms (Section 18.2), Thompson/posterior sampling algorithms (Section 18.3.2), and
ϵ-exploration variant (18.4.1) of exponentiated gradient (EXP3) algorithms (Section 18.4). Each
figure plots a summary of the instantaneous regrets θAt − θA⋆ over t = 1, . . . , 1000 iterations, collat-
ing the results of 4000 experiments. In each individual experiment, we draw θ ∼ Uniform([0, 1]K )
at the outset, then at each iteration t, conditional on the action At = a ∈ {1, . . . , K}, return
Yt (a) ∼ Bernoulli(θa ). The figure highlights a few things. First, the regrets appear to decrease
polynomially in t, consistent with the analysis we have developed. Second, Thompson sampling
outperforms the other algorithms; it is a reasonable default procedure when it is implementable.
Finally, the bottom plot in Fig. 18.7 shows that eventually, Thompson sampling appear to find the
best arm, incurring 0 regret in most samples.

544
Lexture Notes on Statistics and Information Theory John Duchi

100
EXP3
UCB
Thompson

10 1

10 2

0 200 400 600 800 1000

n
100
EXP3
UCB
Thompson

10 1

10 2

10 3
0 200 400 600 800 1000
n
Figure 18.7. An empirical comparison of algorithms for the K = 10-armed Bernoulli bandit
problem. Each figure represents the results of 4000 experiments. Top: mean instantaneous regret
θAt − θA⋆ . Bottom: 90th percentile of instantaneous
√ regret θAt − θA⋆ across experiments. EXP3 uses
√
the exploration strategy (18.4.1) with η = log K/ KT . UCB uses confidence δ = 10−5 . Thompson
sampling uses independent Beta(1, 1) priors for each coordinate of θ ∈ [0, 1]K .

18.5 Minimax lower bounds

Thus far, we have developed instantiations of the generic decision making framework in Figure 18.2,
where we play a sequence of distributions ρt on actions At , in several scenarios, providing procedures
that guarantee small regret (18.3.1) for any distribution Pθ or small Bayesian regret (18.3.2). We
turn to the converse problem of lower bounds for decision-making problems. We consider two
variants of the problem: a pure exploration scenario, when there is no penalty for exploration
through the entire run of the procedure, and the regret based scenarios that Figure 18.2 targets.
Any lower bound on the former immediately implies aPlower bound on the latter: indeed, suppose
that we define the action sampling distribution ρ := n1 nt=1 ρt to be the average of the distributions

545
Lexture Notes on Statistics and Information Theory John Duchi

played throughout the procedure. Let An+1 ∼ ρ, and let θ parameterize Pθ . Then
n
X
nEθ [ℓ(An+1 , Y (An+1 )) − ℓ(A⋆ , Y (A⋆ ))] = Eθ [ℓ(At , Y (At )) − ℓ(A⋆ , Y (A⋆ ))] = Regn (ℓ, A, θ),
t=1

and so any lower bound on the gap

E[ℓ(An+1 , Y (An+1 )) − ℓ(A⋆ , Y (A⋆ ))]

immediately implies one on the regret.

Let us set notation, allowing a bit more abstraction to obtain cleaner statements of the bounds.
We will say say P is a model when it maps actions a ∈ A to a distribution over losses L ∈ R and
observations Y ∈ Y; in the notation of Figure 18.2, we have L = ℓ(A, Y (A)), though our abstraction
allows the losses to be simply any scalar and Y to denote the observed feedback under the sampling
distribution P (a) upon choosing action a. Thus, we observe (L, Y ) ∼ P (a). The gap of an action
a for model P is
gapP (a) := sup EP (a) [L] − EP (a⋆ ) [L], (18.5.1)
a⋆ ∈A

corresponding to E[ℓ(a, Y (a)) − ℓ(A⋆ , Y (A⋆ ))]. We say that a distribution q is an exploration
strategy if for each t = 1, 2, . . ., it defines a distribution for the tth action At ∼ q(· | Ht−1 ), where
Ht = {Ai , Yi (Ai )}i≤t denotes the history as usual.
Given the gap (18.5.1) and definition of an exploration strategy, the minimax risk of a decision
estimation problem is then

Mn (P) := inf sup EP,ρ,q [gapP (An+1 )] , (18.5.2)

ρ,q P ∈P

where the infimum is taken over all exploration strategies q and distributions ρ on actions An+1 ,
where ρ is a measurable function of Hn = (An1 , Y1n ). In the literature on bandit problems, this
quantity is the minimax regret for best arm identification, which cares only about identifying the
optimal action a ∈ A. We can define the minimax regret similarly, letting
n
X
Rn (P) := inf sup EP,q [gapP (At )],
q P ∈P
t=1

where we take the infimum over strategies q. The discussion above shows that

Rn (P) ≥ n · Mn (P),

so we focus essentially exclusively on Mn (P).

18.5.1 Action separation and a modulus of continuity

To lower bound the minimax risk (18.5.2), we provide an analogue of the moduli of continuity
that were central to lower bounds on estimation of individual (scalar) functionals in Chapter 13,
except that now this modulus relates particularly to decision making problems. For a collection P
of models with associated action set A, define the action separation of P

δA (P) := sup δ ≥ 0 | for all a ∈ A, gapP1 (a) ≥ δ or gapP2 (a) ≥ δ .
P1 ,P2 ∈P

546
Lexture Notes on Statistics and Information Theory John Duchi

This measures the separation between models in that any action a ∈ A is sub-optimal for at least
one of the distributions P1 or P2 . For any model P and probability distribution q on A, we can
define ϵ-neighborhood of P indexed by A ∼ q by
Pq,ϵ (P ) := P ∈ P | Eq Dkl P (A)||P (A) ≤ ϵ2 .

(To
P keep the notation
clear, in the case that A is countable, we simply mean Eq [D kl P (A)||P (A) ]=
a Dkl P (a)||P (a) q(a) to be the average KL-divergence between P (a) and P (a).)
We can now define the analogue of the modulus of a parameter with respect to the Hellinger
distance (13.1.1), which we shall term the modulus of action separation by

ωA (ϵ | P, P ) := inf δA Pq,ϵ (P ) . (18.5.3)
q

Intuitively, we might expect this quantity to control lower bounds: when ϵ is small, the definition
of the KL-neighborhood Pq,ϵ means that playing actions A ∼ q does not allow us to determine one
distribution P1 from another P2 in Pq,ϵ , but any action a has some non-trivial loss in at least one
of P1 and P2 .
Continuing our analogy with Chapter 13, we have the following result, which shows that the
√
action modulus (18.5.3) at radius 1/ n lower bounds the minimax risk:
Theorem 18.5.1. For any model P , not necessarily in P, the minimax risk satisfies

1 1
Mn (P) ≥ ωA √ | P, P .
4 8n
We defer the proof of Theorem 18.5.1 to Section 18.5.3, instead focusing on some applications of
the result here.
The key in Theorem 18.5.1 is that a geometric-like quantity—the modulus (18.5.3)—guarantees
lower bounds on the minimax risk for decision estimation. As a consequence, if we can provide
lower bounds on the modulus, we immediately lower bounds on the minimax risk. The next two
examples exhibit recipes for this strategy:

Example 18.5.2 (Multi-armed bandits): By Theorem 18.5.1, to prove a lower bound we

need only show that a family P of models, restricted to a particular neighborhood of some
model P , has large enough action separation. Consider the k-armed Gaussian bandit, where
we define the model family to be collections of normal distributions with P (a) = N(µa , σ 2 ) for
arbitrary mean vectors µ ∈ Rk . Then for the “null” model P with P (a) = N(0, σ 2 ), for any
p.m.f. q and a model with associated mean reward µ ∈ Rk , we have
k k
X 1 X
qi µ2i .

qi Dkl P (i)||P (i) = 2
2σ
i=1 i=1
2
So when µ = δea for a standard basis vector ea , we obtain Eq [Dkl P (A)||P (A) ] = q2σ
aδ

2 .
We can now compute a lower bound on the action separation. For any q, there is necessarily
at least one index i with qi P≤ k1 , and at least one distinct index j such that qj ≤ k−1 1
.
(Otherwise, we would have i qi > 1.) Now, let P1 and P2 correspond to mean vectors
µ = δei and δej for these two indices. Then it is immediate that at least one of gapP1 (a) ≥ δ
or gapP2 (a) ≥ δ, while
δ2 δ2
Eq [Dkl P (A)||P1 (A) ] ≤ and Eq [D kl P (A)||P 2 (A) ] ≤ .
2kσ 2 2(k − 1)σ 2

547
Lexture Notes on Statistics and Information Theory John Duchi

So for any distribution q on A and ϵ > 0, we obtain action separation

δ2

2
p
δA (Pq,ϵ (P )) ≥ sup δ ≥ 0 | ≤ ϵ = σϵ 2(k − 1).
2(k − 1)σ 2
p
In particular, ωA (ϵ | P, P ) ≥ σϵ 2(k − 1), and so the minimax risk for best-arm identification
in a k-armed Gaussian bandit is at least
√
σ k−1
Mn (P) ≥ √
8 n
by Theorem 18.5.1. As in our discussion of UCB algorithms at the end of Section 18.2, this is
sharp. 3
Example 18.5.3 (Linear bandits): Consider linear bandits, where we index models P by
θ ∈ Bd2 , take actions x ∈ A := Bd2 , and observe reward
iid
y(x) = ⟨x, θ⟩ + ε, ε ∼ N(0, σ 2 ).
This is a standard linear regression problem where we can design x. For a given model Pθ , we
have sup∥x∥2 ≤1 EPθ [y(x)] = ∥θ∥2 , and so the gap
gapPθ (x) = ∥θ∥2 − ⟨x, θ⟩.
For any distribution q on actions x, if we take P (x) = N(0, σ 2 ) independent of x, we obtain
1
Eq [Dkl P (X)||Pθ (X) ] = 2 Eq [⟨θ, X⟩2 ].

2σ
Let Cq := Eq [XX ⊤ ], where tr(Cq ) = E[∥X∥22 ] ≤ 1 and therefore there must exist distinct
eigenvectors u, v of Cq for which u⊤ Cq u ≤ d1 and v ⊤ Cq v ≤ d−1
1
.
Fixing δ ≥ 0 to be chosen, take P1 to be the linear bandit model indexed by θ = δu and
P2 that indexed by θ = δv, so that
δ2 δ2
Eq [Dkl P (X)||P1 (X) ] ≤ and Eq [D kl P (X)||P 2 (X) ] ≤ .
2dσ 2 2(d − 1)σ 2
Moreover, for any ∥x∥2 ≤ 1, we have
gapP1 (x) = δ(1 − ⟨x, u⟩) and gapP2 (x) = δ(1 − ⟨x, v⟩).
v+u
√
Because x = ∥v+u∥ minimizes max{−⟨x, v⟩, −⟨x, u⟩} and ∥u + v∥2 = 2 by the orthogonality
2
of u and v, we obtain
√
1 2− 2
max{gapP1 (x), gapP2 (x)} ≥ δ 1 − √ = δ.
2 2
In particular, we have action separation
√
δ2 √ √

2− 2 2
δA (Pq,ϵ (P )) ≥ sup δ ≥ 0 | 2
≤ ϵ = ( 2 − 1) · σϵ d − 1.
2 2(d − 1)σ
The minimax rate for dimension d ≥ 1 therefore satisfies
√
σ d
Mn (P) ≳ √ ,
n
√
which follows by taking ϵ = O(1)/ n. 3

548
Lexture Notes on Statistics and Information Theory John Duchi

√
This minimax rate is not quite sharp: the correct rate is of order dσ/ n, which motivates
a different lower bounding technique we explore in the next section; the application to standard
multi-armed bandits, however, highlights the applications of the approach.

18.5.2 Assoaud’s method for lower bounds

Assouad’s method, which we initially develop in Chapter 9.5, extends elegantly to adaptive prob-
lems. The key insight is that when we reduce to collections of binary tests, as in the lower
bound (9.5.4), we can apply chain rules for KL-divergence even with adaptive information gath-
ering. Here, we apply this idea to two linear bandit problems, which helps both to highlight the
ideas behind its application and in providing sharp minimax and Bayesian lower bounds for both
pure exploration problems and regret.
Recall the key steps in applying Assouad’s method: we embed the problem in the hypercube
V = {±1}d , demonstrate a loss with a Hamming separation, and then show that distributions
Pv , Pv′ for which v, v ′ differ in only a single coordinate are close in KL-divergence. Following this
recipe, we will work in the linear bandit with Gaussian noise setting, where for a parameter θ ∈ Rd
and actions x ∈ Rd , we observe rewards

y(x) = ⟨θ, x⟩ + ε and gapPθ (a) = sup ⟨θ, x⟩ − ⟨θ, a⟩.

x∈A

The interactions between the action set A and the parameter space Θ are sophisticated, so we focus
on two simple but natural cases: we always take Θ = Bd2 , the ℓ2 -ball, and consider either the actions
A = [−1, 1]d or A = Bd2 . We begin by stating the main theorem of this section and a few corollaries.
In the theorem, we elaborate the notation (18.5.2) for the minimax risk in a decision estimation
problem by also including the action set A, so Mn (P, A) = inf ρ,q supP ∈P EP,ρ,q [gapP (An+1 )] and ρ
and q draw actions in A.

Theorem 18.5.4. Let P be the collection of linear bandit models y(x) = ⟨θ, x⟩ + ε, ∥θ∥2 ≤ 1, with
ε ∼ N(0, σ 2 )-noise. Then the following minimax lower bounds hold.

(i) For any action set satisfying {−1, 1}d ⊂ A ⊂ [−1, 1]d ,

dσ √

1
Mn (P, A) ≥ · √ ∧ d .
4 n
√
(ii) For any action set satisfying {±1/ d}d ⊂ A ⊂ Bd2 ,

1 dσ
Mn (P, A) ≥ · √ ∧ 1 .
8 n

Before proving the theorem, we can develop a few simple corollaries to the result and proof
technique by considering Bayesian regret (18.3.2).

Corollary 18.5.5. Let the conditions of Theorem 18.5.4 hold and assume that n ≥ d. Let A either
satisfy condition
√ (i) or (ii) of the theorem and let ℓid (a, y) = y be the identity loss. Then there is a
δ ∈ [0, 1/ d] such that for πUniform uniform over θ ∈ {−δ, δ}d ,
1 √
Regn (ℓid , A, πUniform ) ≥ dσ n.
8

549
Lexture Notes on Statistics and Information Theory John Duchi

Of course, the worst-case regret also satisfies the lower bound in the corollary, so that
1 √
sup Regn (ℓid , A, θ) ≥ dσ n
θ∈Θ 8

under the same conditions. The corollary shows that our analyses of information-directed and
Thompson sampling
√ were sharp, at least to within numerical constants; Corollary 18.3.15 shows
that if A = {±1/ d}d or A = {−1, 1}d , then because H(A⋆ ) ≤ d log 2, we have
p √
Regn (A, ℓid , π) ≤ 2 log 2 · dσ n

for any prior π.

Theorem 18.5.4 and Corollary 18.5.5 also highlight how the action modulus (18.5.3), while suc-
cessful for K-armed bandit problems, fails to capture the correct lower bounds for linear bandit
problems. Exercise 18.3 asks you to show how to use stochastic subgradient methods to achieve the
lower bounds in Theorem 18.5.4 for the pure exploration case, highlighting a fairly straightforward
approach. Exercise 18.4 shows that a minor variant of the problem leads to quite different conver-
gence behavior, highlighting some of the subtleties in decision estimation problems, especially with
regards to the interaction between action sets A and the underlying distributions.

Proof of Theorem 18.5.4. We now return to prove Theorem 18.5.4 by following the Assouad’s
method recipe. Beginning with the embedding, for v ∈ V = {−1, 1}d , let θv = √δd v, where δ ≤ 1 is
to be chosen in each lower bound construction. Then next lemma then demonstrates a Hamming
separation (9.5.1) for each action set.
√
Lemma 18.5.6. Let θ ∈ {θv }v∈V = {±δ/ d}d as above, and let a⋆ (θ) = argmaxa∈A ⟨a, θ⟩. Then
(i) For any action set satisfying {−1, 1}d ⊂ A ⊂ [−1, 1]d ,
d
⋆ δ X
⟨θ, a (θ) − a⟩ ≥ √ 1 {sign(aj ) ̸= sign(θj )} .
d j=1

√
(ii) For any action set satisfying {±1/ d}d ⊂ A ⊂ Bd2 ,
d
δ X
⟨θ, a⋆ (θ) − a⟩ ≥ 1 {sign(aj ) ̸= sign(θj )} .
2d
j=1

Proof For the first claim of the lemma, we have a⋆ (θ) = sign(θ) (defined elementwise), and so
d
X d
X
⋆
⟨θ, a (θ) − a⟩ = (|θj | − aj θj ) ≥ |θj |1 {sign(aj ) ̸= sign(θj )} ,
j=1 j=1
√
where we used that aj ∈ [−1, 1]. Noting that |θj | = δ/ d gives the first result.
For the second result, note that whenever ∥a∥2 ≤ 1 and v ∈ {−1, 1}d , we have

√ X d 2
√ √ √ √
1
d √ − aj vj = d(1 − 2⟨a, v⟩/ d + ∥a∥22 ) ≤ 2 d 1 − ⟨a, v⟩/ d .
j=1
d

550
Lexture Notes on Statistics and Information Theory John Duchi

√
Scaling terms appropriately, for v = dθ/δ we obtain
2 2 √
⟨θ, a⋆ (θ) − a⟩ = √ ⟨v, a⋆ (θ) − a⟩ = 2 1 − ⟨v, a⟩/ d
δ d
d 2 X d
X 1 1
≥ √ − aj vj ≥ 1 {sign(aj ) ̸= vj } .
d d
j=1 j=1

δ
Multiplying by 2 gives the lemma.

As the last step, we must control the variation distance terms in the lower bound (9.5.4). Let √
Pv,+j and Pv,−j denote the distribution of the observations Y1 , . . . , Yn under the model θ = δv/ d
where vj is set to (respectively) 1 or −1. Then we have the following lemma:
Lemma 18.5.7. Let A ⊂ Rd be any action set satisfying ∥a∥2 ≤ D < ∞ for all a ∈ A. Then for
the Gaussian linear bandit model,
d
1 XX 2 nδ 2
∥P v,+j − P v,−j ∥TV ≤ D2 .
d2d 4σ 2 d2
j=1 v∈V

Proof Fix v ∈ {−1, 1}d temporarily, and let P = Pv,+j and Q = Pv,−j Use the shorthand that
Pi is the distribution of the ith observation Yi conditional on the history Hi−1 and similarly for Qi .
Let θ+ ∈ Rd and θ− ∈ Rd be the parameters associated with Pv,+j and Pv,−j , respectively. Then
applying the chain rule for the KL-divergence (Lemma 2.1.9), we obtain
n
X n
X
EP Dkl N(⟨Ai , θ+ ⟩, σ 2 )||N(⟨Ai , θ− ⟩, σ 2 ) | Ai

Dkl (Pv,+j ||Pv,−j ) = EP [Dkl (Pi ||Qi )] =
i=1 i=1
n n
⟨Ai , θ+ − θ− ⟩2 δ2 X
X
= EP = E[A2ij ].
2σ 2 2dσ 2
i=1 i=1

Now averaging over all v ∈ {±1}d and coordinates j = 1, . . . , d, we obtain

d n
1 XX δ2 X h 2
i
Dkl (P v,+j ||Pv,−j ) ≤ E ∥A i ∥2 ,
2d d 2σ 2 d2
j=1 v∈V i=1

and applying Pinsker’s inequality gives the result.

Combining Lemmas 18.5.6 and 18.5.7, we can complete the proof of the theorem by con-
sidering the cases (i) and (ii) distinguishing the action sets A. We first consider case (i), the
hypercube-like action set, where {−1, 1}d ⊂ A ⊂ [−1, d
√1] . Then part (i) of Lemma 18.5.6, As-
souad’s method (9.5.4), and Lemma 18.5.7 (take D = d) show that
√ " 2 1/2 #
dδ nδ
Mn (P) ≥ 1− .
2 4σ 2 d
p
Setting δ = dσ 2 /n ∧ 1 implies the first claim of the theorem.

551
Lexture Notes on Statistics and Information Theory John Duchi

√
For case (ii), the ℓ2 -type action set satisfying {±1/ d}d ⊂ A ⊂ Bd2 , the same argument (except
using part (ii) of Lemma 18.5.6 and setting the action radius D = 1) implies
" #
nδ 2 1/2

δ
Mn (P) ≥ 1− .
4 4σ 2 d2
dσ
Set δ = √
n
∧ 1.

18.5.3 Proof of Theorem 18.5.1

We begin with a minor remark to avoid some trivialities. If ωA (ϵ | P, P ) > 0, then a fortiori the
set Pq,ϵ (P ) is non-empty for all distributions q on actions, because δA (∅) = 0. We thus proceed as
if Pq,ϵ (P ) is non-empty, as otherwise the lower bound of 0 is vacuous. We shall also assume w.l.o.g.
that A is discrete, though this is only for notational convenience.
We begin by letting δ > 0 to be chosen (we shall chose δ eventually to scale as ωA (ϵ | P, P )
√
for ϵ ≲ 1/ n). For simplicity in notation, we assume the observations Y include the observed
losses. For any exploration strategy q and probability distribution ρ, the latter of which is defined
conditional on the history Hn := (Y1n , An1 ), we can define the distribution PP,ρ,q on actions by
PP,A,q (An+1 = a) = EP,q [ρ(a | Hn )], where the expectation is taken jointly over Y1n drawn from P
under the exploration strategy q. We also define
n
1X
qP (a) := EP,q [q(a | Hi−1 )] (18.5.4)
n
i=1

to be the average exploration distribution.

By Markov’s inequality, for any model P and distributions ρ, q, we have the trivial inequality
EP,ρ,q [gapP (A)] ≥ δ · PP,ρ,q (gapP (A) ≥ δ).
We modify the probabilistic model to an arbitrary P to allow an easier testing lower bound,
observing that

EP,ρ,q [gapP (A)] ≥ δ · PP ,ρ,q (gapP (A) ≥ δ) − PP ,ρ,q − PP,ρ,q .
TV
As is typical in our lower bound proofs, we now show that the first probability is constant, and use
that P and P are “close” and a tensorization argument to bound the variation distance. Let PP,q
be the joint distribution of the observations (Y1n , An1 ) under the model P and q, so that because we
have the probabilistic structure (An1 , Y1n ) → An+1 , the data processing inequality implies
2 2 1
PP ,ρ,q − PP,ρ,q ≤ PP ,q − PP,q ≤ Dkl PP ,q ||PP,q
TV TV 2
by Pinsker’s inequality.
Recognizing that at step i under exploration distribution q and model P , Yi has distribution
Yi | Ai = a ∼ P (a) and that, by the Markovian action sampling structure that, conditional on the
history, Ai | Hi−1 ∼ q(· | Hi−1 ) under both P and P , Lemma 2.1.9 then implies that
X n h i
Dkl PP ,q ||PP,q = EP ,q Dkl PP ,q (Ai ∈ ·, Yi ∈ · | Hi−1 )||PP,q (Ai ∈ ·, Yi ∈ · | Hi−1 )
i=1
n
X
= EP ,q Dkl P (Ai )||P (Ai ) = nEqP Dkl P (A)||P (A) ,
i=1

552
Lexture Notes on Statistics and Information Theory John Duchi

where the latter equality follows by definition (18.5.4) of qP .

Combining the steps, we have now shown that for any pair ρ, q and any models P, P , we have
r
n
EP,ρ,q [gapP (An+1 )] ≥ δ PP ,ρ,q (gapP (An+1 ) ≥ δ) − Eq Dkl P (A)||P (A) .
2 P

Now we use the definition of the action separation: for any distribution q on actions A, we
have inf q⋆ δA (Pq⋆ ,ϵ (P )) ≤ δA (Pq,ϵ (P )), and in particular, this holds for q = qP . So taking
δ < δA (PqP ,ϵ (P )), there must exist P1 and P2 so that for all a ∈ A we have at least one of
gapP1 (a) ≥ δ or gapP2 (a) ≥ δ and EqP [Dkl P (a)||Pi (a) ] ≤ ϵ2 for each i = 1, 2. In particular,

2 max EPi ,ρ,q [gapPi (An+1 )]

i∈{1,2}
r
n 2
≥ δ PP ,ρ,q gapP1 (An+1 ) ≥ δ + PP ,ρ,q gapP2 (An+1 ) ≥ δ − 2 ϵ
2
√
≥ δ 1 − 2nϵ2 ,

because at least one of the events gPi (An+1 ) ≥ δ must occur.

Finally, we substitute appropriate values. We required only that δ < δA (PqP ,ϵ (P )), so we may
1
take it up to ωA (ϵ | P, P ). So long as the separation ϵ2 ≤ 8n , we have 2nϵ2 ≤ 14 , and therefore

δ √ δ 1 √
sup EP,ρ,q [gapP (An+1 )] ≥ 1 − 2nϵ2 ≥ = ωA (1/ 8n | P, P )
P ∈P 2 4 4

as desired.

18.6 Technical proofs

18.6.1 Proof of Lemma 18.2.1
Without loss of generality, we can assume that any algorithm choosing arms to pull is, at time t,
a function of the arms A1 , . . . , At , responses Y1 (A1 ), . . . , Yt (At ), and an auxiliary random variable
U ∼ Uniform[0, 1]. Let Ht = {A1 , . . . , At , Yt (A1 ), . . . , Yt (At ), U } be the history up to time t (more
rigorously, the σ-field generated by the actions and responses to time t, along with the randomness
U ). Note that without loss of generality, we have At ∈ Ht−1 , as At is a function of this history and
the randomness U in the algorithm.
To see this, we use the standard fact that the characteristic function of a random variable √
′
completely characterizes the random variable. Let φi (λ) = E[eιλY (i) ] = E[eιλY (i) ], where ι = −1
is the imaginary unit, denote the characteristic function of Y (i), which is identical to that of Y ′ (i)
by construction. Then writing the joint characteristic function of Nt (i)b µt (i) and Nt (i), we obtain

553
Lexture Notes on Statistics and Information Theory John Duchi

by the tower property of expectation1 that

t
" !#
X
E exp ιλ1 1 {Aτ = i} Yτ (i) + ιλ2 Nt (i)
τ =1
t
" #
Y
=E E [exp (ιλ1 1 {Aτ = i} Yτ (i) + ι1 {Aτ = i}) | Hτ −1 ]
τ =1
t
" #
Y
=E 1 {Aτ = i} eιλ2 E [exp(ιλ1 Yτ (i)) | Hτ −1 ] + 1 {Aτ ̸= i}
τ =1

because Nt (i) = tτ =1 1 {Aτ = i} and Aτ ∈ Hτ −1 , that is, it is a function of the history. Then we
P
observe that conditional on Hτ −1 , Yτ (i) and Yτ′ (i) have identical distribution, so that

E[exp(ιλ1 Yτ (i)) | Hτ −1 ] = φi (λ1 ) = E[exp(ιλ1 Yτ′ (i)) | Hτ −1 ].

Unrolling the above product, we therefore obtain

t
" !#
X
E exp ιλ1 1 {Aτ = i} Yτ (i) + ιλ2 Nt (i)
τ =1
t
" #
Y
ιλ2
E exp(ιλ1 Yτ′ (i) | Hτ −1 ) + 1 {Aτ ̸= i}

=E 1 {Aτ = i} e
τ =1
t
" #
Y
E exp(ιλ1 Yτ′ (i)1 {Aτ = i} + ιλ2 1 {Aτ = i}) | Hτ −1

=E
τ =1
t
" !#
X
= E exp ιλ1 1 {Aτ = i} Yτ′ (i) + ι1 {Aτ = i}
τ =1

by again using the tower property of expectations. This gives the equality (18.2.1).

18.7 Further notes and references

Bandit problems, with their attendant need for analyses and procedures that consider both explo-
ration and exploitation, have existed since at least 1933, beginning with Thompson’s paper [179].
In the statistics literature, problems around the “design of experiments,” which resemble the lin-
ear bandit problems we outline in Section 18.3.4, consider choosing covariate vectors in linear
regression to best estimate a regression vector θ [156], which Robbins considered in sequential set-
tings in the 1950s [161]. The “modern” analysis of bandit problems took off with Lai and Robbins
[130], who showed how confidence-based algorithms could achieve regret of order K log n/∆2 , where
∆ = mini̸=i⋆ µi⋆ − µi , with attendant matching lower bounds.
More recent work in machine learning (of which there are far too many references to list) has
been motivated by online auctions, medical scenarios, and, more recently, reinforcement learning
problems. The books by Cesa-Bianchi and Lugosi [50], Bubeck and Cesa-Bianchi [42], and Lat-
timore and Szepesvári [132] provide excellent references, and several of our proofs follow those of
Bubeck and Cesa-Bianchi [42]. Auer et al. [13] introduced the upper confidence bound (UCB)
1 QN
That if Xt is a function of Ht−1 ⊂ Ht , then E[X1 · · · XN ] = E[ t=1 E[Xt | Ht−1 ]]

554
Lexture Notes on Statistics and Information Theory John Duchi

algorithms and provided finite sample bounds on their regret, while Auer et al. [12] introduced
EXP3 and its analysis. Interest in Thompson sampling reignited when Chapelle and Li [51] showed
that, in spite of its (at the time) heuristic nature, it was competitive with established procedures,
and Kaufmann et al. [124] provided the first analysis of the procedure. There are many connections
between the particular cases we consider here and the broader field of reinforcement learning [175];
information-based perspectives remain an active area of research.
Our approach in Section 18.3 to Bayesian bandits follows the approach that Russo and Van
Roy [164, 165, 166] pioneer. These analyses require a certain well-specification of the procedures,
so that the prior and posterior distributions on the parameter are accurate, as otherwise the regret
bounds fail to hold; extending Thompson sampling to apply even when the prior is unknown
requires additional techniques (e.g. [3] or [132, Chapter 36]). That most analyses of bandit problems
repose on some type of well-specification assumption—e.g., for linear bandits, that rewards Y (a) =
⟨θ, a⟩+ε—poses a problem for saying anything rigorous about their behavior in real-world scenarios.
Researchers have begun to investigate this issue, showing that in the case of mis-specified models,
finding the best approximation in the class can take time exponential in the dimension d of the
approximator [69], while slightly weaker approximation guarantees admit efficient algorithms [133,
93]. Many questions in this direction remain open.
The approach we take in Section 18.5 derives from a variety of sources. At a high level, Sec-
tion 18.5.1 is a distilled and slightly weaker version of Foster et al.’s decision estimation coeffi-
cient [91, 92]. We, however, take a perspective a little closer to the modulus of continuity of a
statistic with respect to Hellinger distance, as in Chapter 13. The idea of using the empirical
distribution of plays At to lower bound the regret as n-times the minimax risk (18.5.2) we learned
from Bubeck and Cesa-Bianchi [42, Ch. 3.3], whose minimax approach also bears similarities to
that we use in proving Theorem 18.5.1. The results using Assouad’s method to lower bound the
expected regret of action An+1 —the pure exploration case—appear to be new, though they take
inspiration from Lattimore and Szepesvári [132, Chapter 24], who focus on lower bounds on the
regret for both ℓ∞ and ℓ2 -style action sets; the technique with Assouad’s method also yields sharper
constants.
JCD Comment: Also say something about causal inference!

JCD Comment: Can we do something like a treatment with covariates or something?

Two actions, a ∈ {0, 1}. Get feedback (X, Y (a)), X ∈ Rd . Define ϕ(a, x) = (a, x), and

(b b = argmin Pn (⟨ϕ(A, X), (τ, θ)⟩ − Y )2

τ , θ)

then for H = Pn ϕϕ⊤ and ε = (Y − ⟨ϕ, (τ, θ)⟩) we have

√ d

τ − τ ) ⇝ N 0, eT1 H −1 P ε2 ϕϕ⊤ H −1 e1
n(b

and naive estimator

2 X 2 X
τbnaive = Yi − Yi
n n
i:Ai =1 i:Ai =0

has √ d
τnaive − τ ) ⇝ N (0, 4(Var(Y (1)) + Var(Y (0))))
n (b

555
Lexture Notes on Statistics and Information Theory John Duchi

18.8 Exercises
Exercise 18.1 (Convexity of quadratic-over-linear functions):
x2
(a) For x, y ∈ R, define h(x, y) = y . Show that h is convex over the domain y > 0.

(b) Let x ∈ Rn and y ∈ R. Show that h(x, y) = ∥x∥2 /y is convex over the domain y > 0 for any
norm ∥·∥.

(c) Let X and Y be vector spaces, and let f : X → Rn and g : Y → R be linear functions. Show
that h(x, y) = ∥f (x)∥2 /g(y) is convex over the domain {y ∈ Y | g(y) > 0}.

(d) Assume f and g above are continuous linear functions, and that f (x) ̸= 0 for all x. Show that
h above is a closed convex function.

Exercise 18.2: Let σ 2 > 0 and τ 2 > 0, x1 , . . . , xn ∈ Rd , and y1 , . . . , yn ∈ R. Let θ ∈ Rd have

density
n
!
1 X 1
π(θ) ∝ exp − 2 (yi − ⟨xi , θ⟩)2 − 2 ∥θ∥22 .
2σ 2τ
i=1

b Σ) for Σ = 1 ( 1 2 Id + 1 Pn ⊤
Show that θ ∼ N(θ, n nτ nσ 2 i=1 xi xi ) and
n n
1 1X 1 X 1 2
θb = 2 Σ−1 xi yi = argmin (y i − ⟨xi , θ⟩)2
+ ∥θ∥ 2 .
σ n θ 2σ 2 2τ 2
i=1 i=1

Exercise 18.3: We describe an approach using stochastic gradient methods to solve pure-
exploration linear bandit problems. Assume the that θ ∈ Bd2 and let the action set A = Bd2 ,
and assume that y(x) = ⟨θ, x⟩ + ε for ε an independent mean-zero variable with E[ε2 ] ≤ σ 2 . We
wish to find a (potentiall random) action A ∈ A so that E[⟨A, θ⟩] ≤ − ∥θ∥2 + o(1), that is, mini-
mizing ⟨a, θ⟩. Consider the following procedure: at iteration t, draw wt ∼ Uniform(Sd−1 ), observe
yt = ⟨θ, wt ⟩ + εt , and then set the estimated subgradient gt = d · wt yt .

(a) Show that E[gt | Ht−1 ] = θ and E[∥gt ∥22 | Ht−1 ] ≤ d ∥θ∥22 + d2 σ 2 .

(b) Let η > 0 be a stepsize (which you will determine). Define the iterates

1 2
θt = argmin ⟨gt , θ⟩ + ∥θ − θt−1 ∥2
θ∈Bd 2η
2

1 Pn
and θn = n t=1 θt . Give a setting of the stepsize η so that
r
d2 σ 2 + d
E[⟨θ, θn ⟩] ≤ − ∥θ∥2 + O(1) .
n

Exercise 18.4 (Exploration in linear bandits without noise): Consider a variant of the linear
bandit problem with convex action set A ⊂ Rd where at time t, given action x, we observe y(x) =
⟨θt , x⟩ for θt chosen i.i.d. but satisfying E[θt ] = θ. Assume the action set A ⊃ α{−1, 1}d for some
α > 0 and that A is convex.

556
Lexture Notes on Statistics and Information Theory John Duchi

(a) Show that the vector gt = α−2 ⟨θt , A⟩A for A ∼ Uniform(α{−1, 1}d ) satisfies

E[gt ] = θ and E[∥gt ∥22 ] = d · E[∥θt ∥22 ].

(b) Assume that the noise in θt is such that E[∥θt ∥22 ] ≤ κ2 . Using the result of Exercise 18.3, show
how to construct an action b
an based on n observations such that
r
d
an ⟩] ≤ inf ⟨θ, a⟩ + O(1)κ diam(A)
E[⟨θ, b .
a∈A n

(c) Compare the result of part (b) to achievable convergence guarantees in the full information case,
where we may observe θt , and the upper and lower bounds for cases in which y(x) = ⟨θ, x⟩ + ε.

Exercise 18.5: Show that information directed sampling without noise and sparse θ is really easy
(better than UCB, say).
JCD Comment: Put in a few further examples, or leave as exercises?

i. Linear bandits

ii. Full information (or something similar)

iii. We can pretty easily show that in the two-armed bandit case, a bound of 1/∆2 (where
∆ is the gap in arms) is how many pulls are needed for each arm. Can also show that
probability of error at time n + 1 is at least exp(−n∆2 ) or so (Bretagnolle-Huber).
JCD Comment: We can do one on heavy-tailed (or at least just lighter MGFs)-based
UCB algorithms.
JCD Comment: One on causality?

JCD Comment: Batched bandits?

Exercise 18.6: Prove a high-probability regret bound for UCB.

JCD Comment: Some experiments with Gaussians perhaps, and a mis-specified prior
I think. Like what happens if the prior is too diffuse or not diffuse enough?

557
Chapter 19

Minimax games and Bayesian

estimation

This final chapter explores a few of the many connections between online learning, probabilistic
prediction, and Bayesian statistics, where we link the areas via information theoretic analyses.
Within information theory, these problems classically arise from universal prediction, where one
wishes to encode a sequence of random variables arriving sequentially from an unknown distribution
nearly as well as if one knew the distribution. We use some of these ideas as motivation, but we
will focus more on the statistical sides of the problem and connections with the theory of proper
losses for prediction we develop in Chapter 14, referring to the bibliographic section for further
exploration.
Consider the following minimax probabilistic game between nature, who chooses some distribu-
tion to sample from a set X (for now finite), and a decision maker or player, who wishes to model
the the sampling distribution as well as possible. We, the player, first choose a distribution Q on
X with probability mass function q, and then nature chooses a P ∈ P, where P is a collection of
distributions on X . Nature draws X ∼ P , and upon revealing a realization x, we suffer log loss
− log q(x) and expected loss EP [− log q(X)]. In our game we thus suffer (worst-case) expected loss
X 1
sup EP [− log q(X)] = sup p(x) log , (19.0.1)
P ∈P P ∈P x∈X q(x)

and to play optimally, we wish to solve the minimax problem

minimize sup EP [− log q(X)]. (19.0.2)

Q P ∈P

Instead of using the objective (19.0.2), it is frequently sensible to relativize losses to P: we wish
to compete against the best P ∈ P, and (as in Chapter 17) seek to play distributions Q where
we have little regret relative to the losses we would suffer if we had played the optimal P . Then
instead of the problem (19.0.2), we consider the objective

p(X)
sup EP [− log q(X)] − inf EP [− log q(X)] = sup EP log ,
P ∈P Q∈P P ∈P q(X)

the KL-divergence between P and Q. More abstractly, we can revisit the prediction setting of
Chapter 14, where we for a proper loss ℓ, we suffer ℓ(Q, x) for predicting distribution Q on realization

558
Lexture Notes on Statistics and Information Theory John Duchi

x; the problem (19.0.2) corresponds to the logarithmic loss. Then we wish to minimize the worst-
case expected regret of playing Q instead of the optimal P , that is, solve the game

minimize sup {EP [ℓ(Q, X)] − EP [ℓ(P, X)]} . (19.0.3)

Q P ∈P

Such minimax games are a focus in the game theory, economics, and optimization literatures, and
as we shall see, they bring insights to information theory and statistics.
We can also imagine a variant of the game (19.0.3). To make the connections with parameter
estimation and Bayesian statistics notationally clearer, let the collection P = {Pθ }θ∈Θ be parame-
terized by θ (this is no loss of generality). Then instead of choosing a distribution Q and allowing
nature to choose a distribution Pθ , we could switch the order of the game: nature first chooses prior
distribution π on θ, and without seeing θ (but with knowledge of the distribution π) we choose the
predictive distribution Q. This leads to the Bayesian regret, which is simply the expectation
Z
(EPθ [ℓ(Q, X)] − EPθ [ℓ(Pθ , X)]) π(θ)dθ.
Θ

Let Pπ denote the marginal distribution over X obtained by drawing θ ∼ π and then drawing
X ∼ Pθ . When the loss ℓ is proper (recall Chapter 14.1), Pπ minimizes the Bayesian regret.
Nature, acting adversarially, may choose a prior to make this regret as large as possible, but the
Q player always plays with knowledge of nature’s choice. This leads to the main question of this
chapter: when are the (worst case) Bayesian regret and the game (19.0.3) where Q plays first
equivalent?
This chapter investigates the problems (19.0.2) and (19.0.3) and their variants. We begin by
motivating the problems in two ways: first, in Section 19.1, as a form of robust Bayesian procedure,
which leads to classical maximum entropy estimators. In Section 19.2, we overview some results on
universal and sequential prediction, connecting coding problems to the games (19.0.2) and (19.0.3).
We then present a fundamental duality result: that (for many classes P of distributions), it does
not matter which player goes first in problem (19.0.3), and similarly in problem (19.0.2): there are
robust choices of Q that nature—the P player—can, essentially, gain no advantage against. For the
remainder of the chapter, we then provide analyses of Bayesian statistical procedures, showing how
iid
they provide asymptotic guarantees on the log loss −EP [log q(X1n )] when Xi ∼ P , giving explicit
ways to (asymptotically) find such optimal Q distributions.

19.1 Robust Bayesian procedures and maximum entropy

Consider the setting of Bayesian statistics, where a data analyst has a prior π on the states of
nature, and wishes to perform some type of inference to estimate the “true” state of nature using
this prior. For simplicity, let us assume P = {Pθ }θ∈Θ , where π is some density on Θ. Then the
Bayesian chooses Q to maximize the log-likelihood of observed data—to best approximate Pθ —on
average over the possible states θ of nature, that is, to minimize
Z
EPθ [− log q(X)]π(θ)dθ.

A “robust” Bayesian chooses the worst possible prior π on Θ from a collection of potential priors,
yielding the robust loss Z
sup EPθ [− log q(X)]π(θ)dθ.
π∈Π

559
Lexture Notes on Statistics and Information Theory John Duchi

Abstracting away the particular representation to simply

R consider the family P = {Pπ }π∈Π of
distributions, where we recall the mixture Pπ (A) = Pθ (A)π(θ)dθ, thus yields the worst case log
loss (19.0.1). So we think of problem (19.0.2) as seeking robust Bayes acts against P: we wish to
solve Z
minimize sup EPθ [− log q(X)]π(θ)dθ. (19.1.1)
Q π∈Π
We say that Q is a robust Bayes procedure for the prior distributions Π if it minimizes the supremum
risk (19.1.1); that is, it achieves the best possible logarithmic loss measured in a uniform sense
against all distributions P ∈ P = {Pπ }π∈Π . P
Consider temporarily a discrete set X , and let H(P ) = − x p(x) log p(x) be the usual (Shan-
non) entropy. Because the KL-divergence is always nonnegative, it is immediately apparent that
for any distribution P ,
EP [− log q(X)] = EP [− log p(X)] + Dkl (P ||Q) ≥ H(P ),
and equality holds if and only if Q = P . Thus, abstracting away the particular prior family to
consider collections P of distributions, if we switch the order of play in the game (19.1.1), forcing
nature to play first, then
sup inf EP [− log q(X)] = sup H(P ),
P ∈P Q P ∈P
the maximum entropy in P. Then our leading question in the introductory remarks of the chapter—
when does it not matter whether the Q player (statistician) or P player (nature) plays first in the
game (19.1.1)?—becomes one of duality: when do we have
sup inf EP [− log q(X)] = inf sup EP [− log q(X)]? (19.1.2)
P ∈P Q Q P ∈P

19.1.1 A digression on min-max games

We provide a brief digression on min-max dualities before continuing. For any function L : U ×V →
R, where U and V are arbitrary spaces, the min-max game considers solving
minimize sup L(u, v) and maximize inf L(u, v), (19.1.3)
u∈U v∈V v∈V u∈U

where we think of a u-player and a v-player choosing u (respectively v), after which the other player
chooses a best-response. The game has a value if the order of play does not matter, that is, when
inf u supv L(u, v) = supv inf u L(u, v). In general, no matter U , V , and L, we always have the weak
min-max inequality
sup inf L(u, v) ≤ inf sup L(u, v). (19.1.4)
v∈V u∈U u∈U v∈V

Indeed, for any u0 ∈ U and v0 ∈ V , we certainly have inf u∈U L(u, v0 ) ≤ supv∈V L(u0 , v), and taking
the supremum over v0 on the left and infimum over u0 on the right implies inequality (19.1.4). A
pair (u⋆ , v ⋆ ) ∈ U × V is a saddle point for the game if
sup L(u⋆ , v) ≤ L(u⋆ , v ⋆ ) ≤ inf L(u, v ⋆ ), (19.1.5)
v∈V u∈U

the saddle nomenclature following as (u⋆ , v ⋆ ) simultaneously minimizes and maximizes L. For such
a point, we necessarily have
inf sup L(u, v) ≤ sup L(u⋆ , v) ≤ L(u⋆ , v ⋆ ) ≤ inf L(u, v ⋆ ) ≤ sup inf L(u, v),
u∈U v∈V v∈V u∈U v∈V u∈U

560
Lexture Notes on Statistics and Information Theory John Duchi

so in view of the weak min-max inequality (19.1.4), each inequality is necessarily an equality above.
We capture this as a proposition for later reference:

Proposition 19.1.1. If a point (u⋆ , v ⋆ ) is a saddle point for the game (19.1.3), then the game has
value L(u⋆ , v ⋆ ), independent of the order of play, and

L(u⋆ , v ⋆ ) = inf sup L(u, v) = sup inf L(u, v).

u∈U v∈V v∈V u∈U

Moreover, u⋆ minimizes supv∈V L(u, v) over u ∈ U , and v ⋆ maximizes inf u∈U L(u, v) over v ∈ V .

Classical results in convex optimization and game theory provide sufficient conditions for such
saddle points to exist and, more generally, for equality to hold in inequality (19.1.4). Exercise 17.3
shows how to use online convex optimization techniques to prove the classical von Neumann mini-
max theorem, that when U ⊂ Rm and V ∈ Rn are compact convex sets and L(u, v) = ⟨u, Av⟩ for
some matrix A, then there indeed exists a saddle point. Appendix C.4 reviews the main results
in (finite-dimensional) convex analysis on existence of saddle points, though because of the par-
ticular structure of the regret minimization game (19.0.3), we shall be able to give a more direct
“information-theoretic” proof of duality, which will have the advantage that it applies equally to
finite and infinite-dimensional problems.

19.1.2 Saddle points for maximum entropy

Returning to the log-loss game (19.0.2) and robust Bayesian estimation problem (19.1.1), under
appropriate conditions, we expect to have a saddle point, so that there are Q⋆ , P ⋆ such that

sup EP [− log q ⋆ (X)] ≤ EP ⋆ [− log q ⋆ (X)] ≤ inf EP ⋆ [− log q(X)],

P ∈P Q

which immediately implies the equality (19.1.2) by Proposition 19.1.1. In this case, when we
do indeed have a saddle point, the saddle point thus necessarily achieves the maximum entropy
H(P ⋆ ) = supP ∈P H(P ), which is thus the value for the game (19.0.2). By the defining criterion for
a saddle point, q ⋆ evidently uniquely minimizes EP ⋆ [− log q(X)] = H(P ⋆ ) + Dkl (P ⋆ ||Q), so we see
that Q⋆ maximizes the entropy over P as well, so that maximum entropy becomes the robust Bayes
act against P: it solves the worst-case problem (19.0.2) and maximizes entropy. In summary, for
the log-loss game, any saddle point takes the form

(P ⋆ , P ⋆ ),

and the distribution P ⋆ maximizes the entropy.

In Section 19.6, we shall show that saddle points typically exist for the expected regret game (19.0.3)
when the losses ℓ are strictly proper, connecting the development of our loss functions in Chap-
ter 14 to the game formulations (19.0.2) and (19.0.3). As corollaries, we shall develop the existence
of saddle points in the log-loss game and more general scenarios. First, however, we show that
exponential family models provide a clean treatment of the log-loss game without any sweat.

19.1.3 Exponential family models as robust Bayesian procedures

We now show how the minimax game (19.0.1) naturally gives rise to exponential family models,
so that our now familiar exponential family distributions from Chapter 3 are in fact robust Bayes

561
Lexture Notes on Statistics and Information Theory John Duchi

procedures against certain families P of distributions. To that end, let ϕ : X → Rd be a statistic,

and for α ∈ Rd consider the mean-value constrained set

P = Pαlin := {distributions P on X s.t. EP [ϕ(X)] = α} .

Continuing to consider the case that X is discrete, let Pθ denotes the exponential family distribution
with p.m.f. pθ (x) = h(x) exp(⟨θ, ϕ(x)⟩ − A(θ)), where h denotes a carrier. We have the following.
Proposition 19.1.2. Let the conditions above hold. If EPθ [ϕ(X)] = α, then

inf sup EP [− log q(X)] = sup EP [− log pθ (X)] = sup inf EP [− log q(X)].
Q P ∈P lin lin
P ∈Pα lin Q
P ∈Pα
α

Proof First, note that

sup EP [− log pθ (X)] = sup EP [−⟨ϕ(X), θ⟩ + A(θ)]

lin
P ∈Pα lin
P ∈Pα

= −⟨α, θ⟩ + A(θ) = EPθ [−⟨θ, ϕ(X)⟩ + A(θ)] = H(Pθ ),

where H denotes the Shannon entropy, for any distribution P ∈ Pαlin . Moreover, for any Q ̸= Pθ ,
we have

sup EP [− log q(X)] ≥ EPθ [− log q(X)] > EPθ [− log pθ (X)] = H(Pθ ),
P

where the inequality follows because Dkl (Pθ ||Q) > 0. This shows the first equality in the proposi-
tion.
For the second equality, note that
h p(X) i
inf EP [− log q(X)] = inf EP log −EP [log p(X)] = H(P ).
Q Q q(X)
| {z }
=0

But we know from our standard maximum entropy results (Theorem 14.4.7) that Pθ maximizes the
entropy over Pαlin , that is, supP ∈Pαlin H(P ) = H(Pθ ).

The same result immediately holds beyond discrete sets X , which we state here without proof,
as it is more or less completely identical to the proof of Proposition 19.1.2. Let ν be some measure
on X (e.g., the Lebesgue measure dν(x) = dx), and let

Pαlin := {distributions P ≪ ν on X s.t. EP [ϕ(X)] = α} .

Then the following generalization of Proposition 19.1.2 holds.

Proposition 19.1.3. Let E = {Pθ }θ∈Rd denote the exponential family with densities pθ (x) =
exp(⟨θ, ϕ(x)⟩ − A(θ)) with respect to ν. Then if EPθ [ϕ(X)] = α, Pθ is the uniquely robust Bayes
distribution against Pαlin .
Exercise 19.5 asks you to show that the proof of Proposition 19.1.2 extends straightforwardly.
In short: maximum entropy is equivalent to robust prediction procedures for linear families
of distributions Pαlin . In turn, this is equivalent to maximum likelihood estimation in exponential
families, which corresponds to moment-matching (3.2.3).

562
Lexture Notes on Statistics and Information Theory John Duchi

19.2 The coding game and sequential prediction

An additional motivation for the log-loss game (19.0.2) comes from coding problems. In this case,
iid
we assume we receive n random variables Xi ∼ P , where P is unknown, and suffer the sequential
prediction loss
n
n
X 1
EP [− log q(X1 )] = EP log ,
i=1
q(Xi | X1i−1 )
which corresponds to predicting Xi given X1i−1 as well as possible, even when the Xi follow an
(unknown or adversarially chosen) distribution P . The connection between instantaneously decod-
able codes and probability distributions underlying our original interpretations of the entropy, as
in Chapter 2.4, provides a concrete grounding for the minimax game (19.0.2).
Example 19.2.1 (The coding game): Let the set X be finite or countable, and consider
the problem of encoding X into {0, 1}-valued sequences using as few bits as possible. In this
case, the Kraft inequality (recall Theorem 2.4.2) tells us that if C : X → {0, 1}∗ is a uniquely
decodable code, and ℓC (x) denotes the length of the encoding for the symbol x ∈ X , then
X
2−ℓC (x) ≤ 1.
x

Thus, we may define the distribution qC (x) = 2−ℓC (x) / x 2−ℓC (x) , and for any sequence xn1 ,
P
we have
n n
" #
X X X
− log2 qC (xn1 ) = ℓC (xi ) + log 2−ℓC (x) ≤ ℓC (xi ).
i=1 x i=1
Conversely, given any distribution q on xn1
∈ n
X ,the function ℓ : X n → N with ℓ(xn1 ) :=
n
P −ℓ(x n) log n
2 q(x1 ) = 1, and so there exists a binary
P
⌈− log2 q(x1 )⌉ evidently satisfies x 2 1 ≤
x2
prefix code C : X n → {0, 1}∗ with the length function ℓC (xn1 ) = ⌈− log2 q(xn1 )⌉ by the converse
part of the Kraft inequality in Theorem 2.4.2. Thus for X1n ∼ P ,
EP [ℓC (X1n )] = EP ⌈− log2 q(X1n )⌉ ≤ 1 + EP [− log2 q(X1n )].

The minimax game (19.0.2) thus corresponds to a coding game where we attempt to choose
a distribution Q (or sequential coding scheme C) that has as small an expected length as
possible, uniformly over distributions P . 3
When we have sequences x1 , . . . , xn , the minimax log-loss game (19.0.2) thus is equivalent to a
sequential setting in which we attempt to predict symbols, or equivalently, encode the sequence xn1
online. Rather than measuring performance simply by the losses − log q(x), however, we measure
against reference distributions P , and ask whether, we can predict the sequence (nearly) as well as
if we knew the true distribution of the data. Or, in more general settings, we would like to predict
the data as well as all predictive distributions P from some family of distributions P, even if a
priori we know little about the coming sequence of data.
This leads us to an online setting that parallels the online convex optimization scenarios we
investigated in Chapter 17. With the log loss ℓ(q, x) = − log q(x), we even are in the setting of that
chapter, but the explosion of − log q(x) near the boundaries of the probability simplex necessitates
a different set of tools. We consider two versions of the sequential prediction game: adversarial
and probabilistic. For both of the following definitions of sequential prediction games, we assume
that p and q are densities or probability mass functions in the case that X is continuous or discrete
(this is no real loss of generality) for distributions P and Q.

563
Lexture Notes on Statistics and Information Theory John Duchi

Adversarial regret We begin with the adversarial case. Given a sequence xn1 ∈ X n , the regret
of the distribution Q for the sequence xn1 with respect to the distribution P is
n
1 1 X 1 1
Reg(Q, P, xn1 ) := log n − log n = log i−1
− log , (19.2.1)
q(x1 ) p(x1 )
i=1
q(xi | x1 ) p(xi | xi−1
1 )

where we have written it as the sum over q(xi | xi−1 1 ) to emphasize the sequential nature of the
game. The quantity (19.2.1) measures how much we “regret” playing a distribution Q over the
alternate distribution P . In the context of the coding game in Example 19.2.1, this corresponds to
the realized number of bits necessary to encode xn1 over a putative encoding through p. Associated
with the regret of the sequence xn1 is the adversarial regret of Q with respect to the family P of
distributions, which we typically call the regret as in Chapter 17, which is

RX
n (Q, P) := sup Reg(Q, P, xn1 ). (19.2.2)
P ∈P,xn
1 ∈X
n

Redundancy A less adversarial problem is to minimize the expected regret with respect to a
distribution P , which for the log loss gets the special name redundancy. We thus define

1 1
Redn (Q, P ) := EP log − log = Dkl (P ||Q) , (19.2.3)
q(X1n ) p(X1n )

where the dependence on n is implicit in the KL-divergence. The name choice follows from the
connections to coding games in Example 19.2.1:

Example 19.2.2 (Example 19.2.1 on coding, continued): For any p.m.f.s p and q on the set
X , we can define coding schemes Cp and Cq with code lengths

1 1
ℓCp (x) = log and ℓCq (x) = log .
p(x) q(x)

Conversely, given (uniquely decodable) encoding schemes Cp and Cq : X → {0, 1}∗ , the func-
tions pCp (x) = 2−ℓCp (x) and qCq (x) = 2−ℓCq (x) satisfy x pCp (x) ≤ 1 and x qCq (x) ≤ 1 by
P P
the Kraft-McMillan inequality (Theorem 2.4.2). Thus, the redundancy of Q with respect to P
is the additional number of bits required to encode variables distributed according to P when
we assume they have distribution Q, that is, how redundant our implicit encoding by Q is:
n
X 1 1
Redn (Q, P ) = EP log i−1
− log
i=1
q(Xi | X1 ) p(Xi | X1i−1 )
Xn
= EP [ℓCq (Xi )] − EP [ℓCp (Xi )],
i=1

where ℓC (x) denotes the number of bits C uses to encode x. As in Section 2.4.1, the code
⌈− log p(x)⌉ is (essentially) optimal. 3

As another example, we may consider a filtering or prediction problem for a linear system.

564
Lexture Notes on Statistics and Information Theory John Duchi

Example 19.2.3 (Prediction in a linear system): Suppose we believe that a sequence of

random variables Xi ∈ Rd are Markovian, where Xi given Xi−1 is normally distributed with
mean AXi−1 +g, where A is an unknown matrix and g ∈ Rd is a constant drift term. Concretely,
we assume Xi ∼ N(AXi−1 + g, σ 2 Id×d ), where we assume σ 2 is fixed and known. For our class
of predicting distributions Q, we may look at those that at iteration i predict Xi ∼ N(µi , σ 2 I).
This yields regret
n
X 1 1
Reg(Q, P, xn1 ) = ∥µi − xi ∥22 − 2 ∥Axi−1 + g − xi ∥22 ,
2σ 2 2σ
i=1

while the redundancy is

n
1 X
Redn (Q, P ) = E[∥AXi−1 + g − µi (X1i−1 )∥22 ],
2σ 2
i=1

assuming that P is the linear Gaussian Markov chain. 3

19.3 Expected regret, information capacity, and redundancy

That the log-loss admits a particularly elegant reformulation in terms of the KL-divergence is but
a special case of more general results connecting divergences and the proper losses of Chapter 14,
of which the log-loss is one. Thus, we consider minimizing worst-case expected regret as in prob-
lem (19.0.3). Now, for a general (proper) loss function ℓ, we define the regret
n
X
Reg(Q, P, xn1 ) := ℓ(Q(· | x1i−1 ), xi ) − ℓ(P (· | x1i−1 ), xi ),
i=1

where ℓ(P, xi ) indicates loss suffered on the point xi when the distribution P over Xi is played, and
P (· | xi−1
1 ) denotes the conditional distribution of Xi given x1
i−1
according to P . The particular
loss should clear from context in most of our discussion.
Taking expectations, we obtain the expected regret
Xn
i−1 i−1
ERegn (Q, P ) := EP ℓ(Q(· | X1 ), Xi ) − ℓ(P (· | X1 ), Xi )
i=1
n
X
EP ℓ(Q(· | X1i−1 ), Xi ) − ℓ(P (· | X1i−1 ), Xi ) ,

=
i=1

where the expectation is taken over X1n ∼ P . Because we focus on proper losses, the expected regret
is always nonnegative, and when ℓ is strictly proper, it is positive unless Q = P . The worst-case
expected regret with respect to a class P is then supP ∈P ERegn (Q, P ).
To simplify notation, let us consider the case that n = 1 (or, equivalently, that we simply
measure losses ℓ(Q, X1n ) against an entire vector X1n ). Then minimizing the worst-case expected
regret is identical to the problem (19.0.3), and

EReg(Q, P ) = EP [ℓ(Q, X)] − EP [ℓ(P, X)],

because ℓ is proper. We can make an essentially complete analogy with the redundancy (19.2.3),
that is, that the expected regret is a divergence between distributions. To do this, recall the Savage

565
Lexture Notes on Statistics and Information Theory John Duchi

representation of proper losses from Chapter 14.2. Corollaries 14.2.5 and 14.2.9 show the loss ℓ is
(strictly) proper if and only if there exists a (strictly) convex Ω such that

ℓ(Q, x) = −Ω(Q) − ⟨∇Ω(Q), ex − Q⟩

(where we abuse notation slightly to write inner products), and for this negative entropy Ω,

DΩ (P, Q) = EP [ℓ(Q, X)] − EP [ℓ(P, X)],

where we recall the Bregman divergence DΩ (P, Q) = Ω(P ) − Ω(Q) − ⟨∇Ω(Q), P − Q⟩. That is, the
expected regret
EReg(Q, P ) = DΩ (P, Q) (19.3.1)
is the divergence between P and Q, as in the case (19.2.3) of the log loss with the KL-divergence,
and the representation (19.3.1) via divergence will be the key to the dualities we show.
We consider the min-max game formulation, but set notation to evoke estimating parameters
θ of a model of interest. Indeed, without loss of generality, we can identify P = {Pθ }θ∈Θ for some
space Θ (for example, taking Θ to be in one-to-one mapping with probability distributions). Then
letting Π(Θ) be the collection of probability distributions on Θ, we evidently have
Z
sup EReg(Q, P ) = sup EReg(Q, Pθ )dπ(θ),
P ∈P π∈Π(Θ)

and the rightmost quantity is linear in π. Then we can ask for a distribution Q minimizing the
worst-case expected regret, and similarly ask when it does not matter whether we choose Q first
and nature chooses π or vice versa: do we have the duality
Z Z
inf sup EReg(Q, Pθ )dπ(θ) = sup inf EReg(Q, Pθ )dπ(θ)?
Q π π Q

19.3.1 Information capacity and regret duality

Recall now Chapter 14.1.3, where we defined the information in an experiment as the generalized
entropy reduction that observing a variable X provides for Y . Adapting our notation here, assume
the Markov chain T → X, where we draw T ∼ π for a (prior) probability distribution Θ, and
conditional on T = θ, we draw X ∼ Pθ . Then we defined the information in the experiment (14.1.7)
by Iℓ (T ; X) = Hℓ (X) − Hℓ (X | T ). To make apparent the importance of the prior distribution π
on T , we incorporate it into our notation and define the information X carries about T according
to the loss ℓ via
Z
Iℓ (π; X) := inf EReg(Q, Pθ )dπ(θ) (19.3.2)
Q
Z Z
= inf EPθ [ℓ(Q, X)]dπ(θ) − EPθ [ℓ(Pθ , X)]dπ(θ)
Q

= Hℓ (X) − Hℓ (X | T ),

where the final equality uses the generalized entropy (14.1.5) and tacitly draws T ∼ π.
Because the loss ℓ is proper, the infimum over Q in the definition (19.3.2) is always attained by
the marginal distribution Z
Pπ (A) := Pθ (A)dπ(θ), (19.3.3)

566
Lexture Notes on Statistics and Information Theory John Duchi

that is, Pπ is the Rmarginal distribution on X after drawing T ∼ π and then X ∼ PT . To see this,
simply note that EPθ [ℓ(Q, X)]dπ(θ) = EPπ [ℓ(Q, X)], which Pπ minimizes by propriety. Recalling
the representation of expected regret as the divergence (19.3.1) for the (generalized) negative en-
tropy Ω(P ) = − inf P EP [ℓ(P, X)], we thus may represent the information as the average Bregman
divergence Z Z
Iℓ (π; X) = inf DΩ (Pθ , Q)dπ(θ) = DΩ (Pθ , Pπ )dπ(θ).
Q

This should be familiar, as it generalizes the result

R (9.4.4) for the Shannon mutual information
(corresponding to the log loss) that I(T ; X) = Dkl (Pθ ||Pπ ) dπ(θ).
The notation (19.3.2) makes clear the dependence on the prior π, and so for a collection Π ⊂
Π(Θ) of probability distributions on Θ, we define the capacity of the channel T → X by
Cℓ (Π) := sup Iℓ (π; X). (19.3.4)
π∈Π

This measures the most information that the channel T → X can possibly contain when we know
that T is drawn from the prior π ∈ Π. To rephrase our questions of duality as a saddle point
problem as in Section 19.1.1, we define the expected loss gap (regret) with respect to the prior π,
Z Z Z
L(Q, π) := EPθ [ℓ(Q, X)]dπ(θ) − EPθ [ℓ(Pθ , X)]dπ(θ) = DΩ (Pθ , Q)dπ(θ). (19.3.5)

Then the worst-case expected regret (redundancy) of a distribution Q on X is

Z
sup L(Q, π) = sup DΩ (Pθ , Q)dπ(θ),
π∈Π π∈Π

while the capacity (19.3.4) of the family {Pθ }θ∈Θ is

Cℓ (Π) := sup Iℓ (π; X) = sup inf L(Q, π) = sup L(Pπ , π),
π∈Π π∈Π Q π∈Π

where we recall that Pπ denotes the marginal distribution (19.3.3) and thus satisfies
inf L(Q, π) = L(Pπ , π). (19.3.6)
Q

We can therefore rephrase our questions of duality as the following regret/capacity duality ques-
tion: under what circumstances do we have min-max equality for the saddle objective (19.3.5),
inf sup L(Q, π) = sup inf L(Q, π) = sup Iℓ (π; X)? (19.3.7)
Q π∈Π π∈Π Q π∈Π

Because Π ⊂ Π(Θ) may be infinite dimensional, this equality does not follow so immediately from
classical minimax theorems. Nonetheless, the duality (19.3.7) does indeed typically hold.
A key intermediate result is that when there is a prior π ⋆ achieving the capacity, then we are
guaranteed a saddle point, and so long as ℓ is strictly proper, we have uniqueness. Because we
wish to actually achieve minimizers, we make a minor restriction (as we will see, this is in practice
not important) to assume the losses admit a one-dimensional lower semicontinuity. So we say that
DΩ (P, Q) is directionally lower semicontinuous in Q if for any two distributions Q0 , Q1 ,
lim inf DΩ (P, (1 − λ)Q0 + λQ1 ) ≥ DΩ (P, Q0 ). (19.3.8)
λ↓0

Under this condition, achieving the maximum capacity is sufficient to guarantee the duality (19.3.7),
and even more, the existence of saddle points:

567
Lexture Notes on Statistics and Information Theory John Duchi

Lemma 19.3.1. Let Π be a convex collection of distributions on Θ. Assume that the capacity
Cℓ (Π) := supπ∈Π Iℓ (π; X) is finite, and that π ⋆ ∈ Π achieves it:

Iℓ (π ⋆ ; X) = sup Iℓ (π; X) = Cℓ (Π) < ∞.

π∈Π

Let the divergence DΩ (Pθ , Q) be directionally lower semicontinuous in Q for each Pθ . Then the
marginal distribution Pπ⋆ satisfies

sup L(Pπ⋆ , π) ≤ L(Pπ⋆ , π ⋆ ) ≤ inf L(Q, π ⋆ ),

π∈Π Q

so (Pπ⋆ , π ⋆ ) is a saddle point for the expected regret (19.3.5), and L(Pπ⋆ , π ⋆ ) = Cℓ (Π) = Iℓ (π ⋆ ; X).
If additionally ℓ is strictly proper, then Pπ⋆ uniquely achieves the infimum in L(Q, π ⋆ ).
Writing the lemma in terms of the gaps in the expected losses, if π ⋆ maximizes Iℓ (π; X) over π ∈ Π,
then Q⋆ = Pπ⋆ , and the regret/capacity game (19.0.3) has saddle point (Pπ⋆ , π ⋆ ):
Z Z
EPθ [ℓ(Pπ⋆ , X)] − EPθ [ℓ(Pθ , X)] dπ ⋆ (θ)

sup EPθ [ℓ(Pπ⋆ , X)] − EPθ [ℓ(Pθ , X)] dπ(θ) ≤
π∈Π
Z
EPθ [ℓ(Q, X)] − EPθ [ℓ(Pθ , X)] dπ ⋆ (θ).

≤ inf
Q

We temporarily defer the proof of Lemma 19.3.1 to Section 19.3.4.

Stating things informally for now, the following theorem captures what we view as prototypical
regret/capacity duality:
Theorem 19.3.2 (Informal regret/capacity duality). Let P = {Pθ }θ∈Θ be a collection of distri-
butions on X , Π a convex collection of prior distributions on θ, and assume the loss ℓ is “proper
enough” and that the capacity (19.3.4) is finite. Then there exists a unique distribution Q⋆ on X
such that Z
sup EReg(Q⋆ , Pθ )dπ(θ) = Cℓ (Π).
π∈Π

In particular, the duality (19.3.7) holds, and

Z Z
inf sup EReg(Q, Pθ )dπ(θ) = Cℓ (Π) = sup inf EReg(Q, Pθ )dπ(θ).
Q π∈Π π∈Π Q

While we have been purposefully informal in the statement of Theorem 19.3.2, stating merely
that the loss ℓ is “proper enough,” we provide several formalizations of the theorem in Section 19.6,
presenting a full version in Theorem 19.6.5 in Section 19.6.4. In brief, “proper enough” means that
the loss ℓ is strictly proper plus a bit more, so that convergence of distributions in the associated
Bregman divergence representation of expected regret (19.3.1) guarantees some type of convergence
of distributions.
In the next subsection, we give several instantiations of the theorem. From a practical perspec-
tive, Theorem 19.3.2 (and the more precise Theorem 19.6.5) could use some help: it only guarantees
the existence of an optimal distribution achieving the capacity or worst-case expected regret. By
returning to the setting in which we have sequential observations Xi , i = 1, . . . , n, we can ask not
whether a worst-case regret minimizing distribution Q⋆ exists, but two related questions. First,
whether we can achieve sublinear regret: there exists some distribution Q⋆ for which

sup ERegn (Q⋆ , P ) ≪ n.

P ∈P

568
Lexture Notes on Statistics and Information Theory John Duchi

Then the worst-case expected regret (or redundancy) grows only sublinearly with n, and we play
(asymptotically) essentially as well as if we knew the true distribution P . Secondly and relatedly,
we ask whether we can identify an (asymptotically)Rworst-case prior π ⋆ . Because of loss propriety,
we know that given a prior π, Q = Pπ minimizes EReg(Q, Pθ )dπ(θ), so that if we can perform
estimation using the prior π ⋆ , we achieve (asymptotically) optimal expected regret. These two
questions form the core of Sections 19.4 and 19.5.

19.3.2 Instantiations and corollaries of regret/capacity duality

In spite of the informality of Theorem 19.3.2, we can still state rigorous consequences giving exam-
plar instantiations of the result. We do so here.
In the case of the log-loss, the regret is the redundancy (19.2.3), so that EReg(Q, P ) = Dkl (P ||Q).
Then the information Iℓ is simply the mutual information, and we have the classical capacity

C(Π) := sup {I(X; T ) | π on T ∈ Θ, π ∈ Π} ,

where the supremum is over a convex collection Π of distributions on T , and conditional on T = θ

we draw X ∼ Pθ as usual. We have the following redundancy capacity duality result, one of the
foundational results in information theory.
Corollary 19.3.3 (Redundancy/capacity duality). Let P = {Pθ }θ∈Θ be distributions on a set X ,
and assume that the capacity C(Π) < ∞. Then there exists a distribution Q⋆ on X such that
Z
sup Dkl (Pθ ||Q⋆ ) dπ(θ) = C(Π).
π∈Π

If Π = Π(Θ) consists of all distributions on Θ, then

sup Dkl (Pθ ||Q⋆ ) = C(Π(Θ)).

θ∈Θ

Lastly, if π ⋆ achieves the supremum in the capacity, then Q⋆ = Pπ⋆ is the marginal over X when
T ∼ π ⋆ , and we have the saddle point
Z Z Z
sup Dkl (Pθ ||Pπ ) dπ(θ) ≤ Dkl (Pθ ||Pπ ) dπ (θ) = I(X; T ) ≤ inf Dkl (Pθ ||Q) dπ ⋆ (θ).
⋆ ⋆
⋆
π∈Π Q

Whenever X is finite, we more or less have duality without conditions beyond ℓ being strictly
proper and lower semicontinuous, that is, that Q 7→ ℓ(Q, x) is lower semicontinuous.
Corollary 19.3.4 (Regret/capacity duality for finite sets). Let P = {Pθ }θ∈Θ be distributions
on a finite set X , let Π ⊂ Π(Θ) be a convex collection of distributions on Θ, and assume the
capacity (19.3.4) is finite: Cℓ (Π) < ∞. If ℓ is strictly proper and lower semicontinuous, then there
exists unique distribution Q⋆ on X such that
Z
sup (EPθ [ℓ(Q⋆ , X)] − EPθ [ℓ(Pθ , X)]) dπ(θ) = Cℓ (Π).
π∈Π

If additionally Π = Π(Θ) consists of all distributions on Θ, then

sup EPθ [ℓ(Q⋆ , X)] − EPθ [ℓ(Pθ , X)] = Cℓ (Π(Θ)).

θ∈Θ

Lastly, if π ⋆ achieves the supremum in the capacity, then Q⋆ = Pπ⋆ , and (Pπ⋆ , π ⋆ ) is a saddle point.

569
Lexture Notes on Statistics and Information Theory John Duchi

Finally, we state one more corollary of Theorem 19.6.5 to come, which shows an application of
the results to the continuous-ranked probability scores we discuss in Examples 14.2.6 and 14.2.10,
which arise when we predict cumulative distributions. Recall that in this case, for a CDF F , we
have loss ℓcrps (F, y) = (F (t) − 1 {y ≤ t})2 dt.
R

Corollary 19.3.5 (Regret/capacity duality for ranked scores). Let P = {Fθ }θ∈Θ be any collection
of cumulative distributions on a compact interval T = [t0 , t1 ] and let Π be a convex collection of
distributions on Θ. Then there exists a unique cumulative distribution F ⋆ such that
Z
sup (EFθ [ℓcrps (F ⋆ , X)] − EFθ [ℓ(Fθ , X)]) dπ(θ) = Cℓ (Π).
π∈Π

π⋆ achieves the capacity Cℓ (Π), then F ⋆ (t) = Fθ (t)dπ ⋆ (t) is the marginal CDF.
R
If

19.3.3 Maximum generalized entropy and Robust Bayesian procedures

The instantiations in Corollaries 19.3.3, 19.3.4, and 19.3.5 show the duality between expected
regret (or redundancy) and capacity. We can specialize these results to show that there are robust
Bayesian procedures for many of the loss-minimization games we originally discuss in Section 19.1.
To demonstrate these dualities, we again modify our notation slightly, and consider the identity
channel X → X, that is, T = X. Then we assume that the losses are such that
inf ℓ(P, x) = 0 for all x ∈ X , i.e. ℓ(1x , x) = 0 (19.3.9)
P

where 1x denotes the point mass on x. In this case, the (expected) regret is simply
EP [ℓ(Q, X)] ≥ 0,
and the worst case game is to solve
minimize sup EP [ℓ(Q, X)]. (19.3.10)
Q P ∈P

Example 19.3.6: If X is discrete, then the log-loss ℓ(q, x) = − log q(x) satisfies the condi-
tion (19.3.9), because q = 1x satisfies ℓ(q, x) = − log 1 = 0. 3
Example 19.3.7: If X ⊂ R, the continuous ranked probability score (CRPS) ℓcrps (F, x) =
(F (t) − 1 {x ≤ t})2 dt satisfies the condition (19.3.9), where we take F (t) = 1 {x ≤ t}. 3
R

Assume now that P is a convex collection of distributions on X . Then recalling the generalized
entropy (14.1.6)
Hℓ (P ) = inf EP [ℓ(Q, X)] = EP [ℓ(P, X)],
Q

we say that the robust game (19.3.10) has a solution if there exists Q⋆ such that
sup Hℓ (P ) = sup EP [ℓ(Q⋆ , X)],
P ∈P P ∈P

and a saddle point (Q⋆ , P ⋆ ) if

sup EP [ℓ(Q⋆ , X)] ≤ EP ⋆ [ℓ(Q⋆ , X)] ≤ inf EP ⋆ [ℓ(Q, X)].
P ∈P Q

Note that if a saddle point exists, then whenever ℓ is strictly proper, it must be the case that
P ⋆ = Q⋆ , and moreover, P ⋆ must maximize the generalized entropy Hℓ (P ) over P.
We can therefore provide corollaries of Theorem 19.3.2 (or, more accurately, Theorem 19.6.5 to
come).

570
Lexture Notes on Statistics and Information Theory John Duchi

Corollary 19.3.8. Let X be finite or countable, let the loss ℓ(q, x) = P

− log q(x) be the log-loss, and
let P be a convex collection of distributions. For entropy H(P ) = − x p(x) log p(x), assume that
the maximum entropy supP ∈P H(P ) < ∞. Then

sup H(P ) = inf sup EP [− log q(X)] < ∞,

P ∈P Q P ∈P

and there exists a unique minimizer Q⋆ of supP ∈P EP [ℓ(Q, X)]. If P ⋆ attains the maximum entropy,
then Q⋆ = P ⋆ and (P ⋆ , P ⋆ ) is a saddle point for the game (19.3.10).

So maximum entropy distributions are robust Bayes.

To avoid annoying technicalities, we do not state the most general version of the results here. We
can, however, give a few additional results in special cases, leaving generalizations to the exercises.
Let us consider the case that space X is finite. To give a more explicitly Bayesian flavor to the
results, assume that we parameterize distributions on X by some θ ∈ Θ (this is no loss of generality),
so that for any collection {Pθ }θ∈Θ , we have

Conv{Pθ }θ∈Θ = {Pπ }π∈Π(Θ) ,

where Π(Θ) denotes the collection of prior probability distributions on Θ and Pπ is the marginal
over T ∼ π and X ∼ Pθ conditional on T = θ, as usual. (See Corollary B.1.16 in Appendix B.1.2
to see the Requality of the convex hull.) Then for a given prior π, propriety of ℓ means that Pπ
minimizes EPθ [ℓ(Q, X)]dπ(θ) = EPπ [ℓ(Q, X)]. We call π ⋆ a worst-case prior if it maximizes
Z
inf EPθ [ℓ(Q, X)]dπ(θ).
Q

Corollary 19.3.9. Let X be finite and ℓ be strictly proper. Let {Pθ }θ∈Θ be any collection of
distributions on X . Assume the maximum generalized entropy supπ∈Π(Θ) Hℓ (Pπ ) < ∞. Then there
exists a unique minimizer Q⋆ of supP ∈P EP [ℓ(Q, X)], and
Z
sup Hℓ (Pπ ) = sup EPπ [ℓ(Q⋆ , X)] = inf sup EPθ [ℓ(Q, X)]dπ(θ) < ∞.
π∈Π(Θ) π∈Π(Θ) Q π∈Π(Θ)

If additionally the convex hull {Pπ }π∈Π(Θ) is closed, then there is a prior π ⋆ maximizing the gen-
eralized entropy Hℓ (Pπ ), π ⋆ is the worst-case prior, Q⋆ = Pπ⋆ , and (Pπ⋆ , Pπ⋆ ) is a saddle point for
the game (19.3.10).

Proof All that we need to argue is that there is a prior π ⋆ maximizing the entropy Hℓ (Pπ ).
Because P = Conv{Pθ }θ∈Θ ⊂ ∆X is bounded, if it is closed it is necessarily compact. Then noting
that Hℓ (P ) is the infimum of linear functions of P , it is concave and upper semi-continuous, so
that its maximizers are attained over compact sets. So there is P ⋆ ∈ P maximizing Hℓ (P ); that P
is closed implies there exists at least one prior π ⋆ on Θ for which P ⋆ = Pπ⋆ .

We have no particular need to work with finite domains, though (as we note above), this
usually necessitates more care in the constructions. As a particular example, however, we can
leverage Corollary 19.3.5.

571
Lexture Notes on Statistics and Information Theory John Duchi

Corollary 19.3.10. Let T = [t0 , t1 ] ⊂ R be a compact interval, and let P be the collection of
cumulative distributions on T (i.e., distributions P supported on T ). Then for the continuous
ranked probability loss ℓcrps , there is a (unique) CDF F ⋆ minimizing supF ∈P EF [ℓcrps (F, X)], and
Z t1
sup (1 − F (t))F (t)dt = sup EF [ℓcrps (F ⋆ , X)].
F ∈P t0 F ∈P

JCD Comment: Should probably add some kind of exercises on these

19.3.4 Proof of Lemma 19.3.1

R
Because ℓ is proper, for any π, the distribution Pπ := Pθ dπ(θ) minimizes L(Q, π) over distributions
Q (recall Eq. (19.3.6)). Let Q⋆ = Pπ⋆ . Then for any distribution Q, we have

Cℓ = Iℓ (π ⋆ ; X) = inf L(Q, π ⋆ ) ≤ L(Q, π ⋆ ). (19.3.11)

Now, take any distribution π ∈ Π, and let λ ∈ (0, 1), defining πλ = (1 − λ)π ⋆ + λπ (recall we have
assumed Π is convex). Then Iℓ (πλ ; X) ≤ Cℓ immediately, while for Pλ = λPπ + (1 − λ)Pπ⋆ and
T ∈ Θ representing a draw from π or π ⋆ , we have
Z
Cℓ ≥ Iℓ (πλ ; X) = inf EPθ [ℓ(Q, X)]dπλ (θ) − λEπ [Hℓ (PT )] − (1 − λ)Eπ⋆ [Hℓ (PT )]
Q
Z
(i)
= EPθ [ℓ(Pλ , X)]dπλ (θ) − λEπ [Hℓ (PT )] − (1 − λ)Eπ⋆ [Hℓ (PT )]
Z Z
⋆
= (1 − λ) EPθ [ℓ(Pλ , X)]dπ (θ) − Eπ⋆ [Hℓ (PT )] + λ EPθ [ℓ(Pλ , X)]dπ(θ) − Eπ [Hℓ (PT )]
(ii)
Z
≥ (1 − λ)C + λ EPθ [ℓ(Pλ , X)]dπ(θ) − Eπ [Hℓ (PT )] ,

where step (i) uses the propriety of the loss and step (ii) follows from the earlier bound (19.3.11).
Rearranging and dividing through by λ > 0, we obtain by definition of L that
Z
L(λPπ + (1 − λ)Pπ⋆ , π) = EPθ [ℓ(λPπ + (1 − λ)Pπ⋆ , X)]dπ(θ) − Eπ [Hℓ (PT )] ≤ Cℓ

for all π ∈ Π and all λ ∈ (0, 1).

Finally, we use that the expected regret DΩ (Pθ , Q) = EPθ [ℓ(Q, X)] − EPθ [ℓ(Pθ , X)] is lower
semicontinuous along one-dimensional versions of Q by assumption. This in particular implies that

lim inf EPθ [ℓ(λP + (1 − λ)Q, X)] − EPθ [ℓ(Pθ , X)] ≥ DΩ (Pθ , Q)
λ↓0

for any P, Q. Applying Fatou’s lemma (see Proposition A.3.1 in Appendix A.3) then gives
Z

L(Pπ⋆ , π) ≤ lim inf EPθ [ℓ(λPπ + (1 − λ)Pπ⋆ , X)] − Hℓ (Pθ ) dπ(θ)
λ↓0

≤ lim inf L(λPπ + (1 − λ)Pπ⋆ , π) ≤ Cℓ .

λ↓0

This implies the desired saddle point.

572
Lexture Notes on Statistics and Information Theory John Duchi

19.4 Minimax strategies for regret

With general theory about the existence of saddle points and minimax solutions in place, we now
turn to more explicit strategies for attaining optimal regret as the sample size n → ∞. We discuss
the adversarial setting only briefly, as optimal strategies are somewhat difficult to implement; the
redundancy setting allows easier exploration. We focus on the log loss, because explicit descriptions
of optimal and near-optimal strategies for other losses often remain open questions.
We begin by describing a notion of complexity that captures the best possible regret in the
adversarial setting. Assume without loss of generality that we have a set of distributions P =
{Pθ }θ∈Θ parameterized by θ ∈ Θ, where the distributions are supported on X n . We define the
complexity of the set P (viz. the complexity of Θ) as
Z Z
Compn (Θ) := log sup pθ (xn1 )dxn1 or Compn (Θ) := log sup pθ (xn1 )dν(xn1 ), (19.4.1)
X n θ∈Θ X n θ∈Θ

where ν is some base measure on X n . Note that we may have Compn (Θ) = +∞, especially when
Θ is non-compact. This is not particularly uncommon, for example, consider the case of a normal
location family model over X = R with Θ = R.
The complexity (19.4.1) is precisely the minimax regret in the adversarial setting.
Proposition 19.4.1. The minimax regret
inf RX (Q, P) = Compn (Θ).
Q

Moreover, if Compn (Θ) < +∞, then the normalized maximum likelihood distribution (also known
as the Shtarkov distribution) Q, defined with density
supθ∈Θ pθ (xn1 )
q(xn1 ) = R ,
supθ pθ (xn1 )dxn1
is uniquely minimax optimal.
The proposition completely characterizes the minimax regret in the adversarial setting, and it
gives the unique distribution achieving the regret. Unfortunately, in most cases it is challenging
to compute the minimax optimal distribution Q, so we must make approximations of some type.
In the sequel, we will see that by moving to redundancy rather than adversarial regret, Bayesian
approximations to Q allow this.
Proof We begin by proving the result in the case that Compn < +∞. First, note that the
normalized maximum likelihood distribution Q has constant regret:

X 1 1
Rn (Q, P) = sup log − log
xn
1 ∈X
n q(xn1 ) supθ pθ (xn1 )
supθ pθ (xn1 )dxn1
R
1
= sup log − log = Compn (P).
xn
1
supθ pθ (xn1 ) supθ pθ (xn1 )
Moreover, for any distribution Q on X n we have
Z
X 1 1
Rn (Q, P) ≥ log − log q(xn1 )dxn1
q(xn1 ) supθ pθ (xn1 )
q(xn1 )
Z
= log + Compn (Θ) q(xn1 )dxn1
q(xn1 )

= Dkl Q||Q + Compn (Θ), (19.4.2)

573
Lexture Notes on Statistics and Information Theory John Duchi

so that Q is uniquely minimax optimal, as Dkl Q||Q > 0 unless Q = Q.
Now we show how to extend the lower bound (19.4.2) to the case when Compn (Θ) = +∞. Let
us assume without loss of generality that X is countable and consists of points x1 , x2 , . . . (we can
discretize X otherwise) and assume we have n = 1. Fix any ϵ ∈ (0, 1) and construct the sequence
θ1 , θ2 , . . . so that pθj (xj ) ≥ (1 − ϵ) supθ∈Θ pθ (x), and define the sets PΘj = {θ1 , . . . , θj }. Clearly
we have Comp(Θj ) ≤ log j, and if we define q j (x) = maxθ∈Θj pθ (x)/ x∈X maxθ∈Θj pθ (x), we may
extend the reasoning yielding inequality (19.4.2) to obtain

X 1 1
R (Q, P) = sup log − log
x∈X q(x) supθ∈Θ pθ (x)

X 1 1
≥ q j (x) log − log
x
q(x) maxθ∈Θj pθ (x)
" #
X q j (x) X
max pθ (x′ ) = Dkl Qj ||Q + Comp(Θj ).

= q j (x) log + log
x
q(x) ′
θ∈Θj
x

But of course, by noting that

j
X X
Comp(Θj ) ≥ (1 − ϵ) sup pθ (xi ) + max pθ (xi ) → +∞
θ θ∈Θj
i=1 i>j

as j → ∞, we obtain the result when Compn (Θ) = ∞.

While typically computing the minimax (adversarial) regret is challenging, in some simple cases
we can compute it to within constant factors. In this case, we compete with the family of i.i.d.
Bernoulli distributions.

Example 19.4.2 (Complexity of the Bernoulli distribution): Consider competing against

the family of Bernoulli distributions {Pθ }θ∈[0,1] , where for a point x ∈ {0, 1}, we have Pθ (x) =
θx (1 − θ)1−x . For a sequence xn1 ∈ {0, 1}n with m non-zeros, we thus have for θb = m/n that

sup Pθ (xn1 ) = Pθb(xn1 ) = θbm (1 − θ)

b n−m = exp(−nh2 (θ)),
b
θ∈[0,1]

where h2 (p) = −p log p − (1 − p) log(1 − p) is the binary entropy. Using this representation, we
find that the complexity of the Bernoulli family is
n
X n −nh2 ( m )
Compn ([0, 1]) = log e n .
m
m=0

Rather than explicitly compute with this, we now use Stirling’s approximation (cf. Cover
and Thomas [57, Lemma 17.5.1]): for any p ∈ (0, 1) with np ∈ N, we have
" #
n 1 1 1
∈√ p ,p exp(nh2 (p)).
np n 8p(1 − p) πp(1 − p)

574
Lexture Notes on Statistics and Information Theory John Duchi

Thus, by dealing with the boundary cases m = n and m = 0 explicitly, we obtain

n n−1
X n m X n m
exp −nh2 =2+ exp −nh2
m n m n
m=0 m=1
n−1
1 1 1 X 1
∈2+ √ ,√ √ pm m
,
8 π n m=1 n (1 − n)
| {z }
R1 −1
→n 0 (θ(1−θ))2

the
R 1 noted asymptote occuring as n → ∞ because this sum is a Riemann sum for the integral
θ −1/2 (1 − θ)−1/2 dθ. So as n → ∞,
0
Z 1 !
X −1/2 −1/2 1/2 1
inf Rn (Q, P) = Compn ([0, 1]) = log 2 + [8 ,π ]n p dθ + o(1)
Q 0 θ(1 − θ)
Z 1
1 1
= log n + log p dθ + O(1).
2 0 θ(1 − θ)
R1√
We remark in passing that this is equal to 21 log n + log 0 Jθ dθ, where Jθ denotes the Fisher
information of the Bernoulli family (recall Example 3.1.1). We will see that this holds in more
generality, at least for redundancy, in the sequel. 3

19.5 Mixture (Bayesian) strategies and redundancy

In the less adversarial setting of expected regret, we compete against a random sequence X1n of data,
drawn from some fixed distribution P , rather than an adversarially chosen sequence xn1 . Thinking
of this problem as a game, we choose a distribution Q according to which we make predictions
(based on previous data), and nature chooses a distribution Pθ ∈ P = {Pθ }θ∈Θ . Simplifying the
setting (19.2.3) to make things tractable, we let the data X1n be generated i.i.d. according to Pθ ,
and we suffer expected regret (or redundancy)

1 1
Redn (Q, Pθ ) = Eθ log − Eθ log = Dkl (Pθn ||Qn ) , (19.5.1)
q(X1n ) pθ (X1n )

where we use Qn to denote that Q is applied on all n data points (in a sequential fashion, as
Q(· | X1i−1 )). In this expression, q and p denote the densities of Q and P , respectively. In a slightly
more general setting, we may consider the expected regret of Q with respect to a distribution Pθ
even under model mis-specification, meaning that the data is generated according to an alternate
distribution P . In this case, the (more general) redundancy becomes

1 1
EP log − log . (19.5.2)
q(X1n ) pθ (X1n )

In both cases (19.5.1) and (19.5.2), we would like to be able to guarantee that the redundancy
grows more slowly than n as n → ∞. That is, we would like to find distributions Q such that,
for any θ0 ∈ Θ, we have n1 Dkl Pθn0 ||Qn → 0 as n → ∞. Assuming we could actually obtain

such a distribution in general, this is interesting because (even in the i.i.d. case) for any fixed
distribution Pθ ̸= Pθ0 , we must have Dkl Pθn0 ||Pθn = nDkl (Pθ0 ||Pθn ) = Ω(n). Motivated by the

575
Lexture Notes on Statistics and Information Theory John Duchi

redundancy/capacity duality Corollary 19.3.3, a natural approach is to consider mixture distribu-

tions, where we choose Q as a convex combination (mixture) of all the possible source distributions
Pθ for θ ∈ Θ, that is, we let Q = Pπn for some prior π on Θ. Here, in contrast to the exact dualities
in Section 19.3, we will not seek explicit worst-case priors, instead relying on asymptotics to show
approximate optimality.
To see how we make predictions from such mixtures, let Qnπ (A) = π(θ)Pθn (A)dθ for A ⊂ X n ,
R

where we note that Qnπ will typically not be a product distribution. Then at time i, this mixture
plays the density
Z Z
i−1 i−1
qπ (xi | x1 ) = q(xi , θ | x1 )dθ = pθ (xi )π(θ | xi−1
1 )dθ
Θ Θ

by construction of the distributions Qπ as mixtures of i.i.d. Pθ . Here, the posterior distribution

π(θ | xi−1
1 ) takes the form

1
π(θ)pθ (xi−1 π(θ) exp − log
1 ) pθ (xi−1
1 )
π(θ | xi−1
1 )= R i−1
= R i−1
, (19.5.3)
′ ′ ′ ′
Θ π(θ )pθ (x1 )dθ Θ π(θ )pθ (x1 )dθ
′ ′

where we have emphasized that this strategy exhibits an exponential weighting approach, where
distribution weights are scaled exponentially by their previous loss performance of log 1/pθ (xi−1
1 ).
The weighting scheme (19.5.3) asymptotically strong performance. In fact, we say that so long
as the prior π puts non-zero mass over all of Θ, under some appropriate smoothness conditions,
the scheme Qπ is universal, meaning that Dkl (Pθn ||Qnπ ) = o(n) for any θ ∈ int Θ. Clarke and
Barron [52, 53] make these statements rigorous; while the particular conditions are beyond the
scope of this book, in essence they require the following: the Fisher information Jθ for the family
P = {Pθ }θ∈Θ exists in a compact set K interior to Θ, and the distributions Pθ are sufficiently
regular that differentiation and integration can be interchanged (essentially, uniform versions of
the conditions (i)–(v) necessary for the van Trees inequality, Theorem 12.2.7).
Theorem 19.5.1 (Clarke and Barron [52, 53]). Let Qnπ = Pθn π(θ)dθ be the marginal distribution
R
iid
over X1n obtained by drawing T ∼ π and then Xi ∼ Pθ conditional on T = θ. Let π have compact
support Θ, and K ⊂ int Θ be any compact set. Then under appropriate smoothness conditions on
the distributions Pθ and π,
d n 1 1
Dkl (Pθn ||Qnπ ) − log → log + log det(Jθ ) as n → ∞ (19.5.4a)
2 2πe π(θ) 2
uniformly in θ ∈ K, and
Z Z p
n n d n det(Jθ )
Dkl (Pθ ||Qπ ) π(θ)dθ − log − π(θ) log dθ → 0. (19.5.4b)
2 2πe π(θ)

While we do not rigorously prove the theorem, we give a sketch showing the main components of
the result based on asymptotic normality arguments in Section 19.5.2.

Example 19.5.2 (Bernoulli distributions with a Beta prior): Consider the class of binary
(i.i.d. or memoryless) Bernoulli sources, that is, the Xi are i.i.d Bernoulli(θ), where θ = Pθ (X =
1) ∈ [0, 1]. The Beta(α, β)-distribution prior on θ is the mixture π with density
Γ(α + β) α−1
π(θ) = θ (1 − θ)β−1
Γ(α)Γ(β)

576
Lexture Notes on Statistics and Information Theory John Duchi

R∞
on [0, 1], where Γ(a) = 0 ta−1 e−t dt denotes the gamma function. We remark that that under
α
the Beta(α, β) distribution, we have Eπ [θ] = α+β . (See any undergraduate probability text for
such results.)
If we play via a mixture of Bernoulli distributions under such a Beta-prior for θ, by Theo-
rem 19.5.1 we have a universal prediction scheme. We may also explicitly calculate the predic-
tive distribution
Pi Q. To do so, we first compute the posterior π(θ | X1i ) as in expression (19.5.3).
Let Si = j=1 Xj be partial sum of the Xs up to iteration i. Then

pθ (xi1 )π(θ)
π(θ | xi1 ) = i
∝ θSi (1 − θ)i−Si θα−1 θβ−1 = θα+Si −1 (1 − θ)β+i−Si −1 ,
q(x1 )

where we have ignored the denominator as we must simply normalize the above quantity in
θ. But by inspection, the posterior density of θ | X1i is a Beta(α + Si , β + i − Si ) distribution.
Thus to compute the predictive distribution, we note that Eθ [Xi ] = θ, so we have

Si + α
Q(Xi = 1 | X1i ) = Eπ [θ | X1i ] = .
i+α+β

Moreover, Theorem 19.5.1 shows that when we play the prediction game with a Beta(α, β)-
prior, we have redundancy scaling as

n n
1 n Γ(α)Γ(β) 1 1 1
Dkl Pθ0 ||Qπ = log + log + log + o(1)
2 2πe Γ(α + β) θ0α−1 (1 − θ0 )β−1 2 θ0 (1 − θ0 )

for θ0 ∈ (0, 1). 3

We can also show how to play sequentially with Gaussian models, a simplified version of Ex-
ample 19.2.3.

Example 19.5.3 (Gaussian distributions with a Gaussian prior): Consider the collection
iid
of i.i.d. (memoryless) Gaussian sources, where Xi ∼ N(θ, σ 2 I).
Then if the prior π on θ is
N(0, τ 2 I), a calculation involving completing squares yields posterior density
2
!
2

1 1 n nτ
π(θ | xn1 ) ∝ exp − + θ− 2 xn ,
2 τ 2 σ2 nτ + σ 2 2

that is, if Xi | T = θ ∼ N(θ, σ 2 I) and T ∼ N(0, τ 2 I),

τ 2n σ2τ 2

T | X1n ∼ N X n , I .
τ 2n + σ2 σ2 + τ 2n

Thus, the predictive distribution Qπ (Xn+1 ∈ · | X1n ) is Gaussian with mean EQ [Xn+1 | X1n ] =
2n
Eπ [T | X1n−1 ] = τ 2τn+σ 2 X n and covariance

σ2τ 2

Cov(Xn+1 | X1n ) = Cov(T | X1n ) 2
+σ I = 2
+ σ I.
σ2 + τ 2n

Direct, but tedious, calculations show that this agrees with the limit (19.5.4a). 3

577
Lexture Notes on Statistics and Information Theory John Duchi

While Theorem 19.5.1 addresses the case in which the data are generated i.i.d. Pθ , so that
the model Pθ is well-specified, mixture models enjoy a type of robustness even under model mis-
specification, that is, when the true distribution generating the data does not belong to the class
P = {Pθ }θ∈Θ . In this case, we look at the generalized redundancy (19.5.2), measuring loss relative
to Dkl (P ||Pθ ). We now allow restricting mixture distributions to a subset Θ0 ⊂ Θ by defining
Z
1
Qπ,Θ0 (A) := Pθ (A)dπ(θ).
π(Θ0 ) Θ0

Then we obtain the following robustness result.

Proposition 19.5.4. Assume that Pθ have densities pθ over X , let P be any distribution having
density p over X , and let qπ be the density associated with Qπ . Then for all Θ0 ⊂ Θ,

1 1 1
EP log − log ≤ log + Dkl (P ||Qπ,Θ0 ) − Dkl (P ||Pθ ) .
qπ (X) pθ (X) π(Θ0 )

In particular, Proposition 19.5.4 shows that so long as the mixture distributions Qπ,Θ0 can closely
approximate Pθ , then we attain a convergence guarantee nearly as good as any in the family P =
{Pθ }θ∈Θ . (This result is similar in flavor to the mutual information bound (10.2.2), Corollary 10.2.2,
and the index of resolvability there.) R R
Proof Fix any Θ0 ⊂ Θ. Then qπ (x) = Θ pθ (x)dπ(θ) ≥ Θ0 pθ (x)dπ(θ). Thus we have
" #
p(X) p(X)
EP log ≤ EP inf log R
qπ (X) Θ0 ⊂Θ
Θ0 pθ (x)dπ(θ)
" #
p(X)π(Θ0 ) p(X)
= EP inf log R = EP inf log .
Θ0 π(Θ0 ) Θ0 pθ (x)dπ(θ) Θ0 π(Θ0 )qπ,Θ0 (X)

This is certainly smaller than the same quantity with the infimum outside the expectation, and
noting that
1 1 p(X) p(X)
EP log − log = EP log − EP log
qπ (X) pθ (X) qπ (X) pθ (X)
gives the result.

19.5.1 Bayesian redundancy and objective, reference, and Jeffreys priors

Consider again the Bayesian redundancy
Z Z
n
inf π(θ)Dkl (Pθ ||Q) dθ = π(θ)Dkl (Pθ ||Qnπ ) dθ = Iπ (T ; X1n ),
Q Θ Θ

iid
the mutual information between a random variable T ∼ π and X1n ∼ Pθ conditional on T = θ. With
Theorem 19.5.1 in hand, we can give a somewhat more nuanced picture of this mutual information
quantity. As a first consequence of Theorem 19.5.1, we have that
Z √
n d n det Jθ
Iπ (T ; X1 ) = log + π(θ) log dθ + o(1), (19.5.5)
2 2πe π(θ)

578
Lexture Notes on Statistics and Information Theory John Duchi

where Jθ denotes the Fisher information matrix for the family {Pθ }θ∈Θ . One strand of Bayesian
statistics known as reference analysis advocates that in performing a Bayesian analysis, that is,
performing an experiment (observing X1n ) to estimate θ ∈ Θ, we should choose the prior π that
maximizes the mutual information between the parameters θ about which we wish to make in-
ferences and any observations X1n available. Moreover, in this set of strategies, one allows n to
tend to ∞, as we wish to take advantage of any data we might actually see. The asymptotic for-
mula (19.5.5) allows us to choose such a prior. (We will not delve too deeply into this; instead, we
refer to the survey by Bernardo [28] and papers of Berger et al. [25, 26].)
In a different vein, Jeffreys [121] proposed that if the square root of the determinant of the
Fisher information was integrable, then one should take π as
√
det Jθ
πjeffreys (θ) = R √
Θ det Jθ dθ

giving the eponymous Jeffreys prior. Jeffreys originally proposed this for invariance reasons, as
the inferences made on the parameter θ under the prior πjeffreys are identical to those made on a
transformed parameter ϕ(θ) under the appropriately transformed Jeffreys prior.
Proceeding somewhat non-rigorously, in that we shall not worry about integrability, or uni-
formity in the prior π, the asymptotic expression (19.5.5) shows that the Jeffreys prior and the
asymptotic reference prior coincide. Indeed, computing the integral in (19.5.5), we have
Z √ Z Z p
det Jθ πjeffreys (θ)
π(θ) log dθ = π(θ) log dθ + log det Jθ dθ
Θ π(θ) Θ π(θ)
Z p
= −Dkl (π||πjeffreys ) + log det Jθ dθ,

whenever the Jeffreys prior exists. Moreover, we see that in an asymptotic sense, the Jeffreys prior
is the worst-case prior distribution π for nature to play, as otherwise the −Dkl (π||πjeffreys ) term in
the expected (Bayesian) redundancy is negative.

Example 19.5.5 (Jeffreys priors and the exponential distribution): Let us now assume that
our source distributions Pθ are exponential distributions, meaning that θ ∈ (0, ∞) and we have
density pθ (x) = exp(−θx − log 1θ ) for x ∈ [0, ∞). This is clearly an exponential family model,
∂2
and the Fisher information is easy to compute as Jθ = ∂θ log 1θ = 1/θ2 (cf. Eq. (3.3.2)).
√ 2
In this case, the Jeffreys prior is πjeffreys (θ) ∝ J = 1/θ, but this “density” does not
integrate over [0, ∞). One approach to this difficulty, advocated by Bernardo [28, Definition 3]
(among others) is to just proceed formally and notice that after observing a single datapoint,
the “posterior” distribution π(θ | P X) is well-defined. Following this idea, note that after seeing
some data X1 , . . . , Xi , with Si = ij=1 Xj as the partial sum, we have
i
X 1
π(θ | xi1 ) ∝ pθ (xi1 )πjeffreys (θ) i
= θ exp −θ xj = θi−1 exp(−θSi ).
θ
j=1

Pi
Integrating, we have for si = j=1 xj
Z ∞ Z ∞ Z ∞
−θx i−1 −θsi 1
q(x | xi1 ) = pθ (x)π(θ | xi1 )dθ ∝ θe θ e dθ = ui e−u du,
0 0 (si + x)i+1 0

579
Lexture Notes on Statistics and Information Theory John Duchi

where we made the change of variables u = θ(si + x). This is at least a distribution that
normalizes, so often one simply assumes the existence of a piece of fake data. For example, by
saying we “observe” x0 = 1, we have prior proportional to π(θ) = e−θ , which yields redundancy
1 n 1
Dkl Pθn0 ||Qπn = log + θ0 + log + o(1).
2 2πe θ0
The difference is that, in this case, the redundancy bound is no longer uniform in θ0 , as it
would be for the true reference (or Jeffreys, if it exists) prior. 3

19.5.2 Heuristic calculations: normality and Theorem 19.5.1

In this section, we very briefly (and very hand-wavily) justify the asymptotic expression (19.5.4a),
from which the expectation version (19.5.4b) more or less follows. To do this, we argue that the
posterior distribution π(θ | X1n ) should be approximately normally distributed with appropriate
variance measure, which gives the result. (Clarke and Barron [53] provide a fully rigorous proof.)
Fix θ0 . We begin by expanding the log likelihood in the divergence
pθ0 (X1n ) pθ0 (X1n )

n
Dkl (Pθ0 ||Qπ ) = EPθ0 log = EPθ0 log .
qπ (X1n ) π(θ)pθ (X1n )dθ
R

Define the Fisher score (sum of gradients of log-likelihoods) Sn := ni=1 ∇ log pθ0 (Xi ). For θ near
P
iid
θ0 , for Xi ∼ Pθ0 we thus have
log pθ (X1n )
n n
X 1X
= log pθ0 (Xi ) + Sn⊤ (θ − θ0 ) + (θ − θ0 )⊤ ∇2 log pθ0 (Xi )⊤ (θ − θ0 ) + o(∥θ − θ0 ∥2 )
2
i=1 i=1
n
≈ log pθ0 (X1n ) + Sn⊤ (θ − θ0 ) − (θ − θ0 )⊤ Jθ0 (θ − θ0 ),
2
where we used the law of largePnumbers and the definition Jθ = −EPθ [∇2 log pθ (X)] of the Fisher
information matrix, so that n1 ni=1 ∇2 log pθ0 (Xi ) → −Jθ0 with probability 1. With this approxi-
mation, we complete the square so for S n = n1 Sn we can write
n n ⊤
log pθ (X1n ) ≈ log pθ0 (X1n ) − (θ − θ0 − Jθ−1 S n )⊤ Jθ0 (θ − θ0 − Jθ−1 S n ) + S n Jθ−1 Sn (19.5.6)
2 0 0 2 0

for θ near θ0 .
⊤
We use the approximation (19.5.6) to compute expectations. First, observe that Eθ0 [S n Jθ−1 Sn] =
d −1 √ 0 √
n , and so we typically have J θ0 S n = O(1/ n). Thus, when θ is farther than distance order 1/ n
from θ0 , we have
n
(θ − θ0 − Jθ−1 S n )⊤ Jθ0 (θ − θ0 − Jθ−1 S n ) → ∞.
2 0 0

Thus, assuming the approximation (19.5.6) is accurate enough, if we let Θ0 be those points θ near
θ0 , then for “most” samples X1n we have
Z
π(θ)pθ (X1n )dθ
Z
n ⊤ −1
n
n 2 S n Jθ0 S n
≈ pθ0 (X1 )e π(θ) exp − (θ − θ0 − Jθ−1 S n )⊤
J θ 0 (θ − θ 0 − J −1
θ0 S n ) dθ
Θ0 2 0
Z
n ⊤ −1
n
S J S
≈ pθ0 (X1n )π(θ0 )e 2 n θ0 n exp − (θ − θ0 − Jθ−1 S n )⊤
J θ0 (θ − θ 0 − J −1
θ0 S n ) dθ
Θ0 2 0

580
Lexture Notes on Statistics and Information Theory John Duchi

because point far from θ0 have essentially 0 mass in the integral and π is continuous. Treating the
last integral as that of a Gaussian density (this is known as a Laplace approximation), where we
recall that for J ≻ 0 we have Rd exp(−x⊤ Jx)dx = (2π)d/2 det(J)−1/2 , we obtain
R

Z n
⊤
π(θ)pθ (X1n )dθ ≈ pθ0 (X1n )π(θ0 ) exp S n Jθ−1 S n (2π)d/2 det(nJθ0 )−1/2 . (19.5.7)
2 0

Once we substitute the heuristic approximation (19.5.7) into the KL-divergence, we obtain

n n
1 n ⊤ −1 d d 1
Dkl Pθ0 ||Qπ ≈ EPθ0 log − S J S n − log(2π) + log n + log det(Jθ0 )
π(θ0 ) 2 n θ0 2 2 2
1 d 1 d 1
= log + log + log n + log det(Jθ0 ),
π(θ0 ) 2 2πe 2 2
⊤
because EPθ0 [S n S n ] = n−1 Jθ0 . Assuming each approximation term ≈ adds an at most o(1) error
to the resulting formula, this is the result (19.5.4a). To make these statements rigorous requires
arguments that control the error terms in the various integrals and expansions, typically via variants
of Lebesgue’s dominated convergence theorem; we view the heuristic (19.5.7) as satisfying.

19.6 Regret and capacity dualities

To develop rigorous versions of Theorem 19.3.2 requires some additional mathematical care. We will
first prove the theorem in the special case that X is finite, which corresponds to Corollary 19.3.4,
because this contains the main ideas. The extension to arbitrary spaces X essentially just requires
more careful treatment of probabilistic convergence.

19.6.1 Duality when the domain is finite

To show that a regret/capacity duality obtains when X is finite (and also when X is infinite in
the sections to come), we will will leverage the Bregman divergence representation of expected
regret (19.3.1). Let us recapitulate to keep the presentation self-contained. By Theorem 14.2.1, we
have the Savage representation (14.2.1) associated to any proper loss,

ℓ(Q, x) = −Ω(Q) − ⟨∇Ω(Q), ex − Q⟩ (19.6.1)

for some convex Ω. In equation (19.6.1), we abuse notation slightly, in that we are (without com-
ment) viewing distributions Q as elements of ∆X = {q ∈ RX X
+ | ⟨1, q⟩ = 1} ⊂ R , and ex is the “xth”
standard basis vector. Then ∇Ω(Q) is a (particular, fixed) element of the subdifferential ∂Ω(Q),
which necessarily exists because Ω is proper (Theorem 14.2.1). As Ω(P ) = supQ −EP [ℓ(Q, X)] for
proper losses, we then have the equality (19.3.1), that is,

DΩ (P, Q) := EP [ℓ(Q, X)] − EP [ℓ(P, X)] = Ω(P ) − Ω(Q) − ⟨∇Ω(Q), P − Q⟩,

the Bregman divergence between P and Q. To avoid issues of non-determinacy in the selection of
∇Ω(Q), we shall always take the gap in the expected losses as the definition of the divergence.
When the space X is finite and the loss ℓ is strictly proper, the next lemma shows that conver-
gence of distributions in DΩ implies convergence in any norm. To make clear we work with vectors
p ∈ RX , we use lower case letters. We require one additional consideration: that

DΩ (p, q) := Ep [ℓ(q, X)] − Ep [ℓ(p, X)] = Ep [ℓ(q, X)] − Hℓ (p) = Ω(p) + Ep [ℓ(q, X)]

581
Lexture Notes on Statistics and Information Theory John Duchi

is lower semicontinuous in p, q, meaning that if pn → p and qn → q, then lim inf n DΩ (pn , qn ) ≥

DΩ (p, q). Because Ω(p) is the supremum of linear functions of p, it is lower semicontinuous, and so
because X is finite, it is enough that ℓ(q, x) be lower semicontinuous in q for each x.

Lemma 19.6.1. Let X be finite, and assume the loss ℓ is strictly proper and that DΩ is lower
semicontinuous. Then there exists a function ω with ω(ϵ) > 0 for all ϵ > 0 such that

DΩ (p, q) ≥ ω(∥p − q∥1 ).

Proof We identify P = ∆X = {p ∈ RX
+ | ⟨1, p⟩ = 1}, a compact set. Assume for the sake of
contradiction that
inf {DΩ (p, q) | ∥p − q∥1 ≥ ϵ} = 0
p,q∈∆X

for some ϵ > 0. Let pn , qn ∈ ∆X tend to this infimum, that is, satisfy ∥qn − pn ∥1 ≥ ϵ and
DΩ (pn , qn ) → 0. Taking subsequences as necessary, we may assume that qn → q and pn → p. By
the lower semicontinuity assumption, for any δ > 0 there exists N such that n ≥ N implies

Epn [ℓ(qn , X)] − Epn [ℓ(pn , X)] ≥ Ep [ℓ(q, X)] + Ω(p) − δ = DΩ (p, q) − δ.

Because δ > 0 was arbitrary, we in fact have

lim inf Epn [ℓ(qn , X)] − Epn [ℓ(pn , X)] ≥ DΩ (p, q) > 0
n

because ℓ is strictly proper, our desired contradiction.

As a consequence, if pn is a Cauchy sequence for DΩ , meaning that supm≥0 DΩ (pn , pn+m ) → 0 as

n → ∞, it is also a Cauchy sequence in RX and thus converges to some p.

19.6.2 Proof of Corollary 19.3.4

We turn now to the proof of Corollary 19.3.4. Recall the convex collection Π of prior distributions
π on Θ. Let Q be any distribution on X , (meaning we may identify it with a vector in RX + ), and
define the saddle objective
Z Z
L(Q, π) := EPθ [ℓ(q, X)]dπ(θ) − EPθ [ℓ(Pθ , X)]dπ(θ)
Z Z
= [Ω(Pθ ) − Ω(Q) − ⟨∇Ω(Q), Pθ − Q⟩] dπ(θ) = DΩ (Pθ , Q)dπ(θ).

We decompose the proof into several parts. We first recall from Lemma 19.3.1 that if π̂ attains
the supremum in the definition (19.3.4) of the information in the experiment {Pθ } for the loss ℓ, then
(Pπ̂ , π̂) is a saddle point. Using this intermediate result, we use finite-dimensional approximations
of the capacity supπ Iℓ (π; X), which in turn give a sequence of (explicit) saddle points Qn =
Pπn minimizing the worst-case redundancy. Finally, with a bit of algebraic manipulation and the
careful use of the relationship between Bregman divergences DΩ and proper losses ℓ, we can use
completeness to show that limits of the sequence Qn necessarily exist, and they are unique.

582
Lexture Notes on Statistics and Information Theory John Duchi

Finite-dimensional approximation and a limiting Q With Lemma 19.3.1 in hand, we

proceed via finite dimensional approximations. Let πn ∈ Π be a sequence of priors satisfying
Iℓ (πn ; X) → supπ∈Π Iℓ (π; X) = Cℓ . Then the set

Πn := Conv{π1 , . . . , πn } (19.6.2)

is isomorphic to an at most n − 1 dimensional set, and so because the saddle objective L(Q, π) is
linear in π, inf Q L(Q, π) is a concave function (and upper semicontinuous, as it is the infimum of
linear functions). So
πn⋆ = argmax inf L(Q, π) = Iℓ (π; X)
π∈Πn Q

is attained. The sequence πn⋆ of priors also satisfies Iℓ (πn⋆ ; X) → Cℓ because Cℓ ≥ Iℓ (πn⋆ ; X) ≥
Iℓ (πn ; X) → Cℓ . Because the loss ℓ is proper, the mixture distribution Qn = Pπn⋆ minimizes
L(Q, πn⋆ ), and Lemma 19.3.1 implies supπ∈Πn L(Qn , π) ≤ L(Qn , πn⋆ ) ≤ inf Q L(Q, πn⋆ ) → Cℓ .
Now for k ∈ N consider the pair Qn = Pπn⋆ and Qn+k = Pπn+k ⋆ . We have

L(Qn+k , πn⋆ ) − L(Qn , πn⋆ ) = −Ω(Qn+k ) − ⟨∇Ω(Qn+k ), Pπn⋆ − Qn+k ⟩ + Ω(Qn ) = DΩ (Qn , Qn+k ).

On the other hand, we also have

L(Qn+k , πn⋆ ) − L(Qn , πn⋆ ) ≤ L(Qn+k , πn+k

⋆
) − L(Qn , πn⋆ )
⋆
= Iℓ (πn+k ; X) − Iℓ (πn⋆ ; X) → 0,

the inequality following again from Lemma 19.3.1. As k is arbitrary, we have the limit (in n)

sup DΩ (Qn , Qn+k ) → 0. (19.6.3)

k≥1

Here we use that the space X is finite; by Lemma 19.6.1, there is necessarily some Q⋆ ∈ ∆X such
that Qn → Q⋆ . We show this Q⋆ is a saddle.

Optimality of the limiting Q Using Lemma 19.3.1 again, we have

L(Qn , πn⋆ ) ≤ Iℓ (πn⋆ ; X) ≤ Cℓ .

For any π ∈ Π∞ := ∪∞
n=1 Πn (Θ), we have π ∈ Πn (Θ) for all large enough n, so that similarly

L(Qn , π) ≤ Cℓ

for all π ∈ Π∞ . Now we may use the lower semi-continuity of the regret DΩ (Pθ , Q) in Q, which
implies
(i)
Z Z
⋆
(EPθ [ℓ(Q , X)] − Hℓ (Pθ )) dπ(θ) ≤ lim inf (EPθ [ℓ(Qn , X)] − Hℓ (Pθ )) dπ(θ)
n
(ii)
Z
≤ lim inf (EPθ [ℓ(Qn , X)] − Hℓ (Pθ )) dπ(θ) ≤ C,
n

where step (i) used lower semicontinuity and step (ii) used Fatou’s lemma (Proposition A.3.1). In
brief, we have demonstrated that
[
L(Q⋆ , π) ≤ sup Iℓ (π; X) = Cℓ for all π ∈ Π∞ = Πn . (19.6.4)
π∈Π n≥1

583
Lexture Notes on Statistics and Information Theory John Duchi

For the final step to showing that Q⋆ is optimal, we show that the inequality (19.6.4) holds
for any π ∈ Π. To that end, let π ∈ Π be any fixed prior, and then replace the set (19.6.2) with
e n = Conv{π1 , . . . , πn , π} ⊂ Π. This is isomorphic to an n-dimensional set, and we repeat precisely
Π
the same derivation mutatis mutandis, yielding a sequence of priors π en⋆ and distributions Q
e n = Pπe⋆ .
n
Then
e n , πn⋆ ) − L(Qn , πn⋆ ) = −Ω(Q
L(Q e n ), Pπ⋆ − Q
e n ) − ⟨∇Ω(Q e n ⟩ + Ω(Qn ) = DΩ (Qn , Q
e n ).
n

Similarly, we have
e n , πn⋆ ) − L(Qn , πn⋆ ) ≤ L(Q
L(Q en⋆ ) − L(Qn , πn⋆ )
en , π
πn⋆ ; X) − Iℓ (πn⋆ ; X) ≤ Cℓ − Iℓ (πn⋆ ; X) → 0
= Iℓ (e

by the assumption that Iℓ (πn ; X) → Cℓ = supπ∈Π Iℓ (π; X). Thus DΩ (Qn , Q

e n ) → 0, and so again
⋆
e n → Q as well. Thus we extend inequality (19.6.4) to
by Lemma 19.6.1, Q

L(Q⋆ , π) ≤ sup Iℓ (π; X) = Cℓ for all π ∈ Π. (19.6.5)

π∈Π

Saddle point and its uniqueness As an immediate consequence of inequality (19.6.5), we see
that Q⋆ is indeed a saddle point: we have

sup L(Q⋆ , π) = sup Iℓ (π; X) = sup inf L(Q, π) ≤ inf sup L(Q, π)
π∈Π π∈Π π∈Π Q Q π∈Π

by the weak min-max inequality, so equality holds in each step. Finally, we show that Q⋆ is unique.
e is a different distribution satisfying Cℓ = supπ∈Π L(Q,
Suppose that Q e π). Then letting

e πn⋆ ) = L(Q,
Cℓ ≥ L(Q, e πn⋆ ) − L(Qn , πn⋆ ) + L(Qn , πn⋆ )
= −Ω(Q)
e − ⟨∇Ω(Q),e Pπ⋆ − Q⟩ e + Ω(Qn ) + L(Qn , πn⋆ ) = L(Qn , πn⋆ ) + DΩ (Qn , Q).
e
n

But we know that L(Qn , πn⋆ ) → Cℓ , and thus

e → 0.
DΩ (Qn , Q) (19.6.6)

Because Qn → Q⋆ , it must be the case that Q e = Q⋆ .

If π ⋆ achieves the capacity, that is, maximizes inf Q L(Q, π) over π ∈ Π, then inf Q L(Q, π ⋆ ) =
Cℓ (Π), and by the inequality (19.6.5),

sup L(Q⋆ , π) ≤ inf L(Q, π ⋆ ).

π∈Π Q

Then L(Q⋆ , π ⋆ ) ≤ inf Q L(Q, π ⋆ ), and because ℓ is stricly proper we necessarily have Q⋆ = Pπ⋆ .

19.6.3 Regret/capacity duality for arbitrary domains

Our proof of Corollary 19.3.4 in Section 19.6.2 made little use of the fact that X was finite.
An inspection of the proof shows that we used the finiteness of X in only a few places: the
first is in equation (19.6.1) representing the expected regret via proper losses and the Savage

584
Lexture Notes on Statistics and Information Theory John Duchi

representation (14.2.1) from Theorem 14.2.1. In this case, we can rewrite the loss using the general
representation in Theorem 14.2.8,

ℓ(Q, x) = −Ω(Q) − ⟨∇Ω(Q), 1x − Q⟩ (19.6.7)

for some convex Ω. Here, we have abused notation to mimic that in (19.6.1): we take 1x to be the
point mass at x and inner products to mean expectations, i.e.,

⟨∇Ω(Q), P ⟩ = EP [Ω′ (Q, X)]

for an appropriate subgradient Ω′ (Q, x), as in expression (14.2.6). we then have the equality

DΩ (P, Q) := EP [ℓ(Q, X)] − EP [ℓ(P, X)] = Ω(P ) − Ω(Q) − ⟨Ω(Q), P − Q⟩,

exactly as in the finite-dimensional case. Therefore, as in the proof of Corollary 19.3.4, we still
have the saddle point representation
Z
L(Q, π) = DΩ (Pθ , Q)dπ(θ)

and the infimal representation (19.3.6) that inf Q L(Q, π) = L(Pπ , π) = Iℓ (π; X), and so at least
insofar as setting up the problem, finiteness of X is immaterial because of the general proper loss
representation in Theorem 14.2.8.
The other places in which we use finite dimensionality all revolve around continuity of diver-
gence measures and in completeness of collections of distributions. The second place in which we
use (something like) finite dimensionality is in the statement of Lemma 19.3.1, which assumes that
the expected regret (Bregman divergence) DΩ (Pθ , Q) is lower semicontinuous along one-dimensional
directions in Q. The remaining three uses of finite dimensionality all regard completeness: equa-
tion (19.6.3), uses that if Qn is Cauchy for the Bregman divergence, i.e., supk≥1 DΩ (Qn , Qn+k ) → 0,
then Qn → Q⋆ for some distribution Q⋆ . We use this same completeness to argue that if Qn → Q⋆
and DΩ (Qn , Q e n → Q⋆ as well in the derivation of inequality (19.6.5). And finally,
e n ) → 0, then Q
the convergence (19.6.6) uses that DΩ (Qn , Q) e → 0 and DΩ (Qn , Q) → 0 implies Q = Q, e which
⋆
proves that Q is unique.
Thus, while we will require somewhat careful topological and metric space considerations, to
prove a rigorous version of Theorem 19.3.2 for arbitrary domains requires demonstrating only two
conditions relating to modes of convergence of probability distributions:

(A) The divergence DΩ (Pθ , Q) is directionally lower semicontinuous (19.3.8) in Q.

(B) DΩ is complete for the desired mode of convergence: if Pn is Cauchy for the divergence DΩ ,
meaning that supm≥1 DΩ (Pn , Pn+m ) → 0, then there exists P such that Pn → P , and if
DΩ (Pn , P ) → 0 and DΩ (Pn , Q) → 0, then P = Q.

If we can show both desiderata (A) and (B), then evidently Theorem 19.3.2 will be airtight. This
requires a small detour into the convergence of distributions in different topologies, all of which
coincide when the space X is finite. (See Appendix A.4 for a discussion of different modes of conver-
gence of probability distributions on metric spaces.) We shall focus on two modes of convergence:
in total variation distance and convergence in distribution.

585
Lexture Notes on Statistics and Information Theory John Duchi

Convergence in total variation Convergence in total variation provides a more straightforward

way to verify the above desiderata, thus extending Corollary 19.3.4 to arbitrary spaces. Let X be
an arbitrary space and Pn and P be probability measures on X . Then we say that

Pn → P in total variation if ∥Pn − P ∥TV → 0.

Total variation is particularly convenient for convergence of distributions, as it is complete: if

Pn is a Cauchy sequence for total variation, then it necessarily has a limit distribution P for
which ∥Pn − P ∥TV → 0. (See Lemma A.4.4 in Appendix A.4.) Because we frequently work with
divergences other than the variation distance, it is useful to allow other completeness notions.

Definition 19.1. Let P be a collection of probability distributions on X and D : P × P → R+ .

Then D is complete for P if

i. Whenever Pn ∈ P is Cauchy for D, meaning supk∈N D(Pn , Pn+k ) → 0 as n → ∞, then there

exists P ∈ P for which ∥Pn − P ∥TV → 0

ii. If D(Pn , Qn ) → 0, then ∥Pn − Qn ∥TV → 0.

For the KL-divergence, Pinsker’s inequality essentially immediately provides completeness:

Example 19.6.2: Let D(P, Q) = Dkl (P ||Q). Then Pinsker’s inequality (Proposition 2.2.8)
implies that D(P, Q) ≥ 2 ∥P − Q∥2TV . Many choices for the set P are possible here so long as
P is closed for the variation distance. As particular examples, if P consists of the collection
of all distributions on X , then the KL-divergence is complete. Similarly, if P consists of all
distributions on X absolutely continuous respect to some base measure ν, Definition 19.1 is
satisfied as well (again, see Lemma A.4.4 in Appendix A.4).
The KL-divergence corresponds to the logarithmic loss ℓ(Q, x) = − log q(x), as we have
seen several times. Example 14.2.11 treats the case of general, potentially non-discrete, spaces
X. 3
R
Note that the divergence D(P, Q) = Dkl (P ||Q) and saddle objective L(Q, π) = Dkl (Pθ ||Q) dπ(θ)
are well-defined regardless of the domain X , so that the redundancy/capacity theorem holds so long
as we can verify the lower semicontinuity (A). For this, we have the following observation, which
is a corollary of the Donsker-Varadhan representation of the KL-divergence (Theorem 6.1.1):

Corollary 19.6.3 (Semicontinuity of KL-divergence). Let ∥Pn − P ∥TV and ∥Qn − Q∥TV → 0.
Then lim inf n Dkl (Pn ||Qn ) ≥ Dkl (P ||Q).

Proof Let g be a simple function, that is, satisfying g(x) = ki=1 αi 1 {x ∈ Ai } for scalars αi
P
and measurable sets Ai . Then clearly EPn [g] → EP [g] and EQn [eg ] → EQ [eg ] by the convergence in
total variation. So

Dkl (Pn ||Qn ) ≥ EPn [g] − log EQn [eg ] → EP [g] − log EQ [eg ].

As g was arbitrary, we may take a supremum over such simple functions to achieve the KL-
divergence on the right.

Combining this corollary with the discussion in Example 19.6.2, we see we have satisfied desider-
ata (A) and (B), and have thus proved the redundancy/capacity duality in Corollary 19.3.3.

586
Lexture Notes on Statistics and Information Theory John Duchi

Convergence in distribution Let X be a matric space (we leave the metric implicit). Recall
that a sequence of distributions1 Pn on X converge in distribution to a distribution P , which we
denote
d
Pn ⇝ P,
if for every bounded continuous function f : X → R,

EPn [f (X)] → EP [f (X)].

When X ⊂ Rd , this is equivalent to convergence of cumulative distribution functions at points of

continuity: letting Fn (t) = Pn (X ⪯ t) and F (t) = P (X ⪯ t) be the CDFs of Pn and P , we have
d
Pn ⇝ P if and only if
Fn (t) → F (t)
at all continuity points of F . See Appendix A.4 for discussion around these modes of convergence.
Especially in infinite-dimensional problems, it can be easier to obtain completeness for conver-
gence in distribution rather than in total variation, as we describe in Definition 19.1.

Definition 19.2. The divergence measure D : P × P → R+ is complete for P in distribution if

d
i. Whenever Pn ∈ P is Cauchy for D, there exists P ∈ P for which Pn ⇝ P
d d
ii. If Pn ⇝ P and D(Pn , Qn ) → 0, then Qn ⇝ P .

To distinguish this from definition 19.1 and highlight the necessity of considering convergence
in distribution, consider predicting cumulative distributions. As in forecasting problems, natural
losses include the cumulative ranked probability score (CRPS) that we discuss in Examples 14.2.6
and 14.2.10.

Example 19.6.4: For distributions P and Q on Rd , let

Z
D(P, Q) = (P (X ⪯ t) − Q(X ⪯ t))2 dt

be the average squared distance between their cumulative distribution functions (CDFs). Let
P be the collection of all distributions on a compact set T ⊂ Rd . Then Proposition A.4.7 in
Appendix A.4 shows the (more or less) standard result that this divergence is complete for
d
convergence in distribution: we have D(Pn , P ) → 0 if and only if Pn ⇝ P and Pn is Cauchy for
d
the Bregman divergence if and only if it has a limit Pn ⇝ P . That is, it satisfies Definition 19.2.
In this case, the
R divergence corresponds to a (potentially) multivariate version of the CRPS
loss ℓcrps (Q, y) := (Q(Y ⪯ t)−1 {y ⪯ t})2 dt, which a minor generalization of Examples 14.2.6
and 14.2.10 makes clear. 3

Inspecting Definition (19.2) and Example 14.2.10, we see that both of the desiderata (A) and (B)
d
hold: the semicontinuity follows because for Qn ⇝ Q,

D(P, Qn )1/2 − D(P, Q)1/2 ≤ D(Q, Qn )1/2 → 0

by the triangle inequality and the example. The completeness we have shown.
1
We shall assume all probabilities are Borel measures, meaning they are defined on the Borel sets.

587
Lexture Notes on Statistics and Information Theory John Duchi

19.6.4 A formal statement of regret/capacity duality

With the examples above in mind, we can provide a formal statement that makes Theorem 19.3.2
precise. All that we really need is an appropriate assumption on the losses ℓ to guarantee some
type of completeness. Recall the representation (19.6.7) of a proper loss, so that for the negative
generalized entropy Ω we have Bregman divergence

DΩ (P, Q) = EP [ℓ(Q, X)] − EP [ℓ(P, X)].

Definition 19.3. Let {Pθ }θ∈Θ be a collection of distributions on X indexed by θ ∈ Θ, let Π be a

convex collection of distributions on Θ, and let P ⊃ {Pπ }π∈Π . The loss ℓ is identifying for P if

i. it is strictly proper

ii. for the negative generalized entropy Ω in its representation (19.6.7), either

a. the divergence DΩ is complete for P in total variation (Definition 19.1)

b. the divergence DΩ is complete for P in distribution (Definition 19.2)

iii. for each θ ∈ Θ, DΩ (Pθ , Q) is directionally lower semicontinuous (19.3.8) in Q.

By our discussion so far, this is precisely what is needed to for a non-informal version of
Theorem 19.3.2. Recalling the definition (19.3.5) of the saddle objective
Z
L(Q, π) := (EPθ [ℓ(Q, X)] − EPθ [ℓ(Pθ , X)]) dπ(θ),

we have the following theorem.

Theorem 19.6.5. Let {Pθ }θ∈Θ be a collection of distributions on X ∈ X , let Π be a convex

collection of prior distributions on θ ∈ Θ. Let ℓ be identifying for P ⊃ {Pπ }π∈Π (Definition 19.3),
and assume the capacity
Cℓ (Π) := sup Iℓ (π; X) < ∞.
π∈Π

Then there exists a unique Q⋆ ∈ P such that

sup L(Q⋆ , π) = sup Iℓ (π; X),

π∈Π π∈Π

If additionally Π = Π(Θ) consists of all distributions on Θ, then

sup (EPθ [ℓ(Q⋆ , X)] − EPθ [ℓ(Pθ , X)]) = Cℓ (Π(Θ)) = sup Iℓ (π; X).
θ∈Θ π∈Π

Lastly, if π ⋆ achieves the capacity Cℓ (Π), then Q⋆ = Pπ⋆ , and (Pπ⋆ , π ⋆ ) is a saddle point:

sup L(Pπ⋆ , π) ≤ L(Pπ⋆ , π ⋆ ) ≤ inf L(Q, π ⋆ ).

π∈Π Q

Clearly, Theorem 19.6.5 shows that we have the general regret/capacity duality (19.3.7).

588
Lexture Notes on Statistics and Information Theory John Duchi

19.7 Bibliographic details

See also the book of Grünwald [105] for more discussion of this and other issues. Gallager [97].
Discuss redundancy/capacity. My proof of Theorem 19.6.5 follows the same outline as the proof
of Theorem 5.9 in Polyanskiy and Wu [155]. (See also Csiszár [60].) The generality of (arbitrary)
proper losses requires a bit of care, but shows there is nothing particularly special that the log-loss
buys.

19.8 Exercises
JCD Comment: A few exercise ideas: 1. General corollary for robust Bayes things,
building off of Theorem 19.6.5. 2. Show how regret for log-loss means we get regret for
other (bounded) losses by total variation. Originally had that in the book; can be just
an exercise now.
P p
JCD Comment: Do exercises with different entropies, e.g., Ω(p) = − x p(x) or
other power-type functions. These are Legendre.
Exercise 19.1 (Minimax redundancy and different loss functions): In this question, we consider
iid
expected losses under the Bernoulli distribution. Assume that Xi ∼ Bernoulli(p), meaning that
Xi = 1 with probability p and Xi = 0 with probability 1 − p. We consider four different loss
functions, and their associated expected regret, for measuring the accuracy of our predictions of
such Xi . For each of the four choices below, we prove expected regret bounds on
n
X n
X
Redn (θ,
b P, ℓ) := b i−1 ), Xi )] − inf
EP [ℓ(θ(X EP [ℓ(θ, Xi )], (19.8.1)
1
θ
i=1 i=1

where θb is a predictor based on X1 , . . . , Xi−1 at time i. Define Si = ij=1 Xj to be the partial sum
P
up to time i. For each of parts (a)–(c), at time i use the predictor
1
b i−1 ) = Si−1 + 2 .
θbi = θ(X1
i

(a) Loss function: ℓ(θ, x) = 12 (x − θ)2 . Show that Redn (θ,

b P, ℓ) ≤ C · log n where C is a constant.

(b) Loss function: ℓ(θ, x) = x log 1θ + (1 − x) log 1−θ

1
, the usual log loss for predicting probabilities.
Show that Redn (θ,b P, ℓ) ≤ C · log n whenever the true probability p ∈ (0, 1), where C is a
constant. Hint: Note that there exists a prior π for which θb is a Bayes strategy. What is this
prior?

(c) Loss function: ℓ(θ, x) = |x − θ|. Show that Redn (θ,

b P, ℓ) ≥ c · n, where c > 0 is a constant,
1
whenever the true probability p ̸∈ {0, 2 , 1}.

(d) Extra credit: Show that there is a numerical constant c > 0 such that for any procedure
√
θ, p∈[0,1] Redn (θ, Bernoulli(p), ℓ) ≥ c n for the absolute loss ℓ in
b the worst-case redundancy sup b
part (c). Give a strategy attaining this redundancy.

Exercise 19.2: Fill in the details in the calculations for Example 19.5.3.

589
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 19.3 (Strong versions of redundancy): Assume that for a given θ ∈ Θ we draw
X1n ∼ Pθ . We define the Bayes redundancy for a family of distributions P = {Pθ }θ∈Θ as
Z
Cn := inf Dkl (Pθ ||Q) dπ(θ) = Iπ (T ; X1n ),
π
Q

where π is a probability measure on Θ, T is distributed according to π, and conditional on T = θ,

we draw X1n ∼ Pθ , and Iπ denotes the mutual information when T is drawn according to π. Define
the maximin redundancy Cn∗ := supπ Cnπ asR the worst-case Bayes redundancy. We show that for
“most” points θ under the prior π, if Q = Pθ dπ(θ) is the mixture of all the Pθ under the prior π,
then no distribution Q can have subtantially better redundancy that Q.
Consider any distribution Q on the set X and let ϵ ∈ [0, 1], and define the set of points θ where
Q is ϵ-better than the worst case redundancy as
Bϵ := {θ ∈ Θ : Dkl (Pθ ||Q) ≤ (1 − ϵ)Cn∗ } .
(a) Show that for any prior π, we have
log 2 + Cn∗ − Iπ (T ; X1n )
π(Bϵ ) ≤ .
ϵCn∗
As an aside, note this implies that if πi is a sequence of priors tending to supπ Iπ (T ; X1n ) and
the redundancy Cn∗ → ∞, then so long as Cn∗ − Iπi (T ; X1n ) ≪ ϵCn∗ , we have πi (Bϵ ) ≈ 0.
(b) Assume that π attains the supremum in the definition of Cn∗ . Show that
π(Bϵ ) ≤ O(1) · exp(−ϵCn∗ ).

Hint: Introduce the random variable Z to be 1 if the random variable T ∈ Bϵ and 0 otherwise, then
use that Z → T → X1n forms a Markov chain, and expand the mutual information. For part (b),
the inequality 1−x 1
x log 1−x ≤ 1 for all x ∈ [0, 1] may be useful.

Exercise 19.4 (Mixtures are as good as point distributions): Let P be a Laplace(λ) distribution
on R, meaning that X ∼ P has density
λ
p(x) = exp(−λ|x|).
2
iid
Assume that X1 , . . . , Xn ∼ P , and let P n denote the n-fold product of P . In this problem, we
compare the predictive performance of distributions from the normal location family P = {N(θ, σ 2 ) :
θ ∈ R} with the mixture distribution Qπ over P defined by the normal prior distribution N(µ, τ 2 ),
that is, π(θ) = (2πτ 2 )−1/2 exp(−(θ − µ)2 /2τ 2 ).
(a) Let Pθ,Σ be the multivariate normal distribution with mean θ ∈ Rn and covariance Σ ∈ Rn×n .
What is Dkl (P n ||Pθ,Σ )?
(b) Show that inf θ∈Rn Dkl (P n ||Pθ,Σ ) = Dkl (P n ||P0,Σ ), that is, the mean-zero normal distribution
has the smallest KL-divergence from the Laplace distribution.
(c) Let Qπn be the mixture of the n-fold products in P, that is, Qπn has density
Z ∞
π n
qn (x1 ) = π(θ)pθ (x1 ) · · · pθ (xn )dθ,
−∞

where π is N(0, τ 2 ). What is Dkl (P n ||Qπn )?

590
Lexture Notes on Statistics and Information Theory John Duchi

(d) Show that the redundancy of Qπn under the distribution P is asymptotically nearly as good
as the redundancy of any Pθ ∈ P, the normal location family (so Pθ has density pθ (x) =
(2πσ 2 )−1/2 exp(−(x − θ)2 /2σ 2 )). That is, show that

1 1
sup EP log π n − log = O(log n)
θ∈R qn (X1 ) pθ (X1n )

for any prior variance τ 2 > 0 and any prior mean µ ∈ R, where the big-Oh hides terms
dependent on τ 2 , σ 2 , µ2 .

(e) Extra credit: Can you give an interesting condition under which such redundancy guarantees
hold more generally? That is, using Proposition 19.5.4 in the notes, give a general condition
under which
1 1
EP log π n − log = o(n)
q (X1 ) pθ (X1n )
as n → ∞, for all θ ∈ Θ.

Exercise 19.5: Prove Proposition 19.1.3.

Answer to 19.5: We have the equalizer condition that for P ∈ Pαlin ,

EP [− log pθ (X)] = −⟨α, θ⟩ + A(θ) = EPθ [− log pθ (X)] = Hν (Pθ ),

where H is the Shannon entropy w.r.t. the measure ν. For Q ≪ ν, we know that Q has density
1
q w.r.t. ν, so supP ∈Pαlin EP [− log q(X)] ≥ EPθ [log q(X) ] = Dkl (Pθ ||Q) + Hν (Pθ ), which gives a gap
unless Dkl (Pθ ||Q) = 0.
Apply Theorem 14.4.7 to obtain that

inf EP [− log q(X)] = inf Dkl (P ||Q) + EP [− log p(X)] = H(P ) ≤ H(Pθ )
Q≪ν Q≪ν

for any P ∈ Pαlin . So Pθ is indeed a saddle point as desired.

591
Part V

Appendices

592
Appendix A

Miscellaneous mathematical results

This appendix collects several mathematical results and some of the more advanced mathematical
treatment required for full proofs of the results in the book. It is not a core part of the book, but it
does provide readers who wish to see the measure-theoretic rigor necessary for some of our results,
or otherwise, to dot the appropriate I’s and cross the appropriate T’s.

A.1 The roots of a polynomial

A.2 Measure-theoretic development of divergence measures

A.3 Integral convergence and completeness of probability spaces

In this section, we record several results on the convergence of integrals as well as divergence
measures. These prove useful for, among other things, exchanging integration and differentiation
or discussing completeness of probability spaces. We prove only those for which we do not know
standard references.
The first two results are standard facts about integration and appear, for example, in Royden
[163].

Proposition A.3.1 (Fatou’s lemma). Let fn be a sequence of nonnegative measurable functions

and let f (x) = lim inf n fn (x), and let µ be a measure. Then f is measurable, and
Z Z
f dµ ≤ lim inf fn dµ.

Theorem A.3.2 (Dominated convergence). Let fn , f : X → Y where (X, µ) is a measure space

and Y is a Banach space, where fn are measurable. Assume that fn → f either in µ-measure or for
R there exists an integrable dominating function g such that g(x) ≥ ∥fn (x)∥
µ-almost every x, and that
for all x and n, where gdµ < ∞. Then
Z Z Z
fn (x)dµ(x) → f (x)dµ(x) and ∥fn (x) − f (x)∥ dµ(x) → 0.

Proposition A.3.3 (An extended Scheffé’s lemma). Let fn : X → Y where (X, µ) is a measure
space and Y is
R a Banach space. Assume fn → f either in µ measure or for µ-almost every x, and
that lim supn ∥fn (x)∥p dµ(x) ≤ ∥f (x)∥p dµ(x) < ∞. Then ∥fn (x) − f (x)∥p dµ(x) → 0.
R R

593
Lexture Notes on Statistics and Information Theory John Duchi

Proof By convexity of the norm, ∥fn (x) − f (x)∥p ≤ 2p ∥fn (x)∥ + 2p ∥f (x)∥p , so

0 ≤ 2p ∥fn (x)∥p + 2p ∥f (x)∥p − ∥fn (x) − f (x)∥p → 2p+1 ∥f (x)∥p

R R
µ-almost everywhere (or in µ-measure). Then Fatou’s lemma that hdµ ≤ lim inf n hn dµ for
nonnegative hn converging to h in µ-measure implies
Z Z
2 p+1
∥f (x)∥ dµ(x) ≤ lim inf (2p ∥fn (x)∥p + 2p ∥f (x)∥p − ∥fn (x) − f (x)∥p ) dµ(x)
p
n
Z Z
≤2 p+1 p
∥f (x)∥ dµ(x) − lim sup ∥fn (x) − f (x)∥p dµ(x).
n

Rearrange to obtain the limit.

Corollary A.3.4. If X is discrete and qn → q pointwise, ∥Qn − Q∥TV → 0.

P P P
Proof Observe that x qn (x) = 1 and x q(x) = 1, so Proposition A.3.3. implies x |qn (x) −
1P
q(x)| → 0. Of course, ∥Qn − Q∥TV = 2 x |qn (x) − q(x)|.

A.4 Probabilistic convergence

In this appendix, we collect some of the background results on convergence of random variables
and distributions we use.

A.4.1 Classical results on convergence in distribution

In Chapter 19, we sometimes use results on probability distributions converging in different mod-
els, including in distribution. Here we recapitulate a few of the main results in that direction.
Recall that for a metric space (X , ρ), a sequence of probability distributions Pn on X converges in
d
distribution to P , written Pn ⇝ P , if

EPn [f (X)] → EP [f (X)]

for all bounded contintuous f : X → R. There are many equivalent versions of convergence in
distribution, and the Portmanteau theorem provides several characterizations. (See, e.g., Billingsley
[30, Chapter 1.2] or van der Vaart and Wellner [186, Chapter 1.3].) Because we assume Pn and P
are probability distributions, which induce random elements Xn and X, we do not need to address
the measurability questions of van der Vaart and Wellner [186, Chapter 1.3].
Theorem A.4.1 (Portmanteau). Let (X , ρ) be a metric space and Pn , P be probability distributions
d
on X . The following are all equivalent to the convergence in distribution Pn ⇝ P .
(i) EPn [f (X)] → EP [f (X)] for all bounded, 1-Lipschitz continuous f : X → R.

(ii) For all upper semicontinuous f : X → R bounded from above, lim supn EPn [f (X)] ≤ EP [f (X)].

(iii) For all lower semicontinuous f : X → R bounded from below, lim inf n EPn [f (X)] ≥ EP [f (X)].

594
Lexture Notes on Statistics and Information Theory John Duchi

(iv) For all open sets O ⊂ X , lim inf n Pn (O) ≥ P (O).

(v) For all closed sets C ⊂ X , lim supn Pn (C) ≤ P (C).

(vi) For all continuity sets A of P , meaning sets for which the boundary bd A = cl A\int A satisfies
P (bd A) = 0, limn Pn (A) = P (A).
If additionally X ⊂ Rd , then letting Fn (t) = Pn (X ⪯ t) and F (t) = P (X ⪯ t) denote the CDFs of
d
Pn and P , then Pn ⇝ P is equivalent to
(vii) Fn (t) → F (t) for all continuity points t of F .
The topology induced by convergence in distribution posesses several elegant properties, includ-
ing compactness when the underlying space is compact. Prokhorov’s theorem (see Billingsley [30,
Chapter 1.5] or van der Vaart and Wellner [186, Chapter 1.3]) makes this clear. For the theorem,
we require a family of probability measures defined on the same σ-algebra A, meaning that P ∈ P
are all mappings P : A → [0, 1]. Then P is tight if for all ϵ, there is a compact K such that
P (K) ≥ 1 − ϵ for all P ∈ P. The class P is sequentially compact if for any sequence Pn ∈ P, there
d
is a subsequence Pn(m) and probability measure Q on (X , A) for which Pn(m) ⇝ Q (though we need
not have Q ∈ P).
Theorem A.4.2 (Prokhorov). Let P be a collection of measures on (X , A).
(i) If P is tight, then it is sequentially compact.

(ii) Assume that (X , ρ) is separable and complete. Then if P is sequentially compact, it is tight.
When X is separable and complete, we can also metrize convergence in distribution via the
Lévy-Prokhorov distance: for a set A ⊂ X , define its ϵ-enlargment Aϵ := {x ∈ X | ρ(x, A) < ϵ},
where ρ(x, A) = inf y∈A ρ(x, y). Then for probability distributions P, Q, define

dprob (P, Q) := inf {ϵ ≥ 0 | P (A) ≤ Q(Aϵ ) + ϵ and Q(A) ≤ P (Aϵ ) + ϵ} .

This metrizes convergence in distribution on separable complete metric spaces [30, Theorem 6.8].
Theorem A.4.3. Let X be separable and complete. Then
d
dprob (Pn , P ) → 0 if and only if Pn ⇝ P.

Additionally, let P be the collection of all Borel probability measures on X . Then P is complete
and separable in the dprob metric.

A.4.2 Assorted convergence results for probability distributions

In this appendix, we collect a few results that follow from the basic definitions of convergence of
probability measures. We include them for completeness, because though they are more or less
standard, we do not know specific references.
First, collections of probability distributions on a set are essentially always complete for the
total variation distance.
Lemma A.4.4. Let Pn be probability measures on a measurable space (X , F), where F is a σ-field
of subsets of X . Let Pn be Cauchy for the variation distance. Then

595
Lexture Notes on Statistics and Information Theory John Duchi

i. There exists a probability distribution P on X for which ∥Pn − P ∥TV → 0.

ii. If additionally Pn are absolutely continuous with respect to a measure ν, then P is as well.

Proof We show claim i first. For each A ∈ F, we have Pn (A) ∈ [0, 1], and Pn (A) is Cauchy
in R and so has a limit we shall call P (A) (though we have not demonstrated it is a probability
distribution). It is immediate that this limit satisfies P (∅) = 0 and P (X ) = 1, and that it is finitely
additive. It remains to demonstrate that P is countably additive. For this, let A1 , A2 , . . . be a
collection of disjoint measurable sets, and let ϵ > 0 be arbitrary. For m ∈ N, define the tail sets
Bm = ∪i>m Ai . For all m ∈ N, finite additivity implies
∞ m
!
[ X
P Ai = P (Ai ) + P (Bm ).
i=1 i=1

Now, there is some N such that n ≥ N implies

sup sup |Pn (Bm ) − Pn+k (Bm )| ≤ sup ∥Pn − Pn+k ∥TV ≤ ϵ
k m k

because Pn is Cauchy for the total variation. Taking k → ∞, we obtain that if n ≥ N ,

|Pn (Bm ) − P (Bm )| ≤ ϵ for all m ∈ N.

For any fixed n, the probability of the tail sets Pn (Bm ) → 0 as m → ∞. Take m large enough that
Pn (Bm ) ≤ ϵ. Then 0 ≤ P (Bm ) ≤ 2ϵ, so P satisfies
m m ∞ m
!
X X [ X
P (Ai ) ≤ P (Ai ) + P (Bm ) = P Ai ≤ P (Ai ) + 2ϵ
i=1 i=1 i=1 i=1

for large m. As ϵ > 0 was arbitrary, taking m → ∞ gives that P is indeed countably additive.
It remains to show that ∥Pn − P ∥TV → 0. Fix ϵ > 0 and let N be large enough that
∥Pn − Pn+k ∥TV ≤ ϵ for all n ≥ N and k ≥ 0. For any set A, we have Pn+k (A) → P (A), so

|Pn (A) − P (A)| ≤ ϵ

for all n ≥ N , which implies the result.

For the final claim of the lemma, take a set A such that ν(A) = 0. Then Pn (A) = 0 for all n,
and thus P (A) = |P (A) − Pn (A)| → 0, i.e., P (A) = 0.

We now return to alternative metrics for convergence in distribution. For probability distribu-
tions P and Q on Rd , we define the squared L2 -distance of their cumulative distributions
Z
Dcdf (P, Q) := (P (X ⪯ t) − Q(X ⪯ t))2 dt.

Because the CDF t 7→ P (X ⪯ t) is non-decreasing, we expect that the points of discontinuity

should somehow be small, and thus we might hope that Dcdf metrizes convergence in distribution.
Over compact sets, this hold. To demonstrate it fully rigorously, however, we present two results
on the sets of discontinuities of functions.

596
Lexture Notes on Statistics and Information Theory John Duchi

Lemma A.4.5. Let f : X → Y be any function between metric spaces (X, ρX ) and (Y, ρY ), and
let Bδ (x) = {z ∈ X | ρX (x, z) < δ} be the open δ-ball around x. Define the set
Dϵ (f ) := {x ∈ X | for all δ > 0 there exist y, z ∈ Bδ (x) s.t. dY (f (y), f (z)) ≥ ϵ}
of ϵ-discontinuities of f . Then Dϵ (f ) is closed, and the set
D(f ) := {x ∈ X | f is discontinuous at x}
is Borel-measurable.
Proof We show that Dϵ (f ) is closed. Let x∞ be a limit point of Dϵ (f ). If x∞ is isolated, meaning
that there is some δ0 > 0 such that (Bδ0 (x∞ ) \ {x∞ }) ∩ Dϵ (f ) = ∅, then clearly x∞ ∈ Dϵ (f ).
Otherwise, let δ > 0 be arbitrary. Then there is some x ∈ Dϵ (f ) with x ∈ Bδ (x∞ ) \ {x∞ } by
definition of the limit point, and because Bδ (x∞ ) \ {x∞ } is open, there exists δ0 > 0 such that
Bδ0 (x) ⊂ Bδ (x∞ ) \ {x∞ }. As x ∈ Dϵ (f ), there are certainly y, z ∈ Bδ0 (x) such that ρY (f (y), f (z)) ≥
ϵ. But then y, z ∈ Bδ (x∞ ), and as δ was arbitrary we have x∞ ∈ Dϵ (f ).
The Borel measurability of D(f ) follows because D(f ) = ∪n≥1 D1/n (f ) is the countable union of
closed sets.

Lemma A.4.6. Let F be the cumulative distribution function of X ∈ Rd . If d = 1, the set D(F )
of discontintuities of F is countable, and if d > 1, it is Borel measurable and has measure 0.
Proof The set of discontinuity points of a one-dimensional CDF F is necessarily countable: for
ϵ > 0, the set Dϵ (F ) := {t ∈ R | F (t) ≥ lim supδ↓0 F (t − δ) + ϵ} has cardinality at most 1/ϵ < ∞, so
[
D(F ) := t ∈ R | lim F (t − δ) < F (t) = D1/n (F )
δ↓0
n∈N

is countable.
Let F be the cumulative distribution function of X ∈ Rd and F1 , . . . , Fd be the (marginal)
cumulative distributions of X1 , . . . , Xd . Let t, t′ ∈ Rd , and define the elementwise maximum t ∨ t′ =
[max{tj , t′j }]dj=1 and similarly the elementwise minimum t ∧ t′ . Then by monotonicity of the CDFs,

|F (t) − F (t′ )| ≤ F (t ∨ t′ ) − F (t ∧ t′ ) = P(t ∧ t′ ≺ X ⪯ t ∨ t′ )

d
X d
X
P min{tj , t′j } < Xj ≤ max{tj , t′j } = Fj (max{tj , t′j }) − Fj (min{tj , t′j }) .

≤
j=1 j=1

so a point t is a discontinuity of F only if it is a discontinuity for at least one of the marginal CDFs
Fj . That is, we have shown that the set D(F ) of discontinuities of F satisfies
d n
[ o
D(F ) ⊂ Rj−1 × D(Fj ) × Rd−j
j=1

D(F ) is Borel measurable by Lemma A.4.5. If λ denotes d-dimensional Lebesgue measure, mono-
tonicity of λ then gives
d
X Xd X
j−1 d−j
λ(D(F )) ≤ λ R × D(Fj ) × R ≤ λ Rj−1 × {t} × Rd−j = 0,
j=1 j=1 t∈D(Fj )

597
Lexture Notes on Statistics and Information Theory John Duchi

because the Lebesgue measure of any lower-dimensional subset of Rd is 0. Lebesgue and Borel
measures agree on the Borel sets.

Proposition A.4.7. For measures P, Q on Rd define

Z
Dcdf (P, Q) = (P (X ⪯ t) − Q(X ⪯ t))2 dt.

Let T be a compact subset of Rd . Then Dcdf is complete for convergence in distribution over T ,
d
and Pn ⇝ P if and only if Dcdf (Pn , P ) → 0.
d
Proof We first show that if Pn ⇝ P , then Dcdf (Pn , P ) → 0. To see this, recall from Lemma A.4.6
that the points of discontinuity of any cumulative distribution function t 7→ P (X ⪯ t) have measure
0 (and are also measurable). Thus, by definition of convergence in distribution, the (Lebesgue)
measure of t for which Pn (X ⪯ t) ̸→ P (X ⪯ t) is 0. Now, we have a sequence of functions
Fn (t) := Pn (X ⪯ t) and F (t) := P (X ⪯ t), where Fn (t) R → F (t) for almost all t. Then because
2
(Fn (t) − F (t)) ≤ 1 {t ∈ T } and T is compact (so thatR 1 {t ∈ T } dt = Vol(T ) < ∞), Lebesgue’s
dominated convergence theorem implies Dcdf (Pn , P ) = (Fn (t) − F (t))2 dt → 0.
d
Now we show that if Dcdf (Pn , P ) → 0, then Pn ⇝ P . By Prokhorov’s theorem (Thm. A.4.2),
any subsequence Pn(m) has a further subsequence that converges to some limit distribution; call
this Q. Then by the triangle inequality,

Dcdf (P, Q)1/2 ≤ Dcdf (P, Pn(m) )1/2 + Dcdf (Pn(m) , Q)1/2 → 0,

so Q = P . Using the standard topological result that if for a sequence Pn , every subsequence has
a further subsequence converging to P , then Pn converges to P , we have the result.
Lastly, we show completeness. Let Pn be a Cauchy sequence for Dcdf . Then by Prokhorov’s
theorem, because T is compact, for every subsequence of Pn there is a further subsequence for
d
which Pn(m) ⇝ Q for some probability distribution Q. Let Q0 , Q1 be two subsequent limits; we
show that Q0 = Q1 . To see this, note that for each n, m ∈ N,

Dcdf (Q0 , Q1 )1/2 ≤ Dcdf (Q0 , Pn )1/2 + Dcdf (Pn , Pm )1/2 + Dcdf (Pm , Q1 )1/2 .

d d
Now, choose n and m along subsequences that, respectively, satisfy Pn ⇝ Q0 and Pm ⇝ Q1 . Then
the first and last terms above converge to 0 by the first part of the proof, and by assumption that
Pn is Cauchy, the middle term converges to 0. So Dcdf (Q0 , Q1 ) = 0, and Q0 = Q1 .

A.5 Stirling approximations and entropy

We develop several approximations to factorials and binomials here.

Lemma A.5.1 (Weak Stirling approximation). Let n ∈ N . Then

n n n n
≤ n! ≤ min n, e4 /4

.
e e

598
Lexture Notes on Statistics and Information Theory John Duchi

Proof For the upper bound,

n−1
X Z n−1
log n! = log i + log n ≤ log n + log x dx,
i=2 2

∂
and as ∂x (x log x − x) = log x, we obtain

log n! ≤ log n + (x log x − x)|n−1

x=2 = log n + (n − 1) log(n − 1) − (n − 1) log e − 2 log 2 + 2
n−1
= log n + (n − 1) log + (3 − 2 log 2).
e
That is,
n−1 n−1
e3 n n e4 e 4 n n

n−1 n−1
n! ≤ n =n ≤ .
e 4 e 4n n 4 e

We can explicitly check the inequality for n ≤ 13.

For the lower bound, we observe that
n
X Z n
log n! = log i ≥ log x dx = (x log x − x)|nx=0 = n log n − n log e,
i=1 0

which is the desired result.

599
Appendix B

Convex Analysis

In this appendix, we review several results in convex analysis that are useful for our purposes. We
give only a cursory study here, identifying the basic results and those that will be of most use to
us; the field of convex analysis as a whole is vast. The study of convex analysis and optimization
has become very important practically in the last fourty to fifty years for a few reasons, the most
important of which is probably that convex optimization problems—those optimization problems
in which the objective and constraints are convex—are tractable, while many others are not. We
do not focus on optimization ideas here, however, building only some analytic tools that we will
find useful. We borrow most of our results from Hiriart-Urruty and Lemaréchal [111], focusing
mostly on the finite-dimensional case (though we present results that apply in infinite dimensional
cases with proofs that extend straightforwardly, and we do not specify the domains of our functions
unless necessary), as we require no results from infinite-dimensional analysis.
In addition, we abuse notation and assume that the range of any function is the extended real
line, meaning that if f : C → R we mean that f (x) ∈ R ∪ {−∞, +∞}, where −∞ and +∞ are
infinite and satisfy a + ∞ = +∞ and a − ∞ = −∞ for any a ∈ R. However, we assume throughout
and without further mention that our functions are proper, meaning that f (x) > −∞ for all x, as
this allows us to avoid annoying pathologies.

B.1 Convex sets

We begin with the simplest and most important object in convex analysis, a convex set.
Definition B.1. A set C is convex if for all λ ∈ [0, 1] and all x, y ∈ C, we have

λx + (1 − λ)y ∈ C.

An important restriction of convex sets is to closed convex sets, those convex sets that are, well,
closed.
JCD Comment: Picture

We now consider two operations that extend sets, convexifying them in nice ways.
Definition B.2. The affine hull of a set C is the smallest affine set containing C. That is,
k
X k
X
k
aff(C) := λi xi : k ∈ N, xi ∈ C, λ ∈ R , λi = 1 .
i=1 i=1

600
Lexture Notes on Statistics and Information Theory John Duchi

Associated with any set is also its convex hull:

Definition B.3. The convex hull of a set C ⊂ Rd , denoted Conv(C), is the intersection of all
convex sets containing C.
JCD Comment: picture

An almost immediate associated result is that the convex hull of a set is equal to the set of all
convex combinations of points in the set.
Proposition B.1.1. Let C be an arbitrary set. Then
k
X k
X
Conv(C) = λi xi : k ∈ N, xi ∈ C, λ ∈ Rk+ , λi = 1 .
i=1 i=1

Proof Call T the set on the right hand side of the equality in the proposition. Then T ⊃ C
is clear, as we may simply take λ1 = 1 and vary x ∈ C. Moreover, the set T ⊂ Conv(C), as any
convex set containing C must contain all convex combinations of its elements; similarly, any convex
set S ⊃ C must have S ⊃ T .
Thus PlT is convex, then we are done. Take any two points x, y ∈ T . Then
Pk if we show that
x = i=1 αi xi and y = i=1 βi yi for xi , yi ∈ C. Fix λ ∈ [0, 1]. Then (1 − λ)βi ≥ 0 and λαi ≥ 0 for
all i,
Xk Xl
λ αi + (1 − λ) βi = λ + (1 − λ) = 1,
i=1 i=1
and λx + (1 − λ)y is a convex combination of the points xi and yi weighted by λαi and (1 − λ)βi ,
respectively. So λx + (1 − λ)y ∈ T and T is convex.

We also give one more definition, which is useful for dealing with some pathalogical cases in
convex analysis, as it allows us to assume many sets are full-dimensional.
Definition B.4. The relative interior of a set C is the interior of C relative to its affine hull, that
is,
relint(C) := {x ∈ C : B(x, ϵ) ∩ aff(C) ⊂ C for some ϵ > 0} ,
where B(x, ϵ) := {y : ∥y − x∥ < ϵ} denotes the open ball of radius ϵ centered at x.
An example may make Definition B.4 clearer.
Example B.1.2 (Relative interior of a disc): Consider the (convex) set
n o
C = x ∈ Rd : x21 + x22 ≤ 1, xj = 0 for j ∈ {3, . . . , d} .

The affine hull aff(C) = R2 × {0} = {(x1 , x2 , 0, . . . , 0) : x1 , x2 ∈ R} is simply the (x1 , x2 )-plane
in Rd , while the relative interior relint(C) = {x ∈ Rd : x21 + x22 < 1} ∩ aff(C) is the “interior”
of the 2-dimensional disc in Rd . 3
In finite dimensions, we may actually restrict the definition of the convex hull of a set C to
convex combinations of a bounded number (the dimension plus one) of the points in C, rather
than arbitrary convex combinations as required by Proposition B.1.1. This result is known as
Carathéodory’s theorem.

601
Lexture Notes on Statistics and Information Theory John Duchi

Theorem B.1.3. Let C ⊂ Rd . Then x ∈ Conv(C) if and only if there exist points x1 , . . . , xd+1 ∈ C
Pd+1
d+1
and λ ∈ R+ with i=1 λi = 1 such that

d+1
X
x= λ i xi .
i=1

Proof It is clear that if x can be represented as such a sum, then x ∈ Conv(C). Conversely,
Proposition B.1.1 implies that for any x ∈ Conv(C) we have
k
X k
X
x= λi xi , λi ≥ 0, λi = 1, xi ∈ C
i=1 i=1

for some λi , xi . Assume that k > d + 1 and λi > 0 for each i, as otherwise, there is nothing to prove.
Then we know that the points xi − x1 are certainly linearly dependent (as P there are k − 1 > d of
them), and we can find (not identically zero) values α2 , . . . , αk such that ki=2 αi (xi − x1 ) = 0. Let
α1 = − ki=2 αi to obtain that we have both
P

k
X k
X
αi xi = 0 and αi = 0. (B.1.1)
i=1 i=1

λi
Notably, the equalities (B.1.1) imply that at least one αi > 0, and if we define λ∗ = mini:αi >0 αi > 0,
then setting λ′i = λi − λ∗ αi we have
k
X k
X k
X k
X k
X k
X
λ′i ≥ 0 for all i, λ′i = λi − λ∗ αi = 1, and λ′i xi = λi xi − λ∗ αi xi = x.
i=1 i=1 i=1 i=1 i=1 i=1

But we know that at least one of the λ′i = 0, so that we could write x as a convex combination of
k − 1 elements. Repeating this strategy until k = d + 1 gives the theorem.

B.1.1 Operations preserving convexity

We now touch on a few simple results about operations that preserve convexity of convex sets.
First, we make the following simple observation.

Observation B.1.4. Let C be a convex set. Then C = Conv(C).

Observation B.1.4 is clear, as we have C ⊂ Conv(C), while any other convex S ⊃ C clearly satisfies
S ⊃ Conv(C). Secondly, we note that intersections preserve convexity.

Observation B.1.5. Let {Cα }α∈A be an arbitrary collection of convex sets. Then
\
C= Cα
α∈A

is convex. Moreover, if Cα is closed for each α, then C is closed as well.

602
Lexture Notes on Statistics and Information Theory John Duchi

The convexity property follows because if x1 ∈ C and x2 ∈ C, then clearly x1 , x2 ∈ Cα for all
α ∈ A, and moreover λx1 + (1 − λ)x2 ∈ Cα for all α and any λ ∈ [0, 1]. The closure property is
standard. In addition, we note that closing a convex set maintains convexity.
Observation B.1.6. Let C be convex. Then cl(C) is convex.
To see this, we note that if x, y ∈ cl(C) and xn → x and yn → y (where xn , yn ∈ C), then for any
λ ∈ [0, 1], we have λxn + (1 − λ)yn ∈ C and λxn + (1 − λ)yn → λx + (1 − λ)y. Thus we have
λx + (1 − λ)y ∈ cl(C) as desired.
Observation B.1.6 also implies the following result.
Observation B.1.7. Let D be an arbitrary set. Then
\
{C closed : C ⊃ D, C is convex} = cl Conv(D).

Proof Let T denote the leftmost set. It is clear that T ⊂ cl Conv(D) as cl Conv(D) is a closed
convex set (by Observation B.1.6) containing D. On the other hand, if C ⊃ D is a closed convex
set, then C ⊃ Conv(D), while the closedness of C implies it also contains the closure of Conv(D).
Thus T ⊃ cl Conv(D) as well.

JCD Comment: Picture

Observation B.1.8. Let C ⊂ Rd be compact. Then Conv(C) is compact.

Proof Let xn ∈ Conv(C) converge to someP x. By Theorem B.1.3, for each n we can find
x1 , . . . , xd+1 ∈ C and λ ∈ ∆d+1 such that x = d+1
n n n n n n
j=1 λj xj . Taking subsequences as necessary, we
n
can assume that x1 → x1 ∈ C, as C is compact. Then taking a further subsequence, we can as well
assume xn2 → x2 ∈ C, and so on, so that xnj → xj for each j = 1, . . . , d + 1. Then because ∆d+1
is compact, we can likewise take aP(further) subsequence of λn so that λn → λ ∈ ∆d+1 . Evidently
these limiting objects satisfy x = d+1j=1 λj xj .

As our last consideration of operations that preserve convexity, we consider what is known as
the perspective of a set. To define this set, we need to define the perspective function, which, given
a point (x, t) ∈ Rd × R++ (here R++ = {t : t > 0} denotes strictly positive points), is defined as
x
pers(x, t) = .
t
We have the following definition.
Definition B.5. Let C ⊂ Rd × R+ be a set. The perspective transform of C, denoted by pers(C),
is nx o
pers(C) := : (x, t) ∈ C and t > 0 .
t
This corresponds to taking all the points z ∈ C, normalizing them so their last coordinate is 1, and
then removing the last coordinate. (For more on perspective functions, see Boyd and Vandenberghe
[38, Chapter 2.3.3].)
It is interesting to note that the perspective of a convex set is convex. First, we note the
following.

603
Lexture Notes on Statistics and Information Theory John Duchi

Lemma B.1.9. Let C ⊂ Rd+1 be a compact line segment, meaning that C = {λx + (1 − λ)y : λ ∈
[0, 1]}, where xd+1 > 0 and yd+1 > 0. Then pers(C) = {λ pers(x) + (1 − λ) pers(y) : λ ∈ [0, 1]}.
Proof Let λ ∈ [0, 1]. Then
λx1:d + (1 − λ)y1:d
pers(λx + (1 − λ)y) =
λxd+1 + (1 − λ)yd+1
λxd+1 x1:d (1 − λ)yd+1 y1:d
= +
λxd+1 + (1 − λ)yd+1 xd+1 λxd+1 + (1 − λ)yd+1 yd+1
= θ pers(x) + (1 − θ) pers(y),

where x1:d and y1:d denote the vectors of the first d components of x and y, respectively, and
λxd+1
θ= ∈ [0, 1].
λxd+1 + (1 − λ)yd+1
Sweeping λ from 0 to 1 sweeps θ ∈ [0, 1], giving the result.

Based on Lemma B.1.9, we immediately obtain the following proposition.

Proposition B.1.10. Let C ⊂ Rd × R++ be a convex set. Then pers(C) is convex.
Proof Let x, y ∈ C and define L = {λx + (1 − λ)y : λ ∈ [0, 1]} to be the line segment between
them. By Lemma B.1.9, pers(L) = {λ pers(x) + (1 − λ) pers(y) : λ ∈ [0, 1]} is also a (convex) line
segment, and we have pers(L) ⊂ pers(C) as necessary.

B.1.2 Representation and separation of convex sets

JCD Comment: Put normal and tangent cones here

We now consider some properties of convex sets, showing that (1) they have nice separation
properties—we can put hyperplanes between them—and (2) this allows several interesting represen-
tations of convex sets. We begin with the separation properties, developing them via the existence
of projections. Interestingly, this existence of projections does not rely on any finite-dimensional
structure, and can even be shown to hold in arbitrary Banach spaces (assuming the axiom of
choice) [141]. We provide the results in a Hilbert space, meaning a complete vector space for which
there exists an inner product ⟨·, ·⟩ and associated norm ∥·∥ given by ∥x∥2 = ⟨x, x⟩. We first note
that projections exist.
Theorem B.1.11 (Projections). Let C be a closed convex set. Then for any x, there exists a
unique point πC (x) minimizing ∥y − x∥ over y ∈ C. Moreover, this point is characterized by the
inequality
⟨πC (x) − x, y − πC (x)⟩ ≥ 0 for all y ∈ C. (B.1.2)
Proof The existence and uniqueness of the projection follows from the parallelogram identity,
that is, that for any x, y we have ∥x − y∥2 + ∥x + y∥2 = 2(∥x∥2 + ∥y∥2 ), which follows by noting
that ∥x + y∥2 = ∥x∥2 + ∥y∥2 + 2⟨x, y⟩. Indeed, let {yn } ⊂ C be a sequence such that

∥yn − x∥ → inf ∥y − x∥ =: p⋆
y∈C

604
Lexture Notes on Statistics and Information Theory John Duchi

as n → ∞, where p⋆ is the infimal value. We show that yn is Cauchy, so that there exists a (unique)
limit point of the sequence. Fix ϵ > 0 and let N be such that n ≥ N implies ∥yn − x∥2 ≤ p2⋆ + ϵ2 .
Let m, n ≥ N . Then by the parallelogram identity,
h i
∥yn − ym ∥2 = ∥(x − yn ) − (x − ym )∥2 = 2 ∥x − yn ∥2 + ∥x − ym ∥2 − ∥(x − yn ) + (x − ym )∥2 .

Noting that

yn + ym yn + ym
(x − yn ) + (x − ym ) = 2 x − and ∈ C (by convexity of C),
2 2
we have
2
2 2 2 yn + ym
∥x − yn ∥ ≤ p2⋆ +ϵ2 , ∥x − ym ∥ ≤ p2⋆ +ϵ2 , and ∥(x − yn ) + (x − ym )∥ = 4 x − ≥ 4p2⋆ .
2
In particular, we have

∥yn − ym ∥2 ≤ 2 p2⋆ + ϵ2 + p2⋆ + ϵ2 − 4p2⋆ = 4ϵ2 .

As ϵ > 0 was arbitrary, this completes the proof of the first statement of the theorem.
To see the second result, assume that z is a point satisfying inequality (B.1.2), that is, such
that
⟨z − x, y − z⟩ ≥ 0 for all y ∈ C.
Then we have

∥z − x∥2 = ⟨z − x, z − x⟩ = ⟨z − x, z − y⟩ +⟨z − x, y − x⟩ ≤ ∥z − x∥ ∥y − x∥
| {z }
≤0

by the Cauchy-Schwarz inequality. Dividing both sides by ∥z − x∥ yields ∥z − x∥ ≤ ∥y − x∥ for

any y ∈ C, giving the result. Conversely, let t ∈ [0, 1]. Then for any y ∈ C,

∥πC (x) − x∥2 ≤ ∥(1 − t)πC (x) + ty − x∥2 = ∥πC (x) − x + t(y − πC (x))∥2
= ∥πC (x) − x∥2 + 2t⟨πC (x) − x, y − πC (x)⟩ + t2 ∥y − πC (x)∥2 .

Subtracting the projection value ∥πC (x) − x∥2 from both sides and dividing by t > 0, we have

0 ≤ 2⟨πC (x) − x, y − πC (x)⟩ + t ∥y − πC (x)∥2 .

Taking t → 0 gives inequality (B.1.2).

As an immediate consequence of Theorem B.1.11, we obtain several separation properties of

convex sets, as well as a theorem stating that a closed convex set (not equal to the entire space in
which it lies) can be represented as the intersection of all the half-spaces containing it.
Corollary B.1.12. Let C be closed convex and x ̸∈ C. Then there is a vector v strictly separating
x from C, that is,
⟨v, x⟩ > sup⟨v, y⟩.
y∈C

Moreover, we can take v = x − πC (x).

605
Lexture Notes on Statistics and Information Theory John Duchi

Proof By Theorem B.1.11, we know that taking v = x − πC (x) we have

0 ≤ ⟨y − πC (x), πC (x) − x⟩ = ⟨y − πC (x), −v⟩ = ⟨y − x + v, −v⟩ = −⟨y, v⟩ + ⟨x, v⟩ − ∥v∥2 .

That is, we have ⟨v, y⟩ ≤ ⟨v, x⟩ − ∥v∥2 for all y ∈ C and v ̸= 0.

Projections also never increase distances.

Corollary B.1.13. Let C be closed convex and y ∈ C. Then for any x,

∥πC (x) − y∥ ≤ ∥x − y∥ .

Proof Using inequality (B.1.2) in Theorem B.1.11, write

0 ≥ ⟨y − πC (x), x − πC (x)⟩ = ⟨y − πC (x), y − πC (x) + x − y⟩ = ∥y − πC (x)∥2 + ⟨y − πC (x), x − y⟩.

Rearraging yields ∥y − πC (x)∥2 ≤ ⟨y − πC (x), y − x⟩ ≤ ∥y − πC (x)∥ ∥y − x∥ by Cauchy-Schwarz. If

y = πC (x), the result is trivial, and otherwise, dividing by ∥y − πC (x)∥ > 0 gives the result.

In addition, we can show the existence of supporting hyperplanes, that is, hyperplanes “sepa-
rating” the boundary of a convex set from itself.
Theorem B.1.14. Let C be a convex set and x ∈ bd(C), where bd(C) = cl(C) \ int C. Then there
exists a non-zero vector v such that ⟨v, x⟩ ≥ supy∈C ⟨v, y⟩.
Proof Let D = cl(C) be the closure of C and let xn ̸∈ D be a sequence of points such that
xn → x. Let us define the sequence of separating vectors sn = xn − πD (xn ) and the normalized
version vn = sn / ∥sn ∥. Notably, we have ⟨vn , xn ⟩ > supy∈C ⟨vn , y⟩ for all n. Now, the sequence
{vn } ⊂ {v : ∥v∥ = 1} belongs to a compact set.1 Passing to a subsequence if necessary, let us
assume w.l.o.g. that vn → v with ∥v∥ = 1. Then by a standard limiting argument for the xn → x,
we have
⟨v, x⟩ ≥ ⟨v, y⟩ for all y ∈ C,
which was our desired result.

JCD Comment: Picture of supporting hyperplanes and representations

Theorem B.1.14 gives us an important result. In particular, let D be an arbitrary set, and let
C = cl Conv(D) be the closure of the convex hull of D, which is the smallest closed convex set
containing D. Then we can write C as the intersection of all the closed half-spaces containing D;
this is, in some sense, the most useful “convexification” of D. Recall that a closed half-space H is
defined with respect to a vector v and real a ∈ R as

H := {x : ⟨v, x⟩ ≤ r}.

Before stating the theorem, we remark that by Observation B.1.6, the intersection of all the closed
convex sets containing a set D is equal to the closure of the convex hull of D.
1
In infinite dimensions, this may not be the case. But we can apply the Banach-Alaoglu theorem, which states
that, as vn are linear operators, the sequence is weak-* compact, so that there is a vector v with ∥v∥ ≤ 1 and a
subequence m(n) ⊂ N such that ⟨vm(n) , x⟩ → ⟨v, x⟩ for all x.

606
Lexture Notes on Statistics and Information Theory John Duchi

Theorem B.1.15. Let D be an arbitrary set. If C = cl Conv(D), then

\
C= H, (B.1.3)
H⊃D

where H denotes a closed half-space containing D. Moreover, for any closed convex set C,
\
C= Hx , (B.1.4)
x∈bd(C)

where Hx denotes the intersection of halfspaces supporting C at x.

Proof We begin with the proof of the second result (B.1.4). Indeed, by Theorem B.1.14, we
know that at each point x on the boundary of C, there exists a non-zero supporting hyperplane v,
so that the half-space
Hx,v := {y : ⟨v, y⟩ ≤ ⟨v, x⟩} ⊃ C
is closed, convex, and contains C. We clearly have the containment C ⊂ ∩x∈bd(C) Hx . Now let
x0 ̸∈ C; we show that x0 ̸∈ ∩x∈bd(C) Hx . As x0 ̸∈ C, the projection πC (x0 ) of x0 onto C satisfies
⟨x0 − πC (x0 ), x0 ⟩ > supy∈C ⟨x0 − πC (x0 ), y⟩ by Corollary B.1.12. Moreover, letting v = x0 − πC (x0 ),
the hyperplane
hx0 ,v := {y : ⟨y, v⟩ = ⟨πC (x0 ), v⟩}
is clearly supporting to C at the point πC (x0 ). The half-space {y : ⟨y, v⟩ ≤ ⟨πC (x0 ), v⟩} thus
contains C and does not contain x0 , implying that x0 ̸∈ ∩x∈bd(C) Hx .
Now we show the first result (B.1.3). Let C be the closed convex hull of D and T = ∩H⊃D H.
By a trivial extension of the representation (B.1.4), we have that C = ∩H⊃C H, where H denotes
any halfspace containing C. As C ⊃ D, we have that H ⊃ C implies H ⊃ D, so that
\ \
T = H⊂ H = C.
H⊃D H⊃C

On the other hand, as C = cl Conv(D), Observation B.1.7 implies that any closed set containing
D contains C. As a closed halfspace is convex and closed, we have that H ⊃ D implies H ⊃ C,
and thus T = C as desired.

One elegant corollary of the closure operations and supporting hyperplanes for convex sets is
that we can approximate convex hulls by expectations of vectors, even for potentially uncountable
collections. By combining the strong law of large numbers with our descriptions of the convex hull,
we have the following result.

Corollary B.1.16. Let X = {xα }α∈A ⊂ Rd be an arbitrary collection of vectors and let P be the
collection of probability distributions on elements of A for which EP [∥xA ∥] < ∞, where A ∼ P .
Then
Conv(X) = {EP [xA ] | P ∈ P} ⊂ cl Conv(X).
If additionally X is compact (closed and bounded), then cl Conv(X) = {EP [xA ] | P ∈ P}.

607
Lexture Notes on Statistics and Information Theory John Duchi

Proof Let C = {EP [xA ] | P ∈ P} be the middle set. We show that Conv(X) ⊂ C ⊂ cl Conv(X).
Taking any P, Q ∈ P, we have λP + (1 − λ)Q ∈ P for all λ ∈ [0, 1], so that C is convex, giving
iid
Conv(X) ⊂ C. For the second inclusion, fix any P ∈ P. Draw A1 , A2 , . . . , An ∼ P . Then by the
1 Pn
strong law of large numbers, xn := n i=1 xAi → EP [xA ] with probability 1, and so certainly there
is a sequence of elements xn ∈ Conv(X) satisfying xn → EP [xA ], and EP [xA ] ∈ cl Conv(X).
To demonstrate that Conv(X) = C requires more work. We prove the result by induction on
the dimension. Consider the case that d = 1: then Conv(X) takes on one of three forms: the open
interval Conv(X) = (b0 , b1 ), the half-open interval Conv(X) = (b0 , b1 ] or Conv(X) = [b0 , b1 ), or
the closed interval Conv(X) = [b0 , b1 ]. In the last case, Conv(X) is compact so that Conv(X) =
cl Conv(X) and we are done. Consider the first two, and w.l.o.g. assume EP [xA ] = b0 while xα > b0
for all α ∈ A. But for a distribution P on A to yield EP [xA ] = b0 , it must be the case that
P (xA = b0 ) = 1, a contradiction.
Now consider the d-dimensional case, and let µ = EP [xA ] for shorthand. Suppose for the sake
of contradiction that that µ ∈ cl Conv(X) \ Conv(X). Then there is a non-zero vector v such that
⟨v, µ⟩ ≥ ⟨v, x⟩ for all x ∈ Conv(X). Letting b = ⟨v, µ⟩, we have ⟨v, xα ⟩ ≤ b for all α ∈ A. That is,
the hyperplane H = {x | ⟨v, x⟩ = b} separates µ from X, and the halfspace H− := {x | ⟨v, x⟩ ≤ b}
contains X. So the scalar values ⟨v, xα ⟩ ≤ b for α ∈ A, and EP [⟨v, xA ⟩] = ⟨v, µ⟩ = b implies that
⟨v, xA ⟩ = b with probability 1 over A ∼ P . In particular, the collection {xα | ⟨v, xα ⟩ = b} is
non-empty, so the sets H ∩ X and H ∩ Conv(X) are non-empty and of dimension d − 1. Induct
downwards.
The final claim is simply Observation B.1.8.

The second inclusion in Corollary B.1.16 can be strict even in one dimension: let xα = α for
α ∈ (0, 1), so that Conv(X) = X, and any distribution P on α yields EP [xA ] ∈ (0, 1).

B.2 Sublinear and support functions

A special case of convex functions will be sublinear functions, which form the basis of the transition
between convex sets and convex functions. Accordingly, we give a special treatment here. Recall
that f is convex if f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) for all x, y.

Definition B.6. A function f : Rd → R ∪ {+∞} is sublinear if it is convex and positively homo-

geneous, meaning
f (tx) = tf (x) for all x ∈ Rd and t > 0.

Such functions are important in that they give some of the first dualities between convex sets and
convex functions. As we see in the section to come, they also allow us to describe various first-order
smoothness properties of convex functions.
The main result we shall need on sublinear functions is that they can be defined by a dual
construction.

Proposition B.2.1. Let f be a closed sublinear function and define S := {s | ⟨s, x⟩ ≤ f (x) for all x}.
Then
f (x) = sup⟨s, x⟩.
s∈S

Proof As f is closed convex, there exist affine functions minorizing f at each point in its domain
(Theorem B.3.3). That is, for some pair (s, t) ∈ Rd × R, we have ⟨s, x⟩ − t ≤ f (x) for all x ∈ Rd .

608
Lexture Notes on Statistics and Information Theory John Duchi

Because necessarily f (0) = 0 by sublinearity, we have t ≥ 0, and by positive homogeneity, we have

⟨s, αx⟩ − t ≤ f (αx) for all α > 0, that is, ⟨s, x⟩ − t/α ≤ f (x) for all x. Taking α ↑ ∞ we find that

⟨s, x⟩ ≤ f (x) for all x ∈ Rd .

Because any closed convex function is the supremum of all affine functions minorizing it (Theo-
rem B.3.7), we evidently have f (x) = sups {⟨s, x⟩ | ⟨s, ·⟩ minorizes f }.

To any set S we can associate a particular sublinear function, the support function of S, defining

σS (x) := sup⟨s, x⟩. (B.2.1)

s∈S

This function is evidently a closed convex function—it is the supremum of linear functions—and is
positively homogeneous, so that it is sublinear. We thus immediately have the duality
Corollary B.2.2. Let f be a sublinear function. Then it is the support function of the closed
convex set
Sf := {s | ⟨s, x⟩ ≤ f (x) for all x ∈ Rd },
and hence if C is closed convex, then

C = {x | ⟨s, x⟩ ≤ σC (s) for all s ∈ Rd }.

A few other consequences of the definition are immediate. We see that σS has dom σS = Rd if
and only if S is bounded: whenever ∥s∥ ≤ L for all s ∈ S, then σS (x) ≤ L ∥x∥. Conversely,
if dom σS = Rd then it is locally Lipschitz (Theorem B.3.4) and (by positive homogeneity) thus
globally Lipschitz, so we have ⟨s, x⟩ ≤ σS (x) ≤ L ∥x∥ for some L < ∞ and taking x = s/ ∥s∥
gives ∥s∥ ≤ L. As another consequence, we see that support functions of a set S are the support
functions of the closed convex hull of S:
Proposition B.2.3. Let S ⊂ Rd . Then

σS (x) = σcl Conv S (x).

Proof Let C = Conv S, and let sn be any sequence with ⟨sn , x⟩ → sups∈C ⟨s, x⟩. Then there
Pk(n)
exist sn,i ∈ S, i = 1, . . . , k(n), such that sn = i=1 λi sn,i for some λ ⪰ 0, ⟨λ, 1⟩ = 1, which
may change with n. But of course, ⟨sn , x⟩ ≤ maxi ⟨sn,i , x⟩, and thus σS (x) ≥ σC (x). To see that
σC (x) = σcl C (x), note that for each ϵ > 0, for each s ∈ cl C there is s′ ∈ C with ∥s − s′ ∥ < ϵ. Then
⟨s, x⟩ ≤ ⟨s′ , x⟩ + ϵ ∥x∥ and σcl C (x) ≤ σC (x) + ϵ ∥x∥. Take ϵ ↓ 0.

This proposition, coupled with Corollary B.2.2, shows that if sets S1 , S2 have identical support
functions, then they have identical closed convex hulls, and if they are closed convex, they are thus
identical.
Corollary B.2.4. Let S1 , S2 ⊂ Rd . If σS1 = σS2 , then cl Conv S1 = cl Conv S2 .
Proof By Proposition B.2.3, we have σSi = σcl Conv Si for each i, and Corollary B.2.2 shows that
if σC1 = σC2 for closed convex sets C1 and C2 , then C1 = C2 .

As another corollary, we have

609
Lexture Notes on Statistics and Information Theory John Duchi

Corollary B.2.5. Let σ1 and σ2 be the support functions of the nonempty closed convex sets S1
and S2 . Then if t1 > 0 and t2 > 0,

t1 σ1 + t2 σ2 = σcl(t1 S1 +t2 S2 ) .

If either of S1 or S2 is compact, then t1 σ1 + t2 σ2 = σt1 S1 +t2 S2 .

Proof Let S = t1 S1 + t2 S2 . In first statement, we have

(⋆)
σcl S (x) = σS (x) = sup {⟨t1 s1 + t2 s2 , x⟩ | s1 ∈ S1 , s2 ∈ S2 } ,

equality (⋆) following from Proposition B.2.3. As the suprema run independently through their
respective sets S1 , S2 , the latter quantity is evidently

σS (x) = t1 sup ⟨s1 , x⟩ + t2 sup ⟨s2 , x⟩ = t1 σS1 (x) + t2 σS2 (x).

s1 ∈S1 s2 ∈S2

The final result is an immediate consequence of the result that if C is a compact convex set and
S is closed convex, then C + S is closed convex. That C + S is convex is immediate. To see that
it is closed, let xn ∈ C, yn ∈ S satisfy xn + yn → z. Then proceeding to a subsequence, we have
xn(m) → x∞ for some x∞ ∈ C, and thus yn(m) → z − x∞ , which is then necessarily in S. As the
subsequence xn(m) + yn(m) → x∞ + (z − x∞ ) ∈ C + S and xn(m) + yn(m) → z as well, this gives the
result.

Linear transformations of support functions are also calculable. In the result, recall that for a
matrix A and set S, the set AS = {As | s ∈ S}.

Proposition B.2.6. Let S ⊂ Rd and A ∈ Rm×d . Then σcl AS (x) = σS (A⊤ x).

Proof We have σAS (x) = sups∈S ⟨As, x⟩ = sups∈S ⟨s, AT x⟩. The closure operation changes noth-
ing (Proposition B.2.3).

Lastly, we show how to use support functions to characterize whether sets have interiors. Recall
that for a set S ⊂ Rd , the affine hull aff(S) (Definition B.2) is the set of affine combinations of a
point in S, and the relative interior of S is its interior relative to its affine hull (Definition B.4).

Proposition B.2.7. Let S ⊂ Rd be non-empty a closed convex set. Then

(i) s ∈ int S if and only if ⟨s, x⟩ < σS (x) for all x ̸= 0.

(ii) s ∈ relint S if and only if ⟨s, x⟩ < σS (x) for all x with σS (x) + σS (−x) > 0.

(iii) int S is non-empty if and only if σS (x) + σS (−x) > 0 for all x ̸= 0.

Proof

(i) Because σS is positively homogeneous, an equivalent statement is that σS (x) > ⟨s, x⟩ for all
x ∈ Sd−1 = {x ∈ Rd | ∥x∥2 = 1}. If s ∈ int S, we there exists ϵ > 0 such that s + ϵx ∈ S for
all x ∈ Sd−1 , and so
σS (x) ≥ ⟨s + ϵx, x⟩ = ⟨s, x⟩ + ϵ,

610
Lexture Notes on Statistics and Information Theory John Duchi

so that ⟨s, x⟩ < σS (x).

Conversely, let s be any point satisfying σS (x) − ⟨s, x⟩ > 0 for all x ∈ Sd−1 . Because σS is
lower semicontinuous, the infimum inf x∈Sd−1 {σS (x)−⟨s, x⟩} is attained at some x⋆ ∈ Sd−1 (see
Proposition C.0.1). Then there exists some ϵ > 0 such that ⟨s, x⟩ + ϵ ≤ σS (x) for all x ∈ Sd−1 .
Let u be any vector with ∥u∥2 < ϵ. Then ⟨s + u, x⟩ = ⟨s, x⟩ + ⟨u, x⟩ ≤ ⟨s, x⟩ + ϵ ≤ σS (x), so
Corollary B.2.2 implies s + u ∈ S and s ∈ int S.

(ii) We decompose Rd into subspaces V ⊕ U , where U = V ⊥ and V is parallel to aff(S). Writing

x = xU + xV , where xU ∈ U and xV ∈ V , the function ⟨s, xU ⟩ is constant for s ∈ S. Repeat
the argument for part (i) in the subspace V .

(iii) Suppose int S is non-empty. Then s ∈ int S implies ⟨s, x⟩ < σS (x) for all x with ∥x∥ = 1.
Then σS (x) + σS (−x) > ⟨s, x − x⟩ = 0. Conversely, if int S is empty, there exists a hyperplane
containing S (by a dimension counting argument and that the relative interior of S is never
empty [111, Theorem III.2.1.3]), which we may write as S ⊂ {s | v T s = b} for some v ̸= 0.
For this σS (v) + σS (−v) = b − b = 0.

B.3 Convex functions

epi f

Figure B.1: The epigraph of a convex function.

We now build off of the definitions of convex sets to define convex functions. As we will see,
convex functions have several nice properties that follow from the geometric (separation) properties
of convex sets. First, we have

611
Lexture Notes on Statistics and Information Theory John Duchi

Definition B.7. A function f is convex if for all λ ∈ [0, 1] and x, y ∈ dom f ,

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y). (B.3.1)

We define the domain dom f of a convex function to be those points x such that f (x) < +∞. Note
that Definition B.7 implies that the domain of f must be convex.
An equivalent definition of convexity follows by considering a natural convex set attached to
the function f , known as its epigraph.

Definition B.8. The epigraph epi f of a function is the set

epi f := {(x, t) : t ∈ R, f (x) ≤ t}.

That is, the epigraph of a function f is the set of points on or above the graph of the function itself,
as depicted in Figure B.1. It is immediate from the definition of the epigraph that f is convex if
and only if epi f is convex. Thus, we see that any convex set C ⊂ Rd+1 that is unbounded “above,”
meaning that C = C + {0} × R+ , defines a convex function, and conversely, any convex function
defines such a set C. This duality in the relationship between a convex function and its epigraph
is central to many of the properties we exploit.

B.3.1 Equivalent definitions of convex functions

We begin our discussion of convex functions by enumerating a few standard properties that also
characterize convexity. The simplest of these relate to properties of the derivatives and second
derivatives of functions. We begin by elucidating one of the most basic properties of convexity:
that the slopes of convex functions are increasing. Beginning with functions on R, suppose that
f : R → R is convex, and let x ∈ dom f and v ∈ R be otherwise arbitrary. Then define the quotient
function
f (x + tv) − f (x)
q(t) := , t ≥ 0, (B.3.2)
t
which we claim is nondecreasing in t ≥ 0 if and only if f is convex. Indeed, let t ≥ s > 0 and define
λ = st ∈ [0, 1]. Then

q(t) ≥ q(s) if and only if λ[f (x + tv) − f (x)] ≥ f (x + λtv) − f (x)

if and only if λf (x + tv) + (1 − λ)f (x) ≥ f ((1 − λ)x + λ(x + tv)),

the latter holding for all λ if and only if f is convex.

JCD Comment: Draw a picture of increasing quotient

Because the quotient function (B.3.2) is nondecreasing, we can relatively straightforwardly give
first-order characterizations of convexity as well. Indeed, suppose that f : R → R is differentiable;
then convexity is equivalent to the first-order inequality that for all x, y ∈ R, we have

f (y) ≥ f (x) + f ′ (x)(y − x). (B.3.3)

To see that inequality (B.3.3) implies that f is convex follows from algebraic manipulations: let
λ ∈ [0, 1] and z = λx + (1 − λ)y, so that y − z = λ(y − x) and x − z = (1 − λ)(x − y). Then

f (y) ≥ f (z) + λf ′ (z)(y − x) and f (x) ≥ f (z) + (1 − λ)f ′ (z)(x − y),

612
Lexture Notes on Statistics and Information Theory John Duchi

and multiplying the former by (1 − λ) and the latter by λ and adding the two inequalities yields

λf (x)+(1−λ)f (y) ≥ λf (z)+(1−λ)f (z)+λ(1−λ)f ′ (z)(y−x)+λ(1−λ)f ′ (z)(x−y) = f (λx+(1−λ)y),

as desired. Conversely, let v = y − x in the quotient (B.3.2), so that q(t) = f (x+tv)−f

t
(x)
, which is
′
non-decreasing. If f is differentiable, we see that q(0) := limt↓0 q(t) = f (x)(y − x), and so

q(1) = f (y) − f (x) ≥ q(0) = f ′ (x)(y − x)

as desired.
We may also give the standard second order characterization: if f : R → R is twice differentiable
and f ′′ (x) ≥ 0 for all x, then f is convex. To see this, note that
1
f (y) = f (x) + f ′ (x)(y − x) + f ′′ (tx + (1 − t)y)(x − y)2
2
for some t ∈ [0, 1] by Taylor’s theorem, so that f (y) ≥ f (x) + f ′ (x)(y − x) for all x, y because
f ′′ (tx + (1 − t)y) ≥ 0. As a consequence, we obtain inequality (B.3.3), which implies that f is
convex.
As convexity is a property that depends only on properties of functions on lines—one dimen-
sional projections—we can straightforwardly extend the preceding results to functions f : Rd → R.
Indeed, noting that if h(t) = f (x + ty) then h′ (0) = ⟨∇f (x), y⟩ and h′′ (0) = y ⊤ ∇2 f (x)y, we have
that a differentiable function f : Rd → R is convex if and only if

f (y) ≥ f (x) + ∇f (x)⊤ (y − x) for all x, y,

while a twice differentabile function f : Rd → R is convex if and only if

∇2 f (x) ⪰ 0 for all x.

Noting that nothing in the derivation that the quotient (B.3.2) was non-decreasing relied on f
being a function on R, we can see that a function f : Rd is convex if and only if it satisfies the
increasing slopes criterion: for all x ∈ dom f and any vector v, the quotient

f (x + tv) − f (x)
t 7→ q(t) := (B.3.4)
t
is nondecreasing in t ≥ 0 (where we leave x, v implicit). An alternative version of the crite-
rion (B.3.4) is that if x ∈ dom f and v is any vector, if we define the one-dimensional convex
function h(t) = f (x + tv) then for any s < t and ∆ > 0, we have

h(t + ∆) − h(t) h(t) − h(s) h(t) − h(s − ∆)

≥ ≥ . (B.3.5)
∆ t−s t − (s − ∆)

The proof that either of the inequalities (B.3.5) is equivalent to convexity we leave as an exercise
(Q. C.1).
JCD Comment: Draw pictures of increasing slopes

We summarize each of these implications in a theorem for reference.

Proposition B.3.1 (Convexity). The following are all equivalent:

613
Lexture Notes on Statistics and Information Theory John Duchi

(i) The function f is convex.

(ii) The function f satisfies the criterion of increasing slopes (B.3.4).

If f is differentiable (respectively, twice differentiable), the following are also equivalent:

(iii) The function f : Rd → R satisfies

f (y) ≥ f (x) + ⟨∇f (x), y − x⟩ for all x, y.

(iv) The function f has positive semidefinite Hessian: ∇2 f (x) ⪰ 0 for all x.

JCD Comment: Draw a picture and of strict convexity

A condition slightly stronger than convexity is strict convexity, which makes each of the in-
equalities in Proposition B.3.1 strict. We begin with the classical definition: a function f is strictly
convex if it is convex and

f (λx + (1 − λ)y) < λf (x) + (1 − λ)f (y)

whenever λ ∈ (0, 1) and x ̸= y ∈ dom f . These are convex functions, but always have strictly in-
creasing slopes—secants lie strictly above f . By tracing through the arguments leading to Propo-
sition B.3.1 (replace appropriate non-strict inequalities with strict inequalities), one obtains the
following corollary describing strictly convex functions.

Corollary B.3.2 (Strict convexity). The following are all equivalent:

(i) The function f is strictly convex.

(ii) The function f has strictly increasing slopes (B.3.4).

If f is differentiable (respectively, twice differentiable), the following are also equivalent:

(iii) The function f : Rd → R satisfies

f (y) > f (x) + ⟨∇f (x), y − x⟩ for all x ̸= y.

(iv) The function f has positive definite Hessian: ∇2 f (x) ≻ 0 for all x.

B.3.2 Continuity properties of convex functions

We now consider a few continuity properties of convex functions and a few basic relationships of
the function f to its epigraph. First, we give a definition of the subgradient of a convex function.

Definition B.9. A vector g is a subgradient of f at a point x0 if for all x,

f (x) ≥ f (x0 ) + ⟨g, x − x0 ⟩. (B.3.6)

The subdifferential or subgradient set of f at x0 is

∂f (x0 ) := {g | f (x) ≥ f (x0 ) + ⟨g, x − x0 ⟩ for all x} .

614
Lexture Notes on Statistics and Information Theory John Duchi

f (x)

f (x0 ) + ⟨g, x − x0 ⟩
(x0 , f (x0 ))

Figure B.2. The tangent (affine) function to the function f generated by a subgradient g at the
point x0 .

See Figure B.2 for an illustration of the affine minorizing function given by the subgradient of a
convex function at a particular point.
Interestingly, convex functions have subgradients (at least, nearly everywhere). This is perhaps
intuitively obvious by viewing a function in conjunction with its epigraph epi f and noting that
epi f has supporting hyperplanes, but here we state a result that will have further use.
Theorem B.3.3. Let f be convex. Then there is an affine function minorizing f . More precisely,
for any x0 ∈ relint dom f , there exists a vector g such that

f (x) ≥ f (x0 ) + ⟨g, x − x0 ⟩.

Proof If relint dom f = ∅, then it is clear that f is either identically +∞ or its domain is a
single point {x0 }, in which case the constant function f (x0 ) minorizes f . Now, we assume that
int dom f ̸= ∅, as we can simply always change basis to work in the affine hull of dom f .
We use Theorem B.1.14 on the existence of supporting hyperplanes to construct a subgradient.
Indeed, we note that (x0 , f (x0 )) ∈ bd epi f , as for any open set O we have that (x0 , f (x0 )) + O
contains points both inside and outside of epi f . Thus, Theorem B.1.14 guarantees the existence of
a vector v and a ∈ R, not both simultaneously zero, such that

⟨v, x0 ⟩ + af (x0 ) ≤ ⟨v, x⟩ + at for all (x, t) ∈ epi f. (B.3.7)

Inequality (B.3.7) implies that a ≥ 0, as for any x we may take t → +∞ while satisfying (x, t) ∈
epi f . Now we argue that a > 0 strictly. To see this, note that for suitably small δ > 0, we have
x = x0 − δv ∈ dom f . Then we find by inequality (B.3.7) that

⟨v, x0 ⟩ + af (x0 ) ≤ ⟨v, x0 ⟩ − δ ∥v∥2 + af (x0 − δv), or a [f (x0 ) − f (x0 − δv)] ≤ −δ ∥v∥2 .

So if v = 0, then Theorem B.1.14 already guarantees a ̸= 0, while if v ̸= 0, then ∥v∥2 > 0 and we
must have a ̸= 0 and f (x0 ) ̸= f (x0 − δv). As we showed already that a ≥ 0, we must have a > 0.

615
Lexture Notes on Statistics and Information Theory John Duchi

Then by setting t = f (x0 ) and dividing both sides of inequality (B.3.7) by a, we obtain
1
⟨v, x0 − x⟩ + f (x0 ) ≤ f (x) for all x ∈ dom f.
a
Setting g = −v/a gives the result of the theorem, as we have f (x) = +∞ for x ̸∈ dom f .

Convex functions generally have quite nice behavior. Indeed, they enjoy some quite remarkable
continuity properties just by virtue of the defining convexity inequality (B.3.1). In particular, the
following theorem shows that convex functions are continuous on the relative interiors of their
domains. Even more, convex functions are Lipschitz continuous on any compact subsets contained
in the (relative) interior of their domains. (See Figure B.3 for an illustration of this fact.)
Theorem B.3.4. Let f : Rd → R be convex and C ⊂ relint dom f be compact. Then there exists
an L = L(C) ≥ 0 such that
|f (x) − f (x′ )| ≤ L x − x′ .
As an immediate consequence of Theorem B.3.4, we note that if f : Rd → R is convex and defined
everywhere on Rd , then it is continuous. Moreover, we also have that f : Rd → R is continuous
everywhere on the (relative) interior of its domain: let any x0 ∈ relint dom f . Then for small enough
ϵ > 0, the set cl({x0 + ϵB} ∩ dom f ), where B = {x : ∥x∥2 ≤ 1}, is a closed and bounded—and
hence compact—set contained in the (relative) interior of dom f . Thus f is Lipschitz on this set,
which is a neighborhood of x0 . In addition, if f : R → R, then f is continuous everywhere except
(possibly) at the endpoints of its domain.

Figure B.3. Left: discontinuities in int dom f are impossible while maintaining convexity (Theo-
rem B.3.4). Right: At the edge of dom f , there may be points of discontinuity.

Proof of Theorem B.3.4 To prove the theorem, we require a technical lemma.

Lemma B.3.5. Let f : Rd → R be convex and suppose that there are x0 , δ > 0, m, and M such
that
m ≤ f (x) ≤ M for x ∈ B(x0 , 2δ) := {x : ∥x − x0 ∥ < 2δ}.
Then f is Lipschitz on B(x0 , δ), and moreover,
M −m
|f (y) − f (y ′ )| ≤ y − y′ for y, y ′ ∈ B(x0 , δ).
δ

616
Lexture Notes on Statistics and Information Theory John Duchi

Proof Let y, y ′ ∈ B(x0 , δ), and define y ′′ = y ′ + δ(y ′ − y)/ ∥y ′ − y∥ ∈ B(x0 , 2δ). Then we can
write y ′ as a convex combination of y and y ′′ , specifically,
∥y ′ − y∥ ′′ δ
y′ = ′
y + y.
δ + ∥y − y∥ δ + ∥y ′ − y∥
Thus we obtain by convexity
∥y ′ − y∥ δ ∥y − y ′ ∥
f (y ′ ) − f (y) ≤ f (y ′′
) + f (y) − f (y) = [f (y ′′ ) − f (y)]
δ + ∥y ′ − y∥ δ + ∥y ′ − y∥ δ + ∥y − y ′ ∥
M −m
≤ y − y′ .
δ + ∥y − y ′ ∥
Here we have used the bounds on f assumed in the lemma. Swapping the assignments of y and y ′
gives the same lower bound, thus giving the desired Lipschitz continuity.

With Lemma B.3.5 in place, we proceed to the proof proper. We assume without loss of
generality that dom f has an interior; otherwise we prove the theorem restricting ourselves to the
affine hull of dom f . The proof follows a standard compactification argument. Suppose that for
each x ∈ C, we could construct an open ball Bx = B(x, δx ) with δx > 0 such that
|f (y) − f (y ′ )| ≤ Lx y − y ′ for y, y ′ ∈ Bx . (B.3.8)
As the Bx cover the compact set C, we can extract a finite number of them, call them Bx1 , . . . , Bxk ,
covering C, and then within each (overlapping) ball f is maxk Lxk Lipschitz. As a consequence, we
find that
|f (y) − f (y ′ )| ≤ max Lxk y − y ′
k
for any y, y ′∈ C.
We thus must derive inequality (B.3.8), for which we use the boundedness Lemma B.3.5. We
must demonstrate that f is bounded in a neighborhood of each x ∈ C. To that end, fix x ∈
int dom f , and let the points x0 , . . . , xd be affinely independent and such that
∆ := Conv{x0 , . . . , xd } ⊂ dom f
and x ∈ int ∆; let δ > 0 be such that B(x, 2δ) ⊂ ∆. Then
P by Carathéodory’s theorem (Theo-
rem B.1.3) we may write any point y ∈ B(x, 2δ) as y = di=0 λi xi for i λi = 1 and λi ≥ 0, and
P
thus
Xd
f (y) ≤ λi f (xi ) ≤ max f (xi ) =: M.
i∈{0,...,d}
i=0
Moreover, Theorem B.3.3 implies that there is some affine h function minorizing f ; let h(x) =
a + ⟨v, x⟩ denote this function. Then
m := inf f (x) ≥ inf h(x) = a + inf ⟨v, x⟩ > −∞
x∈C x∈C x∈C

exists and is finite, so that in the ball B(x, 2δ) constructed above, we have f (y) ∈ [m, M ] as required
by Lemma B.3.5. This guarantees the existence of a ball Bx required by inequality (B.3.8).

Our final discussion of continuity properties of convex functions revolves around the most com-
mon and analytically convenient type of convex function, the so-called closed-convex functions.

617
Lexture Notes on Statistics and Information Theory John Duchi

epi f

f (x) f (x)

Figure B.4. A closed—equivalently, lower semi-continuous—function. On the right is shown the

closed epigraph of the function.

Definition B.10. A function f is closed if its epigraph, epi f , is a closed set.

Equivalently, a function is closed if it is lower semi-continuous, meaning that

lim inf f (x) ≥ f (x0 ) (B.3.9)

x→x0

for all x0 and any sequence of points tending toward x0 . See Figure B.4 for an example such
function and its associated epigraph.
Interestingly, in the one-dimensional case, closed convexity implies continuity. Indeed, we have
the following observation (compare Figures B.4 and B.3 previously):
Observation B.3.6. Let f : R → R be a closed convex function. Then f is continuous on its
domain, and for any x0 ∈ bd dom f , limx→x0 f (x) = f (x0 ) whether or not x0 ∈ dom f .
Proof By Theorem B.3.4, we need only consider the endpoints of the domain of f (the result
is obvious by Theorem B.3.4 if dom f = R); let x0 ∈ bd dom f . Let y ∈ dom f be an otherwise
arbitrary point, and define x = λy + (1 − λ)x0 . Then taking λ → 0, we have

f (x) ≤ λf (y) + (1 − λ)f (x0 ) → f (x0 ),

so that lim supx→x0 f (x) ≤ f (x0 ). By the closedness assumption (B.3.9), we have lim inf x→x0 f (x) ≥
f (x0 ), and continuity follows. Note that in this argument, if x0 ̸∈ dom f , then f (x0 ) = +∞ by
convention; for epi f to be closed we require that for each t < f (x0 ) = ∞, we may take a small
enough open interval U = (y, x0 ) for which f (x) > t for all x ∈ U .

In the full-dimensional case, we do not have quite the same continuity, though Theorem B.3.4
guarantees continuity on the (relative) interior of dom f .
An important characterization of convex functions is as the supremum of all affine functionals
(linear plus an offset) below them, which is one of the keys to duality relationships about functions
to come.
Theorem B.3.7. Let f be closed convex and let A be the collection of affine functions h satisfying
f (x) ≥ h(x) for all x. Then f (x) = suph∈A h(x).

618
Lexture Notes on Statistics and Information Theory John Duchi

Proof By Theorem B.1.15 that any closed convex set is the intersection of all the halfspaces
containing (even supporting) it, we can write epi f = ∩H∈H H, where H is the collection of closed
halfspaces H ⊃ epi f . We may write any such halfspace as

H = {(x, r) ∈ Rd × R | ⟨a, x⟩ + br ≤ c}

where (a, b) ∈ Rd × R is non-zero. As H ⊃ epi f , the particular nature of epigraphs (that is, that
if (x, t) ∈ epi f then (x, t + ∆) ∈ epi f for all ∆ > 0) means that b ≤ 0, and so for any b < 0 we
may divide through by b to rewrite H as H = {(x, r) | ⟨a/b, x⟩ + r ≥ c/b}, while if b = 0 then
H = {(x, r) | ⟨a, x⟩ ≤ c}. That is, it is no loss of generality to set

H1 := {Halfspaces {(x, r) | ⟨a, x⟩ + r ≥ c} containing epi f }

H0 := {Halfspaces {(x, r) | ⟨a, x⟩ ≥ c} containing epi f } ,

which (respectively) correspond to the non-vertical halfspaces containing epi f and the halfspaces
d
T T
containing dom f ⊂ R . We have epi f = H∈H1 H ∩ H∈H0 H.
Identify the halfspaces H ∈ H0 or H1 with the associated triple (a, 0, c) or (a, 1, c) and abuse
notation to write (a, i, c) ∈ Hi for i ∈ {0, 1}. For any (a, 1, c) ∈ H1 , the linear function

l(x) = c − ⟨a, x⟩ = inf{r | ⟨a, x⟩ + r ≥ c} satisfies ⟨a, x⟩ + l(x) ≥ c for all x,

and so necessarily l(x) ≤ f (x) for all x, while for the function h(x) = sup(a,1,c)∈H1 {c − ⟨a, x⟩} we
have \
epi h = H.
H∈H1

Thus, if we can show that \ \ \

H∩ H= H (B.3.10)
H∈H1 H∈H0 H∈H1

the proof will be complete.

To show the equality (B.3.10), take arbitrary vectors v0 = (a0 , 0, c0 ) ∈ H0 and v1 = (a1 , 1, c1 ) ∈
H1 , and let H0 = {(x, r) | ⟨a0 , x⟩ ≥ c0 } and H1 = {(x, r) | ⟨a1 , x⟩ + r ≥ c1 } be the associated
halfspaces. Consider the conic-like vector

v(t) := (a1 + ta0 , 1, c1 + tc0 ) for t ≥ 0

and associated halfspace H(t) := {(x, r) | ⟨a1 + ta0 , x⟩ + r ≥ c1 + tc0 }. Then as ⟨a0 , x⟩ ≥ c0 if and
only if t⟨a0 , x⟩ ≥ tc0 for all t ≥ 0, any point (x, r) ∈ H0 ∩ H1 satisfies

⟨a1 + ta0 , x⟩ + r ≥ c1 + tc0 for t ≥ 0,

that is, H(t) ∈ H1 and (x, r) ∈ ∩t≥0 H(t). Additionally, taking t = 0 we see that H(0) = H1 and
so ∩t≥0 H(t) ⊂ H1 , while taking t ↑ ∞ we obtain that each (x, r) ∈ ∩t≥0 H(t) satisfies ⟨a0 , x⟩ ≥ c0 .
That is, we have \
H(t) = H0 ∩ H1 ,
t≥0

while H(t) ∈ H1 for all t ≥ 0. This shows the equality (B.3.10).

619
Lexture Notes on Statistics and Information Theory John Duchi

JCD Comment: Show a picture of the above argument

In spite of the continuity of closed convex functions on R, closed convex functions on higher
dimensional spaces need not be continuous. Indeed, it is immediate (see Proposition B.3.9 to follow)
that f (x) := supα∈A {fα (x)} is closed convex whenever fα are all closed convex for any index set
A. We have the following failure of continuity.
Example B.3.8 (A discontinuous closed convex function): Define the function f : R2 → R
by
1 2
f (x) := sup αx1 + βx2 | α ≤ β .
2
Then certainly f (0) = 0 and f is closed convex. If the supremum is attained then β = 12 α2
and so β ≥ 0 and


1 2
 0 2 if x = 0
x1
f (x) = sup αx1 + α x2 = − 2x if x2 < 0
α 2  2
+∞ otherwise.


But then along the path x2 = − 12 x21 , we always have f (x) = 1, while taking x1 → 0 gives
f (x) = 1 > 0 = f (0). 3

B.3.3 Operations preserving convexity

We now turn to a description of a few simple operations on functions that preserve convexity.
First, we extend the intersection properties of convex sets to operations on convex functions. (See
Figure B.5 for an illustration of the proposition.)
Proposition B.3.9. Let {fα }α∈A be an arbitrary collection of convex functions indexed by A.
Then
f (x) := sup fα (x)
α∈A

is convex. Moreover, if for each α ∈ A, the function fα is closed convex, f is closed convex.
Proof The proof is immediate once we consider the epigraph epi f . We have that
\
epi f = epi fα ,
α∈A

which is convex whenever epi fα is convex for all α and closed whenever epi fα is closed for all α
(recall Observation B.1.5).

Another immediate result is that composition of a convex function with an affine transformation
preserves convexity:
Proposition B.3.10. Let A ∈ Rd×n and b ∈ Rd , and let f : Rd → R be convex. Then the function
g(y) = f (Ay + b) is convex.
Partial minimization of convex functions and some related transformations preserve convexity
as well.

620
Lexture Notes on Statistics and Information Theory John Duchi

f (x) = max{f1 (x), f2 (x)}

f1 (x)
f2 (x)

Figure B.5. The maximum of two convex functions is convex, as its epigraph is the intersection of
the two epigraphs.

Proposition B.3.11. Let A ∈ Rd×n , f : Rn → R be convex, and Y ⊂ Rd be convex. Then

g(x) = inf{f (y) | Ay = x, y ∈ Y } is convex. If Y is compact and f is closed convex, then g is
closed convex.

Proof Let x0 , x1 ∈ Rn . If Ay = x0 has no solution in y ∈ Y , then g(x0 ) = +∞, and similarly

if Ay = x1 has no solutions then g(x1 ) = +∞, and we trivially have g(λx0 + (1 − λ)x1 ) ≤ +∞ in
either case for all λ ∈ (0, 1). Assuming that the sets {y ∈ Y | Ay = x0 } and {y ∈ Y | Ay = x1 }
are non-empty, let ϵ > 0 be arbitrary and y0 , y1 satisfy Ayi = xi and that f (yi ) ≤ g(xi ) + ϵ. Then
yλ = λy0 + (1 − λ)y1 satisfies Ayλ = λx0 + (1 − λ)x1 , and so

g(λx0 + (1 − λ)x1 ) ≤ f (λy0 + (1 − λ)y1 ) ≤ λf (y0 ) + (1 − λ)f (y1 ) ≤ λg(x0 ) + (1 − λ)g(x1 ) + ϵ

for all λ ∈ [0, 1]. Take ϵ → 0.

For the lower semicontinuity (closed convexity) statement, let xn → x; we wish to show that
lim inf n g(xn ) ≥ g(x). If g(xn ) = +∞ for all xn , then we trivially have the result. Otherwise,
assume g(xn ) < ∞ for all n, let ϵ > 0 be arbitrary, and let yn ∈ Y satisfy Ayn = xn and
f (yn ) ≤ g(xn ) + ϵ. Then as Y is compact, yn has convergent subsequences; let y be any such limit.
We have Ay = x, and g(x) ≤ f (y) ≤ lim inf n f (yn ) ≤ lim inf n g(xn ) + ϵ. As ϵ > 0 was arbitrary, we
have the result.

From the proposition we immediately see that if f (x, y) is jointly convex in x and y, then the
partially minimized function inf y∈Y f (x, y) is convex whenever Y is a convex set.
Lastly, we consider the functional analogue of the perspective transform. Given a function
f : Rd → R, the perspective transform of f is defined as
(
tf xt if t > 0 and xt ∈ dom f

pers(f )(x, t) := (B.3.11)
+∞ otherwise.

621
Lexture Notes on Statistics and Information Theory John Duchi

In analogue with the perspective transform of a convex set, the perspective transform of a function
is (jointly) convex.

Proposition B.3.12. Let f : Rd → R be convex. Then pers(f ) : Rd+1 → R is convex.

Proof The result follows if we can show that epi pers(f ) is a convex set. With that in mind,
note that x r
Rd × R++ × R ∋ (x, t, r) ∈ epi pers(f ) if and only if f ≤ .
t t
Rewriting this, we have
n x ro
epi pers(f ) = (x, t, r) ∈ Rd × R++ × R : f ≤
n t t o
= t(x , 1, r ) : x ∈ R , t ∈ R++ , r ∈ R, f (x′ ) ≤ r′
′ ′ ′ d ′

= {t(x, 1, r) : t > 0, (x, r) ∈ epi f } = R++ × {(x, 1, r) : (x, r) ∈ epi f }.

This is a convex cone.

Finally, we discuss closing a convex function, that is, replacing f with cl f , the lower semicon-
tinuous closure of f . For f : Rd → R, we define cl f pointwise by

cl f (x) := lim inf f (x) (B.3.12)

y→x

to be the lower semicontinuous closure of f . (Recall Definition B.10.) We demonstrate that this
function is indeed the largest closed function below f .

Lemma B.3.13. The function cl f is lower-semicontinuous, epi cl f = cl epi f , and if g is any

closed function with g ≤ f , then g ≤ cl f .

Proof Assuming the second statement, the first follows trivially, so let us prove the second.
Let (x, r) ∈ cl epi f . Then there exists a sequence (xn , rn ) ∈ epi f such that (xn , rn ) → (x, r),
and f (xn ) ≤ rn . Then evidently cl f (x) ≤ limn rn , so that (x, r) ∈ epi cl f . Conversely, let
(x, r) ∈ epi cl f . Let xn → x satisfy f (xn ) → lim inf y→x f (x) = cl f (x). Then because r ≥ cl f (x),
there exist rn ≥ f (xn ) for which rn → r, while (xn , rn ) ∈ epi f . In particular, (xn , rn ) → (x, r),
which is thus in cl epi f .
Finally, we prove the third claim of the lemma. If g is closed, then epi g is closed as well, and
epi g ⊃ epi f . Then epi g ⊃ cl epi f = epi cl f , that is, cl f (x) ≤ r implies g(x) ≤ r, i.e., cl f ≥ g.

Using this lemma, we can provide some additional color to Theorem B.3.7 that a closed convex
function is equal to the supremum of the affine functions minorizing it. In particular, Observa-
tion B.1.6 that closures of convex sets remain convex implies that if f is convex, then cl f is convex
as well. We collect this and a bit more with the following proposition, which now immediately
follows.

Proposition B.3.14. Let f : Rd → R be a convex function. Let A be the collection of all affine
functions h ≤ f . Then cl f (x) = suph∈A h(x) for all x. In particular, if f is lower semicontinuous
at x, then f (x) = suph∈A h(x).

622
Lexture Notes on Statistics and Information Theory John Duchi

B.3.4 Smoothness properties, first-order developments for convex functions,

and subdifferentiability
In addition to their continuity properties, convex functions typically enjoy strong differentiability
properties. Some of these interact with the duality properties we present in the section C.2 to
follow. Our main goal will be to show how there exist (roughly) derivative-like objects for convex
functions, so that for some suitably nice object Df (x, v) we have
f (x + tv) = f (x) + Df (x, v)t + o(t) (B.3.13)
for t small and any v. In the case that f is differentiable, of course, this must coincide with the
usual derivative, so that Df (x, v) = ⟨∇f (x), v⟩. For convex functions, a directional derivative
always exists (even if f is non-differentiable), meaning that we can make sense of the first-order
development (B.3.13) in some generality.
As one prototypical result, we leverage Rademacher’s theorem on almost everywhere differen-
tiability of Lipschitz functions to show that convex functions are almost everywhere differentiable:
Theorem B.3.15 (Rademacher). Let U ⊂ Rd be open and f : U → Rk be Lipschitz continuous.
Then f is differentiable almost everywhere on U .
Proofs of this result are standard in measure-theoretic analysis texts; see, e.g., [89, Section 3.5]
or [188, Theorem 10.8(ii)]. As any convex function is locally Lipschitz on its domain (recall Theo-
rem B.3.4), we thus have the following result (where we assume that dom f has an interior).
Corollary B.3.16. Let f : Rd → R be convex. Then it is differentiable except on a set of Lebesgue
measure zero on its domain.
Other differentiability properties of convex functions are also of interest. We begin by consider-
ing directional differentiability properties, after which we expand to consider differentiability and
continuous differentiability of (convex) functions. To begin, recall that the directional derivative of
a function f in direction v at x is
f (x + tv) − f (x)
f ′ (x; v) := lim (B.3.14)
t↓0 t
when this quantity exists. When f ′ (x; v) exists for all directions v and is linear in v, we call the
function Gateaux differentiable. A (stronger in infinite dimensions) notion of differentiability is
Fréchet differentiability: f has Fréchet differential g at x if
f (y) = f (x) + ⟨g, y − x⟩ + o(∥y − x∥) (B.3.15)
as y → x, which is then uniform in the distance ∥y − x∥. It is immediate that if f is Fréchet
differentiable with derivative g then it is Gateaux differentiable with f ′ (x; v) = ⟨g, v⟩. Conveniently,
in finite dimensions, these notions coincide with the standard gradient, and f ′ (x; v) = ⟨∇f (x), v⟩,
whenever f is locally Lipschitzian.
Proposition B.3.17. Let f : Rd → R be Gateaux differentiable at x, that is, its directional
derivative f ′ (x; v) is linear in v, and locally Lipschitz, so that there exists L < ∞ such that |f (x) −
f (y)| ≤ L ∥x − y∥ for y near x. Then f is Fréchet differentiable with Fréchet differential
∂f (x) d

∇f (x) = ,
∂xj j=1

and f ′ (x; v) = ⟨∇f (x), v⟩ and ∥∇f (x)∥ ≤ L.

623
Lexture Notes on Statistics and Information Theory John Duchi

Proof If f is Fréchet differentiable at x with differential g, then we immediately have

f (x + tv) − f (x) t⟨g, v⟩ + o(t)
= → ⟨g, v⟩
t t
as t → 0, so that it is Gateaux differentiable.
Conversely, suppose that f ′ (x; v) = ⟨g, v⟩ for all v ∈ Rd for some g ∈ Rd . Assume for the sake
of contradiction that f is not Fréchet differentiable at x, so that
f (x + ∆) − f (x) − ⟨g, ∆⟩
lim sup = c > 0.
∥∆∥↓0 ∥∆∥

Take any sequence ∆n → 0 achieving this limit supremum, and let ∆n = ϵn vn for a sequence vn on
the sphere, that is, ∥vn ∥ = 1, so ϵn = ∥∆n ∥. Then by passing to a subsequence if necessary, we can
assume w.l.o.g. that vn → v with ∥v∥ = 1. Then

|f (x + ∆n ) − f (x) − ⟨g, ∆n ⟩| |f (x + ϵn v + ϵn (vn − v)) − f (x) − ϵn ⟨g, v⟩ − ϵn ⟨g, vn − v⟩|

=
ϵn ϵn
|f (x + ϵn v) − f (x) − ϵn ⟨g, v⟩| Lϵn ∥vn − v∥ + ϵn ∥g∥ ∥vn − v∥
≤ + .
ϵn ϵn
Both of these terms tend to zero, a contradiction, and so f is Fréchet differentiable at x, and its
Fréchet derivative is g. That Fréchet differentiability implies differentiability follows by noting that
the partial derivatives f ′ (x; ej ) = ∂f∂x(x)
j
for each coordinate j.
Finally, the Lipschitzian bound on ∥∇f (x)∥ follows by noting that

L ∥∆∥ ≥ |f (x + ∆) − f (x)| = |⟨∇f (x), ∆⟩| + o(∥∆∥).

Taking ∆ = tv and t ↓ 0, this implies that L ∥v∥ ≥ ⟨∇f (x), v⟩ for all v, which is equivalent to
∥∇f (x)∥ ≤ L.

The main consequence of convexity that is important for us is that a convex function is direc-
tionally differentiable at every point in the interior of its domain, though the directional derivative
need not be linear:

Proposition B.3.18. Let f be convex and x ∈ int dom f . Then f ′ (x; v) exists and the mapping
v 7→ f ′ (x; v) is sublinear, convex, and globally Lipschitz.

Proof If x ∈ int dom f , then the criterion (B.3.4) of increasing slopes guarantees that f ′ (x; v) =
limt↓0 f (x+tv)−f
t
(x)
exists for all x ∈ int dom f , as the quantity is monotone. To see that f ′ (x; v) is
convex and sublinear in v, note that positive homogeneity is immediate, as we have 1t (f (x + αtv) −
f (x)) = αtα
(f (x + αtv) − f (x)) for all α > 0, and f ′ (x; 0) = 0. That it is convex is straightforward
as well: for any u, v we have
f (x + t(λu + (1 − λ)v)) − f (x) f (x + tu) − f (x) f (x + tv) − f (x)
≤λ + (1 − λ)
t t t
and take t ↓ 0. For the global Lipschitz claim, note that f is already locally Lipschitz near
x ∈ int dom f (recall Theorem B.3.4), so that there exists some L < ∞ and ϵ > 0 such that for
all ∥v∥ = 1 and 0 ≤ t ≤ ϵ |f (x + tv) − f (x)| ≤ Lt, whence |f ′ (x; v)| ≤ L and by homogeneity

624
Lexture Notes on Statistics and Information Theory John Duchi

|f ′ (x; v)| ≤ L ∥v∥ for all v.

An inspection of the proof shows that the result extends even to all of dom f if we allow f ′ (x; v) =
+∞ whenever x + tv ̸∈ dom f for all t > 0, though of course we lose that f ′ (x; v) is finite-valued.
Then we have the following corollary, showing that f ′ (x; v) provides a valid first-order development
of f in all directions from x (where we take ∞ · t = ∞ whenever t > 0).

Corollary B.3.19. Let x ∈ dom f . Then

f (x + tv) = f (x) + f ′ (x; v)t + o(t)

as t ↓ 0 and
f (x + tv) ≥ f (x) + f ′ (x; v)t for all t ≥ 0.

Proof The first part is immediate by definition of f ′ (x; v) = limt↓0 f (x+tv)−f

t
(x)
. The second is
immediate from the criterion (B.3.4) of increasing slopes, as the limit in the directional deriva-
tive (B.3.14) becomes an infimum for convex functions: f ′ (x; v) = inf t>0 f (x+tv)−f
t
(x)
.

There are strong connections between subdifferentials and directional derivatives, and hence of
the local developments (B.3.13). The following result makes this clear.

Proposition B.3.20. Let f be convex and x ∈ relint dom f . Then

∂f (x) = s | ⟨s, v⟩ ≤ f ′ (x; v) for all v ̸= ∅.

Proof For shorthand let S = {s | ⟨s, v⟩ ≤ f ′ (x; v) all v} be the set on the right. If s ∈ S, then
the criterion (B.3.4) of increasing slopes guarantees that

f (x + tv) − f (x)
⟨s, v⟩ ≤ for all v ∈ Rd , t > 0.
t
Recognizing that as v is allowed to vary over all of Rd and t > 0, then x + tv similarly describes
Rd , we see that this condition is completely equivalent to the definition (B.3.6) of the subgradient.
That ∂f (x) ̸= ∅ is Theorem B.3.3.

We can also extend this to x ∈ dom f —not necessarily the interior—where we see that there is
no loss (even when f may be +∞ valued) to defining

∂f (x) := s | ⟨s, v⟩ ≤ f ′ (x; v) for all v .

(B.3.16)

Notably, the directional derivative function v 7→ f ′ (x; v) always exists for x ∈ dom f and is a
sublinear convex function, and thus ∂f (x) above is always a closed convex set whose support
function (recall (B.2.1)) is the closure of v 7→ f ′ (x; v). While the subdifferential ∂f (x) is always a
compact convex set when x ∈ int dom f , even when it exists it may not be compact if x is on the
boundary of dom f . To see one important example of this, consider the indicator function
(
+∞ if x ̸∈ C
IC (x) :=
0 if x ∈ C

625
Lexture Notes on Statistics and Information Theory John Duchi

of a closed convex set C. For simplicity, let C = [a, b] be an interval. Then we have

[0, ∞]
 if x = b
∂IC (x) = {0} if a < x < b

[−∞, 0] if x = a.


Whether points ±∞ are included is a matter of convenience and whether we work with the extended
real line.
JCD Comment: Draw a picture of this

These representations points to a certain closure property of subgradients, namely, that the
subdifferential is closed under additions of the normal cone to the domain of f :

Lemma B.3.21. Let Ndom f (x) be the normal cone (Definition C.1) to dom f at the point x (where
Ndom f (x) = {0} for x ∈ int dom f and Ndom f (x) = ∅ for x ̸∈ dom f ). Then

∂f (x) = ∂f (x) + Ndom f (x).

In particular, if x is a boundary point x ∈ bd dom f of the domain of f , then either ∂f (x) = ∅ or

∂f (x) is unbounded.

Proof We only need concern ourselves with points x ∈ bd dom f , where the normal cone N =
Ndom f (x) is non-trivial. If ∂f (x) is empty, there is nothing to prove, so assume that ∂f (x) is
non-empty. Then the definition (B.3.16) of the subdifferential as ∂f (x) = {s | ⟨s, u⟩ ≤ f ′ (x; u)}
allows us to prove the result. First, consider vectors u for which f ′ (x; u) = +∞. Then certainly, for
any s ∈ ∂f (x), we have ⟨s + v, u⟩ ≤ f ′ (x; u) for all v ∈ N . If f ′ (x; u) < ∞, then for small enough
t > 0 we necessarily have x + tu ∈ dom f . In particular, the definition of the normal cone gives that
v ∈ N satisfies 0 ≥ ⟨v, x + tu − x⟩ = t⟨v, u⟩, or that ⟨v, u⟩ ≤ 0. Thus ⟨s + v, u⟩ ≤ ⟨s, u⟩ ≤ f ′ (x; u),
and so s + v ∈ ∂f (x) once again.
The claim about boundedness is immediate, because Ndom f is a cone.

A more compelling case for the importance of the subgradient set with respect to first-order
developments and differentiability properties of convex functions is the following:
JCD Comment: Add a picture of this as well.

Proposition B.3.22. Let f be convex and x ∈ int dom f . Then

f (y) = f (x) + sup ⟨s, y − x⟩ + o(∥y − x∥)

s∈∂f (x)

= f (x) + f ′ (x; y − x) + o(∥y − x∥).

Proof That sups∈∂f (x) ⟨s, v⟩ = f ′ (x; v) is immediate by Theorem B.3.7 and Proposition B.2.1,
because f ′ (x; v) is sublinear and closed convex in v when x ∈ int dom f . Certainly the right hand
sides are then equal.
We thus prove the equality f (y) = f (x)+f ′ (x; y −x)+o(∥y − x∥), where the argument is similar
to that for Proposition B.3.17. Let yn → x be any sequence and let ∆n = yn − x, so that ∥∆∥ → 0;
as x ∈ int dom f , there exists a (local) Lipschitz constant L such that |f (x + ∆) − f (x)| ≤ L ∥∆∥

626
Lexture Notes on Statistics and Information Theory John Duchi

for all small ∆. Similarly, because v 7→ f ′ (x; v) is convex (even positively homogeneous and thus
sublinear), it has a Lipschitz constant, and we take this to be L as well. Now, write ∆n = ϵn vn
where ∥vn ∥ = 1 and ϵn → 0, and moving to a subsequence if necessary let vn → v. Then we have

f (x + ∆n ) − f (x) − f ′ (x; ∆n ) = f (x + ϵn v + ϵn (vn − v)) − f (x) − f ′ (x; ϵn v + ϵn (vn − v))

= f (x + ϵn v) − f (x) − f ′ (x; ϵn v) ± 2Lϵn ∥vn − v∥
= o(ϵn )

because ∥vn − v∥ → 0 and f (x + ϵn v) − f (x) = f ′ (x; ϵn ) + o(ϵn ) by definition of the directional

derivative.

Note that convexity only played the role of establishing the local Lipschitz property of f in the
proof of Proposition B.3.22; any locally Lipschitz function with directional derivatives will enjoy a
similar first-order expansion.
As our final result on smoothness properties of convex functions, we connect subdifferentials to
differentiability properties of convex f . First, we give a lemma showing that the subdifferential set
∂f is outer semicontinuous.
Lemma B.3.23 (Closure of the graph of the subdifferential). Let f : Rd → R be closed convex.
Then the graph {(x, s) | x ∈ Rd , s ∈ ∂f (x)} of its subdifferential is closed. Equivalently, whenever
xn → x with sn ∈ ∂f (xn ) and sn → s, f has non-empty subdifferential at x with s ∈ ∂f (x).
Proof We prove the second statement, whose equivalence to the first is definitional. Fix any
y ∈ Rd . Then f (y) ≥ f (xn ) + ⟨sn , y − xn ⟩, and because f is closed (i.e., lower semicontinuous),
we have lim inf f (xn ) ≥ f (x). Let ϵ > 0 be arbitrary. Then for all large enough n, we have
f (xn ) ≥ f (x) − ϵ, and similarly, ∥sn − s∥ ≤ ϵ, ∥xn − x∥ ≤ ϵ, and ∥y − xn ∥ ≤ ∥y − x∥ + ϵ. Then

f (y) ≥ f (xn ) + ⟨sn , y − xn ⟩ ≥ f (x) + ⟨s, y − xn ⟩ − ϵ − ∥s − sn ∥ ∥y − xn ∥

≥ f (x) + ⟨s, y − x⟩ − ϵ − ϵ ∥y − xn ∥ − ∥s∥ ∥x − xn ∥
≥ f (x) + ⟨s, y − x⟩ − ϵ − ϵ(1 + ϵ) ∥y − x∥ − ∥s∥ ϵ.

As ϵ was arbitrary we have f (y) ≥ f (x) + ⟨s, y − x⟩ as desired.

Given the somewhat technical Lemma B.3.23, we can show that if f is convex and differentiable
at a point, it is in fact continuously differentiable at the point.
Proposition B.3.24. Let f be convex and x ∈ int dom f . Then ∂f (x) is a singleton if and only if
f is differentiable at x. If additionally f is differentiable on an open set U , then f is continuously
differentiable on U .
Proof Because x ∈ int dom f , there exists L < ∞ such that f is L-Lipschitz near x by Theo-
rem B.3.4. Suppose that ∂f (x) = {s}. Then the directional derivative f ′ (x; v) = ⟨s, v⟩ for all v,
and Proposition B.3.22 gives

f (y) = f (x) + ⟨s, y − x⟩ + o(∥y − x∥)

as y → x, that is, f is differentiable. Conversely, assume that f is differentiable at x. Then

taking any vector v, we immediately have f ′ (x; v) = ⟨∇f (x), v⟩ and Proposition B.3.20 gives that
∂f (x) = {∇f (x)}.

627
Lexture Notes on Statistics and Information Theory John Duchi

To see that f is in fact continuously differentiable on U , let x ∈ U and f be L-Lipschitz

on a compact set C ⊂ U containing x in its interior. Let xk ∈ C satisfy xk → x and let
sk = ∇f (xk ) ∈ ∂f (xk ). Then ∥sk ∥ ≤ L, and each subsequence has a further convergent sub-
sequence. Lemma B.3.23 implies that any convergent subsequence sk(m) → s ∈ ∂f (x). But as
∂f (x) = {∇f (x)}, we have ∇f (xk(m) ) → ∇f (x) and so ∇f (x) is continuous in x.

B.3.5 Calculus rules of subgradients

We close this section with a few calculus results on subdifferentials of of convex functions. These
calculus rules show that we may take the subdifferential plays a similar role to the gradient for
differentiable functions. Additionally, they allow us to take derivatives of various extremal functions.
Our first result shows that subdifferentials of sums are sums of subdifferentials, which relies
on both the representation of sublinear functions as support functions for convex sets and the
characterization of the subdifferential in terms of directional derivatives:
Proposition B.3.25. Let f and g be closed convex functions, and let x ∈ int dom f and g be
subdifferentiable at x, meaning that ∂g(x) ̸= ∅. Then

∂(f + g)(x) = ∂f (x) + ∂g(x).

Proof By Proposition B.3.20, the set ∂f (x) is a compact convex set, and the general defi-
nition (B.3.16) of the subdifferential gives that ∂g(x) is closed convex. Let S1 = ∂f (x) and
S2 = ∂g(x). Then immediately S1 + S2 ⊂ ∂(f + g)(x), so that
n o
S := ∂(f + g)(x) = s | ⟨s, v⟩ ≤ f ′ (x; v) + g ′ (x; v) for all v ∈ Rd

is non-empty. Because of the support function equality f ′ (x; v) = σS1 (v) and g ′ (x; v) = σS2 (v),
Corollary B.2.5 gives
σS (v) = σS1 (v) + σS2 (v) = σS1 +S2 (v).
Thus (Corollary B.2.4) S1 + S2 = S.

Other situations that arise frequently are composition with affine mappings and taking maxima
or suprema of convex functions, so that finding a calculus for these is also important.
Corollary B.3.26. Let f : Rm → R be convex and for A ∈ Rm×d and b ∈ Rm , let g(x) = f (Ax+b).
Then
∂g(x) = AT ∂f (Ax + b).
Proof Using the directional derivative, we have g ′ (x; v) = f ′ (Ax + b; Av) for all v ∈ Rd , and
applying Proposition B.2.6 gives that the latter is the support function of the convex compact set
AT ∂f (Ax + b).

It is also useful to be able to compute subdifferentials of maxima and suprema (recall Proposi-
tion B.3.9). Consider a collection {fα }α∈A of convex functions, and define

f (x) := sup fα (x). (B.3.17)

α∈A

628
Lexture Notes on Statistics and Information Theory John Duchi

The function f is certainly convex. For a given x let

A(x) := {α ∈ A | fα (x) = f (x)}

be the indices attaining the suprema, that is, the active index set (though this may be empty).
Then there is an “easy” direction:
Lemma B.3.27. With the notation above,
n[ o
∂f (x) ⊃ cl Conv ∂fα (x) | α ∈ A(x) = cl Conv {g | g ∈ ∂fα (x) for some α ∈ A(x)} .

Proof Let α ∈ A(x) and g ∈ ∂fα (x). Then

f (y) ≥ fα (y) ≥ fα (x) + ⟨g, y − x⟩ = f (x) + ⟨g, y − x⟩.

Thus g ∈ ∂f (x), which as a closed convex set must thus include its closed convex hull.

A much more challenging argument is to show that the active index set A(x) exactly charac-
terizes the subdifferential of f at x; we simply state a typical result as a proposition.
Proposition B.3.28. Let A be a compact set (for some metric) and assume that for each x, the
mapping α 7→ fα (x) is upper semi-continuous. Then
n[ o
∂f (x) = Conv ∂fα (x) | α ∈ A(x) = Conv {g | g ∈ ∂fα (x) for some α ∈ A(x)} .

For a proof, see [111, Theorem 4.4.2].

JCD Comment: Draw a picture of this

Finally, we revisit the partial minimization operation in Proposition B.3.11. In this case, we
require a bit more care when defining subdifferentials and subdifferentiability. For A ∈ Rn×m with
m ≥ n, where A has rank n (so that x 7→ Ax is surjective) and f : Rm → R, define the function

fA (x) = inf {f (y) | Ay = x} ,

which is convex. Define the set Y ⋆ (x) := {y | Ay = x and fA (x) = f (y)} to be the set of y
attaining the infimum in the definition of fA , which may be empty. When it is not, however, we
can characterize the subdifferential of fA (x):
Proposition B.3.29. Let x ∈ Rn be a point for which Y ⋆ (x) is non-empty for the function fA .
Then
∂fA (x) = s | AT s ∈ ∂f (y)

for any y ∈ Y ⋆ (x), and the set on the right is independent of the choice of y.
Proof A vector s is a subgradient of f at x if and only if

fA (x′ ) ≥ fA (x) + ⟨s, x′ − x⟩ for all x′ ∈ Rn ,

which (as Ay = x for y ∈ Y ⋆ (x)) is equivalent to

fA (x′ ) ≥ f (y) + ⟨s, x′ − Ay⟩ for all x′ ∈ Rn .

629
Lexture Notes on Statistics and Information Theory John Duchi

Because A has full row rank, for any x′ ∈ Rn there exists y ′ with Ay ′ = x′ ; by definition of fA as
the infimum, the preceding display is thus equivalent to

f (y ′ ) ≥ fA (Ay ′ ) ≥ f (y) + ⟨s, Ay ′ − Ay⟩ for all y ′ ∈ Rm .

This holds if and only if AT s is a subgradient of f at y.

630
Appendix C

Optimality, stability, and duality

The existence and continuity properties of minimizers of (convex) optimization problems play a
central role in much of statistical theory. They are especially essential in our understanding of
loss functions and the associated optimality properties. In our context, this is especially central
for problems of classification calibration or surrogate risk consistency, as in Chapters 16. This
appendix records several representative results along these lines, and also builds up the duality
theory associated with convex conjugates, frequently identified as Fenchel-Young duality.
Broadly, throughout this appendix, we shall consider the generic optimization problem

minimize f (x)
(C.0.1)
subject to x ∈ C

where C is a closed convex set (we have not yet assumed convexity of f ), Throughout (as in the
previous appendix) we assume that f is proper, so that f (x) > −∞ for each x, and that f (x) = +∞
if x ̸∈ dom f .
The most basic question we might ask is when minimizers even exist in the problem (C.0.1).
The standard result in this vein is that if minimizers exist whenever C is compact and f is lower
semicontinuous (B.3.9), that is, its epigraph is closed, i.e., lim inf x→x0 f (x) ≥ f (x0 ).

Proposition C.0.1. Let C be compact and f : Rd → R be lower semi-continuous (B.3.9) over C.

Then inf x∈C f (x) > −∞ and the infimum is attained.

Proof Let f ⋆ = inf x∈C f (x), where for now we allow the possibility that f ⋆ = −∞. Let xn ∈ C
be a sequence of points satisfying f (xn ) → f ⋆ . Proceeding to a subseqeunce if necessary, we can
assume that xn → x⋆ ∈ C by the compactness of C. Then lower semi-continuity guarantees that
f ⋆ = limn f (xn ) ≥ f (x⋆ ) ≥ f ⋆ , and so f (x⋆ ) = f ⋆ and so necessarily f ⋆ > −∞.

When the domain C is not compact but only closed, alternative conditions are necessary to
guarantee the existence of minimizers. Perhaps the most frequent, and one especially useful with
convexity (as we shall see), is that f is coercive, meaning that

f (x) → ∞ whenever ∥x∥ → ∞.

Proposition C.0.2. Let C be closed and f : Rd → R be lower semi-continuous over C and coercive.
Then inf x∈C f (x) > −∞ and the infimum is attained.

631
Lexture Notes on Statistics and Information Theory John Duchi

Proof Once again, let f ⋆ = inf x∈C f (x) and let xn ∈ C satisfy f (xn ) → f ⋆ . Certainly
xn must be a bounded sequence because f is coercive. Thus, it has a subsequent limit, and
w.l.o.g. we assume that xn → x⋆ ∈ C by closedness. Lower semi-continuity guarantees that
f ⋆ ≥ lim inf n f (xn ) = f (x⋆ ) ≥ f ⋆ , giving the result.

Finally, we make a small remark norms and dual norms, as these will be important for the more
quantitative smoothness guarantees we provide. For a norm ∥·∥, the dual norm ∥·∥∗ has definition

∥y∥∗ := sup{⟨x, y⟩ | ∥x∥ ≤ 1}.

This is a norm as it is positively homogeneous, ∥y∥∗ = 0 if and only if y = 0, and satisfies the
triangle inequality. A few brief examples follow, which we leave as exercises to the reader.
p
(i) The ℓ2 -norm ∥x∥2 = ⟨x, x⟩ is self-dual, so that its dual is ∥·∥2 .

(ii) The ℓ1 and ℓ∞ norms are dual, that is, ∥x∥∞ = sup∥y∥1 ≤1 ⟨x, y⟩ and ∥y∥1 = sup∥x∥∞ ≤1 ⟨x, y⟩.

(iii) For all p ∈ [1, ∞], the dual to the ℓp norm ∥x∥p = ( dj=1 |xj |p )1/p is the ℓq norm with q = p
P
p−1 ,
that is, for the q ≥ 1 satisfying p1 + 1q = 1.

C.1 Optimality conditions and stability properties

With the basic results on existence of minimizers in place, we turn to convex optimization problems,
where f is closed convex and C is a closed convex set, and we assume essentially without loss of
generality that dom f ⊃ int C (as otherwise, we may replace C with C ∩ cl dom f ). The benefits
of convexity appear immediately: f has no local but non-global minimizers, and moreover, if f is
strictly convex, then any minimizers (if they exist) are unique.
Proposition C.1.1. Let f be convex. Then if x is a local minimizer of f over C, it is a global
minimizer of f over C. If f is strictly convex, then x is unique.
Proof To say that x is a local minimizer of f over C is to say that f (x) ≤ f (x′ ) for all x′ ∈ C
with ∥x′ − x∥ ≤ ϵ for some ϵ > 0. Now, consider y ∈ C. By taking λ > 0 small enough, we have
both (1 − λ)x + λy ∈ C and ∥(1 − λ)x + λy − x∥ ≤ ϵ, and so

f (x) ≤ f ((1 − λ)x + λy) ≤ (1 − λ)f (x) + λf (y),

which rearranged yields f (y) ≥ f (x). If f is additionally strictly convex (recall Corollary B.3.2),
then the preceding inequality is strict whenver y ̸= x.

C.1.1 Subgradient characterizations for optimality

First-order stationary conditions are sufficient for global optimality in convex problems. We can
say more once we consider subgradients:
Observation C.1.2. Let f be convex and subdifferentiable at x. Then x minimizes f if and only
if 0 ∈ ∂f (x).

632
Lexture Notes on Statistics and Information Theory John Duchi

Proof If 0 ∈ ∂f (x), then f (y) ≥ f (x) + ⟨0, y − x⟩ = f (x) for all y. Conversely, if x minimizes f ,
then we have f (y) ≥ f (x) for all y, and in particular, 0 ∈ ∂f (x).

Things become a bit more complicated when we consider the constraints in the problem (C.0.1),
so that the point x may be restricted. In this case, it is important and useful to consider the normal
cone to the set C, which is (essentially) the collection of vectors pointing out of C.
Definition C.1. Let C be a closed convex set. The normal cone to C at the point x ∈ C is the
collection of vectors
NC (x) := {v | ⟨v, y − x⟩ ≤ 0 for all y ∈ C} .
So NC (x) is the collection of vectors making an obtuse angle with any direction into the set C from
JCD Comment: Draw a picture, and also, put this earlier in the discussion of convex
x.
sets.
It is clear that NC (x) is indeed a cone: if v ∈ NC (x), then certainly tv ∈ NC (x) for all t ≥ 0.
It is closed convex, being the intersection of halfspaces. Moreover, if x ∈ int C, then we have
NC (x) = {0}, and additionally, we can connect the supporting hyperplanes of C to its normal
cones: Theorem B.1.14 gives the following corollary.
Corollary C.1.3. Let C be closed convex. Then for any x ∈ bd(C), the normal cone NC (x) is
non-trivial and consists of the collection of supporting hyperplanes to C at x.
By a bit of subgradient calculus, we can then write optimality conditions involving the normal
cones to C. If C is a closed convex set, the convex indicator function IC (x) has subdifferentials

{0}
 if x ∈ int C
∂IC (x) = NC (x) if x ∈ bd(C)

otherwise.

∅

The only case requiring justification is the boundary case; for this, we note that w ∈ NC (x) if and
only if ⟨w, y − x⟩ ≤ 0 for all y ∈ C, which in turn occurs if and only if IC (y) ≥ IC (x) + ⟨w, y − x⟩
for all y.
The subdifferential calculation for IC (x) yields the following general optimality characterization
for problem (C.0.1).
Proposition C.1.4. In the problem (C.0.1), let x ∈ int dom f . Then x minimizes f over C if and
only if
0 ∈ ∂f (x) + NC (x). (C.1.1)
Proof The minimimization problem (C.0.1) is equivalent to the problem

minimize f (x) + IC (x).

As x ∈ int dom f , f has nonempty compact convex subdifferential ∂f (x), and so ∂(f + IC )(x) =
∂f (x) + ∂IC (x) = ∂f (x) + NC (x) by Proposition B.3.25. Apply Observation C.1.2.

Several equivalent versions of Proposition C.1.4 are possible. The first is that

−∂f (x) ∩ NC (x) ̸= ∅,

633
Lexture Notes on Statistics and Information Theory John Duchi

that is, there is a subgradient vector g ∈ ∂f (x) such that −g ∈ NC (x), so that −g points outside the
set C. JCD Comment: Draw a picture

Another variant, frequently used, is to write Proposition C.1.4 as that x solves problem (C.0.1)
if and only if there exists g ∈ ∂f (x) such that
⟨g, y − x⟩ ≥ 0 for all y ∈ C. (C.1.2)
Indeed, taking g ∈ ∂f (x) to be the element satisfying −g ∈ NC (x), we immediately see that ⟨−g, y−
x⟩ ≤ 0 for all y ∈ C by definition of the normal cone, which is (C.1.2). JCD Comment: Use picture above

C.1.2 Stability properties of minimizers

The characterizations (C.1.1) and (C.1.2) of optimality for convex optimization problems allow us
to develop some stability properties of the minimizers of convex problems. These, in turn, will
relate to smoothness properties of various dual functions (as we explore in the sequel), which again
become important in the study of consistent losses. Here, we collect a few of the typical results.
Typical results in this vein exhibit a few properties: that solutions are “stable,” meaning that small
tilts of the function f do not change solutions significantly, or that the function f exhibits various
strong growth properties.
JCD Comment: Add a bit of commentary about when there is growth then we expect
minimizers to be smoothish.
For our starting point, we begin by consider strongly convex functions, where a function is
λ-strongly convex with respect to the norm ∥·∥ if for all t ∈ [0, 1] and x, y ∈ dom f ,
λ
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y) −t(1 − t) ∥x − y∥2 . (C.1.3)
2
The definition (C.1.3) makes strict convexity quantitative in a fairly precise way, and has several
equivalent characterizations.
Proposition C.1.5 (Equivalent characterizations of strong convexity). Let f be a convex function,
subdifferentiable on its domain. Then the following are equivalent.
(i) f is λ-strongly convex (Eq. (C.1.3)).
(ii) For all y ∈ Rd and g ∈ ∂f (x),
λ
f (y) ≥ f (x) + ⟨g, y − x⟩ + ∥x − y∥2 .
2
(iii) For all x, y ∈ dom f and gx ∈ ∂f (x) and gy ∈ ∂f (y),
⟨gx − gy , x − y⟩ ≥ λ ∥x − y∥2 .

Proof Let us prove that inequality (ii) holds if and only if (iii) does. Let gx ∈ ∂f (x) and
gy ∈ ∂f (y) and assume (ii) holds. Then
λ
f (y) ≥ f (x) + ⟨gx , y − x⟩ + ∥y − x∥2
2
λ
f (x) ≥ f (y) + ⟨gy , x − y⟩ + ∥x − y∥2
2

634
Lexture Notes on Statistics and Information Theory John Duchi

and adding the equations we obtain

0 ≥ ⟨gx − gy , y − x⟩ + λ ∥x − y∥2 .

Rearranging gives part (iii). Conversely, assume (iii), and for t ∈ [0, 1] let xt = (1 − t)x + ty and
define h(t) = f (xt ). Then h is convex
R 1 ′ and hence almost everywhere differentiable (and locally
Lipschitz), so that h(1) = h(0) + 0 h (t)dt. Noting that

h′ (t) = ⟨gt , y − x⟩ for some gt ∈ ∂f (xt )

(recall the subgradient characterization of Proposition B.3.20), we have

1
h′ (t) = ⟨gt , y − x⟩ = ⟨gt − gx , y − x⟩ + ⟨gx , y − x⟩ = ⟨gt − gx , (1 − t)x + ty − x⟩ + ⟨gx , y − x⟩
t
and so as h(1) = f (y) and h(0) = f (x),
Z 1
⟨gt − gx , (1 − t)x + ty − x⟩
f (y) = h(0) + dt + ⟨gx , y − x⟩
0 t
λ ∥(1 − t)x + ty − x∥2
Z 1
≥ f (x) + dt + ⟨gx , y − x⟩
0 t
Z 1
λ
= f (x) + λ ∥x − y∥2 tdt + ⟨gx , y − x⟩ = f (x) + ⟨gx , y − x⟩ + ∥y − x∥2 .
0 2

That (ii) implies (i) is relatively straightforward: we have

λ 2
f (y) ≥ f (tx + (1 − t)y) + t⟨gt , y − x⟩ + t ∥x − y∥2
2
λ
f (x) ≥ f (tx + (1 − t)y) + (1 − t)⟨gt , x − y⟩ + (1 − t)2 ∥x − y∥2
2
for any gt ∈ ∂f (tx + (1 − t)y). Multiply the first inequality by (1 − t) and the second by t, then
add them to obtain
λ
(1 − t)t2 + t(1 − t)2 ∥x − y∥2 ,

tf (x) + (1 − t)f (y) ≥ f (tx + (1 − t)y) +
2
and note that (1 − t)t2 + t(1 − t)2 = t(1 − t). Finally, let (i) hold, and which is equivalent to the
condition that
f ((1 − t)x + ty) − f (x) λ
+ (1 − t) ∥x − y∥2 ≤ f (y) − f (x)
t 2
for t ∈ (0, 1). Taking t ↓ 0 gives f ′ (x; y − x) + λ2 ∥x − y∥2 ≤ f (y) − f (x), and because f ′ (x; y − x) =
sups∈∂f (x) ⟨s, y − x⟩ we obtain (ii).

As a first example application of strong convexity, consider minimizers of the tilted functions

fu (x) := f (x) − ⟨u, x⟩

as u varies. First, note that minimizers necessarily exist: the function fu (x) → ∞ whenever
∥x∥ → ∞ by condition (ii) in Proposition C.1.5, and so we can restrict to minimizing fu over

635
Lexture Notes on Statistics and Information Theory John Duchi

compacta. Moreover, the minimizers xu := argminx fu (x) are unique, as the functions fu are
strongly (and hence strictly) convex. However, we can say more. Indeed, let C be any closed
convex set and let
xu = argmin fu (x). (C.1.4)
x∈C

We claim the following:

Proposition C.1.6. Let f be λ-strongly convex with respect to the norm ∥·∥ and subdifferentiable
on C. Then the mapping u 7→ xu is λ1 -Lipschitz continuous with respect to the dual norm ∥·∥∗ , that
is, ∥xu − xv ∥ ≤ λ1 ∥u − v∥∗ .

Proof We use the optimality condition (C.1.2). We have ∂fu (x) = ∂f (x) − u, and thus for any
u, v we have both
⟨gu − u, y − xu ⟩ ≥ 0 and ⟨gv − v, y − xv ⟩ ≥ 0
for some gu ∈ ∂f (xu ) and gv ∈ ∂f (xv ) for all y ∈ C. Set y = xv in the first inequality and y = xu
in the second and add them to obtain

⟨gu − gv + v − u, xv − xu ⟩ ≥ 0 or ⟨v − u, xv − xu ⟩ ≥ ⟨gv − gu , xv − xu ⟩.

By strong convexity the last term satisfies ⟨gv − gu , xv − xu ⟩ ≥ λ ∥xu − xv ∥2 . By definition of the
dual norm, ∥v − u∥∗ ∥xv − xu ∥ ≥ ⟨v − u, xv − xu ⟩, so ∥u − v∥∗ ∥xv − xu ∥ ≥ λ ∥xu − xv ∥2 , which is
the desired result.

JCD Comment: Figure for the preceding lemma.

There are alternative versions of strong convexity, typically given the name uniform convexity
in the convex analysis literature, which allow generalizations and similar quantitative stability
properties. In analogy with the strong convexity condition (C.1.3), we say that f is (λ, κ)-uniformly
convex, where κ ≥ 2, over C if it is closed and for all t ∈ [0, 1] and x, y ∈ C,

λ
t(1 − t) ∥x − y∥κ (1 − t)κ−1 + tκ−1 .

f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y) − (C.1.5)
2
Notably, the κ = 2 case is simply the familiar strong convexity property. Taking κ > 2 weakens
strong convexity, yielding correspondingly weaker guarantees of stability and optimality. We can,
however, provide analogies to Propositions C.1.5 and C.1.6.

Proposition C.1.7 (Equivalent characterizations of uniform convexity). Let f be a convex func-

tion, subdifferentiable on its domain. Then the following are equivalent:

(i) f is (λ, κ)-uniformly convex with respect to the norm ∥·∥.

(ii) For all y ∈ Rd and g ∈ ∂f (x),

λ
f (y) ≥ f (x) + ⟨g, y − x⟩ + ∥x − y∥κ .
2

Additionally, either of (i) or (ii) imply that

636
Lexture Notes on Statistics and Information Theory John Duchi

(iii) For all x, y ∈ dom f and gx ∈ ∂f (x) and gy ∈ ∂f (y),

⟨gx − gy , x − y⟩ ≥ λ ∥x − y∥κ ,

which in turn implies that f is ( κ2 λ, κ)-uniformly convex with respect to the norm ∥·∥.

We leave the proof of the proposition as Exercise C.2, noting that it follows via the same argu-
ments as those we use to prove the strong convexity version (Proposition C.1.5). An analog of
Proposition C.1.6 also holds with an alternative smoothness condition.

Proposition C.1.8. Let f be (λ, κ)-uniformly convex with respect to the norm ∥·∥ and subdiffer-
1
entiable on C. Then the mapping u 7→ xu is κ−1 -Hölder, and in particular,

1 1
∥xu − xv ∥ ≤ λ− κ−1 ∥u − v∥∗κ−1

for all u, v.

Proposition C.1.8 follows as an immediate consequence of Proposition C.2.7 to come, where we

connect smoothness of the minimizers xu with differentiability properties of the conjugate function.
JCD Comment: Add some figures on strict convexity implying a bit of growth around
a neighborhood, and stability properties of strongly convex functions.
The weakest version of such strong convexity properties is strict convexity, for which a careful
reading of the proof of Proposition C.1.5 (replace all λ with 0 and inequalities with strict inequal-
ities) gives the following characterization of equivalent definitions of strict convexity (recall also
Corollary B.3.2).

Corollary C.1.9. Let f be a convex function subdifferentiable on C. The following are equivalent.

(i) f is strictly convex on C.

(ii) For all x ∈ C, y ̸= x, and g ∈ ∂f (x),

f (y) > f (x) + ⟨g, y − x⟩.

(iii) For all x, y ∈ C and gx ∈ ∂f (x) and gy ∈ ∂f (y),

⟨gx − gy , x − y⟩ > 0.

Using Corollary C.1.9, we can then obtain certain smoothness properties of the tilted minimizers
xu of the minimization (C.1.4). We begin with a lemma that guarantees growth of convex functions
over their first-order approximations.

Lemma C.1.10. Let f be convex and subdifferentiable on the closed convex set C, and for any
fixed g ∈ ∂f (x0 ) define the Bregman divergence

D(x, x0 ) := f (x) − f (x0 ) − ⟨g, x − x0 ⟩.

Then for all 0 ≤ ϵ ≤ ϵ′ , δ(ϵ) := inf{D(x, x0 ) | x ∈ C, ∥x − x0 ∥ ≥ ϵ} is attained, nonnegative, and

′
δ(ϵ′ ) ≥ ϵϵ δ(ϵ).

637
Lexture Notes on Statistics and Information Theory John Duchi

Proof Fix x ∈ C. Letting h(t) = D(x0 + t(x − x0 ), x0 ), h is convexRin t ≥ 0, locally Lipschitz,

t
and satisfies h(0) = inf t h(t) = 0, so we can write h(t) = h(0) + 0 h′ (s; 1)ds. Additionally,
s 7→ h′ (s; 1) ≥ 0 is nondecreasing by the increasing slopes criterion (B.3.4).
For all ϵ > 0, then, we may restrict infimum in the definition of δ(ϵ) to those x ∈ C satisfying
∥x − x0 ∥ = ϵ, a compact set, so that the infimum is attained at some xϵ ∈ C with ∥xϵ − x0 ∥ = ϵ.
Now, let ϵ′ > ϵ, and xϵ′ achieve the infimum in δ(ϵ). Then setting x′ = ϵϵ′ (xϵ′ − x0 ) + x0 (implying
′
xϵ′ = ϵϵ (x′ − x0 ) + x0 ), we have ∥x′ − x0 ∥ = ϵ and so D(x′ , x0 ) ≥ D(xϵ , x0 ) = δ(ϵ). Set h(t) =
D(x0 + t(x′ − x0 ), x0 ). Rewriting and using the first-order convexity condition,
′ ′ ′
′ ϵ ′ ϵ ϵ
δ(ϵ ) = D(xϵ′ , x0 ) = D (x − x0 ) + x0 , x0 = h ≥ h(1) + − 1 h′ (1; 1)
ϵ ϵ ϵ
′
′ ϵ
= D(x , x0 ) + − 1 h′ (1; 1).
ϵ
A minor variant of the criterion of increasing slopes (B.3.4) and that h(0) = 0 then gives h′ (1; 1) =
limt↓0 h(1+t)−h(1)
t ≥ h(1)−h(0)
1 = h(1) = D(x′ , x0 ), so we have
′
ϵ′ ϵ′

′ ′ ϵ
δ(ϵ ) = D(xϵ′ , x0 ) ≥ D(x , x0 ) + − 1 D(x′ , x0 ) = D(x′ , x0 ) ≥ δ(ϵ)
ϵ ϵ ϵ
as desired.

Whenever f is strictly convex, because the infimum in δ(ϵ) is attained in Lemma C.1.10, we
have the following guarantee.
Lemma C.1.11. Let the conditions of Lemma C.1.10 hold and additionally let f be strictly convex.
Then δ(ϵ) > 0 for all ϵ > 0.
Combining these results yields the following non-quantitative version of Proposition C.1.6:
Proposition C.1.12. Let f be strictly convex and subdifferentiable on the closed convex set C,
and assume that the minimum x0 = argminx∈C f (x) is attained. Then the mapping u 7→ xu is
continuous in a neighborhood of u = 0.
Proof We show first that xu is continuous at u = 0. By Lemmas C.1.10 and C.1.11, we see that
for x ∈ C we have

f (x) ≥ f (x0 ) + ⟨g, x − x0 ⟩ + δ(∥x − x0 ∥) ≥ f (x0 ) + δ(∥x − x0 ∥),

where g ∈ ∂f (x0 ) satisfies ⟨g, x − x0 ⟩ ≥ 0 for all x ∈ C by the optimality condition (C.1.2). Now,
pick ϵ > 0, so that if ∥x − x0 ∥ > ϵ we have δ(∥x − x0 ∥) ≥ ∥x − x0 ∥ δ(ϵ)
ϵ by Lemma C.1.10. Then if
δ(ϵ)
u satisfies ∥u∥ < ϵ , we have

f (x) − ⟨u, x⟩ = f (x) − ⟨u, x − x0 ⟩ − ⟨u, x0 ⟩

≥ f (x0 ) − ⟨u, x0 ⟩ + δ(∥x − x0 ∥) − ⟨u, x − x0 ⟩
≥ f (x0 ) − ⟨u, x0 ⟩ + δ(∥x − x0 ∥) − ⟨u, x − x0 ⟩
≥ f (x0 ) − ⟨u, x0 ⟩ + δ(∥x − x0 ∥) − ∥u∥ ∥x − x0 ∥
δ(ϵ) δ(ϵ)
> f (x0 ) − ⟨u, x0 ⟩ + ∥x − x0 ∥ − ∥x − x0 ∥ = f (x0 ) − ⟨u, x0 ⟩.
ϵ ϵ

638
Lexture Notes on Statistics and Information Theory John Duchi

Thus any minimizer xu of f (x) − ⟨u, x⟩ over x ∈ C must satisfy ∥xu − x0 ∥ ≤ ϵ, and strict convexity
guarantees its uniqueness.
The argument that u 7→ xu is continuous in a neighborhood of zero is completely similar once
we recognize that for the divergence Df (x, x0 ) := f (x) − ⟨g, x − x0 ⟩ − f (x0 ) (where g ∈ ∂f (x0 ) is
fixed), we have Df = Dfu for fu (x) = f (x) − ⟨u, x⟩ and xu is near x0 for u small.

C.2 Conjugacy and duality properties

Attached to any function is its convex conjugate, sometimes called the Fenchel or Fenchel-Legendre
conjugate function, defined by
f ∗ (s) := sup {⟨s, x⟩ − f (x)} . (C.2.1)
x
For any f , the conjugate f ∗ is a closed convex function, as it is the supremum of linear functions.
This function helps to exibit a duality for convex functions similar to those for convex sets, which
we can describe as the intersection of all halfspaces containing them (recall Theorem B.1.15 and
the equalities (B.1.3)–(B.1.4)).
JCD Comment: Draw a picture of the conjugate

The conjugate function is the largest gap between the linear functional x 7→ ⟨s, x⟩ and the
function f itself. The remarkable property of such conjugates is that their biconjugates describe
the function f itself, or at least the largest closed convex function below f . To make this a bit
more precise, we state a theorem, and then connect to so-called convex closures of functions.
Theorem C.2.1. Let f be closed convex and f ∗ be its conjugate (C.2.1). Then
f ∗∗ (x) = f (x) for all x.
Proof By definition, we have
f ∗∗ (x) = sup {⟨x, s⟩ − f ∗ (s)} ,
s

and we always have ⟨x, s⟩−f ∗ (s) ≤ f (x) by definition of f ∗ (s) = supx {⟨s, x⟩−f (x)}. So immediately
we see that f ∗∗ (x) ≤ f (x).
We essentially show that the linear functions hs (x) := ⟨x, s⟩ − f ∗ (s) describe (enough) of
the global linear underestimators of f so that f (x) = sups hs (x), allowing us to apply Theo-
rem B.3.7. Indeed, let l(x) = ⟨s, x⟩ + b be any global underestimator of f . Then we must have
b ≤ f (x) − ⟨s, x⟩ for all x, that is, b ≤ inf x {f (x) − ⟨s, x⟩} = − supx {⟨s, x⟩ − f (x)} = −f ∗ (s), that
is, l(x) ≤ ⟨s, x⟩ − f ∗ (s) = hs (x). Apply Theorem B.3.7.

We may visualize f ∗∗ as pulling a string up below a function f , yielding the largest closed
convex underestimator of f . Combining Theorem C.2.1 with Proposition B.3.14, we obtain the
following corollary. (Note that we require f (x) > −∞ for all x.)
Corollary C.2.2. Let f : Rd → R be convex and f ∗ its conjugate (C.2.1). Then
f ∗∗ (x) = cl f (x) for all x.
In particular, f is lower semicontinuous at x if and only if f (x) = cl f (x) = f ∗∗ (x).

639
Lexture Notes on Statistics and Information Theory John Duchi

C.2.1 Gradient dualities and the Fenchel-Young inequality

It is immediate from the definition that for any pair s, x we have the Fenchel-Young inequality

⟨s, x⟩ ≤ f ∗ (s) + f (x). (C.2.2)

Even more, combining Theorem C.2.1 with this observation, we can exhibit a duality between
subgradients of f and f ∗ with this inequality.

Proposition C.2.3. Let f be closed convex. Then

⟨s, x⟩ = f ∗ (s) + f (x) if and only if s ∈ ∂f (x) if and only if x ∈ ∂f ∗ (s).

Proof If ⟨s, x⟩ = f ∗ (s) + f (x), then −f (x) + ⟨s, x⟩ = f ∗ (s) ≥ ⟨s, y⟩ − f (y) for all y, and re-
arranging, we have f (y) ≥ f (x) + ⟨s, y − x⟩, that is, s ∈ ∂f (x). Conversely, if s ∈ ∂f (x) then
0 ∈ ∂f (x) − s, so that x minimizes f (x) − ⟨s, x⟩, or equivalently, x maximizes ⟨s, x⟩ − f (x) and so
⟨s, x⟩ − f (x) = supx {⟨s, x⟩ − f (x)} as desired. The final statement is immediate from a parallel
argument and the duality in Theorem C.2.1.

Writing Proposition C.2.3 differently, we see that ∂f and ∂f ∗ are inverses of one another. That
is, as set-valued mappings, where

(∂f )−1 (s) := {x | s ∈ ∂f (x)},

we have the following corollary.

Corollary C.2.4. Let f and f ∗ be subdifferentiable. Then

∂f ∗ = (∂f )−1 and ∂f = (∂f ∗ )−1

and
∂f ∗ (s) = argmax {⟨s, x⟩ − f (x)} and ∂f (x) = argmax {⟨s, x⟩ − f ∗ (s)} .
x s

Notably, if f and f∗ are differentiable, then ∇f = (∇f ∗ )−1 .

Additionally, we see that the domains and images of ∂f and ∂f ∗ are also related, which guar-
antees convexity properties of their images as well.

Corollary C.2.5. Let f be closed convex. Then

dom ∂f = Im ∂f ∗ and dom ∂f ∗ = Im ∂f.

Proof Let x ∈ dom ∂f , so that ∂f (x) is non-empty. Then s ∈ ∂f (x) implies that ⟨s, x⟩ =
f (x) + f ∗ (s) and x ∈ ∂f ∗ (s) by Proposition C.2.3. Similarly, if x ∈ Im ∂f ∗ , then there is some s
for which x ∈ ∂f ∗ (s) and so ⟨s, x⟩ = f (x) + f ∗ (s) and s ∈ ∂f (x).

640
Lexture Notes on Statistics and Information Theory John Duchi

C.2.2 Smoothness and strict convexity of conjugates

The dualities in derivative mappings extend to various smoothness dualities, which can be quite
useful as well. These types of results build from the stability properties of solution mappings, as
in those for tilted minimizers (C.1.4) in Propositions C.1.6 and C.1.12. They also relate different
smoothness properties of f and f ∗ , as well as their domains of definition, to the existence and
continuity of minimizers of f (x) − ⟨s, x⟩.
When we assume that f has quantitative strong convexity or smoothness properties, we can
give similar quantitative guarantees for the smoothness and strong convexity of f ∗ :
Proposition C.2.6. Let f : Rd → R be λ-strongly convex with respect to the norm ∥·∥ (see
Eq. (C.1.3)) on its domain. Then dom f ∗ = Rd and ∇f ∗ is λ1 -Lipschitz continuous with respect to
the dual norm ∥·∥∗ , that is,
1
∥∇f ∗ (u) − ∇f ∗ (v)∥ ≤ ∥u − v∥∗
λ
for all u, v. Conversely, let f : Rd → R be convex with L-Lipschitz gradient with respect to ∥·∥ on Rd .
Then f ∗ is L1 -strongly convex with respect to the dual norm ∥·∥∗ on convex subsets C ⊂ dom ∂f ∗ ,
and in particular,
1
⟨∇f (x) − ∇f (y), x − y⟩ ≥ ∥∇f (x) − ∇f (y)∥2∗ . (C.2.3)
L
Proof For the first claim, let C = dom f . Then Proposition C.1.6 shows that if x1 = argminx {f (x)−
⟨s1 , x⟩} and x2 = argminx {f (x) − ⟨s2 , x⟩} (which exist and are necessarily unique), we have
∥x1 − x2 ∥ ≤ λ1 ∥s1 − s2 ∥∗ . Then Proposition C.2.3 shows that xi ∈ ∂f ∗ (si ) for i = 1, 2, and
hence ∂f ∗ (si ) is necessarily single-valued and (1/λ)-Lipschitz continuous.
The converse is a bit trickier. Let x and y be arbitrary and sx = ∇f (x) and sy = ∇f (y); we
prove inequality (C.2.3), known as co-coercivity. By the L-Lipschitz continuity of ∇f , we have
Z 1
f (y) = f (x) + ⟨∇f (x + t(y − x)), y − x⟩dt
0
Z 1
= f (x) + ⟨∇f (x), y − x⟩ + ⟨∇f (x + t(y − x)) − ∇f (x), y − x⟩dt
0
Z 1
L
≤ f (x) + ⟨sx , y − x⟩ + Lt ∥y − x∥2 dt = f (x) + ⟨sx , y − x⟩ + ∥y − x∥2 ,
0 2
which is valid for any x, y. Note that f (x) − ⟨sx , x⟩ = −f ∗ (sx ), so that rearranging we have
L L
∥y − x∥2 = ⟨s, y⟩ − f (y) + ⟨sx − s, y⟩ + ∥y − x∥2
f ∗ (sx ) ≤ ⟨sx , y⟩ − f (y) +
2 2
L
≤ f ∗ (s) + ⟨sx − s, y⟩ + ∥y − x∥2 ,
2
valid for any vector s and any y. We may in particular take an infimum over y on the right hand
side, where
L L
inf ⟨sx − s, y⟩ + ∥y − x∥2 = inf ⟨sx − s, y − x⟩ + ∥y − x∥2 + ⟨sx − s, x⟩
y 2 y 2
Lt2

(⋆)
= inf t ∥sx − s∥∗ + + ⟨sx − s, x⟩
t 2
1
=− ∥sx − s∥2∗ + ⟨sx − s, x⟩,
2L

641
Lexture Notes on Statistics and Information Theory John Duchi

where equality (⋆) follows by definition of the dual norm and we identify t = ∥y − x∥. Thus
1
f ∗ (sx ) + ⟨x, s − sx ⟩ + ∥s − sx ∥2∗ ≤ f ∗ (s)
2L
for all s. As x ∈ ∂f ∗ (sx ), Proposition C.1.5(ii) gives the strong convexity result. The rest is alge-
braic manipulations with sy = ∇f (y) and an application of Proposition C.1.5, part (iii).

We can extend Proposition C.2.6 to the uniformly convex case (recall Proposition C.1.7, though
for this we require a slightly extended definition of smoothness beyond Lipschitz continuity of the
gradients.

Definition C.2. A function f is (L, β, ∥·∥)-smooth if it has β-Hölder continuous gradient with
respect to the norm ∥·∥, that is,

∥∇f (x) − ∇f (y)∥∗ ≤ L ∥x − y∥β for x, y ∈ dom f.

We then have the following analog of Proposition C.2.6.

Proposition C.2.7. Let f : Rd → R be (λ, κ)-uniformly convex (C.1.5) with respect to the norm
1
∥·∥. Then dom f ∗ = Rd and ∇f ∗ is (λ− κ−1 , κ−1
1
, ∥·∥∗ )-smooth. Conversely, let f : Rd → R be
1
−
convex and (L, β, ∥·∥)-smooth. Then f ∗ is (2 β−1
β L
β−1 , β )-uniformly convex (C.1.5) with respect
β−1
to the dual norm ∥·∥∗ on any convex subset of dom ∂f ∗ , and in particular,
β
β − 1 − β−1
1
f ∗ (s1 ) ≥ f ∗ (s0 ) + ⟨∂f ∗ (s0 ), s1 − s0 ⟩ + L ∥s0 − s1 ∥∗β−1 .
β
Exercise C.3 asks for a proof of this proposition, which more or less follows from a similar technique
as that we use to prove Proposition C.2.6. In short, however, uniform convexity and smoothness
are dual to one another (with a minor loss in leading constant multipliers): let κ and κ∗ = κ−1 κ
be
conjugates, where κ ≥ 2. Then the function f is uniformly convex with exponent κ if and only if
f ∗ has κ∗ -Hölder continuous gradients.
There are more qualitative versions of Proposition C.2.6 that allow us to give a duality between
strict convexity and continuous differentiability of f . Here, we give one typical result.

Proposition C.2.8. Let f : Rd → R be strictly convex and closed. Then int dom f ∗ ̸= ∅ and
f ∗ is continuously differentiable on int dom f ∗ . Conversely, let f : Rd → R be differentiable on
Ω := int dom f . Then f ∗ is strictly convex on each convex C ⊂ ∇f (Ω).

These results should be roughly expected becaues of the duality that ∇f = (∇f ∗ )−1 and that
∂f ∗ (s) = argminx {⟨s, x⟩ − f (x)}, because strict convexity guarantees uniqueness of minimizers
(Proposition C.1.1) so that ∂f ∗ should be a singleton.
Proof To see that int dom f ∗ is non-empty, we use the identification f∞ ′ (v) = σ
dom f ∗ (v) in
Proposition C.3.5 and the interior identification in Proposition B.2.7. Because f is strictly convex,
for any x ∈ dom f we have
f (x − tv) − f (x) f (x + td) − f (x)
0< + for t > 0,
t t
′ (−v) + f ′ (v). Proposition B.2.7 then shows that int dom f ∗ ̸= ∅.
and taking t → ∞ gives 0 < f∞ ∞

642
Lexture Notes on Statistics and Information Theory John Duchi

For the claim that f ∗ is continuously differentiable, take s ∈ int dom f ∗ , and suppose for the
sake of contradiction that ∂f ∗ (s) has distinct points x1 , x2 . Then Corollary C.2.4 gives that x1
and x2 both minimize f (x) − ⟨s, x⟩ over x. But Proposition C.1.1 guarantees x1 = x2 , so that
∂f ∗ (s) = {∇f ∗ (s)} is a singleton, and hence f ∗ is continuous differentiable at s (Proposition B.3.24).
For the converse claim, let C be a convex set as stated. Suppose for the sake of contradiction
that f ∗ is not strictly convex on C, so that there are distinct points s1 , s2 ∈ C for which f ∗ is affine
on the line segment [s1 , s2 ] = {ts1 + (1 − t)s2 | t ∈ [0, 1]}. As C ⊂ ∇f (Ω) is convex, the midpoint
s = 21 (s1 + s2 ) ∈ C and there exists x satisfying ∇f (x) = s, or x ∈ ∂f ∗ (s). Then because f ∗ is
assumed affine on [s1 , s2 ], we have f ∗ (s) = 12 f ∗ (s1 ) + 12 f ∗ (s2 ) and ⟨s, x⟩ = 21 ⟨s1 + s2 , x⟩, so
0 = f (x) + f ∗ (s) − ⟨s, x⟩
1
= [(f (x) + f ∗ (s1 ) − ⟨s1 , x⟩) + (f (x) + f ∗ (s2 ) − ⟨s2 , x⟩)] .
2
Each of the terms in parenthesis is 0 if and only if si ∈ ∂f (x), but by assumption ∂f (x) = {∇f (x)}
is a singleton, and we must have s1 = s2 .

JCD Comment: Better transition

C.2.3 Smooth convex functions

We close this section by investigating particularly nice classes of functions f , where f and its
conjugate f ∗ are strictly convex and smooth. These results are central to the various conjugate
linkage dualities we explore in Chapter 14.3. We therefore make the following definition:
Definition C.3. Let f : Rd → R be closed convex. Then f is of Legendre type if
(i) int dom f ̸= ∅
(ii) f is continuously differentiable on int dom f
(iii) f is strictly convex
(iv) f satisfies the gradient boundary conditions
∥∇f (x)∥ → ∞ as x → bd dom f or ∥x∥ → ∞. (C.2.4)

Thus, at the boundaries of their domains or as their argument tends off to infinity, functions of
Legendre type have slopes tending to ∞. This does not guarantee that f (x) → ∞ as x → bd dom f ,
though it does provide guarantees of regularity that the next theorem highlights.
Theorem C.2.9. Let f be a convex function of Legendre type (Def. C.3). Then f ∗ is strictly
convex, continuously differentiable, and dom f ∗ = Rd .
The theorem implies a number of results on continuity of minimizers and tilted minimiz-
ers (C.1.4), clarifying some of our earlier results. For example, we have the following corollary.
Corollary C.2.10. Let f : Rd → R be a convex function of Legendre type. Then the tilted
minimizer
xu := argmin{f (x) − ⟨u, x⟩}
exists for all u, is continuous in u and unique, and xu ∈ int dom f .

643
Lexture Notes on Statistics and Information Theory John Duchi

We turn to the proof of the theorem.

Proof of Theorem C.2.9 We state an intermediate lemma, whose proof we defer.

Lemma C.2.11. Let f : Rd → R be closed convex and satisfy the gradient boundary condition that
∥sn ∥ → ∞ for any sequence xn → bd dom f and sn ∈ ∂f (xn ). Then
′
f∞ (v) = ∞ for all v ̸= 0

if and only if
∥sn ∥ → ∞ whenever ∥xn ∥ → ∞ and sn ∈ ∂f (xn ).

The theorem follows straightforwardly from Lemma C.2.11. By the boundary conditions (C.2.4)
associated with f , we have f∞ ′ (v) = ∞ for all v ̸= 0 (Lemma C.2.11). Because the support
∗
function of dom f satisfies σdom f ∗ = f∞ ′ (Proposition C.3.5), we see that dom f ∗ = Rd as

dom f ∗ = {s | ⟨s, v⟩ ≤ σdom f ∗ (v) for all v} (e.g., Proposition B.2.7 or Corollary B.2.2). With
this, f ∗ is continuously differentiable and strictly convex on its domain (Proposition C.2.8).

Proof of Lemma C.2.11 As dom f ∗ = Rd if and only if f∞

′ (v) = ∞ for all v ̸= 0 (Corol-
∗
lary C.3.7), it suffices to show the result that int dom f ̸= Rd if and only if there exists an
unbounded sequence xn with sn ∈ ∂f (xn ) and for which sn is convergent.
Let us begin with the unbounded sequence xn for which sn → s ∈ Rd ; assume for the sake
of contradiction that s ∈ int dom f ∗ . Because sn ∈ ∂f (xn ), we have xn ∈ ∂f ∗ (sn ) by Propo-
sition C.2.3. The assumption that s ∈ int dom f ∗ means that there exists an ϵ > 0 such that
s + ϵB ⊂ int dom f ∗ and f ∗ is Lipschitz on s + ϵB (Theorem B.3.4). But then ∂f ∗ (s + ϵB) is
bounded, and xn ∈ ∂f ∗ (sn ) ⊂ ∂f ∗ (s + ϵB) for large enough n, contradicting that ∥xn ∥ → ∞, and
so s ̸∈ int dom f ∗ and int dom f ∗ ̸= Rd .
Now let us assume that int dom f ∗ ̸= Rd . Let s ∈ bd dom f ∗ . Then either ∂f ∗ (s) = ∅ or
∂f ∗ (s) is unbounded (Lemma B.3.21). If ∂f ∗ (s) = ∅, take sn → s with sn ∈ relint dom f ∗ , and
let xn ∈ ∂f ∗ (sn ). We show that xn must be unbounded. If xn is bounded, then by passing to
a subsequence if necessary we may assume xn → x, and the outer semicontinuity of the subdif-
ferential (Lemma B.3.23) gives x ∈ ∂f ∗ (s), contradicting that ∂f ∗ (s) = ∅. Thus we must have
xn unbounded, which is thus the desired unbounded sequence. On the other hand, if ∂f ∗ (s) is
unbounded, we can simply take xn ∈ ∂f ∗ (s) with s ∈ ∂f (xn ) for each n, which is the desired
convergent sequence.

As a last application of these ideas, in some cases we wish to allow constraints on the functions
f to be minimized, returning to the original convex optimization problem (C.0.1) with f a function
of Legendre type and C a closed convex set. We then have the following corollary.

Corollary C.2.12. Let f be of Legendre type (Definition C.3) and C ⊂ Rd a closed convex set
with int dom f ∩ C ̸= ∅. Define fC (x) = f (x) + IC (x). Then

(i) fC∗ is continuously differentiable,

(ii) dom fC∗ = Rd , and

644
Lexture Notes on Statistics and Information Theory John Duchi

(iii) the constrained tilted minimizers

xu = argmin {⟨u, x⟩ − f (x)}

x∈C

are unique, continuous in u, belong to int dom f , and satisfy

xu = ∇f ∗ (u − v) and ∇f (xu ) = −v

for some vector v ∈ NC (xu ).

Proof The function fC := f +IC is closed convex. To show that dom fC∗ = Rd , we can equivalently
show that (fC )′∞ (v) = ∞ for all non-zero v. Because f is Legendre-type, Lemma C.2.11 guarantees
that if x ∈ dom f ∩ C, then

f (x + tv) + IC (x + tv) − f (x) f (x + tv) − f (x)

(fC )′∞ (v) = lim ≥ lim ′
= f∞ (v) = ∞.
t↑∞ t t↑∞ t

So dom fC∗ = Rd , and thus xu argminx∈C {f (x) − ⟨u, x⟩} = ∇fC∗ (u) exists and is unique and contin-
uous, as f is strictly convex.
By the standard subgradient conditions for optimality, the vector xu is characterized by

0 ∈ ∇f (xu ) − u + NC (xu ),

and so xu ∈ int dom f (as otherwise ∥∇f (xu )∥ = +∞ by Definition C.3) and

xu = ∇f ∗ (u − v)

for some vector v ∈ NC (xu ).

JCD Comment: Now do the particular case that we define fC = f + IC where C is

an affine space. Then we should still have dom f ∗ = Rd , and ∇fC∗ exists, and should get
some good dualities. Work it out!
JCD Comment: More smoothness dualities, and write an exercise? Perhaps uniform
convexity versions and strict convexity versions.

C.3 Limits at infinity of convex functions and sets

Section C.2 showed the links between conjugate functions and smoothness properties of gradients,
making some connections with the existence of minimizers of convex functions. Here, we take this
a step further by providing somewhat more abstract conditions for the existence of minimizers of
convex functions. Part of this requires a description of different set limits—we wish to understand
when we need not take points ∥x∥ → ∞ to approach inf x f (x)—so we begin there.

645
Lexture Notes on Statistics and Information Theory John Duchi

C.3.1 Boundedness and closedness of convex sets

The boundedness and closedness of convex sets is central to both the growth properties of convex
functions—via their epigraphs—as well as several min-max duality theorems in the coming section.
The former will relate to the existence of minimizers of convex functions via so-called recession
cones and recession functions associated with convex sets and functions. We develop some of the
relevant theory here.
JCD Comment: Draw picture of recession cone

Determining the boundedness and closure of convex sets conveniently often reduces to checking
the existence of (infinite) rays remaining within the set. To reduce notational complexity, through-
out this section we will assume that C ⊂ Rd is a closed convex set, making note when results
extend beyond this case. Our starting point is a characterization of various limiting directions that
lie within convex sets. For a point x ∈ C, we define the recession cone
n o
C∞ (x) := v ∈ Rd | x + tv ∈ C for all t ≥ 0 (C.3.1)

to be the set of directions at which C extends off to ∞. It is immediate that C∞ (x) is a cone, because
v ∈ C∞ (x) implies tv ∈ C∞ (x) for any t ≥ 0. As in the case of the recession function (C.3.3), this
definition is in fact independent of the choice of x:

Proposition C.3.1. Let C be closed convex. The set C∞ (x) is a closed convex cone, independent
of x, and
\1
C∞ (x) = (C − x).
t
t>0

Proof Let x0 , x1 ∈ C. For the claim that C∞ (x) is independent of x, it is enough to show that
C∞ (x0 ) ⊂ C∞ (x1 ). So let v ∈ C∞ (x0 ) and t ≥ 0, so that x0 + tv ∈ C, and for ϵ ∈ [0, 1] let

t
yϵ = (1 − ϵ)x0 + ϵx1 + tv = (1 − ϵ) x0 + v + ϵx1 .
1−ϵ
t
Then yϵ ∈ C for each ϵ ∈ (0, 1) because x0 + 1−ϵ v ∈ C, and y1 = limϵ↑1 yϵ ∈ C as C is closed.
To see the definition of C∞ (x) in terms of the intersection, note that v ∈ C∞ (x) if and only if
tv ∈ C − x for all t > 0, that is, v ∈ 1t (C − x) for all t > 0. That C∞ (x) is thus the intersection of
closed convex sets implies it is closed convex.

In view of Proposition C.3.1, we can more accurately simply write C∞ for the recession cone of
a convex set C. We can enumerate a few properties of such recession cones.

Lemma C.3.2. Let C be a closed convex set. Then

(i) C∞ = {v | C + R+ v ⊂ C} and C = C + C∞ .

(ii) If {Cα }α∈A is a collection of closed convex sets with non-empty intersection, then (∩α Cα )∞ =
∩α (Cα )∞ .

(iii) Let u be a unit vector. Then u ∈ C∞ if and only if there exists a sequence xn ∈ C with
∥xn ∥ → ∞ while xn / ∥xn ∥ → u.

646
Lexture Notes on Statistics and Information Theory John Duchi

(iv) C is bounded if and only if C∞ = {0}, that is, it has no directions of recession.

Proof The first claim is immediate from the definition of the recession cone and Proposition C.3.1,
so we consider the remainder.

(ii) We have v ∈ (Cα )∞ for each α ∈ A if and only if for some x ∈ ∩α Cα , x + tv ∈ Cα for each α
and t ≥ 0, that is, x + tv ∈ ∩α Cα for all t ≥ 0 and v ∈ (∩α Cα )∞ .

(iii) Let u ∈ C∞ be a unit vector. Let x ∈ C, and take xn = x + nu for n ∈ N. Then clearly
xn / ∥xn ∥ → u, while xn ∈ C for each n. Conversely, assume that xn / ∥xn ∥ → u while
∥xn ∥ → ∞ and xn ∈ C. Let t ≥ 0 and N be large enough that ∥xn ∥ ≥ t for n ≥ N . Then

t t
x + tu = lim 1 − x+ xn .
n ∥xn ∥ ∥xn ∥
| {z }
in C for n≥N

That is, x + tu is the limit of points in C, so that the closedness of C implies x + tu ∈ C.

(iv) We certainly have C∞ = {0} when C is bounded. If C is unbounded, then we may take a
sequence xn ∈ C with ∥xn ∥ → ∞, and by considering convergent subsequences, may assume
w.l.o.g. that xn / ∥xn ∥ → u for some u ∈ Sd−1 . Then C∞ ⊃ {u} is non-empty by part (iii).

JCD Comment: Draw a figure of Theorem C.3.3

Motivated by the conjugate duality relationships that Theorem C.2.1 presents—so that f = f ∗∗
if and only if f is closed—we will find it useful to consider closedness properties of convex sets
C under linear mappings. Recall that for A ∈ Rn×d , we write AC = {Ax | x ∈ C}, and let
null(A) = {z | Az = 0} be its nullspace. We have the following theorem on closedness.

Theorem C.3.3. Let C ⊂ Rd be a closed convex set and A ∈ Rm×d . If C∞ ∩ null(A) is a linear
subspace, then AC is closed.

Proof Define the set

L := C∞ ∩ null(A),
which is a linear subspace and evidently satisfies L = (−C∞ ) ∩ C∞ ∩ null(A) as well. Then
L⊥ = {v | ⟨v, z⟩ = 0 for z ∈ L} satisfies

C = (L⊥ ∩ C) + L,

because L ⊂ C∞ ∩(−C∞ ) and C = C+L (recall Lemma C.3.2, part (i)). Therefore, AC = A(L⊥ ∩C)
as AL = {0}. Now let y ∈ cl(AC). For ϵ > 0, define the sets
n o
Bϵ := {x | ∥Ax − y∥ ≤ ϵ} and Cϵ := C ∩ L⊥ ∩ Bϵ = x ∈ C ∩ L⊥ | ∥Ax − y∥ ≤ ϵ .

Because AC = A(L⊥ ∩ C), the sets Cϵ are non-empty, closed, and nested. If we can show that Cϵ is
bounded, then it will be compact, and the finite intersection property will imply that ∩ϵ>0 Cϵ ̸= ∅,
and so there will be some x ∈ C for which Ax = y.

647
Lexture Notes on Statistics and Information Theory John Duchi

To show the boundedness, we use Lemma C.3.2. Because (Bϵ )∞ = null(A), part (ii) of
Lemma C.3.2 gives

(Cϵ )∞ = C∞ ∩ L⊥ ⊥ ⊥
∞ ∩ (Bϵ )∞ = C∞ ∩ L ∩ null(A) = L ∩ L = {0},

so that part (iv) of the same lemma implies Cϵ is bounded and hence compact, as desired.

As a representative corollary of Theorem C.3.3, let us revisit the problem of partial minimization,
as in Proposition B.3.11. Let F : Rd × Rn → R be convex, and for x ∈ Rd , y ∈ Rn define the partial
minimization function
f (x) := inf F (x, y).
y

We have seen already that f is convex—we wish to know whether f is closed convex. A sufficient
condition here is that F be closed convex and F (x, ·) be coercive for some x.

Corollary C.3.4. Let F be closed convex. If there exists x0 , t0 for which the set

{y | F (x0 , y) ≤ t0 }

is bounded, then f (x) = inf y F (x, y) is closed convex.

Proof Consider the epigraphs

epi f = {(x, t) | f (x) ≤ t} = (x, t) | inf F (x, y) ≤ t
y

and
epi F = {(x, y, t) | F (x, y) ≤ t} .
Define the projection π : Rd × Rm × R → Rd × R by π(x, y, t) = (x, t) (which is evidently a linear
operator). Then we claim that

π(epi F ) ⊂ epi f ⊂ cl π(epi F ). (C.3.2)

Indeed, if (x, y, t) ∈ epi F , then (x, t) ∈ epi f trivially, showing the first inclusion. For the second,
note that if (x, t) ∈ epi f , then there must be a sequence yn lim inf n F (x, yn ) ≤ t. Then for any
ϵ > 0, (x, yn , t + ϵ) ∈ epi F for all large n, and (x, t + ϵ) ∈ π(epi F ). So (x, t) ∈ cl π(epi F ).
In view of the containments (C.3.2), to show that epi f is closed it is enough to show that
π(epi F ) is closed. By Theorem C.3.3, for this it is sufficient to show that (epi F )∞ ∩ null(π) is a
linear subspace, and using that null(π) = {(0, y, 0) | y ∈ Rm }, this set is

(epi F )∞ ∩ null(π) = {(0, vy , 0) | (0, vy , 0) ∈ (epi F )∞ }

= {(0, vy , 0) | there exist x, y, r s.t. (x, y + tvy , r) ∈ epi F for all t ≥ 0} .

But as is now familiar from our treatment of recession functions and recession cones, we know that
in the final set, the choices (x, y, r) ∈ epi F are arbitrary (Proposition C.3.1), and in particular,
we may take x = x0 for the x0 making the set {y | F (x0 , y) ≤ t0 } compact. But Lemma C.3.2,
part (iv) guarantees that only vy = 0 satisfies (x0 , y + tvy , r) ∈ epi F for all t ≥ 0.

648
Lexture Notes on Statistics and Information Theory John Duchi

C.3.2 Asymptotic growth and existence of minimizers

We can use the identification between the domains of ∂f and the images of ∂f ∗ to give a few
additional characterizations of the domains of convex functions and their conjugates; the domain
of f ∗ is intimately tied with the growth properties of f , and conversely by the relationship f = f ∗∗
when f is closed convex. As one example of how we can make this identification, note that if f ∗
is defined everywhere, that is, dom f ∗ = Rd , then similarly dom ∂f ∗ = Rd , and so in particular the
(sub)gradients of f must cover all of Rd . Even more, as we shall see, this implies certain growth
conditions on f .
To make this more rigorous, we require functions capturing the asymptotic growth of f , analo-
gizing the recession cones (C.3.1). To that end, we present the following proposition, which has the
benefit of defining the recession function (essentially, an asymptotic derivative) of f .

Proposition C.3.5. Let f be a closed convex function and f ∗ is convex conjugate. Then for any
x ∈ dom f , we may define

′ f (x + tv) − f (x) f (x + tv) − f (x)

f∞ (v) := sup = lim (C.3.3)
t>0 t t→∞ t

independently of x, and moreover,

′
f∞ (v) = σdom f ∗ (v)
where σdom f ∗ is the support function (B.2.1) of dom f ∗ .

Proof That for any fixed x ∈ dom f the limit exists and is equal to the supremum follows
because of the criterion of increasing slopes (B.3.4), making the equality with the supremum im-
mediate. That f∞ ′ (v) is independent of x will follow once we show the second equality claimed in

the proposition, to which we now turn.

Recall that

∗
dom f = s | sup{⟨s, x⟩ − f (x)} < ∞ and f (x) = sup {⟨s, x⟩ − f ∗ (s)}
x s

by conjugate duality, as f is closed convex (Theorem C.2.1). Fix x ∈ dom f . Then for any
s ∈ dom f ∗ , we evidently have

f (x + tv) − f (x) ⟨s, x + tv⟩ − f ∗ (s) − f (x)

≥ → ⟨s, v⟩
t t
as t ↑ ∞. Taking a supremum over s ∈ dom f ∗ gives that f∞ ′ (v) ≥ σ
dom f ∗ (v). For the opposite
direction, note that
" #
f (x + tv) − f (x) 1
= sup {⟨s, x + tv⟩ − f ∗ (s)} − sup {⟨s, x⟩ − f ∗ (s)}
t t s∈dom f ∗ s∈dom f ∗
1 1
≤ sup {⟨s, x + tv⟩ − f ∗ (s) − (⟨s, x⟩ − f ∗ (s))} = sup t⟨s, v⟩.
t s∈dom f ∗ t s∈dom f ∗
′ (v) ≤ σ
Thus f∞ dom f ∗ (v), and we have the result.

649
Lexture Notes on Statistics and Information Theory John Duchi

JCD Comment: See Theorem 13.3 of Rockafellar.

It is particularly interesting to understand the conditions under which dom f ∗ = Rd , that is,
f∗ is finite everywhere, and relatedly, under which the function x 7→ f (x) − ⟨s, x⟩ has a minimizer.
Recall that f : Rd → R is coercive if f (x) → ∞ whenever ∥x∥ → ∞. For convex functions, we may
characterize coercivity by the recession function of f .
′ (v) > 0
Proposition C.3.6. Let f : Rd → R be closed convex. Then f is coercive if and only if f∞
for all v ̸= 0, that is, f is coercive on each line.
Proof If f is coercive, then for any v ̸= 0, f (x + tv) → ∞ as t → ∞, so that for any x ∈ dom f ,
f (x + tv) − f (x) > 0 for large enough t. In particular,
′ f (x + tv) − f (x)
f∞ (v) = sup >0
t>0 t
by Proposition C.3.5. Conversely, assume that f∞ ′ (v) > 0 for all v ̸= 0. Then if f is not coercive, it

must have an unbounded level set Sα := {x | f (x) ≤ α}. As Sα is a convex set, if it is unbounded,
then there exists a unit vector v ∈ (Sα )∞ , the recession cone of Sα by Lemma C.3.2. But then
x + tv ∈ Sα for all t ≥ 0 implies that f (x + tv) ≤ α for all t ≥ 0, a contradiction to f∞ ′ (v) > 0.

We call f super-coercive if f (x)/ ∥x∥ → ∞ whenever ∥x∥ → ∞, so that f grows more than
linearly. These concepts are central to the existence of minimizers. A priori, any function with
compact domain is super-coercive, because f (x) = +∞ for x ̸∈ dom f . The recession function f∞ ′

associated with f , as expression (C.3.3) defines, characterizes these functions as well. Particularly
important are those f satisfying
′
f∞ (v) = +∞ for all v ̸= 0,
a class Rockafellar [162] calls copositive functions, as these exhibit superlinear growth on all rays
toward infinity. We can relate this condition to the domains of f ∗ as well: using Proposition B.2.7,
Propositions C.3.5 and C.3.6 give
Corollary C.3.7. Let f : Rd → R be closed convex. Then s ∈ dom f ∗ if and only if ⟨s, v⟩ ≤ f∞ ′ (v)

for all v ̸= 0, and s ∈ int dom f ∗ if and only if ⟨s, v⟩ < f∞

′ (v) for all v ̸= 0. In particular,

′ (v) > 0 for all v ̸= 0, then 0 ∈ int dom f ∗ and f has a

(i) If f is coercive, or, equivalently, if f∞
minimizer.
′ (v) = +∞ for all v ̸= 0 if and only if
(ii) We have f∞
dom f ∗ = Rd .
A sufficient condition for this is that f be super-coercive.
Proof Combine Propositions B.2.7, C.3.5, and C.3.6: for part (i), note that if f∞ ′ (v) > 0 for

all v ̸= 0, then 0 ∈ int dom f ∗ , and so f ∗ has a non-trivial subdifferential ∂f ∗ (0) at 0; letting
x ∈ ∂f ∗ (0) we have x ∈ argmin f . Part (ii) is similarly immediate.

As an example consequence, if f is closed convex and c = inf ∥v∥=1 f∞ ′ (v) > 0 is the minimal

value of the recession function, then the tilted function f (·) − ⟨s, ·⟩ has a minimizer whenever
∥s∥ < c.

650
Lexture Notes on Statistics and Information Theory John Duchi

C.4 Saddle point theorems and min-max duality

Our final set of results centers around min-max duality theorems, central concepts in convex analysis
that, for us, are important in the development of proper losses (Chapter 14) and minimax games
(Chapter 19). In the main, we will consider conditions under which, for a function L : Rd ×Rk → R,
we have
inf sup L(x, y) = sup inf L(x, y). (C.4.1)
x∈X y∈Y y∈Y x∈X

We begin with a few straightforward remarks giving sufficient conditions to swap the min and max
(i.e., the infimum and supremum) in the equality (C.4.1), then develop more general theory that
shows when this is indeed possible. When equality (C.4.1) holds, we shall say that strong duality
holds in the saddle point.
The starting point is the weak min-max inequality, which holds without conditions on L or the
sets X, Y .

Proposition C.4.1 (The weak min-max inequality). For any sets X, Y and function L,

sup inf L(x, y) ≤ inf sup L(x, y). (C.4.2)

y∈Y x∈X x∈X y∈Y

Proof Fix y0 ∈ Y . Then for any x0 ∈ X, inf x∈X L(x, y0 ) ≤ L(x0 , y0 ) ≤ supy∈Y L(x0 , y). Taking
a supremum on the left side implies supy∈Y inf x∈X L(x, y) ≤ supy∈Y L(x0 , y) for any x0 ∈ X. We
may now take an infimum on the right.

The simplest condition guaranteeing equality (C.4.1) is the existence of a saddle point (x⋆ , y ⋆ ) ∈
X × Y , meaning a point satisfying

sup L(x⋆ , y) ≤ L(x⋆ , y ⋆ ) ≤ inf L(x, y ⋆ ). (C.4.3)

y∈Y x∈X

One of the main concerns of duality theory and min-max games is when such points exist, as they
guarantee the equality (C.4.1) and even achieve the values of the game:

Proposition C.4.2. A pair (x⋆ , y ⋆ ) ∈ X × Y is a saddle point for L on X × Y if and only if all
of the following are true:

i. the strong duality (C.4.1) holds

ii. x⋆ minimizes supy∈Y L(x, y) over x ∈ X

iii. y ⋆ maximizes inf x∈X L(x, y) over y ∈ Y

Under these conditions, the value of the game is

L(x⋆ , y ⋆ ) = inf L(x, y ⋆ ) = sup L(x⋆ , y) = inf sup L(x, y) = sup inf L(x, y).
x∈X y∈Y x∈X y∈Y y∈Y x∈Y

Proof Given a saddle point (x⋆ , y ⋆ ), we have

inf sup L(x, y) ≤ sup L(x⋆ , y) ≤ L(x⋆ , y ⋆ ) ≤ inf L(x, y ⋆ ) ≤ sup inf L(x, y),
x∈X y∈Y y∈Y x∈X y∈Y x∈X

651
Lexture Notes on Statistics and Information Theory John Duchi

so that in view of the weak min-max inequality (C.4.2), each inequality must actually be an equality,
and equality (C.4.1) holds as well. Thus, inf x∈X L(x, y ⋆ ) = supy∈Y inf x∈X L(x, y), and y ⋆ minimizes
inf x∈X L(x, y) over y ∈ Y , and similarly for x⋆ .
Now suppose conditions i–iii hold above. Then

inf sup L(x, y) = sup L(x⋆ , y) ≥ L(x⋆ , y ⋆ ) ≥ inf L(x, y ⋆ ) = sup inf L(x, y).
x∈X y∈Y y∈Y x∈X y∈Y x∈X

By assumption, the left and right quantities are equal, so that (x⋆ , y ⋆ ) is a saddle point.

C.4.1 Saddle points and convex conjugates

We now proceed to more general considerations under which equality (C.4.1) holds. For the re-
mainder of the section, we shall assume that L is a closed convex-concave function on X × Y ,
meaning that for each y ∈ Y ,
(
L(x, y) if x ∈ X
x 7→ L(x, y) + IX (x) =
+∞ otherwise

is a closed convex function, while for each x ∈ X,

(
−L(x, y) if y ∈ Y
y→
7 −L(x, y) + IY (y) =
+∞ otherwise

is closed convex. We will extend our arguments with tilted minimizers and convex conjugates, so
that the key object we consider will be the parameterized function

F (x, u) := sup {L(x, y) + ⟨u, y⟩} .

y∈Y

Evidently, F is a closed convex function whenever L is closed convex-concave, as it is the supremum

of closed convex functions. We also define the parameterized (primal) value function

p(u) := inf F (x, u), (C.4.4)

x∈X

which is a convex function in u (as it is a partial minimization), and satisfies

p(0) = inf sup L(x, y).

x∈X y∈Y

Our first step is to derive the conjugate of this primal value function.

Lemma C.4.3. Let p∗ (v) = supu {⟨u, v⟩ − p(u)}. Then

(
∗ − inf x∈X L(x, v) if v ∈ Y
p (v) =
+∞ otherwise.

652
Lexture Notes on Statistics and Information Theory John Duchi

Proof We expand
( )
∗
p (v) = sup ⟨u, v⟩ − inf sup{L(x, y) + ⟨u, y⟩}
u x∈X y∈Y

= sup sup inf {⟨u, v − y⟩ − L(x, y)} = sup sup inf {⟨u, v − y⟩ − L(x, y)} .
u x∈X y∈Y x∈X u y∈Y

By assumption, for each x ∈ X the epigraph

Ex := {(y, t) ∈ Y × R | −L(x, y) ≤ t} (C.4.5)
is a closed convex set.
Let us first assume v ∈ Y . Then there exists a non-vertical supporting hyperplane (w, 1) to Ex
at the point (v, −L(x, v)), that is, a w ∈ Rk satisfying
⟨w, v⟩ − L(x, v) = inf {⟨w, y⟩ − t} = inf {⟨w, y⟩ − L(x, y)} .
(y,t)∈Ex y∈Y

Of course, for v ∈ Y , we may always choose y = v in the infimum and always have
inf {⟨u, y − v⟩ − L(x, y)} ≤ ⟨u, v − v⟩ − L(x, v) = −L(x, v).
y∈Y

Equality obtains there for w = u, so

sup inf {⟨u, v − y⟩ − L(x, y)} = inf {⟨w, v − y⟩ − L(x, y)} = −L(x, v),
u y∈Y y∈Y

which gives the first claim of the equality defining p∗ (v).

When v ̸∈ Y , then evidently for all x ∈ X and t ∈ R, we have (v, t) ̸∈ Ex , recall the epi-
graph (C.4.5). So for any t ∈ R and x ∈ X, there exists a non-vertical hyperplane strictly separating
(v, t) from Ex (Corollary B.1.12). That is, we have a w ∈ Rk such that
t + ⟨w, v⟩ < inf {⟨w, y⟩ − L(x, y)} i.e. t < inf {⟨w, y − v⟩ − L(x, y)} .
y∈Y y∈Y

Because t < ∞ was arbitrary, we must have supu inf y∈Y {⟨u, v − y⟩ − L(x, y)} = +∞.

Given Lemma C.4.3, we can almost immediately see that the biconjugate of p will determine
whether we have the saddle point equality (C.4.1). We state this as a proposition.
Proposition C.4.4. Let p be the primal value function (C.4.4). Then
p∗∗ (0) = sup inf L(x, y).
y∈Y x∈X

The saddle point equality (C.4.1) holds if p is lower semicontinuous at 0 with either p(0) < ∞ or
p(0) = ∞ and p(u) > −∞ for all u.
Proof Taking the biconjugate, we have

∗∗ ∗
p (u) = sup {⟨y, u⟩ − p (y)} = sup ⟨y, u⟩ + inf L(x, y)
y y∈Y x∈X

by Lemma C.4.3. This shows the first equality of the proposition. For the second claim, apply
Corollary C.2.2, which shows that p∗∗ (0) = p(0) if and only if p is lower semicontinuous at 0.

653
Lexture Notes on Statistics and Information Theory John Duchi

C.4.2 Min-max duality and the existence of saddle points

Recalling the definition (C.4.4) of the value function via F (x, u) = supy∈Y {L(x, y) + ⟨u, y⟩}, we
are led back to investigate when exactly partial minimization of convex functions yields a closed
convex function, as p(u) = inf x∈X F (x, u). We can provide a few consequences of Proposition C.4.4
that guarantee strong duality holds for the saddle point problem (C.4.1).

Theorem C.4.5. Let X, Y be convex sets and let L be closed convex-concave on X × Y . Assume
that (
supy∈Y L(x, y) if x ∈ X
f (x) :=
+∞ otherwise
satisfies inf x f (x) < ∞ and has compact level sets. Then the value function (C.4.4) is closed convex,
strong duality holds for the saddle point problem (C.4.1), and there exists a nonempty compact set
X ⋆ ⊂ X minimizing f .

Proof By Corollary C.3.4, for F (x, u) = supy∈Y {L(x, y) + ⟨u, y⟩} + IX (x), it is enough to show
that there exists u0 such that x 7→ F (x, u0 ) has compact level sets. Taking u = 0, we have
F (x, 0) = f (x), and so the level sets {x | F (x, 0) ≤ t} are compact by assumption. So p(u) is lower
semicontinuous in u, and p∗∗ (u) = p(u) for all u. Apply Proposition C.4.4 to obtain strong duality.
For the existence of a set of minimizers, note that f (x) is closed convex and has compact level
sets, so its minimizers are attained.

In brief, under the conditions of the theorem, if we let x⋆ ∈ X ⋆ , we have for any y ∈ Y that

sup L(x⋆ , y) = inf sup L(x, y) = sup inf L(x, y) ≤ sup L(x⋆ , y),
y∈Y x∈X y∈Y y∈Y x∈X y∈Y

and equality must hold in the final inequality. That is, we have a sort of “partial” saddle point:
there exists x⋆ achieving the value of the game that L implies, so

sup L(x⋆ , y) = sup inf L(x, y).

y∈Y y∈Y x∈X

We also have the following corollary.

Corollary C.4.6. Let X, Y be convex sets, and let X be compact. Assume L is closed convex-
concave on X × Y . Then if there exists x ∈ X such that supy∈Y L(x, y) < ∞,

inf sup L(x, y) = sup inf L(x, y)

x∈X y∈Y y∈Y x∈X

and there exists a non-empty convex compact set X ⋆ minimizing supy∈Y L(x, y) over x ∈ X.

An elaborated version of these argument guarantees the existence of saddle points, so that
strong duality (C.4.1) obtains.

Theorem C.4.7. Let X, Y be convex sets and L be closed convex-concave and finite on X × Y .
Then the set X ⋆ × Y ⋆ of saddle points for L on X × Y is convex, compact, and non-empty if any
of the following conditions hold:

i. The functions f (x) := supy∈Y L(x, y)+IX (x) and g(y) := supx∈X −L(x, y)+IY (y) are coercive.

654
Lexture Notes on Statistics and Information Theory John Duchi

ii. The sets X and Y are compact.

Proof Theorem C.4.5 guarantees that inf x∈X supy∈Y L(x, y) = supy∈Y inf x∈X supy∈Y under any
of the conditions. It also guarantees the existence of a compact set of minimizers X ⋆ ⊂ X such
that
sup L(x⋆ , y) = inf sup L(x, y)
y∈Y x∈X y∈Y

forx⋆
∈ X ⋆.
Flipping the roles of x and y, we see similarly that there exists a compact convex set
⋆
Y ⊂ Y such that
sup −L(x, y ⋆ ) = inf sup −L(x, y),
x∈X y∈Y x∈X

that is, inf x∈X L(x, y ⋆ ) = supy∈Y inf x∈X L(x, y). Proposition C.4.2 gives the theorem.

Further reading
There are a variety of references on the topic, beginning with the foundational book by Rockafellar
[162], which initiated the study of convex functions and optimization in earnest. Since then, a
variety of authors have written (perhaps more easily approachable) books on convex functions,
optimization, and their related calculus. Hiriart-Urruty and Lemaréchal [111] have written two
volumes explaining in great detail finite-dimensional convex analysis, and provide a treatment of
some first-order algorithms for solving convex problems. Borwein and Lewis [36] and Luenberger
[141] give general treatments that include infinite-dimensional convex analysis, and Bertsekas [29]
gives a variety of theoretical results on duality and optimization theory.
There are, of course, books that combine theoretical treatment with questions of convex mod-
eling and procedures for solving convex optimization problems (problems for which the objective
and constraint sets are all convex). Boyd and Vandenberghe [38] gives a very readable treatment
for those who wish to use convex optimization techniques and modeling, as well as the basic results
in convex analytic background and duality theory. Ben-Tal and Nemirovski [22], as well as Ne-
mirovski’s various lecture notes, give a theory of the tractability of computing solutions to convex
optimization problems as well as methods for solving them.

C.5 Exercises
Exercise C.1: Show that the alternative increasing slopes condition (B.3.5) is equivalent to
convexity of f .
Exercise C.2: Prove Proposition C.1.7. Hint: See the proof of Proposition C.1.5.
Exercise C.3: Prove Proposition C.1.8. Hint: See the proof of Proposition C.1.6.
Exercise C.4: Do the uniform convexity version of Proposition C.1.5.
Exercise C.5: Do the uniform convexity version of Proposition C.1.6.

655
Bibliography

[1] F. Abramovich, Y. Benjamini, D. L. Donoho, and I. M. Johnstone. Adapting to unknown

sparsity by controlling the false discovery rate. Annals of Statistics, 34(2):584–653, 2006.

[2] D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary

coins. Journal of Computer and System Sciences, 66:671–687, 2003.

[3] S. Agrawal and N. Goyal. Analysis of Thompson sampling for the multi-armed bandit prob-
lem. In Proceedings of the Twenty Fifth Annual Conference on Computational Learning
Theory, 2012.

[4] R. Ahlswede, P. Gács, and J. Körner. Bounds on conditional probabilities with applications in
multi-user communication. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete,
34(2):157–177, 1976.

[5] N. Ailon and B. Chazelle. The fast Johnson-Lindenstrauss transform and approximate nearest
neighbors. SIAM Journal on Computing, 39(1):302–322, 2009.

[6] S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution
from another. Journal of the Royal Statistical Society, Series B, 28:131–142, 1966.

[7] A. Andoni, M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality-sensitive hashing

using stable distributions. In T. Darrell, P. Indyk, and G. Shakhnarovich, editors, Nearest
Neighbor Methods in Learning and Vision: Theory and Practice. MIT Press, 2006.

[8] E. Arias-Castro, E. Candés, and M. Davenport. On the fundamental limits of adaptive

sensing. IEEE Transactions on Information Theory, 59(1):472–481, 2013.

[9] S. Artstein, K. Ball, F. Barthe, and A. Naor. Solution of Shannon’s problem on the mono-
tonicity of entropy. Journal of the American Mathematical Society, 17(4):975–982, 2004.

[10] P. Assouad. Deux remarques sur l’estimation. Comptes Rendus des Séances de l’Académie
des Sciences, Série I, 296(23):1021–1024, 1983.

[11] J.-Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring.
In Journal of Machine Learning Research, pages 2635–2686, 2010.

[12] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit
problem. Machine Learning, 47(2-3):235–256, 2002.

[13] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed

bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.

656
Lexture Notes on Statistics and Information Theory John Duchi

[14] B. Balle, G. Barthe, and M. Gaboardi. Privacy amplification by subsampling: Tight analyses
via couplings and divergences. In Advances in Neural Information Processing Systems 31,
pages 6277–6287, 2018.

[15] Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar. An information statistics approach

to data stream and communication complexity. Journal of Computer and System Sciences,
68(4):702–732, 2004.

[16] A. Barron. Entropy and the central limit theorem. Annals of Probability, 14(1):336–342,
1986.

[17] A. R. Barron. Complexity regularization with application to artificial neural networks. In

Nonparametric Functional Estimation and Related Topics, pages 561–576. Kluwer Academic,
1991.

[18] A. R. Barron and T. M. Cover. Minimum complexity density estimation. IEEE Transactions
on Information Theory, 37:1034–1054, 1991.

[19] P. L. Bartlett, M. I. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds.
Journal of the American Statistical Association, 101:138–156, 2006.

[20] R. Bassily, A. Smith, T. Steinke, and J. Ullman. More general queries and less generalization
error in adaptive data analysis. arXiv:1503.04843 [cs.LG], 2015.

[21] R. Bassily, K. Nissim, A. Smith, T. Steinke, U. Stemmer, and J. Ullman. Algorithmic stability
for adaptive data analysis. In Proceedings of the Forty-Eighth Annual ACM Symposium on
the Theory of Computing, pages 1046–1059, 2016.

[22] A. Ben-Tal and A. Nemirovski. Lectures on Modern Convex Optimization. SIAM, 2001.

[23] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful
approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57(1):289–300,
1995.

[24] D. Berend and A. Kontorovich. On the concentration of the missing mass. Electronic Com-
munications in Probability, 18:1–7, 2018.

[25] J. O. Berger. The case for objective Bayesian analysis. Bayesian Analysis, 1(3):385–402,
2006.

[26] J. O. Berger, J. Bernardo, and D. Sun. The formal definition of reference priors. Annals of
Statistics, 37(2):905–938, 2009.

[27] R. Berk, L. Brown, A. Buja, K. Zhang, and L. Zhao. Valid post-selection inference. Annals
of Statistics, 41(2):802–837, 2013.

[28] J. M. Bernardo. Reference analysis. In D. Day and C. R. Rao, editors, Bayesian Thinking,
Modeling and Computation, volume 25 of Handbook of Statistics, chapter 2, pages 17–90.
Elsevier, 2005.

[29] D. P. Bertsekas. Convex Optimization Theory. Athena Scientific, 2009.

[30] P. Billingsley. Convergence of Probability Measures. Wiley, Second edition, 1999.

657
Lexture Notes on Statistics and Information Theory John Duchi

[31] L. Birgé. Approximation dans les espaces métriques et théorie de l’estimation. Zeitschrift für
Wahrscheinlichkeitstheorie und verwebte Gebiet, 65:181–238, 1983.

[32] L. Birgé. A new lower bound for multiple hypothesis testing. IEEE Transactions on Infor-
mation Theory, 51(4):1611–1614, 2005.

[33] L. Birgé and P. Massart. Estimation of integral functionals of a density. Annals of Statistics,
23(1):11–29, 1995.

[34] J. Blasiok, P. Gopalan, L. Hu, and P. Nakkiran. A unifying theory of distance from calibration.
In Proceedings of the Fifty-Fifth Annual ACM Symposium on the Theory of Computing, 2023.
URL https://arxiv.org/abs/2211.16886.

[35] J. Blasiok, P. Gopalan, L. Hu, and P. Nakkiran. When does optimizing a proper loss yield
calibration? In Advances in Neural Information Processing Systems 36, 2023. URL https:
//arxiv.org/abs/2305.18764.

[36] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006.

[37] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: a Nonasymptotic

Theory of Independence. Oxford University Press, 2013.

[38] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[39] M. Braverman, A. Garg, T. Ma, H. L. Nguyen, and D. P. Woodruff. Communication lower

bounds for statistical estimation problems via a distributed data processing inequality. In
Proceedings of the Forty-Eighth Annual ACM Symposium on the Theory of Computing, 2016.
URL https://arxiv.org/abs/1506.07216.

[40] G. Brown, M. Bun, V. Feldman, A. Smith, and K. Talwar. When is memorization of irrelevant
training data necessary for high-accuracy learning? In Proceedings of the Fifty-Third Annual
ACM Symposium on the Theory of Computing, pages 123–132, 2021.

[41] L. D. Brown. Fundamentals of Statistical Exponential Families. Institute of Mathematical

Statistics, Hayward, California, 1986.

[42] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed
bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.

[43] V. Buldygin and Y. Kozachenko. Metric Characterization of Random Variables and Random
Processes, volume 188 of Translations of Mathematical Monographs. American Mathematical
Society, 2000.

[44] T. Cai and M. Low. Testing composite hypotheses, Hermite polynomials and optimal esti-
mation of a nonsmooth functional. Annals of Statistics, 39(2):1012–1041, 2011.

[45] T. T. Cai, Y. Wang, and L. Zhang. The cost of privacy: optimal rates of convergence for
parameter estimation with differential privacy. Annals of Statistics, 49(5):2825–2850, 2021.

[46] T. T. Cai, Y. Wang, and L. Zhang. Score attack: A lower bound technique for optimal
differentially private learning. arXiv:2303.07152 [math.ST], 2023.

658
Lexture Notes on Statistics and Information Theory John Duchi

[47] E. J. Candès and M. A. Davenport. How well can we estimate a sparse vector. Applied and
Computational Harmonic Analysis, 34(2):317–323, 2013.

[48] O. Catoni. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical

Learning, volume 56 of IMS Lecture Notes and Monographs. Institute of Mathematical Statis-
tics, Beachwood, Ohio, USA, 2007. URL https://arxiv.org/abs/0712.0248.

[49] O. Catoni and I. Giulini. Dimension-free PAC-bayesian bounds for matrices, vectors, and
linear least squares regression. arXiv:1712.02747 [math.ST], 2017.

[50] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University
Press, 2006.

[51] O. Chapelle and L. Li. An empirical evaluation of Thompson sampling. In Advances in Neural
Information Processing Systems 24, 2011.

[52] B. S. Clarke and A. R. Barron. Information-theoretic asymptotics of bayes methods. IEEE

Transactions on Information Theory, 36(3):453–471, 1990.

[53] B. S. Clarke and A. R. Barron. Jeffreys’ prior is asymptotically least favorable under entropy
risk. Journal of Statistical Planning and Inference, 41:37–60, 1994.

[54] J. E. Cohen, Y. Iwasa, G. Rautu, M. B. Ruskai, E. Seneta, and G. Zbaganu. Relative entropy
under mappings by stochastic matrices. Linear Algebra and its Applications, 179:211–235,
1993.

[55] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297,

September 1995.

[56] J. Couzin. Whole-genome data not anonymous, challenging assumptions. Science, 321(5894):
1278, 2008.

[57] T. M. Cover and J. A. Thomas. Elements of Information Theory, Second Edition. Wiley,
2006.

[58] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based

vector machines. Journal of Machine Learning Research, 2:265–292, 2001.

[59] I. Csiszár. Information-type measures of difference of probability distributions and indirect

observation. Studia Scientifica Mathematica Hungary, 2:299–318, 1967.

[60] I. Csiszár. A class of measures of informativity of observation channels. Periodica Mathematica

Hungarica, 2(1–4):191–213, 1972.

[61] I. Csiszár and J. Körner. Information Theory: Coding Theorems for Discrete Memoryless
Systems. Cambridge University Press, second edition, 2011.

[62] T. Dalenius. Towards a methodology for statistical disclosure control. Statistik Tidskrift, 15:
429–444, 1977.

[63] S. Dasgupta and A. Gupta. An elementray proof of a theorem of Johnson and Lindenstrauss.
Random Structures and Algorithms, 22(1):60–65, 2002.

659
Lexture Notes on Statistics and Information Theory John Duchi

[64] A. Dawid and V. Vovk. Prequential probability: principles and properties. Bernoulli, 5:
125–162, 1999.

[65] P. Del Moral, M. Ledoux, and L. Miclo. On contraction properties of Markov kernels. Prob-
ability Theory and Related Fields, 126:395–420, 2003.

[66] R. L. Dobrushin. Central limit theorem for nonstationary Markov chains. I. Theory of
Probability and Its Applications, 1(1):65–80, 1956.

[67] R. L. Dobrushin. Central limit theorem for nonstationary Markov chains. II. Theory of
Probability and Its Applications, 1(4):329–383, 1956.

[68] D. L. Donoho and J. Jin. Higher criticism for detecting sparse heterogeneous mixtures. Annals
of Statistics, 32(3), 2004.

[69] S. Du, S. Kakade, R. Wang, and L. Yang. Is a good representation sufficient for sample
efficient reinforcement learning? In Proceedings of the Eighth International Conference on
Learning Representations, 2020.

[70] J. C. Duchi and R. Rogers. Lower bounds for locally private estimation via communica-
tion complexity. In Proceedings of the Thirty Second Annual Conference on Computational
Learning Theory, 2019.

[71] J. C. Duchi and F. Ruan. A constrained risk inequality for general losses. In Proceedings of
the 23nd International Conference on Artificial Intelligence and Statistics, 2020.

[72] J. C. Duchi and M. J. Wainwright. Distance-based and continuum Fano inequalities with
applications to statistical estimation. arXiv:1311.2669 [cs.IT], 2013.

[73] J. C. Duchi, M. I. Jordan, and H. B. McMahan. Estimation, optimization, and parallelism

when data is sparse. In Advances in Neural Information Processing Systems 26, 2013.

[74] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy, data processing inequalities,
and minimax rates. arXiv:1302.3203 [math.ST], 2013. URL http://arxiv.org/abs/1302.
3203.

[75] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates.
In 54th Annual Symposium on Foundations of Computer Science, pages 429–438, 2013.

[76] J. C. Duchi, K. Khosravi, and F. Ruan. Multiclass classification, information, divergence,

and surrogate risk. Annals of Statistics, 46(6b):3246–3275, 2018.

[77] R. M. Dudley. Uniform Central Limit Theorems. Cambridge University Press, 1999.

[78] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and
Trends in Theoretical Computer Science, 9(3 & 4):211–407, 2014.

[79] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves:
Privacy via distributed noise generation. In Advances in Cryptology (EUROCRYPT 2006),
2006.

660
Lexture Notes on Statistics and Information Theory John Duchi

[80] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private
data analysis. In Proceedings of the Third Theory of Cryptography Conference, pages 265–284,
2006.

[81] C. Dwork, G. N. Rothblum, and S. P. Vadhan. Boosting and differential privacy. In 51st
Annual Symposium on Foundations of Computer Science, pages 51–60, 2010.

[82] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. Preserving statistical
validity in adaptive data analysis. arXiv:1411.2664v2 [cs.LG], 2014.

[83] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. Preserving statis-
tical validity in adaptive data analysis. In Proceedings of the Forty-Seventh Annual ACM
Symposium on the Theory of Computing, 2015.

[84] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. The reusable holdout:
Preserving statistical validity in adaptive data analysis. Science, 349(6248):636–638, 2015.

[85] G. K. Dziugaite and D. M. Roy. Computing nonvacuous generalization bounds for deep
(stochastic) neural networks with many more parameters than training data. In Proceedings
of the 33rd Conference on Uncertainty in Artificial Intelligence, 2017.

[86] A. V. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy preserving
data mining. In Proceedings of the Twenty-Second Symposium on Principles of Database
Systems, pages 211–222, 2003.

[87] K. Fan. Minimax theorems. Proceedings of the National Academy of Sciences, 39(1):42–47,
1953.

[88] V. Feldman and T. Steinke. Calibrating noise to variance in adaptive data analysis. In
Proceedings of the Thirty First Annual Conference on Computational Learning Theory, 2018.
URL http://arxiv.org/abs/1712.07196.

[89] G. Folland. Real Analysis: Modern Techniques and their Applications. Pure and Applied
Mathematics. John Wiley & Sons, second edition, 1999.

[90] D. Foster and R. Vohra. Asymptotic calibration. Biometrika, 85(2):379–390, 1998.

[91] D. Foster, S. Kakade, J. Qian, and A. Rakhlin. The statistical complexity of interactive
decision making. arXiv:2112.13487v3 [cs.LG], 2021.

[92] D. Foster, N. Golowich, and Y. Han. Tight guarantees for interactive decision making with
the decision-estimation coefficient. In Proceedings of the Thirty Sixth Annual Conference on
Computational Learning Theory, 2023.

[93] D. J. Foster, C. Gentile, M. Mohri, and J. Zimmert. Adapting to misspecification in contextual

bandits. In Advances in Neural Information Processing Systems 33, 2020.

[94] D. P. Foster and S. Hart. “calibeating”: Beating forecasters at their own game.
arXiv:2209.0489 [econ.TH], 2022.

[95] A. Franco, N. Malhotra, and G. Simonovits. Publication bias in the social sciences: Unlocking
the file drawer. Science, 345(6203):1502–1505, 2014.

661
Lexture Notes on Statistics and Information Theory John Duchi

[96] D. A. Freedman. On tail probabilities for martingales. The Annals of Probability, 3(1):
100–118, Feb. 1975.

[97] R. Gallager. Source coding with side information and universal coding. Technical Report
LIDS-P-937, MIT Laboratory for Information and Decision Systems, 1979.

[98] D. Garcı́a-Garcı́a and R. C. Williamson. Divergences and risks for multiclass experiments.
In Proceedings of the Twenty Fifth Annual Conference on Computational Learning Theory,
2012.

[99] A. Garg, T. Ma, and H. L. Nguyen. On communication cost of distributed statistical estima-
tion and dimensionality. In Advances in Neural Information Processing Systems 27, 2014.

[100] A. Gelman and E. Loken. The garden of forking paths: Why multiple comparisons can
be a problem, even when there is no “fishing expedition” or “p-hacking” and the research
hypothesis was posited ahead of time. Technical report, Columbia University, 2013.

[101] R. P. Gilbert. Function Theoretic Methods in Partial Differential Equations. Academic Press,
1969.

[102] R. D. Gill and B. Y. Levit. Applications of the van Trees inequality: a Bayesian Cramér-rao
bound. Bernoulli, 1(1–2):59–79, 1995.

[103] T. Gneiting and A. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal
of the American Statistical Association, 102(477):359–378, 2007.

[104] R. M. Gray. Entropy and Information Theory. Springer, 1990.

[105] P. Grünwald. The Minimum Description Length Principle. MIT Press, 2007.

[106] A. Guntuboyina. Lower bounds for the minimax risk using f -divergences, and applications.
IEEE Transactions on Information Theory, 57(4):2386–2399, 2011.

[107] L. Györfi and T. Nemetz. f -dissimilarity: A generalization of the affinity of several distribu-
tions. Annals of the Institute of Statistical Mathematics, 30:105–113, 1978.

[108] L. Györfi, M. Kohler, A. Krzyżak, and H. Walk. A Distribution-Free Theory of Nonparametric

Regression. Springer, 2002.

[109] R. Z. Has’minskii. A lower bound on the risks of nonparametric estimates of densities in the
uniform metric. Theory of Probability and Applications, 23:794–798, 1978.

[110] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer,
second edition, 2009.

[111] J. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms I & II.
Springer, New York, 1993.

[112] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the
American Statistical Association, 58(301):13–30, Mar. 1963.

662
Lexture Notes on Statistics and Information Theory John Duchi

[113] N. Homer, S. Szelinger, M. Redman, D. Duggan, W. Tembe, J. Muehling, J. V. Pearson, D. A.

Stephan, S. F. Nelson, and D. W. Craig. Resolving individuals contributing trace amounts
of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS
Genetics, 4(8):e1000167, 2008.
[114] K. Hung and W. Fithian. Statistical methods for replicability assessment. Annals of Applied
Statistics, 14(3):1063–1087, 2020.
[115] I. A. Ibragimov and R. Z. Has’minskii. Statistical Estimation: Asymptotic Theory. Springer-
Verlag, 1981.
[116] P. Indyk. Nearest neighbors in high-dimensional spaces. In Handbook of Discrete and Com-
putational Geometry. CRC Press, 2004.
[117] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of
dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of
Computing, 1998.
[118] J. P. Ioannidis. Why most published research findings are false. PLoS Medicine, 2(8), 2005.
doi: 10.1371/journal.pmed.0020124.
[119] E. T. Jaynes. On the rationale of maximum-entropy methods. Proceedings of the IEEE, 70
(9):939–952, Sept. 1982.
[120] T. S. Jayram. Hellinger strikes back: a note on the multi-party information complexity of
AND. In Proceedings of APPROX and RANDOM 2009, volume 5687 of Lecture Notes in
Computer Science, pages 562–573. Springer, 2009.
[121] H. Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings
of the Royal Society of London, Series A: Mathematical and Physical Sciences, 186:453–461,
1946.
[122] T. Joachims. Training linear SVMs in linear time. In Proceedings of the ACM Conference on
Knowledge Discovery and Data Mining (KDD), 2006.
[123] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz maps into a Hilbert space. Con-
temporary Mathematics, 26:189–206, 1984.
[124] E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal
finite-time analysis. In Algorithmic Learning Theory, 2012.
[125] M. J. Kearns and L. Saul. Large deviation methods for approximate probabilistic inference.
In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, pages
311–319, 1998.
[126] A. Kolmogorov and V. Tikhomirov. ε-entropy and ε-capacity of sets in functional spaces.
Uspekhi Matematischeskikh Nauk, 14(2):3–86, 1959.
[127] V. Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery
Problems, volume 2033 of Lecture Notes in Mathematics. Springer-Verlag, 2011.
[128] A. Kumar, P. Liang, and T. Ma. Verified uncertainty calibration. In Advances in Neural
Information Processing Systems 32, 2019.

663
Lexture Notes on Statistics and Information Theory John Duchi

[129] E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, 1997.

[130] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in
Applied Mathematics, 6:4–22, 1985.

[131] J. Langford and R. Caruana. (not) bounding the true error. In Advances in Neural Informa-
tion Processing Systems 14, 2001.

[132] T. Lattimore and C. Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.

[133] T. Lattimore, C. Szepesvári, and G. Weisz. Learning with good feature representations
in bandits and in RL with a generative model. In Proceedings of the 37th International
Conference on Machine Learning, 2020.

[134] L. Le Cam. Asymptotic Methods in Statistical Decision Theory. Springer-Verlag, 1986.

[135] M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer, 1991.

[136] E. L. Lehmann and G. Casella. Theory of Point Estimation, Second Edition. Springer, 1998.

[137] F. Liese and I. Vajda. On divergences and informations in statistics and information theory.
IEEE Transactions on Information Theory, 52(10):4394–4412, 2006.

[138] J. Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Informa-
tion Theory, 37(1):145–151, 1991.

[139] J. Liu, R. van Handel, and S. Verdú. Second-order converses via reverse hypercontractivity.
Mathematical Statistics and Learning, 2(2):103–163, 2019.

[140] Y. Liu. Fisher consistency of multicategory support vector machines. In Processing of 11th
International Conference on Artificial Intelligence and Statistics, pages 291–298, 2007.

[141] D. Luenberger. Optimization by Vector Space Methods. Wiley, 1969.

[142] G. Lugosi and N. Vayatis. On the Bayes-risk consistency of regularized boosting methods.
Annals of Statistics, 32(1):30–55, 2004.

[143] M. Madiman and A. Barron. Generalized entropy power inequalities and monotonicity prop-
erties of information. IEEE Transactions on Information Theory, 53(7):2317–2329, 2007.

[144] D. A. McAllester. Some PAC-bayesian theorems. In Proceedings of the Eleventh Annual

Conference on Computational Learning Theory, 1998.

[145] D. A. McAllester. Simplified PAC-bayesian margin bounds. In Proceedings of the Sixteenth

Annual Conference on Computational Learning Theory, pages 203–215, 2003.

[146] D. A. McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51:5–21,

2003.

[147] D. A. McAllester. A PAC-Bayestian tutorial with a dropout bound. arXiv:1307.2118 [cs.LG],

2013.

[148] F. McSherry and K. Talwar. Mechanism design via differential privacy. In 48th Annual
Symposium on Foundations of Computer Science, 2007.

664
Lexture Notes on Statistics and Information Theory John Duchi

[149] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in
Theoretical Computer Science, 1(2):117–236, 2005.

[150] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation ap-
proach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.

[151] A. Nowak-Vila, F. Bach, and A. Rudi. Consistent structured prediction with max-min margin
markov networks. In Proceedings of the 37th International Conference on Machine Learning,
2020.

[152] D. Ostrovskii and F. Bach. Finite-sample analysis of M -estimators using self-concordance.

Electronic Journal of Statistics, 15:326–391, 2021.

[153] D. Petz. A survey of certain trace inequalities. Banach Center Publications, 30:287–298,
1994.

[154] Y. Polyanskiy and Y. Wu. Strong data-processing inequalities for channels and Bayesian
networks. In Convexity and Concentration, volume 161 of The IMA Volumes in Mathematics
and its Applications. Springer, 2017.

[155] Y. Polyanskiy and Y. Wu. Information Theory: From Coding to Learning. Cambridge
University Press, 2024.

[156] F. Pukelsheim. Optimal Design of Experiments. Classics in Applied Mathematics. SIAM,

1993.

[157] M. Raginsky. Strong data processing inequalities and ϕ-Sobolev inequalities for discrete
channels. IEEE Transactions on Information Theory, 62(6):3355–3389, 2016.

[158] M. Raginsky and I. Sason. Concentration of measure inequalities in information theory,

communications, and coding. Foundations and Trends in Communications and Information
Theory, 10(1–2):1–250, 2014.

[159] A. Rao and A. Yehudayoff. Communication Complexity and Applications. Cambridge Uni-
versity Press, 2020.

[160] G. Raskutti, M. J. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional
linear regression over ℓq -balls. IEEE Transactions on Information Theory, 57(10):6976—6994,
2011.

[161] H. Robbins. Some aspects of the sequential design of experiments. Bulletin American Math-
ematical Society, 55:527–535, 1952.

[162] R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970.

[163] H. Royden. Real Analysis. Pearson, third edition, 1988.

[164] D. Russo and B. Van Roy. An information-theoretic analysis of Thompson sampling. Journal
of Machine Learning Research, page To appear, 2014.

[165] D. Russo and B. Van Roy. Learning to optimize via information-directed sampling. In
Advances in Neural Information Processing Systems 27, 2014.

665
Lexture Notes on Statistics and Information Theory John Duchi

[166] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of
Operations Research, 39(4):1221–1243, 2014.

[167] C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal,
1948.

[168] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on Stochastic Programming: Mod-

eling and Theory. SIAM and Mathematical Programming Society, 2009.

[169] G. R. Shorack and J. A. Wellner. Empirical processes with applications to statistics, volume 59.
Siam, 2009.

[170] M. Sion. On general minimax theorems. Pacific Journal of Mathematics, 8(1):171–176, 1958.

[171] A. Slavkovic and F. Yu. Genomics and privacy. Chance, 28(2):37–39, 2015.

[172] C. Stein. Efficient nonparametric testing and estimation. In Proceedings of the Third Berkeley
Symposium on Mathematical Statistics and Probability, pages 187–195, 1956.

[173] I. Steinwart. How to compare different loss functions and their risks. Constructive Approxi-
mation, 26:225–287, 2007.

[174] C. Sudlow, J. Gallacher, N. Allen, V. Beral, P. Burton, J. Danesh, P. Downey, P. Elliott,

J. Green, M. Landray, B. Liu, P. Matthews, G. Ong, J. Pell, A. Silman, A. Young, T. Sprosen,
T. Peakman, and R. Collins. UK Biobank: an open access resource for identifying the causes
of a wide range of complex diseases of middle and old age. PLoS Medicine, 12(3):e1001779,
2015.

[175] R. Sutton and A. Barto. Reinforcement Learning: An Introduction (Second Edition). MIT
Press, 2018.

[176] T. Tao. An Epsilon of Room, I: Real Analysis (pages from year three of a mathematical blog),
volume 117 of Graduate Studies in Mathematics. American Mathematical Society, 2010.

[177] B. Taskar. Learning Structured Prediction Models: A Large Margin Approach. PhD thesis,
Stanford University, 2005.

[178] B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning. Max-margin parsing. In Empirical
Methods in Natural Language Processing, 2004.

[179] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view
of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.

[180] R. Tibshirani, J. Taylor, R. Lockhart, and R. Tibshirani. Exact post-selection inference for
sequential regression procedures. Journal of the American Statistical Association, 111(514):
600–620, 2016.

[181] J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Compu-
tational Mathematics, 12:389–434, 2012.

[182] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.

[183] J. W. Tukey. Exploratory Data Analysis. Pearson, 1997.

666
Lexture Notes on Statistics and Information Theory John Duchi

[184] A. W. van der Vaart. Superefficiency. In D. Pollard, E. Torgersen, and G. Yang, editors,
Festschrift for Lucien Le Cam, chapter 27. Springer, 1997.

[185] A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic
Mathematics. Cambridge University Press, 1998.

[186] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes: With
Applications to Statistics. Springer, New York, 1996.

[187] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Compressed

Sensing: Theory and Applications, chapter 5, pages 210–268. Cambridge University Press,
2012.

[188] C. Villani. Optimal Transport: Old and New. Springer, 2009.

[189] A. Wald. Contributions to the theory of statistical estimation and testing hypotheses. Annals
of Mathematical Statistics, 10(4):299–326, 1939.

[190] S. Warner. Randomized response: a survey technique for eliminating evasive answer bias.
Journal of the American Statistical Association, 60(309):63–69, 1965.

[191] Y. Wu and P. Yang. Minimax rates of entropy estimation on large alphabets via best poly-
nomial approximation. IEEE Transactions on Information Theory, 62(6):3702–3720, 2016.

[192] Y. Yang and A. Barron. Information-theoretic determination of minimax rates of convergence.

Annals of Statistics, 27(5):1564–1599, 1999.

[193] A. C.-C. Yao. Some complexity questions related to distributive computing (preliminary
report). In Proceedings of the Eleventh Annual ACM Symposium on Theory of Computing,
pages 209–213. ACM, 1979.

[194] B. Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423–435.
Springer-Verlag, 1997.

[195] T. Zhang. Statistical analysis of some multi-category large margin classification methods.
Journal of Machine Learning Research, 5:1225–1251, 2004.

[196] T. Zhang. Statistical behavior and consistency of classification methods based on convex risk
minimization. Annals of Statistics, 32:56–85, 2004.

[197] T. Zhang and B. Yu. Boosting with early stopping: Convergence and consistency. The Annals
of Statistics, 33:1538–1579, 2005.

[198] Y. Zhang, J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Information-theoretic lower

bounds for distributed estimation with communication constraints. In Advances in Neural
Information Processing Systems 26, 2013.

[199] N. Zhivotovskiy. Dimension-free bounds for sums of independent matrices and simple tensors
via the variational principle. Electronic Journal of Probability, 29:1–28, 2024.

667

Service Manual: Overload Prevention Device
100% (8)
Service Manual: Overload Prevention Device
276 pages
STAT0007 Course Notes
No ratings yet
STAT0007 Course Notes
93 pages
Preview of Methods in Algorithmic Analysis
75% (4)
Preview of Methods in Algorithmic Analysis
10 pages
MICRO XI Subhash Dey All Chapters PPTs (Teaching Made Easier)
No ratings yet
MICRO XI Subhash Dey All Chapters PPTs (Teaching Made Easier)
1,330 pages
Animal Identification and Data Management Using RFID Technology
No ratings yet
Animal Identification and Data Management Using RFID Technology
7 pages
Lecture Notes
No ratings yet
Lecture Notes
495 pages
Stanford Statistics311 InformationTheoryAndStatistics
No ratings yet
Stanford Statistics311 InformationTheoryAndStatistics
304 pages
Full Notes
No ratings yet
Full Notes
197 pages
Lecture Notes
100% (1)
Lecture Notes
324 pages
Randomised Algorithm
No ratings yet
Randomised Algorithm
385 pages
It Lectures
No ratings yet
It Lectures
342 pages
Probability and Stats For Data Science PDF
100% (1)
Probability and Stats For Data Science PDF
237 pages
Information Theory: Lecture Notes For
No ratings yet
Information Theory: Lecture Notes For
193 pages
Mathbook-Econ Prep
100% (1)
Mathbook-Econ Prep
278 pages
Cheat Sheet - JAM
No ratings yet
Cheat Sheet - JAM
46 pages
Mit6 441s16 Course Notes
No ratings yet
Mit6 441s16 Course Notes
295 pages
978 3 642 15202 3
No ratings yet
978 3 642 15202 3
494 pages
ROB501 Textbook2022 03 21
No ratings yet
ROB501 Textbook2022 03 21
142 pages
Math For Data Science
No ratings yet
Math For Data Science
538 pages
P Refresher
No ratings yet
P Refresher
264 pages
Quantitative Economics With Python
No ratings yet
Quantitative Economics With Python
300 pages
Notes On Randomized Algorithms: James Aspnes March 3rd, 2020
No ratings yet
Notes On Randomized Algorithms: James Aspnes March 3rd, 2020
453 pages
Quantecon Python
100% (4)
Quantecon Python
1,413 pages
Maths For Machine Learning
No ratings yet
Maths For Machine Learning
47 pages
Math 4 ML
100% (1)
Math 4 ML
47 pages
Comp Data Science
No ratings yet
Comp Data Science
821 pages
Information Theory Lecture Notes
100% (1)
Information Theory Lecture Notes
97 pages
Cs Theorists Toolkit
No ratings yet
Cs Theorists Toolkit
95 pages
22-Lecture Notes On Probability Theory and Random Processes
100% (2)
22-Lecture Notes On Probability Theory and Random Processes
302 pages
Statistics
100% (2)
Statistics
515 pages
Law 2015
No ratings yet
Law 2015
256 pages
Notes
No ratings yet
Notes
422 pages
MIT18 S096F15 TenLec
No ratings yet
MIT18 S096F15 TenLec
165 pages
Introduction To Bayesian Networks - Koski - Noble
No ratings yet
Introduction To Bayesian Networks - Koski - Noble
471 pages
St104a Vle
No ratings yet
St104a Vle
378 pages
Random Processes
No ratings yet
Random Processes
155 pages
A First Course in Mathematical Statistics - Nusbaum
No ratings yet
A First Course in Mathematical Statistics - Nusbaum
195 pages
Notes On Randomized Algorithms
No ratings yet
Notes On Randomized Algorithms
539 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Principals of Mathematics For Economics
No ratings yet
Principals of Mathematics For Economics
1,589 pages
Quantecon Python Econometria
No ratings yet
Quantecon Python Econometria
1,399 pages
Book 02 17 2020
No ratings yet
Book 02 17 2020
135 pages
Bocconi Maths 30062 Book
No ratings yet
Bocconi Maths 30062 Book
1,575 pages
Inbound 8969254549211759123
No ratings yet
Inbound 8969254549211759123
164 pages
Regbook Inside
100% (1)
Regbook Inside
21 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Probability
No ratings yet
Probability
180 pages
Matlab Notes
No ratings yet
Matlab Notes
189 pages
Introduction To Probability For Ds
No ratings yet
Introduction To Probability For Ds
180 pages
Introduction To Probability For Data Science
100% (1)
Introduction To Probability For Data Science
70 pages
Lectnotemat 2
No ratings yet
Lectnotemat 2
348 pages
Math For Data Science
100% (1)
Math For Data Science
554 pages
Maths Sample 5cb0080b Fee7 4bd9 Adbe 48f7b08d0354
No ratings yet
Maths Sample 5cb0080b Fee7 4bd9 Adbe 48f7b08d0354
28 pages
Stochastic Processes
100% (1)
Stochastic Processes
264 pages
Introduction To Probability For Data Science
No ratings yet
Introduction To Probability For Data Science
709 pages
Mathematical Methods For Economic Analysis
100% (1)
Mathematical Methods For Economic Analysis
245 pages
Introduction To Statistics WITH SAS
No ratings yet
Introduction To Statistics WITH SAS
238 pages
Machine Learning
No ratings yet
Machine Learning
662 pages
ML Lecture Notes 2022 v0.0
No ratings yet
ML Lecture Notes 2022 v0.0
176 pages
DMV Lab Manual
No ratings yet
DMV Lab Manual
45 pages
SDM Sunbeam Notes Vvimp
No ratings yet
SDM Sunbeam Notes Vvimp
316 pages
PR Seals and Shims - 0
No ratings yet
PR Seals and Shims - 0
5 pages
Study On Tools and Techniques of Inventory Management - An Oveview
No ratings yet
Study On Tools and Techniques of Inventory Management - An Oveview
61 pages
List of TIN-FC branch/PAN Centre Having Biometric Devices
No ratings yet
List of TIN-FC branch/PAN Centre Having Biometric Devices
95 pages
The Ultimate TSA Collection 5 Books in One Over 1050 Practice Questions Full Download
No ratings yet
The Ultimate TSA Collection 5 Books in One Over 1050 Practice Questions Full Download
409 pages
Resume 1
No ratings yet
Resume 1
2 pages
Statistics For MGMT I & II
No ratings yet
Statistics For MGMT I & II
161 pages
Uc Berkeley Crane/Hoist Safety Program: Quick Start
No ratings yet
Uc Berkeley Crane/Hoist Safety Program: Quick Start
49 pages
Document
No ratings yet
Document
31 pages
Deformed Bars
No ratings yet
Deformed Bars
3 pages
Calc Practice
No ratings yet
Calc Practice
32 pages
Intex Spa
No ratings yet
Intex Spa
12 pages
Storytelling For Marketing and Entrepreneurship: - Course Manual
No ratings yet
Storytelling For Marketing and Entrepreneurship: - Course Manual
41 pages
Lie History of Computer
No ratings yet
Lie History of Computer
6 pages
Health Sciences 2023
No ratings yet
Health Sciences 2023
22 pages
Cisco SD-Access: Enterprise Networking Made Fast and Flexible
100% (2)
Cisco SD-Access: Enterprise Networking Made Fast and Flexible
30 pages
Ms LTHH Lte LTP Line Eng
No ratings yet
Ms LTHH Lte LTP Line Eng
2 pages
Job-Embedded Learning (JEL) Completion Report: Description of Activities Completed
100% (1)
Job-Embedded Learning (JEL) Completion Report: Description of Activities Completed
3 pages
Dilg-Memocircular-2025228 - SGLGIF CY 2024
No ratings yet
Dilg-Memocircular-2025228 - SGLGIF CY 2024
55 pages
HPHT Cementing Guidelines
100% (5)
HPHT Cementing Guidelines
15 pages
Duty Roster of Instrument Department Starting From 27/04/2020
No ratings yet
Duty Roster of Instrument Department Starting From 27/04/2020
1 page
Fault-Tolerant Soft-Switching Current-Fed DC-DC Converter: Abstract-This Paper Addresses The Methods To Retain The
No ratings yet
Fault-Tolerant Soft-Switching Current-Fed DC-DC Converter: Abstract-This Paper Addresses The Methods To Retain The
4 pages
Me18002 3DP
No ratings yet
Me18002 3DP
341 pages
Vishal Pathak-3
No ratings yet
Vishal Pathak-3
1 page
Itec Srat 2 Template
No ratings yet
Itec Srat 2 Template
2 pages
Msedcl RFP
No ratings yet
Msedcl RFP
88 pages