0% found this document useful (0 votes)
16 views235 pages

Toward Faster Methods in Bayesian Unsupervised Learning

The thesis by Tin D. Nguyen addresses computational challenges in Bayesian unsupervised learning, focusing on speeding up Bayesian inference and developing finite approximations for infinite-dimensional priors. It proposes methods to overcome the label-switching problem and improve generalizability assessments in supervised Bayesian models. The work aims to enhance the efficiency and accuracy of Bayesian methods in analyzing latent traits in data.

Uploaded by

p2dtxvj268
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views235 pages

Toward Faster Methods in Bayesian Unsupervised Learning

The thesis by Tin D. Nguyen addresses computational challenges in Bayesian unsupervised learning, focusing on speeding up Bayesian inference and developing finite approximations for infinite-dimensional priors. It proposes methods to overcome the label-switching problem and improve generalizability assessments in supervised Bayesian models. The work aims to enhance the efficiency and accuracy of Bayesian methods in analyzing latent traits in data.

Uploaded by

p2dtxvj268
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 235

Toward Faster Methods in Bayesian Unsupervised

Learning

by
Tin D. Nguyen
B.S.E. Operations Research and Financial Engineering, Princeton University, 2018
S.M. Electrical Engineering and Computer Science, MIT, 2020

Submitted to the Department of Electrical Engineering and Computer Science


in partial fulfillment of the requirements for the degree of
DOCTOR OF PHILOSOPHY
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
May 2024

© 2024 Tin D. Nguyen. This work is licensed under a CC BY 4.0 license.


The author hereby grants to MIT a nonexclusive, worldwide, irrevocable, royalty-free license
to exercise any and all rights under copyright, including to reproduce, preserve, distribute
and publicly display copies of the thesis, or release the thesis under an open-access license.

Authored by: Tin D. Nguyen


Department of Electrical Engineering and Computer Science
May 12, 2024

Certified by: Tamara Broderick


Associate Professor of Electrical Engineering and Computer Science
Thesis Supervisor

Accepted by: Leslie A. Kolodziejski


Profssor of Electrical Engineering and Computer Science
Chair, Department Committee on Graduate Students
2
Toward Faster Methods in Bayesian Unsupervised Learning
by
Tin D. Nguyen
Submitted to the Department of Electrical Engineering and Computer Science
on May 12, 2024 in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

ABSTRACT
Many data analyses can be seen as discovering a latent set of traits in a population.
For example, what are the themes, or topics, behind Wikipedia documents? To encode
structural information in these unsupervised learning problems, such as the hierarchy among
words, documents, and latent topics, one can use Bayesian probabilistic models. The
application of Bayesian unsupervised learning faces three computational challenges. Firstly,
existing works aim to speed up Bayesian inference via parallelism, but these methods
struggle in Bayesian unsupervised learning due to the so-called “label-switching problem”.
Secondly, in Bayesian nonparametrics for unsupervised learning, computers cannot learn
the distribution over the countable infinity of random variables posited by the model in
finite time. Finally, to assess the generalizability of Bayesian conclusions, we might want to
detect the posterior’s sensitivity to the removal of a very small amount of data, but checking
this sensitivity directly takes an intractably long time. My thesis addresses the first two
computational challenges, and establishes a first step in tackling the last one. I utilize a
known representation of the probabilistic model to evade the label-switching problem: when
parallel processors are available, I derive fast estimates of Bayesian posteriors in unsupervised
learning. Generalizing existing works and providing more guidance, I derive accurate and easy-
to-use finite approximations for infinite-dimensional priors. Lastly, I assess generalizability in
supervised Bayesian models, which can be seen as a precursor to the models used in Bayesian
unsupervised learning. In supervised models, I develop and test a computationally efficient
tool to detect sensitivity regarding data removals for analyses based on MCMC.

Thesis supervisor: Tamara Broderick


Title: Associate Professor of Electrical Engineering and Computer Science

3
4
I am grateful to my advisor, Tamara, for her guidance, support, and encouragement. I am
also grateful to my committee members, Stefanie and Ashia, for their feedback and advice. I
dedicate this thesis to my mom and dad. They have taught me to be kind, to work hard,
and to do a good job for its own sake.

5
6
Contents

Title page 1

Abstract 3

List of Figures 11

Introduction 15

1 Fast and Accurate MCMC Through Couplings 19


1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.2.1 Random Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.2.2 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . 22
1.2.3 An Unbiased Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.2.4 Couplings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3 Our Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3.1 Coupling For Gibbs On Partitions . . . . . . . . . . . . . . . . . . . . 24
1.3.2 An Optimal Transport Coupling . . . . . . . . . . . . . . . . . . . . . 25
1.3.3 Extension To Other Samplers . . . . . . . . . . . . . . . . . . . . . . 26
1.3.4 Variance Reduction Via Trimming . . . . . . . . . . . . . . . . . . . . 27
1.4 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.4.1 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.4.2 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.5 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.5.1 Models, Datasets, And Implementation . . . . . . . . . . . . . . . . . 30
1.5.2 Improved Accuracy With Coupling . . . . . . . . . . . . . . . . . . . 30
1.5.3 Faster Meeting With OT Couplings . . . . . . . . . . . . . . . . . . . 33
1.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Appendices 35
1.A Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.B Functions of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.C Unbiasedness Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.D Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.E Label-Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

7
1.E.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.E.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.F Trimming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.G Additional Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.G.1 Target Distributions And Gibbs Conditionals . . . . . . . . . . . . . 43
1.G.2 General Markov Chain Settings . . . . . . . . . . . . . . . . . . . . . 45
1.G.3 Datasets Preprocessing, Hyperparameters, Dataset-Specific Markov
Chain Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.G.4 Visualizing Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . 47
1.H All Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.H.1 gene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.H.2 synthetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.H.3 seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.H.4 abalone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.H.5 k-regular . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.I Metric Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.I.1 Definition Of Variation Of Information Metric . . . . . . . . . . . . . 49
1.I.2 Impact Of Metric On Meeting Time . . . . . . . . . . . . . . . . . . . 51
1.J Extension to Split-Merge Sample . . . . . . . . . . . . . . . . . . . . . . . . 51
1.K More RMSE Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.K.1 Different Functions Of Interest . . . . . . . . . . . . . . . . . . . . . . 52
1.K.2 Different Minimum Iteration (m) Settings . . . . . . . . . . . . . . . 52
1.K.3 Different Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.K.4 Different DPMM Hyperparameters . . . . . . . . . . . . . . . . . . . 53
1.L More Meeting Time Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.M Estimates of Predictive Density . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.M.1 Data, Target Model, And Definition Of Posterior Predictive . . . . . 54
1.M.2 Estimates Of Posterior Predictive Density . . . . . . . . . . . . . . . 55
1.M.3 Posterior Predictives Become More Alike True Data Generating Density 56

2 Finite Approximations of Nonparametric Priors 63


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.3 Automated independent finite approximations . . . . . . . . . . . . . . . . . 68
2.3.1 Applying our approximation to CRMs . . . . . . . . . . . . . . . . . 69
2.3.2 Applying our approximation to exponential family CRMs . . . . . . . 71
2.3.3 Normalized independent finite approximations . . . . . . . . . . . . . 73
2.4 Non-asymptotic error bounds . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.4.1 Bounds when approximating an exponential family CRM . . . . . . . 74
2.4.2 Approximating a (hierarchical) Dirichlet process . . . . . . . . . . . . 80
2.5 Conceptual benefits of finite approximations . . . . . . . . . . . . . . . . . . 82
2.6 Empirical evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.6.1 Image denoising with the beta–Bernoulli process . . . . . . . . . . . . 85
2.6.2 Topic modelling with the modified hierarchical Dirichlet process . . . 87
2.6.3 Comparing predictions across independent finite approximations . . . 88

8
2.6.4 Discount estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.6.5 Dispersion estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Appendices 93
2.A Additional examples of AIFA construction . . . . . . . . . . . . . . . . . . . 93
2.B Proofs of AIFA convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.B.1 AIFA converges to CRM in distribution . . . . . . . . . . . . . . . . . 95
2.B.2 Differentiability of smoothed indicator . . . . . . . . . . . . . . . . . 100
2.B.3 Normalized AIFA EPPF converges to NCRM EPPF . . . . . . . . . . 101
2.C Marginal processes of exponential CRMs . . . . . . . . . . . . . . . . . . . . 103
2.D Admissible hyperparameters of extended gamma process . . . . . . . . . . . 106
2.E Technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
2.E.1 Concentration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
2.E.2 Total variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
2.E.3 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
2.F Verification of upper bound’s assumptions for additional examples . . . . . . 121
2.F.1 Gamma–Poisson with zero discount . . . . . . . . . . . . . . . . . . . 121
2.F.2 Beta–negative binomial with zero discount . . . . . . . . . . . . . . . 123
2.G Proofs of CRM bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
2.G.1 Upper bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
2.G.2 Lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
2.H DPMM results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.H.1 Upper bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.H.2 Lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.I Proofs of DP bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
2.I.1 Upper bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
2.I.2 Lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
2.J More ease-of-use results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
2.J.1 Conceptual results (continued.) . . . . . . . . . . . . . . . . . . . . . 149
2.J.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
2.K Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
2.K.1 Image denoising using the beta–Bernoulli process . . . . . . . . . . . 153
2.K.2 Topic modelling with the modified HDP . . . . . . . . . . . . . . . . 154
2.K.3 Comparing IFAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
2.K.4 Beta process hyperparameter estimation . . . . . . . . . . . . . . . . 158
2.K.5 Dispersion estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
2.L Additional experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
2.L.1 Denoising other images . . . . . . . . . . . . . . . . . . . . . . . . . . 163
2.L.2 Effect of AIFA tuning hyperparamters . . . . . . . . . . . . . . . . . 164
2.L.3 Estimation of mass and concentration . . . . . . . . . . . . . . . . . . 164

9
3 Sensitivity of MCMC to Small-Data Removals 167
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
3.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
3.2.1 Bayesian data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 170
3.2.2 Drop-data non-robustness . . . . . . . . . . . . . . . . . . . . . . . . 170
3.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
3.3.1 Taylor series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
3.3.2 Estimating the influence . . . . . . . . . . . . . . . . . . . . . . . . . 175
3.3.3 Confidence intervals for AMIP . . . . . . . . . . . . . . . . . . . . . . 176
3.3.4 Putting everything together . . . . . . . . . . . . . . . . . . . . . . . 179
3.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
3.4.1 Estimate coverage of confidence interval for AMIP . . . . . . . . . . . 180
3.4.2 Estimate coverage of confidence intervals for sum-of-influence . . . . . 180
3.4.3 Re-running MCMC on interpolation path . . . . . . . . . . . . . . . . 181
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
3.5.1 Linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
3.5.2 Hierarchical model on microcredit data . . . . . . . . . . . . . . . . . 184
3.5.3 Hierarchical model on tree mortality data . . . . . . . . . . . . . . . 189
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

Appendices 197
3.A Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
3.A.1 Accuracy of first-order approximation . . . . . . . . . . . . . . . . . . 197
3.A.2 Estimator properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
3.B Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
3.B.1 Taylor series proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
3.B.2 First-order accuracy proofs . . . . . . . . . . . . . . . . . . . . . . . . 203
3.B.3 Consistency and asymptotic normality proofs . . . . . . . . . . . . . 207
3.C Additional Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . 214
3.C.1 Linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
3.C.2 Hierarchical model for microcredit data . . . . . . . . . . . . . . . . . 214
3.C.3 Hierarchical model for tree mortality data . . . . . . . . . . . . . . . 215

Conclusion 217

References 235

10
List of Figures

1.1 Lower error at high process count using our estimator (blue) versus using naive
parallelism (red). For details, see Section 1.5.2. . . . . . . . . . . . . . . . . . 20
1.2 Top row and bottom row give results for gene and k-regular, respectively.
The first two columns show that coupled chains provide better point estimates
than naive parallelism. The third column shows that confidence intervals based
on coupled chains are better than those from naive parallelism. The fourth
column shows that OT coupling meets in less time than label-based couplings. 31
1.3 Coupled-chain estimates have large outliers. Meanwhile, naive parallelism
estimates have substantial bias that does not go away with replication. . . . 32
1.F.1Trimmed mean has better RMSE than sample mean on Example 1.F.1. Left
panel plots RMSE versus J. Right panel gives boxplots J = 1000. . . . . . . 44
1.G.1Visualizing synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.H.1Results on gene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.H.2Results on synthetic. Figure legends are the same as Figure 1.H.1. The
results are consistent with Figure 1.2. . . . . . . . . . . . . . . . . . . . . . . 49
1.H.3Results on seed. Figure legends are the same as Figure 1.H.1. The results are
consistent with Figure 1.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.H.4Results on abalone. Similar to Figure 1.2, coupled chains perform better
than naive parallelism with more processes, and our coupling yields smaller
meeting times than label-based couplings. See Figure 1.H.5 for the performance
of trimmed estimators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1.H.5Effect of trimming amount on abalone. . . . . . . . . . . . . . . . . . . . . 52
1.H.6Results on k-regular. Figure legends are the same as Figure 1.H.1. . . . . 53
1.I.1 Hamming and VI metric induce similar meeting time . . . . . . . . . . . . . 54
1.J.1 Split-merge results on gene . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
1.J.2 Split-merge results on synthetic . . . . . . . . . . . . . . . . . . . . . . . . 56
1.K.1Co-clustering results for clustering data sets. . . . . . . . . . . . . . . . . . . 59
1.K.2Impact of different m on the RMSE. The first two panels are LCP estimation
for seed. The last two panels are CC(0, 1) estimation for synthetic. . . . . 60
1.K.3RMSE and intervals for gene on k-means initialization. . . . . . . . . . . . . 60
1.K.4The bias in naive parallel estimates is a function of the DPMM hyperparameters. 60
1.L.1 Meeting time under OT coupling is better than alternative couplings on
Erdos–Renyi graphs, indicated by the fast decrease of the survival functions. 61
1.M.1Posterior predictive density for different number of observations N . . . . . . 61

11
2.6.1 AIFA and TFA denoised images have comparable quality. (a) The noiseless
image. (b) The corrupted image. (c,d) Sample denoised images from finite
models with K = 60. We report PSNR (in dB) with respect to the noiseless
image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.6.2 (a) Peak signal-to-noise ratio (PNSR) as a function of approximation level K.
Error bars depict 1-standard-deviation ranges across 5 trials. (b,c) How PSNR
evolves during inferenceacross 10 trials, with 5 each starting from respectively
cold or warm starts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
2.6.3 (a) Test log-likelihood (testLL) as a function of approximation level K. Error
bars show 1 standard deviation across 5 trials. (b,c) TestLL change during
inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.6.4 (a) The left panel shows the average predictive log-likelihood of the AIFA (blue)
and BFRY IFA (red) as a function of the approximation level K; the average is
across 10 trials with different random seeds for the stochastic optimizer. The
right panel shows highest predictive log-likelihood across the same 10 trials.
(b) The panels are analogous to (a), except the GenPar IFA is in red. . . . . 89
2.6.5 (a) We estimate the discount by maximizing the marginal likelihood of the
AIFA (left) or the full process (right). The solid blue line is the median of the
estimated discounts, while the lower and upper bounds of the error bars are
the 20% and 80% quantiles. The black dashed line is the ideal value of the
estimated discount, equal to the ground-truth discount. (b) In each panel, the
solid red line is the average log of negative log marginal likelihood (LNLML)
across batches. The light red region depicts two standard errors in either
direction from the mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.6.6 Blue histograms show posterior density estimates for τ from MCMC draws.
The ground-truth τ (solid red line) is 0.7 in the overdispersed case (upper
row) and 1.5 in the underdispersed case (lower row). The threshold τ = 1
(dashed black line) marks the transition from overdispersion (τ < 1.0) to
underdispersion (τ > 1.0). The percentile in each panel’s title is the percentile
where the ground truth τ falls in the posterior draws. The approximation size
K of the AIFA increases in the plots from left to right. . . . . . . . . . . . . 91
2.L.1 Sample AIFA and TFA denoised images have comparable quality. (a) shows
the noiseless image. (b) shows the corrupted image. (c,d) are sample denoised
images from finite models with K = 60. PSNR (in dB) is computed with
respect to the noiseless image. . . . . . . . . . . . . . . . . . . . . . . . . . . 163
2.L.2 (a) Peak signal-to-noise ratio (PNSR) as a function of approximation level K.
The error bars reflect randomness in both initialization and simulation of the
conditionals across 5 trials. AIFA denoising quality improves as K increases,
and the performance is similar to TFA across approximation levels. Moreover,
the TFA- and AIFA-denoised images are very similar: the PSNR ≈ 50 for TFA
versus AIFA, whereas PSNR < 35 for TFA or AIFA versus the original image.
(b,c) Show how PSNR evolves during inference. The “warm-start” lines in
indicate that the AIFA-inferred (respectively, TFA-inferred) parameters are
excellent initializations for TFA (respectively, AIFA) inference. . . . . . . . . 164

12
2.L.3 Sample AIFA and TFA denoised images have comparable quality. (a) shows
the noiseless image. (b) shows the corrupted image. (c,d) are sample denoised
images from finite models with K = 60. PSNR (in dB) is computed with
respect to the noiseless image. . . . . . . . . . . . . . . . . . . . . . . . . . . 165
2.L.4 (a) Peak signal-to-noise ratio (PNSR) as a function of approximation level K.
The error bars reflect randomness in both initialization and simulation of the
conditionals across 5 trials. AIFA denoising quality improves as K increases,
and the performance is similar to TFA across approximation levels. Moreover,
the TFA- and AIFA-denoised images are very similar: the PSNR ≈ 47 for TFA
versus AIFA, whereas PSNR < 31 for TFA or AIFA versus the original image.
(b,c) Show how PSNR evolves during inference. The “warm-start” lines in
indicate that the AIFA-inferred (respectively, TFA-inferred) parameters are
excellent initializations for TFA (respectively, AIFA) inference. . . . . . . . . 165
2.L.5The predictive log-likelihood of AIFA is not sensitive to different settings of
a and bK . Each color corresponds to a combination of a and bK . (a) is the
average across 5 trials with different random seeds for the stochastic optimizer,
while (b) is the best across the same trials. . . . . . . . . . . . . . . . . . . . 166
2.L.6 In fig. 2.L.6a, we estimate the mass by maximizing the marginal likelihood
of the AIFA (left panel) or the full process (right panel). The solid blue line
is the median of the estimated masses, while the lower and upper bounds of
the error bars are the 20% and 80% quantiles. The black dashed line is the
ideal value of the estimated mass, equal to the ground-truth mass. The key
for fig. 2.L.6b is the same, but for concentration instead of mass. . . . . . . . 166

3.5.1 (Linear model) Histogram of treatment effect MCMC draws. The blue line
indicates the sample mean. The dashed red line is the zero threshold. The
dotted blue lines indicate estimates of approximate credible interval’s endpoints.183
3.5.2 (Linear model) Confidence interval and refit. At maximum, we remove 1%
of the data. Each panel corresponds to a target conclusion change: ‘sign’ is
the change in sign, ‘sig’ is change in significance, and ‘both’ is the change in
both sign and significance. Error bars are confidence interval for refit after
removing the most extreme data subset. Each ‘x’ is the refit after removing
the proposed data and re-running MCMC. The dotted blue line is the fit on
the full data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
3.5.3 (Linear model) Monte Carlo estimate of AMIP confidence interval’s coverage.
Each panel corresponds to a target conclusion change. The dashed line is the
nominal level η = 0.95. The solid line is the sample mean of the indicator
variable for the event that ground truth is contained in the confidence interval.
The error bars are confidence intervals for the population mean of these indicators.184
3.5.4 (Linear model) Monte Carlo estimate of sum-of-influence confidence interval’s
coverage. Each panel corresponds to a target conclusion change. The dashed
line is the nominal level η = 0.95. The solid line is the sample mean of the
indicator variable for the event that ground truth is contained in the confidence
interval, and error bars are confidence intervals for the population mean of
these indicators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

13
3.5.5 (Linear model) Quality of the linear approximation. Each panel corresponds
to a target conclusion change. The solid blue line is the full-data fit. The
horizontal axis is the distance from the weight that represents the full data.
We plot both the refit from rerunning MCMC and the linear approximation of
the refit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
3.5.6 (Hierarchical model for microcredit) Histogram of treatment effect MCMC
draws. See the caption of fig. 3.5.1 for the meaning of the distinguished vertical
lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
3.5.7 (Hierarchical model for microcredit) Confidence interval and refit. See the
caption of fig. 3.5.2 for meaning of annotated lines. . . . . . . . . . . . . . . 188
3.5.8 (Hierarchical model for microcredit) Monte Carlo estimate of AMIP confidence
interval’s coverage. See the caption of fig. 3.5.3 for the meaning of the error
bars and the distinguished lines. . . . . . . . . . . . . . . . . . . . . . . . . . 188
3.5.9 (Hierarchical model for microcredit) Monte Carlo estimate of sum-of-influence
confidence interval’s coverage. See the caption of fig. 3.5.4 for the meaning of
the panels and the distinguished lines. . . . . . . . . . . . . . . . . . . . . . 189
3.5.10(Hierarchical model for microcredit) Quality of linear approximation. See the
caption for fig. 3.5.5 for the meaning of the panels and the distinguished lines. 189
3.5.11(Hierarchical model for tree mortality) Histogram of slope MCMC draws. See
the caption of fig. 3.5.1 for the meaning of the distinguished vertical lines. . . 191
3.5.12(Hierarchical model for tree mortality) Confidence interval and refit. See the
caption of fig. 3.5.2 for the meaning of the panels and the distinguished lines. 192
3.5.13(Hierarchical model on subsampled tree mortality) Histogram of effect MCMC
draws. See fig. 3.5.1 for the meaning of the distinguished lines. . . . . . . . . 192
3.5.14(Hierarchical model on subsampled tree mortality) Confidence interval and refit.
See the caption of fig. 3.5.2 for the meaning of the panels and the distinguished
lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
3.5.15(Hierarchical model on subsampled tree mortality) Monte Carlo estimate of
coverage of confidence interval for ∆(α). See fig. 3.5.3 for the meaning of the
panels and the distinguished lines. . . . . . . . . . . . . . . . . . . . . . . . . 193
3.5.16(Hierarchical model on subsampled tree mortality) Monte Carlo estimate of
coverage of confidence interval for sum-of-influence. See fig. 3.5.4 for the
meaning of the panels and the distinguished lines. . . . . . . . . . . . . . . . 194
3.5.17(Hierarchical model on subsampled tree mortality) Quality of linear approx-
imation. See fig. 3.5.5 for the meaning of the panels and the distinguished
lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

14
Introduction

Discovering the topics behind Wikipedia texts is just one example in which researchers are
interested in latent traits. Other examples include recovering unique speakers across audio
recordings of many meetings, group membership from network data, driver archetypes from
car sensors, co-occurrence of species from environmental DNA, and themes from questions-
and-answers online boards [43, 63, 81, 123, 187]. When studying latent traits, we might
model the data hierarchically. For example, one might conceptualize that the words in a
given document are exchangeable, and that different documents are independent realizations
of some underlying distribution. One might use Bayesian probabilistic models to codify
such a hierarchy: they contain conditional independence statements that encode hierarchical
structure in the data-generating process.
In general, the application of Bayesian unsupervised learning to data is computationally
intensive. For instance, to draw inference, we need to estimate the posterior distribution.
The posterior is a complex probability distribution over both real-valued parameters, such
as the topics, and discrete parameters, such as the assignment of words or documents to
topics. While recent advances in Bayesian computation, such as Carpenter et al. [39], provide
accurate and fast posterior approximations in models with only real-valued parameters, they
do not solve the inference problem for Bayesian unsupervised learning. Roughly speaking, to
approximate the posterior in such models, there are two options. One might spend a long
time for an accurate estimate, using slow Markov chain Monte Carlo (MCMC) algorithms
such as Gibbs sampling. Or one might spend a little time to get an estimate without clear
accuracy guarantees, using optimization methods such as variational inference [23, 115].
In this thesis, I identify three specific computational challenges. First, while past works
[92, 208] have investigated how to improve the speed of MCMC without degrading accuracy
by using parallelism, their techniques struggle in Bayesian unsupervised learning, due to the
“label-switching” problem: roughly speaking, it can be difficult to usefully combine results
from different processors when they may or may not be handling semantically equivalent
relabelings of traits. Second, when Bayesian nonparametric (BNP) models are used because
we expect the number of latent traits present in a dataset to grow with the number of
observations, computers cannot store an infinity of objects, or learn the distribution of an
infinite collection in finite time. Finally, if a practitioner is interested in checking whether
their conclusions generalize beyond collected data, they might want to quantify the posterior’s
sensitivity to the removal of a very small amount of data. But checking this sensitivity
directly takes an intractably long time: the brute force approach, which enumerates all small
subsets and re-analyze, has to search over a combinatorially large number of subsets.
My thesis addresses the first two computational challenges, and establishes a first step in

15
tackling the last one. I use a representation of the unsupervised learning model that avoids the
label-switching problem: while this representation is not new, my work is the first to utilize it
to improve MCMC speed while maintaining accuracy. I build upon the existing literature on
finite approximations of BNP to derive an accurate and convenient-to-use approximation of
infinite-dimensional priors. Conceptualizing supervised Bayes as a first step towards Bayesian
unsupervised learning, I develop and test a fast tool to detect sensitivity with respect to data
removals for analyses based on MCMC estimates of supervised Bayes posteriors. Below, I
highlight the key findings of my thesis.

Debiasing Markov chain Monte Carlo in unsupervised learning. Markov chain


Monte Carlo (MCMC) methods are often used in clustering since they guarantee asymptoti-
cally exact expectations in the infinite-time limit. In finite time, though, slow mixing often
leads to poor performance. Modern computing environments offer massive parallelism, but
naive implementations of parallel MCMC can exhibit substantial bias. In MCMC samplers
of continuous random variables, Markov chain couplings can overcome bias. But these
approaches depend crucially on paired chains meetings after a small number of transitions.
I show that straightforward applications of existing coupling ideas to discrete clustering
variables fail to meet quickly. This failure arises from the “label-switching problem”: semanti-
cally equivalent cluster relabelings impede fast meeting of coupled chains. I instead consider
chains as exploring the space of partitions rather than partitions’ (arbitrary) labelings. Using
a metric on the partition space, I formulate a practical algorithm using optimal transport
couplings. My theory confirms the method is accurate and efficient. In experiments ranging
from clustering of genes or seeds to graph colorings, I show the benefits of my coupling in
the highly parallel, time-limited regime. This work, which is in collaboration with Brian L.
Trippe and Tamara Broderick, has been published as Nguyen et al. [151]. For more detail,
see Chapter 1.

Automatic independent finite approximations. Completely random measures (CRMs)


and their normalizations (NCRMs) offer flexible models in Bayesian nonparametrics. But
the infinite dimensionality of these models often necessitates ad hoc inference approaches.
To better interface with modern black-box inference engines, I propose a general recipe to
construct tractable finite-dimensional approximations for infinite-dimensional, homogeneous
CRMs and NCRMs. My construction, automated independent finite approximations (AIFAs),
relies on independent identically distributed atom masses and generalizes important special
cases already used in practice, is practical to implement, and can be used both in the presence
and absence of power laws. I upper bound the approximation error of AIFAs for a wide
class of common CRMs and NCRMs — and thereby develop guidelines for choosing the
approximation level. My lower bounds in key cases suggest that my upper bounds are
tight. I show empirically that AIFAs can be used to learn CRM hyperparameters, including
the discount governing power laws. I compare AIFAs with an alternative approximation:
truncated finite approximations (TFAs). I prove that, for worst-case choices of observation
likelihoods, TFAs are more efficient than AIFAs. But in real-data experiments with image
denoising and topic modeling, AIFAs and TFAs perform similarly. Moreover, AIFAs update
equations are easier to derive, and AIFAs are better suited to utilizing distributed computation.

16
This work, which is in collaboration with Jonathan Huggins, Lorenzo Masoero, Lester Mackey,
and Tamara Broderick, has been published as Nguyen et al. [152]. For more detail, see
Chapter 2.

Sensitivity to data removal for MCMC-based analyses. If the conclusion of a data


analysis is sensitive to dropping very few data points, that conclusion might hinge on the
particular data at hand rather than representing a more broadly applicable truth. How
could we check whether this sensitivity holds? One idea is to consider every small subset of
data, drop it from the dataset, and re-run our analysis. But running MCMC to approximate
a Bayesian posterior is already very expensive; running multiple times is prohibitive, and
the number of re-runs needed here is combinatorially large. Recent work proposes a fast
and accurate approximation to find the worst-case dropped data subset, but that work
was developed for problems based on estimating equations — and does not directly handle
Bayesian posterior approximations using MCMC. I make two principal contributions in the
present work. I adapt the existing data-dropping approximation to estimators computed
via MCMC. Observing that Monte Carlo errors induce variability in the approximation,
I use a variant of the bootstrap to quantify this uncertainty. I demonstrate how to use
my approximation in practice to determine whether there is non-robustness in a problem.
Empirically, my method is accurate in simple models, such as linear regression. In models
with complicated structure, such as hierarchical models, the performance of my method is
mixed. This work is in collaboration with Ryan Giordano, Rachael Meager, and Tamara
Broderick. For more detail, see Chapter 3.

17
18
Chapter 1

Fast and Accurate MCMC Through


Couplings

19
Figure 1.1: Lower error at high process count using our estimator (blue) versus using naive
parallelism (red). For details, see Section 1.5.2.

1.1 Introduction
Markov chain Monte Carlo (MCMC) is widely used in applications for exploring distributions
over clusterings, or partitions, of data. For instance, Prabhakaran et al. [169] use MCMC to
approximate a Bayesian posterior over clusters of gene expression data for “discovery and
characterization of cell types”; Chen et al. [42] use MCMC to approximate the number of
k-colorings of a graph; and DeFord et al. [49] use MCMC to identify partisan gerrymandering
via partitioning of geographical units into districts. An appealing feature of MCMC for
many applications is that it yields asymptotically exact expectations in the infinite-time limit.
However, real-life samplers must always be run in finite time, and MCMC mixing is often
prohibitively slow in practice. While this slow mixing has led some practitioners to turn
to other approximations such as variational Bayes [21], these alternative methods can yield
arbitrarily poor approximations of the expectation of interest [87].
A different approach is to speed up MCMC, e.g. by taking advantage of recent com-
putational advantages. While wall-clock time is often at a premium, modern computing
environments increasingly offer massive parallel processing. For example, institute-level com-
pute clusters commonly make hundreds of processors available to their users simultaneously
[178]. Recent efforts to enable parallel MCMC on graphics processing units [119] offer to
expand parallelism further, with modern commodity GPUs providing over ten thousand cores.
A naive approach to exploiting parallelism is to run MCMC separately on each processor; we
illustrate this approach on a genetics dataset (gene) in Figure 1.1 with full experimental
details in Section 1.5. One might either directly average the resulting estimates across
processors (red solid line in Figure 1.1) or use a robust averaging procedure (red dashed line
in Figure 1.1). Massive parallelism can be used to reduce variance of the final estimate but
does not mitigate the problem of bias, so the final estimate does not improve substantially as
the number of processes increases.
Recently, Jacob et al. [93] built on the work of Glynn and Rhee [74] to eliminate bias in
MCMC with a coupling. The basic idea is to cleverly set up dependence between two MCMC
chains so that they are still practical to run and also meet exactly at a random but finite

20
time. After meeting, these coupled chains can be used to compute an unbiased estimate of
the expectation of interest. So arbitrarily large reductions in the estimate’s variance due to
massive parallelism translate directly into arbitrarily large reductions in total error. Since a
processor’s computation concludes after the chains meet, a useful coupling relies heavily on
setting up coupled chains that meet quickly.
Jacob et al. [93] did not consider MCMC over partitions in particular and Glynn and Rhee
[74] did not work on MCMC. But there is existing work on couplings applied to partitions in
other contexts that can be adapted into the Jacob et al. [93] framework. For instance, [99]
uses maximal couplings on partition labelings to prove convergence rates for graph coloring,
and Gibbs [69] uses a common random number coupling for two-state Ising models. Though
[99] was theoretical rather than practical and Gibbs [69] did not apply to general partition
models, we can adapt the Jacob et al. [93] setup in a straightforward manner to use either
coupling scheme. While this adaptation ensures asymptotically-unbiased MCMC samples,
we will see (Section 1.5.3) that both schemes exhibit slow meeting times in practice. We
attribute this issue to the label-switching problem, which is well-known for plaguing MCMC
over partitions [98]. In particular, many different labelings correspond to the same partition.
In the case of couplings, two chains may nearly agree on the partition but require many
iterations to change label assignments, so the coupling is unnecessarily slow to meet.
Our main contribution, then, is to propose and analyze a practical coupling that uses the
unbiasedness of the [93] framework but operates directly in the true space of interest – i.e.,
the space of partitions – to thereby exhibit fast meeting times. In particular, we define an
optimal transport (OT) coupling in the partition space (Section 1.3). For clustering models,
we prove that our coupling produces unbiased estimates (Section 1.4.1). We provide a big-O
analysis to support the fast meeting times of our coupling (Section 1.4.2).We empirically
demonstrate the benefits of our coupling on a simulated analysis; on Dirichlet process mixture
models applied to real genetic, agricultural, and marine life data; and on a graph coloring
problem. We show that, for a fixed wall time, our coupling provides much more accurate
estimates and confidence intervals than naive parallelism (Section 1.5.2). And we show that
our coupling meets much more quickly than standard label-based couplings for partitions
(Section 1.5.3). Our code is available at https://github.com/tinnguyen96/partition-coupling.
Related work. Couplings of Markov chains have a long history in MCMC. But they
either have primarily been a theoretical tool, do not provide guarantees of consistency in
the limit of many processes, or are not generally applicable to Markov chains over partitions
(Section 1.A). Likewise, much previous work has sought to utilize parallelism in MCMC. But
this work has focused on splitting large datasets into small subsets and running MCMC
separately on each subset. But here our distribution of interest is over partitions of the
data; combining partitions learned separately on multiple processors seems to face much the
same difficulties as the original problem (Section 1.A). Xu et al. [208] have also used OT
techniques within the Jacob et al. [93] framework, but their focus was continuous-valued
random variables. For partitions, OT techniques might most straightforwardly be applied to
the label space – and we expect would fare poorly, like the other label-space couplings in
Section 1.5.3. Our key insight is to work directly in the space of partitions.

21
1.2 Setup
Before describing our method, we first review random partitions, set up Markov chain Monte
Carlo for partitions – with an emphasis on Gibbs sampling, and review the Jacob et al. [93]
coupling framework.

1.2.1 Random Partitions


For a natural number N , a partition of [N ] := {1, 2, . . . , N } is a collection of K ≤ N
non-empty disjoint sets {A1 , A2 , . . . , AK }, whose union is [N ]. In a clustering problem, we
can think of Ak as containing the data indices in a particular cluster. Let PN denote the
set of all partitions of [N ]. Let π denote an element of PN , and let Π be a random partition
(i.e. a PN -valued random variable) with probability mass R function (p.m.f.) pΠ . We report a
summary that takes the form of an expectation: H := h(Π)pΠ (Π)dΠ.

As an example, consider a Bayesian cluster analysis for N data points {Wn }N n=1 , with
Wn ∈ R . A common generative procedure uses a Dirichlet process mixture model (DPMM)
D

and conjugate Gaussian cluster likelihoods – with hyperparameters α > 0, µ0Q ∈ RD , and
Σ0 , Σ1 positive definite D × D matrices. First draw Π = π with probability α|π| A∈π (|A| −
1)!/ [α(α + 1) · · · (α + N − 1)]. Then draw cluster centers µA ∼ N (µ0 , Σ0 ) for A ∈ Π and
i.i.d.

observed data Wj | µA ∼ N (µA , Σ1 ) for j ∈ A. The distribution of interest is the Bayesian


i.i.d.

posterior over Π: pΠ (π) := Pr(Π = π | W ). A summary H ∗ of interest might be the posterior


mean of the number of clusters for N data points or of the proportion of data in the largest
cluster; see Section 1.B for more discussion.
An assignment of data points to partitions is often encoded in a vector of labels. E.g., one
might represent π = {{1, 2}, {3}} with the vector z = [1, 1, 2]; z indicates that data points
1 and 2 are in the same cluster (arbitrarily labeled 1 here) while point 3 is in a different
cluster (arbitrarily labeled 2). The partition can be recovered from the labeling, but the
labels themselves are ancillary to the partition and, as we will see, can introduce unnecessary
hurdles for fast MCMC mixing.

1.2.2 Markov Chain Monte Carlo


In the DPMM example and many others, the exact computation of the summary H ∗ is
intractable, so Markov chain Monte Carlo provides an approximation. In particular, let Xt
(for any t) denote a random partition; suppose we have access to a Markov chain {Xt }∞ t=0
with starting value X0 drawn according to some initial distribution and evolving according
to a transition kernel Xt ∼ T (Xt−1 , ·) stationary
Pwith respect to pΠ . Then we approximate
H ∗ with the empirical average of samples: T −1 Tt=1 h(Xt ).
We focus on Gibbs samplers in what follows – since they are a convenient and popular
choice for partitions [48, 136, 150]. We also extend our methods to more sophisticated
samplers, such as split-merge samplers [94], that use Gibbs samplers as a sub-routine; see
Section 1.3.3. To form a Gibbs sampler on the partition itself rather than the labeling, we
first introduce some notation. Namely, let π(−n) and Π(−n) denote π and Π, respectively,
with data point n removed. For example, if π = {{1, 3}, {2}}, then π(−1) = {{3}, {2}}.

22
With this notation, we can write the leave-out conditional distributions of the Gibbs
sampler as pΠ|Π(−n) . In particular, take a random partition X. Suppose X(−n) has K − 1
elements. Then the nth data point can either be added to an existing element or form a
new element in the partition. Each of these K options forms a new partition; call the new
partitions {π k }K
k=1 . It follows that there exist ak ≥ 0 such that

K
X
pΠ|Π(−n) (· | X(−n)) = ak δπk (·), (1.1)
k=1

where δπk denotes a Dirac measure on π k . When pΠ is available up to a proportionality


constant, it is tractable to compute or sample from pΠ|Π(−n) .
Algorithm 1 shows one sweep of the resulting Gibbs sampler. For any X, the transition
kernel T (X, ·) for this sampler’s Markov chain is the distribution of the output, X,
e of
Algorithm 1.

Algorithm 1 Single Gibbs Sweep


Inputs:
pΠ ▷ Target
X ▷ Current partition
1: procedure SingleGibbsSweep(pΠ , X)
2: Xe ←X
3: for n ← 1, N do
4: Xe ∼ pΠ|Π(−n) (· | X(−n))
e
5: end for
6: Return X e
7: end procedure

1.2.3 An Unbiased Estimator


[93] show how to construct an unbiased estimator of H ∗ for some Markov chain {Xt } when
an additional Markov chain {Yt } with two properties is available. First, Yt | Yt−1 must also
evolve using the same transition T (·, ·) as {Xt }, so that {Yt } is equal in distribution to {Xt }.
Second, there must exist a random meeting time τ < ∞ with sub-geometric tails such that
the two chains meet exactly at time τ (Xτ = Yτ −1 ) and remain faithful afterwards (for all
t ≥ τ , Xt = Yt−1 ). When these properties hold, the following provides an unbiased estimate
of H ∗ : m
1 X
Hℓ:m (X, Y ) := h(Xt ) +
m − ℓ + 1 t=ℓ
| {z }
Usual MCMC average
τ −1   (1.2)
X t−ℓ
min 1, {h(Xt ) − h(Yt−1 )} ,
t=ℓ+1
m−ℓ+1
| {z }
Bias correction

23
where ℓ is the burn-in length and m sets a minimum number of iterations [93, Equation 2].
ℓ and m are hyperparameters that impact the runtime and variance of Hℓ:m ; for instance,
smaller m is typically associated with smaller runtimes but larger variance. Jacob et al. [93,
Section 6] recommend setting ℓ to be a large quantile of the meeting time and m as a multiple
of ℓ. We follow these recommendations in our work.
One interpretation of Equation (1.2) is as the usual MCMC estimate plus a bias correction.
Since Hℓ:m is unbiased, a direct average of many copies of Hℓ:m computed in parallel can be
made to have arbitrarily small error (for estimating H ∗ ). It remains to apply the idea from
Equation (1.2) to partition-valued chains.

1.2.4 Couplings
To create two chains of partitions that evolve together, we will need a joint distribution over
partitions from both chains that respects the marginals of each chain. To that end, we define
a coupling.
PK ′
Definition 1.2.1. A coupling γ of two discrete distributions, K k=1 ak δπ k (·) and
P
k′ =1 bk′ δν k (·),

is a distribution on the product space,



XX
γ(·) = uk,k δ(πk ,ν k′ ) (·), (1.3)
k k′

that
P satisfies the P
marginal constraints
k,k′ k,k′ ′
k u = b k ′ , k′ u = ak , 0 ≤ uk,k ≤ 1.

1.3 Our Methods


We have just described how to achieve unbiased estimates when two chains with a particular
relationship are available. It remains to show that we can construct these chains so that
they meet quickly in practice. First, we describe a general setup for a coupling of two Gibbs
samplers over partitions in Section 1.3.1. Our method is a special case where we choose
a coupling function that encourages the two chains to meet quickly (Section 1.3.2). We
extend our coupling to split-merge samplers in Section 1.3.3. We employ a variance reduction
procedure to further improve our estimates (Section 1.3.4).

1.3.1 Coupling For Gibbs On Partitions


Let X, Y be two partitions of [N ]. By Equation (1.1), we can write pΠ|Π(−n) (· | X(−n)) =
ak δπk (·) for some K and tuples (ak , π k ). And we can write pΠ|Π(−n) (· | Y (−n)) =
PK
Pk=1 ′
k′
k′ =1 bk′ δν k (·) for some K and tuples (bk′ , ν ). We say that a coupling function is any
K ′

function that returns a coupling for these distributions.

Definition 1.3.1. A coupling function ψ takes as input a target pΠ , a leave-out index


n, and partitions X, Y. It returns a coupling γ = ψ(pΠ , n, X, Y ) of pΠ|Π(−n) (· | X(−n)) and
pΠ|Π(−n) (· | Y (−n)).

24
Given a coupling function ψ, Algorithm 2 gives the coupled transition from the current
pair of partitions (X, Y ) to another pair (X,e Ye ). Repeating this algorithm guarantees the
first required property from the Jacob et al. [93] construction in Section 1.2.3: co-evolution of
the two chains with correct marginal distributions. It remains to show that we can construct
an appropriate coupling function and that the chains meet (quickly).

Algorithm 2 Coupled Gibbs Sweep


Inputs:
pΠ ▷ Target
ψ ▷ Coupling function
X and Y ▷ Current partitions
1: procedure CoupledGibbsSweep(pΠ , ψ, X, Y )
2: Xe ← X, Ye ← Y
3: for n ← 1, N do
4: γ ← ψ(pΠ , n, X,
e Ye )
5: e Ye ) ∼ γ
(X,
6: end for
7: return X,e Ye
8: end procedure

1.3.2 An Optimal Transport Coupling


We next detail our choice of coupling function; namely, we start from an optimal transport
(OT) coupling and add a nugget term for regularity. For a distance d between partitions,
the OT coupling function ψ OT = ψ OT (pΠ , n, X, Y ) minimizes the expected distance between
partitions after one coupled Gibbs step given partitions X, Y and leave-out index n. Using
the notation of Sections 1.2.4 and 1.3.1, we define
K X
K ′
′ ′
X
ψ OT
:= arg min uk,k d(π k , ν k ). (1.4)
couplings γ
k=1 k′ =1

To complete the specification of ψ OT , we choose a metric d on partitions that was introduced


by Mirkin and Chernyi [145] and Rand [174]:
X X X
d(π, ν) = |A|2 + |B|2 − 2 |A ∩ B|2 . (1.5)
A∈π B∈ν A∈π,B∈ν

Observe that d(π, ν) is zero when π = ν. More generally, we can construct a graph from
a partition by treating the indices in [N ] as vertex labels and assigning any two indices in
the same partition element to share an edge; then d/2 is equal to the Hamming distance
between the adjacency matrices implied by π and ν [145, Theorems 2–3]. The principal trait
of d for our purposes is that d steadily increases as π and ν become more dissimilar. In
Section 1.I, we discuss other potential metrics and show that an alternative with similar
qualitative behavior yields essentially equivalent empirical results.

25
In practice, any standard optimal transport1 solver can be used in ψ OT , and we discuss
our particular choice in more detail in Section 1.4.2. To prove unbiasedness of a coupling
(Theorem 1.4.1), it is convenient to ensure that every joint setting of (X, Y ) is reachable
from every other joint setting in the sampler. As we discuss after Theorem 1.4.1 and in
Section 1.C, adding a small nugget term to the coupling function accomplishes this goal. To
′ ′
that end, define the independent coupling ψ ind to have atom size uk,k = ak bk′ at (π k , ν k ).
Let η ∈ (0, 1). Then our final coupling function ψηOT = ψηOT (pΠ , n, X, Y ) equals
(
ψ OT (X, X) if X = Y
(1.6)
(1 − η)ψ OT (X, Y ) + ηψ ind (X, Y ) else,

where we elide the dependence on pΠ , n for readability. In practice, we set η to 10−5 , so the
behavior of ψηOT is dominated by ψ OT .
As a check, notice that when two chains first meet, the behavior of ψηOT reverts to that of
ψ OT . Since there is a coupling with expected distance zero, that coupling is chosen as the
minimizer in ψ OT . Therefore, the two chains remain faithful going forward.

1.3.3 Extension To Other Samplers


With ψηOT , we can also couple samplers that use Gibbs sampling as a sub-routine; to illustrate,
we next describe a coupling for a split-merge sampler [94]. Split-merge samplers pair a basic
Gibbs sweep with a Metropolis-Hastings (MH) move designed to facilitate larger-scale changes
across the clustering. In particular, the MH move starts from partition X by selecting a pair
of distinct data indices (i, j) uniformly at random. If i and j belong to the same cluster, the
sampler proposes to split this cluster. Otherwise, the sampler proposes to merge together the
two clusters containing i and j. The proposal is accepted or rejected in the MH move. For
our purposes, we summarize the full move, including proposal and acceptance but conditional
on the choice of i and j, as X e ∼ SplitMerge(i, j, X). One iteration of the split-merge sampler
is identical to Algorithm 1, except that between lines 1 and 2 of Algorithm 1, we sample (i, j)
and perform SplitMerge(i, j, X).e
Algorithm 3 shows our coupling of a split-merge sampler. We use the same pair of indices
(i, j) in the split-merge moves across both the X and Y chains. We use ψηOT to couple at the
level of the Gibbs sweeps.
Gibbs samplers and split-merge samplers offer differing strengths and weaknesses. For
instance, the MH move may take long to finish; Algorithm 1 might run for more iterations in
the same time, potentially producing better estimates sooner. The MH move is also more
complex and thus potentially more prone to errors in implementation. In what follows, we
consider both samplers; we compare our coupling to naive parallelism for Gibbs sampling in
Section 1.5, and we make the analogous comparison for split-merge samplers in Section 1.J.
1
We note that the optimization problem defining Equation (1.4) is an exact transport problem, not an
entropically-regularized transport problem [46]. Hence the marginal distributions defined by ψ OT automatically
match the inputs pΠ|Π(−n) (· | X(−n)) and pΠ|Π(−n) (· | Y (−n)), without need of post-processing.

26
Algorithm 3 Coupled Gibbs Sweep with Split–Merge Move
Inputs:
pΠ ▷ Target
X and Y ▷ Current partitions
1: procedure CoupledSplitMergeSweep(pΠ , X, Y )
2: Xe ← X, Ye ← Y
3: (i, j) ← Uniformly random pair of data indices
4: Xe ∼ SplitMerge(i, j, X)e
5: Ye ∼ SplitMerge(i, j, Ye )
6: for n ← 1, N do
7: γ ← ψηOT (pΠ , n, X,
e Ye )
8: e Ye ) ∼ γ
(X,
9: end for
10: return X, e Ye
11: end procedure

1.3.4 Variance Reduction Via Trimming


We have described how to generate a single estimate of H ∗ from Equation (1.2); in practice,
on the jth processor, we run chains X j and Y j to compute Hℓ:m (X j , Y j ). It remains to
decide how to aggregate the observations {Hℓ:m (X j , Y j )}Jj=1 across J processors.
A natural option is to report the sample mean, J1 Jj=1 Hℓ:m (X j , Y j ). If each individual
P
estimate is unbiased, the squared error of the sample mean decreases to zero at rate 1/J.
And standard confidence intervals have asymptotically correct coverage.
For finite J, though, there may be outliers that drive the sample mean far from H ∗ . To
counteract the effect of outliers and achieve a lower squared error, we also report a classical
robust estimator: the trimmed mean [201]. Recall that for α ∈ (0, 0.5), the α-trimmed mean
is the average of the observations between (inclusive) the 100α quantile and the 100(1 − α)
quantile of the observed data. The trimmed mean is asymptotically normally distributed
[16, 189] and provides sub-Gaussian confidence intervals [135]. See Section 1.F for more
discussion on the trimmed mean.

1.4 Theoretical Results


To verify that our coupling is useful, we need to check that it efficiently returns accurate
estimates. We first check that the coupled estimate Hℓ:m (X, Y ) at a single processor is
unbiased – so that aggregated estimates across processors can exhibit arbitrarily small
squared loss. Second, we check that there is no undue computational cost of coupling relative
to a single chain.

27
1.4.1 Unbiasedness
Jacob et al. [93, Assumptions 1–3] give sufficient conditions for unbiasedness of Equation (1.2).
We next use these to establish sufficient conditions that Hℓ:m (X, Y ) is unbiased when targeting
a DPMM posterior.

Theorem 1.4.1 (Sufficient Conditions for Unbiased Estimation). Let pΠ be the DPMM
posterior in Section 1.2.1. Assume the following two conditions on ψ.
(1) There exists ϵ > 0 such that for all n ∈ [N ] and for all X, Y ∈ PN such that X ̸= Y ,
the output γ of the coupling function ψ satisfies

∀k ∈ [K] and k ′ ∈ [K ′ ], uk,k ≥ ϵ. (1.7)

(2) If X = Y , then the output coupling γ of ψ satisfies γ(X


e = Ye ) = 1; i.e. the coupling is
faithful.
Then, the estimator in Equation (1.2) constructed from Algorithm 2 is an unbiased estimator
for H ∗ . Furthermore, Equation (1.2) has a finite variance and a finite expected computing
time.

We prove Theorem 1.4.1 in Section 1.C. Our proof exploits the discreteness of the sample
space to ensure chains meet. Condition (1) roughly ensures that any joint state in the product
space is reachable from any other joint state under the Gibbs sweep; we use it to establish
that the meeting time τ has sub-geometric tails. Condition (2) implies that the Markov
chains are faithful once they meet.

Corollary 1.4.1. Let pΠ be the DPMM posterior. The Equation (1.2) estimator using
Algorithm 2 with coupling function ψηOT (pΠ , n, X, Y ) is unbiased for H ∗ .

Proof. It suffices to check Theorem 1.4.1’s conditions. We show ψηOT is faithful at the end
of Section 1.3.2. For a partition, the associated leave-out distributions place positive mass
on all K accessible atoms, so marginal transition probabilities are lower bounded by some

ω > 0. The nugget guarantees each uk,k ≥ ηω 2 > 0.
Note that the introduction of the nugget allows us to verify the first condition of The-
orem 1.4.1 is met without relying on properties specific to the optimal transport coupling.
We conjecture that one could analogously show unbiased estimates may be obtained using
couplings of Markov chains defined in the label space by introducing a similar nugget to
transitions on this alternative state space. Crucially, though, we will see in Section 1.5.3 that
our coupling in the partition space exhibits much faster meeting times in practice than these
couplings in the label space.

1.4.2 Time Complexity


The accuracy improvements of our method can be achieved only if the compute expense of
coupling is not too high relative to single-chain Gibbs. In Section 1.5.2, we show empirically
that our method outperforms naive parallel samplers run for the same wall time. Here we
use theory to describe why we expect this behavior.

28
There are two key computations that must happen in any coupling Gibbs step within a
sweep:

(1) computing the atom sizes ak , bk′ and atom locations π k , ν k in the sense of Definition 1.2.1
and Definition 1.3.1;

(2) computing the pairwise distances d(π k , ν k ); and solving the optimal transport problem
(Equation (1.4)).
Let β(N, K) represent the time it takes to compute the Gibbs conditional pΠ|Π(−n) for a
partition of size K, and let K e represent the size of the largest partition visited in any chain,
across all processors, while the algorithm runs. Then part (1) takes O(β(N, K)) e time to
run. For single chains, computing atom sizes and locations dominates the compute time; the
computation required is of the same order, but is done for one chain, rather than two, on
each processor. We show in Proposition 1.D.1 in Section 1.D that part (2) can be computed
in O(K e time. Proposition 1.D.1 follows from efficient use of data structures; naive
e 3 log K)
implementations are more computationally costly. Note that the total running time for a full
Gibbs sweep (Algorithm 1 or Algorithm 2) will be N times the single-step cost.
The extra cost of a coupling Gibbs step will be small relative to the cost of a single-chain
Gibbs step, then, if O(K e is small relative to O(β(N, K)).
e 3 log K) e 2 As an illustrative example,
consider again the DPMM application from Section 1.2.1. We start with a comparison that we
suspect captures typical operating procedure, but we also consider a worst-case comparison.
Standard comparison: The direct cost of a standard Gibbs step is β(N, K) = O(N D+KD3 )
(see Proposition 1.D.2 in Section 1.D). By Equation 3.24 in Pitman [163], the number of
clusters in a DPMM grows a.s. as O(log N ) as N → ∞.3 If we take K e = O(log N ),
O(K 3
e log K) e will generally be smaller than β(N, K) = O(N D + KD ) for sufficiently large
3

N.
Worst-case comparison: The complexity of a DPMM Gibbs step can be reduced to
β(N, K) = O(KD + D3 ) through careful use of data structures and conditional conjugacy
(see Proposition 1.D.2 in Section 1.D). Still, the coupling cost O(K e is not much larger
e 3 log K)
than the cost of this step whenever K e is not much larger than D.
For our experiments, we run the standard rather than optimized Gibbs step due to its
simplicity and use in existing work [e.g. 48]. In e.g. our gene expression experiment with
D = 50, we expect this choice has little impact on our results. Our Proposition 1.D.1
establishing O(K e for the optimal transport solver applies to Orlin’s algorithm [154].
e 3 log K)
However, convenient public implementations are not available. So instead we use the simpler
network simplex algorithm [107] as implemented by Flamary et al. [61]. Although Kelly and
O’Neill [107, Section 3.6] upper bound the worst-case complexity of the network simplex as
O(K e 5 ), the algorithm’s average-case performance may be as good as O(K e 2 ) [25, Figure 6].
2
We show in Section 1.D that, while there are also initial setup costs before running any Gibbs sweep,
these costs do not impact the amortized complexity.
3
Two caveats: (1) If a Markov chain is run long enough, it will eventually visit all possible cluster
configurations. But if we run in finite time, it will not have time to explore every collection of clusters. So we
assume O(log N ) is a reasonable approximation of finite time. (2) Also note that the log N growth is for data
generated from a DPMM whereas in real life we cannot expect data are perfectly simulated from the model.

29
1.5 Empirical Results
We now demonstrate empirically that our OT coupling (1) gives more accurate estimates
and confidence intervals for the same wall time and processor budget as naive parallelism
and (2) meets much faster than label-based couplings.

1.5.1 Models, Datasets, And Implementation


We run samplers for both clustering and graph coloring problems, which we describe next. We
detail our construction of ground truth, sampler initialization, and algorithm hyperparameters
(ℓ and m) in Section 1.G.2.
Motivating examples and target models. For clustering, we use single-cell RNA
sequencing data [169], X-ray data of agricultural seed kernels [41, 56], physical measurements
of abalone [56, 148], and synthetic data from a Gaussian mixture model. In each case, our
target model is the Bayesian posterior over partitions from the DPMM. For graph colorings,
sampling from the uniform distribution on k-colorings of graphs is a key sub-routine in
fully polynomial randomized approximation algorithms. And it suffices to sample from the
partition distribution induced by the uniform distribution on k-colorings, which serves as our
target model; see Section 1.G.1 for details.
Summaries of interest. Our first summary is the mean proportion of data points in
the largest cluster; we write LCP for “largest component proportion.” See, e.g., Liverani et al.
[133] for its use in Bayesian analysis. Our second summary is the co-clustering probability;
we write CC(a, b) for the probability that data points indexed by a and b belong to the same
cluster. See, e.g., DeFord et al. [49] for its use in redistricting. In Section 1.M, we also report
a more complex summary: the posterior predictive distribution, which is a quantity of interest
in density estimation [58, 77].
Dataset details. Our synthetic dataset has 300 observations and 2 covariates. Our
gene dataset originates from Zeisel et al. [209] and was previously used by Prabhakaran
et al. [169] in a DPMM-based analysis. We use a subset with 200 observations and 50
covariates to allow us to quickly iterate on experiments. We use the unlabeled version of
the seed dataset from Charytanowicz et al. [41], Dua and Graff [56] with 210 observations
and 7 covariates. For the abalone dataset from Dua and Graff [56], Nash et al. [148], we
remove the labels and binary features, which yields 4177 observations and 7 covariates. For
graph data (k-regular), we use a 4-regular graph with 6 vertices; we target the partition
distribution induced by the uniform distribution on 4-colorings.

1.5.2 Improved Accuracy With Coupling


In Figure 1.2, we first show that our coupling estimates and confidence intervals offer improved
accuracy over naive parallelism. To the best of our knowledge, no previous coupling paper as
of this writing has compared coupling estimates or confidence intervals to those that arise
from naively parallel chains.
Processor setup. We give both coupling and naively parallel approaches the same
number of processors J. We ensure equal wall time across processors as we describe next;
this setup represents a computing system where, e.g., the user pays for total wall time, in

30
Figure 1.2: Top row and bottom row give results for gene and k-regular, respectively.
The first two columns show that coupled chains provide better point estimates than naive
parallelism. The third column shows that confidence intervals based on coupled chains are
better than those from naive parallelism. The fourth column shows that OT coupling meets
in less time than label-based couplings.

which case we ensure equal cost between approaches. For the coupling on the jth processor,
we run until the chains meet and record the total time ξ j . In the naively parallel case,
then, we run a single chain on the jth processor for time ξ j . In either case, each processor
returns an estimate of H ∗ . We can aggregate these estimates with a sample mean or trimmed
estimator. Let Hc,J represent the coupled estimate after aggregation across J processors and
Hu,J represent the naive parallel (uncoupled) estimate after aggregation across J processors.
(i)
To understand the variability of these estimates, we replicate them I times: {Hc,J }Ii=1 and
(i)
{Hu,J }Ii=1 . In particular, we simulate running on 180,000 processors, so for each J, we let
I = 180,000/J; see Section 1.G.2 for details. For the ith replicate, we compute squared error
(i)
ec,i := (Hc,J − H ∗ )2 ; similarly in the uncoupled case.
Better point estimates. The upper left panel of Figure 1.2 shows the behavior of LCP
estimates for gene. The horizontal axis gives the number of processes J. The vertical value
of any solid line is found by taking the square root of the median (across I replicates) of the
squared error and then dividing by the (positive) ground truth. Blue shows the performance
of the aggregated standard-mean coupling estimate; red shows the naive parallel estimate.
The blue regions show the 20% to 80% quantile range. We can see that, at higher numbers
of processors, the coupling estimates consistently yield a lower percentage error than the
naive parallel estimates for a shared wall time. The difference is even more pronounced
for the trimmed estimates (first row, second column of Figure 1.2); here we see that, even
at smaller numbers of processors, the coupling estimates consistently outperform the naive
parallel estimates for a shared wall time. We see the same patterns for estimating CC(2,4) in
k-regular (second row, first two columns of Figure 1.2) and also for synthetic, seed,
and abalone in Figures 1.H.2a, 1.H.3a and 1.H.4a in Section 1.H. We see similar patterns
in the root mean squared error across replicates in Figure 1.1 (which pertains to gene) and

31
Figure 1.3: Coupled-chain estimates have large outliers. Meanwhile, naive parallelism
estimates have substantial bias that does not go away with replication.

the left panel of Figures 1.H.2b, 1.H.3b, 1.H.4b and 1.H.6b for the remaining datasets.
Figure 1.3 illustrates that the problem with naive parallelism is the bias of the individual
chains, whereas only variance is eliminated by parallelism. In particular, the histogram on
the right depicts the J estimates returned across each uncoupled chain at each processor j.
We see that the population mean across these estimates is substantially different from the
ground truth. This observation also clarifies why trimming does not benefit the naive parallel
estimator: trimming can eliminate outliers but not systematic bias across processors.
By contrast, we plot the J coupling estimates returned across each processor j as horizontal
coordinates of points in the left panel of Figure 1.3. Vertical coordinates are random noise to
aid in visualization. By plotting the 1% and 99% quantiles of the J estimators, we can see
that trimming will eliminate a few outliers. But the vast majority of estimates concentrate
near the ground truth.
Better confidence intervals. The third column of Figure 1.2 shows that the confidence
intervals returned by coupling are also substantially improved relative to naive parallelism.
The setup here is slightly different from that of the first two columns. For the first two
columns, we instantiated many replicates of individual users and thereby checked that coupling
generally can be counted upon to beat naive parallelism. But, in practice, an actual user
would run just a single replicate. Here, we evaluate the quality of a confidence interval that
an actual user would construct. We use only the individual estimates sj that make up one
Hc,J , sj = Hℓ:m (X j , Y j ) (or the equivalent for Hu,J ), to form a point estimate of H ∗ and a
notion of uncertainty.
In the third column of Figure 1.2, each PJ solid line shows the sample-average estimate
aggregated across J processors: (1/J) × j=1 sj . The error bars show ±2 standard errors
q
of the mean (SEM), where one SEM equals Var({sj }Jj=1 )/(J − 1). Since the individual
coupling estimators (blue) from each processor j are unbiased, we expect the error bars to be
calibrated, and indeed we see appropriate coverage of the ground truth (dashed black line). By
contrast, we again see systematic bias in the naive parallel estimates – and very-overconfident
intervals; indeed they are so small as to be largely invisible in the top row of the third
column of Figure 1.2 – i.e., when estimating LCP in the gene dataset. The ground truth is

32
many standard errors away from the naive parallel estimates. We see the same patterns for
estimating CC(2,4) for k-regular (second row, third column of Figure 1.2). See the right
panel of Figures 1.H.2b, 1.H.3b and 1.H.4b in Section 1.H for similar behaviors in synthetic,
seed, and abalone.

1.5.3 Faster Meeting With OT Couplings


Next we show that meeting times with our OT coupling on partitions are faster than with
label-based coupling using maximal [99] and common random number generator (common
RNG) [69]. We did not directly add a comparison with label-based couplings to our plots in
Section 1.5.2 since, in many cases, the label-based coupling chains fail to meet altogether
even with a substantially larger time budget than Section 1.5.2 currently uses.
Instead, we now provide a direct comparison of meeting times in the fourth column of
Figure 1.2. To generate each figure, we set a fixed amount of compute time budget: 10
minutes for the top row, and 2 minutes for the bottom row. Each time budget is roughly
the amount of time taken to generate the ground truth (i.e., the long, single-chain runs) for
each dataset. If during that time a coupling method makes the two chains meet, we record
the meeting time τ ; otherwise, the meeting time for that replica is right-censored, and we
record the number of data sweeps up to that point. Using the censored data, we estimate the
survival functions of the meeting times using the classic Kaplan–Maier procedure [105].
In the clustering examples (Figure 1.2 top row, fourth column and also the left panel
of Figures 1.H.2c, 1.H.3c and 1.H.4c in Section 1.H), the label-based couplings’ survival
functions Pr(τ > t) do not go to zero for large times t, but instead they plateau around
0.1. In other words, the label-based coupling chains fail to meet on about 10% of attempts.
Meanwhile, all replicas with our OT coupling successfully meet in the allotted time. Since so
many label-based couplings fail to meet before the time taken to generate the ground truth,
these label-based couplings perform worse than essentially standard MCMC. In addition to
survival functions, we also plot the distance between coupled chains – which decreases the
fastest for our OT couplings – in the right panel of Figures 1.H.1c, 1.H.2c, 1.H.3c, 1.H.4c
and 1.H.6c in Section 1.H. As discussed in Section 1.E, we believe the improvement of our
OT coupling over baselines arises from using a coupling function that incentivizes decreasing
the distance between partitions rather than between labelings.
Separate from accurate estimation in little time, our comparison of survival functions
in the bottom row, fourth column of Figure 1.2 and in Figure 1.L.1 from Section 1.L is
potentially of independent interest. While the bottom row of Figure 1.2 gives results for
k-regular, Figure 1.L.1 gives results on Erdős-Rényi random graphs. The tightest bounds
for mixing time for Gibbs samplers on graph colorings to date [42] rely on couplings on
labeled representations. Our result suggests better bounds may be attainable by considering
convergence of partitions rather than labelings.

1.6 Discussion
We demonstrated how to efficiently couple partition-valued Gibbs samplers using optimal
transport – to take advantage of parallelism for improved estimation. Multiple directions

33
show promise for future work. E.g., while we have used CPUs in our experiments here, we
expect that GPU implementations will improve the applicability of our methodology. More
extensive theory on the trimmed estimator could clarify its guarantees and best practical
settings.

34
Appendix

1.A Related Work


Couplings of Markov chains have a long history in MCMC. Historically, they have primarily
been a theoretical tool for analyzing convergence of Markov chains (see e.g. Lindvall [132]
and references therein). Some works prior to Jacob et al. [93] used coupled Markov chains
for computation, but do not provide guarantees of consistency in the limit of many processes
or are not generally applicable to Markov chains over partitions. E.g., Propp and Wilson
[172] and follow-up works generate exact, i.i.d. samples but require a partial ordering of the
state space that is almost surely preserved by applications of an iterated random function
representation of the Markov transition kernel [91, Chapter 4.4]. It is unclear what such a
partial ordering looks like for the space of partitions. Neal [149] proposes estimates obtained
using circularly coupled chains that can be computed in parallel and aggregated, but these
estimates are not unbiased and so aggregated estimates are not asymptotically exact. Parallel
tempering methods [192] also utilize coupled chains to improve MCMC estimates but, like
naive parallelism, provide guarantees asymptotic only in the number of transitions, not in
the number of processes.
Outside of couplings, other lines of work have sought to utilize parallelism to obtain
improved MCMC estimates in limited time. To our best knowledge, that work has focused
on challenges introduced by large datasets and has subsequently focused on distributing
datasets across processors. For example, Rabinovich et al. [173], Scott et al. [182], Srivastava
et al. [188] explore methods running multiple chains in parallel on small subsets of a large
dataset, and Lao et al. [119] proposes using data parallelism on GPUs to accelerate likelihood
computations. However, these methods offer little help in the current setting as the partition
is the quantity of interest in our case; even if distributions over partitions of subsets are found
at each processor, these distributions are not trivial to combine across processors. Also, the
operations that avail themselves to GPU acceleration (such as matrix multiplications) are
not immediately present in Markov chains on partitions.

1.B Functions of Interest


We express functions of interest, h, in partition notation. Suppose there are N observations,
and the partition is Π = {A1 , A2 , . . . , AK }. To compute largest component proportion (LCP),
we first rank the clusters by decreasing size, |A(1) | ≥ |A(2) | ≥ . . . ≥ |A(K) |, and report the
proportion of data in the largest cluster: |A(1) |/N . If we are interested in the co-clustering

35
probability of data points indexed by j1 and j2 , then we let h be the co-clustering indicator.
Namely, if j1 and j2 belong to the same element of Π (i.e. there exists some A ∈ Π such that
j1 , j2 ∈ A), then h(Π) equals 1; otherwise, it equals 0.
In addition to these summary statistics of the partition, we can also estimate cluster-
specific parameters, like cluster centers. For the Gaussian DPMM from Section 1.2.1, suppose
that we care about the mean of clusters that contain a particular data point, say data point
1. This expectation is E(µA s.t. 1 ∈ A). This is equivalent to E[θi | x] in the notation of
MacEachern [136]. In Section 1.2.1, we use µA to denote the cluster center for all elements
i ∈ A, while MacEachern [136] uses individual θi ’s to denote cluster centers for individual
data points, with the possibility that θi = θj if data points i and j belong in the same
partition element. We can rewrite the expectation as E(E[µA s.t. 1 ∈ A | Π]), using the law
of total expectation. E[µA s.t. 1 ∈ A | Π] is the posterior mean of the cluster that contains
data point 1, which is a function only of the partition Π.

1.C Unbiasedness Theorem


Lemma 1.C.1 (Transition kernel is aperiodic and irreducible for Gaussian DPMM). Denote
by 1 the partition of [N ] where all elements belong to one cluster. For Gaussian DPMM, the
transition kernel from Algorithm 1 satisfies

• For any X ∈ PN , T (X, X) > 0.

• For any X ∈ PN , T (X, 1) > 0.

• For any X ∈ PN , T (1, X) > 0.

Proof of Lemma 1.C.1. For any starting X ∈ PN , we observe that there is positive probabil-
ity to stay at the state after the T (X, ·) transition i.e. T (X, X) > 0. In Gaussian DPMM,
because the support of the Gaussian distribution is the whole Euclidean space (see also Equa-
tion (1.14)), when the nth data point is left out (resulting in the conditional pΠ|Π(−n) (· | X−n )),
there is positive probability that nth is re-inserted into the same partition element of X i.e.
pΠ|Π(−n) (X | X−n ) > 0. Since T (X, ·) is the composition of these N leave–outs and re-inserts,
the probability of staying at X is the product of the probabilities for each pΠ|Π(−n) (· | X−n )),
which is overall a positive number.
One series of updates that transform X into 1 in one sweep is to a) assign 1 to its own
cluster and b) assign 2, 3, . . . , N to the same cluster as 1. This series of update also has
positive probability in Gaussian DPMM.
On transforming 1 into X, for each component A in X, let c(A) be the smallest element
in the component. For instance, if X = {{1, 2}, {3, 4}} then c({1, 2}) = 1, c({3, 4}) = 3. We
sort the components A by their c(A), to get a list c1 < c2 < . . . < c|X| . For each 1 ≤ n ≤ N ,
let l(n) = c(A) for the component A that contains n. In the previous example, we have c1 = 1
and c2 = 3, while l(1) = 1, l(2) = 1, l(3) = 3, l(4) = 3. One series of updates that transform 1
into X is

• Initialize j = 1.

36
• for 1 ≤ n ≤ N , if n = cj , then make a new cluster with n and increment j = j + 1. Else,
assign n to the cluster that currently contains l(n).

This series of update also has positive probability in Gaussian DPMM.


Proof of Theorem 1.4.1. Because of Jacob et al. [93, Proposition 3], it suffices to check Jacob
et al. [93, Assumptions 1–3].

Checking Assumption 1. Because the sample space PN is finite, maxπ∈PN h(π) is finite.
This means the expectation of any moment of h under the Markov chain is also bounded.
t→∞
We show that E[h(X t )] −−−→ H ∗ by standard ergodicity arguments.4

• Aperiodic. From Lemma 1.C.1, we know T (X, X) > 0 for any X. This means the Markov
chain is aperiodic [126, Section 1.3].

• Irreducible. From Lemma 1.C.1, for any X, Y , we know that T (X, 1) > 0 and T (1, Y ) > 0,
meaning that T 2 (X, Y ) > 0. This means the Markov chain is irreducible.

• Invariant w.r.t. pΠ . The transition kernel T (X, ·) from Algorithm 1 leaves the target pΠ
invariant because each leave–out conditional pΠ|Π(−n) leaves the target pΠ invariant. If
X ∼ pΠ , then X−n ∼ pΠ−n . Hence, if X e | X ∼ pΠ|Π(−n) (· | X−n ) then by integrating out
X, we have X e ∼ pΠ .

By Levin and Peres [126, Theorem 4.9], there exists a constant α ∈ (0, 1) and C > 0 such
that
max ∥T t (π, ·) − pΠ ∥TV ≤ Cαt .
π∈PN

Since the sample space is finite, the total variation bound implies that for any π, expectations
under T t (π, ·) are close to expectations under pΠ ,

max |EX t | X 0 =π h(X t ) − H ∗ | ≤ (max h(π))Cαt .


π∈PN π∈PN

Taking expectations over the initial condition X 0 = π,

|EX t h(X t )−H ∗ | = |EX 0 [EX t | X 0 =π h(X t )−H ∗ ]| ≤ EX 0 |EX t | X 0 =π h(X t )−H ∗ ]| ≤ (max h(π))Cαt .
π∈PN

t→∞
Since the right hand side goes to zero as t → ∞, we have shown that E[h(X t )] −−−→ H ∗ .

Checking Assumption 2. To show that the meeting time is geometric, we show that
there exists ϵ such that for any X and Y , under one coupled sweep from Algorithm 2
((X,
e Ye ) ∼ Tb(·, (X, Y ))),
e = Ye = 1 | X, Y ) ≥ ϵ.
P(X (1.8)
4
MacEachern [136, Theorem 1] states a geometric ergodicity theorem for the Gibbs sampler like Algorithm 1
but does not provide verification of the aperiodicity, irreducibility or stationarity.

37
If this were true, we have that P(X
e = Ye | X, Y ) ≥ ϵ, and

t
Y
∩ti=0 X i+1 i 1 0
P(X i+1 ̸= Y i | X i ̸= Y i−1 ),

P(τ > t) = P ̸= Y = P(X ̸= Y )
i=1

where we have used the Markov property to remove conditioning beyond X i ̸= Y i−1 . Since
minX,Y P(X e = Ye | X, Y ) ≥ ϵ, P(X i+1 ̸= Y i | X i ̸= Y i−1 ) ≤ 1 − ϵ, meaning P(τ > t) ≤ (1 − ϵ)t .
To see why Equation (1.8) is true, because of Lemma 1.C.1, there exists a series of
intermediate partitions x1 , x2 , . . . , xN −1 (x0 = X, xN = 1) such that for 1 ≤ n ≤ N ,
n ) > 0. Likewise, there exists a series y , y , . . . , y for Y. Because the
N −1
pΠ|Π(−n) (xn | xn−1 1 2

coupling function ψ satisfies uij > ϵ, for any n, there is at least probability ϵ of transitioning
to (xn , y n ) from (xn−1 , y n−1 ). Overall, there is probability at least ϵN of transitioning from
(X, Y ) to (1, 1). Since the choice of X, Y has been arbitrary, we have proven Equation (1.8)
with ϵ = ϵN .

Checking Assumption 3. By design, the chains remain faithful after coupling.

1.D Time Complexity



Proposition 1.D.1. Given the atom sizes ak , bk′ and atom locations π k , ν k in the sense

of Definition 1.3.1, we can compute the coupling matrix µk,k for OT coupling function in
O(K e time.
e 3 log K)

Proof of Proposition 1.D.1. To find µk,k , we need to solve the optimization problem that is

Equation (1.4). However, given just the marginal distributions (ak , bk′ and π k , ν k ), we do

not have enough “data” in the optimization problem, since the pairwise distances d(π k , ν k )
for k ∈ [K], k ′ ∈ [K ′ ], which define the objective function, are missing. We observe that it is
′ ′
not necessary to compute d(π k , ν k ); it suffices to compute d(π k , ν k ) − c for some constant c
in the sense that the solution to the optimization problem in Equation (1.4) is unchanged
when
PK we PKadd′
a constant value to every distance. In particular, because for any coupling γ,
k,k′
k=1 k′ =1 u = 1,
K X
K ′ K X
K ′

k,k′ k′ ′ ′
X X

γ := arg min u d(π , ν ) = arg min
k
uk,k [d(π k , ν k ) − c]. (1.9)
couplings γ couplings γ
k=1 k′ =1 k=1 k′ =1

We now show that if we set c = d(π(−n), ν(−n)), then we can compute all O(K e 2 ) values

e 2 ) time. First, if we use Ak and B k′ to denote the elements of π k and
of d(π k , ν k ) − c in O(K n n

ν k respectively, containing data-point n, then for any n we may write
′  h ′ ′
i
d(π k , ν k ) = d(π(−n), ν(−n)) + |Akn |2 − (|Akn | − 1)2 + |Bnk |2 − (|Bnk | − 1)2 +

h i
k k′ 2 k k′ 2 2
− 2 |An ∩ Bn | − (|An ∩ Bn | − 1) .
(1.10)

38
Simplifying some terms, we can also write

k k′
 h k′
i h
k′
i
2|Akn | k

d(π , ν ) = d(π(−n), ν(−n)) + − 1 + 2|Bn | − 1 − 2 2|An ∩ Bn | − 1
h ′ ′
i
= d(π(−n), ν(−n)) + 2 |Akn | + |Bnk | − 2|Akn ∩ Bnk | ,

which means

h ′ ′
i
d(π k , ν k ) − d(π(−n), ν(−n)) = 2 |Akn | + |Bnk | − 2|Akn ∩ Bnk | .

At first it may seem that this still does not solve the problem, as directly computing the
size of the set intersections is O(N ) (if cluster sizes scale as O(N )). However, Equation (1.9)
is just our final stepping stone. If we additionally keep track of sizes of intersections at every
step, updating them as we adapt the partitions, it will take only constant time for each

update. As such, we are able to form the K × K ′ matrix of d(π k , ν k ) − c in O(K e 2 ) time.

With the array of d(π , ν ) − d(π(−n), ν(−n)), we now have enough “data” for the
k k

optimization problem that is the optimal transport. Regardless of N , the optimization itself
may be computed in O(K e 3 log K)e time with Orlin’s algorithm [154].

The next proposition provides estimates of the time taken to construct the Gibbs condi-
tionals (β(N, K)) for Gaussian DPMM.

Proposition 1.D.2 (Gibbs conditional runtime with dense Σ0 , Σ1 ). Suppose the covariance
matrices Σ0 and Σ1 are dense i.e. the number of non-zero entries is Θ(D2 ). The standard
implementation takes time β(N, K) = O(N D + KD3 ). By spending O(D3 ) time precomputing
at beginning of sampling, and using additional data structures, the time can be reduced to
β(N, K) = O(KD2 + D3 ).

Proof of Proposition 1.D.2. We first mention the well-known posterior formula of a Gaus-
sian model with known covariances [18, Chapter 2.3]. Namely, if µ ∼ N (µ0 , Σ0 ) and
W1 , W2 , . . . WM | µ ∼ N (µ, Σ1 ) then µ | W1 , . . . , WM is a Gaussian with covariance Σc and
indep

mean µc satisfying
Σc = (Σ−1 0 + M Σ1 )
−1 −1
"M #!
−1 −1
X (1.11)
µc = Σc Σ0 µ0 + Σ1 Wm .
m=1

Suppose |Π| = K. Based on the expressions for the Gibbs conditional in Equation (1.14),
the computational work involved for a held-out observation Wn can be broken down into
three steps

1. Evaluating the prior likelihood N (Wn | µ0 , Σ0 + Σ1 ).

2. For each cluster c ∈ Π(−n), compute µc , Σc , (Σc +Σ1 )−1 and the determinant of (Σc +Σ1 )−1 .

3. For each cluster c ∈ Π(−n), evaluate the likelihood N (Wn | µc , Σc + Σ1 ).

39
Standard implementation. The time to evaluate the prior N (Wn | µ0 , Σ0 + Σ1 ) is O(D3 ),
as we need to compute the precision matrix (Σ0 + Σ1 )−1 and its determinant. With time
O(KD3 ), we can compute the various cluster-specific covariances, precisions and determinants
(where D3 is the cost
P for each cluster). To compute the posterior means µc , we need to
compute the sums j Wj for all clusters, which takes O(N D), as we need to iterate over
all D coordinates of all N observations. The time to evaluate N (Wn | µc , Σc + Σ1 ) across
clusters is O(KD2 ). Overall this leads to O(N D + KD3 ) runtime.

Optimized implementation. By precomputing (Σ0 + Σ1 )−1 (and its determinant) once


at the beginning of sampling for the cost of O(D3 ), we can solve Step 1 in time O(D2 ), since
that is the time to compute the quadratic form involved in the Gaussian likelihood. Once we
have the mean and precisions from Step 2, the time to complete Step 3 is O(KD2 ): for each
cluster, it takes time O(D2 ) to evaluate the likelihood, and there are K clusters. It remains
to show how much time it takes to solve Step 2. We note that quantities like Σ−1 0 µ0 and Σ1
−1

can also be computed once in O(D3 ) time at start up.


Regarding the covariance Σc and the precisions (Σc + Σ1 )−1 , at all points during sampling,
the posterior covariance Σc only depends on the number of data points in the cluster
(Equation (1.11)), and leaving out data point n only changes the number of points in
exactly one cluster. Hence, if we maintain Σc , (Σc + Σ1 )−1 (and their determinants) for
all clusters c ∈ Π, when a data point is left out, we only need to update one such Σc and
(Σc + Σ1 )−1 . Namely, suppose that Π = {A1 , A2 , . . . , AK }. We maintain the precisions are
(Σ(A1 ) + Σ1 )−1 , (Σ(A2 ) + Σ1 )−1 , . . . , (Σ(Ak ) + Σ1 )−1 . Let Aj be the cluster element that
originally contained n. When we leave out data point n to form Π(−n), the only precision
that needs to be changed is (Σ(Aj ) + Σ1 )−1 . Let the new cluster be A fj : the time to compute
Σ(Afj ), (Σ(Afj ) + Σ1 )−1 , and its determinant is O(D3 ).
Regarding the means µc , the use of data structures similar to the covariances/precisions
removes the apparent need to do O(N D) computations. If we keep track P of i∈c Wi for each
P
cluster c, then when data point n is left out, we only need to update i∈c Wi for the cluster
c that originally contained n, which only takes O(D). With the j Wj in place, to evaluate
P

each of K means µc takes O(D2 ); hence the time to compute the means is O(KD2 ). Overall,
the time spent in Step 2 is O(KD2 + D3 ), leading to an overall O(KD2 + D3 ) runtime.
The standard implementation is used, for instance, in de Valpine et al. [48] (see the
CRP_conjugate_dmnorm_dmnorm() function from NIMBLE’s source code). Miller and
Harrison [143] uses the standard implementation in the univariate case (see the Normal.jl
function).

Corollary 1.D.1 (Gibbs conditional runtime with diagonal Σ0 , Σ1 ). Suppose the covariances
Σ0 and Σ1 are diagonal matrices i.e. there are only Θ(D) non-zero entries. Then a standard
implementation takes time β(N, K) = O(N D). Using additional data structures, the time
can be reduced to β(N, K) = O(KD).

Proof of Corollary 1.D.1. When the covariance matrices are diagonal, we do not incur the
cubic costs of inverting D × D matrices. The breakdown of computational work is similar to
the proof of Proposition 1.D.2.

40
Standard implementation. The covariances and precision matrices each take only time
O(D) to compute: as there are K of them, the time taken is O(KD). To compute the
posterior means µc , we iterate through all coordinates of all observations in forming the sums
, leading to O(N D) runtime. Time to evaluate the Gaussian likelihoods are just O(D)
P
j W j
because of the diagonal precision matrices. Overall the runtime is O(N D).

Optimized implementation. By avoiding the recomputation of j Wj from scratch, we


P
reduce the time taken to compute the posterior means to O(KD). Overall the runtime is
O(KD).

1.E Label-Switching
1.E.1 Example 1
Suppose there are 4 data points, indexed by 1,2,3,4. The labeling of the X chain is z1 =
[1, 2, 2, 2], meaning that the partition is {{1}, {2, 3, 4}}. The labeling of the Y chain is
z2 = [2, 1, 1, 2], meaning that the partition is {{1, 4}, {2, 3}}. The Gibbs sampler temporarily
removes the data point 4. For both chains, the remaining data points is partitioned into
{{1}, {2, 3}}. We denote π 1 = {{1, 4}, {2, 3}}, π 2 = {{1}, {2, 3, 4}}, π 3 = {{1}, {2, 3}, {4}}:
in the first two partitions, the data point is assigned to an existing cluster while in the last
partition ,the data point is in its own cluster. There exists three positive numbers a1 , a2 , a3 ,
summing to one, such that
3
X
pΠ|Π(−4) (· | X(−4)) = pΠ|Π(−4) (· | Y (−4)) = ak δπk (·).
k=1

Since the two distributions on partitions are the same, couplings based on partitions like
ψηOT will make the chains meet with probability 1 in the next step. However, this is not true
under labeling–based couplings like maximal or common RNG. In this example, the same
partition is represented with different labels under either chains. The X chain represents
π 1 , π 2 , π 3 with the labels 1, 2, 3, respectively. Meanwhile, the Y chain represents π 1 , π 2 , π 3
with the labels 2, 1, 3, respectively. Let zX be the label assignment of the data point in
question (recall that we have been leaving out 4) under the X chain. Similarly we define
zY . Maximal coupling maximizes the probability that zX = zY . However, the coupling that
results in the two chains X and Y meeting is the following

if u = v = 3
 3

 a
a1 if u = 1, v = 2

Pr(zX = u, zY = v) =


 a2 if u = 2, v = 1
otherwise.

0

In general, a1 ̸= a2 , meaning that the maximal coupling is different from this coupling that
causes the two chains to achieve the same partition after updating the assignment of 4. A
similar phenomenon is true for common RNG coupling.

41
1.E.2 Example 2
For the situation in Section 1.E.1, the discussion of Ju et al. from Tancredi et al. [193]
proposes a relabeling procedure to better align the clusters in the two partitions before
constructing couplings. Indeed, if z2 were relabeled [1, 2, 2, 1] (the label of each cluster is
the smallest data index in that cluster), then upon the removal of data point 4, both the
label-based and partition-based couplings would agree. However, such a relabeling fix still
suffer from label-switching problem in general, since the smallest data index does not convey
much information about the cluster. For concreteness, we demonstrate an example where the
best coupling from minimizing label distances is different from the best coupling minimizing
partition distances.
Suppose there are 6 data points, indexed from 1 through 6. The partition of the X
chain is {{1, 3, 4}, {2, 5, 6}}. The partition of the Y chain is {{1, 5, 6}, {2, 3, 4}}. Using the
labeling rule from above, the label vector for X is zX = [1, 2, 1, 1, 2, 2] while that for Y is
zY = [1, 2, 2, 2, 1, 1]. The Gibbs sampler temporarily removes the data point 1. The three
next possible states of the X chain are the partitions ν1 , ν2 , ν3 where ν1 = {{1, 3, 4}, {2, 5, 6}},
ν2 = {{3, 4}, {1, 2, 5, 6}} and ν3 = {{3, 4}, {2, 5, 6}, {1}}. The labelings of data points 2
through 6 for all three partitions are the same; the only different between the labeling vectors
are the label of data point 1: for ν1 , zX (1) = 1, for ν2 , zX (1) = 2 and for ν1 , zX (1) = 3. On
the Y side, the three next possible states of the Y chain are the partitions µ1 , µ2 , µ3 where
µ1 = {{1, 5, 6}, {2, 3, 4}}, µ2 = {{5, 6}, {1, 2, 3, 4}} and µ3 = {{5, 6}, {2, 3, 4}, {1}}. As for
the labeling of 1 under Y , for µ1 , zY (1) = 1, for µ2 , zY (1) = 2 and for µ3 , zY (1) = 3. Suppose
that the marginal assignment probabilities are the the following:

• Pr(X = ν1 ) = Pr(X = ν2 ) = 0.45, Pr(X = ν3 ) = 0.1.

• Pr(Y = µ1 ) = Pr(Y = µ2 ) = 0.45, Pr(Y = µ3 ) = 0.1.

Under label-based couplings, since Pr(zX (1) = a) = Pr(zY (1) = a) for a ∈ [1, 2, 3], the
coupling that minimizes the distance between the labels will pick Pr(zX (1) = zY (1)) = 1,
which means the following for the induced partitions:

0.45 if ν = ν1 , µ = µ1

Pr(X = ν, Y = µ) = 0.45 if ν = ν2 , µ = µ2 . (1.12)
if ν = ν3 , µ = µ3

0.1

Under the partition-based transport coupling, the distance between partitions (Equa-
tion (1.5)) is the following.

µ1 µ2 µ3
ν1 16 10 12
ν2 10 16 14
ν3 12 14 8

Notice that the distances d(ν1 , µ1 ) and d(ν2 , µ2 ) are actually larger than d(ν1 , µ2 ) and d(ν2 , µ1 ):
in other words, the label-based coupling from Equation (1.12) proposes a coupling with

42
larger-than-minimal expected distance. In fact, solving the transport problem, we find that
the coupling that minimizes the expected partition distance is actually

0.45 if ν = ν1 , µ = µ2

Pr(X = ν, Y = µ) = 0.45 if ν = ν2 , µ = µ1 . (1.13)
if ν = ν3 , µ = µ3

0.1

1.F Trimming
We consider the motivating situation in Example 1.F.1. This is a case where trimming outliers
before taking the average yields a more accurate estimator (in terms of mean squared error)
than the regular sample mean. For reference, the RMSE of an estimator µ b of a real-valued
unknown quantity µ is p
µ − µ∥2 .
E∥b

Example 1.F.1 (Mixture distribution with large outliers). For µ > 0, p < 1, consider the
mixture distribution (0.5 − p/2)N (−µ, 1) + pN (0, 1) + (0.5 − p/2)N (µ, 1). The mean is 0.
The variancepis 1 + (1 − p)µ2√. Therefore, the RMSE of the sample mean computed using J
iid draws is 1 + (1 − p)µ / J.
2

In Example 1.F.1, increasing µ, which corresponds to larger outlier magnitude, increases


the RMSE.
In trimmed means (Section 1.3.4), the quantity α determines how much trimming is done.
Intuitively, for Example 1.F.1, if we trim about 0.5 − p/2 of the top and bottom samples from
the mixture distribution in Example 1.F.1, what remain are roughly samples from N (0, 1).
The mean of these samples should have variance only 1/J, resulting in an RMSE which does
not suffer from large µ.
In Figure 1.F.1, we illustrate the improvement of trimmed mean over sample mean for
problems like Example 1.F.1. We set p = 0.9, µ = 7, and α = 1.2(0.5 − p/2). Similar to
Figure 1.1, RMSE is estimated by adding another level of simulation to capture the variability
across aggregates. The left panel shows that RMSE of trimmed mean is smaller than that of
sample mean. The right panel explains why that is the case. Here, we box plot the trimmed
mean and sample mean, where the randomness is from the iid Monte Carlo draws from the
target mixture for J = 1000. The variance of trimmed mean is smaller than that of sample
mean, which matches the motivation for trimming.
For other situations where there exist better estimators than the sample mean, we refer
to the literature on Stein’s paradox [190].

1.G Additional Experimental Details


1.G.1 Target Distributions And Gibbs Conditionals
DPMM. Denote N (x | µ, Σ) to be the Gaussian density at x for a Gaussian distribution
with mean µ and covariance Σ. For the Gaussian DPMM from Section 1.2.1, the Gibbs

43
Figure 1.F.1: Trimmed mean has better RMSE than sample mean on Example 1.F.1. Left
panel plots RMSE versus J. Right panel gives boxplots J = 1000.

conditional have the form


(
α
β N −1+α N (Wn | µ0 , Σ0 + Σ1 ) if c is new cluster
Pr(zn = c | Π(−n), W1:N ) =
β sizeNof−1+α
cluster c
if c is an existing cluster,
N (Wn | µc , Σc + Σ1 )
(1.14)
where β is a normalization constant so that is an
P
c Pr(zn = c | Π(−n), W 1:N ) = 1, c
index into the clusters that comprise Π(−n) (or a new cluster), µc and Σc are the posterior
parameters of the cluster indexed by c. See Neal [150] for derivations.

Graph coloring. Let G be an undirected graph with vertices V = [N ] and edges E ⊂ V ⊗V,
and let Q = [q] be set of q colors. A graph coloring is an assignment of a color in Q to each
vertex satisfying that the endpoints of each edge have different colors. We here demonstrate
an application of our method to a Gibbs sampler which explores the uniform distribution over
valid q−colorings of G, i.e. the distribution which places equal mass on ever proper coloring
of G.
To employ Algorithm 2, for this problem we need only to characterise the p.m.f. on
partitions of the vertices implied by the uniform distribution on its colorings. A partition
corresponds to a proper coloring only if no two adjacent vertices are in the element of the
partition. As such, we can write
 
q
pΠN (π) ∝ 1{|π| ≤ q and A(π)i,j = 1 → (i, j) ̸∈ E, ∀i ̸= j} |π|!,
|π|

where the indicator term checks that π can correspond to a proper coloring and the second
term accounts for the number of unique colorings which induce the partition π. In particular
it is the product of the number of ways to choose |π| unique colors from Q ( |π|
q q!
:= |π|!(q−|π|)! )
and the number of ways to assign those colors to the groups of vertices in π.

44
The Gibbs conditionals have the form
q! 1
(q−|y|)! (q−|y|)!
pΠ|Π(−n) (Π = y | Π(−n)) = P q!
=P 1 .
x consistent with Π(−n) (q−|x|)! x consistent with Π(−n) (q−|x|)!
(1.15)
In Equation (1.15), x and y are partitions of the whole set of N vertices.
In implementations, to simulate from the conditional Equation (1.15), it suffices to
represent the partition with a color vector. Suppose we condition on Π(−n)) i.e. when the
colors for all but the n vertex are fixed, and there are q ′ unique colors that have been used
(q ′ can be strictly smaller than q). n can either take on a color in [q ′ ] (as long as the color is
not used by a neighbor), or take on the color q ′ + 1 (if q ′ < q). The transition probabilities
are computed from the induced partition sizes |x|.

1.G.2 General Markov Chain Settings


Ground truth. For clustering, we run 10 single-chain Gibbs samplers for 10,000 sweeps
each; we discard the first 1,000 sweeps. For graph coloring, we also run 10 chains, but each
for 100,000 sweeps and discard the first 10,000. We compute an unthinned MCMC estimate
from each chain and use the average across the 10 chains as ground truth. The standard
errors across chains are very small. Dividing the errors by the purported ground truth yields
values with magnitude smaller than 5 × 10−3 . In percentage error, this is less than 0.5%,
which is orders of magnitude smaller than the percentage errors from coupled chains or naive
parallel estimates.5

Sampler initializations. In clustering, we initialize each chain at the partition where all
elements belong to the same element i.e. the one-component partition. In graph coloring,
we initialize the Markov chain by greedily coloring the vertices. Our intuition suggests
that coupling should be especially helpful relative to naively parallel chains when samplers
require a large burn-in – since slow mixing induces bias in the uncoupled chains. In general,
one cannot know in advance if that bias is present or not, but we can try to encourage
suboptimal initialization in our experiments to explore its effects. For completeness, we
consider alternative initialization schemes, such as k-means, in Figure 1.K.3.

Choice of hyperparameters in aggregate estimates. Recall that Equation (1.2)


involves two free hyperparameters, ℓ and m, that we need to set. A general recommendation
from Jacob et al. [93, Section 3.1] is to select m = 10ℓ and ℓ to be a large quantile of the
meeting time distribution. We take heed of these suggestions, but also prioritize m’s that are
small because we are interested in the time-limited regime. Larger m leads to longer compute
times across both coupled chains and naively parallel chains, and the bias in naively parallel
chains is more apparent for shorter m: see Figure 1.K.2. In the naive parallel case, we discard
the first 10% of sweeps completed in any time budget as burn-in steps. In our trimmed
estimates, we remove the most extreme 1% of estimates (so 0.5% in either directions).
5
The percentage errors for LCP are typically 0.01%, while percentage errors for co-clustering are typically
0.1%.

45
Simulating many processes. To quantify the sampling variability of the aggregate
estimates (sample or trimmed mean across J processors), we first generate a large number
(V = 180,000) of coupled estimates Hℓ:m (X j , Y j ) (and V naive parallel estimates U j , where
the time to construct Hℓ:m (X j , Y j ) is equal to the time to construct U j ).6 For each J, we
batch up the V estimates in a consistent way across coupled chains and naive parallel, making
sure that the equality between coupled wall time and naive parallel wall time is maintained.
There are I = V /J batches. For the ith batch, we combine Hℓ:m (X j , Y j ) (or U j ) for indices
(i) (i)
j in the list [(i − 1)J + 1, iJ] to form Hc,J (or Hu,J ) in the sense of Section 1.5.2. By this
batching procedure, smaller values of J have more batches I. The largest J we consider for
gene, k-regular and abalone is 2,750 while that for synthetic and seed is 1,750. This
mean the largest J has at least 57 batches.
To generate the survival functions (last column of Figure 1.2), we use 600 draws from the
(censored) meeting time distribution by simulating 600 coupling experiments.

1.G.3 Datasets Preprocessing, Hyperparameters, Dataset-Specific


Markov Chain Settings
gene i.e. single-cell RNAseq. We extract D = 50 genes with the most variation of
N = 200 cells. We then take the log of the features, and normalize so that each feature has
mean 0 and variance 1. We target the posterior of the probabilistic model in Section 1.2.1
with α = 1.0, µ0 = 0D , diagonal covariance matrices Σ0 = 0.5ID , Σ1 = 1.3ID . Notably, this
is a simplification of the set-up considered by Prabhakaran et al. [169], who work with a
larger dataset and additionally perform fully Bayesian inference over these hyperparameters.
That the prior variance is smaller than the noise variance yields a “challenging” clustering
problem, where the cluster centers themselves are close to each other and observations are
noisy realizations of the centers. We set ℓ = 30 and m = 300.

seed i.e. wheat seed measurements. The original dataset from [41] has 8 features; we
first remove the “target” feature, which contains label information for supervised learning.
Overall there are N = 210 observations and D = 7 features. We normalize each feature to
have mean 0 and variance 1. We target the posterior of the probabilistic model in Section 1.2.1
with α = 1.0, µ0 = 0D , diagonal covariance matrices Σ0 = 1.0ID , Σ1 = 1.0ID . We set ℓ = 10
and m = 100.

synthetic. We generate N = 300 observations from a 4-component mixture model in


2 dimensions. The four cluster centers are [−0.8, −0.8], [−0.8, 0.8], [0.8, −0.8], [0.8, 0.8] Each
data point is equally likely to come from one of four components; the observation noise is
isotropic, zero-mean Gaussian with standard deviation 0.5. These settings result in a dataset
where the observations form clear clusters, but there is substantial overlap at the cluster
boundaries – see Figure 1.G.1a.
On this data, we target the posterior of the probabilistic model in Section 1.2.1 with
α = 0.2, µ0 = 0D , diagonal covariance matrices Σ0 = 0.75ID , Σ1 = 0.7ID . Different from
6
The best computing infrastructure we have access to has only 400 processors, so we generate these V
estimates by sequential running nEst/400 batches, each batch constructing 400 estimates in parallel.

46
(a) synthetic data (b) k-regular data

Figure 1.G.1: Visualizing synthetic data

gene, the prior variance is larger than the noise variance for synthetic. We set ℓ = 10,
m = 100.

abalone i.e. physical measurements of abalone specimens. The original dataset


from [148] has 9 features; we first remove the “Rings” feature, which contains label information
for supervised learning, and the “Sex” feature, which contains binary information that is
not compatible with the Gaussian DPMM generative model. Overall there are N = 4,177
observations and D = 7 features. We normalize each feature to have mean 0 and variance 1.
We target the posterior of the probabilistic model in Section 1.2.1 with α = 1.0, µ0 = 0D ,
diagonal covariance matrices Σ0 = 2.0ID , Σ1 = 2.0ID . We set ℓ = 10 and m = 100.

k-regular. Anticipating that regular graphs are hard to color, we experiment with a
4-regular, 6-node graph – see Figure 1.G.1b. The target distribution is the distribution over
vertex partitions induced by uniform colorings using 4 colors. We set ℓ = 1, m = 4.

1.G.4 Visualizing Synthetic Data


Figure 1.G.1 visualizes the two synthetic datasets.

1.H All Figures


1.H.1 gene
Figure 1.H.1 shows results for LCP estimation on gene – see Figure 1.K.1 for results on co-
clustering. The two panels that did not appear in Figure 1.2 are the left panel of Figure 1.H.1b
and the right panel of Figure 1.H.1c. The left panel of Figure 1.H.1b is the same as Figure 1.1:
the y-axis plots the RMSE instead of the range of losses. As expected from the bias-variance

47
decomposition, the RMSE for coupled estimates decreases with increasing J because of
unbiasedness, while the RMSE for naive parallel estimates does not go away because of bias.
The right panel of Figure 1.H.1c plots typical d distances between coupled chains under
different couplings as a function of the number of sweeps done. d decreases to zero very
fast under OT coupling, while it is possible for chains under maximal and common RNG
couplings to be far from each other even after many sampling steps.

(a) Losses (b) RMSE and intervals

(c) Coupling choice

Figure 1.H.1: Results on gene.

1.H.2 synthetic
Figure 1.H.2 shows results for LCP estimation on synthetic – see Figure 1.K.1 for results
on co-clustering.

1.H.3 seed
Figure 1.H.3 shows results for LCP estimation on seed – see Figure 1.K.1 for results on
co-clustering.

1.H.4 abalone
Figure 1.H.4 shows results for LCP estimation on abalone. In Figure 1.H.4a and Fig-
ure 1.H.4b, we do not report results for the trimmed estimator with the default trimming

48
(a) Losses (b) RMSE and intervals

(c) Coupling choice

Figure 1.H.2: Results on synthetic. Figure legends are the same as Figure 1.H.1. The
results are consistent with Figure 1.2.

amount (0.01 i.e. 1%). This trimming amount is too large for the application, and in
Figure 1.H.5, we show that trimming the most extreme 0.1% yields much better estimation.
In Figure 1.H.5, the first panel (from the left) plots the errors incurred using the trimmed
mean with the default α = 1%. Trimming of coupled chains is still better than naive
parallelism, but worse than sample mean of coupled chains. In the second panel, we use
α = 0.1%, and the trimming of coupled chains performs much better. In the third panel, we
fix the number of processes to be 2000 and quantify the RMSE as a function of the trimming
amount (expressed in percentages). We see a gradual decrease in the RMSE as the trimming
amount is reduced, indicating that this is a situation in which smaller trimming amounts is
prefered.

1.H.5 k-regular
Figure 1.H.6 shows results for CC(2, 4) estimation on k-regular.

1.I Metric Impact


1.I.1 Definition Of Variation Of Information Metric
Variation of information, or VI, is defined in Meilă [141, Equation 16]. We replicate the
definition in what follows. Let π and ν be two partitions of [N ]. Denote the clusters in π by

49
(a) Losses (b) RMSE and intervals

(c) Coupling choice

Figure 1.H.3: Results on seed. Figure legends are the same as Figure 1.H.1. The results are
consistent with Figure 1.2.


{A1 , A2 , . . . , AK } and the clusters in ν by {B 1 , B 2 , . . . , B K }. For each k ∈ [K] and k ′ ∈ K ′ ,
define the number P (k, k ′ ) to be

′ |Ak ∩ B k |
P (k, k ) := .
N
′ ′
|Ak ∩ B k | is the size of the overlap between Ak and B k . Because of the normalization by N ,
the P (k, k ′ )’s are non-negative and sum to 1, hence can be interpreted as probability masses.
Summing across all k (or k ′ ) has a marginalization effect, and we define
K ′
X
P (k) := P (k, k ′ ).
k′ =1

Similarly we define P ′ (k ′ ) := P (k, k ′ ). The VI metric is then


PK
k=1

K X
K ′
X P (k, k ′ )
dI (π, ν) = P (k, k ) log ′
. (1.16)
k=1 k′ =1
P (k)P (k ′ )

In terms of theoretical properties, Meilă [141, Property 1] shows that dI is a metric for the
space of partitions.

50
(a) Losses (b) RMSE and intervals

(c) Coupling choice

Figure 1.H.4: Results on abalone. Similar to Figure 1.2, coupled chains perform better
than naive parallelism with more processes, and our coupling yields smaller meeting times
than label-based couplings. See Figure 1.H.5 for the performance of trimmed estimators.

1.I.2 Impact Of Metric On Meeting Time


In Figures 1.I.1a to 1.I.1d, we examine the effect of metric on the meeting time for coupled
chains. In place of the Hamming metric in Equation (1.5), we can use the variation of
information (VI) metric from Equation (1.16) in defining the OT problem (Equation (1.4)).
Based on the survival functions, the meeting time under VI metric is similar to meeting time
under the default Hamming metric: in all cases, the survival functions lie mostly right on
top of each other. Time is measured in number of sweeps taken, rather than processor time,
because under Hamming metric we have a fast implementation (Section 1.4.2) while we are
not aware of fast implementations for the VI metric. Hence, our recommended metric choice
is Hamming (Equation (1.5)).

1.J Extension to Split-Merge Sample


SplitMerge(i, j, X) is the “Restricted Gibbs Sampling Split–Merge Procedure” from Jain and
Neal [94], where our implementation proposes 1 split–merge move and uses 5 intermediate
Gibbs scan to compute the proposed split (or merge) states.
We refer to Section 1.G for comprehensive experimental setup. The LCP estimation
results for gene are given in Figure 1.J.1. Instead of the one-component initialization, we

51
Figure 1.H.5: Effect of trimming amount on abalone.

use a k-means clustering with 5 components as initialization. m is set to be 100, while ℓ is


10. Switching from pure Gibbs sampler to split-merge samplers can reduce the bias caused
by a bad initialization. But there is still bias that does not go away with replication, and the
results are consistent with Figure 1.2.
We also have split-merge results for estimation of CC(0, 1) on synthetic in Figure 1.J.2.
m is set to be 50, while ℓ is 5.

1.K More RMSE Plots


1.K.1 Different Functions Of Interest
Figure 1.K.1 displays co-clustering results for clustering data sets. The results are consistent
with those for LCP estimation. Co-clustering appears to be a more challenging estimation
problem than LCP, indicated by the higher percentage errors for the same m.

1.K.2 Different Minimum Iteration (m) Settings


In Figure 1.K.2, with an increase in m (from the default 100 to 150), the bias in the naive
parallel approach reduces (percentage error goes from 15% to 10%, for instance), and the
variance of coupled chains’ estimates also reduce.

1.K.3 Different Initialization


In Figure 1.K.3, we initialize the Markov chains with the clustering from a k-means clustering
with 5 clusters, instead of the one-component initialization. Also see Figure 1.J.1 for more
kmeans initialization results. The bias from naive parallel is smaller than when initialized
from the one-component initialization (RMSE in Figure 1.K.3 is around 5% while RMSE in
Figure 1.H.1 is about 10%). However, the bias is still significant enough that even with a lot
of processors, naive parallel estimates are still inadequate.

52
(a) Losses (b) RMSE and intervals

(c) Coupling choice

Figure 1.H.6: Results on k-regular. Figure legends are the same as Figure 1.H.1.

1.K.4 Different DPMM Hyperparameters


For convenience, throughout our experiments, we use diagonal covariance matrices Σ0 = s0 ID
and Σ1 = s1 ID , where the variances in different dimensions are the same. We find that the
bias of standard MCMC is influenced by s0 and s1 : some settings cause naive parallel chains
to have meaningfully large bias, while others do not. Figure 1.K.4 illustrates on synthetic
that when s1 is small compared to s0 , standard MCMC actually has small bias even when
run for a short amount of time. For values of s1 that are closer to (or larger than s0 ), the bias
in standard MCMC is much larger. m and ℓ are set to be 100 and 10 across these settings of
s1 .

1.L More Meeting Time Plots


In Figure 1.L.1, we generate Erdős-Rényi random graphs, including each possible edge with
probability 0.2. The graph in the first two panels has N = 25 vertices, while the one in the
latter two panels has N = 30. We determine a sufficient number of colors by first greedily
coloring the vertices. It turns out that 6 colors is sufficient to properly color the vertices in
either set of panels.

53
(a) gene (b) synthetic

(c) seed (d) k-regular

Figure 1.I.1: Hamming and VI metric induce similar meeting time

1.M Estimates of Predictive Density


1.M.1 Data, Target Model, And Definition Of Posterior Predictive
As the posterior predictive is easiest to visualize in one dimension, we draw artificial data
from a univariate, 10-component Gaussian mixture model with known observational noise
standard deviation σ = 2.0, and use a DPMM to analyze this data. The cluster proportions
were generated from a symmetric Dirichlet distribution with mass 1 for all 10-coordinates.
The cluster means were randomly generated from N (0, 102 ). Since this is an artificial dataset,
we can control the number of observations: we denote gmm-100 to be the dataset of 100
observations, for instance.
The target DPMM has µ0 = 0, α = 1, Σ0 = 3.0, and Σ1 = 2.0
The function of interest is the posterior predictive density
X
Pr(WN +1 ∈ dx | W1:N ) = Pr(WN +1 ∈ dx | ΠN +1 , W1:N ) Pr(ΠN +1 | W1:N ). (1.17)
ΠN +1

In Equation (1.17), ΠN +1 denotes the partition of the data W1:(N +1) . To translate Equa-
tion (1.17) into an integral over just the posterior over ΠN (the partition of W1:N ) we break
up ΠN +1 into (ΠN , Z) where Z is the cluster indicator specifying the cluster of ΠN (or a new

54
(a) Losses (b) RMSE and intervals

(c) Estimates

Figure 1.J.1: Split-merge results on gene

cluster) to which WN +1 belongs. Then


" #
X X
Pr(WN +1 ∈ dx | W1:N ) = Pr(WN +1 ∈ dx, Z | ΠN , W1:N ) Pr(ΠN | W1:N )
ΠN Z

Each Pr(WN +1 ∈ dx, Z | ΠN , W1:N ) is computed using the prediction rule for the CRP and
Gaussian conditioning. Namely

Pr(WN +1 ∈ dx, Z | ΠN , W1:N ) = Pr(WN +1 ∈ dx | Z, ΠN , W1:N ) × Pr(Z | ΠN ) .


| {z } | {z }
Posterior predictive of Gaussian CRP prediction rule

The first term is computed with the function used during Gibbs sampling to reassign data
points to clusters. In the second term, we ignore the conditioning on W1:N , since Z and W1:N
are conditionally independent given ΠN .

1.M.2 Estimates Of Posterior Predictive Density


We first discretize the domain using 150 evenly-spaced points in the interval [−20, 30]: these
are the locations at which to evaluate the posterior predictive. We set m = 100 and ℓ = 10
in constructing the estimate from Equation (1.2). We average the results from 400 coupled
chain estimates. In each panel of Figure 1.M.1, the solid blue curve is an unbiased estimate
of the posterior predictive density: the error across replicates is very small and we do not

55
(a) Losses (b) RMSE and intervals

(c) Estimates

Figure 1.J.2: Split-merge results on synthetic

plot uncertainty bands. The black dashed curve is the true density of the population i.e. the
10-component Gaussian mixture model density. The grey histogram bins the observed data.

1.M.3 Posterior Predictives Become More Alike True Data Gener-


ating Density
In Figure 1.M.1, by visual inspection, the distance between the posterior predictive density
and the underlying density decreases as N increases. This is related to the phenomenon
of posterior concentration, where with more observations gathered, the Bayesian posterior
concentrates more and more on the true data generating process. We refer to Ghosal et al.
[67], Lijoi et al. [128] for more thorough discussions of posterior concentration. In what
follows, we justify the concentration behavior for Gaussian DPMM, when the observation
noise is correctly specified.
Theorem 1.M.1 (DP mixtures prior is consistent for finite mixture models). Let f0 (x) :=
i=1 pi N (x | θi , σ1 ) be a finite mixture model. Suppose we observe iid data X1 , . . . , Xn from
Pm 2

f0 . Consider the following probabilistic model


Pb ∼ DP(α, N (0, σ02 ))
iid
θi | Pb ∼ Pb i = 1, 2, . . . , n
indep
Xi | θi ∼ N (θi , σ12 ) i = 1, 2, . . . , n

56
Let Pbn be the posterior predictive distribution of this generative process. Then with a.s. Pf0
 
n→∞
dT V Pbn , Pf0 −−−→ 0.

To prove Theorem 1.M.1, we first need some definitions and auxiliary results.

Definition 1.M.1 (Strongly consistent priors). Suppose iid data X1 , X2 , . . . , Xn is generated


from some probability measure measure that is absolutely continuous with respect to Lebesgue
measure. Denote the density of this data generating measure by f0 . Let F be the set of all
densities on R. Consider the probabilistic model where we put a prior Π over densities f , and
observations Xi are conditionally iid given f . We use Pf to denote the probability measure
with density f . For any measurable subset A of F, the posterior of A given the observations
Xi is denoted Π(A | X1:N ). A Rstrong neighborhood around f0 is any subset of F containing a
set of the form V = {f ∈ F : |f − f0 | < ϵ} according to Ghosal et al. [67]. The prior Π is
strongly consistent at f0 if for any strong neighborhood U ,

lim Π(U |X1:n ) = 1, (1.18)


n→∞

holds almost surely for X1:∞ distributed according to Pf∞


0
.

Proposition 1.M.1 (Ghosh and Ramamoorthi [68, Proposition 4.2.1]). If a prior Π is


strongly consistent at f0 then the predictive distribution, defined as
Z
Pn (A | X1:n ) := Pf (A)Π(f | X1:n )
b (1.19)
f

also converges to f0 in total variation in a.s. Pf∞


0

 
dT V Pbn , Pf0 →− 0.

The definition of posterior predictive density in Equation (1.19) can equivalently be


rewritten as
Pbn (A | X1:n ) = Pr(Xn+1 ∈ A | X1:n ),
since Pf (A) = Pf (Xn+1 ∈ A) and all the X’s are conditionally iid given f .
We are ready to prove Theorem 1.M.1.
Proof of Theorem 1.M.1. First, we can rewrite the DP mixture model as a generative model
over continuous densities f

Pb ∼ DP(α, N (0, σ02 ))


f = N (0, σ 2 ) ∗ Pb
1 (1.20)
iid
Xi | f ∼ f i = 1, 2, . . . , n

where N (0, σ12 ) ∗ Pb is a convolution, with density f (x) :=


R
θ
N (x − θ|0, σ12 )dPb(θ).

57
The main idea is showing that the posterior Π(f |X1:n ) is strongly consistent and then
leveraging Proposition 1.M.1. For the former, we verify the conditions of Lijoi et al. [128,
Theorem 1].
The first condition of Lijoi et al. [128, Theorem 1] is that f0 is in the K-L support of
the prior over f in Equation (1.20). We use Ghosal et al. [67, Theorem 3].PClearly f0 is the
convolution of the normal density N (0, σ12 ) with the distribution P (.) = m i=1 pi δθi . P (.) is
compactly supported since m is finite. Since the support of P (.) is the set {θi }m i=1 which
belongs in R, the support of N (0, σ0 ), by Ghosh and Ramamoorthi [68, Theorem 3.2.4], the
2

conditions on P are satisfied. The condition that the prior over bandwidths cover the true
bandwidth is trivially satisfied since we perfectly specified σ1 .
The second condition of Lijoi et al. [128, Theorem 1] is simple: because the prior over Pb
is a DP, it reduces to checking that
Z
|θ|N (θ | 0, σ02 ) < ∞
R

which is true.
The final condition trivial holds because we have perfectly specified σ1 : there is actually
zero probability that σ1 becomes too small, and we never need to worry about setting γ or
the sequence σk .

58
(b) RMSE and intervals CC(0, 21) estimation on
(a) Losses, CC(0, 21) estimation on gene gene

(d) RMSE and intervals, CC(0, 1) estimation on


(c) Losses, CC(0, 1) estimation on synthetic synthetic

(f) RMSE and intervals, CC(0, 19) estimation on


(e) Losses, CC(0, 19) estimation on seed seed

Figure 1.K.1: Co-clustering results for clustering data sets.

59
(c) m = 100, syn- (d) m = 150, syn-
(a) m = 100, seed (b) m = 150, seed thetic thetic

Figure 1.K.2: Impact of different m on the RMSE. The first two panels are LCP estimation
for seed. The last two panels are CC(0, 1) estimation for synthetic.

Figure 1.K.3: RMSE and intervals for gene on k-means initialization.

(a) s1 = 0.5, s0 = 0.75 (b) s1 = 0.7, s0 = 0.75 (c) s1 = 0.85, s0 = 0.75

Figure 1.K.4: The bias in naive parallel estimates is a function of the DPMM hyperparameters.

60
(a) N = 25 (b) N = 30

Figure 1.L.1: Meeting time under OT coupling is better than alternative couplings on Erdos–
Renyi graphs, indicated by the fast decrease of the survival functions.

(a) gmm-100 (b) gmm-200 (c) gmm-300

Figure 1.M.1: Posterior predictive density for different number of observations N .

61
62
Chapter 2

Finite Approximations of Nonparametric


Priors

63
2.1 Introduction
Many data analysis problems can be seen as discovering a latent set of traits in a population
— for example, recovering topics or themes from scientific papers, ancestral populations from
genetic data, interest groups from social network data, or unique speakers across audio
recordings of many meetings [20, 63, 158]. In all of these cases, we might reasonably expect
the number of latent traits present in a data set to grow with the number of observations.
One might choose a prior for different data set sizes, but then model construction potentially
becomes inconvenient and unwieldy. A simpler approach is to choose a single prior that
naturally yields different expected numbers of traits for different numbers of data points.
In theory, Bayesian nonparametric (BNP) priors have exactly this desirable property due
to a countable infinity of traits, so that there are always more traits to reveal through the
accumulation of more data.
However, the infinite-dimensional parameter presents a practical challenge; namely, it is
impossible to store an infinity of random variables in memory or learn the distribution over an
infinite number of variables in finite time. Some authors have developed conjugate priors and
likelihoods [30, 96, 153] to circumvent the infinite representation; in particular, these models
allow marginalization of the infinite collection of latent traits. These models will typically
be part of a more complex generative model where the remaining components are all finite.
Therefore, users can apply approximate inference schemes such as Gibbs sampling. However,
these marginal forms typically limit the user to a constrained family of models; are not
amenable to parallelization; would require substantial new development to use with modern
inference engines like NIMBLE [47]; and are not straightforward to use with variational Bayes.

An alternative approach is to approximate the infinite-dimensional prior with a finite-


dimensional prior that essentially replaces the infinite collection of random traits by a
finite subset of “likely” traits. Unlike a fixed finite-dimensional prior across all data set
sizes, this finite-dimensional prior is an approximation to the BNP prior. Therefore, its
cardinality can be informed directly by the BNP prior and the size of the observed data.
Any moderately complex model will necessitate approximate inference, such as Markov chain
Monte Carlo (MCMC) or variational Bayes (VB). Therefore, as long as the error due to the
finite-dimensional prior approximation is small compared to the error due to using approximate
inference, inferential quality is not affected. Unlike marginal representations, probabilistic
programming languages like NIMBLE [47] natively support such finite approximations.
Much of the previous work on finite approximations developed and analyzed truncations
of series representations of the random measures underlying the nonparametric prior; we
call these truncated finite approximations (TFAs) and refer to Campbell et al. [34] for a
thorough study. TFAs start from a sequential ordering of population traits in a random
measure. The TFA retains a finite set of approximating traits; these match the population
traits until a finite point and do not include terms beyond that [7, 34, 55, 157, 180]. However,
we show in section 2.5 that the sequential nature of TFAs makes it difficult to derive update
steps in an approximate inference algorithm (either MCMC or VB) and is not amenable to
parallelization.
Here, we instead develop and analyze a general-purpose finite approximation consisting of

64
independent and identically distributed (i.i.d. ) representations of the traits together with
their rates within the population; we call these independent finite approximations (IFAs). At
the time of writing, we are aware of two alternative lines of work on generic constructions of
finite approximations using i.i.d. random variables, namely Lijoi et al. [131] and Lee et al.
[124, 125]. Lijoi et al. [131] design approximations for clustering models, characterize the
posterior predictive distribution, and derive tractable inference schemes. However, the authors
have not developed their method for trait allocations, where data points can potentially belong
to multiple traits and can potentially exhibit traits in different amounts. And in particular it
would require additional development to perform inference in trait allocation models using
their approximations.1 Lee et al. [124, 125] construct finite approximations through a novel
augmentation scheme. However, Lee et al. [124, 125] lack explicit constructions in important
situations, such as exponential-family rate measures, because the functions involved in the
augmentation are, in general, only implicitly defined. When the augmentation is implicit,
there is not currently a way to evaluate (up to proportionality constant) the probability
density of the finite-dimensional distribution; therefore standard Markov chain Monte Carlo
and variational approaches for approximate inference are unavailable.
Our contributions. We propose a general-purpose construction for IFAs that subsumes a
number of special cases that have already been successfully used in applications (section 2.3.1).
We call our construction the automated independent finite approximation, or AIFA. We show
that AIFAs can handle a wide variety of models — including homogeneous completely random
measures (CRMs) and normalized CRMs (NCRMs) (section 2.3.3).2 Our construction can
handle (N)CRMs exhibiting power laws and has an especially convenient form for exponential
family CRMs (section 2.3.2). We show that our construction works for useful CRMs not
previously seen in the BNP literature (Example 2.3.4). Unlike marginal representations,
AIFAs do not require conditional conjugacy and can be used with VB. We show that, unlike
TFAs, AIFAs facilitate straightforward derivations within approximate inference schemes
such as MCMC or VB and are amenable to parallelization during inference (section 2.5). In
existing special cases, practitioners report similar predictive performance between AIFAs and
TFAs [117] and that AIFAs are also simpler to use compared to TFAs [63, 100]. In contrast
to the methods of Lee et al. [124, 125], one can always evaluate the probability density (up to
a proportionality constant) of AIFAs; furthermore, in section 2.6.4, AIFAs accurately learn
model hyperparameters by maximizing the marginal likelihood where the methods of Lee
et al. [124, 125] struggle.
In section 2.4, we bound the error induced by approximating an exact infinite-dimensional
prior with an AIFA. Our analysis provides interpretable error bounds with explicit dependence
on the size of the approximation and the data cardinality; our bounds can be used to set the
size of the approximation in practice. Our error bounds reveal that for the worst-case choice of
observation likelihood, to approximate the target to a desired accuracy, it is necessary to use
a large IFA model while a small TFA model would suffice. However, in practical experiments
with standard observations likelihoods, we find that AIFAs and TFAs of equal sizes have
similar performance. Likewise, we find that, when both apply, AIFAs and alternative IFAs
1
We also note that, without modification, their approximation is not suitable for use in statistical models
where the unnormalized atom sizes of the CRM are bounded, as arise when modeling the frequencies (in
[0, 1]) of traits. While model reparameterization may help, it requires (at least) additional steps.
2
NCRMs are also called normalized random measures with independent increments (NRMIs) [97, 176].

65
[124, 125] exhibit similar predictive performance (section 2.6.3). But AIFAs apply more
broadly and are amenable to hyperparameter learning via optimizing the marginal likelihood,
unlike Lee et al. [124, 125] (section 2.6.4). As a further illustration, we show that we are
able to learn whether a model is over- or underdispersed, and by how much, using an AIFA
approximating a novel BNP prior in section 2.6.5.

2.2 Background
Our work will approximate nonparametric priors, so we first review construction of these
priors from completely random measures (CRMs). Then we cover existing work on the
construction of truncated and independent finite approximations for these CRM priors. For
some space Ψ, let ψi ∈ Ψ represent the i-th trait of interest, and let θi > 0 represent the
corresponding rate or frequency of this trait in the population. If the set of traits is finite, we
let I equal its cardinality; if the set of traits is countably infinite, we let I = ∞. Collect the
pairs of traits
PI and frequencies in a measure Θ that places non-negative mass θi at location
ψi : Θ := i=1 θi δψi , where δψi is a Dirac measure placing mass 1 at location ψi . To perform
Bayesian inference, we need to choose a prior distribution on Θ and a likelihood for the
observed data Y1:N := {Yn }N n=1 given Θ. Then, applying a disintegration, we can obtain the
posterior on Θ given the observed data.
Homogeneous completely random measures. Many common BNP priors can be
formulated as completely random measures [109, 127].3 CRMs are constructed from Poisson
point processes,4 which are straightforward to manipulate analytically [111]. Consider a
Poisson point process on R+ := [0, ∞) with rate measure ν(dθ) such that ν(R+ ) = ∞ and
min(1, θ)ν(dθ) < ∞. Such a process generates a countably infinite set of rates (θi )∞ i=1 with
R
P∞ i.i.d.
θi ∈ R+ and 0 < i=1 θi < ∞ almost surely. We assume throughout that ψi ∼ H for
some diffuse distribution H. The distribution H, called the ground measure, serves as a
prior on the traits in the space Ψ. For example, consider a common topic model. Each trait
ψi represents a latent topic, modeled as a probability vector in the simplex of vocabulary
words. And θi represents the frequency with which the topic ψi appears across documents in
a corpus. H is a Dirichlet distribution over the probability simplex, with dimension given by
the number of words in the vocabulary.
By pairing the rates from the Poisson process with traits drawn from the ground measure,
we obtain a completely random measure and use the shorthand CRM(H, ν) for its law:
Θ = i θi δψi ∼ CRM(H, ν). Since the traits ψi and the rates θi are independent, the CRM
P
is homogeneous. When the total mass Θ(Ψ) is strictly positive and finite, the corresponding
normalized CRM (NCRM) is PΞ := Θ/Θ(Ψ), which is a discrete probability measure:
Ξ = i ξi δψi , where ξi = θi /( j θj ) [97, 176].
P
The CRM prior on Θ is typically combined with a likelihood that generates trait counts for
each data point. Let ℓ(· | θ) be a proper probability mass function on N ∪ {0} for all θ in the
support of ν. The process Xn := i xni δψi collects the trait counts, where xni | Θ ∼ ℓ(· | θi )
P

3
Conversely, some important priors, such as Pitman-Yor processes, are not CRMs or their normalizations
and are outside the scope of the present paper [8, 129, 164].
4
For brevity, we do not consider the fixed-location and deterministic components of a CRM [109]. When
these are purely atomic, they can be added to our analysis without undue effort.

66
independently across atom index i and i.i.d. across data index n. We denote the distribution
of Xn as LP(ℓ, Θ), which we call the likelihood process. Together, the prior on Θ and likelihood
on X given Θ form a generative model for allocation of data points to traits; hence, this
generative model is a special case of a trait allocation model [33]. Analogously, when the trait
counts are restricted to {0, 1}, this generative model represents a special case of a feature
allocation model.
Since the trait counts are typically just a latent component in a full generative model
indep
specification, we define the observed data to be Yn | Xn ∼ f (· | Xn ) for a probability kernel
f (dY | X). Consider the topic modeling example: θi represents the rate of topic ψi in
a document corpus; Θ captures the rates of all topics; Xn captures how many words in
document n are generated from each topic; and Yn gives the observed collection of words for
that document.
Finite approximations. Since the set {θi }∞ i=1 is countably infinite, it is not possible to
simulate or perform posterior inference for every θi . One approximation scheme uses a finite
approximation ΘK := i=1 ρi δψi . The atom sizes {ρi }K i=1 are designed so that ΘK is a good
PK
approximation of Θ in a suitable sense. Since it involves a finite number of parameters
unlike Θ, ΘK can be used directly in standard posterior approximation schemes such as
Markov chain Monte Carlo or variational Bayes. But not using the full CRM Θ introduces
approximation error.
A truncated finite approximation [TFA; 7, 34, 55, 157, 180] requires constructing an ordering
on the set of rates from the Poisson process; let (θi )∞ i=1 be the corresponding sequence of
rates. The approximation uses ρi = θi for i up to some K; i.e. one keeps the first K rates in
the sequence and ignores the remaining ones. We refer to the number of instantiated atoms
K as the approximation level. Campbell et al. [34] categorizes and analyzes TFAs. TFAs
offer an attractive nested structure: to refine an existing truncation, it suffices to generate
the additional terms in the sequence. However, the complex dependencies between the rates
i=1 potentially make inference more challenging.
(θi )K
We instead develop a family of independent finite approximations (IFAs). An IFA is defined
by a sequence of probability measures ν1 , ν2 , . . . such that at approximation level K, there are
i.i.d.
K atoms whose weights are given by ρ1 , . . . , ρK ∼ νK . The probability measures are chosen
D
so that the sequence of approximations converges in distribution to the target CRM: ΘK → Θ
as K → ∞. For random measures, convergence in distribution can also be characterized by
convergence of integrals under the measures [104, Lemma 12.1 and Theorem 16.16]. The
advantages and disadvantages of IFAs reverse those of TFAs: the atoms are now i.i.d.,
potentially making inference easier, but a completely new approximation must be constructed
if K changes.
Next consider approximating an NCRM Ξ = i ξi δψi , where ξi = θi /( j θj ), with a finite
P P
approximation. A normalized TFA might be defined in one of two ways. In the first approach,
the rates {ρi }Ki=1 that target the CRM rates {θi }i=1 are

Pnormalized to form the NCRM
approximation; i.e. the approximation has atom sizes ρi / K j=1 ρj [34]. The second approach
directly constructs an ordering over the sequence of normalized rates ξi and truncates this
representation.5 We construct normalized IFAs in a similar manner to the first TFA approach:
5
PK
In this case, i=1 ξi < 1. Therefore, setting the final atom size in the NCRM approximation to be

67
the NCRM approximation has atom sizes ρi / K j=1 ρj where {ρi }i=1 are the IFA rates.
K
P
In the past, independent finite approximations have largely been developed on a case-by-case
basis [1, 27, 124, 155]. Our goal is to provide a general-purpose mechanism. Lijoi et al. [131]
and Lee et al. [125] have also recently pursued a more general construction, but we believe
there remains room for improvement. Lijoi et al. [131] focus on NCRMs for clustering; it is not
immediately clear how to adapt this work for inference in trait allocation models. Also, Lijoi
et al. [131, Theorem 1] employ infinitely divisible random variables. Since infinitely divisible
distributions that are not Dirac measures cannot have bounded support, the approximate
rates {ρi }K
i=1 are not naturally compatible with the trait likelihood ℓ(· | θ) if the support of
the rate measure ν is bounded. But the support of ν is often bounded in applications to trait
allocation models; e.g., θi may represent a feature frequency, taking values in [0, 1], and ℓ(· | θ)
may take the form of a Bernoulli, binomial, or negative binomial distribution. Therefore,
applications of the finite approximations of Lijoi et al. [131, Theorem 1] to these models
may require some additional work. The construction in Lee et al. [125, Proposition 3.2]
yields {ρi }K
i=1 that are compatible with ℓ(· | θ) and recovers important cases in the literature.
However, outside these special cases, it is unknown if the i.i.d. distributions are tractable
because the densities νK are not explicitly defined; see the discussion around eq. (2.3) for
more details.

Example 2.2.1 (Running example: beta process). For concreteness, we consider the (three-
parameter) beta process 6 [28, 195] as a running example of a CRM. The process BP(γ, α, d)
is defined by a mass parameter γ > 0, discount parameter d ∈ [0, 1), and concentration
parameter α > −d. It has rate measure

Γ(α + 1)
ν(dθ) = γ 1{0 ≤ θ ≤ 1}θ−d−1 (1 − θ)α+d−1 dθ. (2.1)
Γ(1 − d)Γ(α + d)

The d = 0 case yields the standard beta process [82, 198]. The beta process is typically paired
with the Bernoulli likelihood process with conditional distribution ℓ(x | θ) = θx (1−θ)1−x 1{x ∈
{0, 1}}. The resulting beta–Bernoulli process has been used in factor analysis models [55, 157]
and for dictionary learning [210].

2.3 Automated independent finite approximations


In this section we introduce automated independent finite approximations, a practical con-
struction of independent finite approximations (IFAs) for a broad class of CRMs. We highlight
a useful special case of our construction for exponential family CRMs [30] without power
laws and apply our construction to approximate NCRMs. In all of these cases, we prove that
as the approximation size increases, the distribution of the approximation converges (in some
relevant sense) to that of the exact infinite-dimensional model.
PK
1− i=1 ξi ensures the approximation is a probability measure.
6
Also known as the stable beta process [195]

68
2.3.1 Applying our approximation to CRMs
Formally, we define IFAs in terms of a fixed, diffuse probability measure H and a sequence of
probability measures ν1 , ν2 , . . . . The K-atom IFA ΘK is
i.i.d. i.i.d.
ΘK := K
P
i=1 ρi δψi , ρi ∼ νK , ψi ∼ H,
which we write as ΘK ∼ IFAK (H, νK ). We consider CRM rate measures ν with densities that,
near zero, are (roughly) proportional to θ−1−d , where d ∈ [0, 1) is the discount parameter.
We will propose a general construction for IFAs given a target random measure and prove
that it converges to the target (Theorem 2.3.1). We first summarize our requirements for
which CRMs we approximate in Assumption 2.3.1. We show in section 2.A that popular
BNP priors satisfy Assumption 2.3.1; specifically, we check the beta, gamma [59, 110, 200],
generalized gamma [26], beta prime [27], and PG(α, ζ)-generalized gamma [95] processes.
Assumption 2.3.1. For d ∈ [0, 1) and η ∈ V ⊆ Rd , we take Θ ∼ CRM(H, ν(·; γ, d, η)) for
h(θ; η)
ν(dθ; γ, d, η) := γθ−1−d g(θ)−d dθ
Z(1 − d, η)
such that
1. for ξ > 0 and η ∈ V , Z(ξ, η) :=
R
θξ−1 g(θ)ξ h(θ; η)dθ < ∞;
2. g is continuous, g(0) = 1, and there exist constants 0 < c∗ ≤ c∗ < ∞ such that
c∗ ≤ g(θ)−1 ≤ c∗ (1 + θ);
3. there exists ϵ > 0 such that for all η ∈ V , the map θ 7→ h(θ; η) is continuous and bounded
on [0, ϵ].
Other than the discount d and mass γ, the rate measure ν potentially depends on ad-
ditional hyperparameters η. The finiteness of the normalizer Z is necessary in defining
finite-dimensional distributions whose densities are similar in form to ν. The conditions on
the behaviors of g(θ) and h(θ; η) ensure that the overall rate measure’s behavior near θ = 0
is dominated by the θ−1−d term. The support of the rate measure is implicitly determined by
h(θ; η).
Given a CRM satisfying Assumption 2.3.1, we can construct a sequence of IFAs that converge
in distribution to that CRM.
Theorem 2.3.1. Suppose Assumption 2.3.1 holds. Let
(  
−1
exp 1−(θ−b)2 /b2 + 1 if θ ∈ (0, b)
Sb (θ) = (2.2)
1{θ > 0} otherwise.

For c := γh(0; η)/Z(1 − d, η), let


−1 −dS −1 −d −1
νK (dθ) := θ−1+cK 1/K (θ−1/K)
g(θ)cK h(θ; η)ZK dθ
be a family of probability densities, where ZK is chosen such that νK (dθ) = 1. If ΘK ∼
R
D
IFAK (H, νK ), then ΘK → Θ as K → ∞.

69
See section 2.B.1 for a proof of Theorem 2.3.1. We choose the particular form of Sb (θ) in
eq. (2.2) for concreteness and convenience. But our theory still holds for a more general class
of Sb forms, as we describe in more detail in the proof of Theorem 2.3.1.
Definition 2.3.2. We call the K-atom IFA resulting from Theorem 2.3.1 the automated IFA
(AIFAK ).
Although the normalization constant ZK is not always available analytically, numerical imple-
mentation remains straightforward. When ZK is a quantity of interest, such as in section 2.6.4,
we estimate it using standard numerical integration schemes for a one-dimensional integral
[160, 204]. For other tasks, we need not access ZK directly. In our experiments, we show that
we can use either Markov chain Monte Carlo (sections 2.6.1 and 2.6.5) or variational Bayes
(sections 2.6.2 and 2.6.3) with the unnormalized density.
To illustrate our construction, we next apply Theorem 2.3.1 to BP(γ, α, d) from Example 2.2.1.
In section 2.A, we show how to construct AIFAs for the beta prime, gamma, generalized
gamma, and PG(α, ζ)-generalized gamma processes.
Example 2.3.1 (Beta process AIFA). To apply Assumption 2.3.1, let η = α + d, V = R+ ,
g(θ) = 1, h(θ; η) = (1 − θ)η−1 1[θ ≤ 1], and Z(ξ, η) equal the beta function B(ξ, η). Then
the CRM rate measure ν in Assumption 2.3.1 corresponds to that of BP(γ, α, d) from
Example 2.2.1. Note that we make no additional restrictions on the hyperparameters γ, α, d
beyond those in the original CRM (Example 2.2.1). Observe that h is continuous and bounded
on [0, 1/2], and the normalization function B(ξ, η) is finite for ξ > 0, η ∈ V ; it follows that
Assumption 2.3.1 holds. By Theorem 2.3.1, then, the AIFA density is
1 −1+c/K−dS1/K (θ−1/K)
θ (1 − θ)α+d−1 1{0 ≤ θ ≤ 1}dθ,
ZK
where c := γ/B(α + d, 1 − d) and ZK is the normalization constant. The density does not in
general reduce to a beta distribution in θ due to the θ in the exponent.
Comparison to an alternative IFA construction. Lee et al. [125, Proposition 3.2] verify
the validity of a different IFA construction. Their construction
R requires two functions: (1)
a bivariate function Λ(θ, t) such that for any t > 0, ∆(t) := Λ(θ, t)ν(dθ) < ∞ and (2) a
univariate function f (n) such that ∆(f (n)) is bounded from both above and below by n as
n → ∞. If these functions exist and
Λ(θ, f (K))ν(dθ)
νeK (dθ) := , (2.3)
∆(f (K))
Lee et al. [125, Proposition 3.2] show that IFAK (H, νeK ) converges in distribution to CRM(H, ν)
as K → ∞. The usability of eq. (2.3) in practice depends on the tractability of Λ and f .
There are typically many tractable Λ(θ, t) [125, Section 4]. Proposition B.2 of Lee et al. [125]
lists tractable f for the important cases of the beta process and and generalized gamma
process with d > 0. However, the choice of f provided there for general power-law processes is
not tractable because its evaluation requires computing complicated inverses in the asymptotic
regime. Furthermore, for processes without power laws, no general recipe for f is known.
In contrast, the AIFA construction in Theorem 2.3.1 always yields densities that can be
evaluated up to proportionality constants.

70
Example 2.3.2 (Beta process: an IFA comparison). We next compare our beta process
AIFA to the two separate IFAs proposed by Lee et al. [125] and Lee et al. [124] for disjoint
subcases within the case d > 0. First consider the subcase where α = 0, d > 0. Lee et al.
[124] derive7 what we call8 the BFRY IFA. The IFA density, denoted νBFRY (dθ), is equal to
" 1/d !#
γ θ−d−1 (1 − θ)d−1

KΓ(d)d θ
1 − exp − 1{0 ≤ θ ≤ 1}dθ. (2.4)
K B(d, 1 − d) γ 1−θ

Second, consider the subcase where α > 0, d > 0, Lee et al. [125, Section 4.5] derive another
K-atom IFA, which we call9 the generalized Pareto IFA (GenPar IFA). The IFA density,
denoted νGenPar (dθ), is equal to
 
γ θ−d−1 (1 − θ)α+d−1  1
(2.5)

1 −  
K B(1 − d, α + d)    d1   1{0 ≤ θ ≤ 1}dθ.
α 
Kd
θ 1+ γα
−1 +1

Since the BFRY IFA and GenPar IFA apply to disjoint hyperparameter regimes, they are not
directly comparable. Since our AIFA applies to the whole domain α ≥ −d, we can separately
compare it to each of these alternative IFAs; we also highlight that the AIFA still applies
when α ∈ (−d, 0), a case not covered by either the BFRY IFA or GenPar IFA.
We find in Section 2.6.3 that the AIFA and BFRY IFA have comparable predictive performance;
the AIFA and GenPar IFA also have comparable predictive performance. But in Section 2.6.4,
we show that the AIFA is much more reliable than the BFRY IFA or the GenPar IFA
for estimating the discount (d) hyperparameter by maximizing the marginal likelihood.
Conversely, sampling from a BFRY IFA or GenPar IFA prior is easier than sampling from an
AIFA prior since the BFRY and GenPar IFA priors are formed from standard distributions.

2.3.2 Applying our approximation to exponential family CRMs


Exponential family CRMs with d = 0 comprise a widely used special case of CRMs. In what
follows, we show how Theorem 2.3.1 simplifies in this special case.
In common BNP models, the relationship between the likelihood ℓ(· | θ) and the CRM prior is
closely related to finite-dimensional exponential family conjugacy [30, Section 4]. In particular,
the likelihood has an exponential family form,

ℓ(x | θ) := κ(x)θϕ(x) exp (⟨µ(θ), t(x)⟩ − A(θ)) . (2.6)



Here x ∈ N ∪ {0}, κ(x) ∈ R is the base density, ϕ(x) ∈ R and t(x) ∈ RD (for some D′ )
form the vector of sufficient statistics (t(x), ϕ(x))T , A(θ) ∈ R is the log partition function,
7
There is a typo in Lee et al. [124, Theorem 2, item (iii)]: θ/K should be (θ/Γ(α))/K.
8
Devroye and James [51] introduce the acronym BFRY to denote a distribution named for the authors
Bertoin et al. [15]. We here use “BFRY IFA” to denote what Lee et al. [124] call the “BFRY process” and
thereby emphasize that this process forms an IFA.
9
We use the term “generalized Pareto” because Lee et al. [125, Section 4.5] use generalized Pareto variates
to define Λ(θ, t) from eq. (2.3).

71

µ(θ) ∈ RD and ln θ form the vector of natural parameters (µ(θ), ln θ)T , and ⟨µ(θ), t(x)⟩
denotes the standard Euclidean inner product. The rate measure nearly matches the form of
the conjugate prior, but behaves like θ−1 near 0:
   
ψ µ(θ)
′ −1
ν(dθ) := γ θ exp , 1{θ ∈ U }dθ, (2.7)
λ −A(θ)

where γ ′ > 0, λ > 0, ψ ∈ RD and U ⊆ R+ is the support of ν. eq. (2.7) leads to the
suggestive terminology of exponential family CRMs. The θ−1 dependence near 0 means
that these models lack power-law behavior. Models that can be cast in this form include
the standard beta process with Bernoulli or negative binomial likelihood [27, 211] and the
gamma process with Poisson likelihood [1, 180]. We refer to these models as, respectively,
the beta–Bernoulli, beta–negative binomial, and gamma–Poisson processes.
We now specialize Assumption 2.3.1 and Theorem 2.3.1 to exponential family CRMs in
Assumption 2.3.2 and Corollary 2.3.3, respectively.
Assumption 2.3.2. Let ν be of the form in eq. (2.7) and assume that
1. For any ξ > −1, for any η = (ψ, λ)T where λ > 0, the normalizer defined as
Z   
µ(θ)
Z(ξ, η) := ξ
θ exp η, dθ (2.8)
U −A(θ)
is finite, and
2. there exists ϵ > 0 such that, for any η = (ψ, λ)T where λ > 0, the map
  
µ(θ)
ς : θ 7→ exp η, 1{θ ∈ U }
−A(θ)
is a continuous and bounded function of θ on [0, ϵ].
Corollary 2.3.3. Suppose Assumption 2.3.2 holds. For c := γ ′ ς(0), let
θc/K−1 ς(θ)
νK (θ) := . (2.9)
Z (c/K − 1, η)
D
If ΘK ∼ IFAK (H, νK ), then ΘK → Θ.
The density in eq. (2.9) is almost the same as the rate measure of eq. (2.7), except the θ−1
term has become θc/K−1 . As a result, eq. (2.9) is a proper exponential-family distribution.
In section 2.A, we detail the corresponding d = 0 special cases of the AIFA for beta prime,
gamma, generalized gamma, and PG(α,ζ)-generalized gamma processes. We cover the beta
process case next.
Example 2.3.3 (Beta process AIFA for d = 0). Corollary 2.3.3 is sufficient to recover known
IFA results for BP(γ, α, 0); when d = 0, the AIFA from Example 2.3.1 simplifies to νK =
Beta (γα/K, α) . Doshi-Velez et al. [55] approximates BP(γ, 1, 0) with νK = Beta (γ/K, 1).
For BP(γ, α, 0), Griffiths and Ghahramani [78] set νK = Beta (γα/K, α), and Paisley and
Carin [155] use νK = Beta (γα/K, α(1 − 1/K)). The difference between Beta (γα/K, α) and
Beta (γα/K, α(1 − 1/K)) is negligible for moderately large K.

72
We can also use Corollary 2.3.3 to create a new finite approximation for a nonparametric
process so far not explored in the Bayesian nonparametric literature.

Example 2.3.4 (CMP likelihood and extended gamma process). The CMP likelihood 10 [186]
is given by

θx 1 X θy
ℓ(x | θ) = , where Zτ (θ) := . (2.10)
(x!)τ Zτ (θ) y=0
(y!)τ

The conjugate CRM prior, which we call an extended gamma (or Xgamma) process, has four
hyperparameters: mass γ, concentration c, maximum T , and shape τ :

ν(dθ) = γθ−1 Zτ−c (θ)1{0 ≤ θ ≤ T }dθ. (2.11)

Unlike existing BNP models, the model in eqs. (2.10) and (2.11), which we call Xgamma–CMP
process, is able to capture different dispersion regimes. For τ < 1, the variance of the counts
from ℓ(x | θ) is larger than the mean of the counts, corresponding to overdispersion. For τ > 1,
the variance of the counts from ℓ(x | θ) is smaller than the mean of the counts, corresponding
to underdispersion. As we show in section 2.6.5, the latent shape τ can be inferred using
observed data. Broderick et al. [27], Zhou et al. [211] provide BNP trait allocation models
that handle overdispersion. Canale and Dunson [35] provide a BNP model that handles both
underdispersion and overdispersion, but for clustering rather than traits. We are not aware
of trait allocation models that handle underdispersion, or any trait allocation models that
handle both underdispersion and overdispersion. Following the approach of Broderick et al.
[30], in section 2.D we show that as long as γ > 0, c > 0, T ≥ 1, and τ > 0, the total mass of
the rate measure is infinite and the number of active traits is almost surely finite. Under
these conditions, we show in section 2.A that Corollary 2.3.3 applies to the CRM in eq. (2.11),
and we construct the resulting AIFA.

2.3.3 Normalized independent finite approximations


Given that AIFAs are approximations that converge to the corresponding target CRM, it is
natural to ask if normalizations of AIFAs converge to the corresponding normalization of the
target CRM, i.e., the corresponding NCRM. Our next result shows that normalized AIFAs
indeed converge, in the sense that the exchangeable partition probability functions, or EPPFs
[161], converge. Given a random sample of size N from an NCRM Ξ, the EPPF gives the
probability of the induced partition from such a sample. In particular, consider the model
i.i.d.
Ξ ∼ NCRM, Xn | Ξ ∼ Ξ for 1 ≤ n ≤ N .11 Grouping the indices n with the same value
of Xn induces a partition over the set {1, 2, . . . , N }. Let b represent the number of distinct
values in the set {Xn }Nn=1 , so b ≤ N . Let ni be the number of indices
Pb n with Xn equal to the
i-th distinct value of Xn , for some ordering of the values. So i=1 ni = N and ∀i, ni ≥ 1.
With this notation in hand, we can write the EPPF, which gives the probability of the induced
partition under the model, as a symmetric function p(n1 , n2 , . . . , nb ) that depends only on
10
CMP stands for Conway-Maxwell-Poisson.
11
We reuse the Xn notation from the CRM description, even though Xn now is a scalar, because the role
of the draws from Ξ is the same as that of the draws from Θ.

73
the counts ni . Similarly, we let pK (n1 , n2 , . . . , nb ) be the EPPF for the normalized AIFAK .
Note that pK (n1 , n2 , . . . , nb ) = 0 when K < b since the normalized AIFAK at approximation
level K generates at most K blocks.
Theorem 2.3.4. Suppose Assumption 2.3.1 holds. Take any positive integers N, b, {ni }bi=1
such that b ≤ N , ni ≥ 1, and i=1 ni = N . Let p be the EPPF of the NCRM Ξ := Θ/Θ(Ψ).
Pb
If ΘK is the AIFA for Θ at approximation level K, and pK is the EPPF for the corresponding
NCRM approximation ΘK /ΘK (Ψ), then

lim pK (n1 , n2 , . . . , nb ) = p(n1 , n2 , . . . , nb ).


K→∞

See section 2.B.3 for the proof. Since the EPPF gives the probability of each partition, the
point-wise convergence in Theorem 2.3.4 certifies that the distribution over partitions induced
by sampling from the normalized AIFAK converges to that induced by sampling from the
target NCRM, for any finite sample size N .

2.4 Non-asymptotic error bounds


Theorems 2.3.1 and 2.3.4 justify the use of our proposed AIFA construction in the limit
K → ∞ but do not provide guidance on how to choose the approximation level K when N
observations are available. In section 2.4.1, we quantify the error introduced by replacing an
exponential family CRM with the AIFA. In section 2.4.2, we quantify the error introduced
by replacing a Dirichlet process (DP) [60, 184] with the corresponding normalized AIFA.
We derive error bounds that are simple to manipulate and yield recommendations for the
appropriate K for a given N and a desired accuracy level.

2.4.1 Bounds when approximating an exponential family CRM


Recall from section 2.2 that the CRM prior Θ is typically paired with a likelihood process LP,
which manifests features Xn , and a probability kernel f relating active features to observations
Yn . The target nonparametric model can be summarized as

Θ ∼ CRM(H, ν),
i.i.d.
Xn | Θ ∼ LP(ℓ, Θ), n = 1, 2, . . . , N, (2.12)
indep
Yn | Xn ∼ f (· | Xn ), n = 1, 2, . . . , N.

The approximating model, with νK as in Theorem 2.3.1 (or Corollary 2.3.3), is

ΘK ∼ AIFAK (H, νK ),
i.i.d.
Zn | ΘK ∼ LP(ℓ, ΘK ), n = 1, 2, . . . , N, (2.13)
indep
Wn | Zn ∼ f (· | Zn ), n = 1, 2, . . . , N.

Active traits in the approximate model are collected in Zn and observations are Wn . Let PN,∞
be the marginal distribution of the observations Y1:N and PN,K be the marginal distribution

74
of the observations W1:N . The Rapproximation R error we analyze is the total variation distance
dTV (PN,K , PN,∞ ) := sup0≤g≤1 | gdPN,K − gdPN,∞ | between the two observational processes,
one using the CRM and the other one using the approximate AIFAK as the prior. Total
variation is a standard choice of error when analyzing CRM approximations [34, 55, 89, 157].
Small total variation distance implies small differences in expectations of bounded functions.
Conditions. In our analysis, we focus on exponential family CRMs and conjugate like-
lihood processes. We will suppose Assumption 2.3.2 holds. Our analysis guarantees that
dTV (PN,K , PN,∞ ) is small whenever a conjugate exponential family CRM–likelihood pair and
the corresponding AIFA model satisfy certain conditions, beyond those already stated in
Assumption 2.3.2. In the proof of the error bound, these conditions serve as intermediate
results that ultimately lead to small approximation error. Because we can verify the conditions
for common models, we have error bounds in the most prevalent use cases of CRMs. To
express these conditions, we use the marginal process representation of the target and the
approximate model, i.e., the series of conditional distributions of Xn | X1:(n−1) (or Zn | Z1:(n−1) )
with Θ (or ΘK ) integrated out. Corollary 6.2 of Broderick et al. [30] guarantees that the
marginal Xn | X1:(n−1) is a random measure with finite support and with a convenient form.
Since we will use this form to write our conditions (Condition 2.4.1 below), we first review
the requisite notation — and establish analogous notation for Zn | Z1:(n−1) .
We start by defining h and M to describe the conditional distribution Xn | X1:(n−1) . Let
Kn−1
Kn−1 be the number of unique atom locations in X1 , X2 , . . . , Xn−1 , and let {ζi }i=1 be the
collection of unique atom locations in X1 , X2 , . . . , Xn−1 . Fix an atom location ζj (the choice
of j does not matter). For m with 1 ≤ m ≤ n, let xm be the atom size of Xm at atom
location ζj ; xm may be zero if there is no atom at ζj in Xm . The distribution of xn depends
only on the x1:(n−1) values, which are the atom sizes of previous measures Xm at ζj . We use
h(x | x1:(n−1) ) to denote the probability mass function (p.m.f.) of xn at value x. Furthermore,
Xn has a finite number of new atoms, which can be grouped together by atom size. Consider
any potential atom size x ∈ N. Define pn,x to be the number of atoms of size x. Regardless
of atom size, each atom location is a fresh draw from the ground measure H and pn,x is
Poisson-distributed; we use Mn,x to denote the mean of pn,x .
Next, we define e h, which governs the conditional distribution of Zn | Z1:(n−1) . Let 0n−1 be the
zero vector with n − 1 components. Although h(x | x1:(n−1) ) is defined only for count vectors
x1:(n−1) that are not identically zero, we will see that e h(x | 0n−1 ) is well-defined. In particular,
Kn−1
let {ζi }i=1 be the union of atom locations in Z1 , Z2 , . . . , Zn−1 . Fix an atom location ζj . For
1 ≤ m ≤ n, let xm be the atom size of Zm at atom location ζj . We write the p.m.f. of xn
at x as e h(x | x1:(n−1) ). In addition, Zn also has a maximum of K − Kn−1 new atoms with
Kn−1
locations disjoint from {ζi }i=1 , and the distribution of atom sizes is governed by e h(x | 0n−1 ).
Note that we reuse the xn and ζj notation from Xn | X1:(n−1) without risk of confusion, since
xn and ζj are dummy variables whose meanings are clear given the context of h or e h.
In section 2.C, we describe the marginal processes in more detail and give formulas for h, e h,
and Mn,x in terms of the functions that parametrize eqs. (2.6) and (2.7) and the normalizer
eq. (2.8). For the beta–Bernoulli process with d = 0, the functions have particularly convenient
forms.

75
Example 2.4.1. For the beta–Bernoulli model with d = 0, we have
Pn−1
α + n−1
P
i=1 xi i=1 (1 − xi )
h(x | x1:(n−1) ) = 1{x = 1} + 1{x = 0}.
α−1+n α−1+n
Pn−1 Pn−1
i=1 x i + γα/K α + i=1 (1 − xi )
h(x | x1:(n−1) ) =
e 1{x = 1} + 1{x = 0},
α − 1 + n + γα/K α − 1 + n + γα/K
γα
Mn,1 = , Mn,x = 0 for x > 1.
α−1+n
We now formulate conditions on h, e
h, and Mn,x that will yield small dTV (PN,K , PN,∞ ).

Condition 2.4.1. There exist constants {Ci }5i=1 such that

1. for all n ∈ N,

X C1
Mn,x ≤ ; (2.14)
x=1
n − 1 + C1

2. for all n ∈ N,

X 1 C1
h(x | x1:(n−1) = 0n−1 ) ≤
e ; (2.15)
x=1
K n − 1 + C1

3. for any n ∈ N, for any {xi }n−1


i=1 ̸= 0n−1 ,


X 1 C1
h(x | x1:(n−1) ) − e
h(x | x1:(n−1) ) ≤ ; and (2.16)
x=0
K n − 1 + C1

4. for all n ∈ N, for any K ≥ C2 (ln n + C3 ),



X 1 C4 ln n + C5
h(x | x1:(n−1) = 0n−1 ) ≤
Mn,x − K e . (2.17)
x=1
K n − 1 + C1

Note that the conditions depend only on the functions governing the exponential family
CRM prior and its conjugate likelihood process — and not on theP observation likelihood
N P∞
f . eq. (2.14) constrains the growth rate of the target model since P n=1 x=1 Mn,x is the

expected number of components for data cardinality N . Because each x=1 Mn,x is at most
O(1/n), the total number of components after N samples is O(ln N ). Similarly, eq. (2.15)
constrains the growth rate of the approximate model. The third condition (eq. (2.16)) ensures
that eh is a good approximation of h in total variation distance and that there is also a
reduction in the error as n increases. Finally, eq. (2.17) implies that K e h(x | 0n−1 ) is an
accurate approximation of Mn,x , and there is also a reduction in the error as n increases.
We show that Condition 2.4.1 holds for the most commonly used non-power-law CRM models;
see Example 2.4.2 for the case of the beta–Bernoulli model with discount d = 0 and section 2.F
for the beta–negative binomial and gamma–Poisson models with d = 0. As we detail next, we
believe Condition 2.4.1 is also reasonable beyond these common models. The O(1/n) quantity

76
in eq. (2.14) is the typical expected number of new features after observing n observations
in non-power-law BNP models. eqs. (2.15) to (2.17) are likely to hold when e h is a small
perturbation of h and K e h is a small perturbation of Mn,x . For instance, in Example 2.4.1,
the functional form of eh is very similar to that of h, except that e
h has the additional γα/K
factor in both numerator and denominator. The functional form of K e h is very similar to that
of Mn,x , except that K h has an additional γα/K factor in the denominator.
e

Example 2.4.2 (Beta–Bernoulli with d = 0, continued). The growth rate of the target model
is ∞
X γα
Mn,x = Mn,1 = .
x=1
n−1+α

Since e
h is supported on {0, 1}, the growth rate of the approximate model is
γα/K 1 γα
h(1 | x1:(n−1) = 0n−1 ) =
e ≤ .
α − 1 + n + γα/K Kn−1+α

Since both h and e


h are supported on {0, 1}, eq. (2.16) becomes
Pn−1 Pn−1
x
i=1 i + γα/K i=1 xi γα 1
h(1 | x1:(n−1) ) − e
h(1 | x1:(n−1) ) = − ≤ .
α − 1 + n + γα/K α − 1 + n K n−1+α

And because Mn,x = 0 = e


h(x | · ) for x > 1, eq. (2.17) becomes

γα γα γ 2α 1
Mn,1 − K e
h(1 | x1:(n−1) = 0n−1 ) = − γα ≤ .
α−1+n α−1+n+ K
K n−1+α

Calibrating {Ci } based on these inequalities is straightforward.


Upper bound. We now make use of Condition 2.4.1 to derive an upper bound on the
approximation error induced by AIFAs.
Theorem 2.4.1 (Upper bound for exponential family CRMs). Recall that PN,∞ is the
distribution of Y1:N from eq. (2.12) while PN,K is the distribution of W1:N from eq. (2.13). If
Assumption 2.3.2 and Condition 2.4.1 hold, then there exist positive constants C ′ , C ′′ , C ′′′ , C ′′′′
depending only on {Ci }5i=1 such that

C ′ + C ′′ ln2 N + C ′′′ ln N ln K + C ′′′′ ln K


dTV (PN,∞ , PN,K ) ≤ .
K

See section 2.G.1 for explicit values of the constants as well as the proof. Theorem 2.4.1
states that the AIFA approximation error grows as O(ln2 N ) with fixed K, and decreases as
O (ln K/K) for fixed N . The bound accords with our intuition that, for fixed K, the error
should increase as N increases: with more data, the expected number of latent components in
the data increases, demanding finite approximations of increasingly larger sizes. In particular,
O(ln N ) is the standard Bayesian nonparametric growth rate for non-power law models. It is

77
likely that the O(ln2 N ) factor can be improved to O(ln N ) due to O(ln N ) being the natural
growth rate; more generally, we conjecture that the error directly depends on the expected
number of latent components in a model for N observations. On the other hand, for fixed N ,
we expect that error should decrease as K increases and the approximation thus has greater
capacity. This behavior also matches Theorem 2.3.1, which guarantees that sufficiently large
finite models have small error.
We highlight that Theorem 2.4.1 provides upper bounds both (i) for approximations that
were already known in the literature but where bounds were not already known, as in the
case of the beta–negative binomial process, and (ii) for processes and approximations not
previously studied in the literature in any form.
Lower bounds. From the upper bound in Theorem 2.4.1, we know how to set a sufficient
number of atoms for accurate approximations: for the total variation to be less than some ϵ, we
solve for the smallest K such that the right hand side of Theorem 2.4.1 is smaller than ϵ. We
now derive lower bounds on the AIFA approximation error to characterize a necessary number
of atoms for accurate approximations, by looking at worst-case observational likelihoods
f . In particular, Theorem 2.4.1 implies that an AIFA with K = O (poly(ln N )/ϵ) atoms
suffices in approximating the target model to less than ϵ error. In Theorem 2.4.2 below, we
establish that K must grow at least at a ln N rate in the worst case. In Theorem 2.4.3 below,
we establish that the 1/ϵ term is necessary. To the best of our knowledge, Theorems 2.4.2
and 2.4.3 are the first lower bounds on IFA approximation error for any process.
Our lower bounds apply to the beta–Bernoulli process with d = 0. Recall that PN,∞ is the
distribution of Y1:N from eq. (2.12) while PN,K is the distribution of W1:N from eq. (2.13). In
what follows, PN,∞
BP
refers to the marginal distribution of the observations that arises when
we use the prior BP(γ, α, 0). Analogously, PN,K BP
is the observational distribution that arises
when we use the AIFAK approximation in Example 2.3.1. The observational likelihood f
will be clear from context. The worst-case observational likelihoods f are pathological. We
leave to future work to lower bound the approximation error when more common likelihoods
f , such as Gaussian or Dirichlet, are used.
For the first result, it will be useful to define the growth function for any N ∈ N, α > 0:
N
X α
C(N, α) := . (2.18)
n=1
n−1+α

C(N, α) satisfies limN →∞ C(N, α)/(α ln N ) = 1; this asymptotic equivalence is a corollary of


Lemma 2.E.10 or Theorem 2.3 from Korwar and Hollander [113]. Our next result shows that
our AIFA approximation can be poor if the approximation level K is too small compared to
the growth function C(N, α).

Theorem 2.4.2 (ln N is necessary). For the beta–Bernoulli process model with d = 0,
there exists an observation likelihood f , independent of K and N , such that for any N , if
K ≤ 0.5γC(N, α), then
BP BP C
dTV (PN,∞ , PN,K ) ≥ 1 − γα/8 ,
N
where C is a constant depending only on γ and α.

78
See section 2.G.2 for the proof. The intuition is that, with high probability, the number of
features that manifest in the target X1:N is greater than 0.5γC(N, α). However, the finite
model Z1:N has fewer than 0.5γC(N, α) components. Hence, there is an event where the target
and approximation assign drastically different probability masses. Theorem 2.4.2 implies that
as N grows, if the approximation level K fails to surpass the 0.5γC(N, α) threshold, then
the total variation between the approximate and the target model remains bounded from
zero; in fact, the error tends to one.
We next show that the 1/K factor in the upper bound from Theorem 2.4.1 is tight (up to
logarithmic factors).

Theorem 2.4.3 (Lower bound of 1/K). For the beta–Bernoulli process model with d = 0,
there exists an observation likelihood f , independent of K and N , such that for any N ,

BP BP 1 1
dTV (PN,∞ , PN,K )≥C 2
,
(1 + γ/K) K

where C is a constant depending only on γ.

See section 2.G.2 for the proof. The intuition is that, under the pathological likelihood
f , analyzing the AIFA approximation error is the same as analyzing the binomial–Poisson
approximation error [122]. We then show that 1/K is a lower bound using the techniques
from [13]. Theorem 2.4.3 implies that an AIFA with K = Ω (1/ϵ) atoms is necessary in the
worst case.
Our lower bounds (which apply specifically to the beta–Bernoulli process) are much less
general than our upper bounds. However, as a practical matter, generality in the lower
bounds is not so crucial due to the different roles played by upper and lower bounds. Upper
bounds give control over the approximation error; this control is what is needed to trust
the approximation and to set the approximation level. Whether or not we have access to
lower bounds, general-purpose upper bounds give us this control. Lower bounds, on the other
hand, serve as a helpful check that the upper bounds are not too loose — and reassure us
that we are not inefficiently using too many atoms in a too-large approximation. From that
standpoint, the need for general-purpose lower bounds is not as pressing.
The dependence on the accuracy level in the d = 0 beta–Bernoulli process is worse for AIFAs
than for TFAs. For example, consider the Bondesson approximation [24, 34] of BP(γ, α, 0);
we will see next that this approximation is a TFA with excellent error bounds.
i.i.d.
Pk 2.4.3 (Bondesson approximation [24]). Fix α ≥ 1, let El ∼ Exp(1), P
Example and and let
Γk := l=1 El . The K-atom Bondesson approximation of BP(γ, α, 0) is a TFA K k=1 θk δψk ,
i.i.d. i.i.d.
where θk := Vk exp(−Γk /γα), Vk ∼ Beta(1, α − 1), and ψk ∼ H.

The following result gives a bound on the error of the Bondesson approximation.

Proposition 2.4.4. [34, Appendix A.1] For γ > 0, α ≥ 1, let ΘK be distributed according
i.i.d. indep
to a level-K Bondesson approximation of BP(γ, α, 0), Rn | ΘK ∼ LP(ℓ; ΘK ), Tn | Rn ∼
f (· | Rn ) with N observations. Let QN,K be the distribution of the observations T1:N . Then:
 K
γα
BP

dTV PN,∞ , QN,K ≤ N γ 1+γα .

79
Proposition 2.4.4 implies that a TFA with K = O (ln{N/ϵ}) atoms suffices in approximating
the target model to less than ϵ error. Up to log factors in N , comparing the necessary 1/ϵ
level for an AIFA and the sufficient ln (1/ϵ) level for a TFA, we conclude that the necessary
size for an AIFA is exponentially larger than the sufficient size for a TFA, in the worst-case
observational likelihood f.

2.4.2 Approximating a (hierarchical) Dirichlet process


So far we have analyzed AIFA error for CRM-based models. In this section, we analyze the
error that arises from using a normalized AIFA as an approximation for an NCRM; here, we
focus on a Dirichlet process — i.e., a normalized gamma process without power-law behavior.
We first consider a generative model with the same number of layers as in previous sections.
But we also consider a more complex generative model, with an additional layer — as is
common in, e.g., text analysis. Indeed, one of the strengths of Bayesian modeling is the
flexibility facilitated by hierarchical modeling, and a goal of probabilistic programming is to
provide fast, automated inference for these more complex models.
Dirichlet process. The Dirichlet process is one of the most widely used nonparametric
priors and arises as a normalized gamma process. The generalized gamma process CRM is
λ1−d −d−1 −λθ
characterized by the rate measure ν(dθ) = γ Γ(1−d) θ e dθ. We denote its distribution as
ΓP(γ, λ, d). A normalized draw from ΓP(γ, 1, 0) is Dirichlet-process distributed with mass
parameter γ [60, 110]. By Corollary 2.3.3, IFAK (H, νK ) with νK = Gam(γ/K, 1) converges to
ΓP(γ, 1, 0). Because the normalization of independent gamma random variables is aPDirichlet
random variable, a normalized draw from IFAK (H, νK ) is equal in distribution to K i=1 pi δψi
i.i.d.
where ψi ∼ H and {pi }K i=1 ∼ Dir({γ/K}1K ). We call this distribution the finite symmetric
Dirichlet (FSD), and denote it as FSDK (γ, H).12
In the simplest use case, the Dirichlet process is used as the de Finetti measure for observations
i.i.d.
Xn ; i.e., Ξ ∼ DP, Xn | Ξ ∼ Ξ for 1 ≤ n ≤ N . In section 2.H, we state error bounds when
FSDK replaces the Dirichlet process as the mixing measure that are analogous to the results
in section 2.4.1. The upper bound is similar to Theorem 2.4.1 in that the error grows as
O(ln2 N ) with fixed K, and decreases as O (ln K/K) for fixed N . The lower bounds, which
are the analogues of Theorems 2.4.2 and 2.4.3, state that K = Ω(ln N ) is necessary for
accurate approximations, and that truncation-based approximations are better than FSDK ,
in the worst case. In comparison to existing results [89, 90], Theorem 1 of Ishwaran and
Zarepour [90] does not bound the distance between observational processes, so it is not directly
comparable to our error bound. We improve upon Theorem 4 of Ishwaran and Zarepour
[89], whose upper bound on the FSD approximation error lacks an explicit dependence on
K or N . So, unlike our bounds, that bound cannot be inverted to determine a sufficient
approximation level K.
Hierarchical Dirichlet process. In modern applications such as text analysis, practitioners
use additional hierarchical levels to capture group structure in observed data. In text, we
might have D documents with N words in each. More, generally, we might have D groups
(each indexed by d) with N observations (each indexed by n) each. We target the influential
12
The name “finite symmetric Dirichlet” comes from Kurihara et al. [117]. See Ishwaran and James [88,
Section 2.2] for other names this distribution has had in the literature.

80
model of Hoffman et al. [85], Wang et al. [206], which is a variant of the hierarchical Dirichlet
process [HDP; 196] and which we refer to as the modified HDP. In the HDP, G is a population
measure with G ∼ DP(ω, H). The measure for the d-th subpopulation is Gd | G ∼ DP(α, G);
the concentrations ω and α are potentially different from each other. The modified HDP is
defined in terms of the truncated stick-breaking (TSB) approximation:
i.i.d.
Definition 2.4.5 (Stick-breaking approximation [184]). For i = 1, 2, . . . , K − 1, let vi ∼
i.i.d.
Beta(1, α). Set vK = 1. Let ξi = vi i−1
j=1 (1 − vj ). Let ψk ∼ H, and ΞK = k=1 ξk δψk . We
Q PK
denote the distribution of ΞK as TSBK (α, H).

In the modified HDP, the sub-population measure is distributed as Gd | G ∼ TSBT (α, G).
Wang et al. [206] and Hoffman et al. [85] set T to be small so that inference in the modified
HDP is more efficient than in the HDP, since the number of parameters per group is greatly
reduced. From a modeling standpoint, small T is a reasonable assumption since documents
typically manifest a small number of topics from the corpus, with the total number depending
on the document length and independent of corpus size. For completeness, the generative
process of the modified HDP is

G ∼ DP(ω, H),
i.i.d.
Hd | G ∼ TSBT (α, G) across d,
indep (2.19)
βdn | Hd ∼ Hd (·) across d, n
indep
Wdn | βdn ∼ f (· | βdn ) across d, n.

Hd contains at most T distinct atom locations, all shared with the base measure G.
The finite approximation we consider replaces the population-level Dirichlet process with
FSDK , keeping the other conditionals intact:13

GK ∼ FSDK (ω, H),


i.i.d.
Fd | GK ∼ TSBT (α, GK ) across d,
indep (2.20)
ψdn | Fd ∼ Fd (·) across d, n,
indep
Zdn | ψdn ∼ f (· | ψdn ) across d, n.

Our contribution is analyzing the error of eq. (2.20).


Let P(N,D),∞ be the distribution of the observations {Wdn }. Let P(N,D),K be the distribution
of the observations {Zdn }. We have the following bound on the total variation distance
between P(N,D),∞ and P(N,D),K .

13
Our construction in eq. (2.20) is slightly different from Eqs. 5.5 and 5.6 in Fox et al. [63]. Our document-
level process Fd contains at most T topics from the underlying corpus; by contrast, the Fox et al. [63]
document-level process contains as many topics as the corpus-level process. However, the novelty of eq. (2.20)
is incidental since the replacement of the population-level DP with the FSD in the modified HDP is analogous
to the DP case.

81
Theorem 2.4.6 (Upper bound for modified HDP). For some constants C ′ , C ′′ , C ′′′ , C ′′′ that
depend only on ω,
 C ′ + C ′′ ln2 (DT ) + C ′′′ ln(DT ) ln K + C ′′′′ ln K
dTV P(N,D),∞ , P(N,D),K ≤ .
K

See section 2.I.1 for explicit values of the constants as well as the theorem’s proof. For
fixed K, Theorem 2.4.6 is independent of N , the number of observations in each group, but
scales with the number of groups D like O(poly(ln D)). For fixed D, the approximation error
decreases to zero at rate no slower that O (ln K/K). The O(ln(DT )) factor is related to the
expected logarithmic growth rate of Dirichlet process mixture models [9, Section 5.2] in the
following way. Since there are D groups, each manifesting at most T distinct atom locations
from an underlying Dirichlet process prior, the situation is akin to generating DT samples
from a common Dirichlet process prior. Hence, the expected number of unique samples
is O(ln(DT )). Similar to Theorem 2.4.1, we speculate that the O(ln2 (DT )) factor can be
improved to O(ln(DT )). For error bounds of truncation-based approximations of hierarchical
processes, such as the HDP, we refer to Lijoi et al. [130, Theorem 1].

2.5 Conceptual benefits of finite approximations


Though approximation error lends itself more readily to analysis, ease-of-use considerations
are often at the forefront of users’ choice of finite approximation in practice. Therefore, we
next compare AIFAs to TFAs in this dimension. We see that AIFAs offer more straightforward
updates in approximate inference algorithms and easier implementation of parallelism.
To reduce notation in this section, we let a term without subscripts represent the collection of
all subscripted terms: ρ := (ρk )Kk=1 denotes the collection of atom sizes, ψ := (ψk )k=1 denotes
K

the collection of atom locations, x := (xn,k )K,N


k=1,n=1 denotes the latent trait counts of each
observation, and y := (yn )n=1 denotes the observed data. We use a dot to collect terms
14 N

across the corresponding subscript: x.,k := (xn,k )Nn=1 denotes trait counts across observations
of the k-th trait. We next consider algorithms to approximate the posterior distribution
P(ρ, ψ, x | y) of the finite approximation.
Gibbs sampling. When all latent parameters are continuous, Hamiltonian Monte Carlo
methods are increasingly standard for performing Markov chain Monte Carlo (MCMC) poste-
rior approximation [40, 84]. However, due to the discreteness of the trait counts x, successful
MCMC algorithms for CRMs or their approximations have been based largely on Gibbs sam-
pling [66]. In particular, blocked Gibbs sampling utilizing the natural Markov blanket structure
is straightforward to implement when the complete conditionals P(ρ | x, ψ, y), P(x | ψ, ρ, y),
and P(ψ | x, ρ, y) are easy to simulate from.15
14
The usage of x in this section is different from the usage in the remaining sections: in eq. (2.6), x is a
single observation from the likelihood process.
15
QN
Because of the factorization P(x | ψ, ρ, y) = n=1 P(xn,. | ψ, ρ, yn ), Gibbs sampling over the finite ap-
proximation can be an appealing technique even when Gibbs sampling over the marginal process is not.
In particular, the wall-time of a Gibbs iteration for the finite approximation can be small by drawing
P(xn,. | ψ, ρ, yn ) in parallel. Meanwhile, any iteration to update the trait counts with the marginal process
representation needs to sequentially process the data points, prohibiting speed up through parallelism.

82
Different finite approximations with the same number of atoms K change only P(ρ) in
the generative model. So, of the conditionals, we expect only P(ρ | x, ψ, y) to differ across
finite approximations. We next show in Proposition 2.5.1 that the form of P(ρ | x, ψ, y) is
particularly tractable for AIFAs. Then we will discuss how Gibbs derivations are substantially
more involved for TFAs.

Proposition 2.5.1 (Conditional conjugacy of AIFA). Suppose the likelihood is an exponential


family (eq. (2.6)) and the AIFA prior νK is as in Corollary 2.3.3. Then the complete
conditional of the atom sizes factorizes across atoms as:
K
Y
P(ρ | x, ψ, y) = P(ρk | x.,k ).
k=1

Furthermore, each P(ρk | x.,k ) is in the same exponential family as the AIFA prior, with
density proportional to
N
!
PN X
1{ρ ∈ U }ρc/K+ n=1 ϕ(xn,k )−1 exp ⟨ψ + t(xn,k ), µ(ρ)⟩ + (λ + N )[−A(ρ)] . (2.21)
n=1

See section 2.J.2 for the proof of Proposition 2.5.1. For common models — such as beta–
Bernoulli, gamma–Poisson, and beta–negative binomial — we see that the complete condi-
tionals over AIFA atom sizes are in forms that are well known and easy to simulate.
There are many different types of TFAs, but typical TFA Gibbs updates pose additional
challenges. Even when P(ρ) is easy to sample from, P(ρ | x) can be intractable, as we see in
the following example.

Example 2.5.1 (Stick-breaking approximation [28, 156]). Consider the TFA for BP(γ, α, 0)
given by
XK XCi i−1
Y
(i) (l)
ΘK = Vi,j (1 − Vi,j )δψij ,
i=1 j=1 l=1

i.i.d. (l) i.i.d. i.i.d.


where Ci ∼ Poisson(γ), Vi,j ∼ Beta(1, α) and ψi,j ∼ H. One can sample the atom sizes
(i) Q (l)
Vi,j i−1l=1 (1 − Vi,j ). But there is no tractable way to sample from the conditional distribution
P(ρ | x) because of the dependence on Ci as well as the entangled form of each ρ. Strategies to
make sampling more tractable include introducing auxiliary round indicator variables rk and
marginalizing out the stick-breaking proportions [28]. However, the final model still contains
one Gibbs conditional that is difficult to sample from [28, Equation 37].

Other superposition-based approximations, like decoupled Bondesson or power-law [34],


present similar challenges due to the number of atoms per round variables Ci and the
dependence among the atom sizes.
Mean-field variational inference (MFVI). Analogous to Hamiltonian Monte Carlo for
MCMC, black-box variational methods are increasingly used for variational inference when
the latent parameters are continuous [17, 32, 108, 116, 175, 179]. Mean-field coordinate

83
ascent updates [205, Section 6.3] remain popular for cases with discrete variables, including
the present trait counts x.16
MFVI posits a factorized distribution q to approximate the exact posterior. In our case, we
approximate P(ρ, ψ, x | y) with q(ρ, ψ, x) = qρ (ρ)qψ (ψ)qx (x). We focus on qρ (ρ). For fixed
qψ (ψ) and qx (x), the optimal qρ∗ minimizes the (reverse) Kullback-Leibler divergence between
the posterior and qρ∗ qψ qx :

qρ∗ := argmin KL (qρ (·)qψ (·)qx (·) || P(·, ·, · | y)) . (2.22)


Our next result shows that qρ∗ takes a convenient form when using AIFAs.

Corollary 2.5.2 (AIFA optimal distribution is in exponential family). Suppose the likelihood
is an exponential family (eq. (2.6)) and the AIFA prior νK is as in Corollary 2.3.3. Then,
the density of qρ∗ is given by
Y
qρ∗ (ρ) = pek (ρk ), (2.23)
k

where each pek has density at ρk proportional to


 P   
ψ + n Exn,k ∼qx t(xn,k ) µ(ρk )
P
c/K+ n Exn,k ∼qx ϕ(xn,k )−1
1{ρk ∈ U }ρk exp , (2.24)
λ+N −A(ρk )

where xn,k ∼ qx denotes the marginal distribution of xn,k under qx (x).

That is, when using the AIFA, the optimal qρ∗ factorizes across the K atoms, and each
distribution is in the conjugate exponential family for the likelihood ℓ(xn,k | ρk ). Typically
users will report summary statistics like means or variances of the variational approximations
qρ∗ . These are typically straightforward from the exponential family form.
The TFA case is much more complex and requires both more steps in the inference scheme
as well as additional approximations. See section 2.J for two illustrative examples.
Parallelization. We end with a brief discussion on parallelization. In both Proposition 2.5.1
and Corollary 2.5.2, the update distribution for ρ factorizes across the K atoms. Hence,
AIFA updates can be done in parallel across atoms, yielding speed-ups in wall-clock time,
with the gains being greatest when there are many instantiated atoms. For TFAs, due to the
complicating coupling among the atom rates, there is no such benefit from parallelization.

2.6 Empirical evaluation


In our experiments, we compare our AIFA constructions to TFAs and to other IFA con-
structions [124, 125] on a variety of synthetic and real-data examples. Even though our
theory suggests better performance of TFAs than AIFAs for worst-case likelihoods, we find
comparable performance of TFAs and AIFAs in predictive tasks (sections 2.6.1 and 2.6.2).
16
When discrete latent variables are present, black-box variational methods typically utilize enumeration
strategies to marginalize out the discrete variables. There exists a tradeoff between user time and wall time.
The user time is small since there is no need to derive update equations, but the wall time can be large
depending on the enumeration strategy.

84
Likewise, we find comparable performance of AIFAs and alternative IFAs in predictive tasks
(section 2.6.3). However, we find that AIFAs can be used to learn model hyperparameters
where alternative IFA approximations fail (section 2.6.4). And we show that AIFAs can be
used to learn model hyperparameters for new models, not previously explored in the BNP
literature (section 2.6.5).
In relation to prior studies, existing empirical work has compared IFAs and TFAs only for
simpler models and smaller data sets (e.g., Doshi-Velez et al. [55, Table 1,2] and Kurihara
et al. [117, Figure 4]). Our comparison is grounded in models with more levels and analyzes
datasets of much larger sizes. For instance, in our topic modeling application, we analyze
nearly 1 million documents, while the comparison in Kurihara et al. [117] utilizes only 200
synthetic data points.

2.6.1 Image denoising with the beta–Bernoulli process


Our first experiments show comparable performance of the AIFA and TFA at an image
denoising task with a CRM-based target model. We use MCMC for image denoising through
dictionary learning because it is an application where finite approximations of BNP models
— in particular the beta–Bernoulli process with d = 0 — have proven useful [210]. The
observation likelihood in this dictionary learning model is not one of the worst cases in
section 2.4.1. We find that the performance of AIFAs and TFAs is comparable across K, and
the posterior modes across TFA and AIFA models are similar to each other.
The goal of image denoising is to recover the original, noiseless image (e.g., fig. 2.6.1a) from
a corrupted one (e.g., fig. 2.6.1b). The input image is first decomposed into small contiguous
patches. The model assumes that each patch is a combination of latent basis elements. By
estimating the coefficients expressing the combination, one can denoise the individual patches
and ultimately the overall image. The beta–Bernoulli process allows simultaneous estimation
of both basis elements and basis assignments. The number of extracted patches depends
on both the patch size and the input image size. So even on the same input image, the
analysis might process a varying number of “observations.” The nonparametric nature of
the beta–Bernoulli process sidesteps the cumbersome problem of calibrating the number of
basis elements for these different data set sizes, which can be large even for a relatively small
image; for a 256 × 256 image like fig. 2.6.1b, the number of extracted patches, N , is about
60,000. We quantify denoising quality by computing the peak signal-to-noise ratio (PNSR)
between the original and the denoised image [86]. The higher the PNSR, the more similar
the images.
We use Gibbs sampling to approximate the posterior distributions. To ensure stability and
accuracy of the sampler, patches (i.e., observations) are gradually introduced in epochs,
and the sampler modifies only the latent variables of the current epoch’s observations. See
section 2.K.1 for more details about the finite approximations, the hyperparameter settings,
and the inference algorithm.
figs. 2.6.1c and 2.6.1d visually summarize the results of posterior inference for a particular
image. We report experiments with other images in section 2.L.1. Our results across all
images indicate that the AIFA and TFA perform similarly, and both approximations perform
much better than the baseline (i.e., the noisy input image). fig. 2.6.2 quantitatively confirms
these qualitative findings; fig. 2.6.2a shows that, for approximation levels we considered, the

85
(a) Original (b) Input, 24.64 dB (c) AIFA, 33.81 dB (d) TFA, 34.03 dB

Figure 2.6.1: AIFA and TFA denoised images have comparable quality. (a) The noiseless
image. (b) The corrupted image. (c,d) Sample denoised images from finite models with
K = 60. We report PSNR (in dB) with respect to the noiseless image.

PSNR between either the TFA or AIFA output image and the original image are always very
similar and substantially higher (between 30 and 35) than the PSNR between the original
and corrupted image (below 30). In fact, each TFA denoised image is more similar to the
AIFA denoised image than to the original image; the PSNR between the TFA and AIFA
outputs is about 50. We also see from fig. 2.6.2a that the quality of denoised images improves
with increasing K. The improvement with K is largest for small K, and plateaus for larger
values of K.
In addition to randomly initializing the latent variables at the beginning of the Gibbs sampler
of one model (“cold start”), we can use the last configuration of latent variables visited in
the other model as the initial state of the Gibbs sampler (“warm start”). In fig. 2.6.2b, the
warm-start curve uses the output of inference with the AIFA as an initial value for inference
with the TFA; similarly, the warm-start curve of fig. 2.6.2c uses the output with the TFA to
initialize inference with the AIFA. For both approximations, K = 60. At the end of training,
all latent variables for all patches have been assigned, so for the warm start experiment,
we make all patches available from the start instead of gradually introducing patches. For
both approximations, the Gibbs sampler initialized at the warm start visits candidate images
that essentially have the same PSNR as the starting configuration; the PSNR values never
deviate from the initial PSNR by more than 1%. The early iterates of the cold-start Gibbs
sampler are noticeably lower in quality compared to the warm-start iterates, and the quality
at the plateau is still lower than that of the warm start.17 Each PSNR trace corresponds to a
different set of initial values and simulation of the conditionals. The variation across the 5
warm-start trials is small; the variation across the 5 cold-start trials is larger but still quite
small. In all, the modes of TFA posterior are good initializations for inference with the AIFA
model, and vice versa.
17
Because the warm start represents the end of the training from the cold start with gradually introduced
patches, the gap in final PSNR is due to the gradual patch introduction.

86
(a) Performance across K (b) TFA training (c) AIFA training

Figure 2.6.2: (a) Peak signal-to-noise ratio (PNSR) as a function of approximation level
K. Error bars depict 1-standard-deviation ranges across 5 trials. (b,c) How PSNR evolves
during inferenceacross 10 trials, with 5 each starting from respectively cold or warm starts.

2.6.2 Topic modelling with the modified hierarchical Dirichlet pro-


cess
We next compare the performance of normalized AIFAs (namely, FSDK ) and TFAs (namely,
TSBK ) in a DP-based model with additional hierarchy: the modified HDP from section 2.4.2.
As in section 2.6.1, we find that the approximations perform similarly.
We use the modified HDP for topic modeling. We apply stochastic variational inference with
mean-field factorization [85] to approximate the posterior over the latent topics. The training
corpus consists of nearly one million documents from Wikipedia. We measure the quality
of inferred topics via predictive log-likelihood on a set of 10,000 held-out documents. See
section 2.K.2 for complete experimental details.
fig. 2.6.3a shows that, as expected, the quality of the inferred topics improves as the approxi-
mation level grows. For a given approximation level, the quality of the topics learned using
the TFA and the normalized AIFA are almost the same.
The warm start in this case corresponds to using variational parameters at the end of the
other model’s training. fig. 2.6.3b uses the outputs of inference with the normalized AIFA
approximation as initial values for inference with the normalized TFA; similarly fig. 2.6.3c uses
the TFA to initialize inference with the AIFA. We fix the number of topics to K = 300 and
run 5 trials each with the cold start and warm start, respectively. For both approximations,
the test log-likelihood stays nearly the same for warm-start training iterates; the test log-
likelihood for the iterates never deviate more than 0.5% from the initial value. The early
iterates after the cold start are noticeably lower in quality compared to the warm iterates;
however at the end of training, the test log-likelihoods are nearly the same. Each trace
corresponds to a different set of initial values and ordering of data batches processed. The
variation across either cold starts or warm starts is small. So, in sum, the modes of the TFA
posterior are good initializations for inference with the AIFA model, and vice versa.

87
(a) Performance across K (b) TFA training (c) AIFA training

Figure 2.6.3: (a) Test log-likelihood (testLL) as a function of approximation level K. Error
bars show 1 standard deviation across 5 trials. (b,c) TestLL change during inference.

2.6.3 Comparing predictions across independent finite approxima-


tions
We next show that AIFAs have comparable predictive performance with other IFAs, namely
the BFRY IFA and GenPar IFA. We consider a linear–Gaussian factor analysis model with
the power-law beta–Bernoulli process [78], where the AIFA, BFRY IFA, or GenPar IFA can
be used directly.
Recall that the BFRY IFA applies only when the concentration hyperparameter is zero, and
the GenPar IFA applies only when the concentration parameter is positive. We consider
it a strength of the AIFA that it applies to both cases (and the negative range of the
concentration hyperparameter) simultaneously. Nonetheless, we here generate two separate
synthetic datasets: one to compare the BFRY IFA with the AIFA and one to compare the
GenPar IFA with the AIFA. In each case, we generate 2,000 data points from the full CRM
model with a discount of d = 0.6. We use 1,500 for training and report predictive log-likelihood
on the 500 held-out data points. For posterior approximation, we use automatic differentiation
variational inference as implemented in Pyro [17]. To isolate the effect of the approximation
type, we use “ideal” initialization conditions: we initialize the variational parameters using the
latent features, assignments, and variances that generated the training set. See section 2.K.3
for more details about the BRFY IFA, GenPar IFA, and the approximate inference scheme.
fig. 2.6.4a shows that across approximation levels K, the predictive performances of the AIFA
and BFRY IFA are similar. Likewise, fig. 2.6.4b shows that the predictive performance of the
AIFA and GenPar IFA are similar.

2.6.4 Discount estimation


We next show that AIFAs can reliably recover the beta process discount hyperparameter d,
which governs the power law growth in the number of features. By contrast, we show that
the BFRY IFA or GenPar IFA struggle at this task. In section 2.L.3, we show that the AIFA
can also reliably estimate the mass and concentration hyperparameters.

88
(a) BFRY IFA versus AIFA (b) GenPar IFA vs AIFA

Figure 2.6.4: (a) The left panel shows the average predictive log-likelihood of the AIFA
(blue) and BFRY IFA (red) as a function of the approximation level K; the average is across
10 trials with different random seeds for the stochastic optimizer. The right panel shows
highest predictive log-likelihood across the same 10 trials. (b) The panels are analogous to
(a), except the GenPar IFA is in red.

We generate a synthetic dataset so that the ground truth hyperparameter values are known.
The data takes the form of a binary matrix X, with N rows and K̃ columns. We generate
X from an Indian buffet process prior; recall that the Indian buffet process is the marginal
process of a beta process CRM paired with Bernoulli likelihood. To learn the hyperparameter
values with an AIFA, we maximize the marginal likelihood of the observed matrix X implied
by the AIFA. In particular, we compute the marginal likelihood by integrating the Bernoulli
likelihood P(xn,k | θk ) over θk distributed as the K-atom AIFA νK . To quantify the variability
of the estimation procedure, we generate 50 feature matrices and compute the maximum
likelihood estimate for each of these 50 trials. See section 2.K.4 for more experimental details.
fig. 2.6.5a shows that we can use an AIFA to estimate the underlying discount for a variety
of ground-truth discounts. Since the estimates and error bars are similar whether we use the
AIFA (left) or full nonparametric process (right), we conclude that using the AIFA yields
comparable inference to using the full process.
In theory, the marginal likelihood of the BFRY IFA can also be used to estimate the discount,
but in practice we find that this approach is not straightforward and can yield unreliable
estimates. At the time of writing, such an experiment had not yet been attempted; Lee et al.
[124] focus on clustering models and do not discuss strategies to estimate any hyperparameter
in a feature allocation model with a BFRY IFA. We are not aware of a closed-form formula
for the marginal likelihood. Default schemes to numerically integrate P(0 | θk ) against the
BFRY prior for θk fail because of overflow issues. (KΓ(d)d/γ)1/d is typically very large,
especially for small d. Due to finite precision, 1 − exp −(Kd/γ)1/d 1−θ θ
evaluates to 1 on


the quadrature grid used by numerical integrators [160]. In this case, eq. (2.4) behaves as
θ−d−1 near 0, and thus the integral over θ diverges. To create the left panel of fig. 2.6.5b,
we view the marginal likelihood as an expectation and construct Monte Carlo estimates; we
draw 105 BFRY samples to estimate the marginal likelihood, and we take the estimate’s
logarithm as an approximation to the log marginal likelihood (red line). To quantify the

89
(a) Maximum likelihood estimates (b) Log negative log marginal likelihood

Figure 2.6.5: (a) We estimate the discount by maximizing the marginal likelihood of the
AIFA (left) or the full process (right). The solid blue line is the median of the estimated
discounts, while the lower and upper bounds of the error bars are the 20% and 80% quantiles.
The black dashed line is the ideal value of the estimated discount, equal to the ground-truth
discount. (b) In each panel, the solid red line is the average log of negative log marginal
likelihood (LNLML) across batches. The light red region depicts two standard errors in either
direction from the mean.

uncertainty, we draw 100 batches of 105 samples (light red region). Even for this large
number of Monte Carlo samples, the estimated log marginal likelihood curve is too noisy to
be useful for hyperparameter estimation. By comparison, we can compute the log marginal
likelihood analytically for the IBP (dashed black line); it is much smoother and features a
clear minimum. Moreover, we can compute the AIFA log marginal likelihood via numerical
integration (solid blue line); it is also very smooth and features a clear minimum.
We again consider the BFRY IFA and GenPar IFA separately and generate separate simulated
data for each case due to their disjoint assumptions; we generate date with concentration α = 0
for the BFRY IFA and with α > 0 for the GenPar IFA. An experiment to recover a discount
hyperparameter with the GenPar IFA, analogous to the experiment above with the BFRY
IFA, has also not previously been attempted. There is no analytical formula for the GenPar
IFA marginal likelihood, and we again encounter overflow when trying numerical integration.
Therefore, we resort to Monte Carlo; we find that estimates of the log marginal likelihood are
too noisy for practical use in recovering the discount (the right panel of Figure 2.6.5b).

2.6.5 Dispersion estimation


Finally, we show that the AIFA can straightforwardly be adapted to estimate hyperparameters
in other BNP processes, not just the beta process. In particular we show that AIFAs can
be used to learn the dispersion parameter τ in the novel Xgamma–CMP process that we
introduced in Example 2.3.4. We consider a well-known application of BNP trait-allocation
models to matrix-factorization–based topic modeling [180]. The observed data is a count
matrix X, with N rows, representing documents, and V columns, representing vocabulary
words. We adjust the model of Roychowdhury and Kulis [180] to use the Xgamma–CMP

90
Figure 2.6.6: Blue histograms show posterior density estimates for τ from MCMC draws.
The ground-truth τ (solid red line) is 0.7 in the overdispersed case (upper row) and 1.5 in
the underdispersed case (lower row). The threshold τ = 1 (dashed black line) marks the
transition from overdispersion (τ < 1.0) to underdispersion (τ > 1.0). The percentile in each
panel’s title is the percentile where the ground truth τ falls in the posterior draws. The
approximation size K of the AIFA increases in the plots from left to right.

process of Example 2.3.4 instead of a gamma–Poisson process. The added flexibility of τ


allows modeling trait count distributions that are over- or under-dispersed, which cannot be
done with the gamma-Poisson process.
To have a notion of ground truth, we generate synthetic data (with N = 600) from a large
AIFA (with K = 500) of the Xgamma–CMP process, which is a good approximation of the
BNP limit.18 In each set of experiments, the data are overdispersed (τ < 1) or underdispersed
(τ > 1). In this case, we take a Bayesian approach to estimating τ , and put a uniform prior on
τ ∈ (0, 100] since τ must be strictly positive. For smaller values of K (K = 50 to K = 150),
we approximate the posterior for the K-atom AIFA using Gibbs sampling. See section 2.K.5
for more details about the experimental setup.
fig. 2.6.6 shows that the posterior approximation agrees with the ground truth on the
dispersion type (over or under) in each case. We also see from the figures that the 95%
credible intervals contain the ground-truth τ value in each case.
18
For the chosen number of documents N , let the number of traits with positive count be K.
b There is no
noticeable difference in the distribution of K between K = 500 and K > 500. The rates of the inactive (zero
b
count) traits are smaller than 1/N .

91
2.7 Discussion
We have provided a general construction of automated independent finite approximations
(AIFAs) for completely random measures and their normalizations. Our construction provides
novel finite approximations not previously seen in the literature. For processes without
power-law behavior, we provide approximation error bounds; our bounds show that we can
ensure accurate approximation by setting the number of atoms K to be (1) logarithmic in the
number of observations N and (2) inverse to the error tolerance ϵ. We have discussed how the
independence and automatic construction of AIFA atom sizes lead to convenient inference
schemes. A natural competitor for AIFAs is a truncated finite approximation (TFA). We show
that, for the worst case choice of observational likelihood and the same K, AIFAs can incur
larger error than the corresponding TFAs. However, in our experiments, we find that the two
methods have essentially the same performance in practice. Meanwhile, AIFAs are overall
easier to work with than TFAs, whose coupled atoms complicate the development of inference
schemes. Future work might extend our error bound analysis to conjugate exponential family
CRMs with power-law behavior. An obstacle to upper bounds for the positive-discount case is
the verification of the clauses in Condition 2.4.1. In the positive-discount case, the functions
h and Mn,x , which describe the marginal representation of the nonparametric process, take
forms that are straightforwardly amenable to analysis. But the function e h, which describes
the finite approximations, is complex. In general, h is equal to the ratio of two normalization
e
constants of different AIFAs. The normalization constants can be computed numerically.
However, to make theoretical statements such as the clauses in Condition 2.4.1, we need to
prove their smoothness properties. Another direction is to tighten the error upper bound by
focusing on specific, commonly-used observational likelihoods — in contrast to the worst-case
analysis we provide here. Finally, more work is required to directly compare the size of error
in the finite approximation to the size of error due to approximate inference algorithms such
as Markov chain Monte Carlo or variational inference.

92
Appendix

2.A Additional examples of AIFA construction


Let B(α, β) = Γ(α)Γ(β)
Γ(α+β)
denote the beta function.
Example 2.A.1 (Beta prime process). Taking V = R+ , g(θ) = (1 + θ)−1 , h(θ; η) = (1 + θ)−η ,
and Z(ξ, η) = B(ξ, η) in Theorem 2.3.1 yields the beta prime process of Broderick et al. [27],
which has rate measure
γ
ν(dθ) = θ−1−d (1 + θ)−d−η dθ.
B(η, 1 − d)
Since g is continuous, g(0) = 1, 1 ≤ g(θ) ≤ 1 + θ, and h(θ; η) is continuous and bounded on
[0, 1], Assumption 2.3.1 holds.
In the case of d = 0, the corresponding exponential family distribution is beta prime. With
two placeholder parameters α and β, the beta prime density at θ > 0 is
θα−1 (1 + θ)−α−β
Beta′ (θ; α, β) = .
B (α, β)
To construct AIFA using Corollary 2.3.3, we set c = γη and
νK (θ) = Beta′ (θ; γη/K, η).
Example 2.A.2 (Generalized gamma process). Taking V = R+ , g(θ) = 1, h(θ; λ) = e−λθ ,
and Z(ξ, λ) = Γ(ξ)λ−ξ in Theorem 2.3.1 yields the generalized gamma process, with rate
measure
λ1−d −d−1 −λθ
ν(dθ) = γ θ e dθ.
Γ(1 − d)
Since h(θ; η) is continuous and bounded on [0, 1], Assumption 2.3.1 holds.
In the case of d = 0 i.e. the gamma process, the corresponding exponential family distribution
is gamma. To construct AIFA using Corollary 2.3.3, we set c = γλ and
νK (θ) = Gamma(θ; γλ/K, λ).
Example 2.A.3 (PG(α,ζ)-generalized gamma process). Taking V = R2+ , g(θ) = 1, h(θ; η) =
e−(η1 θ) 2 , and Z(ξ, η) = Γ(ξ/η2 )(η1 η2 )−ξ in Theorem 2.3.1 yields the PG(α,ζ)-generalized
η

gamma process whose rate measure is


γ(η1 η2 )1−d −d−1 −(η1 θ)η2
ν(dθ) = θ e dθ.
Γ((1 − d)/η2 )

93
Since h(θ; η) is continuous and bounded on [0, 1], Assumption 2.3.1 holds.
γ(η1 η2 )1−d
In the positive discount case, let c = Γ((1−d)/η 2)
, and the finite-dimensional distribution has
density equalling
1 c/K−1−dS1/K (1−1/K) −(η1 θ)η2
θ e dθ,
ZK
R∞
where ZK := 0 θc/K−1−dS1/K (1−1/K) e−(η1 θ) 2 dθ.
η

In the case of d = 0, the corresponding exponential family distribution is generalized gamma.


With three placeholder parameters ξ ′ , η1′ , and η2′ , the generalized gamma density at θ > 0 is

′ ′ ′ η2
′ (η ′ /ξ ′η1 )θη1 −1 e−(θ/ξ )
GenGamma(θ; ξ , η1′ , η2′ ) = 2
Γ(η1′ /η2′ )

To construct AIFA using Corollary 2.3.3, we set c = γη1 η2


Γ(η2−1 )
and
 
1 γη1 η2
νK (θ) = GenGamma θ; , , η2 .
η1 KΓ(η2−1 )

Example 2.A.4 (Extended gamma process). RTaking V = (0, ∞) × (1, ∞), g(θ) = 1,

h(θ; η) = Zτ−c (θ), U = [0, T ], and Z(ξ, η) = 0 θξ−1 Zτ−c (θ)dθ in Theorem 2.3.1 yields
the extended gamma process from eq. (2.11). Since g(θ) = 1, the second condition in
Assumption 2.3.1 holds. For any τ and c, Zτ−c (θ) is continuous and bounded on [0, 1], so
the third condition in Assumption 2.3.1 holds. As for the first condition, we note that
Zτ−c (θ) ≤ (1 + θ)−c , since the minimum of Zτ (θ) with respect to τ is 1 + θ, attained at τ = ∞.
Therefore, Z(ξ, η) is finite if
Z T
θξ−1 (1 + θ)−c dθ
0

is finite. Since (1 + θ)
−c
≤ 1, the last integral is at most
T

Z
ξ−1
θ dθ = ,
0 ξ
which is finite. Hence, all three conditions of Assumption 2.3.1 hold, and we can apply
Corollary 2.3.3. The AIFA is
1 γ/K−1 −c
νK (θ) = θ Zτ (θ)1{0 ≤ θ ≤ T }dθ,
ZK
RT
where ZK is the normalization constant ZK = 0 θγ/K−1 Zτ−c (θ)dθ. More generally, for
γ, c, τ > 0 and T ≥ 1, we use the notation XGamma(γ, c, τ, T ) to denote the real-valued
distribution with density at θ equal to:

θγ−1 Zτ−c (θ)1{0 ≤ θ ≤ T }


XGamma(θ; γ, c, τ, T ) := RT . (2.25)
θ γ−1 Z −c (θ)dθ
0 τ

94
2.B Proofs of AIFA convergence
In this appendix, to highlight the fact that the i.i.d. distributions are different across K, we
use ρK,i to denote the i-th atom size in the approximation of level K i.e. the K-atom AIFA is
PK i.i.d. i.i.d.
ΘK := i=1 ρK,i δψK,i , ρK,i ∼ νK , ψK,i ∼ H.

2.B.1 AIFA converges to CRM in distribution


We first state a more general construction than Theorem 2.3.1, and proceed to prove that
result, as a proof of Theorem 2.3.1.
For the more general construction, we first generalize the Sb (θ) as in Theorem 2.3.1 with
so-called approximate indicators.

Definition 2.B.1. The parameterized function family {Sb }b∈R+ is composed of approximate
indicators if, for any b ∈ R+ , Sb (θ) is a real, non-decreasing function such that Sb (θ) = 0 for
θ ≤ 0 and Sb (θ) = 1 for θ ≥ b.

Valid examples of approximate indicators are the indicator function Sb (θ) = 1{θ > 0} and
the smoothed indicator function from Theorem 2.3.1. Some approximate indicators have a
point of discontinuity; e.g., Sb (θ) = 1{θ > 0}. But the smoothed indicator is both continuous
and differentiable; see section 2.B.2.

Theorem 2.B.2. Suppose Assumption 2.3.1 holds, and let {Sb }b∈R+ be a family of approxi-
mate indicators. Fix a > 0, and let (bK )K∈N be a decreasing sequence such that bK → 0. For
c := γh(0; η)/Z(1 − d, η), let
−1 −dS −1 ) −1 −d −1
νK (dθ) := θ−1+cK bK (θ−aK
g(θ)cK h(θ; η)ZK dθ

be a family of probability densities, where ZK is chosen such that νK (dθ) = 1. If ΘK ∼


R
D
IFAK (H, νK ), then ΘK → Θ as K → ∞.

Theorem 2.B.2 recovers Theorem 2.3.1 by setting Sb equaling the smoothed indicator, a = 1,
and bK = 1/K. See section 2.L.2 for discussions on the impact of the tuning hyperaparameters
on the performance of our IFA.
In order to prove Theorem 2.B.2 , we require a few auxiliary results.

Lemma 2.B.3 (Kallenberg [104, Lemma 12.1, Lemma 12.2 and Theorem 16.16]). Let Θ be
a random measure and Θ1 , Θ2 , . . . a sequence of random measures. If for all measurable sets
A and t > 0,
lim E[e−tΘK (A) ] = E[e−tΘ(A) ],
K→∞
D
then ΘK → Θ.

For a density f , let µ(t, f ) : θ 7→ (1 − e−tθ )f (θ). In results that follow we assume all measures
on R+ have densities with respect to Lebesgue measure. We abuse notation and use the same
symbol to denote the measure and the density.

95
Proposition 2.B.4. Let Θ ∼ CRM(H, ν) and for K = 1, 2, . . . , let ΘK ∼ IFAK (H, νK )
where ν is a measure and ν1 , ν2 , . . . are probability measures on R+ , all absolutely continuous
D
with respect to Lebesgue measure. If ∥µ(1, nνK ) − µ(1, ν)∥1 → 0, then ΘK → Θ.

Proof. Let t > 0 and A a measurable set. First, recall that the Laplace functional of the
CRM Θ is  Z ∞ 
−tΘ(A)
E[e ] = exp −H(A) µ(t, ν)(θ) dθ .
0

We have

E[e−tρK,1 1(ψK,1 ∈A) ] = P(ψK,1 ∈ A)E[e−tρK,1 ] + P(ψK,1 ∈ / A)


−tρK,1
= H(A)E[e ] + 1 − H(A)
= 1 − H(A)(1 − E[e−tρK,1 ])
H(A) ∞
Z
=1− µ(t, KνK )(θ) dθ.
K 0

−tθ
Since |1−e
|1−e−θ |
|
≤ max(1, t), it follows by hypothesis that ∥µ(t, KνK ) − µ(t, ν)∥1 → 0. Thus, by
dominated convergence and the standard exponential limit,
K
H(A) ∞
 Z
−tρK,1 1(ψK,1 ∈A) K
lim E[e ] = lim 1 − µ(t, KνK )(θ) dθ
K→∞ K→∞ K 0
 Z ∞ 
= exp − lim H(A) µ(t, KνK )(θ) dθ
K→∞ 0
 Z ∞ 
= exp −H(A) µ(t, ν)(θ) dθ .
0

Finally, by the independence of the random variables {θK,i }K


i=1 and {ψK,i }i=1 ,
K

lim E[e−tΘK (A) ] = lim E[e−tρK,1 1(ψK,1 ∈A) ]K ,


K→∞ K→∞

so the result follows from Lemma 2.B.3.

Lemma 2.B.5. If there exist measures π(θ) dθ and π ′ (θ) dθ on R+ such that for some κ > 0
and c, c′ ,

1. the measures µ, µ1 , µ2 , . . . have densities f, f1 , f2 , . . . with respect to π and densities


f ′ , f1′ , f2′ , . . . with respect to π ′ ,
Rκ K→∞
2. 0 |f ′ (θ) − fK′ (θ)|dθ −−−→ 0,
K→∞
3. supθ∈[κ,∞) |f (θ) − fK (θ)| −−−→ 0,

4. supθ∈[0,κ] π ′ (θ) ≤ c′ < ∞, and


R∞
5. κ π(θ) dθ ≤ c < ∞,

96
then
K→∞
∥µ − µK ∥1 −−−→ 0.

Proof. We have, using the assumptions and Hölder’s inequality,


Z κ Z ∞
′ ′ ′
∥µ − µK ∥1 = |f (θ) − fK (θ)|π (dθ) + |f (θ) − fK (θ)|π(dθ)
0 κ
!Z
κ
≤ sup π ′ (θ) |f ′ (θ) − fK′ (θ)|dθ
θ∈[0,κ] 0
!Z

+ sup |f (θ) − fK (θ)| π(dθ)
θ∈[κ,∞) κ
Z κ

≤c |f ′ (θ) − fK′ (θ)|dθ + c sup |f (θ) − fK (θ)|.
0 θ∈[κ,∞)

The conclusion follows by the assumptions.


Proof of Theorem 2.B.2. Note that since h is continuous and bounded on [0, ϵ], c as given in
the theorem statement is finite. We will apply Lemma 2.B.5 with κ = min(1, ϵ), µ = µ(1, ν),
µK = µ(1, nνK ),
θ−d g(θ)1−d h(θ; η)
π(θ) = ,
Z(1 − d, η)
and π ′ (θ)R := (θg(θ))d π(θ). Because of the finiteness of Z(ξ, η), item 5 of Lemma 2.B.5, which

asks for κ π(θ) dθ < ∞, is satisfied. Thus, f (θ) = γ(1 − e−θ )(θg(θ))−1 ,
−1 −1 +d−dS −1 ) −1
fK (θ) = nZK (1 − e−θ )θ−1+cK bK (θ−aK
g(θ)−1+cK ,

and f ′ (θ) = (θg(θ))−d f (θ), and fK′ (θ) = (θg(θ))−d fK (θ).


We now note a few useful properties that we will use repeatedly in the proof. Observe that
−1
(a/K)cK = 1 + o(1). The assumption that h is bounded and continuous implies that on
[0, a/K], h(θ; η) = h(0; η) + o(1). Similarly, for any δ > 0, g(θ) is bounded and continuous
for θ ∈ [0, δ] and therefore, together with the fact that g(0) = 1, we can conclude that on
[0, a/K], g(θ) = 1 + o(1).
For the remainder of the proof we will consider K large enough that aK −1 + 2bK and cK −1
are less than κ. The normalizing constant ZK can be written as
Z a/K
−1
ZK = (θg(θ))−1+cK π ′ (dθ)
0
Z κ
−1 −dS −1 ) −1
+ θ−1+cK bK (θ−aK
g(θ)−1+cK π ′ (dθ)
a/K
Z ∞
−1 −d
+ (θg(θ))−1+cK π ′ (dθ).
κ

97
We rewrite each term in turn. For the first term,
Z a/K Z a/K
−1+cK −1 −1+cK −1 ′ −1
θ g(θ) π (dθ) = (c/γ + o(1)) θ−1+cK dθ
0 0
K  a cK −1
= (c/γ + o(1))
c K
K
= + o(K).
γ
−1
Since κ ≤ 1 and SbK ∈ [0, 1], for θ ∈ [a/K, κ], θ−dSbK (θ−aK ) ≤ θ−d . Since g(0) = 1, c∗ ≤ 1
−1
and therefore g(θ)−1+cK ≤ c∗−1+c . Hence the second term is upper bounded by
Z κ
−1+c −1 K d K cK −1 −1
c∗ θ−1+cK −d π ′ (dθ) ≤ c−1
∗ (c/γ + O(1)) d
(κ − (a/K)cK )
a/K a c
= O(K d ) × O(ln K)
= o(K).

For the third term,


Z ∞ Z ∞
−1+cK −1 −d ′ −1
(θg(θ)) π (dθ) = (θg(θ))−1+cK π(dθ)
κ κ
Z ∞
−1+cK −1
≤ (κc∗ ) π(dθ)
κ
−1
≤ (κc∗ ) .

Hence, ZK = Kγ + o(K) and KZK


−1
= γ(1 + eK ), where eK = o(1).
Next, we have

sup |f (θ) − fK (θ)|


θ∈[κ,∞)
−1 −1
= sup (1 − e−θ )(θg(θ))−1 |γ − KZK (θg(θ))cK |
θ∈[κ,∞)
−1
≤ sup γ(θg(θ))−1 |1 − (1 + eK )(θg(θ))cK |
θ∈[κ,∞)
−1
≤ γ sup (θg(θ))−1 |1 − (θg(θ))cK |
θ∈[κ,∞)
−1 (2.26)
+ γeK sup (θg(θ))−1+cK .
θ∈[κ,∞)

To bound the two terms we will use the fact that if θ ≥ κ, then
θ κ
θg(θ) ≥ ≥ ∗ =: κ̃
c∗ (1 + θ) c (1 + κ)

and if θ ≤ 1 then θg(θ) ≤ c∗ ≤ 1. Hence, letting ψ := θg(θ), for the first term in eq. (2.26)

98
we have
−1
γ sup (θg(θ))−1 |1 − (θg(θ))cK |
θ∈[κ,∞)
−1
≤ γ sup ψ −1 |1 − ψ cK |
ψ∈[κ̃,∞)
−1 −1
≤ γ sup ψ −1 |1 − ψ cK | + γ sup ψ −1 |1 − ψ cK |
ψ∈[κ̃,1] ψ∈[1,∞)
 Kc−1
−1 K −c
cK −1 K
≤ γκ̃ sup |1 − ψ |+γ 1−
ψ∈[κ̃,1] K K −c
−1 c
≤ γκ̃−1 (1 − κ̃cK ) + O(1) ×
K −c
−1 −1
= γκ̃ × o(1) + O(K )
→ 0.

Similarly, for the second term in eq. (2.26) we have


−1 −1
γeK sup (θg(θ))−1+cK ≤ γeK sup ψ −1+cK
θ∈[κ,∞) ψ∈[κ̃,∞)
−1
≤ γκ̃ eK
→ 0.
−1 −1
Since g(θ) is bounded on [0, κ], g(θ)cK = 1 + o(1) and therefore (1 + eK )g(θ)cK = 1 + e′K ,
where e′K = o(1). Using this observation together with the bound (1 − e−θ )θ−1 ≤ 1, we have
Z κ Z κ
′ ′
|f (θ) − fK (θ)|dθ = (θg(θ))−d |f (θ) − fK (θ)|dθ
0 0
Z κ
−1 cK −1 +d−dSbK (θ−aK −1 ) −1
= (1 − e−θ )(θg(θ))−1−d |γ − KZK θ g(θ)cK |dθ
0
Z κ
∗ −1 −1
≤ γ[c (1 + κ)] 1+d
θ−d |1 − (1 + e′K )θcK +d−dSbK (θ−aK ) |dθ
0
Z κ Z κ
cK −1 +d−dSbK (θ−aK −1 ) −1 −1
≤γ −d
θ |1 − θ ′
|dθ + γeK θcK +d−dSbK (θ−aK ) dθ. (2.27)
0 0

We bound the first integral in eq. (2.27) in four parts: from 0 to aK −1 , from aK −1 to
aK −1 + bK , from aK −1 + bK to κ − bK , and from κ − bK to κ. The first part is equal to
Z aK −1 Z aK −1
−d d+cK −1 −1
θ |1 − θ |dθ ≤ θ−d + θcK dθ
0 0
aK −1
θ1−d K 1+cK −1
= + θ
1−d c+K 0
1 K −1
= (aK −1 )1−d + (aK −1 )1+cK
1−d c+K
→ 0.

99
The second part is equal to
Z aK −1 +bK Z aK −1 +bK
−d cK −1 +d−dSbK (θ−aK −1 ) −1 −d
θ |1 − θ |dθ ≤ θ−d + θcK dθ
aK −1 aK −1
Z aK −1 +bK
≤2 θ−d dθ
aK −1
KaK −1 +b
2 1−d
= θ
1−d aK −1
  a 1−d 
2 a 1−d
= ( + bK ) −
1−d K K
→ 0.
The third part is equal to
Z κ−bK Z κ−bK
−d cK −1 −1 −d
θ |1 − θ |dθ = θ−d − θcK dθ
aK −1 +bK aK −1 +bK
κ−bK
1 1−d K −1
= θ − θ1−d+cK
1−d c + K(1 − d) aK −1 +bK
1−d
(κ − bK ) K −1
= − (κ − bK )1−d+cK
1−d c + K(1 − d)
−1
(aK + bK )1−d K −1
− + (aK −1 + bK )1−d+cK
1−d c+K
→ 0.
The fourth part is equal to
Z κ Z κ
−d cK −1 −1 −d
θ |1 − θ |dθ ≤ θ−d + θcK dθ
κ−bK κ−bK
→0
using the same argument as the second part. The second integral in eq. (2.27) is upper
bounded by
Z κ Z κ
′ cK −1 −dSbK (θ−aK −1 ) ′ κ1−d
γeK θ dθ ≤ γeK θ−d dθ = γe′K = o(K).
0 0 1−d
Since supθ∈[0,κ] π ′ (θ) < ∞ by the boundedness of g and h and π is a probability density
by construction, conclude using Lemma 2.B.5 that ∥µ − µK ∥1 → 0. It then follows from
D
Lemma 2.B.3 that ΘK → Θ.

2.B.2 Differentiability of smoothed indicator


We show that (  
−1
exp 1−(θ−b)2 /b2 + 1 if θ ∈ (0, b)
Sb (θ) =
1[θ > 0] otherwise.

100
is differentiable over the whole real line. Since on the separate domains (−∞, 0), (0, b), and
(b, ∞), the derivative exists and is continuous, we only need to show that the values of the
derivative at θ = 0 and θ = b from either side match.
To start, we show that Sb (θ) is continuous at θ = 0 and θ = b.
 
1
lim Sb (θ) = exp 1 − = 1,
θ→b− 1−0
 
1
lim Sb (θ) = exp 1 − = 0.
θ→0+ ∞

For θ = b, the derivative from the right (θ →


− b+ ) is 0 since constant function. The derivative
on the interval (0, b) equals

dSb −1 2(θ − b)
= Sb (θ) . (2.28)
dθ [(θ − b)2 /b2 − 1]2 b2

The limit as we approach b from the left is 0 since limθ→b− Sb (θ) = 1 and the term (θ − b)
vanishes. So the one-sided derivative is continuous at θ = b.
For θ = 0, the derivative from the left (θ →
− 0− ) is 0 since also constant function. The limit
of eq. (2.28) as we approach 0 from the right is also 0. It suffices to show
−1
lim+ Sb (θ) = 0.
θ→0 [(θ − b)2 /b2 − 1]2

Reparametrizing x = 1
1−(θ−b)2 /b2
, we have that x → ∞ and θ → 0+ . The last limit becomes

exp(−x)
lim = 0,
x→∞ x2
which is true because the decay of the exponential function is faster than any polynomial.
The derivative defined over disjoint intervals are continuous at the boundary points, so the
overall approximate indicator is differentiable.

2.B.3 Normalized AIFA EPPF converges to NCRM EPPF


Proof of Theorem 2.3.4. First, we show that the total mass of AIFA converges in distribution
to the total mass of CRM. It suffices to consider K ≥ b so that the AIFA EPPF is non-zero
since we only care about the asymptotic behavior of pK (n1 , n2 , . . . , nb ). Through section 2.B.1,
we have shown that for all measurable sets A and t > 0, the Laplace functionals converge:

lim E[e−tΘK (A) ] = E[e−tΘ(A) ],


K→∞

By choosing A = Ψ i.e. the ground space, we have that ΘK (Ψ) is the total mass of AIFA
and Θ(Ψ) is the total mass of CRM
K
X ∞
X
ΘK (Ψ) = ρK,i , Θ(Ψ) = θi .
i=1 i=1

101
Since for any t > 0, the Laplace transform of ΘK (Ψ) converges to that of Θ(Ψ), we conclude
that ΘK (Ψ) converges to Θ(Ψ) in distribution [104, Theorem 5.3]:
K
D
X
ρK,i → Θ(Ψ). (2.29)
i=1

Second, we show that the decreasing order statistics of AIFA atom sizes converges (in finite-
dimensional distributions i.e., in f.d.d) to the decreasing order statistics of CRM atom sizes.
For each K, the decreasing order statistics of AIFA atoms is denoted by {ρK,(i) }K i=1 :

ρK,(1) ≥ ρK,(2) ≥ · · · ≥ ρK,(K) .

We will leverage Loeve [134, Theorem 4 and page 191] to find the limiting distribution
{ρK,(i) }K as K → ∞. It is easy to verify the conditions to use the theorem: because the
Pi=1
sums i=1 ρK,i converge in distribution to a limit, we know that all the ρK,i ’s are uniformly
K

asymptotically negligible [104, Lemma 15.13]. Now, we discuss what the limits are. It
is well-known that Θ(Ψ) is an infinitely divisible positive random variable with no drift
component and Levy measure exactly ν(dθ) [159]. In the terminology of Loeve [134, Equation
2], the characteristics of Θ(Ψ) are a = b = 0 (no drift or Gaussian parts), L(x) = 0, and

M (x) = −ν([x, ∞)).

Let I be a counting process in reverse over (0, ∞) defined based on the Poisson point process
i=1 in the following way. For any x, I(x) is the number of points θi exceeding the threshold
{θi }∞
x:
I(x) := |{i : θi ≥ x}|.
We augment I(0) = ∞ and I(∞) = 0. As a stochastic process, I has independent increments,
in that for all 0 = t0 < t1 < · · · < tk , the increments I(ti ) − I(ti−1 ) are independent,
furthermore the law of the increments is I(ti−1 ) − I(ti ) ∼ Poisson(M (ti ) − M (ti−1 )). These
properties are simple consequences of the counting measure induced by the Poisson point
process. According to Loeve [134, Page 191], the limiting distribution of {ρK,(i) }Ki=1 is governed
by I, in the sense that for any fixed t ∈ N, for any x1 , x2 , . . . , xt ∈ [0, ∞):
lim P(ρK,(1) < x1 , ρK,(2) < x2 , . . . , ρK,(t) < xt )
K→∞
(2.30)
= P(I(x1 ) < 1, I(x2 ) < 2, . . . , I(xt ) < t).
Because the θi ’s induce I, we can relate the left hand side to the order statistics of the Poisson
point process. We denote the decreasing order statistic of the {θi }∞ i=1 as:

θ(1) ≥ θ(2) ≥ · · · ≥ θ(n) ≥ · · ·

Clearly, for any t ∈ N, the event that I(x) exceeds t is the same as the top t jumps among
the {θi }∞
i=1 exceed x: I(x) ≥ t ⇐⇒ θ(t) ≥ x. Therefore eq. (2.30) can be rewritten as, for
any fixed t ∈ N, for any x1 , x2 , . . . , xt ∈ [0, ∞):

lim P(ρK,(1) < x1 , ρK,(2) < x2 , . . . , ρK,(t) < xt ) = P(θ(1) < x1 , θ(2) < x2 , . . . , θ(t) < xt ).
K→∞
(2.31)

102
It is well-known that convergence of the distribution function imply weak convergence — for
instance, see Pollard [166, Chapter III, Problem 1]. Actually, from Loeve [134, Theorem 5 and
page 194], for any fixed t ∈ N, the convergence in distribution of {ρK,(i) }ti=1 to {θi }ti=1 holds
P∞
jointly with the convergence of K to i=1 θi : the two conditions of the theorem,
P
i=1 ρK,(i)
which are continuity of the distribution function of each ρK,i and M (0) = −∞19 , are easily
verified. Therefore, by the continuous mapping theorem, if we define the normalized atom
sizes:
ρK,(s) θ(s)
pK,(s) := PK , p(s) := P∞ ,
i=1 ρK,i i=1 θi

we also have that the normalized decreasing order statistics converge:


f.d.d. ∞
(pK,i )K
i=1 → (pK,(i) )i=1 .

Finally we show that the EPPFs converge. In addition, if we define the size-biased permutation
(in the sense of Gnedin [75, Section 2]) of the normalized atom sizes:

pK,i } ∼ SBP(pK,(s) ), {e
{e pi } ∼ SBP(p(s) ),

then by Gnedin [75, Theorem 1], the finite-dimensional distributions of the size-biased
permutation also converges:
f.d.d.
pK,i )K
(e i=1 → (e pi )∞
i=1 . (2.32)
Pitman [162, Equation 45] gives the EPPF of Ξ = Θ/Θ(Ψ):
b b−1 i
!!
Y Y X
p(n1 , n2 , . . . , nb ) = E peni i −1 1− pej ,
i=1 i=1 j=1

Likewise, the EPPF of ΞK = ΘK /ΘK (Ψ) is:


b b−1 i
!!
Y Y X
i −1
pK (n1 , n2 , . . . , nt ) = E penK,i 1− peK,j .
i=1 i=1 j=1

Since b is fixed, and each pj is


 [0, 1]Pvalued, the mapping from the b-dimensional vector p to
the product i=1 pi is continuous and bounded. The choice of N , b,
Qb ni −1 Qb−1 i
i=1 1 − j=1 pj
ni have been fixed but arbitrary. Hence, the convergence in finite-dimensional distributions
of in eq. (2.32) imply that the EPPFs converge.

2.C Marginal processes of exponential CRMs


The marginal process characterization describes the probabilistic model not through the
i.i.d.
two-stage sampling Θ ∼ CRM(H, ν) and Xn | Θ ∼ LP(ℓ; Θ), but through the conditional
distributions Xn | Xn−1 , Xn−2 , . . . , X1 i.e. the underlying Θ has been marginalized out. This
19
There is a typo in Loeve [134].

103
perspective removes the need to infer a countably infinite set of target variables. In addition,
the exchangeability between X1 , X2 , . . . , XN i.e. the joint distribution’s invariance with
respect to ordering of observations [3], often enables the development of inference algorithms,
namely Gibbs samplers.
Broderick et al. [30, Corollary 6.2] derive the conditional distributions Xn | Xn−1 , Xn−2 , . . . , X1
for general exponential family CRMs eqs. (2.6) and (2.7).

Proposition 2.C.1 (Target’s marginal process [30, Corollary 6.2]). For any n, Xn | Xn−1 , . . . , X1
is a random measure with finite support.
K
1. Let {ζi }i=1
n−1
be the union of atom locations in X1 , X2 , . . . , Xn−1 . For 1 ≤ m ≤ n − 1, let
xm,j be the atom size of Xm at atom location ζj . Denote xn,i to be the atom size of Xn at
atom location ζi . The xn,i ’s are independent across i and the p.m.f. of xn,i at x is

h(x | x1:(n−1) ) =
 Pn−1 
Pn−1 t(xm,i ) + t(x)
Z −1 + m=1 ϕ(xm,i ) + ϕ(x), η + m=1
n
κ(x)  Pn−1  .
Pn−1 t(xm,i )
Z −1 + m=1 ϕ(xm,i ), η + m=1
n−1

2. For each x ∈ N, Xn has pn,x atoms whose atom size is exactly x. The locations of each
atom are iid H: as H is diffuse, they are disjoint from the existing union of atoms {ζi }K
i=1 .
n −1

pn,x is Poisson-distributed, independently across x, with mean:

Mn,x =
  
′ n−1 (n − 1)t(0) + t(x))
γ κ(0) κ(x)Z −1 + (n − 1)ϕ(0) + ϕ(x), η + .
n

In Proposition 2.C.2, we state a similar characterization of Zn | Zn−1 , Zn−2 , . . . , Z1 for the


finite-dimensional model in eq. (2.13) and give the proof.

Proposition 2.C.2 (Approximation’s marginal process). For any n, Zn | Zn−1 , . . . , Z1 is a


random measure with finite support.
K
1. Let {ζi }i=1
n−1
be the union of atom locations in Z1 , Z2 , . . . , Zn−1 . For 1 ≤ m ≤ n − 1, let
zm,j be the atom size of Zm at atom location ζj . Denote zn,i to be the atom size of Zn at
atom location ζi . zn,i ’s are independently across i and the p.m.f. of zn,i at x is:

h(x | z1:(n−1) ) =
e
 Pn−1 
Pn−1 t(zm,i ) + t(x)
Z c/K − 1 + m=1 ϕ(zm,i ) + ϕ(x), η + m=1
n
κ(x)  Pn−1  .
Pn−1 t(zm,i )
Z c/K − 1 + m=1 ϕ(zm,i ), η + m=1
n−1

104
2. K − Kn−1 atom locations are generated iid from H. Zn has pn,x atoms whose size is
exactly x (for x ∈ N ∪ {0}) over these K − Kn−1 atom locations (the pn,0 atoms whose
atom size is 0 can be interpreted as not present in Zn ). The joint distribution of pn,x is a
multinomial with K − Kn−1 trials, with success of type x having probability:
h(x | z1:(n−1) = 0n−1 ) =
e
  
(n − 1)t(0) + t(x)
Z c/K − 1 + (n − 1)ϕ(0) + ϕ(x), η +
n
κ(x)    .
(n − 1)t(0)
Z c/K − 1 + (n − 1)ϕ(0), η +
n−1

Proof of Proposition 2.C.2. We only need to prove the conditional distributions for the atom
sizes: that the K distinct atom locations are generated iid from the base measure is clear.
First we consider n = 1. By construction in Corollary 2.3.3, a priori, the trait frequencies
i=1 are independent, each following the distribution:
{ρi }K
  
1{θ ∈ U } c/K−1 µ(θ)
P(ρi ∈ dθ) = θ exp η, .
Z (c/K − 1, η) −A(θ)
Conditioned on {ρi }Ki=1 , the atom sizes z1,i that Z1 puts on the i-th atom location are
independent across i and each is distributed as:
P(z1,i = x | ρi ) = κ(x)ρϕ(x) exp (⟨µ(ρi ), t(x)⟩ − A(ρi )) .
Integrating out ρi , the marginal distribution for z1,i is:
Z
P(z1,i = x) = P(z1,i = x | ρi = θ)P(ρi ∈ dθ)
Z     
κ(x) c/K−1+ϕ(x) t(x) µ(θ)
= θ exp η+ , dθ
Z (c/K − 1, η) U 1 −A(θ)
  
t(x)
Z c/K − 1 + ϕ(x), η +
1
= κ(x) ,
Z (c/K − 1, η)
by definition of Z as the normalizer eq. (2.8).
Now we consider n ≥ 2. The distribution of zn,i only depends on the distribution of
zn−1,i , zn−2,i , . . . , z1,i since the atom sizes across different atoms are independent of each other
both a priori and a posteriori. The predictive distribution is an integral:
Z
P(zn,i = x | z1:(n−1),i ) = P(zn,i = x | ρi )P(ρi ∈ dθ | z1:(n−1),i ).

Because the prior over ρi is conjugate for the likelihood zi,j | ρi , and the observations zi,j
are conditionally independent given ρi , the posterior P(ρi ∈ dθ | z1:(n−1),i ) is in the same
exponential family but with different natural parameters:
 Pn−1   
Pn−1 t(zm,i ) µ(θ)
θc/K−1+ m=1 ϕ(zm,i ) exp η+ m=1 , dθ
n−1 −A(θ)
1{θ ∈ U }  Pn−1  .
Pn−1 t(zm,i )
Z c/K − 1 + m=1 ϕ(zm,i ), η + m=1
n−1

105
This means that the predictive distribution P(zn,i = x | z1:(n−1),i ) equals:
 Pn−1   
R c/K−1+Pn−1 ϕ(z )+ϕ(x)
m=1 t(zm,i ) + t(x) µ(θ)
θ m=1 m,i
exp η+ , dθ
U n −A(θ)
κ(x)  Pn−1 
Pn−1 t(zm,i )
Z c/K − 1 + m=1 ϕ(zm,i ), η + m=1
n−1
 Pn−1 
Pn−1 t(z m,i ) + t(x)
Z c/K − 1 + m=1 ϕ(zm,i ) + ϕ(x), η + m=1
n
= κ(x)  Pn−1  .
Pn−1 t(zm,i )
Z c/K − 1 + m=1 ϕ(zm,i ), η + m=1
n−1

The predictive distribution P(zn,i = x | z1:(n−1),i ) govern both the distribution of atom sizes
for known atom locations and new atom locations.

2.D Admissible hyperparameters of extended gamma pro-


cess
We first describe the two desidarata of a useful Bayesian nonparametric model in more detail.
The condition that the total mass of the rate measure needs to be infinite reads as
Z ∞
ν(dθ) = ∞
0

This is Broderick et al. [30, A1]. To ensure that the number of active traits is almost surely
finite, it suffices to ensure that the expected number of traits is finite. The condition that
the expected number of active traits is finite reads as
Z ∞
(1 − Zτ−1 (θ))ν(dθ) < ∞.
0

This is Broderick et al. [30, A2]: note that Zτ−1 (θ) is exactly the probability that a trait with
rate θ does not manifest.
Lemma 2.D.1 (Hyperparameters for extended gamma rate measure). For any γ > 0, c > 0,
T ≥ 1, τ > 0, for the rate measure ν(t.heta) from eq. (2.11) Then,
R∞
• 0 ν(dθ) = ∞.
R∞
• 0 [1 − Zτ−1 (θ)]ν(dθ) < ∞.
Proof of Lemma 2.D.1. We observe that it suffices to show the two conclusions for γ = 1,
since any positive scaling of the rate measure will preserve the finiteness (or infiniteness) of
the integrals. In addition, we can replace the upper limit of integration, ∞, by T , since the
rate measure is zero for θ > T .
We begin with elementary observations about the monotonicity of Zτ (θ). Zτ (θ) is increasing
in θ but decreasing in τ . In the limit of τ → ∞, Zτ (θ) approaches 1 + θ.

106
RT
To prove the first statement, we use a simple lower bound on 0 ν(dθ), which holds since
T ≥ 1: Z T Z 1
−1 −c
θ Zτ (θ)dθ ≥ θ−1 Zτ−c (θ)dθ
0 0
Z 1
−c
≥ Zτ (1) θ−1 dθ = ∞.
0

Since Zτ R(θ) is increasing in θ, for all θ ∈ [0, 1], Zτ−c (θ)


≥ > 0. There are many ways
Zτ−c (1)
1
to show 0 θ−1 dθ = ∞ — the connection with the harmonic series is one.
To prove the second statement, we consider two cases separately.
In the first case, τ ≤ 1.0. We first show that, there exists a constant κ > 0 such that, for
θ ∈ [0, 1]:
1 − Zτ−1 (θ) ≤ θ + κθ2 . (2.33)
Consider the Taylor series of Zτ (θ). By recursion, the jth derivative of Zτ (θ) equals

∞ i
!1−τ
di X Y θj
Zτ (θ) = (j + k) . (2.34)
dθi j=0 k=1
(j!)τ

It is easy to check that the infinite sums in eq. (2.34) converge for any θ. By absolute
convergence theorems20 , it suffices to inspect θ > 0. By the ratio test, subsequent terms have
ratio
i
!1−τ i
!1−τ
θj+1 Y θj Y θ(j + 1 + i)1−τ j→∞
(j + 1 + k) / (j + k) = −−−→ 0.
[(j + 1)!]τ k=1 [(j)!]τ k=1 j+1

Clearly Zτ (0) = 1. Hence, for all θ close enough to 0, Zτ (θ) is strictly positive. Therefore,
Zτ−1 (θ) also has derivatives of all orders in an open interval containing [0, 1]. Note that
d
Z (θ) θ=0 = 1. Therefore
dθ τ

d
d −1 − dθ Zτ (θ) θ=0
Z (θ) = = −1.
dθ τ θ=0 Zτ2 (0)

By Taylor’s theorem Kline [112, Section 20.3], for any θ ∈ [0, 1], there exists a y between 0
and θ such that
1 d2 −1
 
−1
Zτ (θ) = 1 − θ + Z (θ) θ=y θ2 .
2 dθ2 τ
d2
It is clear that the second derivative Z −1 (θ) θ=y
dθ2 τ
is bounded by a constant independent of
y for y ∈ [0, 1], since
d 2 2
d2 −1

Z (θ)
dθ2 τ d 1
2
Zτ (θ) = −2 Zτ (θ) ,
dθ Zτ2 (θ) dθ 3
Zτ (θ)
20
see, e.g. https://www.whitman.edu/mathematics/calculus_online/section11.06.html

107
with the Zτ (θ) being at least 1 and the derivatives being bounded. This shows eq. (2.33).
Therefore:
Z T Z 1 Z T
−1 −1 −1 −c
[1 − Zτ (θ)]ν(dθ) ≤ [1 − Zτ (θ)]θ Zτ (θ)dθ + θ−1 Zτ−c (θ)dθ
0 0 1
= A + B.
We use the estimate 1 − Zτ−1 (θ)θ + κθ2 in the first part (A):
Z 1 Z 1
−1 −1 −c
[1 − Zτ (θ)]θ Zτ (θ)dθ ≤ (1 + κθ)Zτ−c (θ)dθ.
0 0

Since Zτ−c (θ) ≤ exp(−cθ), it is true that A is finite. For the second part (B), we again use
the upper bound Zτ−c (θ) ≤ exp(−cθ) and also θ−1 ≤ 1 to conclude that B is finite. Overall
A + B is finite.
In the second case, τ > 1.0. Since Zτ (θ) ≤ Z1 (θ), 1 − Zτ−1 (θ) ≤ 1 − Z1−1 (θ) = 1 − exp(−θ).
In addition, since Zτ (θ) ≥ Z∞ (θ), we also have Zτ−c (θ) ≤ Z∞−c 1
= (1+θ)c . Hence

Z T Z T
−1 1
[1 − Zτ (θ)]ν(dθ) ≤ (1 − exp(−θ))θ−1 dθ.
0 0 (1 + θ)c
Observe that for any positive θ, (1 − exp(−θ))θ−1 ≤ 1. Therefore
Z T Z T
−1 1
[1 − Zτ (θ)]ν(dθ) ≤ c
dθ.
0 0 (1 + θ)

The integrand 1
(1+θ)c
is continous and upper bounded on [0, T ], so the overall integral is finite.

2.E Technical lemmas


2.E.1 Concentration
Lemma 2.E.1 (Modified upper tail Chernoff bound). Let X = ni=1 Xi , where Xi = 1 with
P
probability pi and Xi = 0Pwith probability 1 − pi , and all Xi are independent. Let µ be an
upper bound on E(X) = ni=1 pi . Then for all δ > 0:
δ2
 
P(X ≥ (1 + δ)µ) ≤ exp − µ .
2+δ
Proof of Lemma 2.E.1. The proof relies on the regular upper tail Chernoff bound [54, Theo-
rem 1.10.1] and an argument using stochastic domination. We pad the first n Poisson trials
that define X with additional trials Xn+1 , Xn+2 , . . . , Xn+m . m is the smallest natural number
such that µ−E[X]
m
≤ 1. Each XPn+i is a Bernoulli with probability µ−E[X] m
, and the trials are
independent. Then Y = X + m j=1 X n+j is itself the sum of Poisson trials with mean exactly
µ, so the regular Chernoff bound applies:
δ2
 
P(Y ≥ (1 + δ)µ) ≤ exp − µ ,
2+δ

108
where we used [54, Equation 1.10.13] and the simple observation that 2/3δ < δ. By
construction, X is stochastically dominated by Y , so the tail probabilities of X are upper
bounded by the tail probabilities of Y .
Lemma 2.E.2 (Lower tail Chernoff bound [54, Theorem 1.10.5]). Let X = ni=1 Xi , where
P
Xi = 1 with probability pi and Xi = 0 with probability 1 − pi , and all Xi are independent. Let
µ := E(X) = i=1 pi . Then for all δ ∈ (0, 1):
Pn

P(X ≤ (1 − δ)µ) ≤ exp(−µδ 2 /2).

Lemma 2.E.3 (Tail bounds for Poisson distribution). If X ∼ Poisson(λ) then for any x > 0:
x2
 
P(X ≥ λ + x) ≤ exp − ,
2(λ + x)
and for any 0 < x < λ:  2
x
P(X ≤ λ − x) ≤ exp − .

Proof of Lemma 2.E.3. For x ≥ −1, let ψ(x) := 2((1 + x) ln(1 + x) − x)/x2 .
We first inspect the upper tail bound. If X ∼ Poisson(λ), for any x > 0, Pollard [165,
Exercise 3 p.272] implies that:
 2  
x x
P(X ≥ λ + x) ≤ exp − ψ .
2λ λ
x2 x2
To show the upper tail bound, it suffices to prove that 2λ ψ λx is greater than 2(λ+x) . In


general, we show that for u ≥ 0:

(u + 1)ψ(u) − 1 ≥ 0. (2.35)

The denominator of (u+1)ψ(u)−1 is clearly positive. Consider the numerator of (u+1)ψ(u)−1,


which is g(u) := 2((u + 1)2 ln(u + 1) − u(u + 1) − u2 . Its 1st and 2nd derivatives are:

g ′ (u) = 4(u + 1) ln(u + 1) − 2u + 1


g ′′ (u) = 4 ln(u + 1) + 2.

Since g ′′ (u) ≥ 0, g ′ (u) is monotone increasing. Since g ′ (0) = 1, g ′ (u) > 0 for u ≥ 0, hence
g(u) is monotone increasing. Because g(0) = 0, we conclude that g(u) ≥ 0 for u > 0 and
eq. (2.35) holds. Plugging in u = x/λ:
x 1 λ
ψ ≥ x = ,
λ 1+ λ x+λ
x2 x2
which shows 2λ ψ λx ≥ 2(λ+x) .


Now we inspect the lower tail bound. We follow the proof of Canonne [36, Theorem 1]. We
first argue that:  2  
x x
P(X ≤ λ − x) ≤ exp − ψ − . (2.36)
2λ λ

109
For any θ, the moment generating function E[exp(θX)] is well-defined and well-known:

E[exp(θX)] := exp(λ(exp(θ) − 1)).

Therefore:

P(X ≤ λ − x) ≤ P(exp(θX) ≤ exp(θ(λ − x)) ≤ P(exp(θ(λ − x − X)) ≥ 1)


≤ exp(θ(λ − x))E[exp(−θX)],

where we have used Markov’s inequality.


We now aim to minimize exp(θ(λ − x))E[exp(−θX)] as a function of θ. Its logarithm is:

λ(exp(−θ) − 1) + θ(λ − x).

This is a convex function, whose derivative vanishes at θ = − ln 1 − x


. Overall this means

λ
the best upper bound on P(X ≤ λ − x) is:
 x x x 
exp −λ + (1 − ) ln(1 − ) ,
λ λ λ
which is exactly the right hand side of eq. (2.36). Hence to demonstrate the lower tail bound,
it suffices to show that:  x
ψ − ≥ 1.
λ
More generally, we show that for −1 ≤ u ≤ 0, ψ(u) − 1 ≥ 0. Consider the numerator of
ψ(u) − 1, which is h(u) := 2((1 + u) ln(1 + u) − u) − u2 . The first two derivatives are:

h′ (u) = 2(1 + ln(1 + u)) − 2u


2
h′′ (u) = −2
1+u
Since h′′ (u) ≥ 0, h(u) is convex on [−1, 0]. Note that h(0) = 0. Also, by simple continuity
argument, h(−1) = 2. Therefore, h is non-negative on [0, 1], meaning that ψ(u) ≥ 1.
P∞
Lemma 2.E.4 (Multinomial-Poisson approximation). Let {pi }∞ i=1 , pi ≥ 0, i=1 pi < 1.
Suppose there are n independent trials: in each trial, success of type i has probability pi . Let
X = {Xi }∞ i=1 be the number of type i successes after n trial. Let Y = {Yi }i=1 be independent

Poisson random variables, where Yi has mean npi . Then, there exists a coupling (X, b Yb ) of
PX and PY such that !2
X∞
P(Xb ̸= Yb ) ≤ n pi .
i=1

Furthermore, the joint distribution (X,


b Yb ) naturally disintegrates i.e. the conditional distribu-
tion X
b | Yb exists.

Proof of Lemma 2.E.4. First, we recognize that both X and Y can be sampled in two steps.
• Regarding X, first sample N1 ∼ Binom (n, ∞ i=1 pi ). Then, for P
each 1 ≤ k =
̸ N1 ,
P

independently sample Zk where P(Zk = i) = P∞ pj . Then, Xi = k=1 1{Zk = i} for


p i N 1
j=1
each i.

110
P∞
• Regarding Y , first sample N2 ∼ Poisson (n i=1 pi ). Then, for each 1 ≤ k ≤ N2 ,
independently sample Tk where P(Tk = i) = P pi . Then, Yi = N k=1 1{Tk = i} for
P 2

j=1 pj
each i.
The two-step sampling perspective for X comes from rejection sampling: to generate a success
of type k, we first generate some type of success, and then re-calibrate to get the right
proportion for type k. The two-step perspective for Y comes from the thinning property of
Poisson distribution [120, Exercise 1.5]. The thinning property implies that for any finite
index set K, all {Yi } for i ∈ K are mutually independent and marginally, Yi ∼ Poisson(npi ).
Hence the whole collection {Yi }∞ i=1 are independent Poissons and the mean of Yi is npi .
Observing that the conditional X | N1 = n is the same as Y | N2 = n, we propose the coupling
that essentially proves propagation rule Lemma 2.E.8. The proposed coupling (X, b Yb ) is that

• Sample (N c2 ) from the maximal coupling that attains dTV between the two distribu-
c1 , N
tions: Binom (n, ∞
P∞
i=1 pi ) and Poisson (n
P
i=1 pi ).

• If N
c1 = N c2 , let the common value be n, sample X c1 = n and set Yb = X.
b |N b Else
N c2 , independently sample X
c1 ̸= N c1 and Yb | N
b |N c2 .

From the classic binomial-Poisson approximation [122], we know that



!2
X
P(Nc1 ̸= N
c2 ) = dTV (PN1 , PN2 ) ≤ n pi ,
i=1

which guarantees that !2



X
b ̸= Yb ) ≤ n
P(X pi .
i=1

Alternatively, we can sample from the conditional X b | Yb in the following way. From Yb ,

compute N c2 , which is just
x=1 Yi . Sample N1 from the conditional distribution N1 | N2 of
P b c c c
the maximal coupling that attains the binomial-Poisson total variation. If N c1 = N c2 , set
Xb = Yb . Else sample X b from the conditional X c1 . It is straightforward to verify that this
b |N
is the conditional Xb | Yb of the joint (X,b Yb ) described above.

Lemma 2.E.5 (Total variation between Poissons [2, Corrollary 3.1]). Let P1 be the Poisson
distribution with mean s, P2 the Poisson distribution with mean t. Then:
dTV (P1 , P2 ) ≤ 1 − exp(−|s − t|) ≤ |s − t|.

2.E.2 Total variation


We will frequently use the following relationship between total variation and coupling. For
two distributions PX and PY over the same measurable space, it is well-known that the total
variation distance between PX and PY is at most the infimum over joint distributions (X,
b Yb )
which are couplings of PX and PY :
dTV (PX , PY ) ≤ inf b ̸= Yb ).
P(X
X,
b Yb coupling of PX ,PY

111
When PX and PY are discrete distributions, the inequality is actual equality, and there exists
couplings that attain the equality [126, Proposition 4.7].
We first state the chain rule, which will be applied to compare joint distributions that admit
densities.
Lemma 2.E.6 (Chain rule). Suppose PX1 ,Y1 and PX2 ,Y2 are two distributions that have
densities with respect to a common measure over the ground space A × B. Then:
dTV (PX1 ,Y1 , PX2 ,Y2 ) ≤ dTV (PX1 , PX2 ) + sup dTV (PY1 | X1 =a , PY2 | X2 =a ).
a∈A

Proof of Lemma 2.E.6. Because both PX1 ,Y1 and PX2 ,Y2 have densities, total variation distance
is half of L1 distance between the densities:
Z
1
dTV (PX1 ,Y1 , PX2 ,Y2 ) = | PX1 ,Y1 (a, b) − PX2 ,Y2 (a, b) | dadb
2 A×B
Z
1
= |PX1 ,Y1 (a, b) − PX2 (a)PY1 | X1 (b | a)
2 A×B
+ PX2 (a)PY1 | X1 (b | a) − PX2 ,Y2 (a, b)|dadb
Z
1
≤ PY | X (b | a) | PX1 (a) − PX2 (a) |
2 A×B 1 1
+ PX2 (a) | PY1 | X1 (b | a) − PY2 | X2 (b | a) | dadb
Z
1
= PY | X (b | a) | PX1 (a) − PX2 (a) | dadb
2 A×B 1 1
Z
1
+ PX (a) | PY1 | X1 (b | a) − PY2 | X2 (b | a)|dadb,
2 A×B 2
where we have used triangle inequality. Regarding the first term, using Fubini:
Z
1
PY | X (b | a)|PX1 (a) − PX2 (a)|dadb
2 A×B 1 1
Z Z 
1
= PY | X (b | a)db |PX1 (a) − PX2 (a)|da
2 a∈A b∈B 1 1
Z
1
= |PX1 (a) − PX2 (a)|da
2 a∈A
= dTV (PX1 , PX2 ).
Regarding the second term:
Z
1
PX (a)|PY1 | X1 (b | a) − PY2 | X2 (b | a)|dadb
2 A×B 2
Z  Z 
1
= |PY1 | X1 (b | a) − PY2 | X2 (b | a)|db PX2 (a)da
a∈A 2 b∈B
 Z
≤ sup dTV (PY1 | X1 =a , PY2 | X2 =a ) PX2 (a)da
a∈A a∈A
= sup dTV (PY1 | X1 =a , PY2 | X2 =a ).
a∈A

The sum between the first and second upper bound gives the total variation chain rule.

112
An important consequence of Lemma 2.E.6 is when the distributions being compared have
natural independence structures.
Lemma 2.E.7 (Product rule). Let PX1 ,Y1 and PX2 ,Y2 be discrete distributions. In addition,
suppose PX1 ,Y1 factorizes into PX1 PY1 and similarly PX2 ,Y2 = PX2 PY2 . Then:
dTV (PX1 ,Y1 , PX2 ,Y2 ) ≤ dTV (PX1 , PX2 ) + dTV (PY1 , PY2 ).
Proof of Lemma 2.E.7. Since PX1 ,Y1 and PX2 ,Y2 are discrete distributions, we can apply
Lemma 2.E.6 (the common measure is the counting measure). Because each joint distribution
PXi ,Yi factorizes into PXi PYi , for any a ∈ A, the right most term in the inequality of
Lemma 2.E.6 simplifies into
sup dTV (PY1 | X1 =a , PY2 | X2 =a ) = dTV (PY1 , PY2 ),
a∈A

since PY1 = PY1 | X1 =a and PY2 = PY2 | X2 =a for any a.


We call the next lemma the propagation rule, which applies even if distributions do not have
densities.
Lemma 2.E.8 (Propagation rule). Suppose PX1 ,Y1 and PX2 ,Y2 are two distributions over the
same measurable space. Suppose that the conditional Y2 | X2 = a is the same as the conditional
Y1 | X1 = a, which we just denote as Y | X = a. Then:
dTV (PY1 , PY2 ) ≤ inf c1 ̸= X
P(X c2 ).
X
c1 ,X
c2 coupling of PX ,PX
1 2

If PX1 and PX2 are discrete distributions, we also have:


dTV (PY1 , PY2 ) ≤ dTV (PX1 , PX2 ).
Proof of Lemma 2.E.8. Let (X c2 ) be any coupling of PX1 and PX2 . The following two-step
c1 , X
process generates a coupling of PY1 and PY2 :
• Sample (X
c1 , X
c2 ).

• If Xc1 = Xc2 , let the common value be x. Sample Yb1 from the conditional distribution
Y | X = x, and set Yb2 = Yb1 . Else if Xc1 ̸= X c2 , independently sample Yb1 from Y | X = X c1
and Yb2 from Y | X = X c2 .
It is easy to verify that the tuple (Yb1 , Yb2 ) is a coupling of PY1 and PY2 . In addition, (Yb1 , Yb2 )
has the property that
P(Yb1 ̸= Yb2 , X
c1 = X c2 ) = 0,
since conditioned on X c1 = Xc2 , the values of Yb1 and Yb2 always agree. Therefore:
P(Yb1 ̸= Yb2 ) = P(Yb1 ̸= Yb2 , X
c1 ̸= X
c2 ) ≤ P(X
c1 ̸= X
c2 ).
This means that dTV (PY1 , PY2 ) is small:
dTV (PY1 , PY2 ) ≤ P(X
c1 ̸= X
c2 ).

So far (X c2 ) has been an arbitrary coupling between PX1 and PX2 . The final step is
c1 , X
taking the infimum on the right hand side over couplings. When PX1 and PX2 are discrete
distributions, the infimum over couplings is equal to the total variation distance.

113
The final lemma is the reduction rule, which says that the a larger collection of random
variables, in general, has larger total variation distance than a smaller one.

Lemma 2.E.9 (Reduction rule). Suppose PX1 ,Y1 and PX2 ,Y2 are two distributions over the
same measurable space A × B. Then:

dT V (PX1 ,Y1 , PX2 ,Y2 ) ≥ dT V (PX1 , PX2 ).

Proof of Lemma 2.E.9. By definition,

dT V (PX1 , PX2 ) = sup |PX1 (A) − PX2 (A)|.


measurable A

For any measurable A, the product A × B is also measurable. In addition:

PX1 (A) − PX2 (A) = PX1 ,Y1 (A, B) − PX2 ,Y2 (A, B).

Therefore, for any A,

|PX1 (A) − PX2 (A)| ≤ dT V (PX1 ,Y1 , PX2 ,Y2 ),

since PX1 ,Y1 (A, B) − PX2 ,Y2 (A, B) is the difference in probability mass for one measurable
event. The final step is taking supremum of the left hand side.

2.E.3 Miscellaneous
Lemma 2.E.10 (Order of growth of harmonic-like sums).
N
X α
α [ln N + ln(α + 1) − ψ(α)] ≥ ≥ α(ln N − ψ(α) − 1).
n=1
n−1+α

where ψ is the digamma function.

Proof of Lemma 2.E.10. Because of the digamma function identity ψ(z + 1) = ψ(z) + 1/z
for z > 0, we have:
N
X α
= α[ψ(α + N ) − ψ(α)]
n=1
n − 1 + α
Gordon [76, Theorem 5] says that
1 1
ψ(α + N ) ≥ ln(α + N ) − − ≥ ln N − 1.
2(α + N ) 12(α + N )2

Gordon [76, Theorem 5] also says that

ψ(α + N ) ≤ ln(α + N ) ≤ ln((α + 1)N ) = ln(1 + α) + ln N,

where it’s a simple proof that α + N ≤ αN + N = (α + 1)N when α > 0, N ≥ 1

114
We list a collection of technical lemmas that are used when verifying Condition 2.4.1 for the
recurring examples.
The first set assists in the beta–Bernoulli model.

• For α > 0 and i = 1, 2, 3, . . .:


 
1 1 1
≤2 1{i = 1} + 1{i > 1} . (2.37)
i+α−1 2α i

• For m, x, y > 0, m ≤ y:
m+x m x
− ≤ . (2.38)
y+x y y

Proof of eq. (2.37). If i = 1, 1


i+α−1
= α1 . If i ≥ 1, 1
i+α−1
≤ 1
i−1
≤ 2i .
Proof of eq. (2.38).

m+x m (m + x)y − m(y + x) x(y − m) x


− = = ≤ .
y+x y y(y + x) y(y + x) y

The second set aid in the gamma–Poisson model.

• For x ∈ [0, 1);


(1 − x) ln(1 − x) + x ≥ 0. (2.39)

• For x ∈ (0, 1), for p ≥ 0:


x
(1 − x)p + p ≥ 1. (2.40)
1−x
• For λ > 0, for m > 0, t > 1, x > 0:

1/t
dT V NB(m, t−1 ), NB(m + x, t−1 ) ≤ x (2.41)

,
1 − 1/t

where NB(r, θ) is the negative binomial distribution.

• For y ∈ N, K > m > 0:

m Γ(m/K + y) m2
−K ≤e . (2.42)
y Γ(m/K)y! K

where e is the Euler constant and Γ(y) is the gamma function.

Proof of eq. (2.39). Set g(x) to (1−x) ln(1−x)+x. Then its derivative is g ′ (x) = − ln(1−x) ≥
0, meaning the function is monotone increasing. Since g(0) = 0, it’s true that g(x) ≥ 0 over
[0, 1).

115
Proof of eq. (2.40). Let f (p) = (1 − x)p + p 1−x
x
− 1. Then f ′ (p) = ln(1 − x)(1 − x)p + 1−x
x
.
Also f (p) = (ln(1 − x)) (1 − x) > 0. So f (p) is monotone increasing. At p = 0,
′′ 2 p ′

f ′ (0) = ln(1 − x) + 1−x


x
≥ 0. Therefore f ′ (p) ≥ 0 for all p. So f (p) is increasing. Since
f (0) = 0, it’s true that f (p) ≥ 0 for all p.
Proof of eq. (2.41). It is known that NB(r, θ) is a Poisson stopped sum distribution [101,
Equation 5.15]:

• N ∼ Poisson(−r ln(1 − θ)).


i.i.d. −θk
• Yi ∼ Log(θ) where the Log(θ) distribution’s pmf at k equals k ln(1−θ)
.


PN
i=1 Yi ∼ NB(r, θ).

Therefore, by the propagation rule Lemma 2.E.8, to compare NB(m, t−1 ) with NB(m + x, t−1 ),
it suffices to compare the two generating Poissons.

dT V NB(m, t−1 ), NB(m + x, t−1 )




≤ dT V (Poisson(−m ln(1 − t−1 ), Poisson(−(m + x) ln(1 − t−1 )))


t−1
≤ − ln(1 − t−1 )x ≤ x .
1 − t−1
We have used the fact that total variation distance between Poissons is dominated by their
different in means Lemma 2.E.5 and eq. (2.39).

Q 
Proof of eq. (2.42). Since Γ y−1 m
+ j), we
m
 m
 m
 m Qy−1 m
+y =
K j=0 ( K + j) Γ K
=Γ K K j=1 ( K
have:
y−1
!
m Γ(m/K + y) m Y m/K + j
−K = −1 .
y Γ(m/K)y! y j=1
j

We inspect the product in more detail.


y−1 y−1   y−1  
Y m/K + j Y m/K Y m/K
= 1+ ≤ exp
j=1
j j=1
j j=1
j
y−1
!
mX1 m 
= exp ≤ exp (ln y + 1) = (ey)m/K .
K j=1 j K

where the (y − 1)-th Harmonic sum is bounded by ln y + 1. Therefore

m Γ(m/K + y) m
(ey)m/K − 1 .

−K ≤
y Γ(m/K)y! y

We quickly prove that for any u ≥ 1, 0 < a < 1, we have

ua − 1 ≤ a(u − 1).

116
Truly, consider the function g(u) = a(u − 1) − ua + 1. The derivative is g ′ (u) = a − aua−1 =
a(1 − ua−1 ). Since a ∈ (0, 1) and u ≥ 1, g ′ (u) > 0. Therefore g(u) is monotone increasing.
Since g(1) = 0, we have reached the conclusion. Applying to our situation:
m
(ey)m/K − 1 ≤ (ey − 1).
K
In all:
m Γ(m/K + y) m2
−K ≤e .
y Γ(m/K)y! K

The third set aid in the beta–negative binomial model.

• For x > 0, z ≥ y > 1:

B(x, y) − B(x, z) ≤ (z − y)B(x + 1, y − 0.5) ≤ (z − y)B(x + 1, y − 1). (2.43)

• For any r > 0, b ≥ 1:



X Γ(y + r) r
B(y, b + r) ≤ . (2.44)
y=1
y!Γ(r) b − 0.5

• For b ≥ 1, for any c > 0, for any K ≥ c:


Γ(b) c
1− ≤ (2 + ln b) . (2.45)
Γ(b + c/K) K

• For b > 1, c > 0, K ≥ 2c(ln b + 2):


K c
c− ≤ (3 ln b + 8). (2.46)
B(c/K, b) K

Proof of eq. (2.43). First we prove that for any x ∈ [0, 1):

1 − x ln(1 − x) + x ≥ 0.

Truly, let g(x) be the function on the left hand side. Then its derivative is

′ 2 1 − x − ln(1 − x) − 2
g (x) = √ .
2 1−x
Denote the numerator function by h(x). Its derivative is
1 1
h′ (x) = −√ ≥ 0,
1−x 1−x
since x ∈ [0, 1] meaning h is monotone increasing. Since h(0) = 0, it means h(x) ≥ 0. This
means g ′ (x) ≥ 0 i.e. g itself is monotone increasing. Since g(0) = 0 it’s true that g(x) ≥ 0 for
all x ∈ [0, 1).

117
Second we prove that for all x ∈ [0, 1], for all p ≥ 0:
x
(1 − x)p + p √ − 1 ≥ 0. (2.47)
1−x

Truly, let f (p) = (1 − x)p + p √1−x x


− 1. Then f ′ (p) = ln(1 − x)(1 − x)p + √1−x x
. Also
f (p) = (ln(1 − x)) (1 − x) > 0. So f (p) is monotone increasing. At p = 0, f ′ (0) =
′′ 2 p ′

ln(1 − x) + √1−xx
> 0. Therefore f ′ (p) ≥ 0 for all p. So f (p) is increasing. Since f (0) = 0,
it’s true that f (p) ≥ 0 for all p.
We finally prove the inequality about beta functions.
Z 1
B(x, y) − B(x, z) = θx−1 (1 − θ)y−1 (1 − (1 − θ)z−y )dθ
0
Z 1
≤ θx−1 (1 − θ)y−1 (z − y)θ(1 − θ)−0.5 dθ
0
Z 1
= (z − y) θx (1 − θ)y−1.5 dθ = (z − y)B(x + 1, y − 0.5).
0

where we have use 1−(1−θ)z−y ≤ (z −y)θ(1−θ)−1/2 from eq. (2.47). As for B(x+1, y−0.5) ≤
B(x + 1, y − 1), it is because of the monotonicity of the beta function.
Proof of eq. (2.44).
∞ Z ∞
1X
X Γ(y + r) Γ(y + r) y−1
B(y, b + r) = θ (1 − θ)b+r−1 dθ
y=1
y!Γ(r) 0 y=1 y!Γ(r)

Z 1 !
X Γ(y + r)
= θ−1 θy (1 − θ)b+r−1 dθ
0 y=1
y! Γ(r)
Z 1  
−1 1
= θ r
−1 (1 − θ)b+r−1 dθ
0 (1 − θ)
Z 1
θ−1 (1 − (1 − θ)r ) (1 − θ)b−1 dθ

=
Z0 1
θ
≤ θ−1 r √ (1 − θ)b−1 dθ
0 1 − θ
Z 1
r
=r (1 − θ)b−1.5 dθ = ,
0 b − 0.5

where the identity ∞ y=1 y! Γ(r) θ = (1−θ)r −1 is due to the normalization constant for negative
Γ(y+r) y 1
P

binomial distributions, and we also used eq. (2.47) on 1 − (1 − θ)r .


Proof of eq. (2.45). First we prove that:

Γ(b) c
1− ≤ (2 + ln b).
Γ(b + c/K) K

118
The recursion defining Γ(b) allows us to write:
 
⌊b⌋−1
Γ(b) Y b−i Γ(b − ⌊b⌋ + 1)
1− =1−  .
Γ(b + c/K) i=1
b + c/K − i Γ(b + c/K − ⌊b⌋ + 1)

The argument proceeds in one of two ways. If Γ(b−⌊b⌋+1)


Γ(b+c/K−⌊b⌋+1)
≥ 1, then we have:
⌊b⌋−1
Γ(b) Y b−i
1− ≤1−
Γ(b + c/K) i=1
b + c/K − i
 
  ⌊b⌋−1
b−1 b−1 b−i Y
= 1− + − 
b + c/K − 1 b + c/K − 1 i=1
b + c/K − i
 
⌊b⌋−1
c 1 b−1 1 −
Y b−i
= + 
K b + c/K − 1 b + c/K − 1 i=2
b + c/K − i
 
⌊b⌋−1
c 1 Y b−i
≤ + 1−
 
Kb−1 i=2
b + c/K − i
⌊b⌋−1
c X 1 c
≤ ... ≤ ≤ (ln b + 1).
K i=1 b − i K

Else, Γ(b−⌊b⌋+1)
Γ(b+c/K−⌊b⌋+1)
< 1 and we write:

Γ(b)
1−
Γ(b + c/K)
 
⌊b⌋−1
Γ(b − ⌊b⌋ + 1) Γ(b − ⌊b⌋ + 1) Y b−i
=1− + 1 − 
Γ(b + c/K − ⌊b⌋ + 1) Γ(b + c/K − ⌊b⌋ + 1) i=1
b + c/K − i
 
Γ(b − ⌊b⌋ + 1) c
≤ 1− + (ln b + 1).
Γ(b + c/K − ⌊b⌋ + 1) K

We now argue that for all x ∈ [1, 2), for all K ≥ c, 1 − Γ(x)
Γ(x+c/K)
≤ Kc . By convexity of Γ(x),
c Γ′ (x+c/K)
we know that Γ(x) ≥ Γ(x + c/K) − c ′
K
Γ (x + c/K). Hence Γ(x+1/K)
Γ(x)
≥ 1− K Γ(x+c/K)
. Since
Γ′ (y)
x + c/K ∈ [1, 3) and ψ(y) = Γ(y)
, the digamma function, is a monotone increasing function
Γ′ (x+c/K) Γ′ (3)
(it is the derivative of a ln Γ(x), which is also convex), Γ(x+c/K)
≤ Γ(3)
≤ 1. Applying this
to x = b − ⌊b⌋ + 1, we conclude that:
Γ(b) c
1− ≤ (2 + ln b).
Γ(b + c/K) K
We now show that:
Γ(b) c
− 1 ≥ − (ln b + ln 2).
Γ(b + c/K) K

119
Convexity of Γ(y) means that:

c ′ Γ(b) c Γ′ (b + c/K)
Γ(b) ≥ Γ(b + c/K) − Γ (b + c/K) →
− −1≥− .
K Γ(b + c/K) K Γ(b + c/K)

From Alzer [4, Equation 2.2], we know that ψ(x) ≤ ln(x) for positive x. Therefore:

c Γ′ (b + c/K) c c
− ≥ − ln(b + c/K) ≥ − (ln b + ln 2)
K Γ(b + c/K) K K
since b + Kc ≤ 2b.
We combine two sides of the inequality to conclude that the absolute value is at most
c
K
(2 + ln b).
Proof of eq. (2.46).

K K/c Γ(c/K + b)
c− =c −1
B(c/K, b) Γ(c/K) Γ(b)
   
K/c Γ(c/K + b) K/c
=c −1 + −1
Γ(c/K) Γ(b) Γ(c/K)
 
K/c Γ(c/K + b) K/c
≤c −1 + −1 .
Γ(c/K) Γ(b) Γ(c/K)
On the one hand:
K/c Γ(1)
= .
Γ(c/K) Γ(1 + c/K)
From eq. (2.45), we know:
Γ(1) 2c
−1 ≤ .
Γ(1 + c/K) K
On the other hand, let y = Γ(b)/Γ(c/K + b). Then:

Γ(c/K + b) 1 |1 − y|
−1 = −1 = .
Γ(b) y y

Again using eq. (2.45), |1 − y| ≤ Kc (2 + ln b). Since K ≥ 2c(ln b + 2), c


K
(2 + ln b) is at most
0.5, meaning |1 − y| ≤ 0.5 and y ≥ 0.5. Therefore

Γ(c/K + b) 2c
− 1 ≤ (2 + ln b).
Γ(b) K
In all:
  
K 2c c 2c
c− ≤c 1+ 2 (2 + ln b) +
B(c/K, b) K K K
c
≤ (3 ln b + 8).
K

120
2.F Verification of upper bound’s assumptions for addi-
tional examples
Recall the definitions of h, e
h, and Mn,x for exponential family CRM-likelihood in section 2.C.

2.F.1 Gamma–Poisson with zero discount


First we write down the functions in Condition 2.4.1 for non-power-law gamma–Poisson. This
requires expressing the rate measure and likelihood in exponential-family form:
1 x
ℓ(x | θ) = θ exp(−θ), ν(dθ) = γλθ−1 exp(−λθ),
x!
which means that κ(x) = 1/x!, ϕ(x) = x, µ(θ) = 0, A(θ) = θ. This leads to the normalizer
Z ∞
Z= θξ exp(−λθ)dθ = Γ(ξ + 1)λ−(ξ+1) .
0

Therefore, h is
−1+ n−1
1 Γ(−1 + n−1
P
i=1 xi +x+1
P
i=1 x i + x + 1)(λ + n)
h(xn = x | x1:(n−1) ) = n−1
Pn−1
x! Γ(−1 + i=1 xi + 1)(λ + n − 1)−1+ i=1 xi +1
P
x  Pi=1n−1
Pn−1  xi
1 Γ( i=1 xi + x) 1 1
= Pn−1 1− ,
x! Γ( i=1 xi ) λ+n λ+n

and similarly e
h is
Pn−1 Pn−1
1 Γ(−1 + i=1 xi + x + 1 + γλ/K)(λ + n)−1+ i=1 xi +x+1+γλ/K
h(xn = x | x1:(n−1) ) =
e
x! Γ(−1 + n−1 −1+ n−1
P
i=1 xi +1+γλ/K
P
i=1 x i + 1 + γλ/K)(λ + n − 1)
x  Pi=1n−1
xi +γλ/K
1 Γ( n−1
P 
i=1 x i + x + γλ/K) 1 1
= Pn−1 1− ,
x! Γ( i=1 xi + γλ/K) λ+n λ+n

and Mn,x is
1 γλ
Mn,x = γλ Γ(x)(λ + n)−x = .
x! x(λ + n)x
Now, we state the constants so that gamma–Poisson satisfies Condition 2.4.1, and give the
proof.
Proposition 2.F.1 (Gamma–Poisson satisfies Condition 2.4.1). The following hold for
arbitrary γ, λ > 0. For any n:

X γλ
Mn,x ≤ .
x=1
n−1+λ

X γλ
h(x | x1:(n−1) = 0n−1 ) ≤
e .
x=1
n−1+λ

121
For any K:

X 2γλ 1
h(x | x1:(n−1) ) − e
h(x | x1:(n−1) ) ≤ .
x=0
K n−1+λ
For any K ≥ γλ :

X γ 2 λ + eγ 2 λ2 1
Mn,x − K e
h(x | x1:(n−1) = 0n−1 ) ≤ .
x=1
K n−1+λ

Proof of Proposition 2.F.1. The growth rate condition of the target model is simple:
∞ ∞ ∞
X X 1 X 1 γλ
Mn,x = γλ x
≤ γλ x
= .
x=1 x=1
x(λ + n) x=1
(λ + n) n−1+λ

The growth rate condition of the approximate model is also simple:


∞  γλ/K
X 1
h(x | x1:(n−1) = 0n−1 ) = 1 − h(0 | x1:(n−1) = 0n−1 ) = 1 − 1 −
e e
x=1
λ+n
γλ (λ + n)−1 1 γλ
≤ −1
= ,
K 1 − (λ + n) Kn−1+λ
where we have used eq. (2.40) with p = γλ K
, x = (λ + n)−1 .
For the total variation between h and e h condition, observe that h and e
h are p.m.f’s of negative
binomial distributions, namely:
n−1
!
X
h(x | x1:(n−1) ) = NB x | xi , (λ + n)−1 ,
i=1
n−1
!
X
h(x | x1:(n−1) ) = NB x |
e xi + γλ/K, (λ + n)−1 .
i=1

The two negative binomial distributions have the same success probability and only differ in
the number of trials. Hence using eq. (2.41), we have:

X γλ (λ + n)−1 2γλ 1
h(x | x1:(n−1) ) − h(x | x1:(n−1) ) ≤ 2
e
−1
= ,
x=0
K 1 − (λ + n) K n−1+λ

where the factor 2 reflects how total variation distance is 1/2 the L1 distance between p.m.f’s.
For the total variation between Mn,. and K e h(· | 0) condition,

X
Mn,x − K e
h(x | x1:(n−1) = 0n−1 )
x=1
∞  γλ/K
X 1 γλ Γ(γλ/K + x) 1
= −K 1−
x=1
(λ + n)x x Γ(γλ/K)x! λ+n

! !
  γλ/K
X 1 γλ 1 γλ Γ(γλ/K + x)
≤ 1− 1− + −K .
x=1
(λ + n)x x λ+n x Γ(γλ/K)x!

122
Using eq. (2.41) we can upper bound:
 γλ/K
1 γλ 1
1− 1− ≤ ,
λ+n K λ+n−1

while eq. (2.42) gives the upper bound:

γλ Γ(γλ/K + x) eγ 2 λ2
−K ≤ .
x Γ(γλ/K)x! K
This means:

X
Mn,x − K e
h(x | x1:(n−1) = 0n−1 )
x=1
∞ ∞
X 1 γλ γλ 1 X 1 eγ 2 λ2
≤ +
x=1
(λ + n)x x K λ + n − 1 x=1 (λ + n)x K
2 2
γ λ 1 eγ 2 λ2 1
≤ 2
+
K (λ + n − 1) K λ+n−1
2 2 2
γ λ + eγ λ 1
≤ .
K n−1+λ

2.F.2 Beta–negative binomial with zero discount


First we write down the functions in Condition 2.4.1 for non-power-law beta–negative binomial.
This requires expressing the rate measure and likelihood in exponential-family form:
Γ(x + r) x
ℓ(x | θ) = θ exp(r ln(1 − θ)),
x!Γ(r)
ν(dθ) = γαθ−1 exp(ln(1 − θ)(α − 1))1{θ ≤ 1},

which means that κ(x) = Γ(x + r)/Γ(r)x!, ϕ(x) = x, µ(θ) = 0, A(θ) = −r ln(1 − θ). This
leads to the normalizer:
Z 1
Z= θξ (1 − θ)rλ dθ = B(ξ + 1, rλ + 1).
0

To match the parametrizations, we need to set λ = α−1


r
i.e. rλ = α − 1. Therefore, h is
Pn−1
Γ(x + r) B( i=1 xi + x, rn + α)
h(xn = x | x1:(n−1) ) = Pn−1 ,
x!Γ(r) B( i=1 xi , r(n − 1) + α)

and e
h is Pn−1
Γ(x + r) B(γα/K + i=1 xi + x, rn + α)
h(xn = x | x1:(n−1) ) =
e Pn−1 ,
x!Γ(r) B(γα/K + i=1 xi , r(n − 1) + α)

123
and Mn,x is
Γ(x + r)
Mn,x = γα B(x, rn + α).
x!Γ(r)
Now, we state the constants so that beta–negative binomial satisfies Condition 2.4.1, and
give the proof.
Proposition 2.F.2 (Beta–negative binomial satisfies Condition 2.4.1). The following hold
for any γ > 0 and α > 1. For any n:

X γα
Mn,x ≤ .
x=1
n − 1 + (α − 0.5)/r

For any n, any K:



X 1 4γα
h(x | x1:(n−1) = 0n−1 ) ≤
e .
x=1
K n − 1 + (α − 0.5)/r

For any K:

X γα 1
h(x | x1:(n−1) ) − e
h(x | x1:(n−1) ) ≤ 2 .
x=0
K n − 1 + α/r
For any n, for K ≥ γα(3 ln(r(n − 1) + α) + 8):

X
Mn,x − K e
h(x | x1:(n−1) = 0n−1 )
x=1
γα (4γα + 3) ln(rn + α + 1) + (10 + 2r)γα + 24
≤ .
K n − 1 + (α − 0.5)/r
Proof of Proposition 2.F.2. The growth rate condition for the target model is easy to verify:
∞ ∞
X X Γ(x + r) r
Mn,x = γα B(x, rn + α) ≤ γα ,
x=1 x=1
Γ(r)x! r(n − 1) + α − 0.5

where we have used eq. (2.44) with b = r(n − 1) + α.


As for the growth rate condition of the approximate model,

X B(γα/K, rn + α)
h(x | x1:(n−1) = 0n−1 ) = 1 − e
e h(0 | x1:(n−1) = 0n−1 ) = 1 −
x=1
B(γα/K, r(n − 1) + α)
B(γα/K, r(n − 1) + α) − B(γα/K, rn + α)
= .
B(γα/K, r(n − 1) + α)
The numerator is small because of eq. (2.43) where x = γα/K, y = r(n − 1) + α, z = rn + α:
B(γα/K, r(n − 1) + α) − B(γα/K, rn + α) ≤ rB(γα/K + 1, r(n − 1) + α − 0.5)
≤ rB(1, r(n − 1) + α − 0.5)
1
= .
n − 1 + (α − 0.5)/r

124
The denominator is large because eq. (2.46) with eq. (2.46) with c = γα, b = r(n − 1) + α:
1 4γα
≤ .
B(γα/K, r(n − 1) + α) K

Combining the two give yields



X 1 4γα
h(x | x1:(n−1) = 0n−1 ) ≤
e .
x=1
K n − 1 + (α − 0.5)/r

For the total variation between h and e h condition, we first discuss how each function can
be expressed a p.m.f.P of so-called beta negative binomial i.e., BNB [101, Section 6.2.3]
distribution. Let A = n−1i=1 xi . Observe that:

Γ(x + r) B(A + x, rn + α) Γ(A + r) B(r + x, A + r(n − 1) + α)


= . (2.48)
Γ(r)x! B(A, r(n − 1) + α) Γ(A)x! B(r, r(n − 1) + α)

The random variable V1 whose p.m.f at x appears on the right hand side of eq. (2.48) is the
result of a two-step sampling procedure:

P ∼ Beta(r, r(n − 1) + α), V1 | P ∼ NB(A; P ).

We denote such a distribution as V1 ∼ BNB(A; r, r(n − 1) + α). An analogous argument


applies to e
h:  γα 
P ∼ Beta(r, r(n − 1) + α), V2 | P ∼ NB A + ;P .
K
Therefore:

h(x | x1:(n−1) ) = BNB (x | A; r, r(n − 1) + α)


 γα 
h(x | x1:(n−1) ) = BNB x | A +
e ; r, r(n − 1) + α .
K
We now bound the total variation between the BNB distributions. Because they have a
common mixing distribution, we can upper bound the distance with an integral using simple
triangle inequalities:
  1X ∞
dTV h, e
h = |P(V1 = x) − P(V2 = x)|
2 x=0
∞ Z 1
1X
= (P(V1 = x | P = p) − P(V2 = x | P = p))P(P ∈ dp)
2 x=0 0

Z 1 !
1X
≤ |P(V1 = x | P = p) − P(V2 = x | P = p)| P(P ∈ dp)
0 2 x=0
Z 1
= dTV (NB(A, p), NB(A + γα/K, p)) P(P ∈ dp).
0

125
For any p, we use eq. (2.41) to upper bound the total variation distance between negative
binomial distributions. Therefore:
  Z 1 γα p
dTV h, h ≤
e P(P ∈ dp)
0 K 1−p
Z 1
γα 1
= pr (1 − p)r(n−1)+α−2 dp
K B(r, r(n − 1) + α) 0
γα B(r + 1, r(n − 1) + α − 1) γα 1
= = .
K B(r, r(n − 1) + α) K n − 1 + α/r

Finally, we verify the condition between K e


h and Mn,. , which is showing that the following
sum is small:

X Γ(x + r) B(γα/K + x, rn + α)
γαB(x, rn + α) − K .
x=1
x!Γ(r) B(γα/K, r(n − 1) + α)

We look at the summand for x = 1 and the summation from x = 2 through ∞ separately.
For x = 1, we prove that:

Γ(r + 1) B(γα/K + 1, rn + α) 4rγ 2 α2 2 + ln(rn + α + 1)


γαB(1, rn + α) − K ≤ .
Γ(r) B(γα/K, r(n − 1) + α) K rn + α
(2.49)
Expanding gives:

B(1 + γα/K, rn + α)
γαB(1, rn + α) − K
B(γα/K, r(n − 1) + α)
|γαB(1, rn + α)B(γα/K, r(n − 1) + α) − KB(1 + γα/K, rn + α)|
= . (2.50)
B(γα/K, r(n − 1) + α)

We look at the numerator of the right hand side in eq. (2.50):

Γ(γα/K)Γ(r(n − 1) + α) Γ(1 + γα/K)Γ(rn + α)


γαB(1, rn + α) −K
Γ(γα/K + r(n − 1) + α) Γ(1 + γα/K + rn + α)
1 Γ(r(n − 1) + α) Γ(rn + α)
= γαΓ(γα/K) −
rn + α Γ(γα/K + r(n − 1) + α) Γ(γα/K + 1 + rn + α)
γαΓ(γα/K) Γ(r(n − 1) + α) Γ(rn + α + 1)
= −
rn + α Γ(γα/K + r(n − 1) + α) Γ(γα/K + 1 + rn + α)
 
γαΓ(γα/K) Γ(r(n − 1) + α) Γ(rn + α + 1)
≤ −1 + −1
rn + α Γ(γα/K + r(n − 1) + α) Γ(γα/K + 1 + rn + α)
γαΓ(γα/K) 2γα
≤ (2 + ln(rn + α + 1)),
rn + α K
where we have used eq. (2.45) with c = γα and b = r(n − 1) + α or b = rn + α + 1. In all,

126
eq. (2.50) is upper bounded by:

2γ 2 α2 2 + ln(rn + α + 1) Γ(γα/K)
rn + α K B(γα/K, r(n − 1) + α)
2 2
2γ α 2 + ln(rn + α + 1) Γ(γα/K + r(n − 1) + α)
=
rn + α K Γ(r(n − 1) + α)
2 2
4γ α 2 + ln(rn + α + 1)
≤ ,
K rn + α

since Γ(r(n−1)+α+γα/K)
Γ(r(n−1)+α)
≥ 1 − γα
K
(2 + ln(r(n − 1) + α)) ≥ 0.5 with K ≥ 2γα(2 + ln(r(n − 1) + α).
Combining with Γ(r + 1)/Γ(r) = r, this is the proof of eq. (2.49).
We now move onto the summands from x = 2 to ∞. By triangle inequality:

B(γα/K + x, rn + α)
γαB(x, rn + α) − K ≤ T1 (x) + T2 (x),
B(γα/K, r(n − 1) + α)

where:
K
T1 (x) := B(x, rn + α) γα − ,
B(γα/K, r(n − 1) + α)
B(x, rn + α) − B( γα
K
+ x, rn + α)
T2 (x) := K .
B(γα/K, r(n − 1) + α)

The helper inequalities we have proven once again are useful:

K γα
γα − ≤ (3 ln(r(n − 1) + α) + 8)
B(γα/K, r(n − 1) + α) K
K γα
≤ γα + (3 ln(r(n − 1) + α) + 8) ≤ 2γα,
B(γα/K, r(n − 1) + α) K
γα
|B(x, rn + α) − B(γα/K + x, rn + α)| ≤ B(x − 1, rn + α + 1)
K
since K ≥ γα(3 ln(r(n − 1) + α) + 8), we have applied eq. (2.46) in the first and second
inequality and eq. (2.43) in the third one. So for each x ≥ 2, each summand is at most

γα(3 ln(r(n − 1) + α) + 8) Γ(x + r)


B(x, rn + α)
K x!Γ(r)
2γ 2 α2 Γ(x + r)
+ B(x − 1, rn + α + 1).
K x!Γ(r)

To upper bound the summation from x = 2 to ∞, it suffices to bound:


∞ ∞
X Γ(x + r) X Γ(x + r) r
B(x, rn + α) ≤ B(x, rn + α) ≤ ,
x=2
Γ(r)x! x=1
Γ(r)x! r(n − 1) + α − 0.5

127
and:
∞ ∞
X Γ(x + r) X Γ(x − 1 + r + 1)
B(x − 1, rn + α + 1) ≤ r B(x − 1, rn + α + 1)
x=2
Γ(r)x! x=2
Γ(r + 1)(x − 1)!

X Γ(z + r + 1)
≤r B(z, rn + α + 1)
z=1
Γ(r + 1)z!
r(r + 1)
≤ ,
r(n − 1) + α − 0.5
where we have used eq. (2.44) in each upper bound. So the summation from x = 2 to ∞ is
upper bounded by:
γα(3 ln(r(n − 1) + α) + 8) r 2γ 2 α2 r(r + 1)
+ (2.51)
K r(n − 1) + α − 0.5 K r(n − 1) + α − 0.5
eqs. (2.49) and (2.51) combine to give:

X
Mn,x − K e
h(x | x1:(n−1) = 0n−1 )
x=1
γα (4γα + 3) ln(rn + α + 1) + (10 + 2r)γα + 24
≤ .
K n − 1 + (α − 0.5)/r

2.G Proofs of CRM bounds


2.G.1 Upper bound
Proof of Theorem 2.4.1. We first give explicit formulas for the constants C ′ , C ′′ , C ′′′ , C ′′′ . Let
β be the smallest positive constant where β 2 /(1 + β) ≥ 4/C1 . Such constant exists because
β 2 /(1 + β) is an increasing function. The constants are

C ′ = (β + 1)C1 ln(1 + 1/C1 ) [4C1 ln(1 + 1/C1 ) + C5 ]


+ C12 ψ1 (C1 ) + exp(2C1 (ψ(C1 ) + 1))
+ (β + 1)2C1 ln(1 + 1/C1 ) + C2 C3 ,

and
C ′′ =(β + 1)C1 (2C1 + C4 ) + [(β + 1)C1 + C2 ]/ ln 2
+ (β + 1) [C1 (4C1 ln(1 + 1/C1 ) + C5 ) + (2C1 + C4 )C1 ln(1 + 1/C1 )] / ln 2,

and
C ′′′ = (β + 1)2C12 ln(1 + 1/C1 ),
C ′′′′ = (β + 1)2C12 ln(1 + 1/C1 ) + (β + 1)C1 .
By the end of the proof, the reasoning for these constants will be clear.

128
We will focus on the case where the approximation level K is Ω(ln N ):

K ≥ max {(β + 1) max(C(K, C1 ), C(N, C1 )), C2 (ln N + C3 )} , (2.52)

where C(N, α) is the growth function from eq. (2.18). To see why it is sufficient, consider the
case where K < max {(β + 1) max(C(K, C1 ), C(N, C1 )), C2 (ln N + C3 )}. This implies that
K is smaller than a sum
K < (β + 1)(C(N, C1 ) + C(K, C1 )) + C2 (ln N + C3 )
≤ [(β + 1)C1 + C2 ] ln N + (β + 1)C1 ln K + (β + 1)2C1 ln(1 + 1/C1 ) + C2 C3

where we have used upper bound on the growth function from Lemma 2.E.10. Total variation
distance is always upper bounded by 1. Hence, dTV (PN,∞ , PN,K ) is at most

[(β + 1)C1 + C2 ] ln N + (β + 1)C1 ln K + (β + 1)2C1 ln(1 + 1/C1 ) + C2 C3


K
which is smaller than
Ĉ (0) + Ĉ (1) ln2 N + Ĉ (2) ln K
(2.53)
K
where
Ĉ (0) = (β + 1)2C1 ln(1 + 1/C1 ) + C2 C3 ,
Ĉ (1) = [(β + 1)C1 + C2 ]/ ln 2,
Ĉ (2) = (β + 1)C1 .
In the sequel, we will only consider the situation in eq. (2.52).
First, we argue that it suffices to bound the total variation distance between the trait-
allocation matrices coming from the target model and the approximate model. Given the
latent measures X1 , X2 , . . . , XN from the target model, we can read off the feature-allocation
matrix F , which has N rows and as many columns as there are unique atom locations among
the Xi ’s:

1. The i-th row of F records the atom sizes of Xi .

2. Each column corresponds to an atom location: the locations are sorted first according to
the index of the first measure Xi to manifest it (counting from 1, 2, . . .), and then its atom
size in Xi .

For illustration, suppose X1 = 3δψ1 + 4δψ2 + 4δψ3 , X2 = 2δψ1 + δψ3 + δψ4 + 2δψ5 and
X3 = 6δψ2 + 2δψ3 + δψ5 + 2δψ6 + 3δψ7 . Then the associate trait-allocation matrix has 3 rows
and 7 columns and has entries equal to
 
3 4 4 0 0 0 0
2 0 1 1 2 0 0 . (2.54)
0 6 2 0 1 2 3

The marginal process that described the atom sizes of Xn | Xn−1 , Xn−2 , . . . , X1 in Proposi-
tion 2.C.1 is also the description of how the rows of F are generated. The joint distribution

129
X1 , X2 , . . . , Xn can be two-step sampled. First, the trait-allocation matrix F is sampled.
Then, the atom locations are drawn iid from the base measure H: each column of F is
assigned an atom location, and the latent measure Xi has atom size Fi,j on the jth atom
location. A similar two-step sampling generates Z1 , Z2 , . . . , Zn , the latent measures under
the approximate model: the distribution over the feature-allocation matrix F ′ follows Propo-
sition 2.C.2 instead of Proposition 2.C.1, but conditioned on the feature-allocation matrix,
the process generating atom locations and constructing latent measures is exactly the same.
In other words, this implies that the conditional distributions Y1:N | F and W1:N | F ′ when
F = F ′ are the same, since both models have the same the observational likelihood f given
the latent measures 1 through N . Denote PF to be the distribution of the feature-allocation
matrix under the target model, and PF ′ the distribution of the feature-allocation matrix
under the approximate model. Lemma 2.E.8 implies that

dTV (PN,∞ , PN,K ) ≤ inf P(F ̸= F ′ ). (2.55)


F,F ′ coupling of PF ,PF ′

Next, we parametrize the trait-allocation matrices in a way that is convenient for the analysis
of total variation distance. Let J be the number of columns of F . Our parametrization
involves dn,x , for n ∈ [N ] and x ∈ N, and sj , for j ∈ [J]:

1. For n = 1, 2, . . . , N :

(a) If n = 1, for each x ∈ N, d1,x counts the number of columns j where F1,j = x.
(b) For n ≥ 2, for each x ∈ N, let Jn = {j : ∀i < n, Fi,j = 0} i.e. no observation before n
manifests the atom locations indexed by columns in Jn . For each x ∈ N, dn,x counts
the number of columns j ∈ Jn where Fn,j = x.

2. For j = 1, 2, . . . , J, let Ij = min{i : Fi,j > 0} i.e. the first row to manifest the j-th atom
location. Let sj = FIj :N,j i.e. the history of the j-th atom location.

In words, dn,x is the number of atom locations that is first instantiated by the individual n
P∞
and each atom has size x, while sj is the history of the j-th atom location. N
P
n=1 x=1 dn,x
is exactly J, the number of columns. For the example in eq. (2.54):

1. For n = 1, 2, . . . , 3:

(a) For n = 1, d1,1 = d1,2 = d1,j = 0 for j > 4. d1,3 = 1, d1,4 = 2.


(b) For n = 2, d2,1 = 1, d2,2 = 1, d2,j = 0 for j > 2.
(c) For n = 3, d3,1 = 0, d3,2 = 1, d3,3 = 1, d3,j = 0 for j > 3.

2. For j = 1, 2, . . . , 7, s1 = [3, 2, 0], s2 = [4, 0, 6], s3 = [4, 1, 2], s4 = [1, 0], s5 = [2, 1], s6 = [2],
s7 = [3].

We use the short-hand d to refer to the collection of dn,x and s the collection of sj . There is
a one-to-one mapping between (d, s) and the trait-allocation matrix f , since we can read-off
(d, s) from f and use (d, s) to reconstruct f. Let (D, S) be the distribution of d and s under

130
the target model, while (D′ , S ′ ) is the distribution under the approximate model. We have
that
dTV (PN,∞ , PN,K ) ≤ ′ ′
inf P((D, S) ̸= (D′ , S ′ )).
(D,S),(D ,S ) coupling of PD,S ,PD′ ,S ′

To find an upper bound on dTV (PN,∞ , PN,K ), we will demonstrate a joint distribution such
that P((D, S) ̸= (D′ , S ′ )) is small. The rest of the proof is dedicated to that end. To start,
we only assume that (D, S, D′ , S ′ ) is a proper coupling, in that marginally (D, S) ∼ PD,S and
(D′ , S ′ ) ∼ PD′ ,S ′ . As we progress, gradually more structure is added to the joint distribution
(D, S, D′ , S ′ ) to control P((D, S) ̸= (D′ , S ′ )).
We first decompose P((D, S) ̸= (D′ , S ′ )) into other probabilistic quantities which can be
analyzed using Condition 2.4.1. Define the typical set:
( N X∞
)
X
D∗ = d : dn,x ≤ (β + 1) max(C(K, C1 ), C(N, C1 )) .
n=1 x=1

d ∈ D∗ means that the trait-allocation matrix f has a small number of columns. The claim
is that:

P((D, S) ̸= (D′ , S ′ )) ≤ P(D ̸= D′ ) + P(S ̸= S ′ | D = D′ , D ∈ D∗ ) + P(D ∈


/ D∗ ). (2.56)

This is true from basic properties of probabilities and conditional probabilities:

P((D, S) ̸= (D′ , S ′ ))
= P(D ̸= D′ ) + P(S = ̸ S ′ , D = D′ )
= P(D ̸= D′ ) + P(S = ̸ S ′ , D = D′ , D ∈ D∗ ) + P(S ̸= S ′ , D = D′ , D ∈
/ D∗ )
≤ P(D ̸= D′ ) + P(S = ̸ S ′ | D = D′ , D ∈ D∗ ) + P(D ∈/ D∗ ),

The three ideas behind this upper bound are the following. First, because of the growth
condition, we can analyze the atypical set probability P(D ∈ / D∗ ). Second, because of the
total variation between h and eh, we can analyze P(S ̸= S ′ | D = D′ , D ∈ D∗ ). Finally, we can
analyze P(D ̸= D′ ) because of the total variation between K e h and Mn,. . In what follows we
carry out the program.
Atypical set probability. The P(D ∈ / D∗ ) term in eq. (2.56) is easiest to control. Under
the target model Proposition 2.C.1, the Di,x ’s are independent Poissons with mean Mi,x ,
P P∞ PN P∞
so the sum N i=1 x=1 Di,x is itself a Poisson with mean M = i=1 x=1 Mi,x . Because of
Lemma 2.E.3, for any x > 0:

N X
!
x2
X  
P Di,x > M + x ≤ exp − .
i=1 x=1
2(M + x)

For the event P(D ∈ / D∗ ), M + x = (β + 1) max(C(K, C1 ), C(N, C1 )), M ≤ C(N, C1 ) due to


eq. (2.14), so that x ≥ β max(C(K, C1 ), C(N, C1 )). Therefore:

β2
 
P(D ∈ ∗
/ D ) ≤ exp − max(C(K, C1 ), C(N, C1 )) . (2.57)
2(β + 1)

131
Difference between histories. To minimize the difference probability between the histories
of atom sizes i.e. the P(S ̸= S ′ | D = D′ , D ∈ D∗ ) term in eq. (2.56), we will use eq. (2.16).
The claim is, there exists a coupling of S ′ | D′ and S | D such that:
(β + 1) max(C(K, C1 ), C(N, C1 ))
P(S ̸= S ′ | D = D′ , D ∈ D∗ ) ≤ C(N, C1 ). (2.58)
K
Fix some d ∈ D∗ — since we are in the typical set, the number of columns in the trait-
allocation matrix is at most (β + 1) max(C(K, C1 ), C(N, C1 )). Conditioned on D = d, there
is a finite number of history variables S, one for each atom location; similar for conditioning
of S ′ on D′ = d. For both the target and the approximate model, the density of the joint
distribution factorizes:
YJ
P(S = s | D = d) = P(Sj = sj | D = d)
j=1
J
Y
′ ′
P(S = s | D = d) = P(Sj′ = sj | D′ = d),
j=1

since in both marginal processes, the atom sizes for different atom locations are independent
of each other. Each Sj (or Sj′ ) only takes values from a countable set. Therefore, by
Lemma 2.E.7,
J
X
dTV (PS | D=d , PS ′ | D′ =d ) ≤ dTV (PSj | D=d , PSj′ | D′ =d ).
j=1

We inspect each dTV (PSj | D=d , PSj′ | D′ =d ). Fixing d also fixes Ij , the first row to manifest the
j-th atom location. The history sj is then a N − Ij + 1 dimensional integer vector, whose tth
entry is the atom size over the jthe atom location of the t + Ij − 1 row. Because of eq. (2.16),
we know that conditioned on the same partial history Sj (1 : (t − 1)) = Sj′ (1 : (t − 1)) = s,
the distributions Sj (t) and Sj′ (t) are very similar. The conditional distribution Sj (t) | D =
d, Sj (1 : (t − 1)) = s is governed by h Proposition 2.C.1 while Sj′ (t) | D′ = d, Sj′ (1 : (t − 1)) = s
is governed by e h Proposition 2.C.2. Hence:
  1 C1
dTV PSj (t) | D=d,Sj (1:(t−1))=s , PSj (t) | D′ =d,Sj (1:(t−1))=s ≤ 2
′ ′ ,
K t + Ij − 2 + C1
for any partial history s. To use this conditional bound, we repeatedly use Lemma 2.E.6 to com-
pare the joint Sj = (Sj (1), Sj (2), . . . , Sj (N −Ij +1)) with the joint Sj′ = (Sj′ (1), Sj′ (2), . . . , Sj′ (N −
Ij + 1)), peeling off one layer of random variables (indexed by t) at a time.
dTV (PSj | D=d , PSj′ | D′ =d )
N −Ij +1  
X
≤ max dTV PSj (t) | D=d,Sj (1:(t−1))=s , PSj′ (t) | D′ =d,Sj′ (1:(t−1))=s
s
t=1
N −Ij +1
X 1 C1
≤ 2
t=1
K t + Ij − 2 + C1
C(N, C1 )
≤2 .
K

132
Multiplying the right hand side by (β + 1) max(C(K, C1 ), C(N, C1 )), the upper bound on J,
we arrive at the same upper bound for the total variation between PS | D=d and PS ′ | D′ =d in
eq. (2.58). Furthermore, our analysis of the total variation can be back-tracked to construct
the coupling between the conditional distributions S | D = d and S ′ | D′ = d which attains
that small probability of difference because all the distributions being analyzed are discrete.
Since the choice of conditioning d ∈ D∗ was arbitrary, we have actually shown eq. (2.58).
Difference between new atom sizes. Finally, to control the difference probability for
the distribution over new atom sizes i.e. the P(D ̸= D′ ) term in eq. (2.56), we will utilize
eqs. (2.15) and (2.17). For each n, define the short-hand d1:n to refer to the collection di,x for
i ∈ [n], x ∈ N, and the typical sets:
( n X ∞
)
X
Dn∗ = d1:n : di,x ≤ (β + 1) max(C(K, C1 ), C(N, C1 )) .
i=1 x=1

The type of expansion performed in eq. (2.56) can be done once here to see that:
P(D ̸= D′ )
′ ′
= P((D1:(N −1 , DN ) ̸= (D1:(N −1) , DN ))

≤ P(D1:(N −1) ̸= D1:(N −1) )
′ ′ ∗
+ P(DN ̸= DN | D1:(N −1) = D1:(N −1) , D1:(N −1) ∈ Dn−1 )

+ P(D1:(N −1) ∈
/ Dn−1 ).
Apply the expansion once more to P(D1:(N −1) = ′
̸ D1:(N −1) ), then to P(D1:(N −2) ̸= D1:(N −2) ).

If we define:
Bj = P(Dj ̸= Dj′ | D1:(j−1) = D1:(j−1)
′ ∗
, D1:(j−1) ∈ Dj−1 ),
with the special case B1 simply being P(D1 ̸= D1′ ), then:
N
X N
X

P(D ̸= D ) ≤ Bj + P(D1:(j−1) ∈ ∗
/ Dj−1 ). (2.59)
j=1 j=2

The second summation inP eq. (2.59), comprising


PN ofPonly atypical probabilities, is easier to
j−1 P∞ ∞
control. For any j, since i=1 x=1 Di,x ≤ i=1 x=1 Di,x , P(D1:(j−1) ∈ ∗
/ Dj−1 ) ≤ P(D ∈ /
D ), so a generous upper bound for the contribution of all the atypical probabilities including

the first one from eq. (2.57) is


N
X
∗ ∗
P(D ∈
/ D )+ P(D1:(j−1) ∈
/ Dj−1 )
j=2

β2
  
≤ exp − max(C(K, C1 ), C(N, C1 )) − ln N .
2(β + 1)
By Lemma 2.E.10, max(C(K, C1 ), C(N, C1 )) ≥ C1 (max(ln N, ln K) − C1 (ψ(C1 ) + 1)). Since
β2
we have set β so that β+1 C1 = 4, we have
β2
max(C(K, C1 ), C(N, C1 )) − ln N ≥ 2 max(ln N, ln K) − 2C1 (ψ(C1 ) + 1) − ln N
2(β + 1)
≥ ln K − 2C1 (ψ(C1 ) + 1).

133
meaning the overall atypical probabilities is at most
N
X exp(2C1 (ψ(C1 ) + 1))
/ D∗ ) +
P(D ∈ P(D1:(j−1) ∈ ∗
/ Dj−1 )≤ . (2.60)
j=2
K

As for the first summation in eq. (2.59), we look at the individual Bj ’s. For any fixed d1:(j−1) ∈

Dj−1 , we claim that there exists a coupling between the conditionals Dj | D1:(j−1) = d1:(j−1)
and Dj′ | D1:(j−1)

= d1:(j−1) such that P(Dj ̸= Dj′ | D1:(j−1) = D1:(j−1)

= d1:(j−1) ) is at most

C12 1 1
2
+ [C4 ln j + C5 + (β + 1) max(C(K, C1 ), C(N, C1 ))] . (2.61)
K (j − 1 + C1 ) j − 1 + C1
Because the upper bound holds for arbitrary values d1:(j−1) , the coupling actually ensures that,
as long as D1:(j−1) = D1:(j−1)

for some value in Dj−1∗
, the probability of difference between
Dj and Dj is small i.e. Bj is at most the right hand side.

We demonstrate the existence of a distribution U = {Ux }∞ x=1 of independent Poisson random


variables, such that both the total variation between PDj | D1:(j−1) =d1:(j−1) and PU and the total
variation between PDj′ | D1:(j−1)
′ =d1:(j−1) and PU are small. Here, each Ux has mean:

j−1 ∞
!
X X
E(Ux ) = K− di,y h(x | x1:(j−1) = 0).
e
i=1 y=1

On the one hand, conditioned on D1:(j−1)



= d1:(j−1) , Dj′ = {Dj,x

x=1 is the joint distribution of
}∞
Pj−1 P∞
types of successes of type x, where there are K − i=1 x=1 di,x independent trials and types
x success has probability e h(x | x1:(j−1) = 0) by Proposition 2.C.2. Because of Lemma 2.E.4
and eq. (2.15):
j−1 ∞
! ∞ !2
X X X
P(D′ ̸= U | D′
j = d1:(j−1) ) ≤ K −
1:(j−1) di,y h(x | x1:(j−1) = 0)
e
i=1 y=1 x=1
 2
1 C1
≤K
K j − 1 + C1
C2 1
≤ 1 . (2.62)
K (j − 1 + C1 )2
On the other hand, conditioned on D1:(j−1) , Dj = {Dj,x }∞x=1 consists of independent Poissons,
where the mean of Dj,x is Mj,x by Proposition 2.C.1. We show that there exists a coupling of
PU and PDj such that
X∞
P(U ̸= Dj ) ≤ dTV (PUx , PDj,x ). (2.63)
x=1

For each x ≥ 1, let Ox be the maximal coupling distribution between PUx and PDj,x i.e. for
(A, B) ∼ Ox , P(A ≠ B) = dTV (PUx , PDj,x ). Such Ox exists because both PUx and PDj,x are
Poisson (hence discrete) distributions. Furthermore, since Ox is itself a discrete distribution,
the conditional distributions Dj,x | Ux exists. Denote the natural zig-zag bijection from

134
{N ∪ 0}2 to N to be L.21 Denote by Fx the cdf of the distribution of L(A, B) for (A, B) ∼ Ox .
To generate samples from Ox , it suffices to generate samples from Fx and transform using
the inverse of L. Consider the following coupling of PU and PDj :

• Generate i.i.d uniform random random variables V1 , V2 , . . .

• For x ≥ 1, let (Ux , Dj,x ) = L−1 (Fx−1 (Vx )).

Marginally, each Ux (or Dj,x ) is Poisson with the right mean, and across x, the Ux (or Dj,x )
are independent of each other because we use i.i.d uniform r.v’s. Alternatively, the conditional
distribution of Dj | U implied by this joint distribution is as follows:

• For x ≥ 1, sample Ux | Dj,x from the conditional distribution implied by the maximal
coupling Ox .

If U is different from Dj , it must be that for at least one x, Ux ̸= Dj,x . Therefore



X
P(U ̸= Dj ) ≤ P(Ux ̸= Dj,x ).
x=1

Since the coupling (Ux , Dj,x ) attains the dTV (PUx , PDj,x ), we are done. From Lemma 2.E.5,
we know

X
dTV (PUx , PDj,x )
x=1
∞ j−1 ∞
!
X X X
≤ Mj,x − K− di,y h(x|x1:(j−1) = 0)
e
x=1 i=1 y=1
∞ j−1 ∞
!
X X X
≤ |Mj,x − K e
h(x | x1:(j−1) = 0)| + h(x | x1:(j−1) = 0)
di,y e
x=1 i=1 y=1
∞ j−1 ∞
! ∞
!
X X X X
≤ |Mj,x − K e
h(x | x1:(j−1) = 0)| + di,y h(x | x1:(j−1) = 0) .
e (2.64)
x=1 i=1 y=1 x=1

The first term is upper


P∞ bounded by eq. (2.17). Regarding the second term, since we are in the
typical set, i=1 y=1 di,y is small and we also use eq. (2.15). Therefore the overall bound
Pj−1
on the second term is:
1 C1
(β + 1) max(C(K, C1 ), C(N, C1 )) .
K j − 1 + C1

Combining the two bounds and eq. (2.63) give the following bound on P(U ̸= Dj ):

1 C4 ln j + C5 1 C1
P(U ̸= Dj ) ≤ + (β + 1) max(C(K, C1 ), C(N, C1 )) . (2.65)
K j − 1 + C1 K j − 1 + C1
21
L(0, 0) = 1, L(0, 1) = 2, L(1, 0) = 3, L(2, 0) = 4, L(1, 1) = 5, L(0, 2) = 6 and so on.

135
We now show how the combination of eqs. (2.62) and (2.65) imply eq. (2.61). From eq. (2.65),
there exists a coupling of PU and PDj such that the difference probability is small. From
eq. (2.62), there exists a coupling of PU and PDj′ | D1:(j−1)
′ =d1:(j−1) such that the difference
probability is small. In both cases, we can sample from the conditional distribution based
on U . Dj | U exists because of the discussion after eq. (2.63), while Dj′ | D1:(j−1)

= d1:(j−1) , U
exists because of Lemma 2.E.4. Therefore, we can glue the two couplings together, by first
sampling U , and then sample from the appropriate conditional distributions. By taking
expectations of the simple triangle inequality for the discrete metric i.e.

1{Dj ̸= Dj′ } ≤ 1{Dj ̸= U } + 1{Dj′ ̸= U },

we reach eq. (2.61).


We sum of the right hand side of eq. (2.61) across j. This shows that Nj=1 Bj is at most
P

N
!
C12 X 1 (β + 1) max(C(K, C1 ), C(N, C1 ))
+ C(N, C1 )
K j=1
(j − 1 + C1 )2 K
C4 ln N + C5
+ C(N, C1 ).
K
The first term is upper bounded by the trigamma function ψ1 (·):
N
C12 X 1 C12 ψ1 (C1 )
≤ .
K j=1 (j − 1 + C1 )2 K

This means, an upper bound on Bj is


PN
j=1

C12 ψ1 (C1 ) β + 1
+ C(N, C1 ) [(C1 + C4 ) ln N + C1 ln K + 2C1 ln(1 + 1/C1 ) + C5 ] . (2.66)
K K
Because of eqs. (2.59), (2.60) and (2.66), we can couple D and D′ such that P(D =
̸ D′ )+P(D ∈
/
D ) is at most

C12 ψ1 (C1 ) + exp(2C1 (ψ(C1 ) + 1))


K (2.67)
β+1
+ C(N, C1 ) [(C1 + C4 ) ln N + C1 ln K + 2C1 ln(1 + 1/C1 ) + C5 ] .
K
Aggregating the results from eqs. (2.58) and (2.67), we have that dTV (PN,∞ , PN,K ) is at most

C12 ψ1 (C1 ) + exp(2C1 (ψ(C1 ) + 1))


K
β+1
+ C(N, C1 ) [max(C(K, C1 ), C(N, C1 )) + (C1 + C4 ) ln N ]
K
β+1
+ C(N, C1 ) [C1 ln K + 2C1 ln(1 + 1/C1 ) + C5 ] .
K

136
We expand the sum of the last two term by upper bounding max(C(K, C1 ), C(N, C1 )) by
C(K, C1 ) + C(N, C1 ) and using the upper bound Lemma 2.E.10. The end result is

C̃ (0) + C̃ (1) ln K + C̃ (2) ln N + C̃ (3) ln N ln K + C̃ (4) ln2 N


K
where C̃ (0) is equal to

(β + 1)C1 ln(1 + 1/C1 ) [4C1 ln(1 + 1/C1 ) + C5 ] + C12 ψ1 (C1 ) + exp(2C1 (ψ(C1 ) + 1)),

and
C̃ (1) = (β + 1)2C12 ln(1 + 1/C1 ),
C̃ (2) = (β + 1) [C1 (4C1 ln(1 + 1/C1 ) + C5 ) + (2C1 + C4 )C1 ln(1 + 1/C1 )] ,
C̃ (3) = (β + 1)2C12 ln(1 + 1/C1 ),
C̃ (4) = (β + 1)C1 (2C1 + C4 ).
Since N is a natural number, N ≥ 1 we can write ln N ≤ (1/ ln 2) ln2 N , to simplify the
upper bound on total variation as
 
C̃ (0) + C̃ (4) + C̃ (2) / ln 2 ln2 N + C̃ (3) ln N ln K + C̃ (1) ln K
. (2.68)
K
Taking the sum of individual coefficients in front of ln2 N (et cetera) between eq. (2.68) and
eq. (2.53) yields the constants at the beginning of the proof.
In applications, the observational likelihood f and the ground measure H might be random
rather than fixed quantities. For instance, in linear–Gaussian beta–Bernoulli processes without
good prior information, probabilistic models put priors on the variances of the Gaussian
features as well as the noise in observed data. In such cases, the AIFAs remain the same
as the in Theorem 2.B.2 (or Corollary 2.3.3) since the rate measure ν is still fixed. The
above proof of Theorem 2.4.1 can be easily extended to the case where f and H are random,
because the argument leading to eq. (2.55) retains validity when f and H have the same
distribution under the target and the approximate model. For completeness, we state the
error bound in such cases where hyper-priors are used.

Corollary 2.G.1 (Upper bound for hyper-priors). Let H be a prior distribution for ground
measures H and F be a prior distribution for observational likelihoods f. Suppose the target
model is
H ∼ H(.),
f ∼ F(.),
Θ | H ∼ CRM(H, ν),
i.i.d.
Xn | Θ ∼ LP(ℓ, Θ), n = 1, 2, . . . , N,
indep
Yn | f, Xn ∼ f (· | Xn ), n = 1, 2, . . . , N.

137
The approximate model, with νK as in Theorem 2.B.2 (or Corollary 2.3.3), is
H ∼ H(.),
f ∼ F(.),
ΘK | H ∼ IFAK (H, νK ),
i.i.d.
Zn | ΘK ∼ LP(ℓ, ΘK ), n = 1, 2, . . . , N,
indep
Wn | f, Zn ∼ f (· | Zn ), n = 1, 2, . . . , N.
If Assumption 2.3.1 and Condition 2.4.1 hold, then there exist positive constants C ′ , C ′′ , C ′′′
depending only on {Ci }5i=1 such that
C ′ + C ′′ ln2 N + C ′′′ ln N ln K
dTV (PY1:N , PW1:N ) ≤ .
K
The upper bound in Corollary 2.G.1 is visually identical to Theorem 2.4.1, and has no
dependence on the hyper-priors H or F.

2.G.2 Lower bound


Proof of Theorem 2.4.2. First we mention which probability kernel f results in the large
total variation distance: the pathological f is the Dirac measure i.e., f (· | X) := δX (.). With
this conditional likelihood Xn = Yn and Zn = Wn , meaning:
BP BP
dTV (PN,∞ , PN,K ) = dTV (PX1:N , PZ1:N ).
Now we discuss why the total variation is lower bounded by the function of N . Let A be the
event that there are at least 12 γC(N, α) unique atom locations in among the latent states:
 
1
A := x1:N : #unique atom locations ≥ γC(N, α) .
2
The probabilities assigned to this event by the approximate and the target models are very
different from each other. On the one hand, since K < γC(N,α)
2
, under AIFAK , A has measure
zero:
PZ1:N (A) = 0. (2.69)
On the other hand, under beta–Bernoulli, the number of unique atom locations drawn
is a Poisson random variable with mean exactly γC(N, α) — see Proposition 2.C.1 and
Proposition 2.C.2. The complement of A is a lower tail event. By Lemma 2.E.3 with
λ = γC(N, α) and x = 12 γC(N, α):
 
γC(N, α)
PX1:N (A) ≥ 1 − exp − . (2.70)
8
Because of Lemma 2.E.10, we can lower bound C(N, α) by a multiple of ln N :
constant
   
γC(N, α) γα ln N αγ(ψ(α) + 1)
exp − ≤ exp − + = .
8 8 8 N γα/8
We now combine eqs. (2.69) and (2.70) and recall that total variation is the maximum over
discrepancy in probabilistic masses.

138
The proof of Theorem 2.4.3 relies on the ability to compute a lower bound on the total
variation distance between a binomial distribution and a Poisson distribution.

Proposition 2.G.2 (Lower bound on total variation between binomial and Poisson). For
all K, it is true that
    2
γ/K γ/K
dTV Poisson (γ) , Binom K, ≥ C(γ)K ,
γ/K + 1 γ/K + 1

where
1 1
C(γ) = .
8 γ + exp(−1)(γ + 1) max(12γ 2 , 48γ, 28)
Proof of Proposition 2.G.2. We adapt the proof of [13, Theorem 2] to our setting. The
Poisson(γ) distribution satisfies the functional equality:

E[γy(Z + 1) − Zy(Z)] = 0, (2.71)

where y is any real-valued function and Z ∼ Poisson(γ).


Denote γK = γ/K+1
γ
. For m ∈ N, let

m2
 
x(m) = m exp − ,
γK θ

where θ is a constant which will be specified later. x(m) serves as a test function to lower bound
the total variation distance between Poisson(γ) and Binom (K, γK /K). Let Xi ∼ Ber( γKK ),
independently across i from 1 to K, and W = K i=1 . Then W ∼ Binomial (K, γK /K). The
P
following identity is adapted from [13, Equation 2.1]:
K
 γ 2 X
K
E[γK x(W + 1) − W x(W )] = E[x(Wi + 2) − x(Wi + 1)], (2.72)
K i=1

where Wi = W − Xi .
We first argue that the right hand side is not too small i.e. for any i,
2
3γK + 12γK + 7
E[x(Wi + 2) − x(Wi + 1)] ≥ 1 − . (2.73)
θγK

Consider the derivative of x(m):

m2 2m2 3m2
  
d
x(m) = exp − 1− ≥1− ,
dm γK θ γK θ θγK

because of the easy-to-verify inequality e−x (1 − 2x) ≥ 1 − 3x for x ≥ 0. This means that
Z Wi +2 
3m2

1
x(Wi + 2) − x(Wi + 1) ≥ 1− dm = 1 − (3Wi2 + 9Wi + 7).
Wi +1 θγK θγK

139
Taking expectations, noting that E(Wi ) ≤ γK and E(Wi2 ) = Var(Wi ) + [E(Wi )]2 ≤
PK γK
j=1 K +
(γK )2 = γK
2
+ γK we have proven eq. (2.73).
Now, because of positivity of x, and that γ ≥ γK , we trivially have

E[γx(W + 1) − W x(W )] ≥ E[γK x(W + 1) − W x(W )]. (2.74)

Combining eq. (2.72), eq. (2.73) and eq. (2.74) we have that
 γ 2  2

K 3γK + 12γK + 7
E[γx(W + 1) − W x(W )] ≥ K 1− .
K θγK
 
Recalling eq. (2.71), for any coupling (W, Z) such that W ∼ Binom K, γ/K+1 and Z ∼
γ/K

Poisson(γ):
2 2
 
γK 3γK + 12γK + 7
E[γ(x(W + 1) − x(Z + 1)) + Zx(Z) − W x(W )] ≥ 1− .
K θγK
Suppose (W, Z) is the maximal coupling attaining the total variation distance between PW
and PZ i.e. P(W ̸= Z) = dTV (PY , PZ ). Clearly,

γ(x(W + 1) − x(Z + 1)) + Zx(Z) − W x(W )


≤ 1{W ̸= Z} sup |(γx(m1 + 1) − m1 x(m1 )) − (γx(m2 + 1) − m2 x(m2 ))|
m1 ,m2

≤ 21{W ̸= Z} sup |(γx(m + 1) − mx(m)|.


m

Taking expectations on both sides, we conclude that


γ2 2
 
3γK + 12γK + 7
2dTV (PW , PZ ) × sup |γx(m + 1) − mx(m)| ≥ K 1− . (2.75)
m K θγK
It remains
 to upper2 bound
 supm |γx(m + 1) − mx(m)|. Recall that the derivative of x is
2
exp − γmK θ 1 − 2m γK θ
, taking values in [−2e−3/2 , 1]. This means for any m, −2e−3/2 ≤
x(m + 1) − x(m) ≤ 1. Hence:

|γx(m + 1) − mx(m)| = |γ(x(m + 1) − x(m)) + (γ − m)x(m)|


m2
 
≤ γ + (m + γ)m exp −
γK θ
m2
 
2
≤ γ + (γ + 1)m exp −
γK θ
≤ γ + θγK (γ + 1) exp(−1). (2.76)

where the last inequality owes to the easy-to-verify x exp(−x) ≤ exp(−1). Combining
eq. (2.76) and eq. (2.75) we have that
3γ 2 +12γ +7
1 − K θγK K
   
γ/K 1  γ 2
K
dTV Binomial K, , Poisson(γ) ≥ K .
γ/K + 1 2 γ + (γ + 1)θγK exp(−1) K

140
 
Finally, we calibrate θ. By selecting θ = max 12γK , γ28K , 48 we have that the numerator
of the unwieldy fraction is at least 14 and its denominator is at most γ + exp(−1)(γ +
1) max(12γ 2 , 48γ, 28), because γK < γ. This completes the proof.
Proof of Theorem 2.4.3. The constant C in the theorem statement is

C := γ 2 / γ + exp(−1)(γ + 1) max(12γ 2 , 48γ, 28) ,




which is equal to γ 2 C(γ), with C(γ) from Proposition 2.G.2.


First we mention which Pprobability kernel f results in the large total variation distance. For
any discrete measure M i=1 δψi , f is the Dirac measure sitting on M , the number of atoms.

M
X
f (. | δψi ) := δM (.). (2.77)
i=1

Now we show that under such f , the total variation distance is lower bounded. From
Lemma 2.E.9, we know that
BP BP
dTV (PN,∞ , PN,K ) = dTV (PY1:N , PW1:N ) ≥ dTV (PY1 , PW1 ).

Hence it suffices to show:


γ2 1
dTV (PY1 , PW1 ) ≥ C(γ) .
K (1 + γ/K)2

Recall the generative process defining PY1 and PW1 . Y1 is an observation from the target
beta–Bernoulli model, and the functions h, eh, and Mn,x are given in Example 2.4.1. By
Proposition 2.C.1,
NT
i.i.d.
X
NT ∼ Poisson(γ), ψk ∼ H, X1 = δψk , Y1 ∼ f (. | X1 ).
i=1

W1 is an observation from the approximate model, so by Proposition 2.C.2,


  NA
γ/K i.i.d.
X
NA ∼ Binom K, , ϕk ∼ H, Z1 = δϕ k , W1 ∼ f (. | Z1 ).
1 + γ/K i=1

Because of the choice of f , Y1 = NT and W1 = NA . Hence, by Proposition 2.G.2,

dTV (PY1 , PW1 ) = dTV (PNT , PNA )


γ2 1
≥ C(γ) .
K (1 + γ/K)2

141
2.H DPMM results
We consider Dirichlet process mixture models [6]

Θ ∼ DP(α, H),
i.i.d.
Xn | Θ ∼ Θ, n = 1, 2, . . . , N, (2.78)
indep
Yn | Xn ∼ f (· | Xn ), n = 1, 2, . . . , N.

with corresponding approximation

ΘK ∼ FSDK (α, H),


i.i.d.
Zn | ΘK ∼ ΘK , n = 1, 2, . . . , N, (2.79)
indep
Wn | Zn ∼ f (· | Zn ), n = 1, 2, . . . , N, .

Let PN,∞ be the distribution of the observations Y1:N . Let PN,K be the distribution of the
observations W1:N .

2.H.1 Upper bound


Upper bounds on the error made by FSDK can be used to determine the sufficient K
to approximate the target process for a given N and accuracy level. We upper bound
dTV (PN,∞ , PN,K ) in Theorem 2.H.1.
Theorem 2.H.1 (Upper bound for DPMM). For some constants C ′ , C ′′ , C ′′′ , C ′′′′ that only
depend on α,

C ′ + C ′′ ln2 N + C ′′′ ln N ln K + C ′′′′ ln K


dTV (PN,∞ , PN,K ) ≤ .
K
The proof and explicit values of the constants are given in section 2.I.1. Theorem 2.H.1
is similar to Theorem 2.4.1, although the exact values of the constants C ′ , C ′′ , C ′′′ , C ′′′′ are
different. The O(ln2 N ) growth of the bound for fixed N can likely be reduced to O(ln N ), the
inherent growth rate of DP mixture models [9, Section 5.2]. The O(ln K/K) rate of decrease
to zero is tight because of a 1/K lower bound on the approximation error. Theorem 2.H.1 is an
improvement over the existing theory for FSDK , in the sense that Ishwaran and Zarepour [89,
Theorem 4] provide an upper bound on dTV (PN,∞ , PN,K ) that lacks an explicit dependence
on K or N — that bound cannot be inverted to determine the sufficient K to approximate
the target to a given accuracy, while it is simple to determine using Theorem 2.H.1.

2.H.2 Lower bounds


As Theorem 2.H.1 is only an upper bound, we now investigate the tightness of the inequality
in terms of N and K. We first look at the dependence of the error bound in terms of ln N .
Theorem 2.H.2 shows that finite approximations cannot be accurate if the approximation
level is too small compared to the growth rate ln N .

142
Theorem 2.H.2 (ln N is necessary). There exists a probability kernel f (·), independent of
K, N , such that for any N ≥ 2, if K ≤ 12 C(N, α), then

C′
dTV (PN,∞ , PN,K ) ≥ 1 − α/8
N
where C ′ is a constant only dependent on α.

See section 2.I.2 for the proof. Theorem 2.H.2 implies that as N grows, if the approximation
level K fails to surpass the C(N, α)/2 threshold, then the total variation between the
approximate and the target model remains bounded from zero — in fact, the error tends
to one. Recall that C(N, α) = Ω(ln N ), so the necessary approximation level is Ω(ln N ).
Theorem 2.H.2 is the analog of Theorem 2.4.2.
We also investigate the tightness of Theorem 2.H.1 in terms of K. In Theorem 2.H.3, our
lower bound indicates that the 1/K factor in Theorem 2.H.1 is tight (up to log factors).

Theorem 2.H.3 (1/K lower bound). There exists a probability kernel f (·), independent of
K, N , such that for any N ≥ 2,
α 1
dTV (PN,∞ , PN,K ) ≥ .
1+αK
See section 2.I.2 for the proof. While Theorem 2.H.1 implies that the normalized AIFA with
K = O (poly(ln N )/ϵ) atoms suffices in approximating the DP mixture model to less than ϵ
error, Theorem 2.H.3 implies that a normalized AIFA with K = Ω (1/ϵ) atoms is necessary in
the worst case. This worst-case behavior is analogous to Theorem 2.4.3 for DP-based models.
The 1/ϵ dependence means that AIFAs are worse than TFAs in theory. It is known that
small TFA models are already excellent approximations of the DP. Definition 2.4.5 is a very
well-known finite approximation whose error is upper bounded in Proposition 2.H.4.
i.i.d. indep
Proposition 2.H.4. [88, Theorem 2] Let ΞK ∼ TSBK (α, H), Rn | ΞK ∼ ΞK , Tn | Rn ∼
f (· | Rn ) with N observations. Let QN,K be the distribution of the observations T1:N . Then:
dTV (PN,∞ , QN,K ) ≤ 2N exp − K−1 α
.

Proposition 2.H.4 implies that a TFA with K = O (ln (N/ϵ)) atoms suffices in approximating
the DP mixture model to less than ϵ error. Modulo log factors, comparing the necessary 1/ϵ
level for AIFA and the sufficient ln (1/ϵ) level for TFA, we conclude that the necessary size
for normalized IFA is exponentially larger than the sufficient size for TFA, in the worst case.

2.I Proofs of DP bounds


Our technique to analyze the error made by FSDK follows a similar vein to the technique
in section 2.G. We compare the joint distribution of the latents X1:N and Z1:N (with the
underlying Θ or ΘK marginalized out) using the conditional distributions Xn | X1:(n−1) and
Zn | Z1:(n−1) . Before going into the proofs, we give the form of the conditionals.
The conditional X1:N | X1:(n−1) is the well-known Blackwell-MacQueen prediction rule.

143
Proposition 2.I.1. Blackwell and MacQueen [19] For n = 1, X1 ∼ H. For n ≥ 2,
α X nj
Xn | Xn−1 , Xn−2 , . . . , X1 ∼ H+ δψ ,
n−1+α j
n−1+α j

where {ψj } is the set of unique values among Xn−1 , Xn−2 , . . . , X1 and nj is the cardinality of
the set {i : 1 ≤ i ≤ n − 1, Xi = ψj }.

The conditionals Zn | Z1:(n−1) are related to the Blackwell-MacQueen prediction rule.

Proposition 2.I.2. Pitman [162] For n = 1, Z1 ∼ H. For n ≥ 2, let {ψj }Jj=1 n


be the set
of unique values among Zn−1 , Zn−2 , . . . , Z1 and nj is the cardinality of the set {i : 1 ≤ i ≤
n − 1, Zi = ψj }. If Jn < K:
n J
(K − Jn )α/K X nj + α/K
Zn | Zn−1 , Zn−2 , . . . , Z1 ∼ H+ δψ ,
n−1+α j=1
n−1+α j

Otherwise, if Jn = K, there is zero probability of drawing a fresh component from H i.e. Zn


comes only from {ψj }j=1 Jn :
Jn
X nj + α/K
Zn | Zn−1 , Zn−2 , . . . , Z1 ∼ δψj .
j=1
n − 1 + α

Jn ≤ K is an invariant of these of prediction rules: once Jn = K, all subsequent Jm for


m ≥ n is also equal to K.

2.I.1 Upper bounds


Proof of Theorem 2.H.1. The constants C ′ , C ′′ , C ′′′ , C ′′′′ are as follows

C ′ = exp(α(ψ(α) + 1)) + 2α2 ln2 (1 + 1/α),


3α2 ln(1 + 1/α)
C ′′ = α2 + ,
ln 2 (2.80)
C ′′′ = α2 ,
C ′′′′ = α2 ln(1 + 1/α).

The reasoning for these constants will be clear by the end of the proof.
To begin, observe that the conditional distributions of the observations given the latent
variables are the same across target and approximate models: PY1:N |X1:N is the same as
PW1:N |Z1:N if X1:N = Z1:N . Therefore, using Lemma 2.E.8, we want to show that there exists
a coupling of PX1:N and PZ1:N that has small difference probability.
First, we construct a coupling of PX1:N and PZ1:N such that, for any n ≥ 1, for any x1:(n−1)
such that Jn is the number of unique atom locations among x1:(n−1) is at most K,

α Jn
P(Xn ̸= Zn | X1:(n−1) = Z1:(n−1) = x1:(n−1) ) ≤ . (2.81)
Kn−1+α

144
The case where n = 1 reads that P(X1 = ̸ Z1 ) = 0. Such a coupling exists because the total
variation distance between the prediction rules Xn | X1:(n−1) and Zn | Z1:(n−1) is small. Let
{ψj }Jj=1
n
be the unique atom locations in x1:(n−1) and nj be the number of latents xi that
manifest atom location ψj . The distribution Xn | X1:(n−1) can be sampled from in two steps:

• Sample I1 from the categorical distribution over Jn + 1 elements where, for 1 ≤ j ≤ Jn ,


P(I1 = j) = nj /(n − 1 + α) and P(I1 = Jn + 1) = α/(n − 1 + α).

• If I1 = j for 1 ≤ j ≤ Jn , set Xn = δψj . If I1 = Jn + 1, draw a fresh atom from H, label


ψJn +1 and set Xn = δψJn +1 .

Similarly, we can generate Zn | Z1:(n−1) in two steps:

• Sample I2 from the categorical distribution over Jn + 1 elements where, for 1 ≤ j ≤ Jn ,


nj +α/K
P(I2 = j) = n−1+α and P(I2 = Jn + 1) = α(1−J n /K)
n−1+α
.

• If I2 = j for 1 ≤ j ≤ Jn , set Zn = δψj . If I2 = Jn + 1, draw a fresh atom from H, label


ψJn +1 and set Zn = δψJn +1 .

Still conditioning on X1:(n−1) and Z1:(n−1) , we observe that the distribution of Xn | I1 is the
same as Zn | I2 . Hence, using the propagation argument from Lemma 2.E.8, it suffices to
couple I1 and I2 so that

P(I1 ̸= I2 | X1:(n−1) = Z1:(n−1) = x1:(n−1) )

is small. Since I1 and I2 are categorical distributions, the minimum of the difference probability
is the total variation distance between the two distributions, which equals 1/2 the L1 distance
between marginals
Jn
X nj + α/K nj α α(1 − Jn /K) α Jn
− + − =2 .
j=1
n − 1 + α n − 1 + α n − 1 + α n − 1 + α K n − 1 + α

Dividing the last equation by 2 gives eq. (2.81). The joint coupling of PX1:N and PZ1:N is the
natural gluing of the couplings PXn | X1:(n−1) and PZn | Z1:(n−1) .
We now show that for the coupling satisfying eq. (2.81), the overall probability of difference
P(X1:N ≠ Z1:N ) is small. Recall the growth function from eq. (2.18). We will use the notation
of a typical set in the rest of the proof:

Dn := x1:(n−1) : Jn ≤ (1 + δ) max(C(N, α), C(K, α)) .

In other words, the number of unique values among the x1:(n−1) is small. The constant δ
δ2
satisfies 2+δ α = 2: such δ always exists and is unique. The following decomposition is used
to investigate the difference probability on the typical set:

P(X1:N ̸= Z1:N ) = P((X1:(N −1) , XN ) ̸= (Z1:(N −1) , ZN ))


= P(X1:(N −1) ̸= Z1:(N −1) ) + P(XN ̸= ZN , X1:(N −1) = Z1:(N −1) ). (2.82)

145
The second term can be further expanded:

P(XN ̸= ZN ,X1:(N −1) = Z1:(N −1) , X1:(N −1) ∈ DN )


+ P(XN ̸= ZN , X1:(N −1) = Z1:(N −1) , X1:(N −1) ∈
/ DN ).

The former term is at most

P(XN ̸= ZN | X1:(N −1) = Z1:(N −1) , X1:(N −1) ∈ DN ),

while the latter term is at most


P(X1:(N −1) ∈
/ DN ).
To recap, we can bound P(X1:N ̸= Z1:N ) by bounding three quantities:

1. The difference probability of a shorter process P(X1:(N −1) ̸= Z1:(N −1) ).

2. The difference probability of the prediction rule on typical sets P(XN =


̸ ZN | X1:(N −1) =
Z1:(N −1) , X1:(N −1) ∈ DN ).

3. The probability of the atypical set P(X1:(N −1) ∈


/ DN ).

By recursively applying the expansion initiated in eq. (2.82) to P(X1:(N −1) ̸= Z1:(N −1) ), we
actually only need to bound difference probability of the different prediction rules on typical
sets and the atypical set probabilities.
Regarding difference probability of the different prediction rules, being in the typical set
allows us to control Jn in eq. (2.81). Summation across n = 1 through N gives the overall
bound of
α
(1 + δ) max(C(N, α), C(K, α))C(N, α). (2.83)
K
Regarding the atypical set probabilities, because Jn−1 is stochastically dominated by Jn i.e.,
the number of unique values at time n is at least the number at time n − 1, all the atypical
set probabilities are upper bounded by the last one i.e. P(X1:(N −1) ∈
/ DN ). When N > 1, JN
is the sum of independent Poisson trials, with an overall mean equaling exactly C(N − 1, α)
and J1 is defined to be 0. Therefore, the atypical event has small probability because of
Lemma 2.E.1:
P(JN > (1 + δ) max(C(N − 1, α), C(K, α)) ≤ P(JN > (1 + δ) max(C(N, α), C(K, α))
δ2
 
≤ exp − max(C(N, α), C(K, α) .
2+δ

Even accounting for all N atypical events through union bound, the total probability is still
small small:   2 
δ
exp − max(C(N, α), C(K, α) − ln N .
2+δ
By Lemma 2.E.10, max(C(N, α), C(K, α) ≥ α max(ln N, ln K − α(ψ(α) + 1). we have

δ2
max(C(N, α), C(K, α) − ln N ≥ ln K − α(ψ(α) + 1),
2+δ

146
meaning the overall atypical probabilities is at most

exp(α(ψ(α) + 1))
. (2.84)
K
The overall total variation bound combines eqs. (2.83) and (2.84). We first upper bound
C(N, α) using Lemma 2.E.10 and upper bound max(C(N, α), C(K, α)) by the sum of the two
constituent terms. We also upper bound ln N ≤ ln2 N/ ln 2 to remove the dependence on the
sole ln N factor. After the algebraic manipulations, we arrive at the constants in eq. (2.80)s.

Proof of Theorem 2.4.6. The constants C ′ , C ′′ , C ′′′ , C ′′′′ are as follows:

C ′ = exp(ω(ψ(ω) + 1)) + 2ω 2 ln2 (1 + 1/ω),


3ω 2 ln(1 + 1/ω)
C ′′ = ω 2 + ,
ln 2
C ′′′ = ω 2 ,
C ′′′′ = ω 2 ln(1 + 1/ω).

The main idea is reducing to the Dirichlet process mixture model. We do this in two steps.
First, the conditional distribution of the observations W | H1:D of the target model is the same
as the conditional distribution Z | F1:D of the approximate model if H1:D = F1:D . Second,
there exists latent variables Λ and Φ such that the conditional distribution of H1:D | Λ and
the conditional F1:D | Φ are the same when Λ = Φ. Recall the construction of the Fd in terms
of atom locations ϕd,j and stick-breaking weights γd,j :

GK ∼ FSDK (ω, H),


i.i.d.
ϕdj | GK ∼ GK (.) across d, j,
i.i.d.
γdj ∼ Beta(1, α) across d, j (except γdT = 1),
T
!
X Y
Fd | ϕd,. , γd,. = γdi (1 − γdj ) δϕdj .
i=1 j<i

Similarly Hd is also constructed in terms of atom locations λd,j and stick-breaking weights
ηd,j :

G ∼ DP(ω, H),
i.i.d.
λdj | G ∼ G(.) across d, j,
i.i.d.
ηdj ∼ Beta(1, α) across d, j (except ηdT = 1),
T
!
X Y
Hd | λd,. , ηd,. = ηdi (1 − ηdj ) δλdj .
i=1 j<i

Therefore, if we set Λ = {λdj }d,j and Φ = {ϕdj }d,j , then H1:D | Λ is the same as the conditional
F1:D | Φ if Λ = Φ.

147
Overall, this means that W | Λ is the same as Z | Φ. Again by Lemma 2.E.8, we only need to
demonstrate a coupling between PΛ and PΦ such that the difference probability is small.
From the proof of Theorem 2.4.1 in section 2.I.1, we already know how to couple PΛ and PΦ .
On the one hand, since λdj are conditionally iid given G across d, j, the joint distribution of
λdj is from a DPMM (probability kernel f being Dirac f (· | x) = δx (·)) where the underlying
DP has concentration ω. On the other hand, since ϕdj are conditionally iid given GK across
d, j, the joint distribution ϕdj comes from the finite mixture with FSDK . Each observational
process has cardinality DT . Therefore, we can couple PΛ and PΦ such that

C ′ + C ′′ ln2 (DT ) + C ′′′ ln(DT ) ln K + C ′′′′ ln K


P(Λ ̸= Φ) ≤ ,
K
where the constants have been given at the beginning of this proof.

2.I.2 Lower bounds


Proof of Theorem 2.H.2. First we mention which probability kernel f results in the large
total variation distance: the pathological f is the Dirac measure i.e., f (· | x) = δx (.). With
this conditional likelihood Xn = Yn and Zn = Wn , meaning:

dTV (PN,∞ , PN,K ) = dTV (PX1:N , PZ1:N ).

Now we discuss why the total variation is lower bounded by the function of N . Let A be the
event that there are at least 12 C(N, α) unique components in among the latent states:
 
1
A := x1:N : #unique values ≥ C(N, α) .
2

The probabilities assigned to this event by the approximate and the target models are very
different from each other. On the one hand, since K < C(N,α)
2
, under FSDK , A has measure
zero:
PZ1:N (A) = 0. (2.85)
On the other hand, under DP, the number of unique atoms drawn is the sum of Poisson
trials with expectation exactly C(N, α). The complement of A is a lower tail event. Hence
by Lemma 2.E.2 with δ = 1/2, µ = C(N, α), we have:
 
C(N, α)
PX1:N (A) ≥ 1 − exp − (2.86)
8

Because of Lemma 2.E.10, we can lower bound C(N, α) by a multiple of ln N :

constant
   
C(N, α) α ln N α(ψ(α) + 1)
exp − ≤ exp − + = .
8 8 8 N α/8

We now combine eqs. (2.85) and (2.86) and recall that total variation is the maximum over
probability discrepancies.

148
Proof of Theorem 2.H.3. First we mention which probability kernel f results in the large
total variation distance: the pathological f is the Dirac measure i.e., f (· | x) = δx (.).
Now we show that under such f, the total variation distance is lower bounded. Observe that
it suffices to understand the total variation between PY1 ,Y2 and PW1 ,W2 , because Lemma 2.E.9
already implies
dTV (PN,∞ , PN,K ) ≥ dTV (PY1 ,Y2 , PW1 ,W2 ).
Since f is Dirac, Xn = Yn and Zn = Wn and we have:

dTV (PY1 ,Y2 , PW1 ,W2 ) = dTV (PX1 ,X2 , PZ1 ,Z2 ).

Consider the event that the two latent states are equal. Under the target model,
1
P(X2 = X1 ) = ,
1+α
while under the approximate one,
1 + α/K
P(Z2 = Z1 ) = .
1+α
They are simple consequences of the prediction rules in Propositions 2.I.1 and 2.I.2. Therefore,
there exists a measurable event where the probability mass assigned by the target and
approximate models differ by
1 + α/K 1 α 1
− = , (2.87)
1+α 1+α 1+αK
meaning dTV (PX1 ,X2 , PZ1 ,Z2 ) ≥ α 1
1+α K
.

2.J More ease-of-use results


2.J.1 Conceptual results (continued.)
We begin by stating the log density of the optimal qρ∗ under general priors.
Proposition 2.J.1 (Optimal distribution over atom  rates). Define the normalization constant
C := ρ exp ln P(ρ) + n,k Exn,k ∼qx ln P(xn,k | ρk ) dρ where xn,k ∼ qx denote the marginal
R P

distribution of xn,k under qx (x) (which is a distribution over the whole set (xn,k )n,k ). Then
X
qρ∗ (ρ) = − ln C + ln P(ρ) + Exn,k ∼qx ln ℓ(xn,k | ρk ). (2.88)
n,k

The proof of Proposition 2.J.1 is given in section 2.J.2.


Knowing the log density eq. (2.88) does not mean that drawing inference is easy. By drawing
inference, we mean computing posterior expectations of important integrands. Polynomials
(such as ρk ) are natural integrands. In addition, we also need to compute quantities like
Eρk ∼qρ {ln ℓ(xn,k | ρk )} to derive the optimal distribution for qx (x).

149
Proposition 2.J.2. Suppose that the variational distribution qx (x) factorizes as qx (x) =
n,k fn,k (xn,k ). For a particular n, k, let fn,k be the optimal distribution over (n, k) trait

Q
count with all other variational distributions being fixed i.e.
 
Y

fn,k := arg min KL qρ qψ fn,k fn′ ,k′ || P̄  ,
fn,k
(n′ ,k′ )̸=(n,k)

where P̄ denotes the posterior P(·, ·, · | y). Then, the p.m.f. of fn,k

at xn,k is equal to
− ln C + Eρk ∼qρ ln ℓ(xn,k | ρk ) + Eψ∼qψ ,xn,−k ∼fn,−k ln P(yn | xn,. , ψ).
for some positive constant C.
See section 2.J.2 for the proof of this proposition.
Under TFAs such as Example 2.5.1, since we cannot identify the log density in eq. (2.88)
with a well-known distribution, we do not have formulas for expectations. For Example 2.5.1,
strategies to make computing expectations more tractable Qi−1 include introducing auxiliary
(l)
round indicator variables rk , replacing the product l=1 (1 − Vi,j ) with a more succinct
representation and fixing the functional form qρ rather than using optimality conditions
[156, Section 3.2]. However, Paisley et al. [156, Section 3.3] still runs into intractability
issues when evaluating Eρk ∼qρ {ln ℓ(xn,k | ρk )} in the beta–Bernoulli process, and additional
approximations such as Taylor series expansion are needed.
In our second TFA example, the complete conditional of the atom sizes can be sampled
without auxiliary variables, but important expectations are not analytically tractable.
Example 2.J.1 (Bondesson approximation [55, 197]). When α = 1, the Bondesson approxi-
mation in Example 2.4.3 becomes
K i
i.i.d. i.i.d.
X Y
ΘK = ρi δψi , ρi = pj , pj ∼ Beta(γ, 1), ψi ∼ H. (2.89)
i=1 j=1

The atom sizes are dependent because they jointly depend on p1 , . . . , pK , but the complete
conditional of atom sizes P(ρ | x) admits a density proportional to
K
γ1{j=K}+ N
P PN
n=1 xn,j −1
Y
1{0 ≤ ρK ≤ ρK−1 ≤ . . . ≤ ρ1 ≤ 1} ρj (1 − ρj )N − n=1 xn,j
.
j=1

The conditional distributions P(ρi | ρ−i , x) are truncated betas, so adaptive rejection sampling
[70] can be used as a sub-routine to sample each P(ρi | ρ−i , x) and then sweep over all atom
sizes. However, for this exponential family, expectations of the sufficient statistics are not
tractable. The optimal qρ∗ in the sense of eq. (2.22) has a density proportional to
K
γ1{j=K}+ N
P PN
n=1 Eqx xn,j −1
Y
1{0 ≤ ρK ≤ ρK−1 ≤ . . . ≤ ρ1 ≤ 1} ρj (1 − ρj )N − n=1 Eqx xn,j
.
j=1

We do not know closed-form formulas for E{ln(ρi )} or E{ln(1 − ρi )}. Rather than using
the qρ∗ which comes from optimality arguments, Doshi-Velez et al. [55] fixes the functional
form of the variational distribution. Even then, further approximations such as Taylor series
expansion are necessary to approximate E{ln(ρi )} or E{ln(1 − ρi )}.

150
Other series-based approximations, like thinning or rejection sampling [34], are characterized
by even less tractable dependencies between atom sizes in both the prior and the conditional
P(ρ | x).

2.J.2 Proofs
Proof of Proposition 2.5.1. Because of the Markov blanket, conditioning on x, ψ, y is the
same as conditioning on x:
P(ρ | x, ψ, y) = P(ρ | x).
Conditioned on the atom rates, the trait counts are independent across the atoms. In the
prior over atom rates, the atom rates are independent across the atoms. These facts mean
that the posterior also factorizes across the atoms
K
Y
P(ρ | x) = P(ρk | x.,k )
k=1

We look at each factor P(ρk | x.,k ). This is the posterior for ρk after observing N observations
n=1 . Since the AIFA prior over ρk is the conjugate prior of the trait count likelihood, the
(xn,k )N
posterior is in the same exponential family, with updated parameters based on the sufficient
statistics and the log partition function.
Proof of Proposition 2.J.1. Minimizing the KL divergence is equivalent to maximizing the
evidence lower bound (ELBO):
ELBO(q) := E(ρ,ψ,x)∼q ln P(y, ρ, ψ, x) − E(ρ,ψ,x)∼q ln q(ρ, ψ, x). (2.90)
The log joint probability P(y, ρ, ψ, x), regardless of the prior over ρ, decomposes as
X
ln P(y, ρ, ψ, x) = ln P(ρ) + ln P(ψk )
k
X X (2.91)
+ ln P(xn,k | ρk ) + ln P(yn | xn,. , ψ).
n,k n

Recall that the variational distribution factorizes like as q(ρ, ψ, x) = qρ (ρ)qψ (ψ)qx (x). There-
fore, for fixed qψ (ψ) and qx (x), the ELBO from eq. (2.90) depends on qρ (ρ) only through
X
f (qρ ) := Eρ∼qρ ln P(ρ) + Exn,k ∼qx ,ρk ∼qρ ln P(xn,k | ρk ) − Eρ∼qρ ln qρ (ρ).
n,k

Here, the notation ρk ∼ qρ means the marginal distribution of ρk under qρ . Using Fubin’s
theorem, we rewrite the last integral as
qρ (ρ)
f (qρ ) = −Eρ∼qρ ln P
P(ρ) × exp( n,k Exn,k ∼qx ln P(xn,k | ρk ))
The denominator P(ρ) × exp(P n,k Exn,k ∼qx ln P(xn,k | ρk )) is exactly equal to Cq0 (ρ) where
P
ln q0 (ρ) = − ln C + ln P(ρ) + n,k Exn,k ∼qx ln ℓ(xn,k | ρk ). Therefore
f (qρ ) = −KL(qρ ||q0 ) + ln C.

151
This means that the unique maximizer of f (qρ ) is qρ = q0 i.e. the log density of qρ∗ is as given
in eq. (2.88).
Proof of Corollary 2.5.2. We specialize the formula in eq. (2.88) to the AIFA prior.
Recall the exponential-family form of ℓ(xn,k | ρk ):

ℓ(xn,k | ρk ) = ln κ(xn,k ) + ϕ(xn,k ) ln ρk + ⟨µ(ρk ), t(xn,k )⟩ − A(ρk ). (2.92)

Next, observe that Exn,k ∼qx ln ℓ(xn,k | ρk ) is equal to

Exn,k ∼qx ln κ(xn,k ) + Exn,k ∼qx ϕ(xn,k ) × ln ρk + ⟨µ(ρk ), Exn,k ∼qx t(xn,k )⟩ − A(ρk ). (2.93)

Recall that AIFA prior over ρk is the conjugate prior for the likelihood in eq. (2.92):
     
ψ µ(ρk ) ψ
ln P(ρk ) = (c/K − 1) ln ρk + ⟨ , ⟩ − ln Z(c/K − 1, ), (2.94)
λ −A(ρk ) λ
and the prior factorizes across atoms:
X
ln P(ρ) = ln P(ρk )
k

Putting eq. (2.93) and eq. (2.94) together, We have


X X
ln P(ρ) + Exn,k ∼qx ln ℓ(xn,k | ρk ) = Tk (ρk )
n,k k

where Tk (ρk ) is equal to


 P   
X ψ + n Exn,k ∼qx t(xn,k ) µ(ρk )
(c/K + Exn,k ∼qx ϕ(xn,k ) − 1) ln ρk ) + ,
λ+N −A(ρk )
n

Accounting for the normalization constant Zk for each dimension k, we arrive at eq. (2.23).
Proof of Proposition 2.J.2. The argument is the same as section 2.J.2. In the overall ELBO,
the only terms that depend on fn,k is

Exn,k ∼fn,k ,ρk ∼qρ ln ℓ(xn,k | ρk ) + Exn,k ∼fn,k ,xn,−k ∼fn,−k ,ψ∼qψ ln P(yn | xn,. , ψ)
− Exn,k ∼fn,k ln fn,k (xn,k ).
We use Fubini to express the last integral as a negative KL-like quantity, and use optimality
of KL arguments to derive the p.m.f. of the minimizer.

2.K Experimental setup


In this section, the notation for atom sizes, atom locations, latent trait counts and observed
data follow that of section 2.5 i.e. (ρk )Kk=1 denotes the collection of atom sizes, (ψk )k=1
K

denotes the collection of atom locations, (xn,k )k=1,n=1 denotes the latent trait counts of each
K,N

observation, and (yn )N


n=1 denotes the observed data.

152
2.K.1 Image denoising using the beta–Bernoulli process
Data. We obtain the “clean” house image from http://sipi.usc.edu/database/. We downscale
the original 512 × 512 image to 256 × 256 and convert colors to gray scale. We add iid
Gaussian noise to the pixels of the clean image, resulting in the noisy input image. We follow
Zhou et al. [210] in extracting the patches. We use patches of size 8 × 8, and flatten each
observed patch yi into a vector in R64 .

Finite approximations. We use finite approximations that target the beta–Bernoulli


process with BP(1, 1, 0) i.e. γ = 1, α = 1, d = 0. Zhou et al. [210] remark that the denoising
performance is not sensitive to the choice of γ and α. Therefore, we pick γ = α = 1
for computational convenience, since the beta process with α = 1 has the simple TFA in
Example 2.J.1. To be explicit, the TFA for the given beta–Bernoulli process is
i.i.d.
vj ∼ Beta(1, 1), i = 1, 2, . . . , K,
i
Y
ρi = vj , i = 1, 2, . . . , K, (2.95)
j=1
indep
xn,i | ρi ∼ Ber(ρi ), across n, i.

while the corresponding AIFA is


 
i.i.d. 1
ρi ∼ Beta ,1 , i = 1, 2, . . . , K,
K (2.96)
indep
xn,i | ρi ∼ Ber(ρi ), across n, i.

We report the performance for K’s between 10 and 100 with spacing 10.

Ground measure and observational likelihood. Following Zhou et al. [210], we fix the
ground measure but put a hyper-prior (in the sense of Corollary 2.G.1) on the observational
likelihood. The ground measure is a fixed Gaussian distribution:
 
i.i.d. 1
ψi ∼ N 0, I64 , i = 1, 2, . . . , K. (2.97)
64
The observational likelihood involves two Gaussian distributions with random variances:
γw ∼ Gamma(10−6 , 10−6 ),
γe ∼ Gamma(10−6 , 10−6 ),
i.i.d.
wn,i | γw ∼ N (0, γw−1 ), across i, n, (2.98)
K
indep
X
yn | xn,. , wn,. ψ, γe ∼ N ( xn,i wn,i ψi , γe−1 I64 ), across n.
i=1

We use the (shape,rate) parametrization of the gamma distribution. The weights wn,i enable
an observation to manifest a non-integer (and potentially negative) scaled version of the

153
i-th basis element. The precision γw determines the scale of these weights. The precision γe
determines the noise variance of the observations. We are uninformative about the precisions
by choosing the Gamma(10−6 , 10−6 ) priors.
In sum, the full finite models combine either eqs. (2.96) to (2.98) (for AIFA) or eqs. (2.95),
(2.97) and (2.98) (for TFA).

Approximate inference. We use Gibbs sampling to traverse the posterior over all the
latent variables — the ones that are most important for denoising are x, w, ψ. The chosen
ground measure and observational likelihood have the right conditional conjugacies so that
blocked Gibbs sampling is conceptually simple for most of the latent variables. The only
difference between AIFA and TFA is the step to sample the feature proportions ρ: TFA
updates are much more involved compared to AIFA (see section 2.5). The order in which
Gibbs sampler scans through the blocks of variables does not affect the denoising quality.
To generate the PSNR in fig. 2.6.2a, after finishing the gradual introduction of all patches,
we run 150 Gibbs sweeps. We use the final state of the latent variables at the end of these
Gibbs sweep as the warm-start configurations in figs. 2.6.2b and 2.6.2c.

Evaluation metric. We discuss how iterates from Gibbs sampling define output images.
Each configuration of x, w, ψ defines each patch’s “noiseless” value:
K
X
yen = xn,i wn,i ψi .
i=1

Each pixel in the overall image is covered by a small number of patches. The “noiseless” value
of each pixel is the average of the pixel value suggested by the various patches that cover that
pixel. We aggregate the output images across Gibbs sweeps by a simple weighted averaging
mechanism. We report the PSNR of the output image with the original image following the
formulas from [86].

2.K.2 Topic modelling with the modified HDP


Data. We download and pre-process into bags-of-words about one million random Wikipedia
documents, following Hoffman et al. [83].

Finite models. We fix the ground measure to be a Dirichlet distribution and the observa-
tional likelihood to be a categorical distribution i.e. no hyper-priors. The AIFA is

G0 ∼ FSDK (ω, Dir(η1V )),


i.i.d.
Gd | G0 ∼ TSBT (α, G0 ), across d,
indep
βdn | Gd ∼ Gd (·), across d, n,
indep
wdn | βdn ∼ Categorical(βdn ), across d, n.

154
while the TFA is
G0 ∼ TSBK (ω, Dir(η1V )),
i.i.d.
Gd | G0 ∼ TSBT (α, G0 ), across d,
indep
βdn | Gd ∼ Gd (·), across d, n,
indep
wdn | βdn ∼ Categorical(βdn ), across d, n.
We set the hyperparameters η, α, ω, and T following Wang et al. [206], in that η = 0.01, α =
1.0, ω = 1.0, T = 20. We report the performance for K’s between 20 and 300 with spacing 40.

Approximate inference. We approximate the posterior in each model using stochastic


variational inference [85]. Both models have conditional conjugacies that enable the use of
exponential family variational distributions and closed-form expectation equations for all
update types. The batch size is 500. We use the learning rate (t + τ )−κ , where t is the number
of data mini-batches. For cold-start experiments, we set τ = 1.0 and κ = 0.9. To generate the
results of fig. 2.6.3a, we process 4000 mini-batches of documents. We obtain the warm-start
initializations in figs. 2.6.3b and 2.6.3c by processing 512 mini-batches of documents. When
training from warm-start initialization, to reflect the fact that the initial topics are the results
of a training period, we change τ = 512, but use the same κ as cold start.

Evaluation metrics. We compute held-out log-likelihood following Hoffman et al. [85].


Each test document d′ is separated into two parts who and wobs 22 , with no common words
between the two. In our experiments, we set 75% of words to be observed, the remaining
25% unseen. The predictive distribution of each word wnew in the who is exactly equal to:
Z
p(wnew | D, wobs ) = p(wnew | θd′ , β)p(θd′ , β | D, wobs )dθd′ dβ.
θd′ ,β

This is an intractable computation as the posterior p(θd′ , β | D, wobs ) is not analytical. We


approximate it with a factorized distribution:
p(θd′ , β | D, wobs ) ≈ q(β | D)q(θd′ ),
where q(β | D) is fixed to be the variational approximation found during training and q(θd′ )
minimizes the KL between the variational distribution and the posterior. Operationally, we
do an E-step for the document d′ based on the variational distribution of β and the observed
words wobs , and discard the distribution over zd′ ,. , the per-word topic assignments because
of the mean-field assumption. Using those approximations, the predictive approximation is
approximately:
K
X
p(wnew | D, wobs ) ≈ pe(wnew | D, wobs ) = Eq (θd′ (k))Eq (βk (wnew )),
k=1
22
How each document is separated into these two parts can have an impact on the range of test log-likelihood
values encountered. For instance, if the first (in order of appearance in the document) x% of words were the
observed words and the last (100 − x)% words were unseen, then the test log-likelihood is low, presumably
since predicting future words using only past words and without any filtering is challenging. Randomly
assigning words to be observed and unseen gives better test log-likelihood.

155
and the final number we report for document d′ is:
1 X
ln pe(w | D, wobs ).
|who | w∈w
ho

2.K.3 Comparing IFAs


Data. For the AIFA versus BFRY IFA comparison i.e. fig. 2.6.4a, we generate synthetic
data {yn }2000
n=1 from a power-law beta–Bernoulli process BP(2, 0, 0.6).


X
θi ψi ∼ BP(2, 0, 0.6; N (0, 5I5 )),
i=1
indep
xn,i | θi ∼ Ber(θi ), across n, i,
indep
X
yn | xn,. , ψ ∼ N ( xn,i ψi , I5 ), across n.
i

For the AIFA vs GenPar IFA comparison i.e. fig. 2.6.4b, we use the same generative process
except the beta process is BP(2, 1.0, 0.6). We marginalize out the feature proportions θi and
sample the assignment matrix X = {xn,i } from the power-law Indian buffet process [195].
The feature means are Gaussian distributed, with prior mean 0 and prior covariance 5I5 .
Conditioned on the feature combination, the observations are Gaussian with noise variance
I5 . Since the data is exchangeable, without loss of generality, we use y1:1500 for training and
y1501:2000 for evaluation.

Finite approximations. We use finite approximations that have exact knowledge of the
beta process hyperparameters. For instance, for the AIFA versus BFRY IFA comparison, we
use K-atom AIFA prior with densities

1{0 ≤ θ ≤ 1} −1+c/K−0.6S(θ−1/K)
νAIFA (dθ) := θ (1 − θ)−0.4 dθ, (2.99)
ZK
(  
−1
exp 1−K 2 (θ−1/K)2 + 1 if θ ∈ (0, 1/K)
where c := B(0.6,0.4)
2
and S(θ) = , and ZK is the
1{θ > 0} otherwise.
suitable normalization constant.
In all, the approximation to the beta–Bernoulli part of the generative process is
i.i.d.
ρi ∼ νe(.) for i ∈ [K],
indep
(2.100)
xn,i | ρi ∼ Ber(ρi ) across , n, i,

where νe(.) is either νAIFA , νBFRY or νGenPar . We report the performance for K from 2 to 100.

156
Ground measure and observational likelihood. We use hyper-priors in the sense of
Corollary 2.G.1. The ground measure is random because the we do not fix the variance of
the feature means.
σg ∼ Gamma(5, 5),
i.i.d. (2.101)
ψi ∼ N (0, σg2 I5 ) for i ∈ [K].
The observational likelihood is also random because we do not fix the noise variance of the
observed data.
σc ∼ Gamma(5, 5),
indep
yn | xn,. , ψ, σc ∼ N (
X
xn,i ψi , σc2 I5 ). (2.102)
i

In eqs. (2.101) and (2.102), we use the (shape, rate) parametrization of the gamma distribution.
The full finite models are described by eqs. (2.100) to (2.102).23

Approximate inference. We use mean-field variational inference to approximate the


posterior. We pick the variational distribution q(σc , σg , ρ, ψ, x) with the following factorization
structure: Y Y Y
q(σc )q(σg ) q(ρi ) q(ψi ) q(xn,i ).
i i i,n

Each variation distribution is the natural exponential family. Specifically, we have q(σc ) =
Gamma(νc (0), νc (1)), q(σg ) = Gamma(νg (0), νg (1)), q(ψi ) = N (τi , ζi ), q(ρi ) = Beta(κi (0), κi (1)),
q(xn,i ) = Ber(ϕn,i ). We set the initial variational parameters using using the latent features,
feature assignment matrix, and the variances of the features prior and the observations around
the feature combination. We use the ADAM optimizer in Pyro (learning rate 0.001, β1 = 0.9,
clipping gradients if their norms exceed 40) to minimize the KL divergence between the
approximation and exact posterior. We sub-sample 50 data points at a time to form the
objective for stochastic variational inference. We terminate training after processing 5,000
mini-batches of data.

Evaluation metrics. We use the following definition of predictive likelihood:


m
X
ln P(yn+i | y1:n ), (2.103)
i=1

where y1:n are the training data and {yn+i }m i=1 are the held-out data points.
We estimate P(yn+i | y1:n ) using Monte Carlo samples, since the predictive likelihood is an
integral of the posterior over training data:
Z
P(yn+i | y1:n ) = P(yn+i | xn+i , ψ, σ)P(xn+i , ψ, σ, ρ | y1:n ),
xn+i ,σ,ψ,ρ

23
During inference, we add a small tolerance of 10−3 to the standard deviations σc , σg , ζi in the model to
avoid singular covariance matrices, although this is not strictly necessary if we clip gradients.

157
where xn+i is the assignment vector of the n + i test point. Define the S Monte Carlo samples
of the variational approximation to the posterior as (xs(n+1):(n+m),. , ρs , ψ s , σ s )Ss=1 . We jointly
estimate P(yn+i | y1:n ) across test points yn+i using the S Monte Carlo samples:
S
1X
P(yn+i | y1:n ) ≈ P(yn+i | xsn+i , ψ s , σ s ).
S s=1

We use S = 1,000 samples from the (approximate) posterior to estimate the average log
test-likelihood in eq. (2.103).

2.K.4 Beta process hyperparameter estimation


Data. In this experiment, the number of observations, or the number of rows in the feature
matrix, is N = 1000. For discount estimation, we generate 50 matrices from the corresponding
IBP [195] with for mass γ = 3.0, concentration α = 1.0 and discount varying from 0 through
0.5. For mass estimation, we generate 50 matrices from the IBP with concentration α = 1.0,
discount d = 0.25 and mass varying from 1.0 through 5.0. For the concentration estimation,
we generate 50 matrices from the IBP with mass α = 3.0, discount d = 0.25 and concentration
varying from 0 through 5.0.

AIFA marginal likelihood. The K-atom AIFA rates define a generative process over
feature matrices with N rows and K columns:
i.i.d.
θk ∼ AIFAK across k,
indep
xn,k | θk ∼ Ber(θk ) across n, k.

xn,k is the entry in the nth row and kth column of the feature matrix. Treating the beta
process hyperparameters γ, α, d as unknowns, we compute the probability of observing a
particular feature matrix {xn,k } (integrating out the AIFA rates) as a function of γ, α, d. By
symmetry and independence among the columns x.,k , it suffices to compute the probability
of observing just one column, say {xn,1 }Nn=1 . Conditioned on θ1 , the probability of observing
{xn,1 }n=1 is exactly
N
N
x
Y
θ1 n,1 (1 − θ1 )1−xn,1
n=1

We integrate out θ1 to compute the marginal likelihood. Recall that c(γ, α, d) = γ/B(α +
d, 1 − d) for the beta process AIFA. The marginal likelihood of observing the first column
n=1 is
{xn,1 }N
"N #
Y x
Eθ∼AIFAK θ1 n,1 (1 − θ1 )1−xn,1
n=1
R1 P P
0
θ−1+c(γ,α,d)/K+ n xn,1 −dS1/K (θ−1/K) (1 − θ)α+d+N − n xn,1 −1 dθ
= R 1 −1+c(γ,α,d)/K−dS (θ−1/K) .
θ 1/K (1 − θ) α+d−1 dθ
0

158
In all, if we denote
Z 1
ZK (γ, α, d; x, y) := θ−1+c(γ,α,d)/K+x−dS1/K (θ−1/K) (1 − θ)α+d+(y−x)−1 dθ,
0

then the marginal probability of observing a particular binary matrix {xn,k }, as a function of
γ, α, d, is
K P
Y ZK (γ, α, d; n xn,k , N )
. (2.104)
k=1
ZK (γ, α, d; 0, 0)

For feature matrices coming from an IBP, the number of columns K b is random, and usually
(much) smaller than the number of atoms in the approximation. In this section, the approxi-
mation level is K = 100,000: the distribution of the number of active features in the finite
model (for d ∈ [0, 0.5]) has no noticeable change between K = 100,000 and K > 100,000.
The fact that K − K b columns are missing is the same as K − K b columns being identically
zero; hence, when evaluating the marginal probability of matrices that have less than K
columns, we simple pad the missing columns with zeros.
It remains to show how to compute eq. (2.104) using numerical methods. The bottleneck is
computing ZK (γ, α, d; x, y). We split the integral into two disjoint domains. The first domain
is (0, 1/K): on this domain, the integral is an incomplete beta integral, which is implemented
in libraries such as Virtanen et al. [204]. The second domain is [1/K, 1]. On this domain,
we first compute m∗ , the maximum value of the integrand θ−1+c(γ,α,d)/K+x−dS1/K (θ−1/K) (1 −
θ)α+d+(y−x)−1 . We then use numerical integration to integrate θ−1+c(γ,α,d)/K+x−dS1/K (θ−1/K) (1−
θ)α+d+(y−x)−1 /m∗ . We divide by m∗ to avoid the integrand getting too small, which happens
if x or y are large. The last integrand is well-behaved (bounded and smooth), and we expect
numerical integration to be accurate.

Marginal likelihood under BFRY IFA (or GenPar IFA) are challenging to estimate.
In theory, for the BFRY IFA, it is also possible to express the marginal likelihood (as a
function of γ, α, d) for an observed feature matrix xn,k under the BFRY IFA prior as a ratio
between normalization constants. However, we run into numerical issues (divergence errors)
computing the BFRY IFA normalization constants that are not present in computing the
AIFA normalization constants. For completeness, the BFRY IFA normalization constants are
of the kind
Z 1   
γ/K x−d−1 y−x+d−1 1/d θ
ZBFRY (γ, d; x, y) = θ (1 − θ) 1 − exp −(Kd/γ) .
0 B(d, 1 − d) 1−θ
(2.105)
Whether this integral has a closed-form solution is unknown: the closed-formed marginal
likelihoods from Lee et al. [124] apply to clustering models from normalized CRMs rather
than feature-allocation models from unnormalized CRMs. Numerical integration struggles
with Equation (2.105) for x = 0. (Kd/γ)1/d is typically very large: when γ = 1, d = 0.1, even
K = 100 leads to (Kd/γ)1/d being on the order of 1020 . As a result, under standard floating
point precision, 1 − exp −(Kd/γ)1/d 1−θ θ
evaluates to 1 on all points of the quadrature grid:
this leads to divergent behavior, as the factor θ−d−1 by itself grows too fast near 0.

159
We resort to Monte Carlo to estimate the normalization constant. In each Monte Carlo batch,
we draw K random variables θ1 , θ2 , . . . , θK from the BFRY density eq. (2.4), and estimate
the log of ZBFRY (γ, d; x, y) with
 K
logsumexp [(x − d − 1) ln θk + (y − x + d − 1) ln(1 − θk ) − ln K] .
k=1

In the left panel of fig. 2.6.5b, we first generate an feature matrix from IBP with mass 3.0,
concentration 0.0 and discount 0.25. We then plot the estimate of the marginal likelihood
under BFRY IFA for this feature matrix as a function of d for mass fixed at 3.0 and discount
fixed at 0.0.
GenPar IFA faces similar problems as BFRY IFA. We are not aware of a closed-form formula
for the marginal likelihood. Namely, we are not able to show that eq. (2.5) is a conjugate
prior for the Bernoulli likelihood: when we observe an observation X = 1 from the model
X ∼ Ber(θ), θ ∼ νGenPar , the posterior density for θ is proportional to
 
θ−d (1 − θ)α+d−1 
1 −   1 
B(1 − d, α + d)   1/d   1{0 ≤ θ ≤ 1}.
α 
Kd
θ 1+ γα
−1 +1

This new density is not in the same family as the original generalized Pareto variate. Default
schemes to numerically integrate P(0 | θk ) against the generalized Pareto prior for θk fail
 1/d
because of overflow issues associated with the magnitude of the term 1 + Kd γα
. In the left
panel of fig. 2.6.5b, we first generate an feature matrix from IBP with mass 3.0, concentration
1.0 and discount 0.25. We then plot the estimate of the marginal likelihood under BFRY IFA
for this feature matrix as a function of d for mass fixed at 3.0 and discount fixed at 0.0.

Optimization. For AIFA i.e. left panel of fig. 2.6.5a, to estimate the beta process hy-
perparameters given an observed feature matrix, we maximize the marginal probability
in eq. (2.104) with respect to γ, α, d, by doing a grid search with a fine resolution. The
base grid for the triplet γ, α, d is the Cartesian product of three lists: [1.0, 2.0, 3.0, 4.0, 5.0],
[0.5, 1.0, 1.5, 2.0, 2.5], and [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]. We refine the base grid around the true
hyperparameters. For example, in the discount estimation experiment, a true configuration is
(3.0, 1.0, 0.4) The refinement here is the Cartesian product of three lists [2.6, 2.8, 3.0, 3.2, 3.4],
[0.8, 0.9, 1.0, 1.1, 1.2], [0.36, 0.38, 0.4, 0.42, 0.44]. We append the refinement to the base grid by
looping through all the configurations. We propose the best hyperparameters by evaluating
the marginal likelihood (eq. (2.104)) at all points on the grid, and reporting the maximizer.
For the nonparametric process i.e. right panel of fig. 2.6.5a, the probability of observing a
particular feature matrix under the IBP prior over N rows is given in Broderick et al. [29,
Equation 7]. We maximize this function with respect to γ, α, d using differential evolution
techniques [191, 204].

160
2.K.5 Dispersion estimation
Generative model. The probabilistic model is
i.i.d.
λk ∼ XGamma(α/K, c, τ, T ) across k,
i.i.d.
ϕk ∼ Dir(aϕ 1V ) across k,
indep (2.106)
zn,k | λk ∼ CMP(λk , τ ) across k, n,
indep
X
xn,v | zn,: , ϕ ∼ Poisson( zn,k ϕk,v ), across v, n.
k

Recall the definition of the Xgamma variate from eq. (2.25). The observed data is the count
matrix xn,v , the number of times document n manifests vocab word v. The hyperparameters
are α, c, τ, T and aϕ . To draw data for eq. (2.106), we need to sample from XGamma and
CMP, two distributions that are not implemented in standard numerical libraries. The
only Pbottleneck in drawing CMP(θ, PτL) isθycomputing Zτ (θ). We approximate the infinite
∞ θy
sum y=0 (y!)τ with a truncation y=0 (y!)τ , using the bounds from Minka et al. [144] to
make sure the contribution of the left-out terms is small. To draw from XGamma, whose
unnormalized density has a contribution from Zτ−c (θ), we use the above approximation of
Zτ (θ) and slice sampling on the approximation of the unnormalized density.
When generating synthetic data, we draw N = 600 documents, over a vocabulary of size 100,
from a model with K = 500. The under-dispersed case and the over-dispersed case have the
same following hyperparameters: α = 20, c = 1, T = 1000, aϕ = 0.01. For underdispersion,
τ = 1.5, while for overdispersion, τ = 0.7. Our primary goal of inference is estimating the
topics and the shape τ . As such, during posterior inference, we fix the hyperparameters α, c,
T , and aϕ at the data-generating values, and sample the remaining latent variables (λ, ϕ, z)
and shape τ . We put a uniform (0, 100] prior on the shape τ : τ is always positive, and there
is no noticeable difference in amount of dispersion (ratio of variance over mean) between
τ = 100 and τ > 100. Furthermore, during sampling, the values of τ are much smaller than
100, indicating that inference would have remained the same for different choices of the
uniform’s upper limit.

Gibbs sampling. During sampling, following Zhou et al. [211, Section 4], we augment
the original model by introducing three additional families of latent variables: s, u and q.
Conditioned on z and ϕ, the pseudocount sn,k,v is distributed as Poisson
indep
sn,k,v | z, ϕ ∼ Poisson(zn,k ϕk,v ), across n, k, v,
and the sn,k,v add up to be xn,v in the following way
X
xn,v = sn,k,v .
k

Summing up the pseudocounts across words, we have


X
un,k := sn,k,v .
v

161
It is true that
indep
un,k | zn,k ∼ Poisson(zn,k ), across n, k,
indep
(2.107)
{sn,k,v }Vv=1 | un,k , ϕk ∼ Multi(un,k ; ϕk ), across n, k.
Summing up the pseudocounts across documents, we have
X
qk,v := sn,k,v .
n

We use a blocked Gibbs sampling strategy. The variable blocks variables are ϕ, λ, s (which
determines u and q), z, τ . First, we compute the Gibbs conditional of the topics ϕ. Since u
is determined by s (eq. (2.107)), conditioned on s, ϕ is independent of the remaining latent
variables:
K
Y
P(ϕ | x, λ, s, z, τ ) = P(ϕ | s) ∝ P(ϕ)P(s | u, ϕ) = Dir(ϕk 1V | [aϕ + qk,v ]Vv=1 ).
k=1

We compute the Gibbs conditionals of the rates λ. Conditioned on the trait counts z and
shape τ , λ is independent of the remaining latent variables:
K
Y
P(λ | x, ϕ, s, z, τ ) = P(λ | z, τ ) = P(λk | z.,k , τ )
k=1
K
!
Y α X
= XGamma λk | + zn,k , c + N, τ, T .
k=1
K n

We use the scheme discussed after eq. (2.106) to sample these XGamma variates. The Gibbs
conditionals of the trait counts z are
Y
P(z | x, ϕ, λ, s, τ ) = P(z | λ, s, τ ) = P(zn,k | λk , un,k , τ )
n,k
Y
∝ Poisson(un,k | zn,k )CMP(zn,k | λk , τ ).
n,k

To draw from the distribution whose p.m.f. at z ∈ N ∪ {0} is proportional to Poisson(un,k |


z)CMP(z | λk , τ ), the only bottleneck is computing the normalization constant

X z un,k −1 λz
exp(−z) Zτ (λk ) k τ
z=0
un,k ! (z!)

The multiplicative factorsu that don’t depend on z can be taken out of the sum: we only need
(λk /e)z z n,k
to compute ∞ . Similar to the computation of Zτ (θ), we approximate the above
P
z=0 (z!)τ
infinite sum with a finite truncation, making sure the left-out terms have a small contribution.
The Gibbs conditionals of the pseudocounts s are
P(s | x, ϕ, λ, z, τ ) = P(s | x, ϕ, z)
Y X
= Multi({sn,k,v }K
k=1 | x n,v ; [zn,k ϕk,v / zn,k′ ϕk′ ,v ]).
n,v k′

162
(a) Original (b) Input, 24.68 dB (c) AIFA, 34.62 dB (d) TFA, 34.76 dB

Figure 2.L.1: Sample AIFA and TFA denoised images have comparable quality. (a) shows
the noiseless image. (b) shows the corrupted image. (c,d) are sample denoised images from
finite models with K = 60. PSNR (in dB) is computed with respect to the noiseless image.

Finally, the Gibbs conditionals of the shape τ are

P(τ | x, ϕ, λ, s, z) = P(τ | z, λ) ∝ P(z | τ, λ)P(λ | τ )P(τ ).

In implementations, we omit the contribution from P(λ | τ ), since it contributes a very small
amount (less than 0.1%) to the overall value of ln P(z | τ, λ) + ln P(λ | τ ) + ln P(τ ), but takes
up more time to evaluate than the other two components. In other words, the unnormalized
log density of τ conditioned on the other variables is just
X
ln P(z | τ, λ) + ln P(τ ) = ln CMP(zn,k | λk , τ ) + ln 1{τ ∈ (0, 100]}.
n,k

We use slice sampling to draw from this distribution.

MCMC results. We run 40 chains, each for 50,000 iterations. By discarding the first
25,000 iterations, all chains have Rb diagnostic [65] smaller than 1.01. To combat the serial
correlation, we thin samples after burn-in, selecting only one draw after 2,000 iterations. The
effective number of samples remaining after burn-in and thinning is about 1,000.

2.L Additional experiments


2.L.1 Denoising other images
Similar to the house image, the clean plane image was obtained from http://sipi.usc.edu/
database/. The clean, the corrupted, and the example denoised images from AIFA/TFA for
plane images are given in fig. 2.L.1. In figs. 2.L.2b and 2.L.2c, the approximation level is
K = 60.
Similar to the house image, the clean truck image was obtained from http://sipi.usc.edu/
database/. The clean, the corrupted, and the example denoised images from AIFA/TFA for
truck images are given in fig. 2.L.3. In figs. 2.L.4b and 2.L.4c, the approximation level is
K = 60.

163
(a) Performance versus K (b) TFA training (c) AIFA training

Figure 2.L.2: (a) Peak signal-to-noise ratio (PNSR) as a function of approximation level K.
The error bars reflect randomness in both initialization and simulation of the conditionals
across 5 trials. AIFA denoising quality improves as K increases, and the performance is
similar to TFA across approximation levels. Moreover, the TFA- and AIFA-denoised images
are very similar: the PSNR ≈ 50 for TFA versus AIFA, whereas PSNR < 35 for TFA or AIFA
versus the original image. (b,c) Show how PSNR evolves during inference. The “warm-start”
lines in indicate that the AIFA-inferred (respectively, TFA-inferred) parameters are excellent
initializations for TFA (respectively, AIFA) inference.

2.L.2 Effect of AIFA tuning hyperparamters


We investigate the impact of a and bK ,which are two tunable parameters in the more general
definition of AIFA from Theorem 2.B.2. Other than the setting of a and bK , the experimental
set up is the same as section 2.K.3.
From fig. 2.L.5, we see that the setting of a and bK do not have a big impact on the performance
of the IFA
√ from Theorem 2.B.2. We report results for a combination of a ∈ {0.1, 1} and
bK = 1/ K or bK = 1/K.

2.L.3 Estimation of mass and concentration


fig. 2.L.6 shows that we can use an AIFA to estimate the underlying mass and concentration
for a variety of ground-truth masses and concentrations. The experimental setup is from
section 2.K.4. Since the error bars in the left and right panels are comparable, we conclude
that the AIFA yields comparable inference to the full nonparametric process.

164
(a) Original (b) Input, 24.69 dB (c) AIFA, 30.06 dB (d) TFA, 30.24 dB

Figure 2.L.3: Sample AIFA and TFA denoised images have comparable quality. (a) shows
the noiseless image. (b) shows the corrupted image. (c,d) are sample denoised images from
finite models with K = 60. PSNR (in dB) is computed with respect to the noiseless image.

(a) Performance versus K. (b) TFA training (c) AIFA training

Figure 2.L.4: (a) Peak signal-to-noise ratio (PNSR) as a function of approximation level K.
The error bars reflect randomness in both initialization and simulation of the conditionals
across 5 trials. AIFA denoising quality improves as K increases, and the performance is
similar to TFA across approximation levels. Moreover, the TFA- and AIFA-denoised images
are very similar: the PSNR ≈ 47 for TFA versus AIFA, whereas PSNR < 31 for TFA or AIFA
versus the original image. (b,c) Show how PSNR evolves during inference. The “warm-start”
lines in indicate that the AIFA-inferred (respectively, TFA-inferred) parameters are excellent
initializations for TFA (respectively, AIFA) inference.

165
(a) Average (b) Best

Figure 2.L.5: The predictive log-likelihood of AIFA is not sensitive to different settings of a
and bK . Each color corresponds to a combination of a and bK . (a) is the average across 5
trials with different random seeds for the stochastic optimizer, while (b) is the best across
the same trials.

(a) Estimation of mass γ (b) Estimation of concentration α

Figure 2.L.6: In fig. 2.L.6a, we estimate the mass by maximizing the marginal likelihood of
the AIFA (left panel) or the full process (right panel). The solid blue line is the median of
the estimated masses, while the lower and upper bounds of the error bars are the 20% and
80% quantiles. The black dashed line is the ideal value of the estimated mass, equal to the
ground-truth mass. The key for fig. 2.L.6b is the same, but for concentration instead of mass.

166
Chapter 3

Sensitivity of MCMC to Small-Data


Removals

167
3.1 Introduction
Consider the following motivating example. In forest management, there are ecologists who
study the association between drought and the death of tree canopy. For instance, Senf et al.
[183] analyzes data on tree mortality across many years and many regions in Europe, using
a Bayesian mixed effects model. To draw inferences, Senf et al. use Markov chain Monte
Carlo (MCMC) to approximate posterior functionals. These estimates suggest a positive
and significant (in the Bayesian sense) association between drought and tree mortality in
the period between 1987 and 2016. For planning, ecologists want to know if such trends can
be extrapolated beyond the gathered data. The generalization might be across time: if the
association persist after 2016, it could help policymakers make a case that drought needs to
be mitigated to prevent tree death in the future. The generalization might be across space: if
the association hold in European regions not covered by the data (such as parts of Russia), it
could save policymakers from having to repeat the study, which takes time and resources.
This is a specific instance of a broader problem. In many fields, researchers analyze
collected data with Bayesian models and MCMC: the list of examples include economics
[140], epidemiology [103], psychology [168], and many more. In general, these analysts want
to know if their findings can be more broadly applied.
Standard tools to assess generalization do not answer this question entirely. An analyst
might use frequentist tools (confidence interval, p-values) or Bayesian tools (credible interval,
posterior quantiles) to predict whether their inferences hold in the broader population. The
validity of these methods technically depends on the assumption that the gathered data is
an independent and identically distributed (i.i.d.) sample from the broader population. In
practice, we have reason to suspect that this assumption is not met: for instance, in the tree
mortality case, the data collected up to 2016 is likely similar, but not identical, to the data
that will be collected in the future, perhaps due to climate change.
An analyst might wish that deviations from the i.i.d. assumption is small enough, so that
their conclusions remain the same in the broader population, and standard tools accurately
assess generalization. Conversely, if it were possible to remove a small fraction of data and
change conclusions, the analyst would be worried about such deviations and the assessment
from standard tools.
Broderick et al. [31] is the first to formulate such small-data sensitivity as a check on
generalization. Along with the formulation, one contribution of this work is a fast way to
detect sensitivity when the analysis in question is based on estimating equations [114][Chapter
13]. Regardless of how estimators are constructed, in general, the brute-force approach to
finding an influential small fraction of data is computationally intractable. One would need
to enumerate all possible data subsets of a given cardinality and re-analyze on each subset:
even when the fraction of data removed is small and each analysis takes little time, there
are too many such subsets to consider — see the discussion at the end of section 3.3. For
estimating equations, Broderick et al. [31] estimate the effect of dropping data with a first-
order Taylor series approximation: this approximation can be optimized very efficiently, while
the brute-force approach is not at all practical.
Neither Broderick et al. [31] nor subsequent existing work on small-data removals [64, 147,
185] can be immediately applied to determine sensitivity in MCMC. Since MCMC cannot be

168
cast as the root of estimating equations or the solution to an optimization problem, neither
Broderick et al. [31] nor Shiffman et al. [185] apply to our situation. As Freund and Hopkins
[64], Moitra and Rohatgi [147] focus on ordinary least squares (OLS), their work does not
address our problem, either.

Our contributions. We extend Broderick et al. [31] to handle analyses based on MCMC.
In section 3.2, we introduce the concepts in Bayesian decision-making, and describe the
non-robustness concerns. In section 3.3.1, to first-order approximate the effect of removing
observations, we use known results on how much a posterior expectation locally changes
under small perturbations to the total log likelihood [52, 71, 80, 181]. As this approximation
involves posterior covariances, in section 3.3.2, we re-use the MCMC draws that an analyst
would have already generated to estimate what happens when data is removed. Recognizing
that Monte Carlo errors induce variability in our approximation, in section 3.3.3, we use
a variant of the bootstrap Efron [57] to quantify this uncertainty. For more discussion on
how our methodology relates to existing work, see section 3.1.1. Experimentally, we apply
our method to three Bayesian analyses. In section 3.5, we can detect non-robustness in
econometric and ecological studies. However, while our approximation performs well in simple
models such as linear regression, it is less reliable in complex models, such as ones with many
random effects.

3.1.1 Related work


Our work arguably fits into the intersection of three lines of work. We have already mentioned
the first: papers on detecting sensitivity to small-data removal.
The second quantifies the changes that happen to a posterior expectation because of small
perturbations to the total log likelihood. While this local robustness literature has already
mentioned how to estimate the effect of dropping an individual observation, our work is the
first to apply these estimates to assess whether MCMC is sensitive to the removal of small
data. Recent works in this literature include Giordano and Broderick [71], Giordano et al.
[72, 73], Mohamed et al. [146], while foundational works include Diaconis and Freedman
[52], Gustafson [80], Ruggeri and Wasserman [181]. While some works, such as Giordano
et al. [73], Gustafson [80] generate perturbations through varying prior choice, others, such
as Giordano and Broderick [71] generate perturbations by removing observations from the
analysis.
The third set of works, which is the so-called Bayesian case influence literature, quantifies
the importance of individual observations to a Bayesian analysis. As we will explain, existing
works do not tackle our problem. Early works in this area include Carlin and Polson
[37], Johnson and Geisser [102], Lavine [121], Mcculloch [138], while recent works include
Marshall and Spiegelhalter [137], Millar and Stewart [142], Pratola et al. [170], Thomas et al.
[199], van der Linde [202]. Such papers focus on the identification of outliers, rather than
predictions about whether the conclusion changes after removing a small amount of data.
Generally, this literature defines an observation to be an outlier if the Kullback-Leibler (KL)
divergence between the posterior after removing the observation and the original posterior is
large. For conclusions based on posterior functionals, such as the mean, we are not aware of
how to systematically connect the KL divergence to the sensitivity of the decision-making

169
process: in fact, recent work [50] has shown that comparing probability distributions based
on the KL divergence can be misleading if an analyst really cared about the comparison
between the distributions’ means.

3.2 Background
We introduce the problem of drop-data non-robustness in Bayesian data analysis.

3.2.1 Bayesian data analysis


In this section, we introduce the notation and concepts involved in Bayesian data analysis.
Suppose we have a dataset {d(n) }N n=1 . For instance, in regression, each observation is
a vector of covariates x and a response y (n) : in this case, we write d(n) = (x(n) , y (n) ).
(n)

Consider a parameter β ∈ RP of interest. To estimate the latent β, one option is to take a


Bayesian approach. First, we probabilistically model the link between β and the data through a
likelihood function L(d(n) | β). As an example, in linear regression, β consists of the coefficients
θ and the noise σ, with the likelihood equaling L(d(n) | β) = − 2σ1 2 (y (n) −θT x(n) )2 − 12 log(2πσ 2 ).
Secondly, we specify a prior distribution over the latent parameters, and use p(β) to denote
the prior density. Then, the density of the posterior distribution of β given the data is
N
Y
p(β | {d(n) }N
n=1 ) ∝ p(β) exp(L(d(n) | β)).
n=1

In practice, an analyst uses a functional of the posterior to make conclusions. One


prominent functional is the posterior mean Eg(β), where g is a mapping from RP to R. As
an example, in linear regression, commonly a practitioner will make a decision based on the
sign of the posterior mean of a particular regression coefficient. Other decisions are made
with credible intervals. An econometrician might declare that an intervention is helping some
population if the vast majority of the posterior mass for a particular coefficient lies above
zero. That is, the practitioner checks if the lower bound of a credible interval lies above
zero. This decision might be considered to reflect a Bayesian notion of significance. Decisions
might also be made with approximate credible intervals: while exact intervals are based on
posterior quantiles, an approximate interval is often based on the sum between the posterior
mean and a multiple of the posterior standard deviation.
Computationally, in general, the functionals needed to make a conclusion are not available
in closed form. To approximate posterior functionals, practitioners frequently use Markov
chain Monte Carlo (MCMC) methods. Let (β (1) , . . . , β (S) ) denote the MCMC draws that
target the posterior distribution: a draw refers to one such β (s) , and S is the number of draws.
In practice, we estimate expectations using (β (1) , . . . , β (S) ), and make a decision based on
such estimates.

3.2.2 Drop-data non-robustness


With notation for Bayesian data analyses in place, we introduce the problem of drop-data
non-robustness.

170
A Bayesian analyst might be worried if the substantive decision arising from their data
analysis changed after removing some small fraction α of the data. For instance,

• If their decision were based on the sign of the posterior mean, they would be worried if
that sign changed.

• If their decision were based on zero falling outside a credible interval, they would be
worried if we can make the credible interval contain zero.

• If their decision was based on both the sign and the significance, they would be worried
if we can both change the posterior mean’s sign and put a majority of the posterior
mass on the opposite side of zero.

In general, we expect an analyst to be worried if we could remove a small fraction α of the


data and change their decision.
To describe non-robustness precisely and to develop our approximation, we need notation
to indicate the dependence of posterior functionals on the presence of data points. We
introduce a vector of data weights w = (w1 , w2 , . . . , wN ), where wn is the weight for the n-th
observation. Each wn is constrained to be in the interval [0, 1]. The whole vector w defines
the so-called weighted posterior distribution.

Definition 3.2.1. Let Z(w) := p(β) N | β))dβ. If Z(w) < ∞, the weighted
R Q (n)
n=1 exp(wn L(d
posterior distribution associated with w has density:
N
!
1 X
p(β | w, {d(n) }N
n=1 ) := p(β) exp wn L(d(n) | β) .
Z(w) n=1

wn encodes the inclusion of d(n) in the analysis. If wn = 0, the n-th observation is ignored;
if wn = 1, the n-th observation is fully included. We recover the regular unnormalized
posteriordensity by setting all weights to 1: w = 1N = (1, 1, . . . , 1). It is possible that
| β) is not integrable for some w. This is the case when the prior
PN (n)
p(β) exp n=1 wn L(d
p(β) is improper and all weights have been set to zero: w = 0N = (0, 0, . . . , 0). In the
following, we assume that any contribution of the likelihood is enough to define a proper
posterior.

Assumption 3.2.1. ∀w ∈ [0, 1]N \ {0N }, Z(w) < ∞.

This assumption is immediate in the case of proper prior and standard likelihoods.
The notation p(β | w, {d(n) }N
n=1 ) emphasizes the dependence on w, and will supersede
the p(β | {d }n=1 ) notation. To indicate expectations under the weighted posterior, we
(n) N

use the subscript w: Ew is the expectation taken with respect to the randomness β ∼ p(β |
w, {d(n) }N
n=1 ).
With the weighted posterior notation, we extend concepts from the standard analysis to
the new analysis involving weights. The value of a posterior functional depends on w. For
instance, the posterior mean under the weighted posterior is Ew g(β), and we recover the
standard posterior mean by setting w = 1N .

171
The Bayesian analyst’s non-robustness concern can be formalized as follows. For α ∈ (0, 1),
let Wα denote the set of all weight vectors that correspond to dropping no more than 100α%
of the data i.e. ( )
N
1 X
Wα := w ∈ {0, 1}N : (1 − wn ) ≤ α ,
N n=1
We say the analysis is non-robust if there exists a weight w that a) corresponds to dropping
a small amount of data (w ∈ Wα ) and b) changes the conclusion.
We focus on decision problems that satisfies the following simplifying assumption: there
exists a posterior functional, which we denote by ϕ(w), such that ϕ(1N ) < 0 and the conclusion
changes if and only if ϕ(w) > 0. Such a functional will be called a “quantity of interest” (QoI).
We show how the changes mentioned in this section’s beginning fit this framework. To change
the conclusion about sign, if the full-data posterior mean (E1N g(β)) were positive, we take

ϕ(w) = −Ew g(β).

Since the full-data posterior mean is positive, ϕ(1N ) < 0. And ϕ(w) > 0 is equivalent to the
posterior mean (after removing the data) being negative. To change the conclusion about
significance, if approximate credible interval’s left endpoint1 (E1N g(β) − z0.975 Var1N g(β))
p

were positive, we take

ϕ(w) = −(Ew g(β) − z0.975 Varw g(β)).


p

ϕ(w) > 0 is equivalent to moving the left endpoint below zero, thus changing from a significant
result to a non-significant one. Finally, to change to a significant result of the opposite sign,
if the approximate credible interval’s left endpoint were positive, we take

ϕ(w) = −(Ew g(β) + z0.975 Varw g(β)).


p

On the full data, the right endpoint is above zero. On weight such that ϕ(w) > 0, the right
endpoint has been moved below zero: the conclusion has changed from a positive result to a
significant negative result.
Under such assumptions, checking for non-robustness is equivalent to a) finding the
maximum value of ϕ(w) subject to w ∈ Wα and b) checking its sign. The outcome of this
comparison remains the same if we retain the feasible set, maximize the objective function
ϕ(w) − c, and compare the optimal value with −c, for c being any constant that does not
depend on weight. Out of later convenience, we set c = ϕ(1N ). As in Broderick et al. [31,
Section 2], we define the Maximum Influence Perturbation to be the largest change, induced
in a quantity of interest, by dropping no more than 100α% of the data. In our notation, it is
the optimal value of the following optimization problem:

max (ϕ(w) − ϕ(1N )) (3.1)


w∈Wα

If the Maximum Influence Perturbation is more than −ϕ(1N ), then the conclusion is non-
robust to small data removal. The set of observations that achieve Maximum Influence
1
Our approximate credible interval multiplies the posterior standard deviation by z0.975 , which is the
97.5% quantile of the standard normal, but we can replace this with other scaling without undue effort.

172
Perturbation is called the Most Influential Set: to report it, we compute the optimal solution
of eq. (3.1), and find its zero indices.
In general, the brute force approach to solve Equation (3.1) takes a prohibitively long
time to solve. We need to enumerate every data subset that drops no more than 100α%
of the original data. And, for each subset, we would need to re-run MCMC to re-estimate
the quantity of interest. There are more than ⌊N α⌋ elements in Wα . One of our later
N


numerical studies involves N = 16,560 observations: even for α = 0.001, there are more than
1054 subsets to consider. Each Markov chain already takes a noticeable amount of time to
construct: in this analysis, to generate 4,000 samples, we need to run the chain for 1 minute.
The total time to compute the Maximum Influence Perturbation would be on the order of
1048 years.

3.3 Methods
As the brute force solution to eq. (3.1) is computationally prohibitive, we turn to approximation
methods. In this section, we provide a series of approximations to the Maximum Influence
Perturbation problem.

3.3.1 Taylor series


Our first approximation relies on the first-order Taylor series of the quantity of interest ϕ(w).
This idea of approximating the Maximum Influence Perturbation with Taylor series was first
proposed in Broderick et al. [31], in the context of Z-estimators. Our work extends this idea
to conclusions based on MCMC.
To be able to form a Taylor series, we require that the quantity of interest ϕ(w) is
differentiable with respect to the weight w. We are not aware of a complete theory (necessary
and sufficient conditions) for this differentiability. However, through Assumption 3.3.1 and
Assumption 3.3.2, we state a set of sufficient conditions.

Assumption 3.3.1. Let g be a function from RP to the real line. ϕ(w) is a linear combination
of posterior mean and posterior standard deviation i.e. there exists constants c1 and c2 , which
are independent of w, such that

ϕ(w) = c1 Ew g(β) + c2 Varw g(β).


p

A typical choice of g is the function that returns the p-th coordinate of a P -dimensional
vector.
It might appear that constraining ϕ(w) to be a linear combination of the posterior mean
and standard deviation is overly restrictive. However, this choice encompasses many cases of
practical interest: recall from section 3.2.2 that the quantities of interest for changing sign,
changing significance, and producing a significant result of the opposite sign, take the form of
Assumption 3.3.1. Furthermore, the choice of constraining ϕ(w) to be a linear combination
of the posterior mean and standard deviation in Assumption 3.3.1 is done out of convenience.
Our framework can also handle quantities of interest that involve higher moments of the
posterior distribution, and the function that combines these moments need not be linear,

173
but we omit these cases for brevity. However, we note that posterior quantiles in general do
not satisfy Assumption 3.3.1 and leave to future work the question of how to diagnose the
sensitivity of such quantities of interest.
Assumption 3.3.2. For any w ∈ [0, 1]N \ {0N }, the following functions have finite expecta-
tions under the weighted posterior: |g(β)|, g(β)2 , |L(d(n) | β)| (for all n), |g(β)L(d(n) | β)|
(for all n) and |g(β)2 L(d(n) | β)| (for all n).
The assumption is mild. It is satisfied by for instance, linear regression under Gaussian
likelihood and g(β) = βp .
Under Assumption 3.2.1, Assumption 3.3.1, and Assumption 3.3.2, ϕ(w) is continuously
differentiable with respect to w.
Theorem 3.3.1. Assume Assumption 3.2.1, Assumption 3.3.1, and Assumption 3.3.2.
For any δ ∈ (0, 1), ϕ(w) is continuously differentiable with respect to w on {w ∈ [0, 1]N :
maxn wn ≥ δ}. The n-th partial derivative2 at w is equal to c1 f + c2 s where
f = Covw g(β), L(d(n) | β) , (3.2)


and
Covw g(β)2 , L(d(n) | β) − 2Ew g(β) × Covw g(β), L(d(n) | β)
 
s= . (3.3)
Varw g(β)
p

See section 3.B.1 for the proof. This theorem is a specific instance of the sensitivity of
posterior expectations with respect to log likelihood perturbations: for further reading, we
recommend Basu et al. [14], Diaconis and Freedman [52], Gustafson [80]. Theorem 3.3.1
establishes both the existence of the partial derivatives and their formula. Equation (3.2) is
the partial derivative of the posterior mean with respect to the weights, while eq. (3.3) is that
for the posterior standard deviation, with the understanding that the derivative is one-sided.
Based on Theorem 3.3.1, we define the n-th influence as the partial derivative of ϕ(w) at
w = 1N :
∂ϕ(w)
ψn := .
∂wn w=1N
Then, the first-order Taylor series approximation of ϕ(w) − ϕ(1N ) is
N
X
ϕ(w) − ϕ(1N ) ≈ ψn (wn − 1). (3.4)
n=1

This approximation predicts that leaving out the n-th observation (wn = 0) changes the
quantity of interest by −ψn . Using eq. (3.4), we approximately solve eq. (3.1) by replacing
its objective function but keeping its feasible set.
N
X
max (wn − 1)ψn
w
n=1
N
(3.5)
1 X
s.t. wn ∈ {0, 1}, (1 − wn ) ≤ α.
N n=1
2
If wn lies on the boundary, the partial derivative is understood to be one-sided.

174
Algorithm 4 Influence Estimate (EI)
Inputs:
c1 , c2 ▷ ϕ(w)-defining constants
(β (1) , . . . , β (S) ) ▷ Markov chain
1: procedure EI(c1 , c2 , (β (1) , . . . , β (S) ))
m ← S1 Ss=1 g(β (s) )
P
2:
v ← S1 Ss=1 g(β (s) )2 − m2
P
3:
4: ψ̂ ← (0, 0, . . . , 0) ▷ N -dimensional vector
5: for n ← 1,P N do
1 1
 1P 
f ← S s g(β (s) )L(d(n) | β (s) ) − (s) (n) (s)
P
6: g(β ) L(d | β )
SPs s
S 1 P
g ← S1 s g(β (s) )2√L(d(n) | β (s) ) − 1
P (s) 2 (n) (s)

7: S s g(β ) S s L(d | β )
8: ŝ ← (g − 2mf )/( v) ▷ Estimate of eq. (3.3)
9: ψ̂n ← c1 f + c2 s ▷ Estimate of ψn
10: end for
11: return ψ̂
12: end procedure

Solving Equation (3.5) is straightforward. For any w ∈ Wα , the objective function is equal to
n:wn =0 (−ψn ). Let w(α) be the optimal solution and ∆(α) be the optimal value of eq. (3.5).
P
We denote U (α) to be the set of observations omitted according to w(α): U (α) := {dn :
w(α)n = 0}. Let r1 , r2 , . . . , rN sort the ψn in increasing order: ψr1 ≤ ψr2 ≤ . . . ≤ ψrN . Let m
be the smallest index such that ψrm+1 ≥ 0: if none exists, set m to N . If m ≥ 1, w(α) assigns
weight 0 to the observations r1 , r2 , . . . , rmin(m,⌊N α⌋) ), and 1 to the remaining ones. Otherwise,
m = 0 and w(α) assigns weight 1 to all observations. From Broderick et al. [31], we call the
optimal value of Equation (3.5) by the name Approximate Maximum Influence Perturbation
(AMIP), and denote it by ∆(α). It is equal to the negative of ⌊N m=1 ψrm I{ψrm < 0}, where
P α⌋
I{·} is the indicator function.

3.3.2 Estimating the influence


To solve eq. (3.5), we need to compute the influence ψn . In this section, we use MCMC to
estimate ψn .
Because of Theorem 3.3.1 and that ψn is the partial derivative at w = 1N , we know
that ψn is a function of certain expectations and covariances under the full-data posterior.
Therefore, the MCMC draws from the full-data posterior, which are already used to estimate
ϕ(1N ), can be used to estimate ψn . Algorithm 4 shows how this is done. In a nutshell, we
replace all population expectations with sample averages. The estimate of ψn will be called
ψ̂n .
Since ψ̂n is only an approximation of ψn , we are not able to solve eq. (3.5) exactly, but only
solve an approximation of it. Algorithm 5 details the procedure. The outputs of Algorithm 5
are point estimates of ∆(α), U (α), and w(α).

175
Algorithm 5 Sum of Sorted Influence Estimate (SoSIE)
Inputs:
c1 , c2 ▷ ϕ(w)-defining constants
(1) (S)
(β , . . . , β ) ▷ Markov chain
α ▷ Fraction of data to drop
1: procedure SoSIE(c1 , c2 , (β (1) , . . . , β (S) ), α)
2: ψ̂ ← EI(c1 , c2 , (β (1) , . . . , β (S) ))
3: Find ranks v1 , v2 , . . . , vN such that ψ̂v1 ≤ ψ̂v2 ≤ . . . ≤ ψ̂vN
4: Find the smallest p such that ψ̂vp+1 ≥ 0. If none exists, set p to N .
5: If p ≥ 1, U
b ← {dv1 , . . . , dv
min(p,⌊N α⌋)
}. Otherwise, U
b←∅
6: b ← − ⌊N α⌋ ψ̂vm I{ψ̂vm < 0}

P
m=1
7: return ∆,b U b
8: end procedure

3.3.3 Confidence intervals for AMIP


b from algorithm 5 is a noisy point estimate of ∆(α). One concern regarding the quality of ∆
∆ b
is noise due to sampling uncertainty of (β (1) , . . . , β (S) ). In this section, we design confidence
intervals for ∆(α). We begin by considering the special case when the (β (1) , . . . , β (S) ) comes
from exact sampling. Then, we relax the exact sampling assumption, and consider general
Markov chain Monte Carlo.

3.3.3.1 Exact sampling.


For certain prior and likelihoods, we are able to draw exact Monte Carlo samples from the
posterior distribution i.e. (β (1) , . . . , β (S) ) is an i.i.d. sample of size S drawn from the full-data
posterior distribution. This happens for conjugate models [53], or for models in which
convenient augmentation schemes have been discovered, such as Bayesian logistic regression
with Polya-Gamma augmentation [167]. Conceptually, ∆ b can be thought of as an estimator
constructed from an i.i.d. sample. However, the sample in question is not the data {d(n) }N n=1 ,
but (β , . . . , β ). To highlight the dependence between ∆ and (β , . . . , β ), we will use
(1) (S) b (1) (S)

the notation ∆(β b (1) , . . . , β (S) ). The estimator ∆ b is a complex, non-smooth function of the
sample: the act of taking the minimum across the estimated influences ψ̂n is non-smooth.
We do not attempt to prove distributional results for this estimator and use such results to
quantify uncertainty. Instead, we appeal to the bootstrap [57], a general-purpose technique
to quantify the sampling uncertainty of estimators.
Our confidence interval construction proceeds in three steps. First, we define the so-called
bootstrap distribution of ∆. b Second, we approximate this distribution with an empirical
distribution based on Monte Carlo draws. Finally, we use the range spanned by quantiles of
this empirical distribution as our confidence interval for ∆(α).
To define the bootstrap distribution, consider the empirical distribution of the sample

176
(β (1) , . . . , β (S) ):
S
1X
δ (i) (·).
S i=1 {β }

We denote one draw from this empirical distribution by β ∗ (s) . A bootstrap sample is a set
of S draws: (β ∗ (1) , β ∗ (1) , . . . , β ∗ (S) ). The bootstrap distribution of ∆ b is the distribution of
b ∗ (1) , β ∗ (1) , . . . , β ∗ (S) ), where the randomness is taken over the bootstrap sample but is
∆(β
conditional on the original sample (β (1) , . . . , β (S) )
Clearly, the bootstrap distribution is discrete with finite support. If we chose to, we can
enumerate its support and compute its probability mass function, by enumerating all possible
values a bootstrap sample can take. However, this is time-consuming. It suffices to approxi-
mate the bootstrap distribution with Monte Carlo draws. The draw ∆(β b ∗ (1) , β ∗ (1) , . . . , β ∗ (S) )
is abbreviated by ∆ b ∗ : we generate a total number of B such draws. When B increases, the
empirical distribution of (∆ b ∗1 , ∆ b ∗ ) becomes a better approximation of the bootstrap
b ∗2 , . . . , ∆
B
distribution. However, the computational cost scales up with B. In practice, B in the
hundreds are commonplace: our numerical work uses B = 200.
We now define confidence intervals for ∆(α). Each interval is parametrized by η, the
nominal coverage level, which is valued in (0, 1). We compute two quantiles of the empirical
distribution over (∆ b ∗, ∆
1
b ∗ ), the (1 − η)/2 and (1 + η)/2 quantiles3 , and define the
b ∗, . . . , ∆
2 B
interval spanned by these two values as our confidence interval. By default, we set η = 0.95.
One limitation of our current work is that we do not make theoretical claims regarding
the actual coverage of such confidence intervals. Although bootstrap confidence intervals can
always be computed, whether the actual coverage matches the nominal coverage η depends
on structural properties of the estimator and regularity conditions on the sample. To verify
the quality of these confidence intervals, we turn to numerical simulation. We leave to future
work the task of formulating reasonable assumptions and theoretically analyzing the actual
coverage.

3.3.3.2 General MCMC.


In the previous section, we made the simplifying assumption that exact sampling were possible.
We now lift this assumption and handle the case in which (β (1) , . . . , β (S) ) truly came from a
Markov chain (such as the output of Hamiltonian Monte Carlo). This case is much more
common in practice than the exact sampling case.
To construct confidence intervals, one idea is to use the previous section’s construction
without modification. In other words, apply the bootstrap to a non-i.i.d. sample: recall that
the Markov chain states are not independent of each other. Theoretically, it is known that
the bootstrap struggles on non-i.i.d. samples, for even simple estimators. For example, if
the estimator in question is the sample mean and the draws exhibit positive autocorrelation,
under mild regularity conditions, the bootstrap variance estimate seriously underestimates
the true sampling variance, even in the limit of infinite sample size [118, Theorem 2.2].
In our case, the bootstrap likely struggles on the sample means that are involved in the
3
We use R’s quantile() to compute the sample quantiles. When (1 + η)/2 × B is not an integer, the
(1 + η)/2 quantile is defined by linearly interpolating the order statistics.

177
definition of ∆: b for instance, it is very common for (β (1) p , β (2) p , . . . , β (S) p ) to exhibit positive
autocorrelation in practice. Therefore, we have reason to be pessimistic about the ability of
bootstrap confidence intervals to adequately cover ∆(α).
Fundamentally, the bootstrap fails in the non-i.i.d. case because the draws that form
the bootstrap sample do not have any dependence, while the draws that form the original
sample do. To improve upon the bootstrap, one option is to resample in a way that respects
the original sample’s dependence structure. We recognize that the sample in question,
(β (1) , . . . , β (S) ), is a (multivariate) time series: we focus on methods that perform well under
time series dependence. One such scheme is the non-overlapping block bootstrap [38, 118].4
The sample (β (1) , . . . , β (S) ) is divided up into a number of blocks: each block is a vector of
contiguous draws. Let L be the number of elements in a block, and let M := ⌊S/L⌋ denote
the number of blocks. The m-th block is defined as

Bm := β ((m−1)L+1) , . . . , β (mL) .


To generate one sample from the non-overlapping block bootstrap distribution, we first
draw with replacement from the set of blocks M values: B1∗ , . . . , BM ∗
. Then, we write the
elements of these drawn blocks in a contiguous series. For example, when (β (1) , . . . , β (S) ) =
(β (1) , β (2) , β (3) , β (4) ) and L = 2, the two blocks are (β (1) , β (2) ), and (β (3) , β (4) ). The set of
possible samples from resampling include (β (1) , β (2) , β (1) , β (2) ) and (β (3) , β (4) , β (3) , β (4) ) but
not (β (1) , β (3) , β (1) , β (3) ).
The name “non-overlapping block bootstrap” comes from the fact that these blocks, viewed
as sets, are disjoint from each other. While the name is needed in Lahiri [118] to distinguish
from other blocking rules, moving forward, as we only consider the above blocking rule, we
will refer to the procedure as simply, block bootstrap. Intuitively, the block bootstrap sample
is a good approximation of the original sample if the latter has short-term dependence: in
such a case, the original sample itself can be thought of as the concatenation of smaller, i.i.d.
subsamples, and the generation of a block bootstrap sample mimics that. In well-behaved
probabilistic models with well-tuned algorithms, the MCMC draws can be expected to only
have short-term dependence, and the block bootstrap is a good choice.
The block bootstrap has one hyperparameter: the block length L. We would like both L
and M to be large: large L captures time series dependence at larger lags, and large M is
close to having many i.i.d. subsamples. However, since their product is constrained to be S,
the choice of L is a trade-off. In numerical studies, we set L = 10.
Our construction of confidence intervals for general MCMC proceeds identically to the
previous section’s construction, except for the step of generating the bootstrap sample: instead
of drawing from the vanilla bootstrap, we draw from the block bootstrap. We will denote the
endpoints of such an interval by ∆lb (α) (lower endpoint) and ∆ub (α) (upper endpoint).
Similar to the previous section, we do not make theoretical claims on the actual coverage
of our block bootstrap confidence intervals: we verify the quality of the intervals through
later numerical studies.
4
The original paper, Carlstein [38], did not use the term “non-overlapping block bootstrap” to describe the
technique. The name comes from Lahiri [118].

178
3.3.4 Putting everything together
Now, we chain together the intermediate approximations from the previous sections to form
our final estimate of eq. (3.1). We then explain how to use it to determine non-robustness.
Our final estimate of the Maximum Influence Perturbation is the interval [∆lb (α), ∆ub (α)]
constructed in section 3.3.3. This approximation is the result of combining section 3.3.3,
where [∆lb (α), ∆ub (α)] approximates ∆(α), with section 3.3.1, where ∆(α) approximates
of the Maximum Influence Perturbation. Our final estimate of the Most Influential Set
is U
b , which is an output of algorithm 5. This approximation is the result of combining
section 3.3.2, where Ub approximates U (α), with section 3.3.1, where U (α) approximates the
Most Influential Set.
To determine non-robustness, we use [∆lb (α), ∆ub (α)] as follows. Recall that we have
assumed for simplicity that the decision threshold is zero, and that ϕ(1N ) < 0. We believe that
the interval [ϕ(1N ) + ∆lb (α), ϕ(1N ) + ∆ub (α)] contains the quantity of interest after removing
the most extreme observations. Therefore, our assessment of non-robustness depends on the
relationship between this interval and the threshold zero in the following way:

• ϕ(1N ) + ∆lb (α) > 0. Hence, [ϕ(1N ) + ∆lb (α), ϕ(1N ) + ∆ub (α)] is entirely on the opposite
side of 0 compared to ϕ(1N ). We declare the analysis to be non-robust.

• ϕ(1N ) + ∆ub (α) < 0. Hence, [ϕ(1N ) + ∆lb (α), ϕ(1N ) + ∆ub (α)] is entirely on the same
side of 0 compared to ϕ(1N ). We do not declare non-robustness.

• ϕ(1N ) + ∆lb (α) ≤ 0 ≤ ϕ(1N ) + ∆ub (α). The interval contains 0, and we abstain from
making an assessment about non-robustness. We recommend practitioners run more
MCMC draws to reduce the width of the confidence interval.

While [∆lb (α), ∆ub (α)] plays the main role in determining non-robustness, U b plays a
supporting role. For problems in which drawing MCMC a second time is not prohibitively
expensive, we can refit the analysis without the data points in U
b . Doing the refit is one way
of verifying the quality of our assessment (of non-robustness): if [∆lb (α), ∆ub (α)] declares
that the conclusion is non-robust, and the conclusion truly changes after removing U b and
refitting, then we conclusively know that our assessment is correct.

3.4 Experimental Setup


For the rest of the paper, we check the quality of our approximations empirically on real data
analyses. In this section, we only describe the checks: for the actual results, see section 2.6.
A practitioner with a particular definition of “small data” can set α to reflect their concern.
We consider a number of α values. We set the maximum value of α to be 0.01. This choice is
motivated by Broderick et al. [31]. Many analyses are non-robust to removing 1% of the data,
and we a priori think that α > 1% is a large amount of data to remove. We vary log10 (α) in
an equidistant grid of length 10 from −3 to −2. The ten values are 0.10%, 0.13%, 0.17%,
0.22%, 0.28%, 0.36%, 0.46%, 0.60%, 0.77% and 1.00%.
For the range of dropout fraction specified above and across three common quantities of
interest corresponding to sign, significance, and significant result of opposite sign changes, we

179
walk through what a practitioner would do in practice (although they would choose only one
α and one decision). Our method proposes an influential data subset and a change in the
quantity of interest, represented by a confidence interval.
Ideally, we want to check if our interval includes the result of the worst-case data to leave
out. We are unable to do so, since we do not know how to compute the worst-case result in a
reasonable amount of time. We settle for the following checks.
In the first check, for a particular MCMC run, we plot how the change from re-running
minus the proposed data compares to the confidence interval. We recommend the user run
this check if re-running MCMC a second time is not too computationally expensive.
Unfortunately, such refitting does not paint a complete picture of approximation quality.
For instance, the MCMC run might be unlucky since MCMC is random. To be more
comprehensive, we run additional checks. We do not expect users to run these tests, as
their computational costs are high. The central question is how frequently (under MCMC
randomness) the confidence interval includes the result after removing the worst-case data.
Since we estimate the worst-case change with a linear approximation, a natural way to answer
this question is with two separate checks: while section 3.4.1 checks how frequently the
confidence interval includes the result of the linear approximation i.e. the AMIP, section 3.4.3,
checks whether the linear approximation is good. To understand why we observe the coverage
in section 3.4.1, in section 3.4.2 we isolate the impact of the sorting step in the construction
of our confidence interval.

3.4.1 Estimate coverage of confidence interval for AMIP


We estimate how frequently [∆lb (α), ∆ub (α)] covers the AMIP by using another level of Monte
Carlo. Recall that [∆lb (α), ∆ub (α)] is intended to be a confidence interval covering ∆(α)
a fraction η of the time. If the estimated coverage is far from η, we have evidence that
[∆lb (α), ∆ub (α)] does not achieve the desired nominal coverage.
We draw J Markov chains: we set J = 960. On each chain, we estimate the influences and
construct the confidence interval [∆lb (α), ∆ub (α)]. From all the chains, for each n, we have J
estimates of ψn . We take the sample mean across chains, and denote this by ψn∗ : because of
variance reduction through averaging, ψn∗ is a much better estimate of ψn than individual ψ̂n .
We denote the indices of the ⌊N α⌋ most negative ψn∗ by U ∗ (α). We sort ψn∗ across n and sum
the ⌊N α⌋ most negative ψn∗ . This sum is denoted by ∆∗ (α): we use it in place of the ground
truth ∆(α). We use the sample mean of the indicators I{∆∗ (α) ∈ [∆lb (α), ∆ub (α)]} as the
point estimate of the coverage. We also report a 95% confidence interval for the coverage.
This interval is computed using binomial tests designed in Clopper and Pearson [44] and
implemented as R’s binom.test() function.

3.4.2 Estimate coverage of confidence intervals for sum-of-influence


It is possible that the estimated coverage of [∆lb (α), ∆ub (α)] is far from the nominal η.
We suspect that such a discrepancy comes from the sorting of ψ̂n to construct ∆(α). To
modularize out the sorting, we consider a target of inference that is simpler than ∆(α). At
a high level, we fix an index set I, and define the target to be the sum of influences in

180
n∈I ψn . On each sample (β , . . . , β (S) ), our point estimate is n∈I ψ̂n : this estimate
(1)
P P
I:
does not involve any sorting, while ∆ b does. We construct the confidence interval, [V lb , V ub ],
from the block bootstrap distribution of n∈I ψ̂n . The difference between [V lb , V ub ] and
P

[∆lb (α), ∆ub (α)], which is constructed from the block bootstrap distribution of ∆, b is that the
former is not based on sorting the influence estimates. If the actual coverage of [V lb , V ub ] is
close to the nominal value, we have evidence that the miscoverage of [∆lb (α), ∆ub (α)] is due
to this sorting.
From section 3.4.1 we use ψn∗ and the associated ∆∗ (α) and U ∗ (α) as replacement for ground
truths. We set I to be U ∗ (α). We run another set of J Markov chains: for each chain, we
construct the confidence interval [V lb , V ub ] by sampling from the block bootstrap distribution of
the estimator n∈I ψ̂n . We report the sample mean of the indicators I{ n∈I ψn ∈ [V lb , V ub ]}

P P
as our point estimate of the coverage. We also report a 95% confidence interval for the
coverage. This interval is computed using binomial tests designed in Clopper and Pearson
[44] and implemented as R’s binom.test() function.

3.4.3 Re-running MCMC on interpolation path


Ideally, we want to know the difference between the Maximum Influence Perturbation and
the AMIP. As we have established, we do not know how to compute the former efficiently.
We settle for checkingP the linearity approximation made in section 3.3.1 i.e. estimating
ϕ(w) − ϕ(1N ) with n (wn − 1)ψn . In particular, we expect the first-order Taylor series
approximation to be arbitrarily good for w arbitrarily close to 1N . By necessity, we are
interested in some w∗ that has a non-trivial distance from 1N . Plotting the quantity of
interest ϕ(w) on an interpolation path between 1N and w∗ , we get a sense of how much we
have diverged from linearity by that point.
From section 3.4.1, we have ψn∗ as our replacement for the ground truth ψn . We focus
on α = 0.05: 5% is a large amount of data to remove, and a priori we expect the linear
approximation to be poor. Recall that U ∗ (0.05) is the set of ⌊0.05N ⌋ observations that are
most influential according to sorted ψn∗ . Let w∗ be the N -dimensional weight vector that is
1 for observations in U ∗ (0.05) and 0 otherwise. For ζ ∈ [0, 1], the linear approximation of
ϕ(ζw∗ + (1 − ζ)1N ) is ϕ(1N ) + ζ∆∗ (0.05). In the extreme ζ = 0, we do not leave out any
data. In the extreme ζ = 1, we leave out the entirety of U ∗ (0.05) i.e. 5% of the data. An
intermediate value ζ roughly corresponds5 to removing (ζ5)% of the data. We discretize
[0, 1] with 15 values: 0, 0.0010, 0.0016, 0.0027, 0.0044, 0.0072, 0.0118, 0.0193, 0.0316, 0.0518,
0.0848, 0.1389, 0.2276, 0.3728, 0.6105, 1. For each value on this grid, we run MCMC to
estimate ϕ(ζw∗ + (1 − ζ)1N ), and compare it to the linear approximation.

3.5 Experiments
In our experiments, we find that our approximation works well for a simple linear model. But
we find that it can struggle in hierarchical models with more complex structure.
5
This correspondence is not exact, since for ζ < 1, all observations in U ∗ (0.05) are included in the analysis,
only with downplayed contributions.

181
3.5.1 Linear model
We consider a slight variation of a microcredit analysis from Meager [139]. In Meager [139],
conclusions regarding microcredit efficacy were based on ordinary least squares (OLS). We
refer the reader to Broderick et al. [31, Section 4.3.2] for investigations of such conclusions’
non-robustness. Here, we instead consider an analogous Bayesian analysis using MCMC, and
we examine the robustness of conclusions from this analysis.
Our quality checks suggest that our approximation is accurate. Our confidence interval
contains the refit after removing the proposed data. The actual coverage of the confidence
interval for AMIP is close to the nominal coverage. The actual coverage of the confidence
interval for sum-of-influence is also close to the nominal coverage. Even for dropping 5% of
the data, the linear approximation is still adequate.

3.5.1.1 Background and full-data fit.


Meager [139] studies the microcredit data from Angelucci et al. [5], which was an RCT
conducted in Mexico. There are N = 16,560 households in the RCT. Each observation is
d(n) = (x(n) , y (n) ), where x(n) is the treatment status and y (n) is the profit measured. The log-
likelihood for the n-th observation is L(d(n) | µ, θ, σ) = − 2σ1 2 (y (n) − θx(n) − µ)2 − 12 log(2πσ 2 ).
Here, the model parameters are baseline profit µ, treatment effect θ, and noise scale σ. The
most interesting parameter is θ: as x(n) is binary, θ compares the means in the treatment
and control groups. Meager [139] estimates the model parameters with OLS.
Our variation of the above analysis is as follows. We put t location-scale distribution
priors on the model parameters, with the additional constraint that the noise scale σ is
positive: for exact values of the prior hyperparameters, see section 3.C. We use Hamiltonian
Monte Carlo (HMC) as implemented in Stan [39] to approximate the full-data posterior. We
draw S = 4000 samples.
Figure 3.5.1 plots the histogram of the treatment effect draws as well as key sample
summaries. The sample mean is equal6 to −4.55. The sample standard deviation is 5.79.
These values are close to the point estimate and the standard error from OLS [139]. Our
estimate of the approximate credible interval’s left endpoint is −16.10; our estimate of the
right endpoint is 6.99. Based on these summaries, an analyst would likely conclude that
while the posterior mean of the effect of microcredit is negative, the uncertainty interval
covers zero, so they cannot confidently conclude that microcredit either helps or hurts. These
conclusions are in line with Meager [139].

3.5.1.2 Sensitivity results.


The running of our approximation takes very little time compared to the running of the
original analysis. Generating the draws in fig. 3.5.1 took 3 minutes on MIT Supercloud [177].
For one α and one quantity of interest, it took less than 5 seconds to make a confidence
interval for what happens if we remove the most extreme data subset. A user might check
approximation quality by dropping a proposed subset and re-running MCMC: each such
check took us around 3 minutes, the runtime of the original analysis.
6
We round to two decimal places in our numerical studies.

182
Figure 3.5.1: (Linear model) Histogram of treatment effect MCMC draws. The blue line
indicates the sample mean. The dashed red line is the zero threshold. The dotted blue lines
indicate estimates of approximate credible interval’s endpoints.

Figure 3.5.2: (Linear model) Confidence interval and refit. At maximum, we remove 1% of
the data. Each panel corresponds to a target conclusion change: ‘sign’ is the change in sign,
‘sig’ is change in significance, and ‘both’ is the change in both sign and significance. Error
bars are confidence interval for refit after removing the most extreme data subset. Each ‘x’ is
the refit after removing the proposed data and re-running MCMC. The dotted blue line is
the fit on the full data.

In fig. 3.5.2, we plot our confidence intervals and the result after removing the proposed
data. Although the confidence intervals are wide, they are still useful. Across quantities of
interest and removal fractions, our intervals contain the refit after removing the proposed
data. For changing sign, our method predicts there exists a data subset of relative size at
most 0.1% such that if we remove it, we change the posterior mean’s sign. Refitting after
removing the proposed data confirms this prediction. For changing significance, our method
predicts there exists a data subset of relative size at most 0.36% such that if we remove it, we
change the sign of the approximate credible interval’s right endpoint: refitting confirms this
prediction. Our method is not able to predict whether the result can be changed to significant
effect of the opposite sign for these α values and this number of samples: we recommend a
larger number of MCMC samples.

183
Figure 3.5.3: (Linear model) Monte Carlo estimate of AMIP confidence interval’s coverage.
Each panel corresponds to a target conclusion change. The dashed line is the nominal level
η = 0.95. The solid line is the sample mean of the indicator variable for the event that ground
truth is contained in the confidence interval. The error bars are confidence intervals for the
population mean of these indicators.

3.5.1.3 Additional quality checks.


Figure 3.5.3 shows that the actual coverage of the confidence interval for the AMIP is close
to the nominal one, across α. As the half-width of each error bar is small (only 0.02), we
believe that the difference between the true coverage and our point estimate of it is small.
For either ‘sign’ or ‘both’ QoI, the error bars do not contain the nominal η. However, the
difference between the point estimate and the nominal η is only 0.03 at worst, which is small.
For the ‘sig’ QoI, the point estimate is within 0.005 of the nominal value, and the error bars
contain the nominal η.
Figure 3.5.4 shows that the actual coverage of the confidence interval for the sum-of-
influence is close to the nominal one across α. The absolute errors between our estimate of
coverage and the nominal η are similar to those seen in fig. 3.5.3. This success suggests that
the default block length, L = 10, is appropriate for this problem.
Figure 3.5.5 shows that the linear approximation works very well. It is somewhat
remarkable that the linear approximation is this good even after dropping 5%, which we
consider to be a large fraction of data. The horizontal axis (‘scale’) is the same as ζ in
section 3.4.3. For all quantities of interest, the linear approximation and the refit lie mostly
on top of each other: towards the right end of each panel, the approximation slightly
underestimates the refit.

3.5.2 Hierarchical model on microcredit data


We consider a part of the analysis of microcredit done in Meager [140]. Originally, Meager
studies a number of impacts made by microcredit, using data from seven separate RCTs ana-
lyzed under a hierarchical model fitted with MCMC. In Broderick et al. [31], this hierarchical
model is fitted using variational inference [23], and the authors investigate the non-robustness
of the conclusions based on that fit. Here, we focus on only a component of the hierarchical
model. We fit this component, which is still a hierarchical model in itself, using MCMC, and

184
Figure 3.5.4: (Linear model) Monte Carlo estimate of sum-of-influence confidence interval’s
coverage. Each panel corresponds to a target conclusion change. The dashed line is the
nominal level η = 0.95. The solid line is the sample mean of the indicator variable for the
event that ground truth is contained in the confidence interval, and error bars are confidence
intervals for the population mean of these indicators.

examine the fit’s non-robustness.


Our approximation does not work as well as it did for the linear model. For the particular
MCMC run, our confidence interval does not contain the refit after removing proposed data.
The confidence interval for AMIP undercovers: the relative error between estimated coverage
and nominal coverage is at most 9.1%. The confidence interval for the sum-of-influence also
undercovers: at worst, the relative error is 14.7%. The linear approximation is adequate for
the posterior mean even after removing 5%. For the credible endpoints, the approximation is
good until removing roughly 1.8% of the data, and breaks down after that.
A priori, we think that α > 1% is a large data fraction to remove, and we are not worried
about the Maximum Influence Perturbation for such α. So, that the linear approximation
stops working after 1.8% is not a cause for concern. It is more pressing to improve the
confidence intervals. It is likely that a problem-dependent block length L will outperform the
default L = 10.

3.5.2.1 Background and full-data fit.


To study the relationship between microcredit and profit, Meager [140] combines the data
from Angelucci et al. [5] with that from Attanasio et al. [10], Augsburg et al. [11], Banerjee
et al. [12], Crépon et al. [45], Karlan and Zinman [106], Tarozzi et al. [194]. In the aggregated
data, each observation is a household, with d(n) = (x(n) , y (n) , g (n) ) where x(n) is the treatment
status, y (n) is the profit measured, and g (n) indicates the household’s country. Meager [140]
uses a tailored hierarchical model that simultaneously estimates a number of effects. This
model separates the dataset into three parts: households with negative profit, households
with zero profit, and households with positive profit. Microcredit is modeled to have an
impact on the proportion of data assigned to each part: for households with non-zero profit,
microcredit is modeled to have an impact on the location and spread of the log of absolute
profit.
For our experiment, we will not look at all the impacts estimated by Meager [140]’s model.

185
Figure 3.5.5: (Linear model) Quality of the linear approximation. Each panel corresponds to
a target conclusion change. The solid blue line is the full-data fit. The horizontal axis is the
distance from the weight that represents the full data. We plot both the refit from rerunning
MCMC and the linear approximation of the refit.

We focus only on how microcredit impacts the households with negative realizations of profit.
Meager [140]’s model is such that to study this impact, it suffices to a) filter out observations
with non-negative profit from the aggregated data and b) use only a model component rather
than the entire model.
The dataset on households with negative profits has 3,493 observations. The relevant
model component from Meager [140] is the following. They model all households in a given
country as exchangeable, and “share strength” across countries. The absolute value of the
profit is modeled as coming from a log-normal distribution. If the household is in country  k,
(country) (country) (n) (country) (country) (n)
this distribution has mean µk +τk x , and variance exp ξk + θk x ,
(country) (country) (country) (country)
where (µk , τk , ξk , θk ) are latent parameters to be learned. In other
words, the access to microcredit has country-specific impacts on the location and scale of the
log of absolute profit. To borrow strength, the above country-specific parameters are modeled
as coming from a common distribution. For instance, there exists a global parameter, τ , such
(country)
that the τk ’s are a priori independent Gaussian centered at τ . For complete specification
of the model i.e. the list of all global parameters and the prior choice, see section 3.C.
Roughly speaking, τ is an average treatment effect across countries. We use S = 4000
HMC draws to approximate the posterior. Figure 3.5.6 plots the histogram of the treatment
effect draws and sample summaries. The sample mean is equal to 0.09. The sample standard
deviation is 0.09. These values are in agreement with the mean and standard deviation
estimates obtained from fitting on the original model and data [140]. Our estimate of the
approximate credible interval’s left endpoint is −0.09; our estimate of the right endpoint is
0.27.
Based the summaries in fig. 3.5.6, an analyst might come to a decision based on either
(1) the observation that the posterior mean is positive, or (2) the observation that the
uncertainty interval covers zero and therefore they cannot be confident of the sign of the
unknown parameter.

186
Figure 3.5.6: (Hierarchical model for microcredit) Histogram of treatment effect MCMC
draws. See the caption of fig. 3.5.1 for the meaning of the distinguished vertical lines.

3.5.2.2 Sensitivity results.


The running of our approximation takes very little time compared to the running of the
original analysis. Generating the draws in fig. 3.5.6 took 8 minutes. For one α and one
quantity of interest, it took less than 15 seconds to make a confidence interval for what
happens if we remove the most extreme data subset. A user might check approximation
quality by dropping a proposed subset and re-running MCMC: each such check took us
around 8 minutes, the runtime of the original analysis.
Figure 3.5.7 plots our confidence intervals and the result after removing the proposed
data. In general, our confidence interval predicts a more extreme change than the actual refit
achieves. The interval is therefore, not conservative: if it predicts that a change is achievable,
we cannot always trust that such a change is possible. The refit is not a monotone function of
the proposed data’s size in the case of ‘both’ and ‘sig’. The non-monotonicity indicates that
not all observations in the proposed data induce the right direction of change (upon their
removal). For instance, in the case of ‘sig’, we aim to increase the credible left endpoint, but
actually, the endpoint decreases between α = 0.46% and α = 0.60%. Since the proposed data
is U
b from algorithm 5, it is apparent that the proposed data for α = 0.46% is nested in the
proposed data for α = 0.60%. This means that some observations in the difference between
these subsets actually decrease the left endpoint upon removal, rather than increase it.
Our method is not able to predict whether the posterior mean can change sign for α
values and this number of samples; likewise, our method is not able to predict whether the
result can be changed to significant effect of the opposite sign. In either case, we recommend
a larger number of MCMC samples. For changing significance, our method predicts there
exists a data subset of relative size at most 0.60% such that if we remove it, we change the
sign of the approximate credible interval’s left endpoint. However, refitting does not confirm
this prediction.

3.5.2.3 Additional quality checks.


Figure 3.5.8 shows that the confidence interval for the ∆(α) undercovers, but the degree
of undercoverage is arguably mild. Our confidence interval for the true coverage does not
contain the nominal η except for the smallest α. As α increases, our point estimate of the
coverage generally decreases: for the largest α, the difference between our point estimate and
the nominal η is 0.08, which translates to a relative error of 8.4%. If we compare η with the

187
Figure 3.5.7: (Hierarchical model for microcredit) Confidence interval and refit. See the
caption of fig. 3.5.2 for meaning of annotated lines.

Figure 3.5.8: (Hierarchical model for microcredit) Monte Carlo estimate of AMIP confidence
interval’s coverage. See the caption of fig. 3.5.3 for the meaning of the error bars and the
distinguished lines.

lower endpoint of our confidence interval for the true coverage, the worst relative error is
9.1%.
Figure 3.5.8 shows that the confidence interval for sum-of-influence has the right coverage
for sign change, but undercovers for significance change and generating a significant result of
the opposite sign. At worst, in the case of ‘sig’, the relative error between the nominal η and
our estimate of true coverage is 14.7%.
Intuitively, the block bootstrap underestimates uncertainty if the block length is not large
enough to overcome the time series dependence in the MCMC samples. The miscoverage
suggests that the default block length, L = 10, is too small for this problem. One potential
reason for the difference in coverage between ‘sign’ and ‘sig’ is that, the estimate of influence
for ‘sign’ involves a fewer number of objects than that for ‘sig’. While an estimate of influence
for ‘sign’ involves g(β) and L(d(n) | β), an estimate of influence for ‘sig’ involves g(β),
L(d(n) | β), and g(β)2 . It is possible that the default block length is enough to capture time
series dependence for g(β) and L(d(n) | β), but is inadequate for g(β)2 .
Figure 3.5.10 provides evidence that the linear approximation is adequate for ζ less
than 0.3728 for ‘both’ QoI and ‘sig’, but is grossly wrong for larger ζ. Using the rough

188
Figure 3.5.9: (Hierarchical model for microcredit) Monte Carlo estimate of sum-of-influence
confidence interval’s coverage. See the caption of fig. 3.5.4 for the meaning of the panels and
the distinguished lines.

Figure 3.5.10: (Hierarchical model for microcredit) Quality of linear approximation. See the
caption for fig. 3.5.5 for the meaning of the panels and the distinguished lines.

correspondence between ζ and amount of data dropped, we say that the linear approximation
is adequate until dropping 1.8% of the data. For ‘both’ QoI, the refit plateaus after dropping
1.8%, while the linear approximation continues to decrease. For ‘sig’, the refit decreases after
dropping 1.8%, while the linear approximation continues to increase. The approximation is
good for ‘sign’ even after removing 5% of the data: the refit and the prediction lie on top of
each other for ‘sign’.

3.5.3 Hierarchical model on tree mortality data


In the final experiment, we break from microcredit and look at ecological data. In particular,
we consider a slight tweak of the analysis of European tree mortality from Senf et al. [183].
These authors use a high-dimensional mixed effects model, fitted with MCMC, to identify
the impact of drought on tree death.
Our approximation also struggles in this case. For the particular MCMC run used
to estimate the full-data posterior, our confidence interval does not contain the refit after
removing the proposed data. As each MCMC run is already highly time-consuming, we do not

189
run quality checks on the whole dataset. We settle for running quality checks on a subsample
of the data. On the subsampled data, the confidence interval for AMIP undercovers: the
undercoverage is severe for one of the quantities of interest. However, the confidence interval
for sum-of-influence is close to achieving the nominal coverage. For all three quantities of
interest, the linear approximation is good up to removing roughly 1.1% of the data. For two
of the three, it breaks down afterwards: for the remaining one, it continues to be good up to
3%, then falters.
Once again, we think that dropping more than 1% of the data is already removing a large
fraction. We are not worried about the Maximum Influence Perturbation for such α. So, that
the linear approximation stops working after 1.1% is not a cause for concern.

3.5.3.1 Background and full-data fit


Senf et al. [183] studies the relationship between drought and tree death in Europe. To
identify the association, they have compiled a dataset with N = 87,390 observations. Europe
is divided into 2,913 regions, and the data spans 30 years: each observation is a set of
measurements made in a particular region, which we denote as l(n) , and at a particular year,
which we denote as t(n) . For our purposes, it suffices to know that the measurement of (the
opposite of) drought is called climatic water balance, and we denote it as x(n) : larger values
of x(n) indicate that more water is available i.e. there is less drought. The response of interest,
y (n) , is excess death of tree canopy.
In our experiment, we mostly replicate [183]’s probabilistic model: we use the same
likelihood, and make only an immaterial modification in the choice of priors. For the
likelihood, [183] models each y (n) as a realization from an exponentially modified Gaussian
distribution. Recall that such a distribution has three parameters, (µ, σ, λ), and a random
variate can be expressed as the sum between a normal variate N (µ, σ 2 ) and an exponential
variate with rate λ. When modelling {y (n) }N n=1 , the model uses the same σ and λ for all
observations. However, the mean µ is a function of n. It is the sum of three components.
The first is an affine function of x(n) i.e. µ + θx(n) for some latent parameters µ and θ. The
second is a smoothing spline of x(n) : it is included to capture non-linear relationships, but
we do not go into details here. The third contains the random effects for the region l(n)
and the time t(n) : if the observation is located at l and took place during t, this term is
(time) (region) (time) (location) (n)
(µt + µl ) + (θt + θl )x .
At a high level, both Senf et al. [183]’s prior and our prior share strength across regions
and times by modeling the random effects as coming from a some common global distributions.
However, while Senf et al. [183] uses an improper prior, we use a proper one. Numerically,
there is no perceptible difference between the two. Theoretically, we prefer working with
proper priors to avoid the integrability issue mentioned around Assumption 3.2.1. For
complete specification of our model, see section 3.C.
Following Senf et al. [183], we make conclusions based on posterior functionals of θ.
Roughly speaking, θ is the average (across time and space) association effect that water
balance has on excess tree death. We use S = 8000 HMC draws to approximate the posterior.
Figure 3.5.11 plots the histogram of the association effect draws and sample summaries. The
sample mean is equal to −1.88. The sample standard deviation is 0.48. These estimates are
very close to those reported in Senf et al. [183, Table 1]. Our estimate of the approximate

190
Figure 3.5.11: (Hierarchical model for tree mortality) Histogram of slope MCMC draws. See
the caption of fig. 3.5.1 for the meaning of the distinguished vertical lines.

credible interval’s left endpoint is −2.81; our estimate of the right endpoint is −0.94.
In our parametrization, if θ were estimated to be negative, it would indicate that the
availability of water is negatively associated with tree death. In other words, drought is
positively associated with tree death. Based on the sample summaries, a forest ecologist might
decide that drought has a positive relationship with canopy mortality, since the posterior
mean is negative, and this relationship is significant, since the approximate credible interval
does not contain zero.

3.5.3.2 Sensitivity results.


The running of our approximation takes very little time compared to the running of the
original analysis. Generating the draws in fig. 3.5.11 took 12 hours. For one α and one
quantity of interest, it took less than 2 minutes to make a confidence interval for what happens
if we remove the most extreme data subset. A user might check approximation quality by
dropping a proposed subset and re-running MCMC: each such check took us around 12 hours,
which is the runtime of the original analysis.
Figure 3.5.12 plots our confidence intervals and the result after removing the proposed
data. In general, our confidence interval predicts a more extreme change than realized by
the refit: hence, our interval is not conservative. The overestimation is particularly severe
for the ‘both’ QoI and the ‘sig’ QoI. For changing sign, our method predicts there exists a
data subset of relative size at most 0.17% such that if we remove it, we change the posterior
mean’s sign; refitting does not confirm this prediction, however. The smallest α whose refit’s
posterior mean actually changes sign is 0.22%. For changing significance, our method predicts
there exists a data subset of relative size at most 0.10% such that if we remove it, we change
the sign of the right endpoint; refitting confirms this prediction. For generating a significant
result of the opposite sign, our method predicts there exists a data subset of relative size
at most 0.17% such that if we remove it, we change the sign of the left endpoint; refitting
does not confirm this prediction, however. The smallest α whose refit’s left endpoint actually
changes sign is 1.0%.

3.5.3.3 Results on subsampled data.


Running MCMC on the original dataset of size over 80,000 took 12 hours. In theory, we can
spend time (on the order of thousands of hours) to run our quality checks, but we do not

191
Figure 3.5.12: (Hierarchical model for tree mortality) Confidence interval and refit. See the
caption of fig. 3.5.2 for the meaning of the panels and the distinguished lines.

Figure 3.5.13: (Hierarchical model on subsampled tree mortality) Histogram of effect MCMC
draws. See fig. 3.5.1 for the meaning of the distinguished lines.

do so. Instead, we subsample 2,000 observations at random from the original dataset. Each
MCMC on this subsample takes only 15 minutes, making it possible to run quality checks in
a few hours instead of weeks. We hope that the subsampled data is representative enough of
the original data that the quality checks on the subsampled data are indicative of the quality
checks on the original data.
We use the same probabilistic model to analyze the subsampled data. Figure 3.5.13 plots
the histogram of the association effect draws and sample summaries. Based on the draws,
a forest ecologist might tentatively say that drought is positively associated with canopy
mortality if they relied on the posterior mean, but refrain from conclusively deciding, since
the approximate credible interval contains zero.
Figure 3.5.14 shows our confidence intervals and the actual refits. Similar to fig. 3.5.12,
our confidence intervals predict a more extreme change than realized by the refit. The
overestimation is most severe for ‘both’ QoI.
In fig. 3.5.15, the confidence interval for AMIP undercovers for all quantities of interest.
The actual coverage decreases as α increases. The undercoverage is most severe for ‘sig’ QoI:
while the nominal level is 0.95, the confidence interval for the true coverage only contains
values less than 0.15. This translates to a relative error of over 84%. In other words, our
confidence interval for significance change is too narrow, and rarely contains the AMIP. For
‘both’ QoI and ‘sig’ QoI, the worst-case relative error between the nominal and the estimated

192
Figure 3.5.14: (Hierarchical model on subsampled tree mortality) Confidence interval and
refit. See the caption of fig. 3.5.2 for the meaning of the panels and the distinguished lines.

Figure 3.5.15: (Hierarchical model on subsampled tree mortality) Monte Carlo estimate of
coverage of confidence interval for ∆(α). See fig. 3.5.3 for the meaning of the panels and the
distinguished lines.

coverage, which occurs under the largest α, is 15.7%.


In Figure 3.5.16, the estimated coverage of the confidence interval for sum-of-influence is
close to the nominal coverage. Note the stark contrast in the vertical scale of the ‘sig’ panel
in fig. 3.5.15 with that in fig. 3.5.16. At worst, our point estimate of the true coverage is 0.04
less than the nominal level, which is only a 4.2% relative error. This success of the block
bootstrap indicate that the undercoverage observed in fig. 3.5.15 can be attributed to the
sorting step involved in the definition of ∆.b We leave to future work to investigate why the
interference cause by the sorting step is so much more severe for changing the significance
than for changing sign or generating significant result of the opposite sign.
Figure 3.5.17 shows that the linear approximation is good for the posterior mean (‘sign’
QoI) and the left credible endpoint (‘both’ QoI) up to ζ = 0.2276: in data percentages, this
is roughly 1.1%. For larger ζ, the refit for ‘both’ QoI plateaus while the linear approximation
continues to increase, and the linear approximation for posterior mean slightly underestimates
it. For the left endpoint (‘both’ QoI), the linear approximation is close to the refit up to
ζ = 0.6105 (roughly 3% of data); afterwards, the left endpoint increases while the linear
approximation continues to decrease.

193
Figure 3.5.16: (Hierarchical model on subsampled tree mortality) Monte Carlo estimate of
coverage of confidence interval for sum-of-influence. See fig. 3.5.4 for the meaning of the
panels and the distinguished lines.

Figure 3.5.17: (Hierarchical model on subsampled tree mortality) Quality of linear approxi-
mation. See fig. 3.5.5 for the meaning of the panels and the distinguished lines.

3.6 Discussion
We have provided a fast approximation to what happens to conclusions made with MCMC in
Bayesian models when a small percentage of data is removed. In real data experiments, our
approximation is accurate in simple models, such as linear regression. In complicated models,
such as hierarchical ones with many random effects, our methods are less accurate. A number
of open questions remain. We suspect that choosing the block length more carefully may
improve performance: how to pick the block length in a data-driven way is an interesting
question for future work. Currently, we can assess sensitivity for quantities of interest based
on posterior expectations and posterior standard deviations. For analysts that use posterior
quantiles to make decisions, we are not able to assess sensitivity. To extend our work to
quantiles, one would need to quantify how much a quantile changes under small perturbations
of the total log likelihood. Finally, we have not fully isolated the source of difficulty in
complex models like those in Senf et al. [183]. In the analysis of tree mortality data, there
are a number of conflating factors.
• The model has a large number of parameters.

194
• The parameters are organized hierarchically.

• We use MCMC to approximate the posterior.

To determine if the difficulty comes from high dimensionality or if the difficulty comes from
hierarchical organization, future work might apply our approximation to a high-dimensional
model without hierarchical structure. For instance, one might use MCMC on a linear
regression with many parameters and non-conjugate priors. To check if MCMC is a cause of
difficulty, one could experiment with variational inference (VI). If we chose to approximate
the posterior with VI, we can use the machinery developed for estimating equations [31] to
assess small-data sensitivity. If the dropping data approximation works well there, we have
evidence that MCMC is part of the problem in complex models.

195
196
Appendix

3.A Theory
In this section, we theoretically quantify the approximation errors incurred by our methodology.
Namely, section 3.A.1 analyzes the error made by the first-order approximation, while
section 3.A.2 analyzes the error made by using MCMC to estimate influences.

3.A.1 Accuracy of first-order approximation


In this section, we investigate the error incurred by replacing ϕ(w) − ϕ(1N ) with the Taylor
series from section 3.3.1. While the approximation applies to any model that satisfies
Assumption 3.2.1, Assumption 3.3.1, and Assumption 3.3.2, our error analysis is limited
to two models: a normal model and a normal means model. Their salient features are the
following. Both are convenient to analyze and address the same statistical task: derive the
population mean based on a finite sample {x(n) }N n=1 where x
(n)
∈ R. The normal means
model is hierarchical : the observations are organized into disjoint groups. Each observation
d(n) is (x(n) , g (n) ), where g (n) is valued in {1, 2, . . . , G}, with g (n) = g indicating that the n-th
observation belongs to the g-th group. The normal model does not have this structure, as
only x(n) is observed and used in modeling. We show that the error in the normal model is
qualitatively different from the error in the normal means model. Roughly speaking, the
former depends on the ratio between the number of observations left out, ⌊N α⌋, and the
total number of observations, N . Meanwhile, the later depends on three quantities a) ⌊N α⌋,
b) the number of groups G and c) the number of observations in a group.
Before specializing to different models, we pin down the Pcommon notion of error. We
define error to be the difference between ϕ(w) − ϕ(1N ) and n (wn − 1)ψn . We mainly care
when w encodes the full removal of certain observations and full inclusion of the remaining
ones i.e. w ∈ {0, 1}N . If we let q be the function that returns the zero indices of such
a weight (q(w) = {n : wn = 0}), then its inverse q −1 takes a set of observation indices
(I ⊂ {1, 2, . . . , N }) and produces a weight valued in {0, 1}N . We reformulate the error as
a function of I instead of w by replacing w with q −1 (I) in the definition of error. This
reformulation reads X
Err(I) = ϕ(q −1 (I)) − ϕ(1N ) + ψn .
n∈I

197
3.A.1.1 Normal model.
We detail the prior and likelihood of the normal model and the associated quantity of
interest. The parameter of interest is the population mean µ. The likelihood of an observation
is Gaussian with a known standard deviation σ. In other words, the n-th log-likelihood
evaluated at µ is L(d | µ) = 2 log 2πσ2 − 2σ2 [(x(n) )2 − 2x(n) µ + µ2 ]. We choose the uniform
1 1 1
(n)


distribution over the real line as the prior for µ. The quantity of interest is the posterior
mean of µ.
In this model, expectations under the weighted posterior have closed forms. We can
derive an explicit expression for the error. To display the error, it is convenient to define
the sample
P average of observations as a function of I: for any I ⊂ {1, 2, . . . , N }, let x̄I :=
(1/|I|) n∈I x . The sample average of the whole dataset will be denote by x̄.
(n)

Lemma 3.A.1. For the normal model, Err(I) is qual to


|I|2 (x̄ − x̄I )
N (N − |I|)
We prove Lemma 3.A.1 in section 3.B.2. The error is a function of I through the a) the
cardinality of the set |I| and b) the difference between the whole dataset’s sample mean, x̄,
and the sample mean for elements in I. Since the data is fixed, we can upper bound |x̄ − x̄I |
with the constant 2∥x∥∞ , where ∥x∥∞ := maxn |x(n) |. The rate at which the absolute value
of the error goes to zero is |I|2 : as the ratio |I|/N equals α, this means the error’s absolute
value goes to zero like α2 .

3.A.1.2 Normal means model.


We detail the prior and likelihood of the normal means model and the associated quantity
of interest. The parameters of interest are the population mean µ and the group means
θ = (θ1 , θ2 , . . . , θG ). Observations in group g are modeled as Gaussian centered at the
group mean θg with a known standard deviation σ. In other words, the n-th log-likelihood
is L(d(n) | µ, θ) = 12 log 2πσ 1
2 − 2σ1 2 [(x(n) )2 − 2x(n) θg(n) + θg2(n) ]. The prior over (µ, θ) is the
following. We choose the uniform distribution over the real line as the prior for µ. Conditioned
on µ, the group means are Gaussian centered at µ, with a known standard deviation τ . The
quantity of interest is the posterior mean of µ.
This model, like the normal model, has closed-form posterior expectations. Before
displaying the exact formula for the error Err(I), we need to describe the weighted posterior
in more detail. For each group g, we define three functions of w:
P (n) −1
n:g (n) =g wn x σ2
X 
2
Ng (w) := wn , Mg (w) := , Λg (w) := +τ .
(n)
Ng (w) Ng (w)
n:g =g

While Ng (w) sums up the weights of observations in group g, Mg (w) is the weighted average
of observations in this group, and Λg (w) will be used to weigh Mg (w) in forming the posterior
mean of µ. Section 3.B.2 shows that Ew µ is equal to
PG
g=1 Λg (w)Mg (w)
PG .
g=1 Λg (w)

198
To avoid writing G g=1 Λg (w), we define Λ(w) := g=1 Λg (w). To lighten notation, for
P PG
expectations under the original posterior, we write µ instead of E1N µ and Ng∗ instead of

Ng (1N ). The same shorthand applies to Ng (1N ), Mg (1N ), Λg (1N ) and Λ(1N ). In words, µ∗
is the posterior mean of µ under the full-data posterior, Ng∗ is the number of observations in
group g of the original dataset, and so on. We also utilize the x̄I and x̄ notations defined the
normal model section.
The error in the normal means model is given in the following lemma.

Lemma 3.A.2. In the normal means model, let the index set I be such that there exists
k ∈ {1, 2, . . . , G} such that g (n) = k for all n ∈ I. Define

|I|2 ∗ |I| σ 2 Λ∗k ∗


F (I) := (M − x̄ I ) + (µ − Mk∗ ),
Nk∗ [Nk∗ − |I|] k Nk∗ Nk∗
|I|
E(I) := ∗ ∗ σ 2 Λk (q −1 (I))Λ∗k .
Nk [Nk − |I|]

Then, Err(I) is equal to


P 
∗ ∗ −1
−1
Λk (q (I)) Λ
g̸=k g (Mg − Mk (q (I)))
F (I) + E(I).
Λ∗ Λ∗ Λ(q −1 (I))

We prove Lemma 3.A.2 in section 3.B.2. The constraint where all observations in I
belong to the same group k is made out of convenience: we can derive the error without this
constraint, but the formula will be much more complicated.
A corrolary of Lemma 3.A.2 is that the absolute value of the error behaves like |I|2 /(G|Nk∗ |2 ).

Corollary 3.A.1. In the normal means model, for all groups g, assume that Ng∗ ≥ σ 2 /τ 2 .
Let the index set I be such that there exists k ∈ {1, 2, . . . , G} such that g (n) = k for all n ∈ I.
For this k, assume that Nk∗ − |I| ≥ σ 2 /τ 2 . Then,

1 |I|2
|Err(I)| ≤ C(∥x∥∞ , σ, τ ) .
G |Nk∗ |2

where C(∥x∥∞ , σ, τ ) is a constant that only depends on ∥x∥∞ , σ, and τ .

We prove Corollary 3.A.1 in the section 3.B.2. In addition to the assumptions Lemma 3.A.2,
the corrolary assumes that the number of observations in each group is not too small, and
that after removing I, group k still has enough observations. This condition allows us to
approximate Λ∗k and Λg (q −1 (I)) with a constant. The factor ∥x∥∞ in the bound comes from
upper bounding |Mg∗ − Mk (q −1 (I))| by 2 maxN n=1 |x
(n)
|.
For two reasons, we conjecture that similiar qualitative differences also appear in the com-
parison between more complicated hierarchical and non-hierarchical models The fundamental
task of estimating the population mean is embedded in many other statistical tasks, such
as regression. In addition, the group structure imposed by the normal means model is also
found in practically relevant hierarchical models.

199
3.A.2 Estimator properties
Recall from section 3.3.3 that one concern regarding the quality of ∆ b is the (β (1) , . . . , β (S) )-
induced sampling uncertainty. Theoretically analyzing this uncertainty is difficult, with
one obstacle being that ∆ b is a non-smooth function of (β (1) , . . . , β (S) ). In this section, we
settle for the easier goal of analyzing the sampling uncertainty of the influence estimates
ψ̂n . We expect such theoretical characterizations to play a role in the eventual theoretical
characterizations of ∆, b but we leave this step to future work.
In this analysis, we make more restrictive assumptions than those needed for Theorem 3.3.1
to hold. We assume that the sample (β (1) , . . . , β (S) ) comes from exact sampling: the indepen-
dence across draws makes it easier to analyze sampling uncertainty. We focus on the quantity
of interest equaling the posterior mean (c1 = 1, c2 = 0 in the sense of Assumption 3.3.1):
the scaling c1 = 1 for the posterior mean is made out of convenience, and a smiliar analysis
can be conducted when c2 = ̸ 0, but we omit it for brevity. Finally, we need more stringent
moment conditions than Assumption 3.3.2.
Assumption 3.A.1. The functions |g(β)2 L(d(i) | β)L(d(j) | β)| (across i, j) have finite
expectation under the full-data posterior.
This moment condition guarantees that the sample covariance of g(β) and L(d(i) | β) has
finite variance under the full-data posterior: it plays the same as role as finite kurtosis in
proofs sample variance consistency
With the assumptions in place, we begin by showing that the sampling uncertainty of ψ̂n
goes to zero in the limit of S → ∞.
Lemma 3.A.3. Assume Assumption 3.2.1, Assumption 3.3.1, Assumption 3.3.2, Assump-
tion 3.A.1 holds. Let ψ̂ be output of algorithm 4 for c1 = 1, c2 = 0 and (β (1) , . . . , β (S) ) being an
i.i.d. sample. Then, there exists a constant C such that for all n, for all S, Var(ψ̂n ) ≤ C/S.
We prove Lemma 3.A.3 in section 3.B.3. That the variance of individual ψ̂n goes to zero
at the rate of 1/S is not surprising: ψ̂n is a sample covariance, after all.
We use Lemma 3.A.3 to show consistency of different estimators.
Theorem 3.A.1. Assume Assumption 3.2.1, Assumption 3.3.1, Assumption 3.3.2, and
Assumption 3.A.1 holds Let ψ̂ be output of algorithm 4 for c1 = 1, c2 = 0 and (β (1) , . . . , β (S) )
being an i.i.d. sample. Then maxN n=1 |ψ̂n − ψn | converges in probability to 0 in the limit
S → ∞, and ∆ converges in probability to ∆(α) in the limit S → ∞.
b
We prove Theorem 3.A.1 in section 3.B.3. Our theorem states that the vector ψ̂ is a
consistent estimator for the vector ψ and ∆b is a consistent estimator for ∆(α).
Not only is ψ̂ consistent in estimating ψ, it is also asymptotically normal.
Theorem 3.A.2. Assume Assumption 3.2.1, Assumption 3.3.1, Assumption 3.3.2, and As-
sumption 3.A.1 holds. Let √ψ̂ be output of algorithm 4 for c1 = 1, c2 = 0 and (β (1) , . . . , β (S) ) be-
ing an i.i.d. sample. Then S(ψ̂−ψ) converges in distribution to N (0N , Σ) where Σ is the N ×
N matrix whose (i, j) entry, Σi,j , is the covariance between (i) (i)

(g(β) − E1N g(β)) L(d | β) − E 1N L(d | β)
and (g(β) − E1N g(β)) L(d(j) | β) − E1N L(d(j) | β) , taken under the full-data posterior.


We prove Theorem 3.A.2 in section 3.B.3. Heuristically,


pfor each
√ n, the distribution of ψ̂n
is the Gaussian centered at ψn , with standard deviation Σn,n / S.

200
3.A.2.1 Normal model with unknown precision.

While Σn,n / S eventually goes to zero, for finite S, this standard deviation can be large,
p

making ψ̂n an imprecise estimate of ψn . To illustrate this phenomenon, we will derive Σn,n in
the context of a simple probabilistic model: a normal model with unknown precision.
We first introduce the model and the associated quantity of interest. The data is a set of
N real values: d(n) = x(n) , where x(n) ∈ R. The parameters of interest are the mean µ and
the precision τ of the
 population. The log-likelihood of an observation based on µ and τ is
Gaussian: 2 log 2π − 2 τ [(x ) − 2x(n) µ + µ2 ]. The prior is chosen to be the following. µ is
1 τ 1 (n) 2

distributed from uniform over the real line, and τ is distributed from a gamma distribution.
The quantity of interest is the posterior mean of µ.
For this probabilistic model, the assumptions of Theorem 3.A.2 are satisfied. We show
that the variance Σn,n behaves like a quartic function of the observation x(n) .
Lemma 3.A.4. In the normal-gamma model, there exists constants D1 , D2 , and D3 , where
D1 > 0, such that for all n, Σn,n is equal to D1 (x(n) − x̄)4 + D2 (x(n) − x̄)2 + D3 .
We prove Lemma 3.A.4 in section 3.B.3. D1 , D2 , D3 are based on the posterior expectations:
E [τ −1 (τ −E1N τ )2 ]
for instance, the proof shows that D1 = 1N 4N
. It is easy to show that for the
normal-gamma model,
x(n) − x̄
Cov1N (µ, L(d(n) | µ, τ )) = .
N
Hence, while the mean of ψ̂n behaves like a linear function of x(n) − x̄, its standard deviation
behaves like a quadratic function of x(n) − x̄. In other words, the more influence an observation
has, the harder it is to accurately determine its influence!

3.B Proofs
3.B.1 Taylor series proofs
Proof of Theorem 3.3.1. At a high level, we rely on Fleming [62, Chapter 5.12, Theorem 5.9]
to interchange integration and differentiation.
Although the theorem statement does not explicitly mention the normalizer, to show that
the quantity of interest is continuously differentiable and compute partial derivatives, it is
necessary to show that the normalizer is continuously differentiable and compute partial
derivatives. To do so, we verify the following conditions on the integrand defining Z(w):
P 
1. For any β, the mapping w 7→ p(β) exp N
w
n=1 n L(d(n)
| β) is continously differen-
tiable.
2. There exists a Lebesgue integrable function f1 such that for all w ∈ {w ∈ [0, 1]N :
PN (n)
maxn wn ≥ δ}, p(β) exp n=1 wn L(d | β) ≤ f1 (β).

3. For each n, there exists a Lebesgue 


integrable function f2 such that for all w ∈ {w ∈
N ∂
PN (n)
[0, 1] : maxn wn ≥ δ}, ∂wn p(β) exp n=1 wn L(d | β) ≤ f2 (β).

201
The first condition is clearly satisfied. To construct f1 that satisfies the second condition,
we partition the parameter space RP into a finite number of disjoint sets. To index these
sets, we use a subset of {1, 2, . . . , N }. If the indexing subset were I = {n1 , n2 , . . . , nM }, the
corresponding element of the partition is

BI := {β ∈ RP : ∀n ∈ I, L(d(n) | β) ≥ 0}. (3.6)

This partition allows us to upper bound theP integrand with a function that is independent of
w. Suppose β ∈ BI , I ̸= ∅. The maximum N n=1 wn L(d
(n)
| β) is attained by setting wn = 1
for all n ∈ I and wn = 0 for all n ∈
/ I. Suppose β ∈ B∅ . AsP L(d(n) | β) < 0 for all 1 ≤ n ≤ N ,
and we are constrained by maxn wn ≥ δ, the maximum of N n=1 wn L(d
(n)
| β) is attained by
setting wn = δ for arg maxn L(d | β) and wn = 0 for all other n. In short, our envelope
(n)

function is
(
p(β) n∈I exp(L(d(n) | β)) if β ∈ BI , I ̸= ∅.
Q
f1 (β) :=
if β ∈ B∅ .
N (n)

p(β) maxn=1 exp(δL(d | β))

The last step is to show f1 is integrable. It suffices to show  that the integral of f1 on
each BI is finite. On B∅ , integrating p(β) exp(δL(d | β)) over B∅ is clearly finite: by
(n)

Assumption 3.2.1, the integral of p(β) exp(δL(d(n) | β)) over RP is finite, and B∅ is a
subset of RP . As f1 (β) is the maximum of a finite number of integrable functions, it is
integrable. Similarly, the integral of f1 over BI where I ̸= ∅ is atmost the integral of
p(β) n∈I exp(L(d(n) | β)) over RP , which is finite by Assumption 3.2.1. To construct f2 that
Q
satisfies the third condition, we use the same partition of RP , and the envelope function is
f2 (β) := L(d(n) | β)f1 (β), since the partial derivative of the weighted log probability is clearly
the product of the n-th log likelihood and the weighted log probability. The integrability of
f2 follows from Assumption 3.3.2’s guarantee that the expectation of |L(d(n) | β)| is finite
under different weighted posteriors. In all, we can interchange integration with differentiation,
and the partial derivatives are

∂Z(w)
= Z(w) × Ew L(d(n) | β) .
 
∂wn
We move on to prove that Ew g(β) is continuously
P differentiable
 and find its partial
derivatives. The conditions on g(β) Z(w) p(β) exp
1 N
n=1 wn L(d
(n)
| β) that we wish to check
are:
P 
1. For any β, the mapping w 7→ g(β) Z(w) 1
p(β) exp N
n=1 nw L(d (n)
| β) is continously
differentiable.

2. There exists a Lebesgue integrable


Pfunction f1 such that for all w ∈ {w ∈ [0, 1] :
N

1 N (n)
maxn wn ≥ δ}, g(β) Z(w) p(β) exp n=1 wn L(d | β) ≤ f3 (β).

3. For each n, there exists a Lebesgue integrable function f4 such that


P  for all w ∈ {w ∈
N ∂ 1 N (n)
[0, 1] : maxn wn ≥ δ}, ∂wn g(β) Z(w) p(β) exp n=1 wn L(d | β) ≤ f4 (β).

202
We have already proven that Z(w) is continuously differentiable: hence, there is nothing to
do for the first condition. It is straightforward to use Assumption 3.3.2 and check that the
second condition is satisfied by the function f3 (β) := Z(w) 1
g(β)f1 (β), and the third condition
is satisfied by f4 (β) := Z(w) g(β)L(d | β)f1 (β). Hence, we can interchange integration with
1 (n)

differentiation. The partial derivatives of Ew g(β) is equal to tthe sume of two integrals. The
first part is
N
!!
∂Z(w)−1
Z X
g(β)p(β) exp wn L(d(n) | β) dβ
∂wn n=1
Z N
!!
1 X
= − Ew L(d(n) | β) wn L(d(n) | β)
 
g(β)p(β) exp dβ
Z(w) n=1

= −Ew L(d(n) | β) × Ew [g(β)] .


 

The second part is


Z N
!!
1 (n)
X
(n)
dβ = Ew g(β)L(d(n) | β) .
 
g(β)L(d | β)p(β) exp wn L(d | β)
Z(w) n=1

Putting two parts together, the partial derivative is equal to a covariance


∂Ew g(β)
= Covw g(β)L(d(n) | β) .
 
∂wn
The proof that Ew g(β)2 is continuously differentiable is similar to that for Ew g(β). The
partial derivative is
∂[Ew g(β)2 ]
= Covw g(β)2 L(d(n) | β) .
 
∂wn
Since the posterior standard deviation is a differentiably continuous function of the mean
and second moment, it is also differentiably continuous. The partial derivative of the posterior
standard deviation is a simple application of the chain rule, and we omit the proof for brevity.

3.B.2 First-order accuracy proofs


Proof of Lemma 3.A.1. Our proof finds exact formulas for the posterior mean and the partial
derivatives of the posterior mean with respect to wn . Then, we take the difference between
the posterior mean and its Taylor series.
In the normal model, the total log probability at w is equal to
N    
X 1 1 1 (n) 2 (n) 2
wn log − 2 [(x ) − 2x µ + µ ]
n=1
2 2πσ 2 2σ
PN ! PN !2
(n)
n=1 n w w
n=1 n x
=− µ− P N
+ C,
2σ 2 n=1 w n

203
where C is a constantP that does not Pdepend on µ. Hence, thePdistribution of µ under w
is normal with mean ( n=1 wn x )/( N
N (n)
n=1 wn ) and precision ( n=1 wn )/(σ ). The partial
N 2

derivative of the posterior mean with respect to wn is

x(n) ( N
P PN (n)
n=1 w n ) − ( n=1 wn x )
PN .
( n=1 wn ) 2

Plugging in w = 1N , we have that ψn is equal to (x(n) − x̄)/N .


After removing the index set I, the actual posterior mean is
N x̄ − |I|x̄I
,
N − |I|
while the Taylor series approximation is
X x(n) − x̄ N x̄ + |I|(x̄ − x̄I )
x̄ − = .
n∈I
N N

The difference between the actual posterior mean and its approximation is as in the statement
of the lemma.

Proof of Lemma 3.A.2. Similar to the proof of Lemma 3.A.1, we first find exact formulas for
the posterior mean and its Taylor series.
In the normal means model, the total log probability at w is
G    
X 1 1 2
log − 2 (θg − µ)
g=1
2πτ 2 2τ
N    i
X 1 1 h (n) 2 (n) 2
+ wn log 2
− 2 (x ) − 2x θg(n) + θg(n) .
n=1
2πσ 2σ

By completing the squares, we know that


• The distribution of µ is normal:
PG !
g=1 Λg (w)Mg (w) 1
N PG , PG
g=1 Λg (w) g=1 Λg (w)

• Condition on µ, the group means are independent normals:


µ/τ 2 + [Ng (w)Mg (w)]/σ 2
 
1
θg | µ ∼ N , .
1/τ 2 + Ng (w)/σ 2 1/τ 2 + Ng (w)/σ 2

To express the partial derivative of the posterior mean of µ with respect to wn , it is helpful
to define the following “intermediate” value between Ew µ and Ew θg :
Mg (w)Ng (w)/σ 2 + Ew µ/τ 2
µ̃g (w) := .
Ng (w)/σ 2 + 1/τ 2

204
In addition, we need the partial derivatives of the functions Ng , Λg , and Mg .
(
∂Ng 0 if g ̸= g (n)
= ,
∂wn 1 if g = g (n)
(
∂Mg 0 if g ̸= g (n)
= x(n) −Mg (w) ,
∂wn Ng (w)
if g = g (n)
(
∂Λg 0 if g ̸= g (n)
= Λg (w)2 .
∂wn σ2 Ng (w) 2 if g = g (n)

If n is in the k-th group, the partial derivative of the posterior mean with respect to wn is
1 1 (n)

x − µ̃k (w) .
Λ(w) σ 2 + τ 2 Nk (w)
After removing only observations from the k-th group, the actual posterior mean is
Λk (q −1 (I))Mk (q −1 (I)) + g̸=k Λg (1N )Mg (1N )
P
P .
Λk (q −1 (I)) + g̸=k Λg (1N )

Between w = q −1 (I) and w = 1N , the Ng , Mg , Λg functions do not change for g ̸= k. The


Taylor series approximation of the posterior mean is
   P
Λk (1N ) Mk (1N ) + n∈I µ̃k (1N ) − x(n) /Nk (1N ) + g̸=k Λg (1N )Mg (1N )
P
P .
Λk (1N ) + g̸=k Λg (1N )
If we denote
X X
A1 := Λg (1N )Mg (1N ), A2 := Λg (1N )
g̸=k g̸=k

B1 := Λk (q (I))Mk (q (I)), B2 := Λk (q −1 (I))


−1 −1
,
" #
X
µ̃k (1N ) − x(n) /Nk (1N ) , C2 := Λk (1N )

C1 := Λk (1N ) Mk (1N ) +
n∈I

then Err(I) is equal to (A1 + B1 )/(A2 + B2 ) − (A1 + C1 )/(A2 + C2 ). The last equation is
equal to
A2 (B1 − C1 ) + A1 (C2 − B2 ) + (B1 C2 − C1 B2 )
.
(A2 + B2 )(A2 + C2 )
We analyze the differences C2 − B2 , B1 C2 − C1 B2 , and B1 − C1 separately.
C2 − B2 . This difference is
1 1
− .
σ 2 /Nk (1N ) + τ 2 σ 2 /Nk (q −1 (I)) + τ 2
Since we remove |I| from group k, Nk (q −1 (I)) = Nk (1N ) − |I|. Hence, the difference C2 − B2
is
|I|
σ 2 Λk (1N )Λk (q −1 (I)) ,
Nk (1N )(Nk (1N ) − |I|)

205
which is exactly the E(I) mentiond in the lemma statement.
B1 C2 − C1 B2 . The difference is
( )
(n)
P
[µ̃ k (1 N ) − x ]
Λk (1N )Λk (q −1 (I)) Mk (q −1 (I)) − Mk (1N ) − n∈I .
Nk (1N )

We analyze the term in the curly brackets. It is equal to


( )
(n)
P
−1 n∈I [Mk (1 N ) − x ] X  Mk (1N ) − µ̃k (1N ) 
Mk (q (I)) − Mk (1N ) − +
Nk (1N ) n∈I
Nk (1N )

The left term is equal to


|I|2 (Mk (1N ) − x̄I )
.
Nk (1N )[Nk (1N ) − |I|]
The right term is equal to

|I| σ 2 Λk (1N )
(E1N µ − Mk (1N )).
Nk (1N ) Nk (1N )

The sum of the two terms is exactly F (I) mentioned in the lemma statement. Overall, the
difference B1 C2 − C1 B2 is equal to Λk (1N )Λk (q −1 (I))F (I).
B1 − C1 . If we introduce D := Λk (1N )Mk (q −1 (I)), then the difference B1 − C1 is equal to
(B1 − D) + (D − C1 ). The former term is

Mk (q −1 (I))(B2 − C2 ) = −Mk (q −1 (I))E(I).

The later term is


( )
(n)
P
n∈I [µ̃k (1N ) − x ]
Λk (1N ) Mk (q −1 (I)) − Mk (1N ) − .
Nk (1N )

We already know that the term in the curly brackets is equal to F (I). Hence B1 − C1 is equal
to Λk (1N )F (I) − Mk (q −1 (I))E(I).
With the differences C2 − B2 , B1 C2 − C1 B2 , and B1 − C1 , we can now state the final form
of Err(I). The final numerator is
" #
X
Λk (q −1 (I)) + Λg (1N ) Λk (1N )F (I)
g̸=k
" # .
X X
+ Λg (1N )Mg (1N ) − Mk (q −1 (I)) Λg (1N ) E(I)
g̸=k g̸=k

hP i hP i
Divide this by the denominator g Λ g (1N ) g Λ g (q −1
(I)) , we have proven the lemma.

206
Proof of Corollary 3.A.1. Under the assumption that Ng∗ ≥ σ 2 /τ 2 , we have that Λg (1N ) ∈
. Since Mk∗ − |I| ≥ σ 2 /τ 2 , it is also true that Λk (q −1 (I)) ∈ 2τ12 , τ12 .
 1 1  
,
2τ 2 τ 2
Because of Lemma 3.A.2, an upper bound on Err(I) is
P 
∗ ∗ −1
−1
Λk (q (I)) Λ
g̸=k g (Mg − M k (q (I)))
|F (I)| + |E(I)| .
Λ∗ Λ∗ Λ(q −1 (I))

The fraction Λk (q −1 (I))/Λ∗ is at most ( τ12 )/ G 2τ12 , which is equal to 2/G. The absolute


value |F (I)| is at most

2|I|2 ∥x∥∞ 2|I|∥x∥∞ (σ 2 /τ 2 ) 2|I|2 ∥x∥∞ (σ 2 /τ 2 + 1)


+ ≤
(Nk∗ )2 (Nk∗ )2 (Nk∗ )2

The absolute value P 


∗ ∗ −1
g̸=k Λg (Mg − Mk (q (I)))
Λ∗ Λ(q −1 (I))

is at most
G(1/τ 2 )2∥x∥∞ 4∥x∥∞
2 2
≤ .
G (1/2τ ) G
Finally, the absolute value |E(I)| is at most

|I|(σ 2 /(4τ 4 )) |I|2 (σ 2 /(4τ 4 ))


≤ .
(Nk∗ )2 (Nk∗ )2

In all, the constant C(∥x∥∞ , σ, τ ) in the corollary’s statement is

∥x∥∞ 4(σ 2 /τ 2 + 1) + σ 2 /τ 4 .


3.B.3 Consistency and asymptotic normality proofs


The following lemma, on covariance between sample covariances under i.i.d. sampling, will be
useful for later proofs.

Lemma 3.B.1. Suppose we have S i.i.d. draws (A(s) , B (s) , C (s) )Ss=1 . Let f1 be the (biased)
sample covariance between the A’s and the B’s. Let f2 be the (biased) sample covariance
between the A’s and C’s. In other words,
S
! S
! S
!
1 X (s) (s) 1 X (s) 1 X (s)
f1 := A B − A B ,
S s=1 S s=1 S s=1
S
! S
! S
!
1 X (s) (s) 1 X (s) 1 X (s)
f2 := A C − A C .
S s=1 S s=1 S s=1

207
Suppose that the following are finite: E[(A−E[A])2 (B −E[B])(C −E[C])], Cov(B, C), Var(A),
Cov(A, B), Cov(A, C). Then, the covariance of f1 and f2 is equal to
(S − 1)2
E[(A − E[A])2 (B − E[B])(C − E[C])]
S3
S−1 (S − 1)(S − 2)
+ 3
Cov(B, C)Var(A) − Cov(A, B)Cov(A, C).
S S3
Proof of lemma 3.B.1. It suffices to prove the lemma in the case where E[A] = E[B] =
E[C] = 0. Otherwise, we can subtract the population mean from the random variable:
the value of f1 and f2 would not change (since covariance is invariant to constant additive
changes). In otherwords, we want to show that the covariance between f1 and f2 is equal to
(S − 1)2 S−1 (S − 1)(S − 2)
3
E[A2 BC] + 3
E[BC]E[A2 ] − E[AB]E[AC]. (3.7)
S S S3
Since f1 is the biased sample covariance, Ef1 = S−1 S
E[AB]. Similarly, Ef2 = S−1 S
E[AC].
To compute Cov(f1 , f2 ), we only need an expression for E[f1 f2 ]. The product f1 f2 is equal to
the sum of D1 , D2 , D3 , D4 where:
! ! !
1 X (s) (s) 1 X (s) 1 X (s)
D1 := − A B A C ,
S s S s S s
!2 ! !
1 X (s) 1 X (s) 1 X (s)
D2 := A B C ,
S s S s S s
! ! !
1 X (s) (s) 1 X (s) 1 X (s)
D3 := − A C A B ,
S s S s S s
! !
1 X (s) (s) 1 X (s) (s)
D4 := A B A C .
S s S s

We compute the expectation of each Dj .


D1 . By expanding D1 , we know that ED1 = S13 i,j,k E[A(k) B (k) A(i) C (j) ]. The value of
P

E[A(k) B (k) A(i) C (j) ] depends on the triplet (i, j, k) in the following way:


 0 if i = k, j ̸= k

2
E[A BC] if i = k, j = k



(k) (k) (i) (j)
E[A B A C ] = 0 if i ̸= k, j = k
E[AB]E[AC] if i ̸= k, j ̸= k, i = j





if i ̸= k, j ̸= k, i ̸= j

0

We have used independence of (A(s) , B (s) , C (s) )Ss=1 to factorize the expectation E[A(k) B (k) A(i) C (j) ].
For certain triplets, the factorization reveals that the expectation is zero. By accounting for
all triplets, the expectation of D1 is
1  2

SE[A BC] + S(S − 1)E[AB]E[AC] .
S3

208
D2 . By expanding D2 , we know that ED2 = S14 i,j,p,q E[A(i) A(i) B (p) C (q) ]. We can do a
P

similar case-by-case analysis of how E[A(i) A(i) B (p) C (q) ] depend on the quartet (i, j, p, q). In
the end, the expectation of D2 is
1  2 2

E[A BC] + (S − 1)E[A ]E[BC] + 2(S − 1)E[AB]E[AC] .
S3
D3 . By symmetry betwene D1 and D3 , the expectation of D3 is also
1 
SE[A2 BC] + S(S − 1)E[AB]E[AC] .

S 3

D4 . By expanding D4 , we know that ED4 = S12 i,j E[A(i) B (i) A(j) C (j) ]. The case-by-case
P

analysis of E[A(i) B (i) A(j) C (j) ] for each (i, j) is simple, and is omitted. The expectation of D4
is
1 S−1
E[A2 BC] + E[AB]E[AC].
S S
Simple algebra reveals that 4i=1 E[Di ] − S−1 E[AB] S−1 E[AC] is equal to eq. (3.7).
P
S S

Proof of Lemma 3.A.3. In this proof, we will only consider expectations under the full-data
posterior. Hence, to alleviate notation, we shall write E instead of E1N : similarly, covariance
and variance evaluations are understood to be at w = 1N .
Applying lemma 3.B.1, the covariance of ψ̂n and ψ̂n i.e. the variance of ψ̂n is equal to
(S − 1)2
E{(g(β) − E[g(β)])2 (L(d(n) | β) − E[L(d(n) | β)])2 }
S3
S−1 (S − 1)(S − 2)
+ Var(L(d (n)
| β))Var(g(β)) − Cov(g(β), L(d(n) | β))2 .
S3 S3
Define the constant C to be the maximum over n of
Cov(g(β), L(d(n) | β))2 + Var(g(β))Var(L(d(n) | β))
+ E{(g(β) − E[g(β)])2 (L(d(n) | β) − E[L(d(n) | β)])2 }.

Simple algebra shows that Var(ψ̂n ) ≤ C


S
.

Proof of Theorem 3.A.1. Similar to the proof of Lemma 3.A.3, expectations (and variances
and covariances) are understood to be taken under the full-data posterior.
Since ψ̂n is the biased sample variance, we know that
S−1
Eψ̂n = ψn .
S
The bias of ψ̂n goes to zero at rate 1/S. Because of Lemma 3.A.3, the variance also goes to
p
zero at rate 1/S. Then, the application of Chebyshev’s inquality shows that ψ̂n →
− ψn . Since
p
N is a constant, the pointwise convergence |ψ̂n − ψn | →
− 0 implies the uniform convergence
p
maxNn=1 |ψ̂n − ψn | →
− 0.

209
p
We now prove that |∆−∆(α)|b − 0. We first recall some notation. The ranks r1 , r2 , . . . , rN

sort the influences ψr1 ≤ ψr2 ≤ . . . ≤ ψrN , and ∆(α) = − ⌊N I{ψrm < 0}. Similarly,
P α⌋
m=1 ψrm P
b = − ⌊N α⌋ ψ̂vm I{ψ̂vm < 0}.
v1 , v2 , . . . , vN sort the estimates ψ̂v1 ≤ ψ̂v2 ≤ . . . ≤ ψ̂vN , and ∆ m=1
It suffices to prove the convergence when ⌊N α⌋ ≥ 1: in the case ⌊N α⌋ = 0, both ∆ b and ∆(α)
are equal to zero, hence the distance between them is identically zero. Denote the T unique
values among ψn by u1 < u2 < . . . < uT . If T = 1 i.e. there is only one value, let ω := 1.
Otherwise, let ω be the smallest gap between subsequent values: ω := mint (ut+1 − ut ).
Suppose that maxN n=1 |ψ̂n − ψn | ≤ ω/3: let A be the indicator for this event. For any n,
each ψ̂n is in the interval [ψn − ω/3, ψn + ω/3]. In the case T = 1, clearly all k such that ψ̂k
is in [ψn − ω/3, ψn + ω/] satisfy ψk = ψn . In the case T > 1, since unique values of ψn are
at least ω apart, all k such that ψ̂k is in [ψn − ω/3, ψn + ω/] satisfy ψk = ψn . This means
that the ranks v1 , v2 , . . . , vN , which sort the influence estimates, also sort the true influences
in ascending order: ψv1 ≤ ψv2 ≤ . . . ≤ ψvN . Since the ranks r1 , r2 , . . . , rN also sort the true
influences, it must be true that ψvm = ψrm for all m. Therefore, we can write
⌊N α⌋  
X
|∆
b − ∆(α)| = ψvm I{ψvm < 0} − ψ̂vm I{ψ̂vm < 0}
m=1
⌊N α⌋
X
≤ ψvm I{ψvm < 0} − ψ̂vm I{ψ̂vm < 0} .
m=1

We control the absolute values ψvm I{ψvm < 0} − ψ̂vm I{ψ̂vm < 0} . For any index n, by
triangle inequality, ψn I{ψn < 0} − ψ̂n I{ψ̂n < 0} is at most

I{ψ̂n < 0}|ψn − ψ̂n | + |ψn ||I{ψ̂n < 0} − I{ψn < 0}|.
The first term is at most |ψn − ψ̂n |. The second term is at most I{|ψn − ψ̂n | ≥ |ψn |, ψn ̸= 0}.
We next prove a bound on ψn I{ψn < 0} − ψ̂n I{ψ̂n < 0} that holds across n. Our analysis
proceeds differently based on whether the set {n : ψn ̸= 0} is empty or not.
• {n : ψn ̸= 0} is empty. This means ψn = 0 for all n. Hence, I{|ψn − ψ̂n | ≥ |ψn |, ψn ̸= 0}
is identically zero.
• {n : ψn ̸= 0} is not empty. We then know that minn |ψn | > 0. Hence, I{|ψn − ψ̂n | ≥
̸ 0} is upper bounded by I{|ψn − ψ̂n | ≥ minn |ψn |}. Since |ψn − ψ̂n | ≤
|ψn |, ψn =
maxn |ψn − ψ̂n |, this last indicator is at most I{maxn |ψn − ψ̂n | ≥ minn |ψn |}.
To summarize, we have proven the following upper bounds on |∆
b − ∆(α)|. When
{n : ψn ̸= 0} is empty, on A, |∆ − ∆(α)| is upper bounded by
b

⌊N α⌋ max |ψn − ψ̂n | (3.8)


n=1

When {n : ψn ̸= 0} is not empty, on A, |∆


b − ∆(α)| is upper bounded by

⌊N α⌋ max |ψn − ψ̂n | + ⌊N α⌋I{max |ψn − ψ̂n | ≥ min |ψn |}. (3.9)
n=1 n n

210
We are ready to show that Pr(|∆
b − ∆(α)| > ϵ) converges to zero. For any positive ϵ, we
know that
b − ∆(α)| > ϵ, A) + Pr(Ac ).
b − ∆(α)| > ϵ) ≤ Pr(|∆
Pr(|∆
p
The later probability goes to zero because maxN n=1 |ψ̂n − ψn | →
− 0.
Suppose that {n : ψn = ̸ 0} is empty. Using the upper bound eq. (3.8), we know that event
in the former probability implies that maxN n=1 |ψ̂n − ψn | ≥ ϵ/⌊N α⌋: The probability of this
p
event also goes to zero because maxn=1 |ψ̂n − ψn | →
N
− 0.
Suppose that {n : ψn = ̸ 0} is not empty. Using the upper bound eq. (3.9), we
know that event in the former probability implies that (maxN n=1 |ψ̂n − ψn | + I{maxn |ψn −
ψ̂n | ≥ minn |ψn |}) ≥ ϵ/⌊N α⌋. Since maxn=1 |ψ̂n − ψn | converges to zero in probability,
N

I{maxn |ψn − ψ̂n | ≥ minn |ψn |} also converges to zero in probability. Hence, the probability
that (maxN n=1 |ψ̂n − ψn | + I{maxn |ψn − ψ̂n | ≥ minn |ψn |}) ≥ ϵ/⌊N α⌋ converges to zero.
In all, Pr(|∆b − ∆(α)| > ϵ) goes to zero in both the case where {n : ψn ̸= 0} is empty and
p
the complement case. As the choice of ϵ was arbitrary, we have shown ∆ b →
− ∆(α).

Proof of Theorem 3.A.2. Similar to the proof of lemma 3.B.1, we only consider expectations
under the full-data posterior. Hence, we will write E instead of E1N to simplify notation.
Variance and covariance operations are also understood to be taken un der the full-data
posteiror. To lighten the dependence of the notation on the parameter β, we will write g(β)
as g and L(d(n) | β) as Ln when talking about the expectation of g(β) and L(d(n) | β).
Define the the following multivariate function
 T
f (β) := g(β), L(d(1) | β), g(β)L(d(1) | β), . . . , L(d(N ) | β), g(β)L(d(N ) | β) .
As defined, f (·) is a mapping from P -dimensional space to 2N + 1-dimensional space.
Since (β (1) , . . . , β (S) ) is an i.i.d. sample, f (β (1) ), f (β (2) ), . . . , f (β (S) ) is also an i.i.d. sample.
Because of the moment conditions we have assumed, each f (β) has finite variance. We apply
the Lindeberg-Feller multivariate central limit theorem [203, Proposition 2.27], and conclude
that !
√ 1X D
S f (β (s) ) − Ef (β) − → N (0, Ξ)
S s
where the limit is S → ∞, and Ξ is a symmetric (2N + 1) × (2N + 1) dimensional matrix,
which we specify next. It suffices to write down the formula for (i, j) entry of Ξ where i ≤ j:

Var(g) if i = j = 1



Cov(g, Ln ) if i = 1, j > 1





Cov(L , L )

if i = 2n, j = 2m
n m
Ξi,j = .


 Cov(L n , gLm ) if i = 2n, j = 2m + 1
Cov(gLn , Lm ) if i = 2n + 1, j = 2m





Cov(gLn , gLm ) if i = 2n + 1, j = 2m + 1

To relate the asymptotic distribution of f (β) to that of the vector ψ̂, we now use the
delta method. Define the following function which acts on 2N + 1 dimensional vectors and

211
returns N dimensional vectors:
T
h([x1 , x2 , . . . , x2N +1 ]T ) := x3 − x1 x2 , x5 − x1 x4 , x7 − x1 x6 , . . . , x2N +1 − x1 x2N .


Written this way, clearly h(·) transform the sample mean S1 s f (β (s) ) into the estimated
P

influences: ψ̂ = h S1 s f (β (s) ) . Furthermore, h(·) applied to Ef (β) yields the vector of


P 

true influences: ψ = h (Ef (β)). h(·) is continuously differentiable everywhere: its Jacobian is
the N × (2N + 1) matrix
 
−x2 −x1 1 0 0 . . . 0
 −x4 0 0 −x1 1 . . . 0
Jh =  .. .. .. . . ,
 
 . . . . 0 . . . 0
−x2N 0 0 ... 0 ... 1

which is non-zero. Therefore, we apply the delta method [203, Theorem 3.1] and conclude
that √    
D
S ψ̂ − ψ − → N 0, Jh x=Ef (β) Ξ(Jh x=Ef (β) )T .

The (i, j) entry of the asymptotic covariance matrix is the dot product between the i-th
row of Jh x=Ef (β) and the j-th column of Ξ(Jh x=Ef (β) )T . The former is

[−ELi , 0, 0, . . . , −Eg , 1
|{z} , . . . , 0].
|{z}
2i entry (2i+1) entry

The later is

(−ELj )Cov(g, g) − (Eg)Cov(g, Lj ) + Cov(g, gLj )


 
..
. .
 

(−ELj )Cov(gLN , g) − (Eg)Cov(gLN , Lj ) + Cov(gLN , gLj )

Taking the dot product, we have that the (i, j) entry of the asymptotic covariance matrix is
equal to
Cov(gLi , gLj ) − (Eg) [Cov(gLi , Lj ) + Cov(gLj , Li )]
− [(ELj )Cov(g, gLi ) + (ELi )Cov(g, gLj )]
+ (ELj )(ELi )Var(g)
+ (Eg)2 Cov(Li , Lj )
+ (Eg) [(ELj )Cov(g, Li ) + (ELi )Cov(g, Lj )]
It is simple to check that the last display is equal to the covariance between (g−E[g])(Lj −E[Lj ])
and (g − E[g])(Li − E[Li ]).

Proof of Lemma 3.A.4. We use the (shape, rate) parametrization of the gamma distribution.
Let the prior over τ be Gamma(α, β) where α, β > 0. Conditioned on observations, the

212
posterior distribution of (µ, τ ) is normal-gamma:
" N
#!
N N 1 X (n) 2
τ ∼ Gamma α + , β + (x ) − x̄2 ,
2 2 N n=1
ϵ ∼ N (0, 1),
ϵ
µ | τ, ϵ = x̄ + √ .

In this section, since we only take expectations under the original full-data posterior, we will
lighten the notation’s dependence on w, and write E instead of E1N . Similarly, covariance
and variance operators are understood to be under the full-data posterior. √
For completeness, we compute Cov(µ, L(d(n) | µ, τ )). We know that µ − Eµ = ϵ/ N τ .
The log likelihood, as a function of τ and ϵ, is

1 τ  1 1 2 x(n) − x̄ √
log − τ (x(n) − x̄)2 − ϵ + √ ϵ τ.
2 2π 2 2N N

The covariance of µ and √ L(d(n) | µ, τ ) is equal to the covariance between ϵ/ N τ and
L(d(n) | µ, τ ). Since ϵ/ N τ is zero mean, the covariance is equal to the expectation of the
product. Since ϵ is indedependent of τ , many of the terms that form the expectatin of the
product is zero. After some algebra, the only term that remains is
 (n)
x(n) − x̄

x − x̄ 2
E ϵ = .
N N

To compute the asymptotic variance 2 of ψ̂n , it suffices to compute the expectation of


ϵ2

L(d | µ, τ ) − EL(d | µ, τ ) . The calculations are simple, but tedious, and we omit
(n) (n)
2 2
them. We will only state the result. The expectation of Nϵ τ L(d(n) | µ, τ ) − EL(d(n) | µ, τ )
is  
1 −1
E[τ (τ − Eτ ) ] (x(n) − x̄)4
2
4N
3 + E[τ −1 (τ − Eτ )] E[τ −1 (log τ − E log τ )]
 
+ − (x(n) − x̄)2
N2 2N
1 1 1
+ 3
E[τ −1 ] + E[τ −1 (log τ − E log τ )2 ] − 2 E[τ −1 (log τ − E log τ )2 ].
2N 2N N
Since the asymptotic variance is equal to this expectation minus the square of the covariance
between L(d(n) | µ, τ ) and µ, our final expression for the asymptotic variance Σn,n is
 
1 −1
E[τ (τ − Eτ ) ] (x(n) − x̄)4
2
4N
2 + E[τ −1 (τ − Eτ )] E[τ −1 (log τ − E log τ )]
 
+ − (x(n) − x̄)2
N2 2N
1 1 1
+ 3
E[τ −1 ] + E[τ −1 (log τ − E log τ )2 ] − 2 E[τ −1 (log τ − E log τ )2 ].
2N 2N N

213
The constants D1 , D2 , and D3 mentioned in the lemma statement can be read off this
last display. It is possible to replace the posterior functionals of τ with quantities that
only depends on the prior (α, β) and the observed data. Such formulas might be helpful in
studying the behavior of Σn,n in the limit where some x(n) becomes very large.

3.C Additional Experimental Details


3.C.1 Linear model
Recall that the t location-scale distribution has three hyperparameters: ν, µ, σ. ν is the
degrees of freedom, µ is the location, and σ is the scale. The density at y of this distribution
is −(ν+1)/2
(y − µ)2

Γ((ν + 1)/2) 1
√ 1+ .
Γ(ν/2) πνσ 2 νσ 2
Recall that the latent parameters of our model are the baseline µ, the treatment effect θ,
and the noise σ. We set the the prior over µ to be t location-scale with degrees of freedom 3,
location 0, and scale 1000. We set the the prior over θ to be t location-scale with degrees of
freedom 3, location 0, and scale 1000. We set the the prior over σ to be t location-scale with
degrees of freedom 3, location 0, and scale 1000.

3.C.2 Hierarchical model for microcredit data


The entire generative process, from the top-down (observations to priors), is as follows.
 
(country) (country) (n) (country) (country) (n)
|y (n) | ∼ Log-Normal µg(n) + τg(n) x , exp(ξg(n) + θg(n) x ) ,
(country)
µk ∼ Normal(µ, σ(control)
2
) i.i.d. across k,
(country)
τk ∼ Normal(τ, σ(treatment)
2
) i.i.d. across k,
(country)
ξk ∼ Normal(ξ, ψ(control)
2
) i.i.d. across k,
(country)
θk ∼ Normal(θ, ψ(treatment)
2
) i.i.d. across k,
µ ∼ Normal(0, 102 ),
τ ∼ Normal(0, 102 ),
σ(control) ∼ Cauchy(0, 2),
σ(treatment) ∼ Cauchy(0, 2),
ξ ∼ Normal(0, 102 ),
θ ∼ Normal(0, 102 ),
ψ(control) ∼ Cauchy(0, 2),
ψ(treatment) ∼ Cauchy(0, 2).

The observed data are x(n) , g (n) , y (n) ; all other quantities are latent, and estimated by MCMC.

214
3.C.3 Hierarchical model for tree mortality data
The likelihood for the n-th observation is exponentially modified Gaussian with standard
deviation σ, scale λ and mean
   
(time) (region) (time) (location)
µt(n) + µl(n) + µ + θt(n) + θl(n) + θ x(n) + f (x(n) ),

with f (x) := 10 i=1 Bi (x)γi where Bi ’s are fixed thin plate spline basis functions [207] and
P
γi ’s are random: γi ∼ Normal(0, σ(smooth)
2
). In all, the parameters of interest are

• Fixed effects: µ and θ.


(time) (time) (region) (location)
• Random effects: time (µt(n) , θt(n) ) and location (µl(n) , θl(n) ).

• Degree of smoothing: σ(smooth) .

Since there are many regions (nearly 3,000) and periods of time (30), the number of random
effects is large. Senf et al. [183] uses brms()’s default priors for all parameters: in this default,
the fixed effects are given improper uniform priors over the real line. To work with proper
distributions, we set the priors for the random effects and degree of smoothing in the same
way set by Senf et al. [183]. For fixed effects, we use t location-scale distributions with degrees
of freedom 3, location 0, and scale 1000.

215
216
Conclusion

My thesis has taken a modular approach to computational challenges in Bayesian unsupervised


learning. In particular, I have identified three separate and diverse computational issues,
and I have made progress on each of them. (1) If a practitioner has access to many parallel
processors and wishes to use parallelism to speed up estimation, I provide a way to quickly and
accurately calculate expectations in partition-valued models (chapter 1). (2) If the practitioner
wants to use an infinite-dimensional prior for its modeling flexibility, I provide an accurate
and easy-to-use finite approximation (chapter 2). (3) If the practitioner wants to check
whether their conclusions change after removing a small amount of data, I have established
the first step of answering this question in Bayesian supervised learning (chapter 3).
A number of open questions remain. I discuss technical open questions relating to each
project in sections 1.6, 2.7 and 3.6. In what follows, I focus on open questions at the
conceptual level of the full thesis: relating to the high-level goal of speeding up Bayesian
unsupervised learning, e.g. by considering the chapters’ conclusions in tandem.
A first open challenge arises from finishing the work of chapter 3. While the focus of
chapter 3 has been Bayesian supervised learning, conceptually, the approximations made
there also apply to the posterior and MCMC in unsupervised learning. In certain cases,
posterior expectations of the combinatorial variables, such as partition or trait allocation, are
of interest. For instance, in topic modeling, you might report the expected number of topics.
However, in current applications, such expectations normally serve an exploratory purpose,
rather than a decision-making purpose. Contrast general exploration with our framing of the
problem in chapter 3: can dropping a small fraction of data change the conclusions of our
data analysis? Future work in this direction should identify a compelling decision made with
posterior expectations, and detect sensitivity of such decisions.
Additional open challenges arise by considering that the challenges of each chapter often
appear together in practice. For instance, an analyst that is interested in small-data sensitivity
might also hope to spend little time on estimating the posterior in the first place. To address
both challenges, one might extend coupled chains to approximate posterior expectations
and quantify sensitivity i.e. combining chapter 1 and chapter 3. We have seen cases where
MCMC takes multiple hours to run, such as Senf et al. [183]. Coupled chain estimators, such
as what I proposed in chapter 1, have the promise of reducing MCMC run time. Future work
in this direction should approximate the worst-case quantity of interest using coupled chain
estimators and quantify the MCMC sampling variability of the resulting approximation.
Finally, an analyst modeling combinatorial structures more complicated than partitions,
such as topic modeling [22, 171] or feature allocations [79] might be interested in using
Bayesian nonparametrics for the adaptability of the observed number of groups and might

217
also hope to spend little time on estimating the posterior. Both parallelism (chapter 1) and
finite approximations (chapter 2) might help such an analyst. The biggest conceptual obstacle
is extending the coupling scheme in chapter 1 beyond partition-valued problems. I expect
that a good coupling needs to properly handle an analogous label-switching problem, and
ideas from optimal transport to be relevant even outside of partition-valued models.

218
References

[1] Ayan Acharya, Joydeep Ghosh, and Mingyuan Zhou. Nonparametric Bayesian fac-
tor analysis for dynamic count matrices. In International Conference on Artificial
Intelligence and Statistics, 2015.

[2] José Antonio Adell and Alberto. Lekuona. Sharp estimates in signed Poisson approxi-
mation of Poisson mixtures. Bernoulli, 11(1):47–65, 2005.

[3] D Aldous. Exchangeability and related topics. École d’Été de Probabilités de Saint-Flour
XIII—1983, pages 1–198, 1985.

[4] Horst Alzer. On some inequalities for the gamma and psi functions. Mathematics of
computation, 66(217):373–389, 1997.

[5] Manuela Angelucci, Dean Karlan, Jonathan Zinman, Kerry Brennan, Ellen Degnan,
Alissa Fishbane, Andrew Hillis, Hideto Koizumi, Elana Safran, Rachel Strohm, Braulio
Torres, Asya Troychansky, Irene Velez, Glynis Startz, Sanjeev Swamy, Matthew White,
Anna York, and Compartamos Banco. Microcredit impacts: Evidence from a randomized
microcredit program placement experiment by compartamos banco. American Economic
Journal: Applied Economics, 7:151–82, 2015. URL http://www.compartamos.com/
wps/portal/Grupo/InvestorsRelations/FinancialInformation.

[6] Charles E Antoniak. Mixtures of Dirichlet processes with applications to Bayesian


nonparametric problems. The Annals of Statistics, 2(6):1152–1174, 1974.

[7] Julyan Arbel and Igor Prünster. A moment-matching Ferguson & Klass algorithm.
Statistics and Computing, 27(1):3–17, 2017.

[8] Julyan Arbel, Pierpaolo De Blasi, and Igor Prünster. Stochastic approximations to the
Pitman–Yor process. Bayesian Analysis, 14(4):1201–1219, 2019.

[9] Richard Arratia, Andrew D. Barbour, and Simon Tavaré. Logarithmic Combinatorial
Structures: a Probabilistic Approach, volume 1. European Mathematical Society, 2003.

[10] Orazio Attanasio, Britta Augsburg, Ralph De Haas, Emla Fitzsimons, and Heike
Harmgart. The impacts of microfinance: Evidence from joint-liability lending in
mongolia. American Economic Journal: Applied Economics, 7(1):90–122, 2015. ISSN
19457782, 19457790. URL http://www.jstor.org/stable/43189514.

219
[11] Britta Augsburg, Ralph De Haas, Heike Harmgart, and Costas Meghir. The impacts
of microcredit: Evidence from bosnia and herzegovina. American Economic Journal:
Applied Economics, 7(1):183–203, January 2015. doi:10.1257/app.20130272. URL
https://www.aeaweb.org/articles?id=10.1257/app.20130272.

[12] Abhijit Banerjee, Esther Duflo, Rachel Glennerster, and Cynthia Kinnan. The miracle
of microfinance? evidence from a randomized evaluation. American Economic Journal:
Applied Economics, 7(1):22–53, January 2015. doi:10.1257/app.20130533. URL https:
//www.aeaweb.org/articles?id=10.1257/app.20130533.

[13] Andrew D. Barbour and Peter Hall. On the rate of Poisson convergence. In Mathe-
matical Proceedings of the Cambridge Philosophical Society, volume 95, pages 473–480.
Cambridge University Press, 1984.

[14] Sanjib Basu, Sreenivasa Rao Jammalamadaka, and Wei Liu. Local Posterior Robustness
with Parametric Priors: Maximum and Average Sensitivity, pages 97–106. Springer
Netherlands, Dordrecht, 1996. ISBN 978-94-015-8729-7. doi:10.1007/978-94-015-8729-
7_6. URL https://doi.org/10.1007/978-94-015-8729-7_6.

[15] Jean Bertoin, T. Fujita, Bernard Roynette, and Marc Yor. On a particular class of
self-decomposable random variables: the durations of Bessel excursions straddling
independent exponential times. Probability and Mathematical Statistics, 26:315–366,
2006.

[16] Peter J. Bickel. On Some Robust Estimates of Location. The Annals of Mathematical
Statistics, 36(3):847 – 858, 1965.

[17] Eli Bingham, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan,
Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D. Goodman.
Pyro: deep universal probabilistic programming. Journal of Machine Learning Research,
2018.

[18] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information


Science and Statistics). Springer-Verlag, Berlin, Heidelberg, 2006.

[19] David Blackwell and James B. MacQueen. Ferguson distributions via Polya urn schemes.
The Annals of Statistics, 1(2):353–355, 03 1973. doi:10.1214/aos/1176342372.

[20] D. M. Blei, T. L. Griffiths, and M I Jordan. The nested Chinese restaurant process
and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57(2):
1–30, 2010.

[21] David M. Blei and Michael I. Jordan. Variational inference for Dirichlet process mixtures.
Bayesian Analysis, 1(1):121 – 143, 2006.

[22] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation.
Journal of Machine Learning Resesearch, 3(null):993–1022, mar 2003. ISSN 1532-4435.

220
[23] David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A
review for statisticians. Journal of the American Statistical Association, 112(518):
859–877, April 2017. ISSN 1537-274X. doi:10.1080/01621459.2017.1285773. URL
http://dx.doi.org/10.1080/01621459.2017.1285773.

[24] Lennart Bondesson. On simulation from infinitely divisible distributions. Advances in


Applied Probability, 14(4):855–869, 1982.

[25] Nicolas Bonneel, Michiel Van De Panne, Sylvain Paris, and Wolfgang Heidrich. Dis-
placement interpolation using Lagrangian mass transport. In Proceedings of the 2011
SIGGRAPH Asia Conference, 2011.

[26] Anders Brix. Generalized gamma measures and shot-noise cox processes. Advances in
Applied Probability, 31:929–953, 1999.

[27] T. Broderick, Lester Mackey, J. Paisley, and M I Jordan. Combinatorial Clustering


and the Beta Negative Binomial Process. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 37(2):290–306, 2015.

[28] Tamara Broderick, Michael I. Jordan, and Jim Pitman. Beta processes, stick-breaking
and power laws. Bayesian analysis, 7(2):439–476, 2012.

[29] Tamara Broderick, Jim Pitman, and Michael I. Jordan. Feature allocations, probability
functions, and paintboxes. Bayesian Analysis, 8(4):801–836, 2013.

[30] Tamara Broderick, Ashia C. Wilson, and Michael I. Jordan. Posteriors, conjugacy, and
exponential families for completely random measures. Bernoulli, 24(4B):3181–3221, 11
2018. doi:10.3150/16-BEJ855.

[31] Tamara Broderick, Ryan Giordano, and Rachael Meager. An automatic finite-sample
robustness metric: Can dropping a little data change conclusions?, 2020.

[32] Y Burda, Roger B Grosse, and R Salakhutdinov. Importance Weighted Autoencoders.


In International Conference on Learning Representations, 2016.

[33] Trevor Campbell, Diana Cai, and Tamara Broderick. Exchangeable trait allocations.
Electronic Journal of Statistics, 12(2):2290–2322, 2018.

[34] Trevor Campbell, Jonathan H. Huggins, Jonathan P. How, and Tamara Broderick.
Truncated random measures. Bernoulli, 25(2):1256–1288, 05 2019. doi:10.3150/18-
BEJ1020.

[35] Antonio Canale and David B. Dunson. Bayesian kernel mixtures for counts. Journal of
the American Statistical Association, 106(496):1528–1539, 2011.

[36] Clément Canonne. A short note on Poisson tail bounds. Technical report available from
https://ccanonne.github.io/. URL http://www.cs.columbia.edu/~ccanonne/files/misc/
2017-poissonconcentration.pdf.

221
[37] Bradley P. Carlin and Nicholas G. Polson. An expected utility approach to influence
diagnostics. Source: Journal of the American Statistical Association, 86:1013–1021,
1991.

[38] Edward Carlstein. The use of subseries values for estimating the variance of a general
statistic from a stationary sequence. Annals of Statistics, 14:1171–1179, 1986.

[39] Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich,
Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan:
A probabilistic programming language. Journal of Statistical Software, 76:1–32, 2017.

[40] Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich,
Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan:
A probabilistic programming language. Journal of Statistical Software, 76:1–32, 2017.

[41] Małgorzata Charytanowicz, Jerzy Niewczas, Piotr Kulczycki, Piotr A Kowalski, Szymon
Łukasik, and Sławomir Żak. Complete gradient clustering algorithm for features analysis
of X-ray images. In Information Technologies in Biomedicine, pages 15–24. Springer,
2010.

[42] Sitan Chen, Michelle Delcourt, Ankur Moitra, Guillem Perarnau, and Luke Postle. Im-
proved bounds for randomly sampling colorings via linear programming. In Proceedings
of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2019.

[43] Zhijun Chen, Yishi Zhang, Chaozhong Wu, and Bin Ran. Understanding individualiza-
tion driving states via latent dirichlet allocation model. IEEE Intelligent Transportation
Systems Magazine, 11:41–53, 6 2019. ISSN 19411197. doi:10.1109/MITS.2019.2903525.

[44] C. J. Clopper and E. S. Pearson. The use of confidence or fiducial limits illustrated
in the case of the binomial. Biometrika, 26(4):404–413, 1934. ISSN 00063444. URL
http://www.jstor.org/stable/2331986.

[45] Bruno Crépon, Florencia Devoto, Esther Duflo, and William Parienté. Estimating the
impact of microcredit on those who take it up: Evidence from a randomized experiment
in morocco. American Economic Journal: Applied Economics, 7(1):123–50, January
2015. doi:10.1257/app.20130535. URL https://www.aeaweb.org/articles?id=10.1257/
app.20130535.

[46] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In


C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors,
Advances in Neural Information Processing Systems, volume 26, 2013.

[47] Perry de Valpine, Daniel Turek, Christopher Paciorek, Cliff Anderson-Bergman, Duncan
Temple Lang, and Ras Bodik. Programming with models: writing statistical algorithms
for general model structures with NIMBLE. Journal of Computational and Graphical
Statistics, 26:403–413, 2017. doi:10.1080/10618600.2016.1172487.

222
[48] Perry de Valpine, Daniel Turek, Christopher J. Paciorek, Clifford Anderson-Bergman,
Duncan Temple Lang, and Rastislav Bodik. Programming With Models: Writing
Statistical Algorithms for General Model Structures With NIMBLE. Journal of
Computational and Graphical Statistics, 26(2):403–413, 2017.

[49] Daryl DeFord, Moon Duchin, and Justin Solomon. Recombination: a fam-
ily of Markov chains for redistricting. Harvard Data Science Review, 3 2021.
https://hdsr.mitpress.mit.edu/pub/1ds8ptxu.

[50] Sameer Deshpande, Soumya Ghosh, Tin D. Nguyen, and Tamara Broderick. Are you
using test log-likelihood correctly? Transactions on Machine Learning Research, 2024.
ISSN 2835-8856. URL https://openreview.net/forum?id=n2YifD4Dxo.

[51] Luc Devroye and Lancelot James. On simulation and properties of the stable law.
Statistical methods & applications, 23(3):307–343, 2014.

[52] P Diaconis and David Freedman. On the consistency of bayes estimates. The Annals
of Statistics, pages 1–26, 1986.

[53] Persi Diaconis and Donald Ylvisaker. Conjugate priors for exponential families. The
Annals of Statistics, 7:269–281, 1979.

[54] Benjamin Doerr and Frank Neumann. Theory of evolutionary computation: recent
developments in discrete optimization. Springer Nature, 2019.

[55] Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, and Yee Whye Teh. Variational
inference for the Indian buffet process. In International Conference on Artificial
Intelligence and Statistics, 2009.

[56] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http:
//archive.ics.uci.edu/ml.

[57] Bradley Efron. Bootstrap methods: Another look at the jackknife. Annals of Statistics,
7:1–26, 1979.

[58] Michael D. Escobar and Mike West. Bayesian density estimation and inference using
mixtures. Journal of the American Statistical Association, 90(430):577–588, 1995.

[59] T S Ferguson and M J Klass. A representation of independent increment processes


without Gaussian components. The Annals of Mathematical Statistics, 43(5):1634–1643,
1972.

[60] Thomas S Ferguson. A Bayesian analysis of some nonparametric problems. The Annals
of Statistics, 1:209–230, 1973.

[61] Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Bois-
bunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo

223
Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotoma-
monjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Suther-
land, Romain Tavenard, Alexander Tong, and Titouan Vayer. POT: Python Optimal
Transport. Journal of Machine Learning Research, 22(78):1–8, 2021.

[62] W Fleming. Functions of Several Variables. Springer, 2 edition, 1977.

[63] E B Fox, E. Sudderth, M I Jordan, and A. S. Willsky. A Sticky HDP-HMM with


Application to Speaker Diarization. The Annals of Applied Statistics, 5(2A):1020–1056,
2010.

[64] Daniel Freund and Samuel B. Hopkins. Towards practical robustness auditing for linear
regression, 7 2023. URL http://arxiv.org/abs/2307.16315.

[65] Andrew Gelman and Donald B. Rubin. Inference from iterative simulation using
multiple sequences. Statistical Science, 7(4):457–472, 1992.

[66] S. Geman and D. Geman. Stochastic Relaxation, Gibbs Distributions, and the Bayesian
Restoration of Images. Pattern Analysis and Machine Intelligence, IEEE Transactions
on, (6):721–741, 1984.

[67] S. Ghosal, J. K. Ghosh, and R. V. Ramamoorthi. Posterior consistency of Dirichlet


mixtures in density estimation. Annals of Statistics, 27(1):143–158, 1999.

[68] J. K. Ghosh and R. V. Ramamoorthi. Bayesian Nonparametrics. Springer Series in


Statistics, 2003.

[69] Alison L. Gibbs. Convergence in the Wasserstein metric for Markov chain Monte Carlo
algorithms with applications to image restoration. Stochastic Models, 20(4):473–492,
2004.

[70] Walter R. Gilks and Pascal Wild. Adaptive rejection sampling for Gibbs sampling.
Journal of the Royal Statistical Society: Series C (Applied Statistics), 41(2):337–348,
1992.

[71] Ryan Giordano and Tamara Broderick. The bayesian infinitesimal jackknife for variance,
May 2023. URL http://arxiv.org/abs/2305.06466.

[72] Ryan Giordano, Tamara Broderick, Michael I Jordan, Jordan@cs Berkeley Edu, and
Mohammad Emtiyaz Khan. Covariances, robustness, and variational bayes. Journal of
Machine Learning Research, 19:1–49, 2018. URL http://jmlr.org/papers/v19/17-670.
html.

[73] Ryan Giordano, Runjing Liu, Michael I. Jordan, and Tamara Broderick. Evaluating
sensitivity to the stick-breaking prior in bayesian nonparametrics (with discussion).
Bayesian Analysis, 18:287–366, 2023. ISSN 19316690. doi:10.1214/22-BA1309.

[74] Peter W. Glynn and Chang-han Rhee. Exact estimation for Markov chain equilibrium
expectations. Journal of Applied Probability, 51(A):377–389, 2014.

224
[75] Alexander V. Gnedin. On convergence and extensions of size-biased permutations.
Journal of Applied Probability, 35(3):642–650, 1998. doi:10.1239/jap/1032265212.

[76] Louis Gordon. A stochastic approach to the gamma function. The American Mathe-
matical Monthly, 101(9):858–865, 1994. ISSN 00029890, 19300972.

[77] Dilan Görür and Carl E. Rasmussen. Dirichlet process Gaussian mixture models: choice
of the base distribution. Journal of Computer Science and Technology, 25(4):653–664,
2010.

[78] T. L. Griffiths and Z. Ghahramani. The Indian buffet process: an introduction and
review. Journal of Machine Learning Research, 12:1185–1224, 2011.

[79] Thomas L. Griffiths and Zoubin Ghahramani. The Indian Buffet Process: An In-
troduction and Review. Journal of Machine Learning Research, 12(32):1185–1224,
2011.

[80] Paul Gustafson. Local sensitivity of posterior expectations. The Annals of Statistics,
24:195, 1996.

[81] Mubin Ul Haque, Leonardo Horn Iwaya, and M. Ali Babar. Challenges in docker
development: A large-scale study using stack overflow. In Proceedings of the 14th ACM
/ IEEE International Symposium on Empirical Software Engineering and Measurement
(ESEM), ESEM ’20, New York, NY, USA, 2020. Association for Computing Machinery.
ISBN 9781450375801. doi:10.1145/3382494.3410693. URL https://doi.org/10.1145/
3382494.3410693.

[82] Nils Lid Hjort. Nonparametric Bayes estimators based on beta processes in models for
life history data. The Annals of Statistics, 18(3):1259–1294, 1990.

[83] Matthew Hoffman, Francis R. Bach, and David M. Blei. Online learning for latent
Dirichlet allocation. In Advances in Neural Information Processing Systems, 2010.

[84] Matthew D. Hoffman and Andrew Gelman. The no-u-turn sampler: adaptively setting
path lengths in hamiltonian monte carlo. Journal of Machine Learning Research, 15(1):
1593–1623, 2014.

[85] Matthew D Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational


inference. Journal of Machine Learning Research, 14:1303–1347, 2013.

[86] Alain Hore and Djemel Ziou. Image quality metrics: PSNR vs. SSIM. In 2010 20th
International Conference on Pattern Recognition, pages 2366–2369. IEEE, 2010.

[87] Jonathan Huggins, Mikolaj Kasprzak, Trevor Campbell, and Tamara Broderick. Val-
idated variational inference via practical posterior error bounds. In International
Conference on Artificial Intelligence and Statistics, pages 1792–1802. PMLR, 2020.

[88] H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. Journal
of the American Statistical Association, 96:161–173, 2001.

225
[89] H. Ishwaran and M Zarepour. Exact and approximate sum representations for the
Dirichlet process. Canadian Journal of Statistics, 30(2):269–283, 2002.

[90] Hemant Ishwaran and Mahmoud Zarepour. Markov chain monte carlo in approximate
dirichlet and beta two-parameter process hierarchical models. Biometrika, 87(2):371–390,
2000.

[91] Pierre E. Jacob. Couplings and Monte Carlo. Course Lecture Notes, 2020.

[92] Pierre E Jacob, John O’leary, Yves F Atchadé, and Atchad´ Atchadé. Unbiased markov
chain monte carlo methods with couplings. Journal of the Royal Statistical Society,
Series B, pages 543–600, 2020. URL https://github.com/pierrejacob/unbiasedmcmc.

[93] Pierre E. Jacob, John O’Leary, and Yves F. Atchadé. Unbiased Markov chain Monte
Carlo methods with couplings. Journal of the Royal Statistical Society Series B, 82(3):
543–600, 2020.

[94] Sonia Jain and Radford M Neal. A split-merge Markov chain Monte Carlo procedure for
the Dirichlet process mixture model. Journal of computational and Graphical Statistics,
13(1):158–182, 2004.

[95] L. F. James. Stick-breaking PG(α,ζ)-generalized gamma processes. Available at


arXiv:1308.6570v3, 2013.

[96] L. F. James. Bayesian Poisson calculus for latent feature modeling via generalized
Indian Buffet Process priors. The Annals of Statistics, 45(5):2016–2045, 2017.

[97] L. F. James, Antonio Lijoi, and Igor Prünster. Posterior Analysis for Normalized
Random Measures with Independent Increments. Scandinavian Journal of Statistics,
36(1):76–97, 2009.

[98] Ajay Jasra, Chris C. Holmes, and David A. Stephens. Markov chain Monte Carlo
methods and the label switching problem in Bayesian mixture modeling. Statistical
Science, pages 50–67, 2005.

[99] Mark Jerrum. Mathematical foundations of the Markov chain Monte Carlo method. In
Probabilistic Methods for Algorithmic Discrete Mathematics, pages 116–165. Springer,
1998.

[100] M. J. Johnson and A. S. Willsky. Bayesian nonparametric hidden semi-Markov models.


Journal of Machine Learning Research, 14:673–701, 2013.

[101] N.L. Johnson, A.W. Kemp, and S. Kotz. Univariate Discrete Distributions. Wiley
Series in Probability and Statistics. Wiley, 2005. ISBN 9780471715801.

[102] Wesley Johnson and Seymour Geisser. A predictive view of the detection and char-
acterization of influential observations in regression analysis. Source: Journal of the
American Statistical Association, 78:137–144, 1983.

226
[103] Terry C. Jones, Guido Biele, Barbara Mühlemann, Talitha Veith, Julia Schneider, Jörn
Beheim-Schwarzbach, Tobias Bleicker, Julia Tesch, Marie Luisa Schmidt, Leif Erik
Sander, Florian Kurth, Peter Menzel, Rolf Schwarzer, Marta Zuchowski, Jörg Hofmann,
Andi Krumbholz, Angela Stein, Anke Edelmann, Victor Max Corman, and Christian
Drosten. Estimating infectiousness throughout sars-cov-2 infection course. Science, 373,
7 2021. ISSN 10959203. doi:10.1126/science.abi5273.
[104] Olav Kallenberg. Foundations of modern probability. Springer, New York, 2nd edition,
2002.
[105] E. L. Kaplan and Paul Meier. Nonparametric estimation from incomplete observations.
Journal of the American Statistical Association, 53(282):457–481, 1958.
[106] Dean Karlan and Jonathan Zinman. Microcredit in theory and practice: Using ran-
domized credit scoring for impact evaluation. Science, 332(6035):1278–1284, 2011.
[107] Damian J. Kelly and Garrett M. O’Neill. The minimum cost flow problem and the
network simplex solution method. PhD thesis, Citeseer, 1991.
[108] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International
Conference on Learning Representations, 2014.
[109] J. F. C. Kingman. Completely random measures. Pacific Journal of Mathematics, 21
(1):59–78, 1967.
[110] J. F. C. Kingman. Random discrete distributions. Journal of the Royal Statistical
Society B, 37(1):1–22, 1975.
[111] JFC Kingman. Poisson Processes, volume 3. Clarendon Press, 1992.
[112] M. Kline. Calculus: An Intuitive and Physical Approach. Dover Books on Mathematics.
Dover Publications, 1998. ISBN 9780486404530. URL https://books.google.com/books?
id=YdjK_rD7BEkC.
[113] Ramesh Madhavrao Korwar and Myles Hollander. Contributions to the theory of
dirichlet processes. The Annals of Probability, 1(4):705–711, 1972.
[114] Michael R. Kosorok. Introduction to Empirical Processes and Semiparametric Inference.
Springer, 2008.
[115] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M Blei.
Automatic differentiation variational inference. Journal of Machine Learning Research,
18:1–45, 2017. URL http://jmlr.org/papers/v18/16-107.html.
[116] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M. Blei.
Automatic differentiation variational inference. Journal of Machine Learning Research,
18(14):1–45, 2017.
[117] Kenichi Kurihara, Max Welling, and Y. W. Teh. Collapsed variational Dirichlet process
mixture models. In International Joint Conference on Artificial Intelligence, 2007.

227
[118] S.N. Lahiri. Resampling Methods for Dependent Data. Springer, 2003. URL https:
//link.springer.com/book/10.1007/978-1-4757-3803-2.
[119] Junpeng Lao, Christopher Suter, Ian Langmore, Cyril Chimisov, Ashish Saxena, Pavel
Sountsov, Dave Moore, Rif A. Saurous, Matthew D. Hoffman, and Joshua V. Dillon.
tfp. mcmc: Modern Markov Chain Monte Carlo Tools Built For Modern Hardware.
arXiv preprint arXiv:2002.01184, 2020.
[120] Günter Last and Mathew Penrose. Lectures on the Poisson Process. Institute of
Mathematical Statistics Textbooks. Cambridge University Press, 2017.
[121] Michael Lavine. Local predictive influence in bayesian linear models with conjugate
priors. Communications in Statistics - Simulation and Computation, 21:269–283, 1
1992. ISSN 15324141. doi:10.1080/03610919208813018.
[122] Lucien Le Cam. An approximation theorem for the Poisson binomial distribution.
Pacific J. Math., 10(4):1181–1197, 1960.
[123] Clement Lee and Darren J. Wilkinson. A review of stochastic block models and
extensions for graph clustering, 12 2019. ISSN 23648228.
[124] Juho Lee, Lancelot F. James, and Seungjin Choi. Finite-dimensional BFRY priors and
variational Bayesian inference for power law models. In Advances in Neural Information
Processing Systems, 2016.
[125] Juho Lee, Xenia Miscouridou, and François Caron. A unified construction for series
representations and finite approximations of completely random measures. Bernoulli,
2022.
[126] David A. Levin and Yuval Peres. Markov chains and mixing times, volume 107.
American Mathematical Society, 2017.
[127] Antonio Lijoi and Igor Prünster. Models beyond the Dirichlet process, page 80–136.
Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University
Press, 2010.
[128] Antonio Lijoi, Igor Prünster, and Stephen G. Walker. On consistency of nonparametric
normal mixtures for Bayesian density estimation. Journal of the American Statistical
Association, 100(472):1292–1296, 2005.
[129] Antonio Lijoi, Igor Prünster, and Tommaso Rigon. The pitman–yor multinomial process
for mixture modelling. Biometrika, 107(4):891–906, 2020.
[130] Antonio Lijoi, Igor Prünster, and Tommaso Rigon. Sampling hierarchies of discrete ran-
dom structures. Statistics and Computing, 30(6):1591–1607, nov 2020. ISSN 0960-3174.
doi:10.1007/s11222-020-09961-7. URL https://doi.org/10.1007/s11222-020-09961-7.
[131] Antonio Lijoi, Igor Prünster, and Tommaso Rigon. Finite-dimensional discrete random
structures and bayesian clustering. Journal of the American Statistical Association, 0
(0):1–13, 2023.

228
[132] Torgny Lindvall. Lectures on the coupling method. Courier Corporation, 2002.

[133] Silvia Liverani, David I. Hastie, Lamiae Azizi, Michail Papathomas, and Sylvia Richard-
son. PReMiuM: An R package for profile regression mixture models using Dirichlet
processes. Journal of Statistical Software, 64(7):1, 2015.

[134] Michel Loeve. Ranking limit problem. In Proceedings of the Third Berkeley Symposium
on Mathematical Statistics and Probability, Volume 2: Contributions to Probability
Theory, pages 177–194, Berkeley, Calif., 1956.

[135] Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-
tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):
1145–1190, 2019.

[136] Steven N. MacEachern. Estimating normal means with a conjugate style Dirichlet
process prior. Communications in Statistics - Simulation and Computation, 23(3):
727–741, 1994.

[137] E C Marshall and D J Spiegelhalter. Identifying outliers in bayesian hierarchical models:


a simulation-based approach. Bayesian Analysis, 2:409–444, 2007.

[138] Robert E Mcculloch. Local model influence. Source: Journal of the American Statistical
Association, 84:473–478, 1989.

[139] Rachael Meager. Understanding the average impact of microcredit expansions: A


bayesian hierarchical analysis of seven randomized experiments. American Economic
Journal: Applied Economics, 11:57–91, 2019. ISSN 19457790. doi:10.1257/app.20170299.

[140] Rachael Meager. Aggregating distributional treatment effects: A bayesian hierarchical


analysis of the microcredit literature. American Economic Review, 112(6):1818–47,
June 2022. doi:10.1257/aer.20181811. URL https://www.aeaweb.org/articles?id=10.
1257/aer.20181811.

[141] Marina Meilă. Comparing clusterings—an information based distance. Journal of


Multivariate Analysis, 98(5):873–895, 2007.

[142] Russell B Millar and Wayne S Stewart. Assessment of locally influential observations
in bayesian models. Bayesian Analysis, 2:365–384, 2007.

[143] Jeffrey W Miller and Matthew T Harrison. Mixture models with a prior on the number
of components. Journal of the American Statistical Association, 113(521):340–356,
2018.

[144] Thomas B. Minka, Galit Shmueli, Joseph B. Kadane, Sharad Borle, and
Peter Boatwright. Computing with the com-poisson distribution. Tech-
nical report. URL https://www.microsoft.com/en-us/research/publication/
computing-com-poisson-distribution/.

229
[145] B. G. Mirkin and L. B. Chernyi. Measurement of the distance between distinct partitions
of a finite set of objects. Automation and Remote Control, 5:120–127, 1970.

[146] Shakir Mohamed, Mihaela Rosca, Michael Figurnov, Andriy Mnih, and Amnih@google
Com. Monte carlo gradient estimation in machine learning, 2020. URL https://www.
github.com/deepmind/mc_gradients.

[147] Ankur Moitra and Dhruv Rohatgi. Provably auditing ordinary least squares in low
dimensions, 5 2022. URL http://arxiv.org/abs/2205.14284.

[148] Warwick Nash, T.L. Sellers, S.R. Talbot, A.J. Cawthorn, and W.B. Ford. The Population
Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from
the North Coast and Islands of Bass Strait. Sea Fisheries Division, Technical Report,
48, 01 1994.

[149] Radford M Neal. Circularly-coupled markov chain sampling. Technical report, University
of Toronto, 1992.

[150] Radford M. Neal. Markov chain sampling methods for Dirichlet process mixture models.
Journal of Computational and Graphical Statistics, 9(2):249–265, 2000.

[151] Tin D. Nguyen, Brian L. Trippe, and Tamara Broderick. Many processors, little time:
Mcmc for partitions via optimal transport couplings. In Gustau Camps-Valls, Francisco
J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference
on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning
Research, pages 3483–3514. PMLR, 28–30 Mar 2022.

[152] Tin D. Nguyen, Jonathan Huggins, Lorenzo Masoero, Lester Mackey, and Tamara
Broderick. Independent Finite Approximations for Bayesian Nonparametric Inference.
Bayesian Analysis, pages 1 – 38, 2023. doi:10.1214/23-BA1385. URL https://doi.org/
10.1214/23-BA1385.

[153] Peter Orbanz. Conjugate projective limits. Available at arXiv:1012.0363v2, 2010.

[154] James B. Orlin. A faster strongly polynomial minimum cost flow algorithm. Operations
Research, 41(2):338–350, 1993.

[155] John Paisley and Lawrence Carin. Nonparametric factor analysis with beta process
priors. In International Conference on Machine Learning, 2009.

[156] John Paisley, Lawrence Carin, and David Blei. Variational inference for stick-breaking
beta process priors. In International Conference on Machine Learning, 2011.

[157] John Paisley, David M. Blei, and Michael I. Jordan. Stick-breaking beta processes
and the Poisson process. In International Conference on Artificial Intelligence and
Statistics, 2012.

[158] K Palla, D A Knowles, and Z. Ghahramani. An infinite latent attribute model for
network data. In International Conference on Machine Learning, 2012.

230
[159] Mihael Perman, Jim Pitman, and Marc Yor. Size-biased sampling of poisson point
processes and excursions. Probability Theory and Related Fields, 92(1):21–39, 1992.

[160] Robert Piessens, Elise de Doncker-Kapenga, Christoph W Überhuber, and David K


Kahaner. QUADPACK: a subroutine package for automatic integration, volume 1.
Springer Science & Business Media, 2012.

[161] Jim Pitman. Exchangeable and partially exchangeable random partitions. Probability
theory and related fields, 102(2):145–158, 1995.

[162] Jim Pitman. Some developments of the blackwell-macqueen urn scheme. Lecture
Notes-Monograph Series, pages 245–267, 1996.

[163] Jim Pitman. Combinatorial Stochastic Processes: Ecole d’Eté de Probabilités de Saint-
Flour XXXII-2002. Springer, 2006.

[164] Jim Pitman and Marc Yor. The two-parameter poisson-dirichlet distribution derived
from a stable subordinator. The Annals of Probability, pages 855–900, 1997.

[165] David Pollard. A User’s Guide to Measure Theoretic Probability. Cambridge University
Press, 2001.

[166] David Pollard. Convergence of stochastic processes. Springer Science & Business Media,
2012.

[167] Nicholas G. Polson, James G. Scott, and Jesse Windle. Bayesian inference for logis-
tic models using pólya-gamma latent variables. Journal of the American Statistical
Association, 108:1339–1349, 2013. ISSN 1537274X. doi:10.1080/01621459.2013.829001.

[168] Tenelle Porter, Diego Catalán Molina, Andrei Cimpian, Sylvia Roberts, Afiya Fredericks,
Lisa S. Blackwell, and Kali Trzesniewski. Growth-mindset intervention delivered by
teachers boosts achievement in early adolescence. Psychological Science, 33:1086–1096,
7 2022. ISSN 14679280. doi:10.1177/09567976211061109.

[169] Sandhya Prabhakaran, Elham Azizi, Ambrose Carr, and Dana Pe’er. Dirichlet process
mixture model for correcting technical variation in single-cell gene expression data. In
International Conference on Machine Learning, 2016.

[170] M. T. Pratola, E. I. George, and R. E. McCulloch. Influential observations in bayesian


regression tree models. Journal of Computational and Graphical Statistics, 2023. ISSN
15372715. doi:10.1080/10618600.2023.2210180.

[171] Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. Inference of Population
Structure Using Multilocus Genotype Data. Genetics, 155(2):945–959, 06 2000.

[172] James Gary Propp and David Bruce Wilson. Exact sampling with coupled Markov
chains and applications to statistical mechanics. Random Structures & Algorithms, 9
(1-2):223–252, 1996.

231
[173] Maxim Rabinovich, Elaine Angelino, and Michael I Jordan. Variational consensus
monte carlo. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors,
Advances in Neural Information Processing Systems, volume 28. Curran Associates,
Inc., 2015.
[174] William M. Rand. Objective criteria for the evaluation of clustering methods. Journal
of the American Statistical Association, 66(336):846–850, 1971.
[175] Rajesh Ranganath, Sean Gerrish, and D. M. Blei. Black box variational inference. In
International Conference on Artificial Intelligence and Statistics, 2014.
[176] Eugenio Regazzini, Antonio Lijoi, and Igor Prünster. Distributional results for means
of normalized random measures with independent increments. The Annals of Statistics,
31(2):560–585, 2003.
[177] Albert Reuther, Jeremy Kepner, Chansup Byun, Siddharth Samsi, William Arcand,
David Bestor, Bill Bergeron, Vijay Gadepally, Michael Houle, Matthew Hubbell, Michael
Jones, Anna Klein, Lauren Milechin, Julia Mullen, Andrew Prout, Antonio Rosa, Charles
Yee, and Peter Michaleas. Interactive supercomputing on 40,000 cores for machine
learning and data analysis. In 2018 IEEE High Performance extreme Computing
Conference (HPEC), pages 1–6. IEEE, 2018.
[178] Albert Reuther, Jeremy Kepner, Chansup Byun, Siddharth Samsi, William Arcand,
David Bestor, Bill Bergeron, Vijay Gadepally, Michael Houle, Matthew Hubbell, Michael
Jones, Anna Klein, Lauren Milechin, Julia Mullen, Andrew Prout, Antonio Rosa, Charles
Yee, and Peter Michaleas. Interactive supercomputing on 40,000 cores for machine
learning and data analysis. In 2018 IEEE High Performance extreme Computing
Conference (HPEC), pages 1–6. IEEE, 2018.
[179] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic back-
propagation and approximate inference in deep generative models. In International
Conference on Machine Learning, 2014.
[180] Anirban Roychowdhury and Brian Kulis. Gamma processes, stick-breaking, and
variational inference. In International Conference on Artificial Intelligence and Statistics,
2015.
[181] Fabrizio Ruggeri and Larry Wasserman. Infinitesimal sensitivity of posterior dis-
tributions. The Canadian Journal of Statistic, 21:195–203, 1993. URL https:
//www.jstor.org/stable/3315811.
[182] Steven L Scott, Alexander W Blocker, Fernando V Bonassi, Hugh A Chipman, Edward I
George, and Robert E McCulloch. Bayes and big data: The consensus Monte Carlo
algorithm. International Journal of Management Science and Engineering Management,
11(2):78–88, 2016.
[183] Cornelius Senf, Allan Buras, Christian S. Zang, Anja Rammig, and Rupert Seidl. Excess
forest mortality is consistently linked to drought across europe. Nature Communications,
11, 12 2020. ISSN 20411723. doi:10.1038/s41467-020-19924-1.

232
[184] Jayaram Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, 4:
639–650, 1994.

[185] Miriam Shiffman, Ryan Giordano, and Tamara Broderick. Could dropping a few cells
change the takeaways from differential expression?, 2023.

[186] Galit Shmueli, Thomas P. Minka, Joseph B. Kadane, Sharad Borle, and Peter
Boatwright. A useful distribution for fitting discrete data: revival of the con-
way–maxwell–poisson distribution. Journal of the Royal Statistical Society: Se-
ries C (Applied Statistics), 54(1):127–142, 2005. doi:https://doi.org/10.1111/j.1467-
9876.2005.00474.x. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.
1467-9876.2005.00474.x.

[187] Guilhem Sommeria-Klein, Lucie Zinger, Eric Coissac, Amaia Iribar, Heidy Schimann,
Pierre Taberlet, and Jérôme Chave. Latent dirichlet allocation reveals spatial and
taxonomic structure in a dna-based census of soil biodiversity from a tropical forest.
Molecular Ecology Resources, 20:371–386, 3 2020. ISSN 17550998. doi:10.1111/1755-
0998.13109.

[188] Sanvesh Srivastava, Cheng Li, and David B. Dunson. Scalable Bayes via barycenter in
Wasserstein space. The Journal of Machine Learning Research, 19(1):312–346, 2018.

[189] Stephen M. Stigler. The Asymptotic Distribution of the Trimmed Mean. The Annals
of Statistics, 1(3):472 – 477, 1973.

[190] Stephen M. Stigler. The 1988 Neyman Memorial Lecture: A Galtonian Perspective on
Shrinkage Estimators. Statistical Science, 5(1):147–155, 1990.

[191] Rainer Storn and Kenneth Price. Differential evolution-a simple and efficient heuristic
for global optimization over continuous spaces. Journal of global optimization, 11(4):
341, 1997.

[192] Robert H. Swendsen and Jian-Sheng Wang. Replica Monte Carlo simulation of spin-
glasses. Physical Review Letters, 57(21):2607, 1986.

[193] Andrea Tancredi, Rebecca Steorts, and Brunero Liseo. A Unified Framework for De-
Duplication and Population Size Estimation (with Discussion). Bayesian Analysis, 15
(2):633 – 682, 2020.

[194] Alessandro Tarozzi, Jaikishan Desai, and Kristin Johnson. The impacts of microcredit:
Evidence from ethiopia. American Economic Journal: Applied Economics, 7(1):54–89,
January 2015. doi:10.1257/app.20130475. URL https://www.aeaweb.org/articles?id=
10.1257/app.20130475.

[195] Y W Teh and D. Görür. Indian buffet processes with power-law behavior. In Advances
in Neural Information Processing Systems, 2009.

[196] Y W Teh, M I Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet Processes.


Journal of the American Statistical Association, 101(476):1566–1581, 2006.

233
[197] Y W Teh, D. Görür, and Z. Ghahramani. Stick-breaking construction for the Indian
buffet process. In International Conference on Artificial Intelligence and Statistics,
2007.
[198] R. Thibaux and M I Jordan. Hierarchical beta processes and the Indian buffet process.
In International Conference on Artificial Intelligence and Statistics, 2007.
[199] Zachary M. Thomas, Steven N. MacEachern, and Mario Peruggia. Reconciling curvature
and importance sampling based procedures for summarizing case influence in bayesian
models. Journal of the American Statistical Association, 113:1669–1683, 10 2018. ISSN
1537274X. doi:10.1080/01621459.2017.1360777.
[200] Michalis Titsias. The infinite gamma-poisson feature model. In Advances in Neural
Information Processing Systems, 2008.
[201] John W. Tukey and Donald H. McLaughlin. Less vulnerable confidence and significance
procedures for location based on a single sample: Trimming/winsorization 1. Sankhyā:
The Indian Journal of Statistics, Series A (1961-2002), 25(3):331–352, 1963.
[202] Angelika van der Linde. Local influence on posterior distributions under multiplicative
modes of perturbation. Bayesian Analysis, 2:319–332, 2007. URL http://www.math.
uni-bremen.de/~avdl/.
[203] A W van der Vaart. Asymptotic Statistics. University of Cambridge,
1998. URL https://www.cambridge.org/core/books/asymptotic-statistics/
A3C7DAD3F7E66A1FA60E9C8FE132EE1D.
[204] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy,
David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan
Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman,
Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J
Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef
Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M.
Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0
Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python.
Nature Methods, 17:261–272, 2020. doi:10.1038/s41592-019-0686-2.
[205] M. J. Wainwright and M I Jordan. Graphical Models, Exponential Families, and
Variational Inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305,
2008.
[206] Chong Wang, John Paisley, and David Blei. Online variational inference for the
hierarchical Dirichlet process. In International Conference on Artificial Intelligence
and Statistics, 2011.
[207] Simon N. Wood. Thin plate regression splines. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 65(1):95–114, 2003.
doi:https://doi.org/10.1111/1467-9868.00374. URL https://rss.onlinelibrary.wiley.com/
doi/abs/10.1111/1467-9868.00374.

234
[208] Kai Xu, Tor Erlend Fjelde, Charles Sutton, and Hong Ge. Couplings for multinomial
Hamiltonian Monte Carlo. In International Conference on Artificial Intelligence and
Statistics, 2021.

[209] Amit Zeisel, Ana B. Muñoz-Manchado, Simone Codeluppi, Peter Lönnerberg, Gioele
La Manno, Anna Juréus, Sueli Marques, Hermany Munguba, Liqun He, and Christer
Betsholtz. Cell types in the mouse cortex and hippocampus revealed by single-cell
RNA-seq. Science, 347(6226):1138–1142, 2015.

[210] Mingyuan Zhou, Haojun Chen, Lu Ren, Guillermo Sapiro, Lawrence Carin, and John W.
Paisley. Non-parametric Bayesian dictionary learning for sparse image representations.
In Advances in Neural Information Processing Systems. 2009.

[211] Mingyuan Zhou, Lauren Hannah, David Dunson, and Lawrence Carin. Beta-negative
binomial process and Poisson factor analysis. In International Conference on Artificial
Intelligence and Statistics, 2012.

235

You might also like