Toward Faster Methods in Bayesian Unsupervised Learning
Toward Faster Methods in Bayesian Unsupervised Learning
Learning
by
Tin D. Nguyen
B.S.E. Operations Research and Financial Engineering, Princeton University, 2018
S.M. Electrical Engineering and Computer Science, MIT, 2020
DOCTOR OF PHILOSOPHY
ABSTRACT
Many data analyses can be seen as discovering a latent set of traits in a population.
For example, what are the themes, or topics, behind Wikipedia documents? To encode
structural information in these unsupervised learning problems, such as the hierarchy among
words, documents, and latent topics, one can use Bayesian probabilistic models. The
application of Bayesian unsupervised learning faces three computational challenges. Firstly,
existing works aim to speed up Bayesian inference via parallelism, but these methods
struggle in Bayesian unsupervised learning due to the so-called “label-switching problem”.
Secondly, in Bayesian nonparametrics for unsupervised learning, computers cannot learn
the distribution over the countable infinity of random variables posited by the model in
finite time. Finally, to assess the generalizability of Bayesian conclusions, we might want to
detect the posterior’s sensitivity to the removal of a very small amount of data, but checking
this sensitivity directly takes an intractably long time. My thesis addresses the first two
computational challenges, and establishes a first step in tackling the last one. I utilize a
known representation of the probabilistic model to evade the label-switching problem: when
parallel processors are available, I derive fast estimates of Bayesian posteriors in unsupervised
learning. Generalizing existing works and providing more guidance, I derive accurate and easy-
to-use finite approximations for infinite-dimensional priors. Lastly, I assess generalizability in
supervised Bayesian models, which can be seen as a precursor to the models used in Bayesian
unsupervised learning. In supervised models, I develop and test a computationally efficient
tool to detect sensitivity regarding data removals for analyses based on MCMC.
3
4
I am grateful to my advisor, Tamara, for her guidance, support, and encouragement. I am
also grateful to my committee members, Stefanie and Ashia, for their feedback and advice. I
dedicate this thesis to my mom and dad. They have taught me to be kind, to work hard,
and to do a good job for its own sake.
5
6
Contents
Title page 1
Abstract 3
List of Figures 11
Introduction 15
Appendices 35
1.A Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.B Functions of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.C Unbiasedness Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.D Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.E Label-Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7
1.E.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.E.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.F Trimming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.G Additional Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.G.1 Target Distributions And Gibbs Conditionals . . . . . . . . . . . . . 43
1.G.2 General Markov Chain Settings . . . . . . . . . . . . . . . . . . . . . 45
1.G.3 Datasets Preprocessing, Hyperparameters, Dataset-Specific Markov
Chain Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.G.4 Visualizing Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . 47
1.H All Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.H.1 gene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.H.2 synthetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.H.3 seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.H.4 abalone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.H.5 k-regular . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.I Metric Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.I.1 Definition Of Variation Of Information Metric . . . . . . . . . . . . . 49
1.I.2 Impact Of Metric On Meeting Time . . . . . . . . . . . . . . . . . . . 51
1.J Extension to Split-Merge Sample . . . . . . . . . . . . . . . . . . . . . . . . 51
1.K More RMSE Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.K.1 Different Functions Of Interest . . . . . . . . . . . . . . . . . . . . . . 52
1.K.2 Different Minimum Iteration (m) Settings . . . . . . . . . . . . . . . 52
1.K.3 Different Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.K.4 Different DPMM Hyperparameters . . . . . . . . . . . . . . . . . . . 53
1.L More Meeting Time Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.M Estimates of Predictive Density . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.M.1 Data, Target Model, And Definition Of Posterior Predictive . . . . . 54
1.M.2 Estimates Of Posterior Predictive Density . . . . . . . . . . . . . . . 55
1.M.3 Posterior Predictives Become More Alike True Data Generating Density 56
8
2.6.4 Discount estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.6.5 Dispersion estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Appendices 93
2.A Additional examples of AIFA construction . . . . . . . . . . . . . . . . . . . 93
2.B Proofs of AIFA convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.B.1 AIFA converges to CRM in distribution . . . . . . . . . . . . . . . . . 95
2.B.2 Differentiability of smoothed indicator . . . . . . . . . . . . . . . . . 100
2.B.3 Normalized AIFA EPPF converges to NCRM EPPF . . . . . . . . . . 101
2.C Marginal processes of exponential CRMs . . . . . . . . . . . . . . . . . . . . 103
2.D Admissible hyperparameters of extended gamma process . . . . . . . . . . . 106
2.E Technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
2.E.1 Concentration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
2.E.2 Total variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
2.E.3 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
2.F Verification of upper bound’s assumptions for additional examples . . . . . . 121
2.F.1 Gamma–Poisson with zero discount . . . . . . . . . . . . . . . . . . . 121
2.F.2 Beta–negative binomial with zero discount . . . . . . . . . . . . . . . 123
2.G Proofs of CRM bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
2.G.1 Upper bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
2.G.2 Lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
2.H DPMM results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.H.1 Upper bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.H.2 Lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.I Proofs of DP bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
2.I.1 Upper bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
2.I.2 Lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
2.J More ease-of-use results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
2.J.1 Conceptual results (continued.) . . . . . . . . . . . . . . . . . . . . . 149
2.J.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
2.K Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
2.K.1 Image denoising using the beta–Bernoulli process . . . . . . . . . . . 153
2.K.2 Topic modelling with the modified HDP . . . . . . . . . . . . . . . . 154
2.K.3 Comparing IFAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
2.K.4 Beta process hyperparameter estimation . . . . . . . . . . . . . . . . 158
2.K.5 Dispersion estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
2.L Additional experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
2.L.1 Denoising other images . . . . . . . . . . . . . . . . . . . . . . . . . . 163
2.L.2 Effect of AIFA tuning hyperparamters . . . . . . . . . . . . . . . . . 164
2.L.3 Estimation of mass and concentration . . . . . . . . . . . . . . . . . . 164
9
3 Sensitivity of MCMC to Small-Data Removals 167
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
3.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
3.2.1 Bayesian data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 170
3.2.2 Drop-data non-robustness . . . . . . . . . . . . . . . . . . . . . . . . 170
3.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
3.3.1 Taylor series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
3.3.2 Estimating the influence . . . . . . . . . . . . . . . . . . . . . . . . . 175
3.3.3 Confidence intervals for AMIP . . . . . . . . . . . . . . . . . . . . . . 176
3.3.4 Putting everything together . . . . . . . . . . . . . . . . . . . . . . . 179
3.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
3.4.1 Estimate coverage of confidence interval for AMIP . . . . . . . . . . . 180
3.4.2 Estimate coverage of confidence intervals for sum-of-influence . . . . . 180
3.4.3 Re-running MCMC on interpolation path . . . . . . . . . . . . . . . . 181
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
3.5.1 Linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
3.5.2 Hierarchical model on microcredit data . . . . . . . . . . . . . . . . . 184
3.5.3 Hierarchical model on tree mortality data . . . . . . . . . . . . . . . 189
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Appendices 197
3.A Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
3.A.1 Accuracy of first-order approximation . . . . . . . . . . . . . . . . . . 197
3.A.2 Estimator properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
3.B Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
3.B.1 Taylor series proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
3.B.2 First-order accuracy proofs . . . . . . . . . . . . . . . . . . . . . . . . 203
3.B.3 Consistency and asymptotic normality proofs . . . . . . . . . . . . . 207
3.C Additional Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . 214
3.C.1 Linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
3.C.2 Hierarchical model for microcredit data . . . . . . . . . . . . . . . . . 214
3.C.3 Hierarchical model for tree mortality data . . . . . . . . . . . . . . . 215
Conclusion 217
References 235
10
List of Figures
1.1 Lower error at high process count using our estimator (blue) versus using naive
parallelism (red). For details, see Section 1.5.2. . . . . . . . . . . . . . . . . . 20
1.2 Top row and bottom row give results for gene and k-regular, respectively.
The first two columns show that coupled chains provide better point estimates
than naive parallelism. The third column shows that confidence intervals based
on coupled chains are better than those from naive parallelism. The fourth
column shows that OT coupling meets in less time than label-based couplings. 31
1.3 Coupled-chain estimates have large outliers. Meanwhile, naive parallelism
estimates have substantial bias that does not go away with replication. . . . 32
1.F.1Trimmed mean has better RMSE than sample mean on Example 1.F.1. Left
panel plots RMSE versus J. Right panel gives boxplots J = 1000. . . . . . . 44
1.G.1Visualizing synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.H.1Results on gene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.H.2Results on synthetic. Figure legends are the same as Figure 1.H.1. The
results are consistent with Figure 1.2. . . . . . . . . . . . . . . . . . . . . . . 49
1.H.3Results on seed. Figure legends are the same as Figure 1.H.1. The results are
consistent with Figure 1.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.H.4Results on abalone. Similar to Figure 1.2, coupled chains perform better
than naive parallelism with more processes, and our coupling yields smaller
meeting times than label-based couplings. See Figure 1.H.5 for the performance
of trimmed estimators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1.H.5Effect of trimming amount on abalone. . . . . . . . . . . . . . . . . . . . . 52
1.H.6Results on k-regular. Figure legends are the same as Figure 1.H.1. . . . . 53
1.I.1 Hamming and VI metric induce similar meeting time . . . . . . . . . . . . . 54
1.J.1 Split-merge results on gene . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
1.J.2 Split-merge results on synthetic . . . . . . . . . . . . . . . . . . . . . . . . 56
1.K.1Co-clustering results for clustering data sets. . . . . . . . . . . . . . . . . . . 59
1.K.2Impact of different m on the RMSE. The first two panels are LCP estimation
for seed. The last two panels are CC(0, 1) estimation for synthetic. . . . . 60
1.K.3RMSE and intervals for gene on k-means initialization. . . . . . . . . . . . . 60
1.K.4The bias in naive parallel estimates is a function of the DPMM hyperparameters. 60
1.L.1 Meeting time under OT coupling is better than alternative couplings on
Erdos–Renyi graphs, indicated by the fast decrease of the survival functions. 61
1.M.1Posterior predictive density for different number of observations N . . . . . . 61
11
2.6.1 AIFA and TFA denoised images have comparable quality. (a) The noiseless
image. (b) The corrupted image. (c,d) Sample denoised images from finite
models with K = 60. We report PSNR (in dB) with respect to the noiseless
image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.6.2 (a) Peak signal-to-noise ratio (PNSR) as a function of approximation level K.
Error bars depict 1-standard-deviation ranges across 5 trials. (b,c) How PSNR
evolves during inferenceacross 10 trials, with 5 each starting from respectively
cold or warm starts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
2.6.3 (a) Test log-likelihood (testLL) as a function of approximation level K. Error
bars show 1 standard deviation across 5 trials. (b,c) TestLL change during
inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.6.4 (a) The left panel shows the average predictive log-likelihood of the AIFA (blue)
and BFRY IFA (red) as a function of the approximation level K; the average is
across 10 trials with different random seeds for the stochastic optimizer. The
right panel shows highest predictive log-likelihood across the same 10 trials.
(b) The panels are analogous to (a), except the GenPar IFA is in red. . . . . 89
2.6.5 (a) We estimate the discount by maximizing the marginal likelihood of the
AIFA (left) or the full process (right). The solid blue line is the median of the
estimated discounts, while the lower and upper bounds of the error bars are
the 20% and 80% quantiles. The black dashed line is the ideal value of the
estimated discount, equal to the ground-truth discount. (b) In each panel, the
solid red line is the average log of negative log marginal likelihood (LNLML)
across batches. The light red region depicts two standard errors in either
direction from the mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.6.6 Blue histograms show posterior density estimates for τ from MCMC draws.
The ground-truth τ (solid red line) is 0.7 in the overdispersed case (upper
row) and 1.5 in the underdispersed case (lower row). The threshold τ = 1
(dashed black line) marks the transition from overdispersion (τ < 1.0) to
underdispersion (τ > 1.0). The percentile in each panel’s title is the percentile
where the ground truth τ falls in the posterior draws. The approximation size
K of the AIFA increases in the plots from left to right. . . . . . . . . . . . . 91
2.L.1 Sample AIFA and TFA denoised images have comparable quality. (a) shows
the noiseless image. (b) shows the corrupted image. (c,d) are sample denoised
images from finite models with K = 60. PSNR (in dB) is computed with
respect to the noiseless image. . . . . . . . . . . . . . . . . . . . . . . . . . . 163
2.L.2 (a) Peak signal-to-noise ratio (PNSR) as a function of approximation level K.
The error bars reflect randomness in both initialization and simulation of the
conditionals across 5 trials. AIFA denoising quality improves as K increases,
and the performance is similar to TFA across approximation levels. Moreover,
the TFA- and AIFA-denoised images are very similar: the PSNR ≈ 50 for TFA
versus AIFA, whereas PSNR < 35 for TFA or AIFA versus the original image.
(b,c) Show how PSNR evolves during inference. The “warm-start” lines in
indicate that the AIFA-inferred (respectively, TFA-inferred) parameters are
excellent initializations for TFA (respectively, AIFA) inference. . . . . . . . . 164
12
2.L.3 Sample AIFA and TFA denoised images have comparable quality. (a) shows
the noiseless image. (b) shows the corrupted image. (c,d) are sample denoised
images from finite models with K = 60. PSNR (in dB) is computed with
respect to the noiseless image. . . . . . . . . . . . . . . . . . . . . . . . . . . 165
2.L.4 (a) Peak signal-to-noise ratio (PNSR) as a function of approximation level K.
The error bars reflect randomness in both initialization and simulation of the
conditionals across 5 trials. AIFA denoising quality improves as K increases,
and the performance is similar to TFA across approximation levels. Moreover,
the TFA- and AIFA-denoised images are very similar: the PSNR ≈ 47 for TFA
versus AIFA, whereas PSNR < 31 for TFA or AIFA versus the original image.
(b,c) Show how PSNR evolves during inference. The “warm-start” lines in
indicate that the AIFA-inferred (respectively, TFA-inferred) parameters are
excellent initializations for TFA (respectively, AIFA) inference. . . . . . . . . 165
2.L.5The predictive log-likelihood of AIFA is not sensitive to different settings of
a and bK . Each color corresponds to a combination of a and bK . (a) is the
average across 5 trials with different random seeds for the stochastic optimizer,
while (b) is the best across the same trials. . . . . . . . . . . . . . . . . . . . 166
2.L.6 In fig. 2.L.6a, we estimate the mass by maximizing the marginal likelihood
of the AIFA (left panel) or the full process (right panel). The solid blue line
is the median of the estimated masses, while the lower and upper bounds of
the error bars are the 20% and 80% quantiles. The black dashed line is the
ideal value of the estimated mass, equal to the ground-truth mass. The key
for fig. 2.L.6b is the same, but for concentration instead of mass. . . . . . . . 166
3.5.1 (Linear model) Histogram of treatment effect MCMC draws. The blue line
indicates the sample mean. The dashed red line is the zero threshold. The
dotted blue lines indicate estimates of approximate credible interval’s endpoints.183
3.5.2 (Linear model) Confidence interval and refit. At maximum, we remove 1%
of the data. Each panel corresponds to a target conclusion change: ‘sign’ is
the change in sign, ‘sig’ is change in significance, and ‘both’ is the change in
both sign and significance. Error bars are confidence interval for refit after
removing the most extreme data subset. Each ‘x’ is the refit after removing
the proposed data and re-running MCMC. The dotted blue line is the fit on
the full data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
3.5.3 (Linear model) Monte Carlo estimate of AMIP confidence interval’s coverage.
Each panel corresponds to a target conclusion change. The dashed line is the
nominal level η = 0.95. The solid line is the sample mean of the indicator
variable for the event that ground truth is contained in the confidence interval.
The error bars are confidence intervals for the population mean of these indicators.184
3.5.4 (Linear model) Monte Carlo estimate of sum-of-influence confidence interval’s
coverage. Each panel corresponds to a target conclusion change. The dashed
line is the nominal level η = 0.95. The solid line is the sample mean of the
indicator variable for the event that ground truth is contained in the confidence
interval, and error bars are confidence intervals for the population mean of
these indicators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
13
3.5.5 (Linear model) Quality of the linear approximation. Each panel corresponds
to a target conclusion change. The solid blue line is the full-data fit. The
horizontal axis is the distance from the weight that represents the full data.
We plot both the refit from rerunning MCMC and the linear approximation of
the refit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
3.5.6 (Hierarchical model for microcredit) Histogram of treatment effect MCMC
draws. See the caption of fig. 3.5.1 for the meaning of the distinguished vertical
lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
3.5.7 (Hierarchical model for microcredit) Confidence interval and refit. See the
caption of fig. 3.5.2 for meaning of annotated lines. . . . . . . . . . . . . . . 188
3.5.8 (Hierarchical model for microcredit) Monte Carlo estimate of AMIP confidence
interval’s coverage. See the caption of fig. 3.5.3 for the meaning of the error
bars and the distinguished lines. . . . . . . . . . . . . . . . . . . . . . . . . . 188
3.5.9 (Hierarchical model for microcredit) Monte Carlo estimate of sum-of-influence
confidence interval’s coverage. See the caption of fig. 3.5.4 for the meaning of
the panels and the distinguished lines. . . . . . . . . . . . . . . . . . . . . . 189
3.5.10(Hierarchical model for microcredit) Quality of linear approximation. See the
caption for fig. 3.5.5 for the meaning of the panels and the distinguished lines. 189
3.5.11(Hierarchical model for tree mortality) Histogram of slope MCMC draws. See
the caption of fig. 3.5.1 for the meaning of the distinguished vertical lines. . . 191
3.5.12(Hierarchical model for tree mortality) Confidence interval and refit. See the
caption of fig. 3.5.2 for the meaning of the panels and the distinguished lines. 192
3.5.13(Hierarchical model on subsampled tree mortality) Histogram of effect MCMC
draws. See fig. 3.5.1 for the meaning of the distinguished lines. . . . . . . . . 192
3.5.14(Hierarchical model on subsampled tree mortality) Confidence interval and refit.
See the caption of fig. 3.5.2 for the meaning of the panels and the distinguished
lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
3.5.15(Hierarchical model on subsampled tree mortality) Monte Carlo estimate of
coverage of confidence interval for ∆(α). See fig. 3.5.3 for the meaning of the
panels and the distinguished lines. . . . . . . . . . . . . . . . . . . . . . . . . 193
3.5.16(Hierarchical model on subsampled tree mortality) Monte Carlo estimate of
coverage of confidence interval for sum-of-influence. See fig. 3.5.4 for the
meaning of the panels and the distinguished lines. . . . . . . . . . . . . . . . 194
3.5.17(Hierarchical model on subsampled tree mortality) Quality of linear approx-
imation. See fig. 3.5.5 for the meaning of the panels and the distinguished
lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
14
Introduction
Discovering the topics behind Wikipedia texts is just one example in which researchers are
interested in latent traits. Other examples include recovering unique speakers across audio
recordings of many meetings, group membership from network data, driver archetypes from
car sensors, co-occurrence of species from environmental DNA, and themes from questions-
and-answers online boards [43, 63, 81, 123, 187]. When studying latent traits, we might
model the data hierarchically. For example, one might conceptualize that the words in a
given document are exchangeable, and that different documents are independent realizations
of some underlying distribution. One might use Bayesian probabilistic models to codify
such a hierarchy: they contain conditional independence statements that encode hierarchical
structure in the data-generating process.
In general, the application of Bayesian unsupervised learning to data is computationally
intensive. For instance, to draw inference, we need to estimate the posterior distribution.
The posterior is a complex probability distribution over both real-valued parameters, such
as the topics, and discrete parameters, such as the assignment of words or documents to
topics. While recent advances in Bayesian computation, such as Carpenter et al. [39], provide
accurate and fast posterior approximations in models with only real-valued parameters, they
do not solve the inference problem for Bayesian unsupervised learning. Roughly speaking, to
approximate the posterior in such models, there are two options. One might spend a long
time for an accurate estimate, using slow Markov chain Monte Carlo (MCMC) algorithms
such as Gibbs sampling. Or one might spend a little time to get an estimate without clear
accuracy guarantees, using optimization methods such as variational inference [23, 115].
In this thesis, I identify three specific computational challenges. First, while past works
[92, 208] have investigated how to improve the speed of MCMC without degrading accuracy
by using parallelism, their techniques struggle in Bayesian unsupervised learning, due to the
“label-switching” problem: roughly speaking, it can be difficult to usefully combine results
from different processors when they may or may not be handling semantically equivalent
relabelings of traits. Second, when Bayesian nonparametric (BNP) models are used because
we expect the number of latent traits present in a dataset to grow with the number of
observations, computers cannot store an infinity of objects, or learn the distribution of an
infinite collection in finite time. Finally, if a practitioner is interested in checking whether
their conclusions generalize beyond collected data, they might want to quantify the posterior’s
sensitivity to the removal of a very small amount of data. But checking this sensitivity
directly takes an intractably long time: the brute force approach, which enumerates all small
subsets and re-analyze, has to search over a combinatorially large number of subsets.
My thesis addresses the first two computational challenges, and establishes a first step in
15
tackling the last one. I use a representation of the unsupervised learning model that avoids the
label-switching problem: while this representation is not new, my work is the first to utilize it
to improve MCMC speed while maintaining accuracy. I build upon the existing literature on
finite approximations of BNP to derive an accurate and convenient-to-use approximation of
infinite-dimensional priors. Conceptualizing supervised Bayes as a first step towards Bayesian
unsupervised learning, I develop and test a fast tool to detect sensitivity with respect to data
removals for analyses based on MCMC estimates of supervised Bayes posteriors. Below, I
highlight the key findings of my thesis.
16
This work, which is in collaboration with Jonathan Huggins, Lorenzo Masoero, Lester Mackey,
and Tamara Broderick, has been published as Nguyen et al. [152]. For more detail, see
Chapter 2.
17
18
Chapter 1
19
Figure 1.1: Lower error at high process count using our estimator (blue) versus using naive
parallelism (red). For details, see Section 1.5.2.
1.1 Introduction
Markov chain Monte Carlo (MCMC) is widely used in applications for exploring distributions
over clusterings, or partitions, of data. For instance, Prabhakaran et al. [169] use MCMC to
approximate a Bayesian posterior over clusters of gene expression data for “discovery and
characterization of cell types”; Chen et al. [42] use MCMC to approximate the number of
k-colorings of a graph; and DeFord et al. [49] use MCMC to identify partisan gerrymandering
via partitioning of geographical units into districts. An appealing feature of MCMC for
many applications is that it yields asymptotically exact expectations in the infinite-time limit.
However, real-life samplers must always be run in finite time, and MCMC mixing is often
prohibitively slow in practice. While this slow mixing has led some practitioners to turn
to other approximations such as variational Bayes [21], these alternative methods can yield
arbitrarily poor approximations of the expectation of interest [87].
A different approach is to speed up MCMC, e.g. by taking advantage of recent com-
putational advantages. While wall-clock time is often at a premium, modern computing
environments increasingly offer massive parallel processing. For example, institute-level com-
pute clusters commonly make hundreds of processors available to their users simultaneously
[178]. Recent efforts to enable parallel MCMC on graphics processing units [119] offer to
expand parallelism further, with modern commodity GPUs providing over ten thousand cores.
A naive approach to exploiting parallelism is to run MCMC separately on each processor; we
illustrate this approach on a genetics dataset (gene) in Figure 1.1 with full experimental
details in Section 1.5. One might either directly average the resulting estimates across
processors (red solid line in Figure 1.1) or use a robust averaging procedure (red dashed line
in Figure 1.1). Massive parallelism can be used to reduce variance of the final estimate but
does not mitigate the problem of bias, so the final estimate does not improve substantially as
the number of processes increases.
Recently, Jacob et al. [93] built on the work of Glynn and Rhee [74] to eliminate bias in
MCMC with a coupling. The basic idea is to cleverly set up dependence between two MCMC
chains so that they are still practical to run and also meet exactly at a random but finite
20
time. After meeting, these coupled chains can be used to compute an unbiased estimate of
the expectation of interest. So arbitrarily large reductions in the estimate’s variance due to
massive parallelism translate directly into arbitrarily large reductions in total error. Since a
processor’s computation concludes after the chains meet, a useful coupling relies heavily on
setting up coupled chains that meet quickly.
Jacob et al. [93] did not consider MCMC over partitions in particular and Glynn and Rhee
[74] did not work on MCMC. But there is existing work on couplings applied to partitions in
other contexts that can be adapted into the Jacob et al. [93] framework. For instance, [99]
uses maximal couplings on partition labelings to prove convergence rates for graph coloring,
and Gibbs [69] uses a common random number coupling for two-state Ising models. Though
[99] was theoretical rather than practical and Gibbs [69] did not apply to general partition
models, we can adapt the Jacob et al. [93] setup in a straightforward manner to use either
coupling scheme. While this adaptation ensures asymptotically-unbiased MCMC samples,
we will see (Section 1.5.3) that both schemes exhibit slow meeting times in practice. We
attribute this issue to the label-switching problem, which is well-known for plaguing MCMC
over partitions [98]. In particular, many different labelings correspond to the same partition.
In the case of couplings, two chains may nearly agree on the partition but require many
iterations to change label assignments, so the coupling is unnecessarily slow to meet.
Our main contribution, then, is to propose and analyze a practical coupling that uses the
unbiasedness of the [93] framework but operates directly in the true space of interest – i.e.,
the space of partitions – to thereby exhibit fast meeting times. In particular, we define an
optimal transport (OT) coupling in the partition space (Section 1.3). For clustering models,
we prove that our coupling produces unbiased estimates (Section 1.4.1). We provide a big-O
analysis to support the fast meeting times of our coupling (Section 1.4.2).We empirically
demonstrate the benefits of our coupling on a simulated analysis; on Dirichlet process mixture
models applied to real genetic, agricultural, and marine life data; and on a graph coloring
problem. We show that, for a fixed wall time, our coupling provides much more accurate
estimates and confidence intervals than naive parallelism (Section 1.5.2). And we show that
our coupling meets much more quickly than standard label-based couplings for partitions
(Section 1.5.3). Our code is available at https://github.com/tinnguyen96/partition-coupling.
Related work. Couplings of Markov chains have a long history in MCMC. But they
either have primarily been a theoretical tool, do not provide guarantees of consistency in
the limit of many processes, or are not generally applicable to Markov chains over partitions
(Section 1.A). Likewise, much previous work has sought to utilize parallelism in MCMC. But
this work has focused on splitting large datasets into small subsets and running MCMC
separately on each subset. But here our distribution of interest is over partitions of the
data; combining partitions learned separately on multiple processors seems to face much the
same difficulties as the original problem (Section 1.A). Xu et al. [208] have also used OT
techniques within the Jacob et al. [93] framework, but their focus was continuous-valued
random variables. For partitions, OT techniques might most straightforwardly be applied to
the label space – and we expect would fare poorly, like the other label-space couplings in
Section 1.5.3. Our key insight is to work directly in the space of partitions.
21
1.2 Setup
Before describing our method, we first review random partitions, set up Markov chain Monte
Carlo for partitions – with an emphasis on Gibbs sampling, and review the Jacob et al. [93]
coupling framework.
As an example, consider a Bayesian cluster analysis for N data points {Wn }N n=1 , with
Wn ∈ R . A common generative procedure uses a Dirichlet process mixture model (DPMM)
D
and conjugate Gaussian cluster likelihoods – with hyperparameters α > 0, µ0Q ∈ RD , and
Σ0 , Σ1 positive definite D × D matrices. First draw Π = π with probability α|π| A∈π (|A| −
1)!/ [α(α + 1) · · · (α + N − 1)]. Then draw cluster centers µA ∼ N (µ0 , Σ0 ) for A ∈ Π and
i.i.d.
22
With this notation, we can write the leave-out conditional distributions of the Gibbs
sampler as pΠ|Π(−n) . In particular, take a random partition X. Suppose X(−n) has K − 1
elements. Then the nth data point can either be added to an existing element or form a
new element in the partition. Each of these K options forms a new partition; call the new
partitions {π k }K
k=1 . It follows that there exist ak ≥ 0 such that
K
X
pΠ|Π(−n) (· | X(−n)) = ak δπk (·), (1.1)
k=1
23
where ℓ is the burn-in length and m sets a minimum number of iterations [93, Equation 2].
ℓ and m are hyperparameters that impact the runtime and variance of Hℓ:m ; for instance,
smaller m is typically associated with smaller runtimes but larger variance. Jacob et al. [93,
Section 6] recommend setting ℓ to be a large quantile of the meeting time and m as a multiple
of ℓ. We follow these recommendations in our work.
One interpretation of Equation (1.2) is as the usual MCMC estimate plus a bias correction.
Since Hℓ:m is unbiased, a direct average of many copies of Hℓ:m computed in parallel can be
made to have arbitrarily small error (for estimating H ∗ ). It remains to apply the idea from
Equation (1.2) to partition-valued chains.
1.2.4 Couplings
To create two chains of partitions that evolve together, we will need a joint distribution over
partitions from both chains that respects the marginals of each chain. To that end, we define
a coupling.
PK ′
Definition 1.2.1. A coupling γ of two discrete distributions, K k=1 ak δπ k (·) and
P
k′ =1 bk′ δν k (·),
′
that
P satisfies the P
marginal constraints
k,k′ k,k′ ′
k u = b k ′ , k′ u = ak , 0 ≤ uk,k ≤ 1.
24
Given a coupling function ψ, Algorithm 2 gives the coupled transition from the current
pair of partitions (X, Y ) to another pair (X,e Ye ). Repeating this algorithm guarantees the
first required property from the Jacob et al. [93] construction in Section 1.2.3: co-evolution of
the two chains with correct marginal distributions. It remains to show that we can construct
an appropriate coupling function and that the chains meet (quickly).
Observe that d(π, ν) is zero when π = ν. More generally, we can construct a graph from
a partition by treating the indices in [N ] as vertex labels and assigning any two indices in
the same partition element to share an edge; then d/2 is equal to the Hamming distance
between the adjacency matrices implied by π and ν [145, Theorems 2–3]. The principal trait
of d for our purposes is that d steadily increases as π and ν become more dissimilar. In
Section 1.I, we discuss other potential metrics and show that an alternative with similar
qualitative behavior yields essentially equivalent empirical results.
25
In practice, any standard optimal transport1 solver can be used in ψ OT , and we discuss
our particular choice in more detail in Section 1.4.2. To prove unbiasedness of a coupling
(Theorem 1.4.1), it is convenient to ensure that every joint setting of (X, Y ) is reachable
from every other joint setting in the sampler. As we discuss after Theorem 1.4.1 and in
Section 1.C, adding a small nugget term to the coupling function accomplishes this goal. To
′ ′
that end, define the independent coupling ψ ind to have atom size uk,k = ak bk′ at (π k , ν k ).
Let η ∈ (0, 1). Then our final coupling function ψηOT = ψηOT (pΠ , n, X, Y ) equals
(
ψ OT (X, X) if X = Y
(1.6)
(1 − η)ψ OT (X, Y ) + ηψ ind (X, Y ) else,
where we elide the dependence on pΠ , n for readability. In practice, we set η to 10−5 , so the
behavior of ψηOT is dominated by ψ OT .
As a check, notice that when two chains first meet, the behavior of ψηOT reverts to that of
ψ OT . Since there is a coupling with expected distance zero, that coupling is chosen as the
minimizer in ψ OT . Therefore, the two chains remain faithful going forward.
26
Algorithm 3 Coupled Gibbs Sweep with Split–Merge Move
Inputs:
pΠ ▷ Target
X and Y ▷ Current partitions
1: procedure CoupledSplitMergeSweep(pΠ , X, Y )
2: Xe ← X, Ye ← Y
3: (i, j) ← Uniformly random pair of data indices
4: Xe ∼ SplitMerge(i, j, X)e
5: Ye ∼ SplitMerge(i, j, Ye )
6: for n ← 1, N do
7: γ ← ψηOT (pΠ , n, X,
e Ye )
8: e Ye ) ∼ γ
(X,
9: end for
10: return X, e Ye
11: end procedure
27
1.4.1 Unbiasedness
Jacob et al. [93, Assumptions 1–3] give sufficient conditions for unbiasedness of Equation (1.2).
We next use these to establish sufficient conditions that Hℓ:m (X, Y ) is unbiased when targeting
a DPMM posterior.
Theorem 1.4.1 (Sufficient Conditions for Unbiased Estimation). Let pΠ be the DPMM
posterior in Section 1.2.1. Assume the following two conditions on ψ.
(1) There exists ϵ > 0 such that for all n ∈ [N ] and for all X, Y ∈ PN such that X ̸= Y ,
the output γ of the coupling function ψ satisfies
′
∀k ∈ [K] and k ′ ∈ [K ′ ], uk,k ≥ ϵ. (1.7)
We prove Theorem 1.4.1 in Section 1.C. Our proof exploits the discreteness of the sample
space to ensure chains meet. Condition (1) roughly ensures that any joint state in the product
space is reachable from any other joint state under the Gibbs sweep; we use it to establish
that the meeting time τ has sub-geometric tails. Condition (2) implies that the Markov
chains are faithful once they meet.
Corollary 1.4.1. Let pΠ be the DPMM posterior. The Equation (1.2) estimator using
Algorithm 2 with coupling function ψηOT (pΠ , n, X, Y ) is unbiased for H ∗ .
Proof. It suffices to check Theorem 1.4.1’s conditions. We show ψηOT is faithful at the end
of Section 1.3.2. For a partition, the associated leave-out distributions place positive mass
on all K accessible atoms, so marginal transition probabilities are lower bounded by some
′
ω > 0. The nugget guarantees each uk,k ≥ ηω 2 > 0.
Note that the introduction of the nugget allows us to verify the first condition of The-
orem 1.4.1 is met without relying on properties specific to the optimal transport coupling.
We conjecture that one could analogously show unbiased estimates may be obtained using
couplings of Markov chains defined in the label space by introducing a similar nugget to
transitions on this alternative state space. Crucially, though, we will see in Section 1.5.3 that
our coupling in the partition space exhibits much faster meeting times in practice than these
couplings in the label space.
28
There are two key computations that must happen in any coupling Gibbs step within a
sweep:
′
(1) computing the atom sizes ak , bk′ and atom locations π k , ν k in the sense of Definition 1.2.1
and Definition 1.3.1;
′
(2) computing the pairwise distances d(π k , ν k ); and solving the optimal transport problem
(Equation (1.4)).
Let β(N, K) represent the time it takes to compute the Gibbs conditional pΠ|Π(−n) for a
partition of size K, and let K e represent the size of the largest partition visited in any chain,
across all processors, while the algorithm runs. Then part (1) takes O(β(N, K)) e time to
run. For single chains, computing atom sizes and locations dominates the compute time; the
computation required is of the same order, but is done for one chain, rather than two, on
each processor. We show in Proposition 1.D.1 in Section 1.D that part (2) can be computed
in O(K e time. Proposition 1.D.1 follows from efficient use of data structures; naive
e 3 log K)
implementations are more computationally costly. Note that the total running time for a full
Gibbs sweep (Algorithm 1 or Algorithm 2) will be N times the single-step cost.
The extra cost of a coupling Gibbs step will be small relative to the cost of a single-chain
Gibbs step, then, if O(K e is small relative to O(β(N, K)).
e 3 log K) e 2 As an illustrative example,
consider again the DPMM application from Section 1.2.1. We start with a comparison that we
suspect captures typical operating procedure, but we also consider a worst-case comparison.
Standard comparison: The direct cost of a standard Gibbs step is β(N, K) = O(N D+KD3 )
(see Proposition 1.D.2 in Section 1.D). By Equation 3.24 in Pitman [163], the number of
clusters in a DPMM grows a.s. as O(log N ) as N → ∞.3 If we take K e = O(log N ),
O(K 3
e log K) e will generally be smaller than β(N, K) = O(N D + KD ) for sufficiently large
3
N.
Worst-case comparison: The complexity of a DPMM Gibbs step can be reduced to
β(N, K) = O(KD + D3 ) through careful use of data structures and conditional conjugacy
(see Proposition 1.D.2 in Section 1.D). Still, the coupling cost O(K e is not much larger
e 3 log K)
than the cost of this step whenever K e is not much larger than D.
For our experiments, we run the standard rather than optimized Gibbs step due to its
simplicity and use in existing work [e.g. 48]. In e.g. our gene expression experiment with
D = 50, we expect this choice has little impact on our results. Our Proposition 1.D.1
establishing O(K e for the optimal transport solver applies to Orlin’s algorithm [154].
e 3 log K)
However, convenient public implementations are not available. So instead we use the simpler
network simplex algorithm [107] as implemented by Flamary et al. [61]. Although Kelly and
O’Neill [107, Section 3.6] upper bound the worst-case complexity of the network simplex as
O(K e 5 ), the algorithm’s average-case performance may be as good as O(K e 2 ) [25, Figure 6].
2
We show in Section 1.D that, while there are also initial setup costs before running any Gibbs sweep,
these costs do not impact the amortized complexity.
3
Two caveats: (1) If a Markov chain is run long enough, it will eventually visit all possible cluster
configurations. But if we run in finite time, it will not have time to explore every collection of clusters. So we
assume O(log N ) is a reasonable approximation of finite time. (2) Also note that the log N growth is for data
generated from a DPMM whereas in real life we cannot expect data are perfectly simulated from the model.
29
1.5 Empirical Results
We now demonstrate empirically that our OT coupling (1) gives more accurate estimates
and confidence intervals for the same wall time and processor budget as naive parallelism
and (2) meets much faster than label-based couplings.
30
Figure 1.2: Top row and bottom row give results for gene and k-regular, respectively.
The first two columns show that coupled chains provide better point estimates than naive
parallelism. The third column shows that confidence intervals based on coupled chains are
better than those from naive parallelism. The fourth column shows that OT coupling meets
in less time than label-based couplings.
which case we ensure equal cost between approaches. For the coupling on the jth processor,
we run until the chains meet and record the total time ξ j . In the naively parallel case,
then, we run a single chain on the jth processor for time ξ j . In either case, each processor
returns an estimate of H ∗ . We can aggregate these estimates with a sample mean or trimmed
estimator. Let Hc,J represent the coupled estimate after aggregation across J processors and
Hu,J represent the naive parallel (uncoupled) estimate after aggregation across J processors.
(i)
To understand the variability of these estimates, we replicate them I times: {Hc,J }Ii=1 and
(i)
{Hu,J }Ii=1 . In particular, we simulate running on 180,000 processors, so for each J, we let
I = 180,000/J; see Section 1.G.2 for details. For the ith replicate, we compute squared error
(i)
ec,i := (Hc,J − H ∗ )2 ; similarly in the uncoupled case.
Better point estimates. The upper left panel of Figure 1.2 shows the behavior of LCP
estimates for gene. The horizontal axis gives the number of processes J. The vertical value
of any solid line is found by taking the square root of the median (across I replicates) of the
squared error and then dividing by the (positive) ground truth. Blue shows the performance
of the aggregated standard-mean coupling estimate; red shows the naive parallel estimate.
The blue regions show the 20% to 80% quantile range. We can see that, at higher numbers
of processors, the coupling estimates consistently yield a lower percentage error than the
naive parallel estimates for a shared wall time. The difference is even more pronounced
for the trimmed estimates (first row, second column of Figure 1.2); here we see that, even
at smaller numbers of processors, the coupling estimates consistently outperform the naive
parallel estimates for a shared wall time. We see the same patterns for estimating CC(2,4) in
k-regular (second row, first two columns of Figure 1.2) and also for synthetic, seed,
and abalone in Figures 1.H.2a, 1.H.3a and 1.H.4a in Section 1.H. We see similar patterns
in the root mean squared error across replicates in Figure 1.1 (which pertains to gene) and
31
Figure 1.3: Coupled-chain estimates have large outliers. Meanwhile, naive parallelism
estimates have substantial bias that does not go away with replication.
the left panel of Figures 1.H.2b, 1.H.3b, 1.H.4b and 1.H.6b for the remaining datasets.
Figure 1.3 illustrates that the problem with naive parallelism is the bias of the individual
chains, whereas only variance is eliminated by parallelism. In particular, the histogram on
the right depicts the J estimates returned across each uncoupled chain at each processor j.
We see that the population mean across these estimates is substantially different from the
ground truth. This observation also clarifies why trimming does not benefit the naive parallel
estimator: trimming can eliminate outliers but not systematic bias across processors.
By contrast, we plot the J coupling estimates returned across each processor j as horizontal
coordinates of points in the left panel of Figure 1.3. Vertical coordinates are random noise to
aid in visualization. By plotting the 1% and 99% quantiles of the J estimators, we can see
that trimming will eliminate a few outliers. But the vast majority of estimates concentrate
near the ground truth.
Better confidence intervals. The third column of Figure 1.2 shows that the confidence
intervals returned by coupling are also substantially improved relative to naive parallelism.
The setup here is slightly different from that of the first two columns. For the first two
columns, we instantiated many replicates of individual users and thereby checked that coupling
generally can be counted upon to beat naive parallelism. But, in practice, an actual user
would run just a single replicate. Here, we evaluate the quality of a confidence interval that
an actual user would construct. We use only the individual estimates sj that make up one
Hc,J , sj = Hℓ:m (X j , Y j ) (or the equivalent for Hu,J ), to form a point estimate of H ∗ and a
notion of uncertainty.
In the third column of Figure 1.2, each PJ solid line shows the sample-average estimate
aggregated across J processors: (1/J) × j=1 sj . The error bars show ±2 standard errors
q
of the mean (SEM), where one SEM equals Var({sj }Jj=1 )/(J − 1). Since the individual
coupling estimators (blue) from each processor j are unbiased, we expect the error bars to be
calibrated, and indeed we see appropriate coverage of the ground truth (dashed black line). By
contrast, we again see systematic bias in the naive parallel estimates – and very-overconfident
intervals; indeed they are so small as to be largely invisible in the top row of the third
column of Figure 1.2 – i.e., when estimating LCP in the gene dataset. The ground truth is
32
many standard errors away from the naive parallel estimates. We see the same patterns for
estimating CC(2,4) for k-regular (second row, third column of Figure 1.2). See the right
panel of Figures 1.H.2b, 1.H.3b and 1.H.4b in Section 1.H for similar behaviors in synthetic,
seed, and abalone.
1.6 Discussion
We demonstrated how to efficiently couple partition-valued Gibbs samplers using optimal
transport – to take advantage of parallelism for improved estimation. Multiple directions
33
show promise for future work. E.g., while we have used CPUs in our experiments here, we
expect that GPU implementations will improve the applicability of our methodology. More
extensive theory on the trimmed estimator could clarify its guarantees and best practical
settings.
34
Appendix
35
probability of data points indexed by j1 and j2 , then we let h be the co-clustering indicator.
Namely, if j1 and j2 belong to the same element of Π (i.e. there exists some A ∈ Π such that
j1 , j2 ∈ A), then h(Π) equals 1; otherwise, it equals 0.
In addition to these summary statistics of the partition, we can also estimate cluster-
specific parameters, like cluster centers. For the Gaussian DPMM from Section 1.2.1, suppose
that we care about the mean of clusters that contain a particular data point, say data point
1. This expectation is E(µA s.t. 1 ∈ A). This is equivalent to E[θi | x] in the notation of
MacEachern [136]. In Section 1.2.1, we use µA to denote the cluster center for all elements
i ∈ A, while MacEachern [136] uses individual θi ’s to denote cluster centers for individual
data points, with the possibility that θi = θj if data points i and j belong in the same
partition element. We can rewrite the expectation as E(E[µA s.t. 1 ∈ A | Π]), using the law
of total expectation. E[µA s.t. 1 ∈ A | Π] is the posterior mean of the cluster that contains
data point 1, which is a function only of the partition Π.
Proof of Lemma 1.C.1. For any starting X ∈ PN , we observe that there is positive probabil-
ity to stay at the state after the T (X, ·) transition i.e. T (X, X) > 0. In Gaussian DPMM,
because the support of the Gaussian distribution is the whole Euclidean space (see also Equa-
tion (1.14)), when the nth data point is left out (resulting in the conditional pΠ|Π(−n) (· | X−n )),
there is positive probability that nth is re-inserted into the same partition element of X i.e.
pΠ|Π(−n) (X | X−n ) > 0. Since T (X, ·) is the composition of these N leave–outs and re-inserts,
the probability of staying at X is the product of the probabilities for each pΠ|Π(−n) (· | X−n )),
which is overall a positive number.
One series of updates that transform X into 1 in one sweep is to a) assign 1 to its own
cluster and b) assign 2, 3, . . . , N to the same cluster as 1. This series of update also has
positive probability in Gaussian DPMM.
On transforming 1 into X, for each component A in X, let c(A) be the smallest element
in the component. For instance, if X = {{1, 2}, {3, 4}} then c({1, 2}) = 1, c({3, 4}) = 3. We
sort the components A by their c(A), to get a list c1 < c2 < . . . < c|X| . For each 1 ≤ n ≤ N ,
let l(n) = c(A) for the component A that contains n. In the previous example, we have c1 = 1
and c2 = 3, while l(1) = 1, l(2) = 1, l(3) = 3, l(4) = 3. One series of updates that transform 1
into X is
• Initialize j = 1.
36
• for 1 ≤ n ≤ N , if n = cj , then make a new cluster with n and increment j = j + 1. Else,
assign n to the cluster that currently contains l(n).
Checking Assumption 1. Because the sample space PN is finite, maxπ∈PN h(π) is finite.
This means the expectation of any moment of h under the Markov chain is also bounded.
t→∞
We show that E[h(X t )] −−−→ H ∗ by standard ergodicity arguments.4
• Aperiodic. From Lemma 1.C.1, we know T (X, X) > 0 for any X. This means the Markov
chain is aperiodic [126, Section 1.3].
• Irreducible. From Lemma 1.C.1, for any X, Y , we know that T (X, 1) > 0 and T (1, Y ) > 0,
meaning that T 2 (X, Y ) > 0. This means the Markov chain is irreducible.
• Invariant w.r.t. pΠ . The transition kernel T (X, ·) from Algorithm 1 leaves the target pΠ
invariant because each leave–out conditional pΠ|Π(−n) leaves the target pΠ invariant. If
X ∼ pΠ , then X−n ∼ pΠ−n . Hence, if X e | X ∼ pΠ|Π(−n) (· | X−n ) then by integrating out
X, we have X e ∼ pΠ .
By Levin and Peres [126, Theorem 4.9], there exists a constant α ∈ (0, 1) and C > 0 such
that
max ∥T t (π, ·) − pΠ ∥TV ≤ Cαt .
π∈PN
Since the sample space is finite, the total variation bound implies that for any π, expectations
under T t (π, ·) are close to expectations under pΠ ,
|EX t h(X t )−H ∗ | = |EX 0 [EX t | X 0 =π h(X t )−H ∗ ]| ≤ EX 0 |EX t | X 0 =π h(X t )−H ∗ ]| ≤ (max h(π))Cαt .
π∈PN
t→∞
Since the right hand side goes to zero as t → ∞, we have shown that E[h(X t )] −−−→ H ∗ .
Checking Assumption 2. To show that the meeting time is geometric, we show that
there exists ϵ such that for any X and Y , under one coupled sweep from Algorithm 2
((X,
e Ye ) ∼ Tb(·, (X, Y ))),
e = Ye = 1 | X, Y ) ≥ ϵ.
P(X (1.8)
4
MacEachern [136, Theorem 1] states a geometric ergodicity theorem for the Gibbs sampler like Algorithm 1
but does not provide verification of the aperiodicity, irreducibility or stationarity.
37
If this were true, we have that P(X
e = Ye | X, Y ) ≥ ϵ, and
t
Y
∩ti=0 X i+1 i 1 0
P(X i+1 ̸= Y i | X i ̸= Y i−1 ),
P(τ > t) = P ̸= Y = P(X ̸= Y )
i=1
where we have used the Markov property to remove conditioning beyond X i ̸= Y i−1 . Since
minX,Y P(X e = Ye | X, Y ) ≥ ϵ, P(X i+1 ̸= Y i | X i ̸= Y i−1 ) ≤ 1 − ϵ, meaning P(τ > t) ≤ (1 − ϵ)t .
To see why Equation (1.8) is true, because of Lemma 1.C.1, there exists a series of
intermediate partitions x1 , x2 , . . . , xN −1 (x0 = X, xN = 1) such that for 1 ≤ n ≤ N ,
n ) > 0. Likewise, there exists a series y , y , . . . , y for Y. Because the
N −1
pΠ|Π(−n) (xn | xn−1 1 2
coupling function ψ satisfies uij > ϵ, for any n, there is at least probability ϵ of transitioning
to (xn , y n ) from (xn−1 , y n−1 ). Overall, there is probability at least ϵN of transitioning from
(X, Y ) to (1, 1). Since the choice of X, Y has been arbitrary, we have proven Equation (1.8)
with ϵ = ϵN .
k,k′ k′ ′ ′
X X
∗
γ := arg min u d(π , ν ) = arg min
k
uk,k [d(π k , ν k ) − c]. (1.9)
couplings γ couplings γ
k=1 k′ =1 k=1 k′ =1
We now show that if we set c = d(π(−n), ν(−n)), then we can compute all O(K e 2 ) values
′
e 2 ) time. First, if we use Ak and B k′ to denote the elements of π k and
of d(π k , ν k ) − c in O(K n n
′
ν k respectively, containing data-point n, then for any n we may write
′ h ′ ′
i
d(π k , ν k ) = d(π(−n), ν(−n)) + |Akn |2 − (|Akn | − 1)2 + |Bnk |2 − (|Bnk | − 1)2 +
h i
k k′ 2 k k′ 2 2
− 2 |An ∩ Bn | − (|An ∩ Bn | − 1) .
(1.10)
38
Simplifying some terms, we can also write
k k′
h k′
i h
k′
i
2|Akn | k
d(π , ν ) = d(π(−n), ν(−n)) + − 1 + 2|Bn | − 1 − 2 2|An ∩ Bn | − 1
h ′ ′
i
= d(π(−n), ν(−n)) + 2 |Akn | + |Bnk | − 2|Akn ∩ Bnk | ,
which means
′
h ′ ′
i
d(π k , ν k ) − d(π(−n), ν(−n)) = 2 |Akn | + |Bnk | − 2|Akn ∩ Bnk | .
At first it may seem that this still does not solve the problem, as directly computing the
size of the set intersections is O(N ) (if cluster sizes scale as O(N )). However, Equation (1.9)
is just our final stepping stone. If we additionally keep track of sizes of intersections at every
step, updating them as we adapt the partitions, it will take only constant time for each
′
update. As such, we are able to form the K × K ′ matrix of d(π k , ν k ) − c in O(K e 2 ) time.
′
With the array of d(π , ν ) − d(π(−n), ν(−n)), we now have enough “data” for the
k k
optimization problem that is the optimal transport. Regardless of N , the optimization itself
may be computed in O(K e 3 log K)e time with Orlin’s algorithm [154].
The next proposition provides estimates of the time taken to construct the Gibbs condi-
tionals (β(N, K)) for Gaussian DPMM.
Proposition 1.D.2 (Gibbs conditional runtime with dense Σ0 , Σ1 ). Suppose the covariance
matrices Σ0 and Σ1 are dense i.e. the number of non-zero entries is Θ(D2 ). The standard
implementation takes time β(N, K) = O(N D + KD3 ). By spending O(D3 ) time precomputing
at beginning of sampling, and using additional data structures, the time can be reduced to
β(N, K) = O(KD2 + D3 ).
Proof of Proposition 1.D.2. We first mention the well-known posterior formula of a Gaus-
sian model with known covariances [18, Chapter 2.3]. Namely, if µ ∼ N (µ0 , Σ0 ) and
W1 , W2 , . . . WM | µ ∼ N (µ, Σ1 ) then µ | W1 , . . . , WM is a Gaussian with covariance Σc and
indep
mean µc satisfying
Σc = (Σ−1 0 + M Σ1 )
−1 −1
"M #!
−1 −1
X (1.11)
µc = Σc Σ0 µ0 + Σ1 Wm .
m=1
Suppose |Π| = K. Based on the expressions for the Gibbs conditional in Equation (1.14),
the computational work involved for a held-out observation Wn can be broken down into
three steps
2. For each cluster c ∈ Π(−n), compute µc , Σc , (Σc +Σ1 )−1 and the determinant of (Σc +Σ1 )−1 .
39
Standard implementation. The time to evaluate the prior N (Wn | µ0 , Σ0 + Σ1 ) is O(D3 ),
as we need to compute the precision matrix (Σ0 + Σ1 )−1 and its determinant. With time
O(KD3 ), we can compute the various cluster-specific covariances, precisions and determinants
(where D3 is the cost
P for each cluster). To compute the posterior means µc , we need to
compute the sums j Wj for all clusters, which takes O(N D), as we need to iterate over
all D coordinates of all N observations. The time to evaluate N (Wn | µc , Σc + Σ1 ) across
clusters is O(KD2 ). Overall this leads to O(N D + KD3 ) runtime.
each of K means µc takes O(D2 ); hence the time to compute the means is O(KD2 ). Overall,
the time spent in Step 2 is O(KD2 + D3 ), leading to an overall O(KD2 + D3 ) runtime.
The standard implementation is used, for instance, in de Valpine et al. [48] (see the
CRP_conjugate_dmnorm_dmnorm() function from NIMBLE’s source code). Miller and
Harrison [143] uses the standard implementation in the univariate case (see the Normal.jl
function).
Corollary 1.D.1 (Gibbs conditional runtime with diagonal Σ0 , Σ1 ). Suppose the covariances
Σ0 and Σ1 are diagonal matrices i.e. there are only Θ(D) non-zero entries. Then a standard
implementation takes time β(N, K) = O(N D). Using additional data structures, the time
can be reduced to β(N, K) = O(KD).
Proof of Corollary 1.D.1. When the covariance matrices are diagonal, we do not incur the
cubic costs of inverting D × D matrices. The breakdown of computational work is similar to
the proof of Proposition 1.D.2.
40
Standard implementation. The covariances and precision matrices each take only time
O(D) to compute: as there are K of them, the time taken is O(KD). To compute the
posterior means µc , we iterate through all coordinates of all observations in forming the sums
, leading to O(N D) runtime. Time to evaluate the Gaussian likelihoods are just O(D)
P
j W j
because of the diagonal precision matrices. Overall the runtime is O(N D).
1.E Label-Switching
1.E.1 Example 1
Suppose there are 4 data points, indexed by 1,2,3,4. The labeling of the X chain is z1 =
[1, 2, 2, 2], meaning that the partition is {{1}, {2, 3, 4}}. The labeling of the Y chain is
z2 = [2, 1, 1, 2], meaning that the partition is {{1, 4}, {2, 3}}. The Gibbs sampler temporarily
removes the data point 4. For both chains, the remaining data points is partitioned into
{{1}, {2, 3}}. We denote π 1 = {{1, 4}, {2, 3}}, π 2 = {{1}, {2, 3, 4}}, π 3 = {{1}, {2, 3}, {4}}:
in the first two partitions, the data point is assigned to an existing cluster while in the last
partition ,the data point is in its own cluster. There exists three positive numbers a1 , a2 , a3 ,
summing to one, such that
3
X
pΠ|Π(−4) (· | X(−4)) = pΠ|Π(−4) (· | Y (−4)) = ak δπk (·).
k=1
Since the two distributions on partitions are the same, couplings based on partitions like
ψηOT will make the chains meet with probability 1 in the next step. However, this is not true
under labeling–based couplings like maximal or common RNG. In this example, the same
partition is represented with different labels under either chains. The X chain represents
π 1 , π 2 , π 3 with the labels 1, 2, 3, respectively. Meanwhile, the Y chain represents π 1 , π 2 , π 3
with the labels 2, 1, 3, respectively. Let zX be the label assignment of the data point in
question (recall that we have been leaving out 4) under the X chain. Similarly we define
zY . Maximal coupling maximizes the probability that zX = zY . However, the coupling that
results in the two chains X and Y meeting is the following
if u = v = 3
3
a
a1 if u = 1, v = 2
Pr(zX = u, zY = v) =
a2 if u = 2, v = 1
otherwise.
0
In general, a1 ̸= a2 , meaning that the maximal coupling is different from this coupling that
causes the two chains to achieve the same partition after updating the assignment of 4. A
similar phenomenon is true for common RNG coupling.
41
1.E.2 Example 2
For the situation in Section 1.E.1, the discussion of Ju et al. from Tancredi et al. [193]
proposes a relabeling procedure to better align the clusters in the two partitions before
constructing couplings. Indeed, if z2 were relabeled [1, 2, 2, 1] (the label of each cluster is
the smallest data index in that cluster), then upon the removal of data point 4, both the
label-based and partition-based couplings would agree. However, such a relabeling fix still
suffer from label-switching problem in general, since the smallest data index does not convey
much information about the cluster. For concreteness, we demonstrate an example where the
best coupling from minimizing label distances is different from the best coupling minimizing
partition distances.
Suppose there are 6 data points, indexed from 1 through 6. The partition of the X
chain is {{1, 3, 4}, {2, 5, 6}}. The partition of the Y chain is {{1, 5, 6}, {2, 3, 4}}. Using the
labeling rule from above, the label vector for X is zX = [1, 2, 1, 1, 2, 2] while that for Y is
zY = [1, 2, 2, 2, 1, 1]. The Gibbs sampler temporarily removes the data point 1. The three
next possible states of the X chain are the partitions ν1 , ν2 , ν3 where ν1 = {{1, 3, 4}, {2, 5, 6}},
ν2 = {{3, 4}, {1, 2, 5, 6}} and ν3 = {{3, 4}, {2, 5, 6}, {1}}. The labelings of data points 2
through 6 for all three partitions are the same; the only different between the labeling vectors
are the label of data point 1: for ν1 , zX (1) = 1, for ν2 , zX (1) = 2 and for ν1 , zX (1) = 3. On
the Y side, the three next possible states of the Y chain are the partitions µ1 , µ2 , µ3 where
µ1 = {{1, 5, 6}, {2, 3, 4}}, µ2 = {{5, 6}, {1, 2, 3, 4}} and µ3 = {{5, 6}, {2, 3, 4}, {1}}. As for
the labeling of 1 under Y , for µ1 , zY (1) = 1, for µ2 , zY (1) = 2 and for µ3 , zY (1) = 3. Suppose
that the marginal assignment probabilities are the the following:
Under label-based couplings, since Pr(zX (1) = a) = Pr(zY (1) = a) for a ∈ [1, 2, 3], the
coupling that minimizes the distance between the labels will pick Pr(zX (1) = zY (1)) = 1,
which means the following for the induced partitions:
0.45 if ν = ν1 , µ = µ1
Pr(X = ν, Y = µ) = 0.45 if ν = ν2 , µ = µ2 . (1.12)
if ν = ν3 , µ = µ3
0.1
Under the partition-based transport coupling, the distance between partitions (Equa-
tion (1.5)) is the following.
µ1 µ2 µ3
ν1 16 10 12
ν2 10 16 14
ν3 12 14 8
Notice that the distances d(ν1 , µ1 ) and d(ν2 , µ2 ) are actually larger than d(ν1 , µ2 ) and d(ν2 , µ1 ):
in other words, the label-based coupling from Equation (1.12) proposes a coupling with
42
larger-than-minimal expected distance. In fact, solving the transport problem, we find that
the coupling that minimizes the expected partition distance is actually
0.45 if ν = ν1 , µ = µ2
Pr(X = ν, Y = µ) = 0.45 if ν = ν2 , µ = µ1 . (1.13)
if ν = ν3 , µ = µ3
0.1
1.F Trimming
We consider the motivating situation in Example 1.F.1. This is a case where trimming outliers
before taking the average yields a more accurate estimator (in terms of mean squared error)
than the regular sample mean. For reference, the RMSE of an estimator µ b of a real-valued
unknown quantity µ is p
µ − µ∥2 .
E∥b
Example 1.F.1 (Mixture distribution with large outliers). For µ > 0, p < 1, consider the
mixture distribution (0.5 − p/2)N (−µ, 1) + pN (0, 1) + (0.5 − p/2)N (µ, 1). The mean is 0.
The variancepis 1 + (1 − p)µ2√. Therefore, the RMSE of the sample mean computed using J
iid draws is 1 + (1 − p)µ / J.
2
43
Figure 1.F.1: Trimmed mean has better RMSE than sample mean on Example 1.F.1. Left
panel plots RMSE versus J. Right panel gives boxplots J = 1000.
Graph coloring. Let G be an undirected graph with vertices V = [N ] and edges E ⊂ V ⊗V,
and let Q = [q] be set of q colors. A graph coloring is an assignment of a color in Q to each
vertex satisfying that the endpoints of each edge have different colors. We here demonstrate
an application of our method to a Gibbs sampler which explores the uniform distribution over
valid q−colorings of G, i.e. the distribution which places equal mass on ever proper coloring
of G.
To employ Algorithm 2, for this problem we need only to characterise the p.m.f. on
partitions of the vertices implied by the uniform distribution on its colorings. A partition
corresponds to a proper coloring only if no two adjacent vertices are in the element of the
partition. As such, we can write
q
pΠN (π) ∝ 1{|π| ≤ q and A(π)i,j = 1 → (i, j) ̸∈ E, ∀i ̸= j} |π|!,
|π|
where the indicator term checks that π can correspond to a proper coloring and the second
term accounts for the number of unique colorings which induce the partition π. In particular
it is the product of the number of ways to choose |π| unique colors from Q ( |π|
q q!
:= |π|!(q−|π|)! )
and the number of ways to assign those colors to the groups of vertices in π.
44
The Gibbs conditionals have the form
q! 1
(q−|y|)! (q−|y|)!
pΠ|Π(−n) (Π = y | Π(−n)) = P q!
=P 1 .
x consistent with Π(−n) (q−|x|)! x consistent with Π(−n) (q−|x|)!
(1.15)
In Equation (1.15), x and y are partitions of the whole set of N vertices.
In implementations, to simulate from the conditional Equation (1.15), it suffices to
represent the partition with a color vector. Suppose we condition on Π(−n)) i.e. when the
colors for all but the n vertex are fixed, and there are q ′ unique colors that have been used
(q ′ can be strictly smaller than q). n can either take on a color in [q ′ ] (as long as the color is
not used by a neighbor), or take on the color q ′ + 1 (if q ′ < q). The transition probabilities
are computed from the induced partition sizes |x|.
Sampler initializations. In clustering, we initialize each chain at the partition where all
elements belong to the same element i.e. the one-component partition. In graph coloring,
we initialize the Markov chain by greedily coloring the vertices. Our intuition suggests
that coupling should be especially helpful relative to naively parallel chains when samplers
require a large burn-in – since slow mixing induces bias in the uncoupled chains. In general,
one cannot know in advance if that bias is present or not, but we can try to encourage
suboptimal initialization in our experiments to explore its effects. For completeness, we
consider alternative initialization schemes, such as k-means, in Figure 1.K.3.
45
Simulating many processes. To quantify the sampling variability of the aggregate
estimates (sample or trimmed mean across J processors), we first generate a large number
(V = 180,000) of coupled estimates Hℓ:m (X j , Y j ) (and V naive parallel estimates U j , where
the time to construct Hℓ:m (X j , Y j ) is equal to the time to construct U j ).6 For each J, we
batch up the V estimates in a consistent way across coupled chains and naive parallel, making
sure that the equality between coupled wall time and naive parallel wall time is maintained.
There are I = V /J batches. For the ith batch, we combine Hℓ:m (X j , Y j ) (or U j ) for indices
(i) (i)
j in the list [(i − 1)J + 1, iJ] to form Hc,J (or Hu,J ) in the sense of Section 1.5.2. By this
batching procedure, smaller values of J have more batches I. The largest J we consider for
gene, k-regular and abalone is 2,750 while that for synthetic and seed is 1,750. This
mean the largest J has at least 57 batches.
To generate the survival functions (last column of Figure 1.2), we use 600 draws from the
(censored) meeting time distribution by simulating 600 coupling experiments.
seed i.e. wheat seed measurements. The original dataset from [41] has 8 features; we
first remove the “target” feature, which contains label information for supervised learning.
Overall there are N = 210 observations and D = 7 features. We normalize each feature to
have mean 0 and variance 1. We target the posterior of the probabilistic model in Section 1.2.1
with α = 1.0, µ0 = 0D , diagonal covariance matrices Σ0 = 1.0ID , Σ1 = 1.0ID . We set ℓ = 10
and m = 100.
46
(a) synthetic data (b) k-regular data
gene, the prior variance is larger than the noise variance for synthetic. We set ℓ = 10,
m = 100.
k-regular. Anticipating that regular graphs are hard to color, we experiment with a
4-regular, 6-node graph – see Figure 1.G.1b. The target distribution is the distribution over
vertex partitions induced by uniform colorings using 4 colors. We set ℓ = 1, m = 4.
47
decomposition, the RMSE for coupled estimates decreases with increasing J because of
unbiasedness, while the RMSE for naive parallel estimates does not go away because of bias.
The right panel of Figure 1.H.1c plots typical d distances between coupled chains under
different couplings as a function of the number of sweeps done. d decreases to zero very
fast under OT coupling, while it is possible for chains under maximal and common RNG
couplings to be far from each other even after many sampling steps.
1.H.2 synthetic
Figure 1.H.2 shows results for LCP estimation on synthetic – see Figure 1.K.1 for results
on co-clustering.
1.H.3 seed
Figure 1.H.3 shows results for LCP estimation on seed – see Figure 1.K.1 for results on
co-clustering.
1.H.4 abalone
Figure 1.H.4 shows results for LCP estimation on abalone. In Figure 1.H.4a and Fig-
ure 1.H.4b, we do not report results for the trimmed estimator with the default trimming
48
(a) Losses (b) RMSE and intervals
Figure 1.H.2: Results on synthetic. Figure legends are the same as Figure 1.H.1. The
results are consistent with Figure 1.2.
amount (0.01 i.e. 1%). This trimming amount is too large for the application, and in
Figure 1.H.5, we show that trimming the most extreme 0.1% yields much better estimation.
In Figure 1.H.5, the first panel (from the left) plots the errors incurred using the trimmed
mean with the default α = 1%. Trimming of coupled chains is still better than naive
parallelism, but worse than sample mean of coupled chains. In the second panel, we use
α = 0.1%, and the trimming of coupled chains performs much better. In the third panel, we
fix the number of processes to be 2000 and quantify the RMSE as a function of the trimming
amount (expressed in percentages). We see a gradual decrease in the RMSE as the trimming
amount is reduced, indicating that this is a situation in which smaller trimming amounts is
prefered.
1.H.5 k-regular
Figure 1.H.6 shows results for CC(2, 4) estimation on k-regular.
49
(a) Losses (b) RMSE and intervals
Figure 1.H.3: Results on seed. Figure legends are the same as Figure 1.H.1. The results are
consistent with Figure 1.2.
′
{A1 , A2 , . . . , AK } and the clusters in ν by {B 1 , B 2 , . . . , B K }. For each k ∈ [K] and k ′ ∈ K ′ ,
define the number P (k, k ′ ) to be
′
′ |Ak ∩ B k |
P (k, k ) := .
N
′ ′
|Ak ∩ B k | is the size of the overlap between Ak and B k . Because of the normalization by N ,
the P (k, k ′ )’s are non-negative and sum to 1, hence can be interpreted as probability masses.
Summing across all k (or k ′ ) has a marginalization effect, and we define
K ′
X
P (k) := P (k, k ′ ).
k′ =1
K X
K ′
X P (k, k ′ )
dI (π, ν) = P (k, k ) log ′
. (1.16)
k=1 k′ =1
P (k)P (k ′ )
In terms of theoretical properties, Meilă [141, Property 1] shows that dI is a metric for the
space of partitions.
50
(a) Losses (b) RMSE and intervals
Figure 1.H.4: Results on abalone. Similar to Figure 1.2, coupled chains perform better
than naive parallelism with more processes, and our coupling yields smaller meeting times
than label-based couplings. See Figure 1.H.5 for the performance of trimmed estimators.
51
Figure 1.H.5: Effect of trimming amount on abalone.
52
(a) Losses (b) RMSE and intervals
Figure 1.H.6: Results on k-regular. Figure legends are the same as Figure 1.H.1.
53
(a) gene (b) synthetic
In Equation (1.17), ΠN +1 denotes the partition of the data W1:(N +1) . To translate Equa-
tion (1.17) into an integral over just the posterior over ΠN (the partition of W1:N ) we break
up ΠN +1 into (ΠN , Z) where Z is the cluster indicator specifying the cluster of ΠN (or a new
54
(a) Losses (b) RMSE and intervals
(c) Estimates
Each Pr(WN +1 ∈ dx, Z | ΠN , W1:N ) is computed using the prediction rule for the CRP and
Gaussian conditioning. Namely
The first term is computed with the function used during Gibbs sampling to reassign data
points to clusters. In the second term, we ignore the conditioning on W1:N , since Z and W1:N
are conditionally independent given ΠN .
55
(a) Losses (b) RMSE and intervals
(c) Estimates
plot uncertainty bands. The black dashed curve is the true density of the population i.e. the
10-component Gaussian mixture model density. The grey histogram bins the observed data.
56
Let Pbn be the posterior predictive distribution of this generative process. Then with a.s. Pf0
n→∞
dT V Pbn , Pf0 −−−→ 0.
To prove Theorem 1.M.1, we first need some definitions and auxiliary results.
dT V Pbn , Pf0 →− 0.
57
The main idea is showing that the posterior Π(f |X1:n ) is strongly consistent and then
leveraging Proposition 1.M.1. For the former, we verify the conditions of Lijoi et al. [128,
Theorem 1].
The first condition of Lijoi et al. [128, Theorem 1] is that f0 is in the K-L support of
the prior over f in Equation (1.20). We use Ghosal et al. [67, Theorem 3].PClearly f0 is the
convolution of the normal density N (0, σ12 ) with the distribution P (.) = m i=1 pi δθi . P (.) is
compactly supported since m is finite. Since the support of P (.) is the set {θi }m i=1 which
belongs in R, the support of N (0, σ0 ), by Ghosh and Ramamoorthi [68, Theorem 3.2.4], the
2
conditions on P are satisfied. The condition that the prior over bandwidths cover the true
bandwidth is trivially satisfied since we perfectly specified σ1 .
The second condition of Lijoi et al. [128, Theorem 1] is simple: because the prior over Pb
is a DP, it reduces to checking that
Z
|θ|N (θ | 0, σ02 ) < ∞
R
which is true.
The final condition trivial holds because we have perfectly specified σ1 : there is actually
zero probability that σ1 becomes too small, and we never need to worry about setting γ or
the sequence σk .
58
(b) RMSE and intervals CC(0, 21) estimation on
(a) Losses, CC(0, 21) estimation on gene gene
59
(c) m = 100, syn- (d) m = 150, syn-
(a) m = 100, seed (b) m = 150, seed thetic thetic
Figure 1.K.2: Impact of different m on the RMSE. The first two panels are LCP estimation
for seed. The last two panels are CC(0, 1) estimation for synthetic.
Figure 1.K.4: The bias in naive parallel estimates is a function of the DPMM hyperparameters.
60
(a) N = 25 (b) N = 30
Figure 1.L.1: Meeting time under OT coupling is better than alternative couplings on Erdos–
Renyi graphs, indicated by the fast decrease of the survival functions.
61
62
Chapter 2
63
2.1 Introduction
Many data analysis problems can be seen as discovering a latent set of traits in a population
— for example, recovering topics or themes from scientific papers, ancestral populations from
genetic data, interest groups from social network data, or unique speakers across audio
recordings of many meetings [20, 63, 158]. In all of these cases, we might reasonably expect
the number of latent traits present in a data set to grow with the number of observations.
One might choose a prior for different data set sizes, but then model construction potentially
becomes inconvenient and unwieldy. A simpler approach is to choose a single prior that
naturally yields different expected numbers of traits for different numbers of data points.
In theory, Bayesian nonparametric (BNP) priors have exactly this desirable property due
to a countable infinity of traits, so that there are always more traits to reveal through the
accumulation of more data.
However, the infinite-dimensional parameter presents a practical challenge; namely, it is
impossible to store an infinity of random variables in memory or learn the distribution over an
infinite number of variables in finite time. Some authors have developed conjugate priors and
likelihoods [30, 96, 153] to circumvent the infinite representation; in particular, these models
allow marginalization of the infinite collection of latent traits. These models will typically
be part of a more complex generative model where the remaining components are all finite.
Therefore, users can apply approximate inference schemes such as Gibbs sampling. However,
these marginal forms typically limit the user to a constrained family of models; are not
amenable to parallelization; would require substantial new development to use with modern
inference engines like NIMBLE [47]; and are not straightforward to use with variational Bayes.
64
independent and identically distributed (i.i.d. ) representations of the traits together with
their rates within the population; we call these independent finite approximations (IFAs). At
the time of writing, we are aware of two alternative lines of work on generic constructions of
finite approximations using i.i.d. random variables, namely Lijoi et al. [131] and Lee et al.
[124, 125]. Lijoi et al. [131] design approximations for clustering models, characterize the
posterior predictive distribution, and derive tractable inference schemes. However, the authors
have not developed their method for trait allocations, where data points can potentially belong
to multiple traits and can potentially exhibit traits in different amounts. And in particular it
would require additional development to perform inference in trait allocation models using
their approximations.1 Lee et al. [124, 125] construct finite approximations through a novel
augmentation scheme. However, Lee et al. [124, 125] lack explicit constructions in important
situations, such as exponential-family rate measures, because the functions involved in the
augmentation are, in general, only implicitly defined. When the augmentation is implicit,
there is not currently a way to evaluate (up to proportionality constant) the probability
density of the finite-dimensional distribution; therefore standard Markov chain Monte Carlo
and variational approaches for approximate inference are unavailable.
Our contributions. We propose a general-purpose construction for IFAs that subsumes a
number of special cases that have already been successfully used in applications (section 2.3.1).
We call our construction the automated independent finite approximation, or AIFA. We show
that AIFAs can handle a wide variety of models — including homogeneous completely random
measures (CRMs) and normalized CRMs (NCRMs) (section 2.3.3).2 Our construction can
handle (N)CRMs exhibiting power laws and has an especially convenient form for exponential
family CRMs (section 2.3.2). We show that our construction works for useful CRMs not
previously seen in the BNP literature (Example 2.3.4). Unlike marginal representations,
AIFAs do not require conditional conjugacy and can be used with VB. We show that, unlike
TFAs, AIFAs facilitate straightforward derivations within approximate inference schemes
such as MCMC or VB and are amenable to parallelization during inference (section 2.5). In
existing special cases, practitioners report similar predictive performance between AIFAs and
TFAs [117] and that AIFAs are also simpler to use compared to TFAs [63, 100]. In contrast
to the methods of Lee et al. [124, 125], one can always evaluate the probability density (up to
a proportionality constant) of AIFAs; furthermore, in section 2.6.4, AIFAs accurately learn
model hyperparameters by maximizing the marginal likelihood where the methods of Lee
et al. [124, 125] struggle.
In section 2.4, we bound the error induced by approximating an exact infinite-dimensional
prior with an AIFA. Our analysis provides interpretable error bounds with explicit dependence
on the size of the approximation and the data cardinality; our bounds can be used to set the
size of the approximation in practice. Our error bounds reveal that for the worst-case choice of
observation likelihood, to approximate the target to a desired accuracy, it is necessary to use
a large IFA model while a small TFA model would suffice. However, in practical experiments
with standard observations likelihoods, we find that AIFAs and TFAs of equal sizes have
similar performance. Likewise, we find that, when both apply, AIFAs and alternative IFAs
1
We also note that, without modification, their approximation is not suitable for use in statistical models
where the unnormalized atom sizes of the CRM are bounded, as arise when modeling the frequencies (in
[0, 1]) of traits. While model reparameterization may help, it requires (at least) additional steps.
2
NCRMs are also called normalized random measures with independent increments (NRMIs) [97, 176].
65
[124, 125] exhibit similar predictive performance (section 2.6.3). But AIFAs apply more
broadly and are amenable to hyperparameter learning via optimizing the marginal likelihood,
unlike Lee et al. [124, 125] (section 2.6.4). As a further illustration, we show that we are
able to learn whether a model is over- or underdispersed, and by how much, using an AIFA
approximating a novel BNP prior in section 2.6.5.
2.2 Background
Our work will approximate nonparametric priors, so we first review construction of these
priors from completely random measures (CRMs). Then we cover existing work on the
construction of truncated and independent finite approximations for these CRM priors. For
some space Ψ, let ψi ∈ Ψ represent the i-th trait of interest, and let θi > 0 represent the
corresponding rate or frequency of this trait in the population. If the set of traits is finite, we
let I equal its cardinality; if the set of traits is countably infinite, we let I = ∞. Collect the
pairs of traits
PI and frequencies in a measure Θ that places non-negative mass θi at location
ψi : Θ := i=1 θi δψi , where δψi is a Dirac measure placing mass 1 at location ψi . To perform
Bayesian inference, we need to choose a prior distribution on Θ and a likelihood for the
observed data Y1:N := {Yn }N n=1 given Θ. Then, applying a disintegration, we can obtain the
posterior on Θ given the observed data.
Homogeneous completely random measures. Many common BNP priors can be
formulated as completely random measures [109, 127].3 CRMs are constructed from Poisson
point processes,4 which are straightforward to manipulate analytically [111]. Consider a
Poisson point process on R+ := [0, ∞) with rate measure ν(dθ) such that ν(R+ ) = ∞ and
min(1, θ)ν(dθ) < ∞. Such a process generates a countably infinite set of rates (θi )∞ i=1 with
R
P∞ i.i.d.
θi ∈ R+ and 0 < i=1 θi < ∞ almost surely. We assume throughout that ψi ∼ H for
some diffuse distribution H. The distribution H, called the ground measure, serves as a
prior on the traits in the space Ψ. For example, consider a common topic model. Each trait
ψi represents a latent topic, modeled as a probability vector in the simplex of vocabulary
words. And θi represents the frequency with which the topic ψi appears across documents in
a corpus. H is a Dirichlet distribution over the probability simplex, with dimension given by
the number of words in the vocabulary.
By pairing the rates from the Poisson process with traits drawn from the ground measure,
we obtain a completely random measure and use the shorthand CRM(H, ν) for its law:
Θ = i θi δψi ∼ CRM(H, ν). Since the traits ψi and the rates θi are independent, the CRM
P
is homogeneous. When the total mass Θ(Ψ) is strictly positive and finite, the corresponding
normalized CRM (NCRM) is PΞ := Θ/Θ(Ψ), which is a discrete probability measure:
Ξ = i ξi δψi , where ξi = θi /( j θj ) [97, 176].
P
The CRM prior on Θ is typically combined with a likelihood that generates trait counts for
each data point. Let ℓ(· | θ) be a proper probability mass function on N ∪ {0} for all θ in the
support of ν. The process Xn := i xni δψi collects the trait counts, where xni | Θ ∼ ℓ(· | θi )
P
3
Conversely, some important priors, such as Pitman-Yor processes, are not CRMs or their normalizations
and are outside the scope of the present paper [8, 129, 164].
4
For brevity, we do not consider the fixed-location and deterministic components of a CRM [109]. When
these are purely atomic, they can be added to our analysis without undue effort.
66
independently across atom index i and i.i.d. across data index n. We denote the distribution
of Xn as LP(ℓ, Θ), which we call the likelihood process. Together, the prior on Θ and likelihood
on X given Θ form a generative model for allocation of data points to traits; hence, this
generative model is a special case of a trait allocation model [33]. Analogously, when the trait
counts are restricted to {0, 1}, this generative model represents a special case of a feature
allocation model.
Since the trait counts are typically just a latent component in a full generative model
indep
specification, we define the observed data to be Yn | Xn ∼ f (· | Xn ) for a probability kernel
f (dY | X). Consider the topic modeling example: θi represents the rate of topic ψi in
a document corpus; Θ captures the rates of all topics; Xn captures how many words in
document n are generated from each topic; and Yn gives the observed collection of words for
that document.
Finite approximations. Since the set {θi }∞ i=1 is countably infinite, it is not possible to
simulate or perform posterior inference for every θi . One approximation scheme uses a finite
approximation ΘK := i=1 ρi δψi . The atom sizes {ρi }K i=1 are designed so that ΘK is a good
PK
approximation of Θ in a suitable sense. Since it involves a finite number of parameters
unlike Θ, ΘK can be used directly in standard posterior approximation schemes such as
Markov chain Monte Carlo or variational Bayes. But not using the full CRM Θ introduces
approximation error.
A truncated finite approximation [TFA; 7, 34, 55, 157, 180] requires constructing an ordering
on the set of rates from the Poisson process; let (θi )∞ i=1 be the corresponding sequence of
rates. The approximation uses ρi = θi for i up to some K; i.e. one keeps the first K rates in
the sequence and ignores the remaining ones. We refer to the number of instantiated atoms
K as the approximation level. Campbell et al. [34] categorizes and analyzes TFAs. TFAs
offer an attractive nested structure: to refine an existing truncation, it suffices to generate
the additional terms in the sequence. However, the complex dependencies between the rates
i=1 potentially make inference more challenging.
(θi )K
We instead develop a family of independent finite approximations (IFAs). An IFA is defined
by a sequence of probability measures ν1 , ν2 , . . . such that at approximation level K, there are
i.i.d.
K atoms whose weights are given by ρ1 , . . . , ρK ∼ νK . The probability measures are chosen
D
so that the sequence of approximations converges in distribution to the target CRM: ΘK → Θ
as K → ∞. For random measures, convergence in distribution can also be characterized by
convergence of integrals under the measures [104, Lemma 12.1 and Theorem 16.16]. The
advantages and disadvantages of IFAs reverse those of TFAs: the atoms are now i.i.d.,
potentially making inference easier, but a completely new approximation must be constructed
if K changes.
Next consider approximating an NCRM Ξ = i ξi δψi , where ξi = θi /( j θj ), with a finite
P P
approximation. A normalized TFA might be defined in one of two ways. In the first approach,
the rates {ρi }Ki=1 that target the CRM rates {θi }i=1 are
∞
Pnormalized to form the NCRM
approximation; i.e. the approximation has atom sizes ρi / K j=1 ρj [34]. The second approach
directly constructs an ordering over the sequence of normalized rates ξi and truncates this
representation.5 We construct normalized IFAs in a similar manner to the first TFA approach:
5
PK
In this case, i=1 ξi < 1. Therefore, setting the final atom size in the NCRM approximation to be
67
the NCRM approximation has atom sizes ρi / K j=1 ρj where {ρi }i=1 are the IFA rates.
K
P
In the past, independent finite approximations have largely been developed on a case-by-case
basis [1, 27, 124, 155]. Our goal is to provide a general-purpose mechanism. Lijoi et al. [131]
and Lee et al. [125] have also recently pursued a more general construction, but we believe
there remains room for improvement. Lijoi et al. [131] focus on NCRMs for clustering; it is not
immediately clear how to adapt this work for inference in trait allocation models. Also, Lijoi
et al. [131, Theorem 1] employ infinitely divisible random variables. Since infinitely divisible
distributions that are not Dirac measures cannot have bounded support, the approximate
rates {ρi }K
i=1 are not naturally compatible with the trait likelihood ℓ(· | θ) if the support of
the rate measure ν is bounded. But the support of ν is often bounded in applications to trait
allocation models; e.g., θi may represent a feature frequency, taking values in [0, 1], and ℓ(· | θ)
may take the form of a Bernoulli, binomial, or negative binomial distribution. Therefore,
applications of the finite approximations of Lijoi et al. [131, Theorem 1] to these models
may require some additional work. The construction in Lee et al. [125, Proposition 3.2]
yields {ρi }K
i=1 that are compatible with ℓ(· | θ) and recovers important cases in the literature.
However, outside these special cases, it is unknown if the i.i.d. distributions are tractable
because the densities νK are not explicitly defined; see the discussion around eq. (2.3) for
more details.
Example 2.2.1 (Running example: beta process). For concreteness, we consider the (three-
parameter) beta process 6 [28, 195] as a running example of a CRM. The process BP(γ, α, d)
is defined by a mass parameter γ > 0, discount parameter d ∈ [0, 1), and concentration
parameter α > −d. It has rate measure
Γ(α + 1)
ν(dθ) = γ 1{0 ≤ θ ≤ 1}θ−d−1 (1 − θ)α+d−1 dθ. (2.1)
Γ(1 − d)Γ(α + d)
The d = 0 case yields the standard beta process [82, 198]. The beta process is typically paired
with the Bernoulli likelihood process with conditional distribution ℓ(x | θ) = θx (1−θ)1−x 1{x ∈
{0, 1}}. The resulting beta–Bernoulli process has been used in factor analysis models [55, 157]
and for dictionary learning [210].
68
2.3.1 Applying our approximation to CRMs
Formally, we define IFAs in terms of a fixed, diffuse probability measure H and a sequence of
probability measures ν1 , ν2 , . . . . The K-atom IFA ΘK is
i.i.d. i.i.d.
ΘK := K
P
i=1 ρi δψi , ρi ∼ νK , ψi ∼ H,
which we write as ΘK ∼ IFAK (H, νK ). We consider CRM rate measures ν with densities that,
near zero, are (roughly) proportional to θ−1−d , where d ∈ [0, 1) is the discount parameter.
We will propose a general construction for IFAs given a target random measure and prove
that it converges to the target (Theorem 2.3.1). We first summarize our requirements for
which CRMs we approximate in Assumption 2.3.1. We show in section 2.A that popular
BNP priors satisfy Assumption 2.3.1; specifically, we check the beta, gamma [59, 110, 200],
generalized gamma [26], beta prime [27], and PG(α, ζ)-generalized gamma [95] processes.
Assumption 2.3.1. For d ∈ [0, 1) and η ∈ V ⊆ Rd , we take Θ ∼ CRM(H, ν(·; γ, d, η)) for
h(θ; η)
ν(dθ; γ, d, η) := γθ−1−d g(θ)−d dθ
Z(1 − d, η)
such that
1. for ξ > 0 and η ∈ V , Z(ξ, η) :=
R
θξ−1 g(θ)ξ h(θ; η)dθ < ∞;
2. g is continuous, g(0) = 1, and there exist constants 0 < c∗ ≤ c∗ < ∞ such that
c∗ ≤ g(θ)−1 ≤ c∗ (1 + θ);
3. there exists ϵ > 0 such that for all η ∈ V , the map θ 7→ h(θ; η) is continuous and bounded
on [0, ϵ].
Other than the discount d and mass γ, the rate measure ν potentially depends on ad-
ditional hyperparameters η. The finiteness of the normalizer Z is necessary in defining
finite-dimensional distributions whose densities are similar in form to ν. The conditions on
the behaviors of g(θ) and h(θ; η) ensure that the overall rate measure’s behavior near θ = 0
is dominated by the θ−1−d term. The support of the rate measure is implicitly determined by
h(θ; η).
Given a CRM satisfying Assumption 2.3.1, we can construct a sequence of IFAs that converge
in distribution to that CRM.
Theorem 2.3.1. Suppose Assumption 2.3.1 holds. Let
(
−1
exp 1−(θ−b)2 /b2 + 1 if θ ∈ (0, b)
Sb (θ) = (2.2)
1{θ > 0} otherwise.
69
See section 2.B.1 for a proof of Theorem 2.3.1. We choose the particular form of Sb (θ) in
eq. (2.2) for concreteness and convenience. But our theory still holds for a more general class
of Sb forms, as we describe in more detail in the proof of Theorem 2.3.1.
Definition 2.3.2. We call the K-atom IFA resulting from Theorem 2.3.1 the automated IFA
(AIFAK ).
Although the normalization constant ZK is not always available analytically, numerical imple-
mentation remains straightforward. When ZK is a quantity of interest, such as in section 2.6.4,
we estimate it using standard numerical integration schemes for a one-dimensional integral
[160, 204]. For other tasks, we need not access ZK directly. In our experiments, we show that
we can use either Markov chain Monte Carlo (sections 2.6.1 and 2.6.5) or variational Bayes
(sections 2.6.2 and 2.6.3) with the unnormalized density.
To illustrate our construction, we next apply Theorem 2.3.1 to BP(γ, α, d) from Example 2.2.1.
In section 2.A, we show how to construct AIFAs for the beta prime, gamma, generalized
gamma, and PG(α, ζ)-generalized gamma processes.
Example 2.3.1 (Beta process AIFA). To apply Assumption 2.3.1, let η = α + d, V = R+ ,
g(θ) = 1, h(θ; η) = (1 − θ)η−1 1[θ ≤ 1], and Z(ξ, η) equal the beta function B(ξ, η). Then
the CRM rate measure ν in Assumption 2.3.1 corresponds to that of BP(γ, α, d) from
Example 2.2.1. Note that we make no additional restrictions on the hyperparameters γ, α, d
beyond those in the original CRM (Example 2.2.1). Observe that h is continuous and bounded
on [0, 1/2], and the normalization function B(ξ, η) is finite for ξ > 0, η ∈ V ; it follows that
Assumption 2.3.1 holds. By Theorem 2.3.1, then, the AIFA density is
1 −1+c/K−dS1/K (θ−1/K)
θ (1 − θ)α+d−1 1{0 ≤ θ ≤ 1}dθ,
ZK
where c := γ/B(α + d, 1 − d) and ZK is the normalization constant. The density does not in
general reduce to a beta distribution in θ due to the θ in the exponent.
Comparison to an alternative IFA construction. Lee et al. [125, Proposition 3.2] verify
the validity of a different IFA construction. Their construction
R requires two functions: (1)
a bivariate function Λ(θ, t) such that for any t > 0, ∆(t) := Λ(θ, t)ν(dθ) < ∞ and (2) a
univariate function f (n) such that ∆(f (n)) is bounded from both above and below by n as
n → ∞. If these functions exist and
Λ(θ, f (K))ν(dθ)
νeK (dθ) := , (2.3)
∆(f (K))
Lee et al. [125, Proposition 3.2] show that IFAK (H, νeK ) converges in distribution to CRM(H, ν)
as K → ∞. The usability of eq. (2.3) in practice depends on the tractability of Λ and f .
There are typically many tractable Λ(θ, t) [125, Section 4]. Proposition B.2 of Lee et al. [125]
lists tractable f for the important cases of the beta process and and generalized gamma
process with d > 0. However, the choice of f provided there for general power-law processes is
not tractable because its evaluation requires computing complicated inverses in the asymptotic
regime. Furthermore, for processes without power laws, no general recipe for f is known.
In contrast, the AIFA construction in Theorem 2.3.1 always yields densities that can be
evaluated up to proportionality constants.
70
Example 2.3.2 (Beta process: an IFA comparison). We next compare our beta process
AIFA to the two separate IFAs proposed by Lee et al. [125] and Lee et al. [124] for disjoint
subcases within the case d > 0. First consider the subcase where α = 0, d > 0. Lee et al.
[124] derive7 what we call8 the BFRY IFA. The IFA density, denoted νBFRY (dθ), is equal to
" 1/d !#
γ θ−d−1 (1 − θ)d−1
KΓ(d)d θ
1 − exp − 1{0 ≤ θ ≤ 1}dθ. (2.4)
K B(d, 1 − d) γ 1−θ
Second, consider the subcase where α > 0, d > 0, Lee et al. [125, Section 4.5] derive another
K-atom IFA, which we call9 the generalized Pareto IFA (GenPar IFA). The IFA density,
denoted νGenPar (dθ), is equal to
γ θ−d−1 (1 − θ)α+d−1 1
(2.5)
1 −
K B(1 − d, α + d) d1 1{0 ≤ θ ≤ 1}dθ.
α
Kd
θ 1+ γα
−1 +1
Since the BFRY IFA and GenPar IFA apply to disjoint hyperparameter regimes, they are not
directly comparable. Since our AIFA applies to the whole domain α ≥ −d, we can separately
compare it to each of these alternative IFAs; we also highlight that the AIFA still applies
when α ∈ (−d, 0), a case not covered by either the BFRY IFA or GenPar IFA.
We find in Section 2.6.3 that the AIFA and BFRY IFA have comparable predictive performance;
the AIFA and GenPar IFA also have comparable predictive performance. But in Section 2.6.4,
we show that the AIFA is much more reliable than the BFRY IFA or the GenPar IFA
for estimating the discount (d) hyperparameter by maximizing the marginal likelihood.
Conversely, sampling from a BFRY IFA or GenPar IFA prior is easier than sampling from an
AIFA prior since the BFRY and GenPar IFA priors are formed from standard distributions.
71
′
µ(θ) ∈ RD and ln θ form the vector of natural parameters (µ(θ), ln θ)T , and ⟨µ(θ), t(x)⟩
denotes the standard Euclidean inner product. The rate measure nearly matches the form of
the conjugate prior, but behaves like θ−1 near 0:
ψ µ(θ)
′ −1
ν(dθ) := γ θ exp , 1{θ ∈ U }dθ, (2.7)
λ −A(θ)
′
where γ ′ > 0, λ > 0, ψ ∈ RD and U ⊆ R+ is the support of ν. eq. (2.7) leads to the
suggestive terminology of exponential family CRMs. The θ−1 dependence near 0 means
that these models lack power-law behavior. Models that can be cast in this form include
the standard beta process with Bernoulli or negative binomial likelihood [27, 211] and the
gamma process with Poisson likelihood [1, 180]. We refer to these models as, respectively,
the beta–Bernoulli, beta–negative binomial, and gamma–Poisson processes.
We now specialize Assumption 2.3.1 and Theorem 2.3.1 to exponential family CRMs in
Assumption 2.3.2 and Corollary 2.3.3, respectively.
Assumption 2.3.2. Let ν be of the form in eq. (2.7) and assume that
1. For any ξ > −1, for any η = (ψ, λ)T where λ > 0, the normalizer defined as
Z
µ(θ)
Z(ξ, η) := ξ
θ exp η, dθ (2.8)
U −A(θ)
is finite, and
2. there exists ϵ > 0 such that, for any η = (ψ, λ)T where λ > 0, the map
µ(θ)
ς : θ 7→ exp η, 1{θ ∈ U }
−A(θ)
is a continuous and bounded function of θ on [0, ϵ].
Corollary 2.3.3. Suppose Assumption 2.3.2 holds. For c := γ ′ ς(0), let
θc/K−1 ς(θ)
νK (θ) := . (2.9)
Z (c/K − 1, η)
D
If ΘK ∼ IFAK (H, νK ), then ΘK → Θ.
The density in eq. (2.9) is almost the same as the rate measure of eq. (2.7), except the θ−1
term has become θc/K−1 . As a result, eq. (2.9) is a proper exponential-family distribution.
In section 2.A, we detail the corresponding d = 0 special cases of the AIFA for beta prime,
gamma, generalized gamma, and PG(α,ζ)-generalized gamma processes. We cover the beta
process case next.
Example 2.3.3 (Beta process AIFA for d = 0). Corollary 2.3.3 is sufficient to recover known
IFA results for BP(γ, α, 0); when d = 0, the AIFA from Example 2.3.1 simplifies to νK =
Beta (γα/K, α) . Doshi-Velez et al. [55] approximates BP(γ, 1, 0) with νK = Beta (γ/K, 1).
For BP(γ, α, 0), Griffiths and Ghahramani [78] set νK = Beta (γα/K, α), and Paisley and
Carin [155] use νK = Beta (γα/K, α(1 − 1/K)). The difference between Beta (γα/K, α) and
Beta (γα/K, α(1 − 1/K)) is negligible for moderately large K.
72
We can also use Corollary 2.3.3 to create a new finite approximation for a nonparametric
process so far not explored in the Bayesian nonparametric literature.
Example 2.3.4 (CMP likelihood and extended gamma process). The CMP likelihood 10 [186]
is given by
∞
θx 1 X θy
ℓ(x | θ) = , where Zτ (θ) := . (2.10)
(x!)τ Zτ (θ) y=0
(y!)τ
The conjugate CRM prior, which we call an extended gamma (or Xgamma) process, has four
hyperparameters: mass γ, concentration c, maximum T , and shape τ :
Unlike existing BNP models, the model in eqs. (2.10) and (2.11), which we call Xgamma–CMP
process, is able to capture different dispersion regimes. For τ < 1, the variance of the counts
from ℓ(x | θ) is larger than the mean of the counts, corresponding to overdispersion. For τ > 1,
the variance of the counts from ℓ(x | θ) is smaller than the mean of the counts, corresponding
to underdispersion. As we show in section 2.6.5, the latent shape τ can be inferred using
observed data. Broderick et al. [27], Zhou et al. [211] provide BNP trait allocation models
that handle overdispersion. Canale and Dunson [35] provide a BNP model that handles both
underdispersion and overdispersion, but for clustering rather than traits. We are not aware
of trait allocation models that handle underdispersion, or any trait allocation models that
handle both underdispersion and overdispersion. Following the approach of Broderick et al.
[30], in section 2.D we show that as long as γ > 0, c > 0, T ≥ 1, and τ > 0, the total mass of
the rate measure is infinite and the number of active traits is almost surely finite. Under
these conditions, we show in section 2.A that Corollary 2.3.3 applies to the CRM in eq. (2.11),
and we construct the resulting AIFA.
73
the counts ni . Similarly, we let pK (n1 , n2 , . . . , nb ) be the EPPF for the normalized AIFAK .
Note that pK (n1 , n2 , . . . , nb ) = 0 when K < b since the normalized AIFAK at approximation
level K generates at most K blocks.
Theorem 2.3.4. Suppose Assumption 2.3.1 holds. Take any positive integers N, b, {ni }bi=1
such that b ≤ N , ni ≥ 1, and i=1 ni = N . Let p be the EPPF of the NCRM Ξ := Θ/Θ(Ψ).
Pb
If ΘK is the AIFA for Θ at approximation level K, and pK is the EPPF for the corresponding
NCRM approximation ΘK /ΘK (Ψ), then
See section 2.B.3 for the proof. Since the EPPF gives the probability of each partition, the
point-wise convergence in Theorem 2.3.4 certifies that the distribution over partitions induced
by sampling from the normalized AIFAK converges to that induced by sampling from the
target NCRM, for any finite sample size N .
Θ ∼ CRM(H, ν),
i.i.d.
Xn | Θ ∼ LP(ℓ, Θ), n = 1, 2, . . . , N, (2.12)
indep
Yn | Xn ∼ f (· | Xn ), n = 1, 2, . . . , N.
ΘK ∼ AIFAK (H, νK ),
i.i.d.
Zn | ΘK ∼ LP(ℓ, ΘK ), n = 1, 2, . . . , N, (2.13)
indep
Wn | Zn ∼ f (· | Zn ), n = 1, 2, . . . , N.
Active traits in the approximate model are collected in Zn and observations are Wn . Let PN,∞
be the marginal distribution of the observations Y1:N and PN,K be the marginal distribution
74
of the observations W1:N . The Rapproximation R error we analyze is the total variation distance
dTV (PN,K , PN,∞ ) := sup0≤g≤1 | gdPN,K − gdPN,∞ | between the two observational processes,
one using the CRM and the other one using the approximate AIFAK as the prior. Total
variation is a standard choice of error when analyzing CRM approximations [34, 55, 89, 157].
Small total variation distance implies small differences in expectations of bounded functions.
Conditions. In our analysis, we focus on exponential family CRMs and conjugate like-
lihood processes. We will suppose Assumption 2.3.2 holds. Our analysis guarantees that
dTV (PN,K , PN,∞ ) is small whenever a conjugate exponential family CRM–likelihood pair and
the corresponding AIFA model satisfy certain conditions, beyond those already stated in
Assumption 2.3.2. In the proof of the error bound, these conditions serve as intermediate
results that ultimately lead to small approximation error. Because we can verify the conditions
for common models, we have error bounds in the most prevalent use cases of CRMs. To
express these conditions, we use the marginal process representation of the target and the
approximate model, i.e., the series of conditional distributions of Xn | X1:(n−1) (or Zn | Z1:(n−1) )
with Θ (or ΘK ) integrated out. Corollary 6.2 of Broderick et al. [30] guarantees that the
marginal Xn | X1:(n−1) is a random measure with finite support and with a convenient form.
Since we will use this form to write our conditions (Condition 2.4.1 below), we first review
the requisite notation — and establish analogous notation for Zn | Z1:(n−1) .
We start by defining h and M to describe the conditional distribution Xn | X1:(n−1) . Let
Kn−1
Kn−1 be the number of unique atom locations in X1 , X2 , . . . , Xn−1 , and let {ζi }i=1 be the
collection of unique atom locations in X1 , X2 , . . . , Xn−1 . Fix an atom location ζj (the choice
of j does not matter). For m with 1 ≤ m ≤ n, let xm be the atom size of Xm at atom
location ζj ; xm may be zero if there is no atom at ζj in Xm . The distribution of xn depends
only on the x1:(n−1) values, which are the atom sizes of previous measures Xm at ζj . We use
h(x | x1:(n−1) ) to denote the probability mass function (p.m.f.) of xn at value x. Furthermore,
Xn has a finite number of new atoms, which can be grouped together by atom size. Consider
any potential atom size x ∈ N. Define pn,x to be the number of atoms of size x. Regardless
of atom size, each atom location is a fresh draw from the ground measure H and pn,x is
Poisson-distributed; we use Mn,x to denote the mean of pn,x .
Next, we define e h, which governs the conditional distribution of Zn | Z1:(n−1) . Let 0n−1 be the
zero vector with n − 1 components. Although h(x | x1:(n−1) ) is defined only for count vectors
x1:(n−1) that are not identically zero, we will see that e h(x | 0n−1 ) is well-defined. In particular,
Kn−1
let {ζi }i=1 be the union of atom locations in Z1 , Z2 , . . . , Zn−1 . Fix an atom location ζj . For
1 ≤ m ≤ n, let xm be the atom size of Zm at atom location ζj . We write the p.m.f. of xn
at x as e h(x | x1:(n−1) ). In addition, Zn also has a maximum of K − Kn−1 new atoms with
Kn−1
locations disjoint from {ζi }i=1 , and the distribution of atom sizes is governed by e h(x | 0n−1 ).
Note that we reuse the xn and ζj notation from Xn | X1:(n−1) without risk of confusion, since
xn and ζj are dummy variables whose meanings are clear given the context of h or e h.
In section 2.C, we describe the marginal processes in more detail and give formulas for h, e h,
and Mn,x in terms of the functions that parametrize eqs. (2.6) and (2.7) and the normalizer
eq. (2.8). For the beta–Bernoulli process with d = 0, the functions have particularly convenient
forms.
75
Example 2.4.1. For the beta–Bernoulli model with d = 0, we have
Pn−1
α + n−1
P
i=1 xi i=1 (1 − xi )
h(x | x1:(n−1) ) = 1{x = 1} + 1{x = 0}.
α−1+n α−1+n
Pn−1 Pn−1
i=1 x i + γα/K α + i=1 (1 − xi )
h(x | x1:(n−1) ) =
e 1{x = 1} + 1{x = 0},
α − 1 + n + γα/K α − 1 + n + γα/K
γα
Mn,1 = , Mn,x = 0 for x > 1.
α−1+n
We now formulate conditions on h, e
h, and Mn,x that will yield small dTV (PN,K , PN,∞ ).
1. for all n ∈ N,
∞
X C1
Mn,x ≤ ; (2.14)
x=1
n − 1 + C1
2. for all n ∈ N,
∞
X 1 C1
h(x | x1:(n−1) = 0n−1 ) ≤
e ; (2.15)
x=1
K n − 1 + C1
∞
X 1 C1
h(x | x1:(n−1) ) − e
h(x | x1:(n−1) ) ≤ ; and (2.16)
x=0
K n − 1 + C1
Note that the conditions depend only on the functions governing the exponential family
CRM prior and its conjugate likelihood process — and not on theP observation likelihood
N P∞
f . eq. (2.14) constrains the growth rate of the target model since P n=1 x=1 Mn,x is the
∞
expected number of components for data cardinality N . Because each x=1 Mn,x is at most
O(1/n), the total number of components after N samples is O(ln N ). Similarly, eq. (2.15)
constrains the growth rate of the approximate model. The third condition (eq. (2.16)) ensures
that eh is a good approximation of h in total variation distance and that there is also a
reduction in the error as n increases. Finally, eq. (2.17) implies that K e h(x | 0n−1 ) is an
accurate approximation of Mn,x , and there is also a reduction in the error as n increases.
We show that Condition 2.4.1 holds for the most commonly used non-power-law CRM models;
see Example 2.4.2 for the case of the beta–Bernoulli model with discount d = 0 and section 2.F
for the beta–negative binomial and gamma–Poisson models with d = 0. As we detail next, we
believe Condition 2.4.1 is also reasonable beyond these common models. The O(1/n) quantity
76
in eq. (2.14) is the typical expected number of new features after observing n observations
in non-power-law BNP models. eqs. (2.15) to (2.17) are likely to hold when e h is a small
perturbation of h and K e h is a small perturbation of Mn,x . For instance, in Example 2.4.1,
the functional form of eh is very similar to that of h, except that e
h has the additional γα/K
factor in both numerator and denominator. The functional form of K e h is very similar to that
of Mn,x , except that K h has an additional γα/K factor in the denominator.
e
Example 2.4.2 (Beta–Bernoulli with d = 0, continued). The growth rate of the target model
is ∞
X γα
Mn,x = Mn,1 = .
x=1
n−1+α
Since e
h is supported on {0, 1}, the growth rate of the approximate model is
γα/K 1 γα
h(1 | x1:(n−1) = 0n−1 ) =
e ≤ .
α − 1 + n + γα/K Kn−1+α
γα γα γ 2α 1
Mn,1 − K e
h(1 | x1:(n−1) = 0n−1 ) = − γα ≤ .
α−1+n α−1+n+ K
K n−1+α
See section 2.G.1 for explicit values of the constants as well as the proof. Theorem 2.4.1
states that the AIFA approximation error grows as O(ln2 N ) with fixed K, and decreases as
O (ln K/K) for fixed N . The bound accords with our intuition that, for fixed K, the error
should increase as N increases: with more data, the expected number of latent components in
the data increases, demanding finite approximations of increasingly larger sizes. In particular,
O(ln N ) is the standard Bayesian nonparametric growth rate for non-power law models. It is
77
likely that the O(ln2 N ) factor can be improved to O(ln N ) due to O(ln N ) being the natural
growth rate; more generally, we conjecture that the error directly depends on the expected
number of latent components in a model for N observations. On the other hand, for fixed N ,
we expect that error should decrease as K increases and the approximation thus has greater
capacity. This behavior also matches Theorem 2.3.1, which guarantees that sufficiently large
finite models have small error.
We highlight that Theorem 2.4.1 provides upper bounds both (i) for approximations that
were already known in the literature but where bounds were not already known, as in the
case of the beta–negative binomial process, and (ii) for processes and approximations not
previously studied in the literature in any form.
Lower bounds. From the upper bound in Theorem 2.4.1, we know how to set a sufficient
number of atoms for accurate approximations: for the total variation to be less than some ϵ, we
solve for the smallest K such that the right hand side of Theorem 2.4.1 is smaller than ϵ. We
now derive lower bounds on the AIFA approximation error to characterize a necessary number
of atoms for accurate approximations, by looking at worst-case observational likelihoods
f . In particular, Theorem 2.4.1 implies that an AIFA with K = O (poly(ln N )/ϵ) atoms
suffices in approximating the target model to less than ϵ error. In Theorem 2.4.2 below, we
establish that K must grow at least at a ln N rate in the worst case. In Theorem 2.4.3 below,
we establish that the 1/ϵ term is necessary. To the best of our knowledge, Theorems 2.4.2
and 2.4.3 are the first lower bounds on IFA approximation error for any process.
Our lower bounds apply to the beta–Bernoulli process with d = 0. Recall that PN,∞ is the
distribution of Y1:N from eq. (2.12) while PN,K is the distribution of W1:N from eq. (2.13). In
what follows, PN,∞
BP
refers to the marginal distribution of the observations that arises when
we use the prior BP(γ, α, 0). Analogously, PN,K BP
is the observational distribution that arises
when we use the AIFAK approximation in Example 2.3.1. The observational likelihood f
will be clear from context. The worst-case observational likelihoods f are pathological. We
leave to future work to lower bound the approximation error when more common likelihoods
f , such as Gaussian or Dirichlet, are used.
For the first result, it will be useful to define the growth function for any N ∈ N, α > 0:
N
X α
C(N, α) := . (2.18)
n=1
n−1+α
Theorem 2.4.2 (ln N is necessary). For the beta–Bernoulli process model with d = 0,
there exists an observation likelihood f , independent of K and N , such that for any N , if
K ≤ 0.5γC(N, α), then
BP BP C
dTV (PN,∞ , PN,K ) ≥ 1 − γα/8 ,
N
where C is a constant depending only on γ and α.
78
See section 2.G.2 for the proof. The intuition is that, with high probability, the number of
features that manifest in the target X1:N is greater than 0.5γC(N, α). However, the finite
model Z1:N has fewer than 0.5γC(N, α) components. Hence, there is an event where the target
and approximation assign drastically different probability masses. Theorem 2.4.2 implies that
as N grows, if the approximation level K fails to surpass the 0.5γC(N, α) threshold, then
the total variation between the approximate and the target model remains bounded from
zero; in fact, the error tends to one.
We next show that the 1/K factor in the upper bound from Theorem 2.4.1 is tight (up to
logarithmic factors).
Theorem 2.4.3 (Lower bound of 1/K). For the beta–Bernoulli process model with d = 0,
there exists an observation likelihood f , independent of K and N , such that for any N ,
BP BP 1 1
dTV (PN,∞ , PN,K )≥C 2
,
(1 + γ/K) K
See section 2.G.2 for the proof. The intuition is that, under the pathological likelihood
f , analyzing the AIFA approximation error is the same as analyzing the binomial–Poisson
approximation error [122]. We then show that 1/K is a lower bound using the techniques
from [13]. Theorem 2.4.3 implies that an AIFA with K = Ω (1/ϵ) atoms is necessary in the
worst case.
Our lower bounds (which apply specifically to the beta–Bernoulli process) are much less
general than our upper bounds. However, as a practical matter, generality in the lower
bounds is not so crucial due to the different roles played by upper and lower bounds. Upper
bounds give control over the approximation error; this control is what is needed to trust
the approximation and to set the approximation level. Whether or not we have access to
lower bounds, general-purpose upper bounds give us this control. Lower bounds, on the other
hand, serve as a helpful check that the upper bounds are not too loose — and reassure us
that we are not inefficiently using too many atoms in a too-large approximation. From that
standpoint, the need for general-purpose lower bounds is not as pressing.
The dependence on the accuracy level in the d = 0 beta–Bernoulli process is worse for AIFAs
than for TFAs. For example, consider the Bondesson approximation [24, 34] of BP(γ, α, 0);
we will see next that this approximation is a TFA with excellent error bounds.
i.i.d.
Pk 2.4.3 (Bondesson approximation [24]). Fix α ≥ 1, let El ∼ Exp(1), P
Example and and let
Γk := l=1 El . The K-atom Bondesson approximation of BP(γ, α, 0) is a TFA K k=1 θk δψk ,
i.i.d. i.i.d.
where θk := Vk exp(−Γk /γα), Vk ∼ Beta(1, α − 1), and ψk ∼ H.
The following result gives a bound on the error of the Bondesson approximation.
Proposition 2.4.4. [34, Appendix A.1] For γ > 0, α ≥ 1, let ΘK be distributed according
i.i.d. indep
to a level-K Bondesson approximation of BP(γ, α, 0), Rn | ΘK ∼ LP(ℓ; ΘK ), Tn | Rn ∼
f (· | Rn ) with N observations. Let QN,K be the distribution of the observations T1:N . Then:
K
γα
BP
dTV PN,∞ , QN,K ≤ N γ 1+γα .
79
Proposition 2.4.4 implies that a TFA with K = O (ln{N/ϵ}) atoms suffices in approximating
the target model to less than ϵ error. Up to log factors in N , comparing the necessary 1/ϵ
level for an AIFA and the sufficient ln (1/ϵ) level for a TFA, we conclude that the necessary
size for an AIFA is exponentially larger than the sufficient size for a TFA, in the worst-case
observational likelihood f.
80
model of Hoffman et al. [85], Wang et al. [206], which is a variant of the hierarchical Dirichlet
process [HDP; 196] and which we refer to as the modified HDP. In the HDP, G is a population
measure with G ∼ DP(ω, H). The measure for the d-th subpopulation is Gd | G ∼ DP(α, G);
the concentrations ω and α are potentially different from each other. The modified HDP is
defined in terms of the truncated stick-breaking (TSB) approximation:
i.i.d.
Definition 2.4.5 (Stick-breaking approximation [184]). For i = 1, 2, . . . , K − 1, let vi ∼
i.i.d.
Beta(1, α). Set vK = 1. Let ξi = vi i−1
j=1 (1 − vj ). Let ψk ∼ H, and ΞK = k=1 ξk δψk . We
Q PK
denote the distribution of ΞK as TSBK (α, H).
In the modified HDP, the sub-population measure is distributed as Gd | G ∼ TSBT (α, G).
Wang et al. [206] and Hoffman et al. [85] set T to be small so that inference in the modified
HDP is more efficient than in the HDP, since the number of parameters per group is greatly
reduced. From a modeling standpoint, small T is a reasonable assumption since documents
typically manifest a small number of topics from the corpus, with the total number depending
on the document length and independent of corpus size. For completeness, the generative
process of the modified HDP is
G ∼ DP(ω, H),
i.i.d.
Hd | G ∼ TSBT (α, G) across d,
indep (2.19)
βdn | Hd ∼ Hd (·) across d, n
indep
Wdn | βdn ∼ f (· | βdn ) across d, n.
Hd contains at most T distinct atom locations, all shared with the base measure G.
The finite approximation we consider replaces the population-level Dirichlet process with
FSDK , keeping the other conditionals intact:13
13
Our construction in eq. (2.20) is slightly different from Eqs. 5.5 and 5.6 in Fox et al. [63]. Our document-
level process Fd contains at most T topics from the underlying corpus; by contrast, the Fox et al. [63]
document-level process contains as many topics as the corpus-level process. However, the novelty of eq. (2.20)
is incidental since the replacement of the population-level DP with the FSD in the modified HDP is analogous
to the DP case.
81
Theorem 2.4.6 (Upper bound for modified HDP). For some constants C ′ , C ′′ , C ′′′ , C ′′′ that
depend only on ω,
C ′ + C ′′ ln2 (DT ) + C ′′′ ln(DT ) ln K + C ′′′′ ln K
dTV P(N,D),∞ , P(N,D),K ≤ .
K
See section 2.I.1 for explicit values of the constants as well as the theorem’s proof. For
fixed K, Theorem 2.4.6 is independent of N , the number of observations in each group, but
scales with the number of groups D like O(poly(ln D)). For fixed D, the approximation error
decreases to zero at rate no slower that O (ln K/K). The O(ln(DT )) factor is related to the
expected logarithmic growth rate of Dirichlet process mixture models [9, Section 5.2] in the
following way. Since there are D groups, each manifesting at most T distinct atom locations
from an underlying Dirichlet process prior, the situation is akin to generating DT samples
from a common Dirichlet process prior. Hence, the expected number of unique samples
is O(ln(DT )). Similar to Theorem 2.4.1, we speculate that the O(ln2 (DT )) factor can be
improved to O(ln(DT )). For error bounds of truncation-based approximations of hierarchical
processes, such as the HDP, we refer to Lijoi et al. [130, Theorem 1].
across the corresponding subscript: x.,k := (xn,k )Nn=1 denotes trait counts across observations
of the k-th trait. We next consider algorithms to approximate the posterior distribution
P(ρ, ψ, x | y) of the finite approximation.
Gibbs sampling. When all latent parameters are continuous, Hamiltonian Monte Carlo
methods are increasingly standard for performing Markov chain Monte Carlo (MCMC) poste-
rior approximation [40, 84]. However, due to the discreteness of the trait counts x, successful
MCMC algorithms for CRMs or their approximations have been based largely on Gibbs sam-
pling [66]. In particular, blocked Gibbs sampling utilizing the natural Markov blanket structure
is straightforward to implement when the complete conditionals P(ρ | x, ψ, y), P(x | ψ, ρ, y),
and P(ψ | x, ρ, y) are easy to simulate from.15
14
The usage of x in this section is different from the usage in the remaining sections: in eq. (2.6), x is a
single observation from the likelihood process.
15
QN
Because of the factorization P(x | ψ, ρ, y) = n=1 P(xn,. | ψ, ρ, yn ), Gibbs sampling over the finite ap-
proximation can be an appealing technique even when Gibbs sampling over the marginal process is not.
In particular, the wall-time of a Gibbs iteration for the finite approximation can be small by drawing
P(xn,. | ψ, ρ, yn ) in parallel. Meanwhile, any iteration to update the trait counts with the marginal process
representation needs to sequentially process the data points, prohibiting speed up through parallelism.
82
Different finite approximations with the same number of atoms K change only P(ρ) in
the generative model. So, of the conditionals, we expect only P(ρ | x, ψ, y) to differ across
finite approximations. We next show in Proposition 2.5.1 that the form of P(ρ | x, ψ, y) is
particularly tractable for AIFAs. Then we will discuss how Gibbs derivations are substantially
more involved for TFAs.
Furthermore, each P(ρk | x.,k ) is in the same exponential family as the AIFA prior, with
density proportional to
N
!
PN X
1{ρ ∈ U }ρc/K+ n=1 ϕ(xn,k )−1 exp ⟨ψ + t(xn,k ), µ(ρ)⟩ + (λ + N )[−A(ρ)] . (2.21)
n=1
See section 2.J.2 for the proof of Proposition 2.5.1. For common models — such as beta–
Bernoulli, gamma–Poisson, and beta–negative binomial — we see that the complete condi-
tionals over AIFA atom sizes are in forms that are well known and easy to simulate.
There are many different types of TFAs, but typical TFA Gibbs updates pose additional
challenges. Even when P(ρ) is easy to sample from, P(ρ | x) can be intractable, as we see in
the following example.
Example 2.5.1 (Stick-breaking approximation [28, 156]). Consider the TFA for BP(γ, α, 0)
given by
XK XCi i−1
Y
(i) (l)
ΘK = Vi,j (1 − Vi,j )δψij ,
i=1 j=1 l=1
83
ascent updates [205, Section 6.3] remain popular for cases with discrete variables, including
the present trait counts x.16
MFVI posits a factorized distribution q to approximate the exact posterior. In our case, we
approximate P(ρ, ψ, x | y) with q(ρ, ψ, x) = qρ (ρ)qψ (ψ)qx (x). We focus on qρ (ρ). For fixed
qψ (ψ) and qx (x), the optimal qρ∗ minimizes the (reverse) Kullback-Leibler divergence between
the posterior and qρ∗ qψ qx :
Our next result shows that qρ∗ takes a convenient form when using AIFAs.
Corollary 2.5.2 (AIFA optimal distribution is in exponential family). Suppose the likelihood
is an exponential family (eq. (2.6)) and the AIFA prior νK is as in Corollary 2.3.3. Then,
the density of qρ∗ is given by
Y
qρ∗ (ρ) = pek (ρk ), (2.23)
k
That is, when using the AIFA, the optimal qρ∗ factorizes across the K atoms, and each
distribution is in the conjugate exponential family for the likelihood ℓ(xn,k | ρk ). Typically
users will report summary statistics like means or variances of the variational approximations
qρ∗ . These are typically straightforward from the exponential family form.
The TFA case is much more complex and requires both more steps in the inference scheme
as well as additional approximations. See section 2.J for two illustrative examples.
Parallelization. We end with a brief discussion on parallelization. In both Proposition 2.5.1
and Corollary 2.5.2, the update distribution for ρ factorizes across the K atoms. Hence,
AIFA updates can be done in parallel across atoms, yielding speed-ups in wall-clock time,
with the gains being greatest when there are many instantiated atoms. For TFAs, due to the
complicating coupling among the atom rates, there is no such benefit from parallelization.
84
Likewise, we find comparable performance of AIFAs and alternative IFAs in predictive tasks
(section 2.6.3). However, we find that AIFAs can be used to learn model hyperparameters
where alternative IFA approximations fail (section 2.6.4). And we show that AIFAs can be
used to learn model hyperparameters for new models, not previously explored in the BNP
literature (section 2.6.5).
In relation to prior studies, existing empirical work has compared IFAs and TFAs only for
simpler models and smaller data sets (e.g., Doshi-Velez et al. [55, Table 1,2] and Kurihara
et al. [117, Figure 4]). Our comparison is grounded in models with more levels and analyzes
datasets of much larger sizes. For instance, in our topic modeling application, we analyze
nearly 1 million documents, while the comparison in Kurihara et al. [117] utilizes only 200
synthetic data points.
85
(a) Original (b) Input, 24.64 dB (c) AIFA, 33.81 dB (d) TFA, 34.03 dB
Figure 2.6.1: AIFA and TFA denoised images have comparable quality. (a) The noiseless
image. (b) The corrupted image. (c,d) Sample denoised images from finite models with
K = 60. We report PSNR (in dB) with respect to the noiseless image.
PSNR between either the TFA or AIFA output image and the original image are always very
similar and substantially higher (between 30 and 35) than the PSNR between the original
and corrupted image (below 30). In fact, each TFA denoised image is more similar to the
AIFA denoised image than to the original image; the PSNR between the TFA and AIFA
outputs is about 50. We also see from fig. 2.6.2a that the quality of denoised images improves
with increasing K. The improvement with K is largest for small K, and plateaus for larger
values of K.
In addition to randomly initializing the latent variables at the beginning of the Gibbs sampler
of one model (“cold start”), we can use the last configuration of latent variables visited in
the other model as the initial state of the Gibbs sampler (“warm start”). In fig. 2.6.2b, the
warm-start curve uses the output of inference with the AIFA as an initial value for inference
with the TFA; similarly, the warm-start curve of fig. 2.6.2c uses the output with the TFA to
initialize inference with the AIFA. For both approximations, K = 60. At the end of training,
all latent variables for all patches have been assigned, so for the warm start experiment,
we make all patches available from the start instead of gradually introducing patches. For
both approximations, the Gibbs sampler initialized at the warm start visits candidate images
that essentially have the same PSNR as the starting configuration; the PSNR values never
deviate from the initial PSNR by more than 1%. The early iterates of the cold-start Gibbs
sampler are noticeably lower in quality compared to the warm-start iterates, and the quality
at the plateau is still lower than that of the warm start.17 Each PSNR trace corresponds to a
different set of initial values and simulation of the conditionals. The variation across the 5
warm-start trials is small; the variation across the 5 cold-start trials is larger but still quite
small. In all, the modes of TFA posterior are good initializations for inference with the AIFA
model, and vice versa.
17
Because the warm start represents the end of the training from the cold start with gradually introduced
patches, the gap in final PSNR is due to the gradual patch introduction.
86
(a) Performance across K (b) TFA training (c) AIFA training
Figure 2.6.2: (a) Peak signal-to-noise ratio (PNSR) as a function of approximation level
K. Error bars depict 1-standard-deviation ranges across 5 trials. (b,c) How PSNR evolves
during inferenceacross 10 trials, with 5 each starting from respectively cold or warm starts.
87
(a) Performance across K (b) TFA training (c) AIFA training
Figure 2.6.3: (a) Test log-likelihood (testLL) as a function of approximation level K. Error
bars show 1 standard deviation across 5 trials. (b,c) TestLL change during inference.
88
(a) BFRY IFA versus AIFA (b) GenPar IFA vs AIFA
Figure 2.6.4: (a) The left panel shows the average predictive log-likelihood of the AIFA
(blue) and BFRY IFA (red) as a function of the approximation level K; the average is across
10 trials with different random seeds for the stochastic optimizer. The right panel shows
highest predictive log-likelihood across the same 10 trials. (b) The panels are analogous to
(a), except the GenPar IFA is in red.
We generate a synthetic dataset so that the ground truth hyperparameter values are known.
The data takes the form of a binary matrix X, with N rows and K̃ columns. We generate
X from an Indian buffet process prior; recall that the Indian buffet process is the marginal
process of a beta process CRM paired with Bernoulli likelihood. To learn the hyperparameter
values with an AIFA, we maximize the marginal likelihood of the observed matrix X implied
by the AIFA. In particular, we compute the marginal likelihood by integrating the Bernoulli
likelihood P(xn,k | θk ) over θk distributed as the K-atom AIFA νK . To quantify the variability
of the estimation procedure, we generate 50 feature matrices and compute the maximum
likelihood estimate for each of these 50 trials. See section 2.K.4 for more experimental details.
fig. 2.6.5a shows that we can use an AIFA to estimate the underlying discount for a variety
of ground-truth discounts. Since the estimates and error bars are similar whether we use the
AIFA (left) or full nonparametric process (right), we conclude that using the AIFA yields
comparable inference to using the full process.
In theory, the marginal likelihood of the BFRY IFA can also be used to estimate the discount,
but in practice we find that this approach is not straightforward and can yield unreliable
estimates. At the time of writing, such an experiment had not yet been attempted; Lee et al.
[124] focus on clustering models and do not discuss strategies to estimate any hyperparameter
in a feature allocation model with a BFRY IFA. We are not aware of a closed-form formula
for the marginal likelihood. Default schemes to numerically integrate P(0 | θk ) against the
BFRY prior for θk fail because of overflow issues. (KΓ(d)d/γ)1/d is typically very large,
especially for small d. Due to finite precision, 1 − exp −(Kd/γ)1/d 1−θ θ
evaluates to 1 on
the quadrature grid used by numerical integrators [160]. In this case, eq. (2.4) behaves as
θ−d−1 near 0, and thus the integral over θ diverges. To create the left panel of fig. 2.6.5b,
we view the marginal likelihood as an expectation and construct Monte Carlo estimates; we
draw 105 BFRY samples to estimate the marginal likelihood, and we take the estimate’s
logarithm as an approximation to the log marginal likelihood (red line). To quantify the
89
(a) Maximum likelihood estimates (b) Log negative log marginal likelihood
Figure 2.6.5: (a) We estimate the discount by maximizing the marginal likelihood of the
AIFA (left) or the full process (right). The solid blue line is the median of the estimated
discounts, while the lower and upper bounds of the error bars are the 20% and 80% quantiles.
The black dashed line is the ideal value of the estimated discount, equal to the ground-truth
discount. (b) In each panel, the solid red line is the average log of negative log marginal
likelihood (LNLML) across batches. The light red region depicts two standard errors in either
direction from the mean.
uncertainty, we draw 100 batches of 105 samples (light red region). Even for this large
number of Monte Carlo samples, the estimated log marginal likelihood curve is too noisy to
be useful for hyperparameter estimation. By comparison, we can compute the log marginal
likelihood analytically for the IBP (dashed black line); it is much smoother and features a
clear minimum. Moreover, we can compute the AIFA log marginal likelihood via numerical
integration (solid blue line); it is also very smooth and features a clear minimum.
We again consider the BFRY IFA and GenPar IFA separately and generate separate simulated
data for each case due to their disjoint assumptions; we generate date with concentration α = 0
for the BFRY IFA and with α > 0 for the GenPar IFA. An experiment to recover a discount
hyperparameter with the GenPar IFA, analogous to the experiment above with the BFRY
IFA, has also not previously been attempted. There is no analytical formula for the GenPar
IFA marginal likelihood, and we again encounter overflow when trying numerical integration.
Therefore, we resort to Monte Carlo; we find that estimates of the log marginal likelihood are
too noisy for practical use in recovering the discount (the right panel of Figure 2.6.5b).
90
Figure 2.6.6: Blue histograms show posterior density estimates for τ from MCMC draws.
The ground-truth τ (solid red line) is 0.7 in the overdispersed case (upper row) and 1.5 in
the underdispersed case (lower row). The threshold τ = 1 (dashed black line) marks the
transition from overdispersion (τ < 1.0) to underdispersion (τ > 1.0). The percentile in each
panel’s title is the percentile where the ground truth τ falls in the posterior draws. The
approximation size K of the AIFA increases in the plots from left to right.
91
2.7 Discussion
We have provided a general construction of automated independent finite approximations
(AIFAs) for completely random measures and their normalizations. Our construction provides
novel finite approximations not previously seen in the literature. For processes without
power-law behavior, we provide approximation error bounds; our bounds show that we can
ensure accurate approximation by setting the number of atoms K to be (1) logarithmic in the
number of observations N and (2) inverse to the error tolerance ϵ. We have discussed how the
independence and automatic construction of AIFA atom sizes lead to convenient inference
schemes. A natural competitor for AIFAs is a truncated finite approximation (TFA). We show
that, for the worst case choice of observational likelihood and the same K, AIFAs can incur
larger error than the corresponding TFAs. However, in our experiments, we find that the two
methods have essentially the same performance in practice. Meanwhile, AIFAs are overall
easier to work with than TFAs, whose coupled atoms complicate the development of inference
schemes. Future work might extend our error bound analysis to conjugate exponential family
CRMs with power-law behavior. An obstacle to upper bounds for the positive-discount case is
the verification of the clauses in Condition 2.4.1. In the positive-discount case, the functions
h and Mn,x , which describe the marginal representation of the nonparametric process, take
forms that are straightforwardly amenable to analysis. But the function e h, which describes
the finite approximations, is complex. In general, h is equal to the ratio of two normalization
e
constants of different AIFAs. The normalization constants can be computed numerically.
However, to make theoretical statements such as the clauses in Condition 2.4.1, we need to
prove their smoothness properties. Another direction is to tighten the error upper bound by
focusing on specific, commonly-used observational likelihoods — in contrast to the worst-case
analysis we provide here. Finally, more work is required to directly compare the size of error
in the finite approximation to the size of error due to approximate inference algorithms such
as Markov chain Monte Carlo or variational inference.
92
Appendix
93
Since h(θ; η) is continuous and bounded on [0, 1], Assumption 2.3.1 holds.
γ(η1 η2 )1−d
In the positive discount case, let c = Γ((1−d)/η 2)
, and the finite-dimensional distribution has
density equalling
1 c/K−1−dS1/K (1−1/K) −(η1 θ)η2
θ e dθ,
ZK
R∞
where ZK := 0 θc/K−1−dS1/K (1−1/K) e−(η1 θ) 2 dθ.
η
Example 2.A.4 (Extended gamma process). RTaking V = (0, ∞) × (1, ∞), g(θ) = 1,
∞
h(θ; η) = Zτ−c (θ), U = [0, T ], and Z(ξ, η) = 0 θξ−1 Zτ−c (θ)dθ in Theorem 2.3.1 yields
the extended gamma process from eq. (2.11). Since g(θ) = 1, the second condition in
Assumption 2.3.1 holds. For any τ and c, Zτ−c (θ) is continuous and bounded on [0, 1], so
the third condition in Assumption 2.3.1 holds. As for the first condition, we note that
Zτ−c (θ) ≤ (1 + θ)−c , since the minimum of Zτ (θ) with respect to τ is 1 + θ, attained at τ = ∞.
Therefore, Z(ξ, η) is finite if
Z T
θξ−1 (1 + θ)−c dθ
0
is finite. Since (1 + θ)
−c
≤ 1, the last integral is at most
T
Tξ
Z
ξ−1
θ dθ = ,
0 ξ
which is finite. Hence, all three conditions of Assumption 2.3.1 hold, and we can apply
Corollary 2.3.3. The AIFA is
1 γ/K−1 −c
νK (θ) = θ Zτ (θ)1{0 ≤ θ ≤ T }dθ,
ZK
RT
where ZK is the normalization constant ZK = 0 θγ/K−1 Zτ−c (θ)dθ. More generally, for
γ, c, τ > 0 and T ≥ 1, we use the notation XGamma(γ, c, τ, T ) to denote the real-valued
distribution with density at θ equal to:
94
2.B Proofs of AIFA convergence
In this appendix, to highlight the fact that the i.i.d. distributions are different across K, we
use ρK,i to denote the i-th atom size in the approximation of level K i.e. the K-atom AIFA is
PK i.i.d. i.i.d.
ΘK := i=1 ρK,i δψK,i , ρK,i ∼ νK , ψK,i ∼ H.
Definition 2.B.1. The parameterized function family {Sb }b∈R+ is composed of approximate
indicators if, for any b ∈ R+ , Sb (θ) is a real, non-decreasing function such that Sb (θ) = 0 for
θ ≤ 0 and Sb (θ) = 1 for θ ≥ b.
Valid examples of approximate indicators are the indicator function Sb (θ) = 1{θ > 0} and
the smoothed indicator function from Theorem 2.3.1. Some approximate indicators have a
point of discontinuity; e.g., Sb (θ) = 1{θ > 0}. But the smoothed indicator is both continuous
and differentiable; see section 2.B.2.
Theorem 2.B.2. Suppose Assumption 2.3.1 holds, and let {Sb }b∈R+ be a family of approxi-
mate indicators. Fix a > 0, and let (bK )K∈N be a decreasing sequence such that bK → 0. For
c := γh(0; η)/Z(1 − d, η), let
−1 −dS −1 ) −1 −d −1
νK (dθ) := θ−1+cK bK (θ−aK
g(θ)cK h(θ; η)ZK dθ
Theorem 2.B.2 recovers Theorem 2.3.1 by setting Sb equaling the smoothed indicator, a = 1,
and bK = 1/K. See section 2.L.2 for discussions on the impact of the tuning hyperaparameters
on the performance of our IFA.
In order to prove Theorem 2.B.2 , we require a few auxiliary results.
Lemma 2.B.3 (Kallenberg [104, Lemma 12.1, Lemma 12.2 and Theorem 16.16]). Let Θ be
a random measure and Θ1 , Θ2 , . . . a sequence of random measures. If for all measurable sets
A and t > 0,
lim E[e−tΘK (A) ] = E[e−tΘ(A) ],
K→∞
D
then ΘK → Θ.
For a density f , let µ(t, f ) : θ 7→ (1 − e−tθ )f (θ). In results that follow we assume all measures
on R+ have densities with respect to Lebesgue measure. We abuse notation and use the same
symbol to denote the measure and the density.
95
Proposition 2.B.4. Let Θ ∼ CRM(H, ν) and for K = 1, 2, . . . , let ΘK ∼ IFAK (H, νK )
where ν is a measure and ν1 , ν2 , . . . are probability measures on R+ , all absolutely continuous
D
with respect to Lebesgue measure. If ∥µ(1, nνK ) − µ(1, ν)∥1 → 0, then ΘK → Θ.
Proof. Let t > 0 and A a measurable set. First, recall that the Laplace functional of the
CRM Θ is Z ∞
−tΘ(A)
E[e ] = exp −H(A) µ(t, ν)(θ) dθ .
0
We have
−tθ
Since |1−e
|1−e−θ |
|
≤ max(1, t), it follows by hypothesis that ∥µ(t, KνK ) − µ(t, ν)∥1 → 0. Thus, by
dominated convergence and the standard exponential limit,
K
H(A) ∞
Z
−tρK,1 1(ψK,1 ∈A) K
lim E[e ] = lim 1 − µ(t, KνK )(θ) dθ
K→∞ K→∞ K 0
Z ∞
= exp − lim H(A) µ(t, KνK )(θ) dθ
K→∞ 0
Z ∞
= exp −H(A) µ(t, ν)(θ) dθ .
0
Lemma 2.B.5. If there exist measures π(θ) dθ and π ′ (θ) dθ on R+ such that for some κ > 0
and c, c′ ,
96
then
K→∞
∥µ − µK ∥1 −−−→ 0.
97
We rewrite each term in turn. For the first term,
Z a/K Z a/K
−1+cK −1 −1+cK −1 ′ −1
θ g(θ) π (dθ) = (c/γ + o(1)) θ−1+cK dθ
0 0
K a cK −1
= (c/γ + o(1))
c K
K
= + o(K).
γ
−1
Since κ ≤ 1 and SbK ∈ [0, 1], for θ ∈ [a/K, κ], θ−dSbK (θ−aK ) ≤ θ−d . Since g(0) = 1, c∗ ≤ 1
−1
and therefore g(θ)−1+cK ≤ c∗−1+c . Hence the second term is upper bounded by
Z κ
−1+c −1 K d K cK −1 −1
c∗ θ−1+cK −d π ′ (dθ) ≤ c−1
∗ (c/γ + O(1)) d
(κ − (a/K)cK )
a/K a c
= O(K d ) × O(ln K)
= o(K).
To bound the two terms we will use the fact that if θ ≥ κ, then
θ κ
θg(θ) ≥ ≥ ∗ =: κ̃
c∗ (1 + θ) c (1 + κ)
and if θ ≤ 1 then θg(θ) ≤ c∗ ≤ 1. Hence, letting ψ := θg(θ), for the first term in eq. (2.26)
98
we have
−1
γ sup (θg(θ))−1 |1 − (θg(θ))cK |
θ∈[κ,∞)
−1
≤ γ sup ψ −1 |1 − ψ cK |
ψ∈[κ̃,∞)
−1 −1
≤ γ sup ψ −1 |1 − ψ cK | + γ sup ψ −1 |1 − ψ cK |
ψ∈[κ̃,1] ψ∈[1,∞)
Kc−1
−1 K −c
cK −1 K
≤ γκ̃ sup |1 − ψ |+γ 1−
ψ∈[κ̃,1] K K −c
−1 c
≤ γκ̃−1 (1 − κ̃cK ) + O(1) ×
K −c
−1 −1
= γκ̃ × o(1) + O(K )
→ 0.
We bound the first integral in eq. (2.27) in four parts: from 0 to aK −1 , from aK −1 to
aK −1 + bK , from aK −1 + bK to κ − bK , and from κ − bK to κ. The first part is equal to
Z aK −1 Z aK −1
−d d+cK −1 −1
θ |1 − θ |dθ ≤ θ−d + θcK dθ
0 0
aK −1
θ1−d K 1+cK −1
= + θ
1−d c+K 0
1 K −1
= (aK −1 )1−d + (aK −1 )1+cK
1−d c+K
→ 0.
99
The second part is equal to
Z aK −1 +bK Z aK −1 +bK
−d cK −1 +d−dSbK (θ−aK −1 ) −1 −d
θ |1 − θ |dθ ≤ θ−d + θcK dθ
aK −1 aK −1
Z aK −1 +bK
≤2 θ−d dθ
aK −1
KaK −1 +b
2 1−d
= θ
1−d aK −1
a 1−d
2 a 1−d
= ( + bK ) −
1−d K K
→ 0.
The third part is equal to
Z κ−bK Z κ−bK
−d cK −1 −1 −d
θ |1 − θ |dθ = θ−d − θcK dθ
aK −1 +bK aK −1 +bK
κ−bK
1 1−d K −1
= θ − θ1−d+cK
1−d c + K(1 − d) aK −1 +bK
1−d
(κ − bK ) K −1
= − (κ − bK )1−d+cK
1−d c + K(1 − d)
−1
(aK + bK )1−d K −1
− + (aK −1 + bK )1−d+cK
1−d c+K
→ 0.
The fourth part is equal to
Z κ Z κ
−d cK −1 −1 −d
θ |1 − θ |dθ ≤ θ−d + θcK dθ
κ−bK κ−bK
→0
using the same argument as the second part. The second integral in eq. (2.27) is upper
bounded by
Z κ Z κ
′ cK −1 −dSbK (θ−aK −1 ) ′ κ1−d
γeK θ dθ ≤ γeK θ−d dθ = γe′K = o(K).
0 0 1−d
Since supθ∈[0,κ] π ′ (θ) < ∞ by the boundedness of g and h and π is a probability density
by construction, conclude using Lemma 2.B.5 that ∥µ − µK ∥1 → 0. It then follows from
D
Lemma 2.B.3 that ΘK → Θ.
100
is differentiable over the whole real line. Since on the separate domains (−∞, 0), (0, b), and
(b, ∞), the derivative exists and is continuous, we only need to show that the values of the
derivative at θ = 0 and θ = b from either side match.
To start, we show that Sb (θ) is continuous at θ = 0 and θ = b.
1
lim Sb (θ) = exp 1 − = 1,
θ→b− 1−0
1
lim Sb (θ) = exp 1 − = 0.
θ→0+ ∞
dSb −1 2(θ − b)
= Sb (θ) . (2.28)
dθ [(θ − b)2 /b2 − 1]2 b2
The limit as we approach b from the left is 0 since limθ→b− Sb (θ) = 1 and the term (θ − b)
vanishes. So the one-sided derivative is continuous at θ = b.
For θ = 0, the derivative from the left (θ →
− 0− ) is 0 since also constant function. The limit
of eq. (2.28) as we approach 0 from the right is also 0. It suffices to show
−1
lim+ Sb (θ) = 0.
θ→0 [(θ − b)2 /b2 − 1]2
Reparametrizing x = 1
1−(θ−b)2 /b2
, we have that x → ∞ and θ → 0+ . The last limit becomes
exp(−x)
lim = 0,
x→∞ x2
which is true because the decay of the exponential function is faster than any polynomial.
The derivative defined over disjoint intervals are continuous at the boundary points, so the
overall approximate indicator is differentiable.
By choosing A = Ψ i.e. the ground space, we have that ΘK (Ψ) is the total mass of AIFA
and Θ(Ψ) is the total mass of CRM
K
X ∞
X
ΘK (Ψ) = ρK,i , Θ(Ψ) = θi .
i=1 i=1
101
Since for any t > 0, the Laplace transform of ΘK (Ψ) converges to that of Θ(Ψ), we conclude
that ΘK (Ψ) converges to Θ(Ψ) in distribution [104, Theorem 5.3]:
K
D
X
ρK,i → Θ(Ψ). (2.29)
i=1
Second, we show that the decreasing order statistics of AIFA atom sizes converges (in finite-
dimensional distributions i.e., in f.d.d) to the decreasing order statistics of CRM atom sizes.
For each K, the decreasing order statistics of AIFA atoms is denoted by {ρK,(i) }K i=1 :
We will leverage Loeve [134, Theorem 4 and page 191] to find the limiting distribution
{ρK,(i) }K as K → ∞. It is easy to verify the conditions to use the theorem: because the
Pi=1
sums i=1 ρK,i converge in distribution to a limit, we know that all the ρK,i ’s are uniformly
K
asymptotically negligible [104, Lemma 15.13]. Now, we discuss what the limits are. It
is well-known that Θ(Ψ) is an infinitely divisible positive random variable with no drift
component and Levy measure exactly ν(dθ) [159]. In the terminology of Loeve [134, Equation
2], the characteristics of Θ(Ψ) are a = b = 0 (no drift or Gaussian parts), L(x) = 0, and
Let I be a counting process in reverse over (0, ∞) defined based on the Poisson point process
i=1 in the following way. For any x, I(x) is the number of points θi exceeding the threshold
{θi }∞
x:
I(x) := |{i : θi ≥ x}|.
We augment I(0) = ∞ and I(∞) = 0. As a stochastic process, I has independent increments,
in that for all 0 = t0 < t1 < · · · < tk , the increments I(ti ) − I(ti−1 ) are independent,
furthermore the law of the increments is I(ti−1 ) − I(ti ) ∼ Poisson(M (ti ) − M (ti−1 )). These
properties are simple consequences of the counting measure induced by the Poisson point
process. According to Loeve [134, Page 191], the limiting distribution of {ρK,(i) }Ki=1 is governed
by I, in the sense that for any fixed t ∈ N, for any x1 , x2 , . . . , xt ∈ [0, ∞):
lim P(ρK,(1) < x1 , ρK,(2) < x2 , . . . , ρK,(t) < xt )
K→∞
(2.30)
= P(I(x1 ) < 1, I(x2 ) < 2, . . . , I(xt ) < t).
Because the θi ’s induce I, we can relate the left hand side to the order statistics of the Poisson
point process. We denote the decreasing order statistic of the {θi }∞ i=1 as:
Clearly, for any t ∈ N, the event that I(x) exceeds t is the same as the top t jumps among
the {θi }∞
i=1 exceed x: I(x) ≥ t ⇐⇒ θ(t) ≥ x. Therefore eq. (2.30) can be rewritten as, for
any fixed t ∈ N, for any x1 , x2 , . . . , xt ∈ [0, ∞):
lim P(ρK,(1) < x1 , ρK,(2) < x2 , . . . , ρK,(t) < xt ) = P(θ(1) < x1 , θ(2) < x2 , . . . , θ(t) < xt ).
K→∞
(2.31)
102
It is well-known that convergence of the distribution function imply weak convergence — for
instance, see Pollard [166, Chapter III, Problem 1]. Actually, from Loeve [134, Theorem 5 and
page 194], for any fixed t ∈ N, the convergence in distribution of {ρK,(i) }ti=1 to {θi }ti=1 holds
P∞
jointly with the convergence of K to i=1 θi : the two conditions of the theorem,
P
i=1 ρK,(i)
which are continuity of the distribution function of each ρK,i and M (0) = −∞19 , are easily
verified. Therefore, by the continuous mapping theorem, if we define the normalized atom
sizes:
ρK,(s) θ(s)
pK,(s) := PK , p(s) := P∞ ,
i=1 ρK,i i=1 θi
Finally we show that the EPPFs converge. In addition, if we define the size-biased permutation
(in the sense of Gnedin [75, Section 2]) of the normalized atom sizes:
pK,i } ∼ SBP(pK,(s) ), {e
{e pi } ∼ SBP(p(s) ),
then by Gnedin [75, Theorem 1], the finite-dimensional distributions of the size-biased
permutation also converges:
f.d.d.
pK,i )K
(e i=1 → (e pi )∞
i=1 . (2.32)
Pitman [162, Equation 45] gives the EPPF of Ξ = Θ/Θ(Ψ):
b b−1 i
!!
Y Y X
p(n1 , n2 , . . . , nb ) = E peni i −1 1− pej ,
i=1 i=1 j=1
103
perspective removes the need to infer a countably infinite set of target variables. In addition,
the exchangeability between X1 , X2 , . . . , XN i.e. the joint distribution’s invariance with
respect to ordering of observations [3], often enables the development of inference algorithms,
namely Gibbs samplers.
Broderick et al. [30, Corollary 6.2] derive the conditional distributions Xn | Xn−1 , Xn−2 , . . . , X1
for general exponential family CRMs eqs. (2.6) and (2.7).
Proposition 2.C.1 (Target’s marginal process [30, Corollary 6.2]). For any n, Xn | Xn−1 , . . . , X1
is a random measure with finite support.
K
1. Let {ζi }i=1
n−1
be the union of atom locations in X1 , X2 , . . . , Xn−1 . For 1 ≤ m ≤ n − 1, let
xm,j be the atom size of Xm at atom location ζj . Denote xn,i to be the atom size of Xn at
atom location ζi . The xn,i ’s are independent across i and the p.m.f. of xn,i at x is
h(x | x1:(n−1) ) =
Pn−1
Pn−1 t(xm,i ) + t(x)
Z −1 + m=1 ϕ(xm,i ) + ϕ(x), η + m=1
n
κ(x) Pn−1 .
Pn−1 t(xm,i )
Z −1 + m=1 ϕ(xm,i ), η + m=1
n−1
2. For each x ∈ N, Xn has pn,x atoms whose atom size is exactly x. The locations of each
atom are iid H: as H is diffuse, they are disjoint from the existing union of atoms {ζi }K
i=1 .
n −1
Mn,x =
′ n−1 (n − 1)t(0) + t(x))
γ κ(0) κ(x)Z −1 + (n − 1)ϕ(0) + ϕ(x), η + .
n
h(x | z1:(n−1) ) =
e
Pn−1
Pn−1 t(zm,i ) + t(x)
Z c/K − 1 + m=1 ϕ(zm,i ) + ϕ(x), η + m=1
n
κ(x) Pn−1 .
Pn−1 t(zm,i )
Z c/K − 1 + m=1 ϕ(zm,i ), η + m=1
n−1
104
2. K − Kn−1 atom locations are generated iid from H. Zn has pn,x atoms whose size is
exactly x (for x ∈ N ∪ {0}) over these K − Kn−1 atom locations (the pn,0 atoms whose
atom size is 0 can be interpreted as not present in Zn ). The joint distribution of pn,x is a
multinomial with K − Kn−1 trials, with success of type x having probability:
h(x | z1:(n−1) = 0n−1 ) =
e
(n − 1)t(0) + t(x)
Z c/K − 1 + (n − 1)ϕ(0) + ϕ(x), η +
n
κ(x) .
(n − 1)t(0)
Z c/K − 1 + (n − 1)ϕ(0), η +
n−1
Proof of Proposition 2.C.2. We only need to prove the conditional distributions for the atom
sizes: that the K distinct atom locations are generated iid from the base measure is clear.
First we consider n = 1. By construction in Corollary 2.3.3, a priori, the trait frequencies
i=1 are independent, each following the distribution:
{ρi }K
1{θ ∈ U } c/K−1 µ(θ)
P(ρi ∈ dθ) = θ exp η, .
Z (c/K − 1, η) −A(θ)
Conditioned on {ρi }Ki=1 , the atom sizes z1,i that Z1 puts on the i-th atom location are
independent across i and each is distributed as:
P(z1,i = x | ρi ) = κ(x)ρϕ(x) exp (⟨µ(ρi ), t(x)⟩ − A(ρi )) .
Integrating out ρi , the marginal distribution for z1,i is:
Z
P(z1,i = x) = P(z1,i = x | ρi = θ)P(ρi ∈ dθ)
Z
κ(x) c/K−1+ϕ(x) t(x) µ(θ)
= θ exp η+ , dθ
Z (c/K − 1, η) U 1 −A(θ)
t(x)
Z c/K − 1 + ϕ(x), η +
1
= κ(x) ,
Z (c/K − 1, η)
by definition of Z as the normalizer eq. (2.8).
Now we consider n ≥ 2. The distribution of zn,i only depends on the distribution of
zn−1,i , zn−2,i , . . . , z1,i since the atom sizes across different atoms are independent of each other
both a priori and a posteriori. The predictive distribution is an integral:
Z
P(zn,i = x | z1:(n−1),i ) = P(zn,i = x | ρi )P(ρi ∈ dθ | z1:(n−1),i ).
Because the prior over ρi is conjugate for the likelihood zi,j | ρi , and the observations zi,j
are conditionally independent given ρi , the posterior P(ρi ∈ dθ | z1:(n−1),i ) is in the same
exponential family but with different natural parameters:
Pn−1
Pn−1 t(zm,i ) µ(θ)
θc/K−1+ m=1 ϕ(zm,i ) exp η+ m=1 , dθ
n−1 −A(θ)
1{θ ∈ U } Pn−1 .
Pn−1 t(zm,i )
Z c/K − 1 + m=1 ϕ(zm,i ), η + m=1
n−1
105
This means that the predictive distribution P(zn,i = x | z1:(n−1),i ) equals:
Pn−1
R c/K−1+Pn−1 ϕ(z )+ϕ(x)
m=1 t(zm,i ) + t(x) µ(θ)
θ m=1 m,i
exp η+ , dθ
U n −A(θ)
κ(x) Pn−1
Pn−1 t(zm,i )
Z c/K − 1 + m=1 ϕ(zm,i ), η + m=1
n−1
Pn−1
Pn−1 t(z m,i ) + t(x)
Z c/K − 1 + m=1 ϕ(zm,i ) + ϕ(x), η + m=1
n
= κ(x) Pn−1 .
Pn−1 t(zm,i )
Z c/K − 1 + m=1 ϕ(zm,i ), η + m=1
n−1
The predictive distribution P(zn,i = x | z1:(n−1),i ) govern both the distribution of atom sizes
for known atom locations and new atom locations.
This is Broderick et al. [30, A1]. To ensure that the number of active traits is almost surely
finite, it suffices to ensure that the expected number of traits is finite. The condition that
the expected number of active traits is finite reads as
Z ∞
(1 − Zτ−1 (θ))ν(dθ) < ∞.
0
This is Broderick et al. [30, A2]: note that Zτ−1 (θ) is exactly the probability that a trait with
rate θ does not manifest.
Lemma 2.D.1 (Hyperparameters for extended gamma rate measure). For any γ > 0, c > 0,
T ≥ 1, τ > 0, for the rate measure ν(t.heta) from eq. (2.11) Then,
R∞
• 0 ν(dθ) = ∞.
R∞
• 0 [1 − Zτ−1 (θ)]ν(dθ) < ∞.
Proof of Lemma 2.D.1. We observe that it suffices to show the two conclusions for γ = 1,
since any positive scaling of the rate measure will preserve the finiteness (or infiniteness) of
the integrals. In addition, we can replace the upper limit of integration, ∞, by T , since the
rate measure is zero for θ > T .
We begin with elementary observations about the monotonicity of Zτ (θ). Zτ (θ) is increasing
in θ but decreasing in τ . In the limit of τ → ∞, Zτ (θ) approaches 1 + θ.
106
RT
To prove the first statement, we use a simple lower bound on 0 ν(dθ), which holds since
T ≥ 1: Z T Z 1
−1 −c
θ Zτ (θ)dθ ≥ θ−1 Zτ−c (θ)dθ
0 0
Z 1
−c
≥ Zτ (1) θ−1 dθ = ∞.
0
∞ i
!1−τ
di X Y θj
Zτ (θ) = (j + k) . (2.34)
dθi j=0 k=1
(j!)τ
It is easy to check that the infinite sums in eq. (2.34) converge for any θ. By absolute
convergence theorems20 , it suffices to inspect θ > 0. By the ratio test, subsequent terms have
ratio
i
!1−τ i
!1−τ
θj+1 Y θj Y θ(j + 1 + i)1−τ j→∞
(j + 1 + k) / (j + k) = −−−→ 0.
[(j + 1)!]τ k=1 [(j)!]τ k=1 j+1
Clearly Zτ (0) = 1. Hence, for all θ close enough to 0, Zτ (θ) is strictly positive. Therefore,
Zτ−1 (θ) also has derivatives of all orders in an open interval containing [0, 1]. Note that
d
Z (θ) θ=0 = 1. Therefore
dθ τ
d
d −1 − dθ Zτ (θ) θ=0
Z (θ) = = −1.
dθ τ θ=0 Zτ2 (0)
By Taylor’s theorem Kline [112, Section 20.3], for any θ ∈ [0, 1], there exists a y between 0
and θ such that
1 d2 −1
−1
Zτ (θ) = 1 − θ + Z (θ) θ=y θ2 .
2 dθ2 τ
d2
It is clear that the second derivative Z −1 (θ) θ=y
dθ2 τ
is bounded by a constant independent of
y for y ∈ [0, 1], since
d 2 2
d2 −1
Z (θ)
dθ2 τ d 1
2
Zτ (θ) = −2 Zτ (θ) ,
dθ Zτ2 (θ) dθ 3
Zτ (θ)
20
see, e.g. https://www.whitman.edu/mathematics/calculus_online/section11.06.html
107
with the Zτ (θ) being at least 1 and the derivatives being bounded. This shows eq. (2.33).
Therefore:
Z T Z 1 Z T
−1 −1 −1 −c
[1 − Zτ (θ)]ν(dθ) ≤ [1 − Zτ (θ)]θ Zτ (θ)dθ + θ−1 Zτ−c (θ)dθ
0 0 1
= A + B.
We use the estimate 1 − Zτ−1 (θ)θ + κθ2 in the first part (A):
Z 1 Z 1
−1 −1 −c
[1 − Zτ (θ)]θ Zτ (θ)dθ ≤ (1 + κθ)Zτ−c (θ)dθ.
0 0
Since Zτ−c (θ) ≤ exp(−cθ), it is true that A is finite. For the second part (B), we again use
the upper bound Zτ−c (θ) ≤ exp(−cθ) and also θ−1 ≤ 1 to conclude that B is finite. Overall
A + B is finite.
In the second case, τ > 1.0. Since Zτ (θ) ≤ Z1 (θ), 1 − Zτ−1 (θ) ≤ 1 − Z1−1 (θ) = 1 − exp(−θ).
In addition, since Zτ (θ) ≥ Z∞ (θ), we also have Zτ−c (θ) ≤ Z∞−c 1
= (1+θ)c . Hence
Z T Z T
−1 1
[1 − Zτ (θ)]ν(dθ) ≤ (1 − exp(−θ))θ−1 dθ.
0 0 (1 + θ)c
Observe that for any positive θ, (1 − exp(−θ))θ−1 ≤ 1. Therefore
Z T Z T
−1 1
[1 − Zτ (θ)]ν(dθ) ≤ c
dθ.
0 0 (1 + θ)
The integrand 1
(1+θ)c
is continous and upper bounded on [0, T ], so the overall integral is finite.
108
where we used [54, Equation 1.10.13] and the simple observation that 2/3δ < δ. By
construction, X is stochastically dominated by Y , so the tail probabilities of X are upper
bounded by the tail probabilities of Y .
Lemma 2.E.2 (Lower tail Chernoff bound [54, Theorem 1.10.5]). Let X = ni=1 Xi , where
P
Xi = 1 with probability pi and Xi = 0 with probability 1 − pi , and all Xi are independent. Let
µ := E(X) = i=1 pi . Then for all δ ∈ (0, 1):
Pn
Lemma 2.E.3 (Tail bounds for Poisson distribution). If X ∼ Poisson(λ) then for any x > 0:
x2
P(X ≥ λ + x) ≤ exp − ,
2(λ + x)
and for any 0 < x < λ: 2
x
P(X ≤ λ − x) ≤ exp − .
2λ
Proof of Lemma 2.E.3. For x ≥ −1, let ψ(x) := 2((1 + x) ln(1 + x) − x)/x2 .
We first inspect the upper tail bound. If X ∼ Poisson(λ), for any x > 0, Pollard [165,
Exercise 3 p.272] implies that:
2
x x
P(X ≥ λ + x) ≤ exp − ψ .
2λ λ
x2 x2
To show the upper tail bound, it suffices to prove that 2λ ψ λx is greater than 2(λ+x) . In
(u + 1)ψ(u) − 1 ≥ 0. (2.35)
Since g ′′ (u) ≥ 0, g ′ (u) is monotone increasing. Since g ′ (0) = 1, g ′ (u) > 0 for u ≥ 0, hence
g(u) is monotone increasing. Because g(0) = 0, we conclude that g(u) ≥ 0 for u > 0 and
eq. (2.35) holds. Plugging in u = x/λ:
x 1 λ
ψ ≥ x = ,
λ 1+ λ x+λ
x2 x2
which shows 2λ ψ λx ≥ 2(λ+x) .
Now we inspect the lower tail bound. We follow the proof of Canonne [36, Theorem 1]. We
first argue that: 2
x x
P(X ≤ λ − x) ≤ exp − ψ − . (2.36)
2λ λ
109
For any θ, the moment generating function E[exp(θX)] is well-defined and well-known:
Therefore:
Poisson random variables, where Yi has mean npi . Then, there exists a coupling (X, b Yb ) of
PX and PY such that !2
X∞
P(Xb ̸= Yb ) ≤ n pi .
i=1
Proof of Lemma 2.E.4. First, we recognize that both X and Y can be sampled in two steps.
• Regarding X, first sample N1 ∼ Binom (n, ∞ i=1 pi ). Then, for P
each 1 ≤ k =
̸ N1 ,
P
110
P∞
• Regarding Y , first sample N2 ∼ Poisson (n i=1 pi ). Then, for each 1 ≤ k ≤ N2 ,
independently sample Tk where P(Tk = i) = P pi . Then, Yi = N k=1 1{Tk = i} for
P 2
∞
j=1 pj
each i.
The two-step sampling perspective for X comes from rejection sampling: to generate a success
of type k, we first generate some type of success, and then re-calibrate to get the right
proportion for type k. The two-step perspective for Y comes from the thinning property of
Poisson distribution [120, Exercise 1.5]. The thinning property implies that for any finite
index set K, all {Yi } for i ∈ K are mutually independent and marginally, Yi ∼ Poisson(npi ).
Hence the whole collection {Yi }∞ i=1 are independent Poissons and the mean of Yi is npi .
Observing that the conditional X | N1 = n is the same as Y | N2 = n, we propose the coupling
that essentially proves propagation rule Lemma 2.E.8. The proposed coupling (X, b Yb ) is that
• Sample (N c2 ) from the maximal coupling that attains dTV between the two distribu-
c1 , N
tions: Binom (n, ∞
P∞
i=1 pi ) and Poisson (n
P
i=1 pi ).
• If N
c1 = N c2 , let the common value be n, sample X c1 = n and set Yb = X.
b |N b Else
N c2 , independently sample X
c1 ̸= N c1 and Yb | N
b |N c2 .
Alternatively, we can sample from the conditional X b | Yb in the following way. From Yb ,
∞
compute N c2 , which is just
x=1 Yi . Sample N1 from the conditional distribution N1 | N2 of
P b c c c
the maximal coupling that attains the binomial-Poisson total variation. If N c1 = N c2 , set
Xb = Yb . Else sample X b from the conditional X c1 . It is straightforward to verify that this
b |N
is the conditional Xb | Yb of the joint (X,b Yb ) described above.
Lemma 2.E.5 (Total variation between Poissons [2, Corrollary 3.1]). Let P1 be the Poisson
distribution with mean s, P2 the Poisson distribution with mean t. Then:
dTV (P1 , P2 ) ≤ 1 − exp(−|s − t|) ≤ |s − t|.
111
When PX and PY are discrete distributions, the inequality is actual equality, and there exists
couplings that attain the equality [126, Proposition 4.7].
We first state the chain rule, which will be applied to compare joint distributions that admit
densities.
Lemma 2.E.6 (Chain rule). Suppose PX1 ,Y1 and PX2 ,Y2 are two distributions that have
densities with respect to a common measure over the ground space A × B. Then:
dTV (PX1 ,Y1 , PX2 ,Y2 ) ≤ dTV (PX1 , PX2 ) + sup dTV (PY1 | X1 =a , PY2 | X2 =a ).
a∈A
Proof of Lemma 2.E.6. Because both PX1 ,Y1 and PX2 ,Y2 have densities, total variation distance
is half of L1 distance between the densities:
Z
1
dTV (PX1 ,Y1 , PX2 ,Y2 ) = | PX1 ,Y1 (a, b) − PX2 ,Y2 (a, b) | dadb
2 A×B
Z
1
= |PX1 ,Y1 (a, b) − PX2 (a)PY1 | X1 (b | a)
2 A×B
+ PX2 (a)PY1 | X1 (b | a) − PX2 ,Y2 (a, b)|dadb
Z
1
≤ PY | X (b | a) | PX1 (a) − PX2 (a) |
2 A×B 1 1
+ PX2 (a) | PY1 | X1 (b | a) − PY2 | X2 (b | a) | dadb
Z
1
= PY | X (b | a) | PX1 (a) − PX2 (a) | dadb
2 A×B 1 1
Z
1
+ PX (a) | PY1 | X1 (b | a) − PY2 | X2 (b | a)|dadb,
2 A×B 2
where we have used triangle inequality. Regarding the first term, using Fubini:
Z
1
PY | X (b | a)|PX1 (a) − PX2 (a)|dadb
2 A×B 1 1
Z Z
1
= PY | X (b | a)db |PX1 (a) − PX2 (a)|da
2 a∈A b∈B 1 1
Z
1
= |PX1 (a) − PX2 (a)|da
2 a∈A
= dTV (PX1 , PX2 ).
Regarding the second term:
Z
1
PX (a)|PY1 | X1 (b | a) − PY2 | X2 (b | a)|dadb
2 A×B 2
Z Z
1
= |PY1 | X1 (b | a) − PY2 | X2 (b | a)|db PX2 (a)da
a∈A 2 b∈B
Z
≤ sup dTV (PY1 | X1 =a , PY2 | X2 =a ) PX2 (a)da
a∈A a∈A
= sup dTV (PY1 | X1 =a , PY2 | X2 =a ).
a∈A
The sum between the first and second upper bound gives the total variation chain rule.
112
An important consequence of Lemma 2.E.6 is when the distributions being compared have
natural independence structures.
Lemma 2.E.7 (Product rule). Let PX1 ,Y1 and PX2 ,Y2 be discrete distributions. In addition,
suppose PX1 ,Y1 factorizes into PX1 PY1 and similarly PX2 ,Y2 = PX2 PY2 . Then:
dTV (PX1 ,Y1 , PX2 ,Y2 ) ≤ dTV (PX1 , PX2 ) + dTV (PY1 , PY2 ).
Proof of Lemma 2.E.7. Since PX1 ,Y1 and PX2 ,Y2 are discrete distributions, we can apply
Lemma 2.E.6 (the common measure is the counting measure). Because each joint distribution
PXi ,Yi factorizes into PXi PYi , for any a ∈ A, the right most term in the inequality of
Lemma 2.E.6 simplifies into
sup dTV (PY1 | X1 =a , PY2 | X2 =a ) = dTV (PY1 , PY2 ),
a∈A
• If Xc1 = Xc2 , let the common value be x. Sample Yb1 from the conditional distribution
Y | X = x, and set Yb2 = Yb1 . Else if Xc1 ̸= X c2 , independently sample Yb1 from Y | X = X c1
and Yb2 from Y | X = X c2 .
It is easy to verify that the tuple (Yb1 , Yb2 ) is a coupling of PY1 and PY2 . In addition, (Yb1 , Yb2 )
has the property that
P(Yb1 ̸= Yb2 , X
c1 = X c2 ) = 0,
since conditioned on X c1 = Xc2 , the values of Yb1 and Yb2 always agree. Therefore:
P(Yb1 ̸= Yb2 ) = P(Yb1 ̸= Yb2 , X
c1 ̸= X
c2 ) ≤ P(X
c1 ̸= X
c2 ).
This means that dTV (PY1 , PY2 ) is small:
dTV (PY1 , PY2 ) ≤ P(X
c1 ̸= X
c2 ).
So far (X c2 ) has been an arbitrary coupling between PX1 and PX2 . The final step is
c1 , X
taking the infimum on the right hand side over couplings. When PX1 and PX2 are discrete
distributions, the infimum over couplings is equal to the total variation distance.
113
The final lemma is the reduction rule, which says that the a larger collection of random
variables, in general, has larger total variation distance than a smaller one.
Lemma 2.E.9 (Reduction rule). Suppose PX1 ,Y1 and PX2 ,Y2 are two distributions over the
same measurable space A × B. Then:
PX1 (A) − PX2 (A) = PX1 ,Y1 (A, B) − PX2 ,Y2 (A, B).
since PX1 ,Y1 (A, B) − PX2 ,Y2 (A, B) is the difference in probability mass for one measurable
event. The final step is taking supremum of the left hand side.
2.E.3 Miscellaneous
Lemma 2.E.10 (Order of growth of harmonic-like sums).
N
X α
α [ln N + ln(α + 1) − ψ(α)] ≥ ≥ α(ln N − ψ(α) − 1).
n=1
n−1+α
Proof of Lemma 2.E.10. Because of the digamma function identity ψ(z + 1) = ψ(z) + 1/z
for z > 0, we have:
N
X α
= α[ψ(α + N ) − ψ(α)]
n=1
n − 1 + α
Gordon [76, Theorem 5] says that
1 1
ψ(α + N ) ≥ ln(α + N ) − − ≥ ln N − 1.
2(α + N ) 12(α + N )2
114
We list a collection of technical lemmas that are used when verifying Condition 2.4.1 for the
recurring examples.
The first set assists in the beta–Bernoulli model.
• For m, x, y > 0, m ≤ y:
m+x m x
− ≤ . (2.38)
y+x y y
1/t
dT V NB(m, t−1 ), NB(m + x, t−1 ) ≤ x (2.41)
,
1 − 1/t
m Γ(m/K + y) m2
−K ≤e . (2.42)
y Γ(m/K)y! K
Proof of eq. (2.39). Set g(x) to (1−x) ln(1−x)+x. Then its derivative is g ′ (x) = − ln(1−x) ≥
0, meaning the function is monotone increasing. Since g(0) = 0, it’s true that g(x) ≥ 0 over
[0, 1).
115
Proof of eq. (2.40). Let f (p) = (1 − x)p + p 1−x
x
− 1. Then f ′ (p) = ln(1 − x)(1 − x)p + 1−x
x
.
Also f (p) = (ln(1 − x)) (1 − x) > 0. So f (p) is monotone increasing. At p = 0,
′′ 2 p ′
•
PN
i=1 Yi ∼ NB(r, θ).
Therefore, by the propagation rule Lemma 2.E.8, to compare NB(m, t−1 ) with NB(m + x, t−1 ),
it suffices to compare the two generating Poissons.
Q
Proof of eq. (2.42). Since Γ y−1 m
+ j), we
m
m
m
m Qy−1 m
+y =
K j=0 ( K + j) Γ K
=Γ K K j=1 ( K
have:
y−1
!
m Γ(m/K + y) m Y m/K + j
−K = −1 .
y Γ(m/K)y! y j=1
j
m Γ(m/K + y) m
(ey)m/K − 1 .
−K ≤
y Γ(m/K)y! y
ua − 1 ≤ a(u − 1).
116
Truly, consider the function g(u) = a(u − 1) − ua + 1. The derivative is g ′ (u) = a − aua−1 =
a(1 − ua−1 ). Since a ∈ (0, 1) and u ≥ 1, g ′ (u) > 0. Therefore g(u) is monotone increasing.
Since g(1) = 0, we have reached the conclusion. Applying to our situation:
m
(ey)m/K − 1 ≤ (ey − 1).
K
In all:
m Γ(m/K + y) m2
−K ≤e .
y Γ(m/K)y! K
Proof of eq. (2.43). First we prove that for any x ∈ [0, 1):
√
1 − x ln(1 − x) + x ≥ 0.
Truly, let g(x) be the function on the left hand side. Then its derivative is
√
′ 2 1 − x − ln(1 − x) − 2
g (x) = √ .
2 1−x
Denote the numerator function by h(x). Its derivative is
1 1
h′ (x) = −√ ≥ 0,
1−x 1−x
since x ∈ [0, 1] meaning h is monotone increasing. Since h(0) = 0, it means h(x) ≥ 0. This
means g ′ (x) ≥ 0 i.e. g itself is monotone increasing. Since g(0) = 0 it’s true that g(x) ≥ 0 for
all x ∈ [0, 1).
117
Second we prove that for all x ∈ [0, 1], for all p ≥ 0:
x
(1 − x)p + p √ − 1 ≥ 0. (2.47)
1−x
ln(1 − x) + √1−xx
> 0. Therefore f ′ (p) ≥ 0 for all p. So f (p) is increasing. Since f (0) = 0,
it’s true that f (p) ≥ 0 for all p.
We finally prove the inequality about beta functions.
Z 1
B(x, y) − B(x, z) = θx−1 (1 − θ)y−1 (1 − (1 − θ)z−y )dθ
0
Z 1
≤ θx−1 (1 − θ)y−1 (z − y)θ(1 − θ)−0.5 dθ
0
Z 1
= (z − y) θx (1 − θ)y−1.5 dθ = (z − y)B(x + 1, y − 0.5).
0
where we have use 1−(1−θ)z−y ≤ (z −y)θ(1−θ)−1/2 from eq. (2.47). As for B(x+1, y−0.5) ≤
B(x + 1, y − 1), it is because of the monotonicity of the beta function.
Proof of eq. (2.44).
∞ Z ∞
1X
X Γ(y + r) Γ(y + r) y−1
B(y, b + r) = θ (1 − θ)b+r−1 dθ
y=1
y!Γ(r) 0 y=1 y!Γ(r)
∞
Z 1 !
X Γ(y + r)
= θ−1 θy (1 − θ)b+r−1 dθ
0 y=1
y! Γ(r)
Z 1
−1 1
= θ r
−1 (1 − θ)b+r−1 dθ
0 (1 − θ)
Z 1
θ−1 (1 − (1 − θ)r ) (1 − θ)b−1 dθ
=
Z0 1
θ
≤ θ−1 r √ (1 − θ)b−1 dθ
0 1 − θ
Z 1
r
=r (1 − θ)b−1.5 dθ = ,
0 b − 0.5
where the identity ∞ y=1 y! Γ(r) θ = (1−θ)r −1 is due to the normalization constant for negative
Γ(y+r) y 1
P
Γ(b) c
1− ≤ (2 + ln b).
Γ(b + c/K) K
118
The recursion defining Γ(b) allows us to write:
⌊b⌋−1
Γ(b) Y b−i Γ(b − ⌊b⌋ + 1)
1− =1− .
Γ(b + c/K) i=1
b + c/K − i Γ(b + c/K − ⌊b⌋ + 1)
Else, Γ(b−⌊b⌋+1)
Γ(b+c/K−⌊b⌋+1)
< 1 and we write:
Γ(b)
1−
Γ(b + c/K)
⌊b⌋−1
Γ(b − ⌊b⌋ + 1) Γ(b − ⌊b⌋ + 1) Y b−i
=1− + 1 −
Γ(b + c/K − ⌊b⌋ + 1) Γ(b + c/K − ⌊b⌋ + 1) i=1
b + c/K − i
Γ(b − ⌊b⌋ + 1) c
≤ 1− + (ln b + 1).
Γ(b + c/K − ⌊b⌋ + 1) K
We now argue that for all x ∈ [1, 2), for all K ≥ c, 1 − Γ(x)
Γ(x+c/K)
≤ Kc . By convexity of Γ(x),
c Γ′ (x+c/K)
we know that Γ(x) ≥ Γ(x + c/K) − c ′
K
Γ (x + c/K). Hence Γ(x+1/K)
Γ(x)
≥ 1− K Γ(x+c/K)
. Since
Γ′ (y)
x + c/K ∈ [1, 3) and ψ(y) = Γ(y)
, the digamma function, is a monotone increasing function
Γ′ (x+c/K) Γ′ (3)
(it is the derivative of a ln Γ(x), which is also convex), Γ(x+c/K)
≤ Γ(3)
≤ 1. Applying this
to x = b − ⌊b⌋ + 1, we conclude that:
Γ(b) c
1− ≤ (2 + ln b).
Γ(b + c/K) K
We now show that:
Γ(b) c
− 1 ≥ − (ln b + ln 2).
Γ(b + c/K) K
119
Convexity of Γ(y) means that:
c ′ Γ(b) c Γ′ (b + c/K)
Γ(b) ≥ Γ(b + c/K) − Γ (b + c/K) →
− −1≥− .
K Γ(b + c/K) K Γ(b + c/K)
From Alzer [4, Equation 2.2], we know that ψ(x) ≤ ln(x) for positive x. Therefore:
c Γ′ (b + c/K) c c
− ≥ − ln(b + c/K) ≥ − (ln b + ln 2)
K Γ(b + c/K) K K
since b + Kc ≤ 2b.
We combine two sides of the inequality to conclude that the absolute value is at most
c
K
(2 + ln b).
Proof of eq. (2.46).
K K/c Γ(c/K + b)
c− =c −1
B(c/K, b) Γ(c/K) Γ(b)
K/c Γ(c/K + b) K/c
=c −1 + −1
Γ(c/K) Γ(b) Γ(c/K)
K/c Γ(c/K + b) K/c
≤c −1 + −1 .
Γ(c/K) Γ(b) Γ(c/K)
On the one hand:
K/c Γ(1)
= .
Γ(c/K) Γ(1 + c/K)
From eq. (2.45), we know:
Γ(1) 2c
−1 ≤ .
Γ(1 + c/K) K
On the other hand, let y = Γ(b)/Γ(c/K + b). Then:
Γ(c/K + b) 1 |1 − y|
−1 = −1 = .
Γ(b) y y
Γ(c/K + b) 2c
− 1 ≤ (2 + ln b).
Γ(b) K
In all:
K 2c c 2c
c− ≤c 1+ 2 (2 + ln b) +
B(c/K, b) K K K
c
≤ (3 ln b + 8).
K
120
2.F Verification of upper bound’s assumptions for addi-
tional examples
Recall the definitions of h, e
h, and Mn,x for exponential family CRM-likelihood in section 2.C.
Therefore, h is
−1+ n−1
1 Γ(−1 + n−1
P
i=1 xi +x+1
P
i=1 x i + x + 1)(λ + n)
h(xn = x | x1:(n−1) ) = n−1
Pn−1
x! Γ(−1 + i=1 xi + 1)(λ + n − 1)−1+ i=1 xi +1
P
x Pi=1n−1
Pn−1 xi
1 Γ( i=1 xi + x) 1 1
= Pn−1 1− ,
x! Γ( i=1 xi ) λ+n λ+n
and similarly e
h is
Pn−1 Pn−1
1 Γ(−1 + i=1 xi + x + 1 + γλ/K)(λ + n)−1+ i=1 xi +x+1+γλ/K
h(xn = x | x1:(n−1) ) =
e
x! Γ(−1 + n−1 −1+ n−1
P
i=1 xi +1+γλ/K
P
i=1 x i + 1 + γλ/K)(λ + n − 1)
x Pi=1n−1
xi +γλ/K
1 Γ( n−1
P
i=1 x i + x + γλ/K) 1 1
= Pn−1 1− ,
x! Γ( i=1 xi + γλ/K) λ+n λ+n
and Mn,x is
1 γλ
Mn,x = γλ Γ(x)(λ + n)−x = .
x! x(λ + n)x
Now, we state the constants so that gamma–Poisson satisfies Condition 2.4.1, and give the
proof.
Proposition 2.F.1 (Gamma–Poisson satisfies Condition 2.4.1). The following hold for
arbitrary γ, λ > 0. For any n:
∞
X γλ
Mn,x ≤ .
x=1
n−1+λ
∞
X γλ
h(x | x1:(n−1) = 0n−1 ) ≤
e .
x=1
n−1+λ
121
For any K:
∞
X 2γλ 1
h(x | x1:(n−1) ) − e
h(x | x1:(n−1) ) ≤ .
x=0
K n−1+λ
For any K ≥ γλ :
∞
X γ 2 λ + eγ 2 λ2 1
Mn,x − K e
h(x | x1:(n−1) = 0n−1 ) ≤ .
x=1
K n−1+λ
Proof of Proposition 2.F.1. The growth rate condition of the target model is simple:
∞ ∞ ∞
X X 1 X 1 γλ
Mn,x = γλ x
≤ γλ x
= .
x=1 x=1
x(λ + n) x=1
(λ + n) n−1+λ
The two negative binomial distributions have the same success probability and only differ in
the number of trials. Hence using eq. (2.41), we have:
∞
X γλ (λ + n)−1 2γλ 1
h(x | x1:(n−1) ) − h(x | x1:(n−1) ) ≤ 2
e
−1
= ,
x=0
K 1 − (λ + n) K n−1+λ
where the factor 2 reflects how total variation distance is 1/2 the L1 distance between p.m.f’s.
For the total variation between Mn,. and K e h(· | 0) condition,
∞
X
Mn,x − K e
h(x | x1:(n−1) = 0n−1 )
x=1
∞ γλ/K
X 1 γλ Γ(γλ/K + x) 1
= −K 1−
x=1
(λ + n)x x Γ(γλ/K)x! λ+n
∞
! !
γλ/K
X 1 γλ 1 γλ Γ(γλ/K + x)
≤ 1− 1− + −K .
x=1
(λ + n)x x λ+n x Γ(γλ/K)x!
122
Using eq. (2.41) we can upper bound:
γλ/K
1 γλ 1
1− 1− ≤ ,
λ+n K λ+n−1
γλ Γ(γλ/K + x) eγ 2 λ2
−K ≤ .
x Γ(γλ/K)x! K
This means:
∞
X
Mn,x − K e
h(x | x1:(n−1) = 0n−1 )
x=1
∞ ∞
X 1 γλ γλ 1 X 1 eγ 2 λ2
≤ +
x=1
(λ + n)x x K λ + n − 1 x=1 (λ + n)x K
2 2
γ λ 1 eγ 2 λ2 1
≤ 2
+
K (λ + n − 1) K λ+n−1
2 2 2
γ λ + eγ λ 1
≤ .
K n−1+λ
which means that κ(x) = Γ(x + r)/Γ(r)x!, ϕ(x) = x, µ(θ) = 0, A(θ) = −r ln(1 − θ). This
leads to the normalizer:
Z 1
Z= θξ (1 − θ)rλ dθ = B(ξ + 1, rλ + 1).
0
and e
h is Pn−1
Γ(x + r) B(γα/K + i=1 xi + x, rn + α)
h(xn = x | x1:(n−1) ) =
e Pn−1 ,
x!Γ(r) B(γα/K + i=1 xi , r(n − 1) + α)
123
and Mn,x is
Γ(x + r)
Mn,x = γα B(x, rn + α).
x!Γ(r)
Now, we state the constants so that beta–negative binomial satisfies Condition 2.4.1, and
give the proof.
Proposition 2.F.2 (Beta–negative binomial satisfies Condition 2.4.1). The following hold
for any γ > 0 and α > 1. For any n:
∞
X γα
Mn,x ≤ .
x=1
n − 1 + (α − 0.5)/r
For any K:
∞
X γα 1
h(x | x1:(n−1) ) − e
h(x | x1:(n−1) ) ≤ 2 .
x=0
K n − 1 + α/r
For any n, for K ≥ γα(3 ln(r(n − 1) + α) + 8):
∞
X
Mn,x − K e
h(x | x1:(n−1) = 0n−1 )
x=1
γα (4γα + 3) ln(rn + α + 1) + (10 + 2r)γα + 24
≤ .
K n − 1 + (α − 0.5)/r
Proof of Proposition 2.F.2. The growth rate condition for the target model is easy to verify:
∞ ∞
X X Γ(x + r) r
Mn,x = γα B(x, rn + α) ≤ γα ,
x=1 x=1
Γ(r)x! r(n − 1) + α − 0.5
124
The denominator is large because eq. (2.46) with eq. (2.46) with c = γα, b = r(n − 1) + α:
1 4γα
≤ .
B(γα/K, r(n − 1) + α) K
For the total variation between h and e h condition, we first discuss how each function can
be expressed a p.m.f.P of so-called beta negative binomial i.e., BNB [101, Section 6.2.3]
distribution. Let A = n−1i=1 xi . Observe that:
The random variable V1 whose p.m.f at x appears on the right hand side of eq. (2.48) is the
result of a two-step sampling procedure:
125
For any p, we use eq. (2.41) to upper bound the total variation distance between negative
binomial distributions. Therefore:
Z 1 γα p
dTV h, h ≤
e P(P ∈ dp)
0 K 1−p
Z 1
γα 1
= pr (1 − p)r(n−1)+α−2 dp
K B(r, r(n − 1) + α) 0
γα B(r + 1, r(n − 1) + α − 1) γα 1
= = .
K B(r, r(n − 1) + α) K n − 1 + α/r
We look at the summand for x = 1 and the summation from x = 2 through ∞ separately.
For x = 1, we prove that:
B(1 + γα/K, rn + α)
γαB(1, rn + α) − K
B(γα/K, r(n − 1) + α)
|γαB(1, rn + α)B(γα/K, r(n − 1) + α) − KB(1 + γα/K, rn + α)|
= . (2.50)
B(γα/K, r(n − 1) + α)
126
eq. (2.50) is upper bounded by:
2γ 2 α2 2 + ln(rn + α + 1) Γ(γα/K)
rn + α K B(γα/K, r(n − 1) + α)
2 2
2γ α 2 + ln(rn + α + 1) Γ(γα/K + r(n − 1) + α)
=
rn + α K Γ(r(n − 1) + α)
2 2
4γ α 2 + ln(rn + α + 1)
≤ ,
K rn + α
since Γ(r(n−1)+α+γα/K)
Γ(r(n−1)+α)
≥ 1 − γα
K
(2 + ln(r(n − 1) + α)) ≥ 0.5 with K ≥ 2γα(2 + ln(r(n − 1) + α).
Combining with Γ(r + 1)/Γ(r) = r, this is the proof of eq. (2.49).
We now move onto the summands from x = 2 to ∞. By triangle inequality:
B(γα/K + x, rn + α)
γαB(x, rn + α) − K ≤ T1 (x) + T2 (x),
B(γα/K, r(n − 1) + α)
where:
K
T1 (x) := B(x, rn + α) γα − ,
B(γα/K, r(n − 1) + α)
B(x, rn + α) − B( γα
K
+ x, rn + α)
T2 (x) := K .
B(γα/K, r(n − 1) + α)
K γα
γα − ≤ (3 ln(r(n − 1) + α) + 8)
B(γα/K, r(n − 1) + α) K
K γα
≤ γα + (3 ln(r(n − 1) + α) + 8) ≤ 2γα,
B(γα/K, r(n − 1) + α) K
γα
|B(x, rn + α) − B(γα/K + x, rn + α)| ≤ B(x − 1, rn + α + 1)
K
since K ≥ γα(3 ln(r(n − 1) + α) + 8), we have applied eq. (2.46) in the first and second
inequality and eq. (2.43) in the third one. So for each x ≥ 2, each summand is at most
127
and:
∞ ∞
X Γ(x + r) X Γ(x − 1 + r + 1)
B(x − 1, rn + α + 1) ≤ r B(x − 1, rn + α + 1)
x=2
Γ(r)x! x=2
Γ(r + 1)(x − 1)!
∞
X Γ(z + r + 1)
≤r B(z, rn + α + 1)
z=1
Γ(r + 1)z!
r(r + 1)
≤ ,
r(n − 1) + α − 0.5
where we have used eq. (2.44) in each upper bound. So the summation from x = 2 to ∞ is
upper bounded by:
γα(3 ln(r(n − 1) + α) + 8) r 2γ 2 α2 r(r + 1)
+ (2.51)
K r(n − 1) + α − 0.5 K r(n − 1) + α − 0.5
eqs. (2.49) and (2.51) combine to give:
∞
X
Mn,x − K e
h(x | x1:(n−1) = 0n−1 )
x=1
γα (4γα + 3) ln(rn + α + 1) + (10 + 2r)γα + 24
≤ .
K n − 1 + (α − 0.5)/r
and
C ′′ =(β + 1)C1 (2C1 + C4 ) + [(β + 1)C1 + C2 ]/ ln 2
+ (β + 1) [C1 (4C1 ln(1 + 1/C1 ) + C5 ) + (2C1 + C4 )C1 ln(1 + 1/C1 )] / ln 2,
and
C ′′′ = (β + 1)2C12 ln(1 + 1/C1 ),
C ′′′′ = (β + 1)2C12 ln(1 + 1/C1 ) + (β + 1)C1 .
By the end of the proof, the reasoning for these constants will be clear.
128
We will focus on the case where the approximation level K is Ω(ln N ):
where C(N, α) is the growth function from eq. (2.18). To see why it is sufficient, consider the
case where K < max {(β + 1) max(C(K, C1 ), C(N, C1 )), C2 (ln N + C3 )}. This implies that
K is smaller than a sum
K < (β + 1)(C(N, C1 ) + C(K, C1 )) + C2 (ln N + C3 )
≤ [(β + 1)C1 + C2 ] ln N + (β + 1)C1 ln K + (β + 1)2C1 ln(1 + 1/C1 ) + C2 C3
where we have used upper bound on the growth function from Lemma 2.E.10. Total variation
distance is always upper bounded by 1. Hence, dTV (PN,∞ , PN,K ) is at most
2. Each column corresponds to an atom location: the locations are sorted first according to
the index of the first measure Xi to manifest it (counting from 1, 2, . . .), and then its atom
size in Xi .
For illustration, suppose X1 = 3δψ1 + 4δψ2 + 4δψ3 , X2 = 2δψ1 + δψ3 + δψ4 + 2δψ5 and
X3 = 6δψ2 + 2δψ3 + δψ5 + 2δψ6 + 3δψ7 . Then the associate trait-allocation matrix has 3 rows
and 7 columns and has entries equal to
3 4 4 0 0 0 0
2 0 1 1 2 0 0 . (2.54)
0 6 2 0 1 2 3
The marginal process that described the atom sizes of Xn | Xn−1 , Xn−2 , . . . , X1 in Proposi-
tion 2.C.1 is also the description of how the rows of F are generated. The joint distribution
129
X1 , X2 , . . . , Xn can be two-step sampled. First, the trait-allocation matrix F is sampled.
Then, the atom locations are drawn iid from the base measure H: each column of F is
assigned an atom location, and the latent measure Xi has atom size Fi,j on the jth atom
location. A similar two-step sampling generates Z1 , Z2 , . . . , Zn , the latent measures under
the approximate model: the distribution over the feature-allocation matrix F ′ follows Propo-
sition 2.C.2 instead of Proposition 2.C.1, but conditioned on the feature-allocation matrix,
the process generating atom locations and constructing latent measures is exactly the same.
In other words, this implies that the conditional distributions Y1:N | F and W1:N | F ′ when
F = F ′ are the same, since both models have the same the observational likelihood f given
the latent measures 1 through N . Denote PF to be the distribution of the feature-allocation
matrix under the target model, and PF ′ the distribution of the feature-allocation matrix
under the approximate model. Lemma 2.E.8 implies that
Next, we parametrize the trait-allocation matrices in a way that is convenient for the analysis
of total variation distance. Let J be the number of columns of F . Our parametrization
involves dn,x , for n ∈ [N ] and x ∈ N, and sj , for j ∈ [J]:
1. For n = 1, 2, . . . , N :
(a) If n = 1, for each x ∈ N, d1,x counts the number of columns j where F1,j = x.
(b) For n ≥ 2, for each x ∈ N, let Jn = {j : ∀i < n, Fi,j = 0} i.e. no observation before n
manifests the atom locations indexed by columns in Jn . For each x ∈ N, dn,x counts
the number of columns j ∈ Jn where Fn,j = x.
2. For j = 1, 2, . . . , J, let Ij = min{i : Fi,j > 0} i.e. the first row to manifest the j-th atom
location. Let sj = FIj :N,j i.e. the history of the j-th atom location.
In words, dn,x is the number of atom locations that is first instantiated by the individual n
P∞
and each atom has size x, while sj is the history of the j-th atom location. N
P
n=1 x=1 dn,x
is exactly J, the number of columns. For the example in eq. (2.54):
1. For n = 1, 2, . . . , 3:
2. For j = 1, 2, . . . , 7, s1 = [3, 2, 0], s2 = [4, 0, 6], s3 = [4, 1, 2], s4 = [1, 0], s5 = [2, 1], s6 = [2],
s7 = [3].
We use the short-hand d to refer to the collection of dn,x and s the collection of sj . There is
a one-to-one mapping between (d, s) and the trait-allocation matrix f , since we can read-off
(d, s) from f and use (d, s) to reconstruct f. Let (D, S) be the distribution of d and s under
130
the target model, while (D′ , S ′ ) is the distribution under the approximate model. We have
that
dTV (PN,∞ , PN,K ) ≤ ′ ′
inf P((D, S) ̸= (D′ , S ′ )).
(D,S),(D ,S ) coupling of PD,S ,PD′ ,S ′
To find an upper bound on dTV (PN,∞ , PN,K ), we will demonstrate a joint distribution such
that P((D, S) ̸= (D′ , S ′ )) is small. The rest of the proof is dedicated to that end. To start,
we only assume that (D, S, D′ , S ′ ) is a proper coupling, in that marginally (D, S) ∼ PD,S and
(D′ , S ′ ) ∼ PD′ ,S ′ . As we progress, gradually more structure is added to the joint distribution
(D, S, D′ , S ′ ) to control P((D, S) ̸= (D′ , S ′ )).
We first decompose P((D, S) ̸= (D′ , S ′ )) into other probabilistic quantities which can be
analyzed using Condition 2.4.1. Define the typical set:
( N X∞
)
X
D∗ = d : dn,x ≤ (β + 1) max(C(K, C1 ), C(N, C1 )) .
n=1 x=1
d ∈ D∗ means that the trait-allocation matrix f has a small number of columns. The claim
is that:
P((D, S) ̸= (D′ , S ′ ))
= P(D ̸= D′ ) + P(S = ̸ S ′ , D = D′ )
= P(D ̸= D′ ) + P(S = ̸ S ′ , D = D′ , D ∈ D∗ ) + P(S ̸= S ′ , D = D′ , D ∈
/ D∗ )
≤ P(D ̸= D′ ) + P(S = ̸ S ′ | D = D′ , D ∈ D∗ ) + P(D ∈/ D∗ ),
The three ideas behind this upper bound are the following. First, because of the growth
condition, we can analyze the atypical set probability P(D ∈ / D∗ ). Second, because of the
total variation between h and eh, we can analyze P(S ̸= S ′ | D = D′ , D ∈ D∗ ). Finally, we can
analyze P(D ̸= D′ ) because of the total variation between K e h and Mn,. . In what follows we
carry out the program.
Atypical set probability. The P(D ∈ / D∗ ) term in eq. (2.56) is easiest to control. Under
the target model Proposition 2.C.1, the Di,x ’s are independent Poissons with mean Mi,x ,
P P∞ PN P∞
so the sum N i=1 x=1 Di,x is itself a Poisson with mean M = i=1 x=1 Mi,x . Because of
Lemma 2.E.3, for any x > 0:
∞
N X
!
x2
X
P Di,x > M + x ≤ exp − .
i=1 x=1
2(M + x)
β2
P(D ∈ ∗
/ D ) ≤ exp − max(C(K, C1 ), C(N, C1 )) . (2.57)
2(β + 1)
131
Difference between histories. To minimize the difference probability between the histories
of atom sizes i.e. the P(S ̸= S ′ | D = D′ , D ∈ D∗ ) term in eq. (2.56), we will use eq. (2.16).
The claim is, there exists a coupling of S ′ | D′ and S | D such that:
(β + 1) max(C(K, C1 ), C(N, C1 ))
P(S ̸= S ′ | D = D′ , D ∈ D∗ ) ≤ C(N, C1 ). (2.58)
K
Fix some d ∈ D∗ — since we are in the typical set, the number of columns in the trait-
allocation matrix is at most (β + 1) max(C(K, C1 ), C(N, C1 )). Conditioned on D = d, there
is a finite number of history variables S, one for each atom location; similar for conditioning
of S ′ on D′ = d. For both the target and the approximate model, the density of the joint
distribution factorizes:
YJ
P(S = s | D = d) = P(Sj = sj | D = d)
j=1
J
Y
′ ′
P(S = s | D = d) = P(Sj′ = sj | D′ = d),
j=1
since in both marginal processes, the atom sizes for different atom locations are independent
of each other. Each Sj (or Sj′ ) only takes values from a countable set. Therefore, by
Lemma 2.E.7,
J
X
dTV (PS | D=d , PS ′ | D′ =d ) ≤ dTV (PSj | D=d , PSj′ | D′ =d ).
j=1
We inspect each dTV (PSj | D=d , PSj′ | D′ =d ). Fixing d also fixes Ij , the first row to manifest the
j-th atom location. The history sj is then a N − Ij + 1 dimensional integer vector, whose tth
entry is the atom size over the jthe atom location of the t + Ij − 1 row. Because of eq. (2.16),
we know that conditioned on the same partial history Sj (1 : (t − 1)) = Sj′ (1 : (t − 1)) = s,
the distributions Sj (t) and Sj′ (t) are very similar. The conditional distribution Sj (t) | D =
d, Sj (1 : (t − 1)) = s is governed by h Proposition 2.C.1 while Sj′ (t) | D′ = d, Sj′ (1 : (t − 1)) = s
is governed by e h Proposition 2.C.2. Hence:
1 C1
dTV PSj (t) | D=d,Sj (1:(t−1))=s , PSj (t) | D′ =d,Sj (1:(t−1))=s ≤ 2
′ ′ ,
K t + Ij − 2 + C1
for any partial history s. To use this conditional bound, we repeatedly use Lemma 2.E.6 to com-
pare the joint Sj = (Sj (1), Sj (2), . . . , Sj (N −Ij +1)) with the joint Sj′ = (Sj′ (1), Sj′ (2), . . . , Sj′ (N −
Ij + 1)), peeling off one layer of random variables (indexed by t) at a time.
dTV (PSj | D=d , PSj′ | D′ =d )
N −Ij +1
X
≤ max dTV PSj (t) | D=d,Sj (1:(t−1))=s , PSj′ (t) | D′ =d,Sj′ (1:(t−1))=s
s
t=1
N −Ij +1
X 1 C1
≤ 2
t=1
K t + Ij − 2 + C1
C(N, C1 )
≤2 .
K
132
Multiplying the right hand side by (β + 1) max(C(K, C1 ), C(N, C1 )), the upper bound on J,
we arrive at the same upper bound for the total variation between PS | D=d and PS ′ | D′ =d in
eq. (2.58). Furthermore, our analysis of the total variation can be back-tracked to construct
the coupling between the conditional distributions S | D = d and S ′ | D′ = d which attains
that small probability of difference because all the distributions being analyzed are discrete.
Since the choice of conditioning d ∈ D∗ was arbitrary, we have actually shown eq. (2.58).
Difference between new atom sizes. Finally, to control the difference probability for
the distribution over new atom sizes i.e. the P(D ̸= D′ ) term in eq. (2.56), we will utilize
eqs. (2.15) and (2.17). For each n, define the short-hand d1:n to refer to the collection di,x for
i ∈ [n], x ∈ N, and the typical sets:
( n X ∞
)
X
Dn∗ = d1:n : di,x ≤ (β + 1) max(C(K, C1 ), C(N, C1 )) .
i=1 x=1
The type of expansion performed in eq. (2.56) can be done once here to see that:
P(D ̸= D′ )
′ ′
= P((D1:(N −1 , DN ) ̸= (D1:(N −1) , DN ))
′
≤ P(D1:(N −1) ̸= D1:(N −1) )
′ ′ ∗
+ P(DN ̸= DN | D1:(N −1) = D1:(N −1) , D1:(N −1) ∈ Dn−1 )
∗
+ P(D1:(N −1) ∈
/ Dn−1 ).
Apply the expansion once more to P(D1:(N −1) = ′
̸ D1:(N −1) ), then to P(D1:(N −2) ̸= D1:(N −2) ).
′
If we define:
Bj = P(Dj ̸= Dj′ | D1:(j−1) = D1:(j−1)
′ ∗
, D1:(j−1) ∈ Dj−1 ),
with the special case B1 simply being P(D1 ̸= D1′ ), then:
N
X N
X
′
P(D ̸= D ) ≤ Bj + P(D1:(j−1) ∈ ∗
/ Dj−1 ). (2.59)
j=1 j=2
β2
≤ exp − max(C(K, C1 ), C(N, C1 )) − ln N .
2(β + 1)
By Lemma 2.E.10, max(C(K, C1 ), C(N, C1 )) ≥ C1 (max(ln N, ln K) − C1 (ψ(C1 ) + 1)). Since
β2
we have set β so that β+1 C1 = 4, we have
β2
max(C(K, C1 ), C(N, C1 )) − ln N ≥ 2 max(ln N, ln K) − 2C1 (ψ(C1 ) + 1) − ln N
2(β + 1)
≥ ln K − 2C1 (ψ(C1 ) + 1).
133
meaning the overall atypical probabilities is at most
N
X exp(2C1 (ψ(C1 ) + 1))
/ D∗ ) +
P(D ∈ P(D1:(j−1) ∈ ∗
/ Dj−1 )≤ . (2.60)
j=2
K
As for the first summation in eq. (2.59), we look at the individual Bj ’s. For any fixed d1:(j−1) ∈
∗
Dj−1 , we claim that there exists a coupling between the conditionals Dj | D1:(j−1) = d1:(j−1)
and Dj′ | D1:(j−1)
′
= d1:(j−1) such that P(Dj ̸= Dj′ | D1:(j−1) = D1:(j−1)
′
= d1:(j−1) ) is at most
C12 1 1
2
+ [C4 ln j + C5 + (β + 1) max(C(K, C1 ), C(N, C1 ))] . (2.61)
K (j − 1 + C1 ) j − 1 + C1
Because the upper bound holds for arbitrary values d1:(j−1) , the coupling actually ensures that,
as long as D1:(j−1) = D1:(j−1)
′
for some value in Dj−1∗
, the probability of difference between
Dj and Dj is small i.e. Bj is at most the right hand side.
′
j−1 ∞
!
X X
E(Ux ) = K− di,y h(x | x1:(j−1) = 0).
e
i=1 y=1
For each x ≥ 1, let Ox be the maximal coupling distribution between PUx and PDj,x i.e. for
(A, B) ∼ Ox , P(A ≠ B) = dTV (PUx , PDj,x ). Such Ox exists because both PUx and PDj,x are
Poisson (hence discrete) distributions. Furthermore, since Ox is itself a discrete distribution,
the conditional distributions Dj,x | Ux exists. Denote the natural zig-zag bijection from
134
{N ∪ 0}2 to N to be L.21 Denote by Fx the cdf of the distribution of L(A, B) for (A, B) ∼ Ox .
To generate samples from Ox , it suffices to generate samples from Fx and transform using
the inverse of L. Consider the following coupling of PU and PDj :
Marginally, each Ux (or Dj,x ) is Poisson with the right mean, and across x, the Ux (or Dj,x )
are independent of each other because we use i.i.d uniform r.v’s. Alternatively, the conditional
distribution of Dj | U implied by this joint distribution is as follows:
• For x ≥ 1, sample Ux | Dj,x from the conditional distribution implied by the maximal
coupling Ox .
Since the coupling (Ux , Dj,x ) attains the dTV (PUx , PDj,x ), we are done. From Lemma 2.E.5,
we know
∞
X
dTV (PUx , PDj,x )
x=1
∞ j−1 ∞
!
X X X
≤ Mj,x − K− di,y h(x|x1:(j−1) = 0)
e
x=1 i=1 y=1
∞ j−1 ∞
!
X X X
≤ |Mj,x − K e
h(x | x1:(j−1) = 0)| + h(x | x1:(j−1) = 0)
di,y e
x=1 i=1 y=1
∞ j−1 ∞
! ∞
!
X X X X
≤ |Mj,x − K e
h(x | x1:(j−1) = 0)| + di,y h(x | x1:(j−1) = 0) .
e (2.64)
x=1 i=1 y=1 x=1
Combining the two bounds and eq. (2.63) give the following bound on P(U ̸= Dj ):
1 C4 ln j + C5 1 C1
P(U ̸= Dj ) ≤ + (β + 1) max(C(K, C1 ), C(N, C1 )) . (2.65)
K j − 1 + C1 K j − 1 + C1
21
L(0, 0) = 1, L(0, 1) = 2, L(1, 0) = 3, L(2, 0) = 4, L(1, 1) = 5, L(0, 2) = 6 and so on.
135
We now show how the combination of eqs. (2.62) and (2.65) imply eq. (2.61). From eq. (2.65),
there exists a coupling of PU and PDj such that the difference probability is small. From
eq. (2.62), there exists a coupling of PU and PDj′ | D1:(j−1)
′ =d1:(j−1) such that the difference
probability is small. In both cases, we can sample from the conditional distribution based
on U . Dj | U exists because of the discussion after eq. (2.63), while Dj′ | D1:(j−1)
′
= d1:(j−1) , U
exists because of Lemma 2.E.4. Therefore, we can glue the two couplings together, by first
sampling U , and then sample from the appropriate conditional distributions. By taking
expectations of the simple triangle inequality for the discrete metric i.e.
N
!
C12 X 1 (β + 1) max(C(K, C1 ), C(N, C1 ))
+ C(N, C1 )
K j=1
(j − 1 + C1 )2 K
C4 ln N + C5
+ C(N, C1 ).
K
The first term is upper bounded by the trigamma function ψ1 (·):
N
C12 X 1 C12 ψ1 (C1 )
≤ .
K j=1 (j − 1 + C1 )2 K
C12 ψ1 (C1 ) β + 1
+ C(N, C1 ) [(C1 + C4 ) ln N + C1 ln K + 2C1 ln(1 + 1/C1 ) + C5 ] . (2.66)
K K
Because of eqs. (2.59), (2.60) and (2.66), we can couple D and D′ such that P(D =
̸ D′ )+P(D ∈
/
D ) is at most
∗
136
We expand the sum of the last two term by upper bounding max(C(K, C1 ), C(N, C1 )) by
C(K, C1 ) + C(N, C1 ) and using the upper bound Lemma 2.E.10. The end result is
(β + 1)C1 ln(1 + 1/C1 ) [4C1 ln(1 + 1/C1 ) + C5 ] + C12 ψ1 (C1 ) + exp(2C1 (ψ(C1 ) + 1)),
and
C̃ (1) = (β + 1)2C12 ln(1 + 1/C1 ),
C̃ (2) = (β + 1) [C1 (4C1 ln(1 + 1/C1 ) + C5 ) + (2C1 + C4 )C1 ln(1 + 1/C1 )] ,
C̃ (3) = (β + 1)2C12 ln(1 + 1/C1 ),
C̃ (4) = (β + 1)C1 (2C1 + C4 ).
Since N is a natural number, N ≥ 1 we can write ln N ≤ (1/ ln 2) ln2 N , to simplify the
upper bound on total variation as
C̃ (0) + C̃ (4) + C̃ (2) / ln 2 ln2 N + C̃ (3) ln N ln K + C̃ (1) ln K
. (2.68)
K
Taking the sum of individual coefficients in front of ln2 N (et cetera) between eq. (2.68) and
eq. (2.53) yields the constants at the beginning of the proof.
In applications, the observational likelihood f and the ground measure H might be random
rather than fixed quantities. For instance, in linear–Gaussian beta–Bernoulli processes without
good prior information, probabilistic models put priors on the variances of the Gaussian
features as well as the noise in observed data. In such cases, the AIFAs remain the same
as the in Theorem 2.B.2 (or Corollary 2.3.3) since the rate measure ν is still fixed. The
above proof of Theorem 2.4.1 can be easily extended to the case where f and H are random,
because the argument leading to eq. (2.55) retains validity when f and H have the same
distribution under the target and the approximate model. For completeness, we state the
error bound in such cases where hyper-priors are used.
Corollary 2.G.1 (Upper bound for hyper-priors). Let H be a prior distribution for ground
measures H and F be a prior distribution for observational likelihoods f. Suppose the target
model is
H ∼ H(.),
f ∼ F(.),
Θ | H ∼ CRM(H, ν),
i.i.d.
Xn | Θ ∼ LP(ℓ, Θ), n = 1, 2, . . . , N,
indep
Yn | f, Xn ∼ f (· | Xn ), n = 1, 2, . . . , N.
137
The approximate model, with νK as in Theorem 2.B.2 (or Corollary 2.3.3), is
H ∼ H(.),
f ∼ F(.),
ΘK | H ∼ IFAK (H, νK ),
i.i.d.
Zn | ΘK ∼ LP(ℓ, ΘK ), n = 1, 2, . . . , N,
indep
Wn | f, Zn ∼ f (· | Zn ), n = 1, 2, . . . , N.
If Assumption 2.3.1 and Condition 2.4.1 hold, then there exist positive constants C ′ , C ′′ , C ′′′
depending only on {Ci }5i=1 such that
C ′ + C ′′ ln2 N + C ′′′ ln N ln K
dTV (PY1:N , PW1:N ) ≤ .
K
The upper bound in Corollary 2.G.1 is visually identical to Theorem 2.4.1, and has no
dependence on the hyper-priors H or F.
138
The proof of Theorem 2.4.3 relies on the ability to compute a lower bound on the total
variation distance between a binomial distribution and a Poisson distribution.
Proposition 2.G.2 (Lower bound on total variation between binomial and Poisson). For
all K, it is true that
2
γ/K γ/K
dTV Poisson (γ) , Binom K, ≥ C(γ)K ,
γ/K + 1 γ/K + 1
where
1 1
C(γ) = .
8 γ + exp(−1)(γ + 1) max(12γ 2 , 48γ, 28)
Proof of Proposition 2.G.2. We adapt the proof of [13, Theorem 2] to our setting. The
Poisson(γ) distribution satisfies the functional equality:
m2
x(m) = m exp − ,
γK θ
where θ is a constant which will be specified later. x(m) serves as a test function to lower bound
the total variation distance between Poisson(γ) and Binom (K, γK /K). Let Xi ∼ Ber( γKK ),
independently across i from 1 to K, and W = K i=1 . Then W ∼ Binomial (K, γK /K). The
P
following identity is adapted from [13, Equation 2.1]:
K
γ 2 X
K
E[γK x(W + 1) − W x(W )] = E[x(Wi + 2) − x(Wi + 1)], (2.72)
K i=1
where Wi = W − Xi .
We first argue that the right hand side is not too small i.e. for any i,
2
3γK + 12γK + 7
E[x(Wi + 2) − x(Wi + 1)] ≥ 1 − . (2.73)
θγK
m2 2m2 3m2
d
x(m) = exp − 1− ≥1− ,
dm γK θ γK θ θγK
because of the easy-to-verify inequality e−x (1 − 2x) ≥ 1 − 3x for x ≥ 0. This means that
Z Wi +2
3m2
1
x(Wi + 2) − x(Wi + 1) ≥ 1− dm = 1 − (3Wi2 + 9Wi + 7).
Wi +1 θγK θγK
139
Taking expectations, noting that E(Wi ) ≤ γK and E(Wi2 ) = Var(Wi ) + [E(Wi )]2 ≤
PK γK
j=1 K +
(γK )2 = γK
2
+ γK we have proven eq. (2.73).
Now, because of positivity of x, and that γ ≥ γK , we trivially have
Combining eq. (2.72), eq. (2.73) and eq. (2.74) we have that
γ 2 2
K 3γK + 12γK + 7
E[γx(W + 1) − W x(W )] ≥ K 1− .
K θγK
Recalling eq. (2.71), for any coupling (W, Z) such that W ∼ Binom K, γ/K+1 and Z ∼
γ/K
Poisson(γ):
2 2
γK 3γK + 12γK + 7
E[γ(x(W + 1) − x(Z + 1)) + Zx(Z) − W x(W )] ≥ 1− .
K θγK
Suppose (W, Z) is the maximal coupling attaining the total variation distance between PW
and PZ i.e. P(W ̸= Z) = dTV (PY , PZ ). Clearly,
where the last inequality owes to the easy-to-verify x exp(−x) ≤ exp(−1). Combining
eq. (2.76) and eq. (2.75) we have that
3γ 2 +12γ +7
1 − K θγK K
γ/K 1 γ 2
K
dTV Binomial K, , Poisson(γ) ≥ K .
γ/K + 1 2 γ + (γ + 1)θγK exp(−1) K
140
Finally, we calibrate θ. By selecting θ = max 12γK , γ28K , 48 we have that the numerator
of the unwieldy fraction is at least 14 and its denominator is at most γ + exp(−1)(γ +
1) max(12γ 2 , 48γ, 28), because γK < γ. This completes the proof.
Proof of Theorem 2.4.3. The constant C in the theorem statement is
M
X
f (. | δψi ) := δM (.). (2.77)
i=1
Now we show that under such f , the total variation distance is lower bounded. From
Lemma 2.E.9, we know that
BP BP
dTV (PN,∞ , PN,K ) = dTV (PY1:N , PW1:N ) ≥ dTV (PY1 , PW1 ).
Recall the generative process defining PY1 and PW1 . Y1 is an observation from the target
beta–Bernoulli model, and the functions h, eh, and Mn,x are given in Example 2.4.1. By
Proposition 2.C.1,
NT
i.i.d.
X
NT ∼ Poisson(γ), ψk ∼ H, X1 = δψk , Y1 ∼ f (. | X1 ).
i=1
141
2.H DPMM results
We consider Dirichlet process mixture models [6]
Θ ∼ DP(α, H),
i.i.d.
Xn | Θ ∼ Θ, n = 1, 2, . . . , N, (2.78)
indep
Yn | Xn ∼ f (· | Xn ), n = 1, 2, . . . , N.
Let PN,∞ be the distribution of the observations Y1:N . Let PN,K be the distribution of the
observations W1:N .
142
Theorem 2.H.2 (ln N is necessary). There exists a probability kernel f (·), independent of
K, N , such that for any N ≥ 2, if K ≤ 12 C(N, α), then
C′
dTV (PN,∞ , PN,K ) ≥ 1 − α/8
N
where C ′ is a constant only dependent on α.
See section 2.I.2 for the proof. Theorem 2.H.2 implies that as N grows, if the approximation
level K fails to surpass the C(N, α)/2 threshold, then the total variation between the
approximate and the target model remains bounded from zero — in fact, the error tends
to one. Recall that C(N, α) = Ω(ln N ), so the necessary approximation level is Ω(ln N ).
Theorem 2.H.2 is the analog of Theorem 2.4.2.
We also investigate the tightness of Theorem 2.H.1 in terms of K. In Theorem 2.H.3, our
lower bound indicates that the 1/K factor in Theorem 2.H.1 is tight (up to log factors).
Theorem 2.H.3 (1/K lower bound). There exists a probability kernel f (·), independent of
K, N , such that for any N ≥ 2,
α 1
dTV (PN,∞ , PN,K ) ≥ .
1+αK
See section 2.I.2 for the proof. While Theorem 2.H.1 implies that the normalized AIFA with
K = O (poly(ln N )/ϵ) atoms suffices in approximating the DP mixture model to less than ϵ
error, Theorem 2.H.3 implies that a normalized AIFA with K = Ω (1/ϵ) atoms is necessary in
the worst case. This worst-case behavior is analogous to Theorem 2.4.3 for DP-based models.
The 1/ϵ dependence means that AIFAs are worse than TFAs in theory. It is known that
small TFA models are already excellent approximations of the DP. Definition 2.4.5 is a very
well-known finite approximation whose error is upper bounded in Proposition 2.H.4.
i.i.d. indep
Proposition 2.H.4. [88, Theorem 2] Let ΞK ∼ TSBK (α, H), Rn | ΞK ∼ ΞK , Tn | Rn ∼
f (· | Rn ) with N observations. Let QN,K be the distribution of the observations T1:N . Then:
dTV (PN,∞ , QN,K ) ≤ 2N exp − K−1 α
.
Proposition 2.H.4 implies that a TFA with K = O (ln (N/ϵ)) atoms suffices in approximating
the DP mixture model to less than ϵ error. Modulo log factors, comparing the necessary 1/ϵ
level for AIFA and the sufficient ln (1/ϵ) level for TFA, we conclude that the necessary size
for normalized IFA is exponentially larger than the sufficient size for TFA, in the worst case.
143
Proposition 2.I.1. Blackwell and MacQueen [19] For n = 1, X1 ∼ H. For n ≥ 2,
α X nj
Xn | Xn−1 , Xn−2 , . . . , X1 ∼ H+ δψ ,
n−1+α j
n−1+α j
where {ψj } is the set of unique values among Xn−1 , Xn−2 , . . . , X1 and nj is the cardinality of
the set {i : 1 ≤ i ≤ n − 1, Xi = ψj }.
The reasoning for these constants will be clear by the end of the proof.
To begin, observe that the conditional distributions of the observations given the latent
variables are the same across target and approximate models: PY1:N |X1:N is the same as
PW1:N |Z1:N if X1:N = Z1:N . Therefore, using Lemma 2.E.8, we want to show that there exists
a coupling of PX1:N and PZ1:N that has small difference probability.
First, we construct a coupling of PX1:N and PZ1:N such that, for any n ≥ 1, for any x1:(n−1)
such that Jn is the number of unique atom locations among x1:(n−1) is at most K,
α Jn
P(Xn ̸= Zn | X1:(n−1) = Z1:(n−1) = x1:(n−1) ) ≤ . (2.81)
Kn−1+α
144
The case where n = 1 reads that P(X1 = ̸ Z1 ) = 0. Such a coupling exists because the total
variation distance between the prediction rules Xn | X1:(n−1) and Zn | Z1:(n−1) is small. Let
{ψj }Jj=1
n
be the unique atom locations in x1:(n−1) and nj be the number of latents xi that
manifest atom location ψj . The distribution Xn | X1:(n−1) can be sampled from in two steps:
Still conditioning on X1:(n−1) and Z1:(n−1) , we observe that the distribution of Xn | I1 is the
same as Zn | I2 . Hence, using the propagation argument from Lemma 2.E.8, it suffices to
couple I1 and I2 so that
is small. Since I1 and I2 are categorical distributions, the minimum of the difference probability
is the total variation distance between the two distributions, which equals 1/2 the L1 distance
between marginals
Jn
X nj + α/K nj α α(1 − Jn /K) α Jn
− + − =2 .
j=1
n − 1 + α n − 1 + α n − 1 + α n − 1 + α K n − 1 + α
Dividing the last equation by 2 gives eq. (2.81). The joint coupling of PX1:N and PZ1:N is the
natural gluing of the couplings PXn | X1:(n−1) and PZn | Z1:(n−1) .
We now show that for the coupling satisfying eq. (2.81), the overall probability of difference
P(X1:N ≠ Z1:N ) is small. Recall the growth function from eq. (2.18). We will use the notation
of a typical set in the rest of the proof:
Dn := x1:(n−1) : Jn ≤ (1 + δ) max(C(N, α), C(K, α)) .
In other words, the number of unique values among the x1:(n−1) is small. The constant δ
δ2
satisfies 2+δ α = 2: such δ always exists and is unique. The following decomposition is used
to investigate the difference probability on the typical set:
145
The second term can be further expanded:
By recursively applying the expansion initiated in eq. (2.82) to P(X1:(N −1) ̸= Z1:(N −1) ), we
actually only need to bound difference probability of the different prediction rules on typical
sets and the atypical set probabilities.
Regarding difference probability of the different prediction rules, being in the typical set
allows us to control Jn in eq. (2.81). Summation across n = 1 through N gives the overall
bound of
α
(1 + δ) max(C(N, α), C(K, α))C(N, α). (2.83)
K
Regarding the atypical set probabilities, because Jn−1 is stochastically dominated by Jn i.e.,
the number of unique values at time n is at least the number at time n − 1, all the atypical
set probabilities are upper bounded by the last one i.e. P(X1:(N −1) ∈
/ DN ). When N > 1, JN
is the sum of independent Poisson trials, with an overall mean equaling exactly C(N − 1, α)
and J1 is defined to be 0. Therefore, the atypical event has small probability because of
Lemma 2.E.1:
P(JN > (1 + δ) max(C(N − 1, α), C(K, α)) ≤ P(JN > (1 + δ) max(C(N, α), C(K, α))
δ2
≤ exp − max(C(N, α), C(K, α) .
2+δ
Even accounting for all N atypical events through union bound, the total probability is still
small small: 2
δ
exp − max(C(N, α), C(K, α) − ln N .
2+δ
By Lemma 2.E.10, max(C(N, α), C(K, α) ≥ α max(ln N, ln K − α(ψ(α) + 1). we have
δ2
max(C(N, α), C(K, α) − ln N ≥ ln K − α(ψ(α) + 1),
2+δ
146
meaning the overall atypical probabilities is at most
exp(α(ψ(α) + 1))
. (2.84)
K
The overall total variation bound combines eqs. (2.83) and (2.84). We first upper bound
C(N, α) using Lemma 2.E.10 and upper bound max(C(N, α), C(K, α)) by the sum of the two
constituent terms. We also upper bound ln N ≤ ln2 N/ ln 2 to remove the dependence on the
sole ln N factor. After the algebraic manipulations, we arrive at the constants in eq. (2.80)s.
The main idea is reducing to the Dirichlet process mixture model. We do this in two steps.
First, the conditional distribution of the observations W | H1:D of the target model is the same
as the conditional distribution Z | F1:D of the approximate model if H1:D = F1:D . Second,
there exists latent variables Λ and Φ such that the conditional distribution of H1:D | Λ and
the conditional F1:D | Φ are the same when Λ = Φ. Recall the construction of the Fd in terms
of atom locations ϕd,j and stick-breaking weights γd,j :
Similarly Hd is also constructed in terms of atom locations λd,j and stick-breaking weights
ηd,j :
G ∼ DP(ω, H),
i.i.d.
λdj | G ∼ G(.) across d, j,
i.i.d.
ηdj ∼ Beta(1, α) across d, j (except ηdT = 1),
T
!
X Y
Hd | λd,. , ηd,. = ηdi (1 − ηdj ) δλdj .
i=1 j<i
Therefore, if we set Λ = {λdj }d,j and Φ = {ϕdj }d,j , then H1:D | Λ is the same as the conditional
F1:D | Φ if Λ = Φ.
147
Overall, this means that W | Λ is the same as Z | Φ. Again by Lemma 2.E.8, we only need to
demonstrate a coupling between PΛ and PΦ such that the difference probability is small.
From the proof of Theorem 2.4.1 in section 2.I.1, we already know how to couple PΛ and PΦ .
On the one hand, since λdj are conditionally iid given G across d, j, the joint distribution of
λdj is from a DPMM (probability kernel f being Dirac f (· | x) = δx (·)) where the underlying
DP has concentration ω. On the other hand, since ϕdj are conditionally iid given GK across
d, j, the joint distribution ϕdj comes from the finite mixture with FSDK . Each observational
process has cardinality DT . Therefore, we can couple PΛ and PΦ such that
Now we discuss why the total variation is lower bounded by the function of N . Let A be the
event that there are at least 12 C(N, α) unique components in among the latent states:
1
A := x1:N : #unique values ≥ C(N, α) .
2
The probabilities assigned to this event by the approximate and the target models are very
different from each other. On the one hand, since K < C(N,α)
2
, under FSDK , A has measure
zero:
PZ1:N (A) = 0. (2.85)
On the other hand, under DP, the number of unique atoms drawn is the sum of Poisson
trials with expectation exactly C(N, α). The complement of A is a lower tail event. Hence
by Lemma 2.E.2 with δ = 1/2, µ = C(N, α), we have:
C(N, α)
PX1:N (A) ≥ 1 − exp − (2.86)
8
constant
C(N, α) α ln N α(ψ(α) + 1)
exp − ≤ exp − + = .
8 8 8 N α/8
We now combine eqs. (2.85) and (2.86) and recall that total variation is the maximum over
probability discrepancies.
148
Proof of Theorem 2.H.3. First we mention which probability kernel f results in the large
total variation distance: the pathological f is the Dirac measure i.e., f (· | x) = δx (.).
Now we show that under such f, the total variation distance is lower bounded. Observe that
it suffices to understand the total variation between PY1 ,Y2 and PW1 ,W2 , because Lemma 2.E.9
already implies
dTV (PN,∞ , PN,K ) ≥ dTV (PY1 ,Y2 , PW1 ,W2 ).
Since f is Dirac, Xn = Yn and Zn = Wn and we have:
dTV (PY1 ,Y2 , PW1 ,W2 ) = dTV (PX1 ,X2 , PZ1 ,Z2 ).
Consider the event that the two latent states are equal. Under the target model,
1
P(X2 = X1 ) = ,
1+α
while under the approximate one,
1 + α/K
P(Z2 = Z1 ) = .
1+α
They are simple consequences of the prediction rules in Propositions 2.I.1 and 2.I.2. Therefore,
there exists a measurable event where the probability mass assigned by the target and
approximate models differ by
1 + α/K 1 α 1
− = , (2.87)
1+α 1+α 1+αK
meaning dTV (PX1 ,X2 , PZ1 ,Z2 ) ≥ α 1
1+α K
.
distribution of xn,k under qx (x) (which is a distribution over the whole set (xn,k )n,k ). Then
X
qρ∗ (ρ) = − ln C + ln P(ρ) + Exn,k ∼qx ln ℓ(xn,k | ρk ). (2.88)
n,k
149
Proposition 2.J.2. Suppose that the variational distribution qx (x) factorizes as qx (x) =
n,k fn,k (xn,k ). For a particular n, k, let fn,k be the optimal distribution over (n, k) trait
∗
Q
count with all other variational distributions being fixed i.e.
Y
∗
fn,k := arg min KL qρ qψ fn,k fn′ ,k′ || P̄ ,
fn,k
(n′ ,k′ )̸=(n,k)
where P̄ denotes the posterior P(·, ·, · | y). Then, the p.m.f. of fn,k
∗
at xn,k is equal to
− ln C + Eρk ∼qρ ln ℓ(xn,k | ρk ) + Eψ∼qψ ,xn,−k ∼fn,−k ln P(yn | xn,. , ψ).
for some positive constant C.
See section 2.J.2 for the proof of this proposition.
Under TFAs such as Example 2.5.1, since we cannot identify the log density in eq. (2.88)
with a well-known distribution, we do not have formulas for expectations. For Example 2.5.1,
strategies to make computing expectations more tractable Qi−1 include introducing auxiliary
(l)
round indicator variables rk , replacing the product l=1 (1 − Vi,j ) with a more succinct
representation and fixing the functional form qρ rather than using optimality conditions
[156, Section 3.2]. However, Paisley et al. [156, Section 3.3] still runs into intractability
issues when evaluating Eρk ∼qρ {ln ℓ(xn,k | ρk )} in the beta–Bernoulli process, and additional
approximations such as Taylor series expansion are needed.
In our second TFA example, the complete conditional of the atom sizes can be sampled
without auxiliary variables, but important expectations are not analytically tractable.
Example 2.J.1 (Bondesson approximation [55, 197]). When α = 1, the Bondesson approxi-
mation in Example 2.4.3 becomes
K i
i.i.d. i.i.d.
X Y
ΘK = ρi δψi , ρi = pj , pj ∼ Beta(γ, 1), ψi ∼ H. (2.89)
i=1 j=1
The atom sizes are dependent because they jointly depend on p1 , . . . , pK , but the complete
conditional of atom sizes P(ρ | x) admits a density proportional to
K
γ1{j=K}+ N
P PN
n=1 xn,j −1
Y
1{0 ≤ ρK ≤ ρK−1 ≤ . . . ≤ ρ1 ≤ 1} ρj (1 − ρj )N − n=1 xn,j
.
j=1
The conditional distributions P(ρi | ρ−i , x) are truncated betas, so adaptive rejection sampling
[70] can be used as a sub-routine to sample each P(ρi | ρ−i , x) and then sweep over all atom
sizes. However, for this exponential family, expectations of the sufficient statistics are not
tractable. The optimal qρ∗ in the sense of eq. (2.22) has a density proportional to
K
γ1{j=K}+ N
P PN
n=1 Eqx xn,j −1
Y
1{0 ≤ ρK ≤ ρK−1 ≤ . . . ≤ ρ1 ≤ 1} ρj (1 − ρj )N − n=1 Eqx xn,j
.
j=1
We do not know closed-form formulas for E{ln(ρi )} or E{ln(1 − ρi )}. Rather than using
the qρ∗ which comes from optimality arguments, Doshi-Velez et al. [55] fixes the functional
form of the variational distribution. Even then, further approximations such as Taylor series
expansion are necessary to approximate E{ln(ρi )} or E{ln(1 − ρi )}.
150
Other series-based approximations, like thinning or rejection sampling [34], are characterized
by even less tractable dependencies between atom sizes in both the prior and the conditional
P(ρ | x).
2.J.2 Proofs
Proof of Proposition 2.5.1. Because of the Markov blanket, conditioning on x, ψ, y is the
same as conditioning on x:
P(ρ | x, ψ, y) = P(ρ | x).
Conditioned on the atom rates, the trait counts are independent across the atoms. In the
prior over atom rates, the atom rates are independent across the atoms. These facts mean
that the posterior also factorizes across the atoms
K
Y
P(ρ | x) = P(ρk | x.,k )
k=1
We look at each factor P(ρk | x.,k ). This is the posterior for ρk after observing N observations
n=1 . Since the AIFA prior over ρk is the conjugate prior of the trait count likelihood, the
(xn,k )N
posterior is in the same exponential family, with updated parameters based on the sufficient
statistics and the log partition function.
Proof of Proposition 2.J.1. Minimizing the KL divergence is equivalent to maximizing the
evidence lower bound (ELBO):
ELBO(q) := E(ρ,ψ,x)∼q ln P(y, ρ, ψ, x) − E(ρ,ψ,x)∼q ln q(ρ, ψ, x). (2.90)
The log joint probability P(y, ρ, ψ, x), regardless of the prior over ρ, decomposes as
X
ln P(y, ρ, ψ, x) = ln P(ρ) + ln P(ψk )
k
X X (2.91)
+ ln P(xn,k | ρk ) + ln P(yn | xn,. , ψ).
n,k n
Recall that the variational distribution factorizes like as q(ρ, ψ, x) = qρ (ρ)qψ (ψ)qx (x). There-
fore, for fixed qψ (ψ) and qx (x), the ELBO from eq. (2.90) depends on qρ (ρ) only through
X
f (qρ ) := Eρ∼qρ ln P(ρ) + Exn,k ∼qx ,ρk ∼qρ ln P(xn,k | ρk ) − Eρ∼qρ ln qρ (ρ).
n,k
Here, the notation ρk ∼ qρ means the marginal distribution of ρk under qρ . Using Fubin’s
theorem, we rewrite the last integral as
qρ (ρ)
f (qρ ) = −Eρ∼qρ ln P
P(ρ) × exp( n,k Exn,k ∼qx ln P(xn,k | ρk ))
The denominator P(ρ) × exp(P n,k Exn,k ∼qx ln P(xn,k | ρk )) is exactly equal to Cq0 (ρ) where
P
ln q0 (ρ) = − ln C + ln P(ρ) + n,k Exn,k ∼qx ln ℓ(xn,k | ρk ). Therefore
f (qρ ) = −KL(qρ ||q0 ) + ln C.
151
This means that the unique maximizer of f (qρ ) is qρ = q0 i.e. the log density of qρ∗ is as given
in eq. (2.88).
Proof of Corollary 2.5.2. We specialize the formula in eq. (2.88) to the AIFA prior.
Recall the exponential-family form of ℓ(xn,k | ρk ):
Exn,k ∼qx ln κ(xn,k ) + Exn,k ∼qx ϕ(xn,k ) × ln ρk + ⟨µ(ρk ), Exn,k ∼qx t(xn,k )⟩ − A(ρk ). (2.93)
Recall that AIFA prior over ρk is the conjugate prior for the likelihood in eq. (2.92):
ψ µ(ρk ) ψ
ln P(ρk ) = (c/K − 1) ln ρk + ⟨ , ⟩ − ln Z(c/K − 1, ), (2.94)
λ −A(ρk ) λ
and the prior factorizes across atoms:
X
ln P(ρ) = ln P(ρk )
k
Accounting for the normalization constant Zk for each dimension k, we arrive at eq. (2.23).
Proof of Proposition 2.J.2. The argument is the same as section 2.J.2. In the overall ELBO,
the only terms that depend on fn,k is
Exn,k ∼fn,k ,ρk ∼qρ ln ℓ(xn,k | ρk ) + Exn,k ∼fn,k ,xn,−k ∼fn,−k ,ψ∼qψ ln P(yn | xn,. , ψ)
− Exn,k ∼fn,k ln fn,k (xn,k ).
We use Fubini to express the last integral as a negative KL-like quantity, and use optimality
of KL arguments to derive the p.m.f. of the minimizer.
denotes the collection of atom locations, (xn,k )k=1,n=1 denotes the latent trait counts of each
K,N
152
2.K.1 Image denoising using the beta–Bernoulli process
Data. We obtain the “clean” house image from http://sipi.usc.edu/database/. We downscale
the original 512 × 512 image to 256 × 256 and convert colors to gray scale. We add iid
Gaussian noise to the pixels of the clean image, resulting in the noisy input image. We follow
Zhou et al. [210] in extracting the patches. We use patches of size 8 × 8, and flatten each
observed patch yi into a vector in R64 .
We report the performance for K’s between 10 and 100 with spacing 10.
Ground measure and observational likelihood. Following Zhou et al. [210], we fix the
ground measure but put a hyper-prior (in the sense of Corollary 2.G.1) on the observational
likelihood. The ground measure is a fixed Gaussian distribution:
i.i.d. 1
ψi ∼ N 0, I64 , i = 1, 2, . . . , K. (2.97)
64
The observational likelihood involves two Gaussian distributions with random variances:
γw ∼ Gamma(10−6 , 10−6 ),
γe ∼ Gamma(10−6 , 10−6 ),
i.i.d.
wn,i | γw ∼ N (0, γw−1 ), across i, n, (2.98)
K
indep
X
yn | xn,. , wn,. ψ, γe ∼ N ( xn,i wn,i ψi , γe−1 I64 ), across n.
i=1
We use the (shape,rate) parametrization of the gamma distribution. The weights wn,i enable
an observation to manifest a non-integer (and potentially negative) scaled version of the
153
i-th basis element. The precision γw determines the scale of these weights. The precision γe
determines the noise variance of the observations. We are uninformative about the precisions
by choosing the Gamma(10−6 , 10−6 ) priors.
In sum, the full finite models combine either eqs. (2.96) to (2.98) (for AIFA) or eqs. (2.95),
(2.97) and (2.98) (for TFA).
Approximate inference. We use Gibbs sampling to traverse the posterior over all the
latent variables — the ones that are most important for denoising are x, w, ψ. The chosen
ground measure and observational likelihood have the right conditional conjugacies so that
blocked Gibbs sampling is conceptually simple for most of the latent variables. The only
difference between AIFA and TFA is the step to sample the feature proportions ρ: TFA
updates are much more involved compared to AIFA (see section 2.5). The order in which
Gibbs sampler scans through the blocks of variables does not affect the denoising quality.
To generate the PSNR in fig. 2.6.2a, after finishing the gradual introduction of all patches,
we run 150 Gibbs sweeps. We use the final state of the latent variables at the end of these
Gibbs sweep as the warm-start configurations in figs. 2.6.2b and 2.6.2c.
Evaluation metric. We discuss how iterates from Gibbs sampling define output images.
Each configuration of x, w, ψ defines each patch’s “noiseless” value:
K
X
yen = xn,i wn,i ψi .
i=1
Each pixel in the overall image is covered by a small number of patches. The “noiseless” value
of each pixel is the average of the pixel value suggested by the various patches that cover that
pixel. We aggregate the output images across Gibbs sweeps by a simple weighted averaging
mechanism. We report the PSNR of the output image with the original image following the
formulas from [86].
Finite models. We fix the ground measure to be a Dirichlet distribution and the observa-
tional likelihood to be a categorical distribution i.e. no hyper-priors. The AIFA is
154
while the TFA is
G0 ∼ TSBK (ω, Dir(η1V )),
i.i.d.
Gd | G0 ∼ TSBT (α, G0 ), across d,
indep
βdn | Gd ∼ Gd (·), across d, n,
indep
wdn | βdn ∼ Categorical(βdn ), across d, n.
We set the hyperparameters η, α, ω, and T following Wang et al. [206], in that η = 0.01, α =
1.0, ω = 1.0, T = 20. We report the performance for K’s between 20 and 300 with spacing 40.
155
and the final number we report for document d′ is:
1 X
ln pe(w | D, wobs ).
|who | w∈w
ho
∞
X
θi ψi ∼ BP(2, 0, 0.6; N (0, 5I5 )),
i=1
indep
xn,i | θi ∼ Ber(θi ), across n, i,
indep
X
yn | xn,. , ψ ∼ N ( xn,i ψi , I5 ), across n.
i
For the AIFA vs GenPar IFA comparison i.e. fig. 2.6.4b, we use the same generative process
except the beta process is BP(2, 1.0, 0.6). We marginalize out the feature proportions θi and
sample the assignment matrix X = {xn,i } from the power-law Indian buffet process [195].
The feature means are Gaussian distributed, with prior mean 0 and prior covariance 5I5 .
Conditioned on the feature combination, the observations are Gaussian with noise variance
I5 . Since the data is exchangeable, without loss of generality, we use y1:1500 for training and
y1501:2000 for evaluation.
Finite approximations. We use finite approximations that have exact knowledge of the
beta process hyperparameters. For instance, for the AIFA versus BFRY IFA comparison, we
use K-atom AIFA prior with densities
1{0 ≤ θ ≤ 1} −1+c/K−0.6S(θ−1/K)
νAIFA (dθ) := θ (1 − θ)−0.4 dθ, (2.99)
ZK
(
−1
exp 1−K 2 (θ−1/K)2 + 1 if θ ∈ (0, 1/K)
where c := B(0.6,0.4)
2
and S(θ) = , and ZK is the
1{θ > 0} otherwise.
suitable normalization constant.
In all, the approximation to the beta–Bernoulli part of the generative process is
i.i.d.
ρi ∼ νe(.) for i ∈ [K],
indep
(2.100)
xn,i | ρi ∼ Ber(ρi ) across , n, i,
where νe(.) is either νAIFA , νBFRY or νGenPar . We report the performance for K from 2 to 100.
156
Ground measure and observational likelihood. We use hyper-priors in the sense of
Corollary 2.G.1. The ground measure is random because the we do not fix the variance of
the feature means.
σg ∼ Gamma(5, 5),
i.i.d. (2.101)
ψi ∼ N (0, σg2 I5 ) for i ∈ [K].
The observational likelihood is also random because we do not fix the noise variance of the
observed data.
σc ∼ Gamma(5, 5),
indep
yn | xn,. , ψ, σc ∼ N (
X
xn,i ψi , σc2 I5 ). (2.102)
i
In eqs. (2.101) and (2.102), we use the (shape, rate) parametrization of the gamma distribution.
The full finite models are described by eqs. (2.100) to (2.102).23
Each variation distribution is the natural exponential family. Specifically, we have q(σc ) =
Gamma(νc (0), νc (1)), q(σg ) = Gamma(νg (0), νg (1)), q(ψi ) = N (τi , ζi ), q(ρi ) = Beta(κi (0), κi (1)),
q(xn,i ) = Ber(ϕn,i ). We set the initial variational parameters using using the latent features,
feature assignment matrix, and the variances of the features prior and the observations around
the feature combination. We use the ADAM optimizer in Pyro (learning rate 0.001, β1 = 0.9,
clipping gradients if their norms exceed 40) to minimize the KL divergence between the
approximation and exact posterior. We sub-sample 50 data points at a time to form the
objective for stochastic variational inference. We terminate training after processing 5,000
mini-batches of data.
where y1:n are the training data and {yn+i }m i=1 are the held-out data points.
We estimate P(yn+i | y1:n ) using Monte Carlo samples, since the predictive likelihood is an
integral of the posterior over training data:
Z
P(yn+i | y1:n ) = P(yn+i | xn+i , ψ, σ)P(xn+i , ψ, σ, ρ | y1:n ),
xn+i ,σ,ψ,ρ
23
During inference, we add a small tolerance of 10−3 to the standard deviations σc , σg , ζi in the model to
avoid singular covariance matrices, although this is not strictly necessary if we clip gradients.
157
where xn+i is the assignment vector of the n + i test point. Define the S Monte Carlo samples
of the variational approximation to the posterior as (xs(n+1):(n+m),. , ρs , ψ s , σ s )Ss=1 . We jointly
estimate P(yn+i | y1:n ) across test points yn+i using the S Monte Carlo samples:
S
1X
P(yn+i | y1:n ) ≈ P(yn+i | xsn+i , ψ s , σ s ).
S s=1
We use S = 1,000 samples from the (approximate) posterior to estimate the average log
test-likelihood in eq. (2.103).
AIFA marginal likelihood. The K-atom AIFA rates define a generative process over
feature matrices with N rows and K columns:
i.i.d.
θk ∼ AIFAK across k,
indep
xn,k | θk ∼ Ber(θk ) across n, k.
xn,k is the entry in the nth row and kth column of the feature matrix. Treating the beta
process hyperparameters γ, α, d as unknowns, we compute the probability of observing a
particular feature matrix {xn,k } (integrating out the AIFA rates) as a function of γ, α, d. By
symmetry and independence among the columns x.,k , it suffices to compute the probability
of observing just one column, say {xn,1 }Nn=1 . Conditioned on θ1 , the probability of observing
{xn,1 }n=1 is exactly
N
N
x
Y
θ1 n,1 (1 − θ1 )1−xn,1
n=1
We integrate out θ1 to compute the marginal likelihood. Recall that c(γ, α, d) = γ/B(α +
d, 1 − d) for the beta process AIFA. The marginal likelihood of observing the first column
n=1 is
{xn,1 }N
"N #
Y x
Eθ∼AIFAK θ1 n,1 (1 − θ1 )1−xn,1
n=1
R1 P P
0
θ−1+c(γ,α,d)/K+ n xn,1 −dS1/K (θ−1/K) (1 − θ)α+d+N − n xn,1 −1 dθ
= R 1 −1+c(γ,α,d)/K−dS (θ−1/K) .
θ 1/K (1 − θ) α+d−1 dθ
0
158
In all, if we denote
Z 1
ZK (γ, α, d; x, y) := θ−1+c(γ,α,d)/K+x−dS1/K (θ−1/K) (1 − θ)α+d+(y−x)−1 dθ,
0
then the marginal probability of observing a particular binary matrix {xn,k }, as a function of
γ, α, d, is
K P
Y ZK (γ, α, d; n xn,k , N )
. (2.104)
k=1
ZK (γ, α, d; 0, 0)
For feature matrices coming from an IBP, the number of columns K b is random, and usually
(much) smaller than the number of atoms in the approximation. In this section, the approxi-
mation level is K = 100,000: the distribution of the number of active features in the finite
model (for d ∈ [0, 0.5]) has no noticeable change between K = 100,000 and K > 100,000.
The fact that K − K b columns are missing is the same as K − K b columns being identically
zero; hence, when evaluating the marginal probability of matrices that have less than K
columns, we simple pad the missing columns with zeros.
It remains to show how to compute eq. (2.104) using numerical methods. The bottleneck is
computing ZK (γ, α, d; x, y). We split the integral into two disjoint domains. The first domain
is (0, 1/K): on this domain, the integral is an incomplete beta integral, which is implemented
in libraries such as Virtanen et al. [204]. The second domain is [1/K, 1]. On this domain,
we first compute m∗ , the maximum value of the integrand θ−1+c(γ,α,d)/K+x−dS1/K (θ−1/K) (1 −
θ)α+d+(y−x)−1 . We then use numerical integration to integrate θ−1+c(γ,α,d)/K+x−dS1/K (θ−1/K) (1−
θ)α+d+(y−x)−1 /m∗ . We divide by m∗ to avoid the integrand getting too small, which happens
if x or y are large. The last integrand is well-behaved (bounded and smooth), and we expect
numerical integration to be accurate.
Marginal likelihood under BFRY IFA (or GenPar IFA) are challenging to estimate.
In theory, for the BFRY IFA, it is also possible to express the marginal likelihood (as a
function of γ, α, d) for an observed feature matrix xn,k under the BFRY IFA prior as a ratio
between normalization constants. However, we run into numerical issues (divergence errors)
computing the BFRY IFA normalization constants that are not present in computing the
AIFA normalization constants. For completeness, the BFRY IFA normalization constants are
of the kind
Z 1
γ/K x−d−1 y−x+d−1 1/d θ
ZBFRY (γ, d; x, y) = θ (1 − θ) 1 − exp −(Kd/γ) .
0 B(d, 1 − d) 1−θ
(2.105)
Whether this integral has a closed-form solution is unknown: the closed-formed marginal
likelihoods from Lee et al. [124] apply to clustering models from normalized CRMs rather
than feature-allocation models from unnormalized CRMs. Numerical integration struggles
with Equation (2.105) for x = 0. (Kd/γ)1/d is typically very large: when γ = 1, d = 0.1, even
K = 100 leads to (Kd/γ)1/d being on the order of 1020 . As a result, under standard floating
point precision, 1 − exp −(Kd/γ)1/d 1−θ θ
evaluates to 1 on all points of the quadrature grid:
this leads to divergent behavior, as the factor θ−d−1 by itself grows too fast near 0.
159
We resort to Monte Carlo to estimate the normalization constant. In each Monte Carlo batch,
we draw K random variables θ1 , θ2 , . . . , θK from the BFRY density eq. (2.4), and estimate
the log of ZBFRY (γ, d; x, y) with
K
logsumexp [(x − d − 1) ln θk + (y − x + d − 1) ln(1 − θk ) − ln K] .
k=1
In the left panel of fig. 2.6.5b, we first generate an feature matrix from IBP with mass 3.0,
concentration 0.0 and discount 0.25. We then plot the estimate of the marginal likelihood
under BFRY IFA for this feature matrix as a function of d for mass fixed at 3.0 and discount
fixed at 0.0.
GenPar IFA faces similar problems as BFRY IFA. We are not aware of a closed-form formula
for the marginal likelihood. Namely, we are not able to show that eq. (2.5) is a conjugate
prior for the Bernoulli likelihood: when we observe an observation X = 1 from the model
X ∼ Ber(θ), θ ∼ νGenPar , the posterior density for θ is proportional to
θ−d (1 − θ)α+d−1
1 − 1
B(1 − d, α + d) 1/d 1{0 ≤ θ ≤ 1}.
α
Kd
θ 1+ γα
−1 +1
This new density is not in the same family as the original generalized Pareto variate. Default
schemes to numerically integrate P(0 | θk ) against the generalized Pareto prior for θk fail
1/d
because of overflow issues associated with the magnitude of the term 1 + Kd γα
. In the left
panel of fig. 2.6.5b, we first generate an feature matrix from IBP with mass 3.0, concentration
1.0 and discount 0.25. We then plot the estimate of the marginal likelihood under BFRY IFA
for this feature matrix as a function of d for mass fixed at 3.0 and discount fixed at 0.0.
Optimization. For AIFA i.e. left panel of fig. 2.6.5a, to estimate the beta process hy-
perparameters given an observed feature matrix, we maximize the marginal probability
in eq. (2.104) with respect to γ, α, d, by doing a grid search with a fine resolution. The
base grid for the triplet γ, α, d is the Cartesian product of three lists: [1.0, 2.0, 3.0, 4.0, 5.0],
[0.5, 1.0, 1.5, 2.0, 2.5], and [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]. We refine the base grid around the true
hyperparameters. For example, in the discount estimation experiment, a true configuration is
(3.0, 1.0, 0.4) The refinement here is the Cartesian product of three lists [2.6, 2.8, 3.0, 3.2, 3.4],
[0.8, 0.9, 1.0, 1.1, 1.2], [0.36, 0.38, 0.4, 0.42, 0.44]. We append the refinement to the base grid by
looping through all the configurations. We propose the best hyperparameters by evaluating
the marginal likelihood (eq. (2.104)) at all points on the grid, and reporting the maximizer.
For the nonparametric process i.e. right panel of fig. 2.6.5a, the probability of observing a
particular feature matrix under the IBP prior over N rows is given in Broderick et al. [29,
Equation 7]. We maximize this function with respect to γ, α, d using differential evolution
techniques [191, 204].
160
2.K.5 Dispersion estimation
Generative model. The probabilistic model is
i.i.d.
λk ∼ XGamma(α/K, c, τ, T ) across k,
i.i.d.
ϕk ∼ Dir(aϕ 1V ) across k,
indep (2.106)
zn,k | λk ∼ CMP(λk , τ ) across k, n,
indep
X
xn,v | zn,: , ϕ ∼ Poisson( zn,k ϕk,v ), across v, n.
k
Recall the definition of the Xgamma variate from eq. (2.25). The observed data is the count
matrix xn,v , the number of times document n manifests vocab word v. The hyperparameters
are α, c, τ, T and aϕ . To draw data for eq. (2.106), we need to sample from XGamma and
CMP, two distributions that are not implemented in standard numerical libraries. The
only Pbottleneck in drawing CMP(θ, PτL) isθycomputing Zτ (θ). We approximate the infinite
∞ θy
sum y=0 (y!)τ with a truncation y=0 (y!)τ , using the bounds from Minka et al. [144] to
make sure the contribution of the left-out terms is small. To draw from XGamma, whose
unnormalized density has a contribution from Zτ−c (θ), we use the above approximation of
Zτ (θ) and slice sampling on the approximation of the unnormalized density.
When generating synthetic data, we draw N = 600 documents, over a vocabulary of size 100,
from a model with K = 500. The under-dispersed case and the over-dispersed case have the
same following hyperparameters: α = 20, c = 1, T = 1000, aϕ = 0.01. For underdispersion,
τ = 1.5, while for overdispersion, τ = 0.7. Our primary goal of inference is estimating the
topics and the shape τ . As such, during posterior inference, we fix the hyperparameters α, c,
T , and aϕ at the data-generating values, and sample the remaining latent variables (λ, ϕ, z)
and shape τ . We put a uniform (0, 100] prior on the shape τ : τ is always positive, and there
is no noticeable difference in amount of dispersion (ratio of variance over mean) between
τ = 100 and τ > 100. Furthermore, during sampling, the values of τ are much smaller than
100, indicating that inference would have remained the same for different choices of the
uniform’s upper limit.
Gibbs sampling. During sampling, following Zhou et al. [211, Section 4], we augment
the original model by introducing three additional families of latent variables: s, u and q.
Conditioned on z and ϕ, the pseudocount sn,k,v is distributed as Poisson
indep
sn,k,v | z, ϕ ∼ Poisson(zn,k ϕk,v ), across n, k, v,
and the sn,k,v add up to be xn,v in the following way
X
xn,v = sn,k,v .
k
161
It is true that
indep
un,k | zn,k ∼ Poisson(zn,k ), across n, k,
indep
(2.107)
{sn,k,v }Vv=1 | un,k , ϕk ∼ Multi(un,k ; ϕk ), across n, k.
Summing up the pseudocounts across documents, we have
X
qk,v := sn,k,v .
n
We use a blocked Gibbs sampling strategy. The variable blocks variables are ϕ, λ, s (which
determines u and q), z, τ . First, we compute the Gibbs conditional of the topics ϕ. Since u
is determined by s (eq. (2.107)), conditioned on s, ϕ is independent of the remaining latent
variables:
K
Y
P(ϕ | x, λ, s, z, τ ) = P(ϕ | s) ∝ P(ϕ)P(s | u, ϕ) = Dir(ϕk 1V | [aϕ + qk,v ]Vv=1 ).
k=1
We compute the Gibbs conditionals of the rates λ. Conditioned on the trait counts z and
shape τ , λ is independent of the remaining latent variables:
K
Y
P(λ | x, ϕ, s, z, τ ) = P(λ | z, τ ) = P(λk | z.,k , τ )
k=1
K
!
Y α X
= XGamma λk | + zn,k , c + N, τ, T .
k=1
K n
We use the scheme discussed after eq. (2.106) to sample these XGamma variates. The Gibbs
conditionals of the trait counts z are
Y
P(z | x, ϕ, λ, s, τ ) = P(z | λ, s, τ ) = P(zn,k | λk , un,k , τ )
n,k
Y
∝ Poisson(un,k | zn,k )CMP(zn,k | λk , τ ).
n,k
The multiplicative factorsu that don’t depend on z can be taken out of the sum: we only need
(λk /e)z z n,k
to compute ∞ . Similar to the computation of Zτ (θ), we approximate the above
P
z=0 (z!)τ
infinite sum with a finite truncation, making sure the left-out terms have a small contribution.
The Gibbs conditionals of the pseudocounts s are
P(s | x, ϕ, λ, z, τ ) = P(s | x, ϕ, z)
Y X
= Multi({sn,k,v }K
k=1 | x n,v ; [zn,k ϕk,v / zn,k′ ϕk′ ,v ]).
n,v k′
162
(a) Original (b) Input, 24.68 dB (c) AIFA, 34.62 dB (d) TFA, 34.76 dB
Figure 2.L.1: Sample AIFA and TFA denoised images have comparable quality. (a) shows
the noiseless image. (b) shows the corrupted image. (c,d) are sample denoised images from
finite models with K = 60. PSNR (in dB) is computed with respect to the noiseless image.
In implementations, we omit the contribution from P(λ | τ ), since it contributes a very small
amount (less than 0.1%) to the overall value of ln P(z | τ, λ) + ln P(λ | τ ) + ln P(τ ), but takes
up more time to evaluate than the other two components. In other words, the unnormalized
log density of τ conditioned on the other variables is just
X
ln P(z | τ, λ) + ln P(τ ) = ln CMP(zn,k | λk , τ ) + ln 1{τ ∈ (0, 100]}.
n,k
MCMC results. We run 40 chains, each for 50,000 iterations. By discarding the first
25,000 iterations, all chains have Rb diagnostic [65] smaller than 1.01. To combat the serial
correlation, we thin samples after burn-in, selecting only one draw after 2,000 iterations. The
effective number of samples remaining after burn-in and thinning is about 1,000.
163
(a) Performance versus K (b) TFA training (c) AIFA training
Figure 2.L.2: (a) Peak signal-to-noise ratio (PNSR) as a function of approximation level K.
The error bars reflect randomness in both initialization and simulation of the conditionals
across 5 trials. AIFA denoising quality improves as K increases, and the performance is
similar to TFA across approximation levels. Moreover, the TFA- and AIFA-denoised images
are very similar: the PSNR ≈ 50 for TFA versus AIFA, whereas PSNR < 35 for TFA or AIFA
versus the original image. (b,c) Show how PSNR evolves during inference. The “warm-start”
lines in indicate that the AIFA-inferred (respectively, TFA-inferred) parameters are excellent
initializations for TFA (respectively, AIFA) inference.
164
(a) Original (b) Input, 24.69 dB (c) AIFA, 30.06 dB (d) TFA, 30.24 dB
Figure 2.L.3: Sample AIFA and TFA denoised images have comparable quality. (a) shows
the noiseless image. (b) shows the corrupted image. (c,d) are sample denoised images from
finite models with K = 60. PSNR (in dB) is computed with respect to the noiseless image.
Figure 2.L.4: (a) Peak signal-to-noise ratio (PNSR) as a function of approximation level K.
The error bars reflect randomness in both initialization and simulation of the conditionals
across 5 trials. AIFA denoising quality improves as K increases, and the performance is
similar to TFA across approximation levels. Moreover, the TFA- and AIFA-denoised images
are very similar: the PSNR ≈ 47 for TFA versus AIFA, whereas PSNR < 31 for TFA or AIFA
versus the original image. (b,c) Show how PSNR evolves during inference. The “warm-start”
lines in indicate that the AIFA-inferred (respectively, TFA-inferred) parameters are excellent
initializations for TFA (respectively, AIFA) inference.
165
(a) Average (b) Best
Figure 2.L.5: The predictive log-likelihood of AIFA is not sensitive to different settings of a
and bK . Each color corresponds to a combination of a and bK . (a) is the average across 5
trials with different random seeds for the stochastic optimizer, while (b) is the best across
the same trials.
Figure 2.L.6: In fig. 2.L.6a, we estimate the mass by maximizing the marginal likelihood of
the AIFA (left panel) or the full process (right panel). The solid blue line is the median of
the estimated masses, while the lower and upper bounds of the error bars are the 20% and
80% quantiles. The black dashed line is the ideal value of the estimated mass, equal to the
ground-truth mass. The key for fig. 2.L.6b is the same, but for concentration instead of mass.
166
Chapter 3
167
3.1 Introduction
Consider the following motivating example. In forest management, there are ecologists who
study the association between drought and the death of tree canopy. For instance, Senf et al.
[183] analyzes data on tree mortality across many years and many regions in Europe, using
a Bayesian mixed effects model. To draw inferences, Senf et al. use Markov chain Monte
Carlo (MCMC) to approximate posterior functionals. These estimates suggest a positive
and significant (in the Bayesian sense) association between drought and tree mortality in
the period between 1987 and 2016. For planning, ecologists want to know if such trends can
be extrapolated beyond the gathered data. The generalization might be across time: if the
association persist after 2016, it could help policymakers make a case that drought needs to
be mitigated to prevent tree death in the future. The generalization might be across space: if
the association hold in European regions not covered by the data (such as parts of Russia), it
could save policymakers from having to repeat the study, which takes time and resources.
This is a specific instance of a broader problem. In many fields, researchers analyze
collected data with Bayesian models and MCMC: the list of examples include economics
[140], epidemiology [103], psychology [168], and many more. In general, these analysts want
to know if their findings can be more broadly applied.
Standard tools to assess generalization do not answer this question entirely. An analyst
might use frequentist tools (confidence interval, p-values) or Bayesian tools (credible interval,
posterior quantiles) to predict whether their inferences hold in the broader population. The
validity of these methods technically depends on the assumption that the gathered data is
an independent and identically distributed (i.i.d.) sample from the broader population. In
practice, we have reason to suspect that this assumption is not met: for instance, in the tree
mortality case, the data collected up to 2016 is likely similar, but not identical, to the data
that will be collected in the future, perhaps due to climate change.
An analyst might wish that deviations from the i.i.d. assumption is small enough, so that
their conclusions remain the same in the broader population, and standard tools accurately
assess generalization. Conversely, if it were possible to remove a small fraction of data and
change conclusions, the analyst would be worried about such deviations and the assessment
from standard tools.
Broderick et al. [31] is the first to formulate such small-data sensitivity as a check on
generalization. Along with the formulation, one contribution of this work is a fast way to
detect sensitivity when the analysis in question is based on estimating equations [114][Chapter
13]. Regardless of how estimators are constructed, in general, the brute-force approach to
finding an influential small fraction of data is computationally intractable. One would need
to enumerate all possible data subsets of a given cardinality and re-analyze on each subset:
even when the fraction of data removed is small and each analysis takes little time, there
are too many such subsets to consider — see the discussion at the end of section 3.3. For
estimating equations, Broderick et al. [31] estimate the effect of dropping data with a first-
order Taylor series approximation: this approximation can be optimized very efficiently, while
the brute-force approach is not at all practical.
Neither Broderick et al. [31] nor subsequent existing work on small-data removals [64, 147,
185] can be immediately applied to determine sensitivity in MCMC. Since MCMC cannot be
168
cast as the root of estimating equations or the solution to an optimization problem, neither
Broderick et al. [31] nor Shiffman et al. [185] apply to our situation. As Freund and Hopkins
[64], Moitra and Rohatgi [147] focus on ordinary least squares (OLS), their work does not
address our problem, either.
Our contributions. We extend Broderick et al. [31] to handle analyses based on MCMC.
In section 3.2, we introduce the concepts in Bayesian decision-making, and describe the
non-robustness concerns. In section 3.3.1, to first-order approximate the effect of removing
observations, we use known results on how much a posterior expectation locally changes
under small perturbations to the total log likelihood [52, 71, 80, 181]. As this approximation
involves posterior covariances, in section 3.3.2, we re-use the MCMC draws that an analyst
would have already generated to estimate what happens when data is removed. Recognizing
that Monte Carlo errors induce variability in our approximation, in section 3.3.3, we use
a variant of the bootstrap Efron [57] to quantify this uncertainty. For more discussion on
how our methodology relates to existing work, see section 3.1.1. Experimentally, we apply
our method to three Bayesian analyses. In section 3.5, we can detect non-robustness in
econometric and ecological studies. However, while our approximation performs well in simple
models such as linear regression, it is less reliable in complex models, such as ones with many
random effects.
169
process: in fact, recent work [50] has shown that comparing probability distributions based
on the KL divergence can be misleading if an analyst really cared about the comparison
between the distributions’ means.
3.2 Background
We introduce the problem of drop-data non-robustness in Bayesian data analysis.
170
A Bayesian analyst might be worried if the substantive decision arising from their data
analysis changed after removing some small fraction α of the data. For instance,
• If their decision were based on the sign of the posterior mean, they would be worried if
that sign changed.
• If their decision were based on zero falling outside a credible interval, they would be
worried if we can make the credible interval contain zero.
• If their decision was based on both the sign and the significance, they would be worried
if we can both change the posterior mean’s sign and put a majority of the posterior
mass on the opposite side of zero.
Definition 3.2.1. Let Z(w) := p(β) N | β))dβ. If Z(w) < ∞, the weighted
R Q (n)
n=1 exp(wn L(d
posterior distribution associated with w has density:
N
!
1 X
p(β | w, {d(n) }N
n=1 ) := p(β) exp wn L(d(n) | β) .
Z(w) n=1
wn encodes the inclusion of d(n) in the analysis. If wn = 0, the n-th observation is ignored;
if wn = 1, the n-th observation is fully included. We recover the regular unnormalized
posteriordensity by setting all weights to 1: w = 1N = (1, 1, . . . , 1). It is possible that
| β) is not integrable for some w. This is the case when the prior
PN (n)
p(β) exp n=1 wn L(d
p(β) is improper and all weights have been set to zero: w = 0N = (0, 0, . . . , 0). In the
following, we assume that any contribution of the likelihood is enough to define a proper
posterior.
This assumption is immediate in the case of proper prior and standard likelihoods.
The notation p(β | w, {d(n) }N
n=1 ) emphasizes the dependence on w, and will supersede
the p(β | {d }n=1 ) notation. To indicate expectations under the weighted posterior, we
(n) N
use the subscript w: Ew is the expectation taken with respect to the randomness β ∼ p(β |
w, {d(n) }N
n=1 ).
With the weighted posterior notation, we extend concepts from the standard analysis to
the new analysis involving weights. The value of a posterior functional depends on w. For
instance, the posterior mean under the weighted posterior is Ew g(β), and we recover the
standard posterior mean by setting w = 1N .
171
The Bayesian analyst’s non-robustness concern can be formalized as follows. For α ∈ (0, 1),
let Wα denote the set of all weight vectors that correspond to dropping no more than 100α%
of the data i.e. ( )
N
1 X
Wα := w ∈ {0, 1}N : (1 − wn ) ≤ α ,
N n=1
We say the analysis is non-robust if there exists a weight w that a) corresponds to dropping
a small amount of data (w ∈ Wα ) and b) changes the conclusion.
We focus on decision problems that satisfies the following simplifying assumption: there
exists a posterior functional, which we denote by ϕ(w), such that ϕ(1N ) < 0 and the conclusion
changes if and only if ϕ(w) > 0. Such a functional will be called a “quantity of interest” (QoI).
We show how the changes mentioned in this section’s beginning fit this framework. To change
the conclusion about sign, if the full-data posterior mean (E1N g(β)) were positive, we take
Since the full-data posterior mean is positive, ϕ(1N ) < 0. And ϕ(w) > 0 is equivalent to the
posterior mean (after removing the data) being negative. To change the conclusion about
significance, if approximate credible interval’s left endpoint1 (E1N g(β) − z0.975 Var1N g(β))
p
ϕ(w) > 0 is equivalent to moving the left endpoint below zero, thus changing from a significant
result to a non-significant one. Finally, to change to a significant result of the opposite sign,
if the approximate credible interval’s left endpoint were positive, we take
On the full data, the right endpoint is above zero. On weight such that ϕ(w) > 0, the right
endpoint has been moved below zero: the conclusion has changed from a positive result to a
significant negative result.
Under such assumptions, checking for non-robustness is equivalent to a) finding the
maximum value of ϕ(w) subject to w ∈ Wα and b) checking its sign. The outcome of this
comparison remains the same if we retain the feasible set, maximize the objective function
ϕ(w) − c, and compare the optimal value with −c, for c being any constant that does not
depend on weight. Out of later convenience, we set c = ϕ(1N ). As in Broderick et al. [31,
Section 2], we define the Maximum Influence Perturbation to be the largest change, induced
in a quantity of interest, by dropping no more than 100α% of the data. In our notation, it is
the optimal value of the following optimization problem:
If the Maximum Influence Perturbation is more than −ϕ(1N ), then the conclusion is non-
robust to small data removal. The set of observations that achieve Maximum Influence
1
Our approximate credible interval multiplies the posterior standard deviation by z0.975 , which is the
97.5% quantile of the standard normal, but we can replace this with other scaling without undue effort.
172
Perturbation is called the Most Influential Set: to report it, we compute the optimal solution
of eq. (3.1), and find its zero indices.
In general, the brute force approach to solve Equation (3.1) takes a prohibitively long
time to solve. We need to enumerate every data subset that drops no more than 100α%
of the original data. And, for each subset, we would need to re-run MCMC to re-estimate
the quantity of interest. There are more than ⌊N α⌋ elements in Wα . One of our later
N
numerical studies involves N = 16,560 observations: even for α = 0.001, there are more than
1054 subsets to consider. Each Markov chain already takes a noticeable amount of time to
construct: in this analysis, to generate 4,000 samples, we need to run the chain for 1 minute.
The total time to compute the Maximum Influence Perturbation would be on the order of
1048 years.
3.3 Methods
As the brute force solution to eq. (3.1) is computationally prohibitive, we turn to approximation
methods. In this section, we provide a series of approximations to the Maximum Influence
Perturbation problem.
Assumption 3.3.1. Let g be a function from RP to the real line. ϕ(w) is a linear combination
of posterior mean and posterior standard deviation i.e. there exists constants c1 and c2 , which
are independent of w, such that
A typical choice of g is the function that returns the p-th coordinate of a P -dimensional
vector.
It might appear that constraining ϕ(w) to be a linear combination of the posterior mean
and standard deviation is overly restrictive. However, this choice encompasses many cases of
practical interest: recall from section 3.2.2 that the quantities of interest for changing sign,
changing significance, and producing a significant result of the opposite sign, take the form of
Assumption 3.3.1. Furthermore, the choice of constraining ϕ(w) to be a linear combination
of the posterior mean and standard deviation in Assumption 3.3.1 is done out of convenience.
Our framework can also handle quantities of interest that involve higher moments of the
posterior distribution, and the function that combines these moments need not be linear,
173
but we omit these cases for brevity. However, we note that posterior quantiles in general do
not satisfy Assumption 3.3.1 and leave to future work the question of how to diagnose the
sensitivity of such quantities of interest.
Assumption 3.3.2. For any w ∈ [0, 1]N \ {0N }, the following functions have finite expecta-
tions under the weighted posterior: |g(β)|, g(β)2 , |L(d(n) | β)| (for all n), |g(β)L(d(n) | β)|
(for all n) and |g(β)2 L(d(n) | β)| (for all n).
The assumption is mild. It is satisfied by for instance, linear regression under Gaussian
likelihood and g(β) = βp .
Under Assumption 3.2.1, Assumption 3.3.1, and Assumption 3.3.2, ϕ(w) is continuously
differentiable with respect to w.
Theorem 3.3.1. Assume Assumption 3.2.1, Assumption 3.3.1, and Assumption 3.3.2.
For any δ ∈ (0, 1), ϕ(w) is continuously differentiable with respect to w on {w ∈ [0, 1]N :
maxn wn ≥ δ}. The n-th partial derivative2 at w is equal to c1 f + c2 s where
f = Covw g(β), L(d(n) | β) , (3.2)
and
Covw g(β)2 , L(d(n) | β) − 2Ew g(β) × Covw g(β), L(d(n) | β)
s= . (3.3)
Varw g(β)
p
See section 3.B.1 for the proof. This theorem is a specific instance of the sensitivity of
posterior expectations with respect to log likelihood perturbations: for further reading, we
recommend Basu et al. [14], Diaconis and Freedman [52], Gustafson [80]. Theorem 3.3.1
establishes both the existence of the partial derivatives and their formula. Equation (3.2) is
the partial derivative of the posterior mean with respect to the weights, while eq. (3.3) is that
for the posterior standard deviation, with the understanding that the derivative is one-sided.
Based on Theorem 3.3.1, we define the n-th influence as the partial derivative of ϕ(w) at
w = 1N :
∂ϕ(w)
ψn := .
∂wn w=1N
Then, the first-order Taylor series approximation of ϕ(w) − ϕ(1N ) is
N
X
ϕ(w) − ϕ(1N ) ≈ ψn (wn − 1). (3.4)
n=1
This approximation predicts that leaving out the n-th observation (wn = 0) changes the
quantity of interest by −ψn . Using eq. (3.4), we approximately solve eq. (3.1) by replacing
its objective function but keeping its feasible set.
N
X
max (wn − 1)ψn
w
n=1
N
(3.5)
1 X
s.t. wn ∈ {0, 1}, (1 − wn ) ≤ α.
N n=1
2
If wn lies on the boundary, the partial derivative is understood to be one-sided.
174
Algorithm 4 Influence Estimate (EI)
Inputs:
c1 , c2 ▷ ϕ(w)-defining constants
(β (1) , . . . , β (S) ) ▷ Markov chain
1: procedure EI(c1 , c2 , (β (1) , . . . , β (S) ))
m ← S1 Ss=1 g(β (s) )
P
2:
v ← S1 Ss=1 g(β (s) )2 − m2
P
3:
4: ψ̂ ← (0, 0, . . . , 0) ▷ N -dimensional vector
5: for n ← 1,P N do
1 1
1P
f ← S s g(β (s) )L(d(n) | β (s) ) − (s) (n) (s)
P
6: g(β ) L(d | β )
SPs s
S 1 P
g ← S1 s g(β (s) )2√L(d(n) | β (s) ) − 1
P (s) 2 (n) (s)
7: S s g(β ) S s L(d | β )
8: ŝ ← (g − 2mf )/( v) ▷ Estimate of eq. (3.3)
9: ψ̂n ← c1 f + c2 s ▷ Estimate of ψn
10: end for
11: return ψ̂
12: end procedure
Solving Equation (3.5) is straightforward. For any w ∈ Wα , the objective function is equal to
n:wn =0 (−ψn ). Let w(α) be the optimal solution and ∆(α) be the optimal value of eq. (3.5).
P
We denote U (α) to be the set of observations omitted according to w(α): U (α) := {dn :
w(α)n = 0}. Let r1 , r2 , . . . , rN sort the ψn in increasing order: ψr1 ≤ ψr2 ≤ . . . ≤ ψrN . Let m
be the smallest index such that ψrm+1 ≥ 0: if none exists, set m to N . If m ≥ 1, w(α) assigns
weight 0 to the observations r1 , r2 , . . . , rmin(m,⌊N α⌋) ), and 1 to the remaining ones. Otherwise,
m = 0 and w(α) assigns weight 1 to all observations. From Broderick et al. [31], we call the
optimal value of Equation (3.5) by the name Approximate Maximum Influence Perturbation
(AMIP), and denote it by ∆(α). It is equal to the negative of ⌊N m=1 ψrm I{ψrm < 0}, where
P α⌋
I{·} is the indicator function.
175
Algorithm 5 Sum of Sorted Influence Estimate (SoSIE)
Inputs:
c1 , c2 ▷ ϕ(w)-defining constants
(1) (S)
(β , . . . , β ) ▷ Markov chain
α ▷ Fraction of data to drop
1: procedure SoSIE(c1 , c2 , (β (1) , . . . , β (S) ), α)
2: ψ̂ ← EI(c1 , c2 , (β (1) , . . . , β (S) ))
3: Find ranks v1 , v2 , . . . , vN such that ψ̂v1 ≤ ψ̂v2 ≤ . . . ≤ ψ̂vN
4: Find the smallest p such that ψ̂vp+1 ≥ 0. If none exists, set p to N .
5: If p ≥ 1, U
b ← {dv1 , . . . , dv
min(p,⌊N α⌋)
}. Otherwise, U
b←∅
6: b ← − ⌊N α⌋ ψ̂vm I{ψ̂vm < 0}
∆
P
m=1
7: return ∆,b U b
8: end procedure
the notation ∆(β b (1) , . . . , β (S) ). The estimator ∆ b is a complex, non-smooth function of the
sample: the act of taking the minimum across the estimated influences ψ̂n is non-smooth.
We do not attempt to prove distributional results for this estimator and use such results to
quantify uncertainty. Instead, we appeal to the bootstrap [57], a general-purpose technique
to quantify the sampling uncertainty of estimators.
Our confidence interval construction proceeds in three steps. First, we define the so-called
bootstrap distribution of ∆. b Second, we approximate this distribution with an empirical
distribution based on Monte Carlo draws. Finally, we use the range spanned by quantiles of
this empirical distribution as our confidence interval for ∆(α).
To define the bootstrap distribution, consider the empirical distribution of the sample
176
(β (1) , . . . , β (S) ):
S
1X
δ (i) (·).
S i=1 {β }
We denote one draw from this empirical distribution by β ∗ (s) . A bootstrap sample is a set
of S draws: (β ∗ (1) , β ∗ (1) , . . . , β ∗ (S) ). The bootstrap distribution of ∆ b is the distribution of
b ∗ (1) , β ∗ (1) , . . . , β ∗ (S) ), where the randomness is taken over the bootstrap sample but is
∆(β
conditional on the original sample (β (1) , . . . , β (S) )
Clearly, the bootstrap distribution is discrete with finite support. If we chose to, we can
enumerate its support and compute its probability mass function, by enumerating all possible
values a bootstrap sample can take. However, this is time-consuming. It suffices to approxi-
mate the bootstrap distribution with Monte Carlo draws. The draw ∆(β b ∗ (1) , β ∗ (1) , . . . , β ∗ (S) )
is abbreviated by ∆ b ∗ : we generate a total number of B such draws. When B increases, the
empirical distribution of (∆ b ∗1 , ∆ b ∗ ) becomes a better approximation of the bootstrap
b ∗2 , . . . , ∆
B
distribution. However, the computational cost scales up with B. In practice, B in the
hundreds are commonplace: our numerical work uses B = 200.
We now define confidence intervals for ∆(α). Each interval is parametrized by η, the
nominal coverage level, which is valued in (0, 1). We compute two quantiles of the empirical
distribution over (∆ b ∗, ∆
1
b ∗ ), the (1 − η)/2 and (1 + η)/2 quantiles3 , and define the
b ∗, . . . , ∆
2 B
interval spanned by these two values as our confidence interval. By default, we set η = 0.95.
One limitation of our current work is that we do not make theoretical claims regarding
the actual coverage of such confidence intervals. Although bootstrap confidence intervals can
always be computed, whether the actual coverage matches the nominal coverage η depends
on structural properties of the estimator and regularity conditions on the sample. To verify
the quality of these confidence intervals, we turn to numerical simulation. We leave to future
work the task of formulating reasonable assumptions and theoretically analyzing the actual
coverage.
177
definition of ∆: b for instance, it is very common for (β (1) p , β (2) p , . . . , β (S) p ) to exhibit positive
autocorrelation in practice. Therefore, we have reason to be pessimistic about the ability of
bootstrap confidence intervals to adequately cover ∆(α).
Fundamentally, the bootstrap fails in the non-i.i.d. case because the draws that form
the bootstrap sample do not have any dependence, while the draws that form the original
sample do. To improve upon the bootstrap, one option is to resample in a way that respects
the original sample’s dependence structure. We recognize that the sample in question,
(β (1) , . . . , β (S) ), is a (multivariate) time series: we focus on methods that perform well under
time series dependence. One such scheme is the non-overlapping block bootstrap [38, 118].4
The sample (β (1) , . . . , β (S) ) is divided up into a number of blocks: each block is a vector of
contiguous draws. Let L be the number of elements in a block, and let M := ⌊S/L⌋ denote
the number of blocks. The m-th block is defined as
Bm := β ((m−1)L+1) , . . . , β (mL) .
To generate one sample from the non-overlapping block bootstrap distribution, we first
draw with replacement from the set of blocks M values: B1∗ , . . . , BM ∗
. Then, we write the
elements of these drawn blocks in a contiguous series. For example, when (β (1) , . . . , β (S) ) =
(β (1) , β (2) , β (3) , β (4) ) and L = 2, the two blocks are (β (1) , β (2) ), and (β (3) , β (4) ). The set of
possible samples from resampling include (β (1) , β (2) , β (1) , β (2) ) and (β (3) , β (4) , β (3) , β (4) ) but
not (β (1) , β (3) , β (1) , β (3) ).
The name “non-overlapping block bootstrap” comes from the fact that these blocks, viewed
as sets, are disjoint from each other. While the name is needed in Lahiri [118] to distinguish
from other blocking rules, moving forward, as we only consider the above blocking rule, we
will refer to the procedure as simply, block bootstrap. Intuitively, the block bootstrap sample
is a good approximation of the original sample if the latter has short-term dependence: in
such a case, the original sample itself can be thought of as the concatenation of smaller, i.i.d.
subsamples, and the generation of a block bootstrap sample mimics that. In well-behaved
probabilistic models with well-tuned algorithms, the MCMC draws can be expected to only
have short-term dependence, and the block bootstrap is a good choice.
The block bootstrap has one hyperparameter: the block length L. We would like both L
and M to be large: large L captures time series dependence at larger lags, and large M is
close to having many i.i.d. subsamples. However, since their product is constrained to be S,
the choice of L is a trade-off. In numerical studies, we set L = 10.
Our construction of confidence intervals for general MCMC proceeds identically to the
previous section’s construction, except for the step of generating the bootstrap sample: instead
of drawing from the vanilla bootstrap, we draw from the block bootstrap. We will denote the
endpoints of such an interval by ∆lb (α) (lower endpoint) and ∆ub (α) (upper endpoint).
Similar to the previous section, we do not make theoretical claims on the actual coverage
of our block bootstrap confidence intervals: we verify the quality of the intervals through
later numerical studies.
4
The original paper, Carlstein [38], did not use the term “non-overlapping block bootstrap” to describe the
technique. The name comes from Lahiri [118].
178
3.3.4 Putting everything together
Now, we chain together the intermediate approximations from the previous sections to form
our final estimate of eq. (3.1). We then explain how to use it to determine non-robustness.
Our final estimate of the Maximum Influence Perturbation is the interval [∆lb (α), ∆ub (α)]
constructed in section 3.3.3. This approximation is the result of combining section 3.3.3,
where [∆lb (α), ∆ub (α)] approximates ∆(α), with section 3.3.1, where ∆(α) approximates
of the Maximum Influence Perturbation. Our final estimate of the Most Influential Set
is U
b , which is an output of algorithm 5. This approximation is the result of combining
section 3.3.2, where Ub approximates U (α), with section 3.3.1, where U (α) approximates the
Most Influential Set.
To determine non-robustness, we use [∆lb (α), ∆ub (α)] as follows. Recall that we have
assumed for simplicity that the decision threshold is zero, and that ϕ(1N ) < 0. We believe that
the interval [ϕ(1N ) + ∆lb (α), ϕ(1N ) + ∆ub (α)] contains the quantity of interest after removing
the most extreme observations. Therefore, our assessment of non-robustness depends on the
relationship between this interval and the threshold zero in the following way:
• ϕ(1N ) + ∆lb (α) > 0. Hence, [ϕ(1N ) + ∆lb (α), ϕ(1N ) + ∆ub (α)] is entirely on the opposite
side of 0 compared to ϕ(1N ). We declare the analysis to be non-robust.
• ϕ(1N ) + ∆ub (α) < 0. Hence, [ϕ(1N ) + ∆lb (α), ϕ(1N ) + ∆ub (α)] is entirely on the same
side of 0 compared to ϕ(1N ). We do not declare non-robustness.
• ϕ(1N ) + ∆lb (α) ≤ 0 ≤ ϕ(1N ) + ∆ub (α). The interval contains 0, and we abstain from
making an assessment about non-robustness. We recommend practitioners run more
MCMC draws to reduce the width of the confidence interval.
While [∆lb (α), ∆ub (α)] plays the main role in determining non-robustness, U b plays a
supporting role. For problems in which drawing MCMC a second time is not prohibitively
expensive, we can refit the analysis without the data points in U
b . Doing the refit is one way
of verifying the quality of our assessment (of non-robustness): if [∆lb (α), ∆ub (α)] declares
that the conclusion is non-robust, and the conclusion truly changes after removing U b and
refitting, then we conclusively know that our assessment is correct.
179
walk through what a practitioner would do in practice (although they would choose only one
α and one decision). Our method proposes an influential data subset and a change in the
quantity of interest, represented by a confidence interval.
Ideally, we want to check if our interval includes the result of the worst-case data to leave
out. We are unable to do so, since we do not know how to compute the worst-case result in a
reasonable amount of time. We settle for the following checks.
In the first check, for a particular MCMC run, we plot how the change from re-running
minus the proposed data compares to the confidence interval. We recommend the user run
this check if re-running MCMC a second time is not too computationally expensive.
Unfortunately, such refitting does not paint a complete picture of approximation quality.
For instance, the MCMC run might be unlucky since MCMC is random. To be more
comprehensive, we run additional checks. We do not expect users to run these tests, as
their computational costs are high. The central question is how frequently (under MCMC
randomness) the confidence interval includes the result after removing the worst-case data.
Since we estimate the worst-case change with a linear approximation, a natural way to answer
this question is with two separate checks: while section 3.4.1 checks how frequently the
confidence interval includes the result of the linear approximation i.e. the AMIP, section 3.4.3,
checks whether the linear approximation is good. To understand why we observe the coverage
in section 3.4.1, in section 3.4.2 we isolate the impact of the sorting step in the construction
of our confidence interval.
180
n∈I ψn . On each sample (β , . . . , β (S) ), our point estimate is n∈I ψ̂n : this estimate
(1)
P P
I:
does not involve any sorting, while ∆ b does. We construct the confidence interval, [V lb , V ub ],
from the block bootstrap distribution of n∈I ψ̂n . The difference between [V lb , V ub ] and
P
[∆lb (α), ∆ub (α)], which is constructed from the block bootstrap distribution of ∆, b is that the
former is not based on sorting the influence estimates. If the actual coverage of [V lb , V ub ] is
close to the nominal value, we have evidence that the miscoverage of [∆lb (α), ∆ub (α)] is due
to this sorting.
From section 3.4.1 we use ψn∗ and the associated ∆∗ (α) and U ∗ (α) as replacement for ground
truths. We set I to be U ∗ (α). We run another set of J Markov chains: for each chain, we
construct the confidence interval [V lb , V ub ] by sampling from the block bootstrap distribution of
the estimator n∈I ψ̂n . We report the sample mean of the indicators I{ n∈I ψn ∈ [V lb , V ub ]}
∗
P P
as our point estimate of the coverage. We also report a 95% confidence interval for the
coverage. This interval is computed using binomial tests designed in Clopper and Pearson
[44] and implemented as R’s binom.test() function.
3.5 Experiments
In our experiments, we find that our approximation works well for a simple linear model. But
we find that it can struggle in hierarchical models with more complex structure.
5
This correspondence is not exact, since for ζ < 1, all observations in U ∗ (0.05) are included in the analysis,
only with downplayed contributions.
181
3.5.1 Linear model
We consider a slight variation of a microcredit analysis from Meager [139]. In Meager [139],
conclusions regarding microcredit efficacy were based on ordinary least squares (OLS). We
refer the reader to Broderick et al. [31, Section 4.3.2] for investigations of such conclusions’
non-robustness. Here, we instead consider an analogous Bayesian analysis using MCMC, and
we examine the robustness of conclusions from this analysis.
Our quality checks suggest that our approximation is accurate. Our confidence interval
contains the refit after removing the proposed data. The actual coverage of the confidence
interval for AMIP is close to the nominal coverage. The actual coverage of the confidence
interval for sum-of-influence is also close to the nominal coverage. Even for dropping 5% of
the data, the linear approximation is still adequate.
182
Figure 3.5.1: (Linear model) Histogram of treatment effect MCMC draws. The blue line
indicates the sample mean. The dashed red line is the zero threshold. The dotted blue lines
indicate estimates of approximate credible interval’s endpoints.
Figure 3.5.2: (Linear model) Confidence interval and refit. At maximum, we remove 1% of
the data. Each panel corresponds to a target conclusion change: ‘sign’ is the change in sign,
‘sig’ is change in significance, and ‘both’ is the change in both sign and significance. Error
bars are confidence interval for refit after removing the most extreme data subset. Each ‘x’ is
the refit after removing the proposed data and re-running MCMC. The dotted blue line is
the fit on the full data.
In fig. 3.5.2, we plot our confidence intervals and the result after removing the proposed
data. Although the confidence intervals are wide, they are still useful. Across quantities of
interest and removal fractions, our intervals contain the refit after removing the proposed
data. For changing sign, our method predicts there exists a data subset of relative size at
most 0.1% such that if we remove it, we change the posterior mean’s sign. Refitting after
removing the proposed data confirms this prediction. For changing significance, our method
predicts there exists a data subset of relative size at most 0.36% such that if we remove it, we
change the sign of the approximate credible interval’s right endpoint: refitting confirms this
prediction. Our method is not able to predict whether the result can be changed to significant
effect of the opposite sign for these α values and this number of samples: we recommend a
larger number of MCMC samples.
183
Figure 3.5.3: (Linear model) Monte Carlo estimate of AMIP confidence interval’s coverage.
Each panel corresponds to a target conclusion change. The dashed line is the nominal level
η = 0.95. The solid line is the sample mean of the indicator variable for the event that ground
truth is contained in the confidence interval. The error bars are confidence intervals for the
population mean of these indicators.
184
Figure 3.5.4: (Linear model) Monte Carlo estimate of sum-of-influence confidence interval’s
coverage. Each panel corresponds to a target conclusion change. The dashed line is the
nominal level η = 0.95. The solid line is the sample mean of the indicator variable for the
event that ground truth is contained in the confidence interval, and error bars are confidence
intervals for the population mean of these indicators.
185
Figure 3.5.5: (Linear model) Quality of the linear approximation. Each panel corresponds to
a target conclusion change. The solid blue line is the full-data fit. The horizontal axis is the
distance from the weight that represents the full data. We plot both the refit from rerunning
MCMC and the linear approximation of the refit.
We focus only on how microcredit impacts the households with negative realizations of profit.
Meager [140]’s model is such that to study this impact, it suffices to a) filter out observations
with non-negative profit from the aggregated data and b) use only a model component rather
than the entire model.
The dataset on households with negative profits has 3,493 observations. The relevant
model component from Meager [140] is the following. They model all households in a given
country as exchangeable, and “share strength” across countries. The absolute value of the
profit is modeled as coming from a log-normal distribution. If the household is in country k,
(country) (country) (n) (country) (country) (n)
this distribution has mean µk +τk x , and variance exp ξk + θk x ,
(country) (country) (country) (country)
where (µk , τk , ξk , θk ) are latent parameters to be learned. In other
words, the access to microcredit has country-specific impacts on the location and scale of the
log of absolute profit. To borrow strength, the above country-specific parameters are modeled
as coming from a common distribution. For instance, there exists a global parameter, τ , such
(country)
that the τk ’s are a priori independent Gaussian centered at τ . For complete specification
of the model i.e. the list of all global parameters and the prior choice, see section 3.C.
Roughly speaking, τ is an average treatment effect across countries. We use S = 4000
HMC draws to approximate the posterior. Figure 3.5.6 plots the histogram of the treatment
effect draws and sample summaries. The sample mean is equal to 0.09. The sample standard
deviation is 0.09. These values are in agreement with the mean and standard deviation
estimates obtained from fitting on the original model and data [140]. Our estimate of the
approximate credible interval’s left endpoint is −0.09; our estimate of the right endpoint is
0.27.
Based the summaries in fig. 3.5.6, an analyst might come to a decision based on either
(1) the observation that the posterior mean is positive, or (2) the observation that the
uncertainty interval covers zero and therefore they cannot be confident of the sign of the
unknown parameter.
186
Figure 3.5.6: (Hierarchical model for microcredit) Histogram of treatment effect MCMC
draws. See the caption of fig. 3.5.1 for the meaning of the distinguished vertical lines.
187
Figure 3.5.7: (Hierarchical model for microcredit) Confidence interval and refit. See the
caption of fig. 3.5.2 for meaning of annotated lines.
Figure 3.5.8: (Hierarchical model for microcredit) Monte Carlo estimate of AMIP confidence
interval’s coverage. See the caption of fig. 3.5.3 for the meaning of the error bars and the
distinguished lines.
lower endpoint of our confidence interval for the true coverage, the worst relative error is
9.1%.
Figure 3.5.8 shows that the confidence interval for sum-of-influence has the right coverage
for sign change, but undercovers for significance change and generating a significant result of
the opposite sign. At worst, in the case of ‘sig’, the relative error between the nominal η and
our estimate of true coverage is 14.7%.
Intuitively, the block bootstrap underestimates uncertainty if the block length is not large
enough to overcome the time series dependence in the MCMC samples. The miscoverage
suggests that the default block length, L = 10, is too small for this problem. One potential
reason for the difference in coverage between ‘sign’ and ‘sig’ is that, the estimate of influence
for ‘sign’ involves a fewer number of objects than that for ‘sig’. While an estimate of influence
for ‘sign’ involves g(β) and L(d(n) | β), an estimate of influence for ‘sig’ involves g(β),
L(d(n) | β), and g(β)2 . It is possible that the default block length is enough to capture time
series dependence for g(β) and L(d(n) | β), but is inadequate for g(β)2 .
Figure 3.5.10 provides evidence that the linear approximation is adequate for ζ less
than 0.3728 for ‘both’ QoI and ‘sig’, but is grossly wrong for larger ζ. Using the rough
188
Figure 3.5.9: (Hierarchical model for microcredit) Monte Carlo estimate of sum-of-influence
confidence interval’s coverage. See the caption of fig. 3.5.4 for the meaning of the panels and
the distinguished lines.
Figure 3.5.10: (Hierarchical model for microcredit) Quality of linear approximation. See the
caption for fig. 3.5.5 for the meaning of the panels and the distinguished lines.
correspondence between ζ and amount of data dropped, we say that the linear approximation
is adequate until dropping 1.8% of the data. For ‘both’ QoI, the refit plateaus after dropping
1.8%, while the linear approximation continues to decrease. For ‘sig’, the refit decreases after
dropping 1.8%, while the linear approximation continues to increase. The approximation is
good for ‘sign’ even after removing 5% of the data: the refit and the prediction lie on top of
each other for ‘sign’.
189
run quality checks on the whole dataset. We settle for running quality checks on a subsample
of the data. On the subsampled data, the confidence interval for AMIP undercovers: the
undercoverage is severe for one of the quantities of interest. However, the confidence interval
for sum-of-influence is close to achieving the nominal coverage. For all three quantities of
interest, the linear approximation is good up to removing roughly 1.1% of the data. For two
of the three, it breaks down afterwards: for the remaining one, it continues to be good up to
3%, then falters.
Once again, we think that dropping more than 1% of the data is already removing a large
fraction. We are not worried about the Maximum Influence Perturbation for such α. So, that
the linear approximation stops working after 1.1% is not a cause for concern.
190
Figure 3.5.11: (Hierarchical model for tree mortality) Histogram of slope MCMC draws. See
the caption of fig. 3.5.1 for the meaning of the distinguished vertical lines.
credible interval’s left endpoint is −2.81; our estimate of the right endpoint is −0.94.
In our parametrization, if θ were estimated to be negative, it would indicate that the
availability of water is negatively associated with tree death. In other words, drought is
positively associated with tree death. Based on the sample summaries, a forest ecologist might
decide that drought has a positive relationship with canopy mortality, since the posterior
mean is negative, and this relationship is significant, since the approximate credible interval
does not contain zero.
191
Figure 3.5.12: (Hierarchical model for tree mortality) Confidence interval and refit. See the
caption of fig. 3.5.2 for the meaning of the panels and the distinguished lines.
Figure 3.5.13: (Hierarchical model on subsampled tree mortality) Histogram of effect MCMC
draws. See fig. 3.5.1 for the meaning of the distinguished lines.
do so. Instead, we subsample 2,000 observations at random from the original dataset. Each
MCMC on this subsample takes only 15 minutes, making it possible to run quality checks in
a few hours instead of weeks. We hope that the subsampled data is representative enough of
the original data that the quality checks on the subsampled data are indicative of the quality
checks on the original data.
We use the same probabilistic model to analyze the subsampled data. Figure 3.5.13 plots
the histogram of the association effect draws and sample summaries. Based on the draws,
a forest ecologist might tentatively say that drought is positively associated with canopy
mortality if they relied on the posterior mean, but refrain from conclusively deciding, since
the approximate credible interval contains zero.
Figure 3.5.14 shows our confidence intervals and the actual refits. Similar to fig. 3.5.12,
our confidence intervals predict a more extreme change than realized by the refit. The
overestimation is most severe for ‘both’ QoI.
In fig. 3.5.15, the confidence interval for AMIP undercovers for all quantities of interest.
The actual coverage decreases as α increases. The undercoverage is most severe for ‘sig’ QoI:
while the nominal level is 0.95, the confidence interval for the true coverage only contains
values less than 0.15. This translates to a relative error of over 84%. In other words, our
confidence interval for significance change is too narrow, and rarely contains the AMIP. For
‘both’ QoI and ‘sig’ QoI, the worst-case relative error between the nominal and the estimated
192
Figure 3.5.14: (Hierarchical model on subsampled tree mortality) Confidence interval and
refit. See the caption of fig. 3.5.2 for the meaning of the panels and the distinguished lines.
Figure 3.5.15: (Hierarchical model on subsampled tree mortality) Monte Carlo estimate of
coverage of confidence interval for ∆(α). See fig. 3.5.3 for the meaning of the panels and the
distinguished lines.
193
Figure 3.5.16: (Hierarchical model on subsampled tree mortality) Monte Carlo estimate of
coverage of confidence interval for sum-of-influence. See fig. 3.5.4 for the meaning of the
panels and the distinguished lines.
Figure 3.5.17: (Hierarchical model on subsampled tree mortality) Quality of linear approxi-
mation. See fig. 3.5.5 for the meaning of the panels and the distinguished lines.
3.6 Discussion
We have provided a fast approximation to what happens to conclusions made with MCMC in
Bayesian models when a small percentage of data is removed. In real data experiments, our
approximation is accurate in simple models, such as linear regression. In complicated models,
such as hierarchical ones with many random effects, our methods are less accurate. A number
of open questions remain. We suspect that choosing the block length more carefully may
improve performance: how to pick the block length in a data-driven way is an interesting
question for future work. Currently, we can assess sensitivity for quantities of interest based
on posterior expectations and posterior standard deviations. For analysts that use posterior
quantiles to make decisions, we are not able to assess sensitivity. To extend our work to
quantiles, one would need to quantify how much a quantile changes under small perturbations
of the total log likelihood. Finally, we have not fully isolated the source of difficulty in
complex models like those in Senf et al. [183]. In the analysis of tree mortality data, there
are a number of conflating factors.
• The model has a large number of parameters.
194
• The parameters are organized hierarchically.
To determine if the difficulty comes from high dimensionality or if the difficulty comes from
hierarchical organization, future work might apply our approximation to a high-dimensional
model without hierarchical structure. For instance, one might use MCMC on a linear
regression with many parameters and non-conjugate priors. To check if MCMC is a cause of
difficulty, one could experiment with variational inference (VI). If we chose to approximate
the posterior with VI, we can use the machinery developed for estimating equations [31] to
assess small-data sensitivity. If the dropping data approximation works well there, we have
evidence that MCMC is part of the problem in complex models.
195
196
Appendix
3.A Theory
In this section, we theoretically quantify the approximation errors incurred by our methodology.
Namely, section 3.A.1 analyzes the error made by the first-order approximation, while
section 3.A.2 analyzes the error made by using MCMC to estimate influences.
197
3.A.1.1 Normal model.
We detail the prior and likelihood of the normal model and the associated quantity of
interest. The parameter of interest is the population mean µ. The likelihood of an observation
is Gaussian with a known standard deviation σ. In other words, the n-th log-likelihood
evaluated at µ is L(d | µ) = 2 log 2πσ2 − 2σ2 [(x(n) )2 − 2x(n) µ + µ2 ]. We choose the uniform
1 1 1
(n)
distribution over the real line as the prior for µ. The quantity of interest is the posterior
mean of µ.
In this model, expectations under the weighted posterior have closed forms. We can
derive an explicit expression for the error. To display the error, it is convenient to define
the sample
P average of observations as a function of I: for any I ⊂ {1, 2, . . . , N }, let x̄I :=
(1/|I|) n∈I x . The sample average of the whole dataset will be denote by x̄.
(n)
While Ng (w) sums up the weights of observations in group g, Mg (w) is the weighted average
of observations in this group, and Λg (w) will be used to weigh Mg (w) in forming the posterior
mean of µ. Section 3.B.2 shows that Ew µ is equal to
PG
g=1 Λg (w)Mg (w)
PG .
g=1 Λg (w)
198
To avoid writing G g=1 Λg (w), we define Λ(w) := g=1 Λg (w). To lighten notation, for
P PG
expectations under the original posterior, we write µ instead of E1N µ and Ng∗ instead of
∗
Ng (1N ). The same shorthand applies to Ng (1N ), Mg (1N ), Λg (1N ) and Λ(1N ). In words, µ∗
is the posterior mean of µ under the full-data posterior, Ng∗ is the number of observations in
group g of the original dataset, and so on. We also utilize the x̄I and x̄ notations defined the
normal model section.
The error in the normal means model is given in the following lemma.
Lemma 3.A.2. In the normal means model, let the index set I be such that there exists
k ∈ {1, 2, . . . , G} such that g (n) = k for all n ∈ I. Define
We prove Lemma 3.A.2 in section 3.B.2. The constraint where all observations in I
belong to the same group k is made out of convenience: we can derive the error without this
constraint, but the formula will be much more complicated.
A corrolary of Lemma 3.A.2 is that the absolute value of the error behaves like |I|2 /(G|Nk∗ |2 ).
Corollary 3.A.1. In the normal means model, for all groups g, assume that Ng∗ ≥ σ 2 /τ 2 .
Let the index set I be such that there exists k ∈ {1, 2, . . . , G} such that g (n) = k for all n ∈ I.
For this k, assume that Nk∗ − |I| ≥ σ 2 /τ 2 . Then,
1 |I|2
|Err(I)| ≤ C(∥x∥∞ , σ, τ ) .
G |Nk∗ |2
We prove Corollary 3.A.1 in the section 3.B.2. In addition to the assumptions Lemma 3.A.2,
the corrolary assumes that the number of observations in each group is not too small, and
that after removing I, group k still has enough observations. This condition allows us to
approximate Λ∗k and Λg (q −1 (I)) with a constant. The factor ∥x∥∞ in the bound comes from
upper bounding |Mg∗ − Mk (q −1 (I))| by 2 maxN n=1 |x
(n)
|.
For two reasons, we conjecture that similiar qualitative differences also appear in the com-
parison between more complicated hierarchical and non-hierarchical models The fundamental
task of estimating the population mean is embedded in many other statistical tasks, such
as regression. In addition, the group structure imposed by the normal means model is also
found in practically relevant hierarchical models.
199
3.A.2 Estimator properties
Recall from section 3.3.3 that one concern regarding the quality of ∆ b is the (β (1) , . . . , β (S) )-
induced sampling uncertainty. Theoretically analyzing this uncertainty is difficult, with
one obstacle being that ∆ b is a non-smooth function of (β (1) , . . . , β (S) ). In this section, we
settle for the easier goal of analyzing the sampling uncertainty of the influence estimates
ψ̂n . We expect such theoretical characterizations to play a role in the eventual theoretical
characterizations of ∆, b but we leave this step to future work.
In this analysis, we make more restrictive assumptions than those needed for Theorem 3.3.1
to hold. We assume that the sample (β (1) , . . . , β (S) ) comes from exact sampling: the indepen-
dence across draws makes it easier to analyze sampling uncertainty. We focus on the quantity
of interest equaling the posterior mean (c1 = 1, c2 = 0 in the sense of Assumption 3.3.1):
the scaling c1 = 1 for the posterior mean is made out of convenience, and a smiliar analysis
can be conducted when c2 = ̸ 0, but we omit it for brevity. Finally, we need more stringent
moment conditions than Assumption 3.3.2.
Assumption 3.A.1. The functions |g(β)2 L(d(i) | β)L(d(j) | β)| (across i, j) have finite
expectation under the full-data posterior.
This moment condition guarantees that the sample covariance of g(β) and L(d(i) | β) has
finite variance under the full-data posterior: it plays the same as role as finite kurtosis in
proofs sample variance consistency
With the assumptions in place, we begin by showing that the sampling uncertainty of ψ̂n
goes to zero in the limit of S → ∞.
Lemma 3.A.3. Assume Assumption 3.2.1, Assumption 3.3.1, Assumption 3.3.2, Assump-
tion 3.A.1 holds. Let ψ̂ be output of algorithm 4 for c1 = 1, c2 = 0 and (β (1) , . . . , β (S) ) being an
i.i.d. sample. Then, there exists a constant C such that for all n, for all S, Var(ψ̂n ) ≤ C/S.
We prove Lemma 3.A.3 in section 3.B.3. That the variance of individual ψ̂n goes to zero
at the rate of 1/S is not surprising: ψ̂n is a sample covariance, after all.
We use Lemma 3.A.3 to show consistency of different estimators.
Theorem 3.A.1. Assume Assumption 3.2.1, Assumption 3.3.1, Assumption 3.3.2, and
Assumption 3.A.1 holds Let ψ̂ be output of algorithm 4 for c1 = 1, c2 = 0 and (β (1) , . . . , β (S) )
being an i.i.d. sample. Then maxN n=1 |ψ̂n − ψn | converges in probability to 0 in the limit
S → ∞, and ∆ converges in probability to ∆(α) in the limit S → ∞.
b
We prove Theorem 3.A.1 in section 3.B.3. Our theorem states that the vector ψ̂ is a
consistent estimator for the vector ψ and ∆b is a consistent estimator for ∆(α).
Not only is ψ̂ consistent in estimating ψ, it is also asymptotically normal.
Theorem 3.A.2. Assume Assumption 3.2.1, Assumption 3.3.1, Assumption 3.3.2, and As-
sumption 3.A.1 holds. Let √ψ̂ be output of algorithm 4 for c1 = 1, c2 = 0 and (β (1) , . . . , β (S) ) be-
ing an i.i.d. sample. Then S(ψ̂−ψ) converges in distribution to N (0N , Σ) where Σ is the N ×
N matrix whose (i, j) entry, Σi,j , is the covariance between (i) (i)
(g(β) − E1N g(β)) L(d | β) − E 1N L(d | β)
and (g(β) − E1N g(β)) L(d(j) | β) − E1N L(d(j) | β) , taken under the full-data posterior.
200
3.A.2.1 Normal model with unknown precision.
√
While Σn,n / S eventually goes to zero, for finite S, this standard deviation can be large,
p
making ψ̂n an imprecise estimate of ψn . To illustrate this phenomenon, we will derive Σn,n in
the context of a simple probabilistic model: a normal model with unknown precision.
We first introduce the model and the associated quantity of interest. The data is a set of
N real values: d(n) = x(n) , where x(n) ∈ R. The parameters of interest are the mean µ and
the precision τ of the
population. The log-likelihood of an observation based on µ and τ is
Gaussian: 2 log 2π − 2 τ [(x ) − 2x(n) µ + µ2 ]. The prior is chosen to be the following. µ is
1 τ 1 (n) 2
distributed from uniform over the real line, and τ is distributed from a gamma distribution.
The quantity of interest is the posterior mean of µ.
For this probabilistic model, the assumptions of Theorem 3.A.2 are satisfied. We show
that the variance Σn,n behaves like a quartic function of the observation x(n) .
Lemma 3.A.4. In the normal-gamma model, there exists constants D1 , D2 , and D3 , where
D1 > 0, such that for all n, Σn,n is equal to D1 (x(n) − x̄)4 + D2 (x(n) − x̄)2 + D3 .
We prove Lemma 3.A.4 in section 3.B.3. D1 , D2 , D3 are based on the posterior expectations:
E [τ −1 (τ −E1N τ )2 ]
for instance, the proof shows that D1 = 1N 4N
. It is easy to show that for the
normal-gamma model,
x(n) − x̄
Cov1N (µ, L(d(n) | µ, τ )) = .
N
Hence, while the mean of ψ̂n behaves like a linear function of x(n) − x̄, its standard deviation
behaves like a quadratic function of x(n) − x̄. In other words, the more influence an observation
has, the harder it is to accurately determine its influence!
3.B Proofs
3.B.1 Taylor series proofs
Proof of Theorem 3.3.1. At a high level, we rely on Fleming [62, Chapter 5.12, Theorem 5.9]
to interchange integration and differentiation.
Although the theorem statement does not explicitly mention the normalizer, to show that
the quantity of interest is continuously differentiable and compute partial derivatives, it is
necessary to show that the normalizer is continuously differentiable and compute partial
derivatives. To do so, we verify the following conditions on the integrand defining Z(w):
P
1. For any β, the mapping w 7→ p(β) exp N
w
n=1 n L(d(n)
| β) is continously differen-
tiable.
2. There exists a Lebesgue integrable function f1 such that for all w ∈ {w ∈ [0, 1]N :
PN (n)
maxn wn ≥ δ}, p(β) exp n=1 wn L(d | β) ≤ f1 (β).
201
The first condition is clearly satisfied. To construct f1 that satisfies the second condition,
we partition the parameter space RP into a finite number of disjoint sets. To index these
sets, we use a subset of {1, 2, . . . , N }. If the indexing subset were I = {n1 , n2 , . . . , nM }, the
corresponding element of the partition is
This partition allows us to upper bound theP integrand with a function that is independent of
w. Suppose β ∈ BI , I ̸= ∅. The maximum N n=1 wn L(d
(n)
| β) is attained by setting wn = 1
for all n ∈ I and wn = 0 for all n ∈
/ I. Suppose β ∈ B∅ . AsP L(d(n) | β) < 0 for all 1 ≤ n ≤ N ,
and we are constrained by maxn wn ≥ δ, the maximum of N n=1 wn L(d
(n)
| β) is attained by
setting wn = δ for arg maxn L(d | β) and wn = 0 for all other n. In short, our envelope
(n)
function is
(
p(β) n∈I exp(L(d(n) | β)) if β ∈ BI , I ̸= ∅.
Q
f1 (β) :=
if β ∈ B∅ .
N (n)
p(β) maxn=1 exp(δL(d | β))
The last step is to show f1 is integrable. It suffices to show that the integral of f1 on
each BI is finite. On B∅ , integrating p(β) exp(δL(d | β)) over B∅ is clearly finite: by
(n)
Assumption 3.2.1, the integral of p(β) exp(δL(d(n) | β)) over RP is finite, and B∅ is a
subset of RP . As f1 (β) is the maximum of a finite number of integrable functions, it is
integrable. Similarly, the integral of f1 over BI where I ̸= ∅ is atmost the integral of
p(β) n∈I exp(L(d(n) | β)) over RP , which is finite by Assumption 3.2.1. To construct f2 that
Q
satisfies the third condition, we use the same partition of RP , and the envelope function is
f2 (β) := L(d(n) | β)f1 (β), since the partial derivative of the weighted log probability is clearly
the product of the n-th log likelihood and the weighted log probability. The integrability of
f2 follows from Assumption 3.3.2’s guarantee that the expectation of |L(d(n) | β)| is finite
under different weighted posteriors. In all, we can interchange integration with differentiation,
and the partial derivatives are
∂Z(w)
= Z(w) × Ew L(d(n) | β) .
∂wn
We move on to prove that Ew g(β) is continuously
P differentiable
and find its partial
derivatives. The conditions on g(β) Z(w) p(β) exp
1 N
n=1 wn L(d
(n)
| β) that we wish to check
are:
P
1. For any β, the mapping w 7→ g(β) Z(w) 1
p(β) exp N
n=1 nw L(d (n)
| β) is continously
differentiable.
1 N (n)
maxn wn ≥ δ}, g(β) Z(w) p(β) exp n=1 wn L(d | β) ≤ f3 (β).
202
We have already proven that Z(w) is continuously differentiable: hence, there is nothing to
do for the first condition. It is straightforward to use Assumption 3.3.2 and check that the
second condition is satisfied by the function f3 (β) := Z(w) 1
g(β)f1 (β), and the third condition
is satisfied by f4 (β) := Z(w) g(β)L(d | β)f1 (β). Hence, we can interchange integration with
1 (n)
differentiation. The partial derivatives of Ew g(β) is equal to tthe sume of two integrals. The
first part is
N
!!
∂Z(w)−1
Z X
g(β)p(β) exp wn L(d(n) | β) dβ
∂wn n=1
Z N
!!
1 X
= − Ew L(d(n) | β) wn L(d(n) | β)
g(β)p(β) exp dβ
Z(w) n=1
203
where C is a constantP that does not Pdepend on µ. Hence, thePdistribution of µ under w
is normal with mean ( n=1 wn x )/( N
N (n)
n=1 wn ) and precision ( n=1 wn )/(σ ). The partial
N 2
x(n) ( N
P PN (n)
n=1 w n ) − ( n=1 wn x )
PN .
( n=1 wn ) 2
The difference between the actual posterior mean and its approximation is as in the statement
of the lemma.
Proof of Lemma 3.A.2. Similar to the proof of Lemma 3.A.1, we first find exact formulas for
the posterior mean and its Taylor series.
In the normal means model, the total log probability at w is
G
X 1 1 2
log − 2 (θg − µ)
g=1
2πτ 2 2τ
N i
X 1 1 h (n) 2 (n) 2
+ wn log 2
− 2 (x ) − 2x θg(n) + θg(n) .
n=1
2πσ 2σ
To express the partial derivative of the posterior mean of µ with respect to wn , it is helpful
to define the following “intermediate” value between Ew µ and Ew θg :
Mg (w)Ng (w)/σ 2 + Ew µ/τ 2
µ̃g (w) := .
Ng (w)/σ 2 + 1/τ 2
204
In addition, we need the partial derivatives of the functions Ng , Λg , and Mg .
(
∂Ng 0 if g ̸= g (n)
= ,
∂wn 1 if g = g (n)
(
∂Mg 0 if g ̸= g (n)
= x(n) −Mg (w) ,
∂wn Ng (w)
if g = g (n)
(
∂Λg 0 if g ̸= g (n)
= Λg (w)2 .
∂wn σ2 Ng (w) 2 if g = g (n)
If n is in the k-th group, the partial derivative of the posterior mean with respect to wn is
1 1 (n)
x − µ̃k (w) .
Λ(w) σ 2 + τ 2 Nk (w)
After removing only observations from the k-th group, the actual posterior mean is
Λk (q −1 (I))Mk (q −1 (I)) + g̸=k Λg (1N )Mg (1N )
P
P .
Λk (q −1 (I)) + g̸=k Λg (1N )
then Err(I) is equal to (A1 + B1 )/(A2 + B2 ) − (A1 + C1 )/(A2 + C2 ). The last equation is
equal to
A2 (B1 − C1 ) + A1 (C2 − B2 ) + (B1 C2 − C1 B2 )
.
(A2 + B2 )(A2 + C2 )
We analyze the differences C2 − B2 , B1 C2 − C1 B2 , and B1 − C1 separately.
C2 − B2 . This difference is
1 1
− .
σ 2 /Nk (1N ) + τ 2 σ 2 /Nk (q −1 (I)) + τ 2
Since we remove |I| from group k, Nk (q −1 (I)) = Nk (1N ) − |I|. Hence, the difference C2 − B2
is
|I|
σ 2 Λk (1N )Λk (q −1 (I)) ,
Nk (1N )(Nk (1N ) − |I|)
205
which is exactly the E(I) mentiond in the lemma statement.
B1 C2 − C1 B2 . The difference is
( )
(n)
P
[µ̃ k (1 N ) − x ]
Λk (1N )Λk (q −1 (I)) Mk (q −1 (I)) − Mk (1N ) − n∈I .
Nk (1N )
|I| σ 2 Λk (1N )
(E1N µ − Mk (1N )).
Nk (1N ) Nk (1N )
The sum of the two terms is exactly F (I) mentioned in the lemma statement. Overall, the
difference B1 C2 − C1 B2 is equal to Λk (1N )Λk (q −1 (I))F (I).
B1 − C1 . If we introduce D := Λk (1N )Mk (q −1 (I)), then the difference B1 − C1 is equal to
(B1 − D) + (D − C1 ). The former term is
We already know that the term in the curly brackets is equal to F (I). Hence B1 − C1 is equal
to Λk (1N )F (I) − Mk (q −1 (I))E(I).
With the differences C2 − B2 , B1 C2 − C1 B2 , and B1 − C1 , we can now state the final form
of Err(I). The final numerator is
" #
X
Λk (q −1 (I)) + Λg (1N ) Λk (1N )F (I)
g̸=k
" # .
X X
+ Λg (1N )Mg (1N ) − Mk (q −1 (I)) Λg (1N ) E(I)
g̸=k g̸=k
hP i hP i
Divide this by the denominator g Λ g (1N ) g Λ g (q −1
(I)) , we have proven the lemma.
206
Proof of Corollary 3.A.1. Under the assumption that Ng∗ ≥ σ 2 /τ 2 , we have that Λg (1N ) ∈
. Since Mk∗ − |I| ≥ σ 2 /τ 2 , it is also true that Λk (q −1 (I)) ∈ 2τ12 , τ12 .
1 1
,
2τ 2 τ 2
Because of Lemma 3.A.2, an upper bound on Err(I) is
P
∗ ∗ −1
−1
Λk (q (I)) Λ
g̸=k g (Mg − M k (q (I)))
|F (I)| + |E(I)| .
Λ∗ Λ∗ Λ(q −1 (I))
The fraction Λk (q −1 (I))/Λ∗ is at most ( τ12 )/ G 2τ12 , which is equal to 2/G. The absolute
is at most
G(1/τ 2 )2∥x∥∞ 4∥x∥∞
2 2
≤ .
G (1/2τ ) G
Finally, the absolute value |E(I)| is at most
∥x∥∞ 4(σ 2 /τ 2 + 1) + σ 2 /τ 4 .
Lemma 3.B.1. Suppose we have S i.i.d. draws (A(s) , B (s) , C (s) )Ss=1 . Let f1 be the (biased)
sample covariance between the A’s and the B’s. Let f2 be the (biased) sample covariance
between the A’s and C’s. In other words,
S
! S
! S
!
1 X (s) (s) 1 X (s) 1 X (s)
f1 := A B − A B ,
S s=1 S s=1 S s=1
S
! S
! S
!
1 X (s) (s) 1 X (s) 1 X (s)
f2 := A C − A C .
S s=1 S s=1 S s=1
207
Suppose that the following are finite: E[(A−E[A])2 (B −E[B])(C −E[C])], Cov(B, C), Var(A),
Cov(A, B), Cov(A, C). Then, the covariance of f1 and f2 is equal to
(S − 1)2
E[(A − E[A])2 (B − E[B])(C − E[C])]
S3
S−1 (S − 1)(S − 2)
+ 3
Cov(B, C)Var(A) − Cov(A, B)Cov(A, C).
S S3
Proof of lemma 3.B.1. It suffices to prove the lemma in the case where E[A] = E[B] =
E[C] = 0. Otherwise, we can subtract the population mean from the random variable:
the value of f1 and f2 would not change (since covariance is invariant to constant additive
changes). In otherwords, we want to show that the covariance between f1 and f2 is equal to
(S − 1)2 S−1 (S − 1)(S − 2)
3
E[A2 BC] + 3
E[BC]E[A2 ] − E[AB]E[AC]. (3.7)
S S S3
Since f1 is the biased sample covariance, Ef1 = S−1 S
E[AB]. Similarly, Ef2 = S−1 S
E[AC].
To compute Cov(f1 , f2 ), we only need an expression for E[f1 f2 ]. The product f1 f2 is equal to
the sum of D1 , D2 , D3 , D4 where:
! ! !
1 X (s) (s) 1 X (s) 1 X (s)
D1 := − A B A C ,
S s S s S s
!2 ! !
1 X (s) 1 X (s) 1 X (s)
D2 := A B C ,
S s S s S s
! ! !
1 X (s) (s) 1 X (s) 1 X (s)
D3 := − A C A B ,
S s S s S s
! !
1 X (s) (s) 1 X (s) (s)
D4 := A B A C .
S s S s
E[A(k) B (k) A(i) C (j) ] depends on the triplet (i, j, k) in the following way:
0 if i = k, j ̸= k
2
E[A BC] if i = k, j = k
(k) (k) (i) (j)
E[A B A C ] = 0 if i ̸= k, j = k
E[AB]E[AC] if i ̸= k, j ̸= k, i = j
if i ̸= k, j ̸= k, i ̸= j
0
We have used independence of (A(s) , B (s) , C (s) )Ss=1 to factorize the expectation E[A(k) B (k) A(i) C (j) ].
For certain triplets, the factorization reveals that the expectation is zero. By accounting for
all triplets, the expectation of D1 is
1 2
SE[A BC] + S(S − 1)E[AB]E[AC] .
S3
208
D2 . By expanding D2 , we know that ED2 = S14 i,j,p,q E[A(i) A(i) B (p) C (q) ]. We can do a
P
similar case-by-case analysis of how E[A(i) A(i) B (p) C (q) ] depend on the quartet (i, j, p, q). In
the end, the expectation of D2 is
1 2 2
E[A BC] + (S − 1)E[A ]E[BC] + 2(S − 1)E[AB]E[AC] .
S3
D3 . By symmetry betwene D1 and D3 , the expectation of D3 is also
1
SE[A2 BC] + S(S − 1)E[AB]E[AC] .
S 3
D4 . By expanding D4 , we know that ED4 = S12 i,j E[A(i) B (i) A(j) C (j) ]. The case-by-case
P
analysis of E[A(i) B (i) A(j) C (j) ] for each (i, j) is simple, and is omitted. The expectation of D4
is
1 S−1
E[A2 BC] + E[AB]E[AC].
S S
Simple algebra reveals that 4i=1 E[Di ] − S−1 E[AB] S−1 E[AC] is equal to eq. (3.7).
P
S S
Proof of Lemma 3.A.3. In this proof, we will only consider expectations under the full-data
posterior. Hence, to alleviate notation, we shall write E instead of E1N : similarly, covariance
and variance evaluations are understood to be at w = 1N .
Applying lemma 3.B.1, the covariance of ψ̂n and ψ̂n i.e. the variance of ψ̂n is equal to
(S − 1)2
E{(g(β) − E[g(β)])2 (L(d(n) | β) − E[L(d(n) | β)])2 }
S3
S−1 (S − 1)(S − 2)
+ Var(L(d (n)
| β))Var(g(β)) − Cov(g(β), L(d(n) | β))2 .
S3 S3
Define the constant C to be the maximum over n of
Cov(g(β), L(d(n) | β))2 + Var(g(β))Var(L(d(n) | β))
+ E{(g(β) − E[g(β)])2 (L(d(n) | β) − E[L(d(n) | β)])2 }.
Proof of Theorem 3.A.1. Similar to the proof of Lemma 3.A.3, expectations (and variances
and covariances) are understood to be taken under the full-data posterior.
Since ψ̂n is the biased sample variance, we know that
S−1
Eψ̂n = ψn .
S
The bias of ψ̂n goes to zero at rate 1/S. Because of Lemma 3.A.3, the variance also goes to
p
zero at rate 1/S. Then, the application of Chebyshev’s inquality shows that ψ̂n →
− ψn . Since
p
N is a constant, the pointwise convergence |ψ̂n − ψn | →
− 0 implies the uniform convergence
p
maxNn=1 |ψ̂n − ψn | →
− 0.
209
p
We now prove that |∆−∆(α)|b − 0. We first recall some notation. The ranks r1 , r2 , . . . , rN
→
sort the influences ψr1 ≤ ψr2 ≤ . . . ≤ ψrN , and ∆(α) = − ⌊N I{ψrm < 0}. Similarly,
P α⌋
m=1 ψrm P
b = − ⌊N α⌋ ψ̂vm I{ψ̂vm < 0}.
v1 , v2 , . . . , vN sort the estimates ψ̂v1 ≤ ψ̂v2 ≤ . . . ≤ ψ̂vN , and ∆ m=1
It suffices to prove the convergence when ⌊N α⌋ ≥ 1: in the case ⌊N α⌋ = 0, both ∆ b and ∆(α)
are equal to zero, hence the distance between them is identically zero. Denote the T unique
values among ψn by u1 < u2 < . . . < uT . If T = 1 i.e. there is only one value, let ω := 1.
Otherwise, let ω be the smallest gap between subsequent values: ω := mint (ut+1 − ut ).
Suppose that maxN n=1 |ψ̂n − ψn | ≤ ω/3: let A be the indicator for this event. For any n,
each ψ̂n is in the interval [ψn − ω/3, ψn + ω/3]. In the case T = 1, clearly all k such that ψ̂k
is in [ψn − ω/3, ψn + ω/] satisfy ψk = ψn . In the case T > 1, since unique values of ψn are
at least ω apart, all k such that ψ̂k is in [ψn − ω/3, ψn + ω/] satisfy ψk = ψn . This means
that the ranks v1 , v2 , . . . , vN , which sort the influence estimates, also sort the true influences
in ascending order: ψv1 ≤ ψv2 ≤ . . . ≤ ψvN . Since the ranks r1 , r2 , . . . , rN also sort the true
influences, it must be true that ψvm = ψrm for all m. Therefore, we can write
⌊N α⌋
X
|∆
b − ∆(α)| = ψvm I{ψvm < 0} − ψ̂vm I{ψ̂vm < 0}
m=1
⌊N α⌋
X
≤ ψvm I{ψvm < 0} − ψ̂vm I{ψ̂vm < 0} .
m=1
We control the absolute values ψvm I{ψvm < 0} − ψ̂vm I{ψ̂vm < 0} . For any index n, by
triangle inequality, ψn I{ψn < 0} − ψ̂n I{ψ̂n < 0} is at most
I{ψ̂n < 0}|ψn − ψ̂n | + |ψn ||I{ψ̂n < 0} − I{ψn < 0}|.
The first term is at most |ψn − ψ̂n |. The second term is at most I{|ψn − ψ̂n | ≥ |ψn |, ψn ̸= 0}.
We next prove a bound on ψn I{ψn < 0} − ψ̂n I{ψ̂n < 0} that holds across n. Our analysis
proceeds differently based on whether the set {n : ψn ̸= 0} is empty or not.
• {n : ψn ̸= 0} is empty. This means ψn = 0 for all n. Hence, I{|ψn − ψ̂n | ≥ |ψn |, ψn ̸= 0}
is identically zero.
• {n : ψn ̸= 0} is not empty. We then know that minn |ψn | > 0. Hence, I{|ψn − ψ̂n | ≥
̸ 0} is upper bounded by I{|ψn − ψ̂n | ≥ minn |ψn |}. Since |ψn − ψ̂n | ≤
|ψn |, ψn =
maxn |ψn − ψ̂n |, this last indicator is at most I{maxn |ψn − ψ̂n | ≥ minn |ψn |}.
To summarize, we have proven the following upper bounds on |∆
b − ∆(α)|. When
{n : ψn ̸= 0} is empty, on A, |∆ − ∆(α)| is upper bounded by
b
⌊N α⌋ max |ψn − ψ̂n | + ⌊N α⌋I{max |ψn − ψ̂n | ≥ min |ψn |}. (3.9)
n=1 n n
210
We are ready to show that Pr(|∆
b − ∆(α)| > ϵ) converges to zero. For any positive ϵ, we
know that
b − ∆(α)| > ϵ, A) + Pr(Ac ).
b − ∆(α)| > ϵ) ≤ Pr(|∆
Pr(|∆
p
The later probability goes to zero because maxN n=1 |ψ̂n − ψn | →
− 0.
Suppose that {n : ψn = ̸ 0} is empty. Using the upper bound eq. (3.8), we know that event
in the former probability implies that maxN n=1 |ψ̂n − ψn | ≥ ϵ/⌊N α⌋: The probability of this
p
event also goes to zero because maxn=1 |ψ̂n − ψn | →
N
− 0.
Suppose that {n : ψn = ̸ 0} is not empty. Using the upper bound eq. (3.9), we
know that event in the former probability implies that (maxN n=1 |ψ̂n − ψn | + I{maxn |ψn −
ψ̂n | ≥ minn |ψn |}) ≥ ϵ/⌊N α⌋. Since maxn=1 |ψ̂n − ψn | converges to zero in probability,
N
I{maxn |ψn − ψ̂n | ≥ minn |ψn |} also converges to zero in probability. Hence, the probability
that (maxN n=1 |ψ̂n − ψn | + I{maxn |ψn − ψ̂n | ≥ minn |ψn |}) ≥ ϵ/⌊N α⌋ converges to zero.
In all, Pr(|∆b − ∆(α)| > ϵ) goes to zero in both the case where {n : ψn ̸= 0} is empty and
p
the complement case. As the choice of ϵ was arbitrary, we have shown ∆ b →
− ∆(α).
Proof of Theorem 3.A.2. Similar to the proof of lemma 3.B.1, we only consider expectations
under the full-data posterior. Hence, we will write E instead of E1N to simplify notation.
Variance and covariance operations are also understood to be taken un der the full-data
posteiror. To lighten the dependence of the notation on the parameter β, we will write g(β)
as g and L(d(n) | β) as Ln when talking about the expectation of g(β) and L(d(n) | β).
Define the the following multivariate function
T
f (β) := g(β), L(d(1) | β), g(β)L(d(1) | β), . . . , L(d(N ) | β), g(β)L(d(N ) | β) .
As defined, f (·) is a mapping from P -dimensional space to 2N + 1-dimensional space.
Since (β (1) , . . . , β (S) ) is an i.i.d. sample, f (β (1) ), f (β (2) ), . . . , f (β (S) ) is also an i.i.d. sample.
Because of the moment conditions we have assumed, each f (β) has finite variance. We apply
the Lindeberg-Feller multivariate central limit theorem [203, Proposition 2.27], and conclude
that !
√ 1X D
S f (β (s) ) − Ef (β) − → N (0, Ξ)
S s
where the limit is S → ∞, and Ξ is a symmetric (2N + 1) × (2N + 1) dimensional matrix,
which we specify next. It suffices to write down the formula for (i, j) entry of Ξ where i ≤ j:
Var(g) if i = j = 1
Cov(g, Ln ) if i = 1, j > 1
Cov(L , L )
if i = 2n, j = 2m
n m
Ξi,j = .
Cov(L n , gLm ) if i = 2n, j = 2m + 1
Cov(gLn , Lm ) if i = 2n + 1, j = 2m
Cov(gLn , gLm ) if i = 2n + 1, j = 2m + 1
To relate the asymptotic distribution of f (β) to that of the vector ψ̂, we now use the
delta method. Define the following function which acts on 2N + 1 dimensional vectors and
211
returns N dimensional vectors:
T
h([x1 , x2 , . . . , x2N +1 ]T ) := x3 − x1 x2 , x5 − x1 x4 , x7 − x1 x6 , . . . , x2N +1 − x1 x2N .
Written this way, clearly h(·) transform the sample mean S1 s f (β (s) ) into the estimated
P
true influences: ψ = h (Ef (β)). h(·) is continuously differentiable everywhere: its Jacobian is
the N × (2N + 1) matrix
−x2 −x1 1 0 0 . . . 0
−x4 0 0 −x1 1 . . . 0
Jh = .. .. .. . . ,
. . . . 0 . . . 0
−x2N 0 0 ... 0 ... 1
which is non-zero. Therefore, we apply the delta method [203, Theorem 3.1] and conclude
that √
D
S ψ̂ − ψ − → N 0, Jh x=Ef (β) Ξ(Jh x=Ef (β) )T .
The (i, j) entry of the asymptotic covariance matrix is the dot product between the i-th
row of Jh x=Ef (β) and the j-th column of Ξ(Jh x=Ef (β) )T . The former is
[−ELi , 0, 0, . . . , −Eg , 1
|{z} , . . . , 0].
|{z}
2i entry (2i+1) entry
The later is
Taking the dot product, we have that the (i, j) entry of the asymptotic covariance matrix is
equal to
Cov(gLi , gLj ) − (Eg) [Cov(gLi , Lj ) + Cov(gLj , Li )]
− [(ELj )Cov(g, gLi ) + (ELi )Cov(g, gLj )]
+ (ELj )(ELi )Var(g)
+ (Eg)2 Cov(Li , Lj )
+ (Eg) [(ELj )Cov(g, Li ) + (ELi )Cov(g, Lj )]
It is simple to check that the last display is equal to the covariance between (g−E[g])(Lj −E[Lj ])
and (g − E[g])(Li − E[Li ]).
Proof of Lemma 3.A.4. We use the (shape, rate) parametrization of the gamma distribution.
Let the prior over τ be Gamma(α, β) where α, β > 0. Conditioned on observations, the
212
posterior distribution of (µ, τ ) is normal-gamma:
" N
#!
N N 1 X (n) 2
τ ∼ Gamma α + , β + (x ) − x̄2 ,
2 2 N n=1
ϵ ∼ N (0, 1),
ϵ
µ | τ, ϵ = x̄ + √ .
Nτ
In this section, since we only take expectations under the original full-data posterior, we will
lighten the notation’s dependence on w, and write E instead of E1N . Similarly, covariance
and variance operators are understood to be under the full-data posterior. √
For completeness, we compute Cov(µ, L(d(n) | µ, τ )). We know that µ − Eµ = ϵ/ N τ .
The log likelihood, as a function of τ and ϵ, is
1 τ 1 1 2 x(n) − x̄ √
log − τ (x(n) − x̄)2 − ϵ + √ ϵ τ.
2 2π 2 2N N
√
The covariance of µ and √ L(d(n) | µ, τ ) is equal to the covariance between ϵ/ N τ and
L(d(n) | µ, τ ). Since ϵ/ N τ is zero mean, the covariance is equal to the expectation of the
product. Since ϵ is indedependent of τ , many of the terms that form the expectatin of the
product is zero. After some algebra, the only term that remains is
(n)
x(n) − x̄
x − x̄ 2
E ϵ = .
N N
213
The constants D1 , D2 , and D3 mentioned in the lemma statement can be read off this
last display. It is possible to replace the posterior functionals of τ with quantities that
only depends on the prior (α, β) and the observed data. Such formulas might be helpful in
studying the behavior of Σn,n in the limit where some x(n) becomes very large.
The observed data are x(n) , g (n) , y (n) ; all other quantities are latent, and estimated by MCMC.
214
3.C.3 Hierarchical model for tree mortality data
The likelihood for the n-th observation is exponentially modified Gaussian with standard
deviation σ, scale λ and mean
(time) (region) (time) (location)
µt(n) + µl(n) + µ + θt(n) + θl(n) + θ x(n) + f (x(n) ),
with f (x) := 10 i=1 Bi (x)γi where Bi ’s are fixed thin plate spline basis functions [207] and
P
γi ’s are random: γi ∼ Normal(0, σ(smooth)
2
). In all, the parameters of interest are
Since there are many regions (nearly 3,000) and periods of time (30), the number of random
effects is large. Senf et al. [183] uses brms()’s default priors for all parameters: in this default,
the fixed effects are given improper uniform priors over the real line. To work with proper
distributions, we set the priors for the random effects and degree of smoothing in the same
way set by Senf et al. [183]. For fixed effects, we use t location-scale distributions with degrees
of freedom 3, location 0, and scale 1000.
215
216
Conclusion
217
also hope to spend little time on estimating the posterior. Both parallelism (chapter 1) and
finite approximations (chapter 2) might help such an analyst. The biggest conceptual obstacle
is extending the coupling scheme in chapter 1 beyond partition-valued problems. I expect
that a good coupling needs to properly handle an analogous label-switching problem, and
ideas from optimal transport to be relevant even outside of partition-valued models.
218
References
[1] Ayan Acharya, Joydeep Ghosh, and Mingyuan Zhou. Nonparametric Bayesian fac-
tor analysis for dynamic count matrices. In International Conference on Artificial
Intelligence and Statistics, 2015.
[2] José Antonio Adell and Alberto. Lekuona. Sharp estimates in signed Poisson approxi-
mation of Poisson mixtures. Bernoulli, 11(1):47–65, 2005.
[3] D Aldous. Exchangeability and related topics. École d’Été de Probabilités de Saint-Flour
XIII—1983, pages 1–198, 1985.
[4] Horst Alzer. On some inequalities for the gamma and psi functions. Mathematics of
computation, 66(217):373–389, 1997.
[5] Manuela Angelucci, Dean Karlan, Jonathan Zinman, Kerry Brennan, Ellen Degnan,
Alissa Fishbane, Andrew Hillis, Hideto Koizumi, Elana Safran, Rachel Strohm, Braulio
Torres, Asya Troychansky, Irene Velez, Glynis Startz, Sanjeev Swamy, Matthew White,
Anna York, and Compartamos Banco. Microcredit impacts: Evidence from a randomized
microcredit program placement experiment by compartamos banco. American Economic
Journal: Applied Economics, 7:151–82, 2015. URL http://www.compartamos.com/
wps/portal/Grupo/InvestorsRelations/FinancialInformation.
[7] Julyan Arbel and Igor Prünster. A moment-matching Ferguson & Klass algorithm.
Statistics and Computing, 27(1):3–17, 2017.
[8] Julyan Arbel, Pierpaolo De Blasi, and Igor Prünster. Stochastic approximations to the
Pitman–Yor process. Bayesian Analysis, 14(4):1201–1219, 2019.
[9] Richard Arratia, Andrew D. Barbour, and Simon Tavaré. Logarithmic Combinatorial
Structures: a Probabilistic Approach, volume 1. European Mathematical Society, 2003.
[10] Orazio Attanasio, Britta Augsburg, Ralph De Haas, Emla Fitzsimons, and Heike
Harmgart. The impacts of microfinance: Evidence from joint-liability lending in
mongolia. American Economic Journal: Applied Economics, 7(1):90–122, 2015. ISSN
19457782, 19457790. URL http://www.jstor.org/stable/43189514.
219
[11] Britta Augsburg, Ralph De Haas, Heike Harmgart, and Costas Meghir. The impacts
of microcredit: Evidence from bosnia and herzegovina. American Economic Journal:
Applied Economics, 7(1):183–203, January 2015. doi:10.1257/app.20130272. URL
https://www.aeaweb.org/articles?id=10.1257/app.20130272.
[12] Abhijit Banerjee, Esther Duflo, Rachel Glennerster, and Cynthia Kinnan. The miracle
of microfinance? evidence from a randomized evaluation. American Economic Journal:
Applied Economics, 7(1):22–53, January 2015. doi:10.1257/app.20130533. URL https:
//www.aeaweb.org/articles?id=10.1257/app.20130533.
[13] Andrew D. Barbour and Peter Hall. On the rate of Poisson convergence. In Mathe-
matical Proceedings of the Cambridge Philosophical Society, volume 95, pages 473–480.
Cambridge University Press, 1984.
[14] Sanjib Basu, Sreenivasa Rao Jammalamadaka, and Wei Liu. Local Posterior Robustness
with Parametric Priors: Maximum and Average Sensitivity, pages 97–106. Springer
Netherlands, Dordrecht, 1996. ISBN 978-94-015-8729-7. doi:10.1007/978-94-015-8729-
7_6. URL https://doi.org/10.1007/978-94-015-8729-7_6.
[15] Jean Bertoin, T. Fujita, Bernard Roynette, and Marc Yor. On a particular class of
self-decomposable random variables: the durations of Bessel excursions straddling
independent exponential times. Probability and Mathematical Statistics, 26:315–366,
2006.
[16] Peter J. Bickel. On Some Robust Estimates of Location. The Annals of Mathematical
Statistics, 36(3):847 – 858, 1965.
[17] Eli Bingham, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan,
Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D. Goodman.
Pyro: deep universal probabilistic programming. Journal of Machine Learning Research,
2018.
[19] David Blackwell and James B. MacQueen. Ferguson distributions via Polya urn schemes.
The Annals of Statistics, 1(2):353–355, 03 1973. doi:10.1214/aos/1176342372.
[20] D. M. Blei, T. L. Griffiths, and M I Jordan. The nested Chinese restaurant process
and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57(2):
1–30, 2010.
[21] David M. Blei and Michael I. Jordan. Variational inference for Dirichlet process mixtures.
Bayesian Analysis, 1(1):121 – 143, 2006.
[22] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation.
Journal of Machine Learning Resesearch, 3(null):993–1022, mar 2003. ISSN 1532-4435.
220
[23] David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A
review for statisticians. Journal of the American Statistical Association, 112(518):
859–877, April 2017. ISSN 1537-274X. doi:10.1080/01621459.2017.1285773. URL
http://dx.doi.org/10.1080/01621459.2017.1285773.
[25] Nicolas Bonneel, Michiel Van De Panne, Sylvain Paris, and Wolfgang Heidrich. Dis-
placement interpolation using Lagrangian mass transport. In Proceedings of the 2011
SIGGRAPH Asia Conference, 2011.
[26] Anders Brix. Generalized gamma measures and shot-noise cox processes. Advances in
Applied Probability, 31:929–953, 1999.
[28] Tamara Broderick, Michael I. Jordan, and Jim Pitman. Beta processes, stick-breaking
and power laws. Bayesian analysis, 7(2):439–476, 2012.
[29] Tamara Broderick, Jim Pitman, and Michael I. Jordan. Feature allocations, probability
functions, and paintboxes. Bayesian Analysis, 8(4):801–836, 2013.
[30] Tamara Broderick, Ashia C. Wilson, and Michael I. Jordan. Posteriors, conjugacy, and
exponential families for completely random measures. Bernoulli, 24(4B):3181–3221, 11
2018. doi:10.3150/16-BEJ855.
[31] Tamara Broderick, Ryan Giordano, and Rachael Meager. An automatic finite-sample
robustness metric: Can dropping a little data change conclusions?, 2020.
[33] Trevor Campbell, Diana Cai, and Tamara Broderick. Exchangeable trait allocations.
Electronic Journal of Statistics, 12(2):2290–2322, 2018.
[34] Trevor Campbell, Jonathan H. Huggins, Jonathan P. How, and Tamara Broderick.
Truncated random measures. Bernoulli, 25(2):1256–1288, 05 2019. doi:10.3150/18-
BEJ1020.
[35] Antonio Canale and David B. Dunson. Bayesian kernel mixtures for counts. Journal of
the American Statistical Association, 106(496):1528–1539, 2011.
[36] Clément Canonne. A short note on Poisson tail bounds. Technical report available from
https://ccanonne.github.io/. URL http://www.cs.columbia.edu/~ccanonne/files/misc/
2017-poissonconcentration.pdf.
221
[37] Bradley P. Carlin and Nicholas G. Polson. An expected utility approach to influence
diagnostics. Source: Journal of the American Statistical Association, 86:1013–1021,
1991.
[38] Edward Carlstein. The use of subseries values for estimating the variance of a general
statistic from a stationary sequence. Annals of Statistics, 14:1171–1179, 1986.
[39] Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich,
Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan:
A probabilistic programming language. Journal of Statistical Software, 76:1–32, 2017.
[40] Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich,
Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan:
A probabilistic programming language. Journal of Statistical Software, 76:1–32, 2017.
[41] Małgorzata Charytanowicz, Jerzy Niewczas, Piotr Kulczycki, Piotr A Kowalski, Szymon
Łukasik, and Sławomir Żak. Complete gradient clustering algorithm for features analysis
of X-ray images. In Information Technologies in Biomedicine, pages 15–24. Springer,
2010.
[42] Sitan Chen, Michelle Delcourt, Ankur Moitra, Guillem Perarnau, and Luke Postle. Im-
proved bounds for randomly sampling colorings via linear programming. In Proceedings
of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2019.
[43] Zhijun Chen, Yishi Zhang, Chaozhong Wu, and Bin Ran. Understanding individualiza-
tion driving states via latent dirichlet allocation model. IEEE Intelligent Transportation
Systems Magazine, 11:41–53, 6 2019. ISSN 19411197. doi:10.1109/MITS.2019.2903525.
[44] C. J. Clopper and E. S. Pearson. The use of confidence or fiducial limits illustrated
in the case of the binomial. Biometrika, 26(4):404–413, 1934. ISSN 00063444. URL
http://www.jstor.org/stable/2331986.
[45] Bruno Crépon, Florencia Devoto, Esther Duflo, and William Parienté. Estimating the
impact of microcredit on those who take it up: Evidence from a randomized experiment
in morocco. American Economic Journal: Applied Economics, 7(1):123–50, January
2015. doi:10.1257/app.20130535. URL https://www.aeaweb.org/articles?id=10.1257/
app.20130535.
[47] Perry de Valpine, Daniel Turek, Christopher Paciorek, Cliff Anderson-Bergman, Duncan
Temple Lang, and Ras Bodik. Programming with models: writing statistical algorithms
for general model structures with NIMBLE. Journal of Computational and Graphical
Statistics, 26:403–413, 2017. doi:10.1080/10618600.2016.1172487.
222
[48] Perry de Valpine, Daniel Turek, Christopher J. Paciorek, Clifford Anderson-Bergman,
Duncan Temple Lang, and Rastislav Bodik. Programming With Models: Writing
Statistical Algorithms for General Model Structures With NIMBLE. Journal of
Computational and Graphical Statistics, 26(2):403–413, 2017.
[49] Daryl DeFord, Moon Duchin, and Justin Solomon. Recombination: a fam-
ily of Markov chains for redistricting. Harvard Data Science Review, 3 2021.
https://hdsr.mitpress.mit.edu/pub/1ds8ptxu.
[50] Sameer Deshpande, Soumya Ghosh, Tin D. Nguyen, and Tamara Broderick. Are you
using test log-likelihood correctly? Transactions on Machine Learning Research, 2024.
ISSN 2835-8856. URL https://openreview.net/forum?id=n2YifD4Dxo.
[51] Luc Devroye and Lancelot James. On simulation and properties of the stable law.
Statistical methods & applications, 23(3):307–343, 2014.
[52] P Diaconis and David Freedman. On the consistency of bayes estimates. The Annals
of Statistics, pages 1–26, 1986.
[53] Persi Diaconis and Donald Ylvisaker. Conjugate priors for exponential families. The
Annals of Statistics, 7:269–281, 1979.
[54] Benjamin Doerr and Frank Neumann. Theory of evolutionary computation: recent
developments in discrete optimization. Springer Nature, 2019.
[55] Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, and Yee Whye Teh. Variational
inference for the Indian buffet process. In International Conference on Artificial
Intelligence and Statistics, 2009.
[56] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http:
//archive.ics.uci.edu/ml.
[57] Bradley Efron. Bootstrap methods: Another look at the jackknife. Annals of Statistics,
7:1–26, 1979.
[58] Michael D. Escobar and Mike West. Bayesian density estimation and inference using
mixtures. Journal of the American Statistical Association, 90(430):577–588, 1995.
[60] Thomas S Ferguson. A Bayesian analysis of some nonparametric problems. The Annals
of Statistics, 1:209–230, 1973.
[61] Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Bois-
bunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo
223
Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotoma-
monjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Suther-
land, Romain Tavenard, Alexander Tong, and Titouan Vayer. POT: Python Optimal
Transport. Journal of Machine Learning Research, 22(78):1–8, 2021.
[64] Daniel Freund and Samuel B. Hopkins. Towards practical robustness auditing for linear
regression, 7 2023. URL http://arxiv.org/abs/2307.16315.
[65] Andrew Gelman and Donald B. Rubin. Inference from iterative simulation using
multiple sequences. Statistical Science, 7(4):457–472, 1992.
[66] S. Geman and D. Geman. Stochastic Relaxation, Gibbs Distributions, and the Bayesian
Restoration of Images. Pattern Analysis and Machine Intelligence, IEEE Transactions
on, (6):721–741, 1984.
[69] Alison L. Gibbs. Convergence in the Wasserstein metric for Markov chain Monte Carlo
algorithms with applications to image restoration. Stochastic Models, 20(4):473–492,
2004.
[70] Walter R. Gilks and Pascal Wild. Adaptive rejection sampling for Gibbs sampling.
Journal of the Royal Statistical Society: Series C (Applied Statistics), 41(2):337–348,
1992.
[71] Ryan Giordano and Tamara Broderick. The bayesian infinitesimal jackknife for variance,
May 2023. URL http://arxiv.org/abs/2305.06466.
[72] Ryan Giordano, Tamara Broderick, Michael I Jordan, Jordan@cs Berkeley Edu, and
Mohammad Emtiyaz Khan. Covariances, robustness, and variational bayes. Journal of
Machine Learning Research, 19:1–49, 2018. URL http://jmlr.org/papers/v19/17-670.
html.
[73] Ryan Giordano, Runjing Liu, Michael I. Jordan, and Tamara Broderick. Evaluating
sensitivity to the stick-breaking prior in bayesian nonparametrics (with discussion).
Bayesian Analysis, 18:287–366, 2023. ISSN 19316690. doi:10.1214/22-BA1309.
[74] Peter W. Glynn and Chang-han Rhee. Exact estimation for Markov chain equilibrium
expectations. Journal of Applied Probability, 51(A):377–389, 2014.
224
[75] Alexander V. Gnedin. On convergence and extensions of size-biased permutations.
Journal of Applied Probability, 35(3):642–650, 1998. doi:10.1239/jap/1032265212.
[76] Louis Gordon. A stochastic approach to the gamma function. The American Mathe-
matical Monthly, 101(9):858–865, 1994. ISSN 00029890, 19300972.
[77] Dilan Görür and Carl E. Rasmussen. Dirichlet process Gaussian mixture models: choice
of the base distribution. Journal of Computer Science and Technology, 25(4):653–664,
2010.
[78] T. L. Griffiths and Z. Ghahramani. The Indian buffet process: an introduction and
review. Journal of Machine Learning Research, 12:1185–1224, 2011.
[79] Thomas L. Griffiths and Zoubin Ghahramani. The Indian Buffet Process: An In-
troduction and Review. Journal of Machine Learning Research, 12(32):1185–1224,
2011.
[80] Paul Gustafson. Local sensitivity of posterior expectations. The Annals of Statistics,
24:195, 1996.
[81] Mubin Ul Haque, Leonardo Horn Iwaya, and M. Ali Babar. Challenges in docker
development: A large-scale study using stack overflow. In Proceedings of the 14th ACM
/ IEEE International Symposium on Empirical Software Engineering and Measurement
(ESEM), ESEM ’20, New York, NY, USA, 2020. Association for Computing Machinery.
ISBN 9781450375801. doi:10.1145/3382494.3410693. URL https://doi.org/10.1145/
3382494.3410693.
[82] Nils Lid Hjort. Nonparametric Bayes estimators based on beta processes in models for
life history data. The Annals of Statistics, 18(3):1259–1294, 1990.
[83] Matthew Hoffman, Francis R. Bach, and David M. Blei. Online learning for latent
Dirichlet allocation. In Advances in Neural Information Processing Systems, 2010.
[84] Matthew D. Hoffman and Andrew Gelman. The no-u-turn sampler: adaptively setting
path lengths in hamiltonian monte carlo. Journal of Machine Learning Research, 15(1):
1593–1623, 2014.
[86] Alain Hore and Djemel Ziou. Image quality metrics: PSNR vs. SSIM. In 2010 20th
International Conference on Pattern Recognition, pages 2366–2369. IEEE, 2010.
[87] Jonathan Huggins, Mikolaj Kasprzak, Trevor Campbell, and Tamara Broderick. Val-
idated variational inference via practical posterior error bounds. In International
Conference on Artificial Intelligence and Statistics, pages 1792–1802. PMLR, 2020.
[88] H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. Journal
of the American Statistical Association, 96:161–173, 2001.
225
[89] H. Ishwaran and M Zarepour. Exact and approximate sum representations for the
Dirichlet process. Canadian Journal of Statistics, 30(2):269–283, 2002.
[90] Hemant Ishwaran and Mahmoud Zarepour. Markov chain monte carlo in approximate
dirichlet and beta two-parameter process hierarchical models. Biometrika, 87(2):371–390,
2000.
[91] Pierre E. Jacob. Couplings and Monte Carlo. Course Lecture Notes, 2020.
[92] Pierre E Jacob, John O’leary, Yves F Atchadé, and Atchad´ Atchadé. Unbiased markov
chain monte carlo methods with couplings. Journal of the Royal Statistical Society,
Series B, pages 543–600, 2020. URL https://github.com/pierrejacob/unbiasedmcmc.
[93] Pierre E. Jacob, John O’Leary, and Yves F. Atchadé. Unbiased Markov chain Monte
Carlo methods with couplings. Journal of the Royal Statistical Society Series B, 82(3):
543–600, 2020.
[94] Sonia Jain and Radford M Neal. A split-merge Markov chain Monte Carlo procedure for
the Dirichlet process mixture model. Journal of computational and Graphical Statistics,
13(1):158–182, 2004.
[96] L. F. James. Bayesian Poisson calculus for latent feature modeling via generalized
Indian Buffet Process priors. The Annals of Statistics, 45(5):2016–2045, 2017.
[97] L. F. James, Antonio Lijoi, and Igor Prünster. Posterior Analysis for Normalized
Random Measures with Independent Increments. Scandinavian Journal of Statistics,
36(1):76–97, 2009.
[98] Ajay Jasra, Chris C. Holmes, and David A. Stephens. Markov chain Monte Carlo
methods and the label switching problem in Bayesian mixture modeling. Statistical
Science, pages 50–67, 2005.
[99] Mark Jerrum. Mathematical foundations of the Markov chain Monte Carlo method. In
Probabilistic Methods for Algorithmic Discrete Mathematics, pages 116–165. Springer,
1998.
[101] N.L. Johnson, A.W. Kemp, and S. Kotz. Univariate Discrete Distributions. Wiley
Series in Probability and Statistics. Wiley, 2005. ISBN 9780471715801.
[102] Wesley Johnson and Seymour Geisser. A predictive view of the detection and char-
acterization of influential observations in regression analysis. Source: Journal of the
American Statistical Association, 78:137–144, 1983.
226
[103] Terry C. Jones, Guido Biele, Barbara Mühlemann, Talitha Veith, Julia Schneider, Jörn
Beheim-Schwarzbach, Tobias Bleicker, Julia Tesch, Marie Luisa Schmidt, Leif Erik
Sander, Florian Kurth, Peter Menzel, Rolf Schwarzer, Marta Zuchowski, Jörg Hofmann,
Andi Krumbholz, Angela Stein, Anke Edelmann, Victor Max Corman, and Christian
Drosten. Estimating infectiousness throughout sars-cov-2 infection course. Science, 373,
7 2021. ISSN 10959203. doi:10.1126/science.abi5273.
[104] Olav Kallenberg. Foundations of modern probability. Springer, New York, 2nd edition,
2002.
[105] E. L. Kaplan and Paul Meier. Nonparametric estimation from incomplete observations.
Journal of the American Statistical Association, 53(282):457–481, 1958.
[106] Dean Karlan and Jonathan Zinman. Microcredit in theory and practice: Using ran-
domized credit scoring for impact evaluation. Science, 332(6035):1278–1284, 2011.
[107] Damian J. Kelly and Garrett M. O’Neill. The minimum cost flow problem and the
network simplex solution method. PhD thesis, Citeseer, 1991.
[108] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International
Conference on Learning Representations, 2014.
[109] J. F. C. Kingman. Completely random measures. Pacific Journal of Mathematics, 21
(1):59–78, 1967.
[110] J. F. C. Kingman. Random discrete distributions. Journal of the Royal Statistical
Society B, 37(1):1–22, 1975.
[111] JFC Kingman. Poisson Processes, volume 3. Clarendon Press, 1992.
[112] M. Kline. Calculus: An Intuitive and Physical Approach. Dover Books on Mathematics.
Dover Publications, 1998. ISBN 9780486404530. URL https://books.google.com/books?
id=YdjK_rD7BEkC.
[113] Ramesh Madhavrao Korwar and Myles Hollander. Contributions to the theory of
dirichlet processes. The Annals of Probability, 1(4):705–711, 1972.
[114] Michael R. Kosorok. Introduction to Empirical Processes and Semiparametric Inference.
Springer, 2008.
[115] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M Blei.
Automatic differentiation variational inference. Journal of Machine Learning Research,
18:1–45, 2017. URL http://jmlr.org/papers/v18/16-107.html.
[116] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M. Blei.
Automatic differentiation variational inference. Journal of Machine Learning Research,
18(14):1–45, 2017.
[117] Kenichi Kurihara, Max Welling, and Y. W. Teh. Collapsed variational Dirichlet process
mixture models. In International Joint Conference on Artificial Intelligence, 2007.
227
[118] S.N. Lahiri. Resampling Methods for Dependent Data. Springer, 2003. URL https:
//link.springer.com/book/10.1007/978-1-4757-3803-2.
[119] Junpeng Lao, Christopher Suter, Ian Langmore, Cyril Chimisov, Ashish Saxena, Pavel
Sountsov, Dave Moore, Rif A. Saurous, Matthew D. Hoffman, and Joshua V. Dillon.
tfp. mcmc: Modern Markov Chain Monte Carlo Tools Built For Modern Hardware.
arXiv preprint arXiv:2002.01184, 2020.
[120] Günter Last and Mathew Penrose. Lectures on the Poisson Process. Institute of
Mathematical Statistics Textbooks. Cambridge University Press, 2017.
[121] Michael Lavine. Local predictive influence in bayesian linear models with conjugate
priors. Communications in Statistics - Simulation and Computation, 21:269–283, 1
1992. ISSN 15324141. doi:10.1080/03610919208813018.
[122] Lucien Le Cam. An approximation theorem for the Poisson binomial distribution.
Pacific J. Math., 10(4):1181–1197, 1960.
[123] Clement Lee and Darren J. Wilkinson. A review of stochastic block models and
extensions for graph clustering, 12 2019. ISSN 23648228.
[124] Juho Lee, Lancelot F. James, and Seungjin Choi. Finite-dimensional BFRY priors and
variational Bayesian inference for power law models. In Advances in Neural Information
Processing Systems, 2016.
[125] Juho Lee, Xenia Miscouridou, and François Caron. A unified construction for series
representations and finite approximations of completely random measures. Bernoulli,
2022.
[126] David A. Levin and Yuval Peres. Markov chains and mixing times, volume 107.
American Mathematical Society, 2017.
[127] Antonio Lijoi and Igor Prünster. Models beyond the Dirichlet process, page 80–136.
Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University
Press, 2010.
[128] Antonio Lijoi, Igor Prünster, and Stephen G. Walker. On consistency of nonparametric
normal mixtures for Bayesian density estimation. Journal of the American Statistical
Association, 100(472):1292–1296, 2005.
[129] Antonio Lijoi, Igor Prünster, and Tommaso Rigon. The pitman–yor multinomial process
for mixture modelling. Biometrika, 107(4):891–906, 2020.
[130] Antonio Lijoi, Igor Prünster, and Tommaso Rigon. Sampling hierarchies of discrete ran-
dom structures. Statistics and Computing, 30(6):1591–1607, nov 2020. ISSN 0960-3174.
doi:10.1007/s11222-020-09961-7. URL https://doi.org/10.1007/s11222-020-09961-7.
[131] Antonio Lijoi, Igor Prünster, and Tommaso Rigon. Finite-dimensional discrete random
structures and bayesian clustering. Journal of the American Statistical Association, 0
(0):1–13, 2023.
228
[132] Torgny Lindvall. Lectures on the coupling method. Courier Corporation, 2002.
[133] Silvia Liverani, David I. Hastie, Lamiae Azizi, Michail Papathomas, and Sylvia Richard-
son. PReMiuM: An R package for profile regression mixture models using Dirichlet
processes. Journal of Statistical Software, 64(7):1, 2015.
[134] Michel Loeve. Ranking limit problem. In Proceedings of the Third Berkeley Symposium
on Mathematical Statistics and Probability, Volume 2: Contributions to Probability
Theory, pages 177–194, Berkeley, Calif., 1956.
[135] Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-
tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):
1145–1190, 2019.
[136] Steven N. MacEachern. Estimating normal means with a conjugate style Dirichlet
process prior. Communications in Statistics - Simulation and Computation, 23(3):
727–741, 1994.
[138] Robert E Mcculloch. Local model influence. Source: Journal of the American Statistical
Association, 84:473–478, 1989.
[142] Russell B Millar and Wayne S Stewart. Assessment of locally influential observations
in bayesian models. Bayesian Analysis, 2:365–384, 2007.
[143] Jeffrey W Miller and Matthew T Harrison. Mixture models with a prior on the number
of components. Journal of the American Statistical Association, 113(521):340–356,
2018.
[144] Thomas B. Minka, Galit Shmueli, Joseph B. Kadane, Sharad Borle, and
Peter Boatwright. Computing with the com-poisson distribution. Tech-
nical report. URL https://www.microsoft.com/en-us/research/publication/
computing-com-poisson-distribution/.
229
[145] B. G. Mirkin and L. B. Chernyi. Measurement of the distance between distinct partitions
of a finite set of objects. Automation and Remote Control, 5:120–127, 1970.
[146] Shakir Mohamed, Mihaela Rosca, Michael Figurnov, Andriy Mnih, and Amnih@google
Com. Monte carlo gradient estimation in machine learning, 2020. URL https://www.
github.com/deepmind/mc_gradients.
[147] Ankur Moitra and Dhruv Rohatgi. Provably auditing ordinary least squares in low
dimensions, 5 2022. URL http://arxiv.org/abs/2205.14284.
[148] Warwick Nash, T.L. Sellers, S.R. Talbot, A.J. Cawthorn, and W.B. Ford. The Population
Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from
the North Coast and Islands of Bass Strait. Sea Fisheries Division, Technical Report,
48, 01 1994.
[149] Radford M Neal. Circularly-coupled markov chain sampling. Technical report, University
of Toronto, 1992.
[150] Radford M. Neal. Markov chain sampling methods for Dirichlet process mixture models.
Journal of Computational and Graphical Statistics, 9(2):249–265, 2000.
[151] Tin D. Nguyen, Brian L. Trippe, and Tamara Broderick. Many processors, little time:
Mcmc for partitions via optimal transport couplings. In Gustau Camps-Valls, Francisco
J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference
on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning
Research, pages 3483–3514. PMLR, 28–30 Mar 2022.
[152] Tin D. Nguyen, Jonathan Huggins, Lorenzo Masoero, Lester Mackey, and Tamara
Broderick. Independent Finite Approximations for Bayesian Nonparametric Inference.
Bayesian Analysis, pages 1 – 38, 2023. doi:10.1214/23-BA1385. URL https://doi.org/
10.1214/23-BA1385.
[154] James B. Orlin. A faster strongly polynomial minimum cost flow algorithm. Operations
Research, 41(2):338–350, 1993.
[155] John Paisley and Lawrence Carin. Nonparametric factor analysis with beta process
priors. In International Conference on Machine Learning, 2009.
[156] John Paisley, Lawrence Carin, and David Blei. Variational inference for stick-breaking
beta process priors. In International Conference on Machine Learning, 2011.
[157] John Paisley, David M. Blei, and Michael I. Jordan. Stick-breaking beta processes
and the Poisson process. In International Conference on Artificial Intelligence and
Statistics, 2012.
[158] K Palla, D A Knowles, and Z. Ghahramani. An infinite latent attribute model for
network data. In International Conference on Machine Learning, 2012.
230
[159] Mihael Perman, Jim Pitman, and Marc Yor. Size-biased sampling of poisson point
processes and excursions. Probability Theory and Related Fields, 92(1):21–39, 1992.
[161] Jim Pitman. Exchangeable and partially exchangeable random partitions. Probability
theory and related fields, 102(2):145–158, 1995.
[162] Jim Pitman. Some developments of the blackwell-macqueen urn scheme. Lecture
Notes-Monograph Series, pages 245–267, 1996.
[163] Jim Pitman. Combinatorial Stochastic Processes: Ecole d’Eté de Probabilités de Saint-
Flour XXXII-2002. Springer, 2006.
[164] Jim Pitman and Marc Yor. The two-parameter poisson-dirichlet distribution derived
from a stable subordinator. The Annals of Probability, pages 855–900, 1997.
[165] David Pollard. A User’s Guide to Measure Theoretic Probability. Cambridge University
Press, 2001.
[166] David Pollard. Convergence of stochastic processes. Springer Science & Business Media,
2012.
[167] Nicholas G. Polson, James G. Scott, and Jesse Windle. Bayesian inference for logis-
tic models using pólya-gamma latent variables. Journal of the American Statistical
Association, 108:1339–1349, 2013. ISSN 1537274X. doi:10.1080/01621459.2013.829001.
[168] Tenelle Porter, Diego Catalán Molina, Andrei Cimpian, Sylvia Roberts, Afiya Fredericks,
Lisa S. Blackwell, and Kali Trzesniewski. Growth-mindset intervention delivered by
teachers boosts achievement in early adolescence. Psychological Science, 33:1086–1096,
7 2022. ISSN 14679280. doi:10.1177/09567976211061109.
[169] Sandhya Prabhakaran, Elham Azizi, Ambrose Carr, and Dana Pe’er. Dirichlet process
mixture model for correcting technical variation in single-cell gene expression data. In
International Conference on Machine Learning, 2016.
[171] Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. Inference of Population
Structure Using Multilocus Genotype Data. Genetics, 155(2):945–959, 06 2000.
[172] James Gary Propp and David Bruce Wilson. Exact sampling with coupled Markov
chains and applications to statistical mechanics. Random Structures & Algorithms, 9
(1-2):223–252, 1996.
231
[173] Maxim Rabinovich, Elaine Angelino, and Michael I Jordan. Variational consensus
monte carlo. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors,
Advances in Neural Information Processing Systems, volume 28. Curran Associates,
Inc., 2015.
[174] William M. Rand. Objective criteria for the evaluation of clustering methods. Journal
of the American Statistical Association, 66(336):846–850, 1971.
[175] Rajesh Ranganath, Sean Gerrish, and D. M. Blei. Black box variational inference. In
International Conference on Artificial Intelligence and Statistics, 2014.
[176] Eugenio Regazzini, Antonio Lijoi, and Igor Prünster. Distributional results for means
of normalized random measures with independent increments. The Annals of Statistics,
31(2):560–585, 2003.
[177] Albert Reuther, Jeremy Kepner, Chansup Byun, Siddharth Samsi, William Arcand,
David Bestor, Bill Bergeron, Vijay Gadepally, Michael Houle, Matthew Hubbell, Michael
Jones, Anna Klein, Lauren Milechin, Julia Mullen, Andrew Prout, Antonio Rosa, Charles
Yee, and Peter Michaleas. Interactive supercomputing on 40,000 cores for machine
learning and data analysis. In 2018 IEEE High Performance extreme Computing
Conference (HPEC), pages 1–6. IEEE, 2018.
[178] Albert Reuther, Jeremy Kepner, Chansup Byun, Siddharth Samsi, William Arcand,
David Bestor, Bill Bergeron, Vijay Gadepally, Michael Houle, Matthew Hubbell, Michael
Jones, Anna Klein, Lauren Milechin, Julia Mullen, Andrew Prout, Antonio Rosa, Charles
Yee, and Peter Michaleas. Interactive supercomputing on 40,000 cores for machine
learning and data analysis. In 2018 IEEE High Performance extreme Computing
Conference (HPEC), pages 1–6. IEEE, 2018.
[179] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic back-
propagation and approximate inference in deep generative models. In International
Conference on Machine Learning, 2014.
[180] Anirban Roychowdhury and Brian Kulis. Gamma processes, stick-breaking, and
variational inference. In International Conference on Artificial Intelligence and Statistics,
2015.
[181] Fabrizio Ruggeri and Larry Wasserman. Infinitesimal sensitivity of posterior dis-
tributions. The Canadian Journal of Statistic, 21:195–203, 1993. URL https:
//www.jstor.org/stable/3315811.
[182] Steven L Scott, Alexander W Blocker, Fernando V Bonassi, Hugh A Chipman, Edward I
George, and Robert E McCulloch. Bayes and big data: The consensus Monte Carlo
algorithm. International Journal of Management Science and Engineering Management,
11(2):78–88, 2016.
[183] Cornelius Senf, Allan Buras, Christian S. Zang, Anja Rammig, and Rupert Seidl. Excess
forest mortality is consistently linked to drought across europe. Nature Communications,
11, 12 2020. ISSN 20411723. doi:10.1038/s41467-020-19924-1.
232
[184] Jayaram Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, 4:
639–650, 1994.
[185] Miriam Shiffman, Ryan Giordano, and Tamara Broderick. Could dropping a few cells
change the takeaways from differential expression?, 2023.
[186] Galit Shmueli, Thomas P. Minka, Joseph B. Kadane, Sharad Borle, and Peter
Boatwright. A useful distribution for fitting discrete data: revival of the con-
way–maxwell–poisson distribution. Journal of the Royal Statistical Society: Se-
ries C (Applied Statistics), 54(1):127–142, 2005. doi:https://doi.org/10.1111/j.1467-
9876.2005.00474.x. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.
1467-9876.2005.00474.x.
[187] Guilhem Sommeria-Klein, Lucie Zinger, Eric Coissac, Amaia Iribar, Heidy Schimann,
Pierre Taberlet, and Jérôme Chave. Latent dirichlet allocation reveals spatial and
taxonomic structure in a dna-based census of soil biodiversity from a tropical forest.
Molecular Ecology Resources, 20:371–386, 3 2020. ISSN 17550998. doi:10.1111/1755-
0998.13109.
[188] Sanvesh Srivastava, Cheng Li, and David B. Dunson. Scalable Bayes via barycenter in
Wasserstein space. The Journal of Machine Learning Research, 19(1):312–346, 2018.
[189] Stephen M. Stigler. The Asymptotic Distribution of the Trimmed Mean. The Annals
of Statistics, 1(3):472 – 477, 1973.
[190] Stephen M. Stigler. The 1988 Neyman Memorial Lecture: A Galtonian Perspective on
Shrinkage Estimators. Statistical Science, 5(1):147–155, 1990.
[191] Rainer Storn and Kenneth Price. Differential evolution-a simple and efficient heuristic
for global optimization over continuous spaces. Journal of global optimization, 11(4):
341, 1997.
[192] Robert H. Swendsen and Jian-Sheng Wang. Replica Monte Carlo simulation of spin-
glasses. Physical Review Letters, 57(21):2607, 1986.
[193] Andrea Tancredi, Rebecca Steorts, and Brunero Liseo. A Unified Framework for De-
Duplication and Population Size Estimation (with Discussion). Bayesian Analysis, 15
(2):633 – 682, 2020.
[194] Alessandro Tarozzi, Jaikishan Desai, and Kristin Johnson. The impacts of microcredit:
Evidence from ethiopia. American Economic Journal: Applied Economics, 7(1):54–89,
January 2015. doi:10.1257/app.20130475. URL https://www.aeaweb.org/articles?id=
10.1257/app.20130475.
[195] Y W Teh and D. Görür. Indian buffet processes with power-law behavior. In Advances
in Neural Information Processing Systems, 2009.
233
[197] Y W Teh, D. Görür, and Z. Ghahramani. Stick-breaking construction for the Indian
buffet process. In International Conference on Artificial Intelligence and Statistics,
2007.
[198] R. Thibaux and M I Jordan. Hierarchical beta processes and the Indian buffet process.
In International Conference on Artificial Intelligence and Statistics, 2007.
[199] Zachary M. Thomas, Steven N. MacEachern, and Mario Peruggia. Reconciling curvature
and importance sampling based procedures for summarizing case influence in bayesian
models. Journal of the American Statistical Association, 113:1669–1683, 10 2018. ISSN
1537274X. doi:10.1080/01621459.2017.1360777.
[200] Michalis Titsias. The infinite gamma-poisson feature model. In Advances in Neural
Information Processing Systems, 2008.
[201] John W. Tukey and Donald H. McLaughlin. Less vulnerable confidence and significance
procedures for location based on a single sample: Trimming/winsorization 1. Sankhyā:
The Indian Journal of Statistics, Series A (1961-2002), 25(3):331–352, 1963.
[202] Angelika van der Linde. Local influence on posterior distributions under multiplicative
modes of perturbation. Bayesian Analysis, 2:319–332, 2007. URL http://www.math.
uni-bremen.de/~avdl/.
[203] A W van der Vaart. Asymptotic Statistics. University of Cambridge,
1998. URL https://www.cambridge.org/core/books/asymptotic-statistics/
A3C7DAD3F7E66A1FA60E9C8FE132EE1D.
[204] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy,
David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan
Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman,
Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J
Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef
Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M.
Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0
Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python.
Nature Methods, 17:261–272, 2020. doi:10.1038/s41592-019-0686-2.
[205] M. J. Wainwright and M I Jordan. Graphical Models, Exponential Families, and
Variational Inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305,
2008.
[206] Chong Wang, John Paisley, and David Blei. Online variational inference for the
hierarchical Dirichlet process. In International Conference on Artificial Intelligence
and Statistics, 2011.
[207] Simon N. Wood. Thin plate regression splines. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 65(1):95–114, 2003.
doi:https://doi.org/10.1111/1467-9868.00374. URL https://rss.onlinelibrary.wiley.com/
doi/abs/10.1111/1467-9868.00374.
234
[208] Kai Xu, Tor Erlend Fjelde, Charles Sutton, and Hong Ge. Couplings for multinomial
Hamiltonian Monte Carlo. In International Conference on Artificial Intelligence and
Statistics, 2021.
[209] Amit Zeisel, Ana B. Muñoz-Manchado, Simone Codeluppi, Peter Lönnerberg, Gioele
La Manno, Anna Juréus, Sueli Marques, Hermany Munguba, Liqun He, and Christer
Betsholtz. Cell types in the mouse cortex and hippocampus revealed by single-cell
RNA-seq. Science, 347(6226):1138–1142, 2015.
[210] Mingyuan Zhou, Haojun Chen, Lu Ren, Guillermo Sapiro, Lawrence Carin, and John W.
Paisley. Non-parametric Bayesian dictionary learning for sparse image representations.
In Advances in Neural Information Processing Systems. 2009.
[211] Mingyuan Zhou, Lauren Hannah, David Dunson, and Lawrence Carin. Beta-negative
binomial process and Poisson factor analysis. In International Conference on Artificial
Intelligence and Statistics, 2012.
235