Bayesian Data Selection Machine Learning
Bayesian Data Selection Machine Learning
Abstract
Insights into complex, high-dimensional data can be obtained by discovering features of
the data that match or do not match a model of interest. To formalize this task, we intro-
duce the “data selection” problem: finding a lower-dimensional statistic—such as a subset
of variables—that is well fit by a given parametric model of interest. A fully Bayesian
approach to data selection would be to parametrically model the value of the statistic,
nonparametrically model the remaining “background” components of the data, and per-
form standard Bayesian model selection for the choice of statistic. However, fitting a
nonparametric model to high-dimensional data tends to be highly inefficient, statistically
and computationally. We propose a novel score for performing data selection, the “Stein
volume criterion (SVC)”, that does not require fitting a nonparametric model. The SVC
takes the form of a generalized marginal likelihood with a kernelized Stein discrepancy in
place of the Kullback–Leibler divergence. We prove that the SVC is consistent for data
selection, and establish consistency and asymptotic normality of the corresponding gen-
eralized posterior on parameters. We apply the SVC to the analysis of single-cell RNA
sequencing data sets using probabilistic principal components analysis and a spin glass
model of gene regulation.
Keywords: Bayesian nonparametrics, Bayesian theory, consistency, misspecification,
Stein discrepancy
1. Introduction
Scientists often seek to understand complex phenomena by developing working models for
various special cases and subsets. Thus, when faced with a large complex data set, a natural
question to ask is where and when a given working model applies. We formalize this question
statistically by saying that given a high-dimensional data set, we want to identify a lower-
dimensional statistic—such as a subset of variables—that follows a parametric model of
interest (the working model). We refer to this problem as “data selection”, in counterpoint
to model selection, since it requires selecting the aspect of the data to which a given model
applies.
For example, early studies of single-cell RNA expression showed that the expression of
individual genes was often bistable, which suggests that the system of cellular gene expres-
sion might be described with the theory of interacting bistable systems, or spin glasses, with
each gene a separate spin and each cell a separate observation. While it seems implausible
that such a model would hold in full generality, it is quite possible that there are subsets
of genes for which the spin glass model is a reasonable approximation to reality. Finding
such subsets of genes is a data selection problem. In general, a good data selection method
would enable one to (a) discover interesting phenomena in complex data sets, (b) identify
precisely where naive application of the working model to the full data set goes wrong, and
(c) evaluate the robustness of inferences made with the working model.
Perhaps the most natural Bayesian approach to data selection is to employ a semi-
parametric joint model, using the parametric model of interest for the low-dimensional
statistic (the “foreground”) and using a flexible nonparametric model to explain all other
aspects of the data (the “background”). Then, to infer where the foreground model ap-
plies, one would perform standard Bayesian model selection across different choices of the
foreground statistic. However, this is computationally challenging due to the need to in-
tegrate over the nonparametric model for each choice of foreground statistic, making this
approach quite difficult in practice. A natural frequentist approach to data selection would
be to perform a goodness-of-fit test for each choice of foreground statistic. However, this
still requires specifying an alternative hypothesis, even if the alternative is nonparametric,
and ensuring comparability between alternatives used for different choices of foreground
statistics is nontrivial. Moreover, developing goodness-of-fit tests for composite hypotheses
or hierarchical models is often difficult in practice.
In this article, we propose a new score—for both data selection and model selection—
that is similar to the marginal likelihood of a semi-parametric model but does not require
one to specify a background model, let alone integrate over it. The basic idea is to employ a
generalized marginal likelihood where we replace the foreground model likelihood by an ex-
ponentiated divergence with nice properties, and replace the background model’s marginal
likelihood with a simple volume correction factor. For the choice of divergence, we use a
kernelized Stein discrepancy (KSD) since it enables us to provide statistical guarantees and
is easy to estimate compared to other divergences—for instance, the Kullback–Leibler diver-
gence involves a problematic entropy term that cannot simply be dropped. The background
model volume correction arises roughly as follows: if the background model is well-specified,
then asymptotically, its divergence from the empirical distribution converges to zero and all
that remains of the background model’s contribution is the volume of its effective parameter
2
Bayesian Data Selection
space. Consequently, it is not necessary to specify the background model, only its effective
dimension. To facilitate computation further, we develop a Laplace approximation for the
foreground model’s contribution to our proposed score.
This article makes a number of novel contributions. We introduce the data selection
problem in broad generality, and provide a thorough asymptotic analysis. We propose a
novel model/data selection score, which we refer to as the Stein volume criterion, that takes
the form of a generalized marginal likelihood using a KSD. We provide new theoretical re-
sults for this generalized marginal likelihood and its associated posterior, complementing
and building upon recent work on the frequentist properties of minimum KSD estima-
tors (Barp et al., 2019). Finally, we provide first-of-a-kind empirical data selection analyses
with two models that are frequently used in single-cell RNA sequencing analysis.
The article is organized as follows. In Section 2, we introduce the data selection problem
and our proposed method. In Section 3 we study the asymptotic properties of Bayesian data
selection methods and compare to model selection. Section 4 provides a review of related
work and Section 5 illustrates the method on a toy example. In Section 6, we prove (a)
consistency results for both data selection and model selection, (b) a Laplace approximation
for the proposed score, and (c) a Bernstein–von Mises theorem for the corresponding gen-
eralized posterior. In Section 7, we apply our method to probabilistic principal components
analysis (pPCA), assess its performance in simulations, and demonstrate it on single-cell
RNA sequencing (scRNAseq) data. In Section 8, we apply our method to a spin glass model
of gene expression, also demonstrated on an scRNAseq data set. Section 9 concludes with
a brief discussion.
2. Method
Suppose the data X (1) , . . . , X (N ) ∈ X are independent and identically distributed (i.i.d.),
where X ⊆ Rd . Suppose the true data-generating distribution P0 has density p0 (x) with
respect to Lebesgue measure, and let {q(x|θ) : θ ∈ Θ} be a parametric model of interest,
where Θ ⊆ Rm . We are interested in evaluating this model when applied to a projection of
the data onto a subspace, XF ⊆ X (the “foreground” space). Specifically, let XF := V > X be
a linear projection of a datapoint X ∈ X onto XF , where V is a matrix with orthonormal
columns which defines the foreground space. Let q(xF |θ) denote the distribution of XF
when X ∼ q(x|θ), and likewise, let p0 (xF ) be the distribution of XF when X ∼ p0 (x).
Even when the complete model q(x|θ) is misspecified with respect to p0 (x), it may be that
q(xF |θ) is well-specified with respect to p0 (xF ); see Figure 1 for a toy example. In such
cases, the parametric model is only partially misspecified—specifically, it is misspecified on
the “background” space XB , defined as the orthogonal complement of XF (that is, the set
of all vectors that are orthogonal to every vector in XF ).
Our goal is to find subspaces XF of the data space X for which the model q(xF |θ) is
correctly specified. We are not seeking a subset of datapoints, but rather a projection of all
the datapoints.
A natural Bayesian solution would be to replace the background component of the
assumed model, q(xB |xF , θ), with a more flexible component q̃(xB |xF , φB ) that is guaranteed
to be well-specified with respect to p0 (xB |xF ), such as a nonparametric model. The resulting
3
Weinstein and Miller
3 1.4
1.2
1.0
density
2 0.8
0.6
0.4
1 0.2
0.0
3 2 1 0 1 2 3
0
X
X2
density
0.8
3 0.6
3 2 1 0 1 2 3
X1 0.4
0.2
0.0
3 2 1 0 1 2 3
(a) An example for which a bivariate nor- X
mal model is partially misspecified. Basis
vectors for XF (foreground) and XB (back- (c) A univariate normal model is misspeci-
ground) are blue and red, respectively. fied for the data projection onto XB .
4
Bayesian Data Selection
proposed model/data selection score, termed the “Stein volume criterion” (SVC), is
mB /2 Z
2π N
K := exp − nksd(p
[ 0 (xF )kq(xF |θ)) π(θ)dθ (2)
N T
where the “temperature” T > 0 is a hyperparameter and mB is the effective dimension of
the background model parameter space. nksd(·k·)
[ is an empirical estimate of the nksd
(Equations 4 and 5), and measures the mismatch between the data and the model over the
foreground subspace.
There are three key properties of nksd
[ that distinguish it from other estimators of other
divergences. First, it estimates the divergence directly, not just up to a data-dependent
constant; this is essential for data selection consistency (Section 3.1). For instance, putting
the log likelihood in place of N
T nksd in Equation 2 fails to provide data selection consistency
[
since it implicitly involves comparing the foreground entropy under P0 . Second, nksd [
converges at a O(1/N ) convergence rate when the model is correct; this is essential for
nested data selection consistency (Section 3.2). In contrast, even if the foreground entropy
under P0 is known exactly, using a Monte Carlo estimate of the Kullback–Leibler
√ divergence
N [
in place of T nksd fails since the convergence rate is only O(1/ N ). Third, the NKSD
exhibits subsystem independence (Section 6.1), which ensures that the SVC is comparable
between foreground spaces of different dimension. We are unaware of any other divergence
estimator with all three of these key properties.
The integral in Equation 2 can be approximated using techniques discussed in Sec-
tion 2.3. The hyperparameter T can be calibrated by comparing the coverage of the stan-
dard Bayesian posterior to the coverage of the nksd generalized posterior (Section A.1). The
(2π/N )mB /2 factor penalizes higher-complexity background models. In general, we allow
mB to grow with N , particularly when the background model is nonparametric. Crucially,
the likelihood of the background model does not appear in our proposed score, sidestep-
ping the need to fit or even specify the background model—indeed, the only place that the
background model enters into the SVC is through mB .
Thus, rather than specify a background model and then derive mB , one can simply
specify an appropriate value of mB . Reasonable choices of mB can be derived by considering
the asymptotic behavior of a Pitman-Yor process mixture model, a common nonparametric
model that is a natural choice for a background model. A Pitman-Yor process mixture model
with discount parameter α ∈ (0, 1), concentration parameter ν > −α, and D-dimensional
component parameters will asymptotically have expected effective dimension
Γ(ν + 1) α
mB ∼ D N (3)
αΓ(ν + α)
under the prior, where aN ∼ bN means that aN /bN → 1 as N → ∞ and Γ(·) is √ the gamma
function (Pitman, 2002, §3.3). As a default, we recommend setting mB = cB rB N , where
rB is√
the dimension of XB and cB is a constant chosen to match Equation 3 with α = 1/2.
The N scaling is particularly nice in terms of asymptotic guarantees; see Section 3.2.
The SVC uses a novel, normalized version of the ksd between densities p(x) and q(x):
5
Weinstein and Miller
where k(x, y) ∈ R is an integrally strictly positive definite kernel, sq (x) := ∇x log q(x), and
sp (x) := ∇x log p(x); see Section 6.1 for details. The numerator corresponds to the standard
ksd (Liu et al., 2016). The denominator, which is strictly positive and independent of q(x),
is a normalization factor that we have introduced to make the divergence comparable across
spaces of different dimension. See Section A.2 for kernel recommendations. Extending the
technique of Liu et al. (2016), we propose to estimate the normalized KSD using U-statistics:
where X (i) ∼ p(x) i.i.d., the sums are over all i, j ∈ {1, . . . , N } such that i 6= j, and
Importantly, Equation 5 does not require knowledge of sp (x), which is unknown in practice.
N
1 X (i) (i) (i)
cl(p0 (x)kq(xF |θ) q̃(xB |xF , φB )) := −
k log q(XF |θ) q̃(XB |XF , φB ) − H.
N
i=1
Since multiplying the marginal likelihoods by a fixed constant does not affect the Bayes
factors, the following expression could be used instead of the marginal likelihood q̃(X (1:N ) |F)
to decide among foreground subspaces:
Now, consider a generalized marginal likelihood where the nksd replaces the kl:
1
Z Z
K̃ := exp − N nksd[ p0 (x)kq(xF |θ) q̃(xB |xF , φB ) π(θ)πB (φB )dθdφB . (7)
T
6
Bayesian Data Selection
is, p0 (x) = p0 (xF )p0 (xB ) and q̃(xB |xF , φB ) = q̃(xB |φB ), then the theory in Section 6 can be
extended to the full augmented model to show that
log K̃ P0
−−−−→ 1, (8)
log K N →∞
where K is the SVC (Equation 2). Thus, the SVC approximates the nksd marginal
likelihood of the augmented model, suggesting that the SVC may be a convenient al-
ternative to the standard marginal likelihood. Formally, Section 3 shows that the SVC
exhibits consistency properties similar to the standard marginal likelihood, even when
p0 (x) 6= p0 (xF )p0 (xB ).
2.3 Computation
Next, we discuss methods for computing the SVC including exact solutions, Laplace/BIC
approximation, variational approximation, and comparing many possible choices of F. An
attractive feature of the SVC is that, unlike the fully Bayesian augmented model, the
computation time required does not grow with the background dimension mB .
where A, B, and C depend on the data X (1:N ) but not on θ. Therefore, we can compute the
SVC in closed form by choosing a multivariate Gaussian for the prior π(θ) in Equation 2;
see Section A.3.
7
Weinstein and Miller
Now, θN (0) and θN (1) are the minimum Stein discrepancy estimators for F1 and F2 , respec-
tively. Given θN (0), we can approximate θN (1) by applying the implicit function theorem
and a first-order Taylor expansion (Section A.4):
Note that the derivatives of `j are often easy to compute with automatic differentiation (Bay-
din et al., 2018). Note also that when we are comparing one foreground subspace, such as
XF1 = X , to many other foreground subspaces XF2 , the inverse Hessian ∇2θ `1 (θN (0))−1
only needs to be computed once. Thus, Equation 13 provides a fast approximate method
for computing Laplace or BIC approximations to the SVC for a large number of candidate
foregrounds F. We apply this technique in Section 7, where we find that it performs well
in simulation studies and in practice.
8
Bayesian Data Selection
that this variational approximation falls within the framework of generalized variational
inference proposed by Knoblauch et al. (2022).
This variational approximation to the SVC is particularly useful when we are aiming to
find the best subspace XF among a very large number of candidates, since we can jointly
optimize the variational parameters ζ and the choice of foreground subspace XF . Here,
we do not necessarily need to evaluate the SVC for all foreground subspaces XF under
consideration, and can instead rely on optimization methods to search for the best XF from
among a large set of possibilities (see Section 8 for an example). Practically, we recommend
using the local linear approximation in Section 2.3.3 when the goal is to compare SVC
values among many not-too-different foreground subspaces XF , and using the variational
approximation when the goal is to find one best XF from among a large and diverse set.
9
Weinstein and Miller
Consistency property
Score d.s. nested d.s. m.s. nested m.s.
q̃(X (1:N ) |F) full marginal likelihood 3 3 3 3
K(a) foreground marg lik, background volume 7 7 3 3
K(b) foreground marg NKSD 3 7 3 3
K(c) foreground marg KL, background volume 3 7 3 3
K(d) foreground NKSD, background volume 3 3 3 7
K foreground marg NKSD, background volume 3 3 3 3
Table 1: Consistency properties satisfied by various model/data selection scores. Only the
Stein volume criterion K and the full marginal likelihood q̃(X (1:N ) |F) satisfy all
four desiderata. (d.s. = data selection, m.s. = model selection, marg = marginal,
lik = likelihood.)
where θj,∗
kl := argmin kl(p (x )kq(x |θ)) for j ∈ {1, 2}, that is, θ kl is the parameter value
0 Fj Fj j,∗
that minimizes the kl divergence between the projected data distribution p0 (xFj ) and the
projected model q(xFj |θ). Thus, q̃(X (1:N ) |Fj ) asymptotically concentrates on the Fj on
which the projected model can most closely match the data distribution in terms of kl.
In Theorem 17, we show that under mild regularity conditions, the Stein volume criterion
behaves precisely the same way but with the nksd in place of the kl:
1 K1 P0 1 nksd 1 nksd
log −−−−→ nksd(p0 (xF2 )kq(xF2 |θ2,∗ )) − nksd(p0 (xF1 )kq(xF1 |θ1,∗ )) (16)
N K2 N →∞ T T
nksd := argmin nksd(p (x )kq(x |θ)) for j ∈ {1, 2}. Therefore, q̃(X (1:N ) |F) and
where θj,∗ 0 Fj Fj
K both yield data selection consistency. It is important here that the SVC uses a true
divergence, rather than a divergence up to a data-dependent constant. If we instead used
mB /2
2π (1:N )
K(a) := q(XF ), (17)
N
(1:N ) R (1:N )
which employs the foreground marginal likelihood q(XF ) = q(XF |θ)π(θ)dθ and a
background volume correction, we would get qualitatively different behavior (Section B.2):
(a)
1 K P
log 1(a) −−−0−→ kl(p0 (xF2 )kq(xF2 |θ2,∗
kl kl
)) − kl(p0 (xF1 )kq(xF1 |θ1,∗ )) + HF2 − HF1 (18)
N K2 N →∞
R
where HFj := − p0 (xFj ) log p0 (xFj )dxFj is the entropy of p0 (xFj ) for j ∈ {1, 2}. In short,
the naive score K(a) is a bad choice: it decides between data subspaces based not just on
how well the parametric foreground model performs, but also on the entropy of the data
distribution in each space. As a result, K(a) does not exhibit data selection consistency.
10
Bayesian Data Selection
subset of another (Vuong, 1989). If the model q(x|θ) is well-specified over XF1 , then it is
guaranteed to be well-specified over any lower-dimensional sub-subspace XF2 ⊂ XF1 ; in this
case, we refer to asymptotic concentration on F1 as “nested data selection consistency”.
In this situation, kl(p0 (xFj )kq(xFj |θj,∗
kl )) and nksd(p (x ), q(x |θ nksd )) are both zero for
0 Fj Fj j,∗
j ∈ {1, 2}, making it necessary to look at higher-order terms in Equations 15 and 16. In
Section B.3, we show that if XF2 ⊂ XF1 , q(x|θ) is well-specified over XF1 , the background
models are well-specified, and their dimensions mB1 and mB2 are constant with respect to
N , then
1 q̃(X (1:N ) |F1 ) P0 1
log −−−−→ (mF2 + mB2 − mF1 − mB1 ) (19)
log N q̃(X (1:N ) |F2 ) N →∞ 2
where mFj is the effective dimension of the parameter space of q(xFj |θ). In Theorem 17,
we show that under mild regularity conditions, the SVC behaves the same way:
1 K1 P0 1
log −−−−→ (mF2 + mB2 − mF1 − mB1 ). (20)
log N K2 N →∞ 2
Thus, so long as mF2 + mB2 > mF1 + mB1 whenever XF2 ⊂ XF1 , the marginal likelihood and
the SVC asymptotically concentrate on the larger foreground F1 ; hence, they both exhibit
nested data selection consistency. This is a natural assumption since the background model
is generally more flexible—on a per dimension basis—than the foreground model.
The volume correction (2π/N )mB /2 in the definition of the SVC is important for nested
data selection consistency (Equation 20). An alternative score without that correction,
Z N
(b)
K := exp − nksd(p0 (xF )kq(xF |θ)) π(θ)dθ,
[ (21)
T
exhibits data selection consistency (Equation 16 holds for K(b) ), but not nested data se-
lection consistency; see Sections B.2 and B.3. More subtly, the asymptotics of the SVC in
the case of nested data selection also depend on the variance of U-statistics. To illustrate,
consider a score that is similar to the SVC but uses k cl instead of nksd:
[
mB /2 Z
2π
K(c) := exp − N k cl(p0 (xF )kq(xF |θ)) π(θ)dθ (22)
N
(i)
cl(p0 (xF )kq(xF |θ)) := − N1 N
P
where k i=1 log q(XF |θ) − HF and HF is required to be known.
The score K(c) exhibits data selection consistency, but not nested
√ data selection consistency.
The reason is that the error in estimating the kl is of order 1/ N by the central limit theo-
rem, and this source of error dominates the log N term contributed by the volume correction;
see Section B.3. Meanwhile, the error in estimating the nksd is of order 1/N when the
model is well-specified, due to the rapid convergence rate of the U-statistic estimator. Thus,
in the SVC, this source of error is dominated by the volume correction; see Theorem 12.
The nested data selection results we have described so far assume mB does not depend
on N , or at least mB2 − mB1 does√ not depend on N (Theorem 17). However, in Section 2.1,
we suggest setting mB = cB rB N where cB is a constant and rB is the dimension of XB .
With this choice, the asymptotics of the SVC for nested data selection become (Theorem 17)
1 K1 P0 1
√ log −−−−→ cB (rB2 − rB1 ). (23)
N log N K2 N →∞ 2
11
Weinstein and Miller
Since rB1 < rB2 when XF2 ⊂ XF1 , the SVC concentrates on the larger foreground F1 ,
yielding nested data selection consistency. Going beyond the well-specified case, Theorem 17
shows that Equation 23 holds when nksd(p0 (xF1 )kq(xF1 | θ1,∗nksd )) = nksd(p (x )kq(x
0 F2 F2 |
θ2,∗ )) 6= 0, that is, when the models are misspecified by the same amount as measured by
nksd
1 K1 P0 1 nksd 1 nksd
log −−−−→ nksd(p0 (xF )kq2 (xF |θ2,∗ )) − nksd(p0 (xF )kq1 (xF |θ1,∗ )) (25)
N K2 N →∞ T T
where θj,∗
nksd := argmin nksd(p (x )kq (x |θ )) for j ∈ {1, 2}. Thus, for both scores, con-
0 F j F j
centration occurs on the model that comes closer to the data distribution in terms of the
corresponding divergence (kl or nksd).
For nested model selection, suppose both foreground models are well-specified and
mB1 = mB2 . Letting mF ,j be the parameter dimension of qj (xF |θj ), we have (Section B.5)
exhibits model selection consistency (Equation 25 holds for K(d) ) but not nested model
selection consistency (Section B.5). The Laplace and BIC approximations to the SVC
(Equations 10 and 11) explicitly correct for the foreground parameter volume without in-
tegrating.
12
Bayesian Data Selection
4. Related work
Projection pursuit methods are closely related to data selection in that they attempt to
identify “interesting” subspaces of the data. However, projection pursuit uses certain pre-
specified objective functions to optimize over projections, whereas our method allows one
to specify a model of interest (Huber, 1985).
Another related line of research is on Bayesian goodness-of-fit (GOF) tests, which com-
pute the posterior probability that the data comes from a given parametric model versus
a flexible alternative such as a nonparametric model. Our setup differs in that it aims to
compare among different semiparametric models. Nonetheless, in an effort to address the
GOF problem, a number of authors have developed nonparametric models with tractable
marginals (Verdinelli and Wasserman, 1998; Berger and Guglielmi, 2001), and using these
models as the background component in an augmented model could in theory solve data se-
lection problems. In practice, however, such models can only be applied to one-dimensional
or few-dimensional data spaces. In Section 7, we show that naively extending the method of
Berger and Guglielmi (2001) to the multi-dimensional setting has fundamental limitations.
There is a sizeable frequentist literature on GOF testing using discrepancies (Gret-
ton et al., 2012; Barron, 1989; Györfi and Van Der Meulen, 1991). Our proposed method
builds directly on the KSD-based GOF test proposed by Liu et al. (2016) and Chwialkowski
et al. (2016). However, using these methods to draw comparisons between different fore-
ground subspaces is non-trivial, since the set of alternative models considered by the GOF
test, though nonparametric, will be different over data spaces with different dimensionality.
Moreover, the Bayesian aspect of the SVC makes it more straightforward to integrate prior
information and employ hierarchical models.
In composite likelihood methods, instead of the standard likelihood, one uses the prod-
uct of the conditional likelihoods of selected statistics (Lindsay, 1988; Varin et al., 2011).
Composite likelihoods have seen widespread use, often for robustness or computational
purposes. However, in composite likelihood methods, the choice of statistics is fixed be-
fore performing inference. In contrast, in data selection the choice of statistics is a central
quantity to be inferred.
Relatedly, our work connects with the literature on robust Bayesian methods. Doksum
and Lo (1990) propose conditioning on the value of an insufficient statistic, rather than
the complete data set, when performing inference; also see Lewis et al. (2021). However,
making an appropriate choice of statistic requires one to know which aspects of the model
are correct; in contrast, our procedure infers the choice of statistic. The nksd posterior also
falls within the general class of Gibbs posteriors, which have been studied in the context of
robustness, randomized estimators, and generalized belief updating (Zhang, 2006a,b; Jiang
and Tanner, 2008; Bissiri et al., 2016; Jewson et al., 2018; Miller and Dunson, 2019).
Our theoretical results also contribute to the emerging literature on Stein discrepan-
cies (Anastasiou et al., 2021). Barp et al. (2019) recently proposed minimum kernelized
Stein discrepancy estimators and established their consistency and asymptotic normality.
In Section 6, we establish a Bayesian counterpart to these results, showing that the nksd
posterior is asymptotically normal (in the sense of Bernstein–von Mises) and admits a
Laplace approximation. To prove this result, we rely on the recent work of Miller (2021) on
the asymptotics of generalized posteriors. Since Barp et al. (2019) show that the kernelized
13
Weinstein and Miller
Stein discrepancy is related to the Hyvärinen divergence in that both are Stein discrep-
ancies, our work bears an interesting relationship to that of Shao et al. (2018), who use
a Bayesian version of the Hyvärinen divergence to perform model selection with improper
priors. They derive a consistency result analogous to Equation 16, however, their model
selection score takes the form of a prequential score, not a Gibbs marginal likelihood as in
the SVC, and cannot be used for data selection.
In independent recent work, Matsubara et al. (2022) propose a Gibbs posterior based on
the KSD and derive a Bernstein-von Mises theorem similar to Theorem 9 using the results
of Miller (2021). Their method is not motivated by the Bayesian data selection problem but
rather by (1) inference for energy-based models with intractable normalizing constants and
(2) robustness to -contamination. Their Bernstein-von Mises theorem differs from ours in
that it applies to a V-statistic estimator of the KSD rather than a U-statistic estimator of
the NKSD.
Our linear approximation to the minimum Stein discrepancy estimator (Section 2.3.3) is
inspired by previous work on empirical influence functions and the Swiss Army infinitesimal
jackknife (Giordano et al., 2019; Koh and Liang, 2017). These previous methods similarly
compute the linear response of an extremum estimator with respect to perturbations of the
data set, but focus on the effects of dropping datapoints rather than data subspaces.
5. Toy example
The purpose of this toy example is to illustrate the behavior of the Stein volume criterion,
and compare it to some of the defective alternatives listed in Table 1, in a simple setting
where all computations can be done analytically (Section A.3). In all of the following
experiments, we simulated data from a bivariate normal distribution: X (1) , . . . , X (N ) i.i.d. ∼
N ((0, 0)> , Σ0 ).
To set up the Stein volume criterion, we set T = 5 and we choose a radial basis func-
tion kernel, k(x, y) = exp(− 12 kx − yk22 ), which factors across dimensions. We considered
both data set size-independent values of mB (in particular, mB = 5 rB ) and data set size-
dependent values of mB (in particular, Equation 3 with α = 0.5, ν = 1, and D = 0.2,
where fractional values of D correspond to shared parameters across components in the
Pitman-Yor mixture model), obtaining very similar results in each case (shown in Figures 2
and 10, respectively). These choices of mB ensure that, except for at very small N , the
background model has more parameters per data dimension than each of the foreground
models considered below, which have just one. In particular, mB > 1 rB for all N (in the
size-independent case) and for N ≥ 5 (in the size-dependent case).
14
Bayesian Data Selection
first dimension (defined by the projection matrix VF1 = (1, 0)> ) or the second dimension
(projection matrix VF2 = (0, 1)> ). The model is only well-specified for F1 (not F2 ), so a
successful data selection procedure would asymptotically select F1 .
In Figure 2a, we see that the SVC correctly concentrates on F1 as the number of
datapoints N increases, with the log SVC ratio growing linearly in N , as predicted by
Equation 16. Meanwhile, the naive alternative score K(a) (Equation 17) fails since it depends
on the foreground entropies, while K(b) (Equation 21) succeeds since the volume correction
is negligible in this case; see Section 3.1 and Table 1.
Next, we examine the nested data selection case. We use the same model (Equation 29), but
we set Σ0 = I so that the model is well-specified even without being projected. We compare
the complete data space (XF1 = X , projection matrix VF1 = I) to the first dimension alone
(projection matrix VF1 = (1, 0)> ). Nested data selection consistency demands that the
higher-dimensional data space XF1 be preferred asymptotically, since the model is well-
specified for both XF1 and XF2 . Figure 2b shows that this is indeed the case for the Stein
volume criterion, with the log SVC ratio growing at a log N rate when mB is independent of
N , as predicted by Equation 20. When mB depends on N via the Pitman-Yor expression,
the log SVC ratio grows at a N α log N rate (Figure 10b). Meanwhile, Figure 2b shows that
K(a) and K(b) both fail to exhibit nested data selection consistency, in accordance with our
theory (Section 3.2 and Table 1).
Finally, we examine model selection and nested model selection consistency. We again
set Σ0 = I. We first compare the (well-specified) model q(x|θ) = N (x | θ, I) to the
(misspecified) model q(x|θ) = N (x | θ, 2I), using the prior π(θ) = N (θ | (0, 0)> , 10I) for
both models. As shown in Figure 2c, the SVC correctly concentrates on the first model,
with the log SVC ratio growing linearly in N , as predicted by Equation 25. The same
asymptotic behavior is exhibited by K(a) , which is equivalent to the standard Bayesian
marginal likelihood in this setting (Section 3.3). Finally, to check nested model selection
consistency, we compare two well-specified nested models: q(x) = N (x | (0, 0)> , I) and
q(x|θ) = N (x | θ, I). Figure 2d shows that the SVC correctly selects the simpler model
(that is, the model with smaller parameter dimension) and the log SVC ratio grows as
log N (Equation 27). This, too, matches the behavior of the standard Bayesian marginal
likelihood, seen in the plot of K(a) .
6. Theory
In this section we describe our formal theoretical results. We start by studying the NKSD
and then the NKSD posterior, before finally establishing data and model selection consis-
tency for the SVC.
15
Weinstein and Miller
2
10
log
log
20 20
1
1
30
30
log
log
40
(a) (a)
50 40
(b) (b)
60 50
20 40 60 80 20 40 60 80
number of samples number of samples
(a) (b)
Model Selection Nested Model Selection
50
(a) 6
40
(b) 5
2
30 4
log
log
20 3
1
2
log
log
10
1 (a)
0 (b)
0
25 50 75 100 125 150 175 200 20 40 60 80
number of samples number of samples
(c) (d)
Figure 2: Behavior of the Stein volume criterion K, the foreground marginal likelihood with
a background volume correction K(a) , and the foreground marginal nksd K(b) on
toy examples. Here, we set mB = 5 rB . The plots show the results for 5 randomly
generated data sets (thin lines) and the average over 100 random data sets (bold
lines).
16
Bayesian Data Selection
See Section C.1 for the proof. Subsystem independence is powerful since it separates the
problem of evaluating the foreground model from that of evaluating the background model.
A modified version applies to the estimator nksd(pkq)
[ (Equation 5); see Proposition 20.
17
Weinstein and Miller
where uθ (x, y) is the u(x, y) function from Equation 5 with qθ in place of q. For the case
of N = 1, we define f1 (θ) = 0 by convention. Note that −N fN (θ) plays the role of the log
likelihood. Also define
1
f (θ) := nksd(p0 (x)kq(x|θ)), (36)
ZT
zN := exp(−N fN (θ))π(θ)dθ,
Θ
1
πN (θ) := exp(−N fN (θ))π(θ),
zN
where π(θ) is a prior density on Θ. Note that πN (θ)dθ is the NKSD posterior and zN is the
corresponding generalized marginal likelihood employed in the SVC. Denote the gradient
and Hessian of f by f 0 (θ) = ∇θ f (θ) and f 00 (θ) = ∇2θ f (θ), respectively. To ensure that
the nksd posterior is well defined and has an isolated maximum, we assume the following
condition.
18
Bayesian Data Selection
Condition 8 Suppose Θ ⊆ Rm is a convex set and (a) Θ is compact or (b) Θ is open and
fN is convex on Θ with probability 1 for all N . Assume zN < ∞ a.s. for all N . Assume f
has a unique minimizer θ∗ ∈ Θ, f 00 (θ∗ ) is invertible, π is continuous at θ∗ , and π(θ∗ ) > 0.
By Proposition 4, f has a unique minimizer whenever {Qθ : θ ∈ Θ} is well-specified and
identifiable, that is, when Qθ = P0 for some θ and θ 7→ Qθ is injective.
In Theorem 9 below, we establish the following results: (1) the minimum nksd[ converges
to the minimum nksd; (2) πN concentrates around the minimizer of the nksd; (3) the
Laplace approximation to zN is asymptotically correct; and (4) πN is asymptotically normal
in the sense of Bernstein–von Mises. The primary regularity conditions we need for this
theorem are restraints on the derivatives of sqθ with respect to θ (Condition 10). Our proof of
Theorem 9 relies on the theory of generalized posteriors developed byPMiller (2021). We use
k·k for the Euclidean–Frobenius norms: for vectors A ∈ RD , kAk = (Pi A2i )1/2 ; for matrices
A ∈ RD×D , kAk = ( i,j A2i,j )1/2 ; for tensors A ∈ RD×D×D , kAk = ( i,j,k A2i,j,k )1/2 ; and so
P
on.
3. m/2
exp(−N fN (θN ))π(θ∗ ) 2π
zN ∼ (38)
| det f 00 (θ∗ )|1/2 N
almost surely, where aN ∼ bN means that aN /bN → 1 as N → ∞, and
√
4. letting hN denote the density of N (θ − θN ) when θ is sampled from πN , we have
that hN converges to N (0, f 00 (θ∗ )−1 ) in total variation, that is,
Z
a.s.
hN (θ̃) − N (θ̃ | 0, f 00 (θ∗ )−1 ) dθ̃ −−−−→ 0. (39)
Rm N →∞
The proof is in Section C.2. We write ∇2θ sqθ to denote the tensor in Rd×m×m in which entry
(i, j, k) is ∂ 2 sqθ (x)i /∂θj ∂θk . Likewise, ∇3θ sqθ denotes the tensor in Rd×m×m×m in which
entry (i, j, k, `) is ∂ 3 sqθ (x)i /∂θj ∂θk ∂θ` . We write N to denote the set of natural numbers.
Condition 10 (Stein score regularity) Assume sqθ (x) has continuous third-order par-
tial derivatives with respect to the entries of θ on Θ. Suppose that for any compact, convex
subset C ⊆ Θ, there exist continuous functions g0,C , g1,C ∈ L1 (P0 ) such that for all θ ∈ C,
x ∈ X,
19
Weinstein and Miller
20
Bayesian Data Selection
Definition 14 (Nested data selection consistency) For j ∈ {1, 2}, consider fore-
ground model projections Mj := {q(xFj |θ) : θ ∈ Θ}. We say that ρ satisfies “nested data
selection consistency” if ρ(M1 , M2 ) → ∞ as N → ∞ when M1 is well-specified with respect
to p0 (xF1 ), XF2 ⊂ XF1 , and dim(XF2 ) < dim(XF1 ).
Definition 15 (Model selection consistency) For j ∈ {1, 2}, consider foreground mod-
els Mj := {qj (xF |θj ) : θj ∈ Θj }. We say that ρ satisfies “model selection consistency” if
ρ(M1 , M2 ) → ∞ as N → ∞ when M1 is well-specified with respect to p0 (xF ) and M2 is
misspecified.
Definition 16 (Nested model selection consistency) For j ∈ {1, 2}, consider fore-
ground models Mj := {qj (xF |θj ) : θj ∈ Θj }. We say that ρ satisfies “nested model selection
consistency” if ρ(M1 , M2 ) → ∞ as N → ∞ when M1 is well-specified with respect to p0 (xF ),
M1 ⊂ M2 , and dim(Θ1 ) < dim(Θ2 ).
In each case, ρ may diverge almost surely (“strong consistency”) or in probability (“weak
consistency”). Note that in Definitions 13–14, the difference between M1 and M2 is the
choice of foreground data space F, whereas in Definitions 15–16, M1 and M2 are over the
same foreground space but employ different model spaces.
In Theorem 17, we show that the SVC has the asymptotic properties outlined in Sec-
tion 3. In combination with the subsystem independence properties of the NKSD (Propo-
sitions 6 and 20), Theorem 17 also leads to the conclusion that the SVC approximates the
NKSD marginal likelihood of the augmented model (Equation 8). Our proof is similar in
spirit to previous results for model selection with the standard marginal likelihood, notably
those of Hong and Preston (2005) and Huggins and Miller (2021), but relies on the special
properties of the nksd marginal likelihood in Theorem 12.
Theorem 17 For j ∈ {1, 2}, assume the conditions of Theorem 12 hold for model Mj
m
defined on XFj , with density qj (xFj |θj ) for θj ∈ Θj ⊆ R Fj ,j . Let Kj,N be the Stein
volume criterion for Mj , with background model penalty mBj = mBj (N ), and let θj,∗ :=
argminθj nksd(p0 (xFj )kqj (xFj |θj )). Then:
21
Weinstein and Miller
1 K1,N P0 1 1
log −−−−→ nksd(p0 (xF2 )kq2 (xF2 |θ2,∗ )) − nksd(p0 (xF1 )kq1 (xF1 |θ1,∗ )).
N K2,N N →∞ T T
2. If nksd(p0 (xF1 )kq1 (xF1 |θ1,∗ )) = nksd(p0 (xF2 )kq2 (xF2 |θ2,∗ )) = 0 and mB2 − mB1 does
not depend on N , then
1 K1,N P0 1
log −−−−→ (mF2 ,2 + mB2 − mF1 ,1 − mB1 ).
log N K2,N N →∞ 2
√
3. If nksd(p0√(xF1 )kq1 (xF1 |θ1,∗ )) = nksd(p0 (xF2 )kq2 (xF2 |θ2,∗ )), mB1 = cB1 N , and
mB2 = cB2 N , where cB1 and cB2 are positive and constant in N , then
1 K1,N P0 1
√ log −−−−→ (cB − cB1 ).
N log N K2,N N →∞ 2 2
The proof is in Section C.4. In particular, assuming the conditions of Theorem 12,
we obtain the following consistency results in terms of convergence in probability. Let
Dj := nksd(p0 (xFj )kqj (xFj |θj,∗ )) for j ∈ {1, 2}.
• If mBj = o(N/ log N ) then the SVC exhibits data selection consistency and model
selection consistency. This holds by Theorem 17 (part 1) since D2 > D1 = 0.
• If mB1 = mB2 then the SVC exhibits nested model selection consistency. This holds
by Theorem 17 (part 2) since D1 = D2 = 0, mB2 − mB1 = 0, and mF2 ,2 > mF1 ,1 .
Z (i) ∼ N (0, Ik ),
(46)
X (i) |Z (i) ∼ N (HZ (i) , vId ),
22
Bayesian Data Selection
where U is a d × k matrix with orthonormal columns (that is, it lies on the Stiefel manifold)
and L is a k × k diagonal matrix. We use the priors suggested by Minka (2001),
U ∼ Uniform(U),
Lii ∼ InverseGamma(α/2, α/2), (48)
v ∼ InverseGamma (α/2 + 1)(d − k) − 1, (α/2)(d − k) ,
where U is the set of d × k matrices with orthonormal columns and Lii is the ith diagonal
entry of L. We set α = 0.1 in the following experiments, and we use pymanopt (Townsend
et al., 2016) to optimize U over the Stiefel manifold (Section D).
7.1 Simulations
In simulations, we evaluate the ability of the SVC to detect partial misspecification. We
set d = 6, draw the first four dimensions from a pPCA model with k = 2 and
1 0
−1 1
H= 0
, (49)
1
−1 −1
and generate dimensions 5 and 6 in such a way that pPCA is misspecified. We consider two
misspecified scenarios: scenario A (Figure 3a) is that
W (i) ∼ Bernoulli(0.5),
(i)
(50)
X5:6 | W (i) ∼ N (0, ΣW (i) ) ,
(i)
where ΣW (i) = (0.05)W I2 . Scenario B (Figure 3d) is the same but with
(i)
!
1 (−1)W 0.99
ΣW (i) = (i) . (51)
(−1)W 0.99 1
Scenario B is more challenging because the marginals of the misspecified dimensions are
still Gaussian, and thus, misspecification only comes from the dependence between X5 and
X6 . As illustrated in Figures 3b and 3e, both kinds of misspecification are very hard to see
in the lower-dimensional latent representation of the data.
Our method can be used to both (i) detect misspecified subsets of dimensions, and
(ii) conversely, find a maximal subset of dimensions for which the pPCA model provides a
reasonable fit to the data. We set T = 0.05 in the SVC, based on the calibration procedure
23
Weinstein and Miller
4 4 4 1.0
misspecified dimension 2
3 3
2 2 3
1 1 2 0.9
balanced accuracy
0 0
1 1 1 0.8
latent z2
2 2 0
3 3
1 0.7
4 4
0 2 4 2 0 2 4
density 2 0.6
SVC
2 3
density
3 3
2 2 3
1 1 2 0.9
balanced accuracy
0 0
1 1 1 0.8
latent z2
2 2 SVC
0 Polya Tree
3 3
1 0.7
4 4
0.00 0.25 4 2 0 2 4
density 2
0.4
0.6
3
density
0.2 0.5
4
0.0
4 2 0 2 4 4 3 2 1 0 1 2 3 4 500 1000 1500 2000
misspecified dimension 1 latent z1 number of samples
(d) Scenario B, misspecified di- (e) Scenario B, pPCA latent (f) Scenario B, accuracy in de-
mensions. space. tecting misspecified dimensions.
140 SVC
Polya Tree
mean runtime (seconds)
120
100
80
60
40
20
0
500 1000 1500 2000
number of samples
(g) Mean runtime over 5 repeats.
24
Bayesian Data Selection
in Section A.1 (Section D.3). We use the Pitman-Yor mixture model expression for the
background model dimension (Equation 3), with α = 0.5, ν = 1, and D = 0.2. This
value of D ensures that the number of background model parameters per data dimension
is greater than the number of foreground model parameters per data dimension except
for at very small N , since there are two foreground parameters for each additional data
dimension in the pPCA model, and mB > 2 rB for N ≥ 20. We performed leave-one-out
data selection, comparing the foreground space XF0 = X to foreground spaces XFj for
j ∈ {1, . . . , d}, which exclude the jth dimension of the data. We computed the log SVC
ratio log(Kj /K0 ) = log Kj − log K0 using the BIC approximation to the SVC (Section 2.3.2)
and the approximate optima technique (Section 2.3.3). We quantify the performance of the
method in detecting misspecified dimensions in terms of the balanced accuracy, defined as
(T N/N + T P/P )/2 where T N is the number of true negatives (dimension by dimension),
N is the number of negatives, T P is the number of true positives, and P is the number of
positives. Experiments were repeated independently five times. Figures 3c and 3f show that
as the sample size increases, the SVC correctly infers that dimensions 1 through 4 should
be included and dimensions 5 and 6 should be excluded.
To benchmark our method, we compare with an alternative approach that uses an explicit
augmented model. The Pólya tree is a nonparametric model with a closed-form marginal
likelihood that is tractable for one-dimensional data (Lavine, 1992). We define a flexible
background model by sampling each dimension j of the background space independently as
with the Pólya tree constructed as by Berger and Guglielmi (2001) (Section D.4). We set
F = N (0, 10), F̃ = N (0, 10), and η = 1000 so that the model is weighted only very weakly
towards the base distribution.
We performed data selection using the marginal likelihood of the Pólya tree augmented
model, computing the marginal of the pPCA foreground model using the approximation
of Minka (2001). The accuracy results for data selection are in Figures 3c and 3f. On
scenario A (Equation 50), the Pólya tree augmented model requires significantly more data
to detect which dimensions are misspecified. On scenario B (Equation 51) the Pólya tree
augmented model fails entirely, preferring the full data space XF0 = X which includes all
dimensions (Figure 3f). The reason is that the background model is misspecified due to
the assumption of independent dimensions, and thus, the asymptotic data selection results
(Equations 15 and 19) do not hold. This could be resolved by using a richer background
model that allows for dependence between dimensions, however, computing the marginal
likelihood under such a model would be computationally challenging. Even with the inde-
pendence assumption, the Pólya tree approach is already substantially slower than the SVC
(Figure 3g).
25
Weinstein and Miller
Single-cell RNA sequencing (scRNAseq) has emerged as a powerful technology for high-
throughput characterization of individual cells. It provides a snapshot of the transcriptional
state of each cell by measuring the number of RNA transcripts from each gene. PCA is
widely used to study scRNAseq data sets, both as a method for visualizing different cell
types in the data set and as a pre-processing technique, where the latent embedding is used
for downstream tasks like clustering and lineage reconstruction (Qiu et al., 2017; van Dijk
et al., 2018). We applied data selection to answer two practical questions in the application
of probabilistic PCA to scRNAseq data: (1) Where is the pPCA model misspecified? (2)
How does partial misspecification of the pPCA model affect downstream inferences?
26
Bayesian Data Selection
0.4 0.4
empirical density
empirical density
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
4 2 0 2 4 6 4 2 0 2 4
x x
(a) (b)
UBE2V2, rank 199 IRF8, rank 200
3.5
6
3.0
5
2.5
empirical density
empirical density
4
2.0
3 1.5
2 1.0
1 0.5
0 0.0
0 1 2 3 4 5 0 2 4 6 8 10
x x
(c) (d)
(1) (N )
Figure 4: (a,b) Histograms of gene expression (after pre-processing), i.e., Xj , . . . , Xj ,
for genes j selected to be included in the foreground space based on the log SVC
ratio log Kj − log K0 . The estimated density under the pPCA model is shown in
blue. (c,d) Histograms of example genes selected to be excluded. Higher ranks
(in each title) correspond to larger log SVC ratios.
nents. To illustrate the difference between these approaches in practice, we compared the
SVC to a closely analogous measurement of error for the full foreground model (inferred
from XF0 = X ),
N N
log Ej − log E0 := − [ 0 (xFj )kq(xFj |θ0,N )) + nksd(p
nksd(p [ 0 (x)kq(x|θ0,N )) (53)
T T
27
Weinstein and Miller
Figure 5: Scatterplot comparison and projected marginals of the leave-one-out log SVC
ratio, log Kj − log K0 (with mBj = mF0 − mFj ), and the conventional full model
criticism score, log Ej − log E0 , for each gene.
SVC ratio is
N N
log Kj − log K0 ≈ − [ 0 (xFj )kq(xFj |θj,N ) + nksd(p
nksd(p [ 0 (x)kq(x|θ0,N ))
T T
mBj + mFj − mF0 (54)
2π
+ log
2 N
[ 0 (xFj )kq(xFj |θ)) is the minimum nksd estimator for the pro-
where θj,N := argmin nksd(p
jected foreground model applied to the restricted data set, which we approximate as θ0,N
plus the implicit function correction derived in Section 2.3.3. Figure 5 illustrates the differ-
ences between the conventional criticism approach (log Ej − log E0 ) and the log SVC ratio
on an scRNAseq data set. To enable direct comparison of the two methods, we focus on the
lower order terms of Equation 54, that is, we set mBj = mF0 −mFj . We see that the amount
of error contributed by XBj , as judged by the SVC, is often substantially higher than the
amount indicated by the conventional criticism approach, implying that the conventional
criticism approach understates the problems caused by individual genes and, conversely,
overstates the problems with the rest of the model.
Using the SVC instead of a standard criticism approach can also help clarify trends in
where the proposed model fails. A prominent concern in scRNAseq data analysis is the
common occurrence of cells that show exactly zero expression of a certain gene (Pierson
and Yau, 2015; Hicks et al., 2018). We found a Spearman correlation of ρ = 0.89 between
the conventional criticism log Ej − log E0 for a gene j and the fraction of cells with zero
expression of that gene j, suggesting that this is an important source of model-data mis-
match in this scRNAseq data set, but not necessarily the only source (Figure 6a). However,
28
Bayesian Data Selection
150 150
100 100
0
0
50 50
log
log
0 0
j
j
log
log
50 50
100 100
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8
fraction of cells with zero expression fraction of cells with zero expression
(a) (b)
100 100
0 0
100 100
0
200 200
0
log
log
300 300
400 j
400
j
log
log
500 500
600 600
700 700
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8
fraction of cells with zero expression fraction of cells with zero expression
(c) (d)
Figure 6: (a) Comparison of the conventional criticism score, for each gene j, and the frac-
tion of cells that show zero expression of that gene j in the raw data. Spearman
ρ = 0.89, p < 0.01. (b) Same as (a) but with the log SVC ratio. Spearman
ρ = 0.98, p < 0.01. In orange are genes that would be included when using a
background model with cB = 20 and in blue are genes that would be excluded.
(c) Same as (a) for a data set taken from a MALT lymphoma (Section D.5).
Spearman ρ = 0.81, p < 0.01. (d) Same as (b) for the MALT lymphoma data
set. Spearman ρ = 0.99, p < 0.01.
the log SVC ratio yields a Spearman correlation of ρ = 0.98, suggesting instead that the
amount of model-data mismatch can be entirely explained by the fraction of cells with zero
expression (Figure 6b). These observations are repeatable across different scRNAseq data
sets (Figure 6c, 6d).
Data selection can also be used to evaluate the robustness of the foreground model to
partial model misspecification. This is particularly relevant for pPCA on scRNAseq data,
since the inferred latent embeddings of each cell are often used for downstream tasks such
29
Weinstein and Miller
where ZH,J,µ,τ is the unknown normalizing constant of the model, and the vectors Hj ∈ R2
and matrices Jjj 0 ∈ R2×2 are unknown parameters. This model is motivated by exper-
imental observations and is closely related to RNAseq analysis methods that have been
successfully applied in the past (Friedman et al., 2000; Friedman, 2004; Ding and Peng,
2005; Chen et al., 2015; Banerjee et al., 2008; Duvenaud et al., 2008; Liu et al., 2009;
Huynh-Thu et al., 2010; Moignard et al., 2015; Matsumoto et al., 2017). However, from a
biological perspective we can expect that serious problems may occur when applying the
model naively to an scRNAseq data set. Genes need not exhibit bistable expression: it is
straightforward in theory to write down models of gene regulation that do not have just one
30
Bayesian Data Selection
30
25
number of genes
20 c = 10
c = 20
15 c = 40
c = 60
10
5
0
0 25 50 75 100 125 150
log j log 0
(a)
c = 10 c = 20 c = 40 c = 60
2 2 2 2
0 0 0 0
z2
z2
z2
z2
2 2 2 2
2 0 2 2 0 2 2 0 2 2 0 2
z1 z1 z1 z1
(b) (c) (d) (e)
Figure 7: (a) Histogram of log SVC ratios log Kj − log K0 for all 200 genes in the data set
(with mBj = mF0 − mFj ). Dotted lines show the value of the volume correction
term in the SVC for different choices of background model complexity cB ; for
each choice, genes with log Kj − log K0 values above the dotted line would be
excluded from the foreground subspace based on the SVC. (b) Posterior mean
of the first two latent variables (z1 and z2 ), with the pPCA model applied to
the genes selected with a background model complexity of cB = 10 (keeping 23
genes in the foreground). (c-e) Same as (b), but with cB = 20 (keeping 38 genes),
cB = 40 (keeping 87 genes) and cB = 60 (keeping all 200 genes). In (a)-(d), the
points are colored using the z1 value when cB = 60.
or two steady states—gene expression may fall on a continuum, or oscillate, or have three
stable states—and many alternative patterns have been well-documented empirically (Alon,
2019). Interactions between genes may also be more complex than the model assumes, in-
volving for instance three-way dependencies between genes. All of these biological concerns
can potentially produce severe violations of the proposed two-state glass model’s assump-
tions. Data selection provides a method for discovering where the proposed model applies.
Applying standard Bayesian inference to the glass model is intractable, since the nor-
malizing constant is unknown (it is an energy-based model). However, the normalizing
31
Weinstein and Miller
constant does not affect the SVC, so we can still perform data selection. We used the
variational approximation to the SVC in Section 2.3.4. We placed a Gaussian prior on H
and a Laplace prior on each entry of J to encourage sparsity in the pairwise gene interac-
tions; we also used Gaussian priors for µ and τ after applying an appropriate transform to
remove constraints (Section E.1). Following the logic of stochastic variational inference, we
optimized the SVC variational approximation using minibatches of the data and a reparam-
eterization gradient estimator (Hoffman et al., 2013; Kingma and Welling, 2014; Kucukelbir
et al., 2017). We also simultaneously stochastically optimized the set of genes included in
the foreground subspace, using Leave-One-Out REINFORCE (Kool et al., 2019; Dimitriev
and Zhou, 2021) to estimate log-odds gradients. We implemented the model and inference
strategy within the probabilistic programming language Pyro by defining a new distribution
with log probability given by the negative NKSD (Bingham et al., 2019). Pyro provides
automated, GPU-accelerated stochastic variational inference, requiring less than an hour
for inference on data sets with thousands of cells. See Section E.1 for more details on these
inference procedures.
We examined three scRNAseq data sets, taken from (i) peripheral blood monocytes
(PBMCs) from a healthy donor (2,428 cells), (ii) a MALT lymphoma (7,570 cells), and (iii)
mouse neurons (10,658 cells) (Section E.2). We preprocessed the data following standard
protocols and focused on 200 high expression, high variability genes in each data set, based
on the metric of Gigante et al. (2020). We set T = 0.05 as in Section 7, and used the
Pitman-Yor expression for mB (Equation 3) with α = 0.5, ν = 1 and D = 100. This value
of D ensures that the number of background model parameters per data dimension is larger
than the number of foreground model parameters per data dimension except for at very
small N ; in particular, there are 798 foreground model parameter dimensions associated
with each data dimension (from the 199 interactions Jjj 0 that each gene has with each
other gene, plus the contribution of Hj ), and mB > 798 rB for N ≥ 13. Our data selection
procedure selects 65 genes (32.5%) in the PBMC data set, 0 genes in the neuron data
set, and 187 genes (93.5%) in the MALT data set; note that for a lower value of mB , in
particular using D = 10, no genes are selected in the MALT data set. These results suggest
substantial partial misspecification in the PBMC and neuron data sets, and more moderate
partial misspecification in the MALT data set.
We investigated the biological information captured by the foreground model on the
MALT data set. In particular, we looked at the approximate NKSD posterior for the selected
187 genes, and compared it to the approximate NKSD posterior for the model when applied
to all 200 genes. (Note that, since the glass model lacks a tractable normalizing constant,
we cannot compare standard Bayesian posteriors.) Figure 8 shows, for a subset of selected
genes, the posterior mean of the interaction energy ∆Ejj 0 := Jjj 0 21 + Jjj 0 12 − Jjj 0 22 − Jjj 0 11 ,
that is, the total difference in energy between two genes being in the same state versus in
opposite states. We focused on strong interactions with |∆Ejj 0 | > 1, corresponding to just
5% of all possible gene-gene interactions (Figure 12).
One foreground gene with especially large loading onto the top principal component
of the ∆E matrix is CD37 (Figure 8). In B-cell lymphomas, of which MALT lymphoma
is an example, CD37 loss is known to be associated with decreased patient survival (Xu-
Monette et al., 2016). Further, previous studies have observed that CD37 loss leads to high
NF-κB pathway activation (Xu-Monette et al., 2016). Consistent with this observation,
32
Bayesian Data Selection
CD37
HSPA1A
HLA-DQA1
LAPTM5
NR4A2
CD83
LY9
HLA-DRA
HLA-DPB1
HLA-DQB1
MT-ND4L
YBX3
EZR
CD69
FOSB
DNAJB1
MTRNR2L12
GAPDH
HLA-DPA1 10
HSPH1
CD55
NFKBIA 5
MCL1
DUSP2 0
JUN
RPLP0
PPP1R15A 5
KLF2
CREM
ACTG1 10
AC058791.1
ID3
BCL2A1
GPR183
EMP3
TXNIP
IER2
TNFAIP3
MS4A1
IGKC
CD52
VIM
RGCC
REL
PTPRC
IGHM
RPL4
RPL13A
CD37
HSPA1A
HLA-DQA1
LAPTM5
NR4A2
CD83
LY9
HLA-DRA
HLA-DPB1
HLA-DQB1
MT-ND4L
YBX3
EZR
CD69
FOSB
DNAJB1
MTRNR2L12
GAPDH
HLA-DPA1
HSPH1
CD55
NFKBIA
MCL1
DUSP2
JUN
RPLP0
PPP1R15A
KLF2
CREM
ACTG1
AC058791.1
ID3
BCL2A1
GPR183
EMP3
TXNIP
IER2
TNFAIP3
MS4A1
IGKC
CD52
VIM
RGCC
REL
PTPRC
IGHM
RPL4
RPL13A
Figure 8: Posterior mean interaction energies ∆Ejj 0 := Jjj 0 21 + Jjj 0 12 − Jjj 0 22 − Jjj 0 11 for
a subset of the selected genes. For visualization purposes, weak interactions
(|∆Ejj 0 | ≤ 1) are set to zero, and genes with less than 10 total strong connections
are not shown. Genes are sorted based on their (signed) projection onto the top
principal component of the ∆E matrix.
the estimated interaction energies in our model suggest that decreasing CD37 will lead to
higher expression of REL, an NF-κB transcription factor (∆ECD37,REL = 2.5), decreased
expression of NKFBIA, an NF-κB inhibitor (∆ECD37,NKFBIA = −3.6), and higher expres-
sion of BCL2A1, a downstream target of the NF-κB pathway (∆ECD37,BCL2A1 = 2.1).
Separately, a knockout study of Cd37 in B-cell lymphoma in mice does not show IgM ex-
pression (de Winde et al., 2016), consistent with our model (∆ECD37,IGHM = −8.2). The
same study does show MHC-II expression, and our model predicts the same result, for
33
Weinstein and Miller
10.0
7.5
5.0
0
post-selection Ejj
2.5
0.0
2.5
5.0
7.5
10.0
4 2 0 2 4
pre-selection Ejj 0
Figure 9: Comparison of posterior mean interaction energies ∆Ejj 0 for a model applied to
all 200 genes (pre-data selection) to those learned from a model applied to the
selected foreground subspace (post-data selection). Each point corresponds to a
pairwise interaction between two of the selected 187 genes.
34
Bayesian Data Selection
uncertainty (Figure 14a). Instead, the genes excluded by our data selection procedure are
the ones with the highest fraction of cells with zero expression, violating the assumptions
of the foreground model (Figure 14b). These results show how data selection provides a
sound, computationally tractable approach to criticizing and evaluating complex Bayesian
models.
9. Discussion
Statistical modeling is often described as an iterative process, where we design models, infer
hidden parameters, critique model performance, and then use what we have learned from the
critique to design new models and repeat the process (Gelman et al., 2013). This process
has been called “Box’s loop” (Blei, 2014). From one perspective, data selection offers a
new criticism approach. It goes beyond posterior predictive checks and related methods
by changing the model itself, replacing potentially misspecified components with a flexible
background model. This has important practical consequences: since misspecification can
distort estimates of model parameters in unpredictable ways, predictive checks are likely to
indicate mismatch between the model and the data across the entire space X even when the
proposed parametric model is only partially misspecified. Our method, by contrast, reveals
precisely those subspaces of X where model-data mismatch occurs.
From another perspective, data selection is outside the design-infer-critique loop. An
underlying assumption of Box’s loop is that scientists want to model the entire data set. As
data sets get larger, and measurements get more extensive, this desire has led to more and
more complex (and often difficult to interpret) models. In experimental science, however,
scientists have often followed the opposite trajectory: faced with a complicated natural
phenomenon, they attempt to isolate a simpler example of the phenomenon for close study.
Data selection offers one approach to formalizing this intuitive idea in the context of sta-
tistical analysis: we can propose a simple parametric model and then isolate a piece of the
whole data set—a subspace XF —to which this model applies. When working with large,
complicated data sets, this provides a method of searching for simpler phenomena that are
hypothesized to exist.
There are several directions for future work and improvement upon our proposed data
selection approach. First, we have focused in our applied examples on discovering subsets of
data dimensions. However, our theoretical results show that one can perform data selection
on linear subspaces in general; for instance, in the context of scRNAseq, we might find
that a model can describe a certain set of linear gene expression programs. Even more
generally, one might be interested in discovering nonlinear features of the data that the
model can explain—such as a set of nonlinear gene expression programs—and this would
require extending our approach, perhaps by (1) applying a nonlinear volume-preserving
map to the data, and then (2) performing standard linear data selection.
Second, we have focused on choosing one best XF from among a finite set of possi-
bilities. A future direction is to provide rigorous asymptotic guarantees when there are
infinitely many possible choices of XF , such as the set of all linear subspaces of X . Another
future direction is to provide uncertainty quantification of XF , rather than just point esti-
mation. Here, it is important to consider the uncertainty due to having finite data as well
as non-identifiability, since there may exist multiple optimal values of XF ; for instance, this
35
Weinstein and Miller
can occur if the model is well specified over marginals of the data but not over the joint
distribution of the data.
Third, in many applications, researchers will be interested in inferring the parameters
θ of the foreground model when applied to the selected subspace XF . On finite data, it
is conceivable that foreground subspaces XF that are more likely to be selected are also
more likely to have certain values of θ, which could create a “post- data selection bias” in
conclusions about θ, analogous to the bias that occurs in post-selection inference (Yekutieli,
2012). The data selection problem does not fit neatly in the framework of post-selection
inference, however, so further investigation will be required to understand if, when, and to
what extent such bias occurs.
Finally, in comparison to the augmented model marginal likelihood, the SVC makes
different judgments as to what types of model-data mismatch are important. The nksd
and the kl divergence are quite different and do not, in general, coincide or tightly bound
one another, so a model-data mismatch that looks big to one divergence may not look big
to the other, and vice versa (Matsubara et al., 2022). The preference of the nksd for cer-
tain types of errors is not essential to achieving consistent data selection and nested data
selection, but is very relevant to the practical use and interpretation of the SVC. One could
use another divergence instead of the nksd in the definition of the SVC, and this would
typically be expected to yield consistent model selection and nested model selection (Ap-
pendix B.1 and Miller, 2021), however, consistent data selection and nested data selection
are more challenging, and depend on a combination of special properties that our nksd
estimator possesses (Section 3). Developing data selection approaches with different model-
data mismatch preferences, therefore, remains an open challenge. In summary, Bayesian
data selection is a rich area for future work.
Acknowledgments
The authors wish to thank Jonathan Huggins, Pierre Jacob, Andre Nguyen, Elizabeth
Wood, and the anonymous reviewers for helpful discussions and suggestions. We would like
to thank Debora S. Marks in particular for suggesting the use of a Potts model in RNAseq
analysis. E.N.W. was supported by the Fannie and John Hertz Fellowship. J.W.M. is
supported by the National Institutes of Health grant 5R01CA240299-02.
36
Bayesian Data Selection
conditions (Miller, 2021), according to the Bernstein–von Mises theorem, hkl N converges to
a normal distribution in total variation,
Z
−1
a.s.
hkl kl
N (x) − N x | 0, G (θ∗ ) dx −−−−→ 0.
Rm N →∞
According to Theorem 9, the generalized posterior associated with the SVC has analogous
behavior. Let Gsvc (θ) := ∇2θ T1 nksd(p0 (x)kq(x|θ)) and let θNsvc := argmin nksd(p
[ 0 (x)kq(x |
√
θ)). Let hN be the density of N (θ − θN ) when θ ∼ πN . Then by Theorem 9, hsvc
svc svc svc
N
converges to a normal distribution in total variation,
Z
a.s.
(θ∗ )−1 dx −−−−→ 0.
hsvc
N (x) − N x | 0, G
svc
Rm N →∞
For the uncertainty in each posterior to be roughly the same order of magnitude, we want
det Gkl (θ∗ ) ≈ det Gsvc (θ∗ ),
or equivalently,
!1/m
det ∇2θ θ=θ∗ nksd(p0 (x)kq(x|θ))
T ≈ .
det ∇2θ θ=θ∗ EX∼p0 [− log q(X|θ)]
To choose a single T value, we simulate true parameters from the prior, generate data
from each simulated true parameter, and take the median of the estimated T values. That
is, we use the median T̂ across samples drawn as
θ∗ ∼ π(θ)
iid
X (i) ∼ q(x|θ∗ )
!1/m (55)
| det ∇2θ θ=θ∗ nksd(p
[ 0 (x)kq(x|θ)) |
T̂ = .
| det ∇2θ θ=θ∗ N1 N
(i)
P
i=1 − log q(X |θ) |
In practice, we find that the order of magnitude of T̂ is stable across samples θ∗ from π(θ).
See Section D.3 for an example.
37
Weinstein and Miller
d
Y β/d
k(x, y) = c2 + (xi − yi )2
i=1
Note that this kernel factors across any subset of dimensions, that is, if S ⊆ {1, . . . , d}
and S c = {1, . . . , d} \ S, then we can write k(x, y) = kS (xS , yS )kS c (xS c , yS c ). Thus, if the
foreground subspace XF is the span of a subset of the standard basis, such that xF = V > x =
xS for some S ⊆ {1, . . . , d}, then it follows that k factors as k(x, y) = kF (xF , yF )kB (xB , yB ).
Along with this observation, the next result shows that the factored IMQ satisfies the
conditions of Theorem 9 that pertain to k alone.
Proof It is clear that k(x, y) = k(y, x) and k(x, y) > 0. Next, we show that k has
continuous
Qd and bounded partial derivatives up to order 2. Note that we can write k(x, y) =
2 2 β/d for r ∈ R. Differentiating, we have
i=1 ψ(xi − yi ) where ψ(r) = (c + r )
β 2r
ψ 0 (r) = ψ(r)
d c2 + r2
β 2 β 2r 2 β 2
ψ 00 (r) = 2
− 2 2
ψ(r) + ψ(r).
d d c +r d c + r2
2
Since r2 ≥ 0 and β < 0, |ψ(r)| ≤ c2β/d for all r ∈ R. Further, it is straightforward to verify
that |ψ 0 (r)| and |ψ 00 (r)| are bounded on R by using the fact that |r|/(c2 + r2 ) ≤ 1/(2c). By
the chain rule, it follows that for all i, j, the functions k(x, y), |∂k/∂xi |, and |∂ 2 k/∂xi ∂yj |
are bounded. Thus, we conclude that k, k∇kk, and k∇2 kk are bounded.
Finally, we show that k is integrally strictly positive definite. First, for any d, for
x, y ∈ Rd , the function (x, y) 7→ (c2 + kx − yk22 )β/d is an integrally strictly positive definite
kernel (see, for example, Section 3.1 of Sriperumbudur et al., 2010); we refer to this as the
standard IMQ kernel. Since the factored IMQ is a product of one-dimensional standard
IMQ kernels, it defines a kernel on Rd (Lemma 4.6 of Steinwart and Christmann, 2008)
and is positive definite (Theorem 4.16 of Steinwart and Christmann, 2008). By Bochner’s
theorem (Theorem 3 of Sriperumbudur et al., 2010), a continuous positive definite kernel
can be expressed in terms of the Fourier transform of a finite nonnegative Borel measure.
38
Bayesian Data Selection
by Fubini’s theorem, where Λ0 is the finite nonnegative Borel measure on R associated with
ψ(r) and Λ = Λ0 × · · · × Λ0 is the resulting product measure on Rd . Applying Bochner’s
theorem in the other direction, we see that Ψ is a positive definite function. Moreover,
since the standard IMQ kernel is characteristic (Theorem 7 of Sriperumbudur et al., 2010),
it follows that the support of Λ0 is R (Theorem 9 of Sriperumbudur et al., 2010), and thus
the support of Λ is Rd . This implies that the factored IMQ kernel k is characteristic (The-
orem 9 of Sriperumbudur et al., 2010) and, since k is also translation invariant, k must be
integrally strictly positive definite (Section 3.4 of Sriperumbudur et al., 2011).
Our choice of the factored IMQ kernel is motivated by the analysis of Gorham and
Mackey (2017), which suggests that the standard IMQ is a good default choice for the
kernelized Stein discrepancy, particularly when working with distributions that are (roughly
speaking) very spread out. In particular, it is straightforward to show that the factored
IMQ kernel, like the standard IMQ kernel, meets the conditions of Theorem 3.2 of Huggins
and Mackey (2018). However, we do not pursue further the question of whether the nksd
with the factored IMQ detects convergence and non-convergence since our statistical setting
is different from that of Gorham and Mackey (2017), and we are assuming the data consists
of i.i.d. samples from some underlying distribution rather than correlated samples from an
MCMC chain which may or may not converge.
where A, B, and C depend on the data but not on θ. Since qθ (x) = q(x|θ) =
λ(x) exp(θ> t(x) − κ(θ)), we have sqθ (x) = ∇x log λ(x) + (∇x t(x))> θ where (∇x t(x))ij =
∂ti /∂xj . Thus, we can write
uθ (x, y) (57)
> > >
: = sqθ (x) sqθ (y)k(x, y) + sqθ (x) ∇y k(x, y) + sqθ (y) ∇x k(x, y) + trace(∇x ∇>
y k(x, y))
= θ> [(∇x t(x))(∇y t(y))> k(x, y)]θ
+ [(∇x log λ(x))> (∇y t(y))> k(x, y) + (∇y log λ(y))> (∇x t(x))> k(x, y)
+ (∇x k(x, y))> (∇y t(y))> + (∇y k(x, y))> (∇x t(x))> ]θ
+ [(∇x log λ(x))> (∇y log λ(y))k(x, y) + (∇y log λ(y))> (∇x k(x, y))
+ (∇x log λ(x))> (∇y k(x, y)) + trace(∇x ∇>
y k(x, y))]. (58)
39
Weinstein and Miller
+ (∇x log λ(X (j) ))> ∇x t(X (i) )> k(X (i) , X (j) )
+(∇x k(X (i) , X (j) ))> ∇x t(X (j) )> + (∇y k(X (i) , X (j) ))> ∇x t(X (i) )>
1 X
C := P (i) , X (j) )
(∇x log λ(X (i) ))> (∇x log λ(X (j) ))k(X (i) , X (j) )
i6=j k(X i6=j
for w ∈ [0, 1]. We assume that the conditions of Theorem 9 are met, over both XF1 and
XF2 . Since (∂L/∂θi )(w, θN (w)) = 0, we have
∂ ∂L ∂2L X ∂2L ∂
0= (w, θN (w)) = (w, θN (w)) + (w, θN (w)) θN,j (w) ,
∂w ∂θi ∂w∂θi ∂θi ∂θj ∂w
j
40
Bayesian Data Selection
Rearranging, we have
−1
∇w θN (w) = − ∇2θ L(w, θN ) ∇θ ∇w L(w, θN ).
Applying a first-order Taylor series expansion gives us θN (1) ≈ θN (0) + ∇w θN (0), which
yields Equation 13.
B.1 Setup
We first review the asymptotics of the standard marginal likelihood, discussed in depth by
Dawid (2011) and Hong and Preston (2005), for example. Define
N
1 X
kl
fN (θ) := − log q(X (i) |θ), kl
θN kl
:= argmin fN (θ),
N θ
i=1
kl
f (θ) := −EX∼p0 [log q(X|θ)], θ∗kl := argmin f kl (θ).
θ
Let m be the dimension of the parameter space. Under suitable regularity conditions (Miller,
2021), the Laplace approximation to the marginal likelihood is
2π m/2
exp − N fNkl (θ kl ) π(θ kl )
Z
(1:N ) (1:N ) N ∗
q(X )= q(X |θ)π(θ)dθ ∼ 1/2
(59)
det ∇2 f kl (θkl ) N
θ ∗
41
Weinstein and Miller
2
10
log
log
20 20
1
1
30
30
log
log
40
(a) (a)
50 40
(b) (b)
60 50
20 40 60 80 20 40 60 80
number of samples number of samples
(a) (b)
Model Selection Nested Model Selection
50
(a) 6
40
(b) 5
2
30 4
log
log
20 3
1
2
log
log
10
1 (a)
0 (b)
0
25 50 75 100 125 150 175 200 20 40 60 80
number of samples number of samples
(c) (d)
Figure 10: Behavior of the Stein volume criterion K, the foreground marginal likelihood
with a background volume correction K(a) , and the foreground marginal nksd
K(b) on toy examples. The plots show the results for 5 randomly generated data
sets (thin lines) and the average over 100 random data sets (bold lines). Here,
unlike Figure 2, the Pitman-Yor expression for mB is used (Equation 3), with
α = 0.5, ν = 1, and D = 0.2.
As shown by Dawid (2011) and Hong and Preston (2005), under regularity conditions,
kl kl kl kl
N (fN (θN ) − fN (θ∗ )) = OP0 (1)
kl kl
√
N (fN (θ∗ ) − f kl (θ∗kl )) = OP0 ( N )
N f kl (θ∗kl ) = OP0 (N ) (61)
!
π(θ∗kl )(2π)m/2
log 1/2
= OP0 (1).
det ∇2θ f kl (θ∗kl )
42
Bayesian Data Selection
The nksd marginal likelihood has a similar decomposition. Following Section 6, define
As shown in Theorem 9,
m/2
exp(−N fN
nksd (θ nksd ))π(θ nksd )
2π
Z
nksd N ∗
zN := exp(−N fN (θ))π(θ)dθ ∼ 1/2
det ∇2θ f nksd (θ∗nksd ) N
and further, when the model is well-specified, such that nksd(p0 (x)kq(x|θ∗nksd )) = 0,
nksd nksd
N (fN (θ∗ ) − f nksd (θ∗nksd )) = OP0 (1). (64)
For ease of reference, here are the various scores that we consider for model/data selec-
tion.
Marginal likelihood of the augmented model (foreground+background):
Z Z
(1:N ) (1:N ) (1:N )
q̃(X (1:N ) |F) = q(XF |θ) q̃(XB |XF , φB )π(θ)πB (φB )dθdφB .
43
Weinstein and Miller
so K(a) does not satisfy data selection consistency. The SVC satisfies data selection consis-
tency by Theorem 17 (part 1). We show that the other scores also satisfy data selection
consistency. Since K(b) = (2π/N )−mB /2 K where K is the SVC, by Theorem 17 (part 1),
(b)
1 K P 1 1
log 1(b) −−−0−→ nksd(p0 (xF2 )kq(xF2 |θ2,∗
nksd nksd
)) − nksd(p0 (xF1 )kq(xF1 |θ1,∗ )). (66)
N K2 N →∞ T T
(d)
1 K P 1 1
log 1(d) −−−0−→ nksd(p0 (xF2 )kq(xF2 |θ2,∗
nksd nksd
)) − nksd(p0 (xF1 )kq(xF1 |θ1,∗ )). (68)
N K2 N →∞ T T
1 KBIC P 1 1
log 1BIC −−−0−→ nksd(p0 (xF2 )kq(xF2 |θ2,∗
nksd nksd
)) − nksd(p0 (xF1 )kq(xF1 |θ1,∗ )). (69)
N K2 N →∞ T T
44
Bayesian Data Selection
These methods therefore satisfy data selection consistency. For the marginal likelihood of
the augmented model, suppose mB1 and mB2 do not depend on N . Then by Equation 60,
1 q̃(X (1:N ) |F1 ) P0 kl
log −−−−→ EXF2 ∼p0 [− log q(XF2 |θ2,∗ )] + EX∼p0 [− log q̃(XB2 |XF2 , φkl
2,∗ )]
N q̃(X (1:N ) |F2 ) N →∞
(70)
kl
− EXF1 ∼p0 [− log q(XF1 |θ1,∗ )] − EX∼p0 [− log q̃(XB1 |XF1 , φkl
1,∗ )]
We can rewrite this in terms of the KL divergence. First note the decomposition,
Z Z Z
H = − p0 (x) log p0 (x)dx = − p0 (xFj ) log p0 (xFj )dxFj − p0 (x) log p0 (xBj |xFj )dx
for j ∈ {1, 2}. Adding and subtracting the entropy H in Equation 70, and using the fact
that the background model is well-specified,
1 q̃(X (1:N ) |F1 ) P0 kl
log −−−−→ kl(p0 (xF2 )kq(xF2 |θ2,∗ )) + kl(p0 (xB2 |xF2 )kq̃(xB2 |xF2 , φkl
2,∗ ))
N q̃(X (1:N ) |F2 ) N →∞
kl
− kl(p0 (xF1 )kq(xF1 |θ1,∗ )) − kl(p0 (xB1 |xF1 )kq̃(xB1 |xF1 , φkl
1,∗ ))
kl kl
= kl(p0 (xF2 )kq(xF2 |θ2,∗ )) − kl(p0 (xF1 )kq(xF1 |θ1,∗ )). (71)
(d)
1 K P mB2 − mB1
log 1(d) −−−0−→ . (72)
log N K2 N →∞ 2
45
Weinstein and Miller
Since XF2 ⊂ XF1 , we have mF2 ≤ mF1 except perhaps in highly contrived scenarios. If
0 (b) (b) P
mF2 < mF1 then Equation 75 shows that log(K1 /K2 ) −→ −∞. On the other hand, if
(b) (b)
mF2 = mF1 , then by Equations 62 and 63, log(K1 /K2 ) = OP0 (1), so it is not possi-
(b) (b) P
ble to have log(K1 /K2 ) −→ 0
∞. Therefore, K(b) does not satisfy nested data selection
consistency.
(1:N )
Since K(c) = eN HF K(a) = eN HF (2π/N )mB /2 q(XF ), then by Equations 60 and 61,
(c) N (i)
√ p0 (XF1 )
X
1 K1 1 p0 (XF1 )
√ log (c) = N log (i)
− E log + OP0 (N −1/2 log N ). (76)
N K2 N p0 (XF2 ) p 0 (XF 2 )
i=1
If σ 2 := VP0 (log p0 (XF1 )/p0 (XF2 )) is positive and finite, then by the central limit theorem
(c) (c) D
and Slutsky’s theorem, N −1/2 log(K1 /K2 ) − → N (0, σ 2 ). Thus, K(c) randomly selects
F1 or F2 with equal probability, and therefore, it does not satisfy nested data selection
consistency.
For the marginal likelihood of the augmented model, suppose mB1 and mB2 do not
depend on N . The marginal likelihood achieves nested data selection consistency because
the augmented models are both √ well-specified and describe the complete data space X ;
this guarantees that the OP0 ( N ) terms in the marginal likelihood decomposition cancel.
Specifically, p0 (x) = q(x | θj,∗
kl , φkl , F ) for j ∈ {1, 2}, and thus, by Equations 60 and 61
j,∗ j
applied to the augmented model,
(a)
1 K P
log 1(a) −−−0−→kl(p0 (xF )kq2 (xF |θ2,∗
kl kl
)) − kl(p0 (xF )kq1 (xF |θ1,∗ )), (79)
N K2 N →∞
(b)
1 K P 1 1
log 1(b) −−−0−→ nksd(p0 (xF )kq2 (xF |θ2,∗
nksd nksd
)) − nksd(p0 (xF )kq1 (xF |θ1,∗ )), (80)
N K2 N →∞ T T
(c)
1 K P
log 1(c) −−−0−→ kl(p0 (xF )kq2 (xF |θ2,∗
kl kl
)) − kl(p0 (xF )kq1 (xF |θ1,∗ )), (81)
N K2 N →∞
46
Bayesian Data Selection
(d)
1 K P 1 1
log 1(d) −−−0−→ nksd(p0 (xF )kq2 (xF |θ2,∗
nksd nksd
)) − nksd(p0 (xF )kq1 (xF |θ1,∗ )), (82)
N K2 N →∞ T T
1 KBIC P 1 1
log 1BIC −−−0−→ nksd(p0 (xF )kq2 (xF |θ2,∗
nksd nksd
)) − nksd(p0 (xF )kq1 (xF |θ1,∗ )). (83)
N K2 N →∞ T T
Note that in contrast to the data selection case, K(a) satisfies model selection consistency
since the entropy terms HFj cancel due to the fact that F is fixed. We can think of this as
a consequence of the kl divergence’s subsystem independence; if we are just interested in
modeling a fixed foreground space, there is no problem considering the foreground marginal
likelihood alone (Caticha, 2004, 2011; Rezende, 2018).
p0 (xF ) = qj (xF |θj,∗ ) for j ∈ {1, 2}. Thus, the estimated divergences cancel:
nksd
nksd
[ 0 (xF )kq1 (xF |θ1,∗ nksd
nksd(p )) = nksd(p
[ 0 (xF )kq2 (xF |θ2,∗ )),
N N
(i) (i)
X X
kl kl
log q1 (XF |θ1,∗ )= log q2 (XF |θ2,∗ ),
i=1 i=1
kl
cl(p0 (xF )kq1 (xF |θ1,∗ kl
k )) = k
cl(p0 (xF )kq2 (xF |θ2,∗ )).
Using this along with Equations 60–64, under the same conditions on mB as in Section B.2,
(a)
1 K P mF ,2 − mF ,1
log 1(a) −−−0−→ , (85)
log N K2 N →∞ 2
(b)
1 K P mF ,2 − mF ,1
log 1(b) −−−0−→ , (86)
log N K2 N →∞ 2
(c)
1 K P mF ,2 − mF ,1
log 1(c) −−−0−→ , (87)
log N K2 N →∞ 2
(d)
K1
log (d)
= OP0 (1), (88)
K2
1 KBIC P mF ,2 − mF ,1
log 1BIC −−−0−→ , (89)
log N K2 N →∞ 2
where we are using the assumption that the background model is the same in the two
augmented models q̃1 and q̃2 and so mB,1 = mB,2 . Only K(d) fails to satisfy nested model
selection consistency.
47
Weinstein and Miller
Appendix C. Proofs
Proof of Proposition 3 By assumption, the kernel is bounded, say |k(x, y)| ≤ B, and
sp , sq ∈ L1 (P ). Thus, by the Cauchy–Schwarz inequality,
Z Z
(sq (x) − sp (x))> (sq (y) − sp (y))k(x, y)p(x)p(y)dxdy
X X
R 2
≤B X ksq (x) − sp (x)kp(x)dx < ∞.
Since the kernel is integrally strictly positive definite and |k(x, y)| ≤ B,
Z Z
0< k(x, y)p(x)p(y)dxdy ≤ B < ∞. (90)
X X
Thus, the nksd is finite. Equation 30 follows from Theorem 3.6 of Liu et al. (2016).
Z Z d Z Z
X
δ(x)> δ(y)k(x, y)p(x)p(y)dxdy = δi (x)δi (y)k(x, y)p(x)p(y)dxdy. (91)
X X i=1 X X
δ1 (x1 ) := ∇x1 log q(x) − ∇x1 log p(x) = ∇x1 log q(x1 ) − ∇x1 log p(x1 )
δ2 (x2 ) := ∇x2 log q(x) − ∇x2 log p(x) = ∇x2 log q(x2 ) − ∇x2 log p(x2 ).
48
Bayesian Data Selection
Let X, Y ∼ p(x) independently. Note that E[k1 (X1 , Y1 )] > 0 and E[k2 (X2 , Y2 )] > 0 since k1
and k2 are integrally strictly positive definite by assumption. Therefore,
E[(∇x log q(X) − ∇x log p(X))> (∇x log q(Y ) − ∇x log p(Y ))k(X, Y )]
nksd(p(x)kq(x)) =
E[k(X, Y )]
E[δ1 (X1 ) δ1 (Y1 )k1 (X1 , Y1 )]E[k2 (X2 , Y2 )] E[δ2 (X2 )> δ2 (Y2 )k2 (X2 , Y2 )]E[k1 (X1 , Y1 )]
>
= +
E[k1 (X1 , Y1 )]E[k2 (X2 , Y2 )] E[k1 (X1 , Y1 )]E[k2 (X2 , Y2 )]
> >
E[δ1 (X1 ) δ1 (Y1 )k1 (X1 , Y1 )] E[δ2 (X2 ) δ2 (Y2 )k2 (X2 , Y2 )]
= +
E[k1 (X1 , Y1 )] E[k2 (X2 , Y2 )]
= nksd(p(x1 )kq(x1 )) + nksd(p(x2 )kq(x2 )).
Proposition 20
[
nksd(p(x)kq(x)) = nksd(p(x1 )kq(x1 )) + nksd(p(x2 )kq(x2 )) (92)
where
P (i) (j) (i) (j)
i6=j u1 (X1 , X1 )k2 (X2 , X2 )
nksd(p(x1 )kq(x1 )) := P (i) (j) (i) (j)
i6=j k1 (X1 , X1 )k2 (X2 , X2 )
u1 (x1 , y1 ) :=sq (x1 )> sq (y1 )k1 (x1 , y1 ) + sq (x1 )> ∇y1 k1 (x1 , y1 ) + sq (y1 )> ∇x1 k1 (x1 , y1 )
+ trace(∇x1 ∇>
y1 k1 (x1 , y1 ))
sq (x1 ) :=∇x1 log q(x1 ),
and vice versa for nksd(p(x2 )kq(x2 )) with the roles of 1 and 2 swapped.
+ ∇x2 log q(x2 )> ∇y2 log q(y2 )k2 (x2 , y2 ) k1 (x1 , y1 ),
∇x log q(x)> ∇y k(x, y) =[∇x1 log q(x1 )> ∇y1 k1 (x1 , y1 )]k2 (x2 , y2 )
+ [∇x2 log q(x2 )> ∇y2 k2 (x2 , y2 )]k1 (x1 , y1 ),
∇x k(x, y)> ∇y log q(y) =[∇x1 k1 (x1 , y1 )> ∇y1 log q(y1 )]k2 (x2 , y2 ),
+ [∇x2 k2 (x2 , y2 )> ∇y2 log q(y2 )]k1 (x1 , y1 )
trace(∇x ∇> >
y k(x, y)) = trace(∇x1 ∇y1 k1 (x1 , y1 ))k2 (x2 , y2 ),
+ trace(∇x2 ∇>
y2 k2 (x2 , y2 ))k1 (x1 , y1 ).
49
Weinstein and Miller
so nksd(p(x1 )kq(x1 )) is an estimator of nksd(p(x1 )kq(x1 )), and likewise for nksd(p(x2 )kq(x2 )).
Proof First, we establish almost sure convergence for the denominator of fN (θ). Since k
is assumed to be bounded and to have bounded derivatives up to order two, we can choose
B < ∞ such that B ≥ |k| + k∇x kk + k∇x ∇> y kk. In particular, the expected value of the
kernel is finite: Z Z
|k(x, y)|P0 (dx)P0 (dy) ≤ B < ∞. (94)
X X
By the strong law of large numbers for U-statistics (Theorem 5.4A of Serfling, 2009),
1
Z Z
a.s.
X
(i) (j)
k(X , X ) −−−−→ k(x, y)P0 (dx)P0 (dy). (95)
N (N − 1) N →∞ X X
i6=j
Note that the limit is positive since k(x, y) > 0 for all x, y ∈ X . For the numerator, we
establish bounds on uθ and ∇θ uθ . Let C ⊆ Θ be compact and convex. By Equation 5, for
all θ ∈ C and all x, y ∈ X ,
|uθ (x, y)| ≤ |sqθ (x)> sqθ (y)k(x, y)| + |sqθ (x)> ∇y k(x, y)|
+ |sqθ (y)> ∇x k(x, y)| + | trace(∇x ∇>
y k(x, y))|
≤ ksqθ (x)kksqθ (y)kB + ksqθ (x)kB + ksqθ (y)kB + Bd (96)
≤ g0,C (x)g0,C (y)B + g0,C (x)B + g0,C (y)B + Bd
=: h0,C (x, y).
k∇θ uθ (x, y)k ≤ k∇θ (sqθ (x)> sqθ (y))k(x, y)k + k∇θ (sqθ (x)> ∇y k(x, y))k
+ k∇θ (sqθ (y)> ∇x k(x, y))k + k∇θ trace(∇x ∇>
y k(x, y))k
(97)
≤ g0,C (x)g1,C (y)B + g0,C (y)g1,C (x)B + g1,C (x)B + g1,C (y)B
=: h1,C (x, y).
Note that h0,C and h1,C are continuous and belong to L1 (P0 × P0 ).
50
Bayesian Data Selection
1
Z Z
a.s.
X
sup uθ (X (i) , X (j) ) − uθ (x, y)P0 (dx)P0 (dy) −−−−→ 0, (98)
θ∈C N (N − 1) X X N →∞
i6=j
R R
and that θ 7→ X X uθ (x, y)P0 (dx)P0 (dy) is continuous. (Note that although Yeo and
Johnson (2001) assume X = R, their proof goes through without further modification for
any nonempty X ⊆ Rd .) Combining Equations 95 and 98, we have
Thus, it follows that supθ∈C |fN (θ) − f (θ)| → 0 a.s. by Equations 95 and 96. To complete
the proof, we must show that (A) and (B) are equicontinuous on C.
(A) Since θ 7→ uθ (x, y) is differentiable on C, then by the mean value theorem, we have
that for all θ1 , θ2 ∈ C and all x, y ∈ SM ,
where θ̃ = γθ1 + (1 − γ)θ2 for some γ ∈ [0, 1]. Here, the second inequality holds since
θ̃ ∈ C by the convexity of C, and the supremum is finite because a continuous function on a
compact set attains its maximum. Therefore, (θ 7→ uθ (x, y) : x, y ∈ SM ) is equicontinuous
on C. R
(B) To see that θ 7→ uθ (x, y)P0 (dy) : x ∈ SM is equicontinuous on C, first note that
Z Z
|uθ (x, y)|P0 (dy) ≤ h0,C (x, y)P0 (dy) < ∞.
Further, due to Equations 96 andR 97, we can apply the Leibniz integral Rrule (Folland, 1999,
Theorem 2.27) and find that ∇θ uθ (x, y)P0 (dy) exists and is equal to ∇θ uθ (x, y)P0 (dy).
Now we apply the mean value theorem and the same reasoning as before to find that for all
θ1 , θ2 ∈ C and all x ∈ SM ,
R R R
uθ1 (x, y)P0 (dy) − uθ2 (x, y)P0 (dy) ≤ ∇θ |θ=θ̃ uθ (x, y)P0 (dy) kθ1 − θ2 k
Z
≤ kθ1 − θ2 k ∇θ |θ=θ̃ uθ (x, y) P0 (dy)
Z
≤ kθ1 − θ2 k sup h1,C (x, y)P0 (dy) < ∞
x∈SM
51
Weinstein and Miller
where
R θ̃ = γθ1 + (1 − γ)θ2 for some γ ∈ [0, 1]. The supremum is finite since x 7→
h1,C (x, y)P0 (dy) is continuous,
R which can easily be seen by plugging in the definition
of h1,C . Therefore, θ 7→ uθ (x, y)P0 (dy) : x ∈ SM is equicontinuous on C.
Proof First, for any x, y ∈ X , if we define g(θ) = sqθ (x) and h(θ) = sqθ (y) then uθ =
(g > h)k + g > (∇y k) + h> (∇x k) + trace(∇x ∇>
y k). By differentiating, applying Minkowski’s
inequality to the resulting sum of tensors, and applying the Cauchy–Schwarz inequality to
each term, we have
k∇3θ uθ (x, y)k ≤ k∇3 gkkhkk + 3k∇2 gkk∇hkk + 3k∇gkk∇2 hkk + kgkk∇3 hkk
+ k∇3 gkk∇y kk + k∇3 hkk∇x kk.
Using the symmetry of the kernel to combine like terms, this yields that
X
∇3θ uθ (X (i) , X (j) )
i6=j
X
≤ 2k∇3θ sqθ (X (i) )kksqθ (X (j) )kB + 6k∇2θ sqθ (X (i) )kk∇θ sqθ (X (j) )kB + 2k∇3θ sqθ (X (i) )kB
i6=j
where B < ∞ such that B ≥ |k| + k∇x kk + k∇x ∇> y kk. Since fN (θ) = 0 when N = 1 by
definition, we can assume without loss of generality that N ≥ 2, so N 1−1 = N1 (1 + N 1−1 ) ≤
2/N . Since each term is non-negative, we can add in the i = j terms,
1 X
∇3θ uθ (X (i) , X (j) )
N (N − 1)
i6=j
2B X
≤ 2 2k∇3θ sqθ (X (i) )kksqθ (X (j) )k + 6k∇2θ sqθ (X (i) )kk∇θ sqθ (X (j) )k + 2k∇3θ sqθ (X (i) )k
N
i,j
1 X 1 X
= 4B k∇3θ sqθ (X (i) )k ksqθ (X (j) )k (99)
N N
i j
1 X 1 X
+ 12B k∇2θ sqθ (X (i) )k k∇θ sqθ (X (j) )k
N N
i j
1 X
+ 4B k∇3θ sqθ (X (i) )k .
N
i
similarly for N i k∇3θ sqθ (X (i) )k : N ∈ N, θ ∈ E . We show the same for N1 i ksqθ (X (i) )k
1 P P
Z Z
sup ksqθ (x)kP0 (dx) ≤ g0,Ē (x)P0 (dx) < ∞.
θ∈Ē
52
Bayesian Data Selection
Hence, by Theorem 1.3.3 of Ghosh and Ramamoorthi (2003), N1 i ksqθ (X (i) )k converges
P
almost surely. The same argument holds for N1 i k∇θ sqθ (X (i) )k using g1,Ē (x). Therefore,
P
by Equation 99, it follows that k N (N1−1) i6=j ∇3θ uθ (X (i) , X (j) )k is uniformly bounded on
P
E. Since k is positive by assumption, N (N1−1) i6=j k(X (i) , X (j) ) > 0 for all N ≥ 2 and by
P
Equations 94 and 95, N (N1−1) i6=j k(X (i) , X (j) ) converges a.s. to a finite quantity greater
P
than 0. We conclude that almost surely,
1 P 3 (i) (j)
000 1 k N (N −1) i6=j ∇θ uθ (X , X )k
kfN (θ)k = 1
T (i) (j)
P
N (N −1) i6=j k(X , X )
Proof of Theorem 9 We show that the conditions of Theorem 3.2 of Miller (2021) are
met, from which the conclusions of this theorem follow immediately.
By Condition 10 and Equation 35, fN has continuous third-order partial derivatives on
Θ. Let E be the set from Condition 10. With probability 1, fN → f uniformly on E (by
Proposition 21 with C = Ē) and (fN 000 ) is uniformly bounded on E (by Proposition 22).
Note that f is finite on Θ by Proposition 3. Thus, by Theorem 3.4 of Miller (2021), f 0 and
f 00 exist on E and fN 00 → f 00 uniformly on E with probability 1. Since θ is a minimizer of
∗
f and θ∗ ∈ E, we know that f 0 (θ∗ ) = 0 and f 00 (θ∗ ) is positive semidefinite; thus, f 00 (θ∗ ) is
positive definite since it is invertible by assumption.
Case (a): Now, consider the case where Θ is compact. Then almost surely, fN → f
uniformly on Θ by Proposition 21 with C = Θ. Since θ∗ is a unique minimizer of f , we
have f (θ) > f (θ∗ ) for all θ ∈ Θ \ {θ∗ }. Let H ⊆ E be an open set such that θ∗ ∈ H and
H̄ ⊆ E. We show that lim inf N inf θ∈Θ\H̄ fN (θ) > f (θ∗ ). Since Θ \ H is compact,
By uniform convergence, with probability 1, there exists N such that for all N 0 > N ,
supθ∈Θ |fN 0 (θ) − f (θ)| ≤ /2, and thus,
Hence, lim inf N inf θ∈Θ\H̄ fN (θ) > f (θ∗ ) almost surely. Applying Theorem 3.2 of Miller
(2021), the conclusion of the theorem follows. Note that fN 00 (θ ) → f 00 (θ ) a.s. since θ → θ
N ∗ N ∗
00 00
and fN → f uniformly on E.
Case (b): Alternatively, consider the case where Θ is open and fN is convex on Θ. For
all θ ∈ Θ, with probability 1, fN (θ) → f (θ) (by Proposition 21 with C = {θ}). However, we
need to show that with probability 1, for all θ ∈ Θ, fN (θ) → f (θ). We follow the argument
in the proof of Theorem 6.3 of Miller (2021). Let W be a countable dense subset of Θ.
Since W is countable, with probability 1, for all θ ∈ W , fN (θ) → f (θ). Since fN is convex,
then with probability 1, for all θ ∈ Θ, the limit f˜(θ) := limN fN (θ) exists and is finite,
53
Weinstein and Miller
and f˜ is convex (Theorem 10.8 of Rockafellar, 1970). Since fN is convex and f (θ) is finite,
f (θ) is also convex. Since f and f˜ are convex, they are also continuous (Theorem 10.1 of
Rockafellar, 1970). Continuous functions that agree on a dense subset of points must be
equal. Thus, with probability 1, for all θ ∈ Θ, fN (θ) → f (θ). Applying Theorem 3.2 of
Miller (2021), the conclusion of the theorem follows.
Proof of Theorem 11 Our proof builds on Appendix D.3 of Barp et al. (2019), which
establishes a central limit theorem for the ksd when the model is an exponential family.
The outline of the proof is as follows. First, we establish bounds on sqθ and its derivatives,
using the assumed bounds on ∇x t(x) and ∇x log λ(x). Second, we establish that f 00 (θ) is
positive definite and independent of θ, and that fN 00 (θ) converges to it almost surely; from
this, we conclude that f 00 (θ∗ ) is invertible and fN (θ) is convex. These results rely on the
convergence properties of U-statistics and on Sylvester’s criterion.
The assumption that log λ(x) is continuously differentiable on X implies that λ(x) > 0
for x ∈ X . Since qθ (x) = λ(x) exp(θ> t(x) − κ(θ)), we have
where (∇x t(x))ij = ∂ti /∂xj . Thus, sqθ (x) has continuous third-order partial derivatives
with respect to θ, and Equations 41 and 42 are trivially satisfied. Equation 40 holds for all
compact C ⊆ Θ since k∇x log λ(x)k and k∇x t(x)k are continuous functions in L1 (P0 ) and
ksqθ (x)k = k∇x log λ(x) + (∇x t(x))> θk ≤ k∇x log λ(x)k + k∇x t(x)kkθk,
k∇θ sqθ (x)k = k∇x t(x)k.
where
B2 (x, y) = (∇x t(x))(∇y t(y))> k(x, y),
B1 (x, y) = (∇y t(y))(∇x log λ(x))k(x, y) + (∇x t(x))(∇y log λ(y))k(x, y)
+ (∇y t(y))(∇x k(x, y)) + (∇x t(x))(∇y k(x, y)),
B0 (x, y) = (∇x log λ(x))> (∇y log λ(y))k(x, y) + (∇y log λ(y))> (∇x k(x, y))
+ (∇x log λ(x))> (∇y k(x, y)) + trace(∇x ∇>
y k(x, y)).
54
Bayesian Data Selection
k∇x t(x)k and k∇x log λ(x)k are in L1 (P0 ). Further, 0 < K < ∞ since 0 < k(x, y) ≤ B < ∞
by assumption. Thus,
1
Z Z
θ> B2 (x, y)θ + B1 (x, y)> θ + B0 (x, y) P0 (dx)P0 (dy) ∈ R.
f (θ) =
TK
Since k is symmetric, B2 (x, y)> = B2 (y, x). Hence, ∇θ (θ> B2 (x, y)θ) = (B2 (x, y) +
B2 (y, x))θ, so by Fubini’s theorem,
1
Z Z
f 0 (θ) = 2B2 (x, y)θ + B1 (x, y) P0 (dx)P0 (dy) ∈ Rm ,
TK
2
Z Z
f 00 (θ) = B2 (x, y)P0 (dx)P0 (dy) ∈ Rm×m .
TK
Here, differentiating under the integral sign is justified simply by linearity of the expectation.
Note that f 00 (θ) is a symmetric matrix since B2 (x, y)> = B2 (y, x). Next, to show f 00 (θ)
is positive definite, let v ∈ Rm \ {0}. By assumption, the rows of ∇x t(x) are linearly
independent with positive probability under P0 . Thus, there is a set E ⊆ X such that
P > > d
R 0 (E) > 0 and (∇x t(x)) v 6= 0 for all xR ∈ E. Define g(x) R= (∇x t(x)) v p0 (x) ∈ R . Then
X |gi (x)|dx > 0 for at least one i, and X |gi (x)|dx ≤ kvk X k∇x t(x)kp0 (x)dx < ∞ for all
i. Thus,
d Z Z
2 2 X
Z Z
> 00 >
v f (θ)v = g(x) g(y)k(x, y)dxdy = gi (x)gi (y)k(x, y)dxdy > 0
TK TK
i=1
Thus,
(i) (j)
P
00 2 i6=j B2 (X , X )
fN (θ) = (i) (j)
.
T
P
i6=j k(X , X )
By the strong law of large numbers forR U-statistics (Theorem 5.4A of Serfling, 2009), we have
00 (θ) → f 00 (θ) almost surely, since
R
fN X X kB 2 (x, y)kP0 (dx)P0 (dy) < ∞ and 0 < K < ∞.
For a symmetric matrix A, let λ∗ (A) denote the smallest eigenvalue. Since λ∗ (A) is a contin-
uous function of the entries of A, we have λ∗ (fN 00 (θ)) → λ (f 00 (θ)) a.s. as N → ∞. Thus, with
∗
00 (θ) is positive definite, and hence, f is convex.
probability 1, for all N sufficiently large, fN N
Further, for such N , since fN is a quadratic
R function with positive definite Hessian, we have
MN := inf θ∈Θ fN (θ) > −∞ and zN = Θ exp(−N fN (θ))π(θ)dθ ≤ exp(−N MN ) < ∞.
55
Weinstein and Miller
1 P (i) (j)
0 1 N (N −1) i6=j ∇θ uθ∗ (X , X )
fN (θ∗ ) = 1 .
T (i) (j)
P
N (N −1) i6=j k(X , X )
The denominator converges a.s. to a finite positive constant, as in the proof of Proposi-
tion 21. It is straightforward to verify that EX,Y ∼P0 [k∇θ uθ∗ (X, Y )k2 ] < ∞ since sqθ∗ and
∇θ θ=θ∗ sqθ are in L2 (P0 ) by assumption. By Theorems 5.5.1A and 5.5.2 of Serfling (2009),
1 X
∇θ uθ∗ (X (i) , X (j) ) − EX,Y ∼P0 [∇θ uθ∗ (X, Y )] = OP0 (N −1/2 ).
N (N − 1)
i6=j
0 1
fN (θ∗ ) − fN (θN ) = fN (θN )> (θ∗ − θN ) + (θ∗ − θN )> fN
00 ++
(θN )(θ∗ − θN )
2
56
Bayesian Data Selection
1
= (θ∗ − θN )> fN 00 ++
(θN )(θ∗ − θN )
2
++
for all N sufficiently large, where θN is on the line between θN and θ∗ . Therefore, using
the same reasoning as for Equations 103 and 105,
1 00 ++
|fN (θ∗ ) − fN (θN )| ≤ kfN (θN )kkθ∗ − θN k2 = OP0 (N −1 ). (106)
2
This proves the first part of the theorem (Equation 43). Next, consider fN (θ∗ ) − f (θ∗ ).
Recall that
1 P (i) (j)
1 N (N −1) i6=j uθ∗ (X , X )
fN (θ∗ ) = .
T N (N1−1) i6=j k(X (i) , X (j) )
P
It is straightforward to verify that EX,Y ∼P0 [|uθ∗ (X, Y )|2 ] < ∞ since sqθ∗ is in L2 (P0 ). By
Theorems 5.5.1A and 5.5.2 of Serfling (2009),
1 X
uθ∗ (X (i) , X (j) ) − EX,Y ∼P0 [uθ∗ (X, Y )] = OP0 (N −1/2 ).
N (N − 1)
i6=j
It is straightforward to check that the second part of the theorem (Equation 44) follows.
For the third part, our argument follows that of the proof of Theorem 4.1 of Liu et al.
(2016). Suppose nksd(p0 (x)kq(x|θ∗ )) = 0, and note that P0 (x)P = Qθ∗ (x) by Proposition 4.
d
Given a differentiable function g : Rd → Rd , define ∇> x g(x) := i=1 ∂gi (x)/∂xi . Then
Z
EX∼P0 [uθ∗ (X, y)] = sp0 (y)> (∇x p0 (x))k(x, y) + p0 (x)(∇x k(x, y)) dx
Z X
+ (∇x p0 (x))> ∇y k(x, y) + p0 (x)(∇> ∇
x y k(x, y)) dx
X Z Z
>
∇>
= sp0 (y) ∇x p0 (x)k(x, y) dx + x ∇y (p0 (x)k(x, y))dx. (107)
X X
The first term on the right-hand side of Equation 107 is zero since, by assumption,
k is in the Stein class of P0 (Condition 2). The secondR term is also zero since,
by the Leibniz integral rule (Folland, 1999, Theorem 2.27), ∇> y ∇x (p0 (x)k(x, y))dx =
>
R
∇y ∇x (p0 (x)k(x, y))dx, which again equals zero because k is in the Stein class of P0 .
Therefore, EX∼P0 [uθ∗ (X, y)] = 0 for all y ∈ X , and in particular, the variance of this ex-
pression is also zero: VY ∼P0 [EX∼P0 [uθ∗ (X, Y )]] = 0. By Theorem 5.5.2 of Serfling (2009),
it follows that
1 X
uθ∗ (X (i) , X (j) ) = OP0 (N −1 ) (108)
N (N − 1)
i6=j
since EX,Y ∼P0 [uθ∗ (X, Y )] = 0. Although Serfling (2009) requires VX,Y ∼P0 [uθ∗ (X, Y )] > 0,
Equation 108 holds trivially if VX,Y ∼P0 [uθ∗ (X, Y )] = 0. As before, since the denominator
of fN (θ∗ ) converges a.s. to a finite positive constant, we have that fN (θ∗ ) = OP0 (N −1 ).
Equation 45 follows since f (θ∗ ) = 0 when nksd(p0 (x)kq(x|θ∗ )) = 0.
57
Weinstein and Miller
mBj /2
Since Kj,N = (2π/N ) zj,N , this implies
1 a.s.
log Kj,N + N fj,N (θj,N ) − (mFj ,j + mBj ) log(2π/N ) + Cj −−−−→ 0
2 N →∞
0 P
By Theorem 12, fj,N (θj,N ) −→ fj (θj,∗ ), and therefore,
1 K1,N P
log + f1 (θ1,∗ ) − f2 (θ2,∗ ) −−−0−→ 0.
N K2,N N →∞
Plugging in the definition of fj (Equation 36), this proves part 1 of the theorem.
For part 2, suppose f1 (θ1,∗ ) = f2 (θ2,∗ ) = 0 and mB2 − mB1 does not depend on N . Then
by Theorem 12, fj,N (θj,N ) = OP0 (N −1 ). Using this in Equation 109, we have
1 K1,N 1 P
log + (mF1 ,1 + mB1 − mF2 ,2 − mB2 ) −−−0−→ 0. (110)
log N K2,N 2 N →∞
√
For part 3, suppose f1 (θ1,∗ ) = f2 (θ2,∗ ) and mBj = cBj N . Then by Theorem 12,
fj,N (θj,N ) = fj (θj,∗ ) + OP0 (N −1/2 ). Using this in Equation 109, we have
1 K1,N 1 P
√ log + (cB1 − cB2 ) −−−0−→ 0. (111)
N log N K 2,N 2 N →∞
58
Bayesian Data Selection
where I(E) is the indicator function, which equals 1 when E is true and is 0 otherwise.
Define the scalars
N
X
K̄ := Kij ,
i,j=1
N X
d
X ∂2k
K̈ := I(i 6= j) (X (i) , X (j) ).
∂xb ∂yb
i,j=1 b=1
1
[ 0 (x)kq(x|H, v)) =
nksd(p trace(X > KX(HH > + vId )−1 (HH > + vId )−1 )
K̄
− 2 trace(X > K̇(HH > + vId )−1 ) + K̈ ,
where we have used the fact that the kernel is symmetric. The terms X > KX and X > K̇
are the only ones that include sums over the entire data set; these can be pre-computed,
before optimizing the parameters H and v.
To compute the matrix inversion (HH > +vId )−1 we follow the strategy of Minka (2001),
(HH > + vId )−1 − v −1 Id = (HH > + vId )−1 (Id − v −1 (HH > + vId ))
= −(HH > + vId )−1 HH > v −1
= −(U (L − vIk )U > + vId )−1 U (L − vIk )U > v −1 .
Therefore,
(HH > + vId )−1 = U (L−1 − v −1 Ik )U > + v −1 Id .
59
Weinstein and Miller
Computing L−1 is trivial since the matrix is diagonal. Returning to the nksd we have
[ 0 (x)kq(x|U, L, v))
nksd(p
1
trace X > KX[U (L−1 − v −1 Ik )2 U > + 2v −1 U (L−1 − v −1 Ik )U > + v −2 Id ]
=
K̄
− 2 trace X > K̇[U (L−1 − v −1 Ik )U > + v −1 Id ] + K̈
1
trace U > X > KXU (L−1 − v −1 Ik )2
=
K̄
+ trace U > [2v −1 X > KX − 2X > K̇]U (L−1 − v −1 Ik )
−1 −1 > >
+ v trace v X KX − 2X K̇ + K̈ .
We optimized U , L and v using the trust region method implemented in pymanopt (Townsend
et al., 2016).
D.3 Calibration
The T hyperparameter was calibrated as in Section A.1. In detail, we sampled 10 indepen-
dent true parameter values from the prior, with α = 1 and d = 6. (We used a slightly less
disperse prior than during inference, where we set α = 0.1, to avoid numerical instabilities
in the T̂ estimate.) Then, for each of the true parameter values, we simulated N = 2000
datapoints. For each simulated true parameter value, we tracked the trend in the T̂ es-
timator (Equation 55) with increasing N (Figure 11). The median estimated T value at
N = 2000 was 0.052 across the 10 runs.
60
pPCA
<latexit sha1_base64="JEUQSwXjR0M7jiuk8VMX97WdAH4=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V+gVtKJvtpl262YTdiVBCf4QXD4p49fd489+4bXPQ1gcDj/dmmJkXJFIYdN1vp7CxubW9U9wt7e0fHB6Vj0/aJk414y0Wy1h3A2q4FIq3UKDk3URzGgWSd4LJ/dzvPHFtRKyaOE24H9GREqFgFK3U6Y8pZs3ZoFxxq+4CZJ14OalAjsag/NUfxiyNuEImqTE9z03Qz6hGwSSflfqp4QllEzriPUsVjbjxs8W5M3JhlSEJY21LIVmovycyGhkzjQLbGVEcm1VvLv7n9VIMb/1MqCRFrthyUZhKgjGZ/06GQnOGcmoJZVrYWwkbU00Z2oRKNgRv9eV10q5Vvatq7fG6Ur/L4yjCGZzDJXhwA3V4gAa0gMEEnuEV3pzEeXHenY9la8HJZ07hD5zPH3rWj6k=</latexit>
T̂
Figure 11: Estimated T for increasing number of data samples, for 10 independent param-
eter samples from the prior. The median value at N = 2000 is T̂ = 0.052.
where F̃ −1 is the inverse c.d.f. of some probability distribution. For all n ∈ {0, 1, 2, . . .} and
all n ∈ {0, 1}n , let
¯
Yn ∼ Beta(ξn 0 , ξn 1 ),
¯ ¯ ¯
where the ξ’s are hyperparameters. We say that a random variable X ∈ R is distributed
according to a Pólya tree model if
n
Y
P (X ∈ Bn ) = (Yj−1 )I(j =0) (1 − Yj−1 )I(j =1) ,
¯ ¯ ¯
j=1
where I(E) is the indicator function, which equals 1 when E is true and is 0 otherwise. We
follow Berger and Guglielmi (2001) and use
Pn Pn
µ(Bn ) := F F̃ −1 j n − F F̃ −1 j ,
j=1 j /2 + 1/2 j=1 j /2
¯ Pn
−1 j n+1 )) 2
1 f (F̃ ( j=1 j /2 + 1/2
ρ(n ) := ,
¯ η µ(Bn )
s ¯
µ(Bn 0 )
ξn 0 := ρ(n ) ¯ ,
¯ ¯ µ(Bn 1 )
s ¯
µ(Bn 1 )
ξn 1 := ρ(n ) ¯ ,
¯ ¯ µ(Bn 0 )
¯
61
Weinstein and Miller
where F and f are the c.d.f. and p.d.f. respectively of some probability distribution, and
η > 0 is a scale hyperparameter. We denote this complete model as X ∼ PolyaTree(F, F̃, η).
62
Bayesian Data Selection
10.0
7.5
5.0
2.5
0
0.0
Ejj 2.5
5.0
7.5
10.0
0 2500 5000 7500 10000 12500 15000 17500
interaction rank
Figure 12: Posterior mean interaction energies ∆Ejj 0 for all selected genes, sorted. Dotted
lines show the thresholds for strong interactions (set by visual inspection).
where the expectation is taken with respect to I, where K(I) is the (estimated) SVC when
genes with Ij = 1 are included in the foreground space, and φ = (φ1 , . . . , φd )> ∈ Rd is a
vector of log-odds. This stochastic approach to discrete optimization has been used exten-
sively in reinforcement learning and related fields. We use the Leave-One-Out REINFORCE
(LOORF) estimator as described in Section 2.1 of Dimitriev and Zhou (2021) to estimate
gradients of φ, using 8 samples per step.
We interleave updates to the variational approximation and to φ, using the Adam opti-
mizer with step size 0.01 for each. We ran the procedure with 4 random initial seeds, taking
the result with the largest final estimated SVC. We halt optimization using the stopping
rule proposed in Grathwohl et al. (2020), stopping when the estimated mean minus the
estimated variance of the SVC begins to decrease, based on the average over 2000 steps.
Code is available at https://github.com/EWeinstein/data-selection.
63
Weinstein and Miller
CD37
HSPA1A
HLA-DQA1
LAPTM5
NR4A2
CD83
LY9
HLA-DRA
HLA-DPB1
HLA-DQB1
MT-ND4L
YBX3
EZR
CD69
FOSB
DNAJB1
MTRNR2L12
GAPDH
HLA-DPA1 10
HSPH1
CD55
NFKBIA 5
MCL1
DUSP2 0
JUN
RPLP0
PPP1R15A 5
KLF2
CREM
ACTG1 10
AC058791.1
ID3
BCL2A1
GPR183
EMP3
TXNIP
IER2
TNFAIP3
MS4A1
IGKC
CD52
VIM
RGCC
REL
PTPRC
IGHM
RPL4
RPL13A
CD37
HSPA1A
HLA-DQA1
LAPTM5
NR4A2
CD83
LY9
HLA-DRA
HLA-DPB1
HLA-DQB1
MT-ND4L
YBX3
EZR
CD69
FOSB
DNAJB1
MTRNR2L12
GAPDH
HLA-DPA1
HSPH1
CD55
NFKBIA
MCL1
DUSP2
JUN
RPLP0
PPP1R15A
KLF2
CREM
ACTG1
AC058791.1
ID3
BCL2A1
GPR183
EMP3
TXNIP
IER2
TNFAIP3
MS4A1
IGKC
CD52
VIM
RGCC
REL
PTPRC
IGHM
RPL4
RPL13A
Figure 13: Posterior mean interaction energies ∆Ejj 0 for the glass model applied to all 200
genes in the MALT data set (rather than the selected 187). Genes shown are
the same as in Figure 8, for visual comparison.
among the top 500 most variable genes, according to the Scprep variability score. We log
transform the counts, that is we define xij = log(1 + cij ) where cij is the expression count
for gene j in cell i.
64
Bayesian Data Selection
1.0
0.8
0.9 0.6
0.8 0.4
0.7 0.2
0.0
exc sel exc sel
lud ect lud ect
ed ed ed ed
(a) (b)
Figure 14: Comparison of the 187 selected genes and 13 excluded genes using data se-
lection. (a) Violin plot of σ̄j over all excluded and selected genes j, respec-
tively, when applying the model to all 200 genes, where σ̄j is the mean pos-
terior standard deviation of the interaction energies ∆Ejj 0 for gene j, that is,
1 P
σ̄j := d−1 j 0 6=j std(∆Ejj 0 | data). (b) Violin plot of fj over all excluded and
selected genes j, respectively, where fj is the fraction of cells with count equal
to zero for gene j. The data selection procedure excluded all genes with more
than 85% zeros and selected all genes with fewer than 85% zeros.
65
Weinstein and Miller
References
Uri Alon. An Introduction to Systems Biology: Design Principles of Biological Circuits.
CRC Press, July 2019.
Alessandro Barp, Francois-Xavier Briol, Andrew B Duncan, Mark Girolami, and Lester
Mackey. Minimum Stein discrepancy estimators. arXiv preprint arXiv:1906.08283, June
2019.
Andrew R Barron. Uniformly powerful goodness of fit tests. The Annals of Statistics, 17
(1):107–124, 1989.
Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark
Siskind. Automatic differentiation in machine learning: A survey. Journal of Machine
Learning Research, 18(153), 2018.
James O Berger and Alessandra Guglielmi. Bayesian and conditional frequentist testing of a
parametric model versus nonparametric alternatives. Journal of the American Statistical
Association, 96(453):174–184, 2001.
Eli Bingham, Jonathan P Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan,
Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D Goodman.
Pyro: Deep universal probabilistic programming. Journal of Machine Learning Research,
20(28):1–6, 2019.
Pier G Bissiri, Chris C Holmes, and Stephen G Walker. A general framework for updating
belief distributions. J. R. Stat. Soc. Series B Stat. Methodol., 78(5):1103–1130, November
2016.
David M Blei. Build, compute, critique, repeat: Data analysis with latent variable models.
Annual Review of Statistics and Its Application, 1(1):203–232, 2014.
Ariel Caticha. Relative entropy and inductive inference. AIP Conference Proceedings, 707
(1):75–96, 2004.
Haifen Chen, Jing Guo, Shital K Mishra, Paul Robson, Mahesan Niranjan, and Jie Zheng.
Single-cell transcriptional analysis to uncover regulatory circuits driving cell fate decisions
in early mouse development. Bioinformatics, 31(7):1060–1066, April 2015.
66
Bayesian Data Selection
Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness
of fit. In International Conference on Machine Learning (ICML), pages 2606–2615, 2016.
Chris Ding and Hanchuan Peng. Minimum redundancy feature selection from microarray
gene expression data. Journal of Bioinformatics and Computational Biology, 3(2):185–
205, April 2005.
Kjell A Doksum and Albert Y Lo. Consistent and robust Bayes procedures for location
based on partial information. The Annals of Statistics, 18(1):443–453, 1990.
David Duvenaud, Daniel Eaton, Kevin Murphy, and Mark Schmidt. Causal learning without
DAGs. In NeurIPS workshop on causality, 2008.
Gerald B Folland. Real Analysis: Modern Techniques and Their Applications. John Wiley
& Sons, 1999.
Nir Friedman. Inferring cellular networks using probabilistic graphical models. Science, 303
(5659):799–805, February 2004.
Nir Friedman, Michal Linial, Iftach Nachman, and Dana Pe’er. Using bayesian networks to
analyze expression data. J. Comput. Biol., 7(3-4):601–620, 2000.
Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B
Rubin. Bayesian Data Analysis. Chapman and Hall/CRC, 2013.
Scott Gigante, Daniel Burkhardt, Daniel Dager, Jay Stanley, and Alexander Tong. scprep.
https://github.com/KrishnaswamyLab/scprep, 2020.
67
Weinstein and Miller
Ryan Giordano, William Stephenson, Runjing Liu, Michael Jordan, and Tamara Brod-
erick. A Swiss Army infinitesimal jackknife. In International Conference on Artificial
Intelligence and Statistics (AISTATS), pages 1139–1147. PMLR, 2019.
Jackson Gorham and Lester Mackey. Measuring sample quality with kernels. In Interna-
tional Conference on Machine Learning (ICML), pages 1292–1301, Sydney, NSW, Aus-
tralia, 2017.
Will Grathwohl, Kuan-Chieh Wang, Jorn-Henrik Jacobsen, David Duvenaud, and Richard
Zemel. Learning the Stein discrepancy for training and evaluating energy-based models
without sampling. In International Conference on Machine Learning (ICML), 2020.
Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander
Smola. A kernel two-sample test. Journal of Machine Learning Research, 13:723–773,
2012.
László Györfi and Edward C Van Der Meulen. A consistent goodness of fit test based on the
total variation distance. In George Roussas, editor, Nonparametric Functional Estimation
and Related Topics, pages 631–645. Springer Netherlands, Dordrecht, 1991.
Stephanie C Hicks, F William Townes, Mingxiang Teng, and Rafael A Irizarry. Missing
data and technical variability in single-cell RNA-sequencing experiments. Biostatistics,
19(4):562–578, October 2018.
Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational
inference. Journal of Machine Learning Research, 14:1303–1347, 2013.
Han Hong and Bruce Preston. Nonnested model selection criteria. 2005.
Peter J Huber. Projection pursuit. The Annals of Statistics, 13(2):435–475, 1985.
Jonathan H Huggins and Lester Mackey. Random feature stein discrepancies. In Advances
in Neural Information Processing Systems (NeurIPS), 2018.
Jonathan H Huggins and Jeffrey W Miller. Reproducible model selection using bagged
posteriors. arXiv preprint arXiv:2007.14845, 2021.
Vân Anh Huynh-Thu, Alexandre Irrthum, Louis Wehenkel, and Pierre Geurts. Inferring
regulatory networks from expression data using tree-based methods. PLoS One, 5(9),
September 2010.
Pierre E Jacob, Lawrence M Murray, Chris C Holmes, and Christian P Robert. Better to-
gether? Statistical learning in models made of modules. arXiv preprint arXiv:1708.08719,
2017.
Jack Jewson, Jim Q Smith, and Chris Holmes. Principles of Bayesian inference using general
divergence criteria. Entropy, 20(6):442, 2018.
Wenxin Jiang and Martin A Tanner. Gibbs posterior for variable selection in high-
dimensional classification and data mining. The Annals of Statistics, 36(5):2207–2231,
2008.
68
Bayesian Data Selection
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR,
2015.
Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence func-
tions. In International Conference on Machine Learning (ICML), 2017.
Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 REINFORCE samples, get a
baseline for free. In ICLR Workshop: Deep Reinforcement Learning Meets Structured
Prediction, 2019.
Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M Blei.
Automatic differentiation variational inference. Journal of Machine Learning Research,
18:1–45, January 2017.
Michael Lavine. Some aspects of Polya tree distributions for statistical modelling. The
Annals of Statistics, 20(3):1222–1235, 1992.
John R Lewis, Steven N MacEachern, and Yoonkyung Lee. Bayesian restricted likelihood
methods: Conditioning on insufficient statistics in Bayesian regression. Bayesian Analy-
sis, 1(1):1–38, 2021.
Han Liu, John Lafferty, and Larry Wasserman. The nonparanormal: Semiparametric esti-
mation of high dimensional undirected graphs. Journal of Machine Learning Research,
10(Oct):2295–2328, 2009.
Qiang Liu, Jason D Lee, and Michael Jordan. A kernelized Stein discrepancy for goodness-
of-fit tests. In International Conference on Machine Learning, volume 33, pages 276–284,
2016.
Takuo Matsubara, Jeremias Knoblauch, François-Xavier Briol, and Chris J. Oates. Robust
generalised Bayesian inference for intractable likelihoods. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 84(3):997–1022, 2022.
Hirotaka Matsumoto, Hisanori Kiryu, Chikara Furusawa, Minoru S H Ko, Shigeru B H Ko,
Norio Gouda, Tetsutaro Hayashi, and Itoshi Nikaido. SCODE: an efficient regulatory net-
work inference algorithm from single-cell RNA-Seq during differentiation. Bioinformatics,
33(15):2314–2321, August 2017.
R Daniel Mauldin, William D Sudderth, and S C Williams. Polya trees and random distri-
butions. The Annals of Statistics, 20(3):1203–1221, 1992.
69
Weinstein and Miller
Jeffrey W Miller and David B Dunson. Robust Bayesian inference via coarsening. Journal
of the American Statistical Association, 114(527):1113–1125, 2019.
Thomas Minka. Old and new matrix algebra useful for statistics, 2000.
Victoria Moignard, Steven Woodhouse, Laleh Haghverdi, Andrew J Lilly, Yosuke Tanaka,
Adam C Wilkinson, Florian Buettner, Iain C Macaulay, Wajid Jawaid, Evangelia Dia-
manti, Shin-Ichi Nishikawa, Nir Piterman, Valerie Kouskoff, Fabian J Theis, Jasmin
Fisher, and Berthold Göttgens. Decoding the regulatory network of early blood de-
velopment from single-cell gene expression measurements. Nature Biotechnology, 33(3):
269–276, March 2015.
Emma Pierson and Christopher Yau. ZIFA: Dimensionality reduction for zero-inflated
single-cell gene expression analysis. Genome Biology, 16:241, November 2015.
Jim Pitman. Combinatorial stochastic processes. Technical Report 621, Dept of Statistics,
UC Berkeley, 2002.
Xiaojie Qiu, Qi Mao, Ying Tang, Li Wang, Raghav Chawla, Hannah A Pliner, and Cole
Trapnell. Reversed graph embedding resolves complex single-cell trajectories. Nature
Methods, 14(10):979, 2017.
Robert J Serfling. Approximation Theorems of Mathematical Statistics. John Wiley & Sons,
September 2009.
Alex K Shalek, Rahul Satija, Xian Adiconis, Rona S Gertner, Jellert T Gaublomme, Rak-
tima Raychowdhury, Schraga Schwartz, Nir Yosef, Christine Malboeuf, Diana Lu, John J
Trombetta, Dave Gennert, Andreas Gnirke, Alon Goren, Nir Hacohen, Joshua Z Levin,
Hongkun Park, and Aviv Regev. Single-cell transcriptomics reveals bimodality in expres-
sion and splicing in immune cells. Nature, 498(7453):236–240, June 2013.
Stephane Shao, Pierre E Jacob, Jie Ding, and Vahid Tarokh. Bayesian model compari-
son with the Hyvärinen score: Computation and consistency. Journal of the American
Statistical Association, pages 1–24, September 2018.
Zakary S Singer, John Yong, Julia Tischler, Jamie A Hackett, Alphan Altinok, M Azim
Surani, Long Cai, and Michael B Elowitz. Dynamic heterogeneity and DNA methylation
in embryonic stem cells. Molecular Cell, 55(2):319–331, July 2014.
70
Bayesian Data Selection
Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert
R G Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal
of Machine Learning Research, 11:1517–1561, 2010.
Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer Science &
Business Media, September 2008.
Tim Stuart, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi,
William M Mauck, 3rd, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija.
Comprehensive integration of single-cell data. Cell, 177(7):1888–1902.e21, June 2019.
James Townsend, Niklas Koep, and Sebastian Weichwald. Pymanopt: A Python toolbox for
optimization on manifolds using automatic differentiation. Journal of Machine Learning
Research, 17(1):4755–4759, 2016.
David van Dijk, Roshan Sharma, Juozas Nainys, Kristina Yim, Pooja Kathail, Ambrose J
Carr, Cassandra Burdziak, Kevin R Moon, Christine L Chaffer, Diwakar Pattabiraman,
Brian Bierie, Linas Mazutis, Guy Wolf, Smita Krishnaswamy, and Dana Pe’er. Recovering
gene interactions from single-cell data using data diffusion. Cell, 174(3):716–729.e27, July
2018.
Cristiano Varin, Nancy Reid, and David Firth. An overview of composite likelihood meth-
ods. Statistica Sinica, 21(1):5–42, January 2011.
Isabella Verdinelli and Larry Wasserman. Bayesian goodness-of-fit testing using infinite-
dimensional exponential families. The Annals of Statistics, 26(4):1215–1241, August 1998.
Quang H Vuong. Likelihood ratio tests for model selection and non-nested hypotheses.
Econometrica: Journal of the Econometric Society, 57(2):307–333, 1989.
F Alexander Wolf, Philipp Angerer, and Fabian J Theis. SCANPY: Large-scale single-cell
gene expression data analysis. Genome Biology, 19(1):15, February 2018.
Zijun Y Xu-Monette, Ling Li, John C Byrd, Kausar J Jabbar, Ganiraju C Manyam, Char-
lotte Maria de Winde, Michiel van den Brand, Alexandar Tzankov, Carlo Visco, Jing
Wang, Karen Dybkaer, April Chiu, Attilio Orazi, Youli Zu, Govind Bhagat, Kristy L
Richards, Eric D Hsi, William W L Choi, Jooryung Huh, Maurilio Ponzoni, Andrés J M
Ferreri, Michael B Møller, Ben M Parsons, Jane N Winter, Michael Wang, Frederick B
Hagemeister, Miguel A Piris, J Han van Krieken, L Jeffrey Medeiros, Yong Li, Anne-
miek B van Spriel, and Ken H Young. Assessment of CD37 B-cell antigen and cell of
origin significantly improves risk prediction in diffuse large B-cell lymphoma. Blood, 128
(26):3083–3100, December 2016.
Daniel Yekutieli. Adjusted Bayesian inference for selected parameters. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 74(3):515–541, 2012.
71
Weinstein and Miller
In-Kwon Yeo and Richard A Johnson. A uniform strong law of large numbers for U-statistics
with application to transforming to near symmetry. Statistics & Probability Letters, 51
(1):63–69, 2001.
Tong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE
Transactions on Information Theory, 52(4):1307–1321, 2006a.
72