On The Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
On The Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
Abstract
The reproducing kernel Hilbert space (RKHS) embedding of distributions offers a gen-
eral and flexible framework for testing problems in arbitrary domains and has attracted
considerable amount of attention in recent years. To gain insights into their operating char-
acteristics, we study here the statistical performance of such approaches within a minimax
framework. Focusing on the case of goodness-of-fit tests, our analyses show that a vanilla
version of the kernel embedding based test could be minimax suboptimal, when considering
χ2 distance as the separation metric. Hence we suggest a simple remedy by moderating
the embedding. We prove that the moderated approach provides optimal tests for a wide
range of deviations from the null and can also be made adaptive over a large collection
of interpolation spaces. Numerical experiments are presented to further demonstrate the
merits of our approach.
Keywords: Adaptation, goodness of fit, maximum mean discrepancy, optimal rates of
Convergence, reproducing kernel Hilbert space.
1. Introduction
In recent years, statistical tests based on the reproducing kernel Hilbert space (RKHS)
embedding of distributions have attracted much attention because of their flexibility and
broad applicability. Like other kernel methods, RKHS embedding based tests present a
general and unifying framework for testing problems in arbitrary domains by using appro-
priate kernels defined on those domains. See Muandet et al. (2017) for a detailed review
of kernel embedding and its applications. The idea of using kernel embedding for compar-
ing probability distributions was initially introduced by Smola et al. (2007); Gretton et al.
(2007, 2012a). Related extensions were also proposed by Harchaoui et al. (2007); Zaremba
et al. (2013). Furthermore, Sejdinovic et al. (2013) established a close relationship between
kernel-based hypothesis tests and energy distanced based test introduced by Székely et al.
(2007). See also Lyons (2013). More recently, motivated by several applications based on
quantifying the convergence of Monte Carlo simulations, Liu et al. (2016), Chwialkowski
et al. (2016) and Gorham and Mackey (2017) proposed goodness-of-fit tests which were
based on combining the kernel based approach with Stein’s identity. A linear-time method
for goodness-of-fit was also proposed by Jitkrittum et al. (2017) recently. Finally, the idea
of kernel embedding has also been used for constructing implicit generative models (e.g.,
Dziugaite et al., 2015; Li et al., 2015).
Despite their popularity, fairly little is known about the statistical performance of these
kernel embedding based tests. Our goal is to fill in this void. In particular, we focus on kernel
embedding based goodness-of-fit tests and investigate their power under a general composite
alternative. Our results not only provide new insights on the operating characteristics of
these kernel embedding based tests but also suggest improved testing procedures that are
minimax optimal and adaptive over a large collection of alternatives, when considering χ2
distance as the separation metric.
More specifically, let X1 , · · · , Xn be n independent X -valued observations from a cer-
tain probability measure P . We are interested in testing if the hypothesis H0 : P = P0
holds for a fixed P0 . Problems of this kind have a long and illustrious history in statistics
and is often associated with household names such as Kolmogrov-Smirnov tests, Pearson’s
Chi-square test or Neyman’s smooth test. A plethora of other techniques have also been
proposed over the years in both parametric and nonparametric settings (e.g., Ingster and
Suslina, 2003; Lehmann and Romano, 2008). Most of the existing techniques are developed
with the domain X = R or [0, 1] in mind and work the best in these cases. Modern ap-
plications, however, oftentimes involve domains different from these traditional ones. For
example, when dealing with directional data, which arise naturally in applications such as
diffusion tensor imaging, it is natural to consider X as the unit sphere in R3 (e.g., Jupp,
2005). Another example occurs in the context of ranking or preference data (e.g., Ailon
et al., 2008). In these cases, X can be taken as the group of permutations. Furthermore,
motivated by several applications, combinatorial testing problems have been investigated
recently (e.g., Addario-Berry et al., 2010), where the spaces under consideration are specific
combinatorially structured spaces.
A particularly attractive approach to goodness-of-fit testing problems in general domains
is through RKHS embedding of distributions. Specifically, let K : X × X → R be a
symmetric positive (semi-)definite kernel. The Moore-Aronszajn Theorem indicates that
there is an RKHS, denoted by (H(K), h·, ·iK ), uniquely identified with the kernel K (e.g.,
Aronszajn, 1950). The RKHS embedding of a probability measure P with respect to K is
given by Z
µP (·) := K(x, ·)dP (x).
X
It is well-known that, under mild regularity conditions, then µP ∈ H(K) and furthermore
for any f ∈ H(K),
EP f (X) = hµP , f iK , ∀ f ∈ H(K),
where EP signifies that the expectation is taken over X ∼ P . The so-called maximum mean
discrepancy (MMD) between two probability measures P and Q is defined as
Z
γK (P, Q) := sup f (x)d(P − Q)(x),
f ∈H(K):kf kK ≤1 X
2
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
where k · kK is the norm associated with (H(K), h·, ·iK ). It is not hard to see
γK (P, Q) = kµP − µQ kK .
See, e.g., Sriperumbudur et al. (2010) or Gretton et al. (2012a) for details. The goodness-
of-fit test can be carried out conveniently through RKHS embeddings of P and P0 by first
constructing an estimate of γK (P, P0 ):
Z
γK (Pn , P0 ) :=
b sup f (x)d Pbn − P0 (x),
f ∈H(K):kf kK ≤1 X
where Pbn is the empirical distribution of X1 , · · · , Xn , and then rejecting H0 if the estimate
exceeds a threshold calibrated to ensure a certain significance level, say α (0 < α < 1).
In this paper, we investigate the power of the above discussed testing strategy under a
general composite alternative. Following the spirit of Ingster and Suslina (2003), we consider
in particular a set of alternatives that are increasingly close to the null hypothesis. To fix
ideas, we assume hereafter that P is dominated by P0 under the alternative so that the
Radon-Nikodym derivative dP/dP0 is well defined. Recall that the χ2 divergence between
P and P0 is defined as
dP 2
Z
2
χ (P, P0 ) := dP0 − 1.
X dP0
We are particularly interested in the detection boundary, namely how close P and P0 can
be in terms of χ2 distance, under the alternative, so that a test based on a sample of n
observations can still consistently distinguish between the null hypothesis and the alterna-
tive. For example, in the parametric setting where P is known up to a finite dimensional
parameters under the alternative, the detection boundary of the likelihood ratio test is n−1
under mild regularity conditions (e.g., Theorem 13.5.4 in Lehmann and Romano, 2008, and
the discussion leading to it). We are concerned here with alternatives that are nonpara-
metric in nature. Our first result suggests that the detection boundary for aforementioned
γK (Pbn , P0 ) based test is of the order n−1/2 . However, our main results indicate, perhaps
surprisingly at first, that this rate is far from optimal and the gap between it and the usual
parametric rate can be largely bridged.
In particular, we argue that the distinguishability between P and P0 depends on how
close u := dP/dP0 − 1 is to the RKHS H(K). The closeness of u to H(K) can be measured
by the distance from u to an arbitrary ball in H(K). In particular, we shall consider the case
where H(K) is dense in L2 (P0 ), and focus on functions that are polynomially approximable
by H(K) for concreteness. More precisely, for some constants M, θ > 0, denote by F(θ; M )
the collection of functions f ∈ L2 (P0 ) such that for any R > 0, there exists an fR ∈ H(K)
such that
kfR kK ≤ R, and kf − fR kL2 (P0 ) ≤ M R−1/θ .
We also adopt the convention that
F(0; M ) = {f ∈ H(K) : kf kK ≤ M }.
See Section 5 for a more concrete example of the space F(θ; M ) when H(K) is the usual
Sobolev space defined over [0, 1]. Interested readers are also referred to Cucker and Zhou
3
Balasubramanian, Li and Yuan
(2007) for further discussion on these so-called interpolation spaces and their use in statis-
tical learning.
We investigate the optimal rate of detection for testing H0 : P = P0 against
H1 (∆n , θ, M ) : P ∈ P(∆n , θ, M ), (1)
where P(∆n , θ, M ) is the collection of distributions P on (X , B) satisfying:
dP/dP0 − 1 ∈ F(θ; M ), and χ2 (P, P0 ) ≥ ∆n .
We call rn the optimal rate of detection if for any c > 0, there exists no consistent test
whenever ∆n ≤ crn ; and on the other hand, a consistent test exists as long as ∆n rn .
Although one could consider a more general setup, for concreteness, we assume that the
eigenvalues of K with respect to L2 (P0 ) decays polynomially in that λk k −2s . We show
that the optimal rate of detection for testing H0 against H1 (∆n , θ, M ) for any θ ≥ 0 is
4s
n− 4s+θ+1 . The rate of detection, although not achievable with a γK (Pbn , P0 ) based test, can
be attained via a moderated version of the MMD based approach. A practical challenge to
the approach, however, is its reliance on the knowledge of θ. Unlike s which is determined
by K and P0 and therefore known a priori, θ depends on u and is not known in advance.
This naturally brings about the issue of adaptation – is there an agnostic approach that can
adaptively attain the optimal detection boundary without the knowledge of θ. We show
that the answer is affirmative although a small price in the form of log log n is required to
achieve such adaptation.
The minimax framework we considered connects our work with the extensive statistics
literature on minimax hypothesis testing. See, e.g., Ingster (1987); Ermakov (1991); Ingster
(1993); Spokoiny (1996); Lepski and Spokoiny (1999); Ingster and Suslina (2000); Ingster
(2000); Baraud (2002); Fromont and Laurent (2006); Fromont et al. (2012, 2013), among
many others. As is customary in other areas of nonparametric statistics, these works usu-
ally start by characterizing function spaces for the alternatives such as Hölder, Sobolev or
more generally Besov spaces, followed by devising an optimal testing procedure according
to the specific class of alternatives. This creates a subtle yet important difference from
the kernel based approaches where a method or algorithm is developed first with a par-
ticular kernel often specific to the applications in mind, and it is therefore of interest to
investigate a posteriori the performance of such a method. The connection between the
two paradigms is well understood in the context of supervised learning thanks to the one-
to-one correspondence between a kernel and a RKHS: the two approaches are essentially
equivalent in that kernel methods with appropriate regularization could achieve minimax
optimality when considering functions from the RKHS with which the kernel identifies. See,
e.g., Wahba (1990); Scholkopf and Smola (2001). In a sense, our work establishes a similar
relationship in the context of kernel methods for hypothesis testing. Indeed, to achieve the
aforementioned optimal rate of detection, we introduce a class of tests based on a modified
MMD similar in spirit to that from Harchaoui et al. (2007). The modification we applied
is akin to the RKHS norm regularization commonly used for supervised learning. The key
idea is to allow the kernel in MMD to evolve with the number of observations so that MMD
becomes increasingly similar to the χ2 distance.
The rest of the paper is organized as follows. We first analyze the power of MMD based
tests in Section 2. This analysis reveals a significant gap between the detection boundary
4
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
achieved by the MMD based test and the usual parametric 1/n rate. In turn, this prompts
us to introduce, in Section 3, a class of tests based on a modified MMD that are rate optimal.
To address the practical challenge of choosing an appropriate tuning parameter for these
tests, we investigate the issue of optimal adaptation in Section 4, where we establish the
optimal rates of detection for adaptively testing H0 against a broader set of alternatives
and propose a test based on the modified MMD that can attain these rates. To further
illustrate the implications of our results, we consider in Section 5 the specific case of Sobolev
kernels and compare our results with those known for nonparametric testing within Sobolev
spaces. Numerical experiments are presented in Section 6. We conclude with some remarks
in Section 7. All proofs are relegated to Section 8.
Write
5
Balasubramanian, Li and Yuan
where the subscript P0 signifies the fact that the expectation is taken over X, X 0 ∼ P0
independently. By (4), γK 2 (P, P ) = γ 2 (P, P ). Therefore, without loss of generality, we
0 K̄ 0
can focus on kernels that are degenerate under P0 , i.e.,
For brevity, we shall omit the subscript K in γ in the rest of the paper, unless it is nec-
essary to emphasize the dependence of MMD on the reproducing kernel. Passing from a
nondegenerate kernel to a degenerate one however presents a subtlety regarding universality.
Universality of a kernel is essential for MMD by ensuring that dP/dP0 −1 resides in the linear
space spanned by its eigenfunctions. See, e.g., Steinwart (2001) for the definition of uni-
versal kernel and Sriperumbudur et al. (2011) for a detailed discussion of different types of
universality. Observe that dP/dP0 − 1 necessarily lies in the orthogonal complement of con-
stant functions in L2 (P0 ). A degenerate kernel K is universal if its eigenfunctions {ϕk }k≥1
form an orthonormal basis of the orthogonal complement of linear space {c · ϕ0 : c ∈ R}
where ϕ0 (x) = 1 in L2 (P0 ). In what follows, we shall assume that K is both degenerate
and universal.
For the sake of concreteness, we shall also assume that K has infinitely many positive
eigenvalues decaying polynomially, i.e.,
for some s > 1/2. In addition, we also assume that the eigenfunctions of K are uniformly
bounded, i.e.,
Assumption (7) is satisfied for many commonly used kernels such as those associated with
Sobolev spaces we shall describe in further details in Section 5. Together with Assumptions
(6), (7) ensures that Mercer’s decomposition (2) holds uniformly. In general, however, there
are also situations under which (7) may not hold. For example, when considering discrete
and countable domains, several standard kernels don’t satisfy Assumption (7); see, e.g.,
2.24 in Scholkopf and Smola (2001), p. 58. Although it is plausible that most if not all of
our results will continue to hold even without the uniform boundedness of ϕk s, a rigorous
argument to do away with it has so far eluded us.
Note that (5) implies EP0 ϕk (X) = 0, ∀ k ≥ 1. The uniform convergence in (2) together
with (3) give
X
γ 2 (P, P0 ) = λk [EP ϕk (X)]2
k≥1
for any P . Accordingly, when P is replaced by the empirical distribution Pbn , the empirical
squared MMD can be expressed as
" n #2
2 b
X 1X
γ (Pn , P0 ) = λk ϕk (Xi ) .
n
k≥1 i=1
6
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
i.i.d.
under H0 , where Zk ∼ N (0, 1). Let TMMD be an MMD based test, which rejects H0 if
and only if nγ 2 (Pbn , P0 ) exceeds the 1 − α quantile qw,1−α of W , i.e.,
The above limiting distribution of nγ 2 (Pbn , P0 ) immediately suggests that TMMD is an asymp-
totic α-level test.
i.i.d.
where EP means taking expectation over X1 , . . . , Xn ∼ P . For brevity, we shall write
β(T ; ∆n , θ, M ) instead of β(T ; P(∆n , θ, M )) in what follows. The performance of a test T
can then be evaluated by its detection boundary, that is, the smallest ∆n under which the
type II error converges to 0 as n → ∞. Our first result establishes the convergence rate
of the detection boundary for TMMD in the case when θ = 0. Hereafter, we abbreviate M
in P(∆n , θ, M ), H1 (∆n , θ, M ) and β(T ; ∆n , θ, M ), unless it is necessary to emphasize the
dependence.
β(TMMD ; ∆n , 0) → 0 as n → ∞;
Theorem 1 shows that when the alternative H1 (∆n , 0) is considered, the detection
boundary of TMMD is of the order n−1/2 . It is of interest to compare the detection rate
achieved by TMMD with that in a parametric setting where consistent tests are available if
n∆n → ∞. See, e.g., Theorem 13.5.4 in Lehmann and Romano (2008) and the discussion
leading to it. It is natural to raise the question to what extent such a gap can be entirely
attributed to the fundamental difference between parametric and nonparametric testing
problems. When considering χ2 distance as the separation metric, we shall now argue that
this gap actually is largely due to the sub-optimality of TMMD , and the detection boundary
of TMMD could be significantly improved through a slight modification of the MMD.
7
Balasubramanian, Li and Yuan
for a given distribution P0 and a constant % > 0. Distance between probability measures of
this type was first introduced by Harchaoui et al. (2007) when considering kernel methods
for two sample test. A subtle difference between ηK,% (P, Q; P0 ) and the distance from
Harchaoui et al. (2007) is the set of f that we optimize over on the righthand side of (8).
In the case of two sample test, there is no information about P0 and therefore one needs to
replace the norm k · kL2 (P0 ) with the empirical L2 norm.
It is worth noting that ηK,% (P, Q; P0 ) can also be identified with a particular type of
MMD. Specifically, ηK,% (P, Q; P0 ) = γK̃% (P, Q), where
X λk
K̃% (x, x0 ) := ϕk (x)ϕk (x0 ).
λk + %2
k≥1
We shall nonetheless still refer to ηK,% (P, Q; P0 ) as a moderated MMD in what follows to
emphasize the critical importance of moderation. We shall also abbreviate the dependence
of η on K and P0 unless necessary. The unit ball in (8) is defined in terms of both RKHS
norm and L2 (P0 ) norm. Recall that u = dP/dP0 − 1 so that
Z Z
sup f d(P − P0 ) = sup f udP0 = kukL2 (P0 ) = χ(P, P0 ).
kf kL2 (P0 ) ≤1 X kf kL2 (P0 ) ≤1 X
We can therefore expect that a smaller % will make η%2 (P, P0 ) closer to χ2 (P, P0 ), since the
unit ball to be considered will become more similar to the unit ball in L2 (P0 ). This can
also be verified by noticing that
X λk X
lim η%2 (P, P0 ) = lim 2
[EP ϕk (X)]2 = [EP ϕk (X)]2 = χ2 (P, P0 ).
%→∞ %→∞ λk + %
k≥1 k≥1
8
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
This test statistics is similar in spirit to the homogeneity test proposed previously by
Harchaoui et al. (2007), albeit motivated from a different viewpoint. In either case, it is
intuitive to expect improved performance over the vanilla version of the MMD when %n
converges to zero at an appropriate rate. The main goal of the present work to precisely
characterize the amount of moderation needed to ensure maximum power. We first argue
that letting %n converge to 0 at an appropriate rate indeed results in a test more powerful
than TMMD .
9
Balasubramanian, Li and Yuan
2s(θ+1)
Theorem 3 Consider testing H0 against H1 (∆n , θ) by TM3 d with %n = cn− 4s+θ+1 for an
4s
arbitrary constant c > 0. If n 4s+θ+1 ∆n → ∞, then TM3 d is consistent in that
β(TM3 d ; ∆n , θ) → 0, as n → ∞.
Theorem 3 indicates that the detection boundary for TM3 d is n−4s/(4s+θ+1) . In particular,
when testing H0 against H1 (∆n , 0), i.e., θ = 0, it becomes n−4s/(4s+1) . This is to be
contrasted with the detection boundary for TMMD , which, as suggested by Theorem 1, is of
the order n−1/2 . It is also worth noting that the detection boundary for TM3 d deteriorates
as θ increases, implying that it is harder to test against a larger interpolation space.
10
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
Together with Theorem 3, this suggests that TM3 d is rate optimal in the minimax sense, when
considering χ2 distance as the separation metric and F(θ, M ) as the regularity condition of
alternative space.
N
X λk
K̆ρ,N (x, x0 ) := ϕk (x)ϕk (x0 ).
λk + ρ2
k=1
Following the same argument as that of Theorem 2, it can be shown that, under H0
i.i.d.
where X1 , . . . , Xn ∼ P0 ,
h i
d
v̆n−1/2 nη̆%2n ,N (Pbn , P0 ) − Ăn → N (0, 2),
where
N 2 n
X λk 1X
v̆n = and Ăn = K̆ρn ,N (Xi , Xi )
λk + ρ2n n
k=1 i=1
as n → ∞ provided that the conditions of Theorem 2 hold and λN . ρ2n . A test that rejects
H0 when
h i
2−1/2 v̆n−1/2 nη̆%2 ,N (Pbn , P0 ) − Ăn
n
(10)
exceeds z1−α is therefore also an asymptotic α-level test. By the same argument as that
4s
of Theorem 3, this test can also be shown to be consistent whenever n 4s+θ+1 ∆n → ∞. In
other words, η̆%2n ,N (Pbn , P0 ) can achieve the same rate of detection as η%2n ,N (Pbn , P0 ). Based
on the polynomial decay rate assumption (6), that λN . ρ2n is equivalent to
2(θ+1)
N & %n−1/s n 4s+θ+1 , (11)
suggesting how large N needs to be in order for the truncated version of M3 d test to achieve
minimax optimality. This test could be more appealing in practice when the infinite sum
in defining K̃ρ does not have a convenient closed form expression and requires nontrivial
numerical approximation. On the other hand, the presence of the extra tuning parameter
N makes the exposition cumbersome at places. For brevity, we shall focus our attention
on TM3 d although all our discussion applies equally to both tests, and more generally other
suitable forms of moderation.
11
Balasubramanian, Li and Yuan
4. Adaptation
Despite the minimax optimality of TM3 d , a practical challenge in using it is the choice of
an appropriate tuning parameter %n . In particular, Theorem 3 suggests that %n needs to
be taken at the order of n−2s(θ+1)/(4s+θ+1) which depends on the value of s and θ. On the
one hand, since P0 and K are known a priori, so is s. On the other hand, θ reflects the
property of dP/dP0 which is typically not known in advance. This naturally brings about
the issue of adaptation (see, e.g., Spokoiny, 1996; Ingster, 2000). In other words, we are
interested in a single testing procedure that can achieve the detection boundary for testing
H0 against H1 (∆n (θ), θ) simultaneously over all θ ≥ 0. We emphasize the dependence of
∆n on θ since the detection boundary may depend on θ, as suggested by the results from
the previous section. In fact, we should build upon the test statistic introduced before.
More specifically, write
√ 2s
log log n
ρ∗ = ,
n
and & " √ 2s #'
4s+1
log log n
m∗ = log2 ρ−1
∗ .
n
Then our test statistics is taken to be the maximum of Tn,%n for ρn = ρ∗ , 2ρ∗ , 22 ρ∗ , . . . , 2m∗ ρ∗ :
It turns out if an appropriate rejection threshold is chosen, T̃n can achieve a detection
boundary very similar to the one we have before, but now simultaneously over all θ > 0.
(ii) on the other hand, there exists a constant c1 > 0 such that,
p
lim inf P T̃n ≥ 3 log log n = 1,
n→∞ P ∈∪θ≥0 P(∆n (θ),θ)
√ 4s
provided that ∆n (θ) ≥ c1 (n−1 log log n) 4s+θ+1 .
√
Theorem 5 immediately suggests that a test rejects H0 if and only if T̃n ≥ 3 log log n
is consistent for testing it against H1 (∆n (θ), θ) for all θ ≥ 0 provided that ∆n (θ) ≥
√ 4s
c1 (n−1 log log n) 4s+θ+1 . It is worth noting that same detection boundary can be attained by
−1/s
replacing Tn,2k %∗ with the test statistic defined by (10) with ρn = 2k %∗ and N & 2−k/s %∗ .
We can further calibrate the rejection region to yield a test at a given significance level.
More precisely, let q̃1−α be the 1 − α quantile of T̃n under H0 , we can proceed to reject H0
12
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
whenever the observed test statistic exceeds q̃1−α . Denote such a test by T̃M3 d . By definition,
T̃M3 d is an α-level test. Theorem 5 implies that the type II error of T̃M3 d vanishes as n → ∞
uniformly over all θ ≥ 0. In practice, the quantile q̃1−α can be evaluated by Monte Carlo
methods as we shall discuss in further details in Section 6. We note that the 2s
detection
boundary given in Theorem 5 is similar, but inferior by a factor of (log log n) 4s+θ+1 , to that
from Theorem 4. As our next result indicates such an extra factor is indeed unavoidable
and is the price one needs to pay for adaptation.
Theorem 6 Let 0 < θ1 < θ2 < 2s − 1. Then there exists a positive constant c2 , such that
if
( 4s )
n 4s+θ+1
lim sup ∆n (θ) √ ≤ c2
n→∞ θ∈[θ ,θ ]
1 2
log log n
then
" #
lim inf EP0 T + sup β(T ; ∆n (θ), θ) = 1.
n→∞ T ∈Tn θ∈[θ1 ,θ2 ]
Similar to Theorem 4, Theorem 6 shows that there is no consistent test for H0 against
√ 4s
H1 (∆n , θ) simultaneously over all θ ∈ [θ1 , θ2 ], if ∆n (θ) ≤ c2 n−1 log log n 4s+θ+1 ∀ θ ∈
[θ1 , θ2 ] for a sufficiently small c2 . Together with Theorem 5, this suggests that the test T̃M3 d
is indeed rate optimal.
5. A Specific Example
To better illustrate the implications of our general results, it is instructive to consider a
more specific example where we are interested in testing uniformity on the unit interval
X = [0, 1] using a periodic spline kernel. Recall that the periodic Sobolev space of order s
is given by
s Z 1
( Z 1 )
X 2
W0s ([0, 1]) := u ∈ L2 ([0, 1]) : u(m) (x) dx < ∞, u(m) (x) = 0, ∀ 0 ≤ m < s .
m=0 0 0
(−1)s−1
K(x, x0 ) = B2s ([x − x0 ])
(2s)!
where Br is the Bernoulli polynomial and [t] is the fractional part of t. See, e.g., Wahba
(1990), for further details.
13
Balasubramanian, Li and Yuan
In this case, the interpolation spaces F(θ, M ) is closely related to Sobolev spaces of
lower order. Recall that the eigenvalues and eigenfunctions of K with respect to L2 (P0 ) are
also known explicitly:
(√
2 cos(2jπx), k = 2j − 1, j ∈ N
ϕk (x) = √ ,
2 sin(2jπx), k = 2j, j ∈ N
It is well-known that the optimal rate of detection is n−4s/(4s+1) when χ2 distance is consid-
ered. See, e.g., Ingster (1993). On the other hand, since the alternative hypothesis is exactly
P(∆n , 0), such detection boundary can also be derived from Theorem 3 and Theorem 4,
and the proposed M3 d test can achieve minimax optimality in this specific example.
6. Numerical Experiments
To complement the earlier theoretical development, we also performed several sets of simu-
lation experiments to demonstrate the merits of the proposed adaptive test. As mentioned
before, we shall consider the test based on η̆%2n ,N (Pbn , P0 ) instead of η%2n (Pbn , P0 ) for the ease of
practical implementation. With slight abuse of notation, we shall still refer to the adaptive
test as a test based on moderated MMD and denote it by T̃n for brevity.
Though the form of η̆%2n ,N (Pbn , P0 ) looks similar to that of γ 2 (Pbn , P0 ), from the point of
view of computing it numerically, there is a subtle issue. The kernel K̆%n ,N (x, x0 ) is defined
only in its Mercer decomposed form, which is based on the Mercer decomposition of K(x, x0 ).
Hence, in order to compute the kernel K̆%n ,N (x, x0 ), we need to first choose a kernel K(x, x0 )
and compute its Mercer decomposition numerically. Specifically, we use chebfun framework
in Matlab (with slight modifications) to compute Mercer decompositions associated with
kernels based on their integral operator representations (Driscoll et al., 2014; Trefethen
and Battles, 2004).
14
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
By construction, the procedure gave a 5%-level test, up to Monte Carlo error. For all
other tests that we compared with, we used the same approach to determine the rejection
threshold.
Similar idea of using resampling method to decide the rejection threshold has been
considered in Fromont et al. (2012), Fromont et al. (2013), where they used bootstrap
approaches to resample in a two sample problem. To the best of our knowledge, it is proved
there the resulted single kernel tests are exactly of α level. Whether we can give theoretical
justification of the resampling method applied to the adaptive test is of stong interest in
our future work.
Euclidean data: Consider the one sample test where P0 is the uniform distribution on
[0, 1]d , with d = 100 and 200. We followed the examples for densities put forward in Marron
and Wand (1992) in the context of nonparametric density estimation, for the alternatives.
Specifically we set the alternative hypothesis to be (1) mixture of five Gaussians, (2) skewed
unimodal, (3) asymmetric claw density and (4) smooth comb density. The value of α was
set to 0.05. The sample size n varied from 200 to 1000 (in steps of 200) and for each value
of sample size 100 simulations were conducted to estimate the probability of rejecting H0 .
We chose Gaussian kernel as the original kernel to construct the adaptive test. And for
practical purpose, the bandwidth of Gaussian kernel was selected via median heuristic. See,
e.g., Gretton et al. (2012a). We now describe the specific tests that we experimented with,
among which all other MMD based tests considered Gaussian kernel as well.
2. MMD1 : Vanilla MMD with the median heuristic for selecting the bandwidth.
4. MMD3 : The method proposed in Sutherland et al. (2017) to select the kernel that
maximizes the power.
We remark that, for MMD2 and MMD3, we split the whole dataset into two parts. We
use the training dataset to first select the bandwidth and then use the testing samples to
perform the actual hypothesis testing. See Sutherland et al. (2017) for more details.
We first conducted a type I error analysis. Results in Figure 1 suggest that all tests
control the probability of type I error at 0.05 approximately. Then we investigated the
performances of these tests under the alternative hypothesis. Figure 2 illustrates a plot
of the estimated probability of accepting the null hypothesis under alternatives mentioned
above for different values of sample size n . We note from Figure 2 that the estimated
probability of type II error converges to zero at a faster rate for M 3 D compared to other tests
15
Balasubramanian, Li and Yuan
on all the different simulation settings that are considered. Note that it has been previously
observed that MMD test performs better than Kolmogorov-Smirnov test in various setting
in Gretton et al. (2012a), which we observe in our setting as well.
·10−2 ·10−2
8 8
M 3D M 3D
Probability of type I error
4 4
2 2
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)
Figure 1: Estimated probability of type I error versus sample size for different tests with
dimensionality 100 (left) and 200 (right), in the case of Euclidean data.
Directional data: One of the advantages of the proposed RKHS embedding based ap-
proach is that it could be used on domains other than the d-dimensional Euclidean space.
For example when X = Sd−1 where Sd−1 corresponds to the d-dimensional unit sphere, one
can perform hypothesis testing using the above framework, as long as we can compute the
Mercer decomposition of a kernel K defined on the domain. In several applications, like
protein folding, often times data are modeled as coming from the unit-sphere and testing
goodness-of-fit for such data needs specialized methods different from the standard non-
parametric testing methods (Mardia and Jupp, 2009; Jupp, 2005).
In order to highlight the advantage of the proposed approach, we assumed P0 to be the
uniform distribution on the unit sphere of dimension d = 100 and 150 respectively. For
each d, we tested P0 against the alternative that data were from:
(1) multivariate von Mises-Fisher distribution (which is the Gaussian analogue on the
unit-sphere) given by fvM -F (x, µ, κ) = CvM -F (κ) exp(κµ> x) for data x ∈ Sd−1 , where
κ ≥ 0 is concentration parameter and µ is the mean parameter. The term CvM -F is
d/2−1
the normalization constant given by 2πd/2κ I (κ)
where I is modified Bessel function;
d/2−1
(2) multivariate Watson distribution (used to model axially symmetric data on sphere)
given by fW (x, µ, κ) = CW (κ) exp(κ(µ> x)2 ) for data x ∈ Sd−1 , where κ ≥ 0 is con-
centration parameter and µ is the mean parameter as before. The term CW (κ) is
Γ(d/2)
the normalization constant given by 2πp/2 M (1/2,d/2,κ)
where M is Kummer’s confluent
hypergeometric function;
(3) mixture of five von Mises-Fisher distribution which are used in modeling and clustering
spherical data (Banerjee et al., 2005);
(4) mixture of five Watson distribution which are used in modeling and clustering spherical
data (Sra and Karp, 2013).
16
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
1 1
Probability of type II error M 3D M 3D
0.4 0.4
0.2 0.2
0 0
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)
1 1
M 3D M 3D
Probability of type II error
0.4 0.4
0.2 0.2
0 0
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)
1 1
M 3D M 3D
Probability of type II error
M M D3 M M D3
0.8 0.8
M M D2 M M D2
M M D1 M M D1
0.6 KS 0.6 KS
0.4 0.4
0.2 0.2
0 0
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)
1 1
M 3D M 3D
Probability of type II error
M M D3 M M D3
0.8 0.8
M M D2 M M D2
M M D1 M M D1
0.6 KS 0.6 KS
0.4 0.4
0.2 0.2
0 0
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)
Figure 2: Estimated probability of type II error versus sample size: mixture of Gaussian
(row 1), skewed unimodal (row 2), asymmetric claw (row 3) and smooth comb (row 4) with
dimensionality 100 (left) and 200 (right), in the case of Euclidean data.
17
Balasubramanian, Li and Yuan
Besides the adaptive M3 d test and all other MMD based tests that involved in the first
simulation study, we also considered the Sobolev test approach (denoted as ST hereafter)
proposed in Jupp (2005). The original kernel of M 3 D and the kernel of M M D1 were again
chosen as Gaussian kernel with bandwidth selected via median heuristic. Other MMD based
tests also used Gaussian kernel. Note that in this setup one can analytically compute the
Mercer decomposition of the Gaussian kernel on the unit sphere with respect to the uniform
distribution. Specifically, the eigenvalues are given by Theorem 2 in Minh et al. (2006) and
the eigenfunctions are the standard spherical harmonics of order k (see section 2.1 in Minh
et al. (2006) for details). Rest of the simulation setup is similar to the previous setting (of
Euclidean data).
The results of type I error analysis are reported in Figure 3, indicating all tests are
approximately α-level tests. Figure 4 illustrates a plot of the estimated probability of type
II error for different values of sample size, from which we see the adaptive M3 d test performs
better.
·10−2 ·10−2
8 8
M 3D M 3D
Probability of type I error
4 4
2 2
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)
Figure 3: Estimated probability of type I error versus sample size for different tests with
dimensionality 100 (left) and 150 (right), in the case of directional data.
18
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
1 1
M 3D M 3D
Probability of type II error
0.4 0.4
0.2 0.2
0 0
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)
1 1
M 3D M 3D
Probability of type II error
0.4 0.4
0.2 0.2
0 0
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)
1 1
M 3D M 3D
Probability of type II error
M M D3 M M D3
0.8 0.8
M M D2 M M D2
M M D1 M M D1
0.6 ST 0.6 ST
0.4 0.4
0.2 0.2
0 0
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)
1 1
M 3D M 3D
Probability of type II error
M M D3 M M D3
0.8 0.8
M M D2 M M D2
M M D1 M M D1
0.6 ST 0.6 ST
0.4 0.4
0.2 0.2
0 0
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)
Figure 4: Estimated probability of type II error versus sample size: von Mises-Fisher dis-
tribution (row 1), Watson distribution (row 2), mixture of von Mises-Fisher distribution
(row 3) and mixture of Watson distribution (row 4) with dimensionality 100 (left) and 150
(right), in the case of directional data.
19
Balasubramanian, Li and Yuan
are essentially treating P0 with the estimated parameters as being given to us by an oracle
and carry out goodness of fit testing. We take the above approach, as our purpose in this
section is mainly to compare the adaptive M3 d test against the other tests for a simple P0
and not to address the issue of testing with composite null hypothesis. Note that a similar
approach was also performed in Kellner and Celisse (2015) for testing for Gaussianity with
kernel-based testing.
For the case of Euclidean data, we used the MNIST digits data set from the following
webpage: http://yann.lecun.com/exdb/mnist/. Model-based clustering (Fraley and
Raftery, 2002) is a widely-used and practical successful clustering technique in the literature.
Furthermore, the MNIST data set is a standard data set for testing clustering algorithms and
consists of image of digits. Several works have implicitly assumed that the data come from a
mixture of Gaussian distributions, because of the observed superior empirical performance
under such an assumption. But the validity of such a mixture model assumption is invariably
not tested statistically. In this experiment we selected three digits (which correspond to
a cluster) randomly and conditioned on the selected digit (cluster), we tested if the data
come from a Gaussian distribution (that is, P0 is Gaussian). For our experiments, we down
sampled the images and use pixels as feature vectors with dimensionality 64 as is commonly
done in the literature. Table 2 reports the probability with which the null hypothesis is
accepted. Such probability is estimated through 100 trials. The observed result reiterates in
a statistically significant way that it is reasonable to make a mixture of Gaussian assumption
in this case.
For the case of directional data, we used the Human Fibroblasts dataset from Iyer et al.
(1999); Dhillon et al. (2003) and the Yeast Cell Cycle dataset from Spellman et al. (1998).
The Fibroblast data set contains 12 expression data corresponding to 517 samples (genes)
report in the response of human fibroblasts following addition of serum to the growth
media. We refer to Iyer et al. (1999) for more details about the scientific procedure with
which these data were obtained. The Yeast Cell Cycle dataset consists of 82-dimensional
data corresponding to 696 subjects. Previous data analysis studies (Sra and Karp, 2013;
Dhillon et al., 2003) have used mixtures of spherical distributions for clustering the above
data set. Specifically, it has been observed in Sra and Karp (2013) that clustering using
a mixture of Watson distribution has superior performance. While that has proved to be
useful scientifically, it was not statistically tested if such an assumption is valid. Here, we
conducted goodness of fit test of Watson distribution (that is, P0 is a Watson distribution)
for the largest cluster from the above data sets. Table 3 shows the estimated probability
of acceptance of the null hypothesis, which is computed through 100 random trials. The
observed results provide a statistical justification for the use of Watson distribution in
modeling the above datasets.
One thing to comment is that wo do not aim to argue that our proposed test outperforms
the other considered tests in the above real-world experiments. The information conveyed
by the results rather lies in the following two aspects. First, if we assume that H0 is true,
then our proposed test is valid in the sense that the estimated probability of Type I error
is controlled around 5%, although two approximations are involved in the whole testing
procedure. One is using one half of data to estimate P0 and the other is using Monte
Carlo simulations to determine the rejection region. Second, that H0 is accepted with high
probability by our proposed test and other involved tests does provide certain evidence that
20
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
Sample size = 300 400 500 300 400 500 300 400 500
KS 0.91 0.94 0.96 0.91 0.91 0.95 0.91 0.92 0.95
M M D1 0.92 0.95 0.96 0.91 0.92 0.95 0.90 0.92 0.94
M M D2 0.93 0.95 0.96 0.92 0.93 0.95 0.93 0.94 0.96
M M D3 0.92 0.95 0.97 0.90 0.94 0.97 0.91 0.92 0.96
M 3D 0.94 0.96 0.98 0.93 0.95 0.98 0.93 0.95 0.98
N 32 36 40 32 36 40 32 36 40
Table 2: Estimated probability with which the corresponding test accepts the null hypothe-
sis. The level of the test α = 0.05. Digit 4 on the left, Digit 6 in the middle and Digit 7 on the
right,for various values of sample size. N refers to the number of eigenvalues/eigenfunctions
used in kernel approximation.
Table 3: Estimated probability with which the corresponding test accepts the null hypoth-
esis. The level of the test α = 0.05. Human Fibroblasts dataset on the left and Yeast Cell
Cycle dataset on the right. N refers to the number of eigenvalues/eigenfunctions used in
kernel approximation.
H0 is true, considering that all these tests tend to have small probabilities of accepting H0
under the alternative hypothesis.
In this sense, these real-word experiments are more like a beginning example to show
the validity of the proposed test. And we may seek some other examples to demonstrate
the optimality of the proposed test in our following work.
7. Concluding Remarks
In this paper, we investigated the performance of kernel embedding based approaches to
goodness-of-fit testing from a minimax perspective. When considering χ2 distance as the
separation metric, we showed that while the vanilla MMD tests could be suboptimal when
testing against an alternative coming from the RKHS identified with the kernel, a simple
moderation leads to tests that are not only optimal when the alternative resides in the RKHS
but also adaptive with respect to its deviation from the RKHS under suitable distances. Our
analysis provides new insights into the operating characteristics of as well as a benchmark
for evaluating the popular kernel based approach to hypothesis testing. Our work also
points to a number interesting directions that warrant further investigation.
21
Balasubramanian, Li and Yuan
Our work highlighted the importance of moderation in kernel based testing, akin to
regularization in kernel based supervised learning. The specific type of moderation we
considered here is defined in terms of eigenfunctions of the kernel. In some settings this is
convenient to do as we illustrated in Section 5. In other settings, however, eigenfunctions of
a kernel may not be trivial to compute. It is therefore of interest to investigate alternative
ways of moderating the MMD. For example, one intriguing possibility is to apply appropriate
moderation to the so-called random Fourier features (see, e.g., Chwialkowski et al., 2015;
Jitkrittum et al., 2016; Bach, 2017) instead of eigenfunctions.
Another practical issue is how to devise computationally more efficient goodness-of-tests
for dealing with large-scale datasets. For example, it would be of interest to investigate if
the ideas of linear-time approximation (Gretton et al., 2012a) or block test (Zaremba et al.,
2013) can be applied to yield minimax optimal and adaptive goodness of fit tests.
At a more technical level, our analysis has focused on detection boundaries under χ2
distance. There are other commonly used distances suitable for goodness-of-fit tests such
as KL-divergence, or total variation. It would be interesting to examine to what extent the
phenomena we observed here occur under theses alternative distance measures.
As in any kernel based approaches, the choice of the kernel plays a crucial role. In
the current work we have assumed implicitly that a suitable kernel is selected. How to
choose an appropriate kernel is a critical problem in practice. It would be of great interest
to investigate to what extent the ideas from Sriperumbudur et al. (2009); Gretton et al.
(2012b); Sutherland et al. (2017) can be adapted in the current setting.
Throughtout numerical experiments in our paper, we have considered truncated version
of the moderated kernel. Both %2n and truncation level N play the role of regularization
and we can show that as long as N increases at a proper rate with n, even if %n in K̆ρn ,N
is set to 0, the corresponding test can still achieve optimal rate. Note that in this scenario,
K̆0,N becomes the so-called projection kernel and MMD2 associated with K̆0,N is exactly
the projection of χ2 (P, P0 ) onto the subspace spanned by {ϕk }1≤k≤N .
8. Proofs
We now present the proofs of the main results.
Proof [Proof of Theorem 1]
Part (i). The proof of the first part consists of two key steps. First, we show that the
population counterpart nγ 2 (P, P0 ) of the test statistic converges to ∞ uniformly, i.e.,
Then X X
λ−1 2 2
k ak = kukK , and a2k = kuk2L2 (P0 ) = χ2 (P, P0 ).
k≥1 k≥1
22
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
Thus,
n o
P nγ 2 (Pbn , P0 ) < qw,1−α
v
s X u X h X n
1 2 √
u i
≤P n λk [EP ϕk (X)]2 − tn λk ϕk (Xi ) − EP ϕk (X) < qw,1−α
n
k≥1 k≥1 i=1
v
u n i2 s X
u X h1 X √
=P tn λk ϕk (Xi ) − EP ϕk (X) > n λk [EP ϕk (X)]2 − qw,1−α .
n
k≥1 i=1 k≥1
23
Balasubramanian, Li and Yuan
Hence we obtain
n o
lim β(TMMD ; ∆n , 0) = lim sup P nγ 2 (Pbn , P0 ) < qw,1−α
n→∞ n→∞ P ∈P(∆ ,0)
n
n P h P n i2 o
sup EP n λk n1 ϕk (Xi ) − EP ϕk (X)
P ∈P(∆n ,0) k≥1 i=1
≤ lim ( )2
n→∞ r P √
inf n λk [EP ϕk (X)]2 − qw,1−α
P ∈P(∆n ,0) k≥1
=0.
Part (ii). In proving the second part, we will make use of the following lemma that can
be obtained by adapting the argument in Gregory (1977). It gives the limit distribution of
V-statistic under Pn such that Pn converges to P0 in the order n−1/2 .
i.i.d
where X1 , . . . , Xn ∼ Pn , and Zk s are independent standard normal random variables.
24
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
1
where C1 is a positive constant and kn = bC2 n 4s c for some positive constant C2 . Both
C1 and C2 will be determined later. Since sup kϕk k∞ < ∞ and lim λk = 0, there exists
k≥1 k→∞
N0 > 0 such that Pn ’s are well-defined probability measures for any n ≥ N0 .
Note that
C12
kun k2K = 2
≤ L−2 C12
L (kn )
and
C12 λkn C12 −2s −1 −1
kun k2L2 (P0 ) = = k ≥ L C12 kn−2s ∼ L C12 C2−2s n−1/2 ,
L2 (kn ) L(kn ) n
where An ∼ Bn means that lim An /Bn = 1. Thus, by choosing C1 sufficiently small and
n→∞
1 −1 2 −2s
c0 = 2 L C1 C2 ,we ensure that Pn ∈ P(c0 n−1/2 , 0) for sufficiently large n.
To apply Lemma 7, we note that
C12 λkn
lim kun k2L2 (P0 ) = lim = 0.
n→∞ n→∞ L2 (kn )
Proof [Proof of Theorem 2] Let K̃n (·, ·) := K̃%n (·, ·). Note that
j−1
n X
X
vn−1/2 [nη%2n (Pbn , P0 ) − An ] = 2(n2 vn )−1/2 K̃n (Xi , Xj ).
j=2 i=1
25
Balasubramanian, Li and Yuan
j−1
P
Let ζnj = K̃n (Xi , Xj ). Consider a filtration {Fj : j ≥ 1} where Fj = σ{Xi : 1 ≤ i ≤ j}.
i=1
Due to the assumption that K is degenerate, we have Eϕk (X) = 0 for any k ≥ 1, which
implies that
j−1
X j−1
X
E(ζnj |Fj−1 ) = E[K̃n (Xi , Xj )|Fj−1 ] = E[K̃n (Xi , Xj )|Xi ] = 0,
i=1 i=1
for any j ≥ 2.
Write
0
m=1
Unm = Pm .
ζnj m≥2
j=2
Then for any fixed n, {Unm }m≥1 is a martingale with respect to {Fm : m ≥ 1} and
We now apply martingale central limit theorem to Unn . Following the argument from
Hall (1984), it can be shown that
h1 i−1/2
d
n2 EK̃n2 (X, X 0 ) Unn → N (0, 1), (14)
2
provided that
00
[EG2n (X, X 0 ) + n−1 EK̃n2 (X, X 0 )K̃n2 (X, X ) + n−2 EK̃n4 (X, X 0 )]/[EK̃n2 (X, X 0 )]2 → 0, (15)
26
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
where the last step holds by considering that λk k −2s . Hereafter, we shall write an bn
if 0 < lim an /bn ≤ lim an /bn < ∞, for two positive sequences {an } and {bn }. Similarly,
n→∞ n→∞
X λ k 4 X
EG2n (X, X 0 ) = ≤ |{k : λ k ≥ %2
n }| + %−8
n λ4k %−1/s
n ,
λk + %2n 2
k≥1 λk <%n
and
nX λk 2 2 o2
EK̃n2 (X, X 0 )K̃n2 (X, X 00 ) =E ϕk (X)
λk + %2n
k≥1
!4
nX λ 2 o2
k
≤ sup kϕk k∞ 2
%n−2/s .
k≥1 λ k + %n
k≥1
and
n−1 EK̃n2 (X, X 0 )K̃n2 (X, X 00 )/[EK̃n2 (X, X 0 )]2 ≤ C3 n−1 → 0, (17)
where
( )
X λk 2 X λk
kK̃n k∞ = sup ϕ2 (x) ≤ sup kϕk k∞ %−1/s
n .
x λk + %2n k k≥1 λk + %2n
k≥1 k≥1
n−2 EK̃n4 (X, X 0 )}/[EK̃n2 (X, X 0 )]2 ≤ n−2 kK̃n k2∞ /EK̃n2 (X, X 0 ) ≤ C4 (n2 %1/s
n )
−1
→ 0. (18)
as n → ∞. Together, (16), (17) and (18) ensure that condition (15) holds.
27
Balasubramanian, Li and Yuan
Obviously, EP V1 V2 = 0. We first argue that the following three statements together implies
the desired result:
This immediately suggests that TM3 d is consistent. We now show that (19)-(21) indeed
hold.
28
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
−1/s
Verifying (19). We begin with (19). Since vn %n and V3 = (n − 1)η%2n (P, P0 ), (19)
is equivalent to
1
lim inf n%n2s η%2n (P, P0 ) = ∞.
n→∞ P ∈P(∆n ,θ)
≥kuk2L2 (P0 ) − %n M 2 .
2
p
Take %n ≤ ∆n /(2M 2 ) so that ρ2n M 2 ≤ 21 ∆n . Then we have
1 1
inf η%2n (P, P0 ) ≥ inf kuk2L2 (P0 ) = ∆n .
P ∈P(∆n ,0) 2 P ∈P(∆n ,0) 2
(2) Now consider the case when θ > 0. For P ∈ P(∆n , θ), ∀ R > 0, ∃ fR ∈ H(K) such
that ku − fR kL2 (P0 ) ≤ M R−1/θ and kfR kK ≤ R. Let bk = hfR , ϕk iL2 (P0 ) .
X X %2n
η%2n (P, P0 ) = a2k − a2
λk + %2n k
k≥1 k≥1
X%2n X %2
n
≥kuk2L2 (P0 ) − 2 (a k − bk ) 2
− 2 b2
λk + %2n λk + %2n k
k≥1 k≥1
X X 1
≥kuk2L2 (P0 ) − 2 (ak − bk )2 − 2%2n b2
λk k
k≥1 k≥1
29
Balasubramanian, Li and Yuan
θ+1 1
In both cases, with %n ≤ C∆n 2 for a sufficiently small C = C(M ) > 0, lim %n2s n∆n =
n→∞
4s
∞ suffices to ensure (19) holds. Under the condition that lim ∆n n 4s+θ+1 = ∞,
n→∞
2s(θ+1) θ+1
%n = cn− 4s+θ+1 ≤ C∆n 2
1
for sufficiently large n and lim %n2s n∆n = ∞ holds as well.
n→∞
1 X X λk
V1 = [ϕk (Xi ) − EP ϕk (X)][ϕk (Xj ) − EP ϕk (X)]
n λk + %2n
1≤i,j≤n k≥1
i6=j
1 X
:= Fn (Xi , Xj ).
n
1≤i,j≤n
i6=j
Then
1 X
EP V12 = EP Fn (Xi , Xj )Fn (Xi0 , Xj 0 )
n2
i6=j
i0 6=j 0
2n(n − 1)
= EP Fn2 (X, X 0 )
n2
≤2EP Fn2 (X, X 0 ),
i.i.d.
where X, X 0 ∼ P . Recall that, for any two random variables Y1 , Y2 such that EY12 < ∞,
Fn (X, X 0 ) =K̃n (X, X 0 ) − EP [K̃n (X, X 0 )|X] − EP [K̃n (X, X 0 )|X 0 ] + EP K̃n (X, X 0 )
h i
=K̃n (X, X 0 ) − EP [ K̃n (X, X 0 )|X] − E K̃n (X, X 0 ) − EP [K̃n (X, X 0 )|X] X 0 ,
we have
For any g ∈ L2 (P0 ) and positive definite kernel G(·, ·) such that EP0 G2 (X, X 0 ) < ∞, let
p
kgkG := EP0 [g(X)g(X 0 )G(X, X 0 )].
30
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
By the positive definiteness of G(·, ·), triangular inequality holds for k · kG , i.e., for any g1 ,
g2 ∈ L2 (P0 ),
We now appeal to the following lemma to bound the right hand side of (22):
By Lemma 8, we get
!
0
X λk
EP0 [u(X)u(X )K̃n2 (X, X 0 )] ≤C5 kuk2L2 (P0 ) %−1/s
n kuk2L2 (P0 ) .
λk + %2n
k
Recall that 2
X λk
EP0 K̃n2 (X, X 0 ) = %n−1/s .
λk + %2n
k
EP K̃n2 (X, X 0 ) ≤ 2{EP0 K̃n2 (X, X 0 ) + EP0 [u(X)u(X 0 )K̃n2 (X, X 0 )]} ≤ C6 %n−1/s [1 + kuk2L2 (P0 ) ].
θ+1
On the other hand, as already shown in the part of verifying (19), %n ∆n 2 suffices to
ensure that for sufficiently large n,
1
kuk2L2 (P0 ) ≤ η%2n (P, P0 ) ≤ kuk2L2 (P0 ) , ∀ P ∈ P(∆n , θ).
4
Thus
4s
provided that lim n 4s+θ+1 ∆n = ∞. This immediately implies (20).
n→∞
31
Balasubramanian, Li and Yuan
λk
nX o2
EP V22 ≤4nEP [E ϕ
P k (X)][ϕ k (X) − E ϕ
P k (X)]
λk + %2n
k≥1
nX λ o2
k
≤4nEP [EP ϕk (X)][ϕ k (X)]
λk + %2n
k≥1
nX λ o2
k
=4nEP0 [1 + u(X)] [EP ϕk (X)][ϕk (X)] .
λk + %2n
k≥1
It is clear that
nX λk o2
EP0 [E ϕ
P k (X)][ϕk (X)]
λk + %2n
k≥1
X λk λk 0
= EP ϕk (X)EP ϕk0 (X)EP0 [ϕk (X)ϕk0 (X)]
2
λk + %n λk0 + %2n
k,k0 ≥1
X λ k 2
= [EP ϕk (X)]2 ≤ η%2n (P, P0 ).
λk + %2n
k≥1
nX λk o2
EP0 u(X) [EP ϕk (X)][ϕk (X)]
λk + %2n
k≥1
v
u
λk o2
u nX
≤tEP0 u2 (X) [EP ϕk (X)][ϕk (X)] ×
u
2
λk + %n
k≥1
v
λk o2
u nX
× tEP0
u
[E ϕ
P k (X)][ϕk (X)]
λk + %2n
k≥1
X λk
≤kukL2 (P0 ) sup [EP ϕk (X)][ϕk (x)] · η%n (P, P0 )
x λk + %2n
k≥1
X λk
≤ sup kϕk k∞ kukL2 (P0 ) |EP ϕk (X)| · η%n (P, P0 )
k λk + %2n
k≥1
v v
λk uX λk
uX u
≤ sup kϕk k∞ kukL2 (P0 ) t [EP ϕk (X)]2 · η%n (P, P0 )
u
λk + %2n λk + %2n
t
k k≥1 k≥1
1
−
≤C7 kukL2 (P0 ) %n 2s η%2n (P, P0 ).
32
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
4s
under the assumption that lim n 4s+θ+1 ∆n = ∞.
n→∞
Proof [Proof of Theorem 4] The main architect is now standard in establishing minimax
lower bounds for nonparametric hypothesis testing. The main idea is to carefully construct a
set of points under the alternative hypothesis and argue that a mixture of these alternatives
cannot be reliably distinguished from the null. See, e.g., Ingster (1993); Ingster and Suslina
4s
(2003); Tsybakov (2008). Without loss of generality, assume M = 1 and ∆n = cn− 4s+θ+1
for some c > 0.
Let us consider the cases of θ = 0 and θ > 0 separately.
1
−
The case of θ = 0. We first treat the p case when θ = 0. Let Bn = bC8 ∆n 2s c for a
sufficiently small constant C8 > 0 and an = ∆n /Bn . For any ξn := (ξn1 , ξn2 , · · · , ξnBn )> ∈
{±1}Bn , write
Bn
X
un,ξn = an ξnk ϕk .
k=1
It is clear that
kun,ξn k2L2 (P0 ) = Bn a2n = ∆n
and 2s−1
kun,ξn k∞ ≤ an Bn sup kϕk k∞ ∆n 4s → 0.
k
Bn
X
kun,ξn k2K = a2n λ−1
k ≤ 1,
k=1
Therefore, there exists a probability measure Pn,ξn ∈ P(∆n , 0) such that dPn,ξn /dP0 =
1 + un,ξn . Following a standard argument for minimax lower bound, it suffices to show that
( n )2
1 X Y
lim EP0 Bn [1 + un,ξn (Xi )] < ∞. (23)
n→∞ 2
ξn ∈{±1}Bn i=1
33
Balasubramanian, Li and Yuan
Note that
( n )2
1 X Y
EP0 Bn [1 + un,ξn (Xi )]
2
ξn ∈{±1}Bn i=1
( n )( n )
1 X Y Y
=EP0 2Bn [1 + un,ξn (Xi )] [1 + un,ξn0 (Xi )]
2 0 ξn ,ξn ∈{±1}Bn i=1 i=1
n
( )
1 X Y
= EP0 [1 + un,ξn (Xi )][1 + un,ξn0 (Xi )]
22Bn 0 ∈{±1}Bn
ξn ,ξn i=1
Bn
1 X X
0
n
= 1 + a2n ξnk ξnk
22Bn 0 ∈{±1}Bn
ξn ,ξn k=1
Bn
1 X X
0
≤ exp na2n ξn,k ξn,k
22Bn 0 ∈{±1}Bn
ξn ,ξn k=1
Bn
exp(na2n ) + exp(−na2n )
=
2
1
≤ exp Bn n2 a4n ,
2
See, e.g., Baraud (2002). With the particular choice of Bn , an , and the conditions on ∆n ,
this immediately implies (23).
The case of θ > 0. The main idea is similar to before. To find a set of probability
measures in P(∆n , θ), we appeal to the following lemma.
P
Lemma 9 Let u = ak ϕk . If
k
!2/θ
B 2
X ak X
2
sup ak ≤ M 2,
B≥1 λ k
k=1 k≥B
then u ∈ F(θ, M ).
− θ+1 p
Similar to before, we shall now take Bn = bC10 ∆n 2s c and an = ∆n /Bn . By Lemma
9, we can find Pn,ξn ∈ P(∆n , θ) such that dPn,ξn /dP0 = 1 + un,ξn , for appropriately chosen
C10 . Following the same argument as in the previous case, we can again verify (23).
34
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
√ 4s
Proof [Proof of Theorem 5] Without loss of generality, assume that ∆n (θ) = c1 (n−1 log log n) 4s+θ+1
for some constant c1 > 0 to be determined later.
Type I Error. We first prove the first statement which shows that the Type I error
converges to 0. Following the same notations as defined in the proof of Theorem 2, let
n
nX o2 n
X
2 4
Nn,2 = E E ζ̃nj |Fj−1 − 1 , Ln,2 = Eζ̃nj
j=2 j=2
√ √
where ζ̃nj = 2ζnj /(n vn ). As shown by Haeusler (1988),
where Φ̄(t) is the survival function of the standard normal, i.e., Φ̄(t) = P (Z > t) where
Z ∼ N (0, 1). Again by the argument from Hall (1984),
n
nX 1 o2 00
E 2
E(ζnj |Fj−1 ) − n(n − 1)vn ≤ C12 [n4 EG2n (X, X 0 ) + n3 EK̃n2 (X, X 0 )K̃n2 (X, X )],
2
j=2
which ensures
n
nP o2
2 |F 1 1
4E E(ζnj j−1 ) − 2 n(n − 1)vn − 2 nvn
j=2
Nn,2 =
n4 vn2
( 00
)
EG2n (X, X 0 ) EK̃n2 (X, X 0 )K̃n2 (X, X )
1 1
≤8 max C12 , + + 2 ,
4 vn2 nvn2 n
and
n
4
P
4 Eζ̃nj ( 00
)
j=2 EK̃n4 (X, X 0 ) EK̃n2 (X, X 0 )K̃n2 (X, X )
Ln,2 = ≤ 4C13 + .
n4 vn2 n2 vn2 nvn2
Therefore,
1 1
1 2 −
sup |P (Tn,%n > t) − Φ̄(t)| ≤ C14 (%n5s + n− 5 + n− 5 %n 5s ),
t
35
Balasubramanian, Li and Yuan
It is clear that T̃n ≥ Tn,%̃n (θ) for any θ ≥ 0. It therefore suffices to show that for any θ ≥ 0,
n p o
lim inf inf P Tn,%̃n (θ) ≥ 3 log log n = 1.
n→∞ θ≥0 P ∈P(∆n ,θ)
36
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
1
%n (θ) ≤ %̃n (θ) ≤ %n (θ), (26)
2
which immediately suggests
EP Tn,%̃n (θ) ≥ C17 n[%̃n (θ)]1/(2s) η%̃2n (θ) (P, P0 ) ≥ 2−1/(2s) C17 n[%n (θ)]1/2s η%2n (θ) (P, P0 ),
1
η%2n (θ) (P, P0 ) ≥ kuk2L2 (P0 ) (28)
4
√ 4s
log log n 4s+θ+1
provided that ∆n (θ) ≥ C 0 (M ) n .
Therefore,
p p
inf EP Tn,%̃n (θ) ≥ C18 n[%n (θ)]1/(2s) ∆n (θ) ≥ C18 c1 log log n ≥ M̃ log log n
P ∈P(∆n (θ),θ)
−1
if c1 ≥ C18 M̃ . Hence to ensure (24) holds, it suffices to take
−1
c1 = max{C 0 (M ), C18 M̃ }.
With (26), (27) and (28), the results in the proof of Theorem 3 imply that for sufficiently
large n
Var Tn,%̃n (θ) n 1
∗
−2 1
−1
sup 2 ≤C19 [%n (θ)] 2s n∆n (θ) + [%n (θ)] s n2 ∆∗n (θ)
P ∈P(∆∗n (θ),θ) EP Tn,%̃n (θ)
1
−1 o
+ (n∆∗n (θ))−1 + [%n (θ)] 2s n ∆∗n (θ)
p
1
−1 1
≤2C19 [%n (θ)] 2s n∆∗n (θ) = 2C19 (c1 log log n)− 2 → 0,
Proof [Proof of Theorem 6] The main idea of the proof is similar to that for Theorem 4.
Nevertheless, in order to show
37
Balasubramanian, Li and Yuan
converges to 1 rather than bounded below from 0, we need to find Pπ , which is the marginal
distribution on X n with conditional distribution selected from
and prior distribution π on ∪θ∈[θ1 ,θ2 ] P(∆n (θ), θ) such that the χ2 distance between Pπ and
P0⊗n converges to 0. See Ingster (2000).
To this end, assume, without loss of generality, that
− 4s
n 4s+θ+1
∆n (θ) = c2 √ , ∀θ ∈ [θ1 , θ2 ],
log log n
where c2 > 0 is a sufficiently small constant to be determined later.
θ1 +1
−
Let rn = bC20 log nc and Bn,1 = bC21 ∆n 2s (θ1 )cfor sufficiently small C20 , C21 > 0. Set
θn,1 = θ1 . For 2 ≤ r ≤ rn , let
Bn,r = 2r−2 Bn,1
and θn,r is selected such that the following equation holds.
j θn,r +1 k
Bn,r = C21 [∆n (θn,r )]− 2s .
∗
Bn,r
X
fn,r,ξn,r = 1 + an,r ξn,r,k−Bn,r−1
∗ ϕk ,
∗
k=Bn,r−1 +1
p
and an,r = ∆n (θn,r )/Bn,r . Following the same argument as that in the proof of Theorem
4, we can verify that with a sufficiently small C21 , each Pn,r,ξn,r ∈ P(∆n (θn,r ), θn,r ), where
fn,r,ξn,r is the Radon-Nikodym derivative dPn,r,ξn,r /dP0 . With slight abuse of notation,
write
nr
1X
fn (X1 , X2 , · · · , Xn ) = fn,r (X1 , X2 , · · · , Xn ),
rn
r=1
38
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
where
n
1 X Y
fn,r (X1 , X2 , · · · , Xn ) = fn,r,ξn,r (Xi ).
2Bn,r
ξn,r ∈{±1}Bn,r i=1
Following the same derivation as that in the proof of Theorem 4, we can show that
2 ) + exp(−na2 ) Bn,r
!
exp(na
2 n,r n,r 1 2 4
kfn,r kL2 (P0 ) ≤ ≤ exp Bn,r n an,r
2 2
for sufficiently large n. By setting c2 in the expression of ∆n (θ) sufficiently small, we have
Bn,r n2 a4n,r ≤ log rn ,
which ensures that
X
kfn,r k2L2 (P0 ) ≤ rn3/2 = o(rn2 ).
1≤r≤rn
Acknowledgments
We would like to thanks the editor, Arthur Gretton, and the anonymous reviewers for
their insightful comments that helped greatly improved the paper. We also acknowledge
support for this project from the National Science Foundation (NSF grants DMS-1803450
and DMS-2015285).
39
Balasubramanian, Li and Yuan
Appendix A.
Proof [Proof of Lemma 8] We have
X
G2 (x, x0 ) = µk µl ϕk (x)ϕl (x)ϕk (x0 )ϕl (x0 ).
k,l
Thus
Z
g(x)g(x0 )G2 (x, x0 )dP (x)dP (x0 )
X Z 2
= µk µl g(x)ϕk (x)ϕl (x)dP (x)
k,l
X XZ 2
≤µ1 µk g(x)ϕk (x)ϕl (x)dP (x)
k l
Z !
X
≤µ1 µk g 2 (x)ϕ2k (x)dP (x)
k
! 2
X
≤µ1 µk sup kϕk k∞ kgk2L2 (P ) .
k k
By definition, it suffices to show that ∀ R > 0, ∃ fR ∈ H(K) such that kfR k2K ≤ R2 and
ku − fR k2L2 (P0 ) ≤ M 2 R−2/θ .
To this end, let B be such that lB2 ≤ R2 ≤ l 2
B+1 , and denote by
B
X
fR = ak ϕk + a∗B+1 (R)ϕB+1 ,
k=1
where
q
a∗B+1 (R) = sgn(aB+1 ) λB+1 (R2 − lB
2 ).
Clearly,
B
X a2k (a∗ (R))2
kfR k2K = + k+1 = R2 ,
λk λB+1
k=1
40
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
and
X q 2 X
ku − fR k2L2 (P0 ) = a2k + |aB+1 | − λB+1 (R2 − lB
2) ≤ a2k .
k>B+1 k≥B+1
References
L. Addario-Berry, N. Broutin, L. Devroye, and G. Lugosi. On combinatorial testing prob-
lems. The Annals of Statistics, 38(5):3063–3092, 2010.
F. Bach. On the equivalence between kernel quadrature rules and random feature expan-
sions. Journal of Machine Learning Research, 18(21):1–38, 2017.
A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra. Clustering on the unit hypersphere using
von mises-fisher distributions. In Journal of Machine Learning Research, pages 1345–
1382, 2005.
41
Balasubramanian, Li and Yuan
N. Dunford and J. T. Schwartz. Linear Operators: Part II: Spectral Theory: Self Adjoint
Operators in Hilbert Space. Interscience Publishers, 1963.
M. Fromont and B. Laurent. Adaptive goodness-of-fit tests in a density model. The Annals
of Statistics, 34(2):680–720, 2006.
J. Gorham and L. Mackey. Measuring sample quality with kernels. In International Con-
ference on Machine Learning, 2017.
G. G. Gregory. Large sample theory for U-statistics and tests of fit. The Annals of Statistics,
5(1):110–123, 1977.
E. Haeusler. On the rate of convergence in the central limit theorem for martingales with
discrete and continuous time. The Annals of Probability, 16(1):275–299, 1988.
P. Hall. Central limit theorem for integrated square error of multivariate nonparametric
density estimators. Journal of Multivariate Analysis, 14(1):1–16, 1984.
Z. Harchaoui, F. Bach, and E. Moulines. Testing for homogeneity with kernel fisher discrim-
inant analysis. In Advances in Neural Information Processing Systems, pages 609–616,
2007.
42
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
Yu. I. Ingster. Minimax testing of hypotheses on the distribution density for ellipsoids in
l p. Theory of Probability & Its Applications, 39(3):417–436, 1995.
Yu. I. Ingster and I. A. Suslina. Minimax nonparametric hypothesis testing for ellipsoids
and besov bodies. ESAIM: Probability and Statistics, 4:53–135, 2000.
J. Kellner and A. Celisse. A one-sample test for normality with kernel methods. arXiv
preprint arXiv:1507.02904, 2015.
Q. Liu, J. Lee, and M. Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In
International Conference on Machine Learning, pages 276–284, 2016.
43
Balasubramanian, Li and Yuan
K. V. Mardia and P. E. Jupp. Directional Statistics. John Wiley & Sons, New York, NY,
2009.
J. S. Marron and M. P. Wand. Exact mean integrated squared error. The Annals of
Statistics, 20(2):712–736, 1992.
H. Q. Minh, P. Niyogi, and Y. Yao. Mercer’s theorem, feature maps, and smoothing. In
Conference on Learning Theory, pages 154–168, 2006.
B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regular-
ization, Optimization, and Beyond. MIT press, 2001.
A. J. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distri-
butions”. In Proc. 18th International Conference on Algorithmic Learning Theory, pages
13–31. Springer-Verlag, Berlin, Germany, 2007.
44
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
I. Steinwart. On the influence of the kernel on the consistency of support vector machines.
Journal of Machine Learning Research, 2(Nov):67–93, 2001.
I. Steinwart and A. Christmann. Support Vector Machines. Springer Science & Business
Media, 2008.
G. Wahba. Spline Models for Observational Data, volume 59. Siam, 1990.
45