0% found this document useful (0 votes)

2 views45 pages

On The Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

This paper investigates the statistical performance of kernel-embedding based goodness-of-fit tests within a minimax framework, revealing that a standard version of these tests may be suboptimal. The authors propose a moderated approach that achieves optimal testing rates for various deviations from the null hypothesis and can adapt to a wide range of interpolation spaces. Numerical experiments support the effectiveness of the proposed method, demonstrating its advantages over traditional tests in terms of detection boundaries.

Uploaded by

christopherbaumgaertner4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views45 pages

On The Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

Uploaded by

christopherbaumgaertner4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Journal of Machine Learning Research 22 (2021) 1-45 Submitted 9/17; Revised 6/20; Published 1/21

On the Optimality of Kernel-Embedding Based

Goodness-of-Fit Tests

Krishnakumar Balasubramanian [email protected]

Department of Statistics
University of California, Davis
Davis, CA 95616, USA
Tong Li [email protected]
Ming Yuan [email protected]
Department of Statistics
Columbia University
New York, NY 10027, USA

Editor: Arthur Gretton

Abstract
The reproducing kernel Hilbert space (RKHS) embedding of distributions offers a gen-
eral and flexible framework for testing problems in arbitrary domains and has attracted
considerable amount of attention in recent years. To gain insights into their operating char-
acteristics, we study here the statistical performance of such approaches within a minimax
framework. Focusing on the case of goodness-of-fit tests, our analyses show that a vanilla
version of the kernel embedding based test could be minimax suboptimal, when considering
χ2 distance as the separation metric. Hence we suggest a simple remedy by moderating
the embedding. We prove that the moderated approach provides optimal tests for a wide
range of deviations from the null and can also be made adaptive over a large collection
of interpolation spaces. Numerical experiments are presented to further demonstrate the
merits of our approach.
Keywords: Adaptation, goodness of fit, maximum mean discrepancy, optimal rates of
Convergence, reproducing kernel Hilbert space.

1. Introduction
In recent years, statistical tests based on the reproducing kernel Hilbert space (RKHS)
embedding of distributions have attracted much attention because of their flexibility and
broad applicability. Like other kernel methods, RKHS embedding based tests present a
general and unifying framework for testing problems in arbitrary domains by using appro-
priate kernels defined on those domains. See Muandet et al. (2017) for a detailed review
of kernel embedding and its applications. The idea of using kernel embedding for compar-
ing probability distributions was initially introduced by Smola et al. (2007); Gretton et al.
(2007, 2012a). Related extensions were also proposed by Harchaoui et al. (2007); Zaremba
et al. (2013). Furthermore, Sejdinovic et al. (2013) established a close relationship between
kernel-based hypothesis tests and energy distanced based test introduced by Székely et al.
(2007). See also Lyons (2013). More recently, motivated by several applications based on
quantifying the convergence of Monte Carlo simulations, Liu et al. (2016), Chwialkowski

c 2021 Krishnakumar Balasubramanian, Tong Li and Ming Yuan.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v22/17-570.html.
Balasubramanian, Li and Yuan

et al. (2016) and Gorham and Mackey (2017) proposed goodness-of-fit tests which were
based on combining the kernel based approach with Stein’s identity. A linear-time method
for goodness-of-fit was also proposed by Jitkrittum et al. (2017) recently. Finally, the idea
of kernel embedding has also been used for constructing implicit generative models (e.g.,
Dziugaite et al., 2015; Li et al., 2015).
Despite their popularity, fairly little is known about the statistical performance of these
kernel embedding based tests. Our goal is to fill in this void. In particular, we focus on kernel
embedding based goodness-of-fit tests and investigate their power under a general composite
alternative. Our results not only provide new insights on the operating characteristics of
these kernel embedding based tests but also suggest improved testing procedures that are
minimax optimal and adaptive over a large collection of alternatives, when considering χ2
distance as the separation metric.
More specifically, let X1 , · · · , Xn be n independent X -valued observations from a cer-
tain probability measure P . We are interested in testing if the hypothesis H0 : P = P0
holds for a fixed P0 . Problems of this kind have a long and illustrious history in statistics
and is often associated with household names such as Kolmogrov-Smirnov tests, Pearson’s
Chi-square test or Neyman’s smooth test. A plethora of other techniques have also been
proposed over the years in both parametric and nonparametric settings (e.g., Ingster and
Suslina, 2003; Lehmann and Romano, 2008). Most of the existing techniques are developed
with the domain X = R or [0, 1] in mind and work the best in these cases. Modern ap-
plications, however, oftentimes involve domains different from these traditional ones. For
example, when dealing with directional data, which arise naturally in applications such as
diffusion tensor imaging, it is natural to consider X as the unit sphere in R3 (e.g., Jupp,
2005). Another example occurs in the context of ranking or preference data (e.g., Ailon
et al., 2008). In these cases, X can be taken as the group of permutations. Furthermore,
motivated by several applications, combinatorial testing problems have been investigated
recently (e.g., Addario-Berry et al., 2010), where the spaces under consideration are specific
combinatorially structured spaces.
A particularly attractive approach to goodness-of-fit testing problems in general domains
is through RKHS embedding of distributions. Specifically, let K : X × X → R be a
symmetric positive (semi-)definite kernel. The Moore-Aronszajn Theorem indicates that
there is an RKHS, denoted by (H(K), h·, ·iK ), uniquely identified with the kernel K (e.g.,
Aronszajn, 1950). The RKHS embedding of a probability measure P with respect to K is
given by Z
µP (·) := K(x, ·)dP (x).
X

It is well-known that, under mild regularity conditions, then µP ∈ H(K) and furthermore
for any f ∈ H(K),
EP f (X) = hµP , f iK , ∀ f ∈ H(K),

where EP signifies that the expectation is taken over X ∼ P . The so-called maximum mean
discrepancy (MMD) between two probability measures P and Q is defined as
Z
γK (P, Q) := sup f (x)d(P − Q)(x),
f ∈H(K):kf kK ≤1 X

2
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

where k · kK is the norm associated with (H(K), h·, ·iK ). It is not hard to see

γK (P, Q) = kµP − µQ kK .

See, e.g., Sriperumbudur et al. (2010) or Gretton et al. (2012a) for details. The goodness-
of-fit test can be carried out conveniently through RKHS embeddings of P and P0 by first
constructing an estimate of γK (P, P0 ):
Z
γK (Pn , P0 ) :=
b sup f (x)d Pbn − P0 (x),
f ∈H(K):kf kK ≤1 X

where Pbn is the empirical distribution of X1 , · · · , Xn , and then rejecting H0 if the estimate
exceeds a threshold calibrated to ensure a certain significance level, say α (0 < α < 1).
In this paper, we investigate the power of the above discussed testing strategy under a
general composite alternative. Following the spirit of Ingster and Suslina (2003), we consider
in particular a set of alternatives that are increasingly close to the null hypothesis. To fix
ideas, we assume hereafter that P is dominated by P0 under the alternative so that the
Radon-Nikodym derivative dP/dP0 is well defined. Recall that the χ2 divergence between
P and P0 is defined as
dP 2
Z
2
χ (P, P0 ) := dP0 − 1.
X dP0
We are particularly interested in the detection boundary, namely how close P and P0 can
be in terms of χ2 distance, under the alternative, so that a test based on a sample of n
observations can still consistently distinguish between the null hypothesis and the alterna-
tive. For example, in the parametric setting where P is known up to a finite dimensional
parameters under the alternative, the detection boundary of the likelihood ratio test is n−1
under mild regularity conditions (e.g., Theorem 13.5.4 in Lehmann and Romano, 2008, and
the discussion leading to it). We are concerned here with alternatives that are nonpara-
metric in nature. Our first result suggests that the detection boundary for aforementioned
γK (Pbn , P0 ) based test is of the order n−1/2 . However, our main results indicate, perhaps
surprisingly at first, that this rate is far from optimal and the gap between it and the usual
parametric rate can be largely bridged.
In particular, we argue that the distinguishability between P and P0 depends on how
close u := dP/dP0 − 1 is to the RKHS H(K). The closeness of u to H(K) can be measured
by the distance from u to an arbitrary ball in H(K). In particular, we shall consider the case
where H(K) is dense in L2 (P0 ), and focus on functions that are polynomially approximable
by H(K) for concreteness. More precisely, for some constants M, θ > 0, denote by F(θ; M )
the collection of functions f ∈ L2 (P0 ) such that for any R > 0, there exists an fR ∈ H(K)
such that
kfR kK ≤ R, and kf − fR kL2 (P0 ) ≤ M R−1/θ .
We also adopt the convention that

F(0; M ) = {f ∈ H(K) : kf kK ≤ M }.

See Section 5 for a more concrete example of the space F(θ; M ) when H(K) is the usual
Sobolev space defined over [0, 1]. Interested readers are also referred to Cucker and Zhou

3
Balasubramanian, Li and Yuan

(2007) for further discussion on these so-called interpolation spaces and their use in statis-
tical learning.
We investigate the optimal rate of detection for testing H0 : P = P0 against
H1 (∆n , θ, M ) : P ∈ P(∆n , θ, M ), (1)
where P(∆n , θ, M ) is the collection of distributions P on (X , B) satisfying:
dP/dP0 − 1 ∈ F(θ; M ), and χ2 (P, P0 ) ≥ ∆n .
We call rn the optimal rate of detection if for any c > 0, there exists no consistent test
whenever ∆n ≤ crn ; and on the other hand, a consistent test exists as long as ∆n rn .
Although one could consider a more general setup, for concreteness, we assume that the
eigenvalues of K with respect to L2 (P0 ) decays polynomially in that λk k −2s . We show
that the optimal rate of detection for testing H0 against H1 (∆n , θ, M ) for any θ ≥ 0 is
4s
n− 4s+θ+1 . The rate of detection, although not achievable with a γK (Pbn , P0 ) based test, can
be attained via a moderated version of the MMD based approach. A practical challenge to
the approach, however, is its reliance on the knowledge of θ. Unlike s which is determined
by K and P0 and therefore known a priori, θ depends on u and is not known in advance.
This naturally brings about the issue of adaptation – is there an agnostic approach that can
adaptively attain the optimal detection boundary without the knowledge of θ. We show
that the answer is affirmative although a small price in the form of log log n is required to
achieve such adaptation.
The minimax framework we considered connects our work with the extensive statistics
literature on minimax hypothesis testing. See, e.g., Ingster (1987); Ermakov (1991); Ingster
(1993); Spokoiny (1996); Lepski and Spokoiny (1999); Ingster and Suslina (2000); Ingster
(2000); Baraud (2002); Fromont and Laurent (2006); Fromont et al. (2012, 2013), among
many others. As is customary in other areas of nonparametric statistics, these works usu-
ally start by characterizing function spaces for the alternatives such as Hölder, Sobolev or
more generally Besov spaces, followed by devising an optimal testing procedure according
to the specific class of alternatives. This creates a subtle yet important difference from
the kernel based approaches where a method or algorithm is developed first with a par-
ticular kernel often specific to the applications in mind, and it is therefore of interest to
investigate a posteriori the performance of such a method. The connection between the
two paradigms is well understood in the context of supervised learning thanks to the one-
to-one correspondence between a kernel and a RKHS: the two approaches are essentially
equivalent in that kernel methods with appropriate regularization could achieve minimax
optimality when considering functions from the RKHS with which the kernel identifies. See,
e.g., Wahba (1990); Scholkopf and Smola (2001). In a sense, our work establishes a similar
relationship in the context of kernel methods for hypothesis testing. Indeed, to achieve the
aforementioned optimal rate of detection, we introduce a class of tests based on a modified
MMD similar in spirit to that from Harchaoui et al. (2007). The modification we applied
is akin to the RKHS norm regularization commonly used for supervised learning. The key
idea is to allow the kernel in MMD to evolve with the number of observations so that MMD
becomes increasingly similar to the χ2 distance.
The rest of the paper is organized as follows. We first analyze the power of MMD based
tests in Section 2. This analysis reveals a significant gap between the detection boundary

4
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

achieved by the MMD based test and the usual parametric 1/n rate. In turn, this prompts
us to introduce, in Section 3, a class of tests based on a modified MMD that are rate optimal.
To address the practical challenge of choosing an appropriate tuning parameter for these
tests, we investigate the issue of optimal adaptation in Section 4, where we establish the
optimal rates of detection for adaptively testing H0 against a broader set of alternatives
and propose a test based on the modified MMD that can attain these rates. To further
illustrate the implications of our results, we consider in Section 5 the specific case of Sobolev
kernels and compare our results with those known for nonparametric testing within Sobolev
spaces. Numerical experiments are presented in Section 6. We conclude with some remarks
in Section 7. All proofs are relegated to Section 8.

2. Operating characteristics of MMD based test

We begin by reviewing a simple MMD based test procedure and investigating its operating
characteristics.

2.1 Background and notation

Throughout the paper, we shall assume
Z
K 2 (x, x0 )dP0 (x)dP0 (x0 ) < ∞.
X ×X

Hence the Hilbert-Schmidt integral operator

Z
LK (f )(x) = K(x, x0 )f (x0 )dP0 (x0 ), ∀x∈X
X

is well-defined. The spectral decomposition theorem ensures that LK admits an eigenvalue

decomposition. Let {φk }k≥1 denote the orthonormal eigenfunctions of LK with eigenvalues
λk ’s such that λ1 ≥ λ2 ≥ · · · λk ≥ · · · > 0. Then as proved in, e.g., Dunford and Schwartz
(1963),
X
K(x, x0 ) = λk ϕk (x)ϕk (x0 ) (2)
k≥1

in L2 (P0 ⊗ P0 ). We further assume that K is continuous and that P0 is nondegenerate,

meaning the support of P0 is X . Then Mercer’s theorem ensures that (2) holds pointwisely.
See, e.g., Theorem 4.49 of Steinwart and Christmann (2008).
As mentioned in Section 1, the squared MMD between two probability distributions P
and P0 can be expressed as
Z
γK (P, P0 ) = K(x, x0 )d(P − P0 )(x)d(P − P0 )(x0 ).
2
(3)

Write

K̄(x, x0 ) = K(x, x0 ) − EP0 K(x, X) − EP0 K(X, x0 ) + EP0 K(X, X 0 ), (4)

5
Balasubramanian, Li and Yuan

where the subscript P0 signifies the fact that the expectation is taken over X, X 0 ∼ P0
independently. By (4), γK 2 (P, P ) = γ 2 (P, P ). Therefore, without loss of generality, we
0 K̄ 0
can focus on kernels that are degenerate under P0 , i.e.,

EP0 K(X, ·) = 0. (5)

For brevity, we shall omit the subscript K in γ in the rest of the paper, unless it is nec-
essary to emphasize the dependence of MMD on the reproducing kernel. Passing from a
nondegenerate kernel to a degenerate one however presents a subtlety regarding universality.
Universality of a kernel is essential for MMD by ensuring that dP/dP0 −1 resides in the linear
space spanned by its eigenfunctions. See, e.g., Steinwart (2001) for the definition of uni-
versal kernel and Sriperumbudur et al. (2011) for a detailed discussion of different types of
universality. Observe that dP/dP0 − 1 necessarily lies in the orthogonal complement of con-
stant functions in L2 (P0 ). A degenerate kernel K is universal if its eigenfunctions {ϕk }k≥1
form an orthonormal basis of the orthogonal complement of linear space {c · ϕ0 : c ∈ R}
where ϕ0 (x) = 1 in L2 (P0 ). In what follows, we shall assume that K is both degenerate
and universal.
For the sake of concreteness, we shall also assume that K has infinitely many positive
eigenvalues decaying polynomially, i.e.,

0 < lim k 2s λk ≤ lim k 2s λk < ∞ (6)

k→∞ k→∞

for some s > 1/2. In addition, we also assume that the eigenfunctions of K are uniformly
bounded, i.e.,

sup kϕk k∞ < ∞, (7)

k≥1

Assumption (7) is satisfied for many commonly used kernels such as those associated with
Sobolev spaces we shall describe in further details in Section 5. Together with Assumptions
(6), (7) ensures that Mercer’s decomposition (2) holds uniformly. In general, however, there
are also situations under which (7) may not hold. For example, when considering discrete
and countable domains, several standard kernels don’t satisfy Assumption (7); see, e.g.,
2.24 in Scholkopf and Smola (2001), p. 58. Although it is plausible that most if not all of
our results will continue to hold even without the uniform boundedness of ϕk s, a rigorous
argument to do away with it has so far eluded us.
Note that (5) implies EP0 ϕk (X) = 0, ∀ k ≥ 1. The uniform convergence in (2) together
with (3) give
X
γ 2 (P, P0 ) = λk [EP ϕk (X)]2
k≥1

for any P . Accordingly, when P is replaced by the empirical distribution Pbn , the empirical
squared MMD can be expressed as
" n #2
2 b
X 1X
γ (Pn , P0 ) = λk ϕk (Xi ) .
n
k≥1 i=1

6
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

Classic results on the asymptotics of V-statistic (Serfling, 2009) imply that

d
X
nγ 2 (Pbn , P0 ) → λk Zk2 := W
k≥1

i.i.d.
under H0 , where Zk ∼ N (0, 1). Let TMMD be an MMD based test, which rejects H0 if
and only if nγ 2 (Pbn , P0 ) exceeds the 1 − α quantile qw,1−α of W , i.e.,

TMMD = 1{nγ 2 (Pbn ,P0 )>qw,1−α } .

The above limiting distribution of nγ 2 (Pbn , P0 ) immediately suggests that TMMD is an asymp-
totic α-level test.

2.2 Power analysis for MMD based tests

We now investigate the power of TMMD in testing H0 against H1 (∆n , θ, M ) given by (1).
Recall that the type II error of a test T : X n → [0, 1] for testing H0 against a composite
alternative H1 : P ∈ P is given by

β(T ; P) = sup EP [1 − T (X1 , . . . , Xn )],

P ∈P

i.i.d.
where EP means taking expectation over X1 , . . . , Xn ∼ P . For brevity, we shall write
β(T ; ∆n , θ, M ) instead of β(T ; P(∆n , θ, M )) in what follows. The performance of a test T
can then be evaluated by its detection boundary, that is, the smallest ∆n under which the
type II error converges to 0 as n → ∞. Our first result establishes the convergence rate
of the detection boundary for TMMD in the case when θ = 0. Hereafter, we abbreviate M
in P(∆n , θ, M ), H1 (∆n , θ, M ) and β(T ; ∆n , θ, M ), unless it is necessary to emphasize the
dependence.

Theorem 1 Consider testing H0 : P = P0 against H1 (∆n , 0) by TMMD .

√
(i) If n∆n → ∞, then

β(TMMD ; ∆n , 0) → 0 as n → ∞;

(ii) conversely, there exists a constant c0 > 0 such that

lim β(TMMD ; c0 n−1/2 , 0) > 0.

n→∞

Theorem 1 shows that when the alternative H1 (∆n , 0) is considered, the detection
boundary of TMMD is of the order n−1/2 . It is of interest to compare the detection rate
achieved by TMMD with that in a parametric setting where consistent tests are available if
n∆n → ∞. See, e.g., Theorem 13.5.4 in Lehmann and Romano (2008) and the discussion
leading to it. It is natural to raise the question to what extent such a gap can be entirely
attributed to the fundamental difference between parametric and nonparametric testing
problems. When considering χ2 distance as the separation metric, we shall now argue that
this gap actually is largely due to the sub-optimality of TMMD , and the detection boundary
of TMMD could be significantly improved through a slight modification of the MMD.

7
Balasubramanian, Li and Yuan

3. Optimal tests based on moderated MMD

In this section, we introduce the moderated MMD (also referred to as M 3 d) approach.

3.1 Moderated MMD test statistic

The basic idea behind MMD is to project two probability measures onto a unit ball in
H(K) and use the distance between the two projections to measure the distance between
the original probability measures. If the Radon-Nikodym derivative of P with respect to
P0 is far away from H(K), the distance between the two projections may not honestly
2 λk [EP ϕk (X)]2 , while
P
reflect the distance between them. More specifically, γ (P, P0 ) =
k≥1
the χ2 distance between P and P0 is χ2 (P, P0 ) = [EP ϕk (X)]2 . Considering that λk
P
k≥1
decreases with k, γ 2 (P, P0 ) can be much smaller than χ2 (P, P0 ). To overcome this problem,
we consider a moderated version of the MMD which allows us to project the probability
measures onto a larger ball in H(K). In particular, write
Z
ηK,% (P, Q; P0 ) = sup f d(P − Q) (8)
f ∈H(K):kf k2L +%2 kf k2K ≤1 X
2 (P0 )

for a given distribution P0 and a constant % > 0. Distance between probability measures of
this type was first introduced by Harchaoui et al. (2007) when considering kernel methods
for two sample test. A subtle difference between ηK,% (P, Q; P0 ) and the distance from
Harchaoui et al. (2007) is the set of f that we optimize over on the righthand side of (8).
In the case of two sample test, there is no information about P0 and therefore one needs to
replace the norm k · kL2 (P0 ) with the empirical L2 norm.
It is worth noting that ηK,% (P, Q; P0 ) can also be identified with a particular type of
MMD. Specifically, ηK,% (P, Q; P0 ) = γK̃% (P, Q), where

X λk
K̃% (x, x0 ) := ϕk (x)ϕk (x0 ).
λk + %2
k≥1

We shall nonetheless still refer to ηK,% (P, Q; P0 ) as a moderated MMD in what follows to
emphasize the critical importance of moderation. We shall also abbreviate the dependence
of η on K and P0 unless necessary. The unit ball in (8) is defined in terms of both RKHS
norm and L2 (P0 ) norm. Recall that u = dP/dP0 − 1 so that
Z Z
sup f d(P − P0 ) = sup f udP0 = kukL2 (P0 ) = χ(P, P0 ).
kf kL2 (P0 ) ≤1 X kf kL2 (P0 ) ≤1 X

We can therefore expect that a smaller % will make η%2 (P, P0 ) closer to χ2 (P, P0 ), since the
unit ball to be considered will become more similar to the unit ball in L2 (P0 ). This can
also be verified by noticing that
X λk X
lim η%2 (P, P0 ) = lim 2
[EP ϕk (X)]2 = [EP ϕk (X)]2 = χ2 (P, P0 ).
%→∞ %→∞ λk + %
k≥1 k≥1

8
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

Therefore, we choose % converging to 0 when constructing our test statistic.

Hereafter we shall attach the subscript n to % to signify its dependence on n. We shall
argue that letting ρn converge to 0 at an appropriate rate as n increases indeed results in
a test more powerful than TMMD . The test statistic we propose is the empirical version of
η%2n (P, P0 ):
" n #2
1 X X λk 1 X
η%2n (Pbn , P0 ) = 2 K̃%n (Xi , Xj ) = ϕk (Xi ) . (9)
n λk + %2n n
i,j=1 k≥1 i=1

This test statistics is similar in spirit to the homogeneity test proposed previously by
Harchaoui et al. (2007), albeit motivated from a different viewpoint. In either case, it is
intuitive to expect improved performance over the vanilla version of the MMD when %n
converges to zero at an appropriate rate. The main goal of the present work to precisely
characterize the amount of moderation needed to ensure maximum power. We first argue
that letting %n converge to 0 at an appropriate rate indeed results in a test more powerful
than TMMD .

3.2 Operating characteristics of η%2n (Pbn , P0 ) based tests

Although the expression for η%2n (Pbn , P0 ) given by (9) looks similar to that of γ 2 (Pbn , P0 ),
their asymptotic behaviors are quite different. At a technical level, this is due to the fact
that the eigenvalues of the underlying kernel
λk
λ̃nk :=
λk + %2n
depend on n and may not be uniformly summable over n. As presented in the following
theorem, a certain type of asymptotic normality, instead of a sum of chi-squares as in the
case of γ 2 (Pbn , P0 ), holds for η%2n (Pbn , P0 ) under P0 , which helps determine the rejection region
of the η%2n based test.
1/(2s)
Theorem 2 Assume that %n → 0 as n → ∞ in such a fashion that n%n → ∞. Then
i.i.d.
under H0 where X1 , . . . , Xn ∼ P0 ,
d
vn−1/2 [nη%2n (Pbn , P0 ) − An ] → N (0, 2),
where
X 2 n
λk 1X
vn = , and An = K̃%n (Xi , Xi ).
λk + %2n n
k≥1 i=1

In the light of Theorem 2, a test that rejects H0 if and only if

2−1/2 vn−1/2 [nη%2n (Pbn , P0 ) − An ]
exceeds z1−α is an asymptotic α-level test, where z1−α stands for the 1 − α quantile of
a standard normal distribution. We refer to this test as TM3 d where the subscript M3 d
stands for Moderated MMD. The performance of TM3 d under the alternative hypothesis
is characterized by the following theorem, showing that its detection boundary is much
improved when compared with that of TMMD .

9
Balasubramanian, Li and Yuan

2s(θ+1)
Theorem 3 Consider testing H0 against H1 (∆n , θ) by TM3 d with %n = cn− 4s+θ+1 for an
4s
arbitrary constant c > 0. If n 4s+θ+1 ∆n → ∞, then TM3 d is consistent in that

β(TM3 d ; ∆n , θ) → 0, as n → ∞.

Theorem 3 indicates that the detection boundary for TM3 d is n−4s/(4s+θ+1) . In particular,
when testing H0 against H1 (∆n , 0), i.e., θ = 0, it becomes n−4s/(4s+1) . This is to be
contrasted with the detection boundary for TMMD , which, as suggested by Theorem 1, is of
the order n−1/2 . It is also worth noting that the detection boundary for TM3 d deteriorates
as θ increases, implying that it is harder to test against a larger interpolation space.

3.3 Minimax optimality

It is of interest to investigate if the detection boundary of TM3 d can be further improved.
We now show that the answer is negative in a certain sense.
Indeed, the problem of identifying necessary condition on the decay rate of separation
metric under which there exist consistent tests has been investigated under various settings.
In general, both the constructive method of tackling such problem and the necessary con-
dition depend not only on the regularity condition of the alternative space but also on the
separation metric that measures the difference between null and alternative distributions.
When distributions considered (in our case, Radon-Nikodym derivatives with respect to
P0 ) are assumed to lie in a space with an orthonormal basis, both regularity condition
and separation metric can be expressed in terms of the corresponding coefficient sequences
with respect to the basis. Specifically, the separation metric is usually the `r distance be-
tween coefficient sequences, and the regularity condition can be assuming that coefficient
sequences lie in an `p ellipsoid. Indeed, results in the previous literatures suggest that what
really contribute to the optimal rate of detection are r, p and the lengths of semi-axes of
the ellipsoid. See Ingster (1987, 1993, 1995); Ermakov (1991); Lepski and Spokoiny (1999).
See also Ingster and Suslina (2000) for the regularity condition being Besov bodies.
In our case, F(θ, M ) can be converted to certain conditions on the coefficient sequence
with respect to {ϕk }k≥1 , and r = p = 2 since both separation metric and regularity condition
can be expressed in terms of χ2 distance. In particular, F(0, M ) corresponds exactly to an
`2 ellipsoid. Hence the whole procedure of identifying the necessary condition on the decay
rate of ∆n can be greatly simplified by borrowing classical arguments. Though not exactly
the same, the procedure for θ > 0 can be regarded as an adaption of that for θ = 0.

Theorem 4 Consider testing H0 : P = P0 against H1 (∆n , θ), for some θ < 2s − 1. If

4s
lim ∆n n 4s+θ+1 < ∞, then
n→∞

lim inf [EP0 T + β(T ; ∆n , θ)] > 0,

n→∞ T ∈Tn

where Tn denotes the collection of all test functions based on X1 , . . . , Xn .

Recall that for

a test T , EP0 T is its Type I error. Theorem 4 shows that, if ∆n =
O n −4s/(4s+θ+1) , then the sum of Type I and Type II errors of any test does not vanish
as n increases. In other words, there is no consistent test if ∆n = O n−4s/(4s+θ+1) .

10
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

Together with Theorem 3, this suggests that TM3 d is rate optimal in the minimax sense, when
considering χ2 distance as the separation metric and F(θ, M ) as the regularity condition of
alternative space.

3.4 Practical considerations

The moderation we applied is similar in spirit to regularization commonly seen in the
context of supervised learning. And as in the case of regularization, moderation can also
be done in multiple ways. Although in this work, we focus primarily on TM3 d , it is worth
pointing out that similar optimality can also be expected for others forms of moderation.
For example, a careful inspection of our analysis suggests an alternative form of moderation,
η̆%2n ,N (Pbn , P0 ) := γK̆ρ,N (P, Q) where

N
X λk
K̆ρ,N (x, x0 ) := ϕk (x)ϕk (x0 ).
λk + ρ2
k=1

Following the same argument as that of Theorem 2, it can be shown that, under H0
i.i.d.
where X1 , . . . , Xn ∼ P0 ,
h i
d
v̆n−1/2 nη̆%2n ,N (Pbn , P0 ) − Ăn → N (0, 2),

where
N 2 n
X λk 1X
v̆n = and Ăn = K̆ρn ,N (Xi , Xi )
λk + ρ2n n
k=1 i=1

as n → ∞ provided that the conditions of Theorem 2 hold and λN . ρ2n . A test that rejects
H0 when
h i
2−1/2 v̆n−1/2 nη̆%2 ,N (Pbn , P0 ) − Ăn
n
(10)

exceeds z1−α is therefore also an asymptotic α-level test. By the same argument as that
4s
of Theorem 3, this test can also be shown to be consistent whenever n 4s+θ+1 ∆n → ∞. In
other words, η̆%2n ,N (Pbn , P0 ) can achieve the same rate of detection as η%2n ,N (Pbn , P0 ). Based
on the polynomial decay rate assumption (6), that λN . ρ2n is equivalent to

2(θ+1)
N & %n−1/s n 4s+θ+1 , (11)

suggesting how large N needs to be in order for the truncated version of M3 d test to achieve
minimax optimality. This test could be more appealing in practice when the infinite sum
in defining K̃ρ does not have a convenient closed form expression and requires nontrivial
numerical approximation. On the other hand, the presence of the extra tuning parameter
N makes the exposition cumbersome at places. For brevity, we shall focus our attention
on TM3 d although all our discussion applies equally to both tests, and more generally other
suitable forms of moderation.

11
Balasubramanian, Li and Yuan

4. Adaptation
Despite the minimax optimality of TM3 d , a practical challenge in using it is the choice of
an appropriate tuning parameter %n . In particular, Theorem 3 suggests that %n needs to
be taken at the order of n−2s(θ+1)/(4s+θ+1) which depends on the value of s and θ. On the
one hand, since P0 and K are known a priori, so is s. On the other hand, θ reflects the
property of dP/dP0 which is typically not known in advance. This naturally brings about
the issue of adaptation (see, e.g., Spokoiny, 1996; Ingster, 2000). In other words, we are
interested in a single testing procedure that can achieve the detection boundary for testing
H0 against H1 (∆n (θ), θ) simultaneously over all θ ≥ 0. We emphasize the dependence of
∆n on θ since the detection boundary may depend on θ, as suggested by the results from
the previous section. In fact, we should build upon the test statistic introduced before.
More specifically, write
√ 2s
log log n
ρ∗ = ,
n
and & " √ 2s #'
4s+1
log log n
m∗ = log2 ρ−1
∗ .
n

Then our test statistics is taken to be the maximum of Tn,%n for ρn = ρ∗ , 2ρ∗ , 22 ρ∗ , . . . , 2m∗ ρ∗ :

T̃n := sup Tn,2k %∗ , (12)

0≤k≤m∗

where, with slight abuse of notation,

Tn,%n = (2vn )−1/2 [nη%2n (Pbn , P0 ) − An ].

It turns out if an appropriate rejection threshold is chosen, T̃n can achieve a detection
boundary very similar to the one we have before, but now simultaneously over all θ > 0.

Theorem 5 (i) Under H0 ,

p
lim P T̃n ≥ 3 log log n = 0;
n→∞

(ii) on the other hand, there exists a constant c1 > 0 such that,
p
lim inf P T̃n ≥ 3 log log n = 1,
n→∞ P ∈∪θ≥0 P(∆n (θ),θ)

√ 4s
provided that ∆n (θ) ≥ c1 (n−1 log log n) 4s+θ+1 .
√
Theorem 5 immediately suggests that a test rejects H0 if and only if T̃n ≥ 3 log log n
is consistent for testing it against H1 (∆n (θ), θ) for all θ ≥ 0 provided that ∆n (θ) ≥
√ 4s
c1 (n−1 log log n) 4s+θ+1 . It is worth noting that same detection boundary can be attained by
−1/s
replacing Tn,2k %∗ with the test statistic defined by (10) with ρn = 2k %∗ and N & 2−k/s %∗ .
We can further calibrate the rejection region to yield a test at a given significance level.
More precisely, let q̃1−α be the 1 − α quantile of T̃n under H0 , we can proceed to reject H0

12
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

whenever the observed test statistic exceeds q̃1−α . Denote such a test by T̃M3 d . By definition,
T̃M3 d is an α-level test. Theorem 5 implies that the type II error of T̃M3 d vanishes as n → ∞
uniformly over all θ ≥ 0. In practice, the quantile q̃1−α can be evaluated by Monte Carlo
methods as we shall discuss in further details in Section 6. We note that the 2s
detection
boundary given in Theorem 5 is similar, but inferior by a factor of (log log n) 4s+θ+1 , to that
from Theorem 4. As our next result indicates such an extra factor is indeed unavoidable
and is the price one needs to pay for adaptation.

Theorem 6 Let 0 < θ1 < θ2 < 2s − 1. Then there exists a positive constant c2 , such that
if
( 4s )
n 4s+θ+1
lim sup ∆n (θ) √ ≤ c2
n→∞ θ∈[θ ,θ ]
1 2
log log n

then
" #
lim inf EP0 T + sup β(T ; ∆n (θ), θ) = 1.
n→∞ T ∈Tn θ∈[θ1 ,θ2 ]

Similar to Theorem 4, Theorem 6 shows that there is no consistent test for H0 against
√ 4s
H1 (∆n , θ) simultaneously over all θ ∈ [θ1 , θ2 ], if ∆n (θ) ≤ c2 n−1 log log n 4s+θ+1 ∀ θ ∈
[θ1 , θ2 ] for a sufficiently small c2 . Together with Theorem 5, this suggests that the test T̃M3 d
is indeed rate optimal.

5. A Specific Example
To better illustrate the implications of our general results, it is instructive to consider a
more specific example where we are interested in testing uniformity on the unit interval
X = [0, 1] using a periodic spline kernel. Recall that the periodic Sobolev space of order s
is given by
s Z 1
( Z 1 )
X 2
W0s ([0, 1]) := u ∈ L2 ([0, 1]) : u(m) (x) dx < ∞, u(m) (x) = 0, ∀ 0 ≤ m < s .
m=0 0 0

When endowed with the inner product

Z 1
(s) (s)
hu1 , u2 iW0s ([0,1]) = u1 (x)u2 (x)dx, ∀u1 , u2 ∈ W0s ([0, 1])
0

W0s ([0, 1]) forms a RKHS with reproducing kernel

(−1)s−1
K(x, x0 ) = B2s ([x − x0 ])
(2s)!

where Br is the Bernoulli polynomial and [t] is the fractional part of t. See, e.g., Wahba
(1990), for further details.

13
Balasubramanian, Li and Yuan

In this case, the interpolation spaces F(θ, M ) is closely related to Sobolev spaces of
lower order. Recall that the eigenvalues and eigenfunctions of K with respect to L2 (P0 ) are
also known explicitly:
(√
2 cos(2jπx), k = 2j − 1, j ∈ N
ϕk (x) = √ ,
2 sin(2jπx), k = 2j, j ∈ N

and λk = (2πj)−2s where k = 2j − 1 or 2j. It is clear that u ∈ F(0, M ) is equivalent to

kukW0s ([0,1]) ≤ M . More generally, there exists a M 0 (M, s, θ) > 0 such that kukW s0 ([0,1]) ≤
0
M 0 implies that u ∈ F(θ, M ) where s0 = s/(1 + θ). In other words, for any s0 ≤ s, the ball
0
in W0s ([0, 1]) with an appropriate radius is contained in F(s/s0 − 1, M ).
The problem of minimax hypothesis testing of uniformity against a deviation from
Sobolev spaces is well studied. See, e.g., Ingster and Suslina (2003). Specifically, con-
sider the goodness-of-fit test with P0 being uniform distribution on [0, 1] and alternative
hypothesis
{P : kdP/dP0 − 1kW0s ([0,1]) ≤ M, χ2 (P, P0 ) ≥ ∆n }.

It is well-known that the optimal rate of detection is n−4s/(4s+1) when χ2 distance is consid-
ered. See, e.g., Ingster (1993). On the other hand, since the alternative hypothesis is exactly
P(∆n , 0), such detection boundary can also be derived from Theorem 3 and Theorem 4,
and the proposed M3 d test can achieve minimax optimality in this specific example.

6. Numerical Experiments
To complement the earlier theoretical development, we also performed several sets of simu-
lation experiments to demonstrate the merits of the proposed adaptive test. As mentioned
before, we shall consider the test based on η̆%2n ,N (Pbn , P0 ) instead of η%2n (Pbn , P0 ) for the ease of
practical implementation. With slight abuse of notation, we shall still refer to the adaptive
test as a test based on moderated MMD and denote it by T̃n for brevity.
Though the form of η̆%2n ,N (Pbn , P0 ) looks similar to that of γ 2 (Pbn , P0 ), from the point of
view of computing it numerically, there is a subtle issue. The kernel K̆%n ,N (x, x0 ) is defined
only in its Mercer decomposed form, which is based on the Mercer decomposition of K(x, x0 ).
Hence, in order to compute the kernel K̆%n ,N (x, x0 ), we need to first choose a kernel K(x, x0 )
and compute its Mercer decomposition numerically. Specifically, we use chebfun framework
in Matlab (with slight modifications) to compute Mercer decompositions associated with
kernels based on their integral operator representations (Driscoll et al., 2014; Trefethen
and Battles, 2004).

6.1 Simulation Studies

We first conducted two simulation studies. One was with Euclidean data and the other with
directional data. In both cases, the number of eigenvalues used for kernel approximation
(denoted by N ) is given in Table 1. We assessed the null distribution of T̃n by simulating
it under the null hypothesis H0 . In particular, we repeated for each case 200 runs and
estimated the 95% quantile of T̃n under H0 by the corresponding sample quantile. We then
proceeded to reject H0 when an observed test statistic exceeds the estimate 95% quantile.

14
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

Sample size 200 400 600 800 1000

N 15 22 25 28 36

Table 1: Number of eigenvalues (N ) used for the simulation experiments.

By construction, the procedure gave a 5%-level test, up to Monte Carlo error. For all
other tests that we compared with, we used the same approach to determine the rejection
threshold.
Similar idea of using resampling method to decide the rejection threshold has been
considered in Fromont et al. (2012), Fromont et al. (2013), where they used bootstrap
approaches to resample in a two sample problem. To the best of our knowledge, it is proved
there the resulted single kernel tests are exactly of α level. Whether we can give theoretical
justification of the resampling method applied to the adaptive test is of stong interest in
our future work.
Euclidean data: Consider the one sample test where P0 is the uniform distribution on
[0, 1]d , with d = 100 and 200. We followed the examples for densities put forward in Marron
and Wand (1992) in the context of nonparametric density estimation, for the alternatives.
Specifically we set the alternative hypothesis to be (1) mixture of five Gaussians, (2) skewed
unimodal, (3) asymmetric claw density and (4) smooth comb density. The value of α was
set to 0.05. The sample size n varied from 200 to 1000 (in steps of 200) and for each value
of sample size 100 simulations were conducted to estimate the probability of rejecting H0 .
We chose Gaussian kernel as the original kernel to construct the adaptive test. And for
practical purpose, the bandwidth of Gaussian kernel was selected via median heuristic. See,
e.g., Gretton et al. (2012a). We now describe the specific tests that we experimented with,
among which all other MMD based tests considered Gaussian kernel as well.

1. M 3 D: Adaptive M3 d test based on η̆%2n ,N (Pbn , P0 ).

2. MMD1 : Vanilla MMD with the median heuristic for selecting the bandwidth.

3. MMD2 : Sup-MMD as outlined in Sriperumbudur et al. (2009).

4. MMD3 : The method proposed in Sutherland et al. (2017) to select the kernel that
maximizes the power.

5. KS : Kolmogorov-Smirnov test, a classical test for goodness of fit testing.

We remark that, for MMD2 and MMD3, we split the whole dataset into two parts. We
use the training dataset to first select the bandwidth and then use the testing samples to
perform the actual hypothesis testing. See Sutherland et al. (2017) for more details.
We first conducted a type I error analysis. Results in Figure 1 suggest that all tests
control the probability of type I error at 0.05 approximately. Then we investigated the
performances of these tests under the alternative hypothesis. Figure 2 illustrates a plot
of the estimated probability of accepting the null hypothesis under alternatives mentioned
above for different values of sample size n . We note from Figure 2 that the estimated
probability of type II error converges to zero at a faster rate for M 3 D compared to other tests

15
Balasubramanian, Li and Yuan

on all the different simulation settings that are considered. Note that it has been previously
observed that MMD test performs better than Kolmogorov-Smirnov test in various setting
in Gretton et al. (2012a), which we observe in our setting as well.
·10−2 ·10−2
8 8
M 3D M 3D
Probability of type I error

Probability of type I error

M M D3 M M D3
M M D2 M M D2
6 M M D1 6 M M D1
KS KS

4 4

2 2
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)

Figure 1: Estimated probability of type I error versus sample size for different tests with
dimensionality 100 (left) and 200 (right), in the case of Euclidean data.

Directional data: One of the advantages of the proposed RKHS embedding based ap-
proach is that it could be used on domains other than the d-dimensional Euclidean space.
For example when X = Sd−1 where Sd−1 corresponds to the d-dimensional unit sphere, one
can perform hypothesis testing using the above framework, as long as we can compute the
Mercer decomposition of a kernel K defined on the domain. In several applications, like
protein folding, often times data are modeled as coming from the unit-sphere and testing
goodness-of-fit for such data needs specialized methods different from the standard non-
parametric testing methods (Mardia and Jupp, 2009; Jupp, 2005).
In order to highlight the advantage of the proposed approach, we assumed P0 to be the
uniform distribution on the unit sphere of dimension d = 100 and 150 respectively. For
each d, we tested P0 against the alternative that data were from:
(1) multivariate von Mises-Fisher distribution (which is the Gaussian analogue on the
unit-sphere) given by fvM -F (x, µ, κ) = CvM -F (κ) exp(κµ> x) for data x ∈ Sd−1 , where
κ ≥ 0 is concentration parameter and µ is the mean parameter. The term CvM -F is
d/2−1
the normalization constant given by 2πd/2κ I (κ)
where I is modified Bessel function;
d/2−1

(2) multivariate Watson distribution (used to model axially symmetric data on sphere)
given by fW (x, µ, κ) = CW (κ) exp(κ(µ> x)2 ) for data x ∈ Sd−1 , where κ ≥ 0 is con-
centration parameter and µ is the mean parameter as before. The term CW (κ) is
Γ(d/2)
the normalization constant given by 2πp/2 M (1/2,d/2,κ)
where M is Kummer’s confluent
hypergeometric function;
(3) mixture of five von Mises-Fisher distribution which are used in modeling and clustering
spherical data (Banerjee et al., 2005);
(4) mixture of five Watson distribution which are used in modeling and clustering spherical
data (Sra and Karp, 2013).

16
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

1 1
Probability of type II error M 3D M 3D

Probability of type II error

M M D3 M M D3
0.8 0.8
M M D2 M M D2
M M D1 M M D1
0.6 KS 0.6 KS

0.4 0.4

0.2 0.2

0 0
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)
1 1
M 3D M 3D
Probability of type II error

Probability of type II error

M M D3 M M D3
0.8 0.8
M M D2 M M D2
M M D1 M M D1
0.6 KS 0.6 KS

0.4 0.4

0.2 0.2

0 0
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)
1 1
M 3D M 3D
Probability of type II error

Probability of type II error

M M D3 M M D3
0.8 0.8
M M D2 M M D2
M M D1 M M D1
0.6 KS 0.6 KS

0.4 0.4

0.2 0.2

0 0
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)
1 1
M 3D M 3D
Probability of type II error

Probability of type II error

M M D3 M M D3
0.8 0.8
M M D2 M M D2
M M D1 M M D1
0.6 KS 0.6 KS

0.4 0.4

0.2 0.2

0 0
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)

Figure 2: Estimated probability of type II error versus sample size: mixture of Gaussian
(row 1), skewed unimodal (row 2), asymmetric claw (row 3) and smooth comb (row 4) with
dimensionality 100 (left) and 200 (right), in the case of Euclidean data.

17
Balasubramanian, Li and Yuan

Besides the adaptive M3 d test and all other MMD based tests that involved in the first
simulation study, we also considered the Sobolev test approach (denoted as ST hereafter)
proposed in Jupp (2005). The original kernel of M 3 D and the kernel of M M D1 were again
chosen as Gaussian kernel with bandwidth selected via median heuristic. Other MMD based
tests also used Gaussian kernel. Note that in this setup one can analytically compute the
Mercer decomposition of the Gaussian kernel on the unit sphere with respect to the uniform
distribution. Specifically, the eigenvalues are given by Theorem 2 in Minh et al. (2006) and
the eigenfunctions are the standard spherical harmonics of order k (see section 2.1 in Minh
et al. (2006) for details). Rest of the simulation setup is similar to the previous setting (of
Euclidean data).
The results of type I error analysis are reported in Figure 3, indicating all tests are
approximately α-level tests. Figure 4 illustrates a plot of the estimated probability of type
II error for different values of sample size, from which we see the adaptive M3 d test performs
better.
·10−2 ·10−2
8 8
M 3D M 3D
Probability of type I error

Probability of type I error

M M D3 M M D3
M M D2 M M D2
6 M M D1 6 M M D1
ST ST

4 4

2 2
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)

Figure 3: Estimated probability of type I error versus sample size for different tests with
dimensionality 100 (left) and 150 (right), in the case of directional data.

6.2 Real Data Experiments

In addition to the simulation examples, we also performed experiments on several real-
world data examples. All tests considered were the same as those in the simulation studies
in both the cases of Euclidean data and directional data. Furthermore, similar to the
simulation experiments, we used the Monte Carlo quantiles for all tests. Gaussian kernels
were considered and the median heuristic was used as before. The number of eigenvalues
used in kernel approximation is reported along with the main results. Different from the
simulation examples, in the real data setting, the null hypothesis P0 is not known. Indeed,
we are testing goodness of fit to a simple null hypothesis P0 , rather than a composite family
of densities. Hence, ideally one must construct tests that take into account the estimated
P0 . Deriving the asymptotics of kernel-embedding based test and other classical tests, in
particular establishing optimality and adaptivity properties, in this setting is beyond the
scope of the current draft. As a compromise, we first use 50% of the total dataset first to
get an estimate of the parameter of P0 . We then use the rest of the data to do the goodness
of fit test to the simple hypothesis P0 defined with the estimated parameter. In doing so, we

18
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

1 1
M 3D M 3D
Probability of type II error

Probability of type II error

M M D3 M M D3
0.8 0.8
M M D2 M M D2
M M D1 M M D1
0.6 ST 0.6 ST

0.4 0.4

0.2 0.2

0 0
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)
1 1
M 3D M 3D
Probability of type II error

Probability of type II error

M M D3 M M D3
0.8 0.8
M M D2 M M D2
M M D1 M M D1
0.6 ST 0.6 ST

0.4 0.4

0.2 0.2

0 0
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)
1 1
M 3D M 3D
Probability of type II error

Probability of type II error

M M D3 M M D3
0.8 0.8
M M D2 M M D2
M M D1 M M D1
0.6 ST 0.6 ST

0.4 0.4

0.2 0.2

0 0
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)
1 1
M 3D M 3D
Probability of type II error

Probability of type II error

M M D3 M M D3
0.8 0.8
M M D2 M M D2
M M D1 M M D1
0.6 ST 0.6 ST

0.4 0.4

0.2 0.2

0 0
200 400 600 800 1,000 200 400 600 800 1,000
Sample size (n) Sample size (n)

Figure 4: Estimated probability of type II error versus sample size: von Mises-Fisher dis-
tribution (row 1), Watson distribution (row 2), mixture of von Mises-Fisher distribution
(row 3) and mixture of Watson distribution (row 4) with dimensionality 100 (left) and 150
(right), in the case of directional data.

19
Balasubramanian, Li and Yuan

are essentially treating P0 with the estimated parameters as being given to us by an oracle
and carry out goodness of fit testing. We take the above approach, as our purpose in this
section is mainly to compare the adaptive M3 d test against the other tests for a simple P0
and not to address the issue of testing with composite null hypothesis. Note that a similar
approach was also performed in Kellner and Celisse (2015) for testing for Gaussianity with
kernel-based testing.
For the case of Euclidean data, we used the MNIST digits data set from the following
webpage: http://yann.lecun.com/exdb/mnist/. Model-based clustering (Fraley and
Raftery, 2002) is a widely-used and practical successful clustering technique in the literature.
Furthermore, the MNIST data set is a standard data set for testing clustering algorithms and
consists of image of digits. Several works have implicitly assumed that the data come from a
mixture of Gaussian distributions, because of the observed superior empirical performance
under such an assumption. But the validity of such a mixture model assumption is invariably
not tested statistically. In this experiment we selected three digits (which correspond to
a cluster) randomly and conditioned on the selected digit (cluster), we tested if the data
come from a Gaussian distribution (that is, P0 is Gaussian). For our experiments, we down
sampled the images and use pixels as feature vectors with dimensionality 64 as is commonly
done in the literature. Table 2 reports the probability with which the null hypothesis is
accepted. Such probability is estimated through 100 trials. The observed result reiterates in
a statistically significant way that it is reasonable to make a mixture of Gaussian assumption
in this case.
For the case of directional data, we used the Human Fibroblasts dataset from Iyer et al.
(1999); Dhillon et al. (2003) and the Yeast Cell Cycle dataset from Spellman et al. (1998).
The Fibroblast data set contains 12 expression data corresponding to 517 samples (genes)
report in the response of human fibroblasts following addition of serum to the growth
media. We refer to Iyer et al. (1999) for more details about the scientific procedure with
which these data were obtained. The Yeast Cell Cycle dataset consists of 82-dimensional
data corresponding to 696 subjects. Previous data analysis studies (Sra and Karp, 2013;
Dhillon et al., 2003) have used mixtures of spherical distributions for clustering the above
data set. Specifically, it has been observed in Sra and Karp (2013) that clustering using
a mixture of Watson distribution has superior performance. While that has proved to be
useful scientifically, it was not statistically tested if such an assumption is valid. Here, we
conducted goodness of fit test of Watson distribution (that is, P0 is a Watson distribution)
for the largest cluster from the above data sets. Table 3 shows the estimated probability
of acceptance of the null hypothesis, which is computed through 100 random trials. The
observed results provide a statistical justification for the use of Watson distribution in
modeling the above datasets.
One thing to comment is that wo do not aim to argue that our proposed test outperforms
the other considered tests in the above real-world experiments. The information conveyed
by the results rather lies in the following two aspects. First, if we assume that H0 is true,
then our proposed test is valid in the sense that the estimated probability of Type I error
is controlled around 5%, although two approximations are involved in the whole testing
procedure. One is using one half of data to estimate P0 and the other is using Monte
Carlo simulations to determine the rejection region. Second, that H0 is accepted with high
probability by our proposed test and other involved tests does provide certain evidence that

20
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

Sample size = 300 400 500 300 400 500 300 400 500
KS 0.91 0.94 0.96 0.91 0.91 0.95 0.91 0.92 0.95
M M D1 0.92 0.95 0.96 0.91 0.92 0.95 0.90 0.92 0.94
M M D2 0.93 0.95 0.96 0.92 0.93 0.95 0.93 0.94 0.96
M M D3 0.92 0.95 0.97 0.90 0.94 0.97 0.91 0.92 0.96
M 3D 0.94 0.96 0.98 0.93 0.95 0.98 0.93 0.95 0.98
N 32 36 40 32 36 40 32 36 40

Table 2: Estimated probability with which the corresponding test accepts the null hypothe-
sis. The level of the test α = 0.05. Digit 4 on the left, Digit 6 in the middle and Digit 7 on the
right,for various values of sample size. N refers to the number of eigenvalues/eigenfunctions
used in kernel approximation.

Sample size= 75 150 200 150 200 250

ST 0.90 0.93 0.98 0.84 0.89 0.91
M M D1 0.91 0.94 0.98 0.85 0.91 0.93
M M D2 0.91 0.94 0.98 0.86 0.90 0.94
M M D3 0.92 0.96 0.98 0.87 0.91 0.96
M 3D 0.92 0.96 0.99 0.88 0.93 0.96
N 16 19 22 20 23 27

Table 3: Estimated probability with which the corresponding test accepts the null hypoth-
esis. The level of the test α = 0.05. Human Fibroblasts dataset on the left and Yeast Cell
Cycle dataset on the right. N refers to the number of eigenvalues/eigenfunctions used in
kernel approximation.

H0 is true, considering that all these tests tend to have small probabilities of accepting H0
under the alternative hypothesis.
In this sense, these real-word experiments are more like a beginning example to show
the validity of the proposed test. And we may seek some other examples to demonstrate
the optimality of the proposed test in our following work.

7. Concluding Remarks
In this paper, we investigated the performance of kernel embedding based approaches to
goodness-of-fit testing from a minimax perspective. When considering χ2 distance as the
separation metric, we showed that while the vanilla MMD tests could be suboptimal when
testing against an alternative coming from the RKHS identified with the kernel, a simple
moderation leads to tests that are not only optimal when the alternative resides in the RKHS
but also adaptive with respect to its deviation from the RKHS under suitable distances. Our
analysis provides new insights into the operating characteristics of as well as a benchmark
for evaluating the popular kernel based approach to hypothesis testing. Our work also
points to a number interesting directions that warrant further investigation.

21
Balasubramanian, Li and Yuan

Our work highlighted the importance of moderation in kernel based testing, akin to
regularization in kernel based supervised learning. The specific type of moderation we
considered here is defined in terms of eigenfunctions of the kernel. In some settings this is
convenient to do as we illustrated in Section 5. In other settings, however, eigenfunctions of
a kernel may not be trivial to compute. It is therefore of interest to investigate alternative
ways of moderating the MMD. For example, one intriguing possibility is to apply appropriate
moderation to the so-called random Fourier features (see, e.g., Chwialkowski et al., 2015;
Jitkrittum et al., 2016; Bach, 2017) instead of eigenfunctions.
Another practical issue is how to devise computationally more efficient goodness-of-tests
for dealing with large-scale datasets. For example, it would be of interest to investigate if
the ideas of linear-time approximation (Gretton et al., 2012a) or block test (Zaremba et al.,
2013) can be applied to yield minimax optimal and adaptive goodness of fit tests.
At a more technical level, our analysis has focused on detection boundaries under χ2
distance. There are other commonly used distances suitable for goodness-of-fit tests such
as KL-divergence, or total variation. It would be interesting to examine to what extent the
phenomena we observed here occur under theses alternative distance measures.
As in any kernel based approaches, the choice of the kernel plays a crucial role. In
the current work we have assumed implicitly that a suitable kernel is selected. How to
choose an appropriate kernel is a critical problem in practice. It would be of great interest
to investigate to what extent the ideas from Sriperumbudur et al. (2009); Gretton et al.
(2012b); Sutherland et al. (2017) can be adapted in the current setting.
Throughtout numerical experiments in our paper, we have considered truncated version
of the moderated kernel. Both %2n and truncation level N play the role of regularization
and we can show that as long as N increases at a proper rate with n, even if %n in K̆ρn ,N
is set to 0, the corresponding test can still achieve optimal rate. Note that in this scenario,
K̆0,N becomes the so-called projection kernel and MMD2 associated with K̆0,N is exactly
the projection of χ2 (P, P0 ) onto the subspace spanned by {ϕk }1≤k≤N .

8. Proofs
We now present the proofs of the main results.
Proof [Proof of Theorem 1]
Part (i). The proof of the first part consists of two key steps. First, we show that the
population counterpart nγ 2 (P, P0 ) of the test statistic converges to ∞ uniformly, i.e.,

n inf γ 2 (P, P0 ) → ∞ (13)

P ∈P(∆n ,0)

as n → ∞. Then, we argue that the deviation from γ 2 (P, P0 ) to γ 2 (Pbn , P0 ) is uniformly

negligible compared with γ 2 (P, P0 ) itself.
Let u = dP /dP0 − 1 and

ak = hu, ϕk iL2 (P0 ) = EP ϕk (X) − EP0 ϕk (X) = EP ϕk (X).

Then X X
λ−1 2 2
k ak = kukK , and a2k = kuk2L2 (P0 ) = χ2 (P, P0 ).
k≥1 k≥1

22
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

By the definition of P(∆n , 0),

X X
sup λ−1 2 2
k ak ≤ M , and inf a2k ≥ ∆n .
P ∈P(∆n ,0) k≥1 P ∈P(∆n ,0)
k≥1

Since n∆2n → ∞ as n → ∞, we get

X X
inf n λk [EP ϕk (X)]2 = inf n λk a2k
P ∈P(∆n ,0) P ∈P(∆n ,0)
k≥1 k≥1
P 2
a2k
k≥1
≥ inf nP
P ∈P(∆n ,0) λ−1 2
k ak
k≥1
n∆2
≥ 2n →∞
M
as n → ∞, verifying (13).
On the other hand,
v
uX h X n
u 1 i2
γ(Pbn , P0 ) =t λk ϕk (Xi )
n
k≥1 i=1
v
sX uX h X n
2
u 1 i2
≥ λk [EP ϕk (X)] − t λk ϕk (Xi ) − EP ϕk (X) .
n
k≥1 k≥1 i=1

Thus,
n o
P nγ 2 (Pbn , P0 ) < qw,1−α
 v 
s X u X h X n
1 2 √
u i 
≤P n λk [EP ϕk (X)]2 − tn λk ϕk (Xi ) − EP ϕk (X) < qw,1−α
 n 
k≥1 k≥1 i=1
v 
u n i2 s X
u X h1 X √

=P tn λk ϕk (Xi ) − EP ϕk (X) > n λk [EP ϕk (X)]2 − qw,1−α .
 n 
k≥1 i=1 k≥1

Note that (13) ensures for sufficiently lagre n,

X
n λk [EP ϕk (X)]2 > qw,1−α , ∀ P ∈ P(∆n , 0),
k≥1

which together with Markov inequality gives

n P h P n i2 o
1
n o EP n λ k n ϕk (Xi ) − E ϕ
P k (X)
2 b k≥1 i=1
P nγ (Pn , P0 ) < qw,1−α ≤ ( )2 , ∀ P ∈ P(∆n , 0).
r P √
n λk [EP ϕk (X)]2 − qw,1−α
k≥1

23
Balasubramanian, Li and Yuan

Observe that for any P ∈ P(∆n , 0),

n X h1 X n i2 o X
EP n λk ϕk (Xi ) − EP ϕk (X) = λk Var[ϕk (X)]
n
k≥1 i=1 k≥1
X
≤ λk EP ϕ2k (X)
k≥1
2 X
≤ sup kϕk k∞ λk < ∞.
k≥1 k≥1

Hence we obtain
n o
lim β(TMMD ; ∆n , 0) = lim sup P nγ 2 (Pbn , P0 ) < qw,1−α
n→∞ n→∞ P ∈P(∆ ,0)
n
n P h P n i2 o
sup EP n λk n1 ϕk (Xi ) − EP ϕk (X)
P ∈P(∆n ,0) k≥1 i=1
≤ lim ( )2
n→∞ r P √
inf n λk [EP ϕk (X)]2 − qw,1−α
P ∈P(∆n ,0) k≥1

=0.

Part (ii). In proving the second part, we will make use of the following lemma that can
be obtained by adapting the argument in Gregory (1977). It gives the limit distribution of
V-statistic under Pn such that Pn converges to P0 in the order n−1/2 .

Lemma 7 Consider a sequence of probability measures {Pn : n ≥ 1} contiguous to P0

satisfying un = dPn /dP0 − 1 → 0 in L2 (P0 ). Suppose that for any fixed k,
√ X √ X
lim nhun , ϕk iL2 (P0 ) = ãk , and lim λk ( nhun , ϕk iL2 (P0 ) )2 = λk ã2k +ã0 < ∞,
n→∞ n→∞
k≥1 k≥1

for some sequence {ãk : k ≥ 0}, then

n
1 X hX i2
d
X
λk ϕk (Xi ) → λk (Zk + ãk )2 + ã0 ,
n
k≥1 i=1 k≥1

i.i.d
where X1 , . . . , Xn ∼ Pn , and Zk s are independent standard normal random variables.

Write L(k) = λk k 2s . By assumption (6),

0 < L := inf L(k) ≤ sup L(k) := L < ∞.

k≥1 k≥1

Consider a sequence of {Pn : n ≥ 1} such that

dPn /dP0 − 1 = C1 λkn [L(kn )]−1 ϕkn ,

24
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

1
where C1 is a positive constant and kn = bC2 n 4s c for some positive constant C2 . Both
C1 and C2 will be determined later. Since sup kϕk k∞ < ∞ and lim λk = 0, there exists
k≥1 k→∞
N0 > 0 such that Pn ’s are well-defined probability measures for any n ≥ N0 .
Note that
C12
kun k2K = 2
≤ L−2 C12
L (kn )
and
C12 λkn C12 −2s −1 −1
kun k2L2 (P0 ) = = k ≥ L C12 kn−2s ∼ L C12 C2−2s n−1/2 ,
L2 (kn ) L(kn ) n
where An ∼ Bn means that lim An /Bn = 1. Thus, by choosing C1 sufficiently small and
n→∞
1 −1 2 −2s
c0 = 2 L C1 C2 ,we ensure that Pn ∈ P(c0 n−1/2 , 0) for sufficiently large n.
To apply Lemma 7, we note that
C12 λkn
lim kun k2L2 (P0 ) = lim = 0.
n→∞ n→∞ L2 (kn )

In addition, for any fixed k,

√
ãn,k = nhun , ϕk iL2 (P0 ) = 0

for sufficiently large n, and

X nC12 λ2kn
λk ã2n,k = = nC12 kn−4s → C12 C2−4s
L2 (kn )
k≥1

as n → ∞. Thus, Lemma 7 implies that

d
X
nγ(Pbn , P0 ) → λk Zk2 + C12 C2−4s .
k≥1
1/4s
Now take C2 = 2C12 /qw,1−α so that C12 C2−4s = 21 qw,1−α . Then

lim β(TMMD ; c0 n−1/2 , 0) ≥ lim Pn (nγ(Pbn , P0 ) < qw,1−α )

n→∞ n→∞
X 1
=P λk Zk2 < qw,1−α > 0,
2
k≥1

which concludes the proof.

Proof [Proof of Theorem 2] Let K̃n (·, ·) := K̃%n (·, ·). Note that
j−1
n X
X
vn−1/2 [nη%2n (Pbn , P0 ) − An ] = 2(n2 vn )−1/2 K̃n (Xi , Xj ).
j=2 i=1

25
Balasubramanian, Li and Yuan

j−1
P
Let ζnj = K̃n (Xi , Xj ). Consider a filtration {Fj : j ≥ 1} where Fj = σ{Xi : 1 ≤ i ≤ j}.
i=1
Due to the assumption that K is degenerate, we have Eϕk (X) = 0 for any k ≥ 1, which
implies that
j−1
X j−1
X
E(ζnj |Fj−1 ) = E[K̃n (Xi , Xj )|Fj−1 ] = E[K̃n (Xi , Xj )|Xi ] = 0,
i=1 i=1

for any j ≥ 2.
Write

0
 m=1
Unm = Pm .

 ζnj m≥2
j=2

Then for any fixed n, {Unm }m≥1 is a martingale with respect to {Fm : m ≥ 1} and

vn−1/2 [nη%2n (Pbn , P0 ) − An ] = 2(n2 vn )−1/2 Unn .

We now apply martingale central limit theorem to Unn . Following the argument from
Hall (1984), it can be shown that
h1 i−1/2
d
n2 EK̃n2 (X, X 0 ) Unn → N (0, 1), (14)
2
provided that
00
[EG2n (X, X 0 ) + n−1 EK̃n2 (X, X 0 )K̃n2 (X, X ) + n−2 EK̃n4 (X, X 0 )]/[EK̃n2 (X, X 0 )]2 → 0, (15)

as n → ∞, where Gn (x, x0 ) = EK̃n (X, x)K̃n (X, x0 ). Since

X λ k 2
EK̃n2 (X, X 0 ) = = vn ,
λk + %2n
k≥1

(14) implies that

√ 1 −1/2
d
vn−1/2 [nη%2n (Pbn , P0 ) − An ] = 2· n2 EK̃n2 (X, X 0 ) Unn → N (0, 2).
2
It therefore suffices to verify (15).
Note that
X λ k 2 X 1 1 X 2
EK̃n2 (X, X 0 ) = 2
≥ + 4 λk
λk + %n 2
4 4%n 2
k≥1 λk ≥%n λk <%n
1 1 X 2
= |{k : λk ≥ %2n }| + 4 λk %n−1/s ,
4 4%n 2 λk <%n

26
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

where the last step holds by considering that λk k −2s . Hereafter, we shall write an bn
if 0 < lim an /bn ≤ lim an /bn < ∞, for two positive sequences {an } and {bn }. Similarly,
n→∞ n→∞

X λ k 4 X
EG2n (X, X 0 ) = ≤ |{k : λ k ≥ %2
n }| + %−8
n λ4k %−1/s
n ,
λk + %2n 2
k≥1 λk <%n

and

nX λk 2 2 o2
EK̃n2 (X, X 0 )K̃n2 (X, X 00 ) =E ϕk (X)
λk + %2n
k≥1
!4
nX λ 2 o2
k
≤ sup kϕk k∞ 2
%n−2/s .
k≥1 λ k + %n
k≥1

Thus there exists a positive constant C3 such that

EG2n (X, X 0 )/[EK̃n2 (X, X 0 )]2 ≤ C3 %1/s

n → 0, (16)

and

n−1 EK̃n2 (X, X 0 )K̃n2 (X, X 00 )/[EK̃n2 (X, X 0 )]2 ≤ C3 n−1 → 0, (17)

as n → ∞. On the other hand,

EK̃n4 (X, X 0 ) ≤ kK̃n k2∞ EK̃n2 (X, X 0 ),

where
( )
X λk 2 X λk
kK̃n k∞ = sup ϕ2 (x) ≤ sup kϕk k∞ %−1/s
n .
x λk + %2n k k≥1 λk + %2n
k≥1 k≥1

This implies that for some positive constant C4 ,

n−2 EK̃n4 (X, X 0 )}/[EK̃n2 (X, X 0 )]2 ≤ n−2 kK̃n k2∞ /EK̃n2 (X, X 0 ) ≤ C4 (n2 %1/s
n )
−1
→ 0. (18)

as n → ∞. Together, (16), (17) and (18) ensure that condition (15) holds.

27
Balasubramanian, Li and Yuan

Proof [Proof of Theorem 3] Note that

n
1X
nη%2n (Pbn , P0 ) − K̃n (Xi , Xi )
n
i=1
1 X λk X
= 2
ϕk (Xi )ϕk (Xj )
n λk + %n
k≥1 1≤i,j≤n
i6=j
1 X λk X
= [ϕk (Xi ) − EP ϕk (X)][ϕk (Xj ) − EP ϕk (X)]
n λk + %2n
k≥1 1≤i,j≤n
i6=j
2(n − 1) X λk X
+ [EP ϕk (X)] [ϕk (Xi ) − EP ϕk (X)]
n λk + %2n
k≥1 1≤i≤n
n(n − 1) X λk
+ [EP ϕk (X)]2
n λk + %2n
k≥1
:=V1 + V2 + V3 .

Obviously, EP V1 V2 = 0. We first argue that the following three statements together implies
the desired result:

lim inf v −1/2 V3 = ∞, (19)

n→∞ P ∈P(∆n ,θ) n

sup (EP V12 /V32 ) = o(1), (20)

P ∈P(∆n ,θ)

sup (EP V22 /V32 ) = o(1). (21)

P ∈P(∆n ,θ)

To see this, note that (19) implies that

√
lim inf P (vn−1/2 [nη%2n (Pbn , P0 ) − An ] ≥ 2z1−α )
n→∞ P ∈P(∆n ,θ)
√ 1
≥ lim inf P vn−1/2 V3 ≥ 2 2z1−α , V1 + V2 + V3 ≥ V3
n→∞ P ∈P(∆n ,θ) 2
1
= lim inf P V1 + V2 + V3 ≥ V3 .
n→∞ P ∈P(∆n ,θ) 2

On the other hand, (20) and (21) imply that

1 1
lim inf P V1 + V2 + V3 ≥ V3 =1 − lim sup P V1 + V2 + V3 < V3
n→∞ P ∈P(∆n ,θ) 2 n→∞ P ∈P(∆ ,θ)
n
2
EP (V1 + V2 )2
≥1 − lim sup = 1.
n→∞ P ∈P(∆ ,θ)
n
(V3 /2)2

This immediately suggests that TM3 d is consistent. We now show that (19)-(21) indeed
hold.

28
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

−1/s
Verifying (19). We begin with (19). Since vn %n and V3 = (n − 1)η%2n (P, P0 ), (19)
is equivalent to
1
lim inf n%n2s η%2n (P, P0 ) = ∞.
n→∞ P ∈P(∆n ,θ)

For any P ∈ P(∆n , θ), let u = dP /dP0 −

P1 and ak = hu, ϕk iL2 (P0 ) = EP ϕk (X). Based
on the assumption that K is universal, u = ak ϕk . We consider the case θ = 0 and θ > 0
k≥1
separately.
(1) First consider θ = 0. It is clear that
X X %2n
η%2n (P, P0 ) = a2k − a2
λk + %2n k
k≥1 k≥1
X 1
≥kuk2L2 (P0 ) − %2n a2
λk k
k≥1

≥kuk2L2 (P0 ) − %n M 2 .
2

p
Take %n ≤ ∆n /(2M 2 ) so that ρ2n M 2 ≤ 21 ∆n . Then we have
1 1
inf η%2n (P, P0 ) ≥ inf kuk2L2 (P0 ) = ∆n .
P ∈P(∆n ,0) 2 P ∈P(∆n ,0) 2

(2) Now consider the case when θ > 0. For P ∈ P(∆n , θ), ∀ R > 0, ∃ fR ∈ H(K) such
that ku − fR kL2 (P0 ) ≤ M R−1/θ and kfR kK ≤ R. Let bk = hfR , ϕk iL2 (P0 ) .
X X %2n
η%2n (P, P0 ) = a2k − a2
λk + %2n k
k≥1 k≥1
X%2n X %2
n
≥kuk2L2 (P0 ) − 2 (a k − bk ) 2
− 2 b2
λk + %2n λk + %2n k
k≥1 k≥1
X X 1
≥kuk2L2 (P0 ) − 2 (ak − bk )2 − 2%2n b2
λk k
k≥1 k≥1

=kuk2L2 (P0 ) − 2ku − fR k2L2 (P0 ) − 2%n kfR k2K .

Taking R = (2M /kukL2 (P0 ) )θ yields that

1
η%2n (P, P0 ) ≥ kuk2L2 (P0 ) − 2M 2 R−2/θ − 2%2n R2 = kuk2L2 (P0 ) − 2%2n R2 .
2
Now by choosing
1 1+θ
%n ≤ √ (2M )−θ ∆n 2 ,
2 2
we can ensure that
1
2%2n R2 ≤ kuk2L2 (P0 ) .
4
So that
1 1
inf η%2n (P, P0 ) ≥ inf kuk2L2 (P0 ) ≥ ∆n .
P ∈P(∆n ,θ) P ∈P(∆n ,θ) 4 4

29
Balasubramanian, Li and Yuan

θ+1 1
In both cases, with %n ≤ C∆n 2 for a sufficiently small C = C(M ) > 0, lim %n2s n∆n =
n→∞
4s
∞ suffices to ensure (19) holds. Under the condition that lim ∆n n 4s+θ+1 = ∞,
n→∞

2s(θ+1) θ+1
%n = cn− 4s+θ+1 ≤ C∆n 2
1
for sufficiently large n and lim %n2s n∆n = ∞ holds as well.
n→∞

Verifying (20). Rewrite V1 as

1 X X λk
V1 = [ϕk (Xi ) − EP ϕk (X)][ϕk (Xj ) − EP ϕk (X)]
n λk + %2n
1≤i,j≤n k≥1
i6=j
1 X
:= Fn (Xi , Xj ).
n
1≤i,j≤n
i6=j

Then
1 X
EP V12 = EP Fn (Xi , Xj )Fn (Xi0 , Xj 0 )
n2
i6=j
i0 6=j 0
2n(n − 1)
= EP Fn2 (X, X 0 )
n2
≤2EP Fn2 (X, X 0 ),

i.i.d.
where X, X 0 ∼ P . Recall that, for any two random variables Y1 , Y2 such that EY12 < ∞,

E[Y1 − E(Y1 |Y2 )]2 = EY12 − E[E(Y1 |Y2 )2 ] ≤ EY12 .

Together with the fact that

Fn (X, X 0 ) =K̃n (X, X 0 ) − EP [K̃n (X, X 0 )|X] − EP [K̃n (X, X 0 )|X 0 ] + EP K̃n (X, X 0 )
h i
=K̃n (X, X 0 ) − EP [ K̃n (X, X 0 )|X] − E K̃n (X, X 0 ) − EP [K̃n (X, X 0 )|X] X 0 ,

we have

EP Fn2 (X, X 0 ) ≤ EP {K̃n (X, X 0 ) − EP [K̃n (X, X 0 )|X]}2 ≤ EP K̃n2 (X, X 0 ).

Thus, to prove (20), it suffices to show that

lim sup EP K̃n2 (X, X 0 )/V32 = 0.

n→∞ P ∈P(∆ ,θ)
n

For any g ∈ L2 (P0 ) and positive definite kernel G(·, ·) such that EP0 G2 (X, X 0 ) < ∞, let
p
kgkG := EP0 [g(X)g(X 0 )G(X, X 0 )].

30
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

By the positive definiteness of G(·, ·), triangular inequality holds for k · kG , i.e., for any g1 ,
g2 ∈ L2 (P0 ),

|kg1 kG − kg2 kG | ≤ kg1 − g2 kG .

Thus by taking G = K̃n2 , g1 = dP/dP0 and g2 = 1, we have

q q q
EP K̃n (X, X ) − EP0 K̃n (X, X ) ≤ EP0 [u(X)u(X 0 )K̃n2 (X, X 0 )].
2 0 2 0 (22)

We now appeal to the following lemma to bound the right hand side of (22):

Lemma 8 Let G be a Mercer kernel defined over X ×X with eigenvalue-eigenfunction pairs

{(µk , ϕk ) : k ≥ 1} with respect to L2 (P ) such that µ1 ≥ µ2 ≥ · · · . If G is a trace kernel in
that EG(X, X) < ∞, then for any g ∈ L2 (P )
  !2
X
EP [g(X)g(X 0 )G2 (X, X 0 )] ≤ µ1  µk  sup kϕk k∞ kgk2L2 (P ) .
k≥1 k≥1

By Lemma 8, we get
!
0
X λk
EP0 [u(X)u(X )K̃n2 (X, X 0 )] ≤C5 kuk2L2 (P0 ) %−1/s
n kuk2L2 (P0 ) .
λk + %2n
k

Recall that 2
X λk
EP0 K̃n2 (X, X 0 ) = %n−1/s .
λk + %2n
k

In the light of (22), they imply that

EP K̃n2 (X, X 0 ) ≤ 2{EP0 K̃n2 (X, X 0 ) + EP0 [u(X)u(X 0 )K̃n2 (X, X 0 )]} ≤ C6 %n−1/s [1 + kuk2L2 (P0 ) ].

θ+1
On the other hand, as already shown in the part of verifying (19), %n ∆n 2 suffices to
ensure that for sufficiently large n,
1
kuk2L2 (P0 ) ≤ η%2n (P, P0 ) ≤ kuk2L2 (P0 ) , ∀ P ∈ P(∆n , θ).
4
Thus

lim sup EP K̃n2 (X, X 0 )/V32

n→∞ P ∈P(∆ ,θ)
n
n −1 −1 o
≤16C6 lim inf %1/s 2 4
n n kukL2 (P0 ) + lim inf %1/s 2 2
n n kukL2 (P0 ) =0
n→∞ P ∈P(∆n ,θ) n→∞ P ∈P(∆n ,θ)

4s
provided that lim n 4s+θ+1 ∆n = ∞. This immediately implies (20).
n→∞

31
Balasubramanian, Li and Yuan

Verifying (21). Observe that

λk
nX o2
EP V22 ≤4nEP [E ϕ
P k (X)][ϕ k (X) − E ϕ
P k (X)]
λk + %2n
k≥1
nX λ o2
k
≤4nEP [EP ϕk (X)][ϕ k (X)]
λk + %2n
k≥1
 
nX λ o2
k
=4nEP0 [1 + u(X)] [EP ϕk (X)][ϕk (X)]  .
λk + %2n
k≥1

It is clear that

nX λk o2
EP0 [E ϕ
P k (X)][ϕk (X)]
λk + %2n
k≥1
X λk λk 0
= EP ϕk (X)EP ϕk0 (X)EP0 [ϕk (X)ϕk0 (X)]
2
λk + %n λk0 + %2n
k,k0 ≥1
X λ k 2
= [EP ϕk (X)]2 ≤ η%2n (P, P0 ).
λk + %2n
k≥1

On the other hand,

 
nX λk o2
EP0 u(X) [EP ϕk (X)][ϕk (X)] 
λk + %2n
k≥1
v  
u
λk o2
u nX
≤tEP0 u2 (X) [EP ϕk (X)][ϕk (X)] ×
u
2
λk + %n
k≥1
v
λk o2
u nX
× tEP0
u
[E ϕ
P k (X)][ϕk (X)]
λk + %2n
k≥1
X λk
≤kukL2 (P0 ) sup [EP ϕk (X)][ϕk (x)] · η%n (P, P0 )
x λk + %2n
k≥1
X λk
≤ sup kϕk k∞ kukL2 (P0 ) |EP ϕk (X)| · η%n (P, P0 )
k λk + %2n
k≥1
v v
λk uX λk
uX u
≤ sup kϕk k∞ kukL2 (P0 ) t [EP ϕk (X)]2 · η%n (P, P0 )
u
λk + %2n λk + %2n
t
k k≥1 k≥1
1
−
≤C7 kukL2 (P0 ) %n 2s η%2n (P, P0 ).

32
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

Together, they imply that

lim sup EP V12 /V32

n→∞ P ∈P(∆ ,θ)
n
( !)
−1 kukL2 (P0 )
≤4 max{1, C7 } lim inf nη%2n (P, P0 ) + lim sup 1 = 0,
n→∞ P ∈P(∆n ,θ) n→∞ P ∈P(∆ ,θ)
n %n2s nη%2n (P, P0 )

4s
under the assumption that lim n 4s+θ+1 ∆n = ∞.
n→∞

Proof [Proof of Theorem 4] The main architect is now standard in establishing minimax
lower bounds for nonparametric hypothesis testing. The main idea is to carefully construct a
set of points under the alternative hypothesis and argue that a mixture of these alternatives
cannot be reliably distinguished from the null. See, e.g., Ingster (1993); Ingster and Suslina
4s
(2003); Tsybakov (2008). Without loss of generality, assume M = 1 and ∆n = cn− 4s+θ+1
for some c > 0.
Let us consider the cases of θ = 0 and θ > 0 separately.
1
−
The case of θ = 0. We first treat the p case when θ = 0. Let Bn = bC8 ∆n 2s c for a
sufficiently small constant C8 > 0 and an = ∆n /Bn . For any ξn := (ξn1 , ξn2 , · · · , ξnBn )> ∈
{±1}Bn , write
Bn
X
un,ξn = an ξnk ϕk .
k=1

It is clear that
kun,ξn k2L2 (P0 ) = Bn a2n = ∆n

and 2s−1
kun,ξn k∞ ≤ an Bn sup kϕk k∞ ∆n 4s → 0.
k

By taking C8 small enough, we can also ensure

Bn
X
kun,ξn k2K = a2n λ−1
k ≤ 1,
k=1

Therefore, there exists a probability measure Pn,ξn ∈ P(∆n , 0) such that dPn,ξn /dP0 =
1 + un,ξn . Following a standard argument for minimax lower bound, it suffices to show that
 ( n )2
1 X Y
lim EP0  Bn [1 + un,ξn (Xi )]  < ∞. (23)
n→∞ 2
ξn ∈{±1}Bn i=1

33
Balasubramanian, Li and Yuan

Note that
 ( n )2
1 X Y
EP0  Bn [1 + un,ξn (Xi )] 
2
ξn ∈{±1}Bn i=1
 ( n )( n )
1 X Y Y
=EP0  2Bn [1 + un,ξn (Xi )] [1 + un,ξn0 (Xi )] 
2 0 ξn ,ξn ∈{±1}Bn i=1 i=1
n
( )
1 X Y
= EP0 [1 + un,ξn (Xi )][1 + un,ξn0 (Xi )]
22Bn 0 ∈{±1}Bn
ξn ,ξn i=1
Bn
1 X X
0
n
= 1 + a2n ξnk ξnk
22Bn 0 ∈{±1}Bn
ξn ,ξn k=1
Bn
1 X X
0

≤ exp na2n ξn,k ξn,k
22Bn 0 ∈{±1}Bn
ξn ,ξn k=1
Bn
exp(na2n ) + exp(−na2n )

=
2

1
≤ exp Bn n2 a4n ,
2

where the last inequality is ensured by that

2
t
cosh(t) ≤ exp , ∀ t ∈ R.
2

See, e.g., Baraud (2002). With the particular choice of Bn , an , and the conditions on ∆n ,
this immediately implies (23).
The case of θ > 0. The main idea is similar to before. To find a set of probability
measures in P(∆n , θ), we appeal to the following lemma.
P
Lemma 9 Let u = ak ϕk . If
k
 !2/θ  
B 2
 X ak X
2

sup  ak ≤ M 2,
B≥1  λ k 
k=1 k≥B

then u ∈ F(θ, M ).

− θ+1 p
Similar to before, we shall now take Bn = bC10 ∆n 2s c and an = ∆n /Bn . By Lemma
9, we can find Pn,ξn ∈ P(∆n , θ) such that dPn,ξn /dP0 = 1 + un,ξn , for appropriately chosen
C10 . Following the same argument as in the previous case, we can again verify (23).

34
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

√ 4s
Proof [Proof of Theorem 5] Without loss of generality, assume that ∆n (θ) = c1 (n−1 log log n) 4s+θ+1
for some constant c1 > 0 to be determined later.
Type I Error. We first prove the first statement which shows that the Type I error
converges to 0. Following the same notations as defined in the proof of Theorem 2, let
n
nX o2 n
X
2 4
Nn,2 = E E ζ̃nj |Fj−1 − 1 , Ln,2 = Eζ̃nj
j=2 j=2
√ √
where ζ̃nj = 2ζnj /(n vn ). As shown by Haeusler (1988),

sup |P (Tn,%n > t) − Φ̄(t)| ≤ C11 (Ln,2 + Nn,2 )1/5 ,

where Φ̄(t) is the survival function of the standard normal, i.e., Φ̄(t) = P (Z > t) where
Z ∼ N (0, 1). Again by the argument from Hall (1984),
n
nX 1 o2 00
E 2
E(ζnj |Fj−1 ) − n(n − 1)vn ≤ C12 [n4 EG2n (X, X 0 ) + n3 EK̃n2 (X, X 0 )K̃n2 (X, X )],
2
j=2

where Gn (·, ·) is defined in the proof of Theorem 2, and

n
00
X
4
Eζnj ≤ C13 [n2 EK̃n4 (X, X 0 ) + n3 EK̃n2 (X, X 0 )K̃n2 (X, X )],
j=2

which ensures
n
nP o2
2 |F 1 1
4E E(ζnj j−1 ) − 2 n(n − 1)vn − 2 nvn
j=2
Nn,2 =
n4 vn2
( 00
)
EG2n (X, X 0 ) EK̃n2 (X, X 0 )K̃n2 (X, X )

1 1
≤8 max C12 , + + 2 ,
4 vn2 nvn2 n

and
n
4
P
4 Eζ̃nj ( 00
)
j=2 EK̃n4 (X, X 0 ) EK̃n2 (X, X 0 )K̃n2 (X, X )
Ln,2 = ≤ 4C13 + .
n4 vn2 n2 vn2 nvn2

As shown in the proof of Theorem 2,

00
EG2n (X, X 0 ) EK̃n4 (X, X 0 ) EK̃n2 (X, X 0 )K̃n2 (X, X )
≤ C3 %1/s
n , ≤ C4 n−2 %n−1/s , and ≤ C3 n−1 .
vn2 n2 vn2 nvn2

Therefore,
1 1
1 2 −
sup |P (Tn,%n > t) − Φ̄(t)| ≤ C14 (%n5s + n− 5 + n− 5 %n 5s ),
t

35
Balasubramanian, Li and Yuan

which implies that

!
1 1
m∗ 1 2 − 5s
P sup Tn,2k %∗ > t ≤ m∗ Φ̄(t) + C15 (2 5s %∗5s + m∗ n− 5 + n− 5 %∗ ), ∀t.
0≤k≤m∗

It is not hard to see, by the definitions of m∗ and %∗ ,

√ 2s
4s+1
m∗ log log n
2 %∗ ≤ 2
n
and
2s
m∗ =(log 2)−1 {2s log n − log n + o(log n)}
4s + 1
8s2
=(log 2)−1 log n + o(log n) log n.
4s + 1
2 /2
Together with the fact that Φ̄(t) ≤ 12 e−t for t ≥ 0, we get
!
p
P sup Tn,2k %∗ > 3 log log n
0≤k≤m∗
" √ 2
5(4s+1) √ − 25 #
− 32 log log n log n − 51 − 25 log log n
≤C16 e log n + +n log log n + n → 0,
n n
as n → ∞.
√ 2s(θ+1)
log log n 4s+θ+1
Type II Error. Next consider Type II error. To this end, write %n (θ) = ( n ) .
Let
%̃n (θ) = sup {2k %∗ : %n ≤ %n (θ)}.
0≤k≤m∗

It is clear that T̃n ≥ Tn,%̃n (θ) for any θ ≥ 0. It therefore suffices to show that for any θ ≥ 0,
n p o
lim inf inf P Tn,%̃n (θ) ≥ 3 log log n = 1.
n→∞ θ≥0 P ∈P(∆n ,θ)

By Markov inequality, this can accomplished by verifying

p
inf inf EP Tn,%̃n (θ) ≥ M̃ log log n (24)
θ∈[0,∞) P ∈P(∆n (θ),θ)
√
for some M̃ > 3; and

Var Tn,%̃n (θ)
lim sup sup 2 = 0. (25)
n→∞ θ≥0 P ∈P(∆ (θ),θ) E T
n P n,%̃n (θ)

We now show that both (24) and (25) hold with

√ 4s
4s+θ+1
log log n
∆n (θ) = c1
n

36
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

for a sufficiently large c1 = c1 (M, M̃ ).

Note that ∀ θ ∈ [0, ∞),

1
%n (θ) ≤ %̃n (θ) ≤ %n (θ), (26)
2
which immediately suggests

η%̃2n (θ) (P, P0 ) ≥ η%2n (θ) (P, P0 ). (27)

Following the arguments in the proof of Theorem 3,

EP Tn,%̃n (θ) ≥ C17 n[%̃n (θ)]1/(2s) η%̃2n (θ) (P, P0 ) ≥ 2−1/(2s) C17 n[%n (θ)]1/2s η%2n (θ) (P, P0 ),

and ∀ P ∈ P(∆n (θ), θ),

1
η%2n (θ) (P, P0 ) ≥ kuk2L2 (P0 ) (28)
4
√ 4s
log log n 4s+θ+1
provided that ∆n (θ) ≥ C 0 (M ) n .
Therefore,
p p
inf EP Tn,%̃n (θ) ≥ C18 n[%n (θ)]1/(2s) ∆n (θ) ≥ C18 c1 log log n ≥ M̃ log log n
P ∈P(∆n (θ),θ)

−1
if c1 ≥ C18 M̃ . Hence to ensure (24) holds, it suffices to take
−1
c1 = max{C 0 (M ), C18 M̃ }.

With (26), (27) and (28), the results in the proof of Theorem 3 imply that for sufficiently
large n

Var Tn,%̃n (θ) n 1
∗
−2 1
−1
sup 2 ≤C19 [%n (θ)] 2s n∆n (θ) + [%n (θ)] s n2 ∆∗n (θ)
P ∈P(∆∗n (θ),θ) EP Tn,%̃n (θ)
1
−1 o
+ (n∆∗n (θ))−1 + [%n (θ)] 2s n ∆∗n (θ)
p

1
−1 1
≤2C19 [%n (θ)] 2s n∆∗n (θ) = 2C19 (c1 log log n)− 2 → 0,

which shows (25).

Proof [Proof of Theorem 6] The main idea of the proof is similar to that for Theorem 4.
Nevertheless, in order to show

EP0 T + sup β(T ; ∆n (θ), θ)

θ∈[θ1 ,θ2 ]

37
Balasubramanian, Li and Yuan

converges to 1 rather than bounded below from 0, we need to find Pπ , which is the marginal
distribution on X n with conditional distribution selected from

{P ⊗n : P ∈ ∪θ∈[θ1 ,θ2 ] P(∆n (θ), θ)}

and prior distribution π on ∪θ∈[θ1 ,θ2 ] P(∆n (θ), θ) such that the χ2 distance between Pπ and
P0⊗n converges to 0. See Ingster (2000).
To this end, assume, without loss of generality, that
− 4s
n 4s+θ+1
∆n (θ) = c2 √ , ∀θ ∈ [θ1 , θ2 ],
log log n
where c2 > 0 is a sufficiently small constant to be determined later.
θ1 +1
−
Let rn = bC20 log nc and Bn,1 = bC21 ∆n 2s (θ1 )cfor sufficiently small C20 , C21 > 0. Set
θn,1 = θ1 . For 2 ≤ r ≤ rn , let
Bn,r = 2r−2 Bn,1
and θn,r is selected such that the following equation holds.
j θn,r +1 k
Bn,r = C21 [∆n (θn,r )]− 2s .

Note that by choosing C20 sufficiently small,

 
 2(θ1 +1) 2(θ1 +1) 
rn −2
 4s+θ1 +1 n 4s+θ 1 +1
rn −2 

Bn,rn = 2 Bn,1 ≤ c2
 √ ·2
log log n
$ %
2(θ1 +1)
4s+θ1 +1 n 2(θ1 + 1)
= c2 exp log √ · + (rn − 2) log 2
log log n 4s + θ1 + 1

n 2(θ2 + 1) θ2 +1
≤ C21 exp log √ · = bC21 [∆n (θ2 )]− 2s c
log log n 4s + θ2 + 1

for sufficiently large n. Thus, we can guarante that ∀ 1 ≤ r ≤ rn , θn,rn ∈ [θ1 , θ2 ].

We now construct a finite subset of ∪θ∈[θ1 ,θ2 ] P(∆n (θ), θ) as follows. Let Bn,0 ∗ = 0 and
∗ =B Bn,r , let
Bn,r n,1 + · · · + Bn,r for r ≥ 1. For each ξn,r = (ξn,r,1 , · · · , ξn,r,Bn,r ) ∈ {±1}

∗
Bn,r
X
fn,r,ξn,r = 1 + an,r ξn,r,k−Bn,r−1
∗ ϕk ,
∗
k=Bn,r−1 +1

p
and an,r = ∆n (θn,r )/Bn,r . Following the same argument as that in the proof of Theorem
4, we can verify that with a sufficiently small C21 , each Pn,r,ξn,r ∈ P(∆n (θn,r ), θn,r ), where
fn,r,ξn,r is the Radon-Nikodym derivative dPn,r,ξn,r /dP0 . With slight abuse of notation,
write
nr
1X
fn (X1 , X2 , · · · , Xn ) = fn,r (X1 , X2 , · · · , Xn ),
rn
r=1

38
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

where
n
1 X Y
fn,r (X1 , X2 , · · · , Xn ) = fn,r,ξn,r (Xi ).
2Bn,r
ξn,r ∈{±1}Bn,r i=1

It now suffices to show that

kfn − 1k2L2 (P0 ) = kfn k2L2 (P0 ) − 1 → 0, as n → ∞,

where kfn k2L2 (P0 ) = EP0 fn2 (X1 , X2 , · · · , Xn ).

Note that
1 X
kfn k2L2 (P0 ) = 2 hfn,r , fn,r0 iL2 (P0 )
rn 0
1≤r,r ≤rn
1 X 1 X
= kfn,r k2L2 (P0 ) + 2 hfn,r , fn,r0 iL2 (P0 ) .
rn2 rn
1≤r≤rn 1≤r,r0 ≤rn
r6=r0

It is easy to verify that, for any r 6= r0 ,

hfn,r , fn,r0 iL2 (P0 ) = 1.
It therefore suffices to show that
X
kfn,r k2L2 (P0 ) = o(rn2 ).
1≤r≤rn

Following the same derivation as that in the proof of Theorem 4, we can show that
2 ) + exp(−na2 ) Bn,r
!
exp(na

2 n,r n,r 1 2 4
kfn,r kL2 (P0 ) ≤ ≤ exp Bn,r n an,r
2 2

for sufficiently large n. By setting c2 in the expression of ∆n (θ) sufficiently small, we have
Bn,r n2 a4n,r ≤ log rn ,
which ensures that
X
kfn,r k2L2 (P0 ) ≤ rn3/2 = o(rn2 ).
1≤r≤rn

Acknowledgments

We would like to thanks the editor, Arthur Gretton, and the anonymous reviewers for
their insightful comments that helped greatly improved the paper. We also acknowledge
support for this project from the National Science Foundation (NSF grants DMS-1803450
and DMS-2015285).

39
Balasubramanian, Li and Yuan

Appendix A.
Proof [Proof of Lemma 8] We have
X
G2 (x, x0 ) = µk µl ϕk (x)ϕl (x)ϕk (x0 )ϕl (x0 ).
k,l

Thus
Z
g(x)g(x0 )G2 (x, x0 )dP (x)dP (x0 )
X Z 2
= µk µl g(x)ϕk (x)ϕl (x)dP (x)
k,l
X XZ 2
≤µ1 µk g(x)ϕk (x)ϕl (x)dP (x)
k l
Z !
X
≤µ1 µk g 2 (x)ϕ2k (x)dP (x)
k
! 2
X
≤µ1 µk sup kϕk k∞ kgk2L2 (P ) .
k k

Proof [Proof of Lemma 9] For brevity, write

B
X a2k
lB = .
λk
k=1

By definition, it suffices to show that ∀ R > 0, ∃ fR ∈ H(K) such that kfR k2K ≤ R2 and
ku − fR k2L2 (P0 ) ≤ M 2 R−2/θ .
To this end, let B be such that lB2 ≤ R2 ≤ l 2
B+1 , and denote by

B
X
fR = ak ϕk + a∗B+1 (R)ϕB+1 ,
k=1

where
q
a∗B+1 (R) = sgn(aB+1 ) λB+1 (R2 − lB
2 ).

Clearly,
B
X a2k (a∗ (R))2
kfR k2K = + k+1 = R2 ,
λk λB+1
k=1

40
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

and
X q 2 X
ku − fR k2L2 (P0 ) = a2k + |aB+1 | − λB+1 (R2 − lB
2) ≤ a2k .
k>B+1 k≥B+1

To ensure u ∈ F(θ, M ), it suffices to have

sup ku − fR k2L2 (P0 ) R2/θ ≤ M 2 , ∀ B ≥ 0,

2 ≤R2 ≤l2
lB B+1

which concludes the proof.

References
L. Addario-Berry, N. Broutin, L. Devroye, and G. Lugosi. On combinatorial testing prob-
lems. The Annals of Statistics, 38(5):3063–3092, 2010.

N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: ranking

and clustering. Journal of ACM, 55(5):23:1–23:27, 2008.

N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical

Society, 68(3):337–404, 1950.

F. Bach. On the equivalence between kernel quadrature rules and random feature expan-
sions. Journal of Machine Learning Research, 18(21):1–38, 2017.

A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra. Clustering on the unit hypersphere using
von mises-fisher distributions. In Journal of Machine Learning Research, pages 1345–
1382, 2005.

Y. Baraud. Non-asymptotic minimax rates of testing in signal detection. Bernoulli, 8(5):

577–606, 2002.

K. Chwialkowski, A. Ramdas, D. Sejdinovic, and A. Gretton. Fast two-sample testing with

analytic representations of probability measures. In Advances in Neural Information
Processing Systems, pages 1981–1989, 2015.

K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. In

International Conference on Machine Learning, pages 2606–2615, 2016.

F. Cucker and D. Zhou. Learning Theory: an Approximation Theory Viewpoint. Cambridge

University Press, New York, NY, 2007.

I. S. Dhillon, E. M. Marcotte, and U. Roshan. Diametrical clustering for identifying anti-

correlated gene clusters. Bioinformatics, 19(13):1612–1619, 2003.

T. A. Driscoll, N. Hale, and L. N. Trefethen. Chebfun Guide. Pafnuty Publications, Oxford,

2014. URL http://www.chebfun.org/docs/guide/.

41
Balasubramanian, Li and Yuan

N. Dunford and J. T. Schwartz. Linear Operators: Part II: Spectral Theory: Self Adjoint
Operators in Hilbert Space. Interscience Publishers, 1963.

G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via

maximum mean discrepancy optimization. In Conference on Uncertainty in Artificial
Intelligence, pages 258–267, 2015.

M. S. Ermakov. Minimax detection of a signal in a gaussian white noise. Theory of Proba-

bility & Its Applications, 35(4):667–679, 1991.

C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis, and density

estimation. Journal of the American Statistical Association, 97(458):611–631, 2002.

M. Fromont and B. Laurent. Adaptive goodness-of-fit tests in a density model. The Annals
of Statistics, 34(2):680–720, 2006.

M. Fromont, B. Laurent, M. Lerasle, and P. Reynaud-Bouret. Kernels based tests with

non-asymptotic bootstrap approaches for two-sample problem. In JMLR: Workshop and
Conference Proceedings, volume 23, pages 23–1, 2012.

M. Fromont, B. Laurent, and P. Reynaud-Bouret. The two-sample problem for poisson

processes: Adaptive tests with a nonasymptotic wild bootstrap approach. The Annals of
Statistics, 41(3):1431–1461, 2013.

J. Gorham and L. Mackey. Measuring sample quality with kernels. In International Con-
ference on Machine Learning, 2017.

G. G. Gregory. Large sample theory for U-statistics and tests of fit. The Annals of Statistics,
5(1):110–123, 1977.

A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola. A kernel method for

the two-sample-problem. In Advances in Neural Information Processing Systems, pages
513–520, 2007.

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-

sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012a.

A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and

B. K. Sriperumbudur. Optimal kernel choice for large-scale two-sample tests. In Advances
in Neural Information Processing Systems, pages 1205–1213, 2012b.

E. Haeusler. On the rate of convergence in the central limit theorem for martingales with
discrete and continuous time. The Annals of Probability, 16(1):275–299, 1988.

P. Hall. Central limit theorem for integrated square error of multivariate nonparametric
density estimators. Journal of Multivariate Analysis, 14(1):1–16, 1984.

Z. Harchaoui, F. Bach, and E. Moulines. Testing for homogeneity with kernel fisher discrim-
inant analysis. In Advances in Neural Information Processing Systems, pages 609–616,
2007.

42
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

Yu. I. Ingster. Minimax testing of nonparametric hypotheses on a distribution density in

the l p metrics. Theory of Probability & Its Applications, 31(2):333–337, 1987.

Yu. I. Ingster. Asymptotically minimax hypothesis testing for nonparametric alternatives.

i, ii, iii. Mathematical Methods of Statistics, 2(2):85–114, 1993.

Yu. I. Ingster. Minimax testing of hypotheses on the distribution density for ellipsoids in
l p. Theory of Probability & Its Applications, 39(3):417–436, 1995.

Yu. I. Ingster. Adaptive chi-square tests. Journal of Mathematical Sciences, 99(2):1110–

1119, 2000.

Yu. I. Ingster and I. A. Suslina. Minimax nonparametric hypothesis testing for ellipsoids
and besov bodies. ESAIM: Probability and Statistics, 4:53–135, 2000.

Yu. I. Ingster and I. A. Suslina. Nonparametric Goodness-of-Fit Testing under Gaussian

Models. Springer, New York, NY, 2003.

V. R. Iyer, M. B. Eisen, D. T. Ross, G. Schuler, T. Moore, J. Lee, J. M. Trent, L. M.

Staudt, J. Hudson, and M. S. Boguski. The transcriptional program in the response of
human fibroblasts to serum. Science, 283(5398):83–87, 1999.

W. Jitkrittum, Z. Szabó, K. P. Chwialkowski, and A. Gretton. Interpretable distribution

features with maximum testing power. In Advances in Neural Information Processing
Systems, pages 181–189, 2016.

W. Jitkrittum, W. Xu, Z. Szabo, K. Fukumizu, and A. Gretton. A linear-time kernel

goodness-of-fit test. In Advances in Neural Information Processing Systems (to appear),
2017.

P. E. Jupp. Sobolev tests of goodness of fit of distributions on compact riemannian mani-

folds. The Annals of Statistics, 33(6):2957–2966, 2005.

J. Kellner and A. Celisse. A one-sample test for normality with kernel methods. arXiv
preprint arXiv:1507.02904, 2015.

E. L. Lehmann and J. P. Romano. Testing Statistical Hypotheses. Springer Science &

Business Media, New York, NY, 2008.

O. V. Lepski and V. G. Spokoiny. Minimax nonparametric hypothesis testing: the case of

an inhomogeneous alternative. Bernoulli, 5(2):333–358, 1999.

Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In International

Conference on Machine Learning, pages 1718–1727, 2015.

Q. Liu, J. Lee, and M. Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In
International Conference on Machine Learning, pages 276–284, 2016.

R. Lyons. Distance covariance in metric spaces. The Annals of Probability, 41(5):3284–3305,

2013.

43
Balasubramanian, Li and Yuan

K. V. Mardia and P. E. Jupp. Directional Statistics. John Wiley & Sons, New York, NY,
2009.

J. S. Marron and M. P. Wand. Exact mean integrated squared error. The Annals of
Statistics, 20(2):712–736, 1992.

H. Q. Minh, P. Niyogi, and Y. Yao. Mercer’s theorem, feature maps, and smoothing. In
Conference on Learning Theory, pages 154–168, 2006.

K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schölkopf. Kernel mean embedding

of distributions: a review and beyond. Foundations and Trends R in Machine Learning,
10(1-2):1–141, 2017.

B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regular-
ization, Optimization, and Beyond. MIT press, 2001.

D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distance-

based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41(5):
2263–2291, 2013.

R. J. Serfling. Approximation Theorems of Mathematical Statistics. John Wiley & Sons,

New York, NY, 2009.

A. J. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distri-
butions”. In Proc. 18th International Conference on Algorithmic Learning Theory, pages
13–31. Springer-Verlag, Berlin, Germany, 2007.

P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown,

D. Botstein, and B. Futcher. Comprehensive identification of cell cycle–regulated genes
of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of
the Cell, 9(12):3273–3297, 1998.

V. G. Spokoiny. Adaptive hypothesis testing using wavelets. The Annals of Statistics, 24

(6):2477–2498, 1996.

S. Sra and D. Karp. The multivariate Watson distribution: maximum-likelihood estimation

and other aspects. Journal of Multivariate Analysis, 114(C):256–269, 2013.

B. Sriperumbudur, K. Fukumizu, A. Gretton, G. Lanckriet, and B. Schoelkopf. Kernel

choice and classifiability for RKHS embeddings of probability distributions. In Advances
in Neural Information Processing Systems 22, pages 1750–1758, 2009.

B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. Lanckriet. Hilbert

space embeddings and metrics on probability measures. Journal of Machine Learning
Research, 11(Apr):1517–1561, 2010.

B. K. Sriperumbudur, K. Fukumizu, and G. R. Lanckriet. Universality, characteristic kernels

and rkhs embedding of measures. Journal of Machine Learning Research, 12(Jul):2389–
2410, 2011.

44
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests

I. Steinwart. On the influence of the kernel on the consistency of support vector machines.
Journal of Machine Learning Research, 2(Nov):67–93, 2001.

I. Steinwart and A. Christmann. Support Vector Machines. Springer Science & Business
Media, 2008.

D. J. Sutherland, H.-Y. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton.

Generative models and model criticism via optimized maximum mean discrepancy. In
International Conference on Learning Representations, 2017.

G. J. Székely, M. L. Rizzo, and N. K. Bakirov. Measuring and testing dependence by

correlation of distances. The Annals of Statistics, 35(6):2769–2794, 2007.

L. N. Trefethen and Z. Battles. An extension of MATLAB to continuous functions and

operators. SIAM Journal on Scientific Computing, 25:1743–1770, 2004. doi: 10.1137/
S1064827503430126. URL http://link.aip.org/link/?SCE/25/1743/1.

A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer Science & Business

Media, New York, NY, 2008.

G. Wahba. Spline Models for Observational Data, volume 59. Siam, 1990.

W. Zaremba, A. Gretton, and M. Blaschko. B-test: a non-parametric, low variance kernel

two-sample test. In Advances in Neural Information Processing Systems, pages 755–763,
2013.

PHD - Thesis - Final - Statistics Tests
100% (1)
PHD - Thesis - Final - Statistics Tests
154 pages
Gaussian Process
No ratings yet
Gaussian Process
172 pages
Machine Learning With Kernel Methods
No ratings yet
Machine Learning With Kernel Methods
760 pages
JUST ONE QUESTION Readers Theater Piece 20231130 155217 0000
100% (4)
JUST ONE QUESTION Readers Theater Piece 20231130 155217 0000
3 pages
Optimization Theory with Applications
From Everand
Optimization Theory with Applications
Donald A. Pierre
4/5 (4)
Local Minimax
No ratings yet
Local Minimax
75 pages
On The Optimality of Gaussian Kernel Based Nonparametric Tests Against Smooth Alternatives
No ratings yet
On The Optimality of Gaussian Kernel Based Nonparametric Tests Against Smooth Alternatives
62 pages
Armstrong 2018
No ratings yet
Armstrong 2018
62 pages
Pantsulaia-2016-Applications of Measure Theory To Statistics
No ratings yet
Pantsulaia-2016-Applications of Measure Theory To Statistics
130 pages
Chronological Bible in One Year Calendar
95% (41)
Chronological Bible in One Year Calendar
2 pages
Adv Stat II
No ratings yet
Adv Stat II
140 pages
Reproducing Kernel Hilbert - Module and Kernel Mean Embeddings
No ratings yet
Reproducing Kernel Hilbert - Module and Kernel Mean Embeddings
56 pages
Wang 1990256
No ratings yet
Wang 1990256
58 pages
XXXX Hypothesis Testing
No ratings yet
XXXX Hypothesis Testing
101 pages
Intro&NP Stat
No ratings yet
Intro&NP Stat
122 pages
Liu Et Al. - 2021 - Random Features For Kernel Approximation A Survey On Algorithms, Theory, and Beyond
No ratings yet
Liu Et Al. - 2021 - Random Features For Kernel Approximation A Survey On Algorithms, Theory, and Beyond
35 pages
A Powerful Chi-Square Specification Test With Support Vectors
No ratings yet
A Powerful Chi-Square Specification Test With Support Vectors
34 pages
Pure Sig Test STR 656
No ratings yet
Pure Sig Test STR 656
32 pages
Nihms 651394
No ratings yet
Nihms 651394
31 pages
A Kernel Two-Sample Test: Arthur Gretton
No ratings yet
A Kernel Two-Sample Test: Arthur Gretton
51 pages
Kernel-Based Approximation Methods Using MATLAB
0% (1)
Kernel-Based Approximation Methods Using MATLAB
9 pages
ch6 PDF
No ratings yet
ch6 PDF
47 pages
Exercices Kernel Trick
No ratings yet
Exercices Kernel Trick
24 pages
2003 Bernoulli
No ratings yet
2003 Bernoulli
24 pages
Taxonomy of Bugs
100% (1)
Taxonomy of Bugs
8 pages
Mathematical Foundations of Information Theory
From Everand
Mathematical Foundations of Information Theory
A. Ya. Khinchin
3.5/5 (9)
Adaptive Rank-Based Tests For High Dimensional Mean Problems
No ratings yet
Adaptive Rank-Based Tests For High Dimensional Mean Problems
17 pages
Asymptotically Optimal One - and Two-Sample Testing With Kernels
No ratings yet
Asymptotically Optimal One - and Two-Sample Testing With Kernels
19 pages
Testes de Qualidade de Ajuste
No ratings yet
Testes de Qualidade de Ajuste
113 pages
Lec 05
No ratings yet
Lec 05
28 pages
Abdominal Four Quadrants: Abdominal Nine Divisions (A) and Quadrant Regions (B) : The Abdomen Is
No ratings yet
Abdominal Four Quadrants: Abdominal Nine Divisions (A) and Quadrant Regions (B) : The Abdomen Is
4 pages
mtl855 Assign 1
No ratings yet
mtl855 Assign 1
25 pages
From Two Sample Testing To Singular Gaussian Discrimination: Leonardo V. Santoro Kartik G. Waghmare Victor M. Panaretos
No ratings yet
From Two Sample Testing To Singular Gaussian Discrimination: Leonardo V. Santoro Kartik G. Waghmare Victor M. Panaretos
15 pages
VEGA DTM - Version History Device Driver Version History 180403 en SW0251
100% (1)
VEGA DTM - Version History Device Driver Version History 180403 en SW0251
117 pages
Gof Tests For Discrete Models Based On The Idf
No ratings yet
Gof Tests For Discrete Models Based On The Idf
17 pages
A Robust Test For Elliptical Symmetry
No ratings yet
A Robust Test For Elliptical Symmetry
13 pages
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
No ratings yet
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
31 pages
18.443 MIT Stats Course
No ratings yet
18.443 MIT Stats Course
139 pages
Optical Density (Radiographic Density) : T o T o o T
100% (1)
Optical Density (Radiographic Density) : T o T o o T
6 pages
Research 6
No ratings yet
Research 6
16 pages
1 s2.0 0895717796000556 Main
No ratings yet
1 s2.0 0895717796000556 Main
16 pages
Signtest
No ratings yet
Signtest
16 pages
2023 PMCF
100% (1)
2023 PMCF
8 pages
A General Bootstrap Algorithm For Hypothesis Testing
No ratings yet
A General Bootstrap Algorithm For Hypothesis Testing
12 pages
Property Testing of Data Dimensionality: ICSI and UC Berkeley
No ratings yet
Property Testing of Data Dimensionality: ICSI and UC Berkeley
26 pages
Jurnal Multivariate Normal Test PDF
No ratings yet
Jurnal Multivariate Normal Test PDF
23 pages
Mathematics 09 03003 v3
No ratings yet
Mathematics 09 03003 v3
20 pages
Bondad de Ajuste
No ratings yet
Bondad de Ajuste
13 pages
RKL: A General, Invariant Bayes Solution For Neyman-Scott
No ratings yet
RKL: A General, Invariant Bayes Solution For Neyman-Scott
15 pages
Elliot Rottenberg Stock 1996 Efficient Tests For An Autoregressive Unit Root PDF
No ratings yet
Elliot Rottenberg Stock 1996 Efficient Tests For An Autoregressive Unit Root PDF
16 pages
Coin: A Computational Framework For Conditional Inference
No ratings yet
Coin: A Computational Framework For Conditional Inference
11 pages
Gretton - A Kernel Test For Three-Variable Interactions With Random Processes
No ratings yet
Gretton - A Kernel Test For Three-Variable Interactions With Random Processes
15 pages
Beniga Ma 102 Pre-Test Exam
67% (3)
Beniga Ma 102 Pre-Test Exam
6 pages
Aaai07 262
No ratings yet
Aaai07 262
5 pages
Maths 1800-1900
No ratings yet
Maths 1800-1900
308 pages
Hyperkernels: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
No ratings yet
Hyperkernels: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
8 pages
1 s2.0 S0167715224002475 Main
No ratings yet
1 s2.0 S0167715224002475 Main
6 pages
hw5 Kernel Trick 2021
No ratings yet
hw5 Kernel Trick 2021
4 pages
A Fast Algorithm For 2-D KS Two Sample Tests-Yuanhui Xiao
No ratings yet
A Fast Algorithm For 2-D KS Two Sample Tests-Yuanhui Xiao
6 pages
The Two-Dimensional Kolmogorov-Smirnov Test: Raul H.C. Lopes
No ratings yet
The Two-Dimensional Kolmogorov-Smirnov Test: Raul H.C. Lopes
12 pages
Mathematics 09 00788 v3
No ratings yet
Mathematics 09 00788 v3
20 pages
Anna's Archive
No ratings yet
Anna's Archive
12 pages
Statistics
No ratings yet
Statistics
6 pages
High-Dimensional, Two-Sample Testing
No ratings yet
High-Dimensional, Two-Sample Testing
9 pages
Basic Sabre Formats and Functions
No ratings yet
Basic Sabre Formats and Functions
51 pages
Computer Statistics With R: 8. Nonparametric Tests
No ratings yet
Computer Statistics With R: 8. Nonparametric Tests
10 pages
Divide and Conquer Kernel Ridge Regression: University of California, Berkeley University of California, Berkeley
No ratings yet
Divide and Conquer Kernel Ridge Regression: University of California, Berkeley University of California, Berkeley
26 pages
2024HW2Boot GOF Eng
No ratings yet
2024HW2Boot GOF Eng
4 pages
10 Meanvector PDF
No ratings yet
10 Meanvector PDF
10 pages
CASE SUBMISSION - "Buffett's Bid For Media General's Newspapers"
No ratings yet
CASE SUBMISSION - "Buffett's Bid For Media General's Newspapers"
8 pages
A Multivariate Kolmogorov-Smimov Test of Goodness of Fit 1
No ratings yet
A Multivariate Kolmogorov-Smimov Test of Goodness of Fit 1
9 pages
Advertising and Integrated Marketing Communications 1st Edition Kruti Shah - Download The Ebook Now To Never Miss Important Content
No ratings yet
Advertising and Integrated Marketing Communications 1st Edition Kruti Shah - Download The Ebook Now To Never Miss Important Content
86 pages
Peperiksaan Percubaan SPM 2020 Daerah Perak Tengah
No ratings yet
Peperiksaan Percubaan SPM 2020 Daerah Perak Tengah
15 pages
Grade 7 CBC Complete Social Studies Notes by Vyntex Technologies 1
No ratings yet
Grade 7 CBC Complete Social Studies Notes by Vyntex Technologies 1
72 pages
Egron-Polak, E., & Hudson, R. (2014) - Internationalization of Higher Education
No ratings yet
Egron-Polak, E., & Hudson, R. (2014) - Internationalization of Higher Education
19 pages
Anatomical and Physico Mechanical Characterization of Narra
No ratings yet
Anatomical and Physico Mechanical Characterization of Narra
9 pages
Questions On Literature Review
100% (1)
Questions On Literature Review
6 pages
Computer Assisted Instruction: Allen A. Espinosa III-30 BSE-Chemistry
No ratings yet
Computer Assisted Instruction: Allen A. Espinosa III-30 BSE-Chemistry
56 pages
We All Go Through Trials and Temptations But Sometimes Trials Knock at Our Heart and Surround Us
No ratings yet
We All Go Through Trials and Temptations But Sometimes Trials Knock at Our Heart and Surround Us
3 pages
Basic Details
No ratings yet
Basic Details
2 pages
Critical Information Literacy Foundations Inspirations and Ideas
No ratings yet
Critical Information Literacy Foundations Inspirations and Ideas
204 pages
2.3 There Is Another Sky
No ratings yet
2.3 There Is Another Sky
5 pages
Debiased Distributed Learning For Sparse Partial Linear Models in High Dimensions
No ratings yet
Debiased Distributed Learning For Sparse Partial Linear Models in High Dimensions
32 pages
Lab Report
No ratings yet
Lab Report
7 pages
Modulators of Thermogenesis
No ratings yet
Modulators of Thermogenesis
108 pages
Sampling Random Graph Homomorphisms and Applications To Network Data Analysis
No ratings yet
Sampling Random Graph Homomorphisms and Applications To Network Data Analysis
79 pages
Model-Free Representation Learning and Exploration in Low-Rank Mdps
No ratings yet
Model-Free Representation Learning and Exploration in Low-Rank Mdps
76 pages
Kerascv and Kerasnlp: Multi-Framework Models: Lead Authors
No ratings yet
Kerascv and Kerasnlp: Multi-Framework Models: Lead Authors
10 pages
Domain Generalization by Marginal Transfer Learning: Gilles Blanchard
No ratings yet
Domain Generalization by Marginal Transfer Learning: Gilles Blanchard
55 pages
Deep Learning in Target Space: Michael Fairbank Spyridon Samothrakis Luca Citi
No ratings yet
Deep Learning in Target Space: Michael Fairbank Spyridon Samothrakis Luca Citi
46 pages
Bayesian Multinomial Logistic Normal Models Through Marginally Latent Matrix-T Processes
No ratings yet
Bayesian Multinomial Logistic Normal Models Through Marginally Latent Matrix-T Processes
42 pages
Black Box Variational Inference With A Deterministic Objective: Faster, More Accurate, and Even More Black Box
No ratings yet
Black Box Variational Inference With A Deterministic Objective: Faster, More Accurate, and Even More Black Box
39 pages
A Relaxed Inertial Forward-Backward-Forward Algorithm For Solving Monotone Inclusions With Application To Gans
No ratings yet
A Relaxed Inertial Forward-Backward-Forward Algorithm For Solving Monotone Inclusions With Application To Gans
37 pages
Exploiting Locality in High-Dimensional Factorial Hidden Markov Models
No ratings yet
Exploiting Locality in High-Dimensional Factorial Hidden Markov Models
34 pages
XAI Beyond Classification: Interpretable Neural Clustering
No ratings yet
XAI Beyond Classification: Interpretable Neural Clustering
28 pages
Tenochtitlan (Aztec) : Geography and Culture
No ratings yet
Tenochtitlan (Aztec) : Geography and Culture
1 page
Jordan Canonical Form
No ratings yet
Jordan Canonical Form
5 pages
Sert 2020
No ratings yet
Sert 2020
12 pages
Final Meal Replacements Handout 2
No ratings yet
Final Meal Replacements Handout 2
2 pages
Posselt Diagram
No ratings yet
Posselt Diagram
2 pages
A Literature Review Of: "Vietnamese Graduate International Student Repatriates: Reverse Adjustment" by
No ratings yet
A Literature Review Of: "Vietnamese Graduate International Student Repatriates: Reverse Adjustment" by
2 pages