A Kernel Two-Sample Test: Arthur Gretton
A Kernel Two-Sample Test: Arthur Gretton
A Kernel Two-Sample Test: Arthur Gretton
Abstract
We propose a framework for analyzing and comparing distributions, which we use to construct sta-
tistical tests to determine if two samples are drawn from different distributions. Our test statistic is
the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert
space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distribution-
free tests based on large deviation bounds for the MMD, and a third test based on the asymptotic
distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear
time approximations are available. Our statistic is an instance of an integral probability metric, and
various classical metrics on distributions are obtained when alternative function classes are used
in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute
matching for databases using the Hungarian marriage method, where they perform strongly. Ex-
cellent performance is also obtained when comparing distributions over graphs, for which these are
the first such tests.
∗. Also at Gatsby Computational Neuroscience Unit, CSML, 17 Queen Square, London WC1N 3AR, UK.
†. This work was carried out while K.M.B. was with the Ludwig-Maximilians-Universität München.
‡. This work was carried out while M.J.R. was with the Graz University of Technology.
§. Also at The Australian National University, Canberra, ACT 0200, Australia.
2012
c Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf and Alexander Smola.
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
Keywords: kernel methods, two-sample test, uniform convergence bounds, schema matching,
integral probability metric, hypothesis testing
1. Introduction
We address the problem of comparing samples from two probability distributions, by proposing
statistical tests of the null hypothesis that these distributions are equal against the alternative hy-
pothesis that these distributions are different (this is called the two-sample problem). Such tests
have application in a variety of areas. In bioinformatics, it is of interest to compare microarray
data from identical tissue types as measured by different laboratories, to detect whether the data
may be analysed jointly, or whether differences in experimental procedure have caused systematic
differences in the data distributions. Equally of interest are comparisons between microarray data
from different tissue types, either to determine whether two subtypes of cancer may be treated as
statistically indistinguishable from a diagnosis perspective, or to detect differences in healthy and
cancerous tissue. In database attribute matching, it is desirable to merge databases containing mul-
tiple fields, where it is not known in advance which fields correspond: the fields are matched by
maximising the similarity in the distributions of their entries.
We test whether distributions p and q are different on the basis of samples drawn from each of
them, by finding a well behaved (e.g., smooth) function which is large on the points drawn from p,
and small (as negative as possible) on the points from q. We use as our test statistic the difference
between the mean function values on the two samples; when this is large, the samples are likely
from different distributions. We call this test statistic the Maximum Mean Discrepancy (MMD).
Clearly the quality of the MMD as a statistic depends on the class F of smooth functions that
define it. On one hand, F must be “rich enough” so that the population MMD vanishes if and only
if p = q. On the other hand, for the test to be consistent in power, F needs to be “restrictive” enough
for the empirical estimate of the MMD to converge quickly to its expectation as the sample size
increases. We will use the unit balls in characteristic reproducing kernel Hilbert spaces (Fukumizu
et al., 2008; Sriperumbudur et al., 2010b) as our function classes, since these will be shown to satisfy
both of the foregoing properties. We also review classical metrics on distributions, namely the
Kolmogorov-Smirnov and Earth-Mover’s distances, which are based on different function classes;
collectively these are known as integral probability metrics (Müller, 1997). On a more practical
note, the MMD has a reasonable computational cost, when compared with other two-sample tests:
given m points sampled from p and n from q, the cost is O(m + n)2 time. We also propose a test
statistic with a computational cost of O(m + n): the associated test can achieve a given Type II error
at a lower overall computational cost than the quadratic-cost test, by looking at a larger volume of
data.
We define three nonparametric statistical tests based on the MMD. The first two tests are
distribution-free, meaning they make no assumptions regarding p and q, albeit at the expense of
being conservative in detecting differences between the distributions. The third test is based on the
asymptotic distribution of the MMD, and is in practice more sensitive to differences in distribution at
small sample sizes. The present work synthesizes and expands on results of Gretton et al. (2007a,b)
and Smola et al. (2007),1 who in turn build on the earlier work of Borgwardt et al. (2006). Note that
1. In particular, most of the proofs here were not provided by Gretton et al. (2007a), but in an accompanying technical
report (Gretton et al., 2008a), which this document replaces.
724
A K ERNEL T WO -S AMPLE T EST
the latter addresses only the third kind of test, and that the approach of Gretton et al. (2007a,b) is
rigorous in its treatment of the asymptotic distribution of the test statistic under the null hypothesis.
We begin our presentation in Section 2 with a formal definition of the MMD. We review the
notion of a characteristic RKHS, and establish that when F is a unit ball in a characteristic RKHS,
then the population MMD is zero if and only if p = q. We further show that universal RKHSs in
the sense of Steinwart (2001) are characteristic. In Section 3, we give an overview of hypothesis
testing as it applies to the two-sample problem, and review alternative test statistics, including the
L2 distance between kernel density estimates (Anderson et al., 1994), which is the prior approach
closest to our work. We present our first two hypothesis tests in Section 4, based on two different
bounds on the deviation between the population and empirical MMD. We take a different approach
in Section 5, where we use the asymptotic distribution of the empirical MMD estimate as the basis
for a third test. When large volumes of data are available, the cost of computing the MMD (quadratic
in the sample size) may be excessive: we therefore propose in Section 6 a modified version of the
MMD statistic that has a linear cost in the number of samples, and an associated asymptotic test.
In Section 7, we provide an overview of methods related to the MMD in the statistics and machine
learning literature. We also review alternative function classes for which the MMD defines a metric
on probability distributions. Finally, in Section 8, we demonstrate the performance of MMD-based
two-sample tests on problems from neuroscience, bioinformatics, and attribute matching using the
Hungarian marriage method. Our approach performs well on high dimensional data with low sample
size; in addition, we are able to successfully distinguish distributions on graph data, for which ours
is the first proposed test.
A Matlab implementation of the tests is at www.gatsby.ucl.ac.uk/ ∼ gretton/mmd/mmd.htm.
725
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
Lemma 1 Let (X, d) be a metric space, and let p, q be two Borel probability measures defined on
X. Then p = q if and only if Ex ( f (x)) = Ey ( f (y)) for all f ∈ C(X), where C(X) is the space of
bounded continuous functions on X.
Although C(X) in principle allows us to identify p = q uniquely, it is not practical to work with such
a rich function class in the finite sample setting. We thus define a more general class of statistic, for
as yet unspecified function classes F, to measure the disparity between p and q (Fortet and Mourier,
1953; Müller, 1997).
In the statistics literature, this is known as an integral probability metric (Müller, 1997). A biased2
empirical estimate of the MMD is obtained by replacing the population expectations with empirical
expectations computed on the samples X and Y ,
!
1 m 1 n
MMDb [F, X,Y ] := sup ∑ f (xi ) − n ∑ f (yi ) .
f ∈F m i=1
(2)
i=1
We must therefore identify a function class that is rich enough to uniquely identify whether p = q,
yet restrictive enough to provide useful finite sample estimates (the latter property will be established
in subsequent sections).
726
A K ERNEL T WO -S AMPLE T EST
p
Lemma 3 If k(·, ·) is measurable and Ex k(x, x) < ∞ then µ p ∈ H.
Proof The linear operator Tp f := Ex f for all f ∈ F is bounded under the assumption, since
p
|Tp f | = |Ex f | ≤ Ex | f | = Ex |h f , φ(x)iH | ≤ Ex k(x, x) k f kH .
Hence by the Riesz representer theorem, there exists a µ p ∈ H such that Tp f = h f , µ p iH . If we set
f = φ(t) = k(t, ·), we obtain µ p (t) = hµ p , k(t, ·)iH = Ex k(t, x): in other words, the mean embedding
of the distribution p is the expectation under p of the canonical feature map.
We next show that the MMD may be expressed as the distance in H between mean embeddings
(Borgwardt et al., 2006).
Lemma 4 Assume the condition in Lemma 3 for the existence of the mean embeddings µ p , µq is
satisfied. Then
2
MMD2 [F, p, q] =
µ p − µq
H .
Proof
" #2
MMD2 [F, p, q] = sup (Ex [ f (x)] − Ey [ f (y)])
k f kH ≤1
" #2
= sup µ p − µq , f H
k f kH ≤1
2
=
µ p − µq
H .
We now establish a condition on the RKHS H under which the mean embedding µ p is injective,
which indicates that MMD [F, p, q] = 0 is a metric3 on the Borel probability measures on X. Evi-
dently, this property will not hold for all H: for instance, a polynomial RKHS of degree two cannot
distinguish between distributions with the same mean and variance, but different kurtosis (Sriperum-
budur et al., 2010b, Example 3). The MMD is a metric, however, when H is a universal RKHSs,
defined on a compact metric space X. Universality requires that k(·, ·) be continuous, and H be
dense in C(X) with respect to the L∞ norm. Steinwart (2001) proves that the Gaussian and Laplace
RKHSs are universal.
Theorem 5 Let F be a unit ball in a universal RKHS H, defined on the compact metric space X,
with associated continuous kernel k(·, ·). Then MMD [F, p, q] = 0 if and only if p = q.
Proof The proof follows Cortes et al. (2008, Supplementary Appendix), whose approach is clearer
than the original proof of Gretton et al. (2008a, p. 4).4 First, it is clear that p = q implies
3. According to Dudley (2002, p. 26) a metric d(x, y) satisfies the following four properties: symmetry, triangle in-
equality, d(x, x) = 0, and d(x, y) = 0 =⇒ x = y. A pseudo-metric only satisfies the first three properties.
4. Note that the proof of Cortes et al. (2008) requires an application the of dominated convergence theorem, rather than
using the Riesz representation theorem to show the existence of the mean embeddings µ p and µq as we did in Lemma
3.
727
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
MMD {F, p, q} is zero. We now prove the converse. By the universality of H, for any given ε > 0
and f ∈ C(X) there exists a g ∈ H such that
k f − gk∞ ≤ ε.
|Ex f (x) − Ey ( f (y))| ≤ |Ex f (x) − Ex g(x)| + |Ex g(x) − Ey g(y)| + |Ey g(y) − Ey f (y)| .
Next, write
Ex g(x) − Ey g(y) = g, µ p − µq H = 0,
since MMD {F, p, q} = 0 implies µ p = µq . Hence
where x′ is an independent copy of x with the same distribution, and y′ is an independent copy of y.
An unbiased empirical estimate is a sum of two U-statistics and a sample average,
m m n n
1 1
MMD2u [F, X,Y ] = ∑ ∑
m(m − 1) i=1 j6=i
k(xi , x j ) +
n(n − 1) i=1 ∑
∑ k(yi , y j )
j6=i
2 m n
− ∑ ∑ k(xi , y j ).
mn i=1
(3)
j=1
When m = n, a slightly simpler empirical estimate may be used. Let Z := (z1 , . . . , zm ) be m i.i.d.
random variables, where z := (x, y) ∼ p × q (i.e., x and y are independent). An unbiased estimate of
MMD2 is
m
1
(m)(m − 1) i6∑
MMD2u [F, X,Y ] = h(zi , z j ), (4)
=j
728
A K ERNEL T WO -S AMPLE T EST
The proof is completed by applying hφ(x), φ(x′ )iH = k(x, x′ ); the empirical estimates follow straight-
forwardly, by replacing the population expectations with their corresponding U-statistics and sample
averages. This statistic is unbiased following Serfling (1980, Chapter 5).
Note that MMD2u may be negative, since it is an unbiased estimator of (MMD[F, p, q])2 . The only
terms missing to ensure nonnegativity, however, are h(zi , zi ), which were removed to remove spuri-
ous correlations between observations. Consequently we have the bound
m
1
MMD2u + ∑
m(m − 1) i=1
k(xi , xi ) + k(yi , yi ) − 2k(xi , yi ) ≥ 0.
Moreover, while the empirical statistic for m = n is an unbiased estimate of MMD2 , it does not have
minimum variance, since we ignore the cross-terms k(xi , yi ), of which there are O(n). From (3),
however, we see the minimum variance estimate is almost identical (Serfling, 1980, Section 5.1.4).
The biased statistic in (2) may also be easily computed following the above reasoning. Substi-
tuting the empirical estimates µX := m1 ∑mi=1 φ(xi ) and µY := n ∑i=1 φ(yi ) of the feature space means
1 n
Note that the U-statistics of (3) have been replaced by V-statistics. Intuitively we expect the empir-
ical test statistic MMD[F, X,Y ], whether biased or unbiased, to be small if p = q, and large if the
distributions are far apart. It costs O((m + n)2 ) time to compute both statistics.
where we have defined µX = m−1 ∑m i=1 φ(xi ), and µY by analogy. The result follows since the unit
vector v maximizing hv, xiH in a Hilbert space is v = x/ kxkH .
We illustrate the behavior of MMD in Figure 1 using a one-dimensional example. The data X
and Y were generated from distributions p and q with equal means and variances, with p Gaussian
729
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
0.8
fˆ∗
p (Gau ss)
q (L ap l ace)
0.4
0.2
−0.2
−0.4
−0.6
−6 −4 −2 0 2 4 6
t
Figure 1: Illustration of the function maximizing the mean discrepancy in the case where a Gaussian
is being compared with a Laplace distribution. Both distributions have zero mean and unit
variance. The function fˆ∗ that witnesses the MMD has been scaled for plotting purposes,
and was computed empirically on the basis of 2 × 104 samples, using a Gaussian kernel
with σ = 0.5.
and q Laplacian. We chose F to be the unit ball in a Gaussian RKHS. The empirical estimate fˆ∗
of the function f ∗ that witnesses the MMD—in other words, the function maximizing the mean
discrepancy in (1)—is smooth, negative where the Laplace density exceeds the Gaussian density (at
the center and tails), and positive where the Gaussian density is larger. The magnitude of fˆ∗ is a
direct reflection of the amount by which one density exceeds the other, insofar as the smoothness
constraint permits it.
3. Background Material
We now present three background results. First, we introduce the terminology used in statistical
hypothesis testing. Second, we demonstrate via an example that even for tests which have asymp-
totically no error, we cannot guarantee performance at any fixed sample size without making as-
sumptions about the distributions. Third, we review some alternative statistics used in comparing
distributions, and the associated two-sample tests (see also Section 7 for an overview of additional
integral probability metrics).
730
A K ERNEL T WO -S AMPLE T EST
samples X ∼ p of size m and Y ∼ q of size n, the statistical test, T(X,Y ) : Xm × Xn 7→ {0, 1} is used
to distinguish between the null hypothesis H0 : p = q and the alternative hypothesis HA : p 6= q.
This is achieved by comparing the test statistic5 MMD[F, X,Y ] with a particular threshold: if the
threshold is exceeded, then the test rejects the null hypothesis (bearing in mind that a zero population
MMD indicates p = q). The acceptance region of the test is thus defined as the set of real numbers
below the threshold. Since the test is based on finite samples, it is possible that an incorrect answer
will be returned. A Type I error is made when p = q is rejected based on the observed samples,
despite the null hypothesis having generated the data. Conversely, a Type II error occurs when
p = q is accepted despite the underlying distributions being different. The level α of a test is an
upper bound on the probability of a Type I error: this is a design parameter of the test which must
be set in advance, and is used to determine the threshold to which we compare the test statistic
(finding the test threshold for a given α is the topic of Sections 4 and 5). The power of a test
against a particular member of the alternative class HA (i.e., a specific (p, q) such that p 6= q) is the
probability of wrongly accepting p = q in this instance. A consistent test achieves a level α, and a
Type II error of zero, in the large sample limit. We will see that the tests proposed in this paper are
consistent.
Even if a test is consistent, it is not possible to distinguish distributions with high probability at a
given, fixed sample size (i.e., to provide guarantees on the Type II error), without prior assumptions
as to the nature of the difference between p and q. This is true regardless of the two-sample test
used. There are several ways to illustrate this, which each give insight into the kinds of differences
that might be undetectable for a given number of samples. The following example6 is one such
illustration.
Example 1 Assume we have a distribution p from which we have drawn m i.i.d. observations.
We construct a distribution q by drawing m2 i.i.d. observations from p, and defining a discrete
distribution over these m2 instances with probability m−2 each. It is easy to check that if we now
2
draw m observations from q, there is at least a mm mm!2m > 1 − e−1 > 0.63 probability that we thereby
obtain an m sample from p. Hence no test will be able to distinguish samples from p and q in this
case. We could make the probability of detection arbitrarily small by increasing the size of the
sample from which we construct q.
We next give a brief overview of some earlier approaches to the two sample problem for multivariate
data. Since our later experimental comparison is with respect to certain of these methods, we give
abbreviated algorithm names in italics where appropriate: these should be used as a key to the tables
in Section 8.
731
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
732
A K ERNEL T WO -S AMPLE T EST
on an interval (Anderson et al., 1994, Section 2.4), which implies the kernel k is characteristic
(Sriperumbudur et al., 2010b). The power of the L2 test against local alternatives is greater when
the kernel is held fixed, since for any rate of decrease of hm+n with increasing sample size, δ will
decrease more slowly than for a fixed kernel.
An RKHS-based approach generalizes the L2 statistic in a number of important respects. First,
we may employ a much larger class of characteristic kernels that cannot be written as inner products
between Parzen windows: several examples are given by Steinwart (2001, Section 3) and Micchelli
et al. (2006, Section 3) (these kernels are universal, hence characteristic). We may further generalize
to kernels on structured objects such as strings and graphs (Schölkopf et al., 2004), as done in our
experiments (Section 8). Second, even when the kernel may be written as an inner product of
Parzen windows on Rd , the D22 statistic with fixed bandwidth no longer converges to an L2 distance
between probability density functions, hence it is more natural to define the statistic as an integral
probability metric for a particular RKHS, as in Definition 2. Indeed, in our experiments, we obtain
good performance in experimental settings where the dimensionality greatly exceeds the sample
size, and density estimates would perform very poorly8 (for instance the Gaussian toy example
in Figure 5B, for which performance actually improves when the dimensionality increases; and the
microarray data sets in Table 1). This suggests it is not necessary to solve the more difficult problem
of density estimation in high dimensions to do two-sample testing.
Finally, the kernel approach leads us to establish consistency against a larger class of local
alternatives to the null hypothesis than that considered by Anderson et al. In Theorem 13, we prove
consistency against a class of alternatives encoded in terms of the mean embeddings of p and q,
which applies to any domain on which RKHS kernels may be defined, and not only densities on Rd .
This more general approach also has interesting consequences for distributions on Rd : for instance,
a local departure from H0 occurs when p and q differ at increasing frequencies in their respective
characteristic functions. This class of local alternatives cannot be expressed in the form δg for fixed
g, as in (7). We discuss this issue further in Section 5.
d
MMD2 [F, p, q] = kp − qk2Rd = ∑ (pi − qi )2 . (8)
i=1
Harchaoui et al. (2008, Section 1, long version) note that this L2 statistic may not be the best choice
for finite domains, citing a result of Lehmann and Romano (2005, Theorem 14.3.2) that Pearson’s
is well-defined, and satisfies α < π(c) < 1 for 0 < |c| < ∞, and π(c) → 1 as c → ∞.
8. The L2 error of a kernel density estimate converges as O(n−4/(4+d) ) when the optimal bandwidth is used (Wasserman,
2006, Section 6.5).
733
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
Chi-squared statistic is optimal for the problem of goodness of fit testing for multinomials.9 It would
be of interest to establish whether an analogous result holds for two-sample testing in a wider class
of RKHS feature spaces.
Biau and Gyorfi (2005) (Biau) use as their test statistic the L1 distance between discretized esti-
mates of the probabilities, where the partitioning is refined as the sample size increases. This space
partitioning approach becomes difficult or impossible for high dimensional problems, since there
are too few points per bin. For this reason, we use this test only for low-dimensional problems in
our experiments.
A generalisation of the Wald-Wolfowitz runs test to the multivariate domain was proposed and
analysed by Friedman and Rafsky (1979) and Henze and Penrose (1999) (FR Wolf), and involves
counting the number of edges in the minimum spanning tree over the aggregated data that connect
points in X to points in Y . The resulting test relies on the asymptotic normality of the test statistic,
and is not distribution-free under the null hypothesis for finite samples (the test threshold depends
on p, as with our asymptotic test in Section 5; by contrast, our tests in Section 4 are distribution-
free). The computational cost of this method using Kruskal’s algorithm is O((m + n)2 log(m + n)),
although more modern methods improve on the log(m + n) term: see Chazelle (2000) for details.
Friedman and Rafsky (1979) claim that calculating the matrix of distances, which costs O((m+n)2 ),
dominates their computing time; we return to this point in our experiments (Section 8). Two possible
generalisations of the Kolmogorov-Smirnov test to the multivariate case were studied by Bickel
(1969) and Friedman and Rafsky (1979). The approach of Friedman and Rafsky (FR Smirnov) in
this case again requires a minimal spanning tree, and has a similar cost to their multivariate runs
test.
A more recent multivariate test was introduced by Rosenbaum (2005). This entails computing
the minimum distance non-bipartite matching over the aggregate data, and using the number of pairs
containing a sample from both X and Y as a test statistic. The resulting statistic is distribution-free
under the null hypothesis at finite sample sizes, in which respect it is superior to the Friedman-
Rafsky test; on the other hand, it costs O((m + n)3 ) to compute. Another distribution-free test
(Hall) was proposed by Hall and Tajvidi (2002): for each point from p, it requires computing the
closest points in the aggregated data, and counting how many of these are from q (the procedure is
repeated for each point from q with respect to points from p). As we shall see in our experimental
comparisons, the test statistic is costly to compute; Hall and Tajvidi consider only tens of points in
their experiments.
9. A goodness of fit test determines whether a sample from p is drawn from a known target multinomial q. Pearson’s
Chi-squared statistic weights each term in the sum (8) by its corresponding q−1
i .
734
A K ERNEL T WO -S AMPLE T EST
where PrX,Y denotes the probability over the m-sample X and n-sample Y .
Our next goal is to refine this result in a way that allows us to define a test threshold under the null
hypothesis p = q. Under this circumstance, the constants in the exponent are slightly improved. The
following theorem is proved in Appendix A.3.
Theorem 8 Under the conditions of Theorem 7 where additionally p = q and m = n,
1
q
MMDb [F, X,Y ] ≤ m− 2 2Ex,x′ [k(x, x) − k(x, x′ )] + ε ≤ (2K/m)1/2 + ε,
| {z } | {z }
B1 (F,p) B2 (F,p)
2
both with probability at least 1 − exp − ε4Km .
In this theorem, we illustrate two possible bounds B1 (F, p) and B2 (F, p) on the bias in the empirical
estimate (5). The first inequality is interesting inasmuch as it provides a link between the bias bound
B1 (F, p) and kernel size (for instance, if we were to use a Gaussian kernel with large σ, then k(x, x)
and k(x, x′ ) would likely be close, and the bias small). In the context of testing, however, we would
need to provide an additional bound to show convergence of an empirical estimate of B1 (F, p) to its
population equivalent. Thus, in the following test for p = q based on Theorem 8, we use B2 (F, p)
to bound the bias.10
Corollary 9 A hypothesis test of level α for the null hypothesis
p = q, that
is, for MMD[F, p, q] = 0,
p p
has the acceptance region MMDb [F, X,Y ] < 2K/m 1 + 2 log α −1 .
We emphasize that this test is distribution-free: the test threshold does not depend on the particular
distribution that generated the sample. Theorem 7 guarantees the consistency of the test against fixed
alternatives, and that the Type II error probability decreases to zero at rate O m−1/2 , assuming m =
n. To put this convergence rate in perspective, consider a test of whether two normal distributions
have equal means, given they have unknown but equal variance (Casella and Berger, 2002, Exercise
8.41). In this case, the test statistic has a Student-t distribution with n + m − 2 degrees of freedom,
and its Type II error probability converges at the same rate as our test.
It is worth noting that bounds may be obtained for the deviation between population mean
embeddings µ p and the empirical embeddings µX in a completely analogous fashion. The proof
10. Note that we use a tighter bias bound than Gretton et al. (2007a).
735
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
requires symmetrization by means of a ghost sample, that is, a second set of observations drawn
from the same distribution. While not the focus of the present paper, such bounds can be used to
perform inference based on moment matching (Altun and Smola, 2006; Dudı́k and Schapire, 2006;
Dudı́k et al., 2004).
Theorem 10 Assume 0 ≤ k(xi , x j ) ≤ K, from which it follows −2K ≤ h(zi , z j ) ≤ 2K. Then
2
−t m2
PrX,Y MMD2u (F, X,Y ) − MMD2 (F, p, q) > t ≤ exp
8K 2
where m2 := ⌊m/2⌋ (the same bound applies for deviations of −t and below).
Corollary 11 A hypothesis test of level α for the null hypothesis p = q has the acceptance region
2 √ p
MMDu < (4K/ m) log(α−1 ).
This test is distribution-free. We now compare the thresholds of the above test with that in Corollary
9. We note first that the threshold for the biased statistic applies to an estimate of MMD, whereas
that for the unbiased statistic is for an estimate of MMD2 . Squaring the former threshold to make
the two quantities comparable, the squared threshold in Corollary 9 decreases as m−1 , whereas the
threshold in Corollary 11 decreases as m−1/2 . Thus for sufficiently large11 m, the McDiarmid-based
threshold will be lower (and the associated test statistic is in any case biased upwards), and its Type
II error will be better for a given Type I bound. This is confirmed in our Section 8 experiments.
Note, however, that the rate of convergence of the squared, biased MMD estimate to its population
√
value remains at 1/ m (bearing in mind we take the square of a biased estimate, where the bias
√
term decays as 1/ m).
Finally, we note that the bounds we obtained in this section and the last are rather conservative
for a number of reasons: first, they do not take the actual distributions into account. In fact, they are
finite sample size, distribution-free bounds that hold even in the worst case scenario. The bounds
could be tightened using localization, moments of the distribution, etc.: see, for example, Bousquet
et al. (2005) and de la Peña and Giné (1999). Any such improvements could be plugged straight
into Theorem 19. Second, in computing bounds rather than trying to characterize the distribution of
MMD(F, X,Y ) explicitly, we force our test to be conservative by design. In the following we aim for
an exact characterization of the asymptotic distribution of MMD(F, X,Y ) instead of a bound. While
this will not satisfy the uniform convergence requirements, it leads to superior tests in practice.
736
A K ERNEL T WO -S AMPLE T EST
Theorem 12 Let k̃(xi , x j ) be the kernel between feature space mappings from which the mean em-
bedding of p has been subtracted,
k̃(xi , x j ) := φ(xi ) − µ p , φ(x j ) − µ p H
= k(xi , x j ) − Ex k(xi , x) − Ex k(x, x j ) + Ex,x′ k(x, x′ ), (9)
where al ∼ N(0, 1) and bl ∼ N(0, 1) are infinite sequences of independent Gaussian random vari-
ables, and the λi are eigenvalues of
Z
k̃(x, x′ )ψi (x)d p(x) = λi ψi (x′ ).
X
We illustrate the MMD density under both the null and alternative hypotheses by approximating it
empirically for p = q and p 6= q. Results are plotted in Figure 2.
Our goal is to determine whether the empirical test statistic MMD2u is so large as to be outside
the 1 − α quantile of the null distribution in (10), which gives a level α test. Consistency of this test
against local departures from the null hypothesis is provided by the following theorem, proved in
Appendix B.2.
An example of a local departure from the null hypothesis is described earlier in the discussion of
the L2 distance between Parzen window estimates (Section 3.3.1). The class of local alternatives
considered in Theorem 13 is more general, however: for instance, Sriperumbudur et al. (2010b,
Section 4) and Harchaoui et al. (2008, Section 5, long version) give examples of classes of pertur-
bations gt with decreasing RKHS norm. These perturbations have the property that p differs from q
at increasing frequencies, rather than simply with decreasing amplitude.
One way to estimate the 1 − α quantile of the null distribution is using the bootstrap on the
aggregated data, following Arcones and Giné (1992). Alternatively, we may approximate the null
737
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
2 2
Empirical MMD density under H0 Empirical MMD density under H1
u u
50 10
45 9
40 8
35 7
Prob. density
Prob. density
30 6
25 5
20 4
15 3
10 2
5 1
0 0
−0.04 −0.02 0 0.02 0.04 0.06 0.08 0.1 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
MMD2 MMD
2
u u
Figure 2: Left: Empirical distribution of the MMD under H0 , with p and q both Gaussians with
unit standard deviation, using 50 samples from each. Right: Empirical distribution of
the MMD under HA , with p a Laplace distribution√ with unit standard deviation, and q
a Laplace distribution with standard deviation 3 2, using 100 samples from each. In
both cases, the histograms were obtained by computing 2000 independent instances of
the MMD.
distribution by fitting Pearson curves to its first four moments (Johnson et al., 1994, Section 18.8).
Taking advantage of the degeneracy of the U-statistic, we obtain for m = n
2 2
E MMD2u Ez,z′ h2 (z, z′ ) and
=
m(m − 1)
3
8(m − 2)
′ ′′ ′ ′′
E MMD2u = 2 E z,z′ h(z, z )Ez′′ h(z, z )h(z , z ) + O(m−4 ) (11)
m (m − 1)2
(see Appendix B.3), where h(z, z′ ) is defined in Lemma 6, z = (x, y) ∼ p × q where x and y are inde-
′ ′′
2 4
pendent, and z , z are independent copies of z. The fourth moment E MMDu is not computed,
since it is both very small, O(m−4 ), and expensive to calculate, O(m4 ). Instead, we replace the kur-
2
tosis12 with a lower bound due to Wilkins (1944), kurt MMD2u ≥ skew MMD2u + 1. In Figure
3, we illustrate the Pearson curve fit to the null distribution: the fit is good in the upper quantiles of
the distribution, where the test threshold is computed. Finally, we note that two alternative empiri-
cal estimates of the null distribution have more recently been proposed by Gretton et al. (2009): a
consistent estimate, based on an empirical computation of the eigenvalues λl in (10); and an alter-
native Gamma approximation to the null distribution, which has a smaller computational cost but is
generally less accurate. Further detail and experimental comparisons are given by Gretton et al.
4
E [MMD2u ]
12. The kurtosis is defined in terms of the fourth and second moments as kurt MMD2u = h 2
i2 − 3.
E [MMD2u ]
738
A K ERNEL T WO -S AMPLE T EST
0.8
P(MMD2 < t)
0.6
u
0.4
0.2
Emp. CDF
Pearson
0
−0.02 0 0.02 0.04 0.06 0.08 0.1 0.12
t
Figure 3: Illustration of the empirical CDF of the MMD and a Pearson curve fit. Both p and q were
Gaussian with zero mean and unit variance, and 50 samples were drawn from each. The
empirical CDF was computed on the basis of 1000 randomly generated MMD values. To
ensure the quality of fit was determined only by the accuracy of the Pearson approxima-
tion, the moments used for the Pearson curves were also computed on the basis of these
1000 samples. The MMD used a Gaussian kernel with σ = 0.5.
Lemma 14 Define m2 := ⌊m/2⌋, assume m = n, and define h(z1 , z2 ) as in Lemma 6. The estimator
1 m2
MMD2l [F, X,Y ] := ∑ h((x2i−1 , y2i−1 ), (x2i, y2i ))
m2 i=1
can be computed in linear time, and is an unbiased estimate of MMD2 [F, p, q].
While it is expected that MMD2l has higher variance than MMD2u (as we will see explicitly later), it
is computationally much more appealing. In particular, the statistic can be used in stream computa-
tions with need for only O(1) memory, whereas MMD2u requires O(m) storage and O(m2 ) time to
compute the kernel h on all interacting pairs.
Since MMD2l is just the average over a set of random variables, Hoeffding’s bound and the cen-
tral limit theorem readily allow us to provide both uniform convergence and asymptotic statements
with little effort. The first follows directly from Hoeffding (1963, Theorem 2).
739
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
The factor of 2 arises since we are averaging over only ⌊m/2⌋ observations. It is instructive to
compare this asymptotic distribution with that of the quadratic time statistic MMD2u under HA ,
when m = n. In this case, MMD2u converges in distribution to a Gaussian according to
1 D
m 2 MMD2u − MMD2 [F, p, q] → N 0, σ2u ,
where σ2u = 4 Ez (Ez′ h(z, z′ ))2 − [Ez,z′ (h(z, z′ ))]2 (Serfling, 1980, Section 5.5). Thus for MMD2u ,
the asymptotic variance is (up to scaling) the variance of Ez′ [h(z, z′ )], whereas for MMD2l it is
Varz,z′ [h(z, z′ )].
We end by noting another potential approach to reducing the cost of computing an empirical
MMD estimate, by using a low rank approximation to the Gram matrix (Fine and Scheinberg, 2001;
Williams and Seeger, 2001; Smola and Schölkopf, 2000). An incremental computation of the MMD
based on such a low rank approximation would require O(md) storage and O(md) computation
(where d is the rank of the approximate Gram matrix which is used to factorize both matrices)
rather than O(m) storage and O(m2 ) operations. That said, it remains to be determined what effect
this approximation would have on the distribution of the test statistic under H0 , and hence on the
test threshold.
740
A K ERNEL T WO -S AMPLE T EST
Definition 17 Let F be a subset of some vector space. The star S[F] of a set F is
S[F] := {α f | f ∈ F and α ∈ [0, ∞)}
Theorem 18 Denote by F the subset of some vector space of functions from X to R for which
S[F] ∩C(X) is dense in C(X) with respect to the L∞ (X) norm. Then MMD [F, p, q] = 0 if and only
if p = q, and MMD[F, p, q] is a metric on the space of probability distributions. Whenever the star
of F is not dense, the MMD defines a pseudo-metric space.
Proof It is clear that p = q implies MMD [F, p, q] = 0. The proof of the converse is very similar
to that of Theorem 5. Define H := S(F) ∩ C(X). Since by assumption H is dense in C(X), there
exists an h∗ ∈ H satisfying kh∗ − f k∞ < ε for all f ∈ C(X). Write h∗ := α∗ g∗ , where g∗ ∈ F. By
assumption, Ex g∗ − Ey g∗ = 0. Thus we have the bound
|Ex f (x) − Ey ( f (y))| ≤ |Ex f (x) − Ex h∗ (x)| + α∗ |Ex g∗ (x) − Ey g∗ (y)| + |Ey h∗ (y) − Ey f (y)|
≤ 2ε
for all f ∈ C(X) and ε > 0, which implies p = q by Lemma 1.
To show MMD [F, p, q] is a metric, it remains to prove the triangle inequality. We have
sup E p f − Eq f + sup Eq g − Er g ≥ sup E p f − Eq f + Eq f − Er
f ∈F g∈F f ∈F
≥ sup |E p f − Er f | .
f ∈F
Note that any uniform convergence statements in terms of F allow us immediately to characterize
an estimator of MMD(F, p, q) explicitly. The following result shows how (this reasoning is also the
basis for the proofs in Section 4, although here we do not restrict ourselves to an RKHS).
Theorem 19 Let δ ∈ (0, 1) be a confidence level and assume that for some ε(δ, m, F) the following
holds for samples {x1 , . . . , xm } drawn from p:
( )
m
1
PrX sup Ex [ f ] − ∑ f (xi ) > ε(δ, m, F) ≤ δ.
f ∈F m i=1
In this case we have that,
PrX,Y {|MMD[F, p, q] − MMDb [F, X,Y ]| > 2ε(δ/2, m, F)} ≤ δ,
where MMDb [F, X,Y ] is taken from Definition 2.
Proof The proof works simply by using convexity and suprema as follows:
|MMD[F, p, q] − MMDb [F, X,Y ]|
1 m 1 n
= sup |Ex [ f ] − Ey [ f ]| − sup ∑ f (xi ) − ∑ f (yi )
f ∈F f ∈F m i=1 n i=1
1 m 1 n
≤ sup Ex [ f ] − Ey [ f ] − ∑ f (xi ) + ∑ f (yi )
f ∈F m i=1 n i=1
1 m 1 n
≤ sup Ex [ f ] − ∑ f (xi ) + sup Ey [ f ] − ∑ f (yi ) .
f ∈F m i=1 f ∈F n i=1
741
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
Bounding each of the two terms via a uniform convergence bound proves the claim.
This shows that MMDb [F, X,Y ] can be used to estimate MMD[F, p, q], and that the quantity is
asymptotically unbiased.
1 m
Fp (z) := Pr {x ≤ z for x ∼ p} and FX (z) := ∑ 1z≤xi .
|X| i=1
It is clear that Fp captures the properties of p. The Kolmogorov metric is simply the L∞ distance
kFX − FY k∞ for two sets of observations X and Y . Smirnov (1939) showed that for p = q the limiting
distribution of the empirical cumulative distribution functions satisfies
n o ∞
kFX − FY k∞ > x = 2 ∑ (−1) j−1 e−2 j x for x ≥ 0,
mn
12 2 2
lim PrX,Y m+n (12)
m,n→∞
j=1
which is distribution independent. This allows for an efficient characterization of the distribution
under the null hypothesis H0 . Efficient numerical approximations to (12) can be found in numerical
analysis handbooks (Press et al., 1994). The distribution under the alternative p 6= q, however, is
unknown.
742
A K ERNEL T WO -S AMPLE T EST
The Kolmogorov metric is, in fact, a special instance of MMD[F, p, q] for a certain Banach
space (Müller, 1997, Theorem 5.2).
13
F be
the class of functions X → R of bounded variation 1.
Proposition 21 Let Then
MMD[F, p, q] = Fp − Fq ∞ .
where
| f (x) − f (y)|
k f kL := sup
x6=y ∈ X ρ(x, y)
is the Lipschitz seminorm14 for real valued f on X.
A simple example of this theorem is as follows (Dudley, 2002, Exercise 1, p. 425).
Example 2 Let X = R with associated ρ(x, y) = |x − y|. Then given f such that k f kL ≤ 1, we use
integration by parts to obtain
Z Z Z
f d(p − q) = (Fp − Fq )(x) f ′ (x)dx ≤ (Fp − Fq ) (x)dx,
13. A function f defined on [a, b] is of bounded variation C if the total variation is bounded by C, that is, the supremum
over all sums
∑ | f (xi ) − f (xi−1 )|,
1≤i≤n
where a ≤ x0 ≤ . . . ≤ xn ≤ b (Dudley, 2002, p. 184).
14. A seminorm satisfies the requirements of a norm besides kxk = 0 only for x = 0 (Dudley, 2002, p. 156).
743
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
where the maximum is attained for the function g with derivative g′ = 2 1Fp >Fq − 1 (and for which
kgkL = 1). We recover the L1 distance between distribution functions,
Z
W (P, Q) = (Fp − Fq ) (x)dx.
One may further generalize Theorem 23 to the set of all laws P(X) on arbitrary metric spaces X
(Dudley, 2002, Proposition 11.3.2).
Definition 24 (Bounded Lipschitz metric) Let p and q be laws on a metric space X. Then
Z
β(p, q) := sup f d(p − q)
k f kBL ≤1
is a metric on P(X), where f belongs to the space of bounded Lipschitz functions with norm
k f kBL := k f kL + k f k∞ .
Empirical estimates of the Monge-Wasserstein and Bounded Lipschitz metrics on Rd are provided
by Sriperumbudur et al. (2010a).
and to use the latter as the basis for a kernel method. This simple approach affords rather good
performance. With the benefit of hindsight, it is now understandable why the kernel
mi ,m j
1
k(Xi , X j ) =
mi m j ∑ k(xiu , x jv )
u,v
produces
useful
results: it is simply the kernel between the empirical means in feature space
µ(Xi ), µ(X j ) (Hein et al., 2004, Equation 4). Jebara and Kondor (2003) later extended this set-
ting by smoothing the empirical densities before computing inner products.
Note, however, that the empirical mean embedding µX may not be the best statistic to use for
MIC: we are only interested in determining whether some instances in the domain have the desired
property, rather than making a statement regarding the distribution over all instances. Taking this
into account leads to an improved algorithm (Andrews et al., 2003).
744
A K ERNEL T WO -S AMPLE T EST
Here we assume V is an RKHS over X×Y with kernel v((x, y), (x′ , y′ )). If x and y are dependent, then
µ[pxy ] 6= µ[px × py ]. Hence we may use ∆(V, pxy , px × py ) := kµ[pxy ] − µ[px × py ]kV as a measure of
dependence.
Now assume that v((x, y), (x′ , y′ )) = k(x, x′ )l(y, y′ ), that is, the RKHS V is a direct product H ⊗ G
of RKHSs on X and Y. In this case it is easy to see that
∆2 (V, pxy , px × py ) = kExy [k(x, ·)l(y, ·)] − Ex [k(x, ·)] Ey [l(y, ·)]k2V
= Exy Ex′ y′ k(x, x′ )l(y, y′ ) − 2Ex Ey Ex′ y′ k(x, x′ )l(y, y′ )
+Ex Ey Ex′ Ey′ k(x, x′ )l(y, y′ ) .
The latter is also the squared Hilbert-Schmidt norm of the cross-covariance operator between RKHSs
(Gretton et al., 2005a): for characteristic kernels, this is zero if and only if x and y are independent.
Theorem 25 Denote by Cxy the covariance operator between random variables x and y, drawn
jointly from pxy , where the functions on X and Y are the reproducing kernel Hilbert spaces F and G
respectively. Then the Hilbert-Schmidt norm kCxy kHS equals ∆(V, pxy , px × py ).
Empirical estimates of this quantity are as follows:
Theorem 26 Denote by K and L the kernel matrices on X and Y respectively, and by H = I − 1/m
the projection matrix onto the subspace orthogonal to the vector with all entries set to 1 (where 1 is
an m × m matrix of ones). Then m−2 tr HKHL is an estimate of ∆2 with bias O(m−1 ). The deviation
from ∆2 is OP (m−1/2 ).
Gretton et al. (2005a) provide explicit constants. In certain circumstances, including in the case of
RKHSs with Gaussian kernels, the empirical ∆2 may also be interpreted in terms of a smoothed
difference between the joint empirical characteristic function (ECF) and the product of the marginal
ECFs (Feuerverger, 1993; Kankainen, 1995). This interpretation does not hold in all cases, however,
for example, for kernels on strings, graphs, and other structured spaces. An illustration of the wit-
ness function f ∗ ∈ V from Section 2.3 is provided in Figure 4, for the case of dependence detection.
This is a smooth function which has large magnitude where the joint density is most different from
the product of the marginals.
745
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
0.05
1 0.04
0.03
0.5
0.02
0.01
Y
0
0
−0.5 −0.01
−0.02
−1 −0.03
−0.04
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
X
Figure 4: Illustration of the function maximizing the mean discrepancy when MMD is used as a
measure of dependence. A sample from dependent random variables x and y is shown
in black, and the associated function fˆ∗ that witnesses the MMD is plotted as a contour.
The latter was computed empirically on the basis of 200 samples, using a Gaussian kernel
with σ = 0.2.
We remark that a hypothesis test based on the above kernel statistic is more complicated than
for the two-sample problem, since the product of the marginal distributions is in effect simulated
by permuting the variables of the original sample. Further details are provided by Gretton et al.
(2008b).
Shawe-Taylor and Dolia (2007) define a distance between distributions as follows: let H be a set of
functions on X and r be a probability distribution over H. Then the distance between two distribu-
tions p and q is given by
746
A K ERNEL T WO -S AMPLE T EST
That is, we compute the average distance between p and q with respect to a distribution over test
functions. The following result shows the relation with the MMD, and is due to Song et al. (2008,
Section 6).
Proof By definition Ex [ f (x)] = hµ p , f iH . Using linearity of the inner product, Equation (13) equals
Z
µ p − µq , f dr( f )
H
Z
* +
µ − µ
p q
=
µ p − µq
H
,f dr( f ),
µ p − µq
H
H
p µ −µ
q
where the integral is independent of p, q. To see this, note that for any p, q, is a unit vector
kµ p −µq kH
which can be transformed into the first canonical basis vector (for instance) by a rotation which
leaves the integral invariant, bearing in mind that r is rotation invariant.
8. Experiments
We conducted distribution comparisons using our MMD-based tests on data sets from three real-
world domains: database applications, bioinformatics, and neurobiology. We investigated both
uniform convergence approaches (MMDb with the Corollary 9 threshold, and MMD2u H with the
Corollary 11 threshold); the asymptotic approaches with bootstrap (MMD2u B) and moment match-
ing to Pearson curves (MMD2u M), both described in Section 5; and the asymptotic approach using
the linear time statistic (MMD2l ) from Section 6. We also compared against several alternatives from
747
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
the literature (where applicable): the multivariate t-test, the Friedman-Rafsky Kolmogorov-Smirnov
generalisation (Smir), the Friedman-Rafsky Wald-Wolfowitz generalisation (Wolf), the Biau-Györfi
test (Biau) with a uniform space partitioning, and the Hall-Tajvidi test (Hall). See Section 3.3 for
details regarding these tests. Note that we do not apply the Biau-Györfi test to high-dimensional
problems (since the required space partitioning is no longer possible), and that MMD is the only
method applicable to structured data such as graphs.
An important issue in the practical application of the MMD-based tests is the selection of the
kernel parameters. We illustrate this with a Gaussian RBF kernel, where we must choose the kernel
width σ (we use this kernel for univariate and multivariate data, but not for graphs). The empirical
MMD is zero both for kernel size σ = 0 (where the aggregate Gram matrix over X and Y is a unit
matrix), and also approaches zero as σ → ∞ (where the aggregate Gram matrix becomes uniformly
constant). We set σ to be the median distance between points in the aggregate sample, as a compro-
mise between these two extremes: this remains a heuristic, similar to those described in Takeuchi
et al. (2006) and Schölkopf (1997), and the optimum choice of kernel size is an ongoing area of
research. We further note that setting the kernel using the sample being tested may cause changes to
the asymptotic distribution: in particular, the analysis in Sections 4 and 5 assumes the kernel not to
be a function of the sample. An analysis of the convergence of MMD when the kernel is adapted on
the basis of the sample is provided by Sriperumbudur et al. (2009), although the asymptotic distri-
bution in this case remains a topic of research. As a practical matter, however, the median heuristic
has not been observed to have much effect on the asymptotic distribution, and in experiments is
indistinguishable from results obtained by computing the kernel on a small subset of the sample set
aside for this purpose. See Appendix C for more detail.
In our first experiment, we investigated the scaling performance of the various tests as a function
of the dimensionality d of the space X ⊂ Rd , when both p and q were Gaussian. We considered
values of d up to 2500: the performance of the MMD-based tests cannot therefore be explained
in the context of density estimation (as in Section 3.3.1), since the associated density estimates are
necessarily meaningless here. The levels for all tests were set at α = 0.05, m = n = 250 samples were
used, and results were averaged over 100 repetitions. In the first case, the distributions had different
means and unit variance. The percentage of times the null hypothesis was correctly rejected over a
set of Euclidean distances between the distribution means (20 values logarithmically spaced from
0.05 to 50), was computed as a function of the dimensionality of the normal distributions. In case
of the t-test, a ridge was added to the covariance estimate, to avoid singularity (the ratio of largest
to smallest eigenvalue was ensured to be at most 2). In the second case, samples were drawn from
distributions N(0, I) and N(0, σ2 I) with different variance. The percentage of null rejections was
averaged over 20 σ values logarithmically spaced from 100.01 to 10. The t-test was not compared in
this case, since its output would have been irrelevant. Results are plotted in Figure 5.
In the case of Gaussians with differing means, we observe the t-test performs best in low di-
mensions, however its performance is severely weakened when the number of samples exceeds the
number of dimensions. The performance of MMD2u M is comparable to the t-test in low dimen-
sions, and outperforms all other methods in high dimensions. The worst performance is obtained
for MMD2u H, though MMDb also does relatively poorly: this is unsurprising given that these tests
748
A K ERNEL T WO -S AMPLE T EST
A Normal dist. having different means B Normal dist. having different variances
1 1
percent correctly rejecting H0
0.8 0.8
0.6 0.6
0 0
0 1 2 3 4 0 1 2 3 4
10 10 10 10 10 10 10 10 10 10
Dimension Dimension
Figure 5: Type II performance of the various tests when separating two Gaussians, with test level
α = 0.05. A Gaussians having same variance and different means. B Gaussians having
same mean and different variances.
derive from distribution-free large deviation bounds, and the sample size is relatively small. Re-
markably, MMD2l performs quite well compared with the Section 3.3.3 tests in high dimensions.
In the case of Gaussians of differing variance, the Hall test performs best, followed closely
by MMD2u M. FR Wolf and (to a much greater extent) FR Smirnov both have difficulties in high
dimensions, failing completely once the dimensionality becomes too great. The linear-cost test
MMD2l again performs surprisingly well, almost matching the MMD2u M performance at the highest
dimensionality. Both MMD2u H and MMDb perform poorly, the former failing completely: this
is one of several illustrations we will encounter of the much greater tightness of the Corollary 9
threshold over that in Corollary 11.
In our next application of MMD, we performed distribution testing for data integration: the objec-
tive being to aggregate two data sets into a single sample, with the understanding that both original
samples were generated from the same distribution. Clearly, it is important to check this last con-
dition before proceeding, or an analysis could detect patterns in the new data set that are caused
by combining the two different source distributions. We chose several real-world settings for this
task: we compared microarray data from normal and tumor tissues (Health status), microarray data
from different subtypes of cancer (Subtype), and local field potential (LFP) electrode recordings
from the Macaque primary visual cortex (V1) with and without spike events (Neural Data I and
II, as described in more detail by Rasch et al., 2008). In all cases, the two data sets have different
statistical properties, but the detection of these differences is made difficult by the high data dimen-
sionality (indeed, for the microarray data, density estimation is impossible given the sample size and
data dimensionality, and no successful test can rely on accurate density estimates as an intermediate
step).
749
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
Data Set Attr. MMDb MMD2u H MMD2u B MMD2u M t-test Wolf Smir Hall
Neural Data I Same 100.0 100.0 96.5 96.5 100.0 97.0 95.0 96.0
Different 38.0 100.0 0.0 0.0 42.0 0.0 10.0 49.0
Neural Data II Same 100.0 100.0 94.6 95.2 100.0 95.0 94.5 96.0
Different 99.7 100.0 3.3 3.4 100.0 0.8 31.8 5.9
Health status Same 100.0 100.0 95.5 94.4 100.0 94.7 96.1 95.6
Different 100.0 100.0 1.0 0.8 100.0 2.8 44.0 35.7
Subtype Same 100.0 100.0 99.1 96.4 100.0 94.6 97.3 96.5
Different 100.0 100.0 0.0 0.0 100.0 0.0 28.4 0.2
Table 1: Distribution testing for data integration on multivariate data. Numbers indicate the per-
centage of repetitions for which the null hypothesis (p=q) was accepted, given α = 0.05.
Sample size (dimension; repetitions of experiment): Neural I 4000 (63; 100) ; Neural II
1000 (100; 1200); Health Status 25 (12,600; 1000); Subtype 25 (2,118; 1000).
Data Set Attr. MMDb MMD2u H MMD2u B MMD2u M t-test Wolf Smir Hall Biau
BIO Same 100.0 100.0 93.8 94.8 95.2 90.3 95.8 95.3 99.3
Different 20.0 52.6 17.2 17.6 36.2 17.2 18.6 17.9 42.1
FOREST Same 100.0 100.0 96.4 96.0 97.4 94.6 99.8 95.5 100.0
Different 3.9 11.0 0.0 0.0 0.2 3.8 0.0 50.1 0.0
CNUM Same 100.0 100.0 94.5 93.8 94.0 98.4 97.5 91.2 98.5
Different 14.9 52.7 2.7 2.5 19.17 22.5 11.6 79.1 50.5
FOREST10D Same 100.0 100.0 94.0 94.0 100.0 93.5 96.5 97.0 100.0
Different 86.6 100.0 0.0 0.0 0.0 0.0 1.0 72.0 100.0
Table 2: Naive attribute matching on univariate (BIO, FOREST, CNUM) and multivariate (FOR-
EST10D) data. Numbers indicate the percentage of times the null hypothesis p = q was
accepted with α = 0.05, pooled over attributes. Sample size (dimension; attributes; repeti-
tions of experiment): BIO 377 (1; 6; 100); FOREST 538 (1; 10; 100); CNUM 386 (1; 13;
100); FOREST10D 1000 (10; 2; 100).
We applied our tests to these data sets in the following fashion. Given two data sets A and B,
we either chose one sample from A and the other from B (attributes = different); or both samples
from either A or B (attributes = same). We then repeated this process up to 1200 times. Results
are reported in Table 1. Our asymptotic tests perform better than all competitors besides Wolf : in
the latter case, we have greater Type II error for one neural data set, lower Type II error on the
Health Status data (which has very high dimension and low sample size), and identical (error-free)
performance on the remaining examples. We note that the Type I error of the bootstrap test on the
Subtype data set is far from its design value of 0.05, indicating that the Pearson curves provide a
better threshold estimate for these low sample sizes. For the remaining data sets, the Type I errors
of the Pearson and Bootstrap approximations are close. Thus, for larger data sets, the bootstrap is
to be preferred, since it costs O(m2 ), compared with a cost of O(m3 ) for the Pearson curves (due to
the cost of computing (11)). Finally, the uniform convergence-based tests are too conservative, with
MMDb finding differences in distribution only for the data with largest sample size, and MMD2u H
never finding differences.
750
A K ERNEL T WO -S AMPLE T EST
751
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
0
−2 10
10
−2
10
−3
10
−4
10
2 3 4 5 6 2 3 4 5 6
10 10 10 10 10 10 10 10 10 10
Sample # Sample #
−1
10 2
10
Type II error
0
−2 10
10
−2
10
−3
10
−4
10
2 3 4 5 6 2 3 4 5 6
10 10 10 10 10 10 10 10 10 10
Sample # Sample #
−1
10 2
10
Type II error
0
−2 10
10
−2
10
−3
10
−4
10
2 3 4 5 6 2 3 4 5 6
10 10 10 10 10 10 10 10 10 10
Sample # Sample #
Figure 6: Linear-cost vs quadratic-cost MMD. The first column shows Type II performance, and
the second shows runtime. The dashed grey horizontal line indicates zero Type II error
(required due to log y-axis).
752
A K ERNEL T WO -S AMPLE T EST
is also called table matching for tables from different databases. We performed attribute matching
as follows: first, the data set D was split into two halves A and B. Each of the n attributes in A (and
B, resp.) was then represented by its instances in A (resp. B). We then tested all pairs of attributes
from A and from B against each other, to find the optimal assignment of attributes A1 , . . . , An from
A to attributes B1 , . . . , Bn from B. We assumed that A and B contain the same number of attributes.
As a naive approach, we could assume that any possible pair of attributes might correspond,
and thus that every attribute of A needs to be tested against all the attributes of B to find the opti-
mal match. We report results for this naive approach, aggregated over all pairs of possible attribute
matches, in Table 2. We used three data sets: the census income data set from the UCI KDD archive
(CNUM), the protein homology data set from the 2004 KDD Cup (BIO) (Caruana and Joachims,
2004), and the forest data set from the UCI ML archive (Blake and Merz, 1998). For the final data
set, we performed univariate matching of attributes (FOREST) and multivariate matching of tables
(FOREST10D) from two different databases, where each table represents one type of forest. Both
our asymptotic MMD2u -based tests perform as well as or better than the alternatives, notably for
CNUM, where the advantage of MMD2u is large. Unlike in Table 1, the next best alternatives are not
consistently the same across all data: for example, in BIO they are Wolf or Hall, whereas in FOR-
EST they are Smir, Biau, or the t-test. Thus, MMD2u appears to perform more consistently across
the multiple data sets. The Friedman-Rafsky tests do not always return a Type I error close to the
design parameter: for instance, Wolf has a Type I error of 9.7% on the BIO data set (on these data,
MMD2u has the joint best Type II error without compromising the designed Type I performance).
Finally, MMDb performs much better than in Table 1, although surprisingly it fails to reliably detect
differences in FOREST10D. The results of MMD2u H are also improved, although it remains among
the worst performing methods.
A more principled approach to attribute matching is also possible. Assume that
φ(A) = (φ1 (A1 ), φ2 (A2 ), ..., φn (An )): in other words, the kernel decomposes into kernels on the indi-
vidual attributes of A (and also decomposes this way on the attributes of B). In this case, MMD2 can
be written ∑ni=1 kµi (Ai ) − µi (Bi )k2 , where we sum over the MMD terms on each of the attributes.
Our goal of optimally assigning attributes from B to attributes of A via MMD is equivalent to finding
the optimal permutation π of attributes of B that minimizes ∑ni=1 kµi (Ai ) − µi (Bπ(i) )k2 . If we define
Ci j = kµi (Ai ) − µi (B j )k2 , then this is the same as minimizing the sum over Ci,π(i) . This is the linear
assignment problem, which costs O(n3 ) time using the Hungarian method (Kuhn, 1955).
While this may appear to be a crude heuristic, it nonetheless defines a semi-metric on the sample
spaces X and Y and the corresponding distributions p and q. This follows from the fact that matching
distances are proper metrics if the matching cost functions are metrics. We formalize this as follows:
d
∆[F, p, q] := min ∑ MMD[F, pi , qπ(i) ].
π∈Π
i=1
Proof Clearly ∆[F, p, q] is nonnegative, since it is a sum of nonnegative quantities. Next we show
the triangle inequality. Denote by r a third distribution on Rd and let π p,q , πq,r and π p,r be the
753
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
distance minimizing permutations over the associated pairs from {p, q, r}. It follows that
d d
∆[F, p, q] + ∆[F, q, r] = ∑ MMD[F, pi , qπ p,q (i) ] + ∑ MMD[F, qi , rπq,r (i) ]
i=1 i=1
d
≥ ∑ MMD[F, pi , r[π p,q ◦πq,r ](i) ] ≥ ∆[F, p, r].
i=1
MMD[F, pi , qπ p,q (i) ] + MMD[F, qπ p,q (i) , r[π p,q ◦πq,r ](i) ] ≥ MMD[F, pi , r[π p,q ◦πq,r ](i) ].
We tested this ’Hungarian approach’ to attribute matching via MMD2u B on three univariate
data sets (BIO, CNUM, FOREST) and for table matching on a fourth (FOREST10D). To study
MMD2u B on structured data, we used two data sets of protein graphs (PROTEINS and ENZYMES)
and used the graph kernel for proteins from Borgwardt et al. (2005) for table matching via the
Hungarian method (the other tests were not applicable to these graph data). The challenge here is
to match tables representing one functional class of proteins (or enzymes) from data set A to the
corresponding tables (functional classes) in B. Results are shown in Table 3. Besides on the BIO
and CNUM data sets, MMD2u B made no errors.
Data Set Data type No. attributes Sample size Repetitions % correct
BIO univariate 6 377 100 90.0
CNUM univariate 13 386 100 99.8
FOREST univariate 10 538 100 100.0
FOREST10D multivariate 2 1000 100 100.0
ENZYME structured 6 50 50 100.0
PROTEINS structured 2 200 50 100.0
Table 3: Hungarian Method for attribute matching via MMD2u B on univariate (BIO, CNUM,
FOREST), multivariate (FOREST10D), and structured (ENZYMES, PROTEINS) data
(α = 0.05; “% correct” is the percentage of correct attribute matches over all repetitions).
9. Conclusion
We have established three simple multivariate tests for comparing two distributions p and q, based
on samples of size m and n from these respective distributions. Our test statistic is the maximum
mean discrepancy (MMD), defined as the maximum deviation in the expectation of a function eval-
uated on each of the random variables, taken over a sufficiently rich function class: in our case, a
reproducing kernel Hilbert space (RKHS). Equivalently, the statistic can be written as the norm of
the difference between distribution feature means in the RKHS. We do not require density estimates
as an intermediate step. Two of our tests provide Type I error bounds that are exact and distribution-
free for finite sample sizes. We also give a third test based on quantiles of the asymptotic distribution
754
A K ERNEL T WO -S AMPLE T EST
of the associated test statistic. All three tests can be computed in O((m + n)2 ) time, however when
sufficient data are available, a linear time statistic can be used, which in our experiments was able to
achieve a given Type II error at smaller computational cost, by looking at many more samples than
the quadratic-cost tests.
We have seen in Section 7 that several classical metrics on probability distributions can be writ-
ten as integral probability metrics with function classes that are not Hilbert spaces, but rather Banach
or seminormed spaces (for instance the Kolmogorov-Smirnov and Earth Mover’s distances). It is
therefore of interest to establish under what conditions one could write these discrepancies in terms
of norms of differences of mean embeddings. Sriperumbudur et al. (2011b) provide expressions
for the maximum mean discrepancy in terms of mean embeddings in reproducing kernel Banach
spaces. When the Banach space is not an RKBS, the question of establishing a mean embedding
interpretation for the MMD remains open.
We also note (following Section 7.3) that the MMD for RKHSs is associated with a particular
kernel between probability distributions. Hein et al. (2004) describe several further such kernels,
which induce corresponding distances between feature space distribution mappings: these may in
turn lead to new and powerful two-sample tests.
Two recent studies have shown that additional divergence measures between distributions can
be obtained empirically through optimization in a reproducing kernel Hilbert space. Harchaoui
et al. (2008) define a two-sample test statistic arising from the kernel Fisher discriminant, rather
than the difference of RKHS means; and Nguyen et al. (2008) obtain a KL divergence estimate
by approximating the ratio of densities (or its log) with a function in an RKHS. By design, both
these kernel-based statistics prioritise different features of p and q when measuring the divergence
between distributions, and the resulting effects on distinguishability of distributions are therefore of
interest.
Acknowledgments
We would like to thank the anonymous referees, whose suggestions greatly improved the paper;
Bharath Sriperumbudur, for thoroughly proofreading the final draft; Sivaraman Balakrishnan, Philipp
Berens, Olivier Bousquet, Corinna Cortes, Omri Guttman, Peter Hall, Matthias Hein, John Lang-
ford, Mehryar Mohri, Novi Quadrianto, Le Song, and Vishy Vishwanathan, for constructive dis-
cussions; Patrick Warnat (DKFZ, Heidelberg), for providing the microarray data sets; and Nikos
Logothetis, for providing the neural data sets. National ICT Australia is funded through the Aus-
tralian Government’s Backing Australia’s Ability initiative, in part through the Australian Research
Council. This work was supported in part by the IST Programme of the European Community,
under the PASCAL Network of Excellence, IST-2002-506778, and by the Austrian Science Fund
(FWF), project # S9102-N04.
Appendix A. Large Deviation Bounds for Tests with Finite Sample Guarantees
This section contains proofs of the theorems of Section 4.1. We begin in Section A.1 with a review
of McDiarmid’s inequality and the Rademacher average of a function class. We prove Theorem 7
in Section A.2, and Theorem 8 in Section A.3.
755
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
Theorem 29 (McDiarmid’s inequality) Let f : Xm → R be a function such that for all i ∈ {1, . . . , m},
there exist ci < ∞ for which
We also define the Rademacher average of the function class F with respect to the m-sample X.
Definition 30 (Rademacher average of F on X) Let F be the unit ball in an RKHS on the domain
X, with kernel bounded according to 0 ≤ k(x, y) ≤ K. Let X be an i.i.d. sample of size m drawn
according to a probability measure p on X, and let σi be i.i.d and take values in {−1, 1} with equal
probability. We define the Rademacher average
1 m
Rm (F, X) := Eσ sup ∑ σi f (xi )
f ∈F m i=1
≤ (K/m)1/2 ,
where the upper bound is due to Bartlett and Mendelson (2002, Lemma 22), and Eσ denotes the
expectation over all the σi . Similarly, we define
1 m
Rm (F, p) := Ex,σ sup ∑ σi f (xi ) .
f ∈F m i=1
756
A K ERNEL T WO -S AMPLE T EST
Second, we provide an upper bound on the difference between ∆(p, q, X,Y ) and its expectation.
Changing either of xi or yi in ∆(p, q, X,Y ) results in changes in magnitude of at most 2K 1/2 /m
or 2K 1/2 /n, respectively. We can then apply McDiarmid’s theorem, given a denominator in the
exponent of
2 2
1/2 1/2 1 1 m+n
m 2K /m + n 2K /n = 4K + = 4K ,
m n mn
to obtain
ε2 mn
PrX,Y (∆(p, q, X,Y ) − EX,Y [∆(p, q, X,Y )] > ε) ≤ exp − . (15)
2K(m + n)
For our final result, we exploit symmetrisation, following, for example, van der Vaart and Wellner
(1996, p. 108), to upper bound the expectation of ∆(p, q, X,Y ). Denoting by X ′ an i.i.d sample of
size m drawn independently of X (and likewise for Y ′ ), we have
where (a) uses Jensen’s inequality, (b) uses the triangle inequality, (c) substitutes Definition 30 (the
Rademacher average), and (d) bounds the Rademacher averages, also via Definition 30.
Having established our preliminary results, we proceed to the proof of Theorem 7.
Proof (Theorem 7) Combining Equations (15) and (16), gives
h i ε2 mn
PrX,Y ∆(p, q, X,Y ) − 2 (K/m) + (K/n)
1/2 1/2
> ε ≤ exp − .
2K(m + n)
757
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
The McDiarmid bound on the difference between (17) and its expectation is now a function of 2m
2
observations in (17), and has a denominator in the exponent of 2m 2K 1/2 /m = 8K/m. We use a
different strategy in obtaining an upper bound on the expected (17), however: this is now
" #
1 m
EX,X ′ sup ∑ f (xi ) − f (xi′ )
f ∈F m i=1
1
m
EX,X ′
∑ φ(xi ) − φ(xi′ )
=
m
i=1
" #1
1 m m 2
= EX,X ′ ∑ ∑ k(xi , x j ) + k(xi , x j ) − k(xi , x j ) − k(xi , x j )
′ ′ ′ ′
m i=1 j=1
1 1
≤ 2mEx k(x, x) + 2m(m − 1)Ex,x′ k(x, x′ ) − 2m2 Ex,x′ k(x, x′ ) 2
m
1
2 ′
2
= Ex,x′ k(x, x) − k(x, x ) (18)
m
≤ (2K/m)1/2 . (19)
We remark that both (18) and (19) bound the amount by which our biased estimate of the population
MMD exceeds zero under H0 . Combining the three results, we find that under H0 ,
21 ! 2
2 −ε m
PrX,X ′ ′
MMDb (F, X, X ) − ′
Ex,x′ k(x, x) − k(x, x ) >ε < exp and
m 4K
2
−ε m
PrX,X ′ MMDb (F, X, X ) − (2K/m) > ε
′ 1/2
< exp .
4K
758
A K ERNEL T WO -S AMPLE T EST
We begin with the asymptotic analysis of MMD2u [F, X,Y ] under the null hypothesis. This is based
on the reasoning of Anderson et al. (1994, Appendix), bearing in mind the following changes:
• we do not need to deal with the bias terms S1 j in Anderson et al. (1994, Appendix) that vanish
for large sample sizes, since our statistic is unbiased;
• we require greater generality, since our kernels are not necessarily inner products in L2 be-
tween probability density functions (although this is a special case: see Section 3.3.1).
We first transform each term in the sum (20) by centering. Under H0 , both x and y have the same
mean embedding µ p . Thus we replace each instance of k(xi , x j ) in the sum with a kernel k̃(xi , x j )
between feature space mappings from which the mean has been subtracted,
k̃(xi , x j ) := φ(xi ) − µ p , φ(x j ) − µ p H
= k(xi , x j ) − Ex k(xi , x) − Ex k(x, x j ) + Ex,x′ k(x, x′ ).
The centering terms cancel across the three terms (the distance between the two points is unaffected
by an identical global shift in both the points). This gives the equivalent form of the empirical
MMD,
m m n n
1 1
MMD2u [F, X,Y ] = ∑ ∑
m(m − 1) i=1 j6=i
k̃(xi , x j ) +
n(n − 1) i=1 ∑
∑ k̃(yi , y j )
j6=i
2 m n
− ∑ ∑ k̃(xi , y j ),
mn i=1
(21)
j=1
where each of the three sums has expected value zero. Note in particular that the U-statistics in
k̃(xi , x j ) are degenerate, meaning
759
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
According to Reed and Simon (1980, Theorem VI.23), this operator is Hilbert-Schmidt, and hence
compact, if and only if the kernel k̃ is square integrable under p,
k̃ ∈ L2 (X × X, p × p) . (23)
We may write the kernel k̃(xi , x j ) in terms of eigenfunctions ψl (x) with respect to the probability
measure p,
∞
k̃(x, x′ ) = ∑ λl ψl (x)ψl (x′ ), (24)
l=1
where
Z
k̃(x, x′ )ψi (x)d p(x) = λi ψi (x′ ),
X
Z
ψi (x)ψ j (x)d p(x) = δi j , (25)
X
and the convergence is in L2 (X × X, p × p). Since the operator is Hilbert-Schmidt, we have by Reed
and Simon (1980, Theorem VI.22) that ∑ λ2i < ∞.
Using the degeneracy of the U-statistic in (22), then when λi 6= 0,
Z
λi E ψi (x ) =
x′
′
Ex′ k̃(x, x′ )ψi (x)d p(x)
X
= 0,
and hence
Ex ψi (x) = 0. (26)
In other words, the eigenfunctions ψi (x) are zero mean and uncorrelated.
We now use these results to find the asymptotic distribution of (21). First,
1 m m 1 m m ∞
∑ ∑ k̃(xi , x j )
m i=1
= ∑ ∑ ∑ λl ψl (xi )ψl (x j )
m i=1
j6=i j6=i l=1
!2
∞
1
= ∑ λl ∑ ψl (xi ) − ∑ ψ2l (xi )
m l=1 i i
∞
→
D
∑ λl (a2l − 1), (27)
l=1
where al ∼ N(0, 1) are i.i.d., and the final relation denotes convergence in distribution, which is
proved by Serfling (1980, Section 5.5.2) using (25) and (26).16 Given that the random variables
a2l are zero mean with finite variance, it can be shown either via Kolmogorov’s inequality or by
the Martingale convergence theorem that the above sum converges almost surely if ∑∞ l=1 λl < ∞
2
(Grimmet and Stirzaker, 2001, Chapter 7.11 Exercise 30). As we have seen, this is guaranteed
under the assumption (23).
Likewise
∞
1 n n
∑ ∑ i j D ∑ λl (b2l − 1),
n i=1
k̃(y , y ) →
j6=i l=1
16. Simply replace h̃2 (xi , x j ) with k̃(xi , x j ) in Serfling (1980, top of p. 196).
760
A K ERNEL T WO -S AMPLE T EST
both jointly in distribution with (27), where (28) is proved at the end of the section. We now combine
these results. Define t = m + n, and assume limm,n→∞ m/t → ρx and limm,n→∞ n/t → ρy := (1 − ρx )
for fixed 0 < ρx < 1. Then
∞ ∞ ∞
2
x ∑ λl (al − 1) + ρy ∑ λl (bl − 1) − √
tMMD2u [F, X,Y ] → ρ−1 ∑ λl al bl
2 −1 2
D
l=1 l=1 ρx ρy l=1
∞ h i
∑
−1/2 −1/2 −1
= λl (ρx al − ρy bl ) 2
− (ρx ρy ) .
l=1
Proof (Equation 28) The proof is a modification of the result for convergence of degenerate U-
statistics of Serfling (1980, Section 5.5.2). We only provide those details that differ from the proof
of Serfling, and otherwise refer to the steps in the original proof as needed. First, using (24) to
expand out the centred kernel, we may write
1 m n 1 m n ∞
Tmn := √ ∑ ∑ k̃(xi , y j ) = √ ∑ ∑ ∑ λl ψl (xi )ψl (y j ).
mn i=1 j=1 mn i=1 j=1 l=1
1 m n L
TmnL := √ ∑ ∑ ∑ λl ψl (xi )ψl (y j ).
mn i=1 j=1 l=1
where we need to show that for large enough L, each of the three terms vanish.
First term: We first show that for large enough L, Tmn and TmnL are close in distribution. From
Serfling (1980, p. 197),
h i1/2
EX,Y eısTmn − EX,Y eısTmnL ≤ |s| EX,Y (Tmn − TmnL )2 ,
761
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
and we may write the difference between the full sum and its truncation as
!
1 m n L
Tmn − TmnL = √ ∑ ∑ k̃(xi , y j ) − ∑ λl ψl (xi )ψl (y j ) .
mn i=1 j=1 l=1
| {z }
gK (xi ,y j )
Each of the properties (Serfling, 1980, Equations (6a)-(6c) p. 197) still holds for gK , namely
Ex,x′ gK (x, x′ ) = 0,
∞
Ex,x′ g2K (x, x′ ) = ∑ λ2l ,
l=L+1
′
Ex gK (x, x ) = 0.
Then
1 m m n n
EX,Y (Tmn − TmnL )2 = ∑ ∑ ∑ ∑ Exi ,xq ,y j ,yr [gK (xi , y j )gK (xq , yr )]
mn i=1 q=1 j=1 r=1
1 m n
mn ∑i=1 ∑ j=1 Ex,x′ g2K (x, x′ ) i = q and j = r,
=
0 otherwise.
where we have used that p = q under H0 , which allows us to replace Ex,y with Ex,x′ in the final line.
It follows that for large enough L,
" #1/2
h i1/2 m n
1
∑ ∑ Ex,x′ g2K (x, x′ )
|s| EX,Y (Tmn − TmnL )2 = |s|
mn i=1 j=1
" #1/2
∞
= |s| ∑ λ2l
l=L+1
< ε.
1 m 1 n
Wml = √ ∑ ψl (xi ), Wnl′ = √ ∑ ψl (y j ),
m i=1 n i=1
762
A K ERNEL T WO -S AMPLE T EST
Moreover, the vectors Wm and Wn′ are independent. The result (29) then holds by the Lindberg-Lévy
CLT (Serfling, 1980, Theorem 1.9.1A).
Third term: From Serfling (1980, p. 199), we have
h i1/2
Ea,b eısVL − Ea,b eısV ≤ |s| Ea,b (V −VL )2 .
We begin by transforming this statistic by centering the samples X and Y in feature space by µ p and
µq , respectively; unlike the H0 case, however, µ p 6= µq , and the new statistic MMD2c is not the same
as MMD2u . The first term is centered as in (9). The second and third terms are respectively replaced
by
n n
1
∑ ∑
φ(yi ) − µq , φ(y j ) − µq H
n(n − 1) i=1 j6=i
and
2 m n
∑ ∑
φ(xi ) − µ p , φ(y j ) − µq H .
mn i=1 j=1
763
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
1
∑ ∑
MMD2c [F, X,Y ] = φ(xi ) − µ p , φ(x j ) − µ p H
m(m − 1) i=1 j6=i
n n
1 2 m n
∑ ∑ ∑ ∑
+ φ(yi ) − µq , φ(y j ) − µq H − φ(xi ) − µ p , φ(y j ) − µq H .
n(n − 1) i=1 j6=i mn i=1 j=1
in the difference between MMD2u and MMD2c are straightforward sums of independent zero mean
random variables, and have Gaussian asymptotic distribution. Defining uy to be the zero mean
Gaussian random variable associated with the first term,
t n
t −1/2 n gt
∑ gt , φ(yi ) − µq H = n ct ∑
, φ(yi ) − µq
n i=1 i=1 kgt kH H
−1/2
→ cρy uy .
D
Likewise,
t m
∑ hgt , φ(xi ) − µ p iH →
−1/2
cρx ux ,
m i=1 D
where ux is a zero mean Gaussian random variable independent of uy (note, however, that ux and
uy are correlated with terms in S, and are defined on the same probability space as al and bl in this
sum). Finally,
t hgt , gt iH = c2 .
764
A K ERNEL T WO -S AMPLE T EST
which is constant in t, and increases as c → ∞. Thus, kgt kH = ct −1/2 is the minimum distance
between µ p and µq distinguishable by the asymptotic MMD-based test.
and, importantly,
where the first term in the penultimate line is zero due to (31). Note that variance and 2nd moment
are the same under the zero mean assumption. 3
3rd moment: We consider the terms that appear in the expansion of E MMD2u . These are
all of the form 3
2
E(hab hcd he f ),
n(n − 1)
where we shorten hab = h(za , zb ), and we know za and zb are always independent. Most of the terms
vanish due to (30) and (31). The first terms that remain take the form
3
2
E(hab hbc hca ),
n(n − 1)
765
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
References
Y. Altun and A.J. Smola. Unifying divergence minimization and statistical inference via convex
duality. In Proc. Annual Conf. Computational Learning Theory, LNCS, pages 139–153. Springer,
2006.
N. Anderson, P. Hall, and D. Titterington. Two-sample test statistics for measuring discrepan-
cies between two multivariate probability density functions using kernel-based density estimates.
Journal of Multivariate Analysis, 50:41–54, 1994.
766
A K ERNEL T WO -S AMPLE T EST
0.6 0.6
Type II error
Type II error
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
50 100 150 200 250 300 50 100 150 200 250 300
Sample size m Sample size m
Figure 7: Type II error on the Neural Data I set, for kernel computed via the median heuristic on
the full data set (“All”), kernel computed via the median heuristic on a 10% hold-out
set (“Train”), and kernel computed via the median heuristic on 90% of the data (“Part”).
Results are plotted over 1000 repetitions. Left: Bootstrap results. Right: Pearson curve
results.
M. Arcones and E. Giné. On the bootstrap of u and v statistics. The Annals of Statistics, 20(2):
655–674, 1992.
F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of Machine Learning
Research, 3:1–48, 2002.
P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural
results. Journal of Machine Learning Research, 3:463–482, 2002.
S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adap-
tation. In Advances in Neural Information Processing Systems 19, pages 137–144. MIT Press,
2007.
A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics.
Kluwer, 2004.
G. Biau and L. Gyorfi. On the asymptotic properties of a nonparametric l1 -test statistic of homo-
geneity. IEEE Transactions on Information Theory, 51(11):3965–3973, 2005.
P. Bickel. A distribution free version of the Smirnov two sample test in the p-variate case. The
Annals of Mathematical Statistics, 40(1):1–23, 1969.
767
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998. URL
http://www.ics.uci.edu/∼mlearn/MLRepository.html.
G. Casella and R. Berger. Statistical Inference. Duxbury, Pacific Grove, CA, 2nd edition, 2002.
B. Chazelle. A minimum spanning tree algorithm with inverse-Ackermann type complexity. Journal
of the ACM, 47:1028–1047, 2000.
P. Comon. Independent component analysis, a new concept? Signal Processing, 36:287–314, 1994.
C. Cortes, M. Mohri, M. Riley, and A. Rostamizadeh. Sample selection bias correction theory. In
Proceedings of the International Conference on Algorithmic Learning Theory, volume 5254 of
Lecture Notes in Computer Science, pages 38–53. Springer, 2008.
M. Davy, A. Gretton, A. Doucet, and P. J. W. Rayner. Optimized support vector machines for
nonstationary signal classification. IEEE Signal Processing Letters, 9(12):442–445, 2002.
V. de la Peña and E. Giné. Decoupling: from Dependence to Independence. Springer, New York,
1999.
M. Dudı́k and R. E. Schapire. Maximum entropy distribution estimation with generalized regu-
larization. In Proceedings of the Annual Conference on Computational Learning Theory, pages
123–138. Springer Verlag, 2006.
M. Dudı́k, S. Phillips, and R.E. Schapire. Performance guarantees for regularized maximum entropy
density estimation. In Proceedings of the Annual Conference on Computational Learning Theory,
pages 472–486. Springer Verlag, 2004.
R. M. Dudley. Real Analysis and Probability. Cambridge University Press, Cambridge, UK, 2002.
W. Feller. An Introduction to Probability Theory and its Applications. John Wiley and Sons, New
York, 2nd edition, 1971.
A. Feuerverger. A consistent test for bivariate dependence. International Statistical Review, 61(3):
419–433, 1993.
768
A K ERNEL T WO -S AMPLE T EST
S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representations. Journal
of Machine Learning Research, 2:243–264, 2001.
R. Fortet and E. Mourier. Convergence de la réparation empirique vers la réparation théorique. Ann.
Scient. École Norm. Sup., 70:266–285, 1953.
J. Friedman. On multivariate goodness-of-fit and two-sample testing. Technical Report SLAC-
PUB-10325, University of Stanford Statistics Department, 2003.
J. Friedman and L. Rafsky. Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-
sample tests. The Annals of Statistics, 7(4):697–717, 1979.
K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with
reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5:73–99, 2004.
K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf. Kernel measures of conditional dependence. In
Advances in Neural Information Processing Systems 20, pages 489–496, Cambridge, MA, 2008.
MIT Press.
T. Gärtner, P. A. Flach, A. Kowalczyk, and A. J. Smola. Multi-instance kernels. In Proceedings of
the International Conference on Machine Learning, pages 179–186. Morgan Kaufmann Publish-
ers Inc., 2002.
E. Gokcay and J.C. Principe. Information theoretic clustering. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 24(2):158–171, 2002.
A. Gretton, O. Bousquet, A.J. Smola, and B. Schölkopf. Measuring statistical dependence with
Hilbert-Schmidt norms. In Proceedings of the International Conference on Algorithmic Learning
Theory, pages 63–77. Springer-Verlag, 2005a.
A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Schölkopf. Kernel methods for measuring
independence. Journal of Machine Learning Research, 6:2075–2129, 2005b.
A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. A kernel method for the two-
sample problem. In Advances in Neural Information Processing Systems 15, pages 513–520,
Cambridge, MA, 2007a. MIT Press.
A. Gretton, K. Borgwardt, M. Rasch, B. Schlkopf, and A. Smola. A kernel approach to comparing
distributions. Proceedings of the 22nd Conference on Artificial Intelligence (AAAI-07), pages
1637–1641, 2007b.
A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. A kernel method for the two
sample problem. Technical Report 157, MPI for Biological Cybernetics, 2008a.
A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Schölkopf, and A. Smola. A kernel statistical
test of independence. In Advances in Neural Information Processing Systems 20, pages 585–592,
Cambridge, MA, 2008b. MIT Press.
A. Gretton, K. Fukumizu, Z. Harchaoui, and B. Sriperumbudur. A fast, consistent kernel two-sample
test. In Advances in Neural Information Processing Systems 22, Red Hook, NY, 2009. Curran
Associates Inc.
769
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
G. R. Grimmet and D. R. Stirzaker. Probability and Random Processes. Oxford University Press,
Oxford, third edition, 2001.
P. Hall and N. Tajvidi. Permutation tests for equality of distributions in high-dimensional settings.
Biometrika, 89(2):359–374, 2002.
Z. Harchaoui, F. Bach, and E. Moulines. Testing for homogeneity with kernel Fisher discriminant
analysis. In Advances in Neural Information Processing Systems 20, pages 609–616. MIT Press,
Cambridge, MA, 2008.
M. Hein, T.N. Lal, and O. Bousquet. Hilbertian metrics on probability measures and their appli-
cation in SVMs. In Proceedings of the 26th DAGM Symposium, pages 270–277, Berlin, 2004.
Springer.
N. Henze and M. Penrose. On the multivariate runs test. The Annals of Statistics, 27(1):290–298,
1999.
W. Hoeffding. A class of statistics with asymptotically normal distribution. The Annals of Mathe-
matical Statistics, 19(3):293–325, 1948.
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the
American Statistical Association, 58:13–30, 1963.
T. Jebara and R. Kondor. Bhattacharyya and expected likelihood kernels. In Proceedings of the
Annual Conference on Computational Learning Theory, volume 2777 of LNCS, pages 57–71,
Heidelberg, Germany, 2003. Springer-Verlag.
D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streams. In Proceedings of the
International Conference on Very Large Data Bases, pages 180–191. VLDB Endowment, 2004.
H.W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quar-
terly, 2:83–97, 1955.
E. L. Lehmann and J. P. Romano. Testing Statistical Hypotheses. Springer, 3rd edition, 2005.
C. Micchelli, Y. Xu, and H. Zhang. Universal kernels. Journal of Machine Learning Research, 7:
2651–2667, 2006.
A. Müller. Integral probability metrics and their generating classes of functions. Advances in
Applied Probability, 29(2):429–443, 1997.
770
A K ERNEL T WO -S AMPLE T EST
X.L. Nguyen, M. Wainwright, and M. Jordan. Estimating divergence functionals and the likelihood
ratio by penalized convex risk minimization. In Advances in Neural Information Processing
Systems 20, pages 1089–1096. MIT Press, Cambridge, MA, 2008.
M. Reed and B. Simon. Methods of modern mathematical physics. Vol. 1: Functional Analysis.
Academic Press, San Diego, 1980.
M. Reid and R. Williamson. Information, divergence and risk for binary experiments. Journal of
Machine Learning Research, 12:731–817, 2011.
Y. Rubner, C. Tomasi, and L.J. Guibas. The earth mover’s distance as a metric for image retrieval.
International Journal of Computer Vision, 40(2):99–121, 2000.
B. Schölkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.
B. Schölkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in Computational Biology. MIT Press,
Cambridge, MA, 2004.
J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University
Press, Cambridge, UK, 2004.
H. Shen, S. Jegelka, and A. Gretton. Fast kernel-based independent component analysis. IEEE
Transactions on Signal Processing, 57:3498 – 3511, 2009.
B. W. Silverman. Density Estimation for Statistical and Data Analysis. Monographs on statistics
and applied probability. Chapman and Hall, London, 1986.
N.V. Smirnov. On the estimation of the discrepancy between empirical curves of distribution for
two independent samples. Moscow University Mathematics Bulletin, 2:3–26, 1939. University
of Moscow.
771
G RETTON , B ORGWARDT, R ASCH , S CH ÖLKOPF AND S MOLA
A. J. Smola and B. Schölkopf. Sparse greedy matrix approximation for machine learning. In Pro-
ceedings of the International Conference on Machine Learning, pages 911–918, San Francisco,
2000. Morgan Kaufmann Publishers.
A. J. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distributions.
In Proceedings of the International Conference on Algorithmic Learning Theory, volume 4754,
pages 13–31. Springer, 2007.
L. Song, X. Zhang, A. Smola, A. Gretton, and B. Schölkopf. Tailoring density estimation via re-
producing kernel moment matching. In Proceedings of the International Conference on Machine
Learning, pages 992–999. ACM, 2008.
B. Sriperumbudur, K. Fukumizu, and G. Lanckriet. Learning in Hilbert vs. Banach spaces: A mea-
sure embedding viewpoint. In Advances in Neural Information Processing Systems 24. Curran
Associates Inc., Red Hook, NY, 2011b.
I. Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal
of Machine Learning Research, 2:67–93, 2001.
I. Steinwart and A. Christmann. Support Vector Machines. Information Science and Statistics.
Springer, 2008.
A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer, 1996.
772
A K ERNEL T WO -S AMPLE T EST
J. E. Wilkins. A note on skewness and kurtosis. The Annals of Mathematical Statistics, 15(3):
333–335, 1944.
C. K. I. Williams and M. Seeger. Using the Nystrom method to speed up kernel machines. In
Advances in Neural Information Processing Systems 13, pages 682–688, Cambridge, MA, 2001.
MIT Press.
773