High-Dimensional, Two-Sample Testing
High-Dimensional, Two-Sample Testing
1 Introduction
X1 , . . . , Xn ∼ P, Y1 , . . . , Ym ∼ Q
H0 : P = Q versus H1 : P 6= Q.
Throughout, we will assume that n/(n + m) → π ∈ (0, 1) as the sample size increases.
In low dimensions, there are many tests with good power. For example, we could use the
test statistic
T = sup |Fbn (t) − G
bn (t)|
t
where Fbn and Gbn are the empirical cdf’s. To find the α-level critical value we can use
asymptotic theory or permutation testing. But there are other approaches for the high-
dimensional case.
2 Metrics
One way to define a test is to first define a metric between distributions. For example
Z Z
d(P, Q) = sup gdP − gdQ
g∈G
for some class of functions G. Here are some examples. If G = {g : ||g||∞ ≤ 1} then d(P, Q)
is the total variation distance. If G is the set of g such that
|g(y) − g(x)|
sup ≤1
x6=y ||x − y||
then d(P, Q) is the earth-mover distance (or Wasserstein distance). This is equivalent to
inf R ER ||X −Y || where the infimum is over all joint distributions R for (X, Y ) with marginals
1
P and Q. If G = {I(−∞,t] : t ∈ Rd } then d(P, Q) is the Kolmogorov-Smirnov distance. See
Sriperumbudur et al (2010) for more examples.
How do we know when to reject H0 ? One approach is to find the limiting distribution of T
under H0 . This turns out to be, for the RKHS distance,
∞
X
T 2 λj (Zj2 − 1)
j=1
where the Zj ’s are N(0,1) and the λj ’s are the eigenvalues defined by
Z
L(x, y)ψj (x)dP (x) = λj ψj (y)
where L(x, y) = K(x, y) − E[K(x, X)] − E[K(X, x)] + E[K(X, Y )]. This distribution is called
a Gaussian chaos. This distribution has infinitely many nuisance parameters which makes
it un-usable. Instead, we use the permutation distribution to choose the critical value.
2
where β is the smoothness. This was proved by Arias-Castro, Pelletier and Saligrama (2016)
based on techniques developed by Ingster (1987). We’ll discuss this more below.
The problem is that the kernel is hiding a lot. To see this, note that T is essentially the
same as Z
ph (x) − qbh (x))2
(b
where pbh and qbh are kernel density estimators. This test was proposed by Anderson, Hall and
Titterington (1994). But remember, the kernel has a tuning √ parameter. If it is Gaussian,
there is a bandwidth. The statement T − d(P, Q) = OP (1/ N ) assumes we do not change
the bandwidth. But to have good power, we need to let the bandwidth go to zero and we no
longer have the fast rate. The power of the RKHS test in general, nonparametric settings is
not well studied.
Now suppose we want a confidence interval for θ = d2 (P, Q). Unfortunately, there is no
known practical method if we use the above estimator. However, we can use the idea in
Gretton et al (2012) to get a simple (but statistically inefficient) method. Instead of using
a U -statistic, we break the sample into blocks of size two. For simplicity, assume that
n1 = n2 = n. Define
2X 1 X
θb = h (X2j−1 , Y2j−1 ), (X2j , Y2j ) ≡ Rj
n j m j
Another class of tests is based on geometric graphs. Let Z1 , . . . , ZN be the combined sample
where N = n + m. Let Li = 1 if Zi is from group 1 and Li = 2 if Zi is from group 2.
where Bj (r) = 1 if the rth nearest neighbor has the same label as Zi . This corresponds to
forming a k nearest neighbor graph and asking how many of the k nearest neighbors are
3
from the same group as the node. The probability of getting the same label under H0 is
µ = π 2 + (1 − π)2 .
In high-dimensions we need to correct the test to account for some strange effects (Mondal,
Biswas and Ghosh, 2015). If P concentrates its data on a ring R and Q concentrates its data
on a larger ring S that surrounds R, then every point in Q can be closer to a point from P .
Here is an example. Let’s take k = 1 and n =Pm. Let Bi = 1 if its nearest neighbor is from
the same group. The test statistic is T = n−1 i Bi . We are testing
1 1
H0 : P (Bi ) = versus H1 : P (Bi ) > .
2 2
Suppose that X1 , X2 ∼ N (µ1 , σ12 I) and Y1 , Y2 ∼ N (µ2 , σ22 I). Take µ1 = (a, . . . , a) and
µ2 = (b, . . . , b). Now,
1 P 1 P 1 P
||X1 − X2 ||2 → 2σ12 , ||Y1 − Y2 ||2 → 2σ22 , ||X1 − Y2 ||2 → σ12 + σ22 + (a − b)2 .
d d d
Let a = 0, b = 0.2, σ12 = 1, σ22 = 1.2. Then
X1 X2 . . . Xn Y1 Y2 . . . Yn
Bi 1 1 ... 1 0 0 ... 0
We will not reject H0 in this case since (2n)−1 i Bi = 1/2. The problem is that P (Bi =
P
1|Li = 1) = 1 and P (Bi = 1|Li = 2) = 0 but P (Bi = 1) = 1/2. However, if we do a
4
two-sided test, separately within each group, we would reject. Mondal, Biswas and Ghosh
(2015) suggest taking
U = (T1 − θ)2 + (T2 − θ)2
where Tj = (nk)−1 i:Li =j Zj ∈Ni I(Li = Lj ). However, this test can have low power in
P P
other cases. The best strategy is to use both tests i.e. W = T ∨ U .
A similar test, called the cross-match test, was defined by by Rosenbaum (2005). We take
the pooled sample and partitionPthe data into pairs W1 = (Z1 , Z2 ), W2 = (Z3 , Z4 ), . . .. The
partition is chosen to minimize j ||Z2j − Z2j−1 ||2 . Let
X
T = Ai
i
where Ai = 1 if the ith pair has differing labels (i.e. (0,1) or (1,0)) and Ai = 0 otherwise. We
reject when T is small. The exact distribution of T under H0 is known; it is hypergeometric.
It can accurately be approximated with a N (µ, σ 2 ) where
mn 2n(n − 1)m(m − 1)
µ= , σ2 = .
(N − 1) (N − 3)(N − 1)2
This accurate, simple limiting distribution for T under the null is the main advantage of this
test. However, seems to have less power than thePNN test. Also, the distribution of T under
H1 is not known. We could have defined T = i Bi where Bi = 1 − Ai and and rejected
when T is large. This is then the same as the k-NN test with k = 1 except that we allow no
overlap between groups.
4 Smooth Tests
Neyman (1937) introduced a method for testing that takes advantage of smoothness. First,
consider one dimensional data Y1 , . . . , Yn ∼ P . Suppose we want to test
If we want to have power against smooth alternatives, Neyman proposed that we define
k
!
X
pθ (x) = c(θ) exp θj ψj (x)
j=1
5
The null hypothesis corresponds to θ = (θ1 , . . . , θk ) = (0, . . . , 0). One way to test H0 is to
b − `(0)). Under H0 , T
use the likelihood ratio test T = 2(`(θ) χ2k . But Neyman pointed
out that there is a computationally easier test,
X 2
U =n ψj
j
where
1X
ψj = ψj (Xi ).
n i
This also has the property that, under H0 , U χ2k . But it avoids having to deal with the
normalizing constant.
Now we move to the two-sample case. Let F (t) = P (X ≤ t) and G(t) = Q(Y ≤ t). Let
Z = F (Y ). Then the cdf of Z is
H(z) = P(Z ≤ z) = P(F (Y ) ≤ z) = P(Y ≤ R(z)) = G(R(z))
where R(z) = F −1 (z). Under H0 , Z ∼ Unif(0, 1). Now H has density
q(F −1 (z))
ρ(z) =
p(F −1 (z))
and ρ(z) = 1 under H0 . Bera, Ghosh and Xiao (2013) suggest using the family
k
!
X
ρθ (z) = c(θ) exp θj ψj (x) .
j=1
T
Their test statistic is mψ ψ where
1 X
ψj = ψj (Vi )
m i
and Vi = Fbn (Yi ). Bera, Ghosh and Xiao (2013) prove that the statistic again has a limiting
χ2k distribution.
Zhou, Zheng and Zhang (arXiv:1509.03459) considered the high-dimensional case. The con-
sider all one-dimensional projections of the data. There test is
r
nm
T = sup T (u)
n+m u
where the supremum is over the d − 1-dimensional sphere and T (u) is the Bera-Ghosh-Xiao
statistic based on the one-dimensional data uT Yi . They also allow the parameter k to be
chosen from the data. (In fact, they maximize the test over k.)
6
5 Histogram Test
Under smoothness assumptions and compact support, Ingster (1987) showed that optimal
tests can be obtained using histograms. Arias-Castro, Pelletier and Saligrama (2016) ex-
tended this to the multivariate case. Assume smoothness level β. For simplicity let m = n.
Form a histogram with N ≈ n2/(4β+1) bins. Set
X
T = (Cj − Dj )2
j
where Cj is the number of Xi ’s in bin j and let Dj is the number of Yi ’s in bin j. We reject
for T large. This test is, in theory, optimal. In fact, Ingster later showed that the test can
be made adaptive to the degree of smoothness.
6 Sparsity
Let us write
Xi = (Xi (1), . . . , Xi (d)), Yi = (Yi (1), . . . , Yi (d)).
In some cases, we might suspect that P and Q only differ in a few features. In other words,
there is sparsity. If so, the easiest thing is to do all the one-dimensional marginal tests and
a Bonferroni correction. Let Tj be your favorite one dimensional test applied to the jth
feature only. The the statistic to be T = ∨j Tj . This test will have good power in the sparse
case and it is very easy to compute.
7 Minimax Theory
What does it mean for a test to be optimal? Just as there is a theory for minimax estimation,
there is also a theory for minimax testing. We discussed this a few weeks ago. I’ll remind
you of a few basic facts.
Recall that a level α test is a function φ of the data taking values 0 or 1 such that P (φ =
1) ≤ α for every P ∈ H0 . Let Φn denote all level α tests. The minimax type II error, for a
set of distributions P is
βn () = inf sup P n (φ = 0)
φ∈Φn P,Q
where the supremum is over all P, Q ∈ P such that d(P, Q) > . Fix any small δ > 0. We
say that the minimax separation is n if < n implies that βn () ≥ δ.
7
If P is the β smoothness class and d is the L2 distance between densities, then Arias-Castro,
Pelletier and Saligrama (2016) show that
2β
4β+d
1
n .
n
The minimax risk is achieved by the histogram test.
8 Discrete Distributions
Suppose that Xi and Yi are discrete random variables taking values in {1, . . . , d}. Let
Cj = #{Xi = j}, Dj = #{Yi = j}.
Let C = (C1 , . . . , Cd ) and D = (D1 , . . . , Dd ). These are multinomial and we can test
H0 : P = Q using a likelihood ratio test or χ2 test.
But when d is large, the usual tests might have poor power. Improved tests have been devel-
oped by Chan et al (2014) and Diakonikolas and Kane (2016). for example. Moreover, these
tests are designed to have good power against alternatives with respect to total variation
distance. For example, Chan et al propose the test statistic
X (Cj − Dj )2 − (Cj + Dj )
T = .
j
Cj + Dj
We reject
√ when T is large. The prove that this test has good power as long as TV(P, Q) >
d1/4 / n which is the minimax bound.
References
Anderson, Hall and Titterington (1994). Two-sample test statistics for measuring discrep-
ancies between two multivariate probability density functions using kernel-based density
estimates. Journal of Multivariate Analysis, 41-54.
Berlinet and Thomas-Agnan (2011). Reproducing kernel Hilbert spaces in probability and
statistics, Springer.
Chan, Siu-On, et al. ”Optimal algorithms for testing closeness of discrete distributions.”
Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms.
Society for Industrial and Applied Mathematics, 2014.
8
Gretton, Borgwardt, Rasch, Malte, Scholkopf and Smola (2007). A kernel method for the
two-sample-problem. NIPS.
Henze, N. (1988). A multivariate two-sample test based on the number of nearest neighbor
type coincidences. The Annals of Statistics, 772-783.
Mondal, Biswas and Ghosh (2015). On high dimensional two-sample tests based on nearest
neighbors. Journal of Multivariate Analysis, 168-178.
Sriperumbudur, Bharath K., et al. ”Hilbert space embeddings and metrics on probability
measures.” Journal of Machine Learning Research 11.Apr (2010): 1517-1561.
Szkeley and Rizzo (2004). Testing for equal distributions in high dimension. InterStat, 1-6.