0% found this document useful (0 votes)
67 views9 pages

High-Dimensional, Two-Sample Testing

This document discusses high-dimensional two-sample testing, where the goal is to test if two samples come from the same distribution when the samples have a high number of dimensions. It presents several approaches for defining metrics and test statistics for comparing the distributions, including kernel-based metrics and graph-based tests. The key challenges are that standard tests break down in high dimensions and the minimally detectable difference between distributions depends on both the sample size and dimensionality. Permutation tests are often used instead of asymptotic distributions due to the many nuisance parameters involved.

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views9 pages

High-Dimensional, Two-Sample Testing

This document discusses high-dimensional two-sample testing, where the goal is to test if two samples come from the same distribution when the samples have a high number of dimensions. It presents several approaches for defining metrics and test statistics for comparing the distributions, including kernel-based metrics and graph-based tests. The key challenges are that standard tests break down in high dimensions and the minimally detectable difference between distributions depends on both the sample size and dimensionality. Permutation tests are often used instead of asymptotic distributions due to the many nuisance parameters involved.

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

High-Dimensional, Two-Sample Testing

1 Introduction

We observe two iid sample

X1 , . . . , Xn ∼ P, Y1 , . . . , Ym ∼ Q

where Xi , Yi ∈ Rd . We want to test

H0 : P = Q versus H1 : P 6= Q.

Throughout, we will assume that n/(n + m) → π ∈ (0, 1) as the sample size increases.

In low dimensions, there are many tests with good power. For example, we could use the
test statistic
T = sup |Fbn (t) − G
bn (t)|
t

where Fbn and Gbn are the empirical cdf’s. To find the α-level critical value we can use
asymptotic theory or permutation testing. But there are other approaches for the high-
dimensional case.

Why are we interested in two-sample testing? We might be interested in testing whether


two groups are the same for scientific reasons (treatment versus control, for example). Two
sample testing can also be used to screen features for classification.

2 Metrics

One way to define a test is to first define a metric between distributions. For example
Z Z
d(P, Q) = sup gdP − gdQ

g∈G

for some class of functions G. Here are some examples. If G = {g : ||g||∞ ≤ 1} then d(P, Q)
is the total variation distance. If G is the set of g such that

|g(y) − g(x)|
sup ≤1
x6=y ||x − y||

then d(P, Q) is the earth-mover distance (or Wasserstein distance). This is equivalent to
inf R ER ||X −Y || where the infimum is over all joint distributions R for (X, Y ) with marginals

1
P and Q. If G = {I(−∞,t] : t ∈ Rd } then d(P, Q) is the Kolmogorov-Smirnov distance. See
Sriperumbudur et al (2010) for more examples.

In general, estimating d(P, Q) is difficult. But if we take G to be a RKHS defined by a kernel


K, it can be shown that
Z Z Z Z Z Z
2
θ = d (P, Q) = K(x, y)dP (x)dP (y)+ K(x, y)dQ(x)dQ(y)−2 K(x, y)dP (x)dQ(y).

The plus-in estimator of d2 (P, Q) is


2 X 2 X 2 X
T = K(Xi , Xj ) + K(Yi , Yj ) − K(Xi , Yj ).
n(n − 1) i<j m(m − 1) i<j nm i,j

A related distance is the energy distance (Szekeley 1989, 2002) defined by


d2 (P, Q) = 2E[||X − Y ||] − E[||X − X 0 ||] − E[||Y − Y 0 ||].
The advantage of the energy distance is that there is no tuning parameter. (The RKHS
distance actually requires a bandwidth.) The sample estimate is
2 XX 1 XX 1 XX
||Xi − Yj || − 2 ||Xi − Xj || − 2 ||Yi − Yj ||.
n1 n2 i j n1 i j n2 i j

How do we know when to reject H0 ? One approach is to find the limiting distribution of T
under H0 . This turns out to be, for the RKHS distance,

X
T 2 λj (Zj2 − 1)
j=1

where the Zj ’s are N(0,1) and the λj ’s are the eigenvalues defined by
Z
L(x, y)ψj (x)dP (x) = λj ψj (y)

where L(x, y) = K(x, y) − E[K(x, X)] − E[K(X, x)] + E[K(X, Y )]. This distribution is called
a Gaussian chaos. This distribution has infinitely many nuisance parameters which makes
it un-usable. Instead, we use the permutation distribution to choose the critical value.

It can be shown that  


2 1
T − d (P, Q) = OP √
N
where N = n ∧ m. Thus, it appears that the quality of T does not depend on the dimension!
This is false. What matters here is the power. As we shall see below, the minimax power,
that is the smallest detectable difference, is

  4β+d
1
N

2
where β is the smoothness. This was proved by Arias-Castro, Pelletier and Saligrama (2016)
based on techniques developed by Ingster (1987). We’ll discuss this more below.

The problem is that the kernel is hiding a lot. To see this, note that T is essentially the
same as Z
ph (x) − qbh (x))2
(b

where pbh and qbh are kernel density estimators. This test was proposed by Anderson, Hall and
Titterington (1994). But remember, the kernel has a tuning √ parameter. If it is Gaussian,
there is a bandwidth. The statement T − d(P, Q) = OP (1/ N ) assumes we do not change
the bandwidth. But to have good power, we need to let the bandwidth go to zero and we no
longer have the fast rate. The power of the RKHS test in general, nonparametric settings is
not well studied.

Now suppose we want a confidence interval for θ = d2 (P, Q). Unfortunately, there is no
known practical method if we use the above estimator. However, we can use the idea in
Gretton et al (2012) to get a simple (but statistically inefficient) method. Instead of using
a U -statistic, we break the sample into blocks of size two. For simplicity, assume that
n1 = n2 = n. Define
2X   1 X
θb = h (X2j−1 , Y2j−1 ), (X2j , Y2j ) ≡ Rj
n j m j

where m = n/2 and

h((xi , yi ), (xj , yj )) = K(Xi , Xj ) + K(Yi , Yj ) − K(Xi , Yj ) − K(Xj , Yi ).



It follows from the CLT and Slutzky’s theorem that m(θb − θ)/s N (0, 1) where s2
√ variance of R1 , . . . , Rm . Hence, an asymptotic 1 − α confidence interval is
is the sample
θb ± szα/2 / m.

3 Graph Based Tests

Another class of tests is based on geometric graphs. Let Z1 , . . . , ZN be the combined sample
where N = n + m. Let Li = 1 if Zi is from group 1 and Li = 2 if Zi is from group 2.

Let Ni be the k-nearest neighbors of Zi . Define


n k
1 XX
T = Bj (r)
nk i=1 r=1

where Bj (r) = 1 if the rth nearest neighbor has the same label as Zi . This corresponds to
forming a k nearest neighbor graph and asking how many of the k nearest neighbors are

3
from the same group as the node. The probability of getting the same label under H0 is
µ = π 2 + (1 − π)2 .

It can be shown that, under H0 ,



nk(T − µ)
N (0, 1).
σ
The proof is difficult because the test statistic is summing quantities that are not dependent.
The variance σ 2 is known but is very, very complicated. See Schilling (1986a, 1986b). In
practice, we can use the permutation distribution to get the critical value. Under H1 , the
mean of T converges to
Z
p(x)q(x)
θ = 1 − 2π(1 − π) dx
πp(x) + (1 − π)q(x)
which is a distance between p and q. In my experience, this test works well even with k = 1.

In high-dimensions we need to correct the test to account for some strange effects (Mondal,
Biswas and Ghosh, 2015). If P concentrates its data on a ring R and Q concentrates its data
on a larger ring S that surrounds R, then every point in Q can be closer to a point from P .

Here is an example. Let’s take k = 1 and n =Pm. Let Bi = 1 if its nearest neighbor is from
the same group. The test statistic is T = n−1 i Bi . We are testing
1 1
H0 : P (Bi ) = versus H1 : P (Bi ) > .
2 2

Suppose that X1 , X2 ∼ N (µ1 , σ12 I) and Y1 , Y2 ∼ N (µ2 , σ22 I). Take µ1 = (a, . . . , a) and
µ2 = (b, . . . , b). Now,
1 P 1 P 1 P
||X1 − X2 ||2 → 2σ12 , ||Y1 − Y2 ||2 → 2σ22 , ||X1 − Y2 ||2 → σ12 + σ22 + (a − b)2 .
d d d
Let a = 0, b = 0.2, σ12 = 1, σ22 = 1.2. Then

2σ12 < σ12 + σ22 + (a − b)2 < 2σ22 .

Every observation from Q is closer to an observation from P .

The data will look like this:

X1 X2 . . . Xn Y1 Y2 . . . Yn
Bi 1 1 ... 1 0 0 ... 0

We will not reject H0 in this case since (2n)−1 i Bi = 1/2. The problem is that P (Bi =
P
1|Li = 1) = 1 and P (Bi = 1|Li = 2) = 0 but P (Bi = 1) = 1/2. However, if we do a

4
two-sided test, separately within each group, we would reject. Mondal, Biswas and Ghosh
(2015) suggest taking
U = (T1 − θ)2 + (T2 − θ)2
where Tj = (nk)−1 i:Li =j Zj ∈Ni I(Li = Lj ). However, this test can have low power in
P P
other cases. The best strategy is to use both tests i.e. W = T ∨ U .

A similar test, called the cross-match test, was defined by by Rosenbaum (2005). We take
the pooled sample and partitionPthe data into pairs W1 = (Z1 , Z2 ), W2 = (Z3 , Z4 ), . . .. The
partition is chosen to minimize j ||Z2j − Z2j−1 ||2 . Let
X
T = Ai
i

where Ai = 1 if the ith pair has differing labels (i.e. (0,1) or (1,0)) and Ai = 0 otherwise. We
reject when T is small. The exact distribution of T under H0 is known; it is hypergeometric.
It can accurately be approximated with a N (µ, σ 2 ) where

mn 2n(n − 1)m(m − 1)
µ= , σ2 = .
(N − 1) (N − 3)(N − 1)2
This accurate, simple limiting distribution for T under the null is the main advantage of this
test. However, seems to have less power than thePNN test. Also, the distribution of T under
H1 is not known. We could have defined T = i Bi where Bi = 1 − Ai and and rejected
when T is large. This is then the same as the k-NN test with k = 1 except that we allow no
overlap between groups.

4 Smooth Tests

Neyman (1937) introduced a method for testing that takes advantage of smoothness. First,
consider one dimensional data Y1 , . . . , Yn ∼ P . Suppose we want to test

H0 : P = Uniform(0, 1) H1 : P 6= Uniform(0, 1).

If we want to have power against smooth alternatives, Neyman proposed that we define
k
!
X
pθ (x) = c(θ) exp θj ψj (x)
j=1

where ψ1 , ψ2 , . . . , are orthonormal functions and


1
c(θ) = R P  .
k
exp j=1 θj ψj (x) dx

5
The null hypothesis corresponds to θ = (θ1 , . . . , θk ) = (0, . . . , 0). One way to test H0 is to
b − `(0)). Under H0 , T
use the likelihood ratio test T = 2(`(θ) χ2k . But Neyman pointed
out that there is a computationally easier test,
X 2
U =n ψj
j

where
1X
ψj = ψj (Xi ).
n i
This also has the property that, under H0 , U χ2k . But it avoids having to deal with the
normalizing constant.

Now we move to the two-sample case. Let F (t) = P (X ≤ t) and G(t) = Q(Y ≤ t). Let
Z = F (Y ). Then the cdf of Z is
H(z) = P(Z ≤ z) = P(F (Y ) ≤ z) = P(Y ≤ R(z)) = G(R(z))
where R(z) = F −1 (z). Under H0 , Z ∼ Unif(0, 1). Now H has density
q(F −1 (z))
ρ(z) =
p(F −1 (z))
and ρ(z) = 1 under H0 . Bera, Ghosh and Xiao (2013) suggest using the family
k
!
X
ρθ (z) = c(θ) exp θj ψj (x) .
j=1
T
Their test statistic is mψ ψ where
1 X
ψj = ψj (Vi )
m i

and Vi = Fbn (Yi ). Bera, Ghosh and Xiao (2013) prove that the statistic again has a limiting
χ2k distribution.

Zhou, Zheng and Zhang (arXiv:1509.03459) considered the high-dimensional case. The con-
sider all one-dimensional projections of the data. There test is
r
nm
T = sup T (u)
n+m u
where the supremum is over the d − 1-dimensional sphere and T (u) is the Bera-Ghosh-Xiao
statistic based on the one-dimensional data uT Yi . They also allow the parameter k to be
chosen from the data. (In fact, they maximize the test over k.)

The limiting distribution of T under H0 is complicated: it is the supremum of a Gaussian


process. Tow get a practical test there are two possibilities. One is to use permutations. The
other is based on a version of the bootstrap called the multiplier bootstrap. Their simulations
suggest that this test works well. But it is unclear how it compares to the other tests.

6
5 Histogram Test

Under smoothness assumptions and compact support, Ingster (1987) showed that optimal
tests can be obtained using histograms. Arias-Castro, Pelletier and Saligrama (2016) ex-
tended this to the multivariate case. Assume smoothness level β. For simplicity let m = n.
Form a histogram with N ≈ n2/(4β+1) bins. Set
X
T = (Cj − Dj )2
j

where Cj is the number of Xi ’s in bin j and let Dj is the number of Yi ’s in bin j. We reject
for T large. This test is, in theory, optimal. In fact, Ingster later showed that the test can
be made adaptive to the degree of smoothness.

6 Sparsity

Let us write
Xi = (Xi (1), . . . , Xi (d)), Yi = (Yi (1), . . . , Yi (d)).
In some cases, we might suspect that P and Q only differ in a few features. In other words,
there is sparsity. If so, the easiest thing is to do all the one-dimensional marginal tests and
a Bonferroni correction. Let Tj be your favorite one dimensional test applied to the jth
feature only. The the statistic to be T = ∨j Tj . This test will have good power in the sparse
case and it is very easy to compute.

7 Minimax Theory

What does it mean for a test to be optimal? Just as there is a theory for minimax estimation,
there is also a theory for minimax testing. We discussed this a few weeks ago. I’ll remind
you of a few basic facts.

To keep it simple, suppose that m = n. We want to test H0 : P = Q. Let P be a set of


distributions and assume that P, Q ∈ P.

Recall that a level α test is a function φ of the data taking values 0 or 1 such that P (φ =
1) ≤ α for every P ∈ H0 . Let Φn denote all level α tests. The minimax type II error, for a
set of distributions P is
βn () = inf sup P n (φ = 0)
φ∈Φn P,Q

where the supremum is over all P, Q ∈ P such that d(P, Q) > . Fix any small δ > 0. We
say that the minimax separation is n if  < n implies that βn () ≥ δ.

7
If P is the β smoothness class and d is the L2 distance between densities, then Arias-Castro,
Pelletier and Saligrama (2016) show that

  4β+d
1
n  .
n
The minimax risk is achieved by the histogram test.

8 Discrete Distributions

Suppose that Xi and Yi are discrete random variables taking values in {1, . . . , d}. Let
Cj = #{Xi = j}, Dj = #{Yi = j}.
Let C = (C1 , . . . , Cd ) and D = (D1 , . . . , Dd ). These are multinomial and we can test
H0 : P = Q using a likelihood ratio test or χ2 test.

But when d is large, the usual tests might have poor power. Improved tests have been devel-
oped by Chan et al (2014) and Diakonikolas and Kane (2016). for example. Moreover, these
tests are designed to have good power against alternatives with respect to total variation
distance. For example, Chan et al propose the test statistic
X (Cj − Dj )2 − (Cj + Dj )
T = .
j
Cj + Dj

We reject
√ when T is large. The prove that this test has good power as long as TV(P, Q) >
d1/4 / n which is the minimax bound.

References

Anderson, Hall and Titterington (1994). Two-sample test statistics for measuring discrep-
ancies between two multivariate probability density functions using kernel-based density
estimates. Journal of Multivariate Analysis, 41-54.

Arias-Castro, Pelletier and Saligrama (2016). arXiv:1607.08156.

Berlinet and Thomas-Agnan (2011). Reproducing kernel Hilbert spaces in probability and
statistics, Springer.

Chan, Siu-On, et al. ”Optimal algorithms for testing closeness of discrete distributions.”
Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms.
Society for Industrial and Applied Mathematics, 2014.

8
Gretton, Borgwardt, Rasch, Malte, Scholkopf and Smola (2007). A kernel method for the
two-sample-problem. NIPS.

Henze, N. (1988). A multivariate two-sample test based on the number of nearest neighbor
type coincidences. The Annals of Statistics, 772-783.

Ingster, Y. (1987) Minimax testing of nonparametric hypotheses on a distribution density in


the Lp metrics. Theory of Probability and Its Applications, 333-337.

Mondal, Biswas and Ghosh (2015). On high dimensional two-sample tests based on nearest
neighbors. Journal of Multivariate Analysis, 168-178.

Rosenbaum (2005). An exact distribution-free test comparing two multivariate distributions


based on adjacency. Journal of the Royal Statistical Society: Series B (Statistical Methodol-
ogy), 515-530.

Schilling, Mark F. ”Multivariate two-sample tests based on nearest neighbors.” Journal of


the American Statistical Association 81.395 (1986): 799-806.

Schilling, M. F. ”Mutual and shared neighbor probabilities: finite-and infinite-dimensional


results.” Advances in Applied Probability 18.02 (1986): 388-405.

Sriperumbudur, Bharath K., et al. ”Hilbert space embeddings and metrics on probability
measures.” Journal of Machine Learning Research 11.Apr (2010): 1517-1561.

Szkeley and Rizzo (2004). Testing for equal distributions in high dimension. InterStat, 1-6.

You might also like