0% found this document useful (0 votes)
17 views10 pages

Sadhanala 19 A

This document presents an extension of the Kolmogorov-Smirnov (KS) two-sample test that enhances sensitivity to tail differences through a higher-order test statistic defined as an integral probability metric (IPM). The authors develop a linear-time algorithm for computing this statistic for small orders and provide a nearly linear-time approximation for larger orders, along with theoretical backing and numerical studies to demonstrate its effectiveness. Key contributions include an exact representer theorem, derivation of the asymptotic null distribution, and concentration bounds for the test statistic.

Uploaded by

Avi Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views10 pages

Sadhanala 19 A

This document presents an extension of the Kolmogorov-Smirnov (KS) two-sample test that enhances sensitivity to tail differences through a higher-order test statistic defined as an integral probability metric (IPM). The authors develop a linear-time algorithm for computing this statistic for small orders and provide a nearly linear-time approximation for larger orders, along with theoretical backing and numerical studies to demonstrate its effectiveness. Key contributions include an exact representer theorem, derivation of the asymptotic null distribution, and concentration bounds for the test statistic.

Uploaded by

Avi Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

A Higher-Order Kolmogorov-Smirnov Test

Veeranjaneyulu Sadhanala1 Yu-Xiang Wang2 Aaditya Ramdas1 Ryan J. Tibshirani1


1 2
Carnegie Mellon University University of California at Santa Barbara

Abstract KS test rejects the null hypothesis of P = Q for large


values of the statistic. The statistic (1) can also be
We present an extension of the Kolmogorov- written in the following variational form:
Smirnov (KS) two-sample test, which can be sup |Pm f − Qn f |, (2)
more sensitive to differences in the tails. Our f : TV(f )≤1
test statistic is an integral probability metric where TV(·) denotes total variation, and we define the
(IPM) defined over a higher-order total vari- empirical expectation operators Pm , Qn via
ation ball, recovering the original KS test as m n
its simplest case. We give an exact represen- 1 X 1X
Pm f = f (xi ) and Qn f = f (yi ).
ter result for our IPM, which generalizes the m i=1 n i=1
fact that the original KS test statistic can be
expressed in equivalent variational and CDF Later, we will give a general representation result that
forms. For small enough orders (k ≤ 5), we implies the equivalence of (1) and (2) as a special case.
develop a linear-time algorithm for comput- The KS test is a fast, general-purpose two-sample non-
ing our higher-order KS test statistic; for all parametric test. But being a general-purpose test also
others (k ≥ 6), we give a nearly linear-time means that it is systematically less sensitive to some
approximation. We derive the asymptotic null types of differences, such as tail differences (Bryson,
distribution for our test, and show that our 1974). Intuitively, this is because the empirical CDFs
nearly linear-time approximation shares the of X(m) and Y(n) must both tend to 0 as z → −∞ and
same asymptotic null. Lastly, we complement to 1 as z → ∞, so the gap in the tails will not be large.
our theory with numerical studies.
The insensitivity of the KS test to tail differences is well-
known. Several authors have proposed modifications
to the KS test to improve its tail sensitivity, based on
1 INTRODUCTION
variance-reweighting (Anderson and Darling, 1952), or
Renyi-type statistics (Mason and Schuenemeyer, 1983;
The Kolmogorov-Smirnov (KS) test (Kolmogorov, 1933;
Calitz, 1987), to name a few ideas. In a different vein,
Smirnov, 1948) is a classical and celebrated tool for
Wang et al. (2014) recently proposed a higher-order
nonparametric hypothesis testing. Let x1 , . . . , xm ∼ P
extension of the KS two-sample test, which replaces
and y1 , . . . , yn ∼ Q be independent samples. Let X(m)
the total variation constraint on f in (2) with a total
and Y(n) denote the two sets of samples, and also let
variation constraint on a derivative of f . These authors
Z(N ) = X(m) ∪ Y(n) = {z1 , . . . , zN }, where N = m + n.
show empirically that, in some cases, this modification
The two-sample KS test statistic is defined as
can lead to better tail sensitivity. In the current work,
m n we refine the proposal of Wang et al. (2014), and give
1 X 1X
max 1{xi ≤ z} − 1{yi ≤ z} . (1) theoretical backing for this new test.
z∈Z(m+n) m i=1 n i=1

In words, this measures the maximum absolute dif- A Higher-Order KS Test. Our test statistic has
ference between the empirical cumulative distribution the form of an integral probability metric (IPM). For
functions (CDFs) of X(m) and Y(n) , across all points a function class F, the IPM between distributions P
in the joint sample Z(m+n) . Naturally, the two-sample and Q, with respect to F, is defined as (Muller, 1997)
ρ(P, Q; F) = sup |Pf − Qf | (3)
Proceedings of the 22nd International Conference on Ar- f ∈F
tificial Intelligence and Statistics (AISTATS) 2019, Naha,
Okinawa, Japan. PMLR: Volume 89. Copyright 2019 by where we define the expectation operators P, Q by
the author(s).
Pf = EX∼P [f (X)] and Qf = EY ∼Q [f (Y )].
A Higher-Order Kolmogorov-Smirnov Test

For a given function class F, the IPM ρ(·, · ; F) is a Motivating Example. Figure 1 shows the results
pseudometric on the space of distributions. Note that of a simple simulation comparing the proposed higher-
the KS test in (2) is precisely ρ(Pm , Qn ; F0 ), where order tests (5), of orders k = 1 through 5, against the
Pm , Qn are the empirical distributions of X(m) , Y(n) , usual KS test (corresponding to k = 0). For the simu-
respectively, and F0 = {f : TV(f ) ≤ 1}. lation setup, we used P = N (0, 1) and Q = N (0, 1.44).
For 500 repetitions, “alternative repetitions”, we drew
Consider an IPM given by replacing F0 with Fk = {f :
m = 100 samples from P , drew n = 100 samples from
TV(f (k) ) ≤ 1}, for an integer k ≥ 1 (where we write
Q, and computed test statistics; for another 500 rep-
f (k) for the kth weak derivative of f ). Some motivation
etitions, called “null repetitions”, we drew both sets
is as follows. In the case k = 0, we know that the wit-
of samples from P , and again computed test statistics.
ness functions in the KS test (2), i.e., the functions in
Then for each test, we varied the rejection threshold,
F0 that achieve the supremum, are piecewise constant
calculated its true positive rate as the fraction of re-
step functions (cf. the equivalent representation (1)).
jections made on the alternative repetitions, and cal-
These functions can only have so much action in the
culated its false positive rate similarly using the null
tails. By moving to Fk , which is essentially comprised
repetitions. The oracle ROC curve corresponds to the
of the kth order antiderivative of functions in F0 , we
likelihood ratio test (which knows the exact distribu-
should expect that the witness functions over Fk are
tions P, Q). We can see that power of the higher-order
kth order antiderivatives of piecewise constant func-
KS test improves as we increase the order from k = 0
tions, i.e., kth degree piecewise polynomial functions,
up to k = 3, then stops improving by k = 4, 5.
which can have much more sensitivity in the tails.
But simply replacing F0 by Fk and proposing to com- 1

pute ρ(Pm , Qn ; Fk ) leads to an ill-defined test. This 0.9

is due to the fact that Fk contains all polynomials of


0.8
degree k. Hence, if the ith moments of Pm , Qn differ,
for any i ∈ [k] (where we abbreviate [a] = {1, . . . , a} 0.7
True postive rate

for an integer a ≥ 1), then ρ(Pm , Qn ; Fk ) = ∞. 0.6

As such, we must modify Fk to control the growth of 0.5

its elements. While there are different ways to do this, 0.4 KS k=0
not all result in computable IPMs. The approach we KS k=1
0.3 KS k=2
take yields an exact representer theorem (generalizing KS k=3
the equivalence between (1) and (2)). Define 0.2
KS k=4
KS k=5
0.1
Oracle
Fk = f : TV(f (k) ) ≤ 1,

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
f (j) (0) = 0, j ∈ {0} ∪ [k − 1], False postive rate
f (k) (0+) = 0 or f (k) (0−) = 0 . (4)
Figure 1: ROC curves from an experiment comparing the
(k) (k)
Here f (0+) and f (0−) denote one-sided limits at proposed higher-order KS tests in (5) (for various k) to the
usual KS test, when P = N (0, 1) and Q = N (0, 1.44).
0 from above and below, respectively. Informally, the
functions in Fk are pinned down at 0, with all lower-
order derivatives (and the limiting kth derivative from Figure 2 displays the witness function (which achieves
the right or left) equal to 0, which limits their growth. the supremum in (5)) for a large-sample version of the
Now we define the kth-order KS test statistic as higher-order KS test, across orders k = 0 through 5.
We used the same distributions as in Figure 1, but now
ρ(Pm , Qn ; Fk ) = sup |Pm f − Qn f |. (5) n = m = 104 . We will prove in Section 2 that, for the
f ∈Fk
kth order test, the witness function is always a kth
An important remark is that for k = 0, this recovers degree piecewise polynomial (in fact, a rather simple
the original KS test statistic (2), because F0 contains one, of the form gt (x) = (x − t)k+ or gt (x) = (t − x)k+
all step functions of the form gt (x) = 1{x ≤ t}, t ≥ 0. for a knot t). Recall the underlying distributions P, Q
here have different variances, and we can see from their
Another important remark is that for any k ≥ 0, the witness functions that all higher-order KS tests choose
function class Fk in (4) is “rich enough” to make the to put weight on tail differences. Of course, the power
IPM in (5) a metric. We state this formally next; its of any test of is determined by the size of the statistic
proof, as with all other proofs, is in the supplement. under the alternative, relative to typical fluctuations
Proposition 1. For any k ≥ 0, and any P, Q with k under the null. As we place greater and greater weight
moments, ρ(P, Q; Fk ) = 0 if and only if P = Q. on tails, it turns out in this particular setting that we
Veeranjaneyulu Sadhanala, Yu-Xiang Wang, Aaditya Ramdas, Ryan J. Tibshirani

see diminishing returns at k = 4, 5, which means the (TV) distance. While IPMs look at differences dP −dQ,
null fluctuations must be too great. tests based on φ-divergences (such as Kullback-Leibler,
or Hellinger) look at ratios dP/dQ, but can be hard to
0.5 efficiently estimate in practice (Sriperumbudur et al.,
N(0,1)
0.45 N(0,1.44) 2009). The TV distance is the only IPM that is also a
KS k=0 φ-divergence, but it is impossible to estimate.
0.4
KS k=1
KS k=2 There is also a rich class of nonparametric tests based
0.35
KS k=3 on graphs. Using minimum spanning trees, Friedman
0.3 KS k=4 and Rafsky (1979) generalized both the Wald-Wolfowitz
KS k=5 runs test and the KS test. Other tests are based on k-
0.25
nearest neighbors graphs (Schilling, 1986; Henze, 1988)
0.2
or matchings (Rosenbaum, 2005). The Mann-Whitney-
0.15 Wilcoxon test has a multivariate generalization using
the concept of data depth (Liu and Singh, 1993). Bhat-
0.1
tacharya (2016) established that many computationally
0.05 efficient graph-based tests have suboptimal statistical
0
power, but some inefficient ones have optimal scalings.
-5 -4 -3 -2 -1 0 1 2 3 4 5
Different computational-statistical tradeoffs were also
Figure 2: Witness functions (normalized for plotting pur- discovered for IPMs (Ramdas et al., 2015b). Further, as
poses) for the higher-order KS tests, when P = N (0, 1) and noted by Janssen (2000) (in the context of one-sample
Q = N (0, 1.44). They are always of piecewise polynomial testing), every nonparametric test is essentially power-
form; and here they all place weight on tail differences. less in an infinity of directions, and has nontrivial power
only against a finite subspace of alternatives. In par-
Summary of Contributions. Our contributions in ticular, this implies that no single nonparametric test
this work are as follows. can uniformly dominate all others; improved power in
some directions generally implies weaker power in oth-
ers. This problem only gets worse in high-dimensional
• We develop an exact representer theorem for the settings (Ramdas et al., 2015a; Arias-Castro et al.,
higher-order KS test statistic (5). This enables us 2018). Therefore, the question of which test to use for
to compute the test statistic in linear-time, for all a given problem must be guided by a combination of
k ≤ 5. For k ≥ 6, we develop a nearly linear-time simulations, computational considerations, a theoreti-
approximation to the test statistic. cal understanding of the pros/cons of each test, and a
• We derive the asymptotic null distribution of the practical understanding of the data at hand.
our higher-order KS test statistic, based on empir-
ical process theory. For k ≥ 6, our approximation Outline. In Section 2, we give computational details
to the test statistic has the same asymptotic null. for the higher-order KS test statistic (5). We derive its
asymptotic null in Section 3, and give concentration
• We provide concentration tail bounds for the test bounds (for the statistic around the population-level
statistic. Combined with the metric property from IPM) in Section 4. We give numerical experiments in
Proposition 1, this shows that our higher-order KS Section 5, and conclude in Section 6 with a discussion.
test is asymptotically powerful against any pair of
fixed, distinct distributions P, Q.
2 COMPUTATION
• We perform extensive numerical studies to com-
pare the newly proposed tests with several others. Write T = ρ(Pm , Qn ; Fk ) for the test statistic in (5).
In this section, we derive a representer theorem for T ,
Other Related Work. Recently, IPMs have been develop a linear-time algorithm for k ≤ 5, and a nearly
gaining in popularity due in large part to energy dis- linear-time approximation for k ≥ 6.
tance tests (Szekely and Rizzo, 2004; Baringhaus and
Franz, 2004) and kernel maximum mean discrepancy 2.1 Representer Theorem
(MMD) tests (Gretton et al., 2012), and in fact, there
is an equivalence between the two classes (Sejdinovic The higher-order KS test statistic in (5) is defined by
et al., 2013). An IPM with a judicious choice of F gives an infinite-dimensional maximization over Fk in (4).
rise to a number of common distances between distri- Fortunately, we can restrict our attention to a simpler
butions, such as Wasserstein distance or total variation function class, as we show next.
A Higher-Order Kolmogorov-Smirnov Test

Theorem 1. Fix k ≥ 0. Let gt+ (x) = (x − t)k+ /k! and Then the statistic in (6) can be succinctly written as
gt− (x) = (t − x)k+ /k! for t ∈ R, where we write (a)+ =
T = max sup φi (t), (8)
max{a, 0}. For the statistic T defined by (5), i∈[N ] t∈[zi−1 ,zi ]
n
T = max sup |(Pm − Qn )gt+ |, where we let z0 = 0 for convenience. Note each φi (t),
t≥0
o i ∈ [N ] is a kth degree polynomial. We can compute a
sup |(Pm − Qn )gt− | . (6) representation for these polynomials efficiently.
t≤0 Lemma 1. Fix k ≥ 0. The polynomials in (7) satisfy
the recurrence relations
The proof of this theorem uses a key result from Mam-
men (1991), where it is shown that we can construct a 1
φi (t) = ci (zi − t)k + φi+1 (t), i ∈ [N ]
spline interpolant to a given function at given points, k!
such that its higher-order total variation is no larger (where φN +1 = 0). Given the monomial expansion
than that of the original function. k
Remark 1. When k = 0, note that for t ≥ 0,
X
φi+1 (t) = ai+1,` t` ,
m n `=0
1 X 1 X
|(Pm − Qn )gt+ | = 1{xi > t} − 1{yi > t} we can compute an expansion for φi , with coefficients
m n
i=1 i=1 ai` , ` ∈ {0} ∪ [k], in O(1) time. So we can compute all
m n
1 X 1X coefficients ai,` , i ∈ [N ], ` ∈ {0} ∪ [k] in O(N ) time.
= 1{xi ≤ t} − 1{yi ≤ t}
m i=1 n i=1
To compute T in (8), we must maximize each polyno-
and similarly for t ≤ 0, |(Pm − Qn )gt− |
reduces to the mial φi over its domain [zi−1 , zi ], for i ∈ [N ], and then
same expression in the second line above. As we vary compare maxima. Once we have computed a represen-
t from −∞ to ∞, this only changes at values t ∈ Z(N ) , tation for these polynomials, as Lemma 1 ensures we
which shows (6) and (1) are the same, i.e., Theorem 1 can do in O(N ) time, we can use this to analytically
recovers the equivalence between (2) and (1). maximize each polynomial over its domain, provided
Remark 2. For general k ≥ 0, we can interpret (6) as the order k is small enough. Of course, maximizing
a comparison between truncated kth order moments, a polynomial over an interval can be reduced to com-
between the empirical distributions Pm and Qn . The puting the roots of its derivative, which is an analytic
test statistic T the maximum over all possible trunca- computation for any k ≤ 5 (since the roots of any
tion locations t. The critical aspect here is truncation, quartic have a closed-form, see, e.g., Rosen 1995). The
which makes the higher-order KS test statistic a metric next result summarizes.
(recall Proposition 1). A comparison of moments, alone, Proposition 2. For any 0 ≤ k ≤ 5, the test statistic
would not be enough to ensure such a property. in (8) can be computed in O(N ) time.

Theorem 1 itself does not immediately lead to an algo- Maximizing a polynomial of degree k ≥ 6 is not gener-
rithm for computing T , as the range of t considered in ally possible in closed-form. However, developments in
the suprema is infinite. However, through a bit more semidefinite optimization allow us to approximate its
work, detailed in the next two subsections, we can ob- maximum efficiently, investigated next.
tain an exact linear-time algorithm for all k ≤ 5, and
a linear-time approximation for k ≥ 6. 2.3 Linear-Time Approximation for k ≥ 6

2.2 Linear-Time Algorithm for k ≤ 5 Seminal work of Shor (1998); Nesterov (2000) shows
that the problem of maximizing a polynomial over an
The key fact that we will exploit is that the criterion interval can be cast as a semidefinite program (SDP).
in (6), as a function of t, is a piecewise polynomial of The number of variables in this SDP depends only on
order k with knots in Z(N ) . Assume without a loss of the polynomial order k, and all constraint functions are
generality that z1 < · · · < zN . Also assume without a self-concordant. Using say an interior point method to
loss of generality that z1 ≥ 0 (this simplifies notation, solve this SDP, therefore, leads to the following result.
and the general case follows by the repeating the same Proposition 3. Fix k ≥ 6 and  > 0. For each poly-
arguments separately for the points in Z(N ) on either nomial in (7), we can compute an -approximation to
side of 0). Define ci = 1{zi ∈ X(m) }/m − 1{zi ∈ its maximum in ck log(1/) time, for a constant ck > 0
Y(n) }/n, i ∈ [N ], and depending only on k. As we can compute a representa-
N tion for all these polynomials in O(N ) time (Lemma 1),
1 X this means we can compute an -approximation to the
φi (t) = cj (zj − t)k , i ∈ [N ]. (7)
k! j=i statistic in (6) in O(N log(1/)) time.
Veeranjaneyulu Sadhanala, Yu-Xiang Wang, Aaditya Ramdas, Ryan J. Tibshirani

Remark 3. Let T denote the -approximation from For functions f, g in a class F, let GP,F denote a Gaus-
Proposition 3. Under
√ the null P = Q, we would need sian process indexed by F with mean and covariance
to have  = o(1/ N ) in order for the approximation
E(GP,F f ) = 0, f ∈ F,
T to share the asymptotic null distribution of T , as
we will see in Section 3.3. Taking say,  = 1/N , the Cov(GP,F f, GP,F g) = CovX∼P (f (X), g(X)), f, g ∈ F.
statistic T1/N requires O(N log N ) computational time, For functions l, u, let [l, u] denote the set of functions
and this is why in various places we make reference to {f : l(x) ≤ f (x) ≤ u(x), for all x}. Call [l, u] a bracket
a nearly linear-time approximation when k ≥ 6. of size ku − lk2 , where k · k2 denotes the L2 (P ) norm,
defined as Z
2.4 Simple Linear-Time Approximation kf k22 = f (x)2 dP (x).
We conclude this section by noting a simple approxi- Finally, let N[] (, k · k2 , F) be the smallest number of
mation to (6) given by -sized brackets that are required to cover F. Define
n the bracketing integral of F as
T ∗ = max max
0
|(Pm − Qn )gt+ |, Z 1q
t∈Z(N , t≥0
)
o J[] (k · k2 , F) = log N[] (, k · k2 , F) d.
0
max |(Pm − Qn )gt− | , (9)
0
t∈Z(N ) , t≤0 Note that this is finite when log N[] (, k · k2 , F) grows
slower than 1/2 . We now state an important uniform
0
where Z(N ) = {0} ∪ Z(N ) . Clearly, for k = 0 or 1, the CLT from empirical process theory.
maximizing t in (6) must be one of the sample points Theorem 2 (Theorem 11.1.1 in Dudley 1999). If F is
Z(N ) , so T ∗ = T and there is no approximation error a class of functions with finite bracketing integral, then
in (9). For k ≥ 2, we can control the error as follows. when P = Q and m, n → ∞, the process
Lemma 2. For k ≥ 2, the statistics in (6), (9) satisfy r
mn
m n
{Pm f − Qn f }f ∈F
δN
 X
1 1X
 m+n
T − T∗ = |xi |k−1 + |yi |k−1 ,
(k − 1)! m i=1 n i=1 converges weakly to the Gaussian process GP,F . Hence,
r
mn d
where δN is the maximum gap between sorted points in sup |Pm f − Qn f | → sup |GP,F f |.
0
Z(N m + n f ∈F f ∈F
).

Remark 4. We would need to have δN = oP (1/ N ) 3.1 Bracketing Integral Calculation
in order for T ∗ to share the asymptotic null of T , see
again Section 3.3 (this is assuming that P has k − 1 To derive the asymptotic null of the higher-order KS
moments, so the sample moments concentrate for large test, based on its formulation in (5), and Theorem 2,
enough N ). This will not be true of δN , the maximum we would need to bound the bracketing integral of Fk .
gap, in general. But it does hold when P is continuous, While there are well-known entropy (log covering) num-
having compact support, and a density bounded from ber bounds for related function classes (e.g., Birman
below on its support; here, in fact, δN = oP (log N/N ) and Solomyak 1967; Babenko 1979), and the conversion
(see, e.g., Wang et al. 2014). from covering to bracketing numbers is standard, these
results unfortunately require the function class to be
Although it does not have the strong guarantees of the uniformly bounded in the sup norm, which is certainly
approximation from Proposition 3, the statistic in (9) not true of Fk .
is simple and efficient—we must emphasize that it can
be computed in O(N ) linear time, as a consequence of Note that the representer result in (6) can be written
Lemma 1 (the evaluations of φi (t) at the sample points as T = ρ(Pm , Qn ; Gk ), where
t ∈ Z(N ) are the constant terms ai0 , i ∈ [N ] in their Gk = {gt+ : t ≥ 0} ∪ {gt− : t ≤ 0}. (10)
monomial expansions)—and is likely a good choice for
most practical purposes. We can hence instead apply Theorem 2 to Gk , whose
bracketing number can be bounded by direct calcula-
tion, assuming enough moments on P .
3 ASYMPTOTIC NULL
Lemma 3. Fix k ≥ 0. Assume EX∼P |X|2k+δ ≤ M <
∞, for some δ > 0. For the class Gk in (10), there is
To study the asymptotic null distribution of the pro-
a constant C > 0 depending only on k, δ such that
posed higher-order KS test, we will appeal to uniform
δ(k−1)
central limit theorems (CLTs) from the empirical pro- M 1+ 2k+δ
cess theory literature, reviewed here for completeness. log N[] (, k · k2 , Gk ) ≤ C log .
2+δ
A Higher-Order Kolmogorov-Smirnov Test

3.2 Asymptotic Null for Higher-Order KS level IPM ρ(P, Q; Fk ) is large, then the concentration
bounds below will imply that the empirical statistic
Applying Theorem 2 and Lemma 3 to the higher-order ρ(Pm , Qn ; Fk ) will be large for m, n sufficiently large,
KS test statistic (6) leads to the following result. and the test will have power.
Theorem 3. Fix k ≥ 0. Assume EX∼P |X|2k+δ < ∞,
We first review the necessary machinery, again from
for some δ > 0. When P = Q, the test statistic in (6)
empirical process theory. For p ≥ 1, and a function f
satisfies, as m, n → ∞,
of a random variable X ∼ P , recall the Lp (P ) norm is
defined as kf kp = [E(f (X)p )]1/p . For p > 0, recall the
r
mn d
T → sup |GP,k g|,
m+n g∈Gk exponential Orlicz norm of order p is defined as
where GP,k is an abbreviation for the Gaussian process kf kΨp = inf t > 0 : E[exp(|X|p /tp )] − 1 ≤ 1 .

indexed by the function class Gk in (10).
Remark 5. When k = 0, note that for t ≥ s ≥ 0, the (These norms depend on the measure P , since they
covariance function is are defined in terms of expectations with respect to
X ∼ P , though this is not explicit in our notation.)
CovX∼P (1{X > s}, 1{X > t}) = FP (s)(1 − FP (t)),
We now state an important concentration result.
where FP denotes the CDF of P . For s ≤ t ≤ 0, the
Theorem 4 (Theorems 2.14.2 and 2.14.5 in van der
covariance function is again equal to FP (s)(1 − FP (t)).
Vaart and Wellner 1996). Let F be a class functions
The supremum of this Gaussian process over t ∈ R is
with an envelope function F , i.e., f ≤ F for all f ∈ F.
that of a Brownian bridge, so Theorem 3 recovers the
Define √
well-known asymptotic null distribution of the KS test,
W = n sup |Pn f − Pf |,
which (remarkably) does not depend on P . f ∈F
Remark 6. When k ≥ 1, it is not clear how strongly
and abbreviate J = J[] (k · k, F). For p ≥ 2, if kF kp <
the supremum of the Gaussian process from Theorem 3
∞, then for a constant c1 > 0,
depends on P ; it appears it must depend on the first k
moments of P , but is not clear whether it only depends
 
[E(W p )]1/p ≤ c1 kF k2 J + n−1/2+1/p kF kp ,
on these moments. Section 5 investigates empirically.
Currently, we do not have a precise understanding of
and for 0 < p ≤ 1, if kF kΨp < ∞, then for a constant
whether the asymptotic null is useable in practice, and
c2 > 0,
we suggest using a permutation null instead.
 
kW kΨp ≤ c2 kF k2 J + n−1/2 (1 + log n)1/p kF kΨp .
3.3 Asymptotic Null Under Approximation

The approximation from Proposition 3 shares the same The two-sample test statistic T = ρ(Pm , Qn ; Gk ) satis-
asymptotic null, provided  > 0 is small enough. fies (following by a simple argument using convexity)
Corollary 1. Fix k ≥ 0. Assume EX∼P |X|2k+δ < ∞,
for some δ > 0. When P = Q, as m, n → ∞ such that |T − ρ(P, Q; Fk )| ≤ ρ(P, Pm ; Fk ) + ρ(Q, Qn ; Fk ).
m/n converges to a positive constant, the
√ test statistic The terms on the right hand side can each be bounded
T from Proposition 3 converges at a N -rate to the
by Theorem 4, where we can use the envelope function
supremum of the same
√ Gaussian process in Theorem 3,
F (x) = |x|k /k! for Gk . Using Markov’s inequality, we
provided  = o(1/ N ).
can then get a tail bound on the statistic.
The approximation in (9) shares the same asymptotic Theorem 5. Fix k ≥ 0. Assume that P, Q both have
null, provided P is continuous with compact support. p moments, where p ≥ 2 and p > 2k. For the statistic
Corollary 2. Fix k ≥ 0. Assume that P is continuous, in (6), for any α > 0, with probability 1 − α,
compactly supported, with density bounded from below  
on its support. When P = Q, as m, n → ∞ such that 1 1
|T − ρ(P, Q; Gk )| ≤ c(α) √ + √ ,
m/n converges to a positive m n
√ constant, the test statistic
T ∗ in (9) converges at a N -rate to the supremum of
where c(α) = c0 α−1/p , and c0 > 0 is a constant. If P, Q
the same Gaussian process in Theorem 3.
both have finite exponential Orlicz norms of order 0 <
p ≤ 1, then the above holds for c(α) = c0 (log(1/α))1/p .
4 TAIL CONCENTRATION
When we assume k moments, the population IPM for
We examine the convergence of our test statistics to Fk also has a representer in Gk ; by Proposition 1, this
their population analogs. In general, if the population- implies ρ(·, · ; Gk ) is also a metric.
Veeranjaneyulu Sadhanala, Yu-Xiang Wang, Aaditya Ramdas, Ryan J. Tibshirani

Normal k=1 Uniform k=1


Corollary 3. Fix k ≥ 0. Assuming P, Q both have
Asymptotic Asymptotic
k moments, ρ(P, Q; Fk ) = ρ(P, Q; Gk ). Therefore, by 100 Finite-sample 100 Finite-sample
Proposition 1, ρ(·, · ; Gk ) is a metric (over the space of 80 80
distributions P, Q with k moments). 60 60

40 40
Putting this metric property together with Theorem 5
20 20
gives the following.
0 0
Corollary 4. Fix k ≥ 0. For αN = o(1) and 1/αN = 0.5 1 1.5 2 0.5 1 1.5 2
Normal k=2 Uniform k=2
o(N p/2 ), reject when the higher-order
√ √ KS test statistic Asymptotic Asymptotic
(6) satisfies T > c(αN )(1/ m + 1/ n), where c(·) is 100 Finite-sample 100 Finite-sample

as in Theorem 5. For any P, Q that meet the moment 80 80

conditions of Theorem 5, as m, n → ∞ in such a way 60 60


that m/n approaches a positive constant, we have type 40 40
I error tending to 0, and power tending to 1, i.e., the 20 20
higher-order KS test is asymptotically powerful.
0 0
0.5 1 1.5 2 0.5 1 1.5 2

5 NUMERICAL EXPERIMENTS Figure 3: Histograms comparing finite-sample test statistics


to their asymptotic null distribution.
We present numerical experiments that examine the
convergence of our test statistic to its asymptotic null, in each setting there is a choice of k that yields better
its power relative to other general purpose nonpara- power than KS. In the mean difference setting, this
metric tests, and its power when P, Q have densities is k = 1, and the power degrades for k = 3, 5, likely
with local differences. Experiments comparing to the because these tests are “smoothing out” the mean dif-
MMD test with a polynomial kernel are deferred to the ference too much; see Proposition 4.
supplement, for space reasons.
Local Density Differences. In Figures 6 and 7, we
Convergence to Asymptotic Null. In Figure 3, examine the higher-order KS tests and the KS test,
we plot histograms of finite-sample higher-order KS in cases where P, Q have densities p, q such that p − q
test statistics and their asymptotic null distributions, has sharp local changes, and m = n = 100. Figure 6
when k = 1,√2. We √ considered both P = N (0, 1) and displays a case where p − q is piecewise constant with
P = Unif(− 3, 3) (the uniform distribution stan- a few short departures from 0 (see the supplement for
dardized to have mean 0 and variance 1). For a total a plot). The KS test has large power, and the higher-
of 1000 repetitions, we drew two sets of samples from order KS tests all perform poorly; in fact, the KS test
P , each of size m = n = 2000, then computed the here is more powerful than all of the commonly-used
test statistics. For a total of 1000 times, we also ap- nonparametric tests that we tried (not shown).
proximated the supremum of the Gaussian process
from Theorem 3 via discretization. We see that the Figure 7 displays a case where p − q changes sharply
finite-sample statistics adhere closely to their asymp- in the right tail (see the supplement for a plot) and
totic distributions. Interestingly, we also see that the m = n = 2000. The power of the higher-order KS
distributions look roughly similar across all four cases test appears to increase with k, likely because the
considered. Future work will examine more thoroughly. witness functions are able to better concentrate on
sharp departures for large k.
Comparison to General-Purpose Tests. In Fig-
ures 4 and 5, we compare the higher-order KS tests 6 DISCUSSION
to the KS test, and other widely-used nonparametric
tests from the literature: the kernel maximum mean This paper began by noting the variational characteri-
discrepancy (MMD) test (Gretton et al., 2012) with a zation of the classical KS test as an IPM with respect to
Gaussian kernel, the energy distance test (Szekely and functions of bounded total variation, and then proposed
Rizzo, 2004), and the Anderson-Darling test (Anderson a generalization to higher-order total variation classes.
and Darling, 1954). The simulation setup is the same This generalization was nontrivial, with subtleties aris-
as that in the introduction, where we considered P, Q ing in defining the right class of functions so that the
with different variances, except here we study differ- statistic was finite and amenable for simplification via
ent means: P = N (0, 1), Q = N (0.2, 1), and√different a representer result, challenges in computing the statis-
fourth moments: P = N (0, 1), Q = Lap(0, 1/ 2). The tic efficiently, and challenges in studying asymptotic
higher-order KS tests generally perform favorably, and convergence and concentration due to the fact that the
A Higher-Order Kolmogorov-Smirnov Test

1 1

0.9 0.9

0.8 0.8

0.7 0.7

True postive rate


True postive rate

0.6 0.6

0.5 0.5

0.4
KS k=0 0.4 KS k=0
KS k=1 KS k=1
0.3
KS k=3 0.3 KS k=2
KS k=5 KS k=3
0.2
MMD-RBF 0.2
KS k=4
Energy Distance KS k=5
Anderson-Darling 0.1
0.1 Oracle
Oracle
0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

False postive rate False postive rate

Figure 4: ROC curves for P = N (0, 1), Q = N (0.2, 1). Figure 6: ROC curves for piecewise constant p − q.
1
1

0.9
0.9

0.8
0.8

0.7
0.7

True postive rate


True postive rate

0.6
0.6

0.5
0.5

0.4
KS k=0 0.4 KS k=0
KS k=1 KS k=1
0.3
KS k=3 0.3 KS k=2
KS k=5 KS k=3
0.2
MMD-RBF 0.2
KS k=4
Energy Distance KS k=5
Anderson-Darling 0.1
0.1 Oracle
Oracle
0
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

False postive rate False postive rate


Figure 5: ROC curves for P = N (0, 1), Q = Lap(0, 1/ 2). Figure 7: ROC curves for tail departure in p − q.

function class is not uniformly sup norm bounded. The match, then we have the more explicit representation
resulting class of linear-time higher-order KS tests was
shown empirically to be more sensitive to tail differ- ρ(P, Q; Fk ) =
Z ∞Z ∞ ∞
ences than the usual KS test, and to have competitive
Z
power relative to several other popular tests. sup ··· (FP − FQ )(t1 ) dt2 · · · dtk .
x∈R x tk t2
In future work, we intend to more formally study the
power properties of our new higher-order tests relative The representation in Proposition 4 could provide one
to the KS test. The following is a lead in that direc- avenue for power analysis. When P, Q are supported
tion. For k ≥ 1, define I k to be the kth order integral on [0, ∞), or have k matching moments, the representa-
operator, acting on a function f , via tion is particularly simple in form. This form confirms
Z x Z tk Z t2 the intuition that detecting higher-order moment dif-
k ferences is hard: as k increases, the k-times integrated
(I f )(x) = ··· f (t1 ) dt2 · · · dtk .
0 0 0 CDF difference FP − FQ becomes smoother, and hence
the differences are less accentuated.
Denote by FP , FQ the CDFs of the distributions P, Q.
Notice that the population-level KS test statistic can In future work, we also intend to further examine the
be written as ρ(P, Q; F0 ) = kFP − FQ k∞ , where k · k∞ asymptotic null of the higher-order KS test (the Gaus-
is the sup norm. Interestingly, a similar representation sian process from Theorem 3), and determine to what
holds for the higher-order KS tests. extent it depends on the underlying distribution P
(beyond say, its first k moments). Lastly, some ideas
Proposition 4. Assuming P, Q have k moments,
in this paper seem extendable to the multivariate and
ρ(P, Q; Fk ) = k(I k )∗ (FP − FQ )k∞ , graph settings, another direction for future work.

where (I k )∗ is the adjoint of the bounded linear operator Acknowledgments. We thank Alex Smola for sev-
I k , with respect to the usual L2 inner product. Further, eral early inspiring discussions. VS and RT were sup-
if P, Q are supported on [0, ∞), or their first k moments ported by NSF Grant DMS-1554123.
Veeranjaneyulu Sadhanala, Yu-Xiang Wang, Aaditya Ramdas, Ryan J. Tibshirani

References of the American Statistical Association, 88(421):252–


260, 1993.
Theodore W. Anderson and Donald A. Darling. Asymp-
totic theory of certain goodness of fit criteria based Enno Mammen. Nonparametric regression under quali-
on stochastic processes. Annals of Mathematical tative smoothness assumptions. Annals of Statistics,
Statistics, 23(2):193–212, 1952. 19(2):741–759, 1991.
Theodore W. Anderson and Donald A. Darling. A test David M. Mason and John H. Schuenemeyer. A modi-
of goodness of fit. Journal of the American Statistical fied Kolmogorov-Smirnov test sensitive to tail alter-
Association, 49(268):765–769, 1954. natives. Annals of Statistics, 11(3):933–946, 1983.
Ery Arias-Castro, Bruno Pelletier, and Venkatesh Alfred Muller. Integral probability metrics and their
Saligrama. Remember the curse of dimensionality: generating classes of functions. Advances in Applied
the case of goodness-of-fit testing in arbitrary di- Probability, 29(2):429–443, 1997.
mension. Journal of Nonparametric Statistics, 30(2): Yurii Nesterov. Squared Functional Systems and Opti-
448–471, 2018. mization Problems, pages 405–440. Springer, 2000.
K. Babenko. Theoretical Foundations and Construc- Aaditya Ramdas, Sashank Reddi, Barnabas Pczos,
tion of Numerical Algorithms for the Problems of Aarti Singh, and Larry Wasserman. On the de-
Mathematical Physics. 1979. In Russian. creasing power of kernel and distance based nonpara-
metric hypothesis tests in high dimensions. Twenty-
Ludwig Baringhaus and Carsten Franz. On a new
Ninth Conference on Artificial Intelligence, pages
multivariate two-sample test. Journal of Multivariate
3571–3577, 2015a.
Analysis, 88(1):190–206, 2004.
Aaditya Ramdas, Sashank Reddi, Barnabas Poczos,
Bhaswar B. Bhattacharya. Power of graph-based two- Aarti Singh, and Larry Wasserman. Adaptivity and
sample tests. PhD thesis, Stanford University, 2016. computation-statistics tradeoffs for kernel and dis-
M. Birman and M. Solomyak. Piecewise-polynomial tance based high dimensional two sample testing.
approximations of functions of the classes Wpα . Math- arXiv preprint arXiv:1508.00655, 2015b.
ematics of the USSR-Sbornik, 73(115):331–335, 1967. Michael I. Rosen. Niels Hendrik Abel and equations
In Russian. of the fifth degree. The American Mathematical
Maurice C. Bryson. Heavy-tailed distributions: Prop- Monthly, 102(6):495–505, 1995.
erties and tests. Technometrics, 16(1):61–68, 1974. Paul R. Rosenbaum. An exact distribution-free test
Fred Calitz. An alternative to the Kolmogorov-Smirnov comparing two multivariate distributions based on
test for goodness of fit. Communications in Statistics: adjacency. Journal of the Royal Statistical Society:
Theory and Methods, 16(12):3519–3534, 1987. Series B, 67(4):515–530, 2005.
Richard M. Dudley. Uniform Central Limit Theorems. Mark F. Schilling. Multivariate two-sample tests based
Cambridge University Press, 1999. on nearest neighbors. Journal of the American Sta-
tistical Association, 81(395):799–806, 1986.
Jerome H Friedman and Lawrence C Rafsky. Multi-
variate generalizations of the Wald-Wolfowitz and Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gret-
Smirnov two-sample tests. Annals of Statistics, 7(4): ton, and Kenji Fukumizu. Equivalence of distance-
697–717, 1979. based and RKHS-based statistics in hypothesis test-
ing. Annals of Statistics, 41(5):2263–2291, 2013.
Arthur Gretton, Karsten M. Borgwardt, Malte J.
Rasch, Bernhard Schelkopf, and Alexander Smola. A Naum Z. Shor. Nondifferentiable Optimization and
kernel two-sample test. Journal of Machine Learning Polynomial Problems. Nonconvex Optimization and
Research, 13:723–773, 2012. Its Applications. Springer, 1998.
Nikolai Smirnov. Table for estimating the goodness of
Norbert Henze. A multivariate two-sample test based
fit of empirical distributions. Annals of Mathematical
on the number of nearest neighbor type coincidences.
Statistics, 19(2):279–281, 1948.
Annals of Statistics, 16(2):772–783, 1988.
Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur
Arnold Janssen. Global power functions of goodness of
Gretton, Bernhard Scholkopf, and Gert R. G.
fit tests. Annals of Statistics, 28(1):239–253, 2000.
Lanckriet. On integral probability metrics, φ-
Andrey Kolmogorov. Sulla determinazione empirica divergences and binary classification. arXiv preprint
di una legge di distribuzione. Giornale dell’Istituto arXiv:0901.2698, 2009.
Italiano degli Attuari, 4:83–91, 1933. Gabor J. Szekely and Maria L. Rizzo. Testing for equal
Regina Y. Liu and Kesar Singh. A quality index based distributions in high dimension. InterStat, 5(16.10):
on data depth and multivariate rank tests. Journal 1249–1272, 2004.
A Higher-Order Kolmogorov-Smirnov Test

Aad van der Vaart and Jon Wellner. Weak Convergence.


Springer, 1996.
Yu-Xiang Wang, Alexander Smola, and Ryan J. Tib-
shirani. The falling factorial basis and its statistical
applications. International Conference on Machine
Learning, 31, 2014.

You might also like