Sadhanala 19 A
Sadhanala 19 A
In words, this measures the maximum absolute dif- A Higher-Order KS Test. Our test statistic has
ference between the empirical cumulative distribution the form of an integral probability metric (IPM). For
functions (CDFs) of X(m) and Y(n) , across all points a function class F, the IPM between distributions P
in the joint sample Z(m+n) . Naturally, the two-sample and Q, with respect to F, is defined as (Muller, 1997)
ρ(P, Q; F) = sup |Pf − Qf | (3)
Proceedings of the 22nd International Conference on Ar- f ∈F
tificial Intelligence and Statistics (AISTATS) 2019, Naha,
Okinawa, Japan. PMLR: Volume 89. Copyright 2019 by where we define the expectation operators P, Q by
the author(s).
Pf = EX∼P [f (X)] and Qf = EY ∼Q [f (Y )].
A Higher-Order Kolmogorov-Smirnov Test
For a given function class F, the IPM ρ(·, · ; F) is a Motivating Example. Figure 1 shows the results
pseudometric on the space of distributions. Note that of a simple simulation comparing the proposed higher-
the KS test in (2) is precisely ρ(Pm , Qn ; F0 ), where order tests (5), of orders k = 1 through 5, against the
Pm , Qn are the empirical distributions of X(m) , Y(n) , usual KS test (corresponding to k = 0). For the simu-
respectively, and F0 = {f : TV(f ) ≤ 1}. lation setup, we used P = N (0, 1) and Q = N (0, 1.44).
For 500 repetitions, “alternative repetitions”, we drew
Consider an IPM given by replacing F0 with Fk = {f :
m = 100 samples from P , drew n = 100 samples from
TV(f (k) ) ≤ 1}, for an integer k ≥ 1 (where we write
Q, and computed test statistics; for another 500 rep-
f (k) for the kth weak derivative of f ). Some motivation
etitions, called “null repetitions”, we drew both sets
is as follows. In the case k = 0, we know that the wit-
of samples from P , and again computed test statistics.
ness functions in the KS test (2), i.e., the functions in
Then for each test, we varied the rejection threshold,
F0 that achieve the supremum, are piecewise constant
calculated its true positive rate as the fraction of re-
step functions (cf. the equivalent representation (1)).
jections made on the alternative repetitions, and cal-
These functions can only have so much action in the
culated its false positive rate similarly using the null
tails. By moving to Fk , which is essentially comprised
repetitions. The oracle ROC curve corresponds to the
of the kth order antiderivative of functions in F0 , we
likelihood ratio test (which knows the exact distribu-
should expect that the witness functions over Fk are
tions P, Q). We can see that power of the higher-order
kth order antiderivatives of piecewise constant func-
KS test improves as we increase the order from k = 0
tions, i.e., kth degree piecewise polynomial functions,
up to k = 3, then stops improving by k = 4, 5.
which can have much more sensitivity in the tails.
But simply replacing F0 by Fk and proposing to com- 1
its elements. While there are different ways to do this, 0.4 KS k=0
not all result in computable IPMs. The approach we KS k=1
0.3 KS k=2
take yields an exact representer theorem (generalizing KS k=3
the equivalence between (1) and (2)). Define 0.2
KS k=4
KS k=5
0.1
Oracle
Fk = f : TV(f (k) ) ≤ 1,
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
f (j) (0) = 0, j ∈ {0} ∪ [k − 1], False postive rate
f (k) (0+) = 0 or f (k) (0−) = 0 . (4)
Figure 1: ROC curves from an experiment comparing the
(k) (k)
Here f (0+) and f (0−) denote one-sided limits at proposed higher-order KS tests in (5) (for various k) to the
usual KS test, when P = N (0, 1) and Q = N (0, 1.44).
0 from above and below, respectively. Informally, the
functions in Fk are pinned down at 0, with all lower-
order derivatives (and the limiting kth derivative from Figure 2 displays the witness function (which achieves
the right or left) equal to 0, which limits their growth. the supremum in (5)) for a large-sample version of the
Now we define the kth-order KS test statistic as higher-order KS test, across orders k = 0 through 5.
We used the same distributions as in Figure 1, but now
ρ(Pm , Qn ; Fk ) = sup |Pm f − Qn f |. (5) n = m = 104 . We will prove in Section 2 that, for the
f ∈Fk
kth order test, the witness function is always a kth
An important remark is that for k = 0, this recovers degree piecewise polynomial (in fact, a rather simple
the original KS test statistic (2), because F0 contains one, of the form gt (x) = (x − t)k+ or gt (x) = (t − x)k+
all step functions of the form gt (x) = 1{x ≤ t}, t ≥ 0. for a knot t). Recall the underlying distributions P, Q
here have different variances, and we can see from their
Another important remark is that for any k ≥ 0, the witness functions that all higher-order KS tests choose
function class Fk in (4) is “rich enough” to make the to put weight on tail differences. Of course, the power
IPM in (5) a metric. We state this formally next; its of any test of is determined by the size of the statistic
proof, as with all other proofs, is in the supplement. under the alternative, relative to typical fluctuations
Proposition 1. For any k ≥ 0, and any P, Q with k under the null. As we place greater and greater weight
moments, ρ(P, Q; Fk ) = 0 if and only if P = Q. on tails, it turns out in this particular setting that we
Veeranjaneyulu Sadhanala, Yu-Xiang Wang, Aaditya Ramdas, Ryan J. Tibshirani
see diminishing returns at k = 4, 5, which means the (TV) distance. While IPMs look at differences dP −dQ,
null fluctuations must be too great. tests based on φ-divergences (such as Kullback-Leibler,
or Hellinger) look at ratios dP/dQ, but can be hard to
0.5 efficiently estimate in practice (Sriperumbudur et al.,
N(0,1)
0.45 N(0,1.44) 2009). The TV distance is the only IPM that is also a
KS k=0 φ-divergence, but it is impossible to estimate.
0.4
KS k=1
KS k=2 There is also a rich class of nonparametric tests based
0.35
KS k=3 on graphs. Using minimum spanning trees, Friedman
0.3 KS k=4 and Rafsky (1979) generalized both the Wald-Wolfowitz
KS k=5 runs test and the KS test. Other tests are based on k-
0.25
nearest neighbors graphs (Schilling, 1986; Henze, 1988)
0.2
or matchings (Rosenbaum, 2005). The Mann-Whitney-
0.15 Wilcoxon test has a multivariate generalization using
the concept of data depth (Liu and Singh, 1993). Bhat-
0.1
tacharya (2016) established that many computationally
0.05 efficient graph-based tests have suboptimal statistical
0
power, but some inefficient ones have optimal scalings.
-5 -4 -3 -2 -1 0 1 2 3 4 5
Different computational-statistical tradeoffs were also
Figure 2: Witness functions (normalized for plotting pur- discovered for IPMs (Ramdas et al., 2015b). Further, as
poses) for the higher-order KS tests, when P = N (0, 1) and noted by Janssen (2000) (in the context of one-sample
Q = N (0, 1.44). They are always of piecewise polynomial testing), every nonparametric test is essentially power-
form; and here they all place weight on tail differences. less in an infinity of directions, and has nontrivial power
only against a finite subspace of alternatives. In par-
Summary of Contributions. Our contributions in ticular, this implies that no single nonparametric test
this work are as follows. can uniformly dominate all others; improved power in
some directions generally implies weaker power in oth-
ers. This problem only gets worse in high-dimensional
• We develop an exact representer theorem for the settings (Ramdas et al., 2015a; Arias-Castro et al.,
higher-order KS test statistic (5). This enables us 2018). Therefore, the question of which test to use for
to compute the test statistic in linear-time, for all a given problem must be guided by a combination of
k ≤ 5. For k ≥ 6, we develop a nearly linear-time simulations, computational considerations, a theoreti-
approximation to the test statistic. cal understanding of the pros/cons of each test, and a
• We derive the asymptotic null distribution of the practical understanding of the data at hand.
our higher-order KS test statistic, based on empir-
ical process theory. For k ≥ 6, our approximation Outline. In Section 2, we give computational details
to the test statistic has the same asymptotic null. for the higher-order KS test statistic (5). We derive its
asymptotic null in Section 3, and give concentration
• We provide concentration tail bounds for the test bounds (for the statistic around the population-level
statistic. Combined with the metric property from IPM) in Section 4. We give numerical experiments in
Proposition 1, this shows that our higher-order KS Section 5, and conclude in Section 6 with a discussion.
test is asymptotically powerful against any pair of
fixed, distinct distributions P, Q.
2 COMPUTATION
• We perform extensive numerical studies to com-
pare the newly proposed tests with several others. Write T = ρ(Pm , Qn ; Fk ) for the test statistic in (5).
In this section, we derive a representer theorem for T ,
Other Related Work. Recently, IPMs have been develop a linear-time algorithm for k ≤ 5, and a nearly
gaining in popularity due in large part to energy dis- linear-time approximation for k ≥ 6.
tance tests (Szekely and Rizzo, 2004; Baringhaus and
Franz, 2004) and kernel maximum mean discrepancy 2.1 Representer Theorem
(MMD) tests (Gretton et al., 2012), and in fact, there
is an equivalence between the two classes (Sejdinovic The higher-order KS test statistic in (5) is defined by
et al., 2013). An IPM with a judicious choice of F gives an infinite-dimensional maximization over Fk in (4).
rise to a number of common distances between distri- Fortunately, we can restrict our attention to a simpler
butions, such as Wasserstein distance or total variation function class, as we show next.
A Higher-Order Kolmogorov-Smirnov Test
Theorem 1. Fix k ≥ 0. Let gt+ (x) = (x − t)k+ /k! and Then the statistic in (6) can be succinctly written as
gt− (x) = (t − x)k+ /k! for t ∈ R, where we write (a)+ =
T = max sup φi (t), (8)
max{a, 0}. For the statistic T defined by (5), i∈[N ] t∈[zi−1 ,zi ]
n
T = max sup |(Pm − Qn )gt+ |, where we let z0 = 0 for convenience. Note each φi (t),
t≥0
o i ∈ [N ] is a kth degree polynomial. We can compute a
sup |(Pm − Qn )gt− | . (6) representation for these polynomials efficiently.
t≤0 Lemma 1. Fix k ≥ 0. The polynomials in (7) satisfy
the recurrence relations
The proof of this theorem uses a key result from Mam-
men (1991), where it is shown that we can construct a 1
φi (t) = ci (zi − t)k + φi+1 (t), i ∈ [N ]
spline interpolant to a given function at given points, k!
such that its higher-order total variation is no larger (where φN +1 = 0). Given the monomial expansion
than that of the original function. k
Remark 1. When k = 0, note that for t ≥ 0,
X
φi+1 (t) = ai+1,` t` ,
m n `=0
1 X 1 X
|(Pm − Qn )gt+ | = 1{xi > t} − 1{yi > t} we can compute an expansion for φi , with coefficients
m n
i=1 i=1 ai` , ` ∈ {0} ∪ [k], in O(1) time. So we can compute all
m n
1 X 1X coefficients ai,` , i ∈ [N ], ` ∈ {0} ∪ [k] in O(N ) time.
= 1{xi ≤ t} − 1{yi ≤ t}
m i=1 n i=1
To compute T in (8), we must maximize each polyno-
and similarly for t ≤ 0, |(Pm − Qn )gt− |
reduces to the mial φi over its domain [zi−1 , zi ], for i ∈ [N ], and then
same expression in the second line above. As we vary compare maxima. Once we have computed a represen-
t from −∞ to ∞, this only changes at values t ∈ Z(N ) , tation for these polynomials, as Lemma 1 ensures we
which shows (6) and (1) are the same, i.e., Theorem 1 can do in O(N ) time, we can use this to analytically
recovers the equivalence between (2) and (1). maximize each polynomial over its domain, provided
Remark 2. For general k ≥ 0, we can interpret (6) as the order k is small enough. Of course, maximizing
a comparison between truncated kth order moments, a polynomial over an interval can be reduced to com-
between the empirical distributions Pm and Qn . The puting the roots of its derivative, which is an analytic
test statistic T the maximum over all possible trunca- computation for any k ≤ 5 (since the roots of any
tion locations t. The critical aspect here is truncation, quartic have a closed-form, see, e.g., Rosen 1995). The
which makes the higher-order KS test statistic a metric next result summarizes.
(recall Proposition 1). A comparison of moments, alone, Proposition 2. For any 0 ≤ k ≤ 5, the test statistic
would not be enough to ensure such a property. in (8) can be computed in O(N ) time.
Theorem 1 itself does not immediately lead to an algo- Maximizing a polynomial of degree k ≥ 6 is not gener-
rithm for computing T , as the range of t considered in ally possible in closed-form. However, developments in
the suprema is infinite. However, through a bit more semidefinite optimization allow us to approximate its
work, detailed in the next two subsections, we can ob- maximum efficiently, investigated next.
tain an exact linear-time algorithm for all k ≤ 5, and
a linear-time approximation for k ≥ 6. 2.3 Linear-Time Approximation for k ≥ 6
2.2 Linear-Time Algorithm for k ≤ 5 Seminal work of Shor (1998); Nesterov (2000) shows
that the problem of maximizing a polynomial over an
The key fact that we will exploit is that the criterion interval can be cast as a semidefinite program (SDP).
in (6), as a function of t, is a piecewise polynomial of The number of variables in this SDP depends only on
order k with knots in Z(N ) . Assume without a loss of the polynomial order k, and all constraint functions are
generality that z1 < · · · < zN . Also assume without a self-concordant. Using say an interior point method to
loss of generality that z1 ≥ 0 (this simplifies notation, solve this SDP, therefore, leads to the following result.
and the general case follows by the repeating the same Proposition 3. Fix k ≥ 6 and > 0. For each poly-
arguments separately for the points in Z(N ) on either nomial in (7), we can compute an -approximation to
side of 0). Define ci = 1{zi ∈ X(m) }/m − 1{zi ∈ its maximum in ck log(1/) time, for a constant ck > 0
Y(n) }/n, i ∈ [N ], and depending only on k. As we can compute a representa-
N tion for all these polynomials in O(N ) time (Lemma 1),
1 X this means we can compute an -approximation to the
φi (t) = cj (zj − t)k , i ∈ [N ]. (7)
k! j=i statistic in (6) in O(N log(1/)) time.
Veeranjaneyulu Sadhanala, Yu-Xiang Wang, Aaditya Ramdas, Ryan J. Tibshirani
Remark 3. Let T denote the -approximation from For functions f, g in a class F, let GP,F denote a Gaus-
Proposition 3. Under
√ the null P = Q, we would need sian process indexed by F with mean and covariance
to have = o(1/ N ) in order for the approximation
E(GP,F f ) = 0, f ∈ F,
T to share the asymptotic null distribution of T , as
we will see in Section 3.3. Taking say, = 1/N , the Cov(GP,F f, GP,F g) = CovX∼P (f (X), g(X)), f, g ∈ F.
statistic T1/N requires O(N log N ) computational time, For functions l, u, let [l, u] denote the set of functions
and this is why in various places we make reference to {f : l(x) ≤ f (x) ≤ u(x), for all x}. Call [l, u] a bracket
a nearly linear-time approximation when k ≥ 6. of size ku − lk2 , where k · k2 denotes the L2 (P ) norm,
defined as Z
2.4 Simple Linear-Time Approximation kf k22 = f (x)2 dP (x).
We conclude this section by noting a simple approxi- Finally, let N[] (, k · k2 , F) be the smallest number of
mation to (6) given by -sized brackets that are required to cover F. Define
n the bracketing integral of F as
T ∗ = max max
0
|(Pm − Qn )gt+ |, Z 1q
t∈Z(N , t≥0
)
o J[] (k · k2 , F) = log N[] (, k · k2 , F) d.
0
max |(Pm − Qn )gt− | , (9)
0
t∈Z(N ) , t≤0 Note that this is finite when log N[] (, k · k2 , F) grows
slower than 1/2 . We now state an important uniform
0
where Z(N ) = {0} ∪ Z(N ) . Clearly, for k = 0 or 1, the CLT from empirical process theory.
maximizing t in (6) must be one of the sample points Theorem 2 (Theorem 11.1.1 in Dudley 1999). If F is
Z(N ) , so T ∗ = T and there is no approximation error a class of functions with finite bracketing integral, then
in (9). For k ≥ 2, we can control the error as follows. when P = Q and m, n → ∞, the process
Lemma 2. For k ≥ 2, the statistics in (6), (9) satisfy r
mn
m n
{Pm f − Qn f }f ∈F
δN
X
1 1X
m+n
T − T∗ = |xi |k−1 + |yi |k−1 ,
(k − 1)! m i=1 n i=1 converges weakly to the Gaussian process GP,F . Hence,
r
mn d
where δN is the maximum gap between sorted points in sup |Pm f − Qn f | → sup |GP,F f |.
0
Z(N m + n f ∈F f ∈F
).
√
Remark 4. We would need to have δN = oP (1/ N ) 3.1 Bracketing Integral Calculation
in order for T ∗ to share the asymptotic null of T , see
again Section 3.3 (this is assuming that P has k − 1 To derive the asymptotic null of the higher-order KS
moments, so the sample moments concentrate for large test, based on its formulation in (5), and Theorem 2,
enough N ). This will not be true of δN , the maximum we would need to bound the bracketing integral of Fk .
gap, in general. But it does hold when P is continuous, While there are well-known entropy (log covering) num-
having compact support, and a density bounded from ber bounds for related function classes (e.g., Birman
below on its support; here, in fact, δN = oP (log N/N ) and Solomyak 1967; Babenko 1979), and the conversion
(see, e.g., Wang et al. 2014). from covering to bracketing numbers is standard, these
results unfortunately require the function class to be
Although it does not have the strong guarantees of the uniformly bounded in the sup norm, which is certainly
approximation from Proposition 3, the statistic in (9) not true of Fk .
is simple and efficient—we must emphasize that it can
be computed in O(N ) linear time, as a consequence of Note that the representer result in (6) can be written
Lemma 1 (the evaluations of φi (t) at the sample points as T = ρ(Pm , Qn ; Gk ), where
t ∈ Z(N ) are the constant terms ai0 , i ∈ [N ] in their Gk = {gt+ : t ≥ 0} ∪ {gt− : t ≤ 0}. (10)
monomial expansions)—and is likely a good choice for
most practical purposes. We can hence instead apply Theorem 2 to Gk , whose
bracketing number can be bounded by direct calcula-
tion, assuming enough moments on P .
3 ASYMPTOTIC NULL
Lemma 3. Fix k ≥ 0. Assume EX∼P |X|2k+δ ≤ M <
∞, for some δ > 0. For the class Gk in (10), there is
To study the asymptotic null distribution of the pro-
a constant C > 0 depending only on k, δ such that
posed higher-order KS test, we will appeal to uniform
δ(k−1)
central limit theorems (CLTs) from the empirical pro- M 1+ 2k+δ
cess theory literature, reviewed here for completeness. log N[] (, k · k2 , Gk ) ≤ C log .
2+δ
A Higher-Order Kolmogorov-Smirnov Test
3.2 Asymptotic Null for Higher-Order KS level IPM ρ(P, Q; Fk ) is large, then the concentration
bounds below will imply that the empirical statistic
Applying Theorem 2 and Lemma 3 to the higher-order ρ(Pm , Qn ; Fk ) will be large for m, n sufficiently large,
KS test statistic (6) leads to the following result. and the test will have power.
Theorem 3. Fix k ≥ 0. Assume EX∼P |X|2k+δ < ∞,
We first review the necessary machinery, again from
for some δ > 0. When P = Q, the test statistic in (6)
empirical process theory. For p ≥ 1, and a function f
satisfies, as m, n → ∞,
of a random variable X ∼ P , recall the Lp (P ) norm is
defined as kf kp = [E(f (X)p )]1/p . For p > 0, recall the
r
mn d
T → sup |GP,k g|,
m+n g∈Gk exponential Orlicz norm of order p is defined as
where GP,k is an abbreviation for the Gaussian process kf kΨp = inf t > 0 : E[exp(|X|p /tp )] − 1 ≤ 1 .
indexed by the function class Gk in (10).
Remark 5. When k = 0, note that for t ≥ s ≥ 0, the (These norms depend on the measure P , since they
covariance function is are defined in terms of expectations with respect to
X ∼ P , though this is not explicit in our notation.)
CovX∼P (1{X > s}, 1{X > t}) = FP (s)(1 − FP (t)),
We now state an important concentration result.
where FP denotes the CDF of P . For s ≤ t ≤ 0, the
Theorem 4 (Theorems 2.14.2 and 2.14.5 in van der
covariance function is again equal to FP (s)(1 − FP (t)).
Vaart and Wellner 1996). Let F be a class functions
The supremum of this Gaussian process over t ∈ R is
with an envelope function F , i.e., f ≤ F for all f ∈ F.
that of a Brownian bridge, so Theorem 3 recovers the
Define √
well-known asymptotic null distribution of the KS test,
W = n sup |Pn f − Pf |,
which (remarkably) does not depend on P . f ∈F
Remark 6. When k ≥ 1, it is not clear how strongly
and abbreviate J = J[] (k · k, F). For p ≥ 2, if kF kp <
the supremum of the Gaussian process from Theorem 3
∞, then for a constant c1 > 0,
depends on P ; it appears it must depend on the first k
moments of P , but is not clear whether it only depends
[E(W p )]1/p ≤ c1 kF k2 J + n−1/2+1/p kF kp ,
on these moments. Section 5 investigates empirically.
Currently, we do not have a precise understanding of
and for 0 < p ≤ 1, if kF kΨp < ∞, then for a constant
whether the asymptotic null is useable in practice, and
c2 > 0,
we suggest using a permutation null instead.
kW kΨp ≤ c2 kF k2 J + n−1/2 (1 + log n)1/p kF kΨp .
3.3 Asymptotic Null Under Approximation
The approximation from Proposition 3 shares the same The two-sample test statistic T = ρ(Pm , Qn ; Gk ) satis-
asymptotic null, provided > 0 is small enough. fies (following by a simple argument using convexity)
Corollary 1. Fix k ≥ 0. Assume EX∼P |X|2k+δ < ∞,
for some δ > 0. When P = Q, as m, n → ∞ such that |T − ρ(P, Q; Fk )| ≤ ρ(P, Pm ; Fk ) + ρ(Q, Qn ; Fk ).
m/n converges to a positive constant, the
√ test statistic The terms on the right hand side can each be bounded
T from Proposition 3 converges at a N -rate to the
by Theorem 4, where we can use the envelope function
supremum of the same
√ Gaussian process in Theorem 3,
F (x) = |x|k /k! for Gk . Using Markov’s inequality, we
provided = o(1/ N ).
can then get a tail bound on the statistic.
The approximation in (9) shares the same asymptotic Theorem 5. Fix k ≥ 0. Assume that P, Q both have
null, provided P is continuous with compact support. p moments, where p ≥ 2 and p > 2k. For the statistic
Corollary 2. Fix k ≥ 0. Assume that P is continuous, in (6), for any α > 0, with probability 1 − α,
compactly supported, with density bounded from below
on its support. When P = Q, as m, n → ∞ such that 1 1
|T − ρ(P, Q; Gk )| ≤ c(α) √ + √ ,
m/n converges to a positive m n
√ constant, the test statistic
T ∗ in (9) converges at a N -rate to the supremum of
where c(α) = c0 α−1/p , and c0 > 0 is a constant. If P, Q
the same Gaussian process in Theorem 3.
both have finite exponential Orlicz norms of order 0 <
p ≤ 1, then the above holds for c(α) = c0 (log(1/α))1/p .
4 TAIL CONCENTRATION
When we assume k moments, the population IPM for
We examine the convergence of our test statistics to Fk also has a representer in Gk ; by Proposition 1, this
their population analogs. In general, if the population- implies ρ(·, · ; Gk ) is also a metric.
Veeranjaneyulu Sadhanala, Yu-Xiang Wang, Aaditya Ramdas, Ryan J. Tibshirani
40 40
Putting this metric property together with Theorem 5
20 20
gives the following.
0 0
Corollary 4. Fix k ≥ 0. For αN = o(1) and 1/αN = 0.5 1 1.5 2 0.5 1 1.5 2
Normal k=2 Uniform k=2
o(N p/2 ), reject when the higher-order
√ √ KS test statistic Asymptotic Asymptotic
(6) satisfies T > c(αN )(1/ m + 1/ n), where c(·) is 100 Finite-sample 100 Finite-sample
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4
KS k=0 0.4 KS k=0
KS k=1 KS k=1
0.3
KS k=3 0.3 KS k=2
KS k=5 KS k=3
0.2
MMD-RBF 0.2
KS k=4
Energy Distance KS k=5
Anderson-Darling 0.1
0.1 Oracle
Oracle
0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 4: ROC curves for P = N (0, 1), Q = N (0.2, 1). Figure 6: ROC curves for piecewise constant p − q.
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
KS k=0 0.4 KS k=0
KS k=1 KS k=1
0.3
KS k=3 0.3 KS k=2
KS k=5 KS k=3
0.2
MMD-RBF 0.2
KS k=4
Energy Distance KS k=5
Anderson-Darling 0.1
0.1 Oracle
Oracle
0
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
√
Figure 5: ROC curves for P = N (0, 1), Q = Lap(0, 1/ 2). Figure 7: ROC curves for tail departure in p − q.
function class is not uniformly sup norm bounded. The match, then we have the more explicit representation
resulting class of linear-time higher-order KS tests was
shown empirically to be more sensitive to tail differ- ρ(P, Q; Fk ) =
Z ∞Z ∞ ∞
ences than the usual KS test, and to have competitive
Z
power relative to several other popular tests. sup ··· (FP − FQ )(t1 ) dt2 · · · dtk .
x∈R x tk t2
In future work, we intend to more formally study the
power properties of our new higher-order tests relative The representation in Proposition 4 could provide one
to the KS test. The following is a lead in that direc- avenue for power analysis. When P, Q are supported
tion. For k ≥ 1, define I k to be the kth order integral on [0, ∞), or have k matching moments, the representa-
operator, acting on a function f , via tion is particularly simple in form. This form confirms
Z x Z tk Z t2 the intuition that detecting higher-order moment dif-
k ferences is hard: as k increases, the k-times integrated
(I f )(x) = ··· f (t1 ) dt2 · · · dtk .
0 0 0 CDF difference FP − FQ becomes smoother, and hence
the differences are less accentuated.
Denote by FP , FQ the CDFs of the distributions P, Q.
Notice that the population-level KS test statistic can In future work, we also intend to further examine the
be written as ρ(P, Q; F0 ) = kFP − FQ k∞ , where k · k∞ asymptotic null of the higher-order KS test (the Gaus-
is the sup norm. Interestingly, a similar representation sian process from Theorem 3), and determine to what
holds for the higher-order KS tests. extent it depends on the underlying distribution P
(beyond say, its first k moments). Lastly, some ideas
Proposition 4. Assuming P, Q have k moments,
in this paper seem extendable to the multivariate and
ρ(P, Q; Fk ) = k(I k )∗ (FP − FQ )k∞ , graph settings, another direction for future work.
where (I k )∗ is the adjoint of the bounded linear operator Acknowledgments. We thank Alex Smola for sev-
I k , with respect to the usual L2 inner product. Further, eral early inspiring discussions. VS and RT were sup-
if P, Q are supported on [0, ∞), or their first k moments ported by NSF Grant DMS-1554123.
Veeranjaneyulu Sadhanala, Yu-Xiang Wang, Aaditya Ramdas, Ryan J. Tibshirani