0% found this document useful (0 votes)

2 views62 pages

Armstrong 2018

This paper analyzes the asymptotic power of Cramer-von Mises (CvM) and Kolmogorov-Smirnov (KS) tests for conditional moment inequalities when parameters are set identified. It concludes that KS tests outperform CvM tests in this context, particularly when using truncated variance weighting. The findings are supported by theoretical results and a Monte Carlo study, providing guidance on optimal test statistic selection.

Uploaded by

xiaoxia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views62 pages

Armstrong 2018

Uploaded by

xiaoxia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

On the Choice of Test Statistic for Conditional Moment

Inequalities
Timothy B. Armstrong∗
Yale University
arXiv:1410.4718v3 [stat.AP] 7 Jul 2017

July 9, 2021

Abstract

This paper derives asymptotic approximations to the power of Cramer-von Mises

(CvM) style tests for inference on a finite dimensional parameter defined by conditional
moment inequalities in the case where the parameter is set identified. Combined with
power results for Kolmogorov-Smirnov (KS) tests, these results can be used to choose
the optimal test statistic, weighting function and, for tests based on kernel estimates,
kernel bandwidth. The results show that, in the setting considered here, KS tests
are preferred to CvM tests, and that a truncated variance weighting is preferred to
bounded weightings.

1 Introduction
This paper compares methods for inference on a parameter θ defined by the conditional
moment inequalities

E(m(Wi , θ)|Xi ) ≥ 0 a.s.

where m : RdW +dθ → RdY is a known function of data Wi and a parameter θ ∈ Θ ⊆ Rdθ ,
and ≥ is defined elementwise. Here, Wi is a RdW valued random variable and Xi is a RdX
∗
email: [email protected]. Support from National Science Foundation Grant SES-1628939 is
gratefully acknowledged.

1
valued random variable. We are given independent, identically distributed (iid) observations
{(Xi0 , Wi0 )0 }ni=1 . This defines the identified set

Θ0 ≡ {θ ∈ Θ|E(m(Wi , θ)|Xi ) ≥ 0 a.s.}

where Θ ⊆ Rdθ is the parameter space. If Θ0 contains more than one element, the model is
said to be set identified.
Following Imbens and Manski (2004), we are interested in confidence regions Cn that
satisfy the converage criterion

for all θ0 ∈ Θ0 , lim inf P (θ0 ∈ Cn ) ≥ 1 − α. (1)

n→∞

We consider confidence regions constructed by inverting a family of tests φn (θ) = φn (θ, {Xi , Wi }ni=1 ),
where φn (θ) is a test of H0,θ : θ ∈ Θ0 :

Cn = {θ|φn (θ) = 0}.

Subject to the coverage criterion (1), we would like the confidence region Cn not to contain
points that are far away from the identified set Θ0 . In particular, if we take a parameter θ0
on the boundary of Θ0 and consider a sequence θn = θ0 + an where an → 0, we would like to
have θn ∈/ Cn with high probability for an converging to zero as quickly as possible (so long
as θn approaches Θ0 from the outside, rather than from the interior). Note that

P (θn ∈
/ Cn ) = P (φn (θn ) = 1).

Thus, we can determine whether Cn contains points that are far away from Θ0 by examining
the behavior of P (φn (θn ) = 1), which is the power of the test φn (θn ) of H0,θn at the alternative
P.
This paper provides an asymptotic answer to this question by examining the asymptotic
behavior of P (φn (θn ) = 1) as n → ∞. We refer to limit of P (φn (θn ) = 1) as the local
asymptotic power of the sequence of tests φn (θn ) (note that this terminology differs from
definitions often used in the literature, since the null hypothesis varies with n while the
alternative stays fixed). The local asymptotic power of this sequence of tests will depend on
the distribution P , the parameter θ0 on the boundary of Θ0 to which the sequence θn = θ0 +an
converges, and the sequence an .
This paper considers Cramer-von Mises (CvM) style test statistics, which integrate or

2
add some function of the negative part of an objective function. These can be compared
with existing results for Kolmogorov-Smirnov (KS) statistics, which take the minimum of an
objective function. The results show that the power P (φn (θn ) = 1) will be greater asymp-
totically for KS statistics when the distribution P satisfies generic smoothness conditions of
the form used in the nonparametric statistics literature. In particular, the results imply that
KS statistics are preferred according to a “minimax within a smoothness class” criterion of
the form used to formulate nonparametric relative efficiency results in papers such as Stone
(1982).
As an example of the types of problems covered by this setup, consider the interval re-
gression model of Manski and Tamer (2002). We observe (Xi , WiL , WiH ) where [WiL , WiH ]
is known to contain the latent variable Wi∗ , which follows the linear regression model
E(Wi∗ |Xi ) = (1, Xi0 )θ. This falls into the setup of this paper with Wi = (Xi , WiL , WiH )
and m(Wi , θ) = (WiH − (1, Xi0 )θ, (1, Xi0 )θ − WiL )0 . The identified set is then given by

Θ0 = {θ|E(WiL |Xi ) ≤ (1, Xi0 )θ ≤ E(WiH |Xi ) a.s.}.

Thus, a parameter θ0 in the identified set corresponds to a regression line (1, x0 )θ0 that is
between the conditional means E(WiL |Xi = x) and E(WiH |Xi = x) for all x on the support
of Xi . If θ0 is on the boundary of the identified set, it will be equal to one of these regression
lines for some value of x. For θn = θ0 + an approaching the boundary of the identified
set from the outside, the regression line (1, x0 )θn will be above E(WiH |Xi = x) or below
E(WiL |Xi = x) for some values of x, and we would like the test φn (θn ) to detect this so
that θn ∈ / Cn with high probability. We use primitive conditions to apply the general results
in this paper to this setting, thereby giving asymptotic approximations to this probability.
These conditions correspond to smoothness conditions used in the nonparametric statistics
literature and conditions on the shape of these conditional means near points where one of
them is equal to (1, x0 )θ0 (see Section 3.4 and Appendix A.5).
The remainder of this paper is organized as follows. Section 1.1 defines the tests con-
sidered in this paper. Section 1.2 discusses related literature. Section 2 gives an intuitive
description of the power results in this paper and how they are derived. Section 3 states
formally the conditions used in this paper, and provides primitive conditions for the interval
regression model. Section 4 derives the power results. Section 5 reports the results of a
Monte Carlo study. Section 6 concludes. An appendix contains minimax power comparisons
as well as primitive conditions for the results in the main text in additional settings. A
supplementary appendix contains proofs and auxiliary results.

3
1.1 Definition of Test Statistics
The test statistics considered in this paper are as follows. Given a set G of nonnegative
instruments, the null hypothesis H0,θ : θ ∈ Θ0 implies that E(m(Wi , θ)g(Xi )) ≥ 0 for all
g ∈ G. Thus, under H0,θ : θ ∈ Θ0 , the sample analogue
n
1X
En (m(Wi , θ)g(Xi )) ≡ m(Wi , θ)g(Xi ) (2)
n i=1

should not be too negative for any g ∈ G. The results in this paper use classes of functions
given by kernels with varying bandwidths and location, given by G = {x 7→ k((x − x̃)/h)|x̃ ∈
RdX , h ∈ R+ } for some kernel function k. With this choice of G, H0,θ : θ ∈ Θ0 holds if and
only if E(m(Wi , θ)g(Xi )) ≥ 0 for all g ∈ G, so that (2) can be used to form a consistent test
(see Andrews and Shi, 2013, for a discussion of this and other choices of G).
Alternatively, one can test H0,θ : θ ∈ Θ0 by estimating E(m(Wi , θ)|Xi = x) directly using
the kernel estimate
Pn
m(W , θ)k((Xi − x)/h)
ˆ j (θ, x) = i=1Pn i
m̄ (3)
i=1 k((Xi − x)/h)

for some sequence h = hn → 0 and kernel function k. If H0,θ holds, (3) should not be too
negative for any x.
Thus, a test statistic of the null that θ ∈ Θ0 can be formed by taking any function that
is positive and large in magnitude when (2) is negative and large in magnitude for some
g ∈ G, or when (3) is negative and large in magnitude for some x. One possibility is to use a
CvM statistic that integrates the negative part of (2) over some measure µ on G. This CvM
statistic is given by
"Z dY
#1/p
X
Tn,p,ω,µ (θ) = |En mj (Wi , θ)g(Xi )ωj (θ, g)|p− dµ(g) (4)
j=1

for some p ≥ 1 and weighting ω, where |t|− = | min{t, 0}|. I refer to this as an instrument
based CvM (IV-CvM) statistic. The CvM statistic based on the kernel estimate integrates
the negative part of (3) against some weighting ω, and is given by
"Z dY
#1/p
p
X
Tn,p,kern (θ) = ˆ j (θ, x)ωj (θ, x)
m̄ dx (5)
−
j=1

4
for some p ≥ 1. I refer to this as a kernel based CvM (kern-CvM) statistic.
For the instrument based CvM statistic, the scaling for the power function will depend
on ω. This paper considers both a bounded weighting which, without loss of generality, can
be taken to be constant (the measure µ can absorb any weighting that does not change with
the sample size)

ωj (θ, g) = 1 all θ, g, j (6)

as well as the truncated variance weighting used for KS statistics by Armstrong (2014b),
Armstrong and Chan (2016) and Chetverikov (2012), which is given by

ωj (θ, g) = (σ̂j (θ, g) ∨ σn )−1 (7)

where

σ̂j (θ, g) = {En [mj (Wi , θ)g(Xi )]2 − [En mj (Wi , θ)g(Xi )]2 }1/2

and σn is a sequence converging to zero and a ∨ b denotes the maximum of a and b for scalars
a and b.1
The results for CvM statistics derived in this paper can be compared to power results
for KS statistics derived in Armstrong (2015) and Armstrong (2014b). A KS statistic based
on (2) simply takes the most negative value of that expression over g ∈ G, and is given by

Tn,∞,ω (θ) = max sup |En mj (Wi , θ)g(Xi )ωj (θ, g)|− . (8)
j g∈G

I refer to this as an instrument based KS (IV-KS) statistic. A KS statistic based on (3)

simply takes the most negative value of that expression over x, and is given by

ˆ j (θ, x)ωj (θ, x)

Tn,∞,kern (θ) = max sup m̄ . (9)
j −
θ

I refer to this as a kernel based KS (kern-KS) statistic. As with CvM statistics, the scaling
for the local power function for the instrument based KS test depends on whether a bounded
weighting or a truncated variance weighting is used.
1
For the critical value of the test, the results covered in this paper cover any critical value that is of
the same order of magnitude asymptotically as a critical value based on the distribution where all moments
bind. See Section 3.1 for details.

5
To complete the definition of these tests, we need to define a critical value. For tests that
use instrument based CvM statistics with bounded weights or inverse variance weights with
p < ∞, the test φn,p,ω,µ (θ), which rejects when φn,p,ω,µ (θ) = 1, is defined as
( √
1 if nTn,p,ω,µ (θ) > ĉn,p,ω,µ (θ)
φn,p,ω,µ (θ) = (10)
0 otherwise

for some critical value ĉn,p,ω,µ (θ). For kernel based CvM statistics, the test φn,p,kern (θ), which
rejects when φn,p,kern (θ) = 1, is defined as
(
1 if (nhdX )1/2 Tn,p,kern (θ) > ĉn,p,kern (θ)
φn,p,kern (θ) = (11)
0 otherwise

While all of the new results in this paper are for CvM statistics, I refer to analogous results
for KS statistics at some points for comparison. For KS tests with bounded weights, the
critical value is defined as in (10). For KS tests based on truncated variance weights, the
test φn,∞,(σ∨σn )−1 (θ) is defined as
 q
 1 if n
T
log n n,∞,(σ∨σn )
−1 (θ) > ĉn,∞,(σ∨σ )−1 (θ)
n
φn,∞,(σ∨σn )−1 (θ) = (12)
 0 otherwise

for some critical value ĉn,p,∞,(σ∨σn )−1 (θ).

1.2 Related Literature

Tests based on instrument based CvM and KS statistics have been considered by Andrews
and Shi (2013), Kim (2008), Khan and Tamer (2009) and Armstrong (2015) for bounded
weights, and Armstrong (2014b), Armstrong and Chan (2016) and Chetverikov (2012) for KS
statistics with variance weights. The statistics based on instruments with bounded weights
use an approach to nonparametric testing problems that goes back at least to Bierens (1982).
Aradillas-Lopez et al. (2013) use a slightly different version of an instrument CvM approach.
Chernozhukov et al. (2013) consider kernel based KS statistics and Lee et al. (2013) and Lee
et al. (2015) consider kernel based CvM statistics. While some of these papers derive local
power results for CvM tests under conditions that appear to be common in point identified
models, these results do not apply in set identified models except for in very special cases.
Indeed, the results in the present paper show that, when one uses a minimax criterion

6
requiring uniformly good power in classes of underlying distributions defined by smoothness
properties, the power of CvM tests is much worse (see Section A.5). The results in this
paper show that power comparisons in the set identified case considered here are much
different than settings that have been studied previously. Armstrong (2015), Armstrong
(2011), Armstrong (2014b), Armstrong and Chan (2016), and Chetverikov (2012) derive
power results for KS statistics under conditions similar to those used in this paper, but do
not consider CvM statistics.
The results in this paper are also related to the statistics literature on minimax testing
of hypotheses of the form H0,= : f (x) = 0 all x, H0,≥ : f (x) ≥ 0 all x, H0,↑ : f (x) ≥
f (x0 ) all x < x0 , (and related hypotheses such as convexity of f ), where the function f
is observed with noise. While much of this literature focuses on the Gaussian white noise
model or Gaussian sequence model, the results are closely related to the case where f (x) =
E(Yi |Xi = x), and iid observations of Xi , Yi are available (which falls into our setup if we take
Yi = m(Wi , θ0 )). To formulate the minimax testing problem considered in this literature,
one specifies a smoothness class F for f and a functional ψ : F → [0, ∞) such that ψ(f )
is 0 if f satisfies the null and strictly positive otherwise. For example, for H0,= , one can
R
take the Lp norm ψ(f ) = [ f (x)p dx]1/p and, for H0,≥ , one can take the one-sided Lp norm
ψ(f ) = [ |f (x)|p− ]. The minimax testing problem is to obtain tests that have good worst-
R

case power over alternatives f in the smoothness class F with ψ(f ) ≥ an for an → 0 as
quickly as possible. Dumbgen and Spokoiny (2001) and Juditsky and Nemirovski (2002)
consider H0,≥ with ψ given by the one-sided L∞ norm ψ(f ) = supx |f (x)|− and the one-
sided Lp norm with p < ∞ respectively, as well as H0,↑ and the hypothesis of convexity with
related distance functions ψ. Lepski and Tsybakov (2000) consider H0,= with ψ(f ) given by
the L∞ norm and by ψ(f ) = |f (x0 )| for a given point x0 . See Ingster and Suslina (2003) for
further results and references to this literature.
In contrast to this literature, the results in this paper have implications for minimax rates
of CvM statistics for testing the null that a given value of θ is in the identified set against the
alternative that the distance between θ and any point in the identified set is at least an (see
Section A.5 in the appendix for a formal statement). Since the dimension of θ is finite and
fixed, the choice of distance (i.e. whether to use Euclidean distance or sup-norm distance
when defining distance of θ from points in the identified set) does not matter for the rate
at which an can approach zero with the test having good power. This contrasts with the
nonparametric testing literature described above, in which the choice of distance function ψ
has implications for relative efficiency of different test statistics, and is part of the reason

7
that CvM and KS tests can be ranked in this setting. Interestingly, the problem of minimax
inference on θ in the settings considered here appears to be closely related to nonparametric
testing with ψ given by the L∞ norm. See Armstrong (2014a) for further discussion.

2 Intuition for the Results

To get some intuition for the results, consider the interval regression model defined in the
introduction, with a one-dimesional covariate Xi . Consider a sequence θn converging to a
parameter θ0 = (θ0,1 , θ0,2 )0 on the boundary of the identified set such that (1, x)θ0 = θ0,1 +θ0,2 x
is tangent to E(WiH |Xi = x) at some point x0 . This is illustrated in Figure 1 (the conditional
mean of WiL can be considered to be below the range depicted in the figure). To keep the
derivations below simple, we assume that θn is formed by adding a sequence an,1 to the
constant term in θ0 , so that θn = (θ0,1 + an,1 , θ0,2 )0 where θ0 = (θ0,1 , θ0,2 )0 . However, our
general results cover parameter sequences where the intercept changes with n as well.
The test statistics Tn (θn ) defined in Section 1.1 will take sample analogues of E((WiH −
(1, x)θn )g(Xi )) for functions g of the form g(Xi ) = k((Xi − x̃)/h) for some x̃ and h, and
integrate or take the minimum of those that are negative. In order for the test φn (θn ) based
on a test statistic Tn (θn ) to have high power, we would like the test to place as much weight
as possible on functions g(Xi ) = k((Xi − x̃)/h) that are supported on values of Xi where
the inequality is violated (in the case of the parameter θn illustrated in the figure, this
corresponds to Xi between about .5 and .7). As θn approaches θ0 , the portion of the support
of Xi where the inequality is violated will shrink towards a single point x0 , so we will want
the test statistic to use functions g(Xi ) = k((Xi − x̃)/h) where x̃ is near x0 and h is close to
zero.
If we knew a priori the point where the moment inequality was violated, we could use this
to choose the function g(Xi ). The fact that this is unknown leads to the tests described in
Section 1.1, where test statistics are formed by combining these functions using integration
(for CvM statistics) or by taking the maximum (for KS statistics). This is the step that
leads to CvM and KS statistics having different power properties: taking the integral of the
moment functions tends to give power when the inequality is violated by a small amount at
many different points, while taking the maximum leads to more power when the inequality is
violated at a small number of points. Since the moment inequality is violated on a shrinking
set, KS statistics have better power in this setting than CvM statistics.2
2
To see this in a simpler setting, consider testing a finite number of unconditional moment inequalities

8
To see this in more detail, let us give a heuristic derivation of some of the results in
this setting. Consider the instrument based CvM statistic with bounded weights, where
the measure µ on the instruments g(x) = k((x − x̃)/h) has a density fµ (x̃, h) with re-
spect to the Lebesgue measure, and assume that Xi has a density fX (x). For simplic-
ity, suppose we only base the statistic on the inequality involving WiH . The statistic is
1/p
|En (WiH − (1, Xi )θn )k((Xi − x̃)/h)|p− fµ (x̃, h) dx̃ dh
R R
Tn (θn ) = , which is an integral
over a sample expectation. We expect that the test will have power when the integral over
the corresponding population expectation is large relative to the critical value, which, as
discussed below, will be of order n−1/2 . Thus, to have power at θn = (θ0,1 + an,1 , θ0,2 )0 , we
expect that
Z Z 1/p
|E(WiH − (1, Xi )θn )k((Xi − x̃)/h)|p− fµ (x̃, h) dx̃ dh
Z Z Z p 1/p
= (E(WiH |Xi = x) − (1, x)θn )k((x − x̃)/h)fX (x) dx fµ (x̃, h) dx̃ dh (13)
−

will have to be large relative to n−1/2 .

Since E(WiH |Xi = x) is tangent to (1, x)θ0 = θ0,1 + θ0,2 x at x0 , a second order Taylor
approximation gives E(WiH |Xi = x) − θ0,1 − θ0,2 x ≈ (x − x0 )2 (V /2) where V is the second
derivative of E(WiH |Xi = x) at x0 . Since θn = (θ0,1 + an,1 , θ0,2 )0 , this gives an approximation
to the integrand in the above display: E(WiH |Xi = x) − (1, x)θn = E(WiH |Xi = x) − θ0,1 −
θ0,2 x − an,1 ≈ (x − x0 )2 (V /2) − an,1 . Substituting this into the above display gives
Z Z Z p 1/p
2
((x − x0 ) (V /2) − an,1 )k((x − x̃)/h)fX (x) dx fµ (x̃, h) dx̃ dh .
−

As θn approaches θ0 , only values of x̃ near x0 and values of h near zero will contribute to
the integrand, so that this approximation will hold with increasing accuracy. Furthermore,
assuming that fµ and fX are smooth, this means that we can also replace fX (x) with fX (x0 )
and fµ (x̃, h) with fµ (x0 , 0):
Z Z Z p 1/p
2
((x − x0 ) (V /2) − an,1 )k((x − x̃)/h)fX (x0 ) dx fµ (x0 , 0) dx̃ dh .
−
Pk Pn 2
H0 : EYi,1 ≥ 0, . . . , EYi,k ≥ 0. Tests based on the statistic j=1 | i=1 Yi,k |− (which is analogous to a CvM
statistic) will have more power
Pn when each of the inequalities is violated by a small amount, while tests based
on the statistic maxkj=1 | i=1 Yi,j |− (which is analogous to a KS statistic) will have more power when a
single inequality is violated. See Armstrong (2014a) for details and further references.

9
1/2 1/2 1/2
Using the change of variables u = (x − x0 )/an,1 , v = (x̃ − x0 )/an,1 , h̃ = h/an,1 , it can be
seen that the above display is equal to
Z Z Z p 1/p
2 1/2 1/2 1/2
(an,1 u (V /2) − an,1 )k((u − v)/h̃)fX (x0 )an,1 du fµ (x0 , 0)an,1 dṽan,1 dh̃
−
Z Z Z p 1/p
3/2+1/p 2
= an,1 (u (V /2) − 1)k((u − v)/h̃)fX (x0 ) du fµ (x0 , 0) dṽ dh̃ .
−

3/2+1/p
Thus, we expect to get power when an,1 decreases at least as slowly as n−1/2 , which
corresponds to an,1 decreasing at the rate n−1/(3+2/p) . This is the rate derived formally for
this test in Section 4.1, specialized to this setting (the general results use a smoothness
parameter γ which, in this case, is equal to 2).
To understand how this differs from the corresponding KS test based on Tn (θn ) =
supx̃,h |En (WiH − (1, Xi )θn )k((Xi − x̃)/h)|− , note that similar derivations give the approxi-
mation
Z
sup ((x − x0 )2 (V /2) − an,1 )k((x − x̃)/h)fX (x0 ) dx .
x̃,h −

Applying the same change of variables gives

Z
1/2
sup (an,1 u2 (V /2) − an,1 )k((u − v)/h̃)fX (x0 )an,1 du
u,h̃ −
Z
3/2
= an,1 sup (u2 (V /2) − 1)k((u − v)/h̃)fX (x0 ) du ,
u,h̃ −

and comparing this to n−1/2 (which is the order of the critical value in this case as well) shows
that we will have power when an,1 decreases at the rate n−1/3 . This is shown formally in
Armstrong (2015). Note that the n−1/3 rate for the KS statistic is faster than the n−1/(3+2/p)
rate for the CvM statistic.

3 Assumptions
This section states the conditions used in this paper, and verifies them for the interval re-
gression model defined in the introduction. Section A in the appendix verifies the conditions
in other settings.
This paper considers the power P (φn (θn ) = 1) of a sequence φn (θn ) of tests of H0,θn :

10
θn ∈ Θ0 under iid data from a fixed dgp P , where θn = θ0 + an is a sequence converging to θ0
on the boundary of Θ0 (where Θ0 is the identified set under the given dgp P ). Thus, we need
conditions on the tests φn (θn ) (in particular, the critical values and weighting functions, etc.
used in forming the test statistics) and the dgp P and the sequence θn . Section 3.1 gives
the conditions on the tests φn (θn ) and Section 3.2 gives the conditions on P and θn . Section
3.3 verifies these conditions for the interval regression model. Section 3.4 explains how the
conditions differ from those encountered in point identified settings.

3.1 Assumptions on Test Statistics and Critical Values

The properties of these tests will depend on the choice of critical value. The only condition
needed for upper bounds on power, stated in the following assumption, is that the criti-
cal value be of the same order of magnitude as a critical value based on a least favorable
asymptotic distribution where all of the moments bind (i.e. E(m(Wi , θ)|Xi ) = 0 a.s.).

Assumption 3.1. For some η > 0, the critical value ĉn = ĉn (θn ) defined in (10) or (11),
depending on the weighting and form of the test, satisfies ĉn (θn ) > η with probability ap-
proaching one.

Assumption 3.1 holds for the kernel CvM based test of Lee et al. (2013), which uses the
least favorable null dgp, as well as the tests using instrument based CvM statistics with
bounded weights proposed in Andrews and Shi (2013). Instrument based CvM statistics
with variance weights have not been considered in the literature. In Section C of the supple-
mentary appendix, I consider critical values for this case and show that critical values based
on the least favorable null dgp will satisfy Assumption 3.1.
Assumption 3.1 only gives a lower bound for a critical value. This gives bounds on the
power, but to derive the exact local asymptotic power, we need the following condition, which
gives a limiting value for this critical value. Under mild conditions on the data generating
process and sequence of local alternatives, this assumption will also hold for the methods of
choosing critical values discussed above.

Assumption 3.2. For the critical value ĉn = ĉn (θn ) defined in (10) or (12), depending on
p
the weighting and form of the test, and some constant c > 0, ĉn (θn ) → c.

The power properties of the test will also depend on the class of functions G used as
instruments. I derive power results for the case where G consists of kernel functions with
different bandwidths and locations, defined in the following assumption.

11
R
Assumption 3.3. For some bounded, nonnegative function k with finite support and k(u) du >
0, G = {x 7→ k((x − x̃)/h)|x̃ ∈ RdX , h ∈ R+ }, and the covering number N (ε, G, L1 (Q)) de-
fined in Pollard (1984) satisfies supQ N (ε, G, L1 (Q)) ≤ Aε−W , where the supremum is over
all probability measures.

The covering number assumption in Assumption 3.3 is a technical condition that allows
for uniform convergence of kernel estimates over x and h. A sufficient condition is that the
kernel k takes the form k(x) = r(kxk) where r is a monotone decreasing function on on
[0, ∞) (see Pollard, 1984, chapter 2, problem 28).
For CvM statistics, I place the following condition on the measure µ over which the
sample means are integrated.

Assumption 3.4. The measure µ has bounded support, and has a density fµ (x̃, h) with
respect to the Lebesgue measure on RdX × [0, ∞) that is bounded and continuous.

Relaxing this assumption would lead to different power properties, although the general
point that Lp statistics perform worse in these models than supremum statistics would go
through.

3.2 Conditions on Data Generating Process

This section presents the main assumptions on the model and dgp used in this paper. The
conditions are similar to those used in Armstrong (2015), Armstrong (2014b) and Armstrong
and Chan (2016). I first provide high level conditions, and then verify them for the interval
regression model in Section 3.3. Section 3.4 provides a discussion of the difference between
these conditions and other settings, such as point identified models. Section A in the ap-
pendix verifies the conditions in this section for additional settings. I assume throughout
that the data are iid.
I place the following conditions on the data generating process and the sequence θn =
θ0 + an . In these conditions, γ is a smoothness parameter that is generally given by the
minimum of the number of derivatives of the conditional mean and 2. The truncation of
the smoothness parameter at 2 comes from the fact that the test statistics here use positive
kernels or instruments. This a special DGP. Limit/Approx Distribution
is not derived for the CvM statistic
Assumption 3.5. For each j, the conditional mean E(mj (Wi , θ0 )|Xi = x) ≡ m̄j (θ0 , x) takes
its minimum only on a finite set {x|E(mj (Wi , θ0 )|Xi = x) = 0 some j} = X0 = {x1 , . . . , x` }.
For each k from 1 to `, let J(k) be the set of indices j for which E(mj (Wi , θ0 )|Xi = xk ) =

12
0. Assume that there exist neighborhoods B(xk ) of each xk ∈ X0 such that the following
assumptions hold.

i.) There exists η > 0 such that, for θ in a neighborhood of θ0 , we have (a) m̄j (θ, x) > η
for j ∈ / ∪`k=1 B(xk ).
/ J(k) for x ∈ B(xk ) and (b) m̄j (θ, x) > η for all j for x ∈

ii.) For j ∈ J(k), m̄j (θ0 , x) is continuous on the closure of B(xk ) and satisfies

m̄j (θ0 , x) − m̄j (θ0 , xk ) x − xk δ→0
sup − ψj,k → 0
kx−xk k≤δ kx − xk kγ(j,k) kx − xk k

for some γ(j, k) > 0 and some function ψj,k : {t ∈ RdX |ktk = 1} → R with ψ ≥
ψj,k (t) ≥ ψ for some ψ < ∞ and ψ > 0. For future reference, define γ = maxj,k γ(j, k)
˜ = {j ∈ J(k)|γ(j, k) = γ}.
and J(k)

iii.) Xi has a continuous density fX on B(xk ).

iv.) For j ∈ J(k), s2j (x, θ) ≡ var(mj (Wi , θ)|Xi = x) is strictly positive and continuous at
(xk , θ0 ).

v.) For x in the closure of B(xk ) and θ in a neighborhood of θ0 , m̄(θ, x) has a derivative
as a function of θ that is continuous as a function of (θ, x). Let m̄θ,j (θ, x) denote the
jth row of this derivative matrix (i.e. the derivative of m̄j (θ, x) with respect to θ).

Assumption 3.6. The data are iid and for some fixed Y < ∞ and θ in a some neighborhood
of θ0 , |m(Wi , θ)| ≤ Y with probability one.

The deterministic bound in Assumption 3.6 allows for the use of certain technical results
that are useful in the proofs. It may be possible to relax this assumption, although additional
technical arguments would be needed in some places.
The following assumption, which is used for kernel based statistics, ensures that the
kernel estimators do not encounter boundary problems (cf. Assumption 1(iii) in Lee et al.,
2013).

Assumption 3.7. Xi has a density fX that is bounded away from infinity, and the weighting
function ωj (θ, x) is continuous for all j and, for some ε > 0, is equal to zero whenever
fX (x̃) < ε for some x̃ with kx̃ − xk < ε.

13
3.3 Discussion and Primitive Conditions for Interval Regression
In discussing these assumptions, it is useful to keep in mind the interval regression model
introduced in the introduction, in which Wi = (Xi , WiL , WiH ) and m(Wi , θ) = (WiH −
(1, Xi0 )θ, (1, Xi0 )θ − WiL )0 . The following gives a general discussion of these assumptions,
with references to the interval regression model as an example. I then state primitive suffi-
cient conditions in the interval regression model that imply these assumptions with γ = 2.
Section A of the appendix gives primitive conditions in additional settings.
The assumptions used here are similar to the conditions used in Armstrong (2015) to
derive the asymptotic distribution and local power of a KS statistic with bounded weights.
In particular, Assumption 3.5 corresponds to the version of Assumption 3.1 in Armstrong
(2015) used in Section 5 of that paper, in which part (ii) is replaced by Assumption 5.1 in
Armstrong (2015). Part (i) strengthens the version used in Armstrong (2015) by extending
it to a neighborhood of θ0 , and part (v) is an additional condition on the derivative with
respect to θ. These additional conditions are used to derive local power, and are similar to
Assumption 7.1 in Armstrong (2015).
Assumption 3.5 is the main substantive condition that gives rise to the local power
results derived in this paper. It states that the conditional mean of the moment conditions
is equal to zero only at a finite number of points. In the context of the interval regression
model, this holds for θ0 on the boundary of the identified set when the regression line x0 θ0
is tangent to E(WiH |Xi = x) or E(WiH |Xi = x) at a finite number of points. In general,
a sufficient condition for this in the case where Xi has compact support is that m̄j (θ, x)
takes its minimum on the interior of the support of Xi and m̄j (θ, x) is twice continuously
differentiable with a positive definite second derivative matrix at any point where it takes a
minimum (see Section A.1 in the appendix).
The most natural case where this does not hold is where E(WiH |Xi = x) or E(WiL |Xi =
x) is linear and equal to (1, x0 )θ on a nondegenerate interval (the other possibility is for
E(WiH |Xi = x) − (1, x0 )θ0 to be zero on a set with infinitely many elements, but with zero
probability, such as with the function sin(1/x)). This holds in the point identified case where
P (WiH = WiL |Xi ) = 1 for Xi on a nondegenerate interval (and, in particular, in the special
case where WiH = WiL with probability one, leading to the usual linear regression model).
However, when θ is set identified, this is a knife-edge case: even if E(WiH |Xi ) = (1, Xi0 )θ0
for Xi on a nondegenerate interval for a given θ0 on the boundary of the identified set, we
will typically have E(WiH |Xi = x) = (1, x0 )θ̃0 only on a finite set for θ̃0 close to θ0 .
This is illustrated by Figures 2 and 3, which are taken directly from Section 2.2 of Arm-

14
strong (2015). Each figure shows the conditional mean E(WiH |Xi = x) for some dgp along
with regression lines corresponding to particular parameter values θ (the lower conditional
mean E(WiL |Xi = x) can be taken to be below the area shown in each figure). In Figure 2,
the regression line (1, x0 )θ = θ1 + θ2 x is tangent to the conditional mean at a single point,
and Assumption 3.5 holds for the parameter θ. In Figure 3, the regression line θa,1 + θa,2 x
corresponding to the parameter θa is equal to E(WiH |Xi = x) on a nondegenerate interval,
so that Assumption 3.5 does not hold. However, at nearby parameter values such as θb , the
regression line is equal to E(WiH |Xi = x) at a single point and Assumption 3.5 holds. See
Section 2.2 of Armstrong (2015) for further discussion.
In the case where m̄(θ0 , x) is twice continuously differentiable in x, part (ii) of Assumption
3.5 follows from a second order Taylor expansion at xk , so long as the second derivative
matrix is positive definite. In this case, Assumption 3.5 holds with γ = 2 and ψj,k (u) =
u0 Vj (xk )u/2, where Vj (xk ) is the second derivative matrix of x 7→ m̄(θ0 , x) at xk . In the
interval regression model, the second derivative of m1 (θ0 , x) is equal to the second derivative
of E(WiH |Xi = x) (and similarly for m2 (θ0 , x) and −E(WiL |Xi = x)), so this translates
directly to an assumption of a positive definite second derivative matrix of E(WiH |Xi = x).
In the case where m̄(θ0 , x) is Lipschitz continuous, part (ii) of Assumption 3.5 will hold with
γ = 1 if we place additional regularity conditions on the one-sided directional derivative of
m̄(θ0 , x). The parameter θ in Figure 2 illustrates a case where Assumption 3.5 holds with
γ = 2, while the parameter θb in Figure 3 illustrates a case where Assumption 3.5 holds
with γ = 1. See Theorem A.1 in Section A.2 of the appendix for a formal statement in the
interval regression model.
The remaining assumptions are regularity conditions that translate easily to primitive
objects in the case of interval regression. For part (v), note that m̄θ,1 (θ, x) = −(1, x0 ) and
m̄θ,2 (θ, x) = (1, x0 ), which are clearly continuous, so this assumption holds without further
conditions on the dgp.
The following gives a formal statement of primitive conditions for the interval regression
model in the case where the conditional means are twice differentiable. The proof of this
result uses the ideas in the discussion above, and is given in Section A.2 of the appendix.

Theorem 3.1. Suppose that the following conditions hold.

i.) The conditional means E(WiH |Xi = x) and E(WiL |Xi = x) are twice differentiable with
continuous second derivatives, Xi has a continuous density and compact support, and
WiH and WiL are bounded from above and below by finite constants.

15
ii.) For any point x̃ such that E(WiH |Xi = x̃) = (1, x̃0 )θ0 , x̃ is in the interior of the
support of Xi , var(WiH |Xi = x) is positive and continuous at x̃ and E(WiH |Xi = x)
has a positive definite second derivative matrix at x̃. The same holds for E(WiL |Xi = x)
with “positive definite” replaced by “negative definite.”
Then Assumptions 3.5, and 3.6 hold, with γ = 2 in Assumption 3.5.

3.4 Comparison with Conditions Leading to Parametric Rates

Under Assumption 3.5, the conditional mean m̄j (θ0 , x) = E(m(Wi , θ0 )|Xi = x) is minimized
on a finite set, and behaves like kx − xk kγ for xk in this set and nearby x. As shown in
Section 4 below, this leads to power against alternatives that approach the identified set at a
√
slower than n rate. As suggested by the intuitive description of these results in Section 2,
this arises because, as θn approaches the identified set, the conditional moment inequalities
are violated on a set with vanishing probability. This is similar to the case of nonparametric
kernel estimation, in which bias-variance tradeoffs and the level of smoothness determine the
rate of convergence (see, e.g., Wasserman, 2007).
In contrast, Andrews and Shi (2013), Kim (2008) and Lee et al. (2013) consider the
case where m̄j (θ0 , x) is minimized on a nondegenerate interval. In this case, the portion of
the support of Xi on which the inequality is violated does not vanish as θn approaches the
boundary of the identified set. This leads to nontrivial power at alternatives that approach
√
the null at a 1/ n rate. As discussed above, the latter case is typical under point identifi-
cation and holds by construction with moment equalities, but it corresponds to a knife-edge
case under set identification.
To understand these issues, it is helpful to make a comparison to the case of nonpara-
metric regression, where kernel estimators can converge at a faster rate if certain derivatives
are equal to zero. For example, local linear estimators converge at a n2/5 rate when the con-
ditional mean is twice differentiable with nonzero derivative and a bandwidth is used that
decreases like n−1/5 , but a faster rate can be obtained when the second derivative is zero,
using a bandwidth sequence that converges more slowly. The typical approach to formaliz-
ing the notion that the optimal rate under a second derivative condition is n2/5 is to use a
minimax criterion, in which one requires good performance uniformly over all dgps with a
certain bound on the second derivative (see Fan, 1993, for a formulation of this approach for
local linear estimators). Minimax results of this form are often cited in econometrics when
making claims of optimality of nonparametric estimators (for example, Ichimura and Todd
2007 cite minimax bounds in Stone 1982).

16
√
In the present setting, the results in this paper show that, even though n local power
√
is possible in certain special cases, the minimax (worst-case) power is slower than n when
one only places bounds on derivatives of certain objects. In particular, while a bound on the
second derivative of E(WiH |Xi = x) and E(WiL |Xi = x) does not imply Assumption 3.5 in
the interval regression model, one can construct a dgp such that Assumption 3.5 holds with
γ = 2 for any nonzero bound on the second derivative. Thus, the minimax rates of local power
for CvM statistics under a bound on the second derivative are at least as slow as the rates
√
derived in this paper, which are slower than n. Since the results in Armstrong (2014b) show
that the corresponding KS statistics achieve a better rate for local alternatives uniformly
over dgps with a bound on the second derivative (and additional regularity conditions), this
means that the KS statistic is preferred to the CvM statistic under a minimax criterion in
this class. See Section A.5 in the appendix for formal statements.

4 Local Power Results

This section derives local power results for CvM test statistics under the conditions given in
Section 3.

4.1 Instrument Based CvM Statistics with Bounded Weights

To describe the power results, we need some additional notation. Define

λbdd (a, j, k, p) = λbdd (a, m̄θ,j (θ0 , xk ), ψj,k , fX (xk ), fµ (xk , 0), p)
Z Z Z p
γ x
≡ kxk ψj,k + m̄θ,j (θ0 , xk )a k((x − x̃)/h)fX (xk ) dx fµ (xk , 0) dx̃ dh.
kxk −

Theorem 4.1. Let

an = an−γ/{2[dX +γ+(dX +1)/p]}

for some vector a. Under Assumptions 3.3, 3.4, 3.5, and 3.6,
 1/p
|X0 |
p
X X
n1/2 Tn,p,1,µ (θ0 + an ) →  λbdd (a, j, k, p) ≡ rbdd (a)
˜
k=1 j∈J(k)

where rbdd (a) → 0 as a → 0.

17
Theorem 4.1 has immediate consequences for the power of tests based on CvM statistics
with bounded weightings.

Theorem 4.2. If, in addition to the conditions of Theorem 4.1, Assumption 3.1 holds, the
power

Eφn,p,1,µ (θ0 + an )

of the test φn,p,1,µ (θ0 + an ) will converge to zero for rbdd (a) < c. If a is close enough to zero,
rbdd (a) will be less than c so that the power will converge to zero. If, in addition, Assumption
3.2 holds, the power will converge to 1 for rbdd (a) > c.

The n−γ/{2[dX +γ+(dX +1)/p]} rate for instrument based CvM statistics with bounded weights
is slower than the n−γ/{2[dX +γ]} rate derived for the corresponding KS test in Theorem 14 of
Armstrong (2015) (for γ = 2) and Theorem 5.1 of Armstrong (2014b) (α from that paper
plays the role of γ here). Note also that local power increases as p increases, and becomes
aribrarily close to the rate for the KS test as p increases.

4.2 Instrument Based CvM Statistics with Variance Weights

Define

λvar (a, j, k, p)
Z Z Z p
x
≡ γ
kxk ψj,k + m̄θ,j (θ0 , xk )a wj (xk )h−dX /2 k((x − x̃)/h)fX (xk ) dx
kxk −

fµ (xk , 0) dx̃ dh

where wj (xk ) ≡ (s2j (xk , θ0 )fX (xk ) k(u)2 du)−1/2 .

Theorem 4.3. Let

an = an−γ/{2[dX /2+γ+(dX +1)/p]} .

Suppose that σn (n/ log n)1/2 → ∞ and Assumptions 3.3, 3.4, 3.5, and 3.6 hold. Then
 1/p
|X0 |
X X
n1/2 Tn,p,(σ̂∨σn )−1 ,µ (θ0 + an ) ≤  λvar (a, j, k, p) + op (1) ≡ rvar (a) + op (1)
k=1 j∈J(k)

18
where rvar (a) → 0 as a → 0. If, in addition, σn ndX /{4[dX /2+γ+(dX +1)/p]} → 0, the above display
will hold with the inequality replaced by equality.

The result has immediate consequences for the power of tests based on CvM statistics
with truncated variance weightings.

Theorem 4.4. Let an be defined as in Theorem 4.3 and suppose that the conditions of that
theorem and Assumption 3.1 hold. The power

Eφn,p,(σ∨σn )−1 ,µ (θ0 + an )

of the test φn,p,(σ∨σn )−1 ,µ (θ0 +an ) will converge to zero for rvar (a) < c. For a close enough to 0,
rvar (a) will be less than c so that the power will converge to zero. If, in addition, Assumption
3.2 holds and σn ndX /{4[dX /2+γ+(dX +1)/p]} → 0, the power will converge to 1 for rvar (a) > c.

As with bounded weighting functions, the rate for detecting local alternatives with CvM
statistics with variance weights is slower than the rate for the corresponding KS test. The
n−γ/{2[dX /2+γ+(dX +1)/p]} rate for variance weighted CvM statistics derived above contrasts
with the (n/ log n)−γ/[2(dX /2+γ)] rate for the corresponding KS test derived in Armstrong
and Chan (2016) and Armstrong (2014b) (the results from the latter paper on rates of
convergence of confidence regions in the Hausdorff metric imply these local power results).
The rate for CvM statistics approaches the rate for KS statistics as p → ∞.

4.3 Statistics Based on Kernel Estimates

To describe the results, define
Z Z p
x
λkern (a, h, j, k, p) ≡ γ
kxk ψj,k + m̄θ,j (θ0 , xk )a h−dX k((x − x̃)/h)ωj (θ0 , xk ) dx dx̃.
kxk −

and
Z p
γ v
λ̃kern (a, j, k, p) ≡ [kvk ψj,k + m̄θ,j (θ0 , xk )a ωj (θ0 , xk ) dv.
kvk −

Theorem 4.5. Suppose that Assumptions 3.4, 3.5, 3.6 and 3.7 hold, and that the kernel
function k satisfies Assumption 3.3. In addition, suppose that the bandwidth h satisfies
h/n−s → ch for some 0 < s < 1/dX and ch > 0, the kernel function k satisfies k(u) du = 1
R

19
and that the functions ψj,k in Assumption 3.5 are continuous. Let an = an−q for some
a ∈ Rdθ where
(
sγ if s < 1/[2(γ + dX /p + dX /2)]
q=
(1 − sdX )/[2(1 + dX /(pγ))] if s ≥ 1/[2(γ + dX /p + dX /2)]

and let θn = θ0 + an . If s > 1/[2(γ + dX /p + dX /2)], then

 1/p
|X0 |
p d /2
X X
(nhdX )1/2 Tn,p,kern (θn ) → chX  λ̃kern (a, j, k, p) ≡ r̃kern (a).
k=1 j∈J(k)

If s = 1/[2(γ + dX /p + dX /2)], then

 1/p
|X0 |
p d /2
X X
(nhdX )1/2 Tn,p,kern (θn ) → chX  λkern (a, ch , j, k, p) ≡ rkern (a, ch ).
k=1 j∈J(k)

If s < 1/[2(γ + dX /p + dX /2)], then

(nhdX )1/2 Tn,p,kern (θn )

will converge in probability to 0 if

 1/p
|X0 |
X X
 λkern (a, ch , j, k, p)
k=1 j∈J(k)

is 0 in a neighborhood of (a, ch ), and will converge to ∞ if this expression is strictly positive.

The result has immediate implications for the power of tests based on kernel CvM statis-
tics.

Theorem 4.6. Let an be defined as in Theorem 4.5 and suppose that the conditions of that
theorem and Assumption 3.1 hold. If s > 1/[2(γ + dX /p + dX /2)], the power

Eφn,p,kern (θ0 + an )

of the test φn,p,kern (θ0 + an ) will converge to zero for r̃kern (a) < c. If s = 1/[2(γ + dX /p +
dX /2)], the power given by the above display will converge to zero for r̃kern (a, ch ) < c. If

20
s < 1/[2(γ + dX /p + dX /2)], the power given by the above display will converge to zero if
r̃kern (a, ch ) = 0 in a neighborhood of (a, ch ). If, in addition, Assumption 3.2 holds, the power
given by the above display will converge to 1 if r̃kern (a) > c, rkern (a, ch ) > c, or rkern (a, ch ) > 0
in the cases where s is greater than, equal to, or less than 1/[2(γ +dX /p+dX /2)] respectively.

As with instrument based statistics, the rate for detecting local alternatives with the
kernel CvM test is slower than the rate for the corresponding KS statistic. The rate derived
in Theorem 4.5 can be written as max{(nhdX )−1/[2(1+dX /(pγ))] , hγ }, which is slower than the
max (nhdX / log n)−1/2 , hγ rate for kernel based KS statistics derived in Armstrong (2014b).

As with the instrument based statistics, the CvM test is more powerful for p larger, and the
rate approaches the rate for the KS test as p goes to ∞.
Theorem 4.5 can be used to choose the optimal bandwidth in this setting. The rate
an = an−q is best when s = 1/[2(γ + dX /p + dX /2)], which gives an exponent in the rate of

γ 1 − sdX
q= = = sγ.
2(γ + dX /p + dX /2) 2(1 + dX /(pγ))

Note that this rate is faster than the n−γ/[2(dX /2+γ+(dX +1)/p))] rate that can be obtained
with instrument based CvM tests with variance weights. Thus, restricting the class of
instruments using prior knowledge of the data generating process leads to a faster rate
with CvM statistics. In contrast, instrument based KS statistics with variance weights can
achieve the same rate as kernel KS statistics that use prior knowledge of the data generating
process to choose the bandwidth optimally (cf. Armstrong, 2014b; Armstrong and Chan,
2016; Chetverikov, 2012).

5 Monte Carlo
This section reports the results of a Monte Carlo study of the finite sample properties of the
statistics considered in this paper. I perform a Monte Carlo based on a median regression
model with potentially endogenously missing data. I use the same data generating processes
as for the Monte Carlo for variance weighted KS statistics in Armstrong and Chan (2016).
A description of the model and data generating processes is repeated here for convenience.
The latent variable Wi∗ follows a linear median regression model given the observed
covariate Xi : q1/2 (Wi∗ |Xi ) = θ1 + θ2 Xi where q1/2 (Wi∗ |Xi ) is the conditional median of Wi∗
given Xi . Define WiH = Wi∗ when Wi∗ is observed and WiH = ∞ otherwise. This gives the
conditional moment inequality E[I(θ1 + θ2 Xi ≤ WiH ) − 1/2|Xi ] ≥ 0 a.s. (a similar inequality

21
can be formed with the lower bound WiL defined analogously, but with WiL = −∞ when Wi∗
is unobserved, which would give the interval quantile regression setup of Section A.3 of the
appendix; the Monte Carlo focuses on the inequality corresponding to WiH for simplicity).
This model allows for arbitrary correlation between the “missingness” process and (Wi∗ , Xi ),
so that the resulting bounds can be used to assess sensitivity to missingness at random
assumptions that would point identify the model.
Each design uses data from the true model Wi∗ = θ1∗ + θ2∗ Xi + ui , where (θ1∗ , θ2∗ ) = (0, 0)
and ui is independent of Xi with ui ∼ unif(−1, 1). The outcome variable Wi∗ is then set to be
missing independently of Wi∗ with probability p(Xi ) (note that, while the data are generated
according to a missingness at random assumption and a particular parameter value, the tests
are robust to failure of this assumption, which leads to a lack of point identification), where
p(x) is varied in each of three designs:

Design 1: p(x) = .1
Design 2: p(x) = .02 + 2 · .98 · |x − .5|
Design 3: p(x) = .02 + 4 · .98 · (x − .5)2 .

This leads to the identified set Θ0 = {(θ1 , θ2 )0 |θ1 + θ2 x ≤ q1/2 (WiH |Xi = x) all x ∈ [0, 1]}
where q1/2 (WiH |Xi = x) can be calculated for each design as q1/2 (WiH |Xi = x) = 1/(1 −
p(x))−1. For each design, the Monte Carlo power of φ(θ) for each test φ under the dgp in the
given design is reported for θ = (θ1 + a, 0) where θ1 = sup{θ1 |(θ1 , 0) ∈ Θ0 } and a varies over
the set {.1, .2., .3, .4, .5}. This leads to local alternatives that satisfy the conditions of this
paper with γ = 1 for Design 2 and γ = 2 for Design 3. Design 1 leads to a flat conditional
mean for which asymptotic theory predicts the following rates (for the instrument functions
used here): n−1/2 for kernel and instrument based CvM and unweighted instrument based KS
statistics, (n/ log n)−1/2 for variance weighted instrument KS statistics and (nh/ log n)−1/2
for kernel KS statistics (see Andrews and Shi, 2013; Armstrong, 2014b; Chernozhukov et al.,
2013; Lee et al., 2013).
For the instrument based statistics, I use the class of functions {x 7→ I(s < x < s+t)|0 ≤
s ≤ s + t ≤ 1} and the the Lebesgue measure on {(s, t)|0 ≤ s ≤ s + t ≤ 1} for µ for the
instrument based CvM statistics. This corresponds to the multiscale kernel instruments in
Assumption 3.3 with the uniform kernel. For the kernel based statistics, the uniform kernel
is used, and the supremum or integral is taken over the set [h/2, 1 − h/2], so that the support
of the kernel function is always contained in the support of Xi . For the CvM statistics, the
simulations use the test with Lp exponent p = 1. For each test statistic, the critical value

22
is taken from the least favorable null distribution, calculated exactly (up to Monte Carlo
error) using the distribution under (θ1 , 0) under Design 1. For the kernel estimators, the
bandwidths n−1/5 , n−1/3 and n−1/2 are used, and, for the truncated variance weighted CvM
statistics, the values n−1/5 /4, n−1/3 /4 and n−1/2 /4 are used for the truncation parameter σn2
(this corresponds to truncating the variance of functions I(s < x < s + t) with t less than
n−1/5 , n−1/3 and n−1/2 ). For comparison, results for the variance weighted instrument KS
statistic, which corresponds to the multiscale statistic of Armstrong and Chan (2016), are
reported as well (taken directly from that paper).
Overall, the Monte Carlo results support the claim that, for the data generating processes
and classes of instrument functions considered in the theoretical results in this paper, KS
statistics perform better than CvM statistics. For Design 2 and Design 3, which follow
the conditions of this paper with γ = 1 and γ = 2 respectively, the instrument based KS
statistic has more power than the instrument based CvM statistic in basically all cases. For
the kernel statistics, the KS test performs better unless the bandwidth is chosen to be much
too small. For example, for Design 3, the optimal bandwidth for the kernel statistic is of
order n−1/5 , and the kernel KS statistic performs better than the kernel CvM statistic with
this bandwidth. However, the kernel statistic performs worse for smaller bandwidths when
the sample size is not too large (although the KS statistic does almost as well or better with
1000 observations, suggesting that the asymptotics of Theorem 4.5 have started to kick in
at this point).
Note also that power in the Monte Carlo is very sensitive to the design, with greater
power for Design 3 than Design 2. This is to be expected given the asymptotic results.
Under Design 3, the assumptions of this paper hold with γ = 2, while, under Design 2,
the assumptions hold with γ = 1. The results of Section 4 show that asymptotic power is
increasing in γ (the rate at which local alternatives may approach the null with nontrivial
power is faster for larger γ) for each of the test statistics considered.
For Design 1, asymptotic results from elsewhere in the literature predict that the instru-
ment based statistics with the instruments used here perform about the same (in terms of
the rate for detecting local alternatives) for KS and CvM statistics, although the variance
weighted KS statistic performs slightly worse (by a log n factor). For kernel statistics, asymp-
totic theory predicts that KS statistics will perform worse than CvM statistics in this case
(the latter can achieve a n−1/2 rate, while the former cannot if the bandwidth goes to zero).
All of these predictions are borne out in the Monte Carlo: instrument based statistics all
perform well with the weighted KS statistics performing slightly worse, while CvM version

23
is better for kernel statistics.
The Monte Carlo results also fit well with the prescription of the weighted instrument
KS or “multiscale” statistic of Armstrong (2011), Armstrong (2014b), Armstrong and Chan
(2016) and Chetverikov (2012) as the only test among the ones considered here that comes
close to having the best power among these test statistics for all three Monte Carlo designs
(according to asymptotic approximations, the weighted instrument KS test achieves the
best rate to at least within a log n factor in all three cases, while each of the other statistics
considered here performs worse by a polynomial factor in at least one case). While other
statistics perform slightly better in certain cases, they perform much worse in others (e.g.
the kernel KS statistic performs slightly better in Design 3 with the optimal bandwidth,
n−1/5 , but performs much worse when other bandwidths are chosen, or with any bandwidth
choice in Design 1).

6 Conclusion
This paper derives local power results for tests for conditional moment inequality models
based on several forms of CvM statistics in the set identified case. The power comparisons
hold under conditions that arise naturally in the set identified case, and determine the
minimax rate. The results show that KS tests are preferred to CvM statistics and that
variance weightings are preferred to bounded weightings.

A Primitive Conditions and Minimax Bounds

This appendix gives primitive conditions for the assumptions used in this paper, and shows
how the (pointwise in the underlying distribution) results for local alternatives considered
in the paper can be used to bound the minimax power of CvM tests in classes of underlying
distributions where the conditional mean is constrained only by smoothness assumptions.
Since the corresponding KS statistic has a faster rate in these classes, this justifies the claim
that the CvM tests considered here perform worse in these models under a minimax criterion.
Section A.1 gives general primitive conditions for the assumption that the contact set X0
in Assumption 3.5 is finite. Sections A.2, A.3 and A.4 provide primitive conditions for the
assumptions used in this paper in various settings. Section A.5 uses the results in the body
of this paper to give conditions under which the CvM statistics considered in this paper
do not achieve the optimal rate minimax rate, and verifies these conditions for the interval

24
regression model.

A.1 Primitive Conditions for Finite Contact Set

If we assume that the support of Xi is compact, and that the minimizing set {x|m̄j (θ, x) = 0}
is contained on the interior of the support of Xi , then the minimizing set will be finite so long
as m̄j (θ, x) is twice continuously differentiable with strictly positive definite second derivative
matrix at any minimum. This follows from the proof of Lemma B.1 in the supplementary
appendix of Armstrong (2015), and we state the result here for convenience. (Note that
the lemma in Armstrong (2015) assumes a third derivative, since a third derivative is used
for other results in that paper. However, a inspection of the proof shows that a continuous
second derivative suffices.)

Lemma A.1. Let h : X → R be twice continuously differentiable on the compact set X ⊆ Rk .

Suppose that, for any minimizer x̃ of h(x), x̃ is on the interior of X , and that the second
derivative matrix of h is strictly positive definite at x̃. Then the set of minimizers of h(x)
over X is finite.

Proof. The result follows from the proof of Lemma B.1 in the supplementary appendix of
Armstrong (2015).

A.2 Interval Regression

This section gives primitive conditions for the interval regression model described in the In-
troduction, which falls into the setup of this paper with Wi = (Xi , WiL , WiH ) and m(Wi , θ) =
(WiH − (1, Xi0 )θ, (1, Xi0 )θ − WiL )0 . First, I prove Theorem 3.1. Then, I give conditions under
which the assumptions in the main text hold with γ = 1.

Proof of Theorem 3.1. First, note that the set of x such that m̄j (θ, x) = 0 for some j is finite
by Lemma A.1. Part (ii) of Assumption 3.5 follows from a second order Taylor expansion,
and part (i) follows by compactness of the support of Xi and continuity of the first two
derivatives of the conditional means. Part (iv) is immediate from part (ii) of the conditions
of the theorem and the fact that the conditional variance is constant in θ for this model. For
d
part (v), note that dθ d
m̄1 (θ, x) = − dθ m̄2 (θ, x) = (1, x0 ), which is clearly continuous in (θ, x).
Assumption 3.6 is immediate from the bounds on WiH and WiL .

For the Lipschitz case (γ = 1), we can replace the assumption of two derivatives with
a condition on the directional one-sided first derivatives. Here, we make the assumption of

25
finiteness of the set where the conditional moments bind directly, since arguments involving
second derivatives do not apply. In the following, SdX −1 denotes the unit sphere {u ∈
RdX |kuk = 1}.

Assumption A.1. i.) The conditional means E(WiH |Xi = x) and E(WiL |Xi = x) are
Lipschitz continuous, Xi has a continuous density and compact support, and WiH and
WiL are bounded from above and below by finite constants.

ii.) The set X0 ≡ {x|E(WiH |Xi = x) = (1, x0 )θ0 } is finite, and, for any point x̃ ∈ X0 , x̃ is in
the interior of the support of Xi , var(WiH |Xi = x) is positive and continuous at x̃ and
the one-sided directional derivative dtd+ [E(WiH |Xi = x̃+tu)−(1, (x̃+tu)0 )θ0 ] is bounded
from below away from zero at t = 0 and is right continuous at t = 0 uniformly over
u ∈ SdX −1 . The same holds for E(WiL |Xi = x) with “positive” replaced by “negative”
in the last statement.

Theorem A.1. Under Assumption A.1, Assumptions 3.5 and 3.6 hold, with γ = 1 in
Assumption 3.5.

Proof. Part (ii) of Assumption 3.5 follows from a first order Taylor expansion, and part (i)
follows by compactness of the support of Xi and the continuity and lower bound on the
directional derivatives. The verification of the remaining conditions is the same as in the
twice differentiable case.

A.3 Interval Quantile Regression

For the interval quantile regression model, the latent variable Wi∗ follows a linear quan-
tile regression model qτ (Wi∗ |Xi ) = (1, Xi0 )θ, where τ is given and qτ (U |V ) denotes the τ th
conditional quantile of U given V for random variables U and V . As with interval mean
regression, we observe (Xi , WiL , WiH ) where [WiL , WiH ] is known to contain Wi∗ . This falls
into our setup with m(Wi , θ) = (τ − I(WiH ≤ (1, Xi0 )θ), I(WiL ≤ (1, Xi0 )θ) − τ )0 .
For the interval quantile regression model, one can use essentially the same assumptions
as for the interval mean regression model considered above, but with conditional means
replaced by conditional quantiles. In the interest of space, we consider only the case where
the conditional quantile function has two derivatives (γ = 2).

Assumption A.2. i.) The conditional quantiles qτ (WiH |Xi = x) and qτ (WiL |Xi = x) are
twice differentiable with continuous second derivatives and Xi has a continuous density
and compact support.

26
ii.) For any x̃ such that qτ (WiH |Xi = x̃) = (1, x̃0 )θ0 , x̃ is in the interior of the support of
Xi and qτ (WiH |Xi = x) has a positive definite second derivative matrix at x̃. The same
holds for qτ (WiL |Xi = x) with “positive definite” replaced by “negative definite.”

In addition, we will also require an assumption on the conditional densities of WiH and
WiL given Xi .

Assumption A.3. For some η > 0, WiH |Xi and WiL |Xi have conditional densities fWiH |Xi (w|x)
and fWiL |Xi (w|x) on {(x, w)|qτ,P (WiH |Xi = x) − η ≤ w ≤ qτ,P (WiH |Xi = x) + η} and
{(x, w)|qτ,P (WiL |Xi = x) − η ≤ w ≤ qτ,P (WiL |Xi = x) + η} respectively that are continuous
as a function of (x, w) and bounded away from zero on these sets.

Assumption A.3 is similar to Assumption B.3 in Armstrong (2014b). As discussed in

Armstrong (2014b), this type of condition will hold, for example, when (Xi , Wi∗ ) has a
smooth joint density, and Wi∗ is either missing (in which case WiL = −∞ and WiH = ∞)
or fully observed (in which case WiL = WiH = Wi∗ ), so long as the probability that Wi∗ is
missing conditional on (Xi , Wi∗ ) = (x, w) is smooth as a function of (x, w).

Theorem A.2. Suppose that Assumptions A.2 and A.3 hold. Then Assumptions 3.5 and
3.6 hold, with γ = 2 in Assumption 3.5.

Proof. Let θ0 ∈ Θ0 satisfy the conditions of the theorem and let x̃ be such that qτ (WiH |Xi =
x̃) = (1, x̃0 )θ0 . Let V (x) denote the second derivative matrix of x 7→ qτ (WiH |Xi = x). Then,
for δ small enough and kx − x̃k ≤ δ,
Z qτ (WiH |Xi =x)
m̄1 (θ, x) = τ − P (WiH ≤ (1, Xi0 )θ0 |Xi = x) = fWiH |Xi (w|x) dw
(1,x0 )θ0
Z (1,x0 )θ0 +(x−x̃)0 V (x̃)(x−x̃)+r(x)
= fWiH |Xi (w|x) dw
(1,x0 )θ0

where limx→x̃ r(x) = 0 and the last step follows from a second order Taylor expansion. This
expression is bounded from above by f (δ) · [(x − x̃)0 V (x̃)(x − x̃) + r(δ)] and from below by
f (δ) · [(x − x̃)0 V (x̃)(x − x̃) + r(δ)] where f (δ) and r(δ) are upper bounds for fWiH |Xi (w|x) and
r(x) on {(x, w)|kx − x̃k ≤ δ, (1, x0 )θ0 ≤ w ≤ qτ (WiH |Xi = x)} and f (δ) and r(δ) are lower
bounds. As δ → 0, f (δ) and f (δ) converge to fWiH |Xi ((1, x̃0 )θ0 |x̃) and r(δ) and r(δ) converge
to 0, so that

τ − P (WiH ≤ (1, Xi0 )θ0 |Xi = x) (x − x̃)0 (x − x̃)0 δ→0

sup − V (x̃) · f H ((1, x̃0 )θ0 |x̃) → 0.
kx−x̃k≤δ kx − x̃k2 kx − x̃k kx − x̃k Wi |Xi

27
Applying this argument to the finite set of values x̃ such that τ − P (WiH ≤ (1, Xi0 )θ0 |Xi =
x) = 0 and a symmetric argument for WiL , it follows that part (ii) of Assumption 3.5 holds
with γ = 2.
To verify part (i) of Assumption 3.5 first note that the set X0 = {x|qτ (WiH |Xi = x) =
(1, x0 )θ} is finite by Lemma A.1. Using this and similar arguments to those used in the proof
of Theorem 3.1, there exists ε > 0 and δ > 0 such that qτ (WiH |Xi = x) − (1, x)0 θ is bounded
away from zero for kθ − θ0 k < ε and x such that, for all x̃ ∈ X0 , kx − x̃k ≥ δ. It then follows
from Assumption A.3 that τ − P (WiH ≤ (1, Xi0 )θ0 |Xi = x) is bounded away from zero on
such a set. Part (i) of Assumption 3.5 follows from this and a similar argument for WiL .
For part (iv) of Assumption 3.5, note that the conditional variance of the moment function
corresponding to WiH is P (WiH ≤ (1, x0 )θ|Xi = x)[1 − P (WiH ≤ (1, x0 )θ|Xi = x)], so it
suffices to show that P (WiH ≤ (1, x0 )θ|Xi = x) is in the set (0, 1) and is continuous in (θ, x)
at each (θ0 , x̃) such that m̄1 (θ, x) = P (WiH ≤ (1, x̃0 )θ0 |Xi = x̃) = τ . This follows since, by
Assumption A.3, WiH has a continuous conditional density in a neighborhood of (1, x̃0 )θ0 .
For part (v) of Assumption 3.5, note that, for (x, θ) such that WiH has a conditional
density given Xi = x at (1, x0 )θ,

d
m̄θ,1 (θ, x) = − 0
P (WiH ≤ (1, x0 )θ|Xi = x) = −fWiH |Xi =x ((1, x0 )θ|x)(1, x0 ).
dθ

This is continuous in (θ, x) in a small enough neighborhood of any (θ0 , x̃) with m̄θ,1 (θ0 , x̃) = 0,
since fWiH |Xi =x (w|x) is continuous for w, x in a neighborhood of at x = x̃ and w = (1, x̃0 )θ0
for any such θ0 and x̃ by Assumption A.3.

A.4 Selection Model

The interval regression model contains, as a special case, an approach to selection models
based on bounds suggested in Manski (1990). In particular, consider a selection model in
which we are interested in the mean of Yi∗ , which is not always observed. Suppose that Yi∗
is known to take values in [Y , Y ] for some fixed Y and Y , and a variable Xi is available
such that E(Yi∗ |Xi ) = E(Yi∗ ) (i.e. Yi∗ is mean independent of Xi ), and such that Xi shifts
the conditional probability of observing Yi∗ . For example, we may be interested in the offer
wage Yi∗ , which is typically only observed when individual i actually works. In this case,
the variable Xi can be taken to be anything that shifts labor force participation through the
opportunity cost of working (such as income from other sources such as family or government

28
benefits) while being independent of the distribution of offer wages.
Let Di denote an indicator variable that is 1 when Yi∗ is observed and 0 otherwise.
We observe (Xi , Yi , Di ) where Yi = Di · Yi∗ . Following Manski (1990), note that, letting
WiL = Yi · Di + Y · (1 − Di ) and WiL = Yi · Di + Y · (1 − Di ), we have WiL ≤ Yi∗ ≤ WiH
with probability one. Letting θ = E(Yi∗ ) and using the fact that E(Yi∗ ) = E(Yi∗ |Xi ) a.s.,
we obtain our setup with m(Wi , Xi , θ) = (WiH − θ, θ − WiL )0 . This is a special case of the
interval regression model of Section A.2, with (θ, 01×dX ) playing the role of θ. That is, we
have the interval regression model with the slope parameter constrained to be zero. Thus,
if we consider a null value θ0 and a sequence of alternatives in the interval regression model
for which the slope parameter is zero, the results of Section A.2 apply immediately to give
primitive conditions for Assumption 3.5 (here Assumption 3.6 holds by construction and the
assumption that Yi∗ is bounded).
Note that E(WiH |Xi = x) = E(Yi∗ Di |Xi = x) + Y · [1 − P (Di = 1|Xi = x)]. Thus,
a sufficient condition for E(WiH |Xi = x) to be twice differentiable (or Lipschitz) is for
P (Di = 1|Xi = x) and E(Yi∗ Di |Xi = x) to be twice differentiable (or Lipschitz). It is
also worth noting that cases where E(WiH |Xi = x) is minimized at the (possibly infinite)
boundary of the support of Xi are often of interest, and arise naturally in this setting (see,
e.g., Andrews and Schafgans 1998 and Heckman 1990). While Assumption 3.5 formally
precludes the possibility that the minimum of E(WiH |Xi = x) is taken at the boundary of
the support of Xi , such cases can be handled for certain forms of instrument based statistics
by transforming the support of Xi (see Section B.3 of Armstrong 2014b for an example of
this type of argument applied to instrument based KS statistics). We leave this extension
for future research.

A.5 Minimax Rates

The power results in this paper hold under conditions that are arguably common in practice
in the set identified case. However, there are certainly cases (data generating processes,
points on the boundary of the identified set and directions for the local alternative) for
which other conditions will be appropriate. The purpose of this section is to show that, if
the underlying distribution is constrained only by smoothness conditions and other regular-
ity conditions, there will always exist a possible underlying distribution and sequence of local
alternatives that satisfy these properties, with γ governed by the smoothness conditions im-
posed. Thus, any test that achieves good uniform power in these classes against alternatives
that are closer than the pointwise rates derived here for CvM statistics will be preferred

29
under a minimax criterion. By results in Armstrong (2014b), it follows that, for certain
classes of alternatives defined by smoothness conditions, the variance weighted KS statistic
of Armstrong (2014b), Armstrong and Chan (2016) and Chetverikov (2012) is preferred to
the CvM statistics considered in this paper under a minimax criterion.
To formalize these ideas, the rest of this section considers classes P of underlying distri-
butions and uses the notation EP and Θ0 (P ) to denote expectations and the identified set
under a distribution P . In the results below, d(θ, θ̃) denotes the Euclidean distance kθ − θ̃k.

Theorem A.3. Let φCvM (θ) be one of the CvM tests defined in (10) or (11) with the critical
value satisfying Assumption 3.1, the class G or kernel function k satisfying Assumption 3.3,
and the measure µ satisfying Assumption 3.4 for the instrument case and the weighting
satisfying Assumption 3.7 for the kernel case. Let P be any class of distributions such that,
for some P ∗ ∈ P and θ0∗ on the boundary of Θ0 (P ∗ ), Assumptions 3.5 and 3.6 hold, and
either (a) θ0∗ is on the boundary of the convex hull of Θ0 (P ∗ ) or (b) for some a ∈ Rdθ and
a constant K, d(θ0∗ , θ0∗ + ar) ≤ K · d(θ0 , θ0∗ + ar) for all θ0 ∈ Θ0 (P ∗ ) and r small enough.
Then, for a small enough constant C∗ > 0,

lim sup inf inf EP φCvM (θ) = 0,

n→∞ P ∈P θ s.t. d(θ,θ0 )≥C∗ rn all θ0 ∈Θ0 (P )

where rn is the rate for the given test in Section 4 with γ given in Assumption 3.5.

Proof. Under condition (b), the result is immediate from the results in the main text, since
the quantity in the display in the theorem is less than lim supn→∞ EP ∗ φCvM (θ0∗ +aC∗ rn K/kak)
for P ∗ , θ0∗ and a given in the theorem. The result follows since condition (a) implies condition
(b) with K = 1. To see this, note that, by the supporting hyperplane theorem, there exists a
vector a with kak = 1 such that a0 θ̃0 ≤ a0 θ0∗ for all θ̃0 in the convex hull of Θ0 (P ∗ ). For this a
and any scalar r > 0 and θ̃0 ∈ Θ0 (P ∗ ), d(θ0∗ +ar, θ̃0 )2 −d(θ0∗ +ar, θ0 )2 = kθ0∗ +ar−θ̃0 k2 −r2 a0 a =
kθ0∗ − θ̃0 k2 + 2ra0 (θ0∗ − θ̃0 ) + r2 a0 a − r2 a0 a ≥ kθ0∗ − θ̃0 k2 ≥ 0.

A class P of underlying distributions will typically contain a P ∗ satisfying these conditions

so long as it is sufficiently unrestricted (e.g. if the only restrictions are smoothness conditions,
etc.). Theorems A.5 and A.6 below give primitive conditions for this in the interval regression
model.
Under additional regularity conditions on P, the inverse variance weighted KS statistic of
Armstrong (2014b), Armstrong and Chan (2016) and Chetverikov (2012) achieves a strictly
better minimax rate than the upper bounds for CvM statistics given in Theorem A.3. This

30
is stated in the next theorem, which follows immediately from results in Armstrong (2014b)
(the results in Armstrong, 2014b consider a stronger notion of coverage and power).
For concreteness, let us consider a specific version of the inverse variance weighted
KS statistic considered in Armstrong (2014b). Let Tn,∞,(σ∨σn )−1 (θ) be given by (8) with
G = {x 7→ I(kx− x̃k ≤ h)|x̃ ∈ RdX , h ∈ [0, ∞)} and ωj (θ, g) = {σ̂j (θ, g)∨[(log n)2 /n]}−1 . Let
φn,∞,(σ∨σn )−1 (θ) be given by (12) with this definition of Tn,∞,(σ∨σn )−1 (θ) and with ĉn,∞,(σ∨σn )−1
given by the constant K in Theorem 3.1 in Armstrong (2014b). In the interest of concrete-
ness, the above formulation uses certain conservative constants and tuning parameters in
defining the test φn,∞,(σ∨σn )−1 (θ). Less conservative and data driven methods for choos-
ing these constants have been considered by Armstrong and Chan (2016) and Chetverikov
(2012).

Theorem A.4. Suppose that P satisfies Assumptions 4.1, 4.3, 4.4 and 4.5
in Armstrong (2014b), with γ taking the place of α in that paper. Then
lim supn→∞ supP ∈P supθ0 ∈Θ0 (P ) EP φn,∞,(σ∨σn )−1 (θ0 ) = 0 and, for a large enough constant C ∗ ,

lim inf inf inf EP φn,∞,(σ∨σn )−1 (θ) = 1.

n→∞ P ∈P θ s.t. d(θ,θ0 )≥C ∗ [(log n)/n]γ/(dX +2γ) all θ0 ∈Θ0 (P )

Proof. Since Assumptions 3.1-3.3 in Armstrong (2014b) follow by definition of the statis-
tic, the result follows from Theorem 4.2 in that paper, with Assumption 4.2(i) in Arm-
strong (2014b) following from Theorem 4.3 in that paper (since Assumption 4.6 and 4.2(ii)
in that paper hold by construction). For Cn the setwise confidence set constructed from
φn,∞,(σ∨σn )−1 (θ) in Armstrong (2014b),

inf inf EP φn,∞,(σ∨σn )−1 (θ)

P ∈P θ s.t. d(θ,θ0 )≥C ∗ [(log n)/n]γ/(dX +2γ) all θ0 ∈Θ0 (P )

= inf inf P (θ 6∈ Cn )
P ∈P θ s.t. d(θ,θ0 )≥C ∗ [(log n)/n]γ/(dX +2γ) all θ0 ∈Θ0 (P )

≥ inf P (θ 6∈ Cn all θ s.t. d(θ, θ0 ) ≥ C ∗ [(log n)/n]γ/(dX +2γ) all θ0 ∈ Θ0 (P ))

P ∈P

≥ inf P (dH (Θ0 (P ), Cn ) < C ∗ [(log n)/n]γ/(dX +2γ) )

P ∈P

where dH (A, B) = max{supa∈A inf b∈B d(a, b), supb∈B inf a∈A d(a, b)} is the Hausdorff distance.
This converges to 1 for large enough C ∗ by Theorem 4.2 in Armstrong (2014b).

The classes P used in Theorem A.4 impose smoothness conditions on the conditional
mean along with a condition on the derivative of the conditional mean with respect to θ

31
(cases where the latter condition fails appear to favor KS statistics over CvM statistics as
well; see Section A.4 of Armstrong, 2014b). Note that the rate given above for the weighted
KS statistic φn,∞,(σ∨σn )−1 corresponds to the minimax L∞ rate for nonparametric testing
problems (Lepski and Tsybakov, 2000) and to the minimax rate for estimating a conditional
mean (Stone, 1982; see Menzel, 2010 for related results for estimating the identified set in a
setting similar to the one considered here). The results here show that the CvM statistics
considered here do not achieve this rate, and in fact have a minimax rate that is worse by
at least a polynomial amount.
I now turn to the interval regression model and consider primitive conditions. The next
two theorems show that certain classes of underlying distributions for the interval regression
model will always contain a distribution with a sequence of local alternatives that satisfy
the conditions of this paper. The conclusion of Theorem A.3 then follows immediately, since
the identified set is convex in the interval regression model. Theorem A.5 considers the case
where the constraints on the conditional mean embodied in P essentially only restrict the
conditional means of WiH and WiL to a Lipschitz smoothness class. Theorem A.6 considers
the smoother case where a bound is placed on the second derivative. For primitive conditions
for the conditions of Theorem A.4 in the interval regression model for the case where dX = 1
and γ = 1 or 2, see Armstrong (2014b), Section 6.2.

Theorem A.5. Let P be any class of underlying distributions for (Xi , WiH , WiL ) in the
interval regression model such that, for all P ∈ P, WiH and WiL are bounded and Xi has
a continuous density on its support XP . Suppose that, for some set X ⊆ RdX and some
interval [a, b], the following holds: for any function f : X → [a, b] such that

|f (x) − f (x̃)| ≤ Kkx − x̃k,

there exists a P ∈ P such that EP (WiH |Xi ) = f (Xi ) and EP (WiL |Xi ) ≤ a almost surely,
and XP = X . Then there exists a P ∗ ∈ P and θ0∗ ∈ Θ0 (P ∗ ) that satisfies the conditions of
Theorem A.3, with γ = 1 and ψj,k (u) = K in Assumption 3.5.

Proof. Under these assumptions, there exists a distribution P ∈ P such that EP (WiH |Xi =
x) = b − K[(ε − kx − x0 k) ∨ 0] for some ε > 0 and x0 on the interior of the support of Xi ,
and EP (WiL |Xi = x) is bounded from above away from b − 2ε. For θ = (b − Kε, 0), this
satisfies the conditions of Theorem A.1.

Theorem A.6. Let P be any class of underlying distributions for (Xi , WiH , WiL ) in the
interval regression model such that, for all P ∈ P, WiH and WiL are bounded and Xi has

32
a continuous density on its support XP . Suppose that, for some set X ⊆ RdX and some
interval [a, b], for any function f : X → [a, b] such that

d2
f (x + tu) ≤ K
dt2

for all u ∈ RdX with kuk = 1, there exists a P ∈ P such that EP (WiH |Xi ) = f (Xi ) and
EP (WiL |Xi ) ≤ a almost surely, and XP = X . Then there exists a P ∗ ∈ P and θ0∗ ∈ Θ0 (P ∗ )
that satisfies the conditions of Theorem A.3, with γ = 2 and ψj,k (u) = K/2 in Assumption
3.5.

Proof. The result follows by similar arguments to Theorem A.5 since a function can be
constructed for EP (WiH |Xi = x) that has a unique interior minimum with second derivative
matrix KI at its minimum and takes values between, say, (a + b)/2 and b.

References
Andrews, D. W. K. and M. M. A. Schafgans (1998): “Semiparametric Estimation of
the Intercept of a Sample Selection Model,” Review of Economic Studies, 65, 497–517.

Andrews, D. W. K. and X. Shi (2013): “Inference Based on Conditional Moment In-

equalities,” Econometrica, 81, 609–666.

Aradillas-Lopez, A., A. Gandhi, and D. Quint (2013): “Testing Inequalities of

Conditional Moments, with an Application to Ascending Auction Models,” .

Armstrong, T. (2011): “Weighted KS Statistics for Inference on Conditional Moment

Inequalities,” Unpublished Manuscript.

Armstrong, T. B. (2014a): “A Note on Minimax Testing and Confidence Intervals in

Moment Inequality Models,” .

——— (2014b): “Weighted KS statistics for inference on conditional moment inequalities,”

Journal of Econometrics, 181, 92–116.

——— (2015): “Asymptotically exact inference in conditional moment inequality models,”

Journal of Econometrics, 186, 51–65.

33
Armstrong, T. B. and H. P. Chan (2016): “Multiscale adaptive inference on conditional
moment inequalities,” Journal of Econometrics, 194, 24–43.

Bierens, H. J. (1982): “Consistent model specification tests,” Journal of Econometrics,

20, 105–134.

Chernozhukov, V., S. Lee, and A. M. Rosen (2013): “Intersection Bounds: Estima-

tion and Inference,” Econometrica, 81, 667–737.

Chetverikov, D. (2012): “Adaptive Test of Conditional Moment Inequalities,” Unpub-

lished Manuscript.

Dumbgen, L. and V. G. Spokoiny (2001): “Multiscale Testing of Qualitative Hypothe-

ses,” The Annals of Statistics, 29, 124–152.

Fan, J. (1993): “Local Linear Regression Smoothers and Their Minimax Efficiencies,” The
Annals of Statistics, 21, 196–216.

Heckman, J. (1990): “Varieties of Selection Bias,” The American Economic Review, 80,
313–318.

Ichimura, H. and P. E. Todd (2007): “Chapter 74 Implementing Nonparametric and

Semiparametric Estimators,” Elsevier, vol. Volume 6, Part 2, 5369–5468.

Imbens, G. W. and C. F. Manski (2004): “Confidence Intervals for Partially Identified

Parameters,” Econometrica, 72, 1845–1857.

Ingster, Y. and I. A. Suslina (2003): Nonparametric Goodness-of-Fit Testing Under

Gaussian Models, Springer.

Juditsky, A. and A. Nemirovski (2002): “On nonparametric tests of positiv-

ity/monotonicity/convexity,” The Annals of Statistics, 30, 498–527.

Khan, S. and E. Tamer (2009): “Inference on endogenously censored regression models

using conditional moment inequalities,” Journal of Econometrics, 152, 104–119.

Kim, K. i. (2008): “Set estimation and inference with models characterized by conditional
moment inequalities,” .

Lee, S., K. Song, and Y.-J. Whang (2013): “Testing functional inequalities,” Journal
of Econometrics, 172, 14–32.

34
——— (2015): “Testing for a General Class of Functional Inequalities,” arXiv:1311.1595
[math, stat], arXiv: 1311.1595.

Lepski, O. and A. Tsybakov (2000): “Asymptotically exact nonparametric hypothesis

testing in sup-norm and at a fixed point,” Probability Theory and Related Fields, 117,
17–48.

Manski, C. F. (1990): “Nonparametric Bounds on Treatment Effects,” The American

Economic Review, 80, 319–323.

Manski, C. F. and E. Tamer (2002): “Inference on Regressions with Interval Data on a

Regressor or Outcome,” Econometrica, 70, 519–546.

Menzel, K. (2010): “Consistent Estimation with Many Moment Inequalities,” Unpublished

Manuscript.

Pollard, D. (1984): Convergence of stochastic processes, New York, NY: Springer.

Stone, C. J. (1982): “Optimal Global Rates of Convergence for Nonparametric Regres-

sion,” The Annals of Statistics, 10, 1040–1053.

Wasserman, L. (2007): All of Nonparametric Statistics, New York: Springer.

35
1

0.9

0.8

E(WH|X =x)
i i
0.7

0.6

0.5

0.4

(1,x)θn
0.3

0.2
(1,x)θ0
0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 1: Local Alternative for Interval Regression Model

36
0.6

0.5

0.4

0.3

0.2 E(WH|X=x)

0.1

θ1+θ2x
−0.1
−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Figure 2: Case where Assumption 3.5 holds with γ = 2

0.35

0.3

0.25

0.2

0.15
E(WH|X=x)
0.1

0.05

0 θa,1+θa,2x

−0.05

−0.1 θb,1+θb,2x

−0.15
−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Figure 3: Case where Assumption 3.5 does not hold (θa ) and case where Assumption 3.5
holds with γ = 1 (θb )

37
θ1 − θ 1 n = 100 n = 500 n = 1000
0.1 0.196 0.593 0.818
0.2 0.458 0.973 1
0.3 0.775 1 1
0.4 0.952 1 1
0.5 0.995 1 1
Table 1: Power for Unweighted Instrument CvM Test under Design 1

θ1 − θ 1 n = 100 n = 500 n = 1000

0.1 0.166 0.644 0.835
0.2 0.442 0.989 1
0.3 0.781 1 1
0.4 0.957 1 1
0.5 0.994 1 1
Table 2: Power for Unweighted Instrument KS Test under Design 1

σn2 θ1 − θ1 n = 100 n = 500 n = 1000

0.1 0.198 0.567 0.859
0.2 0.49 0.977 1
1 −1/5
4
n 0.3 0.77 1 1
0.4 0.955 1 1
0.5 0.997 1 1
0.1 0.208 0.62 0.851
0.2 0.475 0.983 1
1 −1/3
4
n 0.3 0.808 1 1
0.4 0.958 1 1
0.5 0.994 1 1
0.1 0.203 0.591 0.822
0.2 0.474 0.981 1
1 −1/2
4
n 0.3 0.804 1 1
0.4 0.946 1 1
0.5 0.996 1 1
Table 3: Power for Weighted Instrument CvM Test under Design 1

38
tn θ1 − θ 1 n = 100 n = 500 n = 1000
0.1 0.207 0.503 0.729
0.2 0.48 0.954 1
n−1/5 0.3 0.759 1 1
0.4 0.956 1 1
0.5 0.997 1 1
0.1 0.144 0.453 0.63
0.2 0.378 0.939 0.998
n−1/3 0.3 0.691 1 1
0.4 0.886 1 1
0.5 0.982 1 1
0.1 0.156 0.358 0.502
0.2 0.348 0.898 0.991
n−1/2 0.3 0.649 0.999 1
0.4 0.862 1 1
0.5 0.974 1 1

Table 4: Power for Weighted Instrument KS Test under Design 1 (from Armstrong and Chan
(2016))

hn θ1 − θ 1 n = 100 n = 500 n = 1000

0.1 0.186 0.547 0.858
0.2 0.453 0.97 1
n−1/5 0.3 0.729 1 1
0.4 0.934 1 1
0.5 0.994 1 1
0.1 0.188 0.663 0.843
0.2 0.452 0.987 1
n−1/3 0.3 0.794 1 1
0.4 0.947 1 1
0.5 0.997 1 1
0.1 0.185 0.582 0.848
0.2 0.443 0.977 1
n−1/2 0.3 0.78 1 1
0.4 0.942 1 1
0.5 0.997 1 1
Table 5: Power for Kernel CvM Test under Design 1

39
hn θ1 − θ 1 n = 100 n = 500 n = 1000
0.1 0.16 0.439 0.625
0.2 0.343 0.92 0.997
n−1/5 0.3 0.62 0.999 1
0.4 0.883 1 1
0.5 0.975 1 1
0.1 0.095 0.266 0.481
0.2 0.201 0.715 0.929
n−1/3 0.3 0.382 0.976 1
0.4 0.606 0.999 1
0.5 0.809 1 1
0.1 0 0.094 0.138
0.2 0 0.255 0.404
n−1/2 0.3 0 0.508 0.773
0.4 0 0.812 0.982
0.5 0 0.976 1
Table 6: Power for Kernel KS Test under Design 1

θ1 − θ 1 n = 100 n = 500 n = 1000

0.1 0 0 0
0.2 0.001 0 0
0.3 0.005 0 0
0.4 0.008 0.001 0.004
0.5 0.023 0.054 0.119
Table 7: Power for Unweighted Instrument CvM Test under Design 2

θ1 − θ 1 n = 100 n = 500 n = 1000

0.1 0 0 0
0.2 0.003 0.002 0.001
0.3 0.007 0.022 0.037
0.4 0.01 0.145 0.412
0.5 0.039 0.596 0.884
Table 8: Power for Unweighted Instrument KS Test under Design 2

40
σn2 θ1 − θ1 n = 100 n = 500 n = 1000
0.1 0 0 0
0.2 0 0 0
1 −1/5
4
n 0.3 0.003 0 0
0.4 0.007 0.006 0.013
0.5 0.04 0.118 0.294
0.1 0 0 0
0.2 0 0 0
1 −1/3
4
n 0.3 0.001 0.001 0
0.4 0.011 0.009 0.016
0.5 0.032 0.139 0.371
0.1 0 0 0
0.2 0.001 0 0
1 −1/2
4
n 0.3 0.003 0 0
0.4 0.009 0.003 0.014
0.5 0.034 0.114 0.288
Table 9: Power for Weighted Instrument CvM Test under Design 2

tn θ1 − θ 1 n = 100 n = 500 n = 1000

0.1 0 0 0
0.2 0.006 0.016 0.032
n−1/5 0.3 0.026 0.138 0.295
0.4 0.064 0.449 0.831
0.5 0.175 0.848 0.995
0.1 0.007 0.012 0.005
0.2 0.016 0.062 0.1
n−1/3 0.3 0.041 0.215 0.456
0.4 0.119 0.604 0.876
0.5 0.21 0.902 0.996
0.1 0.006 0.014 0.01
0.2 0.023 0.057 0.086
n−1/2 0.3 0.038 0.229 0.389
0.4 0.119 0.532 0.791
0.5 0.203 0.85 0.982

Table 10: Power for Weighted Instrument KS Test under Design 2 (from Armstrong and
Chan (2016))

41
hn θ1 − θ 1 n = 100 n = 500 n = 1000
0.1 0 0 0
0.2 0.001 0.002 0
n−1/5 0.3 0.008 0.007 0.024
0.4 0.012 0.108 0.369
0.5 0.074 0.484 0.923
0.1 0 0.001 0
0.2 0.001 0 0
n−1/3 0.3 0.003 0.009 0.011
0.4 0.023 0.126 0.273
0.5 0.062 0.519 0.848
0.1 0 0 0
0.2 0.001 0 0
n−1/2 0.3 0.001 0 0
0.4 0.005 0.007 0.023
0.5 0.023 0.089 0.308
Table 11: Power for Kernel CvM Test under Design 2

hn θ1 − θ 1 n = 100 n = 500 n = 1000

0.1 0.001 0.001 0.001
0.2 0.009 0.029 0.049
n−1/5 0.3 0.044 0.185 0.386
0.4 0.082 0.524 0.867
0.5 0.18 0.879 0.997
0.1 0.007 0.015 0.014
0.2 0.015 0.067 0.129
n−1/3 0.3 0.029 0.18 0.454
0.4 0.087 0.525 0.856
0.5 0.167 0.825 0.98
0.1 0 0.014 0.006
0.2 0 0.025 0.032
n−1/2 0.3 0 0.057 0.123
0.4 0 0.163 0.286
0.5 0 0.321 0.604
Table 12: Power for Kernel KS Test under Design 2

42
θ1 − θ 1 n = 100 n = 500 n = 1000
0.1 0.005 0 0.001
0.2 0.031 0.046 0.058
0.3 0.131 0.454 0.743
0.4 0.359 0.914 0.997
0.5 0.619 0.999 1
Table 13: Power for Unweighted Instrument CvM Test under Design 3

θ1 − θ 1 n = 100 n = 500 n = 1000

0.1 0.006 0.015 0.013
0.2 0.027 0.231 0.402
0.3 0.117 0.737 0.959
0.4 0.34 0.982 1
0.5 0.568 1 1
Table 14: Power for Unweighted Instrument KS Test under Design 3

σn2 θ1 − θ1 n = 100 n = 500 n = 1000

0.1 0.006 0 0.001
0.2 0.037 0.079 0.136
1 −1/5
4
n 0.3 0.133 0.515 0.837
0.4 0.341 0.941 1
0.5 0.636 1 1
0.1 0.006 0.003 0.001
0.2 0.029 0.065 0.173
1 −1/3
4
n 0.3 0.143 0.514 0.872
0.4 0.375 0.961 1
0.5 0.642 1 1
0.1 0.006 0.003 0
0.2 0.043 0.059 0.101
1 −1/2
4
n 0.3 0.161 0.52 0.845
0.4 0.335 0.935 0.999
0.5 0.63 0.999 1
Table 15: Power for Weighted Instrument CvM Test under Design 3

43
tn θ1 − θ 1 n = 100 n = 500 n = 1000
0.1 0.034 0.064 0.12
0.2 0.093 0.466 0.704
n−1/5 0.3 0.272 0.869 0.99
0.4 0.501 0.994 1
0.5 0.767 1 1
0.1 0.039 0.104 0.116
0.2 0.112 0.429 0.64
n−1/3 0.3 0.257 0.838 0.979
0.4 0.463 0.994 1
0.5 0.717 1 1
0.1 0.03 0.083 0.087
0.2 0.121 0.325 0.523
n−1/2 0.3 0.24 0.762 0.967
0.4 0.397 0.984 1
0.5 0.669 1 1

Table 16: Power for Weighted Instrument KS Test under Design 3 (from Armstrong and
Chan (2016))

hn θ1 − θ 1 n = 100 n = 500 n = 1000

0.1 0.013 0.017 0.018
0.2 0.05 0.229 0.446
n−1/5 0.3 0.187 0.757 0.965
0.4 0.411 0.98 1
0.5 0.698 1 1
0.1 0.007 0.012 0.01
0.2 0.044 0.167 0.323
n−1/3 0.3 0.173 0.676 0.932
0.4 0.377 0.986 1
0.5 0.657 1 1
0.1 0.002 0.001 0
0.2 0.029 0.03 0.049
n−1/2 0.3 0.082 0.326 0.654
0.4 0.21 0.866 0.991
0.5 0.47 0.996 1
Table 17: Power for Kernel CvM Test under Design 3

44
hn θ1 − θ 1 n = 100 n = 500 n = 1000
0.1 0.043 0.087 0.161
0.2 0.099 0.487 0.722
n−1/5 0.3 0.261 0.876 0.99
0.4 0.48 0.995 1
0.5 0.746 1 1
0.1 0.037 0.086 0.122
0.2 0.079 0.297 0.528
n−1/3 0.3 0.164 0.646 0.912
0.4 0.296 0.937 0.999
0.5 0.507 0.996 1
0.1 0 0.035 0.026
0.2 0 0.087 0.118
n−1/2 0.3 0 0.195 0.385
0.4 0 0.427 0.703
0.5 0 0.716 0.952
Table 18: Power for Kernel KS Test under Design 3

45
Supplement to “On the Choice of Test Statistic for
Conditional Moment Inequalities”
Timothy B. Armstrong
Yale University

July 9, 2021

This supplementary appendix contains proofs of the results in the main text as well as
auxiliary results. Section B contains auxiliary results used in the rest of this appendix. These
results are restatements or simple extensions of well known results on uniform convergence,
and do not constitute part of the main novel contribution of the paper. Section C of this
appendix derives critical values for CvM statistics with variance weights. Section D contains
proofs of the results in the body of the paper.

B Auxiliary Results
We state some results on uniform convergence that will be used in the proofs of the main
results. The results in this section are essentially restatements of results used in Armstrong
(2014b), which are in turn minor extensions of results in Pollard (1984). Throughout this
section, we consider iid observations Z1 , . . . , Zn and a sequence of classes of functions Fn on
the sample space. Let σ(f )2 = Ef (Zi )2 − (Ef (Zi ))2 and let σ̂(f )2 = En f (Zi )2 − (En f (Zi ))2 .

Lemma B.1. Suppose that |f (Zi )| ≤ f a.s. and that

sup sup N (ε, Fn , L1 (Q)) ≤ Aε−W

n∈N Q

for some A and W , where N is the covering number defined in Pollard (1984) and the
supremum over Q is over all probability measures. Let σn be a sequence of constants with
p
σn n/ log n → ∞. Then, for some constant C,
√
n (En − E)f (Zi )
√ sup ≤C
log n f ∈Fn σ(f ) ∨ σn

46
with probability approaching one and

(En − E)f (Zi ) p

sup → 0.
f ∈Fn σ(f )2 ∨ σn2

Proof. The first display follows by applying Lemma A.1 in Armstrong (2014b) to the se-
quence of classes of functions {f − EP f (Zi )|f ∈ Fn }, which satisfies the conditions of that
lemma by Lemma A.5 in Armstrong (2014b). The second display follows from the first
display since
√ √
(En − E)f (Zi ) 1 (En − E)f (Zi ) log n n (En − E)f (Zi )
sup 2 2
≤ sup = √ √ sup
f ∈Fn σ(f ) ∨ σn σn f ∈Fn σ(f ) ∨ σn σn n log n f ∈Fn σ(f ) ∨ σn
√ √
and log n/(σn n) → 0.

Lemma B.2. Under the conditions of Lemma B.1,

σ̂(f ) ∨ σn p
sup − 1 → 0.
f ∈Fn σ(f ) ∨ σn

√ σ̂(f )2 ∨σn
2 p
Proof. By continuity of t 7→ t at 1, it suffices to prove that supf ∈Fn σ(f )2 ∨σn2 − 1 → 0. We
have

σ̂(f )2 ∨ σn2 σ̂(f )2 ∨ σn2 − σ(f )2 ∨ σn2 σ̂(f )2 − σ(f )2

sup − 1 = sup ≤ sup .
f ∈Fn σ(f )2 ∨ σn2 f ∈Fn σ(f )2 ∨ σn2 f ∈Fn σ(f )2 ∨ σn2

Note that

σ̂(f )2 − σ(f )2 = (En − E)[f (Zi ) − Ef (Zi )]2 − [(En − E)f (Zi )]2 . (14)

2
Since σ[(f − Ef (Zi ))2 ]2 ≤ E[f (Zi ) − Ef (Zi )]4 ≤ 4f σ(f )2 , we have

|(En − E)[f (Zi ) − Ef (Zi )]2 | |(En − E)[f (Zi ) − Ef (Zi )]2 | 2
sup 2 2
≤ sup 2 2 2
· (4f ) ∨ 1
f ∈Fn σ(f ) ∨ σn f ∈Fn σ[(f − Ef (Zi )) ] ∨ σn

which converges in probability to zero by Lemma B.1 (using Lemma A.5 in Armstrong,
2014b to verify that the sequence of classes of functions {[f − Ef (Zi )]2 |f ∈ Fn } satisfies the

47
conditions of the lemma). Since

[(En − E)f (Zi )]2 p

→0
σ(f )2 ∨ σn2

by Lemma B.1, the result now follows from this and the triangle inequality applied to
(14).
√
Lemma B.3. Suppose that |f (Zi )| ≤ f and that σn n ≥ 1. Then
√ p
n(En − E)f (Zi )
E ≤ Cp,f
σ(f ) ∨ σn

for a constant Cp,f that depends only on p and f .

Proof. By Bernstein’s inequality,

√ !
n[σ(f ) ∨ σn ]2 t2

n(En − E)f (Zi ) 1
P > t ≤ exp − √
σ(f ) ∨ σn 2 nσ 2 (f ) + 31 · 2f · n[σ(f ) ∨ σn ]t
! ! !
1 t2 1 t2 1 t2
≤ exp − ≤ exp − ≤ exp − .
2 1 + 13 · 2f · √n[σ(ft )∨σ ] 2 1 + 31 · 2f · t 2 1 + 13 · 2f · t
n

t
For t ≥ 1, this is bounded by exp − 2+ 2 ·2f . Thus,
3

√ p Z ∞ √ p
n(En − E)f (Zi ) n(En − E)f (Zi )
E = P > t dt
σ(f ) ∨ σn t=0 σ(f ) ∨ σn
Z ∞ !
t1/p
≤1+ exp − dt
t=1 2 + 32 · 2f

which is finite and depends only on p and f as claimed.

C Critical Values for CvM Statistics with Variance

Weights
For bounded choices of ω (which corresponds to σn bounded away from zero when a truncated
√
variance weighting is used), Kim (2008) and Andrews and Shi (2013) derive a n rate of
convergence to an asymptotic distribution that may be degenerate. Armstrong (2014b)

48
p
shows that letting σn go to zero generally decreases the rate of convergence to n/ log n
for the KS statistic Tn,∞,ω . In contrast to the KS case, CvM statistics do not behave much
differently if the variance is allowed to go to zero, although some additional arguments are
needed to show this.
To deal with the behavior of the CvM statistic for small variances, I place the following
condition on the measure over which the sample means are integrated.

Assumption C.1. µ({g|σj (θ, g) ≤ δ}) → 0 as δ → 0 for all j.

This condition will hold for the choices of G and µ used in the body of the paper, and
also allow for more general choices of G and µ. I also make the following assumption on the
complexity of the class of functions G, which is also satisfied by the class used in the paper.

Assumption C.2. For some constants A and ε, the covering number N (ε, G, L1 (Q)) defined
in Pollard (1984) satisfies

sup N (ε, G, L1 (Q)) ≤ Aε−W ,

whre the supremum is over all probability measures.

The following condition imposes a bounded distribution of the function m.

Assumption C.3. For some nonrandom constant Y , |mj (Wi , θ)| ≤ Y for each j with
probability one.
p
Theorem C.1. Suppose that σn n/ log n → ∞ and that Assumptions C.1, C.2 and C.3
hold. Then, for θ ∈ Θ0 ,
"Z dY √ p
#1/p
X n(En − E)mj (Wi , θ)g(Xi )
n1/2 Tn,p,(σ̂∨σn )−1 ,µ (θ) ≤ dµ(g)
j=1
σ̂j (θ, g) ∨ σn −
"Z dY
#1/p
d
X
→ |Gj (g, θ)/σj (θ, g)|p− dµ(g)
j=1

where G(g, θ) is a vector of Gaussian processes with covariance function

ρ(g, g̃) = E[m(Wi , θ)g(Xi ) − Em(Wi , θ)g(Xi )][m(Wi , θ)g̃(Xi ) − Em(Wi , θ)g̃(Xi )]0 .

49
Proof. The result with the integral truncated over {σj (θ, g) ≤ δ|all j} follows immediately
from standard arguments using functional central limit theorems. This, along with Lemma
C.1 below gives, letting Zn (δ) be the integral truncated at {σj (θ, g) ≤ δ|all j} and Z(δ) be
the limiting variable with this truncation,

P (Zn (δ) − ε ≤ t) − ε ≤ P (n1/2 Tn,p,ω,µ (θ) ≤ t) ≤ P (Zn (δ) ≤ t)

for large enough n for any ε > 0. The lim inf of the left hand size is greater than P (Z(δ) ≤
t − 2ε) − 2ε, and the lim sup of the right hand side is less than P (Z(δ) ≤ t + ε) + ε. We
can bound P (Z(δ) ≤ t − 2ε) − 2ε from below by P (Z ≤ t − 2ε) − 2ε, and we can bound
P (Z(δ) ≤ t + ε) + ε from above by P (Z ≤ t + 2ε) + 2ε by making δ small enough by a version
of Lemma C.1 for the limiting process. Since ε was arbitrary, this gives the result.

The proof of the theorem above uses the following auxiliary lemma, which shows that
functions g with low enough variance have little effect on the integral asymptotically.

Lemma C.1. Fix j and suppose that Assumptions C.1, C.2 and C.3 hold, and that the null
hypothesis holds under θ. Then, for every ε > 0, there exists a δ > 0 such that
 "Z #1/p 
√
P n |En mj (Wi , θ)g(Xi )/(σ̂j (θ, g) ∨ σn )|p− dµ(g) > ε ≤ ε.
σj (θ,g)≤δ

Proof. We have

√
Z
E | nEn mj (Wi , θ)g(Xi )/(σj (θ, g) ∨ σn )|p− dµ(g)
σ (θ,g)≤δ
Zj
√
= E| nEn mj (Wi , θ)g(Xi )/(σj (θ, g) ∨ σn )|p− dµ(g)
σj (θ,g)≤δ
√
Z
≤ E| n(En − E)mj (Wi , θ)g(Xi )/(σj (θ, g) ∨ σn )|p dµ(g) ≤ µ ({g|σj (θ, g) ≤ δ}) · Cp,Y
σj (θ,g)≤δ

for Cp,Y given in Lemma B.3. Applying Markov’s inequality and using Assumption C.1, it
follows that, for any ε > 0, there exists a δ such that
 "Z #1/p 
√
P n |En mj (Wi , θ)g(Xi )/(σj (θ, g) ∨ σn )|p− dµ(g) > ε/2 ≤ ε/2.
σj (θ,g)≤δ

50
The result follows since
"Z #1/p
√
n |En mj (Wi , θ)g(Xi )/(σ̂j (θ, g) ∨ σn )|p− dµ(g)
σj (θ,g)≤δ
"Z #1/p
√
≤ n |En mj (Wi , θ)g(Xi )/(σj (θ, g) ∨ σn )|p− dµ(g) · sup(σj (θ, g) ∨ σn )/(σ̂j (θ, g) ∨ σn )
σj (θ,g)≤δ g

and supg (σj (θ, g) ∨ σn )/(σ̂j (θ, g) ∨ σn ) ≤ 2 with probability approaching one by Lemma
B.2.

D Proofs
This section contains proofs of the results in the body of the paper. The proofs use a number
of auxiliary lemmas, which are stated and proved first. In the following, θn is always assumed
to be a sequence converging to θ0 .

Lemma D.1. Under the assumptions of Theorem 4.5, there exists a constant C such that
√
n
sup p |(En − E)m(Wi , θn )k((Xi − x)/h)| ≤ C
x∈RdX hdX log n

and
√
n
sup p |(En − E)k((Xi − x)/h)| ≤ C
x∈RdX hdX log n

with probability approaching one. In addition,

En k((Xi − h)/h) p
sup − 1 → 0.
{x|ωj (θn ,x)>0 some j} Ek((Xi − h)/h)

Proof. The first two displays follow from Lemma B.1 after noting that

2 2
var(m(Wi , θn )k((Xi − x)/h)) ≤ Y k f X B dX hdX

where k and f X are bounds for k and fX , and B is such that k(u) = 0 whenever max1≤j≤dX |uj | >
√ √ √
B/2, and similarly for var(k((Xi − x)/h)), and that hdX n/ log n → ∞ under these as-
sumptions.

51
For the last display, note that, for x such that ωj (θn , x) > 0 for some j, Ek((Xi − x)/h) ≥
R
f X hdX k(u) du for large enough n, where f X is a lower bound for the density of Xi (which
can be taken to be ε in Assumption 3.7). Thus,

En k((Xi − h)/h) (En − E)k((Xi − h)/h)

sup − 1 ≤ sup R
{x|ωj (θn ,x)>0 some j} Ek((Xi − h)/h) x∈RdX f X hdX k(u) du
√ p
n hdX log n
= sup p |(En − E)k((Xi − h)/h)| · √ R .
x∈RdX hdX log n nf X hdX k(u) du
√
The result then follows from the second display, since √ log n → 0.
nhdX

Let
"Z dY
#1/p
p
En m(Wi , θ)k((Xi − x)/h)
Z X
T̃n,p,(σ̂∨σn )−1 ,µ (θ) = fµ (x, h) dx dh
h>0 x j=1 σj (θ, x, h) ∨ σn −

and let
"Z dY
#1/p
p
X En m(Wi , θ)k((Xi − x)/h)
T̃n,p,kern (θ) = ωj (θ, x) dx dh .
x j=1 Ek((Xi − x)/h) −

The notation σj (θ, x̃, h) is used to denote σj (θ, g) where g(x) = k((x − x̃)/h).

Lemma D.2. Under Assumptions 3.3, 3.4, 3.5 and 3.6,

√ √
nTn,p,(σ̂∨σn )−1 ,µ (θn ) = nT̃n,p,(σ̂∨σn )−1 ,µ (θn )(1 + oP (1))

for any sequence θn → θ0 . If Assumption 3.7 holds as well, then

(nhdX )1/2 Tn,p,kern (θn ) = (nhdX )1/2 T̃n,p,kern (θn )(1 + oP (1))

for any sequence θn → θ0 .

Proof. We have

√ √ √ σj (θn , x, h) ∨ σn
| nTn,p,(σ̂∨σn )−1 ,µ (θn ) − nT̃n,p,(σ̂∨σn )−1 ,µ (θn )| ≤ nT̃n,p,(σ̂∨σn )−1 ,µ (θ) · sup −1 .
x,j σ̂j (θn , x, h) ∨ σn

Thus, the first display follows from Lemma B.2.

52
Similarly, for the second display,

|(nhdX )1/2 Tn,p,kern (θn ) − (nhdX )1/2 T̃n,p,kern (θn )|

Ek((Xi − x)/h)
≤ (nhdX )1/2 T̃n,p,kern (θn ) · sup −1 ,
{x|ωj (θ,x)>0 some j} En k((Xi − x)/h)

and the result follows from Lemma D.1.

Let
"Z dY
#1/p
p
Em(Wi , θ)k((Xi − x)/h)
Z X
T̃˜n,p,(σ̂∨σn )−1 ,µ (θ) = fµ (x, h) dx dh
h>0 x j=1 σj (θ, x, h) ∨ σn −

and let
"Z d #1/p
Y p
Em(Wi , θ)k((Xi − x)/h)
T̃˜n,p,kern (θ) =
X
ωj (θ, x) dx dh .
x j=1 Ek((Xi − x)/h) −

Also define
"Z dY
Z X #1/p
T̃˜n,p,1,µ (θ) = |Em(Wi , θ)k((Xi − x)/h)|p− fµ (x, h) dx dh .
h>0 x j=1

Lemma D.3. Under Assumptions 3.3, 3.4, 3.5 and 3.6, this is not a distributional approx result
√ √ ˜
nT̃n,p,(σ̂∨σn )−1 ,µ (θn ) = nT̃n,p,(σ̂∨σn )−1 ,µ (θn ) + oP (1).

and
√ √ ˜
nTn,p,1,µ (θn ) = nT̃n,p,1,µ (θn ) + oP (1).
p
Proof. Let σ̃n → 0 be such that σ̃n n/ log n → ∞ and σ̃n /σn → 0 (i.e. σ̃n is chosen to be
much smaller than σn , but such that the assumptions still hold for σ̃n ). Note that
√
n|T̃˜n,p,(σ̂∨σn )−1 ,µ (θn ) − T̃n,p,(σ̂∨σn )−1 ,µ (θn )|
"Z Z dY
#1/p
X √ (En − E)m(Wi , θn )k((Xi − x)/h) p
≤ n fµ (x, h) dx dh
(x,h)∈Ĝ j=1 σj (θ, x, h) ∨ σn

53
where Ĝ = {(x, h)|Em(Wi , θn )k((Xi − x)/h) < 0 or En (Wi , θn )k((Xi − x)/h) < 0}.
For any ε > 0, there exists an η > 0 such that, for h > ε and large enough n,

1
Emj (Wi , θn )k((Xi − x)/h) ≥ ηEk((Xi − x)/h) ≥ η · var[mj (Wi , θn )k((Xi − x)/h)] · 2
kY

where the second inequality follows since

2 2
var[mj (Wi , θn )k((Xi − x)/h)] ≤ Y E[k((Xi − x)/h)2 ] ≤ Y kEk((Xi − x)/h).

Thus, for large enough n we will have

En mj (Wi , θn )k((Xi − x)/h)

η
≥ (En − E)mj (Wi , θn )k((Xi − x)/h) + var[mj (Wi , θn )k((Xi − x)/h)] · 2,
kY

and the last line is positive for all (x, h) with σj (θn , x, h) ≥ σ̃n with probability approaching
one by Lemma B.1.
From this and the fact that Em(Wi , θn )k((Xi − x)/h) ≥ 0 for all h > ε for large enough
n, it follows that Ĝ ⊆ {(x, h)|h ≤ ε or σj (θn , x, h) < σ̃n } with probability approaching one.
Note that
dY √ p
n(En − E)m(Wi , θn )k((Xi − x)/h)
Z Z X
E fµ (x, h) dx dh
{(x,h)|h≤ε} j=1 σj (θ, x, h) ∨ σn
dY √ p
n(En − E)m(Wi , θn )k((Xi − x)/h)
Z Z X
= E fµ (x, h) dx dh
{(x,h)|h≤ε} j=1 σj (θ, x, h) ∨ σn

by Fubini’s theorem, and this can be made arbitrarily small by making ε small by Lemma
B.3 and Assumption 3.4. Similarly,

dY √ p
n(En − E)m(Wi , θn )k((Xi − x)/h)
Z Z X
E fµ (x, h) dx dh
{(x,h)|σj (θn ,x,h)<σ̃n some j} j=1 σj (θ, x, h) ∨ σn
√ p
dX n(En − E)m(Wi , θn )k((Xi − x)/h)
≤ µ(R × [0, ∞)) · sup E
{(x,h,j)|σj (θn ,x,h)<σ̃n } σj (θ, x, h) ∨ σn
√ p
dX n(En − E)m(Wi , θn )k((Xi − x)/h) σ̃n
= µ(R × [0, ∞)) · sup E ,
{(x,h,j)|σj (θn ,x,h)<σ̃n } σj (θ, x, h) ∨ σ̃n σn

54
which converges to zero by Lemma B.3. Using this and Markov’s inequality, it follows
√
that n|T̃˜n,p,(σ̂∨σn )−1 ,µ (θ) − T̃n,p,(σ̂∨σn )−1 ,µ (θ)| can be made arbitrarily small with probability
approaching one by making ε small. This gives the first display of the lemma.
The second display follows by the same argument with σn set to the supremum of
σj (θ, x, h) over x, h on the support of µ, θ in a neighborhood of θ0 and all j.

Lemma D.4. Under Assumptions 3.3, 3.4, 3.5, 3.6 and 3.7,

(nhdX )1/2 T̃n,p,kern (θn ) = (nhdX )1/2 T̃˜n,p,kern (θn ) + oP (1).

Proof. For any ε > 0, there is an η > 0 such that Emj (Wi , θn )k((Xi − x)/h) > ηEk((Xi −
x)/h) for all x ∈ X̄ (ε) where X̄ (ε) is the set of x with kx − xk k ≥ ε for all k = 1, . . . , ` and
ωj (θn , x) > 0 for some j. Thus, arguing as in Lemma D.3 and using Lemma D.1, it follows
that, with probability approaching one,

(nhdX )1/2 |T̃n,p,kern (θn ) − T̃˜n,p,kern (θn )|

"Z dY √ p #1/p
X nhdX (En − E)mj (Wi , θn )k((Xi − x)/h)
≤ ωj (θn , x) dx .
x6∈X̄ (ε) j=1 Ek((Xi − x)/h)

R
Using Markov’s inequality and Fubini’s theorem along with the fact that x6∈X̄ (ε) wj (θn x) dx
can be made arbitrarily small by making ε small, the result follows so long as
√ p
nhdX (En − E)mj (Wi , θn )k((Xi − x)/h)
E
Ek((Xi − x)/h)

can be bounded uniformly over x such that ωj (θn , x) > 0. But this follows from Lemma B.3,
since, by Assumptions 3.3 and 3.7, for some δ > 0, Ek((Xi − x)/h) ≥ δhdX for all x with
ωj (θn , x) > 0.

For the following lemma, recall that wj (xk ) = (s2j (xk , θ0 )fX (xk ) k(u)2 du)−1/2 and s2j (x, θ) =
R

var(m(Wi , θ)|Xi = x).

Lemma D.5. Under Assumptions 3.3, 3.4, 3.5 and 3.6, for k = 1, . . . , `

sup h−dX /2 σj (θn , x, h) − wj (xk )−1 → 0.

k(x,h)−(xk ,0)k≤εn

for any sequences εn → 0 and θn → θ0 .

55
Proof. By differentiability of the square root function at wj−2 (xk ), it suffices to show that
supk(x,h)−(xk ,0)k≤εn h−dX σj2 (θn , x, h) − wj−2 (xk ) → 0. Note that

h−dX σj2 (θn , x, h) = h−dX E[m(Wi , θn )2 k((Xi − x)/h)2 ] − h−dX {E[m(Wi , θn )k((Xi − x)/h)]}2
Z
−dX
=h s2j (x̃, θn )k((x̃ − x)/h)2 fX (x̃) dx̃
Z
−dX
+h E[m(Wi , θn )|Xi = x̃]2 k((x̃ − x)/h)2 fX (x̃) dx̃
Z 2
−dX
−h E[m(Wi , θn )|Xi = x̃]k((x̃ − x)/h)fX (x̃) dx̃ .

By Assumption 3.3 and part (iii) of Assumption 3.5, the second term is bounded by a constant
times supk(x,h)−(xk ,0)k≤εn E[m(Wi , θn )|Xi = x]2 , which converges to zero by continuity of
E[m(Wi , θ)|Xi = x] at (θ0 , xk ). By Assumptions 3.3 and 3.5, the third term is bounded by
a constant times h−dX · h2dX ≤ εdnX uniformly over (x, h) with k(x, h) − (xk , 0)k ≤ εn . Using
R
a change of variables, the first term can be written as s2j (x + uh, θn )k(u)2 fX (x + uh) du,
which converges to wj−2 (xk ) uniformly over k(x, h) − (xk , 0)k ≤ εn by continuity of sj and
fX , and by Assumption 3.3.
R
Lemma D.6. Suppose that Assumptions 3.3, 3.4, 3.5, 3.6 and 3.7 hold, and that k(u) du =
1. Then

sup |h−dX Ek((Xi − x)/h) − fX (xk )| → 0

kx−xk k≤ε

as h → 0 and ε → 0 for k = 1, . . . , `.

Proof. We have
Z Z
−dX −dX
h Ek((Xi − x)/h) = h k((x̃ − x)/h)fX (x̃) dx̃ = k(u)fX (x + uh) du,

R
and k(u) du = 1 and fX (x + uh) converges to fX (xk ) uniformly over kx − xk k ≤ ε and u
in the support of k as ε → 0 and h → 0.

For notational convenience in the following lemmas, define, for (j, k) with j ∈ J(k),

m̄j (θ0 , x) − m̄j (θ0 , xk )

ψ̃j,k (x − xk ) =
kx − xk kγ(j,k)

56
so that

x − xk
sup ψ̃j,k (x − xk ) − ψj,k →0
kx−xk k<δ kx − xk k

under Assumption 3.5.

Lemma D.7. Under Assumptions 3.3, 3.4, 3.5 and 3.6, for any a ∈ Rdθ ,

dY
Z Z X
r −[dX +p(dX +γ)+1]/γ
|Emj (Wi , θ0 + ra)k((Xi − x̃)/h)|p− fµ (x̃, h) dx̃ dh
j=1
X0
r→0
X X
→ λbdd (a, j, k, p).
˜
k=1 j∈J(k)

Proof. For simplicity, assume that γ(j, k) = γ for all j, k. The general result follows from
applying the same arguments to show that areas of (x, h) near (j, k) with γ(j, k) < γ do not
matter asymptotically.
For C large enough, the integrand will be zero unless max{kx̃ − xk k, h} < Cr1/γ for some
k with j ∈ J(k). Thus, it suffices to prove the lemma for, fixing (j, k) with j ∈ J(k),
Z Z
|Emj (Wi , θ0 + ra)k((Xi − x̃)/h)|p− fµ (x̃, h) dx̃ dh
Z Z Z p
= m̄j (θ0 + ra, x)k((x − x̃)/h)fX (x) dx fµ (x̃, h) dx̃ dh
−
Z Z Z p
γ ∗
= [kx − xk k ψ̃j,k (x − xk ) + m̄θ,j (θ (r), x)ra]k((x − x̃)/h)fX (x) dx fµ (x̃, h) dx̃ dh
−

where the integrals are taken over kx̃ − xk k < Cr1/γ , h < Cr1/γ and θ∗ (r) is between θ0 and
θ0 + ra (we suppress the dependence of θ∗ (r) on x in the notation). Using the change of
variables u = (x − xk )/r1/γ , v = (x − xk )/r1/γ , h̃ = h/r1/γ , this is equal to
Z Z Z p
1/γ γ 1/γ ∗ 1/γ 1/γ dX /γ
[kr uk ψ̃j,k (r u) + m̄θ,j (θ (r), xk + r u)ra]k((u − v)/h̃)fX (xk + r u)r du
−
1/γ 1/γ dX /γ 1/γ
fµ (xk + r v, r
h̃)r dvr dh̃
Z Z Z p
= r[dX +1+p(γ+dX )]/γ [kukγ ψ̃j,k (r1/γ u) + m̄θ,j (θ∗ (r), xk + r1/γ u)a]k((u − v)/h̃)fX (xk + r1/γ u) du
−
1/γ 1/γ
fµ (xk + r v, r h̃) dv dh̃

57
where the integrals are taken over kvk < C, h̃ < C. The result now follows from the
dominated convergence theorem (here, and in subsequent results involving sequences of the
form | gn (z, w) dµ(z)|p− dν(w), the dominated convergence theorem is applied to the inner
R R

integral for each w, and again to the outer integral).

Lemma D.8. Under the conditions of Theorem 4.3, for any a ∈ Rdθ ,

dY
Z Z X
r −[dX +p(dX /2+γ)+1]/γ
|Emj (Wi , θ0 + ra)k((Xi − x̃)/h)/(σj (θ0 + ra, x̃, h) ∨ σn )|p− fµ (x̃, h) dx̃ dh
j=1
X0
X X
≤ λvar (a, j, k, p) + o(1)
˜
k=1 j∈J(k)

−dX /(2γ)
for any r = rn → 0. If, in addition, σn rn → 0, the above display will hold with the
inequality replaced by equality.

Proof. As in the previous lemma, the following argument assumes, for simplicity, that
γ(j, k) = γ for all (j, k) with j ∈ J(k). Let s̃j (r, x̃, h) = σj (θ0 + ra, x̃, h)/hdX /2 . As be-
fore, for large enough C, the integrand will be zero unless max{kx̃ − xk k, h} < Cr1/γ for
some k with j ∈ J(k). Thus, it suffices to prove the result for, fixing (j, k) with j ∈ J(k),
Z Z
−1 p
|Emj (Wi , θ0 + ra)k((Xi − x̃)/h)(h−dX /2 s̃−1 j (r, x̃, h) ∧ σn )|− fµ (x̃, h) dx̃ dh
Z Z Z
= [kx − xk kγ ψ̃j,k (x − xk ) + m̄θ,j (θ∗ (r), x)ra]
p
k((x − x̃)/h)(h−dX /2 s̃−1 −1
j (r, x̃, h) ∧ σn )fX (x) dx −
fµ (x̃, h) dx̃ dh

where the integral is taken over kx̃ − xk k < Cr1/γ , h < Cr1/γ and θ∗ (r) is between θ0 and
θ0 + ra. Using the change of variables u = (x − xk )/r1/γ , v = (x̃ − xk )/r1/γ ,h̃ = h/r1/γ , this

58
is equal to
Z Z Z
r[kukγ ψ̃j,k (r1/γ u) + m̄θ,j (θ∗ (r), xk + ur1/γ )a]k((u − v)/h̃)
p
(((r 1/γ
h̃)−dX /2 s̃−1
j (r, xk + vr 1/γ
,r 1/γ
h̃)) ∧ σn−1 )fX (xk + ur 1/γ
)r dX /γ
du
−
1/γ 1/γ dX /γ 1/γ
fµ (xk + vr ,r h̃)r dvr dh̃
Z Z Z
= r[p(γ+dX /2)+dX +1]/γ [kukγ ψ̃j,k (r1/γ u) + m̄θ,j (θ∗ (r), xk + ur1/γ )a]k((u − v)/h̃)
p
((h̃−dX /2 s̃−1
j (r, xk + vr 1/γ
,r 1/γ
h̃)) ∧ (rdX /(2γ) σn−1 ))fX (xk + ur 1/γ
) du fµ (xk + vr1/γ , r1/γ h̃) dv dh̃.
−

where the integral is taken over kvk < C, h < C. By Lemma D.5 and the dominated
−d /(2γ) −d /(2γ)
convergence theorem, this converges to λvar (a, j, k, p) if σn rn X → 0. If σn rn X does
not converge to zero, the above display is bounded from above by the same expression with
σn−1 replaced by ∞.

Lemma D.9. Under the conditions of Theorem 4.5, for any a ∈ Rdθ ,

dY
Z X
r −(γp+dX )/γ
|[Emj (Wi , θ0 + ra)k((Xi − x)/h)/Ek((Xi − x)/h)]ωj (θ0 + ra, x)|p− dx
j=1
|X0 |
X X
→ λkern (a, ch,r , j, k, p)
k=1 j∈J(k)

as r → 0 with h/r1/γ → ch,r for ch,r > 0. If the limit is zero for (a, ch,r ) in a neighborhood
of the given values, the sequence will be exactly equal to zero for large enough r.
If h/r1/γ → 0, then, as r → 0,

dY
Z X
r −(γp+dX )/γ
|[Emj (Wi , θ0 + ra)k((Xi − x)/h)/Ek((Xi − x)/h)]ωj (θ0 + ra, x)|p− dx
j=1
|X0 |
X X
→ λ̃kern (a, j, k, p).
k=1 j∈J(k)

˜
Proof. As before, this proof treats the case where J(k) = J(k) for ease of exposition. As
with the proofs of Lemmas D.7 and D.8, it suffices to prove the result for, fixing (j, k) with

59
j ∈ J(k),
Z
|[Emj (Wi , θ0 + ra)k((Xi − x̃)/h)/Ek((Xi − x̃)/h)]ωj (θ0 + ra, x̃)|p− dx̃
Z Z p
= [kx − xk kγ ψ̃j,k (x − xk ) + m̄θ,j (θ∗ (r), x)ra]k((x − x̃)/h)fX (x) dxh−dX b(x̃)ωj (θ0 + ra, x̃) dx̃
−

where the integral is over kx̃ − xk k < Cr1/γ and b(x̃) ≡ hdX /Ek((Xi − x̃)/h) converges
to (fX (xk ))−1 uniformly over x̃ in any shrinking neighborhood of xk by Lemma D.6. Let
h̃ = h/r1/γ . By the change of variables u = (x − xk )/r1/γ , v = (x̃ − xk )/r1/γ , the above
display is equal to
Z Z
[kur1/γ kγ ψ̃j,k (ur1/γ ) + m̄θ,j (θ∗ (r), xk + ur1/γ )ra]k((u − v)/h̃)fX (xk + ur1/γ )rdX /γ du
p
(r1/γ h̃)−dX b(xk + vr1/γ )ωj (θ0 + ra, xk + r1/γ v) rdX /γ dv
−
Z Z
= rp+dX /γ [kukγ ψ̃j,k (ur1/γ ) + m̄θ,j (θ∗ (r), xk + ur1/γ )a]k((u − v)/h̃)fX (xk + ur1/γ ) du
p
h̃−dX b(xk + vr1/γ )ωj (θ0 + ra, xk + r1/γ v) dv (15)
−

where the integral is over v < C. The first display of the lemma (the case where h/r1/γ → ch,r
for ch,r > 0) follows from this and the dominated convergence theorem.
To show that the sequence is exactly zero for small enough r when the limit is zero in
a neighborhood of (a, ch,r ), note, that, if the limit is zero in a neighborhood of (a, ch,r ), we
will have, for all (ã, c̃h,r ) in this neighborhood and any v,
Z
γ u
kuk ψj,k + m̄θ,j (θ0 , xk )ã k((u − v)/c̃h,r ) du
kuk
Z
γ u
= γ
c̃h,r kũk ψj,k + m̄θ,j (θ0 , xk )ã k(ũ − ṽ) c̃dh,r
X
dũ ≥ 0.
kuk

Evaluating this at (c̃r,h , ã) such that c̃γh,r ≤ cγh,r (1 − ε) and (for the case where m̄θ,j (θ0 , xk )a
is negative) m̄θ,j (θ0 , xk )ã ≤ (m̄θ,j (θ0 , xk )a)(1 + ε) shows that
Z
u
cγh,r kũkγ ψj,k · (1 − ε) + (m̄θ,j (θ0 , xk )a)(1 + ε) k(ũ − ṽ) dũ ≥ 0
kuk

for all v for some ε > 0. The above display is, for small enough r, a lower bound for the
inner integral in (15) times a constant that does not depend on r, so that, for small enough

60
r, the inner integral in (15) will be nonnegative for all v and (15) will eventually be equal to
zero.
For the case where h̃ = h/r1/γ → 0, multiplying (15) by r−(p+dX /γ) gives, after the change
of variables ũ = (u − v)/h̃,
Z Z
[kh̃ũ + vkγ ψ̃j,k ((h̃ũ + v)r1/γ ) + m̄θ,j (θ∗ (r), xk + (h̃ũ + v)r1/γ )a]k(ũ)fX (xk + (ũh̃ + v)r1/γ ) dũ
p
b(xk + vr1/γ )ωj (θ0 + ra, xk + r1/γ v) −
dv

which converges to
Z
|[kvkγ ψj,k (v/kvk) + m̄θ,j (θ0 , xk )a]ωj (θ0 , xk )|p− dv

by the dominated convergence theorem, as required.

We are now ready for the proofs of the main results.

proof of Theorem 4.1. The result follows immediately from Lemmas D.3 and D.7 since
(n−γ/{2[dX +γ+(dX +1)/p]} )−[dX +p(dX +γ)+1]/(γp) = n1/2 .

proof of Theorem 4.3. The result follows immediately from Lemmas D.2, D.3 and D.8 since
(n−γ/{2[dX /2+γ+(dX +1)/p]} )−[dX +p(dX /2+γ)+1]/(γp) = n1/2 .
p
proof of Theorem 4.5. The result follows from Lemmas D.2, D.4 and D.9. Note that (nhdX )p/2 /(n1−dX s )p/2 →
d p/2
chX , and that, for the case where s ≥ 1/[2(γ + dX /p + dX /2),

(n−q )−(γp+dX )/(γp) = (n−(1−sdX )/[2(1+dX /(pγ))] )−(γp+dX )/(γp) = n(1−sdX )/2 .

For the case where s < 1/[2(γ + dX /p + dX /2)], it follows from Lemmas D.2, D.4 and D.9
that
 1/p
|X0 |
p
X X
nq(γp+dX )/(γp) Tn (θ0 + an ) →  λkern (a, ch , j, k, p)
k=1 j∈J(k)

so that (nhdX )1/2 Tn (θ0 + an ) will converge to ∞ in this case if the limit in the above display
is strictly positive. If the limit in the above display is zero in a neighborhood of (a, ch ), it

61
follows from Lemmas D.2 and D.4 that (nhdX )1/2 Tn (θ0 + an ) is, up to op (1), equal to a term
that is zero for large enough n by Lemma D.9.

An Introduction To Multivariate Statistical Analysis (Anderson T.W) (Z-Library)
No ratings yet
An Introduction To Multivariate Statistical Analysis (Anderson T.W) (Z-Library)
747 pages
(Ebook PDF) Business Statistics: A First Course, Global Edition 8Th Edition
No ratings yet
(Ebook PDF) Business Statistics: A First Course, Global Edition 8Th Edition
49 pages
Assignment#2 RT WQ2021
No ratings yet
Assignment#2 RT WQ2021
2 pages
Biostatistics Final Exam
0% (2)
Biostatistics Final Exam
7 pages
Formulates Appropriate Null and Alternative Hypothesis
No ratings yet
Formulates Appropriate Null and Alternative Hypothesis
23 pages
A Powerful Chi-Square Specification Test With Support Vectors
No ratings yet
A Powerful Chi-Square Specification Test With Support Vectors
34 pages
High-Dimensional, Two-Sample Testing
No ratings yet
High-Dimensional, Two-Sample Testing
9 pages
Consistent Specification Testing For Conditional Moment Restrictions
No ratings yet
Consistent Specification Testing For Conditional Moment Restrictions
8 pages
On The Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
No ratings yet
On The Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
45 pages
Quality of Life
No ratings yet
Quality of Life
16 pages
On The Optimality of Gaussian Kernel Based Nonparametric Tests Against Smooth Alternatives
No ratings yet
On The Optimality of Gaussian Kernel Based Nonparametric Tests Against Smooth Alternatives
62 pages
Presentation RAMSDS 2024 Final
No ratings yet
Presentation RAMSDS 2024 Final
30 pages
Statistics
No ratings yet
Statistics
6 pages
Coin: A Computational Framework For Conditional Inference
No ratings yet
Coin: A Computational Framework For Conditional Inference
11 pages
Elliot Rottenberg Stock 1996 Efficient Tests For An Autoregressive Unit Root PDF
No ratings yet
Elliot Rottenberg Stock 1996 Efficient Tests For An Autoregressive Unit Root PDF
16 pages
ch6 PDF
No ratings yet
ch6 PDF
47 pages
Aaai07 262
No ratings yet
Aaai07 262
5 pages
2003 Bernoulli
No ratings yet
2003 Bernoulli
24 pages
Testing For Changes in The Error Distribution in Functional Linear Models
No ratings yet
Testing For Changes in The Error Distribution in Functional Linear Models
18 pages
1 s2.0 016794739592844N Main
No ratings yet
1 s2.0 016794739592844N Main
11 pages
1 Point Estimation 10: Ffi Ffi
No ratings yet
1 Point Estimation 10: Ffi Ffi
60 pages
Asymptotic Efficiency of New Exponentiality Tests Based On A Characterization
No ratings yet
Asymptotic Efficiency of New Exponentiality Tests Based On A Characterization
18 pages
Bai 1996
No ratings yet
Bai 1996
27 pages
Comparing Poisson Rates
No ratings yet
Comparing Poisson Rates
13 pages
Asymptotically Optimal One - and Two-Sample Testing With Kernels
No ratings yet
Asymptotically Optimal One - and Two-Sample Testing With Kernels
19 pages
Tests of Fit For The Von Mises Distribution
No ratings yet
Tests of Fit For The Von Mises Distribution
6 pages
Lecture 22
No ratings yet
Lecture 22
33 pages
Signtest
No ratings yet
Signtest
16 pages
Chapter 7 Analysis of Variance (ANOVA)
No ratings yet
Chapter 7 Analysis of Variance (ANOVA)
23 pages
A Kernel Two-Sample Test: Arthur Gretton
No ratings yet
A Kernel Two-Sample Test: Arthur Gretton
51 pages
Basic Concepts For ANOVA: Real Statistics Using Excel
No ratings yet
Basic Concepts For ANOVA: Real Statistics Using Excel
9 pages
Local Linear Regression For Functional Data: Alain Berlinet, Abdallah Elamine, André Mas Université Montpellier 2
No ratings yet
Local Linear Regression For Functional Data: Alain Berlinet, Abdallah Elamine, André Mas Université Montpellier 2
23 pages
573 Final
No ratings yet
573 Final
10 pages
Estimation of Parametric Functions in Downton's
No ratings yet
Estimation of Parametric Functions in Downton's
17 pages
04 Testing
No ratings yet
04 Testing
35 pages
10 Meanvector PDF
No ratings yet
10 Meanvector PDF
10 pages
A Generalized Jarque-Bera Test of Conditional Normality: Yi-Ting Chen Chung-Ming Kuan
No ratings yet
A Generalized Jarque-Bera Test of Conditional Normality: Yi-Ting Chen Chung-Ming Kuan
12 pages
Functional Analysis of Variance For Hilbert-Valued Multivariate Fixed Effect Models (Ruiz-Medina, M.D.) (Z-Library)
No ratings yet
Functional Analysis of Variance For Hilbert-Valued Multivariate Fixed Effect Models (Ruiz-Medina, M.D.) (Z-Library)
28 pages
Unbiased Recursive Partitioning I: A Non-Parametric Conditional Inference Framework
No ratings yet
Unbiased Recursive Partitioning I: A Non-Parametric Conditional Inference Framework
27 pages
Sadhanala 19 A
No ratings yet
Sadhanala 19 A
10 pages
Presentation SICEAMS 2024
No ratings yet
Presentation SICEAMS 2024
31 pages
S.T (Tripos)
No ratings yet
S.T (Tripos)
5 pages
Asymptotic Relative Efficiency of Tests: ARE on a G String: H θ H θ T H T K π θ P T K, θ n α α, π θ α
No ratings yet
Asymptotic Relative Efficiency of Tests: ARE on a G String: H θ H θ T H T K π θ P T K, θ n α α, π θ α
8 pages
Lecture6 Module2 Anova 1
No ratings yet
Lecture6 Module2 Anova 1
10 pages
Stein 2011 DiffFilter
No ratings yet
Stein 2011 DiffFilter
20 pages
A Least-Squares Cross-Validation Bandwidth Selection Approach
No ratings yet
A Least-Squares Cross-Validation Bandwidth Selection Approach
8 pages
Lec27 28
No ratings yet
Lec27 28
22 pages
Lec 05
No ratings yet
Lec 05
28 pages
Box 1953
No ratings yet
Box 1953
19 pages
Classics: 76 Resonance
No ratings yet
Classics: 76 Resonance
15 pages
Cox-Stuart 1955 Some Quick Sign Tests For Trend in Location and Dispersion
No ratings yet
Cox-Stuart 1955 Some Quick Sign Tests For Trend in Location and Dispersion
17 pages
MA204 FinalTest 2022
No ratings yet
MA204 FinalTest 2022
14 pages
Stat 245 Homework 3 Solution
No ratings yet
Stat 245 Homework 3 Solution
8 pages
Wang 1990256
No ratings yet
Wang 1990256
58 pages
Chen14 Weighted L2
No ratings yet
Chen14 Weighted L2
10 pages
Carrasco GeneralizationGMMContinuum 2000
No ratings yet
Carrasco GeneralizationGMMContinuum 2000
39 pages
PHD - Thesis - Final - Statistics Tests
100% (1)
PHD - Thesis - Final - Statistics Tests
154 pages
Gof Tests For Discrete Models Based On The Idf
No ratings yet
Gof Tests For Discrete Models Based On The Idf
17 pages
Adaptive Rank-Based Tests For High Dimensional Mean Problems
No ratings yet
Adaptive Rank-Based Tests For High Dimensional Mean Problems
17 pages
The Functional Central Limit Theorem and Testing For Time Varying Parameters
No ratings yet
The Functional Central Limit Theorem and Testing For Time Varying Parameters
38 pages
22-23 323 Week6Notes
No ratings yet
22-23 323 Week6Notes
28 pages
Mardia (1970)
No ratings yet
Mardia (1970)
12 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Set Theory Essentials
From Everand
Set Theory Essentials
Emil Milewski
No ratings yet
Price Elasticity in Motor Insurance
No ratings yet
Price Elasticity in Motor Insurance
34 pages
K L Transform
No ratings yet
K L Transform
4 pages
Output SPSS Word
No ratings yet
Output SPSS Word
17 pages
TQM - Statistical Process Control
100% (1)
TQM - Statistical Process Control
93 pages
Kategorisasi Variabel: Dep - Tal
No ratings yet
Kategorisasi Variabel: Dep - Tal
5 pages
Template-QEMS-NURUL FADHILAH
No ratings yet
Template-QEMS-NURUL FADHILAH
7 pages
Zero Lecture of MTH302
No ratings yet
Zero Lecture of MTH302
29 pages
Reporting Statistics in Psychology
No ratings yet
Reporting Statistics in Psychology
7 pages
Forecasting
No ratings yet
Forecasting
6 pages
Data Analysis Coca Cola
No ratings yet
Data Analysis Coca Cola
7 pages
Factor Analysis - Stat 390 Presentation 3
No ratings yet
Factor Analysis - Stat 390 Presentation 3
13 pages
L8 Hypothsis Testing
No ratings yet
L8 Hypothsis Testing
29 pages
Flavia Khasoa BSCN/34705/2015 Corrections Page Number Previous Corrected
No ratings yet
Flavia Khasoa BSCN/34705/2015 Corrections Page Number Previous Corrected
2 pages
Stat Proof Book
No ratings yet
Stat Proof Book
381 pages
ISYE 6413: Design and Analysis of Experiments Spring, 2020: Jeffwu@isye - Gatech.edu
No ratings yet
ISYE 6413: Design and Analysis of Experiments Spring, 2020: Jeffwu@isye - Gatech.edu
2 pages
Mean
No ratings yet
Mean
7 pages
Stat 230 Introduction To Probability and Statistics: Sections 1.1 & 1.2
No ratings yet
Stat 230 Introduction To Probability and Statistics: Sections 1.1 & 1.2
17 pages
3 Epidemiology and Statistics For IPC Surveillance
No ratings yet
3 Epidemiology and Statistics For IPC Surveillance
46 pages
Imran Hussain 1
No ratings yet
Imran Hussain 1
2 pages
ML Project Report
No ratings yet
ML Project Report
12 pages
Stats 2
No ratings yet
Stats 2
9 pages
Mindmap QUANT - M6
No ratings yet
Mindmap QUANT - M6
1 page
Sampling
No ratings yet
Sampling
22 pages
Metode Analisis Perencanaan I (MAP I) : Fika Febi Novianti (2018280030)
No ratings yet
Metode Analisis Perencanaan I (MAP I) : Fika Febi Novianti (2018280030)
6 pages
T M N D: HE Ultivariate Ormal Istribution
No ratings yet
T M N D: HE Ultivariate Ormal Istribution
19 pages
20 Scenario Q&A For Data Analyst
No ratings yet
20 Scenario Q&A For Data Analyst
4 pages