Armstrong 2018
Armstrong 2018
Inequalities
Timothy B. Armstrong∗
Yale University
arXiv:1410.4718v3 [stat.AP] 7 Jul 2017
July 9, 2021
Abstract
1 Introduction
This paper compares methods for inference on a parameter θ defined by the conditional
moment inequalities
where m : RdW +dθ → RdY is a known function of data Wi and a parameter θ ∈ Θ ⊆ Rdθ ,
and ≥ is defined elementwise. Here, Wi is a RdW valued random variable and Xi is a RdX
∗
email: [email protected]. Support from National Science Foundation Grant SES-1628939 is
gratefully acknowledged.
1
valued random variable. We are given independent, identically distributed (iid) observations
{(Xi0 , Wi0 )0 }ni=1 . This defines the identified set
where Θ ⊆ Rdθ is the parameter space. If Θ0 contains more than one element, the model is
said to be set identified.
Following Imbens and Manski (2004), we are interested in confidence regions Cn that
satisfy the converage criterion
We consider confidence regions constructed by inverting a family of tests φn (θ) = φn (θ, {Xi , Wi }ni=1 ),
where φn (θ) is a test of H0,θ : θ ∈ Θ0 :
Subject to the coverage criterion (1), we would like the confidence region Cn not to contain
points that are far away from the identified set Θ0 . In particular, if we take a parameter θ0
on the boundary of Θ0 and consider a sequence θn = θ0 + an where an → 0, we would like to
have θn ∈/ Cn with high probability for an converging to zero as quickly as possible (so long
as θn approaches Θ0 from the outside, rather than from the interior). Note that
P (θn ∈
/ Cn ) = P (φn (θn ) = 1).
Thus, we can determine whether Cn contains points that are far away from Θ0 by examining
the behavior of P (φn (θn ) = 1), which is the power of the test φn (θn ) of H0,θn at the alternative
P.
This paper provides an asymptotic answer to this question by examining the asymptotic
behavior of P (φn (θn ) = 1) as n → ∞. We refer to limit of P (φn (θn ) = 1) as the local
asymptotic power of the sequence of tests φn (θn ) (note that this terminology differs from
definitions often used in the literature, since the null hypothesis varies with n while the
alternative stays fixed). The local asymptotic power of this sequence of tests will depend on
the distribution P , the parameter θ0 on the boundary of Θ0 to which the sequence θn = θ0 +an
converges, and the sequence an .
This paper considers Cramer-von Mises (CvM) style test statistics, which integrate or
2
add some function of the negative part of an objective function. These can be compared
with existing results for Kolmogorov-Smirnov (KS) statistics, which take the minimum of an
objective function. The results show that the power P (φn (θn ) = 1) will be greater asymp-
totically for KS statistics when the distribution P satisfies generic smoothness conditions of
the form used in the nonparametric statistics literature. In particular, the results imply that
KS statistics are preferred according to a “minimax within a smoothness class” criterion of
the form used to formulate nonparametric relative efficiency results in papers such as Stone
(1982).
As an example of the types of problems covered by this setup, consider the interval re-
gression model of Manski and Tamer (2002). We observe (Xi , WiL , WiH ) where [WiL , WiH ]
is known to contain the latent variable Wi∗ , which follows the linear regression model
E(Wi∗ |Xi ) = (1, Xi0 )θ. This falls into the setup of this paper with Wi = (Xi , WiL , WiH )
and m(Wi , θ) = (WiH − (1, Xi0 )θ, (1, Xi0 )θ − WiL )0 . The identified set is then given by
Thus, a parameter θ0 in the identified set corresponds to a regression line (1, x0 )θ0 that is
between the conditional means E(WiL |Xi = x) and E(WiH |Xi = x) for all x on the support
of Xi . If θ0 is on the boundary of the identified set, it will be equal to one of these regression
lines for some value of x. For θn = θ0 + an approaching the boundary of the identified
set from the outside, the regression line (1, x0 )θn will be above E(WiH |Xi = x) or below
E(WiL |Xi = x) for some values of x, and we would like the test φn (θn ) to detect this so
that θn ∈ / Cn with high probability. We use primitive conditions to apply the general results
in this paper to this setting, thereby giving asymptotic approximations to this probability.
These conditions correspond to smoothness conditions used in the nonparametric statistics
literature and conditions on the shape of these conditional means near points where one of
them is equal to (1, x0 )θ0 (see Section 3.4 and Appendix A.5).
The remainder of this paper is organized as follows. Section 1.1 defines the tests con-
sidered in this paper. Section 1.2 discusses related literature. Section 2 gives an intuitive
description of the power results in this paper and how they are derived. Section 3 states
formally the conditions used in this paper, and provides primitive conditions for the interval
regression model. Section 4 derives the power results. Section 5 reports the results of a
Monte Carlo study. Section 6 concludes. An appendix contains minimax power comparisons
as well as primitive conditions for the results in the main text in additional settings. A
supplementary appendix contains proofs and auxiliary results.
3
1.1 Definition of Test Statistics
The test statistics considered in this paper are as follows. Given a set G of nonnegative
instruments, the null hypothesis H0,θ : θ ∈ Θ0 implies that E(m(Wi , θ)g(Xi )) ≥ 0 for all
g ∈ G. Thus, under H0,θ : θ ∈ Θ0 , the sample analogue
n
1X
En (m(Wi , θ)g(Xi )) ≡ m(Wi , θ)g(Xi ) (2)
n i=1
should not be too negative for any g ∈ G. The results in this paper use classes of functions
given by kernels with varying bandwidths and location, given by G = {x 7→ k((x − x̃)/h)|x̃ ∈
RdX , h ∈ R+ } for some kernel function k. With this choice of G, H0,θ : θ ∈ Θ0 holds if and
only if E(m(Wi , θ)g(Xi )) ≥ 0 for all g ∈ G, so that (2) can be used to form a consistent test
(see Andrews and Shi, 2013, for a discussion of this and other choices of G).
Alternatively, one can test H0,θ : θ ∈ Θ0 by estimating E(m(Wi , θ)|Xi = x) directly using
the kernel estimate
Pn
m(W , θ)k((Xi − x)/h)
ˆ j (θ, x) = i=1Pn i
m̄ (3)
i=1 k((Xi − x)/h)
for some sequence h = hn → 0 and kernel function k. If H0,θ holds, (3) should not be too
negative for any x.
Thus, a test statistic of the null that θ ∈ Θ0 can be formed by taking any function that
is positive and large in magnitude when (2) is negative and large in magnitude for some
g ∈ G, or when (3) is negative and large in magnitude for some x. One possibility is to use a
CvM statistic that integrates the negative part of (2) over some measure µ on G. This CvM
statistic is given by
"Z dY
#1/p
X
Tn,p,ω,µ (θ) = |En mj (Wi , θ)g(Xi )ωj (θ, g)|p− dµ(g) (4)
j=1
for some p ≥ 1 and weighting ω, where |t|− = | min{t, 0}|. I refer to this as an instrument
based CvM (IV-CvM) statistic. The CvM statistic based on the kernel estimate integrates
the negative part of (3) against some weighting ω, and is given by
"Z dY
#1/p
p
X
Tn,p,kern (θ) = ˆ j (θ, x)ωj (θ, x)
m̄ dx (5)
−
j=1
4
for some p ≥ 1. I refer to this as a kernel based CvM (kern-CvM) statistic.
For the instrument based CvM statistic, the scaling for the power function will depend
on ω. This paper considers both a bounded weighting which, without loss of generality, can
be taken to be constant (the measure µ can absorb any weighting that does not change with
the sample size)
as well as the truncated variance weighting used for KS statistics by Armstrong (2014b),
Armstrong and Chan (2016) and Chetverikov (2012), which is given by
where
σ̂j (θ, g) = {En [mj (Wi , θ)g(Xi )]2 − [En mj (Wi , θ)g(Xi )]2 }1/2
and σn is a sequence converging to zero and a ∨ b denotes the maximum of a and b for scalars
a and b.1
The results for CvM statistics derived in this paper can be compared to power results
for KS statistics derived in Armstrong (2015) and Armstrong (2014b). A KS statistic based
on (2) simply takes the most negative value of that expression over g ∈ G, and is given by
Tn,∞,ω (θ) = max sup |En mj (Wi , θ)g(Xi )ωj (θ, g)|− . (8)
j g∈G
I refer to this as a kernel based KS (kern-KS) statistic. As with CvM statistics, the scaling
for the local power function for the instrument based KS test depends on whether a bounded
weighting or a truncated variance weighting is used.
1
For the critical value of the test, the results covered in this paper cover any critical value that is of
the same order of magnitude asymptotically as a critical value based on the distribution where all moments
bind. See Section 3.1 for details.
5
To complete the definition of these tests, we need to define a critical value. For tests that
use instrument based CvM statistics with bounded weights or inverse variance weights with
p < ∞, the test φn,p,ω,µ (θ), which rejects when φn,p,ω,µ (θ) = 1, is defined as
( √
1 if nTn,p,ω,µ (θ) > ĉn,p,ω,µ (θ)
φn,p,ω,µ (θ) = (10)
0 otherwise
for some critical value ĉn,p,ω,µ (θ). For kernel based CvM statistics, the test φn,p,kern (θ), which
rejects when φn,p,kern (θ) = 1, is defined as
(
1 if (nhdX )1/2 Tn,p,kern (θ) > ĉn,p,kern (θ)
φn,p,kern (θ) = (11)
0 otherwise
While all of the new results in this paper are for CvM statistics, I refer to analogous results
for KS statistics at some points for comparison. For KS tests with bounded weights, the
critical value is defined as in (10). For KS tests based on truncated variance weights, the
test φn,∞,(σ∨σn )−1 (θ) is defined as
q
1 if n
T
log n n,∞,(σ∨σn )
−1 (θ) > ĉn,∞,(σ∨σ )−1 (θ)
n
φn,∞,(σ∨σn )−1 (θ) = (12)
0 otherwise
6
requiring uniformly good power in classes of underlying distributions defined by smoothness
properties, the power of CvM tests is much worse (see Section A.5). The results in this
paper show that power comparisons in the set identified case considered here are much
different than settings that have been studied previously. Armstrong (2015), Armstrong
(2011), Armstrong (2014b), Armstrong and Chan (2016), and Chetverikov (2012) derive
power results for KS statistics under conditions similar to those used in this paper, but do
not consider CvM statistics.
The results in this paper are also related to the statistics literature on minimax testing
of hypotheses of the form H0,= : f (x) = 0 all x, H0,≥ : f (x) ≥ 0 all x, H0,↑ : f (x) ≥
f (x0 ) all x < x0 , (and related hypotheses such as convexity of f ), where the function f
is observed with noise. While much of this literature focuses on the Gaussian white noise
model or Gaussian sequence model, the results are closely related to the case where f (x) =
E(Yi |Xi = x), and iid observations of Xi , Yi are available (which falls into our setup if we take
Yi = m(Wi , θ0 )). To formulate the minimax testing problem considered in this literature,
one specifies a smoothness class F for f and a functional ψ : F → [0, ∞) such that ψ(f )
is 0 if f satisfies the null and strictly positive otherwise. For example, for H0,= , one can
R
take the Lp norm ψ(f ) = [ f (x)p dx]1/p and, for H0,≥ , one can take the one-sided Lp norm
ψ(f ) = [ |f (x)|p− ]. The minimax testing problem is to obtain tests that have good worst-
R
case power over alternatives f in the smoothness class F with ψ(f ) ≥ an for an → 0 as
quickly as possible. Dumbgen and Spokoiny (2001) and Juditsky and Nemirovski (2002)
consider H0,≥ with ψ given by the one-sided L∞ norm ψ(f ) = supx |f (x)|− and the one-
sided Lp norm with p < ∞ respectively, as well as H0,↑ and the hypothesis of convexity with
related distance functions ψ. Lepski and Tsybakov (2000) consider H0,= with ψ(f ) given by
the L∞ norm and by ψ(f ) = |f (x0 )| for a given point x0 . See Ingster and Suslina (2003) for
further results and references to this literature.
In contrast to this literature, the results in this paper have implications for minimax rates
of CvM statistics for testing the null that a given value of θ is in the identified set against the
alternative that the distance between θ and any point in the identified set is at least an (see
Section A.5 in the appendix for a formal statement). Since the dimension of θ is finite and
fixed, the choice of distance (i.e. whether to use Euclidean distance or sup-norm distance
when defining distance of θ from points in the identified set) does not matter for the rate
at which an can approach zero with the test having good power. This contrasts with the
nonparametric testing literature described above, in which the choice of distance function ψ
has implications for relative efficiency of different test statistics, and is part of the reason
7
that CvM and KS tests can be ranked in this setting. Interestingly, the problem of minimax
inference on θ in the settings considered here appears to be closely related to nonparametric
testing with ψ given by the L∞ norm. See Armstrong (2014a) for further discussion.
8
To see this in more detail, let us give a heuristic derivation of some of the results in
this setting. Consider the instrument based CvM statistic with bounded weights, where
the measure µ on the instruments g(x) = k((x − x̃)/h) has a density fµ (x̃, h) with re-
spect to the Lebesgue measure, and assume that Xi has a density fX (x). For simplic-
ity, suppose we only base the statistic on the inequality involving WiH . The statistic is
1/p
|En (WiH − (1, Xi )θn )k((Xi − x̃)/h)|p− fµ (x̃, h) dx̃ dh
R R
Tn (θn ) = , which is an integral
over a sample expectation. We expect that the test will have power when the integral over
the corresponding population expectation is large relative to the critical value, which, as
discussed below, will be of order n−1/2 . Thus, to have power at θn = (θ0,1 + an,1 , θ0,2 )0 , we
expect that
Z Z 1/p
|E(WiH − (1, Xi )θn )k((Xi − x̃)/h)|p− fµ (x̃, h) dx̃ dh
Z Z Z p 1/p
= (E(WiH |Xi = x) − (1, x)θn )k((x − x̃)/h)fX (x) dx fµ (x̃, h) dx̃ dh (13)
−
As θn approaches θ0 , only values of x̃ near x0 and values of h near zero will contribute to
the integrand, so that this approximation will hold with increasing accuracy. Furthermore,
assuming that fµ and fX are smooth, this means that we can also replace fX (x) with fX (x0 )
and fµ (x̃, h) with fµ (x0 , 0):
Z Z Z p 1/p
2
((x − x0 ) (V /2) − an,1 )k((x − x̃)/h)fX (x0 ) dx fµ (x0 , 0) dx̃ dh .
−
Pk Pn 2
H0 : EYi,1 ≥ 0, . . . , EYi,k ≥ 0. Tests based on the statistic j=1 | i=1 Yi,k |− (which is analogous to a CvM
statistic) will have more power
Pn when each of the inequalities is violated by a small amount, while tests based
on the statistic maxkj=1 | i=1 Yi,j |− (which is analogous to a KS statistic) will have more power when a
single inequality is violated. See Armstrong (2014a) for details and further references.
9
1/2 1/2 1/2
Using the change of variables u = (x − x0 )/an,1 , v = (x̃ − x0 )/an,1 , h̃ = h/an,1 , it can be
seen that the above display is equal to
Z Z Z p 1/p
2 1/2 1/2 1/2
(an,1 u (V /2) − an,1 )k((u − v)/h̃)fX (x0 )an,1 du fµ (x0 , 0)an,1 dṽan,1 dh̃
−
Z Z Z p 1/p
3/2+1/p 2
= an,1 (u (V /2) − 1)k((u − v)/h̃)fX (x0 ) du fµ (x0 , 0) dṽ dh̃ .
−
3/2+1/p
Thus, we expect to get power when an,1 decreases at least as slowly as n−1/2 , which
corresponds to an,1 decreasing at the rate n−1/(3+2/p) . This is the rate derived formally for
this test in Section 4.1, specialized to this setting (the general results use a smoothness
parameter γ which, in this case, is equal to 2).
To understand how this differs from the corresponding KS test based on Tn (θn ) =
supx̃,h |En (WiH − (1, Xi )θn )k((Xi − x̃)/h)|− , note that similar derivations give the approxi-
mation
Z
sup ((x − x0 )2 (V /2) − an,1 )k((x − x̃)/h)fX (x0 ) dx .
x̃,h −
and comparing this to n−1/2 (which is the order of the critical value in this case as well) shows
that we will have power when an,1 decreases at the rate n−1/3 . This is shown formally in
Armstrong (2015). Note that the n−1/3 rate for the KS statistic is faster than the n−1/(3+2/p)
rate for the CvM statistic.
3 Assumptions
This section states the conditions used in this paper, and verifies them for the interval re-
gression model defined in the introduction. Section A in the appendix verifies the conditions
in other settings.
This paper considers the power P (φn (θn ) = 1) of a sequence φn (θn ) of tests of H0,θn :
10
θn ∈ Θ0 under iid data from a fixed dgp P , where θn = θ0 + an is a sequence converging to θ0
on the boundary of Θ0 (where Θ0 is the identified set under the given dgp P ). Thus, we need
conditions on the tests φn (θn ) (in particular, the critical values and weighting functions, etc.
used in forming the test statistics) and the dgp P and the sequence θn . Section 3.1 gives
the conditions on the tests φn (θn ) and Section 3.2 gives the conditions on P and θn . Section
3.3 verifies these conditions for the interval regression model. Section 3.4 explains how the
conditions differ from those encountered in point identified settings.
Assumption 3.1. For some η > 0, the critical value ĉn = ĉn (θn ) defined in (10) or (11),
depending on the weighting and form of the test, satisfies ĉn (θn ) > η with probability ap-
proaching one.
Assumption 3.1 holds for the kernel CvM based test of Lee et al. (2013), which uses the
least favorable null dgp, as well as the tests using instrument based CvM statistics with
bounded weights proposed in Andrews and Shi (2013). Instrument based CvM statistics
with variance weights have not been considered in the literature. In Section C of the supple-
mentary appendix, I consider critical values for this case and show that critical values based
on the least favorable null dgp will satisfy Assumption 3.1.
Assumption 3.1 only gives a lower bound for a critical value. This gives bounds on the
power, but to derive the exact local asymptotic power, we need the following condition, which
gives a limiting value for this critical value. Under mild conditions on the data generating
process and sequence of local alternatives, this assumption will also hold for the methods of
choosing critical values discussed above.
Assumption 3.2. For the critical value ĉn = ĉn (θn ) defined in (10) or (12), depending on
p
the weighting and form of the test, and some constant c > 0, ĉn (θn ) → c.
The power properties of the test will also depend on the class of functions G used as
instruments. I derive power results for the case where G consists of kernel functions with
different bandwidths and locations, defined in the following assumption.
11
R
Assumption 3.3. For some bounded, nonnegative function k with finite support and k(u) du >
0, G = {x 7→ k((x − x̃)/h)|x̃ ∈ RdX , h ∈ R+ }, and the covering number N (ε, G, L1 (Q)) de-
fined in Pollard (1984) satisfies supQ N (ε, G, L1 (Q)) ≤ Aε−W , where the supremum is over
all probability measures.
The covering number assumption in Assumption 3.3 is a technical condition that allows
for uniform convergence of kernel estimates over x and h. A sufficient condition is that the
kernel k takes the form k(x) = r(kxk) where r is a monotone decreasing function on on
[0, ∞) (see Pollard, 1984, chapter 2, problem 28).
For CvM statistics, I place the following condition on the measure µ over which the
sample means are integrated.
Assumption 3.4. The measure µ has bounded support, and has a density fµ (x̃, h) with
respect to the Lebesgue measure on RdX × [0, ∞) that is bounded and continuous.
Relaxing this assumption would lead to different power properties, although the general
point that Lp statistics perform worse in these models than supremum statistics would go
through.
12
0. Assume that there exist neighborhoods B(xk ) of each xk ∈ X0 such that the following
assumptions hold.
i.) There exists η > 0 such that, for θ in a neighborhood of θ0 , we have (a) m̄j (θ, x) > η
for j ∈ / ∪`k=1 B(xk ).
/ J(k) for x ∈ B(xk ) and (b) m̄j (θ, x) > η for all j for x ∈
ii.) For j ∈ J(k), m̄j (θ0 , x) is continuous on the closure of B(xk ) and satisfies
m̄j (θ0 , x) − m̄j (θ0 , xk ) x − xk δ→0
sup − ψj,k → 0
kx−xk k≤δ kx − xk kγ(j,k) kx − xk k
for some γ(j, k) > 0 and some function ψj,k : {t ∈ RdX |ktk = 1} → R with ψ ≥
ψj,k (t) ≥ ψ for some ψ < ∞ and ψ > 0. For future reference, define γ = maxj,k γ(j, k)
˜ = {j ∈ J(k)|γ(j, k) = γ}.
and J(k)
iv.) For j ∈ J(k), s2j (x, θ) ≡ var(mj (Wi , θ)|Xi = x) is strictly positive and continuous at
(xk , θ0 ).
v.) For x in the closure of B(xk ) and θ in a neighborhood of θ0 , m̄(θ, x) has a derivative
as a function of θ that is continuous as a function of (θ, x). Let m̄θ,j (θ, x) denote the
jth row of this derivative matrix (i.e. the derivative of m̄j (θ, x) with respect to θ).
Assumption 3.6. The data are iid and for some fixed Y < ∞ and θ in a some neighborhood
of θ0 , |m(Wi , θ)| ≤ Y with probability one.
The deterministic bound in Assumption 3.6 allows for the use of certain technical results
that are useful in the proofs. It may be possible to relax this assumption, although additional
technical arguments would be needed in some places.
The following assumption, which is used for kernel based statistics, ensures that the
kernel estimators do not encounter boundary problems (cf. Assumption 1(iii) in Lee et al.,
2013).
Assumption 3.7. Xi has a density fX that is bounded away from infinity, and the weighting
function ωj (θ, x) is continuous for all j and, for some ε > 0, is equal to zero whenever
fX (x̃) < ε for some x̃ with kx̃ − xk < ε.
13
3.3 Discussion and Primitive Conditions for Interval Regression
In discussing these assumptions, it is useful to keep in mind the interval regression model
introduced in the introduction, in which Wi = (Xi , WiL , WiH ) and m(Wi , θ) = (WiH −
(1, Xi0 )θ, (1, Xi0 )θ − WiL )0 . The following gives a general discussion of these assumptions,
with references to the interval regression model as an example. I then state primitive suffi-
cient conditions in the interval regression model that imply these assumptions with γ = 2.
Section A of the appendix gives primitive conditions in additional settings.
The assumptions used here are similar to the conditions used in Armstrong (2015) to
derive the asymptotic distribution and local power of a KS statistic with bounded weights.
In particular, Assumption 3.5 corresponds to the version of Assumption 3.1 in Armstrong
(2015) used in Section 5 of that paper, in which part (ii) is replaced by Assumption 5.1 in
Armstrong (2015). Part (i) strengthens the version used in Armstrong (2015) by extending
it to a neighborhood of θ0 , and part (v) is an additional condition on the derivative with
respect to θ. These additional conditions are used to derive local power, and are similar to
Assumption 7.1 in Armstrong (2015).
Assumption 3.5 is the main substantive condition that gives rise to the local power
results derived in this paper. It states that the conditional mean of the moment conditions
is equal to zero only at a finite number of points. In the context of the interval regression
model, this holds for θ0 on the boundary of the identified set when the regression line x0 θ0
is tangent to E(WiH |Xi = x) or E(WiH |Xi = x) at a finite number of points. In general,
a sufficient condition for this in the case where Xi has compact support is that m̄j (θ, x)
takes its minimum on the interior of the support of Xi and m̄j (θ, x) is twice continuously
differentiable with a positive definite second derivative matrix at any point where it takes a
minimum (see Section A.1 in the appendix).
The most natural case where this does not hold is where E(WiH |Xi = x) or E(WiL |Xi =
x) is linear and equal to (1, x0 )θ on a nondegenerate interval (the other possibility is for
E(WiH |Xi = x) − (1, x0 )θ0 to be zero on a set with infinitely many elements, but with zero
probability, such as with the function sin(1/x)). This holds in the point identified case where
P (WiH = WiL |Xi ) = 1 for Xi on a nondegenerate interval (and, in particular, in the special
case where WiH = WiL with probability one, leading to the usual linear regression model).
However, when θ is set identified, this is a knife-edge case: even if E(WiH |Xi ) = (1, Xi0 )θ0
for Xi on a nondegenerate interval for a given θ0 on the boundary of the identified set, we
will typically have E(WiH |Xi = x) = (1, x0 )θ̃0 only on a finite set for θ̃0 close to θ0 .
This is illustrated by Figures 2 and 3, which are taken directly from Section 2.2 of Arm-
14
strong (2015). Each figure shows the conditional mean E(WiH |Xi = x) for some dgp along
with regression lines corresponding to particular parameter values θ (the lower conditional
mean E(WiL |Xi = x) can be taken to be below the area shown in each figure). In Figure 2,
the regression line (1, x0 )θ = θ1 + θ2 x is tangent to the conditional mean at a single point,
and Assumption 3.5 holds for the parameter θ. In Figure 3, the regression line θa,1 + θa,2 x
corresponding to the parameter θa is equal to E(WiH |Xi = x) on a nondegenerate interval,
so that Assumption 3.5 does not hold. However, at nearby parameter values such as θb , the
regression line is equal to E(WiH |Xi = x) at a single point and Assumption 3.5 holds. See
Section 2.2 of Armstrong (2015) for further discussion.
In the case where m̄(θ0 , x) is twice continuously differentiable in x, part (ii) of Assumption
3.5 follows from a second order Taylor expansion at xk , so long as the second derivative
matrix is positive definite. In this case, Assumption 3.5 holds with γ = 2 and ψj,k (u) =
u0 Vj (xk )u/2, where Vj (xk ) is the second derivative matrix of x 7→ m̄(θ0 , x) at xk . In the
interval regression model, the second derivative of m1 (θ0 , x) is equal to the second derivative
of E(WiH |Xi = x) (and similarly for m2 (θ0 , x) and −E(WiL |Xi = x)), so this translates
directly to an assumption of a positive definite second derivative matrix of E(WiH |Xi = x).
In the case where m̄(θ0 , x) is Lipschitz continuous, part (ii) of Assumption 3.5 will hold with
γ = 1 if we place additional regularity conditions on the one-sided directional derivative of
m̄(θ0 , x). The parameter θ in Figure 2 illustrates a case where Assumption 3.5 holds with
γ = 2, while the parameter θb in Figure 3 illustrates a case where Assumption 3.5 holds
with γ = 1. See Theorem A.1 in Section A.2 of the appendix for a formal statement in the
interval regression model.
The remaining assumptions are regularity conditions that translate easily to primitive
objects in the case of interval regression. For part (v), note that m̄θ,1 (θ, x) = −(1, x0 ) and
m̄θ,2 (θ, x) = (1, x0 ), which are clearly continuous, so this assumption holds without further
conditions on the dgp.
The following gives a formal statement of primitive conditions for the interval regression
model in the case where the conditional means are twice differentiable. The proof of this
result uses the ideas in the discussion above, and is given in Section A.2 of the appendix.
i.) The conditional means E(WiH |Xi = x) and E(WiL |Xi = x) are twice differentiable with
continuous second derivatives, Xi has a continuous density and compact support, and
WiH and WiL are bounded from above and below by finite constants.
15
ii.) For any point x̃ such that E(WiH |Xi = x̃) = (1, x̃0 )θ0 , x̃ is in the interior of the
support of Xi , var(WiH |Xi = x) is positive and continuous at x̃ and E(WiH |Xi = x)
has a positive definite second derivative matrix at x̃. The same holds for E(WiL |Xi = x)
with “positive definite” replaced by “negative definite.”
Then Assumptions 3.5, and 3.6 hold, with γ = 2 in Assumption 3.5.
16
√
In the present setting, the results in this paper show that, even though n local power
√
is possible in certain special cases, the minimax (worst-case) power is slower than n when
one only places bounds on derivatives of certain objects. In particular, while a bound on the
second derivative of E(WiH |Xi = x) and E(WiL |Xi = x) does not imply Assumption 3.5 in
the interval regression model, one can construct a dgp such that Assumption 3.5 holds with
γ = 2 for any nonzero bound on the second derivative. Thus, the minimax rates of local power
for CvM statistics under a bound on the second derivative are at least as slow as the rates
√
derived in this paper, which are slower than n. Since the results in Armstrong (2014b) show
that the corresponding KS statistics achieve a better rate for local alternatives uniformly
over dgps with a bound on the second derivative (and additional regularity conditions), this
means that the KS statistic is preferred to the CvM statistic under a minimax criterion in
this class. See Section A.5 in the appendix for formal statements.
λbdd (a, j, k, p) = λbdd (a, m̄θ,j (θ0 , xk ), ψj,k , fX (xk ), fµ (xk , 0), p)
Z Z Z p
γ x
≡ kxk ψj,k + m̄θ,j (θ0 , xk )a k((x − x̃)/h)fX (xk ) dx fµ (xk , 0) dx̃ dh.
kxk −
for some vector a. Under Assumptions 3.3, 3.4, 3.5, and 3.6,
1/p
|X0 |
p
X X
n1/2 Tn,p,1,µ (θ0 + an ) → λbdd (a, j, k, p) ≡ rbdd (a)
˜
k=1 j∈J(k)
17
Theorem 4.1 has immediate consequences for the power of tests based on CvM statistics
with bounded weightings.
Theorem 4.2. If, in addition to the conditions of Theorem 4.1, Assumption 3.1 holds, the
power
Eφn,p,1,µ (θ0 + an )
of the test φn,p,1,µ (θ0 + an ) will converge to zero for rbdd (a) < c. If a is close enough to zero,
rbdd (a) will be less than c so that the power will converge to zero. If, in addition, Assumption
3.2 holds, the power will converge to 1 for rbdd (a) > c.
The n−γ/{2[dX +γ+(dX +1)/p]} rate for instrument based CvM statistics with bounded weights
is slower than the n−γ/{2[dX +γ]} rate derived for the corresponding KS test in Theorem 14 of
Armstrong (2015) (for γ = 2) and Theorem 5.1 of Armstrong (2014b) (α from that paper
plays the role of γ here). Note also that local power increases as p increases, and becomes
aribrarily close to the rate for the KS test as p increases.
λvar (a, j, k, p)
Z Z Z p
x
≡ γ
kxk ψj,k + m̄θ,j (θ0 , xk )a wj (xk )h−dX /2 k((x − x̃)/h)fX (xk ) dx
kxk −
fµ (xk , 0) dx̃ dh
Suppose that σn (n/ log n)1/2 → ∞ and Assumptions 3.3, 3.4, 3.5, and 3.6 hold. Then
1/p
|X0 |
X X
n1/2 Tn,p,(σ̂∨σn )−1 ,µ (θ0 + an ) ≤ λvar (a, j, k, p) + op (1) ≡ rvar (a) + op (1)
k=1 j∈J(k)
18
where rvar (a) → 0 as a → 0. If, in addition, σn ndX /{4[dX /2+γ+(dX +1)/p]} → 0, the above display
will hold with the inequality replaced by equality.
The result has immediate consequences for the power of tests based on CvM statistics
with truncated variance weightings.
Theorem 4.4. Let an be defined as in Theorem 4.3 and suppose that the conditions of that
theorem and Assumption 3.1 hold. The power
of the test φn,p,(σ∨σn )−1 ,µ (θ0 +an ) will converge to zero for rvar (a) < c. For a close enough to 0,
rvar (a) will be less than c so that the power will converge to zero. If, in addition, Assumption
3.2 holds and σn ndX /{4[dX /2+γ+(dX +1)/p]} → 0, the power will converge to 1 for rvar (a) > c.
As with bounded weighting functions, the rate for detecting local alternatives with CvM
statistics with variance weights is slower than the rate for the corresponding KS test. The
n−γ/{2[dX /2+γ+(dX +1)/p]} rate for variance weighted CvM statistics derived above contrasts
with the (n/ log n)−γ/[2(dX /2+γ)] rate for the corresponding KS test derived in Armstrong
and Chan (2016) and Armstrong (2014b) (the results from the latter paper on rates of
convergence of confidence regions in the Hausdorff metric imply these local power results).
The rate for CvM statistics approaches the rate for KS statistics as p → ∞.
and
Z p
γ v
λ̃kern (a, j, k, p) ≡ [kvk ψj,k + m̄θ,j (θ0 , xk )a ωj (θ0 , xk ) dv.
kvk −
Theorem 4.5. Suppose that Assumptions 3.4, 3.5, 3.6 and 3.7 hold, and that the kernel
function k satisfies Assumption 3.3. In addition, suppose that the bandwidth h satisfies
h/n−s → ch for some 0 < s < 1/dX and ch > 0, the kernel function k satisfies k(u) du = 1
R
19
and that the functions ψj,k in Assumption 3.5 are continuous. Let an = an−q for some
a ∈ Rdθ where
(
sγ if s < 1/[2(γ + dX /p + dX /2)]
q=
(1 − sdX )/[2(1 + dX /(pγ))] if s ≥ 1/[2(γ + dX /p + dX /2)]
The result has immediate implications for the power of tests based on kernel CvM statis-
tics.
Theorem 4.6. Let an be defined as in Theorem 4.5 and suppose that the conditions of that
theorem and Assumption 3.1 hold. If s > 1/[2(γ + dX /p + dX /2)], the power
Eφn,p,kern (θ0 + an )
of the test φn,p,kern (θ0 + an ) will converge to zero for r̃kern (a) < c. If s = 1/[2(γ + dX /p +
dX /2)], the power given by the above display will converge to zero for r̃kern (a, ch ) < c. If
20
s < 1/[2(γ + dX /p + dX /2)], the power given by the above display will converge to zero if
r̃kern (a, ch ) = 0 in a neighborhood of (a, ch ). If, in addition, Assumption 3.2 holds, the power
given by the above display will converge to 1 if r̃kern (a) > c, rkern (a, ch ) > c, or rkern (a, ch ) > 0
in the cases where s is greater than, equal to, or less than 1/[2(γ +dX /p+dX /2)] respectively.
As with instrument based statistics, the rate for detecting local alternatives with the
kernel CvM test is slower than the rate for the corresponding KS statistic. The rate derived
in Theorem 4.5 can be written as max{(nhdX )−1/[2(1+dX /(pγ))] , hγ }, which is slower than the
max (nhdX / log n)−1/2 , hγ rate for kernel based KS statistics derived in Armstrong (2014b).
As with the instrument based statistics, the CvM test is more powerful for p larger, and the
rate approaches the rate for the KS test as p goes to ∞.
Theorem 4.5 can be used to choose the optimal bandwidth in this setting. The rate
an = an−q is best when s = 1/[2(γ + dX /p + dX /2)], which gives an exponent in the rate of
γ 1 − sdX
q= = = sγ.
2(γ + dX /p + dX /2) 2(1 + dX /(pγ))
Note that this rate is faster than the n−γ/[2(dX /2+γ+(dX +1)/p))] rate that can be obtained
with instrument based CvM tests with variance weights. Thus, restricting the class of
instruments using prior knowledge of the data generating process leads to a faster rate
with CvM statistics. In contrast, instrument based KS statistics with variance weights can
achieve the same rate as kernel KS statistics that use prior knowledge of the data generating
process to choose the bandwidth optimally (cf. Armstrong, 2014b; Armstrong and Chan,
2016; Chetverikov, 2012).
5 Monte Carlo
This section reports the results of a Monte Carlo study of the finite sample properties of the
statistics considered in this paper. I perform a Monte Carlo based on a median regression
model with potentially endogenously missing data. I use the same data generating processes
as for the Monte Carlo for variance weighted KS statistics in Armstrong and Chan (2016).
A description of the model and data generating processes is repeated here for convenience.
The latent variable Wi∗ follows a linear median regression model given the observed
covariate Xi : q1/2 (Wi∗ |Xi ) = θ1 + θ2 Xi where q1/2 (Wi∗ |Xi ) is the conditional median of Wi∗
given Xi . Define WiH = Wi∗ when Wi∗ is observed and WiH = ∞ otherwise. This gives the
conditional moment inequality E[I(θ1 + θ2 Xi ≤ WiH ) − 1/2|Xi ] ≥ 0 a.s. (a similar inequality
21
can be formed with the lower bound WiL defined analogously, but with WiL = −∞ when Wi∗
is unobserved, which would give the interval quantile regression setup of Section A.3 of the
appendix; the Monte Carlo focuses on the inequality corresponding to WiH for simplicity).
This model allows for arbitrary correlation between the “missingness” process and (Wi∗ , Xi ),
so that the resulting bounds can be used to assess sensitivity to missingness at random
assumptions that would point identify the model.
Each design uses data from the true model Wi∗ = θ1∗ + θ2∗ Xi + ui , where (θ1∗ , θ2∗ ) = (0, 0)
and ui is independent of Xi with ui ∼ unif(−1, 1). The outcome variable Wi∗ is then set to be
missing independently of Wi∗ with probability p(Xi ) (note that, while the data are generated
according to a missingness at random assumption and a particular parameter value, the tests
are robust to failure of this assumption, which leads to a lack of point identification), where
p(x) is varied in each of three designs:
Design 1: p(x) = .1
Design 2: p(x) = .02 + 2 · .98 · |x − .5|
Design 3: p(x) = .02 + 4 · .98 · (x − .5)2 .
This leads to the identified set Θ0 = {(θ1 , θ2 )0 |θ1 + θ2 x ≤ q1/2 (WiH |Xi = x) all x ∈ [0, 1]}
where q1/2 (WiH |Xi = x) can be calculated for each design as q1/2 (WiH |Xi = x) = 1/(1 −
p(x))−1. For each design, the Monte Carlo power of φ(θ) for each test φ under the dgp in the
given design is reported for θ = (θ1 + a, 0) where θ1 = sup{θ1 |(θ1 , 0) ∈ Θ0 } and a varies over
the set {.1, .2., .3, .4, .5}. This leads to local alternatives that satisfy the conditions of this
paper with γ = 1 for Design 2 and γ = 2 for Design 3. Design 1 leads to a flat conditional
mean for which asymptotic theory predicts the following rates (for the instrument functions
used here): n−1/2 for kernel and instrument based CvM and unweighted instrument based KS
statistics, (n/ log n)−1/2 for variance weighted instrument KS statistics and (nh/ log n)−1/2
for kernel KS statistics (see Andrews and Shi, 2013; Armstrong, 2014b; Chernozhukov et al.,
2013; Lee et al., 2013).
For the instrument based statistics, I use the class of functions {x 7→ I(s < x < s+t)|0 ≤
s ≤ s + t ≤ 1} and the the Lebesgue measure on {(s, t)|0 ≤ s ≤ s + t ≤ 1} for µ for the
instrument based CvM statistics. This corresponds to the multiscale kernel instruments in
Assumption 3.3 with the uniform kernel. For the kernel based statistics, the uniform kernel
is used, and the supremum or integral is taken over the set [h/2, 1 − h/2], so that the support
of the kernel function is always contained in the support of Xi . For the CvM statistics, the
simulations use the test with Lp exponent p = 1. For each test statistic, the critical value
22
is taken from the least favorable null distribution, calculated exactly (up to Monte Carlo
error) using the distribution under (θ1 , 0) under Design 1. For the kernel estimators, the
bandwidths n−1/5 , n−1/3 and n−1/2 are used, and, for the truncated variance weighted CvM
statistics, the values n−1/5 /4, n−1/3 /4 and n−1/2 /4 are used for the truncation parameter σn2
(this corresponds to truncating the variance of functions I(s < x < s + t) with t less than
n−1/5 , n−1/3 and n−1/2 ). For comparison, results for the variance weighted instrument KS
statistic, which corresponds to the multiscale statistic of Armstrong and Chan (2016), are
reported as well (taken directly from that paper).
Overall, the Monte Carlo results support the claim that, for the data generating processes
and classes of instrument functions considered in the theoretical results in this paper, KS
statistics perform better than CvM statistics. For Design 2 and Design 3, which follow
the conditions of this paper with γ = 1 and γ = 2 respectively, the instrument based KS
statistic has more power than the instrument based CvM statistic in basically all cases. For
the kernel statistics, the KS test performs better unless the bandwidth is chosen to be much
too small. For example, for Design 3, the optimal bandwidth for the kernel statistic is of
order n−1/5 , and the kernel KS statistic performs better than the kernel CvM statistic with
this bandwidth. However, the kernel statistic performs worse for smaller bandwidths when
the sample size is not too large (although the KS statistic does almost as well or better with
1000 observations, suggesting that the asymptotics of Theorem 4.5 have started to kick in
at this point).
Note also that power in the Monte Carlo is very sensitive to the design, with greater
power for Design 3 than Design 2. This is to be expected given the asymptotic results.
Under Design 3, the assumptions of this paper hold with γ = 2, while, under Design 2,
the assumptions hold with γ = 1. The results of Section 4 show that asymptotic power is
increasing in γ (the rate at which local alternatives may approach the null with nontrivial
power is faster for larger γ) for each of the test statistics considered.
For Design 1, asymptotic results from elsewhere in the literature predict that the instru-
ment based statistics with the instruments used here perform about the same (in terms of
the rate for detecting local alternatives) for KS and CvM statistics, although the variance
weighted KS statistic performs slightly worse (by a log n factor). For kernel statistics, asymp-
totic theory predicts that KS statistics will perform worse than CvM statistics in this case
(the latter can achieve a n−1/2 rate, while the former cannot if the bandwidth goes to zero).
All of these predictions are borne out in the Monte Carlo: instrument based statistics all
perform well with the weighted KS statistics performing slightly worse, while CvM version
23
is better for kernel statistics.
The Monte Carlo results also fit well with the prescription of the weighted instrument
KS or “multiscale” statistic of Armstrong (2011), Armstrong (2014b), Armstrong and Chan
(2016) and Chetverikov (2012) as the only test among the ones considered here that comes
close to having the best power among these test statistics for all three Monte Carlo designs
(according to asymptotic approximations, the weighted instrument KS test achieves the
best rate to at least within a log n factor in all three cases, while each of the other statistics
considered here performs worse by a polynomial factor in at least one case). While other
statistics perform slightly better in certain cases, they perform much worse in others (e.g.
the kernel KS statistic performs slightly better in Design 3 with the optimal bandwidth,
n−1/5 , but performs much worse when other bandwidths are chosen, or with any bandwidth
choice in Design 1).
6 Conclusion
This paper derives local power results for tests for conditional moment inequality models
based on several forms of CvM statistics in the set identified case. The power comparisons
hold under conditions that arise naturally in the set identified case, and determine the
minimax rate. The results show that KS tests are preferred to CvM statistics and that
variance weightings are preferred to bounded weightings.
24
regression model.
Proof. The result follows from the proof of Lemma B.1 in the supplementary appendix of
Armstrong (2015).
Proof of Theorem 3.1. First, note that the set of x such that m̄j (θ, x) = 0 for some j is finite
by Lemma A.1. Part (ii) of Assumption 3.5 follows from a second order Taylor expansion,
and part (i) follows by compactness of the support of Xi and continuity of the first two
derivatives of the conditional means. Part (iv) is immediate from part (ii) of the conditions
of the theorem and the fact that the conditional variance is constant in θ for this model. For
d
part (v), note that dθ d
m̄1 (θ, x) = − dθ m̄2 (θ, x) = (1, x0 ), which is clearly continuous in (θ, x).
Assumption 3.6 is immediate from the bounds on WiH and WiL .
For the Lipschitz case (γ = 1), we can replace the assumption of two derivatives with
a condition on the directional one-sided first derivatives. Here, we make the assumption of
25
finiteness of the set where the conditional moments bind directly, since arguments involving
second derivatives do not apply. In the following, SdX −1 denotes the unit sphere {u ∈
RdX |kuk = 1}.
Assumption A.1. i.) The conditional means E(WiH |Xi = x) and E(WiL |Xi = x) are
Lipschitz continuous, Xi has a continuous density and compact support, and WiH and
WiL are bounded from above and below by finite constants.
ii.) The set X0 ≡ {x|E(WiH |Xi = x) = (1, x0 )θ0 } is finite, and, for any point x̃ ∈ X0 , x̃ is in
the interior of the support of Xi , var(WiH |Xi = x) is positive and continuous at x̃ and
the one-sided directional derivative dtd+ [E(WiH |Xi = x̃+tu)−(1, (x̃+tu)0 )θ0 ] is bounded
from below away from zero at t = 0 and is right continuous at t = 0 uniformly over
u ∈ SdX −1 . The same holds for E(WiL |Xi = x) with “positive” replaced by “negative”
in the last statement.
Theorem A.1. Under Assumption A.1, Assumptions 3.5 and 3.6 hold, with γ = 1 in
Assumption 3.5.
Proof. Part (ii) of Assumption 3.5 follows from a first order Taylor expansion, and part (i)
follows by compactness of the support of Xi and the continuity and lower bound on the
directional derivatives. The verification of the remaining conditions is the same as in the
twice differentiable case.
Assumption A.2. i.) The conditional quantiles qτ (WiH |Xi = x) and qτ (WiL |Xi = x) are
twice differentiable with continuous second derivatives and Xi has a continuous density
and compact support.
26
ii.) For any x̃ such that qτ (WiH |Xi = x̃) = (1, x̃0 )θ0 , x̃ is in the interior of the support of
Xi and qτ (WiH |Xi = x) has a positive definite second derivative matrix at x̃. The same
holds for qτ (WiL |Xi = x) with “positive definite” replaced by “negative definite.”
In addition, we will also require an assumption on the conditional densities of WiH and
WiL given Xi .
Assumption A.3. For some η > 0, WiH |Xi and WiL |Xi have conditional densities fWiH |Xi (w|x)
and fWiL |Xi (w|x) on {(x, w)|qτ,P (WiH |Xi = x) − η ≤ w ≤ qτ,P (WiH |Xi = x) + η} and
{(x, w)|qτ,P (WiL |Xi = x) − η ≤ w ≤ qτ,P (WiL |Xi = x) + η} respectively that are continuous
as a function of (x, w) and bounded away from zero on these sets.
Theorem A.2. Suppose that Assumptions A.2 and A.3 hold. Then Assumptions 3.5 and
3.6 hold, with γ = 2 in Assumption 3.5.
Proof. Let θ0 ∈ Θ0 satisfy the conditions of the theorem and let x̃ be such that qτ (WiH |Xi =
x̃) = (1, x̃0 )θ0 . Let V (x) denote the second derivative matrix of x 7→ qτ (WiH |Xi = x). Then,
for δ small enough and kx − x̃k ≤ δ,
Z qτ (WiH |Xi =x)
m̄1 (θ, x) = τ − P (WiH ≤ (1, Xi0 )θ0 |Xi = x) = fWiH |Xi (w|x) dw
(1,x0 )θ0
Z (1,x0 )θ0 +(x−x̃)0 V (x̃)(x−x̃)+r(x)
= fWiH |Xi (w|x) dw
(1,x0 )θ0
where limx→x̃ r(x) = 0 and the last step follows from a second order Taylor expansion. This
expression is bounded from above by f (δ) · [(x − x̃)0 V (x̃)(x − x̃) + r(δ)] and from below by
f (δ) · [(x − x̃)0 V (x̃)(x − x̃) + r(δ)] where f (δ) and r(δ) are upper bounds for fWiH |Xi (w|x) and
r(x) on {(x, w)|kx − x̃k ≤ δ, (1, x0 )θ0 ≤ w ≤ qτ (WiH |Xi = x)} and f (δ) and r(δ) are lower
bounds. As δ → 0, f (δ) and f (δ) converge to fWiH |Xi ((1, x̃0 )θ0 |x̃) and r(δ) and r(δ) converge
to 0, so that
27
Applying this argument to the finite set of values x̃ such that τ − P (WiH ≤ (1, Xi0 )θ0 |Xi =
x) = 0 and a symmetric argument for WiL , it follows that part (ii) of Assumption 3.5 holds
with γ = 2.
To verify part (i) of Assumption 3.5 first note that the set X0 = {x|qτ (WiH |Xi = x) =
(1, x0 )θ} is finite by Lemma A.1. Using this and similar arguments to those used in the proof
of Theorem 3.1, there exists ε > 0 and δ > 0 such that qτ (WiH |Xi = x) − (1, x)0 θ is bounded
away from zero for kθ − θ0 k < ε and x such that, for all x̃ ∈ X0 , kx − x̃k ≥ δ. It then follows
from Assumption A.3 that τ − P (WiH ≤ (1, Xi0 )θ0 |Xi = x) is bounded away from zero on
such a set. Part (i) of Assumption 3.5 follows from this and a similar argument for WiL .
For part (iv) of Assumption 3.5, note that the conditional variance of the moment function
corresponding to WiH is P (WiH ≤ (1, x0 )θ|Xi = x)[1 − P (WiH ≤ (1, x0 )θ|Xi = x)], so it
suffices to show that P (WiH ≤ (1, x0 )θ|Xi = x) is in the set (0, 1) and is continuous in (θ, x)
at each (θ0 , x̃) such that m̄1 (θ, x) = P (WiH ≤ (1, x̃0 )θ0 |Xi = x̃) = τ . This follows since, by
Assumption A.3, WiH has a continuous conditional density in a neighborhood of (1, x̃0 )θ0 .
For part (v) of Assumption 3.5, note that, for (x, θ) such that WiH has a conditional
density given Xi = x at (1, x0 )θ,
d
m̄θ,1 (θ, x) = − 0
P (WiH ≤ (1, x0 )θ|Xi = x) = −fWiH |Xi =x ((1, x0 )θ|x)(1, x0 ).
dθ
This is continuous in (θ, x) in a small enough neighborhood of any (θ0 , x̃) with m̄θ,1 (θ0 , x̃) = 0,
since fWiH |Xi =x (w|x) is continuous for w, x in a neighborhood of at x = x̃ and w = (1, x̃0 )θ0
for any such θ0 and x̃ by Assumption A.3.
28
benefits) while being independent of the distribution of offer wages.
Let Di denote an indicator variable that is 1 when Yi∗ is observed and 0 otherwise.
We observe (Xi , Yi , Di ) where Yi = Di · Yi∗ . Following Manski (1990), note that, letting
WiL = Yi · Di + Y · (1 − Di ) and WiL = Yi · Di + Y · (1 − Di ), we have WiL ≤ Yi∗ ≤ WiH
with probability one. Letting θ = E(Yi∗ ) and using the fact that E(Yi∗ ) = E(Yi∗ |Xi ) a.s.,
we obtain our setup with m(Wi , Xi , θ) = (WiH − θ, θ − WiL )0 . This is a special case of the
interval regression model of Section A.2, with (θ, 01×dX ) playing the role of θ. That is, we
have the interval regression model with the slope parameter constrained to be zero. Thus,
if we consider a null value θ0 and a sequence of alternatives in the interval regression model
for which the slope parameter is zero, the results of Section A.2 apply immediately to give
primitive conditions for Assumption 3.5 (here Assumption 3.6 holds by construction and the
assumption that Yi∗ is bounded).
Note that E(WiH |Xi = x) = E(Yi∗ Di |Xi = x) + Y · [1 − P (Di = 1|Xi = x)]. Thus,
a sufficient condition for E(WiH |Xi = x) to be twice differentiable (or Lipschitz) is for
P (Di = 1|Xi = x) and E(Yi∗ Di |Xi = x) to be twice differentiable (or Lipschitz). It is
also worth noting that cases where E(WiH |Xi = x) is minimized at the (possibly infinite)
boundary of the support of Xi are often of interest, and arise naturally in this setting (see,
e.g., Andrews and Schafgans 1998 and Heckman 1990). While Assumption 3.5 formally
precludes the possibility that the minimum of E(WiH |Xi = x) is taken at the boundary of
the support of Xi , such cases can be handled for certain forms of instrument based statistics
by transforming the support of Xi (see Section B.3 of Armstrong 2014b for an example of
this type of argument applied to instrument based KS statistics). We leave this extension
for future research.
29
under a minimax criterion. By results in Armstrong (2014b), it follows that, for certain
classes of alternatives defined by smoothness conditions, the variance weighted KS statistic
of Armstrong (2014b), Armstrong and Chan (2016) and Chetverikov (2012) is preferred to
the CvM statistics considered in this paper under a minimax criterion.
To formalize these ideas, the rest of this section considers classes P of underlying distri-
butions and uses the notation EP and Θ0 (P ) to denote expectations and the identified set
under a distribution P . In the results below, d(θ, θ̃) denotes the Euclidean distance kθ − θ̃k.
Theorem A.3. Let φCvM (θ) be one of the CvM tests defined in (10) or (11) with the critical
value satisfying Assumption 3.1, the class G or kernel function k satisfying Assumption 3.3,
and the measure µ satisfying Assumption 3.4 for the instrument case and the weighting
satisfying Assumption 3.7 for the kernel case. Let P be any class of distributions such that,
for some P ∗ ∈ P and θ0∗ on the boundary of Θ0 (P ∗ ), Assumptions 3.5 and 3.6 hold, and
either (a) θ0∗ is on the boundary of the convex hull of Θ0 (P ∗ ) or (b) for some a ∈ Rdθ and
a constant K, d(θ0∗ , θ0∗ + ar) ≤ K · d(θ0 , θ0∗ + ar) for all θ0 ∈ Θ0 (P ∗ ) and r small enough.
Then, for a small enough constant C∗ > 0,
where rn is the rate for the given test in Section 4 with γ given in Assumption 3.5.
Proof. Under condition (b), the result is immediate from the results in the main text, since
the quantity in the display in the theorem is less than lim supn→∞ EP ∗ φCvM (θ0∗ +aC∗ rn K/kak)
for P ∗ , θ0∗ and a given in the theorem. The result follows since condition (a) implies condition
(b) with K = 1. To see this, note that, by the supporting hyperplane theorem, there exists a
vector a with kak = 1 such that a0 θ̃0 ≤ a0 θ0∗ for all θ̃0 in the convex hull of Θ0 (P ∗ ). For this a
and any scalar r > 0 and θ̃0 ∈ Θ0 (P ∗ ), d(θ0∗ +ar, θ̃0 )2 −d(θ0∗ +ar, θ0 )2 = kθ0∗ +ar−θ̃0 k2 −r2 a0 a =
kθ0∗ − θ̃0 k2 + 2ra0 (θ0∗ − θ̃0 ) + r2 a0 a − r2 a0 a ≥ kθ0∗ − θ̃0 k2 ≥ 0.
30
is stated in the next theorem, which follows immediately from results in Armstrong (2014b)
(the results in Armstrong, 2014b consider a stronger notion of coverage and power).
For concreteness, let us consider a specific version of the inverse variance weighted
KS statistic considered in Armstrong (2014b). Let Tn,∞,(σ∨σn )−1 (θ) be given by (8) with
G = {x 7→ I(kx− x̃k ≤ h)|x̃ ∈ RdX , h ∈ [0, ∞)} and ωj (θ, g) = {σ̂j (θ, g)∨[(log n)2 /n]}−1 . Let
φn,∞,(σ∨σn )−1 (θ) be given by (12) with this definition of Tn,∞,(σ∨σn )−1 (θ) and with ĉn,∞,(σ∨σn )−1
given by the constant K in Theorem 3.1 in Armstrong (2014b). In the interest of concrete-
ness, the above formulation uses certain conservative constants and tuning parameters in
defining the test φn,∞,(σ∨σn )−1 (θ). Less conservative and data driven methods for choos-
ing these constants have been considered by Armstrong and Chan (2016) and Chetverikov
(2012).
Theorem A.4. Suppose that P satisfies Assumptions 4.1, 4.3, 4.4 and 4.5
in Armstrong (2014b), with γ taking the place of α in that paper. Then
lim supn→∞ supP ∈P supθ0 ∈Θ0 (P ) EP φn,∞,(σ∨σn )−1 (θ0 ) = 0 and, for a large enough constant C ∗ ,
Proof. Since Assumptions 3.1-3.3 in Armstrong (2014b) follow by definition of the statis-
tic, the result follows from Theorem 4.2 in that paper, with Assumption 4.2(i) in Arm-
strong (2014b) following from Theorem 4.3 in that paper (since Assumption 4.6 and 4.2(ii)
in that paper hold by construction). For Cn the setwise confidence set constructed from
φn,∞,(σ∨σn )−1 (θ) in Armstrong (2014b),
= inf inf P (θ 6∈ Cn )
P ∈P θ s.t. d(θ,θ0 )≥C ∗ [(log n)/n]γ/(dX +2γ) all θ0 ∈Θ0 (P )
where dH (A, B) = max{supa∈A inf b∈B d(a, b), supb∈B inf a∈A d(a, b)} is the Hausdorff distance.
This converges to 1 for large enough C ∗ by Theorem 4.2 in Armstrong (2014b).
The classes P used in Theorem A.4 impose smoothness conditions on the conditional
mean along with a condition on the derivative of the conditional mean with respect to θ
31
(cases where the latter condition fails appear to favor KS statistics over CvM statistics as
well; see Section A.4 of Armstrong, 2014b). Note that the rate given above for the weighted
KS statistic φn,∞,(σ∨σn )−1 corresponds to the minimax L∞ rate for nonparametric testing
problems (Lepski and Tsybakov, 2000) and to the minimax rate for estimating a conditional
mean (Stone, 1982; see Menzel, 2010 for related results for estimating the identified set in a
setting similar to the one considered here). The results here show that the CvM statistics
considered here do not achieve this rate, and in fact have a minimax rate that is worse by
at least a polynomial amount.
I now turn to the interval regression model and consider primitive conditions. The next
two theorems show that certain classes of underlying distributions for the interval regression
model will always contain a distribution with a sequence of local alternatives that satisfy
the conditions of this paper. The conclusion of Theorem A.3 then follows immediately, since
the identified set is convex in the interval regression model. Theorem A.5 considers the case
where the constraints on the conditional mean embodied in P essentially only restrict the
conditional means of WiH and WiL to a Lipschitz smoothness class. Theorem A.6 considers
the smoother case where a bound is placed on the second derivative. For primitive conditions
for the conditions of Theorem A.4 in the interval regression model for the case where dX = 1
and γ = 1 or 2, see Armstrong (2014b), Section 6.2.
Theorem A.5. Let P be any class of underlying distributions for (Xi , WiH , WiL ) in the
interval regression model such that, for all P ∈ P, WiH and WiL are bounded and Xi has
a continuous density on its support XP . Suppose that, for some set X ⊆ RdX and some
interval [a, b], the following holds: for any function f : X → [a, b] such that
there exists a P ∈ P such that EP (WiH |Xi ) = f (Xi ) and EP (WiL |Xi ) ≤ a almost surely,
and XP = X . Then there exists a P ∗ ∈ P and θ0∗ ∈ Θ0 (P ∗ ) that satisfies the conditions of
Theorem A.3, with γ = 1 and ψj,k (u) = K in Assumption 3.5.
Proof. Under these assumptions, there exists a distribution P ∈ P such that EP (WiH |Xi =
x) = b − K[(ε − kx − x0 k) ∨ 0] for some ε > 0 and x0 on the interior of the support of Xi ,
and EP (WiL |Xi = x) is bounded from above away from b − 2ε. For θ = (b − Kε, 0), this
satisfies the conditions of Theorem A.1.
Theorem A.6. Let P be any class of underlying distributions for (Xi , WiH , WiL ) in the
interval regression model such that, for all P ∈ P, WiH and WiL are bounded and Xi has
32
a continuous density on its support XP . Suppose that, for some set X ⊆ RdX and some
interval [a, b], for any function f : X → [a, b] such that
d2
f (x + tu) ≤ K
dt2
for all u ∈ RdX with kuk = 1, there exists a P ∈ P such that EP (WiH |Xi ) = f (Xi ) and
EP (WiL |Xi ) ≤ a almost surely, and XP = X . Then there exists a P ∗ ∈ P and θ0∗ ∈ Θ0 (P ∗ )
that satisfies the conditions of Theorem A.3, with γ = 2 and ψj,k (u) = K/2 in Assumption
3.5.
Proof. The result follows by similar arguments to Theorem A.5 since a function can be
constructed for EP (WiH |Xi = x) that has a unique interior minimum with second derivative
matrix KI at its minimum and takes values between, say, (a + b)/2 and b.
References
Andrews, D. W. K. and M. M. A. Schafgans (1998): “Semiparametric Estimation of
the Intercept of a Sample Selection Model,” Review of Economic Studies, 65, 497–517.
33
Armstrong, T. B. and H. P. Chan (2016): “Multiscale adaptive inference on conditional
moment inequalities,” Journal of Econometrics, 194, 24–43.
Fan, J. (1993): “Local Linear Regression Smoothers and Their Minimax Efficiencies,” The
Annals of Statistics, 21, 196–216.
Heckman, J. (1990): “Varieties of Selection Bias,” The American Economic Review, 80,
313–318.
Kim, K. i. (2008): “Set estimation and inference with models characterized by conditional
moment inequalities,” .
Lee, S., K. Song, and Y.-J. Whang (2013): “Testing functional inequalities,” Journal
of Econometrics, 172, 14–32.
34
——— (2015): “Testing for a General Class of Functional Inequalities,” arXiv:1311.1595
[math, stat], arXiv: 1311.1595.
35
1
0.9
0.8
E(WH|X =x)
i i
0.7
0.6
0.5
0.4
(1,x)θn
0.3
0.2
(1,x)θ0
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
36
0.6
0.5
0.4
0.3
0.2 E(WH|X=x)
0.1
θ1+θ2x
−0.1
−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.35
0.3
0.25
0.2
0.15
E(WH|X=x)
0.1
0.05
0 θa,1+θa,2x
−0.05
−0.1 θb,1+θb,2x
−0.15
−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Figure 3: Case where Assumption 3.5 does not hold (θa ) and case where Assumption 3.5
holds with γ = 1 (θb )
37
θ1 − θ 1 n = 100 n = 500 n = 1000
0.1 0.196 0.593 0.818
0.2 0.458 0.973 1
0.3 0.775 1 1
0.4 0.952 1 1
0.5 0.995 1 1
Table 1: Power for Unweighted Instrument CvM Test under Design 1
38
tn θ1 − θ 1 n = 100 n = 500 n = 1000
0.1 0.207 0.503 0.729
0.2 0.48 0.954 1
n−1/5 0.3 0.759 1 1
0.4 0.956 1 1
0.5 0.997 1 1
0.1 0.144 0.453 0.63
0.2 0.378 0.939 0.998
n−1/3 0.3 0.691 1 1
0.4 0.886 1 1
0.5 0.982 1 1
0.1 0.156 0.358 0.502
0.2 0.348 0.898 0.991
n−1/2 0.3 0.649 0.999 1
0.4 0.862 1 1
0.5 0.974 1 1
Table 4: Power for Weighted Instrument KS Test under Design 1 (from Armstrong and Chan
(2016))
39
hn θ1 − θ 1 n = 100 n = 500 n = 1000
0.1 0.16 0.439 0.625
0.2 0.343 0.92 0.997
n−1/5 0.3 0.62 0.999 1
0.4 0.883 1 1
0.5 0.975 1 1
0.1 0.095 0.266 0.481
0.2 0.201 0.715 0.929
n−1/3 0.3 0.382 0.976 1
0.4 0.606 0.999 1
0.5 0.809 1 1
0.1 0 0.094 0.138
0.2 0 0.255 0.404
n−1/2 0.3 0 0.508 0.773
0.4 0 0.812 0.982
0.5 0 0.976 1
Table 6: Power for Kernel KS Test under Design 1
40
σn2 θ1 − θ1 n = 100 n = 500 n = 1000
0.1 0 0 0
0.2 0 0 0
1 −1/5
4
n 0.3 0.003 0 0
0.4 0.007 0.006 0.013
0.5 0.04 0.118 0.294
0.1 0 0 0
0.2 0 0 0
1 −1/3
4
n 0.3 0.001 0.001 0
0.4 0.011 0.009 0.016
0.5 0.032 0.139 0.371
0.1 0 0 0
0.2 0.001 0 0
1 −1/2
4
n 0.3 0.003 0 0
0.4 0.009 0.003 0.014
0.5 0.034 0.114 0.288
Table 9: Power for Weighted Instrument CvM Test under Design 2
Table 10: Power for Weighted Instrument KS Test under Design 2 (from Armstrong and
Chan (2016))
41
hn θ1 − θ 1 n = 100 n = 500 n = 1000
0.1 0 0 0
0.2 0.001 0.002 0
n−1/5 0.3 0.008 0.007 0.024
0.4 0.012 0.108 0.369
0.5 0.074 0.484 0.923
0.1 0 0.001 0
0.2 0.001 0 0
n−1/3 0.3 0.003 0.009 0.011
0.4 0.023 0.126 0.273
0.5 0.062 0.519 0.848
0.1 0 0 0
0.2 0.001 0 0
n−1/2 0.3 0.001 0 0
0.4 0.005 0.007 0.023
0.5 0.023 0.089 0.308
Table 11: Power for Kernel CvM Test under Design 2
42
θ1 − θ 1 n = 100 n = 500 n = 1000
0.1 0.005 0 0.001
0.2 0.031 0.046 0.058
0.3 0.131 0.454 0.743
0.4 0.359 0.914 0.997
0.5 0.619 0.999 1
Table 13: Power for Unweighted Instrument CvM Test under Design 3
43
tn θ1 − θ 1 n = 100 n = 500 n = 1000
0.1 0.034 0.064 0.12
0.2 0.093 0.466 0.704
n−1/5 0.3 0.272 0.869 0.99
0.4 0.501 0.994 1
0.5 0.767 1 1
0.1 0.039 0.104 0.116
0.2 0.112 0.429 0.64
n−1/3 0.3 0.257 0.838 0.979
0.4 0.463 0.994 1
0.5 0.717 1 1
0.1 0.03 0.083 0.087
0.2 0.121 0.325 0.523
n−1/2 0.3 0.24 0.762 0.967
0.4 0.397 0.984 1
0.5 0.669 1 1
Table 16: Power for Weighted Instrument KS Test under Design 3 (from Armstrong and
Chan (2016))
44
hn θ1 − θ 1 n = 100 n = 500 n = 1000
0.1 0.043 0.087 0.161
0.2 0.099 0.487 0.722
n−1/5 0.3 0.261 0.876 0.99
0.4 0.48 0.995 1
0.5 0.746 1 1
0.1 0.037 0.086 0.122
0.2 0.079 0.297 0.528
n−1/3 0.3 0.164 0.646 0.912
0.4 0.296 0.937 0.999
0.5 0.507 0.996 1
0.1 0 0.035 0.026
0.2 0 0.087 0.118
n−1/2 0.3 0 0.195 0.385
0.4 0 0.427 0.703
0.5 0 0.716 0.952
Table 18: Power for Kernel KS Test under Design 3
45
Supplement to “On the Choice of Test Statistic for
Conditional Moment Inequalities”
Timothy B. Armstrong
Yale University
July 9, 2021
This supplementary appendix contains proofs of the results in the main text as well as
auxiliary results. Section B contains auxiliary results used in the rest of this appendix. These
results are restatements or simple extensions of well known results on uniform convergence,
and do not constitute part of the main novel contribution of the paper. Section C of this
appendix derives critical values for CvM statistics with variance weights. Section D contains
proofs of the results in the body of the paper.
B Auxiliary Results
We state some results on uniform convergence that will be used in the proofs of the main
results. The results in this section are essentially restatements of results used in Armstrong
(2014b), which are in turn minor extensions of results in Pollard (1984). Throughout this
section, we consider iid observations Z1 , . . . , Zn and a sequence of classes of functions Fn on
the sample space. Let σ(f )2 = Ef (Zi )2 − (Ef (Zi ))2 and let σ̂(f )2 = En f (Zi )2 − (En f (Zi ))2 .
for some A and W , where N is the covering number defined in Pollard (1984) and the
supremum over Q is over all probability measures. Let σn be a sequence of constants with
p
σn n/ log n → ∞. Then, for some constant C,
√
n (En − E)f (Zi )
√ sup ≤C
log n f ∈Fn σ(f ) ∨ σn
46
with probability approaching one and
Proof. The first display follows by applying Lemma A.1 in Armstrong (2014b) to the se-
quence of classes of functions {f − EP f (Zi )|f ∈ Fn }, which satisfies the conditions of that
lemma by Lemma A.5 in Armstrong (2014b). The second display follows from the first
display since
√ √
(En − E)f (Zi ) 1 (En − E)f (Zi ) log n n (En − E)f (Zi )
sup 2 2
≤ sup = √ √ sup
f ∈Fn σ(f ) ∨ σn σn f ∈Fn σ(f ) ∨ σn σn n log n f ∈Fn σ(f ) ∨ σn
√ √
and log n/(σn n) → 0.
σ̂(f ) ∨ σn p
sup − 1 → 0.
f ∈Fn σ(f ) ∨ σn
√ σ̂(f )2 ∨σn
2 p
Proof. By continuity of t 7→ t at 1, it suffices to prove that supf ∈Fn σ(f )2 ∨σn2 − 1 → 0. We
have
Note that
σ̂(f )2 − σ(f )2 = (En − E)[f (Zi ) − Ef (Zi )]2 − [(En − E)f (Zi )]2 . (14)
2
Since σ[(f − Ef (Zi ))2 ]2 ≤ E[f (Zi ) − Ef (Zi )]4 ≤ 4f σ(f )2 , we have
|(En − E)[f (Zi ) − Ef (Zi )]2 | |(En − E)[f (Zi ) − Ef (Zi )]2 | 2
sup 2 2
≤ sup 2 2 2
· (4f ) ∨ 1
f ∈Fn σ(f ) ∨ σn f ∈Fn σ[(f − Ef (Zi )) ] ∨ σn
which converges in probability to zero by Lemma B.1 (using Lemma A.5 in Armstrong,
2014b to verify that the sequence of classes of functions {[f − Ef (Zi )]2 |f ∈ Fn } satisfies the
47
conditions of the lemma). Since
by Lemma B.1, the result now follows from this and the triangle inequality applied to
(14).
√
Lemma B.3. Suppose that |f (Zi )| ≤ f and that σn n ≥ 1. Then
√ p
n(En − E)f (Zi )
E ≤ Cp,f
σ(f ) ∨ σn
t
For t ≥ 1, this is bounded by exp − 2+ 2 ·2f . Thus,
3
√ p Z ∞ √ p
n(En − E)f (Zi ) n(En − E)f (Zi )
E = P > t dt
σ(f ) ∨ σn t=0 σ(f ) ∨ σn
Z ∞ !
t1/p
≤1+ exp − dt
t=1 2 + 32 · 2f
48
p
shows that letting σn go to zero generally decreases the rate of convergence to n/ log n
for the KS statistic Tn,∞,ω . In contrast to the KS case, CvM statistics do not behave much
differently if the variance is allowed to go to zero, although some additional arguments are
needed to show this.
To deal with the behavior of the CvM statistic for small variances, I place the following
condition on the measure over which the sample means are integrated.
This condition will hold for the choices of G and µ used in the body of the paper, and
also allow for more general choices of G and µ. I also make the following assumption on the
complexity of the class of functions G, which is also satisfied by the class used in the paper.
Assumption C.2. For some constants A and ε, the covering number N (ε, G, L1 (Q)) defined
in Pollard (1984) satisfies
Assumption C.3. For some nonrandom constant Y , |mj (Wi , θ)| ≤ Y for each j with
probability one.
p
Theorem C.1. Suppose that σn n/ log n → ∞ and that Assumptions C.1, C.2 and C.3
hold. Then, for θ ∈ Θ0 ,
"Z dY √ p
#1/p
X n(En − E)mj (Wi , θ)g(Xi )
n1/2 Tn,p,(σ̂∨σn )−1 ,µ (θ) ≤ dµ(g)
j=1
σ̂j (θ, g) ∨ σn −
"Z dY
#1/p
d
X
→ |Gj (g, θ)/σj (θ, g)|p− dµ(g)
j=1
ρ(g, g̃) = E[m(Wi , θ)g(Xi ) − Em(Wi , θ)g(Xi )][m(Wi , θ)g̃(Xi ) − Em(Wi , θ)g̃(Xi )]0 .
49
Proof. The result with the integral truncated over {σj (θ, g) ≤ δ|all j} follows immediately
from standard arguments using functional central limit theorems. This, along with Lemma
C.1 below gives, letting Zn (δ) be the integral truncated at {σj (θ, g) ≤ δ|all j} and Z(δ) be
the limiting variable with this truncation,
for large enough n for any ε > 0. The lim inf of the left hand size is greater than P (Z(δ) ≤
t − 2ε) − 2ε, and the lim sup of the right hand side is less than P (Z(δ) ≤ t + ε) + ε. We
can bound P (Z(δ) ≤ t − 2ε) − 2ε from below by P (Z ≤ t − 2ε) − 2ε, and we can bound
P (Z(δ) ≤ t + ε) + ε from above by P (Z ≤ t + 2ε) + 2ε by making δ small enough by a version
of Lemma C.1 for the limiting process. Since ε was arbitrary, this gives the result.
The proof of the theorem above uses the following auxiliary lemma, which shows that
functions g with low enough variance have little effect on the integral asymptotically.
Lemma C.1. Fix j and suppose that Assumptions C.1, C.2 and C.3 hold, and that the null
hypothesis holds under θ. Then, for every ε > 0, there exists a δ > 0 such that
"Z #1/p
√
P n |En mj (Wi , θ)g(Xi )/(σ̂j (θ, g) ∨ σn )|p− dµ(g) > ε ≤ ε.
σj (θ,g)≤δ
Proof. We have
√
Z
E | nEn mj (Wi , θ)g(Xi )/(σj (θ, g) ∨ σn )|p− dµ(g)
σ (θ,g)≤δ
Zj
√
= E| nEn mj (Wi , θ)g(Xi )/(σj (θ, g) ∨ σn )|p− dµ(g)
σj (θ,g)≤δ
√
Z
≤ E| n(En − E)mj (Wi , θ)g(Xi )/(σj (θ, g) ∨ σn )|p dµ(g) ≤ µ ({g|σj (θ, g) ≤ δ}) · Cp,Y
σj (θ,g)≤δ
for Cp,Y given in Lemma B.3. Applying Markov’s inequality and using Assumption C.1, it
follows that, for any ε > 0, there exists a δ such that
"Z #1/p
√
P n |En mj (Wi , θ)g(Xi )/(σj (θ, g) ∨ σn )|p− dµ(g) > ε/2 ≤ ε/2.
σj (θ,g)≤δ
50
The result follows since
"Z #1/p
√
n |En mj (Wi , θ)g(Xi )/(σ̂j (θ, g) ∨ σn )|p− dµ(g)
σj (θ,g)≤δ
"Z #1/p
√
≤ n |En mj (Wi , θ)g(Xi )/(σj (θ, g) ∨ σn )|p− dµ(g) · sup(σj (θ, g) ∨ σn )/(σ̂j (θ, g) ∨ σn )
σj (θ,g)≤δ g
and supg (σj (θ, g) ∨ σn )/(σ̂j (θ, g) ∨ σn ) ≤ 2 with probability approaching one by Lemma
B.2.
D Proofs
This section contains proofs of the results in the body of the paper. The proofs use a number
of auxiliary lemmas, which are stated and proved first. In the following, θn is always assumed
to be a sequence converging to θ0 .
Lemma D.1. Under the assumptions of Theorem 4.5, there exists a constant C such that
√
n
sup p |(En − E)m(Wi , θn )k((Xi − x)/h)| ≤ C
x∈RdX hdX log n
and
√
n
sup p |(En − E)k((Xi − x)/h)| ≤ C
x∈RdX hdX log n
En k((Xi − h)/h) p
sup − 1 → 0.
{x|ωj (θn ,x)>0 some j} Ek((Xi − h)/h)
Proof. The first two displays follow from Lemma B.1 after noting that
2 2
var(m(Wi , θn )k((Xi − x)/h)) ≤ Y k f X B dX hdX
where k and f X are bounds for k and fX , and B is such that k(u) = 0 whenever max1≤j≤dX |uj | >
√ √ √
B/2, and similarly for var(k((Xi − x)/h)), and that hdX n/ log n → ∞ under these as-
sumptions.
51
For the last display, note that, for x such that ωj (θn , x) > 0 for some j, Ek((Xi − x)/h) ≥
R
f X hdX k(u) du for large enough n, where f X is a lower bound for the density of Xi (which
can be taken to be ε in Assumption 3.7). Thus,
Let
"Z dY
#1/p
p
En m(Wi , θ)k((Xi − x)/h)
Z X
T̃n,p,(σ̂∨σn )−1 ,µ (θ) = fµ (x, h) dx dh
h>0 x j=1 σj (θ, x, h) ∨ σn −
and let
"Z dY
#1/p
p
X En m(Wi , θ)k((Xi − x)/h)
T̃n,p,kern (θ) = ωj (θ, x) dx dh .
x j=1 Ek((Xi − x)/h) −
The notation σj (θ, x̃, h) is used to denote σj (θ, g) where g(x) = k((x − x̃)/h).
(nhdX )1/2 Tn,p,kern (θn ) = (nhdX )1/2 T̃n,p,kern (θn )(1 + oP (1))
Proof. We have
√ √ √ σj (θn , x, h) ∨ σn
| nTn,p,(σ̂∨σn )−1 ,µ (θn ) − nT̃n,p,(σ̂∨σn )−1 ,µ (θn )| ≤ nT̃n,p,(σ̂∨σn )−1 ,µ (θ) · sup −1 .
x,j σ̂j (θn , x, h) ∨ σn
52
Similarly, for the second display,
Let
"Z dY
#1/p
p
Em(Wi , θ)k((Xi − x)/h)
Z X
T̃˜n,p,(σ̂∨σn )−1 ,µ (θ) = fµ (x, h) dx dh
h>0 x j=1 σj (θ, x, h) ∨ σn −
and let
"Z d #1/p
Y p
Em(Wi , θ)k((Xi − x)/h)
T̃˜n,p,kern (θ) =
X
ωj (θ, x) dx dh .
x j=1 Ek((Xi − x)/h) −
Also define
"Z dY
Z X #1/p
T̃˜n,p,1,µ (θ) = |Em(Wi , θ)k((Xi − x)/h)|p− fµ (x, h) dx dh .
h>0 x j=1
Lemma D.3. Under Assumptions 3.3, 3.4, 3.5 and 3.6, this is not a distributional approx result
√ √ ˜
nT̃n,p,(σ̂∨σn )−1 ,µ (θn ) = nT̃n,p,(σ̂∨σn )−1 ,µ (θn ) + oP (1).
and
√ √ ˜
nTn,p,1,µ (θn ) = nT̃n,p,1,µ (θn ) + oP (1).
p
Proof. Let σ̃n → 0 be such that σ̃n n/ log n → ∞ and σ̃n /σn → 0 (i.e. σ̃n is chosen to be
much smaller than σn , but such that the assumptions still hold for σ̃n ). Note that
√
n|T̃˜n,p,(σ̂∨σn )−1 ,µ (θn ) − T̃n,p,(σ̂∨σn )−1 ,µ (θn )|
"Z Z dY
#1/p
X √ (En − E)m(Wi , θn )k((Xi − x)/h) p
≤ n fµ (x, h) dx dh
(x,h)∈Ĝ j=1 σj (θ, x, h) ∨ σn
53
where Ĝ = {(x, h)|Em(Wi , θn )k((Xi − x)/h) < 0 or En (Wi , θn )k((Xi − x)/h) < 0}.
For any ε > 0, there exists an η > 0 such that, for h > ε and large enough n,
1
Emj (Wi , θn )k((Xi − x)/h) ≥ ηEk((Xi − x)/h) ≥ η · var[mj (Wi , θn )k((Xi − x)/h)] · 2
kY
2 2
var[mj (Wi , θn )k((Xi − x)/h)] ≤ Y E[k((Xi − x)/h)2 ] ≤ Y kEk((Xi − x)/h).
and the last line is positive for all (x, h) with σj (θn , x, h) ≥ σ̃n with probability approaching
one by Lemma B.1.
From this and the fact that Em(Wi , θn )k((Xi − x)/h) ≥ 0 for all h > ε for large enough
n, it follows that Ĝ ⊆ {(x, h)|h ≤ ε or σj (θn , x, h) < σ̃n } with probability approaching one.
Note that
dY √ p
n(En − E)m(Wi , θn )k((Xi − x)/h)
Z Z X
E fµ (x, h) dx dh
{(x,h)|h≤ε} j=1 σj (θ, x, h) ∨ σn
dY √ p
n(En − E)m(Wi , θn )k((Xi − x)/h)
Z Z X
= E fµ (x, h) dx dh
{(x,h)|h≤ε} j=1 σj (θ, x, h) ∨ σn
by Fubini’s theorem, and this can be made arbitrarily small by making ε small by Lemma
B.3 and Assumption 3.4. Similarly,
dY √ p
n(En − E)m(Wi , θn )k((Xi − x)/h)
Z Z X
E fµ (x, h) dx dh
{(x,h)|σj (θn ,x,h)<σ̃n some j} j=1 σj (θ, x, h) ∨ σn
√ p
dX n(En − E)m(Wi , θn )k((Xi − x)/h)
≤ µ(R × [0, ∞)) · sup E
{(x,h,j)|σj (θn ,x,h)<σ̃n } σj (θ, x, h) ∨ σn
√ p
dX n(En − E)m(Wi , θn )k((Xi − x)/h) σ̃n
= µ(R × [0, ∞)) · sup E ,
{(x,h,j)|σj (θn ,x,h)<σ̃n } σj (θ, x, h) ∨ σ̃n σn
54
which converges to zero by Lemma B.3. Using this and Markov’s inequality, it follows
√
that n|T̃˜n,p,(σ̂∨σn )−1 ,µ (θ) − T̃n,p,(σ̂∨σn )−1 ,µ (θ)| can be made arbitrarily small with probability
approaching one by making ε small. This gives the first display of the lemma.
The second display follows by the same argument with σn set to the supremum of
σj (θ, x, h) over x, h on the support of µ, θ in a neighborhood of θ0 and all j.
Lemma D.4. Under Assumptions 3.3, 3.4, 3.5, 3.6 and 3.7,
Proof. For any ε > 0, there is an η > 0 such that Emj (Wi , θn )k((Xi − x)/h) > ηEk((Xi −
x)/h) for all x ∈ X̄ (ε) where X̄ (ε) is the set of x with kx − xk k ≥ ε for all k = 1, . . . , ` and
ωj (θn , x) > 0 for some j. Thus, arguing as in Lemma D.3 and using Lemma D.1, it follows
that, with probability approaching one,
R
Using Markov’s inequality and Fubini’s theorem along with the fact that x6∈X̄ (ε) wj (θn x) dx
can be made arbitrarily small by making ε small, the result follows so long as
√ p
nhdX (En − E)mj (Wi , θn )k((Xi − x)/h)
E
Ek((Xi − x)/h)
can be bounded uniformly over x such that ωj (θn , x) > 0. But this follows from Lemma B.3,
since, by Assumptions 3.3 and 3.7, for some δ > 0, Ek((Xi − x)/h) ≥ δhdX for all x with
ωj (θn , x) > 0.
For the following lemma, recall that wj (xk ) = (s2j (xk , θ0 )fX (xk ) k(u)2 du)−1/2 and s2j (x, θ) =
R
Lemma D.5. Under Assumptions 3.3, 3.4, 3.5 and 3.6, for k = 1, . . . , `
55
Proof. By differentiability of the square root function at wj−2 (xk ), it suffices to show that
supk(x,h)−(xk ,0)k≤εn h−dX σj2 (θn , x, h) − wj−2 (xk ) → 0. Note that
h−dX σj2 (θn , x, h) = h−dX E[m(Wi , θn )2 k((Xi − x)/h)2 ] − h−dX {E[m(Wi , θn )k((Xi − x)/h)]}2
Z
−dX
=h s2j (x̃, θn )k((x̃ − x)/h)2 fX (x̃) dx̃
Z
−dX
+h E[m(Wi , θn )|Xi = x̃]2 k((x̃ − x)/h)2 fX (x̃) dx̃
Z 2
−dX
−h E[m(Wi , θn )|Xi = x̃]k((x̃ − x)/h)fX (x̃) dx̃ .
By Assumption 3.3 and part (iii) of Assumption 3.5, the second term is bounded by a constant
times supk(x,h)−(xk ,0)k≤εn E[m(Wi , θn )|Xi = x]2 , which converges to zero by continuity of
E[m(Wi , θ)|Xi = x] at (θ0 , xk ). By Assumptions 3.3 and 3.5, the third term is bounded by
a constant times h−dX · h2dX ≤ εdnX uniformly over (x, h) with k(x, h) − (xk , 0)k ≤ εn . Using
R
a change of variables, the first term can be written as s2j (x + uh, θn )k(u)2 fX (x + uh) du,
which converges to wj−2 (xk ) uniformly over k(x, h) − (xk , 0)k ≤ εn by continuity of sj and
fX , and by Assumption 3.3.
R
Lemma D.6. Suppose that Assumptions 3.3, 3.4, 3.5, 3.6 and 3.7 hold, and that k(u) du =
1. Then
as h → 0 and ε → 0 for k = 1, . . . , `.
Proof. We have
Z Z
−dX −dX
h Ek((Xi − x)/h) = h k((x̃ − x)/h)fX (x̃) dx̃ = k(u)fX (x + uh) du,
R
and k(u) du = 1 and fX (x + uh) converges to fX (xk ) uniformly over kx − xk k ≤ ε and u
in the support of k as ε → 0 and h → 0.
For notational convenience in the following lemmas, define, for (j, k) with j ∈ J(k),
56
so that
x − xk
sup ψ̃j,k (x − xk ) − ψj,k →0
kx−xk k<δ kx − xk k
Lemma D.7. Under Assumptions 3.3, 3.4, 3.5 and 3.6, for any a ∈ Rdθ ,
dY
Z Z X
r −[dX +p(dX +γ)+1]/γ
|Emj (Wi , θ0 + ra)k((Xi − x̃)/h)|p− fµ (x̃, h) dx̃ dh
j=1
X0
r→0
X X
→ λbdd (a, j, k, p).
˜
k=1 j∈J(k)
Proof. For simplicity, assume that γ(j, k) = γ for all j, k. The general result follows from
applying the same arguments to show that areas of (x, h) near (j, k) with γ(j, k) < γ do not
matter asymptotically.
For C large enough, the integrand will be zero unless max{kx̃ − xk k, h} < Cr1/γ for some
k with j ∈ J(k). Thus, it suffices to prove the lemma for, fixing (j, k) with j ∈ J(k),
Z Z
|Emj (Wi , θ0 + ra)k((Xi − x̃)/h)|p− fµ (x̃, h) dx̃ dh
Z Z Z p
= m̄j (θ0 + ra, x)k((x − x̃)/h)fX (x) dx fµ (x̃, h) dx̃ dh
−
Z Z Z p
γ ∗
= [kx − xk k ψ̃j,k (x − xk ) + m̄θ,j (θ (r), x)ra]k((x − x̃)/h)fX (x) dx fµ (x̃, h) dx̃ dh
−
where the integrals are taken over kx̃ − xk k < Cr1/γ , h < Cr1/γ and θ∗ (r) is between θ0 and
θ0 + ra (we suppress the dependence of θ∗ (r) on x in the notation). Using the change of
variables u = (x − xk )/r1/γ , v = (x − xk )/r1/γ , h̃ = h/r1/γ , this is equal to
Z Z Z p
1/γ γ 1/γ ∗ 1/γ 1/γ dX /γ
[kr uk ψ̃j,k (r u) + m̄θ,j (θ (r), xk + r u)ra]k((u − v)/h̃)fX (xk + r u)r du
−
1/γ 1/γ dX /γ 1/γ
fµ (xk + r v, r
h̃)r dvr dh̃
Z Z Z p
= r[dX +1+p(γ+dX )]/γ [kukγ ψ̃j,k (r1/γ u) + m̄θ,j (θ∗ (r), xk + r1/γ u)a]k((u − v)/h̃)fX (xk + r1/γ u) du
−
1/γ 1/γ
fµ (xk + r v, r h̃) dv dh̃
57
where the integrals are taken over kvk < C, h̃ < C. The result now follows from the
dominated convergence theorem (here, and in subsequent results involving sequences of the
form | gn (z, w) dµ(z)|p− dν(w), the dominated convergence theorem is applied to the inner
R R
Lemma D.8. Under the conditions of Theorem 4.3, for any a ∈ Rdθ ,
dY
Z Z X
r −[dX +p(dX /2+γ)+1]/γ
|Emj (Wi , θ0 + ra)k((Xi − x̃)/h)/(σj (θ0 + ra, x̃, h) ∨ σn )|p− fµ (x̃, h) dx̃ dh
j=1
X0
X X
≤ λvar (a, j, k, p) + o(1)
˜
k=1 j∈J(k)
−dX /(2γ)
for any r = rn → 0. If, in addition, σn rn → 0, the above display will hold with the
inequality replaced by equality.
Proof. As in the previous lemma, the following argument assumes, for simplicity, that
γ(j, k) = γ for all (j, k) with j ∈ J(k). Let s̃j (r, x̃, h) = σj (θ0 + ra, x̃, h)/hdX /2 . As be-
fore, for large enough C, the integrand will be zero unless max{kx̃ − xk k, h} < Cr1/γ for
some k with j ∈ J(k). Thus, it suffices to prove the result for, fixing (j, k) with j ∈ J(k),
Z Z
−1 p
|Emj (Wi , θ0 + ra)k((Xi − x̃)/h)(h−dX /2 s̃−1 j (r, x̃, h) ∧ σn )|− fµ (x̃, h) dx̃ dh
Z Z Z
= [kx − xk kγ ψ̃j,k (x − xk ) + m̄θ,j (θ∗ (r), x)ra]
p
k((x − x̃)/h)(h−dX /2 s̃−1 −1
j (r, x̃, h) ∧ σn )fX (x) dx −
fµ (x̃, h) dx̃ dh
where the integral is taken over kx̃ − xk k < Cr1/γ , h < Cr1/γ and θ∗ (r) is between θ0 and
θ0 + ra. Using the change of variables u = (x − xk )/r1/γ , v = (x̃ − xk )/r1/γ ,h̃ = h/r1/γ , this
58
is equal to
Z Z Z
r[kukγ ψ̃j,k (r1/γ u) + m̄θ,j (θ∗ (r), xk + ur1/γ )a]k((u − v)/h̃)
p
(((r 1/γ
h̃)−dX /2 s̃−1
j (r, xk + vr 1/γ
,r 1/γ
h̃)) ∧ σn−1 )fX (xk + ur 1/γ
)r dX /γ
du
−
1/γ 1/γ dX /γ 1/γ
fµ (xk + vr ,r h̃)r dvr dh̃
Z Z Z
= r[p(γ+dX /2)+dX +1]/γ [kukγ ψ̃j,k (r1/γ u) + m̄θ,j (θ∗ (r), xk + ur1/γ )a]k((u − v)/h̃)
p
((h̃−dX /2 s̃−1
j (r, xk + vr 1/γ
,r 1/γ
h̃)) ∧ (rdX /(2γ) σn−1 ))fX (xk + ur 1/γ
) du fµ (xk + vr1/γ , r1/γ h̃) dv dh̃.
−
where the integral is taken over kvk < C, h < C. By Lemma D.5 and the dominated
−d /(2γ) −d /(2γ)
convergence theorem, this converges to λvar (a, j, k, p) if σn rn X → 0. If σn rn X does
not converge to zero, the above display is bounded from above by the same expression with
σn−1 replaced by ∞.
Lemma D.9. Under the conditions of Theorem 4.5, for any a ∈ Rdθ ,
dY
Z X
r −(γp+dX )/γ
|[Emj (Wi , θ0 + ra)k((Xi − x)/h)/Ek((Xi − x)/h)]ωj (θ0 + ra, x)|p− dx
j=1
|X0 |
X X
→ λkern (a, ch,r , j, k, p)
k=1 j∈J(k)
as r → 0 with h/r1/γ → ch,r for ch,r > 0. If the limit is zero for (a, ch,r ) in a neighborhood
of the given values, the sequence will be exactly equal to zero for large enough r.
If h/r1/γ → 0, then, as r → 0,
dY
Z X
r −(γp+dX )/γ
|[Emj (Wi , θ0 + ra)k((Xi − x)/h)/Ek((Xi − x)/h)]ωj (θ0 + ra, x)|p− dx
j=1
|X0 |
X X
→ λ̃kern (a, j, k, p).
k=1 j∈J(k)
˜
Proof. As before, this proof treats the case where J(k) = J(k) for ease of exposition. As
with the proofs of Lemmas D.7 and D.8, it suffices to prove the result for, fixing (j, k) with
59
j ∈ J(k),
Z
|[Emj (Wi , θ0 + ra)k((Xi − x̃)/h)/Ek((Xi − x̃)/h)]ωj (θ0 + ra, x̃)|p− dx̃
Z Z p
= [kx − xk kγ ψ̃j,k (x − xk ) + m̄θ,j (θ∗ (r), x)ra]k((x − x̃)/h)fX (x) dxh−dX b(x̃)ωj (θ0 + ra, x̃) dx̃
−
where the integral is over kx̃ − xk k < Cr1/γ and b(x̃) ≡ hdX /Ek((Xi − x̃)/h) converges
to (fX (xk ))−1 uniformly over x̃ in any shrinking neighborhood of xk by Lemma D.6. Let
h̃ = h/r1/γ . By the change of variables u = (x − xk )/r1/γ , v = (x̃ − xk )/r1/γ , the above
display is equal to
Z Z
[kur1/γ kγ ψ̃j,k (ur1/γ ) + m̄θ,j (θ∗ (r), xk + ur1/γ )ra]k((u − v)/h̃)fX (xk + ur1/γ )rdX /γ du
p
(r1/γ h̃)−dX b(xk + vr1/γ )ωj (θ0 + ra, xk + r1/γ v) rdX /γ dv
−
Z Z
= rp+dX /γ [kukγ ψ̃j,k (ur1/γ ) + m̄θ,j (θ∗ (r), xk + ur1/γ )a]k((u − v)/h̃)fX (xk + ur1/γ ) du
p
h̃−dX b(xk + vr1/γ )ωj (θ0 + ra, xk + r1/γ v) dv (15)
−
where the integral is over v < C. The first display of the lemma (the case where h/r1/γ → ch,r
for ch,r > 0) follows from this and the dominated convergence theorem.
To show that the sequence is exactly zero for small enough r when the limit is zero in
a neighborhood of (a, ch,r ), note, that, if the limit is zero in a neighborhood of (a, ch,r ), we
will have, for all (ã, c̃h,r ) in this neighborhood and any v,
Z
γ u
kuk ψj,k + m̄θ,j (θ0 , xk )ã k((u − v)/c̃h,r ) du
kuk
Z
γ u
= γ
c̃h,r kũk ψj,k + m̄θ,j (θ0 , xk )ã k(ũ − ṽ) c̃dh,r
X
dũ ≥ 0.
kuk
Evaluating this at (c̃r,h , ã) such that c̃γh,r ≤ cγh,r (1 − ε) and (for the case where m̄θ,j (θ0 , xk )a
is negative) m̄θ,j (θ0 , xk )ã ≤ (m̄θ,j (θ0 , xk )a)(1 + ε) shows that
Z
u
cγh,r kũkγ ψj,k · (1 − ε) + (m̄θ,j (θ0 , xk )a)(1 + ε) k(ũ − ṽ) dũ ≥ 0
kuk
for all v for some ε > 0. The above display is, for small enough r, a lower bound for the
inner integral in (15) times a constant that does not depend on r, so that, for small enough
60
r, the inner integral in (15) will be nonnegative for all v and (15) will eventually be equal to
zero.
For the case where h̃ = h/r1/γ → 0, multiplying (15) by r−(p+dX /γ) gives, after the change
of variables ũ = (u − v)/h̃,
Z Z
[kh̃ũ + vkγ ψ̃j,k ((h̃ũ + v)r1/γ ) + m̄θ,j (θ∗ (r), xk + (h̃ũ + v)r1/γ )a]k(ũ)fX (xk + (ũh̃ + v)r1/γ ) dũ
p
b(xk + vr1/γ )ωj (θ0 + ra, xk + r1/γ v) −
dv
which converges to
Z
|[kvkγ ψj,k (v/kvk) + m̄θ,j (θ0 , xk )a]ωj (θ0 , xk )|p− dv
proof of Theorem 4.1. The result follows immediately from Lemmas D.3 and D.7 since
(n−γ/{2[dX +γ+(dX +1)/p]} )−[dX +p(dX +γ)+1]/(γp) = n1/2 .
proof of Theorem 4.3. The result follows immediately from Lemmas D.2, D.3 and D.8 since
(n−γ/{2[dX /2+γ+(dX +1)/p]} )−[dX +p(dX /2+γ)+1]/(γp) = n1/2 .
p
proof of Theorem 4.5. The result follows from Lemmas D.2, D.4 and D.9. Note that (nhdX )p/2 /(n1−dX s )p/2 →
d p/2
chX , and that, for the case where s ≥ 1/[2(γ + dX /p + dX /2),
(n−q )−(γp+dX )/(γp) = (n−(1−sdX )/[2(1+dX /(pγ))] )−(γp+dX )/(γp) = n(1−sdX )/2 .
For the case where s < 1/[2(γ + dX /p + dX /2)], it follows from Lemmas D.2, D.4 and D.9
that
1/p
|X0 |
p
X X
nq(γp+dX )/(γp) Tn (θ0 + an ) → λkern (a, ch , j, k, p)
k=1 j∈J(k)
so that (nhdX )1/2 Tn (θ0 + an ) will converge to ∞ in this case if the limit in the above display
is strictly positive. If the limit in the above display is zero in a neighborhood of (a, ch ), it
61
follows from Lemmas D.2 and D.4 that (nhdX )1/2 Tn (θ0 + an ) is, up to op (1), equal to a term
that is zero for large enough n by Lemma D.9.
62