Minimum Distance Estimators
Minimum Distance Estimators
∗
University of Lugano.
†
HEC Genève and Swiss Finance Institute.
Tikhonov Regularisation for Functional Minimum Distance Estimators
Abstract
tional parameter based on a minimum distance principle for nonparametric conditional mo-
ment restrictions. The estimator is computationally tractable and even takes a closed form
in the linear case. We derive its Mean Integrated Squared Error (MISE), its rate of conver-
gence and its pointwise asymptotic normality under a regularisation parameter depending
on the sample size. The optimal value of the regularisation parameter is characterised. We
illustrate our theoretical findings and the small sample properties with simulation results
for two numerical examples. We also discuss two data driven selection procedures of the
the MISE.
Regularisation.
1
1 Introduction
Minimum distance and extremum estimators have received a lot of attention in the litera-
ture to exploit conditional moment restrictions assumed to hold true on the data generating
process [see e.g Newey and McFadden (1994) for a review]. In a parametric setting, lead-
ing examples are the Ordinary Least Squares estimator, which takes a closed form, and
the Nonlinear Least Squares estimator, which is computed through numerical optimization.
Correction for endogeneity are provided by the Instrumental Variable estimator in the linear
case and by the Generalised Method of Moments estimator in the nonlinear case.
In a functional setting, regression curves are inferred by local polynomial estimators and
sieve estimators. A well known example is the Rosenblatt-Parzen kernel estimator. Recently,
several suggestions have been made to correct for endogeneity in the nonparametric context
as well, mainly motivated by the interest for non-parametric IV estimation of structural equa-
tions. Newey and Powell (NP, 2003) consider the problem of estimating non-parametrically
a regression function, which is the conditional expectation of the dependent variable given
non-parametric analog of the Two-Stage Least Square estimator. The NP methodology ex-
tends to the case of general, nonlinear conditional moment restrictions. Ai and Chen (AC,
moment. Although their focus is more on the efficient estimation of the parametric compo-
nent in a semi-parametric conditional moment specification, they show that the estimator
of the functional component converges at a rate faster than T −1/4 in an appropriate metric.
2
Darolles, Florens and Renault (DFR, 2003) and Hall and Horowitz (HH, 2005) also consider
case. Their estimation approach is based on the empirical analog of the conditional moment
restriction, seen as a linear integral equation in the unknown functional parameter. HH de-
rive the optimal rate of convergence of their estimator in quadratic mean. [Cite some other
ity is ill-posedness. Ill-posedness occurs since the mapping of the reduced form parameter
(that is, the distribution of the data) into the structural parameter (the instrumental regres-
sion function) is not continuous. This may have serious consequences, in particular it can
lead to inconsistency of the estimators. The problem of ill-posedness has been addressed
in the literature in different ways. NP and AC propose to introduce bounds on the deriva-
tives of the functional parameter of interest, which amounts to assume a compact parameter
space. In the linear case, DFR and HH adopt a regularisation technique, which results in a
The aim of this paper is to introduce a new minimum distance estimator for a func-
type QT (ϕ) + λT G(ϕ), where QT (ϕ) is a minimum distance criterion in the functional pa-
rameter ϕ, G(ϕ) is a penalty function and λT is a positive sequence converging to zero. The
penalty function G(ϕ) corresponds to the Sobolev norm of function ϕ, which involves the
3
L2 norms of both ϕ and its derivative ∇ϕ. The basic idea behind our estimator is that
the term λT G(ϕ) penalizes highly oscillating components of the estimator, which are oth-
erwise unduly enhanced by the minimum distance criterion QT (ϕ) because of ill-posedness.
Regularised (TiR) estimator, since the penalty term is inspired by the pioneering paper of
Tikhonov (1963) on the regularisation of ill-posed inverse problems. We stress that also the
penalty term involving the L2 norm instead of the Sobolev norm of the parameter. To avoid
confusion, we refer to the DFR and HH estimator as a regularised estimator with L2 norm.
Our paper contributes to the literature along several directions. First, we introduce a
nonparametric estimator for conditional moment restrictions, which admits the following
features: (i) it applies in the general (linear and nonlinear) setting; (ii) the tuning parameter
is allowed to depend on sample size and to be stochastic; (iii) it may have a faster rate
of convergence than the DFR and HH estimator in the linear case; (iv) it admits a closed
form in the linear case. We emphasize that point (ii) is crucial to develop estimators with
data-driven selection of the tuning parameter. This point is not addressed in the setting of
NP and AC, where the tuning parameter is the bound on the Sobolev norm of the estimator
and is assumed fixed in all theoretical results. For the same reason, feature (iv) is not shared
by NP and AC estimator [see Section 2.4 for more details on the links between the TiR
estimator and the literature]. Concerning point (iii), we give in Section 4 the condition
under which this property holds. In our Monte-Carlo experiments in Section 6, we have
4
found a superior performance of the TiR estimator compared to the regularised estimator
with L2 norm. 1
Second, we study rather in depth the asymptotic properties of our estimator. In par-
ticular: (i) we prove the consistency of the TiR estimator; (ii) we derive the asymptotic
expansion of the Mean Integrated Squared Error (MISE) as a function of the sample size
and the (deterministic) regularisation parameter; (iii) we prove the pointwise asymptotic
normality of the TiR estimator. To the best of our knowledge, results (ii) and (iii), as well
as (i) for a sequence of stochastic regularisation parameters, are new for non-parametric
estimators of this type. In particular, the asymptotic expansion of the MISE allows us to
study the effect of the regularisation parameter on the variance term and on the bias term
of the TiR estimator, to define the optimal sequence of regularisation parameters, and to
derive the associated optimal rate of convergence of the TiR estimator. The methodology
is easily extended to the case of regularisation with L2 norm, so that these results are in-
teresting also for the study of the properties of the DFR and HH estimators. Finally, the
asymptotic expansion of the MISE suggests a procedure for the data-driven selection of the
Third, we investigate the attractiveness of the TiR estimator from an applied point of
view. In the nonlinear case, the TiR estimator only requires running an unconstrained
optimisation routine instead of a constrained one, and in the linear case it even takes a
closed form. Such a numerical tractability is a key advantage in practice, when using heavy
1
The advantage of the Sobolev norm compared to the L2 norm for regularisation is also pointed out in
a numerical example in Kress (1999), Example 16.21.
5
resampling techniques for example. The finite sample properties seem very appealing from
our numerical experiments on two examples and two data driven selection procedures of the
regularisation parameter.
The rest of the paper is organized as follows. In Section 2, we first introduce the general
setting of non-parametric estimation under conditional moment restrictions and the problem
of ill-posedness. We then define the TiR estimator and discuss the links with the literature.
In Section 3 we prove the consistency of the TiR estimator. Section 4 is devoted to the
analysis of the MISE and of the optimal rates of convergence of the TiR estimator. The case
study of the finite sample properties of the TiR estimator. Finally, Section 7 concludes. The
proofs of all results in the paper are gathered in the Appendices. The proofs of technical
In this section we introduce the class of Tikhonov Regularised (TiR) estimators. In Section
Section 2.2 we highlight its main issue, namely ill-posedness. In Section 2.3 TiR estimators
are defined as a regularisation method for the ill-posedness problem. Finally, links with
estimators and results currently available in the literature are discussed in detail in Section
2.4.
6
2.1 Nonparametric Minimum Distance estimation
Let {(Yt , Xt , Zt ) : t = 1, ..., T } be i.i.d. variables, and let the support of Xt be X = [0, 1].
Suppose that the parameter of interest is a function ϕ0 defined on X , which satisfies the
is assumed that Θ is bounded and closed, and that ϕ0 is the unique function ϕ ∈ Θ that
h 0
i
Q∞ (ϕ) = E0 m (ϕ, Z) Ω0 (Z)m (ϕ, Z) , ϕ ∈ Θ, (2)
where m (ϕ, z) = E0 [g (Y, ϕ (X)) | Z = z], and Ω0 (z) is a p.d. matrix for any given z. This
criterion is well-defined if m (ϕ, z) belongs to L2Ω0 (FZ ), for any ϕ ∈ Θ, where L2Ω0 (FZ )
minimizer of the empirical counterpart of Criterion (2). For instance, AC and NP estimate
the conditional moment m (ϕ, z) by an orthogonal polynomials approach, and minimize the
or spline functions.
7
The main difficulty in non-parametric Minimum Distance estimation is that, contrary to
the standard parametric case, the assumption that function ϕ0 is identified in a bounded and
closed parameter set Θ is not sufficient in general to get the consistency of the estimator.
The goal of this section is to highlight the issue of ill-posedness in Minimum Distance es-
timation [NP; see also Kress (1999), Chapter 15, for a general treatment of ill-posed in-
verse problems, and Carrasco, Florens and Renault (2005) for a survey on inverse problems
in econometrics]. To briefly explain what ill-posedness is, note that solving the equation
which maps the conditional distribution F0 (y, x|z) of (Y, X) given Z = z into the solution
ϕ0 [see Equation (1)]. Ill-posedness arises when this mapping is not continuous. As a conse-
quence, the estimator ϕ̂ of ϕ0 , which is the solution of the inverse problem corresponding to a
NP for a more in-depth discussion along these lines. In this paper, we prefer to emphasize
the link between ill-posedness and a classical concept in econometrics, namely parameter
identification.
To illustrate the main point, let us consider the case of non-parametric linear IV estima-
tion, where g(y, ϕ(x)) = ϕ (x) − y. The moment function m(ϕ, z) = E0 [ϕ (X) − Y | Z = z]
8
can be written as
R
where ∆ϕ := ϕ − ϕ0 , operator A is defined by (Aϕ) (z) = ϕ(x)f(w|z)dw and r(z) =
R
yf (w|z)dw. Conditional moment restriction (1) identifies ϕ0 if and only if operator A is
h 0
i
Q∞ (ϕ) = E0 (A∆ϕ) (Z) Ω0 (Z) (A∆ϕ) (Z) = h∆ϕ, A∗ A∆ϕi, (4)
where A∗ denotes the adjoint operator of A w.r.t. the scalar products h., .i and h., .iL2Ω (FZ ) .
0
(1999), Section 15.3, for the spectral decomposition of compact, self-adjoint operators]. By
point of parameter set Θ. Then, the limit criterion Q∞ (ϕ) can be minimized by a sequence
in Θ such as
ϕn = ϕ0 + εψn , n ∈ N, (5)
n → ∞, but kϕn − ϕ0 k = ε, ∀n. Since we can chose ε > 0 as small as we want, the usual
9
is not satisfied. In other words, function ϕ0 is not identified in Θ as an isolated minimum
parameter. Failure of identification condition (6) is due to 0 being a limit point of the eigen-
(1), whenever the linearization of moment function m(ϕ, z) around ϕ = ϕ0 involves a com-
pact operator. This is the maintained assumption in our paper, and is stated below.
Z
∂g
(i) the operator A defined by (A∆ϕ) (z) = (y, ϕ0 (x)) f (w|z) ∆ϕ (x) dw is a compact
∂v
operator in L2 [0, 1] ;
(ii) the second-order term R (ϕ, z) is such that supϕ∈Θ kR (ϕ, .)kL2 / kA∆ϕkL2 < 1.
Ω0 (FZ ) Ω0 (FZ )
Under Assumption 1, the identification condition (6) is not satisfied, and the Minimum
Distance estimator which minimizes the empirical counterpart of criterion Q∞ (ϕ) over (a
In this paper, we address the issue of ill-posedness by introducing Minimum Distance estima-
criterion of the type QT (ϕ) + λT G (ϕ), where QT (ϕ) is an empirical counterpart of criterion
Q∞ (ϕ) in (2), G (ϕ) is a penalty function introduced to solve the unidentifiability problem
10
arising from ill-posedness, and λT is a sequence converging to zero as sample size T increases.
timator of the density of (Y, X) given Z = z with kernel K, bandwidth hT , and w = (y, x).
where ΩT (z), T ∈ N, is a sequence of p.d. matrices converging to Ω0 (z), P-a.s., for any z.
Different choices of penalty function G(ϕ) are possible, leading to consistent estimators
under the assumptions of Theorem 1 in Section 3 below. In this paper, we focus on the
Sobolev norm G(ϕ) = kϕk2H . More precisely, we assume that ϕ0 belongs to some sub-
set Θ of the Sobolev space H 2 [0, 1], which is defined as the completion of linear space
{ϕ ∈ C 1 [0, 1] | ∇ϕ ∈ L2 [0, 1]} with respect to the L2 scalar product h., .i. Sobolev space
H 2 [0, 1] is an Hilbert space w.r.t. the scalar product hϕ, ψiH := hϕ, ψi + h∇ϕ, ∇ψi, and the
1/2
corresponding Sobolev norm is denoted by kϕkH = hϕ, ϕiH .
The Minimum Distance estimator under Tikhonov regularisation with Sobolev norm is
defined next.
11
increasing sequence of subsets of the Sobolev space H 2 [0, 1], which are compact w.r.t. the
L2 -norm k.k.
The name Tikhonov Regularised (TiR) estimators that we use to characterize the Mini-
mum Distance estimators introduced in Definition 1 goes back to Tikhonov (1963), in his pi-
oneering paper on the regularisation of ill-posed inverse problems [see Kress (1999), Chapter
16]. The main intuition is that the term λT kϕk2H in the criterion penalizes highly oscillating
components of the estimated function, which would be otherwise unduly enhanced, since the
criterion QT (ϕ) becomes asymptotically flat along some directions because of ill-posedness.
For instance, in the linear IV case where Q∞ (ϕ) = h∆ϕ, A∗ A∆ϕi, these directions corre-
large n [see Equation (5) and the discussion in Section 2.2]. Typically, ψn is an highly oscil-
lating function and kψn kH → ∞ as n → ∞, so that these directions are penalized by term
G(ϕ) = kϕk2H in the empirical criterion QT (ϕ) + λT kϕk2H . In Theorem 1 in Section 3 below,
we provide precise conditions under which the penalty function G (ϕ) = kϕk2H restores the
validity of the identification condition (6) and ensures the consistency of the TiR estimator.
The sequence (λT ) in Definition 1 controls for the amount of regularisation introduced
by term G (ϕ) = kϕk2H , and how this depends on sample size T . Therefore, λT can be seen
of λT to zero affects the rate of convergence of the TiR estimator ϕ̂. We will discuss in
Section 4 below the choice of the sequence (λT ) to achieve an optimal rate of convergence of
TiR estimator ϕ
b T , and we will present two global data driven selection procedures for λT in
12
Section 6.
The goal of this Section is to discuss the links between the TiR estimator and the differ-
moment restrictions.
To address the issue of ill-posedness, NP and AC [see also Blundell, Chen and Kristensen
(2005)] suggest considering a compact parameter set Θ. In this case, by the same argument
as in the standard parametric setting, the assumption that ϕ0 is the unique function in
Θ which satisfies (1) implies identification condition (6). Compact sets in L2 [0, 1] can be
defined by imposing a bound on the Sobolev norm kϕkH ≤ B of the functional parameter.
Kuhn-Tucker multiplier.
Our approach by TiR estimators differ from AC and NP along two directions. On the
one hand, for TiR estimators λT is a free regularisation parameter, whereas λT is tight down
i) Optimal rates of convergence. Although, for given sample size T , selecting dif-
ferent λT amounts to select different B when the constraint is binding, the asymptotic
properties of the TiR estimator and of the estimators with fixed B are different. In partic-
13
ular, the adoption of a bound B on the Sobolev norm independent of sample size T implies
the NP and AC estimators share rates of convergence which are slower than that of the
TiR estimator with optimally selected sequence of regularisation parameter. The optimal
rates of convergence for the TiR estimator are characterized in Section 4. Finally, note that
letting B = BT grow (slowly) with sample size T is not equivalent to our approach and does
not garantee the consistency of the estimator. Indeed, when BT → ∞, the resulting limit
ii) Data-driven selection of tuning parameters. For the TiR estimator, the tuning
parameter λT is allowed to depend on sample size T and sample data, whereas in the theo-
retical setting of NP and AC the tuning parameter B is treated as fixed. Thus, our approach
iii) Computational tractability. Finally, we emphasize that the TiR estimator fea-
inequality constraint kϕkH ≤ B has to be accounted for in the minimization defining esti-
mators with given B. In particular, in the case of linear conditional moment restrictions,
TiR estimators admit a closed form [see Section 5], whereas the computation of the NP and
On the other hand, a second difference is that NP, AC and BCK use finite-dimensional
14
Sieve estimators, whereas in ou approach .... TO BE CONTINUED.
For the special case of non-parametric IV estimation of a single equation model [see Equation
(3)], DFR and HH [see also Carrasco, Florens and Renault (2005)] introduce a regularised
estimator defined by minimization problem (7) with Sobolev norm kϕkH replaced by L2 norm
kϕk in the penalty term, and Ω0 (Z) = 1. Indeed, it is possible to show that the first order
condition for such an estimator corresponds to the linear equation (4.1) in DFR, or to the
estimator defined at p. 4 in HH. DFR and HH study the consistency and the optimal rates
with Sobolev norm and with L2 norm, and give conditions under which the first one is larger.
Finally, note that the techniques used in this paper to study the asymptotic properties of
TiR estimators are easily extended to estimators with L2 regularisation and allow to derive
new results for the DFR and HH estimators, such as the asymptotic expansion of the Mean
In this section we show the consistency of the TiR estimator. To highlight the main idea,
we first provide in Section 3.1 a consistency theorem for penalized extremum estimators
minimizing the criterion QT (ϕ) + λT G (ϕ) with a general penalty function G (ϕ). Then,
in Section 3.2 the assumptions of the theorem are particularized to the Sobolev penalty
15
function G (ϕ) = kϕk2H used for the TiR estimator.
where QT (ϕ) , (λT ) and ΘT are as in Definition 1. This estimator is well-defined and
Theorem 1: Let
p
(i) δ T := supϕ∈ΘT |QT (ϕ) − Q∞ (ϕ)| −→ 0;
(ii) ϕ0 ∈ Θ, and ∪∞ 2
T =1 ΘT is dense in Θ ⊂ H [0, 1];
(iii) For any ε > 0, Cε (λ) := inf ϕ∈Θ:kϕ−ϕ0 k≥ε Q∞ (ϕ) + λG(ϕ) − Q∞ (ϕ0 ) − λG(ϕ0 ) > 0, for
p
(iv) ∃a > 0 such that limλ→0 λ−a Cε (λ) > 0, T a δ T −→ 0, and T a ρT → 0, for any ε > 0,
Then, under (i)-(iv), for any sequence (λT ) such that λT > 0, λT → 0, P-a.s., and
λT /T → 0, P-a.s., (9)
16
p
ϕT − ϕ0 k −→ 0.
the estimator ϕ̂ defined in (8) is consistent, namely kb
estimators [e.g., Wooldridge and White (1991), Corollary 2.6]. Indeed, in this case, Condition
(iii) is the usual identification condition (6), whereas Condition (iv) is satisfied. Theorem 1
extends this consistency result to situations where Condition (6) does not hold, as it is the
case for our ill-posed setting (see Section 2.2). The identification of ϕ0 as isolated minimum
is restored by including a small additional component λG (ϕ) in the limit criterion. Thus,
Condition (iii) in Theorem 1 is the condition on penalty function G (ϕ) to overcome ill-
posedness and achieve consistency of the estimator ϕ̂. To interpret Condition (iv), note that
in the ill-posed setting we have Cε (λ) → 0 as λ → 0, and the rate of this convergence can
be seen as a measure for the severity of ill-posedness. Thus, Condition (iv) introduces a
p
bound on ill-posedness severity, related to the rates of uniform convergence δ T −→ 0 and
conditions to quantify this bound and to verify Conditions (i), (ii), and (iv) of Theorem 1 for
the TiR estimator. Finally, it is important to emphasize that Theorem 1 is more general than
the results currently known in the literature, since sequence (λT ) is allowed to be stochastic,
possibly data dependent, in a fully general way. Condition (9) on λT for consistency requires
The rest of this Section will focus on the key assumption of Theorem 1, that is identifi-
17
cation assumption (iii). The next Proposition provides a sufficient condition for the validity
of this assumption.
Proposition 2: Assume that the function G is bounded from below. Furthermore, suppose
that, for any ε > 0 and any sequence (ϕn ) in Θ such that kϕn − ϕ0 k ≥ ε for all n ∈ N, we
have
Condition (10) provides a simple intuition to explain why the penalty function G (ϕ)
restores identification. Indeed, it basically requires that the sequences (ϕn ) in Θ, which
minimize Q∞ (ϕ) without converging to ϕ0 , are penalized by function G (ϕ) . In the next
section, we particularize this condition for the penalty function which is relevant for the TiR
When the penalty function G(ϕ) = kϕk2H is used, Condition (10) in Proposition 2 can be
nicely stated in terms of the spectrum of the operator A∗ A, where A is the operator in the
© ª
Assumption 2: Let ψj : j ∈ N be an orthonormal basis in L2 [0, 1] of eigenfunctions
18
ψj ∈ H 2 [0, 1], for any j ∈ N. Then, Mn := inf ϕ∈Sn :kϕk=1 kϕkH → ∞ as n → ∞, where
© ª
Sn =span ψj : j ≥ n .
A∗ A to eigenvalues close to zero consists of highly oscillating functions with large Sobolev
norm. Then, deviations of the estimator ϕ̂ from ϕ0 along such directions are penalized by
G(ϕ) = kϕk2H . This compensates the inability of the empirical criterion QT (ϕ) to achieve
In Lemma A.1 in Appendix 1, we show that Assumptions 1 and 2 imply Condition (10)
in Proposition 2. Then, from Theorem 1 and Proposition 2, the consistency of the TiR
estimator follows.
In this section, we derive the Mean Integrated Square Error (MISE) of the TiR estimator
19
Proposition 3: Under Assumptions 1-3, in Appendix B, and the bandwidth conditions
³ ´
hm
T = o (λT b (λT )) , (T λT )−1 = o hdTZ , (11)
the MISE of the TiR estimator ϕ̂ with deterministic sequence (λT ) is given by
∞
£ ¤ 1X νj ° °2
° ° + b (λT )2 =: MT (λ)
E kϕ̂ − ϕ0 k2 = 2 φj (12)
T j=1 (λT + ν j )
© ª
up to terms which are asymptotically negligible w.r.t. the RHS, where φj : j ∈ N are the
ator of A w.r.t. the scalar products h., .iH and h., .iL2Ω (FZ ) , function b (λT ) is given by
0
° °
b (λT ) = °(λT + A A)−1 A Aϕ0 − ϕ0 ° , (13)
The asymptotic expansion of the MISE consists of two components, which are a variance
(i) The bias function b (λT ) is the L2 norm of (λT + A A)−1 A Aϕ0 − ϕ0 =: ϕ∗ − ϕ0 . To
interpret function ϕ∗ , note that the quadratic approximation of the limit criterion [see (4)
h 0
i
h∆ϕ, A∗ A∆ϕi = E0 (A∆ϕ) (Z) Ω0 (Z) (A∆ϕ) (Z) = h∆ϕ, A A∆ϕiH , ϕ ∈ Θ.
Then, function ϕ∗ minimizes the penalized asymptotic criterion h∆ϕ, A A∆ϕiH + λT kϕk2H .
Thus, b (λT ) is the asymptotic bias arising from introducing penalty λT kϕk2H in the criterion.
20
It corresponds to the so-called regularisation bias in the theory of Tikhonov regularisation
[see e.g. Kress (1999), Groetsch (1984)]. Under general conditions on operator A A and true
function ϕ0 , the bias function b (λ) is increasing w.r.t. λ and such that b (λ) → 0 as λ → 0.
X∞
° °2 £ ¤
(ii) The variance term T −1 °φj ° ν j / (λT + ν j )2 involves a weighted sum of the
j=1
° °2
"regularised" inverse eigenvalues ν j / (λT + ν j )2 of operator A A, with weights °φj ° . 2
To
have an interpretation, note that the inverse of operator A A corresponds to the standard
¡ 0 ¢−1
asymptotic variance matrix J0 V0−1 J0 of the efficient GMM in the parametric setting,
h 0
i
where J0 = E0 ∂g/∂θ and V0 = V0 [g]. In the ill-posed non-parametric setting, the inverse
of operator A A is unbounded, and its eigenvalues 1/ν j → ∞ diverge. The penalty term
λT kϕk2H in the criterion defining the TiR estimator implies that inverse eigenvalues 1/ν j are
replaced by ν j / (λT + ν j )2 .
X ∞
° °2 £ ¤
The variance term T −1 °φj ° ν j / (λT + ν j )2 is a decreasing function of λT . To study
j=1
its behaviour when λT → 0, we introduce the next assumption.
∞
X ° °2 £ ¤
Under Assumption 4, the series kT := °φj ° ν j / (λT + ν j )2 diverges as λT → 0.
j=1
When kT → ∞ such that kT /T → 0, the variance term converges to zero. However, the rate
of convergence is smaller than the parametric rate 1/T . This smaller rate of convergence
is typical in nonparametric estimation. Note, however, that the smaller rate of convergence
2
Since ν j /(λT + ν j )2 ≤ ν j , the infinite sum converges under Assumption B.6 (i) in Appendix B.
21
is not coming from localization as for kernel estimation, but from the ill-posedness of the
The asymptotic expansion of the MISE of the TiR estimator given in Proposition 3 does
not involve the bandwidth hT , as long as Conditions (11) are satisfied. The variance term is
asymptotically independent of hT since the asymptotic expansion of ϕ̂−ϕ0 involves the kernel
density estimator integrated w.r.t. (Y, X, Z) [see Equation (36) in Appendix 2, first term, and
the proof of Lemma A.3]. The integral averages the localization effect of the bandwidth hT .
On the contrary, kernel estimation m̂(ϕ, z) of the conditional moment function does have an
implies that the estimation bias is asymptotically negligible compared to the regularisation
Finally, it is also possible to derive a similar asymptotic expansion of the MISE for the
∞
£ 2¤ 1X µj
¢ + eb (λT ) ,
2
ϕT − ϕ0 k =
E ke ¡ (14)
T j=1 λT + µj 2
° °
where µj are the eigenvalues of operator A∗ A, and eb (λT ) = °(λT + A∗ A)−1 A∗ Aϕ0 − ϕ0 °.
Let us now come back to the MISE MT (λ) of the TiR estimator in Proposition 3 and dis-
cuss the optimal choice of the regularisation parameter λT . Since the bias term is increasing
in the regularisation parameter, whereas the variance term is decreasing, we face a kind of
given by λ∗T = arg minλ>0 MT (λ), and the corresponding optimal MISE of the TiR estimator
22
The optimal sequence of regularisation parameters λ∗T , in particular its rate of conver-
gence to zero, depends on the decay behaviour of the eigenvalues ν j and of the norms of
° °
eigenfunctions °φj °, as well as on the bias function b (λ) close to λ = 0. In the next section,
optimal MISE MT∗ , and their rate of convergence in a broad class of models.
as j → ∞, for instance geometric or hyperbolic decay. Intuitively, the first type is associated
with a faster convergence of the spectrum to zero, and thus to a more serious problem of ill-
posedness. In this section, we focus our analysis on the case where the eigenvalues ν j feature
° °
geometric decay and the norms of eigenfunctions °φj ° feature hyperbolic decay. Results for
° °
Assumption 5: The eigenvalues ν j and the norms of the eigenfunctions °φj ° of operator
Assumption 5 (i) is satisfied for a large number of models, including for instance the
two examples that we consider below in our Monte-Carlo analysis. In general, it is known
that, under appropriate regularity conditions, compact integral operators with smooth kernel
feature eigenvalues with decay of (at least) exponential type [see Theorem 15.20 in Kress
23
3
(1999)]. Assumption 5 (ii) is adopted e.g. in Wahba (1977), and is also satisfied in the
We further assume that the bias function features a power-law behaviour close to λ = 0.
Assumption 6: The bias function is such that b(λ) = C3 λδ , δ > 0, for λ close to 0, where
C3 is a positive constant.
Then, the MISE and the optimal sequence of regularisation parameters are characterised in
1 1
(i) The MISE is MT (λ) = c1 +c2 λ2δ , up to terms which are negligible when
T λ [log (1/λ)]β
λ → 0 and T → ∞.
1
log λ∗T = log c − log T, T ∈ N, (15)
1 + 2δ
2δ 2δβ
(iii) The optimal MISE is MT∗ = cT − 1+2δ (log T )− 1+2δ , up to a term which is negligible
24
Proof: See Appendix 3.
The log of the optimal regularisation parameter is linear in the log sample size. The
slope coefficient γ := 1/(1 + 2δ) is smaller than 1, and depends on the convexity parameter
δ of the bias function close to λ = 0. We have γ < 1/2 when the squared bias function
b(λ)2 is convex, that is 2δ > 1, respectively γ ≥ 1/2 when 2δ < 1. The optimal MISE
converges to zero as a power of T and of log T . The negative exponent of the dominant term
T is 2δ/(1 + 2δ). This rate of convergence is smaller than 1, that is the parametric rate,
because of ill-posedness, and is increasing w.r.t. convexity parameter δ of the bias function.
Note that the geometric decay rate α does not affect neither the rate of convergence of the
optimal regularisation sequence, nor that of the MISE, whereas coefficient β of eigenfunction
norms affects the exponent of the log T term in the MISE only. Finally, under Assumptions
5 and 6, the bandwidth conditions (11) are fulfilled for the optimal sequence of regularisation
1 2δ 1 1+δ
parameters (15) if hT = C · T −η , with <η < . This condition can be
dZ 1 + 2δ m 1 + 2δ
m 1+δ
satisfied if > .
dZ 2δ
To conclude this section, we briefly discuss the optimal rate of convergence of the MISE
when the eigenvalues feature hyperbolic decay, that is ν j = Cj −α , α > 0, or when regu-
larisation with L2 norm is adopted. The results are summarized in Table 1 below, and are
found using Formula (14) and an argument similar to the proof of Proposition 4. In Table 1,
parameter β is defined as in Assumption 5 (ii) for the TiR estimator. Parameters α and α
e
denote the hyperbolic decay rates of the eigenvalues of operator A A for the TiR estimator,
25
to satisfy Assumption 4. Finally, parameters δ and e
δ are the power-law coefficients of the
bias function b (λ) and eb (λ) for λ → 0 as in Assumption 6, where b (λ) is defined in (13) for
geometric 2δ 2δβ − 2e
δ
T − 1+2δ (log T )− 1+2δ T 1+2eδ
spectrum
hyperbolic 2δ − 2eδ
T − 1+2δ+(1−β)/α T 1+2e
δ+1/e
α
spectrum
Table 1: Optimal rate of convergence of the MISE. The decay factors are α and α
e for the
eigenvalues, δ and e
δ for the bias, and β for the squared norm of the eigenfunctions.
With hyperbolic spectrum, the rate of convergence (power of T ) of the TiR estimator
features an additional term (1 − β) /α in the denominator, which involves both the α and
β coefficients. When β > 1, the rate of convergence is faster than that with geometric
spectrum. This is an effect of the less severe ill-posedness problem. The rate of convergence
The rate of convergence with L2 regularisation coincides with that of the TiR estimator
metric spectrum, the TiR estimator features a faster rate of convergence than the regularised
26
convex. Finally, note that with hyperbolic spectrum and L2 regularisation, the formula given
4
in Table 1 corresponds to that derived by HH, Theorem 4.1.
In this section we derive the TiR estimator when the moment restrictions are linear w.r.t. the
equation model, with g (y, ϕ0 (x)) = ϕ0 (x) − y, and conditional moment as in (3). Then, the
To simplify the exposition, we assume that Ω0 (z) = V0 [Yt − ϕ0 (Xt ) | Z = z]−1 = 1 in As-
sumption 3. The objective function of the TiR estimator in Definition 1 can be rewritten as
QT (ϕ) + λT kϕk2H = hϕ, Â ÂϕiH − 2hϕ, Â r̂iH + λT hϕ, ϕiH , ϕ ∈ H 2 [0, 1] , (16)
up to a term independent of ϕ, where  denotes the linear operator defined on L2Ω0 (FZ ) by
1 X³ ´
T
hϕ, Â ψiH = Âϕ (Zt ) ψ (Zt ) , ϕ ∈ H 2 [0, 1] , ψ ∈ L2Ω0 (FZ ). (17)
T t=1
Under the regularity conditions in Appendix B, Criterion (16) admits a global minimum ϕ̂
4
To see this, note that their Assumption A.3 corresponds to e
δ = (2β HH − 1) / (e
α + β HH ), where β HH
is the β coefficient of HH.
27
5
This is a Fredholm integral equation of Type II . The transformation of the ill-posed
problem (1) in the well-posed estimating equation (18) is induced by the penalty term
involving the Sobolev norm. The TiR estimator is the unique solution of Equation (18) and
is given by
³ ´−1
ϕ̂ = λT + Â Â Â r̂. (19)
basis of functions {Pj : j = 1, ..., K} in H 2 [0, 1] and solving Equation (19) on the subspace
K
X 0
ϕ' θj Pj =: θ P, θ ∈ RK . (20)
j=1
by [using (17)]
1 X³ ´ ³ ´ 1 ³ b0 b´
T
hPi , Â ÂPj iH = ÂPi (Zt ) ÂPj (Zt ) = PP , i, j = 1, ..., K,
T t=1 T i,j
0 R
where Pb is the T × K matrix with rows Pb (Zt ) = P (x) fˆ (w|Zt ) dw, t = 1, ..., T . Ma-
0
trix Pb is the matrix of the "fitted values" in the regression of P (X) on Z at the sample
µ ¶
1 b0 b 1 0b
points. Then, Equation (18) reduces to a matrix equation λT D + P P θ = Pb R,
T T
0
where R b = (r̂ (Z1 ) , ..., r̂ (ZT )) and D is the K × K matrix of Sobolev scalar products
µ ¶−1
b 1 b0 b 1 b0 b
Di,j = hPi , Pj iH , i, j = 1, ..., K. The solution is given by θ = λT D + P P P R,
T T
0
which yields the approximation of the TiR estimator ϕ̂ ' θ̂ P.
5
See e.g. Linton and Mammen (2005), (2006), Gagliardini and Gouriéroux (2006), and the survey by
Carrasco, Florens, Renault (2005) for other examples of estimation problems leading to Type II equations.
28
Estimator θ̂ is a 2SLS estimator with a ridge correction term. It is easy to verify that,
this is the estimator that we obtain, if we replace Approximation (20) in Criterion (16) and
we minimize w.r.t. θ. This latter approach has been considered by NP, AC, and Blundell
et al (2005), which use Sieve estimators. However, it is important to emphasize that, the
imately the true TiR estimator ϕ̂ in (19), which is a well-defined estimator on the function
sistency) of the estimators proposed by NP, AC, and Blundell et al (2005), have been derived
only in the case where parameter λT is tight down by the inequality constraint kϕ̂kH ≤ B̄
for fixed B̄, whereas, for the TiR estimator, λT is treated as a free regularization parameter
[0, 1]. The function Φ denotes the cdf of a standard Gaussian variable, and is assumed to be
known. To generate Y , we restrict ourselves to the linear case since a simulation analysis of
29
a nonlinear case would be very time consuming. We examine two designs
When the correlation ρ between U and V is 50% there is endogeneity in both cases.
E0 [Y − ϕ0 (X) | Z] = 0,
where the functional parameter is ϕ0 (x) = Ba,b (x) in Case 1, and ϕ0 (x) = sin (πx) in Case
2, x ∈ [0, 1].
Since we face an unknown function ϕ0 on [0, 1], we use a series approximation based on
standardized shifted Chebyshev polynomials of the first kind (see Section 22 on orthogonal
polynomials of Abramowitz and Stegun (1970) for their mathematical properties). We take
30
Chebyshev polynomials of the first kind are
T3∗ (x) = −1 + 18x − 48x2 + 32x3 , T4∗ (x) = 1 − 32x + 160x2 − 256x3 + 128x4 ,
R1 R1
The (squared) Sobolev norm kϕk2H = 0
ϕ2 + 0
(∇ϕ)2 is approximated by
5 X
X 5 Z 1
kϕk2H ' θi θj (Pi (x)Pj (x) + ∇Pi (x)∇Pj (x)) dx.
i=0 j=0 0
The coefficients in this quadratic form θ0 Dθ take a closed form, and can be computed easily
31
The L2 norm kϕk2 can be approximated in a similar way with θ0 Bθ where
√ √
1 − 2 − 2
π
0 3π
0 15π
0
..
. 2
0 −2
0 −2
3π 5π 21π
14
0 −38
0
15π 105π
B=
.
34
0 −22
35π 63π
..
62
. 63π
0
98
... ... 99π
Such simple and exact forms ease implementation, improve on speed, and contribute to the
6
numerical stability of the estimation procedure .
where h denotes the bandwidth, and K is the Gaussian kernel. This kernel estimator is
asymptotically equivalent to the one described in the lines above. We prefer it because of its
numerical tractability. It has the advantage of avoiding bivariate numerical integration and
the choice of two additional bandwidthes. The bandwidth is selected via the standard rule
of thumb h = 1.06σ̂ Z T −1/5 (Silverman (1986)), where σ̂ Z is the empirical standard deviation
of Zt .
6
The Gauss programs developed for this section are available on request from the authors.
32
6.3 Simulation results
The sample size is initially fixed at T = 400. Estimator performance is measured in terms
of the Mean Integrated Squared Error (MISE) and the Integrated Squared Bias (ISB) based
Figures 1 to 4 concern Case 1 while Figures 5 to 8 concern Case 2. In each figure the left
panel plots the MISE on a grid of lambda, the central panel the ISB on a grid of lambda, and
the right panel the mean estimated functions and the true function on the unit interval. Mean
estimated functions correspond to averages obtained either from regularised estimates with
a lambda achieving the lowest MISE or from OLS estimates. The regularization schemes use
the Sobolev norm, corresponding to the TiR estimator (odd numbering of the figures), and
the L2 norm (even numbering of the figures). We consider designs exhibiting an endogeneity
endogeneity (ρ = 0).
A couple of remarks can be made. First, the bias of the OLS estimator can be large
under endogeneity. Second, the MISE of the TiR estimator is more convex in lambda than
the one obtained from an L2 norm, and performance is clearly better for the TiR estimator.
The Sobolev norm should be strongly favoured over the L2 norm in order to recover the
shape of the true functions. Third, the fit obtained by the OLS estimator is almost perfect
when endogeneity is absent. Using six polynomials delivers a very good approximation of
33
We have also examined sample sizes T = 100 and T = 1000, as well as approximations
based on polynomials with orders up to 10 and 15. The above conclusions remain qualita-
tively unaffected. This suggests that as soon as the order of the polynomials is sufficiently
large to deliver a good numerical approximation of the underlying function, it is not neces-
sary to link it with sample size, as explained in Section 5. For example Figures 9 and 10 are
the analogues of Figures 1 and 5 with T = 1000. We can see that the bias term is almost
identical, while the variance term decreases by a factor about 2.5 as predicted by Proposition
3.
In Figure 11 we display the six eigenvalues of operator A A and the L2 -norms of the
corresponding eigenfunctions when the same approximation basis of six polynomials is used.
These true quantities have been computed by Monte Carlo integration. The eigenvalues ν j
° °2
feature a geometric decay w.r.t. the order j, whereas the decay of the norms °φj ° is of an
hyperbolic type. This is conform to Assumption 5 and the analysis conducted in Proposition
4. A linear fit of the plotted points gives a decay factor α̂ = 2.254 for the eigenvalues and a
Figure 12 is dedicated to check whether the line log λ∗T = log c − γ log T, induced by
Proposition 4 (ii), holds in small samples. For ρ = 0.5 the right panel for Case 1 as well as the
left panel for Case 2 exhibit a linear relationship between the logarithm of the regularisation
parameter minimizing the average MISE on the 1000 Monte Carlo simulations and the
logarithm of sample size ranging from T = 50 to T = 1000. The OLS estimation of this
linear relationship from the plotted pairs delivers ĉ = .226, γ̂ = .752 in Case 1, and ĉ = .012,
34
γ̂ = .428 in Case 2. Both estimated slope coefficients are smaller than 1, and qualitatively
consistent with the implications of Proposition 4. Indeed, from Figures 9 and 10 the ISB
curve appears to be more convex in Case 2 than in Case 1. This points to a larger δ
parameter, and thus to a smaller slope coefficient γ = 1/ (1 + 2δ), in Case 2. Inverting the
relationship γ = 1/ (1 + 2δ) we get estimates for the decay factor δ, which are δ̂ = .165 and
By a similar argument, Proposition 4 also explains the better performance of the TiR
estimator compared to the L2 -regularised estimator that we reported above. Indeed, com-
paring the ISB curves of the two estimators in Case 1 (Figures 1 and 2) and in Case 2
(Figures 5 and 6), it appears that the TiR estimator features a more convex ISB curve. This
implies δ > e
δ and thus a faster rate of convergence of the TiR estimator.
spectral representation of the MISE provided in Proposition 3, and a second method based
on a resampling approximation.
The first data driven selection procedure aims at estimating directly Expression (12) in
order to derive the optimal regularisation parameter. In unreported results we have checked
that the asymptotic MISE, the asymptotic ISB and the asymptotic variance are close to the
ones exhibited in Figures 9 and 10. These true quantities have also been computed by Monte
Carlo integration. We have found an asymptotic optimal lambda equal to .0018 in Case 1
and to .0009 in Case 2, which are of the same magnitudes as .0013 and .0007 in Figures 9
35
and 10. We have also checked that the linear relationship exhibited in Figure 12 holds true
when deduced from optimizing the asymptotic MISE. The OLS estimation delivers ĉ = .418,
γ̂ = .795 in Case 1, and ĉ = .037, γ̂ = .546 in Case 2, and thus δ̂ = .129 and δ̂ = .418,
respectively.
Algorithm
(i) Perform the spectral decomposition of the matrix D−1 Pb Pb/T to get eigenvalues ν̂ j and
0
0
eigenvectors ŵj , normalized to ŵj Dŵj = 1, j = 1, ..., K.
(ii) Get a first-step TiR estimator θ̄ using a pilot regularisation parameter λ̄.
parameter λ can then be estimated with θ̂ instead of θ̄. Besides, if we assume the decay
behaviour of Assumptions 5 and 6, the decay factors α and β can be estimated via minus the
slopes of the linear fit on the pairs (log ν̂ j , j) and on the pairs (log ŵj0 B ŵj , log j), j = 1, ..., K.
36
After getting lambdas minimizing the second-step estimated MISE on a grid of sample sizes
we can also estimate γ by regressing the logarithm of lambda on the logarithm of sample
size.
We have used λ̄ = {.0005, .0001} as the pilot regularisation parameter for T = 1000 and
ρ = .5. In Case 1, the average (quartiles) of the selected lambda over 1000 simulations is
equal to .0028 (.0014, .0020, .0033) when λ̄ = .0005, and .0027 (.0007, .0014, .0029) when
λ̄ = .0001. In Case 2, the results are .0009 (.0007, .0008, .0009) when λ̄ = .0005, and .0008
(.0004, .0006, .0009) when λ̄ = .0001. The selection procedure tends to slightly overpenalize
on average, especially in Case 1, but this does not seem to impact much the MISE of the
two-step TiR estimator. Indeed if we use the optimal data driven regularisation parameter
at each simulation, the MISE based on averages over the 1000 simulations is equal to .0120
for Case 1 and equal to .0144 for Case 2 when λ̄ = .0005 (resp., .0156 and .0175 when
λ̄ = .0001), which are of the same magnitudes as the best MISE, which are .0099 and .0121
in Figures 9 and 10. In Case 1, the tendency of the selection procedure to overpenalized
without unduly affecting efficiency is due to the flatness of the MISE curve.
We also get average values for the decay factors α and β close to the asymptotic ones.
These have been computed through estimating the coefficients of a linear fit for each sim-
ulation, and averaging over the 1000 simulations. For α the average (quartiles) is equal to
2.2502 (2.1456, 2.2641, 2.3628), and for β it is equal to 2.9222 (2.8790, 2.9176, 2.9619).
To compute the average value for the decay factor γ we have used an equally spaced
grid of sample sizes T ∈ {500, 550, ..., 950, 1000} in the variance component of the MISE,
37
together with the data driven estimate of θ in the bias component of the MISE. Optimizing
on the grid of sample sizes yields an optimal lambda for each sample size per simulation.
The logarithm of the optimal lambda is then regressed on the logarithm of the sample
size, and the estimated slope is averaged over the 1000 simulations to obtain the average
estimated gamma. In Case 1, we get an average (quartile) of .6081 (.4908, .6134, .6979),
when λ̄ = .0005, and .7224 (.5171, .6517, .7277), when λ̄ = .0001. In Case 2, we get an
average (quartile) of .5597 (.4918, .5333, .5962), when λ̄ = .0005, and .5764 (.4946, .5416,
The second data driven selection procedure builds on the suggestion of Goh (2004) based
on a subsampling procedure (also called the m-out-of-n (moon) bootstrap). Even if his
theoretical results are derived for semiparametric estimators we believe that they could be
where ϕ̂i,j (x; c, γ) denotes the estimator based on the jth subsample of size mi (mi << T )
original sample of size T with a pilot regularisation parameter λ̄ chosen sufficiently small to
In practice we have chosen 500 subsamples (J = 500) for each subsample size mi ∈
{50, 60, 70, ..., 100} (I = 6), λ̄ = {.0005, .0001}, and T = 1000. To determine c and γ we
have build a joined grid with values around the OLS estimates coming from Case 1, namely
38
{.15, .2, .25} × {.7, .75, .8}, and with values around the OLS estimates coming from Case 2,
namely {.005, .01, .015} × {.35, .4, .45}. Note that the two grids yield a similar range for λT .
In the experiments for ρ = 0.5 we want to verify whether the data driven procedure is able
to pick most of the time c and γ in the first set of values in Case 1, and in the second set of
values in Case 2. On 1000 simulations we have found a frequency equal to 96% of adequate
choices in Case 1 when λ̄ = .0005, and 87% when λ̄ = .0001. In Case 2 we have found 77%
when λ̄ = .0005, and 82% when λ̄ = .0001.These frequencies are scattered among the grid
values.
39
References
Ai, C., and X. Chen (2003): "Efficient Estimation of Models with Conditional Moment
Restrictions Containing Unknown Functions", Econometrica, 71, 1795-1843.
Carrasco, M., Florens, J.-P., and E. Renault (2005): "Linear Inverse Problems in
Structural Econometrics: Estimation Based on Spectral Decomposition and Regulari-
sation", forthcoming in the Handbook of Financial Econometrics.
Darolles, S., Florens, J.-P., and E. Renault (2003): "Nonparametric Instrumental Re-
gression", D.P.
Hall, P., and J. Horowitz (2005): "Nonparametric Methods for Inference in the Pres-
ence of Instrumental Variables", Annals of Statistics.
White, H., and J. Wooldridge (1991): "Some Results on Sieve Estimation with Depen-
dent Observations", in Nonparametric and Semiparametric Methods in Econometrics
and Statistics, Proceedings of the Fifth International Symposium In Economic Theory
and Econometrics, Cambridge University Press.
40
Estimated and
MISE ISB true functions
0.12 0.07 2
0.06
0.1 1.5
0.05
0.08 1
0.04
0.06 0.5
0.03
0.04 0
0.02
0.02 -0.5
0.01
0 0 -1
0 1 2 3 4 5 0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1
λ λ
-3 -3
x 10 x 10 x
Figure 1: MISE (left panel), ISB (central panel) and estimated function (right panel) for
the TiR estimator using Sobolev norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 1. Correlation
parameter is ρ = 0.5, and sample size is T = 400.
41
Estimated and
MISE ISB true functions
0.55 0.07 2
0.5
0.06
1.5
0.45
0.4 0.05
1
0.35
0.04
0.3 0.5
0.03
0.25
0
0.2 0.02
0.15
-0.5
0.01
0.1
0.05 0 -1
λ λ
0 0.025 0.05 0 0.025 0.05 0 0.2 0.4 0.6 0.8 1
x
Figure 2: MISE (left panel), ISB (central panel) and estimated function (right panel) for the
regularised estimator using L2 norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 1. Correlation
parameter is ρ = 0.5, and sample size is T = 400.
42
Estimated and
MISE ISB true functions
0.12 0.012 1.2
1
0.1 0.01
0.8
0.08 0.008
0.6
0.06 0.006
0.4
0.04 0.004
0.2
0.02 0.002
0
0 0 -0.2
0 1 2 3 4 5 0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1
λ λ
-3 -3 x
x 10 x 10
Figure 3: MISE (left panel), ISB (central panel) and estimated function (right panel) for
the TiR estimator using Sobolev norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 1. Correlation
parameter is ρ = 0, and sample size is T = 400.
43
Estimated and
MISE ISB true functions
0.7 0.045 1.2
0.04
0.6 1
0.035
0.5 0.8
0.03
0.015
0.2 0.2
0.01
0.1 0
0.005
0 0 -0.2
0 0.025 0.05 0 0.025 0.05 0 0.2 0.4 0.6 0.8 1
λ λ x
Figure 4: MISE (left panel), ISB (central panel) and estimated function (right panel) for the
regularised estimator using L2 norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 1. Correlation
parameter is ρ = 0, and sample size is T = 400.
44
Estimated and
MISE ISB true functions
0.12 0.07 1.2
0.11 1
0.06
0.1 0.8
0.08 0.4
0.04
0.07 0.2
0.03
0.06 0
0.04 -0.4
0.01
0.03 -0.6
0.02 0 -0.8
λ λ
0 1 2 3 4 5 0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1
-3 -3 x
x 10 x 10
Figure 5: MISE (left panel), ISB (central panel) and estimated function (right panel) for
the TiR estimator using Sobolev norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 2. Correlation
parameter is ρ = 0.5, and sample size is T = 400.
45
Estimated and
MISE ISB true functions
0.55 0.07 1.2
0.5 1
0.06
0.45 0.8
0.35 0.4
0.04
0.3 0.2
0.03
0.25 0
0.15 -0.4
0.01
0.1 -0.6
0.05 0 -0.8
λ λ
0 0.025 0.05 0 0.025 0.05 0 0.2 0.4 0.6 0.8 1
x
Figure 6: MISE (left panel), ISB (central panel) and estimated function (right panel) for the
regularised estimator using L2 norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 2. Correlation
parameter is ρ = 0.5, and sample size is T = 400.
46
Estimated and
MISE ISB true functions
0.12 0.05 1.2
0.045
1
0.1
0.04
0.035 0.8
0.08
0.03
0.6
0.06 0.025
0.4
0.02
0.04
0.015 0.2
0.01
0.02
0
0.005
0 0 -0.2
λ λ
0 1 2 3 4 5 0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1
-3 -3 x
x 10 x 10
Figure 7: MISE (left panel), ISB (central panel) and estimated function (right panel) for
the TiR estimator using Sobolev norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 2. Correlation
parameter is ρ = 0, and sample size is T = 400.
47
Estimated and
MISE ISB true functions
0.7 0.07 1.2
0.6 0.06 1
0.1 0.01 0
0 0 -0.2
λ λ
0 0.025 0.05 0 0.025 0.05 0 0.2 0.4 0.6 0.8 1
x
Figure 8: MISE (left panel), ISB (central panel) and estimated function (right panel) for the
regularised estimator using L2 norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 2. Correlation
parameter is ρ = 0, and sample size is T = 400.
48
Estimated and
MISE ISB true functions
0.1 0.07 2
0.09
0.06
1.5
0.08
0.07 0.05
1
0.06
0.04
0.05 0.5
0.03
0.04
0
0.03 0.02
0.02
-0.5
0.01
0.01
0 0 -1
λ λ
0 1 2 3 4 5 0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1
-3 -3 x
x 10 x 10
Figure 9: MISE (left panel), ISB (central panel) and estimated function (right panel) for
the TiR estimator using Sobolev norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 1. Correlation
parameter is ρ = 0.5, and sample size is T = 1000.
49
Estimated and
MISE ISB true functions
0.1 0.07 1.2
0.09 1
0.06
0.08 0.8
0.06 0.4
0.04
0.05 0.2
0.03
0.04 0
0.02 -0.4
0.01
0.01 -0.6
0 0 -0.8
λ λ
0 1 2 3 4 5 0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1
-3 -3 x
x 10 x 10
Figure 10: MISE (left panel), ISB (central panel) and estimated function (right panel) for
the TiR estimator using Sobolev norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 2. Correlation
parameter is ρ = 0.5, and sample size is T = 1000.
50
Eigenvalues Eigenfunctions
2 0
0 -1
-2
-2
log(|| φ || )
-4
2
log( ν )
j
j
-3
-6
-4
-8
-10 -5
-12 -6
1 2 3 4 5 6 0 0.5 1 1.5 2
log(j)
Figure 11: The six largest eigenvalues (left Panel) and the L2 -norms of the corresponding
eigenfunctions (right Panel) of operator A A.
51
Case 1: Beta Case 2: Sin
-4 -4
-4.5 -4.5
-5 -5
-5.5 -5.5
log( λ )
log( λT)
T
-6 -6
-6.5 -6.5
-7 -7
-7.5 -7.5
4 5 6 7 4 5 6 7
log(T) log(T)
Figure 12: Log of optimal regularisation parameter as a function of log of sample size for
Case 1 (left panel) and Case 2 (right panel). Correlation parameter is ρ = 0.5.
APPENDIX 1
This covers the special case of the TiR estimator in Definition 1, where G(ϕ) = kϕk2H .
From Theorem 2.2 of White and Wooldridge (1991), the estimator ϕ̂ in (A.1) is well-defined
and measurable if
random variable QT (ϕ) for event ω ∈ Ω, and (Ω, F, P ) is a complete probability space;
52
(ii) mappings ϕ → G(ϕ) and ϕ → QT (ω, ϕ) are weakly lower semi-continuous on ΘT , P -a.s.,
Proof of Theorem 1: For any T and some given ε > 0, let us define ϕ∗T ∈ ΘT such that
£ ¤
We have P [kϕ̂ − ϕ0 k > ε] ≤ P inf ϕ∈ΘT :kϕ−ϕ0 k≥ε QT (ϕ) + λT G(ϕ) ≤ QT (ϕ∗T ) + λT G(ϕ∗T ) .
=⇒ inf Q∞ (ϕ) + λT G(ϕ) + inf ∆QT (ϕ) ≤ Q∞ (ϕ∗T ) + λT G (ϕ∗T ) + sup |∆QT (ϕ)|
ϕ∈Θ:kϕ−ϕ0 k≥ε ϕ∈ΘT ϕ∈ΘT
= ρT + 2δ T .
53
p
Since λT → 0 such that (T λT )−1 → 0, P -a.s., for a chosen as in (iv) we have ZT → 0, and we
deduce P [kϕ̂ − ϕ0 k > ε] ≤ P [ZT ≥ 1] → 0. Since ε > 0 is arbitrary, the proof is concluded.
is not satisfied. Then there exists ε > 0 and a sequence (λn ) such that λn & 0 and
Cε (λn ) ≤ 0, ∀n ∈ N. (21)
By definition of function Cε (λ), for any λ > 0 and η > 0, there exists ϕ ∈ Θ such that
we deduce from (21) that there exists a sequence (ϕn ) such that ϕn ∈ Θ, kϕn − ϕ0 k ≥ ε, and
Obviously, the simultaneous holding of (23) and (24) violates Assumption (10). ¥
54
In this Section we check that the assumptions in A.1.1 and A.1.2 hold for the special case
i) The mapping ϕ → kϕk2H is lower semi-continuous on H 2 [0, 1] w.r.t. the norm k.k [see
ii) Let us verify that the assumptions of Proposition 2 are satisfied. Clearly function
G(ϕ) = kϕk2H is bounded from below by 0. Let us now check that Assumption (10) in
Proposition 2 is satisfied.
Proof: Let ε > 0 and let (ϕn ) be a sequence in Θ such that kϕn − ϕ0 k ≥ ε for all n ∈ N,
and
Q∞ (ϕn ) → 0 as n → ∞. (25)
We have to prove
kϕn kH → ∞ as n → 0. (26)
ϕn − ϕ0
To this aim, define sequence en = , n ∈ N. Then, ken k = 1 for all n ∈ N,
kϕn − ϕ0 k
h∆ϕn , A∗ A∆ϕn i 1
and from Assumption 1 and (25), hen , A∗ Aen i = 2 ≤ Q∞ (ϕn ) → 0,
kϕn − ϕ0 k c2 ε2
as n → ∞. Let Π(N) denote the orthogonal projection [w.r.t. the scalar product h., .i)] on
XN N ∞
° °
°Π(N) en °2 = 1 X 1 X
hψj , en i2 ≤ µj hψj , en i2 ≤ µj hψj , en i2
j=1
µN j=1 µN j=1
1
= hen , A∗ Aen i → 0, as n → ∞,
µN
° °
that is °Π(N) en ° → 0 as n → ∞, for any N ∈ N.
55
Let us now derive a lower bound for the Sobolev norm ken kH . We have
° ° ° °
ken kH ≥ °Π⊥ ° ° °
(N) en H − Π(N) en H , (27)
© ª
where Π⊥
(N) = 1 − Π(N) denotes the orthogonal projection on span ψ j : j ≥ N + 1 . Let us
derive bounds for the two terms in the RHS of (27). We have
° ∞ ° °P °
° ° ° ∞ ° Ã ∞ !1/2
° ⊥ ° X hψ ,
° j=N +1 j n j ° e iψ X
°Π(N) en ° = ° ° ° H 2
hψj , en iψj ° = ³ ´1/2 hψj , en i
H ° ° P∞ 2
j=N+1 hψ j , en i
j=N+1 H j=N+1
à ∞ !1/2
X ³ ° °2 ´1/2
≥ inf kϕkH hψj , en i2 = MN+1 1 − °Π(N) en ° ,
ϕ∈SN +1 :kϕk=1
j=N+1
© ª
since ken k = 1, where SN+1 =span ψj : j ≥ N + 1 , and MN+1 = inf ϕ∈SN +1 :kϕk=1 kϕkH .
Moreover,
°N °2
° °2 °X ° N
X
°Π(N) en ° = ° °
° hψj , en iψj ° = hψj , en ihψl , en ihψj , ψ l iH
H ° °
j=1 H j,l=1
N
X N
¯ ¯ ° ° ° ° X ¯ ¯
≤ ¯hψj , en i¯ |hψl , en i| °ψj ° kψl k ≤ max °ψj °2 ¯hψj , en i¯ |hψl , en i|
H H H
j=1,...,N
j,l=1 j,l=1
à N
!2 N
X ¯ ¯ X ° °2
= MN ¯hψj , en i¯ ≤ NM N hψj , en i2 = NM N °Π(N) en ° ,
j,l=1 j,l=1
° °2
where M N = maxj=1,...,N °ψj °H . Thus, we get from (27)
³ ° °2 ´1/2 ° °
ken kH ≥ MN +1 1 − °Π(N ) en ° − cN °Π(N) en ° , (28)
p
for any N and n ∈ N, where cN = NM N . Note that, since M N ≥ kψN k2H ≥ MN2 , by
follows
³ ° °2 ´1/2 ° °
ken kH ≥ MNn +1 1 − Π(Nn ) en °
° − cNn °Π(Nn ) en ° , for any n ∈ N, (29)
56
for any sequence of integers (Nn ).
Let us now prove that there exists a sequence of integers (Nn ) such that the RHS of (29)
© ° ° ª
n(1) = min n∗ ∈ N | c1 °Π(1) en ° ≤ 1 for all n ≥ n∗ ,
© ° ° ª
n(N) = min n∗ ∈ N | n∗ > n(N − 1) , cN °Π(N) en ° ≤ 1 for all n ≥ n∗ , N = 2, ...
° °
Since cN °Π(N) en ° → 0 as n → ∞, for any N ∈ N, it follows that n(N) < ∞, for any N ∈ N,
and the sequence n(N), N = 1, 2... is strictly increasing. Then, let the sequence of integers
By construction, we have
° °
cNn °Π(Nn ) en ° ≤ 1, (30)
° °
°Π(Nn ) en ° ≤ 1/2, ∀n large enough. (31)
Using Bounds (30) and (31) in Inequality (29), we get ken kH ≥ MNn +1 (3/4)1/2 − 1 → ∞,
as n → ∞, from Assumption 2.
Finally, we get
≥ ε ken kH − kϕ0 kH → ∞.
57
Therefore, (26) follows, and the proof is concluded. ¥
58
Appendix 2
In this Appendix we derive the asymptotic expansion of the MISE with deterministic
The objective function of the TiR estimator becomes in the linear case
1 X h³ ´ i2
T
QT (ϕ) + λT kϕk2H = Âϕ (Zt ) − r̂ (Zt ) + λT hϕ, ϕiH . (33)
T t=1
Let us now prove that this objective function can be written as a quadratic form in ϕ ∈
H 2 [0, 1]. To this aim, let us introduce the dual operator  of Â.
Lemma A.1: Under regularity conditions, the following properties hold P -a.s. :
(iii) There exists a linear operator  from L2 (FZ ) into H 2 [0, 1], such that
³ ´ 1 X³ ´
T
h, Â ψ = Âh (Zt ) ψ (Zt ) , for any ψ ∈ L2 (FZ ) and h ∈ H 2 [0, 1];
H T t=1
(iv) Operator   : H 2 [0, 1] → H 2 [0, 1] is compact.
59
Then, from Lemma A.1 i)-iii), Criterion (33) can be rewritten as
³ ´
QT (ϕ) + λT kϕk2H = hϕ, λT + Â Â ϕiH − 2hϕ, Â r̂iH , (34)
Theorem 3.4]. It follows that the quadratic criterion function (34) admits a global minimum
³ ´
2
over H [0, 1]. It is given by the first-order condition   + λT ϕ b T =  r̂, that is
³ ´−1
ϕ̂ = λT + Â Â Â r̂. (35)
³ ´−1 ³ ´−1
where ∆fˆ(w|z) := f(w|z)−f(w|z).
ˆ Hence, ϕ̂ = λT + Â Â Â ψ̂+ λT + Â Â Â Âϕ0 ,
which yields
£ ¤
ϕ̂ − ϕ0 = (λT + A A)−1 A ψ̂ + (λT + A A)−1 A Aϕ0 − ϕ0 + RT
=: VT + D(λT ) + RT , (36)
60
³ ´
Lemma A.2: Assume the bandwidth conditions hm
T = o (λT ) , λ/TT = o hdTZ , where m is
the order of the kernel K , and dZ the dimension of Z. Then, under regularity assumptions,
£ ¤ ¡ £ ¤¢
E kRT k2 = o E kVT + D(λT )k2 .
£ ¤ £ ¤ £ ¤
E kϕ̂ − ϕ0 k2 = E kVT + D(λT )k2 + E kRT k2 + 2E [(VT + D(λT ), RT )]
£ ¤ ¡ £ ¤¢
= E kVT + D(λT )k2 + o E kVT + D(λT )k2 ,
£ ° °2
2¤ ° −1 −1 °
E kVT + D(λT )k = °(λT + A A) A Aϕ0 − ϕ0 + (λT + A A) A E ψ̂°
·° ³ ´°2 ¸
° −1 °
+E °(λT + A A) A ψ̂ − E ψ̂ ° , (37)
we get
£ ¤ ° °2
° °
E kϕ̂ − ϕ0 k2 = °(λT + A A)−1 A Aϕ0 − ϕ0 + (λT + A A)−1 A E ψ̂°
·° ³ ´°2 ¸
° −1 °
+E °(λT + A A) A ψ̂ − E ψ̂ ° , (38)
up to a term which is asymptotically negligible w.r.t. the RHS. This asymptotic expansion
consists of a bias term (regularisation bias plus estimation bias) and a variance term, which
will be analysed separately below in Lemma A.3 and A.4. Combining these two Lemmas
61
Lemma A.3: Under regularity conditions, up to a term which is asymptotically negligible
·° ³ ´°2 ¸ ∞
° −1 ° 1X νj ° °2
°φ ° .
w.r.t. the RHS, we have E °(λT + A A) A ψ̂ − E ψ̂ ° =
T j=1 (λT + ν j )2 j
Proof: See Appendix B.
° °
Lemma A.4: Define b(λT ) = °(λT + A A)−1 A Aϕ0 − ϕ0 °. Then, under regularity condi-
62
Appendix 3
i) The next Lemma A.5 characterizes the variance term of the asymptotic expansion of the
MISE in Proposition 3.
° °2
Lemma A.5: Let ν j and °φj ° satisfy Assumption 5, and define the function
X∞ µ ¶1−β
νj ° °2
I(λ) = ° ° , λ > 0. Then, lim λ [log (1/λ)] I(λ) = 1
β
2 φj C2 .
j=1
(λ + ν j ) λ→0 α
Proof: See Appendix B.
1 1
From Lemma A.5 and using Assumption 6, we get MT (λ) = c1 β
+ c2 λ2δ ,
T λ [log (1/λ)]
µ ¶1−β
1
for λ → 0 and T → ∞, where c1 = C2 , c2 = C32 .
α
ii) The optimal sequence λ∗T is obtained by minimizing function MT (λ) w.r.t. λ. We have
µ ¶
dMT (λ) 1 1 β β−1 1
= − c1 2 2β
[log (1/λ)] − λβ [log (1/λ)] + 2c2 δλ2δ−1
dλ T λ [log (1/λ)] λ
1 log (1/λ) − β
= − c1 2 + 2c2 δλ2δ−1 .
T λ [log (1/λ)]β+1
Thus
dMT (λ∗T ) 1 c1 log (1/λ∗T ) − β
= 0 ⇐⇒ ∗ β+1
= (λ∗T )2δ+1 . (39)
dλ T 2c2 δ [log (1/λT )]
To solve the latter equation for λ∗T , define τ T := log (1/λ∗T ). Then τ T satisfies
1 1+β 1
τ T = c3 + log T + log τ T − log (τ T − β) ,
1 + 2δ 1 + 2δ 1 + 2δ
63
where c3 = (1 + 2δ)−1 log (2c2 δ/c1 ). It follows that
1 1+β
τ T = c4 + log T + log log T + o (log log T ) ,
1 + 2δ 1 + 2δ
1 1+β
log (λ∗T ) = −c4 − log T − log log T + o (log log T ) .
1 + 2δ 1 + 2δ
1 1 1 1
MT (λ∗T ) = c1 ∗ ∗ β
+ c2 (λ∗T )2δ = c1 ∗ β + c2 (λ∗T )2δ .
T λT [log (1/λT )] T λT τ T
µ ¶ 2δ+1
1 Ã ! 2δ+1
1
c1 1
− 2δ+1 τT − β 1 − β
From (39), λ∗T = T = c5 T − 2δ+1 τ T 2δ+1 , for some constant c5 ,
2c2 δ τ β+1
T
up to a term which is negligible w.r.t. the RHS. Thus we get
1 c1 − 2δ+1
1 1 − 2δ+1 2δ − 2δβ
MT (λ∗T ) = T β + c2 c2δ
5 T τ T 2δ+1
T c5 − +β
τ T 2δ+1
2δβ
2δ − 2δ 2δβ
= c6 T − 2δ+1 τ T 2δ+1 = c7 T − 2δ+1 (log T )− 2δ+1 ,
for some constants c6 and c7 , up to a term which is negligible w.r.t. the RHS.
64