0% found this document useful (0 votes)
23 views65 pages

Minimum Distance Estimators

This document introduces a Tikhonov regularised (TiR) estimator for estimating functional parameters based on conditional moment restrictions. The TiR estimator minimizes a distance criterion plus a penalty term involving the Sobolev norm of the function, which helps address ill-posedness. The paper studies the asymptotic properties of the TiR estimator, derives its mean integrated squared error rate, and proves its pointwise asymptotic normality. Simulation results show the TiR estimator has good finite sample properties and its data-driven selection of the regularization parameter works well. The TiR estimator is numerically tractable and extends estimation to nonlinear conditional moment models.

Uploaded by

sepwandjitanguep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views65 pages

Minimum Distance Estimators

This document introduces a Tikhonov regularised (TiR) estimator for estimating functional parameters based on conditional moment restrictions. The TiR estimator minimizes a distance criterion plus a penalty term involving the Sobolev norm of the function, which helps address ill-posedness. The paper studies the asymptotic properties of the TiR estimator, derives its mean integrated squared error rate, and proves its pointwise asymptotic normality. Simulation results show the TiR estimator has good finite sample properties and its data-driven selection of the regularization parameter works well. The TiR estimator is numerically tractable and extends estimation to nonlinear conditional moment models.

Uploaded by

sepwandjitanguep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

TIKHONOV REGULARISATION FOR FUNCTIONAL

MINIMUM DISTANCE ESTIMATORS

P. Gagliardini∗ and O. Scaillet†

This version: May 2006

(first version: May 2006)


University of Lugano.

HEC Genève and Swiss Finance Institute.
Tikhonov Regularisation for Functional Minimum Distance Estimators

Abstract

We study the asymptotic properties of a Tikhonov regularised (TiR) estimator of a func-

tional parameter based on a minimum distance principle for nonparametric conditional mo-

ment restrictions. The estimator is computationally tractable and even takes a closed form

in the linear case. We derive its Mean Integrated Squared Error (MISE), its rate of conver-

gence and its pointwise asymptotic normality under a regularisation parameter depending

on the sample size. The optimal value of the regularisation parameter is characterised. We

illustrate our theoretical findings and the small sample properties with simulation results

for two numerical examples. We also discuss two data driven selection procedures of the

regularisation parameter via a spectral representation and a subsampling approximation of

the MISE.

Keywords and phrases: Minimum Distance, Nonparametric Estimation, Ill-posed

Inverse Problems, Endogeneity, Generalized Method of Moments, Subsampling, Tikhonov

Regularisation.

JEL classification: C13, C14.

AMS 2000 classification: 62G08, 62G20.

1
1 Introduction

Minimum distance and extremum estimators have received a lot of attention in the litera-

ture to exploit conditional moment restrictions assumed to hold true on the data generating

process [see e.g Newey and McFadden (1994) for a review]. In a parametric setting, lead-

ing examples are the Ordinary Least Squares estimator, which takes a closed form, and

the Nonlinear Least Squares estimator, which is computed through numerical optimization.

Correction for endogeneity are provided by the Instrumental Variable estimator in the linear

case and by the Generalised Method of Moments estimator in the nonlinear case.

In a functional setting, regression curves are inferred by local polynomial estimators and

sieve estimators. A well known example is the Rosenblatt-Parzen kernel estimator. Recently,

several suggestions have been made to correct for endogeneity in the nonparametric context

as well, mainly motivated by the interest for non-parametric IV estimation of structural equa-

tions. Newey and Powell (NP, 2003) consider the problem of estimating non-parametrically

a regression function, which is the conditional expectation of the dependent variable given

a set of instruments. They propose a consistent minimum distance estimator, which is a

non-parametric analog of the Two-Stage Least Square estimator. The NP methodology ex-

tends to the case of general, nonlinear conditional moment restrictions. Ai and Chen (AC,

2003) follow a similar approach to estimate an unknown function contained in a conditional

moment. Although their focus is more on the efficient estimation of the parametric compo-

nent in a semi-parametric conditional moment specification, they show that the estimator

of the functional component converges at a rate faster than T −1/4 in an appropriate metric.

2
Darolles, Florens and Renault (DFR, 2003) and Hall and Horowitz (HH, 2005) also consider

non-parametric estimation of an instrumental regression function, but focus on the linear

case. Their estimation approach is based on the empirical analog of the conditional moment

restriction, seen as a linear integral equation in the unknown functional parameter. HH de-

rive the optimal rate of convergence of their estimator in quadratic mean. [Cite some other

peripheric papers (Chernozukov, Vanhems)].

The main theoretical difficulty to overcome in non-parametric estimation with endogene-

ity is ill-posedness. Ill-posedness occurs since the mapping of the reduced form parameter

(that is, the distribution of the data) into the structural parameter (the instrumental regres-

sion function) is not continuous. This may have serious consequences, in particular it can

lead to inconsistency of the estimators. The problem of ill-posedness has been addressed

in the literature in different ways. NP and AC propose to introduce bounds on the deriva-

tives of the functional parameter of interest, which amounts to assume a compact parameter

space. In the linear case, DFR and HH adopt a regularisation technique, which results in a

kind of ridge regression approach in a functional setting.

The aim of this paper is to introduce a new minimum distance estimator for a func-

tional parameter identified by conditional moment restrictions. To address the issue of

ill-posedness, we consider penalized extremum estimators which minimize a criterion of the

type QT (ϕ) + λT G(ϕ), where QT (ϕ) is a minimum distance criterion in the functional pa-

rameter ϕ, G(ϕ) is a penalty function and λT is a positive sequence converging to zero. The

penalty function G(ϕ) corresponds to the Sobolev norm of function ϕ, which involves the

3
L2 norms of both ϕ and its derivative ∇ϕ. The basic idea behind our estimator is that

the term λT G(ϕ) penalizes highly oscillating components of the estimator, which are oth-

erwise unduly enhanced by the minimum distance criterion QT (ϕ) because of ill-posedness.

The amount of regularisation is tuned by parameter λT . We call our estimator a Tikhonov

Regularised (TiR) estimator, since the penalty term is inspired by the pioneering paper of

Tikhonov (1963) on the regularisation of ill-posed inverse problems. We stress that also the

regularisation approach of DFR and HH is an example of Tikhonov regularisation, but with

penalty term involving the L2 norm instead of the Sobolev norm of the parameter. To avoid

confusion, we refer to the DFR and HH estimator as a regularised estimator with L2 norm.

Our paper contributes to the literature along several directions. First, we introduce a

nonparametric estimator for conditional moment restrictions, which admits the following

features: (i) it applies in the general (linear and nonlinear) setting; (ii) the tuning parameter

is allowed to depend on sample size and to be stochastic; (iii) it may have a faster rate

of convergence than the DFR and HH estimator in the linear case; (iv) it admits a closed

form in the linear case. We emphasize that point (ii) is crucial to develop estimators with

data-driven selection of the tuning parameter. This point is not addressed in the setting of

NP and AC, where the tuning parameter is the bound on the Sobolev norm of the estimator

and is assumed fixed in all theoretical results. For the same reason, feature (iv) is not shared

by NP and AC estimator [see Section 2.4 for more details on the links between the TiR

estimator and the literature]. Concerning point (iii), we give in Section 4 the condition

under which this property holds. In our Monte-Carlo experiments in Section 6, we have

4
found a superior performance of the TiR estimator compared to the regularised estimator

with L2 norm. 1

Second, we study rather in depth the asymptotic properties of our estimator. In par-

ticular: (i) we prove the consistency of the TiR estimator; (ii) we derive the asymptotic

expansion of the Mean Integrated Squared Error (MISE) as a function of the sample size

and the (deterministic) regularisation parameter; (iii) we prove the pointwise asymptotic

normality of the TiR estimator. To the best of our knowledge, results (ii) and (iii), as well

as (i) for a sequence of stochastic regularisation parameters, are new for non-parametric

estimators of this type. In particular, the asymptotic expansion of the MISE allows us to

study the effect of the regularisation parameter on the variance term and on the bias term

of the TiR estimator, to define the optimal sequence of regularisation parameters, and to

derive the associated optimal rate of convergence of the TiR estimator. The methodology

is easily extended to the case of regularisation with L2 norm, so that these results are in-

teresting also for the study of the properties of the DFR and HH estimators. Finally, the

asymptotic expansion of the MISE suggests a procedure for the data-driven selection of the

regularisation parameter, that we implement in the Monte-Carlo study.

Third, we investigate the attractiveness of the TiR estimator from an applied point of

view. In the nonlinear case, the TiR estimator only requires running an unconstrained

optimisation routine instead of a constrained one, and in the linear case it even takes a

closed form. Such a numerical tractability is a key advantage in practice, when using heavy

1
The advantage of the Sobolev norm compared to the L2 norm for regularisation is also pointed out in
a numerical example in Kress (1999), Example 16.21.

5
resampling techniques for example. The finite sample properties seem very appealing from

our numerical experiments on two examples and two data driven selection procedures of the

regularisation parameter.

The rest of the paper is organized as follows. In Section 2, we first introduce the general

setting of non-parametric estimation under conditional moment restrictions and the problem

of ill-posedness. We then define the TiR estimator and discuss the links with the literature.

In Section 3 we prove the consistency of the TiR estimator. Section 4 is devoted to the

analysis of the MISE and of the optimal rates of convergence of the TiR estimator. The case

of linear moment restrictions is detailed in Section 5. In section 6 we present a Monte-Carlo

study of the finite sample properties of the TiR estimator. Finally, Section 7 concludes. The

proofs of all results in the paper are gathered in the Appendices. The proofs of technical

Lemmas are collected in a document, which is available by the authors on request.

2 Minimum Distance estimators under Tikhonov reg-


ularisation

In this section we introduce the class of Tikhonov Regularised (TiR) estimators. In Section

2.1 we present the general setting of non-parametric Minimum Distance estimation. In

Section 2.2 we highlight its main issue, namely ill-posedness. In Section 2.3 TiR estimators

are defined as a regularisation method for the ill-posedness problem. Finally, links with

estimators and results currently available in the literature are discussed in detail in Section

2.4.

6
2.1 Nonparametric Minimum Distance estimation

Let {(Yt , Xt , Zt ) : t = 1, ..., T } be i.i.d. variables, and let the support of Xt be X = [0, 1].

Suppose that the parameter of interest is a function ϕ0 defined on X , which satisfies the

conditional moment restriction

E0 [g (Y, ϕ0 (X)) | Z] = 0, (1)

where g is a known function. Parameter ϕ0 belongs to a subset Θ of L2 [0, 1], equipped


R
with the L2 scalar product hϕ, ψi = X
ϕ(x)ψ(x)dx and the L2 norm kϕk = hϕ, ϕi1/2 . It

is assumed that Θ is bounded and closed, and that ϕ0 is the unique function ϕ ∈ Θ that

satisfies the conditional moment restriction (1).

The non-parametric Minimum Distance approach to estimate ϕ0 as in AC and NP relies

on ϕ0 minimizing the criterion

h 0
i
Q∞ (ϕ) = E0 m (ϕ, Z) Ω0 (Z)m (ϕ, Z) , ϕ ∈ Θ, (2)

where m (ϕ, z) = E0 [g (Y, ϕ (X)) | Z = z], and Ω0 (z) is a p.d. matrix for any given z. This

criterion is well-defined if m (ϕ, z) belongs to L2Ω0 (FZ ), for any ϕ ∈ Θ, where L2Ω0 (FZ )

denotes the L2 space of square integrable vector-valued functions of Z defined by scalar


h 0
i
product hψ1 , ψ 2 iL2Ω (FZ ) = E0 ψ1 (Z) Ω0 (Z)ψ2 (Z) . Then, the idea is to estimate ϕ0 by the
0

minimizer of the empirical counterpart of Criterion (2). For instance, AC and NP estimate

the conditional moment m (ϕ, z) by an orthogonal polynomials approach, and minimize the

empirical criterion over a finite-dimensional Sieve approximation of Θ based on polynomial

or spline functions.

7
The main difficulty in non-parametric Minimum Distance estimation is that, contrary to

the standard parametric case, the assumption that function ϕ0 is identified in a bounded and

closed parameter set Θ is not sufficient in general to get the consistency of the estimator.

This is due to the so-called ill-posedness of such an estimation problem.

2.2 Unidentifiability and ill-posedness in Minimum Distance esti-


mation

The goal of this section is to highlight the issue of ill-posedness in Minimum Distance es-

timation [NP; see also Kress (1999), Chapter 15, for a general treatment of ill-posed in-

verse problems, and Carrasco, Florens and Renault (2005) for a survey on inverse problems

in econometrics]. To briefly explain what ill-posedness is, note that solving the equation

E0 [g (Y, ϕ (X)) | Z] = 0 for unknown function ϕ ∈ Θ can be seen as an inverse problem,

which maps the conditional distribution F0 (y, x|z) of (Y, X) given Z = z into the solution

ϕ0 [see Equation (1)]. Ill-posedness arises when this mapping is not continuous. As a conse-

quence, the estimator ϕ̂ of ϕ0 , which is the solution of the inverse problem corresponding to a

consistent estimator F̂ of F0 , is not guaranteed to be consistent. Indeed, by non-continuity,

small deviations of F̂ from F0 may result in large deviations of ϕ̂ from ϕ0 . We refer to

NP for a more in-depth discussion along these lines. In this paper, we prefer to emphasize

the link between ill-posedness and a classical concept in econometrics, namely parameter

identification.

To illustrate the main point, let us consider the case of non-parametric linear IV estima-

tion, where g(y, ϕ(x)) = ϕ (x) − y. The moment function m(ϕ, z) = E0 [ϕ (X) − Y | Z = z]

8
can be written as

m(ϕ, z) = (Aϕ) (z) − r (z) = (A∆ϕ) (z) , (3)

R
where ∆ϕ := ϕ − ϕ0 , operator A is defined by (Aϕ) (z) = ϕ(x)f(w|z)dw and r(z) =
R
yf (w|z)dw. Conditional moment restriction (1) identifies ϕ0 if and only if operator A is

injective. The limit criterion in (2) becomes

h 0
i
Q∞ (ϕ) = E0 (A∆ϕ) (Z) Ω0 (Z) (A∆ϕ) (Z) = h∆ϕ, A∗ A∆ϕi, (4)

where A∗ denotes the adjoint operator of A w.r.t. the scalar products h., .i and h., .iL2Ω (FZ ) .
0

Under weak regularity conditions, operator A is compact. Thus, A∗ A is compact and


© ª
self-adjoint. We denote by ψj : j ∈ N an orthonormal basis in L2 [0, 1] of eigenfunctions of

operator A∗ A, and by µ1 ≥ µ2 ≥ · · · , with µj > 0, the corresponding eigenvalues [see Kress

(1999), Section 15.3, for the spectral decomposition of compact, self-adjoint operators]. By

compactness of A∗ A, the eigenvalues are such that µj → 0. Assume that ϕ0 is an interior

point of parameter set Θ. Then, the limit criterion Q∞ (ϕ) can be minimized by a sequence

in Θ such as

ϕn = ϕ0 + εψn , n ∈ N, (5)

for ε > 0, which does not converge to ϕ0 . Indeed, Q∞ (ϕn ) = ε2 hψn , A∗ Aψ n i = ε2 µn → 0 as

n → ∞, but kϕn − ϕ0 k = ε, ∀n. Since we can chose ε > 0 as small as we want, the usual

identification assumption [e.g., White and Wooldridge (1991)]

inf Q∞ (ϕ) > 0 = Q∞ (ϕ0 ), for ε > 0, (6)


ϕ∈Θ:kϕ−ϕ0 k≥ε

9
is not satisfied. In other words, function ϕ0 is not identified in Θ as an isolated minimum

of Q∞ . This is the identification problem of Minimum Distance estimation with functional

parameter. Failure of identification condition (6) is due to 0 being a limit point of the eigen-

values of operator A∗ A. It applies in the general setting of conditional moment restriction

(1), whenever the linearization of moment function m(ϕ, z) around ϕ = ϕ0 involves a com-

pact operator. This is the maintained assumption in our paper, and is stated below.

Assumption 1 (Ill-posedness): The moment function m(ϕ, z) is such that

m(ϕ, z) = (A∆ϕ) (z) + R (ϕ, z), for any ϕ ∈ Θ, where

Z
∂g
(i) the operator A defined by (A∆ϕ) (z) = (y, ϕ0 (x)) f (w|z) ∆ϕ (x) dw is a compact
∂v
operator in L2 [0, 1] ;

(ii) the second-order term R (ϕ, z) is such that supϕ∈Θ kR (ϕ, .)kL2 / kA∆ϕkL2 < 1.
Ω0 (FZ ) Ω0 (FZ )

Under Assumption 1, the identification condition (6) is not satisfied, and the Minimum

Distance estimator which minimizes the empirical counterpart of criterion Q∞ (ϕ) over (a

Sieve approximation of) set Θ is not consistent.

2.3 Tikhonov Regularised (TiR) estimators

In this paper, we address the issue of ill-posedness by introducing Minimum Distance estima-

tors based on Tikhonov regularisation. We consider extremum estimators which minimize a

criterion of the type QT (ϕ) + λT G (ϕ), where QT (ϕ) is an empirical counterpart of criterion

Q∞ (ϕ) in (2), G (ϕ) is a penalty function introduced to solve the unidentifiability problem

10
arising from ill-posedness, and λT is a sequence converging to zero as sample size T increases.

Functions QT (ϕ) and G (ϕ) are defined next.

The conditional moment m (ϕ, z) = E0 [g (Y, ϕ (X)) | Z = z] can be estimated non-


R
parametrically by m̂ (ϕ, z) = g (y, ϕ (x)) fˆ (w|z) dw, where fˆ (w|z) denotes a kernel es-

timator of the density of (Y, X) given Z = z with kernel K, bandwidth hT , and w = (y, x).

Then, the criterion QT (ϕ) is defined by


T
1X
QT (ϕ) = m̂ (ϕ, Zt )0 ΩT (Zt ) m̂ (ϕ, Zt ) ,
T t=1

where ΩT (z), T ∈ N, is a sequence of p.d. matrices converging to Ω0 (z), P-a.s., for any z.

Different choices of penalty function G(ϕ) are possible, leading to consistent estimators

under the assumptions of Theorem 1 in Section 3 below. In this paper, we focus on the

Sobolev norm G(ϕ) = kϕk2H . More precisely, we assume that ϕ0 belongs to some sub-

set Θ of the Sobolev space H 2 [0, 1], which is defined as the completion of linear space

{ϕ ∈ C 1 [0, 1] | ∇ϕ ∈ L2 [0, 1]} with respect to the L2 scalar product h., .i. Sobolev space

H 2 [0, 1] is an Hilbert space w.r.t. the scalar product hϕ, ψiH := hϕ, ψi + h∇ϕ, ∇ψi, and the
1/2
corresponding Sobolev norm is denoted by kϕkH = hϕ, ϕiH .

The Minimum Distance estimator under Tikhonov regularisation with Sobolev norm is

defined next.

Definition 1: The Tikhonov Regularised (TiR) Minimum Distance estimator is defined by

ϕ̂ = arg inf QT (ϕ) + λT kϕk2H , (7)


ϕ∈ΘT

where λT is a stochastic sequence such that λT ≥ 0 and λT → 0 P-a.s., and (ΘT ) is an

11
increasing sequence of subsets of the Sobolev space H 2 [0, 1], which are compact w.r.t. the

L2 -norm k.k.

The name Tikhonov Regularised (TiR) estimators that we use to characterize the Mini-

mum Distance estimators introduced in Definition 1 goes back to Tikhonov (1963), in his pi-

oneering paper on the regularisation of ill-posed inverse problems [see Kress (1999), Chapter

16]. The main intuition is that the term λT kϕk2H in the criterion penalizes highly oscillating

components of the estimated function, which would be otherwise unduly enhanced, since the

criterion QT (ϕ) becomes asymptotically flat along some directions because of ill-posedness.

For instance, in the linear IV case where Q∞ (ϕ) = h∆ϕ, A∗ A∆ϕi, these directions corre-

spond to the eigenfunctions ψn of operator A∗ A to eigenvalues µn close to zero, that is for

large n [see Equation (5) and the discussion in Section 2.2]. Typically, ψn is an highly oscil-

lating function and kψn kH → ∞ as n → ∞, so that these directions are penalized by term

G(ϕ) = kϕk2H in the empirical criterion QT (ϕ) + λT kϕk2H . In Theorem 1 in Section 3 below,

we provide precise conditions under which the penalty function G (ϕ) = kϕk2H restores the

validity of the identification condition (6) and ensures the consistency of the TiR estimator.

The sequence (λT ) in Definition 1 controls for the amount of regularisation introduced

by term G (ϕ) = kϕk2H , and how this depends on sample size T . Therefore, λT can be seen

as a tuning parameter (or as a sequence of tuning parameters). The rate of convergence

of λT to zero affects the rate of convergence of the TiR estimator ϕ̂. We will discuss in

Section 4 below the choice of the sequence (λT ) to achieve an optimal rate of convergence of

TiR estimator ϕ
b T , and we will present two global data driven selection procedures for λT in

12
Section 6.

2.4 Links with the literature

The goal of this Section is to discuss the links between the TiR estimator and the differ-

ent approaches proposed in the literature on nonparametric estimation under conditional

moment restrictions.

2.4.1 Regularisation by compactness

To address the issue of ill-posedness, NP and AC [see also Blundell, Chen and Kristensen

(2005)] suggest considering a compact parameter set Θ. In this case, by the same argument

as in the standard parametric setting, the assumption that ϕ0 is the unique function in

Θ which satisfies (1) implies identification condition (6). Compact sets in L2 [0, 1] can be

defined by imposing a bound on the Sobolev norm kϕkH ≤ B of the functional parameter.

Then, the estimator is obtained by minimization problem (7), where λT is interpreted as a

Kuhn-Tucker multiplier.

Our approach by TiR estimators differ from AC and NP along two directions. On the

one hand, for TiR estimators λT is a free regularisation parameter, whereas λT is tight down

by the slackness condition in NP and AC approach: either λT = 0 or kϕkH = B, P -a.s. As

a consequence, the approach by TiR estimators presents three important advantages.

i) Optimal rates of convergence. Although, for given sample size T , selecting dif-

ferent λT amounts to select different B when the constraint is binding, the asymptotic

properties of the TiR estimator and of the estimators with fixed B are different. In partic-

13
ular, the adoption of a bound B on the Sobolev norm independent of sample size T implies

in general the selection of a sub-optimal sequence of regularisation parameters λT . Thus,

the NP and AC estimators share rates of convergence which are slower than that of the

TiR estimator with optimally selected sequence of regularisation parameter. The optimal

rates of convergence for the TiR estimator are characterized in Section 4. Finally, note that

letting B = BT grow (slowly) with sample size T is not equivalent to our approach and does

not garantee the consistency of the estimator. Indeed, when BT → ∞, the resulting limit

parameter set Θ is not compact.

ii) Data-driven selection of tuning parameters. For the TiR estimator, the tuning

parameter λT is allowed to depend on sample size T and sample data, whereas in the theo-

retical setting of NP and AC the tuning parameter B is treated as fixed. Thus, our approach

allows for a discussion of asymptotic properties of regularised estimators with data-driven

selection of the tuning parameter [see Proposition 3 in Section 3 for consistency].

iii) Computational tractability. Finally, we emphasize that the TiR estimator fea-

tures computational advantages compared to NP and AC estimators. This is because, for

given λT , the TiR estimator is defined by an unconstrained optimization problem, whereas

inequality constraint kϕkH ≤ B has to be accounted for in the minimization defining esti-

mators with given B. In particular, in the case of linear conditional moment restrictions,

TiR estimators admit a closed form [see Section 5], whereas the computation of the NP and

AC estimator requires a numerical constrained quadratic optimization routine.

On the other hand, a second difference is that NP, AC and BCK use finite-dimensional

14
Sieve estimators, whereas in ou approach .... TO BE CONTINUED.

2.4.2 Regularisation with L2 norm

For the special case of non-parametric IV estimation of a single equation model [see Equation

(3)], DFR and HH [see also Carrasco, Florens and Renault (2005)] introduce a regularised

estimator defined by minimization problem (7) with Sobolev norm kϕkH replaced by L2 norm

kϕk in the penalty term, and Ω0 (Z) = 1. Indeed, it is possible to show that the first order

condition for such an estimator corresponds to the linear equation (4.1) in DFR, or to the

estimator defined at p. 4 in HH. DFR and HH study the consistency and the optimal rates

of convergence of their estimator for a deterministic sequence of regularisation parameters

λT . In Section 4, we will compare the optimal rate of convergence of regularised estimators

with Sobolev norm and with L2 norm, and give conditions under which the first one is larger.

Finally, note that the techniques used in this paper to study the asymptotic properties of

TiR estimators are easily extended to estimators with L2 regularisation and allow to derive

new results for the DFR and HH estimators, such as the asymptotic expansion of the Mean

Integrated Square Error (MISE) in Section 4.

3 Consistency of TiR estimators

In this section we show the consistency of the TiR estimator. To highlight the main idea,

we first provide in Section 3.1 a consistency theorem for penalized extremum estimators

minimizing the criterion QT (ϕ) + λT G (ϕ) with a general penalty function G (ϕ). Then,

in Section 3.2 the assumptions of the theorem are particularized to the Sobolev penalty

15
function G (ϕ) = kϕk2H used for the TiR estimator.

3.1 A general consistency result for penalized extremum estima-


tors

Let us consider an extremum estimator of the TiR-type as in Definition 1 with a general

penalty function G(ϕ)

ϕ̂ = arg inf QT (ϕ) + λT G(ϕ), (8)


ϕ∈ΘT

where QT (ϕ) , (λT ) and ΘT are as in Definition 1. This estimator is well-defined and

measurable under weak conditions [see Appendix 1].

The consistency of estimator ϕ̂ defined in (8) is stated in the next Theorem.

Theorem 1: Let

p
(i) δ T := supϕ∈ΘT |QT (ϕ) − Q∞ (ϕ)| −→ 0;

(ii) ϕ0 ∈ Θ, and ∪∞ 2
T =1 ΘT is dense in Θ ⊂ H [0, 1];

(iii) For any ε > 0, Cε (λ) := inf ϕ∈Θ:kϕ−ϕ0 k≥ε Q∞ (ϕ) + λG(ϕ) − Q∞ (ϕ0 ) − λG(ϕ0 ) > 0, for

any λ > 0 small enough;

p
(iv) ∃a > 0 such that limλ→0 λ−a Cε (λ) > 0, T a δ T −→ 0, and T a ρT → 0, for any ε > 0,

where ρT := inf ϕ∈ΘT :kϕ−ϕ0 k≤ε Q∞ (ϕ) + |G(ϕ) − G(ϕ0 )|.

Then, under (i)-(iv), for any sequence (λT ) such that λT > 0, λT → 0, P-a.s., and

λT /T → 0, P-a.s., (9)

16
p
ϕT − ϕ0 k −→ 0.
the estimator ϕ̂ defined in (8) is consistent, namely kb

Proof: See Appendix 1.

If G = 0, Theorem 1 corresponds to the standard result of consistency for extremum

estimators [e.g., Wooldridge and White (1991), Corollary 2.6]. Indeed, in this case, Condition

(iii) is the usual identification condition (6), whereas Condition (iv) is satisfied. Theorem 1

extends this consistency result to situations where Condition (6) does not hold, as it is the

case for our ill-posed setting (see Section 2.2). The identification of ϕ0 as isolated minimum

is restored by including a small additional component λG (ϕ) in the limit criterion. Thus,

Condition (iii) in Theorem 1 is the condition on penalty function G (ϕ) to overcome ill-

posedness and achieve consistency of the estimator ϕ̂. To interpret Condition (iv), note that

in the ill-posed setting we have Cε (λ) → 0 as λ → 0, and the rate of this convergence can

be seen as a measure for the severity of ill-posedness. Thus, Condition (iv) introduces a
p
bound on ill-posedness severity, related to the rates of uniform convergence δ T −→ 0 and

approximation error ρT → 0 of the Sieve ΘT . In Appendix 1, we provide technical regularity

conditions to quantify this bound and to verify Conditions (i), (ii), and (iv) of Theorem 1 for

the TiR estimator. Finally, it is important to emphasize that Theorem 1 is more general than

the results currently known in the literature, since sequence (λT ) is allowed to be stochastic,

possibly data dependent, in a fully general way. Condition (9) on λT for consistency requires

that λT converges to zero at a rate smaller than 1/T .

The rest of this Section will focus on the key assumption of Theorem 1, that is identifi-

17
cation assumption (iii). The next Proposition provides a sufficient condition for the validity

of this assumption.

Proposition 2: Assume that the function G is bounded from below. Furthermore, suppose

that, for any ε > 0 and any sequence (ϕn ) in Θ such that kϕn − ϕ0 k ≥ ε for all n ∈ N, we

have

Q∞ (ϕn ) → Q∞ (ϕ0 ) as n → ∞ =⇒ G (ϕn ) → ∞ as n → 0. (10)

Then, Condition (iii) of Theorem 1 is satisfied.

Proof: See Appendix 1.

Condition (10) provides a simple intuition to explain why the penalty function G (ϕ)

restores identification. Indeed, it basically requires that the sequences (ϕn ) in Θ, which

minimize Q∞ (ϕ) without converging to ϕ0 , are penalized by function G (ϕ) . In the next

section, we particularize this condition for the penalty function which is relevant for the TiR

estimator in Definition 1, that is the Sobolev norm G (ϕ) = kϕk2H .

3.2 Penalization with Sobolev norm

When the penalty function G(ϕ) = kϕk2H is used, Condition (10) in Proposition 2 can be

nicely stated in terms of the spectrum of the operator A∗ A, where A is the operator in the

linearization of the moment function defined in Assumption 1.

© ª
Assumption 2: Let ψj : j ∈ N be an orthonormal basis in L2 [0, 1] of eigenfunctions

of operator A∗ A to eigenvalues µj , ordered such that µ1 ≥ µ2 ≥ · · · , and let function

18
ψj ∈ H 2 [0, 1], for any j ∈ N. Then, Mn := inf ϕ∈Sn :kϕk=1 kϕkH → ∞ as n → ∞, where
© ª
Sn =span ψj : j ≥ n .

Assumption 2 basically requires that the subspace spanned by the eigenfunctions of

A∗ A to eigenvalues close to zero consists of highly oscillating functions with large Sobolev

norm. Then, deviations of the estimator ϕ̂ from ϕ0 along such directions are penalized by

G(ϕ) = kϕk2H . This compensates the inability of the empirical criterion QT (ϕ) to achieve

this task because of its becoming asymptotically flat in such directions.

In Lemma A.1 in Appendix 1, we show that Assumptions 1 and 2 imply Condition (10)

in Proposition 2. Then, from Theorem 1 and Proposition 2, the consistency of the TiR

estimator follows.

4 Mean Integrated Square Error analysis of the TiR


estimator

4.1 The Mean Integrated Square Error

In this section, we derive the Mean Integrated Square Error (MISE) of the TiR estimator

with deterministic sequence of regularisation parameters. To simplify the exposition, we

assume that an optimal weighting matrix is used.

Assumption 3: The asymptotic weighting matrix Ω0 (z) is V0 [g (Yt , ϕ0 (Xt )) | Z = z]−1 .

The asymptotic expansion of the MISE is characterized in the next Proposition.

19
Proposition 3: Under Assumptions 1-3, in Appendix B, and the bandwidth conditions

³ ´
hm
T = o (λT b (λT )) , (T λT )−1 = o hdTZ , (11)

the MISE of the TiR estimator ϕ̂ with deterministic sequence (λT ) is given by


£ ¤ 1X νj ° °2
° ° + b (λT )2 =: MT (λ)
E kϕ̂ − ϕ0 k2 = 2 φj (12)
T j=1 (λT + ν j )
© ª
up to terms which are asymptotically negligible w.r.t. the RHS, where φj : j ∈ N are the

orthonormal eigenfunctions of operator A A to eigenvalues ν j , A denotes the adjoint oper-

ator of A w.r.t. the scalar products h., .iH and h., .iL2Ω (FZ ) , function b (λT ) is given by
0

° °
b (λT ) = °(λT + A A)−1 A Aϕ0 − ϕ0 ° , (13)

m is the order of the kernel K, and dZ the dimension of Z.

Proof: See Appendix 2.

The asymptotic expansion of the MISE consists of two components, which are a variance

term and a bias term, respectively.

(i) The bias function b (λT ) is the L2 norm of (λT + A A)−1 A Aϕ0 − ϕ0 =: ϕ∗ − ϕ0 . To

interpret function ϕ∗ , note that the quadratic approximation of the limit criterion [see (4)

and Assumption 1] can be written as

h 0
i
h∆ϕ, A∗ A∆ϕi = E0 (A∆ϕ) (Z) Ω0 (Z) (A∆ϕ) (Z) = h∆ϕ, A A∆ϕiH , ϕ ∈ Θ.

Then, function ϕ∗ minimizes the penalized asymptotic criterion h∆ϕ, A A∆ϕiH + λT kϕk2H .

Thus, b (λT ) is the asymptotic bias arising from introducing penalty λT kϕk2H in the criterion.

20
It corresponds to the so-called regularisation bias in the theory of Tikhonov regularisation

[see e.g. Kress (1999), Groetsch (1984)]. Under general conditions on operator A A and true

function ϕ0 , the bias function b (λ) is increasing w.r.t. λ and such that b (λ) → 0 as λ → 0.
X∞
° °2 £ ¤
(ii) The variance term T −1 °φj ° ν j / (λT + ν j )2 involves a weighted sum of the
j=1
° °2
"regularised" inverse eigenvalues ν j / (λT + ν j )2 of operator A A, with weights °φj ° . 2
To

have an interpretation, note that the inverse of operator A A corresponds to the standard
¡ 0 ¢−1
asymptotic variance matrix J0 V0−1 J0 of the efficient GMM in the parametric setting,
h 0
i
where J0 = E0 ∂g/∂θ and V0 = V0 [g]. In the ill-posed non-parametric setting, the inverse

of operator A A is unbounded, and its eigenvalues 1/ν j → ∞ diverge. The penalty term

λT kϕk2H in the criterion defining the TiR estimator implies that inverse eigenvalues 1/ν j are

replaced by ν j / (λT + ν j )2 .
X ∞
° °2 £ ¤
The variance term T −1 °φj ° ν j / (λT + ν j )2 is a decreasing function of λT . To study
j=1
its behaviour when λT → 0, we introduce the next assumption.

Assumption 4: The eigenfunctions φj and the eigenvalues ν j of A A satisfy



X ° °2
ν j−1 °φj ° = ∞.
j=1


X ° °2 £ ¤
Under Assumption 4, the series kT := °φj ° ν j / (λT + ν j )2 diverges as λT → 0.
j=1
When kT → ∞ such that kT /T → 0, the variance term converges to zero. However, the rate

of convergence is smaller than the parametric rate 1/T . This smaller rate of convergence

is typical in nonparametric estimation. Note, however, that the smaller rate of convergence

2
Since ν j /(λT + ν j )2 ≤ ν j , the infinite sum converges under Assumption B.6 (i) in Appendix B.

21
is not coming from localization as for kernel estimation, but from the ill-posedness of the

problem, which implies ν j → 0.

The asymptotic expansion of the MISE of the TiR estimator given in Proposition 3 does

not involve the bandwidth hT , as long as Conditions (11) are satisfied. The variance term is

asymptotically independent of hT since the asymptotic expansion of ϕ̂−ϕ0 involves the kernel

density estimator integrated w.r.t. (Y, X, Z) [see Equation (36) in Appendix 2, first term, and

the proof of Lemma A.3]. The integral averages the localization effect of the bandwidth hT .

On the contrary, kernel estimation m̂(ϕ, z) of the conditional moment function does have an

effect on the bias of the TiR estimator. However, the assumption hm


T = o (λT b (λT )) in (11)

implies that the estimation bias is asymptotically negligible compared to the regularisation

bias [see Lemma A.4 in Appendix 2].

Finally, it is also possible to derive a similar asymptotic expansion of the MISE for the

e T regularised by the L2 norm. This characterisation is new


estimator ϕ


£ 2¤ 1X µj
¢ + eb (λT ) ,
2
ϕT − ϕ0 k =
E ke ¡ (14)
T j=1 λT + µj 2
° °
where µj are the eigenvalues of operator A∗ A, and eb (λT ) = °(λT + A∗ A)−1 A∗ Aϕ0 − ϕ0 °.

Let us now come back to the MISE MT (λ) of the TiR estimator in Proposition 3 and dis-

cuss the optimal choice of the regularisation parameter λT . Since the bias term is increasing

in the regularisation parameter, whereas the variance term is decreasing, we face a kind of

bias-variance trade-off. The optimal sequence of deterministic regularisation parameters is

given by λ∗T = arg minλ>0 MT (λ), and the corresponding optimal MISE of the TiR estimator

is given by MT∗ := MT (λ∗T ).

22
The optimal sequence of regularisation parameters λ∗T , in particular its rate of conver-

gence to zero, depends on the decay behaviour of the eigenvalues ν j and of the norms of
° °
eigenfunctions °φj °, as well as on the bias function b (λ) close to λ = 0. In the next section,

we characterize the optimal sequence of regularisation parameters λ∗T , the corresponding

optimal MISE MT∗ , and their rate of convergence in a broad class of models.

4.2 Optimal rates of convergence


° °
The eigenvalues ν j and the norms of eigenfunctions °φj ° can feature different types of decay

as j → ∞, for instance geometric or hyperbolic decay. Intuitively, the first type is associated

with a faster convergence of the spectrum to zero, and thus to a more serious problem of ill-

posedness. In this section, we focus our analysis on the case where the eigenvalues ν j feature
° °
geometric decay and the norms of eigenfunctions °φj ° feature hyperbolic decay. Results for

the other cases are summarised at the end of the section.

° °
Assumption 5: The eigenvalues ν j and the norms of the eigenfunctions °φj ° of operator

A A are such that, for j = 1, 2, · · · , and some positive constants C1 , C2 ,


° °2
(i) ν j = C1 exp (−αj), α > 0 , (ii) °φj ° = C2 j −β , β > 0.

Assumption 5 (i) is satisfied for a large number of models, including for instance the

two examples that we consider below in our Monte-Carlo analysis. In general, it is known

that, under appropriate regularity conditions, compact integral operators with smooth kernel

feature eigenvalues with decay of (at least) exponential type [see Theorem 15.20 in Kress

23
3
(1999)]. Assumption 5 (ii) is adopted e.g. in Wahba (1977), and is also satisfied in the

examples of our Monte-Carlo analysis.

We further assume that the bias function features a power-law behaviour close to λ = 0.

Assumption 6: The bias function is such that b(λ) = C3 λδ , δ > 0, for λ close to 0, where

C3 is a positive constant.

Then, the MISE and the optimal sequence of regularisation parameters are characterised in

the next Proposition.

Proposition 4: Under the Assumptions of Proposition 3, Assumptions 5 and 6, for some

positive constants c1 , c2 , c and c, we have

1 1
(i) The MISE is MT (λ) = c1 +c2 λ2δ , up to terms which are negligible when
T λ [log (1/λ)]β
λ → 0 and T → ∞.

(ii) The optimal sequence of regularisation parameters is

1
log λ∗T = log c − log T, T ∈ N, (15)
1 + 2δ

up to a term which is negligible w.r.t. the RHS.

2δ 2δβ
(iii) The optimal MISE is MT∗ = cT − 1+2δ (log T )− 1+2δ , up to a term which is negligible

w.r.t. the RHS.


3
In the case of linear IV estimation and regularisation with L2 norm, the eigenvalues correspond to the
nonlinear canonical correlations of (X, Z). When X and Z are monotonic transformations of variables which
are jointly normally distributed with correlation parameter ρ, the canonical correlations of (X, Z) are ρj ,
j ∈ N [see e.g. DFR]. Thus the eigenvalues feature exponential decay.

24
Proof: See Appendix 3.

The log of the optimal regularisation parameter is linear in the log sample size. The

slope coefficient γ := 1/(1 + 2δ) is smaller than 1, and depends on the convexity parameter

δ of the bias function close to λ = 0. We have γ < 1/2 when the squared bias function

b(λ)2 is convex, that is 2δ > 1, respectively γ ≥ 1/2 when 2δ < 1. The optimal MISE

converges to zero as a power of T and of log T . The negative exponent of the dominant term

T is 2δ/(1 + 2δ). This rate of convergence is smaller than 1, that is the parametric rate,

because of ill-posedness, and is increasing w.r.t. convexity parameter δ of the bias function.

Note that the geometric decay rate α does not affect neither the rate of convergence of the

optimal regularisation sequence, nor that of the MISE, whereas coefficient β of eigenfunction

norms affects the exponent of the log T term in the MISE only. Finally, under Assumptions

5 and 6, the bandwidth conditions (11) are fulfilled for the optimal sequence of regularisation
1 2δ 1 1+δ
parameters (15) if hT = C · T −η , with <η < . This condition can be
dZ 1 + 2δ m 1 + 2δ
m 1+δ
satisfied if > .
dZ 2δ
To conclude this section, we briefly discuss the optimal rate of convergence of the MISE

when the eigenvalues feature hyperbolic decay, that is ν j = Cj −α , α > 0, or when regu-

larisation with L2 norm is adopted. The results are summarized in Table 1 below, and are

found using Formula (14) and an argument similar to the proof of Proposition 4. In Table 1,

parameter β is defined as in Assumption 5 (ii) for the TiR estimator. Parameters α and α
e

denote the hyperbolic decay rates of the eigenvalues of operator A A for the TiR estimator,

and of operator A∗ A for L2 regularisation, respectively. We assume α, α


e > 1, and α > β − 1

25
to satisfy Assumption 4. Finally, parameters δ and e
δ are the power-law coefficients of the

bias function b (λ) and eb (λ) for λ → 0 as in Assumption 6, where b (λ) is defined in (13) for

the TiR estimator, and eb (λ) in (14) for L2 regularisation, respectively.

TiR estimator L2 regularisation

geometric 2δ 2δβ − 2e
δ
T − 1+2δ (log T )− 1+2δ T 1+2eδ

spectrum

hyperbolic 2δ − 2eδ
T − 1+2δ+(1−β)/α T 1+2e
δ+1/e
α

spectrum

Table 1: Optimal rate of convergence of the MISE. The decay factors are α and α
e for the

eigenvalues, δ and e
δ for the bias, and β for the squared norm of the eigenfunctions.

With hyperbolic spectrum, the rate of convergence (power of T ) of the TiR estimator

features an additional term (1 − β) /α in the denominator, which involves both the α and

β coefficients. When β > 1, the rate of convergence is faster than that with geometric

spectrum. This is an effect of the less severe ill-posedness problem. The rate of convergence

with geometric spectrum is recovered letting α → ∞ (up to the log T term).

The rate of convergence with L2 regularisation coincides with that of the TiR estimator

with β = 0 and coefficients α, δ corresponding to operator A∗ A instead of A A. With geo-

metric spectrum, the TiR estimator features a faster rate of convergence than the regularised

estimator with L2 norm if δ > e


δ, that is if the bias function of the TiR estimator is more

26
convex. Finally, note that with hyperbolic spectrum and L2 regularisation, the formula given

4
in Table 1 corresponds to that derived by HH, Theorem 4.1.

5 The TiR estimator for linear moment restrictions

In this section we derive the TiR estimator when the moment restrictions are linear w.r.t. the

functional parameter ϕ0 . We consider the case of non-parametric IV estimation of a single

equation model, with g (y, ϕ0 (x)) = ϕ0 (x) − y, and conditional moment as in (3). Then, the

estimated moment function is given by


Z Z ³ ´
m̂ (ϕ, z) = ϕ (x) fˆ (w|z) dw − y fˆ (w|z) dw =: Âϕ (z) − r̂ (z) .

To simplify the exposition, we assume that Ω0 (z) = V0 [Yt − ϕ0 (Xt ) | Z = z]−1 = 1 in As-

sumption 3. The objective function of the TiR estimator in Definition 1 can be rewritten as

[see Appendix 2.1]

QT (ϕ) + λT kϕk2H = hϕ, Â ÂϕiH − 2hϕ, Â r̂iH + λT hϕ, ϕiH , ϕ ∈ H 2 [0, 1] , (16)

up to a term independent of ϕ, where  denotes the linear operator defined on L2Ω0 (FZ ) by

1 X³ ´
T
hϕ, Â ψiH = Âϕ (Zt ) ψ (Zt ) , ϕ ∈ H 2 [0, 1] , ψ ∈ L2Ω0 (FZ ). (17)
T t=1

Under the regularity conditions in Appendix B, Criterion (16) admits a global minimum ϕ̂

on H [0, 1], which is characterized by the first order condition


³ ´
λT + Â Â ϕ̂ = Â r̂ . (18)

4
To see this, note that their Assumption A.3 corresponds to e
δ = (2β HH − 1) / (e
α + β HH ), where β HH
is the β coefficient of HH.

27
5
This is a Fredholm integral equation of Type II . The transformation of the ill-posed

problem (1) in the well-posed estimating equation (18) is induced by the penalty term

involving the Sobolev norm. The TiR estimator is the unique solution of Equation (18) and

is given by
³ ´−1
ϕ̂ = λT + Â Â Â r̂. (19)

The TiR estimator can be approximated numerically by introducing a finite-dimensional

basis of functions {Pj : j = 1, ..., K} in H 2 [0, 1] and solving Equation (19) on the subspace

spanned by {Pj : j = 1, ..., K}, which yields

K
X 0
ϕ' θj Pj =: θ P, θ ∈ RK . (20)
j=1

The K × K matrix corresponding to operator   on the subspace spanned by {Pj } is given

by [using (17)]

1 X³ ´ ³ ´ 1 ³ b0 b´
T
hPi , Â ÂPj iH = ÂPi (Zt ) ÂPj (Zt ) = PP , i, j = 1, ..., K,
T t=1 T i,j

0 R
where Pb is the T × K matrix with rows Pb (Zt ) = P (x) fˆ (w|Zt ) dw, t = 1, ..., T . Ma-
0

trix Pb is the matrix of the "fitted values" in the regression of P (X) on Z at the sample
µ ¶
1 b0 b 1 0b
points. Then, Equation (18) reduces to a matrix equation λT D + P P θ = Pb R,
T T
0
where R b = (r̂ (Z1 ) , ..., r̂ (ZT )) and D is the K × K matrix of Sobolev scalar products
µ ¶−1
b 1 b0 b 1 b0 b
Di,j = hPi , Pj iH , i, j = 1, ..., K. The solution is given by θ = λT D + P P P R,
T T
0
which yields the approximation of the TiR estimator ϕ̂ ' θ̂ P.

5
See e.g. Linton and Mammen (2005), (2006), Gagliardini and Gouriéroux (2006), and the survey by
Carrasco, Florens, Renault (2005) for other examples of estimation problems leading to Type II equations.

28
Estimator θ̂ is a 2SLS estimator with a ridge correction term. It is easy to verify that,

this is the estimator that we obtain, if we replace Approximation (20) in Criterion (16) and

we minimize w.r.t. θ. This latter approach has been considered by NP, AC, and Blundell

et al (2005), which use Sieve estimators. However, it is important to emphasize that, the

introduction of a series of basis functions as in (20) is simply a method to compute approx-

imately the true TiR estimator ϕ̂ in (19), which is a well-defined estimator on the function

space. In particular, when K = KT → ∞ sufficiently fast with T , the asymptotic properties


0
of estimator θ̂ P are the same as for estimator ϕ̂. Moreover, the asymptotic properties (con-

sistency) of the estimators proposed by NP, AC, and Blundell et al (2005), have been derived

only in the case where parameter λT is tight down by the inequality constraint kϕ̂kH ≤ B̄

for fixed B̄, whereas, for the TiR estimator, λT is treated as a free regularization parameter

depending on sample size.

6 A Monte Carlo study


6.1 Data generating process

Following NP we draw the errors U and V and the instruments Z as


     
 U   0   1 ρ 0 
     
     
 V  ∼ N  0  ,  ρ 1 0  , ρ ∈ {0, 0.5},
     
     
     
Z 0 0 0 1

and build X ∗ = Z + V . Then we map X ∗ into a variable X = Φ (X ∗ ), which lives in

[0, 1]. The function Φ denotes the cdf of a standard Gaussian variable, and is assumed to be

known. To generate Y , we restrict ourselves to the linear case since a simulation analysis of

29
a nonlinear case would be very time consuming. We examine two designs

Case 1: Y = Ba,b (X) + U,

where Ba,b denotes the cdf of a Beta(a, b) variable;

Case 2: Y = sin (πX) + U .

The parameters of the beta distribution are chosen equal to a = 2 and b = 5.

When the correlation ρ between U and V is 50% there is endogeneity in both cases.

When ρ = 0 there is no need to correct for the endogeneity bias.

The moment condition is

E0 [Y − ϕ0 (X) | Z] = 0,

where the functional parameter is ϕ0 (x) = Ba,b (x) in Case 1, and ϕ0 (x) = sin (πx) in Case

2, x ∈ [0, 1].

6.2 Estimation procedure

Since we face an unknown function ϕ0 on [0, 1], we use a series approximation based on

standardized shifted Chebyshev polynomials of the first kind (see Section 22 on orthogonal

polynomials of Abramowitz and Stegun (1970) for their mathematical properties). We take

orders 0 to 5 which yields six coefficients (K = 6) to be estimated in the approximation


X5
√ p
ϕ(x) ' θj Pj (x), where P0 (x) = T0∗ (x)/ π, Pj (x) = Tj∗ (x)/ π/2, j 6= 0, and the shifted
j=0

30
Chebyshev polynomials of the first kind are

T0∗ (x) = 1, T1∗ (x) = −1 + 2x, T2∗ (x) = 1 − 8x + 8x2 ,

T3∗ (x) = −1 + 18x − 48x2 + 32x3 , T4∗ (x) = 1 − 32x + 160x2 − 256x3 + 128x4 ,

T5∗ (x) = −1 + 50x − 400x2 + 1120x3 − 1280x4 + 512x5 .

R1 R1
The (squared) Sobolev norm kϕk2H = 0
ϕ2 + 0
(∇ϕ)2 is approximated by

5 X
X 5 Z 1
kϕk2H ' θi θj (Pi (x)Pj (x) + ∇Pi (x)∇Pj (x)) dx.
i=0 j=0 0

The coefficients in this quadratic form θ0 Dθ take a closed form, and can be computed easily

via integration with a symbolic calculus package:


 
√ √
1 − 2 − 2
 π
0 3π
0 15π
0 
 
 .. 
 . 26
0 38
0 166 
 3π 5π 21π 
 
 
 218
0 1182
0 
 5π 35π 
D=

.

 3898
0 5090 
 35π 63π 
 
 .. 
 67894 
 . 315π
0 
 
 
82802
... ... 231π

31
The L2 norm kϕk2 can be approximated in a similar way with θ0 Bθ where
 
√ √
1 − 2 − 2
 π
0 3π
0 15π
0 
 
 .. 
 . 2
0 −2
0 −2 
 3π 5π 21π 
 
 
 14
0 −38
0 
 15π 105π 
B=

.

 34
0 −22 
 35π 63π 
 
 .. 
 62 
 . 63π
0 
 
 
98
... ... 99π

Such simple and exact forms ease implementation, improve on speed, and contribute to the

6
numerical stability of the estimation procedure .

The kernel estimator m̂ (ϕ, z) of the conditional moment is approximated through

θ0 P̂ (z) − r̂(z) where


T
X µ ¶ T
X µ ¶
Zt − z Zt − z
P (Xt ) K Yt K
t=1
h t=1
h
P̂ (z) ' T µ ¶ , r̂(Z) ' T µ ¶ ,
X Zt − z X Zt − z
K K
t=1
h t=1
h

where h denotes the bandwidth, and K is the Gaussian kernel. This kernel estimator is

asymptotically equivalent to the one described in the lines above. We prefer it because of its

numerical tractability. It has the advantage of avoiding bivariate numerical integration and

the choice of two additional bandwidthes. The bandwidth is selected via the standard rule

of thumb h = 1.06σ̂ Z T −1/5 (Silverman (1986)), where σ̂ Z is the empirical standard deviation

of Zt .

The weighting function Ω0 (z) is taken equal to unity, satisfying Assumption 3.

6
The Gauss programs developed for this section are available on request from the authors.

32
6.3 Simulation results

The sample size is initially fixed at T = 400. Estimator performance is measured in terms

of the Mean Integrated Squared Error (MISE) and the Integrated Squared Bias (ISB) based

on averages over 1000 repetitions. We use a univariate Gauss-Legendre quadrature with 40

knots to compute the integrals.

Figures 1 to 4 concern Case 1 while Figures 5 to 8 concern Case 2. In each figure the left

panel plots the MISE on a grid of lambda, the central panel the ISB on a grid of lambda, and

the right panel the mean estimated functions and the true function on the unit interval. Mean

estimated functions correspond to averages obtained either from regularised estimates with

a lambda achieving the lowest MISE or from OLS estimates. The regularization schemes use

the Sobolev norm, corresponding to the TiR estimator (odd numbering of the figures), and

the L2 norm (even numbering of the figures). We consider designs exhibiting an endogeneity

(ρ = 0.5) in Figures 1, 2, 5, 6, while Figures 3, 4, 7, 8 are dedicated to the designs without

endogeneity (ρ = 0).

A couple of remarks can be made. First, the bias of the OLS estimator can be large

under endogeneity. Second, the MISE of the TiR estimator is more convex in lambda than

the one obtained from an L2 norm, and performance is clearly better for the TiR estimator.

The Sobolev norm should be strongly favoured over the L2 norm in order to recover the

shape of the true functions. Third, the fit obtained by the OLS estimator is almost perfect

when endogeneity is absent. Using six polynomials delivers a very good approximation of

the true functions.

33
We have also examined sample sizes T = 100 and T = 1000, as well as approximations

based on polynomials with orders up to 10 and 15. The above conclusions remain qualita-

tively unaffected. This suggests that as soon as the order of the polynomials is sufficiently

large to deliver a good numerical approximation of the underlying function, it is not neces-

sary to link it with sample size, as explained in Section 5. For example Figures 9 and 10 are

the analogues of Figures 1 and 5 with T = 1000. We can see that the bias term is almost

identical, while the variance term decreases by a factor about 2.5 as predicted by Proposition

3.

In Figure 11 we display the six eigenvalues of operator A A and the L2 -norms of the

corresponding eigenfunctions when the same approximation basis of six polynomials is used.

These true quantities have been computed by Monte Carlo integration. The eigenvalues ν j
° °2
feature a geometric decay w.r.t. the order j, whereas the decay of the norms °φj ° is of an

hyperbolic type. This is conform to Assumption 5 and the analysis conducted in Proposition

4. A linear fit of the plotted points gives a decay factor α̂ = 2.254 for the eigenvalues and a

decay factor β̂ = 2.911 for the norm of the eigenfunctions.

Figure 12 is dedicated to check whether the line log λ∗T = log c − γ log T, induced by

Proposition 4 (ii), holds in small samples. For ρ = 0.5 the right panel for Case 1 as well as the

left panel for Case 2 exhibit a linear relationship between the logarithm of the regularisation

parameter minimizing the average MISE on the 1000 Monte Carlo simulations and the

logarithm of sample size ranging from T = 50 to T = 1000. The OLS estimation of this

linear relationship from the plotted pairs delivers ĉ = .226, γ̂ = .752 in Case 1, and ĉ = .012,

34
γ̂ = .428 in Case 2. Both estimated slope coefficients are smaller than 1, and qualitatively

consistent with the implications of Proposition 4. Indeed, from Figures 9 and 10 the ISB

curve appears to be more convex in Case 2 than in Case 1. This points to a larger δ

parameter, and thus to a smaller slope coefficient γ = 1/ (1 + 2δ), in Case 2. Inverting the

relationship γ = 1/ (1 + 2δ) we get estimates for the decay factor δ, which are δ̂ = .165 and

δ̂ = .668 in Case 1 and Case 2, respectively.

By a similar argument, Proposition 4 also explains the better performance of the TiR

estimator compared to the L2 -regularised estimator that we reported above. Indeed, com-

paring the ISB curves of the two estimators in Case 1 (Figures 1 and 2) and in Case 2

(Figures 5 and 6), it appears that the TiR estimator features a more convex ISB curve. This

implies δ > e
δ and thus a faster rate of convergence of the TiR estimator.

Finally we wish to conclude by a brief discussion on data driven selection procedures

of the regularisation parameter λT . We investigate a first method based on the asymptotic

spectral representation of the MISE provided in Proposition 3, and a second method based

on a resampling approximation.

The first data driven selection procedure aims at estimating directly Expression (12) in

order to derive the optimal regularisation parameter. In unreported results we have checked

that the asymptotic MISE, the asymptotic ISB and the asymptotic variance are close to the

ones exhibited in Figures 9 and 10. These true quantities have also been computed by Monte

Carlo integration. We have found an asymptotic optimal lambda equal to .0018 in Case 1

and to .0009 in Case 2, which are of the same magnitudes as .0013 and .0007 in Figures 9

35
and 10. We have also checked that the linear relationship exhibited in Figure 12 holds true

when deduced from optimizing the asymptotic MISE. The OLS estimation delivers ĉ = .418,

γ̂ = .795 in Case 1, and ĉ = .037, γ̂ = .546 in Case 2, and thus δ̂ = .129 and δ̂ = .418,

respectively.

The data driven estimation algorithm goes as follows:

Algorithm

(i) Perform the spectral decomposition of the matrix D−1 Pb Pb/T to get eigenvalues ν̂ j and
0

0
eigenvectors ŵj , normalized to ŵj Dŵj = 1, j = 1, ..., K.

(ii) Get a first-step TiR estimator θ̄ using a pilot regularisation parameter λ̄.

(iii) Estimate the MISE:


K
1X ν̂ j
M̃ (λ) = ŵ0 B ŵj
T j=1 (λ + ν̂ j )2 j
" µ ¶−1 # " µ ¶−1 #
0 1 1 1 1
Pb Pb λD + Pb Pb Pb Pb λD + Pb Pb
0 0 0 0
+θ̄ −I B − I θ̄,
T T T T

and minimize it w.r.t. λ to get the optimal regularisation parameter λ̂.

(iv) Compute the second-step TiR estimator b


θ using regularisation parameter λ̂.

A second-step estimated MISE viewed as a function of sample size T and regularisation

parameter λ can then be estimated with θ̂ instead of θ̄. Besides, if we assume the decay

behaviour of Assumptions 5 and 6, the decay factors α and β can be estimated via minus the

slopes of the linear fit on the pairs (log ν̂ j , j) and on the pairs (log ŵj0 B ŵj , log j), j = 1, ..., K.

36
After getting lambdas minimizing the second-step estimated MISE on a grid of sample sizes

we can also estimate γ by regressing the logarithm of lambda on the logarithm of sample

size.

We have used λ̄ = {.0005, .0001} as the pilot regularisation parameter for T = 1000 and

ρ = .5. In Case 1, the average (quartiles) of the selected lambda over 1000 simulations is

equal to .0028 (.0014, .0020, .0033) when λ̄ = .0005, and .0027 (.0007, .0014, .0029) when

λ̄ = .0001. In Case 2, the results are .0009 (.0007, .0008, .0009) when λ̄ = .0005, and .0008

(.0004, .0006, .0009) when λ̄ = .0001. The selection procedure tends to slightly overpenalize

on average, especially in Case 1, but this does not seem to impact much the MISE of the

two-step TiR estimator. Indeed if we use the optimal data driven regularisation parameter

at each simulation, the MISE based on averages over the 1000 simulations is equal to .0120

for Case 1 and equal to .0144 for Case 2 when λ̄ = .0005 (resp., .0156 and .0175 when

λ̄ = .0001), which are of the same magnitudes as the best MISE, which are .0099 and .0121

in Figures 9 and 10. In Case 1, the tendency of the selection procedure to overpenalized

without unduly affecting efficiency is due to the flatness of the MISE curve.

We also get average values for the decay factors α and β close to the asymptotic ones.

These have been computed through estimating the coefficients of a linear fit for each sim-

ulation, and averaging over the 1000 simulations. For α the average (quartiles) is equal to

2.2502 (2.1456, 2.2641, 2.3628), and for β it is equal to 2.9222 (2.8790, 2.9176, 2.9619).

To compute the average value for the decay factor γ we have used an equally spaced

grid of sample sizes T ∈ {500, 550, ..., 950, 1000} in the variance component of the MISE,

37
together with the data driven estimate of θ in the bias component of the MISE. Optimizing

on the grid of sample sizes yields an optimal lambda for each sample size per simulation.

The logarithm of the optimal lambda is then regressed on the logarithm of the sample

size, and the estimated slope is averaged over the 1000 simulations to obtain the average

estimated gamma. In Case 1, we get an average (quartile) of .6081 (.4908, .6134, .6979),

when λ̄ = .0005, and .7224 (.5171, .6517, .7277), when λ̄ = .0001. In Case 2, we get an

average (quartile) of .5597 (.4918, .5333, .5962), when λ̄ = .0005, and .5764 (.4946, .5416,

.6203), when λ̄ = .0001.

The second data driven selection procedure builds on the suggestion of Goh (2004) based

on a subsampling procedure (also called the m-out-of-n (moon) bootstrap). Even if his

theoretical results are derived for semiparametric estimators we believe that they could be

extended to our case as well. Recognizing that λT = cT −γ we propose to choose c and γ

which minimize the following estimator of the MISE:


Z
11X 1
M̂ISE(c, γ) = (ϕ̂ (x; c, γ) − ϕ̄(x))2 dx,
I J i,j 0 i,j

where ϕ̂i,j (x; c, γ) denotes the estimator based on the jth subsample of size mi (mi << T )

with regularisation parameter λmi = cm−γ


i , and ϕ̄(x) denotes the estimator based on the

original sample of size T with a pilot regularisation parameter λ̄ chosen sufficiently small to

eliminate the bias.

In practice we have chosen 500 subsamples (J = 500) for each subsample size mi ∈

{50, 60, 70, ..., 100} (I = 6), λ̄ = {.0005, .0001}, and T = 1000. To determine c and γ we

have build a joined grid with values around the OLS estimates coming from Case 1, namely

38
{.15, .2, .25} × {.7, .75, .8}, and with values around the OLS estimates coming from Case 2,

namely {.005, .01, .015} × {.35, .4, .45}. Note that the two grids yield a similar range for λT .

In the experiments for ρ = 0.5 we want to verify whether the data driven procedure is able

to pick most of the time c and γ in the first set of values in Case 1, and in the second set of

values in Case 2. On 1000 simulations we have found a frequency equal to 96% of adequate

choices in Case 1 when λ̄ = .0005, and 87% when λ̄ = .0001. In Case 2 we have found 77%

when λ̄ = .0005, and 82% when λ̄ = .0001.These frequencies are scattered among the grid

values.

39
References
Ai, C., and X. Chen (2003): "Efficient Estimation of Models with Conditional Moment
Restrictions Containing Unknown Functions", Econometrica, 71, 1795-1843.

Carrasco, M., Florens, J.-P., and E. Renault (2005): "Linear Inverse Problems in
Structural Econometrics: Estimation Based on Spectral Decomposition and Regulari-
sation", forthcoming in the Handbook of Financial Econometrics.

Darolles, S., Florens, J.-P., and E. Renault (2003): "Nonparametric Instrumental Re-
gression", D.P.

Hall, P., and J. Horowitz (2005): "Nonparametric Methods for Inference in the Pres-
ence of Instrumental Variables", Annals of Statistics.

Kress, R. (1999): Linear Integral Equations, Springer.

Newey, W., and J. Powell, (2003): "Instrumental Variable Estimation of Nonparamet-


ric Models", Econometrica, 71, 1565-1578.

Reed, M., and B. Simon (1980): Functional Analysis, Academic Press.

White, H., and J. Wooldridge (1991): "Some Results on Sieve Estimation with Depen-
dent Observations", in Nonparametric and Semiparametric Methods in Econometrics
and Statistics, Proceedings of the Fifth International Symposium In Economic Theory
and Econometrics, Cambridge University Press.

40
Estimated and
MISE ISB true functions
0.12 0.07 2

0.06
0.1 1.5

0.05
0.08 1

0.04
0.06 0.5
0.03

0.04 0
0.02

0.02 -0.5
0.01

0 0 -1
0 1 2 3 4 5 0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1
λ λ
-3 -3
x 10 x 10 x

Figure 1: MISE (left panel), ISB (central panel) and estimated function (right panel) for
the TiR estimator using Sobolev norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 1. Correlation
parameter is ρ = 0.5, and sample size is T = 400.

41
Estimated and
MISE ISB true functions
0.55 0.07 2

0.5
0.06
1.5
0.45

0.4 0.05
1
0.35
0.04
0.3 0.5
0.03
0.25
0
0.2 0.02

0.15
-0.5
0.01
0.1

0.05 0 -1

λ λ
0 0.025 0.05 0 0.025 0.05 0 0.2 0.4 0.6 0.8 1
x

Figure 2: MISE (left panel), ISB (central panel) and estimated function (right panel) for the
regularised estimator using L2 norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 1. Correlation
parameter is ρ = 0.5, and sample size is T = 400.

42
Estimated and
MISE ISB true functions
0.12 0.012 1.2

1
0.1 0.01

0.8
0.08 0.008

0.6
0.06 0.006
0.4

0.04 0.004
0.2

0.02 0.002
0

0 0 -0.2
0 1 2 3 4 5 0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1
λ λ
-3 -3 x
x 10 x 10

Figure 3: MISE (left panel), ISB (central panel) and estimated function (right panel) for
the TiR estimator using Sobolev norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 1. Correlation
parameter is ρ = 0, and sample size is T = 400.

43
Estimated and
MISE ISB true functions
0.7 0.045 1.2

0.04
0.6 1

0.035
0.5 0.8
0.03

0.4 0.025 0.6

0.3 0.02 0.4

0.015
0.2 0.2
0.01

0.1 0
0.005

0 0 -0.2
0 0.025 0.05 0 0.025 0.05 0 0.2 0.4 0.6 0.8 1
λ λ x

Figure 4: MISE (left panel), ISB (central panel) and estimated function (right panel) for the
regularised estimator using L2 norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 1. Correlation
parameter is ρ = 0, and sample size is T = 400.

44
Estimated and
MISE ISB true functions
0.12 0.07 1.2

0.11 1
0.06
0.1 0.8

0.09 0.05 0.6

0.08 0.4
0.04
0.07 0.2
0.03
0.06 0

0.05 0.02 -0.2

0.04 -0.4
0.01
0.03 -0.6

0.02 0 -0.8

λ λ
0 1 2 3 4 5 0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1
-3 -3 x
x 10 x 10

Figure 5: MISE (left panel), ISB (central panel) and estimated function (right panel) for
the TiR estimator using Sobolev norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 2. Correlation
parameter is ρ = 0.5, and sample size is T = 400.

45
Estimated and
MISE ISB true functions
0.55 0.07 1.2

0.5 1
0.06
0.45 0.8

0.4 0.05 0.6

0.35 0.4
0.04
0.3 0.2
0.03
0.25 0

0.2 0.02 -0.2

0.15 -0.4
0.01
0.1 -0.6

0.05 0 -0.8

λ λ
0 0.025 0.05 0 0.025 0.05 0 0.2 0.4 0.6 0.8 1
x

Figure 6: MISE (left panel), ISB (central panel) and estimated function (right panel) for the
regularised estimator using L2 norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 2. Correlation
parameter is ρ = 0.5, and sample size is T = 400.

46
Estimated and
MISE ISB true functions
0.12 0.05 1.2

0.045
1
0.1
0.04

0.035 0.8
0.08
0.03
0.6
0.06 0.025
0.4
0.02
0.04
0.015 0.2

0.01
0.02
0
0.005

0 0 -0.2

λ λ
0 1 2 3 4 5 0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1
-3 -3 x
x 10 x 10

Figure 7: MISE (left panel), ISB (central panel) and estimated function (right panel) for
the TiR estimator using Sobolev norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 2. Correlation
parameter is ρ = 0, and sample size is T = 400.

47
Estimated and
MISE ISB true functions
0.7 0.07 1.2

0.6 0.06 1

0.5 0.05 0.8

0.4 0.04 0.6

0.3 0.03 0.4

0.2 0.02 0.2

0.1 0.01 0

0 0 -0.2

λ λ
0 0.025 0.05 0 0.025 0.05 0 0.2 0.4 0.6 0.8 1
x

Figure 8: MISE (left panel), ISB (central panel) and estimated function (right panel) for the
regularised estimator using L2 norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 2. Correlation
parameter is ρ = 0, and sample size is T = 400.

48
Estimated and
MISE ISB true functions
0.1 0.07 2

0.09
0.06
1.5
0.08

0.07 0.05
1
0.06
0.04
0.05 0.5
0.03
0.04
0
0.03 0.02

0.02
-0.5
0.01
0.01

0 0 -1

λ λ
0 1 2 3 4 5 0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1
-3 -3 x
x 10 x 10

Figure 9: MISE (left panel), ISB (central panel) and estimated function (right panel) for
the TiR estimator using Sobolev norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 1. Correlation
parameter is ρ = 0.5, and sample size is T = 1000.

49
Estimated and
MISE ISB true functions
0.1 0.07 1.2

0.09 1
0.06
0.08 0.8

0.07 0.05 0.6

0.06 0.4
0.04
0.05 0.2
0.03
0.04 0

0.03 0.02 -0.2

0.02 -0.4
0.01
0.01 -0.6

0 0 -0.8

λ λ
0 1 2 3 4 5 0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1
-3 -3 x
x 10 x 10

Figure 10: MISE (left panel), ISB (central panel) and estimated function (right panel) for
the TiR estimator using Sobolev norm (solid line) and for OLS estimator (dashed line). The
true function is the dotted line in the right panel, and corresponds to Case 2. Correlation
parameter is ρ = 0.5, and sample size is T = 1000.

50
Eigenvalues Eigenfunctions
2 0

0 -1

-2
-2

log(|| φ || )
-4

2
log( ν )
j

j
-3
-6
-4
-8

-10 -5

-12 -6
1 2 3 4 5 6 0 0.5 1 1.5 2

log(j)

Figure 11: The six largest eigenvalues (left Panel) and the L2 -norms of the corresponding
eigenfunctions (right Panel) of operator A A.

51
Case 1: Beta Case 2: Sin
-4 -4

-4.5 -4.5

-5 -5

-5.5 -5.5
log( λ )

log( λT)
T

-6 -6

-6.5 -6.5

-7 -7

-7.5 -7.5
4 5 6 7 4 5 6 7
log(T) log(T)

Figure 12: Log of optimal regularisation parameter as a function of log of sample size for
Case 1 (left panel) and Case 2 (right panel). Correlation parameter is ρ = 0.5.

APPENDIX 1

Consistency of the TiR estimator

In this Appendix we prove the consistency of penalized extremum estimators

ϕ̂ = arg inf QT (ϕ) + λT G(ϕ). (A.1)


ϕ∈ΘT

This covers the special case of the TiR estimator in Definition 1, where G(ϕ) = kϕk2H .

A.1.1 Existence and measurability of the estimator

From Theorem 2.2 of White and Wooldridge (1991), the estimator ϕ̂ in (A.1) is well-defined

and measurable if

(i) function QT : Ω × ΘT → R is Borel-measurable, where QT (ω, ϕ) denotes the values of

random variable QT (ϕ) for event ω ∈ Ω, and (Ω, F, P ) is a complete probability space;

52
(ii) mappings ϕ → G(ϕ) and ϕ → QT (ω, ϕ) are weakly lower semi-continuous on ΘT , P -a.s.,

for any T , w.r.t. the L2 norm k.k ;

(iii) set ΘT is compact w.r.t. the L2 norm k.k for any T.

A.1.2 Consistency of penalized extremum estimators

Proof of Theorem 1: For any T and some given ε > 0, let us define ϕ∗T ∈ ΘT such that

Q∞ (ϕ∗T ) + λT G (ϕ∗T ) = inf Q∞ (ϕ) + λT G (ϕ) .


ϕ∈ΘT :kϕ−ϕ0 k≤ε

£ ¤
We have P [kϕ̂ − ϕ0 k > ε] ≤ P inf ϕ∈ΘT :kϕ−ϕ0 k≥ε QT (ϕ) + λT G(ϕ) ≤ QT (ϕ∗T ) + λT G(ϕ∗T ) .

Let us bound the probability on the RHS. Denoting ∆QT := QT − Q∞ , we get

inf QT (ϕ) + λT G(ϕ) ≤ QT (ϕ∗T ) + λT G(ϕ∗T )


ϕ∈ΘT :kϕ−ϕ0 k≥ε

=⇒ inf Q∞ (ϕ) + λT G(ϕ) + inf ∆QT (ϕ) ≤ Q∞ (ϕ∗T ) + λT G (ϕ∗T ) + sup |∆QT (ϕ)|
ϕ∈Θ:kϕ−ϕ0 k≥ε ϕ∈ΘT ϕ∈ΘT

=⇒ inf Q∞ (ϕ) + λT G(ϕ) − λT G(ϕ0 ) ≤ inf Q∞ (ϕ) + λT [G (ϕ) − G(ϕ0 )]


ϕ∈Θ:kϕ−ϕ0 k≥ε ϕ∈ΘT :kϕ−ϕ0 k≤ε

+2 sup |∆QT (ϕ)|


ϕ∈ΘT

≤ inf Q∞ (ϕ) + |G (ϕ) − G(ϕ0 )|


ϕ∈ΘT :kϕ−ϕ0 k≤ε

+2 sup |∆QT (ϕ)|


ϕ∈ΘT

= ρT + 2δ T .

Thus, from (iii) we get for a > 0

P [kϕ̂ − ϕ0 k > ε] ≤ P [Cε (λT ) ≤ ρT + 2δ T ]


· ¸
1 1 a a
= P 1 ≤ −a a (T ρT + 2T δ T ) =: P [1 ≤ ZT ] .
λT Cε (λT ) (T λT )

53
p
Since λT → 0 such that (T λT )−1 → 0, P -a.s., for a chosen as in (iv) we have ZT → 0, and we

deduce P [kϕ̂ − ϕ0 k > ε] ≤ P [ZT ≥ 1] → 0. Since ε > 0 is arbitrary, the proof is concluded.

Proof of Proposition 2: By contradiction, assume that Condition (iii) of Theorem 1

is not satisfied. Then there exists ε > 0 and a sequence (λn ) such that λn & 0 and

Cε (λn ) ≤ 0, ∀n ∈ N. (21)

By definition of function Cε (λ), for any λ > 0 and η > 0, there exists ϕ ∈ Θ such that

kϕ − ϕ0 k ≥ ε, and Q∞ (ϕ) + λG (ϕ) − λG (ϕ0 ) ≤ Cε (λ) + η. Setting λ = η = λn for n ∈ N,

we deduce from (21) that there exists a sequence (ϕn ) such that ϕn ∈ Θ, kϕn − ϕ0 k ≥ ε, and

Q∞ (ϕn ) + λn G (ϕn ) − λn G (ϕ0 ) ≤ λn . (22)

Now, since Q∞ (ϕn ) ≥ 0, we get λn G (ϕn ) − λn G (ϕ0 ) ≤ λn , that is

G (ϕn ) ≤ G (ϕ0 ) + 1. (23)

Moreover, since G (ϕn ) − G (ϕ0 ) ≥ G0 − G (ϕ0 ), we get Q∞ (ϕn ) + λn G0 − λn G (ϕ0 ) ≤ λn

from (22), that is Q∞ (ϕn ) ≤ λn (1 + G (ϕ0 ) − G0 ), which implies

lim Q∞ (ϕn ) = 0 = Q∞ (ϕ0 ). (24)


n

Obviously, the simultaneous holding of (23) and (24) violates Assumption (10). ¥

A.1.3 Penalization with Sobolev norm

54
In this Section we check that the assumptions in A.1.1 and A.1.2 hold for the special case

G(ϕ) = kϕk2H under Assumptions 1 and 2.

i) The mapping ϕ → kϕk2H is lower semi-continuous on H 2 [0, 1] w.r.t. the norm k.k [see

Reed and Simon (1980), p. 358].

ii) Let us verify that the assumptions of Proposition 2 are satisfied. Clearly function

G(ϕ) = kϕk2H is bounded from below by 0. Let us now check that Assumption (10) in

Proposition 2 is satisfied.

Lemma A.1: Assumptions 1 and 2 imply Assumption (10) in Proposition 2.

Proof: Let ε > 0 and let (ϕn ) be a sequence in Θ such that kϕn − ϕ0 k ≥ ε for all n ∈ N,

and

Q∞ (ϕn ) → 0 as n → ∞. (25)

We have to prove

kϕn kH → ∞ as n → 0. (26)

ϕn − ϕ0
To this aim, define sequence en = , n ∈ N. Then, ken k = 1 for all n ∈ N,
kϕn − ϕ0 k
h∆ϕn , A∗ A∆ϕn i 1
and from Assumption 1 and (25), hen , A∗ Aen i = 2 ≤ Q∞ (ϕn ) → 0,
kϕn − ϕ0 k c2 ε2
as n → ∞. Let Π(N) denote the orthogonal projection [w.r.t. the scalar product h., .i)] on

the subspace spanned by {ψ1 , ..., ψ N }. Then we have for any N ∈ N

XN N ∞
° °
°Π(N) en °2 = 1 X 1 X
hψj , en i2 ≤ µj hψj , en i2 ≤ µj hψj , en i2
j=1
µN j=1 µN j=1
1
= hen , A∗ Aen i → 0, as n → ∞,
µN
° °
that is °Π(N) en ° → 0 as n → ∞, for any N ∈ N.

55
Let us now derive a lower bound for the Sobolev norm ken kH . We have

° ° ° °
ken kH ≥ °Π⊥ ° ° °
(N) en H − Π(N) en H , (27)

© ª
where Π⊥
(N) = 1 − Π(N) denotes the orthogonal projection on span ψ j : j ≥ N + 1 . Let us

derive bounds for the two terms in the RHS of (27). We have
° ∞ ° °P °
° ° ° ∞ ° Ã ∞ !1/2
° ⊥ ° X hψ ,
° j=N +1 j n j ° e iψ X
°Π(N) en ° = ° ° ° H 2
hψj , en iψj ° = ³ ´1/2 hψj , en i
H ° ° P∞ 2
j=N+1 hψ j , en i
j=N+1 H j=N+1

à ∞ !1/2
X ³ ° °2 ´1/2
≥ inf kϕkH hψj , en i2 = MN+1 1 − °Π(N) en ° ,
ϕ∈SN +1 :kϕk=1
j=N+1
© ª
since ken k = 1, where SN+1 =span ψj : j ≥ N + 1 , and MN+1 = inf ϕ∈SN +1 :kϕk=1 kϕkH .

Moreover,
°N °2
° °2 °X ° N
X
°Π(N) en ° = ° °
° hψj , en iψj ° = hψj , en ihψl , en ihψj , ψ l iH
H ° °
j=1 H j,l=1
N
X N
¯ ¯ ° ° ° ° X ¯ ¯
≤ ¯hψj , en i¯ |hψl , en i| °ψj ° kψl k ≤ max °ψj °2 ¯hψj , en i¯ |hψl , en i|
H H H
j=1,...,N
j,l=1 j,l=1
à N
!2 N
X ¯ ¯ X ° °2
= MN ¯hψj , en i¯ ≤ NM N hψj , en i2 = NM N °Π(N) en ° ,
j,l=1 j,l=1
° °2
where M N = maxj=1,...,N °ψj °H . Thus, we get from (27)

³ ° °2 ´1/2 ° °
ken kH ≥ MN +1 1 − °Π(N ) en ° − cN °Π(N) en ° , (28)

p
for any N and n ∈ N, where cN = NM N . Note that, since M N ≥ kψN k2H ≥ MN2 , by

Assumption 2 we have cN → ∞ as N → ∞. Since the bound (28) holds for any N ∈ N, it

follows

³ ° °2 ´1/2 ° °
ken kH ≥ MNn +1 1 − Π(Nn ) en °
° − cNn °Π(Nn ) en ° , for any n ∈ N, (29)

56
for any sequence of integers (Nn ).

Let us now prove that there exists a sequence of integers (Nn ) such that the RHS of (29)

diverges. To this goal, define the sequence n(N), N = 1, 2... recursively by

© ° ° ª
n(1) = min n∗ ∈ N | c1 °Π(1) en ° ≤ 1 for all n ≥ n∗ ,
© ° ° ª
n(N) = min n∗ ∈ N | n∗ > n(N − 1) , cN °Π(N) en ° ≤ 1 for all n ≥ n∗ , N = 2, ...

° °
Since cN °Π(N) en ° → 0 as n → ∞, for any N ∈ N, it follows that n(N) < ∞, for any N ∈ N,

and the sequence n(N), N = 1, 2... is strictly increasing. Then, let the sequence of integers

(Nn ), for n ≥ n(1), be defined by






 1 if n(1) ≤ n < n(2),



Nn = 2 if n(2) ≤ n < n(3),





 ...
 ..
.

By construction, we have
° °
cNn °Π(Nn ) en ° ≤ 1, (30)

for any n ≥ n(1). Since cN → ∞ as N → ∞, we deduce

° °
°Π(Nn ) en ° ≤ 1/2, ∀n large enough. (31)

Using Bounds (30) and (31) in Inequality (29), we get ken kH ≥ MNn +1 (3/4)1/2 − 1 → ∞,

as n → ∞, from Assumption 2.

Finally, we get

kϕn kH = kkϕn − ϕ0 k en + ϕ0 kH ≥ kϕn − ϕ0 k ken kH − kϕ0 kH (32)

≥ ε ken kH − kϕ0 kH → ∞.

57
Therefore, (26) follows, and the proof is concluded. ¥

58
Appendix 2

The MISE of the TiR estimator

In this Appendix we derive the asymptotic expansion of the MISE with deterministic

sequence of regularisation parameters (Proof of Proposition 3). We focus on the linear IV


R
case m (ϕ, z) = E0 [ϕ (X) − Y | Z = z] = (Aϕ) (z)−r (z) , where (Aϕ) (z) = ϕ(x)f (w|z)dw
R
and r(z) = yf(w|z)dw.

A.2.1 The first-order condition

The objective function of the TiR estimator becomes in the linear case

1 X h³ ´ i2
T
QT (ϕ) + λT kϕk2H = Âϕ (Zt ) − r̂ (Zt ) + λT hϕ, ϕiH . (33)
T t=1

Let us now prove that this objective function can be written as a quadratic form in ϕ ∈

H 2 [0, 1]. To this aim, let us introduce the dual operator  of Â.

Lemma A.1: Under regularity conditions, the following properties hold P -a.s. :

(i) Function r̂ is in L2 (FZ );

(ii) Operator  maps H 2 [0, 1] into L2 (FZ );

(iii) There exists a linear operator  from L2 (FZ ) into H 2 [0, 1], such that
³ ´ 1 X³ ´
T
h, Â ψ = Âh (Zt ) ψ (Zt ) , for any ψ ∈ L2 (FZ ) and h ∈ H 2 [0, 1];
H T t=1
(iv) Operator   : H 2 [0, 1] → H 2 [0, 1] is compact.

Proof: See Appendix B.

59
Then, from Lemma A.1 i)-iii), Criterion (33) can be rewritten as

³ ´
QT (ϕ) + λT kϕk2H = hϕ, λT + Â Â ϕiH − 2hϕ, Â r̂iH , (34)

up to a term independent of ϕ. From Lemma A.1 iv), Â Â is a compact operator from

H 2 [0, 1] in itself. Since   is positive, operator λT +   is invertible [Kress (1999),

Theorem 3.4]. It follows that the quadratic criterion function (34) admits a global minimum
³ ´
2
over H [0, 1]. It is given by the first-order condition   + λT ϕ b T =  r̂, that is

³ ´−1
ϕ̂ = λT + Â Â Â r̂. (35)

A.2.2 The asymptotic expansion of the first-order condition

Let us now expand the estimator in (35). We can write


Z Z ³ ´
r̂(z) = ˆ
(y − ϕ0 (x)) ∆f(w|z)dw + ˆ
ϕ0 (x)f(w|z)dw =: ψ̂(z) + Âϕ0 (z) ,

³ ´−1 ³ ´−1
where ∆fˆ(w|z) := f(w|z)−f(w|z).
ˆ Hence, ϕ̂ = λT + Â Â Â ψ̂+ λT + Â Â Â Âϕ0 ,

which yields

£ ¤
ϕ̂ − ϕ0 = (λT + A A)−1 A ψ̂ + (λT + A A)−1 A Aϕ0 − ϕ0 + RT

=: VT + D(λT ) + RT , (36)

where the remaining term RT is given by


·³ ´−1 ¸
−1
RT = λT + Â Â Â − (λT + A A) A ψ̂
·³ ´−1 ¸
−1
+ λT + Â Â Â Â − (λT + A A) A A ϕ0 .

60
³ ´
Lemma A.2: Assume the bandwidth conditions hm
T = o (λT ) , λ/TT = o hdTZ , where m is

the order of the kernel K , and dZ the dimension of Z. Then, under regularity assumptions,
£ ¤ ¡ £ ¤¢
E kRT k2 = o E kVT + D(λT )k2 .

Proof: See Appendix B.

From (36) we deduce

£ ¤ £ ¤ £ ¤
E kϕ̂ − ϕ0 k2 = E kVT + D(λT )k2 + E kRT k2 + 2E [(VT + D(λT ), RT )]
£ ¤ ¡ £ ¤¢
= E kVT + D(λT )k2 + o E kVT + D(λT )k2 ,

by applying twice the Cauchy-Schwarz inequality and Lemma A.2. Since

£ ° °2
2¤ ° −1 −1 °
E kVT + D(λT )k = °(λT + A A) A Aϕ0 − ϕ0 + (λT + A A) A E ψ̂°
·° ³ ´°2 ¸
° −1 °
+E °(λT + A A) A ψ̂ − E ψ̂ ° , (37)

we get

£ ¤ ° °2
° °
E kϕ̂ − ϕ0 k2 = °(λT + A A)−1 A Aϕ0 − ϕ0 + (λT + A A)−1 A E ψ̂°
·° ³ ´°2 ¸
° −1 °
+E °(λT + A A) A ψ̂ − E ψ̂ ° , (38)

up to a term which is asymptotically negligible w.r.t. the RHS. This asymptotic expansion

consists of a bias term (regularisation bias plus estimation bias) and a variance term, which

will be analysed separately below in Lemma A.3 and A.4. Combining these two Lemmas

and the asymptotic expansion in (38) results in Proposition 3 .

A.2.3 Asymptotic expansion of the variance term

61
Lemma A.3: Under regularity conditions, up to a term which is asymptotically negligible
·° ³ ´°2 ¸ ∞
° −1 ° 1X νj ° °2
°φ ° .
w.r.t. the RHS, we have E °(λT + A A) A ψ̂ − E ψ̂ ° =
T j=1 (λT + ν j )2 j
Proof: See Appendix B.

A.2.4 Asymptotic expansion of the bias term

° °
Lemma A.4: Define b(λT ) = °(λT + A A)−1 A Aϕ0 − ϕ0 °. Then, under regularity condi-

tions and the bandwidth condition hm


T = o (λT b(λT )) , where m is the order of the kernel K,
° °
° °
we have °(λT + A A)−1 A Aϕ0 − ϕ0 + (λT + A A)−1 A E ψ̂° = b(λT ), up to a term which

is asymptotically negligible w.r.t. the RHS.

Proof: See Appendix B.

62
Appendix 3

Rate of convergence with geometric spectrum

In this Appendix we prove Proposition 4.

i) The next Lemma A.5 characterizes the variance term of the asymptotic expansion of the

MISE in Proposition 3.

° °2
Lemma A.5: Let ν j and °φj ° satisfy Assumption 5, and define the function
X∞ µ ¶1−β
νj ° °2
I(λ) = ° ° , λ > 0. Then, lim λ [log (1/λ)] I(λ) = 1
β
2 φj C2 .
j=1
(λ + ν j ) λ→0 α
Proof: See Appendix B.

1 1
From Lemma A.5 and using Assumption 6, we get MT (λ) = c1 β
+ c2 λ2δ ,
T λ [log (1/λ)]
µ ¶1−β
1
for λ → 0 and T → ∞, where c1 = C2 , c2 = C32 .
α
ii) The optimal sequence λ∗T is obtained by minimizing function MT (λ) w.r.t. λ. We have
µ ¶
dMT (λ) 1 1 β β−1 1
= − c1 2 2β
[log (1/λ)] − λβ [log (1/λ)] + 2c2 δλ2δ−1
dλ T λ [log (1/λ)] λ
1 log (1/λ) − β
= − c1 2 + 2c2 δλ2δ−1 .
T λ [log (1/λ)]β+1

Thus
dMT (λ∗T ) 1 c1 log (1/λ∗T ) − β
= 0 ⇐⇒ ∗ β+1
= (λ∗T )2δ+1 . (39)
dλ T 2c2 δ [log (1/λT )]

To solve the latter equation for λ∗T , define τ T := log (1/λ∗T ). Then τ T satisfies

1 1+β 1
τ T = c3 + log T + log τ T − log (τ T − β) ,
1 + 2δ 1 + 2δ 1 + 2δ

63
where c3 = (1 + 2δ)−1 log (2c2 δ/c1 ). It follows that

1 1+β
τ T = c4 + log T + log log T + o (log log T ) ,
1 + 2δ 1 + 2δ

for some constant c4 , that is

1 1+β
log (λ∗T ) = −c4 − log T − log log T + o (log log T ) .
1 + 2δ 1 + 2δ

iii) Finally, let us compute the MISE corresponding to λ∗T . We have

1 1 1 1
MT (λ∗T ) = c1 ∗ ∗ β
+ c2 (λ∗T )2δ = c1 ∗ β + c2 (λ∗T )2δ .
T λT [log (1/λT )] T λT τ T

µ ¶ 2δ+1
1 Ã ! 2δ+1
1

c1 1
− 2δ+1 τT − β 1 − β
From (39), λ∗T = T = c5 T − 2δ+1 τ T 2δ+1 , for some constant c5 ,
2c2 δ τ β+1
T
up to a term which is negligible w.r.t. the RHS. Thus we get

1 c1 − 2δ+1
1 1 − 2δ+1 2δ − 2δβ
MT (λ∗T ) = T β + c2 c2δ
5 T τ T 2δ+1
T c5 − +β
τ T 2δ+1
2δβ
2δ − 2δ 2δβ
= c6 T − 2δ+1 τ T 2δ+1 = c7 T − 2δ+1 (log T )− 2δ+1 ,

for some constants c6 and c7 , up to a term which is negligible w.r.t. the RHS.

64

You might also like