0% found this document useful (0 votes)
95 views205 pages

Functional Estimation For Density, Regression Models and Processes (Odile Pons)

Uploaded by

chisn235711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views205 pages

Functional Estimation For Density, Regression Models and Processes (Odile Pons)

Uploaded by

chisn235711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 205

Functional Estimation

for Density, Regression


Models and Processes

8124tp.indd 1 2/11/11 9:29 AM


This page intentionally left blank
Functional Estimation
for Density, Regression
Models and Processes

Odile Pons
INRA, France

World Scientific
NEW JERSEY • LONDON • SINGAPORE • BEIJING • SHANGHAI • HONG KONG • TA I P E I • CHENNAI

8124tp.indd 2 2/11/11 9:29 AM


Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data


A catalogue record for this book is available from the British Library.

FUNCTIONAL ESTIMATION FOR DENSITY, REGRESSION MODELS


AND PROCESSES
Copyright © 2011 by World Scientific Publishing Co. Pte. Ltd.
All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means,
electronic or mechanical, including photocopying, recording or any information storage and retrieval
system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright
Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to
photocopy is not required from the publisher.

ISBN-13 978-981-4343-73-2
ISBN-10 981-4343-73-0

Printed in Singapore.

LaiFun - Functional Estimation for Density.pmd 1 2/1/2011, 9:49 AM


January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Preface

Nonparametric estimators have been intensively used for the statistical


analysis of independent or dependent sequences of random variables and
for samples of continuous or discrete processes. The optimization of the
procedures is based on the choice of a bandwidth that minimizes an esti-
mation error for functionals of their probability distributions.
This book presents new mathematical results about statistical methods
for the density and regression functions, widely presented in the mathe-
matical literature. There is no doubt that its origin benefits from earlier
publications and from other subjects I worked about in other models for
processes. Some questions of great interest for optimizing the methods have
motivated much work some years ago, they are mentioned in the introduc-
tion and they give rise to new developments of this book. The methods are
generalized to estimators with kernel sequences varying on the sample space
and to adaptative procedures for estimating the optimal local bandwidth
of each model.
More complex models are defined by several nonparametric functions
or by vector parameters and nonparametric functions, such as the models
for the intensity of point processes and the single-index regression models.
New estimators are defined and their convergence rates are compared.

Odile M.-T. Pons


This page intentionally left blank
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Contents

Preface v

1. Introduction 1
1.1 Estimation of a density . . . . . . . . . . . . . . . . . . . 2
1.2 Estimation of a regression curve . . . . . . . . . . . . . . 10
1.3 Estimation of functionals of processes . . . . . . . . . . . 14
1.4 Content of the book . . . . . . . . . . . . . . . . . . . . . 19

2. Kernel estimator of a density 23


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Risks and optimal bandwidths for the kernel estimator . . 25
2.3 Weak convergence . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Minimax and histogram estimators . . . . . . . . . . . . . 33
2.5 Estimation of functionals of a density . . . . . . . . . . . 34
2.6 Density of absolutely continuous distributions . . . . . . . 37
2.7 Hellinger distance between a density and its estimator . . 39
2.8 Estimation of the density under right-censoring . . . . . . 40
2.9 Estimation of the density of left-censored variables . . . . 42
2.10 Kernel estimator for the density of a process . . . . . . . . 44
2.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3. Kernel estimator of a regression function 49


3.1 Introduction and notation . . . . . . . . . . . . . . . . . . 49
3.2 Risks and convergence rates for the estimator . . . . . . . 50
3.3 Optimal bandwidths . . . . . . . . . . . . . . . . . . . . . 56
3.4 Weak convergence of the estimator . . . . . . . . . . . . . 60

vii
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

viii Functional estimation for density, regression models and processes

3.5 Estimation of a regression curve by local polynomials . . . 62


3.6 Estimation in regression models with functional variance . 64
3.7 Estimation of the mode of a regression function . . . . . . 68
3.8 Estimation of a regression function under censoring . . . . 69
3.9 Proportional odds model . . . . . . . . . . . . . . . . . . . 70
3.10 Estimation for the regression function of processes . . . . 71
3.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4. Limits for the varying bandwidths estimators 75


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2 Estimation of densities . . . . . . . . . . . . . . . . . . . . 76
4.3 Estimation of regression functions . . . . . . . . . . . . . 81
4.4 Estimation for processes . . . . . . . . . . . . . . . . . . . 84
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5. Nonparametric estimation of quantiles 87


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Asymptotics for the quantile processes . . . . . . . . . . . 89
5.3 Bandwidth selection . . . . . . . . . . . . . . . . . . . . . 95
5.4 Estimation of the conditional density of Y given X . . . . 98
5.5 Estimation of conditional quantiles for processes . . . . . 100
5.6 Inverse of a regression function . . . . . . . . . . . . . . . 102
5.7 Quantile function of right-censored variables . . . . . . . . 104
5.8 Conditional quantiles with variable bandwidth . . . . . . 105
5.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6. Nonparametric estimation of intensities for stochastic processes 107


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2 Risks and convergences for estimators of the intensity . . 110
6.2.1 Kernel estimator of the intensity . . . . . . . . . . 111
6.2.2 Histogram estimator of the intensity . . . . . . . . 116
6.3 Risks and convergences for multiplicative intensities . . . 118
6.3.1 Models with nonparametric regression functions . 119
6.3.2 Models with parametric regression functions . . . 120
6.4 Histograms for intensity and regression functions . . . . . 124
6.5 Estimation of the density of duration excess . . . . . . . . 126
6.6 Estimators for processes on increasing intervals . . . . . . 130
6.7 Models with varying intensity or regression coefficients . . 132
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Contents ix

6.8 Progressive censoring of a random time sequence . . . . . 135


6.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7. Estimation in semi-parametric regression models 137


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.2 Convergence of the estimators . . . . . . . . . . . . . . . . 139
7.3 Nonparametric regression with a change of variables . . . 143
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

8. Diffusion processes 147


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.2 Estimation for continuous diffusions by discretization . . . 149
8.3 Estimation for continuous diffusion processes . . . . . . . 154
8.4 Estimation of discretely observed diffusions with jumps . 158
8.5 Continuous estimation for diffusions with jumps . . . . . . 162
8.6 Transformations of a non-stationary Gaussian process . . 164
8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

9. Applications to time series 167


9.1 Nonparametric estimation of the mean . . . . . . . . . . . 168
9.2 Periodic models for time series . . . . . . . . . . . . . . . 171
9.3 Nonparametric estimation of the covariance function . . . 172
9.4 Nonparametric transformations for stationarity . . . . . . 174
9.5 Change-points in time series . . . . . . . . . . . . . . . . . 174
9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

10. Appendix 183


10.1 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . 183
10.2 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . 184
10.3 Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . 184
10.4 Appendix D . . . . . . . . . . . . . . . . . . . . . . . . . . 187

Notations 189
Bibliography 191
Index 197
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Chapter 1

Introduction

The aim of this book is to present in the same approach estimators for
functions defining probability models: density, intensity of point processes,
regression curves and diffusion processes. The observations may be continu-
ous for processes or discretized for samples of densites, regressions and time
series, with sequential observations over time. The regular sampling scheme
of the time series is not common in regression models where stochastic ex-
planatory variables X are recorded together with a response variable Y
according to a random sampling of independent and identically distributed
observations (Xi , Yi )i≤n . The discretization of a continuous diffusion pro-
cess yields a regression model and the approximation error can be made
sufficiently small to extend the estimators of the regression model to the
drift and variance functions of a diffusion process. The functions defin-
ing the probability models are not specified by parameters and they are
estimated in functional spaces.
This chapter is a review of well known estimators for density and re-
gression functions and a presentation of models for continuous or discrete
processes where nonparametric estimators are defined.

On a probability space (Ω, A, P ), let X be a random variable with distri-


bution function F (x) = Pr(X ≤ x) and Lebesgue density f , the derivative
of F . The empirical distribution function and the histogram are the sim-
plest estimators of a distribution function and a density, respectively. With
a sample (Xi )i≤n of the variable X, the distribution function F (x) is es-
timated by Fbn (x), the proportion of observations smaller than x, which
converges uniformly to F in probability and almost surely if and only F is
continuous. A histogram with bandwidth hn consists in a partition of the
range of the observations into disjoint subintervals of length hn where the

1
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

2 Functional estimation for density, regression models and processes

density is estimated by the proportion of observations Xi in each subin-


tervals, divided by hn . The bandwidth hn tends to zero as n tends to
infinity and nh2n tends to infinity, thus the size of the partition tends to
infinity with the sample size. For a variable X defined in a metric space
(X, A, µ), the histogram is the local nonparametric estimator defined by a
set of neighbourhoods Vh = {Vh (x), x ∈ X}, with Vh (x) = {s; d(x, s) ≤ h}
for the metric d of (X, A, µ)
Z Xn
fbn,h (x) = (n dFX )−1 1{Xi ∈Vh (x)} . (1.1)
Vh (x) i=1

The empirical distribution function and the histogram are stepwise estima-
tors and smooth estimators have been later defined for regular functions.

1.1 Estimation of a density

Several kinds of smooth methods have been developed. The first one was
the projection of functions onto regular and orthonormal bases of functions
(φk )k≥0 . The density of the observations is approximated by a countable
P Kn
projection on the basis fn (x) = i=1 ak φk (x) where Kn tends to infin-
ity and the coefficients are defined by the scalar product specific to the
orthonormality of the basis with
Z Z
2
φk (x)µφ (x) dx = 1, φk (x)φl (x)µφ (x) dx = 0, for all k 6= l,
R
then ak =< f, φk >= f (x)φk (x)µφ (x) dx. The coefficients are estimated
by integrating the basis with respect to the empirical distribution of the
variable X
Z
akn = φk (x)µφ (x) dFbn (x)
b
P Kn
which yields an estimator of the density fbn (x) = i=1 b
akn φk (x). The same
principle applies to other stepwise estimators of functions. Well known
bases of L2 -orthogonal functions are

(i) Legendre’s polynomials1 defined on the interval [−1, 1] as solutions of


the differential equations
00
(1 − x2 )Pn (x) − 2x Pn0 (x) − n(n + 1)Pn (x) = 0,
1 French mathematician (1752-1833)
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Introduction 3

with Pn (1) = 1. Their solutions have an integral form attributed to


Hermite and his student Stieltjes
Z
2 π sin(n + 21 )φ dφ
Pn (cos θ) = .
π 0 {2 cos θ − 2 cos φ}
The polynom Pn (x) has also been expressed as the coefficient of
z −(n+1) in the expansion of (z 2 − 2xz + 1)−1/2 by Stieltjes (1890).
They are orthogonal with the scalar product
Z 1
< f, g >= f (x)g(x) dx;
−1

2
(ii) Hermite’s polynomials of degree n defined by the derivatives
2 dn −x2 /2
Hn (x) = (−1)n ex /2
(e ), n ≥ 1,
dxn
they satisfy the recurrence equation Hn+1 (x) = xHn (x) − Hn0 (x), with
H0 (x) = 1. They are orthogonal with the scalar product
Z +∞
2
< f, g >= f (x)g(x)e−x dx
−∞

and their norm is kHn k = n! 2π;
(iii) Laguerre’s polynomials3 defined by the derivatives
ex dn −x n
Ln (x) = (e x ), n ≥ 1,
n! dxn
and L0 (x) = 1. They satisfy the recurrence equation Ln+1 (x) = (2n +
1 − x)Ln (x) − n2 Ln−1 (x) and they are orthogonal with the scalar
product
Z +∞
< f, g >= f (x)g(x)e−2x dx.
−∞

The orthogonal polynomials are normalized by their norm. If the function


f is Lipschitz, the polynomial approximations converge to f in L2 and
for the pointwise convergence. The corresponding projection estimators
also converge in L2 and pointwisely. Though the bases generate functional
spaces of smooth integrable functions, the estimation is parametric. The
estimator of the approximation function converge to zero in L2 with the
2 French mathematician (1822-1901)
3 French mathematician (1834-1886)
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

4 Functional estimation for density, regression models and processes

R +∞
norm kfbn − fn k2 = { −∞
E(fbn − fn )2 (x)µφ (x) dx}1/2 if n−1 Kn tends to
zero, so that
Z +∞ Kn
X
kfbn − fn k22 = E akn − ak )2 φ2k (x)µφ (x) dx
(b
−∞ i=1
Kn
X
= akn − ak )2
E(b
i=1
Z
2
akn − ak ) = E{
E(b φk (x)µφ (x) d(Fbn − F )(x)}2
Z
= n−1 φk (x)φk (y)µφ (x)µφ (y) dC(x, y)

where C(x, y) = F (x ∧ y) − F (x)F (y) is the covariance function of the


empirical process n1/2 (Fbn − F ). The convergence rate of the norm the
1/2
density estimator is the sum of the norm kfbn − fn k2 = O(n−1/2 Kn ) and
P∞
the approximation error kfn − f k2 = ( i=Kn +1 a2k )1/2 , it is determined by
the convergence rate of the sum of the squared coefficients and therefore
by the degree of derivability of the function f . Splines are also bases of
functions constrained at fixed points or by a condition of derivability of
the function f , with an order of integration for its higher derivative. They
have been introduced by Whittaker (1923) and developed by Schoenberg
(1964), Wold (1975), Wahba and Wold (1975), De Boor (1978), Wahba
(1978), Eubank (1988). They allow the approximation of functions having
different degrees of smoothness on different intervals which can be fixed.
A comparison between splines and kernel estimators of densities may be
found in Silverman (1984) who established an uniform asymptotic bound
for the difference between the kernel function and the weight function of
cubic splines, with a bandwidth kernel λ−1/4 where λ is the smoothing
parameters of the splines. Each spline operator corresponds to a kernel op-
erator and the bias and variance of both estimators have the same rate of
convergence (Rice and Rosenblatt, 1983, Silverman, 1984). Messer (1991)
provides an explicit expression of the kernel corresponding to a cubic si-
nusoı̈dal spline, with their rates of convergence.

Kernel estimators of densities have first been introduced and studied by


Rosenblatt (1956), Whittle (1958), Parzen (1962), Watson and Laedbetter
(1963), Bickel and Rosenblatt (1973). Consider a real random variable
X defined on (Ω, A, P ) with density fX and distribution function FX . A
continuous density fX is estimated by smoothing the empirical distribution
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Introduction 5

function FbX,n of a sample (Xi )1≤i≤n distributed as X by the means of its


convolution with a kernel K, over a bandwidth h = hn tending to zero as
n tends to infinity
Z n
1 X x − Xi
fbX,n,h (x) = Kh (x − s) dFbX,n (s) = K( ), (1.2)
nh i=1 h

where Kh (x) = h−1 K(h−1 x) is the kernel of bandwidth h. The weighting


kernel is a bounded symmetric density satisfying regularity properties and
moment conditions. With a p-variate vector X, the kernel may be defined
on Rp and Kh (x) = (h1 . . . , hp )−1 K(h−1 −1
1 x1 , . . . , hp xp ), for p-dimensional
vectors x = (x1 , . . . , xp ) and h = (h1 , . . . , hp ). Scott (1992) gives a detailed
presentation of the multivariate density estimators with graphical visual-
izations. Another estimator is based on the topology of the space (X, A, µ),
with (1.1) or using a real function K and

Kh (x) = h−1 K(h−1 kxkµ ), h > 0.

The regularity of the kernel K entails the continuity of the estimator fbX,n,h .
All results established for a real valued variable X apply straightforwardly
to a variable defined in a metric space.
Deheuvels (1977) presented a review of nonparametric methods of es-
timation for the density and compared the mean squared error of several
kernel estimators including the classical polynomial kernels which do not
satisfy the above conditions, some of them diverge and their orders differ
from those of the density kernels. Classical kernels are the normal density
with support R and densities with a compact support such as the Bartlett-
Epanechnikov kernel with support [−1, 1], K(u) = 0.75(1 − u2 )1{|u|≤1} ,
other kernels are presented in Parzen (1962), Prakasa Rao (1983), etc.
With a sequence hn converging to zero at a convenient rate, the estimator
fbX,n,h is biased, with an asymptotically negligible bias depending on the
regularity properties of the density. Constants depending on moments of
the kernel function also appear in the bias function E fbX,n,h − fX and the
moments E fbX,n,h
k
of the estimated density. The variance does not depend
on the class of the density. The weak and strong uniform consistency of
the kernel density estimator and its derivatives were proved by Silverman
(1978) under derivability conditions for the density. Their performances
are measured by several error criteria corresponding to the estimation of
the density at a single point or over its whole support. The mean squared
error criterion is common for that purpose and it splits into a variance and
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

6 Functional estimation for density, regression models and processes

the square of a bias term


M SE(fbX,n,h ; x, h) = E{fbX,n,h (x) − fX (x)}2
= E{fbX,n,h (x) − E fbX,n,h (x)}2
+ {E fbX,n,h (x) − fX (x)}2 .
A global random measure of the distance between the estimator fbX,n,h and
the density fX is the integrated squared error (ISE) given by
Z
ISE(fbX,n,h ; h) = {fbX,n,h (x) − fX (x)}2 dx. (1.3)

A global error criterion is the mean integrated squared error introduced by


Rosenblatt (1956)
Z
M ISE(fX,n,h ; h) = E{ISE(fX,n,h ; h)} = M SE(fbX,n,h ; x, h) dx. (1.4)
b b

The first order approximations of the MSE and the MISE as the sample
size increases are the AMSE and the AMISE. Let (hn )n be a bandwidth
sequence converging to zero and
R such that nh tends to infinity
R and let K
be a kernel satisfying m2K = x2 K(x) dx < ∞ and κ2 = K 2 (x) dx < ∞.
Consider a variable X such that EX 2 is finite and the density FX is twice
continuously differentiable
h4 2 002
AM SE(fbX,n,h ); x = (nh)−1 fX (x)κ2 +m f (x).
4 2K
They depend on the bandwidth h of the kernel and the AMSE is minimized
at a value
R 2
 K (x) dx 1/5
hAMSE (x) = fX (x) .
nm22K f 002 (x)
The global optimum of the AMISE is attained at
R 2
 K (x) dx 1/5
hAMISE = 2
R .
002
nm2K f (x) dx
Then the optimal AMSE tends to zero with the order n−4/5 , it depends
on the kernel and on the unknown values at x of the functions fX and
f 002 , or their integrals for the integrated error (Silverman, 1986). If the
bandwidth has a smaller order, the variance of the estimator is predom-
inant in the expression of the errors and the variations of estimator are
larger, if the bandwidth is larger than the optimal value, the bias increases
and the variance is reduced. The approximation made by suppressing the
higher order terms in the expansions of the bias and the variance of the
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Introduction 7

density estimator is obviously another source of error in the choice of the


bandwidth, Hall and Marron (1987) proved that hMISE /hAMISE tends
to 1 and hISE /hAMISE tends to 1 in probability as n tends to infinity.
Surveys on kernel density estimators and their risk functions were given
by Nadaraya (1989), Rosenblatt (1956, 1971), Prakasa Rao (1983), Hall
(1984), Härdle (1991), Khasminskii (1992). The smoothness conditions
for the density are sometimes replaced by Lipschitz or Hölder conditions
and the expansions for the MSE are replaced by expansions for an upper
bound. Parzen (1962) also proved the weak convergence of the mode of
a kernel density estimator. The derivatives of the density are naturally
estimated by those of the kernel estimator and the weak and strong con-
vergence of derivative estimators have been considered by Bhattacharya
(1967) and Schuster (1969) among others. The L1 (R) norm of the differ-
ence between the kernel estimator and its expectation converges to zero, as
a consequence of the properties of the convolution. RDevroye (1983) studied
the consistency of the L1 -norm kfbX,n,h − f k1 = |fbX,n,h − f | dx, Giné,
Mason and Zaitsev (2003) established the weak convergence of the process
n1/2 (kfbX,n,h − E fbX,n,h kR1 − EkfbX,n,h − E fbX,n,hk1 ) to a normal variable with
variance depending on K(u)K(u + t) du. Bounds for minimax estimators
have been established by Beran (1972). A minimax property of the kernel
estimator with the optimal convergence rate n2/5 was proved by Bretagnole
and Huber (1981).
Though the estimator of a monotone function is monotone with prob-
ability tending to 1 as the number of observations tends to infinity, the
number of observations is not always large enough to preserve this prop-
erty and a monotone kernel estimator is built for monotone density func-
tions by isotonisation of the classical kernel estimator. Monotone estima-
tors for a distribution function and a density have been first defined by
Grenander (1956) as the least concave minorant of the empirical distribu-
tion function and its derivative. This estimator has been studied by Barlow,
Bartholomew, Bremmer and Brunk (1972), Kiefer and Wolfowitz (1976),
Groeneboom (1989), Groeneboom and Wellner (1997). The isotonisation
of the kernel estimator fbn,h for a density function is
Z v
1
fbSI,n,h (x) = inf sup fbn,h (t) dt (1.5)
v≥x u≤x v − u u
R
and 1{t≤x} fbSI,n,h (t) dt is the greatest convex minorant of the integrated
R
estimator 1{t≤x} fbn,h (t) dt. Its convergence rate is n1/3 (van der Vaart
and van der Laan, 2003). Groeneboom and Wellner studied the weak con-
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

8 Functional estimation for density, regression models and processes

vergence of local increments of the isotonic estimator of the distribution


function. The estimation of a convex decreasing and twice continuously
differentiable density on R+ by a piecewise linear estimator with knots
between observations points was studied by Groeneboom, Jonkbloed and
Wellner (2001), the estimator is n2/5 -consistent. Dumbgen and Rufibach
(2009) proposed a similar estimator for a log-concave density on R+ and
established its convergence rate (n(log n)−1 )β/(2β+1) , for Hölder densities
of Hβ,M .
Stone (1974), De Boor (1975), Bowman (1983), Marron (1987) intro-
duced automatic data driven methods for the choice of the global band-
width. They minimize theR integrated random risk ISE or the cross-
Pn
validation criterion CV (h) = fbX,n,h
2
(x) dx−2n−1 i=1 fbX,n,h,i (Xi ) where
fb2 is the kernel estimator based on the data sample without the i-th
X,n,h,i
observation, or the Rempirical version of the Kullback-Leibler loss-function
K(fbX,n,h , f ) = −E log fbX,n,h dFX (Bowman, 1983). The CV (h) criterion
is an unbiased estimator of the MISE and its minimum is the minimum
for the estimated ISE using the empirical distribution function. The global
bandwidth estimator b hCV minimizing this estimated criterion achieves the
bound for the convergence rate of any optimal bandwidth for the ISE,
b
hCV /hMISE − 1 = Op (n−1/10 ) and b hCV − hMISE has a normal asymptotic
distribution (Hall and Marron, 1987). The cross-validation is more variable
with the data and often leads to oversmoothing or undersmoothing (Hall
and Marron, 1987, Hall and Johnstone, 1992). As noticed by Hall and
Marron (1987), the estimation of the density and the mean squared error
are different goals and the best bandwidth for the density might not be the
optimal for the MSE, hence the bandwidth minimizing the cross-validation
induces variability of the density estimator. Other methods for selecting
the bandwidth have been proposed such as higher order kernel estimators
of the density (Hall and Marron, 1987, 1990) or bootstrap estimations. An
uniform weak convergence of the distribution of fbn,h was proved using con-
secutive approximations of empirical processes by Bickel and Rosenblatt
(1973), other approaches for the convergence in distribution rely on the
small variations of moments of the sample-paths, as in Billingsley (1968) for
continuous processes. The Hellinger distance h(fbX,n,h , f ) between a density
and its estimator has been studied by Van de Geer (1993, 2000), here the
weak convergence of the process fbX,n,h −fX provides a more precise conver-
gence rate for h(fbX,n,h , f ). All result are extended to the limiting marginal
density of a continuous process under ergodicity and mixing conditions.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Introduction 9

Uniform strong consistency of the kernel density estimator requires


stronger conditions, results can be found in Silverman (1978), Singh (1979),
Prakaso Rao (1983), Härdle, Janssen and Serfling (1988) for the strong con-
sistency and, for its conditional mode, Ould Saı̈d (1997). The law of the
iterated logarithm has been studied by Hall (1981), Stute (1982), Bosq
(1998)
Theorem 1.1 (Stute 1982). Let f be a continuous density strictly pos-
itive and bounded on a sub-interval [a, b] of its support. Let (hn )n be a
bandwidth sequence converging to zero and such that nhn tends to infinity,
log h−1 −1
n = o(nhn ) and log hnR /(log log n) tends to infinity. Suppose that K
has a compact support and |dK| < ∞, then for every δ > 0
nhn 1/2
lim sup{ }1/2 sup |fbX,n,h (x) − E fbX,n,h (x)|f 1/2 (x) = κ2 , a.s.
n 2 log h−1
n Ih

with Ih = [a + h, b − h].

A periodic density f on an interval [−T, T ] is analyzed in the frequency


domain where it is expanded according to the amplitudes and the frequency
or period of its components. Let T = 2π/w, the density f is expressed as
P+∞
the limit of series due to Fourier4 , f (x) = k=−∞ ck eiwkx with coefficients
R T /2
ck = T −1 −T /2 f (x)e−iwkx dx and the Fourier transform of f is defined
R T /2
on R by F f (s) = T −1 −T /2 f (x)e−iwsx dx. The inversion formula of the
R +∞
Fourier transform is f (x) = −∞ F f (w)eiwsx ds. For a non periodic den-
sity, the Fourier transform and its inverse are defined by
Z ∞
−1
F f (s) = (2π) f (x)e−isx dx,
−∞
Z +∞
f (x) = F f (w)eisx ds.
−∞

The
R Fourier transform
R is an isometry as expressed by the equality
|F f (s)|2 ds = |f (s)|2 ds.
Let (Xk )k≤n be a stationary time series with mean zero, the spectral
density is defined from the autocorrelation coefficients γk = E(X0 Xk ) by
P
S(w) = +∞ R ∞ γk e
k=−∞
−iwk
and the inverse relationship for the autocorrela-
tions is γk = −∞ S(w)eiwsx dx. The periodogram of the series is defined as
Pn
Ibn (w) = T −1 | k=1 Xk e−2πikw |2 and Rit is smoothed to yield a regular esti-
mator of the spectral density Sbn (s) = Kh (u−s)Ibn (s) ds. Brillinger (1975)
4 French mathematician (1768–1830)
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

10 Functional estimation for density, regression models and processes

established that the optimal convergence rate for the bandwidth is hn =


O(n−1/5 ) under regularity conditions and he proved the weak convergence
of the process n2/5 (Sbn −S) to the process defined as a transformed Brownian
motion. Robinson (1986, 1991) studied the consistency of kernel estimators
for auto-regression and density functions and for nonparametric models of
time series. Cross-validation for the choice of the bandwidth was also intro-
duced by Wold (1975). For time series, Chiu (1991) proposed a stabilized
bandwidth criterion having a relative convergence rate n−1/2 instead of
n−1/10 for the cross-validation in density estimation. It is defined from the
Fourier transform dY of the observation series (Yi )i , using the periodogram
of the series IY = d2Y /(2πn) and the Fourier transform Wh (λ) of the ker-
P
nel.The squared sum of errors is equal to 2π i IY (λj ){1 − Wh (λj )}2 with
P n
λj = 2πj and Wh (λ) = n−1 j=1 exp(−iλj)Kh (j/n).
Multivariate kernel estimators are widely used in the analysis of spatial
data, in the comparison and the classification of vectors.

1.2 Estimation of a regression curve

Consider a two-dimensional variable (X, Y ) defined on (Ω, A, P ), with val-


ues in R2 . Let fX and fX,Y be the continuous densities of X and, respec-
tively, fXY , and let FX and FXY be their distribution functions. In the
nonparametric regression setting, the curve of interest is the relationship
between two variables, Y a response variable for a predictor X. A contin-
uous curve is estimated by the means of a kernel estimator smoothing the
observations of Y for observations of X in the neighborhood of the predic-
tor value. The conditional mean of Y given X = x is the nonparametric
regression function defined for every x inside the support of X by
Z
fX,Y (x, y)
m(x) = E(Y |X = x) = y dy,
fX (x)
it is continuous when the density fX,Y is continuous with respect to its first
component. It defines regression models for Y with fixed or varying noises
according to the model for its variance

Y = m(X) + σε (1.6)

where E(ε|X) = 0 and V ar(ε|X) = 1 in a model with a constant variance


V ar(Y |X) = σ 2 , or

Y = m(X) + σ(X)ε (1.7)


January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Introduction 11

with a varying conditional variance V ar(Y |X) = σ 2 (X). The regression


function of (1.6) is estimated by the integral of Y with respect to a smoothed
empirical distribution function of Y given X = x
Z
Kh (x − s)FbXY,n (ds, dy)
b n,h (x) = y
m
fbX,n,h (x)
Pn
Yi Kh (x − Xi )
= Pi=1
n .
i=1 Kh (x − Xi )

This estimator has been introduced by Watson (1964) and Nadaraya (1964)
and detailed presentations can bee found in the monographs by Eubank
(1977), Nadaraya (1989) and Härdle (1990). The performance of the kernel
estimator for the regression curve m is measured by error criteria corre-
sponding to the estimation of the curve at a single point or over its whole
support, like for the kernel estimator of a continuous density.
A global random measure of the distance between the estimator m b n,h
and the regression function m is the integrated squared error (ISE)
Z
ISE(m b n,h ; h) = {mb n,h (x) − m(x)}2 dx, (1.8)

its convergence was studied by Hall (1984), Härdle (1990). The mean
squared error criterion develops as the sum of the variance and the squared
bias of the estimator
b n,h (x) − m(x)}2
b n,h ; x, h) = E{m
M SE(m
b n,h (x)}2 + {E m
b n,h (x) − E m
= E{m b n,h (x) − m(x)}2 .
A global mean squared error is the mean integrated squared error
Z
M ISE(mb n,h ; h) = E{ISE(mb n,h ; h)} = M SE(m b n,h ; x, h) dx. (1.9)

Assuming that the curve is twice continuously differentiable, the mean


squared error is approximated by the asymptotic MSE (Chapter 3)
−1
b n,h ; x) = (nh)−1 κ2 fX
AM SE(m (x) V ar(Y |X = x)
h4 2 −1 (2)
+ m f (x){µ(2) (x) − m(x)fX (x)}2 .
4 2K X
The AMSE is minimized at a value hm,AMSE which is still of order n−1/5
and depends on the value at x of the functions defining the model and
their second order derivatives. Automatic optimal bandwidth selection by
cross-validation was developed by Härdle, Hall and Marron (1988) simi-
larly to the density. Bootstrap methods were also widely studied. Splines
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

12 Functional estimation for density, regression models and processes

were generalized to nonparametric regression by Wahba and Wold (1975),


Silverman (1985) for cubic splines and the automatic choice of the degree
of smoothing is also determined by cross-validation.
In model (1.7) with a random conditional variance V ar(Y |X) = σ 2 (X),
the estimator of the regression curve m has to be modified and it is defined
as a weighted kernel estimator with weighting function w(x) = σ −1 (x)
Pn
w(Xi )Yi Kh (x − Xi )
mb w,n,h (x) = Pi=1
n
i=1 w(Xi )Kh (x − Xi )

or more general function w. In Chapter 3, a kernel estimator of σ −1 (x) is


introduced. The bias and variance of the estimator m b w,n,h are developed
by the same expansions as the estimator (1.8). The convergence rate of
the kernel estimator for σ 2 (x) is nonparametric and its bias depends on
the bandwidths used in its definition, on V ar{(Y − m(x))2 |X = x}, on the
functions fX , σ 2 , m and their derivatives.
Results about the almost sure convergence and the L2 -errors of kernel
estimators, their optimal convergence rates and the optimal bandwidth se-
lection were introduced in Hall (1984), Nadaraya (1964). Properties similar
to those of the density are developed here with sequences of bandwidths
converging with specified rates. The methods for estimating a density and
a regression curve by the means of kernel smoothing have been extensively
presented in monographs by Nadaraya (1989), Härdle (1990, 1992) Wand
and Jones (1995), Simonoff (1996), Bowman and Azalini (1997), among
others. In this book, the properties of the estimators are extended with
exact expansions, as for density desimation, and to variable bandwidth se-
quences (hn (x))n≥1 converging with a specified rate.

Several monotone kernel estimators for a regression function m have


been considered, they are built by kernel smoothing after an isotonisation
of the data sample, or by an isotonisation of the classical kernel estimator.
The isotonisation of the data consists in a transformation of the observation
(Yi )i in a monotone set (Yi∗ )i . It is defined by
v
1 X
Yi∗ = min max Yi ,
v≥i u≤i v − u
j=u
P P
and i≤k Yi∗ is the greatest convex minorant of i≤k Yi . The kernel esti-
mator for the regression function built with the isotonic sample (Xi∗ , Yi∗ )i is
b IS,n,h . The convergence rate of the isotonic estimator for a mono-
denoted m
tone density function is n1/3 and the variable n−1/3 (m b IS,n,h − mIS,n,h )(x)
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Introduction 13

converges to a Gaussian process for every x in IX . The isotonisation of the


kernel estimator m b n,h for a regression function is
Z v
1
mb SI,n,h (x) = inf sup b n,h (t) dt
m (1.10)
v≥x u≤x v − u u
R
and
R 1{t≤x} mb SI,n,h (t) dt is the greatest convex minorant of the process
b n,h (t) dt. Its convergence rate is again n1/3 (van der Vaart and
1{t≤x} m
van der Laan, 2003). Meyer and Woodroof (2000) generalized the contraints
to larger classes and proved that the variance of the maximum likelihood
estimator of a monotone regression function attains the optimal conver-
gence rate n1/3 .

In the regression models (1.6) or (1.7) with a multidimensional regres-


sion vector X, a multidimensional regression function m(X) can be replaced
by a semi-parametric single-index model m(x) = g(θT x), where θT denotes
the transpose of a vector θ, or by a more general transformation model
g ◦ ϕθ (X) with unknown function m and parameter θ. In the single-index
model, several estimators for the regression function m(x) have been de-
fined (Ihimura, 1993, Härdle, Hall and Ihimura, 1993, Hristache, Juditski
and Spokony, 2001, Delecroix, Härdle and Hristache, 2003), the estima-
tors of the function g and the parameter θ are iteratively calculated from
approximations.
The inverse of the distribution function FX of a variable X, or quantile
function, is defined on [0, 1] by
−1
Q(t) = FX (t) = inf{x ∈ IX : FX (x) ≥ t},

it is right-continuous with left-hand limits, like the distribution function.


−1
For every uniform variable U , FX (U ) has the distribution function FX and,
if F is continuous, then F (X) has an uniform distribution function. The
−1
inverse of the distribution function satisfies FX ◦ FX (x) = x for every x in
−1
the support of X and FX ◦FX = id for every continuity point x of FX . The
weak convergence of the empirical uniform process and its functionals have
been widely studied (Shorack and Wellner, 1986, van der Vaart and Well-
ner, 1996). For a differentiable functional ψ(FX ), n1/2 {ψ(FbX,n ) − ψ(FX )}
converges weakly to (ψ 0 B) ◦ FX where B is a Brownian motion, limiting
distribution of the empirical process n1/2 (FbX,n − FX ). It follows that the
process n1/2 (FbX,n
−1
− FX−1
) converges weakly to B ◦ FX (fX ◦ FX )−1 . Kiefer
(1972) established a law of iterated logarithms for quantiles of probabilities
tending to zero, the same result holds for 1 − pn as pn tend to one.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

14 Functional estimation for density, regression models and processes

Theorem 1.2 (Kiefer 1972). Let pn tend to zero with npn and hn tend
to infinity, and let δ = 1 or −1 then
Qn (pn ) − npn
lim sup δ = 1, a.s.
n {2npn log log n}1/2
The results were extended to conditional distribution functions and
Sheather and Marron (1990) considered kernel quantile estimators. The
inverse function for a nonparametric regression curve determines thresh-
olds for X given Y values, it is related to the distribution function of Y
conditionally on X. The inverse empirical process for a monotone non-
parametric regression function has been studied in Pinçon and Pons (2006)
and Pons (2008), the main results are presented and generalized in Chap-
ter 5. The behaviour of the threshold estimators Qb X,n,h and Q
b Y,n,h of the
conditional distribution is studied, with their bias and variance and the
mean squared errors which determine the optimal bandwidths specific to
the quantile processes.
The Bahadur representation for the quantile estimators is an expansion

t − FbX,n
FbX,n
−1 −1
(t) = FX (t) + −1
◦ FX (t) + Rn (t), t ∈ [0, 1],
fX
where the main is a sum of independent and identically distributed ran-
dom variables and the remainder term Rn (t) is a op (n−1/2 ) (Ghosh, 1971),
Bahadur (1966) studied its a.s. convergence. Lo and Singh (1986), Gijbels
and Veraverbeke (1988, 1989) extended this approach by differentiation to
the Kaplan-Meier estimator of the distribution function of independent and
identically distributed right-censored variables.

1.3 Estimation of functionals of processes

Watson and Laedbetter (1964) introduced smooth estimators for the hazard
function of a point process. The functional intensity λ(t) of an inhomoge-
neous Poisson point process N is defined by

λ(t) = lim δ −1 P {N (t + δ) − N (t− ) = 1 | N (t−)},


δ→0

it is estimated using a kernel smoothing, from the sample-path of the point


process observed on an interval [0, T ]. Let Y (t) = N (T ) − N (t), then
Z
bh (t) = Kh (t − s)1{Y (s)>0} Y −1 (s) dN (s).
λ
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Introduction 15

For a sample of a time variable T with distribution function F , let F̄ be


the survival function of the variable T , F̄ = 1 − F − , the hazard function λ
is now defined as λ(t) = f (t)/F̄ (t). The probability of excess is
F (t + x) − F (t)
Pt (t + x) = Pr(T > t + x | T > t) = 1 −
F̄ (t)
Z t+x
= exp{− λ(s) ds}.
t
The product-limit estimator has been defined for the estimation of
the distribution function of a time variable under an independent right-
censorship by Kaplan and Meier (1957). Breslow and Crowley (1974) stud-
ied the asymptotic behaviour of the process Bn = n1/2 (Fbn −F ), they proved
its weak convergence to a Gaussian process B with independent increments,
mean zero and a finite variance on every compact sub-interval of [0, Tn:n ],
where Tn:n = maxi≤n Ti . The weak convergence of n1/2 (F̄n − F̄ ) has been
extended by Gill (1983) to the interval [0, Tn:n ] using its expressions as a
martingale upR to the stopping time Tn:n . Let τF = sup{t; F (t) < 1}, for
τ
t < τF and if 0 F F̄ −1 dΛ < ∞, we have
Z t∧Tn:n
b n (t) = dFbn (s)
Λ ,
0 1 − Fbn− (s)
Z t∧Tn:n
b
Fn (t) = b̄ (s) dΛ
F b n (s),
n
0
Z t∧Tn:n
F − Fbn 1 − Fbn (s− ) b
(t) = {dΛn (s) − dΛ(s)}
1−F 0 1 − F (s)
as a consequence, the process n1/2 (F − Fbn )F̄ −1 converges weakly on [0, τF [
to a centered RGaussian process BF , with independent increments and vari-
t
ance vF̄ (t) = 0 {(1 − F )−1 F̄ }2 dvΛ , where vΛ is the asymptotic variance of
b n − Λ).
the process n1/2 (Λ

The definition of the intensity is generalized to point processes having


a random intensity. For a multiplicative intensity λY , with a predictable
process Y , the hazard function λ is estimated by
Z
bh (t) = Kh (t − s)Y −1 (s)1{Y (s)>0} dN (s).
λ
Pn
For a random time sample (Ti )i≤n , N (t) = i=1 1{Ti ≤t} and the process
Pn
Y is Y (t) = i=1 1{Ti ≥t} . Under a right-censorship of a time variable
T by an independent variable C, only T ∧ C and the indicator δ of the
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

16 Functional estimation for density, regression models and processes

event {T ≤ C} are observed. Let X = T ∧ C, the counting processes for


P P
a n-sample of (X, δ) are N (t) = i 1{Ti ≤t∧Ci } and Y (t) = i 1{Xi ≥t} .
Martingale techniques are used to expand the estimation errors, provid-
ing optimal convergence rates according to the regularity conditions for
the hazard function (Pons, 1986) and the weak convergences with fixed
or variable bandwidths (Chapter 6). Regression models for the intensity
are classical, there have generally the form λ(t; β) = λ(t)rβ (Z(t)) with a
regressor process (Z(t))t≥0 and a parametric regression function such as
rβ (Z(t)) = r(β T Z(t)), with an exponential function r in the Cox model
(1972). The classical estimators of the CoxR model rely on the estimation
t
of the cumulated hazard function Λ(t) = 0 λ(s) ds by the stepwise pro-
b n (t; β) at fixed β and the parameter β of the exponential regression
cess Λ
T
function rZ (t; β) = eβ Z(t) is estimated by maximization of an expression
similar to the likelihood where λ is replaced by the jump of Λ b n (β) at Ti
(Cox 1972).
The asymptotic properties of the estimators for the cumulated hazard
function and the parameters of the Cox model were established by An-
dersen and Gill (1982), among others. The estimators presented in this
chapter are obtained by minimization of partial likelihoods based on kernel
estimators of the baseline hazard function λ defined for each model and
on histogram estimators. In the multiplicative intensity model, the kernel
estimator of λ satifies the same minimax property as the kernel estimator of
a density (Pons, 1986) and this property is still satisfied in the multiplica-
tive regression models of the intensity. Pons and Turckheim (1987) proved
the asymptotic equivalence of the estimators of an exponential regression
model based on the estimated cumulative intensity and a histogram estima-
tor. The comparison is extended to the new estimators defined from kernel
estimators of hazard functions in this book.

For a spatial stationary process N on Rd , the k-th moment measures


defined for k ≥ 2 and for every continuous and bounded function g on (Rd )k
by
Z
νk (g) = E g(x1 , . . . , xk )N (dx1 ) . . . N (dxk )
(Rd )k

have been intensively studied and they are estimated by empirical moments
from observations on a subset G of Rd . The centered moments are imme-
Pk
diatly obtained from the mean measure m and µk = i=1 (−1)i Cki mi νk−i .
The stationarity of the process implies that the k-th moment of N
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Introduction 17

is expressed as the expectation of an integral of a translation of its


(k − 1)-th moment
Z
νk (g) = E g(x1 − xk , . . . , xk−1 − xk , 0)N (dx1 ) . . . N (dxk )
(Rd )k

which develops in the form


Z Z
νk (g) = E { gk−1 ◦ Tx (x1 , . . . , xk−1 )N (dx1 ) . . . N (dxk−1 )} N (dx)
Rd (Rd )k−1
Z
=E νk−1 (gk−1 ◦ Tx ) N (dx),
Rd
where gk−1 (x1 , . . . , xk−1 ) = g(x1 , . . . , xk−1 , 0). Let λ be the Lebesgue mea-
sure on Rd and Tx be the translation operator of x in Rd , then the moment
estimators are built iteratively by the relationship
Z
νbk,G (g) = {λ(G)}−1 νbk−1,Gk−1 (gk−1 ◦ Ty )dN (y).
G

The estimator is consistent and its convergence rate is {λ(G)}k/2 . The


stationarity of the process and a mixing condition imply that for every
function g of Cb ((Rd )k ), the variable {λ(G)}k/2 (b
νk,G (g) − νk (g)) converges
weakly to a normal variable with variance ν2k (g). The density of the k-th
moment measures are defined as the derivatives of νk with respect to the
Lebesgue measure on Rd and they are estimated by smoothing the empir-
ical estimator νbk,G using a kernel Kh on Rd and, by iterations, on Rkd .
The convergence of the kernel estimator is then hkd/2 {λ(G)}k/2 , as a con-
sequence of the k d-dimensional smoothing.

Consider a diffusion model with nonparametric drift function α and


variance function, or diffusion, β
dXt = α(Xt )dt + β(Xt )dBt , t ≥ 0 (1.11)
where B is the standard Brownian motion. The drift and variance are
expressed as limits of variations of X
α(Xt ) = lim h−1 E{(Xt+h − Xt ) | Xt },
h→0
β(Xt ) = lim h−1 E{(Xt+h − Xt )2 | Xt }.
h→0

The process X can be approximated by nonparametric regression mod-


els with regular or variable discrete sampling schemes of the sample-path
of the process X. The diffusion equation uniquely defines a continuous
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

18 Functional estimation for density, regression models and processes

Rt
process (Xt )t>0 . Assuming that E exp{− 21 0 β 2 (Bs ) ds} is finite, the Gir-
sanov theorem formulates the density of the process X. Parametric diffu-
sion models have been much studied and estimators of the parameters are
defined by maximum likelihood from observations at regularly spaced dis-
cretization points or at random stopping times. In a discretization scheme
with a constant interval of length ∆n between observations, nonparamet-
ric estimators are defined like with samples of variables in nonparametric
regression models (Pons, 2008). Let (Xti , Yi )i≤1 be discrete observations
with Yi = Xti+1 − Xti defined by equation (1.11), the functions α and β 2
are estimated by
Pn
Y K (x − Xti )
bn (x) =
α P n i hn
i=1
,
∆n i=1 Khn (x − Xti )
Pn
Z 2 K (x − Xti )
b 2
βn (x) = P n i hn
i=1
,
∆n i=1 Khn (x − Xti )
where Zi = Yi − ∆n α bn (Xti ) is the variable of the centered variations for the
diffusion process. The variance of the variable Yi conditionally on Xti varies
with Xti and weighted estimators are also defined here. Varying sampling
intervals or random sampling schemes modify the estimators. Functional
models of diffusions with discontinuities were also considered in Pons (2008)
where the jump size was assumed to be a squared integrable function of the
process X and a nonparametric estimator of this function was defined. Here
the estimators of the discretized process are compared to those built with
the continuously observed diffusion process X defined by (1.11), on an in-
creasing time interval [0, T ]. The kernel bandwidth hT tends to zero as T
tends to infinity with the same rate as hn . In Chapter 8, the MISE of each
estimator and its optimal bandwidth are determined. The estimators are
compared with those defined for the continuously observed diffusion pro-
cesses.

Nonparametric time and space transformations of a Gaussian process


have been first studied by Perrin (1999), Guyon and Perrin (2000) who
estimated the function Φ of non-stationary processes Z = X ◦ Φ, with X
a stationary Gaussian process, Φ a monotone continuously differentiable
function defined in [0, 1] or in [0, 1]3 . The covariance of the process Z is
r(x, y) = R(Φ(x) − Φ(y)) where R is the stationary covariance of X, which
implies R(u) = r(0, Φ−1 (u)) and R(−u) = R(u), with a singularity at zero.
The singularity function of Z is the difference ξ(x) of the left and right
derivatives of r(x, x), which implies Φ(x) = v −1 (1)v(x) where v(x) equals
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Introduction 19

Rx
0
ξ(u) du. The estimators are based on the covariances of the process
Z are built with its quadratic variations. For the time transformation, the
P[nx]
estimator of Φ(x) is defined by linearisation of Vn (x) = k=1 (∆Zk )2 where
the variables Zk = Z(n−1 k) − Z(n−1 (k − 1)) are centered and independent
vn (x) = Vn (x) + (nx − [nx])(∆Z[nx]+1 )2 , x ∈ [0, 1[,
vn (1) = Vn (1), (1.12)
b
Φn (x) = vn−1 (1)vn (x),
the process Φb n − Φ is uniformly consistent and n1/2 (Φ
b n − Φ) is asymptot-
3
ically Gaussian. The method was extended to [0, 1] . The diffusion pro-
cesses cannot be reduced to the same model but the method for estimating
its variance function relies on similar properties of Gaussian processes.

In time series analysis, the models are usually defined by scalar pa-
rameters and a wide range of parametric models for stationary series have
been intensively studied since many years. Nonparametric spectral densi-
ties of the parametric models have been estimated by smoothing the peri-
odogram calculated from T discrete observations of stationary and mixing
series (Wold, 1975, Brillinger, 1981, Robinson, 1986, Herrmann, Gasser
and Kneip, 1992, Pons, 2008). The spectral density is supposed to be twice
continuously differentiable and the bias, variance and moments of its kernel
estimator have been expanded like those of a probability density. It con-
verges weakly with the rate T 2/5 to a Gaussian process, as a consequence
of the weak convergence of the empirical periodogram.

1.4 Content of the book

In each chapter, the classical estimators for samples of independent and


identically distributed variables are presented, with approximations of their
bias, variance and Lp -moments, as the sample size n tends to infinity and
the bandwidth to zero. In each model, the weak convergence of the whole
processes are considered and the limiting distributions are not centered for
the optimal bandwidth minimizing the mean integrated squared error.

Chapters 2 and 3 focus on the density and the regression models, respec-
tively. In models with a constant variance, the regression estimator defined
as a ratio of kernel estimators is approximated by a weighted sum of two
kernel estimators and its properties are easily deduced. In models with a
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

20 Functional estimation for density, regression models and processes

functional variance, a kernel estimator of the variance is also considered


and the estimator of the regression function is modified by an empirical
weight. The properties of the modified estimator are detailed. The estima-
tors for independent and identically distributed variables are extended to a
stationary continuous process (Xt )t≥0 continuously observed on an increas-
ing time interval, for the estimation of the ergodic density of the process.
The observations at times s and t are dependent so the methods for inde-
pendent observations do not adapt immediatly. The estimators are defined
with the conditions necessary for their convergences and their approxima-
tion properties are proved. The optimal bandwidth minimizing the mean
squared error are functional sequences of bandwidths and the properties of
the kernel estimators are extended to varying bandwidths for this reason
in Chapter 4.
The estimators of derivatives of the density, regression function and the
other functions are expressed by the means of derivatives of the kernel so
that their convergence rate is modified, the k-th derivative of Kh being
normalized by h−(k+1) instead of h−1 for Kh . Functionals of the densities
and functions in the other models are considered, the asymptotic properties
of their estimators are deduced from those of the kernel estimators.

The inverse function defined for the increasing distribution function F


are generalized in Chapter 5 to conditional distribution functions and to
monotone regression functions. The bias, variance, norms, optimal band-
widths and weak convergences of the quantiles of their kernel estimators
are established with detailed proofs. Exact Bahadur-type representations
are written, with L2 approximations.

Chapter 6 provides new kernel estimators in nonparametric models


for real point processes which generalize the martingale estimators of the
baseline hazard functions already studied. They are compared to new
histogram-type estimators built for these functional models. The proba-
bility density of excess duration for a point process and its estimator are
defined and the properties of the estimator are also studied.

The single-index models are nonparametric regression models for linear


combinations of the regression variables. The estimators of the param-
eter vector θ and the nonparametric regression function g of the model
are proved to be consistent for independent and identically distributed
variables. New estimators of g and θ are considered in Chapter 7, with
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Introduction 21

direct estimation methods, without numerical iteration procedures. The


convergence rate for the estimator θbn,h obtained by minimizing the em-
pirical mean squared estimation error Vbn is (nh3 )1/2 . The estimator m b n,h
built with this estimator of θ has the same convergence rate which is not
so small as the nonparametric regression estimator with a d-dimensional
regression variable. A differential empirical squared error criterion pro-
vides an estimator for the parameter which converges more quickly and the
estimator of the regression function m has the usual nonparametric conver-
gence rate (nh)1/2 . More generally, the linear combination of the regressors
can be replaced by a parametric change of variable, in a regression model
Y = g ◦ ϕθ (X) + ε. Replacing the function g by a kernel estimator at fixed
θ, the parameter in then estimated by minimizing an empirical version of
the error V (θ) = {Y − bgn,h ◦ ϕθ (X)}2 . Its asymptotic properties are similar
to those of the single-index model estimators. The optimal bandwidths are
precised.

The estimators of the drift and variance of continuous diffusion pro-


cesses depend on the sampling scheme for their discretization and they are
compared to the estimators built from the whole sample-path of the diffu-
sion process. New results are presented in Chapter 8 and they are extended
to the sum of a diffusion processes and a jump process governed by the dif-
fusion. For nonstationary Gaussian models, a kernel estimator is defined
for the singularity function of the covariance of the process.

In Chapter 9, classical estimators of covariances and nonparametric re-


gression functions used for stationary time series are generalized to non-
stationary models. The expansions of the bias, variance and Lp -errors are
detailed and optimal bandwidths are defined. Nonparametric estimators are
defined for the stationarization of time series and for their mean function
in auto-regressive models, based on the results of the previous chapters.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Chapter 2

Kernel estimator of a density

2.1 Introduction

Let f be the continuous probability density of a real variable X defined on


a probability space (Ω, A, P ) and F be its distribution functions. Let IX
be the finite or infinite support of the density function f of X with respect
to the Lebesgue measure and IX,h = {s ∈ IX ; [s − h, s + h] ∈ IX }. For a
sample (Xi )1≤i≤n distributed as X and a kernel K, estimators of F and f
are defined on Ω × R as the empirical distribution function
n
X
FbX,n (x) = n−1 1{Xi ≤x} , x ∈ IX
i=1

and the kernel estimator is defined for every x in IX,h as


Z n
b 1X
fX,n,h (x) = Kh (x − s) dFbX,n (s) = Kh (x − Xi ),
n i=1

where Kh (x) = h−1 K(h−1 x) and h = hn tends to zero as n tends to infinity


and 1A is the indicator of a set A. The empirical probability measure is
Pn
PbX,n,h (A) = n−1 i=1 δXi (A), with δXi (A) = 1{Xi ∈A} . Let
Z
b
fn,h (x) = E fn,h (x) = Kh (x − s) dF (s),

the bias of the kernel estimator fbn,h (x) is


Z
bn,h (x) = fn,h (x) − f (x) = K(t){f (x + ht) − f (x)} dt. (2.1)

The Lp -risk of the kernel estimator of the density f of X is its Lp -norm


kfbn,h (x) − f (x)kp = {E|fbn,h (x) − f (x)|p }1/p (2.2)

23
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

24 Functional estimation for density, regression models and processes

and it is bounded by the sum of a p-moment and a bias term. For every
x in IX,h , the pointwise and uniform convergence of the kernel estimator
fbn,h are established under the following conditions about the kernel and
the density.

Condition 2.1.

(1) K is a symmetric density such that |x|2 K(x) → 0 as |x| tends to infinity
or K has a compact support with value zero on its frontier;
(2) The density function f belongs to the class C2 (IX ) of twice continuously
differentiable functions defined in IX .
(3) The kernel
R function satisfies R integrability conditions: Rthe moments
m2K = u2 K(u)du, κα = K α (u)du, for α ≥ 0, and |K 0 (u)|α du,
for α = 1, 2, are finite. As n → ∞, hn → 0 nhn → ∞.
(4) nh5n converges to a finite limit γ.

The next conditions are stronger than Conditions 2.1 (2)-(4), with higher
degrees of differentiability and integrability.

Condition 2.2.

(1) The density function f is Cs (IX ), with a continuous and bounded


derivative of order s, f (s) , on IX .
(2) As n → ∞, nhn → 0 and nh2s+1 n Rconverges to a finite limit γ > 0. The
j
kernel
R function satifies mjK = u K(u)du = 0 for j < s, msK and
|K 0 (u)|α du are finite for α ≤ s.

The conditions may be strengthened to allow a faster rate of convergence


of the bandwidth to zero by replacing the strictly positive limit of nh2s+1
n
by nh2s+1
n = o(1). That question appears crucial in the relative impor-
tance between the bias and the variance in the L2 -risk of fbn,h − f . The
choice of the optimal bandwidth minimizing that risk corresponds to an
equal rate for the squared bias and the variance and implies the rates of
Condition 2.1(4) or 2.2(2) according to the derivability of the density. Con-
sidering the normalized estimator, the reduction of the bias requires a faster
convergence rate.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a density 25

2.2 Risks and optimal bandwidths for the kernel estimator

Proposition 2.1. Under Conditions 2.1-1 for a continuous density f , the


estimator fbn,h (x) converges in probability to f (x), for every x in IX,h .
Moreover, supx |fbn,h (x) − f (x)| tends a.s. to infinity as n tends to infinity
if and only if f is uniformly continuous.
Proof. The first assertion is a consequence
Z of an integration by parts
1 x−y
sup |fbn,h (x) − fn,h (x)| ≤ sup |Fbn,h (y) − F (y)| |dK( )|
x∈IX,h x∈IX,h h h
Z
1
≤ sup |Fbn,h (y) − F (y)| |dK|.
h y
The Dvoretzky, Kiefer and Wolfowitz (1956) exponential bound implies
that for every λ > 0, Pr(supIX n1/2 |Fbn,h − F | > λ) ≤ 58 exp{−2λ2 }, then
Z
b b
Pr(sup |fn,h − fn,h | > ε) ≤ Pr(sup |Fn,h − F | > ( |dK|)−1 hn ε)
IX,h IX

≤ 58 exp{−αnh2n }
P∞
with α > 0, and n=1 exp{−nα h2n } tends to zero under Condition 2.1 or
2.2. 
Proposition 2.2. Assume hn → 0 and nhn → ∞,
(a) under Conditions 2.1, the bias of fbn,h (x) is
h2
bn,h (x) = m2K f (2) (x) + o(h2 ),
2
denoted h2 bf (x) + o(h2 ), its variance is
V ar{fbn,h (x)} = (nh)−1 κ2 f (x) + o((nh)−1 ),
also denoted (nh)−1 σf2 (x) + o((nh)−1 ), where all approximations are uni-
form. Let K have the compact support [−1, 1], the covariance of fbn,h (x)
and fbn,h (y) is zero if |x − y| > 2h, otherwise it is approximated by
Z
(nh)−1  
{f (x) + f (y)}δx,y K (v − αh K v + αh dv
2
where αh = |x − y|/(2h) and δx,y is the indicator of {x = y}.
(b) Under Conditions 2.2, for every s ≥ 2, the bias of fbn,h (x) is
hs
bn,h (x; s) = msK f (s) (x) + o(hs ),
s!
and
kfbn,h (x) − fn,h (x)kp = 0((nh)−1/p ),
for every p ≥ 2, where the approximations are uniform.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

26 Functional estimation for density, regression models and processes

Proof. The bias as h tends to zero is obtained from a second order ex-
pansion of f (x + ht) under Condition 2.1, and from its s-order expansion
under Condition 2.2. The variance of fbn,h (x) is
Z
b
V ar{fn,h (x)} = n { Kh2 (x − s)f (s) ds − fn,h
−1 2
(x)}.
R
The first term of the sum is n−1 Kh2 (x − u)f (u)du = (nh)−1 κ2 f (x) +
o((nh)−1 ), the second term n−1 f 2 (x) + O(n−1 h) is smaller.
R
The covariance of fbn,h (x) and fbn,h (y) is written n−1 { I 2 Kh (u −
X
x)Kh (u − y)f (u) du − fn,h (x)fn,h (y)}, it is zero if |x − y| > 2h. Otherwise
let αh = |x−y|/(2h) < 1, changing the variables as h−1 (x−u) = v −αh and
h−1 (y − u) = v + αh with v = {(x + y)/2 − u}/h, the covariance develops
as
Z
b b −1 x+y
Cov{fn,h (x), fn,h (y)} = (nh) f ( ) K(v − αh )K(v + αh )dv
2
+ o((nh)−1 ).
If |x − y| ≤ 2h, f ((x + y)/2) = f (x) + o(1) = f (y) + o(1), the covariance is
approximated by
Z
(nh)−1  
{f (x) + f (y)}I{0≤αh <1} K (v − αh K v + αh dv.
2
Due to the compactness of the support of K,  the covariance
is zero if
αh ≥ 1. For x 6= y, αh tends to infinity and I 0 ≤ αh < 1 tends to zero
as n tends to infinity, then the indicator is approximated by the indicator
δx,y of {x = y}. R
For p = 1, E|fbn,h − fn,h |(x) ≤ IX,h |Kh (x − s)| d|Fbn − F |(s) which
converges to zero as n tends to infinity. For p ≥ 3, the Lp -risk of fbn,h (x) is
obtained from the expansion of the sum and by recursion on the order of
the moment in the expansion of
n
X
 p   p
E fbn,h (x) − fn,h (x) = E n−1 Kh (x − Xi ) − fn,h (x) .
i=1

The first moments of the centered estimator fbn,h − fn,h are


 3
E fbn,h (x) − fn,h (x) = (nh)−1 {−2κ2 f 2 (x) + o(1)},
 4
E fbn,h (x) − fn,h (x) = (nh)−1 {κ2 f 3 (x) + o(1)},
 5
E fbn,h (x) − fn,h (x) = (nh)−1 {11κ2 f 3 (x) + o(1)}.
By iterations, the product of moments in the expansion of the risk
p
E fbn,h (x) − fn,h (x) determines its higher order as (nh)−1 . 
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a density 27

Integrating the
R above expansions entails similar bounds for the inte-
grated norms E |fbn,h (x) − fn,h (x)|p dx = 0((nh)−1 ), for every p > 1.
For p = 2, E{fbn,h (x) − f (x)}2 = V ar{fbn,h (x)} + {fn,h (x) − f (x)}2 and
its first order expansion is n−1 h−1 κ2 f (x) + o(n−1 h−1 ) + 41 m22K h4 f (2)2 (x) +
o(h4 ). The asymptotic mean squared error for fbn,h at x is then
1
AM SE(fbn,h ; x) = (nh)−1 κ2 f (x) + m22K h4 f (2)2 (x),
4
it is minimum for the bandwidth function
 κ2 f (x) 1/5
hAMSE (x) = n−1/5 .
m22K f (2)2 (x)
A smaller order bandwidth increases the variance of the density estimator
and reduces its bias, with the order n−1/5 its asymptotic distribution cannot
be centered. An estimator of the derivative f (k) is defined by the means of
the derivative K (k) of the symmetric kernel, for k ≥ 1. The convergences
rates for estimators of a derivative of the density also depend on the order
of the derivative. Consider the k-order derivative of Kh
(k)
Kh (x) = h−(k+1) K (k) (h−1 x), k ≥ 1.
The estimators of the derivatives of the density are
n
X
(k) (k)
fbn,h (x) = n−1 Kh (x − Xi ). (2.3)
i=1
(k)
The next lemma implies the uniform consistency of fbn,h to f (k) , for every
order of derivability k ≥ 1 and allows to calculate the variance of the
derivative estimators. It is not exhaustive and integrals of higher orders
are easily obtained using integrations by parts.

Lemma 2.1. Let K be a symmetric density R function in class C2 , its


(j)
derivatives
R satisfy the following properties
R 2 2 : K (z) dz = 0, for every
2
j ≥ 1, zK (z) dz = 0, κ22 = z K (z) dz 6= 0 and
Z Z Z
zK (1) (z) dz = −1, z 2 K (1) (z) dz = 0, z 3 K (1) (z) dz = −3m2K ,
Z Z Z
(2) 2 (2)
zK (z) dz = 0, z K (z) dz = 2, z 3 K (2) (z) dz = 0,
Z Z
z 4 K (2) (z) dz = 12m2K , κ11 = z(K 0 K)(z) dz = −κ2 /2,
Z Z
K (1) K dz = 0, K (1)2 dz 6= 0.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

28 Functional estimation for density, regression models and processes

(1) Pn (1)
The sum fbn,h (x) = n−1 i=1 Kh (x − Xi ) converges uniformly on IX,h to
its expectation Z
(1) (1) (1)
fn,h (x) = EKh (x − X) = Kh (u − x)fX (u) du
Z Z
h2
= −f (1) (x) zK (1) (z) dz − f (3) (x) z 3 K (1) (z) dz + o(h2 )
6
2
h
= f (1) (x) + m2K f (3) (x) + o(h2 ),
2
(1) 2
then fbn,h converges uniformly to f (1) (x) and its bias is h2 m2K f (3) (x). Its
R
variance is (nh3 )−1 f (x) K (1)2 (z) dz + o((nh3 )−1 ) and the optimal local
bandwidth for estimating f (1) is deduced as
R (1)2
(1)

−1/7 f (x) K (z) dz 1/7
hAMSE (f ; x) = n 2 (3)2
,
m2K f (x)
thus the estimator of the first density derivative (2.3) has to be computed
with a bandwidth estimating hAMSE (f (1) ; x). For the second derivative,
(2) (2) 2
the expectation of fbn,h is fn,h (x) = f (2) (x) + h2 m2K f (3) (x) + o(h2 ), so it
2
converges uniformly f (2) with the bias h2 m2K f (4) (x)+o(h2 ) and the vari-
R to(2)2
5 −1
ance (nh ) f (x) K (z) dz + o((nh4 )−1 ). More generally, Lemma 2.1
generalizes by induction to higher orders and the rate of optimal band-
widths is deduced as follows.
(k)
Proposition 2.3. Under Conditions 2.1, the estimator fbn,h of the k-
order derivative of a density in class C2 has a bias O(h2 ) and a variance
O((nh2k+1 )−1 ), its optimal local and global bandwidths are O(n−1/(2k+5) ),
for every k ≥ 2.
For a density of class Cs and under Conditions 2.2, the bias is a O(hs ) and
the variance a O((nh2k+1 )−1 ), its optimal bandwidths are O(n−1/(2k+2s+1) )
and the corresponding L2 -risks are O(n−s/(2k+2s+1) ).
(k)
As a consequence the L2 -risk of the estimator fb b is a O(n−2s/(2k+2s+1) )
n,hopt
for every density in Cs , s ≥ 2. If the k-th derivative of the kernel and the
density are lipschitzian with |K (k) (x) − K (k) (y)| ≤ α|x − y| and |f (k) (x) −
f (k) (y)| ≤ α|x − y| for some constant α > 0, then there exists a constant C
such that for every x and y in IX,h
(k) (k)
|fbn,h (x) − fbn,h (y)| ≤ Cαh−(k+1) |x − y|.
R (k)2
The integral θk = f (x) dx of the quadratic k-th derivative of the
density is estimated by Z
(k)2
θbk,n,h = fbn,h (x) dx, (2.4)
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a density 29

the variance E(θbk,n,h − θk )2 has the same order as the MISE for the estima-
(k)
tor fbn,h of f (k) , hence it converges to θk with the rate O((n1/2 hk+1/2 ) and
the estimator does not achieve the parametric rate of convergence n1/2 .
The Lp -risk of the estimator of the density decreases as s increases and,
for p ≥ 2, a bound of the Lp -norm is
hps
kfbn,h (x) − f (x)kpp ≤ 2p−1 {mpsK f (k)p (x) + o(1)}
(s!)p

+ (nh)−1 {gp (x) + o(1)} ,
P[p/2] P k
where gp (x) = k=2 1<j1 6=...6=jk ≤p; i ji =p κj1 . . . κjl f (x). The optimal
P

bandwidth is still reached when both terms of this bound are of the same
order and minimal. With p = 2 and s = 2, it is
1 2
hn (x) = O(n− 5 ), kfbn,h (x) − f (x)k2 = O(n− 5 ).

For a density of Cs and the L2 risk, it is hn = O(n−1/(2s+1) ) and

kfbn,h (x) − f (x)k2 = O(n−s/(2s+1) ).

The bandwidth and the risk decrease as the order of derivability of the
density increases.
The derivability condition f ∈ Cs in 2.1 can be replaced by the condi-
tion: f belongs to a Hölder class Hα,M with |f (s) (x)−f (s) (y)| ≤ M |x−y|α−s
where s = [α] ≥ 0 is the integer part of α > 0.

Proposition 2.4. Assume f is bounded and belongs to a Hölder class


Hα,M , then the bias of fbn,h is bounded by M m[α]K hα /([α]!) + o(hα ), the
optimal bandwidth is O(n1/(2α+1) ) and the MISE at the optimal bandwidth
is O(nα/(2α+1) ).

2.3 Weak convergence

The Lp -norm of the variations of the process fbn,h − fn,h are bounded by
the same arguments as the bias and the variance. Assume that K has the
support [−1, 1].

Lemma 2.2. Under Conditions 2.1 and 2.1, there exists a constant C such
that for every x and y in IX,h and satisfying |x − y| ≤ 2h

E{fbn,h (x) − fbn,h (y)}2 ≤ C(nh3 )−1 |x − y|2 .


January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

30 Functional estimation for density, regression models and processes

Proof. Let x and y in IX,h , the variance of fbn,h (x) − fbn,h (y) develops
according to their variances given by Proposition 2.2 and the covariance
between both terms which has the same bound by the Cauchy-Schwarz
inequality. The
R second order moment E|fbn,h (x) − fbn,h (y)|2 develops as
the sum n−1 {Khn (x − u) − Khn (y − u)}2 f (u) du + (1 − n−1 ){f R n,hn (x) −
2
fn,hn (y)} . For an approximation of the integral I2 (x, y) = {Khn (x −
u) − Khn (y − u)}2 f (u) du, the Mean Value Theorem implies Khn (x − u) −
(1)
Khn (y − u) = R(x − y)ϕn (z − u) where ϕn (x) = Khn (x), and z is between
x and y, then {Khn (x − u) − Khn (y − u)}2 f (u) du is approximated by
Z Z
2
(x − y) ϕn (z − u)f (u) du = (x − y) hn {f (x) K (1)2 + o(hn )}.
(1)2 2 −3

Since h−1 −1
n |x| and hn |y| are bounded by 1, the order of the second moment
of fbn,h (x) − fbn,h (y) is a O((x − y)2 (nh3n )−1 ) if |x − y| ≤ 2hn and the
covariance is zero otherwise. 

Theorem 2.1. Under Conditions 2.1 and 2.2, for a density f of class
Cs (IX ) and with nh2s+1 converging to a constant γ, the process
Un,h = (nh)1/2 {fbn,h − f }I{IX,h }
converges weakly to Wf +γ 1/2 bf , where Wf is a continuous Gaussian process
on IX with mean zero and covariance E{Wf (x)Wf (x0 )} = δx,x0 σf2 (x), at x
and x0 .

Proof. The finite dimensional distributions of the process Un,h converge


weakly to those of Wf + γ 1/2 bf , as a consequence of Proposition 2.2. The
covariance of Wf at x and x0 is Cf,n (x, x0 ) = limn nhCov{fbn,h (x), fbn,h (x0 )},
and Proposition 2.2 implies that Un,h (x) and Un,h (x0 ) are asymptotically
independent as n tends to infinity.
If the support of X is bounded, let a = inf IX , η > 0 and c >
1/2
γ |bf (a)| + 2η −1 σf2 (a)
1/2
, then

Pr{|Un,h (a)| > c} ≤ Pr (nh)1/2 |(fbn,h − fn,h )(a)| + (nh)1/2 |bn,h (a)| > c
V ar{(nh)1/2 (fbn,h − fn,h )(a)}
≤ ,
{c − (nh)1/2 |bn,h (a)|}2
so that for n sufficiently large
σf2 (a)
Pr{|Un,h (a)| > c} ≤ + o(1) < η,
{c − γ 1/2 |bf (a)|}2

R and the bound {fn,h (x)−


the process Un,h (a) is therefore tight. Lemma 2.2
f (x) − fn,h (y) + f (y)}2 ≤ |f (x) − f (y)|2 + [ K(z){f (x + hz) − f (y +
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a density 31

hz)}dz]2 ≤ 2|x − y|2 kf (1) k2∞ imply that the mean of the squared variations
of the process Un,h are O(h−2 |x − y|2 ) as |x − y| ≤ 2h < 1, otherwise
the estimators fbnh (x) and fbnh (y) are independent. Billingsley’s Theorem 3
implies the tightness of the process Un,h 1[−h,h] and the convergence is ex-
tended to any compact subinterval of the support. With an unbounded
support for X such that E|X| < ∞, for every η > 0 there exists A such
that P (|X| > A) ≤ η, therefore P (|Un,h (A + 1)| > 0) ≤ η and the same
result still holds on [−A − 1, A + 1] instead of the support of the process
Un,h . 

Corollary 2.1. The process

sup σf−1 (x)|Un,h (x) − γ 1/2 bf (x)|


x∈IX,h

converges weakly to supIX |W1 |, where W1 is the Gaussian process with


mean zero, variance 1 and covariances zero.
For every η > 0, there exists a constant cη > 0 such that

Pr{ sup |σf−1 (Un,h − γ 1/2 bf ) − W1 | > cη }


IX,h

tends to zero as n tends to infinity.

Lemma 2.2 concerning second moments does not depend on the smooth-
ness of the density and it is not modified by the condition of a Hölder
class instead of a class Cs . The variations of the bias are now bounded by
{fn,h (x)−f (x)−fn,h (y)+f (y)}2 ≤ 2M |x−y|2α and the mean of the squared
variations of the process Un,h are O(h−2 |x − y|2 ) for |x − y| ≤ 2h < 1. The
weak convergence of Theorem 2.1 is therefore fulfilled with every α > 1.

With the optimal bandwidth for the global MISE error


 κ 1/5
hAMISE = 2
R 2 ,
nm2K f (2)2 (x) dx
R (2)2
the limit γR of nh5n is κ2 m−2
2K { f (x) dx}−1 . The integral of the second
derivative f (2)2 (x) dx and the bias term bf = 21 m2K f (2) are estimated us-
ing the second derivative of the estimator for f . Furthermore, the variance
σf2 = κ2 f is immediatly estimated. More simply, the asymptotic criterion
is written
Z
AM ISEn (h) = {h4 bf (x) + (nh)−1 σf2 (x)}f −1 (x) dF (x)
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

32 Functional estimation for density, regression models and processes

and it is estimated by the empirical mean


n
X
n−1 {h4 bf (Xi ) + (nh)−1 σf2 (Xi )}f −1 (Xi ).
i=1

This empirical error is estimated by


n
X
b ISEn (h) = n−1
AM {h4bbf,n,h2 (Xi ) + (nh)−1 σ 2
bf,n,h 2
(Xi )}fbh−1
2
(Xi )
i=1

with another bandwidth h2 converging to zero. The global bandwidth


hAMISE is then estimated at the value that achieves the minimum of
b ISEn (h), i.e.
AM
n 4n Pn bb2 b−1 o−1/5
b i=1 f,n,h2 (Xi )fh2 (Xi )
hn = Pn .
i=1 σ
2
bf,n,h 2
(Xi )fbh−1
2
(Xi )

Bootstrap estimators for the bias and the variance provide another estima-
tion of M ISEn (h) and hAMISE . These consistent estimators are then used
for centering and normalizing the process fbn,h −f and provide an estimated
process
bn = (nb
U b−1
hn )1/2 σ b
{fn,bhn γn,bhnbbf,n,bhn }I{IX,bhn }.
−f −b
f,n,b
hn

An uniform confidence interval with a level α for the density f is deduced


from Corollary 2.1, using a quantile of supIX |W1 |.

Theorem 2.2. Under Conditions 2.1 and 2.2, for a density f of class
Cs (IX ) and with nh2s+2k+1 converging to a constant γ, the process
(k) (k)
Un,h = (nh2k+1 )1/2 {fbn,h − f (k) }I{IX,h }

converges weakly to a Gaussian process Wf,k + γ 1/2 bf,k , where Wf,k is a


continuous Gaussian process on IX with mean and covariances zero.

Let X be a vector variable defined in a subset IX of Rd , its density f is


estimated by smoothing its distribution function
n
X
Fbn (x) = 1{X1 ≤x1 ,...,Xd ≤xd } , x = (x1 , . . . , xd ),
i=1

by a multivariate kernel K defined on [−1, 1]d and Kh (x) = h−d K(h−d x),
Qd
with a single bandwidth or Kh (x) = k=1 h−1 −1
k K(hk xk ) with a vector
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a density 33

bandwidth, for x in IX,h . The derivatives of the density f (k) are arrays
and the rates of their moments depend on the dimension d. If hk = h
hs
bs,n,h (x) = msK f (s) (x) + o(hs ),
s!
V ar{fbn,h (x)} = (nhd )−1 κ2 f (x) + o((nhd )−1 ),
kfbn,h (x) − fn,h (x)kp = 0((nhd )−1/p ), (2.5)
M ISEn (h, x) = O(h2s ) + O((nhd )−1 ).
The optimal bandwidth hn (x) minimizing the M ISEn (h, x) has the order
n−1/(2s+d) where the local MISE reaches the minimal order O(n−2s/(2s+d) ).
The convergence rate of fbn,h − f is (nhd )1/2 and the results of Theorem 2.1
and its corollary still hold with this rate.

2.4 Minimax and histogram estimators

Consider a class F of densities and a risk R(f, fbn ) for the estimation of
a density f of F by an estimator fbn belonging to a space Fb. A minimax
estimator fbn∗ is defined as a minimizer of the maximal risk over F
fb∗ = arg inf sup R(f, fbn ).
n
fbn ∈F
b f ∈F

With an optimal bandwidth related to the risk Rpp , the kernel estimator of
a density of F = Cs , s ≥ 2, provides a Lp -risk of order hspn (x; s, p) and this
is the minimax risk order in a space Fb determined by the regularity of the
kernel, the kernel estimator reaches this bound.
R
The estimator (2.4) of the integral θk = f (k)2 (x) dx of the quadratic k-
th derivative of a density of C2 has therefore the optimal rate of convergence
for an estimator of θk .
The histogram is the older unsmoothed nonparametric estimator of the
density. It is defined as the empirical distribution of the observations cumu-
lated on small intervals of equal length hn , divided by hn , with hn and
nhn converging to zero as n tends to infinity. Let (Bjh )j=1,...,JX,h be a
partition of IX into subintervals of length h and centered at ajh , and let
P
Kh (x) = h−1 j∈JX,h 1Bjh (x) be the kernel corresponding to the histo-
gram, it is therefore defined as
Z
fen,h (x) = hKh (x) Kh (s)dFbn (s).
P
Its bias ebf,h (x) = j∈JX,h 1Bjh (x){f (ajh ) − f (x)} + o(h) = hf (1) (x) + o(h)
is larger than the bias of kernel estimators and its variance vef (x) is a
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

34 Functional estimation for density, regression models and processes

O((nh)−1 ), due to the covariance zero between the empirical


R distribution on
Bjh and Bj 0 h for j 6= j 0 . As n tends to infinity, h−1 n Bjh dV ar(Fbn − F ) =
f (ajh ){1 − 2F (ajh )} + o(1) hence vef,h (x) = (nh)−1 f (x){1 − 2F (x)} +
o((nh)−1 ). Let ebf (x) = f (1) (x) and vef (x) = f (x){1 − 2F (x)}. The normal-
ized histogram (nh)1/2 (fen,h − f − hf (1) )(x) converges weakly to a normal
variable vef,h (x)N (0, 1) and it is asymptotically unbiased with a bandwidth
hn = o(n−1/3 ). Increasing the order of hn reduces the variance of the his-
togram and increases its bias. The asymptotic mean squared error of the
histogram is minimal for the bandwidth
1/3
hn (x) = n−1/3 {2eb2 (x)}−1/3 ve (x) = {2nf (1)2 (x)f −1 (x)}−1/3
f f
then it is approximated by
2/3
vf (x)ebf (x)}2/3 {21/3 + 2−1/3ebf (x)}
M SEopt (x) = n−2/3 {e
= n−2/3 {f (x){1 − 2F (x)}f (1) (x)}2/3
×[21/3 + 2−1/3 {f (1) (x)}2/3 ].
These expressions do not depend on the degree of derivability of the den-
sity. The optimal bandwidth, the bias ebf (x) and the variance vef (x) of the
histogram are estimated by plugging the estimators of the density and its
derivative in their formulae. The Lp moments of the histogram are deter-
mined by the higher order term in the expansion of |fen,h (x) − fn,h (x)|pp , it
is a O((nh)−1 ), for every p ≥ 2.
The derivatives of the density are defined by differences of values of
the histogram. For x in Bjh , f (1) (x) = h−1 {f (aj+1,h ) − f (aj,h )} + o(1) is
estimated by
(1)
fe (x) = h−1 {fen,h (aj+1,h ) − fen,h (aj,h )}
n,h
and the derivatives of higher order are defined in the same way. The bias
(1)
of fen,h is a O(1) and its variance is a O((nh3 )−1 ).

2.5 Estimation of functionals of a density

The estimation of the integral of a squared density


Z Z
2
θ = f (x) dx = f (x) dF (x)
has been considered by many authors and several estimators have been
proposed. The plug-in kernel density estimator
2 X
θbn,h = Kh (Xi − Xj )
n(n − 1)
1≤i<j≤n
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a density 35

has been introduced by Hall and Marron (1987). A second plug-in estimator
was defined by Bickel and Ritov (1988) as the integral of the square of the
estimated density
2 X Z
θ̄n,h = Kh (x − Xi )Kh (x − Xj ) dx.
n(n − 1)
1≤i<j≤n

Other authors proposed estimators based on projections on orthogonal ba-


sis of L2 (X ). Bickel and Ritov (1988) and Giné and Nickl (2008) proved
the asymptotic equivalence of the estimators and their weak convergence
for bounded densities ofR Rthe space H2α of functions satisfying the inte-
grated Hölder condition |t|−(1+2α) |f (x − t) − f (x)|2 dx dt < ∞ for some
0 < α ≤ 1/2. Then the bias of the estimator θbn,h is (h2α ) and its vari-
ance is (n2 h ∨ n−1 h2α
R ).2 Moreover, if α > 1/4 and hn = O(n
−2/(4α+1)
),
1/2 b
n {2θn,h − θ̄n,h − f (x) dx} converges weakly to a centered Gaussian
R R
variable with variance 4{ f 3 − ( f 2 )2 }.

The integrals of the squared k-th derivatives of the density


Z
θk = f (k)2 (x) dx

are also estimated by the integral of the square of the kernel estimator for
the derivative of the density
2 X Z (k) (k)
θ̄n,h = Kh (x − Xi )Kh (x − Xj ) dx,
n(n − 1)
1≤i<j≤n

which is equivalent to the integral of the squared estimator (2.3).


The mode of a real function f on an open set IX is
Mf = sup |f (x)|. (2.6)
IX

It is a norm on the space of real functions defined on IX , and for functions


with values in a metric space with the norm | · |. The triangular inequality
Pp
provides a bound for the mode of mixture density g = k=1 µk fk , where
Pp
the sum of mixture probabilities µk belonging to ]0, 1[ satisfies k=1 µk = 1
and the densities fk have the same support
p
X
Mg ≤ µk Mfk . (2.7)
k=1

If the supports of the densities fk are sufficiently separated, their modes


can be identified and the mode of g is identical to the mode of one density
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

36 Functional estimation for density, regression models and processes

of the mixture. Otherwise mixture densities with overlapping supports are


cumulated in g and the mode of g is not always located at the mode of a
component. It is the combination of the modes of the component densities
of the mixture only if all modes have the same location.
For a single density f in class C2 (IX ), the mode Mf is estimated by

cf,n,h = M b .
M fn,h

Under Conditions 2.1, the density is locally concave in a neighborhood NM


of the mode and its estimator has the same property for n sufficiently large.
cf,n,h converges to Mf in probability. A Taylor expansion
It follows that M
for x in NM is written f (1) (x) = (x − Mf )f (2) (Mf ) + o(x − Mf ). At the
estimated mode, f (1) (Mf ) = 0 and fbn,h (M
cf,n,h ) = 0, which entails

cf,n,h − Mf ) = f (2)−1 (Mf ) f (1) (M


(M cf,n,h ){1 + o(1)}
cf,n,h ) − fb(1) (M
= f (2)−1 (Mf ) {f (1) (M cf,n,h )}{1 + o(1)}.
n,h

(1)
For every x in IX , the variable U(1),n,h (x) = (nh3 )1/2 (fbn,h − f (1) )(x) con-
verges weakly to a Gaussian variable with a non degenerated variance
κ2 f (x) and a mean m(1) (x) = limn (nh7n )1/2 m2K f (3) (x)/2. It follows that
the convergence rate of (Mcf,n,h − Mf ) is (nh3 )1/2 and

cf,n,h − Mf ) = −f (2)−1 (Mf ) U(1),n,h (M


(nh3 )1/2 (M cf,n,h ) + o(1).

The following proposition is a consequence of the asymptotic behaviour of


the derivative of the estimated density given in Theorem 2.3, the rate of
the optimal bandwidth for the density is lower than the rate assumed in
Parzen (1962) and the limiting distribution
R (1)2 has a non zero mean.
2 (2) −2
Let σM f
= f (M f ){f (M f )} K (z) dz.

Proposition 2.5. Under Conditions 2.1 and 2.2, (nh3 )1/2 (M cf,n,h − Mf )
(2)−1 2
converges weakly to a variable N (f (Mf )m(1) (Mf ), σMf ).

Note that if hn = O(n−1/9 ), the convergence rate of M cf,n,h is n1/3 like


Chernoff’s estimator (1964). The optimal rate for the bandwidth of the
first derivative of the density is n1/(2s+3) as established in Proposition 2.3
for every integer s > 1, that is n1/9 for a density in C3 (IX ). For a smaller
bandwidth of order n−1/r , r > 2, M cf,n,h converges fastly, with the rate
n (1−3/r)/2 cf,n,h ) is
. If the density belongs to C3 (IX ), the bias of f (1) (M
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a density 37

(1)
deduced from the bias of the process (fbn,h − f (1) ) and it equals
2 2
cf,n,h ) = − h m2K f (3) (M
Ef (1) (M cf,n,h )+o(h2 ) = − h m2K f (3) (Mf )+o(h2 ),
2 2
it does not depend on the degree of derivability of the density f .

The support of a density f can be estimated from its graph defined as


Gf = {(x, y); y = f (x), x ∈ IX }. For a continuous function f defined on
an open interval IX with compact closure, Gf is an open set of R2 with
compact closure. This closed set defines the support of the function f . For
every y such that (x, y) belongs to a closed subset A of Gf , there exist x in
a closed subinterval of IX such that y = f (x). The graph of a sum of two
densities f1 and f2 is the union of their graphs G1 ∪ G2 and by difference
G1 = G1 ∪ G2 − G2 \ G1 , with G2 \ G1 = {(x, y); y = f2 (x) 6= f1 (x), x ∈ IX }.
Let Gbf,n,h = {(x, y); y = fbn,h (x), x ∈ IX } be the graph of the kernel
estimator of an absolutely continuous density f on IX , then
Gfbn,h = Gf ∪ Gfbn,h −f = Gf + Gfbn,h −f \ Gf
hence
Gfbn,h −f = Gbf,n,h − Gf
and it converges a.s. to zero as n tends to infinity. The support of the
density f is consistently estimated by Gbf,n,h and the extrema of the density
are consistently estimated by those of the estimated graph.

2.6 Density of absolutely continuous distributions

Let F0 be a distribution function in a functional space F and Fϕ0 be a distri-


bution function absolutely continuous with respect to F0 , having a density
ϕ0 with respect to F0 . The function ϕ0 belongs to a nonparametric space
of continuous functions Φ and the distribution functionR Fϕ0 belongs to the

nonparametric model PF ,Φ = {(F, ϕ); F ∈ F, ϕ ∈ Φ, 0 ϕ dF = 1}. The
observations are two subsamples X1 , . . . , Xn1 with distribution function F0
and Xn1 +1 , . . . , Xn with distribution function Fϕ0 . The approach extends
straightforwardly to a population stratified in K subpopulations. Estima-
tion of the distributions of stratified populations has already been studied,
in particular by Anderson (1979) with a specific parametric form for ϕθ , by
Gill, Vardi
R · and Wellner (1988) in biased sampling models with group distri-
butions 0 wk dF , where the weight functions are known, by Gilbert (2000)
in biased sampling models with parametric weight functions, by Cheng and
Chu (2004), with the Lebesgue measure and kernel density estimators.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

38 Functional estimation for density, regression models and processes

The density with respect to the Lebesgue measure of a distribution


function F in F is denoted f and the distributions of both samples are
supposed to have the same support. Let n2 = n − n1 increasing with n,
such that limn n−1 n1 = π in ]0, 1[, and let ρ be the sample indicator defined
by ρ = 1 for individuals of the first sample and ρ = 0 for individuals of the
second sample. Let F1 = πF0 and F2 = (1 − π)Fϕ0 be the subdistribution
functions of the two subsamples, they are estimated by the corresponding
empirical subdistribution functions
n
X n1
X
Fb1,n = n−1 ρi 1{Xi ≤t} = n−1 1{Xi1 ≤t} ,
i=1 i=1
Xn
Fb2,n = n−1 (1 − ρi )1{Xi ≤t}
i=1

and πbn = n−1 n1 . Their densities with respect to the Lebesgue measure are
denoted f1 and f2 , and the density of the second sample with respect to
the distribution of the first one is ϕ = π(1 − π)−1 f1−1 f2 . The densities f1
and f2 are estimated by smoothing Fb1,n and Fb2,n , then f0 , fϕ and ϕ are
estimated by
Z
b
f0,n,h (t) = π −1
bn Kh (t − s) dFb1,n (s),
Z
fbn,h (t) = (1 − π
bn )−1 Kh (t − s) dFb2,n (s),

bn,hn (t) = fb0,n,h


ϕ −1
(t)fbn,h (t)
on every compact subset of the support of the densities where f0 is
strictly positive and kϕ bn,h − ϕ0 k converges in probability to 0. R The ex-
pectation of the estimators are approximated by f0;n,h (t) = Kh (t −
2 (2) R
s) dF0 (s) + O(n−1/2 ) = f0 + h2 f0 + o(h2 ) and fn,h (t) = Kh (t −
2 (2)
s) dFϕ (s) + O(n−1/2 ) = fϕ + h2 fϕ + o(h2 ). The bias of ϕ bn,h is ex-
2 (2)
h −1 (1) (1) 2
panded as bn,h = 2 f0 {ϕf0 + 2ϕ f0 } m2K + o(h ), its variance is
vn,h = f0−2 {V arfbn,h + ϕ2 V arfb0,n,h } + o((nh)−1 ) where the variances given
in Proposition 2.2, V arfbj,n,h (t) = (nj h)−1 κ2 fj (t) {1 + o(1)} imply similar
approximations for the variances of the estimators fb0,n,h and fbn,h . The
following approximation with independent subsamples implies the weak
convergence of the estimator of ϕ
bn,h − ϕn,h ) = f0−1 (nh)1/2 {(fbn,h − fϕ,n,h )
(nh)1/2 (ϕ
− ϕ(fb0,n,h − f0,n,h )} + oL (1). 2
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a density 39

2.7 Hellinger distance between a density and its estimator

Let P and Q be two probability measures and let λ = P + Q be the


dominating measure of their sum. Let F and G be the distribution functions
of a variable X under the probability measures P and Q, respectively, and
let f and g be the densities of P and Q, respectively, with respect to λ.
The Hellinger distance between P and Q is
Z √ p Z p
2 1 2 1 √
h (P, Q) = ( dP − dQ) = ( f − g)2 dλ.
2 2
The affinity of P and Q is
Z p
2
ρ(P, Q) = 1 − h (P, Q) = f g dλ.

The following inequalities were proved by Lecam and Yang (1990)


1
h2 (P, Q) ≤ kP − Qk1 ≤ {1 − ρ2 (P, Q)}1/2 .
2
Applying this inequality to the probability density f of P , absolutely con-
tinuous with respect to the Lebesgues measure λ, and its estimator fbn,h ,
we obtain s s
Z bn,h Z
f fbn,h
h2 (fbn,h , f ) = (1 − ) dF ≤ {1 − ( dF )2 }1/2 .
f f
The convergence to zero of the Hellinger distance h2 (fbn,h , f ) is deduced
from the obvious bound s s
Z bn,h Z
f fbn,h
h2 (fbn,h , f ) = (1 − ) dF ≤ ( − 1) d(Fbn − F ) (2.8)
f f
R q fbn,h
b
which is consequence of the inequality f dFn ≥ 0. This inequality
and the uniform a.s. consistency of the density estimator also imply the
a.s. convergence to zero of n1/2 h2 (fbn,h , f ). By differentiation, estimators
of functionals of the density converges with the same rate as the estimator
of the density, hence h2 (fbn,h , f ) convergences to zero in probability with
1/2
the rate nhn .
Applying these results to the probability measures P0 and P = Pϕ0 of
the previous section, with distribution functions F0 and F , we get similar
formulae Z Z
1 √ √
h2 (P0 , P ) = (1 − ϕ)2 dF0 ≤ {1 − ( ϕ dF0 )2 }1/2 ,
2
Z s Z s
2 ϕbn,h ϕbn,h
h (ϕ bn,h , ϕ) = (1 − ) dF ≤ {1 − ( dF )2 }1/2 .
ϕ ϕ
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

40 Functional estimation for density, regression models and processes

The bound (2.8) is adapted to the density ϕ


Z s
ϕbn,h
2
h (ϕbn,h , ϕ) ≤ ( − 1) d(Fbn − F ),
ϕ
1/2
it follows that the convergence rate of h2 (ϕ
bn,hn , ϕ) is nhn .

2.8 Estimation of the density under right-censoring

On a probability space (Ω, A, P ), let X and C be two independent positive


random variables with densities f and fC and such that P (X < C) is
strictly positive, and let T = X ∧ C, δ = 1{X≤C} denote the observed
variables when X is right-censored by C. Let
X
Nn (t) = δi 1{Ti ≤t}
1≤i≤n

be the number of observations before t and


X
Yn (t) = 1{Ti ≥t}
1≤i≤n

be the number of individuals at risk at t. The survival function F̄ = 1 − F −


of X is now estimated by Kaplan-Meier’s product-limit estimator
Y δi Jn (Ti ) Y
b̄ R (t) =
F 1− = b n (Ti ) , with
1 − ∆Λ
n
Yn (Ti )
Ti ≤t Ti ≤t
Z t
b n (t) = Jn (s)
Λ dNn (s) , (2.9)
0 Yn (s)

and Jn (s) = 1{Y (s)>0} . The process FbnR is also written in an additive
n

form (Pons, 2007) as a right-continuous increasing process identical to the


product-limit estimator
Z t
dNn (s)
FbnR (t) = Pn (2.10)
0 n− bR −1 }
j=1 (1 − δj )1{Tj <s} {1 − (Fn (Tj ))

which is easily calculated. From the martingale property of the process Λbn
and Gill’s expression for the Kaplan-Meier estimator
Z t
FbnR − F 1 − FbnR (s− ) b
(t) = − d(Λn − Λ)(s), t ≤ max Ti , (2.11)
1−F 0 1 − F (s)
it follows that
Z t
1 − FbnR (s− ) b
E {dΛn (s) − dΛ(s)} = 0,
0 1 − F (s)
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a density 41

so the Kaplan-Meier estimator is unbiased


Rt and for every t ≤ max Ti , its
variance is V arFbnR (t) = (1 − F )2 (t) 0 g −1 dΛ + o(1). For every p ≥ 2, the
Burkhölder-Davis-Gundy inequality implies the existence of a constant cp
such that a bound for the Lp -risk of the Kaplan-Meier estimator is written
Z
 t
1 − FbnR (s− ) 2 −2 p/2
E{F (t) − FbnR (t)}p ≤ cp {1 − F (t)}p E } Yn (s)dNn (s)
0 1 − F (s)

therefore kF (t) − FbnR (t)kp = O(n1/2 ).


The density of T under right-censoring is estimated by smoothing the
Kaplan-Meier estimator FbnR of the distribution function
Z
fbn,h
R
(t) = Kh (t − s) dFbnR (s)

it is explicitly written using either its multiplicative expression


Z
fbn,h
R
(t) = Kh (t − s)FbnR (s− ) dΛ
b n (s),
IX,n,h

or the additive formula (2.10)


n
X Kh (t − Ti )δi 1{Ti ≤t}
fbn,h
R
(t) = Pn .
i=1 n− j=1 (1 − δj )1{Tj <Ti } {1 − (FbnR (Tj ))−1 }

The a.s. uniform consistency of the process FbnR − F , for the Kaplan-Meier
estimator, implies that supIX,h |fbn,h
R
− f | converges in probability to zero,
as n tends to the infinity and h to zero. From (2.11), the estimator fbR n,h
satisfies
Z Z
1 − FbnR− b
s
fbn,h
R
(t) = Kh (t − s)[f (s){1 + d(Λn − Λ)} ds
0 1−F
− {1 − FbnR (s− )} d(Λ
b n − Λ)(s)]
Z
= Kh (t − s) [dF (s) − {1 − FbnR (s− )} d(Λ
b n − Λ)(s)]
Z Z u
1 − FbnR− b n − Λ)(u).
+ { Kh (t − s) dF (s)} (u) d(Λ
h 1−F

The bias of the estimated density fbn,h


R
(t) is then the same as in the uncen-
h2 (2)
sored case bf,n,h (t) = 2 f (t) + o(h2 ), since the Kaplan-Meier estimator
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

42 Functional estimation for density, regression models and processes

b n are unbiased estimators. Its variance develops as


and Λ
Z Z
 s 1 − FbnR−
R
vf,n,h (t) = E Kh2 (t − s)f 2 (s) d(Λb n − Λ) 2 ds
0 1−F
Z
+ E Kh2 (t − s){1 − Fbn (s− )}2 Yn−1 (s) dΛ(s)
Z
(1 − FbnR− )2
− 2E Kh2 (t − s) (s)Yn−1 (s) dΛ(s)
1−F
Z t
≤ (nh)−1 κ2 {f 2 (t) g −1 dΛ + (1 − FC )−1 (t)f (t)
0
− 2f (t)g −1 (t) + o(1)},
where
Z Z
 s
1 − FbnR− b 2 s
1 − FbnR− 2 −1
E d(Λn − Λ)} ≤ E Yn dΛ.
0 1−F 0 1−F
R
The variance is then written vf,n,h (t) = (nh)−1 vfR (t). It follows that the
optimal local and global bandwidths for the estimation of the density under
right-censoring are O(n−1/5 ) and the optimal L2 -risks are O(n−2/5 ).

2.9 Estimation of the density of left-censored variables

Let X be left-censored by C, such that P (C < X) is strictly positive, then


the observations are T = X ∨ C, δ = 1{C<X} . The notations Nn and Yn
for a sample of n independent and identically distributed observations of
(Ti , δi )1≤i≤n are unchanged, with this definition of the variables. The cumu-
lative hazard function λ used for right-censoring is replaced by a cumulative
retro-hazard function (Pons, 2008)
Z ∞
dF
Λ̄(t) = −
t F
R ∞ dF P
or Λ̄(t) = t F − + s>t ∆F (s)
F (s− ) which uniquely defines the distribution
function F of X, for t > mini Ti , as
Y
F (t) = exp{Λ̄c (t)} {1 + ∆Λ̄(s)},
s>t
Q
where Λ̄c (t) is the continuous part of Λ̄(t) and s>t {1 + ∆Λ̄(s)} its right-
continuous discrete part. On the interval In = ]mini Ti , maxi Ti ], the func-
tion Λ̄ is estimated by
Z ∞
b̄ dNn
Λn (t) = 1{Yn < n}
t n − Yn
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a density 43

and a product-limit estimator of the function F is defined on In from the


b̄ by
expression of Λ n
Yn oδ
FbnL (t) = b̄ (T ) i .
1 + dΛ n i
Ti ≥t
On the interval In = ]mini Ti , maxi Ti ] it satisfies
Z ∞ bL −
F − FbnL Fn (s ) b̄
(t) = {dΛn (s) − dΛ̄(s)}, (2.12)
F t F (s)
and n1/2 (F − FbnL )F −1Rconverges weakly to a centered Gaussian process with

covariance K̄(s, t) = s∧t (F −1 F − )2 (H − )−1 dΛ̄, with 1 − H = (1 − F )(1 −
FC ). From this expression, it follows that Λ b̄ is an unbiased estimator of Λ̄
n
b L
and Fn is an unbiased estimator of the distribution function F , moreover
kFbnL (t) − F (t)kp = O(n1/2 ), for p ≥ 2.
The density of T under left-censoring is estimated by smoothing the
Kaplan-Meier estimator FbnL of the
Z distribution function
b
f (t) = Kh (t − s) dFbL (s).
L
n,h n

The a.s. uniform consistency of the process FbnL − F implies that


supIX,h |fbn,h
L
− f | converges in probability to zero, as n tends to the in-
finity and h to zero. From (2.12), the estimator fbn,h L
satisfies
Z Z ∞ bL−
Fn b̄ − Λ̄)} ds
fbn,h
L
(t) = fn,h (t) + Kh (t − s)[f (s){ d(Λ n
s F
b̄ − Λ̄)(s)]
− FbnL− (s) d(Λ n
Z ∞ b L −
F (s ) b̄
= fn,h (t) + fn,h (s) n d(Λn − Λ̄)(s)
t F (s)
Z
− Kh (t − s)FbnL (s− ) d(Λ b̄ − Λ̄)(s).
n

As a consequence of the uniform consistency of the estimators FbnL− and


b̄ , the bias of the estimated density fbL (t) is then the same as in the
Λ n n,h
2
uncensored case bf,n,h (t) = h2 f (2) (t) + o(h2 ). Its variance is written
L
vf,n,h (t) = (nh)−1 vfL (t), with the expansion
Z ∞
 Fb L (s− ) 2
L
vf,n,h (t) = E 2
fn,h (s) n (n − Yn (s))−1 dΛ̄(s)
t F (s)
Z
+ Kh2 (t − s)FbnL2 (s− )(n − Yn (s))−1 dΛ̄(s)
Z ∞
FbnL2 (s− )
−2 fn,h (s) Kh (t − s)(n − Yn (s))−1 dΛ̄(s)
t F (s)
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

44 Functional estimation for density, regression models and processes

where the last two terms are O((nh)−1 ) and the first one is a O(n−1 ). The
optimal bandwidths for estimating the density under left-censoring are then
also O(n−1/5) and the optimal L2 -risks are O(n−2/5 ).
Under Conditions 2.1 or 2.2 and if the support of K is compact, the
variance vfL belongs to class C2 (IX ) and for every t and t0 in IX,h , there
exists a constant α such that for |t − t0 | ≤ 2h
 L 2
E fbn,h (t) − fbn,h
L
(t0 ) ≤ α(nh3 )−1 |t − t0 |2 .
L
Under the conditions of Theorem 2.1, the process Un,h = (nh)1/2 {fbn,h
L

L 1/2 L
f }I{IX,h } converges weakly to Wf + γ bf , where Wf is a continuous
Gaussian process on IX with mean and covariances zero and with variance
function vfL .

2.10 Kernel estimator for the density of a process

Consider a continuously observed stationary process (Xt )t∈[0,T ] with val-


ues in IX . The stationarity means that the distribution probability of Xt
and Xt+s − Xs are identical for every s and t > 0. For a process with
independent increments, this implies the ergodicity of the process that is
expressed by the convergence of bounded functions of several observations
of the process to a mean value: For every x in IX , there exists a measure
πx on IX \ {x} such that for every bounded and continuous function ψ on
2
IX
Z Z
ET −1 ψ(Xs , Xt ) ds dt → ψ(x, y) dπx (dy)dF (x) (2.13)
[0,T ]2 2
IX

as T tends to infinity. The distribution function F in (2.13) is defined as


the limit of the expectation of the mean sample-path of the process X
Z Z
−1
ET ψ(Xt ) dt → ψ(x) dF (x). (2.14)
[0,T ] IX

The mean marginal density f of the process is the density of the distribution
function F , it is estimated by replacing the integral of a kernel function with
respect to the empirical distribution function of a sample by an integral with
respect to the Lebesgue measure over [0, T ] and the bandwidth sequence is
indexed by T . For every x in IX,T,h
Z
b 1 T
fT,h (x) = Kh (Xs − x) ds, (2.15)
T 0
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a density 45

R
its expectation is fT,h (x) = IX,n Kh (y − x)f (y) dy so its bias is
Z
hs
bT,h (x) = Kh (y − x){f (y) − f (x)} dy = T msK f (s) (x) + o(hsT )
IX,T s!
under Conditions 2.1-2.2. For a density in a Hölder class Hα,M , bT,h (x)
[α]
Rtends[α]to zero for every α > 0 and it is a O(h ) under the condition
|u| K(u) du < ∞.
Its variance is expressed through the integral of the covariance between
Kh (Xs − x) and Kh (Xt − x). For Xs = Xt , the integral on the diagonal
2
DX of IX,T is a (T hT )−1 κ2 f (x) + o((T hT )−1 ) and the integral outside the
diagonal denoted Io (T ) is expanded using the ergodicity property (2.13).
Let αh (u, v) = |u − v|/2hT , the integral Io (T ) is written
Z Z
ds dt
Kh (u − x)Kh (v − x)fXs ,Xt (u, v) du dv
[0,T ] 2 2
IX,T \DX T T
Z Z Z 1−αh (u,v)
= (T hT )−1 { K(z − αh (u, v))K(z + αh (u, v)) dz
IX IX\{u} −1+αh (u,v)

dπu (v) dF (u)}{1 + o(1)} .


R
For every fixed u 6= v, the integral K(z −αh (u, v))K(z +αh (u, v)) dz tends
to zero since αhT (u, v) tends to infinity as hT tends to zero. If αh (u, v) tends
to zero with hT , then πu (v) also tends to zero and the integral Io (T ) is a
o((T hT )−1 ) as T tends to infinity. The mean squared error of the estimator
at x for a marginal density in Cs is then
M ISET,h (x) = (T hT )−1 κ2 f (x) + h2s
T (s!)
−2 2
msK f (s)2 (x)
+ o((T hT )−1 ) + o(h2s
T )

and the optimal local and global bandwidths minimizing the mean squared
(integrated) errors are O(T 1/(2s+1) ). If hT has the rate of the optimal
bandwidths, the M ISE is a O(T 2s/(2s+1) ). The Lp -norm of the estimator
satisfies kfbT,h (x)−fT,h (x)kp = O((T hT )−1/p ) under an ergodicity condition
k
for (Xt1 , . . . , Xtk ) similar to (2.13) for bounded functions ψ defined on IX
Z
ET −1 ψ(Xt1 , . . . , Xtk ) dt1 . . . dtk (2.16)
[0,T ]k
Z Y
→ ψ(x1 , . . . , xk ) πxj (dxj+1 ) dF (x1 ),
k
IX 1≤j≤k−1

for every integer k = 2, . . . , p. The property (2.16) implies the weak conver-
gence of the finite dimensional distributions of the process (T hT )1/2 (fbT,h −
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

46 Functional estimation for density, regression models and processes

f − bT,h ) to those of a centered Gaussian process with mean zero, co-


variances zero and variance κ2 f (x) at x. The proof is similar to the
proof for a sample of variables, using the above expansions for the vari-
ance and covariances of the process. A lipschitzian bound for increments
E{fbT,h (x) − fbT,h (y)}2 is obtained by the Mean Value Theorem which im-
RT
plies T −2 0 E{Khn (x − Xt ) − Khn (y − Xt )}2 dt = O(|x − y|2 (T h3T )−1 ) as in
Lemma 2.1. Then the process (T hT )1/2 (fbT,h − f − bT,h ) converges weakly
to a centered Gaussian process with covariances zero and variance κ2 f .

The Hellinger distance h2 (fbT,hT , f ) is bounded like (2.8)


s s
Z Z
fbT,hT fbT,hT
h2T (fbT,hT , f ) = (1 − ) dP ≤ ( − 1) d(FbT − F )
f f

where
Z T
FbT (t) = T −1 1{Xt ≤s} dt
0

is the empirical probability distribution of the mean marginal distribution


function of the process (Xt )t≤T
Z
−1
FT = T FXt dt,
[0,T ]

and F is its limit √ under the ergodicity property (2.14). The convergence
rate of FbT −F is T , from the mixing property of the process X. Therefore
1/2
h2 (fbT,hT , f ) convergences to zero in probability with the rate T hT .

2.11 Exercises
R
(1) Let f and g be real functions defined on R R and let f ∗ g(x) = f (x −
y)g(y) dy be their convolution. Calculate f ∗ g(x) dx and prove that,
for 1 ≤ p ≤ ∞, if f belongs to Lp and g to Lq such that p−1 + q −1 = 1,
then supx∈R |f ∗ g(x)| ≤ kf kp kgkq . Assume p is finite and prove that
f ∗ g belongs to the space of continuous functions on R tending to zero
at infinity.
(2) Prove the approximation of the bias in (d) of Proposition 2.2 using a
Taylor expansion and precise the expansions for the Lp -risk.
(3) Prove the results of Equation (2.5).
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a density 47

(4) Write the variance of the kernel estimator for the marginal density of
dependent observations (Xi )i≤n in terms of the auto-covariance coeffi-
Pn
cients ρj = n−1 i=1 Cov(Xi , Xi+j ).
(5) Consider a hierarchical sample (Xij , Yij )j=(1,...,Ji ),i=1,...,n , with n in-
dependent and finite sub-samples of Ji dependent observations. Let
Pn −1
P n P Ji
N = i=1 Ji and f = limn N i=1 j=1 fXij be the limiting
marginal mean density of the observations of X. Define an estimator
of the density f and give the first order approximation of the variance
of the estimator
R x under relevant ergodicity conditions.
(6) Let H(x) = −1 k(y) dy be the integrated kernel, F be the distribu-
Pn
tion function of X and Fbnh (x) = n−1 i=1 Hh (Xi − x) be a smooth
estimator of the distribution function. Prove that the bias of Fbnh (x)
is 12 h2 m2K f (1) (x) + o(h2 ) and its variance (nh)−1 κ2 F (x) + o((nh)−1 ).
Define the optimal local and global bandwidths for Fbnh .
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Chapter 3

Kernel estimator of a regression


function

3.1 Introduction and notation

The kernel estimation of nonparametric regression functions is related to


the estimation of the conditional density of a variable and most authors
have studied the asymptotic behaviour of weighted risks, using weights
proportional to the density estimator so that the random denominator of
the regression function disappears. Weighted integrated errors are used for
the empirical choice of a bandwidth and for tests about the regression. In
this chapter, the bias, variance and norms of the kernel regression estimator
are obtained from a linear approximation of the estimator.
Let (Xi , Yi )i=1,...,n be a sample of a variable
R (X, Y ) with joint density
fX,Y . The marginal density of X is fX (x)= fX,Y (x, y)dy and the den-
−1
sity of Y conditionally on X is fY |X =fX fX,Y . Here, the density fX,Y
is supposed to be C2 . Let FXY be the distribution function of (X, Y )
Pn
and FbXY,n (x, y)= n−1 i=1 1{Xi ≤ x, Yi ≤ y} be their empirical distribution
function.
Consider the regression model (1.6)
Y = m(X) + σε
where m is a bounded function and the error variable ε has the conditional
mean E(ε|X) = 0 and a constant conditional variance V ar(ε|X) = σ 2 . Let
IX and IXY be respectively subsets of the supports of the distribution
functions FX and FXY , and let
IX,h = {x ∈ IX ; [x − h, x + h] ∈ IX },
IXY,h = {(x, y) ∈ IXY ; [x − h, x + h] × {y} ∈ IXY }
be subsets of the supports. On an interval IXY,h , a continuous and bounded

49
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

50 Functional estimation for density, regression models and processes

regression function
Z
−1
m(x) = E(Y |X = x) = fX yfXY (x, y) dy

is estimated by the kernel estimator


Pn
Yi Kh (x − Xi )
b n,h (x) = Pi=1
m n . (3.1)
i=1 Kh (x − Xi )
Its numerator is denoted
n Z
1X
bn,h (x) =
µ Yi Kh (x − Xi ) = yKh (x − s) dFbXY,n (s, y)
n i=1

and its denominator is fbX,n,h (x). The mean of µbn,h (x) and its limit are
respectively
Z Z
µn,h (x) = yKh (x − s) dFXY (s, y),
Z
µ(x) = yfXY (x, y) dy = fX (x)m(x),

whereas the mean of m b n,h (x) is denoted mn,h (x). The notations for the
parameters and estimators of the density f are unchanged. The variance
of Y is supposed to be finite and its conditional variance is denoted
σ 2 (x) = E(Y 2 |X = x) − m2 (x),
Z
−1
E(Y 2 |X = x) = fX (x)w2 (x) = y 2 fY |X (y; x) dy, with
Z Z
w2 (x) = y fXY (x, y) dy = fX (x) y 2 fY |X (y; x) dy.
2

Let also σ4 (x) = E[{Y −m(x)}4 | X = x], they are supposed to be bounded
functions. The Lp -risk of the kernel estimator of the regression function m
is defined by its Lp -norm k · kp = {Ek · kp }1/p .

3.2 Risks and convergence rates for the estimator

The following conditions are assumed, in addition to Conditions 2.1 and


2.2 about the kernel and the density.

Condition 3.1. (1). The functions fX , m and µ are twice continuously


differentiable on IX , with bounded second order derivatives; fX is strictly
positive on IX ;
(2). The functions fX , m and σ belong to the class Cs (IX ).
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a regression function 51

Proposition 3.1. Under Conditions 2.1, 2.2 and 3.1(1),


(a). supx∈IX,h |b
µn,h (x)− µ(x)| and supx∈IX,h |m
b n,h (x)− m(x)| converge
a.s. to zero if and only if µ and m are uniformly continuous.
(b). The following expansions are satisfied
µn,h (x)
mn,h (x) = + O((nh)−1 ),
fX,n,h (x)
1/2 1/2 −1 
nh {m
b n,h − mn,h }(x) = nh fX (x) (bµn,h − µn,h )(x) (3.2)

b
− m(x)(fX,n,h − fX,n,h )(x) + rn,h
where rn,h = oL2 (1).
(c) For every x in IX and for every integer p > 1, kb µn,h (x) − µ(x)kp
and kmb n,h (x) − m(x)kp converge to zero, the bias of the estimators µ
bn,h (x)
b n,h (x) is uniformly approximated by
and m
bµ,n,h (x) = µn,h (x) − µ(x) = h2 bµ (x) + o(h2 ),
Z
m2K (2) m2K ∂ 2 fXY (x, y)
bµ (x) = µ (x) = y dy, (3.3)
2 2 ∂x2
bm,n,h (x) = mn,h (x) − m(x) = h2 bm (x) + o(h2 ),
−1
bm (x) = fX (x){bµ (x) − m(x)bf (x)}
m2K −1 (2)
= f (x){µ(2) (x) − m(x)fX (x)}, (3.4)
2 X
the covariance between µbn,h (x) and fbX,n,h (x) is
Covµ,fX ,n,h (x) = (nh)−1 {Covµ,fX (x) + o(1)},
Covµ,fX (x) = µ(x)κ2 = m(x)fX (x)κ2 (3.5)
and their variance
vµ,n,h (x) = (nh)−1 {σµ2 (x) + o(1)},
σµ2 (x) = w2 (x)κ2 , (3.6)
−1 2
vm,n,h (x) = (nh) {σm (x) + o(1)},
2 2 −2
σm (x) = {w2 (x) − m (x)f (x)}κ2 fX (x)
−1
= κ2 f X (x)σ 2 (x). (3.7)

Proof. Note that Condition 3.1 implies that the kernel estimator of fX is
bounded away from zero on IX which may be a sub-interval of the support
of the variable X. Proposition 2.2 and the almost sure convergence to
zero of supx∈IX,h |b
µn,h − µn,h |, proved by the same arguments as for the
density, imply the assertion (a). The bias and the variance are similar for
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

52 Functional estimation for density, regression models and processes

bn,h (x) and fbX,n,h (x). For µ


µ bn,h (x), they are a consequence of (b). The first
approximation of (b) comes from the expansion
µn,h (x) µX,n,h − µX,n,h )(x)
(b
b n,h (x) =
m +
fX,n,h (x) fX,n,h (x)
mb n,h (x){fbX,n,h (x) − fX,n,h (x)}

fX,n,h (x)
µn,h (x) µX,n,h − µX,n,h )(x) µ
(b bn,h (x)(fbX,n,h − fX,n,h )(x)
= + − 2
fX,n,h (x) fX,n,h (x) fX,n,h (x)
b n,h (x)(fbX,n,h − fX,n,h )2 (x)
m
+ 2 ,
fX,n,h (x)
µn,h − µn,h )(x)(fbX,n,h − fX,n,h )(x)
(b
− 2 ,
fX,n,h (x)

the expectation of this equality yields

µn,h (x) µn,h − µn,h )(x)(fbX,n,h − fX,n,h )(x)


(b
mn,h (x) = −E 2
fX,n,h (x) fX,n,h (x)
b n,h (x){fbX,n,h (x) − fX,n,h (x)}2
m
+E 2
fX,n,h (x)
µn,h (x) µn,h (x)
= + O((nh)−1 ) = + o(h2 ) (3.8)
fX,n,h (x) fX,n,h (x)
uniformly on IX , for any bounded regression function m. The bias of
b n,h (x) is
m
 µn,h (x)  µn,h (x)
bm,n,h (x) = − m(x) + mn,h (x) − ,
fX,n,h (x) fX,n,h (x)

where the second difference is a o(h2 ), using (3.8). A second order Taylor
−1
expansion of fX,n,h (x) as n tends to infinity leads to

µn,h (x) −1
= m(x) + {bµ,n,h (x) − m(x)bfX ,n,h (x)}fX (x) + o(h2 )
fX,n,h (x)

b n,h (x) follows immediatly.


and the bias of m
The variance vm,n,h (x) of m b n,h (x) is
 µn,h (x) 2  µn,h (x) 2
vm,n,h (x) = E mb n,h (x) − − mn,h (x) − ,
fX,n,h (x) fX,n,h (x)
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a regression function 53

where the non random term is a o(h4 ), by (3.8). The first term develops
using twice the equality y −1 = x−1 − (y − x)(xy)−1
 µn,h (x)
fX,n,h (x) m b n,h (x) − =µbn,h (x) − µn,h (x)
fX,n,h (x)

− mn,h (x) fbX,n,h (x) − fX,n,h (x) (3.9)
 
bn,h (x) − µn,h (x) fbX,n,h (x) − fX,n,h (x)
µ

fX,n,h (x)
 2
mb n,h (x) fbX,n,h (x) − fX,n,h (x)
+ ,
fX,n,h (x)
so that
2
 µn,h (x) 2
fX,n,h (x)E mb n,h (x) − = V ar{b
µn,h (x)}
fX,n,h (x)
+ m 2 (x)V ar{fbX,n,h (x)} − 2mn,h (x)Cov{b
n,h µn,h (x), fbX,n,h (x)}
π0,2,2 (x) mn,h (x) π0,2,1 (x)
+ 2 +2 π0,1,2 (x) − 2
fX,n,h (x) fX,n,h (x) fX,n,h (x)
π2,0,4 (x) π1,1,2 (x) π1,0,3 (x)
+ 2 +2 − 2mn,h (x)
fX,n,h (x) fX,n,h (x) fX,n,h (x)
π1,1,3 (x)
−2 2 ,
fX,n,h (x)
where
 k 0 00 
µn,h (x) − µn,h (x)}k {fbX,n,h (x) − fX,n,h (x)}k
b n,h (x){b
πk,k0 ,k00 (x) = E m
for k ≥ 0, k 0 ≥ 0 and k 00 ≥ 0. Since m b n,h (x) is bounded, Cauchy-Schwarz
inequalities and the order of the moments of µ bn,h (x) and fbn,h (x) imply that
0 00
all terms πk,k0 ,k00 (x) in the above expression are O((nh)−(k +k )/2 so they
are o((nh)−1 ) except the covariance term π0,1,1 (x). Using the first order
expansions of the means fX,n,h (x) = fX (x) + O(h2 ) the mean develops as
mn,h (x) = m(x) + O(h2 ). It follows that
 −2 
vm,n,h (x) = fX (x) µn,h (x)} + m2 (x) V ar{fbX,n,h (x)}
V ar{b

− 2m(x) Cov{b µn,h (x), fbX,n,h (x)} + o(n−1 h−1 )
and the convergence to zero of the last term rn,h in (3.2) is satisfied. The
other results are obtained by simple calculus. 
b n,h is established by the same
The minimax property of the estimator m
method as for density estimation.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

54 Functional estimation for density, regression models and processes

For p ≥ 2, let

wp (x) = E(Y p 1{X=x} ) (3.10)

be the p-th moment of Y conditionally on X = x. The Lp risk is calculated


from the approximation (3.2) of Proposition 3.1 and the next lemmas.

Lemma 3.1. For p ≥ 2

µn,h (x) − µn,h (x)kp = O((nh)−1/p )


kb

and

kfbn,h
−1 −1
(x) − fX,n,h (x)kp = O((nh)−1/p ),

where the approximations are uniform.

Proof. By the expansion (3.9), Proposition 2.2 extends to µ bn,h (x) −


µn,h (x) and the moments of order p ≥ 2 of µ bn,h (x) − µn,h (x) and
fbX,n,h (x) − fX,n,h are 0((nh)−1/p ) which is decreasing as p increases. Let
−1 b
an = fn,h {fn,h − fn,h }, then
X
{fbn,h
−1 −1 p
− fn,h −p
} = fn,h −p
{(1 + an )−1 − 1}p = fn,h { (−an )k }p
k≥1

and the decreasing order of the moments of the kernel estimator of the
density implies
X
E| (−an )k |p = E|an |p + o(E|an |p ).
k≥1


The convergence rate of the bandwidth determines the behaviour of the


bias term of the process (nh)1/2 (m
b n,h − m), with the following technical
results. They generalise Proposition 3.1 to p and s ≥ 2.

Lemma 3.2. Under Conditions 2.1, the bias of µ bn,h (x) and m
b n,h (x) are
uniformly approximated as
Z
hs ∂ s fX,Y (x, y)
bµ,n,h (x; s) = msK y dy + o(hs ),
s! ∂xs
hs −1 (s)
bm,n,h (x; s) = msK fX (x){µ(s) (x) − m(x)fX (x)} + o(hs ), (3.11)
s!
for s ≥ 2, and their variances develop as in Proposition 3.1.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a regression function 55

Proposition 3.2. Under Conditions 2.1 and 3.1 with s = 2, for every x
in IXh
(nh)1/2 (m µn,h − µn,h ) − m(fbX,n,h − fX,n,h )}
b n,h − m) = (nh)1/2 f −1 {(b X
+ (nh5 )1/2 bm + rbn,h , (3.12)
and the remainder term of (3.12) satisfies
rn,h k2 = O((nh)−1/2 ).
sup kb
x∈IX,h

Proof. Expanding (3.2) yields


rbn,h = (nh)1/2 (fbX,n,h
−1 −1
− fX µn,h − µn,h ) − m(fbX,n,h − fX,n,h )}
){(b

+ (nh)1/2 fbX,n,h
−1 −1
fX,n,h fX,n,h µn,h − m − (nh5 )1/2 bµ

= (nh)1/2 fbX,n,h
−1 −1
− fX µn,h − µn,h ) − m(fbX,n,h − fX,n,h )}
{(b
−1

+ (nh)1/2 µn,h fX,n,h − m − h2 b µ (3.13)

−1
X fbX,n,h − fX,n,h k
+ (nh)1/2 fX,n,h µn,h − m − .
fX,n,h
k≥1

By Lemma 3.1 and Proposition 3.1, the first term is a O((nh)−1/2 ). The
second order uniform approximation
−1
fX,n,h (x)µn,h (x) − m(x) = h2 bµ (x) + O(h4 ), (3.14)
implies that the second term in the sum is a O(h4 ) = O((nh)−1 ), as a
consequence of Condition 2.1. By Lemma 3.1 and (3.14), the third term is
a O((nh)1/2 h2 (nh)−1/2 ), it is therefore a O((nh)−1/2 ). 
For a regression function of class Cs , s ≥ 2, the L2 -norm of the remain-
der term rbn,h is given by the next proposition.

Proposition 3.3. Under Conditions 2.1, 2.2 and 3.1, for every s ≥ 2 the
remainder term of (3.12) satisfies the uniform bounds
rn,h k2 = O((nh)−1/2 ).
sup kb
IX,h

Proof. For functions fX and µ in Cs , the risk of rbn,h is modified by


the bias terms of the previous expansion (3.13). The second term in the
−1
approximation (3.14) is replaced by fX,n,h (x)µn,h (x) − m(x) = hs bµ (x) +
O(hs+1 ) and Conditions 2.2 and 3.2, which implies h2s = O((nh)−1 ). By
Lemma 3.2, supx E|b rn,h (x)|2 is bounded by
O(nh)[{O(h2s + (nh)−1 )} O((nh)−1 ) + O(h2(2+s) ) + O(h2s (nh)−1 )]
which is a O(h2s ) + O(h4 ) + O(h2s ) = O(h4 ). 
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

56 Functional estimation for density, regression models and processes

Propositions 2.2, 3.1, Equation (3.2), Propositions 3.2 and 3.3 determine
an upper bound for the norm km b n,h − mn,h kp of the estimator of m
km −1
b n,h − mn,h kp = kfX {(b µn,h − µn,h ) − m(fbX,n,h − fX,n,h )}kp
+ O((nh)−1/2 kb
rn,h kp ),
−1
≤ 2p−1 [sup fX {kb
µn,h − µn,h kp
IX

+ sup |m| kfbX,n,h − fX,n,h kp }] + O((nh)−1/2 kb


rn,h kp ),
IX
it is therefore a O((nh)−1/2 ). The expression of the Lp -norm km
b n,h −mn,h kp
is obtained by similar expansions and approximations as in the proof of
Proposition 3.1.
Under Conditions 2.1, 2.2, 3.1, for a function µ in Cs and a density fX
in Cr , the bias of mb n,h is
−1 hs hr
bm,n,h (x) = fX (x){ bµ (x) − m(x) bf (x)} + o(hs∧r )
s! r!
and its variance does not depend on r and s.
The derivability conditions fX and µ ∈ Cs of 3.1 can be replaced by the
condition: fX and µ belong to a Hölder class Hα,M .
Proposition 3.4. Assume fX and µ are bounded and belong to Hα,M then
−1
b n,h (x) is bounded by M m[α]K hα /([α]!)fX
the bias of m (x){1 + |m(x)|} +
o(h ), by equation (3.2). The optimal bandwidth is O(n1/(2α+1) ) and the
α

MISE at the optimal bandwidth is O(nα/(2α+1) ).

3.3 Optimal bandwidths

b n,h (x), for p = 2, is


The asymptotic mean squared error of m
−1 2 4 2 −1 −2
(nh) σm (x) + h bm (x) = (nh) κ2 fX (x){w2 (x) − m2 (x)f (x)}
h4 m22K −2 (2)
fX (x){µ(2) (x) − m(x)fX (x)}2
+
4
and its minimum is reached at the optimal bandwidth
 κ2 n−1 {w2 (x) − m2 (x)f (x)} 1/5
hAMSE (x) =
m22K {µ(2) (x) − m(x)f (2) (x)}2
X
where AM SE(x) = O(n−4/5 ). The global mean squared error criterion is
the integrated error and it isZapproximated by
−2
AM ISE = (nh)−1 κ2 fX (x){w2 (x) − m2 (x)f (x)} dx
Z
h4 m22K −2 (2)
+ fX (x){µ(2) (x) − m(x)fX (x)}2 dx
4
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a regression function 57

and the optimal global bandwidth is


R −1
 κ2 n−1 fX (x)V ar{Y |X = x}f (x)} dx 1/5
hn,AMISE = 2 R −2 .
m2K f (x){µ(2) (x) − m(x)f (2) (x)}2 dx
X X
For every s ≥ 2, the asymptotic quadratic risk of the estimator for a re-
gression curve of class Cs is
AM SE(x) = (nh)−1 σm
2
(x) + hs2 b2m,s (x)
−2
= (nh)−1 κ2 fX (x){w2 (x) − m2 (x)f (x)}
h2s 2 −2 (s)
+ m f (x){µ(s) (x) − m(x)fX (x)}2 ,
(s!)2 sK X
its minimum is reached at the optimal bandwidth
n (s!)2 κ n−1 {w (x) − m2 (x)f (x)} o1/(2s+1)
2 2
hAMSE (x) =
2sm2sK {µ(s) (x) − m(x)f (s) (x)}2
X

where AM SE(x) = O(n−2s/(2s+1) ). The global mean squared error crite-


rion is the integrated error and it is approximated by
Z
−1
AM ISE(h, s) = (nh)−1 κ2 fX (x)V ar{Y | X = x} dx
Z
h2s m2sK −2 (s)
+ fX (x){µ(s) (x) − m(x)fX (x)}2 dx
(s!)2
and the optimal global bandwidth is
n (s!)2 κ R −1
2 n−1 fX (x)V ar{Y | X = x} dx o1/(2s+1)
hn,AMISE (s) = R ,
2sm2sK f −2 (x){µ(s) (x) − m(x)f (s) (x)}2 dx
X X

and again AM ISE(hn (s), s) = O(n−2s/(2s+1) ).


In order to estimate the constants of the optimal bandwidths, a non-
parametric estimator of the conditional variance of Y are defined as
Pn
Yi2 Kh (x − Xi )
Vb arn,h (Y |X = x) = Pi=1
n b 2n,h (x).
−m
i=1 K h 2 (x − X i )
More generally, the conditional moment of order p, mp (x) = E(Y p |X = x)
is estimated by
Pn p
i=1 Yi Kh (x − Xi )
mb p,n,h (x) = P n
i=1 Kh (x − Xi )

with a bandwidth h = hn such that hn tends to zero and nh2n tends to


infinity as n tends to infinity. For every p ≥ 2, the estimator m b p,n,h is
also written fbn,h
−1
bp,n,h , it is a.s. uniformly consistent and approximations
µ
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

58 Functional estimation for density, regression models and processes

similar to those of Propositions 3.1 and 3.2 for the regression curve hold
for mp (x)
−1
(nh)1/2 (m
b p,n,h − mp,n,h ) = (nh5 )1/2 bp,m + (nh)1/2 fX {(b
µp,n,h − µp,n,h )
b
− mp (fX,n,h − fX,n,h )} + rbp,n,h ,
rp,n,h k2 = O((nh)−1/2 )
sup kb
x∈IX,h

and for its bias


Z
m2K h2 ∂ 2 fXY (·, y)
bµp ,n,h = yp dy + o(h2 ),
2 ∂x2
m2K h2 −1
bmp ,n,h = mp,n,h − mp = fX {bµp ,n,h − mp bf } + o(h2 ).
2
The covariance between µ bp,n,h (x) and fbX,n,h (x) is (nh)−1 mp (x)fX (x)κ2
and the variances of the estimators of µp (x) and mp (x) are
vµp ,n,h (x) = (nh)−1 {wp (x)κ2 + o(1)},
−2
vmp ,n,h (x) = (nh)−1 {κ2 fX (x) {σµ2 p (x) − m2p (x)f (x)} + o(1)}.
The estimators of the derivatives of the regression function m are
Pn (1)
(1) i=1 Yi Kh (x − Xi )
mb n,h (x) = P n
i=1 Kh (x − Xi )
Pn Pn (1)
{ i=1 Yi Kh (x − Xi )}{ i=1 Kh (x − Xi )}
− Pn ,
{ i=1 Kh (x − Xi )}2
(1) (1)
= fbX,n,h
−1
b n,h (x)fbX,n,h (x)}
µn,h (x) − m
(x){b (3.15)
and all consecutive derivatives of this expression. The first derivatives
n
X
(1) (1)
bn,h (x) = n−1
µ Yi Kh (x − Xi )
i=1
(1) Pn (1)
and fbX,n,h (x) =n −1
i=1 Kh (x−Xi ) converge uniformly on IX,h to their
(1) (1) (1)
expectations µn,h (x) = h−1 EY Kh (x − X) and fX,n,h (x), respectively,
where
(1) (1) h2 (3)
fX,n,h (x) = fX (x) + m2K fX (x) + o(h2 ),
Z 2
(1) −1
µn,h (x) = h yKh0 (u − x)fX,Y (u, y) du dy

h2
= (mfX )(1) (x) + m2K (mfX )(3) (x) + o(h2 ),
2
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a regression function 59

(1) −1 (1)
b n,h converges uniformly to fX
then m (x){(mfX )(1) − mfX } = m(1) , as h
(1)
b n,h (x) is
tends to zero. The bias of m
h2 −1 (3)
m2K fX (x){(mfX )(3) − mfX }(x).
2
Its variance is obtained by an application of Proposition 3.1 to equation
(3.15), its convergence rate is (nh3 )−1 (see Appendix A) and the optimal
global bandwidth for estimating m(1) follows. For the second derivative
Pn (2)
(2) i=1 Yi Kh (x − Xi )
b n,h (x) = P
m n
i=1 Kh (x − Xi )
Pn (1) Pn (1)
{ i=1 Yi Kh (x − Xi )}{ i=1 Kh (x − Xi )}
−2 Pn
{ i=1 Kh (x − Xi )}2
Pn Pn (2)
{ i=1 Yi Kh (x − Xi )}{ i=1 Kh (x − Xi )}
− Pn
{ i=1 Kh (x − Xi )}2
Pn Pn (1)2
{ i=1 Yi Kh (x − Xi )}{ i=1 Kh (x − Xi )}
+2 Pn ,
{ i=1 Kh (x − Xi )}3
(2) (2) Pn (2)
the estimators fbn,h and µ bn,h (x) = n−1 i=1 Yi Kh (x − Xi ) converge uni-
2 (4)
formly to f (2) and µ(2) , respectively, with respective biases h2 m2K fX (x)+
2
o(h2 ) and h2 m2K µ(4) (x) + o(h2 ). The result extends to a general order of
derivative k ≥ 1.

Proposition 3.5. Under Conditions 2.2 and 3.1 with nh2k+2s+1 = O(1),
(k)
for k ≥ 1, and functions m and fX in class Cs (IX ), the estimator m b n,h is
an uniformly consistent estimator of the k-order derivative of the regression
function, its bias is a O(hs ), and its variance a O((nh2k+1 )−1 ), the optimal
bandwidth is a O(n−1/(2k+2s+1) ).

The nonparametric estimator (3.1) is often used in nonparametric time


series models with correlated errors. The bias is unchanged and the variance
of the estimator depends on the covariances between the observation errors
E(εi εi+a ) = βa , for a weakly stationary process (Yi )i corresponding to
correlated measurements of Y = m(X)+ε. Now the variance σf2 is replaced
P
by S = σf2 +2 i≥1 βa assumed to be finite (Billingsley, 1968). A consistent
Pm
estimator of S was defined by Sbm = βbi where the correlation is
i=−m
estimated by the mean correlation error with a mean over the lag between
the terms of the product and a sum over observations and n−1 m2 tends to
zero (Herrmann, Gasser and Kneip, 1992).
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

60 Functional estimation for density, regression models and processes

3.4 Weak convergence of the estimator

The weak convergence of the process Un,h = (nh)1/2 {m b n,h − m}I{IX,h }


relies on bounds for the moments of its increments which are first proved,
as in Lemma 2.2 for the increments of the centered process defined by the
kernel estimator, with a kernel having the compact support [−1, 1]. For a
function or a process ϕ defined on IX,h , let ∆ϕ(x, y) = ϕ(x) − ϕ(y).

Lemma 3.3. Under Conditions 3.1, there exist positive constants C1 and
C2 such that for every x and y in IX,h and satisfying |x − y| ≤ 2h
µn,h − µn,h )(x, y)|2 ≤ C1 (nh3 )−1 |x − y|2 ,
E|∆(b
b n,h − mn,h )(x, y)|2 ≤ C2 (nh3 )−1 |x − y|2 ,
E|∆(m
if |x − y| > 2h, they are O((nh)−1 ) and the estimators at x and y are
independent.

Proof. Let x and y in RIX,h such that |x − y| ≤ 2h, E|b bn,h (y)|2
µn,h (x) − µ
−1 2
develops as the sum n w2 (u){Khn (x − u) − Khn (y − u)} f (u) du + (1 −
n−1 ){µn,hn (x)−µn,hn (y)}2 . For an approximation of the integral, the Mean
(1)
Value Theorem implies Khn (x − u) − Khn (y − u) = (x − y)ϕn (z − u) where
z is between x and y, and
Z
{Khn (x − u) − Khn (y − u)}2 w2 (u)f (u) du
Z
= (x − y)2 ϕ(1)2 n (z − u)w2 (u)f (u) du
Z
= (x − y)2 h−3n {w 2 (x)f (x) K (1)2 + o(hn )}.

Let |x| ≤ hn and |y| ≤ hn , the order of the second moment E|fbn,h (x) −
fbn,h (y)|2 is a O((x−y)2 (nh3n )−1 ) if |x−y| ≤ 2hn and it is the sum E µ
b2n,h (x)
2
and µ bn,h (y) otherwise. This bound and Lemma 2.2 imply the same orders
for the estimator of the regression function m. 

Theorem 3.1. For h > 0, the process Un,h = (nh)1/2 {m b n,h − m}I{IX,h }
converges in distribution to σm W1 +γ 1/2 bm where W1 is a centered Gaussian
process on IX with variance 1 and covariances zero.

Proof. For any x ∈ IX,h and from the approximation (3.2) of Propo-
sition 3.1 and the weak convergences for µ bn,h − µn,h and fbX,n,h −
fX,n,h , the variable Un,h (x) develops as (nh)1/2 {m b n,h (x) − mn,h (x)} +
(nh5 )1/2 bm (x)+o((nh5 )1/2 ), and it converges to a non centered distribution
{W + γ 1/2 bm }(x) where W (x) is the Gaussian variable with mean zero and
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a regression function 61

2
variance σm (x). In the same way, the finite dimensional distributions of
the process Un,h converge weakly to those of {W + γ 1/2 bm }, where W is a
Gaussian process with the same distribution as W (x) at x. The covariance
matrix {σ 2 (xk , xl )}k,l=1,...,m between components W (xk ) and W (xl ) of the
limiting process is the limit of
 nh 
Cov Un,h (xk ), Un,h (xl ) = Cov{b bn,h (xl )}
µn,h (xk ), µ
fX (xk )fX (xl )
− m(xk )Cov{fbX,n,h (xk ), µ µn,h (xk ), fbX,n,h (xl )}
bn,h (xl )} − m(xl )Cov{b

+ m(xk )m(xl )Cov{fbX,n,h (xk ), fbX,n,h (xl )} + o(1) ,
where the o(1) is deduced from Propositions 3.1, 3.2 and 3.3. For every
integers k and l, let αh = |xl − xk |/(2h) and v = {(xl + xk )/2 − s}/h
be in [0, 1], hence h−1 (xk − s) = v − α and h−1 (xl − s) = v + α. By a
Taylor expansion in a neighborhood of (xl + xk )/2, the integral of the first
covariance term develops as
Cov{b bn,h (xl )}
µn,h (xk ), µ
Z
x k + xl  xk + xl   
= n−1 h−1 w2 fX K v − αh K v + αh dv
2 2
+ o(n−1 h−1 )
and zero otherwise, with the notation (3.10). Similar expansions are satis-
fied for the other terms of the covariance. Using the following approxima-
tions for |xk − xl | ≤ 2h : w2 ({xk + xl }/2) = w2 (xk ) + o(1) = w2 (xl ) + o(1)
and fX ({xk + xl }/2) = fX (xk ) + o(1) = fX (xl ) + o(1), the covariance of
Un,h (xk ) and Un,h (xl ) is approximated by
Z
V ar(Y |X = xk ) + V ar(Y |X = xl )  
I{0≤αh <1} K (v − αh K v + αh dv.
fX (xk ) + fX (xl )
Due to the compactness of the support of K, the covariance iszero if αh ≥ 1.
For xk 6= xl , αh tends to infinity as h tends to zero and I 0 ≤ αh < 1
tends to zero as n tends to infinity, therefore the  covariance
of Un,h (xk )
and Un,h (xl ) is equal to δk,l + o(1), where V ar Un,h (xk ) is defined in
Proposition 3.1.
The tightness of the sequence {Un,h } on IX,h will follow from (i) the
tightness of {Un,h (a)} and (ii) a bound of the increments E Un,h (x2 ) −
2
Un,h (x1 ) for |x2 − x1 | < 2h. For condition (i), let η > 0 and c >
1/2
γ 1/2 |bm (a)| + 2η −1 σ 2 (a) , then
 1/2

Pr{|Un,h (a)| > c} ≤ Pr (nh) |(m b n,h − mn,h )(a)| + (nh)1/2 |bn,h (a)| > c
V ar{(nh)1/2 (m
b n,h − mn,h )(a)}

{c − (nh)1/2 |bn,h (a)|}2
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

62 Functional estimation for density, regression models and processes

and for n sufficiently large


σ 2 (a)
Pr{|Un,h (a)| > c} ≤ + o(1) < η.
{c − γ 1/2 |bm (a)|}2
The process Un,h is written Wn,h + (nh)1/2 bn,h where
(bn,h (x) − bn,h (y))2 ≤ kh2s (x − y)2s = O((nh)−1 )(x1 − x2 )2s
and Wn,h = (nh)1/2 (mb n,h −mn,h ). From Lemma 3.3, there exists a constant
CW such that |x−y| ≤ 2h entails E{Wn,h (x)−Wn,h (y)}2 ≤ CW h−2 |x−y|2 ,
which implies the tightness of the process Un,h and its weak convergence
to a continuous Gaussian process defined on IX . 
Note that the tightness of the process implies the existence of a constant
cη > 0 such that
−1 P
Pr{ sup |σm (Un,h − γ 1/2 bm ) − W1 | > cη } → 0.
IX,h
The limiting distribution of the process Un,h does not depend on the band-
width h, so one can state the following corollary.
−1
Corollary 3.1. suph>0:nh2s+1 →γ supIX,h σm |Un,h − γ 1/2 bm | converges in
distribution to supIX |W1 |.
An uniform confidence interval for the regression curve m is deduced as for
the density.
Let X be a variable defined in a subset IX of Rd , the regression func-
tion m is estimated using a multivariate kernel K defined on [−1, 1]d and
Kh (x) = h−d K(h−d x), for x = (x1 , . . . , xd ) in IX,h . The bias is unchanged
and the rates of the moments p ≥ 2 are modified by the dimension d
−1
V ar{mb n,h (x)} = (nhd )−1 κ2 V ar{Y |X = x}fX (x) + o((nhd )−1 ),
and kfbn,h (x) − fn,h (x)kp = 0((nhd )−1/p ). The local and global errors
M ISEn (h) are O(h2s )+O((nhd )−1 ), they are minimal at the optimal band-
widths of order O(n−1/(2s+d) ) where the MISE reaches the minimal order
O(n−2s/(2s+d) ). The weak convergence of Theorem 3.1 and its corollary
still hold with the rate (nhd )1/2 .

3.5 Estimation of a regression curve by local polynomials

The regression function m is approximated by a Taylor expansion of order


k, for every s in a neighborhood Vx,h of a fixed x, with radius h,
(s − x)p (p)
m(s) = m(x) + (s − x)m0 (x) + . . . + m + o((s − x)p ). (3.16)
p!
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a regression function 63

This expansion is a local polynomial regression where the derivatives at x


are considered as parameters. Estimating the derivatives by the derivatives
of the estimator mb n,h yields an estimator having a variance sum of terms
b n,h .
of different orders, its main term is the variance of m
Let (Hk,h )k be a square integrable orthonormal basis of real functions
with respect to the distribution function of X, with support Vx,h for h
converging to zero. Let δk,l be the Dirac indicator δk,l of equality for k and
l, k, l ≥ 0. Equation (3.16) is also written
p
X
m(s) = θk (x)Hk (s − x) + o((s − x)p ) = mp (x) + o((s − x)p )
k=0

for s in Vx,h , and the properties of the functional basis entail


Z
E{Hk (X − x)Hl (X − x)} = Hk (s − x)Hl (s − x) dF (s) = δk,l , k, l ≥ 0.

In the regression model E(Y |X) = m(X)


θk (x) = E{Y Hk (X − x)} = E{Hk (X − x)m(X)}, k ≥ 1
m(x) = E{Y H0 (X − x)} = E{H0 (X − x)m(X)}.
For fixed x, θk (x) is considered as a constant parameter. This expansion is
an extension of the kernel smoothing if the functional basis has regularity
properties. The nonparametric regression function is approximated by an
expansionR on the first p elements of the basis and its projections satisfy
θk (x) = m(s)Hk (x − s) dF (s). The estimation of the parameters is per-
formed by the projection of the observations of Y onto the first p elements
of the orthonormal basis. Let (Xi , Yi )i=1,...,n be a sample for the regression
variables (X, Y ), so that Yi = m(Xi ) + εi where εi is an observation error
having a finite variance σ 2 = E{Y − m(X)} and such that E(ε|X) = 0. An
estimator of the parameter is defined as the empirical conditional mean of
the projection of Y onto the space generated by the basis. For k ≥ 1,
n
X
θbk,n (x) = n−1 Yi Hk (Xi − x)
i=1

is therefore a consistent estimator of θk . Its conditional variance is


n−1 {E(Y 2 |X)Hk2 (X − x) − θk2 }{1 + o(1)}.
This approach may be compared to the local polynomials defined by
minimizing the local smoothed empirical mean squared error
n
X
ASE(x) = {Yi − mp (Xi , θ)}2 Kh (Xi − x).
i=1
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

64 Functional estimation for density, regression models and processes

This provides an estimator of θ with components satisfying


n
X
{Yi − mp (Xi , θ)}Hk (Xi − x)Kh (Xi − x) = 0.
i=1

They are solution of a system of linear equations and θnk is approximated by


Pn
i Hk (Xi − x)Kh (Xi − x)
i=1 YP
n
i=1 Kh (Xi − x)

if the orthogonality of the basis entails that EHk (X −x)Hl (X −x)Kh (X −x)
convergences to zero as h tends to zero, for every k 6= l ≤ p. This estimator
is consistent and its behaviour is further studied by the same method as
the estimator of the nonparametric regression.
A multidimensional regression function m(X1 , . . . , Xd ) can be expanded
in sums of univariate regression functions E(Y | Xk = x) and their inter-
actions like a nonparametric analysis of variance if the regression variables
(X1 , . . . , Xd ) generate orthogonal spaces generated. The orthogonality is a
necessary condition for the estimation of the components of this expansion
since
Z
E{Y Kh (xk − Xk )} = E(Y | X = x) FX (dx1 , . . . , xk−1 ,
xk+1 , . . . , xd ) + o(1)
= m(xk )fk (xk ) + o(1),
E{Y Kh (xk − Xk )Kh (xl − Xl )} = CK m(xk , xl )fXk ,Xl (xk , xl ) + o(1),
R
where CK = K(u)K(v) du dv, and E{Y Kh (xk − Xk )Kh (xl − Xl )}
−1
fX k ,Xl
(xk , xl ) can be factorized or expanded as a sum of regression functions
only if Xk and Xl belong to orthogonal spaces. The orthogonalisation of the
space generated by a vector variable X can be performed by a preliminary
principal component analysis providing orthogonal linear combinations of
the initial variables.

3.6 Estimation in regression models with functional variance

Consider the nonparametric regression model with an observation error


funtion of the regression variable X, Y = m(X) + σ(X)ε defined by (1.7),
with E(ε|X) = 0 and V ar(ε|X) = 1. The variance σ(x)2 = E[{(Y −
m(X)}2 |X = x] is assumed to be continuous and it is estimated by a
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a regression function 65

localisation of the empirical error in a neighborhood of x


Pn
2 {Yi − m b n,h (Xi )}2 1{Xi ∈ Vδ (x)}
en,h,δ (x) = i=1 Pn
σ
i=1 1{Xi ∈ Vδ (x)}

or by smoothing it with a kernel density


Pn
2 {Yi − mb n,h (Xi )}2 Kδ (x − Xi )
bn,h,δ
σ (x) = i=1 Pn . (3.17)
i=1 Kδ (x − Xi )
2
bn,h,δ
The estimator is denoted σ (x) = fbX,n,δ
−1
(x)Sbn,h,δ (x), with
n
X
Sbn,h,δ (x) = n−1 b n,h (Xi )}2 Kδ (x − Xi )
{Yi − m
i=1
Z
= b n,h (s)}2 Kδ (x − s) dFbX,Y,n (s, y).
{y − m

The mean of Sbn,h,δ (x) is denoted Sn,h,δ (x). By the uniform consistency
b n,h , Sbn,h,δ converges uniformly to S as n tends to infinity, with h
of m
P
and δ tending to zero. At Xj , it is written Sbn,h,δ (Xj ) = n−1 i6=j {Yi −
b n,h (Xi )}2n,h Kδ (Xj − Xi ) + o((nh)−1 ). The rate of convergence of δn to
m
zero is governed by the degree of derivability of the variance function σ 2 .

Condition 3.2. For a density fX in Cr (IX ) and a function µ in Cs (IX )


and a variance σ 2 in Ck (IX ), with k, s, r ≥ 2, the bandwidth sequences
(δn )n and (hn )n satisfy
δn = O(n−1/(2k+1) ), hn = O(n−1/{2(s∧r)+1} ),
as n tends to infinity.

Proposition 3.6. Under Conditions 2.1, 2.2 and 3.1, for every function
µ in Cs , density fX in Cr and variance function σ 2 in Ck ,
b nh (x)}2 = σ 2 (x) + O(h2(s∧r) ) + O((nh)−1 ),
E{Y − m
the bias of the estimator Sbn,h,δ (x) of σ 2 (x) defined by (3.17) is
δ 2k
βn,h,δ (x) = b2m,n,h (x)fX (x) + σm,n,h
2
(x)fX (x) + (σ 2 (x)fX (x))(2)
(k!)2
+ o(δ 2k + h2(s∧r) + (nh)−1 )
and its variance is written (nδ)−1 {vσ2 + o(1)} with vσ2 (x) = κ2 V ar{(Y −
m(x))2 |X = x}. The process (nδ)1/2 (b 2
σn,h,δ − σ 2 − βn,h,δ ) converges weakly
to a Gaussian process with mean zero, variance vσ2 and covariances zero.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

66 Functional estimation for density, regression models and processes

Proof. Using Proposition 2.2 and Lemma 3.2, the mean squared er-
ror for m b nh at x is E[{Y − m b nh (x)}2 | X = x] and it is expanded as
2 2 2
σ (x) + bm,n,h (x) + σm,n,h (x) + E[{Y − m(x)}{m(x) − m b nh (x)} | X = x]
b
where the last term is zero. For the variance of Sn,h,δ (x), the fourth condi-
tional moment E[{Y − m b nh (x)}4 (x) | X = x] is the conditional expectation
of {(Y − m(x)) + (m − mnh )(x) + (mnh − m b nh )(x)}4 and it is expanded in a
sum of σ4 (x) = E{Y −m(x)) | X = x}, a bias term b4m,n,h (x) = O(h8(s∧r) ),
4

E(mnh − m b nh )(x)}4 = O((nh)−1 ) by Proposition 3.1, and products of


squared terms the main of which being σ 2 (x)km b nh − mk22 (x) of order
−1 4(s∧r)
O((nh) ) + O(h ), and the others being smaller. The variance of
b
Sn,h,δ (x) follows.
Moreover, for every i 6= j ≤ n and for every functionR ψ in C2 and
integrable with respect to FX , Eψ(Xj )Kδ2 (Xi − Xj ) = ψ(x)Kδ2 (x −
x0 ) dFX (x) dFX (x0 ) equals κ2 Eψ(X) + o(δ 2 ) and the main term of the vari-
ance does not depend on the bandwidth δ. 
The bandwidths hn and δn appear in the bias and the variance, therefore
the mean squared error for the variance is minimum under Condition 3.2.
Note that the function m which achieves the minimum of the empirical
Pn
mean squared error for the model Vn,h (x) = n−1 i=1 Kh (x − Xi ){Yi −
m(x)}2 is the estimator m b n,h (3.1) and Vn,h (x) converges in probability
σ(x). In a parametric regression model with a Gaussian error having a
Pn
constant variance, Vn (x) = n−1 i=1 {Yi − m(x)}2 is the sufficient statistic
for the estimation of the parameters of m. In a Gaussian regression model
with a functional variance σ 2 (x), each term of the sum defining the error is
normalized by a different variance σ(Xi ) and the sufficient statistic for the
estimation of parameters of the function m is the weighted mean square
error
Xn
Vw,n (θ) = n−1 σ −1 (Xi ){Yi − mθ (Xi )}2 .
i=1
For a nonparametric regression function, an empirical local mean weighted
squared error is defined as
n
X
Vw,n,h (x) = n−1 w(Xi ){Yi − m(x)}2 Kh (x − Xi )
i=1

with w(x) = σ −1 (x). A weighted estimator of the nonparametric regression


curve m is then defined as
Pn
w(Xi )Yi Kh (x − Xi )
b w,n,h (x) = Pi=1
m n , (3.18)
i=1 w(Xi )Kh (x − Xi )
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a regression function 67

if the variance is known, it achieves the minimum of Vw,n,h (x). With an


unknown variance, minimizing the weighted squared error leads to the es-
−1
timator built with its estimator w bn = σbn,h n ,δn
, using (3.17)
Pn
wbn (Xi )Yi Kh (x − Xi )
mb wbn ,n,h (x) = Pi=1n . (3.19)
bn (Xi )Kh (x − Xi )
i=1 w
The uniform consistency of w bn,h implies supIn,h |m b wbn ,n,h − mw | tends to
zero as n tends to infinity.
Assuming that σ belongs to C2 (IX ), the convergence results for m b n,h
in Propositions 3.1 or (3.2) adapt to the estimator (3.18), with µw = wµ
instead of µ and w(x)fX,Y (x, y) instead of fX,Y (x, y). The approxima-
tion (3.2) is unchanged, hence the bias and the variance of the weighted
estimator m b w,n,h are
hs msK
bm,w,n,h(x) = {(mwfX )(s) (x) − m(x)(wfX )(s) (x)} + o(hs ),
s!w(x)fX (x)
vm,w,n,h (x) = vm,n,h (x).
In the approximations of Propositions 3.2 and 3.3, the order of convergence
of supx∈IX,h kb rn,h k2 is not modified and the weak convergence of Theorem
3.1 is fulfilled for the process (nh)1/2 {m b w,n,h −m}I{IX,h } , with the modified
bias and variance.
With an estimated weight, the meanR of the numerator µ bwbn ,n,h (x) is
Ew bn (X)m(X)Kh (x − X) and it equals E w bn (y)m(y)Kh (x − y)fX (y) dy
2
since σ bn,h ,δ
n n
(X i ) is equivalent to the estimator of the variance (at Xi )
calculated from the observations without Xi . With an empirical weight
2
wbn (x) = ψ(b σn,h n ,δn
(x)), the mean of the numerator of the estima-
2
tor (3.19) is then ENn (x) = Ew(X)m(X)Kh (x − X) + E{(b σn,h n ,δn

2 0 2
σ )(X)ψ (σ (X))m(X)Kh (x − X)}{1 + o(1)} and the bias of the nu-
merator of (3.19) is modified by adding m(x)fX (x)βn,h,δ (x)ψ 0 (σ 2 (x)) to
the bias of the expression with a fixed weight w. In the same way,
the expectation of the denominator is EDn (x) = w(X)Kh (x − X) +
2
E{(b σn,h n ,δn
− σ 2 )(X)ψ 0 (σ 2 )(X)Kh (x − X)}{1 + o(1)} and it is approxi-
mated by fX (x){w(x) + βn,h,δ (x)ψ 0 (σ 2 (x)). Using the approximation (3.2)
of Proposition 3.1, the first order approximation of the bias of (3.19) is
identical to bm,w,n,h(x). The variances of each term are
κ2
V arNn (x) = E{wbn2 (x)E(Y 2 | X = x)fX (x)} + o((nh)−1 ),
nh
κ2
V arDn (x) = E{wbn2 (x)fX (x)} + o((nh)−1 ),
nh
κ2
V armb wbn ,n,h (x) = V ar(wbn (x)Y | X = x) + o((nh)−1 ).
nhw2 (x)fX (x)
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

68 Functional estimation for density, regression models and processes

The variance of the estimator with an empirical weight is therefore modified


by a random factor in the variance of Y and a normalization by w(x). The
convergence rates are not modified.

3.7 Estimation of the mode of a regression function

The mode of a real regression function m on IX is

Mm = sup m(x). (3.20)


IX

The mode Mm of a regular regression function is estimated by the mode


of a regular estimator of the function, Mcm,n,h = Mm b n,h . Under Conditions
2.1-3.1, the regression function is locally concave in a neighborhood NM of
the mode and its estimator has the same property for n sufficiently large,
by the uniform consistency of m b n,h , hence m(1) (Mm ) = 0, m(2) (Mm ) < 0,
(1) c c
b n,h (M
m m,n,h ) = 0 and Mm,n,h converges to Mm in probability. A Taylor
(1)
expansion of m at the estimated mode implies

cm,n,h − Mm ) = {m(2) (Mm )}−1 {m(1) (M


cm,n,h ) − m (1)
cm,n,h )} + o(1).
(M b n,h (M

(1)
The weak convergences of the process (nh3 )1/2 (m
b n,h − m(1) ) (Proposition
cm,n,h − Mm ) as (nh3 )−1/2 and
3.5) determines the convergence rate of (M
it implies the asymptotic behaviour of the estimator Mcm,n,h .

Proposition 3.7. Under Conditions 2.1, 2.2 and 3.1, (nh3 )1/2 (M cm,n,h −
Mm ) converges weakly to a centered Gaussian variable with finite variance
(1)
m(2)−2 (Mm )V arm
b n,h (Mm ).

cm,n,h ) is
If the regression function belongs to C3 (IX ), the bias of m(1) (M
(1)
deduced from the bias of the process mb n,h defined by (3.15), it equals

2
cm,n,h ) = − h m2K f −1 (x){(mfX )(3) − mf (3) }(Mm ) + o(h2 )
Em(1) (M X X
2
and does not depend on the degree of derivability of the regression function
m. All results are extended for the search of the local maxima and minima
of the function m which are local maxima of −m. The maximization of the
function on the interval IX is then replaced by sequential maximizations
or minimizations.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a regression function 69

3.8 Estimation of a regression function under censoring

Consider the nonparametric regression (1.6) where the variable Y is right-


censored by a variable C independent of (X, Y ) and the observed variables
are (X, Y ∗ , δ) where Y ∗ = Y ∧ C and δ = 1{Y ≤C} . Let FY |X denote the
distribution function of Y R conditionally on X. The regression function
m(x) = E(Y | X = x) = yFY |X (dy; x) is estimated using an estimator
of the conditional density of Y given X under right-censoring. Extending
the results of Section 2.8 to the nonparametric regression, the conditional
distribution function FY |X defines a cumulative conditional hazard function
Z
ΛY |X (y; x) = 1{s≤y} {1 − FY |X (s; x)}−1 FY |X (ds; x),

conversely the function ΛY |X uniquely defines the conditional distribution


function as
Y
1 − FY |X (y; x) = exp{−ΛcY |X (y; x)} {1 − ∆ΛY |X (z − ; x)},
z>y
Q
where ΛcY |X is the continuous part of ΛY |X and s {1 − ∆Λ(s)} its right-
continuous discrete part. Let
X X
Nn (y; x) = Kh (x−Xi )δi 1{Yi ≤y} , Yn (y; x) = Kh (x−Xi )1{Yi∗ ≥y}
1≤i≤n 1≤i≤n

be the counting processes related to the observations of the censored vari-


able Y ∗ , with regressors in a neighborhood Vh (x) of x, and let Jn (y; x)
R y the indicator of Y n(y; x) > 0. The process Mn (y; x) = Nn (y; x) −
be
−∞ n
Y (s; x) dΛY |X (s; x) is a centered martingale with respect to the fil-
tration generated by the observed processes up to y − , conditionally on
regressors in Vh (x). The functions ΛY |X and FY |X are estimated by
Z
b Y |X,n,h (y; x) = 1{s≤y} Jn (s; x)Nn (ds; x) ,
Λ
Yn (s; x)
Y
b
FY |X,n,h (y; x) = 1 − {1 − ∆Λb Y |X,n,h (Yi ; x)},
Yi ≤y

the estimator Λb Y |X,n,h is unbiased and FbY |X,n,h is the Kaplan-Meier estima-
tor for distribution function of Y conditional on {X = x}. The regression
function m is then estimated by
Z
b n,h (x) = y FbY |X,n,h (dy; x)
m
n
X Jn (Yi ; x)
= Yi {1 − FbY |X,n,h (Yi− ; x)} .
i=1
Yn (Yi ; x)
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

70 Functional estimation for density, regression models and processes

The estimators satisfy supIX ×I |Λb Y |X,n,h −ΛY |X |, supI |FbY |X,n,h −FY |X |
X,Y
and supIX |m b n,h − m| converge in probability to zero as n tends to infinity,
for every compact subinverval I of IY . For every y ≤ max Yi∗ , the condi-
tional Kaplan-Meier estimator, given x in IX,n,h , still satisfies
Z y
FY |X − FbY |X,n,h 1 − FbY |X,n,h (s− ; x) b
(y; x) = d(ΛY |X,n,h −ΛY |X )(s; x) .
1 − FY |X −∞ 1 − FY |X (s; x)
(3.21)
The mean of this integral with respect to a centered martingale is zero so the
conditional Kaplan-Meier estimator and Λ b Y |X,n are unbiased estimators.
The bias of the estimator of the regression function for censored variables
Y is then a O(h2 ).

3.9 Proportional odds model

Consider a regression model with a discrete response variable Y correspond-


ing to a categorization of an unobserved continuous real variable Z in a par-
tition (Ik )k≤K of its range, with the probabilities Pr(Z ∈ Ik ) = Pr(Y = k).
With a regression variable X and intervals Ik = (ak−1 , ak ), the cumulated
conditional probabilities are
πk (X) = Pr(Y ≤ k | X) = Pr(Z ≤ ak | X),
and EπK (X) = 1. The proportional odds model is defined through the
logistic model for the probabilities πk (X) = p(ak − m(X)), with the logistic
probability p(y) = exp(y)/{1 − exp(y)} and a regression function m. This
model is equivalent to πk (X){1 − πk (X)}−1 = exp{ak − m(X)} for every
function πk such that 0 < πk (x) < 1 for every x in IX and for 1 ≤ k < K.
This implies that the odds-ratio for the observations (Xi , Yi ) and (Xj , Yj )
with Yi and Yj in the same class does not depend on the class
πk (Xi ){1 − πk (Xj )}
= exp{m(Xj ) − m(Xi )},
{1 − πk (Xi )}πk (Xj )
for every k = 1, . . . , K, this is the proportional odds model.
For k = 1, . . . K, let pk (x) = (πk − πk−1 )(x) = Pr(Y = k | X = x).
Assuming that p1 (x) > 0 for every x in IX , the conditional distribution
of the discrete variable is also determined by the conditional probabilities
αk (x) = P (Y = k|X = x)P −1 (Y = 1|X = x). Equivalently
αk (x)
P (Y = k|X = x) = PK , k = 1, . . . , K,
1 + j=1 αj (x)
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a regression function 71

PK
with the constraint k=1 P (Y = k|X = x) = 1 for every x. This
reparametrization of the conditional probabilities αk is not restrictive,
though it is called the logistic model.

Estimating first the support of the regression variable reduces the num-
ber of unknown parameters to 2(K − 1), the thresholds of the classes and
their probabilities, for k ≤ K − 1, in addition to the nonparametric re-
gression function m. The probability functions πk (x) are estimated by the
proportions πbn,k (x) of observations of the variable Y in class k, condition-
ally on the regressor value x. Let
bn,k (Xi )
π
Uik = log , i = 1, . . . , n,
1−π bn,k (Xi )
calculated from the observations (Xi , Yi )i=1,...,n such that Yi = k. The
variations of the regression function m between two values x and y are
estimated by
XK  Pn
−1 Pi=1 Uik Kh (Xi − x)
b n,h (x) − m
m b n,h (y) = K n
k=1 i=1 Kh (Xi − x)
Pn 
Pi=1 Uik Kh (Xi − y)
− n .
i=1 Kh (Xi − y)

This estimator yields an estimator for the derivative of the regression func-
(1)
b n,h (x) = lim|x−y|→0 (x − y)−1 {m
tion, m b n,h (x) − m
b n,h (y)} wich is written as
the mean over the classes of the derivative estimator (3.15) with response
variables Uik . Integrating the mean derivative provides a nonparametric
estimator of the regression function m. The bounds of the classes cannot
be identified without observations of the underlying continuous variable Z,
thus the odds ratio allows to remove the unidentifiable parameters from the
model for the observed variables.
With a regression multidimensional variable X, the single-index model
or a transformation model (Chapter 7) reduce the dimension of the variable
and fasten the convergence of the estimators.

3.10 Estimation for the regression function of processes

Consider a continuously observed stationary and ergodic process


(Zt )t∈[0,T ] = (Xt , Yt )t∈[0,T ] with values in IXY , and the regression model
Yt = m(Xt ) + σ(Xt )εt where (εt )t∈[0,T ] is a conditional Brownian motion
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

72 Functional estimation for density, regression models and processes

such that E(εt | Xt ) = 0 and E(εt εs | Xt ∧Xs ) = E{(εt ∧εs )2 | Xt ∧Xs ) = 1.


The ergodicity property is expressed by (2.13) or (2.16) for the bivariate
process Z. The regression function m is estimated on an interval IX,Y,T,h
by the kernel estimator
RT
Ys Kh (x − Xs ) ds
b T,h (x) = 0R T
m . (3.22)
0 Kh (x − Xs ) ds
Its numerator is denoted
Z T
1
bT,h (x) =
µ Ys Kh (x − Xs ) ds
T 0

and its denominator is fbX,T,h (x). The mean of µ bT,h (x) and its limit are
respectively
Z
µT,h (x) = yKh (x − u) dFXY (u, y),
I
Z XY
µ(x) = yfXY (x, y) dy = fX (x)m(x).
IXY

Under Conditions 2.1-2.2 and 3.1, the bias of µ bT,h (x) is


Z
hs
bµ,T,h (x) = yKh (x − u) dFXY (u, y) − µ(x) = T msK µ(s) (x) + o(hsT ),
IXY,T s!
its variance is expressed through the integral of the covariance between
Ys Kh (Xs − x) and Yt Kh (Xt − x). For Xs = Xt , the integral on the diagonal
2
DX of IX,T is a (T hT )−1 )κ2 w2 (x)+o((T hT )−1 ) and the integral outside the
diagonal denoted Io (T ) is expanded using the ergodicity property (2.13).
Let αh (u, v) = |u − v|/2hT
Z Z
ds dt
Io (T ) = y1 y2 Kh (u − x)Kh (v − x)dFZs ,Zt (u, y1 , v, y2 )
2
[0,T ]2 IXY \DX T T
Z Z Z 1/2
= (T hT )−1 { K(z − αh (u, v))K(z + αh (u, v)) dz
IX IX\{u} −1/2

µ(u)µ(v)dπu (v) dFX (u)}{1 + o(1)} .


For every fixed u 6= v, αhT (u, v) tends to infinity as hT tends to zero, then
R 1/2
the integral −1/2 K(z − αh (u, v))K(z + αh (u, v)) dz tends to zero with hT .
If |u − v| = O(hT ), this integral does not tend to zero but the transition
probability πu (v) tends to zero as hT tends to zero, therefore the integral
Io (T ) is a o((T hT )−1 ) as T tends to infinity. The Lp -norm of the estimator
satisfies kbµT,h (x) − µT,h (x)kp = O((T hT )−1/p ) under the ergodicity condi-
tion for k-uplets of the process Z (2.16) and the approximation (3.2) is also
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Kernel estimator of a regression function 73

b T,h . It follows that its bias, for s ≥ 2, and its


satisfied for the estimator m
variance are approximated by
bm,T,h (x; s) = hsT bm (x; s) + o(hsT ),
−1
bm (x; s) = fX (x){bµ (x) − m(x)bf (x)}
msK −1 (s)
= f (x){µ(s) (x) − m(x)fX (x)},
s! X
vm,T,h (x) = (T hT )−1 {σm
2
(x) + o(1)},
2 −1
σm (x) = κ2 fX (x)V ar(Y | X = x)
and the covariance between mb T,h (x) and m
b T,h (y) tends to zero. The mean
squared error of the estimator at x for a marginal density in Cs is then
−1
M ISET,hT (x) = (T hT )−1 )κ2 fX (x)V ar(Y | X = x)
+ h2s 2
T bm (x; s) + o((T hT )
−1
) + o(h2s
T )

and the optimal local and global bandwidths minimizing the mean squared
(integrated) errors are O(T 1/(2s+1) )
n1 2
σm (x) o1/(2s+1)
hAMSE,T (x) = 2
T 2sbm (x; s)m(x)
and, for the asymptotic mean integrated squared error criterion
n1 R 2 o1/(2s+1)
σm (x) dx
hAMISE,T = R .
T 2s b2m (x; s)m(x) dx
With the optimal bandwidth rate, the asymptotic mean (integrated)
squared errors are O(T 2s/(2s+1) ). The same expansions as for the vari-
ance µbT,h (x) and fbX,T,h (x) in Section 2.10 prove that the finite dimen-
sion distributions of the process (T hT )1/2 (fbT,h − f − bT,h ) converge to
those of a centered Gaussian process with mean zero, covariances zero
and variance κ2 f (x) at x. Lemma 3.3 generalizes and the increments
E{fbT,h (x) − fbT,h (y)}2 are approximated as E|∆(m b n,h − mn,h )(x, y)|2 =
O(|x− y|2 (T h3T )−1 ) for every x and y in IX,h such that |x− y| ≤ 2hT . Then
the process (T hT )1/2 {mb T,h −m}I{IX,T } converges weakly to σm W1 +γ 1/2 bm
where W1 is a centered Gaussian process on IX with variance 1 and covari-
ances zero.

3.11 Exercises

(1) Detail the proof for the approximations of the biases and variances of
Proposition 3.1.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

74 Functional estimation for density, regression models and processes

(2) Suppose Y is a binary variable with P (Y |X = x) = p(x) and express


the bias and the variance of the estimator of the nonparametric prob-
ability function p.
(3) Consider a discrete variable with values in an infinite countable set.
Define an estimator of the function m under suitable conditions and
give the expression of its bias and variance.
(4) Define nonparametric estimators for the bias of the function m and its
variance.
(5) Define the optimal bandwidths for the estimation of the function µ and
its first order derivative.
(6) Detail the expression of km b n,h (x) − m(x)kp using the orders of the
norms established in Section 3.2.
(7) Detail the expressions of the bias and the second order approximation
2
of the variance of σbn,h,δ (x) in Proposition 3.6.
(8) Let FY |X (y; x) = Pr(Y ≤ y | X ≤ x) be the distribution function of Y
conditionally on X and
n
X
FbY |X,n,h (y; x) = n−1 1{Yi ≤y} Hh (Xi − x)
i=1

be a smooth estimator of the conditional distribution function (see


Exercise 2.11-(6)). Find the expression of the bias and the variance of
FbY |X,n,h (x).
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Chapter 4

Limits for the variable bandwidths


estimators

4.1 Introduction

The pointwise mean squared error for a density or regression function


reaches its minimum at a bandwidth function varying in the domain of
the variable X. The question of the behaviour of the estimator of density
and regression functions with a varying bandwidth is then settled. All re-
sults of Chapters 2 and 3 are modified by this function. Consider a density
or a regression function of class Cs (IX ). Let (hn )n be a sequence of func-
tional bandwidths in C1 (IX ), converging uniformly to zero and uniformly
bounded away from zero on IX . In order to have an optimal bandwidth for
the estimation of functions of class C2 , the functional sequence is assumed
to satisfy an uniform convergence condition for the uniform norm khn k.

Condition 4.1. There exists a strictly positive function h in C1 (IX ),


such that khk is finite and knh2s+1
n − hk tends to zero as n tends to infinity.

Under this condition, the bandwidth is uniformly approximated as

hn (x) = n−1/(2s+1) h1/(2s+1) (x) + o(n−1/(2s+1) ).

The increasing intervals IX,hn are now defined with respect the uniform
norm of the function hn by IX,hn = {s ∈ IX ; [s − khn k, s + khn k] ∈ IX }.
The main results of the previous chapters are extended to kernel estima-
tors with functional bandwidth sequences satisfying this convergence rate.
That is the case of the kernel estimators built with estimated optimal local
bandwidths calculated from independent observations.
The second point of this chapter is the definition of an adaptative es-
timator of the bandwidth, when the degree of derivability of the density
varies in its domain of definition, and the behaviour of the estimator of

75
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

76 Functional estimation for density, regression models and processes

the density with an adaptative estimator. In Chapter 2, the optimal den-


sity was obtained under the assumption that the degree of smoothness of
the density is known and constant on the interval of the observations. The
last assumption flattens the estimated curve by the use of a too large band-
width in areas with smaller derivability order, the above variable bandwidth
hn (x) does not solves that problem. The cross-validation method allows to
define a global bandwidth without knowledge of the class of the density.
Other adaptative methods are based the maximal variations of the estima-
tor as the bandwidth varies in a grid Dn corresponding to a discretization
of the possible domain of the bandwidth according to the order of regularity
of the density. It can be performed globally or pointwisely.

4.2 Estimation of densities

Let us consider the random process Un,hn (x) = (nhn (x))1/2 {fbn,hn (x) (x) −
f (x)} for x in IX,hn . Under Conditions 2.1 and 4.1, supI |fbn,hn (x) − f (x)|
converges a.s. to zero for every compact subinterval I of IX,hn and
kfbn,hn (x) − f (x)kp tends to zero, as n tends to infinity. The bias of
fbn,hn (x) (x) is bn,hn (x) = 21 h2n (x)m2K f (2) (x) + o(khn k2 ), its variance is
V ar{fbn,hn (x) (x)} = (nhn (x))−1 κ2 f (x) + o((n−1 kh−1 b
n k) and kfn,hn (x) (x) −
−1 −1 1/p
fn,hn (x) (x)kp = 0((n khn k) ).
Under Conditions 2.1-4.1, for a density of class Cs (IX ) and for every
x in IX,h , the moments of order p ≥ 2 are unchanged and the bias of
fbn,hn (x) (x) is modified as

hsn (x)
bn,hn (x; s) = msK f (s) (x) + o(khn ks ).
s!

The MISE and the optimal local bandwidth are similar to those of Chapter
2 using these expressions.
For every u in [−1, 1], let αn and v in [−1, 1], |u| in [0, {x + hn (x)} ∧
{y + hn (y)}] be defined by

1
αn (x, y, u) = {(u − x)h−1 −1
n (x) − (u − y)hn (y)}, (4.1)
2
1
v = vn (x, y, u) = {(u − x)h−1 −1
n (x) + (u − y)hn (y)}
2
1
= [{hn (x) + hn (y)}u − xhn (y) − yhn (x)],
2hn (x)hn (y)
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Limits for the varying bandwidths estimators 77

u = un (x, y, v) = {hn (x) + hn (y)}−1 {xhn (y) + yhn (x) + 2vhn (x)hn (y)},
zn (x, y) = {hn (x) + hn (y)}−1 {xhn (y) + yhn (x)},
δn (x, y) = 2hn (x)hn (y){hn (x) + hn (y)}−1 = o(1),

hence αn (x, y, u) is also denoted αn (x, y, v).

Lemma 4.1. The covariance of fbn,h (x) and fbn,h (y)} equals
Z
2
{f (zn (x, y)) K(v − αn (v))K(v + αn (v)) dv
n{hn (x) + hn (y)}
Z
+ δn (x, y)f (1) (zn (x, y)) vK(v − αn (v))K(v + αn (v)) dv + o(khn k)}.

Proof. The integral


Z
EKhn (x) (x − X)Khn (y) (y − X) = Khn (x) (x − u)Khn (y) (y − u)fX (u) du

is expanded changing the variable u in v and it equals


Z
2
K(v − αn (v))K(v + αn (v))f (un (x, y, v)) dv
hn (x) + hn (y)
Z
2
= {f (zn (x, y)) K(v − αn (v))K(v + αn (v)) dv
hn (x) + hn (y)
Z
(1)
+ δn (x, y)f (zn (x, y)) vK(v − αn (v))K(v + αn (v)) dv + o(khn k)}.


Lemma 4.2. For functions of class Cs (IX ), s ≥ 1, and under Conditions


3.1 and 4.1, for every x and y in IX,hn the mean variation of fbn,hn be-
tween x and y has the order O(|x − y|) and its mean squared variation for
|xh−1 −1 b b 2 −1
kh−1 3 2
n (x)−yhn (y)| ≤ 1 are E|fn,hn (x)− fn,h (y)| = O(n n k (x−y) ).
−1 −1 b b
Otherwise, it is a O(n khn k) and the variables fn,h (x) and fn,h (y) are
independent.

Proof. By the Mean Value Theorem, for every x and y in IX,h there
exists s between x and y such that |fn,hn (x) − fn,hn (y)| = |x − y|f (1) (s)
and

|fn,hn (x) − fn,hn (y)| ≤ |x − y|kf (1) k.

Let z = limn zn (x, y) defined


R in (4.1). The expectation of |fbn,h (x) −
b 2
fn,h (y)| develops as n −1
{Khn (x) (x − u) − Khn (y) (y − u)}2 f (u) du + (1 −
−1 2
n ){fn,hn (x) − fn,hn (y)} .
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

78 Functional estimation for density, regression models and processes

Using the notations (4.1), the first term of this sum is expanded as
Z
1
S1n = {hn (x)K(v − αn (v))
nhn (x)hn (y){hn (x) + hn (y)}
−hn (y)K(v + αn (v))}2 f (zn (v)) dv.
The derivability of the bandwidth functions implies
Z
1
{hn (x)K(v − αn ) − hn (y)K(v + αn )}2 f (zn ) dv
hn (x)hn (y)
Z
hn (x)
≤ 2[ {K(v − αn ) − K(v + αn )}2 f (zn ) dv
hn (y)
Z
{hn (x) − hn (y)}2
+ K 2 (v − αn )f (zn ) dv],
hn (x)hn (y)
Z
2 hn (x)
S1n ≤ [ f (z) {K(v − αn ) − K(v + αn )}2 dv
n{hn (x) + hn (y)} hn (y)
(1)2 Z
2 hn (η(x − y))
+ (x − y) K 2 (v − αn )f (zn ) dv],
hn (x)hn (y)
(1)2
where η lies in (−1, 1), by the Mean Value Theorem, hn (η) and
hn (x)hn (y) have the same order, and
Z Z
{K(v − αn ) − K(v + αn )}2 dv = 4α2n K (1)2 (v) dv = O(|x − y|2 kh−1 2
n k ).

It follows that S1n = O(n−1 kh−1 2 −1 2


n k |x − y| khn k ). Since h−1
n (x)|x|
−1 −1 −1 1/2 b
and hn (y)|y| are bounded by 1, the order of E(nkhn k ) |fn,h (x) −
fbn,h (y)|2 = O((x−y)2 ) if |xhn (y)−yhn (x)| ≤ hn (y)hn (x), otherwise fbn,h (x)
and fbn,h (y) are independent and it is a sum of variances. 

Theorem 4.1. Under the conditions, for a density f of class Cs (IX ) and
a varying bandwidth sequence such that nkhn k2s+1 converges to khk, the
process
Un,hn (x) = (nhn (x))1/2 {fbn,hn(x) − f (x)}I{x ∈ IX,khn k }
converges weakly to the process defined on IX as Wf (x) + h1/2 (x)bf (x),
where Wf is a continuous centered Gaussian process with covariance
σf2 (x)δ{x,x0 } between Wf (x) and Wf (x0 ).

Proof. The weak convergence of the variable Un,h (x) is a consequence to


the L2 -convergence of (nhn (x))1/2 {fbn,hn (x) − f (x) − (nh2s+1
n (x))1/2 bf (x)}
to κ2 f (x). In the same way, the finite dimensional distributions of the
process Un,h converge weakly to those of a centered Gaussian process. The
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Limits for the varying bandwidths estimators 79

quadratic variations of the bias {fn,hn (x) (x) − f (x) − fn,hn (y) (y) + f (y)}2
are bounded by
Z
| K(z){f (x + hn (x)z) − f (x) − f (y + hn (y)z) − f (y)} dz|2

msK h(x) 2s (s) h(y) (s)


= khn k2s [{ } f (x) − { f (y)}2s ]2
s! khk khk
and it is a O(khn k2s |x − y|2 ). This bound and Lemma 4.2 imply that the
mean of the squared variations of the process Un,h on small intervals are
O(|x − y|2 ), therefore the process Un,h is tight, so it converges weakly to a
centered Gaussian process. The covariance of the limiting process at x and
y is the limit of the covariance between Un,h (x) and Un,h (y) and it equals
1/2 1/2
limn nhn (x)hn (y)Cov{fbn,h (x),Rfbn,h (y)}. The covariance of fbn,h (x) and
fbn,h (y) is approximated by n−1 Khn (x) (x − u)Khn (y) (y − u)f (u)du, for
x 6= y it develops as
Z
1{0≤αn <1}
f (zn (x, y)) K(v − αn )K(v + αn )dv
n{hn (x) + hn (y)}
+ o(n−1 (hn (x) + hn (y))−1 ).
As n tends to infinity, hn (x)h−1 −1
n (y) and hn (y)hn (x) are bounded and
1{0≤αn <1} tends to zero for every x 6= y, hence the covariance of Un,h (x)
and Un,h (y)} converges to zero as n tends to infinity. 
The optimal bandwidth for estimating a density has an order between
n and n−1/(2α+1) with α > 1/4, under the conditions nhn tends to infinity
−1

and f belongs locally to Hα with α > 1/4, or Cα with α ≥ 2. All estimators


of the bias of a density depend on its regularity through the constant of
the bias and the exponent of h and it cannot be directly estimated without
knowledge of α. The bandwidth minimizing the mean squared error of the
estimator fbn,h (x) is bounded by
M SEn,h (x, α) = V arfbn,h (x){1 + (2α)−1 } (4.2)
with an order of smoothness α > 1/4, so only the lower bound of the degree
is necessary to obtain a bound of the MSE. As the variance of fbn,h(x) (x)
does not depend on the class of f , it can be estimated using a bandwidth
function h2 such that nkh2 k tends to zero and nkh2 k2 tends to infinity, by
Vb arn,h2 fbn,h (x) = (nh(x))−1 κ2 fbn,h2 (x) (x). Let M \ SE n,h,an (x) be the esti-
mator of M SEn,h,an (x) obtained by plugging the estimator of V arfbn,h (x).
It can be compared with the bootstrap estimator of the mean squared er-
ror M SEn,h ∗
(x) = V ar∗ fbn,h (x) + B ∗2 (fbn,h )(x) calculated from a bootstrap
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

80 Functional estimation for density, regression models and processes

sample of independent variables having the distribution Fbn . This estima-


tor and the bootstrap estimator V ar∗ fbn,h (x) yield an estimator of α, by
equation (4.2). An optimal local bandwidth can then be estimated from
the estimator of α. The choice of the bandwidth function h2 relies on the
same procedure and the optimal estimator b hn (x) requires iterations of this
procedure, starting to an empirical bandwidth calculated from a discretiza-
tion of its range. Adaptative estimators of the bandwidth were previously
defined using empirical thresholds for the variations of the estimator of the
density according to the bandwidth, however constants in the thresholds
were chosen by numerical recursive procedures.

Another variable bandwidth kernel estimator is defined with a band-


width function of the variables Xi rather than x
n
1X
fbX,n,hn (x) = Khn (Xi ) (x − Xi ).
n i=1
R
Its mean is EfX,n,hn (x) = EKhn (X) (x − X) = Khn (y) (x − y)fX (y) dy
and its limit is fx (x), approximating y by x in the integral. Its bias and
variance are not expanded as above, the bandwidth at y is now developed
(1)
as hn (y) = hn (x){1 − zhn (x)} + o(khn k2 ) where x − y = hn (y)z, hence
fX (y) = fX (x − hn (x)z + hn (x)h(1) 2
n (x)z + o(khn k )
(1) 1 (2)
= fX (x) − hn (x)zfX (x) + hn (x)z 2 {hn (x)fX (x)
2
(1)
+ 2h(1)
n (x)fX (x)} + o(khn k )
2

and the bias of the estimator is


m2K (2) (1)
bfbX,n,h (x) = hn (x){hn (x)fX (x) + 2h(1)
n (x)fX (x)} + o(khn k ).
2
2
Its variance is
Z
V arfbX,n,hn (x) = n−1 { Kh2n (y) (x − y)fX (y) dy − E 2 fbX,n,hn (x)},
Z Z
Khn (y) (x − y) dy = K 2 (z)fX (x − hn (x)z − hn (x)h(1)
2
n (x)z) dz

+ o(khn k2 )
Z
= m2K {fX (x) − hn (x)f (1) (x) zK 2 (z) dz + o(khn k)

the first order approximation of the variances are identical and their second
order approximation have the opposite sign.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Limits for the varying bandwidths estimators 81

4.3 Estimation of regression functions

Let us consider the variable bandwidth kernel estimator m b n,hn (x) (x) of the
regression function m and the random process related to the estimated re-
gression function Um,n,hn (x) = (nhn (x))1/2 {m
b n,hn (x)− m(x)}I{x∈IX,khn k } .
Conditions 2.1 and 4.1 for kernel estimators of densities with variable band-
width are supposed to be satisfied in addition to Conditions 3.1 for kernel
estimators of regression functions. Then supx∈IX,khn k |mb n,hn (x) (x) − m(x)|
converges a.s. to zero with the uniform approximations
µn,hn (x) (x)
mn,hn (x) (x) = + O((nkhn k)−1 ),
fX,n,hn (x) (x)
1/2 
(nhn (x))1/2 {m
b n,hn(x) − mn,hn (x) }(x) = nhn (x) (b
µn,hn (x)
−1
− µn,h (x) )(x) − m(x)(fbX,n,h (x) − fX,n,h (x) )(x) f (x) + rn,h
n n n X n (x)
,

where rn,hn = oL2 (1), uniformly.


For every x in IX,khn k and for every integer p > 1, km b n,hn(x) (x) − m(x)kp
converges to zero, the bias of the estimator m b n,hn (x) (x) is uniformly ap-
proximated by

bm,n,hn(x) (x) = mn,hn (x) (x) − m(x) = hn (x)2 bm (x) + o(khn k2 ),


−1
bm (x) = fX (x){bµ (x) − m(x)bf (x)}
m2K −1 (2)
= f (x){µ(2) (x) − m(x)fX (x)},
2 X
and its variance is deduced from (3.7)

vm,n,hn (x) (x) = (nhn (x))−1 {σm


2
(x) + o(1)},
2 −2
σm (x) = κ2 fX (x){w2 (x) − m2 (x)f (x)}.

For a regression function and a density fX in class Cs (IX ), s ≥ 2, and


b n,hn (x) (x) is uniformly approximated by
under Conditons 2.2, the bias of m
hsn (x) −1 (s)
bm,n,hn(x) (x; s) = msK fX (x){µ(s) (x) − m(x)fX (x)} + o(khn ks )
s!
and its moments are not modified by the degree of derivability. For every
x in IX,khn k
−1
(nhn (x))1/2 (m
b n,hn − m)(x) = (nhn (x))1/2 fX µn,hn (x) − µn,hn (x) )
(x){(b
b
− m(fX,n,h (x) − fX,n,h (x) )}(x)
n n

+ (nhn (x)2s+1 )1/2 bm (x) + rbn,hn (x) (x),


January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

82 Functional estimation for density, regression models and processes

rn,hn (x) k2 = O((nkhn k)−1/2 ). The asymptotic mean


and supx∈IX,khn k kb
b n,h (x) is
squared error of m
(nhn (x))−1 σm
2
(x) + hn (x)4 b2m (x) = (nhn (x))−1 κ2 {w2 (x) − m2 (x)f (x)}
h (x)m22K −2
4
(2)
+ n fX (x){µ(2) (x) − m(x)fX (x)}2 fX −2
(x)
4
and its minimum is reached at the optimal local bandwidth
 κ2 n−1 {w2 (x) − m2 (x)f (x)} 1/5
hn,AMSE (x) =
m22K {µ(2) (x) − m(x)f (2) (x)}2
X
where AM SE(x) = O(n−4/5 ). For every s ≥ 2, the asymptotic quadratic
risk of the estimator for a regression curve of class Cs is
AM SE(x) = (nhn (x))−1 σm
2
(x) + h2s 2
n (x)bm,s (x)
−2
= (nhn (x))−1 κ2 fX (x){w2 (x) − m2 (x)f (x)}
h2s
n (x) 2 (s)
+ m f −2 (x){µ(s) (x) − m(x)fX (x)}2 ,
(s!)2 sK X
its minimum is reached at the optimal bandwidth
(s!)2 κ2 n−1 {w2 (x) − m2 (x)f (x)} 1/(2s+1)
hn,AMSE (x) = { }
2sm2sK {µ(s) (x) − m(x)f (s) (x)}2
X
where AM SE(x) = O(n−2s/(2s+1) ).
The covariance of mb n,hn (x) and m
b n,hn (y) is calculated as for Theorems
3.1 and 4.1 and it is a o(1) for every x 6= y.

Lemma 4.3. The covariance of m b n,hn (x)} equals


b n,hn (x) and m
Z
2
[σ 2 (zn (x, y))κ−1 K(v − αn (v))K(v + αn (v)) dv
n{hn (x) + hn (y)} m 2

−2 (1) (1)
+ δn (x, y)fX (zn (x, y)){w2 − m2 fX }(zn (x, y))
Z
× vK(v − αn (v))K(v + αn (v)) dv + o(khn k)].

Proof. The integralREY 2 Khn (x) (x − X)Khn(y) (y − X) = EY 2 Khn (x) (x −


X)Khn (y) (y − X) = Khn (x) (x − u)Khn (y) (y − u)w2 (u) du is expanded
changing the variable u in v and it equals
Z
2
K(v − αn (v))K(v + αn (v))w2 (un (x, y, v)) dv
hn (x) + hn (y)
Z
2
= {w2 (zn (x, y)) K(v − αn (v))K(v + αn (v)) dv
hn (x) + hn (y)
Z
(1)
+ δn (x, y)w2 (zn (x, y)) vK(v − αn (v))K(v + αn (v)) dv + o(khn k)}
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Limits for the varying bandwidths estimators 83

then the L2 -approximation of (nhn (x))1/2 {m


b n,hn (x) − mn,hn (x) }(x) and
Lemma 4.3 end the proof. 
bn,hn and m
Lemma 3.3 is extended to µ b n,hn with functional bandwidths
like 4.2 and the weak convergence on IX,khn k of the process with varying
bandwidth Un,hn (x) = (nhn (x))1/2 {fbn,hn(x) (x) − f (x)} is proved as for the
density estimator.

Lemma 4.4. For a regression function m and density fX of class Cs (IX ),


s ≥ 2, and under Conditions 3.1 and 4.1, for every x and y in IX,hn the
mean of the variation of m b n,hn between x and y has the order O(|x−y|) and
E|m b n,h (y)| = O(n−1 kh−1
b n,hn (x)−m 2 3 2 −1 −1
n k (x−y) ) if |xhn (x)−yhn (y)| ≤ 1.
−1 −1
Otherwise, it is a O(n khn k).

Theorem 4.2. Under the conditions, for a density f of class Cs (IX ) and a
varying bandwidth sequence such that nkhn k2s+1 converges to khk, the pro-
cess Un,hn converges weakly to the process defined on IX as Wm + h1/2 bm ,
where Wm is a continuous centered Gaussian process with covariance
2
σm (x)δ{x=x0 } at x and x0 .

The estimators of the derivatives of the regression function are modi-


fied by the derivatives of the bandwidth and the kernel in each term
of the estimators, as detailed in Appendix B, and the first derivative is
(1) (1) (1)
mb = fb−1 {b
n,h µ
n,h b n,h fb }, like in (3.15), with notations of the ap-
−m
n,h n,h
pendix for d{Khn (x) (x)}/dx. The results of Proposition 3.5 are extended
(k)
to the estimator m b n,hn with a varying bandwidth sequence, its bias is a
O(khn k ), and its variance a O((nkh−1
s
n k)
2k+1
), hence the optimal band-
−1/(2k+2s+1)
width is a O(n ) and the optimal mean squared error is a
O(n−2s/(2k+2s+1) ).
In the regression model with a conditional variance function σ 2 (x), the
kernel estimator (3.17) with continuous functional bandwidths hn and δn
can be written
Pn
b n,h (x) (Xi )}2 Kδn (x) (x − Xi )
{Yi − m
bn,hn (x),δn (x) (x) = i=1
σ 2
Pn n ,
i=1 Kδn (x) (x − Xi )

then a new estimator for the regression function is defined using this estima-
−1
tor as a weighting process w bn = σ
bn,h n ,δn
in the estimator of the regression
function
Pn
wbn (Xi )Yi Khn (x) (x − Xi )
b wbn ,n,hn (x) (x) = Pi=1
m n .
i=1 wbn (Xi )Khn (x) (x − Xi )
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

84 Functional estimation for density, regression models and processes

2
The bias and variance of the estimator σ bn,h n (x),δn (x)
(x) and the fixed band-
2
width estimator for σ (x) are still similar. The bias of m b wbn ,n,hn (x) (x)
and m b w,n,hn (x) (x) have the same approximations, the variance of
b w,n,hn (x) (x) is identical to the variance of m
m b n,hn (x) (x) whereas the vari-
ance of m b wbn ,n,hn (x) (x) is modified like with the fixed bandwidth estima-
tor. The weak convergence theorem 4.2 extends to the weighted regression
estimator.

4.4 Estimation for processes

Let (Xt )t∈[0,T ] be a continuously observed stationary and ergodic process


satisfying (2.13), with values in IX . The limiting marginal density defined
by (2.14) is estimated with an optimal bandwidth of order O(T 1/(2s+1) ) as
proved in Section 2.10. For every x in IX,T,khT k
Z
1 T
fbT,hT (x) (x) = KhT (x) (Xs − x) ds (4.3)
T 0
where T 1/(2s+1) khT k = O(1). Conditions 2.1-2.2 are supposed to be satis-
fied, with a density f in class Cs and assuming that the bandwidth function
fulfills Conditions 4.1 with the approximation
hT (x) = T −1/(2s+1) {h1/(2s+1) (x) + o(1)}. (4.4)
The results of the previous sections extends to prove that for every x in
hs (x)
IX,T,khT k , the bias of fbT,h (x) is bT,hT (x) = Ts! msK f (s) (x) + o(khT ks ),
its variance is
V ar{fbT,hT (x)} = (T hT (x))−1 κ2 f (x) + o((T −1 kh−1
T k),

its covariances are o((T −1 kh−1 b


T k) and the Lp -norms are kfT,hT (x) −
−1 −1 1/p
fT,hT (x)kp = 0((T khT k) ). The ergodic property (2.16) for k-
dimensional vectors of values of the process (Xt )t entails the weak con-
vergence of the finite dimensional distributions of the density estimator
fbT,h . Lemma 4.2 extends to the ergodic process and entails the weak con-
vergence of (T hT )1/2 (fbT,h − f ) to a Gaussian process with variance κ2 f (x)
at x and covariances zero.

For a continuously observed stationary and ergodic process (Xt , Yt )t≤T


with values in IX,Y , consider the regression model Yt = m(Xt ) + σ(Xt )εt
where (εt )t∈[0,T ] is a Brownian motion such that E(εt | Xt ) = 0 and E(εt εs |
Xt ∧Xs ) = E{(εt ∧εs )2 | Xt ∧Xs ) = 1. The bivariate process Z is supposed
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Limits for the varying bandwidths estimators 85

to be ergodic, satisfying the properties (2.13) and (2.16). Under the same
conditions as in Chapter 3, the regression function m is estimated on an
interval IX,Y,T,khT k by the kernel estimator
RT
Ys KhT (x) (x − Xs ) ds
b T,hT (x) = 0R T
m .
0
KhT (x) (x − Xs ) ds

The bias and variances established in Section 3.10 for the functions f and m
of class Cs and fixed bandwidth hT are modified, with the notation µ = mf

bm,T,hT (x) (x) = hT (x)s bm (x) + o(khT ks ),


msK −1 (s)
bm (x) = f (x){µ(s) (x) − m(x)fX (x)},
s! X
vm,T,h (x) = (T hT (x))−1 σm 2
(x) + o((T khT k)−1 ),
2 −1
σm (x) = κ2 fX (x)V ar(Y | X = x)

and the covariance of m b T,hT (x) (x) and m b T,hT (x) (y) is a o((T khT k)−1 ). The
weak convergence of the process (T hT (x))1/2 {m b T,hhT (x) (x) − m(x)} is then
proved by the same methods, under the ergodicity properties.
In a model with a variance function, the regression function is also
−1
estimated using a weighting process w bT = σ bT,h T ,δT
in the estimator of the
regression function
RT
2 {Ys − m b T,hT (Xs ) (Xs )}2 KδT (x) (x − Xs ) ds
bT,hT ,δT (x) = 0
σ RT ,
0 KδT (x) (x − Xs ) ds
RT
bT (Xi )Yi KhT (x) (x − Xi )
w
b wbT ,T,hT (x) (x) = R0 T
m .
0
wbT (X i )K h T (x) (x − X i )

The previous modifications of the bias and variance of the estimator extend
to the continuously observed process (Xt )t≤T .

4.5 Exercises

(1) Compute the fixed and varying optimal bandwidths for the estimation
of a density and compare the respective density estimators.
(2) Give the expressions of the first moments of the varying bandwidth
estimator of the conditional probability p(x) = P (Y |X = x) for a Y
binary variable, conditionally on the value of a continuous variable X
(Exercise 3.10-(2)).
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

86 Functional estimation for density, regression models and processes

(3) For the hierarchical observations of n independent sub-samples of Ji


dependent observations of Exercise 2.11-(5), determine a varying band-
width estimator for the limiting density f and ergodicity conditions for
the calculus of its bias and variance, and write their first order approx-
imations.
(4) Write the expressions of the bias and the variance of the continuous
estimator FbY |X,n,hn (x) for the distribution function of Y ≤ y condition-
ally on X ≤ x of Exercise 3.10-(8), with a varying bandwidth and prove
its weak convergence.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Chapter 5

Nonparametric estimation of quantiles

5.1 Introduction

Let F be a distribution function with density f on R, Fbn its empirical


distribution function and νn = n1/2 (Fbn − F ) the normalized empirical pro-
cess. The process Fbn − F convergences to zero uniformly a.s. and in L2 ,
and νn converges weakly to B ◦ F , where B is the Brownian motion. The
quantile Qb n is the inverse functional for Fbn , it converges therefore in prob-
ability to the inverse QF of F , uniformly on [0, 1]. The quantile estimator
is approximated as
b n = QF − {νn ◦ QF }{f ◦ QF }−1 + {νn ◦ QF }2 {f 0 ◦ QF }{f ◦ QF }−3 + o(ν 2 ).
Q n

As a consequence, the quantile process


b n − QF ) = −n1/2 νn ◦ QF + rn
n1/2 (Q
f
converges weakly to a centered Gaussian process with covariance function
{F (s ∧ t) − F (s)F (t)}{f ◦ QF (s)}−1 {f ◦ QF (t)}−1 , for every s and t in [0, 1].
The remainder term is such that supt∈[0,1] krn k is a oL2 (1).

Consider the distribution function FY |X of the variable Y conditionally


on the regression variable X, in the model Y = m(X) + ε with a continuous
regression curve m(x) = E(Y |X = x) and an observation error ε such that
E(ε|X) = 0 and V ar(ε|X) = σ 2 (X). It is defined with respect to the
distribution function Fε of ε by
FY |X (y; x) = P (Y ≤ y|X = x) = Fε (y − m(x)). (5.1)
R
The marginal distribution function of Y is FY (y) = Fε (y −R m(s)) dFX (s)
and the joint distribution function of (X, Y ) is FX,Y (x, y) = 1{s≤x} Fε (y −

87
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

88 Functional estimation for density, regression models and processes

m(s)) dFX (s). The estimator of FY |X is defined by smoothing the regression


variable with a kernel is
Pn
b Kh (x − Xi )1{Yi ≤y}
FY |X,n,h (y; x) = i=1Pn
i=1 Kh (x − Xi )
and an estimator of Fε is deduced from those of FY |X , FX and m as
X
Fbε,n,h (s) = n−1 FbY |X,n,h (s + m
b n,h (Xi ); Xi ).
1≤i≤n

In this expression, the estimator of the regression function can be weighted


by the inverse of the square root of the kernel estimator for the variance
function σ 2 . Therefore, all functions of the model, m, σ
b2 , FY |X and Fε , are
easily estimated from the sample (Xi , Yi )i≤n .
The quantile of the conditional distribution function of Y given X are
first defined with respect to Y , then with respect to X. For every t in [0, 1]
and at fixed x in IX , the conditional distribution FY |X (y; x) is increasing
with respect to y and its inverse is defined as
QY (t; x) = FY−1
|X (t; x) = inf{y ∈ IY : FY |X (y; x) ≥ t}. (5.2)
It is right-continuous with left-hand limits, like the FY |X . For every x ∈ IX ,
FY |X ◦ QY (t; x) ≥ t with equality if and only if FY |X (x) is (x, y) belongs
to the support of (X, Y ). Assuming that the function m is monotone by
intervals, the definition (5.1) implies the monotonicity on the same intervals
of the conditional distribution function FY |X with respect to the Y , with
the inverse monotonicity.
On each interval of monotonicity and for every s in the image of IX ,
the quantile QX (y; s) is defined by inversion of the conditional distribution
FY |X in the domain of the variable X, at fixed y, from equation (5.1)

inf{x ∈ IX : FY |X (y; x) ≥ t}, if m is decreasing,
QX (y; s) = (5.3)
sup{x ∈ IX : FY |X (y; x) ≤ t}, if m is increasing.
For every y ∈ IY , QX ◦FY |X (y; x) = x if and only if m and Fε are continuous
on IY and FY |X ◦QX (y; s) = s if and only if m and Fε are strictly monotone,
for every (s, y) in DX,Y . The empirical conditional distribution function
defines in the same way the empirical quantile processes Q b X,n,h and Q
b Y,n,h ,
b
according to (5.3) and (5.2) respectively. If (x, y) belongs to DX,Y,n,h , the
marginal components x and y belong respectively to D bX,n,h and D bY,n,h
b b
which are the domains of QX,n,h and QY,n,h , respectively.
Another question of interest for a regression function m monotone on
an interval Im is to determine its inverse with its distribution properties.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of quantiles 89

Consider a continuous regression function m, increasing on a sub-interval


Im of the support IX of X, its inverse is defined as
m−1 (t) = inf{x ∈ IX : m(x) ≥ t}. (5.4)
It is increasing and continuous on the image of Im by m and satisfies
m−1 ◦ m = m ◦ m−1 = id.

5.2 Asymptotics for the quantile processes

Let IX,Y,h = {(s, y) ∈ IX,Y ; [s − h, s + h] ∈ IX }. Under conditions simi-


lar to those of the nonparametric regression,RProposition 3.1 applies con-
sidering y as fixed, with fX (x)FY |X (y; x) = 1{ζ≤y} fX,Y (x, ζ) dζ instead
of µ(x) and with the conditional function FY |X (y; x), for every (x, y) in
IX,Y . The weak convergence of the process defined on IX,h , at fixed y,
by (nh)1/2 {FbY |X,n,h (y; ·) − FY |X (y; ·)} is a corollary of Theorem 3.1. The
expressions of the bias and the Lp -norms rely on an expansion up to higher
order terms of its moments.

Proposition 5.1. Let FXY be a distribution function of Cs+1 (IX,Y ). Un-


der Conditions 2.1 for the density fX and 3.1 for the conditional distribu-
tion function FY |X (y; x) at fixed y, the variable supIX,Y,h |FbY |X,n,h − FY |X |
tends to zero a.s., its bias and its variance are
bFY |X ,n,h (y; x) = h2 bF (y; x) + o(h2 ) (5.5)
s+1
1 −1 ∂ FX,Y (x, y) (2)
bF (y; x) = msK fX (x){ − FY |X (x, y)fX (x)},
s! ∂xs+1
vFY |X ,n,h (y; x) = (nh)−1 vF (y; x) + o((nh)−1 ) (5.6)
−1
vF (y; x) = κ2 fX (x)FY |X (y; x){1 − FY |X (y; x)}.
At every fixed y in IY , the process (nh)1/2 {FbY |X,n,h (y) − FY |X (y)}1{IX,h }
converges weakly to a Gaussian process defined in IX , with mean function
limn (nh5 )1/2 bFY |X (y; ·), covariances zero and variance function vFY |X (y; ·).

The results for the bias of the estimator FbY |X,n,h extend to a density fX
in Cs as in Lemma 3.2. The weak convergence of the bivariate process
(nh)1/2 (FbY |X,n,h − FY |X ) defined on IX,Y,h requires an extension of the
previous results as for the empirical distribution function of Y .

Proposition 5.2. The process


νY |X,n,h = (nh)1/2 {FbY |X,n,h − FY |X )}1{IX,Y,h }
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

90 Functional estimation for density, regression models and processes

converges weakly to a Gaussian process Wν on IY,X , with mean function


limn (nh5 )1/2 bF (y; ·), variance vY |X and covariances at fixed x
−1
CovY |X (y, y 0 ; x) = κ2 fX (x){FY |X (y ∧ y 0 ; x) − FY |X (y; x)FY |X (y 0 ; x)},
and zero otherwise.

Proof. This is a consequence of the weak convergence of the finite di-


mensional distributions of νY |X,n,h and of its tightness, due to the bound
obtained for the moments of the squared variations between (x, y) and
(x0 , y 0 ) of the joint empirical process, νY |X,n,h (y; x) − νY |X,n,h (y 0 ; x) −
{νY |X,n,h (y; x0 ) − νY |X,n,h (y 0 ; x0 )} is a O((x0 − x)2 + (y 0 − y)2 ). The bound
O((y 0 − y)2 ) is obtained for the empirical process at fixed x and x0 , and
O((x0 − x)2 ) as in the proof of Lemma 3.3, at fixed y and y 0 . 
Let FY |X (y; x) be monotone with respect to x. If n is sufficiently large,
then FbY |X,n,h is monotone, as proved in the following lemma. The means
are denoted FY |X,n,h , QY,n,h and QX,n,h for E FbY |X,n,h , E Q
b Y,n,h and, re-
spectively, E Qb X,n,h .

Lemma 5.1. If n ≥ n0 large enough, FY |X,n,h is monotone on IX,Y,h .


Moreover, if FY |X is increasing with respect to x in IX then, for every
x1 < x2 and ζ > 0, there exists C > 0 such that
Pr{FbY |X,n,h (x2 ) − FbY |X,n,h (x1 ) > C} ≥ 1 − ζ.

Proof. Let y be considered as fixed in IY , x1 < x2 be in IX,h and such


that FY |X (y; x2 ) − FY |X (y; x1 ) = d > 0. For n large enough the bias
of FbY |X,n,h (y; x2 ) − FbY |X,n,h (y; x1 ) is strictly larger than d/2, by Propo-
sition 5.2. The uniform consistency of Proposition 5.1 implies, for every
η and ζ > 0, the existence of an integer n0 such that for every n ≥ n0 ,
Pr{|FbY |X,n,h (y; x1 )−FY |X (y; x1 )|+|FbY |X,n,h (y; x2 )−FY |X (y; x2 )| > η} < ζ.
For the monotonicity of the empirical conditional distribution function, let
d > η > 0, then
Pr{FbY |X,n,h (y; x2 ) − FbY |X,n,h (y; x1 ) > d − η}
= 1 − Pr{(FbY |X,n,h − FY |X )(y; x1 ) − (FbY |X,n,h − FY |X )(y; x2 ) ≥ η}
≥ 1 − Pr{|FbY |X,n,h − FY |X |(y; x2 ) + |FbY |X,n,h − FY |X |(y; x1 ) ≥ η}
≥ 1 − ζ.

The asymptotic behaviour of the quantile processes follows the same princi-
ples as the distribution functions. We first consider the quantile QY defined
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of quantiles 91

by (5.2) conditionally on fixed X = x, it is always increasing. The empirical


quantile function is increasing with probability tending to 1, as in Lemma
5.1 and the functions QX,n,h and QY,n,h are monotone, for n large enough.
The results of Section 5.1, are adapted to the empirical quantiles. Another
quantile function is defined for n large enough by

Qe Y,n,h (v; x) = sup y : (x, y) ∈ IX,Y,h , FY |X,n,h (y; x) ≤ v , v ∈ D bY,n,h .
(5.7)
The uniform convergence of FY |X,n,h to FY |X implies that Q e Y,n,h converges
uniformly to QY . The derivative with respect to y of FY |X (y; x) belonging
to C2 (IY ) is fY |X (y; x), for every x in IX . Let bF and vF be defined by
(5.6) and (5.6), respectively.

Proposition 5.3. Let FX|Y be a continuous conditional distribution func-


b Y,n,h − QY |(u; x) converges in probability
tion, the process supDbY,n,h ×IX |Q
to zero. If the density fX,Y of (X, Y ) belongs to Cs (IX,Y ), then for every
x in IX and u in D bY,n,h , the bias of Qb Y,n,h equals

bF
bY (u; x) = −h2 ◦ QY (u; x) + o(h2 ),
fY |X
and its variance is
vF
vY (u; x) = (nh)−1 ◦ QY (u; x) + o((nh)−1 ).
{fY |X }2

Proof. By definition of the inverse function, for every x in IX,n,h and u


bY,n,h , there exists an unique y in IY,n,h such that u = FbY |X,n,h (y; x),
in D
then by derivability of the inverse function
b Y,n,h (u; x) − QY (u; x) = Q
Q b Y,n,h ◦ FbY |X,n,h (y; x) − QY ◦ FbY |X,n,h (x)
= QY ◦ FY |X (y; x) − QY ◦ FbY |X,n,h (y; x)
FbY |X,n,h (y; x) − FY |X (y; x)
=−
fY |X (y; x)
+ o(FbY |X,n,h (y; x) − FY |X (y; x)).

The functions Q e Y,n,h satisfy a similar approximation with the functions


FY |X,n,h . By the uniform convergence in probability of FY |X,n,h to FY |X
on IX,n,h and under the condition that the density is bounded away from
zero on IX , the processes Q b Y,n,h and the functions Q
e Y,n,h convergence
uniformly to QY .
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

92 Functional estimation for density, regression models and processes

bY,n,h . In order to calculate the bias and variance


At fixed x, let u be in D
b
of QY,n,h (u), we first determine the order of the bias and the variance of
the processes
b Y,n,h (u) − FY |X ◦ Q
ηbY,n,h (u) = FY |X ◦ Q e Y,n,h(u) (5.8)
which converge in probability to zero. Then the quantile estimator satisfies
b Y,n,h = QY ◦ (b
Q e Y,n,h ).
ηY,n,h + FY |X ◦ Q (5.9)

Taylor expansions allow to express Q e Y,n,h as a function of Q and Q b Y,n,h as


a function of Qe Y,n,h and of the process ηbY,n,h . Since FbY |X,n,h ◦ Q
b Y,n,h and
e
FY |X,n,h ◦ QY,n,h equal identity, (5.8) is also written
e Y,n,h (u) − {FbY |X,n,h − FY |X } ◦ Q
ηbY,n,h (u) = bF,n,h ◦ Q b Y,n,h(u).
ηY,n,h (u)} = h2 bF ◦ Q
E{b e Y,n,h (u) − h2 E{bF ◦ Q b Y,n,h(u)} + o(h2 ), (5.10)

and the variance of ηbY,n,h (u) equals


h i
ηY,n,h (u)} = E V ar{FbY |X,n,h ◦ Q
V ar{b b Y,n,h (u)|Q
b Y,n,h (u)}

b Y,n,h (u)}
+ V ar{bF,n,h ◦ Q
= (nh)−1 E{vF ◦ Q b Y,n,h (u)}
b Y,n,h (u)} + o(n−1 h−1 + h4 ). (5.11)
+ h4 V ar{bF ◦ Q
The moments of order l ≥ 3 of ηbY,n,h are bounded using an expansion of
the moments of the sum in Equation (5.8) by
n o
e Y,n,h (u)|l + E|(FbY |X,n,h − FY |X ) ◦ Q
2l |bY,n,h ◦ Q b Y,n,h(u)|l ,

the second right hand term E{|(FbY |X,n,h − FY |X,n,h + bY,n,h) ◦ Q


b Y,n,h(u)|l }
is lower than
2l E[E{|(FbY |X,n,h − FY |X,n,h ) ◦ Q
b Y,n,h (u)|l |Q
b Y,n,h (u)}
+ |bY,n,h ◦ Q b Y,n,h(u)|l ],

thus
h n
e Y,n,h (u)|l + 2l E |bY,n,h ◦ Q
ηY,n,h (u)|l ≤ 2l |bY,n,h ◦ Q
E|b b Y,n,h (u)|l
oi
+ E{|(FbY |X,n,h − FY |X,n,h ) ◦ Q b Y,n,h (u)|l |Q
b Y,n,h (u)} .

By Propositions 2.2 and 3.1, the conditional expectation E{|FbY |X,n,h −


FY |X,n,h |l is O((nh)−l/2 ), and both terms in bY,n,h are O(h2l ), hence
ηn,h (u)|l = o((nh)−1 ) for every l ≥ 3. The expression of the bias of
E|b
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of quantiles 93

FbY |X,n,h implies {FY |X,n,h − FY |X } ◦ Q


e n,h (u) = h2 bF ◦ Q
e n,h (u) + o(h2 ),
therefore
e Y,n,h(u) = F −1 (u − h2 bF ◦ Q
Q e Y,n,h (u)) + o(h2 )
Y |X
e Y,n,h(u)
bF ◦ Q
= QY (u) − h2 + o(h2 ).
fY |X ◦ QY (u)
e Y,n,h (u) = QY (u) + O(h2 ), bF ◦ Q
Since Q e Y,n,h(u) = bF ◦ QY (u) + O(h2 ), so
that
e Y,n,h(u) = QY (u) − h2 bF ◦ QY (u) + o(h2 ).
Q (5.12)
fY |X
Furthermore, by (5.8),
b Y,n,h(u) = F −1 (FY |X ◦ Q
Q e Y,n,h(u) + ηbY,n,h (u))
Y |X

e Y,n,h (u) + ηbY,n,h (u) 2


=Q + O(b
ηY,n,h (u)),
e
fY |X ◦ QY,n,h (u)
and, using (5.12),

b Y,n,h(u) = Q
e Y,n,h (u) + ηbY,n,h (u)
Q
fY |X ◦ QY (u)
+ O(h2 ηbY,n,h (u)) + O(b2
ηY,n,h (u)). (5.13)

The expansion (5.12) implies bF ◦ Q e Y,n,h (u) = bF ◦ QY (u) + o(1). With


(5.13) and since E{b
ηY,n,h (u)} and V ar{bηY,n,h (u)} are o(1)

b Y,n,h (u) = bF ◦ Q
e Y,n,h (u) + ηbY,n,h (u) e Y,n,h(u)
bF ◦ Q bf ◦ Q
FY |X ◦ QY (u)
+ O(h2 ηbY,n,h (u)) + O(b2
ηY,n,h (u)),
b Y,n,h (u)} = bF ◦ Q
E{bF ◦ Q e Y,n,h (u) + o(1) = bF ◦ QY (u) + o(1).

Moreover, V ar{bF ◦ Q b Y,n,h (u)} = O(h4 + n−1 h−1 ) because of the approxi-
mations V ar{bηY,n,h (u)} = O(h4 +n−1 h−1 ) and E{b 4
ηY,n,h (u)} = o(n−1 h−1 ).
From (5.10), the expectation of ηbY,n,h (u) becomes
ηY,n,h (u)} = o(h2 ).
E{b (5.14)
b Y,n,h (u)} = vF ◦ QY (u) + o(1) and
In the expansion (5.11), E{vF ◦ Q
4 b
h V ar{bF ◦ QY,n,h (u)} = O(h + n−1 h3 ) = o(n−1 h−1 ). The variance of
8

ηbY,n,h (u) is then equal to


ηY,n,h (u)} = (nh)−1 vF ◦ QY (u) + o((nh)−1 ).
V ar{b (5.15)
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

94 Functional estimation for density, regression models and processes

b Y,n,h (u) is deduced from (5.9), (5.12), (5.13) and (5.14),


Finally, the bias of Q
which imply
b Y,n,h = QY +{(b
Q e Y,n,h −FY |X ◦QY )/(fY |X ◦QY )}{1+o(1)}
ηY,n,h +FY |X ◦ Q
therefore
b Y,n,h(u)} = QY (u) − h2 bF
E{Q + o(h2 ).
fY |X ◦ QY (u)
b Y,n,h (u).
Equations (5.13) and (5.15) yield the variance of Q 
b Y,n,h − QY }1 b
Theorem 5.1. The process UY,n,h = (nh)1/2 {Q {DY,n,h } con-
Wν + γ 1/2 bF
verges weakly to UY = ◦ QY where Wν is the Gaussian process
fY |X
limit of νY |X,n,h .

Proof. For every x in IX,n,h and for every u in D bY,n,h , there exists an
b
unique y in IY,n,h such that u = FY |X,n,h (y; x) therefore
b Y,n,h − QY )(u; x) = (Q
(Q b Y,n,h ◦ FbY |X,n,h − QY ◦ FbY |X,n,h )(y; x)
= (QY ◦ FY |X − QY ◦ FbY |X,n,h )(y; x).
From the convergence of FbY |X,n,h to FY |X , it follows
1/2
b Y,n,h − QY } = − {νY |X,n,h + γ
(nh)1/2 {Q
bY } b Y,n,h ,
◦Q
fY |X,n,h
and its limit is deduced from Proposition 5.2. 
The representation of the conditional quantile process
b Y,n,h = QY + {(b
Q e Y,n,h − FY |X ◦ QY )/(fY |X ◦ Q
ηY,n,h + FY |X ◦ Q b Y,n,h)}
+ rY,n,h , (5.16)
where ηbY,n,h is defined by (5.8) and where the remainder term rY,n,h is
oL2 ((nhn )−1/2 ), was established to prove Proposition 5.3 and Theorem 5.1.
An analogous representation holds for the quantile process Q b X,n,h
b X,n,h = QX +{(ζbX,n,h +FY |X ◦ Q
Q e X,n,h −FY |X ◦QX )/(fY |X ◦QX )}+rX,n,h

where ζbX,n,h = FY |X ◦ Q
b X,n,h −FY |X ◦ Q
e X,n,h and rX,n,h = oL2 ((nhn )−1/2 ).
The bias bX , the variance vX and the weak convergence of Q b X,n,h are
(1)
deduced. Let FY |X be the derivative of FY |X (y; x) with respect to x.

Proposition 5.4. Let FX|Y be a continuous conditional distribution func-


b X,n,h − QX |(u; x) converges in probability
tion, the process supDbX,n,h ×IY |Q
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of quantiles 95

to zero. If the density fX,Y of (X, Y ) belongs to Cs (IX,Y ), then for every
x in IX and u in D bX,n,h , the bias of Q
b X,n,h equals
bF
bX (y; u) = −h2 ◦ QX (y; u) + o(h2 ),
∂FY |X /∂x
and its variance is
vF
vX (y; u) = (nh)−1 ◦ QX (y; u) + o((nh)−1 ).
{∂FY |X /∂x}2
b X,n,h − QX }1 b
Theorem 5.2. The process UX,n,h = (nh)1/2 {Q {DX,n,h } con-
Wν + limn (nh5n )1/2 bF
verges weakly to UX = ◦ QX .
∂FY |X /∂x

5.3 Bandwidth selection

The error criteria measuring the accuracy of Q b Y,n,h (u) as an estimator of


Q(u) are generally sums of the variance and the squared bias of Q b Y,n,h (u),
where the variance increases as h tends to zero whereas the bias decreases.
Under the assumption that FY |X is twice continuously differentiable with
respect to y and using results of Proposition 5.3, the mean squared error
n o2
M SEY (h) = E Q b Y,n,h (u) − Q(u) is asymptotically equivalent to

vF ◦ QY (u; x)
AMSEQY (u; x, h) = (nh)−1
{fY |X ◦ QY (u; x)}2
 2
4 bF
+h ◦ QY (u; x) .
fY |X
Its minimization in h leads to an optimal local bandwidth, varying with u
and x
 1/5
−1/5 vF ◦ QY (u; x)
hopt,loc (u; x) = n .
4b2F ◦ QY (u; x)
That this also the optimal local bandwidth minimizing the AMSE of
FbY |X,n,h (u; x) for the unique value of x such that y = QY (u), that is
AMSEF (u; x, h) = (nh)−1 vF (u; x) + h4 b2F (u; x).
If the density fX has a continuous derivative, that is also identical to the
optimal local bandwidth minimizing the AMSE of Q b X,n,h (y; x), at fixed y,
by Proposition 5.4. Since the optimal rate for the bandwidth has the order
n−1/5 , the optimal rate of convergence of Qb Y,n,h to QY is n−4/5 and the
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

96 Functional estimation for density, regression models and processes

process (nh)1/2 {Q b Y,n,h −QY } converges to a non-centered Gaussian process


with an expectation different from zero because nh5 = O(1). Estimating
the bias bF and the variance vF by bootstrap allows to estimate the optimal
bandwidths for the quantile estimator without knowledge of its order s of
derivability. A direct kernel estimator of the variance function of the process
νY |X,n,h is vbF,n,h = κ2 fbX,n,h
−1
Fb|X,n,h (1−Fb|X,n,h ), according to the expression
(5.6) of vF .
For a conditional distribution function FY |X (·; x) having derivatives of
order s, the bias is modified
bs,F,n,h (y; x) = hs bF (y; x) + o(hs )
hs −1 ∂ s+1 FX,Y (x, y)
= msK fX (x){
s! ∂xs+1
(s)
− FY |X (x, y)fX (x)} + o(h2 ),
and the optimal local bandwidth is modified by this s-order bias.
The global mean integrated squared error criteria are defined by inte-
grating over all values of u in D bY,n,h the AM SEQY (u; x, h), conditionally
on a fixed value of x
Z
AMISEQY (x, h) = E{Q b Y,n,h(u; x) − QY (u; x)}2 du
Z
vF ◦ QY (u; x) bY ◦ QY (u; x) 2
= [(nh)−1 + h4 { } ] du
{fY |X ◦ QY (u; x)}2 fY |X ◦ QY (u; x)
Z
AMSEF (y; x, h)
= dy,
IY,n,h fY |X (y; x)
R
which differs from the integral AMISEF (x, h) = IY,n,h AMSEF (y; x, h) dy,
conditional to X = x. The mean of the conditional random criterion
AMISEQY (X, h) is
Z
AMSEF (y; x, h)
dy dFX (x).
IX,Y,n,h fY |X (y; x)
In the same way, the global mean integrated squared error criteria is
defined by integratingR AMSEQX (y, h) over DbX,n,h , for a fixed value of
y, AMISEQX (y, h) = IX,n,h AMSEF (y; x, h){fY |X (y; x)}−1 dx, at fixed y.
bX,Y,n,h
The global AMISE criteria for QX and QY defined as integrals over D
are both equal to
Z
AMSEF (y; x, h)
AMISEQ (h) = dx dy
IX,Y,n,h fY |X (y; x)
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of quantiles 97

and they differ from the global criterion


Z
AMSEF (Y ; X, h)
AMISEF = AMSEF (y; x, h) dx dy = E .
IX,Y,n,h fX,Y (X, Y )

Some discretized versions of these criteria are the Asymptotic Mean


Average Squared Errors such as the AMASEF corresponding to AMISEF ,
AMASEQY (x, h) corresponding to AMISEQY (x, h) and EAMISEQY (X, h)
are respectively defined by
n
X AMSEF (Yi ; Xi )
AMASEF = n−1 ,
i=1
fX,Y (Xi , Yi , h)
Xn
AMSEF (Yi ; x, h)
AMASEQY (x, h) = n−1 ,
i=1
fY2 |X (Yi ; x, h)
n X
X n
AMSEF (Yi ; Xj , h)
AMASEQY (h) = n−2 ,
i=1 j=1
fY2 |X (Yi ; Xj , h)

which is the empirical mean of AMASEQY (X, h). Similar ones are defined
for QX and other means. Note that no computation of the global errors and
bandwidths require the computation of integrals of errors for the empirical
inverse functions, all are expressed through integrals or empirical means
of AMSEF with various weights depending on the density of X and the
conditional density of Y given X. The optimal window for AMASEQF (h)
is
Pn
−1/5 {vF (fX,Y )−1 }(Xi , Yi , h) 1/5
n [ Pi=1
n ] ,
4 i=1 {b2F (fX,Y )−1 }(Xi , Yi , h)

for AMASEQY (h) it is


Pn Pn
i=1 {vF (fX,Y )−2 }(Xi , Yj , h)
n −1/5
[ Pn Pj=1
n 2 −1 }(X , Y , h)
]1/5 .
4 i=1 j=1 {bF (fX,Y ) i j

The expressions of other optimal global bandwidths are easily written and
all are estimated by plugging estimators of the density, the bias bF and the
variance vF with another bandwidth. The derivatives of the conditional
distribution function are simply the derivatives of the conditional empir-
ical distribution function, as nonparametric regression curves. The mean
squared errors and the optimal bandwidths for the quantile process Q b X,n,h
are written in similar forms, with the bias bX and variance vX .
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

98 Functional estimation for density, regression models and processes

5.4 Estimation of the conditional density of Y given X

The conditional density fY |X (y; x) is deduced from the conditional distribu-


tion function FY |X (y; x) by derivative with respect to y and it is estimated
using the kernel K with another bandwidth h0
Z
fbY |X,n,h,h0 (y; x) = Kh0 (v − y) FbY |X,n,h (dv; x)
n Pn
X Kh (x − Xi )I{Yi ≤Yj }
= Kh0 (Yj − y) i=1
Pn .
j=1 i=1 Kh (x − Xi )

Proposition 5.5. If the conditional density fY |X belongs to the class


Cs (IXY ), s ≥ 2, the process (nhh0 )1/2 (fbY |X,n,h,h0 − fY |X ) converges weakly
to a Gaussian process with mean limn (nhn h0n )1/2 (E fbY |X,n,h,h0 −fY |X ), with
covariances zero and variance function vf = κ2 fY |X (1 − fY |X ).
If h0n = hn and s = 2, the process n1/2 hn (fbY |X,n,hn − fY |X ) converges
weakly to a Gaussian process with mean limn (nh6n )1/2 bf where

1 −1 (2)
bf = m2K fX {∂ 2 fX,Y /∂y 2 + ∂ 2 fX,Y /∂x2 − fY |X fX },
2
with covariances zero and variance function vf . The optimal bandwidth is
O(n−1/6 ).

Proof. By Proposition 5.1, if fY |X belongs to C2 (IXY ), its expectation


develops as
Z
E fbY |X,n,h,h0 (y; x) = Kh0 (v − y) E FbY |X,n,h (dv; x)
Z
= Kh0 (v − y) (FY |X + bF,n,h )(dv; x)
0
h2 ∂ 2 fY |X (y; x)
= fY |X (y; x) + m2K
2 ∂y 2
∂bF,n,h (y; x) 0
+ + o(h2 ) + o(h 2 ).
∂y

Assuming that h0 = h, its bias bf,n,h (y; x) = h2 bf (y; x) + o(h2 ), with

1 −1 (2)
bf = m2K fX {∂ 2 fX,Y /∂y 2 + ∂ 2 fX,Y /∂x2 − fY |X fX }.
2
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of quantiles 99

Generally, the range of the variables X and Y differs and two distinct
kernels have to be used, the bias is then expressed as a sum of two terms
(1) 0 0
bf,n,h,h0 (y; x) = h2 bF (y; x) + h 2 bf (y; x) + o(h2 ) + o(h 2 ),
(1) 1 −1 (2)
bF = m2K fX {∂ 2 fX,Y /∂x2 − fY |X fX },
2
1 −1 2
bf = m2K fX ∂ fX,Y /∂y 2 .
2
The variance of the estimator is the limit of V arfbY |X,n,h,h0 (y; x) written
Z
Kh0 (u − y)Kh0 (v − y)Cov{FbY |X,n,h (du; x), FbY |X,n,h (dv; x)}
Z
−1
= (nh)−1 κ2 fX (x){ Kh20 (u − y) FY |X (du; x)
Z
− Kh0 (u − y)Kh0 (v − y) FY |X (du; x)FY |X (dv; x)}.
0
The first integral develops as I1 = h −1 {κ2 fY |X (y; x) + o(1)}, the second
R
integral I2 = Kh0 (u − y)Kh0 (v − y) FY |X (du; x)FY |X (dv; x) is the sum
2
of the integral outside the diagonal DY = {(u, v) ∈ IX,T ; |u − v| < 2h0n },
which is zero, and an integral restricted to the diagonal which is expanded
by changing the variables like in the proof of Proposition 2.2.
Let αh0 (u, v) = |u − v|/(2h0 ), u = y + h0 (z + αh0 ), v = y + h0 (z − αh0 )
and z = {(u + v)/2 − y}/(h0 )
Z
I2 = Kh0 (u − x)Kh0 (v − x)fY |X (u; x)fY |X (v; x) du dv
DY
Z
0
−1
=h { K(z − αh0 (u, v))K(z + αh0 (u, v)) dzdufY2 |X (y; x) + o(1)}
DY
0 0
and it is equivalent to h −1 κ2 fY2 |X (y; x) + o(h −1 ). The variance of the
estimator of the conditional density fY |X is then
vfY |X,n,h,h0 = (nhh0 )−1 vf (y; x) + o((nhh0 )−1 ),
vf (y; x) = κ2 fY |X (1 − fY |X ) (5.17)
and its covariances at every y 6= y 0 tends to zero. 
The asymptotic mean squared error for the estimator of the conditional
(1)2 0
density is M SEfY |X y; x; hn , h0n ) = h4n bF (y; x)+hn4 b2f (y; x)+(nhn h0n )−1 vf ,
it is minimal at the optimal bandwidth
( )1/5
0 1 vf
hn,opt,fY |X (y; x) = (y; x) .
nhn 4b2f
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

100 Functional estimation for density, regression models and processes

In this expression, hn can be chosen as the optimal bandwidth for the kernel
estimator of the conditional distribution function FY |X . The convergence
rate (nhn h0n )1/2 of the estimator for the conditional density is smaller than
the convergence rate of a density and than (nh2n )1/2 = O(n2/5 ), at the
optimal bandwidth.
Assuming that h0n = hn , the optimal bandwidth is now
( )1/6
1 vf
hn,opt,fY |X (y; x) = (y; x)
nh 2b2f

and the convergence rate for the estimator of the conditional density fY |X
is n1/3 which is larger than the previous rate with two optimal bandwidths.

The mode of the conditional density fY |X is estimated by the mode of its


estimator and the proof of Proposition 2.5 applies with the modified rates
of convergence and limit. The derivative of fbY |X,n,h (y; x) with respect to y
0
converges with the rate (nhn3 hn )1/2 that is n1/2 h2n for identical bandwidths.

5.5 Estimation of conditional quantiles for processes

Let (Zt )t∈[0,T ] = (Xt , Yt )t∈[0,T ] be a continuously observed stationary and


ergodic process with values in a metric space IXY and the regression model
Yt = m(Xt ) + σ(Xt )εt as in Section 3.10. Under the ergodicity condition
(2.13) for (Zt )t>0 , the conditional distribution function of the limiting dis-
tribution corresponds to FY |X (y; x) for a sample of variables and it is es-
timated from the sample-path of the process on [0, T ], similarly to (3.22),
with a bandwidth indexed by T
RT
1{Ys ≤y} KhT (x − Xs ) ds
FbY |X,T,hT (y; x) = 0
RT . (5.18)
0 KhT (x − Xs ) ds

RT
The numerator of (5.18), µ bRF,T,hT (y; x) = T1 0 1{Ys ≤y} KhT (x − Xs ) ds,
has the mean µF,T,hT (x) = IX KhT (x − u) FXs ,Ys (du, y) = F (y; x)f (x) +
h2T bF (y; x) + o(h2T ) with bµ (y; x) = ∂ 3 {F (x, y)}/∂x3 , for a conditional den-
sity of C2 (IXY ).

Proposition 5.6. Under the ergodicity conditions and for a conditional


density fY |X in class Cs (IXY ), the bias bF,T,hT and the variance vF,T,hT
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of quantiles 101

of the estimator FbY |X,T,hT are


bF,T,hT (y; x) = hsT bF (y; x) + o(hsT ),
msK −1 (s)
bF (y; x) = f (x){∂ s+1 F (x, y)/∂xs+1 − F (x)fX (x)},
s! X
vF,T,hT (y; x) = (T hT )−1 {σF2 (y; x) + o(1)},
−1
σF2 (x) = κ2 fX (x)FY |X (y; x){1 − FY |X (y; x)}
−1
its covariances are CovY |X (y, y 0 ; x) = κ2 fX (x){FY |X (y ∧ y 0 ; x) −
FY |X (y; x)FY |X (y 0 ; x)} and zero for x 6= x0 .

The weak convergence of Proposition 5.2 is still satisfied with the conver-
gence rate (T hT )1/2 and the notations of Proposition 5.6. The quantile pro-
cesses of Section 5.2 are generalized to the continuous process (Xt , Yt )t>0
and their asymptotic behaviour is deduced by the same arguments from
the weak convergence of (T hT )1/2 (FbY |X,T,hT − FY |X ).
The conditional density fY |X (y; x) of the ergodic limit of the process is
estimated using the kernel KhT , with the same bandwidth as the estimator
of the distribution function FY |X
Z T (R T )
1 K h (x − X s )1 {Y ≤Y }
fbY |X,T,hT (y; x) = 0 T s t
KhT (Yt − y) RT dt.
T 0
0 KhT (x − Xs ) ds

Its expectation is approximated by


Z T
−1
fY |X,T,hT (y; x) = T E KhT (Yt − y){FY |X (Yt ; x) + h2T bF (Yt ; x)} dt
0
+ o(h2T )
Z

=E KhT (v − y) {FY |X + h2T bF }(v; x) dv + o(h2T )
IY ∂v
∂bF (y; x) h2T ∂ 2 fY |X (y; x)
= fY |X (y; x) + h2T + m2K
∂y 2 ∂y 2
+ o(h2T ),
where bF is the bias (5.6) of the estimator FbY |X,n,h . Let vf be defined
by (5.17), the variance of fbY |X,T,hT (y; x) has an expansion similar to the
variance of the estimator fbY |X,n,h (y; x)
vfY |X ,T,hT = (T h2T )−1 vf (y; x) + o((T h2T )−1 ).

Proposition 5.7. Under the ergodicity conditions and for a conditional


density fY |X in class Cs (IXY ), the bias and the variance of the estimator
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

102 Functional estimation for density, regression models and processes

fbY |X,T,hT are


bfY |X ,T,hT (y; x) = hsT bfY |X (y; x) + o(hsT ),
msK −1 (s)
bfY |X (y; x) = f (x){∂ s f (x, y)/∂xs − fY |X fX (x)},
s! X
vfY |X ,T,hT (y; x) = (T h2T )−1 {vf (y; x) + o(1)},
−1
vf (y; x) = κ2 fX (x)fY |X (y; x){1 − fY |X (y; x)}
its covariances are zero for x 6= x0 or y 6= y 0 .
The process T 1/2 hT (fbY |X,T,hT − fY |X ) converges weakly to a Gaussian
process with mean limT T 1/2 hs+1 T bfY |X , variance vf and covariances are
zero.

The optimal bandwidth for fbY |X,T,hT is O(T −1/(2s+2) ) and the convergence
rate of fbY |X,T,hT with the optimal bandwidth is T s/(2s+2) , hence it is T 1/3
for s = 2, and the expression of the optimal bandwidth is hT,opt,fY |X defined
in the previous section.

5.6 Inverse of a regression function

Consider the inverse function (5.4) for a regression function m of the model
(1.6), monotone on a sub-interval Im of the support IX of the regression
variable X. The kernel estimator of the function m is monotone on the
same interval with a probability converging to 1 as n tends to infinity,
by an extension of Lemma 5.1 to an increasing function. The maxima
and minima of the estimated regression function, considered in Section
3.7, define empirical intervals for monotonicity where the inverse of the
regression function is estimated by the inverse of its estimator. Let t belong
to the image Jm by m of an interval Im where m is increasing
b m,n,h (t) = m
Q b −1 (t) = inf{x ∈ Im : mb n,h (x) ≥ t}. (5.19)
n,h

This estimator is continuous like mb n,h , so that m b −1


b n,h ◦ m n,h = id on Jm , and
−1
b n,h ◦m
m b n,h = id on Im . The results proved in Section 5.2 for the conditional
quantiles adapt to the estimator Q b m,n,h . The bias and the variance of the
estimator (5.19) on Jm are deduced from those of the estimator m
b n,h , as
in Proposition 5.3
bm
bQm ,n,h (t) = −h2 (1) ◦ Qm (t) + o(h2 ),
m
2
σm
vQm ,n,h (t) = (nh)−1 (1)2 ◦ Qm (t) + o((nh)−1 ).
m
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of quantiles 103

The weak convergence of (nh)1/2 (Q b m,n,h − Qm ) is a consequence of The-


orem 3.1 and it is proved by the same arguments as Theorem 5.1 and
proved by the same arguments. Let W1 be the Gaussian process limit of
−1
σm (nh)1/2 (m
b n,h − mn,h ) on Im .

b m,n,h − Qm } con-
Theorem 5.3. On Jm , the process UQm ,n,h = (nh)1/2 {Q
1/2
W1 + γ bm
verges weakly to UQm = ◦ Qm .
m(1)
The inverse of the estimator (3.22) for a regression function of an ergodic
and mixing process (Xt , Yt )t≥0 is (T hT )1/2 -consistent and it satisfies the
same approximations and weak convergence, with the notations and condi-
tions of Section 3.10.
Under derivability conditions for the kernel, the regression function
and the density of the variable X, the estimators m b n,h and its inverse
are differentiable and they belong to the same class which is supposed
to be sufficiently large to allow expansions of order s for estimator of
function m in Ck+s . The derivatives of the quantile are determined
b (1) (1) b m,n,h }−1 ,
by consecutive derivatives of the inverse: Q m,n,h = {mb n,h ◦ Q
Qb (2) = {m (2)
b {m b
(1)3 −1
} }◦Q b m,n,h .
m,n,h n,h n,h

Consider a partition of the sample in J disjoint groups of size nj , and


PJ
let Aj be the indicator of a group j, for j = 1, . . . , J. Let Y = j=1 Yj 1Aj
PJ
and X = j=1 Xj 1Aj where (Xj , Yj ) is the variable set in group j. For
j = 1, . . . , J, the regression model for the variables (Xji , Yji )i=1,...,nj is

Yji = mj (Xji ) + εji

where mj (x) = E(Y | 1Aj X = x) and the expectation in the whole sample
is defined from the probability pj of Aj , the conditional density of X given
Aj and the conditional regression functions given the group Aj
J
X
m(x) = πj (x)mj (x),
j=1
fj (x)
πj (x) = pj = P (Aj | X = x).
f (x)
The density of X in the whole sample is a mixture of J densities condi-
PJ
tionally on the group fX (x) = j=1 pj fj (x) and the ratio f −1 (x)fj (x) is
one if the partition is independent of X. The regression functions and the
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

104 Functional estimation for density, regression models and processes

conditional probability densities are estimated from the sub-samples


Pnj
i=1 Yji Kh (x − Xji )
b j,n,h (x) = P
m nj ,
i=1 Kh (x − Xji )
PJ Pnj
j=1 i=1 Yji Kh (x − Xji )
mb n,h (x) = PJ Pnj ,
j=1 i=1 Kh (x − Xji )
Pnj
Kh (x − Xji )
bj,n,h (x) = PJ i=1
π Pnj .
j=1 i=1 Kh (x − Xji )

The inverse processes Qb j,m,n,h are defined as in Equation (5.19) for each
group. The inverse of the conditional probability densities πj are estimated
using the same arguments.

5.7 Quantile function of right-censored variables

The product-limit estimator Fbn for a differentiable distribution function F


on R+ under right-censorship satisfies Equation (2.11) on [0, max Ti [
Z x
1 − FbnR (s− ) b
Fbn (x) = F − {1 − F (x)} d(Λn − Λ)(s),
0 1 − F (s)
denoted F − ψn where Eψn = 0 and supt≤τ kψn (t)k2 converges a.s. to
zero for every τ < τF = sup{x > 0; F (x) < 1}. Its quantile Q b n converges
therefore to QF in probability uniformly on [0, 1[. Let f be the density
probability for F and let G be the distribution function of the indepen-
dent censoring times, the process n1/2 ψn converges weakly to a centered
Gaussian process with covariance function
Z x∧y
CF (x, y) = {1 − F (x)}{1 − F (y)} {(1 − F )(1 − G− )}−1 dF,
0

at every x and y in [0, τ < τ0 ], where τ0 = τF ∧ τG . As a consequence, the


quantile process

b n − QF ) = −n1/2 ψn
n1/2 (Q ◦ QF + rn (5.20)
f
is unbiased and it converges weakly to a centered Gaussian process with
covariance function c(s, t) = CF (QF (s), QF (t)){f ◦QF (s)}−1 {f ◦QF (t)}−1 ,
for every s and t in [0, F (τ0 )]. The remainder term is such that
supt≤F (τ0 ) krn k is a oL2 (1).
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of quantiles 105

A smoothed
R quantile process is defined by integrating the smoothed
process Kh (t−s) dQ bn (s) which is an uniformly consistent estimator of the
(1) R
derivative QF (t) = 1/{f ◦ QF (t)} of QF (t). Its mean is Kh (t − s) dQ(s)
2 (3)
and its bias bQF ,n,h = h2 m2K QF (t) if F belongs to C3 . Its variance
and covariance functions are deduced from the representation (5.20) of the
quantile, for s 6= t and as n tends to infinity
Z tZ s Z 1 Z 1
b n,h (t), Q
Cov{Q b n,h (s)} = n−1 1{u6=v} Kh (u − u0 )
0 0 −1 −1

× Kh (v − v 0 ) du dv d2 c(u0 , v 0 ) + (nh) −1
κ2 c(s ∧ t, s ∧ t) + o(n−1/2 )
= (nh)−1 κ2 c(s ∧ t, s ∧ t) + o(n−1/2 ).

5.8 Conditional quantiles with variable bandwidth

The pointwise conditional mean squared errors for the empirical conditional
distribution function and its inverses reach their minimum at a varying
bandwidth function. So the behaviour of the estimators with such band-
width is now considered. Conditions 4.1 are supposed to be satisfied in ad-
dition to 2.1 or 2.2. The results of Propositions 5.1 and 5.2 still hold with a
functional bandwidth sequence hn and approximation orders o(khn k2 ) for
the bias and o(nkh−1 n k) for the variance and a functional convergence rate
(nhn )1/2 for the process νY |X,n,h . This is an application of Section 4.3 with
the following expansion of the covariances.

Lemma 5.2. The covariance of FbY |X,n,h (y; x1 ) and FbY |X,n,h (y; x2 ) equals

2[nκ2 {hn (x1 ) + hn (x2 )}]−1


Z
× [vF (y; zn (x1 , x2 )) K(v − αn (v))K(v + αn (v)) dv
−2 (1)
+ δn (x1 , x2 )fX (zn (x, y)){(vΛ Y |XfX )(1) − m2 fX }(zn (x, y))
Z
× vK(v − αn (v))K(v + αn (v)) dv + o(khn k)].

The mean squared errors the functional bandwidth sequences are similar
to the MSE and MISE of Section 5.3.
The conditional quantiles are now defined with functional bandwidths
satisfying the convergence condition 4.1. The representation of the condi-
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

106 Functional estimation for density, regression models and processes

tional quantiles becomes


b Y,n,h = QY + (b
Q e Y,n,h − FY |X ◦ QY )(fY |X ◦ QY )−1
ηY,n,h + FY |X ◦ Q
+ oL2 ((nkhn k)−1/2 ),
b X,n,h = QX + (ζbX,n,h + FY |X ◦ Q
Q e X,n,h − FY |X ◦ QX )(fY |X ◦ QX )−1
+ oL2 ((nkhn k)−1/2 )
with ηbY,n,h defined by (5.8) and ζbX,n,h = FY |X ◦ Q
b X,n,h − FY |X ◦ Q
e X,n,h .
The expansions of their bias and variance are also written with the uni-
form norms of the bandwidths, generalizing Propositions 5.3 and 5.4 and
the weak convergence of the quantile processes is proved as for the kernel
regression function with variable bandwidth in Section 4.4.

5.9 Exercises

(1) Consider the quantile process Fbn−1 of a continuous


R 1 distribution function
F and the smooth quantile estimator Tbn (t) = 0 Kh (t − s)Fbn−1 (s) ds,
for t in [0, 1]. Prove its consistency and write expansions for its bias
and its variance.
(2) Determine the limiting distribution of the quantiles with respect to X
and Y for the estimator of the distribution function of Y ≤ y condi-
tionally on X ≤ x.
(3) Determine the limiting distribution of smoothed quantiles with respect
to X and Y for the estimator of the distribution function of Y ≤ y
conditionally on X ≤ x.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Chapter 6

Nonparametric estimation
of intensities for stochastic processes

6.1 Introduction

Let Nn = {Nn (t), t ≥ 0} be a sequence of counting processes defined on a


probability space (Ω, A, P ) associated to a sequence of random time vari-
ables (Ti )1≤i≤n
n
X
Nn (t) = 1{Ti ≤t} , t ≥ 0,
i=1
where Ti = inf{t; Nn (t) = i}, and let Fn = (Fnt )t∈R+ denote the his-
tory generated by observations of Nn and other observed processes before
t. The predictable compensator of Nn with respect to Fn is the unique
Fn− -measurable (or predictable) process N en such that Nn − N en is a Fn -
martingale on (Ω, A, P ). Consider a counting process Nn with a predictable
compensator
n Z t
X
Nen (t) = Yi (s)µ(s, Zi (s)) ds
i=1 0

where Yi and Zi are predictable processes with values in metric spaces Y


and Z and µ(s, z) = λ(s)r(z) is a strictly positive function for s > 0. This
model with a random variable or process Z is classical, when the observa-
P
tions are a right-censored counting process Nn and Yn (t) = ni=1 1{Ti ≥t} is
the counting process of the random times of Nn after t. The right-censorship
is defined by a sequence of random censoring variables (Ci )1≤i≤n so that a
Pn
censoring process NnC (t) = i=1 1{Ci ≤t} is partially observed with the pro-
P
cesses Nn , the observations are the processes Yn (t) = 1≤i≤n 1{Ti ∧Ci ≥t}
and the sequences of times (Ti ∧ Ci )1≤i≤n and indicators (δi )i = (1{Ti ∧Ci } )i
with values 1 if Ti is observed and 0 otherwise. All processes are observed
in an increasing time interval [0, τ ] such that Nn (τ ) = n tends to infinity.

107
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

108 Functional estimation for density, regression models and processes

With independent and identically distributed variables Ti with distribution


function FT and density
R t fT , the relationships between the survival function
1 − FT (t) = exp{− 0 λ(s) ds} and the hazard function λT = (1 − FT )−1 fT
are equivalent. With independent and identically distributed censoring
variables Ci , independent of the time sequence (Ti )1≤i≤n and with distri-
bution function FC , the hazard function of the censored counting process
P
Nn (t) = 1≤i≤n δi 1{Ti ≤t} is identical to λT . The aim of this chapter is to
define smooth estimators for the baseline hazard function and regression
function of intensity models and to compare them with histogram-type
estimators. Several regression models are considered, with parametric or
nonparametric regression functions.
Let Jn (t) = 1{Yn (t)>0} be the indicator of censored times occurring after
t. The baseline intensity λ of an intensity µn (t) = λ(t)Yn (t) is estimated for
t in [h, τ − h] by smoothing R t the Nelson (1972) estimator of the cumulative
hazard function Λ(t) = 0 λ(s) ds, which is asymptotically equivalent to
Rt −1 e
0 Jn (s)Yn (s) dNn (s) as Jn tends R
to 1 in probability. The unbiased Nelson
estimator is defined as Λ b n (t) = t Jn (s)Yn−1 (s) dNn (s), with the convention
0
0/0 = 0, and the function λ is estimated by smoothing Λ bn
Z 1
b
λn,h (t) = Yn−1 (s)Jn (s)Kh (t − s) dNn (s). (6.1)
−1

A stepwise estimator for λ is also defined on an observation time [0, τ ] as


the ratio of integrals over the subintervals (Bjh )j≤Jh of a partition of the
observation interval into Jh = h−1 τ disjoint intervals with length h tending
to zero. For every t belonging to Bjh , the histogram-type estimator of the
funtion λ is estimated at t by
Z Z
e
λn,h (t) = { Jn (s) dNn }{ Yn (s) ds}−1 (6.2)
Bjh Bjh

where the normalizing h of the histogram for a density is replaced by an


integral. Consider a multiplicative intensity µ(t, Zi (t)) = λ(t)r(Zi (t))Yi (t)
for the counting process Ni (t) = 1{Ti ≤t} of the i-th time variable. In the
Cox model, the regression function r defining the point process is exp(β T z),
with an unknown parameter β belonging to an open bounded set of Rd and
z in the metric space (Z, k·, k) of the sample-paths of a regression processes
Zi . The intensity conditionally on Ft is then the semi-parametric process
X
λ(t; β) = 1{Ti ≤t} r(Zi (t))λ(t)
1≤i≤n
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of intensities for stochastic processes 109

(0)
and the estimators of λ and β are defined by the means of the process Sn
defined by weighting each term of the sum in YT by the regression function
at the jump time
n
X
Sn(0) (t; β) = rZi (t; β)1{Ti ≥t} ,
i=1

with the parametric function rZ (t; β) = exp{β T Z(t)}. For k = 1, 2, let also
(0) Pn ⊗k (0)
Sn (t; β) = i=1 rZi (t; β)Zi (t)1{Ti ≥t} be the derivatives of Sn with
respect to β, let Z ⊗0 = 1, Z ⊗1 = Z and Z ⊗2 be the scalar product. The
true regression parameter value is β0 , or r0 for the function r and the
predictable compensator of the point process Nn is
Z t
Nen (t) = Sn(0) (s; β)λ(s) ds . (6.3)
0
The function λ is estimated by smoothing the cumulative hazard function
of the Cox process, the parameter β by maximizing the partial likelihood
Z 1
bn,h (t; β) =
λ Jn (s){Sn(0) (s; β)}−1 Kh (t − s) dNn (s), (6.4)
−1
Y
βbn,h = arg max {rZi (t; β)λbn,h (Ti ; β)}δi ,
β
Ti ≤τ

with the convention 00 = 1 and Jn = 1{Yn >0} . The hazard function


is estimated by λ bn,h = λ bn,h (βbn,h ). The classical estimators of the Cox
model rely on the estimation of the function Λ(t) by the stepwise process
R
b n (t; β) = τ Jn (s){Sn(0) (s; β)}−1 dNn (s) at fixed β and the parameter β
Λ 0
T
of the exponential regression function rZ (t; β) = eβ Z(t) is estimated by
maximization of an expression similar to (6.4) where λ bn,h (Ti ; β) is replaced
b
by the jump of Λn (β) at Ti
Y
βbnC = arg max {rZi (t; β)Sn(0)−1 (Ti ; β)}δi .
β
Ti ≤τ

A stepwise estimator for the baseline intensity is now defined as


X Z Z
en,h (t; β) =
λ 1Bjh (t)[ Jn (s) dNn (s)][ Sn(0) (s; β) ds]−1 , (6.5)
j≤Jh Bjh Bjh
Y
βen,h = arg max en,h (Ti ; β)}δi ,
{rZi (Ti ; β)λ
β
Ti ≤τ

en,h = λ
and λ en,h (βen,h ). More generally, a nonparametric function r is esti-
mated by a stepwise process rbn,h defined on each set Bjh of the partition
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

110 Functional estimation for density, regression models and processes

(Bjh )j≤Jh , centered at ajh . Let also (Dlh )l≤Lh be a partition of the val-
ues Zi (t), i ≤ n, centered at zlh . The function r is estimated in the form
P
ren,h (Z(t)) = l≤Lh ren,h (zlh )1Dlh (Z(t))
X Z Z
en,h (t; r) =
λ 1Bjh (t) Jn (s) dNn (s) [ Sn(0) (s; r) ds]−1 ,
j≤Jh Bjh Bjh
Y X
ren,h (zlh ) = arg max [{ en,h (Ti ; rlh )]δi ,
rl 1Dlh (Zi (Ti ))}λ (6.6)
rl
Ti ≤τ l≤Lh
(0) P
where Sn (t; r) = ni=1 rZi (t)1{Ti ≥t} is now defined for a nonparametric
regression function, then λ en,h (t, Z) = λen,h (t, ren (Z(t))). A kernel estimator
for the functions λ is similarly defined by
Z 1
b
λn,h (t; r) = Jn (s){Sn(0) (s; r)}−1 Kh (t − s) dNn (s) . (6.7)
−1
An approximation of the covariates values at jump times by z when they are
sufficiently close allows to build a nonparametric estimator of the regression
function r like β in the parametric model for r
X n Z 1
rbn,h (z) = arg max bn (s; rz )}Kh (z − Zi (s)) dNi (s),
{log rz (s) + log λ 2
rz −1
i=1
where h2 = hn2 is a bandwidth sequence satisfying the same conditions as
bn (t, Z) = λ
h, and λ bn (t; rbn (t, Z(t)).
The L2 -risk of the estimators of the intensity functions splits into a
squared bias and a variance term and the minimization of the quadratic risk
provides an optimal bandwidth depending on the parameters and functions
of the models and having similar rates of convergence, following the same
arguments as for the optimal bandwidths for densities.

6.2 Risks and convergences for estimators of the intensity

Conditions for expanding the bias and variance of the estimators are added
to Conditions 2.1 and 2.2 of the previous chapters concerning the kernel and
the bandwidths. For the intensities, they are regularity and boundedness
conditions for the functions of the models and for the processes.

Condition 6.1.
(1) As n tends to infinity, the process Yn is positive with a probability
tending to 1, i.e. P {inf [0,τ ] Yn > 0} tends to 1, and there exists a
function g such that sup[0,τ ] |n−1 Yn − g| tends a.s. to zero;
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of intensities for stochastic processes 111


(2) 0 g −1 (s)λ(s) ds < ∞
(3) The functions λ and g belong to Cs (R+ ), s ≥ 2.

6.2.1 Kernel estimator of the intensity


Let
Z 1
λn,h (t) = Jn (s)Kh (t − s)λ(s) ds, for t ∈ [h, τ − h],
−1
be the expectation of the kernel estimator (6.1) and let λn (t) = Jn (t)λ(t),
defined as λ(t) on the random interval In,τ = 1{Jn =1 }[0, τ ] which may be
right-censored. Let also In,h,τ = 1{Jn =1 }[h, τ − h] be the interval where all
convergences will be considered. Since Jn (t) − 1 tends uniformly to zero in
probability, supt∈[0,τ ] |λ(t) − λn,h (t)| tends to zero in probability.
Proposition 6.1. Under Conditions 2.1 and 6.1 with hn converging to zero
and nh to infinity
(a) supt∈In,h,τ |λbn,h (t)−λ(t)| converges to zero in probability. (b) For every
t in In,h,τ , the bias λn,h (t) − λ(t) of the estimator is
hs
bλ,n,h (t; s) = msK λ(s) (t) + o(hs ),
s!
denoted h2 bλ (t) + o(h2 ), its variance is
V ar{λbn,h (t)} = (nh)−1 κ2 g −1 (t)λ(t) + o((nh)−1 ),
also denoted (nh)−1 σλ2 (t) + o((nh)−1 ) and the covariance of λ bn,h (t) and
b
λn,h (s) is
Z
Cov{λ bn,h (t)} = (nh)−1 { λ ( s + t ) K(v − αh )K(v + αh )dv + o(1)}
bn,h (s), λ
g 2
if αh = |t − s|/2h ≤ 1 and zero otherwise, with uniform approximations on
In,h,τ . The Lp -norms of the estimator are
sup kλ bn,h (t) − λn,h (t)kp = 0((nh)−1/p )
t∈In,h,τ
bn,h (t) − λ(t)kp = 0((nh)−1/p + hs ).
and supt∈In,h,τ kλ

Proof. (a) Let MΛ,n = Λ b n − Λ, its predictable compensator is N fn . By


Lenglart’s inequality applied to the martingale MΛ,n , for every n ≥ 1
Z 1
P ( sup |λ bn,h − λ| ≥ η) ≤ η −2 E sup Kh2 (t − u)(Jn Yn−1 λ)(u) du
[h,τ −h] t∈[h,τ −h] −1
−2 −1 −1
=η (nh) κ2 sup g (t)λ(t)
t∈[0,τ ]
and the upper bound tends to zero as n tends to infinity.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

112 Functional estimation for density, regression models and processes

(b) For every t in In,h,τ , the bias bλ,n,h (t) = λh (t) − λn,h (t) develops as
Z 1
bλ,n,h (t) = Kh (t − u)E{Jn (u)λ(u) − Jn (t)λ(t)} du
−1
Z 1
= E{Jn (t + hz)λ(t + hz) − Jn (t)λ(t)}K(z) dz
−1
s
h (s)
= λ (t) + o(hs ),
s!
where EYn (s) = P (Yn (s) > 0) = P (maxi≤n Ti > s) = 1 − FTn (s) belongs to
]0, 1[ for every s ≤ τ , for independent times Ti , i ≤ n. Its variance is
Z 1
V ar{λbn,h (t)} = E Kh2 (t − u)Jn (u)Yn−1 (u)λ(u) du
−1
Z
= (nh)−1 K 2 (z)EJn (t + hz)g −1 (t + hz)λ(t + hz) dz

= (nh)−1 κ2 g −1 (t)λ(t) + o((nh)−1 ).


R
The covariance of λ bn,h (t) and λ bn,h (s) is Kh (s − u)Kh (t −
−1
u)Jn (u)Yn (u)λ(u) du, it is zero if αh = |x − y|/(2h) > 1 and, if αh ≤ 1,
it is approximated by a change of variables as in Proposition 2.2 for the
density, under Conditions 6.1. R
The Lp -moment of λ bn,h (t)−λn,h (t) = 1 Kh (t−u)Jn (u)Y −1 (u) d(Nn −
−1 n
Nen )(u) are calculated using the martingale property of Mn = Nn − N en and
its stochastic integrals
bn,h (t) − λn,h (t)|3 = (nh)−1 {κ2 g −1 (t)λ2 (t) + o(1)},
E|λ
bn,h (t) − λn,h (t)|4 = (nh)−1 {κ2 g −1 (t)λ3 (t) + o(1)},
E|λ
bn,h (t) − λn,h (t)|5 = (nh)−1 {κ2 g −1 (t)λ4 (t) + o(1)}.
E|λ

The higher order moments are deduced by iterations. In each case, the
main term is expressed as the integral of the product of a squared kernel
at a time Ti and other kernel terms at Tjk , where all time variables are
independent. 
For p = 2, E{λ bn,h (t) − λn,h (t)}2 develops on In,h,τ as the sum of
a squared bias and a variance terms {λn,h (t) − λ(t)}2 + V ar{λ bn,h (t)}
and its first order expansions are (s!) msK h λ (t) + o(h2s ) +
−2 2 2s (s)2

(nh)−1 κ2 g −1 (t)λ(t) + o(n−1 h−1 ). The asymptotic mean squared error


1
AM SE(t; b
λn,h ) = (nh)−1 κ2 g −1 (t)λ(t) + m2 h2s λ(s)2 (t)
(s!)2 sK
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of intensities for stochastic processes 113

is minimum for the bandwidth function


 s!(s − 1)! κ2 λ(t) 1/(2s+1)
hAMSE,n (t) = n−1/(2s+1) .
2m2sK g(t)λ(s)2 (t)
The global asymptotic mean integrated squared error for λ bn,h at t is
Z τ Z τ
bn,h ) = (nh)−1 κ2 1
AM ISE(λ g −1 (t)λ(t) dt + 2
m 2
sK h 2s
λ(s)2 (t) dt,
0 (s!) 0
it is minimum for the global bandwidth
R τ −1

−1/(2s+1) s!(s − 1)! κ2 0 g (t)λ(t) dt 1/(2s+1)
hAMISE,n = n 2
Rτ .
(s)2
2msK 0 λ (t) dt
It is estimated by
R τ −1
b
b −1/(2s+1) s!(s − 1)! κ2 0 Yn (t) dΛn (t) 1/(2s+1)
hAMISE,n,h = n [ R τ (s)
] .
2m2sK 0 {λ b (t)}2 dt
n,h
Another integrated asymptotic mean squaredR error is the average of
τ
AM SE(T ) for the intensity, E{AM SE(T )} = 0 AM SE dFT also writ-
ten Z
AM ISEn (h; FT ) = {h2s bλ (t) + (nh)−1 σλ2 (t)}{1 − F (t)} dΛ(t)

and it is estimated by plugging estimators of the intensity, bλ and σλ2 into


the empirical mean
Z τ
n −1 b n,h(t)}Y −1 (t) dNn (t).
{h2s b2λ (t) + (nh)−1 σλ2 (t)} exp{−Λ n
0
Its minimum empirical bandwidth is
n Rτ o1/(2s+1)
b b −2
b −1/(2s+1) 0 λn,h exp{−Λn,h }Yn dNn
hλ,n,h = n R τ (s)
.
2m2sK 0 {λb }2 exp{−Λ b n,h}Yn−1 dNn
n,h

An estimator of the derivative λ(k) or its integral are defined by the means
of the derivatives K (k) of the kernel, for k ≥ 1
Z
b(k) (t) = K (k) (t − s)Jn (s)Y −1 (s) dNn (s).
λn,h h n

Proposition 2.3 established for the densities is generalized to the intensity


λ. Lemma 2.1 allows to develop the mean of the estimator of the first
derivative as Z
(1) (1)
λn,h (t) = Kh (u − t)λ(u) du
Z Z
h2 (3)
= −λ (t) zK (z) dz − λ (t) z 3 K (1) (z) dz + o(h2 )
(1) (1)
6
2
h
= λ(1) (t) + m2K λ(3) (t) + o(h2 ),
2
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

114 Functional estimation for density, regression models and processes

(1) s
for an intensity of C3 or λn,h (t) = λ(1) (t) + hs! msK λ(s) (t) + o(hs ) for an
R
b(1) is (nh3 )−1 g −1 (t)λ(t) K (1)2 (z) dz +
intensity of Cs . The variance of λ n,h
o((nh3 )−1 ). The optimal local bandwidth for estimating λ(1) belonging to
Cs is therefore R
 λ(t) K (1)2 (z) dz 1/(2s+3)
hAMSE (λ(1) ; t) = n−1/(2s+3) s!(s − 1)! .
2m2sK g(t)λ(3)2 (t)
For the second derivative of the intensity, the estimator λb(2) is the deriva-
n,h
b(1) expressed by the means of the second derivative of the ker-
tive of λ n,h
nel. For a function λ in class C4 , the expectation of the estimator
b(2) is λ(2) (t) = λ(2) (t) + h2 m2K λ(4) (t) + o(h2 ), so it converges uni-
λ n,h n,h 2
formly to λ(2) . The bias of λ b(2) is h2 m2K λ(4) (t) + o(h2 ) and its variance
R n,h 2
(nh5 )−1 g −1 (t)λ(t) K (2)2 (z) dz + o((nh5 )−1 ).
Proposition 6.2. Under Conditions 2.2, for every integers k ≥ 0 and
b(k) of the
s ≥ 2 and for intensities belonging to class Cs , the estimator λn,h
k-order derivative λ(k) has a bias O(hs ) and a variance O((nh2k+1 )−1 ) on
In,h,τ . Its optimal local and global bandwidths are O(n−s/(2k+2s+1) ) and
the optimal L2 -risks are O(n−s/(2k+2s+1) ).
Consider the normalized process
bn,h (t) − λ(t)}, t ∈ In,h,τ .
Uλ,n,h (t) = (nh)1/2 {λ
The tightness and the weak convergence of Uλ,n,h on In,h,τ are proved
by studing moments of its variations and the convergence of its finite di-
mensional distributions. For independent and identically distributed obser-
vations of right-censored variables, the intensity of the censored counting
process has the same degree of derivability as the density functions for the
random times of interest.
Lemma 6.1. Under Conditions 6.1, for every intensity of Cs there exists
a constant C such that for every t and t0 in In,h,τ satisfying |t0 − t| ≤ 2h

V ar λ bn,h (s) 2 ≤ C(nh3 )−1 |t − t0 |2 .
bn,h (t) − λ

Proof. Let t0 and t in In,h,τ , the variance of λ bn,h (t0 ) − λ


bn,h (t) develops
according to their variances given by Proposition 6.1 and the covariance
between both terms which is zero if |t − t0 | > 1 as established in the same
proposition. The second order moment E|λ bn,h (t) − λbn,h (t0 )|2 develops as
R
{Kh (t−u)−Kh (t0 −u)}2 Jn (u)Yn−1 (u)λ(u) du and it is a O((t−t0 )2 n−1 h−3n ),
by the same approximation as for the proof of Lemma 2.2 and the uniform
convergence of Jn Yn−1 . 
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of intensities for stochastic processes 115

Theorem 6.1. Under Conditions 6.1, for a density λ of class Cs (Iτ ) and
with nh2s+1 converging to a constant γ, the process
bn,h − λ}1{I
Uλ,n,h = (nh)1/2 {λ n,h,τ }

converges weakly to Wλ + γ 1/2 bλ , where Wλ is a continuous Gaussian pro-


cess on Iτ with mean zero and covariance E{Wλ (t0 )Wλ (t)} = σλ2 (t)δ{t0 ,t} ,
at t0 and t in Iτ , and σλ2 (t) = g −1 (t)λ(t).

Proof. The weak convergence of the finite dimensional distributionsR of


the process Wλ,n,h (t) = (nh)1/2 (λ bn,h − λn,h )(t) = (nh)1/2 1 Kh (t −
−1
u)Jn (u)Yn−1 (u) dMn (u) on In,h,τ is a consequence of the convergence of its
variance and of the weak convergence of the martingale n−1/2 Mn to a con-
R t∧t0
tinuous Gaussian process with mean zero and covariance 0 g(u)λ(u) du
at t and t0 . The covariance between Wλ,n,h (t) and Wλ,n,h (t0 ), for t 6= t0 , is
approximated by
Z
n−1 Kh (t − u)Kh (t0 − u)g(u)−1 λ(u) du
Z
1{0≤α<1} λ λ 
= { (t) + (t0 )} K(v − α)K(v + α)dv + o((nh)−1 ),
2nh g g
where αRh = |t0 − t|/(2h) tends to infinity as h tends to zero, then the
integral K(v − αh )K(v + αh )dv tends to zero. Lemma 6.1 and the bound
E{λn,h (t)−λ(t)−λn,h (t0 )+λ(t0 )}2 ≤ C 0 (nh)−1 |t−t0 |2 imply that the mean
of the squared variations of Un,h are O(h−2 |t − t0 |2 ) if |t − t0 | ≤ 2hn and
O(|t − t0 |2 ) if |t − t0 | > 2hn . The process Un,h is therefore tight. 

Corollary 6.1. The process


sup σλ−1 (t)|Uλ,n,h (t) − γ 1/2 bλ (t)|
t∈In,h,τ

converges weakly to supIτ |W1 |, where W1 is the Gaussian process with mean
zero, variance 1 and covariances zero.
For every η > 0, there exists a constant cη > 0 such that
Pr{ sup |σλ−1 (Uλ,n,h − γ 1/2 bλ ) − W1 | > cη }
In,h,τ

tends to zero as n tends to infinity.

The Hellinger distance between two probability measures P1 and P2


defined by intensity functions λ1 and λ2 is
Z
λ1 1 − F1 1/2
h2 (P1 , P2 ) = {1 − ( )1/2 ( ) } dF1
λ2 1 − F2
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

116 Functional estimation for density, regression models and processes

R
and it is also written h2 (P1 , P2 ) = {1 − ( λλ12 )1/2 e−(Λ1 −Λ2 )/2 } dF1 . The
estimator of a function λ satisfies
Z b b
bn,h , λ) = {1 − ( λn,h 1 − Fn )1/2 } dF
h2 (λ
λ 1−F
Z b
λn,h 1 − Fbn 1/2
≤ {( ) − 1} d(Fbn − F )
λ 1−F
bn,h , λ) to zero is nh1/2
and the convergence rate of h2 (λ n .

A varying bandwidth estimator is defined for multiplicative intensities


under Condition 4.1, with the optimal convergence rate. The bias and the
variance of the estimator are modified as
hn (t)s
bλ,n,hn (t) (t) = m2K λ(s) (t) + o(khn k2 ),
s!
its variance is
bn,h (t) (t)} = (nhn (t))−1 κ2 g −1 (t)λ(t) + o((n−1 kh−1 k),
V ar{λ n n

and Ekλ bn,h (t) (t) − λn,h (t) (t)kp = 0((n−1 kh−1 k)1/p ). The covariance of
n n n
bn,h (t) (t) and λ
λ bn,h (t) (t0 )} equals
n n
Z
E Khn (t) (t − u)Khn (t0 ) (t0 − u)Yn−1 (u) dΛ(u)
Z
2 −1 0
= {(g λ)(z n (t, t )) K(v − αn (v))K(v + αn (v)) dv
n{hn (t) + hn (t0 )}
Z
+ δn (x, y)(g −1 λ)(1) (zn (x, y)) vK(v − αn (v))K(v + αn (v)) dv + o(khn k)}

with αn (x, y, u) = 21 {(u − x)h−1 −1


n (x) − (u − y)hn (y)} and v = {(u −
−1 −1
x)hn (x) + (u − y)hn (y)}/2. Lemma 4.2 is fulfilled for the mean
squared variations of the process λ bn,h (t) (t) which satisfy E|λbn,h (t) (t) −
n n
b 0 2 −1 −1 3 0 2 −1 0 −1 0
λn,h(t0 ) (t )| = O(n khn k (t − t ) ), if |thn (t) − t hn (t )| ≤ 1. Other-
wise, the mean squared variations of λ bn,h are zero, this implies the weak
1/2 b
convergence of the process (nhn (t)) {λn,hn (t) (t) − λ(t)}I{t ∈ In,khn k,τ }
to the process Wf (t) + h1/2 (t)bf (t), where Wf is a continuous centered
Gaussian process on Iτ with covariance σλ2 (t)δ{t,t0 } at t and t0 .

6.2.2 Histogram estimator of the intensity


The histogram estimator (6.2) for the intensity λ is a consistent estimator
P
as h tends to zero and n to infinity. Let Kh (t) = h−1 j∈Jτ,h 1Bjh (t) be
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of intensities for stochastic processes 117

the kernel corresponding to the histogram, the histogram estimator (6.2) is


defined as the ratio of two stochastic integrals on the same subintervals of
the partition of [0, τ ]. Its expectation is approximated by the ratio of the
expectations of each integral, for t in Bj,h
R R R
B
g(s)λ(s) ds (n−1 Bj,h Jn dMn ) Bj,h (n−1 Yn − g)(s) ds
Eλen,h (t) = R j,h
−E R
Bj,h g(s) ds ( Bj,h g(s) ds)2
R
en,h (t)(
λ {n−1 Yn − g)(s) ds}2
Bj,h
+E R
( Bj,h g(s) ds)2
R
B
g(s)λ(s) ds
= Rj,h + o(n−1/2 h1/2 ) = λn,h (aj,h ) + o(h) (6.8)
Bj,h g(s) ds

en,h (t) can be approximated by an expan-


uniformly on Iτ,n,h . The bias of λ
sion, assuming only that λ belongs to C1 (R+ ), it is written
X
ebλ,h (t) = 1Bjh (t){λ(ajh ) − λ(t)} + o(h)
j≤Jτ,h
X
= 1Bjh (t)|t − ajh |λ(1) (t) + o(h) = O(h)
j≤Jτ,h

also denoted ebλ,h (t) = hebλ (t) + o(h) and it is larger than the bias of kernel
estimator. Assuming that V ar{n−1/2 (Yn − g)(t)} = O(1) for every t, the
variance of the denominator of λ en,h (aj,h ) equals (nh)−1 V arn−1/2 Yn (ajh ) +
o((nh)−1 ) = O((nh)−1 . For every t in Bj,h , the variance of λ en,h (t) is
R R
 B
g(s)λ(s) ds 2  g(s)λ(s) ds 2
ven,h (t) = E λen,h (t)− R j,h
− Eλ en,h (t)− BRj,h ,
Bj,h g(s) ds Bj,h g(s) ds

and the last term is a o(n−1/2 h1/2 ), by (6.8). Following the same calculus
as for the variance of the nonparametric estimator of a regression function,
Z
 −2 
ven,h (t) = g(t) + o(h) V ar{(nh)−1 Jn dNn }
Bj,h
Z Z

− 2λ(t)Cov{(nh)−1 Jn dNn , (nh)−1 Yn ds}
Bj,h Bj,h
Z
+ λ2 (t)V ar{(nh)−1 Yn ds} + o((nh)−1 )
Bj,h
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

118 Functional estimation for density, regression models and processes

R
with V ar{(nh)−1 Bj,h
Jn dNn } = (nh)−1 (gλ)(ajh ) + o((nh)−1 ) and the
covariance term
Z Z
−2
(nh) E{ Jn (dNn − Yn dΛ), (Yn − g) ds} = O((nh)−1 )
Bj,h Bj,h

en,h (t) can be written ven,h (t) = (nh)−1 veλ (t) +


therefore the variance of λ
−1
o((nh) ). The asymptotic mean squared error of the estimator λ en,h (t) is
minimal for the bandwidth
1/3
hn (t) = n−1/3 {2eb2λ (t)}−1/3 veλ (t).

This expression and the AMSE do not depend on the degree of derivability
of the intensity.

6.3 Risks and convergences for multiplicative intensities

The estimators (6.4) for the exponential regression of the intensity are spe-
cial cases of those defined by (6.7) in a multiplicative intensity with ex-
planatory predictable processes and an unknown regression R function r. For
every t in In,h,τ , the mean of λbn,h (t; r) is still λn,h (t) = 1 Kh (t−s)λ(s) ds
−1
and their degree of derivability is the same as K.
With a parametric regression function r, the convergence in the
first condition of 6.1 is replaced by the a.s. convergence to zero of
(k) (k) (k) (k)
supt∈[0,τ ] supkβ−β0 k≤ε |n−1 Sn (t; β) − s0 (t)|, where s0 = s0 (β0 ), for
k = 0, 1, 2, and ε > 0 is a small real number. In a nonparamet-
ric model, this condition is replaced by the a.s. convergence to zero of
(k) (k) (k)
supt∈[0,τ ] supkr−r0 k≤ε |n−1 Sn (t; r) − s0 (t)|, where s0 = s(k) (r0 ), for
k = 0, 1, 2.
The previous conditions 6.1 are modified by the regression function. For
expansions of the bias and the variance, they are now written as follows.

Condition 6.2.
(k) (k)
(1) As n tends to infinity, the processes n−1 Sn (t; β) and n−1 Sn (t; r) are
positive with a probability tending to 1 and the function defined by
(k)
s(k) (t) = n−1 ESn (t) belongs to class C2 (R+ );
(0)
(2) The function pn (s) = Pr(Sn (s; r) > 0) belongs to class C2 (R+ ) and
p
R nτ(τ, r0 ) converges to 1 in probability;
(3) 0 r(z)g −1 (s)λ(s) ds < ∞
(4) The functions λ and g belong to C2 (R+ ) and r belongs to Cs (Z).
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of intensities for stochastic processes 119

6.3.1 Models with nonparametric regression functions


The regression funtion is estimated by rbn,h (z) = arg maxrz Lbn,h (z; r) where
n Z
X
Lbn,h (z; r) = n−1 bn (s; rz )} Kh (z − Zi (s)) dNi (s)
{log rz (s) + log λ
i=1 Iτ,n,h

for t in In,h,τ . Its expectation Ln (z; r) = E Lbn,h (z; r) is expanded as


Z
Ln (z; r) = E bn (s; rz )} Kh (z − Z(s))S (0) (s; rZ )λ(s) ds
log{rz (s)λ n
Iτ,n,h
Z
bn (s; rz )}{S (0) (s; rz ) + h2
=E log{rz (s)λ n κ2 Sn(2) (s; rz )}
Iτ,n,h 2
× fZ(s) (z)λ(s) ds + o(h2 ),
where fZ(s) is the marginal density of Z(s). It follows that rbn (z) converges
uniformly, in probability, to the value r0 (z) which minimizes the limit of
n−1 Lbn (z; rz )
Z 1
L0 (z; rz ) = {log rz (s) + log λ(s; rz )}s(0) (s; rz )fZ(s) (z) λ(s) ds.
−1
(k)
Let Lbnbe the k-th order derivative of Lbn (z; r) with respect to z, their
(k) (k)
limits are denoted L0 and their expectations Ln , for k = 1, 2.

Proposition 6.3. Under Condition 6.2, the process (nh)1/2 (b rn,h − r)


converges weakly to a Gaussian process with mean zero and variance
(1)
(L(2) )−1 V(1) {L(2) }−1 (z; r) where V(1) = limn→∞ nhV arLbn .

Proof. The first derivative of Lbn with respect to rz and its expectation
bn
depend on the derivative of λ
Z
b(1) (t; rz ) = −
λ Kh (t − s)Sn(1) (s; rz )Sn(0)−2 (s; rz ) dΛ(s),
n
In,h,τ

n Z
X b(1)
1 λn (s; rz )
Lb(1)
n = n
−1
{ + } Kh (z − Zi (s)) dNi (s),
In,h,τ rz (s) bn (s; rz )
λ
i=1
Z 1 b(1)
1 λn (s; rz )
L(1)
n = E { + } Kh (z − Z(s))Sn(0) (s; rz ) fZ(s) (z)λ(s) ds
−1 rt,z (s) b
λn (s; rz )
(1) (1)
is such that V arLbn (z; r) = O((nh)−1 ), therefore (nh)1/2 (Lbn − L(1) )(z; r)
is bounded in probability, and the second derivative is a Op (1). By a Taylor
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

120 Functional estimation for density, regression models and processes

(1)
expansion of Lbn (z; r) in a neighbourhood of the true value r0z ≡ r0 (z) of
the regression function at z
Lb(1)
n (z; r)−L
(1)
(z; r0 ) = {(rz (0)−r0z (s)}T Lb(2) rn,h −r0 )2 (z(s)))
n (z; r0 )+Op ((b

and, by an inversion, the centered estimator (b rn,h − r0 )(z) is approxi-


b(2) −1 (1)
mated by the variable {−Ln (z; r0 )} L )(z; r0 ) the variance of which
is a O((nh)−1 ). For every z in Zn,h = {z ∈ Z; supz0 ∈∂Z kz − z 0 k ≥ h}, the
variable (nh)1/2 (b
rn,h − r0 )(z) converges weakly to a Gaussian variable with
(1)
variance (L ) limn nhn V arLbn {L(2) }−1 (z; r0 ).
(2) −1


Proposition 6.4. The processes (nh)1/2 (λ bn,h − λn,h )(r0 )1I and
τ,n,h
Z (1)
bn,h −λ)+(nh)1/2 (b Sn
(nh)1/2 (λ rn,h −r0 ) (0)3
(s; rbn,h )Kh (·−s) Jn (s) dNn (s)}
Sn
converge weakly to the same continuous and centered Gaussian process on
Iτ , with covariances zero and variance function vλ = κ2 s(0)−1 (r0 )λ.

Proof. The bias of Λb n,h is


Z T
bΛ,n,h (t) = − E{Sn(0) (s, r)Sn(0)−1 (s, rbn,h ) − 1} dΛ(s)
0
Z T (1)
Sn (s, rbn,h ) + Op (n−1/2 )
= rn,h − r)(s)
E{(b (0)
} dΛ(s)
0 Sn (s, r)S, rbn,h )

and it is also equivalent to the mean of


Z T
rn,h − r)(s)Sn(1) (s, rbn,h ){Sn(0) (t, rbn,h )}−3 dNn (s).
(b
0

The bias of the estimator λ bn,h is obtained by smoothing the bias bΛ,n,h (t)
and its first order approximation can be written as the mean of
Z 1 (1)
bbλ,n,h (t) = (b Sn (s, rbn,h )
rn,h − r)(t) Kh (t − s) (0) dNn (s).
−1 {Sn (t, rbn,h )}3 

6.3.2 Models with parametric regression functions


T
In the exponential regression model rβ (z) = eβ z with observations
(Xi , δi )1≤i≤n , the estimator of the parameter β minimizes Lbn (β) which
is written
n
X
Lbn (β) = n−1 bn (Ti ; β)} .
δi {β T Zi (Ti ) + log λ
i=1
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of intensities for stochastic processes 121

Its mean
Z
Ln (β) = {β T s(1) b s(0)
n (s; β0 ) + E log{λn (s; β)}bn (s; β0 )}λ(s) ds
In,h,τ


converges to L(β) = 0
{β T s(1) (s; β0 ) + {log λ(s; β)}s(0) (s; β0 )}λ(s) ds.

Proposition 6.5. Under Condition 6.2, n1/2 (βbn,h − β0 ) converges weakly


to a Gaussian variable with mean zero. The processes (nh)1/2 (λ bn,h −
λn,h )(β0 ) 1Iτ,n,h and
Z (1)
bn,h −λ)(βbn )+n1/2 (βbn,h −β0 )T Sn
(nh) 1/2
(λ (0)2
(s; βbn,h )Kh (·−s) dNn (s)
Iτ,n,h Sn

converge weakly to the same continuous and centered Gaussian process with
covariances zero and variance function vλ = κ2 s(0)−1 (β0 )λ.

Proof. The derivatives with respect to β of the partial likelihood Lbn are
written
n
X b(1)
λn (Ti ; β)
Lb(1)
n (β) = n
−1
δi {Zi (Ti ) + },
b
λn (Ti ; β)
i=1
b(1)⊗2
λn
b(2)
λn 
Lb(2)
n (β) = −n −1
δ i − (Ti ; β),
b2
λ bn
λ
n

bn,h with respect to β are written


where the derivatives of λ
Z (1)
b(1) (t; β) = −n−1 Sn
λn (0)2
(s; β)Kh (t − s) dNn (s),
Iτ,n,h Sn
Z (1)⊗2 (2)
b(2) (t; β) = n−1 Sn Sn 
λn 2 (0)3
− (0)2
(s; β)Kh (t − s) dNn (s).
Iτ,n,h Sn Sn
(k)
As h tends to zero, the predictable compensators of Lbn (β), k = 1, 2,
develop as
Z b(1)
λn (s; β) (0)
L(1)
n (β) = n
−1
{Sn(1) (s; β0 ) + S (s; β0 )}λ(s) ds,
In,h,τ bn (s; β) n
λ
Z b(1)⊗2 b(2)
λn λn 
L(2)
n (β) = −n
−1
− (s; β)Sn(0) (s; β0 )λ(s) ds.
In,h,τ b2
λ bn
λ
n
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

122 Functional estimation for density, regression models and processes

(1)
bn , k = 1, 2, is deduced from the martingale property
The expectation of λ
e
of Nn − Nn
Z (1)
Sn (s; β) (0)
λ(1)
n (t; β) = (0)2
Sn (s; β0 )λ(s)Kh (t − s) ds,
In,h,τ Sn (s; β)
(1)
Sn (t; β)
= (0)2
Sn(0) (t; β0 )λ(t)
Sn (t; β)
(1)
m2K h2  Sn (t; β) (0) (2)
+ (0)2
Sn (t; β0 )λ(t) + o(h2 ),
2 Sn (t; β)
Z (1)⊗2 (2)
Sn Sn 
λ(2)
n (t; β) = 2 (0)2
− (0)
(s; β)Kh (t − s)λ(s) ds
In,h,τ Sn Sn
(1)⊗2 (2)
Sn Sn 
= 2 (0)2
− (0)
(t; β)λ(t)
Sn Sn
(1)⊗2 (2)
m2K h2 Sn Sn  (2)
+ { 2 (0)2 − (0) (t; β)λ(t) + o(h2 ).
2 Sn Sn
(1) (2)
It follows that Ln (β) and Ln (β) converges to L(1) (t; β)
Z τ
(1)
L (β) = {s(1) (1) (0)−2
n (s; β0 ) − sn (s; β)sn (s; β)s(0)2
n (s; β0 )}λ(s) ds,
0
Z τ
λ(1)⊗2 λ(2) 
L(2) (β) = − − (s; β)s(0)
n (s; β0 )λ(s) ds,
λ2 λ
Z0 τ
L(2) (β0 ) = − s(2) (s; β0 )λ(s) ds,
0

where −L(2) (β0 ) is positive definite and L(1) (t; β0 ) = 0 so that the max-
imum βbn,h of Lbn converges in probability to β0 , the maximum of the
(1)
limit L of Ln . The rate of convergence of βbn,h − β0 is that of Lbn (β0 ).
First n1/2 (Lb(1) )n − L(1) )n )(β0 ) is the sum of stochastic integrals of pre-
dictable processes with respect to centered martingales and it Rconvergences
weakly to a centered Gaussian variable with variance v(1) = In,h,τ {s(2) −
s(1)⊗2 s(0)−1 )(s; β0 )λ(s) ds. Secondly
Z
n1/2 (L(1)
n − L (1)
)(β 0 ) = [n1/2 {n−1 (Sn(1) − s(1) }(s; β0 )
In,h,τ
b(1)
λn
+ n1/2 { n−1 Sn(0) − s(1) }(s; β0 )]λ(s) ds,
b
λn
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of intensities for stochastic processes 123

(1) (1)
is continuous and independent of n1/2 (Lbn − Ln )(β0 ), its integrand is a
(1)
sum of three terms l1n + l2n + l3n where l1n = n1/2 {n−1 (Sn − s(1) }(s; β0 )
convergences weakly to a centered Gaussian variable with a finite variance,
(1)
λn,h
l2n = n1/2 { n−1 Sn(0) − s(1) }(·; β0 ),
λn,h
b(1)
λ
(1)
λn,h
n,h
l3n = n1/2 [Sn(0) { − }](·; β0 ).
bn,h
λ λn,h
The term l2n convergences weakly to a centered Gaussian variable with a
finite variance and l3n has the same asymptotic distribution as
b(1) − λ(1) ) − λ(1) λ−1 (λ
n1/2 s(0) λ−1 {(λ bn,h − λn,h )}(·; β0 )
n,h n,h
where the process
bn,h − λn,h )(t; β0 )
(nh)1/2 (λ
Z τ
=n 1/2 en )(s)
Sn(0)−1 (s; β0 )Kh (t − s) Yn (s) d(Nn − N
0
R (0)−1
has the mean zero and the variance h Iτ,n,h
Sn (s; β0 )Kh2 (t − s)λ(s) ds
which converges in probability to vλ = κ2 s(0)−1 (t; β0 )λ(t). In the same
way, the process
Z (1)
1/2 b(1) (1) 1/2 Sn en )(s)
n (λn,h − λn,h )(t; β0 ) = n (0)2
(s; β0 )Kh (t − s) d(Nn − N
In,h,τ Sn
is consistent and it has the finite asymptotic variance
vλ,(1) (t) = s(1)⊗2 s(0)−3 (t; β0 )λ(t).
The term l3n with asymptotic variance zero converges in probability to zero.
The proof of the weak convergence of βbn ends as previously. The process
bn,h − λ)(t; βbn ) develops as
(nh)1/2 (λ
Z
n1/2 Jn (s)Sn(0)−1 (s; βbn )Kh (t − s) d(Nn − N
en )(s)
Iτ,n,h
Z
+ {Sn(0)−1 (s; βbn ) − Sn(0)−1 (s; β0 )}Sn(0) (s; β0 )Kh (t − s)λ(s) ds
Iτ,n,h
the first term of the right-hand side converges weakly to a centered Gaus-
sian process with variance κ2 s(0)−1 (t; β0 )λ(t) and covariances zero, and the
second term is expanded into
Z
− n1/2 (βbn,h − β0 )T Sn(1) (s; β0 )Sn(0)−1 (s; β0 )Kh (t − s)λ(s) ds
Iτ,n,h

= −n 1/2
(βbn,h − β0 )T s(1) (t; β0 )s(0)−1 (t; β0 )λ(t) + o(1). 
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

124 Functional estimation for density, regression models and processes

The results are analogous for every parametric regression function rβ


(0) Pn
of C2 , the processes are then defined by Sn (t; β) = i=1 rZi (t; β)1{Ti ≥t}
(k)
and Sn is its derivative of order k with respect to β
n (1) b(1) (s; β)
X rβ (Zi (Ti )) λn,h
Lb(1)
n (β) = n
−1
δi { + } dNn (s),
rβ (Z (T
i i )) b
λn (s; β)
i=1
Z (1)
b(1)
rβ (Zi (t) λn (s; β)
L(1)
n (β) = E { + }
In,h,τ rβ (Zi (t) b
λn (s; β)
× Sn(0) (t; β0 )λ(s) ds,
n
X b(1)2 b(2)
λn λn 
Lb(2)
n (β) = −n
−1
δi − (Ti ; β) .
b2
λ bn
λ
i=1 n

All results of this section are extended to varying bandwidth estimators as


before.

6.4 Histograms for intensity and regression functions

The histogram estimator (6.2) for the intensity λ with a parametric regres-
sion or (6.6), for nonparametric regression of the intensity, are consistent
estimators as h tends to zero and n to infinity. Their expectations are ap-
proximated like (6.8) by a ratio of means. Their variances are calculated
as in Section 6.2.2.
The nonparametric regression function r is estimated by
X
ren,h (z) = ren,h (zlh )1Dlh (z)
l≤Lh

and the histogram estimator for the intensity defines the estimator ren,h of
the regression function by
Y X
ren,h (zlh ) = arg max [{ en,h (Ti ; rlh )]δi ,
rl 1Dlh (Zn (Ti ))}λ
rl
Ti ≤τ l≤Lh
X Z Z
en,h (t; r) =
λ 1Bjh (t) Jn (s) dNn (s) [ Sn(0) (s; r) ds]−1 .
j≤Jh Bjh Bjh

(0) Pn
For every t in Bj,h , let Sn (t; r) = i=1 rZi (t)Yi (t), the limit of
−1 (0) (0)
P
n Sn (t; r) is s (t; r) = l≤Lh rzlh Pr(Z(t) ∈ Dlh ) + o(1) and its vari-
ance is
X
v (0) (t; r) = n−1 rz2lh [Pr(Z(t) ∈ Dlh )g(t)−{Pr(Z(t) ∈ Dlh )g(t)}2 ]+o(1)
l≤Lh
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of intensities for stochastic processes 125

under Condition 6.1. The mean of λ en,h (t; zlh ) is


R
Bj,h
s(0) (s; rlh )λ(s) ds
λn,h (t; rlh ) = R
(0) (s; r ) ds
+ O((nh)−1 ) = λ(aj,h ) + o(h)
Bj,h s lh

and its bias is ebλ,h (t; rlh ) = hλ(1) (t)+o(h), uniformly on Iτ,n,h . Its variance
is
R
 B
s(0) (s; rlh )λ(s) ds 2
e
ven,h (t; rlh ) = E λn,h (t) − Rj,h + O((nh)−1 )
s (0) (s; r ) ds
Bj,h lh
Z
 (0) −2  −1
= s (t; rlh ) V ar{(nh) Jn (s) dNn }
Bj,h
Z Z
−1 −1

− 2λ(t)Cov{(nh) Jn dNn , (nh) Sn(0) (s; rlh ) ds}
Bj,h Bj,h
Z
+ λ2 (t)V ar{(nh)−1 Sn(0) (s; rlh ) ds} + o((nh)−1 ) + o(h)
Bj,h

where, for t in Bj,h and Z(t) in Dlh


Z
−1
V ar{(nh) Jn dNn } = (nh)−1 s(0) (t; rlh )λ(t) + o((nh)−1 )
Bj,h
Z
V ar{(nh)−1 Sn(0) (s; rlh ) ds} = (nh)−1 v (0) (t; rlh ) + o((nh)−1 ),
Bj,h
Z Z
Cov{(nh)−1 Jn dNn , (nh)−1 Sn(0) (s; rlh ) ds}
Bj,h Bj,h

= (nh)−1 v (0) (t; rlh )λ(t) + o((nh)−1 )


therefore ven,h (t) = (nh)−1 veλ (t) + o((nh)−1 ) with
vλ (t) = s(0)−2 (s; rlh ){s(0) (t; rlh )λ(t) − v (0) (t; rlh )λ2 (t)}
e
en,h −λn,h )(t) converges weakly to a centered Gaussian process
and (nh)1/2 (λ
with variance evλ (t).
The asymptotic mean squared error of the estimator λ en,h (t) is still min-
−1/3 e2 −1/3 1/3
imal for the bandwidth hn (t) = n {2bλ (t)} veλ (t). The stepwise
constant estimator of the nonparametric regression function r maximizes
Z X
Ln,h (r) = { log rl 1Dlh (Zn (s))} Jn (s) dNn (s)
l≤Lh
Z
+ enh (s; rlh ) Jn (s) dNn (s)
λ
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

126 Functional estimation for density, regression models and processes

(1) (1)
and it satisfies Ln,h (er1h , . . . , reLh ,h ) = 0, where Ln,h is a vector with com-
ponents the derivatives with respect to the components of rh = (rlh )l≤Lh
Z X
(1) 1
Ln,h,l (rh ) = { 1D (Zn (s))} Jn (s) dNn (s)
rlh lh
l≤Lh
Z
+ λ e(1) (s; rlh ) Jn (s) dNn (s).
n,h,l

The derivatives of the intensity are consistently estimated by differences of


values of the histogram, in the same way as the derivatives of a density.
The variance of λ e(1) is a O((nh3 )−1 ) and the estimator of the regression
n,h
function converges with that rate.
In the parametric regression model, the histogram estimator for the
function λ and the related estimator of the regression parameter have the
(0)
same form (6.5), where the function r and the process Sn are indexed by
the parameter β. Let t in Bj,h
X Z Z
e
λn,h (t; β) = 1Bjh (t)[ Jn (s) dNn (s)][ Sn(0) (s; β) ds]−1 ,
j≤Jh Bjh Bjh
Y
βen,h = arg max en,h (Ti ; β)}δi .
{rZi (Ti ; β)λ
β
Ti ≤τ

en,h are obtained by deriving Sn(0) with respect to β


The derivative of λ
Z R (1)
(1)
X S (s; β) ds
Bjh n
e
λn,h (s; β) = − 1Bjh (t)[ Jn (s) dNn (s)] R (0)
j≤Jh Bjh [ Bjh Sn (s; β) ds]2
and the derivative of the logarithm of the partial likelihood for β is
Xn Z (1) Z
(1) rZi
Ln,h (β) = n−1 (s; β) dNi (s) + n−1 λe(1) (s; β) dNn (s).
nh
i=1
rZ i

(1)
It is zero at βen,h and its convergence rate is a O((nh3 )), like λ
e . Therefore,
nh
e 3
βn,h has the convergence rate O((nh )) and the estimator of the hazard
function has the convergence rate O((nh)).

6.5 Estimation of the density of duration excess

For the indicator NT of a time variable T , the probability of excess is


Pt (t + x) = Pr(T > t + x | T > t) = 1 − Pr(t < T ≤ t + x | T > t)
= 1 − Pr{NT (t + x) − NT (t) = 1 | NT (t) = 0}.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of intensities for stochastic processes 127

For a sample of n independent and identically distributed variables (Ti )i≤n ,


Pn
the processes n−1 Nn (t) = n−1 i=1 1Ti ≤t and n−1 Yn (t− ) = 1 − n−1 Nn (t)
converge respectively to the functions F (t) and F̄ (t), and Pt (t + x) is esti-
mated by Pbn,t (t + x) = 1 − {Nn (t + x) − Nn (t)}1{Nn (t)<n} {n − Nn (t)}−1 .
Let T1:n = min1≤i≤n Ti and Tn:n = max1≤i≤n Ti , on every interval [a, b]
included in In = ]T1:n , Tn:n [, the product-limit estimator has values in the
open interval ]0, 1[ and the centered process is approximated as

n1/2 {Pt (t + x) − Pbn,t (t + x)} = n1/2 F̄ −1 (t)[n−1 {Nn (t + x) − Nn (t)}


− F (t + x) + F (t) − Pt (t + x){n−1 Yn (t− ) − F̄ (t)}] + oL2 (1)

and it converges weakly to a Gausian process with mean zero and a finite
R t+x
variance. The probability Pt (t + x) = exp{− t λ(s) ds} is also estimated
by the product-limit estimator on the interval ]T1:n , Tn:n [
Y 1{t<Ti ≤t+x} Jn (Ti )
Pbn,t (t + x) = {1 − }, (6.9)
Yn (Ti )
1≤i≤n

with Yn = 1{Jn >0} . The estimator is constant between jump times Ti


and the size of the jumps is ∆Pbn,t (Ti ) = 1{t<Ti } Pbn,t (Ti− )Jn (Ti )Yn−1 (Ti ).
According to the asymptotic results established for the product-limit
estimator, for every interval [a, b] included in the interval ]T1:n , Tn:n [,
the product-limit estimator has values in the open interval ]0, 1[ and
sup(t,x+t)∈[a,b] |Pbn,t (t + x) − Pt (t + x)| converges a.s. to zero. Let

Pt − Pbn,t
Bn,t (t + x) = n1/2 (t + x) (6.10)
Pt

also written Pbn,t (t + x) = Pt (t + x){1 − n−1/2 Bn,t (t + x)}.

Theorem 6.2. On every interval [a, b] included in the interval ]T1:n , Tn:n [,
the estimator Pbn satisfies
Z t+x
Bn,t (t + x) = n 1/2 b n − Λ).
d(Λ
t

It converges weakly to a Gaussian process B with independent increments,


R t+x
mean zero and a finite variance vB (t + x; t) = t F̄ −1 dF .

b n be the martin-
Proof. Let [a, b] be a sub-interval ofR ]T1:n , Tn:n [ and let Λ
b t −1
gale estimator of Λ = − log F̄ , Λn = 0 Jn Yn dNn is uniformly consistent
b n − Λ) converges weakly to a centered Gaussian process
on [a, b] and n1/2 (Λ
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

128 Functional estimation for density, regression models and processes

Rt
with independent increments and variance vΛ (t) = 0 F̄ −2 dF . The process
Bn,t (t + x) is expanded as
Z t+x Z t+x
− Pt (t + x)Bn,t (t + x) = n1/2 [exp{− b n } − exp{−
dΛ dΛ}]
t t
Z t+x
+ n1/2 [exp{log Pbn,t (t + x)} − exp{− dΛb n }]
t
Z t+x Z t+x
= −n1/2 b n − Λ) exp{−
d(Λ dΛ∗n } {1 + o(1)}
t t
Z t+x b̄ (t + x) Z t+x
F n b n },
+ n1/2 exp{− dΛ∗∗
n } {log − dΛ
t Fb̄ (t) t
n

where Λ∗n is between Λ and Λ b n , so it converges uniformly to Λ, and


R t+x ∗∗ b̄ b̄ (t) and R t+x dΛ
− t Λn is between log F n (t + x)} − log F n
b n . From
t
1/2
Lemma 1.5.1 in Pons (2008), the variable n sup b̄
|−log{F (t)}− Λ b (t)|
[a,b] n n
converges in probability to zero as n tends to infinity then the last term of
the expansion of Bn,t (t + x) converges to zero in probability, uniformly on
[a, b]. The first term converges weakly to −Pt (t + x) BΛR(t, t + x) where the
t+x b n − Λ). 
process BΛ (t, t + x) is the limiting distribution of n1/2 t d(Λ
By Equation (6.10), the covariance of n1/2 (Pbn,t − Pt )(t + x) and
n (Pbn,t − Pt )(t + y) is denoted
1/2

2
CP (t + x, t + y; t) = Pt (t + x)Pt (t + y) lim EBn,t (t + x ∧ y)
n
= Pt (t + x)Pt (t + y)vB (t + x ∧ y; t).
The weak convergence of n1/2 (F̄n − F̄ ) on the interval [0, Tn:n ] allows to
extend theR previous proposition to a weak convergence on [T1:n , Tn:n ]. For
τ
t < τF , if 0 F F̄ −1 dΛ < ∞
Z (t+x)∧Tn:n
Pt − Pbn,t 1 − Fbn (s− ) b
(t + x) = d(Λn − Λ)(s). (6.11)
Pt t∨T1:n 1 − F (s)
Therefore, the process defined for t and t + x in [T1:n , Tn:n ] by n1/2 {(Pt −
Pbn,t )Pt−1 }(t + x) converges weakly on the support of F to a centered Gaus-
sian process BP , with independent increments and variance vF̄ (t + x) −
vF̄ (t). By the product definition of the estimator of the probability Pt (t+x),
it satisfies
Z (t+x)∧Tn:n
b
Pn,t (t + x) = Pbn,t (t + s− ) dΛ
b n (s).
t∨T1:n
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of intensities for stochastic processes 129

The estimator is extended without modification to samples of right-censored


variables, with an independent censorship by a sequence of independent
and identically distributed variables with survival function Ḡ. Only the
definitions of Nn and Yn and the expression of the variances are modified
by a multiplicative term Ḡ−1 in the integral defining vΛ , as in the variance
of the classical product-limit estimator. For p ≥ 2, by Equation (6.11),
their exists a constant cp such that
Z ( )
 t+x Fb̄ (s− ) 2 dN (s) p/2
b p p n n
E{Pt (t + x) − Pn,t (t + x)} ≤ cp Pt (t + x) E
t F̄ (s) Yn2 (s)
therefore kPt (t + x) − Pbn,t (t + x)kp = O(n−1/2 ).

The probability density of excess duration is the derivative of the prob-


ability Pt (t + x), pt (t + x) = −f (t + x)/F̄ (t). A kernel estimator of pt (t + x)
is now defined, for t < t + x − hn as
Z 1
pbn,h (t + x; t) = Kh (t + x − s) dPbn,t (s)
−1
Z 1
= Kh (t + x − s)Pbn,t (s− ) dΛ
b n (s)
−1
n
X
= Kh (Ti − t − x)Pbn,t (Ti− )Jn (Ti )Yn−1 (Ti )δi 1{t<Ti }
i=1
n
X b̄ (T − ) J
F n i n
= Kh (Ti − t − x)δi 1{t<Ti } Pt (Ti ) (Ti ).
i=1
F̄ (Ti ) Yn
As in Section 2.8, the estimator pbn,h is uniformly consistent. For a density
2
p in class C2 , its bias deduced from (6.11) is bp,n,h (t + x; t) = h2 pt (t + x) +
o(h2 ). Its variance has the bound
Z Z b̄ −
 sF
vp,n,h (t + x; t) ≤ 2[E Kh2 (t + x − s)pt (s)2 n
d(Λb n − Λ) 2 ds
0 F̄
Z 2
b̄ (s− )
F
+ E Kh2 (t + x − s) n Y −1 (s) dΛ(s)] = O((nh)−1 ).
F̄ (t) n
It follows that for every t and x, (nh)1/2 (b pn,h − pn,h )(t + x; t) converges
weakly to a Gaussian variable with mean zero and finite variance vp (t +
x; t) = limn nhvp,n,h (t + x; t). The covariance of pbn,h (t + x; t) and pbn,h (t +
y; t) is written
Z Z
Cn,h,p (t+x, t+y; t) = Kh (t+x−u)Kh (t+y−v)Cov{Pbn,t (du) Pbn,t (dy)}.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

130 Functional estimation for density, regression models and processes

For x < y, the covariance of Pbn,t (x) and Pbn,t (y) is n−1 Pt (t + x)Pt (t +
y)vB (t+ x; t)+ o(n−1 ), hence for |x− y| > 2h, the limit of nCn,h,p converges
to
Z Z
Kh (t + x − u)Kh (t + y − v)pt (t + u)pt (t + v)vB (t + u; t)
(1)
+ Pt (t + u)Pt (t + v)vB (t + u; t)} du dv
(1)
= pt (t + y){pt (t + x)vB (t + x; t)}(1)
(1)
+ pt (t + y){Pt (t + x)vB (t + x; t)}(1) ,
if |x − y| ≤ 2h, Cn,h,p = 0((nh)−1 ). Then, the process (nh)1/2 (b
pn,h − p)
converges weakly to a Gaussian variable with mean zero, covariances zero
and variance function vp .

6.6 Estimators for processes on increasing intervals

Let (N (t))t≤T be the counting processes associated to a sequence of random


time variables (Ti )i≥1
X
NT (t) = 1{Ti ≤t∧T } , t ≥ 0,
i≥1

where T is deterministic or a random R stopping time. Its predictable com-


P
pensator is written NeT (t) = N (T ) t Yi (s)µ(s, Zi (s)) ds as in the previous
i=1 0
PN (T )
sections. In a model without covariates, YT (t) = i=1 Yi (t) and the base-
line intensity λ of the intensity µT (t) = λ(t)YT (t) is estimated for t in
[h, T − h]Rby smoothing the estimator of the cumulative hazard function Λ,
b T (t) = t∧T JT (s)Y −1 (s) dN
Λ eT (s), with an indication JT (t) = 1{Y (t)>0}
0 T T
Z 1
bT,hT (t) =
λ YT−1 (s)JT (s)Kh (t − s) dNT (s). (6.12)
−1
R
The mean of λ bT,h is λT,h (t) = 1 E{Y −1 (s)}Kh (t−s) dΛ(s). Conditions
T T −1 T
for its consistency are an ergodic property for the process (NT , YT ) and Con-
ditions 6.1 written for YsT with limits as T tends to infinity, then its bias de-
h
velops as bλ,T,hT (t) = s!T msK λ(s) (t) + o(hsT ). The variance of the estimator
is V ar{λbT,h (t)} = (T hT )−1 κ2 g −1 (t)λ(t) + o((T hT )−1 ) and the Lp -norms
T

of the estimator satisfy supt∈IT ,h,τ kλ bT,h (t) − λT,h (t)kp = 0((T hT )−1/p )
T T

and sup b
kλT,h (t) − λ(t)kp = 0((T hT ) −1/p
+ hs ). Under a mix-
t∈IT ,hT ,τ T

ing condition for the point process (NT , YT ) which ensures the weak con-
bT,h (t) − λ)
vergence of the process T 1/2 (YT − g), the process (T hT )1/2 (λ T
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of intensities for stochastic processes 131

converges weakly to the Gaussian process limit of Theorem 6.1. Now the
optimal bandwidth are O(T −1/(2s+1) ) and the minimal mean squared errors
are O(T −2s/(2s+1) ).
In models with covariates, the process YT is modified by a regression
(0) PN (T )
function at the jump times as ST (t; β) = i=1 rZi (t; β)1{Ti ≥t} with a
(0) PN (T )
parametric regression function or ST (t; r) = i=1 rZi (t; )1{Ti ≥t} with
a nonparametric regression
R function. The predictable compensator of NT
becomes N eT (t) = t∧T S (0) (s; r)λ(s) ds. The function λ is estimated by
0 T
smoothing the estimator of the cumulative hazard function
Z 1
bT,h (t; β) = (0)
λ {ST (s; β)}−1 Kh (t − s) dNT (s)
−1

and the regression function is estimated using the parameter estimator


Q
βbT,hT = arg maxβ Ti ≤T {rZi (t; β)λ
bT,h (Ti ; β)}δi or by
T

N (T ) Z T
X
rbT,hT (z) = arg max Kh2 (z − Zi (s)){log rz (s)
rz 0
i=1
bT,h (s; rz )}Kh (z − Zi (s)) dNi (s)
+ log λ T 2

in the nonparametric case.


Assuming that the process Z((Ti ))i≥1 is bounded, ergodic and has fi-
(k) (k)
nite moments up to order 4, the processes T −1 ST (s; β) and T −1 ST (s; r)
converge in probabilty and uniformly on [0, T ] and in a neighbourhood of
(k)
β0 or r0 to finite limits s(k) and T 1/2 (T −1 ST − s(k) ) satisfies a theorem
central limit, for k = 0, 1, 2. All results for the kernel estimators of Section
6.3 are still valid under these conditions replacing n by T .
The empirical estimator of the probability of excess duration and a
kernel estimator for its density are defined by
Y
PbT,t (t + x) = {1 − JT (s)YT−1 (s) dNT (s)},
t≤s≤T ∧(t+x)
Z 1
pbT,hT (t + x; t) = Kh (t + x − s) dPbT,t (s).
−1

The process BT,t (t + x) = T 1/2 {Pt (t + x) − PbT,t (t + x)}Pt−1 (t + x) satisfies


Z t+x b̄ −
F T b T − Λ).
BT,t (t + x) = d(Λ
t F̄
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

132 Functional estimation for density, regression models and processes

6.7 Models with varying intensity or regression coefficients

More complex models are required to describe the distribution of event


times when the conditions may change in time or according to the value
of a variable. Pons (1999, 2002) presented results for two extensions of
the classical exponential regression model for an intensity involving a non-
parametric baseline hazard function and a regression on a p-dimensional
process Z

• a model for the duration X = T 0 − S of a phenomenon starting at


S and ending at T 0 , with a non-stationary baseline hazard depending
non-parametrically on the time S at which the observed phenomenon
starts
T
λX|S,Z (x | S, Z) = λX|S (x; S) eβ Z(S+x)
, (6.13)

• a model where the regression coefficients are smooth functions of an


observed variable X

T
λ(t | X, Z) = λ(t)eβ(X) Z(t)
. (6.14)

The asymptotic properties of the estimators βbn and Λ b n of β and of the


cumulative baseline hazard function follow the classical lines but the kernel
smoothing of the likelihood requires modifications.
In model (6.13), the time T 0 may be right-censored at a random time
C independent of (S, T 0 ) conditionally on Z and non informative for the
parameters β and λX,S , and that S is uncensored. We observe a sam-
ple (Si , Ti , δi , Zi )1≤i≤n drawn from the distribution of (S, T, δ, Z), where
T = T 0 ∧ C and δ = 1{T 0 ≤C} is the censoring indicator. The data are
observed on a finite time interval [0, τ ] strictly included in the support of
the distributions of the variables S, T 0 and C, and (S, T 0 ) belongs to the
triangle Iτ = {(s, x) ∈ [0, τ ] × [0, τ ]; s + x ≤ τ }. For Si in a neighborhood
of s, the baseline hazard λX|S (·; Si ) is approximated by λX|S (·; s), which
yields a local log-likelihood at s ∈ [hn , τ − hn ], defined as
X
ln (s) = Khn (s − Si )δi {log λX|S (Xi ; Si ) + β T Zi (Ti )}
i
Z τ
− Yi (y) exp{β T Zi (Si + y)}λX|S (y; Si ) dy.
0
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of intensities for stochastic processes 133

ei = Xi ∧ (Ci − Si )
Let X
In,τ = {(s, x); s ∈ [hn , τ − hn ], x ∈ [0, τ − s]} ,
Yi (x) = 1{T 0 ∧Ci ≥Si +x} = 1{Xei ≥x} ,
i
X
(0) −1
Sn (x; s, β) = n Khn (s − Sj )Yj (x) exp{β T Zj (Sj + x)}.
j
Rx
The estimator of Λ0,X|S (x; s) = 0 λ0,X|S (y; s) dy is defined for (s, x) ∈ In,τ
by Λ̂n,X|S (x; s) = Λ̂n,X|S (x; s, β̂n ) with
X Khn (s − Si )1{Si ≤Ci ,Xi ≤x∧(Ci −Si )}
Λ̂n,X|S (x; s, β) = (0)
.
i nSn (Xi ; s, β)

The estimator βbn of the regression coefficient maximizes the following par-
tial likelihood
X h i
ln (β) = δi β T Zi (Ti0 ) − log{nSn(0) (Xi ; Si , β)} εn (Si )
i

where εn (s) = 1[hn ,τ −hn ] (s). The bandwidth h is supposed to con-


verge to zero, with nh2 tends to infinity and h = o(n−1/4 ) as n
tends to infinity, the other conditions are precised by Pons (2002).
The variable n1/2 (βbn − β0 ) converges weakly to a Gaussian variable
N (0, I −1 (β0 )) where the variance I −1 (β0 ), defined as the inverse of the
limit of the second derivative of the partial likelihood ln , is the minimal
variance for a regular estimator of β0 .
The weak convergence of the estimated cumulative hazard function de-
fined along the current time and the duration elapsed between two events
relies on the bivariate empirical processes
X
b n (s, x) = n−1
H δi 1{Si ≤s} 1{Xi ≤x} ,
i
X T
W̄n(0) (s, x) =n −1
eβ0 Zi (Si +x) 1{Si ≤s} 1{Xei ≥x} ,
i
bn = n1/2 (W̄n(0) − W (0) , H
B b n − H) 1In,τ ,

under boundedness and regularity conditions, the process B bn converges


(0)
weakly to a Gaussian limit. With functions λ and s in class C2 , the bias
b n,X|S is a O(h2 ), thus the optimal bandwidth minimizing
of the estimator Λ
the asymptotic mean squared error of Λ is O(n−1/5 ) and it is still written in
terms of the squared bias and the variance of the estimator. If the regressor
Z is a bounded variable, then there exists a sequence of centered Gaussian
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

134 Functional estimation for density, regression models and processes

1/2
bn −Bn kIn,τ = op (hn ). This property im-
processes Bn on In,τ such that kB
b n,X|S − Λ0,X|S )1{I }
plies the weak convergence of the process (nhn )1/2 (Λ n,τ

to a centered Gaussian process.

In model (6.14), Λb n only involves kernel terms through the regression


b
functions but both βn and Λ b n have the same non-parametric rate of con-
vergence. In Pons (1999), the estimator βbn,h (x) was defined as the value of
β which maximizes
X
ln,x (β) = δi Khn (x − Xi )[{β(Xi )}T Zi (Ti ) (6.15)
i≤n
X T
− log{ Khn (x − Xj )Yj (Ti )e{β(Xi )} Zj (Ti )
}] (6.16)
j≤n

where Yi (t) = 1{Ti ≥t} is the risk indicator for individual i at t. Let
(0) P β(Xi )T Zi (t)
Sn (t, β) = i Yi (t)e , an estimator of the integrated baseline
R
hazard function is Λ b n (t) = t Sn(0)−1 (s, βbn,h ) dNn (s). For every x in IX,h ,
0
the process n−1 ln,x converges uniformly to
Z τ
lx (β) = (β − β0 )(x)T s(1) (t, β0 (x), x)
0
s(0) (t, β(x), x)
− s(0) (t, β0 (x), x) log dΛ0 (t)
s(0) (t, β0 (x), x)

which is maximum at β0 hence βbn,h (x) = arg max ln,x (x) converges to
β0 (x). Let Un,h (·, x) and In,h (·, x) be the first two derivatives of the pro-
cess ln,x with respect to β at fixed x in IX,h , the estimator of β(x) satisfies
Un,h (βbn,h (x), x) = 0 and In,h (x) ≤ 0 converges uniformly to a limit I(x).
By a Taylor expansion Un,h (β0 (x), x) = (βbn,h (x) − β0 (x))T {I(β0 , x) + o(1)}
and (βbn,h (x) − β0 (x)) = {In,h (β0 , x) + o(1)}}−1 Un,h (β0 (x). Under the as-
sumptions that the bandwidth is a O(n−1/5 ) and the function β belongs to
the class C2 (IX ), the bias of βbn,h (x) isRapproximated by I −1 (β0 , x)h2 u(x)
τ
where u(x) has the form u(x) = m22K 0 φ(t, x) dΛ0 (t) and its variance is
(nh)−1 κ2 I −1 (β0 , x) + o((nh)
R
−1
). The asymptotic mean integrated squared
error AM ISEw (h) = E Xn,h kβbn,h (x) − β0 (x)kw(x) dx for βbn,h (x) is there-
fore minimal for the bandwidth
R
κ2 Xn,h kI −1 (β0 , x)kw(x) dx
hn,opt = n−1/5 R −1 (β , x)kw(x) dx
Xn,h u(x)kI 0

and the error AM ISEw (hn,opt ) has the order n−2/5 .


January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Nonparametric estimation of intensities for stochastic processes 135

The limiting distributions of the estimators are now expressed in the


(0) (0)
following proposition. Let Gn = (n−1 h)1/2 {Sn (βbn,h ) − Sn (β0 )}.

Proposition 6.6. For every x in IX,n,h , the variable (nhn )1/2 (βbn,h −β0 )(x)
converges weakly to a Gaussian variable N (0, γ2 (K)I0−1 (x)).
b n − Λ0 ) converges weakly to the Gaussian pro-
TheR processR (nhn )1/2 (Λ
·
cess − 0 G(t){ s (t, y) dy}−2 dΛ0 (t), where the process G is the limiting
(0)

distribution of Gn .

The convergence (nhn )1/2 for the estimator of Λ comes from the vari-
R t (0) rate (0)−2
ance E 0 Sn (s, β0 )Sn (s, βbn,h ) dΛ0 (s) of Λ
b n (t) − Λ0 (t) developed by a
first order Taylor expansion.

6.8 Progressive censoring of a random time sequence

Let (Ti )i=1,...,n be a sequence of independent random time variables and


(Tj )j=1,...,m be an independent sequence of independent random censoring
times such that a random number Rj of variables Ti are censored at Tj and
Pm
j=1 Rj = n. Then the censored variables Xi,j = Ti ∧ Cj are no longer
independent, only m sets of variables are independent. Let
Rj
m X Rj
m X
X X
Nn,m (t) = 1{Ti ≤t∧Cj } , Yn,m (t) = 1{Xi,j ≥t} .
j=1 i=1 j=1 i=1
Let F be the common distribution function of the variables Ti and
Gj be the distribution function of Cj , the intensity of the point process
Nn,m is still written λn,m (t) = λYn,m with λ = F̄ −1 f . Conditionally on
the censoring number R = (R1 , . . . , Rm ), the expectations of Nn,m (t) and
Yn,m (t) are
Xm Z t
E{Nn,m (t) | R} = Rj Ḡj dF,
j=1 0

Xm
E{Yn,m (t) | R} = Rj Ḡj (t)F̄ (t).
j=1

Let µR = lim Rj for j = 1, . . . , m, and Jn,m (t) = 1{Yn,m (t)>0} , the estimator
of the cumulative hazard function Λ and its derivative λ are
Z t
b n,m (t) =
Λ −1
Yn,m Jn,m dNn,m ,
0
Z
bn,m (t) = Kh (t − s) dΛ
λ b n,m (s).
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

136 Functional estimation for density, regression models and processes

Assuming that there exists an uniform limit for the mean survival func-
Pm
tion Ḡ = limm→∞ m−1 j=1 Ḡj , the process n−1 Yn,m converges uniformly
to its expectation
Rt µY (t) = µR Ḡ(t)F̄ (t), and n−1 Nn,m converges uniformly
to µR 0 Ḡ dF . The estimators Λ b n,m and λ bn,m are then unbiased and uni-
1/2 b
Rt −1
formly consistent. The variance of n (Λn,m −Λ)(t) is 0 nYn,m Jn,m dΛ and
R t −1
it converges to vΛ (t) = 0 µY dΛ. The variance of the estimator λ bn,m (t) is
(1)
(nh)−1 κ2 vΛ (t) + o((nh)−1 ) and both estimator processes converge weakly
to Gausian processes with zero mean and these variances. The process
b n,m − Λ) has independent increments and n1/2 (λ
n1/2 (Λ bn,m − λ) has asymp-
totic covariances zero. All results for multiplicative regression models with
independent censoring times apply to this progressive random censoring
scheme. With nonrandom numbers Rj , the necessary condition for the
Pm
convergence of the processes is the uniform convergence of n−1 j=1 Rj Ḡj .

6.9 Exercises

(1) Define a kernel estimator for a Poisson process N with a functional


intensity λ(t) from the observation of a sample-path on an inter-
val [0, T ], such that T tends to infinity and prove that the process
(T h)−1/2 (λbT,h − λ) converges weakly.
(2) Calculate the bias of the estimator of β which maximizes the process
ln,x defined by (6.15).
(3) Consider a point process with a conditional multiplicative intensity λ(t)
r(β T Z(t)) Y (t), where λ and r are nonparametric real functions and
β is a vector of unkown parameters. Define kernel estimators for the
functions λ and r and an estimator for β by minimization of a partial
likelihood. Determine the order of their risks and prove their weak
convergence.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Chapter 7

Estimation in semi-parametric
regression models

7.1 Introduction

In the single-index regression model, the scalar response variable Y is ex-


pressed as a nonparametric transform for a linear combination of the com-
ponents of a vector X of d regression variables
Yi = g(θT Xi ) + σ(η T Xi )εi , (7.1)
where X is a vector of regression variables in a bounded subset of Rd and
ε is an error variable such that E(ε|X) = 0 and V ar(ε | X = x) = 1, then
V ar(Y | X = x) = σ 2 (η T x). The model includes a similar parametrization
for the mean and the variance functions. The parameters are vectors η and
θ belonging to an open and bounded subset Θ of Rd and g, an unknown
function of C2 (R). Several estimators for the semi-parametric regression
function m(x) = g(θT x) have been defined from approximations and they
are calculated by iterations, without model for the variance. Here the
estimators are defined in a two-step procedure from the weighted estimator
of the regression function g defined by (3.19). The true parameter value is
a vector in R2d
(η0T , θ0T )T = arg min V (η, θ) (7.2)
η,θ∈Θ

where
V (η, θ) = E[σ −1 (η T Xi ){Y − g(θT X)}2 ]
is the mean weighted squared error at a fixed parameters value η and θ.
Several empirical criteria can be defined for estimating V , estimators of θ0
satisfying the same property (7.2) are deduced.
Let (Xi , Yi )i=1,...,n be a sample of observations in Rd+1 , in the model
with known variance function. At fixed θ, let gbn,h (z) be the nonparametric

137
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

138 Functional estimation for density, regression models and processes

regression estimator defined with regressors values θT Xi in a neighborhood


of z, then the parameter θ is estimated by minimizing a mean squared error
of estimation, which is a goodness-of-fit criterium for the model. For ob-
servations such that θT Xi lies in a neighborhood of z = θT x, the regression
function is estimated at fixed η and θ, by
Pn −1
bn,h
i=1 σ (η T Xi )Yi Kh (z − θT Xi )
b
gn,h (z; η, θ) = Pn −1 ,
bn,h
i=1 σ (η T Xi )Kh (z − θT Xi )
−1
bn,h
and σ (η T Xi ) is the estimator (3.19) calculated at fixed η. The global
goodness-of-fit error and the estimator of θ minimizing this error are
n
X
Vbn,h (η, θ) = n−1 −1
bn,h
σ (η T Xi ){Yi − gbn,h (θT Xi ; θ)}2 , (7.3)
i=1

(bT
ηn,h , θbn,h
T
)T = arg min Vbn,h (η, θ), (7.4)
η,θ∈Θ

b n,h (x) = gbn,h (θbn,h


m T
x; θbn,h ).

We first assume that the variance function is known and denoted σ 2 (x),
the error and the estimator of g are then only normalized by σ −1 (x). The
global error (7.3) and the estimator (7.4) have been modified by considering
the mean of local empirical squared errors. In a neighborhood of z, a local
empirical squared error is defined by smoothing (7.3)
n
X
Vbn,h (z; θ) = n−1 σ −1 (Xi ){Yi − b
gn,h (θT Xi ; θ)}2 Kh (z − θT Xi ),
i=1

θbn,h,z = arg min Vbn,h (z; θ).


θ∈Θ

T
A global estimator θ̄n,h
of θ was defined by an empirical mean of the local
estimators at the random point Z bn,i = θbT
n,h,Zi Zi . Then an estimator of the
regression function m is
T
m̄n,h (x) = gbn,h (θ̄n,h x; θ̄n,h ). (7.5)

Another estimator is then obtained by minimizing the differential criterion


n
X
cn,h (θ) = n−2
W {σ 2 (Xi ) + σ 2 (Xj )}−1/2 {Yi − Yj − b
gn,h (θT Xi ; θ)
i6=j=1

gn,h (θT Xj ; θ)}2 Kh (Xi − Xj ),


+b (7.6)
θen,h = arg min W
cn,h (θ),
θ∈Θ
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Estimation in semi-parametric regression models 139

with sums on values individuals having close values of the regression vari-
able. Finally, a third estimator of the regression function is
gn,h (θen,h
e n,h (x) = b
m T
x; θen,h )

with the estimator (7.6) for the vector parameter θ.


The error V (θ0 ) and W (θ0 ) is estimated by plug-in with (7.4) and
Vbn,h = Vbn,h (θbn,h )
cn,h = W
W cn,h (θbn,h )
Pn
and V̄n,h = n−1 k=1 Vbn,h (θ̄n,hT
Xk ; θ̄n,h ) for an empirical mean of local
estimators.
Various forms of the estimation criteria are weighted mean square errors
in order to take into account the unknown variance σ 2 of the variable Y
or other weighting functions. Iterative procedures were also defined with
alternative estimations of the function g and the parameters. In the next
section, the convergence rate of the parameter estimator and the regression
function m b n,h are determined for (7.3) and (7.6) with variance 1. In the
presence of nonparametric estimator of the function g, the convergence rate
of the parameter estimator differs from the parametric rate. The limiting
distributions of the parametric and nonparametric estimators are studied
for the global errors Vbn,h and W
cn,h .
More general models are nonparametric regressions with a parametric
change of variable. In the nonparametric regression with a change of vari-
ables, the linear expression θT X is replaced by a transformation of X using
a parametric family of functions defined in Rd , φ = {ϕθ }θ∈Θ subset of the
class C2 . The semi-parametric regression model is
Y = g ◦ ϕθ (X) + σ(X)ε. (7.7)

The error is still V (θ, σ) = Eσ −1 (X){Y − m(X)}2 with m = g ◦ ϕθ , at fixed


θ and σ, and the parameters can be estimated by minimizing over θ and σ
similar empirical estimators of V (θ) as above.

7.2 Convergence of the estimators

Proposition 2.3 and Theorem 2.1 imply the convergence in probabil-


ity to zero of the empirical error criteria supθ∈Θ |Vbn,h (θ) − V (θ)| and
cn,h (θ) − W (θ)|, the local criterium Vbn,h (z; θ) also converges uni-
supθ∈Θ |W
formly on Z × Θ to V (θ; z) = E[{Y − m(X)}2 |θT X = z]. As the minimum
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

140 Functional estimation for density, regression models and processes

of the limits V (θ) and W (θ) is θ0 , all estimators converge to θ0 . The mini-
(1)
mization of Vbn,h provides an estimator θbn solution of Vbn,h (θ) = 0 where
n
X
(1) (1)
Vbn,h (θ) = 2n−1 {Yi − gbn,h (θT Xi ; θ)} {b
gn,h (θT Xi ; θ)} Xi
i=1

and for the second derivative, let Zi = θT Xi , at fixed θ


n
X
b (2) (2) (1)
Vn,h (θ) = 2n −1
[{Yi − b gn,h (Zi ; θ)}2 ] Xi⊗2 .
gn,h (Zi ; θ) − {b
gn,h (Zi ; θ)} b
i=1
(2)
The limit of Vbn,h (θ) is the second derivative of V (θ) = E[{g(θT X) −
(2)
g(θ0T X)}2 ] which is minimal at θ0 , therefore −Vn,h (θ) is positive definite in
a neighbourhood of θ0 for n large enough since that is true for the limit-
(1)
ing function −V (2) (θ). Expanding Vbn,h (θ), for θ in a neighbourhood of θ0 ,
(1) (1) (2)
implies Vb (θ) = Vb (θ0 ) + (θ − θ0 )T Vb (θ0 ) + OP ((θ − θ0 )2 ). Then the
n,h n,h n,h
(2) (1)
estimators satisfy θbn,h − θ0 = {−Vbn,h (θ0 )}−1 Vbn,h (θ0 ) + OP ((θbn,h − θ0 )2 )
also written
(2) (1)
θbn,h − θ0 = {−Vbn,h (θ0 )}−1 {Vbn,h (θ0 ) − V (1) (θ0 )} + OP ((θbn,h − θ0 )2 ).
(1)
Applying Theorem 2.1, (nh3 )1/2 {bgn,h(θ0 ) − g (1) (θ0 )} converges weakly to a
continuous Gaussian process with zero mean and finite variance. The first
(1) (1) (1)
two moments of Vbn,h are Vn,h (θ) = E Vbn,h (θ), expanded at θ0 as
(1) (1)
Vn,h (θ0 ) = 2E[{m(Xi ) − gn,h (θ0T Xi ; θ0 )} {b
gn,h(θ0T Xi ; θ0 )} Xi ]
(1)
gn,h − gn,h )(θ0T Xi ; θ0 )} {b
− 2E[(b gn,h(θ0T Xi ; θ0 )} Xi ].
(1)
Proposition 10.1 in Appendix A and Appendix C prove that Vn,h (θ0 )
(1)
is a O(h2 ) + O((nh2 )−1 ), and the variance of Vb (θT x; θ0 ) equals n,h 0
(nh3 )−1 ΣV,θ0 + o((nh3 )−1 ), where
Z
−2
ΣV,θ0 = E[X ⊗2 fX (X){m2g (θ0T X) + σg2 (θ0T X)}w2 (θ0T X)]( K (1)2 ).
(1) (1)
It follows that the variable (nh3 )1/2 {Vbn,h (θ0 ) − Vn,h (θ0 )} converges
weakly to a continuous Gaussian variable with mean zero and variance
ΣV,θ0 , and
(2) (1)
(nh3 )1/2 (θbn,h − θ0 ) = 2{Vbn,h (θ0 )}−1 (nh3 )1/2 ){V (1) − Vbn,h }(θ0 ) + op (1).
(7.8)
(2)
Moreover, the second derivative Vbn,h converges uniformly to a bounded
(2)
function Vθ on the parameter space. The result extend to regression
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Estimation in semi-parametric regression models 141

functions of Cs (IX ), s ≥ 2, and the order nh3 related to the variance of


(2)−1 (2)−1
the derivative of gbn,h is unchanged. Let v0 = Vθ0 ΣV,θ0 Vθ0 .

Proposition 7.1. Under Conditions 2.1 and 3.1, and if hn = o(n−1/7 ),


the estimators of the parameter θ and the function m in class Cs (IX )
are consistent, (nh3 )1/2 (θbn,h − θ0 ) converges weakly to a Gaussian vari-
able with with mean zero and variance v0 and (nh3 )1/2 (m b n,h − m) con-
verges weakly to a Gaussian process with mean zero and covariance function
g (1) (θ0T x)vθ0 g (1) (θ0T x0 ) at (x, x0 ).

Proof. b n,h (x) − m(x) splits as the sum of two terms


The process m
gn,h (θbn,h
un,h (x) = b T
x) − g(θbn,h
T
x),
vn,h (x) = g(θbn,h
T
x) − g(θ0T x).

The convergence rate of vn,h is the same as θbn,h − θ, which is (nh3 )1/2 like
(1) (1)
(Vbn,h − V (1) )(θ0 ) since the bias of Vbn,h (θ0 ) disappears with hn = o(n−1/7 ).
The process (nh)1/2 un,h (x) = (nh)1/2 (b gn,h − g)(θ0T x){1 + op (1)} is a Op (1),
then the convergence rate of {m b n,h (x) − m(x)} is (nh3 )1/2 . 
The bandwidth minimizing the sum of the squared bias and the vari-
(1) (1)
ance of Vbn,h (θ0 ) is hV,n = O(n−1/7 ) and the convergence rate of Vbn,h (θ0 ) is
n2/7 in that case. Note that with a bandwidth hn = O(n−1/7 ), the limit is
(2)−1 (1)
a Gaussian variable with the finite mean limn (nh7 )1/2 Vn,h )(θ0 )Vn,h (θ0 ).
This rate hn = O(n−1/7 ) is optimal for estimating the first derivative of
a function of class C2 and the biases were obtained under this assump-
tion. This is a consequence of the approximation of θbn,h − θ0 in terms of
(1)
(Vbn,h − V (1) )(θ). The optimal local and global bandwidths for the estima-
tors gbn,h and m b n,h are O(n−1/(2s+3) ) and they are expressed as a ration
of their (integrated) bias and variance, the mean squared errors for their
estimation are O(n−2s/(2s+3) ).

The estimator minimizing the local error Vbn,h (z; θ) is approximated by


linearization of the error for θT Xi in a neighbourhood of z
n
X
Vbn,h (θ; z) = n−1 Kh (z − θT Xi ){Yi − b
gn,h (z)
i=1
(1)
− (θT Xi − z)b
gn,h (z)}2 + O(h2 ),
and the derivatives of the linear approximation are considered with vari-
ables Zi = θT Xi in a neighborhood of z. The asymptotic behaviour of its
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

142 Functional estimation for density, regression models and processes

derivatives differs from those of the global error due to the kernel smoothing.
The mean estimator has a smaller variance than the estimator minimizing
the global error Vbn,h .

For independent variables such that Xi 6= Xj and |Xi − Xj | ≤ h, let


∆i,j X = Xi − Xj and ∆i,j ϕ(X) = ϕ(Xi )− ϕ(Xj ), for every function ϕ. Let
Z = θT X at fixed θ. With the notations of Proposition 3.1, the differential
cn,h defined by (7.6) has the mean
criterion W
gn,h )(θT X)}2 Kh (Xi − Xj )]
Wn,h (θ) = E[{∆i,j (Y − b
= E[{σ 2 (θT Xi ) − σ 2 (θT Xj )} + {g(θ0T Xi ) − g(θ0T Xj ) − g(θT Xi )
+ g(θT Xj )}2 + E[(nh)−1 {σg2 (θT Xi ) − σg2 (θT Xj )}
+ h4 {b2g,n,h (θT Xi ) − b2g,n,h (θT Xj )}Kh (Xi − Xj )]{1 + o(1)},
where gbn,h is the regression function estimator. By Lemma 10.1 in Ap-
pendix C, it is approximated by O(h2 ) + O(n−1 h). The first derivative of
cn,h is
W
Xn
c (1) (θ) = −2n−2
W {Yi − Yj − gbn,h (Zi ; θ) + gbn,h (Zj ; θ)}
n,h
i6=j=1
(1) (1)
{b
gn,h (Zi ; θ)Xi gn,h (Zj ; θ)Xj } Kh (∆i,j X)
−b
its mean develops as
(1) (1)
Wn,h (θ) = 2E[{∆i,j (m0 (X) − gn,h (Z))}{∆i,j (Xgn,h (Z))} Kh (∆i,j X)]
(1)
− E[{∆i,j (b
gn,h − gn,h )(Z)}{∆i,j (Xb
gn,h (Z))} Kh (∆i,j X)],
denoted E1 + E2 . The first term is expanded as
Z
(1)
E1 = h2 m2K {m0 (x) − gn,h (θT x)}(1) {xgn,h (θT x)}(1) fX (x) dFX (x),
using a first order approximation for the variations, E2 depends on the co-
(2) (1)
variance of b gn,h (Z), with a factor h2 , thus the second term is a
gn,h (Z) and b
(1)
0(n−1 h−2 ). The function Wn,h (θ) is then an uniform O(h2 ) + O((nh2 )−1 )
and it tends to zero uniformly on the bounded parameter space. As
in Appendix C, the main term of its variance ΣW,n,h (θ0 ) depends on
(2) (2) (1) (1)
E{bgn,h (Z) − gn,h (Z)}2 {bgn,h (Z) − gn,h (Z)}2 = O(nh4 )−1 , with a factor
n−2 h3 , thus it is a 0(n−3 h−1 ). The second order derivative is an empirical
mean
n
X
c (2) (θ) = 2n−2
W [{∆i,j {Xb
(1)
gn,h(θT X)}}⊗2 − {Yi − Yj − ∆i,j gbn,h (θT X)}
n,h
i6=j=1
(2)
× ∆i,j {X ⊗2 gbn,h (θT X)}]Kh(∆i,j X),
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Estimation in semi-parametric regression models 143

it is approximated by its expectation


(2) (1)
Wn,h = 2E[∆⊗2 gn,h(θT X)} − {∆i,j {m0 (X) − gbn,h (θT X)}
i,j {Xb
(2)
× ∆i,j {X ⊗2 b
gn,h (θT X)}] Kh (∆i,j X)
= 2E Kh (∆i,j X) [{∆⊗2
i,j {Xg
(1) T
(θ X)}
(2)
− ∆i,j {m0 (X) − gn,h (θT X)}∆i,j {X ⊗2 gn,h (θT X)}
(2)
+ ∆⊗2 gn,h − gn,h )(θT X)}∆i,j {X ⊗2 gbn,h (θT X)}].
i,j {(b

(2)
The sequence (Wn,hn )n is therefore an uniform O(h2 + (nh3 )−1 ). Assuming
(1)
that nh4n tends to infinity with n, Wn,hn = O(h2 + (nh2 )−1 ) = O(h2 ) and
c (1) is that its
a necessary condition for the weak convergence of (n3 h)1/2 W n,h
normalized mean converges, i.e. h = O(n−3/5 ).
Under this condition, nh5 tends to zero and the convergence rate of
(2)
Wn,hn is h2 . Arguing as for the estimator of θbn,h related to Vbn,h , θen,h
minimizing W cn,h is such that (n3 h5 )1/2 (θen,h − θn,h ) is approximated by the
c (2) (θ0 )}−1 n3/2 h1/2 (W
variable h2 {W c (1) − W (1) )(θ0 ). It converges weakly
n,h n,h n,h
to a Gaussian process with variance vθ0 = W (2)−1 (θ0 )ΣW,θ0 W (2)−1 (θ0 ).
Following the arguments for the proof of the previous proposition, we obtain
the following convergences.

Proposition 7.2. Under Conditions 2.1 and 3.1, and if hn = O(n−3/5 ),


the estimators of the parameter θ and of the function m in class C2 are
consistent, (n3 h5 )1/2 (θen,h − θ0 ) and (nh)1/2 (m
e n,h − m) converge weakly to
Gaussian processes with finite variances.

With hn = O(n−3/5 ), the limit of (nh)1/2 (m


e n,h −m) is centered, if moreover
hn = o(n−3/5 ), both limits are centered. The weighted estimator of the
regression function has the same convergence rate as in the model with a
constant variance as proved in Section 3.6, the convergence rates of the
estimators of θ and g of Propositions 7.1 and 7.2 in the single-index model
are therefore unchanged.

7.3 Nonparametric regression with a change of variables

In the semi-parametric regression model (7.7), with m = g ◦ ϕθ , the estima-


tors are built following the same lines as in the single index regression model.
For observations such that ϕθ (Xi ) lies in a neighbourhood of z = ϕθ (x),
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

144 Functional estimation for density, regression models and processes

the regression function is estimated by


Pn −1
bn,h
i=1 σ (Xi )Yi Kh (z − ϕθ (Xi ))
b
gn,h (z; θ) = Pn −1
bn,h (Xi )Kh (z − ϕθ (Xi ))
i=1 σ

with a parametric or semi-parametric estimator for the variance function.


The global goodness-of-fit error and the estimator of θ minimizing this error
are
Xn
Vbn,h (θ) = n−1 −1
bn,h
σ gn,h ◦ ϕθ (Xi )}2 ,
(Xi ){Yi − b
i=1

θbn = arg min Vn,h (θ),


θ∈Θ

gn,h ◦ ϕθbn,h .
b n,h = b
and m
Assume that the variance is constant and let Zi = ϕθ (Xi ) at fixed θ.
The derivatives of Vbn,h are
n
X
(1) (1) (1)
Vbn,h (θ) = −2n−1 {Yi − gbn,h (Zi )} b
gn,h (Zi ) ϕθ (Xi )
i=1
n
X
(2) (2) (1)2 (1)
Vbn,h (θ) = −2n−1 [{Yi − gbn,h (Zi )}b gn,h (Zi )]{ϕθ (Xi )}⊗2
gn,h (Zi ) − b
i=1
n
X (1) (2)
− n−1 {Yi − gbn,h (Zi )} gbn,h (Zi ) ϕθ (Xi )
i=1
(2)
where −Vbn,h (θ) is a positive definite matrix converging to a finite limit
(2) (1)
ΣV (θ) uniformly on the parameter space. The mean of Vbn,h (θ) and its
variance have the same orders as for the derivative of (7.3) in model (7.1)
(1) (1) (1)
Vn,h (θ) = 2E[{Yi − gbn,h ◦ ϕθ (Xi )} b
gn,h ◦ ϕθ (Xi ) ϕθ (Xi )]
(1) (1) (1)
= 2E{[{(m − gn,h )gn,h − Cov(b
gn,h , gbn,h )} ◦ ϕθ ϕθ ](Xi )}
where the first terms in the right-hand side is O(h2 ) and the last term is
(1)
a O((nh2 )−1 ). The variance of Vbn,h (θ) is a O((nh3 )−1 ). An expansion of
Vn,h in a neighborhood of θ0 implies
(2) (1)
(nh3 )1/2 (θbn,h − θ0 ) = {−Vbn,h (θ0 )}−1 (nh3 )1/2 {Vbn,h (θ0 ) − V (1) (θ0 )} + op (1).
(1)
The variance of Vbn,h (θ0 ) is asymptotically equivalent to (nh3 )−1 ΣV,θ0 +
3 −1
o((nh ) ), with a modified notation
Z
(1)⊗2 −2
ΣV,θ0 = 4E[ϕθ0 (X)fX (X){g 2 + σg2 } ◦ ϕθ0 (X)w2 (ϕθ0 (X))]( K (1)2 ).
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Estimation in semi-parametric regression models 145

Proposition 7.3. Under Conditions 2.1 and 3.1, and with a bandwidth
hn = o(n−1/7 ) for the estimation of a regression function m in class
C2 , the estimators of the parameter θ and the function m are consistent,
(nh3 )1/2 (θbn,h −θ0 ) converges weakly to a centered Gaussian with variance vθ
and (nh3 )1/2 (m b n,h − mθ0 ) converges weakly to a centered Gaussian process
(1) (1)
with covariance g (1) ◦ ϕθ0 (x)vθ0 g (1) ◦ ϕθ0 (x0 )ϕθ0 (x) ⊗ ϕθ0 (x0 ) at (x, x0 ).
With the optimal bandwidth hn = O(n−1/7 ), the convergence rates of the
estimators are n2/7 and the limiting distributions of the estimators have a
non zero mean, as in the previous section.

cn,h (7.6) adapted to model (7.7) with a con-


The differential criterion W
stant variance is written
X n
cn,h (θ) = n−2
W gn,h ◦ ϕθ (Xi ) + gbn,h ◦ ϕθ (Xj )}2 Kh (Xi − Xj ),
{Yi − Yj − b
i,j=1
it defines the estimators θen,h = arg minθ∈Θ W cn,h (θ) and gen,h at θen,h . Let
Zi = ϕθ (Xi ) at fixed θ, the derivatives of Wcn,h with respect to θ are
X n
c (1) (θ) = −2n−2
W {Yi − Yj − gbn,h (Zi ) + gbn,h (Zj )}
n,h
i,j=1
(1) (1) (1) (1)
{b
gn,h (Zi )ϕθ (Xi ) − b gn,h (Zj )ϕθ (Xj )}Kh (Xi − Xj ),
X n
c (2) (θ) =
W 2n−2
(1) (1) (1) (1)
gn,h (Zi )ϕθ (Xi ) − gbn,h (Zj )ϕθ (Xj )}2
[{b
n,h
i,j=1
(2) (1)2
− {Yi − Yj − b
gn,h (Zi ) + b
gn,h (Zj )}{b
gn,h (Zi )ϕθ (Xi )
(2) (1)2
− gbn,h (Zj )ϕθ (Xj )}2 + {Yi − Yj − b
gn,h (Zi ) + gbn,h (Zj )}
(1) (2) (1) (2)
{b
gn,h (Zi )ϕθ (Xi )−b
gn,h (Zj )ϕθ (Xj )}]Kh (Xi − Xj ).
(1)
The mean of the first derivative is Wn,h (θ) = −E E[{g(Zi ) − g(Zj ) −
(1) (1) (1) (1)
gbn,h(Zi )+b
gn,h (Zj )}{b
gn,h (Zi )ϕθ (Xi )−b gn,h (Zj )ϕθ (Xj )} | Xi , Xj ]Kh (Xi −
Xj ) , its order and the order of its variance are the same as in the single
index model. The second derivative has the mean
(2) (1) (1) (1) (1)
Wn,h (θ) = 2E E[{b gn,h (Zj )ϕθ (Xj )}2 | Xi , Xj ]
gn,h (Zi )ϕθ (Xi ) − b
(2) (1)2
− E[{g(Zi ) − g(Zj ) − b
gn,h (Zi ) + b
gn,h (Zj )}{b
gn,h (Zi )ϕθ (Xi )
(2) (1)2
− gbn,h (Zj )ϕθ (Xj )}2 | Xi , Xj ]
(1) (2)
+ E[{g(Zi ) − g(Zj ) − b
gn,h (Zi ) + b
gn,h (Zj )}{b
gn,h (Zi )ϕθ (Xi )
(1) (2) 
− gbn,h (Zj )ϕθ (Xj )} | Xi , Xj ]Kh (Xi − Xj ) .
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

146 Functional estimation for density, regression models and processes

The results of Proposition 7.2 are similar with these notations for the
asymptotic variances.

In a regression model for processes (Xt , Yt )t≥0 , empirical error processes


are defined as in Section 3.6, with linear combinations of the components
of the d-dimensional regression variable. They are indexed by T , the length
of the time interval and the convergence rates are similar replacing n by T .
Varying bandwidth estimators of the parameter θ and the function m
are defined by a modification of the estimators of the functions g and σ 2
as in Section 4.3. Assuming that the functions g and σ 2 have the same
order of derivability, both functions are estimated with bandwidths of the
same order, and the modified estimators are introduced in the definition of
Vbn,h , where the bandwidths hn (Xi ) and hn (Xj ) differ. Under Condition
4.1, Proposition 7.1 extends to variable bandwidth estimators, with the
convergence rate (nkhn k3 )1/2 for the parameter estimator.
The differential empirical error uses another weight Kh (Xi − Xj ), for
every i 6= j. Its bandwidth was supposed equal to the bandwidth used
for the estimation of g in Section 2. Due to the symmetry of the ker-
nel with respect to the variables Xi and Xj , a variable bandwidth in
the expression Kh (Xi − Xj ) can be chosen as the mean of the band-
widths at Xi and Xj , with |Xi − Xj | < 2khk. As h tends to zero,
h(Xi ) = h(Xj )+(Xi −Xj ){h(1) (Xj )+o(khk)} and the mean bandwidth for
Kh (Xi − Xj ) is equivalent to h(Xi ) or h(Xj ). The expansions of Chapter
4 allow to extend Proposition 7.2 for such bandwidths.

7.4 Exercises

(1) Write the mean, the bias and the variance of the local mean squared
error Vbn,h (θ; z) and define a sequence of optimal bandwidth functions
for this criterium.
(2) Write the derivatives of Vbn,h (θ; z) and an approximation for the esti-
mator θbn,h (z) minimizing Vbn,h (θ; z). Determine the orders of its bias
and its variance.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Chapter 8

Diffusion processes

8.1 Introduction

Let α and β be two functions of class C2 on a functional metric space


(X, k · k), and let B be the standard Brownian motion on R. Their norms
on X, kαk1 , kβk2 , Ekα(X(t))k1 and Ekβ(X(t))k2 are supposed to be finite.
A diffusion process (Xt )t∈[0,T ] is defined as a stochastic differential equation
by
dXt = α(Xt )dt + β(Xt )dBt , t ∈ [0, T ] (8.1)
and its initial value X0 such that E|X0 | < ∞. Equation (8.1) with locally
Lipschitz drift and diffusion functions has a unique solution Xt = X0 +
Rt Rt
0
α(Xs )ds + 0 β(Xs )dBs , it is a continuous Gaussian process wih mean
Rt
E(Xt − X0 ) = 0 Eα(Xs ) ds and variance
Z t Z t
V ar(Xt − X0 ) = V ar{ α(Xs ) ds} + E β 2 (Xs ) ds
0 0
Z t Z t
2 2
≤ E{α (Xs ) + β (Xs )} ds − { Eα(Xs ) ds}2 .
0 0
The existence and unicity of the process X is proved by construction of a
sequence of processes satisfying Equation (8.1) and starting from X0 and
satisfying
Z t Z t
Xn,t −Xn−1,t = {α(Xn,s )−α(Xn−1,s )}ds+ {β(Xn,s )−β(Xn−1,s )}dBt ,
0 0
hence Xn,t is the finite sum from X0 to Xn,t −Xn−1,t , where the convergence
of the sum is a consequence of the Lischitz property of α and β. By a
discretization of the time interval [0, T ] in n sub-intervals of length tending
to zero as n tends to infinity, Equation (8.1) is approximated by
Yi ≡ Yti+1 = Xti+1 − Xti = (ti+1 − ti )α(Xti ) + β(Xti ){Bti+1 − Bti }, (8.2)

147
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

148 Functional estimation for density, regression models and processes

considering the functions α and β as piecewise constant on the intervals


of the partition generated by (ti )i=1,...,n . Let εi = Bti+1 − Bti , it is a
random variable with mean zero and variance (ti+1 − ti ) conditionally
on the σ-algebra Fti generated by the sample-paths of X up to ti , then
Eα(Xti )εi = 0 and V ar(Yi |Xti ) = (ti+1 − ti )β 2 (Xti ). The process Xt
solution of (8.1) is a continuous Gaussian process with independent incre-
ments. Its increments are approximated by the nonparametric regression
model (8.2) with an independent normal error by considering the func-
tions α and β as stepwise constant functions on the partition (ti )1≤i≤n . In
the nonparametric regression model (8.2), EYi = (ti+1 − ti )Eα(Xti ) and
V arYi = (ti+1 − ti )2 V arα(Xti ) + (ti+1 − ti )Eβ 2 (Xti ).
Let t in Ii =]ti , ti+1 ], the approximation errorR of the process Xt − Xti
t
by the discretized sample-path (8.2) is et;ti = ti {α(Xs ) − α(Xti )}ds +
Rt
ti
{β(Xs ) − β(Xti )}dBs , it satisfies
E|et,ti | ≤ (t − ti )2 sup |α(1) (Xs )|,
s∈Ii
Z t Z t
V ar et,ti = V ar {α(Xs ) − α(Xti )}ds + E{β(Xs ) − β(Xti )}2 ds
ti ti
Z t Z t
≤ V ar{α(Xs ) − α(Xti )}ds + E{β(Xs ) − β(Xti )}2 ds
ti ti

and it bounded by (t − ti )kα(1) k2 kβ (1) k2 E(Xt − Xti )2 = O((t − ti )2 ), with


Z t
E(Xt − Xti )2 ≤ E{α2 (Xs ) + β 2 (Xs )} ds.
ti
The following condition allows to express the moments of increments of the
diffusion process as integrals with respect to a mean density.

Condition 8.1. There exists a mean density of the variables (Xti )1≤i≤n
defined as the limit
n
X
f (x) = lim n−1 fXti (x) = EfXt (x).
n→∞
i=1

This condition is satisfied under a mixing property of the process Xt


sup{Pr(B|A) − Pr(B); A ∈ F0t , B ∈ Ft+s

, s, t ∈ R+ } ≤ ϕ(s),
Z T
ϕ(u) du < ∞, (8.3)
0
where the σ-algebras F0t and Ft+s ∞
are respectively generated by {Xu , u ∈
[0, t]} and {Xu , u ∈ [t + s, ∞[}. That property is satisfied for the Brownian
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Diffusion processes 149

motion, its sample paths having independent


RT increments. For the diffusion
process Xt , it is sufficient that E 0 β 2 (Xs ) ds < ∞. Moments of discon-
tinuous parts of a diffusion process with jumps require another ergodicity
condition defining another mean density and it is satisfied under the mixing
property (8.3).

8.2 Estimation for continuous diffusions by discretization

The regression model (8.2) with observations at fixed points regularly


spaced on a grid (ti )1≤i≤n , of path ∆n = n−1 T , is written Yi = ∆n α(Xti )+
β(Xti )εi , for i = 1, . . . , n. The variables εi have a normal distribu-
tion N (0, ∆n ), hence Eε2k+1 i = 0 for every integer k, Eε2i = ∆n and
2(k−1)
Eε2k
i = (2k − 1)∆n Eεi for every k ≥ 1, thus Eε4i = 3∆2n . A nonpara-
metric estimator of the function α requires a normalization of Yi by the
scale ∆−1
n
Pn
Y K (x − Xti )
bn,h (x) =
α P i h
i=1
,
∆n ni=1 Kh (x − Xti )
for every x in Xn,h = {y ∈ X ; kx − yk < h}. The approximations and the
bn,h . Let
convergences of Proposition 3.1 are satisfied for the estimator α
−1
αn,h be its mean, it is approximated as αn,h (x) = µα,n,h (x)fX,n,h (x) +
−1
O((T h) ), where
Z Z
−1
µα,n,h (x) = ∆n yKh (x − s) dFXt ,Yt (s, y)

µα (x) = EfXt (x)α(x).

bn,h for a function α in class C2 (X ) has a first


The bias of the estimator α
order expansion

bα,n,h (x) = αn,h (x) − α(x) = h2 bα (x) + o(h2 ),


m2K −1 (2)
bα (x) = f (x){µα (x) − α(x)f (2) (x)}
2
and its variance is

vα,n,h (x) = ∆−1


n (nh)
−1
{σα2 (x) + o(1)},
σα2 (x) = κ2 f −1 (x)V ar(Yt | Xt = x) = κ2 f −1 (x)β 2 (x).

For the estimation of the function β defining the variance of X, let

Zi = Yi − ∆n α
bn,h (Xti ) = ∆n (α − α
bn,h )(Xti ) + β(Xti )εi ,
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

150 Functional estimation for density, regression models and processes

its mean is E{Zi |Xti = x} = ∆n E(α − α bn,h )(Xti ) = ∆n (α − αn,h )(Xti ) its
order is ∆n O(h2 ) and its variance satisfies

∆−1 −1
bn,h )(Xti ) + β(Xti )εi }2
n V ar{Zi |Xti = x} = ∆n E{∆n (αn,h − α
αn,h (Xti ) + β 2 (x)
= ∆n V arb
= (nh)−1 κ2 f −1 (x) + β 2 (x). (8.4)

A consistent estimator of the function β 2 (x) is therefore


Pn
Z 2 K (x − Xti )
βbn,h
2
(x) = Pn i h
i=1
.
∆n i=1 Kh (x − Xti )
The approximations and the convergences of Proposition 3.1 are also sat-
isfied for the estimator βbn . Let βn,h be its mean and let

µβ,n,h (x) = ∆−1 2


n E[Zi Kh (x − Xti )]
= E [β 2 (Xt ) + (nh)−1 κ2 f −1 (Xti )

− ∆n (αn,h − α)2 (Xt )}]Kh (x − Xt )
= µβ (x) + o(1),
µβ (x) = f (x)β 2 (x).
2 −1
The mean βn,h is approximated as βn,h (x) = µβ,n,h (x)fX,n,h (x) +
−1
O((nh) ). Under conditions (2.1) and (3.1) for the functions α and β
in class C2 (X ), the bias of the estimator βbn is
2
bβ 2 ,n,h (x) = βn,h (x) − β 2 (x) = h2 bβ (x) + o(h2 ),
m2K −1 (2)
bβ 2 (x) = f (x){µβ (x) − β 2 (x)f (2) (x)}
2
and its variance is vβ,n,h (x) = (nh)−1 {σβ2 (x)+ o(1)} where σβ2 (x) is the first
term in the expansion of κ2 f −1 (x)∆−2 2
n V ar(Zt | Xt = x), provided by the
−2 2
approximation of ∆n V ar(Zt | Xt = x) by

E{∆2n (b
αn,h − α)4 (Xt ) + β 4 (Xt )∆−2
n ε
4

+ 2β 2 (Xt )(b
αn,h − α)2 (Xt ) | Xt = x} − E 2 (∆−1 2
n Zt | Xt = x)
= β 4 (x)∆−2 4 2
αn,h − α)4 (x) + 2β 2 (x){vα,n,h (x) + b2α,n,h (x)}
n Eε + ∆n E(b
− {β 2 (x) + vα,n,h (x) + b2α,n,h (x)}2
= β 4 (x){∆−2 4 4
n Eε − 1} = 2β (x) + o(1),

thus σβ2 is written in the form σβ2 (x) = 2κ2 f −1 (x)β 4 (x). Under the condi-
tion h = hT = 0(T −1/5 ), let cα = limT →∞ (T h5T )1/2 .
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Diffusion processes 151

Proposition 8.1. Under Conditions 2.1, 2.2, 3.1 and 3.2 for the func-
bn,h and βbn,h are uniformly
tions α and β in class Cs (X ), the estimators α
consistent on X , with bias

bα,n,h (x) = αn,h (x) − α(x) = hs bα (x) + o(hs ),


msK −1
bα (x) = f (x){µ(s) α (x) − α(x)f
(s)
(x)},
s!
2
bβ 2 ,n,h (x) = βn,h (x) − β 2 (x) = hs bβ (x) + o(hs ),
msK −1 (s)
bβ 2 (x) = f (x){µβ (x) − β 2 (x)f (s) (x)}
s!
and their variances are vα,n,h (x) and vβ,n,h (x). Moreover

αn,h (x) − αn,h (x)kp = 0((T h)−1/p ) kβbn,h (x) − βn,h (x)kp = 0((nh)−1/p ) ,
kb

for every p ≥ 2, where the approximations are uniform. If h = 0(T −1/5 ),


the process (T h)1/2 (b
αn,h − α − cα bα ) converges weakly to centered Gaussian
process with variance σα2 (x), the process (nh)1/2 (βbn,h
2
− β 2 − γ 1/2 bβ 2 ) con-
verges weakly to centered Gaussian process with variance σβ2 (x) at x, and
the limiting covariances are zero.

The order for the bandwidths is the order of the optimal bandwidth for the
asymptotic mean squared errors of estimation of α. The conditions ensure
a Lipschitz property for the second order moment of the increments of the
processes, similar to Lemma 2.2 for the density. Moreover, the covariances
develop like in the proof of Theorem 2.1.
The variance of the variable Y in model (8.2) being a function of X, the
regression function α is also estimated by the mean of a weighted kernel as in
b ti ) = σα−1 (Xti ). As previously,
Section 3.6, with the weighting variables w(X
the approximations of the bias and variance of the new estimator (3.19)
of the drift function are modified by introducing w bn and its asymptotic
distribution is modified.
With a partition of [0, T ] in subintervals Ii of unequal length ∆n,i vary-
ing with the observation timesti of the process the variable Yi has to be
normalized by ∆n,i , 1 ≤ i ≤ n. For every x in Xn,h , the estimators are
Pn −1
i=1 ∆n,i Yi Kh (x − Xti )
bn,h (x) =
α P n ,
i=1 Kh (x − Xti )
−1/2
Zn,i = ∆n,i {Yi − ∆n,i αbn,h (Xti )},
Pn 2
i=1 Zn,i Kh (x − Xti )
βbn,h
2
(x) = P n .
i=1 Kh (x − Xti )
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

152 Functional estimation for density, regression models and processes

The results of Proposition 8.1 are satisfied, replacing the means of sums
Pn
with terms ∆−1n,i by means with coefficient n
−1 −1
i=1 ∆n,i and assuming
−1
that the lengths ∆n,i have the order n T . The optimal bandwidth for the
estimation of α is O(T −1/(2s+1) ) and its asymptotic mean squared error is
AM SEα (x) = (T h)−1 σα2 (x) + hs2 b2α,s (x), it is minimum for the bandwidth
function
n (s!)2 κ T −1 V ar(∆−1 o1/(2s+1)
2 n,i Yti )
hα,AMSE (x) = .
2sm2sK {µ(s)
α (x) − α(x)f
(s) (x)}2

The optimal local bandwidth for estimating the variance function β 2 of the
diffusion is a O(n−1/(2s+1) ) and it minimizes AM SEβ (x) = (nh)−1 σβ2 (x) +
hs2 b2β,s (x).
A diffusion model including several explanatory processes in the coeffi-
cients α and β may be written using an indicator process (Jt )t with values
in a discrete space {1, . . . , K} as

K
X K
X
dXt = αk (Xt )1{Jt = k}dt + βk (Xt )1{Jt = k}dBt , t ∈ [0, T ]. (8.5)
k=1 k=1

Let Xtk = Xt 1{Jt = k} be the partition of the variable corresponding to


the models for the drift and the variance of equation (8.5). The model is
equivalent to
K
X K
X
dXt = αk (Xtk )dt + βk (Xtk )dBt
k=1 k=1

and the estimators of the 2K functions αk and βk are defined for every for
x in Xn,h by
Pn −1
i=1 ∆n,i Yi Kh (x − Xti ,k )
bk,n,h (x) =
α Pn ,
i=1 Kh (x − Xti ,k )
Pn −1 2
i=1 ∆n,i Zi Kh (x − Xti ,k )
βbk,n,h
2
(x) = Pn .
i=1 Kh (x − Xti ,k )

Their means are approximated as the estimators for model (8.1) by


−1
αk,n,h (x) = µαk ,n,h (x)fX,n,h (x) + O((T h)−1 ),
2 −1
βk,n,h (x) = µβk ,n,h (x)fX,n,h (x) + O((nh)−1 ),
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Diffusion processes 153

where
n
X Z Z
µα,k,n,h (x) = En −1
∆−1
n,i yKh (x − s) dFXk,ti ,Yti (s, y)
i=1
h2
= µαk (x) + m2K µ(2) 2
αk (x) + o(h )
2
µαk (x) = f (x)αk (x),
n
X
µβk ,n,h (x) = n−1 ∆−1 2
n,i E[Zi Kh (x − Xk,ti )]
i=1
h2 (2)
= µβk (x) + m2K µβk (x) + o(h2 ),
2
µβk (x) = f (x)βk2 (x).
The norms and the asymptotic behaviour of the estimators is the same as
in Proposition 8.1. The two-dimensional model
dXt = αX (Yt )dt + βX (Xt )dBX (t),
dYt = αY (Yt )dt + βY (Yt )dBY (t)
with independent Brownian processes BX and BY is a special case where
all parameters are estimated as before.

The process (8.1) is generalized with functions depending of the sample-


path and of the current time
dXt = α(t, Xt )dt + β(t, Xt )dBt , t ∈ [0, T ] (8.6)
under similar conditions. A discretization of the time interval [0, T ] leads
to
Yi = Xti+1 − Xti = (ti+1 − ti )α(ti , Xti ) + β(ti , Xti )(Bti+1 − Bti ).
The functions α and β are now defined in (R+ × X) and they are estimated
by
Pn
Y K (x − Xti )Kh2 (t − ti )
bn,h (t, x) =
α P n i h1
i=1
,
∆n i=1 Kh1 (x − Xti )Kh2 (t − ti )
Pn
Z 2 K (x − Xti )Kh2 (t − ti )
βbn,h
2
(t, x) = P n i h1
i=1
,
∆n i=1 Kh1 (x − Xti )Kh2 (t − ti )
with Zi = Yi − ∆n α
bn,h (ti , Xti ) = ∆n (α − α
bn,h )(ti , Xti ) + β(ti , Xti )εi . Their
variance has now the order (nh1 h2 )−1 , their bias is a O((h1 h2 )2 ) and the
convergence rate of the centered processes is (nh1 h2 )1/2 .
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

154 Functional estimation for density, regression models and processes

8.3 Estimation for continuous diffusion processes

The process {Xt , t ∈ [0, 1]} is extended to a time interval [0, T ] by rescaling:
Xt = XT s , with s in [0, 1] and t in [0, T ]. Now the Gaussian process B is
mapped from [0, 1] onto [0, T ] by the same transform and Bs = T 1/2 Bt/T is
the Brownian motion extended from [0, 1] to [0, T ]. The observation of the
sample-path of the process {Xt , t ∈ [0, T ]} allows to construct estimators
similar those of smooth density and regression function in Sections 2.10 and
3.10, under the ergodic property (2.13). The Brownian process (Bt )t≥0
is a martingale with respect to the filtration generated by the (Bu )u<t ,
E(Bt − Bs | Xs ) = 0 for every 0 < s < t. Its moments are EBt2k+1 = 0,
Bt2 = t thus (Bt − B0 )2 has a tχ21 distribution and, for every integer k,
Bt2k = tk G(k) (0) with the generating function G2k (t) = (1 − 2t)−k of the
χ2k distribution, for t < 1/2, hence Bt4 = 3t2 .

Estimators are built like for regression functions of processes with the
response process Yt = dXt , without derivability assumption for the sample-
paths of X since B has only a L2 -derivative. The integrated drift function
Z t
A(t; X) = α(Xs ) ds
0
b X) = Xt − X0 , thus E A(t;
is estimated by A(t; b X) = A(t; X) and its vari-
Rt Rt 2
ance equals V ar{ 0 α(Xs ) ds} + E 0 β (Xs ) ds. The drift function α(Xt )
is estimated by smoothing the sample-path of the process X in a neighbor-
hood of Xt = x
RT
Kh (x − Xs ) dXs
bT,h (x) = R0 T
α . (8.7)
0 K h (x − X s ) ds
The estimators of the density and µα (x) = α(x)fX (x) defining (8.7) are
Z T
fbX,T,h (Xt ) = T −1 Kh (x − Xs ) ds,
0
Z T
−1
bα,T,h (x) = T
µ Kh (x − Xs ) dXs .
0
Their limits are expressed with the mean marginal density of the process
Z T
fX (x) = lim T −1 E fXs (x) ds
T →∞ 0
and the mixing property of the sample path of the process X implies that
Z T
−1
fX (x) − lim T E fXs (x) ds = O(T −1/2 ).
T →∞ 0
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Diffusion processes 155

Assuming that the kernel satisfies Conditions 2.1-2.2, their moments are
approximated using Taylor expansions and the properties of the Brownian
motion, with covariance function E(Bs Bt ) = s ∧ t. With a diffusion process
X, their expectations are
Z
E fbX,T,h (x) = Kh (u − x) fX (u) du
IX
h2 (2)
= fX (x) + m2K fX (x) + o(h2 ),
2
Z T
bα,T,h (x) = T −1 E
Eµ Kh (x − Xs ){α(Xs ) ds + β(Xs ) dBs }
0
Z
= α(u)Kh (u − x)fX (u) du
IX
h2
= α(x)fX (x) + m2K (αfX )(2) (x) + o(h2 ),
2
so the bias of the estimator of µα (x) is h2 bµα (x) = h2 m2K (αfX )(2) (x)/2 +
RT
o(h2 ). Its variance T −2 E 0 Kh (Xt − x) {dXt − α(Xt ) dt}2 is expanded
using theR ergodicity property (2.16) as−1 inRSection 2.10, now the covariance
−1 T T
of T 0 Kh (Xs −x)β(Xs ) dBs and T R 0 Kh (Xt −x)β(Xt ) dBt as a sum
T
Id (T ) + Io (T ), where Id (T ) = T −2 E 0 Kh2 (Xt − x)β 2 (Xt ) dt develops as
Z
Id (T ) = T −1 Kh2 (u − x)β 2 (u)fX (u) du
IX
= (T hT )−1 κ2 β 2 (x)fX (x) + o((T hT )−1 )
and
R T Rthe expectation
R T Io (T ) is expanded using the ergodicity property, with
T 2
0 0 d(s∧t) = 2 0 (T −s) ds = T and the notation αh (u, v) = |u−v|/2hT
Z
Io (T ) = Kh (u − x)Kh (v − x)xβ(u)β(v) dFXs ,Xt (u, v) du dv
2 \D
IX X
Z Z 1
= K(z − αh (u, v))K(z + αh (u, v)) dz
IX\{u} −1

β(u)β(v) dπu (v) dFX (u)}{1 + o(1)}.


For every fixed u 6= v, αhT (u, v) tends to infinity as hT tends to zero,
R 1/2
then the integral −1/2 K(z − αh (u, v))K(z + αh (u, v)) dz tends to zero. If
αh (u, v) = O(hT ), this integral does not disappear but πu (v) tends to zero,
therefore the integral Io (T ) is a o((T hT )−1 ) as T tends to infinity. Under
the ergodicity condition (2.16) for sets of k finite dimensional distributions
of X, the Lp -norm of the centered estimator of µα also satisfies kb µα,T,h (x)−
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

156 Functional estimation for density, regression models and processes

µα,T,h (x)kp = O((T hT )−1/p ) and the approximation (3.2) is also satisfied
for the estimator αbT,h . It follows that the estimator α bT,h (x) of a drift
function α in class Cs , for s ≥ 2, has a bias and a variance
bα,T,h (x; s) = hsT bα (x; s) + o(hsT ),
msK −1 (s)
bα (x; s) = f (x){(αfX )(s) (x) − α(x)fX (x)},
s! X
vm,T,h (x) = (T hT )−1 {σα2 (x) + o(1)},
−1
σα2 (x) = κ2 fX (x)β 2 (x)
so they have the same expressions as in the discretized regression model
bT,h (x) and α
(8.2), the covariance of α bT,h (y) tends to zero. Let
Z t Z t Z t
Z t = Xt − X0 − bT,h (Xs ) ds =
α (α − α
bT,h )(Xs ) ds + β(Xt ) dBt ,
0 0 0
(8.8)
its expectation conditionallyRon the filtration generated by the process X
t
up to t− is E(Zt | Ft ) = − 0 bα,T,h (Xs ) ds = O(h2 ) for every t > 0 and
the main term of its conditional variance
Z t Z t
V ar(Zt | Xt ) = V ar{ bT,h (Xs )ds} +
α β 2 (Xs )ds
0 0
Z t Z t
− 2Cov{ (b αT,h )(Xs ) ds, β(Xs ) dBs } + O((T hT )−1 )
0 0
Rt
is 0 β 2 (Xs )ds. The variance function β 2 (Xt ) is therefore consistently es-
timated by
RT
b2 2 0 Zs Kh (Xs − x) dZs
βT,h (x) = RT . (8.9)
0
Kh (Xs − x) ds
Under conditions (2.1) and (3.1) for the functions α and β in class Cs (X ),
the bias of the estimator βbT,h
2
is
bβ,T,h (x) = hs bβ (x; s) + o(hs ),
msK −1
bβ (x; s) = f (x){(f β 2 )(s) (x) − β 2 (x)f (s) (x)}.
s!
Its variance is vβ,T,h (x) = (T h)−1 {σβ2 (x) + o(1)} where σβ2 (x) is the first
term in the expansion of κ2 f −1 (x)V ar(Zt2 | Xt = x) calculated like in
the discrete model, that is σβ2 (x) = 2κ2 f −1 (x)β 4 (x), as in Proposition 8.1.
Under the previous conditions, the processes (T hT )1/2 (b αT,h − α − bα,T,h )
1/2 b2 2
and (T hT ) (βT,h − β − bβ 2 ,T,h ) converge weakly to a centered Gaussian
processes with mean zero, covariances zero and respective variance functions
σα2 and σβ2 .
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Diffusion processes 157

The mean squared error of the estimator at x for a marginal density in


Cs is then
−1
M ISET,hT (x) = (T hT )−1 κ2 fX (x)V ar(Y | X = x)
+ h2s
T bα (x; s) + o((T hT )
−1
) + o(h2s
T )
and the optimal local and global bandwidths minimizing the mean squared
(integrated) errors are O(T 1/(2s+1) )
n 1 σ 2 (x) o1/(2s+1)
α
hAMSE,T (x) =
T 2sb2α (x; s)
and, for the asymptotic mean integrated squared error criterion
n 1 R σ 2 (x) dx o1/(2s+1)
hAMISE,T = R α .
T 2s b2α (x; s) dx
With the optimal bandwidth rate, the asymptotic mean (integrated)
squared errors are O(T 2s/(2s+1) ).

The same expansions as for the variance of µ bT,h (x) and fbX,T,h (x)
in Section 2.10 prove that the finite dimension distributions of the pro-
αT,h − α − bα,T,h ) and (T hT )1/2 (βbT,h − β − bβ,T,h ) con-
cess (T hT )1/2 (b
verge to those of a centered Gaussian process with mean zero, covari-
ances zero and variance functions σα2 and σβ2 . Lemma 3.3 generalizes
and the increments E{b αT,h (x) − αbT,h (y)}2 and E{βbT,h (x) − βbT,h (y)}2
are approximated by O(|x − y|2 (T h3T )−1 ) for every x and y in IX,h such
that |x − y| ≤ 2hT . Then the processes (T hT )1/2 {b αT,h − α}I{IX,T } and
1/2 b
(T hT ) {βT,h − β}I{IX,T } converge weakly to σα W1 + γ 1/2 bα and σβ W2 +
γ 1/2 bβ , respectively, where W1 and W2 are centered Gaussian processes
on IX with variance 1 and covariances zero. The covariance Cα,β,T,h (x, y)
of αbT,h (x) and βbT,h (y), with 2|x − y| > hT develops using the approxi-
RT RT
mation (3.2) as {f (x)f (y)T }−2[E{ 0 Kh (Xs − x)β(Xs ) dBs }{ 0 Kh (Xs −
RT
y)(2Zt dZt − β 2 (Xt ) dt)} − Eα(x){ 0 Kh (Xs − y)β(Xs ) dBs }(fbT,h −
RT
fT,h )(x) − Eβ(y){ 0 Kh (Xs − x)(2Zt dZt − β 2 (Xt ) dt)}(fbT,h − fT,h )(y) +
α(x)β(y)(T hT )−1 Cov(fbT,h (x), fbT,h (y)), it is therefore a o((T hT )−1 ).
According to the local optimal bandwidths defined in the previous sec-
tions, the estimators α bn,h and βbn,h are calculated with a functional band-
width sequences (hn (x))n or (hT (x))T . The assumptions for the convergence
of these sequences are similar to the assumptions for the nonparametric re-
gression with a functional bandwidth and the results of Chapter 4 apply
immediatly for the estimators of the discretized or continuous processes
(8.2) and (8.1).
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

158 Functional estimation for density, regression models and processes

8.4 Estimation of discretely observed diffusions with jumps

Let α, β and γ be functions of class C2 on a metric space (X, k · k), let B


be the standard Brownian motion, M = N − N e be a centered martingale
associated to a point process N , with predictable compensator N e (t) =
Rt
0
Y dΛ, and such that M is independent of B. The process Y is predictable
and there exists a function g defined on [0, 1] such that sups∈[0,1] |T −1 YT s −
g(s)| converges to zero in probability, the function
R T g and the hazard function
λ is supposed to be in class C2 (R); ET −1 0 γ 2 dN e are finite for every
stopping time T .
The process Xt solution of the stochastic differential equation
dXt = α(Xt )dt + β(Xt )dBt + γ(Xt )dMt , t ∈ [0, T ], (8.10)
has a discrete and a continuous part. A discretization of this equation into
n sub-intervals of length ∆n,i tending to zero as n tends to infinity gives
the approximated equation
Yi = Xti+1 − Xti = ∆n,i α(Xti ) + β(Xti )∆Bti + γ(Xti )∆Mti .
Let εi = ∆Bti = Bti+1 − Bti , with zero mean and variance ∆n,i condi-
tionally on the σ-algebra Fti generated by the sample-paths of X up to ti ,
ηi = η(ti+1 ) defined by Mti+1 − Mti , with expectation zero and variance
Neti+1 − N
eti = O(∆n,i ) conditionally on the σ-algebra Fti generated by the
sample-paths of X; E{α(Xti )εi } = 0, E{β(Xti )ηi } = 0, and the martin-
gales (Bt )t≥0 and (Mt )t≥0 have independent increments, by definition. The
functionals of the martingale M and the process N e are estimated from the
observation of the point process N , as in Chapter 4. The variables XTi are
supposed to satisfy an ergodic property for the random stopping times of
the counting process N , in addition to Conditions 6.2 and 8.1.
Condition 8.2. There exists a mean density of the variables XTi defined
as the limit
Z T
−1
fN (x) = lim T fXs (x) dN (s).
T →∞ 0

This condition is satisfied if the jump part of Rthe process Xt satisfies the
T e (s). The diffu-
property (8.3) and the limit is fN (x) = T −1 E 0 fXs (x) dN
sion process Xt defined by (8.10) has the mean
Z t
µT = EX0 + E α(Xs )ds
0
Z Z t Z
= EX0 + α(x)fXs (x) dt dx = EX0 + t α(x)f (x) dx
X 0 X
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Diffusion processes 159

and the variance of the normalized variable T −1/2 (XT − µT ) is finite if the
integrals
Z T Z
Sα = ET −1 α2 (Xs ) ds = α2 (x)f (x) dx + o(1),
0 X
Z T Z
Sβ = ET −1 β 2 (Xs ) ds = β 2 (x)f (x) dx + o(1),
0 X
Z T Z
Sγ = ET −1 e (s) =
γ 2 (Xs ) dN γ 2 (x)fN (x) dx + o(1)
0 X

are finite. Then T 1/2 (T −1 XT −µT ) converges weakly to a centered Gaussian


variable with variance SX = Sα +Sβ +Sγ . Let SX (t) be the function defined
as above with integrals on [0, t]. The process T 1/2 (T −1 XsT − µsT )0≤s≤1 is
a sum of stochastic integrals with respect to the martingales B and M .

Proposition 8.2. The process WT,s = T 1/2 (T −1 XsT − µsT )0≤s≤1 is a


martingale. If SX < ∞, WT,s converges weakly to a Brownian motion BX
with variance function SX (s) on [0, 1].

Let a in R and Ta = inf{s ∈ [0, 1]; BX (s) = a} be a stopping time for the
process BX , then for every θ ≥ 0

E exp{θSX (Ta )} = exp(−a 2θ).
Let a in R and TT,a = inf{s ∈ [0, 1]; WT,s = a} be a stopping time for the
process WT,s .

Corollary
√ 8.1. For every θ ≥ 0, E exp{θSX (TT,a )} converges to
exp(−a 2θ) as T tends to infinity.

Moments of discontinuous parts of a diffusion process with jumps re-


quire another ergodicity condition defining another mean density and it is
satisfied under the mixing property (8.3). Conditionally on Fti , the vari-
ables Yi have the expectation ∆n,i α(Xti ) and the variance
Z ti+1
V ar(Yi |Xti ) = β 2 (Xti )∆n,i + e (s)
γ 2 (Xs ) dN
ti
e (ti ) + o(∆n,i ) = O(∆n,i ).
= β 2 (Xti )∆n,i + γ 2 (Xti ) ∆N
A nonparametric estimator of the function α is the kernel estimator nor-
malized by ∆n,i as in the previous section
Pn −1
i=1 ∆n,i Yi Kh (x − Xti )
αbn,h (x) = Pn ,
i=1 Kh (x − Xti )
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

160 Functional estimation for density, regression models and processes

Pn Rt
for x in Xn,h . Let ∆−1 n denote n
−1 −1 e
i=1 ∆n,i and E Nt = 0 g(s)λ(s) ds,
then the mean of α bn,h (x) is approximated by
h2
αn,h (x) = α(x) + m2K {(f α)(2) (x) − α(x)f (2) (x)} + o(h2 ).
2
The variance of α bn,h (x) is a O((T h)−1 )
Xn
vα,n,h (x) = n−1 ∆−2
n,i (nh)
−1
{σα2 (x) + o(1)},
i=1
σα2 (x) = κ2 f −1
(x)∆−1
n V ar(Yt | Xt = x)
= κ2 f −1 (x){β 2 (x) + γ 2 (x)g(t)λ(t)}
and its covariances tend to zero. The process (T h)1/2 (bαn,h − α) has the
−1
asymptotic variance κ2 σα2 (x)fX (x), at x.
P
The discrete part of X is X d (t) = s≤t γ(Xs )∆Ns and its continuous
Rt Rt Rt
es , with variations
part is X c (t) = 0 α(Xs ) ds + 0 β(Xs ) dBs − 0 γ(Xs ) dN
on (ti , ti+1 )
∆Xic = α(Xti ) ∆n,i + β(Xti ) ∆Bti − γ(Xti ) ∆n,i Y (ti )λ(ti ) = Op (∆n,i ).
Rt
Then the sum its jumps converges to 0 Eγ(Xs ) g(s)dΛs . Let (Ti )1≤i≤N (T )
be the jumps of the process N . The jumps ∆X d (Ti ) = γ(XTi ) yield a
consistent estimator of γ(x), for x in Xn,h
P d
1≤i≤N (T ) ∆X (Ti )Kh (x − XTi )
bn,h (x) =
γ P
1≤i≤N (T ) Kh (x − XTi )
P
1≤i≤N (T ) γ(XTi )Kh (x − XTi )
= P .
1≤i≤N (T ) Kh (x − XTi )
The expectation of b γn,h (x) is approximated by the ratio of the means of
the numerator and the denominator. For the numerator
Z T
−1 h2
ET γ(Xs )Kh (x−Xs ) dNs = (γfN )(x)+ m2K {(γfN )(x)}(2) +o(h2 )
0 2
R T
and, for the denominator ET −1 0 Kh (x − Xs ) dNs = fN (x) +
h2 (2) 2
2 m2K fN (x) + o(h ). The bias of b γn,h (x) is then
h2 (2)
bγ,n,h(x) = m2K {fN (x)}−1 [{γ(x)fN (x)}(2) − γ(x)fN (x)] + o(h2 ),
2
also denoted bγ,n,h (x) = h2 bγ . The variance of b
γn,h (x) is deduced from the
variance of the numerator
Z T
T −2 E Kh2 (x − Xs ) dNs
0
h2 (2)
= (T h)−1 {κ2 fN (x) + κ22 f (x)} + o(T −1 hT )
2 N
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Diffusion processes 161

the variance of the denominator


Z T
T −2 E γ 2 (Xs )Kh2 (x − Xs ) dNs
0
= (T h)−1 {κ2 γ 2 (x)fN (x) + h2 κ22 (γ 2 fN )(2) (x)} + o(hT −1 )
and their covariance
Z TZ T
T −2 E γ(Xs )Kh (x − Xs )Kh (x − Xt ) dNs dNt
0 0
−1
= (T h) {κ2 γ(x)fN (x) + h2 κ22 (γfN )(2) (x)} + o(hT −1 ),
therefore vγ,n,h (x) = T −1 hvγ (x) with
(2)
vγ (x) = κ22 {fN (x)}−1 {(γ 2 fN )(x)(2) − γ 2 (x)fN (x)} + o(hT −1 ).
It follows that the process (T h−1 )1/2 (b
γn − γ − cα bγ ) converges weakly to
a centered Gaussian process with variance function vγ (x) and covariances
zero.
For the estimation of the variance function β of model (8.10), let
Zi = Yi − ∆n,i α
bn,h (Xti ) − b
γn,h (Xti )ηi
= ∆n,i (α − α
bn,h )(Xti ) + β(Xti )εi + (γ − γ
bn,h )(Xti )ηi ,
its conditional expectation E(Zi | Xti = x) = ∆n,i (α − αn,h )(Xti ) tends to
zero and its conditional variance satisfies
∆−1 2 4
n,i V ar{Zi | Xti } = β (Xti ) + o(h ) + o((nh)
−1
) + o((T h)−1 ).
An estimator of the function β is deduced for x in Xn,h
P −1 2
b 2 1≤i≤n ∆n,i Zi Kh (x − Xti )
βn,h (x) = Pn .
i=1 Kh (x − Xti )

The previous approximations of the estimator βbn,h given in Proposition 8.1


are modified, its expectation is approximated by
X
2
βn,h (x) = n−1 ∆−1 2 −1 2
n,i EZi Kh (x − Xti )fN (x) + o(h )
1≤i≤n

therefore its bias is E βbn,h


2
− β 2 = bβ,n,h + o(h2 ) with
h2
bβ,n,h = m2K f −1 (x){(f β 2 )(2) (x) − β 2 (x)f (2) (x)} + o(h2 ).
2
Under conditions (2.1) and (3.1) for the function β in class C2 (X ), the
variance of the estimator βbn,h
2
is vβ,n,h (x) = (nh)−1 {σβ2 (x) + o(1)}, with
−1
σβ2 (x) = κ2 fXt
(x)∆−2 2
n V ar(Zt | Xt = x).
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

162 Functional estimation for density, regression models and processes

The normalized variance ∆−2 2


n V ar(Zt | Xt = x) develops as

E{∆2n (b
αn,h − α)4 (x) + β 4 (x)∆−2 4
bn,h )4 (x)∆−2
n ε + (γ − γ n η
4

+ O(h4 ) + O((nh)−1 ) + O(hT −1 )

where the Burkhölder-Davis-Gundy inequality implies that the order of Eηi4


is a O((Eηi2 )2 ) = O(∆2n,i ). Then, from the expression of the moments of the
variable ε, σβ2 (x) = β 4 (x)(∆−2 4 4
n Eε −1)+o(1) = 2β (x)+o(1). The variance
of βb2 is therefore written vβ,n,h (x) = (nh)−1 vβ (x), it is a O((nh)−1 ) and
n,h
the process (nh)1/2 (βbn − β − (nh5 )1/2 bβ ) converges weakly to a centered
Gaussian process with variance function vβ and covariances zero.

8.5 Continuous estimation for diffusions with jumps

In model (8.10), the estimator α bT,hT of Section 8.3 is unchanged and new
estimators of the functions β and γ must be defined from the continuous
observation
Rt of the sample path of X. The discrete part of X is also written
Xtd = 0 γ(Xs )dNs and the point process N is rescaled as Nt = NT s , with
t in [0, T ] and s in [0, 1]. Let

NT (s) = T −1 NT s ,
XT (s) = T −1 XT s , t ∈ [0, T ], s ∈ [0, 1].
R
The predictable compensator of NT is written N eT (t) = T −1 t YT (s)λ(s) ds
0
on [0, 1] and
Rt it is assumed to converge uniformly on [0, 1] to its mean
e d
E NT (t) = 0 g(s)λ(s) ds, in probability. Then XT (t) converges uniformly
Rt
in probability to 0 Eγ(XT (s)) g(s)dΛ(s). The continuous part of X is
dXtc = α(Xt ) dt + β(Xt ) dBt − γ(Xt )Yt λt dt. A consistent estimator of
γ(x), for x in IX,T,h
RT
Kh (x − Xs ) dX d (s)
bT,h (x) = R0 T
γ ,
0
Kh (x − Xs ) dN (s)
RT
Kh (x − Xs )γ(Xs ) dN (s)
= 0 RT ,
0
Kh (x − Xs ) dN (s)
it is identical to the estimator previously defined for the discrete diffusion
process. Its moments calculated in the continuous model (8.10) are iden-
tical to those of Section 8.4 then the process (T h−1
T )
1/2
γT,hT − γ − cα bγ )
(b
converges weakly to a centered Gaussian process with variance function vγ
and covariances zero.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Diffusion processes 163

The variance function β 2 (Xt ) is now estimated by smoothing the


squared variations of the process
Z t
Z t = Xt − X0 − bT,h (Xs ) ds
α (8.11)
0
Z t Z t Z t
= (α − α
bT,h )(Xs ) ds + β(Xs ) dBs + (γ − γ
bT,h )(Xs ) dMs .
0 0 0
For every t in [0, T ], its first two conditional moments are
Z t
E(Zt | Ft ) = − bα,T,h (Xs ) ds = O(h2 )
0
and
Z t Z t
V ar(Zt | Ft ) = V ar{ bT,h (Xs )ds} + E
α β 2 (Xs )ds
0 0
Z t
+E (γ − γ es
bT,h )2 (Xs ) dN
0
Z
=t β (x)fXs (x) dx + O((T hT )−1 ) + O(h4T ).
2
X
Furthermore, the Burkhölder-Davis-Gundy inequality implies the existence
of a constant c4 such that
Z t Z
V arZt2 = E{ β(Xs ) dBs }4 − {t β 2 (x)f (x) dx}2
0
Z t Z
≤ c4 E β 4 (Xs ) ds = c4 t β 2 (x)f (x) dx.
0
The variance function β 2 (x) is then consistently estimated smoothing the
process Zt2
RT
b2 Kh (Xs − x) Zs dZs
βT,h (x) = 2 0 R T . (8.12)
0
Kh (Xs − x) ds
Under conditions (2.1) and (3.1) for the function β in class C2 (X ) and
using the ergodicity property (2.13) for the limiting density f of the process
(Xt )t∈[0,T ] , the expectation of the denominator of (8.12) is
Z T Z T
T −1 E Kh (Xs − x) ds = T −1 E fXs (x) ds
0 0
2 Z T
h (2)
+ m2K T −1 E fXs (x) ds + o(h2 )
2 0
h2
= f (x) + m2K f (x) + o(h2 )
(2)
2
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

164 Functional estimation for density, regression models and processes

the expectation of the numerator is


Z T
2T −1 E Kh (Xs − x)Zs dZs
0
Z T Z
= 2T −1 E Kh (u − x) β 2 (u) fXs (u) du + o(h4 )
0 X

2 h2
= β (x) f (x) + m2K (β 2 (x)f (x))(2) + o(h2 )
2
and its bias is denoted bβ,T,h = h2 bβ + o(h2 ), with
1
bβ = m2K f −1 (x){(f β 2 )(2) (x) − β 2 (x)f (2) (x)}.
2
Under conditions (2.1) and (3.1) for the function β in class R C2 , the
variance of the estimator βbT,h is obtained from E(Zt2 | X) = β 2 (Xs ) ds,
V ar(Zt2 | X) = O(t) and expanding
Z TZ
−2
ET Kh2 (x − y)V ar(Zt2 | Xt = y)fXt (y) dy dt = O((hT )−1 ),
0
2
it is therefore written σβ,T,h = (hT )−1 vβ + o((hT )−1 ). Then the process
(T hT )1/2 (βbT,h − β − (T h5T )1/2 bβ ) converges weakly to a centered Gaussian
process with variance function vβ and covariances zero.

8.6 Transformations of a non-stationary Gaussian process

Consider the non-stationary processes Z = X ◦ Φ, where X is a stationary


Gaussian process with covariance R(x, y) = E(Xx Xy ) and Φ is a monotone
function C1 ([0, 1]) with Φ(0) = 0 and Φ(1) = 1. The transform is expressed
as Φ(x) = v −1 (1)v(x) with respect
R x to the integrated singularity
Rx function of
the covariance r(x, x), v(x) = 0 ξ(u) du. Conversely, 0 ξ(u) du = cξ Φ(x)
R1
with cξ = 0 ξ(u) du. A direct estimator of the regularity function ξ is
obtained by smoothing the estimator Φ b n (x) defined by (1.12)
Z 1
ξbn,h (x) = Vn (1) Kh (x − y) dΦb n (y)
0
Z 1
= Kh (x − y) db
vn (y).
0
R1
The expectation of ξbn,h (x) is ξn,h (x) = 0 Kh (x − y) dv(y) and the process
R1
(ξbn,h − ξn,h )(x) = 0 Khn (x − y)d(Φ b n − Φ)(y) is uniformly consistent, since
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Diffusion processes 165

an integration by parts implies


Z 1 Z 1
k b b
Khn (s − y) d(Φn − Φ)(y)k ≤ kΦn − Φk |dKhn (s − y)|
0 0
b n − Φk
+ sup |Khn | kΦ
Z
≤ (sup |K| + |dK(z)|) h−1 b
n kΦn − Φk

which converges to zero in probability, by the weak convergence of


b n − Φk. The process n1/2 (b
n1/2 k√Φ vn − v) converges weakly to the pro-
Rx
cess 2 0 v(y)dW (y) where W is a Gaussian process with mean zero and
1/2
R x∧y 2 x ∧ y at (x, y), then the covariance of the limit of n (b
covariances vn − v)
b
is 2 0 v (y) dy at x 6= y. The limiting variance of ξn,h (x) is
Z 1 Z 1
E{ vn − v)(y)}2 = E
Kh (x − y) d(b Kh2 (x − y) dV ar(b
vn − v)(y)
0 0
Z 1 Z 1
+E Kh (x − y)Kh (x − u) dCov{(b
vn − v)(y), (b
vn − v)(u)
0 0
Z 1
= O(n−1 Kh2 (x − y)v 2 (y) dy) = O((nh)−1 )
0

The convergence rate of the process ξbn,h is therefore (nhn )1/2 and the finite
dimensional distributions of (nhn )1/2 (ξbn,h − ξn,h ) converge to those of a
Gaussian process with mean zero, as normalized sums of the independent
variables defined as the weighted quadratic variations of the increments of
Z. The covariances of (nhn )1/2 (ξbn,h − ξn,h ) are zero except on the interval
[−hn , hn ] where they are bounded, hence the covariance function converges
to zero. The quadratic variations of ξbn,h satisfy a Lipschitz property of
moments

E|(ξbn,h − ξn,h )(x) − (ξbn,h − ξn,h )(y)|2


Z 1
= 2n−1 | {Kh2 (x − u) − Kh2 (y − u)}v 2 (u) du|
0

it is then a O((nh3n )−1 |x − y|2 ) for |x − y| ≤ 2hn . It follows that the process
(nhn )1/2 (ξbn,h − ξn,h ) converges weakly to a continuous process with mean
zero and variance function 2v 2 and covariances zero.
The singularity function of the spatial covariance of a Gaussian process
Z is estimated by smoothing the estimator of the integrated spatial trans-
form of Z on [0, 1]3 , the convergence rate of the estimator is then (nh3 )1/2 .
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

166 Functional estimation for density, regression models and processes

8.7 Exercises

(1) Calculate the moments of the estimators for the continuous process
(8.6) and write the necessary ergodic conditions for the convergences
in this model.
(2) Calculate the bias and variance of derivatives of the estimators of func-
tions α, β and γ in the stochastic differential equations model (8.10).
(3) Prove Proposition 8.2.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Chapter 9

Applications to time series

Let (X, k·k) be a metric space and (Xt )t∈N be a time series defined on XN by
its initial value X0 and a recursive equation Xt = m(Xt−p , . . . , Xt−1 ) + εt
where m is a parameric or nonparametric function defined on Xp for some
p > 1 and (εt )t is a sequence of independent noise variables with mean zero
and variance σ 2 , such that for every t, εt is independent of (Xt−p , . . . , Xt−1 ).

The stationarity of a time series is a property of the joint distribution


of consecutive observations. The weak stationarity is defined by a constant
mean µ and a stationary covariance function
ρs,t = Cov(Xs , Xt ) = Cov(X0 , Xt−s ), for every s < t.
The series (Xt )t is strong stationary if the distributions of the sequences
(Xt1 , . . . , Xtk ) and (Xt1 −s , . . . , Xtk −s ) are identical for every sequence
(t1 , . . . , tk , s) in Nk+1 . The nonparametric estimation of the mean and
the covariances is therefore useful for modelling the time series. The mov-
ing average processes are stationary, they are defined as linear combina-
tions of past and present noise terms such as the MA(q) process Xt =
Pq
εt + k=1 θk εt−k , with independent variables εj such that Eεj = 0 and
Pq
V arεj = σ 2 , for every integer j. The variance of Xt is σq2 = σ 2 2
k=1 θk +1
and it is supposed to be finite. The covariance of Xs and Xt such that
Pq P(t−s+q)∧q 2 
0 < t − s < q is Cov(Xs , Xt ) = σ 2 θ
k=t−s k + k=t−s+1 θ k , it only
depends on the difference t − s. The moving average processes with |θ| < 1
are reversible and the process Xt can be expressed as an auto-regressive
process, sum of εt and an infinite combination of its past values. Generally,
an AR process is not stationary.

In nonstationary series, a nonstationarity may be due to a smooth trend


or regular and deterministic seasonal variations, to discontinuities or to a

167
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

168 Functional estimation for density, regression models and processes

continuous change-points. A transformation such as differencing a stochas-


tic linear trend reduces the nonstationarity of the series, other classical
transformations are the square root or power transformations for data with
increasing variance. Periodic functions of the mean can be estimated after
the identification of the period and nonparametric estimator is proposed
in Section 9.2. Change-points of nonparametric regressions in time or at
thresholds of the series are stronger causes of non regularity and several
phases of the series must be considered separately, with estimation of their
change-points. Their estimators are studied in Section 9.5.

9.1 Nonparametric estimation of the mean

The simplest nonparametric estimators for the mean of a stationary process


are the moving average estimators

k
1 X
bt,k
µ = Xt−i ,
k + 1 i=0

k
for a lag k up to t. The transformed series is Xt − µ bt,k = k+1 Xt −
1
P k
k+1 i=1 Xt−i and it equals (Xt − Xt−1 )/2 for k = 2. A polynomial trend
is estimated by minimizing the empirical mean squared error of the model,
then the transformed series Xt − µ bt,k is expressed by the means of moving
average of higher order, according to the degree of the polynomial model.
Consider the auto-regressive process with nonparametric mean

Xt = µt + αXt−1 + εt , t ∈ N, (9.1)

with an independent sequence of independent errors (εt )t with mean zero


and variance σ 2 . With α 6= 1, its mean µt may be written (1 − α)mt , with
an unknown function mt and the solution Xt of Equation (9.1) is

t−1
X t
X
Xt = µt−k αk + αt X0 + αk εt−k .
k=0 k=1

With a mean and an initial value zero, the covariance of Xs and Xt is


Ps∧t
ρs,t = σ 2 k=1 α2k and it is not stationary. The asymptotic behaviour
of the process X changes as the mean crosses the threshold value 1. For
α = 1, the model is the classical nonparametric regression model.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Applications to time series 169

The parameters of the auto-regressive series AR(1) Xt = µ+αXt−1 +εt ,


with α 6= 1, are estimated by
µ bt )X̄t + t−1 (b
bt = (1 − α αt Xt − X0 ),
m bt )−1 µ
b t = (1 − α bt ,
Pt
(Xk−1 − m b t )(Xk − m b t)
bt = k=1Pt
α , (9.2)
(X − b
m )2
k=1 k−1 t
t
1X
bt2
σ = {Xk − (1 − α bt Xk−1 }2 .
bt )X̄t − α
t
k=1

For |α|6= 1, µ bt )X̄t + Op (t−1 ) and m


bt = (1 − α b t = X̄t . For α = 1, the parame-
trization µ = (1 − α)m is meaningless and the mean is estimated by µ bt =
Pt
t−1 k=1 (Xk − Xk−1 ). The estimators are consistent and asymptotially
Gaussian, with different normalization sequences for the three domains of
Pp
α (α < 1, α = 1, α > 1). In the AR(p) model Xt = µt + j=1 αj Xt−j + εt ,
similar estimators are defined for the regression parameters αj
Pt
k=j (Xk−j − m b t )(Xk − m b t)
bj,t =
α Pt
k=j (Xk−j − m b t )2
and the variance is estimated by the mean squared estimation error.
In model (9.1) with a nonparametric mean function µt = (1 − α)mt ,
Xt − mt = α(Xt−1 − mt ) + εt , then the estimator (9.2) of α is modified by
b k = X̄k by a local moving average mean or by a local mean
replacing m
Pt
j=0 Kh (j − k)Xj
b k = Pt
m
j=1 Kh (j − k)

for every k, and the estimator of α becomes


Pt
(Xk−1 − m b k )(Xk − m b k)
bt = k=1Pt
α .
(X − b
m )2
k=1 k−1 k

Finally, the function µt is estimated by (1 − α b t or by smoothing Xt −


bt )m
bt Xt−1
α
t
X
bt,h,k = t−1
µ Kh (j − k)(Xj − α
bj Xj−1 )
j=0

and the estimator of σ 2 is still defined by (9.2). The asymptotic distri-


butions are modified as a consequence of the asymptotic behaviour of mb k,
with mean tending to mk and variance converging to a finite limit. As
h tends to zero, the weak convergence to centered Gaussian variables of
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

170 Functional estimation for density, regression models and processes

t1/2 (m
b t − mt ), when |α| < 1, and tα−t (mb t − mt ), when |α| > 1, follows
from martingale properties of the time series which imply its ergodicity and
a mixing property (Appendix D). If |α| 6= 1
Pt
(Xk−1 − mb k−1 )((1 − α)(mk − mb k ) + εk )
bt − α = k=1
α Pt
(X − b
m )2
k=1 k−1 k
it is therefore approximated in the same way as in model AR(1) and it
converges weakly with the same rate as in this model.

When Equation (9.1) is defined by a regular parametrization of the mean


µt = (1 − α)mθ (t) for |α| 6= 1, the minimization of squared estimation error
Pt Pt
ε2(t) k2t = k=1 εb2k = k=1 {Xk −(1− α
kb bt Xk−1 }2 yields estimators of
bt )X̄t − α
the parameters α and θ for identically distributed error variables εk . If the
P
variance of εk is σk2 (θ), maximum likelihood estimators minimize k σk−1 ε2k .
The robustness and the bias of the estimators in false models have been
studied for generalized exponential distributions, the same methods are
used in models for time series.
In a nonparametric regression model
Xt = m(Xt−1 ) + εt (9.3)
with an initial random value X0 and with independent and identically dis-
tributed errors εt with mean zero and variance σ 2 , let F be the continuous
distribution function of the variables εt , and f its density. The nonpara-
metric estimator of the function m is still
Pt
k=1 Kh (x − Xk−1 )Xk
b t,h (x) = P
m t .
k=1 Kh (x − Xk−1 )
It is uniformly consistent under the ergodicity condition
t Z Z
1X
ϕ(Xk , Xk−1 ) → ϕ(x, y)F (dx − m(y)) dπ(y)
t
k=1
with the invariant measure π of the process and for every continuous and
bounded function ϕ on R2 . Conditions on the function m and the inde-
pendence of the error variables εi ensure the ergodicity, then the process
(th)1/2 (m
b t,h − m) converges weakly to a continuous centered Gaussian pro-
cess with covariances zero and variance κ2 f −1 (x)V ar{Xk | Xk−1 }, where
V ar{Xk | Xk−1 } = σ 2 . In model (9.3) with a functional variance, the
results of Section 3.6 apply.
The observation of series in several groups or in distinct time intervals
may introduce a group or time effect similar to population effect in regres-
sion samples and sub-regression functions may necessary as in Section 5.6.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Applications to time series 171

9.2 Periodic models for time series

Let (Xt )t∈N be a periodic auto-regressive time series defined by X0 and


p
X
Xt = ψ(t) + αp Xt−p + εt (9.4)
i=1

where |α| < 1 and ψ is a periodic function defined in N with period τ ,


ψ(t) = ψ(t + kτ ), for every integers t and k. Let α = (α1 . . . , αp ) and
X(p),t = (Xt−1 , . . . , Xt−p ). As ψ(t) = E(Xt − αT X(p),t ), the value of the
function ψ at t is estimated by an empirical mean over the periods, with
a fixed parameter value α. Assuming that K periods are observed and
T = Kτ values of the series are observed, the function ψ is estimated as
a mean over the K periods of the remainder term of the auto-regressive
process. For every t in {1, . . . , τ }
K−1
1 X
ψbK,α (t) = (Xt+kτ − αT X(p),t+kτ ) (9.5)
K
k=0

and the parameter vector is estimated by minimizing the mean squared


error of the model
T
1 X
lK (α) = {Xt − ψbK,α (t) − αT X(p),t }2 .
T t=1
The components of the first two derivatives of lK are
T
2 X ∂ ψbK,α
l̇T,K,t = − {Xt − ψbK,α (t) − αT X(p),t }{ (t) + X(p),t }
T t=1 ∂α
T K−1
2 X 1 X
= {Xt − ψbK,α (t) − αT X(p),t }{ X(p),t+kτ − X(p),t },
T t=1 K
k=0
T
X K−1
X
2 1
l̈T,K,t = { (X(p),t+kτ − X(p),t )}⊗2 .
T t=1
K
k=0

The vector α is estimated by α bT = arg minα∈]−1,1[d lT,K,t (α). For the first
1/2 ˙
order derivative, T lT,K,t (α0 ) converges weakly to a centered limiting
distribution and the second order derivative l̈T,K,t converges in probability
to a positive definite matrix E l̈T,K,t which does not depend on α. Then
the estimator of α satisfies T 1/2 (b −1
αT,K,t − α0 ) = l̈T,K,t T 1/2 l˙T,K,t (α0 ) + o(1).
The estimator αbT is consistent and its weak convergence rate is T 1/2 , if all
components of the vector α have a norm smaller than 1. The function ψ is
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

172 Functional estimation for density, regression models and processes

then consistently estimated by ψbK = ψbK,b


αT and, for every t in {1, . . . , τ },
the weak convergence rate of the estimator ψbK (t) is K 1/2 .
The true period of the function ψ was supposed to be known. With
an unknown period, the estimators ψbK and α bT depend on the parameter τ
and it is consistently estimated by τbT = arg minτ ≤T l[T /τ ] (b
αT,τ ).
If the function ψ is parametric, its parameters vector θ is estimated by
PT
minimizing the mean squared error between ψbK and ψθ , T1 t=1 {ψbK (t) −
ψθ (t)}2 . As a minimum distance estimator, the estimator θbK is consistent
and T 1/2 (θbT − θ) converges weakly to a centered Gaussian variable.

The trigonometric series with independent noise are a combination of


periodic sinus and cosinus functions
Xr
Xt = M {cos(wj t + Φt ) + sin(wj t + Φt )} + εt
j=1
Xr
= {Aj cos(2πwj t) − Bj sin(2πwj t)} + εt
j=1

where (wj )j=1,...,r are frequencies wj = jt−1 , Aj = M cos Φt , Bj = M sin Φt


such that A2j + Bj2 = M 2 is the magnitude of the series, for j = 1, . . . , r,
and Φt its phases. The estimators of the parameters are defined from the
Fourier series, for j = 1, . . . , r
t
X
btj = 2n−1
A Xt cos(2πkj/t),
k=1
Xt
btj = 2n−1
B Xt sin(2πkj/t),
k=1
r
X
ct = r−1
M b2tj + B
(A btj
2 1/2
) .
j=1

9.3 Nonparametric estimation of the covariance function

The classical estimator for estimating the covariances function in a station-


ary model is similar to the moving average for the mean, with a lag k ≥ 1
between variables Xi and Xi−k , for every i ≥ 1,
t
X
ρbk,t = (t − k)−1 (Xi − X̄t )(Xi−k − X̄t ).
i=k+1
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Applications to time series 173

In the auto-regressive model AR(1) with independent errors with mean zero
and variance σ 2 , for k ≥ 1, the variable Xk is expressed from the initial
value as
k
X k−1
X
Xk − m = αk (X0 − m) + Sk,α , where Sk,α = αk−j εj = αj εk−j .
j=1 j=0

Let B be the standard Brownian motion, if |α| < 1 the process S[ns],α
defined up to the integer part of ns converges weakly to σB{(1 − α2 )1/2 }−1 .
If α = 1, the process n−1/2 S[ns],1 converges weakly to σB, and if |α| >
1 the process α−[ns] S[ns],α converges weakly to σB{(α2 − 1)1/2 }−1 . The
independence of the error variables εj implies

E(Xk − m)(Xk+s − m) = α2k+s V arX0 + Cov(Sk,α , Sk+s,α ), (9.6)


k
X k
X
Cov(Sk,α , Sk+s,α ) = E( αk−j εj )2 = σ 2 α2(k−j) ,
j=1 j=1

2k+s
so E(Xk − m)(Xk+s − m) = α V arX0 + V arSk,α and the covariance
function of the series is not stationary. The estimator (9.2) of the variance
σ 2 is defined as the empirical variance of the estimator of the noise variables
which are identically distributed and independent. In the same way, the
covariance is estimated by
t
X
1
ρbt,k = {Xi − m
b t −α
bt (Xi−1 − α
bt )}{Xi−k − m
b t −α
bt (Xi−k−1 − α
bt )},
t−k
i=k+1

the estimators σ bt2 and ρbt,k are consistent (Pons 2008). The estimators
are defined in the same way in an auto-regressive model of order p, with
a scalar products α bTt Xi−1 and α
bTt Xi−k−1 for p-dimensional variables Xi−1
and Xi−k−1 . In model (9.1), the expansion (9.6) of the variables centered by
the mean function is not modified and the covariance E(Xk − mk )(Xk+s −
mk+s ) has the same expression depending only on the variances of the
initial value and Sk,α , and on α and the rank of the observations. In
auto-regressive series with deterministic models of the mean, the covariance
estimator is modified by the corresponding estimator of the mean. In model
(9.3), the covariance estimator becomes
t
X
1
ρbt,k = {Xi − m
b t,h (Xi )}{Xi−k − m
b t,h (Xi−k )}
t−k
i=k+1

and the estimators are consistent.


January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

174 Functional estimation for density, regression models and processes

9.4 Nonparametric transformations for stationarity

In the nonparametric regression model (9.3), Xt = m(Xt−1 ) + εt with


an initial random value X0 and with independent and identically dis-
tributed errors εt with mean zero and variance σ 2 , the covariance be-
tween Xk and Xk+l is ρt,k,l = E{Xk m∗l (Xk )} − EXk Em∗l (Xk ), with
E{Xk m(Xk+l−1 )} = E{Xk m∗l (Xk )}, where m∗l is the composition of l
functions m. The nonstationarity of ρt,k,l does not allow to estimate it
using empirical means and it is necessary to remove the functional mean
µt before studying the covariance of the series. The centered series
Yt = Xt − m
b t (Xt−1 ) = m(Xt−1 ) − m
b t (Xt−1 ) + εt
has a conditional expectation equal to minus the bias of the estimator m bt
2
h (2) −1
E(Yt | Xt−1 ) = − {(mfXt−1 )(2) − mf Xt−1 }(Xt−1 )m2K fX (Xt−1 )
2 t−1

and it is negligeable as t tends to infinity and h to zero. The time series Yt is


then asymptotically equivalent to a random walk with a variance parameter
σ 2 . The main transformations for nonstationary series (9.3) with a constant
variance is therefore its centering.
With a varying variance function
Eε2i = σi2 = V ar(Xi | Xi−1 ),
the estimator of the mean function of the series has to be weighted by the
inverse of the square root of the nonparametric estimator of the variance
at Xi , where
Pt
2 {Yi − m b t,h (Xi )}2 Kδ (x − Xi )
bt,h,δ
σ (x) = i=1 Pn ,
i=1 Kδ (x − Xi )
as in Section 3.6, the estimator of the regression function is
Pt
wbt,h,δ (Xi )Yi Kh (x − Xi )
mb w,t,h (x) = Pi=1
n
bt,h,δ (Xi )Kh (x − Xi )
i=1 w
and the stationary series for (9.3) is Yi = Xi − m b w,t,h (Xi−1 ). A model for
non independent stationary terms εt can then be detailed.

9.5 Change-points in time series

A change-point in a time series may occur at an unknown time τ or at


an unknown threshold η of the series. In both cases, Xt splits into two
processes at the unknown threshold
X1,t = Xt It and X2,t = Xt (1 − It )
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Applications to time series 175

with It = 1{Xt ≤ η} for a model with a change-point at a threshold of


the series and It = 1{t ≤ τ } in a model with a time threshold. The p-
dimensional parameter vector α is replaced by two vectors α and β. Both
change-points models are written equivalently, with a time change-point or
a series change-points
τη = sup{t; Xt ≤ η}, ητ = sup{x; (Xs )s∈[0,τ ] ≤ x}. (9.7)
With a change-point, the auto-regressive model AR(p) is modified as
Xt = µ1 It + µ2 (1 − It ) + αT X1,t + β T X2,t + εt (9.8)
T T
where Xt = µ+α X1,t +β X2,t +εt with X1,t = Xt It and X2,t = Xt (1−It )
for a model without change-point in the mean. Considering first that the
change-point is known, the parameters are µ, or µ1 and µ2 , α, β and σ 2 . As
t tends to infinity, a change-point at an integer time τ is denoted [γt] and
sums of variables up to τ are increasing with t. For the auto-regressive
process of order 1 with a change-point in time, this equation yields a two-
phase sample-path
Xt
t
Xt,α = mα + α (X0 − mα ) + αt−k εk , t ≤ τ,
k=1
t−τ
X
t−τ
Xt,β = mβ + β (Xτ,α − mα ) + β t−τ −k εk+τ , t > τ,
k=1
Pt
or mβ = µ(1 − β)−1 . With α = 1, Xt,α = X0 + (t − 1)µ + k=1 εk and
Pt−τ
with β = 1 and t > τ , Xt,β = Xτ,α + (t − k − 1)µ + k=1 εk+τ . Let θ
be the vector of parameters α, β, mα , mβ , γ. The time τ corresponds either
to a change-point of the series or a stopping time defined by (9.7) for a
change-point at a threshold of the process X and the indicator Ik relative
to an unknown threshold τη of Xt−k is denoted Ik,τ .
Pt
(Ik−1,τ Xk−1 − m b α,τ )(Ik,τ Xk − m b α,τ )
bt,τ = k=1 Pτ
α 2
,
k=1 (Ik−1 Xk−1 − m b α,τ )
Pt
((1 − Ik−1,τ )Xk−1 − m b β,t )((1 − Ik,τ )Xk − m b β,t )
βbt,τ = k=1 Pt ,
((1 − I )X − b
m )2
k=1 k−1,τ k−1 β,t
where the estimators of mα = (1 − α)−1 µ and mβ = (1 − β)−1 µ are equiv-
alent to
b α,τ = X̄τ ,
m
k
X
−1
b β,τ +k = X̄τ,k = k
m Xτ +j , for t = τ + k ≥ τ,
j=1

bt,τ X̄τ − βbt,τ X̄τ,t if |α| and |β| 6= 1.


bt = X̄τ − α
and µ
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

176 Functional estimation for density, regression models and processes

The estimator of the change-point parameter minimizes with respect to


τ the mean squared error of estimation. For t > τ , consider the estimation
errors εbτ,k = Xk − m b αbt,τ ,τ − α
bt,τ Xk−1 if k ≤ τ and εbt,k = Xk − m
b βbt,τ ,t −
βbt,τ Xk−1 if k > τ . The variance σ and the change-point parameter are
2

estimated by
τ
X t
X
σt2 (θ) = τ −1 εb2τ,k + (t − τ )−1 εb2t,k ,
k=1 k=τ +1

bt2 (τ ).
bt = arg min σ
γ
τ ∈[0,t]
The change-point estimator is approximated by
τ
1 X
bt = arg min t1/2 {
γ (Xk − µα − αXk−1 )2
τ ∈[0,t] τ
k=τ0 +1
τ
X
1
− (Xk − µβ − βXk−1 )2 − γ0 } + op (1),
t−τ
k=τ0 +1

bt − γ0 is independent of the estimators of the parameter vector ξbt of the


γ
regression and all estimators converge weakly to limits bounded in proba-
bility.
Consider the model (9.8) of order 1 with a change-point at a thresh-
old η of the series, with the equivalence (9.7) between the chronological
change-point model and the model for a series crossing the threshold η
at consecutive random stopping times τ1 = inf{k ≥ 0 : Ik = 0} and
τj = inf{k > τj−1 : Ik = 0}, j ≥ 1. The series have similar asymptotic
behaviour starting from the first value of the series which goes across the
threshold η at time sj = inf{k > τj−1 : Ik = 1} after τj−1 . The estimators
of the parameters in the first phase of the model are restricted to the set of
random intervals [sj , τj ] where Xt stands below η, for the second phase the
observations are restricted to the set of random intervals ]τj−1 , sj [ where X
remains above η. The time τj are stopping times of the series defined for
t > sj−1 by

 mα + Ssj−1 ,t−sj−1 ,α + op (1),
 if |α| < 1,
Xt = Xsj−1 + (t − sj−1 − 1)µ + Ssj−1 ,t−sj−1 ,1 , si α = 1,


mα + αt−sj−1 (Xsj−1 −1 − mβ ) + Ssj−1 ,t−sj−1 ,α , if |α| > 1
and the sj are stopping times defined for t > τj−1 by

 mβ + Sτj−1 ,t−τj−1 ,β + op (1), if |β| < 1,
Xt = Xτj−1 + (t − τj−1 − 1)µ + Sτj−1 ,t−τj−1 ,1 , if β = 1,

mβ + β t−τj−1 (Xτj−1 − mα ) + Sτj−1 ,t−τj−1 ,β if |β| > 1,
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Applications to time series 177

The sequences t−1 τj and t−1 sj converge to the corresponding stopping


times of the limit of Xt as t tends to infinity. The partial sums are there-
fore defined as sums over indices belonging to countable union of intervals
[sj , τj ] and ]τj , sj+1 [, respectively, for the two phases of the model. Theirs
limits are deduced from integrals on the corresponding sub-intervals, in-
stead of sums of the errors on the interval (τ, τ0 ). The estimators of the
parameters are still expressions of their partial sums. The results gener-
alize to processes of order p with a possible change-point in each p com-
ponent. The estimators and their weak convergences are detailed in Pons
(2009). Change-points in nonparametric models for time series are esti-
mated by replacing the estimators of the parameters by those of the func-
tions of the models and only the expression of the errors εk determines its
estimator.

With a change-point at an unknown time τ0 in the nonparametric model


(9.3), it is written Xt = Iτ,t m1 (Xt−1 ) + (1 − Iτ,t )m2 (Xt−1 ) + σεt . For every
x of IX , the two regression functions are estimated using a kernel estimator
with the same bandwidth h for m1 and m2
Pt
Kh (x − Xi )(1 − Iτ,i )Yi
b 1,t,h (x, τ ) = Pi=1
m t ,
i=1 Kh (x − Xi )(1 − Iτ,i )
Pt
Kh (x − Xi )Ii,τ Yi
b 2,t,h (x, τ ) = Pi=1
m t .
i=1 Kh (x − Xi )Ii,τ

The behaviour of the estimators m b 1,t,h and m


b 2,t,h is the same as in the
model where τ0 is known, and it is the behaviour described in Section 9.1.
The variance σ 2 is estimated by

t
X
2
bτ,t,h
σ = t−1 {Yi − (Iτ,i )m b 2,t,h (Xi , τ )}2
b 1,t,h (Xi , τ ) − (1 − Iτ,i ){m
i=1

at the estimated τ . The change-point parameter τ is estimated by mini-


mization of the error of the model with a change-point at τ

2
τbt,h = arg min σ
bτ,t,h
τ ≤t

and the functions m1 and m2 by m b k,t,h (x) = m


b k,t,h (x, τbt,h ), for k = 1, 2.
Let γ = [T −1 τ ] and the corresponding change-point time τγ = T γ, let
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

178 Functional estimation for density, regression models and processes

m = (m1 , m2 ) with true functions m0 , and let


t
X
σt2 (m, γ) = t−1 {Yi − (Iτγ ,i )m1 (Xi ) − (1 − Iτ,i )m2 (Xi )}2
i=1

be the mean squared error for parameters (m, τ ). The difference of the
error from its minimal is

lt (m, τ ) = σt2 (m, τ ) − σt2 (m0 , τ0 )


Xt

= t−1 {Yi − Iτ,i m1 (Xi ) − (1 − Iτ,i )m2 (Xi )}2
i=1

− {Yi − Iτ0 ,i m10 (Xi ) − (1 − Iτ0 ,i )m20 (Xi )}2
t
X
−1

=t {Iτ,i m1 (Xi ) − Iτ0 ,i m10 (Xi )} (9.9)
i=1
2
+ {(1 − Iτ,i )m2 (Xi ) − (1 − Iτ0 ,i )m20 (Xi )} ,
t
X
= t−1 [{(m1 − m10 )(Xi )Iτ0 ,i − (m2 − m20 )(Xi )(1 − Iτ0 ,i )}2
i=1
+ {(Iτ,i − Iτ0 ,i ))(m1 − m2 )(Xi )}2 ]{1 + o(1)}.

It converges a.s. to l(m, τ ) = Eα (m1 − m10 )2 (X) + Eβ (m2 − m20 )2 (X) +


|τ − τ0 |E(m1 − m2 )2 (X) which is minimal for (m0 , τ0 ), and the estimator
τbnh minimizes lt (m
b nh , τ ). The a.s. consistency of the regression estimators
mb nh = (mb 1nh , m
b 2nh ) and lt (m, τ ) imply that γbt,h = [t−1 τbt,h ] is an a.s.
consistent estimator of γ0 in ]0, 1[. It follows that the estimator

b nh (x) = m
m b 2nh (x)(1 − Iτbt,h )
b 1nh (x)Iτbt,h + m

of the regression function m0 (x) = m10 (x)Iτ0 + m20 (x)(1 − Iτ0 ) is a.s.
uniformly consistent and the process (th)1/2 (m b th − m0 ) converges weakly
under Pm0 to a Gaussian process Gm on IX , with mean and covariances
zero and with variance function Vm (x) = κ2 V ar(Y |X = x).
For the weak convergence of the change-point estimator, let kϕkX be
the L2 (FX )-norm of a function ϕ on IX , ρ(θ, θ0 ) = (|γ −γ 0 |+km−m0 k2X )1/2
0
the distance between θ = (mT , γ)T and θ0 = (m T , γ 0 )T and let Vε (θ0 ) be a
neighbourhood of θ0 with radius ε for the metric ρ. The quadratic function
lt (m, τ ) defined by (9.9) converges to its expectation

b nh − m0 k2X + |b
b th , τbth ) = 0(km
l(m τnh − τ0 |).
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Applications to time series 179

The process is bounded in the same way


t
X
lt (m, τ ) = [t−1 {(m1 − m10 )2 (Xi )Iτ0 ,i + (1 − Iτ0 ,i )(m2 − m20 )2 (Xi )}
i=1
t
X
+ t−1 (Iτ,i − Iτ0 ,i )2 (m1 − m2 )2 (Xi )]{1 + o(1)}.
i=1

it is denoted lt = (l1t + l2t ){1 + o(1)}. The process Wt (m, γ) = t1/2 (lt −
l)(m, τγ ) is a Op (1). The estimator m b th is a local maximum likelihood
estimators of the nonparametric regression functions and the estimator of
the change-point is a maximum likelihood estimator. The variable l1t (m b th )
converges to l1 (m0 ) = 0 and l2t (mb th , τbγt ) converges to zero with the same
rate if the convergence rate of γ bt is the same as m b th . We obtain the next
bounds.

Lemma 9.1. For every ε > 0, there exists a constant κ0 such that
E sup(m,γ)∈Vε (τ0 ) lt (m, τγ ) ≤ κ0 ε2 and 0 ≤ l(m, τγ ) ≤ κ0 ρ2 (θ, θ0 ), for every
θ in Vε (τ0 ).

Lemma 9.2. For every ε > 0, there exist a constant κ1 such that
E sup(m,γ)∈Vε (τ0 ) Wt (m, γ) ≤ κ1 ρ(θ, θ0 ).

The lemmas imply that for every ε > 0


lim sup P0 (tht |b
γtht − γ0 | > A) = 0.
t→∞,A→∞

The proof is similar to Ibragimov and Has’minskii’s (1981) for a change-


point of a density. It implies that lt (θbth ) = (l1t + l2t )(θbth ) + op (1) uniformly.
For the weak convergence of (th)1/2 (b γth − γ0 ), let
Un = {u = (uTm , uγ )T : um = (th)−1/2 (m − m0 ), uγ = (th)−1 (γ − γ0 )}
A
be a bounded set. For every A > 0, let Uth = {u ∈ Ut ; kuk2 ≤ A}.
A
Then for every u = (um , uγ ) belonging to Uth , θt,u = (mt,u , γt,u ) with
−1/2 −1
mt,u = m0 + (th) um and γt,u = γ0 + (th) uγ . The process Wt defines
a map u 7→ Wt (θt,u ).

Theorem 9.1. For every A > 0, the process Wt (θ) develops as


Wt (θ) = W1t (m) + W2t (γ) + op (1),
where the op is uniform on UtA , as t tends to infinity. Then change-point
estimator of γ0 is asymptotically independent of the estimators of the re-
b 1th and m
gression functions m b 2th .
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

180 Functional estimation for density, regression models and processes

Proof. For an ergodic process, the continuous part l1t of lt converges to


l1 (m) = γ0 km1 − m01 k2F1X + (1 − γ0 )km2 − m02 k2F2X
and the continuous part of Wt is approximated by W1t (m) = t1/2 (l1t −
A
l1 )(m). On Uth , it is written
Z Z
W1t (m) = {γ0 (m1 − m01 )2 dν1t + (1 − γ0 ) (m2 − m02 )2 dν2t },

where νkt = t1/2 (Fbkt − Fk0 ) is the empirical processes of the series in phase
k = 1, 2, with the ergodic distributions Fk0 of the process.
The discrete part of Wt is approximated by W2t (γ) = t1/2 (l2t − l2 )(γ)
P
where l2t = t−1 ti=1 (Iτ,i − Iτ0 ,i )2 (m10 − m20 )2 (Xi ) + op (|τ − τ0 |) and the
sum is developed with the notation ai = (m10 − m20 )2 (Xi )
t
X τ0 t
1 X X
t−1 (Iτ,i − Iτ0 ,i )2 (m10 − m20 )2 (Xi ) = { (1 − Iτ,i )ai + Iτ,i ai }
i=1
t i=1 i=1+τ 0
τ0
X τth
X
1
= {1{τth <τ0 } ai + 1{τ0 <τth } ai }
t i=1+τ i=1+τ
th 0

τ0 τ0 +[h−1 uγ ]
1 X X
= {1{τth <τ0 } ai + 1{τ0 <τth } ai }.
t i=1+τ0
i=1+τ0 −[h−1 uγ ]

Then l2t converges uniformly to


Z Z
2
l2 (γ) = |γ − γ0 |{ (m10 − m20 ) dF2X − (m10 − m20 )2 dF1X }
X X
as ht tends to zero. Let ντ,kt , k = 1, 2, be the empirical processes reduced
to the variables between τ0 and τ0 + [h−1 uγ ], according to the sign of uγ ,
and normalized by |τ − τ0 |1/2 . Then the process W2t is approximated by
Z Z
−1/2 2
W2t (γ) = |γ − γ0 | { (m10 − m20 ) dν2t − (m10 − m20 )2 dντ,1t }.
X X 
The limit of the process W1t is a Gaussian distribution Gm . The estimator
of the change-point satisfies
γt − γ0 ) = arg min W2t (uγ ) + op (1).
ubγt = th(b
R
Its convergence rate is th and the asymptotic behaviour of the estimator
ubγt is deduced from Theorem 9.1 and the continuity of the minimum.

γth − γ0 ) converges weakly


Theorem 9.2. The change-point estimator th(b
to arg minu∈R Q(u) and it is a Op (1).
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Applications to time series 181

9.6 Exercises

(1) Let Xt be an AR(1) process Xt = αXt−1 + εt , with an initial random


variable X0 with mean µ0 and variance σ02 . Write it as a moving average
and calculate the mean and the variance of the MA process.
(2) Consider the moving average process Xt = εt − θ1 εt−1 − θ2 εt−2 with
independent and identically distributed errors with mean zero and vari-
ance σ 2 . Calculate the variance of Xt and the covariances between Xt
and Xt−k and their empirical estimators. Extend to a model MA(q) of
order q.
(3) Let Xt = pk,t + θεt−1 + σεt , where pk,t is a polynomial of degree k.
Define moving average estimators of pk,t .
(4) Let Xt be an ARMA(2,2) process Xt = µt +αXt−1 +βXt−2 +εt −θεt−1 ,
t in N, with a constant or a varying mean and with noise variables with
first moments (0, σ 2 ). Define estimators for the parameters α, β and
σ 2 and describe the series as infinite combinations of their past values.
(5) Invert the AR(1) model for Xt as a moving average model depending
on X0 and the noises and identify the parameters of both model in
order to estimate them from Section 9.1.
(6) Define maximum likelihood estimators for the parameters α, β and
σ 2 of the model Xt = µt (θ) + αXt−1 + βXt−2 + σεt , t in N, with a
parametric varying mean, independent identically distributed normal
errors εi . For a polynomial trend µt (θ), use differences of the series
before the estimation of α and σ 2 .
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Chapter 10

Appendix

10.1 Appendix A

The moments of derivatives of the kernel estimator for the regression func-
tion are presented in Chapter 3, here the proofs are detailed. The variance
(1) (1) (1)
b n,h (x) = fbn,h
of m −1
µn,h (x) − m
(x){b b n,h (x)fbn,h (x)} is obtained by an approx-
imation similar to (3.2) in Proposition 3.1
(1) (1) (1)
V armb (x) = f −2 [V ar{b
n,h X µ (x) − m b n,h (x)fb (x)}
n,h n,h
(1) (1)
+ {µn,h (x) − mn,h (x)fn,h (x)}2 V arfbn,h (x)
(1) (1)
− 2{µn,h(x) − mn,h (x)fn,h (x)}
(1) (1)
× Cov{fbn,h (x), µ b n,h (x)fbn,h (x)}] {1 + o(1)},
bn,h (x) − m
(1) (1)
where the variances V arb µn,h (x) and V arfbn,h (x) are O((nh3 )−1 ),
(1) (1)
V arfbn,h (x) = O((nh)−1 ), E{fbn,h (x) − fn,h (x)}4 = O((nh3 )−1 ) and
4 −1
E{mb n,h (x) − mn, h(x)} = O((nh) ),
(1) (1) (1)
V ar{b b n,h (x)fbn,h (x)} = V arb
µn,h (x) − m µn,h (x)
(1) (1) (1)
b n,h (x)fbn,h (x)} − 2Cov{b
+ V ar{m b n,h (x)fbn,h (x)},
µn,h (x), m
(1) (1)
b n,h (x)fbn,h (x)} ≤ [E{m
V ar{m b n,h (x)}4 E{fbn,h (x)}4 ]1/2
(1)
b n,h (x)fbn,h (x)} = O((nh2 )−1 )
− E 2 {m
(1) (1) (1)
Therefore V ar{b b n,h (x)fbn,h (x)} and V arm
µn,h (x)− m b n,h (x) are O((nh3 )−1 ).

Proposition 10.1.
(1) −2 (1)
b n,h (x) = fX
V arm V arbµn,h (x) + o((nh3 )−1 )
Z
−2
= (nh3 )−1 {fX (x)w2 (x) K (1)2 + o(h)}.

183
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

184 Functional estimation for density, regression models and processes

10.2 Appendix B

In Chapter 4, the bandwidth is a real function defined on IX and the


normalized kernel is ϕn (x) = Khn (x) (x). Its derivative with respect to x is
(1)
1 d x hn (x)
ϕ(1)
n (x) = K( )− 2 K (x)
hn (x) dx hn (x) hn (x) hn (x)
(1) (1)
1 x hn (x) hn (x)
= K (1) ( ){1 − x }− 2 Khn (x) (x).
h2n (x) hn (x) hn (x) hn (x)
As khn k tends to zero, a Taylor expansion of the density in a neighborhood
of x and Lemma 2.1 yield
Z
hn (x) (3)
ϕ(1)
n (x − u)f (u) du = f
(1)
(x) − hn (x)m2K { f (x)
2
+ 2h(1)
n (x)f
(2)
(x) + o(khn k + kh(1)
n k)}.

The expectation of the quadratic variations |fbn,h (x) − fbn,h (y)|2 develops as
the sum
Z
n−1 {Khn (x) (x − u) − Khn (y) (y − u)}2 f (u) du

+ (1 − n−1 ){fn,hn (x) − fn,hn (y)}2 .


For an approximation of the first term, the Mean Value Theorem implies
(1)
Khn (x) (x − u)
R − Khn (y) (y − u) = (x − y)ϕn (z2 − u) where z is between x
and y, then {Khn(x) (x − u) − Khn (y) (y − u)} f (u) du is approximated by
Z Z
(x − y)2 ϕ(1)2n (z − u)f (u) du = (x − y)2 −3
h n (x){f (x) K (1)2 + o(khn k)}.

Since h−1 −1 b
n (x)|x| and hn (y)|y| are bounded by 1, the order of E|fn,h (x) −
b 2 2 −1 −1
fn,h (y)| = O((x − y) n khn k if |xhn (y) − yhn (x)| ≤ 2hn (y)hn (x) and it
is a sum of variances otherwise.

10.3 Appendix C

In the single index model studied in Chapter 7, the precise order of the
(1)
mean and variance of Vbn,h defined Section 7.2 requires expansions. The
empirical mean squared error of the estimated function g, at fixed θ has
the derivative
n
X
(1) (1)
Vbn,h (θ) = n−1 {Yi − gbn,h (θT Xi ; θ)} {b
gn,h(θT Xi ; θ)} Xi .
i=1
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Appendix 185

(1)
Let Z = θT X at fixed θ. The mean of Vbn,h is

(1) (1)
Vn,h (θ) = EE[{g(Zi ) − gbn,h (Zi ; θ)} {b
gn,h (Zi ; θ)} Xi | Xi ]
(1) (1)
gn,h , gbn,h )(Zi ; θ)} Xi⊗2 ].
= E[{bg(1) ,n,h gn,h (Zi ; θ) − Cov(b

(1)
Its variance is n−1 V ar[{Yi − b
gn,h (Zi ; θ)} {b
gn,h (Zi ; θ)} Xi ] and,

(1) (1)
V ar[{Yi − gbn,h (Zi ; θ)} gbn,h (Zi ; θ)} = O(V ar{Yi gbn,h (Zi ; θ)})
(1)
gn,h (Zi ; θ) gbn,h (Zi ; θ)})
+ O(V ar{b
(1) (1)
V ar{Yi b gn,h (Zi ; θ)}) = O((nh3 )−1 ).
gn,h (Zi ; θ)} = O(V ar{b

(1)
The expansions (3.2) for gbn,h and (3.15) for b
gn,h are written

−1

{b
gn,h − gn,h }(z) = fX (z) (bµn,h − µn,h )(z)

b
− g(z)(fX,n,h − fX,n,h )(z) + oL2 ((nh)−1/2 )
(1) (1) −1 (1) (1)
{b
gn,h − gn,h }(z) = fX µn,h − µn,h )(z)
(z)[(b
(1) (1)
gn,h fbX,n,h − E(b
− {b gn,h fbX,n,h )}(z)
− g (1) (z) (fbX,n,h − fX,n,h )(z)] + oL2 ((nh3 )−1/2 ),

(1) (1) (1) 


gbn,h (z)fbX,n,h (z) = {fbX,n,h − fX,n,h }(z)[gn,h + fX−1
µn,h − µn,h )
(b

− g(fbX,n,h − fX,n,h ) ](z)
(1) −1

+ fX,n,h (z)[gn,h + fX µn,h − µn,h )
(b

− g(fbX,n,h − fX,n,h ) ](z) + oL2 ((nh)−1/2 )
(1) (1) (1) 
E{b gn,h fbX,n,h }(z) = fX−1
(z) E{fbX,n,h − fX,n,h }(z) (bµn,h − µn,h )
(1)
− g(fbX,n,h − fX,n,h ) (z) + fX,n,h (z)gn,h (z)
(1)
+ oL2 (nh2 ) = fX,n,h (z)gn,h (z) + O((nh2 )−1 ).

Then
(1) (1) (1) (1)
gn,h fbX,n,h − E(b
{b gn,h fbX,n,h )}(z) = {fbX,n,h − fX,n,h }(z)([gn,h

+ fX−1
(bµn,h − µn,h ) − g(fbX,n,h − fX,n,h ) ](z)
(1) 
+ fX,n,h (z)fX −1
µn,h − µn,h ) − g(fbX,n,h − fX,n,h ) (z)) + oL2 ((nh)−1/2 )
(b
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

186 Functional estimation for density, regression models and processes

(1) (1) (1) (1) (1) (1)


{b −1
gn,h − gn,h }(z) = fX µn,h − µn,h )(z) − (fbX,n,h − fX,n,h )(z)
(z)[(b

× ([gn,h + fX−1
µn,h − µn,h )− g(fbX,n,h − fX,n,h ) ](z)
(b
(1) 
+ [fX,n,h fX−1
µn,h − µn,h ) − g(fbX,n,h − fX,n,h ) ](z))
(b
− g (1) (z) (fbX,n,h − fX,n,h )(z)] + oL2 ((nh)−1/2 )
(1) (1)
−2
gn,h , gbn,h )(z) = −fX
Cov(b (z)[Cov(b µn,h , µ µn,h , fbX,n,h )
bn,h ) − g (1) Cov(b
(1)
− gCov(fbX,n,h , µ
bn,h ) − gg (1) V arfbX,n,h − gn,h
(1) (1)
− gnh Cov(fbX,n,h , µ
bn,h ) + ggnh Cov(fbX,n,h , fbn,h )
(1) (1) (1)
gn,h + E{(fbn,h − fn,h )(b
+ fn,h V arb gn,h − gn,h ]
+ o((nh2 )−1 )
(1)
where the main term has the same order as Cov(b bn,h ), namely
µn,h , µ
O((nh2 )−1 ) and the variances and covariances for the terms without deriva-
(1) (1)
tives are O((nh)−1 ). The product and E{b gn,h gbn,h }(z) = gn,h (z)gn,h (z) +
2 −1
O((nh ) ), then
(1) (1) (1)
V ar{b gn,h (z) − gn,h (z)}2 {b
gn,h }(z) = E[{b
gn,h b gn,h (z) − gn,h (z)}2
2 (1) (1)2
+ gn,h (z)V ar{b
gn,h (z)} + gn,h (z)V ar{b
gn,h (z)}
(1) (1)
gn,h (z) − gn,h (z)}4 ]1/2 E[{b
≤ E[{b gn,h (z) − gn,h (z)}4 ]1/2
2 (1) (1)2
+ gn,h (z)V ar{b
gn,h (z)} + gn,h (z)V ar{b
gn,h (z)}.
Finally,
gn,h (z) − gn,h (z)}4 ] = O((nh)−1 ),
E[{b
(1) (1)
gn,h (z) − gn,h (z)}4 ] = O((nh3 )−1 ),
E[{b
(1)
gn,h gbn,h }(z) = O((nh3 )−1 )
V ar{b
(1)
therefore the main term of the variance of the product gbn,h (z)b
gn,h (z) is
2 (1) (1)
gn,h gn,h (z)} = O((nh3 )−1 ) and the mean Vn,h (θ) is a O(h2 ) +
(z)V ar{b
2 −1
O((nh ) ).
Lemma 10.1. For 1 ≤ i 6= j ≤ n
h2 m2K (1)
E{ϕ(Xi ) − ϕ(Xj )}Kh (Xi − Xj ) = E(fX ϕ(2) + fX ϕ(1) )(X) + r1n ,
2 Z
E{ϕ(Xi ) − ϕ(Xj )}Kh2 (Xi − Xj ) = E[{fX ϕ(1) }(X)] zK 2 (z) dz + r2n ,
where r1n = o(h2 ) and
Z
(1)
r2n = h z 2 K 2 (z) dzE[{fX ϕ(2) + 2ϕ(1) fX }(X)] + o(h2 ).
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Appendix 187

Proof. Using Lemma 2.1 and an expansion for |Xi − Xj | ≤ h


E{ϕ(Xi ) − ϕ(Xj )}Kh2 (Xi − Xj )
Z Z
= {ϕ(x) − ϕ(y)}Kh2 (x − y)fX (x)fX (y) dx dy
Z Z
z 2 h2 (2) (1)
= h−1 {hzϕ(1) + ϕ + o(h)}{fX + hzfX + o(h2 )}
2
× K 2 (z) dFX dz.
The higher orders are obtained by further terms in the expansion. 

10.4 Appendix D

The ergodicity and mixing conditions for the convergence of functionals


of a process (Xt )t≥0 or a sequence of dependent variables (Xi )1≤i≤n are
expressed in following conditions A and B. Let (Xi )1≤i∈N be a sequence
of dependent random variable with values in a metric space (X, A, µ) and
such that EXi2 is finite for every i. Let Mj1 and M∞ j+k be the σ-algebras
generated by the sub-samples (Xi )1≤i≤j and, respectively, (Xi )j+k≤i . The
sequence (Xi )1≤i is ϕ-mixing if there exists a real sequence (ϕk )k∈N converg-
ing to zero as k tends to infinity and such that the conditional probabilities
of the sample (Xi )i∈N satisfy
sup{|P (B|A) − P (B)|; A ∈ Mj1 , B ∈ M∞
j+k , j, k ∈ N} < ϕk .

A1 The sequence (Xi )i≥1 is ergodic if there exists a probability ν on


(X, A, µ) such that for every real bounded function φ on X
n
1X P
φ(Xi ) −−−−→ Eν φ(X1 ).
n i=1 n→∞

B1 The sequence (Xi )i≥0 is ϕ-mixing with a sequence (ϕk )k≥1 satisfying
P 2 1/2
k≥1 (k + 1) ϕk < ∞.
The ϕ-mixing property and condition B1 are defined in Billingsley
(1968), they imply the weak convergence of the normalized variable
P
n1/2 { n1 ni=1 φ(Xi ) − Eν φ(X1 )} to a normal variable σϕ2 N (0, 1).

Let (Xt )t≥0 be a time indexed process such that for every t > 0, Xt
is a random variable in a metric space (X, A, µ) and EXt2 is finite. Let
Ms0 and M∞ s+t be the σ-algebras generated by the sample-paths of the
process observed on the time intervals [0, s] and [s + t, ∞[ respectively,
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

188 Functional estimation for density, regression models and processes

Ms0 = σ{(Xu )0≤u≤s } and M∞ s+t = σ{(Xu )u≥s+t }, with s and t > 0. The
sequence (Xt )t≥0 is ϕ-mixing if there exists a sequence (ϕt )t≥0 converging
to zero as t tends to infinity and such that the marginal distributions of the
process (Xt )t≥0 satisfy
sup{|P (B|A) − P (B)|; A ∈ Ms0 , B ∈ M∞
s+t , s, t ≥ 0} < ϕt .

The ergodicity and mixing conditions are modified as follows.


A2 The process (Xt )t≥0 is ergodic if there exists a probability ν on
(X, A, µ) such that for every real bounded function φ defined on the
space X
Z Z
1 t P
φ(Xs ) ds −−−→ φ(x)dν(x).
t 0 t→∞ X

B2 The process (Xt )t≥0 is ϕ-mixing with a sequence (ϕt )t≥0 satisfying
R∞ 2 1/2
0 (t + 1) ϕt dt < ∞.
The ergodic property is strengthened to allow the convergence of functionals
of the joint distributions of the process at several observation times.
A2’ The process (Xt )t≥0 is ergodic if for every integer k there exists a
probability νk on (X⊗k , Ak , µ), with the Borel σ-algebra Ak on X⊗k ,
such that for every real bounded function φ defined on the space X⊗k
Z
1 P
φ(Xs1 , . . . , Xsk ) ds1 , . . . , dsk −−−→ Eνk φ(X1 , . . . , Xk ).
tk [0,t]k t→∞

The expectation withR respect to the limit νk is an integral on X⊗k ,


Eνk φ(X1 , . . . , Xk ) = X⊗k φ(x1 , . . . , xk ) νk (dx1 , . . . , dxk ). For every inte-
ger k > 0, this property is the consistency of the k-th moment of the
process X. For k = 2, it implies the convergence in probability of the
covariance function of the process R t X. Condition B2 entails the weak con-
vergence of the process t1/2 {t−1 0 φ(Xs ) ds−Eν φ(X1 )} to a normal process
σϕ2 W1 . The ϕ-mixing property A1 and condition B1 imply the consistency
and the weak convergence of the partial sum of the sequence of variables
Zk = X(n−1 k) − X(n−1 (k − 1)), 1 ≤ k ≤ n. Under condition A1, the pro-
P[nx]
cess Sn (x) = n−1/2 k=1 Zk , x in [0, 1], converges weakly to the Brownian
motion defined on [0, 1] and with covariance function C(s, t) = s ∧ t.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Notations

1A indicator of a set A,
a.s. almost surely,
(Bt )t≥0 Brownian motion,
Cov(Xi , Xj ) covariance of Xi and Xj : E(Xi − EXi )(Xj − EXj ),
Cs (I) class of real functions on I, having bounded and
continuous derivatives of order s,
∆f (x, y) variation of f : f (y) − f (x), R
EX expectation (or mean) of a variable X: x dFX (x),
FX (x) probability of the random set {X ≤ x},
f (s) derivatives of order s for a function f ,
FbX,n empirical distribution function,
m−1 either 1/m or inverse of a monotone function m,
Hα,M class of real functions f such that ∀x and y,
|f (s) (x) − f (s) (y)| ≤ M |x − y|α−s , s = [α],
Kh =R h−1 K(h−1 ·) normalized kernel with bandwidth h,
· dF
Λ = 0 (1−F −) cumulative hazard function for the distribution
function F ,
N = (Nt )t≥0 point process,
νbn empirical process n1/2 (FbX,n − FX ),
Ln partial likelihood of N at t such that Nt = n,
(Net )t≥0 predictable compensator of a point process N ,
Ω sample space,
ρ(i, j) correlation of variables Xi and Xj :
Cov(Xi , Xj ){V arXi V arXj }−1/2 ,
V arX variance of a variable X: E(X − EX)2 ,
Vbn,h empirical mean squared error for a regression,

189
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

190 Functional estimation for density, regression models and processes

(Xt )t≥0 continuously observed process or time series,


(Wt )t≥0 Gaussian process,
Zh {z ∈ Z; supz0 ∈∂Z kz − z 0 k ≥ h}, with the frontier
∂Z of Z.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Bibliography

Andersen, P. and Gill, R. D. (1982). Cox’s regression model for counting processes:
a large sample study, Ann. Statist. 10, pp. 1100–1120.
Bahadur, R. R. (1966). A note on quantiles in large samples, Ann. Math. Statist.
37, pp. 577–580.
Barlow, R., Bartholomew, D. J., Bremmer, J. and Brunk, H. D. (1972). Statistical
Inference under Order Restrictions (Wiley, New York).
Beran, R. J. (1972). Upper and lower risks and minimax procedures, Proceedings
of the sixth Berkeley Symposium on Mathematical Statistics, L. Lecam, J.
Neyman and E. Scott (eds) , pp. 1–16.
Bickel, P. and Rosenblatt, P. (1973). On some global measures of the deviations
of density functions estimates, Ann. Statist. 1, pp. 1071–1095.
Billingsley, P. (1968). Convergence of probability measures (Wiley, New York).
Bosq, D. (1998). Nonparametric Statistics for Stochastic Processes (2nd edition,
Springer, New York).
Bowman, A. W. (1984). An alternative method of cross-validation for the smooth-
ing of density estimates, Biometrika 71, pp. 353–360.
Bowman, A. W. and Azalini, A. (1997). Applied Smoothing Techniques for Data
Analysis. The Kernel Approach with S-Plus Illustrations (Oxford Statistical
Science Series 18).
Breslow, N. and Crowley, J. (1974). A large sample study of the life table and
product limit estimates under random censorship, Ann. Statist. 2, pp. 437–
453.
Bretagnolle, J. and Huber, C. (1981). Estimation de densités : risque minimax,
Z. Wahrsch. Verw. Geb. 47, pp. 119–139.
Brillinger, D. R. (1981). Time Series Data Analysis and Theory (Holt, Rinehart
and Winston, New York).
Chaudhuri, P. (1991). Nonparametric estimates of regression quantiles and their
local Bahadur representation, Ann. Statist. 19, pp. 760–777.
Chernoff, H. (1964). Estimation of the mode, Ann. Inst. Statist. Math. 16, pp.
31–41.
Cox, D. R. and Oakes, D. (1984). Analysis of Survival Data (Chapman and Hall,
London).

191
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

192 Functional estimation for density, regression models and processes

Cox, R. D. (1972). Regression model and life tables, J. Roy. Statist. Soc. Ser. B
34, pp. 187–220.
De Boor, C. (1978). A Practical Guide to Splines (Springer, New York).
Deheuvels, P. (1977). Estimation non paramétrique de la densité par his-
togrammes généralisés, Rev. Statist. Appl 25, pp. 5–42.
Delecroix, M., Härdle, W. and Hristache, M. (2003). Optimal smoothing in single-
index models, J. Multiv. Anal. 286, pp. 213–226.
Devroye, L. (1983). The equivalence of weak, strong and complete convergence in
l1 for kernel density estimates, Ann. Statist. 11, pp. 896–904.
Dumbgen, L. and Rufibach, K. (2009). Maximum likelihood estimatio of a log-
concave density and its distribution function: Basic properties and uniform
consistency, Bernoulli 15, pp. 40–68.
Dvoretski, A., Kiefer, J. and Wolfowitz, J. (1956). Asymptotic minimax char-
acter of the sample distribution functions and of the classical multinomial
estimator, Ann. Math. Statist. 27, pp. 642–669.
Eubank, R. (1977). Spline Smoothing and Nonparametric Regression (Dekker,
New York).
Fan, J. and Gijbels, I. (1996). Polynomial Modelling and Its Applications (Chap-
man and Hall CRC).
Ghosh, J. K. (1966). A new proof of the Bahadur representation of quantiles and
an application, Ann. Math. Statist. 42, pp. 1957–1961.
Gijbels, I. and Veraverbeke, N. (1988a). Almost sure asymptotic representation
for a class of functionals of the product-limit estimator, Ann. Statist. 19,
pp. 1457–1470.
Gijbels, I. and Veraverbeke, N. (1988b). Weak asymptotic representations for
quantiles of the product-limit estimator, J. Statist. Plann. Inf. 18, pp.
151–160.
Groeneboom, P. (1989). Brownian motion with a parabolic dridt and airy func-
tions, Probab. Theory Related Fields 81, pp. 79–109.
Groeneboom, P., Jonkbloed, G. and Wellner, J. (2001). Estimation of a convex
function: Characterization and asymptotic theory, Ann. Statist. 29, pp.
1653–1698.
Groeneboom, P. and Wellner, J. (1990). Empirical processes (Burckhlder, Basel).
Guyon, X. and Perrin, O. (2000). Identification of space deformation using linear
and superficial quadratic variations, Statist. Prob. Lett. 47, pp. 307–316.
Hall, P. (1981). Law of the iterated logarithm for nonparametric density estima-
tors, Stoch. Proc. Appl. 56, pp. 47–61.
Hall, P. (1984). Integrated square error properties of kernel estimators of regres-
sion functions, Ann. Statist. 12, pp. 241–260.
Hall, P. and Huang, L.-S. (2001). Nonparametric kernel regression subjet to mono-
tonicity constraints, Ann. Statist. 29, pp. 624–647.
Hall, P. and Johnstone, I. (1992). Empirical functionals and efficient smoothing
parameter selection, J. Roy. Statist. Soc. Ser. B 54, pp. 475–530.
Hall, P. and Marron, J. M. (1987). Estimation of integrated squared density
derivatives, Statist. Probab. Lett. 6, pp. 109–115.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Bibliography 193

Härdle, W. (1990). Applied Nonparametric Regression (Cambridge University


Press).
Härdle, W. (1991). Smoothing Methods in Statistics (Cambridge University Press,
UK).
Härdle, W., Hall, P. and Ihimura, H. (1993). Optimal smoothing in single-index
models, Ann. Statist. 21, pp. 157–178.
Härdle, W., Hall, P. and Marron, J. M. (1988). How far are asymptotically
choosen regression smoothers from their optimum ? (with discussion), J.
Amer. Statist. Soc. 6, pp. 109–115.
Hirstache, M., Juditsky, A. and Spokoiny, V. (2001). Direct estimation of the
index coefficients in a single-index model, Ann. Statist. 29, pp. 595–623.
Ibragimov, I. and Has’minskii, R. (1981). Statistical Estimation: Asymptotic The-
ory (Springer, New York).
Ichimura, H. (1993). Semi-parametric least squares and weighted sls estimation
of single-index models, J. Econometrics 58, pp. 71–120.
Jones, M. C., Marron, J. S. and Park, B. U. (1991). A simple root n bandwidth
selector, Ann. Statist. 19, pp. 1919–1932.
Kaplan, M. and Meier, P. A. (1958). Nonparametric estimator from incomplete
observations, J. Am. Statist. Ass. 53, pp. 457–481.
Khasminskii, R. Z. (1992). Topics in nonparametric estimation (American Math-
ematical Society).
Kiefer, J. (1972). Iterated logaritm analogues for samples quantiles when pn ↓ 0,
Proceedings of the sixth Berkeley Symposium on Mathematical Statistics, L.
Lecam, J. Neyman and E. Scott (eds) , pp. 227–244.
Kiefer, J. and Wolfowitz, J. (1976). Asymptotically minimax estimation of con-
cave and convex distribution functions, Z. Wahrsch. Verw. Gebiete 34, pp.
73–85.
Kim, J. and Pollard, D. (1990). Cube root asymptotics, Ann. Statist. 18, pp.
191–219.
Lecam, L. (1990). On the asymptotic theory of estimation (Springer, New York).
Lo, S.-H. and Singh, K. (1986). The product-limit estimator and the bootstrap:
some asymptotic representations, Prob. Theor. Rel. Fields 71, pp. 455–465.
Mammen, E. (1991). Estimating a smooth monotone regression function, Ann.
Statist. 19, pp. 724–740.
Marron, J. S. (1988). Improvment of a data based bandwidth selector, Preprint
Univ. North Carolina, Chapel Hill , pp. 1–31.
Messer, K. (1991). A comparison of a spline estimate to its equivalent kernel
estimate, Ann. Statist. 19, pp. 817–829.
Meyer, M. and Woodroof, M. (2000). On the degrees of freedom in shaped-
restricted regression, Ann. Statist. 28, pp. 1083–1104.
Nadaraya, E. A. (1964). On estimating regression, Theor. Probab. Appl. 9, pp.
141–142.
Nadaraya, E. A. (1989). Nonparametric estimation of probability densities and
regression curves (Kluwer Academic Publisher, Boston).
Ould Saı̈d, E. (1997). A note on ergodic processes prediction via estimation of
the conditional mode, S. J. Statist. 24, pp. 231–239.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

194 Functional estimation for density, regression models and processes

Parzen, E. A. (1962). On the estimation of probability density and mode, Ann.


Math. Statist. 33, pp. 1065–1076.
Parzen, E. A. (1979). Nonparametric statistical data modeling, J. Amer. Statist.
Assoc. 74, pp. 105–131.
Perrin, O. (1999). Quadratic variations for gaussian processes and application to
time deformation, Stoch. Proc. Appl. 82, pp. 293–305.
Pinçon, C. and Pons, O. (2006). Nonparametric estimator of a quantile function
for the probability of event with repeated data, Dependence in Probability
and Statistics, Lect. N. Statist. 17. Springer, New York , pp. 475–489.
Plancherel, M. and Rotach, W. (1929). Sur les valeurs asymptotiques des
polynômes d’Hermite, Commentarii Mathem. Helvet. 1, pp. 227–254.
Pons, O. (1986). Vitesse de convergence des estimateurs à noyau pour l’intensité
d’un processus ponctuel, Statistics 17, pp. 577–584.
Pons, O. (2000). Nonparametric estimation in a varying-coefficient Cox model,
Mathematical Methods of Statistics 9, pp. 376–398.
Pons, O. (2007a). Estimation of absolutely continuous distributions for cen-
sored variables in two-samples nonparametric and semi-parametric models,
Bernoulli 13, pp. 92–114.
Pons, O. (2007b). Estimation of the distribution function of one and two di-
mensional censored variables or sojourn times of markov renewal processes,
Comm. Statist.–Theory Methods 55, pp. 1–18.
Pons, O. (2009a). Estimation and tests in distribution mixtures and change-points
models (O. Pons, Viroflay, F).
Pons, O. (2009b). Nonparametric Estimation for Renewal and Markov Processes
(O. Pons, Viroflay, F).
Pons, O. and de Turckheim, E. (1987). Estimation in Cox’s periodic model with
a histogram-type estimator for the underlying intensity, Scand. J. Statist.
14, pp. 329–345.
Prakasa Rao, B. L. S. (1983). Nonparametric Functional Estimation (Academic
Press, New York).
Rebolledo, R. (1978). Sur les applications de la théorie des martingales à
l’étude statistique d’une famille de processus ponctuels, Journée de Statis-
tique des Processus Stochastiques, Lecture Notes in Mathematics 636,
pp. 27–70.
Rice, J. and Rosenblatt, M. (1983). Smoothing splines: regression, derivatives
and deconvolution, Ann. Statist. 11, pp. 141–156.
Robinson, P. M. (1991). Automatic frequency domain inference on semiparamet-
ric and nonparametric models, Econometrika 59, pp. 1329–1363.
Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density
function, Ann. Math. Statist. 27, pp. 832–837.
Rosenblatt, M. (1975). A quadratic measures of deviation of two-dimensional
density estimates, Ann. Statist. 3, pp. 1–14.
Rudemo, M. (1982). Empirical choice of histograms and kernel density estimators,
Scand. J. Statist. 9, pp. 65–78.
Ruppert, D., Wand, M. P. and Carroll, R. J. (2003). Semiparametric Regression
(Cambridge University Press, UK).
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Bibliography 195

Schoenberg, I. (1964). Spline functions and the problem of graduation, Proc. Nat.
Acad. Sci. USA 52, pp. 947–950.
Schuster, E. F. (1969). Estimation of a probability density function and its deriva-
tives, Ann. Math. Statist. 40, pp. 1187–1195.
Scott, D. W. (1992). Multivariate density estimation: theory, practice, and visu-
alization (Wiley, New York).
Sheather, S. J. and Marron, J. S. (1990). Kernel quantile estimators, J. Amer.
Statist. Assoc. 85, pp. 410–416.
Shorack, G. R. and Wellner, J. A. (1986). Empirical processes and applications
to statistics (Wiley, New York).
Silverman, B. W. (1978a). On a gaussian process related to multivariate probabil-
ity density estimation, Math. Proc. Cambridge Philos. Soc. 80, pp. 136–144.
Silverman, B. W. (1978b). Weak and strong uniform consistency of the kernel
estimate of a density and its derivatives, Ann. Statist. 6, pp. 17–184.
Silverman, B. W. (1984). Spline smoothing: The equivalent variable kernel
method, Ann. Statist. 12, pp. 898–916.
Silverman, B. W. (1985). Some aspects of the spline smoothing approach to the
nonparametric regression curve fitting, J. Roy. Statist. Soc. Ser. B 47, pp.
1–22.
Simonoff, J. S. (1996). Smoothing Methods in Statistics (Springer-Verlag, New
York).
Singh, R. S. (1979). On necessary and sufficient conditions for uniform strong
consistency of estimators of a density and its derivatives, J. Multiv. Anal.
9, pp. 157–164.
Stieltjes, T.-J. (1890). Sur les polynômes de Legendre, Ann. Fac. Sci. Toulouse,
1e série 4 G, pp. 1–17.
Stone, M. (1974). Cross-validation choice and assessment of statistical prediction
(with discussion), J. Roy. Statist. Soc. Ser. B 36, pp. 111–147.
Stute, W. (1982). A law of the logarithm for kernel density estimators, Ann.
Probab. 10, pp. 414–422.
van de Geer, S. (1993). Hellinger consistency of certain nonparametric maximum
likelihood estimators, Ann. Statist. 21, pp. 14–44.
van de Geer, S. (1996). Applications of empirical process theory (Cambridge uni-
versity press).
van der Vaart, A. and van der Laan, M. (2003). Smooth estimation of a monotone
density, Statistics 37, pp. 189–203.
van der Vaart, A. and Wellner, J. A. (1996). Weak convergence and Empirical
Processes (Springer, New York).
Wahba, G. (1977). Optimal smoothing of density estimates, Classification and
clustering, (ed.) J. Van Ryzin. Academic Press, New York , pp. 423–458.
Wahba, G. and Wold, S. (1975). A completely automatic french curve: Fitting
spline functions by cross-validation, Comm. Statist. 4, pp. 1–17.
Walker, A. M. (1971). On the estimation of a harmonic component in a time
series with stationary independent residuals, Biometrika 58, pp. 26–36.
Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing (Chapman and Hall,
CRC).
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

196 Functional estimation for density, regression models and processes

Watson, G. S. (1964). Smooth regression analysis, Sankhyã A26, pp. 359–372.


Watson, G. S. and Laedbetter, M. (1963). On the estimation of a probability
density, Ann. Math. Statist. 34, pp. 480–491.
Watson, G. S. and Laedbetter, M. R. (1964). Hazard analysis, Biometrika 51,
pp. 175–184.
Whittaker, E. T. (1923). On a new method of graduation, Proc. Edinburgh Math.
Soc. (2) 41, pp. 63–75.
Whittle, P. (1958). On the smoothing of probability density functions, J. Roy.
Statist. Soc., Ser. B 20, pp. 334–343.
Wold, S. (1975). Periodic splines for the spectral density estimation: The use of
cross-validation for determining the degree of smoothness, Comm. Statist.
4, pp. 125–141.
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Index

bias of estimators duration excess


conditional distribution, 89 density, 126
conditional density, 98 probability, 15, 126
conditional quantile, 91
diffusion, 150, 161 ergodic density, 44
drift, 149, 160 ergodic measure, 44, 45, 188
histogram, 34 diffusion process, 158, 187
histogram for intensity, 117
regression process, 72, 84
jump size, 160
kernel density, 80
Gaussian process
kernel density estimator, 23
kernel intensity, 111 diffusion, 159
kernel regression, 51, 81 singularity function, 19, 164
weighted kernel, 67 transformation, 19, 164

change-point hazard function, 14


estimator, 179 kernel estimator, 109
nonparametric time series, 177 Hellinger distance
consistency density estimator, 39
kernel density estimator, 25 intensity estimator, 116
kernel intensity estimator, 111 Hermite polynomials, 3
histogram, 33
diffusion process, 17, 147
hazard function, 110
approximation, 147, 159
intensity, 108, 117
continuous, 154
continuous estimation, 154, 162 regression, 124
drift estimation, 149, 160 Hölderian density, 29
ergodic density, 155
estimators, 18 intensity
jumps size estimation, 160 Cox process, 109
variance estimation, 150, 161 multiplicative, 109
with jumps, 158 inverse regression function, 102

197
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

198 Functional estimation for density, regression models and processes

isotonic estimator density estimator, 26, 29


density, 7 kernel regression estimator, 54
regression, 13
optimal bandwidth
Kaplan-Meier estimator, 15 conditional quantile, 95
kernel estimator density, 6, 27, 29, 31
continuous diffusion process, 155 derivatives, 28
density, 5, 23 diffusion process, 157
density derivatives, 27 kernel regression estimator, 56
density support, 37 orthonormal basis, 2
ergodic density, 44
product-limit, 41 parametric estimation, 140
regression, 11, 50 periodic
regression derivatives, 58 density, 9
regression for processes, 71, 84 time series, 9, 171
varying bandwidth, 75, 80 point process
weighted kernel, 12 estimation, 108
kernel function, 24, 27 intensity, 15
moments, 17
Laguerre polynomials, 3
progressive censoring, 135
left-censoring
proportional odds model, 70
density, 42
intensity, 43
quantile
Legendre polynomials, 3
conditional distribution, 87
likelihood
quantile estimator, 14
approximation, 179
approximation, 87
multiplicative intensity, 109
Bahadur representation, 14
linear representation
kernel regression estimator, 51, 55 conditional processes, 100
local polynomial estimator, 62 distribution function, 13

minimax estimator, 33 regression


MISE change of variables, 143
density estimator, 6 differential mean squares, 138
regression estimator, 56, 73 intensity, 132
mode kernel estimator, 50
density, 35 mean squares, 138
regression function, 68 model, 49
MSE, 27 multiplicative intensity, 119
conditional quantile estimator, 95 single-index, 13, 137
density estimator, 5 retro-hazard function, 42
diffusion process estimator, 157 right-censoring
kernel intensity estimator, 113 density, 40
regression estimator, 11 point process, 15
quantile, 104
norm regression, 69
January 31, 2011 17:17 World Scientific Book - 9in x 6in FunctionalEstimation

Index 199

stopping time, 15, 128, 159 jump size, 161


regression, 25, 51, 66, 72, 81
time series, 167 weighted kernel, 65
auto-regressive, 168, 175 variation
change-points, 175 kernel density estimator, 29, 77
covariance estimation, 173 kernel regression estimator, 60, 83
nonparametric regression, 170
stationarity, 167 weak convergence, 15, 128
transformation, 174 conditional density estimator, 98
weak convergence, 180 conditional distribution, 89
density estimator, 32
variance of estimators diffusion estimator, 151, 156, 162
conditional density, 98 diffusion process, 159
conditional distribution, 89 kernel intensity estimator, 114
conditional quantile, 91 regression estimator, 60
density, 80 single-index estimator, 141, 143
diffusion, 150, 161 varying bandwidth estimator, 78
drift, 149, 160 weighted kernel estimator, 64
histogram for intensity, 117 regression estimation, 67
intensity, 111 time series, 174

You might also like