The Annals of Statistics 10.1214/009053606000000830 Institute of Mathematical Statistics
The Annals of Statistics 10.1214/009053606000000830 Institute of Mathematical Statistics
r
X
i
v
:
m
a
t
h
/
0
7
0
2
6
5
0
v
1
[
m
a
t
h
.
S
T
]
2
2
F
e
b
2
0
0
7
The Annals of Statistics
2006, Vol. 34, No. 5, 21592179
DOI: 10.1214/009053606000000830
c Institute of Mathematical Statistics, 2006
PREDICTION IN FUNCTIONAL LINEAR REGRESSION
By T. Tony Cai
1
and Peter Hall
University of Pennsylvania and Australian National University
There has been substantial recent work on methods for estimat-
ing the slope function in linear regression for functional data analysis.
However, as in the case of more conventional nite-dimensional re-
gression, much of the practical interest in the slope centers on its
application for the purpose of prediction, rather than on its signi-
cance in its own right. We show that the problems of slope-function
estimation, and of prediction from an estimator of the slope function,
have very dierent characteristics. While the former is intrinsically
nonparametric, the latter can be either nonparametric or semipara-
metric. In particular, the optimal mean-square convergence rate of
predictors is n
1
, where n denotes sample size, if the predictand is
a suciently smooth function. In other cases, convergence occurs at
a polynomial rate that is strictly slower than n
1
. At the boundary
between these two regimes, the mean-square convergence rate is less
than n
1
by only a logarithmic factor. More generally, the rate of
convergence of the predicted value of the mean response in the re-
gression model, given a particular value of the explanatory variable,
is determined by a subtle interaction among the smoothness of the
predictand, of the slope function in the model, and of the autocovari-
ance function for the distribution of explanatory variables.
1. Introduction. In the problem of functional linear regression we ob-
serve data {(X
1
, Y
1
), . . . , (X
n
, Y
n
)}, where the X
i
s are independent and
identically distributed as a random function X, dened on an interval I,
and the Y
i
s are generated by the regression model,
Y
i
=a +
_
I
b X
i
+
i
. (1.1)
Received August 2004; revised October 2005.
1
Supported in part by NSF Grant DMS-03-06576 and a grant from the Australian
Research Council.
AMS 2000 subject classications. Primary 62J05; secondary 62G20.
Key words and phrases. Bootstrap, covariance, dimension reduction, eigenfunction,
eigenvalue, eigenvector, functional data analysis, intercept, minimax, optimal convergence
rate, principal components analysis, rate of convergence, slope, smoothing, spectral de-
composition.
This is an electronic reprint of the original article published by the
Institute of Mathematical Statistics in The Annals of Statistics,
2006, Vol. 34, No. 5, 21592179. This reprint diers from the original in
pagination and typographic detail.
1
2 T. T. CAI AND P. HALL
Here, a is a constant, denoting the intercept in the model, and b is a square-
integrable function on I, representing the slope function. The majority of
attention usually focuses on estimating b, typically by methods based on
functional principal components. See, for example, [28], Chapter 10, and [29].
In functional linear regression, perhaps as distinct from more conventional
linear regression, there is signicant interest in b in its own right. In partic-
ular, since b is a function rather than a scalar, then knowing where b takes
large or small values provides information about where a future observation
x of X will have greatest leverage on the value of
_
I
bx. Such information can
be very useful for understanding the role played by the functional explana-
tory variable. Nevertheless, as this example suggests, the greatest overall
interest lies, as in conventional linear regression, in using an estimator
b as
an aid to predicting, either qualitatively or quantitatively, a future value of
_
I
bx.
Thus, while there is extensive literature on properties of
b, for example on
convergence rates of
b to b (see, e.g., [11, 13, 15, 20]), there is arguably a still
greater need to understand the manner in which
b should be constructed in
order to optimize the prediction of
_
I
bx, or of a+
_
I
bx. This is the problem
addressed in the present paper.
Estimation of b is intrinsically an innite-dimensional problem. There-
fore, unlike slope estimation in conventional nite-dimensional regression, it
involves smoothing or regularization. The smoothing step is used to reduce
dimension, and the extent to which this should be done depends on the use
to which the estimator of b will be put, as well as on the smoothness of b. It is
in this way that the problem of estimating
_
I
bx is quite dierent from that
of estimating b. The operation of integration, in computing
_
I
b x from
b,
confers additional smoothness, with the result that if we smooth
b optimally
for estimating b then it will usually be oversmoothed for estimating
_
I
bx.
Therefore the construction of
b, as a prelude to estimating
_
I
bx, should
involve signicant undersmoothing relative to the amount of smoothing that
would be used if we wished only to estimate b itself. In fact, as we shall
show, the degree of undersmoothing can be so great that it enables
_
I
bx to
be estimated root-n consistently, even though b itself could not be estimated
at such a fast rate.
However, root-n consistency is not always possible when estimating
_
I
bx.
The optimal convergence rate depends on a delicate balance among the
smoothness of b, the smoothness of x, and the smoothness of the autoco-
variance of the stochastic process X, all measured with respect to the same
sequence of basis functions. In a qualitative sense,
_
I
bx can be estimated
root-n consistently if and only if x is suciently smooth relative to the
degree of smoothness of the autocovariance. If x is less smooth than this,
then the optimal rate at which
_
I
bx can be estimated is determined jointly
PREDICTION 3
by the smoothnesses of b, x and the autocovariance, and becomes faster as
the smoothnesses of x and of b increase, and also as the smoothness of the
covariance decreases.
These results are made explicitly clear in Section 4, which gives upper
bounds to rates of convergence for specic estimators of
_
I
bx, and lower
bounds (of the same order as the upper bounds) to rates of convergence for
general estimators. Section 2 describes construction of the specic estimators
of b, which are then substituted for b in the formula
_
I
bx. Practical choice
of smoothing parameters is discussed in Section 3.
In this brief account of the problem we have omitted mention of the role
of the intercept, a, in the prediction problem. It turns out that from a
theoretical viewpoint the role is minor. Given an estimator
b of b, we can
readily estimate a by a =
Y
_
I
b
X, where
X and
Y denote the means of
the samples of X
i
s and Y
i
s, respectively. Taking this approach, it emerges
that the rate of convergence of our estimator of a +
_
I
bx is identical to that
of our estimator of
_
I
bx, up to terms that converge to zero at the parametric
rate n
1/2
. This point will be discussed in greater detail in Section 4.1.
The approach taken in this paper to estimating b is based on functional
principal components. While other methods could be used, the PC technique
is currently the most popular. It goes back to work of Besse and Ramsay
[1], Ramsay and Dalzell [27], Rice and Silverman [31] and Silverman [32,
33]. There are a great many more recent contributions, including those of
Brumback and Rice [5], Cardot [7], Cardot, Ferraty and Sarda [8, 9, 10],
Girard [19], James, Hastie and Sugar [23], Boente and Fraiman [3] and He,
M uller and Wang [21].
Other recent work on regression for functional data includes that of Ferre
and Yao [18], who introduced a functional version of sliced inverse regres-
sion; Preda and Saporta [26], who discussed linear regression on clusters of
functional data; Escabias, Aguilera and Valderrama [14] and Ratclie, Heller
and Leader [30], who described applications of functional logistic regression;
and Ferraty and Vieu [16, 17] and Masry [24], who addressed various aspects
of nonparametric regression for functional data. M uller and Stadtm uller [25]
introduced the generalized functional linear model, where the response Y
i
is a general smooth function of a +
_
I
bX
i
, plus an error. See also [22] and
[12]. The methods developed in the present paper could be extended to this
setting.
2. Model and estimators. We shall assume model (1.1), and suppose that
the errors
i
are independent and identically distributed with zero mean and
nite variance. It will be assumed too that the errors are independent of the
X
i
s and that
_
I
E(X
2
) <.
Conventionally, estimation of b is undertaken using a principal compo-
nents approach, as follows. We take the covariance function of X to be
4 T. T. CAI AND P. HALL
positive denite, in which case it admits a spectral decomposition in terms
of strictly positive eigenvalues
j
,
K(u, v) cov{X(u), X(v)} =
j=1
j
(u)
j
(v), u, v I, (2.1)
where (
j
,
j
) are (eigenvalue, eigenfunction) pairs for the linear operator
with kernel K, the eigenvalues are ordered so that
1
>
2
> (in particu-
lar, we assume there are no ties among the eigenvalues), and the functions
1
,
2
, . . . form an orthonormal basis for the space of all square-integrable
functions on I.
Empirical versions of K and of its spectral decomposition are
K(u, v)
1
n
n
i=1
{X
i
(u)
X(u)}{X
i
(v)
X(v)}
=
j=1
j
(u)
j
(v), u, v I,
where
X =n
1
i
X
i
. Analogously to the case of K, (
j
,
j
) are (eigenvalue,
eigenfunction) pairs for the linear operator with kernel
K, ordered such that
2
. Moreover,
j
= 0 for j n + 1. We take (
j
,
j
) to be our
estimator of (
j
,
j
). The function b can be expressed in terms of its Fourier
series, as b =
j1
b
j
j
, where b
j
=
_
b
j
. We estimate b as
b =
m
j=1
b
j
j
, (2.2)
where m, lying in the range 1 mn, denotes a frequency cut-o and
b
j
is an estimator of b
j
.
To construct
b
j
we note that b
j
=
1
j
g
j
, where g
j
denotes the jth Fourier
coecient of g(u) =
_
I
K(u, v)b(v) dv. A consistent estimator of g is given
by
g(t) =
1
n
n
i=1
{X
i
(t)
X(t)}(Y
i
Y ),
and so, for 1 j m, we take
b
j
=
1
j
g
j
, where g
j
=
_
I
g
j
.
While the problem of estimating b is of intrinsic interest, it is arguably
not of as much practical importance as that of prediction, that is, estimating
p(x) E(Y |X =x) =a +
_
I
bx
PREDICTION 5
for a particular function x. To accomplish this task we require an estimator
of a,
a =
Y
_
I
b
X =a
_
I
(
b b)
X + .
Here,
Y and are the respective means of the sequences Y
i
and
i
. Our
estimator of p(x), for a given function x, is
p(x) = a +
_
I
bx.
In Section 4 we shall introduce three parameters, , and , describing
the smoothness of K, b and x, respectively. In each case, smoothness is mea-
sured in the context of generalized Fourier expansions in the basis
1
,
2
, . . .
, and the larger the value of the parameter, the smoother the associated
function. We shall show in Theorem 4.1 that if x is suciently smooth rel-
ative to K, specically if >
1
2
( +1), then
_
I
bx can be estimated root-n
consistently. For smaller values of , the optimal convergence rate is slower
than n
1/2
.
3. Numerical implementation and simulation study. There is a variety
of possible approaches to empirical choice of the cut-o, m, although not all
are directly suited to estimation of
_
I
bx. Potential methods include those
based on simple least-squares, on the bootstrap or on cross-validation. In
some instances where
_
I
j
t Cn
c
, with I
j
= 0 otherwise. Since the sequence
1
,
2
, . . .
is nonincreasing and
j
= 0 for j n + 1, then I
1
, I
2
, . . . is a sequence of
m, say, 1s, followed by an innite sequence of 0s. Therefore the threshold
6 T. T. CAI AND P. HALL
algorithm implicitly gives an empirical rule for choosing the cut-o, m. Our
estimator of
_
I
bx is
_
I
bx, where
b =
1j m
b
j
j
. Note that the estimator
_
I
bx =
j
I
j
b
j
x
j
=
1j m
b
j
x
j
,
where x
j
=
_
I
x
j
. This form is often easier to use in numerical calculations.
To appreciate the size of m chosen by this rule, let us suppose that
j
=
const.j
j
=
const.j
{1 + o
p
(1)} uniformly in 1 j m + k, for each integer k 1.
Therefore, m= const.n
c/
{1+o
p
(1)}. It follows that the order of magnitude
of m changes a great deal as we vary c.
It can be proved too that, under the conditions of Theorem 4.1, and
assuming that 2,
3
2
( +2) and + (/2c) +1,
m
j=1
b
j
x
j
=
_
I
bx +O
p
(n
1/2
). (3.1)
This result demonstrates the root-n consistency of the estimator on the left-
hand side, for a range of dierent orders of magnitude of m. Of course, (3.1)
continues to hold if the number of terms, m, is replaced by a deterministic
quantity, say m const.n
c/
. Note too that the conditions
3
2
( + 2)
and + (/2c) +1 are both implied by max(3/2, 1/2c) +3, which
asserts simply that the function x is suciently smooth relative to K.
The case where the functions X
i
are observed on a regular grid of k points
with additive white noise may be treated similarly. Indeed, it can be proved
that if continuous approximations to the X
i
s are generated by passing a
local-linear smoother through noisy, gridded data, and if we take c =
1
2
,
then all the results discussed above remain true provided n = O(k). That
is, k should be of the same order as, or of larger order than, n. Details are
given in the Appendix of [6]. Similar results are obtained using smoothing
methods based on splines or orthogonal series.
A simulation study was carried out to investigate the nite-sample per-
formance of the thresholding procedure given above. The study considered
the model (1.1) in two cases. In the rst, the predictor X
i
was observed
continuously without error. Specically, random samples of size n = 100
were generated from the model (1.1), where I =[0, 1], the random functions
X
i
were distributed as X =
j
Z
j
2
1/2
cos(jt), the Z
j
s were independent
and normal N(0, 4j
2
), b =
j
j
4
2
1/2
cos(jt), and the errors
i
were in-
dependent and normal N(0, 4). The future observation of X was taken to
be x =
j
j
2
2
1/2
cos(jt), in which case the conditional mean of y given
X =x was 1.0141.
PREDICTION 7
Table 1
Comparison of average squared errors
Threshold 0.001 0.01 0.05 0.1 0.15 0.2
X continuous 0.026 0.019 0.015 0.014 0.013 0.015
X discrete with noise 0.035 0.022 0.016 0.017 0.015 0.016
The example in the second case was the same as that for the rst, except
that each X
i
was observed discretely on an equally-spaced grid of 200 points
with additive N(0, 1) random noise. We used an orthogonal-series smoother
to estimate each X
i
from the corresponding discrete data. Table 1 gives
values of averaged squared error of the estimator of the conditional mean,
computed by averaging 500 Monte Carlo simulations. It is clear from these
results that the procedure is robust against discretization, random errors
and choice of the threshold.
Earlier in this section we discussed the robustness of
b to choice of smooth-
ing parameter in the prediction problem. This robustness is not shared in
cases where
b is of interest in its own right, rather than a tool for prediction.
To make this comparison explicit, and to compare the levels of smooth-
ing appropriate for prediction and estimation, we extended the simulation
study above. We selected X as before, but took b = 10
j
j
2
2
1/2
cos(jt)
and x =
j
j
1.6
2
1/2
cos(jt). In the case of noisy, discrete observations we
took the noise to be N(0, 1) and the grid to consist of 500 points. Sample
size was n =100.
For the thresholds t =0.001, 0.01, 0.05, 0.1, 0.15, 0.2 used to construct Ta-
ble 1, mean squared prediction error was relatively constant; respective
values were 0.013, 0.008, 0.007, 0.010, 0.015, 0.022. However, mean integrated
squared error of
b was as high as 168 when t = 0.001, dropping to 6.67 at
t = 0.01 and reaching its minimum, 0.639, at t = 0.1. Similar results were
achieved in the case of noisy, discrete data; values of mean squared predic-
tion error there were 0.014, 0.008, 0.009, 0.013, 0.019, 0.028 for the respective
values of t, and mean integrated squared error of
b was elevated by about
30% across the range, the minimum again occurring when t = 0.1.
These results also indicate the advantages of undersmoothing when mak-
ing predictions, as opposed to estimating
b in its own right. In particular,
the numerical value of the optimal threshold for prediction is a little less
than that for estimating
b. Discussion of theoretical aspects of this point
will be given in Section 4.
4. Convergence rates.
4.1. Eect of the intercept, a. In terms of convergence rates, the prob-
lems of estimating a +
_
I
bx and
_
I
bx are not intrinsically dierent. To
8 T. T. CAI AND P. HALL
appreciate this point, dene =E(X), let the functionals p and p be as in
Section 2, and put q(x) =
_
I
b(x) and q(x) =
_
I
b b)(
X ) +
_
(4.1)
(E
b b
2
)
1/2
(E
X
2
)
1/2
+(E
2
)
1/2
.
Provided only that E
b b
2
is bounded, the right-hand side of (4.1) equals
O(n
1/2
). Hence, (4.1) shows that, up to terms that converge to zero at the
parametric rate n
1/2
, the rates of convergence of p(x) to p(x) and of q(x)
to q(x) are identical. This result, and the fact that q(x) is identical to
_
bx
provided x is replaced by x , imply that when addressing convergence
rates in the prediction problem it is sucient to treat estimation of
_
I
bx.
4.2. Estimation of
_
bx. Recall that our estimator of
_
bx is
_
bx. Sup-
pose the eigenvalues
j
in the spectral decomposition (2.1) satisfy
C
1
j
j
Cj
,
j
j+1
C
1
j
1
for j 1. (4.2)
For example, if
j
=Dj
j+1
D
1
j
1
,
and so (4.2) holds. The second part of (4.2) asks that the spacings among
eigenvalues not be too small. Methods based on a frequency cut-o m can
have diculty when spacings equal zero, or are close to zero. To appreciate
why, note that if
j+1
= =
j+k
then
j+1
, . . . ,
j+k
are not individually
identiable (although the set of these k functions is identiable). In partic-
ular, individual functions cannot be estimated consistently. This can cause
problems when estimating
_
I
bx if the frequency cut-o lies strictly between
j and j +k.
Let Z have the distribution of a generic X
i
E(X
i
). Then we may
write Z =
j1
j
, where
j
=
_
Z
j
is the jth principal component, or
KarhunenLo`eve coecient, of Z. We assume that all the moments of X are
nite, and more specically that
for each r 2 and each j 1, E|
j
|
2r
C(r)
r
j
, where C(r) does
not depend on j; and, for any sequence j
1
, . . . , j
4
, E(
j
1
. . .
j
4
) =0
unless each index j
k
is repeated.
(4.3)
In particular, (4.3) holds if X is a Gaussian process. Let >1 and C
1
>0,
and let
B =B(C
1
, ) =
_
b : b =
j1
b
j
j
, with |b
j
| C
1
j
for each j 1
_
. (4.4)
PREDICTION 9
We can interpret B(C
1
, ) as a smoothness class of functions, where the
functions become smoother (measured in the sense of generalized Fourier
expansions in the basis
1
,
2
, . . .) as increases. We suppose too that the
xed function x satises
x =
j=1
x
j
j
with |x
j
| C
2
j
b =
_
b, if
b C
4
n
C
5
,
C
3
, otherwise,
(4.7)
where, for a function on I,
2
=
_
I
2
. This truncation of
b serves to
ensure that all moments of
b are nite.
Theorem 4.1. Assume the eigenvalues
j
satisfy (4.2), that (4.3) holds
and that all moments of the distribution of the errors
i
are nite. Let ,
and be as in (4.2), (4.4) and (4.5), respectively. Suppose that > 1,
+2 and >
1
2
, and that the ratio of m to m
0
is bounded away from zero
and innity as n . Then, for each given C, C
1
, . . . , C
5
> 0, as n ,
the estimator
b given in (4.7) satises
sup
bB(C
1
,)
E
__
I
bx
_
I
bx
_
2
=O(), (4.8)
where =(n) is given by
=
_
_
_
n
1
, if +1 <2,
n
1
log n, if +1 = 2,
n
2(+1)/(+21)
, if +1 >2.
(4.9)
10 T. T. CAI AND P. HALL
The smoothing-parameter choices suggested by (4.6) are dierent from
those that would be used if our aim were to estimate b rather than
_
I
bx.
In particular, to optimize the L
2
convergence rate of
b to b we would take
m to be of size n
1/(+2)
in each of the three settings addressed by (4.6).
See, for example, [20]. In the critical cases where +1 2, this provides an
order of magnitude more smoothing than is suggested by (4.6). The intuition
behind this result is that the integration step, in the denition
_
I
bx, provides
additional smoothing no matter what level is used when constructing
b, and
so less smoothing is needed for
b.
The case + 1 < 2 is more dicult to discuss in these terms, since
a variety of dierent orders of magnitude of m can lead to the same optimal
mean-square convergence rate of n
1
. Further discussion of this issue is given
in Section 3.
Of course, there are other related problems where similar phenomena are
observed. Consider, for example, the problem of estimating a distribution
function by integrating a kernel density estimator. In order to achieve the
same parametric convergence rate as the empirical distribution function, we
should, when constructing the density estimator, use a substantially smaller
bandwidth than would be appropriate if we wanted a good estimator of the
density itself. The operation of integrating the density estimator provides
additional smoothing, over and above that accorded by the bandwidth, and
so if the net result is not to be an oversmoothed distribution-function esti-
mator then we should smooth less at the density estimation step. The same
is true in the problem of prediction in functional regression; the operation of
integrating
bx provides additional smoothing, and so to get the right amount
of smoothing in the end we should undersmooth when computing the slope-
function estimator. A curious feature of the regression prediction problem is
that, unlike the distribution estimation one, it is not always parametric, and
in some cases the optimal convergence rate lies strictly between that for the
nonparametric problem of slope estimation and the parametric n
1/2
rate.
4.3. Lower bounds. We adopt notation from Sections 4.1 and 4.2, and
in particular take x =
j1
x
j
j
to be a function and dene B as at (4.4).
Recall that the functions
j
form an orthonormal basis for square-integrable
functions on I. Assume that, for a constant C
6
>1,
C
1
6
j
j
C
6
and C
1
6
j
|x
j
| C
6
for all j 1.
Let
T denote any estimator of T(b) =
_
I
bx, and dene =(n) as at (4.9).
Our main result in this section provides a lower bound to the convergence
rate of
T to T(b), complementing the upper bound given by Theorem 4.1
in the case
T =
_
I
bx, where
b is given by (4.7). We make relatively specic
assumptions about the nature of the model, for example that X is a Gaussian
PREDICTION 11
process and the intercept, a, vanishes, bearing in mind that in the case of a
lower bound, the strength of the result is increased, from some viewpoints,
through imposing relatively narrow conditions.
Theorem 4.2. Let , and be as in (4.2), (4.4) and (4.5), respec-
tively, and assume , > 1 and >
1
2
. Suppose too that the process X is
Gaussian and that the errors
i
in the model (1.1) are Normal with zero
mean and strictly positive variance; and take a = 0. Then there exists a con-
stant C
7
>0 such that, for any estimator
T and for all suciently large n,
sup
bB(C
1
,)
E{
T T(b)}
2
C
7
,
where =(n) is given as in (4.9).
A comparison of the lower bound given above with the upper bound given
in Theorem 4.1 yields the result that the minimax risk of estimating
_
bx
satises
inf
T
sup
bB(C
1
,)
E
_
T
_
bx
_
2
_
_
_
n
1
, if +1 <2,
n
1
log n, if +1 = 2,
n
2(+1)/(+21)
, if +1 >2,
where, for positive sequences a
n
and b
n
, a
n
b
n
means that a
n
/b
n
is bounded
away from zero and innity as n .
5. Proof of Theorem 4.1.
5.1. Preliminaries. Dene =
K K, ||||||
2
=
_
I
2
2
and
j
=
min
kj
(
k
k+1
). It may be shown from results of Bhatia, Davis and
McIntosh [2] that
sup
j1
|
j
j
| ||||||,
(5.1)
sup
j1
j
j
8
1/2
||||||.
For simplicity in our proof we shall take m=m
0
, as dened in (4.6). Note
that in this setting mn
1/(+21)
in each of the three cases in (4.6).
Expand x with respect to both the orthonormal series
1
,
2
, . . . and
1
,
2
, . . . , obtaining x =
j1
x
j
j
=
j1
x
j
j
, where x
j
=
_
I
x
j
and
x
j
=
_
I
x
j
. Put g
j
=
_
I
g
j
. In this notation
_
I
(
b b)x =
m
j=1
(
b
j
x
j
b
j
x
j
)
j=m+1
b
j
x
j
,
12 T. T. CAI AND P. HALL
whence it follows that
_
I
(
b b)x
j=1
(
b
j
b
j
)x
j
j=m+1
b
j
x
j
j=1
b
j
( x
j
x
j
)
(5.2)
+
m
j=1
|
b
j
b
j
|| x
j
x
j
|.
It is straightforward to show that |
jm+1
b
j
x
j
| = O(m
(+1)
). This
quantity equals O{(n
1
log n)
1/2
} if +1 =2, equals O(n
(+1)/(+21)
)
if +1 >2 and equals o(n
1/2
) otherwise. We shall complete the derivation
of Theorem 4.1 by obtaining bounds for second moments of the other three
terms on the right-hand side of (5.2). Our analysis will show that the rst
and second terms determine the convergence rate, and that the third and
fourth terms are asymptotically negligible. In the arguments leading to the
bounds we shall use the notation const. to denote a constant, the value of
which does not depend on b B. In particular, the bounds we shall give are
valid uniformly in b, although we shall not mention that property explicitly.
5.2. Bound for |
jm
(
b
j
b
j
)x
j
|. Note that
b
j
b
j
= (
1
j
1
j
)( g
j
g
j
) +
1
j
( g
j
g
j
) +(
1
j
1
j
)g
j
, (5.3)
g
j
g
j
= g
j
g
j
+
_
I
( g g)(
j
j
) +
_
I
g(
j
j
). (5.4)
Therefore, dening
g
= g g, we have
g
j
g
j
_
I
g(
j
j
)
3
g
. (5.5)
If the event
E ={|
j
j
|
1
2
j
for all 1 j m} (5.6)
holds, then |
1
j
1
j
| 2|
j
j
|/
2
j
1
j
. It can be proved, using this
result, (5.1), (5.4) and (5.5), that if E holds,
1
6
j=1
(
b
j
b
j
)x
j
j=1
( g
j
g
j
)x
j
1
j
j=1
x
j
1
j
_
I
g(
j
j
)
+||||||
m
j=1
_
I
g(
j
j
)
|x
j
|
2
j
(5.7)
PREDICTION 13
+8
1/2
||||||
m
j=1
(
g
1
j
+|g
j
|
1
j
)|x
j
|
1
j
.
For each real number r, dene
t
r
(m) =
_
_
_
m
r+1
, if r >1,
log m, if r =1,
1, if r <1.
Standard moment calculations, noting that S
1
(g)
jm
( g
j
g
j
)x
j
1
j
may be expressed as a sum of n independent and identically distributed ran-
dom variables with zero mean, show that E{S
1
(g)
2
} const.n
1
t
2
(m),
uniformly in g. Moreover, denoting by S
2
(g) the last term on the right-hand
side of (5.7), we deduce that
E{S
2
(g)
2
} E
_
||||||
m
j=1
(
g
1
j
+|g
j
|
1
j
)|x
j
|
1
j
_
2
(5.8)
const.{n
2
t
2+1
(m)
2
+n
1
t
(m)
2
}.
If then t
(m) t
2
(m), and if < then, since >
1
2
( +1),
< 1, implying that t
(m) const.t
2
(m). Moreover,
t
2+1
(m) const.t
2
(m)m
+1
, and by assumption, n m
+1
. There-
fore, n
1
t
2+1
(m) const.t
2
(m). Hence, (5.8) implies that E{S
2
(g)
2
}
const.n
1
t
2
(m). Combining this bound with that for E{S
1
(g)
2
}, and
with (5.7), and writing I(F) for the indicator function of any subset F E,
we deduce that
E
_
I(F)
_
m
j=1
(
b
j
b
j
)x
j
_
2
_
const.
_
E
_
I(F)
_
m
j=1
x
j
1
j
_
I
g(
j
j
)
_
2
_
(5.9)
+E
_
I(F)||||||
2
_
m
j=1
_
I
g(
j
j
)
|x
j
|
2
j
_
2
_
+n
1
t
2
(m)
_
.
Note too that if E holds,
m
j=1
(
b
j
b
j
)
2
const.
m
j=1
2
j
_
( g
j
g
j
)
2
+
_
I
g(
j
j
)
2
_
(5.10)
+const.||||||
2
{
g
2
t
4+2
(m) +t
22
(m)},
14 T. T. CAI AND P. HALL
and also that
j=1
b
j
( x
j
x
j
)
j=1
b
j
_
x(
j
j
)
, (5.11)
m
j=1
( x
j
x
j
)
2
=
m
j=1
__
I
x(
j
j
)
_
2
. (5.12)
Let p = g or x, and dene = + and = in the respective cases.
Let q
1
, q
2
, . . . denote constants satisfying |q
j
| const.j
j
j
|
1
2
C
1
j
1
for all 1 j m}.
Comparing (5.6) and (5.13), and noting (4.2), we see that F E. We shall
show in Section 5.5 that, uniformly in 1 j const.n
1/(+1)
,
E
_
I(F)
_
I
p(
j
j
)
_
2
const.n
1
j
(1 +j
2+22
), (5.14)
and also,
E
_
I(F)
_
m
j=1
q
j
_
I
p(
j
j
)
_
2
_
const.n
1
t
2
(m). (5.15)
Next we use (5.15) to bound the rst term on the right-hand side of (5.9):
E
_
I(F)
_
m
j=1
x
j
1
j
_
I
g(
j
j
)
_
2
_
const.n
1
t
2
(m). (5.16)
To bound the second term, it can be proved from (5.14) that
E
_
I(F)
_
m
j=1
_
I
g(
j
j
)
|x
j
|
2
j
_
2
_
(5.17)
const.n
2{(3/2)}/(+21)
.
Going back to the denition of F at (5.13), and taking <{(3/2)}/(+
2 1), we deduce from (5.17) that
E
_
I(F)||||||
2
_
m
j=1
_
I
g(
j
j
)
|x
j
|
2
j
_
2
_
const.n
1
. (5.18)
Results (5.9), (5.16) and (5.18) imply that
E
_
I(F)
_
m
j=1
(
b
j
b
j
)x
j
_
2
_
const.n
1
t
2
(m). (5.19)
PREDICTION 15
5.3. Bounds for |
jm
b
j
( x
j
x
j
)| and
jm
|
b
j
b
j
|| x
j
x
j
|. Noting
that =( +) when p =x, we may also use (5.15) and (5.14) to bound
the expected values of the squares of the right-hand sides of (5.11) and
(5.12), respectively, multiplied by I(F):
E
_
I(F)
_
m
j=1
b
j
_
x(
j
j
)
_
2
_
const.n
1
, (5.20)
E
_
I(F)
m
j=1
__
I
x(
j
j
)
_
2
_
const.n
1
t
+32
(m). (5.21)
Noting that + 2 and E( g
j
g
j
)
2
const.n
1
j
, we can show from
(5.10) and (5.14) that
E
_
I(F)
m
j=1
(
b
j
b
j
)
2
_
const.n
1
m
+1
. (5.22)
From (5.21) and (5.22) it follows that
E
_
I(F)
_
m
j=1
|
b
j
b
j
|| x
j
x
j
|
_
2
_
E
_
I(F)
m
j=1
(
b
j
b
j
)
2
_
E
_
I(F)
m
j=1
( x
j
x
j
)
2
_
(5.23)
const.n
1
m
+1
n
1
t
+32
(m) const.n
1
.
5.4. Completion of the proof of Theorem 4.1. Combining (5.2), (5.19),
(5.20) and (5.23) we deduce that
E
_
I(F)
__
I
(
b b)x
_
2
_
const.n
1
t
2
(m). (5.24)
The proof of Theorem 4.1 will be complete if we show that the factor I(F)
can be removed from the left-hand side. Since, in view of (4.7), our estimator
b satises
b C
4
n
C
5
, then it suces to prove that, for all D >0, P(F) =
1 O(n
D
). Now the rst part of (5.1) and (5.13) imply that if we dene
G ={|||||| min(n
(1/2)
, cC
1
m
1
)},
then G F. Since mn
1/(+21)
and 2(+1) <+2 1, then for some
>0, m
1
n
(1/2)
. Therefore, if >0 is suciently small, there exists
n
0
1 such that, if we dene H = {|||||| n
(1/2)
}, then for all n n
0
,
HG. Since we assumed all moments of the principal components
j
and
the errors
i
to be nite, then Markovs inequality is readily used to show
that P(H) = 1 O(n
D
) for all D >0. It follows that P(F) = 1 O(n
D
),
and so (5.24) implies (4.8).
16 T. T. CAI AND P. HALL
5.5. Proof of (5.14) and (5.15). Dene
j
by
j
(t) =
j
(t) +
k : k=j
(
j
k
)
1
k
(t)
_
j
k
+
j
(t). (5.25)
It may be proved that
j
j
=
k : k=j
(
j
k
)
1
k
_
k
+
j
_
I
(
j
j
)
j
,
from which it follows that
j
=
k : k=j
{(
j
k
)
1
(
j
k
)
1
}
k
_
I
k
+
k : k=j
(
j
k
)
1
k
_
I
(
j
j
)
k
+
j
_
I
(
j
j
)
j
.
If F holds then so too does the event E and, in view of (4.2), |
j
k
|
2|
j
k
| for all 1 j m and all k = j. Therefore, writing p =
j1
p
j
j
and using (5.1), we deduce that
_
I
p
2|
j
j
|
_
k : k=j
(
j
k
)
4
p
2
k
_
1/2
p
j
_
I
(
j
j
)
j
(5.26)
+
_
k : k=j
(
j
k
)
2
p
2
k
_
1/2 _
_
_
_
_
(
j
j
)
_
_
_
_
.
Since |p
j
| const.j
k : k=j
(
j
k
)
d
p
2
k
const.{t
d2
(j) +j
d+d2
}
const.(1 +j
d+d2
).
Moreover,
j
j
+(
j
), E
j
2
const.n
1
j
, and if F
holds, (
j
) const.||||||
2
1
j
. We shall show in Section 5.6 that
if , in the denition of F at (5.13), is chosen suciently
small, then whenever F holds, |
_
I
(
j
j
)
j
| C
0
a
j
for
1 j m, where C
0
>0 is a constant depending on neither
j nor n, and a
j
is a nonnegative random variable satisfying
E( a
2
j
) n
2
j
4
.
(5.27)
PREDICTION 17
Combining (5.26) and the results in this paragraph, we deduce that
E
_
I(F)
__
I
p
j
_
2
_
const.{n
2
j
(1 +n
1
j
3+2
)(1 +j
4+42
) (5.28)
+n
2
j
+1
(1 +j
2+22
) +n
2
j
42
}.
Note too that
E
_
k : k=j
(
j
k
)
1
p
k
_
j
k
_
2
_
k : k=j
(
j
k
)
2
p
2
k
_
E
_
_
_
_
_
j
_
_
_
_
2
(5.29)
const.n
1
j
(1 +j
2+22
).
When p = g we may substitute = + into (5.28). Then we can de-
duce from (5.28) that, assuming + 2 as well as the bound j m
n
1/(+21)
, the right-hand side of (5.28) is bounded above by a constant
multiple of n
1
j
j=1
q
j
_
I
p
j
_
2
mconst.{n
2
t
+2+1
(m) +n
2
t
3+2+42
(m) (5.30)
+n
3
t
2+2+2
(m) +n
3
t
6+2+62
}.
Now, =(+) if p =g, and it equals (++) if p =x. Therefore,
if p =g then 3 + 2 + 4 2 = 3 + 4 2( +) <( +2 1) 1, and
6 + 2 + 6 2 = 2{3 + 3 ( +)} < 2( + 2 1) 1. [We subtract
the extra 1 to account for the factor m on the right-hand side of (5.30).]
These two results, and the fact that m
+21
n, imply that the terms in
mn
2
t
3+2+42
(m) and mn
3
t
6+2+62
in (5.30) may be replaced by
n
1
without aecting the validity of the bound when p = g. Furthermore,
when p =g, +2 +1 = 3 2 +1 <( +2 1) 1 and 2 +2 +2 =
4 2 +2 <2( +2 1) 1, and so the terms in mn
2
t
+2+1
(m) and
mn
3
t
2+2+2
may also be replaced by n
1
. Therefore the right-hand of
18 T. T. CAI AND P. HALL
(5.30) may be replaced by n
1
when p = g. An identical argument shows
this also to be the case when p =x. Hence, in either setting,
E
_
I(F)
m
j=1
q
j
_
I
p
j
_
2
const.n
1
. (5.31)
Using (4.3) it can be proved that
nE
_
m
j=1
k : k=j
(
j
k
)
1
q
j
p
k
_
j
k
_
2
const.t
2
(m). (5.32)
Combining (5.25), (5.31) and (5.32) we obtain (5.15).
5.6. Proof of (5.27). It may be proved from (5.25) that
j
j
2
=
u
2
j
+ v
2
j
, where
u
2
j
=
k : k=j
(
j
k
)
2
w
2
jk
, v
2
j
=
__
(
j
j
)
j
_
2
and w
jk
=
_
k
. Since both
j
and
j
are of unit length then v
2
j
=
2{1 (1 u
2
j
)
1/2
} u
2
j
, which implies that
for all j 1,
j
j
2
2 u
2
j
, v
2
j
u
4
j
. (5.33)
If the event F obtains then |
k
|
1
2|
j
k
|
1
for all j, k such that
j =k and 1 j m. For the same range of values of j and k, |
j
k
|
1
D
1
m
m. Here D =C
2
, where C is as in (4.2). Dening x
jk
=
_
j
k
and
y
jk
=
_
(
j
j
)
k
, we have w
2
jk
2( x
2
jk
+ y
2
jk
), and hence, assuming F
holds, we have for 1 j m,
u
2
j
8
k : k=j
(
j
k
)
2
( x
2
jk
+ y
2
jk
) 8
A
j
+8D
2
2
m
m
2
c
j
(5.34)
8
A
j
+8D
2
2
m
m
2
||||||
2
j
j
2
,
where
A
j
=
k : k=j
(
j
k
)
2
x
2
jk
and c
j
=
k : k=j
y
2
jk
||||||
2
j
j
2
.
Condition (4.3) implies that nE( x
2
jk
) const.
j
k
, where the constant
does not depend on j, k or n. Moreover,
k : k=j
(
j
k
)
2
k
const.
k : k=j
_
max(j, k)
max(
j
,
k
)|j k|
_
2
k
const.j
2
.
Therefore, E(
A
j
) const.n
1
j
2
for 1 j m, and similar calculations show
that
E(
A
2
j
) D
2
1
n
2
j
4
, (5.35)
PREDICTION 19
where D
1
>0 depends on neither j nor n.
Combining (5.34) with the rst part of (5.33) we deduce that if F holds,
j
j
2
16
A
j
+16D
2
2
m
m
2
||||||
2
j
j
2
(5.36)
for 1 j m. However, if c >0 is given, and if >0 is chosen suciently
small in the denition of F at (5.13), then for all suciently large m, F
implies |||||| cm
1
m
. Hence, by (5.36), if F holds, then for 1 j m,
(1 16D
2
c
2
)
j
j
2
16
A
j
.
Choosing c so small that 16D
2
c
2
1
2
, we deduce that if F holds, then for
1 j m,
2
32
A
j
. Combining this result with (5.34), and noting
the choice of c, we deduce that if F holds, then for 1 j m, u
2
j
16
A
j
.
From this property and the second part of (5.33) we conclude that if F
holds, then for 1 j m,
_
I
(
j
j
)
j
u
2
j
j
j
2
32
A
j
. (5.37)
Taking a
j
=D
1
1
A
j
, where D
1
is as at (5.35), and letting C
0
= 32D
1
, we see
that (5.27) follows from (5.35) and (5.37).
6. Proof of Theorem 4.2. We shall treat only the cases 2 < +1 and
2 = + 1, since the third setting, 2 > + 1, is relatively straightfor-
ward. For notational simplicity we shall assume that C
1
, in the denition
of B(C
1
, ), satises C
1
1, and take
j
=j
and x
j
=j
. More general
cases are easily addressed.
Since X is Gaussian then we may write X
i
=
j1
ij
j
for i 1, where
the variables
ij
are independent and normal with zero mean and respective
variances
j
for j 1. Dene to be the integer part of n
1/(+21)
, and let
B
0
0 and B
1
=
+1j2
j
j
; both are functions in B(C
1
, ).
Note that T(B
0
) = 0 and that for large n,
T(B
1
) const.n
(+1)/(+21)
, (6.1)
where, here and below, const. denotes a nite, strictly positive, generic
constant. Write
i
=
+1j2
ij
j
2
n
i=1
2
i
_
,
where
2
denotes the variance of the error distribution.
20 T. T. CAI AND P. HALL
The variables
i
are independent and normally distributed with zero
means and variance V
n
, where nV
n
=n
+1j2
j
2
const. as n
. Indeed,
E
1
{d(P
0
, P
1
)} const., (6.2)
where E
t
denotes expectation in the model with b =B
t
, for t = 0 or 1. Let
T T(B
0
)}
2
Dn
2(+1)/(+21)
. (6.3)
Put
=
2[E
0
{
T T(B
0
)}
2
E
1
{d(P
0
, P
1
)}]
1/2
|T(B
1
) T(B
0
)|
.
It follows from (6.1), (6.2) and the fact that T(B
0
) = 0, that if D in (6.3) is
chosen suciently small,
1
2
. In this case,
E
1
{
T T(B
1
)}
2
{T(B
1
) T(B
0
)}
2
(1 )
(6.4)
const.n
2(+1)/(+21)
,
where the rst inequality follows from the constrained-risk lower bound of
Brown and Low [4], and the second uses (6.1) and the property T(B
0
) = 0.
Consequently, writing E
b
for expectation when the slope function is b B,
for any estimator
T
sup
bB
E
b
{
T T(b)}
2
max
t=0,1
E
t
{
T T(B
t
)}
2
const.n
2(+1)/(+21)
.
The case 2 =+1 may be treated similarly, by taking = (n/log n)
1/(+21)
and replacing n by n/log n in (6.1), (6.3) and (6.4).
Acknowledgment. This work was done while Tony Cai was visiting the
Mathematical Sciences Institute of the Australian National University.
REFERENCES
[1] Besse, P. and Ramsay, J. O. (1986). Principal components analysis of sampled
functions. Psychometrika 51 285311. MR0848110
[2] Bhatia, R., Davis, C. and McIntosh, A. (1983). Perturbation of spectral subspaces
and solution of linear operator equations. Linear Algebra Appl. 52/53 4567.
MR0709344
[3] Boente, G. and Fraiman, R. (2000). Kernel-based functional principal components.
Statist. Probab. Lett. 48 335345. MR1771495
[4] Brown, L. D. and Low, M. G. (1996). A constrained risk inequality with ap-
plications to nonparametric functional estimation. Ann. Statist. 24 25242535.
MR1425965
PREDICTION 21
[5] Brumback, B. A. and Rice, J. A. (1998). Smoothing spline models for the analysis
of nested and crossed samples of curves (with discussion). J. Amer. Statist.
Assoc. 93 961994. MR1649194
[6] Cai, T. T. and Hall, P. (2005). Prediction in func-
tional linear regression. Technical report. Available at
stat.wharton.upenn.edu/tcai/paper/FLR-Tech-Report.pdf.
[7] Cardot, H. (2000). Nonparametric estimation of smoothed principal components
analysis of sampled noisy functions. J. Nonparametr. Statist. 12 503538.
MR1785396
[8] Cardot, H., Ferraty, F. and Sarda, P. (1999). Functional linear model. Statist.
Probab. Lett. 45 1122. MR1718346
[9] Cardot, H., Ferraty, F. and Sarda, P. (2000).
Etude asymptotique dun estima-
teur spline hybride pour le mod`ele lineaire fonctionnel. C. R. Acad. Sci. Paris
Ser. I Math. 330 501504. MR1756966
[10] Cardot, H., Ferraty, F. and Sarda, P. (2003). Spline estimators for the func-
tional linear model. Statist. Sinica 13 571591. MR1997162
[11] Cardot, H. and Sarda, P. (2003). Linear regression models for functional data.
Unpublished manuscript.
[12] Cardot, H. and Sarda, P. (2005). Estimation in generalized linear models for func-
tional data via penalized likelihood. J. Multivariate Anal. 92 2441. MR2102242
[13] Cuevas, A., Febrero, M. and Fraiman, R. (2002). Linear functional regression:
The case of xed design and functional response. Canad. J. Statist. 30 285300.
MR1926066
[14] Escabias, M., Aguilera, A. M. and Valderrama, M. J. (2005). Modeling envi-
ronmental data by functional principal component logistic regression. Environ-
metrics 16 95107. MR2146901
[15] Ferraty, F. and Vieu, P. (2000). Dimension fractale et estimation de la regression
dans des espaces vectoriels semi-normes. C. R. Acad. Sci. Paris Ser. I Math.
330 139142. MR1745172
[16] Ferraty, F. and Vieu, P. (2002). The functional nonparametric model and appli-
cation to spectrometric data. Comput. Statist. 17 545564. MR1952697
[17] Ferraty, F. and Vieu, P. (2004). Nonparametric models for functional data, with
application in regression, time-series prediction and curve discrimination. J.
Nonparametr. Statist. 16 111125. MR2053065
[18] Ferre, L. and Yao, A. F. (2003). Functional sliced inverse regression analysis.
Statistics 37 475488. MR2022235
[19] Girard, S. (2000). A nonlinear PCA based on manifold approximation. Comput.
Statist. 15 145167. MR1794107
[20] Hall, P. and Horowitz, J. L. (2004). Methodology and convergence rates for func-
tional linear regression. Unpublished manuscript.
[21] He, G., M uller, H.-G. and Wang, J.-L. (2003). Functional canonical analy-
sis for square integrable stochastic processes. J. Multivariate Anal. 85 5477.
MR1978177
[22] James, G. M. (2002). Generalized linear models with functional predictors. J. R.
Stat. Soc. Ser. B Stat. Methodol. 64 411432. MR1924298
[23] James, G. M., Hastie, T. J. and Sugar, C. A. (2000). Principal component models
for sparse functional data. Biometrika 87 587602. MR1789811
[24] Masry, E. (2005). Nonparametric regression estimation for dependent functional
data: Asymptotic normality. Stochastic Process. Appl. 115 155177. MR2105373
22 T. T. CAI AND P. HALL
[25] M uller, H.-G. and Stadtm uller, U. (2005). Generalized functional linear models.
Ann. Statist. 33 774805. MR2163159
[26] Preda, C. and Saporta, G. (2004). PLS approach for clusterwise linear regression
on functional data. In Classication, Clustering, and Data Mining Applications
(D. Banks, L. House, F. R. McMorris, P. Arabie and W. Gaul, eds.) 167176.
Springer, Berlin. MR2113607
[27] Ramsay, J. O. and Dalzell, C. J. (1991). Some tools for functional data analysis
(with discussion). J. Roy. Statist. Soc. Ser. B 53 539572. MR1125714
[28] Ramsay, J. O. and Silverman, B. W. (1997). Functional Data Analysis. Springer,
New York.
[29] Ramsay, J. O. and Silverman, B. W. (2002). Applied Functional Data Analysis:
Methods and Case Studies. Springer, New York. MR1910407
[30] Ratcliffe, S. J., Heller, G. Z. and Leader, L. R. (2002). Functional data anal-
ysis with application to periodically stimulated foetal heart rate data. II. Func-
tional logistic regression. Statistics in Medicine 21 11151127.
[31] Rice, J. A. and Silverman, B. W. (1991). Estimating the mean and covariance
structure nonparametrically when the data are curves. J. Roy. Statist. Soc.
Ser. B 53 233243. MR1094283
[32] Silverman, B. W. (1995). Incorporating parametric eects into functional principal
components analysis. J. Roy. Statist. Soc. Ser. B 57 673689. MR1354074
[33] Silverman, B. W. (1996). Smoothed functional principal components analysis by
choice of norm. Ann. Statist. 24 124. MR1389877
Department of Statistics
The Wharton School
University of Pennsylvania
Philadelphia, Pennsylvania 19104-6340
USA
E-mail: [email protected]
Centre for Mathematics
and Its Applications
Australian National University
Canberra, ACT 0200
Australia
E-mail: [email protected]