0% found this document useful (0 votes)
6 views13 pages

Nonparametric Inference Using Orthogonal Functions

This chapter discusses nonparametric inference using orthogonal functions, specifically focusing on regression and density estimation through an orthogonal basis. The method, called react, generalizes the James-Stein estimator by minimizing risk over a broader class of modulators. The chapter also outlines procedures for estimating risks and constructing confidence sets for the estimated functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

Nonparametric Inference Using Orthogonal Functions

This chapter discusses nonparametric inference using orthogonal functions, specifically focusing on regression and density estimation through an orthogonal basis. The method, called react, generalizes the James-Stein estimator by minimizing risk over a broader class of modulators. The chapter also outlines procedures for estimating risks and constructing confidence sets for the estimated functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

8

Nonparametric Inference Using


Orthogonal Functions

8.1 Introduction
In this chapter we use orthogonal function methods for nonparametric infer-
ence. Specifically, we use an orthogonal basis to convert regression and density
estimation into a Normal means problem and then we construct estimates and
confidence sets using the theory from Chapter 7. In the regression case, the
resulting estimators are linear smoothers and thus are a special case of the es-
timators described in Section 5.2. We discuss another approach to orthogonal
function regression based on wavelets in the next chapter.

8.2 Nonparametric Regression


The particular version of orthogonal function regression that we consider here
was developed by Beran (2000) and Beran and Dümbgen (1998). They call
the method react, which stands for Risk Estimation and Adaptation
after Coordinate Transformation. Similar ideas have been developed by
Efromovich (1999). In fact, the basic idea is quite old; see Cenc̆ov (1962), for
example.
Suppose we observe
Yi = r(xi ) + σi (8.1)
184 8. Nonparametric Inference Using Orthogonal Functions

where i ∼ N (0, 1) are iid. For now, we assume a regular design meaning
that xi = i/n, i = 1, . . . , n.
Let φ1 , φ2 , . . . be an orthonormal basis for [0, 1]. In our examples we will
often use the cosine basis:

φ1 (x) ≡ 1, φj (x) = 2 cos((j − 1)πx), j ≥ 2. (8.2)

Expand r as

r(x) = θj φj (x) (8.3)
j=1
1
where θj = 0 φj (x)r(x)dx.
First, we approximate r by
n
rn (x) = θj φj (x)
j=1

which is the projection of r onto the span of {φ1 , . . . , φn }.1 This introduces
an integrated squared bias of size
 1 ∞
Bn (θ) = (r(x) − rn (x))2 dx = θj2 .
0 j=n+1

If r is smooth, this bias is quite small.

8.4 Lemma. Let Θ(m, c) be a Sobolev ellipsoid.2 Then,

1
sup Bn (θ) = O . (8.5)
θ∈Θ(m,c) n2m

In particular, if m > 1/2 then supθ∈Θ(m,c) Bn (θ) = o(1/n).

Hence this bias is negligible and we shall ignore it for the rest of the chapter.
More precisely, we will focus on estimating rn rather than r. Our next task is
to estimate the θ = (θ1 , . . . , θn ). Let
n
1
Zj = Yi φj (xi ), j = 1, . . . . (8.6)
n i=1

1 More
 p(n)
generally we could take rn (x) = j=1 θj φj (x) where p(n) → ∞ at an appropriate
rate.
2 See Definition 7.2.
8.2 Nonparametric Regression 185

As we saw in equation (7.15) of Chapter 7,


σ2
Zj ≈ N θj , . (8.7)
n
We know from the previous chapter that the mle Z = (Z1 , . . . , Zn ) has large
risk. One possibility for improving on the mle is to use the James–Stein
estimator θJS defined in (7.41). We can think of the James–Stein estimator
as the estimator that minimizes the estimated risk over all estimators of the
form (bZ1 , . . . , bZn ). react generalizes this idea by minimizing the risk over
a larger class of estimators, called modulators.
A modulator is a vector b = (b1 , . . . , bn ) such that 0 ≤ bj ≤ 1, j = 1, . . . , n.
A modulation estimator is an estimator of the form

θ = bZ = (b1 Z1 , b2 Z2 , . . . , bn Zn ). (8.8)

A constant modulator is a modulator of the form (b, . . . , b). A nested


subset selection modulator is a modulator of the form

b = (1, . . . , 1, 0, . . . , 0).

A monotone modulator is a modulator of the form

1 ≥ b1 ≥ · · · ≥ bn ≥ 0.

The set of constant modulators is denoted by MCONS , the set of nested sub-
set modulators is denoted by MNSS and the set of monotone modulators is
denoted by MMON .
Given a modulator b = (b1 , . . . , bn ), the function estimator is
n n
rn (x) = θj φj (x) = bj Zj φj (x). (8.9)
j=1 j=1

Observe that n
rn (x) = Yi i (x) (8.10)
i=1
where n
1
i (x) = bj φj (x)φj (xi ). (8.11)
n j=1

Hence, rn is a linear smoother as described in Section 5.2.


Modulators shrink the Zj s towards 0 and, as we saw in the last chapter,
shrinking tends to smooth the function. Thus, choosing the amount of shrink-
age corresponds to the problem of choosing a bandwidth that we faced in
186 8. Nonparametric Inference Using Orthogonal Functions

Chapter 5. We shall address the problem using Stein’s unbiased risk estima-
tor (Section 7.4) instead of cross-validation.
Let 3 n 4
R(b) = Eθ (bj Zj − θj )2
j=1

denote the risk of the estimator θ = (b1 Z1 , . . . , bn Zn ). The idea of react is


to estimate the risk R(b) and choose b to minimize the estimated risk over
a class of modulators M. Minimizing over MCONS yields the James–Stein
estimator, so react is a generalization of James–Stein estimation.
To proceed, we need to estimate σ. Any of the methods discussed in Chapter
5 can be used. Another estimator, well-suited for the present framework, is
n
1
2 =
σ Zi2 . (8.12)
n − Jn
i=n−Jn +1

This estimator is consistent as long as Jn → ∞ and n − Jn → ∞ as n → ∞.


As a default value, Jn = n/4 is not unreasonable. The intuition is that if r is

smooth then we expect θj ≈ 0 for large j, and hence Zj2 = (θj2 + σj / n)2 ≈

(σj / n)2 = σ 2 2j /n. Therefore,
n n
1 1 σ2
σ2 ) =
E( E(Zi2 ) ≈ E(2i ) = σ 2 .
n − Jn n − Jn n
n−Jn +1 n−Jn +1

Now we can estimate the risk function.

8.13 Theorem. The risk of a modulator b is


n n
σ2
R(b) = θj2 (1 − bj )2 + b2j . (8.14)
j=1
n j=1

The (modified)3 sure estimator of R(b) is


n n
 2
σ 2
σ
R(b) = Zj2 − (1 − bj )2 + b2j (8.15)
j=1
n + n j=1

2 is a consistent estimate of σ 2 such as (8.12).


where σ

3 We call this a modified risk estimator since we have inserted an estimate σ  of σ and we
replaced (Zj2 − σ
 2 /n) with (Zj2 − σ
 2 /n)+ which usually improves the risk estimate.
8.2 Nonparametric Regression 187

8.16 Definition. Let M be a set of modulators. The modulation


estimator of θ is θ = (b1 Z1 , . . . , bn Zn ) where b = (b1 . . . , bn ) minimizes

R(b) over M. The react function estimator is
n n
rn (x) = θj φj (x) = bj Zj φj (x).
j=1 j=1


For a fixed b, we expect that R(b) approximates R(b). But for the react
estimator we require more: we want R(b) to approximate R(b) uniformly for
 
b ∈ M. If so, then inf b∈M R(b) ≈ inf b∈M R(b) and the b that minimizes R(b)
should be nearly as good as the b that minimizes R(b). This motivates the
next result.

8.17 Theorem (Beran and Dümbgen, 1998). Let M be one of MCONS , MNSS
or MMON . Let R(b) denote the true risk of the estimator (b1 Z1 , . . . , bn Zn ).
Let b∗ minimize R(b) over M and let b minimize R(b)
 over M. Then

|R(b) − R(b∗ )| → 0

as n → ∞. For M = MCONS or M = MMON , the estimator θ = (b1 Z1 , . . .,


bn Zn ) achieves the Pinsker bound (7.29).

To implement this method, we need to find b to minimize R(b).


 The min-

imum of R(b) over MCONS is the James–Stein estimator. To minimize R(b) 

over MNSS , we compute R(b) for every modulator of the form (1, 1 . . . , 1, 0, . . . , 0)
and then the minimum is found. In other words, we find J to minimize
n
 2
Jσ 2
σ
R(J) = + Zj2 − (8.18)
n n +
j=J+1

J
and set rn (x) = j=1 Zj φj (x). It is a good idea to plot the estimated risk as

a function of J. To minimize R(b) 
over MMON , note that R(b) can be written
as
n n
 2
σ
R(b) = (bi − gi )2 Zi2 + gi (8.19)
i=1
n i=1

where gi = (Zi2 − (
σ 2 /n))/Zi2 . So it suffices to minimize
n
(bi − gi )2 Zi2
i=1
188 8. Nonparametric Inference Using Orthogonal Functions

subject to b1 ≥ · · · ≥ bn . This is simply a weighted least squares problem sub-


ject to a monotonicity constraint. There is a well-known algorithm called the
pooled-adjacent-violators (PAV) algorithm for performing this minimization;
See Robertson et al. (1988).
Usually, monotone modulators lead to estimates that are close to the NSS
modulators and the latter are very easy to implement. Thus, as a default, the
NSS method is reasonable. At this point, we can summarize the whole react
procedure.

Summary of react
n
1. Let Zj = n−1 i=1 Yi φj (xi ) for j = 1, . . . , n.

2. Find J to minimize the risk estimator R(J)


 given by equation (8.18).

3. Let
J
rn (x) = Zj φj (x).
j=1

8.20 Example (Doppler function). Recall that the Doppler function from
Example 5.63 is
 2.1π
r(x) = x(1 − x) sin .
x + .05
The top left panel in Figure 8.1 shows the true function. The top right panel
shows 1000 data points. The data were simulated from the model Yi = r(i/n)+
σi with σ = 0.1 and i ∼ N (0, 1). The bottom left panel shows the estimated
risk for the NSS modulator as a function of the number of terms in the fit.
The risk was minimized by using the modulator:
b = (1, . . . , 1, 0, . . . , 0).
% &# $ % &# $
187 813

The bottom right panel shows the react fit. Compare with Figure 5.6. 

8.21 Example (CMB data). Let us compare react to local smoothing for the
CMB data from Example 4.4. The estimated risk (for NSS) is minimized by
using J = 6 basis functions. The fit is shown in Figure 8.2. and is similar to
the fits obtained in Chapter 5. (We are ignoring the fact that the variance is
not constant.) The plot of the risk reveals that there is another local minimum
around J = 40. The bottom right plot shows the fit using 40 basis functions.
This fit appears to undersmooth. 
8.2 Nonparametric Regression 189
1

1
0

0
−1

−1
0.0 0.5 1.0 0.0 0.5 1.0

0 500 1000 1
0
−1 0.0 0.5 1.0

FIGURE 8.1. Doppler test function. Top left: true function. Top right: 1000 data
points. Bottom left: estimated risk as a function of the number of terms in the fit.
Bottom right: final react fit.

There are several ways to construct a confidence set for r. We begin with
confidence balls. First, construct a confidence ball Bn for θ = (θ1 , . . . , θn )
using any of the methods in Section 7.8. Then define
 n
*
Cn = r= θj φj : (θ1 , . . . , θn ) ∈ Bn . (8.22)
j=1

It follows that Cn is a confidence ball for rn . If we use the pivotal method


from Section 7.8 we get the following.

8.23 Theorem (Beran and Dümbgen, 1998). Let θ be the MON or NSS esti-
2 be the estimator of σ defined in (8.12). Let
mator and let σ
 n
*
Bn = θ = (θ1 , . . . , θn ) :  2 2
(θj − θj ) ≤ sn (8.24)
j=1

where
s2n  b) + τ√zα
= R(
n
2
σ 4  2
τ2 = (2bj − 1)(1 − cj )
n j
2
σ  2
σ2
+ 4 Zj2 − (1 − bj ) + (2bj − 1)cj
j
n
190 8. Nonparametric Inference Using Orthogonal Functions

0 450 900 0 25 50

0 450 900

FIGURE 8.2. CMB data using react. Top left: NSS fit using J = 6 basis functions.
Top right: estimated risk. Bottom left: NSS fit using J = 40 basis functions.

and 
0 if j ≤ n − J
cj =
1/J if j > n − J.
Then, for any c > 0 and m > 1/2,

lim sup |Pθ (θ ∈ Bn ) − (1 − α)| = 0.


n→∞ θ∈Θ(m,c)

To construct confidence bands, we use the fact that rn is a linear smoother
and we can then use the method from Section 5.7. The band is given by (5.99),
namely,  
I(x) = rn (x) − c σ
||(x)||, rn (x) + c σ
||(x)|| (8.25)

where
n
1
||(x)||2 ≈ b2j φ2j (x) (8.26)
n j=1

and c is from equation (5.102).

8.3 Irregular Designs


So far we have assumed a regular design xi = i/n. Now let us relax this
assumption and deal with an irregular design. There are several ways to
8.3 Irregular Designs 191

handle this case. The simplest is to use a basis {φ1 , . . . , φn } that is orthogonal
with respect to the design points x1 , . . . , xn . That is, we choose a basis for

L2 (Pn ) where Pn = n−1 ni=1 δi and δi is a point mass at xi . This requires
that
||φ2j || = 1, j = 1, . . . , n

and
φj , φk  = 0, 1≤j<k≤n

where
 n
1
f, g = f (x)g(x)dPn (x) = f (xi )g(xi )
n i=1

and
 n
1
||f ||2 = f 2 (x)dPn (x) = f 2 (xi ).
n i=1

We can construct such a basis by Gram–Schmidt orthogonalization as follows.


Let g1 , . . . , gn be any convenient orthonormal basis for Rn . Let

ψ1 (x)
φ1 (x) = where ψ1 (x) = g1 (x)
||ψ1 ||

and for 2 ≤ r ≤ n define


r−1
ψr (x)
φr (x) = where ψr (x) = gr (x) − ar,j φj (x)
||ψr || j=1

and
ar,j = gr , φj .

Then, φ1 , . . . , φn form an orthonormal basis with respect to Pn .


Now, as before, we define
n
1
Zj = Yi φj (xi ), j = 1, . . . , n. (8.27)
n i=1

It follows that
σ2
Zj ≈ N θj ,
n
and we can then use the methods that we developed in this chapter.
192 8. Nonparametric Inference Using Orthogonal Functions

8.4 Density Estimation


Orthogonal function methods can also be used for density estimation. Let
X1 , . . . , Xn be an iid sample from a distribution F with density f with support
on (0, 1). We assume that f ∈ L2 (0, 1) so we can expand f as

f (x) = θj φj (x)
j=1

where, as before, φ1 , φ2 , . . . , is an orthogonal basis. Let


n
1
Zj = φj (Xi ), j = 1, 2, . . . , n. (8.28)
n i=1

Then, 
E(Zj ) = φj (x)f (x)dx = θj

and  
1
V(Zj ) = φ2j (x)f (x)dx − θj2 ≡ σj2 .
n

As in the regression case, we take θj = 0 for j > n and we estimate θ =


(θ1 , . . . , θn ) using a modulation estimator θ = bZ = (b1 Z1 , . . . , bn Zn ). The
risk of this estimator is
n n
R(b) = b2j σj2 + (1 − bj )2 θj2 . (8.29)
j=1 j=1

We estimate σj2 by
n
1 2
j2 =
σ (φj (Xi ) − Zj )
n2 i=1

and θj2 by Zj2 − σ


j2 ; then we can estimate the risk by
n n
 

R(b) = b2j σ
j2 + (1 − bj )2 Zj2 − σ
j2 + . (8.30)
j=1 j=1

Finally, we choose b by minimizing R(b)


 over some class of modulators M.
The density estimate can be negative. We can fix this by performing surgery:
remove the negative part of the density and renormalize it to integrate to 1.
Better surgery methods are discussed in Glad et al. (2003).
8.5 Comparison of Methods 193

8.5 Comparison of Methods


The methods we have introduced for nonparametric regression so far are lo-
cal regression (Section 5.4), spline smoothing (Section 5.5), and orthogonal
function smoothing. In many ways, these methods are very similar. They all
involve a bias–variance tradeoff and they all require choosing a smoothing
parameter. Local polynomial smoothers have the advantage that they auto-
matically correct for boundary bias. It is possible to modify orthogonal func-
tion estimators to alleviate boundary bias by changing the basis slightly; see
Efromovich (1999). An advantage of orthogonal function smoothing is that
it converts nonparametric regression into the many Normal means problem,
which is simpler, at least for theoretical purposes. There are rarely huge dif-
ferences between the approaches, especially when these differences are judged
relative to the width of confidence bands. Each approach has its champions
and its detractors. It is wise to use all available tools for each problem. If they
agree then the choice of method is one of convenience or taste; if they disagree
then there is value in figuring out why they differ.
Finally, let us mention that there is a formal relationship between these ap-
proaches. For example, orthogonal function can be viewed as kernel smoothing
with a particular kernel and vice versa. See Härdle et al. (1998) for details.

8.6 Tensor Product Models


The methods in this chapter extend readily to higher dimensions although
our previous remarks about the curse of dimensionality apply here as well.
Suppose that r(x1 , x2 ) is a function of two variables. For simplicity, assume
that 0 ≤ x1 , x2 ≤ 1. If φ0 , φ1 , . . . is an orthonormal basis for L2 (0, 1) then the
functions  
φj,k (x1 , x2 ) = φj (x1 )φk (x2 ) : j, k = 0, 1, . . . ,
form an orthonormal basis for L2 ([0, 1] × [0, 1]), called the tensor product
basis. The basis can be extended to d dimensions in the obvious way.
Suppose that φ0 = 1. Then a function r ∈ L2 ([0, 1]×[0, 1]) can be expanded
in the tensor product basis as

r(x1 , x2 ) = βj,k φj (x1 )φk (x2 )
j,k=0
∞ ∞ ∞
= β0 + βj,0 φj (x1 ) + β0,j φj (x2 ) + βj,k φj (x1 )φk (x2 ).
j=1 j=1 j,k=1
194 8. Nonparametric Inference Using Orthogonal Functions

This expansion has an ANOVA-like structure consisting of a mean, main ef-


fects, and interactions. This structure suggests a way to get better estimators.
We could put stronger smoothness assumptions on higher-order interactions
to get better rates of convergence (at the expense of more assumptions). See
Lin (2000), Wahba et al. (1995), Gao et al. (2001), and Lin et al. (2000).

8.7 Bibliographic Remarks


The react method is developed in Beran (2000) and Beran and Dümbgen
(1998). A different approach to using orthogonal functions is discussed in Efro-
movich (1999). react confidence sets are extended to nonconstant variance
in Genovese et al. (2004), to wavelets in Genovese and Wasserman (2005),
and to density estimation in Jang et al. (2004).

8.8 Exercises
1. Prove Lemma 8.4.

2. Prove Theorem 8.13.

3. Prove equation (8.19).

4. Prove equation (8.26).

5. Show that the estimator (8.12) is consistent.

6. Show that the estimator (8.12) is uniformly consistent over Sobolev


ellipsoids.

7. Get the data on fragments of glass collected in forensic work from the
book website. Let Y be refractive index and let x be aluminium con-
tent (the fourth variable). Perform a nonparametric regression to fit the
model Y = r(x) + . Use react and compare to local linear smoothing.
Estimate the variance. Construct 95 percent confidence bands for your
estimate.

8. Get the motorcycle data from the book website. The covariate is time
(in milliseconds) and the response is acceleration at time of impact. Use
react to fit the data. Compute 95 percent confidence bands. Compute
a 95 percent confidence ball. Can you think of a creative way to display
the confidence ball?
8.8 Exercises 195

9. Generate 1000 observations from the model Yi = r(xi ) + σi where


xi = i/n, i ∼ N (0, 1) and r is the Doppler function. Make three data
sets corresponding to σ = .1, σ = 1 and σ = 3. Estimate the function
using local linear regression and using react. In each case, compute a
95 percent confidence band. Compare the fits and the confidence bands.

10. Repeat the previous exercise but use Cauchy errors instead of Normal
errors. How might you change the procedure to make the estimators
more robust?

11. Generate n = 1000 data points from (1/2)N (0, 1) + (1/2)N (µ, 1). Com-
pare kernel density estimators and react density estimators. Try µ =
0, 1, 2, 3, 4, 5.

12. Recall that a modulator is any vector of the form b = (b1 , . . . , bn ) such
that 0 ≤ bj ≤ 1, j = 1, . . . , n. The greedy modulator is the modulator
b∗ = (b∗1 , . . . , b∗n ) chosen to minimize the risk R(b) over all modulators.
(a) Find b∗ .
(b) What happens if we try to estimate b∗ from the data? In particular,
consider taking b∗ to minimize the estimated risk R.
 Why will this not
work well? (The problem is that we are now minimizing R  over a very

large class and R does not approximate R uniformly over such a large
class.)

13. Let
Yi = r(x1i , x2i ) + i

where i ∼ N (0, 1), x1i = x2i = i/n and r(x1 , x2 ) = x1 + cos(x2 ).


Generate 1000 observations. Fit a tensor product model with J1 basis
elements for x1 and J2 basis elements for x2 . Use sure (Stein’s unbiased
risk estimator) to choose J1 and J2 .

14. Download the air quality data set from the book website. Model ozone
as a function of solar R, wind and temperature. Use a tensor product
basis.

You might also like