0% found this document useful (0 votes)
36 views

Putational Methods For Local Regression

This document presents three computational methods for local regression: 1. Fast surface fitting and evaluation is achieved by building a k-d tree in predictor space, evaluating the surface at tree corners, and interpolating elsewhere using blending functions. 2. Surfaces can be made conditionally parametric in a subset of predictors by a simple alteration of the weighting scheme during fitting. 3. Degree-of-freedom quantities, which would be expensive to compute exactly, are approximated through a statistical model that predicts them from the easily computed trace of the hat matrix.

Uploaded by

hengky.75123014
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Putational Methods For Local Regression

This document presents three computational methods for local regression: 1. Fast surface fitting and evaluation is achieved by building a k-d tree in predictor space, evaluating the surface at tree corners, and interpolating elsewhere using blending functions. 2. Surfaces can be made conditionally parametric in a subset of predictors by a simple alteration of the weighting scheme during fitting. 3. Degree-of-freedom quantities, which would be expensive to compute exactly, are approximated through a statistical model that predicts them from the easily computed trace of the hat matrix.

Uploaded by

hengky.75123014
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Statistics a n d C o m p u t i n g ( 1991) 1, 4 7 - 62

Computational methods for


local regression
WILLIAM S. C L E V E L A N D and E. G R O S S E
AT&T Bell Laboratories, Murray Hill, NJ 07974, USA

Received April 1990 and accepted May 1991

Local regression is a nonparametric method in which the regression surface is estimated by


fitting parametric functions locally in the space of the predictors using weighted least squares
in a moving fashion similar to the way that a time series is smoothed by moving averages.
Three computational methods for local regression are presented. First, fast surface fitting and
evaluation is achieved by building a k - d tree in the space of the predictors, evaluating the
surface at the corners of the tree, and then interpolating elsewhere by blending functions.
Second, surfaces are made conditionally parametric in any proper subset of the predictors by
a simple alteration of the weighting scheme. Third, degree-of-freedom quantities that would
be extremely expensive to compute exactly are approximated, not by numerical methods, but
through a statistical model that predicts the quantities from the trace of the hat matrix,
which can be computed easily.
Keywords: Nonparametric regression, loess, k - d tree, blending function, semi-parametric
model

I. Introduction tional issues arise as a result of categorical predictors, so


in this paper we suppose that all predictors are numerical.
Local regression is a nonparametric approach to estimat- In using local regression methods, we specify properties
ing regression functions, or surfaces (Macauley, 1931; of g and ei, that is, we make assumptions about them. In
Watson, 1964; Stone, 1977; Cleveland, 1979; Friedman, practice, of course, it is important to verify these specifica-
1984; Hastie and Tibshirani, 1990; Cleveland and Grosse, tions using diagnostic methods (Cleveland et al., 1991).
in press). Two examples are shown in Figs. 1 arid 2. In There is a fundamental specification of g, however, that
Fig. 1, there is one predictor, and the fitted function is the characterizes the approach as local regression; we suppose
curve. In Fig. 2, there are two predictors, and the fitted that for each x, the regression surface is well approximated
surface is shown by a contour plot. These two examples in a certain neighborhood of x by a function from a
will be explained in detail later. parametric class. Here, we will allow the specification of
We suppose the response and predictors are related by one of two general classes of parametric functions: linear
and quadratic polynomials. For example, suppose there
yi = g(x~ ) + E~ are two predictors, u and v. If we specify linear, the class
for i = 1 to n, where y~ is the ith observation of the is made up of three monomials: a constant, u, and v. If we
response, x~ is the ith observation of p predictors, g is the specify quadratic, the class is made up of five monomials:
regression surface, and e~ is a random error. If x is any a constant, u, v, uv, u 2, and /)2. We will let 2 denote the
point in the space of the predictors, g(x) is the value of the degree of the polynomial class and z be the number of
surface at x. Predictors can be numerical variables or monomials.
categorical. However, when categorical predictors are The method for fitting local regression surfaces that is
present, the observations are divided into subsets, one for treated in this paper is loess, which is short for local
each combination of levels of the categorical variables, regression (Cleveland, 1979; Cleveland et al., 1988).
and the response is fitted separately to the numerical Straightforward implementation of loess leads to an un-
predictors for each subset. Thus no additional computa- realistic computational burden. In this paper we present
0960-3174/91 $03.00 + .12 9 1991 Chapman and Hall Ltd.
48 Cleveland and Grosse

fitting method in a particular application are determined


0 O0
by the specifications of the surface and error terms, but
the general approach is quite simple; g is estimated at x by
fitting a linear or quadratic polynomial using weighted
least squares, where the weight for the observation (x/, Yl)
decreases as the distance of x from x; increases.
I
"-3 03
f33
1.2. Surface computation
c;
z In principle, loess fitting involves a weighted least-squares
computation at each point where the surface is to be
evaluated; for example, to make the contour plot in Fig. 2,
the loess estimate was evaluated at 5841 values. Typically,
0
direct evaluation at all points is too expensive in the
I I I I I I.
computing environments that most practitioners use to-
0.7 0.8 0.9 1.0 1.1 1.2 day. In Section 3 we present a computational method for
evaluating a loess fit. A set of points, typically small in
E number, is selected for direct computation using the loess
Fig. 1. Local regression model with one predictor--fitted curve fitting method, and a surface is provided by interpolation.
and 99% pointwise confidence intervals The space of the predictors is divided into rectangular cells
using an algorithm based on k - d trees (Friedman et al.,
1977), the loess fit is evaluated at the vertices of the tree,
, js/ and then blending functions (Gordon, 1969) do the inter-
polation.
Suppose there are two or more predictors. One specifi-
cation we can impose on g is that the surface be condition-
ally parametric in a proper subset of the predictors
(Cleveland et al., 1991). This means that given the values
W of the predictors not in the subset, the surface is a member
8 of a parametric class as a function of the subset. If we
o
change the conditioning values, the surface is still a func-
tion in the same class, although the parameters might
change. For example, suppose the predictors are u and v.
Suppose 2 = 2, and we specify the surface to be condition-
ally parametric in u. Then, given v, the surface is quadratic
in u; the general form of the surface in this case is
ho(v)-+-hl(V)u + h2(/2)u 2. It makes sense to specify a sur-
face to be conditionally parametric in one or more predic-
I I I I I tors if exploration of the data or a priori information
-40 -20 0 20 40
suggests that the surface is globally a very smooth func-
EW (arc seconds)
tion of those predictors. Making such a specification when
Fig. 2. Local regression model with two predictors--contours of it is valid can result in a more parsimonious description of
fitted surface a regression surface.
If we have specified a surface to be conditionally para-
metric, we need a computational procedure to ensure that
methods that make the method computationally practical,
the fitted surface is conditionally parametric. In Section 3
even in personal computer environments. In the remainder
we present a method for altering the basic loess computa-
of this section we describe the computational problems
tional method to accomplish this.
and the contents of the paper.

1.1. Specifications,fitting and inference 1.3. Computing degrees of freedom


In Section 2 we describe specifications of local regression, Associated with a loess fit is an n x n matrix, L, that takes
the loess fitting method, and how inferences based on this the vector of observations of the response, Yi, to the vector
fitting method are carried out. The details of the loess of fitted values, 9i = if(x;), where ~ is the fitted loess
Computational methods for local regression 49

surface. Following the terminology for least-squares observations. Finally, we can specify a subset of the
fitting, we refer to L as the hat matrix. Inferences about g predictors to be conditionally parametric.
utilize two quantities:
6k = tr[(l -- L ) ' ( I -- L)] k 2.2. Loess
for k = 1 and 2. ~1 and 3 2 a r e used in the computation of The loess fitting method is a definition of ~(x), the esti-
degrees of freedom for t-intervals for g(x) and for F-tests mate of g at a specific value of x. The smoothness of the
to compare two loess fits. A supercomputer environment loess fit depends on the specification of the neighborhood
would be needed to perform exact computations of the 6k parameter, c~, the specification of 2, and the specification
routinely. In Section 4 we present a method of approxima- of conditionally parametric predictors.
tion. Instead of using tools of numerical analysis, as we do Let Ai(x) be the Euclidean distance of x to x; in the
to approximate the fitted surface, we use a statistical space of the predictors. Let A(o(x) be the values of these
approach. We generated a large number of data sets, each distances ordered from smallest to largest, and let
with a response and one or more predictors, and com-
puted the 6~ for each. Then we fitted a semi-parametric T(u) ={(01--U3) 3 foru->lf~
model to the 6k based on predictors that can be computed
cheaply. The central predictor is the trace of L. be the tricube weight function.
The neighborhood parameter, ~, is positive. Suppose
< 1. Let q be an truncated to an integer. We define a
1.4. Discussion
weight for (xi, Yi) by
In the final section of the paper we cite other work on
computational methods for nonparametric regression.
k m(q)(x) /
Also, information is given on electronic access to FOR-
T R A N routines that implement the computational meth- For ~ > 1, the wi (x) are defined in the same manner but
ods of this paper. A(q)(X) is replaced by A(n)(x)~ l/p, where p is the number of
predictors. The wi (x), which we will call the neighborhood
weights, decrease or remain constant as xi increases in
2. Local regression: specifications, fitting and distance from x.
inference If we have specified 2 to be 1, a linear polynomial in x
is fitted to Yi using weighted least squares with the weights
wi (x); the value of this fitted polynomial at x is ~(x). If 2
2.1. Local regression is 2, a quadratic is fitted.
In carrying out local regression we specify properties of
the surface and errors, that is, we make certain assump- 2.3. Statistical inference
tions about them. In all cases we suppose the ei are
independent random variables with mean 0. We can The normal-error loess estimate, ~(x), is linear in Yi:
specify a probability distribution for the E~. Here, we treat
just one case; we assume that the Ei are independent g(x) = ~ t,(x)yi
i=1
normal random variables with variance o.2 . Computation-
ally, other cases typically involve iterations of normal- where the li (x) do not depend on the Yi. Suppose diagnos-
error estimates. tic methods have revealed that the specifications of the
As we stated earlier, we suppose that in a certain surface have not resulted in appreciable lack of fit of ~(x);
neighborhood of x, the regression surface is well approxi- we take this to mean that E~,(x) -g(x) is small. Suppose
mated by linear or quadratic polynomials. Thus we must further that diagnostic checking has verified the specifica-
specify the degree, 2. Sometimes, it is useful to specify an tions of the error terms in the model. Then the linearity of
intermediate model between 2 = 1 and 2 = 2 by discarding ~(x) results in distributional properties of the estimate that
squares of some predictors (Cleveland et al., 1991), but we are very similar to those for classical parametric fitting.
do not consider this specification here. The overall sizes of We will briefly review the properties.
the neighborhoods are specified by a parameter, 7, that Since ~(x) is linear in Yi, the fitted value at xi can be
will be defined in Section 2.2. Size, of course, implies a written
metric, and we will use Euclidean distance, but a variety of
metrics in the space of the raw data can be achieved by Yi = Y, ~ (xi)yi
j=l
transformation of the original variables, which amounts to
a change in the definition of the xi. At the very least, we Let L be the matrix whose (i,j)th element is/j(xi) and let
typically want to standardize the scales of the predictor L=I-L
50 Cleveland and Grosse

where I is the n • n identity matrix. For k = 1 and 2, let


o
6k = tr(/'/) ~

We estimate a by the scale estimate o

s=g ,h
8
from the residuals ~ = y ~ - .re. The standard deviation of o o

~(x) is
z
o
a(x)=a l~(x)

We estimate a(x) by o
o

s(x)=s l (x)
1 I I I I I

-40 -20 0 20 40
Let
EW (arc seconds)

P t~2 Fig. 3. Galaxy data--locations of velocity measurements


The distribution of
~,(x) -- g(x) of the engine. The predictor is the equivalence ratio, E, at
s(x) which the engine was run; E is a measure of the richness
is well approximated by a t-distribution with p degrees of of the air and fuel mixture. For the loess fit shown in the
freedom; we can use this result to form confidence inter- figure, 2 is 2 and c~ is 2; these parameters were chosen
vals for g(x) based on if(x). based on graphical diagnostics that reveal lack and surplus
We can use the analysis of variance to test a null loess of fit of regression functions and that explore the assump-
fit against an alternative one. Let the parameters of the tions about the error terms of the model (Cleveland et al.,
null fit be ~(~, 2 (n), 6~"), and 6~2"~. Let the parameters of the 1991).
alternative fit be e, 2, 6~ and ~2. For the test to make The surface in Fig. 2 was fitted to measurements of the
sense, the null model should be nested in the alternative radial velocity of the galaxy N G C 7531 (Buta, 1987).
(Cleveland et al., 1991). Let rss be the residual sum of Figure 3 shows locations where 323 measurements were
squares of the alternative model, and let rss (") be the made. The original positions, which lie along seven inter-
residual sum of squares of the null model. The test statis- secting lines, have been jittered slightly to reduce overplot-
tic, which is analogous to that for the analysis of variance ring. The vertical and horizontal axes are north-south and
in the parametric case, is east-west position, respectively, on the celestial sphere.
For the loess fit shown in the figure, 2 is 2 and e is 0.2;
(rss ("~ - rss)/(f ~"~ - (~1) again, the appropriateness of these parameters was
F=
rss/61 demonstrated by graphical diagnostics.
F has a distribution that is well approximated by an
F-distribution with denominator look-up degrees of free-
dom p, defined earlier, and numerator look-up degrees of 3. Surface computation
freedom
The strategy of our algorithm is to select a few places to
v do the loess calculation directly and interpolate elsewhere.
~(2n) - - 6 2 Since the surface, ~, is smooth by construction, it should
be plausible that one can directly calculate ~ on a coarse
grid and interpolate cheaply to a finer grid for graphics or
2.4. Examples
other purposes. It pays to use an adaptive grid that takes
The data in Fig. 1 are from an industrial experiment account of the location of the xi, rather than a uniform
studying exhaust from an experimental one-cylinder en- rectangular one, because loess fitting adapts in such a way
gine (Brinkman, 1981). The response, NOx, is the concen- due to the nearest-neighbor feature. The adaptation is
tration of nitric oxide, NO, plus the concentration of achieved by using k - d trees (Friedman et al., 1977) to
nitrogen dioxide, NO2, normalized by the amount of work partition the domain into cells. Then blending functions
Computational methods for local regression 51

(Lancaster and Salkauskas, 1986) are fitted to the cells to boundary to bring the total to q. Since the tricube weight
give a C l surface. This algorithm is apparently new, function falls to 0 at the boundary, this is justified;
although it resembles transfinite elements on irregular boundary points make no difference in the least-squares
grids (Birkhoff et al., 1974; Cavendish, 1975) and regres- local fit. In implementing a uniform weight function, in
sion trees (Friedman, 1979) in some respects. The al- contrast, more care would be needed to ensure that all
gorithm may be viewed as belonging to the category of points on the boundary receive equal treatment.
two-stage methods (Schumaker, 1976), in which typically Badly distributed observations may not uniquely deter-
one maps scattered data onto a rectangular grid and then mine all ~ monomial terms in the local regression. Con-
uses a grid-based interpolation scheme. sider locally-quadratic fitting in one predictor. If, because
In this section, we first discuss certain numerical and of multiplicities, there are only two distinct sample loca-
algorithmic issues that need careful attention to evaluate tions inside a neighborhood, then a quadratic polynomial
the surface at a single point, then we describe the k - d tree is not uniquely determined. In more variables, something
and blending methods, then we treat performance, and similar can occur if there are patterns in the distribution of
finally we describe a procedure to modify the loess fitting predictors. The most common such situation involves a
method to produce a conditionally parametric surface. discrete variable, so that a neighborhood that otherwise
seems large enough contains just points lying on one or
two lines. Again there is no unique local quadratic. In
3.1. Least squares
such a situation, we use the pseudo-inverse to produce
To get the value of the loess surface, ~(x), at a particular stable values for ~(x). Since q is typically much larger than
point x, the computational steps are the following: identify r, a preliminary factorization X = QR into a ~ x ~ matrix
neighbors of x by calculating n distances, sort, and then R and a q x z matrix Q with Q'Q = I~ followed by SVD of
solve a q • z linear least-squares problem. Note that, R allows the pseudo-inverse to be computed efficiently
unlike some other nonparametric methods such as (Chan, 1982). The interpolation scheme to be described
smoothing splines, which require the solution of a global later uses not only ~(x), but also the slope of the local
linear system (Wahba, 1978), the loess surface is locally model at x. The pseudo-inverse has an effect like dropping
defined, which would allow large speed-up on parallel the square term in the transverse predictor. This is the best
computers. For each x, the dominant cost is a classic that can be automatically done with the available local
linear algebra problem for which well-vectorized sub- information, but does imply the interpolation has to work
routines are widely available. The methods presented here, with less information. It is wise in such circumstances to
however, have been developed on sequential machines. make a surface plot and verify that no undesired flat spots
Even with an efficient O(n log n) procedure, the sorting result. Of course, the data analyst can eliminate the use of
to get A(i~(x) can become a bottleneck, slowing down the the pseudo-inverse by adding rows to X (increasing c0 or
overall computation noticeably. A little time can be saved removing columns from X (reducing 2 or specifying condi-
by avoiding the square roots and instead sorting A~(x). tionally parametric predictors).
On some machines, a little more time can be saved by It is possible to save linear algebra cost for multiple
arranging the distance calculation so that it vectorizes. A data sets at the same sample points with the same a, at the
much more significant speedup is achieved by applying a cost of increased workspace requirements, by saving the
classic algorithm (Floyd and Rivest, 1975) for selecting the pseudo-inverse coefficients. In the implementation of our
qth largest in an unordered list. computational methods, we provide this option. Associ-
If we have found the neighborhood for a point x and ated with each vertex in the k - d tree is a ~ x q coefficient
next need the neighborhood for a nearby point x', can we matrix and a length-q integer vector.
do better than starting from scratch? One possibility
would be to take advantage somehow of the triangle
3.2. k - d trees
inequality, only updating the distances for points near the
first neighborhood boundary. We have pursued another The k - d tree is a particular data structure (Friedman
approach. During the partial sort we update a permuta- et al., 1977) for partitioning space by recursively cutting
tion vector instead of explicitly reordering the distance cells in half by a hyperplane orthogonal to one of the
vector. By using the final permutation for x rather than coordinate axes. It was originally designed for answering
the identity when starting the partial sort for x', fewer nearest-neighbor queries but we use it in a different way.
comparisons and swaps may be needed. (Ironically, the tree is not of assistance in identifying the
What happens if there are several points on the neighborhoods. The trouble is that c~ is commonly large
boundary of the neighborhood? That is to say, suppose enough that each neighborhood intersects most of the cells
there are distinct i and j such that A(j)(x)= A(i)(x)= in the k - d tree.)
A(q)(x). The procedure just described breaks ties arbitrarily Let C be a rectangular cell containing the x; and let h be
and hence selects at random just enough points on the a number between 0 and 1. To build the k - d tree, call
52 Cleveland and Grosse

partition(C), the recursive function described informally First f o u r cells


by
partition(C)
j.'=maximer-of(max~ ~ c[X]j - min~ ~ c[X]j); component with o~ o
0 0
greatest spread
# ,= median{[x]j: x ~ C}; 0 C pooo
o ~'~o~ o o
cut C='.LwR by the hyperplane that intersects jth axis at #;
left subcell L c_c_{x ~ C'.[x]j < #};
v@ooo
right subcell R ___{x ~ C: [x]j > #}; -5 , ~o oO
if size(L) > nh ~ o
partition(L); -o
t-
if size(R) > nh
partition(R); 00 0 0 ~
0 !'e- ~ o

Thus h controls how far to refine the partition; if more ~,


o o
o ~
? '~
O0 0
~o o
than nh points are inside the cell, we refine. Larger values
make the method run faster by generating fewer vertices;
smaller values yield a blending surface closer to the direct
surface. Also, as h increases, the tree tends to change only ~.
I
at values of h around a power of 89 provided the number
-4 -2
of points on cell boundaries is small, because the number
of points inside a cell tends to be around n/2 ~ after the kth First predictor
round of partitioning.
The procedure is illustrated in Fig. 4 where the circles Final tree
graph x~ with two predictors, with n = 100, and with
h = 0.05. Consider the top panel of the figure. First a
bounding rectangle is drawn which encloses the sample o

points. This first cell is taller than it is wide, so it is o o 0


0
bisected at the median of the second predictor; this results ,~ 0
aoc~O ~
in the horizontal line inside the box. Then the two cells are oo
considered. Each is wider than it is tall, so each is bisected
~ ooo
at median values of the first predictor. The result at this
point is the four cells shown in the top panel. The final ~ o
tree is shown in the bottom panel. K
For readers more familiar with quadtrees, which bisect
o
simultaneously at the midpoint in all the coordinates, it is ~~ o
o o ~Oo~
worth noting that the k - d cells do not all have the same ~, G'Ooo
~0 ~'~
shape. In compensation, the cells do not vary widely in the oo o ~i ~~176 o
number of points they contain; a quadtree can contain
mostly empty cells if the samples are non-uniformly dis-
tributed.
An additional stopping criterion is needed in a FOR- I

T R A N implementation. Only a fixed-size array is available -' -2 0


for storing vertices, so there may be no room to continue
First predictor
refining the tree. For this reason, we refine in breadth-first
order so that stopping at any time spreads the error Fig. 4. A k - d tree
uniformly. Ordinarily, k - d tree algorithms have been im-
plemented in depth-first order, but breadth-first is no
harder. by-product of the least-squares computation and cost
Once the k - d tree is built, ~(x) is computed directly at nothing extra to obtain.
the vertices. By 'vertex', we just mean a corner of a cell; The k - d tree is designed to adapt to very non-uniformly
'vertex' seems a better term because a corner of one cell distributed data. Nevertheless, if the samples are scattered
typically lies in the middle of a side of an adjacent cell. In in an elliptical cloud, we could help the tree along in the
addition to computing ~(x) at the vertices, derivatives of early stages by first rotating the points to bring the
are estimated there by taking the slopes of the locally principal axes into alignment with the coordinate axes. (It
linear or locally quadratic fit. These slopes are a natural is apparent from the definition that ~ itself is rotationally
Computational methods for local regression 53

invariant.) An even more sophisticated rotation would smooth ~ will have discontinuous derivatives resulting
align principal axes of clusters in the data (Art et al., from the discontinuous nature of the neighborhood
1982). Rarely do we bother with such rotations. radius. The radius would need to be smoothed somehow
We hope the number of vertices will be much smaller before exact derivative formulas would be useful.
than n. This is at least true asymptotically, because the A related issue is that we implicitly use cubic interpola-
number of cells needed to fit the data properly depends on tion with zero values for the cross-derivatives Oof at the
the smoothness of g(x), not n. In Fig. 4 there are 66 rectangle corners. This is satisfactory for our purposes but
vertices, so we solve just 66 least-squares problems instead produces flat spots that might be visible in a carefully
of the thousands necessary to draw a contour plot. rendered image of the surface, particularly if there are
Because of symmetries in the sample locations, it may reflection lines that happen to pass near the corners. For a
happen that when building the k - d tree, two adjacent cells discussion of this phenomenon, see Barnhill (1977) and
are cut in the same direction and at the same value. This Farin (1990). If desired, it would be possible to provide
introduces a double vertex on the shared cell side. A good correct cross-derivatives.
rule of thumb in computational geometry is never to A special problem arises for higher dimensions that does
compute anything twice. Consistency rather than opti- not occur for p < 2. Introducting vertices only on the sides
mization is the crucial reason, and in this case we also of a cell can lead to adjacent cells in which sides intro-
wish to avoid double counting when computing tr(L) later. duced by independent cuts intersect. This is illustrated in
So before introducing a vertex we check the neighboring Fig. 5. It is necessary to add all such points of intersection
cell and use a pointer to an existing vertex if possible. as vertices; otherwise, independent interpolation along the
two edges gives inconsistent values at the common point.
The blending method is based on cubic polynomial
3.3. Blending
interpolation and it is well known that polynomial extrap-
We now sketch the scheme used to interpolate based on olation is dangerous. So our computer implementation
the computation of ~ at the cells. We use a method called refuses to evaluate ~(x) for x outside the blending rectan-
blending or Coon's patch or transfinite interpolation devel- gle of the observations. To avoid nuisances caused by
oped in the automobile industry (Gordon, 1969) for han- rounding, the bounding box is expanded by 1% in each
dling cross-sections of clay models of a car. It is a method variable.
for generating smooth surfaces given function data along
grid lines.
3.4. Statistical quantities
It is helpful to think of the case p = 2 . Each cell
boundary consists of four or more segments that meet at To carry out the computational method of the next section
vertices. On each segment, function values are interpolated we need the diagonal of the hat matrix, L. The linear
using the unique cubic polynomial determined by the dependence of blending coefficients at each vertex on q
function and derivative data at the vertices. Normal nearby data values can be computed from a least-squares
derivatives are interpolated linearly along the segment. In system, Xfl = Wy. Since the blending interpolant at an
this way, we obtain function and derivative values all arbitrary point x ultimately is just a linear combination of
along the cell boundaries. Values are consistent from
segment to segment and from cell to cell.
Finally, blending functions interpolate across the cell
proper. This technique takes a certain combination of
univariate interpolants in each variable separately to build,
in our application, a surface with continuous first deriva-
tive. Although the calculation is not actually done this
way, each cell is implicitly subdivided and on each piece a
cubic polynomial is constructed.
We estimate partial derivatives at the vertices by slopes
of the local model. This is quick but not the same as
taking true derivatives of the underlying direct loess sur- /--....
face. It is interesting to note that with just a few more
backsolves but no extra matrix factorizations, the correct
derivatives can be obtained (Lancaster and Salkauskas,
1986, para. 10.4). Our implementation continues to use
slopes rather than the true derivatives because we have not
found any examples where the approximation leads to a Fig. 5. In three dimensions, extra vertices are introduced by
poor smooth. In fact, as presently defined, the direct division of adjacent cells
54 Cleveland and Grosse

vertex data, it is only a modest amount of extra effort to and not the hat matrix of the direct surface. Still, it is
accumulate any element of L. comforting that the interpolation surface can give a good
If we have just smoothed at a point xi then we have a approximation of the direct, since the direct method has
singular value decomposition X = U~V'. The smooth been extensively investigated and its properties are well
value 9~ is just/71, the coefficient of the constant term, and understood.
this is linearly related to the data y via the ith row of L We will use the galaxy data and NOx data sets discussed
in Sections 1 and 2 to illustrate the relationship of the
L,, = v , , . ~ + u ' w (1)
direct and interpolation surfaces as h changes. For the
A few backsolves and an inner product suffice to compute
this row.
The calculation for the interpolation method is only Galaxy Data
slightly more complicated. Note that smoothing at vertex
j uses a linear combination of data values which we may
write as/?~ = Fjy and blending uses a linear combination of o
I.o
vertex data. So there is a matrix F composed of the Fj and
a (sparse) matrix B such that L = BF. A diagonal element, r'-

L, = e~BFe~, may be computed by using the same interpo-


lation algorithm on 'vertex data' Fje~ and summing over j. "o
o
3.5. Performance O
..O

Since the scheme is local and reproduces quadratics, it can E O


be shown that m leaf cells can achieve overall error O
E o
O(m -3). Hence, as n ~ ~ and for a fixed error tolerance,
the number of vertices is bounded. Partial sorting and
linear algebra are each O(n) and blending at each sample o
point involves a tree walk and fixed arithmetic. So overall I I i i i

runtime is proportional to the number of data points. -6 -5 -4 -3 -2 -1


Since the constant involves the number of k - d tree ver-
tices, one needs to be cautious in trying to use this result
log 2 h
as the basis for comparison with other computational
methods. However, it is a basis for optimism when at-
tempting to analyze huge data sets.
Profiling a typical computation revealed that 30% of the
NO• Data
time was spent in linear algebra, 30% in building the k - d
tree and interpolating by blending, 20% in sorting, and O
('3 O O
20% in reading the input data. Total time for smoothing O
600 points with p = 2 , ~.=2, ~ = 0 . 5 is under 10s on O
t-
current workstations.
We have presented the loess interpolation method as an .m o
approximation to the direct loess method. But the inter- O

polation method is a perfectly sensible nonparametric m


O
regression procedure in its own right. If we choose the ,.O
cell-size parameter, h, to be small, then we get a close O

approximation of the direct fit; more specifically, in our E


c5
experimentation we have found that taking h to be ~/5 E
X

typically results in a close approximation. As h increases,


the interpolation surface typically becomes smoother and o
departs more and more from the direct surface. But for (5 0
i i 1 i i

practice, the important question is how well the interpola-


tion surface fits the data, not how well the interpolation -6 -5 -4 -3 -2 -1
surface approximates the direct one; thus diagnostic meth-
ods target the question of surplus and lack of fit of the log2 h
interpolation surface. Also, the interpolation surface has a Fig. 6. Comparison of direct and interpolation methods'for galaxy
hat matrix, L, and inferences, of course, are based on it data
Computational methods for local regression 55

galaxy data, the spindly configuration of the points in the words, we generated values of xi. For each data set, values
space of the predictors presents a challenge to the inter- of e and 2 were assigned, and 71 and 72 were computed.
polation method to approximate the direct surface. For We developed a semi-parametric model with 71 and 72 as a
the fit shown in Fig. 2, 2 = 2 , c~=0.2, and h = 0 . 2 / single response, 7, and with several predictors. The central
5 = 0.04, the standard value. Let us consider values of h of predictor is tr(L), which can be computed cheaply. The
the form 0.95 (89 for k = 1 to 6. This spans a large range; other predictors are p, 2, n, and k, a categorical variable
the largest value, 0.475, results in a tree with just four with two levels that describes whether 71 or 72 is being
cells, and the smallest value, 0.015, is less than the stan- predicted. We fitted the model to the data, and the fitted
dard value of h --0.04. Also, any value of h in this range model is used in applications to compute predicted values
yields a tree that is the same as that for one for the six of the 7k based on the values of tr(L), p, 2, n, and k.
values we have selected. On the top panel of Fig. 6, the
maximum absolute difference of the direct and interpola-
4.1. The predictor tr(L)
tion surfaces evaluated at the x~ is graphed against log2 h.
The vertical dashed line segment shows the standard value The model for 7 was identified by a combination of
of h. For this value, the absolute difference is 16 km s -1, empirical study, which consisted of generating pilot data
which is an insignificant departure in this application. sets and studying them graphically, and of theory. The
Most importantly, diagnostics showed the standard sur- central idea of the model, however, using tr(L) to predict
face fitted the data. y, arose from theoretical considerations, which we now
The bottom panel shows a similar analysis for the NOx describe.
data, where cr = 2, 2 = 2, and h = c~/5. For the standard Let r be the number of monomial terms in the para-
surface, the maximum absolute difference is 0.022/~/j. This metric class being fitted locally, including the constant
is 0.4% of the range of the fitted values, so if we added the term. If 2 = I, r = p + 1, and if 2 = 2, ~ = (p + 1)(p + 2 ) /
direct curve to Fig. 1, the two curves would not be visually 2. Suppose that, instead of fitting a loess surface, we
distinguishable. carried out a least-squares fit of the ~ polynomial terms.
Then L is the hat matrix of the fit and
3.6. Conditionally parametric surfaces 71 = 72 = tr(L) =
The method for making a loess fit conditionally para-
This equality is the limiting case for loess as ~ ~ oe since
metric in a proper subset of the predictors is simple. The
the loess fit tends to the least-squares polynomial fit. Let
subset is ignored in computing the Euclidean distances
us consider the other extreme. Suppose there are no ties
that are used in the definition of the neighborhood
among the xi and c~ is sufficiently small that the only
weights, w~(x). Let us use an example to explain why this
positive neighborhood weight for the fit at x~ is w~(x~) = 1.
produces a conditionally parametric surface. Suppose that
Then ~(x~) = y/, L is the identity matrix, and
there are two predictors, u and v, and 2 = 2. Suppose we
specify u to be a conditionally parametric predictor. Since W = Y2 = tr(L) = n
the weight function ignores the variable u, the ith weight,
w~(u, v), for the fit at (u, v), is the same as the ith weight, In between these extremes, 7; and tr(L) depend in a highly
w~(u + t, v), for the fit at (u + t, v). Thu~ the quadratic complex way on x; and on the specification of the surface,
polynomial that is fitted locally for the fit at (u, v) is the that is, the choices of e, 2, and which predictors, if any,
same as the quadratic polynomial that is fitted locally for are conditionally parametric; the complexity defies charac-
the fit at (u + t, v), whatever the value of t. Thus for the terizing the dependence.
fixed value of v, the surface is exactly this quadratic as a We can, however, look at a special case where the xi
function of the first predictor. Finally, it is easy to see that have a regular structure, and gain some insight. Suppose
our interpolation method preserves this property. there is one predictor and the xi are equally spaced. Also,
suppose that 2 --- 1. As before, let q be cm truncated to an
integer, that is, the number of points in a neighborhood.
4. Computing degrees of freedom For fixed g, as n grows large, the operator matrix becomes
well approximated by a circulant matrix. (In fact, if the
In this section we present a method for computing 61 and points are equally spaced on a circle, and we take the
62 approximately. To make notation and formulas some- distance between two points to be the minimum distance
what simpler, we will work with Yk = n -~5 k. around the circle, the operator matrix is exactly cireulant).
We could have attempted to develop a numerical Let W be an altered form of the tricube weight function:
method to compute the 7k approximately. Instead, because
it seemed more promising, we took a statistical approach. 1 -)2.13 3 for I"J <
W
We randomly generated predictor data sets; in other otherwise
56 Cleveland and Grosse

where c is a constant that makes W integrate to unity. degree of the parametric class being fitted locally, so )~ is 1
Thus we have simply rescaled the domain and range of T. or 2. The third, k, is 1 if 71 is being predicted and is 2 if
Let W* be the Fourier transform of W. Then for f x e d q 72 is being predicted. In effect, these three predictors divide
and large n, the model up into 4 x 2 x 2 = 1 6 cases. Let r~ be an
observation of the response and let zi be the corresponding
trL k ~ c~-1 W*k(w) dw observation of z. The model we selected is semi-para-
oo
metric:
Using this result we have
r i = Op2kZ~P'~k ( 1 - - Z i ) J3P'~kf(z i )E i
tr(L) g 1.72c~- 1
forp=l to 4, 2 = 1 to 2, and k - - - I to 2, w h e r e f i s a
nonparametric function of z, and the e; are i.i.d, random
7z ~ 2.06e -1 variables. To make f identifiable, we add the boundary
Now, instead of equality of these three quantities we have conditions f ( 0 ) = f ( 1 ) = 1. This form was chosen since
plots of r against z for the pilot data sets showed patterns
71 that looked like beta functions. This is not surprising
log tr-~ ,~ 0.17
since, from the results of Section 4.1, as cr ~ m, we have
z---> 1 and r ~ 0, and as cr gets small, we have z ~ 0 and
72
log tr--~ ~ 0.17 r ----~ 0 .

One thing this does is to dispel the notion that we can


predict the y~ simply by taking the value of tr(L); an error 4.3. Data for fitting the model
of 17% is larger than we can tolerate. In fact, as we shall
To fit the model, we generated data different from the data
see shortly, the error would be far greater for p > 1. But
used in the pilot studies to identify the model. For each
the above log ratio will be helpful in selecting and inter-
combination of p, 2, and k, we generated 32 data sets,
preting the model for 7~.
selected a value of ~ for each, and computed values of r
and z. For each data set, the sample size, n, was 300, and
4.2. Model selection the x~ were generated from a normal distribution with
covariance matrix equal to the identity matrix. We wanted
We initially generated a variety of pilot data sets, explored
the resulting 32 values of z to be as close to uniformly
them with graphical methods, and on the basis of this
distributed as possible. Thus we invoked the result of
exploration and the properties of ~ described earlier, chose
Section 4.1 that tr(L) behaves like ~ - 1 times a constant as
the form of the model. The model response is
changes, and chose 30 values of ~ so that ~1/2 was
7 equally spaced from 1 to a value sufficiently small that z
r = log tr(L) would be sufficiently close to 0, but not so small that we
would encounter numerical problems; then to make z
It made sense to study percentage deviations of 7 from
sufficiently close to 1, we added two values of ~, 1.5 and 3.
tr(L) because of the near-equality of the two for large and
Figure 7 is a graph of the values of r against the values
small values of e. The central predictor, tr(L), was trans-
of z for the 16 combinations of the categorical variables.
formed by the inverse square root since it resulted in a Notice the extreme regularity of the dependence of r on z
relatively simple structure for the dependence of 7 on the
for the 16 cases, and the small amount of noise about the
predictors. It was convenient to scale the transformed
overall pattern; this is the phenomenon that allows our
variable so that it varied between 0 and 1; since tr(L) method to work.
varies between z and n, the resulting transformation was
1 1
4.4. Fitting the model
Z m
1 1 The model was fitted by the backfitting algorithm (Fried-
man and Stuetzle, 1981; Hastie and Tibshirani, 1990), with
each iteration consisting of two steps. In one step, f was
We selected three other predictors that were treated as estimated by a loess fit with cr = 88and 2 = 2. In the second
categorical in the fitting of the selected model. The first is step, the cr and/~ parameters were estimated.
p, the number of predictors in the data set; for the data The estimation, to make it simpler, was done on the log
that we generated, p ranged from 1 to 4 since applications scale. That is, we fitted the model
of local regression with more than four predictors are rare
due to dimensionality problems. The second is ,~, the log(ri) = Op;~k+ CCp~klog(zi) + flp~k log(l -- zi) +f'(zi) + E~
Computational methods for local regression 57

p--1 p=2 p=3 p=4


tO
(5 5=2 (5 5=2 o 5=2 <:5 5=2

<:5 (5
t-
0 0 0000 0 0
o
(5 (5
re" % * o + ~ o~o~
~~176 ~ ~ oo CXC C~Z

o o o o,
<:5 2=2 <5 2=2 <:5 2=2 ,l=2

0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8

Predictor: z Predictor: z Predictor: z Predictor: z

~D
(5 5=2 &=2 6=2 (5 5=2

#
<::5 o
t- t- 0
t-
O O
O- t'~
o 0
g
~ ~ ~,.0 . ~ d
o rr

q
o
2=1 2=1 2=1 2=1

0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8

Predictor: z Predictor: z Predictor: z Predictor: z

5=1 6=1 6=1 6=1

p,; (5 ~
c- c t-
0 O O

~, o o --~ o o ~, (5 ~00 0 0
tr- n- n-' o n.-
o~accoao 0~% oD o ~
o. ~
o
o o
2 =2 2=2 d 2=2 (5 2=2

0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8

Predictor: z Predictor: z Predictor: z Predictor: z

<:5 &=l 5=1 5=1 5=1

(5 (5
c- f-
t-
O
04
O
O.. r
0
~ o o
O .. #
f Ibmm~lm~~ o

o
o ooo%oIorz~,~ D o
o ~o0 o re" o ** o

q o o o
o
2=1 <:5 2=1 ,:5
..... ~=1
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8

Predictor: z Predictor: z Predictor: z Predictor: z

F i g . 7. Data generated for fitting of semi-parametric model


58 C l e v e l a n d a n d Grosse

where 0p;.e.= log(0p;~k), f ' = log(f), and c~ = log(ei), f ' 4.6. From the laboratory to the field
was updated by fitting loess to
The computer-generated data used in the pilot studies
log(r/) -- Op).~
~' -- %;.k
^ log(z/) -- fip;.k log(1 - zi)
and model fitting are not representative of the entire
population of data sets that users of loess fitting encoun-
as a function of z~, where ff~k, 02p~k, and fip~k were the ter. For one thing, the predictors were generated from
current estimates of the parameters, and the parameter normal distributions. Second, while we changed n during
estimates were updated by least-squares regression of the pilot studies, it was equal to 300 for the data to
l o g ( r / ) - f ' ( z ~ ) on a constant, log(z~), and l o g ( l - z / ) . which we fitted the model. Nor is it realistic to suppose
Thus we fitted a model which--given p, 2, and n - - i s a that it would be possible to generate a representative
generalized additive model with a parametric part and a collection. For this reason we make no attempt to use
nonparametric part (Hastie and Tibshirani, 1990). The statistical inference based on the model fitting to charac-
successive estimates of f in the backfittings were made terize the errors of our method in practice. Instead, we
to satisfy the boundary constraints at z = 0 and z = 1. simply went into the field, tried the predictive model on a
This was done by adding the values (0, 1) and (1, 1) to large number of data sets, and compared with the actual
the (zi, r~) values, and giving these two points prior values of 71 and 72 obtained by explicitly forming/2. The
weight 100 and the remaining points prior weight 1; the resulting errors, not surprisingly, were larger, as a collec-
effect of this is that any neighborhood weight for these tion, than the errors for the generated data, but were still
two points is multiplied by 100, forcing the fit to go well within the limits of acceptable accuracy. Figure 11
through 1 at the ends. Taking the log of r, which itself is shows results for the NOx data and the galaxy data
a log of 7/tr(L), is not as outlandish as it might first studied in earlier sections. Values of tog(~/7) are graphed
seem. The generated values of 7/tr(L) were all greater against c~, where ~ is the predicted value and 7 the actual
than 1, which is not surprising since we showed that this value. For the NOx data, deviations are within -+4.5%,
is so for the equally spaced circular case in Section 4.1. and for the galaxy data, deviations are within _+ 12.4%.
(Note that we do not allow values of ~ that can actually Both are acceptable accuracies for degree-of-freedom
yieldazof0or 1.) computations.

4.5. The fitted model


5. D i s c u s s i o n
The solid curves in Fig. 8 show the fitted model, which
consists of 16 univariate functions. The dashed curves
5.1. Other work on computational methods
show the terms
for nonparametric regression
Much work has been reported on computational methods
for nonparametric regression procedures of all types, par-
Figure 9 graphs f ( z ) . The fitted model agrees with the
ticularly local regression procedures, smoothing splines,
numerical values we derived in Section 4.1 for 2 = 1 and
regression splines, kernel estimates, and projection pur-
p = 1. There we saw that
suit. Since Hastie and Tibshirani (1990) have recently
provided a comprehensive review, we cite here just a few
log & examples. Projection pursuit regression (Friedman and
Stuetzle, 1981; Friedman et al., 1983) is a highly com-
for k = 1 and 2. Figure 8 shows the maximum values of puter-intensive approach that requires many new compu-
log(7~/tr(L)) are in fact in the vicinity of 0.17 for p = 1. tational methods including a very fast updating method
The model provides an excellent fit to the data. This is for local regression fitting for one predictor. Hastie and
illustrated in Fig. 10, which graphs the sample distribu- Tibshirani (1990) developed a local scoring algorithm for
tion function of log ~/7, where ~ is the prediction of 7 fitting generalized additive models. Bates et al. (1987) and
resulting from the fitted model. All values lie between Gu and Wahba (1988) discuss methods for carrying out
_+0.03 and most lie between +0.01. Since the natural log generalized cross-validation for smoothing spline fits.
transformation is used in the figure, and since Friedman and Silverman (1989), Friedman (1991), and
log(x) ~ 1 + x for Ix[-< 0.25, the graph shows that the Breiman (to appear) have investigated adaptive fitting
model predicts the observed values of 7 to within -+ 3% methods using regression splines; a number of novel com-
in all cases and to within -+ 1% in most. This is more putational methods have been devised to carry out knot
than enough accuracy for the purpose at hand, which is selection in an automated way. For pointers to the large
to compute degrees of freedom for t-intervals and F- numerical approximation literature, see Franke and Schu-
tests. maker (1987) and Grosse (1990).
Computational methods for local regression 59

p=l p=2 p=3 p=4


to to to
c5 6=2 6=2 o 6=2 c5 6=2

c- c c-
O O O O
~. c~
d
n- et- tr"

o o o o.
c5 2=2 c5 2=2 ,:5 o
2=2 2=2

0.0 0.4 0.8 0.0 0.4 0.8 0.0 O,4 0.8 0.0 0.4 0.8

Predictor: z Predictor: z Predictor: z Predictor: z

to to to to
6=2 ~5 6=2 6=2 d 6=2

~ d
c- t- C-
O O O O
~. od (3. 04

g-t" 13C g't- re

R o o
o
2=1 c5 2=1 c~
2=1 o
2=1

0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8

Predictor: z Predictor: z Predictor: z Predictor: z

to ts tq.
d 6=1 O 6=1 O 6=1 o 6=1

t- c- c- c-
O O O O
O_ 04 c~ e4 c~ c~
~ d
rr t~ n"
f .... . , ,
o
d 2=2 2=2 2=2 2=2
3-- ---,r=' , . ',~ .

0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0,0 0.4 0.8

Predictor: z Predictor: z Predictor: z Predictor: z

to
d 6=1 d 6=1 o
6=1 6=1

p. e- t-
c-
0 O
lb. c,I

n- 13C r~

o o R
6 2=1 2=1 o
2---1 2=1

0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8

Predictor: z Predictor: z Predictor: z Predictor: z

Fig. 8. Fitted semi-parametric model


60 Cleveland and Grosse

o
0oo0 o o

co
c5

r
,5

o
=5
o
z r
c5

o _o om 0 o ~ ~
0.0 0.2 0.4 0.6 0.8 1.0 -0.03 -0.02 -0.01 0.0 0.01 0,02 0.03

Predictor: z Natural log prediction/actual

Fig. 9. Loess estimate of nonparametric function in the semi- Fig. 10. Deviations of predicted values of 7 from the actual values
parametric model

-~ u3
d
O3 O
000000 0 0 0 0 r
O
o o o
"~
o
c~ 0 0 0
o
-o
++
+ +
+ ~3 O0 0 0 0
++++
cl cl
~n o~
0 o
++ +
o + +

z ++F+
z
! | i i I l i I I i

0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

Neighborhood parameter Neighborhood parameter


(a) k = l , 2=1 (b) k = 2 , 2 = 1

+ -~ lZ3
"~ O
+ +
0~ +
+
e- + O~ 0 t- GO
0 0 0
0 o r O O O
o
,-, r + O
o + O 0
:5 O
+
+ +

+ + 0
+
0 o

o,
Z Z
i I m i i ,, i i ,

0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

Neighborhood parameter Neighborhood parameter

(c) k = 1 , 2 = 2 (d) k = 2 , 2 = 2
Fig. 11. Deviations of predicted values of 7 from the actual values, The circles graph results for the NO x data, and the plus signs graph
results for the galaxy data
C o m p u t a t i o n a l m e t h o d s f o r local regression 61

5.2. Future work Breiman, L. (to appear) The II method for estimating multivari-
ate functions from noisy data. Technometrics.
Although we have focused in this paper on k - d trees and Brinkman, N. D. (1981) Ethanol fuel--a single-cylinder engine
blending to accelerate loess smoothing, it should be ap- study of efficiency and exhaust emissions. SAE Transactions,
parent that the same strategy could be pursued in several 90, no. 810345, 1410-1424.
other directions. Other partitioning and interpolation Buta, R. (1987) The structure and dynamics of ringed galaxies,
techniques might be used, such as subsampling from the III: surface photometry and kinematics of the ringed non-
sample points and using finite element interpolation on a barred spiral NGC7531. The Astrophysical Journal, Supple-
Delaunay triangulation. More radically, one c o u l d try ment Ser., 64, 1-37.
k - d trees and blending with other local smoothing proce- Cavendish, J. C. (1975) Local mesh refinement using rectangular
dures, or abandon the local computations altogether and blended finite elements. J. Comp. Physics, 19, 211-228.
Chan, T. F.-C. (1982) Algorithm 581: An improved algorithm
solve for coefficients as a single, sparse linear least-
for computing the singular value decomposition. ACM
squares problem. In our method, the k - d tree is deter- Transactions on Mathematical Software, 8, 84-88.
mined strictly by the sample locations and ignores the Cleveland, W. S. (1979) Robust locally-weighted regression and
values of the smooth surface. This is the method we smoothing scatterplots. J. Amer. Statist. Assoc., 74, 829-
currently use in our implementation, and is necessary if 836.
linear algebra is to be shared across multiple data sets. It Cleveland, W. S. and Grosse, E. (in press) Fitting Functions to
is a natural choice because commonly more samples will Data, Wadsworth, Pacific Grove, Calif.
be taken in regions where more variation is anticipated. Cleveland, W. S., Devlin, S. J. and Grosse, E. (1988) Regression
A better tactic might be to refine based on the remaining by local fitting: methods, properties, and computational
interpolation error. When making a cut, immediately algorithms. Journal of Econometrics, 37, 87-114.
smooth at the new vertex and compare with the interpo- Cleveland, W. S., Grosse, and E. Shyu, W. M. (1991) Local
regression models, in Statistical Models in S, Chambers, J.
lated smooth value there based on the parent cell. If the
M. and Hastie, T. (eds), Wadsworth, Pacific Grove, Calif.
difference is below some tolerance, no further refinement Dongarra, J. J. and Grosse, E. (1987) Distribution of mathemat-
is done. In some sense this would make the process ical software via electronic mail. Communications of the
nonlinear, but because there is a nearby fixed (and lin- ACM, 30, 403-407.
early determined) surface, the statistical procedures re- Farin, G. (1990) Curves and Surfaces for Computer Aided Design:
main asymptotically valid. A Practical Guide (2nd edn), Academic Press, New York.
Floyd, R. W. and Rivest, R. L. (1975) Expected time bounds for
selection. Communications of the ACM, 18, 165-172.
5.3. FORTRAN routines Franke, R. and Schumaker, L. L. (1987) A bibliography of
multivariate approximation, in Topics in Multivariate Ap-
F O R T R A N routines are available that implement the proximation, Chui, C. K., Schumaker, L. L. and Utreras, F.
methods described here. They may be obtained electroni- (eds), Academic Press, New York.
cally by sending the message 'send dloess from a' to the Friedman, J. H. (1979) A tree-structured approach to nonpara-
address [email protected]. To get the single-preci- metric multiple regression, in Smoothing Techniques for
sion version, replace 'dloess' by 'loess' in the netlib re- Curve Estimation, Gasser, T. and R0senblatt, M. (eds),
quest. If you wish to learn more about the mechanics of Springer Verlag, New York.
this distribution scheme, consult Dongarra and Grosse Friedman, J. H. (1984) A variable span smoother. Technical
(1987). Report LCS5. Dept. of Statistics, Stanford.
Friedman, J. H. (1991) Multivariate adaptive regression splines
(with discussion). Ann. Statist., 19, 1-141.
Friedman, J. J., Bentley, J. L. and Finkel, R. A. (1977) An
References algorithm for finding best matches in logarithmic expected
time. ACM Transactions on Mathematical Software, 3, 209-
Art, D., Gnanadesikan, R. and Kettenring, J. R. (1982) Data- 226.
based metrics for hierarchical cluster analysis. Utilitas Friedman, J. H., Grosse, E. H. and Stuetzle, W. (1983) Multi-
Mathernatica, 21A, 75-99. dimensional additive spline approximation. SIAM J. Sci.
Barnhill, R. E. (1977) Representation and approximation of Stat. Comp., 4, 291-301.
surfaces. Mathematical Software III, Academic Press, New Friedman, J. H. and Silverman, B. W. (1989) Flexible parsimo-
York. nious smoothing and additive modelling (with discussion).
Bates, D. M., Lindstrom, M. J., Wahba, G. and Yandell, B. S. Technometrics, 31, 3-39.
(1987) GCVPACK-routines for generalized cross-validation. Friedman, J. H. and Stuetzle, W. (1981) Projection pursuit
Comm. Stat.-Simula., 16, 263-297. regression. J. Amer. Statist. Assoc., 76, 817-823.
Birkhoff, G., Cavendish, J. C. and Gordon, W. J. (1974) Multi- Gordon, W. J. (1969) Distributive lattices and the approximation
variate approximation by locally blended univariate inter- of multivariate functions, in Proceedings of the Symposium
polants. Proc. National Academy of Sciences USA, 71, on Approximation with Special Emphasis on Splines, Schoen-
3423-3425. berg, I. J. (ed), Academic Press, New York.
62 Cleveland and Grosse

Grosse, E. (1990) A catalogue of algorithms for approximation, McLain, D. H. (1974) Drawing contours from arbitrary data
in Algorithms for Approximation II, Mason, J. and Cox, M. points. Computer J., 17, 318-324.
(eds), Chapman and Hall, London. Schumaker, L. L. (1976) Fitting surfaces to scattered data, in
Gu, C. and Wahba, G. (1988) Minimizing GCV/GML scores Approximation Theory II, Lorentz, G. G., Chui, C. K., and
with multiple smoothing parameters via the Newton Schumaker, L. L. (eds), Academic Press, New York.
method. Technical Report 847. Department of Statistics, Stone, C. J. (1977) Consistent nonparametric regression. Ann.
University of Wisconsin. Stat., 5, 595-620.
Hastie, T. and Tibshirani, R. (1990) Generalized Additive Models, Wahba, G. (1978) Improper priors, spline smoothing, and the
Chapman and Hall, London. problem of guarding against model errors in regression.
Lancaster, P. and Salkauskas, K. (1986) Curve and Surfaee J. R. Stat. Soc. B, 40, 364-372.
Fitting: An Introduction, Academic Press, New York. Watson, G. S. (1964) Smooth regression analysis. Sankhya A, 26,
Macauley, F. R. (1931) The Smoothing of Time Series, National 359-372.
Bureau of Economic Research, New York.

You might also like