Putational Methods For Local Regression
Putational Methods For Local Regression
surface. Following the terminology for least-squares observations. Finally, we can specify a subset of the
fitting, we refer to L as the hat matrix. Inferences about g predictors to be conditionally parametric.
utilize two quantities:
6k = tr[(l -- L ) ' ( I -- L)] k 2.2. Loess
for k = 1 and 2. ~1 and 3 2 a r e used in the computation of The loess fitting method is a definition of ~(x), the esti-
degrees of freedom for t-intervals for g(x) and for F-tests mate of g at a specific value of x. The smoothness of the
to compare two loess fits. A supercomputer environment loess fit depends on the specification of the neighborhood
would be needed to perform exact computations of the 6k parameter, c~, the specification of 2, and the specification
routinely. In Section 4 we present a method of approxima- of conditionally parametric predictors.
tion. Instead of using tools of numerical analysis, as we do Let Ai(x) be the Euclidean distance of x to x; in the
to approximate the fitted surface, we use a statistical space of the predictors. Let A(o(x) be the values of these
approach. We generated a large number of data sets, each distances ordered from smallest to largest, and let
with a response and one or more predictors, and com-
puted the 6~ for each. Then we fitted a semi-parametric T(u) ={(01--U3) 3 foru->lf~
model to the 6k based on predictors that can be computed
cheaply. The central predictor is the trace of L. be the tricube weight function.
The neighborhood parameter, ~, is positive. Suppose
< 1. Let q be an truncated to an integer. We define a
1.4. Discussion
weight for (xi, Yi) by
In the final section of the paper we cite other work on
computational methods for nonparametric regression.
k m(q)(x) /
Also, information is given on electronic access to FOR-
T R A N routines that implement the computational meth- For ~ > 1, the wi (x) are defined in the same manner but
ods of this paper. A(q)(X) is replaced by A(n)(x)~ l/p, where p is the number of
predictors. The wi (x), which we will call the neighborhood
weights, decrease or remain constant as xi increases in
2. Local regression: specifications, fitting and distance from x.
inference If we have specified 2 to be 1, a linear polynomial in x
is fitted to Yi using weighted least squares with the weights
wi (x); the value of this fitted polynomial at x is ~(x). If 2
2.1. Local regression is 2, a quadratic is fitted.
In carrying out local regression we specify properties of
the surface and errors, that is, we make certain assump- 2.3. Statistical inference
tions about them. In all cases we suppose the ei are
independent random variables with mean 0. We can The normal-error loess estimate, ~(x), is linear in Yi:
specify a probability distribution for the E~. Here, we treat
just one case; we assume that the Ei are independent g(x) = ~ t,(x)yi
i=1
normal random variables with variance o.2 . Computation-
ally, other cases typically involve iterations of normal- where the li (x) do not depend on the Yi. Suppose diagnos-
error estimates. tic methods have revealed that the specifications of the
As we stated earlier, we suppose that in a certain surface have not resulted in appreciable lack of fit of ~(x);
neighborhood of x, the regression surface is well approxi- we take this to mean that E~,(x) -g(x) is small. Suppose
mated by linear or quadratic polynomials. Thus we must further that diagnostic checking has verified the specifica-
specify the degree, 2. Sometimes, it is useful to specify an tions of the error terms in the model. Then the linearity of
intermediate model between 2 = 1 and 2 = 2 by discarding ~(x) results in distributional properties of the estimate that
squares of some predictors (Cleveland et al., 1991), but we are very similar to those for classical parametric fitting.
do not consider this specification here. The overall sizes of We will briefly review the properties.
the neighborhoods are specified by a parameter, 7, that Since ~(x) is linear in Yi, the fitted value at xi can be
will be defined in Section 2.2. Size, of course, implies a written
metric, and we will use Euclidean distance, but a variety of
metrics in the space of the raw data can be achieved by Yi = Y, ~ (xi)yi
j=l
transformation of the original variables, which amounts to
a change in the definition of the xi. At the very least, we Let L be the matrix whose (i,j)th element is/j(xi) and let
typically want to standardize the scales of the predictor L=I-L
50 Cleveland and Grosse
s=g ,h
8
from the residuals ~ = y ~ - .re. The standard deviation of o o
~(x) is
z
o
a(x)=a l~(x)
We estimate a(x) by o
o
s(x)=s l (x)
1 I I I I I
-40 -20 0 20 40
Let
EW (arc seconds)
(Lancaster and Salkauskas, 1986) are fitted to the cells to boundary to bring the total to q. Since the tricube weight
give a C l surface. This algorithm is apparently new, function falls to 0 at the boundary, this is justified;
although it resembles transfinite elements on irregular boundary points make no difference in the least-squares
grids (Birkhoff et al., 1974; Cavendish, 1975) and regres- local fit. In implementing a uniform weight function, in
sion trees (Friedman, 1979) in some respects. The al- contrast, more care would be needed to ensure that all
gorithm may be viewed as belonging to the category of points on the boundary receive equal treatment.
two-stage methods (Schumaker, 1976), in which typically Badly distributed observations may not uniquely deter-
one maps scattered data onto a rectangular grid and then mine all ~ monomial terms in the local regression. Con-
uses a grid-based interpolation scheme. sider locally-quadratic fitting in one predictor. If, because
In this section, we first discuss certain numerical and of multiplicities, there are only two distinct sample loca-
algorithmic issues that need careful attention to evaluate tions inside a neighborhood, then a quadratic polynomial
the surface at a single point, then we describe the k - d tree is not uniquely determined. In more variables, something
and blending methods, then we treat performance, and similar can occur if there are patterns in the distribution of
finally we describe a procedure to modify the loess fitting predictors. The most common such situation involves a
method to produce a conditionally parametric surface. discrete variable, so that a neighborhood that otherwise
seems large enough contains just points lying on one or
two lines. Again there is no unique local quadratic. In
3.1. Least squares
such a situation, we use the pseudo-inverse to produce
To get the value of the loess surface, ~(x), at a particular stable values for ~(x). Since q is typically much larger than
point x, the computational steps are the following: identify r, a preliminary factorization X = QR into a ~ x ~ matrix
neighbors of x by calculating n distances, sort, and then R and a q x z matrix Q with Q'Q = I~ followed by SVD of
solve a q • z linear least-squares problem. Note that, R allows the pseudo-inverse to be computed efficiently
unlike some other nonparametric methods such as (Chan, 1982). The interpolation scheme to be described
smoothing splines, which require the solution of a global later uses not only ~(x), but also the slope of the local
linear system (Wahba, 1978), the loess surface is locally model at x. The pseudo-inverse has an effect like dropping
defined, which would allow large speed-up on parallel the square term in the transverse predictor. This is the best
computers. For each x, the dominant cost is a classic that can be automatically done with the available local
linear algebra problem for which well-vectorized sub- information, but does imply the interpolation has to work
routines are widely available. The methods presented here, with less information. It is wise in such circumstances to
however, have been developed on sequential machines. make a surface plot and verify that no undesired flat spots
Even with an efficient O(n log n) procedure, the sorting result. Of course, the data analyst can eliminate the use of
to get A(i~(x) can become a bottleneck, slowing down the the pseudo-inverse by adding rows to X (increasing c0 or
overall computation noticeably. A little time can be saved removing columns from X (reducing 2 or specifying condi-
by avoiding the square roots and instead sorting A~(x). tionally parametric predictors).
On some machines, a little more time can be saved by It is possible to save linear algebra cost for multiple
arranging the distance calculation so that it vectorizes. A data sets at the same sample points with the same a, at the
much more significant speedup is achieved by applying a cost of increased workspace requirements, by saving the
classic algorithm (Floyd and Rivest, 1975) for selecting the pseudo-inverse coefficients. In the implementation of our
qth largest in an unordered list. computational methods, we provide this option. Associ-
If we have found the neighborhood for a point x and ated with each vertex in the k - d tree is a ~ x q coefficient
next need the neighborhood for a nearby point x', can we matrix and a length-q integer vector.
do better than starting from scratch? One possibility
would be to take advantage somehow of the triangle
3.2. k - d trees
inequality, only updating the distances for points near the
first neighborhood boundary. We have pursued another The k - d tree is a particular data structure (Friedman
approach. During the partial sort we update a permuta- et al., 1977) for partitioning space by recursively cutting
tion vector instead of explicitly reordering the distance cells in half by a hyperplane orthogonal to one of the
vector. By using the final permutation for x rather than coordinate axes. It was originally designed for answering
the identity when starting the partial sort for x', fewer nearest-neighbor queries but we use it in a different way.
comparisons and swaps may be needed. (Ironically, the tree is not of assistance in identifying the
What happens if there are several points on the neighborhoods. The trouble is that c~ is commonly large
boundary of the neighborhood? That is to say, suppose enough that each neighborhood intersects most of the cells
there are distinct i and j such that A(j)(x)= A(i)(x)= in the k - d tree.)
A(q)(x). The procedure just described breaks ties arbitrarily Let C be a rectangular cell containing the x; and let h be
and hence selects at random just enough points on the a number between 0 and 1. To build the k - d tree, call
52 Cleveland and Grosse
invariant.) An even more sophisticated rotation would smooth ~ will have discontinuous derivatives resulting
align principal axes of clusters in the data (Art et al., from the discontinuous nature of the neighborhood
1982). Rarely do we bother with such rotations. radius. The radius would need to be smoothed somehow
We hope the number of vertices will be much smaller before exact derivative formulas would be useful.
than n. This is at least true asymptotically, because the A related issue is that we implicitly use cubic interpola-
number of cells needed to fit the data properly depends on tion with zero values for the cross-derivatives Oof at the
the smoothness of g(x), not n. In Fig. 4 there are 66 rectangle corners. This is satisfactory for our purposes but
vertices, so we solve just 66 least-squares problems instead produces flat spots that might be visible in a carefully
of the thousands necessary to draw a contour plot. rendered image of the surface, particularly if there are
Because of symmetries in the sample locations, it may reflection lines that happen to pass near the corners. For a
happen that when building the k - d tree, two adjacent cells discussion of this phenomenon, see Barnhill (1977) and
are cut in the same direction and at the same value. This Farin (1990). If desired, it would be possible to provide
introduces a double vertex on the shared cell side. A good correct cross-derivatives.
rule of thumb in computational geometry is never to A special problem arises for higher dimensions that does
compute anything twice. Consistency rather than opti- not occur for p < 2. Introducting vertices only on the sides
mization is the crucial reason, and in this case we also of a cell can lead to adjacent cells in which sides intro-
wish to avoid double counting when computing tr(L) later. duced by independent cuts intersect. This is illustrated in
So before introducing a vertex we check the neighboring Fig. 5. It is necessary to add all such points of intersection
cell and use a pointer to an existing vertex if possible. as vertices; otherwise, independent interpolation along the
two edges gives inconsistent values at the common point.
The blending method is based on cubic polynomial
3.3. Blending
interpolation and it is well known that polynomial extrap-
We now sketch the scheme used to interpolate based on olation is dangerous. So our computer implementation
the computation of ~ at the cells. We use a method called refuses to evaluate ~(x) for x outside the blending rectan-
blending or Coon's patch or transfinite interpolation devel- gle of the observations. To avoid nuisances caused by
oped in the automobile industry (Gordon, 1969) for han- rounding, the bounding box is expanded by 1% in each
dling cross-sections of clay models of a car. It is a method variable.
for generating smooth surfaces given function data along
grid lines.
3.4. Statistical quantities
It is helpful to think of the case p = 2 . Each cell
boundary consists of four or more segments that meet at To carry out the computational method of the next section
vertices. On each segment, function values are interpolated we need the diagonal of the hat matrix, L. The linear
using the unique cubic polynomial determined by the dependence of blending coefficients at each vertex on q
function and derivative data at the vertices. Normal nearby data values can be computed from a least-squares
derivatives are interpolated linearly along the segment. In system, Xfl = Wy. Since the blending interpolant at an
this way, we obtain function and derivative values all arbitrary point x ultimately is just a linear combination of
along the cell boundaries. Values are consistent from
segment to segment and from cell to cell.
Finally, blending functions interpolate across the cell
proper. This technique takes a certain combination of
univariate interpolants in each variable separately to build,
in our application, a surface with continuous first deriva-
tive. Although the calculation is not actually done this
way, each cell is implicitly subdivided and on each piece a
cubic polynomial is constructed.
We estimate partial derivatives at the vertices by slopes
of the local model. This is quick but not the same as
taking true derivatives of the underlying direct loess sur- /--....
face. It is interesting to note that with just a few more
backsolves but no extra matrix factorizations, the correct
derivatives can be obtained (Lancaster and Salkauskas,
1986, para. 10.4). Our implementation continues to use
slopes rather than the true derivatives because we have not
found any examples where the approximation leads to a Fig. 5. In three dimensions, extra vertices are introduced by
poor smooth. In fact, as presently defined, the direct division of adjacent cells
54 Cleveland and Grosse
vertex data, it is only a modest amount of extra effort to and not the hat matrix of the direct surface. Still, it is
accumulate any element of L. comforting that the interpolation surface can give a good
If we have just smoothed at a point xi then we have a approximation of the direct, since the direct method has
singular value decomposition X = U~V'. The smooth been extensively investigated and its properties are well
value 9~ is just/71, the coefficient of the constant term, and understood.
this is linearly related to the data y via the ith row of L We will use the galaxy data and NOx data sets discussed
in Sections 1 and 2 to illustrate the relationship of the
L,, = v , , . ~ + u ' w (1)
direct and interpolation surfaces as h changes. For the
A few backsolves and an inner product suffice to compute
this row.
The calculation for the interpolation method is only Galaxy Data
slightly more complicated. Note that smoothing at vertex
j uses a linear combination of data values which we may
write as/?~ = Fjy and blending uses a linear combination of o
I.o
vertex data. So there is a matrix F composed of the Fj and
a (sparse) matrix B such that L = BF. A diagonal element, r'-
galaxy data, the spindly configuration of the points in the words, we generated values of xi. For each data set, values
space of the predictors presents a challenge to the inter- of e and 2 were assigned, and 71 and 72 were computed.
polation method to approximate the direct surface. For We developed a semi-parametric model with 71 and 72 as a
the fit shown in Fig. 2, 2 = 2 , c~=0.2, and h = 0 . 2 / single response, 7, and with several predictors. The central
5 = 0.04, the standard value. Let us consider values of h of predictor is tr(L), which can be computed cheaply. The
the form 0.95 (89 for k = 1 to 6. This spans a large range; other predictors are p, 2, n, and k, a categorical variable
the largest value, 0.475, results in a tree with just four with two levels that describes whether 71 or 72 is being
cells, and the smallest value, 0.015, is less than the stan- predicted. We fitted the model to the data, and the fitted
dard value of h --0.04. Also, any value of h in this range model is used in applications to compute predicted values
yields a tree that is the same as that for one for the six of the 7k based on the values of tr(L), p, 2, n, and k.
values we have selected. On the top panel of Fig. 6, the
maximum absolute difference of the direct and interpola-
4.1. The predictor tr(L)
tion surfaces evaluated at the x~ is graphed against log2 h.
The vertical dashed line segment shows the standard value The model for 7 was identified by a combination of
of h. For this value, the absolute difference is 16 km s -1, empirical study, which consisted of generating pilot data
which is an insignificant departure in this application. sets and studying them graphically, and of theory. The
Most importantly, diagnostics showed the standard sur- central idea of the model, however, using tr(L) to predict
face fitted the data. y, arose from theoretical considerations, which we now
The bottom panel shows a similar analysis for the NOx describe.
data, where cr = 2, 2 = 2, and h = c~/5. For the standard Let r be the number of monomial terms in the para-
surface, the maximum absolute difference is 0.022/~/j. This metric class being fitted locally, including the constant
is 0.4% of the range of the fitted values, so if we added the term. If 2 = I, r = p + 1, and if 2 = 2, ~ = (p + 1)(p + 2 ) /
direct curve to Fig. 1, the two curves would not be visually 2. Suppose that, instead of fitting a loess surface, we
distinguishable. carried out a least-squares fit of the ~ polynomial terms.
Then L is the hat matrix of the fit and
3.6. Conditionally parametric surfaces 71 = 72 = tr(L) =
The method for making a loess fit conditionally para-
This equality is the limiting case for loess as ~ ~ oe since
metric in a proper subset of the predictors is simple. The
the loess fit tends to the least-squares polynomial fit. Let
subset is ignored in computing the Euclidean distances
us consider the other extreme. Suppose there are no ties
that are used in the definition of the neighborhood
among the xi and c~ is sufficiently small that the only
weights, w~(x). Let us use an example to explain why this
positive neighborhood weight for the fit at x~ is w~(x~) = 1.
produces a conditionally parametric surface. Suppose that
Then ~(x~) = y/, L is the identity matrix, and
there are two predictors, u and v, and 2 = 2. Suppose we
specify u to be a conditionally parametric predictor. Since W = Y2 = tr(L) = n
the weight function ignores the variable u, the ith weight,
w~(u, v), for the fit at (u, v), is the same as the ith weight, In between these extremes, 7; and tr(L) depend in a highly
w~(u + t, v), for the fit at (u + t, v). Thu~ the quadratic complex way on x; and on the specification of the surface,
polynomial that is fitted locally for the fit at (u, v) is the that is, the choices of e, 2, and which predictors, if any,
same as the quadratic polynomial that is fitted locally for are conditionally parametric; the complexity defies charac-
the fit at (u + t, v), whatever the value of t. Thus for the terizing the dependence.
fixed value of v, the surface is exactly this quadratic as a We can, however, look at a special case where the xi
function of the first predictor. Finally, it is easy to see that have a regular structure, and gain some insight. Suppose
our interpolation method preserves this property. there is one predictor and the xi are equally spaced. Also,
suppose that 2 --- 1. As before, let q be cm truncated to an
integer, that is, the number of points in a neighborhood.
4. Computing degrees of freedom For fixed g, as n grows large, the operator matrix becomes
well approximated by a circulant matrix. (In fact, if the
In this section we present a method for computing 61 and points are equally spaced on a circle, and we take the
62 approximately. To make notation and formulas some- distance between two points to be the minimum distance
what simpler, we will work with Yk = n -~5 k. around the circle, the operator matrix is exactly cireulant).
We could have attempted to develop a numerical Let W be an altered form of the tricube weight function:
method to compute the 7k approximately. Instead, because
it seemed more promising, we took a statistical approach. 1 -)2.13 3 for I"J <
W
We randomly generated predictor data sets; in other otherwise
56 Cleveland and Grosse
where c is a constant that makes W integrate to unity. degree of the parametric class being fitted locally, so )~ is 1
Thus we have simply rescaled the domain and range of T. or 2. The third, k, is 1 if 71 is being predicted and is 2 if
Let W* be the Fourier transform of W. Then for f x e d q 72 is being predicted. In effect, these three predictors divide
and large n, the model up into 4 x 2 x 2 = 1 6 cases. Let r~ be an
observation of the response and let zi be the corresponding
trL k ~ c~-1 W*k(w) dw observation of z. The model we selected is semi-para-
oo
metric:
Using this result we have
r i = Op2kZ~P'~k ( 1 - - Z i ) J3P'~kf(z i )E i
tr(L) g 1.72c~- 1
forp=l to 4, 2 = 1 to 2, and k - - - I to 2, w h e r e f i s a
nonparametric function of z, and the e; are i.i.d, random
7z ~ 2.06e -1 variables. To make f identifiable, we add the boundary
Now, instead of equality of these three quantities we have conditions f ( 0 ) = f ( 1 ) = 1. This form was chosen since
plots of r against z for the pilot data sets showed patterns
71 that looked like beta functions. This is not surprising
log tr-~ ,~ 0.17
since, from the results of Section 4.1, as cr ~ m, we have
z---> 1 and r ~ 0, and as cr gets small, we have z ~ 0 and
72
log tr--~ ~ 0.17 r ----~ 0 .
<:5 (5
t-
0 0 0000 0 0
o
(5 (5
re" % * o + ~ o~o~
~~176 ~ ~ oo CXC C~Z
o o o o,
<:5 2=2 <5 2=2 <:5 2=2 ,l=2
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
~D
(5 5=2 &=2 6=2 (5 5=2
#
<::5 o
t- t- 0
t-
O O
O- t'~
o 0
g
~ ~ ~,.0 . ~ d
o rr
q
o
2=1 2=1 2=1 2=1
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
p,; (5 ~
c- c t-
0 O O
~, o o --~ o o ~, (5 ~00 0 0
tr- n- n-' o n.-
o~accoao 0~% oD o ~
o. ~
o
o o
2 =2 2=2 d 2=2 (5 2=2
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
(5 (5
c- f-
t-
O
04
O
O.. r
0
~ o o
O .. #
f Ibmm~lm~~ o
o
o ooo%oIorz~,~ D o
o ~o0 o re" o ** o
q o o o
o
2=1 <:5 2=1 ,:5
..... ~=1
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
where 0p;.e.= log(0p;~k), f ' = log(f), and c~ = log(ei), f ' 4.6. From the laboratory to the field
was updated by fitting loess to
The computer-generated data used in the pilot studies
log(r/) -- Op).~
~' -- %;.k
^ log(z/) -- fip;.k log(1 - zi)
and model fitting are not representative of the entire
population of data sets that users of loess fitting encoun-
as a function of z~, where ff~k, 02p~k, and fip~k were the ter. For one thing, the predictors were generated from
current estimates of the parameters, and the parameter normal distributions. Second, while we changed n during
estimates were updated by least-squares regression of the pilot studies, it was equal to 300 for the data to
l o g ( r / ) - f ' ( z ~ ) on a constant, log(z~), and l o g ( l - z / ) . which we fitted the model. Nor is it realistic to suppose
Thus we fitted a model which--given p, 2, and n - - i s a that it would be possible to generate a representative
generalized additive model with a parametric part and a collection. For this reason we make no attempt to use
nonparametric part (Hastie and Tibshirani, 1990). The statistical inference based on the model fitting to charac-
successive estimates of f in the backfittings were made terize the errors of our method in practice. Instead, we
to satisfy the boundary constraints at z = 0 and z = 1. simply went into the field, tried the predictive model on a
This was done by adding the values (0, 1) and (1, 1) to large number of data sets, and compared with the actual
the (zi, r~) values, and giving these two points prior values of 71 and 72 obtained by explicitly forming/2. The
weight 100 and the remaining points prior weight 1; the resulting errors, not surprisingly, were larger, as a collec-
effect of this is that any neighborhood weight for these tion, than the errors for the generated data, but were still
two points is multiplied by 100, forcing the fit to go well within the limits of acceptable accuracy. Figure 11
through 1 at the ends. Taking the log of r, which itself is shows results for the NOx data and the galaxy data
a log of 7/tr(L), is not as outlandish as it might first studied in earlier sections. Values of tog(~/7) are graphed
seem. The generated values of 7/tr(L) were all greater against c~, where ~ is the predicted value and 7 the actual
than 1, which is not surprising since we showed that this value. For the NOx data, deviations are within -+4.5%,
is so for the equally spaced circular case in Section 4.1. and for the galaxy data, deviations are within _+ 12.4%.
(Note that we do not allow values of ~ that can actually Both are acceptable accuracies for degree-of-freedom
yieldazof0or 1.) computations.
c- c c-
O O O O
~. c~
d
n- et- tr"
o o o o.
c5 2=2 c5 2=2 ,:5 o
2=2 2=2
0.0 0.4 0.8 0.0 0.4 0.8 0.0 O,4 0.8 0.0 0.4 0.8
to to to to
6=2 ~5 6=2 6=2 d 6=2
~ d
c- t- C-
O O O O
~. od (3. 04
R o o
o
2=1 c5 2=1 c~
2=1 o
2=1
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
to ts tq.
d 6=1 O 6=1 O 6=1 o 6=1
t- c- c- c-
O O O O
O_ 04 c~ e4 c~ c~
~ d
rr t~ n"
f .... . , ,
o
d 2=2 2=2 2=2 2=2
3-- ---,r=' , . ',~ .
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0,0 0.4 0.8
to
d 6=1 d 6=1 o
6=1 6=1
p. e- t-
c-
0 O
lb. c,I
n- 13C r~
o o R
6 2=1 2=1 o
2---1 2=1
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
o
0oo0 o o
co
c5
r
,5
o
=5
o
z r
c5
o _o om 0 o ~ ~
0.0 0.2 0.4 0.6 0.8 1.0 -0.03 -0.02 -0.01 0.0 0.01 0,02 0.03
Fig. 9. Loess estimate of nonparametric function in the semi- Fig. 10. Deviations of predicted values of 7 from the actual values
parametric model
-~ u3
d
O3 O
000000 0 0 0 0 r
O
o o o
"~
o
c~ 0 0 0
o
-o
++
+ +
+ ~3 O0 0 0 0
++++
cl cl
~n o~
0 o
++ +
o + +
z ++F+
z
! | i i I l i I I i
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
+ -~ lZ3
"~ O
+ +
0~ +
+
e- + O~ 0 t- GO
0 0 0
0 o r O O O
o
,-, r + O
o + O 0
:5 O
+
+ +
+ + 0
+
0 o
o,
Z Z
i I m i i ,, i i ,
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
(c) k = 1 , 2 = 2 (d) k = 2 , 2 = 2
Fig. 11. Deviations of predicted values of 7 from the actual values, The circles graph results for the NO x data, and the plus signs graph
results for the galaxy data
C o m p u t a t i o n a l m e t h o d s f o r local regression 61
5.2. Future work Breiman, L. (to appear) The II method for estimating multivari-
ate functions from noisy data. Technometrics.
Although we have focused in this paper on k - d trees and Brinkman, N. D. (1981) Ethanol fuel--a single-cylinder engine
blending to accelerate loess smoothing, it should be ap- study of efficiency and exhaust emissions. SAE Transactions,
parent that the same strategy could be pursued in several 90, no. 810345, 1410-1424.
other directions. Other partitioning and interpolation Buta, R. (1987) The structure and dynamics of ringed galaxies,
techniques might be used, such as subsampling from the III: surface photometry and kinematics of the ringed non-
sample points and using finite element interpolation on a barred spiral NGC7531. The Astrophysical Journal, Supple-
Delaunay triangulation. More radically, one c o u l d try ment Ser., 64, 1-37.
k - d trees and blending with other local smoothing proce- Cavendish, J. C. (1975) Local mesh refinement using rectangular
dures, or abandon the local computations altogether and blended finite elements. J. Comp. Physics, 19, 211-228.
Chan, T. F.-C. (1982) Algorithm 581: An improved algorithm
solve for coefficients as a single, sparse linear least-
for computing the singular value decomposition. ACM
squares problem. In our method, the k - d tree is deter- Transactions on Mathematical Software, 8, 84-88.
mined strictly by the sample locations and ignores the Cleveland, W. S. (1979) Robust locally-weighted regression and
values of the smooth surface. This is the method we smoothing scatterplots. J. Amer. Statist. Assoc., 74, 829-
currently use in our implementation, and is necessary if 836.
linear algebra is to be shared across multiple data sets. It Cleveland, W. S. and Grosse, E. (in press) Fitting Functions to
is a natural choice because commonly more samples will Data, Wadsworth, Pacific Grove, Calif.
be taken in regions where more variation is anticipated. Cleveland, W. S., Devlin, S. J. and Grosse, E. (1988) Regression
A better tactic might be to refine based on the remaining by local fitting: methods, properties, and computational
interpolation error. When making a cut, immediately algorithms. Journal of Econometrics, 37, 87-114.
smooth at the new vertex and compare with the interpo- Cleveland, W. S., Grosse, and E. Shyu, W. M. (1991) Local
regression models, in Statistical Models in S, Chambers, J.
lated smooth value there based on the parent cell. If the
M. and Hastie, T. (eds), Wadsworth, Pacific Grove, Calif.
difference is below some tolerance, no further refinement Dongarra, J. J. and Grosse, E. (1987) Distribution of mathemat-
is done. In some sense this would make the process ical software via electronic mail. Communications of the
nonlinear, but because there is a nearby fixed (and lin- ACM, 30, 403-407.
early determined) surface, the statistical procedures re- Farin, G. (1990) Curves and Surfaces for Computer Aided Design:
main asymptotically valid. A Practical Guide (2nd edn), Academic Press, New York.
Floyd, R. W. and Rivest, R. L. (1975) Expected time bounds for
selection. Communications of the ACM, 18, 165-172.
5.3. FORTRAN routines Franke, R. and Schumaker, L. L. (1987) A bibliography of
multivariate approximation, in Topics in Multivariate Ap-
F O R T R A N routines are available that implement the proximation, Chui, C. K., Schumaker, L. L. and Utreras, F.
methods described here. They may be obtained electroni- (eds), Academic Press, New York.
cally by sending the message 'send dloess from a' to the Friedman, J. H. (1979) A tree-structured approach to nonpara-
address [email protected]. To get the single-preci- metric multiple regression, in Smoothing Techniques for
sion version, replace 'dloess' by 'loess' in the netlib re- Curve Estimation, Gasser, T. and R0senblatt, M. (eds),
quest. If you wish to learn more about the mechanics of Springer Verlag, New York.
this distribution scheme, consult Dongarra and Grosse Friedman, J. H. (1984) A variable span smoother. Technical
(1987). Report LCS5. Dept. of Statistics, Stanford.
Friedman, J. H. (1991) Multivariate adaptive regression splines
(with discussion). Ann. Statist., 19, 1-141.
Friedman, J. J., Bentley, J. L. and Finkel, R. A. (1977) An
References algorithm for finding best matches in logarithmic expected
time. ACM Transactions on Mathematical Software, 3, 209-
Art, D., Gnanadesikan, R. and Kettenring, J. R. (1982) Data- 226.
based metrics for hierarchical cluster analysis. Utilitas Friedman, J. H., Grosse, E. H. and Stuetzle, W. (1983) Multi-
Mathernatica, 21A, 75-99. dimensional additive spline approximation. SIAM J. Sci.
Barnhill, R. E. (1977) Representation and approximation of Stat. Comp., 4, 291-301.
surfaces. Mathematical Software III, Academic Press, New Friedman, J. H. and Silverman, B. W. (1989) Flexible parsimo-
York. nious smoothing and additive modelling (with discussion).
Bates, D. M., Lindstrom, M. J., Wahba, G. and Yandell, B. S. Technometrics, 31, 3-39.
(1987) GCVPACK-routines for generalized cross-validation. Friedman, J. H. and Stuetzle, W. (1981) Projection pursuit
Comm. Stat.-Simula., 16, 263-297. regression. J. Amer. Statist. Assoc., 76, 817-823.
Birkhoff, G., Cavendish, J. C. and Gordon, W. J. (1974) Multi- Gordon, W. J. (1969) Distributive lattices and the approximation
variate approximation by locally blended univariate inter- of multivariate functions, in Proceedings of the Symposium
polants. Proc. National Academy of Sciences USA, 71, on Approximation with Special Emphasis on Splines, Schoen-
3423-3425. berg, I. J. (ed), Academic Press, New York.
62 Cleveland and Grosse
Grosse, E. (1990) A catalogue of algorithms for approximation, McLain, D. H. (1974) Drawing contours from arbitrary data
in Algorithms for Approximation II, Mason, J. and Cox, M. points. Computer J., 17, 318-324.
(eds), Chapman and Hall, London. Schumaker, L. L. (1976) Fitting surfaces to scattered data, in
Gu, C. and Wahba, G. (1988) Minimizing GCV/GML scores Approximation Theory II, Lorentz, G. G., Chui, C. K., and
with multiple smoothing parameters via the Newton Schumaker, L. L. (eds), Academic Press, New York.
method. Technical Report 847. Department of Statistics, Stone, C. J. (1977) Consistent nonparametric regression. Ann.
University of Wisconsin. Stat., 5, 595-620.
Hastie, T. and Tibshirani, R. (1990) Generalized Additive Models, Wahba, G. (1978) Improper priors, spline smoothing, and the
Chapman and Hall, London. problem of guarding against model errors in regression.
Lancaster, P. and Salkauskas, K. (1986) Curve and Surfaee J. R. Stat. Soc. B, 40, 364-372.
Fitting: An Introduction, Academic Press, New York. Watson, G. S. (1964) Smooth regression analysis. Sankhya A, 26,
Macauley, F. R. (1931) The Smoothing of Time Series, National 359-372.
Bureau of Economic Research, New York.