0% found this document useful (0 votes)
46 views

Additive Model

Projection pursuit regression is a nonparametric method for multiple regression that models the regression surface as a sum of general smooth functions of linear combinations of predictor variables. This iterative procedure is more flexible than standard stepwise or stagewise regression as it does not require a predefined functional form or metric in predictor space. The method fits successive refinement models of increasing complexity to approximate the regression surface without overfitting due to limitations of sample size.

Uploaded by

Jung Attila
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Additive Model

Projection pursuit regression is a nonparametric method for multiple regression that models the regression surface as a sum of general smooth functions of linear combinations of predictor variables. This iterative procedure is more flexible than standard stepwise or stagewise regression as it does not require a predefined functional form or metric in predictor space. The method fits successive refinement models of increasing complexity to approximate the regression surface without overfitting due to limitations of sample size.

Uploaded by

Jung Attila
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Projection Pursuit Regression

Author(s): Jerome H. Friedman and Werner Stuetzle


Source: Journal of the American Statistical Association , Dec., 1981, Vol. 76, No. 376
(Dec., 1981), pp. 817-823
Published by: Taylor & Francis, Ltd. on behalf of the American Statistical Association

Stable URL: https://www.jstor.org/stable/2287576

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms

Taylor & Francis, Ltd. and American Statistical Association are collaborating with JSTOR to
digitize, preserve and extend access to Journal of the American Statistical Association

This content downloaded from


176.63.30.54 on Mon, 26 Oct 2020 15:02:29 UTC
All use subject to https://about.jstor.org/terms
Projection Pursuit Regression
JEROME H. FRIEDMAN and WERNER STUETZLE*

A new method for nonparametric multiple regression is to cover 10 percent of the range of each coordinate, then
presented. The procedure models the regression surface it will (on the average) contain only (. 1)10 of the sample,
as a sum of general smooth functions of linear combi- and thus will nearly always be empty. If, on the other
nations of the predictor variables in an iterative manner. hand, one adjusts the neighborhood to contain 10 percent
It is more general than standard stepwise and stagewise of the sample, it will cover (on the average) (.1)1/10 - 80
regression procedures, does not require the definition of percent of the range of each coordinate. This problem of
a metric in the predictor space, and lends itself to graph- sparsity basically limits the success of direct p-dimen-
ical interpretation. sional local averaging. In addition, these methods do not
provide any comprehensible information about the nature
KEY WORDS: Nonparametric regression; Smoothing;
of the regression surface.
Projection pursuit; Surface approximation.
The successful nonparametric regression procedures
that have been proposed are based on successive refine-
1. INTRODUCTION
ment. A hierarchy of models of increasing complexity is
In the regression problem, one is given a p-dimensional formulated. The complexity of a model is the number of
random vector X, the components of which are called degrees of freedom used to fit it. The aim is to find the
predictor variables, and a random variable Y, which is particular model that, when estimated from the data, best
called the response. The aim of regression analysis is to approximates the regression surface. The search usually
estimate the conditional expectation of Y given X on the proceeds through the hierarchy in a stepwise manner. At
basis of a sample {(xi, yi): i = 1, 2, . . . , n}. Typically, each step, the model of the subsequent level of the hi-
one assumes that the functional form of the regression erarchy that best fits the data is selected. Since the sample
surface is known, reducing the problem to that of esti- size limits the complexity of the models that can be used,
mating a set of parameters. To the extent that this model these procedures will be successful to the extent that the
is correct, such parametric procedures can be successful; regression surface can be approximated by models on
unfortunately, model correctness is difficult to verify in levels of low complexity in the hierarchy.
practice, and an incorrect model can yield misleading Applying this concept with a hierarchy of polynomial
results. For this reason, there is a growing interest in functions of the predictors leads to the stepwise, stage-
nonparametric methods, which make only a few very wise, and all-subsets polynomial regression procedures.
general assumptions about the regression surface. These procedures have proven to be successful in many
The most extensively studied nonparametric regression applications. Unfortunately, regression surfaces occur-
techniques (kernel, nearest-neighbor, and spline smooth- ring in practice often are not represented well by low-
ing) are based on p-dimensional local averaging: the es- order polynomials (e.g., surfaces with asymptotes); use
timate of the regression surface at a point xo is the average of higher-order polynomials is limited by considerations
of the responses of those observations with predictors in of sample size and computational feasibility.
a neighborhood of xo. These techniques can be shown to A hierarchy of piecewise constant (Sonquist 1970) or
have desirable asymptotic properties (Stone 1977). In piecewise linear (Breiman and Meisel 1976; Friedman
high-dimensional settings, however, they do not perform 1979) models leads to recursive partitioning regression.
well for reasonable sample sizes. The reason is the in- These procedures basically operate as follows: for a par-
herent sparsity of high-dimensional samples. This is il- ticular predictor and a value of this predictor, the pre-
lustrated by the following simple example: let X be uni- dictor space is split into two regions, one projecting to
formly distributed over the unit hypercube in R'0, and the left and the other to the right of the value. A separate
consider local averaging over hypercubical neighbor- constant or linear model is fit to the sample points lying
hoods. If the dimensions of the neighborhood are chosen in each region. The particular predictor and splitting value
are chosen to minimize the residual sum of squares over
the sample. The procedure is then recursively applied to
* Jerome H. Friedman is Group Leader, Computation Researcheach of the regions so obtained.
Group, Stanford Linear Accelerator Center, Stanford University, Stan- These recursive partitioning methods can be viewed as
ford, CA 94305. Werner Stuetzle is with the Stanford Linear Accelerator
Center and is Assistant Professor, Department of Statistics, Stanford local averaging procedures, but unlike kernel and nearest-
University, Stanford, CA 94305. The authors' work was supported by
the Department of Energy under Contract DE-AC03-76SF00515. The
? Journal of the American Statistical Association
authors gratefully acknowledge the help of Mark Jacobson during the
early stages of this work and thank the editor for many helpful December 1981, Volume 76, Number 376
comments. Theory and Methods Section

817

This content downloaded from


176.63.30.54 on Mon, 26 Oct 2020 15:02:29 UTC
All use subject to https://about.jstor.org/terms
818 Journal of the American Statistical Association, December 1981

neighbor procedures, the local regions are adaptively This procedure directly follows the successive refine-
constructed based on the nature of the response variation. ment concept outlined in the previous section: The
In many situations, this results in dramatically improved models at the mth level of the hierarchy are sums of m
performance. However, as each split reduces the sample smooth functions of arbitrary linear combinations of the
over which further fitting can take place, the number of predictors.
regions, and thus the number of separate models, is Standard additive models approximate the regression
limited. surface by a sum of functions of the individual predictors.
In this paper we apply the successive refinement con- Such models are not completely general in that they can-
cept in a new way that attempts to overcome the limi- not deal with interactions of predictors. Considering func-
tations of polynomial regression and recursive partition- tions of linear combinations of the predictors removes
ing. The procedure is presented in Section 2. Univariate this limitation. For example, consider a simple interac-
smoothing is discussed in Section 3; implementation spe- tion: Y = X1X2. A standard additive model cannot rep-
cifics are considered in Section 4. In Section 5 we illus- resent this multiplicative dependence; however, Y can be
trate the procedure by applying it to several data sets. expressed in the form (1), with a, = (1/V2)(1, 1), a2
The merits of this procedure, relative to other nonpara- = (1I/A2)(1, - 1), S (Z) = IZ2, S2(Z) = - Z2. The in-
metric procedures, are discussed in Section 6. In Section troduction of arbitrary linear combinations of predictors
7 we relate projection pursuit regression to the projection allows the representation of general regression surfaces.
pursuit technique for cluster analysis presented by Fried-
man and Tukey (1974). 3. UNIVARIATE SMOOTHING

The purpose of smoothing a set of observations


2. THE ALGORITHM
{Yi, z }7 = 1, sequenced in ascending order of z, is to produ
a decomposition Yi = S(zi) + ri, where S is a smooth
The regression surface is approximated by a sum of
function
empirically determined univariate functions Sam of linear and the ri are called residuals. The degree of
combinations of the predictors: smoothness of a function S can be formally defined (e.g.,
M
fs"t2(z) dz), but for the purpose of this discussion an in-
tuitive notion of smoothness will be sufficient. Many pro-
(p(X) = E Sam(otm X), (1)
m=1 cedures for smoothing have been described (Tukey 1977;
Cleveland 1979; Gasser and Rosenblatt 1979). They are
where. am * X denotes the inner product. The approxi-
based on the notion of local averaging:
mation is constructed in an iterative manner.
S(zi)= AVE (y1),
1. Initialize current residuals and term counter i-k!j'i + k

riy, i= l...n with suitable adjustment for the boundaries. Here AVE
can denote the mean, median, or other ways of "aver-
MO.
aging." The parameter k defines the bandwidth of the
smoother.
(We assume that the response is centered: E Yi 0.)
2. Search for the next term in the model. For a given The assumption underlying traditional smoothing pro-
linear combination Z = a * X, construct a smooth rep- cedures is that the observed responses yi are generated
resentation Sa(Z) of the current residuals as ordered in according to the model yi = f(xi) + Ei, Ei iid, E(Ei) = 0,
ascending value of Z (see Sec. 3). Take as a figure of f smooth. The resulting smooth S is then taken as an
merit (criterion of fit) I(a) for this linear combination estimate
the for f. Choosing too small a bandwidth will tend
fraction of so far unexplained variance that is explained to increase the variance -component of the mean squared
by Sa: error of the estimate, whereas too large a bandwidth may
n n increase the bias. The optimum bandwidth will, of course,
I(a) = 1 - z (ri - Sa(a . xi))2 ri2. (2) depend on f and the variance of es which are generally
i=l1 i=l unknown. Formal methods for estimating the optimal
bandwidth using cross-validation have been proposed
Find the coefficient
(projection pursuit) aM+ 1 = maxj`1I(a), and the cor- (Wahba and Wold 1975). Often, however, the degree of
smoothing is determined experimentally. One attempts
responding smooth S.M+l.
3. Termination. If the figure of merit is smaller than to use as large a bandwidth as possible, subject to the
a user-specified threshold, stop. (The last term is not smooth not lying systematically above or below the data
included in the model.) Otherwise, update the current in any region (oversmoothing).
residuals and the term counter Our design of a smoother is guided by the fact that the
model underlying traditional smoothing procedures is not
ri ri- SaMr S.M+ I (am .i x1), . = n appropriate. Our model seeks to explain response vari-
ability by not just one smoothed sequence, but by a sum
M-M + 1
of smooths of several sequencings of the response (as in-
and go to Step 2. duced by the several linear combinations of the predic-

This content downloaded from


176.63.30.54 on Mon, 26 Oct 2020 15:02:29 UTC
All use subject to https://about.jstor.org/terms
Friedman and Stuetzle: Projection Pursuit Regression 819

tors). High local variability encountered in a particular or without readjustment of the smooths along previously
sequence may be caused by smooth dependence of the determined linear combinations when a new linear com-
response on other linear combinations. In order to pre- bination has been found (backfitting). In the terminology
serve the ability of fitting this structure in further itera- of linear regression, this would correspond to the differ-
tions, it is important to avoid accounting for it by spurious ence between a stepwise and a stagewise procedure. We
fits along existing directions. Consequently, we use a have implemented the stepwise version.
variable bandwidth smoother. An average smoother In some situations it may be useful to restrict the search
bandwidth is specified by the user. The actual bandwidth for solution directions to the set of predictors (projection
used for local averaging at a particular value of Z can be selection) rather than allowing for linear combinations.
larger or smaller than the average bandwidth. Larger Although the resulting additive model cannot represent
bandwidths are used in regions of high local variability completely general regression surfaces, it is still more
of the response. general than linear regression in allowing for general
To reduce bias, especially at the ends of the sequence, smooth functions rather than only linear functions of the
we smooth by locally linear, rather than locally constant, predictors. Projection selection is computationally less
fitting (Cleveland 1979). Furthermore, each observation expensive than full projection pursuit and the resulting
is omitted from the local average that determines its models are often more easily interpreted. One could also
smoothed value. This cross-validation makes the average run projection selection, followed by projection pursuit,
squared residual a more realistic indicator of variability thereby separating the additive and interactive parts of
about the smooth (e.g., it is not possible to make the the model. Another strategy would be to run projection
average squared residual arbitrarily small by reducing the pursuit and get some easily interpreted linear combina-
bandwidth). To protect against isolated outliers, we use tions (as in Sec. 5, second example, with XI - X2, X4,
running medians of three (Tukey 1977) as a first pass in X5) and then run projection selection on these directions
our smoother. to see how much is lost. Forming a parametric model
Our smoothing algorithm thus makes four passes over based on these directions is another possibility.
the data:

1. Running medians of three; 5. EXAMPLES


2. Estimating response variability at each point by the
In this section we present and discuss the results of
average squared residual of a locally linear fit with
applying projection pursuit regression (PPR) to three data
constant bandwidth;
sets. (A FORTRAN program implementing the PPR pro-
3. Smoothing these variance estimates by a fixed-
cedure is available from the authors on request.) For all
bandwidth moving average; and
three examples the iteration was terminated when the
4. Smoothing the sequence obtained by pass (1) by
figure of merit for the next term was less than .1. The
locally linear fits with bandwidths determined by
average bandwidth of the one-dimensional smoother was
the smoothed local variance estimates obtained in
taken to be 30 percent for the first two examples and 10
pass (3).
percent for the third. All predictors were standardized
to have median zero and interquartile range one. (Widely
4. IMPLEMENTATION
different scales can cause problems for the numerical
For a particular linear combination, the smoother optimizer.)
yields a residual sum of squares from the corresponding The first example is artificially constructed to be es-
smooth. The optimal linear combination is sought by nu- pecially simple in order to illustrate how PPR models
merical optimization. Considerations governing the choice interactions between predictors. A sample of 200 obser-
of the optimization algorithm are that (a) the function vations was generated according to the simplest inter-
evaluations are expensive (each one requires several action model Y = X1X2 + E with (XI, X2) uniformly
passes over the data); (b) the search usually starts far distributed in (-1, 1) x (-1, 1) and E N(0, .04). Figure
from the solution; and (c) the search can be restricted to la shows Y plotted against X2 with the corresponding
the unit sphere in RP. For these reasons we chose a Ro- smooth. Figure lb shows Y plotted against the first linear
senbrock method (Rosenbrock 1960) modified to search combination Z1 = al * X, al = (.71, .70), found by
on the unit sphere. The search is started at the best co- projection pursuit, with the corresponding smooth
ordinate direction. On any given search there is no guar- Sal(al * X). Figure Ic shows the residuals r1 = Y -
antee that the global optimum will be found. If the local Sal(al * X) plotted against the second linear combination
optimum is not acceptable, the search is restarted at ran- Z2 = a2 * X, a2 = (.72, -.69), together with
dom directions. This guards against premature termina- Sa2(a2 X). Figure Id shows the residuals r2 = Y -
tion. If the local optimum is acceptable but not identical Sai(a * X) - Sa2(a2 * X) plotted against the third linear
to the global optimum, no great harm is done because a combination with the corresponding smooth. This pro-
new search is performed in the next iteration on an object jection was not accepted because the figure of merit was
function for which the previous optima have been deflated. below the threshold. (Note that the figure of merit, as
Projection pursuit regression can be implemented with defined in equation (2), measures the improvement in

This content downloaded from


176.63.30.54 on Mon, 26 Oct 2020 15:02:29 UTC
All use subject to https://about.jstor.org/terms
820 Journal of the American Statistical Association, December 1981

Figure la. Y = X1X2 + i, e - N(0, .04), vs. X2 (Y is Figure lc. Residuals From First Solution Smooth
plotted on the vertical axis, X2 on the horizontal axis. vs. Second Solution Linear Combination a2 "X, a2
The + symbols represent data points, numbers indi- = (.72, -.69)
cate more than one data point. The smooth is repre-
I + I
sented by * symbols) I I
I I
I I
I + I
I I I ..+ + I
1.0 I I I ++ 2 I
I + ++ ++ + I
I ++ +I + + 2 + ++ +
I + I I + 2+ + +44+++ + +I
I I I +4.++ ++ 2 I
II I + + 2 I
I + + + 2 +++ **+++ I
+ + + + I + 3 + ++********* + + I
I ++ + +1 I +.+ + ****** ++4+.+ *** + I
I + ++ I
I + + + + + I .0 I 2 **** +++2+ + + ***. I
I * + + I I +.+ ** + 2+ + + 2**+2+ I
I + + 2 ++++ I I .+***++ ++2 ++ + ++ +** I
I + + + +2+ + I + **+2 + +4+ + + + *** +
I + *** + + + + ** + I
I + ++ 2 + + I
I ** + + + + +4***4+ I
I + + ++ ++++++ + + + + + I
I ++** 2++ ++ + + + +** I
I + + + +4++ + + + I I **+ 4.+ + + +** I
+ +2 + ++ ** + +I I
I
**
**
+
+
**
**
+
I
I

I + ** + + *********+ ** 2 + + + I I ** ++ ** + I
I + ***** + + + + ++ + + + + + + + +I I ** + + **+. I
I *** 4++4 + + 3 +2 ++2 + + I -. S I ** + + ** I
I ** + + + + + + 9 I + ** + ** +
I****+ +4 + + 2+ + +3 I I** ** I
I + + + + + I
I + + + + + 2 I I + + + I
I + + + + + I I I
I + I I +I
++ + + . + + I I
I I
I 2 I I + I
I +I I I
I I -1.0
I + + + I
I+ I
-1.0 I I 1.3 .0 1.4

+ + +

In the second example, PPR was applied to air pollution


-1.0 .0 1.0 data. The data (213 observations) were taken from the
contaminant and weather summary of the Bay Area Pol-
goodness of fit.) It is evident from inspection of Figure lution Control District (Technical Services Division, 993
Id that this projection does not substantially contribute Ellis Street, San Francisco, CA 94109). In this example
to the model. The pure quadratic shapes of Sa1 and we study the relation between the amount of suspended
S,,2 together with the corresponding coefficient vectors particulate matter (Y) and predictor variables mean wind
speed (XI), average temperature (X2)., insolation (X3),
a, and a2, reveal that PPR has essentially expressed the
model Y = X,X2 in the additive form Y = *(XI + X2)2
and wind direction at 4:00 A.M. (X4) and 4:00 P.M. (X5)
-i(XI _ X2)2.
Figure ld. Residuals From First Two Solution
Figure lb. Y vs. First Solution Linear Combination Smooths vs. Third Solution Linear Combination a3*
a, X,al = (.71, .70) X, a3 = (- .016, .99)

I+I I + I
I I
1.0 I * I
I * I .5 I + I
I +* I I I+ +
I *+
I+ *2 I + + I
I** + ++** I I + + I
I * ** I I + + I
I ** + 2 * I
+ + + + + +
+ +** + . ** +
I *** *** I I + + + + + + + I
I +** + + + + **+ I I +
+ +
I
4+ +
+
2
+
+
2
+
I
I
I ** + + ** I
I ***4++ + + **++ I I + ++ + 4 + + I
I+ ** 2 + + +**+ I I + + + 44+ + 2 2 + + I
I **+ + 44 + + * I 1+4.
I +
+ +
+
4 +
+
43
+
+ +
+
+432
+
I
I
I 2**+2 + + + +2 2*** + I
I + ** + ++ + ***+ + I I + + + *********** + 2+ + I
.0O I ********* + + ***** **+. *********** + I
I **.+ 2 44 + *** +++ I +*** + + + + + **********2 + +4+ *** +
O0+ 2 *** + + 2442 2 *** + + + + I ++ 2 + + + + + + 2 * I
I ****+ ++**+2 I 1 4 ++ 44+ 3 + + + + *****I
I 2 **+2 + + + +***4++ + I I+ + + + 4 + 2 22 + + + ++ I
1 +44+ 44*****+ +2 +2****++ + I I +4. + 2 + + I
I ++ + ********* + I + + + + + + +4 . 2+ I
I + + 2+ +44 +2 I I 4.+ + + + I
I +- + + + + I I 2 + + I
I + ++4+ 2 + I I .+ + +4. I
I 2 + 2 I I + + I
I + I 4. 4. + +
+ + 2 + + I + + +I
I + + + I I I
I + + I 1 + I
I + I I + I
I I -.5 I +I
I ++ + I
I + I I I
-1.01I I I I
I + I 4. + +
+ + +
+

1.25 .0 1.25 -1.0 .0 1.0

This content downloaded from


176.63.30.54 on Mon, 26 Oct 2020 15:02:29 UTC
All use subject to https://about.jstor.org/terms
Friedman and Stuetzle: Projection Pursuit Regression 821

at the San Jose measuring station. Three projections were Figure 2b. Air Pollution-Second Solution Smooth
accepted. Figures 2a through 2c show the three final Sa 2 = (.16, .29, .17, .91, .16), With Residuals Added
smooths (after backfitting) plotted against their corre-
I + I
sponding linear combinations. The points plotted are ob- 60.0 I I
I I
tained by adding the residuals from the final model to I I
I I
each smooth. The first projection (Fig. 2a) shows that a I + + I
I1 + I
good indicator of suspended particulate matter is (stand- 40.0 I
I +
+
+
+
+
+ I
I
+ + + + + +
ardized) temperature minus wind speed. For small values I + 2 I
I + + + + I
of this indicator, the amount of pollution is seen to be I + + + + **I
I + + 4+ + +t**I
roughly constant; for higher values, there is a strong lin- 20.0 I+ + 2 + + + + + +I
I + + + +2 ** I
ear dependence. The second smooth (Fig. 2b) and the I ++ +2+ + + 2 **
I ***** + + + + 2+ + **++ I
corresponding direction (essentially X4) show a much I***** +****** +
+ + +***** + + +
+ ++2 + + +** + I
+ 3 + 2 *** ++
.0 I + ***** ++ + 2+********** ++ + I
smaller pollutant dependence on 4:00 A.M. wind direc- I + +++ ++ ****** + *** + 3** + + I
I + + + ********+**** 2 ++ I
tion. The third projection (Fig. 2c) suggests an additional I + + + + 2 *** +++2+3 I
I + .+ + + + + + + 2 + I
dependence on the 4:00 P.M. wind direction, but the ef- I +++4IH2 + 22 I
-20.0 I + + + 2 + + I
fect, if any, is clearly small. I + + + + + ++ + + + I
I t+ + 44 + I
In order to illustrate PPR on highly structured data, + + + + + ++ +
I 2 + I
which are common in the physical sciences, we apply it -40.0
I +
I + +
I
I
I I
to data taken from a particle physics experiment (Ballam I I
I I
et al. 1971). This data set (500 observations) is described I I
I I
in Friedman and Tukey (1974). Here we study the com-
-60.0 + + I
bined energy of the three wr mesons (Y) as a function of
the six other variables. -2.0 -1.0 0.0 1.0

Figure 3a shows Y plotted against the first linear com-


bination and the corresponding smooth found in the first
high degree of structuring in the data expressed by the
iteration. Figures 3b through 3d show the final smooths
fact that the model explains over 99 percent of the
(after backfitting) for the first three of the nine accepted
variance.
projections. As in Figures 2a through 2c, we show the
residuals from the final model added to the final smooths.
6. DISCUSSION
Note the substantial change in the first smooth due to
backfitting, which readjusts for later projections. Note Although simple in concept, projection pursuit regres-
also the striking nonlinearity in Figures 3c and 3d and the sion overcomes many limitations of other nonparametric
regression procedures. The sparsity limitation of kernel
Figure 2a. Air Pollution (suspended particulate mat-
ter)-First Solution Smooth Sav, ? - (.83, -.55,Figure
.0, 2c. Air Pollution-Third Solution Smooth
.0, .10), With Residuals Added Sa3, a03 - (.16, .21, .01, -.05, .96), With Residuals
Added
I + I
I I
I I I + I

120 I I 60.0 I
I + I
I** +4-4- I I I
I ** I I + 4+ +2+++ I
I * + + I I I

100 + ** ++ ++ + 40. 0 I 2 I
+ I
+ ~~ ~~
44+2+~~~~~~~+
+ ++
I
I ** +4 + I
I **+ 2 + I I + + I
I * + I I
I
++
++2 2 *
+ 2 I
I ** 4++4 4 + + I I+ +4
I **+2 ++ + I 20.0 I + + + + 2
80 I ++ **4~+2+4 + I
+ + + +2 + +* I
I + ** + + I
I + ++2 2+ *I
I + **+22 + + ++ I + 44 + +2 ** I
+ +3+**** ++ + 2+ 4 + I+ **** + + +2+** ++ 3+* I
I 4t44++ 2***2 4+ I + + ****** + 2 + **** + * +
I 44 +2**2 + + I 0I ****** + + 3**2 **++ * I
60 I +2324 +**44+ + I0 ******* + +** 22** +** I
I + +3++ * + 44+ ** I I + ******* ** 44 +* 2+ +**** +I
I ++ + ** 2 + + *********** *** I I + + ***************23 +** + I
I St + + 2+* **************** ******* ** +I I + ** **2+4+ 2+ +++ I
I 4++ +*** ++ I I+ 2 + 2 32 2+ 2 + I
I +4422+4 44 I + 3+4444+2 2 I
40 I + +2+ 2 + I -20.0 I+ + + 3+4++ +4 I
+ + +4+2 +4+ + + + I + + I
I +4+ +44 + I + + + 2 + +
I 4 + + I I + ++4t + I
I 2 + + + I I + + + I
20 + + I
-40.0 I + + I
I
I
+ I
I
I + ++ I
I I
I I
I I I I
I I
+ + + -60. 0+ + +

-1.5 0 1.5 3.0 -3.0 -2.0 -1.0 0.0 1.0

This content downloaded from


176.63.30.54 on Mon, 26 Oct 2020 15:02:29 UTC
All use subject to https://about.jstor.org/terms
822 Journal of the American Statistical Association, December 1981

Figure 3a. Combined Energy of Three IT Mesons E31, Figure 3c. Particle Physics Data-Second Solution
(particle physics data) vs. First Solution Linear Com- Smooth Sa2, a2 = (.0, .82, -.05, .0, -.33, .46), With
bination, a, = (.83, .54, .0, -.16, .0, .0), With Corre- Residuals Added
sponding Smooth Found on the First Iteration
I +I
4.0 I I
I+ + I I I
20.0 I I
I1+
+ I
I I ***I
I ** I I ++** I
I *** + I I + +**+ I
I ** + I I *** I
I *** + + I + **+ +
I *** + + I I1 3*** I
+ ***++ ++ 2.0 ** + I
15.0 I *** + + I I 22**+ I
I 44*** 2 + I I *3**44 I
I 2 2+**+ + + + I
I ** + +I 1 44 + ** I
I + **++ + + I I ** I
I + ** +2 + 2 * I +**++ I
I +** + + * I I + **++ ++ I
I +** + * I + 4*** + +
I 2 ***++ 2 + 2** I I + +++** I
+ +2** + 4 ** + I 4**++ I
10.0 I 2+*** +2++ + + + 4** I .0 I + +2** I
I +o+**+++ + + 2** I I ~~~~~~+ + 2+***I
I 22*** 2 ++ +*2 + I I + **++2
I + ** 2++ + ** I + + + **+2 I
I 32*** 44* I I + + ++4 23**2+ I
I +3+**22 + 2** I I 2+ 2+ 264**+ I
I + +**+ +++ + +* I I + + + 2+325***+ I
I 2+2** + +2++ + ** I + + + 2 2+2++22***4**23 + + +
I +25** *2 I I + ++ + ********6***2+ I
I + +2*****++?.+33+25 3 I
5.0 ?+ 4 *h*2 *** + I 2*****33 53+2+ 2++2+ I
I 2+** 444411* I 1+ 22.***22 ++ + + ++ I
I + 2+**5 +2+**+ I -2. 2+**** + + +
I 44**3 7.* I I*** ++2 + + I
I 34**82**+ I I* + + 2+ I
I + +6**4*4 I I ++ I
1.0 + 47*K*+ I I+ I
** I +*** I 4+
I 8M8 I
+ 44 +
-3.0 -2.0 -1.0 .0 1.0

-2. 0 -10 0 5

essary, more complex models. In addition, interactions


and nearest-neighbor techniques is not encountered since of predictors are directly considered.
all estimation (smoothing) is performed in a univariate One can view linear regression, projection selection,
setting. PPR does not require specification of a metric in and full projection pursuit as a group of regression pro-
the predictor space. Unlike recursive partitioning, PPR cedures ordered in ascending generality. Linear regres-
does not split the sample, thereby allowing, when nec- sion models the regression surface as a sum of linear
functions of the predictors. Projection selection allows

Figure 3b. Particle Physics Data-First Solution


Figure 3d. Particle Physics Data-Third Solution
Smooth S., al (.83, .54, .0, -.16, .0, .0), With Re-
Smooth Sa3, a3 = ( 14, .39, .69, .51, -.16, .26), Wit
siduals Added
Residuals Added
I+ I
20.0 I** I 1.8 I + I
I *** I - I + I
I *** I I I
I *** I I I
I +*** + I I I
I ** I
I *** + I I + I
+ *** + I +++ ++ + I
15.0 I **++ I + ++4+ + +
I *** I 1.0 I + + +. I
I 2**+ I I +4 + 2I
I ***+ I
I 44 + 2+ 442 I
I **+ 3 I +4++ I
I + *** + I I + 344w + I
I ++** I I + 2+ +2+ + ++ + ,I
I +*** I I + +2 2++1+411+-++ + I
I 44**+ I I + ++ +************+2 3 + + I
10.0 + *** + I +**42+2+ + +**+ 2+ +++ I
I +** I + +2+*** ++ 2+ +3***4 + + 2 +
I ** I I +****+ + + +***+22+ + ++ + I
I + **+ I I +****+ 2 2 + ++ 2 2 ***++ +3 344 I
I ***+ I I ++** 2 2 +++ ++ ++ 2+ ++****+3 + +4+ + + I
I +4++** I .0 I+ + *+ 44+ + 4+ 44 + +*****+ 2++++ 44+ + I
I 2**44 I I 2+ **+ ++ ++++2****++3 2 3 + + I
I **5 + I I+***** + + + + ***** + 4+2 3 + I
I**** + + + + + ++ ++ + + *****+ + + +I
5?I +** I I*+ + + + + + + 443 24+********+ I
+ +**4 + I2 ++ + ++ + ++ **+I
I +**44 I ++?+ + + + + + ++2+ HH**+
I **3 I I2 4 + 2 + ++*I
I + ++ + ++2I
I +**D5 I I2 + + ++I
I 3****+ I I+ + + + I
I 234*** I I+ + + +I
I 2**** I I I
I + + + I
I + I
+ 44
-1 I I + I
-20 -10 0 5 5.0 10.0 15.0

This content downloaded from


176.63.30.54 on Mon, 26 Oct 2020 15:02:29 UTC
All use subject to https://about.jstor.org/terms
Friedman and Stuetzle: Projection Pursuit Regression 823

for nonlinearity by modeling with general smooth func- jections. Their figure of merit measuring the degree of
tions of the predictors. Full projection pursuit allows for clustering in a projection (P index) is too complex to be
interactions by modeling with general smooth functions optimized by linear algebra. Instead, the optimal projec-
of linear combinations of the predictors. tion was sought by numerical optimization; this was re-
PPR is computationally quite feasible. For increasing ferred to as projection pursuit. As multivariate structure
sample size n, dimensionality p, and number of iterations often will not be completely reflected in one projection,
M, the computation required to construct the model it is important to remove structure already discovered
grows as Mpn log (n). (deflate previous optima of the figure of merit), allowing
As seen in the examples, an important feature of PPR the algorithm to find additional interesting projections.
is that the results of each iteration can be represented Friedman and Tukey suggest splitting the data into clus-
graphically, facilitating interpretation. This pictorial out- ters, once a clustered projection has been found, and
put can be used to adjust the main parameters of the applying the procedure to the data in each of the clusters
procedure, that is, average smoother bandwidth and ter- separately.
mination threshold. Projection pursuit regression follows a similar prescrip-
The average bandwidth should be chosen as large as tion. It constructs a model of the regression surface based
possible, subject to the avoidance of oversmoothing. In on projections of the data on planes spanned by the re-
any projection, whether the smooth systematically de- sponse Y and a linear combination at * X of the predictors.
viates from the data is easily detected by visual inspec- Here the figure of merit for a projection is the fraction
tion. Whether a particular projection affects a significant of variance explained by a smooth of Y versus a * X.
improvement in the model can be judged subjectively by Structure is removed by forming the residuals from the
viewing its smooth and the corresponding residuals. Lack smooth and substituting them for the response. The
of a systematic tendency of the smooth indicates that model at each iteration is the sum of the smooths that
including this projection into the model would only in- were previously subtracted and thus incorporates the
crease the variance, while not reducing the bias. One can structure so far found.
also employ a more formal procedure based on cross-
[Received February 1980. Revised April 1981.]
validation (see Stone 1981).
The PPR procedure can clearly be applied to the re- REFERENCES
siduals from any initial model. If the initial model does
BALLAM, J., CHADWICK, G.B., GUIRAGOSSIAN, Z.C.G., JOHN-
not fit the data well, PPR will so indicate by augmenting SON, W.B., LEITH, D.W.G.S., and MORIGASU, J. (1971), "Van
the model. Hove Analysis of the Reactions r-p --> -irr +p and nr+p
All stepwise procedures have difficulties modeling 7r+ Tr+r- at 16 GeV/C," Physics Review, 4, 1946-1947.
BREIMAN, L., and MEISEL, W.S. (1976), "General Estimates of the
regression surfaces that cannot be well represented by Intrinsic Variability of Data in Nonlinear Regression Models," Jour-
models of low complexity in their hierarchy. Because nal of the American Statistical Association, 71, 301-307.
models in PPR are sums of functions, each varying only CLEVELAND, W.S. (1979), "Robust Locally Weighted Regression
and Smoothing Scatterplots," Journal of the American Statistical
along a single linear combination of the predictors, PPR Association, 74, 829-836.
has difficulties modeling regression surfaces that vary FRIEDMAN, J.H. (1979), "A Tree-Structured Approach to Nonpar-
with equal strength along all possible linear combinations. ametric Multiple Regression," in Smoothing Techniques for Curve
Estimation, eds. Th. Gasser and M. Rosenblatt, New York: Springer-
Verlag, 5-22.
7. PROJECTION PURSUIT PROCEDURES FRIEDMAN, J.H., and TUKEY, J.W. (1974), "A Projection Pursuit
Algorithm for Exploratory Data Analysis," IEEE Transactions on
The idea of projection pursuit is not a new one. Inter- Computers, C-23, 881-890.
preting high-dimensional data through the use of well- GASSER, T., and ROSENBLATT, M. (eds.) (1979), "Smoothing Tech-
niques for Curve Estimation," in Lecture Notes in Mathematics 757,
chosen lower-dimensional projections is a standard pro-
New York: Springer-Verlag.
cedure in multivariate data analysis. The choice of a pro- ROSENBROCK, H.H. (1960), "An Automatic Method for Finding the
jection is usually guided by an appropriate figure of merit. Greatest or Least Value of a Function," Computer Journal, 3,
175-184.
If the goal is to preserve interpoint distances as well as
SONQUIST, J. (1970), "Multivariate Model Building: The Validation
possible, then the appropriate figure of merit is the var- of a Search Strategy," Report, Institute for Social Research, Uni-
iance of the projected data, leading to projection on the versity of Michigan, Ann Arbor.
largest principal component. If the purpose is to separate STONE, C.J. (1977), "Nonparametric Regression and Its Applications"
(with discussion), Annals of Statistics, 5, 595-645.
two Gaussian samples with equal covariance matrices, (1981), "Admissible Selection of an Accurate and Parsimonious
the figure of merit is the error rate of a one-dimensional Normal Linear Regression Model," Annals of Statistics, 9, in press.
TUKEY, J.W. (1977), EDA Exploratory Data Analysis, Reading, Mass.:
classification rule in the projection, leading to linear dis-
Addison-Wesley.
criminant analysis. In both cases the figure of merit is WAHBA, G., and WOLD, S. (1975), "A Completely Automatic French
especially simple and the solution can be found by linear Curve: Fitting Spline Functions by Cross-Validation," Communi-
cations in Statistics, 4, 1-17.
algebra. In a similar spirit, Friedman and Tukey (1974)
suggest detecting clusters by searching for clustered pro-

This content downloaded from


176.63.30.54 on Mon, 26 Oct 2020 15:02:29 UTC
All use subject to https://about.jstor.org/terms

You might also like