0% found this document useful (0 votes)
39 views

Function Approximation: A Gradient Boosting Machine.

This document summarizes Jerome Friedman's 2001 paper on greedy function approximation using a gradient boosting machine. It frames function estimation as an optimization problem in function space rather than parameter space. The paper develops a general gradient descent paradigm for additive expansions based on any fitting criterion. Specific algorithms are presented for regression using least squares, least absolute deviation, and Huber loss, and for classification using logistic likelihood. The paper also derives special enhancements when the additive components are regression trees, and tools for interpreting such "TreeBoost" models. Gradient boosting of regression trees is presented as a competitive and robust procedure for regression and classification problems.

Uploaded by

Paul Wattellier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Function Approximation: A Gradient Boosting Machine.

This document summarizes Jerome Friedman's 2001 paper on greedy function approximation using a gradient boosting machine. It frames function estimation as an optimization problem in function space rather than parameter space. The paper develops a general gradient descent paradigm for additive expansions based on any fitting criterion. Specific algorithms are presented for regression using least squares, least absolute deviation, and Huber loss, and for classification using logistic likelihood. The paper also derives special enhancements when the additive components are regression trees, and tools for interpreting such "TreeBoost" models. Gradient boosting of regression trees is presented as a competitive and robust procedure for regression and classification problems.

Uploaded by

Paul Wattellier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Greedy Function Approximation: A Gradient Boosting Machine

Author(s): Jerome H. Friedman


Source: The Annals of Statistics, Vol. 29, No. 5 (Oct., 2001), pp. 1189-1232
Published by: Institute of Mathematical Statistics
Stable URL: http://www.jstor.org/stable/2699986 .
Accessed: 13/09/2014 03:42

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp

.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].

Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to The
Annals of Statistics.

http://www.jstor.org

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
The Annals of Statistics
2001, Vol. 29, No. 5, 1189-1232

1999 REITZ LECTURE


GREEDY FUNCTION APPROXIMATION:
A GRADIENT BOOSTING MACHINE'

BY JEROME H. FRIEDMAN

Stanford University
Function estimation/approximationis viewed fromthe perspective of
numerical optimization in functionspace, rather than parameter space. A
connection is made between stagewise additive expansions and steepest-
descent minimization. A general gradient descent "boosting"paradigm is
developed for additive expansions based on any fittingcriterion. Specific
algorithms are presented for least-squares, least absolute deviation, and
Huber-M loss functionsfor regression, and multiclass logistic likelihood
forclassification.Special enhancements are derived forthe particular case
where the individual additive components are regression trees, and tools
for interpretingsuch "TreeBoost" models are presented. Gradient boost-
ing of regression trees produces competitive,highly robust, interpretable
procedures forboth regression and classification,especially appropriate for
mining less than clean data. Connections between this approach and the
boosting methods of Freund and Shapire and Friedman, Hastie and Tib-
shirani are discussed.

1. Function estimation. In the functionestimation or "predictivelearn-


ing" problem,one has a system consisting of a random "output" or "response"
variable y and a set ofrandom "input"or "explanatory"variables x = x1, ....
xn}. Using a "training"sample {yi, Xi}N ofknown (y, x)-values, the goal is to
obtain an estimate or approximation F(x), of the function F*(x) mapping
x to y, that minimizes the expected value of some specified loss function
L(y, F(x)) over the joint distributionof all (y, x)-values,

(1) F* = argmin Ey xL(y, F(x)) = argminEx[Ey(L(y,


F
F(x))) I x].
F

Frequently employed loss functions L(y, F) include squared-error (y - F)2


and absolute error ly - Fl for y E R' (regression) and negative binomial log-
likelihood,log(1 + e-2YF), when y E {-1, 1} (classification).
A commonprocedureis to restrictF(x) to be a member of a parameterized
class offunctionsF(x; P), where P = {P1, P2, .. .} is a finiteset ofparameters
whose joint values identifyindividual class members. In this article we focus

Received May 1999; revised April 2001.


1Supported in part by CSIRO Mathematical and InformationScience, Australia; Department
of Energy Contract DE-AC03-76SF00515; and NSF Grant DMS-97-64431.
AMS 2000 subject classifications. 62-02, 62-07, 62-08, 62G08, 62H30, 68T10.
Key words and phrases. Function estimation, boosting, decision trees, robust nonparametric
regression.
1189

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1190 J. H. FRIEDMAN

on "additive" expansions of the form


M
(2) F(x;{fP3m,am}m) = E 3mh(X;am)
m=l

The (generic) functionh(x; a) in (2) is usually a simple parameterized func-


tion of the input variables x, characterized by parameters a = {a1, a2, . . .}.
The individual terms differin the joint values am chosen forthese parameters.
Such expansions (2) are at the heart of many functionapproximation meth-
ods such as neural networks[Rumelhart,Hinton, and Williams (1986)], radial
basis functions [Powell (1987)], MARS [Friedman (1991)], wavelets [Donoho
(1993)] and support vector machines [Vapnik (1995)]. Of special interest here
is the case where each of the functions h(x; am) is a small regression tree,
such as those produced by CART TM [Breiman, Friedman, Olshen and Stone
(1983)]. For a regression tree the parameters am are the splittingvariables,
split locations and the terminal node means of the individual trees.

1.1. Numerical optimization. In general, choosing a parameterized model


F(x; P) changes the functionoptimization problem to one of parameter opti-
mization,

(3) P* = arg min 4>(P),


P

where

4?(P) = Ey XL(y,F(x; P))


and then
F* (x) = F(x; P*).

For most F(x; P) and L, numerical optimization methods must be applied to


solve (3). This ofteninvolves expressing the solution forthe parameters in the
form
M
(4) P* E Pm,
m=O
where po is an initial guess and {pm}m are successive increments ("steps" or
"boosts"), each based on the sequence of preceding steps. The prescriptionfor
computingeach step Pm is definedby the optimization method.

1.2. Steepest-descent. Steepest-descent is one of the simplest of the frequ-


entlyused numerical minimizationmethods. It definesthe increments{pm}lm
(4) as follows.First the currentgradient gm is computed:

gm = {gjm} = [ dPj =_p1 }

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1191

where
m-1
Pm-l = E Pi
i=O

The step is taken to be

Pm = Pmgm,

where

(5) Pm = arg min 4(Pm_1 - pgm).


p

The negative gradient -gm is said to define the "steepest-descent"direction


and (5) is called the "line search" along that direction.

2. Numerical optimization in function space. Here we take a "non-


parametric" approach and apply numerical optimization in function space.
That is, we consider F(x) evaluated at each point x to be a "parameter" and
seek to minimize

@'(F) = Ey xL(y, F(x)) = Ex[Ey(L(y, F(x))) Ix],


or equivalently,

4(F(x)) = Ey[L(y, F(x)) I x]

at each individual x, directlywith respect to F(x). In functionspace there are


an infinitenumber of such parameters, but in data sets (discussed below) only
a finitenumber {F(xj)}N are involved. Following the numerical optimization
paradigm we take the solution to be
M
F*(x) = E fm(x),
m==O
where fo(x) is an initial guess, and {fm(X)}jM are incremental functions
("steps" or "boosts") definedby the optimization method.
For steepest-descent,

(6) f m(x) = -Pm gm(x)


with
(X d0g(F(x))_ _ dEY[L(y, F(x)) x]-
gm(x - dFx -dF X
grn(x)-mJiF(x)=F.j(X) = E[Y x)) F(x)=F,-l (x)

and
m-1
Fmil(X) E fi(x).

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1192 J. H. FRIEDMAN

Assuming sufficientregularitythat one can interchange differentiationand


integration,this becomes

(7) g() =E dL(y, F(x)) I x


dL F(x) F(x)=:Fm,(x)

The multiplier Pmin (6) is given by the line search


(8) Pm= argmin Ey XL(y, Fml(x) -pgm(x)).
p

3. Finite data. This nonparametricapproach breaks down when thejoint


distributionof (y, x) is estimated by a finite data sample {Yi, Xi}IN. In this
case EY[. I x] cannot be estimated accurately by its data value at each xi, and
even if it could, one would like to estimate F*(x) at x values other than the
training sample points. Strength must be borrowed fromnearby data points
by imposing smoothness on the solution. One way to do this is to assume a
parameterized formsuch as (2) and do parameter optimizationas discussed in
Section 1.1 to minimizethe correspondingdata based estimate ofexpected loss,
M
1 ms am I = arg min L Yi, E mh(Xi; am)j
{13a? =arg W,n,7 a/}m'}~
Ifif
i=l m=l
m=1
In situations where this is infeasible one can try a "greedy stagewise"
approach. For m = 1, 2, ..., Ml
N
(9) (fm am) = argmin i=l
E L(yi, Fm I(xi) + /3h(xi;a))
f3,a
and then
(10) Fm(x) = Fmi,(x) + pmh(x; am).
Note that this stagewise strategyis differentfromstepwise approaches that
readjust previously entered terms when new ones are added.
In signal processing this stagewise strategy is called "matching pursuit"
[Mallat and Zhang (1993)] where L(y, F) is squared-error loss and the
{ h(x; am)}M are called basis functions,usually taken froman overcomplete
waveletlike dictionary.In machine learning,(9), (10) is called "boosting"where
y E {-1, 1} and L(y, F) is either an exponential loss criterione-YF [Freund
and Schapire (1996), Schapire and Singer (1998)] or negative binomial log-
likelihood [Friedman, Hastie and Tibshirani (2000) (here after refferedto as
FHT00)]. The functionh(x; a) is called a "weak learner" or "base learner" and
is usually a classificationtree.
Suppose that for a particular loss L(y, F) and/orbase learner h(x; a) the
solution to (9) is difficultto obtain. Given any approximator Fmi,(x), the
function 83mh(x;am) (9), (10) can be viewed as the best greedy step toward
the data-based estimate of F*(x) (1), under the constraintthat the step "direc-
tion" h(x; am) be a member of the parameterized class of functionsh(x; a). It
can thus be regarded as a steepest descent step (6) under that constraint.By

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1193

construction,the data-based analogue of the unconstrained negative gradi-


ent (7),

-gm(i ) -[ dF(Xi ) - F(x)=F,,,-_ (x)

gives the best steepest-descent step direction -gm = {-gm(xi)}1 in the N-


dimensional data space at Fmi(x). However, this gradient is definedonly at
the data points {Xi N and cannot be generalized to other x-values. One possi-
bility forgeneralization is to choose that member of the parameterized class
h(x; am) thatproduceshm = {h(xi; am)}IN mostparallelto -gm E RN. This is
the h(x; a) most highly correlated with -gm(x) over the data distribution.It
can be obtained fromthe solution
N
(11) am = argmin [-gm(xi) -,8h(xi;a)]2.
a,f *i=

This constrained negative gradient h(x; am) is used in place of the uncon-
strained one -gm(x) (7) in the steepest-descentstrategy.Specifically,the line
search (8) is performed
N
(12) Pm = argmin E( L(yx, Fm1(Xi) + ph(xi; am))
Pi=

and the approximation updated,


Fm(x) = Fmi,(x) + Pmh(X;am).
Basically, instead of obtaining the solution under a smoothness constraint
(9), the constraint is applied to the unconstrained (rough) solution by fit-
ting h(x; a) to the "pseudoresponses" {K = -gm(xi)}IN1 (7). This permits the
replacement ofthe difficultfunctionminimizationproblem(9) by least-squares
functionminimization (11), followedby only a single parameter optimization
based on the original criterion(12). Thus, forany h(x; a) forwhich a feasible
least-squares algorithm exists for solving (11), one can use this approach to
minimize any differentiableloss L(y, F) in conjunction with forwardstage-
wise additive modeling. This leads to the following(generic) algorithmusing
steepest-descent.

ALGORITHM 1 (Gradient-Boost).

1. Fo(x) = arg min EN L(yi vp)


2. For m = 1 to M do:
3- 5d = F(x-) ]F(x)=F (x)i = 1, N
4. am = arg mina,:Ei1[Yi-,fh(x0;a)j2
5. Pm = argmin Ei=1 L(yi, Fmi,(xi) + ph(xi; am))
6. Fm(x) = Fmi (x) + Pmh(X; am)

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1194 J. H. FRIEDMAN

7. endFor
end Algorithm

Note that any fittingcriterionthat estimates conditional expectation (given


x) could in principle be used to estimate the (smoothed) negative gradient (7)
at line 4 of Algorithm1. Least-squares (11) is a natural choice owing to the
superior computational properties of many least-squares algorithms.
In the special case where y E {-1, 1} and the loss functionL(y, F) depends
on y and F onlythroughtheir product L(y, F) = L(yF), the analogy ofboost-
ing (9), (10) to steepest-descent minimization has been noted in the machine
learning literature [lRatsch,Onoda and Muller (1998), Breiman (1999)]. Duffy
and Helmbold (1999) elegantly exploit this analogy to motivate their GeoLev
and GeoArc procedures. The quantity yF is called the "margin" and the
steepest-descentis performedin the space of margin values, rather than the
space of functionvalues F. The latter approach permits application to more
general loss functionswhere the notion of margins is not apparent. Drucker
(1997) employsa differentstrategyofcasting regressioninto the frameworkof
classificationin the context of the AdaBoost algorithm [Freund and Schapire
(1996)].

4. Applications: additive modeling. In this section the gradient boost-


ing strategyis applied to several popular loss criteria:least-squares (LS), least
absolute deviation (LAD), Huber (M), and logistic binomial log-likelihood(L).
The firstserves as a "reality check",whereas the others lead to new boosting
algorithms.

4.1. Least-squares regression. Here L(y, F) = (y - F)2/2. The pseudore-


sponse in line 3 of Algorithm1 is Fi= - ,F7_(xi). Thus, line 4 simply fits
the current residuals and the line search (line 5) produces the result Pm -
Oam,where OBmis the minimizing /3 of line 4. Therefore,gradient boosting on
squared-errorloss produces the usual stagewise approach ofiterativelyfitting
the currentresiduals.

ALGORITHM 2 (LS-Boost).
Fo(x) = Y
For m = 1 to M do:
Yi = Y- -Fm_x(x) i = 1, N
(Pm' am) = arg minapNE I[4i - ph(xi; a)]2
Fm(x) = Fm I(X) + Pmh(X; am)
endFor
end Algorithm

4.2. Least absolute deviation (LAD) regression. For the loss function
L(y, F) = ly- Fl, one has

(13) =p[8dL(yj, F(xi))] = sign(Y - -

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1195

This implies that h(x; a) is fit (by least-squares) to the sign of the current
residuals in line 4 of Algorithm1. The line search (line 5) becomes
N
Pm= argmin Yi - Fm,(xi) - ph(xt;am).

N yi -Fmi(xi)
(14) = argmin>3h(xi;am).
h(xa;am) -P

=medianw1yi-h( m-i(xi)1, Wi = jh(xi;al)j.

Here medianw{.} is the weighted median with weights wi. Inserting these
results [(13), (14)] into Algorithm 1 yields an algorithm for least absolute
deviation boosting,using any base learner h(x; a).

4.3. Regressiontrees. Here we consider the special case where each base
learner is an J-terminal node regression tree [Breiman, Friedman, Olshen
and Stone (1983)]. Each regression tree model itself has the additive form
J
( 15) Rj}J )
h(x;ttbi, =Ebj 1(x E Rj
i=l

Here {Rj}J are disjoint regions that collectivelycover the space of all joint
values of the predictorvariables x. These regions are represented by the ter-
minal nodes of the correspondingtree. The indicator function 1(.) has the
value 1 if its argument is true, and zero otherwise. The "parameters" of this
base learner (15) are the coefficients{bj}j, and the quantities that definethe
boundaries of the regions {Rj}f. These are the splitting variables and the
values ofthose variables that representthe splits at the nonterminalnodes of
the tree. Because the regions are disjoint, (15) is equivalent to the prediction
rule: if x E R1 then h(x) = bj.
For a regression tree, the update at line 6 of Algorithm1 becomes
J
(16) Fm(X) = Fm-i(X) + PmE bjml(x e Rjm).
j=1

Here {Rjm}J are the regions defined by the terminal nodes of the tree at
the mth iteration.They are constructedto predictthe pseudoresponses {i}ij
(line 3) byleast-squares (line 4). The {bjm} are the correspondingleast-squares
coefficients,

bjm = avexERjm i-

The scaling factorPmis the solution to the "line search" at line 5.

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1196 J. H. FRIEDMAN

The update (16) can be alternativelyexpressed as

J
(17) Fm(X) = Fmi-(X) + E yjml(x E Rjm)
j=1

with Yjm = Pmbjm.One can view (17) as adding J separate basis functions
at each step {1(x E Rjm)}IJ, instead of a single additive one as in (16). Thus,
in this case one can furtherimprove the quality of the fitby using the opti-
mal coefficientsforeach of these separate basis functions(17). These optimal
coefficientsare the solution to
N J
{ Yjm}J = arg min E L (\Y Fm- (xi) + yjl(x E Rjm))
{'YjI iTi1 j=1

Owing to the disjoint nature of the regions produced by regression trees, this
reduces to

(18) yjm=argmin E L(yi,Fm_1(xi)+?y).


X'iERjm

This is just the optimal constant update in each terminal node region, based
on the loss functionL, given the currentapproximation Fmil(x).
For the case of LAD regression (18) becomes

Yjm = medianxi (ER J{Yi - Fm-i(Xi)},

which is simplythe median ofthe currentresiduals in the jth terminal node at


the mth iteration.At each iterationa regressiontree is built to best predictthe
sign ofthe currentresiduals y, - Fm-,(xi), based on a least-squares criterion.
Then the approximationis updated by adding the median of the residuals in
each of the derived terminal nodes.

ALGORITHM 3 (LAD-TreeBoost).
Fo(x) = median{}yi }
For m = 1 to M do:
Yi = sign(yi - Fmi,(xi)), i = 1, N
{Rjm}j = J-terminalnode tree({yi, xi} I)
Yjm = medianxiERjm {yi - Fmi(Xii)}, j = 1, J
Fm(X) = Fm-i(X) + Ej= lyjml(X E Rjm)
endFor
end Algorithm

This algorithm is highly robust. The trees use only order informationon
the individual input variables xj, and the pseudoresponses 5i (13) have only
two values, Yi E {-1, 1}. The terminal node updates are based on medians.

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1197

An alternative approach would be to build a tree to directlyminimize the loss


criterion,
N
treem(x) = arg mintree i=1
- Fmil(xi) - tree(xi)l
J-node

Fm(x) = Fm- (xi) + treem(x).

However, Algorithm3 is much faster since it uses least-squares to induce the


trees. Squared-error loss is much more rapidly updated than mean absolute
deviation when searching forsplits during the tree building process.

4.4. M-Regression. M-regression techniques attempt resistance to long-


tailed error distributions and outliers while maintaining high efficiencyfor
normally distributed errors. We consider the Huber loss function [Huber
(1964)]

(19) L(y, F) =|2(YF vIY- ,


(I-Fl
5ly - /2) lY -Fl > 5.
Here the pseudoresponse is
=
dL(yi, F(xi)) 1
d
8F(xi) I F(x)=F._1(x)

- {
_ - Fm.-(Xi),
3 sign(yi - Fm-i(Xi)),
IYi-
IYi -
Fm-(xi)l
Fm-(xi)l
< &
> an

and the line search becomes


N
(20) Pm = arg min L(yi, Fm-1(xi) + ph(xi; am))
P i=1

with L given by (19). The solution to (19), (20) can be obtained by standard
iterative methods [see Huber (1964)].
The value of the transition point 3 defines those residual values that are
considered to be "outliers,"subject to absolute rather than squared-errorloss.
An optimal value will depend on the distributionof y - F*(x), where F* is
the true target function(1). A common practice is to choose the value of 3 to
be the ae-quantileof the distributionof Iy - F* (x)1, where (1 - ae) controlsthe
breakdown point of the procedure. The "breakdown point" is the fractionof
observations that can be arbitrarilymodifiedwithout seriously degrading the
quality of the result. Since F*(x) is unknown one uses the current estimate
Fmil(x) as an approximation at the mth iteration. The distributionof y -
Fmil(x)l is estimated by the currentresiduals, leading to
am = quantile{ Iyj - Fm -1(xi)}1N
With regression trees as base learners we use the strategy of Section 4.3,
that is, a separate update (18) in each terminal node R jm*For the Huber loss

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1198 J. H. FRIEDMAN

(19) the solution to (18) can be approximated by a single step of the standard
iterative procedure [Huber (1964)] starting at the median

rim = medianxiERj.{rm_l(xi)},

where {rm_i(xi)}N are the currentresiduals

rmil(xi) = yi-Fmi,(xi).

The approximation is

'Yjm=rjm?N+ m E sign(rm -(x)-rym).min(8m,abs(rm (xj)-r jm)),


im XiERjm

where Njm is the number of observations in the jth terminal node. This gives
the followingalgorithmforboosting regressiontrees based on Huber loss (19).

ALGORITHM 4 M-TreeBoost.
FO(x) = median{yi}I
For m = 1 to M do:
rmil(xi) = yi-Fmi(xi) i = 1, N
5m =quantile,jjrm_(xi)Ijl

rm-i(xi)I < am i- 1,N


=
rml(xi),
Y
8m sign(rm-1(xi)), Irm-i(xi)l > 8m
{Rjml}J = J-terminalnode tree({Yi, xi }N)
rjm = medianxiERm {rmij(xi)}, j = 1, J
I
Yjm = rjm + ExiERjmSign (rmi,(xi) - rjm) min(8m, abs(rm_i(xi) - )
j=1, J
Fm(x) = Fmil(x) ? EfJ=1
'Yjm1(x E Rjm)
endFor
end Algorithm

Accordingto the motivations underlyingrobust regression, this algorithm


should have properties similar to that of least-squares boosting (Algorithm2)
fornormallydistributederrors,and similar to that ofleast absolute deviation
regression (Algorithm3) with very long-tailed distributions.For error distri-
butions with only moderately long tails it can have performancesuperior to
both (see Section 6.2).

4.5. Two-class logisticregressionand classification. Here the loss function


is negative binomial log-likelihood(FHTOO)
L(y, F) = log(l + exp(-2yF)), y E {-1, 1},
where

(21) F(x) = 2logF Pr(y--1 x)


2 LPr(y = -lIx)_j

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1199

The pseudoresponse is

(22) - [ L dcF(x1)
F(x)) 1
IF(x)=Fmi,(x)
- 2yi/(l + exp(2yjFFmi (xj))).

The line search becomes


N
Pm = arg min E log(l + exp(-2yi(Fmi_ (xi) + ph(xi; am)))).

With regression trees as base learners we again use the strategy(Section 4.3)
of separate updates in each terminal node Rjm:

(23) Yjm argmin E log(l +exp(-2yi(Fmi(Xi) ? y)))


'Yxi ER jm

There is no closed-formsolution to (23). Following FHTOO, we approximate it


by a single Newton-Raphson step. This turns out to be

Yjm= E i/E IJiJ(2 IYij)


xi ERjm xi ERjm

with 5j given by (22). This gives the followingalgorithmforlikelihood gradient


boosting with regression trees.

ALGORITHM
5 (LK-TreeBoost).
1 1?5-
FO(x) = log Y
For m = 1 to M do:
Yi = 2yi/(l + exp(2yiFm_i(xi))), i = 1, N
{Rjm}J = J-terminalnode tree({51, Xi}N)
Yjm = Yi/ExiERjm IJY(2- IYj), i = 1, J
EXijERjm

Fm(x) = Fm-i(X) + EJ I Yjml(X E Rjm)


endFor
end Algorithm

The final approximation FM(x) is related to log-oddsthrough(21). This can


be invertedto yield probabilityestimates

p+(x) = Pr(y = 1 I x) = 1/(1 + e-2FM(x))

p_(x) = Pr(y = -1 I x) = 1/(1 + e2FM(x))

These in turn can be used forclassification,

A(x) = 2 1[c(-1, 1)p+(x) > c(l, -1)p_(x)] - 1,


-
where c(y, y) is the cost associated with predicting when the truth is y.

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1200 J. H. FRIEDMAN

4.5.1. Influence trimming. The empirical loss functionfor the two-class


logistic regression problem at the mth iteration is

N
(24) km(P,a)= log[l1+ exp(-2yiF1_1(x )) exp(-2yiph(xi; a))].
i=l1

If yiFm-1(xi) is very large, then (24) has almost no dependence on ph(xi; a)


forsmall to moderate values near zero. This implies that the ith observation
(yi, xi) has almost no influence on the loss function,and thereforeon its
solution

(Pm,am) = argmin
p,a Om(P,a).

This suggests that all observations (yi, xi) forwhich yiFmi,(xi) is relatively
very large can be deleted fromall computations of the mth iteration without
having a substantial effecton the result. Thus,

(25) wi = exp(-2YiFmi,(xi))

can be viewed as a measure ofthe "influence"or weight ofthe ith observation


on the estimate pmh(x;am).
More generally, from the nonparametric function space perspective of
Section 2, the parameters are the observation functionvalues {F(xi )}N. The
influenceon an estimate to changes in a "parameter" value F(xi) (holding all
the other parameters fixed)can be gauged by the second derivative ofthe loss
functionwith respect to that parameter. Here this second derivative at the
mth iteration is Ii 1(2 - I5iJ)with 5i given by (22). Thus, another measure of
the influenceor "weight"of the ith observation on the estimate pmh(x;am) at
the mth iteration is

(26) wi = 1?J(2 - Iij).

Influence trimmingdeletes all observations with wi-values less than wl(,),


where l(a) is the solution to

l(a) N
(27) E w(i) =
a wi.
i=1 i=l

Here {W(i)}N are the weights {Wi}lN arranged in ascending order. Typical
values are a E [0.05, 0.2]. Note that influence trimmingbased on (25), (27)
is identical to the "weight trimming"strategyemployed with Real AdaBoost,
whereas (26), (27) is equivalent to that used with LogitBoost,in FHTOO. There
it was seen that 90% to 95% of the observations were often deleted without
sacrificingaccuracy of the estimates, using either influence measure. This
results in a correspondingreduction in computation by factorsof 10 to 20.

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1201

4.6. Multiclass logistic regression and classification. Here we develop a


gradient-descentboosting algorithm for the K-class problem. The loss func-
tion is
K
(28) L({yk, Fk(x)} =K) - yklogpk(x),
k=i

where Yk = l(class = k) E {0, 1}, and Pk(X) = Pr(yk = 1 x). Following


FHTOO, we use the symmetricmultiple logistic transform

(29) Fk(X) = logPk(X) - K 0logp(X)

or equivalently
K
(30) Pk(X) = exp(Fk(x))/ E exp(FI(x)).
k1=
Substituting (30) into (28) and taking firstderivatives one has

(31) Sik = j==L({Yi } Yk - Pk,m


mp (xi),
d8 F(XJ) -{F(x)=Fj, ",_i(X)}K

where Pk, m_i(X) is derived from Fk, mi(X) through (30). Thus, K-trees are
induced at each iteration m to predictthe correspondingcurrentresiduals for
each class on the probabilityscale. Each of these trees has J-terminalnodes,
with correspondingregions {Rjkm}J1= The model updates yjkm corresponding
to these regions are the solution to
NK KJ
{1Yjkm}= argmin} E E, Yik, Fk,m-l(Xi) jk1(Xi ERjm))
{'Yjk} i=I k=I =I

where O(Yk, Fk) = -Yk log Pk from(28), with Fk related to Pk through (30).
This has no closed formsolution. Moreover,the regions correspondingto the
differentclass trees overlap, so that the solution does not reduce to a separate
calculation within each region of each tree in analogy with (18). Following
FHTOO, we approximate the solution with a single Newton-Raphson step,
using a diagonal approximationto the Hessian. This decomposes the problem
into a separate calculation foreach terminal node of each tree. The result is
(32) Yk K-i xeRjkm Yik

This leads to the followingalgorithmforK-class logistic gradient boosting.

ALGORITHM
6 (LK-TreeBoost).
Fko(x) = 0, k = 1, K
For m = 1 to M do:
pk(X) = exp(Fk(x))/ I[S11
exp(FI(x)), k = 1, K

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1202 J. H. FRIEDMAN

For k = 1 to K do:
Yik = Yik -Pk(XA), i = 1, N
{Rjkm}=l = J-terminalnode tree({5ik, xI} )
Nk K-1 ExRjkm(l Tik
YI) = 1
YJkm- K ExicRjkm~ IYikl('--IYikI) 1, Jl
Fkm(X) = Fk, m-l(X) + ,iA lYjkm 1(X c Rjjkm)
endFor
endFor
end Algorithm

The final estimates {FkM(X)}K can be used to obtain correspondingprob-


ability estimates {PkM(X)}1 through (30). These in turn can be used for
classification
K
k(x) = arg mm E c(k, k')PkIM(x),
1<k<K k=

where c(k, k') is the cost associated with predictingthe kth class when the
truth is k'. Note that for K = 2, Algorithm6 is equivalent to Algorithm5.
Algorithm6 bears a close similarity to the K-class LogitBoost procedure
of FHTOO, which is based on Newton-Raphson rather than gradient descent
in functionspace. In that algorithm K trees were induced, each using corre-
sponding pseudoresponses
~ ~ YK KI - Yik -Pk(Xi)
(33)
K pk(Xi)(1 - Pk(Xi))
and a weight
(34) Wk(Xi) = Pk(Xi)(1 - Pk(xi))
applied to each observation (Yik, xi). The terminal node updates were

Exi ERjkmWk(Xi)Yik
'Yjkm= ExR jkmWk(Xi)

which is equivalent to (32). The differencebetween the two algorithmsis the


splittingcriterionused to induce the trees and thereby the terminal regions
{Rjkm }l*
The least-squares improvementcriterionused to evaluate potential splits
of a currentlyterminal region R into two subregions (RI, Rr) is
- )2
3(W,R WlWr_
r
Wr (5?
= -
Rr) WI ?
where -1, Yr are the left and right daughter response means respectively,
and wl, Wr are the corresponding sums of the weights. For a given split,
using (31) with unit weights, or (33) with weights (34), give the same val-
ues for -1,Yr. However, the weight sums wl, Wr are different.Unit weights
(LK TeeBoost) favorsplits that are symmetricin the number ofobservationsin

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1203

each daughter node, whereas (34) (LogitBoost) favorssplits forwhich the sums
estimatedresponsevariancesvar(Yik)
ofthe currently = P-(Xi)( -Pk(Xi))
are more equal.
LK-TreeBoost has an implementation advantage in numerical stability.
LogitBoost becomes numerically unstable whenever the value of (34) is close
to zero forany observation xi, which happens quite frequently.This is a con-
sequence of the difficultythat Newton-Raphson has with vanishing second
derivatives. Its performanceis stronglyaffectedby the way this problem is
handled (see FHTOO, page 352). LK-TreeBoost has such difficultiesonly when
(34) is close to zero forall observations in a terminal node. This happens much
less frequentlyand is easier to deal with when it does happen.
Influence trimming for the multiclass procedure is implemented in the
same way as that forthe two-class case outlined in Section 4.5.1. Associated
(Yik, Xi) is an influenceWik = IYikl(I
witheach "observation" - IYik ) whichis
used fordeleting observations (27) when inducing the kth tree at the current
iteration m.

5. Regularization. In predictionproblems, fittingthe training data too


closely can be counterproductive.Reducing the expected loss on the training
data beyond some point causes the population-expectedloss to stop decreas-
ing and oftento start increasing. Regularization methods attempt to prevent
such "overfitting"by constraining the fittingprocedure. For additive expan-
sions (2) a natural regularization parameter is the number of components M.
This is analogous to stepwise regression where the {h(x; am)}M are consid-
ered explanatoryvariables that are sequentially entered. Controllingthe value
of M regulates the degree to which expected loss on the training data can be
minimized. The best value for M can be estimated by some model selection
method, such as using an independent "test" set, or cross-validation.
Regularizing by controllingthe number ofterms in the expansion places an
implicit prior belief that "sparse" approximations involving fewer terms are
likely to provide better prediction.However, it has oftenbeen found that reg-
ularization through shrinkage provides superior results to that obtained by
restrictingthe number of components [Copas (1983)]. In the context of addi-
tive models (2) constructedin a forwardstagewise manner (9), (10), a simple
shrinkage strategyis to replace line 6 of the generic algorithm(Algorithm1)
with
(36) Fm(x) = Fm_(x) + v. pmh(x;am), 0 < v < 1,
and making the correspondingequivalent changes in all of the specificalgo-
rithms (Algorithms2-6). Each update is simply scaled by the value of the
"learning rate" parameter v.
Introducingshrinkage into gradient boosting (36) in this manner provides
two regularization parameters, the learning rate v and the number of com-
ponents M. Each one can control the degree of fit and thus affectthe best
value for the other one. Decreasing the value of v increases the best value
for M. Ideally one should estimate optimal values forboth by minimizing a

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1204 J. H. FRIEDMAN

model selection criterionjointly with respect to the values of the two param-
eters. There are also computational considerations; increasing the size of M
produces a proportionateincrease in computation.
We illustrate this V-M trade-offthrough a simulation study.The training
sample consists of 5000 observations {yi, xi} with

Yi = F*(xi) + ?i-

The target function F*(x), x E R10, is randomly generated as described in


Section 6.1. The noise 8 was generated froma normal distributionwith zero
mean, and variance adjusted so that

E1? = EXIF* (x) - medianxF* (x)


giving a signal-to-noise ratio of 2/1. For this illustration the base learner
h(x; a) is taken to be an l1-terminal node regression tree induced in a best-
first manner (FHTOO). A general discussion of tree size choice appears in
Section 7.
Figure 1 shows the lack of fit(LOF) of LS-TreeBoost, LAD TreeBoost, and
L2-TreeBoost as a function of number of terms (iterations) M, for several
values of the shrinkage parameter v E {1.0, 0.25, 0.125, 0.06}. For the first
two methods, LOF is measured by the average absolute errorof the estimate
FM(x) relative to that of the optimal constant solution

ExIF*(x) -FM(x)I
(37) A(FM(x)) -_EX F* (x) - medianxF*(x) I
I

For logistic regression the y-values were obtained by thresholding at the


median of F*(x) over the distributionof x-values; F*(xi) values greater than
the median were assigned yi = 1; those below the median were assigned
Yi = -1. The Bayes error rate is thus zero, but the decision boundary is
fairly complicated. There are two LOF measures for L2-TreeBoost; minus
twice log-likelihood("deviance") and the misclassificationerrorrate EX[1(y 7
sign(FM(x)))]. The values of all LOF measures were computed by using an
independent validation data set of 10,000 observations.
As seen in Figure 1, smaller values of the shrinkage parameter v (more
shrinkage) are seen to result in betterperformance,although there is a dimin-
ishing return forthe smallest values. For the larger values, behavior charac-
teristic of overfittingis observed; performancereaches an optimum at some
value of M and thereafterdiminishes as M increases beyond that point. This
effectis much less pronounced with LADRTreeBoost, and with the error rate
criterionof L2 TreeBoost. For smaller values of v there is less overfitting,as
would be expected.
Although difficultto see except for v = 1, the misclassification error rate
(lower rightpanel) continues to decrease well afterthe logistic likelihood has
reached its optimum (lower left panel). Thus, degrading the likelihood by
overfittingactually improves misclassification error rate. Although perhaps

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1205

LSTreeBoost LAD_TreeBoost
o 0

LQ
o '-~~~~~~~~~~0
~~~~~~~~~~~~~~~~0
c 0~~~~~~~~~~~~~c
nD VN n- ______

o 6

o 0
6 _ _ _ _ _ _ _ _ _ _
__,_ _
___ __i_ _01 _ _ _

0 200 400 600 800 1000 0 200 400 600 800 1000
Iterations Iterations

L2TreeBoost L2_TreeBoost
0
6

o LQ~~~~~~~~~~~~~~~~

o 0

C4, , I 04 I ,~
, I

o 200 400 600 800 1000 0 200 400 600 800 1000
IteratiOnS IteraT5Ons

FIG. 1. Performanceof threegradient boostingalgorithms as a functionof number of iterations


M. The four curves correspond to shrinkageparameter values of v E {1.0, 0.25, 0.125, 0.06} and
are in that order (top to bottom)at the extremerightof each plot.

counterintuitive,this is not a contradiction;likelihood and errorrate measure


differentaspects of fitquality. Error rate depends only on the sign of FM(X)
whereas likelihood is affectedby both its sign and magnitude. Apparently,
overfittingdegrades the quality of the magnitude estimate without affecting
(and sometimesimproving)the sign. Thus, misclassificationerroris much less
sensitive to overfitting.
Table 1 summarizes the simulation results forseveral values of v including
those shown in Figure 1. Shown foreach v-value (row) are the iterationnumber
at which the minimumLOF was achieved and the correspondingminimizing
value (pairs of columns).

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1206 J. H. FRIEDMAN

TABLE 1
Iteration number giving the best fit and the best fit value for several shrinkage parameter
v-values, with threeboosting methods

v LS: A(FM(x)) LAD: A(FM(x)) L2: -2log (like) L2: error rate
1.0 15 0.48 19 0.57 20 0.60 436 0.111
0.5 43 0.40 19 0.44 80 0.50 371 0.106
0.25 77 0.34 84 0.38 310 0.46 967 0.099
0.125 146 0.32 307 0.35 570 0.45 580 0.098
0.06 326 0.32 509 0.35 1000 0.44 994 0.094
0.03 855 0.32 937 0.35 1000 0.45 979 0.097

The v-M trade-offis clearly evident; smaller values of v give rise to larger
optimal M-values. They also provide higher accuracy, with a diminishing
return forv < 0.125. The misclassificationerrorrate is very flat for M > 200,
so that optimal M-values forit are unstable.
Although illustrated here forjust one target functionand base learner (11-
terminal node tree), the qualitative nature ofthese results is fairlyuniversal.
Other target functionsand tree sizes (not shown) give rise to the same behav-
ior.This suggests that the best value forv depends on the number ofiterations
M. The latter should be made as large as is computationally convenient or
feasible. The value of v should then be adjusted so that LOF achieves its min-
imum close to the value chosen for M. If LOF is still decreasing at the last
iteration, the value of v or the number of iterations M should be increased,
preferablythe latter. Given the sequential nature ofthe algorithm,it can eas-
ily be restarted where it finished previously,so that no computation need be
repeated. LOF as a functionof iteration number is most convenientlyesti-
mated using a left-outtest sample.
As illustrated here, decreasing the learning rate clearly improves perfor-
mance, usually dramatically. The reason for this is less clear. Shrinking the
model update (36) at each iterationproduces a more complex effectthan direct
proportionalshrinkage of the entire model

(38) FV(X) = - + V (FM(X)-


where FM(X) is the model induced without shrinkage. The update pmh(x;am)
at each iteration depends on the specificsequence of updates at the previous
iterations. Incremental shrinkage (36) produces very differentmodels than
global shrinkage (38). Empirical evidence (not shown) indicates that global
shrinkage (38) provides at best marginal improvementover no shrinkage,far
fromthe dramatic effectof incremental shrinkage. The mysteryunderlying
the success of incremental shrinkage is currentlyunder investigation.

6. Simulation studies. The performance of any function estimation


method depends on the particular problem to which it is applied. Important
characteristics of problems that affectperformanceinclude training sample
size N, true underlying "target" function F*(x) (1), and the distributionof

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1207

the departures, E, of y I x from F*(x). For any given problem, N is always


known and sometimes the distributionof E is also known,forexample when y
is binary (Bernoulli). When y is a general real-valued variable the distribution
of 8 is seldom known. In nearly all cases, the nature of F*(x) is unknown.
In orderto gauge the value ofany estimation methodit is necessary to accu-
rately evaluate its performanceover many differentsituations. This is most
convenientlyaccomplished through Monte Carlo simulation where data can
be generated according to a wide variety ofprescriptionsand resulting perfor-
mance accurately calculated. In this section several such studies are presented
in an attemptto understand the propertiesofthe various GradientLTreeBoost
procedures developed in the previous sections. Although such a study is far
more thoroughthan evaluating the methods on just a few selected examples,
real or simulated, the results of even a large study can only be regarded as
suggestive.

6.1. Random functiongenerator. One of the most important character-


istics of any problem affectingperformance is the true underlying target
functionF*(x) (1). Every method has particular targets forwhich it is most
appropriate and others forwhich it is not. Since the nature ofthe target func-
tion can vary greatly over differentproblems, and is seldom known, we com-
pare the merits of regression tree gradient boosting algorithms on a variety
of differentrandomly generated targets. Each one takes the form
20
(39) F*(x) a,ag1(zl).

The coefficients{al}20 are randomly generated froma uniformdistribution


a, - Ut-1, 1]. Each gl(zl) is a functionofa randomlyselected subset, ofsize nl,
of the n-inputvariables x. Specifically,

Z, {Xp=1j)}Jj1,

where each P1 is a separate random permutationofthe integers {1, 2, .. ., n}.


The size of each subset n, is itself taken to be random, n1 = L1.5+ r], with
r being drawn froman exponential distributionwith mean A = 2. Thus, the
expected number of input variables foreach gl(zl) is between three and four.
However, most oftenthere will be fewerthan that, and somewhat less often,
more. This reflectsa bias against strong very high-orderinteraction effects.
However, for any realized F* (x) there is a good chance that at least a few
of the 20 functionsgl(zl) will involve higher-orderinteractions. In any case,
F*(x) will be a functionof all, or nearly all, of the input variables.
Each gl(zl) is an nl-dimensional Gaussian function

(40) gl(zl) = exp-((zI- -


)V1(z1 ))

where each of the mean vectors {_t}120 is randomly generated fromthe same
distributionas that of the input variables x. The n, x n, covariance matrixV,

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1208 J. H. FRIEDMAN

is also randomly generated. Specifically,

VI = UID1U[,

where U1 is a random orthonormalmatrix (uniformon Haar measure) and


D, = diag {d,, ... dn,11}The square roots of the eigenvalues are randomlygen-
erated froma uniformdistributionVd jl U[a, b], where the limits a, b depend
on the distributionof the input variables x.
For all of the studies presented here, the number of input variables was
taken to be n = 10, and theirjoint distributionwas taken to be standard nor-
mal x - N(0, I). The eigenvalue limits were a - 0.1 and b = 2.0. Althoughthe
tails ofthe normal distributionare oftenshorterthan that ofdata encountered
in practice,theyare still more realistic than uniformlydistributedinputs often
used in simulation studies. Also, regression trees are immune to the effectsof
long-tailedinput variable distributions,so shortertails gives a relative advan-
tage to competitorsin the comparisons.
In the simulation studies below, 100 target functionsF*(x) were randomly
generated accordingto the above prescription(39), (40). Performanceis evalu-
ated in terms ofthe distributionofapproximationinaccuracy [relative approx-
imation error (37) or misclassificationrisk] over these differenttargets. This
approach allows a wide variety of quite differenttarget functionsto be gen-
erated in terms of the shapes of their contours in the ten-dimensional input
space. Although lower order interactions are favored,these functionsare not
especially well suited to additive regression trees. Decision trees produce ten-
sor product basis functions,and the components gl(zl) of the targets F*(x)
are not tensor product functions.Using the techniques described in Section 8,
visualizations ofthe dependencies of the firstrandomlygenerated functionon
some of its more importantarguments are shown in Section 8.3.
Although there are only ten input variables, each target is a functionof
all of them. In many data mining applications there are many more than ten
inputs. However, the relevant dimensionalities are the intrinsicdimensional-
ity of the input space, and the number of inputs that actually influence the
output response variable y. In problems with many input variables there are
usually high degrees of collinearityamong many of them, and the number of
roughlyindependent variables (approximate intrinsicdimensionality)is much
smaller. Also, target functionsoftenstronglydepend only on a small subset of
all of the inputs.

6.2. Error distribution. In this section, LS TreeBoost, LADRTreeBoost,


and M-TreeBoost are compared in terms of their performanceover the 100
targetfunctionsfortwo differenterrordistributions.Best-firstregressiontrees
with 11 terminal nodes were used with all algorithms.The breakdown param-
eter for the M-TreeBoost was set to its default value a = 0.9. The learning
rate parameter (36) was set to v = 0.1 for all TreeBoost procedures in all of
the simulation studies.

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1209

One hundred data sets {yi, Xi}N were generated according to

yi= F*(xi) + i,

where F*(x) represents each of the 100 target functionsrandomly generated


as described in Section 6.1. For the firststudy,the errors 8i were generated
froma normal distributionwith zero mean, and variance adjusted so that

(41) EI1I = ExIF*(x) - medianxF*(x) ,

giving a 1/1 signal-to-noiseratio. For the second study the errors were gen-
erated froma "slash" distribution,8i = s. (u/v), where u - N(O, 1) and v
U[0, 1]. The scale factors is adjusted to give a 1/1 signal-to-noiseratio (41).
The slash distributionhas very thick tails and is oftenused as an extreme to
test robustness. The trainingsample size was taken to be N = 7500, with 5000
used fortraining,and 2500 left out as a test sample to estimate the optimal
number of components M. For each of the 100 trials an additional validation
sample of 5000 observations was generated (without error) to evaluate the
approximation inaccuracy (37) forthat trial.
The left panels of Figure 2 show boxplots of the distributionof approxima-
tion inaccuracy (37) over the 100 targetsforthe two errordistributionsforeach
of the three methods. The shaded area of each boxplot shows the interquar-
tile range of the distributionwith the enclosed white bar being the median.

Normal Normal

LS LAD M LS LAD M

Slash Slash

L1j

-= -

LS LAD M LS LAD M

FIG. 2. Distribution of absolute approximation error (leftpanels) and error relative to the best
(rightpanels) forLSJTheeBoost,LAD_TheeBoostand M_TreeBoostfor normal and slash errordis-
tributions.LSJTreeBoost,performsbest with the normal error distribution.LADJTreeBoostand
M_TreeBoostbothperformwell with slash errors. M_TreeBoostis veryclose to the best for both
errordistributions.Note the use of logarithmicscale in the lower rightpanel.

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1210 J. H. FRIEDMAN

The outer hinges represent the points closest to (plus/minus) 1.5 interquar-
tile range units fromthe (upper/lower)quartiles. The isolated bars represent
individual points outside this range (outliers).
These plots allow the comparison of the overall distributions,but give no
informationconcerningrelative performancefor individual target functions.
The right two panels of Figure 2 attempt to provide such a summary.They
show distributionsof errorratios, rather than the errorsthemselves. For each
target functionand method,the errorforthe method on that target is divided
by the smallest error obtained on that target, over all of the methods (here
three) being compared. Thus, for each of the 100 trials, the best method
receives a value of 1.0 and the others receive a larger value. If a particu-
lar method was best (smallest error)forall 100 target functions,its resulting
distribution(boxplot) would be a point mass at the value 1.0. Note that the
logarithm of this ratio is plotted in the lower rightpanel.
From the left panels of Figure 2 one sees that the 100 targets represent a
fairlywide spectrum of difficulty forall three methods; approximation errors
vary by over a factor of two. For normally distributed errors LS-TreeBoost
is the superior performer,as might be expected. It had the smallest error
in 73 of the trials, with M-TreeBoost best the other 27 times. On average
LS-TreeBoost was 0.2% worse than the best, M_TreeBoost 0.9% worse, and
LAD-TreeBoost was 7.4% worse than the best.
With slash-distributederrors,things are reversed. On average the approxi-
mation error for LS-TreeBoost was 0.95, thereby explaining only 5% target
variation. On individual trials however, it could be much better or much
worse. The performanceof both LAD-TreeBoost and M-TreeBoost was much
better and comparable to each other. LAD-TreeBoost was best 32 times and
M-TreeBoost 68 times. On average LADfTreeBoost was 4.1% worse than the
best, M_TreeBoost 1.0% worse, and LS-TreeBoost was 364.6% worse that the
best, over the 100 targets.
The results suggest that of these three, M_TreeBoost is the method of
choice. In both the extreme cases of very well-behaved (normal) and very
badly behaved (slash) errors, its performancewas very close to that of the
best. By comparison, LAD-TreeBoost sufferedsomewhat with normal errors,
and LS-TreeBoost was disastrous with slash errors.

6.3. LSJTheeBoostversus MARS. All GradientLTreeBoostalgorithms pro-


duce piecewise constant approximations.Although the number of such pieces
is generallymuch larger than that produced by a single tree, this aspect ofthe
approximatingfunctionFM(X) mightbe expected to represent a disadvantage
with respect to methods that provide continuous approximations, especially
when the true underlyingtarget F*(x) (1) is continuous and fairly smooth.
All of the randomly generated target functions(39), (40) are continuous and
verysmooth.In this section we investigate the extentofthe piecewise constant
disadvantage by comparing the accuracy of GradientLTreeBoostwith that of
MARS [Friedman (1991)] over these 100 targets. Like TreeBoost, MARS pro-
duces a tensor productbased approximation.However,it uses continuousfunc-

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1211

tions as the product factors,thereby producing a continuous approximation.


It also uses a more involved (stepwise) strategyto induce the tensor products.
Since MARS is based on least-squares fitting,we compare it to LS-Tree-
Boost using normallydistributederrors,again with a 1/1 signal-to-noiseratio
(41). The experimentalsetup is the same as that in Section 6.2. It is interesting
to note that here the performanceof MARS was considerably enhanced by
using the 2500 observation test set formodel selection,rather than its default
generalized cross-validation (GCV) criterion[Friedman (1991)].
The top left panel of Figure 3 compares the distributionof MARS average
absolute approximation errors,over the 100 randomly generated target func-
tions (39), (40), to that ofLS-TreeBoost fromFigure 2. The MARS distribution
is seen to be much broader, varying by almost a factor of three. There were
many targets for which MARS did considerably better than LS-TreeBoost,
and many for which it was substantially worse. This furtherillustrates the
fact that the nature of the target function stronglyinfluences the relative
performanceof differentmethods. The top right panel of Figure 3 shows the
distributionof errors,relative to the best for each target. The two methods
exhibit similar performancebased on average absolute error.There were a
number of targets where each one substantially outperformedthe other.
The bottomtwo panels of Figure 3 show correspondingplots based on root
mean squared error.This gives proportionallymore weight to larger errors
in assessing lack of performance.For LS-TreeBoost the two error measures
have close to the same values forall of the 100 targets. However with MARS,
root mean squared erroris typically30% higher than average absolute error.
This indicates that MARS predictions tend to be either very close to, or far
from,the target. The errors fromLS-TreeBoost are more evenly distributed.
It tends to have fewer very large errors or very small errors. The latter may
be a consequence ofthe piecewise constant nature ofthe approximationwhich
makes it difficultto get arbitrarilyclose to very smoothlyvaryingtargets with
approximationsoffinitesize. As Figure 3 illustrates, relative performancecan
be quite sensitive to the criterionused to measure it.
These results indicate that the piecewise constant aspect of TreeBoost
approximations is not a serious disadvantage. In the rather pristine environ-
ment of normal errors and normal input variable distributions,it is competi-
tive with MARS. The advantage ofthe piecewise constant approach is robust-
ness; specifically,it provides immunityto the adverse effectsofwide tails and
outliers in the distribution of the input variables x. Methods that produce
continuous approximations,such as MARS, can be extremelysensitive to such
problems. Also, as shown in Section 6.2, M-TreeBoost (Algorithm4) is nearly
as accurate as LS-TreeBoost fornormal errorswhile, in addition, being highly
resistant to output y-outliers. Therefore in data mining applications where
the cleanliness of the data is not assured and x- and/or y-outliers may be
present, the relatively high accuracy,consistent performanceand robustness
of M-TreeBoost may represent a substantial advantage.

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1212 J. H. FRIEDMAN

Abs-error Abs-error
(o~~~~~~~~~~~~~~~~~~~~o

0
0

LS_TreeBoost MARS LS_TreeBoost MARS

Rms-error Rms-error

0
,.
c~~~~~~~~~~~~~~~~~~~~~~(
T_

(0
0
0 . ~ -
~ ~ ~ ~ 0 En

C9

o 0

LS_TreeBoost MARS LS_TreeBoost MARS

FIG. 3. Distributionofapproximationerror(leftpanels) and errorrelativeto the best (rightpanels)


for LSJTheeBoostand MARS. The top panels are based on average absolute error,whereas the
bottomones use root mean squared error. For absolute error the MARS distribution is wider,
indicating morefrequentbetterand worseperformancethan LSJTheeBoost.MARS performanceas
measured by root mean squared error is much worse, indicating that it tends to more frequently
make both larger and smaller errorsthan LSJTheeBoost.

6.4. LKifreeBoost versus K-class LogitBoost and AdaBoost.MH. In this


section the performanceof LK-TreeBoost is compared to that of K-class Log-
itBoost (FHT00) and AdaBoost.MH [Schapire and Singer (1998)] over the 100
randomly generated targets (Section 6.1). Here K = 5 classes are generated
by thresholdingeach target at its 0.2, 0.4, 0.6 and 0.8 quantiles over the dis-
tributionofinput x-values. There are N = 7500 training observationsforeach
trial (1500 per class) divided into 5000 fortraining and 2500 formodel selec-
tion (number ofiterations,M). An independentlygenerated validation sample
of 5000 observations was used to estimate the errorrate foreach target. The

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1213

Bayes errorrate is zero forall targets,but the induced decision boundaries can
become quite complicated, depending on the nature of each individual target
functionF*(x). Regression trees with 11 terminal nodes were used for each
method.
Figure 4 shows the distributionof error rate (left panel), and its ratio to
the smallest (rightpanel), over the 100 target functions,foreach of the three
methods. The errorrate of all three methods is seen to vary substantially over
these targets. LK TreeBoost is seen to be the generally superior performer.It
had the smallest error for 78 of the trials and on average its error rate was
0.6% higher than the best for each trial. LogitBoost was best on 21 of the
targets and there was one tie. Its errorrate was 3.5% higher than the best on
average. AdaBoost.MH was never the best performer,and on average it was
15% worse than the best.
Figure 5 shows a corresponding comparison, with the LogitBoost and
AdaBoost.MH procedures modifiedto incorporateincremental shrinkage (36),
with the shrinkage parameter set to the same (default) value v = 0.1 used with
LK-TreeBoost. Here one sees a somewhat differentpicture. Both LogitBoost
and AdaBoost.MH benefitsubstantially fromshrinkage. The performanceof
all three procedures is now nearly the same, with LogitBoost perhaps hav-
ing a slight advantage. On average its error rate was 0.5% worse that the
best; the correspondingvalues forLK-TreeBoost and AdaBoost.MH were 2.3%
and 3.9%, respectively.These results suggest that the relative performanceof
these methods is more dependent on their aggressiveness, as parameterized
by learning rate, than on their structuraldifferences.LogitBoost has an addi-

Error-rate Rel. error-rate

LK_TreeBoost LogitBoost AdaBoost LK_TreeBoost LogltBoost AdaBoost

FIG. 4. Distributionof errorrate on a five-classproblem (leftpanel) and errorrate relative to the


best (rightpanel) forLK TreeBoost, LogitBoost,and AdaBoost.MH. LK Tl'reeBoost exhibitssuperior
performance.

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1214 J. H. FRIEDMAN

Error-rate Rel. error-rate

o i l~~~~~~~~~~~~~~~~~~~~~~t

LK_TreeBoost LogitBoost(O. 1) AdaBoost(0.1) LK TreeBoost LogitBoost(O.1) AdaBoost(0 1)

FIG. 5. Distributionoferrorrate on a five-classproblem (leftpanel), and errorrate relative to the


best (rightpanel), for LK T7reeBoost,and withproportional shrinkage applied to LogitBoost and
RealAdaBoost. Here theperformanceof all threemethods is similar.

tional internal shrinkage associated with stabilizing its pseudoresponse (33)


when the denominator is close to zero (FHTOO, page 352). This may account
for its slight superiorityin this comparison. In fact, when increased shrink-
age is applied to LKITreeBoost (v = 0.05) its performanceimproves,becoming
identical to that of LogitBoost shown in Figure 5. It is likely that when the
shrinkage parameter is carefullytuned for each of the three methods, there
would be little performancedifferentialbetween them.

7. Tree boosting. The GradientBoost procedure (Algorithm1) has two


primary metaparameters, the number of iterations M and the learning rate
parameter v (36). These are discussed in Section 5. In addition to these, there
are the metaparameters associated with the procedure used to estimate the
base learner h(x; a). The primaryfocus of this paper has been on the use of
best-firstinduced regression trees with a fixed number of terminal nodes, J.
Thus, J is the primarymetaparameter ofthis base learner. The best choice for
its value depends most stronglyon the nature of the target function,namely
the highest order of the dominant interactions among the variables.
Consider an ANOVA expansion of a function

(42) F(X) = f j(Xj) + E f jk(Xj, Xk) + E f jkl(Xj, Xk, x1) + ..


j j,k j,k,l

The firstsum is called the "main effects"component of F(x). It consists of a


sum of functionsthat each depend on only one input variable. The particular
functions{ f j(Xj)}N are those that provide the closest approximationto F(x)

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1215

under this additive constraint. This is sometimes referredto as an "additive"


model because the contributionsof each xj, f j(xj), add to the contributions
of the others. This is a differentand more restrictivedefinitionof "additive"
than (2). The second sum consists of functions of pairs of input variables.
They are called the two-variable "interactioneffects."They are chosen so that
along with the main effectsthey provide the closest approximation to F(x)
under the limitationofno more than two-variableinteractions.The third sum
represents three-variable interaction effects,and so on.
The highest interaction order possible is limited by the number of input
variables n. However, especially for large n, many target functions F*(x)
encountered in practice can be closely approximated by ANOVA decomposi-
tions of much lower order. Only the firstfew terms in (42) are required to
capture the dominant variation in F*(x). In fact,considerable success is often
achieved with the additive component alone [Hastie and Tibshirani (1990)].
Purely additive approximations are also produced by the "naive" -Bayes
method [Warner, Toronto, Veasey and Stephenson (1961)], which is often
highly successful in classification. These considerations motivated the bias
toward lower-orderinteractions in the randomly generated target functions
(Section 6.1) used forthe simulation studies.
The goal of functionestimation is to produce an approximation F(x) that
closely matches the target F*(x). This usually requires that the dominant
interaction order of F(x) be similar to that of F*(x). In boosting regression
trees, the interaction order can be controlledby limiting the size of the indi-
vidual trees induced at each iteration. A tree with J terminal nodes produces
a functionwith interaction order at most min(J - 1, n). The boosting pro-
cess is additive, so the interaction order of the entire approximation can be
no larger than the largest among its individual components. Therefore,with
any of the TreeBoost procedures,the best tree size J is governedby the effec-
tive interaction order of the target F*(x). This is usually unknown so that
J becomes a metaparameter of the procedure to be estimated using a model
selection criterionsuch as cross-validationor on a left-outsubsample not used
in training.However, as discussed above, it is unlikely that large trees would
ever be necessary or desirable.
Figure 6 illustrates the effectoftree size on approximationaccuracy forthe
100 randomlygenerated functions(Section 6.1) used in the simulation studies.
The experimental set-up is the same as that used in Section 6.2. Shown is the
distributionofabsolute errors(37) (leftpanel), and errorsrelative to the lowest
for each target (right panel), for J E {2, 3, 6, 11, 21}. The firstvalue J = 2
produces additive main effectscomponentsonly; J = 3 produces additive and
two-variableinteractionterms,and so on. A J terminal node tree can produce
interactionlevels up to a maximum ofmin(J- 1, n), with typical values being
less than that, especially when J - 1 < n.
As seen in Figure 6 the smallest trees J E {2, 3} produce lower accuracy on
average, but their distributionsare considerably wider than the others. This
means that they produce more very accurate, and even more very inaccurate,

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1216 J. H. FRIEDMAN

Abs. error Rel. error

2 3 6 11 ~~~21 2 3 1 21

FIG. 6. Distribution of absolute approximation error (leftpanel) and error relative to the best
(rightpanel) forLS-TheeBoost with differentsized trees,as measured by numberof terminalnodes
J. The distributionusing the smallest trees J c { 2, 3} is wider, indicating more frequentbetter
and worseperformancethan with the larger trees,all of which have similar performance.

approximations. The smaller trees, being restrictedto low-orderinteractions,


are betterable to take advantage oftargetsthat happen to be oflow interaction
level. However, they do quite badly when trying to approximate the high-
order interactiontargets. The larger trees J E {6, 11, 21} are more consistent.
They sacrifice some accuracy on low-orderinteraction targets, but do much
better on the higher-orderfunctions.There is little performancedifference
among the larger trees, with perhaps some slight deteriorationfor J = 21.
The J = 2 trees produced the most accurate approximation eight times; the
correspondingnumbers for J E { 3, 6, 11, 2 1} were 2, 30, 31, 29, respectively.
On average the J = 2 trees had errors 23.2% larger than the lowest for
each target, while the others had correspondingvalues of 16.4%, 2.4%, 2.2%
and 3.7%, respectively.Higher accuracy should be obtained when the best
tree size J is individually estimated for each target. In practice this can be
accomplished by evaluating the use ofdifferenttree sizes with an independent
test data set, as illustrated in Section 9.

8. Interpretation. In many applications it is useful to be able to inter-


pret the derived approximation F(x). This involves gaining an understanding
of those particular input variables that are most influentialin contributingto
its variation, and the nature of the dependence of F(x) on those influential
inputs. To the extentthat F(x) at least qualitatively reflectsthe nature ofthe
target functionF*(x) (1), such tools can provide informationconcerningthe
underlyingrelationship between the inputs x and the output variable y. In
this section,several tools are presented forinterpretingTreeBoost approxima-

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1217

tions. Although they can be used for interpretingsingle decision trees, they
tend to be more effectivein the context of boosting (especially small) trees.
These interpretativetools are illustrated on real data examples in Section 9.

8.1. Relative importance of input variables. Among the most useful des-
criptions of an approximation F(x) are the relative influences Ij, of the
individual inputs xj, on the variation of F(x) over the joint input variable
distribution.One such measure is

(43) Ex x
=i varx[xi])12

For piecewise constant approximations produced by decision trees, (43) does


not strictlyexist and it must be approximated by a surrogate measure that
reflectsits properties.Breiman, Friedman, Olshen and Stone (1983) proposed
J-1
(44) I (T)= 2 d1(vt=
t=1

where the summation is over the nonterminalnodes t of the J-terminalnode


tree T, vt is the splitting variable associated with node t, and i is the cor-
responding empirical improvementin squared error (35) as a result of the
split. The right-handside of (44) is associated with squared influenceso that
its units correspond to those of (43). Breiman, Friedman, Olshen and Stone
(1983) used (44) directlyas a measure ofinfluence,rather than squared influ-
ence. For a collectionof decision trees {Tm}", obtained throughboosting,(44)
can be generalized by its average over all of the trees,

IM
(45) Ij = M (Tm)
,n=1

in the sequence.
The motivationfor(44), (45) is based purely on heuristic arguments. As a
partial justificationwe show that it produces expected results when applied
in the simplest context.Consider a linear target function
n
(46) F*(x) = aO + j xj,
j=1

where the covariance matrix of the inputs is a multiple of the identity

Ex [(x - x)(x - k)T] = cl1.

In this case the influencemeasure (43) produces

(47) Ij = lajl.

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1218 J. H. FRIEDMAN

Table 2 shows the results of a small simulation study similar to those in


Section 6, but with F*(x) taken to be linear (46) with coefficients

(48) aj = (-l)jn
and a signal-to-noise ratio of 1/1 (41). Shown are the mean and standard
deviation of the values of (44), (45) over ten random samples, all with F*(x)
given by (46), (48). The influence of the estimated most influential variable
xj* is arbitrarilyassigned the value Ii* = 100, and the estimated values of
the others scaled accordingly.The estimated importance ranking of the input
variables was correcton every one ofthe ten trials. As can be seen in Table 2,
the estimated relative influencevalues are consistentwith those given by (47)
and (48).
In Breiman, Friedman, Olshen and Stone 1983, the influencemeasure (44)
is augmented by a strategyinvolvingsurrogate splits intended to uncover the
masking of influentialvariables by others highly associated with them. This
strategyis most helpful with single decision trees where the opportunityfor
variables to participate in splittingis limitedby the size J ofthe tree in (44). In
the contextofboosting,however,the numberofsplittingopportunitiesis vastly
increased (45), and surrogate unmasking is correspondinglyless essential.
In K-class logistic regression and classification (Section 4.6) there are K
(logistic) regression functions{FkM(X)}fK[l, each described by a sequence of
M trees. In this case (45) generalizes to

I M
(49) IJk= M EIj (Tkm),

where Tkm is the tree induced forthe kth class at iteration m. The quantity
Ijk can be interpretedas the relevance of predictorvariable xj in separating
class k fromthe other classes. The overall relevance of xi can be obtained by

TABLE2
Estimated mean and standard deviation of input variable
relative influencefor a linear targetfunction

Variable Mean Standard

10 100.0 0.0
9 90.3 4.3
8 80.0 4.1
7 69.8 3.9
6 62.1 2.3
5 51.7 2.0
4 40.3 4.2
3 31.3 2.9
2 22.2 2.8
1 13.0 3.2

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1219

averaging over all classes

i K
Ij=K k=1i

However,the individual Ijk themselves can be quite useful. It is oftenthe case


that differentsubsets of variables are highly relevant to differentsubsets of
classes. This more detailed knowledge can lead to insights not obtainable by
examining only overall relevance.

8.2. Partial dependence plots. Visualization is one of the most powerful


interpretational tools. Graphical renderings of the value of F(x) as a func-
tion of its arguments provides a comprehensive summary of its dependence
on the joint values of the input variables. Unfortunately,such visualization is
limited to low-dimensional arguments. Functions of a single real-valued vari-
able x, F(x), can be plotted as a graph of the values of F(x) against each
correspondingvalue of x. Functions of a single categorical variable can be
represented by a bar plot, each bar representingone ofits values, and the bar
height the value ofthe function.Functions oftwo real-valued variables can be
pictured using contour or perspective mesh plots. Functions of a categorical
variable and another variable (real or categorical) are best summarized by
a sequence of ("trellis") plots, each one showing the dependence of F(x) on
the second variable, conditioned on the respective values of the firstvariable
[Becker and Cleveland (1996)].
Viewing functionsof higher-dimensionalarguments is more difficult.It is
thereforeuseful to be able to view the partial dependence ofthe approximation
F(x) on selected small subsets ofthe input variables. Although a collectionof
such plots can seldom provide a comprehensivedepictionofthe approximation,
it can oftenproduce helpful clues, especially when F(x) is dominated by low-
order interactions (Section 7).
Let z1 be a chosen "target"subset, of size 1, of the input variables x,

zl = {Z1, ' - -, Zl} } c{Xl,... I Xn}'

and z\l be the complementsubset

z\U zl = x.

The approximation F(x) in principle depends on variables in both subsets

F(x) = F(zl, z\l).

If one conditions on specificvalues forthe variables in z\1,then F(x) can be


considered as a functiononly of the variables in the chosen subset zl,

(50) FZ\l(zl) = F(z1 Iz\ )

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1220 J. H. FRIEDMAN

In general, the functionalformof FZ\1(zl) will depend on the particular values


chosen forz\l. If,however,this dependence is not too strongthen the average
function
(51) F1(zl) = Ez\1[F(x)] = f F(z ,Z\l) P\l(Z\l) dz\l

can represent a useful summary of the partial dependence of F(x) on the


chosen variable subset zl. Here p\1(z\,) is the marginal probability density
of Z\l,

(52) P\(z\)= fp(x) dzl,

where p(x) is thejoint densityofall ofthe inputs x. This complementmarginal


density (52) can be estimated fromthe training data, so that (51) becomes
1 N
(53) Fl(zl) - Y F(zl, z-,\l).

In the special cases where the dependence of F(x) on z1 is additive,

(54) F(x) = Fl(zl) + F\l(z\l),


or multiplicative,
(55) F(x) = F1(Z1) F\l(z\6),
the formof FZ\1(zl) (50) does not depend on the joint values ofthe complement
variables z\l. Then F1(zl) (51) provides a complete description of the nature
of the variation of F(x) on the chosen input variable subset zl.
An alternative way of summarizing the dependence of F(x) on a subset z1
is to directlymodel F(x) as a functionof z1 on the training data

(56) Fl(zl) = EX[F(X) IZi] = f F(x) p(z\l I zl) dz\l.

However, averaging over the conditional density in (56), rather than the
marginal density in (51), causes Fl(zl) to reflect not only the dependence
of F(x) on the selected variable subset zl, but in addition, apparent depen-
dencies induced solely by the associations between them and the complement
variables z\1. For example, if the contributionof z1 happens to be additive
(54) or multiplicative (55), Fl(zl) (56) would not evaluate to the correspond-
ing term or factor Fl(zl), unless the joint density p(x) happened to be the
product
(57) p(x) = pl(z1). P\l(Z\l).
Partial dependence functions(51) can be used to help interpretmodels pro-
duced by any "black box" predictionmethod,such as neural networks,support
vectormachines, nearest neighbors,radial basis functions,etc. When there are
a large number of predictorvariables, it is very useful to have a measure of

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1221

relevance (Section 8.1) to reduce the potentially large number variables and
variable combinations to be considered. Also, a pass over the data (53) is
required to evaluate each Fl (zl) foreach set ofjoint values z1 ofits argument.
This can be time-consumingforlarge data sets, although subsampling could
help somewhat.
For regression trees based on single-variable splits, however, the partial
dependence of F(x) on a specified target variable subset z1 (51) is straight-
forwardto evaluate given only the tree, without referenceto the data itself
(53). For a specificset of values for the variables zl, a weighted traversal of
the tree is performed.At the root of the tree, a weight value of 1 is assigned.
For each nonterminalnode visited, if its split variable is in the target subset
zl, the appropriate leftor rightdaughter node is visited and the weight is not
modified.If the node's split variable is a member of the complement subset
z\l, then both daughters are visited and the current weight is multiplied by
the fractionof training observations that went left or right, respectively,at
that node.
Each terminal node visited during the traversal is assigned the current
value of the weight. When the tree traversal is complete, the value of F1(zJ)
is the correspondingweighted average of the F(x) values over those termi-
nal nodes visited during the tree traversal. For a collection of M regression
trees, obtained throughboosting,the results forthe individual trees are simply
averaged.
For purposes of interpretationthrough graphical displays, input variable
subsets of low cardinality (I < 2) are most useful. The most informativeof
such subsets would likely be comprised of the input variables deemed to be
among the most influential(44), (45) in contributingto the variation of F(x).
Illustrations are provided in Sections 8.3 and 9.
The closer the dependence of F(x) on the subset zZ is to being additive (54)
or multiplicative (55), the more completely the partial dependence function
Fl(zl) (51) captures the nature of the influence of the variables in z1 on the
derived approximation F(x). Therefore,subsets z1 that group togetherthose
influentialinputs that have complex [nonfactorable(55)] interactionsbetween
them will providethe most revealing partial dependence plots. As a diagnostic,
both F, (zl) and F, (z\1) can be separately computedforcandidate subsets. The
value of the multiple correlation over the training data between F(x) and
{Fj(zl), F\l(z\i)} and/or Fl(zl). F\l(z\l) can be used to gauge the degree of
additivityand/orfactorabilityof F(x) with respect to a chosen subset zl. As
an additional diagnostic, FZ\l(zl) (50) can be computed fora small number of
z\l-valuesrandomlyselected fromthe trainingdata. The resultingfunctionsof
z1 can be compared to Fl(zl) to judge the variabilityofthe partial dependence
of F(x) on zl, with respect to changing values of z\l.
In K-class logistic regression and classification (Section 4.6) there are K
(logistic) regression functions {Fk(X)}KkQl= Each is logarithmicallyrelated to
pk(X) = Pr(y = k Ix) through (29). Larger values of Fk(x) imply higher

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1222 J. H. FRIEDMAN

probabilityof observing class k at x. Partial dependence plots of each Fk(x)


on variable subsets z1 most relevant to that class (49) provide informationon
how the input variables influencethe respective individual class probabilities.

8.3. Randomly generated function. In this section the interpretational


tools described in the preceding two sections are applied to the first(of the
100) randomly generated functions (Section 6.1) used for the Monte Carlo
studies of Section 6.
Figure 7 shows the estimated relative importance (44), (45) of the 10 input
predictorvariables. Some are seen to be more influentialthan others,but no
small subset appears to dominate. This is consistentwith the mechanism used
to generate these functions.
Figure 8 displays single variable (I = 1) partial dependence plots (53)
on the six most influential variables. The hash marks at the base of each
plot represent the deciles ofthe correspondingpredictorvariable distribution.
The piecewise constant nature of the approximation is evident. Unlike most
approximation methods, there is no explicit smoothness constraint imposed
upon TreeBoost models. Arbitrarilysharp discontinuities can be accommo-
dated. The generally smooth trends exhibited in these plots suggest that a
smooth approximationbest describes this target. This is again consistentwith
the way these functionswere generated.

Relative Variable Importance

CD

CD
0~
ci ~ ~ l

a)I
E

7 2 8 9 3 5 4 1 10 6

Input variable

FIG. 7. Relative importance of the input predictor variables for the firstrandomly generated
functionused in the Monte Carlo studies.

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1223

O' EL C a CDX

9
-

-2 -t 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2
var 7 var 2 var5

FIGt8o Sigl-vaialeparia deenenc pot fo te sx ostinuetiauaiat6 pedito


for6the first randomly generated function used in the simulation studies

P0 -

-2 -1 0 1 2 -2 -I 0 1 2 -2 -1 0 1 2
varO var3 varl

FIG. 8. Single-variable plotsforthesix mostinfluential


partial dependence predictorvariables
thei aruens iti nieyta n ol vrb bet noe hi
forthefirstrandomlygeneratedfunction used in thesimulation
studies.

Figure 9 displays two-variable (I = 2) partial dependence plots on some


of the more influential variables. Interaction effectsof varying degrees are
indicated among these variable pairs. This is in accordance with the way in
which
co these
Iea target
Iaa functions
Inti were actually
sctio IhI generated
reos (39), (40).aloihmIr
rersso
Given the general complexityof these generated targets as a functionof
their arguments, it is unlikely that one would ever be able to uncover their
complete detailed functionalformthrougha series of such partial dependence
plots. The goal is to obtain an understandable descriptionofsome ofthe impor-
tant aspects ofthe functionalrelationship.In this example the target function
was generated froma known prescription,so that at least qualitatively we can
verifythat this is the case here.

9. Real data. In this section the TreeBoost regression algorithms are


illustrated on two moderate-sized data sets. The results in Section 6.4 suggest
that the propertiesof the classificationalgorithmLK-TreeBoost are very sim-
ilar to those of LogitBoost, which was extensively applied to data in FHTOO.
The first(scientific)data set consists ofchemical concentrationmeasurements
on rock samples, and the second (demographic)is sample surveyquestionnaire
data. Both data sets were partitionedinto a learning sample consistingoftwo-
thirds of the data, with the remaining data being used as a test sample for
choosing the model size (number of iterations M). The shrinkage parameter
(36) was set to v = 0.1.

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1224 J. H. FRIEDMAN

FIG. 9. Two-variable partial dependence plots on a few of the importantpredictor variables for
the firstrandomlygenerated functionused in the simulation studies.

9.1. Garnet data. This data set consists ofa sample of N = 13317 garnets
collected fromaround the world [Griffin,Fisher,Friedman, Ryan and O' Reilly
(1997)]. A garnet is a complex Ca-Mg-Fe-Cr silicate that commonlyoccurs as
a minorphase in rocks making up the earth's mantle. The variables associated
with each garnet are the concentrationsofvarious chemicals and the tectonic
plate setting where the rock was collected:

(TiO2, Cr203, FeO, MnO, MgO, CaO, Zn, Ga, Sr, Y, Zr, tec).

The firsteleven variables representing concentrations are real-valued. The


last variable (tec) takes on three categorical values: "ancient stable shields,"
"Proterozoic shield areas," and "young orogenic belts." There are no missing
values in these data, but the distributionof many of the variables tend to be
highly skewed toward larger values, with many outliers.
The purpose of this exercise is to estimate the concentrationof titanium
(TiO2) as a functionof the joint concentrationsof the other chemicals and the
tectonicplate index.

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1225

TABLE3
Average absolute errorof LS-TreeBoost, LAD TreeBoost,and MTreeBoost on the
garnet data for varyingnumbers of terminal nodes in the individual trees

Terminal nodes LS LAD M


2 0.58 0.57 0.57
3 0.48 0.47 0.46
4 0.49 0.45 0.45
6 0.48 0.44 0.43
11 0.47 0.44 0.43
21 0.46 0.43 0.43

Table 3 shows the average absolute errorin predictingthe output y-variable,


relative to the optimal constant prediction,

Ey y - F(x)
(58) A(y, F(x)) = v
Ey - median(y) |
based on the test sample, forLS TreeBoost, LADLTreeBoost, and M-TreeBoost
forseveral values of the size (number ofterminal nodes) J of the constituent
trees. Note that this predictionerrormeasure (58) includes the additive irre-
ducible errorassociated with the (unknown) underlyingtarget functionF*(x)
(1). This irreducible error adds same amount to all entries in Table 3. Thus,
differencesin those entries reflecta proportionallygreater improvementin
approximation error (37) on the target functionitself.
For all three methods the additive (J = 2) approximationis distinctlyinfe-
rior to that using larger trees, indicating the presence of interaction effects
(Section 7) among the input variables. Six terminal node trees are seen to be
adequate and using only three terminal node trees is seen to provide accuracy
within 10% of the best. The errors of LADRTreeBoost and M-TreeBoost are
smaller than those of LS-TreeBoost and similar to each other,with perhaps
M-TreeBoost having a slight edge. These results are consistent with those
obtained in the simulation studies as shown in Figures 2 and 6.
Figure 10 shows the relative importance (44), (45) ofthe 11 input variables
in predictingTiO2 concentrationbased on the M-TreeBoost approximation
using six terminal node trees. Results are very similar forthe other models in
Table 3 with similar errors.Ga and Zr are seen to be the most influentialwith
MnO being somewhat less important.The top three panels of Figure 11 show
the partial dependence (51) of the approximation F(x) on these three most
influential variables. The bottom three panels show the partial dependence
of F(x) on the three pairings of these variables. A strong interaction effect
between Ga and Zr is clearly evident. F(x) has very little dependence on
either variable when the othertakes on its smallest values. As the value ofone
of them is increased, the dependence of F(x) on the other is correspondingly
amplified.A somewhat smaller interactioneffectis seen between MnO and Zr.

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1226 J. H. FRIEDMAN

Relative
importance

co

0-

Ga Zr MnO Y Zn MaO CaO FeO Cr203 Tec Sr

FIG. 10. Relative influenceof the eleven input variables on the target variation for the garnet
data. Ga and Zr are much more influentialthat the others.

Ci

C4

N | - A ~~~~~N = N

0 QC)~~~~~~~~~~~~~~~1
2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 20 30 40 50

CnO Ga Zr
C) ~~~~~~0

246102116024 111 0 2 0 5

10
~ z 0

FIG. 11. Partial dependenceplots forthe threemost influentialinput variables in thegarnet data.
Note the differentvertical scales for each plot. There is a stronginteractioneffectbetweenZr and
Ga, and a somewhat weaker one betweenZr and MnO.

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1227

TABLE4
Variables for the demographic data

Variable Demographic Number values Type

1 sex 2 cat
2 martial status 5 cat
3 age 7 real
4 education 6 real
5 occupation 9 cat
6 income 9 real
7 years in Bay Area 5 real
8 dual incomes 2 cat
9 number in household 9 real
10 number in household<18 9 real
11 householder status 3 cat
12 type of home 5 cat
13 ethnic classification 8 cat
14 language in home 3 cat

9.2. Demographic data. This data set consists of N = 9409 questionnaires


filledout by shopping mall customers in the San Francisco Bay Area [Impact
Resources, Inc, Columbus, Ohio (1987)]. Here we use answers to the first14
questions, relating to demographics,forillustration.These questions are listed
in Table 4. The data are seen to consist of a mixture of real and categorical
variables, each with a small numbers of distinctvalues. There are many miss-
ing values.
We illustrate TreeBoost on these data by modeling income as a functionof
the other 13 variables. Table 5 shows the average absolute errorin predicting
income, relative to the best constant predictor(58), for the three regression
TreeBoost algorithms.
There is little differencein performanceamong the three methods. Owing
to the highlydiscrete nature ofthese data, there are no outliers or long-tailed
distributionsamong the real-valued inputs or the output y. There is also very
little reduction in error as the constituenttree size J is increased, indicating

TABLE5
Average absolute errorof LS-TreeBoost, LAD_TreeBoost,and M-TreeBoost on the
demographicdata for varyingnumbers of terminal nodes in the individual trees

Terminal nodes LS LAD M

2 0.60 0.63 0.61


3 0.60 0.62 0.59
4 0.59 0.59 0.59
6 0.59 0.58 0.59
11 0.59 0.57 0.58
21 0.59 0.58 0.58

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1228 J. H. FRIEDMAN

Relative importance
C
C)

co

C)

CD

C)

occ hsid mar age edu hme eth Ian dinc num sex < 18 yBA

FIG. 12. Relative influenceof the 13 input variables on the targetvariation for the demographic
data. No small group of variables dominate.

lack of interactions among the input variables; an approximation additive in


the individual input variables (J = 2) seems to be adequate.
Figure 12 shows the relative importanceofthe input variables in predicting
income,based on the (J = 2) LS-TreeBoost approximation. There is no small
subset of them that dominates. Figure 13 shows partial dependence plots on
the six most influentialvariables. Those forthe categorical variables are rep-
resented as bar plots, and all plots are centered to have zero mean over the
data. Since,the approximationconsists of main effectsonly [firstsum in (42)],
these plots completelydescribe the correspondingcontributionsf j(xj) ofeach
of these inputs.
There do not appear to be any surprising results in Figure 13. The depen-
dencies forthe most part confirmpriorsuspicions and suggest that the approx-
imation is intuitivelyreasonable.

10. Data mining. As "offthe shelf" tools forpredictive data mining,the


TreeBoost procedures have some attractiveproperties.They inheritthe favor-
able characteristics of trees while mitigating many of the unfavorable ones.
Among the most favorable is robustness. All TreeBoost procedures are invari-
ant under all (strictly)monotonetransformationsofthe individual input vari-
ables. For example, using xj, log xj, exi, or xY; as the jth input variable yields
the same result. Thus, the need for.consideringinput variable transformations
is eliminated. As a consequence of this invariance, sensitivityto long-tailed

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1229

Occupation Household status Maritalstatus

0 V^~~Uemployed Single
Retired Live withfamily
._ Military flWidowed

Homemaker Rent Dv sep.

Laborer _ Live together


prSa lesngai Own MariedN
Married
Prof./manag.

-1.5 -0.5 0.5 1.0 1.5 -0.5 0.0 0.5 1.0 -0.5 0.0 0.5 1.0 1.5

Age Education Type home

1 2- 3 5 \4
6|7 - 1 2 3 5 4 -1.5 -1.0 -0.5 0.01.Other
0.5
/ I T / 1 8 1 ~~~~~~~~~~~~~~~~Mob
home

| / | 1 / | | ~~~~~~~~~~~~~~~~~Ap
N ] / | ' 4 / | ~~~~~~~
~~Condo _

' 1/ t N4> , , ,! ~~~~~~~~House


1 2 3 4 5 6 7 2 3 4 5 6 -. -10 -0.5 0.0 05 1.0

FIG. 13. Partial dependenceplots forthe six most influentialinput variables in the demographic
data. Note the differentverticalscales foreach plot. The abscissa values forage and education are
codes representingconsecutiveequal intervals. The dependence of income on age is nonmonotonic
reaching a maximum at the value 5, representingthe interval 45-54 years old.

distributions and outliers is also eliminated. In addition, LADRTreeBoost is


completelyrobust against outliers in the output variable y as well. M-Tree-
Boost also enjoys a fair measure of robustness against output outliers.
Another advantage of decision tree induction is internal feature selection.
Trees tend to be quite robust against the addition ofirrelevantinput variables.
Also, tree-based models handle missing values in a unifiedand elegant manner
[Breiman, Friedman, Olshen and Stone (1983)]. There is no need to consider
external imputation schemes. TreeBoost clearly inherits these properties as
well.
The principal disadvantage of single tree models is inaccuracy. This is a
consequence of the coarse nature of their piecewise constant approximations,
especially for smaller trees, and instability,especially for larger trees, and
the fact that they involve predominatelyhigh-orderinteractions.All of these
are mitigated by boosting. TreeBoost procedures produce piecewise constant
approximations,but the granularity is much finer.TreeBoost enhances sta-
bilityby using small trees and by the effectof averaging over many of them.
The interactionlevel of TreeBoost approximations is effectivelycontrolledby
limitingthe size of the individual constituenttrees.
Among the purported biggest advantages of single tree models is inter-
pretability,whereas boosted trees are thoughtto lack this feature.Small trees

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1230 J. H. FRIEDMAN

can be easily interpreted,but due to instability such interpretationsshould


be treated with caution. The interpretabilityof larger trees is questionable
[Ripley (1996)]. TreeBoost approximations can be interpreted using partial
dependence plots in conjunctionwith the input variable relative importance
measure, as illustrated in Sections 8.3 and 9. While not providinga complete
description, they at least offersome insight into the nature of the input-
output relationship. Although these tools can be used with any approxima-
tion method,the special characteristicsof tree-based models allow their rapid
calculation. Partial dependence plots can also be used with single regression
trees,but as noted above, more caution is required owing to greater instability.
Aftersortingofthe input variables, the computationofthe regression Tree-
Boost procedures (LS, LAD, and M-TreeBoost) scales linearly with the num-
ber of observations N, the number of input variables n and the number of
iterations M. It scales roughlyas the logarithmof the size of the constituent
trees J. In addition, the classificationalgorithmLK-TreeBoost scales linearly
with the number of classes K; but it scales highly sublinearly with the num-
ber of iterations M, if influence trimming(Section 4.5.1) is employed. As a
point of reference,applying M-TreeBoost to the garnet data of Section 9.1
(N = 13317, n = 11, J = 6, M = 500) required 20 seconds on a 933Mh Pen-
tium III computer.
As seen in Section 5, many boosting iterations (M - 500) can be required
to obtain optimal TreeBoost approximations, based on small values of the
shrinkage parameter v (36). This is somewhat mitigated by the very small
size of the trees induced at each iteration. However, as illustrated in Figure
1, improvementtends to be very rapid initially and then levels offto slower
increments.Thus, nearly optimal approximations can be achieved quite early
(M -_100) with correspondinglymuch less computation. These near-optimal
approximationscan be used forinitial explorationand to provide an indication
of whether the final approximation will be of sufficientaccuracy to warrant
continuation. If lack of fitimproves very little in the firstfew iterations (say
100), it is unlikelythat there will be dramatic improvementlater on. If contin-
uation is judged to be warranted, the procedure can be restarted where it left
offpreviously,so that no computational investment is lost. Also, one can use
larger values ofthe shrinkage parameter to speed initial improvementforthis
purpose. As seen in Figure 1, using v -0.25 provided accuracy within 10% of
the optimal (v = 0.1) solution after only 20 iterations. In this case however,
boostingwould have to be restarted fromthe beginningif a smaller shrinkage
parameter value were to be subsequently employed.
The ability of TreeBoost procedures to give a quick indication of potential
predictability,coupled with their extreme robustness, makes them a useful
preprocessing tool that can be applied to imperfectdata. If sufficientpre-
dictability is indicated, furtherdata cleaning can be invested to render it
suitable formore sophisticated, less robust, modeling procedures.
If more data become available after modeling is complete,boosting can be
continued on the new data starting fromthe previous solution. This usually
improves accuracy provided an independent test data set is used to monitor

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
GREEDY FUNCTION APPROXIMATION 1231

improvementto prevent overfittingon the new data. Although the accuracy


increase is generallyless than would be obtained by redoingthe entireanalysis
on the combined data, considerable computationis saved.
Boosting on successive subsets of data can also be used when there is insuf-
ficientrandom access main memoryto store the entire data set. Boosting can
be applied to "arcbites" of data [Breiman (1997)] sequentially read into main
memory,each time starting at the current solution, recyclingover previous
subsets as time permits. Again, it is crucial to use an independent test set
to stop training on each individual subset at that point where the estimated
accuracy of the combined approximation starts to diminish.

Acknowledgments. Helpful discussions with Trevor Hastie, Bogdan


Popescu and Robert Tibshirani are gratefullyacknowledged.

REFERENCES
BECKER, R. A. and CLEVELAND, W. S (1996). The design and controlof Trellis display.J; Comput.
Statist. Graphics 5 123-155.
BREIMAN,L. (1997). Pasting bites togetherforpredictionin large data sets and on-line.Technical
report,Dept. Statistics, Univ. California, Berkeley.
BREIMAN,L. (1999). Prediction games and arcing algorithms.Neural Comp. 11 1493-1517.
BREIMAN,L., FRIEDMAN,J. H., OLSHEN, R. and STONE, C. (1983). Classification and Regression
Trees. Wadsworth,Belmont, CA.
COPAS,J. B. (1983). Regression, prediction,and shrinkage (with discussion). J Roy. Statist. Soc.
Ser B 45 311-354.
DONOHO,D. L. (1993). Nonlinear wavelete methods for recoveryof signals, densities, and spec-
tra fromindirect and noisy data. In DifferentPerspectiveson Wavelets.Proceedings of
Symposium in Applied Mathematics (I. Daubechies, ed.) 47 173-205. Amer.Math. Soc.,
Providence RI.
DRUCKER,H. (1997). Improving regressors using boosting techniques. Proceedings of Fourteenth
International Conferenceon Machine Learning (D. Fisher, Jr.,ed.) 107-115. Morgan-
Kaufmann, San Francisco.
DUFFY, N. and HELMBOLD,D. (1999). A geometric approach to leveraging weak learners. In
Computational Learning Theory.Proceedings of4th European ConferenceEuroCOLT99
(P. Fischer and H. U. Simon, eds.) 18-33. Springer,New York.
FREUND,Y. and SCHAPIRE,R. (1996). Experiments with a new boosting algorithm. In Machine
Learning: Proceedings of the ThirteenthInternational Conference 148-156. Morgan
Kaufman, San Francisco.
FRIEDMAN,J. H. (1991). Multivariate adaptive regression splines (with discussion). Ann. Statist.
19 1-141.
FRIEDMANJ. H., HASTIE,T. and TIBSHIRANI,R. (2000). Additive logistic regression: a statistical
view of boosting (with discussion). Ann. Statist. 28 337-407.
GRIFFIN,W. L., FISHER, N. I., FRIEDMANJ. H., RYAN,C. G. and O'REILLY, S. (1999). Cr-Pyrope
garnets in lithosphericmantle. J Petrology.40 679-704.
HASTIE,T. and TIBSHIRANI, R. (1990). Generalized Additive Models. Chapman and Hall, London.
HUBER,P. (1964). Robust estimation of a location parameter.Ann. Math. Statist. 35 73-101.
MALLAT,S. and ZHANG,Z. (1993). Matching pursuits with time frequency dictionaries. IEEE
Trans. Signal Processing 41 3397-3415.
POWELL,M. J. D. (1987). Radial basis functionsformultivariate interpolation:a review.In Algo-
rithmsforApproximation(J. C. Mason and M. G. Cox, eds.) 143-167. Clarendon Press,
Oxford.
RATSCH,G., ONODA,T. and MULLER,K. R. (1998). Soft margins forAdaBoost. NeuroCOLT Tech-
nical Report NC-TR-98-021.

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions
1232 J. H. FRIEDMAN

RIPLEY, B. D. (1996). Pattern Recognitionand Neural Networks.Cambridge Univ. Press.


RUMELHART, D. E., HINTON, G. E. and WILLIAMS, R. J. (1986). Learning representationsby back-
propagating errors.Nature 323 533-536.
SCHAPIRE, R. and SINGER, Y. (1998). Improved boosting algorithmsusing confidence-ratedpredic-
tions. In Proceedings of the Eleventh Annual Conferenceon Computational Learning
Theory.ACM, New York.
VAPNIK, V. N. (1995). The Nature of Statistical Learning Theory.Springer,New York.
WARNER, J. R., TORONTO, A. E., VEASEY, L. R. and STEPHENSON, R. (1961). A mathematical model
for medical diagnosis-application to congenital heart disease. J Amer. Med. Assoc.
177 177-184.

DEPARTMENT OF STATISTICS
SEQUOIA HALL
STANFORD UNIVERSITY
STANFORD, CALIFORNIA 94305
E-MAIL: [email protected]

This content downloaded from 110.146.133.181 on Sat, 13 Sep 2014 03:42:31 AM


All use subject to JSTOR Terms and Conditions

You might also like