Function Approximation: A Gradient Boosting Machine.
Function Approximation: A Gradient Boosting Machine.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to The
Annals of Statistics.
http://www.jstor.org
BY JEROME H. FRIEDMAN
Stanford University
Function estimation/approximationis viewed fromthe perspective of
numerical optimization in functionspace, rather than parameter space. A
connection is made between stagewise additive expansions and steepest-
descent minimization. A general gradient descent "boosting"paradigm is
developed for additive expansions based on any fittingcriterion. Specific
algorithms are presented for least-squares, least absolute deviation, and
Huber-M loss functionsfor regression, and multiclass logistic likelihood
forclassification.Special enhancements are derived forthe particular case
where the individual additive components are regression trees, and tools
for interpretingsuch "TreeBoost" models are presented. Gradient boost-
ing of regression trees produces competitive,highly robust, interpretable
procedures forboth regression and classification,especially appropriate for
mining less than clean data. Connections between this approach and the
boosting methods of Freund and Shapire and Friedman, Hastie and Tib-
shirani are discussed.
where
where
m-1
Pm-l = E Pi
i=O
Pm = Pmgm,
where
and
m-1
Fmil(X) E fi(x).
This constrained negative gradient h(x; am) is used in place of the uncon-
strained one -gm(x) (7) in the steepest-descentstrategy.Specifically,the line
search (8) is performed
N
(12) Pm = argmin E( L(yx, Fm1(Xi) + ph(xi; am))
Pi=
ALGORITHM 1 (Gradient-Boost).
7. endFor
end Algorithm
ALGORITHM 2 (LS-Boost).
Fo(x) = Y
For m = 1 to M do:
Yi = Y- -Fm_x(x) i = 1, N
(Pm' am) = arg minapNE I[4i - ph(xi; a)]2
Fm(x) = Fm I(X) + Pmh(X; am)
endFor
end Algorithm
4.2. Least absolute deviation (LAD) regression. For the loss function
L(y, F) = ly- Fl, one has
This implies that h(x; a) is fit (by least-squares) to the sign of the current
residuals in line 4 of Algorithm1. The line search (line 5) becomes
N
Pm= argmin Yi - Fm,(xi) - ph(xt;am).
N yi -Fmi(xi)
(14) = argmin>3h(xi;am).
h(xa;am) -P
Here medianw{.} is the weighted median with weights wi. Inserting these
results [(13), (14)] into Algorithm 1 yields an algorithm for least absolute
deviation boosting,using any base learner h(x; a).
4.3. Regressiontrees. Here we consider the special case where each base
learner is an J-terminal node regression tree [Breiman, Friedman, Olshen
and Stone (1983)]. Each regression tree model itself has the additive form
J
( 15) Rj}J )
h(x;ttbi, =Ebj 1(x E Rj
i=l
Here {Rj}J are disjoint regions that collectivelycover the space of all joint
values of the predictorvariables x. These regions are represented by the ter-
minal nodes of the correspondingtree. The indicator function 1(.) has the
value 1 if its argument is true, and zero otherwise. The "parameters" of this
base learner (15) are the coefficients{bj}j, and the quantities that definethe
boundaries of the regions {Rj}f. These are the splitting variables and the
values ofthose variables that representthe splits at the nonterminalnodes of
the tree. Because the regions are disjoint, (15) is equivalent to the prediction
rule: if x E R1 then h(x) = bj.
For a regression tree, the update at line 6 of Algorithm1 becomes
J
(16) Fm(X) = Fm-i(X) + PmE bjml(x e Rjm).
j=1
Here {Rjm}J are the regions defined by the terminal nodes of the tree at
the mth iteration.They are constructedto predictthe pseudoresponses {i}ij
(line 3) byleast-squares (line 4). The {bjm} are the correspondingleast-squares
coefficients,
bjm = avexERjm i-
J
(17) Fm(X) = Fmi-(X) + E yjml(x E Rjm)
j=1
with Yjm = Pmbjm.One can view (17) as adding J separate basis functions
at each step {1(x E Rjm)}IJ, instead of a single additive one as in (16). Thus,
in this case one can furtherimprove the quality of the fitby using the opti-
mal coefficientsforeach of these separate basis functions(17). These optimal
coefficientsare the solution to
N J
{ Yjm}J = arg min E L (\Y Fm- (xi) + yjl(x E Rjm))
{'YjI iTi1 j=1
Owing to the disjoint nature of the regions produced by regression trees, this
reduces to
This is just the optimal constant update in each terminal node region, based
on the loss functionL, given the currentapproximation Fmil(x).
For the case of LAD regression (18) becomes
ALGORITHM 3 (LAD-TreeBoost).
Fo(x) = median{}yi }
For m = 1 to M do:
Yi = sign(yi - Fmi,(xi)), i = 1, N
{Rjm}j = J-terminalnode tree({yi, xi} I)
Yjm = medianxiERjm {yi - Fmi(Xii)}, j = 1, J
Fm(X) = Fm-i(X) + Ej= lyjml(X E Rjm)
endFor
end Algorithm
This algorithm is highly robust. The trees use only order informationon
the individual input variables xj, and the pseudoresponses 5i (13) have only
two values, Yi E {-1, 1}. The terminal node updates are based on medians.
- {
_ - Fm.-(Xi),
3 sign(yi - Fm-i(Xi)),
IYi-
IYi -
Fm-(xi)l
Fm-(xi)l
< &
> an
with L given by (19). The solution to (19), (20) can be obtained by standard
iterative methods [see Huber (1964)].
The value of the transition point 3 defines those residual values that are
considered to be "outliers,"subject to absolute rather than squared-errorloss.
An optimal value will depend on the distributionof y - F*(x), where F* is
the true target function(1). A common practice is to choose the value of 3 to
be the ae-quantileof the distributionof Iy - F* (x)1, where (1 - ae) controlsthe
breakdown point of the procedure. The "breakdown point" is the fractionof
observations that can be arbitrarilymodifiedwithout seriously degrading the
quality of the result. Since F*(x) is unknown one uses the current estimate
Fmil(x) as an approximation at the mth iteration. The distributionof y -
Fmil(x)l is estimated by the currentresiduals, leading to
am = quantile{ Iyj - Fm -1(xi)}1N
With regression trees as base learners we use the strategy of Section 4.3,
that is, a separate update (18) in each terminal node R jm*For the Huber loss
(19) the solution to (18) can be approximated by a single step of the standard
iterative procedure [Huber (1964)] starting at the median
rim = medianxiERj.{rm_l(xi)},
rmil(xi) = yi-Fmi,(xi).
The approximation is
where Njm is the number of observations in the jth terminal node. This gives
the followingalgorithmforboosting regressiontrees based on Huber loss (19).
ALGORITHM 4 M-TreeBoost.
FO(x) = median{yi}I
For m = 1 to M do:
rmil(xi) = yi-Fmi(xi) i = 1, N
5m =quantile,jjrm_(xi)Ijl
The pseudoresponse is
(22) - [ L dcF(x1)
F(x)) 1
IF(x)=Fmi,(x)
- 2yi/(l + exp(2yjFFmi (xj))).
With regression trees as base learners we again use the strategy(Section 4.3)
of separate updates in each terminal node Rjm:
ALGORITHM
5 (LK-TreeBoost).
1 1?5-
FO(x) = log Y
For m = 1 to M do:
Yi = 2yi/(l + exp(2yiFm_i(xi))), i = 1, N
{Rjm}J = J-terminalnode tree({51, Xi}N)
Yjm = Yi/ExiERjm IJY(2- IYj), i = 1, J
EXijERjm
N
(24) km(P,a)= log[l1+ exp(-2yiF1_1(x )) exp(-2yiph(xi; a))].
i=l1
(Pm,am) = argmin
p,a Om(P,a).
This suggests that all observations (yi, xi) forwhich yiFmi,(xi) is relatively
very large can be deleted fromall computations of the mth iteration without
having a substantial effecton the result. Thus,
(25) wi = exp(-2YiFmi,(xi))
l(a) N
(27) E w(i) =
a wi.
i=1 i=l
Here {W(i)}N are the weights {Wi}lN arranged in ascending order. Typical
values are a E [0.05, 0.2]. Note that influence trimmingbased on (25), (27)
is identical to the "weight trimming"strategyemployed with Real AdaBoost,
whereas (26), (27) is equivalent to that used with LogitBoost,in FHTOO. There
it was seen that 90% to 95% of the observations were often deleted without
sacrificingaccuracy of the estimates, using either influence measure. This
results in a correspondingreduction in computation by factorsof 10 to 20.
or equivalently
K
(30) Pk(X) = exp(Fk(x))/ E exp(FI(x)).
k1=
Substituting (30) into (28) and taking firstderivatives one has
where Pk, m_i(X) is derived from Fk, mi(X) through (30). Thus, K-trees are
induced at each iteration m to predictthe correspondingcurrentresiduals for
each class on the probabilityscale. Each of these trees has J-terminalnodes,
with correspondingregions {Rjkm}J1= The model updates yjkm corresponding
to these regions are the solution to
NK KJ
{1Yjkm}= argmin} E E, Yik, Fk,m-l(Xi) jk1(Xi ERjm))
{'Yjk} i=I k=I =I
where O(Yk, Fk) = -Yk log Pk from(28), with Fk related to Pk through (30).
This has no closed formsolution. Moreover,the regions correspondingto the
differentclass trees overlap, so that the solution does not reduce to a separate
calculation within each region of each tree in analogy with (18). Following
FHTOO, we approximate the solution with a single Newton-Raphson step,
using a diagonal approximationto the Hessian. This decomposes the problem
into a separate calculation foreach terminal node of each tree. The result is
(32) Yk K-i xeRjkm Yik
ALGORITHM
6 (LK-TreeBoost).
Fko(x) = 0, k = 1, K
For m = 1 to M do:
pk(X) = exp(Fk(x))/ I[S11
exp(FI(x)), k = 1, K
For k = 1 to K do:
Yik = Yik -Pk(XA), i = 1, N
{Rjkm}=l = J-terminalnode tree({5ik, xI} )
Nk K-1 ExRjkm(l Tik
YI) = 1
YJkm- K ExicRjkm~ IYikl('--IYikI) 1, Jl
Fkm(X) = Fk, m-l(X) + ,iA lYjkm 1(X c Rjjkm)
endFor
endFor
end Algorithm
where c(k, k') is the cost associated with predictingthe kth class when the
truth is k'. Note that for K = 2, Algorithm6 is equivalent to Algorithm5.
Algorithm6 bears a close similarity to the K-class LogitBoost procedure
of FHTOO, which is based on Newton-Raphson rather than gradient descent
in functionspace. In that algorithm K trees were induced, each using corre-
sponding pseudoresponses
~ ~ YK KI - Yik -Pk(Xi)
(33)
K pk(Xi)(1 - Pk(Xi))
and a weight
(34) Wk(Xi) = Pk(Xi)(1 - Pk(xi))
applied to each observation (Yik, xi). The terminal node updates were
Exi ERjkmWk(Xi)Yik
'Yjkm= ExR jkmWk(Xi)
each daughter node, whereas (34) (LogitBoost) favorssplits forwhich the sums
estimatedresponsevariancesvar(Yik)
ofthe currently = P-(Xi)( -Pk(Xi))
are more equal.
LK-TreeBoost has an implementation advantage in numerical stability.
LogitBoost becomes numerically unstable whenever the value of (34) is close
to zero forany observation xi, which happens quite frequently.This is a con-
sequence of the difficultythat Newton-Raphson has with vanishing second
derivatives. Its performanceis stronglyaffectedby the way this problem is
handled (see FHTOO, page 352). LK-TreeBoost has such difficultiesonly when
(34) is close to zero forall observations in a terminal node. This happens much
less frequentlyand is easier to deal with when it does happen.
Influence trimming for the multiclass procedure is implemented in the
same way as that forthe two-class case outlined in Section 4.5.1. Associated
(Yik, Xi) is an influenceWik = IYikl(I
witheach "observation" - IYik ) whichis
used fordeleting observations (27) when inducing the kth tree at the current
iteration m.
model selection criterionjointly with respect to the values of the two param-
eters. There are also computational considerations; increasing the size of M
produces a proportionateincrease in computation.
We illustrate this V-M trade-offthrough a simulation study.The training
sample consists of 5000 observations {yi, xi} with
Yi = F*(xi) + ?i-
ExIF*(x) -FM(x)I
(37) A(FM(x)) -_EX F* (x) - medianxF*(x) I
I
LSTreeBoost LAD_TreeBoost
o 0
LQ
o '-~~~~~~~~~~0
~~~~~~~~~~~~~~~~0
c 0~~~~~~~~~~~~~c
nD VN n- ______
o 6
o 0
6 _ _ _ _ _ _ _ _ _ _
__,_ _
___ __i_ _01 _ _ _
0 200 400 600 800 1000 0 200 400 600 800 1000
Iterations Iterations
L2TreeBoost L2_TreeBoost
0
6
o LQ~~~~~~~~~~~~~~~~
o 0
C4, , I 04 I ,~
, I
o 200 400 600 800 1000 0 200 400 600 800 1000
IteratiOnS IteraT5Ons
TABLE 1
Iteration number giving the best fit and the best fit value for several shrinkage parameter
v-values, with threeboosting methods
v LS: A(FM(x)) LAD: A(FM(x)) L2: -2log (like) L2: error rate
1.0 15 0.48 19 0.57 20 0.60 436 0.111
0.5 43 0.40 19 0.44 80 0.50 371 0.106
0.25 77 0.34 84 0.38 310 0.46 967 0.099
0.125 146 0.32 307 0.35 570 0.45 580 0.098
0.06 326 0.32 509 0.35 1000 0.44 994 0.094
0.03 855 0.32 937 0.35 1000 0.45 979 0.097
The v-M trade-offis clearly evident; smaller values of v give rise to larger
optimal M-values. They also provide higher accuracy, with a diminishing
return forv < 0.125. The misclassificationerrorrate is very flat for M > 200,
so that optimal M-values forit are unstable.
Although illustrated here forjust one target functionand base learner (11-
terminal node tree), the qualitative nature ofthese results is fairlyuniversal.
Other target functionsand tree sizes (not shown) give rise to the same behav-
ior.This suggests that the best value forv depends on the number ofiterations
M. The latter should be made as large as is computationally convenient or
feasible. The value of v should then be adjusted so that LOF achieves its min-
imum close to the value chosen for M. If LOF is still decreasing at the last
iteration, the value of v or the number of iterations M should be increased,
preferablythe latter. Given the sequential nature ofthe algorithm,it can eas-
ily be restarted where it finished previously,so that no computation need be
repeated. LOF as a functionof iteration number is most convenientlyesti-
mated using a left-outtest sample.
As illustrated here, decreasing the learning rate clearly improves perfor-
mance, usually dramatically. The reason for this is less clear. Shrinking the
model update (36) at each iterationproduces a more complex effectthan direct
proportionalshrinkage of the entire model
Z, {Xp=1j)}Jj1,
where each of the mean vectors {_t}120 is randomly generated fromthe same
distributionas that of the input variables x. The n, x n, covariance matrixV,
VI = UID1U[,
yi= F*(xi) + i,
giving a 1/1 signal-to-noiseratio. For the second study the errors were gen-
erated froma "slash" distribution,8i = s. (u/v), where u - N(O, 1) and v
U[0, 1]. The scale factors is adjusted to give a 1/1 signal-to-noiseratio (41).
The slash distributionhas very thick tails and is oftenused as an extreme to
test robustness. The trainingsample size was taken to be N = 7500, with 5000
used fortraining,and 2500 left out as a test sample to estimate the optimal
number of components M. For each of the 100 trials an additional validation
sample of 5000 observations was generated (without error) to evaluate the
approximation inaccuracy (37) forthat trial.
The left panels of Figure 2 show boxplots of the distributionof approxima-
tion inaccuracy (37) over the 100 targetsforthe two errordistributionsforeach
of the three methods. The shaded area of each boxplot shows the interquar-
tile range of the distributionwith the enclosed white bar being the median.
Normal Normal
LS LAD M LS LAD M
Slash Slash
L1j
-= -
LS LAD M LS LAD M
FIG. 2. Distribution of absolute approximation error (leftpanels) and error relative to the best
(rightpanels) forLSJTheeBoost,LAD_TheeBoostand M_TreeBoostfor normal and slash errordis-
tributions.LSJTreeBoost,performsbest with the normal error distribution.LADJTreeBoostand
M_TreeBoostbothperformwell with slash errors. M_TreeBoostis veryclose to the best for both
errordistributions.Note the use of logarithmicscale in the lower rightpanel.
The outer hinges represent the points closest to (plus/minus) 1.5 interquar-
tile range units fromthe (upper/lower)quartiles. The isolated bars represent
individual points outside this range (outliers).
These plots allow the comparison of the overall distributions,but give no
informationconcerningrelative performancefor individual target functions.
The right two panels of Figure 2 attempt to provide such a summary.They
show distributionsof errorratios, rather than the errorsthemselves. For each
target functionand method,the errorforthe method on that target is divided
by the smallest error obtained on that target, over all of the methods (here
three) being compared. Thus, for each of the 100 trials, the best method
receives a value of 1.0 and the others receive a larger value. If a particu-
lar method was best (smallest error)forall 100 target functions,its resulting
distribution(boxplot) would be a point mass at the value 1.0. Note that the
logarithm of this ratio is plotted in the lower rightpanel.
From the left panels of Figure 2 one sees that the 100 targets represent a
fairlywide spectrum of difficulty forall three methods; approximation errors
vary by over a factor of two. For normally distributed errors LS-TreeBoost
is the superior performer,as might be expected. It had the smallest error
in 73 of the trials, with M-TreeBoost best the other 27 times. On average
LS-TreeBoost was 0.2% worse than the best, M_TreeBoost 0.9% worse, and
LAD-TreeBoost was 7.4% worse than the best.
With slash-distributederrors,things are reversed. On average the approxi-
mation error for LS-TreeBoost was 0.95, thereby explaining only 5% target
variation. On individual trials however, it could be much better or much
worse. The performanceof both LAD-TreeBoost and M-TreeBoost was much
better and comparable to each other. LAD-TreeBoost was best 32 times and
M-TreeBoost 68 times. On average LADfTreeBoost was 4.1% worse than the
best, M_TreeBoost 1.0% worse, and LS-TreeBoost was 364.6% worse that the
best, over the 100 targets.
The results suggest that of these three, M_TreeBoost is the method of
choice. In both the extreme cases of very well-behaved (normal) and very
badly behaved (slash) errors, its performancewas very close to that of the
best. By comparison, LAD-TreeBoost sufferedsomewhat with normal errors,
and LS-TreeBoost was disastrous with slash errors.
Abs-error Abs-error
(o~~~~~~~~~~~~~~~~~~~~o
0
0
Rms-error Rms-error
0
,.
c~~~~~~~~~~~~~~~~~~~~~~(
T_
(0
0
0 . ~ -
~ ~ ~ ~ 0 En
C9
o 0
Bayes errorrate is zero forall targets,but the induced decision boundaries can
become quite complicated, depending on the nature of each individual target
functionF*(x). Regression trees with 11 terminal nodes were used for each
method.
Figure 4 shows the distributionof error rate (left panel), and its ratio to
the smallest (rightpanel), over the 100 target functions,foreach of the three
methods. The errorrate of all three methods is seen to vary substantially over
these targets. LK TreeBoost is seen to be the generally superior performer.It
had the smallest error for 78 of the trials and on average its error rate was
0.6% higher than the best for each trial. LogitBoost was best on 21 of the
targets and there was one tie. Its errorrate was 3.5% higher than the best on
average. AdaBoost.MH was never the best performer,and on average it was
15% worse than the best.
Figure 5 shows a corresponding comparison, with the LogitBoost and
AdaBoost.MH procedures modifiedto incorporateincremental shrinkage (36),
with the shrinkage parameter set to the same (default) value v = 0.1 used with
LK-TreeBoost. Here one sees a somewhat differentpicture. Both LogitBoost
and AdaBoost.MH benefitsubstantially fromshrinkage. The performanceof
all three procedures is now nearly the same, with LogitBoost perhaps hav-
ing a slight advantage. On average its error rate was 0.5% worse that the
best; the correspondingvalues forLK-TreeBoost and AdaBoost.MH were 2.3%
and 3.9%, respectively.These results suggest that the relative performanceof
these methods is more dependent on their aggressiveness, as parameterized
by learning rate, than on their structuraldifferences.LogitBoost has an addi-
o i l~~~~~~~~~~~~~~~~~~~~~~t
2 3 6 11 ~~~21 2 3 1 21
FIG. 6. Distribution of absolute approximation error (leftpanel) and error relative to the best
(rightpanel) forLS-TheeBoost with differentsized trees,as measured by numberof terminalnodes
J. The distributionusing the smallest trees J c { 2, 3} is wider, indicating more frequentbetter
and worseperformancethan with the larger trees,all of which have similar performance.
tions. Although they can be used for interpretingsingle decision trees, they
tend to be more effectivein the context of boosting (especially small) trees.
These interpretativetools are illustrated on real data examples in Section 9.
8.1. Relative importance of input variables. Among the most useful des-
criptions of an approximation F(x) are the relative influences Ij, of the
individual inputs xj, on the variation of F(x) over the joint input variable
distribution.One such measure is
(43) Ex x
=i varx[xi])12
IM
(45) Ij = M (Tm)
,n=1
in the sequence.
The motivationfor(44), (45) is based purely on heuristic arguments. As a
partial justificationwe show that it produces expected results when applied
in the simplest context.Consider a linear target function
n
(46) F*(x) = aO + j xj,
j=1
(47) Ij = lajl.
(48) aj = (-l)jn
and a signal-to-noise ratio of 1/1 (41). Shown are the mean and standard
deviation of the values of (44), (45) over ten random samples, all with F*(x)
given by (46), (48). The influence of the estimated most influential variable
xj* is arbitrarilyassigned the value Ii* = 100, and the estimated values of
the others scaled accordingly.The estimated importance ranking of the input
variables was correcton every one ofthe ten trials. As can be seen in Table 2,
the estimated relative influencevalues are consistentwith those given by (47)
and (48).
In Breiman, Friedman, Olshen and Stone 1983, the influencemeasure (44)
is augmented by a strategyinvolvingsurrogate splits intended to uncover the
masking of influentialvariables by others highly associated with them. This
strategyis most helpful with single decision trees where the opportunityfor
variables to participate in splittingis limitedby the size J ofthe tree in (44). In
the contextofboosting,however,the numberofsplittingopportunitiesis vastly
increased (45), and surrogate unmasking is correspondinglyless essential.
In K-class logistic regression and classification (Section 4.6) there are K
(logistic) regression functions{FkM(X)}fK[l, each described by a sequence of
M trees. In this case (45) generalizes to
I M
(49) IJk= M EIj (Tkm),
where Tkm is the tree induced forthe kth class at iteration m. The quantity
Ijk can be interpretedas the relevance of predictorvariable xj in separating
class k fromthe other classes. The overall relevance of xi can be obtained by
TABLE2
Estimated mean and standard deviation of input variable
relative influencefor a linear targetfunction
10 100.0 0.0
9 90.3 4.3
8 80.0 4.1
7 69.8 3.9
6 62.1 2.3
5 51.7 2.0
4 40.3 4.2
3 31.3 2.9
2 22.2 2.8
1 13.0 3.2
i K
Ij=K k=1i
z\U zl = x.
However, averaging over the conditional density in (56), rather than the
marginal density in (51), causes Fl(zl) to reflect not only the dependence
of F(x) on the selected variable subset zl, but in addition, apparent depen-
dencies induced solely by the associations between them and the complement
variables z\1. For example, if the contributionof z1 happens to be additive
(54) or multiplicative (55), Fl(zl) (56) would not evaluate to the correspond-
ing term or factor Fl(zl), unless the joint density p(x) happened to be the
product
(57) p(x) = pl(z1). P\l(Z\l).
Partial dependence functions(51) can be used to help interpretmodels pro-
duced by any "black box" predictionmethod,such as neural networks,support
vectormachines, nearest neighbors,radial basis functions,etc. When there are
a large number of predictorvariables, it is very useful to have a measure of
relevance (Section 8.1) to reduce the potentially large number variables and
variable combinations to be considered. Also, a pass over the data (53) is
required to evaluate each Fl (zl) foreach set ofjoint values z1 ofits argument.
This can be time-consumingforlarge data sets, although subsampling could
help somewhat.
For regression trees based on single-variable splits, however, the partial
dependence of F(x) on a specified target variable subset z1 (51) is straight-
forwardto evaluate given only the tree, without referenceto the data itself
(53). For a specificset of values for the variables zl, a weighted traversal of
the tree is performed.At the root of the tree, a weight value of 1 is assigned.
For each nonterminalnode visited, if its split variable is in the target subset
zl, the appropriate leftor rightdaughter node is visited and the weight is not
modified.If the node's split variable is a member of the complement subset
z\l, then both daughters are visited and the current weight is multiplied by
the fractionof training observations that went left or right, respectively,at
that node.
Each terminal node visited during the traversal is assigned the current
value of the weight. When the tree traversal is complete, the value of F1(zJ)
is the correspondingweighted average of the F(x) values over those termi-
nal nodes visited during the tree traversal. For a collection of M regression
trees, obtained throughboosting,the results forthe individual trees are simply
averaged.
For purposes of interpretationthrough graphical displays, input variable
subsets of low cardinality (I < 2) are most useful. The most informativeof
such subsets would likely be comprised of the input variables deemed to be
among the most influential(44), (45) in contributingto the variation of F(x).
Illustrations are provided in Sections 8.3 and 9.
The closer the dependence of F(x) on the subset zZ is to being additive (54)
or multiplicative (55), the more completely the partial dependence function
Fl(zl) (51) captures the nature of the influence of the variables in z1 on the
derived approximation F(x). Therefore,subsets z1 that group togetherthose
influentialinputs that have complex [nonfactorable(55)] interactionsbetween
them will providethe most revealing partial dependence plots. As a diagnostic,
both F, (zl) and F, (z\1) can be separately computedforcandidate subsets. The
value of the multiple correlation over the training data between F(x) and
{Fj(zl), F\l(z\i)} and/or Fl(zl). F\l(z\l) can be used to gauge the degree of
additivityand/orfactorabilityof F(x) with respect to a chosen subset zl. As
an additional diagnostic, FZ\l(zl) (50) can be computed fora small number of
z\l-valuesrandomlyselected fromthe trainingdata. The resultingfunctionsof
z1 can be compared to Fl(zl) to judge the variabilityofthe partial dependence
of F(x) on zl, with respect to changing values of z\l.
In K-class logistic regression and classification (Section 4.6) there are K
(logistic) regression functions {Fk(X)}KkQl= Each is logarithmicallyrelated to
pk(X) = Pr(y = k Ix) through (29). Larger values of Fk(x) imply higher
CD
CD
0~
ci ~ ~ l
a)I
E
7 2 8 9 3 5 4 1 10 6
Input variable
FIG. 7. Relative importance of the input predictor variables for the firstrandomly generated
functionused in the Monte Carlo studies.
O' EL C a CDX
9
-
-2 -t 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2
var 7 var 2 var5
P0 -
-2 -1 0 1 2 -2 -I 0 1 2 -2 -1 0 1 2
varO var3 varl
FIG. 9. Two-variable partial dependence plots on a few of the importantpredictor variables for
the firstrandomlygenerated functionused in the simulation studies.
9.1. Garnet data. This data set consists ofa sample of N = 13317 garnets
collected fromaround the world [Griffin,Fisher,Friedman, Ryan and O' Reilly
(1997)]. A garnet is a complex Ca-Mg-Fe-Cr silicate that commonlyoccurs as
a minorphase in rocks making up the earth's mantle. The variables associated
with each garnet are the concentrationsofvarious chemicals and the tectonic
plate setting where the rock was collected:
(TiO2, Cr203, FeO, MnO, MgO, CaO, Zn, Ga, Sr, Y, Zr, tec).
TABLE3
Average absolute errorof LS-TreeBoost, LAD TreeBoost,and MTreeBoost on the
garnet data for varyingnumbers of terminal nodes in the individual trees
Ey y - F(x)
(58) A(y, F(x)) = v
Ey - median(y) |
based on the test sample, forLS TreeBoost, LADLTreeBoost, and M-TreeBoost
forseveral values of the size (number ofterminal nodes) J of the constituent
trees. Note that this predictionerrormeasure (58) includes the additive irre-
ducible errorassociated with the (unknown) underlyingtarget functionF*(x)
(1). This irreducible error adds same amount to all entries in Table 3. Thus,
differencesin those entries reflecta proportionallygreater improvementin
approximation error (37) on the target functionitself.
For all three methods the additive (J = 2) approximationis distinctlyinfe-
rior to that using larger trees, indicating the presence of interaction effects
(Section 7) among the input variables. Six terminal node trees are seen to be
adequate and using only three terminal node trees is seen to provide accuracy
within 10% of the best. The errors of LADRTreeBoost and M-TreeBoost are
smaller than those of LS-TreeBoost and similar to each other,with perhaps
M-TreeBoost having a slight edge. These results are consistent with those
obtained in the simulation studies as shown in Figures 2 and 6.
Figure 10 shows the relative importance (44), (45) ofthe 11 input variables
in predictingTiO2 concentrationbased on the M-TreeBoost approximation
using six terminal node trees. Results are very similar forthe other models in
Table 3 with similar errors.Ga and Zr are seen to be the most influentialwith
MnO being somewhat less important.The top three panels of Figure 11 show
the partial dependence (51) of the approximation F(x) on these three most
influential variables. The bottom three panels show the partial dependence
of F(x) on the three pairings of these variables. A strong interaction effect
between Ga and Zr is clearly evident. F(x) has very little dependence on
either variable when the othertakes on its smallest values. As the value ofone
of them is increased, the dependence of F(x) on the other is correspondingly
amplified.A somewhat smaller interactioneffectis seen between MnO and Zr.
Relative
importance
co
0-
FIG. 10. Relative influenceof the eleven input variables on the target variation for the garnet
data. Ga and Zr are much more influentialthat the others.
Ci
C4
N | - A ~~~~~N = N
0 QC)~~~~~~~~~~~~~~~1
2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 20 30 40 50
CnO Ga Zr
C) ~~~~~~0
246102116024 111 0 2 0 5
10
~ z 0
FIG. 11. Partial dependenceplots forthe threemost influentialinput variables in thegarnet data.
Note the differentvertical scales for each plot. There is a stronginteractioneffectbetweenZr and
Ga, and a somewhat weaker one betweenZr and MnO.
TABLE4
Variables for the demographic data
1 sex 2 cat
2 martial status 5 cat
3 age 7 real
4 education 6 real
5 occupation 9 cat
6 income 9 real
7 years in Bay Area 5 real
8 dual incomes 2 cat
9 number in household 9 real
10 number in household<18 9 real
11 householder status 3 cat
12 type of home 5 cat
13 ethnic classification 8 cat
14 language in home 3 cat
TABLE5
Average absolute errorof LS-TreeBoost, LAD_TreeBoost,and M-TreeBoost on the
demographicdata for varyingnumbers of terminal nodes in the individual trees
Relative importance
C
C)
co
C)
CD
C)
occ hsid mar age edu hme eth Ian dinc num sex < 18 yBA
FIG. 12. Relative influenceof the 13 input variables on the targetvariation for the demographic
data. No small group of variables dominate.
0 V^~~Uemployed Single
Retired Live withfamily
._ Military flWidowed
-1.5 -0.5 0.5 1.0 1.5 -0.5 0.0 0.5 1.0 -0.5 0.0 0.5 1.0 1.5
1 2- 3 5 \4
6|7 - 1 2 3 5 4 -1.5 -1.0 -0.5 0.01.Other
0.5
/ I T / 1 8 1 ~~~~~~~~~~~~~~~~Mob
home
| / | 1 / | | ~~~~~~~~~~~~~~~~~Ap
N ] / | ' 4 / | ~~~~~~~
~~Condo _
FIG. 13. Partial dependenceplots forthe six most influentialinput variables in the demographic
data. Note the differentverticalscales foreach plot. The abscissa values forage and education are
codes representingconsecutiveequal intervals. The dependence of income on age is nonmonotonic
reaching a maximum at the value 5, representingthe interval 45-54 years old.
REFERENCES
BECKER, R. A. and CLEVELAND, W. S (1996). The design and controlof Trellis display.J; Comput.
Statist. Graphics 5 123-155.
BREIMAN,L. (1997). Pasting bites togetherforpredictionin large data sets and on-line.Technical
report,Dept. Statistics, Univ. California, Berkeley.
BREIMAN,L. (1999). Prediction games and arcing algorithms.Neural Comp. 11 1493-1517.
BREIMAN,L., FRIEDMAN,J. H., OLSHEN, R. and STONE, C. (1983). Classification and Regression
Trees. Wadsworth,Belmont, CA.
COPAS,J. B. (1983). Regression, prediction,and shrinkage (with discussion). J Roy. Statist. Soc.
Ser B 45 311-354.
DONOHO,D. L. (1993). Nonlinear wavelete methods for recoveryof signals, densities, and spec-
tra fromindirect and noisy data. In DifferentPerspectiveson Wavelets.Proceedings of
Symposium in Applied Mathematics (I. Daubechies, ed.) 47 173-205. Amer.Math. Soc.,
Providence RI.
DRUCKER,H. (1997). Improving regressors using boosting techniques. Proceedings of Fourteenth
International Conferenceon Machine Learning (D. Fisher, Jr.,ed.) 107-115. Morgan-
Kaufmann, San Francisco.
DUFFY, N. and HELMBOLD,D. (1999). A geometric approach to leveraging weak learners. In
Computational Learning Theory.Proceedings of4th European ConferenceEuroCOLT99
(P. Fischer and H. U. Simon, eds.) 18-33. Springer,New York.
FREUND,Y. and SCHAPIRE,R. (1996). Experiments with a new boosting algorithm. In Machine
Learning: Proceedings of the ThirteenthInternational Conference 148-156. Morgan
Kaufman, San Francisco.
FRIEDMAN,J. H. (1991). Multivariate adaptive regression splines (with discussion). Ann. Statist.
19 1-141.
FRIEDMANJ. H., HASTIE,T. and TIBSHIRANI,R. (2000). Additive logistic regression: a statistical
view of boosting (with discussion). Ann. Statist. 28 337-407.
GRIFFIN,W. L., FISHER, N. I., FRIEDMANJ. H., RYAN,C. G. and O'REILLY, S. (1999). Cr-Pyrope
garnets in lithosphericmantle. J Petrology.40 679-704.
HASTIE,T. and TIBSHIRANI, R. (1990). Generalized Additive Models. Chapman and Hall, London.
HUBER,P. (1964). Robust estimation of a location parameter.Ann. Math. Statist. 35 73-101.
MALLAT,S. and ZHANG,Z. (1993). Matching pursuits with time frequency dictionaries. IEEE
Trans. Signal Processing 41 3397-3415.
POWELL,M. J. D. (1987). Radial basis functionsformultivariate interpolation:a review.In Algo-
rithmsforApproximation(J. C. Mason and M. G. Cox, eds.) 143-167. Clarendon Press,
Oxford.
RATSCH,G., ONODA,T. and MULLER,K. R. (1998). Soft margins forAdaBoost. NeuroCOLT Tech-
nical Report NC-TR-98-021.
DEPARTMENT OF STATISTICS
SEQUOIA HALL
STANFORD UNIVERSITY
STANFORD, CALIFORNIA 94305
E-MAIL: [email protected]