0% found this document useful (0 votes)
72 views24 pages

Learning Nonlinear Functions Using Regularized Greedy Forest

This document presents a new method called Regularized Greedy Forest (RGF) for learning nonlinear functions using decision forests. RGF directly learns decision forests without using a "black box" decision tree learner, as in traditional boosting algorithms. It achieves this by taking advantage of the underlying tree structure through a structured greedy search optimization procedure. The authors show that on several datasets, RGF achieves higher accuracy and smaller models than the existing gradient boosting decision tree (GBDT) method.

Uploaded by

lotuseater23
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views24 pages

Learning Nonlinear Functions Using Regularized Greedy Forest

This document presents a new method called Regularized Greedy Forest (RGF) for learning nonlinear functions using decision forests. RGF directly learns decision forests without using a "black box" decision tree learner, as in traditional boosting algorithms. It achieves this by taking advantage of the underlying tree structure through a structured greedy search optimization procedure. The authors show that on several datasets, RGF achieves higher accuracy and smaller models than the existing gradient boosting decision tree (GBDT) method.

Uploaded by

lotuseater23
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Learning Nonlinear Functions Using Regularized Greedy Forest

Rie Johnson
RJ Research Consulting
Tong Zhang
Rutgers University
September 25, 2012
Abstract
We consider the problem of learning a forest of nonlinear decision rules with general loss functions.
The standard methods employ boosted decision trees such as Adaboost for exponential loss and Fried-
mans gradient boosting for general loss. In contrast to these traditional boosting algorithms that treat a
tree learner as a black box, the method we propose directly learns decision forests via fully-corrective
regularized greedy search using the underlying forest structure. Our method achieves higher accuracy
and smaller models than gradient boosting on many of the datasets we have tested on.
1 Introduction
Many application problems in machine learning require learning nonlinear functions from data. A popular
method to solve this problem is through decision tree learning (such as CART [4] and C4.5 [23]), which
has an important advantage for handling heterogeneous data with ease when different features come from
different sources. This makes decision trees a popular off-the-shelf machine learning method that can
be readily applied to any data without much tuning; in comparison, alternative algorithms such as neural
networks require signicantly more tuning. However, a disadvantage of decision tree learning is that it
does not generally achieve the most accurate prediction performance, when compared to other methods. A
remedy for this problem is through boosting [12, 15, 25], where one builds an additive model of decision
trees by sequentially building trees one by one. In general boosted decision trees is regarded as the most
effective off-the-shelf nonlinear learning method for a wide range of application problems.
In the boosted tree approach, one considers an additive model over multiple decision trees, and thus, we
will refer to the resulting function as a decision forest. Other approach to learning decision forests include
bagging and random forests [5, 6]. In this context, we may view boosted decision tree algorithms as methods
to learn decision forests by applying a greedy algorithm (boosting) on top of a decision tree base learner.
This indirect approach is sometimes referred to as a wrapper approach (in this case, wrapping boosting
procedure over decision tree base learner); the boosting wrapper simply treats the decision tree base learner
as a black box and it does not take advantage of the tree structure itself. The advantage of such a wrapper
approach is that the underlying base learner can be changed to other procedures with the same wrapper; the
disadvantage is that for any specic base learner which may have additional structure to explore, a generic
wrapper might not be the optimal aggregator.
Due to the practical importance of boosted decision trees in applications, it is natural to ask whether
one can design a more direct procedure that specically learns decision forests without using a black-box
decision tree learner under the wrapper. The purpose of doing so is that by directly taking advantage of
the underlying tree structure, we shall be able to design a more effective algorithm for learning the nal
1
a
r
X
i
v
:
1
1
0
9
.
0
8
8
7
v
5


[
s
t
a
t
.
M
L
]


2
4

S
e
p

2
0
1
2
nonlinear decision forest. This paper attempts to address this issue, where we propose a direct decision
forest learning algorithm called Regularized Greedy Forest or RGF. We are specically interested in an
approach that can handle general loss functions (while, for example, Adaboost is specic to a certain loss
function), which leads to a wider range of applicability. An existing method with this property is gradient
boosting decision tree (GBDT) [15]. We show that RGF can deliver better results than GBDT on a number
of datasets we have tested on.
2 Problem Setup
We consider the problemof learning a single nonlinear function h(x) on some input vector x = [x[1], . . . , x[d]]
R
d
from a set of training examples. In supervised learning, we are given a set of input vectors X =
[x
1
, . . . , x
n
] with labels Y = [y
1
, . . . , y
m
] (here m may not equal to n). Our training goal is to nd a
nonlinear prediction function h(x) from a function class H that minimizes a risk function

h = arg min
hH
L(h(X), Y ). (1)
Here H is a pre-dened nonlinear function class, h(X) = [h(x
1
), . . . , h(x
n
)] is a vector of size n, and
L(h, ) is a general loss function of vector h R
n
.
The loss function L(, ) is given by the underlying problem. For example, for regression problems, we
have y
i
R and m = n. If we are interested in the conditional mean of y given x, then the underlying loss
function corresponds to least squares regression as follows:
L(h(X), Y ) =
n

i=1
(h(x
i
) y
i
)
2
.
In binary classication, we assume that y
i
{1} and m = n. We may consider the logistic regression
loss function as follows:
L(h(X), Y ) =
n

i=1
ln(1 + e
h(x
i
)y
i
).
Another important problemthat has drawn much attention in recent years is the pair-wise preference learning
(for example, see [19, 13]), where the goal is to learn a nonlinear function h(x) so that h(x) > h(x

) when
x is preferred over x

. In this case, m = n(n 1), and the labels encode pair-wise preference as y
(i,i

)
= 1
when x
i
is preferred over x
i
, and y
(i,i

)
= 0 otherwise. For this problem, we may consider the following
loss function that suffers a loss when h(x) h(x

) + 1. That is, the formulation encourages the separation


of h(x) and h(x

) by a margin when x is preferred over x

:
L(h(X), Y ) =

(i,i

):y
(i,i

)
=1
max(0, 1 (h(x
i
) h(x
i
)))
2
.
Given data (X, Y ) and a general loss function L(, ) in (1), there are two basic questions to address
for nonlinear learning. The rst is the form of nonlinear function class H, and the second is the learn-
ing/optimization algorithm. This paper achieves nonlinearity by using additive models of the form:
H =

h() : h(x) =
K

j=1

j
g
j
(x); j, g
j
C

, (2)
2
where each
j
R is a coefcient that can be optimized, and each g
j
(x) is by itself a nonlinear function
(which we may refer to as a nonlinear basis function or an atom) taken from a base function class C. The
base function class typically has a simple form that can be used in the underlying algorithm. This work
considers decision rules as the underlying base function class that is of the form
C =

g() : g(x) =

j
I(x[i
j
] t
j
)

k
I(x[i
k
] > t
k
)

, (3)
where {(i
j
, t
j
), (i
k
, t
k
)} are a set of (feature-index, threshold) pair, and I(x) denotes the indicator function:
I(p) = 1 if p is true; 0 otherwise.
Since the space of decision rules is rather large, for computational purposes, we have to employ a
structured search over the set of decision rules. The optimization procedure we propose is a structured
greedy search algorithm which we call regularized greedy forest (RGF). To introduce RGF, we rst discuss
pros and cons of the existing method for general loss, gradient boosting [15], in the next section.
3 Gradient Boosted Decision Tree
Gradient boosting is a method to minimize (1) with additive model (2) by assuming that there exists a
nonlinear base learner (or oracle) A that satises Assumption 1.
Assumption 1. A base learner for a nonlinear function class A is a regression optimization method that
takes as input any pair

X = [ x
1
, . . . , x
n
] and

Y = [ y
1
, . . . , y
n
] and outputs a nonlinear function g =
A(

X,

Y ) that approximately solves the regression problem:
g arg min
gC
min
R
n

j=1
( g( x
j
) y
j
)
2
.
The gradient boosting method is a wrapper (boosting) algorithm that solves (1) with a base learner A
dened above and additive model dened in (2). The general algorithm is described in Algorithm 1. Of
special interest for this paper and for general applications is the decision tree base learner, for which C is the
class of J-leaf decision trees, with each node associated with a decision rule of the form (3). In order to take
advantage of the fact that each element in C contains J (rather than one) decision rules, Algorithm 1 can be
modied by adding a partially corrective update step that optimizes all J coefcients associated with the J
decision rules returned by A. This adaption was suggested by Friedman. We shall refer to this modication
as gradient boosted decision tree (GBDT), and the details are listed in Algorithm 2.
Gradient boosting may be regarded as a functional generalization of gradient descent method h
k

h
k1
s
k
L(h)
h
|
h=h
k1
, where the shrinkage parameter s corresponds to the step size s
k
in gradient descent,
and
L(h)
h
|
h=h
k1
is approximated using the regression tree output. The shrinkage parameter s > 0 is a
tuning parameter that can affect performance, as noticed by Friedman. In fact, the convergence of the
algorithm generally requires choosing s
k
0 as indicated in the theoretical analysis of [28], which is
also natural when we consider that it is analogous to step size in gradient descent. This is consistent with
Friedmans own observation, who argued that in order to achieve good prediction performance (rather than
computational efciency), one should take as small a step size as possible (preferably innitesimal step size
each time), and the resulting procedure is often referred to as -boosting.
GBDT constructs a decision forest which is an additive model of K decision trees. The method has been
very successful for many application problems, and its main advantage is that the method can automatically
3
Algorithm 1: Generic Gradient Boosting [15]
h
0
(x)arg min

L(, Y )
for k = 1 to K do

Y
k
L(h, Y )/h|
h=h
k1
(X)
g
k
A(X,

Y
k
)

k
arg min
R
L(h
k1
(X) + g
k
(X), Y )
h
k
(x)h
k1
(x) + s
k
g
k
(x) // s is a shrinkage parameter
end
return h(x) = h
K
(x)
Algorithm 2: Gradient Boosted Decision Tree (GBDT) [15]
h
0
(x)arg min

L(, Y )
for k = 1 to K do

Y
k
L(h, Y )/h|
h=h
k1
(X)
Build a J-leaf decision tree T
k
A(X,

Y
k
) with leaf-nodes {g
k,j
}
J
j=1
for j = 1 to J do
k,j
arg min
R
L(h
k1
(X) + g
k,j
(X), Y )
h
k
(x)h
k1
(x) + s

J
j=1

k,j
g
k,j
(x) // s is a shrinkage parameter
end
return h(x) = h
K
(x)
nd nonlinear interactions via decision tree learning (which can easily deal with heterogeneous data), and
it has relatively few tuning parameters for a nonlinear learning scheme (the main tuning parameters are the
shrinkage parameter s, number of terminals per tree J, and the number of trees K). However, it has a num-
ber of disadvantages as well. First, there is no explicit regularization in the algorithm, and in fact, it is argued
in [28] that the shrinkage parameter s plus early stopping (that is K) interact together as a form of regular-
ization. In addition, the number of nodes J can also be regarded as a form of regularization. The interaction
of these parameters in term of regularization is unclear, and the resulting implicit regularization may not
be effective. The second issue is also a consequence of using small step size s as implicit regularization.
Use of small s can lead to a huge number of trees, which is very undesirable as it leads to high computa-
tional cost of applications (i.e., making predictions). Note that in order to achieve good performance, it is
often necessary to choose a small shrinkage parameter s and hence large K; in the extremely scenario of
-boosting, one needs an innite number of trees. Third, the regression tree learner is treated as a black box,
and its only purpose is to return J nonlinear terminal decision rule basis functions. This again may not be
effective because the procedure separates tree learning and forest learning, and hence the algorithm itself is
not necessarily the most effective method to construct the decision forest.
4 Fully-Corrective Greedy Update and Structured Sparsity Regularization
As mentioned above, one disadvantage of gradient boosting is that in order to achieve good performance in
practice, the shrinkage parameter s often needs to be small, and Friedman himself argued for innitesimal
step size. This practical observation is supported by the theoretical analysis in [28] which showed that if we
vary the shrinkage s for each iteration k as s
k
, then for general loss functions with appropriate regularity
conditions, the procedure converges as k if we choose the sequence s
k
such that

k
s
k
|
k
| =
4
and

k
s
2
k

2
k
< . This condition is analogous to a related condition for the step size of gradient
descent method which also requires the step-size to approach zero. Fully Corrective Greedy Algorithm is
a modication of Gradient Boosting that can avoid the potential small step size problem. The procedure is
described in Algorithm 3.
Algorithm 3: Fully-Corrective Gradient Boosting [26]
h
0
(x)arg min

L(, Y )
for k = 1 to K do

Y
k
L(h, Y )/h|
h=h
k1
(X)
g
k
A(X,

Y
k
)
let H
k
= {

k
j=1

j
g
j
(x) :
j
R}
h
k
(x)arg min
hH
k
L(h(X), Y ) // fully-corrective step
end
return h(x) = h
K
(x)
In gradient boosting of Algorithm 1 (or its variation with tree base learner of Algorithm 2), the algorithm
only does a partial corrective step that optimizes either the coefcient of the last basis function g
k
(or the
last J coefcients). The main difference of the fully-corrective gradient boosting is the fully-corrective-step
that optimizes all coefcients {
j
}
k
j=1
for basis functions {g
j
}
k
j=1
obtained so far at each iteration k. It was
noticed empirically that such fully-corrective step can signicantly accelerate the convergence of boosting
procedures [27]. This observation was theoretically justied in [26] where the following rate of convergence
was obtained under suitable conditions: there exists a constant C
0
such that
L(h
k
(X), Y ) inf
hH

L(h(X), Y ) +
C
0
h
2
C
k

,
where C
0
is a constant that depends on properties of L(, ) and the function class H, and
h
C
= inf

j
|
j
| : h(X) =

j
g
j
(X); g
j
C

.
In comparison, with only partial corrective optimization as in the original gradient boosting, no such conver-
gence rate is possible. Therefore the fully-corrective step is not only intuitively sensible, but also important
theoretically. The use of fully-corrective update (combined with regularization) automatically removes the
need for using the undesirable small step s needed in the traditional gradient boosting approach.
However, such an aggressive greedy procedure will lead to quick overtting of the data if not appropri-
ately regularized (in gradient boosting, an implicit regularization effect is achieved by small step size s, as
argued in [28]). Therefore we are forced to impose an explicit regularization to prevent overtting.
This leads to the second idea in our approach, which is to impose explicit regularization via the concept
of structured sparsity that has drawn much attention in recent years [3, 22, 21, 1, 2, 20]. The general idea
of structured sparsity is that in a situation where a sparse solution is assumed, one can take advantage of
the sparsity structure underlying the task. In our setting, we seek a sparse combination of decision rules
(i.e., a compact model), and we have the forest structure to explore, which can be viewed as graph spar-
sity structures. Moreover, the problem can be considered as a variable selection problem. Search over all
nonlinear interactions (atoms) over C is computationally difcult or infeasible; one has to impose structured
search over atoms. The idea of structured sparsity is that by exploring the fact that not all sparsity patterns
are equally likely, one can select appropriate variables (corresponding to decision rules in our setting) more
5
effectively by preferring certain sparsity patterns more than others. For our purpose, one may impose struc-
tured regularization and search to prefer one sparsity pattern over another, exploring the underlying forest
structure.
Algorithmically, one can explore the use of an underlying graph structure that connects different vari-
ables in order to search over sparsity patterns. The general concept of graph-structured sparsity were consid-
ered in [1, 2, 20]. A simple example is presented in Figure 1, where each node of the graph indicates a vari-
able (non-linear decision rule), and each gray node denotes a selected variable. The graph structure is used to
select variables that forma connected region; that is, we may growthe region by following the edges fromthe
variables that have been selected already, and new variables are selected one by one. Figure 1 indicates the
order of selection. This approach reduces both statistical complexity and algorithmic complexity. The algo-
rithmic advantage is quite obvious; statistically, using the information theoretical argument in [20], one can
show that the generalization performance is characterized by O(

j{selected nodes}
log
2
degree(parent(j))),
while without structure, it will be O(#{selected nodes} ln p), where p is the total number of atoms in C.
Based on this general idea, [2] considered the problem of learning with nonlinear kernels induced by an
underlying graph.
1
2 3
4 5 6
7
9 10
8
11
Figure 1: Graph Structured Sparsity
This work considers the special but important case of learning a forest of nonlinear decision rules; al-
though this may be considered as a special case of the general structured sparsity learning with an underlying
graph, the problem itself is rich and important enough and hence requires a dedicated investigation. Specif-
ically, we integrate this framework with specic tree-structured regularization and structured greedy search
to obtain an effective algorithm that can outperform the popular and important gradient boosting method. In
the context of nonlinear learning with graph structured sparsity, we note that a variant of boosting was pro-
posed in [14], where the idea is to split trees not only at the leaf nodes, but also at the internal nodes at every
step. However, the method is prone to overtting due to the lack of regularization, and is computationally
expensive due to the multiple splitting of internal nodes. We shall avoid such a strategy in this work.
5 Regularized Greedy Forest
The method we propose addresses the issues of the standard method GBDT described above by directly
learning a decision forest via fully-corrective regularized greedy search. The key ideas discussed in Section
4 can be summarized as follows.
6
First, we introduce an explicit regularization functional on the nonlinear function h and optimize

h = arg min
hH
[L(h(X), Y ) +R(h)] (4)
instead of (1). In particular, we dene regularizers that explicitly take advantage of individual tree structures.
Second, we employ fully-corrective greedy algorithm which repeatedly re-optimizes the coefcients of
all the decision rules obtained so far while rules are added into the forest by greedy search. Although such an
aggressive greedy procedure could lead to quick overtting if not appropriately regularized, our formulation
includes explicit regularization to avoid overtting and the problem of huge models caused by small s.
Third, we performstructured greedy search directly over forest nodes based on the forest structure (graph
sparsity structure) employing the concept of structured sparsity. At the conceptual level, our nonlinear
function h(x) is explicitly dened as an additive model on forest nodes (rather than trees) consistent with
the underlying forest structure. In this framework, it is also possible to build a forest by growing multiple
trees simultaneously.
Before going into more detail, we shall introduce some denitions and notation that allow us to formally
dene the underlying formulations and procedures.
5.1 Denitions and notation
A forest is an ensemble of multiple decision trees T
1
, . . . , T
K
. The forest shown in Figure 2 contains three
trees T
1
, T
2
, and T
3
. Each tree edge e is associated with a variable k
e
and threshold t
e
, and denotes a
decision of the form I(x[k
e
] t
e
) or I(x[k
e
] > t
e
). Each node denotes a nonlinear decision rule of the
form (3), which is the product of decisions along the edges leading from the root to this node.
root T
1
T
2
T
3
Figure 2: Decision Forest
Mathematically, each node v of the forest is associated with a decision rule of the form
g
v
(x) =

j
I(x[i
j
] t
i
j
)

k
I(x[i
k
] > t
i
k
) ,
which serves as a basis function or atom for the additive model considered in this paper. Note that if v
1
and
v
2
are the two children of v, then g
v
(x) = g
v
1
(x)+g
v
2
(x). This means that any internal node is redundant in
the sense that an additive model with basis functions g
v
(x), g
v
1
(x), g
v
2
(x) can be represented as an additive
model over basis functions g
v
1
(x) and g
v
2
(x). Therefore it can be shown that an additive model over all tree
nodes always has an equivalent model (equivalent in terms of output) over leaf nodes only. This property
7
is important for computational efciency because it implies that we only have to consider additive models
over leaf nodes.
Let F represent a forest, and each node v of F is associated with (g
v
,
v
,
v
). Here g
v
is the basis
function that this node represents;
v
is the weight or coefcient assigned to this node; and
v
represents
other attributes of this node such as depth. The additive model of this forest F considered in this paper
is: h
F
(x) =

vF

v
g
v
(x) with
v
= 0 for any internal node v. We rewrite the regularized loss in (4)
as follows to emphasize that the regularizer depends on the underlying forest structure, by replacing the
regularization term R(h
F
) with G(F):
Q(F) = L(h
F
(X), Y ) +G(F). (5)
5.2 Algorithmic framework
The training objective of RGF is to build a forest that minimizes Q(F) dened in (5). Since the exact
optimum solution is difcult to nd, we greedily select the basis functions and optimize the weights. At
a high level, we may summarize RGF in a generic algorithm in Algorithm 4. It essentially has two main
components as follows.
Fix the weights, and change the structure of the forest (which changes basis functions) so that the loss
Q(F) is reduced the most (Line 24).
Fix the structure of the forest, and change the weights so that loss Q(F) is minimized (Line 5).
Algorithm 4: Regularized greedy forest framework
1 F{}.
repeat
2 oarg min
oO(F)
Q(o(F)) where O(F) is a set of all the structure-changing operations
applicable to F.
3 if (Q( o(F)) Q(F)) then break // Leave the loop if o does not reduce the loss.
4 F o(F). // Perform the optimum operation.
5 if some criterion is met then optimize the leaf weights in F to minimize loss Q(F).
until some exit criterion is met;
Optimize the leaf weights in F to minimize loss Q(F).
return h
F
(x)
5.3 Specic Implementation
There may be more than one way to instantiate useful algorithms based on Algorithm 4. Below, we describe
what we found effective and efcient.
5.3.1 Search for the optimum structure change (Line 2)
For computational efciency, we only allow the following two types of operations in the search strategy:
to split an existing leaf node,
8
to start a new tree (i.e., add a new stump to the forest).
The operations include assigning weights to new leaf nodes and setting zero to the node that was split.
Search is done with the weights of all the existing leaf nodes xed, by repeatedly evaluating the maximum
loss reduction of all the possible structure changes. When it is prohibitively expensive to search the entire
forest (and that is often the case with practical applications), we limit the search to the most recently-created
t trees with the default choice of t = 1. This is the strategy in our current implementation. For example,
Figure 3 shows that at the same stage as Figure 2, we may either consider splitting one of the leaf nodes
marked with symbol X or grow a new tree T
4
(split T
4
s root).
root T
1
T
2
T
3
x
x
T
4
Figure 3: Decision Forest Splitting Strategy (we may either split a leaf in T
3
or start a new tree T
4
)
Note that RGF does not require the tree size parameter needed in GBDT. With RGF, the size of each
tree is automatically determined as a result of minimizing the regularized loss.
Computation Consider the evaluation of loss reduction by splitting a node associated with (g, , ) into
the nodes associated with (g
u
1
, +
1
,
u
1
) and (g
u
2
, +
2
,
u
2
). Then the model associated with the new
forest

F = o(F) after splitting the node can be written as:
h

F
(x) = h
F
(x) g(x) +
2

k=1
( +
k
)g
u
k
(x) = h
F
(x) +
2

k=1

k
g
u
k
(x). (6)
Recall that our additive models are over leaf nodes only. The node that was split is no longer leaf and there-
fore g(x) is removed from the model. The second equality is from g(x) = g
u
1
(x) + g
u
2
(x) due to the
parent-child relationship. To emphasize that
1
and
2
are the only variables in

F for the current computa-
tion, let us write

F(
1
,
2
) for the new forest. Our immediate goal here is to nd arg min

1
,
2
Q(

F(
1
,
2
)).
Actual computation depends on Q(F). In general, there may not be an analytical solution for this
optimization problem, whereas we need to nd the solution in an inexpensive manner as this computation
is repeated frequently. For fast computation, one may employ gradient-descent approximation as used in
gradient boosting. However, the sub-problem we are looking at is simpler, and thus instead of the simpler
gradient descent approximation, we perform one Newton step which is more accurate; namely, we obtain
the approximately optimum

k
(k = 1, 2) as:

k
=

Q(

F(
1
,
2
))

k
|

1
=0,
2
=0

2
Q(

F(
1
,
2
))

2
k
|

1
=0,
2
=0
. (7)
9
In particular, suppose that loss function is for either regression or classication tasks, then Q(F) can be
written in the following form
Q(F) =
n

i=1
(h
F
(x
i
), y
i
)/n +G(F) .
In this case,
h

F
(x)

k
= g
u
k
(x),

2
h

F
(x)

2
k
= 0, and h
F
(x) = h

F(0,0)
(x) by (6); thus

k
in (7) can be rewritten
as:

k
=

gu
k
(x
i
)=1
(h,y)
h
|
h=h
F
(x
i
),y=y
i
n
G(

F(
1
,
2
))

k
|

1
=0,
2
=0

gu
k
(x
i
)=1

2
(h,y)
h
2
|
h=h
F
(x
i
),y=y
i
+ n

2
G(

F(
1
,
2
))

2
k
|

1
=0,
2
=0
.
For example, with square loss (h, y) = (h y)
2
/2 and L
2
regularization penalty G(F) =

vF

2
v
/2,
we have

k
=

gu
k
(x
i
)=1
(y
i
h
F
(x
i
)) n

gu
k
(x
i
)=1
1 + n
,
which is the exact optimum for the given split.
5.3.2 Weight optimization/correction (Line 5)
With the basis functions xed, the weights can be optimized using a standard procedure if the regularization
penalty is standard (e.g., L
1
- or L
2
-penalty). In our implementation we perform coordinate descent, which
iteratively goes through the basis functions and in each iteration updates the weights by a Newton step with
a small step size:

v
+

Q(F(v))
v
|
v=0

2
Q(F(v))

2
v
|
v=0
, (8)
where
v
is the additive change to
v
. Computation of the Newton step is similar to Section 5.3.1.
Since the initial weights of new leaf nodes set in Line 4 are approximately optimal at the moment, it
is not necessary to perform weight correction in every iteration, which is relatively expensive. We found
that correcting the weights every time k new leaf nodes are added works well. The interval between fully-
corrective updates k was xed to 100 in our experiments.
5.4 Tree-structured regularization
Explicit regularization is a crucial component of this framework. To simplify notation, we dene regularizers
over a single tree. The regularizer over a forest can be obtained by adding the regularizers described here
over all the trees. Therefore, suppose that we are given a tree T with an additive model over leaf nodes:
h
T
(x) =

vT

v
g
v
(x) ,
v
= 0 for v / L
T
where L
T
denotes the set of leaf nodes in T.
To consider useful regularizers, rst recall that for any additive model over leaf nodes only, there always
exist equivalent models over all the nodes of the same tree that produce the same output. More precisely, let
10
-1.2
1.6 1.2
0.1 -2.4
1.2
0.3 -0.1
1.2+0.1+0.3=1.6
1.2+0.1-0.1=1.2
1.2-2.4=-1.2
Figure 4: Example of Equivalent Models
A(v) denote the set of ancestor nodes of v and v itself, and let T() be a tree that has the same topological
structure as T but whose node weights {
v
} are replaced by {
v
}. Then we have
u L
T
:

vA(u)

v
=
u
h
T()
(x) h
T
(x)
as illustrated in Figure 4. Our basic idea is that it is natural to give the same regularization penalty to
all equivalent models dened on the same tree topology. One way to dene a regularizer that satises
this condition is to choose a model of some desirable properties as the unique representation for all the
equivalent models and dene the regularization penalty based on this unique representation. This is the
high-level strategy we take. That is, we consider the following form of regularization:
G(T) =

vT
r(
v
,
v
) : h
T()
(x) h
T
(x) .
Here node v includes both internal and leaf nodes; the additive model h
T()
(x) serves as the unique repre-
sentation of the set of equivalent models; and r(, ) is a penalty function of vs weight and attributes. Each

v
is a function of given leaf weights {
u
}
uL
T
, though the function may not be a closed form. Since reg-
ularizers in this form utilize the entire tree including its topological structure, we call them tree-structured
regularizers. Below, we describe three tree-structured regularizers using three distinct unique representa-
tions.
5.4.1 L
2
regularization on leaf-only models
The rst regularizer we introduce simply chooses the given leaf-only model as the unique representation
(namely,
v
=
v
) and sets r(
v
,
v
) =
2
v
/2 based on the standard L
2
regularization. This leads to
G(T) =

vT

2
v
/2 =

vL
T

2
v
/2
where is a constant for controlling the strength of regularization. A desirable property of this unique
representation is that among the equivalent models, the leaf-only model is often (but not always
1
) the one
with the smallest number of basis functions, i.e., the most sparse.
5.4.2 Minimum-penalty regularization
Another approach we consider is to choose the model that minimizes some penalty as the unique representa-
tive of all the equivalent models, as it is the most preferable model according to the dened penalty. We call
1
For example, consider a leaf-only model on a stump whose two sibling leaf nodes have the same weight = 0. Its equivalent
model with the fewest basis functions (with nonzero coefcients) is the one whose weight is on the root and zero on the two leaf
nodes.
11
this type of regularizer a min-penalty regularizer. In the following min-penalty regularizer, the complexity
of a basis function is explicitly regularized via the node depth.
G(T) = min
{v}

vT
1
2

dv

2
v
: h
T()
(x) h
T
(x)

. (9)
Here d
v
is the depth of node v, which is the distance from the root, and is a constant. A larger > 1
penalizes deeper nodes more severely, which are associated with more complex decision rules, and we
assume that 1.
Computation To derive an algorithm for computing this regularizer, rst we introduce auxiliary variables
{

v
}
vT
, recursively dened as:

o
T
=
o
T
,

v
=
v
+

p(v)
,
where o
T
is Ts root, and p(v) is vs parent node, so that we have
h
T()
h
T
v L
T
.

v
=
v

, (10)
and (9) can be rewritten as:
G(T) = min
{

v}

f({

v
}) : v L
T
.[

v
=
v
]

(11)
where f({

v
}) =

v=o
T

dv
(

p(v)
)
2
/2 +

2
o
T
/2 . (12)
Setting fs partial derivatives to zero, we obtain that at the optimum,
v / L
T
:

v
=

p(v)
+

p(w)=v

w
1+2
v = o
T

p(w)=v

w
1+2
v = o
T
, (13)
i.e., essentially,

v
is the weighted average of the neighbors. This naturally leads to an iterative algorithm
summarized in Algorithm 5. Convergence of this algorithm and some more computational detail of this
regularizer are shown in the Appendix.
Algorithm 5:
for v T do

v,0


v
v L
T
0 v / L
T
for i = 1 to m do
for v L
T
do

v,i

v
for v / L
T
do

v,i

p(v),i1
+

p(w)=v

w,i1
1+2
v = o
T

p(w)=v

w,i1
1+2
v = o
T
end
return {

v,m
}
12
+1.3 -1.3
0.1
+0.2 -0.2
Figure 5: Example of Sum-to-zero Sibling Model
5.4.3 Min-penalty regularization with sum-to-zero sibling constraints
Another regularizer we introduce is based on the same basic idea as above but is computationally simpler.
We add to (9) the constraint that the sum of weights for every sibling pair must be zero,
G(T) = min
{v}

vT

dv

2
v
/2 : h
T()
(x) h
T
(x); v / L
T
.


p(w)=v

w
= 0

,
as illustrated in Figure 5. The intuition behind this sum-to-zero sibling constraints is that less redundant
models are preferable and that the models are the least redundant when branches at every internal node lead
to completely opposite actions, namely, adding x to versus subtracting x from the output value.
Using the auxiliary variables {

v
} as dened above, it is straightforward to show that any set of equiv-
alent models has exactly one model that satises the sum-to-zero sibling constraints. This model, whose
coefcients are {
v
}, can be obtained through the following recursive computation:

v
=


v
v L
T

p(w)=v

w
/2 v / L
T
. (14)
More computational details of this regularizer is described in the Appendix.
5.5 Extension of regularized greedy forest
We introduce an extension, which allows the process of forest growing and the process of weight correction
to have different regularization parameters. The motivation is that the regularization parameter optimum
for weight correction may not necessarily be optimal for forest growing, as the former is fully-corrective
and therefore global whereas the latter is greedy and is localized to the leaf nodes of interest. Therefore, it
is sensible to allow distinct regularization parameters for these two distinct processes. Furthermore, there
could be an extension that allows one to change the strength of regularization as the forest grows, though we
did not pursue this direction in the current work.
6 Experiments
This section reports empirical studies of RGF in comparison with GBDT and several tree ensemble methods.
In particular, we report the results of entering competitions using RGF. Our implementation of RGF used
for the experiments is available from http://riejohnson.com/rgf_download.html.
6.1 On the synthesized datasets controlling complexity of target functions
First we study the performance of the methods in relation to the complexity of target functions using synthe-
sized datasets. To synthesize datasets, rst we dened the target function by randomly generating 100 q-leaf
13
RGF GBDT
5-leaf data 0.2200 0.2597
10-leaf data 0.3480 0.3968
20-leaf data 0.4578 0.4942
Table 1: Regression results on synthesized datasets. RMSE. Average of 3 runs.
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0 10000 20000 30000 40000
R
M
S
E
Model size (# of leaf nodes)
Data: synthesized from
10-leaf trees
RGF L2
RGF min-penalty
GBDT
Figure 6: : Regression results in relation to model size. One particular run on the data synthesized from
10-leaf trees.
regression trees; then we randomly generated data points and applied the target function to them to assign
the output/target values. The dimensionality of data points was 10. Note that a larger tree size q makes the
target function more complex.
The results shown in Table 1 are in the root mean square error (RMSE) averaged over three runs. In each
run, randomly chosen 2K data points were used for training and the number of test data points was 20K. The
parameters were chosen by 2-fold cross validation on the training data. Since the task is regression, the loss
function for RGF and GBDT were set to square loss. RGF used here is the most basic version, which does
L
2
regularization with one parameter for both forest growing and weight correction. was chosen from
{1, 0.1, 0.01}. For GBDT, we used Rpackage gbm
2
[24]. The tree size (in terms of the number of leaf nodes)
and the shrinkage parameter were chosen from {5, 10, 15, 20, 25} and {0.5, 0.1, 0.05, 0.01, 0.005, 0.001},
respectively. Table 1 shows that RMSE achieves smaller error than GBDT on all types of datasets.
RGF with min-penalty regularizer with the sibling constraints further improves RMSE over RGF-L
2
by 0.0315, 0.0210, 0.0033 on the 5-leaf, 10-leaf, and 20-leaf synthesized datasets, respectively. RGF with
min-penalty regularizer without the sibling constraints also achieved the similar performances. Based on the
amount of improvements, min-penalty regularizer appears to be more effective on simpler targets. Figure 6
plots RMSE in relation to the model size in terms of the number of basis functions or leaf nodes.
The synthesized datasets used in this section are provided with the RGF software.
6.2 Regression and 2-way classication tasks on the real-world datasets
The rst suite of real-world experiments use relatively small training data of 2K data points to facilitate
experimenting with a wide variety of datasets. The criteria of data choice were (1) having over 5000 data
2
In the rest of the paper, gbm was used for the GBDT experiments unless otherwise specied.
14
Name Dim Regression tasks
CT slices 384 Target: relative location of CT slices
California Houses 6 Target: log(median house price)
YearPredictionMSD 90 Target: year when the song was released
Name Dim Binary classication tasks
Adult 14(168) Is income > $50K?
Letter 16 A-M vs N-Z
Musk 166 Musk or not
Nursery 8(24) Special priority or not
Waveform 40 Class2 vs. Class1&3
Table 2: Real-world Datasets. We report the average of 3 runs, each of which uses 2K training data points. The
numbers in parentheses indicate the dimensionality after converting categorical attributes to indicator vectors.
RGF GBDT RandomF. BART
CT slices 7.2037 7.6861 7.5029 8.6006
California Houses 0.3417 0.3454 0.3453 0.3536
YearPredictionMSD 9.5523 9.6846 9.9779 9.6126
Table 3: Regression results. RMSE. Average of 3 runs. 2K training data points. The best and second best
results are in bold and italic, respectively.
points in total to ensure a decent amount of test data and (2) to cover a variety of domains. The datasets and
tasks are summarized in Table 2. All except Houses (downloaded from http://lib.stat.cmu.edu)
are from the UCI repository [11]. All the results are the average of 3 runs, each of which used randomly-
drawn 2K training data points. For multi-class data, binary tasks were generated as in Table 2. The ofcial
test sets were used as test sets if any (Letter, Adult, and MSD). For relatively large Nursery and Houses, 5K
data points were held out as test sets. For relatively small Musk and Waveform, in each run, 2K data points
were randomly chosen as training sets, and the rest were used as test sets. The exact partitions of training
and test data are provided with the RGF software.
All the parameters were chosen by 2-fold cross validation on the training data. The RGF tested here is
RGF-L
2
with the extension in which the processes of forest growing and weight correction can have regular-
ization parameters of different values, which we call
g
(g for growing) and , respectively. The value of
was chosen from{10, 1, 0.1, 0.01} with square loss, and from{10, 1, 0.1, 0.01, 1e 10, 1e 20, 1e 30}
RGF GBDT Random BART ada
Square Logistic Expo. Square Logistic Expo. forests
Adult 85.62 85.63 85.20 85.62 85.88 85.78 85.29 85.62 84.68
Letter 92.50 92.19 92.48 91.20 91.72 91.86 90.33 85.06 92.12
Musk 97.83 97.91 97.83 97.14 96.79 97.39 96.23 95.56 97.13
Nursery 98.63 99.97 99.95 98.13 99.90 99.88 97.44 99.12 99.51
Waveform 90.28 90.21 90.06 89.56 90.06 90.14 90.20 90.49 90.52
Table 4: Binary classication results. Accuracy (%). Average of 3 runs. 2K training data points. The best
and second best results are in bold and italic, respectively.
15
Regression RMSE
GBDT GBDT w/post-proc.
Houses 0.3451 0.3479 (4.3%)
Classication accuracy
GBDT GBDT w/post-proc.
Adult 85.62 85.62 (4.6%)
Letter 91.03 90.78 (6.9%)
Musk 97.09 96.67 (7.1%)
Nursery 98.06 97.71 (5.8%)
Waveform 89.56 89.73 (2.9%)
Table 5: Comparison of GBDT with and without Fully-Corrective Post Processing; RMSE/accuracy(%)
and model sizes (in parentheses) relative to those without post-processing. Square loss. Average of 3 runs.
with logistic loss and exponential loss.
g
was chosen from {,

100
}. The tree size for GBDT was chosen
from {5, 10, 15, 20, 25}, and the shrinkage parameter was from {1, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001}.
In addition to GBDT, we also tested two other tree ensemble methods: random forests [7] and Bayesian
additive regression trees (BART) [10]. We used the R package randomForest [7] and performed random
forest training with the number of randomly-drawn features in {
d
4
,
d
3
,
d
2
,
3d
5
,
7d
10
,
4d
5
,
9d
10
,

d}, where d is the


feature dimensionality; the number of trees set to 1000; and other parameters set to default values. BART is
a Bayesian approach to tree ensemble learning. The motivation to test BART was that it shares some high-
level strategies with RGF such as explicit regularization and non-black-box approaches to tree learners. We
used the R package BayesTree [9] and chose the parameter k, which adjusts the degree of regularization,
from {1, 2, 3}.
On the binary classication tasks, we also tested AdaBoost using R package ada. The parameter cp,
which controls the degree of regularization of the tree learner, was chosen from {0.1, 0.01, 0.001}; the
number of trees was set to 1000; and the other parameters were set to the default values.
Table 3 shows the regression results in RMSE. RGF achieves lower error than all others.
Table 4 shows binary classication results in accuracy(%). RGF achieves the best performance on the
three datasets, whereas GBDT achieves the best performance on only one dataset. In addition, we noticed
that random forests and BART require far larger models than RGF to achieve the performances shown in
the Table 4; for example, all the BARTs models consist of over 400K leaf nodes where as the RGF models
reach the best performane with 20K leaf nodes or fewer.
The min-penalty regularizer was found to be effective on Musk, improving the accuracy of RGF-L
2
with
square loss from 97.83% to 98.39%, but it did not improve performance on other datasets. Based on the
synthesized data experiments in the previous section, we presume that this is because the target functions
underlying these real-world datasets are mostly complex.
6.2.1 GBDT with post processing of fully-corrective updates
A two-stage approach was proposed in [17, 18]
3
that, in essence, rst performs GBDT to learn basis func-
tions and then ts their weights with L
1
penalty in the post-processing stage. Note that by contrast RGF gen-
3
Although [18] discusses various techniques regarding rules, we focus on the aspect of the two-stage approach which [18]
derives from [17], since it is the most relevant portion to our work due to its contrast with our interleaving approach.
16
Data size
dim #team
#train #test
Bond prices 762,678 61,146 91 265
Bio response 3,751 2,501 1,776 703
Health Prize 71,435 70,942 50468 1,327
Table 6: Competition data statistics. The dim (feature dimensionality) and #train are shown for the data
used by one particular run for each competition for which we show the Leaderboard performance in Section
6.3.
erates basis functions and optimizes their weights in an interleaving manner so that fully-corrected weights
can inuence generation of the next basis functions.
Table 5 shows the performance results of the two-stage approach on the regression and 2-way classica-
tion tasks described in Section and 6.2. As is well known, L
1
regularization has feature selection effects,
assigning zero weights to more and more features with stronger regularization. After performing GBDT
4
with the parameter chosen by cross validation on the training data, we used the R package glmnet [16] to
compute the entire L
1
path in which the regularization parameter goes down gradually and thus more and
more basis functions obtain nonzero weights. The table shows the best RMSE or accuracy (%) in the L
1
path computed by glmnet. The numbers in the parentheses compare the sizes of the best models with and
without post-processing of the two-stage approach; for example, in the rst row, the size of the best model
after post-processing is 4.3% compared with the best GBDT model without post-processing. The results
show that the L
1
post processing makes the models smaller, but it often degrades accuracy. We view the
results as supporting RGFs interleaving approach.
6.3 RGF in the competitions
To further test RGF in practical settings, we entered three machine learning competitions and obtained good
results. The competitions were held in the Netix Prize style. That is, participants submit predictions on
the test data (whose labels are not disclosed) and receive performance results on the public portion of the
test data as feedback on the public Leaderboard. The goal is to maximize the performance on the private
portion of the test data, and neither the private score nor the standing on the private Leaderboard is disclosed
until the competition ends.
Bond price prediction We were awarded with the First Prize in Benchmark Bond Trade Price Challenge
(www.kaggle.com/c/benchmark-bond-trade-price-challenge). The task was to predict
bond trade prices based on the information such as past trade recordings.
The evaluation metric was weighted mean absolute error,

i
w
i
|y
i
f(x
i
)|, with the weights set to be
larger for the bonds whose price prediction is considered to be harder. We trained RGF with L1-L2 hybrid
loss [8],

1 + r
2
1 where r is the residual, which behaves like L
2
loss when |r| is small and L
1
when
|r| is large. Our winning submission was the average of 62 RGF runs, each of which used different data
pre-processing.
4
We used our own implementation of GBDT for this purpose, as gbm does not have the functionality to output the features
generated by tree learning.
17
Leaderboard WMAE
Public Private
RGF (single run) 0.69273 0.68847
Second best team 0.69380 0.69062
GBDT (single run) 0.69504 0.69582
In the table above, RGF (single run) is one of the RGF runs used to make the winning submission,
and GBDT (single run) is GBDT
5
using exactly the same features as RGF (single run). RGF produces
smaller error than GBDT on both public and private portions. Furthermore, by comparison with the perfor-
mance of the second best team (which blended random forest runs and GBDT runs), we observe that not
only the average of the 62 RGF runs but also the single RGF run could have won the rst prize, whereas the
single GBDT run would have fallen behind the second best team.
Biological response prediction The task of Predicting a Biological Response (www.kaggle.com/c/
bioresponse) was to predict a biological response (1/0) of molecules from their chemical properties. We
were in the fourth place with a small difference from the rst place.
Our best submission combined the predictions of RGF and other methods with some data conversion.
For the purpose of this paper, we show the performance of RGF and GBDT on the original data for easy
reproduction of the results. Although the evaluation metric was log loss,
1
n

n
i=1
y
i
log(f(x
i
)) + (1
y
i
) log(1 f(x
i
)), we found that with both RGF and GBDT, better results can be obtained by training with
square loss and then calibrating the predictions by: g(x) = (0.05 + x)/2 if x < 0.05; (0.95 + x)/2 if x >
0.95; x otherwise . The log loss results shown in the table below were obtained this way. RGF produces
better results than GBDT on both public and private sets.
Leaderboard Log loss
Public Private
RGF 0.42224 0.39361
GBDT 0.43576 0.40105
Predicting days in a hospital $3M Grand Prize As of today (September 2012), we are the pre-
liminary second-prize winner of the Round 3 Milestone of the Heritage Provider Network Health Prize
(www.heritagehealthprize.com/c/hhp), in the phase of the verication by the judging panel to-
wards becoming the actual winner. The task is to predict the number of days people will spend in a hospital
in the next year based on historical claims data. This is a two-year-long competition from April 2011 until
April 2013 with $3,000,000 Grand Prize; currently there are over 1300 teams competing.
The competition is on-going, and private scores will not be disclosed until the competition ends. We
show the public Leaderboard performance of an RGF run and a GBDT run applied to the same features
in the table below. Both runs were part of the second-prize submission. We also show the 5-fold cross
validation results of our testing using training data on the same features. RGF achieves lower error than
GBDT in both.
Public LB RMSE cross validation
RGF 0.459877 0.440874
GBDT 0.460997 0.441612
5
As gbm does not support the L1-L2 loss, we used our own implementation.
18
Furthermore, we have 5-fold cross validation results of RGF and GBDT on 53 datasets each of which
uses features composed differently. Their corresponding ofcial runs are all part of the winning submission.
On all of the 53 datasets, RGF produced lower error than GBDT with the average of error differences
0.0005. Though the difference is small, the superiority of RGF is consistent on these datasets. This provides
us enough competitive advantage to do well in all three kaggle competitions we have entered so far.
7 Conclusion
This paper introduced a new method that learns a nonlinear function by using an additive model over non-
linear decision rules. Unlike the traditional boosted decision tree approach, the proposed method directly
works with the underlying forest structure. The resulting method, which we refer to as regularized greedy
forest (RGF), integrates two ideas: one is to include tree-structured regularization into the learning formu-
lation; and the other is to employ the fully-corrective regularized greedy algorithm. Since in this approach
we are able to take advantage of the special structure of the decision forest, the resulting learning method
is effective and principled. Our empirical studies showed that the new method can achieve more accurate
predictions than existing methods which we tested.
References
[1] Francis Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In NIPS
2008, 2008.
[2] Francis Bach. High-dimensional non-linear variable selection through hierarchical kernel learning.
Technical Report 00413473, HAL, 2009.
[3] Richard Baraniuk, Volkan Cevher, Marco F. Duarte, and Chinmay Hegde. Model based compressive
sensing. IEEE Transactions on Information Theory, 56:19822001, 2010.
[4] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classication and Regression Trees.
Wadsworth Advanced Books and Software, Belmont, CA, 1984.
[5] Leo Breiman. Bagging predictors. Machine Learning, 24:123140, August 1996.
[6] Leo Breiman. Random forests. Machine Learning, 45:532, 2001.
[7] Leo Breiman, Adele Cutler, Andy Liaw, and Matthew Wiener. Package randomForest, 2010.
[8] K. Bube and R. Langan. Hybrid
1
/
2
minimization with application to tomography. Geophysics,
62:11831195, 1997.
[9] Hugh Chipman and Robert mcCulloch. Package BayesTree, 2010.
[10] Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. BART: Bayesian additive regression
trees. The Annals of Applied Statistics, 4(1):266298, 2010.
[11] A. Frank and A. Asuncion. UCI machine learning repository [http://archive.ics.uci.edu/ml], 2010.
University of California, Irvine, School of Information and Computer Sciences.
19
[12] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application
to boosting. J. Comput. Syst. Sci., 55(1):119139, 1997.
[13] Yoav Freund, Raj Iyer, Robert E. Schapire, and Yoram Singer. An efcient boosting algorithm for
combining preferences. JMLR, 4:933969, 2003.
[14] Yoav Freund and Llew Mason. The alternating decision tree learning algorithm. In ICML 99, pages
124133, 1999.
[15] Jerome Friedman. Greedy function approximation: A gradient boosting machine. The Annals of
Statistics, 29, 2001.
[16] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Package glmnet, 2011.
[17] Jerome H. Friedman and Bogdan E. Popescu. Importance sampled learning ensembles. Technical
report, Tech Report, 2003.
[18] Jerome H. Friedman and Bogdan E. Popescu. Predictive learning via rule ensembles. The Annals of
Applied Statistics, 2(3):916954, 2008.
[19] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. In
B. Sch olkopf A. Smola, P. Bartlett and D. Schuurmans, editors, Advances in Large Margin Classiers,
pages 115132. MIT Press, 2000.
[20] Junzhou Huang, Tong Zhang, and Dimitris Metaxas. Learning with structured sparsity. JMLR,
12:33713412, 2011.
[21] L. Jacob, G. Obozinski, and J. Vert. Group lasso with overlap and graph lasso. In Proceedings of
ICML, 2009.
[22] R. Jenatton, J.-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms.
Technical report, Tech Report: arXiv:0904, 2009.
[23] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[24] Greg Ridgeway. Generalized boosted models: A guide to the gbm package, 2007.
[25] Robert E. Schapire. The boosting approach to machine learning: An overview. Nonlinear Estimation
and Classication, 2003.
[26] Shai Shalev-Shwartz, Nathan Srebro, and Tong Zhang. Trading accuracy for sparsity in optimization
problems with sparsity constraints. Siam Journal on Optimization, 20:28072832, 2010.
[27] M. Warmuth, J. Liao, and G. Ratsch. Totally corrective boosting algorithms that maximize the margin.
In Proceedings of the 23rd international conference on Machine learning, 2006.
[28] Tong Zhang and Bin Yu. Boosting with early stopping: Convergence and consistency. The Annals of
Statistics, 33:15381579, 2005.
20
A Appendix
A.1 Running time
We have shown that RGF often achieves higher accuracy than GBDT, but this is done at the cost of additional
computational complexity mainly for fully-corrective weight updates. Good news is that running time is still
linear in the number of training data points. Belowwe analyze running time in terms of the following factors:
, the number of leaf nodes generated during training; d, dimensionality of the original input space; n, the
number of training data points; c, how many times the fully-corrective weight optimization is done; and z,
the number of leaf nodes in one tree, or tree size. In RGF, tree size depends on the characteristics of data
and strength of regularization. Although tree size can differ from tree to tree, for simplicity we treat it as
one quantity, which should be approximated by the average tree size in applications.
In typical tree ensemble learning implementation, for efciency, the data points are sorted according to
feature values at the beginning of training. The following analysis assumes that this pre-sorting has been
done. Pre-sorting runs in O(nd log(n)), but its actual running time seems practically negligible compared
with the other part of training even when n is as large as 100,000.
Recall that RGF training consists of two major parts: one grows the forest, and the other optimizes/corrects
the weights of leaf nodes. The part to grow the forest excluding regularization runs in O(nd), same as
GBDT. Weight optimization takes place c times, and each time we have an optimization problem of n data
points each of which has at most

z
nonzero entries; therefore, the running time for optimization, excluding
regularization, is O(
nc
z
) using coordinate descent implemented with sparse matrix representation.
During forest building, the partial derivatives and the reduction of regularization penalty are referred
to O(nd) times. During weight optimization, the partial derivatives of the penalty are required O(c)
times. With RGF-L
2
, computation of these quantities is practically negligible. Computation of min-penalty
regularizers involves O(z) nodes; however, with efcient implementation that stores and reuses invariant
quantities, extra running time for min-penalty regularizers during forest building can be reduced to O(nd)+
O(z
2
) from O(ndz). The extra running time during weight optimization is O(cz), but the constant part
can be substantially reduced by efcient implementation. Details of efcient implementation are shown in
A.3 and A.4.
A.2 Convergence of Algorithm 5
To show that Algorithm 5 converges, let us express the algorithm using matrix multiplication as follows. Let
J be the number of internal nodes, and dene a matrix A R
JJ
and a column vector b R
J
so that:
A[v, w] =

1
1+2
w = p(v)

1+2
p(w) = v
0 otherwise
, b[v] =

p(w)=v,wL
T

w
1 + 2
Dene B R
(J+1)(J+1)
and

R
n+1
so that:
B =

A b
0 1

,

=

0
:
0
1

Then since we have:


B
m
=

A
m

m1
i=0
A
i
b
0 1

,
21
m iterations of Algorithm 5 is equivalent to:

v,m


v
v L
T

B
m

[v] =

m1
i=0
A
i
b

[v] v / L
T
It is well known that for any square matrix Z, if Z
p
< 1 for some p 1 then I Z is invertible and
(I Z)
1
=

k=0
Z
k
. Therefore, it sufces to show that A
p
< 1 for some p 1.
First, consider the case that = 1. In this case, A is symmetric and column v (and row v) has |N(v)|
non-zero entries, where N(v) denote the set of the internal nodes adjacent to v, and all the non-zero entries
are 1/3.
Using the fact that |N(v)| 3 for v = o
T
and |N(o
T
)| 2, for any x R
J
, we have:
x
2
2
Ax
2
2
=

j
x[j]
2

1
9


kN(j)
x[k]

2
>
2
3

j
x[j]
2

1
9

k,N(j),k<
2x[k] x[]
>
1
9

k,N(j),k<
(x[k] x[])
2
0
Therefore, A
2
< 1.
Next suppose that > 1. Then we have:
A
1
= max
j
J

i=1
|A[i, j]| 2
1
1 + 2
+

1 + 2
< 1
Hence, Algorithm 5 converges with 1.
Another way to look at this is that v L
T
.

v
=
v

in (10) and (13) can be expressed using the


matrix notation above as:

= A

+ b. As shown above, I A is invertible, and therefore {

v
} with
desired properties can be obtained by

= (I A)
1
b. Algorithm 5 computes it iteratively.
Our implementation used its slight variant (Algorithm 6) as it converges faster.
Algorithm 6:
forall the v T do


v
v L
T
0 v / L
T
for i = 1 to m do
for v / L
T
in some xed order do

p(v)
+

p(w)=v

w
1+2
v = o
T

p(w)=v

w
1+2
v = o
T
end
A.3 Computational detail of the min-penalty regularization in Section 5.4.2
To optimize weights according to (8), we need to obtain the derivatives of the regularization penalty,
G(T(u))
u |u=0
and

2
G(T(u))

2
u |u=0
, where
u
is the additive change to
u
, the weight of a leaf node u,
22
and T is the tree to which node u belongs. Let {
v
} = arg min
{

v}

f({

v
}) : v L
T
.[

v
=
v
]

so
that G(T) = f({
v
}), where f and G(T) are dened in terms of auxiliary variables as in (11) and (12).
From the derivation in the previous section, we know that
w
is linear in leaf weights {
v
}. In particular,

w
for an internal node w can be written in the form of
w
=

vL
T
c
w,v

v
with coefcients c
w,v
that are
independent of leaf weights and only depend on the tree topology. Also considering
v
=
v
for v L
T
,
we have:

w

u
=

c
w,u
w / L
T
1 w = u
0 w = u & w L
T
.
c
w,u
can be obtained by Algorithm 5 (or 6) with input of:
u
= 1;
t
= 0 for t = u; and the topological
structure of T. Also using

2
w

2
v
= 0, it is straightforward to derive from the denition of f in (12) that:
G(T(
u
))

u |u=0
=

wT

dw

u
,

2
G(T(
u
))

2
u |u=0
=

wT

dw

2
, (15)
where
w
=
w

p(w)
if w = o
T
;
w
otherwise. (16)
The optimum change to the leaf weight
u
can be computed using these quantities. Loss reduction caused
by node split can be similarly estimated.
For efcient implementation The key to efcient implementation is to make use of the fact that in the
course of training certain quantities are locally invariant, by storing and reusing the invariant quantities.
First, since the iterative algorithm is relatively complex, it should be executed as infrequently as possible.
As noted above,
w
s partial derivatives only depend on the topological structure of the tree, so they need to
be computed only when the tree topology changes. When
v
is added to
v
,
w
should be updated through

w

w
+
v
w
v
instead of running the iterative algorithm. Second, consider the process of evaluating
loss reduction of all possible splits of some node, which is xed during this process. Using the notation in
Section A.1, the partial derivatives of the regularization penalty similar to (15) are referred to O(nd) times
in this process, but they are invariant and need to be computed just once. The change in penalty caused by
node split is evaluated also O(nd) times, and one can make each evaluation run in O(1) instead of O(z) by
storing invariant quantities. To see this, as in Section 5.3.1, consider splitting a node associated with weight
, and let u
k
for k = 1, 2 be the new leaf nodes after split with weight +
k
. Write

T(
1
,
2
) for the new
tree. Dene
w
and
w
as above but on

T(0, 0) instead of T. To simplify notation, let
w,k
=
w
u
k
. Due
to symmetry, we have:
w,1
=
w,2
for w = u
1
& w = u
2
. The change in penalty is: G(

T(
1
,
2
))
G(T) =

G(

T(
1
,
2
)) G(

T(0, 0))

G(

T(0, 0)) G(T)

. Since the second term is invariant during


this process, it needs to be computed only once. To ease notation, let

v
=
dv

v
2

k=1

k

v,k
+
1
2

k=1

k

v,k

.
23
Then the rst term in the penalty change can be written as:

G(

T(
1
,
2
)) G(

T(0, 0))

=
1
2

dv

v
+
2

k=1

k

v,k

1
2

dv

2
v
=

v
=(
1
+
2
)

T{u
1
,u
2
}

dv

v

v,1
+
1
2
(
1
+
2
)
2

T{u
1
,u
2
}

dv

2
v,1
+
u
1
+
u
2
.
Here

T{u
1
,u
2
}

dv

v

v,1
and

T{u
1
,u
2
}

dv

2
v,1
are invariant during this process and can be pre-
computed; therefore, each evaluation of penalty differences runs in O(1).
A.4 Computational detail of the regularizer in Section 5.4.3
The same basic ideas above can be applied to make efcient implementation of the regularizer with sum-to-
zero sibling constraints. To optimize the additive change
u
of the leaf weight
u
, let {
v
} be the arguments
that minimize (9) so that G(T) =

vT

dv

2
/2. Then the partial derivatives of G(T(
u
)) are obtained
by (15). From (14), we have:

u
=

2
du
k
w = o
T
2
dwdu
k
1
w = o
T
, w A(u) (w is either u or us ancestor)
2
dwdu
k
1
w / A(u), p(w) A(u) (w is us ancestors sibling)
0 otherwise
.
Optimization can be done using these quantities.
Regarding efcient implementation, one difference from the regularizer without the sibling constraints
above is that G(

T(0, 0)) = G(T) with this regularizer and therefore G(

T(0, 0)) does not have to be com-


puted.
24

You might also like