Survey - Gradient Boosting Machine
Survey - Gradient Boosting Machine
1
Point Zero One Technology
2
Fordham University
3
Imperial College London
4
Massachusetts Institute of Technology
Abstract
In this survey, we discuss several different types of gradient boosting algorithms
and illustrate their mathematical frameworks in detail: 1. introduction of gradient
boosting leads to 2. objective function optimization, 3. loss function estimations, and
4. model constructions. 5. application of boosting in ranking.
1 Introduction
Proposed by Freund and Schapire (1997), boosting is a general issue of constructing an ex-
tremely accurate prediction with numerous roughly accurate predictions. Addressed by
Friedman (2001, 2002) and Natekin and Knoll (2013), the Gradient Boosting Machines
(GBM) seeks to build predictive models through back-fittings and non-parametric regres-
sions. Instead of building a single model, the GBM starts by generating an initial model and
constantly fits new models through loss function minimization to produce the most precise
model (Natekin and Knoll, 2013).
This survey concentrates on the mathematical derivations of the gradient boosting al-
gorithms. In Section 2, we analyze the optimization methods for parametric and non-
parametric models. Section 3 covers the definitions of different types of loss functions. In
Section 4, we present different types of boosting algorithms, while in Section 5, we explore
the combination of boosting algorithms and ranking algorithms to rank the real-world data.
∗
[email protected]
1
2 Basic Framework
The ultimate goal of the GBM is to find a function F (x), which minimize its loss function
L(y, F (x)) as
F ∗ = argmin Ey,x L(y, F (x)),
F
where h(x; a) is a base learner parameterized by a. Regarded as weak learners, the base
learners produce hypotheses that only predict slightly better than random guessing, and it
was proved that recursive learning with weak learners can perform just as good as a strong
learning algorithm (Schapire, 1990).
If the base learner is a regression tree, the parameter a is usually the splitting nodes
of tree branches (Friedman, 2002). Tree-based models divide the input variable space into
regions and apply a series of rules to identify the regions that have the strongest responses
to the inputs (Elith et al., 2008). Each region is then fitted with a regression tree taking
the mean response of the observations (Elith et al., 2008). Decision trees are constructed
through binary splits, and recursive splits generate a large tree, which is then pruned to drop
out the weak branches.
The optimization process can be written as follows
P
where P ∗ = M m=0 pm , and pm are the consecutive boosting steps.
One of the approaches to generate these steps, pm , is to use the steepest-descent algorithm
by calculating the gradient
( )
δΦ(P )
gm = {gjm} = ,
δPj P =Pm−1
Pm−1
where Pm−1 = i=0 pi . The boosting step in the previous function is given by
pm = −ρm gm ,
and ρm = argminρ Φ(Pm−1 − ρgm ) is the line search on the direction of steepest-descent.
In the non-parametric case, the function F is solved by minimizing
2
P
and the optimum is reached at F ∗ (x) = M
m=0 fm (x), where fm (x) = −ρm gm (x).
The gradient of the non-parametric model is
The differentiation and integration of the gradient function can be switched interchangeably,
and the gradient function can be simplified to
δΦ(F (x))
gm (x) = Ey x .
δF (x) F (x)=Fm−1 (x)
3 Estimation
Under the framework of GBM, different loss functions can be applied to solve different tasks
(Friedman, 2002; Koenker and Hallock, 2001; Natekin and Knoll, 2013).
L(y, F )L1 = |y − F |,
which is the absolute value of residuals between the explained variable y and the predictive
function F , while the most commonly used squared-error L2 loss function is defined as
1
L(y, F )L2 = (y − F )2 .
2
3
In addition, the Huber loss function that merges L1 and L2 loss functions described above
can be a robust alternative to the L1 loss function
1
2
(y − F )2 |y − F | ≤ δ
L(y, F )Huber,δ = .
δ(|y − F | − 2δ ) |y − F | > δ
Quantile loss is inevitably handy in the situations of ordering and sorting because of its
robustness, which is framed as
(1 − α)|y − f | y − f ≤ 0
L(y, f )α = ,
α|y − f | y − f > 0
where α designates the targeted quantile in the conditional distribution. The loss function
can be degenerated into the L1 loss by taking α = 0.5.
4 Methodology
AdaBoost
Gradient boosting is a generalization of Adaboost. The design of Adaboost (Freund and Schapire,
1997), the original boosting algorithm, is to find a hypothesis with low prediction error rel-
ative to a given distribution over the training samples. Freund and Schapire (1997) demon-
strated their algorithm through a horse gambling example, where a gambler wishes to bet on
the horse that has the greatest chance to win. In order to increase the winning probability
of a bet, the gambler is encouraged to gather the expert opinions before placing a bet. Such
a process of collecting information from different experts is similar to the ensemble of a class
of poor classifiers. In Adaboost, each expert’s opinion corresponds to a training set (Wang,
2012). Each sample is initialized with a weight, and the weights of the training sets are
adjusted after each iteration, such that the weights of misclassified samples are increased,
while the weights of correctly classified samples are decreased.
In each iteration of boosting, the current weak learner of Adaboost chooses a weak
hypothesis from the entire set of weak hypotheses instead of just the weak hypotheses that
are currently found to the point. Since the search of an entire space of hypotheses can be
4
enormous amount of work, it is often suitable to apply weak learners that approximately
cover the whole set (Collins et al., 2002).
Boosting algorithms with certain modifications perform well under high bias and high
variance settings. When weighted sampling is implemented for the training data, the per-
formance of boosting is determined by its ability to reduce variance (Friedman et al., 2000).
Meanwhile, boosting performance depends on bias reduction when the weighted sampling is
replaced with weighted tree fitting (Friedman et al., 2000).
Additionally, Adaboost is prone to cause model overfitting because of the exponential
loss. The overfitting may be mitigated by minimizing the normalized sigmoid cost function
in exchange (Mason et al., 2000),
m
1 X
C(F ) = 1 − tanh(λyi F (xi )).
m i=1
In the above function, F is a convex combination of weak hypotheses, and the parameter
λ measures the steepness of the margin cost function c(z) = 1 − tanh(λz). Through their
experiments, Mason et al. (2000) showed that a new boosting algorithm optimizing normal-
ized sigmoid cost, called DOOM II, overall performed better than Adaboost. According to
Mason et al. (1999), AnyBoost is a general boosting algorithm that optimize gradient de-
scent in an inner product space. The inner product space S, which is inclusive of all linear
combinations of weak hypotheses, contains the weak hypotheses and their combination F .
The inner product can thus be represented as,
m
1 X
hF, Gi := F (xi )G(xi ),
m i=1
where F and G are the combinations of weak hypotheses that belong to the set of all linear
combinations of weak hypotheses. Only AnyBoost algorithm that implies the inner product
function and normalized sigmoid cost function is referred to DOOM II (Mason et al., 2000).
Arc-x4
Arcing, a concept introduced by Breiman (1996) and utilized in Adaboost, is a technique to
adaptively reweighting the training samples. Arc-x4 (Breiman, 1997) performs similarly to
the original boosting in training error and generalization error reduction. At each boosting
step, a new training sample is generated from the training set with probability
(1 + m(n)4 )
p(n) = P ,
(1 + m(n)4 )
where m(n) is the number of misclassified cases.
5
equation
N
X
(ρm , am ) = argmin [yei − ρh(xi ; a)]2 .
a,ρ
i=1
Solving for F (x), we obtain a stage-wise model
Fm (x) = Fm−1 (x) + ρm h(x; am ).
Logitboost
Another well-known boosting Algorithm is Logitboost. Similar to other boosting algorithms,
Logitboost adopts regression trees as the weak leaners. Deriving from the logistic regression,
Logitboost takes the negative of the loglikelihood of class probabilities (Li, 2012). Defined
as p, class probability is formulated as
eFi,k (xi )
pi,k = P r(yi = k|xi ) = PK−1 ,
Fi,s (xi )
s=0 e
where yi is the output vector and Xi is the input vector. Thus, the loss function of Logitboost
can be written out
XN K−1
X
L= Li , Li = − ri,k log pi,k ,
i=1 k=0
where ri,k = 0 if yi 6= k and ri,k = 1 on the contrary. A stagewise model follows as
K−1
K −1 1 X
Fi,k = Fi,k + v (fi,k − fi,k ),
K K k=0
where v is a shrinkage parameter, and fi,k is the objective function.
Beside the class probabilities, another important factor in Logitboost is the dense Hessian
matrix, which is obtained by computing the tree split gain and node value fitting. However,
certain modifications are required in order to incorporate these factors into optimization.
The sum-to-zero constraint of classifier, implied by the sum-to-one constraint of the class
probabilities, can be settled by adopting a vector tree at each boost. In the vector tree, a
sum-to-zero vector is fitted at each split node in the K-dimensional space. Moreover, adding
the vector tree allows explicit computations of the split gain and node fitting, which becomes
a secondary problem when fitting a new tree. Such secondary problems can then be used to
cope with the dense Hessian matrix, where only two coordinates are allowed for each of the
secondary problems (Sun et al., 2012).
LAD Regression
The LAD regression proposed by Friedman (2002) has its loss function as L(y, F ) = |y − F |,
where F (x) is solved by
J
X
Fm (x) = Fm−1 (x) + γ1(x ∈ Rjm ), γjm = ρm bjm .
j−1
6
Moreover, in the LAD regression, the gamma parameter is
M-Regression
M-Regresison (Friedman, 2002) is designed to incorporate with the Huber loss function
1 X
γjm = rf
jm + sign(rm−1 (xi ) − rf
jm ) • min(δm , abs(rm−1 (xi ) − rf
jm )),
Njm x ∈R
i jm
where rf
jm = medianxi ∈Rjm {rm−1 (xi )} and rm−1 (xi ) = yi − Fm−1 (xi ).
5 Ranking Problem
One of the most discussed problems in machine learning is teaching a computer to rank.
Two sets of data are required before constructing a ranking algorithm (Zheng et al., 2008),
i.e., the preference data containing a set of features, and the ranked targets. Based on these
two datasets, the ranking function can be computed for each dataset under an optimization
problem.
7
The objective function for the ranking problem is
N n
wX 1−w X
R(h) = (max{0, h(yi ) − h(xi ) + τ })2 + (li − h(zi ))2 ,
2 i=1 2 i=1
where xi and yi are the features in the preference data, and h(xi ) ≤ h(yi ) + τ, if xi is ranked
higher than yi .
Wu et al. (2008) proposed a highly effective ranking algorithm LambdaMART which
integrates LambdaRank function and boosting. The LambdaRank function aims to maximize
the Normalized Discounted Cumulative Gain (NDCG)
T
X
Ni := ni (2r(j) − 1)/ log(1 + j),
j=1
where r(j) represents the ranking of the targets. Gamma gradient is used in the optimization
δCij
γi,j := Sij △NDCG ,
δoij
where Sij takes the value of 1 or -1 depending on the relevance of the items. For example, in
ranking for webpages, the gamma gradient is used to determine the relevance of information
retrieved online. If a piece of information i is more relevant than another piece j, then
Sij equals to 1; otherwise Sij equals to -1. The oij represents the difference between the
ranking scores predicted by the ranking function oij := F (xi ) − F (xj ), and Cij := C(oij ) =
F (xj )−F (xi )+log(1+esi −sj ). Moreover, the gamma gradient of a specific item i is as follows
X
γi = γij .
j∈P
6 Conclusion
In this paper, we summarize the Gradient Boosting Algorithms from several aspects, includ-
ing the general function optimization, the objective functions, and different loss functions.
Additionally, we present a set of boosting algorithms with unique loss functions, and we
solve their predictive models accordingly.
References
Breiman, L. (1996). Bias, Variance, and Arcing Classifiers. Statistics Department, University
of California, Berkeley, CA, USA. Tech. Rep. 460.
8
Collins, M., Schapire, R. E., & Singer, Y. (2002). Logistic Regression, Adaboost amd Breg-
man Distances. Machine Learning 48, 253–285.
Elith, J., and Leathwick, J. R., & Hastie, T. (2008). A Working Guide to boosted Regression
Rrees. Journal of Animal Ecology 77, 802–813.
Friedman, J., Hastie, T., & Tibshirani, R. (2000). ADDITIVE LOGISTIC REGRESSION:
A STATISTICAL VIEW OF BOOSTING. The Annals of Statistics 28, 337–407.
Koenker, R., and Hallock, K. F. (2001). Quantile Regression Journal of Economic Perspec-
tives 15, 143–156.
Li, P. (2012). Robust Logitboost and Adaptive Base Class (abc) Logitboost. arXiv preprint
arXiv:1203.3491.
Mason, L., Baxter, J., Bartlett, P., & Frean, M. (1999). Boosting Algorithms as Gradient
Descent in Function Space.
Mason, L., Baxter, J., Bartlett, P., & Frean, M. (2000). Boosting Algorithms as Gradient
Descent. Advances in Neural Information Processing Systems, 512–518.
Natekin, A., and Knoll, A. (2013). Gradient Boosting Machines, A Tutorial. Frontiers in
Neurorobotics 7, 21.
Sun, P., Reid, M. D., & Zhou., J. (2012). ASOS-LogitBoost: Adaptive One-Vs-One Logit-
Boost for Multi-Class Problem. arXiv preprint arXiv:1110.3907.
Wang, R. (2012). AdaBoost for Feature Selection, Classification and Its Relation with SVM,
A Review. Physics Procedia 25, 800–807.
Wu, Q., Burges, C. J., Svore, K. M., & Gao, J. (2008). Ranking, Boosting, and Model
Adaptation.
Zheng, Z., Zha, H., Zhang, T., Chapelle, O., Chen, K., & Sun, G. (2008). A General Boosting
Method and Its Application to Learning Ranking Functions for Web Search. Advances in
Neural Information Processing Systems, 1697–1704.