0% found this document useful (0 votes)
84 views

Gradient Boosting Tutorial

This document provides an overview of gradient boosting machines (GBM), including their history and development. It discusses how boosting aims to combine weak learners into a strong learner. AdaBoost was an early boosting algorithm, but GBM developed by Jerome Friedman in 1999 provided a statistical foundation and approach of functional gradient descent. GBM allows gradual improvements to the model where it was previously inaccurate by building on approximations of the gradient of the objective function, rather than just reweighting examples. XGBoost is a popular and effective implementation of GBM that is commonly used for machine learning competitions and applications.

Uploaded by

Sarang Potdar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views

Gradient Boosting Tutorial

This document provides an overview of gradient boosting machines (GBM), including their history and development. It discusses how boosting aims to combine weak learners into a strong learner. AdaBoost was an early boosting algorithm, but GBM developed by Jerome Friedman in 1999 provided a statistical foundation and approach of functional gradient descent. GBM allows gradual improvements to the model where it was previously inaccurate by building on approximations of the gradient of the objective function, rather than just reweighting examples. XGBoost is a popular and effective implementation of GBM that is commonly used for machine learning competitions and applications.

Uploaded by

Sarang Potdar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

__notebook__ 7/21/20, 11(23 AM

mlcourse.ai (https://mlcourse.ai) – Open Machine Learning Course

Author: Alexey Natekin (https://www.linkedin.com/in/natekin/), OpenDataScience founder, Machine


Learning Evangelist. Translated and edited by Olga Daykhovskaya
(https://www.linkedin.com/in/odaykhovskaya/), Anastasia Manokhina
(https://www.linkedin.com/in/anastasiamanokhina/), Yury Kashnitsky (https://yorko.github.io), Egor
Polusmak (https://www.linkedin.com/in/egor-polusmak/), and Yuanyuan Pao
(https://www.linkedin.com/in/yuanyuanpao/). This material is subject to the terms and conditions of the
Creative Commons CC BY-NC-SA 4.0 (https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is
permitted for any non-commercial purpose.

You can also check out the latest version of this notebook in the course repository
(https://github.com/Yorko/mlcourse.ai).

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…-x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 1 of 33
__notebook__ 7/21/20, 11(23 AM

Topic 10. Gradient Boosting

Today we are going to have a look at one of the most popular and practical machine learning algorithms:
gradient boosting.

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…-x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 2 of 33
__notebook__ 7/21/20, 11(23 AM

Outline
We recommend going over this article in the order described below, but feel free to jump around
between sections.

 Introduction and history of boosting


History of Gradient Boosting Machine
 GBM algorithm
ML Problem statement
Functional gradient descent
Friedman's classic GBM algorithm
Step-by-step example of the GBM algorithm
 Loss functions
Regression loss functions
Classification loss functions
Weights
 Conclusion
 Assignment #10
 Useful resources

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…-x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 3 of 33
__notebook__ 7/21/20, 11(23 AM

1. Introduction and history of boosting

Almost everyone in machine learning has heard about gradient boosting. Many data scientists include
this algorithm in their data scientist's toolbox because of the good results it yields on any given
(unknown) problem.

Furthermore, XGBoost is often the standard recipe for winning


(https://github.com/dmlc/xgboost/blob/master/demo/README.md#usecases) ML competitions
(http://blog.kaggle.com/tag/xgboost/). It is so popular that the idea of stacking XGBoosts has become a
meme. Moreover, boosting is an important component in many recommender systems
(https://en.wikipedia.org/wiki/Learning_to_rank#Practical_usage_by_search_engines); sometimes, it is
even considered a brand (https://yandex.com/company/technologies/matrixnet/). Let's look at the
history and development of boosting.

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…-x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 4 of 33
__notebook__ 7/21/20, 11(23 AM

Boosting was born out of the question: (http://www.cis.upenn.edu/~mkearns/papers/boostnote.pdf) is it


possible to get one strong model from a large amount of relatively weak and simple models?
By saying "weak models", we do not mean simple basic models like decision trees but models with poor
accuracy performance, where poor is a little bit better than random.

A positive mathematical answer (http://www.cs.princeton.edu/~schapire/papers/strengthofweak.pdf) to


this question was identified, but it took a few years to develop fully functioning algorithms based on this
solution e.g. AdaBoost. These algoritms take a greedy approach: first, they build a linear combination of
simple models (basic algorithms) by re-weighing the input data. Then, the model (usually a decision
tree) is built on earlier incorrectly predicted objects, which are now given larger weights.

Many machine learning courses study AdaBoost - the ancestor of GBM (Gradient Boosting Machine).
However, since AdaBoost merged with GBM, it has become apparent that AdaBoost is just a particular
variation of GBM.

The algorithm itself has a very clear visual interpretation and intuition for defining weights. Let's have a
look at the following toy classification problem where we are going to split the data between the trees of
depth 1 (also known as 'stumps') on each iteration of AdaBoost. For the first two iterations, we have the
following picture:

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…-x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 5 of 33
__notebook__ 7/21/20, 11(23 AM

The size of point corresponds to its weight, which was assigned for an incorrect prediction. On each
iteration, we can see that these weights are growing -- the stumps cannot cope with this problem.
Although, if we take a weighted vote for the stumps, we will get the correct classifications:

Pseudocode:

Initialize sample weights w(0)


i = 1
l ,i = 1, … , l.
For all t = 1, … , T
Train base algo bt , let ϵt be it's training error.
1 1−ϵt
αt = 2 ln ϵt .

Update sample weights: w(t)


i = w(t−1) −αt yi bt (xi )
i e , i = 1, … , l.
Normalize sample weights:
w (t)
w(t)
0 = ∑kj= 1 w(t)
j , w(t)
i = i
, i = 1, … , l.
w (t)
0

T
t t
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…-x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 6 of 33
__notebook__ 7/21/20, 11(23 AM

T
Return ∑t αt bt

Here (https://www.youtube.com/watch?v=k4G2VCuOMMg) is more detailed example of AdaBoost


where, as we iterate, we can see the weights increase, especially on the border between classes.

AdaBoost works well, but the lack


(https://www.cs.princeton.edu/courses/archive/spring07/cos424/papers/boosting-survey.pdf) of
explanation for why the algorithm is successful sewed the seeds of doubt. Some considered it a super-
algorithm, a silver bullet, but others were skeptical and believed AdaBoost was just overfitting.

The overfitting problem did indeed exist, especially when data had strong outliers. Therefore, in those
types of problems, AdaBoost was unstable. Fortunately, a few professors in the statistics department at
Stanford, who had created Lasso, Elastic Net, and Random Forest, started researching the algorithm. In
1999, Jerome Friedman came up with the generalization of boosting algorithms development - Gradient
Boosting (Machine), also known as GBM. With this work, Friedman set up the statistical foundation for
many algorithms providing the general approach of boosting for optimization in the functional space.

CART, bootstrap, and many other algorithms have originated from Stanford's statistics department. In
doing so, the department has solidified their names in future textbooks. These algorithms are very
practical, and some recent works have yet to be widely adopted. For example, check out glinternet
(https://arxiv.org/abs/1308.2719).

Not many video recordings of Friedman are available. Although, there is a very interesting interview
(https://www.youtube.com/watch?v=8hupHmBVvb0) with him about the creation of CART and how they
solved statistics problems (which is similar to data analysis and data science today) more than 40 years
ago.

There is also a great lecture (https://www.youtube.com/watch?v=zBk3PK3g-Fc) from Hastie, a


retrospective on data analysis from one of the creators of methods that we use everyday.

In general, there has been a transition from engineering and algorithmic research to a full-fledged
approach to building and studying algorithms. From a mathematical perspective, this is not a big change
- we are still adding (or boosting) weak algorithms and enlarging our ensemble with gradual
improvements for parts of the data where the model was inaccurate. But, this time, the next simple
model is not just built on re-weighted objects but improves its approximation of the gradient of overall
objective function. This concept greatly opens up our algorithms for imagination and extensions.

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…-x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 7 of 33
__notebook__ 7/21/20, 11(23 AM

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…-x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 8 of 33
__notebook__ 7/21/20, 11(23 AM

History of GBM

It took more than 10 years after the introduction of GBM for it to become an essential part of the data
science toolbox.
GBM was extended to apply to different statistics problems: GLMboost and GAMboost for strengthening
already existing GAM models, CoxBoost for survival curves, and RankBoost and LambdaMART for
ranking.
Many realizations of GBM also appeared under different names and on different platforms: Stochastic
GBM, GBDT (Gradient Boosted Decision Trees), GBRT (Gradient Boosted Regression Trees), MART
(Multiple Additive Regression Trees), and more. In addition, the ML community was very segmented and
dissociated, which made it hard to track just how widespread boosting had become.

At the same time, boosting had been actively used in search ranking. This problem was rewritten in
terms of a loss function that penalizes errors in the output order, so it became convenient to simply
insert it into GBM. AltaVista was one of the first companies who introduced boosting to ranking. Soon,
the ideas spread to Yahoo, Yandex, Bing, etc. Once this happened, boosting became one of the main
algorithms that was used not only in research but also in core technologies in industry.

ML competitions, especially Kaggle, played a major role in boosting's


popularization. Now, researchers had a common platform where they
could compete in different data science problems with large number of
participants from around the world. With Kaggle, one could test new
algorithms on the real data, giving algoritms oppurtunity to "shine", and
provide full information in sharing model performance results across
competition data sets. This is exactly what happened to boosting when it
was used at Kaggle (http://blog.kaggle.com/2011/12/21/score-xavier-
conort-on-coming-second-in-give-me-some-credit/) (check interviews
with Kaggle winners starting from 2011 who mostly used boosting). The
XGBoost (https://github.com/dmlc/xgboost) library quickly gained popularity after its appearance.
XGBoost is not a new, unique algorithm; it is just an extremely effective realization of classic GBM with
additional heuristics.

This algorithm has gone through very typical path for ML algorithms today: mathematical problem and
algorithmic crafts to successful practical applications and mass adoption years after its first
appearance.

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…-x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 9 of 33
__notebook__ 7/21/20, 11(23 AM

2. GBM algorithm

ML problem statement

We are going to solve the problem of function approximation in a general supervised learning setting. We

have a set of features x and target variables y, {(xi , yi )}i= 1,…,n which we use to restore the

dependence y = f (x). We restore the dependence by approximating f (x) ̂ and by understanding


which approximation is better when we use the loss function L(y, f ), which we want to minimize:
̂ f (x)
y ≈ f (x), ̂ = arg min L(y, f (x))
f (x)

At this moment, we do not make any assumptions regarding the type of dependence f (x) , the model
of our approximation ̂ , or the distribution of the target variable (y). We only expect that the
f (x)
function L(y, f ) is differentiable. Our formula is very general; let's define it for a particular data set
̂
with a population mean f (x) . Our expression for minimizing the loss of the data is the following:
̂ = arg min 𝔼x,y [L(y, f (x))]
f (x)
f (x)

Unfortunately, the number of functions f (x) is not just large, but its functional space is infinite-
dimensional. That is why it is acceptable for us to limit the search space by some family of functions

f (x, θ), θ ∈ ℝd . This simplifies the objective a lot because now we have a solvable optimization of

̂ = f (x, ),̂
(x)
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 10 of 33
__notebook__ 7/21/20, 11(23 AM

parameter values:
̂ = f (x, θ),̂
f (x)
θ =̂ arg min 𝔼x,y [L(y, f (x, θ))]
θ

Simple analytical solutions for finding the optimal parameters


̂ do not exist, so the parameters
θ often
̂
are usually approximated iteratively. To start, we write down the empirical loss function Lθ (θ ) that will
̂
allow us to evaluate our parameters using our data. Additionally, let's write out our approximation θ for a
number of M iterations as a sum:

θ =̂ ∑M ^,
θ
i= 1 i
Lθ (θ) ̂ = ∑N L(y , f (xi , θ))̂
i= 1 i

Lθ (θ).̂ Gradient descent is


Then, the only thing left is to find a suitable, iterative algorithm to minimize
̂
the simplest and most frequently used option. We define the gradient as ∇Lθ (θ ) and add our
^
iterative evaluations θ i to it (since we are minimizing the loss, we add the minus sign). Our last step is
^
to initialize our first approximation θ 0 and choose the number of iterations M . Let's review the steps

for this inefficient and naive algorithm for approximating θ : ̂

 Define the initial approximation of the parameters θ =̂ θ^0


 For every iteration t = 1, … , M repeat steps 3-7:

̂
Calculate the gradient of the loss function ∇Lθ (θ ) for the current approximation θ
̂
∇Lθ (θ) ̂ = [ ] ̂
∂L(y,f (x,θ))
∂θ θ= θ
^
 Set the current iterative approximation θt based on the calculated gradient
^
θt ← −∇Lθ (θ) ̂
 Update the approximation of the parameters ̂ θ +̂ θ^t = ∑t θ^i
θ : θ̂ ← i= 0
 Save the result of approximation θ θ̂ =̂ ∑M θ^i
i= 0
 ̂ = f (x, θ) ̂
Use the function that was found f (x)

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 11 of 33
__notebook__ 7/21/20, 11(23 AM

Functional gradient descent

Let's imagine for a second that we can perform optimization in the function space and iteratively search

for the approximations ̂ as functions themselves. We will express our approximation as a sum of
f (x)
incremental improvements, each being a function. For convenience, we will immediately start with the
^
sum from the initial approximation f0 (x):
M
̂ = ^
f (x) ∑ i (x)
f
i= 0

Nothing has happened yet; we have only decided that we will search for our approximation
̂ not as
f (x)
a big model with plenty of parameters (as an example, neural network), but as a sum of functions,
pretending we move in functional space.

In order to accomplish this task, we need to limit our search by some function family
̂ = h (x, θ). There are a few issues here -- first of all, the sum of models can be more
f (x)
complicated than any model from this family; secondly, the general objective is still in functional space.
Let's note that, on every step, we will need to select an optimal coefficient ρ ∈ ℝ. For step t, the
problem is the following:

t−1
̂ =
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 12 of 33

(x) (x),
__notebook__ 7/21/20, 11(23 AM

t−1
̂ = ^ (x),
f (x) ∑ i
f
i= 0
̂ + ρ ⋅ h (x, θ))],
(ρt , θt ) = arg min 𝔼x,y [L(y, f (x)
ρ,θ
^
ft (x) = ρt ⋅ h (x, θt )
Here is where the magic happens. We have defined all of our objectives in general terms, as if we could
have trained any kind of model h (x, θ) for any type of loss functions L(y, f (x, θ)). In practice, this
is extremely difficult, but, fortunately, there is a simple way to solve this task.

Knowing the expression of loss function's gradient, we can calculate its value on our data. So, let's train
the models such that our predictions will be more correlated with this gradient (with a minus sign). In
other words, we will use least squares to correct the predictions with these residuals. For classification,
regression, and ranking tasks, we will minimize the squared difference between pseudo-residuals r and
t
our predictions. For step , the final problem looks like the following:
t−1
̂ = f^i (x),

f (x)
i= 0

[ ∂f (xi ) ]f (x)= f (x)̂


∂L(yi , f (xi ))
rit = − , for i = 1, … , n ,
n
(rit − h (xi , θ))2 ,

θt = arg min
θ i= 1
n
L(yi , f (x̂ i ) + ρ ⋅ h (xi , θt ))

ρt = arg min
ρ i= 1

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 13 of 33
__notebook__ 7/21/20, 11(23 AM

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 14 of 33
__notebook__ 7/21/20, 11(23 AM

Friedman's classic GBM algorithm

We can now define the classic GBM algorithm suggested by Jerome Friedman in 1999. It is a supervised
algorithm that has the following components:

dataset {(xi , yi )}i= 1,…,n ;


number of iterations M ;
choice of loss function L(y, f ) with a defined gradient;
choice of function family of base algorithms h (x, θ) with the training procedure;
additional hyperparameters h (x, θ) (for example, in decision trees, the tree depth);

f0 (x). For simplicity, for an initial approximation, a


The only thing left is the initial approximation
constant value γ is used. The constant value, as well as the optimal coefficient ρ , are identified via
binary search or another line search algorithm over the initial loss function (not a gradient). So, we have
our GBM algorithm described as follows:

̂ = f ,̂ f ̂ = γ, γ ∈ ℝ
f (x)
 Initialize GBM with constant value 0 0
̂ n
f 0 = arg min ∑i= 1 L(yi , γ)
γ
 For each iteration t = 1, … , M, repeat:
rt : rit = −[ ∂fi(x ) i ]
∂L(y ,f (x ))
 Calculate pseudo-residuals , for i = 1, … , n
i ̂
f (x)= f (x)
 Build new base algorithm h t (x) as regression on pseudo-residuals {(xi , rit )}i= 1,…,n

 Find optimal coefficient ρt at h t (x) regarding initial loss function

ρt = arg min ∑i= 1 L(yi , f (x̂ i ) + ρ ⋅ h (xi , θ))


n
ρ
^
 Save ft (x) = ρt ⋅ h t (x)
 Update current approximation
̂ f (x)
f (x) ̂ + f^ (x) = ∑t f^ (x)
̂ ← f (x)
t i= 0 i
̂ ̂ M ^
Compose final GBM model f (x) f (x) = ∑

i= 0 fi (x)
 Conquer Kaggle and the rest of the world

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 15 of 33
__notebook__ 7/21/20, 11(23 AM

Step-By-Step example: How GBM Works

Let's see an example of how GBM works. In this toy example, we will restore a noisy function
y = co s(x) + ϵ, ϵ ∼  (0, 15 ), x ∈ [−5, 5].

This is a regression problem with a real-valued target, so we will choose to use the mean squared error
loss function. We will generate 300 pairs of observations and approximate them with decision trees of
depth 2. Let's put together everything we need to use GBM:

Toy data{(xi , yi )}i= 1,…,300 ✓


Number of iterations M = 3 ✓;

The mean squared error loss function L(y, f ) = (y − f )2 ✓


Gradient of L(y, f ) = L2 loss is just residuals r = (y − f ) ✓;
Decision trees as base algorithms h (x) ✓;
Hyperparameters of the decision trees: trees depth is equal to 2 ✓;

For the mean squared error, both initialization γ and coefficients ρt are simple. We will initialize GBM
with the average value γ= 1
n ⋅ ∑ni= 1 yi , and set all coefficients ρt to 1.

We will run GBM and draw two types of graphs: the current approximation
̂ (blue graph) and every
f (x)
^
tree ft (x) built on its pseudo-residuals (green graph). The graph's number corresponds to the iteration
number:

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 16 of 33
__notebook__ 7/21/20, 11(23 AM

By the second iteration, our trees have recovered the basic form of the function. However, at the first
iteration, we see that the algorithm has built only the "left branch" of the function ( x ∈ [−5, −4]).
This was due to the fact that our trees simply did not have enough depth to build a symmetrical branch
at once, and it focused on the left branch with the larger error. Therefore, the right branch appeared
only after the second iteration.

The rest of the process goes as expected -- on every step, our pseudo-residuals decreased, and GBM
approximated the original function better and better with each iteration. However, by construction, trees
cannot approximate a continuous function, which means that GBM is not ideal in this example. To play
with GBM function approximations, you can use the awesome interactive demo in this blog called
Brilliantly wrong (http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html):

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 17 of 33
__notebook__ 7/21/20, 11(23 AM

3. Loss functions

If we want to solve a classification problem instead of regression, what would change? We only need to
choose a suitable loss function L(y, f ). This is the most important, high-level moment that determines
exactly how we will optimize and what characteristics we can expect in the final model.

As a rule, we do not need to invent this ourselves – researchers have already done it for us. Today, we
will explore loss functions for the two most common objectives: regression y ∈ ℝ and binary
classification y ∈ {−1, 1}.

Regression loss functions

y ∈ ℝ. In order to choose the appropriate loss function, we


Let's start with a regression problem for
need to consider which of the properties of the conditional distribution (y|x) we want to restore. The
most common options are:

L(y, f ) = (y − f )2 a.k.a. L2 loss or Gaussian loss. It is the classical conditional mean,


which is the simplest and most common case. If we do not have any additional information or
requirements for a model to be robust, we can use the Gaussian loss.
L(y, f ) = |y − f | a.k.a. L1 loss or Laplacian loss. At the first glance, this function does
not seem to be differentiable, but it actually defines the conditional median. Median, as we
know, is robust to outliers, which is why this loss function is better in some cases. The penalty
L2 .
for big variations is not as heavy as it is in

{ α ⋅ |y − f |,
(1 − α) ⋅ |y − f |, if y − f ≤ 0
L(y, f ) = , α ∈ (0, 1) a.k.a. Lq loss or
if y − f > 0
Quantile loss. Instead of median, it uses quantiles. For example, α = 0.75 corresponds to
the 75%-quantile. We can see that this function is asymmetric and penalizes the observations
which are on the right side of the defined quantile.

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 18 of 33
__notebook__ 7/21/20, 11(23 AM

Let's use loss function Lq on our data. The goal is to restore the conditional 75%-quantile of cosine. Let
us put everyting together for GBM:

{(xi , yi )}i= 1,…,300 ✓


Toy data

A number of iterations M = 3 ✓;

{ 0.75 ⋅ |y − f |, if y − f > 0
0.25 ⋅ |y − f |, if y − f ≤ 0
Loss function for quantiles L0.75 (y, f ) = ✓;

Gradient L0.75 (y, f ) - function weighted by α = 0.75. We are going to train tree-based

[ ∂f (x ) ]
∂L(yi ,f (xi ))
model for classification: ri = − =
i ̂
f (x)= f (x)
= αI(yi > f (x̂ i )) − (1 − α)I(yi ≤ f (x̂ i )), for i = 1, … , 300✓;
Decision tree as a basic algorithm h (x) ✓;
Hyperparameter of trees: depth = 2 ✓;

For our initial approximation, we will take the needed quantile of y. However, we do not know anything
about optimal coefficients ρt , so we'll use standard line search. The results are the following:

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 19 of 33
__notebook__ 7/21/20, 11(23 AM

We can observe that, on each iteration, ri take only 2 possible values, but GBM is still able to restore
our initial function.

The overall results of GBM with quantile loss function are the same as the results with quadratic loss
function offset by ≈ 0.135. But if we were to use the 90%-quantile, we would not have enough data
due to the fact that classes would become unbalanced. We need to remember this when we deal with

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 20 of 33
__notebook__ 7/21/20, 11(23 AM

non-standard problems.

For regression tasks, many loss functions have been developed, some of them with extra properties. For
example, they can be robust like in the [Huber loss function](https://en.wikipedia.org/wiki/Huber_loss).
For a small number of outliers, the loss function works as L2 , but after a defined threshold, the function
changes to L1 . This allows for decreasing the effect of outliers and focusing on the overall picture. We
sin (x)
can illustrate this with the following example. Data is generated from the function y =
x with
added noise, a mixture from normal and Bernulli distributions. We show the functions on graphs A-D and
the relevant GBM on F-H (graph E represents the initial function):

[Original size](https://habrastorage.org/web/130/05b/222/13005b222e8a4eb68c3936216c05e276.jpg).
In this example, we used splines as the base algorithm. See, it does not always have to be trees for
boosting? We can clearly see the difference between the functions L2 , L1 , and Huber loss. If we
choose optimal parameters for the Huber loss, we can get the best possible approximation among all
our options. The difference can be seen as well in the 10%, 50%, and 90%-quantiles. Unfortunately,
Huber loss function is supported only by very few popular libraries/packages; h2o supports it, but
XGBoost does not. It is relevant to other things that are more exotic like [conditional expectiles]
(https://www.slideshare.net/charthur/quantile-and-expectile-regression), but it may still be interesting
knowledge.

Classification loss functions

y ∈ {−1, 1}
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 21 of 33
__notebook__ 7/21/20, 11(23 AM

Now, let's look at the binary classification problem y ∈ {−1, 1}. We saw that GBM can even
optimize non-differentiable loss functions. Technically, it is possible to solve this problem with a
regression L2 loss, but it wouldn't be correct.

The distribution of the target variable requires us to use log-likehood, so we need to have different loss
functions for targets multiplied by their predictions: y ⋅ f . The most common choices would be the
following:

L(y, f ) = lo g (1 + exp(−2yf ))a.k.a. Logistic loss or Bernoulli loss. This has an


interesting property that penalizes even correctly predicted classes, which helps not only helps
to optimize loss but also to move the classes apart further, even if all classes are predicted
correctly.
L(y, f ) = exp(−yf ) a.k.a. AdaBoost loss. The classic AdaBoost is equivalent to GBM
with this loss function. Conceptually, this function is very similar to logistic loss, but it has a
bigger exponential penalization if the prediction is wrong.

Let's generate some new toy data for our classification problem. As a basis, we will take our noisy
cosine, and we will use the sign function for classes of the target variable. Our toy data looks like the
following (jitter-noise is added for clarity):

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 22 of 33
__notebook__ 7/21/20, 11(23 AM

We will use logistic loss to look for what we actually boost. So, again, we put together what we will use
for GBM:

Toy data{(xi , yi )}i= 1,…,300 , yi ∈ {−1, 1} ✓


Number of iterations M = 3 ✓;
Logistic loss as the loss function, its gradient is computed the following way:
2⋅yi
ri = , for i = 1, … , 300 ✓;
1+ exp(2⋅yi ⋅f (x̂ i ))
Decision trees as base algorithms h (x) ✓;
Hyperparameters of the decision trees: tree's depth is equal to 2 ✓;

This time, the initialization of the algorithm is a little bit harder. First, our classes are imbalanced (63%
versus 37%). Second, there is no known analytical formula for the initialization of our loss function, so
^
we have to look for f0 = γ via search:

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 23 of 33
__notebook__ 7/21/20, 11(23 AM

Our optimal initial approximation is around -0.273. You could have guessed that it was negative because
it is more profitable to predict everything as the most popular class, but there is no formula for the exact
value. Now let's finally start GBM, and look what actually happens under the hood:

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 24 of 33
__notebook__ 7/21/20, 11(23 AM

The algorithm successfully restored the separation between our classes. You can see how the "lower"
areas are separating because the trees are more confident in the correct prediction of the negative
class and how the two steps of mixed classes are forming. It is clear that we have a lot of correctly
classified observations and some amount of observations with large errors that appeared due to the
noise in the data.

Weights
Sometimes, there is a situation where we want a more specific loss function for our problem. For
example, in financial time series, we may want to give bigger weight to large movements in the time
series; for churn prediction, it is more useful to predict the churn of clients with high LTV (or lifetime
value: how much money a client will bring in the future).

The statistical warrior would invent their own loss function, write out the gradient for it (for more
effective training, include the Hessian), and carefully check whether this function satisfies the required
properties. However, there is a high probability of making a mistake somewhere, running up against
computational difficulties, and spending an inordinate amount of time on research.

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 25 of 33
__notebook__ 7/21/20, 11(23 AM

In lieu of this, a very simple instrument was invented (which is rarely remembered in practice): weighing
observations and assigning weight functions. The simplest example of such weighting is the setting of
weights for class balance. In general, if we know that some subset of data, both in the input variables x
and in the target variable y, has greater importance for our model, then we just assign them a larger
weight w(x, y). The main goal is to fulfill the general requirements for weights:
wi ∈ ℝ,
wi ≥ 0 for i = 1, … , n ,
n


wi > 0
i= 1

Weights can significantly reduce the time spent adjusting the loss function for the task we are solving
and also encourages experiments with the target models' properties. Assigning these weights is entirely
a function of creativity. We simply add scalar weights:
Lw (y, f ) = w ⋅ L(y, f ),

[ ∂f (xi ) ]f (x)= f (x)̂


∂L(yi , f (xi ))
rit = −wi ⋅ , for i = 1, … , n

It is clear that, for arbitrary weights, we do not know the statistical properties of our model. Often,
linking the weights to the valuesy can be too complicated. For example, the usage of weights
proportional to |y| in L1 loss function is not equivalent to L2 loss because the gradient will not take
̂ .
into account the values of the predictions themselves: f (x)

We mention all of this so that we can understand our possibilities better. Let's create some very exotic
weights for our toy data. We will define a strongly asymmetric weight function as follows:

{ 0.1 + |co s(x)|, if x > 0


0.1, if x ≤ 0
w(x) =

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 26 of 33
__notebook__ 7/21/20, 11(23 AM

With these weights, we expect to get two properties: less detailing for negative values of x and the form
of the function, similar to the initial cosine. We take the other GBM's tunings from our previous example
with classification including the line search for optimal coefficients. Let's look what we've got:

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 27 of 33
__notebook__ 7/21/20, 11(23 AM

We achieved the result that we expected. First, we can see how strongly the pseudo-residuals differ; on
the initial iteration, they look almost like the original cosine. Second, the left part of the function's graph
was often ignored in favor of the right one, which had larger weights. Third, the function that we got on
the third iteration received enough attention and started looking similar to the original cosine (also
started to slightly overfit).

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 28 of 33
__notebook__ 7/21/20, 11(23 AM

Weights are a powerful but risky tool that we can use to control the properties of our model. If you want
to optimize your loss function, it is worth trying to solve a more simple problem first but add weights to
the observations at your discretion.

4. Conclusion

Today, we learned the theory behind gradient boosting. GBM is not just some specific algorithm but a
common methodology for building ensembles of models. In addition, this methodology is sufficiently
flexible and expandable -- it is possible to train a large number of models, taking into consideration
different loss-functions with a variety of weighting functions.

Practice and ML competitions show that, in standard problems (except for image, audio, and very sparse
data), GBM is often the most effective algorithm (not to mention stacking and high-level ensembles,
where GBM is almost always a part of them). Also, there are many adaptations of GBM for
Reinforcement Learning (https://arxiv.org/abs/1603.04119) (Minecraft, ICML 2016). By the way, the
Viola-Jones algorithm, which is still used in computer vision, is based on AdaBoost
(https://en.wikipedia.org/wiki/Viola%E2%80%93Jones_object_detection_framework#Learning_algorithm).

In this article, we intentionally omitted questions concerning GBM’s regularization, stochasticity, and
hyper-parameters. It was not accidental that we used a small number of iterations M= 3
throughout. If we used 30 trees instead of 3 and trained the GBM as described, the result would not be
that predictable:

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 29 of 33
__notebook__ 7/21/20, 11(23 AM

http://arogozhnikov.github.io/2016/07/05/gradient_boosting_playground.html
(http://arogozhnikov.github.io/2016/07/05/gradient_boosting_playground.html)

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 30 of 33
__notebook__ 7/21/20, 11(23 AM

5. Assignment #10

Your task (https://www.kaggle.com/kashnitsky/assignment-10-gradient-boosting-and-flight-delays) is


to beat at least 2 benchmarks in this Kaggle Inclass competition (https://www.kaggle.com/c/flight-
delays-spring-2018). Here you won’t be provided with detailed instructions. We only give you a brief
description of how the second benchmark was achieved using XGBoost.

6. Useful links

Original article (https://statweb.stanford.edu/~jhf/ftp/trebst.pdf) about GBM from Jerome

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 31 of 33
__notebook__ 7/21/20, 11(23 AM

Friedman
“Gradient boosting machines, a tutorial”, paper
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3885826/) by Alexey Natekin, and Alois Knoll
Chapter in Elements of Statistical Learning
(http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf) from Hastie,
Tibshirani, Friedman (page 337)
Wiki (https://en.wikipedia.org/wiki/Gradient_boosting) article about Gradient Boosting
Frontiers tutorial (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3885826/) article about GBM
Video-lecture by Hastie (https://www.youtube.com/watch?v=wPqtzj5VZus) about GBM at
h2o.ai conference
CatBoost vs. Light GBM vs. XGBoost (https://towardsdatascience.com/catboost-vs-light-gbm-
vs-xgboost-5f93620723db) on "Towards Data Science"
Benchmarking and Optimization of Gradient Boosting Decision Tree Algorithms
(https://arxiv.org/abs/1809.04559), XGBoost: Scalable GPU Accelerated Learning
(https://arxiv.org/abs/1806.11248) - benchmarking CatBoost, Light GBM, and XGBoost (no 100%
winner)

Support course creators

You can make a monthly (Patreon) or one-time (Ko-Fi) donation ↓

(https://www.patreon.com/ods_mlcourse)

(https://ko-fi.com/mlcourse_ai)
(https://ko-fi.com/mlcourse_ai)
(https://ko-fi.com/mlcourse_ai)
(https://ko-fi.com/mlcourse_ai)

(https://ko-fi.com/mlcourse_ai)

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 32 of 33
__notebook__ 7/21/20, 11(23 AM

(https://ko-fi.com/mlcourse_ai)
(https://ko-fi.com/mlcourse_ai)
(https://ko-fi.com/mlcourse_ai)

https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 33 of 33

You might also like