Gradient Boosting Tutorial
Gradient Boosting Tutorial
You can also check out the latest version of this notebook in the course repository
(https://github.com/Yorko/mlcourse.ai).
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…-x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 1 of 33
__notebook__ 7/21/20, 11(23 AM
Today we are going to have a look at one of the most popular and practical machine learning algorithms:
gradient boosting.
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…-x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 2 of 33
__notebook__ 7/21/20, 11(23 AM
Outline
We recommend going over this article in the order described below, but feel free to jump around
between sections.
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…-x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 3 of 33
__notebook__ 7/21/20, 11(23 AM
Almost everyone in machine learning has heard about gradient boosting. Many data scientists include
this algorithm in their data scientist's toolbox because of the good results it yields on any given
(unknown) problem.
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…-x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 4 of 33
__notebook__ 7/21/20, 11(23 AM
Many machine learning courses study AdaBoost - the ancestor of GBM (Gradient Boosting Machine).
However, since AdaBoost merged with GBM, it has become apparent that AdaBoost is just a particular
variation of GBM.
The algorithm itself has a very clear visual interpretation and intuition for defining weights. Let's have a
look at the following toy classification problem where we are going to split the data between the trees of
depth 1 (also known as 'stumps') on each iteration of AdaBoost. For the first two iterations, we have the
following picture:
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…-x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 5 of 33
__notebook__ 7/21/20, 11(23 AM
The size of point corresponds to its weight, which was assigned for an incorrect prediction. On each
iteration, we can see that these weights are growing -- the stumps cannot cope with this problem.
Although, if we take a weighted vote for the stumps, we will get the correct classifications:
Pseudocode:
T
t t
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…-x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 6 of 33
__notebook__ 7/21/20, 11(23 AM
T
Return ∑t αt bt
The overfitting problem did indeed exist, especially when data had strong outliers. Therefore, in those
types of problems, AdaBoost was unstable. Fortunately, a few professors in the statistics department at
Stanford, who had created Lasso, Elastic Net, and Random Forest, started researching the algorithm. In
1999, Jerome Friedman came up with the generalization of boosting algorithms development - Gradient
Boosting (Machine), also known as GBM. With this work, Friedman set up the statistical foundation for
many algorithms providing the general approach of boosting for optimization in the functional space.
CART, bootstrap, and many other algorithms have originated from Stanford's statistics department. In
doing so, the department has solidified their names in future textbooks. These algorithms are very
practical, and some recent works have yet to be widely adopted. For example, check out glinternet
(https://arxiv.org/abs/1308.2719).
Not many video recordings of Friedman are available. Although, there is a very interesting interview
(https://www.youtube.com/watch?v=8hupHmBVvb0) with him about the creation of CART and how they
solved statistics problems (which is similar to data analysis and data science today) more than 40 years
ago.
In general, there has been a transition from engineering and algorithmic research to a full-fledged
approach to building and studying algorithms. From a mathematical perspective, this is not a big change
- we are still adding (or boosting) weak algorithms and enlarging our ensemble with gradual
improvements for parts of the data where the model was inaccurate. But, this time, the next simple
model is not just built on re-weighted objects but improves its approximation of the gradient of overall
objective function. This concept greatly opens up our algorithms for imagination and extensions.
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…-x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 7 of 33
__notebook__ 7/21/20, 11(23 AM
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…-x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 8 of 33
__notebook__ 7/21/20, 11(23 AM
History of GBM
It took more than 10 years after the introduction of GBM for it to become an essential part of the data
science toolbox.
GBM was extended to apply to different statistics problems: GLMboost and GAMboost for strengthening
already existing GAM models, CoxBoost for survival curves, and RankBoost and LambdaMART for
ranking.
Many realizations of GBM also appeared under different names and on different platforms: Stochastic
GBM, GBDT (Gradient Boosted Decision Trees), GBRT (Gradient Boosted Regression Trees), MART
(Multiple Additive Regression Trees), and more. In addition, the ML community was very segmented and
dissociated, which made it hard to track just how widespread boosting had become.
At the same time, boosting had been actively used in search ranking. This problem was rewritten in
terms of a loss function that penalizes errors in the output order, so it became convenient to simply
insert it into GBM. AltaVista was one of the first companies who introduced boosting to ranking. Soon,
the ideas spread to Yahoo, Yandex, Bing, etc. Once this happened, boosting became one of the main
algorithms that was used not only in research but also in core technologies in industry.
This algorithm has gone through very typical path for ML algorithms today: mathematical problem and
algorithmic crafts to successful practical applications and mass adoption years after its first
appearance.
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…-x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 9 of 33
__notebook__ 7/21/20, 11(23 AM
2. GBM algorithm
ML problem statement
We are going to solve the problem of function approximation in a general supervised learning setting. We
have a set of features x and target variables y, {(xi , yi )}i= 1,…,n which we use to restore the
At this moment, we do not make any assumptions regarding the type of dependence f (x) , the model
of our approximation ̂ , or the distribution of the target variable (y). We only expect that the
f (x)
function L(y, f ) is differentiable. Our formula is very general; let's define it for a particular data set
̂
with a population mean f (x) . Our expression for minimizing the loss of the data is the following:
̂ = arg min 𝔼x,y [L(y, f (x))]
f (x)
f (x)
Unfortunately, the number of functions f (x) is not just large, but its functional space is infinite-
dimensional. That is why it is acceptable for us to limit the search space by some family of functions
f (x, θ), θ ∈ ℝd . This simplifies the objective a lot because now we have a solvable optimization of
̂ = f (x, ),̂
(x)
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 10 of 33
__notebook__ 7/21/20, 11(23 AM
parameter values:
̂ = f (x, θ),̂
f (x)
θ =̂ arg min 𝔼x,y [L(y, f (x, θ))]
θ
θ =̂ ∑M ^,
θ
i= 1 i
Lθ (θ) ̂ = ∑N L(y , f (xi , θ))̂
i= 1 i
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 11 of 33
__notebook__ 7/21/20, 11(23 AM
Let's imagine for a second that we can perform optimization in the function space and iteratively search
for the approximations ̂ as functions themselves. We will express our approximation as a sum of
f (x)
incremental improvements, each being a function. For convenience, we will immediately start with the
^
sum from the initial approximation f0 (x):
M
̂ = ^
f (x) ∑ i (x)
f
i= 0
Nothing has happened yet; we have only decided that we will search for our approximation
̂ not as
f (x)
a big model with plenty of parameters (as an example, neural network), but as a sum of functions,
pretending we move in functional space.
In order to accomplish this task, we need to limit our search by some function family
̂ = h (x, θ). There are a few issues here -- first of all, the sum of models can be more
f (x)
complicated than any model from this family; secondly, the general objective is still in functional space.
Let's note that, on every step, we will need to select an optimal coefficient ρ ∈ ℝ. For step t, the
problem is the following:
t−1
̂ =
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 12 of 33
(x) (x),
__notebook__ 7/21/20, 11(23 AM
t−1
̂ = ^ (x),
f (x) ∑ i
f
i= 0
̂ + ρ ⋅ h (x, θ))],
(ρt , θt ) = arg min 𝔼x,y [L(y, f (x)
ρ,θ
^
ft (x) = ρt ⋅ h (x, θt )
Here is where the magic happens. We have defined all of our objectives in general terms, as if we could
have trained any kind of model h (x, θ) for any type of loss functions L(y, f (x, θ)). In practice, this
is extremely difficult, but, fortunately, there is a simple way to solve this task.
Knowing the expression of loss function's gradient, we can calculate its value on our data. So, let's train
the models such that our predictions will be more correlated with this gradient (with a minus sign). In
other words, we will use least squares to correct the predictions with these residuals. For classification,
regression, and ranking tasks, we will minimize the squared difference between pseudo-residuals r and
t
our predictions. For step , the final problem looks like the following:
t−1
̂ = f^i (x),
∑
f (x)
i= 0
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 13 of 33
__notebook__ 7/21/20, 11(23 AM
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 14 of 33
__notebook__ 7/21/20, 11(23 AM
We can now define the classic GBM algorithm suggested by Jerome Friedman in 1999. It is a supervised
algorithm that has the following components:
̂ = f ,̂ f ̂ = γ, γ ∈ ℝ
f (x)
Initialize GBM with constant value 0 0
̂ n
f 0 = arg min ∑i= 1 L(yi , γ)
γ
For each iteration t = 1, … , M, repeat:
rt : rit = −[ ∂fi(x ) i ]
∂L(y ,f (x ))
Calculate pseudo-residuals , for i = 1, … , n
i ̂
f (x)= f (x)
Build new base algorithm h t (x) as regression on pseudo-residuals {(xi , rit )}i= 1,…,n
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 15 of 33
__notebook__ 7/21/20, 11(23 AM
Let's see an example of how GBM works. In this toy example, we will restore a noisy function
y = co s(x) + ϵ, ϵ ∼ (0, 15 ), x ∈ [−5, 5].
This is a regression problem with a real-valued target, so we will choose to use the mean squared error
loss function. We will generate 300 pairs of observations and approximate them with decision trees of
depth 2. Let's put together everything we need to use GBM:
For the mean squared error, both initialization γ and coefficients ρt are simple. We will initialize GBM
with the average value γ= 1
n ⋅ ∑ni= 1 yi , and set all coefficients ρt to 1.
We will run GBM and draw two types of graphs: the current approximation
̂ (blue graph) and every
f (x)
^
tree ft (x) built on its pseudo-residuals (green graph). The graph's number corresponds to the iteration
number:
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 16 of 33
__notebook__ 7/21/20, 11(23 AM
By the second iteration, our trees have recovered the basic form of the function. However, at the first
iteration, we see that the algorithm has built only the "left branch" of the function ( x ∈ [−5, −4]).
This was due to the fact that our trees simply did not have enough depth to build a symmetrical branch
at once, and it focused on the left branch with the larger error. Therefore, the right branch appeared
only after the second iteration.
The rest of the process goes as expected -- on every step, our pseudo-residuals decreased, and GBM
approximated the original function better and better with each iteration. However, by construction, trees
cannot approximate a continuous function, which means that GBM is not ideal in this example. To play
with GBM function approximations, you can use the awesome interactive demo in this blog called
Brilliantly wrong (http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html):
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 17 of 33
__notebook__ 7/21/20, 11(23 AM
3. Loss functions
If we want to solve a classification problem instead of regression, what would change? We only need to
choose a suitable loss function L(y, f ). This is the most important, high-level moment that determines
exactly how we will optimize and what characteristics we can expect in the final model.
As a rule, we do not need to invent this ourselves – researchers have already done it for us. Today, we
will explore loss functions for the two most common objectives: regression y ∈ ℝ and binary
classification y ∈ {−1, 1}.
{ α ⋅ |y − f |,
(1 − α) ⋅ |y − f |, if y − f ≤ 0
L(y, f ) = , α ∈ (0, 1) a.k.a. Lq loss or
if y − f > 0
Quantile loss. Instead of median, it uses quantiles. For example, α = 0.75 corresponds to
the 75%-quantile. We can see that this function is asymmetric and penalizes the observations
which are on the right side of the defined quantile.
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 18 of 33
__notebook__ 7/21/20, 11(23 AM
Let's use loss function Lq on our data. The goal is to restore the conditional 75%-quantile of cosine. Let
us put everyting together for GBM:
A number of iterations M = 3 ✓;
{ 0.75 ⋅ |y − f |, if y − f > 0
0.25 ⋅ |y − f |, if y − f ≤ 0
Loss function for quantiles L0.75 (y, f ) = ✓;
Gradient L0.75 (y, f ) - function weighted by α = 0.75. We are going to train tree-based
[ ∂f (x ) ]
∂L(yi ,f (xi ))
model for classification: ri = − =
i ̂
f (x)= f (x)
= αI(yi > f (x̂ i )) − (1 − α)I(yi ≤ f (x̂ i )), for i = 1, … , 300✓;
Decision tree as a basic algorithm h (x) ✓;
Hyperparameter of trees: depth = 2 ✓;
For our initial approximation, we will take the needed quantile of y. However, we do not know anything
about optimal coefficients ρt , so we'll use standard line search. The results are the following:
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 19 of 33
__notebook__ 7/21/20, 11(23 AM
We can observe that, on each iteration, ri take only 2 possible values, but GBM is still able to restore
our initial function.
The overall results of GBM with quantile loss function are the same as the results with quadratic loss
function offset by ≈ 0.135. But if we were to use the 90%-quantile, we would not have enough data
due to the fact that classes would become unbalanced. We need to remember this when we deal with
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 20 of 33
__notebook__ 7/21/20, 11(23 AM
non-standard problems.
For regression tasks, many loss functions have been developed, some of them with extra properties. For
example, they can be robust like in the [Huber loss function](https://en.wikipedia.org/wiki/Huber_loss).
For a small number of outliers, the loss function works as L2 , but after a defined threshold, the function
changes to L1 . This allows for decreasing the effect of outliers and focusing on the overall picture. We
sin (x)
can illustrate this with the following example. Data is generated from the function y =
x with
added noise, a mixture from normal and Bernulli distributions. We show the functions on graphs A-D and
the relevant GBM on F-H (graph E represents the initial function):
[Original size](https://habrastorage.org/web/130/05b/222/13005b222e8a4eb68c3936216c05e276.jpg).
In this example, we used splines as the base algorithm. See, it does not always have to be trees for
boosting? We can clearly see the difference between the functions L2 , L1 , and Huber loss. If we
choose optimal parameters for the Huber loss, we can get the best possible approximation among all
our options. The difference can be seen as well in the 10%, 50%, and 90%-quantiles. Unfortunately,
Huber loss function is supported only by very few popular libraries/packages; h2o supports it, but
XGBoost does not. It is relevant to other things that are more exotic like [conditional expectiles]
(https://www.slideshare.net/charthur/quantile-and-expectile-regression), but it may still be interesting
knowledge.
y ∈ {−1, 1}
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 21 of 33
__notebook__ 7/21/20, 11(23 AM
Now, let's look at the binary classification problem y ∈ {−1, 1}. We saw that GBM can even
optimize non-differentiable loss functions. Technically, it is possible to solve this problem with a
regression L2 loss, but it wouldn't be correct.
The distribution of the target variable requires us to use log-likehood, so we need to have different loss
functions for targets multiplied by their predictions: y ⋅ f . The most common choices would be the
following:
Let's generate some new toy data for our classification problem. As a basis, we will take our noisy
cosine, and we will use the sign function for classes of the target variable. Our toy data looks like the
following (jitter-noise is added for clarity):
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 22 of 33
__notebook__ 7/21/20, 11(23 AM
We will use logistic loss to look for what we actually boost. So, again, we put together what we will use
for GBM:
This time, the initialization of the algorithm is a little bit harder. First, our classes are imbalanced (63%
versus 37%). Second, there is no known analytical formula for the initialization of our loss function, so
^
we have to look for f0 = γ via search:
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 23 of 33
__notebook__ 7/21/20, 11(23 AM
Our optimal initial approximation is around -0.273. You could have guessed that it was negative because
it is more profitable to predict everything as the most popular class, but there is no formula for the exact
value. Now let's finally start GBM, and look what actually happens under the hood:
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 24 of 33
__notebook__ 7/21/20, 11(23 AM
The algorithm successfully restored the separation between our classes. You can see how the "lower"
areas are separating because the trees are more confident in the correct prediction of the negative
class and how the two steps of mixed classes are forming. It is clear that we have a lot of correctly
classified observations and some amount of observations with large errors that appeared due to the
noise in the data.
Weights
Sometimes, there is a situation where we want a more specific loss function for our problem. For
example, in financial time series, we may want to give bigger weight to large movements in the time
series; for churn prediction, it is more useful to predict the churn of clients with high LTV (or lifetime
value: how much money a client will bring in the future).
The statistical warrior would invent their own loss function, write out the gradient for it (for more
effective training, include the Hessian), and carefully check whether this function satisfies the required
properties. However, there is a high probability of making a mistake somewhere, running up against
computational difficulties, and spending an inordinate amount of time on research.
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 25 of 33
__notebook__ 7/21/20, 11(23 AM
In lieu of this, a very simple instrument was invented (which is rarely remembered in practice): weighing
observations and assigning weight functions. The simplest example of such weighting is the setting of
weights for class balance. In general, if we know that some subset of data, both in the input variables x
and in the target variable y, has greater importance for our model, then we just assign them a larger
weight w(x, y). The main goal is to fulfill the general requirements for weights:
wi ∈ ℝ,
wi ≥ 0 for i = 1, … , n ,
n
∑
wi > 0
i= 1
Weights can significantly reduce the time spent adjusting the loss function for the task we are solving
and also encourages experiments with the target models' properties. Assigning these weights is entirely
a function of creativity. We simply add scalar weights:
Lw (y, f ) = w ⋅ L(y, f ),
It is clear that, for arbitrary weights, we do not know the statistical properties of our model. Often,
linking the weights to the valuesy can be too complicated. For example, the usage of weights
proportional to |y| in L1 loss function is not equivalent to L2 loss because the gradient will not take
̂ .
into account the values of the predictions themselves: f (x)
We mention all of this so that we can understand our possibilities better. Let's create some very exotic
weights for our toy data. We will define a strongly asymmetric weight function as follows:
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 26 of 33
__notebook__ 7/21/20, 11(23 AM
With these weights, we expect to get two properties: less detailing for negative values of x and the form
of the function, similar to the initial cosine. We take the other GBM's tunings from our previous example
with classification including the line search for optimal coefficients. Let's look what we've got:
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 27 of 33
__notebook__ 7/21/20, 11(23 AM
We achieved the result that we expected. First, we can see how strongly the pseudo-residuals differ; on
the initial iteration, they look almost like the original cosine. Second, the left part of the function's graph
was often ignored in favor of the right one, which had larger weights. Third, the function that we got on
the third iteration received enough attention and started looking similar to the original cosine (also
started to slightly overfit).
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 28 of 33
__notebook__ 7/21/20, 11(23 AM
Weights are a powerful but risky tool that we can use to control the properties of our model. If you want
to optimize your loss function, it is worth trying to solve a more simple problem first but add weights to
the observations at your discretion.
4. Conclusion
Today, we learned the theory behind gradient boosting. GBM is not just some specific algorithm but a
common methodology for building ensembles of models. In addition, this methodology is sufficiently
flexible and expandable -- it is possible to train a large number of models, taking into consideration
different loss-functions with a variety of weighting functions.
Practice and ML competitions show that, in standard problems (except for image, audio, and very sparse
data), GBM is often the most effective algorithm (not to mention stacking and high-level ensembles,
where GBM is almost always a part of them). Also, there are many adaptations of GBM for
Reinforcement Learning (https://arxiv.org/abs/1603.04119) (Minecraft, ICML 2016). By the way, the
Viola-Jones algorithm, which is still used in computer vision, is based on AdaBoost
(https://en.wikipedia.org/wiki/Viola%E2%80%93Jones_object_detection_framework#Learning_algorithm).
In this article, we intentionally omitted questions concerning GBM’s regularization, stochasticity, and
hyper-parameters. It was not accidental that we used a small number of iterations M= 3
throughout. If we used 30 trees instead of 3 and trained the GBM as described, the result would not be
that predictable:
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 29 of 33
__notebook__ 7/21/20, 11(23 AM
http://arogozhnikov.github.io/2016/07/05/gradient_boosting_playground.html
(http://arogozhnikov.github.io/2016/07/05/gradient_boosting_playground.html)
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 30 of 33
__notebook__ 7/21/20, 11(23 AM
5. Assignment #10
6. Useful links
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIiL…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 31 of 33
__notebook__ 7/21/20, 11(23 AM
Friedman
“Gradient boosting machines, a tutorial”, paper
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3885826/) by Alexey Natekin, and Alois Knoll
Chapter in Elements of Statistical Learning
(http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf) from Hastie,
Tibshirani, Friedman (page 337)
Wiki (https://en.wikipedia.org/wiki/Gradient_boosting) article about Gradient Boosting
Frontiers tutorial (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3885826/) article about GBM
Video-lecture by Hastie (https://www.youtube.com/watch?v=wPqtzj5VZus) about GBM at
h2o.ai conference
CatBoost vs. Light GBM vs. XGBoost (https://towardsdatascience.com/catboost-vs-light-gbm-
vs-xgboost-5f93620723db) on "Towards Data Science"
Benchmarking and Optimization of Gradient Boosting Decision Tree Algorithms
(https://arxiv.org/abs/1809.04559), XGBoost: Scalable GPU Accelerated Learning
(https://arxiv.org/abs/1806.11248) - benchmarking CatBoost, Light GBM, and XGBoost (no 100%
winner)
(https://www.patreon.com/ods_mlcourse)
(https://ko-fi.com/mlcourse_ai)
(https://ko-fi.com/mlcourse_ai)
(https://ko-fi.com/mlcourse_ai)
(https://ko-fi.com/mlcourse_ai)
(https://ko-fi.com/mlcourse_ai)
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 32 of 33
__notebook__ 7/21/20, 11(23 AM
(https://ko-fi.com/mlcourse_ai)
(https://ko-fi.com/mlcourse_ai)
(https://ko-fi.com/mlcourse_ai)
https://www.kaggleusercontent.com/kf/37852307/eyJhbGciOiJkaXIi…x4zxm8YbFlTZ0mVmJwqF8Y.gRZTDAXiHsUqbUoX_w4gfg/__results__.html Page 33 of 33