0% found this document useful (0 votes)

132 views33 pages

Gradient Boosting Tutorial

This document provides an overview of gradient boosting machines (GBM), including their history and development. It discusses how boosting aims to combine weak learners into a strong learner. AdaBoost was an early boosting algorithm, but GBM developed by Jerome Friedman in 1999 provided a statistical foundation and approach of functional gradient descent. GBM allows gradual improvements to the model where it was previously inaccurate by building on approximations of the gradient of the objective function, rather than just reweighting examples. XGBoost is a popular and effective implementation of GBM that is commonly used for machine learning competitions and applications.

Uploaded by

Sarang Potdar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

132 views33 pages

Gradient Boosting Tutorial

Uploaded by

Sarang Potdar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

notebook 7/21/20, 11(23 AM

[Link] ([Link] – Open Machine Learning Course

Author: Alexey Natekin ([Link] OpenDataScience founder, Machine

Learning Evangelist. Translated and edited by Olga Daykhovskaya
([Link] Anastasia Manokhina
([Link] Yury Kashnitsky ([Link] Egor
Polusmak ([Link] and Yuanyuan Pao
([Link] This material is subject to the terms and conditions of the
Creative Commons CC BY-NC-SA 4.0 ([Link] Free use is
permitted for any non-commercial purpose.

You can also check out the latest version of this notebook in the course repository
([Link]

[Link] Page 1 of 33
__notebook__ 7/21/20, 11(23 AM

Topic 10. Gradient Boosting

Today we are going to have a look at one of the most popular and practical machine learning algorithms:
gradient boosting.

[Link] Page 2 of 33
__notebook__ 7/21/20, 11(23 AM

Outline
We recommend going over this article in the order described below, but feel free to jump around
between sections.

 Introduction and history of boosting

History of Gradient Boosting Machine
 GBM algorithm
ML Problem statement
Functional gradient descent
Friedman's classic GBM algorithm
Step-by-step example of the GBM algorithm
 Loss functions
Regression loss functions
Classification loss functions
Weights
 Conclusion
 Assignment #10
 Useful resources

[Link] Page 3 of 33
__notebook__ 7/21/20, 11(23 AM

1. Introduction and history of boosting

Almost everyone in machine learning has heard about gradient boosting. Many data scientists include
this algorithm in their data scientist's toolbox because of the good results it yields on any given
(unknown) problem.

Furthermore, XGBoost is often the standard recipe for winning

([Link] ML competitions
([Link] It is so popular that the idea of stacking XGBoosts has become a
meme. Moreover, boosting is an important component in many recommender systems
([Link] sometimes, it is
even considered a brand ([Link] Let's look at the
history and development of boosting.

[Link] Page 4 of 33
__notebook__ 7/21/20, 11(23 AM

Boosting was born out of the question: ([Link] is it

possible to get one strong model from a large amount of relatively weak and simple models?
By saying "weak models", we do not mean simple basic models like decision trees but models with poor
accuracy performance, where poor is a little bit better than random.

A positive mathematical answer ([Link] to

this question was identified, but it took a few years to develop fully functioning algorithms based on this
solution e.g. AdaBoost. These algoritms take a greedy approach: first, they build a linear combination of
simple models (basic algorithms) by re-weighing the input data. Then, the model (usually a decision
tree) is built on earlier incorrectly predicted objects, which are now given larger weights.

Many machine learning courses study AdaBoost - the ancestor of GBM (Gradient Boosting Machine).
However, since AdaBoost merged with GBM, it has become apparent that AdaBoost is just a particular
variation of GBM.

The algorithm itself has a very clear visual interpretation and intuition for defining weights. Let's have a
look at the following toy classification problem where we are going to split the data between the trees of
depth 1 (also known as 'stumps') on each iteration of AdaBoost. For the first two iterations, we have the
following picture:

[Link] Page 5 of 33
__notebook__ 7/21/20, 11(23 AM

The size of point corresponds to its weight, which was assigned for an incorrect prediction. On each
iteration, we can see that these weights are growing -- the stumps cannot cope with this problem.
Although, if we take a weighted vote for the stumps, we will get the correct classifications:

Pseudocode:

Initialize sample weights w(0)

i = 1
l ,i = 1, … , l.
For all t = 1, … , T
Train base algo bt , let ϵt be it's training error.
1 1−ϵt
αt = 2 ln ϵt .

Update sample weights: w(t)

i = w(t−1) −αt yi bt (xi )
i e , i = 1, … , l.
Normalize sample weights:
w (t)
w(t)
0 = ∑kj= 1 w(t)
j , w(t)
i = i
, i = 1, … , l.
w (t)
0

T
t t
[Link] Page 6 of 33
__notebook__ 7/21/20, 11(23 AM

T
Return ∑t αt bt

Here ([Link] is more detailed example of AdaBoost

where, as we iterate, we can see the weights increase, especially on the border between classes.

AdaBoost works well, but the lack

([Link] of
explanation for why the algorithm is successful sewed the seeds of doubt. Some considered it a super-
algorithm, a silver bullet, but others were skeptical and believed AdaBoost was just overfitting.

The overfitting problem did indeed exist, especially when data had strong outliers. Therefore, in those
types of problems, AdaBoost was unstable. Fortunately, a few professors in the statistics department at
Stanford, who had created Lasso, Elastic Net, and Random Forest, started researching the algorithm. In
1999, Jerome Friedman came up with the generalization of boosting algorithms development - Gradient
Boosting (Machine), also known as GBM. With this work, Friedman set up the statistical foundation for
many algorithms providing the general approach of boosting for optimization in the functional space.

CART, bootstrap, and many other algorithms have originated from Stanford's statistics department. In
doing so, the department has solidified their names in future textbooks. These algorithms are very
practical, and some recent works have yet to be widely adopted. For example, check out glinternet
([Link]

Not many video recordings of Friedman are available. Although, there is a very interesting interview
([Link] with him about the creation of CART and how they
solved statistics problems (which is similar to data analysis and data science today) more than 40 years
ago.

There is also a great lecture ([Link] from Hastie, a

retrospective on data analysis from one of the creators of methods that we use everyday.

In general, there has been a transition from engineering and algorithmic research to a full-fledged
approach to building and studying algorithms. From a mathematical perspective, this is not a big change
- we are still adding (or boosting) weak algorithms and enlarging our ensemble with gradual
improvements for parts of the data where the model was inaccurate. But, this time, the next simple
model is not just built on re-weighted objects but improves its approximation of the gradient of overall
objective function. This concept greatly opens up our algorithms for imagination and extensions.

[Link] Page 7 of 33
__notebook__ 7/21/20, 11(23 AM

[Link] Page 8 of 33
__notebook__ 7/21/20, 11(23 AM

History of GBM

It took more than 10 years after the introduction of GBM for it to become an essential part of the data
science toolbox.
GBM was extended to apply to different statistics problems: GLMboost and GAMboost for strengthening
already existing GAM models, CoxBoost for survival curves, and RankBoost and LambdaMART for
ranking.
Many realizations of GBM also appeared under different names and on different platforms: Stochastic
GBM, GBDT (Gradient Boosted Decision Trees), GBRT (Gradient Boosted Regression Trees), MART
(Multiple Additive Regression Trees), and more. In addition, the ML community was very segmented and
dissociated, which made it hard to track just how widespread boosting had become.

At the same time, boosting had been actively used in search ranking. This problem was rewritten in
terms of a loss function that penalizes errors in the output order, so it became convenient to simply
insert it into GBM. AltaVista was one of the first companies who introduced boosting to ranking. Soon,
the ideas spread to Yahoo, Yandex, Bing, etc. Once this happened, boosting became one of the main
algorithms that was used not only in research but also in core technologies in industry.

ML competitions, especially Kaggle, played a major role in boosting's

popularization. Now, researchers had a common platform where they
could compete in different data science problems with large number of
participants from around the world. With Kaggle, one could test new
algorithms on the real data, giving algoritms oppurtunity to "shine", and
provide full information in sharing model performance results across
competition data sets. This is exactly what happened to boosting when it
was used at Kaggle ([Link]
conort-on-coming-second-in-give-me-some-credit/) (check interviews
with Kaggle winners starting from 2011 who mostly used boosting). The
XGBoost ([Link] library quickly gained popularity after its appearance.
XGBoost is not a new, unique algorithm; it is just an extremely effective realization of classic GBM with
additional heuristics.

This algorithm has gone through very typical path for ML algorithms today: mathematical problem and
algorithmic crafts to successful practical applications and mass adoption years after its first
appearance.

[Link] Page 9 of 33
__notebook__ 7/21/20, 11(23 AM

2. GBM algorithm

ML problem statement

We are going to solve the problem of function approximation in a general supervised learning setting. We

have a set of features x and target variables y, {(xi , yi )}i= 1,…,n which we use to restore the

dependence y = f (x). We restore the dependence by approximating f (x) ̂ and by understanding

which approximation is better when we use the loss function L(y, f ), which we want to minimize:
̂ f (x)
y ≈ f (x), ̂ = arg min L(y, f (x))
f (x)

At this moment, we do not make any assumptions regarding the type of dependence f (x) , the model
of our approximation ̂ , or the distribution of the target variable (y). We only expect that the
f (x)
function L(y, f ) is differentiable. Our formula is very general; let's define it for a particular data set
̂
with a population mean f (x) . Our expression for minimizing the loss of the data is the following:
̂ = arg min 𝔼x,y [L(y, f (x))]
f (x)
f (x)

Unfortunately, the number of functions f (x) is not just large, but its functional space is infinite-
dimensional. That is why it is acceptable for us to limit the search space by some family of functions

f (x, θ), θ ∈ ℝd . This simplifies the objective a lot because now we have a solvable optimization of

̂ = f (x, ),̂
(x)
[Link] Page 10 of 33
__notebook__ 7/21/20, 11(23 AM

parameter values:
̂ = f (x, θ),̂
f (x)
θ =̂ arg min 𝔼x,y [L(y, f (x, θ))]
θ

Simple analytical solutions for finding the optimal parameters

̂ do not exist, so the parameters
θ often
̂
are usually approximated iteratively. To start, we write down the empirical loss function Lθ (θ ) that will
̂
allow us to evaluate our parameters using our data. Additionally, let's write out our approximation θ for a
number of M iterations as a sum:

θ =̂ ∑M ^,
θ
i= 1 i
Lθ (θ) ̂ = ∑N L(y , f (xi , θ))̂
i= 1 i

Lθ (θ).̂ Gradient descent is

Then, the only thing left is to find a suitable, iterative algorithm to minimize
̂
the simplest and most frequently used option. We define the gradient as ∇Lθ (θ ) and add our
^
iterative evaluations θ i to it (since we are minimizing the loss, we add the minus sign). Our last step is
^
to initialize our first approximation θ 0 and choose the number of iterations M . Let's review the steps

for this inefficient and naive algorithm for approximating θ : ̂

 Define the initial approximation of the parameters θ =̂ θ^0

 For every iteration t = 1, … , M repeat steps 3-7:

̂
Calculate the gradient of the loss function ∇Lθ (θ ) for the current approximation θ
̂
∇Lθ (θ) ̂ = [ ] ̂
∂L(y,f (x,θ))
∂θ θ= θ
^
 Set the current iterative approximation θt based on the calculated gradient
^
θt ← −∇Lθ (θ) ̂
 Update the approximation of the parameters ̂ θ +̂ θ^t = ∑t θ^i
θ : θ̂ ← i= 0
 Save the result of approximation θ θ̂ =̂ ∑M θ^i
i= 0
 ̂ = f (x, θ) ̂
Use the function that was found f (x)

[Link] Page 11 of 33
__notebook__ 7/21/20, 11(23 AM

Functional gradient descent

Let's imagine for a second that we can perform optimization in the function space and iteratively search

for the approximations ̂ as functions themselves. We will express our approximation as a sum of
f (x)
incremental improvements, each being a function. For convenience, we will immediately start with the
^
sum from the initial approximation f0 (x):
M
̂ = ^
f (x) ∑ i (x)
f
i= 0

Nothing has happened yet; we have only decided that we will search for our approximation
̂ not as
f (x)
a big model with plenty of parameters (as an example, neural network), but as a sum of functions,
pretending we move in functional space.

In order to accomplish this task, we need to limit our search by some function family
̂ = h (x, θ). There are a few issues here -- first of all, the sum of models can be more
f (x)
complicated than any model from this family; secondly, the general objective is still in functional space.
Let's note that, on every step, we will need to select an optimal coefficient ρ ∈ ℝ. For step t, the
problem is the following:

t−1
̂ =
[Link] Page 12 of 33

(x) (x),
__notebook__ 7/21/20, 11(23 AM

t−1
̂ = ^ (x),
f (x) ∑ i
f
i= 0
̂ + ρ ⋅ h (x, θ))],
(ρt , θt ) = arg min 𝔼x,y [L(y, f (x)
ρ,θ
^
ft (x) = ρt ⋅ h (x, θt )
Here is where the magic happens. We have defined all of our objectives in general terms, as if we could
have trained any kind of model h (x, θ) for any type of loss functions L(y, f (x, θ)). In practice, this
is extremely difficult, but, fortunately, there is a simple way to solve this task.

Knowing the expression of loss function's gradient, we can calculate its value on our data. So, let's train
the models such that our predictions will be more correlated with this gradient (with a minus sign). In
other words, we will use least squares to correct the predictions with these residuals. For classification,
regression, and ranking tasks, we will minimize the squared difference between pseudo-residuals r and
t
our predictions. For step , the final problem looks like the following:
t−1
̂ = f^i (x),
∑
f (x)
i= 0

[ ∂f (xi ) ]f (x)= f (x)̂

∂L(yi , f (xi ))
rit = − , for i = 1, … , n ,
n
(rit − h (xi , θ))2 ,
∑
θt = arg min
θ i= 1
n
L(yi , f (x̂ i ) + ρ ⋅ h (xi , θt ))
∑
ρt = arg min
ρ i= 1

[Link] Page 13 of 33
__notebook__ 7/21/20, 11(23 AM

[Link] Page 14 of 33
__notebook__ 7/21/20, 11(23 AM

Friedman's classic GBM algorithm

We can now define the classic GBM algorithm suggested by Jerome Friedman in 1999. It is a supervised
algorithm that has the following components:

dataset {(xi , yi )}i= 1,…,n ;

number of iterations M ;
choice of loss function L(y, f ) with a defined gradient;
choice of function family of base algorithms h (x, θ) with the training procedure;
additional hyperparameters h (x, θ) (for example, in decision trees, the tree depth);

f0 (x). For simplicity, for an initial approximation, a

The only thing left is the initial approximation
constant value γ is used. The constant value, as well as the optimal coefficient ρ , are identified via
binary search or another line search algorithm over the initial loss function (not a gradient). So, we have
our GBM algorithm described as follows:

̂ = f ,̂ f ̂ = γ, γ ∈ ℝ
f (x)
 Initialize GBM with constant value 0 0
̂ n
f 0 = arg min ∑i= 1 L(yi , γ)
γ
 For each iteration t = 1, … , M, repeat:
rt : rit = −[ ∂fi(x ) i ]
∂L(y ,f (x ))
 Calculate pseudo-residuals , for i = 1, … , n
i ̂
f (x)= f (x)
 Build new base algorithm h t (x) as regression on pseudo-residuals {(xi , rit )}i= 1,…,n

 Find optimal coefficient ρt at h t (x) regarding initial loss function

ρt = arg min ∑i= 1 L(yi , f (x̂ i ) + ρ ⋅ h (xi , θ))

n
ρ
^
 Save ft (x) = ρt ⋅ h t (x)
 Update current approximation
̂ f (x)
f (x) ̂ + f^ (x) = ∑t f^ (x)
̂ ← f (x)
t i= 0 i
̂ ̂ M ^
Compose final GBM model f (x) f (x) = ∑

i= 0 fi (x)
 Conquer Kaggle and the rest of the world

[Link] Page 15 of 33
__notebook__ 7/21/20, 11(23 AM

Step-By-Step example: How GBM Works

Let's see an example of how GBM works. In this toy example, we will restore a noisy function
y = co s(x) + ϵ, ϵ ∼  (0, 15 ), x ∈ [−5, 5].

This is a regression problem with a real-valued target, so we will choose to use the mean squared error
loss function. We will generate 300 pairs of observations and approximate them with decision trees of
depth 2. Let's put together everything we need to use GBM:

Toy data{(xi , yi )}i= 1,…,300 ✓

Number of iterations M = 3 ✓;

The mean squared error loss function L(y, f ) = (y − f )2 ✓

Gradient of L(y, f ) = L2 loss is just residuals r = (y − f ) ✓;
Decision trees as base algorithms h (x) ✓;
Hyperparameters of the decision trees: trees depth is equal to 2 ✓;

For the mean squared error, both initialization γ and coefficients ρt are simple. We will initialize GBM
with the average value γ= 1
n ⋅ ∑ni= 1 yi , and set all coefficients ρt to 1.

We will run GBM and draw two types of graphs: the current approximation
̂ (blue graph) and every
f (x)
^
tree ft (x) built on its pseudo-residuals (green graph). The graph's number corresponds to the iteration
number:

[Link] Page 16 of 33
__notebook__ 7/21/20, 11(23 AM

By the second iteration, our trees have recovered the basic form of the function. However, at the first
iteration, we see that the algorithm has built only the "left branch" of the function ( x ∈ [−5, −4]).
This was due to the fact that our trees simply did not have enough depth to build a symmetrical branch
at once, and it focused on the left branch with the larger error. Therefore, the right branch appeared
only after the second iteration.

The rest of the process goes as expected -- on every step, our pseudo-residuals decreased, and GBM
approximated the original function better and better with each iteration. However, by construction, trees
cannot approximate a continuous function, which means that GBM is not ideal in this example. To play
with GBM function approximations, you can use the awesome interactive demo in this blog called
Brilliantly wrong ([Link]

[Link] Page 17 of 33
__notebook__ 7/21/20, 11(23 AM

3. Loss functions

If we want to solve a classification problem instead of regression, what would change? We only need to
choose a suitable loss function L(y, f ). This is the most important, high-level moment that determines
exactly how we will optimize and what characteristics we can expect in the final model.

As a rule, we do not need to invent this ourselves – researchers have already done it for us. Today, we
will explore loss functions for the two most common objectives: regression y ∈ ℝ and binary
classification y ∈ {−1, 1}.

Regression loss functions

y ∈ ℝ. In order to choose the appropriate loss function, we

Let's start with a regression problem for
need to consider which of the properties of the conditional distribution (y|x) we want to restore. The
most common options are:

L(y, f ) = (y − f )2 a.k.a. L2 loss or Gaussian loss. It is the classical conditional mean,

which is the simplest and most common case. If we do not have any additional information or
requirements for a model to be robust, we can use the Gaussian loss.
L(y, f ) = |y − f | a.k.a. L1 loss or Laplacian loss. At the first glance, this function does
not seem to be differentiable, but it actually defines the conditional median. Median, as we
know, is robust to outliers, which is why this loss function is better in some cases. The penalty
L2 .
for big variations is not as heavy as it is in

{ α ⋅ |y − f |,
(1 − α) ⋅ |y − f |, if y − f ≤ 0
L(y, f ) = , α ∈ (0, 1) a.k.a. Lq loss or
if y − f > 0
Quantile loss. Instead of median, it uses quantiles. For example, α = 0.75 corresponds to
the 75%-quantile. We can see that this function is asymmetric and penalizes the observations
which are on the right side of the defined quantile.

[Link] Page 18 of 33
__notebook__ 7/21/20, 11(23 AM

Let's use loss function Lq on our data. The goal is to restore the conditional 75%-quantile of cosine. Let
us put everyting together for GBM:

{(xi , yi )}i= 1,…,300 ✓

Toy data

A number of iterations M = 3 ✓;

{ 0.75 ⋅ |y − f |, if y − f > 0
0.25 ⋅ |y − f |, if y − f ≤ 0
Loss function for quantiles L0.75 (y, f ) = ✓;

Gradient L0.75 (y, f ) - function weighted by α = 0.75. We are going to train tree-based

[ ∂f (x ) ]
∂L(yi ,f (xi ))
model for classification: ri = − =
i ̂
f (x)= f (x)
= αI(yi > f (x̂ i )) − (1 − α)I(yi ≤ f (x̂ i )), for i = 1, … , 300✓;
Decision tree as a basic algorithm h (x) ✓;
Hyperparameter of trees: depth = 2 ✓;

For our initial approximation, we will take the needed quantile of y. However, we do not know anything
about optimal coefficients ρt , so we'll use standard line search. The results are the following:

[Link] Page 19 of 33
__notebook__ 7/21/20, 11(23 AM

We can observe that, on each iteration, ri take only 2 possible values, but GBM is still able to restore
our initial function.

The overall results of GBM with quantile loss function are the same as the results with quadratic loss
function offset by ≈ 0.135. But if we were to use the 90%-quantile, we would not have enough data
due to the fact that classes would become unbalanced. We need to remember this when we deal with

[Link] Page 20 of 33
__notebook__ 7/21/20, 11(23 AM

non-standard problems.

For regression tasks, many loss functions have been developed, some of them with extra properties. For
example, they can be robust like in the [Huber loss function]([Link]
For a small number of outliers, the loss function works as L2 , but after a defined threshold, the function
changes to L1 . This allows for decreasing the effect of outliers and focusing on the overall picture. We
sin (x)
can illustrate this with the following example. Data is generated from the function y =
x with
added noise, a mixture from normal and Bernulli distributions. We show the functions on graphs A-D and
the relevant GBM on F-H (graph E represents the initial function):

[Original size]([Link]
In this example, we used splines as the base algorithm. See, it does not always have to be trees for
boosting? We can clearly see the difference between the functions L2 , L1 , and Huber loss. If we
choose optimal parameters for the Huber loss, we can get the best possible approximation among all
our options. The difference can be seen as well in the 10%, 50%, and 90%-quantiles. Unfortunately,
Huber loss function is supported only by very few popular libraries/packages; h2o supports it, but
XGBoost does not. It is relevant to other things that are more exotic like [conditional expectiles]
([Link] but it may still be interesting
knowledge.

Classification loss functions

y ∈ {−1, 1}
[Link] Page 21 of 33
__notebook__ 7/21/20, 11(23 AM

Now, let's look at the binary classification problem y ∈ {−1, 1}. We saw that GBM can even
optimize non-differentiable loss functions. Technically, it is possible to solve this problem with a
regression L2 loss, but it wouldn't be correct.

The distribution of the target variable requires us to use log-likehood, so we need to have different loss
functions for targets multiplied by their predictions: y ⋅ f . The most common choices would be the
following:

L(y, f ) = lo g (1 + exp(−2yf ))a.k.a. Logistic loss or Bernoulli loss. This has an

interesting property that penalizes even correctly predicted classes, which helps not only helps
to optimize loss but also to move the classes apart further, even if all classes are predicted
correctly.
L(y, f ) = exp(−yf ) a.k.a. AdaBoost loss. The classic AdaBoost is equivalent to GBM
with this loss function. Conceptually, this function is very similar to logistic loss, but it has a
bigger exponential penalization if the prediction is wrong.

Let's generate some new toy data for our classification problem. As a basis, we will take our noisy
cosine, and we will use the sign function for classes of the target variable. Our toy data looks like the
following (jitter-noise is added for clarity):

[Link] Page 22 of 33
__notebook__ 7/21/20, 11(23 AM

We will use logistic loss to look for what we actually boost. So, again, we put together what we will use
for GBM:

Toy data{(xi , yi )}i= 1,…,300 , yi ∈ {−1, 1} ✓

Number of iterations M = 3 ✓;
Logistic loss as the loss function, its gradient is computed the following way:
2⋅yi
ri = , for i = 1, … , 300 ✓;
1+ exp(2⋅yi ⋅f (x̂ i ))
Decision trees as base algorithms h (x) ✓;
Hyperparameters of the decision trees: tree's depth is equal to 2 ✓;

This time, the initialization of the algorithm is a little bit harder. First, our classes are imbalanced (63%
versus 37%). Second, there is no known analytical formula for the initialization of our loss function, so
^
we have to look for f0 = γ via search:

[Link] Page 23 of 33
__notebook__ 7/21/20, 11(23 AM

Our optimal initial approximation is around -0.273. You could have guessed that it was negative because
it is more profitable to predict everything as the most popular class, but there is no formula for the exact
value. Now let's finally start GBM, and look what actually happens under the hood:

[Link] Page 24 of 33
__notebook__ 7/21/20, 11(23 AM

The algorithm successfully restored the separation between our classes. You can see how the "lower"
areas are separating because the trees are more confident in the correct prediction of the negative
class and how the two steps of mixed classes are forming. It is clear that we have a lot of correctly
classified observations and some amount of observations with large errors that appeared due to the
noise in the data.

Weights
Sometimes, there is a situation where we want a more specific loss function for our problem. For
example, in financial time series, we may want to give bigger weight to large movements in the time
series; for churn prediction, it is more useful to predict the churn of clients with high LTV (or lifetime
value: how much money a client will bring in the future).

The statistical warrior would invent their own loss function, write out the gradient for it (for more
effective training, include the Hessian), and carefully check whether this function satisfies the required
properties. However, there is a high probability of making a mistake somewhere, running up against
computational difficulties, and spending an inordinate amount of time on research.

[Link] Page 25 of 33
__notebook__ 7/21/20, 11(23 AM

In lieu of this, a very simple instrument was invented (which is rarely remembered in practice): weighing
observations and assigning weight functions. The simplest example of such weighting is the setting of
weights for class balance. In general, if we know that some subset of data, both in the input variables x
and in the target variable y, has greater importance for our model, then we just assign them a larger
weight w(x, y). The main goal is to fulfill the general requirements for weights:
wi ∈ ℝ,
wi ≥ 0 for i = 1, … , n ,
n

∑
wi > 0
i= 1

Weights can significantly reduce the time spent adjusting the loss function for the task we are solving
and also encourages experiments with the target models' properties. Assigning these weights is entirely
a function of creativity. We simply add scalar weights:
Lw (y, f ) = w ⋅ L(y, f ),

[ ∂f (xi ) ]f (x)= f (x)̂

∂L(yi , f (xi ))
rit = −wi ⋅ , for i = 1, … , n

It is clear that, for arbitrary weights, we do not know the statistical properties of our model. Often,
linking the weights to the valuesy can be too complicated. For example, the usage of weights
proportional to |y| in L1 loss function is not equivalent to L2 loss because the gradient will not take
̂ .
into account the values of the predictions themselves: f (x)

We mention all of this so that we can understand our possibilities better. Let's create some very exotic
weights for our toy data. We will define a strongly asymmetric weight function as follows:

{ 0.1 + |co s(x)|, if x > 0

0.1, if x ≤ 0
w(x) =

[Link] Page 26 of 33
__notebook__ 7/21/20, 11(23 AM

With these weights, we expect to get two properties: less detailing for negative values of x and the form
of the function, similar to the initial cosine. We take the other GBM's tunings from our previous example
with classification including the line search for optimal coefficients. Let's look what we've got:

[Link] Page 27 of 33
__notebook__ 7/21/20, 11(23 AM

We achieved the result that we expected. First, we can see how strongly the pseudo-residuals differ; on
the initial iteration, they look almost like the original cosine. Second, the left part of the function's graph
was often ignored in favor of the right one, which had larger weights. Third, the function that we got on
the third iteration received enough attention and started looking similar to the original cosine (also
started to slightly overfit).

[Link] Page 28 of 33
__notebook__ 7/21/20, 11(23 AM

Weights are a powerful but risky tool that we can use to control the properties of our model. If you want
to optimize your loss function, it is worth trying to solve a more simple problem first but add weights to
the observations at your discretion.

4. Conclusion

Today, we learned the theory behind gradient boosting. GBM is not just some specific algorithm but a
common methodology for building ensembles of models. In addition, this methodology is sufficiently
flexible and expandable -- it is possible to train a large number of models, taking into consideration
different loss-functions with a variety of weighting functions.

Practice and ML competitions show that, in standard problems (except for image, audio, and very sparse
data), GBM is often the most effective algorithm (not to mention stacking and high-level ensembles,
where GBM is almost always a part of them). Also, there are many adaptations of GBM for
Reinforcement Learning ([Link] (Minecraft, ICML 2016). By the way, the
Viola-Jones algorithm, which is still used in computer vision, is based on AdaBoost
([Link]

In this article, we intentionally omitted questions concerning GBM’s regularization, stochasticity, and
hyper-parameters. It was not accidental that we used a small number of iterations M= 3
throughout. If we used 30 trees instead of 3 and trained the GBM as described, the result would not be
that predictable:

[Link] Page 29 of 33
__notebook__ 7/21/20, 11(23 AM

[Link]
([Link]

[Link] Page 30 of 33
__notebook__ 7/21/20, 11(23 AM

5. Assignment #10

Your task ([Link] is

to beat at least 2 benchmarks in this Kaggle Inclass competition ([Link]
delays-spring-2018). Here you won’t be provided with detailed instructions. We only give you a brief
description of how the second benchmark was achieved using XGBoost.

6. Useful links

Original article ([Link] about GBM from Jerome

[Link] Page 31 of 33
__notebook__ 7/21/20, 11(23 AM

Friedman
“Gradient boosting machines, a tutorial”, paper
([Link] by Alexey Natekin, and Alois Knoll
Chapter in Elements of Statistical Learning
([Link] from Hastie,
Tibshirani, Friedman (page 337)
Wiki ([Link] article about Gradient Boosting
Frontiers tutorial ([Link] article about GBM
Video-lecture by Hastie ([Link] about GBM at
[Link] conference
CatBoost vs. Light GBM vs. XGBoost ([Link]
vs-xgboost-5f93620723db) on "Towards Data Science"
Benchmarking and Optimization of Gradient Boosting Decision Tree Algorithms
([Link] XGBoost: Scalable GPU Accelerated Learning
([Link] - benchmarking CatBoost, Light GBM, and XGBoost (no 100%
winner)

Support course creators

You can make a monthly (Patreon) or one-time (Ko-Fi) donation ↓

([Link]

([Link]
([Link]
([Link]
([Link]

([Link]

[Link] Page 32 of 33
__notebook__ 7/21/20, 11(23 AM

([Link]
([Link]
([Link]

[Link] Page 33 of 33

Common questions

Boosting algorithms, historically evolving from the foundational concept of combining weak learners to create a robust model, have diversified through numerous scholarly contributions and practical implementations. The seminal work of AdaBoost played a crucial role by formalizing boosting frameworks. Jerome Friedman's extension to Gradient Boosting Machine (GBM) in 1999 established a solid statistical foundation for boosting via functional gradient descent, influencing various sectors, from problem-solving in statistics to diverse machine learning applications. Extensions like XGBoost and others further optimized GBM's computational efficiency and scalability, spearheading its prominence in both academic and industrial realms, as seen in search engine optimization and competitive data science platforms like Kaggle .

L2 loss, also known as Gaussian loss, is used to minimize the squared differences between observed and predicted values, making it effective when the goal is to model conditional means. However, it is sensitive to outliers due to the square term. L1 loss or Laplacian loss, on the other hand, minimizes the absolute differences, providing robustness to outliers and modeling conditional medians, which can be advantageous in noisy datasets. In the context of GBM, L1 might be chosen over L2 when the data contains significant outliers or a skewed distribution, where resilience to these anomalies is necessary for accurate prediction .

GBM addresses overfitting through multiple mechanisms such as employing shrinkage (learning rate), subsampling (creating stochastic versions of the algorithm), and employing regularized models like shallow decision trees for base learners. The combination of these techniques allows for complexity control by ensuring consecutive trees correct only modest discrepancies, limiting model variance and preventing fitting overly complex structures onto random noise. This contrasts with AdaBoost, which, due to its aggregation of decision stumps, can be more prone to overfit especially in cases of noisy or mislabeled data .

Choosing the appropriate loss function in GBM is significant because it directly influences the optimization process and the final model characteristics. For regression, the loss function determines how consistently the model aligns with the observed distribution, with L2 Gaussian loss suited for data without outliers and L1 Laplacian loss better for robust outcomes in the presence of noise. In binary classification, loss functions ensure the model treats class separations appropriately, with logistic loss offering gradual penalization for correction and incorporating class balance, guiding model adjustments effectively towards generalization. This strategic choice avoids overfitting and aligns model outcomes with objectives .

In AdaBoost, the initial weighting process is crucial, as it adjusts the influence of incorrectly predicted instances by increasing their weights iteratively, compelling subsequent models to focus on errors made by their predecessors. This re-weighting approach addresses inaccuracies by giving harder-to-classify samples more prominence in subsequent steps. In contrast, GBM utilizes gradient descent in function space for optimization, focusing on minimizing residuals rather than re-weighting samples directly. GBM effectively 'updates' its model by minimizing prediction errors driven by gradients, a concept that broadens its scope beyond binary classification, unlike AdaBoost's re-weight bulwark against error .

Gradient Boosting Machines (GBM) have become a critical element in the machine learning toolkit due to their capacity to create strong predictive models by iteratively improving the model accuracy. GBM evolved from AdaBoost, which constructed a strong model by combining weak learners. Jerome Friedman's introduction of GBM in 1999 generalized the boosting algorithm by framing it in functional gradient descent terms, significantly improving upon the limitations of AdaBoost, particularly its susceptibility to noise and overfitting. This evolution allowed GBM to be applied effectively in various domains, such as ranking in search engines and machine learning competitions .

For classification tasks, GBM optimizes performance using loss functions that align with the target distribution and problem requirements. Commonly employed loss functions include Logistic loss, which not only penalizes incorrect predictions but also aids in increasing class separation by penalizing correct predictions marginally. This characteristic prevents overfitting and enhances model robustness. Another option, also equivalent to AdaBoost loss, is the exponential loss function, which imposes higher penalizations on incorrect classifications. These loss functions help guide the iterative improvement of base learners within GBM to optimize classification accuracy .

Decision trees are widely employed as base learners in Gradient Boosting Machines due to their capability to handle various types of data without extensive preprocessing and their relative interpretability. They can model nonlinear relationships and capture interactions between features effectively. Additionally, decision trees handle both scalar and categorical data intuitively, making them a versatile choice. Their weakness as standalone models, due to high bias and variance, is mitigated in GBM by aggregating the outputs of multiple trees to improve predictive performance .

In GBM, pseudo-residuals represent the gradient of the loss function with respect to the predicted output, effectively indicating how to correct errors from the preceding model predictions. This approach allows trees, which are inherently step-wise and cannot precisely approximate continuous functions, to incrementally improve predictions by focusing on areas of data where errors or residuals are largest. By iteratively fitting trees to these pseudo-residuals, GBM refines its function approximation across steps, enhancing overall model accuracy on continuous functions .

Machine learning competitions, particularly those on platforms like Kaggle, played a pivotal role in popularizing boosting algorithms such as XGBoost. These competitions provided a real-world testing ground for new algorithms, allowing participants to showcase and compare their models directly. XGBoost gained attention for its effective implementation of GBM, showing superior performance in many tasks, which led to its rapid adoption in both research and industrial applications. They also fostered community discussion and highlighted practical uses and enhancements over traditional GBM, promoting further algorithmic development .

CTRP LK TW I 2013
No ratings yet
CTRP LK TW I 2013
98 pages
Gradiant Boosting Algorithm Baisics
No ratings yet
Gradiant Boosting Algorithm Baisics
21 pages
Predicting Financial Distress in Indonesian Manufacturing
No ratings yet
Predicting Financial Distress in Indonesian Manufacturing
9 pages
Data Preprocessing For Python
No ratings yet
Data Preprocessing For Python
3 pages
Pajak Daerah Dan Retribusi Daerah Untuk Peningkatan Pendapatan Asli Daerah
No ratings yet
Pajak Daerah Dan Retribusi Daerah Untuk Peningkatan Pendapatan Asli Daerah
238 pages
XGBoost R Tutorial: Model Building & Prediction
100% (1)
XGBoost R Tutorial: Model Building & Prediction
10 pages
Big Data and Analytics For Safer Transportation PDF
No ratings yet
Big Data and Analytics For Safer Transportation PDF
8 pages
Analisis Kelayakan Bisnis Pada Investasi Pembangkit Listrik Tenaga Air (Plta) Meureubo 48 MW Di Aceh
No ratings yet
Analisis Kelayakan Bisnis Pada Investasi Pembangkit Listrik Tenaga Air (Plta) Meureubo 48 MW Di Aceh
220 pages
Modul Machine Learning UTS 2023
No ratings yet
Modul Machine Learning UTS 2023
20 pages
BMSR - Annual Report - 2012
No ratings yet
BMSR - Annual Report - 2012
66 pages
XGBoost: A Comprehensive Guide
100% (1)
XGBoost: A Comprehensive Guide
128 pages
Definitive Guide Location Intelligence For Your Business v1
No ratings yet
Definitive Guide Location Intelligence For Your Business v1
84 pages
Data Science Clustering Guide
No ratings yet
Data Science Clustering Guide
35 pages
Xgboost: A Scalable Tree Boosting System: Tianqi Chen Tqchen@Cs - Washington.Edu Carlos Guestrin Guestrin@Cs - Washington.Edu
No ratings yet
Xgboost: A Scalable Tree Boosting System: Tianqi Chen Tqchen@Cs - Washington.Edu Carlos Guestrin Guestrin@Cs - Washington.Edu
13 pages
Excel Dashboard
No ratings yet
Excel Dashboard
15 pages
Soal Praktikum Akuntansi Perusahaan Dagang
No ratings yet
Soal Praktikum Akuntansi Perusahaan Dagang
10 pages
Python Data Preprocessing Guide
No ratings yet
Python Data Preprocessing Guide
3 pages
2014 - MYRX - MYRX - Annual Report - 2014
No ratings yet
2014 - MYRX - MYRX - Annual Report - 2014
62 pages
PT SUMARECON AGUNG TBK PDF
No ratings yet
PT SUMARECON AGUNG TBK PDF
341 pages
Segmentation Detection
100% (1)
Segmentation Detection
109 pages
Lec06 - Ensembling Methods Bagging Boosting
No ratings yet
Lec06 - Ensembling Methods Bagging Boosting
48 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
M05 Ensemble
No ratings yet
M05 Ensemble
42 pages
AI - ML Beginner-Friendly Resources For Cs
No ratings yet
AI - ML Beginner-Friendly Resources For Cs
9 pages
22 Boosting
No ratings yet
22 Boosting
32 pages
Module 1 ML Mumbai University
No ratings yet
Module 1 ML Mumbai University
47 pages
Approaching Any Machine Learning Problem
No ratings yet
Approaching Any Machine Learning Problem
22 pages
Deep Learning Quiz: Week 1 & 2
No ratings yet
Deep Learning Quiz: Week 1 & 2
5 pages
NNDL
No ratings yet
NNDL
4 pages
Lec2 Intro To ML
No ratings yet
Lec2 Intro To ML
35 pages
DL (1-10)
No ratings yet
DL (1-10)
10 pages
Machine Learning-5
No ratings yet
Machine Learning-5
89 pages
Python 06 MachineLearning
No ratings yet
Python 06 MachineLearning
45 pages
Basic Concepts of Machine Learning For Beginners
No ratings yet
Basic Concepts of Machine Learning For Beginners
102 pages
July4 SaketAnand FriendlyIntroToML
No ratings yet
July4 SaketAnand FriendlyIntroToML
84 pages
Problemset2 PDF
No ratings yet
Problemset2 PDF
4 pages
Weatherwax Geron Solutions
No ratings yet
Weatherwax Geron Solutions
52 pages
Machine Learning Solutions Guide
No ratings yet
Machine Learning Solutions Guide
45 pages
Deep Learning
No ratings yet
Deep Learning
21 pages
Data Science Self-Learning Guide
100% (3)
Data Science Self-Learning Guide
16 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
Machine Learning with R Guide
No ratings yet
Machine Learning with R Guide
2 pages
Machine Learning With R
No ratings yet
Machine Learning With R
2 pages
Kaggle Competition Mastery Guide
100% (1)
Kaggle Competition Mastery Guide
74 pages
ML 7th Sem AIML ITE Notes Complete LONG
No ratings yet
ML 7th Sem AIML ITE Notes Complete LONG
202 pages
Basic Concepts of Machine Learning For Beginners 1732109263
No ratings yet
Basic Concepts of Machine Learning For Beginners 1732109263
102 pages
ML Mdu 2024 10939237
No ratings yet
ML Mdu 2024 10939237
20 pages
UCS - 401 - Unit-LV - Trends in Machine Learning - Model and Symbols - Bagging and Boosting, Multitask
No ratings yet
UCS - 401 - Unit-LV - Trends in Machine Learning - Model and Symbols - Bagging and Boosting, Multitask
44 pages
Notes On Data Science and Machine Learning
No ratings yet
Notes On Data Science and Machine Learning
53 pages
AI & Linear Algebra Lecture Notes
No ratings yet
AI & Linear Algebra Lecture Notes
45 pages
Deep Learning with Keras Basics
No ratings yet
Deep Learning with Keras Basics
58 pages
Scha Pire
No ratings yet
Scha Pire
182 pages
ML 01
No ratings yet
ML 01
24 pages
Advanced Machine Learning Techniques
No ratings yet
Advanced Machine Learning Techniques
61 pages
Top 10 Machine Learning Algorithms Guide
No ratings yet
Top 10 Machine Learning Algorithms Guide
15 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
Module 5,1 Ensemble - Bagging, RF, Boosting
No ratings yet
Module 5,1 Ensemble - Bagging, RF, Boosting
66 pages
XGBoost: Boosting Algorithm Guide
No ratings yet
XGBoost: Boosting Algorithm Guide
22 pages
Fall2024 W4995 Lecture1
No ratings yet
Fall2024 W4995 Lecture1
110 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
Millau Viaduct Construction Details
No ratings yet
Millau Viaduct Construction Details
104 pages
Bridge Engineering Handbook, Construction
100% (5)
Bridge Engineering Handbook, Construction
646 pages
LRFD Design of Superstructure
100% (4)
LRFD Design of Superstructure
1,288 pages
Conceptual Design of Bridges
100% (17)
Conceptual Design of Bridges
640 pages
Bridge 1926 Design Final Report
100% (7)
Bridge 1926 Design Final Report
107 pages
Designing and Constructing Prestressed Bridges
100% (8)
Designing and Constructing Prestressed Bridges
235 pages
G-13.1 Guidelines For Steel Girder Bridge Analysis 2ed PDF
100% (1)
G-13.1 Guidelines For Steel Girder Bridge Analysis 2ed PDF
184 pages
Cable Stayed Bridges Design
83% (12)
Cable Stayed Bridges Design
234 pages
Substructure Design Bridge Engineering H
100% (6)
Substructure Design Bridge Engineering H
378 pages
Perhitungan Kekuatan Pile Cap Kode Pondasi F2: A. Data Pondasi Tiang Bor
100% (5)
Perhitungan Kekuatan Pile Cap Kode Pondasi F2: A. Data Pondasi Tiang Bor
9 pages
Erasmus Bridge Design & Evaluation
No ratings yet
Erasmus Bridge Design & Evaluation
10 pages
Akashi Kaikyo Bridge: Engineering Marvel
No ratings yet
Akashi Kaikyo Bridge: Engineering Marvel
33 pages
Alamillo Bridge: Calatrava's Asymmetric Marvel
No ratings yet
Alamillo Bridge: Calatrava's Asymmetric Marvel
11 pages
AASHTO LRFD Bridge Design Specifications 10th Edition 2024
100% (11)
AASHTO LRFD Bridge Design Specifications 10th Edition 2024
1,906 pages
Design of Cable Stayed Bridges
No ratings yet
Design of Cable Stayed Bridges
97 pages
Balanced Cantilever Bridge Design Considering Seismic Analysis Manual
50% (2)
Balanced Cantilever Bridge Design Considering Seismic Analysis Manual
31 pages
Bridge Design PDF
73% (11)
Bridge Design PDF
504 pages
Midas Civil Training - 0220-Edit-Final PDF
100% (4)
Midas Civil Training - 0220-Edit-Final PDF
48 pages
Bridge Design Manual, 3rd Edition MNL-133-11
95% (20)
Bridge Design Manual, 3rd Edition MNL-133-11
1,620 pages
Bridge Superstructure Design
100% (3)
Bridge Superstructure Design
278 pages
Perhitungan Abutment
No ratings yet
Perhitungan Abutment
164 pages
Design of Box Girder Bridges
No ratings yet
Design of Box Girder Bridges
31 pages
2019 - Cable Stayed PDF
No ratings yet
2019 - Cable Stayed PDF
69 pages
Bridge Bearing & Expamnsion Joints PDF
91% (11)
Bridge Bearing & Expamnsion Joints PDF
222 pages
Setra Cable Stays
100% (7)
Setra Cable Stays
186 pages
Bridge Design For Beginners PDF
100% (3)
Bridge Design For Beginners PDF
74 pages
Perhitungan Beban Box Culvert
No ratings yet
Perhitungan Beban Box Culvert
8 pages
Guidelines For The Design of Cable Stayed Bridges
100% (3)
Guidelines For The Design of Cable Stayed Bridges
69 pages
2018 - Lorenzo-Navarro - Automatic Counting and Classification of Microplastic Particles
No ratings yet
2018 - Lorenzo-Navarro - Automatic Counting and Classification of Microplastic Particles
7 pages
FPGA Implementation of A Face Recognition System
No ratings yet
FPGA Implementation of A Face Recognition System
5 pages
Thesis. Facial Recognition Security System
0% (1)
Thesis. Facial Recognition Security System
44 pages
OTT Customer Churn Prediction Study
No ratings yet
OTT Customer Churn Prediction Study
7 pages
ML Unit 3 New
100% (1)
ML Unit 3 New
24 pages
Intro To Machine Learning With PyTorch
No ratings yet
Intro To Machine Learning With PyTorch
48 pages
Black Friday Sales Prediction Project
No ratings yet
Black Friday Sales Prediction Project
14 pages
Aiml Unit 4
No ratings yet
Aiml Unit 4
17 pages
Stock Market Prediction Employing Ensemble Methods: The Nifty50 Index
No ratings yet
Stock Market Prediction Employing Ensemble Methods: The Nifty50 Index
11 pages
Course Title Course Number
No ratings yet
Course Title Course Number
15 pages
AIML Project Report On Predicting Blood Glucose in Diabetic Patients Using RandomForest Classifier (1
No ratings yet
AIML Project Report On Predicting Blood Glucose in Diabetic Patients Using RandomForest Classifier (1
25 pages
Unit IV Ensemble Unsupervised Learning
No ratings yet
Unit IV Ensemble Unsupervised Learning
5 pages
Chapter 3 - Boosting Theory
No ratings yet
Chapter 3 - Boosting Theory
7 pages
ML Ensemble for Ionosphere Forecasting
No ratings yet
ML Ensemble for Ionosphere Forecasting
4 pages
DSA5102 Lecture3
No ratings yet
DSA5102 Lecture3
34 pages
Twitter Sentiment Analysis for Airlines
No ratings yet
Twitter Sentiment Analysis for Airlines
5 pages
Paddy
No ratings yet
Paddy
16 pages
Commentclass: A Robust Ensemble Machine Learning Model For Comment Classification
No ratings yet
Commentclass: A Robust Ensemble Machine Learning Model For Comment Classification
20 pages
Bhardwaj Sharma 2022 Email Spam Detection Using Bagging and Boosting of Machine Learning Classifiers
No ratings yet
Bhardwaj Sharma 2022 Email Spam Detection Using Bagging and Boosting of Machine Learning Classifiers
25 pages
Civil Engineers' Guide to GFR-SCC
No ratings yet
Civil Engineers' Guide to GFR-SCC
18 pages
Smart Library System Report
No ratings yet
Smart Library System Report
63 pages
Paper Writing
No ratings yet
Paper Writing
8 pages
Deep Learning for Speech Recognition
No ratings yet
Deep Learning for Speech Recognition
8 pages
Aiml QB
No ratings yet
Aiml QB
19 pages
ML Question Bank CA-II
No ratings yet
ML Question Bank CA-II
10 pages
Machine Learning Quiz Worksheet
No ratings yet
Machine Learning Quiz Worksheet
5 pages
AI Methods for Project Duration Forecasting
No ratings yet
AI Methods for Project Duration Forecasting
51 pages
Machine Learning Engineer Nanodegree Supervised Learning Project: Finding Donors For CharityML
No ratings yet
Machine Learning Engineer Nanodegree Supervised Learning Project: Finding Donors For CharityML
16 pages
COMP4702 Supervised Learning Overview
No ratings yet
COMP4702 Supervised Learning Overview
23 pages
Datamites Certified Data Scientist Syllabus PDF
50% (2)
Datamites Certified Data Scientist Syllabus PDF
12 pages

Gradient Boosting Tutorial

Uploaded by

Gradient Boosting Tutorial

Uploaded by

__notebook__ 7/21/20, 11(23 AM

[Link] ([Link] – Open Machine Learning Course

Author: Alexey Natekin ([Link] OpenDataScience founder, Machine

Topic 10. Gradient Boosting

 Introduction and history of boosting

1. Introduction and history of boosting

Furthermore, XGBoost is often the standard recipe for winning

Boosting was born out of the question: ([Link] is it

A positive mathematical answer ([Link] to

Initialize sample weights w(0)

Update sample weights: w(t)

Here ([Link] is more detailed example of AdaBoost

AdaBoost works well, but the lack

There is also a great lecture ([Link] from Hastie, a

ML competitions, especially Kaggle, played a major role in boosting's

dependence y = f (x). We restore the dependence by approximating f (x) ̂ and by understanding

Simple analytical solutions for finding the optimal parameters

Lθ (θ).̂ Gradient descent is

for this inefficient and naive algorithm for approximating θ : ̂

 Define the initial approximation of the parameters θ =̂ θ^0

Functional gradient descent

[ ∂f (xi ) ]f (x)= f (x)̂

Friedman's classic GBM algorithm

dataset {(xi , yi )}i= 1,…,n ;

f0 (x). For simplicity, for an initial approximation, a

 Find optimal coefficient ρt at h t (x) regarding initial loss function

ρt = arg min ∑i= 1 L(yi , f (x̂ i ) + ρ ⋅ h (xi , θ))

Step-By-Step example: How GBM Works

Toy data{(xi , yi )}i= 1,…,300 ✓

The mean squared error loss function L(y, f ) = (y − f )2 ✓

Regression loss functions

y ∈ ℝ. In order to choose the appropriate loss function, we

L(y, f ) = (y − f )2 a.k.a. L2 loss or Gaussian loss. It is the classical conditional mean,

{(xi , yi )}i= 1,…,300 ✓

Classification loss functions

L(y, f ) = lo g (1 + exp(−2yf ))a.k.a. Logistic loss or Bernoulli loss. This has an

Toy data{(xi , yi )}i= 1,…,300 , yi ∈ {−1, 1} ✓

[ ∂f (xi ) ]f (x)= f (x)̂

{ 0.1 + |co s(x)|, if x > 0

Your task ([Link] is

Original article ([Link] about GBM from Jerome

Support course creators

You can make a monthly (Patreon) or one-time (Ko-Fi) donation ↓

Common questions

Explore the historical development and interdisciplinary contributions to the field of boosting algorithms, focusing on foundational works and extensions into modern applications.

Explore the historical development and interdisciplinary contributions to the field of boosting algorithms, focusing on foundational works and extensions into modern applications.

Discuss the difference between L2 loss and L1 loss in regression problems and why one might be chosen over the other in a GBM context.

Discuss the difference between L2 loss and L1 loss in regression problems and why one might be chosen over the other in a GBM context.

How does GBM address overfitting, a common issue in machine learning models such as AdaBoost?

How does GBM address overfitting, a common issue in machine learning models such as AdaBoost?

Why is the choice of loss function significant in the application of GBM for different types of predictive modeling, such as regression and binary classification?

Why is the choice of loss function significant in the application of GBM for different types of predictive modeling, such as regression and binary classification?

Can you explain the role of weighting in the initial steps of boosting algorithms like AdaBoost and how it contrasts with the approach taken by GBM?

Can you explain the role of weighting in the initial steps of boosting algorithms like AdaBoost and how it contrasts with the approach taken by GBM?

What is the significance of Gradient Boosting Machines (GBM) in modern machine learning, and how did it evolve from earlier algorithms like AdaBoost?

What is the significance of Gradient Boosting Machines (GBM) in modern machine learning, and how did it evolve from earlier algorithms like AdaBoost?

How does the Gradient Boosting Machine (GBM) algorithm optimize its performance using loss functions, particularly for classification tasks?

How does the Gradient Boosting Machine (GBM) algorithm optimize its performance using loss functions, particularly for classification tasks?

What are the primary advantages of using decision trees as base learners in Gradient Boosting Machines?

What are the primary advantages of using decision trees as base learners in Gradient Boosting Machines?

How has the function approximation capability of trees in GBM been highlighted through the use of pseudo-residuals?

How has the function approximation capability of trees in GBM been highlighted through the use of pseudo-residuals?

In what way did machine learning competitions, such as those hosted by Kaggle, contribute to the popularization and development of boosting algorithms like XGBoost?

In what way did machine learning competitions, such as those hosted by Kaggle, contribute to the popularization and development of boosting algorithms like XGBoost?

You might also like

notebook 7/21/20, 11(23 AM