0% found this document useful (0 votes)
16 views20 pages

Gradient Boosting

Uploaded by

ajankit0712
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views20 pages

Gradient Boosting

Uploaded by

ajankit0712
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Gradient Boosting

Gradient Boosting is a powerful boosting algorithm that combines several weak learners into
strong learners, in which each new model is trained to minimize the loss function such as
mean squared error or cross-entropy of the previous model using gradient descent. In each
iteration, the algorithm computes the gradient of the loss function with respect to the
predictions of the current ensemble and then trains a new weak model to minimize this
gradient. The predictions of the new model are then added to the ensemble, and the process
is repeated until a stopping criterion is met.
In contrast to AdaBoost, the weights of the training instances are not tweaked, instead, each
predictor is trained using the residual errors of the predecessor as labels. There is a technique
called the Gradient Boosted Trees whose base learner is CART (Classification and
Regression Trees). The below diagram explains how gradient-boosted trees are trained for
regression problems.

Gradient Boosted Trees for Regression

The ensemble consists of M trees. Tree1 is trained using the feature matrix X and the labels y.
The predictions labeled y1(hat) are used to determine the training set residual errors r1. Tree2
is then trained using the feature matrix X and the residual errors r1 of Tree1 as labels. The
predicted results r1(hat) are then used to determine the residual r2. The process is repeated
until all the M trees forming the ensemble are trained. There is an important parameter used
in this technique known as Shrinkage. Shrinkage refers to the fact that the prediction of
each tree in the ensemble is shrunk after it is multiplied by the learning rate (eta) which ranges
between 0 to 1. There is a trade-off between eta and the number of estimators, decreasing
learning rate needs to be compensated with increasing estimators in order to reach certain
model performance. Since all trees are trained now, predictions can be made. Each tree
predicts a label and the final prediction is given by the formula,

y(pred) = y1 + (υ* r1) + (υ * r2) + ....... + (υ * rN)

Gradient Boosting Algorithm


Errors play a major role in any machine learning algorithm. There are mainly two types of
errors: bias error and variance error. The gradient boost algorithm helps us minimize the bias
error of the model. The main idea behind this algorithm is to build models sequentially and
these subsequent models try to reduce the errors of the previous model. But how do we do that?
How do we reduce the error? This is done by building a new model on the errors or residuals
of the previous model.
When the target column is continuous, we use Gradient Boosting Regressor whereas when it
is a classification problem, we use Gradient Boosting Classifier. The only difference between
the two is the “Loss function”. The objective here is to minimize this loss function by adding
weak learners using gradient descent. Since it is based on the loss function, for regression
problems, we’ll have different loss functions like Mean squared error (MSE) and for
classification, we will have different functions, like log-likelihood.
Understanding Gradient Boosting Regression Algorithm with an Example
Let’s understand the intuition behind the Stochastic Gradient Boosting algorithm in Machine
Learning with the help of an example. Here our target column is continuous hence we will use
gradient boosting regressor.
Following is a sample from a random dataset where we have to predict the car price based on
various features. The target column is price and other features are independent features.

Step 1: Build a Base Model


The first step in gradient boosting is to build a base model to predict the observations in the
training dataset. For simplicity, we take an average of the target column and assume that to be
the predicted value as shown below:
Why did I say we take the average of the target column? Well, there is math involved in this.
Mathematically the first step can be written as:

Here L is our loss function,


Gamma is our predicted value, and
arg min means we have to find a predicted value/gamma for which the loss function is
minimum.
Since the target column is continuous our loss function will be:

Here yi is the observed value, and gamma is the predicted value.


Now we need to find a minimum value of gamma such that this loss function is minimum. We
differentiate this loss function and then put it equal to 0.

Let’s see how to do this with the help of our example. Remember that y_i is our observed value
and γi i is our predicted value, by plugging the values in the above formula we get:
We end up over an average of the observed car price and this is why I asked you to take the
average of the target column and assume it to be your first prediction.
Hence for gamma=14500, the loss function will be minimum so this value will become our
prediction for the base model.
Step 2: Compute Pseudo Residuals
The next step is to calculate the pseudo residuals which are (observed value – predicted value).

Again the question comes why only observed – predicted? Everything is mathematically
proven. Let’s see where this formula comes from. This step can be written as:

Here F(xi) is the previous model and m is the number of decision tree made.
We are just taking the derivative of loss function w.r.t the predicted value and we have already
calculated this derivative:

If you see the formula of residuals above, we see that the derivative of the loss function is
multiplied by a negative sign, so now we get:

The predicted value here is the prediction made by the previous model. In our example the
prediction made by the previous model (initial base model prediction) is 14500, to calculate
the residuals our formula becomes:

Step 3: Build a Model on Calculated Residuals


In the next step, we will build a model on these pseudo residuals and make predictions. Why
do we do this? Because we want to minimize these residuals minimizing the residuals will
eventually improve our model accuracy and prediction power. So, using the Residual as a target
and the original feature Cylinder number, cylinder height, and Engine location we will generate
new predictions. Note that the predictions, in this case, will be the error values, not the predicted
car price values since our target column is an error now.
Let’s say hm(x) is our Decision tree made on these residuals.
Step 4: Compute Decision Tree Output
In this step, we find the output values for each leaf of our decision tree. That means there might
be a case where 1 leaf gets more than 1 residual, hence we need to find the final output of all
the leaves. To find the output we can simply take the average of all the numbers in a leaf,
doesn’t matter if there is only 1 number or more than 1.
Let’s see why we take the average of all the numbers. Mathematically this step can be
represented as:

Here hm(xi) is the DT made on residuals and m is the number of DT. When m=1 we are talking
about the 1st DT and when it is “M” we are talking about the last DT.
The output value for the leaf is the value of gamma that minimizes the Loss function. The left-
hand side “Gamma” is the output value of a particular leaf. On the right-hand side
[F m-1 (x i )+γh m (x i ))] is similar to step 1 but here the difference is that we are taking
previous predictions whereas earlier there was no previous prediction.
Example of Calculating Regression Tree Output
Let’s understand this even better with the help of an example. Suppose this is our regressor
tree:

We see 1st residual goes in R1,1 ,2nd and 3rd residuals go in R2,1 and 4th residual goes in R3,1 .
Let’s calculate the output for the first leave that is R1,1
Now we need to find the value for gamma for which this function is minimum. So we find the
derivative of this equation w.r.t gamma and put it equal to 0.

Hence the leaf R1,1 has an output value of -2500. Now let’s solve for the R2,1.

Let’s take the derivative to get the minimum value of gamma for which this function is
minimum:

We end up with the average of the residuals in the leaf R2,1 . Hence if we get any leaf with more
than 1 residual, we can simply find the average of that leaf and that will be our final output.
Now after calculating the output of all the leaves, we get:
Step 5: Update Previous Model Predictions
This is finally the last step where we have to update the predictions of the previous model. It
can be updated as:

where m is the number of decision trees made.


Since we have just started building our model so our m=1. Now to make a new DT our new
predictions will be:

Here Fm-1(x) is the prediction of the base model (previous prediction) since F1-1=0 , F0 is our
base model hence the previous prediction is 14500.
nu is the learning rate that is usually selected between 0-1. It reduces the effect each tree has
on the final prediction, and this improves accuracy in the long run. Let’s take nu=0.1 in this
example.
Hm(x) is the recent DT made on the residuals.
Let’s calculate the new prediction now:

Suppose we want to find a prediction of our first data point which has a car height of 48.8. This
data point will go through this decision tree and the output it gets will be multiplied by the
learning rate and then added to the previous prediction.
Now let’s say m=2 which means we have built 2 decision trees and now we want to have new
predictions.
This time we will add the previous prediction that is F1(x) to the new DT made on residuals.
We will iterate through these steps again and again till the loss is negligible.
I am taking a hypothetical example here just to
make you understand how this predicts for a new dataset:
If a new data point comes, say, height = 1.40, it’ll go through all the trees and then will give
the prediction. Here we have only 2 trees hence the datapoint will go through these 2 trees and
the final output will be F2(x).

Gradient Boosting in Classification:


Gradient Boosting has three main components:
• Loss Function - The role of the loss function is to estimate how good the model is at
making predictions with the given data. This could vary depending on the problem at
hand. For example, if we’re trying to predict the weight of a person depending on some
input variables (a regression problem), then the loss function would be something that
helps us find the difference between the predicted weights and the observed weights.
On the other hand, if we’re trying to categorize if a person will like a certain movie
based on their personality, we’ll require a loss function that helps us understand how
accurate our model is at classifying people who did or didn’t like certain movies.
• Weak Learner - A weak learner is one that classifies our data but does so poorly,
perhaps no better than random guessing. In other words, it has a high error rate. These
are typically decision trees (also called decision stumps, because they are less
complicated than typical decision trees).
• Additive Model - This is the iterative and sequential approach of adding the trees (weak
learners) one step at a time. After each iteration, we need to be closer to our final model.
In other words, each iteration should reduce the value of our loss function.
An Intuitive Understanding: Visualizing Gradient Boost
Let’s start with looking at one of the most common binary classification machine learning
problems. It aims at predicting the fate of the passengers on Titanic based on a few features:
their age, gender, etc. We will take only a subset of the dataset and choose certain columns, for
convenience. Our dataset looks something like this:
Titanic Passenger Data
• Pclass, or Passenger Class, is categorical: 1, 2, or 3.
• Age is the age of the passenger when they were on the Titanic.
• Fare is the Passenger Fare.
• Sex is the gender of the person.
• Survived refers to whether or not the person survived the crash; 0 if they did not, 1 if
they did.
Now let’s look at how the Gradient Boosting algorithm solves this problem.
We start with one leaf node that predicts the initial value for every individual passenger. For a
classification problem, it will be the log(odds) of the target value. log(odds) is the equivalent
of average in a classification problem. Since four passengers in our case survived, and two did
not survive, log(odds) that a passenger survived would be:

This becomes our initial leaf.

Initial Leaf Node


The easiest way to use the log(odds) for classification is to convert it to a probability. To do so,
we’ll use this formula:
Note: Please bear in mind that we have rounded off everything to one decimal place here, and
hence the log(odds) and probability are the same, which may not be the case always.
If the probability of surviving is greater than 0.5, then we first classify everyone in the training
dataset as survivors. (0.5 is a common threshold used for classification decisions made based
on probability; note that the threshold can easily be taken as something else.)
Now we need to calculate the Pseudo Residual, i.e, the difference between the observed value
and the predicted value. Let us draw the residuals on a graph.

The blue and the yellow dots are the observed values. The blue dots are the passengers who did
not survive with the probability of 0 and the yellow dots are the passengers who survived with
a probability of 1. The dotted line here represents the predicted probability which is 0.7
We need to find the residual which would be :
Here, 1 denotes Yes and 0 denotes No.
We will use this residual to get the next tree. It may seem absurd that we are considering the
residual instead of the actual value, but we shall throw more light ahead.

Branching out data points using the residual values


We use a limit of two leaves here to simplify our example, but in reality, Gradient Boost has a
range between 8 leaves to 32 leaves.
Because of the limit on leaves, one leaf can have multiple values. Predictions are in terms of
log(odds) but these leaves are derived from probability which cause disparity. So, we can’t just
add the single leaf we got earlier and this tree to get new predictions because they’re derived
from different sources. We have to use some kind of transformation. The most common form
of transformation used in Gradient Boost for Classification is :

The numerator in this equation is sum of residuals in that particular leaf.


The denominator is sum of (previous prediction probability for each residual ) * (1 - same
previous prediction probability).
The first leaf has only one residual value that is 0.3, and since this is the first tree, the previous
probability will be the value from the initial leaf, thus, same for all residuals. Hence,

For the second leaf,

Similarly, for the last leaf:

Now the transformed tree looks like:

Transformed tree
Now that we have transformed it, we can add our initial lead with our new tree with a learning
rate.

Learning Rate is used to scale the contribution from the new tree. This results in a small step
in the right direction of prediction. Empirical evidence has proven that taking lots of small steps
in the right direction results in better prediction with a testing dataset i.e the dataset that the
model has never seen as compared to the perfect prediction in 1st step. Learning Rate is usually
a small number like 0.1
We can now calculate new log(odds) prediction and hence a new probability.
For example, for the first passenger, Old Tree = 0.7. Learning Rate which remains the same for
all records is equal to 0.1 and by scaling the new tree, we find its value to be -0.16. Hence,
substituting in the formula we get:

Similarly, we substitute and find the new log(odds) for each passenger and hence find the
probability. Using the new probability, we will calculate the new residuals.
This process repeats until we have made the maximum number of trees specified or the
residuals get super small.
A Mathematical Understanding
We shall go through each step, one at a time and try to understand them.

xi - This is the input variables that we feed into our model.


yi- This is the target variable that we are trying to predict.
We can predict the log likelihood of the data given the predicted probability

yi is observed value ( 0 or 1 ).
p is the predicted probability.
The goal would be to maximize the log likelihood function. Hence, if we use
the log(likelihood) as our loss function where smaller values represent better fitting models
then:

Now the log(likelihood) is a function of predicted probability p but we need it to be a function


of predictive log(odds). So, let us try and convert the formula :
We know that:

Substituting,

Now,

Hence,

Loss function in terms of log odds

𝐿(𝑦𝑖 , log⁡(𝑜𝑑𝑑𝑠)) = −𝑦𝑖 log(𝑜𝑑𝑑𝑠) + log⁡(1 + 𝑒 log(𝑜𝑑𝑑𝑠) )


Now that we have converted the p to log(odds), this becomes our Loss Function.
We have to show that this is differentiable.
This can also be written as:

Now we can proceed to the actual steps of the model building.


Step 1: Initialize model with a constant value

Here, yi is the observed values, L is the loss function, and gamma is the value for log(odds).
We are summating the loss function i.e. we add up the Loss Function for each observed value.
argmin over gamma means that we need to find a log(odds) value that minimizes this sum.
Then, we take the derivative of each loss function:

… and so on.
Step 2: for m = 1 to M
(A)

This step needs you to calculate the residual using the given formula. We have already found
the Loss Function to be as :

Hence,
(B) Fit a regression tree to the residual values and create terminal regions

Because the leaves are limited for one branch hence, we might have more than one value in a
particular terminal region.
In our first tree, m=1 and j will be the unique number for each terminal node. So R11, R21 and
so on.
C)

For each leaf in the new tree, we calculate gamma which is the output value. The summation
should be only for those records which goes into making that leaf. In theory, we could find the
derivative with respect to gamma to obtain the value of gamma but that could be extremely
wearisome due to the hefty variables included in our loss function.
Substituting the loss function and i=1 in the equation above, we get:

We use second order Taylor Polynomial to approximate this Loss Function:

There are three terms in our approximation. Taking derivative with respect to gamma gives us:

Equating this to 0 and subtracting the single derivative term from both the sides.
Then, gamma will be equal to:

The gamma equation may look humongous but in simple terms, it is:

We will just substitute the value of derivative of Loss Function

Now we shall solve for the second derivative of the Loss Function. After some heavy
computations, we get:

We have simplified the numerator as well as the denominator. The final gamma solution looks
like:

We were trying to find the value of gamma that when added to the most recent predicted
log(odds) minimizes our Loss Function. This gamma works when our terminal region has only
one residual value and hence one predicted probability. But, do recall from our example above
that because of the restricted leaves in Gradient Boosting, it is possible that one terminal region
has many values. Then the generalized formula would be:
Hence, we have calculated the output values for each leaf in the tree.
(D)

This formula is asking us to update our predictions now. In the first pass, m =1 and we will
substitute F0(x), the common prediction for all samples i.e. the initial leaf value plus 𝑣, which
is the learning rate into the output value from the tree we built, previously. The summation is
for the cases where a single sample ends up in multiple leaves.
Now we will use this new F1(x) value to get new predictions for each sample.
The new predicted value should get us a little closer to actual value. It is to be noted that in
contrary to one tree in our consideration, gradient boosting builds a lot of trees and M could be
as large as 100 or more.
This completes the loop in Step 2 and we are ready for the final step of Gradient Boosting.
Step 3: Output

If we get a new data, then we shall use this value to predict if the passenger survived or not.
This would give us the log(odds) that the person survived. Plugging it into ‘p’ formula:

If the resultant value lies above our threshold then the person survived, else did not.
Comparing and Contrasting AdaBoost and GradientBoost
Both AdaBoost and Gradient Boost learn sequentially from a weak set of learners. A strong
learner is obtained from the additive model of these weak learners. The main focus here is to
learn from the shortcomings at each step in the iteration.
AdaBoost requires users specify a set of weak learners (alternatively, it will randomly generate
a set of weak learner before the real learning process). It increases the weights of the wrongly
predicted instances and decreases the ones of the correctly predicted instances. The weak
learner thus focuses more on the difficult instances. After being trained, the weak learner is
added to the strong one according to its performance (so-called alpha weight). The higher it
performs, the more it contributes to the strong learner.
On the other hand, gradient boosting doesn’t modify the sample distribution. Instead of training
on a newly sampled distribution, the weak learner trains on the remaining errors of the strong
learner. It is another way to give more importance to the difficult instances. At each iteration,
the pseudo-residuals are computed and a weak learner is fitted to these pseudo-residuals. Then,
the contribution of the weak learner to the strong one isn’t computed according to its
performance on the newly distributed sample but using a gradient descent optimization process.
The computed contribution is the one minimizing the overall error of the strong learner.
Adaboost is more about ‘voting weights’ and gradient boosting is more about
‘adding gradient optimization’.
Adaboost Gradient Boost
An additive model where shortcomings of An additive model where shortcomings of
previous models are identified by high- previous models are identified by the
weight data points gradient.
The trees are usually grown as decision The trees are grown to a greater depth usually
stumps. ranging from 8 to 32 terminal nodes.

Each classifier has different weights All classifiers are weighed equally and their
assigned to the final prediction based on its predictive capacity is restricted with learning
performance. rate to increase accuracy.

It gives weights to both classifiers and It builds trees on previous classifier’s


observations thus capturing maximum residuals thus capturing variance in data.
variance within data.

Advantages of Gradient Boosting


• Often provides predictive accuracy that cannot be trumped.
• Lots of flexibility - can optimize on different loss functions and provides several hyper
parameter tuning options that make the function fit very flexible.
• No data pre-processing required - often works great with categorical and numerical
values as is.
• Handles missing data - imputation not required.

Disadvantages of Gradient Boosting


• Gradient Boosting Models will continue improving to minimize all errors. This can
overemphasize outliers and cause overfitting.
• Computationally expensive - often require many trees (>1000) which can be time and
memory exhaustive.
• The high flexibility results in many parameters that interact and influence heavily the
behavior of the approach (number of iterations, tree depth, regularization parameters,
etc.). This requires a large grid search during tuning.
• Less interpretative in nature, although this is easily addressed with various tools.

You might also like