Gradient Boosting
Gradient Boosting
Gradient Boosting is a powerful boosting algorithm that combines several weak learners into
strong learners, in which each new model is trained to minimize the loss function such as
mean squared error or cross-entropy of the previous model using gradient descent. In each
iteration, the algorithm computes the gradient of the loss function with respect to the
predictions of the current ensemble and then trains a new weak model to minimize this
gradient. The predictions of the new model are then added to the ensemble, and the process
is repeated until a stopping criterion is met.
In contrast to AdaBoost, the weights of the training instances are not tweaked, instead, each
predictor is trained using the residual errors of the predecessor as labels. There is a technique
called the Gradient Boosted Trees whose base learner is CART (Classification and
Regression Trees). The below diagram explains how gradient-boosted trees are trained for
regression problems.
The ensemble consists of M trees. Tree1 is trained using the feature matrix X and the labels y.
The predictions labeled y1(hat) are used to determine the training set residual errors r1. Tree2
is then trained using the feature matrix X and the residual errors r1 of Tree1 as labels. The
predicted results r1(hat) are then used to determine the residual r2. The process is repeated
until all the M trees forming the ensemble are trained. There is an important parameter used
in this technique known as Shrinkage. Shrinkage refers to the fact that the prediction of
each tree in the ensemble is shrunk after it is multiplied by the learning rate (eta) which ranges
between 0 to 1. There is a trade-off between eta and the number of estimators, decreasing
learning rate needs to be compensated with increasing estimators in order to reach certain
model performance. Since all trees are trained now, predictions can be made. Each tree
predicts a label and the final prediction is given by the formula,
Let’s see how to do this with the help of our example. Remember that y_i is our observed value
and γi i is our predicted value, by plugging the values in the above formula we get:
We end up over an average of the observed car price and this is why I asked you to take the
average of the target column and assume it to be your first prediction.
Hence for gamma=14500, the loss function will be minimum so this value will become our
prediction for the base model.
Step 2: Compute Pseudo Residuals
The next step is to calculate the pseudo residuals which are (observed value – predicted value).
Again the question comes why only observed – predicted? Everything is mathematically
proven. Let’s see where this formula comes from. This step can be written as:
Here F(xi) is the previous model and m is the number of decision tree made.
We are just taking the derivative of loss function w.r.t the predicted value and we have already
calculated this derivative:
If you see the formula of residuals above, we see that the derivative of the loss function is
multiplied by a negative sign, so now we get:
The predicted value here is the prediction made by the previous model. In our example the
prediction made by the previous model (initial base model prediction) is 14500, to calculate
the residuals our formula becomes:
Here hm(xi) is the DT made on residuals and m is the number of DT. When m=1 we are talking
about the 1st DT and when it is “M” we are talking about the last DT.
The output value for the leaf is the value of gamma that minimizes the Loss function. The left-
hand side “Gamma” is the output value of a particular leaf. On the right-hand side
[F m-1 (x i )+γh m (x i ))] is similar to step 1 but here the difference is that we are taking
previous predictions whereas earlier there was no previous prediction.
Example of Calculating Regression Tree Output
Let’s understand this even better with the help of an example. Suppose this is our regressor
tree:
We see 1st residual goes in R1,1 ,2nd and 3rd residuals go in R2,1 and 4th residual goes in R3,1 .
Let’s calculate the output for the first leave that is R1,1
Now we need to find the value for gamma for which this function is minimum. So we find the
derivative of this equation w.r.t gamma and put it equal to 0.
Hence the leaf R1,1 has an output value of -2500. Now let’s solve for the R2,1.
Let’s take the derivative to get the minimum value of gamma for which this function is
minimum:
We end up with the average of the residuals in the leaf R2,1 . Hence if we get any leaf with more
than 1 residual, we can simply find the average of that leaf and that will be our final output.
Now after calculating the output of all the leaves, we get:
Step 5: Update Previous Model Predictions
This is finally the last step where we have to update the predictions of the previous model. It
can be updated as:
Here Fm-1(x) is the prediction of the base model (previous prediction) since F1-1=0 , F0 is our
base model hence the previous prediction is 14500.
nu is the learning rate that is usually selected between 0-1. It reduces the effect each tree has
on the final prediction, and this improves accuracy in the long run. Let’s take nu=0.1 in this
example.
Hm(x) is the recent DT made on the residuals.
Let’s calculate the new prediction now:
Suppose we want to find a prediction of our first data point which has a car height of 48.8. This
data point will go through this decision tree and the output it gets will be multiplied by the
learning rate and then added to the previous prediction.
Now let’s say m=2 which means we have built 2 decision trees and now we want to have new
predictions.
This time we will add the previous prediction that is F1(x) to the new DT made on residuals.
We will iterate through these steps again and again till the loss is negligible.
I am taking a hypothetical example here just to
make you understand how this predicts for a new dataset:
If a new data point comes, say, height = 1.40, it’ll go through all the trees and then will give
the prediction. Here we have only 2 trees hence the datapoint will go through these 2 trees and
the final output will be F2(x).
The blue and the yellow dots are the observed values. The blue dots are the passengers who did
not survive with the probability of 0 and the yellow dots are the passengers who survived with
a probability of 1. The dotted line here represents the predicted probability which is 0.7
We need to find the residual which would be :
Here, 1 denotes Yes and 0 denotes No.
We will use this residual to get the next tree. It may seem absurd that we are considering the
residual instead of the actual value, but we shall throw more light ahead.
Transformed tree
Now that we have transformed it, we can add our initial lead with our new tree with a learning
rate.
Learning Rate is used to scale the contribution from the new tree. This results in a small step
in the right direction of prediction. Empirical evidence has proven that taking lots of small steps
in the right direction results in better prediction with a testing dataset i.e the dataset that the
model has never seen as compared to the perfect prediction in 1st step. Learning Rate is usually
a small number like 0.1
We can now calculate new log(odds) prediction and hence a new probability.
For example, for the first passenger, Old Tree = 0.7. Learning Rate which remains the same for
all records is equal to 0.1 and by scaling the new tree, we find its value to be -0.16. Hence,
substituting in the formula we get:
Similarly, we substitute and find the new log(odds) for each passenger and hence find the
probability. Using the new probability, we will calculate the new residuals.
This process repeats until we have made the maximum number of trees specified or the
residuals get super small.
A Mathematical Understanding
We shall go through each step, one at a time and try to understand them.
yi is observed value ( 0 or 1 ).
p is the predicted probability.
The goal would be to maximize the log likelihood function. Hence, if we use
the log(likelihood) as our loss function where smaller values represent better fitting models
then:
Substituting,
Now,
Hence,
Here, yi is the observed values, L is the loss function, and gamma is the value for log(odds).
We are summating the loss function i.e. we add up the Loss Function for each observed value.
argmin over gamma means that we need to find a log(odds) value that minimizes this sum.
Then, we take the derivative of each loss function:
… and so on.
Step 2: for m = 1 to M
(A)
This step needs you to calculate the residual using the given formula. We have already found
the Loss Function to be as :
Hence,
(B) Fit a regression tree to the residual values and create terminal regions
Because the leaves are limited for one branch hence, we might have more than one value in a
particular terminal region.
In our first tree, m=1 and j will be the unique number for each terminal node. So R11, R21 and
so on.
C)
For each leaf in the new tree, we calculate gamma which is the output value. The summation
should be only for those records which goes into making that leaf. In theory, we could find the
derivative with respect to gamma to obtain the value of gamma but that could be extremely
wearisome due to the hefty variables included in our loss function.
Substituting the loss function and i=1 in the equation above, we get:
There are three terms in our approximation. Taking derivative with respect to gamma gives us:
Equating this to 0 and subtracting the single derivative term from both the sides.
Then, gamma will be equal to:
The gamma equation may look humongous but in simple terms, it is:
Now we shall solve for the second derivative of the Loss Function. After some heavy
computations, we get:
We have simplified the numerator as well as the denominator. The final gamma solution looks
like:
We were trying to find the value of gamma that when added to the most recent predicted
log(odds) minimizes our Loss Function. This gamma works when our terminal region has only
one residual value and hence one predicted probability. But, do recall from our example above
that because of the restricted leaves in Gradient Boosting, it is possible that one terminal region
has many values. Then the generalized formula would be:
Hence, we have calculated the output values for each leaf in the tree.
(D)
This formula is asking us to update our predictions now. In the first pass, m =1 and we will
substitute F0(x), the common prediction for all samples i.e. the initial leaf value plus 𝑣, which
is the learning rate into the output value from the tree we built, previously. The summation is
for the cases where a single sample ends up in multiple leaves.
Now we will use this new F1(x) value to get new predictions for each sample.
The new predicted value should get us a little closer to actual value. It is to be noted that in
contrary to one tree in our consideration, gradient boosting builds a lot of trees and M could be
as large as 100 or more.
This completes the loop in Step 2 and we are ready for the final step of Gradient Boosting.
Step 3: Output
If we get a new data, then we shall use this value to predict if the passenger survived or not.
This would give us the log(odds) that the person survived. Plugging it into ‘p’ formula:
If the resultant value lies above our threshold then the person survived, else did not.
Comparing and Contrasting AdaBoost and GradientBoost
Both AdaBoost and Gradient Boost learn sequentially from a weak set of learners. A strong
learner is obtained from the additive model of these weak learners. The main focus here is to
learn from the shortcomings at each step in the iteration.
AdaBoost requires users specify a set of weak learners (alternatively, it will randomly generate
a set of weak learner before the real learning process). It increases the weights of the wrongly
predicted instances and decreases the ones of the correctly predicted instances. The weak
learner thus focuses more on the difficult instances. After being trained, the weak learner is
added to the strong one according to its performance (so-called alpha weight). The higher it
performs, the more it contributes to the strong learner.
On the other hand, gradient boosting doesn’t modify the sample distribution. Instead of training
on a newly sampled distribution, the weak learner trains on the remaining errors of the strong
learner. It is another way to give more importance to the difficult instances. At each iteration,
the pseudo-residuals are computed and a weak learner is fitted to these pseudo-residuals. Then,
the contribution of the weak learner to the strong one isn’t computed according to its
performance on the newly distributed sample but using a gradient descent optimization process.
The computed contribution is the one minimizing the overall error of the strong learner.
Adaboost is more about ‘voting weights’ and gradient boosting is more about
‘adding gradient optimization’.
Adaboost Gradient Boost
An additive model where shortcomings of An additive model where shortcomings of
previous models are identified by high- previous models are identified by the
weight data points gradient.
The trees are usually grown as decision The trees are grown to a greater depth usually
stumps. ranging from 8 to 32 terminal nodes.
Each classifier has different weights All classifiers are weighed equally and their
assigned to the final prediction based on its predictive capacity is restricted with learning
performance. rate to increase accuracy.