0% found this document useful (0 votes)
24 views8 pages

02 Section 12.4.1 QR Code Content

Gradient tree boosting is an algorithm that combines decision trees in a stage-wise fashion to optimize a loss function. It builds the trees sequentially as it learns from the mistakes of the previous trees. The algorithm reduces the loss at each stage by learning the residuals from the previous stage. Weak learners like decision stumps are used in a boosting framework to perform regression or classification tasks.

Uploaded by

Riya Putti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views8 pages

02 Section 12.4.1 QR Code Content

Gradient tree boosting is an algorithm that combines decision trees in a stage-wise fashion to optimize a loss function. It builds the trees sequentially as it learns from the mistakes of the previous trees. The algorithm reduces the loss at each stage by learning the residuals from the previous stage. Weak learners like decision stumps are used in a boosting framework to perform regression or classification tasks.

Uploaded by

Riya Putti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

GRADIENT TREE BOOSTING

Gradient tree boosting is also known as Gradient Boosting Machine (GBM) or Gradient Boosted
Regression Tree (GBRT). Gradient Boosting is a combination of two procedures called Gradient
Descent and Boosting. AdaBoost is the first boosting algorithm which works by boosting the weights
of data instances learned by a weak learner and fed to subsequent weak learners in a sequential
fashion in order to increase the accuracy of predictions, and final prediction is estimated by the
accuracy of the predictions by the individual weak learners.
Gradient boosting or Gradient tree boosting is also a boosting algorithm which employs weak learners
in a sequential fashion but optimizes the loss function or minimizes the error at each stage by moving
in the opposite direction of the gradient to reach the minima. The algorithm boosts the residual errors
made by a weak learner rather than boosting the weights of misclassified data instances as done by
AdaBoost algorithm.
The general idea of gradient boosting is learning from mistake and to reduce the loss function or the
average of the squared errors.
The general formula for mean squared error or loss function is,
N
MSE (Mean Squared Error = loss = ∑ ( y actual− y predicted)
2

i=1

MSE is a measure of average of the squares of the errors between the actual values and the predicted
values.
The weak learners used in Gradient tree boosting are single split decision trees called a decision
stump. The predictions made by a weak learner are always combined with the predictions made by
previous weak learners. Then gradient boosting estimates the residual errors from the combined
predictions and train another weak learner with the estimated residuals. This procedure is iteratively
followed until the loss function is optimized.
There exist several implementations of the Gradient tree boosting framework such as Gradient
Boosting Machines (GBM), Extreme Gradient Boosting System (XGBoost), LightGBM, and
Catboost etc.

Algorithm:
Input: Training data set T with N data instances.
M weak learners.
Step 1: Train a weak learner (Decision Stump) on the training data set T. Estimate the
Decision Tree prediction T1 = DT(X1).
Step 2: Compute the residual error of this weak learner R1 = Y – T1,
where R1 is the residual error computed by the difference between actual
target value Y and the predicted target value T1 by the weak learner.
Step 3: Train another weak learner on residual error Ri (1≤ i ≤ M) as target variable.
Estimate the Decision Tree prediction Ti+1 = DT (Xi+1).
Step 4: Compute the combined predictions of the weak learner’s Ci = Ti + Ti+1.
Step 5: Compute Residual error of the combined prediction by the two weak learners R i+1
= Y- Ci.
Step 6: Estimate the Mean Squared Error (MSE) or loss function of Ri+1.
Step 7: Repeat Step 3 to Step 6 until the MSE becomes constant or the number of trees
we set to train M is reached.

Example 1: Consider a training dataset of 10 data instances which describes the skills of individual
students with attributes such as Practical Knowledge, Interactiveness and Aptitude as shown in Table
1. The target variable is CGPA which is a continuous valued variable. Based on the skill set of a
student, predict the CGPA that can be scored. It is a regression problem.
Table 1: Training Dataset
S.No Interactiveness Practical Aptitude CGPA
. Knowledge

1 Yes Good Good 9.5


2 No Average Good 8.2
3 No Good Good 9.1
4 No Average Poor 6.8
5 Yes Good Poor 8.5
6 Yes Good Good 9.5
7 Yes Average Poor 7.9
8 No Good Good 9.1
9 Yes Good Good 8.8
10 Yes Average Poor 9

Solution:

Stage 1:
Train a Decision Stump based on the attribute Practical Knowledge on the training dataset T as shown
in Figure 1. Estimate the Decision Tree prediction T1 = DT(X1).
Figure 1

Mean of target variable CGPA based on (X1 = Practical Knowledge ∈ {Good}) = 9.5+ 9.1+8.5+
9.5+ 9.1+ 8.8 = 9.08.
Mean of target variable CGPA based on (X1 = Practical Knowledge ∈ {Average}) = 8.2 + 6.8 + 7.9
+ 9 = 7.98.
Use the Decision Stump DT(X1) = HPractical Knowledge and estimate the prediction T1.
Calculate Residual error of this Decision Stump R1 = Y-T1 as shown in Table 2.

Table 2: Residual Error Estimation


S.No. Target X1 = Practical Tree1 Residual Error
Y= CGPA Knowledge prediction R1 = Y – T1
T1 = DT(X1)

1 9.5 Good 9.08 0.42


2 8.2 Average 7.98 0.22
3 9.1 Good 9.08 0.02
4 6.8 Average 7.98 -1.18
5 8.5 Good 9.08 -0.58
6 9.5 Good 9.08 0.42
7 7.9 Average 7.98 -0.08
8 9.1 Good 9.08 0.02
9 8.8 Good 9.08 -0.28
10 9 Average 7.98 1.02

N
Mean Squared Error or loss R1 = ∑ ( y actual− y predicted) is calculated as 0.3256.
2

i=1
Stage 2:
Train a Decision Stump based on the attribute Interactiveness on the residual error calculated in the
previous stage as shown in Figure 2.

Figure 2

Mean of Residual error R1 based on (X2 = Interactiveness ∈ {Yes}) = 0.42 + -0.58 + 0.42 + -0.08 +
-0.28 + 1.02 = 0.153.
Mean of Residual error R1 based on (X2 = Interactiveness ∈ {No}) = 0.22 + 0.02 + -1.18+ 0.02 = -
0.23.
Use the Decision Stump DT(X2) = HInteractiveness as shown in Figure 13.14 and estimate the prediction
T2.
Calculate the combined prediction C1 = T1 + T2.
Calculate Residual error of the combined prediction by the two weak learners R2 = Y- C1.

Table 3 shows the combined prediction by the two decision stumps and the estimation of residual
error.

Table 3: Residual Error


S.No. Target Tree 1 Residual X2= Tree2 Combined Residual
Y=CGPA prediction R1 = Interacti Prediction Prediction Error
T1 = F(X1) Y – T1 veness T2 = C1= R2 =
DT(X2) T1 + T2 Y- C1
1 9.5 9.08 0.42 Yes 0.153 9.233 0.267
2 8.2 7.98 0.22 No -0.23 7.75 0.45
3 9.1 9.08 0.02 No -0.23 8.85 0.25
4 6.8 7.98 -1.18 No -0.23 7.75 -0.95
5 8.5 9.08 -0.58 Yes 0.153 9.233 -0.733
6 9.5 9.08 0.42 Yes 0.153 9.233 0.267
7 7.9 7.98 -0.08 Yes 0.153 8.133 -0.233
8 9.1 9.08 0.02 No -0.23 8.85 0.25
9 8.8 9.08 -0.28 Yes 0.153 9.233 -0.433
10 9 7.98 1.02 Yes 0.153 8.133 0.867

Mean Squared Error or loss R2 is calculated as 0.2903334.

Stage 3:
Train a Decision Stump based on the attribute Aptitude on the residual error calculated in the Stage 2
as shown in Figure 3.

Figure 3

Mean of Residual error R2 based on (X3 = Aptitude ∈ {Good}) = 0.267 + 0.25 + 0.267 + 0.25 + -
0.433 + 0.45 = 0.175
Mean of Residual error R2based on (X3 =Aptitude ∈ {Poor}) = -0.95 + -0.733 + -0.233+ 0.867 = -
0.587
Use the Decision Stump DT(X3) = HAptitude and estimate the prediction T3.
Calculate the combined prediction C2 = C1 + T3.
Calculate Residual error of the combined prediction by the two weak learners R3 = Y - C2.

Table 4 shows the combined prediction and the final residual after training with three decision
stumps.

Table 4: Residual Error


S.No. Target Combined Residual X3 = Tree3 Combined Final
Y= Prediction R2 = Aptitude Prediction Prediction Residual
CGPA C1= Y- C1 T3=DT(X3) C2 = R3 =
T1 + T2 C1 + T3 Y – C2
1 9.5 9.233 0.267 good 0.175 9.408 0.092
2 8.2 7.75 0.45 good 0.175 7.925 0.275
3 9.1 8.85 0.25 good 0.175 9.025 0.075
4 6.8 7.75 -0.95 poor -0.587 7.163 -0.363
5 8.5 9.233 -0.733 poor -0.587 8.646 -0.146
6 9.5 9.233 0.267 good 0.175 9.408 0.092
7 7.9 8.133 -0.233 poor -0.587 7.546 0.354
8 9.1 8.85 0.25 good 0.175 9.025 0.075
9 8.8 9.233 -0.433 good 0.175 9.408 -0.608
10 9 8.133 0.867 poor -0.587 7.546 1.454

Mean Squared Error or loss R3 is calculated as 0.2865984.

We can observe that the MSE is getting reduced as we add the predictions of weak learners. This
procedure is iteratively followed by adding a weak learner and optimizing the loss function until it
becomes constant or reduced to 0.

Advantages
It reduces bias and variance.
Disadvantages
It is a greedy algorithm that can generally overfit with the training data.
XGBoost

XGBoost stands for Extreme Gradient Boosting. It is a tree ensemble model that belongs to the family
of Gradient Boosting Decision Tree (GBDT) models. It is a scalable tree boosting system that
combines both hardware optimizations and algorithmic optimizations to provide computational speed
and high performance with optimized usage of resources. The boosting model proves an improvement
by adding regularization with Gradient boosting to avoid overfitting and bias. This model provides an
open end-to-end tree boosting scalable system. This algorithm has become very popular with Kaggle
competitions because of its high performance and scalability. It solves for both regression and
classification based real world problems with large data sets and many features.
Some of the important features added to XGBoost that makes the model perform better than boosting
algorithms are,
 Regularization
Regularization has been added to the boosting model that smooth the final learnt weights to
avoid over-fitting. It penalizes more complex models through both LASSO (L1) and Ridge
(L2) regularization to prevent overfitting.

 In-built cross validation that avoids explicit learning which further improves its
performance.

 Post Pruning
The model employs post tree pruning using DFS. It allows the tree to grow to maximum
depth and prune the trees backward by recursively pruning the leaf nodes with negative gain.

It adds two algorithmic optimizations to gradient boosting called,


 Weighted Approximate Quantile Sketch procedure that handles weighted instances for
approximate tree learning by finding optimal split points during tree learning.

 Sparse Aware algorithm is to handle missing data values or sparse features more efficiently.

It also supports hardware optimizations such as,


 Out of core computing
This supports for handling huge datasets that do not fit into memory. It optimizes the
available disk space and maximizes its usage when handling large amount of data.

 Parallelization
It employs parallelized tree building process of tree construction using multiple cores on the
CPU during training.

 Cache awareness
Cache optimization is provided by allocating internal buffers in each thread to store gradient
statistics.

 Distributed Computing
It uses a cluster of machines for training very large model.

SUMMARY

1. Gradient tree boosting is also known as Gradient Boosting Machine (GBM) or Gradient
Boosted Regression Tree (GBRT) is a boosting algorithm which employs weak learners in a
sequential fashion but optimizes the loss function or minimizes the error at each stage by
moving in the opposite direction of the gradient to reach the minima.
2. XGBoost stands for Extreme Gradient Boosting is a tree ensemble model that belongs to the
family of Gradient Boosting Decision Tree (GBDT) models.
3. XGBoost is a scalable tree boosting system that combines both hardware optimizations and
algorithmic optimizations to provide computational speed and high performance with
optimized usage of resources.

You might also like