Gradient Boosting in ML
Gradient Boosting in ML
Gradient Boosting is a popular boosting algorithm in machine learning used for classification and
regression tasks. Boosting is one kind of ensemble Learning method which trains the model
sequentially and each new model tries to correct the previous model. It combines several weak
learners into strong learners. There is two most popular boosting algorithm i.e
1. AdaBoost
2. Gradient Boosting
Gradient Boosting
Gradient Boosting is a powerful boosting algorithm that combines several weak learners into strong
learners, in which each new model is trained to minimize the loss function such as mean squared
error or cross-entropy of the previous model using gradient descent. In each iteration, the algorithm
computes the gradient of the loss function with respect to the predictions of the current ensemble
and then trains a new weak model to minimize this gradient. The predictions of the new model are
then added to the ensemble, and the process is repeated until a stopping criterion is met.
In contrast to AdaBoost, the weights of the training instances are not tweaked, instead, each
predictor is trained using the residual errors of the predecessor as labels. There is a technique
called the Gradient Boosted Trees whose base learner is CART (Classification and Regression
Trees). The below diagram explains how gradient-boosted trees are trained for regression
problems.
The ensemble consists of M trees. Tree1 is trained using the feature matrix X and the labels y. The
predictions labeled y1(hat) are used to determine the training set residual errors r1. Tree2 is then
trained using the feature matrix X and the residual errors r1 of Tree1 as labels. The predicted
results r1(hat) are then used to determine the residual r2. The process is repeated until all the M
trees forming the ensemble are trained. There is an important parameter used in this technique
known as Shrinkage. Shrinkage refers to the fact that the prediction of each tree in the ensemble is
shrunk after it is multiplied by the learning rate (eta) which ranges between 0 to 1. There is a trade-
off between eta and the number of estimators, decreasing learning rate needs to be compensated
with increasing estimators in order to reach certain model performance. Since all trees are trained
now, predictions can be made. Each tree predicts a label and the final prediction is given by the
formula,
During each iteration in AdaBoost, the weights of incorrectly Gradient Boosting updates the weights by computing
classified samples are increased, so that the next weak learner the negative gradient of the loss function with respect to
focuses more on these samples. the predicted output.
AdaBoost is more susceptible to noise and outliers in the data, as it Gradient Boosting is generally more robust, as it updates
assigns high weights to misclassified samples
the weights based on the gradients, which are less
sensitive to outliers.
Gradient Boosting Algorithm
Step 1:
Let’s assume X, and Y are the input and target having N samples. Our goal is to learn the function
f(x) that maps the input features X to the target variables y. It is boosted trees i.e the sum of trees.
The loss function is the difference between the actual and the predicted variables.
If our gradient boosting algorithm is in M stages then To improve the the algorithm can add some
new estimator as having
Step 4: Solution
The gradient Similarly for M trees: