0% found this document useful (0 votes)
5 views2 pages

Boosting

Boosting is an ensemble learning technique that combines multiple low-accuracy classifiers to create a highly accurate model by sequentially correcting errors from previous models. Key algorithms include AdaBoost, which focuses on misclassified examples, Gradient Boosting, which predicts residual errors, and XGBoost, known for its scalability and performance optimizations. Overall, boosting methods enhance predictive accuracy while managing bias and variance effectively.

Uploaded by

kalshkingu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views2 pages

Boosting

Boosting is an ensemble learning technique that combines multiple low-accuracy classifiers to create a highly accurate model by sequentially correcting errors from previous models. Key algorithms include AdaBoost, which focuses on misclassified examples, Gradient Boosting, which predicts residual errors, and XGBoost, known for its scalability and performance optimizations. Overall, boosting methods enhance predictive accuracy while managing bias and variance effectively.

Uploaded by

kalshkingu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Boosting: An Ensemble Learning Technique

Kaleab Tadesse
FTP0848/14
Addis Ababa Science and Technology University
Email: [email protected]

I NTRODUCTION • The final or combined classifier is a weighted


Boosting is an ensemble learning method. It involves majority vote of the base classifiers, with (αt ) being
combining a set of low-accurate classifiers to create the weight assigned to the classifier (ht ).
a highly accurate classifier. The core idea is to build • Base models used in AdaBoost should be simple,
models sequentially, where each subsequent model at- so that different instance weights lead to different
tempts to correct the errors of the model before it. This models. Each simple model acts as an ’expert’ on
is done by building a model from the training data, some parts of the data.
then creating a second model that attempts to correct • The final AdaBoost model is an additive model,
errors from the first, and so on, adding models until where predictions are the sum of base model predic-
the training set is predicted perfectly or a maximum tions, and each base model receives a unique weight
number of models is reached. Boosting is primarily a related to its weighted error rate.
bias reduction technique. In contrast, Bagging (like Ran- • AdaBoost works by reducing the margin of training
dom Forest) is primarily a variance reduction technique. examples, especially those with the smallest mar-
While boosting reduces bias, boosting too much will gins. Larger margins on the training set translate
eventually increase variance. Highly accurate classifiers into a superior upper bound on generalization error.
produced by boosting can offer an error rate close to 0. • Bias-variance analysis: AdaBoost reduces bias (for
Boosting algorithms can track which base models failed underfitting problems).
accurate prediction and are less affected by overfitting • There is a close connection between AdaBoost
compared to individual weak models. and logistic regression. AdaBoost minimizes an
Some of the boosting techniques used to train ML exponential loss function, which is an upper bound
models are: for the logistic loss function minimized in logistic
regression.
1. AdaBoost (Adaptive Boosting) • AdaBoost has practical advantages: it is fast, sim-
• AdaBoost is based on the observation that finding ple, and easy to program. It generally has no
many rough rules of thumb can be easier than parameters to tune except the number of rounds. It
finding a single highly accurate rule. It is a general requires no prior knowledge about the base learner
method for improving the accuracy of any given and can be combined flexibly with various methods.
learning algorithm. • AdaBoost can identify outliers, which are examples
• The algorithm calls a base learning algorithm re- that are mislabeled or inherently ambiguous, by
peatedly, each time feeding it a different subset of focusing weight on the hardest examples. However,
training examples or, more precisely, a different dis- it can be susceptible to noise when the number
tribution or weighting over the training examples. of outliers is large. Variants like Gentle AdaBoost
• The key idea is to maintain a distribution or set and BrownBoost exist to handle noisy data and de-
of weights over the training set. On each round, emphasize outliers.
the weights of incorrectly classified examples are • AdaBoost can be extended to multiclass classi-
increased, forcing the base learner to focus on the fication using methods like AdaBoost.M1, Ad-
”hard” examples. aBoost.MH, and AdaBoost.M2. Some variants use
• Different models are obtained by reweighting the real-valued outputs which can speed up boosting.
training data every iteration. This process aims
to reduce underfitting by focusing on the ’hard’
2. Gradient Boosting
training examples.
• After each round, AdaBoost chooses a parameter • Gradient Boosting is an ensemble method where
(αt ) that measures the importance assigned to the models are built sequentially, with each model
base classifier produced in that round. fixing the remaining mistakes of the previous ones.
• In each iteration, the task of the new model is to mat, sorted by feature value. This structure
predict the residual error of the current ensemble’s reduces sorting costs and optimizes split find-
prediction. ing. Cache-aware prefetching is used for exact
• Pseudo-residuals are computed based on a differen- greedy algorithms to improve performance on
tiable loss function, such as least squares loss for large datasets. Block size is optimized for ap-
regression or log loss for classification. proximate algorithms to balance parallelization
• The algorithm uses a gradient descent approach: and cache performance.
predictions are updated step by step until conver- – Support for out-of-core computation by divid-
gence. ing data into blocks stored on disk, utilizing
• The final model is an additive model, with predic- independent threads for pre-fetching to over-
tions being the sum of the base model predictions. lap computation and disk reading. Techniques
• A learning rate is often used, which scales the like block compression and sharding data onto
contribution of each new model. Small updates with multiple disks are used to improve disk I/O
a learning rate typically work better as they reduce throughput.
variance. • XGBoost incorporates a regularized learning ob-
• Base models should generally be low variance and jective beyond traditional gradient boosting, which
flexible enough to accurately predict the residuals, penalizes the complexity of the tree models (number
such as decision trees of depth 2-5. of leaves and L2 norm of leaf weights) to help
• For regression using square loss, the pseudo- prevent overfitting.
residuals are simply the prediction errors. A new • It also utilizes shrinkage and feature (column)
regression tree is fitted to these errors. subsampling to further prevent overfitting. Column
• For classification using log loss, the base models subsampling can also speed up computation.
predict the probability of the positive class. The
pseudo-residuals are the difference between the true S UMMARY
class (0 or 1) and the predicted probability. In summary, Boosting is a powerful ensemble learning
• Bias-variance analysis: Gradient Boosting is highly technique that constructs a strong predictive model by
effective at reducing bias error. Like other boosting sequentially combining predictions from multiple sim-
methods, boosting too much can eventually increase pler, typically low-performing base models. The funda-
variance. mental principle involves fitting each subsequent model
to correct the errors or residual errors made by the
3. Extreme Gradient Boosting (XGBoost) models before it in the sequence, a process that primarily
• It is a variant of gradient tree boosting. aims to reduce bias. Key boosting algorithms include
• The most significant factor behind XGBoost’s suc- AdaBoost, which adaptively recomputes weights to fo-
cess is its scalability in all scenarios. It can run cus on previously misclassified examples, and Gradient
much faster than existing solutions on a single Boosting, which builds models to predict the negative
machine and scales to billions of examples in dis- gradient of a loss function with respect to the current
tributed or memory-limited settings. It can process ensemble’s prediction. Advanced implementations like
terabyte size datasets. XGBoost enhance gradient boosting with optimizations
• Factors that contribute to its scalability: for scalability, sparsity handling, and regularization to
– A sparsity-aware algorithm to handle sparse achieve high performance efficiently. Ultimately, this
data efficiently. It adds a default direction to iterative error correction process enables boosting meth-
tree nodes for missing values and processes ods to yield improved accuracy compared to individual
only non-missing entries, making complexity base models.
linear in the number of non-missing entries. R EFERENCES
– A theoretically justified weighted quantile
[1] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting
sketch for efficient approximate tree learn- system,” in Proceedings of the 22nd ACM SIGKDD International
ing. This method proposes candidate split Conference on Knowledge Discovery and Data Mining, ser.
points based on percentiles of feature values, KDD ’16. ACM, Aug. 2016, pp. 785–794. [Online]. Available:
http://dx.doi.org/10.1145/2939672.2939785
weighted by the second-order gradient statistics [2] J. H. Friedman, “Greedy function approximation: A
(hi). This is a novel approach for handling gradient boosting machine,” The Annals of Statistics,
weighted data in quantile computation. vol. 29, no. 5, pp. 1189–1232, 2001. [Online]. Available:
http://www.jstor.org/stable/2699986
– An effective cache-aware block structure for
parallel and out-of-core learning. Data is stored
in blocks in a compressed column (CSC) for-

You might also like