2.4-Ensemble_methods_lecture_notes (1)
2.4-Ensemble_methods_lecture_notes (1)
J.A. Matos
Contents
1 Motivation 2
1.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Ensemble models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Combining homogeneous classifiers . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Random Trees 7
3.1 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Extremely Randomized Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Feature importance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Boosting methods 9
4.1 Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Gradient boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.6 Perturbation of the test examples . . . . . . . . . . . . . . . . . . . . . . . . . 13
1 Motivation
1.1 Setup
Basic idea
The main idea behind any multiple (predictive) model is based on the observation that different
learning algorithms explore:
Definition
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain
better predictive performance than could be obtained from any of the constituent learning algo-
rithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a
machine learning ensemble consists of only a concrete finite set of alternative models, but typically
allows for much more flexible structure to exist among those alternatives.
From https://en.wikipedia.org/wiki/Ensemble_learning
Error correlation
Given a set of classifiers, such that the prediction of the classifier i is given by fˆi (x) = y, i =
1 . . . n such that f (x) = y denotes the true class of record x.
We define the error correlation between classifiers as
φij = p fˆi (x) = fˆj (x) |fˆi (x) 6= f (x) ∨ fˆj (x) 6= f (x) .
This means the probability that both classifiers commit the same error given that one of them fails.
Purpose
• Combine models that produce non correlated errors or preferably negatively correlated er-
rors;
• Each model should have better performance than the random choice.
• In averaging methods, the driving principle is to build several estimators independently and
then to average their predictions. On average, the combined estimator is usually better than
any of the single base estimator because its variance is reduced.
• By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce
the bias of the combined estimator. The motivation is to combine several weak models to
produce a powerful ensemble.
Classification vs regression
Here we will discuss mainly classification models but the same ideas apply equally well to re-
gression models.
To understand how the relation works look into the first method that we have studied (k−nearest
neighbours) to see how regression can be implemented when compared with the classification. E.g.
the methods can be combined by averaging the output (for regression) or voting (for classification).
Optimization methods
It should be noted that, by construction, all the methods presented here are optimization methods.
Note
The ensemble methods are optimization methods even if the underlying methods are not.
E.g, consider an ensemble of nearest neighbours methods. The ensemble is an optimization
method although the nearest neighbours are not.
• One of the ways to ensure that we have generate different models is by the manipulation of
the training dataset.
• The learning algorithm is run several times using different distributions of the training dataset.
This works well if the algorithm is unstable, i.e. algorithms where a small change in the training
dataset can induce large change in the output representation (like what we saw in the Decision
Trees, the tree structure can change a lot with small changes in the input) and not necessarily in the
output prediction.
• Injecting Randomness;
Definition
Bootstrap aggregating
, also called bagging (from bootstrap aggregating), is a machine learning ensemble meta-algorithm
designed to improve the stability and accuracy of machine learning algorithms used in statistical
classification and regression. It also reduces variance and helps to avoid overfitting. Although it
is usually applied to decision tree methods, it can be used with any type of method. Bagging is a
special case of the model averaging approach.
From https://en.wikipedia.org/wiki/Bootstrap_aggregating
Bagging
2.2 Implementation
Python Implementation
This can be used for several classes of estimators that we have already used, see:
• https://scikit-learn.org/stable/modules/ensemble.html#bagging-meta-estimator
• Many weak learners aggregated typically outperform a single learner over the entire set, and
has less overfit;
• Removes variance in high-variance low-bias weak learner, which can improve efficiency
(statistics);
• Can be performed in parallel, as each separate bootstrap can be processed on its own before
combination.
Disadvantages:
• For weak learner with high bias, bagging will also carry high bias into its aggregate;
3 Random Trees
3.1 Random Forests
Random forests
One of the main methods that applies Bagging are the Random Forests.
This extension combines:
• random selection of features in order to construct a collection of decision trees with con-
trolled variance.
https://en.wikipedia.org/wiki/Random_forest
Example
The Random Forest model uses Bagging, where decision tree models with higher variance are
present. It makes random feature selection to grow trees (Injecting Randomness). Several random
trees make a Random Forest.
3.4 Implementation
Python
In random forests (see RandomForestClassifier and RandomForestRegressor classes), each
tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from
the training set.
4 Boosting methods
4.1 Idea
Boosting
4.2 AdaBoost
Example: AdaBoost
The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only
slightly better than random guessing, such as small decision trees) on repeatedly modified versions
of the data. The predictions from all of them are then combined through a weighted majority vote
(or sum) to produce the final prediction. The data modifications at each so-called boosting iteration
consist of applying weights , …, to each of the training samples. Initially, those weights are all set
to , so that the first step simply trains a weak learner on the original data. For each successive
iteration, the sample weights are individually modified and the learning algorithm is reapplied to
the reweighted data. At a given step, those training examples that were incorrectly predicted by the
boosted model induced at the previous step have their weights increased, whereas the weights are
decreased for those that were predicted correctly. As iterations proceed, examples that are difficult
to predict receive ever-increasing influence.
Example: AdaBoost
General idea
Let us consider a gradient boosting algorithm with M stages. At each stage m (1 ≤ m ≤ M )
of gradient boosting, suppose some imperfect model Fm (for low m, this model may simply return
ŷi = ȳ, where the right hand side is the mean of y. In order to improve Fm , our algorithm should
add some new estimator, hm (x). Thus,
or, equivalently,
hm (x) = y − Fm (x)
Therefore, gradient boosting will fit h to the residual y − Fm (x). As in other boosting variants,
each Fm+1 attempts to correct the errors of its predecessor Fm . A generalization of this idea to
loss functions other than squared error, and to classification and ranking problems, follows from
the observation that residuals hm (x) for a given model are proportional equivalent to the negative
gradients of the mean squared error (MSE) loss function (with respect to F (x)):
1
LMSE = (y − F (x))2
n
∂LMSE 2 2
− = (y − F (x)) = hm (x).
∂F n n
So, gradient boosting could be specialized to a gradient descent algorithm, and generalizing it entails
"plugging in" a different loss and its gradient.
We are usually given a training set {(x1 , y1 ), . . . , (xn , yn )} of known sample values of x and
corresponding values of y. In accordance with the empirical risk minimization principle, the
method tries to find an approximation F̂ (x) that minimizes the average value of the loss func-
tion on the training set, i.e., minimizes the empirical risk. It does so by starting with a model,
consisting of a constant function F0 (x), and incrementally expands it in a greedy fashion:
n
X
F0 (x) = arg min L(yi , γ),
γ
i=1
n
" #
X
Fm (x) = Fm−1 (x) + arg min L(yi , Fm−1 (xi ) + hm (xi )) ,
hm ∈H i=1
4.4 Regularization
Regularization
Fitting the training set too closely can lead to degradation of the model’s generalization ability.
Several so called regularization techniques reduce this overfitting effect by constraining the fitting
procedure.
• One natural regularization parameter is the number of gradient boosting iterations M (i.e.
the number of trees in the model when the base learner is a decision tree). Increasing M
reduces the error on training set, but setting it too high may lead to overfitting. An optimal
value of M is often selected by monitoring prediction error on a separate validation data set.
Besides controlling M , several other regularization techniques are used.
• Another regularization parameter is the depth of the trees. The higher this value the more
likely the model will overfit the training data.
4.5 Implementation
Implementations
Gradient boosting gives usually very good results and is one the methods used in machine learning
problems, together with neural networks and random forests. In Python there are two well regarded
implementations:
scikit-learn https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosted-trees
Some of those algorithms support missing values and categorical data removing the need for addi-
tional preprocessing such as imputation.
Individual decision trees intrinsically perform feature selection by selecting appropriate split
points. This information can be used to measure the importance of each feature; the basic idea
is: the more often a feature is used in the split points of a tree the more important that feature
is. This notion of importance can be extended to decision tree ensembles by simply averaging the
impurity-based feature importance of each tree.
• https://scikit-learn.org/stable/modules/ensemble.html#interpretation-with-feature-importance
• https://scikit-learn.org/stable/modules/ensemble.html#random-forest-feature-importance
• Classification: VotingClassifier
• Regression: VotingRegressor