0% found this document useful (0 votes)
7 views

2.4-Ensemble_methods_lecture_notes (1)

hdrg

Uploaded by

sara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

2.4-Ensemble_methods_lecture_notes (1)

hdrg

Uploaded by

sara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Ensemble methods

J.A. Matos

Modelling and Data Analysis II


Master in Finance (FEP)

Contents
1 Motivation 2
1.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Ensemble models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Combining homogeneous classifiers . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Bootstrap Aggregating - Bagging 5


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Advantages and disadvantage . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Random Trees 7
3.1 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Extremely Randomized Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Feature importance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Boosting methods 9
4.1 Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Gradient boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.6 Perturbation of the test examples . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Combining Heterogeneous Estimators 14


5.1 Stacked generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Voting methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1 Motivation 2

1 Motivation
1.1 Setup
Basic idea
The main idea behind any multiple (predictive) model is based on the observation that different
learning algorithms explore:

• different representation languages;

• different search spaces;

• different hypothesis assessment functions.

Can we take advantage of these differences?


The goal of ensemble methods is to combine the predictions of several base estimators built with
a given learning algorithm in order to improve generalizability/robustness over a single estimator.

Definition
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain
better predictive performance than could be obtained from any of the constituent learning algo-
rithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a
machine learning ensemble consists of only a concrete finite set of alternative models, but typically
allows for much more flexible structure to exist among those alternatives.
From https://en.wikipedia.org/wiki/Ensemble_learning

Error correlation
Given a set of classifiers, such that the prediction of the classifier i is given by fˆi (x) = y, i =
1 . . . n such that f (x) = y denotes the true class of record x.
We define the error correlation between classifiers as
 
φij = p fˆi (x) = fˆj (x) |fˆi (x) 6= f (x) ∨ fˆj (x) 6= f (x) .

This means the probability that both classifiers commit the same error given that one of them fails.

Purpose

• Combine models that produce non correlated errors or preferably negatively correlated er-
rors;

• Each model should have better performance than the random choice.

Modelling and Data Analysis II Master in Finance


1 Motivation 3

1.2 Ensemble models


Ensemble methods
Two families of ensemble methods are usually distinguished:

• In averaging methods, the driving principle is to build several estimators independently and
then to average their predictions. On average, the combined estimator is usually better than
any of the single base estimator because its variance is reduced.

• By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce
the bias of the combined estimator. The motivation is to combine several weak models to
produce a powerful ensemble.

Classification vs regression
Here we will discuss mainly classification models but the same ideas apply equally well to re-
gression models.
To understand how the relation works look into the first method that we have studied (k−nearest
neighbours) to see how regression can be implemented when compared with the classification. E.g.
the methods can be combined by averaging the output (for regression) or voting (for classification).

Optimization methods
It should be noted that, by construction, all the methods presented here are optimization methods.
Note
The ensemble methods are optimization methods even if the underlying methods are not.
E.g, consider an ensemble of nearest neighbours methods. The ensemble is an optimization
method although the nearest neighbours are not.

1.3 Combining homogeneous classifiers


Motivation
Here we combine models generated by a single algorithm.
Diversity is one of the requirements when using multiple models:

• One of the ways to ensure that we have generate different models is by the manipulation of
the training dataset.

• The learning algorithm is run several times using different distributions of the training dataset.

This works well if the algorithm is unstable, i.e. algorithms where a small change in the training
dataset can induce large change in the output representation (like what we saw in the Decision
Trees, the tree structure can change a lot with small changes in the input) and not necessarily in the
output prediction.

Modelling and Data Analysis II Master in Finance


1 Motivation 4

Sampling the training dataset


The types of algorithms diversify the base classifiers can be broadly divided in different types:

• Sampling from the training dataset;


– Bagging;
– Boosting;

• Sampling from the set of attributes;

• Injecting Randomness;

• Perturbation of the test examples.

Modelling and Data Analysis II Master in Finance


2 Bootstrap Aggregating - Bagging 5

2 Bootstrap Aggregating - Bagging


2.1 Introduction
Bagging meaning
Bagging is an abbreviation of Bootstrap Aggregating.
In ensemble algorithms, bagging methods form a class of algorithms which build several in-
stances of a black-box estimator on random subsets of the original training set and then aggregate
their individual predictions to form a final prediction. These methods are used as a way to re-
duce the variance of a base estimator (e.g., a decision tree), by introducing randomization into its
construction procedure and then making an ensemble out of it.

Definition
Bootstrap aggregating
, also called bagging (from bootstrap aggregating), is a machine learning ensemble meta-algorithm
designed to improve the stability and accuracy of machine learning algorithms used in statistical
classification and regression. It also reduces variance and helps to avoid overfitting. Although it
is usually applied to decision tree methods, it can be used with any type of method. Bagging is a
special case of the model averaging approach.
From https://en.wikipedia.org/wiki/Bootstrap_aggregating

Bagging

2.2 Implementation
Python Implementation
This can be used for several classes of estimators that we have already used, see:
• https://scikit-learn.org/stable/modules/ensemble.html#bagging-meta-estimator

Modelling and Data Analysis II Master in Finance


2 Bootstrap Aggregating - Bagging 6

2.3 Advantages and disadvantage


Advantages and disadvantages
We can generate bagging models, for example, from Bayesian Classifiers or from Decision trees.
Advantages:

• Many weak learners aggregated typically outperform a single learner over the entire set, and
has less overfit;

• Removes variance in high-variance low-bias weak learner, which can improve efficiency
(statistics);

• Can be performed in parallel, as each separate bootstrap can be processed on its own before
combination.

Disadvantages:

• For weak learner with high bias, bagging will also carry high bias into its aggregate;

• Loss of interpretability of a model;

• Can be computationally expensive depending on the data set.

Modelling and Data Analysis II Master in Finance


3 Random Trees 7

3 Random Trees
3.1 Random Forests
Random forests
One of the main methods that applies Bagging are the Random Forests.
This extension combines:

• the "bagging" idea and

• random selection of features in order to construct a collection of decision trees with con-
trolled variance.

https://en.wikipedia.org/wiki/Random_forest

Example
The Random Forest model uses Bagging, where decision tree models with higher variance are
present. It makes random feature selection to grow trees (Injecting Randomness). Several random
trees make a Random Forest.

3.2 Extremely Randomized Trees


Motivation
In extremely randomized trees (see ExtraTreesClassifier and ExtraTreesRegressor classes),
randomness goes one step further in the way splits are computed. As in random forests, a random
subset of candidate features is used, but instead of looking for the most discriminative thresholds,
thresholds are drawn at random for each candidate feature and the best of these randomly-generated
thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a
bit more, at the expense of a slightly greater increase in bias.

3.3 Feature importance evaluation

Modelling and Data Analysis II Master in Finance


3 Random Trees 8

3.4 Implementation
Python
In random forests (see RandomForestClassifier and RandomForestRegressor classes), each
tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from
the training set.

Modelling and Data Analysis II Master in Finance


4 Boosting methods 9

4 Boosting methods
4.1 Idea
Boosting

4.2 AdaBoost
Example: AdaBoost
The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only
slightly better than random guessing, such as small decision trees) on repeatedly modified versions
of the data. The predictions from all of them are then combined through a weighted majority vote
(or sum) to produce the final prediction. The data modifications at each so-called boosting iteration
consist of applying weights , …, to each of the training samples. Initially, those weights are all set
to , so that the first step simply trains a weak learner on the original data. For each successive
iteration, the sample weights are individually modified and the learning algorithm is reapplied to
the reweighted data. At a given step, those training examples that were incorrectly predicted by the
boosted model induced at the previous step have their weights increased, whereas the weights are
decreased for those that were predicted correctly. As iterations proceed, examples that are difficult
to predict receive ever-increasing influence.

Example: AdaBoost

Modelling and Data Analysis II Master in Finance


4 Boosting methods 10

4.3 Gradient boosting


Gradient boosting
Gradient boosting is a machine learning technique for regression, classification and other tasks,
which produces a prediction model in the form of an ensemble of weak prediction models, typically
decision trees.
The idea of gradient boosting originated in the observation that boosting can be interpreted as
an optimization algorithm on a suitable cost function.
X a model F
One possible example is to consider the cost function as where the goal is to "teach"
to predict values of the form ŷ = F (x) by minimizing the mean squared error n1 (ŷi − yi )2 ,
i
where i indexes over some training set of size n of actual values of the output variable y:
• ŷi is the predicted value F (x);

• yi is the observed value;

• n is the number of samples in y.

General idea
Let us consider a gradient boosting algorithm with M stages. At each stage m (1 ≤ m ≤ M )
of gradient boosting, suppose some imperfect model Fm (for low m, this model may simply return
ŷi = ȳ, where the right hand side is the mean of y. In order to improve Fm , our algorithm should
add some new estimator, hm (x). Thus,

Fm+1 (x) = Fm (x) + hm (x) = y

or, equivalently,

hm (x) = y − Fm (x)

Modelling and Data Analysis II Master in Finance


4 Boosting methods 11

Therefore, gradient boosting will fit h to the residual y − Fm (x). As in other boosting variants,
each Fm+1 attempts to correct the errors of its predecessor Fm . A generalization of this idea to
loss functions other than squared error, and to classification and ranking problems, follows from
the observation that residuals hm (x) for a given model are proportional equivalent to the negative
gradients of the mean squared error (MSE) loss function (with respect to F (x)):
1
LMSE = (y − F (x))2
n
∂LMSE 2 2
− = (y − F (x)) = hm (x).
∂F n n
So, gradient boosting could be specialized to a gradient descent algorithm, and generalizing it entails
"plugging in" a different loss and its gradient.

Steepest descent method


The gradient boosting method assumes a real-valued y and seeks an approximation F̂ (x) in the
form of a weighted sum of functions hi (x) from some class H, called base (or weak) learners:
M
X
F̂ (x) = γi hi (x) + const.
i=1

We are usually given a training set {(x1 , y1 ), . . . , (xn , yn )} of known sample values of x and
corresponding values of y. In accordance with the empirical risk minimization principle, the
method tries to find an approximation F̂ (x) that minimizes the average value of the loss func-
tion on the training set, i.e., minimizes the empirical risk. It does so by starting with a model,
consisting of a constant function F0 (x), and incrementally expands it in a greedy fashion:
n
X
F0 (x) = arg min L(yi , γ),
γ
i=1
n
" #
X
Fm (x) = Fm−1 (x) + arg min L(yi , Fm−1 (xi ) + hm (xi )) ,
hm ∈H i=1

where hm ∈ H is a base learner function.


Unfortunately, choosing the best function h at each step for an arbitrary loss function L is a
computationally infeasible optimization problem in general. Therefore, we restrict our approach to
a simplified version of the problem.
The idea is to apply a steepest descent step to this minimization problem (functional gradient
descent).

4.4 Regularization
Regularization
Fitting the training set too closely can lead to degradation of the model’s generalization ability.

Modelling and Data Analysis II Master in Finance


4 Boosting methods 12

Several so called regularization techniques reduce this overfitting effect by constraining the fitting
procedure.

• One natural regularization parameter is the number of gradient boosting iterations M (i.e.
the number of trees in the model when the base learner is a decision tree). Increasing M
reduces the error on training set, but setting it too high may lead to overfitting. An optimal
value of M is often selected by monitoring prediction error on a separate validation data set.
Besides controlling M , several other regularization techniques are used.

• Another regularization parameter is the depth of the trees. The higher this value the more
likely the model will overfit the training data.

4.5 Implementation
Implementations
Gradient boosting gives usually very good results and is one the methods used in machine learning
problems, together with neural networks and random forests. In Python there are two well regarded
implementations:

XGBoost (eXtreme Gradient Boosting) https://en.wikipedia.org/wiki/XGBoost

scikit-learn https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosted-trees

Some of those algorithms support missing values and categorical data removing the need for addi-
tional preprocessing such as imputation.

Bringing back some interpretability


Individual decision trees can be interpreted easily by simply visualizing the tree structure. Gra-
dient boosting models, however, comprise hundreds of regression trees thus they cannot be easily
interpreted by visual inspection of the individual trees. Fortunately, a number of techniques have
been proposed to summarize and interpret gradient boosting models.
Often features do not contribute equally to predict the target response; in many situations the
majority of the features are in fact irrelevant. When interpreting a model, the first question usually
is: what are those important features and how do they contributing in predicting the target response?

Individual decision trees intrinsically perform feature selection by selecting appropriate split
points. This information can be used to measure the importance of each feature; the basic idea
is: the more often a feature is used in the split points of a tree the more important that feature
is. This notion of importance can be extended to decision tree ensembles by simply averaging the
impurity-based feature importance of each tree.

• https://scikit-learn.org/stable/modules/ensemble.html#interpretation-with-feature-importance

• https://scikit-learn.org/stable/modules/ensemble.html#random-forest-feature-importance

Modelling and Data Analysis II Master in Finance


4 Boosting methods 13

4.6 Perturbation of the test examples


Perturbation of the test examples
Another way to generate diversity is to keep the original estimator model and to slightly perturb
the test examples adding white noise to each record and look into the corresponding distribution
of outcomes.
That is instead of considering a record x we consider a set of test examples in the form of x + 
where  corresponds to the white noise perturbation.

Modelling and Data Analysis II Master in Finance


5 Combining Heterogeneous Estimators 14

5 Combining Heterogeneous Estimators


5.1 Stacked generalization
Stacked generalization
Stacked generalization is a method for combining estimators to reduce their biases. More pre-
cisely, the predictions of each individual estimator are stacked together and used as input to a final
estimator to compute the prediction. This final estimator is trained through cross-validation (to be
seen in the next module).
See scikit-learn implementation: https://scikit-learn.org/stable/modules/ensemble.
html#stacked-generalization

5.2 Voting methods


Elections!

• Classification: VotingClassifier

• Regression: VotingRegressor

Modelling and Data Analysis II Master in Finance

You might also like