0% found this document useful (0 votes)
7 views12 pages

Ensemble Learning

Uploaded by

brokenbottle571
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views12 pages

Ensemble Learning

Uploaded by

brokenbottle571
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Ensemble Learning

and Random Forests


Ensemble Learning: Introduction
❖ Ensemble learning is a machine learning technique that combines multiple individual models to
improve overall predictive performance.

❖ Law of Large numbers:


The law of large numbers states that as the number of independent and identically distributed
(i.i.d.) random variables increases, their average converges to the expected value.

In ensemble learning, each model can be considered as an independent random variable making
predictions.

❖ Reduction of Bias and Variance:


By combining diverse models, ensemble methods can reduce bias by capturing different aspects of
the data and reduce variance by averaging out individual model errors. This leads to more robust
and accurate predictions.

❖ Ensemble learning methods:


Voting, Bagging (Bootstrap Aggregating), Boosting and Stacking
Voting Classifiers
▪ In voting, multiple models are trained independently
on the same dataset.
▪ Predictions are aggregated through majority voting
(for classification) or averaging (for regression).

▪ There are two main types of voting classifiers: Hard


voting and Soft voting
▪ In hard voting, refer to fig. 2, each individual classifier
predicts the class label for a given instance, and the
final prediction is determined by a majority vote. Fig. 1: Training diverse classifiers
▪ In soft voting, instead of predicting class labels, the
individual classifiers provide probability estimates for
each class.
The final prediction is determined by averaging the
probability estimates across all classifiers and selecting
the class label with the highest average probability.
For example, for a data-point, Classifier A, B, and C
predicts the probabilities for binary classes as [0.3,
0.7], [0.6, 0.4], [0.4, 0.6].
Average probability for Class 1: (0.3 + 0.6 + 0.4) / 3 ≈ 0.43
Average probability for Class 2: (0.7 + 0.4 + 0.6) / 3 ≈ 0.57 Fig. 2: Hard voting classifier predictions
Voting Classifiers: scikit-learn

Training voting classifier Individual classifier score


>>> for name, clf in
from sklearn.datasets import make_moons voting_clf.named_estimators_.items():
from sklearn.ensemble import RandomForestClassifier, ... print(name, "=", clf.score(X_test, y_test))
VotingClassifier ...
from sklearn.linear_model import LogisticRegression lr = 0.864
from sklearn.model_selection import train_test_split rf = 0.896
from sklearn.svm import SVC svc = 0.896
X, y = make_moons(n_samples=500, noise=0.30, Hard and soft voting classifier scores
random_state=42)
>>> voting_clf.predict(X_test[:1])
X_train, X_test, y_train, y_test = train_test_split(X, y, array([1])
random_state=42)
>>> [clf.predict(X_test[:1]) for clf in
voting_clf = VotingClassifier( voting_clf.estimators_]
estimators=[ [array([1]), array([1]), array([0])]
('lr', LogisticRegression(random_state=42)),
('rf', RandomForestClassifier(random_state=42)), >>> voting_clf.score(X_test, y_test)
('svc', SVC(random_state=42)) 0.912
]
>>> voting_clf.voting = "soft"
)
>>>
voting_clf.named_estimators["svc"].probability =
voting_clf.fit(X_train, y_train)
True
>>> voting_clf.fit(X_train, y_train)
>>> voting_clf.score(X_test, y_test)
0.92
Bagging and Pasting

❖ A diverse set of predictor can also be built using the same training algorithm for every predictor but train
them on different random subsets of the training set.

❖ The key difference between bagging and pasting lies in how these subsets are sampled:
Bagging (short for bootstrap aggregating): Sampling with replacement
Pasting: Sampling without replacement.
❖ Both bagging and pasting allow training instances to
be sampled several times across multiple predictors.

❖ But, only bagging allows a training instance to be


sampled several times for the same predictor.

❖ Generally, bagging introduces more diversity in the


subsets, and often preferred over pasting.

❖ The ensemble can make a prediction just like a


voting classifier by aggregating the output of
different predictors. Fig. 3: Bagging and pasting involve training several
predictors on different random samples of the
❖ The ensemble has a similar bias but a lower variance training set
than a single predictor trained on the original
training set.
❖ Different predictors of the ensemble can all be trained in parallel, via different CPU cores or even
different servers: the algorithm scales very well.
Bagging and Pasting: scikit-learn

Training Bagging classifier


▪ If you want to use pasting instead, just set
from sklearn.ensemble import BaggingClassifier bootstrap=False).

from sklearn.tree import DecisionTreeClassifier ▪ A BaggingClassifier automatically performs


soft voting instead of hard voting if the base
bag_clf = BaggingClassifier(DecisionTreeClassifier(),
classifier can estimate class probabilities (i.e.,
n_estimators=500,
max_samples=100, n_jobs=-1, random_state=42) if it has a predict_proba() method), which is
the case with decision tree classifiers.
bag_clf.fit(X_train, y_train)

Fig. 4: A single decision tree (left) versus a bagging ensemble of 500 trees (right)
Out-of-Bag Evaluation

❖ The BaggingClassifier samples training instances with replacement, resulting in roughly 63% of
instances being sampled on average for each predictor.

❖ The remaining 37% are termed out-of-bag (OOB) instances and can be used for evaluation
without a separate validation set. This built-in cross-validation allows for accurate ensemble
prediction assessment.

❖ The out-of-bag evaluation approach provides a convenient and efficient way to estimate the
performance of the ensemble without the need for splitting the data in train and test sets.

OOB evaluation Test-set evaluation


>>> bag_clf = BaggingClassifier(DecisionTreeClassifier(), >>> from sklearn.metrics import accuracy_score
n_estimators=500, >>> y_pred = bag_clf.predict(X_test)
... oob_score=True, n_jobs=-1, random_state=42) >>> accuracy_score(y_test, y_pred)
... 0.92
>>> bag_clf.fit(X_train, y_train)
>>> bag_clf.oob_score_
0.896

❖ The OOB evaluation was a bit pessimistic for this case.


Random Forests: Bagging of Decision Trees

❖ Random Forests: A variant of bagging classifier with decision trees as the base learner.

❖ However, in contrast to the bagging classifier, random forests allows random sampling of subset of
features at each decision node of the trees, by default. This process is known as feature bagging or
feature subsampling.
❖ By default, it samples 𝑛 features (where n is the total number of features).

❖ The algorithm results in greater tree diversity, which (again) trades a higher bias for a lower variance.

❖ A RandomForestClassifier has all the hyperparameters of a DecisionTreeClassifier (to control how


trees are grown), plus all the hyperparameters of a BaggingClassifier to control the ensemble itself.

❖ Extra-Trees: It is possible to make trees even more random by also using random thresholds for each
feature at the decision nodes. For this, simply set splitter="random" when creating a
DecisionTreeClassifier.

❖ A forest of such extremely random trees is called an extremely randomized trees (or extra-trees for
short) ensemble.

❖ Along with further reduction in variance, It also makes extra-trees classifiers much faster to train than
regular random forests.
Random Forests: scikit-learn

Training of Random Forests

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500,
max_leaf_nodes=16,
n_jobs=-1, random_state=42)

rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

Equivalent Bagging Classifier


bag_clf = BaggingClassifier(

DecisionTreeClassifier(max_features="sqrt",
max_leaf_nodes=16),
n_estimators=500, n_jobs=-1, random_state=42)
Random Forests: Feature Importance

❖ A very useful aspect of random forests is that they make it easy to measure the relative
importance of each feature.

❖ Scikit-Learn measures a feature’s importance by looking at how much the tree nodes that use that
feature reduce impurity on average, across all trees in the forest.

❖ More precisely, it is a weighted average, where each node’s weight is proportional to the number
of training samples that are associated with it.
❖ Scikit-Learn computes this score automatically for
each feature after training, then it scales the results
so that the sum of all importances is equal to 1.
>>> from sklearn.datasets import load_iris
>>> iris = load_iris(as_frame=True)
>>> rnd_clf = RandomForestClassifier(n_estimators=500,
random_state=42)
>>> rnd_clf.fit(iris.data, iris.target)
>>> for score, name in zip(rnd_clf.feature_importances_,
iris.data.columns):
... print(round(score, 2), name)
...
0.11 sepal length (cm), 0.02 sepal width (cm) Fig. 4: MNIST pixel importance (according to a
0.44 petal length (cm), 0.42 petal width (cm) random forest classifier)
Boosting: AdaBoost

❖ Boosting refers to any ensemble method that can combine several weak learners into a strong
learner.

❖ The general idea of most boosting methods is to train predictors sequentially, each trying to correct
its predecessor.

❖ There are many boosting methods available, but by far the most popular are AdaBoost (short for
adaptive boosting) and gradient boosting.
❖ AdaBoost:
One way for a new predictor to correct its predecessor is to pay a bit more attention to the training
instances that the predecessor underfit. This results in new predictors focusing more and more on
the hard cases.

Fig. 5: Decision boundaries of consecutive predictors


AdaBoost Algorithm
❖ Weighted error rate of the jth predictor:
Each instance weight w(i) is initially set to 1/m. Then, error rate for each predictor is calculated by summing
contributions from each incorrectly predicted instance.

❖ Predictor weight: • So, predictor weight, j, is larger for smaller values of error rate,
rj, or vice-versa.

• For random predictor, rj = 0.5, predicted weight is zero.

❖ Instance weight update: Building AdaBoost Classifier


from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1),
n_estimators=30,
learning_rate=0.5, random_state=42)
After the weight-update, they are normalized such that ada_clf.fit(X_train, y_train)
sum of all weights becomes 1.

❖ AdaBoost predictions: • Prediction is based on summation of predictor


weights for a given class-value, k.

• Predicted class is the one for which the


summation value is the maximum.

You might also like