Ensemble Learning
Ensemble Learning
In ensemble learning, each model can be considered as an independent random variable making
predictions.
❖ A diverse set of predictor can also be built using the same training algorithm for every predictor but train
them on different random subsets of the training set.
❖ The key difference between bagging and pasting lies in how these subsets are sampled:
Bagging (short for bootstrap aggregating): Sampling with replacement
Pasting: Sampling without replacement.
❖ Both bagging and pasting allow training instances to
be sampled several times across multiple predictors.
Fig. 4: A single decision tree (left) versus a bagging ensemble of 500 trees (right)
Out-of-Bag Evaluation
❖ The BaggingClassifier samples training instances with replacement, resulting in roughly 63% of
instances being sampled on average for each predictor.
❖ The remaining 37% are termed out-of-bag (OOB) instances and can be used for evaluation
without a separate validation set. This built-in cross-validation allows for accurate ensemble
prediction assessment.
❖ The out-of-bag evaluation approach provides a convenient and efficient way to estimate the
performance of the ensemble without the need for splitting the data in train and test sets.
❖ Random Forests: A variant of bagging classifier with decision trees as the base learner.
❖ However, in contrast to the bagging classifier, random forests allows random sampling of subset of
features at each decision node of the trees, by default. This process is known as feature bagging or
feature subsampling.
❖ By default, it samples 𝑛 features (where n is the total number of features).
❖ The algorithm results in greater tree diversity, which (again) trades a higher bias for a lower variance.
❖ Extra-Trees: It is possible to make trees even more random by also using random thresholds for each
feature at the decision nodes. For this, simply set splitter="random" when creating a
DecisionTreeClassifier.
❖ A forest of such extremely random trees is called an extremely randomized trees (or extra-trees for
short) ensemble.
❖ Along with further reduction in variance, It also makes extra-trees classifiers much faster to train than
regular random forests.
Random Forests: scikit-learn
rnd_clf = RandomForestClassifier(n_estimators=500,
max_leaf_nodes=16,
n_jobs=-1, random_state=42)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)
DecisionTreeClassifier(max_features="sqrt",
max_leaf_nodes=16),
n_estimators=500, n_jobs=-1, random_state=42)
Random Forests: Feature Importance
❖ A very useful aspect of random forests is that they make it easy to measure the relative
importance of each feature.
❖ Scikit-Learn measures a feature’s importance by looking at how much the tree nodes that use that
feature reduce impurity on average, across all trees in the forest.
❖ More precisely, it is a weighted average, where each node’s weight is proportional to the number
of training samples that are associated with it.
❖ Scikit-Learn computes this score automatically for
each feature after training, then it scales the results
so that the sum of all importances is equal to 1.
>>> from sklearn.datasets import load_iris
>>> iris = load_iris(as_frame=True)
>>> rnd_clf = RandomForestClassifier(n_estimators=500,
random_state=42)
>>> rnd_clf.fit(iris.data, iris.target)
>>> for score, name in zip(rnd_clf.feature_importances_,
iris.data.columns):
... print(round(score, 2), name)
...
0.11 sepal length (cm), 0.02 sepal width (cm) Fig. 4: MNIST pixel importance (according to a
0.44 petal length (cm), 0.42 petal width (cm) random forest classifier)
Boosting: AdaBoost
❖ Boosting refers to any ensemble method that can combine several weak learners into a strong
learner.
❖ The general idea of most boosting methods is to train predictors sequentially, each trying to correct
its predecessor.
❖ There are many boosting methods available, but by far the most popular are AdaBoost (short for
adaptive boosting) and gradient boosting.
❖ AdaBoost:
One way for a new predictor to correct its predecessor is to pay a bit more attention to the training
instances that the predecessor underfit. This results in new predictors focusing more and more on
the hard cases.
❖ Predictor weight: • So, predictor weight, j, is larger for smaller values of error rate,
rj, or vice-versa.