0% found this document useful (0 votes)
112 views

Ensemble Methods in Machine Learning

Uploaded by

suryafootball01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views

Ensemble Methods in Machine Learning

Uploaded by

suryafootball01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Ensemble Methods in Machine Learning

Dr. Arundhati Mahesh


Senior Lecturer
Bioinformatics
SRET
SRIHER
Ensemble Methods: Elegant Techniques to
Produce Improved Machine Learning Results
Ensemble means a group of elements viewed as a whole rather than individually. An Ensemble method creates multiple
models and combines them to solve it in order to produce improved results. Ensemble methods help to improve the
robustness/generalizability of the model. Ensemble methods in machine learning usually produce more accurate
solutions than a single model would.
Combine Model Predictions Into Ensemble
Predictions
The three most popular methods for combining the predictions from different models are:

● Bagging. Building multiple models (typically of the same type) from different subsamples of the training
dataset.
● Boosting. Building multiple models (typically of the same type) each of which learns to fix the prediction
errors of a prior model in the chain.
● Voting. Building multiple models (typically of differing types) and simple statistics (like calculating the
mean) are used to combine predictions.
Bagging Algorithms
Bootstrap Aggregation or bagging involves taking multiple samples from your training dataset (with replacement)
and training a model for each sample.

The final output prediction is averaged across the predictions of all of the sub-models.

The three bagging models covered in this section are as follows:

1. Bagged Decision Trees


2. Random Forest
3. Extra Trees
Bagged Decision Trees

Bagging performs best with algorithms that have high variance. A popular example are decision trees, often
constructed without pruning.

In the example below see an example of using the BaggingClassifier with the Classification and Regression
Trees algorithm (DecisionTreeClassifier). A total of 100 trees are created.
# Bagged Decision Trees for Classification
import pandas
from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
#url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(‘diabetes.csv’)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
Random Forest
Random forest is an extension of bagged decision trees.

Samples of the training dataset are taken with replacement, but the trees are constructed in a way that reduces the
correlation between individual classifiers. Specifically, rather than greedily choosing the best split point in the
construction of the tree, only a random subset of features are considered for each split.

You can construct a Random Forest model for classification using the RandomForestClassifier class.

The example below provides an example of Random Forest for classification with 100 trees and split points chosen
from a random selection of 3 features.
# Random Forest Classification
import pandas
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
#url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(‘diabetes.csv’)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
num_trees = 100
max_features = 3
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
Extra Trees
Extra Trees are another modification of bagging where random trees are constructed from samples of the training
dataset.

You can construct an Extra Trees model for classification using the ExtraTreesClassifier class.

The example below provides a demonstration of extra trees with the number of trees set to 100 and splits chosen
from 7 random features.
# Extra Trees Classification
import pandas
from sklearn import model_selection
from sklearn.ensemble import ExtraTreesClassifier
#url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(‘diabetes.csv’)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
num_trees = 100
max_features = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
Boosting Algorithms

Boosting ensemble algorithms creates a sequence of models that attempt to correct the mistakes of the models
before them in the sequence.

Once created, the models make predictions which may be weighted by their demonstrated accuracy and the
results are combined to create a final output prediction.

The two most common boosting ensemble machine learning algorithms are:

1. AdaBoost
2. Stochastic Gradient Boosting
AdaBoost

AdaBoost was perhaps the first successful boosting ensemble algorithm. It generally works by weighting
instances in the dataset by how easy or difficult they are to classify, allowing the algorithm to pay or or less
attention to them in the construction of subsequent models.

You can construct an AdaBoost model for classification using the AdaBoostClassifier class.

The example below demonstrates the construction of 30 decision trees in sequence using the AdaBoost
algorithm.
# AdaBoost Classification
import pandas
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier
#url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(‘diabetes.csv’)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
num_trees = 30
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
Stochastic Gradient Boosting

Stochastic Gradient Boosting (also called Gradient Boosting Machines) are one of the most sophisticated
ensemble techniques. It is also a technique that is proving to be perhaps of the the best techniques available for
improving performance via ensembles.

You can construct a Gradient Boosting model for classification using the GradientBoostingClassifier class.

The example below demonstrates Stochastic Gradient Boosting for classification with 100 trees.
# Stochastic Gradient Boosting Classification
import pandas
from sklearn import model_selection
from sklearn.ensemble import GradientBoostingClassifier
#url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(‘diabetes.csv’)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
num_trees = 100
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
Voting Ensemble

Voting is one of the simplest ways of combining the predictions from multiple machine learning algorithms.

It works by first creating two or more standalone models from your training dataset. A Voting Classifier can then
be used to wrap your models and average the predictions of the sub-models when asked to make predictions for
new data.

You can create a voting ensemble model for classification using the VotingClassifier class.

The code below provides an example of combining the predictions of logistic regression, classification and
regression trees and support vector machines together for a classification problem.
# Voting Ensemble for Classification
import pandas
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
#url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(‘diabetes.csv’)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
# create the sub models
estimators = []
model1 = LogisticRegression()
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))
model3 = SVC()
estimators.append(('svm', model3))
# create the ensemble model
ensemble = VotingClassifier(estimators)
results = model_selection.cross_val_score(ensemble, X, Y, cv=kfold)
print(results.mean())
Stacking
A stacking classifier is an ensemble method where the output from multiple classifiers is passed as an input
to a meta-classifier for the task of the final classification. The individual classification models are trained
based on the complete training set, then the meta-classifier is fitted based on the outputs (meta-features) of
the individual classification models.

Stacking has been successfully used in several machine learning competitions at Kaggle. It is definitely a
must know technique in machine learning. Stacking is an ensemble technique that uses a new model to learn
how to best combine the predictions from two or more models trained on your dataset.

Voting Vs Stacking

Voting - The algorithm has some constant method that gives answers.

Stacking - The algorithm takes advantage of the predictions as the new representation of the problem and
creates another abstraction layer to learn how to predict the correct label having k votes
Stacking vs bagging and boosting
1. Bagging (stands for Bootstrap Aggregating): we use bagging for combining weak learners (base models) of high
variance. Bagging aims to produce a model with lower variance than the individual weak models. Bagging takes
advantage of Bootstrapping technique - sampling different sets of data from a given training set by using
replacement. After bootstrapping the training dataset, we train the model on all the different sets and aggregate the
result. Unlike bagging, in stacking, the models are typically different (e.g. not all decision trees) and fit on the same
dataset (instead of samples of the training dataset).
2. Boosting: in boosting the learners are trained sequentially. The algorithm learns models sequentially in a very
adaptive way (a base model depends on the previous ones) and combines them following a deterministic strategy.
Unlike boosting, in stacking, a single model is used to learn how to best combine the predictions from the
contributing models (e.g. instead of correcting the predictions of prior models).

Bagging trains models in parallel, boosting trains the models sequentially. Stacking creates a new meta-model.
Stacking Scikit-Learn
Stacking is used for two machine learning problems with the help of Scikit-Learn. Scikit-learn is a free
software machine learning library for the Python programming language. It features various classification,
regression and clustering algorithms including support vector machines, linear regression, logistic
regression, k-means clustering and many more.

The first problem is the famous iris problem in which, given some attributes, we have to classify the iris
flower as Setosa, Versicolor, or Virginica which are it’s three species. The second problem is Wine
recognition in which we have to classify the wine into three categories. Both of these datasets are available
in Scikit-learn library.
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import StackingClassifier
from matplotlib import pyplot
from sklearn.datasets import load_wine,load_iris
from matplotlib.pyplot import figure
figure(num=2, figsize=(16, 12), dpi=80, facecolor='w', edgecolor='k')

# get a stacking ensemble of models


def get_stacking():
# define the base models
level0 = list()
level0.append(('lr', LogisticRegression()))
level0.append(('knn', KNeighborsClassifier()))
level0.append(('cart', DecisionTreeClassifier()))
level0.append(('svm', SVC()))
level0.append(('bayes', GaussianNB()))
# define meta learner model
level1 = LogisticRegression()
# define the stacking ensemble
model = StackingClassifier(estimators=level0, final_estimator=level1, cv=5)
return model

# get a list of models to evaluate


def get_models():
models = dict()
models['LogisticRegression'] = LogisticRegression()
models['KNeighborsClassifier'] = KNeighborsClassifier()
models['Decision tree'] = DecisionTreeClassifier()
models['svm'] = SVC()
models['GaussianNB'] = GaussianNB()
models['stacking'] = get_stacking()
return models

# evaluate a give model using cross-validation


def evaluate_model(model):
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
scores1 = cross_val_score(model, X1, y1, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
return scores,scores1

# define dataset
X,y = load_wine().data,load_wine().target
X1,y1= load_iris().data,load_iris().target
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names, results1 = list(), list(),list()
for name, model in models.items():
scores,scores1= evaluate_model(model)
results.append(scores)
results1.append(scores1)
names.append(name)
print('>%s -> %.3f (%.3f)---Wine dataset' % (name, mean(scores), std(scores)))
print('>%s -> %.3f (%.3f)---Iris dataset' % (name, mean(scores1), std(scores1)))
# plot model performance for comparison
pyplot.rcParams["figure.figsize"] = (15,6)
pyplot.boxplot(results, labels=[s+"-wine" for s in names], showmeans=True)
pyplot.show()
pyplot.boxplot(results1, labels=[s+"-iris" for s in names], showmeans=True)
pyplot.show()
References
https://www.analyticsvidhya.com/blog/2021/04/distinguish-between-tree-based-machine-learning-algorithms/

https://www.geeksforgeeks.org/stacking-in-machine-learning-2/

https://machinelearningmastery.com/ensemble-machine-learning-algorithms-python-scikit-learn/

http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/

https://machinelearningmastery.com/stacking-ensemble-machine-learning-with-python/

https://machinelearningmastery.com/stacking-ensemble-machine-learning-with-python/

You might also like