Ensemble Learning
Ensemble Learning
systems
attempt to use a generalized rule to carry out observations.
• Inductive Learning Algorithms (APIs) are used to generate a set of
classification rules. These generated rules are in the "If this, then that"
format.
• There are four processes that the students go through when given an
inductive learning activity-(1) observe, (2) hypothesize, (3) collect
evidence, and (4) generalize.
• There are two methods for obtaining knowledge in the real world: first,
from domain experts, and second, from machine learning.
• Domain experts are not very useful or reliable for large amounts of data.
• machine learning, replicates the logic of 'experts' in algorithms, but this
work may be very complex, time-consuming, and expensive.
• Computational Learning Theory (CoLT) is a branch of Artificial
Intelligence study that focuses with formal studies on the design of
computer programmes that can learn.
• Essentially, it evaluates which issues may be learned using mathematical
frameworks and comprehends the theoretical background of
deep learning algorithms while increasing accuracy.
• Computational learning theory is a subfield of artificial intelligence
(AI) that is concerned with how computers can learn from data. The
main goals of computational learning theory are to develop algorithms
that can learn from data and to understand the limits of what can be
learned from data.
The individual models that we combine are known as weak learners. We call
them weak learners because they either have a high bias or high variance.
Because they either have high bias or variance, weak learners cannot learn
efficiently and perform poorly.
• Both high bias and high variance models thus cannot generalize properly.
• weak learners will either make incorrect generalizations or fail to
generalize altogether. Because of this, the predictions of weak learners
cannot be relied on by themselves.
• bias-variance trade-off, an underfit model has high bias and low variance,
overfit model has high variance and low bias. In either case, there is no
balance between bias and variance. For there to be a balance, both the bias
and variance need to be low. Ensemble learning tries to balance this bias-
variance trade-off by reducing either the bias or the variance
• a machine learning model analyses the data, find patterns in it and make
predictions. While training, the model learns these patterns in the dataset and
applies them to test data for prediction. While making predictions, a
difference occurs between prediction values made by the model and actual
values/expected values, and this difference is known as bias errors or
Errors due to bias.
• Low Bias: A low bias model will make fewer assumptions about the form of
the target function.
• High Bias: A model with a high bias makes more assumptions, and the
model becomes unable to capture the important features of our dataset. A
high bias model also cannot perform well on new data.
• Some examples of machine learning algorithms with low bias are Decision
Trees, k-Nearest Neighbours and Support Vector Machines. At the same
time, an algorithm with high bias is Linear Regression, Linear
Discriminant Analysis and Logistic Regression.
• Ways to reduce High Bias:
• High bias mainly occurs due to a much simple model. Below are some
ways to reduce the high bias:
• Increase the input features as the model is underfitted.
• Decrease the regularization term.
• Use more complex models, such as including some polynomial features.
•
The variance would specify the amount of variation in the prediction if the
different training data was used. In simple words, variance tells that how
much a random variable is different from its expected value. Ideally, a
model should not vary too much from one training dataset to another,
which means the algorithm should be good in understanding the hidden
mapping between inputs and output variables. Variance errors are either
of low variance or high variance.
• Low variance means there is a small variation in the prediction of the
target function with changes in the training data set.
• High variance shows a large variation in the prediction of the target
function with changes in the training dataset.
• A model that shows high variance learns a lot and perform well with the
training dataset, and does not generalize well with the unseen dataset. As a
result, such a model gives good results with the training dataset but shows
high error rates on the test dataset.
• Since, with high variance, the model learns too much from the dataset, it
leads to overfitting of the model. A model with high variance has the below
problems:
• A high variance model leads to overfitting.
• Increase model complexities.
• Some examples of machine learning algorithms with low variance
are, Linear Regression, Logistic Regression, and Linear discriminant
analysis. At the same time, algorithms with high variance are decision
tree, Support Vector Machine, and K-nearest neighbours.
• Ways to Reduce High Variance:
• Reduce the input features or number of parameters as a model is overfitted.
• Do not use a much complex model.
• Increase the training data.
• Increase the Regularization term.
• Low-Bias, Low-Variance:
The combination of low bias and low variance shows an ideal machine
learning model. However, it is not possible practically.
• Low-Bias, High-Variance: With low bias and high variance, model
predictions are inconsistent and accurate on average. This case occurs
when the model learns with a large number of parameters and hence leads
to an overfitting
• High-Bias, Low-Variance: With High bias and low variance, predictions
are consistent but inaccurate on average. This case occurs when a model
does not learn well with the training dataset or uses few numbers of the
parameter. It leads to underfitting problems in the model.
• High-Bias, High-Variance:
With high bias and high variance, predictions are inconsistent and also
inaccurate on average.
• Ensemble learning tries to balance this bias-variance trade-off by reducing
either the bias or the variance.
• Ensemble learning will aim to reduce the bias if we have a weak model
with high bias and low variance. Ensemble learning will aim to reduce the
variance if we have a weak model with high variance and low bias.
• Ensemble learning improves a model’s performance in mainly three ways:
• By reducing the variance of weak learners
• By reducing the bias of weak learners,
• By improving the overall accuracy of strong learners.
• Bagging is used to reduce the variance of weak learners. Boosting is used
to reduce the bias of weak learners. Stacking is used to improve the
overall accuracy of strong learners
• Bagging aims to produce a model with lower variance than the individual
weak models. These weak learners are homogenous, meaning they are of
the same type.
• Bagging is also known as Bootstrap aggregating. It consists of two steps:
bootstrapping and aggregation.
• Bootstrap:-subsets of data are taken from the initial dataset. These subsets
of data are called bootstrapped datasets or, simply, bootstraps. Resampled
‘with replacement’ means an individual data point can be sampled multiple
times. Each bootstrap dataset is used to train a weak learner.
• Aggregating
• The individual weak learners are trained independently from each other.
Each learner makes independent predictions. The results of those
predictions are aggregated at the end to get the overall prediction. The
predictions are aggregated using either max voting or averaging.
• Max Voting is commonly used for classification problems. It consists of
taking the mode of the predictions (the most occurring prediction). It is
called voting because like in election voting, the premise is that ‘the
majority rules’. Each model makes a prediction. A prediction from each
model counts as a single ‘vote’. The most occurring ‘vote’ is chosen as the
representative for the combined model.
• Averaging is generally used for regression problems. It involves taking the
average of the predictions. The resulting average is used as the overall
prediction for the combined model.
Steps of Bagging
• The steps of bagging are as follows:
• We have an initial training dataset containing n-number of instances.
• We create a m-number of subsets of data from the training set. We
take a subset of N sample points from the initial dataset for each
subset. Each subset is taken with replacement. This means that a
specific data point can be sampled more than once.
• For each subset of data, we train the corresponding weak learners
independently. These models are homogeneous, meaning that they
are of the same type.
• Each model makes a prediction.
• The predictions are aggregated into a single prediction. For this,
either max voting or averaging is used.
• boosting for combining weak learners with high bias. Boosting aims to
produce a model with a lower bias than that of the individual models.
• Boosting involves sequentially training weak learners.
• each subsequent learner improves the errors of previous learners in the
sequence.
• A sample of data is first taken from the initial dataset. This sample is
used to train the first model, and the model makes its prediction. The
samples can either be correctly or incorrectly predicted. The samples
that are wrongly predicted are reused for training the next model. In this
way, subsequent models can improve on the errors of previous models.
• Unlike bagging, which aggregates prediction results at the end, boosting
aggregates the results at each step. They are aggregated using weighted
averaging.
• Weighted averaging involves giving all models different weights
depending on their predictive power. In other words, it gives more weight
to the model with the highest predictive power. This is because the learner
with the highest predictive power is considered the most important.
• Boosting works with the following steps:
• We sample m-number of subsets from an initial training dataset.
• Using the first subset, we train the first weak learner.
• We test the trained weak learner using the training data. As a result
of the testing, some data points will be incorrectly predicted.
• Each data point with the wrong prediction is sent into the second
subset of data, and this subset is updated.
• Using this updated subset, we train and test the second weak learner.
• We continue with the following subset until the total number of
subsets is reached.
• We now have the total prediction. The overall prediction has already
been aggregated at each step, so there is no need to calculate it.
Support Vector Machine (SVM)
Apply statistical algorithms to learn the Uses artificial neural network architecture
hidden patterns and relationships in the to learn the hidden patterns and
dataset. relationships in the dataset.
Takes less time to train the model. Takes more time to train the model.
Less complex and easy to interpret the More complex, it works like the black box
result. interpretations of the result are not easy.