2025 Ensemble Learning
2025 Ensemble Learning
+919945379089
Email(s): [email protected],[email protected]
https://paulvanderlaken.com/2020/01/20/animated-machine-learning-classifiers/
● Ensemble methods work best when the base models make different types of errors, allowing
the combined model to compensate for individual weaknesses.
● Predictions from individual models are combined using techniques like averaging, voting, or
stacking to make the final prediction.
● Ensembles are generally more robust in terms of overfitting and noise in the data.
● The most popular ensemble methods are bagging, boosting, and stacking.
● Ensemble methods are ideal for regression and classification, where they reduce bias and
variance to boost the accuracy of models.
Categories of Ensemble Methods
Parallel ensemble techniques – Random Forest
Variance: The error is due to sensitivity to small fluctuations in the training data. High variance can cause the
model to overfit, capturing noise instead of the underlying pattern.
A model with overfitting: Complex wavy line that fits every point, including noise.
A model with good / best fit: Smooth curve capturing the main trend.
ii. It classifies the data based on voting whereas, for numeric prediction, it uses an
average method
Random forest is a supervised learning algorithm that is used for both classification as well as
regression. However, it is mainly used for classification problems. As we know a forest is made up of trees
Similarly, a random forest algorithm creates decision trees on data samples then gets the
prediction from each of them, and finally selects the best solution employing voting. It is an ensemble
method that is better than a single decision tree because it reduces the over-fitting by averaging the
result.
The random forest is a model made up of many decision trees. Rather than just simply averaging the
prediction of trees.
For the training phase, each tree in a random forest learns from a random sample of the data
points. The samples are drawn with replacement, known as bootstrapping, which means that some
A subset of all the features is considered for splitting each node in each decision tree. Generally,
this is set to sqrt(n_features) for classification. For example, if there are 16 features, at each node
in each tree, only 4 features will be considered for splitting the node.
Var A has a value >=5 for 12 records out of 16 and 4 records with a value <5 value.
● For Var A >= 5 & class == positive: 5/12
● For Var A >= 5 & class == negative: 7/12
o gini(5,7) = 1- ( (5/12)^2 + (7/12)^2 ) = 0.4860
● For Var A <5 & class == positive: 3/4
● For Var A <5 & class == negative: 1/4
o gini(3,1) = 1- ( (3/4)^2 + (1/4)^2 ) = 0.375
Var B has a value >=3 for 12 records out of 16 and 4 records with a value <5 value.
Var C has a value >=4.2 for 6 records out of 16 and 10 records with a value <4.2 value.
Var D has a value >=1.4 for 5 records out of 16 and 11 records with a value <1.4 value.
F1 Score: In most real-life classification problems, imbalanced class distribution exists and
thus F1-score is a better metric to evaluate our model on.
Accuracy can be used when the class distribution is similar while F1-score is a better metric
when there are imbalanced classes as in the above case.
○ Take random subsets of the data (with replacement) to create different datasets (called
bootstraps).
○ Train the same type of model (e.g., decision trees) on each dataset.
○ Random Forest is an extension of bagging that uses decision trees. It also introduces a
2. Boosting
How it works?
● The key difference from bagging is that models are trained sequentially, and each new model
1. Train multiple different models (e.g., decision tree, logistic regression, SVM) on the same
dataset.
2. Use the predictions from these models as input features for a meta-model (e.g., logistic
regression).
3. The meta-model learns to make the final predictions by combining the outputs of the base
models.
● Key difference: Stacking uses multiple different types of models and combines them with a
meta-model, unlike bagging and boosting, which typically use the same type of model.
Tree stumps define a 1-level decision tree. The main idea is that at each step, we want to find the
For each feature, we need to build a tree and make the decision of which decision tree becomes the base
learning model. [This is done based on Gini Index, entropy, or information gain]
The initial Weight of each Observation is 1/7 in this case.
Total Error in the initial stage becomes (TE =1/7) for each record.
Performance of stump = ½ ln [1-TE/TE] [The difference between log and ln is that log is defined for
base 10 and ln is denoted for base e. For example, the log of base 2 is represented as log2 and the log of
=1/2 ln [1-(1/7)/(1/7)]
= ½ ln [6]
=0.897
Now we can update the weight, for every wrongly classified record we need to increase the weight, and for
Below is the way to calculate the weight for the correctly classified records
= 1/7 * exp0.895
= 0.349
Below is the way to calculate the weight for the incorrectly classified records
To normalize the weights, we need to divide each record weight by the sum of all the record weights.
Finally, each boosting technique differs from the other based on baseline classifier, split criterion, working
Adaboost is a Boosting algorithm that increases the accuracy by giving more weightage to the target
The gradient Boosting Algorithm increases the accuracy by minimizing the loss function(error which is a
difference between actual and predicted value) and having them as targets for the next decision tree
building.