Random Forest
Random Forest
Ideally, the case when the model makes the predictions with 0 error, is said to have a
good fit on the data. This situation is achievable at a spot between overfitting and
underfitting. In order to understand it, we will have to look at the performance of
our model with the passage of time, while it is learning from the training dataset.
With the passage of time, our model will keep on learning, and thus the error for the
model on the training and testing data will keep on decreasing. If it will learn for too
long, the model will become more prone to overfitting due to the presence of noise
and less useful details. Hence the performance of our model will decrease. In order
to get a good fit, we will stop at a point just before where the error starts increasing.
At this point, the model is said to have good skills in training datasets as well as our
unseen testing dataset.
Overfitting,Apropriate Fitting,Underfitting
Ensemble Learning
Ensemble learning is a machine learning technique that enhances accuracy and resilience in
forecasting by merging predictions from multiple models. It aims to mitigate errors or biases
that may exist in individual models by leveraging the collective intelligence of the ensemble.
The underlying concept behind ensemble learning is to combine the outputs of diverse
models to create a more precise prediction. By considering multiple perspectives and utilizing
the strengths of different models, ensemble learning improves the overall performance of
the learning system. This approach not only enhances accuracy but also provides resilience
against uncertainties in the data. By effectively merging predictions from multiple models,
ensemble learning has proven to be a powerful tool in various domains, offering more robust
and reliable forecasts.
Bagging and Boosting
Bagging is an Ensemble Learning technique which aims to reduce the error learning
through the implementation of a set of homogeneous machine learning algorithms. The
key idea of bagging is the use of multiple base learners which are trained separately
with a random sample from the training set, which through a voting or averaging
approach, produce a more stable and accurate model.
Bagging and Boosting
Boosting is an Ensemble Learning technique that, like bagging, makes use of a set
of base learners to improve the stability and effectiveness of a ML model. The idea
behind a boosting architecture is the generation of sequential hypotheses, where
each hypothesis tries to improve or correct the mistakes made in the previous one .
The above diagram explains the AdaBoost algorithm in a very simple way. Let’s try to understand it in a stepwise process:
• B1 consists of 10 data points which consist of two types namely plus(+) and minus(-) and 5 of which are plus(+) and the
other 5 are minus(-) and each one has been assigned equal weight initially. The first model tries to classify the data points
and generates a vertical separator line but it wrongly classifies 3 plus(+) as minus(-).
• B2 consists of the 10 data points from the previous model in which the 3 wrongly classified plus(+) are weighted more so
that the current model tries more to classify these pluses(+) correctly. This model generates a vertical separator line that
correctly classifies the previously wrongly classified pluses(+) but in this attempt, it wrongly classifies three minuses(-).
• B3 consists of the 10 data points from the previous model in which the 3 wrongly classified minus(-) are weighted more so
that the current model tries more to classify these minuses(-) correctly. This model generates a horizontal separator line that
correctly classifies the previously wrongly classified minuses(-).
• B4 combines together B1, B2, and B3 in order to build a strong prediction model which is much better than any individual
model used.
Boosting
The main difference between these two is that Random Forest is a bagging method
that uses a subset of the original dataset to make predictions and this property of
Random Forest helps to overcome Overfitting. Instead of building a single decision
tree, Random forest builds a number of DT’s with a different set of observations. One
big advantage of this algorithm is that it can be used for classification as well as
regression problems.
Random Forest
Step-1 – We first make subsets of our original data. We will do row sampling and
feature sampling that means we’ll select rows and columns with replacement and
create subsets of the training dataset
To get the oob evaluation we need to set a parameter called oob_score to TRUE. We can see that the score we
Decision Trees Random Forests
get from oob samples, and the test dataset is somewhat the same. In this way, we can use these left-out samples
in evaluating our model.