Lecture+Notes+-+Random Forests
Lecture+Notes+-+Random Forests
Random Forests
You are familiar with decision trees, now it’s time to learn about Random Forests, which is a collection of decision
trees. The great thing about random forests is that - they almost always outperform a decision tree in terms of
accuracy.
Ensembles
An ensemble means a group of things viewed as a whole rather than individually. In ensembles, a collection of
models is used to make predictions, rather than individual models. Arguably, the most popular in the family of
ensemble models is the random forest: an ensemble made by the combination of a large number of decision trees.
For an ensemble to work, each model of the ensemble should comply with the following conditions:
1. Each model should be diverse. Diversity ensures that the models serve complementary purposes, which
means that the individual models make predictions independent of each other.
2. Each model should be acceptable. Acceptability implies that each model is at least better than a random
model.
Consider a binary classification problem where the response variable is either 0 or 1. You have an ensemble of three
models, where each model has an accuracy of 0.7 i.e. it is correct 70% of the times. The following table shows all the
possible cases that can occur while classifying a test data point as 1 or 0. The column to the extreme right shows the
probability of each case.
Figure 1- Ensemble models
In the table, there are 4 cases each where the decision of the final model (ensemble) is either correct or wrong. Let’s
assume that the probability of the ensemble being correct is p, and the probability of the ensemble being wrong is q.
You can see how an ensemble of just three model gives a boost to the accuracy from 70% to 78.4%. In general, the
more the number of models, the higher the accuracy of an ensemble is.
Random forests are created using a special ensemble method called bagging. Bagging stands for Bootstrap
Aggregation. Bootstrapping means creating bootstrap samples from a given data set. A bootstrap sample is created
by sampling the given data set uniformly and with replacement. A bootstrap sample typically contains about 30-
70% data from the data set. Aggregation implies combining the results of different models present in the ensemble.
Random forests is an ensemble of many decision trees. A random forest is created in the following way:
1. Create a bootstrap sample from the training set.
2. Now construct a decision tree using the bootstrap sample. While splitting a node of the tree, only consider a
random subset of features. Every time a node has to split, a different random subset of features will be
considered.
3. Repeat the steps 1 and 2 n times, to construct n trees in the forest. Remember each tree is constructed
independently, so it is possible to construct each tree in parallel.
4. While predicting a test case, each tree predicts individually, and the final prediction is given by the majority
vote of all the trees.
The OOB error is calculated by using each observation of the training set as a test observation. Since each tree is
built on a bootstrap sample, each observation can used as a test observation by those trees which did not have it in
their bootstrap sample. All these trees predict on this observation and you get an error for a single observation. The
final OOB error is calculated by calculating the error on each observation and aggregating it.
It turns out that the OOB error is as good as cross validation error.
To construct a forest of S trees, on a dataset which has M features and N observations, the time taken will depends
on the following factors:
1. The number of trees. The time is directly proportional to the number of trees. But this time can be reduced
by creating the trees in parallel.
2. The size of bootstrap sample. Generally the size of a bootstrap sample is 30-70% of N. The smaller the size
the faster it takes to create a forest.
3. The size of subset of features while splitting a node. Generally this is taken as √𝑀 in classification and M/3
in regression.
Hyperparameter Tuning
The following hyperparameters are present in a random forest classifier. Note that most of these hyperparameters
are actually of the decision trees that are in the forest.
Tuning max_depth
Thus, controlling the depth of the constituent trees will help reduce overfitting in the forest.
Tuning n_estimators
#We can get accuracy of 0.818285714286 using {'max_features': 10, 'n_estimators': 200,
'max_depth': 8, 'min_samples_split': 200, 'min_samples_leaf': 100}
Fitting the final model with the best parameters obtained from grid search.