Random Forest
Random Forest
Forest
1
Contents
• What is Random Forest?
• Building RF in Scikit-learn
2
What is Random Forest?
• Random Forest is a
supervised learning
algorithm and capable of
performing both regression
and classification tasks.
3
Ensemble method
• Use multiple learning algorithms to obtain
better predictions.
• Train various different models, aggregate
their predictions to improvise stability and
predictive power.
4
Bagging
• The idea behind bagging is combining the results of multiple models (for
instance, all decision trees) to get a generalized result.
• Bagging uses a sampling technique called – Bootstrapping.
• Bootstrapping is a sampling technique in which we create subsets of
observations from the original dataset, with replacement.
• Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to
get a fair idea of the distribution (complete set).
• The size of subsets created for bagging may be same or less than the original
set.
5
Bagging
• Multiple subsets are created from the original dataset, selecting
observations with replacement.
6
Bagging
• A base model (weak model) is created on
each of these subsets.
• The models run in parallel and are
independent of each other.
• The final predictions are determined by
combining the predictions from all the
models.
7
How does Random Forest work?
• RF consists multiple decision trees which act as base learners.
• Each decision tree is given a subset of random samples from the data set
(hence the name random).
• Then, Random Forest train each base learner (i.e Decision Tree) on a different
sample of data and the sampling of data points happens with replacement.
8
Example
• Consider a training dataset : [X1, X2, X3, … X10, Y].
• Random forest will create decision trees taking the input from subset using
bagging as shown below:
9
Hyper-Parameters Random Forest
• Optimization of RF depends on few inbuilt parameters.
10
Parameters Random Forest
• max_depth is the maximum depth of the tree. The deeper the tree, the more
splits it has and it captures more information about the data.
11
Cross-Validation (CV)
• Cross-validation is a statistical method used to estimate the performance of
machine learning models.
• Normally, we split the data into – train & test data sets.
• In K-fold CV, training data is further split into K number of subsets, called
folds.
12
Cross-Validation (CV)
• Then iteratively fit the model K times, each time training the data on K-1 of
the folds and evaluating on the Kth fold (called the validation data).
• For example, consider the train data is splitted into 5 folds (K = 5).
• 1st Iteration - train on the first four folds and evaluate on the fifth.
• 2nd Iteration - train on the first, second, third, and fifth fold and evaluate on the fourth.
13
Cross-Validation (CV)
• 5 Fold Cross Validation –
• If we have 10 sets of hyperparameters and are using 5-Fold CV, that represents 50
training loops.
14
GridSearchCV
• Grid-search is used to find the optimal hyperparameters of a model which
results in the most ‘accurate’ predictions.
• The first step you need to perform is to create a dictionary of all the
parameters and their corresponding set of values that you want to test for
best performance.
15
Pros & Cons
Pros:
• Random Forest algorithm avoids overfitting.
• For both classification and regression task, the same random forest algorithm can be used.
• The Random Forest algorithm can be used for identifying the most important features from
the training dataset. It helps in feature engineering.
Cons:
• Random Forest is difficult to interpret. Because of averaging the results of many trees
becomes hard for us to figure out why a random forest is making predictions the way it is.
• Random Forest takes a longer time to create. It is computationally expensive compared to a
Decision Tree.
16