0% found this document useful (0 votes)
154 views16 pages

Random Forest

Random Forest is an ensemble learning method that constructs multiple decision trees and outputs the class that is the mode of the classes or mean prediction of individual trees. It works by constructing decision trees on various sub-samples of the dataset and averaging their predictions. Some key hyperparameters are the number of trees, maximum depth of each tree, and number of features considered at each split. Cross-validation and grid search are used to tune hyperparameters and evaluate performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views16 pages

Random Forest

Random Forest is an ensemble learning method that constructs multiple decision trees and outputs the class that is the mode of the classes or mean prediction of individual trees. It works by constructing decision trees on various sub-samples of the dataset and averaging their predictions. Some key hyperparameters are the number of trees, maximum depth of each tree, and number of features considered at each split. Cross-validation and grid search are used to tune hyperparameters and evaluate performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Random

Forest

1
Contents
• What is Random Forest?

• Ensemble Methods - Bagging

• How does Random Forest work?

• Hyper-Parameters in Random Forest

• Parameter Tuning - Cross-Validation & GridSearchCV

• Building RF in Scikit-learn

• Pros and Cons

2
What is Random Forest?
• Random Forest is a
supervised learning
algorithm and capable of
performing both regression
and classification tasks.

• As the name suggests,


Random Forest algorithm
creates a forest with a
number of decision trees.

3
Ensemble method
• Use multiple learning algorithms to obtain
better predictions.
• Train various different models, aggregate
their predictions to improvise stability and
predictive power.

• As we see, we need numbers of


models(learners), whose predictive power
is just slightly better than random chance.
Such learners are called as weak learners.
• We take such weak learners to make one
combined strong learner.

4
Bagging
• The idea behind bagging is combining the results of multiple models (for
instance, all decision trees) to get a generalized result.
• Bagging uses a sampling technique called – Bootstrapping.
• Bootstrapping is a sampling technique in which we create subsets of
observations from the original dataset, with replacement.
• Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to
get a fair idea of the distribution (complete set).
• The size of subsets created for bagging may be same or less than the original
set.

5
Bagging
• Multiple subsets are created from the original dataset, selecting
observations with replacement.

6
Bagging
• A base model (weak model) is created on
each of these subsets.
• The models run in parallel and are
independent of each other.
• The final predictions are determined by
combining the predictions from all the
models.

7
How does Random Forest work?
• RF consists multiple decision trees which act as base learners.

• Each decision tree is given a subset of random samples from the data set
(hence the name random).

• RF algorithm uses an Ensemble method – Bagging (Bootstrap Aggregating)

• Then, Random Forest train each base learner (i.e Decision Tree) on a different
sample of data and the sampling of data points happens with replacement.

8
Example
• Consider a training dataset : [X1, X2, X3, … X10, Y].

• Random forest will create decision trees taking the input from subset using
bagging as shown below:

9
Hyper-Parameters Random Forest
• Optimization of RF depends on few inbuilt parameters.

• n_estimators* - number of decision trees that the algorithm creates. As the


number tree increases, the performance increases and the predictions are
more stable but it slows down the computation.

• max_features* - maximum number of features that are considered for


splitting a node.

• n_jobs - number of jobs to run in parallel. If n_jobs=1, it uses one processor.


If n_jobs=-1, then the number of jobs is set to the number of cores available.

10
Parameters Random Forest
• max_depth is the maximum depth of the tree. The deeper the tree, the more
splits it has and it captures more information about the data.

• criterion is the function to measure the quality of a split. Supported criteria


are “gini” for the Gini impurity and “entropy” for the information gain.

11
Cross-Validation (CV)
• Cross-validation is a statistical method used to estimate the performance of
machine learning models.

• It is a resampling procedure used to evaluate machine learning models on a


limited data sample.

• The most common method is K-Fold CV.

• Normally, we split the data into – train & test data sets.

• In K-fold CV, training data is further split into K number of subsets, called
folds.

12
Cross-Validation (CV)
• Then iteratively fit the model K times, each time training the data on K-1 of
the folds and evaluating on the Kth fold (called the validation data).

• For example, consider the train data is splitted into 5 folds (K = 5).

• 1st Iteration - train on the first four folds and evaluate on the fifth.

• 2nd Iteration - train on the first, second, third, and fifth fold and evaluate on the fourth.

• And repeat the same procedure.

• At end of training, we average the performance on each of the folds to come


up with final validation metrics for the model.

13
Cross-Validation (CV)
• 5 Fold Cross Validation –

• For hyperparameter tuning, we perform many iterations of the entire K-Fold CV


process, each time using different model settings.

• If we have 10 sets of hyperparameters and are using 5-Fold CV, that represents 50
training loops.

14
GridSearchCV
• Grid-search is used to find the optimal hyperparameters of a model which
results in the most ‘accurate’ predictions.

• To implement the Grid Search algorithm we need to import GridSearchCV


class from the sklearn.model_selection library.

• The first step you need to perform is to create a dictionary of all the
parameters and their corresponding set of values that you want to test for
best performance.

15
Pros & Cons
Pros:
• Random Forest algorithm avoids overfitting.
• For both classification and regression task, the same random forest algorithm can be used.
• The Random Forest algorithm can be used for identifying the most important features from
the training dataset. It helps in feature engineering.

Cons:
• Random Forest is difficult to interpret. Because of averaging the results of many trees
becomes hard for us to figure out why a random forest is making predictions the way it is.
• Random Forest takes a longer time to create. It is computationally expensive compared to a
Decision Tree.

16

You might also like