0% found this document useful (0 votes)

123 views10 pages

Lecture+Notes+-+Random Forests

Random forests are an ensemble method that combines multiple decision trees to make classifications or predictions. It improves upon single decision trees by averaging the predictions of the individual trees to reduce variance. Random forests work by building each tree on a random subsample of the data and considering a random subset of features at each node split. This introduces randomness and helps ensure the trees are not correlated, improving the predictive performance of the ensemble. Tuning hyperparameters like the number of trees or maximum depth can help optimize a random forest for a given problem.

Uploaded by

samrat141988

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

123 views10 pages

Lecture+Notes+-+Random Forests

Uploaded by

samrat141988

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Lecture Notes

Random Forests
You are familiar with decision trees, now it’s time to learn about Random Forests, which is a collection of decision
trees. The great thing about random forests is that - they almost always outperform a decision tree in terms of
accuracy.

Ensembles

An ensemble means a group of things viewed as a whole rather than individually. In ensembles, a collection of
models is used to make predictions, rather than individual models. Arguably, the most popular in the family of
ensemble models is the random forest: an ensemble made by the combination of a large number of decision trees.

For an ensemble to work, each model of the ensemble should comply with the following conditions:
1. Each model should be diverse. Diversity ensures that the models serve complementary purposes, which
means that the individual models make predictions independent of each other.
2. Each model should be acceptable. Acceptability implies that each model is at least better than a random
model.

Consider a binary classification problem where the response variable is either 0 or 1. You have an ensemble of three
models, where each model has an accuracy of 0.7 i.e. it is correct 70% of the times. The following table shows all the
possible cases that can occur while classifying a test data point as 1 or 0. The column to the extreme right shows the
probability of each case.
Figure 1- Ensemble models

In the table, there are 4 cases each where the decision of the final model (ensemble) is either correct or wrong. Let’s
assume that the probability of the ensemble being correct is p, and the probability of the ensemble being wrong is q.

For the data in the table, p and q can be calculated as follows:

p = 0.343 + 0.147 + 0.147 + 0.147 = 0.784

q = 0.027 + 0.063 + 0.063 + 0.063 = 0.216 = 1 – p

You can see how an ensemble of just three model gives a boost to the accuracy from 70% to 78.4%. In general, the
more the number of models, the higher the accuracy of an ensemble is.

Creating a Random Forest

Random forests are created using a special ensemble method called bagging. Bagging stands for Bootstrap
Aggregation. Bootstrapping means creating bootstrap samples from a given data set. A bootstrap sample is created
by sampling the given data set uniformly and with replacement. A bootstrap sample typically contains about 30-
70% data from the data set. Aggregation implies combining the results of different models present in the ensemble.

Random forests is an ensemble of many decision trees. A random forest is created in the following way:
1. Create a bootstrap sample from the training set.

Figure 2 - Training set

Figure 3- Bootstrap sample

2. Now construct a decision tree using the bootstrap sample. While splitting a node of the tree, only consider a
random subset of features. Every time a node has to split, a different random subset of features will be
considered.
3. Repeat the steps 1 and 2 n times, to construct n trees in the forest. Remember each tree is constructed
independently, so it is possible to construct each tree in parallel.
4. While predicting a test case, each tree predicts individually, and the final prediction is given by the majority
vote of all the trees.

There are several advantages of a random forest:

1. A random forest is more stable than any single decision tree because the results get averaged out; it is not
affected by the instability and bias of an individual tree.
2. A random forest is immune to the curse of dimensionality since only a subset of features is used to split a
node.
3. You can parallelize the training of a forest since each tree is constructed independently.
4. You can calculate the OOB (Out-of-Bag) error using the training set which gives a really good estimate of the
performance of the forest on unseen data. Hence there is no need to split the data into training and
validation; you can use all the data to train the forest.

OOB (Out-of-Bag) Error

The OOB error is calculated by using each observation of the training set as a test observation. Since each tree is
built on a bootstrap sample, each observation can used as a test observation by those trees which did not have it in
their bootstrap sample. All these trees predict on this observation and you get an error for a single observation. The
final OOB error is calculated by calculating the error on each observation and aggregating it.

It turns out that the OOB error is as good as cross validation error.

Time taken to build a forest

To construct a forest of S trees, on a dataset which has M features and N observations, the time taken will depends
on the following factors:

1. The number of trees. The time is directly proportional to the number of trees. But this time can be reduced
by creating the trees in parallel.
2. The size of bootstrap sample. Generally the size of a bootstrap sample is 30-70% of N. The smaller the size
the faster it takes to create a forest.
3. The size of subset of features while splitting a node. Generally this is taken as √𝑀 in classification and M/3
in regression.

Random forests lab

Random Forest with default hyperparameters

# Importing random forest classifier from sklearn library

from sklearn.ensemble import RandomForestClassifier

# Running the random forest with default parameters.

rfc = RandomForestClassifier()
# fit
rfc.fit(X_train,y_train)
# Making predictions
predictions = rfc.predict(X_test)
# Importing classification report and confusion matrix from sklearn metrics
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score

# Let's check the report of our default model

print(classification_report(y_test,predictions))
print(accuracy_score(y_test,predictions))
# Printing confusion matrix
print(confusion_matrix(y_test,predictions))

Hyperparameter Tuning

The following hyperparameters are present in a random forest classifier. Note that most of these hyperparameters
are actually of the decision trees that are in the forest.

• n_estimators: integer, optional (default=10): The number of trees in the forest.

• criterion: string, optional (default= “gini”)The function to measure the quality of a split. Supported criteria
are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific.
• max_features : int, float, string or None, optional (default=”auto”)The number of features to consider when
looking for the best split:
• If int, then consider max_features features at each split.
• If float, then max_features is a percentage and int(max_features * n_features) features are
considered at each split.
• If “auto”, then max_features=sqrt(n_features).
• If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
• If “log2”, then max_features=log2(n_features).
• If None, then max_features=n_features.
• max_depth : integer or None, optional (default=None)The maximum depth of the tree. If None, then nodes
are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
• min_samples_split : int, float, optional (default=2)The minimum number of samples required to split an
internal node:
• If int, then consider min_samples_split as the minimum number.
• If float, then min_samples_split is a percentage and ceil(min_samples_split, n_samples) are the
minimum number of samples for each split.
• min_samples_leaf : int, float, optional (default=1)The minimum number of samples required to be at a leaf
node:
• If int, then consider min_samples_leaf as the minimum number.
• If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the
minimum number of samples for each node.
• min_weight_fraction_leaf : float, optional (default=0.)The minimum weighted fraction of the sum total of
weights (of all the input samples) required to be at a leaf node. Samples have equal weight when
sample_weight is not provided.
• max_leaf_nodes : int or None, optional (default=None)Grow trees with max_leaf_nodes in best-first fashion.
Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
• min_impurity_split : float,Threshold for early stopping in tree growth. A node will split if its impurity is above
the threshold, otherwise it is a leaf.

Tuning max_depth

# GridSearchCV to find optimal n_estimators

from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

# specify number of folds for k-fold CV

n_folds = 5

# parameters to build the model on

parameters = {'max_depth': range(2, 20, 5)}

# instantiate the model

rf = RandomForestClassifier()

# fit tree on training data

rf = GridSearchCV(rf, parameters,
cv=n_folds,
scoring="accuracy")
rf.fit(X_train, y_train)

Figure 4- Tuning max_depth

You can see that as we increase the value of max_depth, both train and test scores increase till a point, but after
that test score starts to decrease. The ensemble tries to overfit as we increase the max_depth.

Thus, controlling the depth of the constituent trees will help reduce overfitting in the forest.

Tuning n_estimators

# GridSearchCV to find optimal n_estimators

from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

# specify number of folds for k-fold CV

n_folds = 5

# parameters to build the model on

parameters = {'n_estimators': range(100, 1500, 400)}

# instantiate the model (note we are specifying a max_depth)

rf = RandomForestClassifier(max_depth=4)

# fit tree on training data

rf = GridSearchCV(rf, parameters,
cv=n_folds,
scoring="accuracy")
rf.fit(X_train, y_train)

Figure 5- Tuning n_estimators

Tuning max_features

# GridSearchCV to find optimal max_features

from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

# specify number of folds for k-fold CV

n_folds = 5

# parameters to build the model on

parameters = {'max_features': [4, 8, 14, 20, 24]}

# instantiate the model

rf = RandomForestClassifier(max_depth=4)

# fit tree on training data

rf = GridSearchCV(rf, parameters,
cv=n_folds,
scoring="accuracy")
rf.fit(X_train, y_train)

Figure 6-Tuning max_features

Grid Search to Find Optimal Hyperparameters

We can now find the optimal hyperparameters using GridSearchCV.

# Create the parameter grid based on the results of random search

param_grid = {
'max_depth': [4,8,10],
'min_samples_leaf': range(100, 400, 200),
'min_samples_split': range(200, 500, 200),
'n_estimators': [100,200, 300],
'max_features': [5, 10]
}
# Create a based model
rf = RandomForestClassifier()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
cv = 3, n_jobs = -1,verbose = 1)
# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# printing the optimal accuracy score and hyperparameters

print('We can get accuracy
of',grid_search.best_score_,'using',grid_search.best_params_)

#We can get accuracy of 0.818285714286 using {'max_features': 10, 'n_estimators': 200,
'max_depth': 8, 'min_samples_split': 200, 'min_samples_leaf': 100}

Fitting the final model with the best parameters obtained from grid search.

# model with the best hyperparameters

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(bootstrap=True,
max_depth=10,
min_samples_leaf=100,
min_samples_split=200,
max_features=10,
n_estimators=100)
# fit
rfc.fit(X_train,y_train)
# predict
predictions = rfc.predict(X_test)
# evaluation metrics
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
print(confusion_matrix(y_test,predictions))

Understanding Random Forests in Machine Learning
100% (1)
Understanding Random Forests in Machine Learning
4 pages
05.random Forest
No ratings yet
05.random Forest
3 pages
25 June 2024 12:34: Random Fores Page 1
No ratings yet
25 June 2024 12:34: Random Fores Page 1
6 pages
Deep Learning and Neural Networks
No ratings yet
Deep Learning and Neural Networks
21 pages
Random Forest
No ratings yet
Random Forest
25 pages
Random Forest
No ratings yet
Random Forest
25 pages
Random Forest
No ratings yet
Random Forest
29 pages
ML Lec6
No ratings yet
ML Lec6
4 pages
Understanding Random Forest Algorithm
No ratings yet
Understanding Random Forest Algorithm
8 pages
Random Forests 2
No ratings yet
Random Forests 2
43 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Randon Forest
No ratings yet
Randon Forest
34 pages
Random Forest Summary
No ratings yet
Random Forest Summary
6 pages
Session 2 On Random Forest
No ratings yet
Session 2 On Random Forest
11 pages
Ensemble Learning Explained
No ratings yet
Ensemble Learning Explained
32 pages
Random Forest
No ratings yet
Random Forest
21 pages
Random Forest Summary
No ratings yet
Random Forest Summary
6 pages
Random Forest Class Lecture Notes
No ratings yet
Random Forest Class Lecture Notes
2 pages
03 - Random Forest
No ratings yet
03 - Random Forest
24 pages
Random Forest for Data Scientists
No ratings yet
Random Forest for Data Scientists
38 pages
L1 - Ensemble Learning - Random Forests (Lecture Slides)
No ratings yet
L1 - Ensemble Learning - Random Forests (Lecture Slides)
73 pages
Trees and Random Forest
No ratings yet
Trees and Random Forest
34 pages
Random Forest
No ratings yet
Random Forest
14 pages
D3 IT Random Forest Apr 2023
No ratings yet
D3 IT Random Forest Apr 2023
32 pages
Random Forest Algorithm in Machine Learning Random Forest Random Forests or Random Decision Trees Decision Trees
No ratings yet
Random Forest Algorithm in Machine Learning Random Forest Random Forests or Random Decision Trees Decision Trees
6 pages
Machine Learning With Random Forests - by Knoldus Inc. - Knoldus - Technical Insights - Medium
No ratings yet
Machine Learning With Random Forests - by Knoldus Inc. - Knoldus - Technical Insights - Medium
12 pages
Da MS
No ratings yet
Da MS
24 pages
Lecture-12 Machine Learning With Python
No ratings yet
Lecture-12 Machine Learning With Python
18 pages
Decision Tree
No ratings yet
Decision Tree
7 pages
lecture19-FromTreesToForests RandomForests
No ratings yet
lecture19-FromTreesToForests RandomForests
50 pages
Random Forest 1737667979
No ratings yet
Random Forest 1737667979
11 pages
Random Forest for ML Enthusiasts
No ratings yet
Random Forest for ML Enthusiasts
4 pages
Random Forests
No ratings yet
Random Forests
43 pages
Random Forest, CNN and Different Algorithm
No ratings yet
Random Forest, CNN and Different Algorithm
14 pages
Ensemble Learning in Machine Learning
No ratings yet
Ensemble Learning in Machine Learning
32 pages
DS&RF
No ratings yet
DS&RF
20 pages
Lecture 6
No ratings yet
Lecture 6
24 pages
2023AIB1008 Lab08
No ratings yet
2023AIB1008 Lab08
8 pages
Notes On Random Forest
No ratings yet
Notes On Random Forest
2 pages
A) What Is Motivation Behind Ensemble Methods? Give Your Answer in Probabilistic Terms
100% (1)
A) What Is Motivation Behind Ensemble Methods? Give Your Answer in Probabilistic Terms
6 pages
Random Forest
No ratings yet
Random Forest
6 pages
CE880 Lecture7 Slides
No ratings yet
CE880 Lecture7 Slides
78 pages
Aditri Chaudhuri - DM
No ratings yet
Aditri Chaudhuri - DM
10 pages
ML Mid Question Solve
No ratings yet
ML Mid Question Solve
19 pages
Lecture 05 Random Forest 07112022 124639pm
No ratings yet
Lecture 05 Random Forest 07112022 124639pm
25 pages
Schonlau Zou 2020 The Random Forest Algorithm For Statistical Learning
No ratings yet
Schonlau Zou 2020 The Random Forest Algorithm For Statistical Learning
27 pages
Random Forest
100% (1)
Random Forest
18 pages
Random Forest Medical Diagnosis 1684665707
No ratings yet
Random Forest Medical Diagnosis 1684665707
10 pages
DA PRA WEEK 13 (Random Forest) - 054551
No ratings yet
DA PRA WEEK 13 (Random Forest) - 054551
12 pages
Machine Learning Random Forest Algorithm - Javatpoint
100% (1)
Machine Learning Random Forest Algorithm - Javatpoint
14 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
9 pages
ML Unit 3
No ratings yet
ML Unit 3
22 pages
Lecture #15: Regression Trees & Random Forests
No ratings yet
Lecture #15: Regression Trees & Random Forests
34 pages
Random Forest Algorithm Updated
No ratings yet
Random Forest Algorithm Updated
11 pages
Random Forest and Parameter Tuning in R
No ratings yet
Random Forest and Parameter Tuning in R
9 pages
Random Forest
No ratings yet
Random Forest
13 pages
Quattro Pro and Lotus 123
No ratings yet
Quattro Pro and Lotus 123
18 pages
Understanding Ultra Wideband Technology
No ratings yet
Understanding Ultra Wideband Technology
21 pages
Design and Fabrication of Foue Wheel Steering Systems To Reduce Turning Radius and Increase Stability
No ratings yet
Design and Fabrication of Foue Wheel Steering Systems To Reduce Turning Radius and Increase Stability
17 pages
STATS 200 Midterm Exam Sample Questions
No ratings yet
STATS 200 Midterm Exam Sample Questions
6 pages
KK Ack
No ratings yet
KK Ack
16 pages
Rachel Angel Araini - 2443019283 - Jurnal Disolusi
No ratings yet
Rachel Angel Araini - 2443019283 - Jurnal Disolusi
12 pages
8255 Programmable Interface Guide
No ratings yet
8255 Programmable Interface Guide
21 pages
Concrete Compressive Strength
100% (1)
Concrete Compressive Strength
3 pages
Chemistry Practicals First Years
100% (1)
Chemistry Practicals First Years
65 pages
Steam Turbines: Types and Advantages
No ratings yet
Steam Turbines: Types and Advantages
29 pages
Syllabus: Electrical Distribution Systems (15A02701)
No ratings yet
Syllabus: Electrical Distribution Systems (15A02701)
7 pages
Chatper 8 NM
No ratings yet
Chatper 8 NM
3 pages
Derived Class & Base Class Class Hierarchies. - Public & Private Inheritance. - Level of Inheritance, Multiple Inheritance
No ratings yet
Derived Class & Base Class Class Hierarchies. - Public & Private Inheritance. - Level of Inheritance, Multiple Inheritance
10 pages
Revised Heat Gain Rates From Typical Commercial Cooking Appliances From RP
No ratings yet
Revised Heat Gain Rates From Typical Commercial Cooking Appliances From RP
36 pages
Philosophical Rift: Husserl & Heidegger
100% (1)
Philosophical Rift: Husserl & Heidegger
40 pages
Alfa Romeo 155 2.5 v6 Cat
No ratings yet
Alfa Romeo 155 2.5 v6 Cat
1 page
ANOVA For Feature Selection in Machine Learning by Sampath Kumar Gajawada Towards Data Science
No ratings yet
ANOVA For Feature Selection in Machine Learning by Sampath Kumar Gajawada Towards Data Science
10 pages
VHDL Code for 8-Point FFT Algorithm
33% (3)
VHDL Code for 8-Point FFT Algorithm
5 pages
Night School 16 Session 8
No ratings yet
Night School 16 Session 8
62 pages
Advanced Accounting - Dayag 2015 - Chapter 15 - Multiple Choice Solution (23-25)
No ratings yet
Advanced Accounting - Dayag 2015 - Chapter 15 - Multiple Choice Solution (23-25)
1 page
CNT Composites for Aerospace Use
No ratings yet
CNT Composites for Aerospace Use
6 pages
Data Analyst Resume for UC Berkeley Grad
No ratings yet
Data Analyst Resume for UC Berkeley Grad
1 page
Amortiguador de Vibraciones Tipo SVD
No ratings yet
Amortiguador de Vibraciones Tipo SVD
3 pages
Crimp Specification For Powerpole® 75 & SB® 50 Contacts - 1s6603
No ratings yet
Crimp Specification For Powerpole® 75 & SB® 50 Contacts - 1s6603
2 pages
PHYS 673, Quantum Field Theory: Assignment #5: 0 00 000 1 X /2 D X /2 D 1
No ratings yet
PHYS 673, Quantum Field Theory: Assignment #5: 0 00 000 1 X /2 D X /2 D 1
1 page
Note No - 1 - Introduction To Python!
No ratings yet
Note No - 1 - Introduction To Python!
12 pages
Engineering - Circuit - Analysis - 9th Solutions - CH - 17
No ratings yet
Engineering - Circuit - Analysis - 9th Solutions - CH - 17
80 pages
Software Development Is The: Citation Needed
No ratings yet
Software Development Is The: Citation Needed
3 pages
Cad Manual Revised Amrita
No ratings yet
Cad Manual Revised Amrita
119 pages
Narratives Equities GDELT
No ratings yet
Narratives Equities GDELT
37 pages

Lecture+Notes+-+Random Forests

Uploaded by

Lecture+Notes+-+Random Forests

Uploaded by

Lecture Notes

For the data in the table, p and q can be calculated as follows:

p = 0.343 + 0.147 + 0.147 + 0.147 = 0.784

q = 0.027 + 0.063 + 0.063 + 0.063 = 0.216 = 1 – p

Creating a Random Forest

Figure 2 - Training set

Figure 3- Bootstrap sample

There are several advantages of a random forest:

OOB (Out-of-Bag) Error

Time taken to build a forest

Random forests lab

Random Forest with default hyperparameters

# Importing random forest classifier from sklearn library

# Running the random forest with default parameters.

# Let's check the report of our default model

• n_estimators: integer, optional (default=10): The number of trees in the forest.

# GridSearchCV to find optimal n_estimators

# specify number of folds for k-fold CV

# parameters to build the model on

# instantiate the model

# fit tree on training data

Figure 4- Tuning max_depth

# GridSearchCV to find optimal n_estimators

# specify number of folds for k-fold CV

# parameters to build the model on

# instantiate the model (note we are specifying a max_depth)

# fit tree on training data

Figure 5- Tuning n_estimators

# GridSearchCV to find optimal max_features

# specify number of folds for k-fold CV

# parameters to build the model on

# instantiate the model

# fit tree on training data

Figure 6-Tuning max_features

We can now find the optimal hyperparameters using GridSearchCV.

# Create the parameter grid based on the results of random search

# printing the optimal accuracy score and hyperparameters

# model with the best hyperparameters

You might also like