Da MS
Da MS
27-737
Random Forest
Lecture 6
Revised: 21st Apr., 2021
Recap 2
Useful sources of information (both in Canvas):
• The algorithm for random forests is presented on Page
588 of Hastie et al. Elements of Statistical Learning.
linear regression: prediction
• Another useful resource for learning about random
forests is: Leo Breiman, Random forests, Machine
learning, 45, 5–32 (2001).
Resources 3
A decision tree is a tool for making decisions that uses a tree-like model of decisions
and their possible consequences.
• Decision nodes
• Chance nodes
• End nodes
Decision trees are all about information and how to use it in a structured way.
We mention them here because they are the building blocks of the random forest
model and useful in their own right.
Decision trees 4
To play tennis or not to play tennis? In a decision tree model,
“What feature will split
splits are chosen to
the observations in a
maximize information
way that the resulting
gain. For a regression
groups are as different
problem, the residual sum
from each other as
of squares (RSS) can be
possible (and the
used and for a
members of each
classification problem, the
resulting subgroup are
Gini index or entropy
as similar to each other
would apply. (See talk on
as possible)?”
https://www.slideshare.net/marinasantini1/lecture-4- slideshare.)
Splitting stops when the decision-trees-2-entropy-information-gain-gain-ratio-
55241087
data cannot be split
further. Pruning decision trees is discussed
at:
https://towardsdatascience.com/understanding-random-forest-58381e0602d2 https://en.wikipedia.org/wiki/Decision_tree_pruning
Decision trees 5
High entropy alloy dataset
(we have seen this in the
discussion of regular
expressions) with
composition including 24
elements in five phases.
https://towardsdatascience.com/understanding-random-
forest-58381e0602d2
Random forest allows individual trees to randomly sample the dataset with
replacement.
For example, suppose we have a training dataset with N=6 points: {1,2,3,4,5,6}.
Random sampling the data set with replacement might lead to something like
{1,2,2,5,5,6}, in which N=6.
Note that bagging can also be used by taking subsets of the data, as we see on the
next slide.
Best practice:
• each bagged tree makes use of around 2/3 of the observations
• remaining 1/3 of the observations are referred to as the out-of-bag (OOB)
observations
• each individual tree has high variance, but low bias, averaging these trees
reduces the variance
• reduces overfitting; reduce bias; break the bias-variance trade-off
• See later comments for use of OOB data for testing accuracy and feature
importance
Bagging: advantages 12
“Random forests are bagged decision tree
models that split on a subset of features
on each split.”
https://towardsdatascience.com/understanding-
random-forest-58381e0602d2
• Works well with small to medium data, unlike neural network which requires large
data
• It has an effective method for estimating missing data and maintains accuracy
when a large proportion of the data are missing
Pick your favorite program and search for random forest in the documentation.
20
“Despite the recent fast progress in materials informatics and data science, data-driven
molecular design of organic photovoltaic (OPV) materials remains challenging. We report
a screening of conjugated molecules for polymer−fullerene OPV applications by
supervised learning methods (artificial neural network (ANN) and random forest (RF)).
24