0% found this document useful (0 votes)

31 views24 pages

Da MS

analytics for material science

Uploaded by

aswinganeshds

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views24 pages

Da MS

analytics for material science

Uploaded by

aswinganeshds

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Data Analytics for Materials Science

27-737

A.D. (Tony) Rollett, R.A. LeSar (Iowa State Univ.)

Dept. Materials Sci. Eng., Carnegie Mellon University

Random Forest

Lecture 6
Revised: 21st Apr., 2021

1 Do not re-distribute these slides without instructor permission

To date, we have discussed:
• linear algebra
• linear regression: prediction
• multiple linear regression: prediction

Recap 2
Useful sources of information (both in Canvas):
• The algorithm for random forests is presented on Page
588 of Hastie et al. Elements of Statistical Learning.
linear regression: prediction
• Another useful resource for learning about random
forests is: Leo Breiman, Random forests, Machine
learning, 45, 5–32 (2001).

Resources 3
A decision tree is a tool for making decisions that uses a tree-like model of decisions
and their possible consequences.

A formal decision tree consists of three types of nodes: [1]

• Decision nodes
• Chance nodes
• End nodes

Decision trees are all about information and how to use it in a structured way.

We mention them here because they are the building blocks of the random forest
model and useful in their own right.

Decision trees 4
To play tennis or not to play tennis? In a decision tree model,
“What feature will split
splits are chosen to
the observations in a
maximize information
way that the resulting
gain. For a regression
groups are as different
problem, the residual sum
from each other as
of squares (RSS) can be
possible (and the
used and for a
members of each
classification problem, the
resulting subgroup are
Gini index or entropy
as similar to each other
would apply. (See talk on
as possible)?”
https://www.slideshare.net/marinasantini1/lecture-4- slideshare.)
Splitting stops when the decision-trees-2-entropy-information-gain-gain-ratio-
55241087
data cannot be split
further. Pruning decision trees is discussed
at:
https://towardsdatascience.com/understanding-random-forest-58381e0602d2 https://en.wikipedia.org/wiki/Decision_tree_pruning

Decision trees 5
High entropy alloy dataset
(we have seen this in the
discussion of regular
expressions) with
composition including 24
elements in five phases.

Can we predict Vicker’s

hardness based on
composition and rule of
mixtures (ROM) density?

Decision trees in materials research 6

Greedy Approach is based on
the concept of Heuristic
Problem Solving by making an
optimal local choice at each
node. By making these local
optimal choices, we reach the
approximate optimal solution
globally.”
The algorithm can be
summarized as :
1. At each stage (node), pick
out the best feature as the
test condition.
2. Now split the node into the
possible outcomes (internal
nodes).
3. Repeat the above steps
until all the test conditions
have been exhausted into
leaf nodes.
see: https://www.slideshare.net/marinasantini1/lecture-4-decision-trees-2-
entropy-information-gain-gain-ratio-55241087 Courtesy of Tony Rollett.

Decision trees in materials research 7

“Random forests are bagged decision tree models that split on a subset of features
on each split.” https://towardsdatascience.com/why-random-forest-is-my-favorite-machine-learning-model-b97651fa3706

“Random forest, like its name implies,

consists of a large number of individual
decision trees that operate as an
ensemble. Each individual tree in the
random forest spits out a class prediction
and the class with the most votes
becomes our model’s prediction (see
figure).”

https://towardsdatascience.com/understanding-random-
forest-58381e0602d2

Random Forest model: basic idea 8

The basic concept behind random forest is
based on the wisdom of crowds.

Random forest takes a large number of

uncorrelated trees (models) that operate as a
committee, which will outperform any of the
individual models.

A key feature is that the models must have

low correlation between them.

The low correlation between trees protects

each of them from their individual errors.
https://towardsdatascience.com/understanding-
random-forest-58381e0602d2

Random Forest model: uncorrelated trees 9

Decision trees are very sensitive to the data they are trained on — small changes in a
training set can result in tree structures with large differences in structure.

Random forest allows individual trees to randomly sample the dataset with
replacement.

For example, suppose we have a training dataset with N=6 points: {1,2,3,4,5,6}.
Random sampling the data set with replacement might lead to something like
{1,2,2,5,5,6}, in which N=6.
Note that bagging can also be used by taking subsets of the data, as we see on the
next slide.

Random Forest model: bootstrap aggregating (bagging) 10

“Instead of building a single smoother from
the complete data set, 100 bootstrap samples
of the data were drawn. Each sample is
different from the original data set, yet
resembles it in distribution and variability. For
each bootstrap sample, a LOESS smoother
was fit. Predictions from these 100 smoothers
were then made across the range of the data.
The first 10 predicted smooth fits appear as
grey lines in the figure below. The lines are
clearly very wiggly and they overfit the data - By taking the average of the 100
a result of the bandwidth being too small.” smoothers, we arrive at one
bagged predictor (red line).
Clearly, the mean is more stable
https://en.wikipedia.org/wiki/Bootstrap_aggregating and there is less overfit.

Bootstrap aggregating (bagging) 11

Reducing variance
• a natural way to reduce the variance and hence increase the prediction accuracy
of a statistical learning method is to take many training sets from the population,
build a separate prediction model using each training set, and average the
resulting predictions

Best practice:
• each bagged tree makes use of around 2/3 of the observations
• remaining 1/3 of the observations are referred to as the out-of-bag (OOB)
observations
• each individual tree has high variance, but low bias, averaging these trees
reduces the variance
• reduces overfitting; reduce bias; break the bias-variance trade-off
• See later comments for use of OOB data for testing accuracy and feature
importance

Bagging: advantages 12
“Random forests are bagged decision tree
models that split on a subset of features
on each split.”

In addition to bagging, each tree in a

random forest bases its split on a random
subset of features.

In the example, while a decision tree would

include all 4 features, each tree in a random
forest would base their split on a subset of
features.

Random Forest model: basic idea 13

The basic concept behind random forest is
based on the wisdom of crowds.

Random forest takes a large number of

uncorrelated trees (models) that operate as a
committee, which will outperform any of the
individual models.

A key feature is that the models must have

low correlation between them.

The low correlation between trees protects

each of them from their individual errors.
https://towardsdatascience.com/understanding-
random-forest-58381e0602d2

Random Forest model: uncorrelated trees 14

“The random forest is a classification algorithm consisting of many decision trees.
It uses bagging and feature randomness when building each individual tree to try
to create an uncorrelated forest of trees whose prediction by committee is more
accurate than that of any individual tree.”

https://towardsdatascience.com/understanding-
random-forest-58381e0602d2

Random Forest model: summary 15

Decision trees Random Forest
• trees give insight into decision • “Black Box” — rather hard to gain
rules insight into the decision rules

• rather fast computationally • rather slow computationally

• prediction of trees tend to have • has smaller prediction variance

high variance and thus usually a better
performance

Decision trees versus Random Forest 16

• No statistical assumptions

• Works with any kind of data – continuous / categorical – intrinsically multiclass

• Can express any function – regression / classification

• Works well with small to medium data, unlike neural network which requires large
data

• Can handle thousands of input variables without variable selection

- provides feature importance

• It has an effective method for estimating missing data and maintains accuracy
when a large proportion of the data are missing

Random Forest: attributes 17

1. How much each feature decreases the variance in a tree
• For a forest, the variance decrease from each feature can be averaged and
the features are ranked according to this measure
• Biased towards preferring variables with more categories
(Bias in random forest variable importance measures: Illustrations, sources and a solution — on Canvas)
• When dataset has two (or more) correlated features, then one shows up high
while other as low (applies to other methods too)
-The effect of this phenomenon is somewhat reduced by random selection of
features at each node creation
2. Random shuffling of the variables
• permute the values of each feature and measure how much the permutation
decreases the accuracy of the model
• The OOB data is passed along each tree to determine the "test error" (since the
OOB were not used to train). See section 15.3.1 in Hastie et al.
• For each variable, the values are permuted in the OOB to evaluate the sensitivity
to that variable (from the increase in the test error).

Random Forest model: interpretation 18

R: randomForest package (available on CRAN)

Matlab: TreeBagger selects a random subset of predictors to use at each decision

split as in the random forest algorithm. (see documentation)

Mathematica: use Predict[] with Method-> “RandomForest”

There are also implementations in Python, …

Pick your favorite program and search for random forest in the documentation.

Random Forest model: availability 19

QUESTIONS?

20
“Despite the recent fast progress in materials informatics and data science, data-driven
molecular design of organic photovoltaic (OPV) materials remains challenging. We report
a screening of conjugated molecules for polymer−fullerene OPV applications by
supervised learning methods (artificial neural network (ANN) and random forest (RF)).

We report a screening of conjugated molecules for polymer−fullerene OPV applications

by supervised learning methods (artificial neural network (ANN) and random forest (RF)).
Approximately 1000 experimental parameters including power conversion efficiency
(PCE), molecular weight, and electronic properties are manually collected from the
literature and subjected to machine learning with digitized chemical structures. Contrary
to the low correlation coefficient in ANN, RF yields an acceptable accuracy, which is twice
that of random classification.”

Results based on 1200 points from 500 papers.

Computer-Aided Screening of Conjugated Polymers for
Organic Solar Cell: Classification by Random Forest, S.
Nagasawa et al, J. Phys. Chem Lett. 9, 2639 (2018)

Random Forest model: examples from materials 21

research
Artificial Neural Nets
(ANN) led to a relation
with r=0.37, which is
not acceptable.

They represented PCE

in 4 groups (e) and used
the RF in (d).

Based in part on the RF

results, they
demonstrated an
alternative approach to
the design of polymers
for OPVs.

Random Forest model: examples from materials 22

research
1. How much each feature decreases the variance in a tree
• For a forest, the variance decrease from each feature can be averaged and
the features are ranked according to this measure
• Biased towards preferring variables with more categories
(Bias in random forest variable importance measures: Illustrations, sources and a solution — on Canvas)
• When dataset has two (or more) correlated features, then one shows up high
while other as low (applies to other methods too)
-The effect of this phenomenon is somewhat reduced by random selection of
features at each node creation
2. Random shuffling of the variable
• permute the values of each feature and measure how much the permutation
decreases the accuracy of the model

Random Forest model: interpretation 23

Lecture 17: RF models part II

Random Forest Presentation
No ratings yet
Random Forest Presentation
37 pages
Random Forest Algorithm Updated
No ratings yet
Random Forest Algorithm Updated
11 pages
Random Forests For Beginners PDF
No ratings yet
Random Forests For Beginners PDF
71 pages
Pa 5 Unit
No ratings yet
Pa 5 Unit
35 pages
Random Forest (RF) : Decision Trees
No ratings yet
Random Forest (RF) : Decision Trees
3 pages
Random Forest
No ratings yet
Random Forest
18 pages
Random Forest
No ratings yet
Random Forest
2 pages
Present
No ratings yet
Present
20 pages
Lecture #15: Regression Trees & Random Forests
No ratings yet
Lecture #15: Regression Trees & Random Forests
34 pages
Random Forest
No ratings yet
Random Forest
10 pages
Random Forests
No ratings yet
Random Forests
43 pages
Random Forest
No ratings yet
Random Forest
14 pages
Random Forests 2
No ratings yet
Random Forests 2
43 pages
IS15477 - 2019 Tile Adhesive
No ratings yet
IS15477 - 2019 Tile Adhesive
21 pages
Biau 2016
No ratings yet
Biau 2016
31 pages
Summary-RK Narayan - The Financial Expert
100% (13)
Summary-RK Narayan - The Financial Expert
5 pages
Sustainable Devlopment Goals
No ratings yet
Sustainable Devlopment Goals
34 pages
Random Forest
No ratings yet
Random Forest
25 pages
Lecture 19 Different Classification Models
No ratings yet
Lecture 19 Different Classification Models
22 pages
03 - Random Forest
No ratings yet
03 - Random Forest
24 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
39 pages
Random Forests
No ratings yet
Random Forests
35 pages
Random Forest
No ratings yet
Random Forest
29 pages
Machine Learning With Random Forests - by Knoldus Inc. - Knoldus - Technical Insights - Medium
No ratings yet
Machine Learning With Random Forests - by Knoldus Inc. - Knoldus - Technical Insights - Medium
12 pages
Random Forest
No ratings yet
Random Forest
6 pages
Schonlau Zou 2020 The Random Forest Algorithm For Statistical Learning
No ratings yet
Schonlau Zou 2020 The Random Forest Algorithm For Statistical Learning
27 pages
Random Forest Algorithms - Comprehensive Guide With Examples
No ratings yet
Random Forest Algorithms - Comprehensive Guide With Examples
13 pages
Psychopathology An Integrative Approach To Mental Disorders 9th Edition by David H Barlow V Mark Durand Stefan G Hofmann
No ratings yet
Psychopathology An Integrative Approach To Mental Disorders 9th Edition by David H Barlow V Mark Durand Stefan G Hofmann
351 pages
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
No ratings yet
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
11 pages
Random Forest
No ratings yet
Random Forest
83 pages
2023AIB1008 Lab08
No ratings yet
2023AIB1008 Lab08
8 pages
Lecture2 Decision Tree and Random Forest
No ratings yet
Lecture2 Decision Tree and Random Forest
24 pages
Random Forest
No ratings yet
Random Forest
21 pages
Classification Algorithms
No ratings yet
Classification Algorithms
68 pages
Random FOrest
No ratings yet
Random FOrest
19 pages
Trees and Random Forest
No ratings yet
Trees and Random Forest
34 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
32 pages
Random Forest
No ratings yet
Random Forest
8 pages
Lecture+Notes+-+Random Forests
No ratings yet
Lecture+Notes+-+Random Forests
10 pages
Random Forest
No ratings yet
Random Forest
25 pages
Deep Learning and Neural Networks
No ratings yet
Deep Learning and Neural Networks
21 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Random Forest Algorithm 1
No ratings yet
Random Forest Algorithm 1
14 pages
Randon Forest
No ratings yet
Randon Forest
34 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
4 pages
Irm Su 1381
No ratings yet
Irm Su 1381
26 pages
Lecture-12 Machine Learning With Python
No ratings yet
Lecture-12 Machine Learning With Python
18 pages
Random Forest Summary
No ratings yet
Random Forest Summary
6 pages
Aditri Chaudhuri - DM
No ratings yet
Aditri Chaudhuri - DM
10 pages
Data Mining Notes
No ratings yet
Data Mining Notes
5 pages
ML Mid Question Solve
No ratings yet
ML Mid Question Solve
19 pages
Bennett, Customary Law in South Africa (2004)
No ratings yet
Bennett, Customary Law in South Africa (2004)
24 pages
ML Lec6
No ratings yet
ML Lec6
4 pages
Random Forest, CNN and Different Algorithm
No ratings yet
Random Forest, CNN and Different Algorithm
14 pages
Random Forest
No ratings yet
Random Forest
8 pages
Decision Tree Classification Algorithm
No ratings yet
Decision Tree Classification Algorithm
4 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
2 pages
Random Forests: H S H H
No ratings yet
Random Forests: H S H H
2 pages
Time and Decision Economic and Psychological Perspectives of Intertemporal Choice George Loewenstein
No ratings yet
Time and Decision Economic and Psychological Perspectives of Intertemporal Choice George Loewenstein
82 pages
Random Forest - Basics
No ratings yet
Random Forest - Basics
9 pages
Random Forest Summary - Rashmi
No ratings yet
Random Forest Summary - Rashmi
2 pages
Random Forests
No ratings yet
Random Forests
2 pages
Final Suggestion EC, BC by GKJ
No ratings yet
Final Suggestion EC, BC by GKJ
109 pages
Chapter - 06 Legal and Ethical Issues
No ratings yet
Chapter - 06 Legal and Ethical Issues
23 pages
BEIJER - IX TxA and IX TXB To X2 Migration Guidelines (08 - 2016)
No ratings yet
BEIJER - IX TxA and IX TXB To X2 Migration Guidelines (08 - 2016)
10 pages
07820100024353
No ratings yet
07820100024353
20 pages
AEDT Icepak Intro 2019R1 L3 Flow and Thermal Boundary Conditions
No ratings yet
AEDT Icepak Intro 2019R1 L3 Flow and Thermal Boundary Conditions
20 pages
Shi 2023 J. Phys. Conf. Ser. 2459 012020
No ratings yet
Shi 2023 J. Phys. Conf. Ser. 2459 012020
11 pages
Final Bachelor Project 07 Vikram
No ratings yet
Final Bachelor Project 07 Vikram
62 pages
Chapter 17 - Answer PDF
No ratings yet
Chapter 17 - Answer PDF
5 pages
Au Bon Pain
No ratings yet
Au Bon Pain
6 pages
CB Model Gearbox Rebuild
No ratings yet
CB Model Gearbox Rebuild
7 pages
EE Lab 10
No ratings yet
EE Lab 10
7 pages
Corporate Governanceand Ethics
No ratings yet
Corporate Governanceand Ethics
8 pages
Weekly Lesson Plan (Grade 10)
No ratings yet
Weekly Lesson Plan (Grade 10)
8 pages
Diagnostic Test 15 Dependent Prepositions
No ratings yet
Diagnostic Test 15 Dependent Prepositions
1 page
Pricing Strategy
No ratings yet
Pricing Strategy
1 page
Michael's Resume 2024
No ratings yet
Michael's Resume 2024
3 pages
SBR - Chapter 1
No ratings yet
SBR - Chapter 1
2 pages
REPSE Requirements
No ratings yet
REPSE Requirements
6 pages
Losing Track of Time
No ratings yet
Losing Track of Time
2 pages
LSIQNF2309462 - Pak Ageng
No ratings yet
LSIQNF2309462 - Pak Ageng
1 page
Document 3254323
No ratings yet
Document 3254323
2 pages
ALPFA Brings Top Latino Professionals Together For 2011 Annual Convention in Anaheim, CA.
No ratings yet
ALPFA Brings Top Latino Professionals Together For 2011 Annual Convention in Anaheim, CA.
1 page
BBMF2063 Tutorial Questions - 202306-10
No ratings yet
BBMF2063 Tutorial Questions - 202306-10
1 page
Union-Find Data Structures and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Union-Find Data Structures and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
From Everand
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Beam Search: Fundamentals and Applications
From Everand
Beam Search: Fundamentals and Applications
Fouad Sabry
No ratings yet