0% found this document useful (0 votes)

5 views24 pages

Phys361 S24 Lecture 17 Random Forests

This lecture covers decision tree ensembles, focusing on Random Forests and Gradient Boosted Trees, which are effective for tabular data applications. It discusses the advantages of tree-based models, such as their ability to handle mixed data types, capture non-linear relationships, and provide interpretability. The lecture also highlights the application of these models in estimating the redshift of galaxies using photometric data.

Uploaded by

siddhucoolft9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views24 pages

Phys361 S24 Lecture 17 Random Forests

Uploaded by

siddhucoolft9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Physics 361 - Machine Learning in

Physics
Lecture 17 – Random Forrests and
Gradient Boosted Trees
Mar. 19th 2024

Moritz Münchmeyer
Decision tree ensembles

1. Motivation
Recall decision trees
• At each step we consider all possible splits
on all possible features and chose the one
Recall example from Gary’s lecture:
that leads to the highest reduction in impurity
of there resulting branches.

• Classification: common metric is Gini impurity

• Regression: common metric is the MSE

• Typical Hyperparameters:
• Minimum reduction of impurity to do a split
• Minimum number of samples required in
leaf node
• Maximum depth of the tree
Power of decision tree methods
• Decision trees methods are still the state-of-the-art for many tabular data
applications.
• Gradient boosted decision trees (GBDTs) are the current state of the art on
tabular data.
• They are used in many Kaggle competitions and are the go-to model for many
data scientists, as they tend to get better performance than neural networks while
being easier and faster to train.
• Neural networks, on the other hand, are the state of the art in many other tasks,
such as image classification, natural language processing, and speech
recognition.

• https://arxiv.org/abs/2207.08815 Why do tree-based models still outperform deep

learning on tabular data?
While deep learning has enabled tremendous progress on text and image datasets, its superiority on
tabular data is not clear. We contribute extensive benchmarks of standard and novel deep learning
methods as well as tree-based models such as XGBoost and Random Forests, across a large
number of datasets and hyperparameter combinations.… Results show that tree-based models
remain state-of-the-art on medium-sized data (∼ 10K samples) even without accounting for their
superior speed. To understand this gap, we conduct an empirical investigation into the differing
inductive biases of tree-based models and Neural Networks (NNs).
Reasons for their success
• Structure of Tabular Data: Tabular data often contain a mix of categorical and numerical
features. Tree-based models can inherently handle these different types of data and their
interactions effectively.

• Non-Linearity: Tree ensembles are particularly good at capturing non-linear relationships

and interactions between variables without needing to explicitly engineer these features.
Deep learning models can also capture non-linearities but often require large amounts of
data and complex architectures to do so effectively.

• Efficiency with Small to Medium-Sized Datasets: Deep learning models excel in domains
with abundant data (like images, text, and audio) where they can learn complex patterns
and representations. However, many tabular datasets are relatively small or medium-sized,
where deep learning models might overfit or may not have enough data to adequately learn.

• Other advantages:
• Interpretability
• Speed
• Feature Importance is easy to evaluate
• Robust to outliers and missing data
• Simplicity
Decision tree ensembles
2. Random Forrests
(“Bagging”)
Ensemble methods
• One way to boost the performance (both for classification and regression) is to
aggregate the response of several models.

• For example in classification we could take the majority vote, perhaps weighted by
some factor if different models have different precision.

• This combined model often gives better combined accuracy than the single
constituents.

• Reasons:
• Aggregating several base learners generally reduces the variance.
• Single models may get stuck in different local minima.
• The combined model has a higher capacity than the constituents and may fit the
data better.
• Single models may be biased in opposite directions so the biases may cancel out in
some situations.

• A popular set of models for ensemble training are decision trees. A set of decision trees
is called a forest :)
Trees for regression problems
“Decision trees” are also useful for regression problems. They are then often
called “Regression trees”.

Regression trees assign a continuous value to each leaf.

They thus approximate the function as piecewise constant.

https://scikit-learn.org/stable/auto_examples/
tree/plot_tree_regression.html
Random Forests
• Random Forests are a collection of randomized decision trees.
• Randomization (“Bootstrapping”) occurs in two ways:
• Take many different random subset of the training data (where elements can
repeat).
• Take random subsets of the features.

• We train many random trees based on these randomized data sets.

• The final outcome is the “mean” of the many trees.
• The approach we just described is called “bagging”. The name comes from
Bootstrap AGGregating.

• Typical hyper parameters: number of trees and the number of features in the
bootstrap subset.

• Implementation: https://scikit-learn.org/stable/modules/generated/
sklearn.ensemble.RandomForestClassifier.html
How to combine trees

• The final prediction (class, or number) is typically the

• average of all predictions (for a regression problem),
• or the majority vote (for a classification problem)
Feature Importance
A nice feature of random forests and other ensembles of decision trees is that one can
evaluate which features of the data are more important than others.

The result of this analysis is somewhat algorithm dependent but is still informative.

More important features appear earlier in the tree, since they lead to a large decrease in
impurity.

To quantify the importance we can sum up the impurity improvements of all the splits
associated with a given variable. One can then rank the features. Correlation can
complicate the interpretation.

Example: predicting the median house value (target)

given some information about the neighborhoods, as
the average number of rooms, the latitude, the
longitude or the median income of people in the
neighborhoods (block).
https://inria.github.io/scikit-learn-mooc/python_scripts/
dev_features_importance.html
Decision tree ensembles
2. Gradient Boosted
Decision Trees (“Boosting”)
Boosting vs Bagging
• Bagging means that we average over many weaker models, which are
trained independently.

• Boosting works sequentially: A weak (simple) learner is created to make

prediction. Then the progressively iterated to get the problematic examples
right (“boosting the success rate”).

• Two popular algorithms are Adaptive Boosting (AdaBoost) and Gradient

Boosting.

• I will focus on Gradient Boosting, which seems to be the dominant method.

The leading software implementations of Gradient Boosting are currently
XGBoost and LightGBM.
XGBoost
XGBoost stands for “Extreme Gradient Boosting”, where the term “Gradient Boosting” originates from
the paper Greedy Function Approximation: A Gradient Boosting Machine, by Friedman.
My introduction is based on https://xgboost.readthedocs.io/en/stable/tutorials/model.html (which
contains mathematical details we have to skip over for time reasons)

Example: function tting

with a tree, nding the
optimal tree complexity
fi
fi
CART in XGBoost
The tree ensemble model of XGBoost consists of a set of classification and regression trees (CART).
A CART is a bit different from decision trees, in which the leaf only contains decision values. In CART, a
real score is associated with each of the leaves, which gives us richer interpretations that go beyond
classification.
Usually, a single tree is not strong enough to be used in practice. What is actually used is the ensemble
model, which sums the prediction of multiple trees together.
Tree Boosting
• We want to optimize the loss function by adjusting the parameters of the trees.
• What are the parameters of trees? the structure of the tree and the leaf scores.
• Learning tree structure is much harder than traditional optimization problem where you can simply
take the gradient. It is intractable to learn all the trees at once. Instead, we use an additive strategy:
fix what we have learned, and add one new tree at a time.
• It remains to ask: which tree do we want at each step? A natural thing is to add the one that
optimizes our objective, i.e. the sum of the loss function and the regularization.

• The regularization is given by the complexity of the tree. One way to measure this is

Where T is the number of leaves and w are the scores of the leaves.
Tree Boosting: Greedy algorithm
The iteratively added tree is found with a greedy algorithm:

Step 1: Initialization
• The algorithm starts with all training instances in the root node.
• At each node, it evaluates all possible splits across all features.
Step 2: Evaluating Splits
• For each feature, the potential splits are considered. The algorithm sorts the values of the feature and then
iteratively evaluates the possible split positions between these sorted values. The "gain" from making a split is
calculated based on how much it would reduce the loss function.

Step 3: Choosing the Best Split

• The algorithm selects the split with the highest gain. If no split results in a gain that meets the regularization criteria,
the node is not split and becomes a leaf.

Step 4: Recursion
• This process is recursively applied to each resulting subset of the data (corresponding to each branch of the split)
until one of the stopping criteria is met (max depth of the tree OR does not improve the loss by a significant amount
OR having too few samples in a node)

Step 5: Outputting the Leaf Values

• Once the tree is fully grown and no more splits are made, the algorithm calculates the optimal output value for each
leaf.

XGBoost improves upon this basic greedy algorithm by introducing several optimizations.
Decision tree ensembles
3. Application: Red Shift
estimation
Slides and python notebook from: Viviana Acquaviva - Machine Learning for physics and
Astronomy chapter 6
Redshift of a galaxy
Astronomers measure the distance of a
galaxy using its redshift. Because the
universe is expanding, galaxies that are
farther away have a higher redshift.

A spectrum is a high-resolution chart of

brightness vs wavelength.

For galaxies that are further away, the

spectrum is stretched
(all the wavelengths are longer).

Spectra contains spikes and dips, which

corresponds to known transition in basic
atoms (e.g., H, O).

If I can identify the emission lines I see (from

Example spectrum
the structure – one is not enough!), I can from the SDSS
calculate the amount of stretch, which is 1 +
z (a Doppler effect, essentially!). z is called
survey
the “redshift parameter”.
Photometric redshifts
In this case, we only have the
average brightness over wide
range of wavelengths, called
filters or bands (1000s of
Angstroms).

Much more challenging/less

accurate, but a lot cheaper to
obtain!

Photo-z can be derived for billions

of galaxies.

Spectroscopic redshifts (derived

from line identification) can be
used as a learning set for
photometric redshifts.
Learning task
Input data:
Collection of photometric
intensity in 6 bands (i.e. 6
numbers per galaxy)

Target data:
True redshift of the galaxy
obtained from more expensive
spectroscopy. 1 number called z.
Colab notebook
The rest of this lecture will be on Colab. We will use notebooks from the book Viviana Acquaviva
“Machine learning for Physics and Astronomy”. The notebooks can be downloaded on the course
website, and on the book website https://press.princeton.edu/books/paperback/9780691206417/
machine-learning-for-physics-and-astronomy.
Course logistics

• Reading for this lecture:

• This lecture was based mostly on Viviana Acquaviva “Machine
learning for Physics and Astronomy” chapter 6.

Xgboost Presentation
100% (3)
Xgboost Presentation
54 pages
A Machine Learning Approach Based On Contract Parameters For Cost Forecasting in Construction
100% (1)
A Machine Learning Approach Based On Contract Parameters For Cost Forecasting in Construction
13 pages
Music Recommendation Using Emotion Detection
100% (1)
Music Recommendation Using Emotion Detection
22 pages
A Seminar Report On Machine Learing
35% (23)
A Seminar Report On Machine Learing
30 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
32 pages
Machine Learning Lecture 2,3,4
No ratings yet
Machine Learning Lecture 2,3,4
26 pages
05 - Ensemble Learning
No ratings yet
05 - Ensemble Learning
39 pages
Data Science - Decision Tree - Random Forest
No ratings yet
Data Science - Decision Tree - Random Forest
15 pages
Classification Algorithms
No ratings yet
Classification Algorithms
68 pages
Decision Trees
No ratings yet
Decision Trees
8 pages
Trees and Random Forest
No ratings yet
Trees and Random Forest
34 pages
ML Lecture 15 Ensemble
No ratings yet
ML Lecture 15 Ensemble
27 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
39 pages
Week 7 - Tree-Based Model
100% (1)
Week 7 - Tree-Based Model
8 pages
Random Forest
No ratings yet
Random Forest
29 pages
D3 IT Random Forest Apr 2023
No ratings yet
D3 IT Random Forest Apr 2023
32 pages
Lecture 6
No ratings yet
Lecture 6
24 pages
UNIT-V (Bagging, Boosting, Random Forest) : by Dr. K. Aditya Shastry Associate Professor Dept. of ISE NMIT, Bengaluru
No ratings yet
UNIT-V (Bagging, Boosting, Random Forest) : by Dr. K. Aditya Shastry Associate Professor Dept. of ISE NMIT, Bengaluru
27 pages
Random Forest
No ratings yet
Random Forest
25 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
Module 2
No ratings yet
Module 2
34 pages
Mid2 Answers
No ratings yet
Mid2 Answers
42 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
Lecture+Notes+-+Random Forests
No ratings yet
Lecture+Notes+-+Random Forests
10 pages
Lecture 5
No ratings yet
Lecture 5
53 pages
Lecture 05 Random Forest 07112022 124639pm
No ratings yet
Lecture 05 Random Forest 07112022 124639pm
25 pages
Ml-Unit Iii-1
No ratings yet
Ml-Unit Iii-1
46 pages
Week 12
No ratings yet
Week 12
34 pages
EST Cheatsheet
No ratings yet
EST Cheatsheet
5 pages
Trees, Boosting, and Random Forest
No ratings yet
Trees, Boosting, and Random Forest
14 pages
ML Unit 4
No ratings yet
ML Unit 4
47 pages
14 Model Ensembles
No ratings yet
14 Model Ensembles
63 pages
5 - EnsembleModeling
No ratings yet
5 - EnsembleModeling
80 pages
Unit 3
No ratings yet
Unit 3
63 pages
Decision Tree & Regression
No ratings yet
Decision Tree & Regression
33 pages
Decision Trees and Random Forests
No ratings yet
Decision Trees and Random Forests
25 pages
13 PracticalMachineLearning
100% (1)
13 PracticalMachineLearning
84 pages
Lecture 11 Slides - After
No ratings yet
Lecture 11 Slides - After
55 pages
Deep Learning and Neural Networks
No ratings yet
Deep Learning and Neural Networks
21 pages
Data Mining Notes
No ratings yet
Data Mining Notes
5 pages
DSA5102 Lecture3
No ratings yet
DSA5102 Lecture3
34 pages
ML Unit 3
No ratings yet
ML Unit 3
22 pages
Lecture 9
No ratings yet
Lecture 9
12 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
Predict 422 - Module 8
100% (1)
Predict 422 - Module 8
138 pages
Decision Trees
67% (3)
Decision Trees
14 pages
2025 Ensemble Learning
No ratings yet
2025 Ensemble Learning
25 pages
Random Forest Summary
No ratings yet
Random Forest Summary
6 pages
Jntuk Machine Learning 3-2 Unit-3
No ratings yet
Jntuk Machine Learning 3-2 Unit-3
33 pages
Lecture #15: Regression Trees & Random Forests
No ratings yet
Lecture #15: Regression Trees & Random Forests
34 pages
ML - 5
No ratings yet
ML - 5
53 pages
Decision Tree Algorithm
No ratings yet
Decision Tree Algorithm
14 pages
Session 10 - Ensemble Methods (XGBoost)
No ratings yet
Session 10 - Ensemble Methods (XGBoost)
37 pages
Random Forest Summary
No ratings yet
Random Forest Summary
6 pages
Bagging Vs Boosting - Javatpoint
No ratings yet
Bagging Vs Boosting - Javatpoint
8 pages
Present
No ratings yet
Present
20 pages
CE880 Lecture7 Slides
No ratings yet
CE880 Lecture7 Slides
78 pages
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Ensemble Methods
No ratings yet
Ensemble Methods
32 pages
CS109a Lecture16 Bagging RF Boosting
No ratings yet
CS109a Lecture16 Bagging RF Boosting
48 pages
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
No ratings yet
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
11 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Search Algorithm: Fundamentals and Applications
From Everand
Search Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
AgroAdvisor Crop Yield Prediction Crop and Fertili
No ratings yet
AgroAdvisor Crop Yield Prediction Crop and Fertili
27 pages
Trustworthy Machine Learning-Enhanced 3D Concrete Printing - Predicting Bond Strength and Designing Reinforcement Embedment Length
No ratings yet
Trustworthy Machine Learning-Enhanced 3D Concrete Printing - Predicting Bond Strength and Designing Reinforcement Embedment Length
19 pages
Ensemble Learning Methods
100% (1)
Ensemble Learning Methods
24 pages
Lesson Plan - ML24ECSC306
No ratings yet
Lesson Plan - ML24ECSC306
22 pages
Student Profile Modeling Using Boosting Algorithms
No ratings yet
Student Profile Modeling Using Boosting Algorithms
13 pages
A Tour of Machine Learning Algorithms
No ratings yet
A Tour of Machine Learning Algorithms
9 pages
European Journal of Operational Research: Stefan Lessmann, Bart Baesens, Hsin-Vonn Seow, Lyn C. Thomas
No ratings yet
European Journal of Operational Research: Stefan Lessmann, Bart Baesens, Hsin-Vonn Seow, Lyn C. Thomas
13 pages
Detecting Phishing Websites Using Machine Learning
No ratings yet
Detecting Phishing Websites Using Machine Learning
7 pages
Datamites Certified Data Scientist Syllabus PDF
50% (2)
Datamites Certified Data Scientist Syllabus PDF
12 pages
ML Unit-3
No ratings yet
ML Unit-3
15 pages
s8 - Detection of Malicious Social Bots - Project Report
No ratings yet
s8 - Detection of Malicious Social Bots - Project Report
58 pages
Pump It Up: Data Mining The Water Table
No ratings yet
Pump It Up: Data Mining The Water Table
5 pages
Using The Confusion Matrix For Improving Ensemble Classifiers
No ratings yet
Using The Confusion Matrix For Improving Ensemble Classifiers
5 pages
cz4041 9 Ensemble
No ratings yet
cz4041 9 Ensemble
54 pages
QCM
No ratings yet
QCM
24 pages
Ensemble Methods (Final)
No ratings yet
Ensemble Methods (Final)
16 pages
Murat Durmus - A Primer To The 42 Most Commonly Used Machine Learning Algorithms (With Code Samples) - Leanpub (2023)
No ratings yet
Murat Durmus - A Primer To The 42 Most Commonly Used Machine Learning Algorithms (With Code Samples) - Leanpub (2023)
192 pages
Final BE Project Report
No ratings yet
Final BE Project Report
74 pages
Diabetes Prediction System
No ratings yet
Diabetes Prediction System
25 pages
AI-based Detection of Stress Using Heart Rate Data Obtained From Wearable Devices
No ratings yet
AI-based Detection of Stress Using Heart Rate Data Obtained From Wearable Devices
11 pages
Midterm 2008s Solution
No ratings yet
Midterm 2008s Solution
12 pages
Intrusion Detection System Using Machine Learning Techniques A Review
No ratings yet
Intrusion Detection System Using Machine Learning Techniques A Review
8 pages
Fraud Detection Paper English
No ratings yet
Fraud Detection Paper English
19 pages
Viva EDA
No ratings yet
Viva EDA
8 pages
Inceptez Fullstack Datascience, Bigdata and Cloud 2021
No ratings yet
Inceptez Fullstack Datascience, Bigdata and Cloud 2021
36 pages
Machine Learning For Tabular Data XGBoost, Deep Learning, and AI (Mark Ryan, Luca Massaron) (Z-Library)
100% (1)
Machine Learning For Tabular Data XGBoost, Deep Learning, and AI (Mark Ryan, Luca Massaron) (Z-Library)
504 pages