Phys361 S24 Lecture 17 Random Forests
Phys361 S24 Lecture 17 Random Forests
Physics
Lecture 17 – Random Forrests and
Gradient Boosted Trees
Mar. 19th 2024
Moritz Münchmeyer
Decision tree ensembles
1. Motivation
Recall decision trees
• At each step we consider all possible splits
on all possible features and chose the one
Recall example from Gary’s lecture:
that leads to the highest reduction in impurity
of there resulting branches.
• Typical Hyperparameters:
• Minimum reduction of impurity to do a split
• Minimum number of samples required in
leaf node
• Maximum depth of the tree
Power of decision tree methods
• Decision trees methods are still the state-of-the-art for many tabular data
applications.
• Gradient boosted decision trees (GBDTs) are the current state of the art on
tabular data.
• They are used in many Kaggle competitions and are the go-to model for many
data scientists, as they tend to get better performance than neural networks while
being easier and faster to train.
• Neural networks, on the other hand, are the state of the art in many other tasks,
such as image classification, natural language processing, and speech
recognition.
• Efficiency with Small to Medium-Sized Datasets: Deep learning models excel in domains
with abundant data (like images, text, and audio) where they can learn complex patterns
and representations. However, many tabular datasets are relatively small or medium-sized,
where deep learning models might overfit or may not have enough data to adequately learn.
• Other advantages:
• Interpretability
• Speed
• Feature Importance is easy to evaluate
• Robust to outliers and missing data
• Simplicity
Decision tree ensembles
2. Random Forrests
(“Bagging”)
Ensemble methods
• One way to boost the performance (both for classification and regression) is to
aggregate the response of several models.
• For example in classification we could take the majority vote, perhaps weighted by
some factor if different models have different precision.
• This combined model often gives better combined accuracy than the single
constituents.
• Reasons:
• Aggregating several base learners generally reduces the variance.
• Single models may get stuck in different local minima.
• The combined model has a higher capacity than the constituents and may fit the
data better.
• Single models may be biased in opposite directions so the biases may cancel out in
some situations.
• A popular set of models for ensemble training are decision trees. A set of decision trees
is called a forest :)
Trees for regression problems
“Decision trees” are also useful for regression problems. They are then often
called “Regression trees”.
https://scikit-learn.org/stable/auto_examples/
tree/plot_tree_regression.html
Random Forests
• Random Forests are a collection of randomized decision trees.
• Randomization (“Bootstrapping”) occurs in two ways:
• Take many different random subset of the training data (where elements can
repeat).
• Take random subsets of the features.
• Typical hyper parameters: number of trees and the number of features in the
bootstrap subset.
• Implementation: https://scikit-learn.org/stable/modules/generated/
sklearn.ensemble.RandomForestClassifier.html
How to combine trees
The result of this analysis is somewhat algorithm dependent but is still informative.
More important features appear earlier in the tree, since they lead to a large decrease in
impurity.
To quantify the importance we can sum up the impurity improvements of all the splits
associated with a given variable. One can then rank the features. Correlation can
complicate the interpretation.
• The regularization is given by the complexity of the tree. One way to measure this is
Where T is the number of leaves and w are the scores of the leaves.
Tree Boosting: Greedy algorithm
The iteratively added tree is found with a greedy algorithm:
Step 1: Initialization
• The algorithm starts with all training instances in the root node.
• At each node, it evaluates all possible splits across all features.
Step 2: Evaluating Splits
• For each feature, the potential splits are considered. The algorithm sorts the values of the feature and then
iteratively evaluates the possible split positions between these sorted values. The "gain" from making a split is
calculated based on how much it would reduce the loss function.
Step 4: Recursion
• This process is recursively applied to each resulting subset of the data (corresponding to each branch of the split)
until one of the stopping criteria is met (max depth of the tree OR does not improve the loss by a significant amount
OR having too few samples in a node)
XGBoost improves upon this basic greedy algorithm by introducing several optimizations.
Decision tree ensembles
3. Application: Red Shift
estimation
Slides and python notebook from: Viviana Acquaviva - Machine Learning for physics and
Astronomy chapter 6
Redshift of a galaxy
Astronomers measure the distance of a
galaxy using its redshift. Because the
universe is expanding, galaxies that are
farther away have a higher redshift.
Target data:
True redshift of the galaxy
obtained from more expensive
spectroscopy. 1 number called z.
Colab notebook
The rest of this lecture will be on Colab. We will use notebooks from the book Viviana Acquaviva
“Machine learning for Physics and Astronomy”. The notebooks can be downloaded on the course
website, and on the book website https://press.princeton.edu/books/paperback/9780691206417/
machine-learning-for-physics-and-astronomy.
Course logistics