Decision Trees
Decision Trees
https://chatgpt.com/c/6784137c-02b8-8006-9a51-c84f92d97b25
Inference of a decision tree model is computed by routing an example from the root (at the top) to
one of the leaf nodes (at the bottom) according to the conditions. The value of the reached leaf is the
decision tree's prediction. The set of visited nodes is called the inference path.
Mathematically speaking, Decision trees use hyperplanes which run parallel to any one of the feature
axis to divide the feature space into regions, effectively forming hyper-cuboids that separate data
points based on feature thresholds.to cut your coordinate system into hyper cuboids.
ASSUMPTIONS
1. **Feature Independence**: Decision trees assume that each feature is independently split and
contributes to the decision-making process.
2. **No Assumption on Data Distribution**: Unlike models like Naive Bayes, decision trees do not
assume any specific distribution for the features (e.g., Gaussian distribution).
3. **Non-linear Relationships**: Decision trees can handle non-linear relationships between features
and the target variable, as they perform piecewise constant approximations.
4. **Categorical or Continuous Features**: Decision trees can handle both categorical and continuous
variables.
5. **Monotonicity**: In some cases, decision trees assume that the relationship between features and
target may be monotonic, but this assumption isn't strictly enforced.
STRUCTURE
Root Node: The top node where the data is split first based on a feature that best separates
the data.
Internal Nodes: Represent decisions or tests on features that help in further splitting the
data. Each internal node has one or more branches.
Branches: The links between nodes, representing the decision paths based on feature values.
Leaf Nodes: Represent the final decision or prediction. No further splits occur at these nodes.
Leaf nodes are the final nodes in a Decision Tree where the model assigns a predicted value
or class label based on the features of the input data.
HOW WORKS?
A Decision Tree works by splitting the dataset at each node based on the feature that best separates
the data. This process continues recursively until a stopping criterion is met (e.g., a maximum depth
or minimum samples per leaf).
In a Decision Tree, each internal node represents a feature, and the branches coming out of that node
represent the possible values that the feature can take.
- **For categorical features**: The number of branches equals the number of unique categories for
that feature.
- **For numerical features**: There are typically two branches—one for values less than a certain
number and another for values greater than or equal to it.
The leaf nodes at the end of the tree indicate the final class label or prediction.
The feature used at each node is chosen based on its **Information Gain**, which measures how well
it separates the data. Features with the highest Information Gain are placed closer to the root, as
they are the most important for making decisions.
This ensures the tree is structured to split the data efficiently at each step.
PURPOSE OF SPLITTING
Splitting is done to partition the data into smaller, more homogeneous subsets, making the data
within each subset as similar as possible.
It uses criteria like Gini impurity or entropy to evaluate the splits. The goal is to minimize impurity
after a split, making the resulting child nodes as pure as possible.
Criteria: Gini Impurity, Entropy (Information Gain), and Mean Squared Error (for regression) are
common criteria used to evaluate the best split at each node.
GINI INPURITY
Gini impurity is a measure used in decision tree algorithms to quantify a dataset's impurity level or
disorder.
Gini = 1 – ( p1 +¿
For a binary classification, it is calculated as:
2 2
p2 )
are the proportions of each class in the node. A lower Gini value indicates a better split.
For example -
G1 = 1 – (4/25 + 9/25)
G2 = 1 – (1/25 + 16/25)
G2 is better than G1.
Gini graph is same as entropy, only difference is gini range from 0-0.5.
ENTROPY
Entropy measures the disorder or impurity in a dataset. It is used to calculate Information Gain when
splitting nodes. A lower entropy indicates a more homogeneous dataset (More knowledge less
entropy).
It is calculated as :
Entropy = - ∑ pi log 2 p i
where 𝑝𝑖 is the probability of class 𝑖 in a node. A lower entropy indicates a more pure node. Decision
Trees use entropy to guide the selection of the best split.
For example, if our data has only 2 class labels ‘Yes’ and ‘No’.
E(D) = -pyes log2 (pyes) - pno log2 (pno) , pyes=1/5, pno=4/5
E(D) = -1/5 log2 (1/5) – (4/5) log2 (4/5)
Observation:
More uncertainty more entropy
For a 2-class problem the min entropy is 0 and the max entropy is 1.
For more than 2 classes the min entropy is 0 but the max entropy can be greater than 1
Both log2 and loge can be used to calculate entropy
Entropy for continuous variables
Practical Considerations
Entropy and Feature Selection:
Features with higher entropy may carry more information, but this depends on the problem
and the target variable. In decision trees, for example, entropy is used to measure
information gain, helping to choose features that best reduce uncertainty about the target.
Normalized Entropy:
To compare entropy across variables, it's common to normalize it (e.g., divide by the
maximum possible entropy given the range).
INFORMATION GAIN
Information Gain measures the reduction in entropy from a feature split. It helps decide which
feature provides the most useful information when making a split.
Information Gain measures the reduction in entropy (or impurity) due to a split. It is calculated as:
¿⊂¿ x Entropy (¿)
Information Gain = Entropy (Parent) - ∑
|Parent|
where the subsets are the child nodes. The feature that provides the highest information gain is
chosen for the split.
For example –
E(S) = -2/5 log(2/5) – 3/5 log(3/5) = 0.97
E(O) = -5/5 log(5/5) – 0/5 log(0/5) = 0
E(R) = -3/5 log(3/5) – 2/5 log(2/5) = 0.97
TYPES OF CONDITIONS
1. AXIS-ALIGNED AND OBLIQUE CONDITIONS
An axis-aligned condition involves only a single feature. An oblique condition involves multiple
features.
For example,
the following is an axis-aligned condition: num_legs ≥ 2
While the following is an oblique condition: num_legs ≥ num_fingers
Often, decision trees are trained with only axis-aligned conditions. However, oblique splits are more
powerful because they can express more complex patterns. Oblique splits sometime produce better
results at the expense of higher training and inference costs.
YDF Code - In YDF, decision trees are trained with axis-aligned condition by default. You can enable
decision oblique trees with the split_axis="SPARSE_OBLIQUE" parameter.
Non-binary conditions have more than two possible outcomes. Therefore, non-binary conditions have
more discriminative power than binary conditions. Decisions containing one or more non-binary
conditions are called non-binary decision trees.
Conditions with too much power are also more likely to overfit. For this reason, decision forests
generally use binary decision trees, so this course will focus on them.
Note: A non-binary condition can be emulated with multiple binary conditions; therefore, binary trees
are not inherently less powerful than non-binary trees.
The most common type of condition is the threshold condition expressed as:
feature ≥ threshold
PRUNING
Pruning aims to simplify the decision tree by removing parts of it that do not provide significant
predictive power, thus improving its ability to generalize to new data.
Decision Tree Pruning removes unwanted nodes from the overfitted decision tree to make it smaller
in size which results in more fast, more accurate and more effective predictions.
By pruning early, we come to be with a simpler tree that is less likely to overfit the training facts.
Post-pruning simplifies the tree while preserving its Accuracy. Decision tree pruning helps to improve
the performance and interpretability of decision trees by reducing their complexity and avoiding
overfitting. Proper pruning can lead to simpler and more robust models that generalize better to
unseen data.
Set Constraints
Max Depth: Limit the depth of the tree to avoid too many splits. Deeper trees are more
likely to overfit, while shallow trees may underfit.
Min Samples Split: Specify the minimum number of samples needed to split a node.
Min Samples Leaf: Set the minimum number of samples required in a leaf node. The
minimum samples per leaf parameter ensures that a node will only be split if
it contains at least a certain number of samples, reducing the risk of
overfitting.
Max Features: Limit the number of features considered at each split.
Cross-Validation - Cross-validation ensures that the Decision Tree model generalizes well by
training and evaluating the model on different subsets of the data.
Gather more data
Most decision tree training algorithms follow a greedy approach, using a divide-and-conquer strategy.
Here’s how it works:
1. **Start with the root node:** The algorithm begins by creating a single starting point for the tree.
2. **Grow the tree step by step:** The tree expands by adding nodes one at a time, working
recursively from the root.
3. **Evaluate all possible conditions:** At each node, the algorithm looks at every possible decision it
could make (conditions) and assigns a score to each.
4. **Choose the best condition:** The condition with the highest score is selected to split the data at
that node.
The score is calculated using a metric that aligns with the task (such as classification or regression).
The goal is to choose conditions that maximize this score, leading to a more effective decision tree.
Let's go through the steps of training a particular decision tree in more detail.
Step 1: Create a root:
Step 2: Grow node #1. The condition "x1 ≥ 1" was found. Two child nodes are created:
Step 3: Grow node #2. No satisfying conditions were found. So, make the node into a leaf:
Step 4: Grow node #3. The condition "x2 ≥ 0.5" was found. Two child nodes are created.
Other methods exist to grow decision trees. A popular alternative is to optimize nodes globally
instead of using a divide and conquer strategy.
YDF Code - In YDF, decision trees are grown with the "greedy divide and conquer" algorithm
described above by default. Alternatively, you can enable global growth with
the growing_strategy="BEST_FIRST_GLOBAL" parameter.
Depending on the number and type of input features, the number of possible conditions for a given
node can be huge, generally infinite. For example, given a threshold condition featurei ≥ t, the
combination of all the possible threshold values for t ε R is infinite.
The routine responsible for finding the best condition is called the splitter. Because it needs to test a
lot of possible conditions, splitters are the bottleneck when training a decision tree.
The score maximized by the splitter depends on the task. For example:
Information gain and Gini (both covered later) are commonly used for classification.
Mean squared error is commonly used for regression.
There are many splitter algorithms, each with varying support for:
The type of features; for example, numerical, categorical, text
The task; for example, binary classification, multi-class classification, regression
The type of condition; for example, threshold condition, in-set condition, oblique condition
The regularization criteria; for example, exact or approximated splitters for threshold
conditions
In addition, there are equivalent splitter variants with different trade-offs regarding memory usage,
CPU usage, computation distribution, and so on.
YDF Code - In YDF, splitters are selected implicitly from the automatically detected (or manually
specified) type of the feature, the hyperparameters, and the estimated speed of each splitter (during
training, YDF runs a small model to predict the speed of each candidate splitter).
https://developers.google.com/machine-learning/decision-forests/binary-classification
MULTICOLLINEARITY
**Handling Multicollinearity in Decision Trees**
Decision trees inherently handle multicollinearity during their feature selection process, making them
less directly impacted by it compared to linear models. Here's how they manage it:
2. **Splitting Criteria:**
The splitting criteria ensure that the tree chooses the feature that optimally separates the data at
each node. If two features are correlated, they often have similar splitting scores. In such cases, the
tree may randomly pick one, effectively reducing the impact of multicollinearity.
This concise yet comprehensive response highlights your understanding of both the theoretical and
practical aspects of the topic, ensuring clarity and depth in your explanation.
DETECTION:
Detecting multicollinearity is an important step in ensuring the reliability of your regression model.
Here are two common methods for detecting multicollinearity:
1. Correlation Matrix:
Calculate the correlation coefficient between each pair of predictor variables.
Values close to 1 or -1 indicate a high degree of correlation.
Identify pairs of variables with high correlation coefficients (e.g., greater than 0.7 or less
than -0.7).
Hyperparameters
These are parameters that you set manually before training begins. They influence how the learning
process takes place but are not updated by the model itself. Examples include:
- Learning Rate: Controls how quickly the model updates learnable parameters.
- Regularization Strength: Helps prevent overfitting by penalizing complex models.
- Optimization Algorithm: Determines how the model updates the learnable parameters (e.g., Adam,
SGD).
Hyperparameters play a key role in shaping the model's performance and behavior. Choosing the
right hyperparameters can significantly impact the model's ability to learn effectively.
WHY: Tuning hyperparameters is essential for decision trees for the following reasons:
**Reduced Overfitting:** Decision trees are prone to overfitting, where they memorize the noise in
the training data instead of learning generalizable patterns. Hyperparameter tuning helps mitigate
overfitting by controlling the tree’s complexity (e.g., using `max_depth`) and preventing excessive
detail (e.g., through `min_samples_split`).
**Enhanced Generalization:** The goal is for the decision tree to perform well on unseen data. Tuning
hyperparameters helps balance model complexity and flexibility, enabling the tree to capture key
trends without overfitting to the training data. This leads to improved performance on new, unseen
data.
**Addressing Class Imbalance:** In cases of class imbalance, where one class has significantly fewer
samples than the others, tuning hyperparameters like `min_weight_fraction_leaf` allows the model to
adjust for sample weights. This helps prevent the model from being biased towards the majority
class, improving predictions for the minority class.
**Tailoring the Model to Specific Tasks:** Different tasks may require specific decision tree
behaviors. Hyperparameter tuning allows you to adjust the tree’s structure and learning process to
suit the needs of your task. For example, you can prioritize capturing complex relationships by
adjusting `max_depth` for a more complex classification problem.
Hyperparameter tuning helps find the best values for parameters like maximum depth, minimum
samples per leaf, and split criterion, improving model performance and preventing overfitting.
max_depth: Maximum depth of the tree. The input option for max_depth can be a positive
integer or ‘None’ that indicates no maximum depth limit.
min_samples_split: The minimum number of samples required to split an internal node.
min_samples_leaf: The minimum number of samples required to be at a leaf node.
max_features: The number of features to consider when looking for the best split.
criterion: The function to measure the quality of a split (Gini impurity or entropy).
splitter: Strategy used to split at each node (best or random).
MAX_DEPTH
The **`max_depth`** hyperparameter in a decision tree specifies the maximum depth (number of
levels) that the tree can grow. It directly influences the complexity, interpretability, and performance
of the tree. Here's how it affects the decision tree:
- **High `max_depth`:** A higher value allows the tree to grow deeper, meaning more splits can be
made and more complex patterns can be captured. While this increases the tree’s ability to fit the
data (reducing bias), it also increases the risk of **overfitting**, where the tree learns the noise and
specificities of the training data, which doesn't generalize well to new data.
- **Large `max_depth`:** A deeper tree allows more detailed splits, which can lead to overfitting if
the tree becomes too sensitive to the training data. The model may perform very well on training data
but poorly on unseen data, as it has memorized the specifics of the training set.
- **Large `max_depth`:** A deeper tree reduces bias by allowing the model to learn more complex
patterns but increases variance, as it becomes more sensitive to small changes in the training data.
This can lead to overfitting, where the model is too tailored to the training data and performs poorly
on new data.
- **Large `max_depth`:** A deeper tree is harder to interpret due to its complexity. While it may
provide more accurate predictions, it can become very difficult to understand, especially when trying
to visualize or explain the decision-making process.
- **Large `max_depth`:** A deeper tree requires more resources and takes more time to train and
make predictions, especially if the data has many features and complex relationships.
### Summary
- **Low `max_depth`:** The tree is simpler, faster, and less likely to overfit, but may suffer from
underfitting.
- **High `max_depth`:** The tree can model more complex relationships, but is more prone to
overfitting and may be computationally expensive and harder to interpret.
By carefully tuning the `max_depth` parameter, you can achieve a good balance between underfitting
and overfitting, leading to a more effective and efficient decision tree model.
MIN_SAMPLES_SPLIT
The **`min_samples_split`** hyperparameter in a decision tree specifies the minimum number of
samples required to split an internal node. This hyperparameter affects the tree's structure,
complexity, and generalization. Here's how it influences the decision tree:
### Summary
- **Low `min_samples_split`:** More splits, higher complexity, risk of overfitting.
- **High `min_samples_split`:** Fewer splits, simpler tree, improved generalization, but potentially
underfitting.
MIN_SAMPLES_LEAF
The **`min_samples_leaf`** hyperparameter in a decision tree controls the minimum number of
samples required to be at a leaf node (the terminal node of the tree). This parameter plays a key role
in determining the structure and complexity of the tree, and it affects the model's ability to
generalize. Here's how it impacts the decision tree:
Increasing `min_samples_leaf` forces the tree to have more samples at each leaf, leading to a simpler
tree with fewer splits, helping prevent overfitting by smoothing the model’s predictions.
### Summary
- **Low `min_samples_leaf`:** More splits, higher complexity, risk of overfitting.
- **High `min_samples_leaf`:** Fewer splits, simpler tree, improved generalization, but potentially
underfitting.
Choosing the right value for `min_samples_leaf` is crucial for finding the right balance between
underfitting and overfitting, leading to better generalization and model performance.
MAX_FEATURES
The **`max_features`** hyperparameter in a decision tree specifies the maximum number of features
to consider when looking for the best split at each node. This parameter directly influences the tree’s
structure, performance, and generalization. Here's how it affects a decision tree:
- **High `max_features`:** If `max_features` is set to a high number (such as the total number of
features or close to it), the decision tree will have access to many features when splitting at each
node. This allows the tree to make splits based on more information, which can improve the model's
performance but also increases the risk of **overfitting**. The tree might learn noise or very specific
patterns from the data that don't generalize well to new, unseen data.
- **High `max_features`:** The tree will have lower **bias** because it can use a wider range of
features for splitting. However, it will have higher **variance**, meaning the model could be overly
sensitive to training data and may overfit, resulting in poor generalization.
- **High `max_features`:** When more features are considered, the model will take more time to
evaluate splits and train, especially as the number of features grows. This increases the
computational cost.
### 6. **Interpretability**
- **Low `max_features`:** Trees built with a smaller set of features per split can be easier to interpret
because they rely on fewer features, which might make the tree simpler to analyze and understand.
- **High `max_features`:** Larger numbers of features considered at each split can make the tree
more complex and harder to interpret, especially when the tree uses many features to make
decisions.
### Summary
- **Low `max_features`:** The tree is more likely to generalize better and prevent overfitting, but it
might not capture all the important patterns, leading to underfitting.
- **High `max_features`:** The tree can model more complex patterns, but it risks overfitting, and the
model may become computationally expensive and harder to interpret.
Tuning the `max_features` parameter helps to strike the right balance between model complexity,
generalization, and computational efficiency.
CRITERION
The criterion hyperparameter in a decision tree determines the function used to measure the quality
of a split at each node. It plays a crucial role in how the tree makes decisions about which features to
use and what thresholds to set when splitting the data. The choice of criterion can affect the
accuracy, complexity, and interpretability of the tree.
The two most common values for the criterion parameter are:
1. gini (Gini impurity)
2. entropy (Information gain)
4. Impact on Overfitting
Both Gini impurity and entropy aim to improve the purity of the splits, but the tree's depth (and thus
the risk of overfitting) will depend on other factors like max_depth, min_samples_split, and
min_samples_leaf. However:
Gini tends to produce slightly less complex trees because it is quicker to compute and less
sensitive to small variations in the data.
Entropy might result in deeper trees, as it focuses more on minimizing uncertainty and might
keep splitting the data to achieve maximum homogeneity in each node.
5. Interpretability
Gini impurity: Trees built with the Gini criterion tend to be slightly easier to interpret
because they often produce more balanced and simpler trees with fewer levels.
Entropy: Trees built with the entropy criterion may be more complex, potentially making
them harder to interpret due to deeper branches and more splits.
6. Performance Considerations
Gini impurity is generally faster to compute than entropy, which can be beneficial for large
datasets or when computational efficiency is crucial.
Entropy may take longer to compute, but the trees it generates can sometimes achieve
slightly better predictive accuracy, particularly in complex datasets.
Summary of Effects:
gini (Gini Impurity):
o Faster to compute
o Tends to result in simpler and more balanced trees
o Often preferred for practical use in decision trees
entropy (Information Gain):
o Slower to compute due to the logarithmic calculations
o Tends to create deeper trees with potentially better accuracy
o Focuses more on maximizing information gain at each split
Conclusion
The criterion parameter affects how the decision tree evaluates and splits the data. Both Gini
impurity and entropy are designed to optimize the purity of the splits, but they differ in computational
efficiency and how they balance tree complexity. Gini is generally faster and simpler, while entropy
may result in better accuracy but is computationally more intensive. The choice depends on the
trade-off between model performance, speed, and interpretability.
SPLITTER
The **`splitter`** hyperparameter in a decision tree controls the strategy used to split the nodes at
each level of the tree. It determines how the best feature to split on is selected during the tree-
building process. The two common options for the `splitter` parameter are:
1. **`best`**
2. **`random`**
### Conclusion
The choice of **`splitter`** depends on the trade-off between accuracy, speed, and generalization:
- Use **`best`** if accuracy is your primary goal and you want to ensure the tree captures as much
useful information as possible.
- Use **`random`** if you need a faster model or want to ensure the tree doesn't overfit, especially
when working with large datasets or when computational resources are limited.
Randomized Search
In contrast, randomized search performs a more flexible search over hyperparameters by sampling
from predefined distributions. This approach continues until a predefined limit or the desired
performance is reached. Randomized search is generally more efficient than grid search, especially
when hyperparameters are not uniformly distributed, as it allows independent allocation of the
search limit for each parameter.
Additionally, the nature of randomized search makes it easy to parallelize, which helps in saving time
and resources. While grid search might explore all possibilities exhaustively, randomized search often
achieves better results in less time, particularly when the search space is large.
Bayesian Optimization
Bayesian optimization takes a more sophisticated approach by using probabilistic models to identify
the best set of hyperparameters in a more computationally efficient way. Unlike grid and random
search, it’s a sequential process designed to find the global optimum with fewer trials, thus reducing
the number of computations needed.
While this method can be highly efficient in terms of computational cost, it’s also more complex to
implement. It's particularly useful in scenarios where you have a large search space and need to
balance performance and computational efficiency.
Conclusion
For most cases, Grid Search and Randomized Search are excellent starting points due to their
simplicity and ease of implementation. Grid Search is ideal when you have a small to moderate
search space and can afford the computational cost, while Randomized Search is better when
dealing with a larger search space and when efficiency is a concern. If computational cost becomes a
bottleneck and you need fewer trials for optimization, Bayesian Optimization is a strong choice, but
it requires more expertise to implement effectively.
EVALUATION
Precision, Recall, F1-Score, Accuracy, AUC-ROC, MSE
Bias refers to the error introduced by simplifying assumptions (e.g., underfitting), while variance
refers to the error introduced by the model’s sensitivity to fluctuations in the training data (e.g.,
overfitting). Decision Trees typically have high variance but low bias.
BAGGING
Bagging (Bootstrap Aggregating) involves training multiple Decision Trees on different random
subsets of the data and averaging their predictions to reduce variance.
BOOSTING
A sequential ensemble technique (e.g., Gradient Boosting) where each tree corrects the errors of the
previous ones, focusing more on misclassified examples. It improves accuracy but may lead to
overfitting if not controlled.
Boosting is an ensemble technique where multiple weak learners (usually Decision Trees) are trained
sequentially, with each subsequent model focusing on correcting the errors of the previous model.
Boosting involves sequentially training Decision Trees, each focusing on correcting the errors made
by the previous tree, leading to a strong model by combining weak learners.
FEATURE IMPORTANCE
The importance of a feature is accumulated by summing up its contribution across all splits in the
tree.
1. Split Impurity Reduction
Each split in a Decision Tree is associated with a measure of impurity, such as:
Gini Impurity
Entropy (Information Gain)
Mean Squared Error (for regression trees)
When a feature is used to split a node, it reduces the impurity of the data at that node. The
reduction in impurity for that split is calculated as:
Impurity Reduction=Impurity of Parent Node−(Weighted Impurity of Child Nodes)
2. Aggregate Across the Tree
For each feature:
Calculate the total impurity reduction contributed by that feature across all the splits
where it is used.
The importance of a feature is proportional to this total reduction.
3. Normalize the Importance scores
Once the total impurity reductions are computed for all features:
Normalize them so that they sum to 1. This allows easy comparison of feature
importances.
Total Impurity Reduction for a Feature
Normalized Importance =
∑ of Impurity Reductions for All Features
Key Takeaway
Features that result in larger reductions in impurity are deemed more important because they
contribute more to making accurate splits in the tree. This calculation is an inherent part of building
the Decision Tree, and libraries like Scikit-learn automatically compute these important scores.