0% found this document useful (0 votes)
6 views18 pages

Decision Trees

A Decision Tree is a supervised learning algorithm used for classification and regression, which recursively splits data based on significant features using criteria like Gini impurity or entropy. It consists of a root node, internal nodes, branches, and leaf nodes, with the splitting process aimed at creating homogeneous subsets to improve prediction accuracy. While Decision Trees are easy to interpret and handle various data types, they are prone to overfitting and can be unstable with small data variations.

Uploaded by

Prerna Bhandari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views18 pages

Decision Trees

A Decision Tree is a supervised learning algorithm used for classification and regression, which recursively splits data based on significant features using criteria like Gini impurity or entropy. It consists of a root node, internal nodes, branches, and leaf nodes, with the splitting process aimed at creating homogeneous subsets to improve prediction accuracy. While Decision Trees are easy to interpret and handle various data types, they are prone to overfitting and can be unstable with small data variations.

Uploaded by

Prerna Bhandari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

DECISION TREES

https://chatgpt.com/c/6784137c-02b8-8006-9a51-c84f92d97b25

WHAT IS A DECISION TREES?


A Decision Tree is a supervised learning algorithm used for classification and regression tasks. It
recursively splits the data into subsets based on the most significant feature, using criteria like Gini
impurity or entropy.

Inference of a decision tree model is computed by routing an example from the root (at the top) to
one of the leaf nodes (at the bottom) according to the conditions. The value of the reached leaf is the
decision tree's prediction. The set of visited nodes is called the inference path.

Mathematically speaking, Decision trees use hyperplanes which run parallel to any one of the feature
axis to divide the feature space into regions, effectively forming hyper-cuboids that separate data
points based on feature thresholds.to cut your coordinate system into hyper cuboids.

ASSUMPTIONS
1. **Feature Independence**: Decision trees assume that each feature is independently split and
contributes to the decision-making process.

2. **No Assumption on Data Distribution**: Unlike models like Naive Bayes, decision trees do not
assume any specific distribution for the features (e.g., Gaussian distribution).

3. **Non-linear Relationships**: Decision trees can handle non-linear relationships between features
and the target variable, as they perform piecewise constant approximations.

4. **Categorical or Continuous Features**: Decision trees can handle both categorical and continuous
variables.

5. **Monotonicity**: In some cases, decision trees assume that the relationship between features and
target may be monotonic, but this assumption isn't strictly enforced.

STRUCTURE
Root Node: The top node where the data is split first based on a feature that best separates
the data.
Internal Nodes: Represent decisions or tests on features that help in further splitting the
data. Each internal node has one or more branches.
Branches: The links between nodes, representing the decision paths based on feature values.
Leaf Nodes: Represent the final decision or prediction. No further splits occur at these nodes.
Leaf nodes are the final nodes in a Decision Tree where the model assigns a predicted value
or class label based on the features of the input data.

HOW WORKS?
A Decision Tree works by splitting the dataset at each node based on the feature that best separates
the data. This process continues recursively until a stopping criterion is met (e.g., a maximum depth
or minimum samples per leaf).

In a Decision Tree, each internal node represents a feature, and the branches coming out of that node
represent the possible values that the feature can take.

- **For categorical features**: The number of branches equals the number of unique categories for
that feature.
- **For numerical features**: There are typically two branches—one for values less than a certain
number and another for values greater than or equal to it.

The leaf nodes at the end of the tree indicate the final class label or prediction.
The feature used at each node is chosen based on its **Information Gain**, which measures how well
it separates the data. Features with the highest Information Gain are placed closer to the root, as
they are the most important for making decisions.
This ensures the tree is structured to split the data efficiently at each step.

ADVANTAGES OF DECISION TREES


Easy to understand and interpret, can handle both numerical and categorical data, minimal data
preparation is required , the cost of using the tree for inference is logarithmic in the number of data
points used to train the tree and are non-parametric (do not require prior assumptions about the
data).

DISADVANTAGES OF DECISION TREES


They are prone to overfitting, especially when the tree is deep. They can also be unstable, as small
variations in the data can result in different splits.

PURPOSE OF SPLITTING
Splitting is done to partition the data into smaller, more homogeneous subsets, making the data
within each subset as similar as possible.
It uses criteria like Gini impurity or entropy to evaluate the splits. The goal is to minimize impurity
after a split, making the resulting child nodes as pure as possible.
Criteria: Gini Impurity, Entropy (Information Gain), and Mean Squared Error (for regression) are
common criteria used to evaluate the best split at each node.

GINI INPURITY
Gini impurity is a measure used in decision tree algorithms to quantify a dataset's impurity level or
disorder.

Gini = 1 – ( p1 +¿
For a binary classification, it is calculated as:
2 2
p2 )
are the proportions of each class in the node. A lower Gini value indicates a better split.

For example -
G1 = 1 – (4/25 + 9/25)
G2 = 1 – (1/25 + 16/25)
G2 is better than G1.

Gini graph is same as entropy, only difference is gini range from 0-0.5.

ENTROPY
Entropy measures the disorder or impurity in a dataset. It is used to calculate Information Gain when
splitting nodes. A lower entropy indicates a more homogeneous dataset (More knowledge less
entropy).
It is calculated as :
Entropy = - ∑ pi log 2 p i
where 𝑝𝑖 is the probability of class 𝑖 in a node. A lower entropy indicates a more pure node. Decision
Trees use entropy to guide the selection of the best split.

For example, if our data has only 2 class labels ‘Yes’ and ‘No’.
E(D) = -pyes log2 (pyes) - pno log2 (pno) , pyes=1/5, pno=4/5
E(D) = -1/5 log2 (1/5) – (4/5) log2 (4/5)

Observation:
 More uncertainty more entropy
 For a 2-class problem the min entropy is 0 and the max entropy is 1.
 For more than 2 classes the min entropy is 0 but the max entropy can be greater than 1
 Both log2 and loge can be used to calculate entropy
Entropy for continuous variables

Relationship Between Entropy and Feature Distribution


The entropy of a continuous variable is directly related to the spread or variability of its distribution:
1. Higher Variability → Higher Entropy:
o Distributions with a larger spread (e.g., a uniform distribution) tend to have higher
entropy because the uncertainty is greater.
o Example: A variable uniformly distributed between [0, 10] has higher entropy than
one concentrated around a mean (e.g., Gaussian with a small variance).
2. Lower Variability → Lower Entropy:
o Narrow or concentrated distributions have lower entropy because they exhibit less
uncertainty.
o Example: A delta function, where all values are concentrated at a single point, has
zero entropy.
3. Shape of the Distribution:
o Symmetric distributions like the normal distribution typically have higher entropy
compared to skewed distributions with the same range.
o Multimodal distributions can also affect entropy since they spread uncertainty across
multiple peaks.

Practical Considerations
 Entropy and Feature Selection:
Features with higher entropy may carry more information, but this depends on the problem
and the target variable. In decision trees, for example, entropy is used to measure
information gain, helping to choose features that best reduce uncertainty about the target.
 Normalized Entropy:
To compare entropy across variables, it's common to normalize it (e.g., divide by the
maximum possible entropy given the range).

INFORMATION GAIN
Information Gain measures the reduction in entropy from a feature split. It helps decide which
feature provides the most useful information when making a split.
Information Gain measures the reduction in entropy (or impurity) due to a split. It is calculated as:
¿⊂¿ x Entropy (¿)
Information Gain = Entropy (Parent) - ∑
|Parent|
where the subsets are the child nodes. The feature that provides the highest information gain is
chosen for the split.

For example –
E(S) = -2/5 log(2/5) – 3/5 log(3/5) = 0.97
E(O) = -5/5 log(5/5) – 0/5 log(0/5) = 0
E(R) = -3/5 log(3/5) – 2/5 log(2/5) = 0.97

Weighted entropy of children


Weighted entropy = 5/14 * 0.97 + 4/14 * 0 + 5/14 * 0.97 (5/14, 4/14, 5/14 – parent dataset portion)
W.E. (Children) = 0.69
P(O) is a leaf node as its entropy is 0.

Information gain = E(Parent) – {Weighted average} * E(Children) = 0.97 – 0.69 = 0.28


So the information gain (or decrease in entropy/impurity) when you split this data on the basis of
Outlook condition/column is 0.28.

What is the difference between Gini Impurity and Entropy?


Gini Impurity is faster to compute and typically results in more balanced splits, while
Entropy tends to prioritize the most informative features. Both are used to evaluate splits,
but Gini Impurity is more commonly used in practice.
Both Gini and Entropy aim to reduce impurity during splits, but Gini is computationally
simpler as it does not involve logarithmic calculations. Entropy, on the other hand, provides a
finer distinction when splits occur in the dataset. Both lead to similar results but may behave
differently in edge cases.
Entropy is slow compared to gini. Entropy gives balance trees whereas gini overfits.

How do you calculate the Gini index for a split?


The Gini index for a split is the weighted sum of the Gini indices of the child nodes. For each
child node, the Gini index is calculated based on the proportions of classes in that node, and
the final Gini index is the sum of these weighted values.

TYPES OF CONDITIONS
1. AXIS-ALIGNED AND OBLIQUE CONDITIONS
An axis-aligned condition involves only a single feature. An oblique condition involves multiple
features.
For example,
the following is an axis-aligned condition: num_legs ≥ 2
While the following is an oblique condition: num_legs ≥ num_fingers
Often, decision trees are trained with only axis-aligned conditions. However, oblique splits are more
powerful because they can express more complex patterns. Oblique splits sometime produce better
results at the expense of higher training and inference costs.

YDF Code - In YDF, decision trees are trained with axis-aligned condition by default. You can enable
decision oblique trees with the split_axis="SPARSE_OBLIQUE" parameter.

2. BINARY AND NON-BINARY CONDITIONS


Conditions with two possible outcomes (for example, true or false) are called binary conditions.
Decision trees containing only binary conditions are called binary decision trees.

Non-binary conditions have more than two possible outcomes. Therefore, non-binary conditions have
more discriminative power than binary conditions. Decisions containing one or more non-binary
conditions are called non-binary decision trees.
Conditions with too much power are also more likely to overfit. For this reason, decision forests
generally use binary decision trees, so this course will focus on them.

Note: A non-binary condition can be emulated with multiple binary conditions; therefore, binary trees
are not inherently less powerful than non-binary trees.

The most common type of condition is the threshold condition expressed as:
feature ≥ threshold

PRUNING
Pruning aims to simplify the decision tree by removing parts of it that do not provide significant
predictive power, thus improving its ability to generalize to new data.

Decision Tree Pruning removes unwanted nodes from the overfitted decision tree to make it smaller
in size which results in more fast, more accurate and more effective predictions.

Types Of Decision Tree Pruning - Pre-Pruning and Post-Pruning.

Pre-Pruning (Early Stopping)


Sometimes, the growth of the decision tree can be stopped before it gets too complex, this is called
pre-pruning. It is important to prevent the overfitting of the training data, which results in a poor
performance when exposed to new data.

Some common pre-pruning techniques include:


 Maximum Depth: It limits the maximum level of depth in a decision tree.
 Minimum Samples per Leaf: Set a minimum threshold for the number of samples in each leaf
node.
 Minimum Samples per Split: Specify the minimal number of samples needed to break up a node.
 Maximum Features: Restrict the quantity of features considered for splitting.

By pruning early, we come to be with a simpler tree that is less likely to overfit the training facts.

Post-Pruning (Reducing Nodes)


After the tree is fully grown, post-pruning involves removing branches or nodes to improve the
model's ability to generalize. Some common post-pruning techniques include:
 Cost-Complexity Pruning (CCP): This method assigns a price to each subtree primarily based on
its accuracy and complexity, then selects the subtree with the lowest fee.
 Reduced Error Pruning: Removes branches that do not significantly affect the overall accuracy.
 Minimum Impurity Decrease: Prunes nodes if the decrease in impurity (Gini impurity or
entropy) is beneath a certain threshold.
 Minimum Leaf Size: Removes leaf nodes with fewer samples than a specified threshold.

Post-pruning simplifies the tree while preserving its Accuracy. Decision tree pruning helps to improve
the performance and interpretability of decision trees by reducing their complexity and avoiding
overfitting. Proper pruning can lead to simpler and more robust models that generalize better to
unseen data.

PREVENT UNDER OR OVER-FITTING


Prune the tree – Pruning is the process of removing branches from a fully grown tree to
improve generalization and reduce overfitting.
Pre-pruning (Early stopping) - Stop the tree from growing when certain conditions are met,
such as:
 Maximum depth of the tree.
 Minimum number of samples required to split a node.
 Minimum number of samples in a leaf node.
Post-pruning - Grow the full tree and then remove branches that contribute little to
overall accuracy.

Set Constraints
 Max Depth: Limit the depth of the tree to avoid too many splits. Deeper trees are more
likely to overfit, while shallow trees may underfit.
 Min Samples Split: Specify the minimum number of samples needed to split a node.
 Min Samples Leaf: Set the minimum number of samples required in a leaf node. The
minimum samples per leaf parameter ensures that a node will only be split if
it contains at least a certain number of samples, reducing the risk of
overfitting.
 Max Features: Limit the number of features considered at each split.

Use Ensemble Methods


 Random Forests: Build multiple Decision Trees on random subsets of data and average
their results.
 Gradient Boosting: Combine weak learners iteratively to improve overall performance
while reducing overfitting.

Cross-Validation - Cross-validation ensures that the Decision Tree model generalizes well by
training and evaluating the model on different subsets of the data.
Gather more data

GROWING DECISION TREES


Decision trees, like all supervised machine learning models, are designed to explain patterns in
training data. However, finding the best possible decision tree is extremely complex (an NP-hard
problem). As a result, decision trees are typically trained using simpler methods called heuristics.
These methods don’t guarantee the perfect tree but aim to create one that’s good enough.

Most decision tree training algorithms follow a greedy approach, using a divide-and-conquer strategy.
Here’s how it works:

1. **Start with the root node:** The algorithm begins by creating a single starting point for the tree.
2. **Grow the tree step by step:** The tree expands by adding nodes one at a time, working
recursively from the root.
3. **Evaluate all possible conditions:** At each node, the algorithm looks at every possible decision it
could make (conditions) and assigns a score to each.
4. **Choose the best condition:** The condition with the highest score is selected to split the data at
that node.
The score is calculated using a metric that aligns with the task (such as classification or regression).
The goal is to choose conditions that maximize this score, leading to a more effective decision tree.

Let's go through the steps of training a particular decision tree in more detail.
Step 1: Create a root:

Step 2: Grow node #1. The condition "x1 ≥ 1" was found. Two child nodes are created:

Step 3: Grow node #2. No satisfying conditions were found. So, make the node into a leaf:

Step 4: Grow node #3. The condition "x2 ≥ 0.5" was found. Two child nodes are created.

Other methods exist to grow decision trees. A popular alternative is to optimize nodes globally
instead of using a divide and conquer strategy.

YDF Code - In YDF, decision trees are grown with the "greedy divide and conquer" algorithm
described above by default. Alternatively, you can enable global growth with
the growing_strategy="BEST_FIRST_GLOBAL" parameter.

Depending on the number and type of input features, the number of possible conditions for a given
node can be huge, generally infinite. For example, given a threshold condition featurei ≥ t, the
combination of all the possible threshold values for t ε R is infinite.
The routine responsible for finding the best condition is called the splitter. Because it needs to test a
lot of possible conditions, splitters are the bottleneck when training a decision tree.

The score maximized by the splitter depends on the task. For example:

 Information gain and Gini (both covered later) are commonly used for classification.
 Mean squared error is commonly used for regression.

There are many splitter algorithms, each with varying support for:
 The type of features; for example, numerical, categorical, text
 The task; for example, binary classification, multi-class classification, regression
 The type of condition; for example, threshold condition, in-set condition, oblique condition
 The regularization criteria; for example, exact or approximated splitters for threshold
conditions

In addition, there are equivalent splitter variants with different trade-offs regarding memory usage,
CPU usage, computation distribution, and so on.

YDF Code - In YDF, splitters are selected implicitly from the automatically detected (or manually
specified) type of the feature, the hyperparameters, and the estimated speed of each splitter (during
training, YDF runs a small model to predict the speed of each candidate splitter).
https://developers.google.com/machine-learning/decision-forests/binary-classification

MULTICOLLINEARITY
**Handling Multicollinearity in Decision Trees**

Decision trees inherently handle multicollinearity during their feature selection process, making them
less directly impacted by it compared to linear models. Here's how they manage it:

1. **Feature Importance and Redundancy:**


Decision trees evaluate feature importance based on criteria like information gain or Gini impurity.
When two features are highly correlated (i.e., multicollinear), they provide redundant information for
splitting. The tree typically selects one of these features to split the data and ignores the other, as
using both would not add any significant value in reducing impurity.

2. **Splitting Criteria:**
The splitting criteria ensure that the tree chooses the feature that optimally separates the data at
each node. If two features are correlated, they often have similar splitting scores. In such cases, the
tree may randomly pick one, effectively reducing the impact of multicollinearity.

3. **Tree Structure and Feature Filtering:**


As the tree grows, it naturally filters out redundant features. Once a feature is used to split the data
and reduce impurity effectively, its correlated counterparts are less likely to be selected in
subsequent splits, as they contribute little additional value.

4. **Limitations and Sensitivity:**


While decision trees manage multicollinearity reasonably well, they are sensitive to small changes
in the dataset, which could impact performance. This sensitivity might occasionally amplify the
effects of multicollinearity, particularly in shallow trees or datasets with limited samples.

5. **Using Ensemble Methods for Stability:**


Ensemble methods like **Random Forests** further mitigate the effects of multicollinearity. By
building multiple trees on different subsets of data and averaging their predictions, random forests
reduce the sensitivity of individual trees to redundant or correlated features, improving robustness
and predictive performance.

**Summary for the Interviewer:**


Decision trees address multicollinearity by focusing on feature importance and selecting splits that
maximize information gain or impurity reduction. However, for greater stability and robustness,
especially in the presence of strong multicollinearity, ensemble methods like random forests are often
preferred. This approach leverages the strengths of decision trees while mitigating their sensitivity to
correlated features.

This concise yet comprehensive response highlights your understanding of both the theoretical and
practical aspects of the topic, ensuring clarity and depth in your explanation.

DETECTION:
Detecting multicollinearity is an important step in ensuring the reliability of your regression model.
Here are two common methods for detecting multicollinearity:

1. Correlation Matrix:
 Calculate the correlation coefficient between each pair of predictor variables.
 Values close to 1 or -1 indicate a high degree of correlation.
 Identify pairs of variables with high correlation coefficients (e.g., greater than 0.7 or less
than -0.7).

2. Variance Inflation Factor (VIF):


 VIF measures how much the variance of an estimated regression coefficient is increased
due to multicollinearity.
 Calculate the VIF for each predictor variable.
 VIF values greater than 5 or 10 are often used as thresholds to indicate multicollinearity.
PARAMETERS
Learnable Parameters
These are the parameters that the model automatically learns and updates during the training
process. They are responsible for capturing patterns and relationships in the training data. Examples
include the weights and biases in a neural network. The model adjusts these parameters on its own,
without needing manual intervention, to optimize its predictions.

Hyperparameters
These are parameters that you set manually before training begins. They influence how the learning
process takes place but are not updated by the model itself. Examples include:
- Learning Rate: Controls how quickly the model updates learnable parameters.
- Regularization Strength: Helps prevent overfitting by penalizing complex models.
- Optimization Algorithm: Determines how the model updates the learnable parameters (e.g., Adam,
SGD).

Hyperparameters play a key role in shaping the model's performance and behavior. Choosing the
right hyperparameters can significantly impact the model's ability to learn effectively.

HYPERPARAMETER TUNING OF DECISION TREE


When training machine learning models, different datasets and models require different sets of
hyperparameters. To find the best combination of hyperparameters for a given model, we conduct
multiple experiments. This process of selecting the optimal hyperparameters is known as
hyperparameter tuning.

WHY: Tuning hyperparameters is essential for decision trees for the following reasons:

**Improved Performance:** Untuned hyperparameters can result in suboptimal decision trees. By


tuning the hyperparameters, you can find the settings that best match your data, leading to a model
that captures patterns more effectively and makes better predictions.

**Reduced Overfitting:** Decision trees are prone to overfitting, where they memorize the noise in
the training data instead of learning generalizable patterns. Hyperparameter tuning helps mitigate
overfitting by controlling the tree’s complexity (e.g., using `max_depth`) and preventing excessive
detail (e.g., through `min_samples_split`).

**Enhanced Generalization:** The goal is for the decision tree to perform well on unseen data. Tuning
hyperparameters helps balance model complexity and flexibility, enabling the tree to capture key
trends without overfitting to the training data. This leads to improved performance on new, unseen
data.

**Addressing Class Imbalance:** In cases of class imbalance, where one class has significantly fewer
samples than the others, tuning hyperparameters like `min_weight_fraction_leaf` allows the model to
adjust for sample weights. This helps prevent the model from being biased towards the majority
class, improving predictions for the minority class.

**Tailoring the Model to Specific Tasks:** Different tasks may require specific decision tree
behaviors. Hyperparameter tuning allows you to adjust the tree’s structure and learning process to
suit the needs of your task. For example, you can prioritize capturing complex relationships by
adjusting `max_depth` for a more complex classification problem.

Hyperparameter tuning helps find the best values for parameters like maximum depth, minimum
samples per leaf, and split criterion, improving model performance and preventing overfitting.
max_depth: Maximum depth of the tree. The input option for max_depth can be a positive
integer or ‘None’ that indicates no maximum depth limit.
min_samples_split: The minimum number of samples required to split an internal node.
min_samples_leaf: The minimum number of samples required to be at a leaf node.
max_features: The number of features to consider when looking for the best split.
criterion: The function to measure the quality of a split (Gini impurity or entropy).
splitter: Strategy used to split at each node (best or random).
MAX_DEPTH
The **`max_depth`** hyperparameter in a decision tree specifies the maximum depth (number of
levels) that the tree can grow. It directly influences the complexity, interpretability, and performance
of the tree. Here's how it affects the decision tree:

### 1. **Controlling Tree Complexity**


- **Low `max_depth`:** When `max_depth` is set to a small value, the tree will be shallow, meaning it
won’t make many splits. This prevents the model from capturing complex relationships in the data,
potentially leading to **underfitting**. The tree might not be able to model important patterns, and it
may have high bias, performing poorly on both training and unseen data.

- **High `max_depth`:** A higher value allows the tree to grow deeper, meaning more splits can be
made and more complex patterns can be captured. While this increases the tree’s ability to fit the
data (reducing bias), it also increases the risk of **overfitting**, where the tree learns the noise and
specificities of the training data, which doesn't generalize well to new data.

### 2. **Preventing Overfitting**


- **Small `max_depth`:** Limiting the depth of the tree helps prevent overfitting by ensuring the tree
doesn’t become too complex. With a shallow tree, the model generalizes better and is less likely to fit
noise or outliers in the data.

- **Large `max_depth`:** A deeper tree allows more detailed splits, which can lead to overfitting if
the tree becomes too sensitive to the training data. The model may perform very well on training data
but poorly on unseen data, as it has memorized the specifics of the training set.

### 3. **Bias-Variance Trade-off**


- **Small `max_depth`:** With a shallow tree (low depth), the model has high **bias** (it
oversimplifies the data) but low **variance** (it doesn’t fluctuate too much between different data
sets). This can lead to underfitting, where the model fails to capture important trends in the data.

- **Large `max_depth`:** A deeper tree reduces bias by allowing the model to learn more complex
patterns but increases variance, as it becomes more sensitive to small changes in the training data.
This can lead to overfitting, where the model is too tailored to the training data and performs poorly
on new data.

### 4. **Improved Generalization**


Setting an appropriate `max_depth` helps the decision tree balance between fitting the training data
well and generalizing to unseen data. A tree that is neither too shallow nor too deep typically
performs best on both training and test datasets.

### 5. **Effect on Interpretability**


- **Small `max_depth`:** A shallower tree is easier to interpret, as it has fewer nodes and branches.
It provides a simpler, more understandable model, but it may sacrifice accuracy if the complexity of
the data requires deeper splits.

- **Large `max_depth`:** A deeper tree is harder to interpret due to its complexity. While it may
provide more accurate predictions, it can become very difficult to understand, especially when trying
to visualize or explain the decision-making process.

### 6. **Impact on Computational Efficiency**


- **Small `max_depth`:** A shallower tree requires fewer resources to train and predict, as there are
fewer nodes to evaluate.

- **Large `max_depth`:** A deeper tree requires more resources and takes more time to train and
make predictions, especially if the data has many features and complex relationships.

### Summary
- **Low `max_depth`:** The tree is simpler, faster, and less likely to overfit, but may suffer from
underfitting.
- **High `max_depth`:** The tree can model more complex relationships, but is more prone to
overfitting and may be computationally expensive and harder to interpret.
By carefully tuning the `max_depth` parameter, you can achieve a good balance between underfitting
and overfitting, leading to a more effective and efficient decision tree model.

MIN_SAMPLES_SPLIT
The **`min_samples_split`** hyperparameter in a decision tree specifies the minimum number of
samples required to split an internal node. This hyperparameter affects the tree's structure,
complexity, and generalization. Here's how it influences the decision tree:

### 1. **Controlling Tree Depth and Complexity**


- **Low `min_samples_split`:** A smaller value (e.g., 2) allows the tree to make more splits, resulting
in a deeper and more complex tree. This can help capture finer patterns in the data but increases the
risk of overfitting, especially if the tree starts to model noise or minor fluctuations in the data.
- **High `min_samples_split`:** A larger value requires more samples to split a node, which reduces
the number of splits and results in a shallower tree. This can prevent the model from becoming overly
complex and reduce the chance of overfitting. However, if set too high, it might also lead to
underfitting, where the tree fails to capture important patterns in the data.

### 2. **Preventing Overfitting**


- **Low `min_samples_split`:** The model is more likely to overfit the training data. By allowing very
small subsets of data to form their own splits, the tree can end up being very detailed, capturing
noise or random fluctuations in the data that do not generalize well to new data.
- **High `min_samples_split`:** A higher value forces the tree to require larger groups of data to
make a split, leading to fewer and more general splits. This results in a simpler model, which can
help avoid overfitting and improve generalization to unseen data.

### 3. **Bias-Variance Trade-off**


- **Smaller `min_samples_split`:** The tree tends to have higher variance, meaning it can fit the
training data very well (low bias) but may not perform as well on new data due to overfitting.
- **Larger `min_samples_split`:** The tree tends to have higher bias, meaning it may not fit the
training data as well (less flexibility), but it could generalize better and perform more reliably on
unseen data.

### 4. **Effect on Node Size**


When `min_samples_split` is set to a low value, the decision tree may create nodes with only a few
samples, which could cause instability and variability in the predictions. If the value is set too high,
the model might fail to create meaningful splits, reducing its ability to capture important
relationships between features.

### 5. **Impact on Computational Efficiency**


- **Low `min_samples_split`:** The tree may become deeper and more complex, leading to higher
computational costs, both in terms of time and memory.
- **High `min_samples_split`:** The tree will generally have fewer splits and shallower depth, leading
to better computational efficiency during both training and prediction.

### Summary
- **Low `min_samples_split`:** More splits, higher complexity, risk of overfitting.
- **High `min_samples_split`:** Fewer splits, simpler tree, improved generalization, but potentially
underfitting.

Adjusting `min_samples_split` helps balance model complexity and generalization, contributing to


better performance and reduced overfitting or underfitting.

MIN_SAMPLES_LEAF
The **`min_samples_leaf`** hyperparameter in a decision tree controls the minimum number of
samples required to be at a leaf node (the terminal node of the tree). This parameter plays a key role
in determining the structure and complexity of the tree, and it affects the model's ability to
generalize. Here's how it impacts the decision tree:

### 1. **Preventing Overfitting**


A lower value for `min_samples_leaf` (e.g., 1) allows the tree to create very deep branches, which can
result in overfitting. This means the model may memorize the training data, including noise or
outliers, and perform poorly on unseen data.

Increasing `min_samples_leaf` forces the tree to have more samples at each leaf, leading to a simpler
tree with fewer splits, helping prevent overfitting by smoothing the model’s predictions.

### 2. **Tree Complexity**


- **Low `min_samples_leaf`:** With a low value (like 1), the tree can create more splits and
potentially more nodes, resulting in a highly detailed and complex tree that captures even the
smallest patterns (including noise) in the training data.
- **High `min_samples_leaf`:** A higher value means the tree will require more samples in each leaf
node, leading to fewer splits and a simpler tree. This can help improve generalization by not allowing
the model to become too specific to the training data.

### 3. **Bias-Variance Trade-off**


- **Smaller `min_samples_leaf`:** The tree might have high variance, as it becomes more flexible and
fits the noise in the training set. While this may lead to a better fit on training data, it could hurt the
performance on new, unseen data.
- **Larger `min_samples_leaf`:** The tree might exhibit higher bias because it’s less flexible, but this
could lead to better performance on unseen data by avoiding overfitting.

### 4. **Handling Small or Rare Classes**


When dealing with class imbalance, setting an appropriate value for `min_samples_leaf` ensures that
the tree doesn't create leaf nodes that only represent a few samples, which might correspond to rare
or noisy classes. A higher value can help the model treat small classes more reasonably by combining
samples into fewer, broader nodes.

### Summary
- **Low `min_samples_leaf`:** More splits, higher complexity, risk of overfitting.
- **High `min_samples_leaf`:** Fewer splits, simpler tree, improved generalization, but potentially
underfitting.

Choosing the right value for `min_samples_leaf` is crucial for finding the right balance between
underfitting and overfitting, leading to better generalization and model performance.

MAX_FEATURES
The **`max_features`** hyperparameter in a decision tree specifies the maximum number of features
to consider when looking for the best split at each node. This parameter directly influences the tree’s
structure, performance, and generalization. Here's how it affects a decision tree:

### 1. **Controlling Model Complexity**


- **Low `max_features`:** If `max_features` is set to a small number, the model will only consider a
subset of features when splitting each node. This forces the tree to focus on only a few features at
each decision point, leading to a more diverse set of decision trees and potentially reducing the
likelihood of overfitting. However, it might result in **underfitting** if the tree isn't able to capture
enough relevant features for optimal splits.

- **High `max_features`:** If `max_features` is set to a high number (such as the total number of
features or close to it), the decision tree will have access to many features when splitting at each
node. This allows the tree to make splits based on more information, which can improve the model's
performance but also increases the risk of **overfitting**. The tree might learn noise or very specific
patterns from the data that don't generalize well to new, unseen data.

### 2. **Effect on Overfitting and Underfitting**


- **Low `max_features`:** Limiting the number of features considered at each split can help reduce
overfitting. It encourages the model to look at different, smaller subsets of the data at each step,
making it harder to memorize the training set and helping the model generalize better. However,
setting it too low can result in underfitting, where the tree doesn't have enough information to make
meaningful splits.
- **High `max_features`:** A larger value allows the tree to consider more features at each node,
which can increase the model’s complexity. This often leads to better fitting of the training data but
can result in overfitting, where the model becomes too specialized to the training data and performs
poorly on new data.

### 3. **Bias-Variance Trade-off**


- **Low `max_features`:** The tree will have higher **bias** because it is limited in the number of
features it can use for each decision, potentially missing important relationships in the data. It will
have lower **variance**, meaning the model is less sensitive to the specifics of the training data.

- **High `max_features`:** The tree will have lower **bias** because it can use a wider range of
features for splitting. However, it will have higher **variance**, meaning the model could be overly
sensitive to training data and may overfit, resulting in poor generalization.

### 4. **Improved Generalization (Especially in Random Forests)**


In ensemble models like **Random Forests**, where multiple decision trees are trained, setting
`max_features` to a value less than the total number of features helps create diversity among the
trees. This helps improve generalization by ensuring that different trees in the forest focus on
different subsets of features, reducing the model's tendency to overfit.

### 5. **Impact on Computational Efficiency**


- **Low `max_features`:** When fewer features are considered for each split, the tree will require
less computational power to evaluate potential splits, resulting in faster training and prediction
times.

- **High `max_features`:** When more features are considered, the model will take more time to
evaluate splits and train, especially as the number of features grows. This increases the
computational cost.

### 6. **Interpretability**
- **Low `max_features`:** Trees built with a smaller set of features per split can be easier to interpret
because they rely on fewer features, which might make the tree simpler to analyze and understand.

- **High `max_features`:** Larger numbers of features considered at each split can make the tree
more complex and harder to interpret, especially when the tree uses many features to make
decisions.

### Summary
- **Low `max_features`:** The tree is more likely to generalize better and prevent overfitting, but it
might not capture all the important patterns, leading to underfitting.
- **High `max_features`:** The tree can model more complex patterns, but it risks overfitting, and the
model may become computationally expensive and harder to interpret.

Tuning the `max_features` parameter helps to strike the right balance between model complexity,
generalization, and computational efficiency.

CRITERION
The criterion hyperparameter in a decision tree determines the function used to measure the quality
of a split at each node. It plays a crucial role in how the tree makes decisions about which features to
use and what thresholds to set when splitting the data. The choice of criterion can affect the
accuracy, complexity, and interpretability of the tree.

The two most common values for the criterion parameter are:
1. gini (Gini impurity)
2. entropy (Information gain)

1. gini (Gini Impurity)


Gini impurity measures the "impurity" or "purity" of a node. It calculates the probability of a random
sample being incorrectly classified if it were randomly labeled according to the distribution of labels
in that node. The Gini index ranges from 0 to 1:
 0 means perfect purity (i.e., all samples at a node belong to the same class).
 1 means maximum impurity (i.e., the samples are evenly distributed among all classes).

Effect on the Tree:


o Gini impurity tends to favor splits that result in larger class homogeneity (pure nodes).
o It is computationally faster to calculate than entropy, making it a more efficient choice when
building large decision trees.
o Decision trees using the Gini index tend to result in slightly more balanced trees with fewer
nodes.

2. entropy (Information Gain)


Entropy is based on the concept of information theory. It measures the disorder or uncertainty in the
node. The goal is to minimize entropy (i.e., reduce uncertainty) by finding splits that separate the
classes as clearly as possible. The entropy ranges from 0 to 1:
 0 means no uncertainty (i.e., all samples at a node belong to the same class).
 1 means maximum uncertainty (i.e., the samples are evenly distributed among all classes).

Effect on the Tree:


o Entropy tends to favor splits that create more homogeneity in the child nodes, but it may
create slightly deeper trees than Gini impurity.
o Information gain (used in entropy) might sometimes result in less balanced trees compared to
Gini, as it focuses on achieving the highest possible reduction in entropy.

3. Impact on Tree Performance


The choice of criterion impacts the structure and behavior of the tree:
 Gini impurity often results in trees with a faster training time and slightly more balanced
splits.
 Entropy tends to be more focused on reducing disorder and can sometimes result in deeper
trees with potentially more accurate predictions, though it might require more time to
compute.

4. Impact on Overfitting
Both Gini impurity and entropy aim to improve the purity of the splits, but the tree's depth (and thus
the risk of overfitting) will depend on other factors like max_depth, min_samples_split, and
min_samples_leaf. However:
 Gini tends to produce slightly less complex trees because it is quicker to compute and less
sensitive to small variations in the data.
 Entropy might result in deeper trees, as it focuses more on minimizing uncertainty and might
keep splitting the data to achieve maximum homogeneity in each node.

5. Interpretability
 Gini impurity: Trees built with the Gini criterion tend to be slightly easier to interpret
because they often produce more balanced and simpler trees with fewer levels.
 Entropy: Trees built with the entropy criterion may be more complex, potentially making
them harder to interpret due to deeper branches and more splits.

6. Performance Considerations
 Gini impurity is generally faster to compute than entropy, which can be beneficial for large
datasets or when computational efficiency is crucial.
 Entropy may take longer to compute, but the trees it generates can sometimes achieve
slightly better predictive accuracy, particularly in complex datasets.

Summary of Effects:
 gini (Gini Impurity):
o Faster to compute
o Tends to result in simpler and more balanced trees
o Often preferred for practical use in decision trees
 entropy (Information Gain):
o Slower to compute due to the logarithmic calculations
o Tends to create deeper trees with potentially better accuracy
o Focuses more on maximizing information gain at each split
Conclusion
The criterion parameter affects how the decision tree evaluates and splits the data. Both Gini
impurity and entropy are designed to optimize the purity of the splits, but they differ in computational
efficiency and how they balance tree complexity. Gini is generally faster and simpler, while entropy
may result in better accuracy but is computationally more intensive. The choice depends on the
trade-off between model performance, speed, and interpretability.

SPLITTER
The **`splitter`** hyperparameter in a decision tree controls the strategy used to split the nodes at
each level of the tree. It determines how the best feature to split on is selected during the tree-
building process. The two common options for the `splitter` parameter are:

1. **`best`**
2. **`random`**

### 1. **`best` Splitter (Default)**


When the **`splitter`** is set to **`best`**, the decision tree algorithm selects the feature and
threshold that best splits the data at each node, according to the chosen **criterion** (such as Gini
impurity or entropy).

- **Effect on the Tree**:


- The algorithm evaluates all possible splits at each node and chooses the one that provides the best
separation (i.e., the one that reduces the impurity most).
- This results in the most optimal splits, potentially leading to a more accurate model.
- However, it can also lead to **overfitting** if the tree grows too complex because the algorithm is
selecting the best possible split at each step without any randomness, which can lead to highly
specific patterns learned from the training data.
- It may increase computational cost since the algorithm evaluates all splits at each node.

### 2. **`random` Splitter**


When the **`splitter`** is set to **`random`**, the algorithm randomly selects a subset of features
and chooses the best split from that subset, rather than considering all features.

- **Effect on the Tree**:


- By introducing randomness, the decision tree becomes less likely to overfit since the splits are less
likely to perfectly fit the noise in the training data.
- The tree may be less accurate compared to a tree using the **`best`** splitter because it might not
always select the optimal feature and threshold.
- **Faster training**: Since fewer features are evaluated at each node, the training process can be
faster compared to the **`best`** splitter, making this approach suitable for large datasets or when
faster model training is required.
- It increases **variance** in the tree, which can lead to a less stable model but can also help in
generalization by reducing overfitting.

### 3. **Impact on Overfitting and Underfitting**


- **`best` Splitter**: Tends to overfit if the tree is allowed to grow too deep because it always chooses
the best possible split, which could capture noise from the data.
- **`random` Splitter**: Helps to reduce overfitting by introducing randomness in the splitting
process. However, it might result in a less accurate model compared to the `best` splitter, especially
if the tree is not tuned properly.

### 4. **Impact on Performance and Speed**


- **`best` Splitter**: The decision tree will evaluate all features at each split, which can be
computationally expensive, especially for large datasets with many features.
- **`random` Splitter**: Evaluates only a random subset of features at each split, making the training
process faster, particularly on large datasets.

### 5. **Effect on Model Interpretability**


- **`best` Splitter**: The resulting tree will likely be more interpretable because it chooses the
optimal splits for each node, meaning each decision point in the tree is as informative as possible.
- **`random` Splitter**: The tree might be less interpretable because the splits are selected randomly,
and the resulting decisions may not be as clear-cut as when using the `best` strategy.

### Summary of Effects:


- **`best` Splitter**: Results in the most accurate tree (within the scope of the training data), but may
cause overfitting and is computationally more expensive.
- **`random` Splitter**: Introduces randomness, potentially reducing overfitting, speeding up
training, and helping to generalize better, but may lead to less accurate splits.

### Conclusion
The choice of **`splitter`** depends on the trade-off between accuracy, speed, and generalization:
- Use **`best`** if accuracy is your primary goal and you want to ensure the tree captures as much
useful information as possible.
- Use **`random`** if you need a faster model or want to ensure the tree doesn't overfit, especially
when working with large datasets or when computational resources are limited.

METHODS OF HYPERPARAMETER TUNING


Grid Search
Grid search is a popular and fundamental technique for hyperparameter tuning, where we
exhaustively evaluate all possible combinations of predefined hyperparameters. This approach is
reliable for finding the optimal hyperparameters, and when sufficient computational resources are
available, it often leads to highly accurate predictions.
One key advantage of grid search is that each trial runs independently, which allows it to be
parallelized, reducing overall runtime. However, the main drawback is its high computational cost,
especially when the parameter space is large or high-dimensional, making it less efficient in those
scenarios.

Randomized Search
In contrast, randomized search performs a more flexible search over hyperparameters by sampling
from predefined distributions. This approach continues until a predefined limit or the desired
performance is reached. Randomized search is generally more efficient than grid search, especially
when hyperparameters are not uniformly distributed, as it allows independent allocation of the
search limit for each parameter.
Additionally, the nature of randomized search makes it easy to parallelize, which helps in saving time
and resources. While grid search might explore all possibilities exhaustively, randomized search often
achieves better results in less time, particularly when the search space is large.

Bayesian Optimization
Bayesian optimization takes a more sophisticated approach by using probabilistic models to identify
the best set of hyperparameters in a more computationally efficient way. Unlike grid and random
search, it’s a sequential process designed to find the global optimum with fewer trials, thus reducing
the number of computations needed.
While this method can be highly efficient in terms of computational cost, it’s also more complex to
implement. It's particularly useful in scenarios where you have a large search space and need to
balance performance and computational efficiency.

Conclusion
For most cases, Grid Search and Randomized Search are excellent starting points due to their
simplicity and ease of implementation. Grid Search is ideal when you have a small to moderate
search space and can afford the computational cost, while Randomized Search is better when
dealing with a larger search space and when efficiency is a concern. If computational cost becomes a
bottleneck and you need fewer trials for optimization, Bayesian Optimization is a strong choice, but
it requires more expertise to implement effectively.

EVALUATION
Precision, Recall, F1-Score, Accuracy, AUC-ROC, MSE

BIAS – VARIANCE TRADE OFF


Bias-Variance Tradeoff: Decision Trees are prone to overfitting (high variance) if they grow
too deep, capturing noise in the training data. Conversely, shallow trees might oversimplify
relationships (high bias), leading to underfitting.
Increasing Depth: As depth increases:
 Bias decreases (tree models complex patterns better).
 Variance increases (model becomes more sensitive to data fluctuations).

Bias refers to the error introduced by simplifying assumptions (e.g., underfitting), while variance
refers to the error introduced by the model’s sensitivity to fluctuations in the training data (e.g.,
overfitting). Decision Trees typically have high variance but low bias.

BAGGING
Bagging (Bootstrap Aggregating) involves training multiple Decision Trees on different random
subsets of the data and averaging their predictions to reduce variance.

BOOSTING
A sequential ensemble technique (e.g., Gradient Boosting) where each tree corrects the errors of the
previous ones, focusing more on misclassified examples. It improves accuracy but may lead to
overfitting if not controlled.

Boosting is an ensemble technique where multiple weak learners (usually Decision Trees) are trained
sequentially, with each subsequent model focusing on correcting the errors of the previous model.
Boosting involves sequentially training Decision Trees, each focusing on correcting the errors made
by the previous tree, leading to a strong model by combining weak learners.

FEATURE IMPORTANCE
The importance of a feature is accumulated by summing up its contribution across all splits in the
tree.
1. Split Impurity Reduction
Each split in a Decision Tree is associated with a measure of impurity, such as:
 Gini Impurity
 Entropy (Information Gain)
 Mean Squared Error (for regression trees)
When a feature is used to split a node, it reduces the impurity of the data at that node. The
reduction in impurity for that split is calculated as:
Impurity Reduction=Impurity of Parent Node−(Weighted Impurity of Child Nodes)
2. Aggregate Across the Tree
For each feature:
 Calculate the total impurity reduction contributed by that feature across all the splits
where it is used.
 The importance of a feature is proportional to this total reduction.
3. Normalize the Importance scores
Once the total impurity reductions are computed for all features:
 Normalize them so that they sum to 1. This allows easy comparison of feature
importances.
Total Impurity Reduction for a Feature
Normalized Importance =
∑ of Impurity Reductions for All Features
Key Takeaway
Features that result in larger reductions in impurity are deemed more important because they
contribute more to making accurate splits in the tree. This calculation is an inherent part of building
the Decision Tree, and libraries like Scikit-learn automatically compute these important scores.

Techniques for Handling Imbalanced Classes


1. Class Weights: Assign higher weights to underrepresented classes.
2. Oversampling/Undersampling: Balance the dataset using methods like SMOTE.
3. Custom Splitting Criteria: Adjust metrics to prioritize minority class performance.

What is the "out-of-bag" error in Decision Trees?


Out-of-bag error refers to the error rate of a Decision Tree when evaluated on data not
included in the training set, providing an internal validation mechanism.
What is the importance of feature selection in Decision Trees?
Feature selection helps improve the interpretability and accuracy of a Decision Tree by
reducing irrelevant or redundant features, preventing overfitting, and speeding up training.
How do Decision Trees handle missing data?
Decision Trees can handle missing data by using strategies like surrogate splits (splits based
on secondary criteria) or assigning missing values to the most likely class.
How would you handle imbalanced data in Decision Trees?
Imbalanced data can be handled by using techniques like class weighting, resampling
(oversampling/undersampling), or changing the decision threshold to improve the model’s
performance on minority classes.
How do Decision Trees handle categorical variables?
Decision Trees can handle categorical variables directly by evaluating splits based on the
distinct categories or creating binary splits for each category.
What is the computational complexity of Decision Trees?
The time complexity for building a Decision Tree is O(n * m * log n), where n is the number
of samples and m is the number of features. The complexity may vary with specific
implementations.
How do you improve the stability of Decision Trees?
To improve stability, techniques like Random Forest (ensemble method) or bagging
(bootstrap aggregation) can be used to reduce variance and make the model more robust.
How do Decision Trees handle continuous variables?
Continuous variables are split by comparing the feature value to a threshold. The tree tries
different thresholds to find the one that minimizes the impurity.
How would you handle correlated features in Decision Trees?
Correlated features may result in inefficient splits. Feature selection, Principal Component
Analysis (PCA), or using ensemble methods like Random Forest can help manage this.
What is feature importance in Decision Trees?
Feature importance measures the contribution of each feature to the predictive power of the
model. It is calculated based on the amount of impurity reduction caused by a feature during
splits.

What is the difference between the "CART" and "ID3" algorithms?


CART (Classification and Regression Trees) uses binary splits and can handle both
classification and regression, while ID3 (Iterative Dichotomiser 3) uses multi-way
splits and is generally used for classification tasks.
What is the role of Random Forest in improving Decision Trees?
Random Forest builds multiple Decision Trees and combines their outputs, reducing the risk
of overfitting and improving model stability and accuracy.
How do Decision Trees differ from Neural Networks?
Decision Trees are interpretable and involve simple if-else logic, while Neural Networks are
complex models with layers of interconnected nodes that are harder to interpret.

You might also like