Data II - Decision Trees and Rules
Data II - Decision Trees and Rules
Split
Split
Imagine you're a marketing manager trying to target ads for a new fitness tracker. You
have data on 100 past customers, including their age, income, and whether they bought
the tracker (1) or not (0). Your goal is to build a decision tree that can predict future
customer purchases(tracker) based on these attributes.
➔ Think of the parent node as the entire customer pool (100 people).
➔ We know 40 bought the tracker (class 1) and 60 didn't (class 0).
➔ Calculate the entropy using the formula:
Entropy(S) = - (0.4 * log2(0.4) + 0.6 * log2(0.6)) = 0.971
This value (0.971) tells us how mixed up the classes are in the parent node. Closer
to 1 means more uncertainty, with classes evenly distributed.
2. For each potential split feature, calculate the entropy of each child
node. A child node is a group of data points that result from splitting the
parent node on a particular feature value.
These values indicate that the "young" group is less certain (more mixed
classes), while the "old" group is more certain (mostly didn't buy).
3. Calculate the information gain for each potential split. Information gain
is the difference between the entropy of the parent node and the
weighted average of the entropy of the child nodes. The split with the
highest information gain is the best split, because it results in the most
homogeneous child nodes.
➔ Recall that information gain measures how much a specific feature (age
in this case) helps make the data less mixed.
➔ Use the formula:
Another example:
The Gini index, or Gini impurity, is a measure used in decision trees to quantify
the impurity or disorder within a dataset.
Where:
Example: To illustrate the use of the Gini index in building a decision tree for a
marketing scenario involving a new fitness tracker, dataset of 100 past
customers.
For simplicity, let's assume the dataset has been divided based on age into two
groups: "Under 30" and "30 and Over", and based on income into two groups:
"High" and "Low". We also have the purchase outcome for each group.
**Age Groups**:
- Under 30: 40 customers, 30 bought the tracker (1), 10 did not (0).
- 30 and Over: 60 customers, 20 bought the tracker (1), 40 did not (0).
**Income Groups**:
- High: 50 customers, 35 bought the tracker (1), 15 did not (0).
- Low: 50 customers, 15 bought the tracker (1), 35 did not (0).
**Under 30**:
**High Income**:
**Low Income**:
1. First Split on Age: Divide the dataset into "Under 30" and "30 and Over".
2.Further Splits: For each of these groups, we could further analyze the data (possibly considering
income or other attributes if available) to create additional splits, aiming to increase the purity of
the nodes.
3. Terminal Nodes (Leaves): Continue splitting until reaching a point where further splits do not
significantly increase purity or when a node has reached a minimum size.
Other method explained in class:
Underfitting occurs when the model is too simple to capture the underlying
structure of the data. This usually happens when the tree is not deep enough,
leading to large bias and poor performance on both training and unseen data.
Characteristics of Underfitting:
Characteristics of Overfitting:
● Pruning (removing parts of the tree that don't provide additional power)
● Setting a maximum depth for the tree
● Requiring a minimum number of samples to split a node.
Problem: Decision trees can overfit the training data, leading to poor
performance on unseen data.
Solution: Pruning reduces the size and complexity of the tree, improving
generalization.
Techniques:
● Pre-pruning (Early stopping): Stop growing the tree early based on size
or data purity.
● Post-pruning: Grow a large tree, then remove suboptimal branches based
on error rates.
Benefits: Trade-offs:
SSR method:
Choosing the best feature for splitting in a regression decision tree involves
finding the split that minimizes the variance (spread) of the target variable
(what you're trying to predict) within the resulting groups. This method is called
the Sum of Squared Residuals (SSR).
Note : Sum of Squared Residuals (SSR): Measures how "spread out" the
target variable is within the resulting groups after a split. We want to
minimize SSR because the smaller the SSR, the more homogeneous
(similar) the target variable is within each group, leading to better
predictions.