0% found this document useful (0 votes)
8 views11 pages

Data II - Decision Trees and Rules

The document discusses decision trees as a predictive modeling technique that utilizes 'if-then' rules to classify data based on features. It explains methods for determining the best splits, including Information Gain and Gini Index, and emphasizes the importance of balancing underfitting and overfitting through techniques like pruning. Additionally, it touches on regression decision trees and the Sum of Squared Residuals (SSR) method for minimizing variance in predictions.

Uploaded by

Arij Khlifi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views11 pages

Data II - Decision Trees and Rules

The document discusses decision trees as a predictive modeling technique that utilizes 'if-then' rules to classify data based on features. It explains methods for determining the best splits, including Information Gain and Gini Index, and emphasizes the importance of balancing underfitting and overfitting through techniques like pruning. Additionally, it touches on regression decision trees and the Sum of Squared Residuals (SSR) method for minimizing variance in predictions.

Uploaded by

Arij Khlifi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Data Mining: Decision Tree Yesmine Chalgham

Decision Trees and Rules


Decision Tree:

● Decision tree learning: build tree-shaped models for prediction.


● Model: uses "if-then" decisions (nodes) with branches for different
choices.
● Final nodes (leaves): represent the predicted outcome (e.g., class label).
● Classification: data follows branches based on features, reaching a leaf
and its prediction.
How to get this Decision tree?

Split

Split

Data Mining: Decision Tree Page 1


Data Mining: Decision Tree Yesmine Chalgham

Choosing the best split

● Identifying the Best Split:


The primary challenge for a decision tree is determining the most effective
feature to divide the dataset, aiming for partitions with examples
predominantly of a single class to achieve purity.
I. Information Gain method:
● Measuring Purity with Entropy:
Entropy measures the impurity within a dataset segment S

Entropy values range from 0 (completely homogeneous) to 1 (maximum


disorder).
● Information Gain:
It measures the reduction in entropy from the pre-split dataset (S1) to the post-
split partitions (S2), aiming to maximize homogeneity in the resulting groups.

Calculating Total Entropy Post-Split:

Explaining the method in Steps and example:

Data Mining: Decision Tree Page 2


Data Mining: Decision Tree Yesmine Chalgham

Imagine you're a marketing manager trying to target ads for a new fitness tracker. You
have data on 100 past customers, including their age, income, and whether they bought
the tracker (1) or not (0). Your goal is to build a decision tree that can predict future
customer purchases(tracker) based on these attributes.

1. Calculate the entropy of the parent node. Entropy is a measure of how


mixed up the classes are in a dataset. A dataset with only one class has an
entropy of 0, while a dataset with an equal number of each class has an
entropy of 1.

➔ Think of the parent node as the entire customer pool (100 people).
➔ We know 40 bought the tracker (class 1) and 60 didn't (class 0).
➔ Calculate the entropy using the formula:
Entropy(S) = - (0.4 * log2(0.4) + 0.6 * log2(0.6)) = 0.971

This value (0.971) tells us how mixed up the classes are in the parent node. Closer
to 1 means more uncertainty, with classes evenly distributed.

2. For each potential split feature, calculate the entropy of each child
node. A child node is a group of data points that result from splitting the
parent node on a particular feature value.

➔ We'll consider splitting the data by age (<30 and 30+).


➔ Calculate entropy for each group based on their purchase behavior:
- Young (<30): 30 bought, 20 didn't (Entropy = 0.971)
- Old (30+): 10 bought, 40 didn't (Entropy = 0.758)

These values indicate that the "young" group is less certain (more mixed
classes), while the "old" group is more certain (mostly didn't buy).

3. Calculate the information gain for each potential split. Information gain
is the difference between the entropy of the parent node and the
weighted average of the entropy of the child nodes. The split with the
highest information gain is the best split, because it results in the most
homogeneous child nodes.

Data Mining: Decision Tree Page 3


Data Mining: Decision Tree Yesmine Chalgham

➔ Recall that information gain measures how much a specific feature (age
in this case) helps make the data less mixed.
➔ Use the formula:

InformationGain(age) = Entropy(parent) - (weight_young *


Entropy(young) + weight_old * Entropy(old))

= 0.971 - (0.6 * 0.971 + 0.4 * 0.758) = 0.0852

➔ We interpret this as: splitting by age reduces the overall uncertainty


(entropy) by 0.0852.
4. Repeat for Income Split
➔ Perform similar calculations for splitting based on income levels.
➔ Compare the information gain of both splits (age and income).
5. Choose the Best Split
➔ In this example, suppose the information gain for income split is 0.025.
➔ Since 0.0852 > 0.025, splitting by age leads to a greater reduction in
uncertainty.
➔ The higher the Information Gain (IG), the more an attribute effectively
reduces uncertainty about the target variable's outcome in a decision
tree model.
This is just the first step! The process continues iteratively for each branch,
considering further splits based on other features until a stopping criteria is met
(e.g., minimum information gain threshold or reaching a certain level of purity in
each leaf node).

Another example:

II. Gini index


method:

Data Mining: Decision Tree Page 4


Data Mining: Decision Tree Yesmine Chalgham

The Gini index, or Gini impurity, is a measure used in decision trees to quantify
the impurity or disorder within a dataset.

Where:

● S is the dataset for which the Gini index is being calculated.


● n is the number of different classes (outcomes) in the dataset.
● Pi is the proportion (probability) of the class i within the dataset S.

Example: To illustrate the use of the Gini index in building a decision tree for a
marketing scenario involving a new fitness tracker, dataset of 100 past
customers.
For simplicity, let's assume the dataset has been divided based on age into two
groups: "Under 30" and "30 and Over", and based on income into two groups:
"High" and "Low". We also have the purchase outcome for each group.
**Age Groups**:
- Under 30: 40 customers, 30 bought the tracker (1), 10 did not (0).
- 30 and Over: 60 customers, 20 bought the tracker (1), 40 did not (0).

**Income Groups**:
- High: 50 customers, 35 bought the tracker (1), 15 did not (0).
- Low: 50 customers, 15 bought the tracker (1), 35 did not (0).

1. Calculating Gini Index for Age Groups

First, we calculate the Gini index for splitting by age:

**Under 30**:

**30 and Over**:

➔ Weighted Gini for Age:

2. Calculating Gini Index for Income Groups

Data Mining: Decision Tree Page 5


Data Mining: Decision Tree Yesmine Chalgham

Next, we calculate the Gini index for splitting by income:

**High Income**:

**Low Income**:

**Weighted Gini for Income**:


Since the dataset is evenly split between high and low income, the weighted
Gini for income is simply the Gini of either group: 0.42.
3. Decision on the First Split
Comparing the weighted Gini indices for age and income:
- Weighted Gini for Age: 0.417
- Weighted Gini for Income: 0.42
➔ A lower Gini index indicates a better attribute for splitting the data.
This means that, according to our example, age is a slightly better predictor of
whether a customer will buy the fitness tracker than income.

4. Building the Decision Tree

1. First Split on Age: Divide the dataset into "Under 30" and "30 and Over".
2.Further Splits: For each of these groups, we could further analyze the data (possibly considering
income or other attributes if available) to create additional splits, aiming to increase the purity of
the nodes.
3. Terminal Nodes (Leaves): Continue splitting until reaching a point where further splits do not
significantly increase purity or when a node has reached a minimum size.
Other method explained in class:

Data Mining: Decision Tree Page 6


Data Mining: Decision Tree Yesmine Chalgham

Fitting/Underfitting/Overfitting the training data

Underfitting in Decision Trees

Data Mining: Decision Tree Page 7


Data Mining: Decision Tree Yesmine Chalgham

Underfitting occurs when the model is too simple to capture the underlying
structure of the data. This usually happens when the tree is not deep enough,
leading to large bias and poor performance on both training and unseen data.

Characteristics of Underfitting:

● The decision tree has very few nodes.


● It makes very broad generalizations about the data.

Overfitting in Decision Trees


Overfitting occurs when the model is too complex, capturing noise in the
training data as if it were a real pattern. This leads to high variance and poor
generalization to new data. Overfitting is common in decision trees that are
allowed to grow without constraints, creating highly specific rules that apply
only to the training data.

Characteristics of Overfitting:

● The decision tree is very deep, with many nodes.


● It makes highly specific splits that reflect noise or outliers in the training
data.

Fitting Decision Trees Properly


The key to using decision trees effectively is to balance between underfitting
and overfitting. This can be achieved through techniques such as

● Pruning (removing parts of the tree that don't provide additional power)
● Setting a maximum depth for the tree
● Requiring a minimum number of samples to split a node.

Pruning the decision tree

Data Mining: Decision Tree Page 8


Data Mining: Decision Tree Yesmine Chalgham

Problem: Decision trees can overfit the training data, leading to poor
performance on unseen data.

Solution: Pruning reduces the size and complexity of the tree, improving
generalization.

Techniques:

● Pre-pruning (Early stopping): Stop growing the tree early based on size
or data purity.
● Post-pruning: Grow a large tree, then remove suboptimal branches based
on error rates.

Benefits: Trade-offs:

● Improves generalization: Less prone to ● Pre-pruning might miss important


overfitting. patterns.
● Simpler model: Easier to interpret and ● Post-pruning requires growing a
faster to predict. large tree, consuming more
resources.

Key Parameters and Stopping Rules:

● Max Depth: Limits tree depth to prevent overfitting.


● Max Observations: Limits data points used for splits, reducing sensitivity
to noise. (eg: kol node fiha at least x observations)
● Convergence Criteria: This criterion monitors how much the tree's
internal rules (like split conditions) are changing between training
iterations. If the changes are minimal for a set number of iterations, it
suggests the tree is no longer improving and might have converged.

Regression Decision Tree

A regression tree is a type of decision tree used in machine learning for


predicting continuous values.
(12) Regression Trees, Clearly Explained!!! - YouTube

SSR method:

Data Mining: Decision Tree Page 9


Data Mining: Decision Tree Yesmine Chalgham

Choosing the best feature for splitting in a regression decision tree involves
finding the split that minimizes the variance (spread) of the target variable
(what you're trying to predict) within the resulting groups. This method is called
the Sum of Squared Residuals (SSR).

1. Calculate SSR for All Possible Splits:


● Divide the data into two groups based on the chosen split condition.
● For each group:
a. Calculate the average target value
b. Find the squared difference between each data point's actual
target value and the average target value for its group (this
represents the squared residual).
c. Sum all these squared residuals for each group to get the total
SSR for that specific split.
● Do this again and again with each two groups.
2. Choose the Split with Minimum SSR:
○ After calculating the SSR for every possible split across all features,
compare the SSR values for each split.
○ The feature-split combination that results in the lowest overall SSR
is chosen as the splitting point for the current node in the decision
tree. This split minimizes the variance of the target variable within
the resulting groups, leading to better predictions later.
Example:

Data Mining: Decision Tree Page 10


Data Mining: Decision Tree Yesmine Chalgham

Note : Sum of Squared Residuals (SSR): Measures how "spread out" the
target variable is within the resulting groups after a split. We want to
minimize SSR because the smaller the SSR, the more homogeneous
(similar) the target variable is within each group, leading to better
predictions.

Data Mining: Decision Tree Page 11

You might also like