Breaking Down Decision Tree Algorithm
Breaking Down Decision Tree Algorithm
Decision Tree
Hemant Thapa
Importing Libraries
In [1]: import math
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
Decision Trees consist of a tree-like structure where each internal node represents a decision rule based on one of the data features, and each
leaf node corresponds to the predicted outcome or class. The process of constructing a Decision Tree involves selecting the best features to
split the data and optimizing the decision rules at each node to maximize predictive accuracy.
1. Root Node:
The root node is the topmost node in a decision tree.
It represents the entire dataset or a subset of the data at the beginning of the tree-building process.
The root node serves as the starting point for the tree's construction.
At the root node, a decision is made to split the data into subsets based on the values of a specific feature (attribute). This feature is
chosen because it results in the best separation of data according to a certain criterion (e.g., Gini impurity, information gain, variance
reduction).
It's like the most important question we ask when trying to make decisions. tThe root node helps us decide how to divide our data into
smaller groups (subsets).
Question 2 - What's the best way to divide our data into more manageable groups?
Answer 1 - We choose the feature (attribute) that provides the most useful information for making decisions.
Answer 2 - This feature is selected based on the highest Information Gain or Gini Gain, which helps us make the best possible split.
2. Decision Nodes:
Decision nodes are the internal nodes of the decision tree, situated between the root node and the terminal nodes.
Each decision node represents a specific decision or condition based on a feature's value.
These nodes serve as points where the dataset is split into subsets according to the decision or condition.
The decision node's role is to determine which branch of the tree to follow based on whether the condition is true or false for a particular
data point.
Decision nodes are like the middle managers in our decision-making process. they help us decide whether we need to split our data into
more detailed groups or not. decision nodes act as traffic controllers for data, guiding it down different paths.
Gini Impurity:
Gini impurity is a measure of the disorder or impurity within a dataset.
In the context of a decision tree, it quantifies the probability of incorrectly classifying a randomly chosen element from the dataset.
A Gini impurity of 0 indicates that the dataset is perfectly pure, meaning all elements belong to the same class. A Gini impurity of 1 means
that the dataset is completely impure, with elements evenly distributed among different classes.
When deciding how to split a dataset at a decision node, the algorithm calculates the Gini impurity for each possible split and selects the
split that minimises the impurity, resulting in a more homogeneous subset.
Information Gain:
Information gain measures the reduction in uncertainty or entropy achieved by a particular split in the data.
Entropy is a concept from information theory that quantifies the disorder or randomness in a dataset. High entropy indicates high
disorder, while low entropy indicates order or purity.
Information gain calculates the difference in entropy before and after a split. A high information gain means that the split results in a
more orderly separation of data.
Decision tree algorithms aim to maximise information gain when choosing which feature to split on. They select the feature that leads to
the greatest reduction in uncertainty or entropy within the child nodes.
5. Advantages:
Interpretability: Decision Trees offer a clear and interpretable model, making it easy to understand why a particular decision was made.
This transparency is crucial for applications where understanding the reasoning behind predictions is essential, such as in healthcare and
finance.
Versatility: Decision Trees can handle both categorical and numerical data, as well as mixed data types. They are robust against outliers
and missing values, making them suitable for real-world datasets.
Feature Importance: Decision Trees provide a measure of feature importance, which helps identify which features have the most
significant impact on predictions. This information can guide feature selection and engineering efforts.
6. Challenges:
Overfitting: Decision Trees are prone to overfitting, where the model captures noise in the training data, leading to poor generalisation on
unseen data. Techniques like pruning and setting a maximum depth are used to mitigate overfitting.
Instability: Small changes in the data can result in different tree structures, making Decision Trees somewhat unstable. Ensemble methods
like Random Forest and Gradient Boosting address this issue by combining multiple trees to improve stability and predictive performance.
ID3 is one of the earliest decision tree algorithms developed by Ross Quinlan. It is primarily used for classification tasks.
ID3 uses the entropy and information gain concepts to determine the best features for splitting the data at each node.
It builds a tree by recursively selecting attributes that result in the most significant information gain.
CART is a versatile decision tree algorithm that can be used for both classification and regression problems.
Instead of entropy, CART uses Gini impurity to evaluate the quality of splits in classification tasks. For regression, it assesses the mean
squared error reduction.
CART constructs binary trees, meaning that each internal node has exactly two child nodes.
CHAID is a decision tree algorithm primarily used for categorical target variables and categorical predictors.
It employs statistical tests like the Chi-Square test to determine the best splits based on the significance of relationships between
variables.
CHAID is particularly useful for exploring interactions between categorical variables.
This method is specific to regression problems and is often used with algorithms like C4.5 or CART.
Instead of impurity measures, it focuses on reducing the variance of the target variable within each split.
In each node, it aims to find a split that results in the smallest total variance within the child nodes, leading to a more accurate regression
model.
8.1 Homogeneity
Homogeneity is the quality or state of being uniform, consistent, or similar in nature. In various contexts, homogeneity implies that the
elements or components within a group or system are alike or exhibit a high degree of similarity with respect to a particular characteristic
or property.
Homogeneity of variance refers to the assumption that the variances (the spread or dispersion of data) are approximately equal across
different groups or samples being compared. Violations of this assumption can impact the validity of statistical tests.
we can calculate the variance or standard deviation of the data points within each group. If the variances are similar or within an
acceptable range, it suggests homogeneity. You may also use graphical methods such as histograms or box plots to visualise the
distribution of data within each group.
Entropy is used as a measure of impurity or disorder within a dataset. It helps decide how to split data at decision nodes to create more
homogeneous subsets.
The function of entropy is to guide the construction of decision trees by identifying features that result in the greatest reduction of
uncertainty (entropy) in the child nodes after a split. It helps select the most informative features for classification or regression tasks.
Entropy(s)
The more uncertain or unpredictable the outcomes, the higher the entropy.
i=1
−p p n n
Entropy(S) = log ( ) − log ( )
2 2
p + n p + n p + n p + n
Dataset S :
First Term :
−p p
log ( )
p+n 2 p+n
p+n
p
log (
2 p+n
) calculates the information (in bits) for this class.
The whole term represents the contribution of the positive class to the entropy of the dataset.
Second Term :
−n n
log ( )
p+n 2 p+n
p+n
n
is the probability of an element being in the negative class.
This term represents the contribution of the negative class to the entropy.
Overall Entropy:
The sum of these two terms gives the total entropy of the dataset, measuring its impurity or disorder.
Example :
In [3]: #probabilities
p_prob = positive / (positive + negative)
n_prob = negative / (positive + negative)
In [4]: #entropy
entropy = (-p_prob * math.log2(p_prob)) - (n_prob * math.log2(n_prob))
print("Entropy:", entropy)
Entropy: 0.8112781244591328
−p p n n
Entropy(S) = log ( ) − log ( )
2 2
p + n p + n p + n p + n
1. Calculate Probabilities:
p 30 30
= = = 0.75
p + n 30 + 10 40
n 10 10
= = = 0.25
p + n 30 + 10 40
Entropy S
−30 30 10 10
= log ( ) − log ( )
40 2 40 40 2 40
log (0.25) ≈ −2
2
Entropy S ≈ 0.81125
Example 2
Our dataset consists of three attributes: A1 represents the type of destination (e.g., Beach, Mountains), A2 represents the average temperature
of the destination (Hot or Cold), and Class indicates whether the customer enjoyed their vacation, with values Yes (enjoyed) or No (did not
enjoy).
columns: A1, A2, and Class. Each column represents a distinct feature with categorical values.
A1: This column contains categorical color data with possible values Red and Blue. It represents various types of vacation destinations.
A2: This column represents temperature categories with possible values Hot and Cold. It tells us the temperature climate of the vacation
destination.
Class: The target variable, indicating a binary outcome with possible values Yes and No. It indicates whether the customers enjoyed their
vacation experience.
In [5]: dataset = {
"A1": ["Red", "Red", "Blue", "Blue", "Red"],
"A2": ["Hot", "Cold", "Hot", "Cold", "Hot"],
"Class": ["Yes", "No", "Yes", "Yes", "No"]
}
In [6]: df = pd.DataFrame(dataset)
In [9]: print(df)
A1 A2 Class
0 Red Hot Yes
1 Red Cold No
2 Blue Hot Yes
3 Blue Cold Yes
4 Red Hot No
Class Entropy
we count how many times each outcome appears in the Class category of our dataset.
We find that Yes appears 3 times and No appears 2 times.
We then add these counts together to get the total number of entries in the Class category, which in this case is 5 (3 Yes and 2 No).
The concept of entropy is a way to measure how mixed or uncertain the Class category is.
High entropy means that the outcomes are very mixed (like having an equal number of Yes and No), indicating more unpredictability.
Low entropy means that the outcomes are not very mixed (like having mostly Yes and very few No), indicating less unpredictability.
An entropy value of 0.9710 is relatively high, suggesting a significant level of mixture or diversity in the outcomes. It implies that the Class
category contains a fairly balanced mix of Yes and No outcomes, but not perfectly balanced. If it were perfectly balanced, the entropy would
be closer to 1.
A1 Red Entropy
A1 Blue Entropy
Since both Yes and No occur equally (1 time each) for Blue, the entropy is expected to be at its maximum.
An entropy of 1.0 for A1 BLUE indicates that the outcomes are perfectly balanced and hence highly unpredictable.
p i + ni
I (Attribute) = ∑ Entropy(A)
p + n
print(i_a1)
0.9509775004326937
Average entropy, we get a sense of how much A1 as a whole contributes to predicting the Class.
Value being close to 1, suggests a relatively high level of uncertainty or diversity in the Class outcomes as explained by the A1 feature
alone.
A2 Hot Entropy
A2 Cold Entropy
The entropy for the category Hot in A2 is 0.9183, which is relatively high and entropy for the category Cold in A2 is 0, indicating no
uncertainty at all.
This high entropy value suggests that when A2 is Hot, there's a significant level of uncertainty in predicting whether the class will be Yes or
No. In other words, Hot does not strongly lean towards a single Class outcome, as it is associated with Yes once and No twice.
This zero entropy value implies that when A2 is Cold, the class outcome is completely predictable. In your dataset, Cold is always
associated with Yes, and there are no instances of No for Cold. knowing that A2 is Cold gives a clear indication that the class will be 'Yes'.
p i + ni
I (Attribute) = ∑ Entropy(A)
p + n
print(i_a2)
0.5509775004326937
gain_a1 is 0.01997309402197489, indicating how much considering color reduces the confusion about customer satisfaction.
gain_a2 is 0.4199730940219749, Indicate how much considering temperature reduces the confusion about customer satisfaction.
Out[22]: A1 A2 Class
1 Red Cold No
4 Red Hot No
In [24]: df_encoded
Feature Selection
In [26]: y = df_encoded['Class'].values
y
In [27]: #80 percent for training and 20 percent for test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
In [28]: X_train
In [29]: X_test
Training Model
Out[30]: ▾ DecisionTreeClassifier
DecisionTreeClassifier()
Accuracy of Model
for i in range(n_nodes):
if children_left[i] != children_right[i]:
print(f"Node {i} entropy: {impurity[i]}")
Node count: 5
Node 0 entropy: 0.5
Node 1 entropy: 0.4444444444444444
In [35]: df
Out[35]: A1 A2 Class
1 Red Cold No
4 Red Hot No
In [40]: print(entropies)
References:
3. Morgan, J. P. (1964, July). Decision Trees for Decision Making. Harvard Business Rview.