Decision Tree-31-01-2025
Decision Tree-31-01-2025
30/04/2025 1
Duration: 55 min AIML Credit: 4 ML | AI3201
• The goal of using a Decision Tree is to create a training model that can use to
predict the class or value of the target variable by learning simple decision
rules inferred from prior data(training data).
• In Decision Trees, for predicting a class label for a record we start from the
root of the tree. We compare the values of the root attribute with the record’s
attribute. On the basis of comparison, we follow the branch corresponding to
that value and jump to the next node.
30/04/2025 2
Duration: 55 min AIML Credit: 4 ML | AI3201
• Types of decision trees are based on the type of target variable we have. It can
be of two types:
1.Categorical Variable Decision Tree: Decision Tree which has a categorical
target variable then it called a Categorical variable decision tree.
• Example:- Let’s say we have a problem to predict whether a customer will pay
his renewal premium with an insurance company (yes/ no). Here we know that
the income of customers is a significant variable but the insurance company
does not have income details for all customers. Now, as we know this is an
important variable, then we can build a decision tree to predict customer
income based on occupation, product, and various other variables. In this case,
we are predicting values for the continuous variables.
30/04/2025 3
Duration: 55 min AIML Credit: 4 ML | AI3201
Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
•Decision trees classify the examples by sorting them down the tree from the root to
some leaf/terminal node, with the leaf/terminal node providing the classification of
the example.
•Each node in the tree acts as a test case for some attribute, and each edge
descending from the node corresponds to the possible answers to the test case. This
process is recursive in nature and is repeated for every subtree rooted at the new
04/30/2025 5
node.
Duration: 55 min AIML Credit: 4 ML | AI3201
Assumptions while creating Decision Tree
•Below are some of the assumptions we make while using Decision tree:
•In the beginning, the whole training set is considered as the root.
•Feature values are preferred to be categorical. If the values are continuous then they are
discretized prior to building the model.
•Records are distributed recursively on the basis of attribute values.
•Order to placing attributes as root or internal node of the tree is done by using some
statistical approach.
•Decision Trees follow Sum of Product (SOP) representation. The Sum of product (SOP) is
also known as Disjunctive Normal Form. For a class, every branch from the root of the
tree to a leaf node having the same class is conjunction (product) of values, different
branches ending in that class form a disjunction (sum).
•The primary challenge in the decision tree implementation is to identify which attributes
do we need to consider as the root node and each level. Handling this is to know as the
attributes selection. We have different attributes selection measures to identify the
attribute which can be considered as the root note at each level.
04/30/2025 6
Duration: 55 min AIML Credit: 4 ML | AI3201
30/04/2025 7
Duration: 55 min AIML Credit: 4 ML | AI3201
How do Decision Trees work?
•The decision of making strategic splits heavily affects a tree’s accuracy. The decision
criteria are different for classification and regression trees.
•Decision trees use multiple algorithms to decide to split a node into two or more sub-
nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In
other words, we can say that the purity of the node increases with respect to the target
variable. The decision tree splits the nodes on all available variables and then selects the
split which results in most homogeneous sub-nodes.
•The algorithm selection is also based on the type of target variables. Let us look at some
algorithms used in Decision Trees:
ID3 → (extension of D3)
C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits when
computing classification trees)
MARS → (multivariate adaptive regression splines)
The ID3 algorithm builds decision trees using a top-down greedy search approach through
the space of possible branches with no backtracking. A greedy algorithm, as the name
suggests, always makes the choice that seems to be the best at that moment.
04/30/2025 8
Duration: 55 min AIML Credit: 4 ML | AI3201
30/04/2025 9
Duration: 55 min AIML Credit: 4 ML | AI3201
Attribute Selection Measures
• If the dataset consists of N attributes then deciding which attribute to place at the root
or at different levels of the tree as internal nodes is a complicated step. By just
randomly selecting any node to be the root can’t solve the issue. If we follow a random
approach, it may give us bad results with low accuracy.
• For solving this attribute selection problem, researchers worked and devised some
solutions. They suggested using some criteria like :
• Entropy,
• Information gain,
• Gini index,
• Gain Ratio,
• Reduction in Variance
• Chi-Square
• These criteria will calculate values for every attribute. The values are sorted, and
attributes are placed in the tree by following the order i.e, the attribute with a high
value(in case of information gain) is placed at the root.
• While using Information Gain as a criterion, we assume attributes to be categorical,
and for the Gini index, attributes are assumed to be continuous.
30/04/2025 10
Duration: 55 min AIML Credit: 4 ML | AI3201
Entropy
• Entropy is a measure of the randomness in the information being processed. The higher the
entropy, the harder it is to draw any conclusions from that information. Flipping a coin is an
example of an action that provides information that is random.
• From the above graph, it is quite evident that the entropy H(X) is zero when the probability is
either 0 or 1. The Entropy is maximum when the probability is 0.5 because it projects perfect
randomness in the data and there is no chance if perfectly determining the outcome.
• ID3 follows the rule — A branch with an entropy of zero is a leaf node and A branch with entropy
more than zero needs further splitting.
30/04/2025 11
Duration: 55 min AIML Credit: 4 ML | AI3201
30/04/2025 12
Duration: 55 min AIML Credit: 4 ML | AI3201
30/04/2025 13
Duration: 55 min AIML Credit: 4 ML | AI3201
Information Gain
• Information gain or IG is a statistical property that measures how well a given
attribute separates the training examples according to their target classification.
Constructing a decision tree is all about finding an attribute that returns the highest
information gain and the smallest entropy.
30/04/2025 14
Duration: 55 min AIML Credit: 4 ML | AI3201
• Where “before” is the dataset before the split, K is the number of subsets generated by the
split, and (j, after) is subset j after the split.
30/04/2025 15
Duration: 55 min AIML Credit: 4 ML | AI3201
Gini Index
• You can understand the Gini index as a cost function used to evaluate splits in the
dataset. It is calculated by subtracting the sum of the squared probabilities of each
class from one. It favours larger partitions and easy to implement whereas
information gain favours smaller partitions with distinct values.
• Gini Index works with the categorical target variable “Success” or “Failure”. It
performs only Binary splits.
• Higher value of Gini index implies higher inequality, higher heterogeneity.
30/04/2025 16
Duration: 55 min AIML Credit: 4 ML | AI3201
1.Calculate Gini for sub-nodes, using the above formula for success(p) and
failure(q) (p²+q²).
2.Calculate the Gini index for split using the weighted Gini score of each node
of that split.
• CART (Classification and Regression Tree) uses the Gini index method to
create split points.
30/04/2025 17
Duration: 55 min AIML Credit: 4 ML | AI3201
Gain ratio
• Information gain is biased towards choosing attributes with a large number of
values as root nodes. It means it prefers the attribute with a large number of
distinct values.
• C4.5, an improvement of ID3, uses Gain ratio which is a modification of
Information gain that reduces its bias and is usually the best option. Gain ratio
overcomes the problem with information gain by taking into account the number
of branches that would result before making the split. It corrects information gain
by taking the intrinsic information of a split into account.
• Let us consider if we have a dataset that has users and their movie genre
preferences based on variables like gender, group of age, rating, blah, blah. With
the help of information gain, you split at ‘Gender’ (assuming it has the highest
information gain) and now the variables ‘Group of Age’ and ‘Rating’ could be
equally important and with the help of gain ratio, it will penalize a variable with
more distinct values which will help us decide the split at the next level.
30/04/2025 18
Duration: 55 min AIML Credit: 4 ML | AI3201
Gain ratio
• Where “before” is the dataset before the split, K is the number of subsets
generated by the split, and (j, after) is subset j after the split.
30/04/2025 19
Duration: 55 min AIML Credit: 4 ML | AI3201
Reduction in Variance
• Above X-bar is the mean of the values, X is actual, and n is the number of
values.
Chi-Square
• The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is
one of the oldest tree classification methods. It finds out the statistical
significance between the differences between sub-nodes and parent node. We
measure it by the sum of squares of standardized differences between observed
and expected frequencies of the target variable.
• It works with the categorical target variable “Success” or “Failure”. It can perform
two or more splits. Higher the value of Chi-Square higher the statistical
significance of differences between sub-node and Parent node.
• It generates a tree called CHAID (Chi-square Automatic Interaction Detector).
• Mathematically, Chi-squared is represented as:
30/04/2025 21
Duration: 55 min AIML Credit: 4 ML | AI3201
30/04/2025 22
Duration: 55 min AIML Credit: 4 ML | AI3201
30/04/2025 23
Duration: 55 min AIML Credit: 4 ML | AI3201
30/04/2025 24
Duration: 55 min AIML Credit: 4 ML | AI3201
• In the above diagram, the ‘Age’ attribute in the left-hand side of the tree has been pruned as it
has more importance on the right-hand side of the tree, hence removing overfitting.
30/04/2025 25
Duration: 55 min AIML Credit: 4 ML | AI3201
Random Forest
• Random Forest is an example of ensemble learning, in which we combine
multiple machine learning algorithms to obtain better predictive performance.
30/04/2025 26
Duration: 55 min AIML Credit: 4 ML | AI3201
Bagging
• A technique known as bagging is used to create an ensemble of trees where
multiple training sets are generated with replacement.
• In the bagging technique, a data set is divided into N samples using
randomized sampling. Then, using a single learning algorithm a model is built
on all samples. Later, the resultant predictions are combined using voting or
averaging in parallel.
30/04/2025 27
Duration: 55 min AIML Credit: 4 ML | AI3201
30/04/2025 28