0% found this document useful (0 votes)
21 views17 pages

ML Lec-12

ID3 is a decision tree algorithm that uses information gain to select the best feature to split on at each node. It builds the tree from the top down by starting with the root node and recursively splitting on the feature with the highest information gain. The tree splits the data into purer subsets until it reaches leaf nodes containing predominantly one class. Some disadvantages are that decision trees can be complex with many layers and prone to overfitting, but they are simple to understand and interpret.

Uploaded by

sankalp6414
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views17 pages

ML Lec-12

ID3 is a decision tree algorithm that uses information gain to select the best feature to split on at each node. It builds the tree from the top down by starting with the root node and recursively splitting on the feature with the highest information gain. The tree splits the data into purer subsets until it reaches leaf nodes containing predominantly one class. Some disadvantages are that decision trees can be complex with many layers and prone to overfitting, but they are simple to understand and interpret.

Uploaded by

sankalp6414
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

ML

LECTURE-12
BY
Dr. Ramesh Kumar Thakur
Assistant Professor (II)
School Of Computer Engineering
v Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems.
v It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
v In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are
used to make any decision and have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
v The decisions or the test are performed on the basis of features of the given dataset.
v It is a graphical representation for getting all the possible solutions to a problem/decision based on
given conditions.
v It is called a decision tree because, similar to a tree, it starts with the root node, which expands on further
branches and constructs a tree-like structure.

v Reasons for using the Decision tree:


v Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand.
v The logic behind the decision tree can be easily understood because it shows a tree-like structure.
v Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
v Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a
leaf node.
v Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.
v Branch/Sub Tree: A tree formed by splitting the tree.
v Pruning: Pruning is the process of removing the unwanted branches from the tree.
v Parent/Child node: A node that is divided into sub-nodes is known as a parent node, and the sub-
nodes emerging from it are referred to as child nodes. The parent node represents a decision or
condition, while the child nodes represent the potential outcomes or further decisions based on that
condition.
v ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively
(repeatedly) dichotomizes(divides) features into two or more groups at each step.

v Invented by Ross Quinlan, ID3 uses a top-down greedy approach to build a decision tree.

v In simple words, the top-down approach means that we start building the tree from the top and the
greedy approach means that at each iteration we select the best feature at the present moment to
create a node.

v Most generally ID3 is only used for classification problems with nominal features only.
v ID3 algorithm selects the best feature at each step while building a Decision tree.

v So the answer to the question: ‘How does ID3 select the best feature?’ is that ID3 uses Information Gain
or just Gain to find the best feature.

v Information Gain calculates the reduction in the entropy and measures how well a given feature
separates or classifies the target classes.

v The feature with the highest Information Gain is selected as the best one.

v In simple words, Entropy is the measure of disorder and the Entropy of a dataset is the measure of
disorder in the target feature of the dataset.

v In the case of binary classification (where the target column has only two types of classes) entropy is 0 if
all values in the target column are homogenous(similar) and will be 1 if the target column has equal
number values for both the classes.
v We denote our dataset as S, entropy is calculated as:
Entropy(S) = - ∑ pᵢ * log₂(pᵢ) ; i = 1 to n
v where,
v n is the total number of classes in the target column (in our case n = 2 i.e YES and NO)
v pᵢ is the probability of class ‘i’ or the ratio of “number of rows with class i in the target column” to the
“total number of rows” in the dataset.

v Information Gain for a feature column A is calculated as:


IG(S, A) = Entropy(S) - ∑((|Sᵥ| / |S|) * Entropy(Sᵥ))
v where Sᵥ is the set of rows in S for which the feature column A has value v, |Sᵥ| is the number of rows in Sᵥ
and likewise |S| is the number of rows in S.
I. Calculate the Information Gain of each feature.

II. Considering that all rows don’t belong to the same class, split the dataset S into subsets using the feature
for which the Information Gain is maximum.

III. Make a decision tree node using the feature with the maximum Information gain.

IV. If all or most of the rows belong to the same class, make the current node as a leaf node with the class as
its label.

V. Repeat for the remaining features until we run out of all features, or the decision tree has all leaf nodes.
v The first step is to find the best feature i.e. the one that has the maximum Information Gain(IG).

v We’ll calculate the IG for each of the features now, but for that, we first need to calculate the entropy of S.

v From the total of 14 rows in our dataset S, there are 8 rows with the target value YES and 6 rows with the
target value NO. The entropy of S is calculated as:

Entropy(S) = - (8/14) * log₂(8/14) - (6/14) * log₂(6/14) = 0.99

v Note: If all the values in our target column are same the entropy will be zero (meaning that it has no
or zero randomness).

v We now calculate the Information Gain for each feature.


v IG calculation for Fever:
v In this(Fever) feature there are 8 rows having value YES and 6 rows having value NO.
v In the 8 rows with YES for Fever, there are 6 rows having target value YES and 2 rows having target
value NO.
v In the 6 rows with NO, there are 2 rows having target value YES and 4 rows having target value NO.
v |S| = 14
v For v = YES, |Sᵥ| = 8
v Entropy(Sᵥ) = - (6/8) * log₂(6/8) - (2/8) * log₂(2/8) = 0.81
v For v = NO, |Sᵥ| = 6
v Entropy(Sᵥ) = - (2/6) * log₂(2/6) - (4/6) * log₂(4/6) = 0.91

v # Expanding the summation in the IG formula:


v IG(S, Fever) = Entropy(S) - (|Sʏᴇꜱ| / |S|) * Entropy(Sʏᴇꜱ) - (|Sɴᴏ| / |S|) * Entropy(Sɴᴏ)
v ∴ IG(S, Fever) = 0.99 - (8/14) * 0.81 - (6/14) * 0.91 = 0.13
v Next, we calculate the IG for the features “Cough” and “Breathing issues”.
v IG(S, Cough) = 0.04
v IG(S, BreathingIssues) = 0.40
v Since the feature Breathing issues have the highest Information Gain it is used to create the root node.
v Hence, after this initial step our tree looks like this:

v Next, from the remaining two unused features, namely, Fever and Cough, we decide which one is the best
for the left branch of Breathing Issues.
v Since the left branch of Breathing Issues denotes YES, we will work with the subset of the original data i.e
the set of rows having YES as the value in the Breathing Issues column. These 8 rows are shown below:
v Next, we calculate the IG for the features Fever and Cough using the subset Sʙʏ (Set Breathing Issues Yes)
v Note: For IG calculation the Entropy will be calculated from the subset Sʙʏ and not the original dataset S.
v IG(Sʙʏ, Fever) = 0.20
v IG(Sʙʏ, Cough) = 0.09
v IG of Fever is greater than that of Cough, so we select Fever as the left branch of Breathing Issues.
v Our tree now looks like this:

v Next, we find the feature with the maximum IG for the right branch of Breathing Issues. But, since there is
only one unused feature left we have no other choice but to make it the right branch of the root node.
v So our tree now looks like this:

v There are no more unused features, so we stop here and jump to the final step of creating the leaf nodes.
v For the left leaf node of Fever, we see the subset of rows from the original data set that has Breathing
Issues and Fever both values as YES.

v Since all the values in the target column are YES, we label the left leaf node as YES, but to make it
more logical we label it Infected.
v Similarly, for the right node of Fever we see the subset of rows from the original data set that have
Breathing Issues value as YES and Fever as NO.

v Here not all but most of the values are NO, hence NO or Not Infected becomes our right leaf node.
v Our tree, now, looks like this:

v We repeat the same process for the node Cough, however here both left and right leaves turn out to be
the same i.e. NO or Not Infected as shown below:

v The right node of Breathing issues is as good as just a leaf node with class ‘Not infected’. This is one
of the Drawbacks of ID3, it doesn’t do pruning.
v Pruning is a mechanism that reduces the size and complexity of a Decision tree by removing
unnecessary nodes.
v Another drawback of ID3 is overfitting or high variance i.e. it learns the dataset it used so well that it fails
to generalize on new data which can be resolved using the Random Forest algorithm.
v Advantages of the Decision Tree
1. It is simple to understand as it follows the same process which a human follow while making any
decision in real-life.
2. It can be very useful for solving decision-related problems.
3. It helps to think about all the possible outcomes for a problem.
4. There is less requirement of data cleaning compared to other algorithms.

v Disadvantages of the Decision Tree


1. The decision tree contains lots of layers, which makes it complex.
2. It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
3. For more class labels, the computational complexity of the decision tree may increase.
4. It may contain some unnecessary nodes which can be solved by prunning.

You might also like