ML Lec-12
ML Lec-12
LECTURE-12
BY
Dr. Ramesh Kumar Thakur
Assistant Professor (II)
School Of Computer Engineering
v Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems.
v It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
v In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are
used to make any decision and have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
v The decisions or the test are performed on the basis of features of the given dataset.
v It is a graphical representation for getting all the possible solutions to a problem/decision based on
given conditions.
v It is called a decision tree because, similar to a tree, it starts with the root node, which expands on further
branches and constructs a tree-like structure.
v Invented by Ross Quinlan, ID3 uses a top-down greedy approach to build a decision tree.
v In simple words, the top-down approach means that we start building the tree from the top and the
greedy approach means that at each iteration we select the best feature at the present moment to
create a node.
v Most generally ID3 is only used for classification problems with nominal features only.
v ID3 algorithm selects the best feature at each step while building a Decision tree.
v So the answer to the question: ‘How does ID3 select the best feature?’ is that ID3 uses Information Gain
or just Gain to find the best feature.
v Information Gain calculates the reduction in the entropy and measures how well a given feature
separates or classifies the target classes.
v The feature with the highest Information Gain is selected as the best one.
v In simple words, Entropy is the measure of disorder and the Entropy of a dataset is the measure of
disorder in the target feature of the dataset.
v In the case of binary classification (where the target column has only two types of classes) entropy is 0 if
all values in the target column are homogenous(similar) and will be 1 if the target column has equal
number values for both the classes.
v We denote our dataset as S, entropy is calculated as:
Entropy(S) = - ∑ pᵢ * log₂(pᵢ) ; i = 1 to n
v where,
v n is the total number of classes in the target column (in our case n = 2 i.e YES and NO)
v pᵢ is the probability of class ‘i’ or the ratio of “number of rows with class i in the target column” to the
“total number of rows” in the dataset.
II. Considering that all rows don’t belong to the same class, split the dataset S into subsets using the feature
for which the Information Gain is maximum.
III. Make a decision tree node using the feature with the maximum Information gain.
IV. If all or most of the rows belong to the same class, make the current node as a leaf node with the class as
its label.
V. Repeat for the remaining features until we run out of all features, or the decision tree has all leaf nodes.
v The first step is to find the best feature i.e. the one that has the maximum Information Gain(IG).
v We’ll calculate the IG for each of the features now, but for that, we first need to calculate the entropy of S.
v From the total of 14 rows in our dataset S, there are 8 rows with the target value YES and 6 rows with the
target value NO. The entropy of S is calculated as:
v Note: If all the values in our target column are same the entropy will be zero (meaning that it has no
or zero randomness).
v Next, from the remaining two unused features, namely, Fever and Cough, we decide which one is the best
for the left branch of Breathing Issues.
v Since the left branch of Breathing Issues denotes YES, we will work with the subset of the original data i.e
the set of rows having YES as the value in the Breathing Issues column. These 8 rows are shown below:
v Next, we calculate the IG for the features Fever and Cough using the subset Sʙʏ (Set Breathing Issues Yes)
v Note: For IG calculation the Entropy will be calculated from the subset Sʙʏ and not the original dataset S.
v IG(Sʙʏ, Fever) = 0.20
v IG(Sʙʏ, Cough) = 0.09
v IG of Fever is greater than that of Cough, so we select Fever as the left branch of Breathing Issues.
v Our tree now looks like this:
v Next, we find the feature with the maximum IG for the right branch of Breathing Issues. But, since there is
only one unused feature left we have no other choice but to make it the right branch of the root node.
v So our tree now looks like this:
v There are no more unused features, so we stop here and jump to the final step of creating the leaf nodes.
v For the left leaf node of Fever, we see the subset of rows from the original data set that has Breathing
Issues and Fever both values as YES.
v Since all the values in the target column are YES, we label the left leaf node as YES, but to make it
more logical we label it Infected.
v Similarly, for the right node of Fever we see the subset of rows from the original data set that have
Breathing Issues value as YES and Fever as NO.
v Here not all but most of the values are NO, hence NO or Not Infected becomes our right leaf node.
v Our tree, now, looks like this:
v We repeat the same process for the node Cough, however here both left and right leaves turn out to be
the same i.e. NO or Not Infected as shown below:
v The right node of Breathing issues is as good as just a leaf node with class ‘Not infected’. This is one
of the Drawbacks of ID3, it doesn’t do pruning.
v Pruning is a mechanism that reduces the size and complexity of a Decision tree by removing
unnecessary nodes.
v Another drawback of ID3 is overfitting or high variance i.e. it learns the dataset it used so well that it fails
to generalize on new data which can be resolved using the Random Forest algorithm.
v Advantages of the Decision Tree
1. It is simple to understand as it follows the same process which a human follow while making any
decision in real-life.
2. It can be very useful for solving decision-related problems.
3. It helps to think about all the possible outcomes for a problem.
4. There is less requirement of data cleaning compared to other algorithms.