ML Unit 3 Notes
ML Unit 3 Notes
Lecture - 13
Pallavi Shukla
Assistant Professor
Introduction to Decision Tree
• Decision tree is used to create a learning model that can be used to predict
(test) the class or value of the target variable.
• The decision tree uses prior training data to predict the class of new
example.
A DECISION TREE
• “It is a flowchart structure in which each node represents a “test” on
the attribute and each branch represents the outcome of the test.”
• The end node (called leaf node) represents a class label.
• Decision tree is a supervised learning method.
• It is used for both classification and regression tasks in machine learning.
DECISION TREE LEARNING
• It is a method for approximating discrete - valued target function, (concept)
in which the learned function is represented by a decision tree.
TERMINOLOGIES IN DECISION
TREE
• Root Node: It represents the entire population (or sample) which gets
further divided into two or more sets.
• Splitting: It is a process of dividing a node into two or more sub-nodes to
increase the tree.
• Decision Nodes: When a sub-node splits into further sub-nodes then it is
called a decision node.
• Leaf/ Terminal Node: The end nodes that do not split are called leaf or
terminal nodes.
TERMINOLOGIES IN DECISION
TREE
• Pruning : The removal of sub-nodes is called pruning to reduce trees.
• Parent nodes : a node divided into sub nodes is called a parent node.
• Child Nodes: The sub nodes of a parent node are called child nodes.
How does the Decision Tree algorithm
Work?
• In a decision tree, for predicting the class of the given dataset, the algorithm
starts from the root node of the tree.
• This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps
to the next node.
• For the next node, the algorithm again compares the attribute value with the
other sub-nodes and move further.
• It continues the process until it reaches the leaf node of the tree.
• The complete process can be better understood using the below algorithm:
How does the Decision Tree algorithm
Work?
• Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
• Step-3: Divide the S into subsets that contain possible values for the best attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Example
• Example: Suppose there is a candidate who has a job offer and wants to
decide whether he should accept the offer or Not. So, to solve this
problem, the decision tree starts with the root node (Salary attribute by
ASM). The root node splits further into the next decision node (distance
from the office) and one leaf node based on the corresponding labels.
The next decision node further gets split into one decision node (Cab
facility) and one leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the below diagram
Attribute Selection Measures
• While implementing a Decision tree, the main issue arises that how to select the
best attribute for the root node and for sub-nodes.
• So, to solve such problems there is a technique which is called as Attribute
selection measure or ASM.
• By this measurement, we can easily select the best attribute for the nodes of
the tree.
• There are two popular techniques for ASM, which are:
• Information Gain
• Gini Index
Advantages of the Decision Tree
• It is simple to understand as it follows the same process which a human
follow while making any decision in real-life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a problem.
• There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
• E = 1.41
• when all observations belong to the same class? In such a case, the entropy will
always be zero.
• Such a dataset has no impurity.
• This implies that such a dataset would
not be useful for learning.
• If we have a dataset with say, two classes,
• Half made up of yellow and the other half being purple,
• The entropy will be one.
Example 2
• Calculate the entropy E of a single attribute “Playing Golf ” problem when
following data is given.
Play Golf
Yes No
9 5
Solution
E(s) =
S = current state , pi = Prob of event (i) of State(S)
Entropy (Play Golf) = (9,5)
Probability of Play Golf = Yes = 9/14 = 0.64
Probability of Play Golf = No = 5/14 = 0.36
Entropy = - (0.36 log 2 (0.36) - (0.64) log 2 (0.64))
E(s) = 0.94
Example 3
• Calculate the entropy of multiple attributes of “Playing Golf ” Problem with
given data set.
Play Golf
Yes No
Sunny 3 2
Outlook Overcast 4 0
Rainy 2 3
Solution
E(Play Golf, Outlook) = P(Sunny), E(3,2) + P(Overcast) , E(4,0) + P(Rainy) , E(2,3)
P(Sunny), E(3,2) = - (3/14 log 2 (3/14) – 2/14 log 2 (2/14))=
INFORMATION GAIN
• In machine Learning and decision trees, the information gain(IG) is
defined as the reduction (decrease) in entropy.
• Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
• It calculates how much information a feature provides us about a class.
• Information gain helps to determine the order of attributes in the nodes
of a decision tree.
• According to the value of information gain, we split the node and build
the decision tree.
INFORMATION GAIN
• The main node is referred to as the parent node, whereas sub-nodes are
known as child nodes.
• We can use information gain to determine how good the splitting of
nodes in a decision tree.
• E parent is the entropy of the parent node and E Children is the average
entopic of the child nodes.
Example
Suppose we have a dataset with two classes.
This dataset has 5 purple and 5 yellow examples.
Since the dataset is balanced, we expect the answer to be 1.
• The entropy before the split, which we referred to as initial entropy Einitial=1
• After splitting, the current value is 0.39
• We can now get our information gain, which is the entropy we “lost” after splitting.
Example
The more the entropy removed, the greater the information gain.
The higher the information gain, the better the split.
Decision Tree
Algorithms
Pallavi Shukla
Assistant Professor
Types of Decision Tree Algorithms
• 1. Iterative Dichotomizer 3(ID 3) Algorithm
• 2. CD 4.5 Algorithm
• 3. Classification and Regression Tree (CART) Algorithm
General Decision Tree Algorithm Steps
1. Calculate the Entropy(E ) of Every attribute (A) of Dataset (S).
2. Split the dataset (S) into subsets using the attribute for which the resulting
entropy after splitting is minimized .
Iterative Dichotomizer 3 Algorithm
Inductive Bias
Pallavi Shukla
Assistant Professor
Inductive Bias in Decision Tree Learning
• The Inductive Bias of machine learning is the set of assumptions that the
learner uses to predict outputs of given inputs that it has not encountered.
• An approximation of inductive bias of ID3 decision tree algorithm: “Shorter
trees are preferred over longer trees. Trees that place high information gain
attributes close to the root are preferred over those that do not.”
• Inductive bias is a “policy” by which the decision tree algorithm generalizes
from observed training examples to classify unseen instances.
Inductive Bias in Decision Tree Learning