Session 5b Classification by Decision Tree Induction
Session 5b Classification by Decision Tree Induction
1
INTRODUCTION
2
LEARNING OUTCOMES
3
DECISION TREES
4
DECISION TREES
5
DECISION TREES
6
DECISION TREES
7
DECISION TREES
8
HOW DOES THE DECISION TREE
ALGORITHM WORK?
• Given a tuple X (where X is a set of attributes
associated with the unknown class label), the attribute
values of the X are tested against the decision tree.
1. Decision tree algorithm compares the values of root
attribute with the record (real dataset) attribute and,
based on the comparison, follows the branch and
jumps to the next node.
2. For the next node, the algorithm again compares the
attribute value with the other sub-nodes and move
further.
3. It continues the process until it reaches the leaf
node of the tree.
9
HOW DOES THE DECISION TREE
ALGORITHM WORK?
Example: Suppose there is a candidate who has a job
offer and wants to decide whether he should accept
the offer or Not. So, to solve this problem, the decision
tree starts with the root node (Salary attribute by
ASM). The root node splits further into the next
decision node (distance from the office) and one leaf
node based on the corresponding labels. The next
decision node further gets split into one decision node
(Cab facility) and one leaf node. Finally, the decision
node splits into two leaf nodes (Accepted offers and
Declined offer).
10
DECISION TREES
11
DECISION TREES
History
13
DECISION TREES
History
17
ATTRIBUTE SELECTION MEASURES
18
ATTRIBUTE SELECTION MEASURES
19
ATTRIBUTE SELECTION MEASURES
20
ATTRIBUTE SELECTION MEASURES
21
ATTRIBUTE SELECTION MEASURES
Information Gain
• Information gain is the measurement of changes in
entropy after the segmentation of a dataset based on an
attribute.
• It calculates how much information a feature provides us
about a class.
• According to the value of information gain, we split the
node and build the decision tree.
• A decision tree algorithm always tries to maximize the
value of information gain, and a node/attribute having
the highest information gain is split first
22
ATTRIBUTE SELECTION MEASURES
Information Gain
• ID3 uses information gain as its attribute selection
measure.
• This measure is based on pioneering work by Claude
Shannon on information theory, which studied the
value or “information content” of messages.
• Let node N represent or hold the tuples of partition
D:-The attribute with the highest information gain is
chosen as the splitting attribute for node N.
23
ATTRIBUTE SELECTION MEASURES
Information Gain
• The expected information needed to classify a tuple in D
is given by:
• where
• pi is the nonzero probability that an arbitrary tuple in D
belongs to class Ci
• Info(D) is just the average amount of information
needed to identify the class label of a tuple in D.
24
ATTRIBUTE SELECTION MEASURES
Information Gain
• Information gain is a decrease in entropy.
• It computes the difference between entropy before
split and average entropy after split of the dataset
based on given attribute values
25
ATTRIBUTE SELECTION MEASURES
Information Gain
Example: Entropy for 1 attribute is calculated as follows
27
ATTRIBUTE SELECTION MEASURES
Information Gain
Example - Entropy for multiple attributes is calculated
as:
28
ATTRIBUTE SELECTION MEASURES
Information Gain
Therefore,
29
ATTRIBUTE SELECTION MEASURES
Gain Ratio
• Information gain is biased towards choosing
attributes with a large number of values as root
nodes.
• It means it prefers the attribute with a large number
of distinct values.
• C4.5, an improvement of ID3, uses Gain ratio which
is a modification of Information gain that reduces its
bias and is usually the best option.
30
ATTRIBUTE SELECTION MEASURES
Gain Ratio
• Gain ratio overcomes the problem with information
gain by taking into account the number of branches
that would result before making the split.
• It corrects information gain by taking the intrinsic
information of a split into account.
Gini Index
• Gini index is a cost function used to evaluate splits in
the dataset.
• It is calculated by subtracting the sum of the squared
probabilities of each class from one.
• It favors larger partitions and easy to implement
whereas information gain favors smaller partitions
with distinct values.
32
ATTRIBUTE SELECTION MEASURES
33
Overfitting Problem in Decision Trees
34
Overfitting Problem in Decision Trees
• Overfitting problem affects the accuracy when
predicting samples that are not part of the training
set.
• As a problem usually has a large set of features, it
results in large number of split, which in turn gives a
huge tree.
• Such trees are complex and can lead to overfitting.
But when do we stop splitting/growing the tree?
– Random Forest 35
Pruning Decision Trees
38
DECISION TREES
Why are decision tree classifiers so popular?
• The construction of decision tree classifiers does not
require any domain knowledge or parameter setting,
and therefore is appropriate for exploratory
knowledge discovery.
• Decision trees can handle multidimensional data.
• Their representation of acquired knowledge in tree
form is intuitive and generally easy to assimilate by
humans.
• The learning and classification steps of decision tree
induction are simple and fast.
• Decision tree classifiers have good accuracy.
39
SUMMARY
40
THANK YOU
41
REFERENCES
42