DECISION TREE
LEARNING
MODULE - 2
1
INTRODUCTION
• Decision tree learning is one of the most widely used methods for inductive
inference
• It is used to approximate discrete valued functions that is robust to noisy
data
• It is capable of learning disjunctive expressions
• The learned function is represented by a decision tree.
• Disjunctive Expressions – (A ∧ B ∧ C) ∨ (D ∧ E ∧ F)
2
CONSIDER THE DATASET
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
3
DECISION TREE REPRESENTATION
• Each internal node tests an
attribute
• Each branch corresponds to an
attribute value
• Each leaf node assigns a
classification
PlayTennis: This decision tree classifies Saturday mornings
according to whether or not they are suitable for playing tennis
4
REPRESENTATION
internal node =
attribute test
branch =
attribute value
leaf node =
classification 5
DECISION TREE REPRESENTATION -
CLASSIFICATION
• An example is classified by
sorting it through the tree from
the root to the leaf node
• Example – (Outlook = Sunny,
Humidity = High) =>
(PlayTennis = No)
PlayTennis: This decision tree classifies Saturday mornings
according to whether or not they are suitable for playing tennis
6
DECISION TREE REPRESENTATION
• In general, decision trees represent a disjunction of conjunctions
of constraints on the attribute values of instances
• Example –
7
APPROPRIATE PROBLEMS FOR DECISION
TREE LEARNING
• Instances are represented by attribute-value pairs
• Target function has discrete output values
• Disjunctive hypothesis may be required
• Possibly noisy data
• Training data may contain errors
• Training data may contain missing attribute values
• Examples – Classification Problems
1. Equipment or medical diagnosis
2. Credit risk analysis
8
DECISION TREE ALGORITHMS
• Decision tree algorithms employs top-down greedy
search through the space of possible decision trees
• ID3 (Iterative Dichotomizer 3) by Quinlan 1986, and
C4.5 by Quinlan 1986 uses this approach
• Most algorithms that have been developed for
learning decision trees are variations of the core
algorithms
9
BASIC ID3 LEARNING ALGORITHM APPROACH
• Top-down construction of the tree, beginning with the question "which
attribute should be tested at the root of the tree?'
• Each instance attribute is evaluated using a statistical test to determine
how well it alone classifies the training examples.
• The best attribute is selected and used as the test at the root node of the
tree.
• A descendant of the root node is then created for each possible value of
this attribute.
• The training examples are sorted to the appropriate descendant node
• The entire process is then repeated at the descendant node using the
training examples associated with each descendant node
• GREEDY Approach
• No Backtracking - So we may get a suboptimal solution.
10
ID3 ALGORITHM
11
TOP-DOWN INDUCTION OF DECISION TREES
1. Find A = the best decision attribute for next node
2. Assign A as decision attribute for node
3. For each value of A create new descendants of node
4. Sort the training examples to the leaf node.
5. If training examples classified perfectly, STOP else
iterate over the new leaf nodes.
12
WHICH ATTRIBUTE IS THE BEST CLASSIFIER?
• Information Gain – A statistical property that measures
how well a given attribute separates the training
examples according to their target classification.
• This measure is used to select among the candidate
attributes at each step while growing the tree.
13
ENTROPY
• Entropy (E) is the minimum number of bits needed in order to classify an arbitrary
example as yes or no
• Entropy is commonly used in information theory. It characterizes the (im)purity of an
arbitrary collection of examples.
• S is a sample of training examples
• is the proportion of positive examples in S
• is the proportion of negative examples in S
• Then the entropy measures the impurity of S:
• But If the target attribute can take c different values:
14
ENTROPY - EXAMPLE
• Entropy([29+, 35-]) = - (29/64) log2(29/64) - (35/64) log2(35/64)
= 0.994
• Note: 0 log 0 is defined to be 0
15
ENTROPY
• The entropy varies between 0 and 1.
• If all the members belong to the same class => The entropy
= 0, In this case =1 and = 0.
• Entropy (S) = -1. log2 (1) – 0. log2 0
• = -1 . 0 – 0 . log2 0 = 0
• If there are equal number of positive and negative examples
=> The entropy = 1 16
INFORMATION GAIN
• Information gain measures the expected reduction in Entropy
• It measures the effectiveness of the attribute in classifying the
training data
• Gain(S,A) = expected reduction in entropy by
partitioning the examples according to the attribute A
• Here values(A) is the set of all possible values for attribute A, sv
is the subset of S for which atribute A has value v
17
INFORMATION GAIN - EXAMPLE
E=0.994 E=0.994
E=0.706 E=0.742 E=0.937 E=0.619
• Gain(S,A1) • Gain(S,A2)
= 0.994 – (26/64)*.706 – (38/64)*.742 = 0.994 – (51/64)*.937 –
= 0.266 (13/64)*.619
• Information gained by partitioning = 0.121
along attribute A1 is 0.266 • Information gained by partitioning
along attribute A2 is 0.121
18
An Illustrative Example
19
DECISION TREE LEARNING
■ Let’s Try an Example!
■ Let
– E([X+,Y-]) represent that there are X positive training elements
and Y negative elements.
■ Therefore the Entropy for the training data, E(S), can be
represented as E([9+,5-]) because of the 14 training examples 9 of
them are yes and 5 of them are no.
20
DECISION TREE LEARNING:
A SIMPLE EXAMPLE
■ Let’s start off by calculating the Entropy of the Training Set.
■ E(S) = E([9+,5-]) = (-9/14 log2 9/14) + (-5/14 log2 5/14)
■ = 0.94
■ Next we will need to calculate the information gain G(S,A) for
each attribute A where A is taken from the set {Outlook,
Temperature, Humidity, Wind}.
21
DECISION TREE LEARNING:
A SIMPLE EXAMPLE
■ The information gain for Outlook is:
– Gain(S,Outlook) = E(S) – [5/14 * E(Outlook=sunny) + 4/14 *
E(Outlook = overcast) + 5/14 * E(Outlook=rain)]
– Gain(S,Outlook) = E([9+,5-]) – [5/14*E(2+,3-) + 4/14*E([4+,0]) +
(- 2/5 log2 (2/5) )+ (-3/5 log2(3/5))
5/14*E([3+,2-])]
– Gain(S,Outlook) = 0.94 – [5/14*0.971 + 4/14*0.0 + 5/14*0.971]
– Gain(S,Outlook) = 0.246
22
DECISION TREE LEARNING:
A SIMPLE EXAMPLE
■ Gain(S,Temperature) = 0.94 – [4/14*E(Temperature=hot) +
6/14*E(Temperature=mild) +
4/14*E(Temperature=cool)]
■ Gain(S,Temperature) = 0.94 – [4/14*E([2+,2-]) + 6/14*E([4+,2-]) +
4/14*E([3+,1-])]
■ Gain(S,Temperature) = 0.94 – [4/14 + 6/14*0.918 + 4/14*0.811]
■ Gain(S,Temperature) = 0.029
23
DECISION TREE LEARNING:
A SIMPLE EXAMPLE
■ Gain(S,Humidity) = 0.94 – [7/14*E(Humidity=high) +
7/14*E(Humidity=normal)]
■ Gain(S,Humidity = 0.94 – [7/14*E([3+,4-]) + 7/14*E([6+,1-])]
■ Gain(S,Humidity = 0.94 – [7/14*0.985 + 7/14*0.592]
■ Gain(S,Humidity) = 0.1515
24
DECISION TREE LEARNING:
A SIMPLE EXAMPLE
■ G(S,Wind) = 0.94 – [8/14*0.811 + 6/14*1.00]
■ G(S,Wind) = 0.048
25
AN ILLUSTRATIVE EXAMPLE
• Gain(S, Outlook) = 0.246
• Gain(S, Humidity) = 0.151
• Gain(S, Wind) = 0.048
• Gain(S, Temperature) = 0.029
• Since Outlook attribute provides the
best prediction of the target attribute,
PlayTennis, it is selected as the
decision attribute for the root node, and
branches are created with its possible
values (i.e., Sunny, Overcast, and
Rain).
26
ROOT NODE
27
AN ILLUSTRATIVE EXAMPLE
For Overcast – Decision Class can be obtained
Day Outlook Temp. Humidity Wind Decision
3 Overcast Hot High Weak Yes
7 Overcast Cool Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
28
AN ILLUSTRATIVE EXAMPLE
Day Outlook Temp. Humidity Wind Decision
For Sunny– 1 Sunny Hot High Weak No
Decision Class
2 Sunny Hot High Strong No
cannot be
obtained 8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes
Day Outlook Temp. Humidity Wind Decision
4 Rain Mild High Weak Yes
For Rain–
5 Rain Cool Normal Weak Yes Decision Class
6 Rain Cool Normal Strong No cannot be
obtained
10 Rain Mild Normal Weak Yes
14 Rain Mild High Strong No
29
AN ILLUSTRATIVE EXAMPLE
30
INTERMEDIATE NODE COMPUTATION
Ssunny = {D1,D2,D8,D9,D11}
Entropy (Ssunny ) = (-2/5 log 2/5 ) + ( -3/5 log 3/5) = 0.97
• Gain (Ssunny , Humidity)
■ = .970 - (3/5) 0.0 - (2/5) 0.0
■ = .970 [3/5 * ((-0/3 log 0/3) + ((-3/3 log (3/3))] + [2/5 *
((-2/2 log(2/2) + ((-0/2 log (0/2))]
• Gain (S sunny , Temperature)
■ = .970 - (2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0
■ = .570
• Gain (S sunny , Wind)
■ = .970 - (2/5) 1.0 - (3/5) .918
■ = .019
31
INTERMEDIATE NODE COMPUTATION
For the rightmost branch: Rain
• Gain (SRain, Temperature) = 0.019
• Gain (SRain, Humidity) = 0.019
• Gain (SRain, Wind)= 0.97
32
FINAL DECISION TREE :
33
DECISION FOR LEFT BRANCH:
BRANCH: (HUMIDITY)
Day Outlook Temp. Humidity Wind Decision
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild High Weak No
Day Outlook Temp. Humidity Wind Decision
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes
34
DECISION FOR RIGHT BRANCH:
BRANCH: (WIND)
Day Outlook Temp. Humidity Wind Decision
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
Day Outlook Temp. Humidity Wind Decision
6 Rain Cool Normal Strong No
14 Rain Mild High Strong No
35
FINAL DECISION TREE :
Test Instance: (Outlook = Rain, Temperature = Hot, Humidity = High, Wind =
Strong ) = ?
36
EXAMPLE 2:
37
DECISION TREE LEARNING
38
DECISION TREE LEARNING
39
DECISION TREE LEARNING
40
DECISION TREE LEARNING
41
DECISION TREE LEARNING
42
DECISION TREE LEARNING
43
DECISION TREE LEARNING
44
HYPOTHESIS SPACE SEARCH IN DECISION
TREE LEARNING
ID3 – Capabilities and Limitations
• ID3’s hypothesis of all decision trees is a complete space
of finite discrete-valued functions
• ID3 maintains a single current hypothesis as it searches
through the space of decision trees
• ID3 in its pure form performs no backtracking in its
search
• ID3 uses all training examples at each step in the search
to make statistically based decisions
45
INDUCTIVE BIAS IN ID3
• Inductive bias is the set of assumptions that along with
the training data justify the classifications assigned by the
learner to future instances.
• Given a collection of training examples, there are several
decision trees consistent with these examples
• Inductive bias of ID3:
• Which of these decision trees does the ID3 choose?
INDUCTIVE BIAS IN ID3
Approximate inductive bias of ID3
• ID3 has preference of short trees over larger trees
• Trees that place high information gain attributes closer to
the root are preferred over that are not.
47
INDUCTIVE BIAS IN ID3
Restriction Biases and Preference Biases
• Interesting difference between the types of inductive
bias exhibited by ID3 and Candidate-Elimination
algorithm
• ID3 – searches complete hypothesis space
incompletely
• CE – searches incomplete hypothesis space
completely
48
INDUCTIVE BIAS IN ID3
• ID3 – Preference Biases
• CE – Restriction Biases
Which type of inductive bias is preferable?
• Combination is also possible
49
OCCAM’S RAZOR – WHY PREFER SHORTER
HYPOTHESIS
• William of Occam was the 14th century English logician
• The principle states that, “All other things being equal, the
simplest solution is the best”
• When multiple competing theories are equal in other
respects, the principle recommends selecting the theory
that introduces fewest assumptions.
50
OCCAM’S RAZOR
• Occam’s Razor: Prefer the simplest hypothesis that fits the
data
• Argument in favor -
• A short hypothesis that fits data unlikely to be a coincidence
• A long hypothesis that fits data might be a coincidence
• Argument Opposed –
• There are many ways to define small sets of hypotheses
• Two different hypotheses from the same training examples
possible when applied by two learners that perceive these
examples in terms of different internal representations
51
ISSUES IN DECISION TREE LEARNING
• Overfitting
• Incorporating Continuous-valued attributes
• Alternative measures for selecting attributes
• Handling attributes with costs
• Handling examples with missing attribute values
52
OVERFITTING
• Consider a hypothesis h over
• Training data: errortrain(h)
• Entire distribution D of data: errorD(h)
• The hypothesis h ∈ H overfits training data if there is an
alternative hypothesis h’ ∈ H such that
• errortrain(h) <
errortrain(h’) AND
• errorD(h) > errorD(h’)
53
OVERFITTING
Definition:
Given a hypothesis space H, a hypothesis h belonging to H
is said to overfit the training data if there exists some
alternative hypothesis h’ belonging H, such that h has
smaller error than h’ over the training examples, but h’ has a
smaller error than h over the entire distribution of instances.
54
OVERFITTING IN DECISION TREE LEARNING
55
OVERFITTING IN DECISION TREE LEARNING
56
AVOIDING OVERFITTING
57
AVOIDING OVERFITTING
58
AVOIDING OVERFITTING
59
AVOIDING OVERFITTING
60
AVOIDING OVERFITTING
61
AVOIDING OVERFITTING
62
AVOIDING OVERFITTING
63
REDUCED-ERROR PRUNING
• Split data into training and validation sets
• Do until further pruning is harmful
1. Evaluate impact of pruning each possible node on
validation set
2. Greedily remove the one that most improves the validation
set accuracy
64
EFFECT OF REDUCED-ERROR PRUNING
65
EFFECT OF REDUCED-ERROR PRUNING
66
RULE POST-PRUNING
• The major drawback of Reduced-Error Pruning is when
the data is limited, validation set reduces even further
the number of examples for training.
Hence Rule Post-Pruning
67
RULE POST-PRUNING
68
RULE POST-PRUNING
69
RULE POST-PRUNING
70
RULE POST-PRUNING
71
RULE POST-PRUNING
72
RULE POST-PRUNING
73
74
CONTINUOUS VALUED-ATTRIBUTES
• Create a discrete-valued attribute to test continuous
• Dynamically define new discrete valued attributes
that partition the continuous valued attribute into
discrete set of intervals
• For an attribute A, create new boolean attribute Ac
that is true of A<c and false otherwise
75
(48+60) /2 = 54 -
Threshold
(80+90) /2 = 85 -
Threshold
• So if Temperature = 75
• We can infer that PlayTennis = Yes
76
ALTERNATIVE MEASURES FOR SELECTING
ATTRIBUTES
• Problem:
• If attribute has many values, Gain will select any value
• Example – Using date attribute
• One approach – Gain Ratio
Where si is a subset of S which has value vi
77
EXAMPLES WITH MISSING ATTRIBUTE
VALUES
• What if some examples missing values of attribute A?
• Use training examples anyway and sort through tree
• If node n tests A, Assign it the most common value among
the examples at node n
• Assign a probability pi to each possible value of A – vi and
assign fraction pi of example to each descendant in tree
78
ATTRIBUTES WITH COSTS
• Problem:
• Medical diagnosis, BloodTest has cost $150
• Robotics, Width_from_1ft has cost 23 sec
• One Approach - replace gain
• Tan and Schlimmer (1990)
• Nunez (1988)
• where w ∈ [0, 1] is a constant that determines the relative importance of cost versus information
gain.
79