Chapter 3 Decision Trees
Chapter 3 Decision Trees
Decision Trees
What is a Decision Tree?
• Decision trees are powerful and popular tools for
classification and prediction.
Leave At If we leave at
10 AM 9 AM 10 AM and
8 AM
there are no
Stall? Accident?
cars stalled on
No Yes Long No Yes the road, what
will our
Short Long Medium Long commute time
be?
3
Inductive Learning
• In this decision tree, we made a series of Boolean
decisions and followed the corresponding branch
– Did we leave at 10 AM?
– Did a car stall on the road?
– Is there an accident on the road?
4
Learning in Decision Trees
• Goal: Build a decision tree for
classifying examples as positive or
negative instances of a concept using
supervised learning from a training Color
set. green blue
red
• A decision tree is a tree where
– each non-leaf node is associated with an Size + Shape
attribute (feature) big small square round
– each leaf node is associated with a
classification (+ or -) - + Size +
– each arc is associated with one of the big small
possible values of the attribute at the - +
node where the arc is directed from.
• Generalization: allow for > 2 classes
– e.g., {sell, hold, buy}
5
Decision Trees as Rules
• We did not have represent this tree
graphically
6
Decision Tree as a Rule Set
7
How to Create a Decision Tree
• We first make a list of attributes that we
can measure
– These attributes (for now) must be discrete
• We then choose a target attribute that
we want to predict
• Then create an experience table that
lists what we have seen in the past
8
Sample Experience Table
Example Attributes Target
D1 8 AM Sunny No No Long
D3 10 AM Sunny No No Short
D6 10 AM Sunny No No Short
D7 10 AM Cloudy No No Short
D8 9 AM Rainy No No Medium
10
Choosing Attributes
• Methods for selecting attributes (which will be
described later) show that weather is not a
discriminating attribute
• We use the principle of Occam’s Razor:
– Given a number of competing hypotheses, the
simplest one is preferable
11
Choosing Attributes
• The basic structure of creating a decision tree
is the same for most decision tree algorithms
• The difference lies in how we select the
attributes for the tree
• We will focus on the ID3 algorithm developed
by Ross Quinlan in 1975
12
Decision Tree Algorithms
• The basic idea behind any decision tree
algorithm is as follows:
– Choose the best attribute(s) to split the
remaining instances and make that attribute a
decision node
– Repeat this process for recursively for each child
– Stop when:
• All the instances have the same target attribute value
• There are no more attributes
• There are no more instances
13
Identifying the Best Attributes
• Refer back to our original decision tree
Leave At
10 AM 9 AM
8 AM
Stall? Accident?
Long
No Yes No Yes
Short Long Medium Long
16
Entropy
• A measure of homogeneity of the set of
examples.
17
Entropy
• Entropy is maximized when there is an equal
chance of all values for the target attribute
(i.e. the result is random)
– If commute time = short in 3 instances, medium in
3 instances and long in 3 instances, entropy is
maximized
– The entropy is maximum if we have no
knowledge of the system (or any outcome is
equally possible).
18
Entropy
• Calculation of entropy
– Entropy(S) = ∑(i=1 to l)-|Si|/|S| * log2(|Si|/|S|)
• S = set of examples
• Si = subset of S with value vi under the target attribute
• l = size of the range of the target attribute
19
Entropy - Example
• Given a set S of positive and negative examples of
some target concept (a 2-class problem), the entropy of
set S relative to this binary classification is
21
Information Gain
• Information gain measures the expected reduction in
entropy, or uncertainty.
– the first term in the equation for Gain is just the entropy
of the original collection S
– the second term is the expected value of the entropy after
S is partitioned using attribute A
22
Information Gain - cont
• It measures how well a given attribute
separates the training examples according to
their target classification
• This measure is used to select among the
candidate attributes at each step while
growing the tree
• It is simply the expected reduction in entropy
caused by partitioning the examples according
to this attribute.
23
Training Examples
Day Outlook Temp Humidity Wind Tennis?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Entropy Example
• Entropy (disorder) is bad, Homogeneity (Information Gain) is
good
• Let S be a set of examples
• Entropy(S) = -P log2(P) - N log2(N)
– P is proportion of pos example
– N is proportion of neg examples
– 0 log 0 == 0
v Values(A)
26
Gain of Splitting on Wind
Day Wind Tennis?
Values(wind) = weak, strong d1 weak no
S = [9+, 5-] d2 strong no
Sweak = [6+, 2-] d3 weak yes
Sstrong = [3+, 3-] d4 weak yes
d5 weak yes
d6 strong no
Gain(S, wind) d7 strong yes
= Entropy(S) -
(|S | / |S|) Entropy(S )
v v
d8
d9
weak
weak
no
yes
v {weak, s} d10 weak yes
d11 strong yes
= Entropy(S) - 8/14 Entropy(Sweak)
d12 strong yes
- 6/14 Entropy(Sstong) d13 weak yes
= 0.940 - (8/14) 0.811 - (6/14) 1.00 d14 strong no
= .048
27
Gain of Splitting on Humidity and Wind
S=[9+,5-] S=[9+,5-]
E=0.940 E=0.940
Humidity Wind
Over
Sunny Rain
cast
Humid Wind
Gain(S,Humid) Gain(S,Wind)
= 0.151 Outlook Temp = 0.048
Gain(S,Outlook) Gain(S,Temp)
= 0.246 = 0.029
Outlook
Sunny Rain
Overcast
Don’t Play Don’t Play
[2+, 3-] Play [3+, 2-]
[4+] 31
Second step
▪ Working on Outlook=Sunny node:
Gain(SSunny, Humidity) = 0.970 − 3/5 0.0 − 2/5 0.0 = 0.970
Gain(SSunny, Wind) = 0.970 − 2/5 1.0 − 3/5 0.918 = 0 .019
Gain(SSunny, Temp.) = 0.970 − 2/5 0.0 − 2/5 1.0 − 1/5 0.0 =
0.570
▪ Humidity provides the best prediction for the target
▪ Lets grow the tree:
▪ add to the tree a successor for each possible value of
Humidity
▪ partition the training samples according to the value of
Humidity
32
One Step Later
Good day for tennis?
Outlook
Sunny Rain
Overcast
Humidity
Don’t Play
Play
[2+, 3-]
[4+]
High Normal
33
Recurse Again
Good day for tennis?
Outlook
Sunny Medium
Overcast
Humidity Day Temp Humid Wind Tennis?
d4 m h weak yes
High Low d5 c n weak yes
d6 c n s n
d10 m n weak yes
d14 m h s n
34
One Step Later: Final Tree
Good day for tennis?
Outlook
Sunny Rain
Overcast
Humidity Wind
Play
[4+] Strong Weak
High Normal
35
Pruning Trees
• There is another technique for reducing the
number of attributes used in a tree - pruning
• Two types of pruning:
– Pre-pruning (forward pruning)
– Post-pruning (backward pruning)
36
Prepruning
• In prepruning, we decide during the building process
when to stop adding attributes (possibly based on
their information gain)
38
Subtree Replacement
• Entire subtree is replaced by a single leaf node
A A
B B
C 4 5 6 4 5
1 2 3
A A
B
C
C 4 5
1 2 3 1 2 3
42
Reduced Error Pruning Example
Outlook
Sunny Rain
Overcast
Humidity
Play
Play
High Low
44
Problems with Decision Trees
• While decision trees classify quickly, the time
for building a tree may be higher than
another type of classifier
45
Error Propagation
• Since decision trees work by a series of local
decisions, what happens when one of these
local decisions is wrong?
– Every decision from that point on may be wrong
– We may never return to the correct path of the
tree
46
Problems with ID3
• ID3 is not optimal
– Uses expected entropy reduction, not actual
reduction
• Must use discrete (or discretized) attributes
– What if we left for work at 9:30 AM?
– We could break down the attributes into smaller
values…
47
Problems with ID3
• If we broke down leave time to the minute,
we might get something like this:
48
Problems with ID3
• We can use a technique known as discretization
• We choose cut points, such as 9AM for splitting
continuous attributes
• These cut points generally lie in a subset of
boundary points, such that a boundary point is
where two adjacent instances in a sorted list have
different target value attributes
49
Problems with ID3
• Consider the attribute commute time
8:00 (L), 8:02 (L), 8:07 (M), 9:00 (S), 9:20 (S), 9:25 (S), 10:00 (S), 10:02 (M)
50
Issues with Decision Trees
• Missing data
• Real-valued attributes
• Many-valued features
• Evaluation
• Overfitting
51
Missing Data 1
Day Temp Humid Wind Tennis?
Assign most common
d1 h h weak n
d2 h h s n value at this node
d8 m h weak n ?=>h
d9 c ? weak yes
d11 m n s yes
52
Missing Data 2
Day Temp Humid Wind Tennis?
d1 h h weak n [0.75+, 3-]
d2 h h s n
d8 m h weak n
d9 c ? weak yes
d11 m n s yes
[1.25+, 0-]
Wind 25 12 12 11 10 10 8 7 7 7 7 6 6 5
Play n y y n y n n y y y y y y n
>= 12 >= 10
Gain = 0.0004 Gain = 0.048
54
Many-valued Attributes
• Problem:
– If attribute has many values, Gain will select it
– Imagine using Date = June_6_1996
• So many values
– Divides examples into tiny sets
– Sets are likely uniform => high info gain
– Poor predictor
• Penalize these attributes
55
Evaluation
• Training accuracy
– How many training instances can be correctly
classify based on the available data?
– Is high when the tree is deep/large, or when there
is less confliction in the training instances.
– however, higher training accuracy does not mean
good generalization
• Testing accuracy
– Given a number of new instances, how many of
them can we correctly classify?
– Cross validation 56
Overfitting Definition
• DT is overfit when exists another DT’ and
– DT has smaller error on training examples, but
– DT has bigger error on test examples
• Causes of overfitting
– Noisy data, or
– Training set is too small
• Solutions
– Reduced error pruning
– Early stopping
– Rule post pruning
57
Summary
• Decision trees can be used to help predict the
future
• The trees are easy to understand
• Decision trees work more efficiently with
discrete attributes
• The trees may suffer from error propagation
58
Strengths
• can generate understandable rules
• perform classification without much
computation
• can handle continuous and categorical
variables
• provide a clear indication of which fields are
most important for prediction or classification
59
Weaknesses
• Not suitable for prediction of continuous attribute.
• Perform poorly with many class and small data.
• Computationally expensive to train.
– At each node, each candidate splitting field must be sorted
before its best split can be found.
– In some algorithms, combinations of fields are used and a
search must be made for optimal combining weights.
– Pruning algorithms can also be expensive since many
candidate sub-trees must be formed and compared.
• Do not treat well non-rectangular regions.
60
Assignment 1
• The following attributes of instances are used to determine the best day
for playing tennis
Day Outlook Temp Humid Wind PlayTennis?
• Use Information Gain to select the best attributes, and draw the resultant
decision tree for the above data
NB: Show all the calculations
61
Assignment should be hand written, scanned and submitted as PDF.