0% found this document useful (0 votes)
41 views61 pages

Chapter 3 Decision Trees

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views61 pages

Chapter 3 Decision Trees

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Chapter 3

Decision Trees
What is a Decision Tree?
• Decision trees are powerful and popular tools for
classification and prediction.

• An inductive learning task


– Use particular facts to make more generalized conclusions

• A predictive model based on a branching series of


Boolean tests
– These smaller Boolean tests are less complex than a one-
stage classifier

• Let’s look at a sample decision tree…


2
Predicting Commute Time

Leave At If we leave at
10 AM 9 AM 10 AM and
8 AM
there are no
Stall? Accident?
cars stalled on
No Yes Long No Yes the road, what
will our
Short Long Medium Long commute time
be?

3
Inductive Learning
• In this decision tree, we made a series of Boolean
decisions and followed the corresponding branch
– Did we leave at 10 AM?
– Did a car stall on the road?
– Is there an accident on the road?

• By answering each of these yes/no questions, we


then came to a conclusion on how long our commute
might take

4
Learning in Decision Trees
• Goal: Build a decision tree for
classifying examples as positive or
negative instances of a concept using
supervised learning from a training Color
set. green blue
red
• A decision tree is a tree where
– each non-leaf node is associated with an Size + Shape
attribute (feature) big small square round
– each leaf node is associated with a
classification (+ or -) - + Size +
– each arc is associated with one of the big small
possible values of the attribute at the - +
node where the arc is directed from.
• Generalization: allow for > 2 classes
– e.g., {sell, hold, buy}
5
Decision Trees as Rules
• We did not have represent this tree
graphically

• We could have represented as a set of rules.


However, this may be much harder to read…

6
Decision Tree as a Rule Set

if hour == 8am • Notice that all attributes need


commute time = long not to have to be used in each
else if hour == 9am path of the decision.
if accident == yes
commute time = long • As we will see, all attributes
may not even appear in the
else tree.
commute time = medium
else if hour == 10am
if stall == yes
commute time = long
else
commute time = short

7
How to Create a Decision Tree
• We first make a list of attributes that we
can measure
– These attributes (for now) must be discrete
• We then choose a target attribute that
we want to predict
• Then create an experience table that
lists what we have seen in the past

8
Sample Experience Table
Example Attributes Target

Hour Weather Accident Stall Commute

D1 8 AM Sunny No No Long

D2 8 AM Cloudy No Yes Long

D3 10 AM Sunny No No Short

D4 9 AM Rainy Yes No Long

D5 9 AM Sunny Yes Yes Long

D6 10 AM Sunny No No Short

D7 10 AM Cloudy No No Short

D8 9 AM Rainy No No Medium

D9 9 AM Sunny Yes No Long

D10 10 AM Cloudy Yes Yes Long

D11 10 AM Rainy No No Short

D12 8 AM Cloudy Yes No Long

D13 9 AM Sunny No No Medium


Choosing Attributes
• The previous experience decision table
showed 4 attributes: hour, weather, accident
and stall
• But the decision tree only showed 3
attributes: hour, accident and stall
• Why is that?

10
Choosing Attributes
• Methods for selecting attributes (which will be
described later) show that weather is not a
discriminating attribute
• We use the principle of Occam’s Razor:
– Given a number of competing hypotheses, the
simplest one is preferable

11
Choosing Attributes
• The basic structure of creating a decision tree
is the same for most decision tree algorithms
• The difference lies in how we select the
attributes for the tree
• We will focus on the ID3 algorithm developed
by Ross Quinlan in 1975

12
Decision Tree Algorithms
• The basic idea behind any decision tree
algorithm is as follows:
– Choose the best attribute(s) to split the
remaining instances and make that attribute a
decision node
– Repeat this process for recursively for each child
– Stop when:
• All the instances have the same target attribute value
• There are no more attributes
• There are no more instances
13
Identifying the Best Attributes
• Refer back to our original decision tree

Leave At
10 AM 9 AM
8 AM

Stall? Accident?
Long
No Yes No Yes
Short Long Medium Long

◼ How did we know to split on leave at and


then on stall and accident and not weather?
14
Choosing the Best Attribute
• The key problem is choosing which attribute to split a
given set of examples.
• Intuitively: A good attribute splits the examples into
subsets that are (ideally) all positive or all negative.
• Some possibilities are:
– Random: Select any attribute at random
– Least-Values: Choose the attribute with the smallest number
of possible values (fewer branches)
– Most-Values: Choose the attribute with the largest number
of possible values (smaller subsets)
– Max-Gain: Choose the attribute that has the largest expected
information gain, i.e. select attribute that will result in the
smallest expected size of the subtrees rooted at its children.
15
ID3 Heuristic
• To determine the best attribute, we look at
the ID3 heuristic
• ID3 splits attributes based on their entropy.
• Entropy is the measure of disinformation…

16
Entropy
• A measure of homogeneity of the set of
examples.

• Entropy is minimized when all values of the


target attribute are the same.
– If we know that commute time will always be
short, then entropy = 0
– The entropy is 0 if the outcome is ``certain’’.

17
Entropy
• Entropy is maximized when there is an equal
chance of all values for the target attribute
(i.e. the result is random)
– If commute time = short in 3 instances, medium in
3 instances and long in 3 instances, entropy is
maximized
– The entropy is maximum if we have no
knowledge of the system (or any outcome is
equally possible).

18
Entropy
• Calculation of entropy
– Entropy(S) = ∑(i=1 to l)-|Si|/|S| * log2(|Si|/|S|)
• S = set of examples
• Si = subset of S with value vi under the target attribute
• l = size of the range of the target attribute

19
Entropy - Example
• Given a set S of positive and negative examples of
some target concept (a 2-class problem), the entropy of
set S relative to this binary classification is

E(S) = - p(P)log2 p(P) – p(N)log2 p(N)

• Suppose S has 25 examples, 15 positive and 10


negatives [15+, 10-]. Then the entropy of S relative to
this classification is

E(S)=-(15/25) log2(15/25) - (10/25) log2 (10/25)


20
ID3
• ID3 splits on attributes with the lowest entropy
• We calculate the entropy for all values of an attribute as the
weighted sum of subset entropies as follows:
∑(i = 1 to k) |Si|/|S| Entropy(Si),
– where k is the range of the attribute we are testing

• We can also measure information gain (which is inversely


proportional to entropy) as follows:
– Entropy(S) - ∑(i = 1 to k) |Si|/|S| Entropy(Si)

21
Information Gain
• Information gain measures the expected reduction in
entropy, or uncertainty.

– Values(A) is the set of all possible values for attribute A,


and Sv the subset of S for which attribute A has value v
Sv = {s in S | A(s) = v}.

– the first term in the equation for Gain is just the entropy
of the original collection S
– the second term is the expected value of the entropy after
S is partitioned using attribute A
22
Information Gain - cont
• It measures how well a given attribute
separates the training examples according to
their target classification
• This measure is used to select among the
candidate attributes at each step while
growing the tree
• It is simply the expected reduction in entropy
caused by partitioning the examples according
to this attribute.
23
Training Examples
Day Outlook Temp Humidity Wind Tennis?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Entropy Example
• Entropy (disorder) is bad, Homogeneity (Information Gain) is
good
• Let S be a set of examples
• Entropy(S) = -P log2(P) - N log2(N)
– P is proportion of pos example
– N is proportion of neg examples
– 0 log 0 == 0

• Example: S has 9 pos and 5 neg


Entropy([9+, 5-]) = -(9/14) log2(9/14) - (5/14)log2(5/14)
= 0.940
25
Information Gain
• Measure of expected reduction in entropy
• Resulting from splitting along an attribute

Gain(S,A) = Entropy(S) -  (|S | / |S|) Entropy(S )


v v

v  Values(A)

Where Entropy(S) = -P log2(P) - N log2(N)

26
Gain of Splitting on Wind
Day Wind Tennis?
Values(wind) = weak, strong d1 weak no
S = [9+, 5-] d2 strong no
Sweak = [6+, 2-] d3 weak yes
Sstrong = [3+, 3-] d4 weak yes
d5 weak yes
d6 strong no
Gain(S, wind) d7 strong yes
= Entropy(S) -
 (|S | / |S|) Entropy(S )
v v
d8
d9
weak
weak
no
yes
v  {weak, s} d10 weak yes
d11 strong yes
= Entropy(S) - 8/14 Entropy(Sweak)
d12 strong yes
- 6/14 Entropy(Sstong) d13 weak yes
= 0.940 - (8/14) 0.811 - (6/14) 1.00 d14 strong no
= .048

27
Gain of Splitting on Humidity and Wind
S=[9+,5-] S=[9+,5-]
E=0.940 E=0.940
Humidity Wind

High Normal Weak Strong

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]


E=0.985 E=0.592 E=0.811 E=1.0
Gain(S,Humidity) Gain(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
– (7/14)*0.592 – (6/14)*1.0
= 0.151 = 0.048
NB: Humidity provides greater information gain than Wind, with relation to
28
target classification.
Gain of splitting on Outlook
S=[9+,5-]
E=0.940
Outlook

Over
Sunny Rain
cast

[2+, 3-] [4+, 0] [3+, 2-]


E=0.971 E=0.0 E=0.971
Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.971
= 0.247

• The information gain on Temperature is calculated the


same way 29
Evaluating Attributes
Yes

Humid Wind

Gain(S,Humid) Gain(S,Wind)
= 0.151 Outlook Temp = 0.048

Gain(S,Outlook) Gain(S,Temp)
= 0.246 = 0.029

▪Outlook provides the best prediction for the target


30
Resulting Tree
▪ Lets grow the tree:
▪ add to the tree a successor for each possible value of
Outlook
▪ partition the training samples according to the value of
Outlook

Outlook
Sunny Rain
Overcast
Don’t Play Don’t Play
[2+, 3-] Play [3+, 2-]
[4+] 31
Second step
▪ Working on Outlook=Sunny node:
Gain(SSunny, Humidity) = 0.970 − 3/5  0.0 − 2/5  0.0 = 0.970
Gain(SSunny, Wind) = 0.970 − 2/5  1.0 − 3/5  0.918 = 0 .019
Gain(SSunny, Temp.) = 0.970 − 2/5  0.0 − 2/5  1.0 − 1/5  0.0 =
0.570
▪ Humidity provides the best prediction for the target
▪ Lets grow the tree:
▪ add to the tree a successor for each possible value of
Humidity
▪ partition the training samples according to the value of
Humidity
32
One Step Later
Good day for tennis?

Outlook
Sunny Rain
Overcast
Humidity
Don’t Play
Play
[2+, 3-]
[4+]
High Normal

Don’t play Play


[3-] [2+]

33
Recurse Again
Good day for tennis?

Outlook
Sunny Medium
Overcast
Humidity Day Temp Humid Wind Tennis?
d4 m h weak yes
High Low d5 c n weak yes
d6 c n s n
d10 m n weak yes
d14 m h s n

34
One Step Later: Final Tree
Good day for tennis?

Outlook
Sunny Rain
Overcast
Humidity Wind
Play
[4+] Strong Weak
High Normal

Don’t play Play Don’t play Play


[3-] [2+] [2-] [3+]

35
Pruning Trees
• There is another technique for reducing the
number of attributes used in a tree - pruning
• Two types of pruning:
– Pre-pruning (forward pruning)
– Post-pruning (backward pruning)

36
Prepruning
• In prepruning, we decide during the building process
when to stop adding attributes (possibly based on
their information gain)

• However, this may be problematic – Why?


– Sometimes attributes individually do not contribute much
to a decision, but combined, they may have a significant
impact
37
Postpruning

• Postpruning waits until the full decision tree


has built and then prunes the attributes
• Two techniques:
– Subtree Replacement
– Subtree Raising

38
Subtree Replacement
• Entire subtree is replaced by a single leaf node

A A

B B
C 4 5 6 4 5

1 2 3

• Node 6 replaced the subtree


• Generalizes tree a little more, but may
increase accuracy 39
Subtree Raising
• Entire subtree is raised onto another node

A A

B
C
C 4 5

1 2 3 1 2 3

• Entire subtree is raised onto another node


• This process is very time consuming
40
Reduced Error Pruning Example
Outlook
Sunny Rain
Overcast
Humidity Wind
Play
Strong Weak
High Low

Don’t play Play


Don’t play Play

Validation set accuracy = 0.75


41
Reduced Error Pruning Example
Outlook
Sunny Rain
Overcast
Don’t play Wind
Play
Strong Weak

Don’t play Play

Validation set accuracy = 0.80

42
Reduced Error Pruning Example
Outlook
Sunny Rain
Overcast
Humidity
Play
Play
High Low

Don’t play Play

Validation set accuracy = 0.70


43
Reduced Error Pruning Example
Outlook
Sunny Rain
Overcast
Don’t play Wind
Play
Strong Weak

Don’t play Play

Use this as final tree

44
Problems with Decision Trees
• While decision trees classify quickly, the time
for building a tree may be higher than
another type of classifier

• Decision trees suffer from a problem of errors


propagating throughout a tree
– A very serious problem as the number of classes
increases

45
Error Propagation
• Since decision trees work by a series of local
decisions, what happens when one of these
local decisions is wrong?
– Every decision from that point on may be wrong
– We may never return to the correct path of the
tree

46
Problems with ID3
• ID3 is not optimal
– Uses expected entropy reduction, not actual
reduction
• Must use discrete (or discretized) attributes
– What if we left for work at 9:30 AM?
– We could break down the attributes into smaller
values…

47
Problems with ID3
• If we broke down leave time to the minute,
we might get something like this:

8:02 AM 8:03 AM 9:05 AM 9:07 AM 9:09 AM 10:02 AM

Long Medium Short Long Long Short

• Since entropy is very low for each branch, we


have n branches with n leaves.
• This would not be helpful for predictive modeling.

48
Problems with ID3
• We can use a technique known as discretization
• We choose cut points, such as 9AM for splitting
continuous attributes
• These cut points generally lie in a subset of
boundary points, such that a boundary point is
where two adjacent instances in a sorted list have
different target value attributes

49
Problems with ID3
• Consider the attribute commute time

8:00 (L), 8:02 (L), 8:07 (M), 9:00 (S), 9:20 (S), 9:25 (S), 10:00 (S), 10:02 (M)

• When we split on these attributes, we increase the


entropy so we don’t have a decision tree with the
same number of cut points as leaves

50
Issues with Decision Trees
• Missing data
• Real-valued attributes
• Many-valued features
• Evaluation
• Overfitting

51
Missing Data 1
Day Temp Humid Wind Tennis?
Assign most common
d1 h h weak n
d2 h h s n value at this node
d8 m h weak n ?=>h
d9 c ? weak yes
d11 m n s yes

Day Temp Humid Wind Tennis?


d1 h h weak n Assign most common
d2 h h s n value for class
d8 m h weak n ?=>n
d9 c ? weak yes
d11 m n s yes

52
Missing Data 2
Day Temp Humid Wind Tennis?
d1 h h weak n [0.75+, 3-]
d2 h h s n
d8 m h weak n
d9 c ? weak yes
d11 m n s yes
[1.25+, 0-]

• 75% h and 25% n


• Use in gain calculations
• Further subdivide if other missing attributes
• Same approach to classify test ex with missing attribute
– Classification is most probable classification
– Summing over leaves where it got divided
53
Real-valued Features
• Discretize?
• Threshold split using observed values?
Wind 8 25 7 6 6 10 12 5 7 7 12 10 7 11
Play n n y y y n y n y y y y y n

Wind 25 12 12 11 10 10 8 7 7 7 7 6 6 5
Play n y y n y n n y y y y y y n

>= 12 >= 10
Gain = 0.0004 Gain = 0.048

54
Many-valued Attributes
• Problem:
– If attribute has many values, Gain will select it
– Imagine using Date = June_6_1996
• So many values
– Divides examples into tiny sets
– Sets are likely uniform => high info gain
– Poor predictor
• Penalize these attributes

55
Evaluation
• Training accuracy
– How many training instances can be correctly
classify based on the available data?
– Is high when the tree is deep/large, or when there
is less confliction in the training instances.
– however, higher training accuracy does not mean
good generalization
• Testing accuracy
– Given a number of new instances, how many of
them can we correctly classify?
– Cross validation 56
Overfitting Definition
• DT is overfit when exists another DT’ and
– DT has smaller error on training examples, but
– DT has bigger error on test examples
• Causes of overfitting
– Noisy data, or
– Training set is too small
• Solutions
– Reduced error pruning
– Early stopping
– Rule post pruning
57
Summary
• Decision trees can be used to help predict the
future
• The trees are easy to understand
• Decision trees work more efficiently with
discrete attributes
• The trees may suffer from error propagation

58
Strengths
• can generate understandable rules
• perform classification without much
computation
• can handle continuous and categorical
variables
• provide a clear indication of which fields are
most important for prediction or classification

59
Weaknesses
• Not suitable for prediction of continuous attribute.
• Perform poorly with many class and small data.
• Computationally expensive to train.
– At each node, each candidate splitting field must be sorted
before its best split can be found.
– In some algorithms, combinations of fields are used and a
search must be made for optimal combining weights.
– Pruning algorithms can also be expensive since many
candidate sub-trees must be formed and compared.
• Do not treat well non-rectangular regions.

60
Assignment 1
• The following attributes of instances are used to determine the best day
for playing tennis
Day Outlook Temp Humid Wind PlayTennis?

d1 sunny hot high weak no


d2 sunny hot high strong no
d3 overcast hot high weak yes
d4 rainy medium high weak yes
d5 rainy cool normal weak yes
d6 rainy cool normal strong no
d7 overcast cool normal strong yes
d8 sunny medium high weak no
d9 sunny cool normal weak yes
d10 rainy medium normal weak yes
d11 sunny medium normal strong yes
d12 overcast medium high strong yes
d13 overcast hot normal weak yes
d14 rainy medium high strong no

• Use Information Gain to select the best attributes, and draw the resultant
decision tree for the above data
NB: Show all the calculations
61
Assignment should be hand written, scanned and submitted as PDF.

You might also like