Dtree
Dtree
Decision Trees
Todays lecture
Information Gain for measuring association
between inputs and outputs
Learning a decision tree classifier from data
Andrew W. Moore
Associate Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
[email protected]
412-268-7599
Copyright 2001, Andrew W. Moore
Data Mining
Data Mining is all about automating the
process of searching for patterns in the
data.
Which patterns are interesting?
Thats what well look at right now.
And the answer will turn out to be the engine that
drives decision tree learning.
Information Gain
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
[email protected]
412-268-7599
Copyright 2001, Andrew W. Moore
Bits
You are watching a set of independent random samples of X
You see that X has four possible values
Fewer Bits
Fewer Bits
Its possible
to invent a coding for your transmission that only uses
1.75 bits on average per symbol. How?
Its possible
to invent a coding for your transmission that only uses
1.75 bits on average per symbol. How?
10
110
111
Fewer Bits
General Case
00
01
10
P(X=V1) = p1
P(X=V2) = p2
P(X=Vm) = pm
= p j log 2 p j
j =1
General Case
P(X=V1) = p1
Vm
Can you think of a coding that would need only 1.6 bits
per symbol on average?
General Case
Vm
P(X=V2) = p2
.
P(X=Vm) = pm
A histogram of the
Whats the smallest possible number of frequency
bits, on average,
distributionper
of
symbol, needed
to transmit
a stream values
of symbols
drawn
of X would
havefrom
A histogram
of the
many lows and one or
Xs distribution?
Its
frequency
distribution of
two highs
values of X would be flat
H ( X ) = p1 log 2 p1 p2 log 2 p2 K pm log 2 pm
m
= p j log 2 p j
P(X=V1) = p1
Vm
P(X=V2) = p2
.
P(X=Vm) = pm
A histogram of the
Whats the smallest possible number of frequency
bits, on average,
distributionper
of
symbol, needed
to transmit
a stream values
of symbols
drawn
of X would
havefrom
A histogram
of the
many lows and one or
Xs distribution?
Its
frequency
distribution of
two highs
values of X would be flat
H ( X ) = p1 log 2 p1 p2 log 2 p2 K pm log 2 pm
m
= p ..and
j log so
j values
2 pthe
j =1
j =1
Entropy in a nut-shell
Low Entropy
High Entropy
Entropy in a nut-shell
Low Entropy
High Entropy
..the values (locations
of soup) sampled
entirely from within
the soup bowl
X = College Major
Y = Likes Gladiator
Math
Yes
Math
Yes
History
No
History
No
CS
Yes
CS
Yes
Math
No
Math
No
Math
No
Math
No
CS
Yes
Yes
History
No
H(X) = 1.5
CS
History
No
Math
Yes
H(Y) = 1
Math
Yes
Note:
Y = Likes Gladiator
Math
Yes
History
No
CS
Yes
Math
No
Math
No
CS
Yes
History
Math
X = College Major
Y = Likes Gladiator
Math
Yes
Example:
History
No
H(Y|X=Math) = 1
CS
Yes
Math
No
Math
No
CS
Yes
No
History
No
Yes
Math
Yes
H(Y|X=History) = 0
H(Y|X=CS) = 0
Definition of Conditional
Entropy:
Conditional Entropy
Y = Likes Gladiator
Information Gain
X = College Major
X = College Major
Math
Yes
Example:
History
No
CS
Yes
vj
Math
No
Math
No
CS
Yes
History
No
Math
Yes
Math
History
CS
Prob(X=vj)
0.5
0.25
0.25
H(Y | X = vj)
1
0
0
Back to
Decision Trees
Andrew W. Moore
Associate Professor
School of Computer Science
Carnegie Mellon University
Y = Likes Gladiator
Math
Yes
History
No
CS
Yes
Math
No
Math
No
CS
Yes
History
No
Math
Yes
H(Y) = 1
H(Y|X) = 0.5
Thus IG(Y|X) = 1 0.5 = 0.5
Decision Trees: Slide 20
www.cs.cmu.edu/~awm
[email protected]
412-268-7599
Copyright 2001, Andrew W. Moore
40
Records
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
bad
good
bad
good
bad
good
good
bad
good
bad
low
medium
medium
high
medium
low
low
high
:
:
:
high
high
high
low
medium
medium
low
high
low
medium
low
medium
medium
high
medium
medium
medium
high
:
:
:
high
medium
high
low
medium
low
low
high
medium
medium
weight
low
medium
medium
high
medium
low
low
high
:
:
:
high
high
high
low
medium
low
medium
high
low
medium
high
medium
low
low
medium
medium
low
low
:
:
:
low
high
low
low
high
low
high
low
medium
medium
75to78
70to74
75to78
70to74
70to74
70to74
70to74
75to78
:
:
:
70to74
79to83
75to78
79to83
75to78
79to83
79to83
70to74
75to78
75to78
asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
america
america
america
america
america
america
europe
europe
Suppose we want to
predict MPG.
Look at all
the
information
gains
Copyright 2001, Andrew W. Moore
A Decision Stump
Recursion Step
Records
in which
cylinders
=4
Take the
Original
Dataset..
Records
in which
cylinders
=5
And partition it
according
to the value of
the attribute
we split on
Records
in which
cylinders
=6
Records
in which
cylinders
=8
Recursion Step
Records in
which
cylinders = 4
Records in
which
cylinders = 5
Records in
which
cylinders = 6
Records in
which
cylinders = 8
Base Case
One
Dont split a
node if all
matching
records have
the same
output value
Base Case
Two
Dont split a
node if none
of the
attributes
can create
multiple nonempty
children
Base Cases
b
0
0
1
1
y
0
1
0
1
0
1
1
0
y = a XOR b
b
0
0
1
1
y
0
1
0
1
0
1
1
0
y = a XOR b
BuildTree(DataSet,Output)
If all output values are the same in DataSet, return a leaf node that
says predict this unique output
If all input values are the same, return a leaf node that says predict
the majority output
Else find attribute X with highest Info Gain
Suppose X has nX distinct values (i.e. X has arity nX).
MPG Training
error
MPG Training
error
MPG Training
error
An artificial example
Well create a training dataset
why?
32 records
Output y = copy of e,
Except a random 25%
of the records have y
set to the opposite of e
Suppose we build a full tree (we always split until base case 2)
Root
e=0
a=0
a=1
a=0
a=1
e=1
1/16 of the test set will 3/16 of the test set will
be correctly predicted be wrongly predicted
for the wrong reasons because the test record is
corrupted
32 records
Root
e=0
Root
e=1
e=0
e=1
In about 12 of
the 16 records
in this node the
output will be 0
In about 12 of
the 16 records
in this node the
output will be 1
So this will
almost certainly
predict 0
So this will
almost certainly
predict 1
Root
e=0
e=1
n/a
Overfitting
Avoiding overfitting
Usually we do not know in advance which
are the irrelevant variables
and it may depend on the context
Consider this
split
10
A chi-squared test
A chi-squared test
CS
Non CS
15972 145643
Likes
Matrix
Hates 3
Matrix
CS
37
Non CS
Pruning example
MaxPchance
Good news:
Bad news:
Note the improved
test set accuracy
compared with the
unpruned tree
Good news:
11
MaxPchance
The simplest tree structure for which all within-leafnode disagreements can be explained by chance
Decreasing
High Bias
MaxPchance
Increasing
High Variance
Simple proof:
1.
2.
3.
Real-Valued inputs
What should we do if some of the inputs are
real-valued?
mpg
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
good
bad
good
bad
97
199
121
350
198
108
113
302
:
:
:
4
8
4
5
75
90
110
175
95
94
95
139
:
:
:
120
455
107
131
2265
2648
2600
4100
3102
2379
2228
3570
:
:
:
79
225
86
103
18.2
15
12.8
13
16.5
16.5
14
12.8
:
:
:
2625
4425
2464
2830
77
70
77
73
74
73
71
78
:
:
:
18.6
10
15.5
15.9
82
70
76
78
asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
europe
europe
12
Computational Issues
Example with
MPG
R log R + 2 R ny
Where
R is the number of records in the node under consideration
ny is the arity (number of distinct values of) Y
How?
Sort records according to increasing values of X. Then create a 2xny
contingency table corresponding to computation of IG(Y|X:xmin). Then
iterate through the records, testing for each threshold between adjacent
values of X, incrementally updating the contingency table as you go. For a
minor additional speedup, only test between values of Y that differ.
Predicting age
from census
Example:
Root
Attribute
equals
value
Attribute
doesnt
equal value
13
Conclusions
Easy to understand
Easy to implement
Easy to use
Computationally cheap
14
Discussion
Instead of using information gain, why not choose the
splitting attribute to be the one with the highest prediction
accuracy?
Instead of greedily, heuristically, building the tree, why not
do a combinatorial search for the optimal tree?
If you build a decision tree to predict wealth, and marital
status, age and gender are chosen as attributes near the
top of the tree, is it reasonable to conclude that those
three inputs are the major causes of wealth?
..would it be reasonable to assume that attributes not
mentioned in the tree are not causes of wealth?
..would it be reasonable to assume that attributes not
mentioned in the tree are not correlated with wealth?
What about multi-attribute splits?
Copyright 2001, Andrew W. Moore
15