22.InfoTheory DecisionTrees Short
22.InfoTheory DecisionTrees Short
Jihoon Yang
Machine Learning Research Laboratory
Department of Computer Science & Engineering
Sogang University
Email: [email protected]
Decision tree representation
• In general
– Each internal node corresponds to a test (on input instances)
with mutually exclusive and exhaustive outcomes – tests may be
univariate or multivariate
– Each branch corresponds to an outcome of a test
– Each leaf node corresponds to a class label
11,01,10,00
1 0 11 10
1 1 1 A
E
01 00
x
2 0 1 B
11 01
1 x 0 1 x 0
a
m
10 00
p
l 3 1 0 A
e
c=A c=B 11 01 10 00
s
4 0 0 B
c=A c=B c=A c=B
• There are far too many trees that are consistent with a training set
• Searching for the simplest tree that is consistent with the training
set is not typically computationally feasible
• Solution
– Use a greedy algorithm – not guaranteed to find the simplest
tree – but works well in practice
– Or restrict the space of hypothesis to a subset of simple trees
I ( pi ) log 2 pi provided pi 0
I ( pi ) 0 otherwise
Let P ( p1 .... pn ) be a discrete probability distribution
The entropy of the distribution P is given by
1
n n
H P pi log 2 pi log 2 pi
i 1 pi i 1
1 1 2
1 1 1 1
H , pi log 2 pi log 2 log 2 1 bit
2 2 i 1 2 2 2 2
2
H 0,1 pi log 2 pi 1I 1 0 I 0 0 bit
i 1
Nature
Training Data S S1 S2
Instance
Sm
Training Data
Instance Class label
Instances – I1 (t, d, l) +
ordered 3-tuples of attribute values I2 (s, d, l) +
corresponding to I3 (t, b, l)
I4 (t, r, l)
Height (tall, short) I5 (s, b, l)
Hair (dark, blonde, red) I6 (t, b, w) +
Eye (blue, brown) I7 (t, d, w) +
I8 (s, b, w) +
I1…I8 3 3 5 5
H ( X ) log 2 log 2 0.954bits
8 8 8 8
Height
3 3 2 2
H ( X | Height t ) log 2 log 2 0.971bits
5 5 5 5
t s 2 2 1 1
H ( X | Height s) log 2 log 2 0.918bits
3 3 3 3
I1 I3 I4 I6 I7 I2 I5 I8
St Ss
5 3 5 3
H ( X | Height ) H ( X | Height t ) H ( X | Height s) (0.971) (0.918) 0.95bits
8 8 8 8
Similarly, H ( X | Hair ) 3 H ( X | Hair d ) 4 H ( X | Hair b) 1 H ( Hair r ) 0.5bits and
8 8 8
H ( X | Eye ) 0.607bits
Hair
d r
b
Compare the result with
+ Eye - Naïve Bayes
l w
- +
• As we move further away from the root, the data set used to choose
the best test becomes smaller poor estimates of entropy
• Types of attributes
– Nominal – values are names
– …
Attribute T 40 48 50 54 60 70
Class N N Y Y Y N
48 50
T
60 70
Candidate splits T ? ?
2 2
2 4 3 3 1 1
E( S | T 49 ?) (0) log2 log2
6 6 4 4 4 4
• For each attribute, find the test which yields the lowest entropy
C2
C2 C1
C1
• Solutions
– Only two-way splits (CART): A = value versus A = ~value
Gain(S , A)
GainRatio(S , A)
SplitInformation(S , A)
Values ( A )
| Si | |S |
SplitInformation(S , A)
i 1 |S |
log2 i
|S |
• Boosting
• Simple
• Fast (linear in size of the tree, linear in the size of the training set,
linear in the number of attributes)
• Good for generating simple predictive rules from data with lots of
attributes