0% found this document useful (0 votes)
54 views15 pages

Dtree

DTREE

Uploaded by

johnnokia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views15 pages

Dtree

DTREE

Uploaded by

johnnokia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Note to other teachers and users of these slides.

Andrew would be delighted if you found this source


material useful in giving your own lectures. Feel free
to use these slides verbatim, or to modify them to fit
your own needs. PowerPoint originals are available. If
you make use of a significant portion of these slides in
your own lecture, please include this message, or the
following link to the source repository of Andrews
tutorials: http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully received.

Decision Trees

Todays lecture
Information Gain for measuring association
between inputs and outputs
Learning a decision tree classifier from data

Andrew W. Moore
Associate Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
[email protected]
412-268-7599
Copyright 2001, Andrew W. Moore

July 30, 2001

Decision Trees: Slide 2

Deciding whether a pattern is


interesting

Data Mining
Data Mining is all about automating the
process of searching for patterns in the
data.
Which patterns are interesting?
Thats what well look at right now.
And the answer will turn out to be the engine that
drives decision tree learning.

Which might be mere illusions?


And how can they be exploited?
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 3

Note to other teachers and users of these slides.


Andrew would be delighted if you found this source
material useful in giving your own lectures. Feel free
to use these slides verbatim, or to modify them to fit
your own needs. PowerPoint originals are available. If
you make use of a significant portion of these slides in
your own lecture, please include this message, or the
following link to the source repository of Andrews
tutorials: http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully received.

Information Gain
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
[email protected]
412-268-7599
Copyright 2001, Andrew W. Moore

Copyright 2001, Andrew W. Moore

We will use information theory


A very large topic, originally used for
compressing signals
But more recently used for data mining
(The topic of Information Gain will now be
discussed, but you will find it in a separate
Andrew Handout)
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 4

Bits
You are watching a set of independent random samples of X
You see that X has four possible values

P(X=A) = 1/4 P(X=B) = 1/4 P(X=C) = 1/4 P(X=D) = 1/4

So you might see: BAACBADCDADDDA


You transmit data over a binary serial link. You can encode
each reading with two bits (e.g. A = 00, B = 01, C = 10, D =
11)
0100001001001110110011111100

July 30, 2001

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 6

Fewer Bits

Fewer Bits

Someone tells you that the probabilities are not equal

Someone tells you that the probabilities are not equal

P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8

P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8

Its possible
to invent a coding for your transmission that only uses
1.75 bits on average per symbol. How?

Its possible
to invent a coding for your transmission that only uses
1.75 bits on average per symbol. How?

10

110

111

(This is just one of several ways)


Copyright 2001, Andrew W. Moore

Decision Trees: Slide 7

Copyright 2001, Andrew W. Moore

Fewer Bits

General Case

Suppose there are three equally likely values

Suppose X can have one of m values V1, V2,

P(X=B) = 1/3 P(X=C) = 1/3 P(X=D) = 1/3

00

01

10

P(X=V1) = p1

P(X=V2) = p2

P(X=Vm) = pm

In theory, it can in fact be done with 1.58496 bits per


symbol.
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 9

= p j log 2 p j
j =1

H(X) = The entropy of X


High Entropy means X is from a uniform (boring) distribution
Low Entropy means X is from varied (peaks and valleys) distribution
Copyright 2001, Andrew W. Moore

General Case
P(X=V1) = p1

Vm

H ( X ) = p1 log 2 p1 p2 log 2 p2 K pm log 2 pm

Can you think of a coding that would need only 1.6 bits
per symbol on average?

Suppose X can have one of m values V1, V2,

Whats the smallest possible number of bits, on average, per


symbol, needed to transmit a stream of symbols drawn from
Xs distribution? Its

Heres a nave coding, costing 2 bits per symbol


A

Decision Trees: Slide 8

Decision Trees: Slide 10

General Case

Vm

Suppose X can have one of m values V1, V2,

P(X=V2) = p2

.
P(X=Vm) = pm
A histogram of the
Whats the smallest possible number of frequency
bits, on average,
distributionper
of
symbol, needed
to transmit
a stream values
of symbols
drawn
of X would
havefrom
A histogram
of the
many lows and one or
Xs distribution?
Its
frequency
distribution of
two highs
values of X would be flat
H ( X ) = p1 log 2 p1 p2 log 2 p2 K pm log 2 pm
m

= p j log 2 p j

P(X=V1) = p1

Vm

P(X=V2) = p2

.
P(X=Vm) = pm
A histogram of the
Whats the smallest possible number of frequency
bits, on average,
distributionper
of
symbol, needed
to transmit
a stream values
of symbols
drawn
of X would
havefrom
A histogram
of the
many lows and one or
Xs distribution?
Its
frequency
distribution of
two highs
values of X would be flat
H ( X ) = p1 log 2 p1 p2 log 2 p2 K pm log 2 pm
m

= p ..and
j log so
j values
2 pthe

j =1

j =1

sampled from it would


be all over the place

..and so the values


sampled from it would
be more predictable

H(X) = The entropy of X

H(X) = The entropy of X

High Entropy means X is from a uniform (boring) distribution


Low Entropy means X is from varied (peaks and valleys) distribution

High Entropy means X is from a uniform (boring) distribution


Low Entropy means X is from varied (peaks and valleys) distribution

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 11

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 12

Entropy in a nut-shell

Low Entropy

High Entropy

Entropy in a nut-shell

Low Entropy

High Entropy
..the values (locations
of soup) sampled
entirely from within
the soup bowl

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 13

Specific Conditional Entropy H(Y|X=v)


Suppose Im trying to predict output Y and I have input X
X = College Major
Y = Likes Gladiator
X
Y

Lets assume this reflects the true


probabilities

Copyright 2001, Andrew W. Moore

X = College Major
Y = Likes Gladiator

Math

Yes

P(LikeG = Yes) = 0.5

Math

Yes

History

No

P(Major = Math & LikeG = No) = 0.25

History

No

CS

Yes

CS

Yes

Math

No

P(Major = Math) = 0.5

Math

No

Math

No

Math

No

CS

Yes

Yes

History

No

H(X) = 1.5

CS
History

No

Math

Yes

H(Y) = 1

Math

Yes

Note:

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 15

Copyright 2001, Andrew W. Moore

Y = Likes Gladiator

Math

Yes

History

No

CS

Yes

Math

No

Math

No

CS

Yes

History
Math

Definition of Specific Conditional


Entropy:

H(Y |X=v) = The entropy of Y


among only those records in which
X has value v

X = College Major
Y = Likes Gladiator

Math

Yes

Example:

History

No

H(Y|X=Math) = 1

CS

Yes

Math

No

Math

No

CS

Yes

No

History

No

Yes

Math

Yes

Copyright 2001, Andrew W. Moore

H(Y|X=History) = 0
H(Y|X=CS) = 0

Decision Trees: Slide 17

Definition of Specific Conditional


Entropy:

H(Y |X=v) = The entropy of Y


among only those records in which
X has value v

Decision Trees: Slide 16

Conditional Entropy H(Y|X)

Specific Conditional Entropy H(Y|X=v)


X = College Major

Decision Trees: Slide 14

Specific Conditional Entropy H(Y|X=v)

E.G. From this data we estimate

P(LikeG = Yes | Major = History) = 0

..the values (locations of


soup) unpredictable...
almost uniformly sampled
throughout our dining room

Copyright 2001, Andrew W. Moore

Definition of Conditional
Entropy:

H(Y |X) = The average specific


conditional entropy of Y
= if you choose a record at random what

will be the conditional entropy of Y,


conditioned on that rows value of X

= Expected number of bits to transmit Y if


both sides will know the value of X

= j Prob(X=vj) H(Y | X = vj)


Decision Trees: Slide 18

Conditional Entropy
Y = Likes Gladiator

Information Gain

Definition of Conditional Entropy:

X = College Major

H(Y|X) = The average conditional


entropy of Y

X = College Major

= jProb(X=vj) H(Y | X = vj)

Math

Yes

Example:

History

No

CS

Yes

vj

Math

No

Math

No

CS

Yes

History

No

Math

Yes

Math
History
CS

Prob(X=vj)

0.5
0.25
0.25

H(Y | X = vj)

1
0
0

H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 19

Note to other teachers and users of these slides.


Andrew would be delighted if you found this source
material useful in giving your own lectures. Feel free
to use these slides verbatim, or to modify them to fit
your own needs. PowerPoint originals are available. If
you make use of a significant portion of these slides in
your own lecture, please include this message, or the
following link to the source repository of Andrews
tutorials: http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully received.

Back to
Decision Trees
Andrew W. Moore
Associate Professor
School of Computer Science
Carnegie Mellon University

Definition of Information Gain:

Y = Likes Gladiator

IG(Y|X) = I must transmit Y.


How many bits on average
would it save me if both ends of
the line knew X?

Math

Yes

History

No

CS

Yes

Math

No

Math

No

CS

Yes

History

No

Math

Yes

IG(Y|X) = H(Y) - H(Y | X)


Example:

Copyright 2001, Andrew W. Moore

H(Y) = 1
H(Y|X) = 0.5
Thus IG(Y|X) = 1 0.5 = 0.5
Decision Trees: Slide 20

Learning Decision Trees


A Decision Tree is a tree-structured plan of
a set of attributes to test in order to predict
the output.
To decide which attribute should be tested
first, simply find the one with the highest
information gain.
Then recurse

www.cs.cmu.edu/~awm
[email protected]
412-268-7599
Copyright 2001, Andrew W. Moore

July 30, 2001

A small dataset: Miles Per Gallon


mpg

40
Records

good
bad
bad
bad
bad
bad
bad
bad
:
:
:
bad
good
bad
good
bad
good
good
bad
good
bad

cylinders displacement horsepower


4
6
4
8
6
4
4
8
:
:
:
8
8
8
4
6
4
4
8
4
5

low
medium
medium
high
medium
low
low
high
:
:
:
high
high
high
low
medium
medium
low
high
low
medium

low
medium
medium
high
medium
medium
medium
high
:
:
:
high
medium
high
low
medium
low
low
high
medium
medium

weight

acceleration modelyear maker

low
medium
medium
high
medium
low
low
high
:
:
:
high
high
high
low
medium
low
medium
high
low
medium

high
medium
low
low
medium
medium
low
low
:
:
:
low
high
low
low
high
low
high
low
medium
medium

75to78
70to74
75to78
70to74
70to74
70to74
70to74
75to78
:
:
:
70to74
79to83
75to78
79to83
75to78
79to83
79to83
70to74
75to78
75to78

asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
america
america
america
america
america
america
europe
europe

From the UCI repository (thanks to Ross Quinlan)


Copyright 2001, Andrew W. Moore

Decision Trees: Slide 23

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 22

Suppose we want to
predict MPG.

Look at all
the
information
gains
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 24

A Decision Stump

Recursion Step
Records
in which
cylinders
=4

Take the
Original
Dataset..

Records
in which
cylinders
=5

And partition it
according
to the value of
the attribute
we split on

Records
in which
cylinders
=6
Records
in which
cylinders
=8

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 25

Recursion Step

Build tree from


These records..

Records in
which
cylinders = 4

Build tree from


These records..

Records in
which
cylinders = 5

Copyright 2001, Andrew W. Moore

Build tree from


These records..

Records in
which
cylinders = 6

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 26

Second level of tree

Build tree from


These records..

Records in
which
cylinders = 8

Decision Trees: Slide 27

Recursively build a tree from the seven


records in which there are four cylinders and
the maker was based in Asia
Copyright 2001, Andrew W. Moore

(Similar recursion in the


other cases)

Decision Trees: Slide 28

Base Case
One

The final tree

Dont split a
node if all
matching
records have
the same
output value

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 29

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 30

Base Case
Two
Dont split a
node if none
of the
attributes
can create
multiple nonempty
children

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 31

Base Case Two:


No attributes
can distinguish

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 32

Base Cases

Base Cases: An idea

Base Case One: If all records in current data subset have


the same output then dont recurse
Base Case Two: If all records have exactly the same set of
input attributes then dont recurse

Base Case One: If all records in current data subset have


the same output then dont recurse
Base Case Two: If all records have exactly the same set of
input attributes then dont recurse
Proposed Base Case 3:
If all attributes have zero information
gain then dont recurse

Is this a good idea?


Copyright 2001, Andrew W. Moore

Decision Trees: Slide 33

Copyright 2001, Andrew W. Moore

The problem with Base Case 3


a

b
0
0
1
1

y
0
1
0
1

0
1
1
0

The information gains:

Copyright 2001, Andrew W. Moore

If we omit Base Case 3:


a

y = a XOR b

b
0
0
1
1

The resulting decision


tree:

Decision Trees: Slide 35

Decision Trees: Slide 34

y
0
1
0
1

0
1
1
0

y = a XOR b

The resulting decision tree:

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 36

Basic Decision Tree Building


Summarized

Training Set Error

BuildTree(DataSet,Output)
If all output values are the same in DataSet, return a leaf node that
says predict this unique output
If all input values are the same, return a leaf node that says predict
the majority output
Else find attribute X with highest Info Gain
Suppose X has nX distinct values (i.e. X has arity nX).

For each record, follow the decision tree to


see what it would predict

Copyright 2001, Andrew W. Moore

Copyright 2001, Andrew W. Moore

Create and return a non-leaf node with nX children.


The ith child should be built by calling
BuildTree(DSi,Output)
Where DSi built consists of all those records in DataSet for which X = ith
distinct value of X.

Decision Trees: Slide 37

For what number of records does the decision


trees prediction disagree with the true value in
the database?

This quantity is called the training set error.


The smaller the better.

MPG Training
error

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 39

MPG Training
error

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 41

Decision Trees: Slide 38

MPG Training
error

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 40

Stop and reflect: Why are we


doing this learning anyway?
It is not usually in order to predict the
training datas output on data we have
already seen.

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 42

Stop and reflect: Why are we


doing this learning anyway?

Stop and reflect: Why are we


doing this learning anyway?

It is not usually in order to predict the


training datas output on data we have
already seen.
It is more commonly in order to predict the
output value for future data we have not yet
seen.

It is not usually in order to predict the


training datas output on data we have
already seen.
It is more commonly in order to predict the
output value for future data we have not yet
seen.
Warning: A common data mining misperception is that the
above two bullets are the only possible reasons for learning.
There are at least a dozen others.

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 43

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 44

MPG Test set


error

Test Set Error


Suppose we are forward thinking.
We hide some data away when we learn the
decision tree.
But once learned, we see how well the tree
predicts that data.
This is a good simulation of what happens
when we try to predict future data.
And it is called Test Set Error.
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 45

MPG Test set


error

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 46

An artificial example
Well create a training dataset

The test set error is much worse than the


training set error

why?

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 47

32 records

Five inputs, all bits, are


generated in all 32 possible
combinations

Copyright 2001, Andrew W. Moore

Output y = copy of e,
Except a random 25%
of the records have y
set to the opposite of e

Decision Trees: Slide 48

In our artificial example

Building a tree with the artificial


training set

Suppose someone generates a test set


according to the same method.
The test set is identical, except that some of
the ys will be different.
Some ys that were corrupted in the training
set will be uncorrupted in the testing set.
Some ys that were uncorrupted in the
training set will be corrupted in the test set.

Suppose we build a full tree (we always split until base case 2)

Copyright 2001, Andrew W. Moore

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 49

Training set error for our artificial


tree

Root
e=0
a=0

a=1

a=0

a=1

25% of these leaf node labels will be corrupted

Decision Trees: Slide 50

Testing the tree with the test set

All the leaf nodes contain exactly one record and so

We would have a training set error


of zero

e=1

1/4 of the tree nodes


are corrupted

3/4 are fine

1/4 of the test set


records are
corrupted

1/16 of the test set will 3/16 of the test set will
be correctly predicted be wrongly predicted
for the wrong reasons because the test record is
corrupted

3/4 are fine

3/16 of the test


predictions will be
wrong because the
tree node is corrupted

9/16 of the test


predictions will be fine

In total, we expect to be wrong on 3/8 of the test set predictions

Decision Trees: Slide 51

Whats this example shown us?


This explains the discrepancy between
training and test set error
But more importantly it indicates theres
something we should do about it if we want
to predict well on future data.

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 52

Suppose we had less data


Lets not look at the irrelevant bits
Output y = copy of e, except a
random 25% of the records
have y set to the opposite of e

These bits are hidden

32 records

Copyright 2001, Andrew W. Moore

What decision tree would we learn now?


Copyright 2001, Andrew W. Moore

Decision Trees: Slide 53

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 54

Without access to the irrelevant bits

Without access to the irrelevant bits

Root
e=0

Root
e=1

These nodes will be unexpandable

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 55

e=0

e=1

In about 12 of
the 16 records
in this node the
output will be 0

In about 12 of
the 16 records
in this node the
output will be 1

So this will
almost certainly
predict 0

So this will
almost certainly
predict 1

Copyright 2001, Andrew W. Moore

Root
e=0

e=1

almost certainly all


are fine

1/4 of the test n/a


set records
are corrupted

1/4 of the test set


will be wrongly
predicted because
the test record is
corrupted

3/4 are fine

3/4 of the test


predictions will be
fine

n/a

Decision Trees: Slide 56

Overfitting

Without access to the irrelevant bits


almost certainly
none of the tree
nodes are
corrupted

These nodes will be unexpandable

Definition: If your machine learning


algorithm fits noise (i.e. pays attention to
parts of the data that are irrelevant) it is
overfitting.
Fact (theoretical and empirical): If your
machine learning algorithm is overfitting
then it may perform less well on test set
data.

In total, we expect to be wrong on only 1/4 of the test set predictions


Copyright 2001, Andrew W. Moore

Decision Trees: Slide 57

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 58

Avoiding overfitting
Usually we do not know in advance which
are the irrelevant variables
and it may depend on the context

Consider this
split

For example, if y = a AND b then b is an irrelevant


variable only in the portion of the tree in which a=0

But we can use simple statistics to


warn us that we might be
overfitting.
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 59

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 60

10

A chi-squared test

A chi-squared test

Suppose that mpg was completely uncorrelated with


maker.
What is the chance wed have seen data of at least this
apparent level of association anyway?

Suppose that mpg was completely uncorrelated with


maker.
What is the chance wed have seen data of at least this
apparent level of association anyway?

By using a particular kind of chi-squared test, the


answer is 13.5%.
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 61

Copyright 2001, Andrew W. Moore

CS
Non CS
15972 145643

Likes
Matrix
Hates 3
Matrix

CS

Decision Trees: Slide 62

Using Chi-squared to avoid


overfitting

What is a Chi-Square test?


Google chi square for
excellent explanations
Takes into account surprise
that a feature generates:
((unsplit-number splitnumber)2/unsplit-number)
Gives probability that rate
you saw was generated by
luck of the draw
Does likes-Matrix predict
CS grad?

Copyright 2001, Andrew W. Moore

37

Non CS

Likes 21543 145643


Matrix
Hates 3
173
Decision
Trees:
Slide 63
Matrix

Build the full decision tree as before.


But when you can grow it no more, start to
prune:
Beginning at the bottom of the tree, delete
splits in which pchance > MaxPchance.
Continue working your way up until there are no
more prunable nodes.
MaxPchance is a magic parameter you must specify to the decision tree,
indicating your willingness to risk fitting noise.

Copyright 2001, Andrew W. Moore

Pruning example

Decision Trees: Slide 64

MaxPchance

With MaxPchance = 0.1, you will see the


following MPG decision tree:

Good news:

The decision tree can automatically adjust


its pruning decisions according to the amount of apparent
noise and data.

Bad news:
Note the improved
test set accuracy
compared with the
unpruned tree

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 65

The user must come up with a good value of


MaxPchance. (Note, Andrew usually uses 0.05, which is his
favorite value for any magic parameter).

Good news:

But with extra work, the best MaxPchance


value can be estimated automatically by a technique called
cross-validation.

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 66

11

MaxPchance

The simplest tree

Expected Test set


Error

Technical note (dealt with in other lectures):


MaxPchance is a regularization parameter.

The simplest tree structure for which all within-leafnode disagreements can be explained by chance
Decreasing

High Bias

Copyright 2001, Andrew W. Moore

MaxPchance

Increasing
High Variance

Decision Trees: Slide 67

Expressiveness of Decision Trees

Assume all inputs are Boolean and all outputs are


Boolean.
What is the class of Boolean functions that are
possible to represent by decision trees?
Answer: All Boolean functions.

Simple proof:
1.
2.
3.

Note that this pruning is heuristically trying


to find

Take any Boolean function


Convert it into a truth table
Construct a decision tree in which each row of the truth table
corresponds to one path through the decision tree.

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 69

One branch for each numeric


value idea:

This is not the same as saying the simplest


classification scheme for which
Decision trees are biased to prefer classifiers
that can be expressed as trees.

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 68

Real-Valued inputs
What should we do if some of the inputs are
real-valued?
mpg
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
good
bad
good
bad

cylinders displacemen horsepower weight acceleration modelyear maker


4
6
4
8
6
4
4
8
:
:
:

97
199
121
350
198
108
113
302
:
:
:

4
8
4
5

75
90
110
175
95
94
95
139
:
:
:

120
455
107
131

2265
2648
2600
4100
3102
2379
2228
3570
:
:
:

79
225
86
103

18.2
15
12.8
13
16.5
16.5
14
12.8
:
:
:

2625
4425
2464
2830

77
70
77
73
74
73
71
78
:
:
:

18.6
10
15.5
15.9

82
70
76
78

asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
europe
europe

Idea One: Branch on each possible real value

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 70

A better idea: thresholded splits


Suppose X is real valued.
Define IG(Y|X:t) as H(Y) - H(Y|X:t)
Define H(Y|X:t) =

H(Y|X < t) P(X < t) + H(Y|X >= t) P(X >= t)

Hopeless: with such high branching factor will shatter


the dataset and over fit
Note pchance is 0.222 in the aboveif MaxPchance
was 0.05 that would end up pruning away to a single
root node.
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 71

IG(Y|X:t) is the information gain for predicting Y if all


you know is whether X is greater than or less than t

Then define IG*(Y|X) = maxt IG(Y|X:t)


For each real-valued attribute, use IG*(Y|X)
for assessing its suitability as a split
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 72

12

Computational Issues

Example with
MPG

You can compute IG*(Y|X) in time

R log R + 2 R ny
Where
R is the number of records in the node under consideration
ny is the arity (number of distinct values of) Y
How?
Sort records according to increasing values of X. Then create a 2xny
contingency table corresponding to computation of IG(Y|X:xmin). Then
iterate through the records, testing for each threshold between adjacent
values of X, incrementally updating the contingency table as you go. For a
minor additional speedup, only test between values of Y that differ.

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 73

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 74

Pruned tree using reals


Unpruned
tree using
reals

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 75

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 76

Binary categorical splits


One of Andrews
favorite tricks
Allow splits of the
following form

Predicting age
from census

Example:

Root
Attribute
equals
value

Copyright 2001, Andrew W. Moore

Attribute
doesnt
equal value

Decision Trees: Slide 77

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 78

13

Predicting gender from census


Predicting
wealth from
census

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 79

Copyright 2001, Andrew W. Moore

Conclusions

What you should know

Decision trees are the single most popular


data mining tool

Easy to understand
Easy to implement
Easy to use
Computationally cheap

Its possible to get in trouble with overfitting


They do classification: predict a categorical
output from categorical and/or real inputs
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 81

Whats information gain, and why we use it


The recursive algorithm for building an
unpruned decision tree
What are training and test set errors
Why test set errors can be bigger than
training set
Why pruning can reduce test set error
How to exploit real-valued inputs
Copyright 2001, Andrew W. Moore

What we havent discussed


Its easy to have real-valued outputs too---these are called
Regression Trees*
Bayesian Decision Trees can take a different approach to
preventing overfitting
Computational complexity (straightforward and cheap) *
Alternatives to Information Gain for splitting nodes
How to choose MaxPchance automatically *
The details of Chi-Squared testing *
Boosting---a simple way to improve accuracy *
* = discussed in other Andrew lectures

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 80

Decision Trees: Slide 83

Decision Trees: Slide 82

For more information


Two nice books
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.
Classification and Regression Trees. Wadsworth, Belmont,
CA, 1984.
C4.5 : Programs for Machine Learning (Morgan Kaufmann
Series in Machine Learning) by J. Ross Quinlan

Dozens of nice papers, including


Learning Classification Trees, Wray Buntine, Statistics and
Computation (1992), Vol 2, pages 63-73
Kearns and Mansour, On the Boosting Ability of Top-Down
Decision Tree Learning Algorithms, STOC: ACM Symposium
on Theory of Computing, 1996

Dozens of software implementations available on the web for free and


commercially for prices ranging between $50 - $300,000

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 84

14

Discussion
Instead of using information gain, why not choose the
splitting attribute to be the one with the highest prediction
accuracy?
Instead of greedily, heuristically, building the tree, why not
do a combinatorial search for the optimal tree?
If you build a decision tree to predict wealth, and marital
status, age and gender are chosen as attributes near the
top of the tree, is it reasonable to conclude that those
three inputs are the major causes of wealth?
..would it be reasonable to assume that attributes not
mentioned in the tree are not causes of wealth?
..would it be reasonable to assume that attributes not
mentioned in the tree are not correlated with wealth?
What about multi-attribute splits?
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 85

15

You might also like