0% found this document useful (0 votes)

54 views15 pages

Dtree

DTREE

Uploaded by

johnnokia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views15 pages

Dtree

DTREE

Uploaded by

johnnokia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Note to other teachers and users of these slides.

Andrew would be delighted if you found this source

material useful in giving your own lectures. Feel free
to use these slides verbatim, or to modify them to fit
your own needs. PowerPoint originals are available. If
you make use of a significant portion of these slides in
your own lecture, please include this message, or the
following link to the source repository of Andrews
tutorials: http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully received.

Decision Trees

Todays lecture
Information Gain for measuring association
between inputs and outputs
Learning a decision tree classifier from data

Andrew W. Moore
Associate Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
[email protected]
412-268-7599
Copyright 2001, Andrew W. Moore

July 30, 2001

Decision Trees: Slide 2

Deciding whether a pattern is

interesting

Data Mining
Data Mining is all about automating the
process of searching for patterns in the
data.
Which patterns are interesting?
Thats what well look at right now.
And the answer will turn out to be the engine that
drives decision tree learning.

Which might be mere illusions?

And how can they be exploited?
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 3

Note to other teachers and users of these slides.

Andrew would be delighted if you found this source
material useful in giving your own lectures. Feel free
to use these slides verbatim, or to modify them to fit
your own needs. PowerPoint originals are available. If
you make use of a significant portion of these slides in
your own lecture, please include this message, or the
following link to the source repository of Andrews
tutorials: http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully received.

Information Gain
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
[email protected]
412-268-7599
Copyright 2001, Andrew W. Moore

Copyright 2001, Andrew W. Moore

We will use information theory

A very large topic, originally used for
compressing signals
But more recently used for data mining
(The topic of Information Gain will now be
discussed, but you will find it in a separate
Andrew Handout)
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 4

Bits
You are watching a set of independent random samples of X
You see that X has four possible values

P(X=A) = 1/4 P(X=B) = 1/4 P(X=C) = 1/4 P(X=D) = 1/4

So you might see: BAACBADCDADDDA

You transmit data over a binary serial link. You can encode
each reading with two bits (e.g. A = 00, B = 01, C = 10, D =
11)
0100001001001110110011111100

July 30, 2001

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 6

Fewer Bits

Someone tells you that the probabilities are not equal

P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8

Its possible
to invent a coding for your transmission that only uses
1.75 bits on average per symbol. How?

110

111

(This is just one of several ways)

Decision Trees: Slide 7

Copyright 2001, Andrew W. Moore

Fewer Bits

General Case

Suppose there are three equally likely values

Suppose X can have one of m values V1, V2,

P(X=B) = 1/3 P(X=C) = 1/3 P(X=D) = 1/3

P(X=V1) = p1

P(X=V2) = p2

P(X=Vm) = pm

In theory, it can in fact be done with 1.58496 bits per

symbol.
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 9

= p j log 2 p j
j =1

H(X) = The entropy of X

High Entropy means X is from a uniform (boring) distribution
Low Entropy means X is from varied (peaks and valleys) distribution
Copyright 2001, Andrew W. Moore

General Case
P(X=V1) = p1

H ( X ) = p1 log 2 p1 p2 log 2 p2 K pm log 2 pm

Can you think of a coding that would need only 1.6 bits
per symbol on average?

Suppose X can have one of m values V1, V2,

Whats the smallest possible number of bits, on average, per

symbol, needed to transmit a stream of symbols drawn from
Xs distribution? Its

Heres a nave coding, costing 2 bits per symbol

Decision Trees: Slide 8

Decision Trees: Slide 10

General Case

Suppose X can have one of m values V1, V2,

P(X=V2) = p2

.
P(X=Vm) = pm
A histogram of the
Whats the smallest possible number of frequency
bits, on average,
distributionper
of
symbol, needed
to transmit
a stream values
of symbols
drawn
of X would
havefrom
A histogram
of the
many lows and one or
Xs distribution?
Its
frequency
distribution of
two highs
values of X would be flat
H ( X ) = p1 log 2 p1 p2 log 2 p2 K pm log 2 pm
m

= p j log 2 p j

P(X=V1) = p1

P(X=V2) = p2

= p ..and
j log so
j values
2 pthe

j =1

sampled from it would

be all over the place

..and so the values

sampled from it would
be more predictable

H(X) = The entropy of X

High Entropy means X is from a uniform (boring) distribution

Low Entropy means X is from varied (peaks and valleys) distribution

High Entropy means X is from a uniform (boring) distribution

Low Entropy means X is from varied (peaks and valleys) distribution

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 11

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 12

Entropy in a nut-shell

Low Entropy

High Entropy

Entropy in a nut-shell

Low Entropy

High Entropy
..the values (locations
of soup) sampled
entirely from within
the soup bowl

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 13

Specific Conditional Entropy H(Y|X=v)

Suppose Im trying to predict output Y and I have input X
X = College Major
Y = Likes Gladiator
X
Y

Lets assume this reflects the true

probabilities

Copyright 2001, Andrew W. Moore

X = College Major
Y = Likes Gladiator

Math

Yes

P(LikeG = Yes) = 0.5

Math

Yes

History

P(Major = Math & LikeG = No) = 0.25

History

Yes

Math

P(Major = Math) = 0.5

Math

Yes

History

H(X) = 1.5

CS
History

Math

Yes

H(Y) = 1

Math

Yes

Note:

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 15

Copyright 2001, Andrew W. Moore

Y = Likes Gladiator

Math

Yes

History

Yes

Math

Yes

History
Math

Definition of Specific Conditional

Entropy:

H(Y |X=v) = The entropy of Y

among only those records in which
X has value v

X = College Major
Y = Likes Gladiator

Math

Yes

Example:

History

H(Y|X=Math) = 1

Yes

Math

Yes

History

Yes

Math

Yes

Copyright 2001, Andrew W. Moore

H(Y|X=History) = 0
H(Y|X=CS) = 0

Decision Trees: Slide 17

Definition of Specific Conditional

Entropy:

H(Y |X=v) = The entropy of Y

among only those records in which
X has value v

Decision Trees: Slide 16

Conditional Entropy H(Y|X)

Specific Conditional Entropy H(Y|X=v)

X = College Major

Decision Trees: Slide 14

Specific Conditional Entropy H(Y|X=v)

E.G. From this data we estimate

P(LikeG = Yes | Major = History) = 0

..the values (locations of

soup) unpredictable...
almost uniformly sampled
throughout our dining room

Copyright 2001, Andrew W. Moore

Definition of Conditional
Entropy:

H(Y |X) = The average specific

conditional entropy of Y
= if you choose a record at random what

will be the conditional entropy of Y,

conditioned on that rows value of X

= Expected number of bits to transmit Y if

both sides will know the value of X

= j Prob(X=vj) H(Y | X = vj)

Decision Trees: Slide 18

Conditional Entropy
Y = Likes Gladiator

Information Gain

Definition of Conditional Entropy:

X = College Major

H(Y|X) = The average conditional

entropy of Y

X = College Major

= jProb(X=vj) H(Y | X = vj)

Math

Yes

Example:

History

Yes

Math

Yes

History

Math

Yes

Math
History
CS

Prob(X=vj)

0.5
0.25
0.25

H(Y | X = vj)

1
0
0

H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 19

Note to other teachers and users of these slides.

Back to
Decision Trees
Andrew W. Moore
Associate Professor
School of Computer Science
Carnegie Mellon University

Definition of Information Gain:

Y = Likes Gladiator

IG(Y|X) = I must transmit Y.

How many bits on average
would it save me if both ends of
the line knew X?

Math

Yes

History

Yes

Math

Yes

History

Math

Yes

IG(Y|X) = H(Y) - H(Y | X)

Example:

Copyright 2001, Andrew W. Moore

H(Y) = 1
H(Y|X) = 0.5
Thus IG(Y|X) = 1 0.5 = 0.5
Decision Trees: Slide 20

Learning Decision Trees

A Decision Tree is a tree-structured plan of
a set of attributes to test in order to predict
the output.
To decide which attribute should be tested
first, simply find the one with the highest
information gain.
Then recurse

www.cs.cmu.edu/~awm
[email protected]
412-268-7599
Copyright 2001, Andrew W. Moore

July 30, 2001

A small dataset: Miles Per Gallon

mpg

40
Records

good
bad
bad
bad
bad
bad
bad
bad
:
:
:
bad
good
bad
good
bad
good
good
bad
good
bad

cylinders displacement horsepower

4
6
4
8
6
4
4
8
:
:
:
8
8
8
4
6
4
4
8
4
5

low
medium
medium
high
medium
low
low
high
:
:
:
high
high
high
low
medium
medium
low
high
low
medium

low
medium
medium
high
medium
medium
medium
high
:
:
:
high
medium
high
low
medium
low
low
high
medium
medium

weight

acceleration modelyear maker

low
medium
medium
high
medium
low
low
high
:
:
:
high
high
high
low
medium
low
medium
high
low
medium

high
medium
low
low
medium
medium
low
low
:
:
:
low
high
low
low
high
low
high
low
medium
medium

75to78
70to74
75to78
70to74
70to74
70to74
70to74
75to78
:
:
:
70to74
79to83
75to78
79to83
75to78
79to83
79to83
70to74
75to78
75to78

asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
america
america
america
america
america
america
europe
europe

From the UCI repository (thanks to Ross Quinlan)

Decision Trees: Slide 23

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 22

Suppose we want to
predict MPG.

Look at all
the
information
gains
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 24

A Decision Stump

Recursion Step
Records
in which
cylinders
=4

Take the
Original
Dataset..

Records
in which
cylinders
=5

And partition it
according
to the value of
the attribute
we split on

Records
in which
cylinders
=6
Records
in which
cylinders
=8

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 25

Recursion Step

Build tree from

These records..

Records in
which
cylinders = 4

Build tree from

These records..

Records in
which
cylinders = 5

Copyright 2001, Andrew W. Moore

Build tree from

These records..

Records in
which
cylinders = 6

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 26

Second level of tree

Build tree from

These records..

Records in
which
cylinders = 8

Decision Trees: Slide 27

Recursively build a tree from the seven

records in which there are four cylinders and
the maker was based in Asia
Copyright 2001, Andrew W. Moore

(Similar recursion in the

other cases)

Decision Trees: Slide 28

Base Case
One

The final tree

Dont split a
node if all
matching
records have
the same
output value

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 29

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 30

Base Case
Two
Dont split a
node if none
of the
attributes
can create
multiple nonempty
children

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 31

Base Case Two:

No attributes
can distinguish

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 32

Base Cases

Base Cases: An idea

Base Case One: If all records in current data subset have

the same output then dont recurse
Base Case Two: If all records have exactly the same set of
input attributes then dont recurse

Base Case One: If all records in current data subset have

the same output then dont recurse
Base Case Two: If all records have exactly the same set of
input attributes then dont recurse
Proposed Base Case 3:
If all attributes have zero information
gain then dont recurse

Is this a good idea?

Decision Trees: Slide 33

Copyright 2001, Andrew W. Moore

The problem with Base Case 3

b
0
0
1
1

y
0
1
0
1

0
1
1
0

The information gains:

Copyright 2001, Andrew W. Moore

If we omit Base Case 3:

y = a XOR b

b
0
0
1
1

The resulting decision

tree:

Decision Trees: Slide 35

Decision Trees: Slide 34

y
0
1
0
1

0
1
1
0

y = a XOR b

The resulting decision tree:

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 36

Basic Decision Tree Building

Summarized

Training Set Error

BuildTree(DataSet,Output)
If all output values are the same in DataSet, return a leaf node that
says predict this unique output
If all input values are the same, return a leaf node that says predict
the majority output
Else find attribute X with highest Info Gain
Suppose X has nX distinct values (i.e. X has arity nX).

For each record, follow the decision tree to

see what it would predict

Copyright 2001, Andrew W. Moore

Create and return a non-leaf node with nX children.

The ith child should be built by calling
BuildTree(DSi,Output)
Where DSi built consists of all those records in DataSet for which X = ith
distinct value of X.

Decision Trees: Slide 37

For what number of records does the decision

trees prediction disagree with the true value in
the database?

This quantity is called the training set error.

The smaller the better.

MPG Training
error

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 39

MPG Training
error

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 41

Decision Trees: Slide 38

MPG Training
error

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 40

Stop and reflect: Why are we

doing this learning anyway?
It is not usually in order to predict the
training datas output on data we have
already seen.

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 42

Stop and reflect: Why are we

doing this learning anyway?

Stop and reflect: Why are we

doing this learning anyway?

It is not usually in order to predict the

training datas output on data we have
already seen.
It is more commonly in order to predict the
output value for future data we have not yet
seen.

It is not usually in order to predict the

training datas output on data we have
already seen.
It is more commonly in order to predict the
output value for future data we have not yet
seen.
Warning: A common data mining misperception is that the
above two bullets are the only possible reasons for learning.
There are at least a dozen others.

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 43

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 44

MPG Test set

error

Test Set Error

Suppose we are forward thinking.
We hide some data away when we learn the
decision tree.
But once learned, we see how well the tree
predicts that data.
This is a good simulation of what happens
when we try to predict future data.
And it is called Test Set Error.
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 45

MPG Test set

error

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 46

An artificial example
Well create a training dataset

The test set error is much worse than the

training set error

why?

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 47

32 records

Five inputs, all bits, are

generated in all 32 possible
combinations

Copyright 2001, Andrew W. Moore

Output y = copy of e,
Except a random 25%
of the records have y
set to the opposite of e

Decision Trees: Slide 48

In our artificial example

Building a tree with the artificial

training set

Suppose someone generates a test set

according to the same method.
The test set is identical, except that some of
the ys will be different.
Some ys that were corrupted in the training
set will be uncorrupted in the testing set.
Some ys that were uncorrupted in the
training set will be corrupted in the test set.

Suppose we build a full tree (we always split until base case 2)

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 49

Training set error for our artificial

tree

Root
e=0
a=0

a=1

a=0

a=1

25% of these leaf node labels will be corrupted

Decision Trees: Slide 50

Testing the tree with the test set

All the leaf nodes contain exactly one record and so

We would have a training set error

of zero

e=1

1/4 of the tree nodes

are corrupted

3/4 are fine

1/4 of the test set

records are
corrupted

1/16 of the test set will 3/16 of the test set will
be correctly predicted be wrongly predicted
for the wrong reasons because the test record is
corrupted

3/4 are fine

3/16 of the test

predictions will be
wrong because the
tree node is corrupted

9/16 of the test

predictions will be fine

In total, we expect to be wrong on 3/8 of the test set predictions

Decision Trees: Slide 51

Whats this example shown us?

This explains the discrepancy between
training and test set error
But more importantly it indicates theres
something we should do about it if we want
to predict well on future data.

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 52

Suppose we had less data

Lets not look at the irrelevant bits
Output y = copy of e, except a
random 25% of the records
have y set to the opposite of e

These bits are hidden

32 records

Copyright 2001, Andrew W. Moore

What decision tree would we learn now?

Decision Trees: Slide 53

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 54

Without access to the irrelevant bits

Root
e=0

Root
e=1

These nodes will be unexpandable

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 55

e=0

e=1

In about 12 of
the 16 records
in this node the
output will be 0

In about 12 of
the 16 records
in this node the
output will be 1

So this will
almost certainly
predict 0

So this will
almost certainly
predict 1

Root
e=0

e=1

almost certainly all

are fine

1/4 of the test n/a

set records
are corrupted

1/4 of the test set

will be wrongly
predicted because
the test record is
corrupted

3/4 are fine

3/4 of the test

predictions will be
fine

n/a

Decision Trees: Slide 56

Overfitting

Without access to the irrelevant bits

almost certainly
none of the tree
nodes are
corrupted

These nodes will be unexpandable

Definition: If your machine learning

algorithm fits noise (i.e. pays attention to
parts of the data that are irrelevant) it is
overfitting.
Fact (theoretical and empirical): If your
machine learning algorithm is overfitting
then it may perform less well on test set
data.

In total, we expect to be wrong on only 1/4 of the test set predictions

Decision Trees: Slide 57

Decision Trees: Slide 58

Avoiding overfitting
Usually we do not know in advance which
are the irrelevant variables
and it may depend on the context

Consider this
split

For example, if y = a AND b then b is an irrelevant

variable only in the portion of the tree in which a=0

But we can use simple statistics to

Decision Trees: Slide 59

Decision Trees: Slide 60

A chi-squared test

Suppose that mpg was completely uncorrelated with

maker.
What is the chance wed have seen data of at least this
apparent level of association anyway?

Suppose that mpg was completely uncorrelated with

maker.
What is the chance wed have seen data of at least this
apparent level of association anyway?

By using a particular kind of chi-squared test, the

Decision Trees: Slide 61

CS
Non CS
15972 145643

Likes
Matrix
Hates 3
Matrix

Decision Trees: Slide 62

Using Chi-squared to avoid

overfitting

What is a Chi-Square test?

Google chi square for
excellent explanations
Takes into account surprise
that a feature generates:
((unsplit-number splitnumber)2/unsplit-number)
Gives probability that rate
you saw was generated by
luck of the draw
Does likes-Matrix predict
CS grad?

Non CS

Likes 21543 145643

Matrix
Hates 3
173
Decision
Trees:
Slide 63
Matrix

Build the full decision tree as before.

But when you can grow it no more, start to
prune:
Beginning at the bottom of the tree, delete
splits in which pchance > MaxPchance.
Continue working your way up until there are no
more prunable nodes.
MaxPchance is a magic parameter you must specify to the decision tree,
indicating your willingness to risk fitting noise.

Pruning example

Decision Trees: Slide 64

MaxPchance

With MaxPchance = 0.1, you will see the

following MPG decision tree:

Good news:

The decision tree can automatically adjust

its pruning decisions according to the amount of apparent
noise and data.

Bad news:
Note the improved
test set accuracy
compared with the
unpruned tree

Decision Trees: Slide 65

The user must come up with a good value of

MaxPchance. (Note, Andrew usually uses 0.05, which is his
favorite value for any magic parameter).

Good news:

But with extra work, the best MaxPchance

value can be estimated automatically by a technique called
cross-validation.

Decision Trees: Slide 66

MaxPchance

The simplest tree

Expected Test set

Error

Technical note (dealt with in other lectures):

MaxPchance is a regularization parameter.

The simplest tree structure for which all within-leafnode disagreements can be explained by chance
Decreasing

High Bias

MaxPchance

Increasing
High Variance

Decision Trees: Slide 67

Expressiveness of Decision Trees

Assume all inputs are Boolean and all outputs are

Boolean.
What is the class of Boolean functions that are
possible to represent by decision trees?
Answer: All Boolean functions.

Simple proof:
1.
2.
3.

Note that this pruning is heuristically trying

to find

Take any Boolean function

Convert it into a truth table
Construct a decision tree in which each row of the truth table
corresponds to one path through the decision tree.

Decision Trees: Slide 69

One branch for each numeric

value idea:

This is not the same as saying the simplest

classification scheme for which
Decision trees are biased to prefer classifiers
that can be expressed as trees.

Decision Trees: Slide 68

Real-Valued inputs
What should we do if some of the inputs are
real-valued?
mpg
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
good
bad
good
bad

cylinders displacemen horsepower weight acceleration modelyear maker

4
6
4
8
6
4
4
8
:
:
:

97
199
121
350
198
108
113
302
:
:
:

4
8
4
5

75
90
110
175
95
94
95
139
:
:
:

120
455
107
131

2265
2648
2600
4100
3102
2379
2228
3570
:
:
:

79
225
86
103

18.2
15
12.8
13
16.5
16.5
14
12.8
:
:
:

2625
4425
2464
2830

77
70
77
73
74
73
71
78
:
:
:

18.6
10
15.5
15.9

82
70
76
78

asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
europe
europe

Idea One: Branch on each possible real value

Decision Trees: Slide 70

A better idea: thresholded splits

Suppose X is real valued.
Define IG(Y|X:t) as H(Y) - H(Y|X:t)
Define H(Y|X:t) =

H(Y|X < t) P(X < t) + H(Y|X >= t) P(X >= t)

Hopeless: with such high branching factor will shatter

the dataset and over fit
Note pchance is 0.222 in the aboveif MaxPchance
was 0.05 that would end up pruning away to a single
root node.
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 71

IG(Y|X:t) is the information gain for predicting Y if all

you know is whether X is greater than or less than t

Then define IG*(Y|X) = maxt IG(Y|X:t)

For each real-valued attribute, use IG*(Y|X)
for assessing its suitability as a split
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 72

Computational Issues

Example with
MPG

You can compute IG*(Y|X) in time

R log R + 2 R ny
Where
R is the number of records in the node under consideration
ny is the arity (number of distinct values of) Y
How?
Sort records according to increasing values of X. Then create a 2xny
contingency table corresponding to computation of IG(Y|X:xmin). Then
iterate through the records, testing for each threshold between adjacent
values of X, incrementally updating the contingency table as you go. For a
minor additional speedup, only test between values of Y that differ.

Decision Trees: Slide 73

Decision Trees: Slide 74

Pruned tree using reals

Unpruned
tree using
reals

Decision Trees: Slide 75

Decision Trees: Slide 76

Binary categorical splits

One of Andrews
favorite tricks
Allow splits of the
following form

Predicting age
from census

Example:

Root
Attribute
equals
value

Attribute
doesnt
equal value

Decision Trees: Slide 77

Decision Trees: Slide 78

Predicting gender from census

Predicting
wealth from
census

Decision Trees: Slide 79

Conclusions

What you should know

Decision trees are the single most popular

data mining tool

Easy to understand
Easy to implement
Easy to use
Computationally cheap

Its possible to get in trouble with overfitting

They do classification: predict a categorical
output from categorical and/or real inputs
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 81

Whats information gain, and why we use it

The recursive algorithm for building an
unpruned decision tree
What are training and test set errors
Why test set errors can be bigger than
training set
Why pruning can reduce test set error
How to exploit real-valued inputs
Copyright 2001, Andrew W. Moore

What we havent discussed

Its easy to have real-valued outputs too---these are called
Regression Trees*
Bayesian Decision Trees can take a different approach to
preventing overfitting
Computational complexity (straightforward and cheap) *
Alternatives to Information Gain for splitting nodes
How to choose MaxPchance automatically *
The details of Chi-Squared testing *
Boosting---a simple way to improve accuracy *
* = discussed in other Andrew lectures

Decision Trees: Slide 80

Decision Trees: Slide 83

Decision Trees: Slide 82

For more information

Two nice books
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.
Classification and Regression Trees. Wadsworth, Belmont,
CA, 1984.
C4.5 : Programs for Machine Learning (Morgan Kaufmann
Series in Machine Learning) by J. Ross Quinlan

Dozens of nice papers, including

Learning Classification Trees, Wray Buntine, Statistics and
Computation (1992), Vol 2, pages 63-73
Kearns and Mansour, On the Boosting Ability of Top-Down
Decision Tree Learning Algorithms, STOC: ACM Symposium
on Theory of Computing, 1996

Dozens of software implementations available on the web for free and

commercially for prices ranging between $50 - $300,000

Decision Trees: Slide 84

Discussion
Instead of using information gain, why not choose the
splitting attribute to be the one with the highest prediction
accuracy?
Instead of greedily, heuristically, building the tree, why not
do a combinatorial search for the optimal tree?
If you build a decision tree to predict wealth, and marital
status, age and gender are chosen as attributes near the
top of the tree, is it reasonable to conclude that those
three inputs are the major causes of wealth?
..would it be reasonable to assume that attributes not
mentioned in the tree are not causes of wealth?
..would it be reasonable to assume that attributes not
mentioned in the tree are not correlated with wealth?
What about multi-attribute splits?
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 85

Test Card Credit
100% (6)
Test Card Credit
32 pages
03 InformationGain
No ratings yet
03 InformationGain
20 pages
Decision Trees
No ratings yet
Decision Trees
42 pages
ML Lecture04x2
No ratings yet
ML Lecture04x2
16 pages
15-780: Graduate Artificial Intelligence: Decision Trees
No ratings yet
15-780: Graduate Artificial Intelligence: Decision Trees
41 pages
Comp101 Lect02
No ratings yet
Comp101 Lect02
44 pages
Recitation Decision Trees Adaboost 02-09-2006
No ratings yet
Recitation Decision Trees Adaboost 02-09-2006
30 pages
Information Theory: Info Rmatio N Types
No ratings yet
Information Theory: Info Rmatio N Types
52 pages
ITC Module - I
No ratings yet
ITC Module - I
98 pages
Decision Tree
No ratings yet
Decision Tree
52 pages
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
No ratings yet
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
44 pages
Mutual Information
No ratings yet
Mutual Information
48 pages
Noise, Information Theory, and Entropy: CS414 - Spring 2007
No ratings yet
Noise, Information Theory, and Entropy: CS414 - Spring 2007
44 pages
Data Compression
No ratings yet
Data Compression
113 pages
CS340 Machine Learning Information Theory
No ratings yet
CS340 Machine Learning Information Theory
22 pages
4.decision Tree
No ratings yet
4.decision Tree
39 pages
Module14 InformationTheoryandEntropy
No ratings yet
Module14 InformationTheoryandEntropy
24 pages
Machine Learning: Decision Trees: CS540 Jerry Zhu University of Wisconsin-Madison
No ratings yet
Machine Learning: Decision Trees: CS540 Jerry Zhu University of Wisconsin-Madison
49 pages
A08 Decision Trees 2up
No ratings yet
A08 Decision Trees 2up
20 pages
Unit 1
No ratings yet
Unit 1
94 pages
Lecturer: Mark Braverman Scribe: Mark Braverman: COS597D: Information Theory in Computer Science
No ratings yet
Lecturer: Mark Braverman Scribe: Mark Braverman: COS597D: Information Theory in Computer Science
5 pages
LVC 1 Post-Session Summary
No ratings yet
LVC 1 Post-Session Summary
9 pages
Data Compression Basics: Discrete Source
No ratings yet
Data Compression Basics: Discrete Source
34 pages
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
No ratings yet
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
16 pages
Decision Tree Class 2
No ratings yet
Decision Tree Class 2
40 pages
Lossless Math
No ratings yet
Lossless Math
32 pages
Communication Theory and Coding: Basics
No ratings yet
Communication Theory and Coding: Basics
17 pages
Lec35 - 210108062 - ZAINAB ALI
No ratings yet
Lec35 - 210108062 - ZAINAB ALI
9 pages
LECTURE 1: Introduction
No ratings yet
LECTURE 1: Introduction
16 pages
ML Lecture 13-14
No ratings yet
ML Lecture 13-14
33 pages
Lecture 1: Introduction, Entropy and ML Estimation
No ratings yet
Lecture 1: Introduction, Entropy and ML Estimation
5 pages
Module 1
No ratings yet
Module 1
40 pages
Information Theory
No ratings yet
Information Theory
29 pages
PMIT-6214: Information Coding: Instructor: M. Shamim Kaiser Email: Text Phone: 01511000555
No ratings yet
PMIT-6214: Information Coding: Instructor: M. Shamim Kaiser Email: Text Phone: 01511000555
76 pages
Lecture2 DT
No ratings yet
Lecture2 DT
75 pages
Decision Trees
No ratings yet
Decision Trees
128 pages
Decision Tree Class 1
No ratings yet
Decision Tree Class 1
34 pages
Cse 445 Lecture 8 Mma
No ratings yet
Cse 445 Lecture 8 Mma
107 pages
Unit 5. Decision Trees
No ratings yet
Unit 5. Decision Trees
58 pages
Information Theory: Mike Brookes E4.40, ISE4.51, SO20
No ratings yet
Information Theory: Mike Brookes E4.40, ISE4.51, SO20
114 pages
Information Theory
No ratings yet
Information Theory
114 pages
Noise, Information Theory, and Entropy
No ratings yet
Noise, Information Theory, and Entropy
34 pages
Chapter4 Machine Learning Part3
No ratings yet
Chapter4 Machine Learning Part3
43 pages
Information Coding Techniques
No ratings yet
Information Coding Techniques
42 pages
Decision Trees Notes
No ratings yet
Decision Trees Notes
11 pages
Decision Tree
No ratings yet
Decision Tree
42 pages
Rojas 10 Why The Normal Distribution
No ratings yet
Rojas 10 Why The Normal Distribution
10 pages
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
No ratings yet
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
54 pages
Decision Trees
No ratings yet
Decision Trees
53 pages
Lec7 - Nonparametric Methods - II
No ratings yet
Lec7 - Nonparametric Methods - II
38 pages
Decision Tree
No ratings yet
Decision Tree
23 pages
Decision Trees: Classifier
No ratings yet
Decision Trees: Classifier
23 pages
MIT18 600F19 Lec33
No ratings yet
MIT18 600F19 Lec33
58 pages
MIT16 36s09 Lec03
No ratings yet
MIT16 36s09 Lec03
10 pages
Business Analytics & Machine Learning: Decision Tree Classifiers
No ratings yet
Business Analytics & Machine Learning: Decision Tree Classifiers
60 pages
Laboratory Journal: Signal Coding Estimation Theory
No ratings yet
Laboratory Journal: Signal Coding Estimation Theory
63 pages
Source Coding
No ratings yet
Source Coding
29 pages
Tensor Calculus
From Everand
Tensor Calculus
J. L. Synge
3.5/5 (5)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Diophantine Approximations
From Everand
Diophantine Approximations
Ivan Niven
3/5 (1)
Basic Methods of Linear Functional Analysis
From Everand
Basic Methods of Linear Functional Analysis
John D. Pryce
No ratings yet
PySpark Real Time Q&A
No ratings yet
PySpark Real Time Q&A
5 pages
Healthcare ERP Project Success: It's All About Avoiding Missteps
No ratings yet
Healthcare ERP Project Success: It's All About Avoiding Missteps
5 pages
SmartPlant Electrical SmartPlant 3D Cable Management Interface
No ratings yet
SmartPlant Electrical SmartPlant 3D Cable Management Interface
1 page
Advanced Continuous Historian
No ratings yet
Advanced Continuous Historian
7 pages
Razberi User Manual (Razberi VMS)
No ratings yet
Razberi User Manual (Razberi VMS)
34 pages
Presentation1 GRP
No ratings yet
Presentation1 GRP
13 pages
Welcome To NEST: What Saving With NEST Means For You
No ratings yet
Welcome To NEST: What Saving With NEST Means For You
9 pages
Threat Intelligence Handbook
100% (6)
Threat Intelligence Handbook
108 pages
Waterfall Whitepaper
No ratings yet
Waterfall Whitepaper
7 pages
Inte 423 Exam Draft
No ratings yet
Inte 423 Exam Draft
3 pages
Tarantella PDF
No ratings yet
Tarantella PDF
10 pages
SUN2000-115kTL-M2 Datasheet
No ratings yet
SUN2000-115kTL-M2 Datasheet
2 pages
CC0002 Notes
No ratings yet
CC0002 Notes
10 pages
q8, q9, q10 Question and Answers
No ratings yet
q8, q9, q10 Question and Answers
16 pages
Deep Learning Server Platform - Admin Manual 2.0
No ratings yet
Deep Learning Server Platform - Admin Manual 2.0
20 pages
US IT Recruiting Training Material - Road To US Staffing and USA
No ratings yet
US IT Recruiting Training Material - Road To US Staffing and USA
17 pages
Domain PR Check List3!!! (8647)
No ratings yet
Domain PR Check List3!!! (8647)
304 pages
Bcis 1305 Business Computer Applications Homework 2 True/False
No ratings yet
Bcis 1305 Business Computer Applications Homework 2 True/False
6 pages
ICT Trivia
No ratings yet
ICT Trivia
9 pages
SET-280. Controlling AC Lamp Dimmer Through Mobile Phone
No ratings yet
SET-280. Controlling AC Lamp Dimmer Through Mobile Phone
3 pages
DUI0448I v2p Ca9 TRM
No ratings yet
DUI0448I v2p Ca9 TRM
62 pages
Lecture 7: Least-Squares Problem: Convex Optimization
No ratings yet
Lecture 7: Least-Squares Problem: Convex Optimization
7 pages
EEE378 - Digital Electronic II (Vol I) Week 1
No ratings yet
EEE378 - Digital Electronic II (Vol I) Week 1
41 pages
Log
No ratings yet
Log
24 pages
Windows Application
No ratings yet
Windows Application
9 pages
HG10CV2.0 Datasheet
No ratings yet
HG10CV2.0 Datasheet
5 pages
MX-CPG Bim Impplan Rev0
No ratings yet
MX-CPG Bim Impplan Rev0
17 pages
Satp Installation Guide 3.2
No ratings yet
Satp Installation Guide 3.2
81 pages
User Guide LIGO Fuel Level Sensor 2021
No ratings yet
User Guide LIGO Fuel Level Sensor 2021
32 pages