0% found this document useful (0 votes)
7 views25 pages

22.InfoTheory DecisionTrees Short

Artificial intelligence

Uploaded by

zenithw131013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views25 pages

22.InfoTheory DecisionTrees Short

Artificial intelligence

Uploaded by

zenithw131013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Decision Trees

Jihoon Yang
Machine Learning Research Laboratory
Department of Computer Science & Engineering
Sogang University
Email: [email protected]
Decision tree representation

• In the simplest case


– Each internal node tests on an attribute
– Each branch corresponds to an attribute value
– Each leaf node corresponds to a class label

• In general
– Each internal node corresponds to a test (on input instances)
with mutually exclusive and exhaustive outcomes – tests may be
univariate or multivariate
– Each branch corresponds to an outcome of a test
– Each leaf node corresponds to a class label

Department of Computer Science & Engineering


Machine Learning Research Laboratory 2
Decision tree representation

11,01,10,00

Attributes Class 11,01,10,00 y


x y c x 1 0

1 0 11 10
1 1 1 A
E
01 00
x
2 0 1 B
11 01
1 x 0 1 x 0
a
m
10 00
p
l 3 1 0 A
e
c=A c=B 11 01 10 00
s
4 0 0 B
c=A c=B c=A c=B

Data set Tree 1 Tree 2

• Should we choose Tree 1 or Tree 2? Why?

Department of Computer Science & Engineering


Machine Learning Research Laboratory 3
Learning decision tree classifiers

• Ockham’s razor recommends that we pick the simplest decision


tree that is consistent with the training set

• There are far too many trees that are consistent with a training set

• Searching for the simplest tree that is consistent with the training
set is not typically computationally feasible

• Solution
– Use a greedy algorithm – not guaranteed to find the simplest
tree – but works well in practice
– Or restrict the space of hypothesis to a subset of simple trees

Department of Computer Science & Engineering


Machine Learning Research Laboratory 4
Information and Shannon Entropy

• Suppose we have a message that conveys the result of a random


experiment with m possible discrete outcomes, with probabilities
p1 , p2 ,..., pm

• The expected information content of such a message is called the


entropy of the probability distribution
m
H ( p1 , p2 ,..., pm )   pi I ( pi )
i 1

I ( pi )   log 2 pi provided pi  0
I ( pi )  0 otherwise

Department of Computer Science & Engineering


Machine Learning Research Laboratory 5
Shannon’s entropy as a measure of information


Let P  ( p1 .... pn ) be a discrete probability distribution
The entropy of the distribution P is given by

 1
 n n
H P   pi log 2     pi log 2  pi 
i 1  pi  i 1

1 1 2
1 1 1 1
H  ,    pi log 2  pi     log 2      log 2    1 bit
2 2 i 1 2 2 2 2
2
H 0,1   pi log 2  pi   1I 1  0 I 0  0 bit
i 1

Department of Computer Science & Engineering


Machine Learning Research Laboratory 6
The entropy for a binary variable

Department of Computer Science & Engineering


Machine Learning Research Laboratory 7
Learning decision tree classifiers

• On average, the information needed to convey the class




membership of a random instance drawn from nature is H P

Nature
Training Data S S1 S2

Instance
Sm

Classifier  m     


H  P   i 1  pi  log 2  pi   H ( X )
     

Class label where P is an estimate of P
and X is a random variable with distribution P

Si is the multi-set of examples belonging to class Ci


Department of Computer Science & Engineering
Machine Learning Research Laboratory 8
Learning decision tree classifiers

• The task of the learner then is to extract the needed information


from the training set and store it in the form of a decision tree for
classification

• Information gain based decision tree learner


Start with the entire training set at the root
Recursively add nodes to the tree
corresponding to tests that yield the
greatest expected reduction in entropy
(or the largest expected information gain)
until some termination criterion is met
( e.g. the training data at every leaf node has zero entropy )

Department of Computer Science & Engineering


Machine Learning Research Laboratory 9
Learning decision tree classifiers – Example

Training Data
Instance Class label
Instances – I1 (t, d, l) +
ordered 3-tuples of attribute values I2 (s, d, l) +
corresponding to I3 (t, b, l) 
I4 (t, r, l) 
Height (tall, short) I5 (s, b, l) 
Hair (dark, blonde, red) I6 (t, b, w) +
Eye (blue, brown) I7 (t, d, w) +
I8 (s, b, w) +

Department of Computer Science & Engineering


Machine Learning Research Laboratory 10
Learning decision tree classifiers – Example


I1…I8 3 3 5 5
H ( X )   log 2  log 2  0.954bits
8 8 8 8
Height
 3 3 2 2
H ( X | Height  t )   log 2  log 2  0.971bits
5 5 5 5
t s  2 2 1 1
H ( X | Height  s)   log 2  log 2  0.918bits
3 3 3 3

I1 I3 I4 I6 I7 I2 I5 I8
St Ss
 5  3  5 3
H ( X | Height )  H ( X | Height  t )  H ( X | Height  s)  (0.971)  (0.918)  0.95bits
8 8 8 8
   
Similarly, H ( X | Hair )  3 H ( X | Hair  d )  4 H ( X | Hair  b)  1 H ( Hair  r )  0.5bits and
8 8 8

H ( X | Eye )  0.607bits

Hair is the most informative because it yields the largest reduction in


entropy; Test on the value of Hair is chosen to correspond to the root of
the decision tree
Department of Computer Science & Engineering
Machine Learning Research Laboratory 11
Learning decision tree classifiers – Example

Hair
d r
b
Compare the result with
+ Eye - Naïve Bayes

l w

- +

• In practice, we need some way to prune the tree to avoid overfitting


the training data

Department of Computer Science & Engineering


Machine Learning Research Laboratory 12
Learning, generalization, overfitting

• Consider the error of a hypothesis h over


– Training data: ErrorTrain (h)
– Entire distribution D of data: ErrorD (h)

• Hypothesis h  H over fits training data if there is an alternative


hypothesis h' H such that

ErrorTrain( h)  ErrorTrain( h' )


and

ErrorD ( h)  ErrorD ( h' )

Department of Computer Science & Engineering


Machine Learning Research Laboratory 13
Overfitting in decision tree learning
(e.g. diabetes dataset)

Department of Computer Science & Engineering


Machine Learning Research Laboratory 14
Causes of overfitting

• As we move further away from the root, the data set used to choose
the best test becomes smaller  poor estimates of entropy

• Noisy examples can further exacerbate overfitting

Department of Computer Science & Engineering


Machine Learning Research Laboratory 15
Minimizing overfitting

• Use roughly the same size sample at every node to estimate


entropy – when there is a large data set from which we can sample

• Stop when further split fails to yield statistically significant


information gain (estimated from validation set)

• Grow full tree, then prune

• Minimize size (tree) + size (exceptions (tree))

Department of Computer Science & Engineering


Machine Learning Research Laboratory 16
Rule post-pruning

• Convert tree to equivalent


set of rules

IF (Outlook = Sunny)  (Humidity = High)


THEN PlayTennis = No
IF (Outlook = Sunny)  (Humidity = Normal)
THEN PlayTennis = Yes
...

Department of Computer Science & Engineering


Machine Learning Research Laboratory 17
Rule post-pruning

1. Convert tree to equivalent set of rules

2. Prune each rule independently of others by dropping a condition


at a time if doing so does not reduce estimated accuracy (at the
desired confidence level)

3. Sort final rules in order of lowest to highest error for classifying


new instances

• Advantage – can potentially correct bad choices made close to the


root

• Post pruning based on validation set is the most commonly used


method in practice

Department of Computer Science & Engineering


Machine Learning Research Laboratory 18
Classification of instances

• Unique classification – possible when each leaf has zero entropy


and there are no missing attribute values

• Most likely or probabilistic classification – based on distribution of


classes at a node when there are no missing attributes

Department of Computer Science & Engineering


Machine Learning Research Laboratory 19
Handling different types of attribute values

• Types of attributes
– Nominal – values are names

– Ordinal – values are ordered

– Cardinal (numeric) – values are numbers (hence ordered)

– …

Department of Computer Science & Engineering


Machine Learning Research Laboratory 20
Handling numeric attributes

Attribute T 40 48 50 54 60 70

Class N N Y Y Y N

48  50
T
60  70
Candidate splits T ? ?
2 2
2 4 3 3 1  1 
E( S | T  49 ?)  (0)      log2      log2   
6 6 4 4 4  4 

• Sort instances by value of numeric attribute under consideration

• For each attribute, find the test which yields the lowest entropy

• Greedily choose the best test across all attributes


Department of Computer Science & Engineering
Machine Learning Research Laboratory 21
Handling numeric attributes

Axis-parallel split Oblique split

C2
C2 C1
C1

• Oblique splits cannot be realized by univariate tests

Department of Computer Science & Engineering


Machine Learning Research Laboratory 22
Two-way versus multi-way splits

• Entropy criterion favors many-valued attributes


– Pathological behavior – what if in a medical diagnosis data set,
social security number is one of the candidate attributes?

• Solutions
– Only two-way splits (CART): A = value versus A = ~value

– Gain ratio (C4.5)

Gain(S , A)
GainRatio(S , A) 
SplitInformation(S , A)
Values ( A )
| Si | |S |
SplitInformation(S , A)   
i 1 |S |
log2 i
|S |

Department of Computer Science & Engineering


Machine Learning Research Laboratory 23
See5/C5.0 [1997]

• Boosting

• New data types (e.g. dates), N/A values, variable misclassification


costs, attribute pre-filtering

• Unordered rulesets: all applicable rules are found and voted

• Improved scalability: multi-threading, multi-core/CPUs

Department of Computer Science & Engineering


Machine Learning Research Laboratory 24
Summary of decision trees

• Simple

• Fast (linear in size of the tree, linear in the size of the training set,
linear in the number of attributes)

• Produce easy to interpret rules

• Good for generating simple predictive rules from data with lots of
attributes

• Popular extensions: GBDT(Friedman, 2001), XGBoost(Chen &


Guestrin, 2016), LightGBM(Ke et al., 2017)

Department of Computer Science & Engineering


Machine Learning Research Laboratory 25

You might also like