Decision Trees
Decision Trees
1
Representation of Concepts
Concept learning: conjunction of attributes
(Sunny AND Hot AND Humid AND Windy) +
No Yes No Yes
2
Rectangle learning….
-----------
-- +++++++ --
-- -- Conjunctions
-- ++++++
-- (single rectangle)
-- --
-----------
-----------
--
----------- +++++++ --
----------- ++++++ --
-- Disjunctions of Conjunctions
-- -- +++++++ --
-- (union of rectangles)
-- -- --
-- ++++++ --
-- --
-- --
-- --
--
-----------
3
Training Examples
• Can be represented
sunny overcast rain
by logical formulas
Humidity Yes Wind
No Yes No Yes
5
Representation in decision trees
6
Applications of Decision Trees
7
Decision Trees
Given distribution
+ + + + - - - - + + + + Of training instances
Attribute 2
Attribute 1
8
Decision Tree Structure
Attribute 1
9
Decision Tree Structure
Decision leaf
* Alternate splits possible
Attribute 2
+ + + + - - - - + + + +
Decision node
+ + + + - - - - + + + +
+ + + + - - - - + + + + = condition
30 - - - - = box
- - - - - - - - + + + + = collection of satisfying
- - - - - - - - + + + + examples
- - - - - - - - + + + +
- - - - - - - -
20 40 Attribute 1
10
Decision Tree Construction
• Find the best structure
• Given a training data set
11
Top-Down Construction
12
Best attribute to split?
+ + + + - - - - + + + +
Attribute 2
+ + + + - - - - + + + +
+ + + + - - - - + + + +
- - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
Attribute 1
13
Best attribute to split?
+ + + + - - - - + + + +
Attribute 2
+ + + + - - - - + + + +
+ + + + - - - - + + + +
- - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
+ + + + - - - - + + + +
Attribute 2
+ + + + - - - - + + + +
+ + + + - - - - + + + +
- - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
Pure box/node
Mixed box/node
+ + + + - - - - + + + +
Attribute 2
+ + + + - - - - + + + +
+ + + + - - - - + + + +
+ + + + - - - - + + + +
A2 > 30? - - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
Attribute 1
17
Principle of Decision Tree Construction
• Finally we want to form pure leaves
– Correct classification
• Greedy approach to reach correct classification
1. Initially treat the entire data set as a single box
2. For each box choose the spilt that reduces its impurity (in terms
of class labels) by the maximum amount
3. Split the box having highest reduction in impurity
4. Continue to Step 2
5. Stop when all boxes are pure
18
Choosing Best Attribute?
• Consider examples, and
• Which one is better?
29, 35 A1 29, 35 A2
t f t f
• Which is better?
29, 35 A1 29, 35 A2
t f t f
19
Entropy
• A measure for
– uncertainty
– purity
– information content
• Information theory: optimal length code assigns (logp) bits to
message having probability p
• S is a sample of training examples
– p+ is the proportion of positive examples in S
– p is the proportion of negative examples in S
• Entropy of S: average optimal number of bits to encode information
about certainty/uncertainty about S
EntropySplogpplogp plogpplogp
• Can be generalized to more than two values
20
Entropy
21
Choosing Best Attribute?
• Consider examples (,and compute entropies:
• Which one is better?
E(S)=0.993
29, 35 A1 E(S)=0.993 29, 35 A2
t f t f
0.650 0.522 0.989 0.997
25, 5 4, 30 15, 19 14, 16
• Which is better?
E(S)=0.993 E(S)=0.993
29, 35 A1 29 , 35 A2
t f t f
0.708 0.742 0.937 0.619
21, 5 8, 30 18, 33 11, 2
22
Information Gain
• Gain(S,A): reduction in entropy after choosing attr. A
Sv
Gain( S , A) Entropy( S )
vValues ( A ) S
Entropy ( S v )
E(S)=0.993
29, 35 A1 E(S)=0.993 29 , 35
A2
t f t f
0.650 0.522 0.989 0.997
25, 5 4, 30 15, 19 14, 16
Gain: 0.395 Gain: 0.000
E(S)=0.993 E(S)=0.993
29, 35 A1 29 , 35 A2
t f t f
0.708 0.742 0.937 0.619
21, 5 8, 30 18, 33 11, 2
Gain: 0.265 Gain: 0.121 23
Gain function
Gain is measure of how much can
– Reduce uncertainty
Value lies between 0,1
What is significance of
gain of 0?
example where have 50/50 split of +/- both before and after
discriminating on attributes values
gain of 1?
Example of going from “perfect uncertainty” to perfect certainty
after splitting example with predictive attribute
– Find “patterns” in TE’s relating to attribute values
Move to locally minimal representation of TE’s
24
Training Examples
Humidity Wind
26
Sort the Training Examples
9+, 5D1,…,D14
Outlook
? Yes ?
Ssunny= {D1,D2,D8,D9,D11}
Gain (Ssunny, Humidity) = .970
Gain (Ssunny, Temp) = .570
Gain (Ssunny, Wind) = .019 27
Final Decision Tree for Example
Outlook
Sunny Rain
Overcast
Humidity
Yes Wind
High
Normal Strong Weak
No Yes No
Yes
28
Hypothesis Space Search (ID3)
• Hypothesis space (all possible trees) is complete!
– Target function is included in there
29
Hypothesis Space Search in Decision Trees
• Conduct a search of the space of decision trees which
can represent all possible discrete functions.
31
Restriction bias vs. Preference bias
• Restriction bias (or Language bias)
– Incomplete hypothesis space
• Preference (or search) bias
– Incomplete search strategy
• Candidate Elimination has restriction bias
• ID3 has preference bias
• In most cases, we have both a restriction and a
preference bias.
32
Inductive Bias in ID3
33
Overfitting the Data
• Learning a tree that classifies the training data perfectly may
not lead to the tree with the best generalization performance.
- There may be noise in the training data the tree is fitting
- The algorithm might be making decisions based on
very little data
• A hypothesis h is said to overfit the training data if the is
another hypothesis, h’, such that h has smaller error than h’
on the training data but h has larger error on the test data than h’.
On training
accuracy On testing
Complexity of tree 34
Overfitting
Attribute 2
Attribute 1
35
When to stop splitting further?
Attribute 2
Attribute 1
36
Overfitting in Decision Trees
• Consider adding noisy training example (should be +):
Day Outlook Temp Humidity Wind Tennis?
D15 Sunny Hot Normal Strong No
Outlook
37
Overfitting - Example
Strong Weak
No Yes 38
Avoiding Overfitting
39
Reduced-Error Pruning
• A post-pruning, cross validation approach
- Partition training data into “grow” set and “validation” set.
- Build a complete tree for the “grow” data
- Until accuracy on validation set decreases, do:
For each non-leaf node in the tree
Temporarily prune the tree below; replace it by majority vote.
Test the accuracy of the hypothesis on the validation set
Permanently prune the node with the greatest increase
in accuracy on the validation test.
• Problem: Uses less data to construct the tree
• Sometimes done at the rules level
40
Rule post-pruning
41
Example of rule post pruning
• IF (Outlook = Sunny) ^ (Humidity = High)
– THEN PlayTennis = No
• IF (Outlook = Sunny) ^ (Humidity = Normal)
– THEN PlayTennis = Yes
Outlook
42
Extensions of basic algorithm
43
Continuous Valued Attributes
• Create a discrete attribute from continuous variables
– E.g., define critical Temperature = 82.5
• Candidate thresholds
– chosen by gain function
– can have more than one threshold
– typically where values change quickly
(48+60)/2 (80+90)/2
Temp 40 48 60 72 80 90
Tennis? N N Y Y Y N
44
Attributes with Many Values
• Problem:
– If attribute has many values, Gain will select it (why?)
– E.g. of birthdates attribute
365 possible values
45
Attributes with many values
• Problem: Gain will select attribute with many values
• One approach: use GainRatio instead
Gain( S , A)
GainRatio( S , A)
SplitInformation( S , A) Entropy of the
partitioning
c Si Si
SplitInformation( S , A) log 2 Penalizes
higher number
i 1 S S of partitions
47
Attributes with Costs
• Consider
– medical diagnosis: BloodTest has cost $150, Pulse has a cost of $5.
– robotics, Width-From-1ft has cost 23 sec., from 2 ft 10s.
• How to learn a consistent tree with low expected cost?
• Replace gain by
– Tan and Schlimmer (1990)
Gain 2 ( S , A)
Cost ( A)
– Nunez (1988)
2Gain ( S , A) 1
) 1) of cost
(Cost ( Aimportance
where determines
48
Gini Index
• Another sensible measure of impurity
(i and j are classes)
. .
. .
. .
50
. . .
.
. .
. .
red
Color?
Gini Index for Color
green
.
yellow .
.
.
51
Gain of Gini Index
52
Three Impurity Measures
53
Regression Tree
• Similar to classification
• Use a set of attributes to predict the value (instead
of a class label)
• Instead of computing information gain, compute
the sum of squared errors
• Partition the attribute space into a set of
rectangular subspaces, each with its own predictor
– The simplest predictor is a constant value
54
Rectilinear Division
• A regression tree is a piecewise constant function of the
input attributes
X2
X1 t1
r5
r2
X2 t2 X1 t3
r3
t2 r4
r1
r1 r2 r3 X2 t4
t1 t3 X1
r4 r5
55
Growing Regression Trees
• The best split is the one that reduces the most variance:
| LS a |
I ( LS , A) vary|LS { y} vary|LS a { y}
a | LS |
56
Regression Tree Pruning
• Exactly the same algorithms apply: pre-pruning
and post-pruning.
• In post-pruning, the tree that minimizes the
squared error on VS is selected.
• In practice, pruning is more important in
regression because full trees are much more
complex (often all objects have a different output
values and hence the full tree has as many leaves
as there are objects in the learning sample)
57
When Are Decision Trees Useful ?
• Advantages
– Very fast: can handle very large datasets with many
attributes
– Flexible: several attribute types, classification and
regression problems, missing values…
– Interpretability: provide rules and attribute importance
• Disadvantages
– Instability of the trees (high variance)
– Not always competitive with other algorithms in terms
of accuracy
58
History of Decision Tree Research
• Hunt and colleagues in Psychology used full search decision
trees methods to model human concept learning in the 60’s
59
Summary
• Decision trees are practical for concept learning
• Basic information measure and gain function for best first
search of space of DTs
• ID3 procedure
– search space is complete
– Preference for shorter trees
• Overfitting is an important issue with various solutions
• Many variations and extensions possible
60
Software
• In R:
– Packages tree and rpart
• C4.5:
– http://www.cse.unwe.edu.au/~quinlan
• Weka
– http://www.cs.waikato.ac.nz/ml/weka
61