Unit 3 Decision Trees-3
Unit 3 Decision Trees-3
1
Representation of Concepts
Concept learning: conjunction of attributes
(Sunny AND Hot AND Humid AND Windy) +
No Yes No Yes
2
Rectangle learning….
-----------
-- +++++++ --
-- -- Conjunctions
-- ++++++
-- (single rectangle)
-- --
-----------
-----------
--
----------- +++++++ --
----------- ++++++ --
-- Disjunctions of Conjunctions
-- -- +++++++ --
-- (union of rectangles)
-- -- --
-- ++++++ --
-- --
-- --
-- --
--
-----------
3
Training Examples
• Can be represented
sunny overcast rain
by logical formulas
Humidity Yes Wind
No Yes No Yes
5
Representation in decision trees
6
Applications of Decision Trees
7
Decision Trees
Given distribution
Of training instances
Attribute 2
+ + + + - - - - + + + +
+ + + + - - - - + + + + Draw axis parallel
+ + + + - - - - + + + + Lines to separate the
- - - - Instances of each class
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
Attribute 1
8
Decision Tree Structure
+ + + + - - - - + + + +
+ + + + - - - - + + + + Instances of each class
+ + + + - - - - + + + +
- - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
Attribute 1
9
Decision Tree Structure
Decision leaf
* Alternate splits possible
Attribute 2
+ + + + - - - - + + + +
Decision node
+ + + + - - - - + + + +
+ + + + - - - - + + + + = condition
30 - - - - = box
- - - - - - - - + + + + = collection of satisfying
- - - - - - - - + + + + examples
- - - - - - - - + + + +
- - - - - - - -
20 40 Attribute 1
10
Decision Tree Construction
• Find the best structure
• Given a training data set
11
Top-Down Construction
12
Best attribute to split?
Attribute 2
+ + + + - - - - + + + +
+ + + + - - - - + + + +
+ + + + - - - - + + + +
- - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
Attribute 1
13
Best attribute to split?
Attribute 2
+ + + + - - - - + + + +
+ + + + - - - - + + + +
+ + + + - - - - + + + +
- - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
+ + + + - - - - + + + +
+ + + + - - - - + + + +
+ + + + - - - - + + + +
- - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
Pure box/node
Mixed box/node
Attribute 2
+ + + + - - - - + + + +
+ + + + - - - - + + + + Already pure leaf
+ + + + - - - - + + + + No further need to split
- - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
+ + + + - - - - + + + +
+ + + + - - - - + + + +
+ + + + - - - - + + + +
A2 > 30? - - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
Attribute 1
17
Principle of Decision Tree Construction
• Finally we want to form pure leaves
– Correct classification
• Greedy approach to reach correct classification
1. Initially treat the entire data set as a single box
2. For each box choose the spilt that reduces its impurity (in terms
of class labels) by the maximum amount
3. Split the box having highest reduction in impurity
4. Continue to Step 2
5. Stop when all boxes are pure
18
Choosing Best Attribute?
• Consider examples, + and −
• Which one is better?
29+, 35− A1 29+, 35− A2
t f t f
• Which is better?
29+, 35− A1 29+, 35− A2
t f t f
19
Entropy
• A measure for
– uncertainty
– purity
– information content
• Information theory: optimal length code assigns (− logp) bits to
message having probability p
• S is a sample of training examples
– p+ is the proportion of positive examples in S
– p− is the proportion of negative examples in S
• Entropy of S: average optimal number of bits to encode information
about certainty/uncertainty about S
Entropy(S) = p+(−logp+) + p−(−logp−) = −p+logp+− p−logp−
• Can be generalized to more than two values
20
Entropy
21
Choosing Best Attribute?
• Consider examples (+,−) and compute entropies:
• Which one is better?
E(S)=0.993
29+, 35− A1 E(S)=0.993 29+, 35− A2
t f t f
0.650 0.522 0.989 0.997
25+, 5− 4+, 30− 15+, 19− 14+, 16−
• Which is better?
E(S)=0.993 E(S)=0.993
29+, 35− A1 29+, 35− A2
t f t f
0.708 0.742 0.937 0.619
21+, 5− 8+, 30− 18+, 33− 11+, 2−
22
Information Gain
• Gain(S,A): reduction in entropy after choosing attr. A
𝑆𝑣
Gain 𝑆, 𝐴 = Entropy 𝑆 − Entropy 𝑆𝑣
𝑆
𝑣∈Values 𝐴
E(S)=0.993
29+, 35− A1 E(S)=0.993 29+, 35− A2
t f t f
0.650 0.522 0.989 0.997
25+, 5− 4+, 30− 15+, 19− 14+, 16−
Gain: 0.395 Gain: 0.000
E(S)=0.993 E(S)=0.993
29+, 35− A1 29+, 35− A2
t f t f
0.708 0.742 0.937 0.619
21+, 5− 8+, 30− 18+, 33− 11+, 2−
Gain: 0.265 Gain: 0.121 23
Gain function
Gain is measure of how much can
– Reduce uncertainty
❖ Value lies between 0,1
❖ What is significance of
➢ gain of 0?
▪ example where have 50/50 split of +/- both before and after
discriminating on attributes values
➢ gain of 1?
▪ Example of going from “perfect uncertainty” to perfect certainty
after splitting example with predictive attribute
– Find “patterns” in TE’s relating to attribute values
❖ Move to locally minimal representation of TE’s
24
Training Examples
Outlook
? Yes ?
Ssunny= {D1,D2,D8,D9,D11}
Gain (Ssunny, Humidity) = .970
Gain (Ssunny, Temp) = .570
Gain (Ssunny, Wind) = .019 31
Final Decision Tree for Example
Outlook
Sunny Rain
Overcast
Humidity
Yes Wind
High
Normal Strong Weak
No Yes No
Yes
37
Hypothesis Space Search (ID3)
• Hypothesis space (all possible trees) is complete!
– Target function is included in there
38
Hypothesis Space Search in Decision Trees
• Conduct a search of the space of decision trees which
can represent all possible discrete functions.
40
Restriction bias vs. Preference bias
• Restriction bias (or Language bias)
– Incomplete hypothesis space
• Preference (or search) bias
– Incomplete search strategy
• Candidate Elimination has restriction bias
• ID3 has preference bias
• In most cases, we have both a restriction and a
preference bias.
41
Inductive Bias in ID3
42
Overfitting the Data
• Learning a tree that classifies the training data perfectly may
not lead to the tree with the best generalization performance.
- There may be noise in the training data the tree is fitting
- The algorithm might be making decisions based on
very little data
• A hypothesis h is said to overfit the training data if the is
another hypothesis, h’, such that h has smaller error than h’
on the training data but h has larger error on the test data than h’.
On training
accuracy On testing
Complexity of tree 43
Overfitting
Attribute 2
Attribute 1
44
When to stop splitting further?
Attribute 2
Attribute 1
45
Overfitting in Decision Trees
• Consider adding noisy training example (should be +):
Day Outlook Temp Humidity Wind Tennis?
D15 Sunny Hot Normal Strong No
Outlook
46
Overfitting - Example
Strong Weak
No Yes 47
Avoiding Overfitting
48
Reduced-Error Pruning
• A post-pruning, cross validation approach
- Partition training data into “grow” set and “validation” set.
- Build a complete tree for the “grow” data
- Until accuracy on validation set decreases, do:
For each non-leaf node in the tree
Temporarily prune the tree below; replace it by majority vote.
Test the accuracy of the hypothesis on the validation set
Permanently prune the node with the greatest increase
in accuracy on the validation test.
• Problem: Uses less data to construct the tree
• Sometimes done at the rules level
49
Rule post-pruning
50
Example of rule post pruning
• IF (Outlook = Sunny) ^ (Humidity = High)
– THEN PlayTennis = No
• IF (Outlook = Sunny) ^ (Humidity = Normal)
– THEN PlayTennis = Yes
Outlook
51
Extensions of basic algorithm
52
Continuous Valued Attributes
• Create a discrete attribute from continuous variables
– E.g., define critical Temperature = 82.5
• Candidate thresholds
– chosen by gain function
– can have more than one threshold
– typically where values change quickly
(48+60)/2 (80+90)/2
Temp 40 48 60 72 80 90
Tennis? N N Y Y Y N
53
Attributes with Many Values
• Problem:
– If attribute has many values, Gain will select it (why?)
– E.g. of birthdates attribute
365 possible values
54
Attributes with many values
• Problem: Gain will select attribute with many values
• One approach: use GainRatio instead
Gain 𝑆, 𝐴
GainRatio 𝑆, 𝐴 =
SplitInformation 𝑆, 𝐴
Entropy of the
partitioning
𝑐
𝑆𝑖 𝑆𝑖 Penalizes
SplitInformation 𝑆, 𝐴 = − log2
𝑆 𝑆 higher number
𝑖=1
of partitions
56
Attributes with Costs
• Consider
– medical diagnosis: BloodTest has cost $150, Pulse has a cost of $5.
– robotics, Width-From-1ft has cost 23 sec., from 2 ft 10s.
• How to learn a consistent tree with low expected cost?
• Replace gain by
– Tan and Schlimmer (1990)
Gain2 𝑆, 𝐴
Cost 𝐴
– Nunez (1988)
2Gain 𝑆,𝐴
−1
. .
. .
. .
59
. .
. .
. .
. .
red
Gini Index for Color
Color? green
.
yellow .
.
.
60
Gain of Gini Index
61
Regression Tree
• Similar to classification
• Use a set of attributes to predict the value (instead
of a class label)
• Instead of computing information gain, compute
the sum of squared errors
• Partition the attribute space into a set of
rectangular subspaces, each with its own predictor
– The simplest predictor is a constant value
63
Rectilinear Division
• A regression tree is a piecewise constant function of the
input attributes
X2
X1 t1
r5
r2
X2 t2 X1 t3
r3
t2 r4
r1
r1 r2 r3 X2 t4
t1 t3 X1
r4 r5
64
Growing Regression Trees
• The best split is the one that reduces the most variance:
I ( LS , A) = vary|LS { y} −
| LS a |
vary|LS a { y}
a | LS |
65
Regression Tree Pruning
• Exactly the same algorithms apply: pre-pruning
and post-pruning.
• In post-pruning, the tree that minimizes the
squared error on VS is selected.
• In practice, pruning is more important in
regression because full trees are much more
complex (often all objects have a different output
values and hence the full tree has as many leaves
as there are objects in the learning sample)
66
When Are Decision Trees Useful ?
• Advantages
– Very fast: can handle very large datasets with many
attributes
– Flexible: several attribute types, classification and
regression problems, missing values…
– Interpretability: provide rules and attribute importance
• Disadvantages
– Instability of the trees (high variance)
– Not always competitive with other algorithms in terms
of accuracy
67
History of Decision Tree Research
• Hunt and colleagues in Psychology used full search decision
trees methods to model human concept learning in the 60’s
68
Summary
• Decision trees are practical for concept learning
• Basic information measure and gain function for best first
search of space of DTs
• ID3 procedure
– search space is complete
– Preference for shorter trees
• Overfitting is an important issue with various solutions
• Many variations and extensions possible
69
Software
• In R:
– Packages tree and rpart
• C4.5:
– http://www.cse.unwe.edu.au/~quinlan
• Weka
– http://www.cs.waikato.ac.nz/ml/weka
70