What Did We Learn?: Learning Problem
What Did We Learn?: Learning Problem
p
) log(p p ) log(p p Entropy(S)
23
k
i
i i
1
) log(p p }) ,...p p , Entropy({p
k 2 1
In general, when p
i
is the fraction of examples labeled i:
Entropy can be viewed as the number of bits required, on average, to encode the
class of labels. If the probability for + is 0.5, a single bit is required for each example;
if it is 0.8 -- can use less then 1 bit.
DECISION TREES CSE463 2014 Fall
Entropy
Entropy (impurity, disorder) of a set of examples, S,
relative to a binary classification is:
where is the proportion of positive examples in S and
is the proportion of negatives.
If all the examples belong to the same category: Entropy = 0
If all the examples are equally mixed (0.5, 0.5): Entropy = 1
p
1
-- +
1
-- + --
1
+
) log(p p ) log(p p Entropy(S)
24
DECISION TREES CSE463 2014 Fall
Entropy
Entropy (impurity, disorder) of a set of examples, S,
relative to a binary classification is:
where is the proportion of positive examples in S and
is the proportion of negatives.
If all the examples belong to the same category: Entropy = 0
If all the examples are equally mixed (0.5, 0.5): Entropy = 1
p
) log(p p ) log(p p Entropy(S)
1 1 1
High Entropy High level of Uncertainty
Low Entropy No Uncertainty.
25
DECISION TREES CSE463 2014 Fall
Information Gain
The information gain of an attribute a is the expected
reduction in entropy caused by partitioning on this
attribute
where is the subset of S for which attribute a has value v
and the entropy of partitioning the data is calculated by
weighing the entropy of each partition by its size relative to
the original set
Partitions of low entropy (imbalanced splits) lead to high gain
Go back to check which of the A, B splits is better
) Entropy(S
| S |
| S |
Entropy(S) a) Gain(S,
v
v
values(a) v
v
S
Outlook
Overcast Rain Sunny
26
DECISION TREES CSE463 2014 Fall
An Illustrative Example
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
27
DECISION TREES CSE463 2014 Fall
An Illustrative Example (II)
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
28
0.94
)
14
5
log(
14
5
)
14
9
log(
14
9
Entropy(S)
9+,5-
DECISION TREES CSE463 2014 Fall
An Illustrative Example (II)
Humidity Wind Play Tennis
High Weak No
High Strong No
High Weak Yes
High Weak Yes
Normal Weak Yes
Normal Strong No
Normal Strong Yes
High Weak No
Normal Weak Yes
Normal Weak Yes
Normal Strong Yes
High Strong Yes
Normal Weak Yes
High Strong No
29
9+,5-
E=.94
DECISION TREES CSE463 2014 Fall
An Illustrative Example (II)
Humidity Wind Play Tennis
High Weak No
High Strong No
High Weak Yes
High Weak Yes
Normal Weak Yes
Normal Strong No
Normal Strong Yes
High Weak No
Normal Weak Yes
Normal Weak Yes
Normal Strong Yes
High Strong Yes
Normal Weak Yes
High Strong No
30
9+,5-
E=.94
Humidity
High Normal
3+,4- 6+,1-
E=.985 E=.592
) Entropy(S
| S |
| S |
Entropy(S) a) Gain(S,
v
v
values(a) v
DECISION TREES CSE463 2014 Fall
An Illustrative Example (II)
Humidity Wind Play Tennis
High Weak No
High Strong No
High Weak Yes
High Weak Yes
Normal Weak Yes
Normal Strong No
Normal Strong Yes
High Weak No
Normal Weak Yes
Normal Weak Yes
Normal Strong Yes
High Strong Yes
Normal Weak Yes
High Strong No
31
9+,5-
E=.94
Humidity
High Normal
3+,4- 6+,1-
E=.985 E=.592
) Entropy(S
| S |
| S |
Entropy(S) a) Gain(S,
v
v
values(a) v
Wind
Weak
6+2- 3+,3-
E=.811 E=1.0
Strong
DECISION TREES CSE463 2014 Fall
An Illustrative Example (II)
Humidity Wind Play Tennis
High Weak No
High Strong No
High Weak Yes
High Weak Yes
Normal Weak Yes
Normal Strong No
Normal Strong Yes
High Weak No
Normal Weak Yes
Normal Weak Yes
Normal Strong Yes
High Strong Yes
Normal Weak Yes
High Strong No
32
9+,5-
E=.94
Humidity
High Normal
3+,4- 6+,1-
E=.985 E=.592
) Entropy(S
| S |
| S |
Entropy(S) a) Gain(S,
v
v
values(a) v
Wind
Weak
6+2- 3+,3-
E=.811 E=1.0
Strong
Gain(S,Humidity)=
.94 - 7/14 0.985
- 7/14 0.592=
0.151
DECISION TREES CSE463 2014 Fall
An Illustrative Example (II)
Humidity Wind Play Tennis
High Weak No
High Strong No
High Weak Yes
High Weak Yes
Normal Weak Yes
Normal Strong No
Normal Strong Yes
High Weak No
Normal Weak Yes
Normal Weak Yes
Normal Strong Yes
High Strong Yes
Normal Weak Yes
High Strong No
33
9+,5-
E=.94
Humidity
High Normal
3+,4- 6+,1-
E=.985 E=.592
) Entropy(S
| S |
| S |
Entropy(S) a) Gain(S,
v
v
values(a) v
Wind
Weak
6+2- 3+,3-
E=.811 E=1.0
Strong
Gain(S,Humidity)=
.94 - 7/14 0.985
- 7/14 0.592=
0.151
Gain(S,Wind)=
.94 - 8/14 0.811
- 6/14 1.0 =
0.048
DECISION TREES CSE463 2014 Fall
An Illustrative Example (III)
Outlook
Gain(S,Humidity)=0.151
Gain(S,Wind) = 0.048
Gain(S,Temperature) = 0.029
Gain(S,Outlook) = 0.246
34
DECISION TREES CSE463 2014 Fall
An Illustrative Example (III)
Outlook
35
Overcast Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes ? ?
1 Sunny No
2 Sunny No
3 Overcast Yes
4 Rain Yes
5 Rain Yes
6 Rain No
7 Overcast Yes
8 Sunny No
9 Sunny Yes
10 Rain Yes
11 Sunny Yes
12 Overcast Yes
13 Overcast Yes
14 Rain No
Day Outlook PlayTennis
DECISION TREES CSE463 2014 Fall
An Illustrative Example (III)
36
Outlook
Overcast Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes ? ?
1 Sunny No
2 Sunny No
3 Overcast Yes
4 Rain Yes
5 Rain Yes
6 Rain No
7 Overcast Yes
8 Sunny No
9 Sunny Yes
10 Rain Yes
11 Sunny Yes
12 Overcast Yes
13 Overcast Yes
14 Rain No
Day Outlook PlayTennis
Continue until:
Every attribute is included in path, or,
All examples in the leaf have same label
DECISION TREES CSE463 2014 Fall
An Illustrative Example (IV)
Humidity) , Gain(S
sunny
.97-(3/5) 0-(2/5) 0 = .97
Temp) , Gain(S
sunny
.97- 0-(2/5) 1 = .57
Wind) , Gain(S
sunny
.97-(2/5) 1 - (3/5) .92= .02
Day Outlook Temperature Humidity Wind PlayTennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes
37
Outlook
Overcast Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes ? ?
DECISION TREES CSE463 2014 Fall
An Illustrative Example (V)
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes ? ?
38
DECISION TREES CSE463 2014 Fall
An Illustrative Example (V)
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity ?
Normal High
No
Yes
39
DECISION TREES CSE463 2014 Fall
An Illustrative Example (VI)
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity Wind
Normal High
No
Yes
Weak Strong
No
Yes
40
DECISION TREES CSE463 2014 Fall
Summary: ID3
(Examples, Attributes, Label)
Let S be the set of Examples
Label is the target attribute (the prediction)
Attributes is the set of measured attributes
Create a Root node for tree
If all examples are labeled the same return a single node tree with Label
Otherwise Begin
A = attribute in Attributes that best classifies S
for each possible value v of A
Add a new tree branch corresponding to A=v
Let Sv be the subset of examples in S with A=v
if Sv is empty: add leaf node with the common value of Label in S
Else: below this branch add the subtree
ID3(Sv, Attributes - {a}, Label)
End
Return Root
41
Iterative Dichotomiser 3
DECISION TREES CSE463 2014 Fall
Hypothesis Space in Decision
Tree Induction
Conduct a search of the space of decision trees which
can represent all possible discrete functions. (pros
and cons)
Goal: to find the best decision tree
Finding a minimal decision tree consistent with a set
of data is NP-hard.
Performs a greedy heuristic search: hill climbing
without backtracking
Makes statistically based decisions using all data
42
DECISION TREES CSE463 2014 Fall
Bias in Decision Tree Induction
Bias is for trees of minimal depth; however, greedy
search introduces complications; it positions features
with high information gain high in the tree and may
not find the minimal tree
Implement a preference bias (search bias) as
opposed to restriction bias (a language bias)
Occams razor can be defended on the basis that
there are relatively few simple hypotheses compared
to complex ones. Therefore, a simple hypothesis that
is consistent with the data is less likely to be a
statistical coincidence (but)
43
DECISION TREES CSE463 2014 Fall
History of Decision Tree Research
Hunt and colleagues in Psychology used full search decision tree
methods to model human concept learning in the 60s
Quinlan developed ID3, with the information gain heuristics in
the late 70s to learn expert systems from examples
Breiman, Freidman and colleagues in statistics developed CART
(classification and regression trees simultaneously)
A variety of improvements in the 80s: coping with noise,
continuous attributes, missing data, non-axis parallel etc.
Quinlans updated algorithm, C4.5 (1993) is commonly used
(New: C5)
Boosting (or Bagging) over DTs is a very good general purpose
algorithm
44
DECISION TREES CSE463 2014 Fall
Overfitting the Data
Learning a tree that classifies the training data perfectly may
not lead to the tree with the best generalization performance.
There may be noise in the training data the tree is fitting
The algorithm might be making decisions based on very little data
A hypothesis h is said to overfit the training data if there is
another hypothesis h, such that h has a smaller error than h on
the training data but h has larger error on the test data than h.
Complexity of tree
accuracy
On testing
On training
45
DECISION TREES CSE463 2014 Fall
Overfitting - Example
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity Wind
Normal High
No
Yes
Normal High
No
Yes
Outlook = Sunny, Temp = Hot, Humidity = Normal, Wind = Strong, NO
46
DECISION TREES CSE463 2014 Fall
Overfitting - Example
Outlook = Sunny, Temp = Hot, Humidity = Normal, Wind = Strong, NO
47
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity Wind
Normal High
No
Weak Strong
No
Yes
Weak Strong
No
Yes
Wind
DECISION TREES CSE463 2014 Fall
Overfitting - Example
Outlook = Sunny, Temp = Hot, Humidity = Normal, Wind = Strong, NO
48
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity Wind
Normal High
No
Weak Strong
No
Yes
Weak Strong
No
Yes
Wind
This can always be done -- may fit noise or
other coincidental regularities
DECISION TREES CSE463 2014 Fall
Avoiding Overfitting
Two basic approaches
Prepruning: Stop growing the tree at some point during
construction when it is determined that there is not enough data
to make reliable choices.
Postpruning: Grow the full tree and then remove nodes that seem
not to have sufficient evidence.
Methods for evaluating subtrees to prune
Cross-validation: Reserve hold-out set to evaluate utility
Statistical testing: Test if the observed regularity can be dismissed
as likely to occur by chance
Minimum Description Length: Is the additional complexity of the
hypothesis smaller than remembering the exceptions?
This is related to the notion of regularization that we will see in other
contexts keep the hypothesis simple.
How can this be avoided with linear classifiers?
49
DECISION TREES CSE463 2014 Fall
Trees and Rules
Decision Trees can be represented as Rules
(outlook = sunny) and (humidity = normal) then YES
(outlook = rain) and (wind = strong) then NO
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity Wind
Normal High
No
Weak Strong
No Yes Yes
50
DECISION TREES CSE463 2014 Fall
Reduced-Error Pruning
A post-pruning, cross validation approach
Partition training data into grow set and validation set.
Build a complete tree for the grow data
Until accuracy on validation set decreases, do:
For each non-leaf node in the tree
Temporarily prune the tree below; replace it by majority vote.
Test the accuracy of the hypothesis on the validation set
Permanently prune the node with the greatest increase in
accuracy on the validation test.
Problem: Uses less data to construct the tree
Sometimes done at the rules level
Rules are generalized by erasing a condition (different!)
General Strategy: Overfit and Simplify
51
DECISION TREES CSE463 2014 Fall
Continuous Attributes
Real-valued attributes can, in advance, be discretized into
ranges, such as big, medium, small
Alternatively, one can develop splitting nodes based on
thresholds of the form A<c that partition the data into examples
that satisfy A<c and A>=c. The information gain for these splits
is calculated in the same way and compared to the information
gain of discrete splits.
How to find the split with the highest gain?
For each continuous feature A:
Sort examples according to the value of A
For each ordered pair (x,y) with different labels
Check the mid-point as a possible threshold, i.e.
S
a x
, S
a y
52
DECISION TREES CSE463 2014 Fall
Continuous Attributes
Example:
Length (L): 10 15 21 28 32 40 50
Class: - + + - + + -
Check thresholds: L < 12.5; L < 24.5; L < 45
Subset of Examples= {}, Split= k+,j-
How to find the split with the highest gain ?
For each continuous feature A:
Sort examples according to the value of A
For each ordered pair (x,y) with different labels
Check the mid-point as a possible threshold. I.e,
S
a x
, S
a y
53
DECISION TREES CSE463 2014 Fall
Missing Values
Diagnosis = < fever, blood_pressure,, blood_test=?,>
Many times values are not available for all attributes
during training or testing (e.g., medical diagnosis)
Training: evaluate Gain(S,a) where in some of the
examples a value for a is not given
54
DECISION TREES CSE463 2014 Fall
Missing Values
55
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes ? ?
Humidity) , Gain(S
sunny
Temp) , Gain(S
sunny
.97- 0-(2/5) 1 = .57
Day Outlook Temperature Humidity Wind PlayTennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild ??? Weak No
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes
Fill-in: most common value (?)-- High
.97-(3/5) Ent[+0,-3] -(2/5) Ent[+2,-0] = .97
.97-(2.5/5) Ent[+0,-2.5] - (2.5/5) Ent[+2,-.5] < .97
Fractional : 0.5 Normal, 0.5 High
Normal
.97-(2/5) Ent[+2,-0] -(3/5) Ent[+2,-1] < .97
) Ent(S
| S |
| S |
Ent(S) a) Gain(S,
v
v
DECISION TREES CSE463 2014 Fall
Missing Values
Diagnosis = < fever, blood_pressure,, blood_test=?,>
Many times values are not available for all attributes
during training or testing (e.g., medical diagnosis)
Training: evaluate Gain(S,a) where in some of the
examples a value for a is not given
Testing: classify an example without knowing the value
of a
Other suggestions?
56
DECISION TREES CSE463 2014 Fall
Missing Values
57
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity Wind
Normal High
No
Weak Strong
No Yes Yes
Outlook = ???, Temp = Hot, Humidity = Normal, Wind = Strong, label = ??
1/3 Yes + 1/3 Yes +1/3 No = Yes
Outlook = Sunny, Temp = Hot, Humidity = ???, Wind = Strong, label = ?? Normal/High
DECISION TREES CSE463 2014 Fall
Other Issues
Attributes with different costs
Change information gain so that low cost attribute are
preferred
Alternative measures for selecting attributes
When different attributes have different number of values
information gain tends to prefer those with many values
Oblique Decision Trees
Decisions are not axis-parallel
Incremental Decision Trees induction
Update an existing decision tree to account for new
examples incrementally (Maintain consistency?)
58
DECISION TREES CSE463 2014 Fall
Decision Trees as Features
Rather than using decision trees to represent the target function it
is becoming common to use small decision trees as features
When learning over a large number of features, learning decision
trees is difficult and the resulting tree may be very large
(over fitting)
Instead, learn small decision trees, with limited depth.
Treat them as experts; they are correct, but only on a small
region in the domain. (what DTs to learn? same every time?)
Then, learn another function, typically a linear function, over these
as features.
Boosting (but also other linear learners) are used on top of the
small decision trees. (Either Boolean, or real valued features)
59
DECISION TREES CSE463 2014 Fall
Decision Trees - Summary
Hypothesis Space:
Variable size (contains all functions)
Deterministic; Discrete and Continuous attributes
Search Algorithm
ID3 - batch, constructive search
Extensions: missing values
Issues:
What is the goal?
When to stop? How to guarantee good generalization?
Did not address:
How are we doing? (Correctness-wise, Complexity-wise)
60