0% found this document useful (0 votes)
52 views60 pages

What Did We Learn?: Learning Problem

The document discusses decision trees and their use in machine learning. It introduces decision trees as a way to represent data hierarchically using tests on feature values to split the data into partitions. Decision trees can be used for classification and regression tasks. The document outlines the basic algorithm for learning decision trees from data in a top-down recursive manner, and describes how to select the best feature to split on at each node using information gain to minimize tree size.

Uploaded by

Taehoon Kim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views60 pages

What Did We Learn?: Learning Problem

The document discusses decision trees and their use in machine learning. It introduces decision trees as a way to represent data hierarchically using tests on feature values to split the data into partitions. Decision trees can be used for classification and regression tasks. The document outlines the basic algorithm for learning decision trees from data in a top-down recursive manner, and describes how to select the best feature to split on at each node using information gain to minimize tree size.

Uploaded by

Taehoon Kim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

DECISION TREES CSE463 2014 Fall

What Did We Learn?


Learning problem:
Find a function that
best separates the data
What function?
Whats best?
How to find it?
A possibility: Define the learning problem to be:
Find a (linear) function that best separates the data
Linear:
x= data representation; w= the classifier
Y = sgn {w
T
x}
1 * Slides courtesy of Dan Roth, University of Illinois at Urbana-Champaign
INTRODUCTION CSE48201 2013 Winter
Machine Learning - Big Picture
2
Evaluation:
Training data
Test (or Real) data
Model: Hypothesis Space H
{x, y}
{x, ?}
{x, h(x)}
h: the best hypothesis
Accuracy
xx.x%
Gradient Decent
DECISION TREES CSE463 2014 Fall
Announcement
3
Please form a group and choose a dataset from
Kaggle (Due: Sep 29).
http://www.youtube.com/watch?v=PoD84TVdD-4
Please put your team name and members in the link
https://docs.google.com/spreadsheets/d/1DsldlV4wy819m
nDvU-9gR6m7ATJVEn_T2IeecyB4V3U/edit?usp=sharing
Then, inform our TA ([email protected]).
If you are unable to form a group, please let us know, today.
DECISION TREES CSE463 2014 Fall 4
Be prepared for
your presentation!
DECISION TREES CSE463 2014 Fall
Introduction - Summary
We introduced the technical part of the class by giving two examples
for (very different) approaches to linear discrimination.
There are many other solutions.
Question 1: Our solution assumed that the target functions are linear.
Can we learn a function that is more flexible in terms of what it does
with the feature space?
Question 2: Can we say something about the quality of what we learn
(sample complexity, time complexity; quality)
5
DECISION TREES CSE463 2014 Fall
From Linear to Decision Trees
We decoupled the generation of the feature space from
the learning.
Argued that we can map the given examples into another
space, in which the target functions are linearly separable.
Do we always want to do it?
How do we determine what are good mappings?
The study of decision trees may shed some light on this.
Learning is done directly from the given data
representation.
The algorithm ``transforms the data itself.
x
x
2
6
Whats the best learning algorithm?
DECISION TREES CSE463 2014 Fall
Decision Trees
A hierarchical data structure that represents data by
implementing a divide and conquer strategy
Can be used as a non-parametric classification and
regression method
Given a collection of examples, learn a decision tree
that represents it.
Use this representation to classify new examples
A
C
B
7
DECISION TREES CSE463 2014 Fall
The Representation
Decision Trees are classifiers for instances represented as
feature vectors (color= ; shape= ; label= )
Nodes are tests for feature values
There is one branch for each value of the feature
Leaves specify the category (labels)
Can categorize instances into multiple disjoint categories
Color
Shape
Blue red Green
Shape
square
triangle
circle
circle
square
A
B
C
A
B
B
Evaluation of a
Decision Tree
(color= RED ;shape=triangle)
Learning a
Decision Tree
8
DECISION TREES CSE463 2014 Fall
Boolean Decision Trees
As Boolean functions they can represent any Boolean function.
Can be rewritten as rules in Disjunctive Normal Form (DNF)
Green^square positive
Blue^circle positive
Blue^square positive
The disjunction of these rules is equivalent to the Decision Tree
What did we show?
Color
Shape
Blue red Green
Shape
square
triangle
circle
circle
square
-
+
+
+
-
+
9
DECISION TREES CSE463 2014 Fall
Decision Boundaries
Usually, instances are represented as attribute-value
pairs (color=blue, shape = square, +)
Numerical values can be used either by discretizing
or by using thresholds for splitting nodes
In this case, the tree divides the features space into
axis-parallel rectangles, each labeled with one of the
labels
X<3
Y<5
no yes
Y>7
yes
no
-
+
+
-
X < 1
no yes
+
-
1 3 X
7
5
Y
-
+
+ +
+ +
-
-
+
10
DECISION TREES CSE463 2014 Fall
Decision Trees
Output is a discrete category. Real valued outputs are
possible (regression trees)
There are efficient algorithms for processing large
amounts of data (but not too many features)
There are methods for handling noisy data
(classification noise and attribute noise) and for
handling missing attribute values
Color
Shape
Blue red Green
Shape
square
triangle
circle
circle
square
-
+
+
+
-
+
11
DECISION TREES CSE463 2014 Fall
An example
12
http://graphics8.nytimes.com/images/2008/04/
16/us/0416-nat-subOBAMA.jpg
DECISION TREES CSE463 2014 Fall
Another example: Housing price
13
source:
http://www.stat.cmu.edu/~cshalizi/
350/lectures/22/lecture-22.pdf
DECISION TREES CSE463 2014 Fall
Another example: Housing price
14
source:
http://www.stat.cmu.edu/~cshalizi/
350/lectures/22/lecture-22.pdf
DECISION TREES CSE463 2014 Fall
Decision Trees
Can represent any Boolean Function
Can be viewed as a way to compactly represent a lot
of data.
Advantage: non-metric data
Natural representation: (20 questions)
The evaluation of the Decision Tree Classifier is easy
Clearly, given data, there are
many ways to represent it as
a decision tree.
Learning a good representation
from data is the challenge.
Yes
Humidity
Normal High
No Yes
Wind
Weak Strong
No Yes
Outlook
Overcast
Rain
Sunny
15
DECISION TREES CSE463 2014 Fall
Representing Data
Think about a large table, N attributes, and assume you want to know
something about the people represented as entries in this table.
E.g. own an expensive car or not;
Simplest way: Histogram on the first attribute own
Then, histogram on first and second (own & gender)
But, what if the # of attributes is larger: N=16
How large are the 1-d histograms (contingency tables) ? 16 numbers
How large are the 2-d histograms? 16-choose-2 = 120 numbers
How many 3-d tables? 560 numbers
With 100 attributes, the 3-d tables need 161,700 numbers
We need to figure out a way to represent data in a better way, and
figure out what are the important attributes to look at first.
Information theory has something to say about it we will use it to
better represent the data.
16
DECISION TREES CSE463 2014 Fall
Basic Decision Trees Learning Algorithm
Data is processed in Batch (i.e. all the data available)
Recursively build a decision tree top down.
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Yes
Humidity
Normal High
No Yes
Wind
Weak Strong
No Yes
Outlook
Overcast
Rain
Sunny
17
DECISION TREES CSE463 2014 Fall
Basic Decision Tree Algorithm
Let S be the set of Examples
Label is the target attribute (the prediction)
Attributes is the set of measured attributes
Create a Root node for tree
If all examples are labeled the same return a single node tree with Label
Otherwise Begin
A = attribute in Attributes that best classifies S
for each possible value v of A
Add a new tree branch corresponding to A=v
Let Sv be the subset of examples in S with A=v
if Sv is empty: add leaf node with the common value
of Label in S
Else: below this branch add the subtree
ID3(Sv, Attributes - {a}, Label)
End
Return Root
18
DECISION TREES CSE463 2014 Fall
Picking the Root Attribute
The goal is to have the resulting decision tree as small
as possible (Occams Razor)
Finding the minimal decision tree consistent with the
data is NP-hard
The recursive algorithm is a greedy heuristic search
for a simple tree, but cannot guarantee optimality.
The main decision in the algorithm is the selection of
the next attribute to condition on.
19
DECISION TREES CSE463 2014 Fall
Picking the Root Attribute
Consider data with two Boolean attributes (A,B).
< (A=0,B=0), - >: 50 examples
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples
< (A=1,B=1), + >: 100 examples
A
+ -
0 1
B
-
0 1
A
+ -
0 1
Splitting on B: we dont get purely labeled nodes.
What if we have: <(A=1,B=0), - >: 3 examples
What should be the first attribute we select?
Splitting on A: we get purely labeled nodes.
20
DECISION TREES CSE463 2014 Fall
Picking the Root Attribute
Consider data with two Boolean attributes (A,B).
< (A=0,B=0), - >: 50 examples
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples 3 examples
< (A=1,B=1), + >: 100 examples
B
-
0 1
A
+ -
0 1
A
-
0 1
B
+ -
0 1
53
50 3
100
100
100
Trees looks structurally similar; which attribute should we choose?
Advantage A. But
Need a way to quantify things
21
DECISION TREES CSE463 2014 Fall
Picking the Root Attribute
The goal is to have the resulting decision tree as small
as possible (Occams Razor)
The main decision in the algorithm is the selection of
the next attribute to condition on.
We want attributes that split the examples to sets
that are relatively pure in one label; this way we are
closer to a leaf node.
The most popular heuristics is based on information
gain, originated with the ID3 system of Quinlan.
Implementations: Google guthub decision tree ID3
https://github.com/bonz0/Decision-Tree C++
https://github.com/Steve525/decision-tree/ Java
22
DECISION TREES CSE463 2014 Fall
Entropy
Entropy (impurity, disorder) of a set of examples, S,
relative to a binary classification is:
where is the proportion of positive examples in S and
is the proportion of negatives.
If all the examples belong to the same category: Entropy = 0
If all the examples are equally mixed (0.5, 0.5): Entropy = 1

p
) log(p p ) log(p p Entropy(S)


23


k
i
i i
1
) log(p p }) ,...p p , Entropy({p
k 2 1
In general, when p
i
is the fraction of examples labeled i:
Entropy can be viewed as the number of bits required, on average, to encode the
class of labels. If the probability for + is 0.5, a single bit is required for each example;
if it is 0.8 -- can use less then 1 bit.
DECISION TREES CSE463 2014 Fall
Entropy
Entropy (impurity, disorder) of a set of examples, S,
relative to a binary classification is:
where is the proportion of positive examples in S and
is the proportion of negatives.
If all the examples belong to the same category: Entropy = 0
If all the examples are equally mixed (0.5, 0.5): Entropy = 1

p
1
-- +
1
-- + --
1
+
) log(p p ) log(p p Entropy(S)


24
DECISION TREES CSE463 2014 Fall
Entropy
Entropy (impurity, disorder) of a set of examples, S,
relative to a binary classification is:
where is the proportion of positive examples in S and
is the proportion of negatives.
If all the examples belong to the same category: Entropy = 0
If all the examples are equally mixed (0.5, 0.5): Entropy = 1

p
) log(p p ) log(p p Entropy(S)


1 1 1
High Entropy High level of Uncertainty
Low Entropy No Uncertainty.
25
DECISION TREES CSE463 2014 Fall
Information Gain
The information gain of an attribute a is the expected
reduction in entropy caused by partitioning on this
attribute
where is the subset of S for which attribute a has value v
and the entropy of partitioning the data is calculated by
weighing the entropy of each partition by its size relative to
the original set
Partitions of low entropy (imbalanced splits) lead to high gain
Go back to check which of the A, B splits is better
) Entropy(S
| S |
| S |
Entropy(S) a) Gain(S,
v
v
values(a) v


v
S
Outlook
Overcast Rain Sunny
26
DECISION TREES CSE463 2014 Fall
An Illustrative Example
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
27
DECISION TREES CSE463 2014 Fall
An Illustrative Example (II)
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
28
0.94
)
14
5
log(
14
5
)
14
9
log(
14
9
Entropy(S)

9+,5-
DECISION TREES CSE463 2014 Fall
An Illustrative Example (II)
Humidity Wind Play Tennis
High Weak No
High Strong No
High Weak Yes
High Weak Yes
Normal Weak Yes
Normal Strong No
Normal Strong Yes
High Weak No
Normal Weak Yes
Normal Weak Yes
Normal Strong Yes
High Strong Yes
Normal Weak Yes
High Strong No
29
9+,5-
E=.94
DECISION TREES CSE463 2014 Fall
An Illustrative Example (II)
Humidity Wind Play Tennis
High Weak No
High Strong No
High Weak Yes
High Weak Yes
Normal Weak Yes
Normal Strong No
Normal Strong Yes
High Weak No
Normal Weak Yes
Normal Weak Yes
Normal Strong Yes
High Strong Yes
Normal Weak Yes
High Strong No
30
9+,5-
E=.94
Humidity
High Normal
3+,4- 6+,1-
E=.985 E=.592
) Entropy(S
| S |
| S |
Entropy(S) a) Gain(S,
v
v
values(a) v


DECISION TREES CSE463 2014 Fall
An Illustrative Example (II)
Humidity Wind Play Tennis
High Weak No
High Strong No
High Weak Yes
High Weak Yes
Normal Weak Yes
Normal Strong No
Normal Strong Yes
High Weak No
Normal Weak Yes
Normal Weak Yes
Normal Strong Yes
High Strong Yes
Normal Weak Yes
High Strong No
31
9+,5-
E=.94
Humidity
High Normal
3+,4- 6+,1-
E=.985 E=.592
) Entropy(S
| S |
| S |
Entropy(S) a) Gain(S,
v
v
values(a) v


Wind
Weak
6+2- 3+,3-
E=.811 E=1.0
Strong
DECISION TREES CSE463 2014 Fall
An Illustrative Example (II)
Humidity Wind Play Tennis
High Weak No
High Strong No
High Weak Yes
High Weak Yes
Normal Weak Yes
Normal Strong No
Normal Strong Yes
High Weak No
Normal Weak Yes
Normal Weak Yes
Normal Strong Yes
High Strong Yes
Normal Weak Yes
High Strong No
32
9+,5-
E=.94
Humidity
High Normal
3+,4- 6+,1-
E=.985 E=.592
) Entropy(S
| S |
| S |
Entropy(S) a) Gain(S,
v
v
values(a) v


Wind
Weak
6+2- 3+,3-
E=.811 E=1.0
Strong
Gain(S,Humidity)=
.94 - 7/14 0.985
- 7/14 0.592=
0.151
DECISION TREES CSE463 2014 Fall
An Illustrative Example (II)
Humidity Wind Play Tennis
High Weak No
High Strong No
High Weak Yes
High Weak Yes
Normal Weak Yes
Normal Strong No
Normal Strong Yes
High Weak No
Normal Weak Yes
Normal Weak Yes
Normal Strong Yes
High Strong Yes
Normal Weak Yes
High Strong No
33
9+,5-
E=.94
Humidity
High Normal
3+,4- 6+,1-
E=.985 E=.592
) Entropy(S
| S |
| S |
Entropy(S) a) Gain(S,
v
v
values(a) v


Wind
Weak
6+2- 3+,3-
E=.811 E=1.0
Strong
Gain(S,Humidity)=
.94 - 7/14 0.985
- 7/14 0.592=
0.151
Gain(S,Wind)=
.94 - 8/14 0.811
- 6/14 1.0 =
0.048
DECISION TREES CSE463 2014 Fall
An Illustrative Example (III)
Outlook
Gain(S,Humidity)=0.151
Gain(S,Wind) = 0.048
Gain(S,Temperature) = 0.029
Gain(S,Outlook) = 0.246
34
DECISION TREES CSE463 2014 Fall
An Illustrative Example (III)
Outlook
35
Overcast Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes ? ?
1 Sunny No
2 Sunny No
3 Overcast Yes
4 Rain Yes
5 Rain Yes
6 Rain No
7 Overcast Yes
8 Sunny No
9 Sunny Yes
10 Rain Yes
11 Sunny Yes
12 Overcast Yes
13 Overcast Yes
14 Rain No
Day Outlook PlayTennis
DECISION TREES CSE463 2014 Fall
An Illustrative Example (III)
36
Outlook
Overcast Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes ? ?
1 Sunny No
2 Sunny No
3 Overcast Yes
4 Rain Yes
5 Rain Yes
6 Rain No
7 Overcast Yes
8 Sunny No
9 Sunny Yes
10 Rain Yes
11 Sunny Yes
12 Overcast Yes
13 Overcast Yes
14 Rain No
Day Outlook PlayTennis
Continue until:
Every attribute is included in path, or,
All examples in the leaf have same label
DECISION TREES CSE463 2014 Fall
An Illustrative Example (IV)
Humidity) , Gain(S
sunny
.97-(3/5) 0-(2/5) 0 = .97
Temp) , Gain(S
sunny
.97- 0-(2/5) 1 = .57
Wind) , Gain(S
sunny
.97-(2/5) 1 - (3/5) .92= .02
Day Outlook Temperature Humidity Wind PlayTennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes
37
Outlook
Overcast Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes ? ?
DECISION TREES CSE463 2014 Fall
An Illustrative Example (V)
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes ? ?
38
DECISION TREES CSE463 2014 Fall
An Illustrative Example (V)
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity ?
Normal High
No
Yes
39
DECISION TREES CSE463 2014 Fall
An Illustrative Example (VI)
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity Wind
Normal High
No
Yes
Weak Strong
No
Yes
40
DECISION TREES CSE463 2014 Fall
Summary: ID3
(Examples, Attributes, Label)
Let S be the set of Examples
Label is the target attribute (the prediction)
Attributes is the set of measured attributes
Create a Root node for tree
If all examples are labeled the same return a single node tree with Label
Otherwise Begin
A = attribute in Attributes that best classifies S
for each possible value v of A
Add a new tree branch corresponding to A=v
Let Sv be the subset of examples in S with A=v
if Sv is empty: add leaf node with the common value of Label in S
Else: below this branch add the subtree
ID3(Sv, Attributes - {a}, Label)
End
Return Root
41
Iterative Dichotomiser 3
DECISION TREES CSE463 2014 Fall
Hypothesis Space in Decision
Tree Induction
Conduct a search of the space of decision trees which
can represent all possible discrete functions. (pros
and cons)
Goal: to find the best decision tree
Finding a minimal decision tree consistent with a set
of data is NP-hard.
Performs a greedy heuristic search: hill climbing
without backtracking
Makes statistically based decisions using all data
42
DECISION TREES CSE463 2014 Fall
Bias in Decision Tree Induction
Bias is for trees of minimal depth; however, greedy
search introduces complications; it positions features
with high information gain high in the tree and may
not find the minimal tree
Implement a preference bias (search bias) as
opposed to restriction bias (a language bias)
Occams razor can be defended on the basis that
there are relatively few simple hypotheses compared
to complex ones. Therefore, a simple hypothesis that
is consistent with the data is less likely to be a
statistical coincidence (but)
43
DECISION TREES CSE463 2014 Fall
History of Decision Tree Research
Hunt and colleagues in Psychology used full search decision tree
methods to model human concept learning in the 60s
Quinlan developed ID3, with the information gain heuristics in
the late 70s to learn expert systems from examples
Breiman, Freidman and colleagues in statistics developed CART
(classification and regression trees simultaneously)
A variety of improvements in the 80s: coping with noise,
continuous attributes, missing data, non-axis parallel etc.
Quinlans updated algorithm, C4.5 (1993) is commonly used
(New: C5)
Boosting (or Bagging) over DTs is a very good general purpose
algorithm
44
DECISION TREES CSE463 2014 Fall
Overfitting the Data
Learning a tree that classifies the training data perfectly may
not lead to the tree with the best generalization performance.
There may be noise in the training data the tree is fitting
The algorithm might be making decisions based on very little data
A hypothesis h is said to overfit the training data if there is
another hypothesis h, such that h has a smaller error than h on
the training data but h has larger error on the test data than h.
Complexity of tree
accuracy
On testing
On training
45
DECISION TREES CSE463 2014 Fall
Overfitting - Example
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity Wind
Normal High
No
Yes
Normal High
No
Yes
Outlook = Sunny, Temp = Hot, Humidity = Normal, Wind = Strong, NO
46
DECISION TREES CSE463 2014 Fall
Overfitting - Example
Outlook = Sunny, Temp = Hot, Humidity = Normal, Wind = Strong, NO
47
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity Wind
Normal High
No
Weak Strong
No
Yes
Weak Strong
No
Yes
Wind
DECISION TREES CSE463 2014 Fall
Overfitting - Example
Outlook = Sunny, Temp = Hot, Humidity = Normal, Wind = Strong, NO
48
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity Wind
Normal High
No
Weak Strong
No
Yes
Weak Strong
No
Yes
Wind
This can always be done -- may fit noise or
other coincidental regularities
DECISION TREES CSE463 2014 Fall
Avoiding Overfitting
Two basic approaches
Prepruning: Stop growing the tree at some point during
construction when it is determined that there is not enough data
to make reliable choices.
Postpruning: Grow the full tree and then remove nodes that seem
not to have sufficient evidence.
Methods for evaluating subtrees to prune
Cross-validation: Reserve hold-out set to evaluate utility
Statistical testing: Test if the observed regularity can be dismissed
as likely to occur by chance
Minimum Description Length: Is the additional complexity of the
hypothesis smaller than remembering the exceptions?
This is related to the notion of regularization that we will see in other
contexts keep the hypothesis simple.
How can this be avoided with linear classifiers?
49
DECISION TREES CSE463 2014 Fall
Trees and Rules
Decision Trees can be represented as Rules
(outlook = sunny) and (humidity = normal) then YES
(outlook = rain) and (wind = strong) then NO
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity Wind
Normal High
No
Weak Strong
No Yes Yes
50
DECISION TREES CSE463 2014 Fall
Reduced-Error Pruning
A post-pruning, cross validation approach
Partition training data into grow set and validation set.
Build a complete tree for the grow data
Until accuracy on validation set decreases, do:
For each non-leaf node in the tree
Temporarily prune the tree below; replace it by majority vote.
Test the accuracy of the hypothesis on the validation set
Permanently prune the node with the greatest increase in
accuracy on the validation test.
Problem: Uses less data to construct the tree
Sometimes done at the rules level
Rules are generalized by erasing a condition (different!)
General Strategy: Overfit and Simplify
51
DECISION TREES CSE463 2014 Fall
Continuous Attributes
Real-valued attributes can, in advance, be discretized into
ranges, such as big, medium, small
Alternatively, one can develop splitting nodes based on
thresholds of the form A<c that partition the data into examples
that satisfy A<c and A>=c. The information gain for these splits
is calculated in the same way and compared to the information
gain of discrete splits.
How to find the split with the highest gain?
For each continuous feature A:
Sort examples according to the value of A
For each ordered pair (x,y) with different labels
Check the mid-point as a possible threshold, i.e.
S
a x
, S
a y
52
DECISION TREES CSE463 2014 Fall
Continuous Attributes
Example:
Length (L): 10 15 21 28 32 40 50
Class: - + + - + + -
Check thresholds: L < 12.5; L < 24.5; L < 45
Subset of Examples= {}, Split= k+,j-
How to find the split with the highest gain ?
For each continuous feature A:
Sort examples according to the value of A
For each ordered pair (x,y) with different labels
Check the mid-point as a possible threshold. I.e,
S
a x
, S
a y
53
DECISION TREES CSE463 2014 Fall
Missing Values
Diagnosis = < fever, blood_pressure,, blood_test=?,>
Many times values are not available for all attributes
during training or testing (e.g., medical diagnosis)
Training: evaluate Gain(S,a) where in some of the
examples a value for a is not given
54
DECISION TREES CSE463 2014 Fall
Missing Values
55
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes ? ?
Humidity) , Gain(S
sunny
Temp) , Gain(S
sunny
.97- 0-(2/5) 1 = .57
Day Outlook Temperature Humidity Wind PlayTennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild ??? Weak No
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes
Fill-in: most common value (?)-- High
.97-(3/5) Ent[+0,-3] -(2/5) Ent[+2,-0] = .97
.97-(2.5/5) Ent[+0,-2.5] - (2.5/5) Ent[+2,-.5] < .97
Fractional : 0.5 Normal, 0.5 High
Normal
.97-(2/5) Ent[+2,-0] -(3/5) Ent[+2,-1] < .97
) Ent(S
| S |
| S |
Ent(S) a) Gain(S,
v
v


DECISION TREES CSE463 2014 Fall
Missing Values
Diagnosis = < fever, blood_pressure,, blood_test=?,>
Many times values are not available for all attributes
during training or testing (e.g., medical diagnosis)
Training: evaluate Gain(S,a) where in some of the
examples a value for a is not given
Testing: classify an example without knowing the value
of a
Other suggestions?
56
DECISION TREES CSE463 2014 Fall
Missing Values
57
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity Wind
Normal High
No
Weak Strong
No Yes Yes
Outlook = ???, Temp = Hot, Humidity = Normal, Wind = Strong, label = ??
1/3 Yes + 1/3 Yes +1/3 No = Yes
Outlook = Sunny, Temp = Hot, Humidity = ???, Wind = Strong, label = ?? Normal/High
DECISION TREES CSE463 2014 Fall
Other Issues
Attributes with different costs
Change information gain so that low cost attribute are
preferred
Alternative measures for selecting attributes
When different attributes have different number of values
information gain tends to prefer those with many values
Oblique Decision Trees
Decisions are not axis-parallel
Incremental Decision Trees induction
Update an existing decision tree to account for new
examples incrementally (Maintain consistency?)
58
DECISION TREES CSE463 2014 Fall
Decision Trees as Features
Rather than using decision trees to represent the target function it
is becoming common to use small decision trees as features
When learning over a large number of features, learning decision
trees is difficult and the resulting tree may be very large
(over fitting)
Instead, learn small decision trees, with limited depth.
Treat them as experts; they are correct, but only on a small
region in the domain. (what DTs to learn? same every time?)
Then, learn another function, typically a linear function, over these
as features.
Boosting (but also other linear learners) are used on top of the
small decision trees. (Either Boolean, or real valued features)
59
DECISION TREES CSE463 2014 Fall
Decision Trees - Summary
Hypothesis Space:
Variable size (contains all functions)
Deterministic; Discrete and Continuous attributes
Search Algorithm
ID3 - batch, constructive search
Extensions: missing values
Issues:
What is the goal?
When to stop? How to guarantee good generalization?
Did not address:
How are we doing? (Correctness-wise, Complexity-wise)
60

You might also like