What Did We Learn?: Learning Problem

The document discusses decision trees and their use in machine learning. It introduces decision trees as a way to represent data hierarchically using tests on feature values to split the data into partitions. Decision trees can be used for classification and regression tasks. The document outlines the basic algorithm for learning decision trees from data in a top-down recursive manner, and describes how to select the best feature to split on at each node using information gain to minimize tree size.

Uploaded by

Taehoon Kim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views60 pages

What Did We Learn?: Learning Problem

Uploaded by

Taehoon Kim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

DECISION TREES CSE463 2014 Fall

What Did We Learn?

Learning problem:
Find a function that
best separates the data
What function?
Whats best?
How to find it?
A possibility: Define the learning problem to be:
Find a (linear) function that best separates the data
Linear:
x= data representation; w= the classifier
Y = sgn {w
T
x}
1 * Slides courtesy of Dan Roth, University of Illinois at Urbana-Champaign
INTRODUCTION CSE48201 2013 Winter
Machine Learning - Big Picture
2
Evaluation:
Training data
Test (or Real) data
Model: Hypothesis Space H
{x, y}
{x, ?}
{x, h(x)}
h: the best hypothesis
Accuracy
xx.x%
Gradient Decent
DECISION TREES CSE463 2014 Fall
Announcement
3
Please form a group and choose a dataset from
Kaggle (Due: Sep 29).
http://www.youtube.com/watch?v=PoD84TVdD-4
Please put your team name and members in the link
https://docs.google.com/spreadsheets/d/1DsldlV4wy819m
nDvU-9gR6m7ATJVEn_T2IeecyB4V3U/edit?usp=sharing
Then, inform our TA ([email protected]).
If you are unable to form a group, please let us know, today.
DECISION TREES CSE463 2014 Fall 4
Be prepared for
your presentation!
DECISION TREES CSE463 2014 Fall
Introduction - Summary
We introduced the technical part of the class by giving two examples
for (very different) approaches to linear discrimination.
There are many other solutions.
Question 1: Our solution assumed that the target functions are linear.
Can we learn a function that is more flexible in terms of what it does
with the feature space?
Question 2: Can we say something about the quality of what we learn
(sample complexity, time complexity; quality)
5
DECISION TREES CSE463 2014 Fall
From Linear to Decision Trees
We decoupled the generation of the feature space from
the learning.
Argued that we can map the given examples into another
space, in which the target functions are linearly separable.
Do we always want to do it?
How do we determine what are good mappings?
The study of decision trees may shed some light on this.
Learning is done directly from the given data
representation.
The algorithm ``transforms the data itself.
x
x
2
6
Whats the best learning algorithm?
DECISION TREES CSE463 2014 Fall
Decision Trees
A hierarchical data structure that represents data by
implementing a divide and conquer strategy
Can be used as a non-parametric classification and
regression method
Given a collection of examples, learn a decision tree
that represents it.
Use this representation to classify new examples
A
C
B
7
DECISION TREES CSE463 2014 Fall
The Representation
Decision Trees are classifiers for instances represented as
feature vectors (color= ; shape= ; label= )
Nodes are tests for feature values
There is one branch for each value of the feature
Leaves specify the category (labels)
Can categorize instances into multiple disjoint categories
Color
Shape
Blue red Green
Shape
square
triangle
circle
circle
square
A
B
C
A
B
B
Evaluation of a
Decision Tree
(color= RED ;shape=triangle)
Learning a
Decision Tree
8
DECISION TREES CSE463 2014 Fall
Boolean Decision Trees
As Boolean functions they can represent any Boolean function.
Can be rewritten as rules in Disjunctive Normal Form (DNF)
Green^square positive
Blue^circle positive
Blue^square positive
The disjunction of these rules is equivalent to the Decision Tree
What did we show?
Color
Shape
Blue red Green
Shape
square
triangle
circle
circle
square
-
+
+
+
-
+
9
DECISION TREES CSE463 2014 Fall
Decision Boundaries
Usually, instances are represented as attribute-value
pairs (color=blue, shape = square, +)
Numerical values can be used either by discretizing
or by using thresholds for splitting nodes
In this case, the tree divides the features space into
axis-parallel rectangles, each labeled with one of the
labels
X<3
Y<5
no yes
Y>7
yes
no
-
+
+
-
X < 1
no yes
+
-
1 3 X
7
5
Y
-
+
+ +
+ +
-
-
+
10
DECISION TREES CSE463 2014 Fall
Decision Trees
Output is a discrete category. Real valued outputs are
possible (regression trees)
There are efficient algorithms for processing large
amounts of data (but not too many features)
There are methods for handling noisy data
(classification noise and attribute noise) and for
handling missing attribute values
Color
Shape
Blue red Green
Shape
square
triangle
circle
circle
square
-
+
+
+
-
+
11
DECISION TREES CSE463 2014 Fall
An example
12
http://graphics8.nytimes.com/images/2008/04/
16/us/0416-nat-subOBAMA.jpg
DECISION TREES CSE463 2014 Fall
Another example: Housing price
13
source:
http://www.stat.cmu.edu/~cshalizi/
350/lectures/22/lecture-22.pdf
DECISION TREES CSE463 2014 Fall
Another example: Housing price
14
source:
http://www.stat.cmu.edu/~cshalizi/
350/lectures/22/lecture-22.pdf
DECISION TREES CSE463 2014 Fall
Decision Trees
Can represent any Boolean Function
Can be viewed as a way to compactly represent a lot
of data.
Advantage: non-metric data
Natural representation: (20 questions)
The evaluation of the Decision Tree Classifier is easy
Clearly, given data, there are
many ways to represent it as
a decision tree.
Learning a good representation
from data is the challenge.
Yes
Humidity
Normal High
No Yes
Wind
Weak Strong
No Yes
Outlook
Overcast
Rain
Sunny
15
DECISION TREES CSE463 2014 Fall
Representing Data
Think about a large table, N attributes, and assume you want to know
something about the people represented as entries in this table.
E.g. own an expensive car or not;
Simplest way: Histogram on the first attribute own
Then, histogram on first and second (own & gender)
But, what if the # of attributes is larger: N=16
How large are the 1-d histograms (contingency tables) ? 16 numbers
How large are the 2-d histograms? 16-choose-2 = 120 numbers
How many 3-d tables? 560 numbers
With 100 attributes, the 3-d tables need 161,700 numbers
We need to figure out a way to represent data in a better way, and
figure out what are the important attributes to look at first.
Information theory has something to say about it we will use it to
better represent the data.
16
DECISION TREES CSE463 2014 Fall
Basic Decision Trees Learning Algorithm
Data is processed in Batch (i.e. all the data available)
Recursively build a decision tree top down.
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Yes
Humidity
Normal High
No Yes
Wind
Weak Strong
No Yes
Outlook
Overcast
Rain
Sunny
17
DECISION TREES CSE463 2014 Fall
Basic Decision Tree Algorithm
Let S be the set of Examples
Label is the target attribute (the prediction)
Attributes is the set of measured attributes
Create a Root node for tree
If all examples are labeled the same return a single node tree with Label
Otherwise Begin
A = attribute in Attributes that best classifies S
for each possible value v of A
Add a new tree branch corresponding to A=v
Let Sv be the subset of examples in S with A=v
if Sv is empty: add leaf node with the common value
of Label in S
Else: below this branch add the subtree
ID3(Sv, Attributes - {a}, Label)
End
Return Root
18
DECISION TREES CSE463 2014 Fall
Picking the Root Attribute
The goal is to have the resulting decision tree as small
as possible (Occams Razor)
Finding the minimal decision tree consistent with the
data is NP-hard
The recursive algorithm is a greedy heuristic search
for a simple tree, but cannot guarantee optimality.
The main decision in the algorithm is the selection of
the next attribute to condition on.
19
DECISION TREES CSE463 2014 Fall
Picking the Root Attribute
Consider data with two Boolean attributes (A,B).
< (A=0,B=0), - >: 50 examples
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples
< (A=1,B=1), + >: 100 examples
A
+ -
0 1
B
-
0 1
A
+ -
0 1
Splitting on B: we dont get purely labeled nodes.
What if we have: <(A=1,B=0), - >: 3 examples
What should be the first attribute we select?
Splitting on A: we get purely labeled nodes.
20
DECISION TREES CSE463 2014 Fall
Picking the Root Attribute
Consider data with two Boolean attributes (A,B).
< (A=0,B=0), - >: 50 examples
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples 3 examples
< (A=1,B=1), + >: 100 examples
B
-
0 1
A
+ -
0 1
A
-
0 1
B
+ -
0 1
53
50 3
100
100
100
Trees looks structurally similar; which attribute should we choose?
Advantage A. But
Need a way to quantify things
21
DECISION TREES CSE463 2014 Fall
Picking the Root Attribute
The goal is to have the resulting decision tree as small
as possible (Occams Razor)
The main decision in the algorithm is the selection of
the next attribute to condition on.
We want attributes that split the examples to sets
that are relatively pure in one label; this way we are
closer to a leaf node.
The most popular heuristics is based on information
gain, originated with the ID3 system of Quinlan.
Implementations: Google guthub decision tree ID3
https://github.com/bonz0/Decision-Tree C++
https://github.com/Steve525/decision-tree/ Java
22
DECISION TREES CSE463 2014 Fall
Entropy
Entropy (impurity, disorder) of a set of examples, S,
relative to a binary classification is:
where is the proportion of positive examples in S and
is the proportion of negatives.
If all the examples belong to the same category: Entropy = 0
If all the examples are equally mixed (0.5, 0.5): Entropy = 1

p
) log(p p ) log(p p Entropy(S)

23

k
i
i i
1
) log(p p }) ,...p p , Entropy({p
k 2 1
In general, when p
i
is the fraction of examples labeled i:
Entropy can be viewed as the number of bits required, on average, to encode the
class of labels. If the probability for + is 0.5, a single bit is required for each example;
if it is 0.8 -- can use less then 1 bit.
DECISION TREES CSE463 2014 Fall
Entropy
Entropy (impurity, disorder) of a set of examples, S,
relative to a binary classification is:
where is the proportion of positive examples in S and
is the proportion of negatives.
If all the examples belong to the same category: Entropy = 0
If all the examples are equally mixed (0.5, 0.5): Entropy = 1

p
1
-- +
1
-- + --
1
+
) log(p p ) log(p p Entropy(S)

24
DECISION TREES CSE463 2014 Fall
Entropy
Entropy (impurity, disorder) of a set of examples, S,
relative to a binary classification is:
where is the proportion of positive examples in S and
is the proportion of negatives.
If all the examples belong to the same category: Entropy = 0
If all the examples are equally mixed (0.5, 0.5): Entropy = 1

p
) log(p p ) log(p p Entropy(S)

1 1 1
High Entropy High level of Uncertainty
Low Entropy No Uncertainty.
25
DECISION TREES CSE463 2014 Fall
Information Gain
The information gain of an attribute a is the expected
reduction in entropy caused by partitioning on this
attribute
where is the subset of S for which attribute a has value v
and the entropy of partitioning the data is calculated by
weighing the entropy of each partition by its size relative to
the original set
Partitions of low entropy (imbalanced splits) lead to high gain
Go back to check which of the A, B splits is better
) Entropy(S
| S |
| S |
Entropy(S) a) Gain(S,
v
v
values(a) v

v
S
Outlook
Overcast Rain Sunny
26
DECISION TREES CSE463 2014 Fall
An Illustrative Example
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
27
DECISION TREES CSE463 2014 Fall
An Illustrative Example (II)
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
28
0.94
)
14
5
log(
14
5
)
14
9
log(
14
9
Entropy(S)

9+,5-
DECISION TREES CSE463 2014 Fall
An Illustrative Example (II)
Humidity Wind Play Tennis
High Weak No
High Strong No
High Weak Yes
High Weak Yes
Normal Weak Yes
Normal Strong No
Normal Strong Yes
High Weak No
Normal Weak Yes
Normal Weak Yes
Normal Strong Yes
High Strong Yes
Normal Weak Yes
High Strong No
29
9+,5-
E=.94
DECISION TREES CSE463 2014 Fall
An Illustrative Example (II)
Humidity Wind Play Tennis
High Weak No
High Strong No
High Weak Yes
High Weak Yes
Normal Weak Yes
Normal Strong No
Normal Strong Yes
High Weak No
Normal Weak Yes
Normal Weak Yes
Normal Strong Yes
High Strong Yes
Normal Weak Yes
High Strong No
30
9+,5-
E=.94
Humidity
High Normal
3+,4- 6+,1-
E=.985 E=.592
) Entropy(S
| S |
| S |
Entropy(S) a) Gain(S,
v
v
values(a) v

DECISION TREES CSE463 2014 Fall
An Illustrative Example (II)
Humidity Wind Play Tennis
High Weak No
High Strong No
High Weak Yes
High Weak Yes
Normal Weak Yes
Normal Strong No
Normal Strong Yes
High Weak No
Normal Weak Yes
Normal Weak Yes
Normal Strong Yes
High Strong Yes
Normal Weak Yes
High Strong No
31
9+,5-
E=.94
Humidity
High Normal
3+,4- 6+,1-
E=.985 E=.592
) Entropy(S
| S |
| S |
Entropy(S) a) Gain(S,
v
v
values(a) v

Wind
Weak
6+2- 3+,3-
E=.811 E=1.0
Strong
DECISION TREES CSE463 2014 Fall
An Illustrative Example (II)
Humidity Wind Play Tennis
High Weak No
High Strong No
High Weak Yes
High Weak Yes
Normal Weak Yes
Normal Strong No
Normal Strong Yes
High Weak No
Normal Weak Yes
Normal Weak Yes
Normal Strong Yes
High Strong Yes
Normal Weak Yes
High Strong No
32
9+,5-
E=.94
Humidity
High Normal
3+,4- 6+,1-
E=.985 E=.592
) Entropy(S
| S |
| S |
Entropy(S) a) Gain(S,
v
v
values(a) v

Wind
Weak
6+2- 3+,3-
E=.811 E=1.0
Strong
Gain(S,Humidity)=
.94 - 7/14 0.985
- 7/14 0.592=
0.151
DECISION TREES CSE463 2014 Fall
An Illustrative Example (II)
Humidity Wind Play Tennis
High Weak No
High Strong No
High Weak Yes
High Weak Yes
Normal Weak Yes
Normal Strong No
Normal Strong Yes
High Weak No
Normal Weak Yes
Normal Weak Yes
Normal Strong Yes
High Strong Yes
Normal Weak Yes
High Strong No
33
9+,5-
E=.94
Humidity
High Normal
3+,4- 6+,1-
E=.985 E=.592
) Entropy(S
| S |
| S |
Entropy(S) a) Gain(S,
v
v
values(a) v

Wind
Weak
6+2- 3+,3-
E=.811 E=1.0
Strong
Gain(S,Humidity)=
.94 - 7/14 0.985
- 7/14 0.592=
0.151
Gain(S,Wind)=
.94 - 8/14 0.811
- 6/14 1.0 =
0.048
DECISION TREES CSE463 2014 Fall
An Illustrative Example (III)
Outlook
Gain(S,Humidity)=0.151
Gain(S,Wind) = 0.048
Gain(S,Temperature) = 0.029
Gain(S,Outlook) = 0.246
34
DECISION TREES CSE463 2014 Fall
An Illustrative Example (III)
Outlook
35
Overcast Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes ? ?
1 Sunny No
2 Sunny No
3 Overcast Yes
4 Rain Yes
5 Rain Yes
6 Rain No
7 Overcast Yes
8 Sunny No
9 Sunny Yes
10 Rain Yes
11 Sunny Yes
12 Overcast Yes
13 Overcast Yes
14 Rain No
Day Outlook PlayTennis
DECISION TREES CSE463 2014 Fall
An Illustrative Example (III)
36
Outlook
Overcast Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes ? ?
1 Sunny No
2 Sunny No
3 Overcast Yes
4 Rain Yes
5 Rain Yes
6 Rain No
7 Overcast Yes
8 Sunny No
9 Sunny Yes
10 Rain Yes
11 Sunny Yes
12 Overcast Yes
13 Overcast Yes
14 Rain No
Day Outlook PlayTennis
Continue until:
Every attribute is included in path, or,
All examples in the leaf have same label
DECISION TREES CSE463 2014 Fall
An Illustrative Example (IV)
Humidity) , Gain(S
sunny
.97-(3/5) 0-(2/5) 0 = .97
Temp) , Gain(S
sunny
.97- 0-(2/5) 1 = .57
Wind) , Gain(S
sunny
.97-(2/5) 1 - (3/5) .92= .02
Day Outlook Temperature Humidity Wind PlayTennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes
37
Outlook
Overcast Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes ? ?
DECISION TREES CSE463 2014 Fall
An Illustrative Example (V)
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes ? ?
38
DECISION TREES CSE463 2014 Fall
An Illustrative Example (V)
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity ?
Normal High
No
Yes
39
DECISION TREES CSE463 2014 Fall
An Illustrative Example (VI)
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity Wind
Normal High
No
Yes
Weak Strong
No
Yes
40
DECISION TREES CSE463 2014 Fall
Summary: ID3
(Examples, Attributes, Label)
Let S be the set of Examples
Label is the target attribute (the prediction)
Attributes is the set of measured attributes
Create a Root node for tree
If all examples are labeled the same return a single node tree with Label
Otherwise Begin
A = attribute in Attributes that best classifies S
for each possible value v of A
Add a new tree branch corresponding to A=v
Let Sv be the subset of examples in S with A=v
if Sv is empty: add leaf node with the common value of Label in S
Else: below this branch add the subtree
ID3(Sv, Attributes - {a}, Label)
End
Return Root
41
Iterative Dichotomiser 3
DECISION TREES CSE463 2014 Fall
Hypothesis Space in Decision
Tree Induction
Conduct a search of the space of decision trees which
can represent all possible discrete functions. (pros
and cons)
Goal: to find the best decision tree
Finding a minimal decision tree consistent with a set
of data is NP-hard.
Performs a greedy heuristic search: hill climbing
without backtracking
Makes statistically based decisions using all data
42
DECISION TREES CSE463 2014 Fall
Bias in Decision Tree Induction
Bias is for trees of minimal depth; however, greedy
search introduces complications; it positions features
with high information gain high in the tree and may
not find the minimal tree
Implement a preference bias (search bias) as
opposed to restriction bias (a language bias)
Occams razor can be defended on the basis that
there are relatively few simple hypotheses compared
to complex ones. Therefore, a simple hypothesis that
is consistent with the data is less likely to be a
statistical coincidence (but)
43
DECISION TREES CSE463 2014 Fall
History of Decision Tree Research
Hunt and colleagues in Psychology used full search decision tree
methods to model human concept learning in the 60s
Quinlan developed ID3, with the information gain heuristics in
the late 70s to learn expert systems from examples
Breiman, Freidman and colleagues in statistics developed CART
(classification and regression trees simultaneously)
A variety of improvements in the 80s: coping with noise,
continuous attributes, missing data, non-axis parallel etc.
Quinlans updated algorithm, C4.5 (1993) is commonly used
(New: C5)
Boosting (or Bagging) over DTs is a very good general purpose
algorithm
44
DECISION TREES CSE463 2014 Fall
Overfitting the Data
Learning a tree that classifies the training data perfectly may
not lead to the tree with the best generalization performance.
There may be noise in the training data the tree is fitting
The algorithm might be making decisions based on very little data
A hypothesis h is said to overfit the training data if there is
another hypothesis h, such that h has a smaller error than h on
the training data but h has larger error on the test data than h.
Complexity of tree
accuracy
On testing
On training
45
DECISION TREES CSE463 2014 Fall
Overfitting - Example
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity Wind
Normal High
No
Yes
Normal High
No
Yes
Outlook = Sunny, Temp = Hot, Humidity = Normal, Wind = Strong, NO
46
DECISION TREES CSE463 2014 Fall
Overfitting - Example
Outlook = Sunny, Temp = Hot, Humidity = Normal, Wind = Strong, NO
47
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity Wind
Normal High
No
Weak Strong
No
Yes
Weak Strong
No
Yes
Wind
DECISION TREES CSE463 2014 Fall
Overfitting - Example
Outlook = Sunny, Temp = Hot, Humidity = Normal, Wind = Strong, NO
48
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity Wind
Normal High
No
Weak Strong
No
Yes
Weak Strong
No
Yes
Wind
This can always be done -- may fit noise or
other coincidental regularities
DECISION TREES CSE463 2014 Fall
Avoiding Overfitting
Two basic approaches
Prepruning: Stop growing the tree at some point during
construction when it is determined that there is not enough data
to make reliable choices.
Postpruning: Grow the full tree and then remove nodes that seem
not to have sufficient evidence.
Methods for evaluating subtrees to prune
Cross-validation: Reserve hold-out set to evaluate utility
Statistical testing: Test if the observed regularity can be dismissed
as likely to occur by chance
Minimum Description Length: Is the additional complexity of the
hypothesis smaller than remembering the exceptions?
This is related to the notion of regularization that we will see in other
contexts keep the hypothesis simple.
How can this be avoided with linear classifiers?
49
DECISION TREES CSE463 2014 Fall
Trees and Rules
Decision Trees can be represented as Rules
(outlook = sunny) and (humidity = normal) then YES
(outlook = rain) and (wind = strong) then NO
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity Wind
Normal High
No
Weak Strong
No Yes Yes
50
DECISION TREES CSE463 2014 Fall
Reduced-Error Pruning
A post-pruning, cross validation approach
Partition training data into grow set and validation set.
Build a complete tree for the grow data
Until accuracy on validation set decreases, do:
For each non-leaf node in the tree
Temporarily prune the tree below; replace it by majority vote.
Test the accuracy of the hypothesis on the validation set
Permanently prune the node with the greatest increase in
accuracy on the validation test.
Problem: Uses less data to construct the tree
Sometimes done at the rules level
Rules are generalized by erasing a condition (different!)
General Strategy: Overfit and Simplify
51
DECISION TREES CSE463 2014 Fall
Continuous Attributes
Real-valued attributes can, in advance, be discretized into
ranges, such as big, medium, small
Alternatively, one can develop splitting nodes based on
thresholds of the form A<c that partition the data into examples
that satisfy A<c and A>=c. The information gain for these splits
is calculated in the same way and compared to the information
gain of discrete splits.
How to find the split with the highest gain?
For each continuous feature A:
Sort examples according to the value of A
For each ordered pair (x,y) with different labels
Check the mid-point as a possible threshold, i.e.
S
a x
, S
a y
52
DECISION TREES CSE463 2014 Fall
Continuous Attributes
Example:
Length (L): 10 15 21 28 32 40 50
Class: - + + - + + -
Check thresholds: L < 12.5; L < 24.5; L < 45
Subset of Examples= {}, Split= k+,j-
How to find the split with the highest gain ?
For each continuous feature A:
Sort examples according to the value of A
For each ordered pair (x,y) with different labels
Check the mid-point as a possible threshold. I.e,
S
a x
, S
a y
53
DECISION TREES CSE463 2014 Fall
Missing Values
Diagnosis = < fever, blood_pressure,, blood_test=?,>
Many times values are not available for all attributes
during training or testing (e.g., medical diagnosis)
Training: evaluate Gain(S,a) where in some of the
examples a value for a is not given
54
DECISION TREES CSE463 2014 Fall
Missing Values
55
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes ? ?
Humidity) , Gain(S
sunny
Temp) , Gain(S
sunny
.97- 0-(2/5) 1 = .57
Day Outlook Temperature Humidity Wind PlayTennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild ??? Weak No
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes
Fill-in: most common value (?)-- High
.97-(3/5) Ent[+0,-3] -(2/5) Ent[+2,-0] = .97
.97-(2.5/5) Ent[+0,-2.5] - (2.5/5) Ent[+2,-.5] < .97
Fractional : 0.5 Normal, 0.5 High
Normal
.97-(2/5) Ent[+2,-0] -(3/5) Ent[+2,-1] < .97
) Ent(S
| S |
| S |
Ent(S) a) Gain(S,
v
v

DECISION TREES CSE463 2014 Fall
Missing Values
Diagnosis = < fever, blood_pressure,, blood_test=?,>
Many times values are not available for all attributes
during training or testing (e.g., medical diagnosis)
Training: evaluate Gain(S,a) where in some of the
examples a value for a is not given
Testing: classify an example without knowing the value
of a
Other suggestions?
56
DECISION TREES CSE463 2014 Fall
Missing Values
57
Outlook
Overcast
Rain
3,7,12,13 4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0- 2+,3-
Yes Humidity Wind
Normal High
No
Weak Strong
No Yes Yes
Outlook = ???, Temp = Hot, Humidity = Normal, Wind = Strong, label = ??
1/3 Yes + 1/3 Yes +1/3 No = Yes
Outlook = Sunny, Temp = Hot, Humidity = ???, Wind = Strong, label = ?? Normal/High
DECISION TREES CSE463 2014 Fall
Other Issues
Attributes with different costs
Change information gain so that low cost attribute are
preferred
Alternative measures for selecting attributes
When different attributes have different number of values
information gain tends to prefer those with many values
Oblique Decision Trees
Decisions are not axis-parallel
Incremental Decision Trees induction
Update an existing decision tree to account for new
examples incrementally (Maintain consistency?)
58
DECISION TREES CSE463 2014 Fall
Decision Trees as Features
Rather than using decision trees to represent the target function it
is becoming common to use small decision trees as features
When learning over a large number of features, learning decision
trees is difficult and the resulting tree may be very large
(over fitting)
Instead, learn small decision trees, with limited depth.
Treat them as experts; they are correct, but only on a small
region in the domain. (what DTs to learn? same every time?)
Then, learn another function, typically a linear function, over these
as features.
Boosting (but also other linear learners) are used on top of the
small decision trees. (Either Boolean, or real valued features)
59
DECISION TREES CSE463 2014 Fall
Decision Trees - Summary
Hypothesis Space:
Variable size (contains all functions)
Deterministic; Discrete and Continuous attributes
Search Algorithm
ID3 - batch, constructive search
Extensions: missing values
Issues:
What is the goal?
When to stop? How to guarantee good generalization?
Did not address:
How are we doing? (Correctness-wise, Complexity-wise)
60

Decision Trees
100% (1)
Decision Trees
61 pages
Lecture2 DT
No ratings yet
Lecture2 DT
103 pages
02 LecDT
No ratings yet
02 LecDT
85 pages
9-Module 5 Decision Tree-21-03-2024
No ratings yet
9-Module 5 Decision Tree-21-03-2024
83 pages
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
No ratings yet
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
73 pages
Module 3 DecisionTree Notes
100% (1)
Module 3 DecisionTree Notes
14 pages
Decision Trees and Regression Techniques
No ratings yet
Decision Trees and Regression Techniques
27 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
28 pages
06 Trees Handout
No ratings yet
06 Trees Handout
39 pages
Lecture2 DT
No ratings yet
Lecture2 DT
75 pages
Unit 3
No ratings yet
Unit 3
81 pages
Module 3-Decision Tree Learning
100% (1)
Module 3-Decision Tree Learning
33 pages
Sonia Singh and Priyanka Gupta (Data Mining Research Paper)
100% (1)
Sonia Singh and Priyanka Gupta (Data Mining Research Paper)
7 pages
3 Dtrees-Lect6
No ratings yet
3 Dtrees-Lect6
63 pages
19 - Decision Tree - ID3
No ratings yet
19 - Decision Tree - ID3
87 pages
Decision - Tree
No ratings yet
Decision - Tree
75 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
70 pages
MCA3 (DS) Unit 4 ML
No ratings yet
MCA3 (DS) Unit 4 ML
29 pages
Classification and Clustering
No ratings yet
Classification and Clustering
59 pages
AI Lecture 9
No ratings yet
AI Lecture 9
69 pages
Chap 4 - Using Decision Trees For Classification
No ratings yet
Chap 4 - Using Decision Trees For Classification
10 pages
Data Mining: Classification-1
No ratings yet
Data Mining: Classification-1
53 pages
Module 3 Chap 3 Decision Tree Learning
No ratings yet
Module 3 Chap 3 Decision Tree Learning
79 pages
AIML Lect5 Decision Tree
No ratings yet
AIML Lect5 Decision Tree
33 pages
Lab Exercise
No ratings yet
Lab Exercise
9 pages
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
No ratings yet
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
83 pages
M2 Decision Trees
No ratings yet
M2 Decision Trees
37 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
41 pages
ID3 Explanation
No ratings yet
ID3 Explanation
23 pages
Decision Trees Classification: Mustafa Jarrar
No ratings yet
Decision Trees Classification: Mustafa Jarrar
46 pages
Starting Large Spreadsheet Is Not Good Way To Analyze Any Data. It Is
No ratings yet
Starting Large Spreadsheet Is Not Good Way To Analyze Any Data. It Is
12 pages
Lecture 19 - Decision Tress
No ratings yet
Lecture 19 - Decision Tress
21 pages
Decision Tree
No ratings yet
Decision Tree
2 pages
Decision Trees-Lecture 9&10
No ratings yet
Decision Trees-Lecture 9&10
60 pages
MLT UNIT-3 Notes
No ratings yet
MLT UNIT-3 Notes
35 pages
Wk. 5.2. Decision Trees (27.10.2020)
No ratings yet
Wk. 5.2. Decision Trees (27.10.2020)
57 pages
ML Unit-2 Material
No ratings yet
ML Unit-2 Material
20 pages
2.3 Decision-Tree-Algorithm
No ratings yet
2.3 Decision-Tree-Algorithm
61 pages
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
No ratings yet
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
54 pages
Lecture2 DT
No ratings yet
Lecture2 DT
89 pages
07 - ML - Decision Tree
No ratings yet
07 - ML - Decision Tree
37 pages
Decision Tree
No ratings yet
Decision Tree
14 pages
Decision Trees
No ratings yet
Decision Trees
34 pages
DMDW Co3 Session 14
No ratings yet
DMDW Co3 Session 14
55 pages
Learning by Asking Questions: Decision Trees: Piyush Rai Machine Learning (CS771A)
No ratings yet
Learning by Asking Questions: Decision Trees: Piyush Rai Machine Learning (CS771A)
22 pages
Module - 2 Decision Tree Learning
No ratings yet
Module - 2 Decision Tree Learning
79 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
57 pages
Lecture 04 Decession Trees 04112022 015118pm
No ratings yet
Lecture 04 Decession Trees 04112022 015118pm
43 pages
Chapter 5 2018 2019
No ratings yet
Chapter 5 2018 2019
5 pages
Decisiontrees
No ratings yet
Decisiontrees
46 pages
Decision Trees
No ratings yet
Decision Trees
53 pages
Decision Trees
No ratings yet
Decision Trees
15 pages
Decision Trees
No ratings yet
Decision Trees
18 pages
Ch02 DecisionTree
No ratings yet
Ch02 DecisionTree
41 pages
T6 Decision Tree
No ratings yet
T6 Decision Tree
38 pages
DT Classifier
No ratings yet
DT Classifier
45 pages
Lect 8-Decision Tree-2
No ratings yet
Lect 8-Decision Tree-2
16 pages
06 - Decision Trees
No ratings yet
06 - Decision Trees
14 pages
1st Long Exam in Ucsp
86% (22)
1st Long Exam in Ucsp
3 pages
Decision Tree Id3 Problem
No ratings yet
Decision Tree Id3 Problem
5 pages
Untitled
No ratings yet
Untitled
164 pages
Safety, Health and Environment Management Assignment
100% (5)
Safety, Health and Environment Management Assignment
30 pages
Christopher Alexander
100% (1)
Christopher Alexander
23 pages
Manager's Role in Capacity Building
100% (2)
Manager's Role in Capacity Building
67 pages
HPCL Case
No ratings yet
HPCL Case
13 pages
Article 11 Justifying Circumstances Aid
No ratings yet
Article 11 Justifying Circumstances Aid
12 pages
Critical Thinking For Strategic Intelligence Secon
100% (1)
Critical Thinking For Strategic Intelligence Secon
4 pages
Heterotopia and The City
100% (1)
Heterotopia and The City
3 pages
Lawrence and Lorch Contingency Theory
No ratings yet
Lawrence and Lorch Contingency Theory
3 pages
Reviewer For Crim 102
No ratings yet
Reviewer For Crim 102
35 pages
Theory of Cultural Context
No ratings yet
Theory of Cultural Context
1 page
Diskursni Marker Znaci
No ratings yet
Diskursni Marker Znaci
14 pages
Job Evaluation Point Slide
No ratings yet
Job Evaluation Point Slide
30 pages
Expressions For Agreeing and Disagreeing
No ratings yet
Expressions For Agreeing and Disagreeing
2 pages
From Serving To Shining.
No ratings yet
From Serving To Shining.
6 pages
Vocabulary Choice Menu
No ratings yet
Vocabulary Choice Menu
1 page
Probability Density Function Distribution Function: Distributions
No ratings yet
Probability Density Function Distribution Function: Distributions
7 pages
Chapter 1 .Jihad in Islam
No ratings yet
Chapter 1 .Jihad in Islam
26 pages
PEDAGOY
No ratings yet
PEDAGOY
76 pages
UT Dallas Syllabus For bps6310.002.07s Taught by Zhiang Lin (Zlin)
No ratings yet
UT Dallas Syllabus For bps6310.002.07s Taught by Zhiang Lin (Zlin)
12 pages
Asti Shekhar-Sweetus Dairy
No ratings yet
Asti Shekhar-Sweetus Dairy
13 pages
Lesson 2 Living and Nonliving Things
No ratings yet
Lesson 2 Living and Nonliving Things
5 pages
A Influencia de J C Salgado Na Producao Cientifica em Filosofia Do Direito Na Faculdade de Direito Da Universidade Federal de Minas Gerais
No ratings yet
A Influencia de J C Salgado Na Producao Cientifica em Filosofia Do Direito Na Faculdade de Direito Da Universidade Federal de Minas Gerais
13 pages
ABS Final Report Final PDF
No ratings yet
ABS Final Report Final PDF
74 pages
Week 1 Theoryprep Readings
No ratings yet
Week 1 Theoryprep Readings
25 pages
Managing and Marketing Events MKT 4004: Individual Coursework
No ratings yet
Managing and Marketing Events MKT 4004: Individual Coursework
14 pages
Teaching Structures To Non-Engineers - Being An Engineer
No ratings yet
Teaching Structures To Non-Engineers - Being An Engineer
12 pages
Portretul Psihologic Al Negociatorului
No ratings yet
Portretul Psihologic Al Negociatorului
9 pages
HW 2 Sol
No ratings yet
HW 2 Sol
4 pages
Painless Pre-Algebra
From Everand
Painless Pre-Algebra
Barron's Educational Series
3/5 (2)

What Did We Learn?: Learning Problem

Uploaded by

What Did We Learn?: Learning Problem

Uploaded by

DECISION TREES CSE463 2014 Fall

What Did We Learn?

You might also like