ML Lecture 3
ML Lecture 3
1
Decision Tree Representation
Outlook
Sunny Rain
Overcast
Humidity Wind
High Normal Strong Weak
3
ID3: The Basic Decision Tree
Learning Algorithm
Database, See [Mitchell, p. 59]
4
ID3 (Cont’d)
Outlook
Sunny Rain
Overcast
D1 D8 D10 D6
D3
D14
D11 D12 D4
D9 D2 D7 D5
D13
7
What Attribute to choose to
“best” split a node? (Cont’d)
We want to use this measure to choose an attribute
that minimizes the disorder in the partitions it
creates. Let {S_i | 1i n} be a partition of S
resulting from a particular attribute. The disorder
associated with this partition is:
V({S_i | 1i n})=|S_i|/|
S|.I({P(S_i),N(S_i)})
Set of positive Set of negative
examples in S_i examples in S_i
8
Hypothesis Space Search in
Decision Tree Learning
Hypothesis Space: Set of possible decision trees (i.e., complete
space of finie discrete-valued functions).
Search Method: Simple-to-Complex Hill-Climbing Search (only
a single current hypothesis is maintained ( from candidate-
elimination method)). No Backtracking!!!
Evaluation Function: Information Gain Measure
Batch Learning: ID3 uses all training examples at each step to
make statistically-based decisions ( from candidate-elimination
method which makes decisions incrementally). ==> the search is
less sensitive to errors in individual training examples.
9
Inductive Bias in Decision Tree
Learning
ID3’s Inductive Bias: Shorter trees are preferred over
longer trees. Trees that place high information gain
attributes close to the root are preferred over those that
do not.
Note: this type of bias is different from the type of bias
used by Candidate-Elimination: the inductive bias of ID3
follows from its search strategy (preference or search
bias) whereas the inductive bias of the Candidate-
Elimination algorithm follows from the definition of its
hypothesis space (restriction or language bias).
10
Why Prefer Short Hypotheses?
Occam’s razor: Prefer the
simplest hypothesis that fits the data [William of Occam
(Philosopher), circa 1320]
Scientists seem to do that: E.g., Physicist seem to prefer simple explanations
for the motion of planets, over more complex ones
Argument: Since there are fewer short hypotheses than long ones, it is less
likely that one will find a short hypothesis that coincidentally fits the training
data.
Problem with this argument: it can be made about many other constraints.
Why is the “short description” constraint more relevant than others?
Nevertheless: Occam’s razor was shown experimentally to be a successful
strategy!
11
Issues in Decision Tree Learning:
I. Avoiding Overfitting the Data
Definition: Given a hypothesis space H, a hypothesis hH is
said to overfit the training data if there exists some alternative
hypothesis h’H, such that h has smaller error than h’ over the
training examples, but h’ has a smaller error than h over the
entire distribution of instances. (See curves in [Mitchell, p.67])
There are two approaches for overfitting avoidance in Decision
Trees:
Stop growing the tree before it perfectly fits the data
Allow the tree to overfit the data, and then post-prune it.
12
Issues in Decision Tree Learning:
II. Other Issues
Incorporating Continuous-Valued Attributes
Alternative Measures for Selecting
Attributes
Handling Training Examples with Missing
Attribute Values
Handling Attributes with Differing Costs
13