19 - Decision Tree - ID3
19 - Decision Tree - ID3
Learning
Dr Abid Ali
Machine
Decision Learning
Trees Cours –
CS/SE 7th
PAF-IAST
Decision Trees
• Earlier, we decoupled the generation of the feature space from the
learning.
• Argued that we can map the given examples into another space, in
which the target functions are linearly separable.
x
This Lecture
• Decision trees for (binary) classification
– Non-linear classifiers
• Overfitting
– Some experimental issues
Introduction of Decision trees
4
Representing Data
• Think about a large table, N attributes, and assume you want to know
something about the people represented as entries in this table.
• E.g. own an expensive car or not;
• Simplest way: Histogram on the first attribute – own
• Then, histogram on first and second (own & gender)
• But, what if the # of attributes is larger: N=16
• How large are the 1-d histograms (contingency tables) ? 16 numbers
• How large are the 2-d histograms? 16-choose-2 = 120 numbers
• How many 3-d tables? 560 numbers
• With 100 attributes, the 3-d tables need 161,700 numbers
– We need to figure out a way to represent data in a better way, and figure
out what are the important attributes to look at first.
– Information theory has something to say about it – we will use it to better
represent the data.
Decision Trees
– A hierarchical data structure that represents data by
implementing a divide and conquer strategy
– Can be used as a non-parametric classification and
regression method
– Given a collection of examples, learn a decision tree that
represents it.
– Use this
C B
representation to classifyA new examples
The Representation
• Decision Trees are classifiers for instances represented as feature vectors
– color={red, blue, green} ; shape={circle, triangle, rectangle} ; label= {A, B, C}
• Nodes are tests for feature values Learning a
Evaluation of a
• There is one branch for each value of the feature
Decision Tree Decision Tree
• Leaves specify the category (labels)
• Can categorize instances into multiple disjoint categories Color
C B A
Shape B Shape
B A C B A
Expressivity of Decision Trees
• As Boolean functions they can represent any Boolean function.
• Can be rewritten as rules in Disjunctive Normal Form (DNF)
– Green ∧ Square positive
– Blue ∧ Circle positive
– Blue ∧ Square positive Color
• The disjunction of these rules is equivalent to the Decision Tree
• What did we show? What is the hypothesis space here?
– 2 dimensions: color and shape
–
Shape
3 values each: color(red, blue, green), shape(triangle, square, circle)
B Shape
– |X| = 9: (red, triangle), (red, circle), (blue, square) …
– |Y| = 2: + and -
– |H| = 29
- + + + -
Decision Trees
- + - - + + X<1
no yes
1 3 X
+ -
Today’s key concepts
• Learning decision trees (ID3 algorithm)
– Greedy heuristic (based on information gain)
Originally developed for discrete features
• Overfitting
– What is it? How do we deal with it?
• Principles of Experimental ML
Learning decision trees (ID3 algorithm)
12
Decision Trees
• Can represent any Boolean Function
• Can be viewed as a way to compactly represent a lot of
data.
• Natural representation: (20 questions)
• The evaluation of the Decision Tree Classifier is easy
Outlook
• Clearly, given data, there are
many ways to represent it as Sunny Overcast Rain
a decision tree. Humidity Wind
Yes
• Learning a good representation
from data is the challenge. High
No
Normal
Yes
StrongWeak
No Yes
Will I play tennis today?
• Features
– Outlook: {Sun, Overcast, Rain}
– Temperature: {Hot, Mild, Cool}
– Humidity: {High, Normal, Low}
– Wind: {Strong, Weak}
• Labels
– Binary classification task: Y = {+, -}
Will I play tennis today?
O T H W Play?
1 S H H W - Outlook: S(unny),
2 S H H S - O(vercast),
3 O H H W + R(ainy)
4 R M H W +
5 R C N W + Temperature: H(ot),
6 R C N S -
7 O C N S + M(edium),
8 S M H W -
C(ool)
9 S C N W +
1 R M N W + Humidity: H(igh),
0
1 S M N S +
N(ormal),
1 L(ow)
1 O M H S +
2 Wind: S(trong),
1 O H N W + W(eak)
3
1 R M H S -
Basic Decision Trees Learning Algorithm
O T H W Play?
1 S H H W - • Data is processed in Batch
2
3
S H
O H
H
H
S
W
-
+
(i.e. all the data available)
Algorithm?
A B
1 0 1 0
Advantage A. But…
Need a way to quantify -
B things A -
1 0 100 1 0 5
• One way to think about it: # of queries required to
label a random data point. + - + - 3
• If we choose A we have less uncertainty about
the labels. 100 3 100 50
splitting on A splitting on B
Picking the Root Attribute
• The goal is to have the resulting decision tree as
small as possible (Occam’s Razor)
– The main decision in the algorithm is the selection of the
next attribute to condition on.
• We want attributes that split the examples to sets
that are relatively pure in one label; this way we
are closer to a leaf node.
– The most popular heuristics is based on information
gain, originated with the ID3 system of Quinlan.
Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a
binary classification is:
1 1 1
-- + -- + -- +
Entropy
(Convince yourself that the max value would be )
(Also note that the base of the log only introduce a constant factor; therefore, we’ll think
about base 2)
1 1 1
Information Gain
High Entropy – High level of
Uncertainty
Low Entropy – No Uncertainty.
• Where:
– Sv is the subset of S for which attribute a has value v, andSunny Overcast Rain
– the entropy of partitioning the data is calculated by weighing
the entropy of each partition by its size relative to the
original set
High Normal
No Yes
induceDecisionTree(S)
• 1. Does S uniquely define a class?
if all s ∈ S have the same label y: return S;
• 3. Add children to S:
for k in Values(Xi):
Sk = {s ∈ S | xi = k}
addChild(S, Sk)
induceDecisionTree(Sk)
return S;
An Illustrative Example (VI)
Outlook
41
Example
Outlook
• Outlook = Sunny,
• Temp = Hot Sunny Overcast Rain
• Humidity = Normal 1,2,8,9,113,7,12,13 4,5,6,10,14
• Wind = Strong 2+,3- 4+,0- 3+,2-
• label: NO Humidity Yes Wind
• this example doesn’t exist in the
tree
High Normal Strong Weak
No Yes No Yes
Overfitting - Example
This can always be Outlook
done – may fit noise
or other
coincidental
• Outlook = Sunny, regularities Sunny Overcast Rain
• Temp = Hot 1,2,8,9,11 3,7,12,13 4,5,6,10,14
• Humidity = Normal 2+,3- 4+,0- 3+,2-
• Wind = Strong Humidity Yes Wind
• label: NO
• this example doesn’t exist in the
tree High Normal Strong Weak
No Wind No Yes
Strong Weak
No Yes
Our training data
The instance space
Overfitting the Data
• Learning a tree that classifies the training data perfectly may not lead to the
tree with the best generalization performance.
– There may be noise in the training data the tree is fitting
– The algorithm might be making decisions based on very little data
• A hypothesis h is said to overfit the training data if there is another hypothesis
h’, such that h has a smaller error than h’ on the training data but h has larger
error on the test data than h’.
accuracy
On training
On testing
Complexity of tree
Reasons for overfitting
• Too much variance in the training data
– Training data is not a representative sample
of the instance space
– We split on features that are actually irrelevant
h1 h2
The i.i.d. assumption
• Training and test items are independently and
identically distributed (i.i.d.):
– There is a distribution P(X, Y) from which the data D = {(x, y)}
is generated.
• Sometimes it’s useful to rewrite P(X, Y) as P(X)P(Y|X)
Usually P(X, Y) is unknown to us (we just know it exists)
– Training and test data are samples drawn from the same P(X,
Y): they are identically distributed
– Each (x, y) is drawn independently from P(X, Y)
Overfitting
On training data
Accurac
y this
Why
shape of On test data
curves?
Size of tree
??
Model complexity
Empirical
Error
Model complexity
Expected
Error
Model complexity
• Expected error:
What percentage of items drawn from P(x,y) do
we expect to be misclassified by f?
• (That’s what we really care about –
generalization)
Variance of a learner (informally)
Variance
Model complexity
Expected
Error
Variance
Bias
Model complexity
Expected
Error
Variance
Bias
Model complexity
Bias
Model complexity
66
Continuous Attributes
• Real-valued attributes can, in advance, be discretized into
ranges, such as big, medium, small
• Alternatively, one can develop splitting nodes based on
thresholds of the form A<c that partition the data into examples
that satisfy A<c and A>=c.
– The information gain for these splits is calculated in the same way and
compared to the information gain of discrete splits.
• How to find the split with the highest gain?
• For each continuous feature A:
– Sort examples according to the value of A
– For each ordered pair (x,y) with different labels
• Check the mid-point as a possible threshold, i.e.
• Sa < x Sa >= y
Continuous Attributes
• Example:
– Length (L): 10 15 21 28 32 40 50
– Class: - + + - + + -
– Check thresholds: L < 12.5; L < 24.5; L < 45
– Subset of Examples= {…}, Split= k+,j-
77
Metrics
• We train on our training data Train = {xi, yi}1,m
• We test on Test data.
• We often set aside part of the training data as a development set,
especially when the algorithms require tuning.
– In the HW we asked you to present results also on the Training; why?
• When we deal with binary classification we often measure
performance simply using Accuracy:
# (predicted positive)
80
Confusion Matrix
• Given a dataset of P positive instances and N
negative instances:
Predicted Class
Yes No
Actual Class
The notion of a
confusion matrix Yes TP FN
can be usefully
extended to the No FP TN
multiclass case
(i,j) cell indicate • Imagine using classifier to identify positive cases
how many of the (i.e., for information retrieval)
i-labeled
examples were
predicted to be j
Probability that a Probability that a
randomly selected randomly selected
positive prediction is positive is identified
indeed positive
81
Relevant Metrics
• It makes sense to consider Recall
and Precision together or combine
them into a single metric.
• Recall-Precision Curve:
• F-Measure:
– A measure that combines precision
and recall is the harmonic mean of
precision and recall.
83
N-fold cross validation
• Instead of a single test-training split:
train test