Lecture2 DT
Lecture2 DT
Decision Trees
– A hierarchical data structure that represents data by implementing a
divide and conquer strategy
– Can be used as a non-parametric classification and regression method
– Given a collection of examples, learn a decision tree that represents it.
– Use this representation to classify new examples
C B A
2
Learning decision trees (ID3
algorithm
Will I play tennis today?
• Features
– Outlook: {Sun, Overcast, Rain}
– Temperature: {Hot, Mild, Cool}
– Humidity: {High, Normal, Low}
– Wind: {Strong, Weak}
• Labels
– Binary classification task: Y = {+, -}
4
Will I play tennis today?
O T H W Play?
1 S H H W - Outlook: S(unny),
2 S H H S - O(vercast),
3 O H H W + R(ainy)
4 R M H W +
5 R C N W + Temperature: H(ot),
6 R C N S - M(edium),
7 O C N S +
8 S M H W -
C(ool)
9 S C N W +
10 R M N W + Humidity: H(igh),
11 S M N S + N(ormal),
12 O M H S +
13 O H N W +
L(ow)
14 R M H S - Wind: S(trong),
W(eak)
5
Basic Decision Trees Learning Algorithm
O T H W Play?
1 S H H W - • Data is processed in Batch (i.e. all the
2 S H H S -
3 O H H W +
data available) Algorithm?
4 R M H W + • Recursively build a decision tree top
5 R C N W +
6 R C N S - down.
7 O C N S + Outlook
8 S M H W -
9 S C N W +
10 R M N W + Rain
Sunny Overcast
11 S M N S +
12 O M H S + Humidity Yes Wind
13 O H N W +
14 R M H S - High Normal Strong Weak
No Yes No Yes
Basic Decision Tree Algorithm
• Let S be the set of Examples
– Label is the target attribute (the prediction)
– Attributes is the set of measured attributes
• ID3(S, Attributes, Label)
If all examples are labeled the same return a single node tree with Label
Otherwise Begin
A = attribute in Attributes that best classifies S (Create a Root node for tree)
for each possible value v of A
Add a new tree branch corresponding to A=v
Let Sv be the subset of examples in S with A=v
if Sv is empty: add leaf node with the common value of Label why?in S
Else: below this branch add the subtree For evaluation time
ID3(Sv, Attributes - {a}, Label)
End
Return Root
7
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
– But, finding the minimal decision tree consistent with the data is NP-
hard
• The recursive algorithm is a greedy heuristic search for a
simple tree, but cannot guarantee optimality.
• The main decision in the algorithm is the selection of the next
attribute to condition on.
8
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
– The main decision in the algorithm is the selection of the next attribute
to condition on.
• We want attributes that split the examples to sets that are
relatively pure in one label; this way we are closer to a leaf
node.
– The most popular heuristics is based on information gain, originated
with the ID3 system of Quinlan.
9
Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a binary
classification is:
• Entropy can be viewed as the number of bits required, on average, to encode the
class of labels. If the probability for + is 0.5, a single bit is required for each
example; if it is 0.8 – can use less then 1 bit. 10
Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a binary classification is:
1 1 1
-- + -- + -- +
11
Entropy
(Convince yourself that the max value would be )
(Also note that the base of the log only introduce a constant factor; therefore, we’ll think about base 2)
1 1 1
12
Information Gain
High Entropy – High level of Uncertainty
Low Entropy – No Uncertainty.
13
Will I play tennis today?
O T H W Play?
1 S H H W -
Outlook: S(unny),
2 S H H S - O(vercast),
3 O H H W + R(ainy)
4 R M H W +
5 R C N W + Temperature: H(ot),
6 R C N S -
7 O C N S + M(edium),
8 S M H W - C(ool)
9 S C N W +
10 R M N W + Humidity: H(igh),
11 S M N S + N(ormal),
12 O M H S + L(ow)
13 O H N W +
14 R M H S - Wind: S(trong),
W(eak)
14
Will I play tennis today?
O T H W Play?
1 S H H W -
2 S H H S -
3 O H H W + calculate current entropy
4 R M H W +
5 R C N W +
•
6 R C N S -
7 O C N S +
8 S M H W -
9 S C N W + 0.94
10 R M N W +
11 S M N S +
12 O M H S +
13 O H N W +
14 R M H S -
15
Information Gain: Outlook
¿
1
O
S
T
H
H
H
W
W
Play?
- 𝐺𝑎𝑖𝑛 ( 𝑆,𝑎 )=𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑆 ) − ∑ ¿ 𝑆𝑣 ∨
¿𝑆∨¿ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
¿¿
2 S H H S - 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
3 O H H W + Outlook = sunny:
4 R M H W + Entropy(O = S) = 0.971
5 R C N W +
Outlook = overcast:
6 R C N S -
Entropy(O = O) = 0
7 O C N S +
8 S M H W - Outlook = rainy:
9 S C N W + Entropy(O = R) = 0.971
10 R M N W +
11 S M N S + Expected entropy
12 O M H S + =
13 O H N W + = (5/14)×0.971 + (4/14)×0 + (5/14)×0.971 = 0.694
14 R M H S -
Information gain = 0.940 – 0.694 = 0.246
16
Information Gain: Humidity
¿
1
O
S
T
H
H
H
W
W
Play?
- 𝐺𝑎𝑖𝑛 ( 𝑆,𝑎 )=𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑆 ) − ∑ ¿ 𝑆𝑣 ∨
¿𝑆∨¿ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
¿¿
2 S H H S - 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
3 O H H W + Humidity = high:
4 R M H W + Entropy(H = H) = 0.985
5 R C N W +
Humidity = Normal:
6 R C N S -
Entropy(H = N) = 0.592
7 O C N S +
8 S M H W -
9 S C N W + Expected entropy
10 R M N W + =
11 S M N S + = (7/14)×0.985 + (7/14)×0.592 = 0.7785
12 O M H S +
13 O H N W + Information gain = 0.940 – 0.694 = 0.246
14 R M H S -
17
Which feature to split on?
O T H W Play?
1 S H H W -
2 S H H S -
3 O H H W +
4 R M H W + Information gain:
5 R C N W + Outlook: 0.246
6 R C N S - Humidity: 0.151
7 O C N S + Wind: 0.048
8 S M H W -
Temperature: 0.029
9 S C N W +
10 R M N W +
11 S M N S + → Split on Outlook
12 O M H S +
13 O H N W +
14 R M H S -
18
An Illustrative Example (III)
Gain(S,Humidity)=0.151
Outlook Gain(S,Wind) = 0.048
Gain(S,Temperature) = 0.029
Gain(S,Outlook) = 0.246
19
An Illustrative Example (III)
O T H W Play?
Outlook 1 S H H W -
2 S H H S -
3 O H H W +
4 R M H W +
Sunny Overcast Rain 5 R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14 6 R C N S -
2+,3- 4+,0- 3+,2- 7 O C N S +
8 S M H W -
? Yes ? 9 S C N W +
10 R M N W +
11 S M N S +
12 O M H S +
13 O H N W +
14 R M H S -
20
An Illustrative Example (III)
O T H W Play?
Outlook
1 S H H W -
2 S H H S -
3 O H H W +
4 R M H W +
Sunny Overcast Rain 5 R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14 6 R C N S -
2+,3- 4+,0- 3+,2- 7 O C N S +
? Yes ? 8 S M H W -
9 S C N W +
10 R M N W +
Continue until:
• Every attribute is included in path, or, 11 S M N S +
• All examples in the leaf have same label 12 O M H S +
13 O H N W +
14 R M H S -
21
An Illustrative Example (IV)
Outlook
O T H W Play?
1 S H H W -
2 S H H S -
4 R M H W +
Sunny Overcast Rain
5 R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14 6 R C N S -
2+,3- 4+,0- 3+,2- 7 O C N S +
? Yes ? 8 S M H W -
9 S C N W +
10 R M N W +
Gain(S sunny , Humidity) .97-(3/5) 0-(2/5) 0 = .97 11 S M N S +
Gain(S sunny , Temp) .97- 0-(2/5) 1 = .57 12 O M H S +
13 O H N W +
Gain(S sunny , Wind) .97-(2/5) 1 - (3/5) .92= .02 14 R M H S -
Split on Humidity
22
An Illustrative Example (V)
Outlook
23
An Illustrative Example (V)
Outlook
High Normal
No Yes
24
induceDecisionTree(S)
• 1. Does S uniquely define a class?
if all s ∈ S have the same label y: return S;
• 3. Add children to S:
for k in Values(Xi):
Sk = {s ∈ S | xi = k}
addChild(S, Sk)
induceDecisionTree(Sk)
return S;
25
An Illustrative Example (VI)
Outlook
26
Hypothesis Space in Decision Tree Induction
• Conduct a search of the space of decision trees which can
represent all possible discrete functions. (pros and cons)
• Goal: to find the best decision tree
– Best could be “smallest depth”
– Best could be “minimizing the expected number of tests”
• Finding a minimal decision tree consistent with a set of data is
NP-hard.
• Performs a greedy heuristic search: hill climbing without
backtracking
• Makes statistically based decisions using all data
27
History of Decision Tree Research
• Hunt and colleagues in Psychology used full search decision tree
methods to model human concept learning in the 60s
– Quinlan developed ID3, with the information gain heuristics in the late 70s to
learn expert systems from examples
– Breiman, Freidman and colleagues in statistics developed CART (classification
and regression trees simultaneously)
• A variety of improvements in the 80s: coping with noise, continuous
attributes, missing data, non-axis parallel etc.
– Quinlan’s updated algorithm, C4.5 (1993) is commonly used (New: C5)
• Boosting (or Bagging) over DTs is a very good general purpose
algorithm
28
Overfitting
Example
Outlook
• Outlook = Sunny,
• Temp = Hot Sunny Overcast Rain
• Humidity = Normal 1,2,8,9,11 3,7,12,13 4,5,6,10,14
• Wind = Strong 2+,3- 4+,0- 3+,2-
• label: NO Humidity Yes Wind
• this example doesn’t exist in the tree
30
Overfitting - Example
This can always be done Outlook
– may fit noise or other
coincidental regularities
Strong Weak
No Yes
31
Our training data
32
The instance space
33
Overfitting the Data
• Learning a tree that classifies the training data perfectly may not lead to the tree with the best
generalization performance.
– There may be noise in the training data the tree is fitting
– The algorithm might be making decisions based on very little data
• A hypothesis h is said to overfit the training data if there is another hypothesis h’, such that h has a
smaller error than h’ on the training data but h has larger error on the test data than h’.
accuracy
On training
On testing
Complexity of tree 34
Reasons for overfitting
• Too much variance in the training data
– Training data is not a representative sample
of the instance space
– We split on features that are actually irrelevant
36
Avoiding Overfitting
How can this be avoided with linear classifiers?
• Two basic approaches
– Pre-pruning: Stop growing the tree at some point during construction when it is determined that there is
not enough data to make reliable choices.
– Post-pruning: Grow the full tree and then remove nodes that seem not to have sufficient evidence.
• Methods for evaluating subtrees to prune
– Cross-validation: Reserve hold-out set to evaluate utility
– Statistical testing: Test if the observed regularity can be dismissed as likely to occur by chance
– Minimum Description Length: Is the additional complexity of the hypothesis smaller than remembering
the exceptions?
• This is related to the notion of regularization that we will see in other contexts – keep the hypothesis
simple.
h1 h2
38
The i.i.d. assumption
• Training and test items are independently and identically
distributed (i.i.d.):
– There is a distribution P(X, Y) from which the data D = {(x, y)} is generated.
• Sometimes it’s useful to rewrite P(X, Y) as P(X)P(Y|X)
Usually P(X, Y) is unknown to us (we just know it exists)
– Training and test data are samples drawn from the same P(X, Y): they are
identically distributed
– Each (x, y) is drawn independently from P(X, Y)
42
Overfitting
On training data
Accuracy
Why this shape
of curves? On test data
Size of tree
??
Model complexity
44
Overfitting
Empirical
Error
Model complexity
45
Overfitting
Expected
Error
Model complexity
• Expected error:
What percentage of items drawn from P(x,y) do we expect to
be misclassified by f?
• (That’s what we really care about – generalization)
46
Variance of a learner (informally)
Variance
Model complexity
47
Bias of a learner (informally)
48
Impact of bias and variance
Expected
Error
Variance
Bias
Model complexity
49
Model complexity
Expected
Error
Variance
Bias
Model complexity
50
Underfitting and Overfitting
Underfittin Overfitting
Expected g
Error
Variance
Bias
Model complexity
55
Continuous Attributes
• Example:
– Length (L): 10 15 21 28 32 40 50
– Class: - + + - + + -
– Check thresholds: L < 12.5; L < 24.5; L < 45
– Subset of Examples= {…}, Split= k+,j-
• Many times values are not available for all attributes during
training or testing (e.g., medical diagnosis)
57
𝐺𝑎𝑖𝑛 ( 𝑆,𝑎 )=𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑆 ) − ∑ ¿ 𝑆𝑣 ∨ ¿
Missing Values ¿ 𝑆∨¿ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
¿¿
• Many times values are not available for all attributes during
training or testing (e.g., medical diagnosis)
59
Missing Values
Outlook = Sunny, Temp = Hot, Humidity = ???, Wind = Strong, label = ?? Normal/High
63
Experimental Machine Learning
• Machine Learning is an Experimental Field and we will spend some time
(in Problem sets) learning how to run experiments and evaluate results
– First hint: be organized; write scripts
• Basics:
– Split your data into three sets:
• Training data (often 70-90%)
• Test data (often 10-20%)
• Development data (10-20%)
• You need to report performance on test data, but you are not allowed to
look at it.
– You are allowed to look at the development data (and use it to tune parameters)
64
Metrics
Methodologies
Statistical Significance
Metrics
• We train on our training data Train = {xi, yi}1,m
• We test on Test data.
• We often set aside part of the training data as a development set, especially when
the algorithms require tuning.
– In the HW we asked you to present results also on the Training; why?
• When we deal with binary classification we often measure performance simply
using Accuracy:
67
Example
Positive negative
• 100 examples, 5% are positive.
Actual Class
The notion of a
confusion matrix can Yes TP FN
be usefully extended
to the multiclass case No FP TN
(i,j) cell indicate how
many of the i-labeled • Imagine using classifier to identify positive cases (i.e., for
examples were information retrieval)
predicted to be j
69
Relevant Metrics
• It makes sense to consider Recall and
Precision together or combine them
into a single metric.
• Recall-Precision Curve:
• F-Measure:
– A measure that combines precision and
recall is the harmonic mean of precision
and recall.
71
N-fold cross validation
• Instead of a single test-training split:
train test
72
Evaluation: significance tests
• You have two different classifiers, A and B
• You train and test them on the same data set using N-fold
cross-validation
• For the n-th fold:
accuracy(A, n), accuracy(B, n)
pn = accuracy(A, n) - accuracy(B, n)
• Is the difference between A and B’s accuracies significant?
73
Hypothesis testing
• You want to show that hypothesis H is true, based on your data
74
Rejecting H0
• H0 defines a distribution P(M |H0) over some statistic M
– (e.g. M= the difference in accuracy between A and B)
• Select a significance value S
– (e.g. 0.05, 0.01, etc.)
– You can only reject H0 if P(m |H0) ≤ S
• Compute the test statistic m from your data
– e.g. the average difference in accuracy over your N folds
• Compute P(m |H0)
• Refute H0 with p ≤ S if P(m |H0) ≤ S
75
Paired t-test
• Null hypothesis (H0; to be refuted):
– There is no difference between A and B, i.e. the expected accuracies of
A and B are the same
• That is, the expected difference (over all possible data sets)
between their accuracies is 0:
H0: E[pD] = 0
76
Paired t-test
• Null hypothesis H0: E[diffD] = μ = 0
• m: our estimate of μ based on N samples of diffD
m = 1/N n diffn
•The estimated variance S2:
S2 = 1/(N-1) 1,N (diffn – m)2
•Accept Null hypothesis at significance level a if the
following statistic lies in (-ta/2, N-1, +ta/2, N-1)
77
Decision Trees - Summary
• Hypothesis Space:
– Variable size (contains all functions)
– Deterministic; Discrete and Continuous attributes
• Search Algorithm
– ID3 - batch
– Extensions: missing values
• Issues:
– What is the goal?
– When to stop? How to guarantee good generalization?
• Did not address:
– How are we doing? (Correctness-wise, Complexity-wise)
78