Lec 16,17
Lec 16,17
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Outlook
sunny overcast rain
Play No
No Play
{1} No Play Play
Underfitting: when model is too simple, both training and test errors are large. TE & GE
are large when the size of the tree is very small.
It occurs because the model is yet to learn the true structure of the data and as a result it
performs poorly on both training and test sets
Figure taken from text book (Tan, Steinbach, Kumar)
Overfitting the Data
Model M3
Body Temp
TE = 0%, GE=30% Find out
Warm blooded Cold blooded why?
Humans, elephants, and dolphins
Non-mammals are misclassified because the DT
Hibernates
classifies all warm-blooded
Yes No
vertebrates which do not
hibernate as non-mammals.
Non-mammals
4-legged The DT arrives at this decision,
No because there is only one training
Yes
record with such characteristics
Mammals Non-mammals
Lack of data points in the lower half of the diagram makes it difficult to
predict correctly the class labels of that region
- Insufficient number of training records in the region causes the decision
tree to predict the test examples using other training records that are
irrelevant to the classification task
Figure taken from text book (Tan, Steinbach, Kumar)
How to Address Overfitting
t Pre-Pruning (Early Stopping Rule)
t Stop the algorithm before it becomes a fully-grown tree
t Typical stopping conditions for a node:
t Stop if all instances belong to the same class
t Stop if all the attribute values are the same
t More restrictive conditions:
t Stop if number of instances is less than some user-specified
threshold
t Stop if class distribution of instances are independent of the
available features (e.g., using c 2 test)
t Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
How to Address Overfitting…
t Post-pruning
t Grow decision tree to its entirety
t Trim the nodes of the decision tree in a
bottom-up fashion
t If generalization error improves after
trimming, replace sub-tree by a leaf node.
t Class label of leaf node is determined
from majority class of instances in the
sub-tree
t Can use MDL for post-pruning
Post-pruning
Alt Alt
Yes Yes
Yes
Price
$ $$$
$$
Yes Yes No
Post-pruning
t Subtree raising moves a subtree to a higher
level in the decision tree, subsuming its parent
Alt
Yes
Alt
Yes
Res
No Yes Price
$ $$$
$$
No
Price 4/4 Yes Yes No
$ $$$
$$
Yes Yes No
Overfitting: Example
Presence of Noise:Training Set
Name Body Gives 4-legged Hibernates Class Label
Temperature Birth (mammal)
Porcupine Warm Blooded Y Y Y Y
Step 2:
Dt
t If Dt contains records that belong to
more than one class, use an
attribute test to split the data into ?
smaller subsets. Recursively apply
the procedure to each child node
Hunt’s Algorithm
Status Income Cheat
Don’t Taxable
Don’t
Cheat Cheat Income Cheat
t Issues
t Determine how to split the records
t How to specify the attribute test condition?
t How to determine the best split?
Slide taken from text book slides available at companion website (Tan, Steinbach, Kumar)
How to determine the Best Split?
t Greedy approach:
t Nodes with homogeneous class
distribution are preferred
t Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Slide taken from text book slides available at companion website (Tan, Steinbach, Kumar)
Measures of Node Impurity
- Based on the degree of impurity of child nodes
- Less impurity è more skew
- node with class distribution (1,0) has zero
impurity, whereas a node with class distribution
(0.5, 0.5) has highest impurity
t Gini Index
t Entropy
t Misclassification error
-
Measures of Node Impurity
t Gini Index
GINI (t ) = 1 - å [ p ( j | t )]2
j
t Entropy
Entropy (t ) = - å p ( j | t ) log p ( j | t )
j
t Misclassification error
Error (t ) = 1 - max P (i | t )
i
Comparison among
Splitting Criteria
For a 2-class problem:
I
m
p
u
r
i
t
y
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
Slide taken from text book slides available at companion website (Tan, Steinbach, Kumar)
Measure of Impurity: GINI
t Gini Index for a given node t :
GINI (t ) = 1 - å [ p ( j | t )]2
j
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Slide taken from text book slides available at companion website (Tan, Steinbach, Kumar)
Examples for computing GINI
GINI (t ) = 1 - å [ p ( j | t )]2
j
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
è n ø
split i =1
GAIN n n
GainRATIO = SplitINFO = - å log
Split k
i i
SplitINFO
split
n n i =1
Gini(N1)
N1 N2 Gini(Children)
= 1 – (3/3)2 – (0/3)2
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves !!
Decision Tree Based
Classification
t Advantages:
t Inexpensive to construct
t Extremely fast at classifying unknown
records
t Easy to interpret for small-sized trees
t Accuracy is comparable to other
classification techniques for many simple
data sets
Example: C4.5
t Simple depth-first construction.
t Uses Information Gain
t Sorts Continuous Attributes at each
node.
t Needs entire data to fit in memory.
t Unsuitable for Large Datasets.
t Needs out-of-core sorting.
0.9
0.8
x < 0.43?
0.7
Yes No
0.6
y < 0.33?
y
0.3
Yes No Yes No
0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
• Border line between two neighboring regions of different classes is known as
decision boundary
• Decision boundary is parallel to axes because test condition involves a single
attribute at-a-time
Oblique Decision Trees
x+y<1
Class = + Class =
Q R
S 0 Q 1
0 1 S 0
Split using P
0 1
redundant?
Remove P in
• Same subtree appears in multiple branches post pruning
Metrics for Performance Evaluation
t Focus on the predictive capability of a model
t Rather than how fast it takes to classify or build models,
scalability, etc.
t Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No a: TP (true positive)
b: FN (false negative)
Class=Yes a b
ACTUAL Class=No c: FP (false positive)
c d
CLASS d: TN (true negative)
Class=Yes a b
ACTUAL (TP) (FN)
CLASS Class=No c d
(FP) (TN)
wa + wb+ wc+ w d
1 2 3 4
Handling Overfitting in DTs
t Pre-pruning (Early Stopping Rule)
Stop the tree growing algorithm before generating a
fully grown tree that perfectly fits the entire training
data
t Post-Pruning
Grow the tree to its maximum size and then prune it
in a bottoms up fashion
Handling Overfitting in DTs
Pre-pruning (Early Stopping Rule)
t How it can be done?
t Stop expanding a leaf node when it becomes
“sufficiently” pure
t OR improvement in the GE falls below a threshold
t Adv – avoids generating overly complex subtrees
that overfit the training data
t Issue – difficult to choose the right threshold for
early termination
t High threshold – underfitted models
t Low threshold – not sufficient to overcome the
overfitting
Handling Overfitting in DTs
Post-pruning
t Grow the tree fully
t Prune the tree in a bottom up fashion
t Replace a subtree with a new leaf node whose
class label is determined from the majority class of
records affiliated with the subtree
t OR replace the subtree with the most frequently
used branch of the subtree
t Stop tree pruning when no further improvement is
observed
Handling Overfitting in DTs
Pre-pruning vs. Post-pruning
t Post-pruning tends to give better results than pre-
pruning as it makes pruning decisions based on a
fully grown tree
t Pre-pruning can suffer from premature termination
of the tree growing process
t Post-pruning can lead to wastage of additional
computations needed to grow the tree fully when
the tree is pruned
Decision Tree Example
Age Income Student Credit_rating Class:Buys_comp
Youth HIGH N FAIR N
Youth HIGH N EXCELLENT N
Middle_aged HIGH N FAIR Y
Senior MEDIUM N FAIR Y
Senior LOW Y FAIR Y
Senior LOW Y EXCELLENT N
Middle_aged LOW Y EXCELLENT Y
Youth MEDIUM N FAIR N
Youth LOW Y FAIR Y
Senior MEDIUM Y FAIR Y
Youth MEDIUM Y EXCELLENT Y
Middle_aged MEDIUM N EXCELLENT Y
Middle_aged HIGH Y FAIR Y
Senior MEDIUM N EXCELLENT N