DM Lect8
DM Lect8
Classification
Lec 8
Mohammed
Taiz University
Outlines
• Define classification
• Decision Trees
• Evaluate the performance of classifier
Introduction
Given a collection of records (training set)
Each record contains a set of attributes, one of them is the class
What is classif ication?
• Classification is the task oflearning a
target function f that maps
attribute set x to one of the predefined class labels y
l l
ir ca ir ca ous
g o g o
ti nu ss
te te n a
ca ca co cl
Refund
Yes No
Test outcome
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Class labels
Training Data Model: Decision Tree
Another Example of Decision Tree
l l
ir ca ir ca
ous
o o n u
teg
teg
n ti
a ss Single,
ca ca co c l MarSt
Married Divorced
NO Refund
Yes No
NO TaxInc
< 80K > 80K
NO YES
Refund
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund
No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Classif ication Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
Tree Induction
• Greedy strategy.
• Split the records based on an attribute test that
optimizes certain criterion.
• Many Algorithms:
• Hunt’s Algorithm (one of the earliest)
• CART
• ID3, C4.5
• SLIQ,SPRINT
General Structure of Hunt’s Algorithm
●Let Dt be the set of training records
that reach a node t
●General Procedure:
– If Dt contains records that
belong the same class yt, then t
is a leaf node labeled as yt
– If Dt contains records that
belong to more than one class,
use an attribute test to split the
data into smaller subsets.
Recursively apply the procedure Dt
to each subset.
?
Hunt’s Algorithm
(7,3)
Hunt’s Algorithm
(7,3)
(3,0) (4,3)
Hunt’s Algorithm
(7,3)
(3,0) (4,3)
(3,0)
(1,3) (3,0)
Hunt’s Algorithm
(7,3)
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(1,0) (0,3)
Design Issues of Decision Tree Induction
●How should training records be split?
– Method for expressing test condition
depending on attribute types
– Measure for evaluating the goodness of a test condition
● Binary split:
– Divides values into two
subsets
M a r ita l M a r ita l M a r ita l
S ta tu s S ta tu s S ta tu s
OR OR
{ M a r r ie d } { S in g le , { S in g le } { M a r r ie d , { S in g le , { D iv o r c e d }
D iv o r c e d } D iv o r c e d } M a r r ie d }
Test Condition for Ordinal Attributes
● Multi-way split: S h ir t
S iz e
– Use as many partitions as
distinct values
S m a ll
E x tr a L a r g e
M e d iu m L a rg e
● Binary split:
– Divides values into two
subsets
– Preserve order property
among attribute values
S h ir t
S iz e
This grouping
violates order
property
{ S m a ll, { M e d iu m ,
L a rg e } E x tr a L a r g e }
Test Condition for Continuous Attributes
How to determine the Best Split
C 0: 5 C 0: 9
C 1: 5 C 1: 1
● Entropy
● Misclassification error
Finding the Best Split
1. Compute impurity measure (P) before splitting
2. Compute impurity measure (M) after splitting
● Compute impurity measure of each child node
● M is the weighted impurity of child nodes
Gain = P - M
or equivalently, lowest impurity measure after splitting (M)
Finding the Best Split
Before Splitting: P
A? B?
Yes No Yes No
M1 M2
Gain = P – M1 vs P – M2
Measure of Impurity: GINI
Measure of Impurity: GINI
• Gini Index for a given node t :
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
G in i= 0 .0 0 0 G in i= 0 .2 7 8 G in i= 0 .4 4 4 G in i= 0 .5 0 0
Computing Gini Index of a Single Node
B?
Yes No
Gini(N1)
= 1 – (5/6)2 – (1/6)2
Node N1 Node N2
= 0.278
Gini(N2) Weighted Gini of N1 N2
= 1 – (2/6)2 – (4/6)2 = 6/12 * 0.278 +
= 0.444 6/12 * 0.444
= 0.361
Gain = 0.486 – 0.361 = 0.125
Measure of Impurity: Entropy
•
Computing Entropy of a Single Node
• Evaluation
Underf itting and Overf itting
• Underfitting:
• Means that the model makes accurate, but initially
incorrect prediction.
• Training data is small.
• Need more training time.
• Overfitting: An induced tree may overfit the training data
• Too many branches, some may reflect anomalies due to noise or
outliers
• Poor accuracy for unseen samples
Model Evaluation
• Metrics for Performance Evaluation
• How to evaluate the performance of a model?
PREDICTED CLASS
Class=Yes Class=No
a: TP (true positive)
Class=Yes a b
ACTUAL b: FN (false negative)
CLASS
Class=No c d c: FP (false positive)
d: TN (true negative)
Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS Class=No c d
(FP) (TN)
a d TP TN
Accuracy
ab c d TP TN FP FN
Example of confusion matrix
C1 C2
C1 True False
positive negative
C2 False True
positive negative
• Recall(completeness)
• What % of positive tuples did the classifier label as
positive?
• Perfect score is 1.0
Class=Yes Class=No
a TP Class=Yes a b
Precision (p)
a c TP FP ACTUAL Class=No c d
a TP CLASS
Recall (r)
a b TP FN
1 2 rp 2a 2 TP
F - measure (F)
1 / r 1 / p r p 2a b c 2 TP FP FN
2