Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
Tan,Steinbach, Kumar
4/18/2004
Classification: Definition
Tan,Steinbach, Kumar
4/18/2004
4/18/2004
Classifying
Classifying
Categorizing
Tan,Steinbach, Kumar
4/18/2004
Classification Techniques
Decision
4/18/2004
ca
go
e
t
al
c
ri
ca
go
e
t
al
c
ri
us
o
u
in
t
ss
n
a
cl
co
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
10
Training Data
Tan,Steinbach, Kumar
4/18/2004
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
MarSt
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
> 80K
NO
YES
10
Tan,Steinbach, Kumar
4/18/2004
Decision
Tree
Tan,Steinbach, Kumar
4/18/2004
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
10
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
11
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
12
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
13
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
Assign Cheat to No
NO
> 80K
YES
4/18/2004
14
Decision
Tree
Tan,Steinbach, Kumar
4/18/2004
15
Many Algorithms:
Hunts Algorithm (one of the earliest)
CART
ID3 (Iterator Dichotomizer) , C4.5 (Quinlan
1986, 1993) (C5: See5) Demo
SLIQ,SPRINT (IBM 1996)
Tan,Steinbach, Kumar
4/18/2004
16
Dt
4/18/2004
17
Hunts Algorithm
Dont
Cheat
Refund
Yes
No
Dont
Cheat
Dont
Cheat
Refund
Refund
Yes
Yes
No
Dont
Cheat
Single,
Divorced
Cheat
Dont
Cheat
Marital
Status
Married
Single,
Divorced
Marital
Status
Married
Dont
Cheat
Taxable
Income
Dont
Cheat
Tan,Steinbach, Kumar
No
< 80K
>= 80K
Dont
Cheat
Cheat
4/18/2004
18
Tree Induction
Greedy
strategy.
greedy search through the space of possible
decision trees
Split the records based on an attribute test that
optimizes certain criterion
.
Issues
Determine how to split the records
How
4/18/2004
19
Tree Induction
Greedy
strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
4/18/2004
20
on attribute types
Nominal
Ordinal
Continuous
Depends
Tan,Steinbach, Kumar
4/18/2004
21
Luxury
Family
Sports
CarType
Tan,Steinbach, Kumar
{Family}
OR
{Family,
Luxury}
CarType
{Sports}
4/18/2004
22
Large
Size
{Large}
Tan,Steinbach, Kumar
OR
{Small,
Large}
{Medium,
Large}
Size
{Small}
Size
{Medium}
4/18/2004
23
, :
. ( )
Tan,Steinbach, Kumar
4/18/2004
24
Tan,Steinbach, Kumar
4/18/2004
25
Tree Induction
Greedy
strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Tan,Steinbach, Kumar
4/18/2004
26
Tan,Steinbach, Kumar
4/18/2004
27
Greedy approach:
Nodes with homogeneous class distribution
are preferred
Non-homogeneous,
Homogeneous,
Tan,Steinbach, Kumar
4/18/2004
28
Index
Entropy
D .
Misclassification
error
Tan,Steinbach, Kumar
4/18/2004
29
M0
A?
Yes
B?
No
Yes
No
Node N1
Node N2
Node N3
Node N4
M1
M2
M3
M4
M12
M34
Gain = M0 M12 vs M0 M34
Tan,Steinbach, Kumar
4/18/2004
30
GINI (t ) 1 [ p ( j | t )]2
j
0
6
Gini=0.000
C1
C2
1
5
Gini=0.278
C1
C2
2
4
Gini=0.444
C1
C2
3
3
Gini=0.500
4/18/2004
31
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
Tan,Steinbach, Kumar
P(C2) = 6/6 = 1
P(C2) = 5/6
4/18/2004
32
GINI split
where,
ni
GINI (i )
i 1 n
Tan,Steinbach, Kumar
4/18/2004
33
No
C1
C2
Gini = 0.500
Gini(N1)
= 1 (5/6)2 (2/6)2
= 0.194
Gini(N2)
= 1 (1/6)2 (4/6)2
= 0.528
Tan,Steinbach, Kumar
Node N1
Node N2
C1
C2
N1 N2
5
1
2
4
Gini=0.333
Introduction to Data Mining
Gini(Children)
= 7/12 * 0.194 +
5/12 * 0.528
= 0.333
4/18/2004
34
Multi-way split
CarType
C1
C2
Gini
Tan,Steinbach, Kumar
C1
C2
Gini
CarType
{Sports,
{Family}
Luxury}
3
1
2
4
0.400
C1
C2
Gini
CarType
{Family,
{Sports}
Luxury}
2
2
1
5
0.419
4/18/2004
35
Tan,Steinbach, Kumar
4/18/2004
36
Cheat
No
No
No
Yes
Yes
Yes
No
No
No
No
100
120
125
220
Taxable Income
60
Sorted Values
Split Positions
70
55
75
65
85
72
90
80
95
87
92
97
110
122
172
230
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
Yes
No
Gini
Tan,Steinbach, Kumar
0.420
0.400
0.375
0.343
0.417
0.400
0.300
0.343
0.375
0.400
4/18/2004
0.420
37
Entropy (t ) p ( j | t ) log p ( j | t )
j
Minimum
4/18/2004
38
Tan,Steinbach, Kumar
4/18/2004
39
Entropy (t ) p ( j | t ) log p ( j | t )
j
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
Tan,Steinbach, Kumar
P(C2) = 6/6 = 1
P(C2) = 5/6
5/6
4/18/2004
40
Information Gain: = )
(
GAIN
split
Entropy ( p )
Entropy (i )
n
i 1
Tan,Steinbach, Kumar
4/18/2004
41
Tan,Steinbach, Kumar
4/18/2004
42
Gain Ratio:
GainRATIO
GAIN
n
n
SplitINFO log
SplitINFO
n
n
k
Split
split
i 1
GAIN
split
Used in C4.5
Designed to overcome the disadvantage of Information Gain
Tan,Steinbach, Kumar
4/18/2004
43
Tan,Steinbach, Kumar
4/18/2004
44
Error (t ) 1 max P (i | t )
i
Minimum
Tan,Steinbach, Kumar
4/18/2004
45
Error (t ) 1 max P (i | t )
i
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
Tan,Steinbach, Kumar
P(C2) = 6/6 = 1
P(C2) = 5/6
4/18/2004
46
Tan,Steinbach, Kumar
4/18/2004
47
A?
Yes
No
Node N1
Gini(N1)
= 1 (3/3)2 (0/3)2
=0
Gini(N2)
= 1 (4/7)2 (3/7)2
= 0.489
Node N2
C1
C2
N1
3
0
N2
4
3
Gini=0.361
C1
C2
Gini = 0.42
Gini(Children)
= 3/10 * 0
+ 7/10 * 0.489
= 0.342
Gini improves !!
Tan,Steinbach, Kumar
4/18/2004
48
Tree Induction
Greedy
strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
4/18/2004
49
Tan,Steinbach, Kumar
4/18/2004
50
Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification
techniques for many simple data sets
Tan,Steinbach, Kumar
4/18/2004
51
Example: C4.5
Simple
depth-first construction.
Uses Information Gain
Sorts Continuous Attributes at each node.
Needs entire data to fit in memory.
Unsuitable for Large Datasets.
Needs out-of-core sorting.
You
http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz
Tan,Steinbach, Kumar
4/18/2004
52
and Overfitting
Values
of Classification
Tan,Steinbach, Kumar
4/18/2004
53
Circular points:
0.5 sqrt(x12+x22) 1
Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Tan,Steinbach, Kumar
4/18/2004
54
Underfitting: when model is too simple, both training and test errors are large
Tan,Steinbach, Kumar
4/18/2004
55
4/18/2004
56
Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
-Insufficient
Tan,Steinbach, Kumar
4/18/2004
57
Notes on Overfitting
Overfitting
Training
Need
Tan,Steinbach, Kumar
4/18/2004
58
Tan,Steinbach, Kumar
4/18/2004
59
Occams Razor
Given two models of similar generalization errors,
one should prefer the simpler model over the
more complex model
For complex models, there is a greater chance
that it was fitted accidentally by errors in data
Tan,Steinbach, Kumar
4/18/2004
60
X
X1
X2
X3
X4
y
1
0
0
1
Xn
B
1
C
1
X
X1
X2
X3
X4
y
?
?
?
?
Xn
Tan,Steinbach, Kumar
4/18/2004
61
Tan,Steinbach, Kumar
4/18/2004
62
Post-pruning
Grow decision tree to its entirety
Trim the nodes of the decision tree in a bottomup fashion
If generalization error improves after trimming,
replace sub-tree by a leaf node.
Class label of leaf node is determined from
majority class of instances in the sub-tree
4/18/2004
63
Example of Post-Pruning
Training Error (Before splitting) = 10/30
Class = Yes
20
Class = No
10
Error = 10/30
= (9 + 4 0.5)/30 = 11/30
PRUNE!
A?
A1
A4
A3
A2
Class = Yes
Class = Yes
Class = Yes
Class = Yes
Class = No
Class = No
Class = No
Class = No
Tan,Steinbach, Kumar
4/18/2004
64
Examples of Post-pruning
Optimistic error?
Case 1:
Pessimistic error?
C0: 11
C1: 3
C0: 2
C1: 4
C0: 14
C1: 3
C0: 2
C1: 2
Case 2:
Tan,Steinbach, Kumar
4/18/2004
65
Tan,Steinbach, Kumar
4/18/2004
66
Refund=Yes
Refund=No
Refund=?
Class Class
= Yes = No
0
3
2
4
1
Split on Refund:
Entropy(Refund=Yes) = 0
Entropy(Refund=No)
= -(2/6)log(2/6) (4/6)log(4/6) = 0.9183
Missing
value
Tan,Steinbach, Kumar
Entropy(Children)
= 0.3 (0) + 0.6 (0.9183) = 0.551
Gain = 0.9 (0.8813 0.551) = 0.3303
Introduction to Data Mining
4/18/2004
67
Distribute Instances
Refund
Yes
Class = no
Refund =
yes
Refund
Yes
No
No
Tan,Steinbach, Kumar
4/18/2004
68
Classify Instances
New record:
Married
Refund
Yes
NO
Single
Divorced Total
Class=No
Class=Yes
6/9
2.67
Total
3.67
6.67
No
Single,
Divorced
MarSt
Married
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
NO
> 80K
YES
4/18/2004
69
Other Issues
Data
Fragmentation
Search Strategy
Expressiveness
Tree Replication
Tan,Steinbach, Kumar
4/18/2004
70
Data Fragmentation
Number
Number
Tan,Steinbach, Kumar
4/18/2004
71
Search Strategy
Finding
The
Other
strategies?
Bottom-up
Bi-directional
Tan,Steinbach, Kumar
4/18/2004
72
Expressiveness
Tan,Steinbach, Kumar
4/18/2004
73
Decision Boundary
Tan,Steinbach, Kumar
4/18/2004
74
x+y<1
Class = +
Class =
Tan,Steinbach, Kumar
4/18/2004
75
Tree Replication
P
Tan,Steinbach, Kumar
4/18/2004
76
Model Evaluation
Metrics
Methods
Methods
Tan,Steinbach, Kumar
4/18/2004
77
Model Evaluation
Metrics
Methods
Methods
Tan,Steinbach, Kumar
4/18/2004
78
ACTUAL
CLASS
Class=No
Class=Yes
Class=No
Tan,Steinbach, Kumar
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
4/18/2004
79
ACTUAL Class=Yes
CLASS
Class=No
Most
Class=No
a
(TP)
b
(FN)
c
(FP)
d
(TN)
widely-used metric:
ad
TP TN
Accuracy
a b c d TP TN FP FN
Tan,Steinbach, Kumar
4/18/2004
80
Limitation of Accuracy
Consider
a 2-class problem
Number of Class 0 examples = 9990
Number of Class 1 examples = 10
If
Tan,Steinbach, Kumar
4/18/2004
81
Cost Matrix
PREDICTED CLASS
C(i|j)
Class=Yes
C(Yes|No)
Class=No
C(No|Yes)
C(No|No)
4/18/2004
82
PREDICTED CLASS
ACTUAL
CLASS
Model M1
C(i|j)
-1
100
PREDICTED CLASS
ACTUAL
CLASS
150
40
60
250
Accuracy = 80%
Cost = 3910
Tan,Steinbach, Kumar
Model M2
ACTUAL
CLASS
PREDICTED CLASS
250
45
200
Accuracy = 90%
Cost = 4255
Introduction to Data Mining
4/18/2004
83
Cost vs Accuracy
PREDICTED CLASS
Count
Class=Yes
Class=Yes
ACTUAL
CLASS
Class=No
b
N=a+b+c+d
Class=No
d
Accuracy = (a + d)/N
PREDICTED CLASS
Cost
Class=Yes
ACTUAL
CLASS
Class=No
Cost = p (a + d) + q (b + c)
= p (a + d) + q (N a d)
Class=Yes
= q N (q p)(a + d)
Class=No
= N [q (q-p) Accuracy]
Tan,Steinbach, Kumar
4/18/2004
84
Cost-Sensitive Measures
a
Precision (p)
ac
a
Recall (r)
ab
2rp
2a
F - measure (F)
r p 2a b c
wa w d
Weighted Accuracy
wa wb wc w d
1
Tan,Steinbach, Kumar
4/18/2004
85
Model Evaluation
Metrics
Methods
Methods
Tan,Steinbach, Kumar
4/18/2004
86
Performance
Tan,Steinbach, Kumar
4/18/2004
87
Learning Curve
Requires a sampling
schedule for creating
learning curve:
Arithmetic sampling
(Langley, et al)
Geometric sampling
(Provost et al)
Tan,Steinbach, Kumar
Variance of estimate
4/18/2004
88
Methods of Estimation
Holdout
Reserve 2/3 for training and 1/3 for testing
Random subsampling
Repeated holdout
Cross validation
Partition data into k disjoint subsets
k-fold: train on k-1 partitions, test on the remaining one
Leave-one-out: k=n
Stratified sampling
oversampling vs undersampling
Bootstrap
Sampling with replacement
Tan,Steinbach, Kumar
4/18/2004
89
Model Evaluation
Metrics
Methods
Methods
Tan,Steinbach, Kumar
4/18/2004
90
4/18/2004
91
ROC Curve
- 1-dimensional data set containing 2 classes (positive and negative)
- any points located at x > t is classified as positive
At threshold t:
TP=0.5, FN=0.5, FP=0.12, FN=0.88
Tan,Steinbach, Kumar
4/18/2004
92
ROC Curve
(TP,FP):
(0,0): declare everything
to be negative class
(1,1): declare everything
to be positive class
(1,0): ideal
Diagonal line:
Random guessing
Below diagonal line:
prediction is opposite of
the true class
Tan,Steinbach, Kumar
4/18/2004
93
No model consistently
outperform the other
M1 is better for
small FPR
M2 is better for
large FPR
Tan,Steinbach, Kumar
Ideal:
Area = 1
Random guess:
Area = 0.5
4/18/2004
94
Instance
P(+|A)
True Class
0.95
0.93
0.87
0.85
0.85
0.85
0.76
0.53
0.43
10
0.25
Tan,Steinbach, Kumar
4/18/2004
95
0.25
0.43
0.53
0.76
0.85
0.85
0.85
0.87
0.93
0.95
1.00
FP
TN
FN
TPR
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.2
FPR
0.8
0.8
0.6
0.4
0.2
0.2
Class
P
Threshold
>=
TP
ROC Curve:
Tan,Steinbach, Kumar
4/18/2004
96
Test of Significance
Tan,Steinbach, Kumar
4/18/2004
97
x Bin(N, p)
e.g: Toss a fair coin 50 times, how many heads would turn up?
Expected number of heads = Np = 50 0.5 = 25
Tan,Steinbach, Kumar
4/18/2004
98
Area = 1 -
P(Z
/2
acc p
Z
p (1 p ) / N
1 / 2
Z/2
Z1- /2
/2
/2
/2
Tan,Steinbach, Kumar
4/18/2004
99
1-
0.99 2.58
0.98 2.33
50
100
500
1000
5000
0.95 1.96
p(lower)
0.670
0.711
0.763
0.774
0.789
0.90 1.65
p(upper)
0.888
0.866
0.833
0.824
0.811
Tan,Steinbach, Kumar
4/18/2004
100
Comparing Performance of 2
Models
Given
e1 ~ N 1 , 1
e2 ~ N 2 , 2
e (1 e )
Approximate:
n
i
Tan,Steinbach, Kumar
4/18/2004
101
Comparing Performance of 2
Models
To
e1(1 e1) e2(1 e2)
n1
n2
2
d d Z
/2
4/18/2004
102
An Illustrative Example
Given: M1: n1 = 30, e1 = 0.15
M2: n2 = 5000, e2 = 0.25
d = |e2 e1| = 0.1 (2-sided test)
0.0043
30
5000
d
4/18/2004
103
Comparing Performance of 2
Algorithms
Each
(d j
j 1
d)
k (k 1)
d d t
t
Tan,Steinbach, Kumar
1 ,k 1
4/18/2004
104