M6 Classification Alternative
M6 Classification Alternative
Rule-Based
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
consequent of a (Status=Single) → No
rule Coverage = 40%, Accuracy = 50%
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
Exhaustive rules
– Classifier has exhaustive coverage if it
accounts for every possible combination of
attribute values
– Each record is covered by at least one rule
9/30/2020 Introduction to Data Mining, 2nd Edition 7
Characteristics of Rule Sets: Strategy 2
Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
9/30/2020 Introduction to Data Mining, 2nd Edition 9
Rule Ordering Schemes
Rule-based ordering
– Individual rules are ranked based on their quality
Class-based ordering
– Rules that belong to the same class appear together
Direct Method:
◆ Extract rules directly from data
◆ Examples: RIPPER, CN2, Holte’s 1R
Indirect Method:
◆ Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
◆ Examples: C4.5rules
R1 R1
R2
Yes: 3
{} No: 4
Refund=No, Refund=No,
Status=Single, Status=Single,
Income=85K Income=90K
(Class=Yes) (Class=Yes)
Refund=
No
Status =
Single
Status =
Divorced
Status =
Married
... Income
> 80K
Refund=No,
Status = Single
Yes: 3 Yes: 2 Yes: 1 Yes: 0 Yes: 3 (Class = Yes)
No: 4 No: 1 No: 0 No: 3 No: 1
– 𝐺𝑎𝑖𝑛 𝑅 , 𝑅 = 𝑝 × [ 𝑙𝑜𝑔 𝑝1 𝑝0
0 1 1 2 − 𝑙𝑜𝑔2 ]
𝑝1 + 𝑛1 𝑝0 + 𝑛0
Growing a rule:
– Start from empty rule
– Add conjuncts as long as they improve FOIL’s
information gain
– Stop when rule no longer covers negative examples
– Prune the rule immediately using incremental reduced
error pruning
– Measure for pruning: v = (p-n)/(p+n)
◆ p: number of positive examples covered by the rule in
the validation set
◆ n: number of negative examples covered by the rule in
the validation set
– Pruning method: delete any final sequence of
conditions that maximizes v
9/30/2020 Introduction to Data Mining, 2nd Edition 18
Direct Method: RIPPER
P
No Yes
Q R Rule Set
Give C4.5rules:
Birth? (Give Birth=No, Can Fly=Yes) → Birds
(Give Birth=No, Live in Water=Yes) → Fishes
Yes No
(Give Birth=Yes) → Mammals
(Give Birth=No, Can Fly=No, Live in Water=No) → Reptiles
Mammals Live In ( ) → Amphibians
Water?
Yes No RIPPER:
(Live in Water=Yes) → Fishes
Sometimes (Have Legs=No) → Reptiles
(Give Birth=No, Can Fly=No, Live In Water=No)
Fishes Amphibians Can → Reptiles
Fly?
(Can Fly=Yes,Give Birth=No) → Birds
Yes No () → Mammals
Birds Reptiles
Instance-Based Learning
Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck
Compute
Distance Test
Record
111111111110 000000000001
vs
011111111111 100000000000
Nearest neighbor
classifiers are local
classifiers
Bayesian Classifiers
by
Tan, Steinbach, Karpatne, Kumar
Bayes Classifier
• Approach:
– compute posterior probability P(Y | X1, X2, …, Xd) using
the Bayes theorem
P( X 1 X 2 X d | Y ) P(Y )
P(Y | X 1 X 2 X n ) =
P( X 1 X 2 X d )
P(Refund=Yes|Yes)=0
1 −
( 120−110 ) 2
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
A: attributes
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat
pigeon
yes
no
yes
yes
no
no
yes
yes
mammals
non-mammals
P ( A | M ) = = 0.06
cat yes no no yes mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N ) = = 0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
7
P ( A | M ) P ( M ) = 0.06 = 0.021
eel no no yes no non-mammals
salamander no no sometimes yes non-mammals
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P ( A | N ) P ( N ) = 0.004 = 0.0027
eagle no yes no yes non-mammals 20
D
D is parent of C
A is child of C
C
B is descendant of D
D is ancestor of A
A B
X1 X2 X3 X4 ... Xd
Exercise Diet
Blood
Chest Pain
Pressure
• Find a linear hyperplane (decision boundary) that will separate the data
10/11/2021 Introduction to Data Mining, 2nd Edition 2
Support Vector Machines
B1
B2
B2
B1
B2
B1
B2
b21
b22
margin
b11
b12
B1
w• x +b = 0
w • x + b = +1
w • x + b = −1
b11
b12
1 if w • x + b 1 2
f ( x) = Margin =
− 1 if w • x + b −1 || w ||
10/11/2021 Introduction to Data Mining, 2nd Edition 8
Linear SVM
• Linear model:
1 if w • x + b 1
f ( x) =
− 1 if w • x + b −1
Support vectors
x1 x2 y l
0.3858 0.4687 1 65.5261
0.4871 0.611 -1 65.5261
0.9218 0.4103 -1 0
0.7382 0.8936 -1 0
0.1763 0.0579 1 0
0.4057 0.3529 1 0
0.9355 0.8132 -1 0
0.2146 0.0099 1 0
1 if w • x i + b 1 - i
yi =
− 1 if w • x i + b −1 + i
◆ If k is 1 or 2, this leads to similar objective function
as linear SVM but with different constraints (see
textbook)
B1
B2
b21
b22
margin
b11
b12
Decision boundary:
w • ( x ) + b = 0
10/11/2021 Introduction to Data Mining, 2nd Edition 17
Learning Nonlinear SVM
• Optimization problem:
• Issues:
– What type of mapping function should be
used?
– How to do the computation in high
dimensional space?
◆ Most computations involve dot product (xi)• (xj)
◆ Curse of dimensionality?
• Kernel Trick:
– (xi)• (xj) = K(xi, xj)
– K(xi, xj) is a kernel function (expressed in
terms of the coordinates in the original space)
◆ Examples:
• Robust to noise
• Overfitting is handled by maximizing the margin of the decision boundary,
• SVM can handle irrelevant and redundant attributes better than many
other techniques
• The user needs to provide the type of kernel function and cost function
• Difficult to handle missing values
Ensemble Techniques
Overfitting
Underfitting
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
xk
True False
yleft yright
10/11/2021 Introduction to Data Mining, 2nd Edition 12
Bagging Example
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35 y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35 y = -1
Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1
y 1 1 1 -1 -1 -1 1 1 1 1
Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1
Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1
Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35 y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35 y = -1
Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1 x <= 0.7 y = 1
y 1 1 1 -1 -1 -1 1 1 1 1 x > 0.7 y = 1
Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9 x <= 0.35 y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.35 y = -1
Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9 x <= 0.3 y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.3 y = -1
Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1 x <= 0.35 y = 1
x > 0.35 y = -1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Round 6:
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1 x <= 0.75 y = -1
y 1 -1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 y = 1
Bagging Round 7:
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1 x <= 0.75 y = -1
y 1 -1 -1 -1 -1 1 1 1 1 1 x > 0.75 y = 1
Bagging Round 8:
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1 x <= 0.75 y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 y = 1
Bagging Round 9:
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1 x <= 0.75 y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 y = 1
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
Importance of a classifier:
1 1 − i
i = ln
2 i
Weight update:
True False
yleft yright
10/11/2021 Introduction to Data Mining, 2nd Edition 23
AdaBoost Example
Boosting Round 2:
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1
Boosting Round 3:
x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1 -1 -1 -1
Summary:
Round Split Point Left Class Right Class alpha
1 0.75 -1 1 1.738
2 0.05 1 1 2.7784
3 0.3 1 -1 4.1195
10/11/2021 Introduction to Data Mining, 2nd Edition 24
AdaBoost Example
Weights
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
2 0.311 0.311 0.311 0.01 0.01 0.01 0.01 0.01 0.01 0.01
3 0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009
Classification
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class
Key Challenge:
– Evaluation measures such as accuracy are not well-
suited for imbalanced class
Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL
CLASS Class=No c d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)
a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN
2/15/2021 Introduction to Data Mining, 2nd Edition 4
Problem with Accuracy
Consider a 2-class problem
– Number of Class NO examples = 990
– Number of Class YES examples = 10
If a model predicts everything to be class NO, accuracy is
990/1000 = 99 %
– This is misleading because this trivial model does not detect any class
YES example
– Detecting the rare class is usually more interesting (e.g., frauds,
intrusions, defects, etc)
PREDICTED CLASS
Class=Yes Class=No
Class=Yes 0 10
ACTUAL
CLASS Class=No 0 990
2/15/2021 Introduction to Data Mining, 2nd Edition 5
Which model is better?
PREDICTED
Class=Yes Class=No
A ACTUAL Class=Yes 0 10
Class=No 0 990
Accuracy: 99%
PREDICTED
B Class=Yes Class=No
ACTUAL Class=Yes 10 0
Class=No 500 490
Accuracy: 50%
2/15/2021 Introduction to Data Mining, 2nd Edition 6
Which model is better?
PREDICTED
A Class=Yes Class=No
ACTUAL Class=Yes 5 5
Class=No 0 990
PREDICTED
B Class=Yes Class=No
ACTUAL Class=Yes 10 0
Class=No 500 490
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL
CLASS Class=No c d
a
Precision (p) =
a+c
a
Recall (r) =
a+b
2rp 2a
F - measure (F) = =
r + p 2a + b + c
2/15/2021 Introduction to Data Mining, 2nd Edition 8
Alternative Measures
10
PREDICTED CLASS Precision (p) = = 0.5
10 + 10
10
Class=Yes Class=No Recall (r) = =1
10 + 0
Class=Yes 10 0 2 *1* 0.5
ACTUAL F - measure (F) = = 0.62
CLASS Class=No 10 980 1 + 0.5
990
Accuracy = = 0.99
1000
PREDICTED CLASS 1
Precision (p) = =1
1+ 0
Class=Yes Class=No
1
Recall (r) = = 0.1
Class=Yes 1 9 1+ 9
ACTUAL 2 * 0.1*1
CLASS Class=No 0 990 F - measure (F) = = 0.18
1 + 0.1
991
Accuracy = = 0.991
1000
2/15/2021 Introduction to Data Mining, 2nd Edition 10
Which of these classifiers is better?
PREDICTED CLASS
Precision (p) = 0.8
Class=Yes Class=No
Recall (r) = 0.8
A Class=Yes 40 10 F - measure (F) = 0.8
ACTUAL
CLASS Class=No 10 40 Accuracy = 0.8
PREDICTED CLASS
B Class=Yes Class=No Precision (p) =~ 0.04
Class=Yes 40 10 Recall (r) = 0.8
ACTUAL F - measure (F) =~ 0.08
CLASS Class=No 1000 4000
Accuracy =~ 0.8
PREDICTED CLASS
Yes No
ACTUAL
Yes TP FN
CLASS
No FP TN
A PREDICTED CLASS
Class=Yes Class=No
Precision (p) = 0.5
Class=Yes 10 40
TPR = Recall (r) = 0.2
ACTUAL
Class=No 10 40
FPR = 0.2
CLASS F − measure = 0.28
B PREDICTED CLASS
Precision (p) = 0.5
Class=Yes Class=No
TPR = Recall (r) = 0.5
Class=Yes 25 25
ACTUAL Class=No 25 25
FPR = 0.5
CLASS F − measure = 0.5
(TPR,FPR):
(0,0): declare everything
to be negative class
(1,1): declare everything
to be positive class
(1,0): ideal
Diagonal line:
– Random guessing
– Below diagonal line:
◆ prediction is opposite
of the true class
x2 < 12.63
x1 < 7.24
x2 < 8.64
x1 < 13.29 x2 < 17.35
x1 < 12.11
x2 < 1.38 x1 < 6.56 x1 < 2.15
0.059 0.220
x1 < 18.88
x1 < 7.24
x2 < 8.64 0.071
0.107
x1 < 12.11
x2 < 1.38 0.727
0.164
x1 < 18.88
0.143 0.669 0.271
0.654 0
x2 < 12.63
x1 < 7.24
x2 < 8.64 0.071
0.107
x1 < 12.11
x2 < 1.38 0.727
0.164
x1 < 18.88
0.143 0.669 0.271
0.654 0
At threshold t:
TPR=0.5, FNR=0.5, FPR=0.12, TNR=0.88
2/15/2021 Introduction to Data Mining, 2nd Edition 20
How to Construct an ROC curve
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
ROC Curve:
No model consistently
outperforms the other
M1 is better for
small FPR
M2 is better for
large FPR