0% found this document useful (0 votes)
20 views

Introduction To Data Mining Unit 4

This document discusses various metrics for evaluating classification models, including accuracy, precision, recall, and F-measure. It provides examples to illustrate concepts like handling multi-state variables, computing Gini index and gain ratio for categorical attributes, and dealing with imbalanced class problems where one class is much more prevalent than others. Examples are given to demonstrate computing confusion matrices and various evaluation metrics.

Uploaded by

vinay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Introduction To Data Mining Unit 4

This document discusses various metrics for evaluating classification models, including accuracy, precision, recall, and F-measure. It provides examples to illustrate concepts like handling multi-state variables, computing Gini index and gain ratio for categorical attributes, and dealing with imbalanced class problems where one class is much more prevalent than others. Examples are given to demonstrate computing confusion matrices and various evaluation metrics.

Uploaded by

vinay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

9/28/2020

INTRODUCTION TO DATA MINING


UNIT # 4

FALL 2020 Sajjad Haider 1

ACKNOWLEDGEMENT

 Most of the slides in this presentation are taken from material provided
by
 Han and Kimber (Data Mining Concepts and Techniques) and
 Tan, Steinbach and Kumar (Introduction to Data Mining)

FALL 2020 Sajjad Haider 2

1
9/28/2020

TODAY’S AGENDA

 Recap
 Handling Multi-State Variables
 Confusion Matrix and Accuracy Computation
 Recall and Precision
 Sensitivity and Specificity
 ROC Curve

FALL 2020 Sajjad Haider 3

CATEGORICAL ATTRIBUTES: COMPUTING GINI INDEX

 From a historical perspective, Gini Index always created a binary tree.


 As a result, in case of multiple values, it merged them together to find the best binary
split
 For each distinct value, gather counts for each class in the dataset
Two-way split
Multi-way split (find best partition of values)

CarType CarType CarType


Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419

FALL 2020 Sajjad Haider 4

2
9/28/2020

HANDLING OF MULTI-STATE VARIABLE

 Gini Index (and Entropy) become biased to variables having multiple states.
 To over this, the following approach was recommended (in C4.5 using Entropy
but can be generalized to Gini Index as well).
 Gain = SR (D) – SRA (D)
 Where SR = splitting rule metric
 D = class variable
 A = an attribute on which the splitting rule is conditioned

 Gain Ratio = Gain / SplitInfo

FALL 2020 Sajjad Haider 5

SPLITINFO

 Gini (class) = 0.46


 Gini outlook (class) = 0.34 : Gain = 0.12
 Gini temperature (class) = 0.44 : Gain = 0.02
 Gini humidity (class) = 0.37 : Gain = 0.09
 Gini windy (class) = 0.43 : Gain = 0.03
 SplitInfo = unconditional splitting rules on the variables. If one is using Gini then it becomes
 Splitinfo (outlook) = Gini (outlook) = 0.66
 Splitinfo (temperature) = Gini (temperature) = 0.65
 Splitinfo (humidity) = Gini (humidity) = 0.5
 Splitinfo (windy) = Gini (windy) = 0.49

FALL 2020 Sajjad Haider 6

3
9/28/2020

GAIN_RATIO

 To obtain gain ratio, we divide gain by splitinfo


 Gain_ratio (outlook) = 0.12 / 0.66 = 0.18
 Gain_ratio (temperature) = 0.02 / 0.65 = 0.03
 Gain_ratio (humidity) = 0.09 / 0.5 = 0.18
 Gain_ratio (windy) = 0.03 / 0.49 = 0.06

FALL 2020 Sajjad Haider 7

EXAMPLE
Attribute 1 Attribute 2 Attribute 3 Class

A 70 T C1
A 90 T C2
A 85 F C2
A 95 F C2
A 70 F C1
B 90 T C1
B 78 F C1
B 65 T C1
B 75 F C1
C 80 T C2
C 70 T C2
C 80 F C1
C 80 F C1
FALL 2020 C 96 F C1 Sajjad Haider 8

4
9/28/2020

EXAMPLE II

Height Hair Eyes Class


Short Blond Blue +
Tall Blond Brown -
Tall Red Blue +
Short Dark Blue -
Tall Dark Blue -
Tall Blond Blue +
Tall Dark Brown -
Short Blond Brown -

FALL 2020 Sajjad Haider 9

ACCURACY OR ERROR RATES

 Partition: Training-and-testing
 use two independent data sets, e.g., training set (2/3), test set(1/3)
 used for data set with large number of examples

FALL 2020 Sajjad Haider 10

5
9/28/2020

METRICS FOR PERFORMANCE EVALUATION…

Predicted Label
Positive (+) Negative (-)
True Label

Positive (+) True Positive False Negative


(TP) (FN)
Negative (-) False Positive True Negative
(FP) (TN)

 Most widely-used metric:


TP  TN
Accuracy 
FALL 2020
TP  TN  FP  FN Sajjad Haider 11

IMBALANCE CLASS PROBLEM

 An imbalance class problem occurs when one or more classes have very
low proportions in the training data as compared to the other classes.
 In online advertising, an advertisement is presented to a viewer which creates an
impression. The click through rate is the number of times an ad was clicked on
divided by the total number of impressions and tends to be very low.

FALL 2020 Sajjad Haider 12

6
9/28/2020

LIMITATION OF ACCURACY

 Consider a 2-class problem


 Number of Class 0 examples = 9990
 Number of Class 1 examples = 10

 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %


 Accuracy is misleading because model does not detect any class 1 example

FALL 2020 Sajjad Haider 13

COST-SENSITIVE MEASURES

TP
Precision (p) 
TP  FP
TP
Recall (r) 
TP  FN
2rp 2TP
F - measure (F)  
r  p 2TP  FN  FP

FALL 2020 Sajjad Haider 14

7
9/28/2020

RECALL AND PRECISION

Actual Prediction
T T
T F
F T
F F
F T
T T
T T
T F
F T
T T
FALL 2020 Sajjad Haider 15

RECALL AND PRECISION

Actual Prediction
T T
T F
F T
F F
 Recall = 4 / 6
F T
T T
T T
T F
F T
T T
FALL 2020 Sajjad Haider 16

8
9/28/2020

RECALL AND PRECISION

Actual Prediction
T T
T F
 Recall = 4 / 6
F T
F F  Precision = 4 / 7
F T
 F-Measure = 8 / 13
T T
T T
T F
F T
T T
FALL 2020 Sajjad Haider 17

TERMINOLOGY

 True Positive: The number of positive examples correctly predicted by the


classification model.
 False Negative: The number of positive examples wrongly predicted as negative
by the classification model.
 False Positive: The number of negative examples wrongly predicted as positive
by the classification model.
 True Negative: The number of negative examples correctly predicted by the
classification model.

FALL 2020 Sajjad Haider 18

9
9/28/2020

TERMINOLOGY (CONT’D)

 The true positive rate (TPR) or sensitivity is defined as TPR = TP / (TP +


FN).
 The true negative rate (TNR) or specificity is defined as TNR = TN / (TN
+ FP).
 The false positive rate (FPR) is defined as FPR = FP / (TN + FP).
 The false negative rate (FNR) is defined as FNR = FN / (TP + FN).

FALL 2020 Sajjad Haider 19

ROC (RECEIVER OPERATING CHARACTERISTIC)

 Developed in 1950s for signal detection theory to analyze noisy signals


 Characterize the trade-off between positive hits and false alarms
 ROC curve plots TPR (on the y-axis) against FPR (on the x-axis)
 Remember that TPR represents “sensitivity” while FPR represents “100 –
specificity”.

FALL 2020 Sajjad Haider 20

10
9/28/2020

ROC CURVES

 Suppose sensitivity in a given scenario is poor (40%) while specificity is


fairly high (92.9%).
 The values are calculated from classes that are determined with the
default 50% probability threshold.
 Lowering the threshold to 30% results in a model with 60% sensitivity
and 79.3% specificity.

FALL 2020 Sajjad Haider 21

ROC CURVE (CONT’D)

 The ROC curve is created by evaluating the class probabilities for the
model across a continuum of thresholds.
 For each candidate threshold, the resulting true-positive rate (sensitivity)
and the false-positive rate (1-specificity) are plotted against each other.

FALL 2020 Sajjad Haider 22

11
9/28/2020

ROC CURVE (CONT’D)

 It is important to remember that altering the threshold only has the


effect of making samples more positive (or negative as the case may be).
 In the confusion matrix, it cannot move samples out of both off-diagonal
table cells. There is almost always a decrease in either sensitivity or
specificity as 1 is increased.

FALL 2020 Sajjad Haider 23

ROC CURVE (CONT’D)

 The optimal model should be shifted towards the upper left corner of
the plot.
 Alternatively, the model with the largest area under the ROC curve
would be the most effective.
 The ROC curve is only defined for two-class problems but has been
extended to handle three or more classes.

FALL 2020 Sajjad Haider 24

12
9/28/2020

HOW TO CONSTRUCT AN ROC CURVE

Instance P(+|A) True Class


• Use classifier that produces posterior
1 0.95 +
probability for each test instance P(+|A)
2 0.93 + • Sort the instances according to P(+|A) in
3 0.87 - decreasing order
4 0.852 -
5 0.851 -
• Apply threshold at each unique value of
6 0.850 +
P(+|A)
7 0.76 - • Count the number of TP, FP,
8 0.53 + TN, FN at each threshold
9 0.43 -
• TP rate, TPR = TP/(TP+FN)
10 0.25 +

FALL 2020
• FP rate, FPR = FP/(FP + TN) Sajjad Haider 25

HOW TO CONSTRUCT AN ROC CURVE

Class + - + - - - + - + +
Threshold
P
0.25 0.43 0.53 0.76 0.850 0.851 0.852 0.87 0.93 0.95 1.00

>= TP 5 4 4 3 3 2 2 2 2 1 0

FP 5 5 4 4 3 3 2 1 0 0 0

TN 0 0 1 1 2 2 3 4 5 5 5

FN 0 1 1 2 2 3 3 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.6 0.4 0.2 0 0 0

ROC Curve

FALL 2020 Sajjad Haider 26

13

You might also like