Introduction To Data Mining Unit 4
Introduction To Data Mining Unit 4
ACKNOWLEDGEMENT
Most of the slides in this presentation are taken from material provided
by
Han and Kimber (Data Mining Concepts and Techniques) and
Tan, Steinbach and Kumar (Introduction to Data Mining)
1
9/28/2020
TODAY’S AGENDA
Recap
Handling Multi-State Variables
Confusion Matrix and Accuracy Computation
Recall and Precision
Sensitivity and Specificity
ROC Curve
2
9/28/2020
Gini Index (and Entropy) become biased to variables having multiple states.
To over this, the following approach was recommended (in C4.5 using Entropy
but can be generalized to Gini Index as well).
Gain = SR (D) – SRA (D)
Where SR = splitting rule metric
D = class variable
A = an attribute on which the splitting rule is conditioned
SPLITINFO
3
9/28/2020
GAIN_RATIO
EXAMPLE
Attribute 1 Attribute 2 Attribute 3 Class
A 70 T C1
A 90 T C2
A 85 F C2
A 95 F C2
A 70 F C1
B 90 T C1
B 78 F C1
B 65 T C1
B 75 F C1
C 80 T C2
C 70 T C2
C 80 F C1
C 80 F C1
FALL 2020 C 96 F C1 Sajjad Haider 8
4
9/28/2020
EXAMPLE II
Partition: Training-and-testing
use two independent data sets, e.g., training set (2/3), test set(1/3)
used for data set with large number of examples
5
9/28/2020
Predicted Label
Positive (+) Negative (-)
True Label
An imbalance class problem occurs when one or more classes have very
low proportions in the training data as compared to the other classes.
In online advertising, an advertisement is presented to a viewer which creates an
impression. The click through rate is the number of times an ad was clicked on
divided by the total number of impressions and tends to be very low.
6
9/28/2020
LIMITATION OF ACCURACY
COST-SENSITIVE MEASURES
TP
Precision (p)
TP FP
TP
Recall (r)
TP FN
2rp 2TP
F - measure (F)
r p 2TP FN FP
7
9/28/2020
Actual Prediction
T T
T F
F T
F F
F T
T T
T T
T F
F T
T T
FALL 2020 Sajjad Haider 15
Actual Prediction
T T
T F
F T
F F
Recall = 4 / 6
F T
T T
T T
T F
F T
T T
FALL 2020 Sajjad Haider 16
8
9/28/2020
Actual Prediction
T T
T F
Recall = 4 / 6
F T
F F Precision = 4 / 7
F T
F-Measure = 8 / 13
T T
T T
T F
F T
T T
FALL 2020 Sajjad Haider 17
TERMINOLOGY
9
9/28/2020
TERMINOLOGY (CONT’D)
10
9/28/2020
ROC CURVES
The ROC curve is created by evaluating the class probabilities for the
model across a continuum of thresholds.
For each candidate threshold, the resulting true-positive rate (sensitivity)
and the false-positive rate (1-specificity) are plotted against each other.
11
9/28/2020
The optimal model should be shifted towards the upper left corner of
the plot.
Alternatively, the model with the largest area under the ROC curve
would be the most effective.
The ROC curve is only defined for two-class problems but has been
extended to handle three or more classes.
12
9/28/2020
FALL 2020
• FP rate, FPR = FP/(FP + TN) Sajjad Haider 25
Class + - + - - - + - + +
Threshold
P
0.25 0.43 0.53 0.76 0.850 0.851 0.852 0.87 0.93 0.95 1.00
>= TP 5 4 4 3 3 2 2 2 2 1 0
FP 5 5 4 4 3 3 2 1 0 0 0
TN 0 0 1 1 2 2 3 4 5 5 5
FN 0 1 1 2 2 3 3 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0
ROC Curve
13