0% found this document useful (0 votes)

20 views

Introduction To Data Mining Unit 4

This document discusses various metrics for evaluating classification models, including accuracy, precision, recall, and F-measure. It provides examples to illustrate concepts like handling multi-state variables, computing Gini index and gain ratio for categorical attributes, and dealing with imbalanced class problems where one class is much more prevalent than others. Examples are given to demonstrate computing confusion matrices and various evaluation metrics.

Uploaded by

vinay

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

Introduction To Data Mining Unit 4

Uploaded by

vinay

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

9/28/2020

INTRODUCTION TO DATA MINING

UNIT # 4

FALL 2020 Sajjad Haider 1

ACKNOWLEDGEMENT

 Most of the slides in this presentation are taken from material provided
by
 Han and Kimber (Data Mining Concepts and Techniques) and
 Tan, Steinbach and Kumar (Introduction to Data Mining)

FALL 2020 Sajjad Haider 2

1
9/28/2020

TODAY’S AGENDA

 Recap
 Handling Multi-State Variables
 Confusion Matrix and Accuracy Computation
 Recall and Precision
 Sensitivity and Specificity
 ROC Curve

FALL 2020 Sajjad Haider 3

CATEGORICAL ATTRIBUTES: COMPUTING GINI INDEX

 From a historical perspective, Gini Index always created a binary tree.

 As a result, in case of multiple values, it merged them together to find the best binary
split
 For each distinct value, gather counts for each class in the dataset
Two-way split
Multi-way split (find best partition of values)

CarType CarType CarType

Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419

FALL 2020 Sajjad Haider 4

2
9/28/2020

HANDLING OF MULTI-STATE VARIABLE

 Gini Index (and Entropy) become biased to variables having multiple states.
 To over this, the following approach was recommended (in C4.5 using Entropy
but can be generalized to Gini Index as well).
 Gain = SR (D) – SRA (D)
 Where SR = splitting rule metric
 D = class variable
 A = an attribute on which the splitting rule is conditioned

 Gain Ratio = Gain / SplitInfo

FALL 2020 Sajjad Haider 5

SPLITINFO

 Gini (class) = 0.46

 Gini outlook (class) = 0.34 : Gain = 0.12
 Gini temperature (class) = 0.44 : Gain = 0.02
 Gini humidity (class) = 0.37 : Gain = 0.09
 Gini windy (class) = 0.43 : Gain = 0.03
 SplitInfo = unconditional splitting rules on the variables. If one is using Gini then it becomes
 Splitinfo (outlook) = Gini (outlook) = 0.66
 Splitinfo (temperature) = Gini (temperature) = 0.65
 Splitinfo (humidity) = Gini (humidity) = 0.5
 Splitinfo (windy) = Gini (windy) = 0.49

FALL 2020 Sajjad Haider 6

3
9/28/2020

GAIN_RATIO

 To obtain gain ratio, we divide gain by splitinfo

 Gain_ratio (outlook) = 0.12 / 0.66 = 0.18
 Gain_ratio (temperature) = 0.02 / 0.65 = 0.03
 Gain_ratio (humidity) = 0.09 / 0.5 = 0.18
 Gain_ratio (windy) = 0.03 / 0.49 = 0.06

FALL 2020 Sajjad Haider 7

EXAMPLE
Attribute 1 Attribute 2 Attribute 3 Class

A 70 T C1
A 90 T C2
A 85 F C2
A 95 F C2
A 70 F C1
B 90 T C1
B 78 F C1
B 65 T C1
B 75 F C1
C 80 T C2
C 70 T C2
C 80 F C1
C 80 F C1
FALL 2020 C 96 F C1 Sajjad Haider 8

4
9/28/2020

EXAMPLE II

Height Hair Eyes Class

Short Blond Blue +
Tall Blond Brown -
Tall Red Blue +
Short Dark Blue -
Tall Dark Blue -
Tall Blond Blue +
Tall Dark Brown -
Short Blond Brown -

FALL 2020 Sajjad Haider 9

ACCURACY OR ERROR RATES

 Partition: Training-and-testing
 use two independent data sets, e.g., training set (2/3), test set(1/3)
 used for data set with large number of examples

FALL 2020 Sajjad Haider 10

5
9/28/2020

METRICS FOR PERFORMANCE EVALUATION…

Predicted Label
Positive (+) Negative (-)
True Label

Positive (+) True Positive False Negative

(TP) (FN)
Negative (-) False Positive True Negative
(FP) (TN)

 Most widely-used metric:

TP  TN
Accuracy 
FALL 2020
TP  TN  FP  FN Sajjad Haider 11

IMBALANCE CLASS PROBLEM

 An imbalance class problem occurs when one or more classes have very
low proportions in the training data as compared to the other classes.
 In online advertising, an advertisement is presented to a viewer which creates an
impression. The click through rate is the number of times an ad was clicked on
divided by the total number of impressions and tends to be very low.

FALL 2020 Sajjad Haider 12

6
9/28/2020

LIMITATION OF ACCURACY

 Consider a 2-class problem

 Number of Class 0 examples = 9990
 Number of Class 1 examples = 10

 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %

 Accuracy is misleading because model does not detect any class 1 example

FALL 2020 Sajjad Haider 13

COST-SENSITIVE MEASURES

TP
Precision (p) 
TP  FP
TP
Recall (r) 
TP  FN
2rp 2TP
F - measure (F)  
r  p 2TP  FN  FP

FALL 2020 Sajjad Haider 14

7
9/28/2020

RECALL AND PRECISION

Actual Prediction
T T
T F
F T
F F
F T
T T
T T
T F
F T
T T
FALL 2020 Sajjad Haider 15

RECALL AND PRECISION

Actual Prediction
T T
T F
F T
F F
 Recall = 4 / 6
F T
T T
T T
T F
F T
T T
FALL 2020 Sajjad Haider 16

8
9/28/2020

RECALL AND PRECISION

Actual Prediction
T T
T F
 Recall = 4 / 6
F T
F F  Precision = 4 / 7
F T
 F-Measure = 8 / 13
T T
T T
T F
F T
T T
FALL 2020 Sajjad Haider 17

TERMINOLOGY

 True Positive: The number of positive examples correctly predicted by the

classification model.
 False Negative: The number of positive examples wrongly predicted as negative
by the classification model.
 False Positive: The number of negative examples wrongly predicted as positive
by the classification model.
 True Negative: The number of negative examples correctly predicted by the
classification model.

FALL 2020 Sajjad Haider 18

9
9/28/2020

TERMINOLOGY (CONT’D)

 The true positive rate (TPR) or sensitivity is defined as TPR = TP / (TP +

FN).
 The true negative rate (TNR) or specificity is defined as TNR = TN / (TN
+ FP).
 The false positive rate (FPR) is defined as FPR = FP / (TN + FP).
 The false negative rate (FNR) is defined as FNR = FN / (TP + FN).

FALL 2020 Sajjad Haider 19

ROC (RECEIVER OPERATING CHARACTERISTIC)

 Developed in 1950s for signal detection theory to analyze noisy signals

 Characterize the trade-off between positive hits and false alarms
 ROC curve plots TPR (on the y-axis) against FPR (on the x-axis)
 Remember that TPR represents “sensitivity” while FPR represents “100 –
specificity”.

FALL 2020 Sajjad Haider 20

10
9/28/2020

ROC CURVES

 Suppose sensitivity in a given scenario is poor (40%) while specificity is

fairly high (92.9%).
 The values are calculated from classes that are determined with the
default 50% probability threshold.
 Lowering the threshold to 30% results in a model with 60% sensitivity
and 79.3% specificity.

FALL 2020 Sajjad Haider 21

ROC CURVE (CONT’D)

 The ROC curve is created by evaluating the class probabilities for the
model across a continuum of thresholds.
 For each candidate threshold, the resulting true-positive rate (sensitivity)
and the false-positive rate (1-specificity) are plotted against each other.

FALL 2020 Sajjad Haider 22

11
9/28/2020

ROC CURVE (CONT’D)

 It is important to remember that altering the threshold only has the

effect of making samples more positive (or negative as the case may be).
 In the confusion matrix, it cannot move samples out of both off-diagonal
table cells. There is almost always a decrease in either sensitivity or
specificity as 1 is increased.

FALL 2020 Sajjad Haider 23

ROC CURVE (CONT’D)

 The optimal model should be shifted towards the upper left corner of
the plot.
 Alternatively, the model with the largest area under the ROC curve
would be the most effective.
 The ROC curve is only defined for two-class problems but has been
extended to handle three or more classes.

FALL 2020 Sajjad Haider 24

12
9/28/2020

HOW TO CONSTRUCT AN ROC CURVE

Instance P(+|A) True Class

• Use classifier that produces posterior
1 0.95 +
probability for each test instance P(+|A)
2 0.93 + • Sort the instances according to P(+|A) in
3 0.87 - decreasing order
4 0.852 -
5 0.851 -
• Apply threshold at each unique value of
6 0.850 +
P(+|A)
7 0.76 - • Count the number of TP, FP,
8 0.53 + TN, FN at each threshold
9 0.43 -
• TP rate, TPR = TP/(TP+FN)
10 0.25 +

FALL 2020
• FP rate, FPR = FP/(FP + TN) Sajjad Haider 25

HOW TO CONSTRUCT AN ROC CURVE

Class + - + - - - + - + +
Threshold
P
0.25 0.43 0.53 0.76 0.850 0.851 0.852 0.87 0.93 0.95 1.00

>= TP 5 4 4 3 3 2 2 2 2 1 0

FP 5 5 4 4 3 3 2 1 0 0 0

TN 0 0 1 1 2 2 3 4 5 5 5

FN 0 1 1 2 2 3 3 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.6 0.4 0.2 0 0 0

ROC Curve

FALL 2020 Sajjad Haider 26

Ritesh Tandon Machine Learning Project
100% (5)
Ritesh Tandon Machine Learning Project
23 pages
Capstone - Project - Final - Report - Churn - Prediction
100% (3)
Capstone - Project - Final - Report - Churn - Prediction
28 pages
Ai-900 Ba1968fd3ca4
100% (2)
Ai-900 Ba1968fd3ca4
143 pages
IML 7 - ROC Curve
No ratings yet
IML 7 - ROC Curve
17 pages
Module 5 ML
No ratings yet
Module 5 ML
12 pages
lecture11evaluationmetricsforclassification-240913060639-0c766554
No ratings yet
lecture11evaluationmetricsforclassification-240913060639-0c766554
28 pages
performance_measures
No ratings yet
performance_measures
32 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
L 13 Choose Your Own Algorithm D 07062024 111828am
No ratings yet
L 13 Choose Your Own Algorithm D 07062024 111828am
36 pages
9b. Evaluation of Classifiers
No ratings yet
9b. Evaluation of Classifiers
4 pages
11.2 - Classification Evaluation Metrics
No ratings yet
11.2 - Classification Evaluation Metrics
22 pages
Evaluation Metrics:: Confusion Matrix
No ratings yet
Evaluation Metrics:: Confusion Matrix
7 pages
FALLSEM2024-25 BCSE334L TH VL2024250101768 2024-10-08 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE334L TH VL2024250101768 2024-10-08 Reference-Material-I
18 pages
Performance
No ratings yet
Performance
11 pages
Lesson 6 Analytics Methods
No ratings yet
Lesson 6 Analytics Methods
12 pages
Session01 DataScience
No ratings yet
Session01 DataScience
79 pages
EvaluationMatrix
No ratings yet
EvaluationMatrix
29 pages
Ca 3 Merged
No ratings yet
Ca 3 Merged
275 pages
Unit2- Perfomance Measures
No ratings yet
Unit2- Perfomance Measures
32 pages
19-Performance Metrics
No ratings yet
19-Performance Metrics
23 pages
Int3209 - Data Mining: Week 5: Classification Model Improvements
No ratings yet
Int3209 - Data Mining: Week 5: Classification Model Improvements
56 pages
Machine_Learning_II
No ratings yet
Machine_Learning_II
61 pages
APznzaag02xO1GGi5u_A2DhJZs4CkLi9le3t7z9-R-wpvTJmn6o4ZfwQPBMHbFF9nnLxXjm40qffE-ZJQt7sji0grSXm812681Z1HXweJuujlkNekCE0LBXhi7QZzIbYwVm0Gy8OihuREB3yX-xuUY9vnUp00zdff4914hbLoLi_yw8ca2WGrMjDOn15XXUi5lnBdigIFlLgiIztS_axMl
No ratings yet
APznzaag02xO1GGi5u_A2DhJZs4CkLi9le3t7z9-R-wpvTJmn6o4ZfwQPBMHbFF9nnLxXjm40qffE-ZJQt7sji0grSXm812681Z1HXweJuujlkNekCE0LBXhi7QZzIbYwVm0Gy8OihuREB3yX-xuUY9vnUp00zdff4914hbLoLi_yw8ca2WGrMjDOn15XXUi5lnBdigIFlLgiIztS_axMl
15 pages
CSDS 440: Machine Learning: Soumya Ray (
No ratings yet
CSDS 440: Machine Learning: Soumya Ray (
20 pages
Chap4 Imbalanced Classes
No ratings yet
Chap4 Imbalanced Classes
28 pages
3-Performance Measures
No ratings yet
3-Performance Measures
35 pages
A10-Model-Performance-v2-2up
No ratings yet
A10-Model-Performance-v2-2up
11 pages
Unit 2 Chap 4
No ratings yet
Unit 2 Chap 4
14 pages
DL_IT324a_4
No ratings yet
DL_IT324a_4
52 pages
Lecture 10
No ratings yet
Lecture 10
16 pages
Performance Parameters
No ratings yet
Performance Parameters
23 pages
جلسه 13
No ratings yet
جلسه 13
76 pages
Lecture 3b - Evaluation
No ratings yet
Lecture 3b - Evaluation
37 pages
To Machine Learning: Isabelle Guyon
No ratings yet
To Machine Learning: Isabelle Guyon
40 pages
An Introduction To ROC Analysis
No ratings yet
An Introduction To ROC Analysis
14 pages
ROC
No ratings yet
ROC
5 pages
Chap3 Part1 Classification
No ratings yet
Chap3 Part1 Classification
38 pages
Unit3 7 Issues
No ratings yet
Unit3 7 Issues
24 pages
DWDM final5
No ratings yet
DWDM final5
45 pages
Confusion Matrix
No ratings yet
Confusion Matrix
8 pages
Analytics in Practice: Model Evaluation
No ratings yet
Analytics in Practice: Model Evaluation
40 pages
Unit-6: Classification and Prediction
No ratings yet
Unit-6: Classification and Prediction
63 pages
IT 138 - Lecture 4
No ratings yet
IT 138 - Lecture 4
30 pages
Fitting A Model To Data
No ratings yet
Fitting A Model To Data
41 pages
l09_machine_learning
No ratings yet
l09_machine_learning
39 pages
Imbalance Problem
No ratings yet
Imbalance Problem
13 pages
06 - ML - Classificaion Performance Evaluation Measures
No ratings yet
06 - ML - Classificaion Performance Evaluation Measures
19 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
11 pages
MACHINELEARNING
No ratings yet
MACHINELEARNING
20 pages
UNIT-1-2.Binary Classification and Related Tasks
No ratings yet
UNIT-1-2.Binary Classification and Related Tasks
22 pages
Introduction_to_ROC_analysis
No ratings yet
Introduction_to_ROC_analysis
15 pages
Machine Learning Lab Assignment Using Weka Name:: Submitted To
No ratings yet
Machine Learning Lab Assignment Using Weka Name:: Submitted To
15 pages
Chapitre_2
No ratings yet
Chapitre_2
26 pages
09ClassAdvanced
No ratings yet
09ClassAdvanced
64 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
CH 4
No ratings yet
CH 4
9 pages
An Introduction To ROC Analysis
100% (1)
An Introduction To ROC Analysis
14 pages
Roc Intro
No ratings yet
Roc Intro
14 pages
Introduction To ROC Analysis: Pattern Recognition Letters June 2006
No ratings yet
Introduction To ROC Analysis: Pattern Recognition Letters June 2006
16 pages
Introduction To ROC Analysis
No ratings yet
Introduction To ROC Analysis
15 pages
Private Debt: Yield, Safety and the Emergence of Alternative Lending
From Everand
Private Debt: Yield, Safety and the Emergence of Alternative Lending
Stephen L. Nesbitt
No ratings yet
Economic Insights from Input–Output Tables for Asia and the Pacific
From Everand
Economic Insights from Input–Output Tables for Asia and the Pacific
Asian Development Bank
No ratings yet
Using Economic Indicators to Improve Investment Analysis
From Everand
Using Economic Indicators to Improve Investment Analysis
Evelina M. Tainer
3.5/5 (1)
Part 1: Objective Paper: Is Used For Process Switching in An Operating System
No ratings yet
Part 1: Objective Paper: Is Used For Process Switching in An Operating System
4 pages
Ch04 7ed
No ratings yet
Ch04 7ed
20 pages
Lecture12 - 1
No ratings yet
Lecture12 - 1
21 pages
Introduction To Data Mining Unit 2
No ratings yet
Introduction To Data Mining Unit 2
18 pages
Introduction To Data Mining Unit 1
No ratings yet
Introduction To Data Mining Unit 1
13 pages
I Am Not Going To Give You The Answer To These Questions, But You Can Expect Them To Show Up On The Exam in Some Form - So Use Them To Prepare!
No ratings yet
I Am Not Going To Give You The Answer To These Questions, But You Can Expect Them To Show Up On The Exam in Some Form - So Use Them To Prepare!
4 pages
K Meansassignment PDF
No ratings yet
K Meansassignment PDF
2 pages
Sepi Journal Brief PDF
No ratings yet
Sepi Journal Brief PDF
2 pages
LA Assignment 3 PDF
No ratings yet
LA Assignment 3 PDF
4 pages
Institute of Business Administration, Karachi
No ratings yet
Institute of Business Administration, Karachi
2 pages
Introduction To Computer Programming (CS141)
No ratings yet
Introduction To Computer Programming (CS141)
31 pages
The Hadith of Jibraeel: Bukhari's Version
No ratings yet
The Hadith of Jibraeel: Bukhari's Version
2 pages
LA Assignment 3 PDF
No ratings yet
LA Assignment 3 PDF
4 pages
Packet Tracer - Designing and Implementing A VLSM Addressing Scheme
No ratings yet
Packet Tracer - Designing and Implementing A VLSM Addressing Scheme
4 pages
Microsoft Test-DP-100
100% (1)
Microsoft Test-DP-100
50 pages
Orthodontic Treatment Planning Based On Artificial
No ratings yet
Orthodontic Treatment Planning Based On Artificial
10 pages
Get Absolute Risk: Methods and Applications in Clinical Management and Public Health 1st Edition Ruth M. Pfeiffer free all chapters
100% (1)
Get Absolute Risk: Methods and Applications in Clinical Management and Public Health 1st Edition Ruth M. Pfeiffer free all chapters
65 pages
Machine Learning Terminology
No ratings yet
Machine Learning Terminology
16 pages
GM 340
No ratings yet
GM 340
16 pages
Roc Curve
No ratings yet
Roc Curve
43 pages
Diagnosis of Carpal Tunnel Syndrome
No ratings yet
Diagnosis of Carpal Tunnel Syndrome
11 pages
Poor Data Throughput Root Cause Analysis in Mobile Networks Using Deep Neural Network
No ratings yet
Poor Data Throughput Root Cause Analysis in Mobile Networks Using Deep Neural Network
6 pages
Using Adaptive Alert Classification To Reduce False Positives in Intrusion Detection
No ratings yet
Using Adaptive Alert Classification To Reduce False Positives in Intrusion Detection
24 pages
1 s2.0 S2214581822003184 Main
No ratings yet
1 s2.0 S2214581822003184 Main
14 pages
E4 DS203 2023 Sem2
No ratings yet
E4 DS203 2023 Sem2
2 pages
5) Young Malnutrition
No ratings yet
5) Young Malnutrition
6 pages
V02ct42a050 gt2018 77098
No ratings yet
V02ct42a050 gt2018 77098
11 pages
A Study On The Bi-Rayleigh ROC Curve Model
No ratings yet
A Study On The Bi-Rayleigh ROC Curve Model
6 pages
CH 6-2 Comparing Diagnostic Tests
No ratings yet
CH 6-2 Comparing Diagnostic Tests
8 pages
Measuring Mitigating Unintended Bias Paper
No ratings yet
Measuring Mitigating Unintended Bias Paper
7 pages
Research Analysis Offender Assessment System PDF
100% (1)
Research Analysis Offender Assessment System PDF
367 pages
vetsci-11-00510 (1)
No ratings yet
vetsci-11-00510 (1)
21 pages
Machine Learning Approaches in Smart Health
No ratings yet
Machine Learning Approaches in Smart Health
8 pages
MCQ Bi
No ratings yet
MCQ Bi
49 pages
Amazon Machine Learning: Developer Guide Version Latest
No ratings yet
Amazon Machine Learning: Developer Guide Version Latest
150 pages
1 s2.0 S016740482300456X Main
No ratings yet
1 s2.0 S016740482300456X Main
13 pages
Titanic Eda
No ratings yet
Titanic Eda
14 pages
Supervised Learning - A Systematic Literature Review
No ratings yet
Supervised Learning - A Systematic Literature Review
22 pages
Cyber-Bullying Detection Using Machine Learning - 2020
No ratings yet
Cyber-Bullying Detection Using Machine Learning - 2020
7 pages
A Review On Lane Detection and Tracking Techniques
No ratings yet
A Review On Lane Detection and Tracking Techniques
8 pages
Jadan Et Al 2022 Forest Ecosystems
No ratings yet
Jadan Et Al 2022 Forest Ecosystems
9 pages