0% found this document useful (0 votes)

9 views9 pages

CH 4

Uploaded by

apdurahmaandahir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views9 pages

CH 4

Uploaded by

apdurahmaandahir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Precision and Recall

Contents
1 Precision and Recall
2 Precision and Recall for Information Retrieval
2.1 Precision/Recall Curves
2.2 Average Precision
3 Precision and Recall for Classification
3.1 Precision
3.2 Recall
4 F Measure
4.1 Motivation: Precision and Recall
4.2 Example
5 Precision and Recall for Clustering
5.1 Example
6 Multi-Class Problems
6.1 Averaging
7 Sources

Precision and Recall

Precision and Recall are quality metrics used across many domains:

originally it's from Information Retrieval

also used in Machine Learning

Precision and Recall for Information Retrieval

IR system has to be:

precise: all returned document should be relevant

efficient: all relevant document should be returned

Given a test collection, the quality of an IR system is evaluated with:

Precision: % of relevant documents in the result

Recall: % of retrieved relevant documents

More formally,

given a collection of documents C

If X ⊆ C is the output of the IR system and Y ⊆ C is the list of all relevant documents then define
|X ∪ Y | |X ∪ Y |
precision as P = and recall as R =
|X| |Y |

both P and R are defined w.r.t a set of retrieved documents

Precision/Recall Curves

If we retrieve more document, we improve recall (if return all docs, R = 1 )

if we retrieve fewer documents, we improve precision, but reduce recall
so there's a trade-off between them

Let k be the number of retrieved documents

then by varying k from 0 to N = |C | we can draw P vs R and obtain the Precision/recall curve:

source: [1]

the closer the curve to the (1, 1) point - the better the IR system performance

source: Information Retrieval

(UFRT) lecture 2

Area under P/R Curve:

Analogously to ROC Curves we can calculate the area under the P/R Curve
the closer AUPR to 1 the better

Average Precision

Top-k -precision is insensitive to change of ranks of relevant documents among top k

how to measure overall performance of an IR system?

1 k
K
avg P = ∑
k=1
K rk

where ri is the rank of k th relevant document in the result

Since in a test collection we usually have a set of queries, we calcuate the average over them and get Mean
Average Precision: MAP

Precision and Recall for Classification

The precision and recall metrics can also be applied to Machine Learning: to binary classifiers

Diagnostic Testing Measures

Actual Class y
Positive Negative
Precision =
Test
True positive False positive #TP
outcome
(TP ) (FP, Type I error)
hθ (x) positive #TP + #FP
Test
outcome Negative predictive value =
Test
False negative True negative #TN
outcome
(FN , Type II error) (TN)
negative #FN + #TN

Sensitivity = Specificity = Accuracy =

#TP #TN #TP + #TN

#TP + #FN #FP + #TN #TOTAL

Main values of this matrix:

True Positive - we predicted "+" and the true class is "+"

True Negative - we predicted "-" and the true class is "-"
False Positive - we predicted "+" and the true class is "-" (Type I error)
False Negative - we predicted "-" and the true class is "+" (Type II error)

Two Classes: C+ and C−

Precision

π = P (f (x) = C + ∣
∣ hθ (x) = C + )

given that we predict x is +

what's the probability that the decision is correct
# TP # TP
we estimate precision as P = =
# predicted positives # TP + # FP

Interpretation

Out of all the people we thought have cancer, how many actually had it?
High precision is good
we don't tell many people that they have cancer when they actually don't

Recall

ρ = P (hθ (x) = C + ∣
∣ f (x) = C + )

given a positive instance x

what's the probability that we predict correctly
# TP # TP
we estimate recall as R = =
# actual positives # TP + # FN

Interpretation

Out of all the people that do actually have cancer, how much we identified?
The higher the better:
We don't fail to spot many people that actually have cancer

For a classifier that always returns zero (i.e. hθ (x) = 0 ) the Recall would be zero
That gives us more useful evaluation metric
And we're much more sure

F Measure
P and R don't make sense in the isolation from each other

higher level of ρ may be obtained by lowering π and vice versa

Suppose we have a ranking classifier that produces some score for x

we decide whether to classify it as C+ or C− based on some threshold parameter τ

by varying τ we will get different precision and recall
improving recall will lead to worse precision
improving precision will lead to worse recall
how to pick the threshold?
combine P and R into one measure (also see ROC Analysis)

2
(β + 1)P R
Fβ =
2
β P + R

β is the tradeoff between P and R

if β is close to 0, then we give more importance to P
F0 = P
if β is closer to +∞, we give more importance to R

When β = 1 we have F1 score:

The F1 -score is a single measure of performance of the test.

it's the harmonic mean of precision P and recall R
P R
F1 = 2
P + R

Motivation: Precision and Recall

Let's say we trained a Logistic Regression classifier

we predict 1 if hθ (x) ⩾ 0.5

we predict 0 if hθ (x) < 0.5

Suppose we want to predict y = 1 (i.e. people have cancer) only if we're very confident

we may change the threshold to 0.7

we predict 1 if hθ (x) ⩾ 0.7
we predict 0 if hθ (x) < 0.7
We'll have higher precision in this case (all for who we predicted y = 1 are more likely to actually have it)
But lower recall (we'll miss more patients that actually have cancer, but we failed to spot them)

Let's consider the opposite

Suppose we want to avoid missing too many cases of y=1 (i.e. we want to avoid false negatives)
So we may change the threshold to 0.3
we predict 1 if hθ (x) ⩾ 0.3
we predict 0 if hθ (x) < 0.3
That leads to
Higher recall (we'll correctly flag higher fraction of patients with cancer)
Lower precision (and higher fraction will turn out to actually have no cancer)

Questions

Is there a way to automatically choose the threshold for us?

How to compare precision and recall numbers and decide which algorithm is better?
at the beginning we had a single number (error ratio) - but now have two and need to choose which one
to prefer
F1 score helps to decide since it's just one number

Example

Suppose we have 3 algorithms A1 , A2 , A3 , and we captured the following metrics:

P R Avg F1

A1 0.5 0.4 0.45 0.444 ← our choice

A2 0.7 0.1 0.4 0.175
A3 0.02 1.0 0.54 0.0392

Here's the best is A1 because it has the highest F1 -score

Precision and Recall for Clustering

Can use precision and recall to evaluate the result of clustering

Correct decisions:

TP = decision to assign two similar documents to the same cluster

TN = assign two dissimilar documents to different clusters

Errors:

FP: assign two dissimilar documents to the same cluster

FN: assign two similar documents to different clusters

So the confusion matrix is:

same different
same TP FN
different FP TN

Example

Consider the following example (from the IR book [3])

n (n − 1)
there are = 136 pairs of documents
2
6 6 5
TP + FP = ( ) + ( ) + ( ) = 40
2 2 2
5 4 3 2
TP = ( ) + ( ) + ( ) + ( ) = 20
2 2 2 2

etc

So have the following contingency table:

same different
same TP = 20 FN = 24

different FP = 20 TN = 72

Thus,

P = 20/40 = 0.5 and R = 20/44 ≈ 0.455

F1 score is F1 ≈ 0.48

Multi-Class Problems
How do we adapt precision and recall to multi-class problems?

let f (⋅) be the target unknown function and hθ (⋅) the model
let C1 , . . . , CK be labels we want to predict (K labels)

Precision w.r.t class Ci is

P (f (x) = C i ∣
∣ hθ (x) = C i )

probability that given that we classified x as Ci

the decision is indeed correct

Recall w.r.t. class Ci is

P (hθ (x) = C i ∣
∣ f (x) = C i )

given an instance x belongs to Ci

what's the probability that we predict correctly
We estimate these probabilities using a contingency table w.r.t each class Ci

Contingency Table for Ci :

let C+ be Ci and
let C− be all other classes except for Ci , i.e. C− = {C j } − C i (all classes except for i )
then we create a contingency table
and calculate TPi , FPi , FNi , TNi for them

Now estimate precision and recall for class Ci

TPi
Pi =
TPi + FPi

TPi
Ri =
TPi + FNi

Averaging
These precision and recall are calculated for each class separately
how to combine them?

Micro-averaging

calculate TP, ... etc globally and then calculate Precision and Recall
let
TP = ∑ TPi
i

FP = ∑ FPi
i

FN = ∑ FNi
i

TN = ∑ TNi
i

and then calculate precision and recall as

TP
μ
P =
TP + FP

TP
μ
R =
TP + FN
Macro-averaging

similar to the One-vs-All Classification technique

calculate Pi and Ri "locally" for each Ci
1 1
and then let P M = ∑ Pi
i
and RM = ∑ Ri
i
K K

Micro and macro averaging behave quite differently and may give different results

the ability to behave well on categories with low generality (fewer training examples) will be less
emphasized by macroaveraging
which one to use? depends on application

This way is often used in Document Classification

Weighted-averaging

Calculate metrics for each label, and find their average weighted by support
(the number of true instances for each label).
it is useful when classes/labels are imbalance

Int3209 - Data Mining: Week 5: Classification Model Improvements
No ratings yet
Int3209 - Data Mining: Week 5: Classification Model Improvements
56 pages
Decoding SK Mehta-The Guaranteed Astrologer-La
No ratings yet
Decoding SK Mehta-The Guaranteed Astrologer-La
31 pages
ML Interview Questions Placements
No ratings yet
ML Interview Questions Placements
99 pages
Lesson 4 - Performance Metrics
No ratings yet
Lesson 4 - Performance Metrics
46 pages
Unit8 (Evaluation Method)
No ratings yet
Unit8 (Evaluation Method)
43 pages
Lect 02 Evaluation Part 1
No ratings yet
Lect 02 Evaluation Part 1
33 pages
Performance Measures
No ratings yet
Performance Measures
32 pages
Lecture 2 Classifier Performance Metrics
No ratings yet
Lecture 2 Classifier Performance Metrics
60 pages
CSE4261 Lecture-10
No ratings yet
CSE4261 Lecture-10
50 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
79 pages
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
No ratings yet
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
62 pages
Classification Metrics
No ratings yet
Classification Metrics
39 pages
Evaluation Metrics: Yining Chen (Adapted From Slides by Anand Avati) May 1, 2020
No ratings yet
Evaluation Metrics: Yining Chen (Adapted From Slides by Anand Avati) May 1, 2020
31 pages
L15 IRSW Evaluation
No ratings yet
L15 IRSW Evaluation
49 pages
Data Mining Final
No ratings yet
Data Mining Final
25 pages
6 Evaluarea Performantei
No ratings yet
6 Evaluarea Performantei
43 pages
Lec5 Classification
No ratings yet
Lec5 Classification
27 pages
Module 6
No ratings yet
Module 6
24 pages
Macro - and Micro-Averaged Evaluation
No ratings yet
Macro - and Micro-Averaged Evaluation
27 pages
ML Lecture 11 Evaluation
No ratings yet
ML Lecture 11 Evaluation
17 pages
Unit6 - 7 Issues
No ratings yet
Unit6 - 7 Issues
53 pages
004 07 Roc Auc Eer W4L2 W5L1 PDF
No ratings yet
004 07 Roc Auc Eer W4L2 W5L1 PDF
12 pages
Lecture11evaluationmetricsforclassification 240913060639 0c766554
No ratings yet
Lecture11evaluationmetricsforclassification 240913060639 0c766554
28 pages
AI Performance Evaluation - Annotated
No ratings yet
AI Performance Evaluation - Annotated
52 pages
Binary Classification PDF
No ratings yet
Binary Classification PDF
27 pages
Lectures3 5
No ratings yet
Lectures3 5
57 pages
Lecture 10
No ratings yet
Lecture 10
16 pages
Lecture 3 1611410001002
No ratings yet
Lecture 3 1611410001002
51 pages
جلسه 13
No ratings yet
جلسه 13
76 pages
Classification Metrics Mod 6
No ratings yet
Classification Metrics Mod 6
8 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
22 pages
ML CH 5
No ratings yet
ML CH 5
45 pages
06-FSSR DS610 2024 2025T1 Metrics
No ratings yet
06-FSSR DS610 2024 2025T1 Metrics
24 pages
DL IT324a 4
No ratings yet
DL IT324a 4
52 pages
Chap4 Imbalanced Classes
No ratings yet
Chap4 Imbalanced Classes
28 pages
Performance Parameters
No ratings yet
Performance Parameters
14 pages
ML Unit 3
No ratings yet
ML Unit 3
127 pages
UNIT-1-2.Binary Classification and Related Tasks
No ratings yet
UNIT-1-2.Binary Classification and Related Tasks
22 pages
Precision and Recall
No ratings yet
Precision and Recall
20 pages
Performance
No ratings yet
Performance
11 pages
Hands On Machine Learning 3 Edition
No ratings yet
Hands On Machine Learning 3 Edition
31 pages
CSC4316 9
No ratings yet
CSC4316 9
40 pages
9b. Evaluation of Classifiers
No ratings yet
9b. Evaluation of Classifiers
4 pages
Evaluation Measures For Machine Learning Models
No ratings yet
Evaluation Measures For Machine Learning Models
6 pages
Iai&ml Unit-5
No ratings yet
Iai&ml Unit-5
15 pages
L 13 Choose Your Own Algorithm D 07062024 111828am
No ratings yet
L 13 Choose Your Own Algorithm D 07062024 111828am
36 pages
Instruction & Option Choice
No ratings yet
Instruction & Option Choice
6 pages
Module 5 ML
No ratings yet
Module 5 ML
12 pages
Imbalance Problem
No ratings yet
Imbalance Problem
13 pages
Accuracy Precision and Recall
No ratings yet
Accuracy Precision and Recall
21 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
Imp Notes For Aamd
No ratings yet
Imp Notes For Aamd
6 pages
Evaluation Measures
No ratings yet
Evaluation Measures
8 pages
11.2 - Classification Evaluation Metrics
No ratings yet
11.2 - Classification Evaluation Metrics
22 pages
CS340 Machine Learning ROC Curves
No ratings yet
CS340 Machine Learning ROC Curves
8 pages
Machine Learning Project Report (Group 3) Shahbaz Khan
No ratings yet
Machine Learning Project Report (Group 3) Shahbaz Khan
11 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
11 pages
Performance Metrics (Classification) : Enrique J. de La Hoz D
100% (1)
Performance Metrics (Classification) : Enrique J. de La Hoz D
30 pages
Roc 1 PDF
No ratings yet
Roc 1 PDF
8 pages
Online Genealogy Research Resources
80% (5)
Online Genealogy Research Resources
77 pages
Bringing Intelligence About: Practitioners Reflect On Best Practices
No ratings yet
Bringing Intelligence About: Practitioners Reflect On Best Practices
60 pages
Engineering Critical Analyses To BS 7910
100% (1)
Engineering Critical Analyses To BS 7910
15 pages
Axis Bank Set 1
50% (2)
Axis Bank Set 1
14 pages
Arguments in Context 1657124987
No ratings yet
Arguments in Context 1657124987
258 pages
Technical Specifications For Construction and Management of SCIF
100% (1)
Technical Specifications For Construction and Management of SCIF
164 pages
AquaChem DemoGuide
No ratings yet
AquaChem DemoGuide
67 pages
Curriculum 2018-00016 1
No ratings yet
Curriculum 2018-00016 1
2 pages
Eitan, Granot
100% (1)
Eitan, Granot
37 pages
0580 m20 QP 22 PDF
No ratings yet
0580 m20 QP 22 PDF
12 pages
Creating Interactive APEX Reports Over OLAP Cubes
100% (1)
Creating Interactive APEX Reports Over OLAP Cubes
35 pages
4 PET Quantitative Aptitude Data Analysis Data Intrepretation Set 4
No ratings yet
4 PET Quantitative Aptitude Data Analysis Data Intrepretation Set 4
12 pages
Dissertation: Is Professional Wrestling Coverage A Form of Journalism? - by Ryan Carse
No ratings yet
Dissertation: Is Professional Wrestling Coverage A Form of Journalism? - by Ryan Carse
64 pages
Versiondog - AP - en
No ratings yet
Versiondog - AP - en
27 pages
ĐỀ CƯƠNG SỐ 2 giũa kỳ 2 anh 8 - KEY
No ratings yet
ĐỀ CƯƠNG SỐ 2 giũa kỳ 2 anh 8 - KEY
10 pages
Aster254 TRG E1
No ratings yet
Aster254 TRG E1
46 pages
Addition Fact Strategy Doubles Free Sample
100% (1)
Addition Fact Strategy Doubles Free Sample
13 pages
Itl 608 - Showcase Reflection
100% (2)
Itl 608 - Showcase Reflection
2 pages
Faculty of Arts/Science/Engineering/Interdisciplinary Studies, Law & Management
No ratings yet
Faculty of Arts/Science/Engineering/Interdisciplinary Studies, Law & Management
4 pages
GNSS Processing Report Kebun Sawit
No ratings yet
GNSS Processing Report Kebun Sawit
7 pages
De Revolutionibus Orbium Coelestium
No ratings yet
De Revolutionibus Orbium Coelestium
8 pages
RPH (25 Feb 2019) - Sains KSSM DLP 1a
No ratings yet
RPH (25 Feb 2019) - Sains KSSM DLP 1a
2 pages
Economics.: Demand Curves
No ratings yet
Economics.: Demand Curves
1 page
SBI Clerk Mains Result 2016 Declared!!!
No ratings yet
SBI Clerk Mains Result 2016 Declared!!!
9 pages
Test 5th Period For 2nd Grade of Junior High Schoo
No ratings yet
Test 5th Period For 2nd Grade of Junior High Schoo
4 pages
Logitech MX ERGO Wireless Trackball
No ratings yet
Logitech MX ERGO Wireless Trackball
8 pages
AD30400 Montage Continuity
No ratings yet
AD30400 Montage Continuity
5 pages
Add Column in Table: Syntax
No ratings yet
Add Column in Table: Syntax
4 pages
MS 4227 Pricing & Revenue Management: Swhshum@cityu - Edu.hk
No ratings yet
MS 4227 Pricing & Revenue Management: Swhshum@cityu - Edu.hk
4 pages
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet

CH 4

Uploaded by

CH 4

Uploaded by

Precision and Recall

Precision and Recall

originally it's from Information Retrieval

Precision and Recall for Information Retrieval

precise: all returned document should be relevant

Given a test collection, the quality of an IR system is evaluated with:

Precision: % of relevant documents in the result

given a collection of documents C

both P and R are defined w.r.t a set of retrieved documents

If we retrieve more document, we improve recall (if return all docs, R = 1 )

Let k be the number of retrieved documents

source: Information Retrieval

Area under P/R Curve:

Top-k -precision is insensitive to change of ranks of relevant documents among top k

how to measure overall performance of an IR system?

where ri is the rank of k th relevant document in the result

Precision and Recall for Classification

Diagnostic Testing Measures

Sensitivity = Specificity = Accuracy =

#TP + #FN #FP + #TN #TOTAL

Main values of this matrix:

True Positive - we predicted "+" and the true class is "+"

Two Classes: C+ and C−

given that we predict x is +

given a positive instance x

higher level of ρ may be obtained by lowering π and vice versa

Suppose we have a ranking classifier that produces some score for x

we decide whether to classify it as C+ or C− based on some threshold parameter τ

β is the tradeoff between P and R

When β = 1 we have F1 score:

The F1 -score is a single measure of performance of the test.

Motivation: Precision and Recall

Let's say we trained a Logistic Regression classifier

we predict 1 if hθ (x) ⩾ 0.5

we predict 0 if hθ (x) < 0.5

we may change the threshold to 0.7

Let's consider the opposite

Is there a way to automatically choose the threshold for us?

Suppose we have 3 algorithms A1 , A2 , A3 , and we captured the following metrics:

A1 0.5 0.4 0.45 0.444 ← our choice

Here's the best is A1 because it has the highest F1 -score

Precision and Recall for Clustering

TP = decision to assign two similar documents to the same cluster

FP: assign two dissimilar documents to the same cluster

So the confusion matrix is:

Consider the following example (from the IR book [3])

So have the following contingency table:

P = 20/40 = 0.5 and R = 20/44 ≈ 0.455

Precision w.r.t class Ci is

probability that given that we classified x as Ci

Recall w.r.t. class Ci is

given an instance x belongs to Ci

Contingency Table for Ci :

Now estimate precision and recall for class Ci

and then calculate precision and recall as

similar to the One-vs-All Classification technique

This way is often used in Document Classification

You might also like