0% found this document useful (0 votes)
9 views9 pages

CH 4

Uploaded by

apdurahmaandahir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views9 pages

CH 4

Uploaded by

apdurahmaandahir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Precision and Recall

Contents
1 Precision and Recall
2 Precision and Recall for Information Retrieval
2.1 Precision/Recall Curves
2.2 Average Precision
3 Precision and Recall for Classification
3.1 Precision
3.2 Recall
4 F Measure
4.1 Motivation: Precision and Recall
4.2 Example
5 Precision and Recall for Clustering
5.1 Example
6 Multi-Class Problems
6.1 Averaging
7 Sources

Precision and Recall


Precision and Recall are quality metrics used across many domains:

originally it's from Information Retrieval


also used in Machine Learning

Precision and Recall for Information Retrieval


IR system has to be:

precise: all returned document should be relevant


efficient: all relevant document should be returned

Given a test collection, the quality of an IR system is evaluated with:

Precision: % of relevant documents in the result


Recall: % of retrieved relevant documents

More formally,

given a collection of documents C


If X ⊆ C is the output of the IR system and Y ⊆ C is the list of all relevant documents then define
|X ∪ Y | |X ∪ Y |
precision as P = and recall as R =
|X| |Y |

both P and R are defined w.r.t a set of retrieved documents


Precision/Recall Curves

If we retrieve more document, we improve recall (if return all docs, R = 1 )


if we retrieve fewer documents, we improve precision, but reduce recall
so there's a trade-off between them

Let k be the number of retrieved documents

then by varying k from 0 to N = |C | we can draw P vs R and obtain the Precision/recall curve:

source: [1]

the closer the curve to the (1, 1) point - the better the IR system performance

source: Information Retrieval

(UFRT) lecture 2

Area under P/R Curve:

Analogously to ROC Curves we can calculate the area under the P/R Curve
the closer AUPR to 1 the better

Average Precision

Top-k -precision is insensitive to change of ranks of relevant documents among top k

how to measure overall performance of an IR system?

1 k
K
avg P = ∑
k=1
K rk

where ri is the rank of k th relevant document in the result

Since in a test collection we usually have a set of queries, we calcuate the average over them and get Mean
Average Precision: MAP

Precision and Recall for Classification


The precision and recall metrics can also be applied to Machine Learning: to binary classifiers

Diagnostic Testing Measures


Actual Class y
Positive Negative
Precision =
Test
True positive False positive #TP
outcome
(TP ) (FP, Type I error)
hθ (x) positive #TP + #FP
Test
outcome Negative predictive value =
Test
False negative True negative #TN
outcome
(FN , Type II error) (TN)
negative #FN + #TN

Sensitivity = Specificity = Accuracy =


#TP #TN #TP + #TN

#TP + #FN #FP + #TN #TOTAL

Main values of this matrix:

True Positive - we predicted "+" and the true class is "+"


True Negative - we predicted "-" and the true class is "-"
False Positive - we predicted "+" and the true class is "-" (Type I error)
False Negative - we predicted "-" and the true class is "+" (Type II error)

Two Classes: C+ and C−


Precision

Precision

π = P (f (x) = C + ∣
∣ hθ (x) = C + )

given that we predict x is +


what's the probability that the decision is correct
# TP # TP
we estimate precision as P = =
# predicted positives # TP + # FP

Interpretation

Out of all the people we thought have cancer, how many actually had it?
High precision is good
we don't tell many people that they have cancer when they actually don't

Recall

Recall

ρ = P (hθ (x) = C + ∣
∣ f (x) = C + )

given a positive instance x


what's the probability that we predict correctly
# TP # TP
we estimate recall as R = =
# actual positives # TP + # FN

Interpretation

Out of all the people that do actually have cancer, how much we identified?
The higher the better:
We don't fail to spot many people that actually have cancer

For a classifier that always returns zero (i.e. hθ (x) = 0 ) the Recall would be zero
That gives us more useful evaluation metric
And we're much more sure

F Measure
P and R don't make sense in the isolation from each other

higher level of ρ may be obtained by lowering π and vice versa

Suppose we have a ranking classifier that produces some score for x

we decide whether to classify it as C+ or C− based on some threshold parameter τ


by varying τ we will get different precision and recall
improving recall will lead to worse precision
improving precision will lead to worse recall
how to pick the threshold?
combine P and R into one measure (also see ROC Analysis)

2
(β + 1)P R
Fβ =
2
β P + R

β is the tradeoff between P and R


if β is close to 0, then we give more importance to P
F0 = P
if β is closer to +∞, we give more importance to R

When β = 1 we have F1 score:

The F1 -score is a single measure of performance of the test.


it's the harmonic mean of precision P and recall R
P R
F1 = 2
P + R

Motivation: Precision and Recall

Let's say we trained a Logistic Regression classifier

we predict 1 if hθ (x) ⩾ 0.5

we predict 0 if hθ (x) < 0.5

Suppose we want to predict y = 1 (i.e. people have cancer) only if we're very confident

we may change the threshold to 0.7


we predict 1 if hθ (x) ⩾ 0.7
we predict 0 if hθ (x) < 0.7
We'll have higher precision in this case (all for who we predicted y = 1 are more likely to actually have it)
But lower recall (we'll miss more patients that actually have cancer, but we failed to spot them)

Let's consider the opposite

Suppose we want to avoid missing too many cases of y=1 (i.e. we want to avoid false negatives)
So we may change the threshold to 0.3
we predict 1 if hθ (x) ⩾ 0.3
we predict 0 if hθ (x) < 0.3
That leads to
Higher recall (we'll correctly flag higher fraction of patients with cancer)
Lower precision (and higher fraction will turn out to actually have no cancer)

Questions

Is there a way to automatically choose the threshold for us?


How to compare precision and recall numbers and decide which algorithm is better?
at the beginning we had a single number (error ratio) - but now have two and need to choose which one
to prefer
F1 score helps to decide since it's just one number

Example

Suppose we have 3 algorithms A1 , A2 , A3 , and we captured the following metrics:

P R Avg F1

A1 0.5 0.4 0.45 0.444 ← our choice


A2 0.7 0.1 0.4 0.175
A3 0.02 1.0 0.54 0.0392

Here's the best is A1 because it has the highest F1 -score

Precision and Recall for Clustering


Can use precision and recall to evaluate the result of clustering

Correct decisions:

TP = decision to assign two similar documents to the same cluster


TN = assign two dissimilar documents to different clusters

Errors:

FP: assign two dissimilar documents to the same cluster


FN: assign two similar documents to different clusters

So the confusion matrix is:

same different
same TP FN
different FP TN

Example

Consider the following example (from the IR book [3])


n (n − 1)
there are = 136 pairs of documents
2
6 6 5
TP + FP = ( ) + ( ) + ( ) = 40
2 2 2
5 4 3 2
TP = ( ) + ( ) + ( ) + ( ) = 20
2 2 2 2

etc

So have the following contingency table:

same different
same TP = 20 FN = 24

different FP = 20 TN = 72

Thus,

P = 20/40 = 0.5 and R = 20/44 ≈ 0.455

F1 score is F1 ≈ 0.48

Multi-Class Problems
How do we adapt precision and recall to multi-class problems?

let f (⋅) be the target unknown function and hθ (⋅) the model
let C1 , . . . , CK be labels we want to predict (K labels)

Precision w.r.t class Ci is

P (f (x) = C i ∣
∣ hθ (x) = C i )

probability that given that we classified x as Ci


the decision is indeed correct

Recall w.r.t. class Ci is

P (hθ (x) = C i ∣
∣ f (x) = C i )

given an instance x belongs to Ci


what's the probability that we predict correctly
We estimate these probabilities using a contingency table w.r.t each class Ci

Contingency Table for Ci :

let C+ be Ci and
let C− be all other classes except for Ci , i.e. C− = {C j } − C i (all classes except for i )
then we create a contingency table
and calculate TPi , FPi , FNi , TNi for them

Now estimate precision and recall for class Ci

TPi
Pi =
TPi + FPi

TPi
Ri =
TPi + FNi

Averaging
These precision and recall are calculated for each class separately
how to combine them?

Micro-averaging

calculate TP, ... etc globally and then calculate Precision and Recall
let
TP = ∑ TPi
i

FP = ∑ FPi
i

FN = ∑ FNi
i

TN = ∑ TNi
i

and then calculate precision and recall as


TP
μ
P =
TP + FP

TP
μ
R =
TP + FN
Macro-averaging

similar to the One-vs-All Classification technique


calculate Pi and Ri "locally" for each Ci
1 1
and then let P M = ∑ Pi
i
and RM = ∑ Ri
i
K K

Micro and macro averaging behave quite differently and may give different results

the ability to behave well on categories with low generality (fewer training examples) will be less
emphasized by macroaveraging
which one to use? depends on application

This way is often used in Document Classification

Weighted-averaging

Calculate metrics for each label, and find their average weighted by support
(the number of true instances for each label).
it is useful when classes/labels are imbalance

You might also like