CH 4
CH 4
Contents
1 Precision and Recall
2 Precision and Recall for Information Retrieval
2.1 Precision/Recall Curves
2.2 Average Precision
3 Precision and Recall for Classification
3.1 Precision
3.2 Recall
4 F Measure
4.1 Motivation: Precision and Recall
4.2 Example
5 Precision and Recall for Clustering
5.1 Example
6 Multi-Class Problems
6.1 Averaging
7 Sources
More formally,
then by varying k from 0 to N = |C | we can draw P vs R and obtain the Precision/recall curve:
source: [1]
the closer the curve to the (1, 1) point - the better the IR system performance
(UFRT) lecture 2
Analogously to ROC Curves we can calculate the area under the P/R Curve
the closer AUPR to 1 the better
Average Precision
1 k
K
avg P = ∑
k=1
K rk
Since in a test collection we usually have a set of queries, we calcuate the average over them and get Mean
Average Precision: MAP
Precision
π = P (f (x) = C + ∣
∣ hθ (x) = C + )
Interpretation
Out of all the people we thought have cancer, how many actually had it?
High precision is good
we don't tell many people that they have cancer when they actually don't
Recall
Recall
ρ = P (hθ (x) = C + ∣
∣ f (x) = C + )
Interpretation
Out of all the people that do actually have cancer, how much we identified?
The higher the better:
We don't fail to spot many people that actually have cancer
For a classifier that always returns zero (i.e. hθ (x) = 0 ) the Recall would be zero
That gives us more useful evaluation metric
And we're much more sure
F Measure
P and R don't make sense in the isolation from each other
2
(β + 1)P R
Fβ =
2
β P + R
Suppose we want to predict y = 1 (i.e. people have cancer) only if we're very confident
Suppose we want to avoid missing too many cases of y=1 (i.e. we want to avoid false negatives)
So we may change the threshold to 0.3
we predict 1 if hθ (x) ⩾ 0.3
we predict 0 if hθ (x) < 0.3
That leads to
Higher recall (we'll correctly flag higher fraction of patients with cancer)
Lower precision (and higher fraction will turn out to actually have no cancer)
Questions
Example
P R Avg F1
Correct decisions:
Errors:
same different
same TP FN
different FP TN
Example
etc
same different
same TP = 20 FN = 24
different FP = 20 TN = 72
Thus,
F1 score is F1 ≈ 0.48
Multi-Class Problems
How do we adapt precision and recall to multi-class problems?
let f (⋅) be the target unknown function and hθ (⋅) the model
let C1 , . . . , CK be labels we want to predict (K labels)
P (f (x) = C i ∣
∣ hθ (x) = C i )
P (hθ (x) = C i ∣
∣ f (x) = C i )
let C+ be Ci and
let C− be all other classes except for Ci , i.e. C− = {C j } − C i (all classes except for i )
then we create a contingency table
and calculate TPi , FPi , FNi , TNi for them
TPi
Pi =
TPi + FPi
TPi
Ri =
TPi + FNi
Averaging
These precision and recall are calculated for each class separately
how to combine them?
Micro-averaging
calculate TP, ... etc globally and then calculate Precision and Recall
let
TP = ∑ TPi
i
FP = ∑ FPi
i
FN = ∑ FNi
i
TN = ∑ TNi
i
TP
μ
R =
TP + FN
Macro-averaging
Micro and macro averaging behave quite differently and may give different results
the ability to behave well on categories with low generality (fewer training examples) will be less
emphasized by macroaveraging
which one to use? depends on application
Weighted-averaging
Calculate metrics for each label, and find their average weighted by support
(the number of true instances for each label).
it is useful when classes/labels are imbalance