0% found this document useful (0 votes)
66 views44 pages

What's Next?: Binary Classification and Related Tasks Classification

The document discusses binary classification and related tasks. It defines classification as learning a mapping from instances to class labels. It discusses assessing classification performance using contingency tables and statistics like true positives, false positives, true negatives, false negatives, accuracy, and error rate. Visualizing classification performance using decision trees is also mentioned.

Uploaded by

sowmyalalitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views44 pages

What's Next?: Binary Classification and Related Tasks Classification

The document discusses binary classification and related tasks. It defines classification as learning a mapping from instances to class labels. It discusses assessing classification performance using contingency tables and statistics like true positives, false positives, true negatives, false negatives, accuracy, and error rate. Visualizing classification performance using decision trees is also mentioned.

Uploaded by

sowmyalalitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

2.

Binary classification and related tasks

What’s next?

2 Binary classification and related tasks


Classification
Assessing classification performance
Visualising classification performance
Scoring and ranking
Assessing and visualising ranking performance
Tuning rankers
Class probability estimation
Assessing class probability estimates

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 67 / 291


2. Binary classification and related tasks

Symbols used in the following slides

Suppose the following symbols:


t X – set of all instances (the universe)
t L – set of all labels (the universe)
t C – set of all classes (the universe)
t Y – set of all outputs (the universe)
t Tr – training set of labelled
instances (x, l (x)) , where l : X → L
t Te – test set of labelled instances
(x, l (x)), where l : X → L

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 68 / 291


2. Binary classification and related tasks

Table 2.1, p.52 Predictive machine learning scenarios

Tas Label space Output space Learning problem


k
Classification L =C Y = learn an approximation cˆ :
C X → C to the true labelling
function c
Scoring and L =C Y = R|C | learn a model that outputs a
ranking score vector over classes
Probability L =C learn a model that out-
estimation Y = [0, 1]| puts a probability vector
C|
over classes
Regression L = Y = learn an approximation fˆ :
R R X → R to the true labelling
function f

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 69 / 291


2. Binary classification and related tasks 2.1 Classification

What’s next?

2 Binary classification and related tasks


Classification
Assessing classification performance
Visualising classification performance
Scoring and ranking
Assessing and visualising ranking performance
Tuning rankers
Class probability estimation
Assessing class probability estimates

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 70 / 291


2. Binary classification and related tasks 2.1 Classification

Classification

A classifier is a mapping cˆ : X → C , where C = {C 1 ,C 2 ,...,C k } is a


finite and usually small set of class labels. We will sometimes also use Ci to
indicate the set of examples of that class.

We use the ‘hat’ to indicate that cˆ(x) is an estimate of the true but unknown
function c(x) . Examples for a classifier take the form (x, c(x)) , where x ∈ X
is an instance and c(x) is the true class of the instance (sometimes
contaminated by noise).

Learning a classifier involves constructing the function cˆ such that it matches


c as closely as possible (and not just on the training set, but ideally on the
entire instance space X ).

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 71 / 291


2. Binary classification and related tasks 2.1 Classification

Figure 2.1, p.53 A decision tree

ʻViagraʼ ʻViagraʼ

=0 =1 =0 =1

spam: 20
ʻlotteryʼ ʻlotteryʼ ĉ(x) = spam
ham: 5

=0 =1 =0 =1

spam: 20 spam: 10
ĉ(x) = ham ĉ(x) = spam
ham: 40 ham: 5

(left) A feature tree with training set class distribution in the leaves. (right) A decision
tree obtained using the majority class decision rule.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 72 / 291


2. Binary classification and related tasks 2.1 Classification

Assessing classification performance

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 73 / 291


2. Binary classification and related tasks 2.1 Classification

Table 2.2, p.54 Contingency table

Predicted ⊕ Predicted 8 ⊕ 8
Actual ⊕ 30 20 50 ⊕ 20 30 50
Actual 8 10 40 50 8 20 30 50
40 60 100 40 60
100
(left) A two-class contingency table or confusion matrix depicting the performance of the
decision tree in Figure 2.1. Numbers on the descending diagonal indicate correct
predictions, while the ascending diagonal concerns prediction errors. (right) A
contingency table with the same marginals but independent rows and columns.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 74 / 291


2. Binary classification and related tasks 2.1 Classification

Statistics from contingency table

Let’s label numbers of a classifier’s predictions on a test set as in the table:

Predicted ⊕ Predicted 8
Actual ⊕ TP FN Pos
Actual 8
FP TN Neg
0 0 0
Where abbreviations stand for:
t TP – true positives
t FP – false positives
t FN – false negatives
t TN – true negatives
t Pos – number of
positive examples
t Neg – number of
negative examples
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 75 / 291
2. Binary classification and related tasks 2.1 Classification

Table 2.3, p.57 Performance measures I

Measure Definition Equal to Estimates


number of positives .
Pos = x ∈ T e I [c(x) =
number of negatives ⊕] . |Te| −
Neg = x ∈ T e I [c(x) =
number of true positives . Pos
8] = x ∈ T e I [cˆ(x) = c(x) =
TP
number of true negatives ⊕] = . x ∈ T e I [cˆ(x) = c(x) =
TN
number of false 8] . Neg −
FP = . x ∈ T e I [cˆ(x) = ⊕, c(x) =
positives 8] x∈T I [cˆ(x) = 8, c(x) TN Pos
. = ⊕]
number of false pos = | 1e x∈T I [c(x) = − TP P (c(x) = ⊕)
negatives 1 . e ⊕]
neg = Te|
| x∈T Pos/|Te| P (c(x) = 8)
FN = clr = Te| e I [c(x) = 1 − pos
pos/neg . 8]
proportion of positives acc = | 1 x∈T I [cˆ(x) = Pos/Neg P (cˆ(x) =
1 . e c(x)] c(x))
proportion of negatives Te|
err |Te| x∈T 1−
class ratio = e I [cˆ(x) /= acc P (cˆ(x) /=
c(x)] c(x))
(*) accuracy
(*) error
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 76 / 291
2. Binary classification and related tasks 2.1 Classification

Table 2.3, p.57 Performance measures II

Measure Definition Equal to Estimates


.
x. Te I
true positive rate, tpr = ∈ x∈Te[cˆ(x)=c(x)=⊕
TP/ Pos P (cˆ(x) = ⊕|c(x) =
I [c(x)=⊕]
sensitivity, recall ] ⊕)
.
true negative rate, tnr = x∈T . I TN
e x∈Te [cˆ(x)=c(x)=8]
I [c(x)=8]
specificity /Neg P (cˆ(x) = 8|c(x) =
.
I 8)
false positive rate, fpr x∈ .Te
[cˆ(x)=⊕,c(x)=8
FP/Neg = 1 −
x∈Te I [c(x)=8]
false alarm rate ] tnr
.
false negative rate = x .Te I P (cˆ(x) = ⊕|c(x) =
.
∈ x∈Te [cˆ(x)=8,c(x)=⊕
I [c(x)=⊕]
FN /Pos = 1 − 8)
]
x . Te I
precision, prec = ∈ [cˆ(x)=c(x)=⊕
fnr = x∈Te I
confi- dence ] tpr TP/(TP + P (cˆ(x) = 8|c(x) =
[cˆ(x)=⊕]
⊕)
FP) measures for classifiers on a test
Table: A summary of different quantities and evaluation
P (c(x)
set Te. Symbols starting with a capital letter denote absolute frequencies = ⊕|cˆ(x)
(counts), while=
⊕)
lower-case symbols denote relative frequencies or ratios. All except those indicated with
(*) are defined only for binary classification.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 77 / 291


2. Binary classification and related tasks 2.1 Classification

Example 2.1, p.56 Accuracy as a weighted average

Suppose a classifier’s predictions on a test set are as in the following table:

Predicted ⊕ Predicted 8
Actual ⊕ 60 15 75
Actual 8 10 15 25
70 30 100

From this table, we see that the true positive rate is tpr = 60/75 = 0.80 and
the true negative rate is tnr = 15/25 = 0.60. The overall accuracy is
acc = (60 + 15)/100 = 0.75, which is no longer the average of true
positive and negative rates. However, taking into account the proportion of
positives pos = 0.75 and the proportion of negatives neg = 1 − pos =
0.25, we see that

acc = pos · tpr + neg · tnr

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 78 / 291


2. Binary classification and related tasks 2.1 Classification

Visualising classification performance

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 79 / 291


2. Binary classification and related tasks 2.1 Classification

Degrees of freedom

The following contingency table:

Predicted ⊕
Predicted 8
Actual ⊕ TP FN Pos
Actual 8
FP TN Neg
0 0 0
contains 9 values, however some of them depend on others: e.g., marginal sums
depend on rows and columns, respectively. Actually, we need only 4 values to
determine the rest of them. Thus, we say that this table has 4 degrees of
freedom. In general table having (k + 1)2 entries has k 2 degrees of freedom.

In the following, we assume that Pos , Neg , TP and FP are enough to


reconstruct whole table.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 80 / 291


2. Binary classification and related tasks 2.1 Classification

Figure 2.2, p.58 A coverage plot


Let there be classifiers C1, C2 and C3.

Pos

C3

TP3
Pos
Positives
C1
TP1
Positives

C2
TP2

0
0 FP3 Neg
0

0 FP1 FP2 Neg

Negatives
Negatives

(left) A coverage plot depicting the two contingency tables in Table 2.2. The plot is
square because the class distribution is uniform. From the plot we immediately see that
C1 is better than C2. (right) Coverage plot for Example 2.1, with a class ratio clr = 3 .

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 81 / 291


2. Binary classification and related tasks 2.1 Classification

Figure 2.3, p.59 An ROC plot


Pos

1
C3 C3
TP3

tpr3
C1 C1
TP1

tpr1
True positive rate
Positives

C2 C2
TP2

tpr2
0

0
0 FP1 FP2-3 Neg 0 fpr1 fpr2-3 1

Negatives False positive rate

(left) C1 and C3 dominate C2, but neither dominates the other. The diagonal line
having slope of 1 indicates that all classifiers on this line achieve equal accuracy.
(right) Receiver Operating Characteristic (ROC) plot: a merger of the two coverage plots
in Figure 2.2, employing normalisation to deal with the different class distributions. The
diagonal line having slope of 1 indicates that all classifiers on this line have the
same average recall (average of positive and negative recalls).
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 82 / 291
2. Binary classification and related tasks 2.1 Classification

Figure 2.4, p.61 Comparing coverage and ROC plots


TP1 TP2-3 Pos

C2 C3
C2 C3

tpr1 tpr2-3 Pos


C1 C1
Positives

Positives
0

0 FP1 FP2 FP3 Neg

0
0 fpr1 fpr2 fpr3 Neg
Negatives
Negatives

(left) In a coverage plot, accuracy isometrics have a slope of 1, and average recall
isometrics are parallel to the ascending diagonal. (right) In the corresponding ROC plot,
average recall isometrics have a slope of 1; the accuracy isometric here has a slope of
3, corresponding to the ratio of negatives to positives in the data set.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 83 / 291


2. Binary classification and related tasks 2.2 Scoring and ranking

What’s next?

2 Binary classification and related tasks


Classification
Assessing classification performance
Visualising classification performance
Scoring and ranking
Assessing and visualising ranking performance
Tuning rankers
Class probability estimation
Assessing class probability estimates

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 84 / 291


2. Binary classification and related tasks 2.2 Scoring and ranking

Scoring classifier

A scoring classifier is a mapping sˆ : X → Rk , i.e., a mapping from the


instance space to a k-vector of real numbers.
The boldface notation indicates that a scoring classifier outputs a vector

sˆ(x) = (sˆ1(x), . . . , sˆk (x)) rather than a single number; sˆi(x) is the score
assigned to class C i for instance x .
This score indicates how likely it is that class label C i applies.

If we only have two classes, it usually suffices to consider the score for only one
of the classes; in that case, we use sˆ(x) to denote the score of the positive
class for instance x .

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 85 / 291


2. Binary classification and related tasks 2.2 Scoring and ranking

Figure 2.5, p.62 A scoring tree

ʻViagraʼ ʻViagraʼ

=0 =1 =0 =1

spam: 20
ʻlotteryʼ ʻlotteryʼ ŝ(x) = +2
ham: 5

=0 =1 =0 =1

spam: 20 spam: 10
ŝ(x) = −1 ŝ(x) = +1
ham: 40 ham: 5

(left) A feature tree with training set class distribution in the leaves. (right) A scoring tree
using the logarithm of the class ratio as scores; spam is taken as the positive class.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 86 / 291


2. Binary classification and related tasks 2.2 Scoring and ranking

Margins and loss functions

If we take the true class c(x) as +1 for positive examples and −1 for negative
examples, then the quantity z(x) = c(x)sˆ(x) is positive for correct predictions
and negative for incorrect predictions: this quantity is called the margin assigned
by the scoring classifier to the example.

We would like to reward large positive margins, and penalise large negative
values. This is achieved by means of a so-called loss function L : R ›→ [0,
∞) which maps each example’s margin z(x) to an associated loss
L(z(x)) .

We will assume that L(0) = 1, which is the loss incurred by having an example
on the decision boundary. We furthermore have L(z) ≥ 1 for z < 0, and usually
also 0 ≤ L(z) < 1 for z > 0 (Figure 2.6).
.
The average loss over a test set Te is 1| x∈T L(z(x))
Te| e .

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 87 / 291


2. Binary classification and related tasks 2.2 Scoring and ranking

Figure 2.6, p.63 Loss functions

10

L(z)
4

-2 -1.5 -1 -0.5 0 0.5 z 1 1.5 2

From bottom-left (i ) 0–1 loss L 0 1 (z) = 1 if z ≤ 0, and L 0 1 (z) = 0 if z


> 0 ; (ii ) hinge loss L h (z) = (1 − z) if z ≤ 1, and L h (z) = 0 if z >
1;
(iii ) logistic loss L l o g (z) = log 2 (1 +
exp(−z)) ; (iv ) exponential loss L ex p (z) =
exp(−z) ; flach/mlbook/
cs.bris.ac.uk/ Machine Learning: Making Sense of Data 88 / 291
2. Binary classification and related tasks 2.2 Scoring and ranking

Assessing and visualising ranking performance

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 89 / 291


2. Binary classification and related tasks 2.2 Scoring and ranking

Example 2.2, p.64 Ranking example

t The scoring tree in Figure 2.5 produces the following ranking:


[20+, 5−][10+, 5−][20+, 40−]. Here, 20+ denotes a
sequence of 20 positive examples, and instances in square brackets
[. ..] are tied.
t By selecting a split point in the ranking we can turn the ranking into a
classification. In this case there are four possibilities:
(A) setting the split point before the first segment, and thus assigning all
segments to the negative class;
(B) assigning the first segment to the positive class, and the other two to
the negative class;
(C) assigning the first two segments to the positive class; and
(D) assigning all segments to the positive class.
t In terms of actual scores, this corresponds to (A) choosing any score larger
than 2 as the threshold; (B) choosing a threshold between 1 and 2; (C)
setting the threshold between −1 and 1; and (D) setting it lower than −1.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 90 / 291
2. Binary classification and related tasks 2.2 Scoring and ranking

Example 2.3, p.65 Ranking accuracy

The ranking error rate is defined as


.
x∈Te⊕,xf∈Te8 I [sˆ(x) < sˆ(x
f )] 1+ I [sˆ(x) =
2
rank-err = sˆ(x )] f
Pos ·
Neg
t The 5 negatives in the right leaf are scored higher than the 10 positives in
the middle leaf and the 20 positives in the left leaf, resulting in
50 + 100 = 150 ranking errors.
t The 5 negatives in the middle leaf are scored higher than the 20 positives in
the left leaf, giving a further 100 ranking errors.
t In addition, the left leaf makes 800 half ranking errors (because 20 positives
and 40 negatives get the same score), the middle leaf 50 and the right leaf
100.
t In total we have 725 ranking errors out of a possible 50 · 50 = 2500,
corresponding to a ranking error rate of 29% or a ranking accuracy of
71% .
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 91 / 291
2. Binary classification and related tasks 2.2 Scoring and ranking

Figure 2.7, p.66 Coverage curve

Pos
Pos

D
Positives sorted on decreasing

TP2
Positives

TP1
C
score

B
0

0
0 Neg 0 FP1 FP2 Neg

Negatives sorted on decreasing score Negatives

(left) Each cell in the grid denotes a unique pair of one positive and one negative
example: the green cells indicate pairs that are correctly ranked by the classifier, the red
A

cells represent ranking errors, and the orange cells are half-errors due to ties. (right)
The coverage curve of a tree-based scoring classifier has one line segment for each leaf
of the tree, and one (FP, TP) pair for each possible threshold on the score.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 92 / 291
2. Binary classification and related tasks 2.2 Scoring and ranking

Important point to remember

ROC curve is obtained from the coverage curve by normalizing the axes to
range [0, 1].
The area under the ROC curve is the ranking accuracy.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 93 / 291


2. Binary classification and related tasks 2.2 Scoring and ranking

Example 2.4, p.67 Class imbalance

t Suppose we feed the scoring tree in Figure 2.5 an extended test set, with
an additional batch of 50 negatives.
t The added negatives happen to be identical to the original ones, so the net
effect is that the number of negatives in each leaf doubles.
t As a result the coverage curve changes (because the class ratio changes),
but the ROC curve stays the same (Figure 2.8).
t Note that the ranking accuracy stays the same as well: while the classifier
makes twice as many ranking errors, there are also twice as many
positive–negative pairs, so the ranking error rate doesn’t change.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 94 / 291


2. Binary classification and related tasks 2.2 Scoring and ranking

Figure 2.8, p.67 Class imbalance


Pos

1
TP2

True positive rate


Positives

TP1

tpr1
tpr2
0

0
0 FP1 FP2 Neg 0 fpr1 fpr2 1

Negat ives False positive rat e

(left) A coverage curve obtained from a test set with class ratio clr = 1/2 . (right) The
corresponding (axis-normalised) ROC curve is the same as the one corresponding to
the coverage curve in Figure 2.7 (right). The ranking accuracy is the Area Under the
ROC Curve (AUC).

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 95 / 291


2. Binary classification and related tasks 2.2 Scoring and ranking

Rankings from grading classifiers

Figure 2.9 (left) shows a linear classifier (the decision boundary is denoted B)
applied to a small data set of five positive and five negative examples, achieving
an accuracy of 0.80.

We can derive a score from this linear classifier by taking the distance of an
example from the decision boundary; if the example is on the negative side we
take the negative distance. This means that the examples are ranked in the
following order: p1 – p2 – p3 – n1 – p4 – n2 – n3 – p5 – n4 – n5.

This ranking incurs four ranking errors: n1 before p4, and n1, n2 and n3 before
p5. Figure 2.9 (right) visualises these four ranking errors in the top-left corner.
The AUC of this ranking is 21/25 = 0.84.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 96 / 291


2. Binary classification and related tasks 2.2 Scoring and ranking

Figure 2.9, p.68 Rankings from grading classifiers

p5
w
p2 +
+

p4
p1

p4

p3
+ +

Positives
p3
A

p2
n2 – – n1
n 3
– B

p1
n4 – +
p5 n5
C
– n1 n2 n3 n4 n5

Negatives

(left) A linear classifier induces a ranking by taking the signed distance to the decision
boundary as the score. This ranking only depends on the orientation of the decision
boundary: the three lines result in exactly the same ranking. (right) The grid of correctly
ranked positive–negative pairs (in green) and ranking errors (in red).

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 97 / 291


2. Binary classification and related tasks 2.2 Scoring and ranking

Tuning rankers

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 98 / 291


2. Binary classification and related tasks 2.2 Scoring and ranking

Example 2.5, p.70 Tuning your spam filter I

You have carefully trained your Bayesian spam filter, and all that remains is
setting the decision threshold. You select a set of six spam and four ham e-mails
and collect the scores assigned by the spam filter. Sorted on decreasing score
these are 0.89 (spam), 0.80 (spam), 0.74 (ham), 0.71 (spam), 0.63 (spam), 0.49
(ham), 0.42 (spam), 0.32 (spam), 0.24 (ham), and 0.13 (ham).

If the class ratio of 6 spam against 4 ham is representative, you can select the
optimal point on the ROC curve using an isometric with slope 4/6. As can be
seen in Figure 2.11, this leads to putting the decision boundary between the
sixth spam e-mail and the third ham e-mail, and we can take the average of their
scores as the decision threshold (0.28).

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 99 / 291


2. Binary classification and related tasks 2.2 Scoring and ranking

Example 2.5, p.70 Tuning your spam filter II

An alternative way of finding the optimal point is to iterate over all possible split
points – from before the top ranked e-mail to after the bottom one – and calculate
the number of correctly classified examples at each split: 4 – 5 – 6 – 5 – 6 – 7 –
6
– 7 – 8 – 7 – 6. The maximum is achieved at the same split point, yielding an
accuracy of 0.80.

A useful trick to find out which accuracy an isometric in an ROC plot represents
is to intersect the isometric with the descending diagonal. Since accuracy is a
weighted average of the true positive and true negative rates, and since these
are the same in a point on the descending diagonal, we can read off the
corresponding accuracy value on the y -axis.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 100 / 291


2. Binary classification and related tasks 2.2 Scoring and ranking

Figure 2.11, p.71 Finding the optimal point

0.83
0.67
True positive rate

0.50
0.33
0.17
1.00

0.25 0.50 0.75 1.00

False positive rate

Selecting the optimal point on an ROC curve. The top dotted line is the accuracy
isometric, with a slope of 2/3. The lower isometric doubles the value (or prevalence) of
negatives, and allows a choice of thresholds. By intersecting the isometrics with the
descending diagonal we can read off the achieved accuracy on the y -axis.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 101 / 291


2. Binary classification and related tasks 2.3 Class probability estimation

What’s next?

2 Binary classification and related tasks


Classification
Assessing classification performance
Visualising classification performance
Scoring and ranking
Assessing and visualising ranking performance
Tuning rankers
Class probability estimation
Assessing class probability estimates

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 102 / 291


2. Binary classification and related tasks 2.3 Class probability estimation

Class probability estimation

A class probability estimator – or probability estimator in short – is a scoring


classifier that outputs probability vectors over classes, i.e., a mapping
.
pˆ : X → [0,k 1] . We write pˆ(x) Σ 1= pˆ (x ),...,
k pˆ (x) , where
i pˆ
. k
(x) is the
probability assigned to class C ifor instance x , and pˆi(x) =
i
=1 1.
If we have only two classes, the probability associated with one class is 1 minus
the probability of the other class; in that case, we use pˆ(x) to denote the
estimated probability of the positive class for instance x .

As with scoring classifiers, we usually do not have direct access to the true
probabilities p i (x) .

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 103 / 291


2. Binary classification and related tasks 2.3 Class probability estimation

Figure 2.12, p.73 Probability estimation tree

ʻViagraʼ

=0 =1

ʻlotteryʼ p̂(x)=0.80

=0 =1

p̂(x)=0.33 p̂(x)=0.67

A probability estimation tree derived from the feature tree in Figure 1.4.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 104 / 291


2. Binary classification and related tasks 2.3 Class probability estimation

Assessing class probability estimates

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 105 / 291


2. Binary classification and related tasks 2.3 Class probability estimation

Mean squared probability error


We can define the squared error (SE) of the predicted probability vector
. Σ
pˆ(x) = pˆ1(x), . . . , pˆk (x) as
1 .k 2
SE(x) = i (x) − I [c(x) =
(pˆ i
C ]) 2 i =1

and the mean squared error (MSE) as the average squared error over all
instances in the test set:
1 .
MSE(Te) SE(x
= |Te| x∈Te)

The factor 1/2 in Equation 2.6 ensures that the squared error per example is
normalised between 0 and 1: the worst possible situation is that a wrong class is
predicted with probability 1, which means two ‘bits’ are wrong.
For two classes this reduces to a single term (pˆ(x) − I [c(x) = ⊕])2 only
referring to the positive class.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 106 / 291
2. Binary classification and related tasks 2.3 Class probability estimation

Example 2.6, p.74 Squared error

Suppose one model predicts (0.70, 0.10, 0.20) for a particular example
x in a three-class task, while another appears much more certain by
predicting (0.99, 0, 0.01).
t If the first class is the actual class, the second prediction is clearly better
than the first: the SE of the first prediction is
((0.70 − 1)2 + (0.10 − 0)2 + (0.20 − 0) 2 )/2 = 0.07, while for the
second
prediction it is ((0.99 − 1)2 + (0 − 0)2 + (0.01 − 0) 2 )/2 = 0.0001. The
first model gets punished more because, although mostly right, it isn’t quite
sure of it.
t However, if the third class is the actual class, the situation is reversed: now
the SE of the first prediction is
((0.70 − 0)2 + (0.10 − 0)2 + (0.20 − 1) 2 )/2 = 0.57, and of the
second
((0.99 − 0)2 + (0 − 0)2 + (0.01 − 1) 2 )/2 = 0.98. The second
model gets
cs.bris.ac.uk/ punished
flach/mlbook/ more for
Machine not Making
Learning: just Sense
being wrong, but being
of Data 107 / 291
2. Binary classification and related tasks 2.3 Class probability estimation

Which probabilities achieve lowest MSE ?


Returning to the probability estimation tree in Figure 2.12, we calculate the
squared error per leaf as follows (left to right):

SE 1 = 20(0.33 − 1)2 + 40(0.33 − 0)2 = 13.33


SE 2 = 10(0.67 − 1)2 + 5(0.67 − 0)2 = 3.33
SE 3 = 20(0.80 − 1)2 + 5(0.80 − 0)2 = 4.00
1
which leads to a mean squared error of M S E =10 (SE1 + SE 2 + SE ) =
0.21 . 3
Changing the predicted probabilities in the left-most leaf to 0.40 for spam
0
and
0.60 for ham, or 0.20 for spam and 0.80 for ham, results in a higher
squared error:

SEf1 = 20(0.40 − 1)2 + 40(0.40 − 0)2 = 13.6

SEf1f = 20(0.20 − 1)2 + 40(0.20 − 0)2 = 14.4

Predicting probabilities obtained from the class distributions in each leaf is


cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 108 / 291
2. Binary classification and related tasks 2.3 Class probability estimation

Why predicting empirical probabilities is optimal

The reason for this becomes obvious if we rewrite the expression for two-class
squared error of a leaf as follows, using the notation n⊕ and n 8 for the numbers
of positive and negative examples in the leaf:

Σ
n⊕(pˆ − 1)2 + n8pˆ2 = (n⊕ + n8)pˆ2 − 2n⊕ pˆ + n⊕ = (n⊕ + n 8 ) pˆ2 −
Σ
2p˙pˆ + p˙
Σ Σ
= (n⊕ + n 8 ) (pˆ − p˙)2 + p˙(1 − p˙)

where p˙ = n ⊕ /(n ⊕ + n 8 ) is the relative frequency of the positive class among


the examples covered by the leaf, also called the empirical probability. As the
term p˙(1 − p˙) does not depend on the predicted probability pˆ, we see
immediately that we achieve lowest squared error in the leaf if we assign pˆ =
p˙.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 109 / 291
2. Binary classification and related tasks 2.3 Class probability estimation

Smoothing empirical probabilities

It is almost always a good idea to smooth these relative frequencies. The most
common way to do this is by means of the Laplace correction:
ni + 1
p˙i(S) =
|S| + k

In effect, we are adding uniformly distributed pseudo-counts to each of the k


alternatives, reflecting our prior belief that the empirical probabilities will turn out
uniform.
We can also apply non-uniform smoothing by setting
ni + m ·
πi
p˙i(S) = |S| + m
This smoothing technique, known as the m-estimate, allows the choice of the
number of pseudo-counts m as well as the prior probabilities πi . The Laplace
correction is a special case of the m-estimate with m = k and πi = 1/k .
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 110 / 291

You might also like