What's Next?: Binary Classification and Related Tasks Classification
What's Next?: Binary Classification and Related Tasks Classification
What’s next?
What’s next?
Classification
We use the ‘hat’ to indicate that cˆ(x) is an estimate of the true but unknown
function c(x) . Examples for a classifier take the form (x, c(x)) , where x ∈ X
is an instance and c(x) is the true class of the instance (sometimes
contaminated by noise).
ʻViagraʼ ʻViagraʼ
=0 =1 =0 =1
spam: 20
ʻlotteryʼ ʻlotteryʼ ĉ(x) = spam
ham: 5
=0 =1 =0 =1
spam: 20 spam: 10
ĉ(x) = ham ĉ(x) = spam
ham: 40 ham: 5
(left) A feature tree with training set class distribution in the leaves. (right) A decision
tree obtained using the majority class decision rule.
Predicted ⊕ Predicted 8 ⊕ 8
Actual ⊕ 30 20 50 ⊕ 20 30 50
Actual 8 10 40 50 8 20 30 50
40 60 100 40 60
100
(left) A two-class contingency table or confusion matrix depicting the performance of the
decision tree in Figure 2.1. Numbers on the descending diagonal indicate correct
predictions, while the ascending diagonal concerns prediction errors. (right) A
contingency table with the same marginals but independent rows and columns.
Predicted ⊕ Predicted 8
Actual ⊕ TP FN Pos
Actual 8
FP TN Neg
0 0 0
Where abbreviations stand for:
t TP – true positives
t FP – false positives
t FN – false negatives
t TN – true negatives
t Pos – number of
positive examples
t Neg – number of
negative examples
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 75 / 291
2. Binary classification and related tasks 2.1 Classification
Predicted ⊕ Predicted 8
Actual ⊕ 60 15 75
Actual 8 10 15 25
70 30 100
From this table, we see that the true positive rate is tpr = 60/75 = 0.80 and
the true negative rate is tnr = 15/25 = 0.60. The overall accuracy is
acc = (60 + 15)/100 = 0.75, which is no longer the average of true
positive and negative rates. However, taking into account the proportion of
positives pos = 0.75 and the proportion of negatives neg = 1 − pos =
0.25, we see that
Degrees of freedom
Predicted ⊕
Predicted 8
Actual ⊕ TP FN Pos
Actual 8
FP TN Neg
0 0 0
contains 9 values, however some of them depend on others: e.g., marginal sums
depend on rows and columns, respectively. Actually, we need only 4 values to
determine the rest of them. Thus, we say that this table has 4 degrees of
freedom. In general table having (k + 1)2 entries has k 2 degrees of freedom.
Pos
C3
TP3
Pos
Positives
C1
TP1
Positives
C2
TP2
0
0 FP3 Neg
0
Negatives
Negatives
(left) A coverage plot depicting the two contingency tables in Table 2.2. The plot is
square because the class distribution is uniform. From the plot we immediately see that
C1 is better than C2. (right) Coverage plot for Example 2.1, with a class ratio clr = 3 .
1
C3 C3
TP3
tpr3
C1 C1
TP1
tpr1
True positive rate
Positives
C2 C2
TP2
tpr2
0
0
0 FP1 FP2-3 Neg 0 fpr1 fpr2-3 1
(left) C1 and C3 dominate C2, but neither dominates the other. The diagonal line
having slope of 1 indicates that all classifiers on this line achieve equal accuracy.
(right) Receiver Operating Characteristic (ROC) plot: a merger of the two coverage plots
in Figure 2.2, employing normalisation to deal with the different class distributions. The
diagonal line having slope of 1 indicates that all classifiers on this line have the
same average recall (average of positive and negative recalls).
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 82 / 291
2. Binary classification and related tasks 2.1 Classification
C2 C3
C2 C3
Positives
0
0
0 fpr1 fpr2 fpr3 Neg
Negatives
Negatives
(left) In a coverage plot, accuracy isometrics have a slope of 1, and average recall
isometrics are parallel to the ascending diagonal. (right) In the corresponding ROC plot,
average recall isometrics have a slope of 1; the accuracy isometric here has a slope of
3, corresponding to the ratio of negatives to positives in the data set.
What’s next?
Scoring classifier
sˆ(x) = (sˆ1(x), . . . , sˆk (x)) rather than a single number; sˆi(x) is the score
assigned to class C i for instance x .
This score indicates how likely it is that class label C i applies.
If we only have two classes, it usually suffices to consider the score for only one
of the classes; in that case, we use sˆ(x) to denote the score of the positive
class for instance x .
ʻViagraʼ ʻViagraʼ
=0 =1 =0 =1
spam: 20
ʻlotteryʼ ʻlotteryʼ ŝ(x) = +2
ham: 5
=0 =1 =0 =1
spam: 20 spam: 10
ŝ(x) = −1 ŝ(x) = +1
ham: 40 ham: 5
(left) A feature tree with training set class distribution in the leaves. (right) A scoring tree
using the logarithm of the class ratio as scores; spam is taken as the positive class.
If we take the true class c(x) as +1 for positive examples and −1 for negative
examples, then the quantity z(x) = c(x)sˆ(x) is positive for correct predictions
and negative for incorrect predictions: this quantity is called the margin assigned
by the scoring classifier to the example.
We would like to reward large positive margins, and penalise large negative
values. This is achieved by means of a so-called loss function L : R ›→ [0,
∞) which maps each example’s margin z(x) to an associated loss
L(z(x)) .
We will assume that L(0) = 1, which is the loss incurred by having an example
on the decision boundary. We furthermore have L(z) ≥ 1 for z < 0, and usually
also 0 ≤ L(z) < 1 for z > 0 (Figure 2.6).
.
The average loss over a test set Te is 1| x∈T L(z(x))
Te| e .
10
L(z)
4
Pos
Pos
D
Positives sorted on decreasing
TP2
Positives
TP1
C
score
B
0
0
0 Neg 0 FP1 FP2 Neg
(left) Each cell in the grid denotes a unique pair of one positive and one negative
example: the green cells indicate pairs that are correctly ranked by the classifier, the red
A
cells represent ranking errors, and the orange cells are half-errors due to ties. (right)
The coverage curve of a tree-based scoring classifier has one line segment for each leaf
of the tree, and one (FP, TP) pair for each possible threshold on the score.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 92 / 291
2. Binary classification and related tasks 2.2 Scoring and ranking
ROC curve is obtained from the coverage curve by normalizing the axes to
range [0, 1].
The area under the ROC curve is the ranking accuracy.
t Suppose we feed the scoring tree in Figure 2.5 an extended test set, with
an additional batch of 50 negatives.
t The added negatives happen to be identical to the original ones, so the net
effect is that the number of negatives in each leaf doubles.
t As a result the coverage curve changes (because the class ratio changes),
but the ROC curve stays the same (Figure 2.8).
t Note that the ranking accuracy stays the same as well: while the classifier
makes twice as many ranking errors, there are also twice as many
positive–negative pairs, so the ranking error rate doesn’t change.
1
TP2
TP1
tpr1
tpr2
0
0
0 FP1 FP2 Neg 0 fpr1 fpr2 1
(left) A coverage curve obtained from a test set with class ratio clr = 1/2 . (right) The
corresponding (axis-normalised) ROC curve is the same as the one corresponding to
the coverage curve in Figure 2.7 (right). The ranking accuracy is the Area Under the
ROC Curve (AUC).
Figure 2.9 (left) shows a linear classifier (the decision boundary is denoted B)
applied to a small data set of five positive and five negative examples, achieving
an accuracy of 0.80.
We can derive a score from this linear classifier by taking the distance of an
example from the decision boundary; if the example is on the negative side we
take the negative distance. This means that the examples are ranked in the
following order: p1 – p2 – p3 – n1 – p4 – n2 – n3 – p5 – n4 – n5.
This ranking incurs four ranking errors: n1 before p4, and n1, n2 and n3 before
p5. Figure 2.9 (right) visualises these four ranking errors in the top-left corner.
The AUC of this ranking is 21/25 = 0.84.
p5
w
p2 +
+
p4
p1
p4
p3
+ +
Positives
p3
A
p2
n2 – – n1
n 3
– B
p1
n4 – +
p5 n5
C
– n1 n2 n3 n4 n5
Negatives
(left) A linear classifier induces a ranking by taking the signed distance to the decision
boundary as the score. This ranking only depends on the orientation of the decision
boundary: the three lines result in exactly the same ranking. (right) The grid of correctly
ranked positive–negative pairs (in green) and ranking errors (in red).
Tuning rankers
You have carefully trained your Bayesian spam filter, and all that remains is
setting the decision threshold. You select a set of six spam and four ham e-mails
and collect the scores assigned by the spam filter. Sorted on decreasing score
these are 0.89 (spam), 0.80 (spam), 0.74 (ham), 0.71 (spam), 0.63 (spam), 0.49
(ham), 0.42 (spam), 0.32 (spam), 0.24 (ham), and 0.13 (ham).
If the class ratio of 6 spam against 4 ham is representative, you can select the
optimal point on the ROC curve using an isometric with slope 4/6. As can be
seen in Figure 2.11, this leads to putting the decision boundary between the
sixth spam e-mail and the third ham e-mail, and we can take the average of their
scores as the decision threshold (0.28).
An alternative way of finding the optimal point is to iterate over all possible split
points – from before the top ranked e-mail to after the bottom one – and calculate
the number of correctly classified examples at each split: 4 – 5 – 6 – 5 – 6 – 7 –
6
– 7 – 8 – 7 – 6. The maximum is achieved at the same split point, yielding an
accuracy of 0.80.
A useful trick to find out which accuracy an isometric in an ROC plot represents
is to intersect the isometric with the descending diagonal. Since accuracy is a
weighted average of the true positive and true negative rates, and since these
are the same in a point on the descending diagonal, we can read off the
corresponding accuracy value on the y -axis.
0.83
0.67
True positive rate
0.50
0.33
0.17
1.00
Selecting the optimal point on an ROC curve. The top dotted line is the accuracy
isometric, with a slope of 2/3. The lower isometric doubles the value (or prevalence) of
negatives, and allows a choice of thresholds. By intersecting the isometrics with the
descending diagonal we can read off the achieved accuracy on the y -axis.
What’s next?
As with scoring classifiers, we usually do not have direct access to the true
probabilities p i (x) .
ʻViagraʼ
=0 =1
ʻlotteryʼ p̂(x)=0.80
=0 =1
p̂(x)=0.33 p̂(x)=0.67
A probability estimation tree derived from the feature tree in Figure 1.4.
and the mean squared error (MSE) as the average squared error over all
instances in the test set:
1 .
MSE(Te) SE(x
= |Te| x∈Te)
The factor 1/2 in Equation 2.6 ensures that the squared error per example is
normalised between 0 and 1: the worst possible situation is that a wrong class is
predicted with probability 1, which means two ‘bits’ are wrong.
For two classes this reduces to a single term (pˆ(x) − I [c(x) = ⊕])2 only
referring to the positive class.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 106 / 291
2. Binary classification and related tasks 2.3 Class probability estimation
Suppose one model predicts (0.70, 0.10, 0.20) for a particular example
x in a three-class task, while another appears much more certain by
predicting (0.99, 0, 0.01).
t If the first class is the actual class, the second prediction is clearly better
than the first: the SE of the first prediction is
((0.70 − 1)2 + (0.10 − 0)2 + (0.20 − 0) 2 )/2 = 0.07, while for the
second
prediction it is ((0.99 − 1)2 + (0 − 0)2 + (0.01 − 0) 2 )/2 = 0.0001. The
first model gets punished more because, although mostly right, it isn’t quite
sure of it.
t However, if the third class is the actual class, the situation is reversed: now
the SE of the first prediction is
((0.70 − 0)2 + (0.10 − 0)2 + (0.20 − 1) 2 )/2 = 0.57, and of the
second
((0.99 − 0)2 + (0 − 0)2 + (0.01 − 1) 2 )/2 = 0.98. The second
model gets
cs.bris.ac.uk/ punished
flach/mlbook/ more for
Machine not Making
Learning: just Sense
being wrong, but being
of Data 107 / 291
2. Binary classification and related tasks 2.3 Class probability estimation
The reason for this becomes obvious if we rewrite the expression for two-class
squared error of a leaf as follows, using the notation n⊕ and n 8 for the numbers
of positive and negative examples in the leaf:
Σ
n⊕(pˆ − 1)2 + n8pˆ2 = (n⊕ + n8)pˆ2 − 2n⊕ pˆ + n⊕ = (n⊕ + n 8 ) pˆ2 −
Σ
2p˙pˆ + p˙
Σ Σ
= (n⊕ + n 8 ) (pˆ − p˙)2 + p˙(1 − p˙)
It is almost always a good idea to smooth these relative frequencies. The most
common way to do this is by means of the Laplace correction:
ni + 1
p˙i(S) =
|S| + k