0% found this document useful (0 votes)

66 views44 pages

What's Next?: Binary Classification and Related Tasks Classification

The document discusses binary classification and related tasks. It defines classification as learning a mapping from instances to class labels. It discusses assessing classification performance using contingency tables and statistics like true positives, false positives, true negatives, false negatives, accuracy, and error rate. Visualizing classification performance using decision trees is also mentioned.

Uploaded by

sowmyalalitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views44 pages

What's Next?: Binary Classification and Related Tasks Classification

Uploaded by

sowmyalalitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 44

2.

Binary classification and related tasks

What’s next?

2 Binary classification and related tasks

Classification
Assessing classification performance
Visualising classification performance
Scoring and ranking
Assessing and visualising ranking performance
Tuning rankers
Class probability estimation
Assessing class probability estimates

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 67 / 291

2. Binary classification and related tasks

Symbols used in the following slides

Suppose the following symbols:

t X – set of all instances (the universe)
t L – set of all labels (the universe)
t C – set of all classes (the universe)
t Y – set of all outputs (the universe)
t Tr – training set of labelled
instances (x, l (x)) , where l : X → L
t Te – test set of labelled instances
(x, l (x)), where l : X → L

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 68 / 291

2. Binary classification and related tasks

Table 2.1, p.52 Predictive machine learning scenarios

Tas Label space Output space Learning problem

k
Classification L =C Y = learn an approximation cˆ :
C X → C to the true labelling
function c
Scoring and L =C Y = R|C | learn a model that outputs a
ranking score vector over classes
Probability L =C learn a model that out-
estimation Y = [0, 1]| puts a probability vector
C|
over classes
Regression L = Y = learn an approximation fˆ :
R R X → R to the true labelling
function f

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 69 / 291

2. Binary classification and related tasks 2.1 Classification

What’s next?

2 Binary classification and related tasks

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 70 / 291

2. Binary classification and related tasks 2.1 Classification

Classification

A classifier is a mapping cˆ : X → C , where C = {C 1 ,C 2 ,...,C k } is a

finite and usually small set of class labels. We will sometimes also use Ci to
indicate the set of examples of that class.

We use the ‘hat’ to indicate that cˆ(x) is an estimate of the true but unknown
function c(x) . Examples for a classifier take the form (x, c(x)) , where x ∈ X
is an instance and c(x) is the true class of the instance (sometimes
contaminated by noise).

Learning a classifier involves constructing the function cˆ such that it matches

c as closely as possible (and not just on the training set, but ideally on the
entire instance space X ).

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 71 / 291

2. Binary classification and related tasks 2.1 Classification

Figure 2.1, p.53 A decision tree

ʻViagraʼ ʻViagraʼ

=0 =1 =0 =1

spam: 20
ʻlotteryʼ ʻlotteryʼ ĉ(x) = spam
ham: 5

=0 =1 =0 =1

spam: 20 spam: 10
ĉ(x) = ham ĉ(x) = spam
ham: 40 ham: 5

(left) A feature tree with training set class distribution in the leaves. (right) A decision
tree obtained using the majority class decision rule.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 72 / 291

2. Binary classification and related tasks 2.1 Classification

Assessing classification performance

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 73 / 291

2. Binary classification and related tasks 2.1 Classification

Table 2.2, p.54 Contingency table

Predicted ⊕ Predicted 8 ⊕ 8
Actual ⊕ 30 20 50 ⊕ 20 30 50
Actual 8 10 40 50 8 20 30 50
40 60 100 40 60
100
(left) A two-class contingency table or confusion matrix depicting the performance of the
decision tree in Figure 2.1. Numbers on the descending diagonal indicate correct
predictions, while the ascending diagonal concerns prediction errors. (right) A
contingency table with the same marginals but independent rows and columns.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 74 / 291

2. Binary classification and related tasks 2.1 Classification

Statistics from contingency table

Let’s label numbers of a classifier’s predictions on a test set as in the table:

Predicted ⊕ Predicted 8
Actual ⊕ TP FN Pos
Actual 8
FP TN Neg
0 0 0
Where abbreviations stand for:
t TP – true positives
t FP – false positives
t FN – false negatives
t TN – true negatives
t Pos – number of
positive examples
t Neg – number of
negative examples
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 75 / 291
2. Binary classification and related tasks 2.1 Classification

Table 2.3, p.57 Performance measures I

Measure Definition Equal to Estimates

number of positives .
Pos = x ∈ T e I [c(x) =
number of negatives ⊕] . |Te| −
Neg = x ∈ T e I [c(x) =
number of true positives . Pos
8] = x ∈ T e I [cˆ(x) = c(x) =
TP
number of true negatives ⊕] = . x ∈ T e I [cˆ(x) = c(x) =
TN
number of false 8] . Neg −
FP = . x ∈ T e I [cˆ(x) = ⊕, c(x) =
positives 8] x∈T I [cˆ(x) = 8, c(x) TN Pos
. = ⊕]
number of false pos = | 1e x∈T I [c(x) = − TP P (c(x) = ⊕)
negatives 1 . e ⊕]
neg = Te|
| x∈T Pos/|Te| P (c(x) = 8)
FN = clr = Te| e I [c(x) = 1 − pos
pos/neg . 8]
proportion of positives acc = | 1 x∈T I [cˆ(x) = Pos/Neg P (cˆ(x) =
1 . e c(x)] c(x))
proportion of negatives Te|
err |Te| x∈T 1−
class ratio = e I [cˆ(x) /= acc P (cˆ(x) /=
c(x)] c(x))
(*) accuracy
(*) error
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 76 / 291
2. Binary classification and related tasks 2.1 Classification

Table 2.3, p.57 Performance measures II

Measure Definition Equal to Estimates

.
x. Te I
true positive rate, tpr = ∈ x∈Te[cˆ(x)=c(x)=⊕
TP/ Pos P (cˆ(x) = ⊕|c(x) =
I [c(x)=⊕]
sensitivity, recall ] ⊕)
.
true negative rate, tnr = x∈T . I TN
e x∈Te [cˆ(x)=c(x)=8]
I [c(x)=8]
specificity /Neg P (cˆ(x) = 8|c(x) =
.
I 8)
false positive rate, fpr x∈ .Te
[cˆ(x)=⊕,c(x)=8
FP/Neg = 1 −
x∈Te I [c(x)=8]
false alarm rate ] tnr
.
false negative rate = x .Te I P (cˆ(x) = ⊕|c(x) =
.
∈ x∈Te [cˆ(x)=8,c(x)=⊕
I [c(x)=⊕]
FN /Pos = 1 − 8)
]
x . Te I
precision, prec = ∈ [cˆ(x)=c(x)=⊕
fnr = x∈Te I
confi- dence ] tpr TP/(TP + P (cˆ(x) = 8|c(x) =
[cˆ(x)=⊕]
⊕)
FP) measures for classifiers on a test
Table: A summary of different quantities and evaluation
P (c(x)
set Te. Symbols starting with a capital letter denote absolute frequencies = ⊕|cˆ(x)
(counts), while=
⊕)
lower-case symbols denote relative frequencies or ratios. All except those indicated with
(*) are defined only for binary classification.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 77 / 291

2. Binary classification and related tasks 2.1 Classification

Example 2.1, p.56 Accuracy as a weighted average

Suppose a classifier’s predictions on a test set are as in the following table:

Predicted ⊕ Predicted 8
Actual ⊕ 60 15 75
Actual 8 10 15 25
70 30 100

From this table, we see that the true positive rate is tpr = 60/75 = 0.80 and
the true negative rate is tnr = 15/25 = 0.60. The overall accuracy is
acc = (60 + 15)/100 = 0.75, which is no longer the average of true
positive and negative rates. However, taking into account the proportion of
positives pos = 0.75 and the proportion of negatives neg = 1 − pos =
0.25, we see that

acc = pos · tpr + neg · tnr

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 78 / 291

2. Binary classification and related tasks 2.1 Classification

Visualising classification performance

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 79 / 291

2. Binary classification and related tasks 2.1 Classification

Degrees of freedom

The following contingency table:

Predicted ⊕
Predicted 8
Actual ⊕ TP FN Pos
Actual 8
FP TN Neg
0 0 0
contains 9 values, however some of them depend on others: e.g., marginal sums
depend on rows and columns, respectively. Actually, we need only 4 values to
determine the rest of them. Thus, we say that this table has 4 degrees of
freedom. In general table having (k + 1)2 entries has k 2 degrees of freedom.

In the following, we assume that Pos , Neg , TP and FP are enough to

reconstruct whole table.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 80 / 291

2. Binary classification and related tasks 2.1 Classification

Figure 2.2, p.58 A coverage plot

Let there be classifiers C1, C2 and C3.

Pos

TP3
Pos
Positives
C1
TP1
Positives

C2
TP2

0
0 FP3 Neg
0

0 FP1 FP2 Neg

Negatives
Negatives

(left) A coverage plot depicting the two contingency tables in Table 2.2. The plot is
square because the class distribution is uniform. From the plot we immediately see that
C1 is better than C2. (right) Coverage plot for Example 2.1, with a class ratio clr = 3 .

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 81 / 291

2. Binary classification and related tasks 2.1 Classification

Figure 2.3, p.59 An ROC plot

Pos

1
C3 C3
TP3

tpr3
C1 C1
TP1

tpr1
True positive rate
Positives

C2 C2
TP2

tpr2
0

0
0 FP1 FP2-3 Neg 0 fpr1 fpr2-3 1

Negatives False positive rate

(left) C1 and C3 dominate C2, but neither dominates the other. The diagonal line
having slope of 1 indicates that all classifiers on this line achieve equal accuracy.
(right) Receiver Operating Characteristic (ROC) plot: a merger of the two coverage plots
in Figure 2.2, employing normalisation to deal with the different class distributions. The
diagonal line having slope of 1 indicates that all classifiers on this line have the
same average recall (average of positive and negative recalls).
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 82 / 291
2. Binary classification and related tasks 2.1 Classification

Figure 2.4, p.61 Comparing coverage and ROC plots

TP1 TP2-3 Pos

C2 C3
C2 C3

tpr1 tpr2-3 Pos

C1 C1
Positives

Positives
0

0 FP1 FP2 FP3 Neg

0
0 fpr1 fpr2 fpr3 Neg
Negatives
Negatives

(left) In a coverage plot, accuracy isometrics have a slope of 1, and average recall
isometrics are parallel to the ascending diagonal. (right) In the corresponding ROC plot,
average recall isometrics have a slope of 1; the accuracy isometric here has a slope of
3, corresponding to the ratio of negatives to positives in the data set.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 83 / 291

2. Binary classification and related tasks 2.2 Scoring and ranking

What’s next?

2 Binary classification and related tasks

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 84 / 291

2. Binary classification and related tasks 2.2 Scoring and ranking

Scoring classifier

A scoring classifier is a mapping sˆ : X → Rk , i.e., a mapping from the

instance space to a k-vector of real numbers.
The boldface notation indicates that a scoring classifier outputs a vector

sˆ(x) = (sˆ1(x), . . . , sˆk (x)) rather than a single number; sˆi(x) is the score
assigned to class C i for instance x .
This score indicates how likely it is that class label C i applies.

If we only have two classes, it usually suffices to consider the score for only one
of the classes; in that case, we use sˆ(x) to denote the score of the positive
class for instance x .

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 85 / 291

2. Binary classification and related tasks 2.2 Scoring and ranking

Figure 2.5, p.62 A scoring tree

ʻViagraʼ ʻViagraʼ

=0 =1 =0 =1

spam: 20
ʻlotteryʼ ʻlotteryʼ ŝ(x) = +2
ham: 5

=0 =1 =0 =1

spam: 20 spam: 10
ŝ(x) = −1 ŝ(x) = +1
ham: 40 ham: 5

(left) A feature tree with training set class distribution in the leaves. (right) A scoring tree
using the logarithm of the class ratio as scores; spam is taken as the positive class.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 86 / 291

2. Binary classification and related tasks 2.2 Scoring and ranking

Margins and loss functions

If we take the true class c(x) as +1 for positive examples and −1 for negative
examples, then the quantity z(x) = c(x)sˆ(x) is positive for correct predictions
and negative for incorrect predictions: this quantity is called the margin assigned
by the scoring classifier to the example.

We would like to reward large positive margins, and penalise large negative
values. This is achieved by means of a so-called loss function L : R ›→ [0,
∞) which maps each example’s margin z(x) to an associated loss
L(z(x)) .

We will assume that L(0) = 1, which is the loss incurred by having an example
on the decision boundary. We furthermore have L(z) ≥ 1 for z < 0, and usually
also 0 ≤ L(z) < 1 for z > 0 (Figure 2.6).
.
The average loss over a test set Te is 1| x∈T L(z(x))
Te| e .

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 87 / 291

2. Binary classification and related tasks 2.2 Scoring and ranking

Figure 2.6, p.63 Loss functions

L(z)
4

-2 -1.5 -1 -0.5 0 0.5 z 1 1.5 2

From bottom-left (i ) 0–1 loss L 0 1 (z) = 1 if z ≤ 0, and L 0 1 (z) = 0 if z

> 0 ; (ii ) hinge loss L h (z) = (1 − z) if z ≤ 1, and L h (z) = 0 if z >
1;
(iii ) logistic loss L l o g (z) = log 2 (1 +
exp(−z)) ; (iv ) exponential loss L ex p (z) =
exp(−z) ; flach/mlbook/
cs.bris.ac.uk/ Machine Learning: Making Sense of Data 88 / 291
2. Binary classification and related tasks 2.2 Scoring and ranking

Assessing and visualising ranking performance

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 89 / 291

2. Binary classification and related tasks 2.2 Scoring and ranking

Example 2.2, p.64 Ranking example

t The scoring tree in Figure 2.5 produces the following ranking:

[20+, 5−][10+, 5−][20+, 40−]. Here, 20+ denotes a
sequence of 20 positive examples, and instances in square brackets
[. ..] are tied.
t By selecting a split point in the ranking we can turn the ranking into a
classification. In this case there are four possibilities:
(A) setting the split point before the first segment, and thus assigning all
segments to the negative class;
(B) assigning the first segment to the positive class, and the other two to
the negative class;
(C) assigning the first two segments to the positive class; and
(D) assigning all segments to the positive class.
t In terms of actual scores, this corresponds to (A) choosing any score larger
than 2 as the threshold; (B) choosing a threshold between 1 and 2; (C)
setting the threshold between −1 and 1; and (D) setting it lower than −1.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 90 / 291
2. Binary classification and related tasks 2.2 Scoring and ranking

Example 2.3, p.65 Ranking accuracy

The ranking error rate is defined as

.
x∈Te⊕,xf∈Te8 I [sˆ(x) < sˆ(x
f )] 1+ I [sˆ(x) =
2
rank-err = sˆ(x )] f
Pos ·
Neg
t The 5 negatives in the right leaf are scored higher than the 10 positives in
the middle leaf and the 20 positives in the left leaf, resulting in
50 + 100 = 150 ranking errors.
t The 5 negatives in the middle leaf are scored higher than the 20 positives in
the left leaf, giving a further 100 ranking errors.
t In addition, the left leaf makes 800 half ranking errors (because 20 positives
and 40 negatives get the same score), the middle leaf 50 and the right leaf
100.
t In total we have 725 ranking errors out of a possible 50 · 50 = 2500,
corresponding to a ranking error rate of 29% or a ranking accuracy of
71% .
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 91 / 291
2. Binary classification and related tasks 2.2 Scoring and ranking

Figure 2.7, p.66 Coverage curve

Pos
Pos

D
Positives sorted on decreasing

TP2
Positives

TP1
C
score

B
0

0
0 Neg 0 FP1 FP2 Neg

Negatives sorted on decreasing score Negatives

(left) Each cell in the grid denotes a unique pair of one positive and one negative
example: the green cells indicate pairs that are correctly ranked by the classifier, the red
A

cells represent ranking errors, and the orange cells are half-errors due to ties. (right)
The coverage curve of a tree-based scoring classifier has one line segment for each leaf
of the tree, and one (FP, TP) pair for each possible threshold on the score.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 92 / 291
2. Binary classification and related tasks 2.2 Scoring and ranking

Important point to remember

ROC curve is obtained from the coverage curve by normalizing the axes to
range [0, 1].
The area under the ROC curve is the ranking accuracy.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 93 / 291

2. Binary classification and related tasks 2.2 Scoring and ranking

Example 2.4, p.67 Class imbalance

t Suppose we feed the scoring tree in Figure 2.5 an extended test set, with
an additional batch of 50 negatives.
t The added negatives happen to be identical to the original ones, so the net
effect is that the number of negatives in each leaf doubles.
t As a result the coverage curve changes (because the class ratio changes),
but the ROC curve stays the same (Figure 2.8).
t Note that the ranking accuracy stays the same as well: while the classifier
makes twice as many ranking errors, there are also twice as many
positive–negative pairs, so the ranking error rate doesn’t change.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 94 / 291

2. Binary classification and related tasks 2.2 Scoring and ranking

Figure 2.8, p.67 Class imbalance

Pos

1
TP2

True positive rate

Positives

TP1

tpr1
tpr2
0

0
0 FP1 FP2 Neg 0 fpr1 fpr2 1

Negat ives False positive rat e

(left) A coverage curve obtained from a test set with class ratio clr = 1/2 . (right) The
corresponding (axis-normalised) ROC curve is the same as the one corresponding to
the coverage curve in Figure 2.7 (right). The ranking accuracy is the Area Under the
ROC Curve (AUC).

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 95 / 291

2. Binary classification and related tasks 2.2 Scoring and ranking

Rankings from grading classifiers

Figure 2.9 (left) shows a linear classifier (the decision boundary is denoted B)
applied to a small data set of five positive and five negative examples, achieving
an accuracy of 0.80.

We can derive a score from this linear classifier by taking the distance of an
example from the decision boundary; if the example is on the negative side we
take the negative distance. This means that the examples are ranked in the
following order: p1 – p2 – p3 – n1 – p4 – n2 – n3 – p5 – n4 – n5.

This ranking incurs four ranking errors: n1 before p4, and n1, n2 and n3 before
p5. Figure 2.9 (right) visualises these four ranking errors in the top-left corner.
The AUC of this ranking is 21/25 = 0.84.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 96 / 291

2. Binary classification and related tasks 2.2 Scoring and ranking

Figure 2.9, p.68 Rankings from grading classifiers

p5
w
p2 +
+

p4
p1

p3
+ +

Positives
p3
A

p2
n2 – – n1
n 3
– B

p1
n4 – +
p5 n5
C
– n1 n2 n3 n4 n5

Negatives

(left) A linear classifier induces a ranking by taking the signed distance to the decision
boundary as the score. This ranking only depends on the orientation of the decision
boundary: the three lines result in exactly the same ranking. (right) The grid of correctly
ranked positive–negative pairs (in green) and ranking errors (in red).

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 97 / 291

2. Binary classification and related tasks 2.2 Scoring and ranking

Tuning rankers

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 98 / 291

2. Binary classification and related tasks 2.2 Scoring and ranking

Example 2.5, p.70 Tuning your spam filter I

You have carefully trained your Bayesian spam filter, and all that remains is
setting the decision threshold. You select a set of six spam and four ham e-mails
and collect the scores assigned by the spam filter. Sorted on decreasing score
these are 0.89 (spam), 0.80 (spam), 0.74 (ham), 0.71 (spam), 0.63 (spam), 0.49
(ham), 0.42 (spam), 0.32 (spam), 0.24 (ham), and 0.13 (ham).

If the class ratio of 6 spam against 4 ham is representative, you can select the
optimal point on the ROC curve using an isometric with slope 4/6. As can be
seen in Figure 2.11, this leads to putting the decision boundary between the
sixth spam e-mail and the third ham e-mail, and we can take the average of their
scores as the decision threshold (0.28).

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 99 / 291

2. Binary classification and related tasks 2.2 Scoring and ranking

Example 2.5, p.70 Tuning your spam filter II

An alternative way of finding the optimal point is to iterate over all possible split
points – from before the top ranked e-mail to after the bottom one – and calculate
the number of correctly classified examples at each split: 4 – 5 – 6 – 5 – 6 – 7 –
6
– 7 – 8 – 7 – 6. The maximum is achieved at the same split point, yielding an
accuracy of 0.80.

A useful trick to find out which accuracy an isometric in an ROC plot represents
is to intersect the isometric with the descending diagonal. Since accuracy is a
weighted average of the true positive and true negative rates, and since these
are the same in a point on the descending diagonal, we can read off the
corresponding accuracy value on the y -axis.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 100 / 291

2. Binary classification and related tasks 2.2 Scoring and ranking

Figure 2.11, p.71 Finding the optimal point

0.83
0.67
True positive rate

0.50
0.33
0.17
1.00

0.25 0.50 0.75 1.00

False positive rate

Selecting the optimal point on an ROC curve. The top dotted line is the accuracy
isometric, with a slope of 2/3. The lower isometric doubles the value (or prevalence) of
negatives, and allows a choice of thresholds. By intersecting the isometrics with the
descending diagonal we can read off the achieved accuracy on the y -axis.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 101 / 291

2. Binary classification and related tasks 2.3 Class probability estimation

What’s next?

2 Binary classification and related tasks

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 102 / 291

2. Binary classification and related tasks 2.3 Class probability estimation

Class probability estimation

A class probability estimator – or probability estimator in short – is a scoring

classifier that outputs probability vectors over classes, i.e., a mapping
.
pˆ : X → [0,k 1] . We write pˆ(x) Σ 1= pˆ (x ),...,
k pˆ (x) , where
i pˆ
. k
(x) is the
probability assigned to class C ifor instance x , and pˆi(x) =
i
=1 1.
If we have only two classes, the probability associated with one class is 1 minus
the probability of the other class; in that case, we use pˆ(x) to denote the
estimated probability of the positive class for instance x .

As with scoring classifiers, we usually do not have direct access to the true
probabilities p i (x) .

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 103 / 291

2. Binary classification and related tasks 2.3 Class probability estimation

Figure 2.12, p.73 Probability estimation tree

ʻViagraʼ

=0 =1

ʻlotteryʼ p̂(x)=0.80

=0 =1

p̂(x)=0.33 p̂(x)=0.67

A probability estimation tree derived from the feature tree in Figure 1.4.

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 104 / 291

2. Binary classification and related tasks 2.3 Class probability estimation

Assessing class probability estimates

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 105 / 291

2. Binary classification and related tasks 2.3 Class probability estimation

Mean squared probability error

We can define the squared error (SE) of the predicted probability vector
. Σ
pˆ(x) = pˆ1(x), . . . , pˆk (x) as
1 .k 2
SE(x) = i (x) − I [c(x) =
(pˆ i
C ]) 2 i =1

and the mean squared error (MSE) as the average squared error over all
instances in the test set:
1 .
MSE(Te) SE(x
= |Te| x∈Te)

The factor 1/2 in Equation 2.6 ensures that the squared error per example is
normalised between 0 and 1: the worst possible situation is that a wrong class is
predicted with probability 1, which means two ‘bits’ are wrong.
For two classes this reduces to a single term (pˆ(x) − I [c(x) = ⊕])2 only
referring to the positive class.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 106 / 291
2. Binary classification and related tasks 2.3 Class probability estimation

Example 2.6, p.74 Squared error

Suppose one model predicts (0.70, 0.10, 0.20) for a particular example
x in a three-class task, while another appears much more certain by
predicting (0.99, 0, 0.01).
t If the first class is the actual class, the second prediction is clearly better
than the first: the SE of the first prediction is
((0.70 − 1)2 + (0.10 − 0)2 + (0.20 − 0) 2 )/2 = 0.07, while for the
second
prediction it is ((0.99 − 1)2 + (0 − 0)2 + (0.01 − 0) 2 )/2 = 0.0001. The
first model gets punished more because, although mostly right, it isn’t quite
sure of it.
t However, if the third class is the actual class, the situation is reversed: now
the SE of the first prediction is
((0.70 − 0)2 + (0.10 − 0)2 + (0.20 − 1) 2 )/2 = 0.57, and of the
second
((0.99 − 0)2 + (0 − 0)2 + (0.01 − 1) 2 )/2 = 0.98. The second
model gets
cs.bris.ac.uk/ punished
flach/mlbook/ more for
Machine not Making
Learning: just Sense
being wrong, but being
of Data 107 / 291
2. Binary classification and related tasks 2.3 Class probability estimation

Which probabilities achieve lowest MSE ?

Returning to the probability estimation tree in Figure 2.12, we calculate the
squared error per leaf as follows (left to right):

SE 1 = 20(0.33 − 1)2 + 40(0.33 − 0)2 = 13.33

SE 2 = 10(0.67 − 1)2 + 5(0.67 − 0)2 = 3.33
SE 3 = 20(0.80 − 1)2 + 5(0.80 − 0)2 = 4.00
1
which leads to a mean squared error of M S E =10 (SE1 + SE 2 + SE ) =
0.21 . 3
Changing the predicted probabilities in the left-most leaf to 0.40 for spam
0
and
0.60 for ham, or 0.20 for spam and 0.80 for ham, results in a higher
squared error:

SEf1 = 20(0.40 − 1)2 + 40(0.40 − 0)2 = 13.6

SEf1f = 20(0.20 − 1)2 + 40(0.20 − 0)2 = 14.4

Predicting probabilities obtained from the class distributions in each leaf is

cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 108 / 291
2. Binary classification and related tasks 2.3 Class probability estimation

Why predicting empirical probabilities is optimal

The reason for this becomes obvious if we rewrite the expression for two-class
squared error of a leaf as follows, using the notation n⊕ and n 8 for the numbers
of positive and negative examples in the leaf:

Σ
n⊕(pˆ − 1)2 + n8pˆ2 = (n⊕ + n8)pˆ2 − 2n⊕ pˆ + n⊕ = (n⊕ + n 8 ) pˆ2 −
Σ
2p˙pˆ + p˙
Σ Σ
= (n⊕ + n 8 ) (pˆ − p˙)2 + p˙(1 − p˙)

where p˙ = n ⊕ /(n ⊕ + n 8 ) is the relative frequency of the positive class among

the examples covered by the leaf, also called the empirical probability. As the
term p˙(1 − p˙) does not depend on the predicted probability pˆ, we see
immediately that we achieve lowest squared error in the leaf if we assign pˆ =
p˙.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 109 / 291
2. Binary classification and related tasks 2.3 Class probability estimation

Smoothing empirical probabilities

It is almost always a good idea to smooth these relative frequencies. The most
common way to do this is by means of the Laplace correction:
ni + 1
p˙i(S) =
|S| + k

In effect, we are adding uniformly distributed pseudo-counts to each of the k

alternatives, reflecting our prior belief that the empirical probabilities will turn out
uniform.
We can also apply non-uniform smoothing by setting
ni + m ·
πi
p˙i(S) = |S| + m
This smoothing technique, known as the m-estimate, allows the choice of the
number of pseudo-counts m as well as the prior probabilities πi . The Laplace
correction is a special case of the m-estimate with m = k and πi = 1/k .
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 110 / 291

Lecture 1, Part 2: Linear Classification: Roger Grosse
No ratings yet
Lecture 1, Part 2: Linear Classification: Roger Grosse
10 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
Classification FoundationalMathofAI S24
No ratings yet
Classification FoundationalMathofAI S24
6 pages
Classification Metrics
No ratings yet
Classification Metrics
39 pages
Binary Classification PDF
No ratings yet
Binary Classification PDF
27 pages
Binary Classification
No ratings yet
Binary Classification
2 pages
Lecture W1c UG
No ratings yet
Lecture W1c UG
33 pages
ML Unit 2
No ratings yet
ML Unit 2
31 pages
Lec5 Classification
No ratings yet
Lec5 Classification
27 pages
Accuracy Precision and Recall
No ratings yet
Accuracy Precision and Recall
21 pages
Naïve Bayes Classifier
No ratings yet
Naïve Bayes Classifier
39 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
Module 3
No ratings yet
Module 3
132 pages
Machine Learning and Pattern Recognition Week 3 Intro - Classification
No ratings yet
Machine Learning and Pattern Recognition Week 3 Intro - Classification
5 pages
Classification Metrics Mod 6
No ratings yet
Classification Metrics Mod 6
8 pages
WK 07
No ratings yet
WK 07
8 pages
ML
No ratings yet
ML
22 pages
4 Types of Classification Tasks in Machine Learning
No ratings yet
4 Types of Classification Tasks in Machine Learning
14 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
KNN Evaluation
No ratings yet
KNN Evaluation
51 pages
Mathematics of Machine Learning MIT
No ratings yet
Mathematics of Machine Learning MIT
411 pages
Notes Chapter Linear Classifiers
No ratings yet
Notes Chapter Linear Classifiers
4 pages
Evaluation Metrics and Statistical Tests For Machi
No ratings yet
Evaluation Metrics and Statistical Tests For Machi
15 pages
Datamining Lect12
No ratings yet
Datamining Lect12
75 pages
lec21-ML II
No ratings yet
lec21-ML II
66 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Maths For ML
No ratings yet
Maths For ML
156 pages
Matematics and Machine Learning
No ratings yet
Matematics and Machine Learning
156 pages
Classification
No ratings yet
Classification
36 pages
Data MIning Chapter 8
No ratings yet
Data MIning Chapter 8
11 pages
Evaluation Metrics and Statistical Tests For Machine Learning
No ratings yet
Evaluation Metrics and Statistical Tests For Machine Learning
14 pages
3 Percept Ron
No ratings yet
3 Percept Ron
34 pages
W4 - Logistic Regression
No ratings yet
W4 - Logistic Regression
52 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
ML - Collection.2019 04 15
No ratings yet
ML - Collection.2019 04 15
30 pages
Bioinformatics F&amp M 20100722 Bujak
100% (1)
Bioinformatics F&amp M 20100722 Bujak
27 pages
ML - Mod2 Classification
No ratings yet
ML - Mod2 Classification
74 pages
Lecture 1
No ratings yet
Lecture 1
48 pages
Data Mining 4th Is
No ratings yet
Data Mining 4th Is
24 pages
Chapter 2 - Linear Classifiers
No ratings yet
Chapter 2 - Linear Classifiers
4 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
87 pages
Unit 2
No ratings yet
Unit 2
28 pages
Data Mining Lecture 10B: Classification
No ratings yet
Data Mining Lecture 10B: Classification
62 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
ML 2 PPT Unit 2
No ratings yet
ML 2 PPT Unit 2
214 pages
Ds 2
No ratings yet
Ds 2
27 pages
Exercise 5
No ratings yet
Exercise 5
8 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
ML Module4 Classification
No ratings yet
ML Module4 Classification
79 pages
Hands On Machine Learning 3 Edition
No ratings yet
Hands On Machine Learning 3 Edition
18 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Fairness Lectures-21
No ratings yet
Fairness Lectures-21
63 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
cs188 Fa23 Note21
No ratings yet
cs188 Fa23 Note21
8 pages
Lec 21
No ratings yet
Lec 21
34 pages
CS305 Exercise 5: Task 1: Comparing Machine Learning Algorithms
No ratings yet
CS305 Exercise 5: Task 1: Comparing Machine Learning Algorithms
7 pages
Mla Unit-5'2
No ratings yet
Mla Unit-5'2
74 pages
Unit 4 (Part 1)
No ratings yet
Unit 4 (Part 1)
49 pages
What Are The Different Types of Machine Learning
No ratings yet
What Are The Different Types of Machine Learning
2 pages
What's Next?: Concept Learning The Hypothesis Space
No ratings yet
What's Next?: Concept Learning The Hypothesis Space
30 pages
What's Next?: Tree Models Decision Trees Ranking and Probability Estimation Trees
No ratings yet
What's Next?: Tree Models Decision Trees Ranking and Probability Estimation Trees
49 pages
What's Next?: Tree Models Decision Trees Ranking and Probability Estimation Trees
No ratings yet
What's Next?: Tree Models Decision Trees Ranking and Probability Estimation Trees
49 pages
What's Next?: Rule Models Learning Ordered Rule Lists Learning Unordered Rule Sets Descriptive Rule Learning
No ratings yet
What's Next?: Rule Models Learning Ordered Rule Lists Learning Unordered Rule Sets Descriptive Rule Learning
47 pages
Distributed Systems Assignment 6
No ratings yet
Distributed Systems Assignment 6
7 pages
Unit 5 (Part 1)
No ratings yet
Unit 5 (Part 1)
56 pages
CPP Amp Language and Programming Model
No ratings yet
CPP Amp Language and Programming Model
148 pages
RPubs - The Analytics Edge EdX MIT15
No ratings yet
RPubs - The Analytics Edge EdX MIT15
57 pages
Involuntary Weight Loss. Does A Negative Baseline Evaluation Provide Adequate Reassurance?
No ratings yet
Involuntary Weight Loss. Does A Negative Baseline Evaluation Provide Adequate Reassurance?
5 pages
Microbiology Ospe Solution Revised
No ratings yet
Microbiology Ospe Solution Revised
20 pages
Laboratory Equipment Dose Calibrator
No ratings yet
Laboratory Equipment Dose Calibrator
9 pages
Instruction Manual: Admittance Level Limit Switch
No ratings yet
Instruction Manual: Admittance Level Limit Switch
15 pages
Vectorcardiographic QRS Area As A Novel Predictor of Response To Cardiac Resynchronization Therapy
No ratings yet
Vectorcardiographic QRS Area As A Novel Predictor of Response To Cardiac Resynchronization Therapy
8 pages
Predicting Customer Churn A Systematic Literature Review
No ratings yet
Predicting Customer Churn A Systematic Literature Review
22 pages
March 2024 Guidelinestoimprovengstesting 1710528644464
No ratings yet
March 2024 Guidelinestoimprovengstesting 1710528644464
57 pages
Validity of Autism Behavior Checklist (ABC)
No ratings yet
Validity of Autism Behavior Checklist (ABC)
7 pages
A Scaled Version of The General Health Questionnaire: D. P. Goldberg and V. F. Hillier
No ratings yet
A Scaled Version of The General Health Questionnaire: D. P. Goldberg and V. F. Hillier
7 pages
Equine Hematology, Cytology, and Clinical Chemistry 2nd Edition Updated Edition Download
100% (17)
Equine Hematology, Cytology, and Clinical Chemistry 2nd Edition Updated Edition Download
15 pages
Introduction To Biostatistics1
No ratings yet
Introduction To Biostatistics1
23 pages
PH1700 Session 1 - Stu - Bayes and Screening Tests
No ratings yet
PH1700 Session 1 - Stu - Bayes and Screening Tests
27 pages
BULLYINGSCALE
No ratings yet
BULLYINGSCALE
14 pages
Nptel Bia All
No ratings yet
Nptel Bia All
42 pages
Conditional Probability: Ds1 - Ag
No ratings yet
Conditional Probability: Ds1 - Ag
24 pages
Projectreport Diabetes Prediction
No ratings yet
Projectreport Diabetes Prediction
22 pages
An Integrative Review of Pediatric Early Warning System Scores
No ratings yet
An Integrative Review of Pediatric Early Warning System Scores
10 pages
Videosurveillance Software Manual
No ratings yet
Videosurveillance Software Manual
87 pages
Test Bank For Tietz Fundamentals of Clinical Chemistry and Molecular Diagnostics 7th Edition by Carl A Burtis-1-70
No ratings yet
Test Bank For Tietz Fundamentals of Clinical Chemistry and Molecular Diagnostics 7th Edition by Carl A Burtis-1-70
79 pages
CH - 1 Errosr in Measurements
No ratings yet
CH - 1 Errosr in Measurements
33 pages
Data Qual Assess Guidance
No ratings yet
Data Qual Assess Guidance
138 pages
Lesson
No ratings yet
Lesson
9 pages
Validation Protocol
100% (1)
Validation Protocol
24 pages
PRP FPR Tendon Injuries
No ratings yet
PRP FPR Tendon Injuries
9 pages
Allergy - 2024 - Doña - An Algorithm For The Diagnosis of Betalactam Allergy 2024 Update
No ratings yet
Allergy - 2024 - Doña - An Algorithm For The Diagnosis of Betalactam Allergy 2024 Update
5 pages
Epidemiology Midterm
No ratings yet
Epidemiology Midterm
103 pages
The Columbia Suicide Severity Rating Scale PDF
67% (3)
The Columbia Suicide Severity Rating Scale PDF
21 pages
HIV Testing Strategies
No ratings yet
HIV Testing Strategies
12 pages
Depression Self-Rating Scale For Children
No ratings yet
Depression Self-Rating Scale For Children
4 pages