Introduction To Classifier Performance Analysis With R by Sutaip L.C. Saw
Introduction To Classifier Performance Analysis With R by Sutaip L.C. Saw
Classifier Performance
Analysis with R
Classification problems are common in business, medicine, science, engineering,
and other sectors of the economy. Data scientists and machine learning profes-
sionals solve these problems through the use of classifiers. Choosing one of these
data-driven classification algorithms for a given problem is a challenging task. An
important aspect involved in this task is classifier performance analysis (CPA).
Key Features:
• An
introduction to binary and multiclass classification problems is provided,
including some classifiers based on statistical, machine, and ensemble learning.
• Commonly
used techniques for binary and multiclass CPA are covered, some
from less well-known but useful points of view. Coverage also includes impor-
tant topics that have not received much attention in textbook accounts of CPA.
• Limitations of some commonly used performance measures are highlighted.
•
Coverage includes performance parameters and inferential techniques for
them.
• Also covered are techniques for comparative analysis of competing classifiers.
• A key contribution involves the use of key R meta-packages like tidyverse and
tidymodels for CPA, particularly the very useful yardstick package.
This is a useful resource for upper-level undergraduate and masters level students
in data science, machine learning, and related disciplines. Practitioners interest-
ed in learning how to use R to evaluate classifier performance can also potentially
benefit from the book. The material and references in the book can also serve the
needs of researchers in CPA.
CHAPMAN & HALL/CRC DATA SCIENCE SERIES
Reflecting the interdisciplinary nature of the field, this book series brings together re-
searchers, practitioners, and instructors from statistics, computer science, machine learn-
ing, and analytics. The series will publish cutting-edge research, industry applications, and
textbooks in data science.
The inclusion of concrete examples, applications, and methods is highly encouraged. The
scope of the series includes titles in the areas of machine learning, pattern recognition,
predictive analytics, business analytics, Big Data, visualization, programming, software,
learning analytics, data wrangling, interactive graphics, and reproducible research.
Soccer Analytics
An Introduction Using R
Clive Beggs
Data Science
A First Introduction with Python
Tiffany Timbers, Trevor Campbell, Melissa Lee, Joel Ostblom and Lindsey Heagy
Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com
or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-
750-8400. For works that are not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.
DOI: 10.1201/9781003518679
List of Tables xi
Preface xv
Author xix
1 Introduction to Classification 1
1.1 Classification Overview . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Binary and Multiclass Problems . . . . . . . . . . . . . 2
1.1.2 Classification Rules . . . . . . . . . . . . . . . . . . . . . 3
1.2 Classification with Statistical Learners . . . . . . . . . . . . . . 6
1.2.1 Logit Model Classifier . . . . . . . . . . . . . . . . . . . 6
1.2.2 Multinomial Logistic Classifier . . . . . . . . . . . . . . 9
1.3 An Overview of Binary CPA . . . . . . . . . . . . . . . . . . . 10
1.3.1 Required Information . . . . . . . . . . . . . . . . . . . 11
1.3.2 Binary Confusion Matrix . . . . . . . . . . . . . . . . . 12
1.3.3 Binary Performance Measures . . . . . . . . . . . . . . . 16
1.3.4 Binary Performance Curves . . . . . . . . . . . . . . . . 19
1.4 An Overview of Multiclass CPA . . . . . . . . . . . . . . . . . . 21
1.4.1 Required Information . . . . . . . . . . . . . . . . . . . 21
1.4.2 Multiclass Confusion Matrix . . . . . . . . . . . . . . . 22
1.4.3 Multiclass Performance Measures . . . . . . . . . . . . . 23
1.4.4 Multiclass Performance Curves . . . . . . . . . . . . . . 24
1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
vii
viii Contents
A Appendix 179
A.1 Required Libraries . . . . . . . . . . . . . . . . . . . . . . . . . 179
A.2 Alternative yardstick Functions . . . . . . . . . . . . . . . . . 180
A.3 Some Additional R Functions . . . . . . . . . . . . . . . . . . . 180
A.3.1 The con mat() Function . . . . . . . . . . . . . . . . . . 180
A.3.2 The kcv fn() Function . . . . . . . . . . . . . . . . . . 182
A.3.3 The acc ci() Function . . . . . . . . . . . . . . . . . . 184
A.4 Some Useful Web Links . . . . . . . . . . . . . . . . . . . . . . 185
Bibliography 187
Index 197
List of Tables
6.1 Node Class Counts Before and After a Binary Split . . . . . . . 172
xi
List of Figures
xiii
xiv List of Figures
xv
xvi Preface
This book provides an introduction to CPA and the use of the R software
system to accomplish the analysis. Much of what is known about CPA is
scattered throughout the published literature, including books and journals in
disciplines other than DS and ML. To have the relevant material in one book,
with expanded introductory discussions and elementary theoretical support
where necessary, and illustrated with help from R is certainly useful for those
who have to engage in CPA, particularly those who have yet to master the
techniques and conceptual foundations underlying the analysis.
Target Audience
The book is primarily targeted at senior undergraduate and masters level
students in DS and/or ML (it is not a monograph on the CPA for profes-
sionals in these and related disciplines). It can serve as a supplementary text
or reference, especially since coverage of CPA is somewhat limited in most
published introductory textbooks on DS and ML.
Aspiring data scientists, machine learning professionals, and others who
have to analyze performance of classification algorithms should find the book
a useful resource. Early career professionals in DS, ML, and related disciplines
will find material in the book that can help consolidate their understanding
of CPA. Practitioners and researchers, especially the experienced ones, will
probably be familiar with much if not all of the material in this introductory
book. However, they can also potentially benefit from the book, e.g., using it
as a reference for the use of key packages and meta-packages in R for CPA
(the extensive bibliography given for this topic is also useful).
Why R?
Python is a programming language that is commonly used by DS and
ML professionals, and it does a great job for what it was designed for based
on publications and information in alternative media the author has seen on
this software. However, R and its superlative integrated development environ-
ment RStudio offers some appealing competitive advantages. Its excellence
in serving the needs of analysts in DS and ML and other disciplines engaged
in computational statistics and data mining is unquestionable. In particular,
it can be used to train a wide variety of classifiers and, for our purpose, it
provides a powerful and well-integrated collection of tools for binary and multi-
class CPA given the availability of packages like yardstick and meta-packages
like tidyverse and tidymodels.
On a personal note, the author regards R the best choice for students learn-
ing to solve problems in computational statistics, data science, machine learn-
ing, and related disciplines. When making this judgement, he has other soft-
ware systems (like Fortran, APL, C, Visual Basic, Gauss, SAS, SPSS,
and Stata) in mind that he has used in an educational environment. This
judgement remains unchanged when his experience with S and S-Plus is also
taken into account, even though (like S-Plus) R is based on S.
Preface xvii
Chapter Outlines
• What can students in DS, ML, and related disciplines hope to learn from
the book?
◦ The relevance and importance of classification problems and how to
solve them with data-driven classifiers.
◦ The different categories of classification algorithms and examples of
some classifiers that are based on them.
◦ The key measures/curves and resampling techniques that can be used
to assess generalizability of trained binary and multiclass classifiers.
◦ Relative merits of various performance metrics (i.e., measures and
curves) for CPA and the impact of class imbalance on the metrics.
◦ Alternative and complementary views on the measures and curves
that CPA relies on.
◦ Some techniques for statistical inferences on performance parameters.
◦ Key modeling issues that impact on training and evaluation of clas-
sifiers.
xviii Preface
Supplementary Materials
Readers can access soft copies of code in the book and those for exercises,
including relevant datasets from the publisher’s website.
Acknowledgments
I am grateful to commissioning editor Lucy McClune and editorial assistant
Danielle Zarfati from Taylor and Francis for the significant roles they played
with publication aspects related to this book. Without Lucy’s expeditious
response to my book proposal and her helpful guidance in getting the book
project underway, this book would probably not see the light of day. Danielle
was of great help in this endeavor with her prompt and helpful responses to
the numerous questions I had regarding publication issues. I also wish to thank
Ashraf Reza and the production team for not only the good job they did in
preparing the proofs for the book, but also for their help in dealing with the
issues that arose from the proofs.
Author
xix
1
Introduction to Classification
DOI: 10.1201/9781003518679-1 1
2 Introduction to Classification
When faced with a binary classification problem, you can begin by identi-
fying the class of interest, and when presented with a case, ask the question
whether the case belongs to the class of interest.1 You can regard a “Yes”
response to mean that the case belongs to the positive class regardless of the
substantive nature of the class in question. For example, the two classes may
be labeled as “spam” and “non-spam” for a spam filtering problem. If the
former is the class of interest, then you can label it as the positive class even
though the usual thinking is to regard getting spam as a negative outcome.
There are, however, alternative criteria that have been used to define the
positive class. For imbalanced binary problems, this class is usually the minor-
ity class since it is often the one of primary interest; for example, in medical
diagnosis, the focus is usually on an individual belonging to a (minority) dis-
eased group. This convention is widely adopted by data scientists and machine
learning professionals. The sidebar titled Bad Positives and Harmless Nega-
tives in Provost and Fawcett [90, p. 188] provides an informative perspective
on this issue. Whatever convention is adopted, it is important to make it clear
when using class-specific performance measures to evaluate binary classifiers
because such measures attempt to quantify the rate at which true positives
or true negatives occur, for example.
In practice, you can also encounter problems that involve a target variable
with more than two levels. Such multiclass classification problems can arise
when you have questions like the following ones.
1 When discussing classification problems in this book, a case refers to an individual or
object, or some other entity. Readers in epidemiology, for example, should be mindful since
the term has a different meaning in their discipline. A case is often referred to as an instance
when data mining is used to solve classification problems. It has also been referred to as an
example in other application areas.
Classification Overview 3
Yb = fˆ(X), (1.1)
Assign a case with feature vector x to the positive class if S(x) > t,
2 For binary classifiction, other choices for C that have been adopted in the literature
include {0, 1} and {−1, 1}. The former choice, though convenient when modeling a Bernoulli
target variable, is not mandatory. Our choice is consistent with what is given here for the
multiclass problem.
4 Introduction to Classification
where S(x) is a score assigned by the classifier to a case with given feature
vector x (this vector contains attributes of a case that are relevant for the
classification problem) and t is a suitable threshold. The prediction function
associated with the above classification rule may be expressed as
FIGURE 1.1
Training Data for a Binary Classification Problem
FIGURE 1.2
Training Data for a 3-Class Classification Problem
6 Introduction to Classification
P (Y = 1)
ln = η(x), (1.3)
P (Y = 2)
where η(x) represents the linear predictor and Y is a target variable that
takes the value 1 or 2 according to whether the case belongs to the positive or
negative class, respectively. The linear predictor determines the logit model,
and it is a key specification that is needed for the corresponding classifier. It
is determined by the features of a case and the parameters of the model; see
(1.5) for a simple example.
Classification with Statistical Learners 7
It follows from (1.3) that the probability that a case belongs to the positive
class is
exp(η(x))
P (Y = 1) = .
1 + exp(η(x))
Thus, a case may be assigned to the positive class if this probability is suffi-
cently large. This amounts to saying that such a class assignment is made by
a LM classifier if
exp(η(x))
> c, (1.4)
1 + exp(η(x))
where c is a suitable threshold. The usual default for c is 0.5, but note that
this value is not suitable when you have significant class imbalance in your
problem. During our discussion on cost-sensitive learning in the last chapter,
we show one way to obtain an optimal value for this threshold.
To illustrate training of a logit model (LM) classifier, consider the data
that is displayed in Figure 1.1. A partial listing of the data is given below.
## # A tibble: 700 x 3
## age income group
## <dbl> <dbl> <fct>
## 1 23.3 3.60 No
## 2 39.5 4.24 Yes
## 3 22.9 3.00 No
## # ... with 697 more rows
For the binary classification problem under consideration, the feature vec-
tor is x = (age, income), and a possible linear predictor (we ignore possible
interaction effects) is
The βi ’s on the right-hand side of the above expression are unknown model
parameters. Here, we use the training data and the glm() function in the
stats package to obtain estimates of these parameters; a good discussion on
estimation of logit model parameters may be found in Charpentier and Tufféry
[12, p. 175], for example.4
When we use the training data to estimate the parameters in linear pre-
dictor (1.5), we obtain
that the function models the logit of the second level of the categorical class variable and
make any required adjustments, i.e., let the second level of the class variable refer to the
positive class and change later when using yardstick package to evaluate performance, if
required (the reason for the change will be explained later).
8 Introduction to Classification
## # A tibble: 3 x 3
## term estimate p.value
## <chr> <dbl> <dbl>
## 1 (Intercept) -11.4 2.48e-32
## 2 age 0.301 1.44e-31
## 3 income -0.0240 6.29e- 1
On examining the P-values, we note that the estimated coefficient for income
is not statistically significant. Refitting the model without this feature yields
the following results (notice that the estimated intercept and age coefficients
change only slightly when income is omitted).
## # A tibble: 2 x 3
## term estimate p.value
## <chr> <dbl> <dbl>
## 1 (Intercept) -11.5 2.88e-32
## 2 age 0.299 2.15e-32
In light of the above results, we see that the trained LM classifier is one
which assigns an individual to the positive class if
The left-hand side of (1.6) represents the estimated positive class membership
probability of an individual with a given value for age. We can re-express (1.6)
as
−11.5 + 0.299 × age > t, (1.7)
where t = ln(c/(1 − c)). Thus, the LM classifier is a scoring classifier with
scores that can be defined by the left-hand side of (1.6) or (1.7). When scores
are probabilities, the classification rule is also referred to as a probabilistic
classifier [16]. The trained LM classifier given by (1.6) with c = 0.5 will be
used to illustrate some techniques for evaluating classifier performance in later
sections of this chapter.
Before considering another example of classifier training in the next sec-
tion, it should be noted that such training in practice is often an involved
process that requires you to do more than just apply a classification algo-
rithm to a training dataset. This is certainly true if you want to obtain the
Classification with Statistical Learners 9
best performing classifier for your problem. Often, if that is your practical
objective, you will also need to do some combination of the following: feature
engineering, data exploration and preprocessing, hyperparameter tuning (this
requires use of resampling methods like bootstrapping or cross-validation), and
selection of suitable performance measures to evaluate the trained classifier.
To avoid unnecessary complications in this book, we will take a limited ap-
proach to train the classifiers that will be used to illustrate various CPA tech-
niques (when applicable, we will discuss some of the abovementioned training
aspects in subsequent chapters). Clearly, this is not unreasonable since it suf-
fices to have a trained classifier for such illustrations (i.e., you do not need to
have the “best” classifier).
h = argmax {P (Y = i | x), i = 1, . . . , k}
i
where
exp(ηi (x))
P (Y = i | x) = Pk , i = 1, 2, . . . , k.
j=1 exp(ηj (x))
## # A tibble: 700 x 3
## age income group
## <dbl> <dbl> <fct>
## 1 23.9 0.307 A
## 2 27.2 0.390 A
## 3 24.1 0.671 A
## # ... with 697 more rows
10 Introduction to Classification
## # A tibble: 6 x 4
## y.level term estimate p.value
## <chr> <chr> <dbl> <dbl>
## 1 B (Intercept) -45.6 2.41e- 6
## 2 B age 1.51 4.15e- 6
## 3 B income 1.10 3.31e- 3
## 4 C (Intercept) -71.6 1.44e-12
## 5 C age 2.05 8.58e-10
## 6 C income 1.14 3.07e- 3
These results show that the estimated model coefficients are statistically sig-
nificant. Thus, the estimated probabilities of class membership are given by
(class “A”, “B” and “C” are referred to as class 1, 2 and 3, respectively)
exp(η̂i (x))
P3 , i = 1, 2, 3,
j=1 exp(η̂j (x))
and
η̂3 (x) = −71.6 + 2.05 × age + 1.14 × income.
Hence, the trained ML classifier assigns a case to the h-th class if
( )
exp(η̂i (x))
h = argmax P3 , i = 1, 2, 3 . (1.8)
i
j=1 exp(η̂j (x))
## # A tibble: 300 x 3
## prob_Yes pred_class group
## <dbl> <fct> <fct>
## 1 0.143 No Yes
## 2 0.531 Yes Yes
## 3 0.139 No Yes
## # ... with 297 more rows
12 Introduction to Classification
The given tibble was obtained when the LM classifier was applied to the cases
in a test dataset. The column labeled prob Yes in the partially listed tibble
contains predicted positive class membership probabilities for the 300 cases in
the test dataset, and the second column labeled pred class contains the cor-
responding predicted classes. Information on the actual classes are contained
in the third column. With the given information, you can obtain a useful array
summary and extract descriptive performance measures from it. You can also
construct suitable performance curves and obtain relevant summaries from
them.
TABLE 1.1
Confusion Matrix for a Binary
Classifier
Actual
Predicted Yes No
Yes tp fp
No fn tn
An Overview of Binary CPA 13
TABLE 1.2
Confusion Matrix for the LM Classifier
Actual
Predicted Yes No
Yes 118 14
No 32 136
14 Introduction to Classification
FIGURE 1.3
Confusion Matrices with Formats Different from Table 1.1
TABLE 1.3
Key Count Vectors for Totally Useless, Random and
Perfect Classifiers
Type of Classifier Key Count Vector
(tp, f n, f p, tn)
Totally Useless (0, m, n, 0)
Random (mθ, m(1 − θ), nθ, n(1 − θ))
Perfect (m, 0, 0, n)
## Actual
## Predicted Yes No
## Yes 81 23
## No 252 9644
TABLE 1.4
Expected Confusion Matrix for a Random
Classifier
Actual
Predicted Yes No
Yes mθ nθ
No m(1 − θ) n(1 − θ)
16 Introduction to Classification
as is done when using a package like yardstick (more on this later). We can
demonstrate this by using our con mat() function to construct these objects
using the key counts from Table 1.2.
The result from the first command is a "matrix" object. It seems to be the
obvious one to use since it is, afterall, for a confusion matrix. However, we find
the "table" object from the second command to be more useful for a number
of reasons which will become clear in due course (note there is one obvious
difference). To obtain a "conf mat" object, leave out the type argument in
the call to con mat(). Also, note that the "table" and "conf mat" represen-
tation look the same, but they can produce different results depending on the
function you subsequently apply.
The choice of object representation is important because what you get
when applying functions like prop.table() or summary() depends on the ob-
ject that you supply as argument to these functions. In general, the applicable
representation is determined by the task at hand and the R package that is
being used.
TABLE 1.5
Performance Measures for Totally Useless, Random and
Perfect Classifiers
Type of Classifier accuracy tpr tnr
Totally Useless 0 0 0
Random ((m − n)θ + n)/(m + n) θ 1 − θ
Perfect 1 1 1
18 Introduction to Classification
For the LM classifier given by (1.6), the value for accuracy shows that
about 85% of cases in the test dataset were correctly classified by the classifier.
The tpr value shows that about 3 out of 4 positive cases in the dataset were
correctly classified, and the tnr value shows that about 91% of the negative
cases were correctly classified. Thus, the LM classifier did an excellent job at
classifying test data cases in the negative class but not quite as well for those
in the positive class. Overall, its classification accuracy is relatively high.
# Key Counts
tp <- 118; fn <- 32; fp <- 14; tn <- 136
We can infer two false rates from the true rates in the above discussion.
These are the f alse negative rate and f alse positive rate defined by the frac-
tional expression in
fn
f nr = = 1 − tpr (1.13)
tp + f n
and
fp
f pr = = 1 − tnr, (1.14)
f p + tn
respectively. The alternative expressions for these false rates follow immedi-
ately from the corresponding true rates in (1.11) and (1.12). We can interpret
f nr (f pr) as the classification error rate for test data cases in the positive (neg-
ative) class. For the medical diagnosis example discussed earlier, f nr (f pr)
is the rate at which a classifier used for the diagnosis incorrectly classifies a
diseased (healthy) individual.
Given the true rates that we calculated for the LM classifier, we find that
f nr = 0.213 and f pr = 0.093 for the classifier. Of course, given these numbers,
you can also conclude that sensitivity = 0.787 and specif icity = 0.907 for the
classifier. Regardless of your preference for false or true rates, these numbers
suggest that the classifier is doing a better job at identifying negative cases
than positive ones.
An Overview of Binary CPA 19
FIGURE 1.4
ROC Curve for the LM Classifier
20 Introduction to Classification
FIGURE 1.5
Ideal ROC Curve
descriptive summary from the ROC curve is not unreasonable for a given clas-
sifier whose performance has been established by other coherent measures).
Hand [50, 51] highlighted the relatively unknown fact that AU C “is funda-
mentally incoherent in terms of misclassification costs”; actually, AU C as an
inconsistent criterion was noted earlier in a paper by Hilden [60]. Further
discussion of the problem with AU C will be taken up in Chapter 3.
## # A tibble: 300 x 5
## prob_A prob_B prob_C pred_class group
## <dbl> <dbl> <dbl> <fct> <fct>
## 1 0.999 0.000681 1.34e- 9 A A
## 2 1.00 0.00000238 1.19e-12 A A
## 3 1.00 0.000217 3.62e-10 A A
## # ... with 297 more rows
prob A contains the predicted probabilities that cases in the test dataset be-
long to class “A”. Together with information in group, these probabilities may
be used to obtain some performance curves for the classifier.
TABLE 1.7
Confusion Matrix for the 3-Class ML Classifier
Actual
Predicted A B C
A 74 4 0
B 1 143 8
C 0 3 67
An Overview of Multiclass CPA 23
FIGURE 1.6
OvR Collection of Confusion Matrices for the ML Classifier
74 + 143 + 67
= 0.947.
300
This shows that about 95% of cases in the test dataset were correctly classified.
This is one indication of excellent performance.
Measures like tpr and tnr do not have such straightforward extensions.
Without taking an OvR perspective, the notion of positive and negative class
does not make sense for a multiclass problem. However, when you take such a
perspective, you can obtain a multiclass sensitivity measure by averaging the
24 Introduction to Classification
FIGURE 1.7
ROC Curves by Class Reference Formulation
1.5 Exercises
1. As noted earlier in Figure 1.3, confusion matrices can have different
formats. In this exercise, we explore potential consequences of using
different formats for these array summaries.
(a) In what way do the formats of the confusion matrices in Fig-
ure 1.3 differ from that in Table 1.1?
(b) Identify the key count vectors for each confusion matrix in Fig-
ure 1.3.
(c) Obtain estimates of accuracy, tpr and tnr and provide a sub-
stantive interpretation of the performance measures you com-
puted.
2. The key count vector (5134, 2033, 6385, 25257) is from a confusion
matrix given in Kuhn and Johnson [68, p. 39]. The classification
problem they considered was concerned with the problem of de-
termining whether a person’s profession is in one of the following
disciplines: science, technology, engineering, or math (STEM).
(a) Define the positive class for the classification problem. Explain
the reasoning for your answer.
26 Introduction to Classification
(b) Use the con mat() function in Appendix A.3 to reproduce the
confusion matrix. What is the R object representation of the
confusion matrix you obtain?
(c) Estimate the rate at which (i) false positives and (ii) false neg-
atives occur. What is the overall error rate?
3. Nawanganga and Chapple [83, p. 238] gave an example on applica-
tion of a k-nearest neighbors classifier for a heart disease classifica-
tion problem; see Rhys [95, p. 56] for a good discussion on k-NN
classifiers. The substantive issue for this problem is to determine
whether an individual suffers from heart disease or not. The for-
mat for the confusion matrix given by the authors differs from ours.
However, we can easily identify the following key counts:
(a) Use the above counts and the con mat() function in Ap-
pendix A.3 to obtain the confusion matrix in the same format
as used in Table 1.1. Use a "table" object representation.
(b) Obtain the column profiles of the resulting "table" object by
dividing entries in each column by the corresponding column
total. Interpret the entries on the diagonal.
(c) What is the fraction of correct positive classifications (i.e., what
fraction of individuals classified as having heart disease actu-
ally have the disease)? What is the fraction of correct negative
classifications?
4. In this exercise, a comparison of two classifiers that were trained
using data displayed in Figure 1.1 will be made; both classifiers were
trained using age as the only predictor. When the classifiers were
applied to test data, the key count vectors shown in the following
tibble were obtained for a logit model (LM) classifier and a decision
tree (DT) classifier (the next chapter has some information on this
machine learner).
## classifier tp fn fp tn
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 LM 118 32 14 136
## 2 DT 104 46 9 141
Note that the key counts for the LM classifier are from Table 1.2.
(a) Which classifier has a smaller overall error rate?
(b) Which classifier has a smaller false positive rate?
(c) Which classifier has a larger true positive rate?
Answer the above questions by (i) inspecting the relevant key counts
and (ii) computing relevant performance measures.
Exercises 27
## [[1]]
## Actual
## Predicted No Yes
## No 873 50
## Yes 68 9
##
## [[2]]
## Actual
## Predicted No Yes
## No 920 54
## Yes 21 5
##
## [[3]]
## Actual
## Predicted No Yes
## No 930 55
## Yes 11 4
## Actual
## Predicted A B C
## A 20 1 0
## B 3 10 2
## C 0 0 8
28 Introduction to Classification
In the last chapter, we provided a glimpse of what is involved when you at-
tempt to evaluate performance of a trained binary classifier. We saw the key
role played by the confusion matrix that is obtained from predicted and ac-
tual classes of cases in a test dataset. This useful array summary allows you
to derive several descriptive performance measures like accuracy, sensitivity,
and specif icity.1 These are essentially measures of the number of correct clas-
sifications as a fraction of the entire test dataset or similarly defined fractions
for each class in this dataset. They are examples of threshold performance
measures; see Section 2.3 for further discussion of this important category of
measures.
Other relevant measures and issues have been examined and reported in
the literature. For example, the review by Tharwat [108] considered several
other performance measures and examined the influence of balanced and im-
balanced data on each metric. As part of their study, Sokolova and Lapalme
[103] considered the issue of reliable evaluation of classifiers by examining the
invariance properties of several performance measures. Concepts and measures
such as informedness and markedness that reflect the likelihood that a clas-
sification is informed versus chance was examined by Powers [88]. Recently,
Aydemir [3] proposed the polygon area metric (PAM) to facilitate CPA when
there are competing classifiers for a given problem. Given the wide variety of
performance measures, it should come as no surprise that different measures
quantify different aspects of performance (this is a noteworthy point); see the
excellent review by Hand [52] for further discussion on this issue and other
aspects like choice of and comparisons between performance measures.
In this chapter, we extend coverage of threshold performance measures
to include, among others, those that help to answer the third and fourth
questions in Section 1.3. Another important aim is to show how R may be used
to analyze classifier performance. In particular, we demonstrate functionality
in the yardstick package for this purpose. Some of the measures we cover
may also be viewed as estimates of certain performance parameters. For these
metrics, we also discuss some procedures that can be used to make inferences
1 Equivalent measures that are complementary to the listed ones include error rate,
DOI: 10.1201/9781003518679-2 29
30 Classifier Performance Measures
FIGURE 2.1
Decision Tree from Simulated Training Data
Classification with Machine Learners 31
FIGURE 2.2
Simulated Training Data
in Figure 2.2 (in this dataset, group is the target variable with two levels and
the features are the numeric variables X1 and X2).
The inverted tree structure in Figure 2.1 is typical of a decision tree; it is
made up of three types of nodes that are linked by branches. The root node
is at the top and the terminal nodes are at the bottom. The nodes in between
these two types are called internal nodes. For the given decision tree, we have
one root node, one internal node, and three terminal nodes.
The root node of a decision tree contains all the data prior to splitting. A
suitable criterion is used to split this node into two branches, each of which
leads to another node. Internal nodes are also split in a similar fashion. Nodes
that do not split further make up the terminal nodes. The whole process is
governed by a recursive partitioning algorithm like rpart. This version is an
open source implementation of CART, the classification and regression trees
algorithm developed by Breiman et al. [9].2
A measure of impurity like Gini index that quantifies how heterogeneous
classes are within a node may be used in the node-splitting process. This
measure is defined by
X X
Gini index = pi (1 − pi ) = 1 − p2i ,
i∈C i∈C
purity is high). The feature used to split a root or internal node is the one with
the best value for Gini gain (this is the difference between the Gini index of a
parent node and a linear combination of that for the two nodes resulting from
the split). Rhys [95, p. 170] has a good example that demonstrates calculation
of this criterion.
To classify a case with the DT classifier, start at the root node and move
down along each branch (take the left branch if the condition associated with a
node is true) until you reach a terminal node. The label in the latter determines
the class of the case. For example, a case is assigned to the “Yes” class if it
ends up in the right-most terminal node in Figure 2.1. The recorded percentage
shows that 28% of cases in the training sample winds up in this node. A case
that winds up in this node is certain to belong to the “Yes” class as shown
by the estimated probability (i.e., number below the label for the node) of
belonging to the “Yes” class given the node conditions satisfied by the case.
Finally, note that decision trees tend not to perform well due to the ten-
dency to overfit. Hence, they tend to have high variance (a small change in the
training data can produce big changes in the fitted tree and hence highly vari-
able predictions). To guard against extravagant tree building, you can employ
suitable stopping criteria (e.g., minimum number of cases in a node before
splitting, maximum depth of the tree, minimum improvement in performance
for a split, minimum number of cases in a leaf). Alternatively, you can ap-
ply suitable pruning techniques to reduce the size of the decision tree; see
Boehmke and Greenwell [6, p. 181], for example.
# Data Preparation
The Titanic df data frame created in the above code segment has some
missing values.3 For convenience, we impute the missing values using the
3 Note the use of the basic pipe operator %>% from the magrittr package. We make
frequent use of this and other pipe operators (e.g., %T>% and %$%) in this book.
34 Classifier Performance Measures
replace na() function from the tidyr package so that there is a complete
dataset to start with. There is, of course, a better way to deal with the missing
data in practice. You can, for instance, incorporate the required imputation in
a preprocessing pipeline as part of the workflow when you use the tidymod-
els meta-package to train and test your classifier as shown in Section 6.2.2.
The difference does not matter here because our aim is illustrate various tech-
niques for classifier performance analysis. For this purpose, it suffices to have
a trained classifier and a test dataset.
FIGURE 2.3
DT Classifier for the Titanic Survival Classification Problem
tree construction process. These include those that determine tree complexity,
maximum depth and minimum number of cases in a node for it to be split
further. For the moment, we use default values for these hyperparameters.
Figure 2.3 provides a display of the fitted decision tree. You can also refer to
it as a display of the trained DT classifier for the Titanic survival classification
problem since the figure is a display of the rule set associated with the DT
classifier for this problem.4
As can be seen from the displayed decision tree, an individual starting
from the root node will reach the left-most leaf node at the bottom of the
decision tree if Sex = male and Age ≥ 13 are both true for the individual.
The numbers in this leaf node show that 61% of individuals in the training
4 See Roiger and Geatz [97, p. 11] for a simple example that shows how a decision tree
FIGURE 2.4
Variable Importance Plot
When working with functions in the yardstick package for CPA, keep in
mind that by default the first level of the target variable is the reference level.
If you want the second level to be the reference in binary CPA, you need to
specify event level = "second" as one of the arguments in your call to the
performance function when applicable. Henceforth, for convenience, we will
make the positive class the first level of factor variables that define the pre-
dicted and actual classes (since this avoids the need for the extra argument).
As shown in the next example, the “Yes” level (which defines the positive class
for our problem) is the second level of both pred class and Survived. We
can make it the first level by using the factor reversal function fct rev() in
the forcats package.
You can obtain the required array summary in a number of ways. As shown
in the preceding code segment, the first approach uses the table() function
and hence yields a "table" object (note the use of the exposition pipe %$%
here). There are situations when it is advantageous to take this approach.
The second approach that uses the conf mat() function from the yardstick
package is also very useful. It yields a "conf mat" object and Figure 2.5 shows
a heat map display of this object. Such a display is one of several benefits that
you have when conf mat() is used to obtain a confusion matrix.
There is another approach that is not widely known that yields a "matrix"
object and it involves use of the indicator matrices that we can get for
pred class and group with information in DT pv (we demonstrate this later
when we discuss multiclass CPA). This shows the importance that these indi-
cator matrices play in the determination of CPA measures. In particular, we
see in the next section that a measure called Matthews correlation coefficient
relies on them. Reliance on these matrices is necessary because the formula
that we will see later in this chapter for this measure in terms of key counts is
not very useful when it comes to generalization for use with multiclass classi-
fiers. The resulting approach is conceptually and computationally appealing.
The essential information to extract from a confusion matrix are the key
counts since they play an important role when you need to compute perfor-
mance measures that depend on these counts, and you use the formulas that
define them. This approach is sometimes convenient to use even though you
have a package like yardstick to facilitate calculations of the required mea-
sures. When attempting to identify the key counts for a given binary problem,
note that identification of what constitutes the positive class is an important
first step. This is easily done with the confusion matrix in Figure 2.5 in light of
the format we adopted in Table 1.1 of Chapter 1 for this array summary. You
have to be careful when your confusion matrix is not in the adopted format.
FIGURE 2.5
Confusion Matrix for the DT Classifier
40 Classifier Performance Measures
The key counts from Figure 2.5 are contained in the kcv vector below.
tp <- 57; fn <- 22; fp <- 12; tn <- 132 # key counts
kcv <- c(tp, fn, fp, tn) # vector of key counts
names(kcv) <- c("tp", "fn", "fp", "tn")
kcv
## tp fn fp tn
## 57 22 12 132
The key counts in the above code segment allow you to obtain a preliminary
assessment of the DT classifier. If required, you can compute an overall mea-
sure of classification accuracy and the rates at which false negatives and false
positives occur by using these counts and the formulas given by (1.10), (1.13),
and (1.14). As we’ll see in the subsequent sections of this chapter, there are
other useful performance measures that you can also compute.
performance of interest; see Hand [50] for an example where this requirement
is violated.5 Other criteria may also be relevant, e.g., measure invariance [103],
applicability to imbalanced classification problems [10], and sensitivity to costs
of different misclassification errors [36].
Measures like accuracy and tpr satisfy most of the abovementioned re-
quirements; some issues with these measures will be discussed later. Recall,
they provide answers to the first two questions given in Section 1.3. We need
to expand our coverage of performance measures to answer the remaining
two questions and others that are concerned with classifier performance. The
threshold measures we cover may be further divided into three groups, namely,
those that are class-specific, those that measure overall performance, and com-
posite measures. Typically, definition of measures in the last two groups in-
volves all of the four key counts and those in the first group usually rely on a
subset of these counts.
We take a slightly different approach when we review the measures men-
tioned above and introduce new ones in this section. The approach assumes
familiarity with what is involved in getting the column/row profiles of a rectan-
gular array like Table 1.1. This approach has some advantages like the ability
to easily obtain estimates of multiple CPA measures and to obtain them even
without relying on the formulas that define the measures, or use functions in
a package like yardstick. The approach also allows for several relationships
among CPA measures to be easily deduced (e.g., the relationship between true
and false rates).
As shown in Table 2.1, the fractions involved in the above definitions ap-
pear in the column profiles of the generic confusion matrix that was given in
Table 1.1. Thus, the four measures listed above form a natural group of frac-
tional performance measures. A useful practical implication is the fact that you
5 This example is important and we’ll return to it later since the discussion involves the
area under the ROC curve (a topic that we’ll cover with more detail in the next chapter).
42 Classifier Performance Measures
TABLE 2.1
Column Profiles of the Confusion Matrix in
Table 1.1
Actual
Predicted Yes No
Yes tp/(tp + f n) f p/(f p + tn)
No f n/(tp + f n) tn/(f p + tn)
can obtain these measures easily with software like R since its prop.table()
function may be used to calculate column (and row) profiles.6 As will be seen
later, this is not the only advantage when you adopt this perspective.
The tpr measure provides an answer to the second question given at the
beginning of Section 1.3 (replacing positive in the question by negative leads to
another question whose answer is provided by tnr). Thus, tpr may be viewed
as a measure of classification accuracy for the positive class. You can also
answer the question just mentioned if you know f nr since
tpr + f nr = 1.
This relationship follows from Table 2.1 since the first (and second) column
total for this table is clearly equal to 1. Thus, another advantage of the column
profiles perspective is that it allows you to easily deduce relationships between
true and false rates.
In practice, some measures are often referred to by different names that
make sense in a particular application area. As noted in the last chapter,
terms like sensitivity and specif icity are used in certain disciplines instead
of tpr and tnr. A measure like sensitivity is important in binary classifica-
tion problems that arise in medical applications like disease detection. In such
applications, a diseased individual belongs to the positive class. For some dis-
eases, it is necessary to use a classifier with high sensitivity to ensure that
false negatives occur rarely or not at all (this is particularly important for
harmful diseases with high infection rate).
There are other class-specific measures that you can obtain from a binary
confusion matrix. For example, you can consider the measures from the row
profiles of Table 1.1. The entries in Table 2.2 provide another set of four
performance measures:
TABLE 2.2
Row Profiles of the Confusion Matrix in
Table 1.1
Actual
Predicted Yes No
Yes tp/(tp + f p) f p/(tp + f p)
No f n/(f n + tn) tn/(f n + tn)
The answer to the third question at the beginning of Section 1.3 is provided
by the ppv measure. Replacing positive in the question by negative yields
another related question whose answer is provided by npv. Also, note that in
general, f pr ̸= f dr even though the number of false positives and number of
false discoveries are determined by f p in Table 1.1. Similarly, f nr = ̸ f or in
general. It suffices to examine the denominator in the definition of the relevant
rates to see why this is so.
In text classification and information retrieval, ppv is also referred to
as precision [90]. In such applications, recall is an alternative term for
sensitivity. Also, in database record linkage applications, precision and recall
(including their harmonic mean) are popular measures because the underlying
classification problem is often very unbalanced [46].
In applications such as spam classification, precision takes precedence
over recall. This is because it is important to have a high fraction of cor-
rect classifications when emails are classified as spam, i.e., we want a very
high precision value. Another way of saying this is to say that we require a
very low f alse discovery rate (this measure is equal to 1 − ppv). However,
it is possible for your business goal to be stated in terms of sensitivity and
specif icity for a spam classification problem; see the conclusion resulting from
the illustrative dialogue process in Table 6.3 of Zumel and Mount [119], for
example.
As noted earlier, calculation of binary class-specific measures by the col-
umn/row profiles approach may be performed by using the prop.table()
function. The array summary you supply to this function as one of its argu-
ments must be of class "table" or "matrix".
The confusion matrix for the DT classifier was obtained earlier in the last
section; see Figure 2.5. This figure is a display of DT pv cm that we created
earlier. It is an object of class "conf mat", not one of the two that is required.
Fortunately, you can use the pluck() function from the purrr package to
extract the required "table" object from DT pv cm before calculating the
required profiles. This is demonstrated next.7
7 If necessary, refer to Table 2.1 and Table 2.2 for help to interpret the entries in the
Thus, for the trained DT classifier, the tpr (tnr) value tells you that the
fraction of survivors (non-survivors) in the test dataset that were given the
correct classification is 0.722 (0.917). The classifier did a very good job at
classifying non-survivors but not as well for survivors. Furthermore, the value
obtained for ppv (npv) from the row profiles show that the fraction of correct
survivor (non-survivor) classifications is 0.826 (0.857). Classifying a passenger
as a non-survivor is more likely to be correct than classifying a passenger as
a survivor.
Figure 2.6 provides a visualization of the column profiles from the confusion
matrix for the DT classifier. The relative magnitudes of the relevant rates are
easily seen in this figure. For example, you can see that false positives occur at
a lower rate than false negatives. Use the following code to produce the figure.
FIGURE 2.6
Bar Chart of False & True Positive Rates
Threshold Performance Measures 45
DT_pv %>%
rename(Actual = Survived, Predicted = pred_class) %>%
mutate_at(.vars = c("Actual","Predicted"), .funs = fct_rev) %>%
ggplot(aes(x = Actual, fill = Predicted)) + theme_bw() +
scale_fill_manual(values = c("#CC0000","#5DADE2")) +
geom_bar(position = "fill") + ylab("Proportion") +
ggtitle("FPR & TPR")
You can use the vtree() function to obtain a more informative display of
the true and false rates (in percentage terms) from the column profiles; see
the variables tree plot in Figure 2.7. The plot is quite informative since the
key counts are also displayed in addition to estimates of the class priors (i.e.,
prevalence of the classes) and the size of the test dataset.8
FIGURE 2.7
Variable Tree Display of True and False Rates
library(vtree)
DT_pv_cm %>% pluck(1) %>%
crosstabToCases() %>%
vtree("Actual Predicted", imagewidth="3in", imageheight="2in",
title = "Passengers in \n the Test Dataset")
Note that the magrittr package was loaded in the above code segment to
enable use of the exposition pipe operator %$%.9 Also, note that for each
measure we calculated, there are two functions in the yardstick package for
it. For example, you can also obtain sensitivity as shown below.
The above code segment shows you how to easily get a number of perfor-
mance measures in the same format as that shown in the above alternative
9 For the basic pipe operator %>%, you do not need to do this loading to use it if you have
# accuracy
DT_pv %$% accuracy_vec(Survived, pred_class)
## [1] 0.848
Imbalanced datasets can pose a problem especially when the minority class
is very small. When the dataset is highly imbalanced, the trivial classifier that
assigns every case to the dominant class will have a high value for accuracy.
For example, an accuracy of 99.99% can easily be obtained by classifying
all record pairs as non-matches when record linkage is performed with large
databases [46]. As another example, the trivial classifier for the spam clas-
sification problem that assigns every email to the non-spam class will have
high accuracy when prevalence of spam is very low, i.e., when you have an
insignificant fraction of spam emails.
The true rates given by equations (1.11) and (1.12) have the following
simple relationship with accuracy:
FIGURE 2.8
Accuracy vs Prevalence for Given True Rates
Clearly, accuracy varies linearly with prevalence. The intercept and slope
are given by tnr and (tpr − tnr), respectively. Figure 2.8 displays the linear
relationship for five different sets of values for the true rates. The plot shows
that accuracy is equal to the common value of the true rates when they
are equal regardless of prevalence. When tpr is greater (smaller) than tnr,
accuracy increases (decreases) with prevalence. One of the latter alternatives
is usually the case.
Note that there is also a simple relationship between accuracy, ppv, and
npv. This is given by
 − Ê
kappa = , (2.4)
1 − Ê
where, Â is the accuracy measure (1.10) and Ê is an estimate of expected
accuracy under chance agreement that is given by
(tp + f p)(tp + f n) + (f n + tn)(f p + tn)
Ê = .
(tp + f n + f p + tn)2
This formula is a special case of that given in Chapter 5.
Values of the kappa measure above 0.5 indicate moderate to very good
performance, while values below 0.5 indicate very poor to fair performance
Threshold Performance Measures 49
Chicco and Jurman [14] compared accuracy, mcc and the F1 -measure for
balanced and imbalanced datasets. They concluded that mcc is superior to the
other performance measures for binary classifiers since it is unaffected by im-
balanced datasets and produces a more informative and truthful performance
measure. Delgado and Tibau [22] showed that kappa exhibits an undesirable
behaviour in unbalanced situations, i.e., a worse classifier gets a higher value
for kappa, unlike what is seen for mcc in the same situation.
For the DT classifier, the values of these measures are positive but rela-
tively low in contrast to the optimistic 0.848 value obtained for accuracy.
50 Classifier Performance Measures
# Cohen’s kappa
A <- (tp + tn) / (tp + fn + fp + tn)
E <- ((tp + fp)*(tp + fn) + (fn + tn)*(fp + tn)) /
(tp + fn + fp + tn)^2
(A - E) / (1 - E)
## [1] 0.657
You can, of course, use functions in the yardstick package to calculate these
overall measures. The required commands are given below.
tpr + tnr
balanced accuracy = . (2.6)
2
In light of (2.3), this measure may interpreted as the value of accuracy that
is obtained when the classification problem is perfectly balanced (i.e., when
prevalence = 0.5). A related measure is the J-index defined by
−1
1 1 ppv × tpr
F1 -measure = 2 + =2 (2.8)
ppv tpr ppv + tpr
are often more relevant (taking the geometric mean yields a related measure
called G-measure). It is often a measure of choice when the focus in a binary
classification problem is on the positive class. This measure allows you to make
a trade-off between a classifier’s ability to identify cases in the positive class
and its ability to deliver correct positive predictions.
For example, with information retrieval systems, the goal is to retrieve as
many relevant items as possible and as few nonrelevant items as possible in
response to a request [93]. This requires use of classifiers that have high recall
and precision (hence, F1 -measure) because you want the classifier to identify
most of the relevant items and do it with high degree of accuracy.
Since (2.8) is focused on the positive class, we can also view it as a class-
specific performance measure. Also, note that it may be expressed as [46]
where p = (tp+f n)/(2tp+f n+f p); the expression for p was given in Hand and
Christen [46] using different notation. This points to one significant conceptual
weakness of the F1 -measure, namely, the fact that the relative importance
assigned to precision and recall is classifier dependent. Hand and Christen [46]
highlighted this issue when evaluating classification algorithms for database
record linkage and suggested that the same p should be used for all methods
in order to make a fair comparison of the algorithms; see their article for how
this can be accomplished.
One way to obtain the composite measures for the DT classifier is through
the definitions for these measures. This, of course, is the obvious approach. It
is a useful one to take initially since it reminds you of how the measures are
defined.
# Class-Specific Measures
tpr <- tp / (tp + fn)
tnr <- tn / (fp + tn)
ppv <- tp / (tp + fp)
# balanced accuracy
(tpr + tnr) / 2
## [1] 0.819
52 Classifier Performance Measures
# J-index
tpr + tnr -1
## [1] 0.638
# F1-measure
2*(ppv * tpr) / (ppv + tpr)
## [1] 0.77
Alternatively, you can use yardstick functions to obtain the measures.
# Composite Measures Using yardstick Functions
FIGURE 2.9
Confusion Matrices for Totally Useless, Random, and Perfect Classifiers
Note that, when defining the function to obtain the performance measures
from a given key count vector, we did not use function() (this is usually
used to define R functions). Instead, we use the function composition syntax
from the magrittr package. For more information on this convenient syntax
for composing R functions, see Mailund [75, p. 79], for example. Note the
use of a period placeholder for the argument to the function. In this case, it
represents a key count vector. You can infer this since it is the first argument
to the con mat() function.
54 Classifier Performance Measures
The second and fourth columns in the above tibble contain measures ob-
tained for the totally useless (TUC) and perfect classifier (PC), respectively.
The values in these columns represent the lower and upper limits of the range
of values for the corresponding metric. All measures for the perfect classi-
fier are equal to the upper limit of 1. With the exception of kappa, mcc and
J-index, measures for the totally useless classifier are all equal to 0 (the ex-
ceptions have the value −1 as lower limit). For the selected value of θ, all
measures for the random classifier (RC) take values at the mid-point of their
respective ranges (in general, the measures depend on the value of θ).
As a general rule, we would like our trained classifier to have values for
these measures that are (significantly) greater than those for the random clas-
sifier and as close to 1 as possible. This requirement is partially satisfied by
the LM classifier. As can be seen by the values for kappa, mcc and J-index,
the closeness to 1 requirement falls somewhat short of the ideal. The accuracy
value is relatively high for the classifier and is greater than the estimated “No
Information Rate” (whose estimated value here is 0.646). On the other hand,
the other two overall measures for the classifier suggest mediocre performance.
The class-specific measures provide additional insights. We have already
commented on sensitivity and specif icity of the LM classifier in the first
chapter. The relatively high precision is off-setted by the mediocre recall
value. We see this reflected in the F1 -measure. Although the classifier does a
relatively poor job at classifying positive cases, the correctness of such classi-
fications is quite high.
In general, the practical relevance of the different class-specific measures
depends on the substantive problem. When deciding which to focus on, you
need to take into account the relative cost of classification errors, i.e., false neg-
atives versus false positives. You can also think of comparing the cost of false
omissions versus that of false discoveries. For example, in spam classification,
false discovery is more serious than false omission. Hence, for this applica-
tion, precision takes precedence over recall. On the other hand, sensitivity
is more important for classifications involved with disease detection since it is
important to control occurrence of false negatives.
and
P (Y = 2) = 1 − P (Y = 1),
P (Yb = 1 | Y = 2) = 1 − P (Yb = 2 | Y = 2).
When you use information in Table 2.3 to substitute the probabilities by the
corresponding performance parameters, you get the desired result for P P V.
Similarly, by applying these rules to re-express P (Y = 2 | Yb = 2), you get a
similar expression for N P V ; see Kuhn and Johnson [68, p. 41] for the expres-
sion that you wind up with.
where p, q, and r are positive fractions. From this jpmf , it is easily follows
that
Note that p represents P revalence and, in light of Table 2.3, it follows that
q and r coincide with T rue P ositive Rate and T rue N egative Rate, respec-
tively. These three probabilities also determine all other performance param-
eters in Table 2.3. For example, you can re-express (2.9) as
qp
PPV = .
qp + (1 − r)(1 − p)
Accuracy = P (Yb = Y )
= P (Yb = 1, Y = 1) + P (Yb = 2, Y = 2)
= P (Yb = 1 | Y = 1)P (Y = 1) + P (Yb = 2 | Y = 2)P (Y = 2)
= qp + r(1 − p).
results when predicted and actual responses are cross-tabulated. Given a re-
alization (tp, f n, f p, tn) of this random vector, the likelihood function may be
expressed as
Next, set the right-hand side of (2.11) equal to the 3 × 1 zero vector and solve
the resulting system of equations, to obtain
tp + f n tp tn
p̂ = , q̂ = and r̂ = .
tp + f n + f p + tn tp + f n f p + tn
The Hessian matrix of ln L is
f n+tp f p+tn
− p2 − (1−p) 2 0 0
0 − qtp2 − (1−q)
fn
0
2
fp
0 0 − tn
r2 − (1−r)2
58 Classifier Performance Measures
This matrix is negative definite if p̂, q̂, and r̂ are all positive fractions. Under
this condition, the solution given above for the likelihood equations are the
maximum likelihood (ML) estimates of p, q, and r (the ML estimators follow
when you replace the key counts by the corresponding random variables).
Clearly, these estimates coincide with descriptive summaries prevalence, tpr,
and tnr, respectively, that we considered earlier.
By the invariance property of MLEs, you can get the MLEs of the re-
maining estimands in Table 2.3 by substituting p̂, q̂ and r̂ where appropriate.
As seen in Table 2.4, the resulting MLEs coincide with the estimators in Ta-
ble 2.3. For example, after making the required substitutions and simplifying,
the MLE of Accuracy is
TN + TP
(1 − p̂)r̂ + p̂q̂ = .
TN + FP + FN + TP
Finally, note that the expected value of the negative Hessian matrix is
equal to nI(p, q, r) where
1
p(1−p) 0 0
p
I(p, q, r) =
0 q(1−q) 0
(1−p)
0 0 r(1−r)
# Key Counts
tp <- 57; fn <- 22; fp <- 12; tn <- 132
# MLE of p, q and r
(p <- (tp + fn) / (tp + fn + fp + tn)) # Prevalence
## [1] 0.354
(q <- tp / (tp + fn)) # TPR
Other Inferences 59
TABLE 2.4
MLEs of Binary Performance Parameters
Performance Metric Estimand MLE
T P +T N
Accuracy pq + (1 − p)r T P +F N +F P +T N
T P +F P
Detection Prevalence (DP ) (1 − p)(1 − r) + pq T P +F N +F P +T N
TP
Detection Rate (DR) pq T P +F N +F P +T N
TP
True Positive Rate (T P R) q T P +F N
FN
False Negative Rate (F N R) 1−q T P +F N
FP
False Positive Rate (F P R) 1−r F P +T N
TN
True Negative Rate (T N R) r F P +T N
pq TP
Positive Predicted Value (P P V ) (1−p)(1−r)+pq T P +F P
(1−p)(1−r) FP
False Discovery Rate (F DR) (1−p)(1−r)+pq T P +F P
p(1−q) FN
False Omission Rate (F OR) (1−p)r+p(1−q) F N +T N
(1−p)r TN
Negative Predicted Value (N P V ) (1−p)r+p(1−q) F N +T N
## [1] 0.722
(r <- tn / (fp + tn)) # TNR
## [1] 0.917
You can obtain the maximum likelihood estimates of other key parameters
like Accuracy, P P V , and N P V in two ways. The first approach is to use the
relevant formulas in the third column of Table 2.4, i.e., use the formulas we
have seen before. Here, we used the second approach which makes use of the
formulas in the second column and the MLEs of p, q, and r.
techniques for some key parameters will be discussed. To obtain the required
inferences for the DT classifier, the relevant information is contained in the
DT pv tibble that contains the actual and predicted classes that were obtained
when the classifier was applied to the test dataset; see Section 2.2.3.
x0
X n x α
θU (1 − θU )n−x = ,
x=0
x 2
for positive integers u and n and 0 < θ < 1. This result is a re-expression of
that given in Olkin et al. [86, p. 461] for the relationship between cumulative
probabilities of the beta and binomial distributions.
Hence, the limits of the required confidence interval may be obtained by
finding the relevant quantiles from the appropriate beta distributions, i.e.,
α α
(θL , θU ) = QB ; x0 , n − x0 + 1 , QB 1 − ; x0 + 1, n − x0 ,
2 2
where QB (p; a, b) is the p-th quantile of a Beta(a, b) distribution.
The above confidence interval may be found using the acc ci() function
given in Appendix A.3. Using data in DT pv, we obtain the 95% confidence
interval for the Accuracy parameter given below.
Other Inferences 61
When both performance measures are relevant, you can consider testing
H0 : (q, r) ∈
/ Ψ versus H1 : (q, r) ∈ Ψ, (2.14)
where
Ψ = {(q, r) : q0 < q < 1, r0 < r < 1}.
A special case of this problem is
where, as usual, Φ(·) is the standard normal CDF. The above approximate
P-value follows from the asymptotic normality and independence of the MLEs.
For the DT classifier, the P-value of the test of (2.15) with c0 = 0.5 shows
that H0 can be rejected at 5% level of significance.
For binary classification problems, you can also use the test proposed
by McNemar [79] to test the hypothesis of marginal homogeneity, i.e., the
marginal distribution of actual and predicted response are the same. Since,
we are dealing with binary responses, it suffices to state the null hypothesis
as
H0 : P (Yb = 1) = P (Y = 1). (2.16)
Other Inferences 65
This null hypothesis is equivalent to stating that the chance of getting a false
positive is equal to that for getting a false negative. To see this, note that H0
may be expressed as
H1 : P (Yb = 1, Y = 2) ̸= P (Yb = 2, Y = 1)
(F P − F N )2
T = ,
FP + FN
and T ∼ χ21 provided F P + F N is sufficiently large (i.e., at least 25).
Marginal homogeneity of predicted and actual responses is one of several
criteria that can be used to assess the quality of a classifier. Its presence
provides one indication of good performance by the classifier.
In light of the P-value computed below, the null hypothesis of homogene-
ity for the marginal distribution of the actual and predicted response is not
rejected for the DT classifier at the usual 5% level of significance. Thus, for
this classifier, the chance of getting a false positive is equal to that for getting
a false negative.
Other statistical tests have been used in CPA. A number of these arise
when the problem involves comparative analysis of competing classifiers and
in multiclass CPA. We’ll mention some of these in the fourth and fifth chapters.
66 Classifier Performance Measures
2.6 Exercises
1. The following confusion matrix from Zumel and Mount [119, p. 176]
was obtained for a logit model classifier in a spam classification
problem.
prediction
truth FALSE TRUE
non-spam 264 14
spam 22 158
(a) Show how to put the given confusion matrix in the format
shown in Table 1.1.
(b) What is the precision of the logit model classifier under evalu-
ation?
(c) Of the emails that were non-spam, what fraction was incorrectly
classified?
2. Consider classifier A and B with key count vectors given below.
The counts are from two confusion matrices that were given in
Provost and Fawcett [90, p. 191]. The authors used these array
summaries to illustrate issues with the accuracy measure when class
imbalance is significant.
(a) Obtain the confusion matrices and take note of accuracy and
prevalence for the two classifiers.
(b) Compute tpr and tnr for the two classifiers.
(c) Compare the accuracy of the two classifiers if prevalence is 0.1
instead of 0.5. Assume the same true rates as those obtained in
part (b) for the classifiers.
(d) Comment on the results obtained.
3. You can use the crosstabToCases() function in the vtree package
to obtain the actual and predicted classes when given a confusion
matrix as a "table" object. Part of the code for the con mat()
function in Appendix A.3 makes use of this function.
Exercises 67
where
sensitivity 1 − sensitivity
ρ+ = and ρ− =
1 − specif icity specif icity
refer to composite performance measures called positive likelihood
and negative likelihood, respectively.
(a) Write an R function to compute dp. Assume the input to your
function is a vector of key counts from a binary confusion ma-
trix. Allow your function to return either dp or a list containing
this value and the likelihoods.
(b) Apply your function to compute dp for the decision tree classi-
fier that we covered in the text; see Figure 2.5 for the confusion
matrix. Interpret the value you obtain. Note that some guid-
ance is available in Sokolova et al. [102].
8. The three confusion matrices in Fernandez et al. [36, p. 50] are not
in our preferred format; see Table 1.1.
(a) Identify the corresponding 1×4 key count vectors. Use the same
format as that in (1.9).
(b) Construct the corresponding confusion matrices in our pre-
ferred format.
(c) Obtain the corresponding overall threshold measures.
(d) Comment on your results.
3
Classifier Performance Curves
The performance measures discussed in the previous chapter have one serious
disadvantage for an important category of classifiers. For binary classifica-
tion problems, scoring classifiers rely on choice of an appropriate threshold
to decide whether a case belongs to the positive class. Threshold performance
measures like accuracy and class-specific ones like sensitivity and specif icity
depend on choice of the threshold. To see this, it suffices to note that for
classifiers with classification rules like that given by (1.2) in Chapter 1, the
number of positive classifications decreases as we increase the threshold t.
Performance curves like those that we will discuss in this chapter provide a
way to deal with the abovementioned issue. Another advantage of such curves
is the fact that they not only give you a visualization of classifier perfor-
mance, but they also provide informative scalar metrics that cover aspects of
performance not included in threshold measures. The ROC curve of a scoring
classifier is a prime example of performance curves. Figure 1.4 in the first
chapter is one example of such a curve.
In this chapter, we provide further details on ROC curves and performance
measures that you can derive from them. We also discuss Precision-Recall
(PR) curves that may be more useful in some applications. We omit discus-
sion of other performance curves like profit and lift curves. Interested readers
may find some information on these curves in Provost and Fawcett [90], for
example.
DOI: 10.1201/9781003518679-3 69
70 Classifier Performance Curves
chapter. Its evaluation will be completed in the next chapter when we compare
its performance with that of three other classifiers for the same classification
problem.
In the above code segment, we obtain Train data by (i) using recipe()
and step range() to create a "recipe" object, i.e., a data structure that rep-
resents the required preprocessing pipeline, (ii) applying prep() to this object
to obtain any estimates required for the preprocessing (here, this means find-
ing minimum and maximum values), and (iii) applying juice() to the result
to extract the preprocessed training data. All the abovementioned functions
are from the recipes package.
Training of the classifier was done using the mlp() function from the nnet
package. To display the fitted NN classifier, we used the plotnet() function
from the NeuralNetTools package.
Another Classification Machine Learner 71
FIGURE 3.1
Neural Network Classifier for Titanic Survival Classification
where t is a suitable threshold (with 0.5 as default value). The left-hand side
of (3.1) is the value of the output node (the right-most node). This value is
the positive class membership score that is assigned by the NN classifier to a
passenger with features captured by the input nodes (the five left-most nodes).
Here, it tells you how strong the evidence is for a passenger to be classified
as a survivor (recall, survivors of the Titanic sinking belong to the positive
class). The various terms on the left-hand side of (3.1) are as follows:
• ϕh (·) is the activation function for the h-th hidden node that is used to
P5
transform βh + i=1 wih xi to obtain the node output,
• ωh is the weight given to the connection between the h-th hidden node and
the output node,
• α is the bias attached to the output node,
• ϕ0 (·) is the activation function to transform the value of the expression
within the outer brackets to obtain the score for the output node.
The bias terms and connection weights are determined by the backpropa-
gation algorithm; see Chapter 8 in Roiger and Geatz [97] for an outline of this
algorithm and a detailed example illustrating it. For the fitted neural network
in Figure 3.1, the complete set of values for the weights and bias terms may
be obtained by using the code below (output omitted).
# Obtain Predictions
NN_pv <-
bind_cols(
predict(NN_fit, new_data = Test_data, type = "prob"),
predict(NN_fit, new_data = Test_data, type = "class"),
Test_data %$% Survived
) %>%
set_names(c("prob_No", "prob_Yes", "pred_class", "Survived"))
The variables in NN pv have the same labels as in the DT pv tibble of the last
chapter. The information we need in this chapter are contained in prob Yes
and Survived. Note that “Yes” is the second level for the two factor variables
in NN pv. Since we want it to be the first level, we make the required change
for both factors.1
1 The change is also made for pred class since we will use it in the next chapter to
compare threshold performance measures for the NN classifier with others for the problem
in Section 2.2.
74 Classifier Performance Curves
By varying t and plotting the resulting pairs, you can obtain the ROC curve
of the NN classifier. This is how you can proceed in the absence of a function
like roc curve() from the yardstick package. When we use this function, we
obtain the performance curve shown in Figure 3.2.
library(yardstick)
NN_pv %>% roc_curve(Survived, prob_Yes) %>% autoplot()
FIGURE 3.2
ROC Curve for the NN Classifier
benefits (i.e., high tpr) are rapidly gained with low cost (i.e., low f pr). This
is a desirable quality. However, this is offsetted by the slowness at gaining
maximal benefit as cost increases. In other words, the curve falls somewhat
short of what is expected of the ideal ROC curve given in the first chapter.
The shortfall in performance may be quantified by examining the AU C, i.e.,
area under the curve (we discuss this in the next section).
What else can you learn from the ROC curve? The answer is that the curve
has something to say about the separability of scores (or class membership
probabilities, when applicable). More precisely, it tells you something about
the separation between the distribution of positive class membership scores
(or probabilities) for cases in positive and negative classes. Good separation
(in the right direction) is needed for good performance to be supported by the
ROC curve (and by its AU C). Figures 3.3 and 3.4 demonstrate this point.
Figure 3.3 illustrates the different degrees of separation in the distributions
of positive class membership scores for the two classes. Each ROC curve in
Figure 3.4 is identified by the quality of separation. This quality refers to how
well the distribution of positive class membership scores separate for the two
classes in a binary classification problem.
Good separation is required in order for the ROC curve to lie in the upper
left rectangle of the unit square. This is a basic requirement of a good classifier.
It is important to note that the separation needs to be in the right direction.
The plots in the top right panel of both figures show why this is so (this does
not mean that the corresponding classifier is useless because you can modify
the classification rule for it to function like a classifier with the ideal ROC
curve). For classifiers like those represented by (1.2) in the first chapter, the
distribution of positive class membership scores for the positive class should
be to the right of the one for the negative class.
76 Classifier Performance Curves
FIGURE 3.3
Separability of Positive Class Membership Score Distributions
FIGURE 3.4
Impact of Difference in Separability on ROC Curves
Receiver Operating Characteristics (ROC) 77
Although ROC curves are widely used to evaluate classifiers in the presence
of class imbalance, it can be optimistic for severely imbalanced class problems
with few cases of the minority class [10, 19]. To overcome this problem, Drum-
mond and Holte [26] suggested the use of cost curves.
## # A tibble: 4 x 5
## plot AUC accuracy precision recall
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Good Separation 1 0.99 1 0.975
## 2 Good Separation But? 0 0 0 0
## 3 Poor Separation 0.519 0.55 0.442 0.475
## 4 Reasonable Separation 0.713 0.64 0.534 0.775
There are several points to note from the above results. First, it shows that
good separation in score distributions can result in AU C = 1 (the value one
gets for the ideal ROC curve) even though the corresponding accuracy is
less than ideal. Second, good separation but incorrect relative position of the
score distributions yields a totally useless classifier if you do not re-define
the positive class. Third, when there is poor separation in the distributions,
the classifier functions like one that makes random classifications.
For the NN classifier, the AU C is equal to 0.829. You can infer from
this that separability of the score distributions is more than reasonable for
this classifier. The AU C value is significantly better than the 0.5 value for a
random classifier. Following the interpretation guidelines given in Table 4 in
Nahm [82], you can conclude that the value indicates good performance by
the classifier.
78 Classifier Performance Curves
S − n1 (n1 + 1)/2
 = , (3.3)
n1 n2
where S is the sum of ranks of the S1|1 scores when these values are combined
with the S1|2 scores and the resulting set of (n1 + n2 ) values are arranged in
increasing order (the authors also provided a brief argument that links (3.3)
to the AU C of a step function ROC curve). Note that scores in Hand and Till
[56] refer to estimated class membership probabilities. The same is true for
scores in Kleiman and Page [66]; see their paper for an alternative formula for
the estimator of the right-hand side of (3.2).
To illustrate calculation of (3.3), consider using information in the small
dataset given in Nahm [82, p. 29]; the author used the data to construct the
ROC curve in his article. The simple step function form of the curve allows
you to easily calculate the area under it by elementary geometric arguments.
Thus, you can easily check correctness of the value that we obtain for  in
the following code segment.2
2 This check is left for the reader to do in Exercise 3; two other ways of getting the AU C
## # A tibble: 10 x 3
## case cancer marker
## <int> <fct> <dbl>
## 1 1 No 25.8
## 2 2 No 26.6
## 3 3 No 28.1
## 4 4 Yes 29
## 5 5 No 30.5
## 6 6 No 31
## 7 7 No 33.6
## 8 8 No 39.3
## 9 9 Yes 43.3
## 10 10 Yes 45.8
Ld
AU C = 1 − , (3.6)
2π(1 − π)
3 Henceforth, we’ll use the term H-measure as in Hand and Anagnostopoulos [53] rather
where Ld is what you get for (3.5) when you set the cost distribution w(·)
equal to the mixture score density function
Here, f (·) and g(·) are density function corresponding to F (·) and G(·), re-
spectively. It is important to note that, in general, density function d(·) is
not the same for different classifiers [50, 53]. When used to obtain Ld , this
means the result is obtained using a cost distribution that differs with differ-
ent classifiers! This is a fundamental flaw that points to the incoherence of
AU C highlighted by Hand [50] since the cost distribution is a property of the
classification problem, not the classifier.
With sufficient domain knowledge, you can specify w(·) but, although this
may be ideal, practical application of this approach can be problematic. Thus,
it is useful to have a standard specification that produces the same summary
measure from the same data [53]. Such a specification is given by the de-
fault beta distribution recommended by Hand and Anagnostopoulos [54]. In
practice, their suggested Beta(α, β) distribution with density function
cα−1 (1 − c)β−1
w(c) = , 0 < c < 1,
B(α, β)
where α = 1 + π and β = 2 − π is conditional on π because of the uncertainty
in prevalence; this uncertainty may be modeled by a Beta(2, 2) distribution.
In any case, the same weight distribution should be used when comparing
different classifiers for a given problem.
A Monte Carlo approach based on random variates from the two beta
distributions mentioned earlier may be used to estimate H-measure; see Hand
and Anagnostopoulos [53] for a revised expression of the measure and other
relevant details involved in this approach. Fortunately, we are spared the effort
since the HMeasure() function in the hmeasure package may be used to
obtain the required estimate (for convenience, we use the wrapper function
HM fn() that is given below because we want only the value of the H-measure
to be returned when a call is made to calculate it).
library(hmeasure)
HM_fn <- function(grp, prob_Y){
HMeasure(grp, prob_Y) %>% .$metrics %$% H
}
NN_pv %$% HM_fn(Survived, prob_Yes)
## [1] 0.483
The above calculation shows that the reduction in expected minimum mis-
classification loss for the NN classifier is about 48% of the corresponding loss
for a random classifier. This improvement in performance is rather mediocre.
82 Classifier Performance Curves
FIGURE 3.5
PR Curve for the LM Classifier in Chapter 1
## # A tibble: 3 x 4
## c t recall precision
## <dbl> <dbl> <dbl> <dbl>
## 1 0.3 -0.847 0.887 0.773
## 2 0.5 0 0.787 0.894
## 3 0.7 0.847 0.680 0.953
The PR curve for the neural network (NN) classifier is given in Figure 3.6.
The AU C in this case is slightly more than twice the estimated prevalence
but still falls somewhat short of the ideal value of 1.
# Prevalence
NN_pv %$% table(Survived) %>% prop.table() %>% pluck("Yes")
## [1] 0.354
84 Classifier Performance Curves
FIGURE 3.6
PR Curve for the NN Classifier
PR curves are more informative than ROC curves when dealing with highly
imbalanced data. Saito and Rehmsmeier [99] noted that use of ROC curves for
imbalanced problems can lead to deceptive conclusions about the reliability
of classification performance; on the other hand, PR curves are more reliable
because they involve the fraction of correct positive predictions (this is what
precision quantifies). Davis and Goadrich [19] noted that looking at PR curves
can expose differences between classification algorithms that are not apparent
in ROC space; see Figure 1 in their paper for an illustration of this point.
They also showed that a curve dominates in the ROC space if and only if
it dominates in the PR space, and gave a counter example to show that an
algorithm that optimizes the area under the ROC curve is not guaranteed to
optimize the area under the PR curve.
3.4 Exercises
1. Consider the decision tree classifier displayed in Figure 2.3 for the
Titanic survival classification problem. Use the predictions con-
tained in the DT pv.csv file for this exercise (the file is available
in the publisher’s website). This file contains the same information
as in the DT pv tibble that was obtained earlier when the classifier
was applied to some test data.
(a) Obtain density plots of positive class membership probabilities
for the two classes.
(b) Obtain the ROC curve and corresponding AU C.
(c) Compute the H-measure.
Exercises 85
## # A tibble: 10 x 3
## prob_Yes pred_class group
## <dbl> <fct> <fct>
## 1 0.28 No No
## 2 0.547 Yes No
## 3 0.43 Yes No
## 4 0.222 No No
## 5 0.225 No No
## 6 0.674 Yes Yes
## 7 0.276 No No
## 8 0.613 Yes Yes
## 9 0.754 Yes Yes
## 10 0.845 Yes Yes
Data on the first three variable are from Fawcett [35, p. 864].
86 Classifier Performance Curves
When solving a classification problem, you would typically use the same
dataset to train several different classifiers and then perform a comparative
analysis of them with the same validation (e.g., test) dataset. This, of course,
is an essential part of the process of classifier construction for the problems
that you encounter in practice. Such analysis is also relevant when you attempt
to evaluate a new classification algorithm.
One important issue you face when attempting to compare classifiers is
the choice of performance metric to use in the comparison. Some questions
you need to keep in mind when dealing with this issue for binary classification
include the following.
• Do you have an imbalanced classification problem?
• Is your focus on predicting class labels or positive class membership proba-
bilities?
• Is the positive class more important?
• Are the two misclassification errors (i.e., false positives and false negatives)
equally important?
• When errors have unequal cost, which is more costly?
For imbalanced problems, you can find some guidance on choosing a per-
formance measure in the flowchart given by Brownlee [10, p. 46]. The flowchart
makes use of the answers to the last four questions listed above to guide you
in your selection. Typically, these answers require you to take into account the
business goals and/or research objectives underlying the classification problem
you want to solve.
For example, assuming the usual definition of a positive case, minimizing
occurrence of false negative errors is important in coronary artery disease
prediction; see Akella and Akella [1] for the reason why. On the other hand,
in spam filtering applications, it is important for precision to be very high
in order to have good control over occurrence of false discoveries (which you
can achieve by giving more emphasis to controlling the occurrence of false
positives).
DOI: 10.1201/9781003518679-4 87
88 Comparative Analysis of Classifiers
One issue that you can raise with Brownlee’s flowchart is the fact that the
available metrics to choose from is restricted to one of the following: accuracy,
Fβ -score for β ∈ {0.5, 1, 2}, G-mean, Brier score, and the AU Cs for ROC
and PR curves. This limitation is in part due to the focus on imbalanced clas-
sification. Another issue worth keeping in mind when making your selection
from the highlighted list of measures is the fact that there are problems that
can arise with some of them; see Provost et al. [91], Hand [50], and Hand and
Christen [46], for example.
There are other measures that you can use when faced with problematic
metrics. Some of these arise in connection with questions like the following:
• Which classifier provides the best predictive accuracy after you account for
the possibility of correct prediction by chance alone?
• Which classifier provides greatest reduction in expected minimum misclas-
sification loss?
• Which classifier has the best ability to distinguish between positive and
negative cases?
The above questions lead to measures like Cohen’s kappa, H-measure and
Discriminant P ower. We will revisit the use of the measures highlighted so
far and others in this chapter.
To show how R can be brought to bear on the problem, we will ignore the
measure selection issue and proceed to demonstrate how to use the software
and various measures in a comparative analysis. Of course, in practice, you will
usually base your analysis on a narrower selection of metrics (e.g., see Akella
and Akella [1]) after taking into account the relevant factors and questions
you have about your problem. However, it is important to keep in mind that
the various measures deal with different aspects of performance, and empirical
comparisons between classifiers measuring different aspects are of limited value
[52]. This remark is noteworthy in light of studies such as those by Halimu
et al. [45] and Chicco and Jurman [15]; their studies resulted in different
conclusions about the relative merits of AU C and mcc.
To illustrate what is involved in a comparative analysis of competing clas-
sifiers, we revisit the Titanic survival classification problem. We have already
developed two machine learners for this problem, namely, a decision tree (DT)
classifier in Chapter 2 and a neural network (NN) classifier in Chapter 3. In
this chapter, we train two additional classifiers, and compare the relative per-
formance of the available classifiers with help from the measures and curves
that we studied in the last two chapters. We will initially take a descriptive
approach in Section 4.3, and leave it to Section 4.4 to discuss some examples
on whether any observed difference in performance is statistically significant.
Competing Classifiers 89
Next, we check the levels of pred class and Survived to see whether they
are in the required order.
FIGURE 4.1
Illustrating a Random Forest Classifier
As can be seen, “Yes” is the second level of the two factor (i.e., categorical)
variables. This is a consequence of leaving it at that level because, when fitting
a logit model, the glm() function in the stats package models the logit of
getting the outcome defined by the second level.
Since we’ll be using yardstick for performance analysis, it is convenient to
make “Yes” the first level for pred class and Survived since it determines the
positive class for our problem. This makes it the reference level and obviates
the need to specify the event level argument when using the functions in
this package.
from the ensemble is returned as the assigned class. Figure 4.1 illustrates
what is involved; see Zumel and Mount [119, p. 362] for an illustration of the
process involved in growing a random forest.
When constructing a random forest classifier, you typically have several
decisions to make on issues like decision tree complexity, number of trees in
the ensemble, number of features to consider for a given split, and so on. There
are hyperparameters that allow you control these aspects when training such
a classifier. Usually, a resampling technique (e.g., 10-fold cross-validation) is
used to obtain suitable values for them. An example of how to do this will
be given in Chapter 6. For now, we train the RF classifier (for the Titanic
survival classification problem) using default values for the hyperparameters.
When you run the code, you’ll see agreement with the DT classifier on what
constitutes the most important feature for the classification problem. There
is, however, disagreement on the relative importance of the remaining three
features.
Collect Predictions 93
With pv lst at hand, you can easily obtain array summaries, performance
measures and curves for the competing classifiers. We demonstrate how to
obtain the required metrics in later sections of this chapter.
the information in the tibble by classifiers. To obtain such a tibble, you can
proceed as follows.
needed. Exercise 1 explores simpler alternative ways to get the required array summaries.
Descriptive Comparisons 95
FIGURE 4.2
Tile Plots of the Confusion Matrices
A quick scan of Figure 4.2 or the tibble in the above code segment shows
that the DT classifier has a slight edge over the NN and RF classifiers, and the
worst performing one being the LM classifier due to its relatively high number
of false positives.4 The tibble makes it easier to compare the key counts of
the competing binary classifiers. The relatively similar F N values are clearly
evident. We can say the same for three of the F P values, the one outlier is
due to the poor performance of the LM classifier.
between actual and predicted classes (for a given threshold, usually the de-
fault). The relevant measures to use in the comparative analysis depend on
issues like whether there is serious class imbalance, the relative cost of classi-
fication errors and so on.
A convenient way to get the measures that were discussed in Chapter 2 for
the competing classifiers is to use information in the pv tb from Section 4.2.2
and the conf mat() function to obtain a "conf mat" object for each classifier
followed by application of the summary() function to extract the measures.
The results you obtain may then be subsequently assembled into a single tibble
as shown below.
For each classifier, there are 13 measures in pm tb but note that there
are some redundancies (e.g., both ppv and precision are included). A useful
approach to use when you proceed to examine the measures is to look at
relevant subgroups of measures. For example, you can start by looking at
measures that tell you something about overall performance like those given
in the next code segment and displayed in Figure 4.3. Your choice on which
measure to focus on in this group depends on its interpretability, the problem,
and what aspect of overall performance matters to you.
Usually, accuracy is the first measure that comes to mind. But, is it ap-
propriate for the problem you face? This largely depends on the extent of
class imbalance in your classification problem. When imbalance is not severe,
98 Comparative Analysis of Classifiers
FIGURE 4.3
Bar Plots of Overall Performance Measures
accuracy does provide useful information; see Provost et al. [91] for issues in-
volved in use of this measure. For the problem at hand, there is some degree of
class imbalance but it is not severe (if it was, accuracy would not be a suitable
measure to consider). The accuracy values of all the competing classifiers are
significantly greater than 0.646 (this is the estimated N IR that we saw in Sec-
tion 2.5.2). When compared, their values reflect the initial impression on rela-
tive performance that was obtained when we examined the confusion matrices
in Section 4.3.1. The same can be said about the other two overall measures.
The agreement in relative rankings of the competing classifiers shown by the
three overall measures cannot be expected to occur for other comparisons.
This means that you often have to make choice between these measures.
If you have to narrow your choice to one of the three overall measures,
which should you choose? Chicco and Jurman [14] have argued that mcc is
preferred to accuracy (and F1 -measure). Arguments against the use of kappa
have been provided by Delgado and Tibau [22] and others. Furthermore, re-
search by Powers [88] suggests that mcc is one of the best balanced measure.
Given the arguments presented by these researchers, the measure that mer-
its serious consideration is mcc, unless there are compelling issues in your
problem that suggests otherwise.
What other groups of measures should you examine? One approach to deal
with this issue is to consider complementary pairs of class-specific metrics (we
consider two such pairs in the sequel). As noted by Cichosz [16], it takes a
complementary pair of indicators (i.e., metrics) to adequately measure the
performance of a classifier.
Often, the choice of measures depends on the relative cost of the pos-
sible classification errors. If your goal is to minimize false negatives and
false positives, then the group of measures in the next example is what you
probably want to examine.5 Here, the complementary pair is sensitivity and
5 The code you need to produce the results exhibited is similar to what was given for
FIGURE 4.4
Bar Plots of J-index, Sensitivity and Specificity
specif icity, and J-index is the composite measure that combines this pair.
These measures for the competing classifiers are given below and displayed in
Figure 4.4.
## # A tibble: 3 x 5
## Measure DT LM NN RF
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 j_index 0.638 0.568 0.612 0.619
## 2 sens 0.722 0.734 0.709 0.709
## 3 spec 0.917 0.833 0.903 0.910
The “best performing” DT classifier that we have identified does well with
specif icity but not quite as well with sensitivity. On balance, as shown by
the J-index values, this classifier does better than the others.
On the other hand, if you want to minimize false discovery errors and yet
keep occurrence of false negatives as low as possible, then you may wish to
consider precision and recall, and possibly F1 -measure as shown below and
in Figure 4.5.
## # A tibble: 3 x 5
## Measure DT LM NN RF
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 f_meas 0.770 0.720 0.752 0.757
## 2 precision 0.826 0.707 0.8 0.812
## 3 recall 0.722 0.734 0.709 0.709
100 Comparative Analysis of Classifiers
FIGURE 4.5
Bar Plots of F1 -measure, Precision and Recall
The DT classifier has the highest precision and the LM classifier the lowest
(the difference is descriptively significant).6 On the other hand, sensitivity or
recall is highest for the LM classifier. Corresponding values of these measures
for the NN and RF classifiers are quite comparable.
formance. For statistical signficance, you need to account for variability in the estimated
measure and evaluate the probability of getting a difference as large as what is observed.
Descriptive Comparisons 101
FIGURE 4.6
ROC Curves of the Competing Classifiers
the RF classifier still holds when PR curves and the corresponding AU Cs are
compared; see Exercise 2.
More important to note is the fact that the RF classifier provides the
greatest reduction in expected minimum misclassification loss as can be seen
when you compare the values of H-measure (as given by HM ). Comparisons
based on this alternative measure is preferred given the problems with the
incoherence of AU C, and when ROC curves cross one another.
Finally, note that Figure 4.7 illustrates the separability of the various den-
sity plots for the positive class membership probabilities.
FIGURE 4.7
Distributions of Positive Class Membership Probabilities
is the diagnostic odds ratio. Note that DOR may be re-expressed as [102]
The first expression for DOR shows that, like AU C, it is also influenced by
the separability of positive class membership distributions of the two classes.
The nature of this apparent connection between the two measures is worth
investigating.
When used to evaluate a classifier, you can use the rule of thumb given in
Sokolova et al. [102] to interpret DP which states that “the algorithm is a poor
discriminant if DP < 1, limited if DP < 2, fair if DP < 3 and good in other
cases”. Also, from Figure 1 in their paper (with change in notation), they gave
criteria involving plr and nlr to facilitate comparison of two classifiers. For
example, one rule states that classifier 1 is superior overall to classifier 2 if
plr1 > plr2 and nlr1 < nlr2 .
This amounts to saying that classifier 1 is overall superior if DOR1 > DOR2 .
In terms of discriminant power, the equivalent condition is DP1 > DP2 . We
discuss these measures for the competing classifiers next.
The DP values reported in the last code segment suggest limited dis-
criminant ability for each of the competing classifiers. When this measure is
used to rank the classifiers, you wind up with the same ranking as that ob-
tained when the usual overall performance measures are used as shown in
Section 4.3.2 (note that agreement also holds with J-index and F1 -measure).
The final measure we consider in this section for the comparative analysis
is the Brier score defined by
n
1X
BS = (ŷi − yi )2 ,
n i=1
where ŷi is the predicted positive class membership probability for the i-th
case and yi is equal to 1 or 0, according to whether the case is positive or not.
This measure belongs to the probabilistic metric category as defined by the
taxonomy that is given in Ferri et al. [37]. It is a “negative oriented” measure,
which means that smaller values of the score indicate better predictions [98].
In the next code segment, we compare BS for the competing classifiers.7
pv_tb %>%
mutate(yhat = prob_Yes) %>%
mutate(y = ifelse(Survived == "Yes", 1, 0)) %>%
summarize(BS = sum((y-yhat)^2)/length(y))
# A tibble: 4 x 2
classifier BS
<chr> <dbl>
1 DT 0.128
2 LM 0.147
3 NN 0.166
4 RF 0.129
7 Comparisons can also be based on the “positive oriented” Brier skill score (BSS) but
this requires you to take into account the BS for a reference prediction; see Roulston [98], for
example. You have to consider this adjustment if you are not willing to consider “negative
oriented” measures.
Assessing Statistical Significance 105
The DT classifier has the smallest BS with the RF classifier a close second.
The ranking of classifiers that results from BS differs from that noted for DP
and several other measures.
## # A tibble: 9 x 5
## Measure DT LM NN RF
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 accuracy 1 4 3 2
## 2 kap 1 4 3 2
## 3 mcc 1 4 3 2
## 4 j_indec 1 4 3 2
## 5 f_meas 1 4 3 2
## 6 AUC 4 3 2 1
## 7 HM 2 4 3 1
## 8 DP 1 4 3 2
## 9 BS 1 3 4 2
As shown in the above results, all the threshold measures gave the same
rankings to the four classifiers with the DT classifier ranked as the “best”
and the LM classifier ranked as the “worst”. On the other hand, the RF
classifier was ranked the “best” by AU C and HM (i.e., H-measure), but
there is disagreement with the ranking of the other classifiers by these two
measures. In particular, AU C ranked the DT classifier as worst of the four,
unlike the rankings by threshold measures (this contrarian finding is possibly a
consequence of the incoherence of AU C). The ranking by BS partially agrees
with that given by the threshold measures.
Of course, the usefulness of the rankings that we noted above is predicated
on the assumption that differences in the values of a particular measure among
the classifiers are statistically significant, but are they? In this section, we ex-
amine this issue for two measures, namely, accuracy and AU C by comparing
confidence intervals for the corresponding probability metric. Despite issues
106 Comparative Analysis of Classifiers
with the measures involved here, we proceed with the comparisons for illustra-
tive purposes. You can perform similar analysis if you can obtain confidence
intervals for other suitable measures (e.g., H-measure).
FIGURE 4.8
95% Confidence Intervals for Accuracy
where Yb1 and Yb2 are the predicted target variables by the DT and RF clas-
sifiers, respectively, of the actual target Y for a case being classified. The
specified null hypothesis is an alternative expression of that in Dietterich [24].
Given its acceptable control of Type I error, we test (4.2) versus the alter-
native hypothesis
̸ Y, Yb2 = Y ) =
H1 : P (Yb1 = ̸ P (Yb1 = Y, Yb2 ̸= Y )
library(pROC)
pv_lst %>% map(~ ci.auc(.$Survived, .$prob_Yes))
## $DT
## 95% CI: 0.757-0.877 (DeLong)
##
## $LM
## 95% CI: 0.751-0.885 (DeLong)
##
## $NN
## 95% CI: 0.765-0.893 (DeLong)
##
## $RF
## 95% CI: 0.797-0.916 (DeLong)
8 Note that this application of McNemar’s test differs from that discussed in Section 2.5.2.
Exercises 109
FIGURE 4.9
95% Confidence Intervals for AUC
4.5 Exercises
1. You saw one way to obtain a display of the confusion matrices
for competing classifiers in Section 4.3.1; see Figure 4.2. There are
simple quicker alternatives that you can use to obtain the required
array summaries. These alternatives are explored in this exercise.
110 Comparative Analysis of Classifiers
(a) Show how to use pv lst from Section 4.2.1 and the con mat()
function in Appendix A.3 to obtain a list of confusion matrices
for the competing classifiers (for display on the console).
(b) You can produce heat maps of the confusion matrices in the list
obtained in part (a). Show how to do this (the do.call() func-
tion and the gridExtra package are useful for this purpose).
2. In this exercise, we will tie up some “loose ends” that involve omit-
ted code for some results presented in this chapter, and other anal-
ysis you can perform.
(a) Show how to use pv tb from Section 4.2.2 and the ggplot2
package to obtain Figure 4.7
(b) Show how to obtain Figure 4.9. You may assume the limits of
the confidence intervals given in Section 4.4.2.
(c) Compare the PR curves for the competing classifiers considered
in this chapter and obtain the corresponding AU Cs. Comment
on the results you obtain.
3. To illustrate discussion in their paper, Sokolova et al. [102] consid-
ered the use of traditional and discriminant measures to compare
the performance of two classifiers (support vector machine versus
naive Bayes).9 The confusion matrices involved in the comparison
was given in Table 2 of their paper.
(a) Verify the values of the four threshold measures given in Table 3
of Sokolova et al. [102].
(b) Compare the discriminant power of the two classifiers
(c) What do the results in parts (a) and (b) tell you about the
relative merits of the two classifiers?
4. Customer churn in a business offering cell phone service occurs when
customers switch from one company to another at the end of their
contract. One way to manage the problem is to make special offers
to retain customers prior to expiration of their contracts (since re-
cruiting new customers to replace those that are lost is more costly).
Identifying customers to make the offers to gives rise to a classifi-
cation problem.
This exercise is based on confusion matrices reported in Provost
and Fawcett [90, p. 227] for two classifiers in connection with such
a classification problem; assume those planning to churn belong to
the positive class. The corresponding key count vectors for the naive
Bayes (NB) and k-nearest neighbors (k-NN) classifier are given in
the following tibble.
9 See Chapter 6 of Rhys [95] for some information on these classifiers.
Exercises 111
## # A tibble: 2 x 5
## classifier tp fn fp tn
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 NB 127 200 848 3518
## 2 kNN 3 324 15 4351
Note that the given key counts are actually from one fold in the
ten-fold cross-validation analysis done by the authors based on data
from the KDD Cup 2009 churn problem.
(a) Obtain heat maps of the confusion matrices.
(b) Compare the overall threshold measures. Is accuracy a suitable
measure to use in the comparison?
(c) Compare the F2 -measure for the classifiers. Is this a suitable
measure to use?
(d) Compare the discriminant power of the classifiers. Comment
on the dp values you obtain.
5. The following tibble contains key count vectors from five confusion
matrices given in Baumer et al. [5].
## # A tibble: 5 x 5
## classifier tp fn fp tn
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 DT 3256 3061 990 18742
## 2 kNN 3320 2997 857 18875
## 3 NB 2712 3605 1145 18587
## 4 NN 4189 2128 1861 17871
## 5 RF 4112 1261 2205 18471
The labels used for the classifiers refer to decision tree (DT), k-
nearest neighbors (kNN), naive Bayes (NB), neural network (NN)
and random forest (RF). The classifiers were for the problem of
classifying whether an individual is a high-income earner (such in-
dividuals belong to the positive class).
An important point to note about the given key counts is the fact
that they were derived from training data rather than test data.
This fact should be of concern if one is interested in generalizability
of the classifiers, but we will ignore it since the purpose of this
exercise is to demonstrate techniques for the following comparisons
in a comparative analysis of the five classifiers.
(a) Compare the overall performance measures.
(b) Compare sensitivity, specif icity, and J-index.
(c) Compare precision, recall, and F1 -measure.
112 Comparative Analysis of Classifiers
## TP FN FP TN
## DT 32 2 5 561
## LM 24 2 13 561
## NB 26 4 11 559
## NN 29 6 8 557
## SVM 27 2 10 561
Classification problems that involve more than two classes are also quite com-
mon. You saw examples of some substantive questions that give rise to such
problems in the first chapter. As with binary problems, classifiers for a multi-
class problem may be based on statistical, machine, or ensemble learning, and
techniques to evaluate them are to some extent similar to what was covered
earlier for binary classifier performance analysis.
You can expect some issues when you attempt to apply techniques learned
for binary CPA to problems that involve more than two classes. This is not
surprising because the increased number of classes presents conceptual, defi-
nitional and visualization problems when dealing with multiclass performance
measures and surfaces.
Conceptually, overall measures like accuracy, kappa, and mcc that we
encountered earlier remain unchanged when you consider them for multiclass
CPA. However, for the last two measures, you need to consider more general
definitional formulas since the definitions for binary classifiers usually involve
the key counts from Table 1.1, e.g., the mcc formula given by (2.5) for binary
classifiers is given in terms of these counts. The required generalizations are
discussed in Section 5.2.2.
Class-specific measures like sensitivity and specif icity are intrinsically
binary performance metrics. Recall, alternative terminology for these measures
are true positive rate and true negative rate, respectively, and definition of
these rates relies on the positive versus negative class dichotomy. You can
also encounter use of the term sensitivity when analyzing performance of
multiclass classifiers. When used in this context, think of it as an estimate
obtained by suitable averaging of tpr values from an OvR collection of binary
confusion matrices. This was illustrated during our overview of multiclass
CPA in the first chapter. Extensions to other class-specific measures will be
discussed in Section 5.2.1.
Other issues also arise when you attempt ROC analysis for multiclass
classifiers. When the number of classes is large, you face visualization problems
because of the high-dimensional hypersurfaces that arise in the analysis. A
related issue is calculation of the volume under these surfaces. We discuss
some alternatives in Section 5.3 like the ROC curves from a class reference
formulation [35] for multiclass ROC and the M -measure [56] for multiclass
AU C.
We begin in the next section with a 3-class problem concerned with predic-
tion of diabetic status of an individual. This problem will be used as a running
example throughout this chapter. In addition to data preparation, training,
and predictions, we discuss some techniques for data exploration (so far, this
aspect of the analysis has been ignored). For most of the remaining sections,
we focus on techniques to deal with the problems that were highlighted in the
above introduction. In the last section, we consider some inferential aspects
for performance parameters in the multiclass context.
# Data Preparation
Next, we use 70% of data in diabetes tb for training and reserve the rest
for testing. As in the second chapter, data splitting is done with help from
functions in the rsample package. Note the use of the strata argument in
the initial split() function to specify the variable involved in deciding how
splitting is accomplished with stratification (this is usually done when class
imbalance is serious enough).
library(scatterplot3d)
brg <- c("steelblue", "red", "green")
brg_vec <- brg[as.numeric(diabetes_train$group)]
pts <- c(16, 0, 17)
pts_vec <- pts[as.numeric(diabetes_train$group)]
diabetes_train %>%
select(-group) %>%
scatterplot3d(pch = pts_vec, color = brg_vec)
legend("top", legend = levels(diabetes_train$group), col = brg,
pch = c(16,0,17), inset = -0.22, xpd = TRUE, horiz = TRUE)
FIGURE 5.1
3-D Scatter Plot Display of the Training Data
diabetes_train %>%
gather(key = "Variable", value = "Value", -group) %>%
ggplot(aes(Value, group)) +
facet_wrap(~ Variable, scales = "free_x") +
geom_boxplot(col = "blue", fill = "lightblue") +
labs(x = "") + theme_bw()
Diabetes Classification Problem 117
FIGURE 5.2
Boxplots of the Feature Variables
The preceding code segment yields the marginal distribution of the feature
variables for each level of group as shown by the boxplots in Figure 5.2. The
plots provide some insight on the predictive value of each feature. For example,
we expect sspg to have good predictive value since its distribution for the three
classes is quite well separated. The same can also be said about glucose.
As shown in Figure 5.3, the correlation between sspg and glucose is 0.79.
Although quite high, the (absolute) correlation is not high enough to warrant
dropping one of these two features from the training dataset.1 Correlation
between insulin and glucose is weakly negative, and that with sspg and is
non-existent.
library(corrplot)
diabetes_train %>%
select(-group) %>%
cor() %>%
corrplot.mixed(lower = "ellipse", upper = "number",
tl.pos = "lt")
Finally, as shown by the pie chart in Figure 5.4, prevalence of the “normal”
1 Note that when there are many numeric predictors, you can use a function like
step corr() in the recipes package to identify predictors for removal from those that
result in absolute pairwise correlations greater than a pre-specified threshold.
118 Multiclass CPA
FIGURE 5.3
Correlation Plot of the Feature Variables
class is about 52.5% (there is some imbalance in the data but it is not severe).
This class determines the reference level since it is the first one for the group
variable.
diabetes_train %>%
count(group) %>%
mutate(pct=100*n/nrow(diabetes_train)) %>%
ggplot(aes(x="", y=pct, fill=group)) +
scale_fill_manual(values=c("#85C1E9","#E74C3C","#ABEBC6")) +
coord_polar("y") +
geom_bar(width=1, size=1, col="white", stat="identity") +
geom_text(aes(label=paste0(round(pct, 1), "%")),
position=position_stack(vjust=0.5)) +
labs(x=NULL, y=NULL, title="") + theme_bw() +
theme(axis.text=element_blank(), axis.ticks=element_blank())
FIGURE 5.4
Pie Chart of the Target Variable
exp(η̂i (x))
P3 , i = 1, 2, 3,
j=1 exp(η̂j (x))
or, equivalently, if
h = argmax {η̂i (x), i = 1, 2, 3} .
i
significance of the estimated coefficients, and decide what to do with those that are not
significant. Here, we will ignore this issue since our purpose is to use the fitted model to
illustrate multiclass CPA.
Diabetes Classification Problem 121
ML_pv <-
bind_cols(
ML_fit %>% predict(newdata = Test_data, type = "probs") %>%
as_tibble() %>%
set_names(c("prob_normal","prob_chemical","prob_overt")),
tibble(
pred_class = ML_fit %>% predict(newdata = Test_data,
type = "class"),
group = Test_data %$% group
)
)
ML_pv %>% print(n = 3)
## # A tibble: 44 x 5
## prob_normal prob_chemical prob_overt pred_class group
## <dbl> <dbl> <dbl> <fct> <fct>
## 1 1.00 0.00000356 1.45e-11 normal norm~
## 2 0.961 0.0394 1.42e- 5 normal norm~
## 3 1.00 0.000429 7.84e- 9 normal norm~
## # ... with 41 more rows
library(yardstick)
ML_pv_cm <- ML_pv %>% conf_mat(group, pred_class,
dnn = c("Predicted", "Actual"))
ML_pv_cm %>% autoplot(type = "heatmap")
FIGURE 5.5
Confusion Matrix for the ML Classifier
You can also obtain a confusion matrix from the indicator matrices asso-
ciated with the predicted and actual classes for the cases in the evaluation
dataset (surprisingly, this fact is not widely known). We illustrate the ap-
proach by re-computing the confusion matrix for the ML classifier.3
3 Note that in the demonstration, you can also use a function from the ncpen package
respectively.
Alternatively, you can consider the OvR collection of key count vectors
given by
{kcvi : kcvi = (tpi , f ni , f pi , tni ), i = 1, . . . , k}.
Here, kcvi contains the key counts from the i-th binary confusion matrix, i.e.,
they are those given in Table 5.1.
Next, we use the kcv fn() function in Appendix A.3 to obtain the OvR
collection of key counts from the 3-class confusion matrix displayed in Fig-
ure 5.5. The corresponding OvR collection of confusion matrices is given in
Figure 5.6.
124 Multiclass CPA
TABLE 5.1
The i-th OvR Binary Confusion Matrix
Actual
Predicted Yes No
Yes tpi f pi
No f ni tni
FIGURE 5.6
OvR Collection of Binary Confusion Matrices
In the next section, we use the collection of array summaries in Figure 5.6
to obtain estimates of some performance measures for multiclass CPA.
Threshold Performance Measures 125
Next, we adapt the approach taken in Kuhn and Silge [69, p. 119] to obtain
the required estimates.
The results from commands in the preceding code segment are given below.
## # A tibble: 3 x 8
## group tp fn fp tn total weight sens
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <table> <dbl>
## 1 normal 21 2 0 21 23 0.523 0.913
## 2 chemical 10 1 2 31 11 0.250 0.909
## 3 overt 10 0 1 33 10 0.227 1
## # A tibble: 1 x 3
## macro macro_wt micro
## <dbl> <dbl> <dbl>
## 1 0.941 0.932 0.932
The above sensitivity estimates for the ML classifier may also be obtained
by using functions in the yardstick package. The functions in this package
use the standard approach to obtain macro/micro estimates (the alternative
calculation of macro F1 -measure given later illustrates use of a non-standard
approach). To demonstrate the use of functions in this package, consider how
to obtain macro/micro average estimates of specif icity (the approach used
to obtain multiclass sensitivity estimates in the preceding code segment may
also be used to obtain estimates of multiclass specif icity; see Exercise 2).
In the above formula, Xj and Yj are respectively the means of the j-th columns
of indicator matrices X = [Xij ]n×k and Y = [Yij ]n×k (these matrices were
used earlier to demonstrate an alternative computation of confusion matrices).
Equation (5.2) is equivalent to
Pk Pk
n × i=1 nii − i=1 (ni· × n·i )
mcc = q Pk Pk . (5.3)
(n2 − i=1 n2i· ) × (n2 − i=1 n2·i )
The distribution of group shows that there is some imbalance, but it is not
to a significant degree.
The summary below shows that classification accuracy for the ML classifier
is very high. The remaining two measures are on the high end and they are
quite close to one another (this is not surprising since the confusion matrix
for the ML classifier is nearly symmetric).
In light of the above overall measures and the macro/micro averaged esti-
mates obtained in the last section, it appears that the ML classifier has done a
good job at solving the diabetes classification problem (confirming the initial
impression obtained earlier from the confusion matrix in Figure 5.5). In the
next section, we see whether further support for this assessment can be found
when we examine some relevant performance curves and ranking measures
derived from them.
confusion matrix.
Performance Curves and Surfaces 133
FIGURE 5.7
Plot Equivalent to the NN Classifier ROC Curve in Chapter 3
the remaining classes and false rates involving them are not as important for
the problem at hand.
Given the limited utility of ROC hypersurfaces (at least for now), we will
not pursue them further here (although we’ll briefly revisit their use when dis-
cussing multiclass AU C). Instead, consider the popular alternative approach
to multiclass ROC analysis based on an OvR collection of binary ROC curves.
Recall that this approach is referred to as the class reference formulation by
Fawcett [34]. The curves in such a collection may be displayed in a single plot
like that shown in Figure 1.7 or separately as shown in Figure 5.8.
FIGURE 5.8
Class Reference ROC Curves for the ML Classifier
where S(x) is the score of a case with feature vector x, c1 and c2 are specified
thresholds, and I(·) is the indicator function. Varying the thresholds results
in the ROC surface given in Figure 3 of their paper (the axes for their plot
are determined by the true rates). The volume under this surface is given by
where, for i = 1, 2, 3, Si is the score for a random case from the i-th class.
Performance Curves and Surfaces 135
where ni is the number of scores from the i-th class. Note that (5.4) assumes
the Si ’s are continuous random variables; see Liu et al. [73] for a more general
expression. Next, we use some synthetic data to illustrate calculation of V[ U S.
# Levels of group
classes # this variable was created earlier
## [1] "normal" "chemical" "overt"
Note that the OvR collection of ROC curves that we obtained earlier with the
roc auc() function may also be produced with help from the OvR pv fn()
function, and the grid.arrange() function from the gridExtra package. We
leave this as an exercise for the reader.
In practice, a more common approach to multiclass AU C is to use the
M -measure proposed by Hand and Till [56]. This measure is calculated by de-
fault when you use the roc auc() function from yardstick. This is illustrated
next for the ML classifier.
Performance Curves and Surfaces 137
where
Âi|j + Âj|i
Âij =
2
is an estimate of the measure of separability between class i and j. Here, Âi|j
and Âj|i are estimates of P (Si|i > Si|j ) and P (Sj|j > Sj|i ), respectively, where
an expression like Sr|s denotes the class r membership score for a randomly
selected case from class s (we implicitly assume that cases are assigned to
class r when such scores exceed a pre-specified threshold).
pv_tb <-
tibble(
A = c(0.501,0.505,0.341,0.168,0.178,0.22,0.22,0.207,
0.338,0.355),
B = c(0.228,0.251,0.408,0.458,0.474,0.362,0.489,0.511,
0.17,0.204),
C = c(0.271,0.244,0.251,0.374,0.348,0.418,0.291,0.282,
0.492,0.441),
group = c("A","A","A","B","B","B","B","C","C","C") %>%
as.factor()
)
# Compute Ahat_1|2
n1 <- 3; n2 <- 4
S <- AB_tb %>% filter(group == "A") %$% sum(rankA)
(S - n1*(n1+1)/2) / (n1*n2)
## [1] 1
# Compute Ahat_2|1
n1 <- 4; n2 <- 3
S <- AB_tb %>% filter(group == "B") %$% sum(rankB)
(S - n1*(n1+1)/2) / (n1*n2)
## [1] 0.917
The above calculations show that Â1|2 = 1 and Â2|1 = 0.917 (we refer to class
“A”, “B” and “C” as class 1, 2, and 3, respectively). Similar calculations show
that Â1|3 = 0.889, Â3|1 = 1, Â2|3 = 0.667 and Â3|2 = 0.667. Hence,
2 1 + 0.917 0.889 + 1 0.667 + 0.667
M= + + = 0.857.
3(3 − 1) 2 2 2
The value you get when you use the roc auc() function is 0.856.
Kleiman and Page [66] noted some issues with M -measure. In particular,
they observed that this measure can fail to return the ideal value of 1, even
7 Note that we used class membership scores in our discussion of this measure instead of
estimated class membership probabilities as was done by Hand and Till [56]. As noted by
the authors, the use of scores in the formulation is also valid.
Inferences for Performance Parameters 139
when for every case, a classifier gives the correct class the highest probabil-
ity. They also noted that the measure is not the probability that randomly
selected cases will be ranked correctly. As an alternative, they proposed their
AU Cµ measure. This measure was motivated by the need to deal with com-
putational complexity and the issue of what properties of AU C for binary
classifiers are worth preserving. It is based on the relationship between AU C
and the Mann Whitney U -statistic (generalizing this statistic was a key step
in development of AU Cµ ). The authors claim that AU Cµ is a fast, reliable,
and easy-to-interpret method for assessing the performance of a multiclass
classifier. However, despite this claim, their measure has yet to gain wide ac-
ceptance, perhaps due to the fact that the measure was only recently proposed.
Perhaps a more problematic issue is the one that arises due to incoher-
ence of AU C when used as a performance measure for binary classifiers; our
earlier discussion in Chapter 3 showed why this is a concerning issue. Binary
AU Cs are involved (whether directly or indirectly) in the calculation of macro-
weighted AU C, M -measure, and the AU Cµ measure. It stands to reason that
the incoherence of binary AU Cs leads to similar issues with these multiclass
AU C estimates. We need a better understanding of how the incoherence of
binary AU C leads to problems with the multiclass analogs considered so far.
This is possibly, if not likely, an area of current research into multiclass AU C.
In the meantime, we can consider the use of a macro-weighted H-measure.
This is demonstrated below for the ML classifier.
The above value for macro-weighted H-measure supports the earlier evidence
provided for the excellent performance of the ML classifier.
problem, the joint probability mass function (PMF) of these random variables
is given in Table 5.2 where
pij = P (Yb = i, Y = j), i, j = 1, . . . , k.
Given the joint distribution in Table 5.2, the marginal distribution of Yb is
given by
Xk
P (Yb = i) = pij = pi· , i = 1, . . . , k,
j=1
and that for Y is given by
k
X
P (Y = j) = pij = p·j , j = 1, . . . , k.
i=1
The joint probabilities pij are the parameters that define the joint distri-
bution in Table 5.2. A number of performance parameters (e.g., Accuracy and
Kappa) for the multiclass problem are functions of these parameters. In this
section, we consider some inferential procedures for these parameters.
which simplifies to
k
X k
X
E= P (Yb = i)P (Y = i) = pi· p·j
i=1 i=1
parameter by the corresponding p̂ij ’s where applicable. For example, the MLE
of Kappa is (Â − Ê)/(1 − Ê) where
k k k k
X nii X X n ij
X n ji
 = and Ê = .
i=1
n i=1 j=1
n j=1
n
The commented portion at the bottom of the above code segment shows how
to use the yardstick package to obtain the required MLEs.
Takahashi et al. [106] considered the interval estimation problem for three
multiclass F1 -measure parameters. Their confidence intervals were based on
the asymptotic normal distribution of the maximum likelihood (ML) estima-
tors of the pij ’s in Table 5.2, and use of the multivariate delta-method [71, p.
315]. They gave the following limits
v ! !
Xk u
u1 X k Xk
p̂ii ± zα/2 t p̂ii 1− p̂ii
i=1
n i=1 i=1
• θ̂1∗ , . . . , θ̂B
∗
denote B bootstrap replicates of θ̂,
• qL and qU denote the lower and upper α/2-quantiles obtained from the θ̂i∗ ’s.
144 Multiclass CPA
Two 100(1 − α)% bootstrap confidence intervals for θ are (2θ̂ − qU , 2θ̂ − qL )
and (qL , qU ). The first is called the basic bootstrap interval and the second is
the percentile interval.
The boot.ci() function in the package may be used to obtain the above-
mentioned bootstrap confidence intervals. Instead of doing this, we use the
boot() function in this package to obtain the bootstrap replicates for the
macro F1 -score in the following code segment, and use the replicates to ob-
tain estimates of qL and qU . These estimates are then used together with the
corresponding MLE to obtain a basic bootstrap confidence interval for macro
F1 -measure.
You get slightly different results when you use boot.ci() function because
the boot package uses its own quantile function. Note that this function also
Exercises 145
5.5 Exercises
1. Construct a decision tree classifier for the diabetes classification
problem. Use the same training and test datasets (i.e., Train data
and Test data) as that used for the ML classifier given by (5.1).
(a) Obtain a display of the decision tree.
(b) Obtain the confusion matrix and threshold measures.
(c) Obtain the class reference ROC curves and multiclass AU C.
2. (a) Use the sens() function in the yardstick package to verify the
multiclass sensitivity estimates given for the ML classifier.
(b) Estimate multiclass specif icity for the ML classifier without
using yardstick.
3. Consider the following confusion matrix reported in Zhang et al.
[118] for a 5-class classifier.
## Actual
## Predicted W N1 N2 N3 REM
## W 5022 407 130 13 103
## N1 577 2468 630 0 258
## N2 188 989 27254 1236 609
## N3 19 4 1021 6399 0
## REM 395 965 763 5 9611
(a) Consider the parameters that are defined by (5.8) and (5.9).
Obtain the MLE of these parameters.
(b) Compute the harmonic mean of the estimates obtained in part
(a). Comment on what you have estimated.
(c) Obtain a 95% confidence interval for the metric in part (b).
5. Do the following comparisons between the ML classifier given by
(5.1) and the DT classifier that you obtained in your answer to
Exercise 1.
(a) Compare (i) the confusion matrices, and (ii) overall and com-
posite performance measures.
(b) Compare the Hand and Till [56] AU Cs.
(c) Comment on the results you obtained for the comparisons.
6. The tibble given below contains point and interval estimation results
from Takahashi et al. [106] for the F1 -measure that were based on
maximum likelihood estimation and macro averaging. These results
are part of what the authors obtained for the temporal sleep stage
classification problem discussed in Dong et al. [25].
## # A tibble: 4 x 4
## classifier estimate lower_limit upper_limit
## <chr> <dbl> <dbl> <dbl>
## 1 MNN 0.805 0.801 0.809
## 2 SVM 0.75 0.746 0.754
## 3 RF 0.724 0.72 0.729
## 4 MLP 0.772 0.768 0.776
Knowing the material we have covered in the last five chapters is necessary but
not sufficient if you want to analyze performance of classification algorithms.
Whether you are involved with binary or multiclass CPA, the topics we’ll
discuss in this chapter are relevant because they help you deal with questions
like the following.1
• What is data leakage and how can it impact performance analysis of classi-
fiers?
• How can you detect and prevent overfitting of the underlying model that
your classifier is based on?
• What is feature engineering and why is it important?
• What is the bias-variance trade-off in the context of predictive modeling?
• How can you properly evaluate your classifier without using the test dataset?
• How useful are resampling techniques for CPA and how can they be used
for this purpose?
• What is hyperparameter tuning for a classifier and how does one perform
it?
• Why is class imbalance an important issue when classifiers are trained and
evaluated?
• How do we deal with the class imbalance problem in practice?
If you are new to CPA, the basic issues underlying some of the above
questions are probably not the ones that comes to mind initially. This is not
surprising because, unlike obvious concerns with accuracy and classification
errors, not all of these questions deal directly with performance measurement
even though their relevance to classifier performance is unquestionable. How-
ever, with further exposure to CPA, you will encounter these issues and it is
a matter of time before you have to address them. For instance, this might
1 This list is not exhaustive (e.g., we omitted questions about resampling techniques for
comparing classifiers.) but we will focus on the listed questions in this chapter.
happen when you attempt to examine the reasons why a particular classifica-
tion algorithm is underperforming for your problem (the cause could be due
to overfitting or class imbalance or both, for example).
Thus, in this section, we will address the questions that were raised since
they have relevance to classifier performance. Some topics like cross-validation
have direct relevance to CPA, while others like data or classifier level ap-
proaches to deal with class imbalance do not have the same degree of relevance
but are nonetheless important if you are interested in classifier performance.
There is a good chance you have data leakage when the model you created
yields doubtful optimistic results. One technique you can use to avoid it is
what we have done in earlier chapters when we used data splitting to create
separate datasets for training and testing. The second dataset is usually set
aside for the sole purpose of evaluating the model you independently obtained
from training data.
Data leakage can occur when you employ k-fold cross-validation (we’ll
discuss this further later) to evaluate your model if you do not carry out
any required preprocessing correctly. It occurs when you preprocess the entire
dataset before using cross-validation to evaluate the model. The correct proce-
dure is to perform preprocessing within each fold; see Figure 3.10 in Boehmke
and Greenwell [6, p. 69] for an illustration of what is involved.
simple, but this is “less frequently a problem than overfitting” [95, p. 67].
150 Additional Topics in CPA
while the latter continues to decrease; see Provost and Fawcett [90, p. 117] for
a figure illustrating this point and a more in-depth discussion of overfitting in
tree induction.
How can you prevent overfitting? Zumel and Mount [119] suggest the use
of simple models that generalize better whenever possible. This makes sense in
light of the above discussion on how the problem can arise with tree induction.
Other suggestions they gave include using techniques like regularization and
bagging.
Regularization allows you to constrain estimated coefficients in your model
thereby reducing variance and decreasing out-of-sample error [6]; this can be
achieved, for example, by using a classifier based on the penalized logistic
regression model (such models shrink coefficients of less beneficial predictors to
zero by imposing suitable penalties). Boehmke and Greenwell [6] provide some
discussion on how to implement regularization with linear models through the
use of ridge penalty, lasso penalty, or elastic nets.
Bagging refers to bootstrap aggregation that underpins early attempts to
create ensemble learners. A random forest classifier is an example of such a
learner that accomplishes model averaging (a key aspect of ensemble learning)
through bagging. This reduces variance and minimizes overfitting [6].
variables so that they have zero means and unit standard deviations; see Ex-
ercise 1 for precise definitions.
# Demonstrate Normalization of Features
Notice that the first argument in the call to the recipe() function is a
model formula. This formula helps R distinguish between the response (i.e.,
target) and feature variables in the mpg tb tibble. Since we only want to
normalize the features, we see from the summary statistics that these variables
are in the range we expect.
Next consider standardization of features in the mpg tb tibble.3 As shown
in the following code segment, the means and standard deviations of these
variables take on values that we expect.
3 Note that the reference to standardization here is in line with how the term is commonly
used in statistics. Also, note that the helper function in recipes to perform standardization
uses the misleading name step normalize.
152 Additional Topics in CPA
By design, we consider the two first variables in the wxy tb tibble that
was just created as the features involved in the dummy and one-hot encoding
Some Important Modeling Issues 153
# Dummy Encoding
recipe(y ~ w + x, data = wxy_tb) %>%
step_dummy(all_nominal_predictors()) %>%
prep() %>% juice()
## # A tibble: 5 x 4
## y w_M x_B x_C
## <fct> <dbl> <dbl> <dbl>
## 1 Y 0 0 1
## 2 N 0 1 0
## 3 N 1 0 1
## 4 N 0 0 0
## 5 N 1 1 0
# One-Hot Encoding
recipe(y ~ w + x, data = wxy_tb) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
prep() %>% juice()
## # A tibble: 5 x 6
## y w_F w_M x_A x_B x_C
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Y 1 0 0 0 1
## 2 N 1 0 0 1 0
## 3 N 0 1 0 0 1
## 4 N 1 0 1 0 0
## 5 N 0 1 0 1 0
The datasets that you encounter in practice often have missing values for
some variables. Earlier, we used the replace na() function from the tidyr
package to impute approximations for these values when preparing data for
the Titanic survival classification problem. However, you can also use the
recipe() function to perform such imputations through the use of steps spec-
ified by helper functions like step medianimpute(), step knnimpute() or
step bagimpute(); see Boehmke and Greenwell [6, p. 50] for further discus-
sion of these alternatives.
The discussion up to this point is only a brief account of what is possible
with feature engineering. A more comprehensive coverage of this important
154 Additional Topics in CPA
topic may be found in Kuhn and Johnson [68]. A good introductory account
may be found in Chapter 8 of Kuhn and Silge [69].
Kuhn and Silge [69] refer to model bias as “the difference between the
true pattern or relationships in data and the type of patterns that the model
can emulate”. In other words, it is the discrepancy between actual patterns in
the data and their representation by a model. Models with low bias tend to
be very flexible, and this quality allows them to provide good fits to various
patterns in the data including those that are nonlinear and non-monotonic.
Examples of such models include decision trees and k-nearest neighbors. On
the other hand, models based on regression and discriminant analysis tend to
have high bias because of limited flexibility and adaptability.
Model variance refers to variability of results produced by a model when
supplied with the same or slightly different inputs. Linear models (such models
are linear in the parameters) like linear or logistic regression usually have
low variance unlike models like decision trees, k-nearest neighbors, or neural
networks. The latter group of models is prone to overfitting.
When you consider the examples cited above, you see that models with
low (high) bias tend to have high (low) variance. This is the bias-variance
trade-off that you should take note of when deciding on choice of classifier for
your problem. It helps to keep in mind that there are steps you can take to
minimize the effects of the abovementioned model properties. For example,
you can identify overfitting and reduce its risk by using resampling techniques
[6, 95]. We take up such techniques in the next section.
be used to evaluate performance of the logit model for the Titanic classifica-
tion problem. Following this, we illustrate the use of 5-fold cross-validation
to tune a random forest classifier for the same problem. Finally, a brief dis-
cussion will be given that covers other resampling techniques like repeated
cross-validation, Monte-Carlo cross-validation, and validation set resampling.
where A is the event that a case is selected at least once for a bootstrap re-
sample. For sufficiently large n, this probability is approximately 0.632. Thus,
as noted in Efron and Tibshirani [30], bootstrap samples are supported by
approximately 0.632n of the original data points.
boot_smpl
## # A tibble: 10 x 5
## B1 B2 B3 B4 B5
## <chr> <chr> <chr> <chr> <chr>
## 1 C G J E C
## 2 H J I A J
## 3 C A F F F
## 4 D I B J B
## 5 C E I D F
## 6 I F D C F
## 7 G H G E H
## 8 B G E E J
## 9 A B I D B
## 10 G F A D D
For illustration, consider the five bootstrap resamples the we just listed.
The resamples were obtained by random sampling with replacement from 10
Resampling Techniques for CPA 157
cases that were labeled using the first ten letters of the alphabet. The features
and target variables for the cases are not listed since they are not relevant to
the present discussion.
As shown by the column labeled B1, the case labeled “C” is the first one
selected from the original 10 cases for the first bootstrap resample, and it
appears a total of three times in the resample since sampling was done with
replacement. Furthermore, because of random sampling, you get a different
collection of cases in subsequent resamples. What you should note here is that
the all resamples have the same size (i.e., 10) and the fact that some cases
appear more than once in each resample.
What about cases not in each bootstrap resample? A case has 0.328 chance
of not being included in a bootstrap resample (assuming n is sufficiently large).
For classification problems, such out-of-bag cases play a useful role in model
evaluations and hyperparameter tuning. For example, when evaluating a clas-
sifier, each of b bootstrap resamples plays the role of the analysis (i.e., train-
ing) set and the corresponding out-of-bag cases constitute the assessment (i.e.,
test) set.
# Out-of-Bag Cases
## $B1
## [1] "E" "F" "J"
##
## $B2
## [1] "C" "D"
##
The above listing shows the out-of-bag cases associated with the first two
bootstrap resamples given earlier. Because of random sampling with replace-
ment, you see a difference in number of out-of-bag cases associated with B1
and B2.
As an application, consider the use of bootstrap resampling to estimate the
Accuracy performance parameter (as defined by the probability in Table 2.3)
for the logit model classifier that is under consideration for the Titanic survival
classification problem. Estimation will be based 100 bootstrap resamples from
the Titanic train dataset that was obtained in Section 2.2.2.
As shown in the next code segment, we begin by defining a function to
compute accuracy for the classifier when given a training and a test dataset.
Next, we use the bootstraps() function from the rsample package to obtain
the splits for the resamples, and the required analysis and assessment datasets
are extracted using two functions with the same name. For each iteration in
the loop, this extraction is done and the accuracy estimate is obtained using
the acc fn() function.
158 Additional Topics in CPA
accuracy_vec(Survived, pred_class)
}
# library(rsample)
set.seed(19322)
load("Titanic_tb.rda") # Titanic_tb from Ch 2
Titanic_train <- Titanic_tb %>% initial_split() %>% training()
Given the list of Accuracy estimates from the bootstrap resamples, you
can use a simple or weighted average (averaging aims to reduce variance) to
obtain the required bootstrap estimate of Accuracy. The weighted version
Resampling Techniques for CPA 159
FIGURE 6.1
Illustrating 5-Fold Cross-Validation
use to find the “best” setting for the hyperparameters of a classification al-
gorithm. This is often accomplished through the use of a suitable resampling
technique.
When you use cross-validation for hyperparameter tuning, the process be-
gins by splitting all available data for classifier construction into training and
test sets. For k-fold cross-validation, you next split the training set into k dis-
joint subsets of roughly equal size; these subsets are referred to as folds. You
can associate a resample with each of these k folds. The i-th fold serves as
the assessment set for the i-th resample, and the remaining k − 1 folds serves
as analysis set. A good illustration of the overall process that is involved may
be found in Figure 3.5 of Kuhn and Johnson [68, p. 48]; B in their figure
corresponds to our k.
For further illustration, see Figure 6.1 which shows the resamples for 5-
fold cross-validation; the figure was adapted from Provost and Fawcett [90,
p. 128]. In each of the resamples, the training folds make up the analysis
set and the test fold constitutes the assessment set. When used for model
evaluation or hyperparameter tuning, you train each of the five models using
the same classification algorithm and corresponding analysis set, and evaluate
each trained model using the corresponding test fold and a performance mea-
sure like accuracy or AU C; see Cichosz [16, p. 150] for an example on the use
of 10-fold cross-validation to evaluate a decision tree classifier. The discussion
below on the use of 5-fold cross-validation to tune the hyperparameters of a
random forest classifier is another example.
In general, the number of hyperparameters you need to tune depends on
the software implementation of a classification algorithm. For example, when
you use the rand forest() function in the parsnip package to train a random
forest classifier, the relevant hyperparameters are given by:
• trees, the number of trees in the ensemble (default = 500),
• min n, the minimum number of cases in a node for further splitting,
Resampling Techniques for CPA 161
• mtry, the number of randomly selected predictors at each split for a con-
stituent decision tree.
√
For classification, the default for min n and mtry are 10 and ⌊ p⌋, respectively,
where p is the number of predictors. A discussion of the roles played by the
above hyperparameters in the training of an RF classifier may be found in
Boehmke and Greenwell [6, p. 206], for example. Here, we demonstrate the
use of 5-fold cross-validation (in practice, using 10 folds is quite common) to
tune these hyperparameters for the problem in Section 2.2.
We begin by splitting Titanic df (instead of Titanic tb, as was done
in the second chapter) to create the training and test datasets. This is done
so that we can demonstrate how to handle the missing values in Titanic df
through the use of a preprocessing recipe. When you take this preferred ap-
proach, the required packages are loaded when you load the tidymodels
meta-package.
# Recreate Training and Test Datasets for the DT Classifier
# Model Specification
rf_mod_spec <-
rand_forest(mtry = tune(), min_n = tune(), trees = tune()) %>%
set_mode("classification") %>%
set_engine("randomForest")
# Preprocessing Recipe
rf_rec <-
recipe(Survived ~ ., data = Titanic_train) %>%
step_impute_median(Age) # impute missing values
# Workflow
rf_wf <-
workflow() %>%
add_model(rf_mod_spec) %>%
add_recipe(rf_rec)
The setup phase of the tuning process starts with the required preprocess-
ing to obtain Train data (here, this is Titanic train with missing values
imputed). This is followed by creating training/test folds for the 5-fold cross-
validation and a grid of 75 (= 3 × 5 × 5) hyperparameter combinations.
# Tune Hyperparameters for a RF Classifier
in Cichosz [16, p. 150]. The given code is quite instructive even though it is for model
evaluation rather than hyperparameter tuning.
Resampling Techniques for CPA 163
Results from tuning are collected into the nested tibble rf rs. The listing
of this tibble shows that the first three resamples contain 534 cases in each
set of training folds and 134 cases in each test fold. The corresponding num-
bers for the last two resamples are 535 and 133, respectively. The .metrics
column is a list-column of 5 tibbles, each of which contains accuracy and
AU C values for the 75 hyperparameter combinations. Corresponding test fold
predictions are contained in the list column labeled .predictions. Use the
collect metrics() or collect predictions() functions if you want the de-
tails for the various hyperparameter combinations. The second function will
be used later to extract the test fold predictions for the “best” model; see
definition of the best rf rs pv tibble that is given later.
You now have sufficient information to determine the “best” combination
of hyperparameters to use for the required RF classifier. The combination
depends on the metric used to make the selection. In general, you get differ-
ent combinations with different metrics. As noted earlier, you might even get
different combinations with the same maximal value for a given performance
metric.
Note from the above results that there is a second combination of hyperparam-
eters that yields the same mean accuracy value. To resolve the non-uniqueness
problem, you need to consider some other criteria (e.g., preference for smaller
number of trees).
For illustration, we show how to obtain the mean and standard error of
average accuracy for the first model in the tibble listed above. To do this,
you first need to use the collect predictions() function to extract the
predictions from the rf rs tibble that was obtained earlier, and then follow
up by using the filter() function to extract the resampling results for the
model.
mutate_if(is.factor, fct_rev)
Given the resampling results for Preprocessor1 Model20 that are con-
tained in the best rf rs pv tibble, you can obtain estimates of accuracy from
predicted and actual classes in each test fold, and then obtain the required
summaries from the resulting five accuracy estimates.
# Mean accuracy and Standard Error for the "Best" Model (cont’d)
FIGURE 6.2
ROC Curves from the Test Folds for the “Best” Model
values range from 0.824 to 0.906. The average of the five AU Cs is 0.865 with
a standard error of 0.0143. We leave it to Exercise 3 for the reader to verify
these results.
# Test Fold AUCs for the "Best" Model with Summary Statistics
## # A tibble: 5 x 2
## id auc
## <chr> <dbl>
## 1 Fold1 0.844
## 2 Fold2 0.879
## 3 Fold3 0.824
## 4 Fold4 0.906
## 5 Fold5 0.871
## # A tibble: 1 x 2
## Mean StdErr
## <dbl> <dbl>
## 1 0.865 0.0143
Resampling Techniques for CPA 167
training data after you update the model specification and workflow, and evaluate it with
the test dataset; see Kuhn and Silge [69], for example.
168 Additional Topics in CPA
substantive problem (e.g., credit card fraud usually occurs at relatively low
frequency) and the data you have for classifier training is representative of the
population it came from.
How serious is the class imbalance before you take steps to account for
it when you train and evaluate your classifier? To answer this question, you
have to first decide whether the problem exists, and if so, proceed to judge its
seriousness. The existence decision for binary classification may be based on
the imbalance ratio (IR) that you can obtain by dividing the number of cases
in the negative class by the number in the positive class (this usually refers to
the minority class); this is also known as the class or skew ratio [40].
As a rough rule of thumb, class imbalance can be said to exist if the IR
is more than 1.5. According to this rule, the Titanic train dataset that we
used in Section 2.2.2 is imbalanced since its IR is 1.54, but is it concerning?
The answer is probably not since the difference between it and the suggested
cutoff is negligible. For comparison, consider the five project datasets in the
book by Brownlee [10] on imbalanced classification. The IR for these datasets
ranges from 2.33 to 42. The creditcardfraud dataset mentioned earlier is a
more extreme example since IR = 578 for it.
In this section, we provide some background information on class imbalance
issues and a brief overview of some techniques to deal with the problem.
For a more complete coverage of the available techniques, see Fernández [36],
for example. Readers with background in Python may find the tutorials in
Brownlee [10] a useful resource. His tutorial on a systematic framework for
dealing with imbalanced classification problems is noteworthy.
## 0.606 0.394
Under-sampling, on the other hand, allows you to reduce the size of the
majority class by removal (of redundant, borderline, or noisy cases) or reten-
tion (of useful cases). For example, deletion can be done by random under-
sampling, i.e., randomly eliminate cases from the majority class. An obvious
issue with this non-heuristic approach is the potential loss of information when
you remove useful cases (i.e., those that facilitate classifier training), but there
are techniques to overcome this problem (e.g., data decontamination by rela-
beling some cases).
Heuristics methods are also available for under-sampling like the one that
makes use of pairs of cross-class nearest neighbors, the so-called Tomek links
[109]. By removing cases that belong to the majority class from each cross-
pair, you increase separation between the classes. Such pairs may be combined
with SMOTE to create class clusters that are better defined [4]; see Figure 2
in the cited article for a good demonstration of the basic idea underlying this
hybrid approach.8 Another heuristic you can use is to apply the condensed
nearest neighbor rule [58] to keep cases in the majority class that are useful
together with all cases in the minority class. The resulting dataset is referred
to as a consistent subset.
class imbalance. For example, the critical mechanism responsible for the bias
in tree-induction algorithms is the splitting criterion.
Decision trees with splits determined by measures of impurity based on
Gini index or entropy (as used by the CART or C4.5 algorithm, respectively)
perform poorly when you have class imbalance. The issue is more problematic
with use of Gini index as demonstrated by the isometrics of Gini-split in Fig-
ure 6(b) of Flach [40]. You can obtain skew-insensitive performance by using
an alternative splitting criterion. For binary classification, a good alternative
is to split based on Hellinger distance. This criterion may be expressed as
sr 2 r 2
r r
nLP nLN nRP nRN
dH = − + − , (6.3)
nP nN nP nN
where the counts on the right-hand side are given in Table 6.1. The above
formula may be expressed as [17]
r
√ p 2 p p 2
dH (tpr, f pr) = tpr − f pr + 1 − tpr − 1 − f pr .
For this to make sense, you need to adopt another interpretation of the counts
in the second and third rows of Table 6.1, i.e., regard them as counts from
a confusion matrix with key count vector (nLP , nRP , nLN , nRN ). Hence, the
definitions of tpr and f pr in dH (tpr, f pr) follow from (1.11) and (1.14) in
Chapter 1 when you interpret this vector according to what is given by (1.9).9
One advantage of the alternative formula is that it allows you to obtain iso-
metric plots for Hellinger distance as shown in Figure 1 of Cieslak et al. [18,
p. 141]. The given figure demonstrates the robustness of this distance measure
in the presence of skew.
For tree induction, dH is a useful splitting criterion since it is a good
measure of the ability of a feature to separate the classes. An algorithm to
construct a Hellinger distance decision tree (HDDT) that makes use of this
criterion may be found in Cieslak and Chawla [17]. Based on experiments
they carried out, Cieslak et al. [18] concluded that HDDTs are robust and
9 Here, rather than think of tpr and f pr as classifier performance measures, view them
simply as relative frequencies derived from the confusion matrix representation of class
distributions in the child nodes that result from a decision tree split.
Dealing with Class Imbalance 173
where cij is the cost incurred when a class j case is classified as class i (here,
class 1 refers to the positive class and, for imbalanced problems, this coincides
with the minority class). The cij ’s represent values of a suitable measure of
cost which, in general, need not be in monetary terms. The off-diagonal entries
in C are particularly important because c12 and c21 refer to the cost of a false
positive and the cost of a false negative, respectively. Since incorrect labeling
of a case should cost more, a requirement for cij ’s to satisfy is
This IR-based heuristic accounts for more costly false negatives by making
cost of such errors proportional to the imbalance ratio.
174 Additional Topics in CPA
Halimu et al. [45], but they concluded that AU C is a better measure. Huang
and Ling [61] compared AU C and accuracy in their study, and argued that
optimizing AU C is a better option in practice. Unfortunately, these studies
have one problem with them. In terms of the taxonomy proposed by Ferri
et al. [37], these studies involve comparisons between threshold and ranking
measures. Hence, the measures being compared quantify different aspects of
classifier performance and, as noted by Hand [52], empirical studies engaged
in such comparisons have limited value.
In an earlier study, Chicco and Jurman [14] compared three commonly
used threshold measures and concluded that mcc is more reliable compared
to F1 -measure and accuracy. The shortcomings of accuracy have been known
for some time. It is not a useful performance measure when class imbalance is
serious, and when the cost of misclassifications matter. Furthermore, its use in
comparative performance analysis of competing classifiers in not recommended
as noted by Provost et al. [91]; they preferred the use of ROC analysis instead.
In their study involving mcc and kappa, Delgado and Tibau [22] highlighted
the incoherence of kappa and argued against the use of this measure. Thus, of
the three overall threshold measures that we considered in the second chapter,
the available evidence seems to favor mcc.
There are also issues that arise with class-specific measures. For some
problems, the goal is to minimize both the false discovery of positive cases
and occurrence of false negatives. In this situation, the relevant measures to
consider are precision and recall, and possibly F1 -measure, their harmonic
mean. Powers [88] noted some biases associated with these measures, e.g.,
failure to provide information about how well negative cases are handled by a
classifier. Their use to evaluate classifiers for information retrieval systems can
be problematic when retrieval results are weakly ordered as noted by Raghavan
et al. [93]; see their article for probabilistic measures designed to deal with
the problem. Hand and Christen [46] highlighted a conceptual weakness of
F1 -measure when used to evaluate record linkage algorithms and suggested
alternative measures to overcome the problem.
If minimizing both the occurrence of false negatives and occurrence of
false positives is an important goal, then sensitivity and specif icity are the
measures to consider, including composite measures derived from them like
J-index or discriminant power. If you want more than a pair of estimates
of the first two measures, then you can consider the ROC curve; this curve
is determined (directly or indirectly) by these measures. The area under the
ROC curve provides a useful descriptive summary for a single classifier, but
issues arise if you try to compare classifiers using AU C. In the latter situation,
the H-measure proposed by Hand [50] provides a solution to the problem.
There are alternatives to the ROC curves. One important example is the
Precision-Recall (PR) curve. According to Davis and Goadrich [19], these
curves provide a more informative view of a classifier’s performance when you
have significant class imbalance. This assessment is supported by Saito and
Rehmsmeier [99].
Exercises 177
6.5 Exercises
1. Let xmin , xmax , x, and s denote the minimum, maximum, mean,
and standard deviation obtained from a numerical data sample that
is represented by x1 , . . . , xn . To normalize the xi ’s, you perform the
following transformation(s)
xi − xmin
xi 7→ , i = 1, . . . , n,
xmax − xmin
and to standardize the xi ’s, you perform
xi − x
xi 7→ , i = 1, . . . , n.
s
During our discussion on feature engineering, we showed how to use
the recipes package to normalize and standardized features. Show
how to perform these transformations without using the stated func-
tionality. Use the mpg tb tibble that was created in text example.
10 This holds even though we made no mention of some of the other important topics in
2. For this exercise, consider the LM classifier in the Chapter 4 for the
Titanic survival classification problem in Section 2.2.
(a) Interpret the probability given on the right-hand side of (3.2)
for the classification problem under consideration.
(b) Obtain the plain and .632 bootstrap estimate of the probability
mentioned in part (a).
(c) Obtain a basic percentile interval and a basic bootstrap interval
for the probability. Use a 90% confidence level.
3. From the example on hyperparameter tuning for a random forest
classifier by 5-fold cross-validation, we noted that the model iden-
tified by .config == "Preprocessor1 Model20" is the “best” one
for the problem in Section 2.2 according to the accuracy measure.
(a) Show how to obtain Figure 6.2 and the corresponding AU Cs.
Also, verify the given summary statistics for the AU C values.
(b) Obtain a list of the five test fold confusion matrices for this
model.
(c) Obtain the five test fold values for Cohen’s kappa, hence, the
mean and standard error from these values.
4. Consider the same RF hyperparameter tuning example as used for
Exercise 3.
(a) What is the “best” hyperparameter combination according to
the AU C performance metric?
(b) Use your answer to part (a) and the Train data tibble to fit
the RF classifier and obtain the variable importance plot
(c) Use the fitted RF classifier and Test data to obtain the confu-
sion matrix and performance measures.
(d) Obtain the ROC and AU C for the fitted RF classifier
5. The hpc cv data frame from the yardstick package contains re-
sampling results from 10-fold cross-validation for a 4-class problem.
The data includes columns for the true class (obs), the class pre-
diction (pred), class membership probabilities (in columns VF, F, M,
and L), and a column that identifies the folds (Resample).
(a) Extract the hpc cv data frame and convert it to tibble.
(b) Obtain the four PR curves (associated with the four classes)
from resampling results in each fold of Resample.
(c) Obtain the resampling estimates of PR AU C from results in
each fold of Resample. Calculate the mean and standard error
from the estimated AU C values.
A
Appendix
• https://www.tidyverse.org/packages/
• https://www.tidymodels.org/packages/
library(tidyverse)
library(tidymodels)
After doing this you can access functionality in the core packages like dplyr
and yardstick without the need to load them individually.2
Some of the other required packages for the code examples in this book
include the following:
• magrittr for data wrangling,
1 Note that the included core packages are listed in the control panel when you load the
meta-packages.
2 Commands for loading of individual core packages were given in some of the code
segments in this book. This was done to highlight the relevant package. You can ignore the
commands if you have loaded the corresponding meta-package.
179
180 Appendix
# Class-Specific Measures
DT_pv %>% sens(Survived, pred_class)
DT_pv %>% spec(Survived, pred_class)
DT_pv %>% ppv(Survived, pred_class)
DT_pv %>% npv(Survived, pred_class)
# Overall Measures
DT_pv %>% accuracy(Survived, pred_class)
DT_pv %>% kap(Survived, pred_class)
DT_pv %>% mcc(Survived, pred_class)
# Composite Measures
DT_pv %>% bal_accuracy(Survived, pred_class)
DT_pv %>% j_index(Survived, pred_class)
DT_pv %>% f_meas(Survived, pred_class)
if (type == "matrix") {
k <- sqrt(length(counts))
cm <- matrix(counts, k, k)
rownames(cm) <- colnames(cm) <- outcomes
k <- length(outcomes)
cm <- counts
dim(cm) <- c(k,k)
class(cm) <- "table"
dimnames(cm) <- list(Predicted=outcomes, Actual=outcomes)
} else {
cm <-
con_mat(counts, outcomes, type = "table") %>%
vtree::crosstabToCases() %>%
yardstick::conf_mat(Actual, Predicted,
dnn = c("Predicted", "Actual"))
}
return(cm)
}
Alternatively, you can extract all the performance measures from a given
confusion matrix of class "conf mat" as shown below for a 3-class array sum-
mary. This is useful when you want to extract performance measures from
published confusion matrices in the literature.
You can obtain the key count vectors for the OvR collection from Table 1.7
as shown below. The resulting vectors are those for the OvR collection of
confusion matrices in Figure 1.6.
You can use the following pipeline to obtain the OvR key count vec-
tors, confusion matrices and corresponding row profiles (which yields the OvR
sensitivity and specif icity values that you can use to obtain corresponding
macro and macro-weighted measures).
1:nrow(CM) %>%
map(~ kcv_fn(CM, .)) %T>%
print() %>%
map(~ con_mat(., c("Yes", "No"), type = "table")) %T>%
print() %>%
map(~ prop.table(., 2))
184 Appendix
estimate = x0 / n
if (type == "lower") {
lower_limit = qbeta(alpha, x0, n - x0 + 1)
upper_limit = 1
} else {
lower_limit = qbeta(alpha/2, x0, n - x0 + 1)
upper_limit = qbeta(1 - alpha/2, x0 + 1, n - x0)
}
Given information in the CM confusion matrix below, you can use the fol-
lowing code to obtain 95% two-sided and lower one-sided confidence intervals
for the Accuracy performance parameter.
• Random Forests in R
https://www.youtube.com/watch?v=6EXPYzbfLCE
Bibliography
[7] A. P. Bradley. The use of the area under the ROC curve in the evaluation
of machine learning algorithms. Pattern Recognition, 30(6):1145–1159,
1997.
[8] P. Branco, L. Torgo, and R. P. Ribeiro. A survey of predictive modeling
under imbalanced distributions. ACM Computing Surveys, 45:1–50, May
2015.
[9] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification
and Regression Trees. Wadsworth, Califonia, USA, 1984.
[10] J. Brownlee. Imbalanced Classificaton with Python. Jason Brownlee
eBook, 2020.
[11] A. Buja, W. Stuetzle, and Y. Shen. Loss functions for binary class prob-
ability estimation and classification: structure and applications. Tech-
nical report, Statistics Department, The Wharton School, University of
Pennsylvania, 2005.
187
188 Bibliography
[72] J. Li and J. P. Fine. ROC analysis with multiple classes and multiple
tests: methodology and its application in microarray studies. Biostatis-
tics, 9(3):556–576, 2008.
[73] S. Liu, H. Zhui, K. Yi, X. Sun, W. Xu, and C. Wang. Fast and unbiased
estimation of volume under ordered three-class ROC surface (VUS) with
continuous or discrete measurements. IEEE Access, 8:136206–136222,
2020.
[74] Y. Liu, Y. Zhou, S. Wen, and C. Tang. A strategy on selecting perfor-
mance metrics for classifier evaluation. International Journal of Mobile
Computing and Multimedia Communications, 6:20–35, 2014.
197
198 Index
binary perfect, 15
classification problems, 2 performance analysis (CPA), 4
confusion matrix, 12 performance measures, 29
CPA, 10 probabilistic, 8, 174
performance curves, 19 optimal threshold, 174
performance measures, 16 random, 15
random forest, 92
case, 2
scoring, 3, 8
class
totally useless, 15
positive, 2
classifier performance analysis, 4
actual, 11 binary CPA, 10
imbalance, 47 comparative analysis, 87
minority, 2
performance curves, 69
predicted, 11
performance measures, 29
priors, 24
multiclass CPA, 21, 113
class imbalance problem, 168 performance measures, 125
Brownlee’s flowchart, 169 performance surfaces, 131
classifier level approaches, 171 required information, 11, 21
HDDT, 172
resampling techniques, 155
isometric plots, 172
k-fold cross-validation, 160
cost-sensitive learning, 173
bootstrap approach, 156
cost matrix, 173 hyperparameter tuning, 160
optimal threshold, 174 resubstitution approach, 155
data level approaches, 170
Cohen’s kappa, 48
over-sampling, 170
comparative analysis, 87
SMOTE, 170
array summaries, 94
Tomek links, 171
collect predictions, 93
under-sampling, 171 competing classifiers, 89
imbalance ratio, 169 McNemar’s test, 108
class membership probabilities
performance curves, 100
distribution & separability, 75
performance measures, 96, 102
predicted, 11, 21
statistical significance
classification errors
AU C differences, 108
false discovery/omission, 13, 54 accuracy differences, 106
false positive/negative, 12, 54 confusion matrix
classification problems
binary, 12
binary, 2
alternative formats, 14
Titanic survival, 32
key count vector, 14
multiclass, 2 key counts, 12, 13, 39
diabetes classification, 114 column profiles, 42
classification rules, 3 multiclass, 22, 121
classifier
object representation, 15
decision tree, 30, 35
preferred format, 14
logit model, 6
random classifier, 15
multinomial logistic, 9 row profiles, 42
neural network, 71
Index 199