Jds 1022
Jds 1022
6339/21-JDS1022
October 2021 Data Science Reviews
Abstract
Machine learning methods are increasingly applied for medical data analysis to reduce human
efforts and improve our understanding of disease propagation. When the data is complicated and
unstructured, shallow learning methods may not be suitable or feasible. Deep learning neural
networks like multilayer perceptron (MLP) and convolutional neural network (CNN), have been
incorporated in medical diagnosis and prognosis for better health care practice. For a binary
outcome, these learning methods directly output predicted probabilities for patient’s health
condition. Investigators still need to consider appropriate decision threshold to split the predicted
probabilities into positive and negative regions. We review methods to select the cut-off values,
including the relatively automatic methods based on optimization of the ROC curve criteria and
also the utility-based methods with a net benefit curve. In particular, decision curve analysis
(DCA) is now acknowledged in medical studies as a good complement to the ROC analysis
for the purpose of decision making. In this paper, we provide the R code to illustrate how to
perform the statistical learning methods, select decision threshold to yield the binary prediction
and evaluate the accuracy of the resulting classification. This article will help medical decision
makers to understand different classification methods and use them in real world scenario.
Keywords deep learning; machine learning; net benefit; ROC; threshold
1 Introduction
Data science has expanded quickly due to the increase in data storage capacities and exploration
of computational technologies and algorithms. Medical researchers now can access large volume
of data and analyze them in real time. The data mining techniques help to obtain the significant
information from the patient health data and make promising predictions. In a particular clinical
investigation, we may follow a three-step procedure: (i) build a risk prediction model from
available data by using machine learning methods; (ii) decide a threshold value to split the
predicted probability into either a positive or a negative region; (iii) apply the trained model
and selected threshold for actual diagnosis and screening. We will provide a systematic review
on how to conduct steps (i) and (ii) in this paper.
To perform step (i), one may consider many machine learning methods. When data are
in the standard format, e.g., accessible via an Excel sheet, most shallow learning tools can be
readily applied, including the familiar logistic regression, and classification trees for example.
These methods are traditionally covered in the course curriculum in most graduate programs
for statistics and biostatistics. On the other hand, when the data become complicated, we may
need deep learning neural networks such as multilayer perceptron (MLP) and convolutional
neural network (CNN). These more advanced learning can be used in clinical diagnostic tasks to
improve the prediction results of shallow learning methods or for analyzing unstructured data
such as medical imaging data. In fact medical images can come from computed tomography,
magnetic resonance, ultrasound, X-ray etc. Medical image segmentation, pattern recognition
and visualization has become vital for the early detection, diagnosis and treatment of many
diseases. In this study we illustrate the traditional shallow learning methods and the deep
learning methods on the Pima Indian diabetes dataset and the Malaria Blood smear image
dataset, respectively.
To perform step (ii), one may choose the decision threshold based on two types of con-
sideration. The first option is to directly optimize some criteria related to Receiver Operating
Characteristic (ROC) curve. ROC analysis has been frequently used for the evaluation of diag-
nostic models (Pepe et al., 2003; Zhou et al., 2009; Erkanli et al., 2006; Krzanowski and Hand,
2009; O’Malley and Zou, 2006; Zou et al., 2012; Li et al., 2019). The so-called ROC curve is a plot
of sensitivity against 1-specificity across all the possible decision thresholds from a diagnostic
model. Area under the ROC curve (AUC) is a commonly reported measure to summarize overall
accuracy of the diagnostic model. One may then select some “optimal” cut-off values from the
constructed ROC graphs when satisfactory levels of sensitivity and specificity are achieved.
The second option for deciding a threshold is via a utility-based analysis. In fact, despite
of its vast popularity, ROC curve fails to account for the clinical consequence and the practical
cost-utility of the learned models (Vickers et al., 2019). The cost of a decision could mean the
acquisition costs of the product, but also the costs of any additional tests or visits to health-care
professional that a patient needs to make. The benefit or utility from the decision is understood
as the gain in health a patient receives. It is usually difficult for medical practitioners to conduct
utility analysis without external data. Decision curve analysis (DCA; Vickers and Elkin, 2006)
is a formal statistical tool to overcome this problem by comparing the model with the default
strategies treating all patients and treating no one irrespective of the model. Net benefit function
(Vickers et al., 2012) is constructed to place net benefit and harm on the same scale for the
possible range of threshold probabilities of the disease (probability at which the decision to
undergo treatment or not; Zhang et al., 2018; Fitzgerald et al., 2015). DCA helps to compare
default strategies with the learned model for the range of thresholds in terms of model utility.
One can then make an informed decision by examining the net benefit graphs. Inference for net
benefit has been established in Sande et al. (2020).
This study is to provide a systematic review of statistical learning tools for predicting
patient’s medical conditions and decision thresholding methods for aforementioned purposes (i)
and (ii). To this end, the paper is organized as follows. Section 2 starts by introducing the data
setting assumed throughout this paper. In Section 3, we provide a comprehensive review of the
shallow supervised learning methods for binary classification. In Section 4, we provide a sketch of
crucial components involved in training a deep learning model. In Section 5, we discuss different
approaches to find the threshold probability. Then, in Section 6, we analyze two medical datasets
for an illustration. We conclude with some discussions in Section 7.
2 Set Up
Consider a set of predictors = (X1 , X2 , . . . , Xm ) such that Xj ∈ IR and Y be the binary
response variable which takes values 1 and 0 for diseased and non-diseased status respectively.
636 Sande, S.Z. et al.
Let the prevalence of the disease be π = Pr(Y = 1). Suppose we have a training sample X =
{xij ; 1 i n, 1 j m} where the covariate vector for the ith row is denoted by Xi =
(xi1 , xi2 , . . . , xim ). The columns denote the m predictor variables where the j th predictor is
denoted by Xj = (x1j , x2j , . . . , xnj )T , 1 j m. We also observe the corresponding response
variable Y = (Y1 , Y2 , . . . , Yn )T for the sample. We train a model M based on a subset of predictors
⊆ {X1 , X2 , . . . , Xm } using any of the techniques to be reviewed in this section. We can then obtain
the model-based predictive probability as p = Pr(Y = 1|M). For a sample {X, Y }, the predicted
probability for the ith individual is given as pi = Pr(Yi = 1|M). Denote p = (p1 , p2 , . . . , pn )T
as the predicted probabilities for the n individuals in the sample. If we consider a risk threshold
c ∈ (0, 1) for assessment, the binary decision rule is: the ith subject with pi c will be assigned
into class 1; otherwise in class 0.
In the next two sections we consider different statistical learning methods to obtain p. We
will introduce how to select threshold c in the Section 5.
g(μ) = β0 + β1 X1 + · · · + βm Xm ,
where βj ’s are the regression coefficients in the GLM. In this case with a binary response we
have μ = p. There are a few choices of link functions g(·) available for binary outcome. The
most popular choice in medical study is the logit link. We may consider the logistic regression
model by using such a link function as
p
g(p) = log = β0 + β 1 X 1 + · · · + βm X m .
1−p
The model can be fitted by maximizing the log-likelihood function, or equivalently, minimizing
the negative log-likelihood given by
m
m
m
l Y i , β0 + xij βj = −Yi β0 + xij βj + log 1 + exp β0 + xij βj .
j =1 j =1 j =1
After fitting the model, we obtain regression coefficient estimates β̂ = {β̂0 , β̂1 , . . . , β̂m } and
then the estimate of predictive probability p̂ are obtained for all the subjects in the sample by
substituting βj ’s in (1) with β̂j ’s. In R, glm function can be used for binary logistic regression
by specifying family = binomial(link = “logit”) in the arguments.
Statistical Learning in Medical Research with Decision Threshold 637
Another popular link function for GLM is the probit link under which we attain a probit
regression model. The probit link function is given as follows
where f1 (.), f2 (.), . . . , fm (.) are unspecified smooth non-parametric functions. GLM can be un-
derstood as a special case of GAM when each f function is linear. Clearly, the main advantage
638 Sande, S.Z. et al.
of GAM over GLM is that the covariate effect is not necessary to be linear. Typically, for binary
response, we have μ = p and we may still use the logit link function g(p) = log(p/(1 − p)).
The building blocks of GAM is to estimate the functional effects for individual covari-
ates. These functions are usually assumed to be smooth with derivatives. The most commonly
adopted smoothers in the literature are running line smoother, locally estimated scatterplot
smoother (loess), Regression splines and Smoothing splines. We offer more detailed review in
the supplementary file.
In R, Generalized additive models are fitted by using gam package (Hastie, 2020a) or mgcv
(Wood, 2003) package. gam package uses AIC criterion for the model selection while mgcv package
uses any of the GCV/UBRE/REML/AIC criterion available in option method of the function
gam().
W ψ(X) + B = 0,
where B is a bias term, W is weight vector and ψ(.) is a fixed transformation function such that
ψ : X → Z where X is an m-dimensional input space and Z is a p-dimensional feature space.
The objective is to classify the data points in such a way that the distance between the two
classes is as wide as possible.
Instead of maximizing the margin, we can equivalently minimize the euclidean norm of the
weight vector W with a constraint that the SVM predicted value and the actual response value
share the same sign:
1
min W 2 ,
W,B 2
We solve the above optimization problem by the Lagrangian method which is given by
1 T n
L(W , B, α) = W W − αi Yi∗ W ψ(Xi ) + B − 1 .
2 i=1
where αi∗ are the optimal Lagrange multipliers. We define a real valued kernel function K(Xi , Xj )
K : X ×X → IR with the property K(Xi , Xj ) = ψ(Xi )T ψ(Xj ). Hence, we do not need the explicit
coordinates in the feature space Z or the transformation function ψ(.). The kernel function
directly calculates the value of the dot product of the transformed data points in the feature
space. The following are some common kernel functions used in SVM.
m
Linear Kernel: K(Xi , Xj ) = xik xj k .
k=1
m
Radial/Gaussian Kernel: K(Xi , Xj ) = exp −γ (xik − xj k )2 .
k=1
m d
Polynomial Kernel: K(Xi , Xj ) = b0 + γ xik xj k .
k=1
m
Sigmoid Kernel: K(Xi , Xj ) = tanh b0 + γ xik xj k .
k=1
SVM can be implemented in R using several packages. For example, one can use the e1071
package (Meyer et al., 2019) where the svm() function provides a rigid interface to libsvm
(Chang and Lin, 2011) along with visualization and parameter tuning methods. Package kernlab
(Karatzoglou et al., 2004) features a variety of kernel-based methods and includes a SVM method
based on the optimizers used in libsvm and bsvm (Hsu and Lin, 2002). It aims to provide a
flexible and extensible SVM implementation. Package klaR (Weihs et al., 2005) includes an
interface to SVMlight, a popular SVM implementation that additionally offers classification
tools such as Regularized Discriminant Analysis. Another package svmpath (Hastie, 2020b)
provides an algorithm that fits the entire path of the SVM solution. We have used the function
svm() from the package e1071 for the numerical works in this paper.
of that attribute. This is usually done by greedy search, direct customizing, or direct discriminate
dissection.
Different methods have been proposed to improve the accuracy of the posterior probabili-
ties, including smoothing methods, specialized trees, combined methods (decision trees combined
with other methods such as Naive Bayes), fuzzy methods and ensemble methods (Bagging and
Boosting). Some of these make drastic changes in the fundamental properties of trees and com-
pute probabilities by modifying the tree itself. Distance based probability estimates (Alvarez
et al., 2007) have also been provided.
To implement classification tree in R, we may use rpart package (Therneau and Atkinson,
2019) and we can obtain posterior probabilities by using predict() function. This package
implements univariate splitting. There are other R packages which implement the multivariate
splitting, such as mvpart (Breiman et al., 1984), optpart (Roberts, 2020), partDSA (Molinaro
et al., 2010) and party (Hothorn et al., 2006). We can perform the complexity pruning used
with the prune() function in the same package by choosing an optimal complexity parameter
(cp).
of GBM, is designed to focus on computational speed and model efficiency. Gradient boosting
also allows one to optimise a user specified cost function, instead of a loss function that does
not essentially correspond with real world applications.
Owing to its better performance over other boosting methods, we only implement extreme
gradient boosting algorithm. To implement Extreme Gradient boosting in R, we may use xgboost
package (Chen et al., 2021). We prepare training and test data by creating xgb.DMatrix()
objects with the feature and label data. The model is fitted on the training set by using
xgb.train() function.
Through the computation of individual terms in the product, we obtain the posterior probabil-
ities as the predicted probabilities. Different versions of naive Bayes classifiers differ mainly by
the assumptions they make regarding the distribution of Pr(Xj |Y ). This probability distribution
can be Gaussian, multinomial or Bernoulli.
To implement Naive Bayes in R, we will use the package e1071 (Meyer et al., 2021) in this
paper. We can use naiveBayes() function from the package to fit the model and then obtain
the predictive probabilities by the predict() function. One may use other R packages such as
naivebayes (Majka, 2019) which allows more distribution options.
assumes the density functions of the two classes are multivariate normal distribution and directly
uses the density ratio to form classifiers. LDA further assumes a condition called ‘homoscedas-
ticity’ where the covariance matrices for the two classes are equal. If we relax this condition and
allow 0 = 1 , we may derive the quadratic discrimination analysis (QDA).
LDA can be implemented in R by using MASS package (Venables and Ripley, 2002). lda()
function in the package is used to train the model and predict() is used for the predictions.
The three activation functions are displayed in Figure 1. If we do not apply an activation function,
the output signal would simply be a linear function. A linear equation is easy to solve but they
are limited in their complexity and have less power to learn complex functional mappings from
data.
ANN is a feed-forward neural network in which information moves in only one direction
(forward) as input → hidden layer → output and there are no cycles and loops in the network.
As an representative case of ANN, multilayer perceptron (MLP; Gardner and Dorling, 1998)
consists of at least three layers of nodes: an input layer, a hidden layer and an output layer.
Figure 2 gives a graphical illustration of MLP where we consider three hypothetical input features
passing through N hidden layers.
We may consider a recursive formula to describe the DL process. In particular, the ith
neuron in the Lth layer is given by
(L) (L−1) (L−1) (L−1)
Zi = φ wj,i Zj + bi ,
j
Cross-Entropy is widely used for logistic regression or GAM type of learning tasks. Hinge loss
or Squared Hinge cost functions are primarily designed for SVM type of learning tasks. For
Cross-Entropy loss the outcome should be coded as {0, 1} while for hinge and squared hinge loss
the outcome should be coded as {−1, 1}.
A key computation step for DL is to repeatedly optimize the cost function at the current
layer to obtain the updated weight coefficients. Cost function is usually a non-convex function for
644 Sande, S.Z. et al.
these parameters with multiple local minima and we usually have to adopt the backpropagation
method to optimize it.
Any ANN is learned by an algorithm called backpropagation (backward propagation of
errors) which requires the use of optimizing algorithm. Initially all the weights and biases are
randomly assigned. For every input in the training dataset, the ANN may be activated to yield
the observed output. This output is compared with the desired output with a cost function, and
the error is propagated back to the previous layer. Optimizers calculate the gradient of the cost
function with respect to all the parameters (weights and biases) to minimize the cost function.
Once the cost function is minimized, we have a ‘learned’ neural network algorithm which, we
consider is ready to work with ‘new’ inputs. We can optimize the cost function with one of the
following optimisers:
• Gradient Descent: Gradient Descent is the most important technique and the foundation
of how we train and optimize ANN (Bengio, 2012; Andrychowicz et al., 2016). The gradient
can be easily derived using the chain rule for differentiation. Updation of weights or bias in
neural network takes place as
θt = θt−1 − η C(θt−1 ),
where θt is the updated weight (or bias parameter), θt−1 is the previous weight (or bias
parameter) while C(θt−1 ) is a gradient of a cost function and η is a learning rate.
• Stochastic Gradient Descent: Stochastic Gradient Descent (SGD) on the other hand
performs a parameter update for each training record (Bottou, 1991, 2010). Due to these
frequent updates, it has high variance and causes the cost function to fluctuate with different
intensities. This helps to discover new and possibly better local minima but sometimes it
could prohibit the convergence to the exact minimum.
• Mini batch gradient Descent: To complement SGD and Gradient Descent, Mini Batch
Gradient Descent (Khirirat et al., 2017; Ruder, 2016) is often used as it chooses the best of
both techniques and performs a parameter update by dividing the entire training sample into
mini-batches with the provided batch size. It reduces the variance in the parameter updates,
which can ultimately lead us to a much better and stable convergence.
• AdaGrad: This optimiser uses a different learning rate for every parameter θ at a particular
step based on the past gradients which are already computed for that parameter (Duchi et al.,
2011; Dean et al., 2012). Its main weakness is that its learning rate η is always decreasing
and decaying. The learning rate shrinks and eventually becomes so small, that the model
just stops learning entirely and stops acquiring new additional information.
• Adadelta: This is an extension of AdaGrad which tends to correct the decaying learning
rate problem (Zeiler, 2012; Schaul et al., 2013). Instead of accumulating all previous squared
gradients, Adadelta limits the window of accumulated past gradients to some fixed size.
• Adam: Adaptive Moment Estimation (Adam) is another method that computes adaptive
learning rates for each parameter (Kingma and Ba, 2014; Reddi et al., 2019). In addition
to storing an exponentially decaying average of past squared gradients like AdaDelta, Adam
also keeps an exponentially decaying average of past gradients, similar to momentum. It is
considered as the most efficient optimization algorithm so far.
Any optimization or learning algorithm has a number of hyperparameters to monitor the
DL process. Two important hyperparameters are the batch size and number of epochs. Both
are integer values. We need them when the data volume is too big and we cannot pass all the
data to the computer at once. The batch size is a hyperparameter that defines the number
Statistical Learning in Medical Research with Decision Threshold 645
of samples of training data to work through before updating the internal model parameters.
A training dataset can be divided into one or more batches. When all training samples are used
to create one batch, the learning is usually called batch gradient descent. When the batch is
the size of one sample, the learning is the stochastic gradient descent. When the batch size is
more than one sample and less than the size of the training dataset, the learning is the mini-
batch gradient descent. The number of epochs is the number of complete passes through the
training dataset. As the number of epochs increases, the weights are changed more frequently
in the ANN and the learning architecture develops from underfitting, to optimal, and then to
overfitting phases.
The learning rate is a configurable hyperparameter used in the training of neural networks,
and has a small positive value, often in the range between 0.0 and 1.0. It controls how quickly
the model is adapted to the problem. Smaller learning rates require more training epochs given
the smaller changes made to the weights during each update, whereas larger learning rates result
in rapid changes and require fewer training epochs. A learning rate that is too large can cause
the model to converge too quickly to a sub-optimal solution, whereas a learning rate that is too
small can cause the process to slow down and get stuck. The challenge of training DL neural
networks involves carefully selecting the learning rate or equivalently, the number of epochs.
Now the input image has been converted into a suitable form for MLP which is a fully
connected layer for CNN, we can transform the output from the last pooling layer into a column
vector (flattening) which is fed to a feed-forward neural network and backpropagation applied
to every iteration of training. Over a series of epochs, the model is able to distinguish between
dominating and certain low-level features in images and classify them using the Softmax as the
final output.
In R, we may implement CNN by using ‘keras’ for R interface (Chollet et al., 2017). The
code to implement CNN in R, is given in the Appendix.
Figure 4: Sorted predicted probabilities for Pima Indian diabetes data obtained from a random
forest classifier.
The intersection between the ROC curve and the descending diagonal in the unit square
gives ĉT F . Recall that se(c) is a decreasing function of c while sp(c) is an increasing function
of c. This choice of threshold thus achieves a balance between true positive and false positive
fractions.
Figure 5: Different decision thresholds plotted in ROC and Decision curves by random forest
model for Pima Indian Diabetes Dataset.
6 Case Studies
6.1 Pima Indian Diabetes Dataset
This dataset (Smith et al., 1988) is originally from the National Institute of Diabetes and Di-
gestive and Kidney Diseases. All patients are females at least 21 years old of Pima Indian
heritage. The purpose of this study is to diagnostically predict whether or not a patient has
diabetes, based on her demographic and clinical measurements. This is a public dataset avail-
able in the Kaggle datasets repository (https://www.kaggle.com/uciml/pima-indians-diabetes-
database?select=diabetes.csv). The data include diagnosis results of 768 women with 9 variables
summarized in Table 1.
Table 1: Information of Variables in the Pima Indian Diabetes dataset (more information for
the calculation of Diabetes Pedigree Function can be found in Smith et al., 1988).
Variable Summary
Mean ± SD
Pregnancies 3.84 ± 3.36
Diabetes Pedigree Function 0.47 ± 0.33
Age (years) 33.24 ± 11.76
Glucose (mg/dl) 121.51 ± 30.55
Diastolic Blood Pressure (mmHg) 72.43 ± 12.30
Triceps skin fold thickness (mm) 28.67 ± 10.34
Insulin (mu U/ml) 149.68 ± 108.56
BMI (kg/m2 ) 32.44 ± 6.91
Proportion
Diabetes Yes: 34.9%
650 Sande, S.Z. et al.
Table 2: Training and test set (with 7:3 proportion) AUC for all the classification methods along
with 95% Confidence Interval.
Training (70%) Test (30%)
Methods
AUC Confidence Interval AUC Confidence Interval
GLM-Logit 0.8472 (0.8145, 0.8798) 0.8430 (0.7920, 0.8941)
GLMNET-LASSO 0.8453 (0.8126, 0.8781) 0.8425 (0.7914, 0.8935)
GAM 0.8453 (0.8122, 0.8784) 0.8430 (0.7917, 0.8944)
LDA 0.8468 (0.8141, 0.8795) 0.8431 (0.7920, 0.8942)
Naive Bayes 0.8209 (0.7858, 0.8560) 0.8210 (0.7670, 0.8750)
XGBoost 1.0000 (1.0000, 1.0000) 0.8180 (0.7622, 0.8738)
Decision Tree 0.8679 (0.8353, 0.9005) 0.7767 (0.7138, 0.8397)
SVM 0.7084 (0.6815, 0.7353) 0.6701 (0.6153, 0.7249)
RF 1.0000 (1.0000, 1.0000) 0.8391 (0.7872, 0.8910)
KNN 0.8617 (0.8310, 0.8924) 0.8250 (0.7709, 0.8791)
MLP 0.8262 (0.7914, 0.8611) 0.8195 (0.7648, 0.8742)
The full data was divided into training and test data with the ratio of 7:3. We trained all
the classification models described previously on the training set and then obtain the predicted
probabilities for the training and test sets separately. The classification was repeated for different
randomly partitioned training and test sets for 100 times and then average AUC values for
training and test sets were calculated. We reported the results in Table 2. For this data set
XGBoost and random forest (RF) achieves 100% accuracy to predict the diabetes status for the
training sample. However for the test set XGBoost is not as good as LDA.
We plotted ROC and decision curves for the test data as shown in Figures 6(a) and 6(b)
respectively for a single sample set for the purpose of illustration. Specifically, the ROC plot is
obtained using the ROCR package (Sing et al., 2005). Next, we produced decision curves for all
the classification methods using the dca package (Sjoberg, 2021). In general all methods yield
similar but distinct ROC curves and decision curves. One may select a sensible threshold based
on the methods introduced in the preceding section for all these methods.
We note that decision curves offer more valuable information on utility on top of the usual
accuracy assessment provided in ROC curves. If the risk threshold preferred by physician or
patient lies within a range where a certain classifier has higher net benefit compared to others,
then the treatment decision should be based on the predicted risks from this classifier.
Figure 6: Test Set ROC and Decision curves for Pima Indian Diabetes Dataset.
Figure 7: Sample thin blood smear slide images of parasitized and uninfected cells.
methods could serve as an effective diagnostic aid. The segmented cells from the thin blood
smear slide images for the parasitized and uninfected classes are open-source and made available
at (https://ceb.nlm.nih.gov/repositories/malaria-datasets/). The data contains a total of 27,558
cell images with equal instances of parasitized and uninfected cells. Sample images of parasitized
(Y = 1) and uninfected (Y = 0) cells are shown in Figure 7.
In this case shallow learning methods are no longer applicable. We used Keras for R inter-
face (Chollet et al., 2017) with Tensorflow backend to apply CNN (Please refer Supplementary
Material for the R Code). The images were available in random pixel sizes between 100 ∼ 150
pixels in height and width. We first processed the image data by resizing (100 × 100 pixels) and
organizing the image data as required for CNN model in Keras for which we used the R package
EBImage (Pau et al., 2010). Then the dataset was divided into training and test sets randomly
with the ratio 8:2. The training data contains 22056 thin blood smear slide images while test
data contains 5502 images after a random splitting. We then fitted CNN model to the train-
ing data (CNN model we used contains 2 sets of two convolution-pooling layers plus dropout
layer and then the fully connected layer) and then predicted the disease outcome on test data.
652 Sande, S.Z. et al.
Figure 8: ROC and DCA curves for the CNN model on Malaria thin blood smear image dataset.
Performance of the CNN model was evaluated for classifying parasitized and uninfected cells
with ROC and DCA. The ROC curve and decision curve are presented in Figures 8(a) and 8(b)
respectively. The overall AUC is 0.977, suggesting a very accurate classification results for the
test data.
To make a dichotomous classification for individual subjects, we may apply the methods
introduced in Section 5 to select an appropriate decision threshold. In this case, we have ĉJ =
0.068, ĉD = 0.068, ĉT F = 0.005 based on the ROC curve (Figure 8(a)). We can also choose a
threshold based on the decision curve analysis. In fact, eyeballing Figure 8(b), we note that the
net benefit function remains relatively flat around the wide range of threshold values, essentially
suggesting that all these threshold values lead to a net benefit value between 0.4 and 0.5. In
particular, the threshold corresponding to φ(c) = 0.46 is at 0.307 for which the sensitivity and
specificity are 0.93 and 0.96. The three cut-off values identified from the ROC curve all have
similar benefit values 0.47.
7 Discussion
Machine learning and deep learning has created lot of buzz in science. To appreciate such ad-
vanced algorithms, basic understanding of machine learning and neural networks is necessary.
In this paper we provided a brief review of the machine learning methods for clinical predictive
analytics. It is important to evaluate these different methods in practical conditions and acknowl-
edge the limitations of the currently available methods and research topics that are needed by
the field. In particular, GLM and regularized GLM depend on linearity assumption and may be
more restrictive than other nonparametric classifiers; decision tree is usually quite a weak classi-
fier and needs to be coupled with bagging and/or boosting to achieve a satisfactory performance;
LDA and Bayes methods rely on the distribution assumptions such as normality and may not
be valid when the distribution is mis-specified; SVM and deep learning, though enjoying supe-
rior performance most of the time, may require the specification of a lot of hyper parameters.
Furthermore, there is no theoretical guarantee that one method is consistently more predictive
than other methods. We need to be mindful in choosing appropriate methods for the problems.
Statistical Learning in Medical Research with Decision Threshold 653
Sometimes using a single algorithm or method can mislead the decision. It is often more
effective to consider multiple models and assess the prediction jointly. All the methods reviewed
in our paper can serve as potential candidates for addressing the prediction issues in medicine
and other scientific fields. Through this study, it is also helpful to note that a complex model
is not always better. A simple shallow learning model like logistic regression and LDA can also
perform quite well with high AUC values when their assumptions are met in an application.
On the other hand, when dealing with complicated data sets such as the brain images, almost
all shallow learning methods could fail and we have to invoke a well-designed deep learning
computation to achieve reasonable risk predictions. Arranging accurate and reliable learning
procedures for a real data will lead to more sensible decision threshold selection, using the
methods we reviewed in this paper.
Supplementary Materials
Supplementary material online include: The review of different smoothers used in Generalized
additive models, Installation details for R interface for Keras and Tensorflow, data and R code
needed to reproduce the results.
Acknowledgments
We are grateful to the Editorial Board and two reviewers for their constructive comments.
Funding
The work was partly supported by Academic Research Funds R-155-000-205-114, R-155-000-
195-114 and Tier 2 MOE funds in Singapore MOE2017-T2-2-082: R-155-000-197-112 (Direct
cost) and R-155-000-197-113 (IRC).
References
Allyn J, Allou N, Augustin P, Philip I, Martinet O, Belghiti M, et al. (2017). A comparison of
a machine learning model with EuroSCORE II in predicting mortality after elective cardiac
surgery: A decision curve analysis. PLoS ONE, 12(1): e0169772.
Alvarez I, Bernard S, Deffuant G (2007). Keep the decision tree and estimate the class proba-
bilities using its decision boundary. In: IJCAI, 654–659.
Andrychowicz M, Denil M, Colmenarejo SG, Hoffman MW, Pfau D, Schaul T, et al. (2016).
Learning to learn by gradient descent by gradient descent. CoRR, arXiv preprint: https:
//arxiv.org/abs/1606.04474.
Baker SG, Kramer BS (2007). Peirce, Youden, and receiver operating characteristic curves.
American Statistician, 61(4): 343–346.
Bengio Y (2012). Practical recommendations for gradient-based training of deep architectures.
In: Neural Networks: Tricks of the Trade (G Montavon, G Orr, KR Müller, eds.), 437–478.
Springer.
Bottou L (1991). Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes,
91(8): 12.
654 Sande, S.Z. et al.
Bottou L (2010). Large-scale machine learning with stochastic gradient descent. In: Proceedings
of COMPSTAT’2010 (Y Lechevallier, G Saporta, eds.), 177–186. Springer.
Breiman L (1996). Bagging predictors. Machine Learning, 24(2): 123–140.
Breiman L (2001). Random forests. Machine Learning, 45(1): 5–32.
Breiman L (2017). Classification and Regression Trees. Routledge.
Breiman L, Friedman J, Stone C, Olshen R (1984). Classification and Regression Trees. The
Wadsworth and Brooks-Cole Statistics-Probability Series. Taylor & Francis.
Chang CC, Lin CJ (2011). Libsvm: A library for support vector machines. ACM Transactions
on Intelligent Systems and Technology, 2(3): 1–27.
Chen J, Huang H, Tian S, Qu Y (2009). Feature selection for text classification with naïve Bayes.
Expert Systems with Applications, 36(3): 5432–5435.
Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, et al. (2021). xgboost: Extreme
Gradient Boosting. R package version 1.3.2.1.
Chollet F, Allaire J, et al. (2017). R interface to keras, https://github.com/rstudio/keras.
Cortes C, Vapnik V (1995). Support-vector networks. Machine Learning, 20(3): 273–297.
Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, et al. (2012). Large scale dis-
tributed deep networks. In: Advances in Neural Information Processing Systems (F Pereira,
CJC Burges, L Bottou, KQ Weinberger, eds.), volume 25. Curran Associates, Inc.
Duchi J, Hazan E, Singer Y (2011). Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12(7): 2121–2159.
Erkanli A, Sung M, Costello E, Angold A (2006). Bayesian semi-parametric ROC analysis.
Statistics in Medicine, 25: 3905–3928.
Fitzgerald M, Saville BR, Lewis RJ (2015). Decision curve analysis. JAMA, 313(4): 409–410.
Friedman J, Hastie T, Tibshirani R (2001). The Elements of Statistical Learning, volume 1.
Springer Series in Statistics. Springer, New York.
Friedman J, Hastie T, Tibshirani R (2010). Regularization paths for generalized linear models
via coordinate descent. Journal of Statistical Software, 33(1): 1.
Friedman JH (2001). Greedy function approximation: A gradient boosting machine. The Annals
of Statistics, 29(5): 1189–1232.
Gardner MW, Dorling S (1998). Artificial neural networks (the multilayer perceptron) a review
of applications in the atmospheric sciences. Atmospheric Environment, 32(14–15): 2627–2636.
Hastie T (2020a). gam: Generalized Additive Models. R package version 1.20.
Hastie T (2020b). svmpath: The SVM Path Algorithm. R package version 0.970.
Hastie T, Tibshirani R (1986). Generalized additive models. Statistical Science, 1(3): 297–310.
Hastie TJ (2017). Generalized additive models. In: Statistical Models in S (JM Chambers, TJ
Hastie, eds.), 249–307. Routledge.
Hester J, Csárdi G, Wickham H, Chang W, Morgan M, Tenenbaum D (2021). remotes: R
Package Installation from Remote Repositories, Including ‘GitHub’. R package version 2.3.0.
Hothorn T, Hornik K, Zeileis A (2006). Unbiased recursive partitioning: A conditional inference
framework. Journal of Computational and Graphical Statistics, 15(3): 651–674.
Hsu CW, Lin CJ (2002). A simple decomposition method for support vector machines. Machine
Learning, 46(1): 291–314.
Karatzoglou A, Smola A, Hornik K, Zeileis A (2004). kernlab – An S4 package for kernel methods
in R. Journal of Statistical Software, 11(9): 1–20.
Kass GV (1980). An exploratory technique for investigating large quantities of categorical data.
Journal of the Royal Statistical Society. Series C. Applied Statistics, 29(2): 119–127.
Statistical Learning in Medical Research with Decision Threshold 655
Keller JM, Gray MR, Givens JA (1985). A fuzzy k-nearest neighbor algorithm. IEEE Transac-
tions on Systems, Man and Cybernetics, SMC-15(4): 580–585.
Kerr KF, Brown MD, Zhu K, Janes H (2016). Assessing the clinical impact of risk prediction
models with decision curves: Guidance for correct interpretation and appropriate use. Journal
of Clinical Oncology, 34(21): 2534.
Khirirat S, Feyzmahdavian HR, Johansson M (2017). Mini-batch gradient descent: Faster con-
vergence under data sparsity. In: 2017 IEEE 56th Annual Conference on Decision and Control
(CDC), 2880–2887.
Kingma DP, Ba J (2014). Adam: A method for stochastic optimization. CoRR, arXiv preprint:
https://arxiv.org/abs/1412.6980.
Krzanowski WJ, Hand DJ (2009). ROC Curves for Continuous Data. Chapman and Hall/CRC.
Kuhn M (2020). caret: Classification and Regression Training. R package version 6.0-86.
Li J, Fine JP (2010). Weighted area under the receiver operating characteristic curve and its
application to gene selection. Journal of the Royal Statistical Society. Series C. Applied Statis-
tics, 59: 673–692.
Li J, Gao M, D’Agostino R (2019). Evaluating classification accuracy for modern learning ap-
proaches. Statistics in Medicine, 38: 2477–2503.
Li J, Zhou XH (2009). Nonparametric and semiparametric estimation of the three way receiver
operating characteristic surface. Journal of Statistical Planning and Inference, 139: 4133–4142.
Liaw A, Wiener M (2002). Classification and regression by randomforest. R News, 2(3): 18–22.
Majka M (2019). naivebayes: High Performance Implementation of the Naive Bayes Algorithm
in R. R package version 0.9.7.
Mehta M, Rissanen J, Agrawal R (1995). Mdl-based decision tree pruning. In: Proceedings of
the First International Conference on Knowledge Discovery and Data Mining, KDD’95 (U
Fayyad, R Uthurusamy, eds.), 216–221. AAAI Press.
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2019). e1071: Misc Functions of the
Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package
version 1.7-2.
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2021). e1071: Misc Functions of the
Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package
version 1.7-6.
Mika S, Ratsch G, Weston J, Scholkopf B, Mullers K (1999). Fisher discriminant analysis with
kernels. In: Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal
Processing Society Workshop (Cat. No.98TH8468) (YH Hu, ed.), 41–48.
Molinaro AM, Lostritto K, van der Laan M (2010). partdsa: deletion/substitution/addition
algorithm for partitioning the covariate space in prediction. Bioinformatics, 26(10): 1357–1363.
Nakas CT, Alonzo TA, Yiannoutsos CT (2010). Accuracy and cut-off point selection in three-
class classification problems using a generalization of the Youden index. Statistics in Medicine,
29: 2946–2955.
Nakas CT, Dalrymple-Alford JC, Anderson TJ, Alonzo TA (2012). Generalization of Youden in-
dex for multiple-class classification problems applied to the assessment of externally validated
cognition in Parkinson disease screening. Statistics in Medicine, 95: 995–1003.
Nelder JA, Wedderburn RW (1972). Generalized linear models. Journal of the Royal Statistical
Society. Series A. General, 135(3): 370–384.
Niblett T, Bratko I (1987). Learning decision rules in noisy domains. In: Proceedings of Expert
Systems ’86, The 6th Annual Technical Conference on Research and Development in Expert
656 Sande, S.Z. et al.
Systems III (MA Bramer, ed.), 25–34. Cambridge University Press, USA.
O’Malley A, Zou K (2006). Bayesian multivariate hierarchical transformation models for ROC
analysis. Statistics in Medicine, 25: 459–479.
Pau G, Fuchs F, Sklyar O, Boutros M, Huber W (2010). Ebimage—an R package for image
processing with applications to cellular phenotypes. Bioinformatics, 26(7): 979–981.
Pepe MS, et al. (2003). The Statistical Evaluation of Medical Tests for Classification and Pre-
diction. Medicine.
Perkins Neil J, Schisterman Enrique F (2006). The inconsistency of “optimal” cut-points using
two roc based criteria. American Journal of Epidemiology, 163: 670–675.
Quinlan JR (1993). C4.5: Programming for Machine Learning. Morgan Kauffmann, 38: 48.
Reddi SJ, Kale S, Kumar S (2019). On the convergence of adam and beyond. CoRR, arXiv
preprint: https://arxiv.org/abs/1904.09237.
Roberts DW (2020). optpart: Optimal Partitioning of Similarity Relations. R package version
3.0-3.
Rousson V, Zumbrunn T (2011). Decision curve analysis revisited: overall net benefit, relation-
ships to ROC curve analysis, and application to case-control studies. BMC Medical Informatics
and Decision Making, 11(1): 1–9.
Ruder S (2016). An overview of gradient descent optimization algorithms. CoRR, arXiv preprint:
https://arxiv.org/abs/1609.04747.
Safavian SR, Landgrebe D (1991). A survey of decision tree classifier methodology. IEEE Trans-
actions on Systems, Man and Cybernetics, 21(3): 660–674.
Salzberg SL (1994). C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kauf-
mann Publishers, Inc., 1993. Machine Learning, 16(3): 235–240.
Sanchez IE (2016). Optimal threshold estimation for binary classifiers using game theory.
F1000Research, 5.
Sande SZ, Li J, D’Agostino R, Yin Wong T, Cheng CY (2020). Statistical inference for deci-
sion curve analysis, with applications to cataract diagnosis. Statistics in Medicine, 39(22):
2980–3002.
Schalkoff RJ (1997). Artificial Neural Networks. McGraw-Hill Higher Education.
Schapire RE, Freund Y, Bartlett P, Lee WS, et al. (1998). Boosting the margin: A new expla-
nation for the effectiveness of voting methods. The Annals of Statistics, 26(5): 1651–1686.
Schaul T, Zhang S, LeCun Y (2013). No more pesky learning rates. Proceedings of Machine
Learning Research 28(3): 343–351.
Sing T, Sander O, Beerenwinkel N, Lengauer T (2005). Rocr: visualizing classifier performance
in R. Bioinformatics, 21(20): 7881.
Sjoberg DD (2021). dca: Decision Curve Analysis. R package version 0.1.0.9000.
Smith JW, Everhart J, Dickson W, Knowler W, Johannes R (1988). Using the ADAP learning
algorithm to forecast the onset of diabetes mellitus. In: Proceedings of the Annual Sympo-
sium on Computer Application in Medical Care, volume 261. American Medical Informatics
Association.
Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, et al. (2010). As-
sessing the performance of prediction models: a framework for some traditional and novel
measures. Epidemiology, 21(1): 128.
Talluri R, Shete S (2016). Using the weighted area under the net benefit curve for decision curve
analysis. BMC Medical Informatics and Decision Making, 16(1): 94.
Therneau T, Atkinson B (2019). rpart: Recursive Partitioning and Regression Trees. R package
Statistical Learning in Medical Research with Decision Threshold 657
version 4.1-15.
Van Calster B, Vickers AJ, Pencina MJ, Baker SG, Timmerman D, Steyerberg EW (2013).
Evaluation of markers and risk prediction models: Overview of relationships between NRI
and decision-analytic measures. Medical Decision Making, 33(4): 490–501.
Van Calster B, Wynants L, Verbeek JF, Verbakel JY, Christodoulou E, Vickers AJ, et al.
(2018). Reporting and interpreting decision curve analysis: A guide for investigators. European
Urology, 74(6): 796–804.
Venables WN, Ripley BD (2002). Modern Applied Statistics with S. Springer, New York, fourth
edition. 0-387-95457-0.
Vickers AJ, Cronin AM, Gönen M (2012). A simple decision analytic solution to the comparison
of two binary diagnostic tests. Statistics in Medicine, 32(11): 1865–1876.
Vickers AJ, Elkin EB (2006). Decision curve analysis: A novel method for evaluating prediction
models. Medical Decision Making, 26(6): 565–574. PMID: 17099194.
Vickers AJ, van Calster B, Steyerberg EW (2019). A simple, step-by-step guide to interpreting
decision curve analysis. Diagnostic and Prognostic Research, 3(1): 1–8.
Weihs C, Ligges U, Luebke K, Raabe N (2005). klaR analyzing German business cycles.
In: Data Analysis and Decision Support (D Baier, R Decker, L Schmidt-Thieme, eds.), 335–343.
Springer Berlin Heidelberg, Berlin, Heidelberg.
Wood SN (2003). Thin-plate regression splines. Journal of the Royal Statistical Society, Series
B, 65(1): 95–114.
Youden WJ (1950). Index for rating diagnostic tests. Cancer, 3(1): 32–35.
Yu T, Li J, Ma S (2017). Accounting for clinical covariates and interactions in ranking ge-
nomic markers using ROC. Communications in Statistics. Simulation and Computation, 46(5):
3735–3755.
Zeiler MD (2012). ADADELTA: An adaptive learning rate method. CoRR, arXiv preprint:
https://arxiv.org/abs/1212.5701.
Zhang Z, Rousson V, Lee WC, Ferdynus C, Chen M, Qian X, et al. (2018). Decision curve
analysis: a technical note. Annals of Translational Medicine, 6(15): 308.
Zhou XH, McClish DK, Obuchowski NA (2009). Statistical Methods in Diagnostic Medicine,
volume 569. John Wiley & Sons.
Zou K, Liu A, Bandos A, Ohno-Machado L, Rockette H (2012). Statistical Evaluation of Diag-
nostic Performance: Topics in ROC Analysis. CRC Press.