2020-Maleki - (NeuroimageClin) - Machine Learning Algorithm Validation ...
2020-Maleki - (NeuroimageClin) - Machine Learning Algorithm Validation ...
A l g o r i t h m Va l i d a t i o n
From Essentials to Advanced Applications
and Implications for Regulatory Certification
and Deployment
Farhad Maleki, PhDa,1, Nikesh Muthukrishnan, MEnga,1, Katie Ovens, PhDb,
Caroline Reinhold, MD, MSca,c, Reza Forghani, MD, PhDa,c,d,e,f,*
KEYWORDS
Reproducibility Ability to generalize Machine learning Evaluation Validation
Cross-validation Deep learning Artificial intelligence
KEY POINTS
Understanding and following the best practices for evaluating machine learning (ML) models is
essential for developing reproducible and generalizable ML applications.
The reliability and robustness of a ML application will depend on multiple factors, including dataset
size and variety as well as a well-conceived design for ML algorithm development and evaluation.
A rigorously designed ML model development and evaluation process using large and representa-
tive training, validation, and test datasets will increase the likelihood of developing a reliable and
generalizable ML application and will also facilitate future regulatory certification.
Scalable, auditable, and transparent platforms for building and sharing multi-institutional datasets
will be a crucial step in developing generalizable solutions in the health care domain.
Funding: R. Forghani is a clinical research scholar (chercheur-boursier clinicien) supported by the Fonds de
recherche en santé du Québec (FRQS) and has an operating grant jointly funded by the FRQS and the Fonda-
tion de l’Association des radiologistes du Québec (FARQ).
a
Augmented Intelligence & Precision Health Laboratory (AIPHL), Department of Radiology & Research Insti-
tute of the McGill University Health Centre, 5252 Boulevard de Maisonneuve Ouest, Montreal, Quebec H4A
3S5, Canada; b Department of Computer Science, University of Saskatchewan, 176 Thorvaldson Bldg, 110 Sci-
neuroimaging.theclinics.com
ence Place, Saskatoon S7N 5C9, Canada; c Department of Radiology, McGill University, 1650 Cedar Avenue,
Montreal, Quebec H3G 1A4, Canada; d Segal Cancer Centre, Lady Davis Institute for Medical Research, Jewish
General Hospital, 3755 Cote Ste-Catherine Road, Montreal, Quebec H3T 1E2, Canada; e Gerald Bronfman
Department of Oncology, McGill University, Suite 720, 5100 Maisonneuve Boulevard West, Montreal, Quebec
H4A3T2, Canada; f Department of Otolaryngology - Head and Neck Surgery, Royal Victoria Hospital, McGill
University Health Centre, 1001 boul. Decarie Boulevard, Montreal, Quebec H3A 3J1, Canada
1
F. Maleki and N. Muthukrishnan contributed equally to this article.
* Corresponding author. Room C02.5821, 1001 Decarie Boulevard, Montreal, Quebec H4A 3J1, Canada.
E-mail address: [email protected]
improve their performance with new experiences. importance of a robust experimental design to
A performance measure, which is defined quanti- facilitate future certification and strategies
tatively, drives the model building and evaluation required for deployment of ML models in clinical
process.3 settings.
Developing a ML model requires 3 major com-
ponents: representation, evaluation, and optimiza- ESTIMATING ERROR IN MODEL EVALUATION
tion.4 The representation component involves
deciding on a type of model or algorithm that is In ML applications, available data are often parti-
used to represent the association between the tioned into training, validation, and test sets. A per-
input data and the outcomes of interest. Examples formance measure is used to reflect the model
of such models are support vector machines, error when applied to data in these sets. The error
random forests, and neural networks.5 The evalu- made by a model when applied to the data in the
ation component concerns defining and calcu- training set is referred to as training error, and
lating quantitative performance measures that the error made by a model when applied to data
show the goodness of a representation; that is, in a test set is referred to as test error. The test er-
the capability of a given model to represent the as- ror is used as an estimate for the generalization er-
sociation between inputs and outputs. Among per- ror (ie, the error of the model when applied to
formance measures that are commonly used for unseen data). Therefore, it is essential that data
model evaluations are accuracy, precision, recall, in the test set are not used during training and
mean squared error, and the Jaccard index.6,7 fine-tuning of the model. Irreducible error, also
The aim of the optimization component is to up- referred to as Bayes error, is another type of error
date parameters of a given representation (ie, resulting from the inherent noise in the data. Irre-
model) with the goal of increasing the performance ducible error is the lowest possible error achiev-
measures of interest. Examples of approaches able for a given task using the available data.
used for optimization are gradient descent– This error is independent of the model being
based methods and the Newton method.8 used and often cannot be mathematically calcu-
In developing ML models, the available data are lated. It is often estimated by the error made by
often partitioned into 3 disjoint sets commonly a group of humans with the domain expertise for
referred to as training, validation, and test sets. the task at hand. The resulting estimate is consid-
The data from the training set are used to train ered as an upper bound for irreducible error. Un-
the model. A model is often trained through an iter- derstanding these error types is important for
ative process. In each iteration, a performance developing and evaluating ML models.
measure reflecting the error made by the model Underfitting and overfitting are defined based on
when applied to the data in the training set is the error types described earlier. An underfitted
calculated. This measure is used to update the model achieves a training error that is much higher
model parameters in order to reduce the model er- than the irreducible error, and an overfitted model
ror when applied to the data in the training set. The achieves a training error that is much lower than
model parameters are a set of variables associ- the test error. These concepts are associated
ated with the model, and their values are learned with the model complexity; that is, the capacity
during the training process. Beside model param- of a model to represent associations between
eters, there might be other variables associated model inputs and outputs. The complexity of
with a model where their values are not updated different models can be compared by their number
during training. These variables are referred to as of parameters and the way these parameters
hyperparameters. The optimal or near-optimal interact in the model (eg, linear, nonlinear). Models
values for hyperparameters are determined using with high complexity often tend to be too sensitive
data in the validation set. This process is often to the dataset used for training. Often the predic-
referred to as hyperparameter tuning. After training tions of a model when trained using different data-
and fine-tuning the model, data from the test set sets, all sampled from the same population, have a
are used to evaluate the model for ability to gener- high variance, introducing error. Models with high
alize (ie, the performance on unseen data). complexity and consequently high error variance
This review article first describes the funda- tend to overfit. In contrast, low-complexity models
mental concepts required for understanding the may be biased to learning simpler associations be-
model evaluation processes. Then it explains the tween inputs and outputs that might not be suffi-
main challenges that might affect the ability to cient for representing true associations. For
generalize of ML models. Next, it highlights com- example, a linear model cannot represent an
mon workflows for evaluation of ML models. In exponential association between inputs and out-
addition, it discusses the implications and puts. Low-complexity models tend to underfit.
Machine Learning Algorithm Validation 435
Table 1
Suggested terminology for machine learning evaluation
After the model is trained and fine-tuned, it is the dataset. Another factor that plays a role in
evaluated on the test set to provide an estimate determining the value of k is the availability of
of the model generalization error (ie, the error of computational resources. Also, KFCV can be run
the resulting model when applied to unseen in parallel to speed up the model evaluation pro-
data). Therefore, it is essential that the test data cess, which can be accomplished because each
are not used during training and fine-tuning the iteration of KFCV is independent of the other itera-
models; otherwise, the estimate for the generaliza- tions. The 10-fold and 5-fold cross-validations are
tion error would be overoptimistic and unreliable.23 the most widely used KFCVs for evaluating ML
The holdout validation approach is commonly models.5
used when training deep learning models with
large-scale datasets because it is computationally Stratified k-fold cross-validation
less demanding. However, for small datasets, this Class imbalance is a common phenomenon in ML.
approach is criticized for not using the whole data- Class imbalance occurs when there is a substan-
set. A small test set might not provide a reliable es- tial difference between the number of samples
timate of model performance, and the resulting for the majority class and the minority class, where
performance measures might be sensitive to the the majority class is defined as the class with the
choice of the test set. For small datasets, selecting highest number of samples and the minority class
a test set large enough to be representative of the is defined as the class with the lowest number of
underlying data is often impossible. Further, using samples. In such a setting, KFCV might lead to un-
a larger test set means that fewer samples are stable performance measures. There might be
available to be used for training the model, which zero or very few samples from the minority class
negatively affects the performance of the resulting in 1 or a few of the data folds, which would sub-
model. Also, when fine-tuning a model using this stantially affect the evaluation metrics for such
approach, the resulting model may be sensitive folds. In the stratified KFCV (SKFCV), each of the
to the choice of the validation set, resulting in k groups of data points are sampled so that the
models with low ability to generalize. distribution of the classes in each fold closely mir-
rors the distribution of classes in the whole
Cross-Validation dataset.
Cross-validation is a resampling approach used Leave-one-out cross-validation
for the evaluation of ML models. The aim of Although KFCV provides more reliable estimates
cross-validation is to provide an unbiased esti- for generalization error, the resulting model only
mate of model performance. Compared with uses k 1 groups for training and validation.
holdout validation, this approach tends to provide Leave-one-out cross-validation (LOOCV) uses
a more accurate estimate of generalization error k 5 n, where n is the number of samples in the
when dealing with small datasets. Various cross- dataset; therefore, all but 1 sample are used for
validation techniques for the evaluation of ML model training. LOOCV is computationally more
models are reviewed next. demanding because it requires training n models.
K-fold cross-validation Therefore, it cannot be used when the dataset is
In k-fold cross-validation (KFCV), data points are very large or the training process for a single model
is computationally expensive. LOOCV has been
randomly assigned to k disjoint groups (Fig. 3). In
an iterative process, each time one of these k recommended for small or imbalanced datasets.24
groups is selected as the validation set, the Leave-p-out cross-validation
remaining k 1 groups are combined and used Leave-p-out cross-validation (LPOCV) is an
as the training set. This process is iterated k times extended form of LOOCV, where validation sets
so that each group is selected once as the valida- can have p elements instead of 1. It is an exhaus-
tion set. The average of performance measures tive approach designed to use all of the possible
across the k iteration is used as the estimate for validation sets of size p for the evaluation of ML
the validation error. Compared with holdout vali- models. For a dataset of n distinct data points,
dation, this approach is computationally more the number of distinct sets of size p, where
demanding because it requires training and evalu- p 5 n/k, are as follows:
ation of the model k times. However, because the
model evaluation is performed k times, the vari- ðn p11Þ ðn p12Þ / n
Cðn; pÞ 5
ance of the performance measure is reduced and 1 2 / ðpÞ
the resulting estimate is more reliable.
The value of k is often chosen such that each of Even for moderately large datasets when p>1,
the resulting k groups is a representative sample of this value exponentially grows, and LPOCV quickly
438 Maleki et al
Fig. 3. Three-fold cross-validation (red bounding box). In practice, a portion of samples is locked away for calcu-
lating an unbiased estimate of the generalization error. The cross-validation method takes the remaining data as
input and randomly assigns them to k disjoint groups (k 5 3 in this example). In an iterative process, each time
one of these k groups is selected as the validation set (yellow box) and the remaining k 1 groups are combined
and used as the training set (purple boxes). This process is iterated k times so that each group is selected as a vali-
dation set once. The average of the model error on the validation sets then can be used as an estimate of the
validation error. Note that, in practice, training and validation data in each iteration are used for learning model
parameters as well as selecting hyperparameters for the model. Therefore, the resulting estimate is considered an
estimate for the validation error, not for the test error, because the validation data are used for both learning the
model parameters and hyperparameters; therefore, it might provide an overoptimistic estimate of generalization
error, which is the reason why a test set is locked away before conducting cross-validation. In practice, only in rare
cases, such as developing a simple linear regression model that includes a fixed set of variables where the model
has no hyperparameters to tune, is a test set not locked away. In such cases, the average of model error (perfor-
mance measure) on the k validation sets can be used as an unbiased estimate of the generalization error (perfor-
mance measure).
Machine Learning Algorithm Validation 439
becomes impractical. For small datasets, often a The inner loop takes a train and validation set cho-
value of p 5 2, which is known as leave-pair-out sen by the outer loop, then the model with different
cross-validation, is used to achieve a robust esti- hyperparameters is trained using the training set,
mate of the model performance.25 Note that for and the best hyperparameters are chosen based
p 5 1, this approach is equivalent to LOOCV. on the performance of the trained models on the
validation set. In the outer loop, generalization er-
Leave-one-group-out cross-validation ror is estimated by averaging test error over the
In some applications, there might be samples in test sets in the outer loop. Fig. 4 shows a 4-fold
the dataset that are not independent of each other outer with 3-fold inner NCV.
and are somehow related. In such scenarios, the
knowledge of one sample from a group might DATA USED FOR MODEL EVALUATION
reveal information about the status of other sam-
ples in the same group. For example, different pa- Different approaches for model development and
thology slides for the same patients or different evaluation and the impact on algorithm perfor-
MRI scans of a patient during the course of treat- mance were discussed earlier. Here, this article
ment might reveal information about the patient’s discusses the important attributes of the datasets
disease. Having samples from these groups scat- used for developing reliable ML algorithms.
tered in training, validation, and test sets results in
Data: Size Matters
overoptimistic performance evaluations and leads
to a lack of ability to generalize. Leave-one-group- In some disciplines, developing large datasets for
out cross-validation (LOGOCV) is similar to building and evaluating ML models might not be
LOOCV but, instead of leaving 1 data point out, it practical. For example, in the medical domain,
leaves 1 group of samples out, which requires developing large-scale datasets is often not an op-
that, for each sample in the dataset, a group iden- tion because of the rarity of the phenotype under
tifier be provided. These group identifiers can study, limited financial resources, limited expertise
represent domain-specific stratification of the required for data preparation or annotation, pa-
samples. For example, when developing a model tients’ privacy concerns, or other ethical or legal
for classifying MRI scans into cancerous and concerns and barriers. For example, in the medical
noncancerous, all scans for a patient during an un- imaging domain, experienced physicians are
successful course of treatment should have the required to manually annotate medical images reli-
same group identifier. ably. Furthermore, because of the patients’ pri-
vacy concerns and specific legal and regulatory
Nested cross-validation requirements in different jurisdictions, developing
Most ML models rely on several hyperparameters. a large-scale multi-institutional dataset can be
Tuning these hyperparameters is a common prac- challenging. Therefore, most research is conduct-
tice in building ML solutions. Often, hyperpara- ed using small datasets. Splitting such small data-
meter values that lead to the best performance sets into train, validation, and test sets further
are experimentally sought. In a traditional cross- shrinks the dataset used for model evaluation,
validation, where data are split into training and which leads to unreliable estimates of perfor-
validation sets, experimenting with several models mance measures. Consequently, the resulting
and searching for their optimal hyperparameter models suffer from a lack of ability to generalize
values often makes the resulting validation error and lack of reproducibility.1 When the dataset
an overoptimistic performance measure if used used for model building and evaluation is small,
for estimating generalization error. Therefore, a the LOOCV or LOGOCV approach is recommen-
test set should be locked away and not be used ded for model evaluation.
for model training and hyperparameter tuning. If possible, public datasets can be added to the
The model performance on this test set can be local dataset; however, depending on the struc-
used as a reliable estimate of generalization error. ture of a public dataset, there could be an inherent
Selecting a single subset of data as the test set for selection bias. Clinicians must be aware of this
small datasets leads to estimates for generaliza- issue in order to evaluate its impact on the ability
tion error that have high variance and are sensitive to generalize. One example is a mucosal head
to the composition of the test set. Nested cross- and neck cancer set consisting mostly of a subset
validation (NCV) is used to address this challenge of the disease; for example, human papilloma virus
(Fig. 4). (HPV)–positive oropharyngeal head and neck
NCV consists of an outer cross-validation loop squamous cell carcinomas (HNSCCs) treated
and an inner cross-validation loop. The outer with radiation and chemotherapy. Models trained
loop uses different train, validation, and test splits. on such a dataset (eg, for predicting treatment
440 Maleki et al
Fig. 4. A 4-fold outer and 3-fold inner NCV. First, the samples in the dataset are randomly shuffled. Then the
outer loop uses different train (purple box), validation (yellow box), and test (orange box) splits. The outer folds
1, 2, 3, and 4 are depicted in the top-left, top-right, bottom-left, and bottom-right corners, respectively. For each
outer fold, a 3-fold cross-validation highlighted in a red box is used. The model with different hyperparameters is
trained using the training set, and the optimal hyperparameters are chosen based on the average performance of
the trained models on the validation sets. In the outer loop, generalization error is estimated by averaging test
error over the 4 test sets.
response and outcome) may not be generalizable generally considered as noisy annotations.28–30
to HNSCCs of the oral cavity, which are typically This research suggests that crowdsourced anno-
HPV negative and treated surgically, even though tations can translate to improving model perfor-
they are still pathologically mucosal HNSCC. The mance only with carefully crafted strategies.28–30
quality of labeling and annotations can also affect Another approach commonly used in medical
model performance. The variability between the imaging is increasing the number of samples using
public dataset annotations and the annotations in patch-based approaches.31 In these approaches,
the training data may also lead to models with two-dimensional or three-dimensional (3D)
low ability to generalize.26 Therefore, data aggre- patches are extracted from medical images. These
gation does not always lead to improving model patches are then used for model training and eval-
performance and generalization.27 uation. These approaches often extract several
Crowdsourced annotations have also been patches from a single image, which leads to
used to address the challenge of annotating med- increasing the number of available data points for
ical datasets.28–30 Several publications have developing and evaluating ML models. For
explored the differences between expert contours example, instead of treating the GBM example in
and crowdsourced nonexpert contours, which are Fig. 1 as a single training image, it can be split
Machine Learning Algorithm Validation 441
into several small patches of GBM samples. In that of the test set (eg, validation and test sets
such scenarios when the dataset is small, using are from different institutions or different scan-
an LOGOCV approach for model evaluation is rec- ners), the model performance based on the valida-
ommended to achieve a reliable evaluation of the tion set may not translate to a clear picture of
model performance. Otherwise, the performance model performance for the test set. This situation
measures resulting from this approach might be often manifests as a decline in the model perfor-
unreliable and overoptimistic. mance measures from the validation set to the
Using data augmentation is another alternative test set.
for increasing the number of samples used for In some medical imaging research, data
training and evaluation of ML models. Among ex- collected from selected institutions have been
amples of commonly used data augmentation used for model building, and the performance of
techniques are geometric affine transformations the models has been evaluated using data from
such as translation, scaling, and rotation. Data a different institution.14–16 For such models, the
augmentation has been widely used in building performance measures on the test set may not
ML models for image data, and sophisticated soft- achieve their maximum performance if there is a
ware packages are available for this task.32,33 considerable difference between the distribution
However, most of these tools have been designed of data used for training and data used for evalu-
for regular RGB data (images with three color ating the models.39 If factors that might substan-
channels: red, green, and blue) and do not support tially affect the data distribution can be
3D medical images. Therefore, for 3D images, sim- controlled, using data from 2 institutions can lead
ple augmentations such as flipping and rotation to improving performance measures. In such sce-
have been commonly used. Synthetic image gen- narios, data from the first institution can be used
eration using generative adversarial networks for model development and data from the second
(GANs) has been also used for data augmenta- institution can be used for model evaluation, which
tion.34–38 Although data augmentation is an impor- can lead to improving performance measures,
tant tool in developing ML models, proper and because the test set does not need to be held
impactful application has to be evaluated on a out from the data from the first institution. There-
case-by-case basis. When using data augmenta- fore, a larger dataset can be used for model build-
tion, clinicians must ensure that it is not simply be- ing. In addition, this approach does not require
ing used to amplify or overrepresent information sharing the original dataset between institutions,
within the training data, which could result in because the trained model in the first institution
overfitting. can be shared with the second institution to be
evaluated.
Another important factor that needs to be
Data: Variety Matters
considered when using data from 1 institution for
Alongside the size of the datasets used for devel- model development and data from another institu-
oping and evaluating ML models, the variety in the tion for data evaluation is data representation.
dataset is a crucial element to consider. Datasets Although this approach is considered by many as
are typically gathered under a variety of circum- the gold standard approach for developing a
stances. For example, in medical imaging, scans model and evaluating its performance, clinicians
from different institutions may substantially vary need to be aware of the inherent potential pitfalls
because of factors such as different scanner set- of this approach. This approach can only be suc-
tings, disease prevalence at a specific institutions cessful if the unique characteristics of evaluation
because of population demographics, and the use data (eg, from the second institution) are reflected
of different protocols. Even within an institution, or represented in the training data (eg, from the
there are frequently different scanner types, result- first institution). If this is not the case, the perfor-
ing in technical variations, among a list of other po- mance may not be optimal and the generalization
tential sources of technical variations and noise. error could be overestimated. To make a practical
With these factors affecting the data variability, comparison, if an institution is deploying new im-
the training, validation, and test sets should repre- age analysis software developed based on data
sent these variations to be able to create a gener- from other institutions, the out-of-box algorithm
alizable model. will not perform optimally unless the major charac-
Furthermore, because models are selected teristics of the scans at the deploying institution
based on the performance of the validation set, it are compatible with those used for generating
is crucial that the distribution of the validation set the data used for developing the image analysis
follows the distribution of the test set.8 If the data software. Using this logic, it also follows that,
distribution of the validation set is different from when deploying an algorithm in a new
442 Maleki et al
environment, it may be worthwhile to either eval- such cases, the expert radiologist making the final
uate for representation through analyses such as interpretation would be made aware of the poten-
outlier analysis (discussed later) or first perform tial pitfall, taking this into account for the final
additional training and validation in the new envi- interpretation.
ronment for optimization before deploying the al-
gorithm for use in the new environment. IMPLICATION OF ROBUST EXPERIMENTAL
In addition, certain data samples may be poorly DESIGN FOR REGULATORY APPROVAL AND
represented in a given dataset. It is important to CERTIFICATION
identify these samples and determine whether
they should be excluded or whether more of Although a comprehensive discussion of regulato-
such samples are necessary in the dataset. ry approval and certification is beyond the scope
Consider a dataset where a small subset of the im- of this article, optimal algorithm development
ages are degraded with severe artifacts. Artifacts and evaluation is paramount to successful certifi-
are common in clinical practice and can be caused cation and deployment for patient care in the clin-
by noise, beam hardening from normal anatomic ical setting. This article therefore concludes with a
structures, or metal implants. To deploy a general- brief discussion of this topic. Learning from other
ized model in practice, a model should be able to industries, such as the pharmaceutical industry,
predict and properly process the image if signifi- by establishing well-conceived guidelines and
cant artifact is present; therefore, it is essential ensuring rigorous experimental design from the
that the model is exposed to artifacts in the outset, there is an opportunity to accelerate the
training and evaluation phases. If images with se- future translation of ML algorithms for clinical
vere artifact are not well represented, the model deployment and use for patient care. The adoption
may lack the ability to process these cases of robust industry-grade platforms that are audit-
correctly. To address this issue, a possible able is also likely to facilitate future certification
approach is the use of preprocessing techniques and clinical deployment.
for artifact reduction as a first step before feeding As interest in ML continues to grow, the wide-
images to the trained model.40 In this way, the spread deployment of ML models in clinical set-
trained model is treated as a specialized model tings is highly anticipated. The performance of
that is only able to make a prediction or classifica- ML models in medical imaging has been shown
tion in the absence of severe artifact. to achieve performance superior to or comparable
To determine whether a sample is poorly repre- with human experts in very selective and
sented, techniques that measure the similarity controlled settings.45 In the future, ML models
across images in a dataset can be used to identify are expected to provide predictions for clinical
outliers.41–43 These techniques use pretrained outcomes of interest and assist clinicians in
models to compute feature vectors for each sam- providing a diagnosis and treatment plan in a
ple in the dataset. Then, similarity scores between timely and accurate manner, enabling more pre-
all samples are calculated to detect outliers (ie, cise and personalized patient therapy and
samples that are different from the rest of the sam- management.46
ples in the dataset). Such samples tend to have Before deploying a model in a clinical setting, its
characteristics that are poorly represented in the performance needs to be thoroughly validated and
dataset. These techniques are also useful to the generalization error needs to be understood.
consider when using data from different institu- Moreover, if a model is expected to be generaliz-
tions.39,44 By identifying outliers or underrepre- able, it must have exposure to a variety of sam-
sented samples, samples can be removed from ples. Intuitively, data from different institutions
the validation sets to fine-tune performance to a may be considered as different distributions based
more specific application, or samples similar to on many factors, including different scanner set-
the outlier samples can be introduced to the tings, disease prevalence at a specific institutions
training data to increase confidence in these sam- caused by population demographics, and
ples. If a model is designed for working in a spe- following different protocols. This outcome is
cific scenario (eg, only for data with no severe achieved by carefully considering these factors
artifact), the limitations of the resulting model and their implications, as discussed in this article,
must be clearly communicated to avoid using the to incorporate them into the experimental design.
model in the wrong context. In a deployment To properly train generalizable models, large-
setting, such approaches may even be used to scale multi-institutional datasets would also be
flag an image set or scan that may not be well rep- beneficial. Having more data for ML models in-
resented and consequently has a high likelihood of creases the confidence in predictions and allows
not being reliably evaluated by an algorithm. In robust validation and testing, providing better
Machine Learning Algorithm Validation 443
estimations on the generalization error.45 Explainability, to the extent feasible, will also facil-
Acquiring large-scale datasets is susceptible to itate deployment and adoption. Software pilot pro-
its own challenges. Current regulations and infra- grams such as the Precertification Program
structure limitations make data sharing between outlined in the US Food and Drug Administration’s
institutions a tedious and time-consuming pro- Digital Health Innovation Action Plan (https://www.
cess. Furthermore, medical image sets can be fda.gov/medical-devices/digital-health/digital-
large in volume, ranging from several hundred health-software-precertification-pre-cert-
megabytes to several gigabytes, which highlights program) are models that will help the future devel-
the need for specialized infrastructures for devel- opment of a regulatory framework for streamlined
oping large-scale multi-institutional datasets. and efficient regulatory oversight of applications
Secure cloud platforms that facilitate distributed developed by manufacturers with a demonstrated
data access would be ideal for collaboration and culture of quality and organizational excellence.
building such large-scale multi-institutional data- This framework could represent a mechanism
sets. The implementation of scalable and stream- through which would-be trusted vendors could
lined platforms for data preparation and curation deploy artificial intelligence (AI)–based software
will be a key factor in facilitating the development in an efficient and streamlined manner, including
of reliable ML algorithms in the future. deployment of software iterations and changes,
Before deployment of a ML model in a specific under appropriate controls and oversight. In addi-
institution, the model needs to be validated within tion, current regulatory frameworks consider AI al-
that institution to verify that the model can meet gorithms as a software as medical device, which
the performance requirements using the local are expected to be locked and not evolving.48 As
data. Models can also be fine-tuned to the local experience and comfort with ML applications in-
data to achieve better localized performance, as creases, new regulatory frameworks will have to
discussed earlier. However, any changes of a be developed to allow model adaptations that
deployed algorithm or its performance need to enable optimal performance while ensuring reli-
be excessively examined. Alongside such local ability and patient safety.
validation, ML models generally need to be vali-
dated over time as well. For examples, as the prev-
alence of diseases changes, the deployed models SUMMARY
might need to adapt as well. Although ML algo-
rithms can learn from exposure or “experience,” With the surge in popularity of ML and deep
the implementation of an actively changing or learning solutions and increasing investments in
mutating algorithm for patient care in the clinical such approaches, ML solutions that fail to gener-
setting would be a very complex process that alize, when applied to external data, may gain pub-
would require robust feedback loops and quality lic attention that could hinder the slow but steady
monitoring, ensuring reliable performance and sta- adaptation of ML in the health care domain.
bility, and is unlikely to be the model for implemen- Following the best practices for the development
tation in the foreseeable future. Instead, the and evaluation of ML models is a necessity for
current model is to develop and evaluate an algo- developing generalizable solutions that can be
rithm using large and varied datasets. The algo- deployed in clinical settings. This requirement is
rithm that is deployed will not be actively even more important for deep learning models,
changing based on its use after deployment. As which have high capacities and can easily overfit
such, alternative mechanisms, such as quality to the available data if a proper methodology for
monitoring and periodic updates by the vendors model evaluation is not followed.
based on additional training and evaluation, could Evaluating ML models in the health care domain
be a potential model for optimizing performance in is often a challenging task because of the difficulty
the clinical setting. of developing large-scale datasets resulting from
Because of the reasons mentioned earlier, algo- the lack of required resources or ethical issues.
rithms are expected to adapt, and will be required Because of the small datasets used for develop-
to be accessible, transparent, and auditable. ment and evaluation of ML models, applications
Transparent platforms that allow for continuous that do not follow a rigorous and sound evaluation
evaluation on the performance of an ML model procedure are prone to overfitting to the available
are required to approve the implementation of data. Lack of familiarity with the best practices for
new ML models or software updates in clinical set- model evaluation lead to a lack of generalization of
tings.47 These platforms should be transparent published research. Also, the unavailability of code
and auditable such that clinicians can investigate and data makes evaluating and reproducing such
any underlying biases in the datasets or models.47 models difficult.
444 Maleki et al
A rigorous experimental design and the use of simulation study of bias and precision in small sam-
transparent platforms for building and sharing ples. J Clin Epidemiol 2003;56(5):441–7.
multi-institutional datasets and following best 13. Steyerberg EW, Harrell FE. Prediction models need
practices for model evaluation will be crucial steps appropriate internal, internal–external, and external
in developing generalizable solutions in the health validation. J Clin Epidemiol 2016;69:245–7.
care domain. Such platforms could also serve as a 14. Kann BH, Hicks DF, Payabvash S, et al. Multi-institu-
medium for reproducible research, which would tional validation of deep learning for pretreatment
then increase the likelihood of successful deploy- identification of extranodal extension in head and
ment of ML models in the health care domain, neck squamous cell carcinoma. J Clin Oncol 2020;
with the potential to streamline health care pro- 38(12):1304–11.
cesses, increase efficiency and quality, and 15. Welch ML, McIntosh C, Traverso A, et al. External
improve patient care through precision medicine. validation and transfer learning of convolutional neu-
ral networks for computed tomography dental arti-
fact classification. Phys Med Biol 2020;65(3):
035017.
REFERENCES
16. Datema FR, Ferrier MB, Vergouwe Y, et al. Update
1. Beam AL, Manrai AK, Ghassemi M. Challenges to and external validation of a head and neck cancer
the reproducibility of machine learning models in prognostic model. Head Neck 2013;35(9):1232–7.
health care. JAMA 2020;323(4):305–6. 17. König IR, Malley J, Weimar C, et al. Practical expe-
2. McDermott MB, Wang S, Marinsek N, et al. Repro- riences on the necessity of external validation. Stat
ducibility in machine learning for health. Paper pre- Med 2007;26(30):5499–511.
sented at: 2019 Reproducibility in Machine 18. Kocak B, Yardimci AH, Bektas CT, et al. Textural dif-
Learning, RML@ ICLR 2019 Workshop. New Or- ferences between renal cell carcinoma subtypes:
leans, May 6, 2019. machine learning-based quantitative computed to-
3. Forghani R, Savadjiev P, Chatterjee A, et al. Radio- mography texture analysis with independent
mics and artificial intelligence for biomarker and pre- external validation. Eur J Radiol 2018;107:149–57.
diction model development in oncology. Comput 19. Guyon I. A scaling law for the validation-set training-
Struct Biotechnol J 2019;17:995. set size ratio. Berkeley (CA): AT&T Bell Laboratories;
4. Domingos PM. A few useful things to know about 1997. p. 1–11.
machine learning. Commun ACM 2012;55(10): 20. Forghani R, Chatterjee A, Reinhold C, et al. Head
78–87. and neck squamous cell carcinoma: prediction of
5. Friedman J, Hastie T, Tibshirani R. The elements of cervical lymph node metastasis by dual-energy CT
statistical learning, vol. 1. New York: Springer Series texture analysis with machine learning. Eur Radiol
in Statistics; 2001. 2019;29(11):6172–81.
6. Bertels J, Eelbode T, Berman M, et al. Optimizing the 21. Guyon I, Makhoul J, Schwartz R, et al. What size test
Dice score and Jaccard index for medical image set gives good error rate estimates? IEEE Trans
segmentation: Theory and practice. Paper pre- Pattern Anal Mach Intell 1998;20(1):52–64.
sented at: International Conference on Medical Im- 22. Hutter F, Kotthoff L, Vanschoren J. Automated ma-
age Computing and Computer-Assisted chine learning: methods, systems, challenges. Ber-
Intervention. Shenzhen (China), October 13-17, keley (CA): Springer Nature; 2019.
2019. 23. Russell S, Norvig P. Artificial intelligence: a modern
7. Tharwat A. Classification assessment methods. New approach. 3rd edition. Upper Saddle River (NJ):
England Journal of Entrepreneurship 2020. https:// Pearson; 2009.
10.1016/j.aci.2018.08.003. 24. Wong T-T. Performance evaluation of classification
8. Goodfellow I, Bengio Y, Courville A. Deep learning. algorithms by k-fold and leave-one-out cross valida-
Cambridge (MA): MIT Press; 2016. tion. Pattern Recognition 2015;48(9):2839–46.
9. Quinlan JR. Bagging, boosting, and C4. 5. Paper 25. Airola A, Pahikkala T, Waegeman W, et al. An exper-
presented at: AAAI/IAAI, Vol. 1. Portland (Oregon), imental comparison of cross-validation techniques
August 4–8, 1996. for estimating the area under the ROC curve. Com-
10. Obermeyer Z, Emanuel EJ. Predicting the future— put Stat Data Anal 2011;55(4):1828–44.
big data, machine learning, and clinical medicine. 26. Cohen JP, Hashir M, Brooks R, et al. On the limits of
N Engl J Med 2016;375(13):1216. cross-domain generalization in automated X-ray
11. Erickson BJ, Korfiatis P, Akkus Z, et al. Machine prediction. arXiv preprint arXiv:200202497. 2020.
learning for medical imaging. Radiographics 2017; 27. Saha A, Harowicz MR, Mazurowski MA. Breast can-
37(2):505–15. cer MRI radiomics: an overview of algorithmic fea-
12. Steyerberg EW, Bleeker SE, Moll HA, et al. Internal tures and impact of inter-reader variability in
and external validation of predictive models: a annotating tumors. Med Phys 2018;45(7):3076–85.
Machine Learning Algorithm Validation 445
28. Albarqouni S, Baur C, Achilles F, et al. Aggnet: deep generative adversarial networks. arXiv preprint ar-
learning from crowds for mitosis detection in breast Xiv:180503144. 2018.
cancer histology images. IEEE Trans Med Imaging 39. Storkey A. When training and test sets are different:
2016;35(5):1313–21. characterizing learning transfer. Dataset shift in ma-
29. McKenna MT, Wang S, Nguyen TB, et al. Strategies chine learning; 2009. p. 3-28.
for improved interpretation of computer-aided de- 40. Philipsen RHHM, Maduskar P, Hogeweg L, et al.
tections for CT colonography utilizing distributed hu- Localized energy-based normalization of medical
man intelligence. Med Image Anal 2012;16(6): images: application to chest radiography. IEEE
1280–92. Trans Med Imaging 2015;34(9):1965–75.
30. Nguyen TB, Wang S, Anugu V, et al. Distributed hu- 41. Zhang M, Leung KH, Ma Z, et al. A generalized
man intelligence for colonic polyp classification in approach to determine confident samples for deep
computer-aided detection for CT colonography. neural networks on unseen data. In: Greenspan H,
Radiology 2012;262(3):824–33. Tanno R, Erdt M, et al, editors. Uncertainty for safe
31. Greenspan H, Van Ginneken B, Summers RM. Guest utilization of machine learning in medical imaging
editorial deep learning in medical imaging: overview and clinical image-based procedures. Springer;
and future promise of an exciting new technique. 2019. p. 65–74.
IEEE Trans Med Imaging 2016;35(5):1153–9. 42. Salimans T, Goodfellow I, Zaremba W, et al.
Improved techniques for training gans. Paper pre-
32. Buslaev A, Iglovikov VI, Khvedchenya E, et al. Albu-
sented at: Advances in neural information process-
mentations: fast and flexible image augmentations.
ing systems. Barcelona (Spain), December 5-10,
Information 2020;11(2):125.
2016.
33. Cubuk ED, Zoph B, Mane D, et al. Autoaugment:
43. Heusel M, Ramsauer H, Unterthiner T, et al. Gans
Learning augmentation policies from data. arXiv pre-
trained by a two time-scale update rule converge
print arXiv:180509501. 2018.
to a local nash equilibrium. Paper presented at: Ad-
34. Zhao H, Li H, Maurer-Stroh S, et al. Synthesizing
vances in Neural Information Processing Systems.
retinal and neuronal images with generative adver-
Long Beach (CA), December 4-9, 2017.
sarial nets. Med Image Anal 2018;49:14–26.
44. Glocker B, Robinson R, Castro DC, et al. Machine
35. Salehinejad H, Colak E, Dowdell T, et al. Synthesiz- learning with multi-site imaging data: An empirical
ing chest x-ray pathology for training deep convolu- study on the impact of scanner effects. arXiv pre-
tional neural networks. IEEE Trans Med Imaging print arXiv:191004597. 2019.
2018;38(5):1197–206. 45. Miller DD, Brown EW. Artificial intelligence in medi-
36. Han C, Kitamura Y, Kudo A, et al. Synthesizing cal practice: the question to the answer? Am J
diverse lung nodules wherever massively: 3D Med 2018;131(2):129–33.
multi-conditional GAN-based CT image augmenta- 46. Jiang F, Jiang Y, Zhi H, et al. Artificial intelligence in
tion for object detection. Paper presented at: 2019 healthcare: past, present and future. Stroke Vasc
International Conference on 3D Vision (3DV). Neurol 2017;2(4):230–43.
Québec (Canada), September 16-19, 2019. 47. He J, Baxter SL, Xu J, et al. The practical implemen-
37. Frid-Adar M, Diamant I, Klang E, et al. GAN-based tation of artificial intelligence technologies in medi-
synthetic medical image augmentation for increased cine. Nat Med 2019;25(1):30.
CNN performance in liver lesion classification. Neu- 48. Shah P, Kendall F, Khozin S, et al. Artificial intelli-
rocomputing 2018;321:321–31. gence and machine learning in clinical develop-
38. Beers A, Brown J, Chang K, et al. High-resolution ment: a translational perspective. NPJ Digit Med
medical image synthesis using progressively grown 2019;2(1):1–5.