Calibration: The Achilles Heel of Predictive Analytics: Opinion Open Access
Calibration: The Achilles Heel of Predictive Analytics: Opinion Open Access
Calibration: The Achilles Heel of Predictive Analytics: Opinion Open Access
Abstract
Background: The assessment of calibration performance of risk prediction models based on regression or more
flexible machine learning algorithms receives little attention.
Main text: Herein, we argue that this needs to change immediately because poorly calibrated algorithms can be
misleading and potentially harmful for clinical decision-making. We summarize how to avoid poor calibration at
algorithm development and how to assess calibration at algorithm validation, emphasizing balance between model
complexity and the available sample size. At external validation, calibration curves require sufficiently large samples.
Algorithm updating should be considered for appropriate support of clinical practice.
Conclusion: Efforts are required to avoid poor calibration when developing prediction models, to evaluate
calibration when validating models, and to update models when indicated. The ultimate aim is to optimize the
utility of predictive analytics for shared decision-making and patient counseling.
Keywords: Calibration, Risk prediction models, Predictive analytics, Overfitting, Heterogeneity, Model performance
lead to models that suffer greatly from poor calibra- risk leads to overtreatment. Conversely, underestima-
tion [9, 10]. Calibration has therefore been labeled tion leads to undertreatment.
the ‘Achilles heel’ of predictive analytics [11].
Reporting on calibration performance is recom- Why may an algorithm give poorly calibrated risk
mended by the TRIPOD (Transparent Reporting of predictions?
a multivariable prediction model for Individual Many possible sources may distort the calibration of
Prognosis Or Diagnosis) guidelines for prediction risk predictions. A first set of causes relates to vari-
modeling studies [12]. Calibration is especially im- ables and characteristics unrelated to algorithm de-
portant when the aim is to support decision-making, velopment. Often, patient characteristics and disease
even when discrimination is moderate such as for incidence or prevalence rates vary greatly between
breast cancer prediction models [13]. We explain health centers, regions, and countries [16]. When an
the relevance of calibration in this paper and sug- algorithm is developed in a setting with a high dis-
gest solutions to prevent or correct poor calibration ease incidence, it may systematically give overesti-
and thus make predictive algorithms more clinically mated risk estimates when used in a setting where
relevant. the incidence is lower [17]. For example, university
hospitals may treat more patients with the event of
How can inaccurate risk predictions be misleading? interest than regional hospitals; such heterogeneity
If the algorithm is used to inform patients, poorly cal- between settings can affect risk estimates and their
ibrated risk estimates lead to false expectations with calibration [18]. The predictors in the algorithm
patients and healthcare professionals. Patients may may explain a part of the heterogeneity, but often
make personal decisions in anticipation of an event, differences between predictors will not explain all
or the absence thereof, that were in fact misguided. differences between settings [19]. Patient popula-
Take, for example, a prediction model that predicts tions also tend to change over time, e.g., due to
the chance that in vitro fertilization (IVF) treatment changes in referral patterns, healthcare policy, or
leads to a live birth [14]. Irrespective of how well the treatment policies [20, 21]. For example, in the last
models can discriminate between treatments that end 10 years, there has been a drive in Europe to lower
in live birth versus those that do not, it is clear that the number of embryos transferred in IVF and im-
strong over- or underestimation of the chance of a provements in IVF cryopreservation technology led
live birth makes the algorithms clinically unaccept- to an increase in embryo freezing and storage for
able. For instance, a strong overestimation of the subsequent transfer [22]; such evolutions may
chance of live birth after IVF would give false hope change the calibration of algorithms that predict
to couples going through an already stressful and IVF success [23].
emotional experience. Treating a couple who, in real- A second set of causes relates to methodological prob-
ity, has a favorable prognosis exposes the woman un- lems regarding the algorithm itself. Statistical overfitting is
necessarily to possible harmful side effects, e.g., common. It is caused by a modeling strategy that is too
ovarian hyperstimulation syndrome. complex for the amount of data at hand (e.g., too many
In fact, poor calibration may make an algorithm less candidate predictors, predictor selection based on statis-
clinically useful than a competitor algorithm that has tical significance, use of a very flexible algorithm such as a
a lower AUC but is well calibrated [8]. As an ex- neural network) [24]. Overfitted predictions capture too
ample, consider the QRISK2–2011 and NICE Fra- much random noise in the development data. Thus, when
mingham models to predict the 10-year risk of validated on new data, an overfitted algorithm is expected
cardiovascular disease. An external validation study of to show lower discrimination performance and predicted
these models in 2 million patients from the United risks that are too extreme – patients at high risk of the
Kingdom indicated that QRISK2–2011 was well cali- event tend to get overestimated risk predictions, whereas
brated and had an AUC of 0.771, whereas NICE Fra- patients at low risk of the event tend to get underesti-
mingham was overestimating risk, with an AUC of mated risk predictions. Apart from statistical overfitting,
0.776 [15]. When using the traditional risk threshold medical data usually contain measurement error, for ex-
of 20% to identify high-risk patients for intervention, ample, biomarker expressions vary with assay kits and
QRISK2–2011 would select 110 per 1000 men aged ultrasound measurement of tumor vascularity has inter-
between 35 and 74 years. On the other hand, NICE and intra-observer variability [25, 26]. If measurement
Framingham would select almost twice as many (206 error systematically differs between settings (e.g., measure-
per 1000 men) because a predicted risk of 20% based ments of a predictor are systemically more biased upward
on this model actually corresponded to a lower event in a different setting), this affects the predicted risks and
rate. This example illustrates that overestimation of thus calibration of an algorithm [27].
Van Calster et al. BMC Medicine (2019) 17:230 Page 3 of 7
Fig. 1 Illustrations of different types of miscalibration. Illustrations are based on an outcome with a 25% event rate and a model with an area
under the ROC curve (AUC or c-statistic) of 0.71. Calibration intercept and slope are indicated for each illustrative curve. a General over- or
underestimation of predicted risks. b Predicted risks that are too extreme or not extreme enough
How to assess calibration? selected for surgical removal [28]; further details can be
The concepts explained in this section are illustrated in found elsewhere [1, 4, 29].
Additional file 1, with the validation of the Risk of Ovar- According to four increasingly stringent levels of cali-
ian Malignancy Algorithm (ROMA) for the diagnosis of bration, models can be calibrated in the mean, weak,
ovarian malignancy in women with an ovarian tumor moderate, or strong sense [4]. First, to assess ‘mean
Van Calster et al. BMC Medicine (2019) 17:230 Page 4 of 7
calibration’ (or ‘calibration-in-the-large’), the average many drawbacks – it is based on artificially group-
predicted risk is compared with the overall event rate. ing patients into risk strata, gives a P value that is
When the average predicted risk is higher than the over- uninformative with respect to the type and extent of
all event rate, the algorithm overestimates risk in gen- miscalibration, and suffers from low statistical
eral. Conversely, underestimation occurs when the power [1, 4]. Therefore, we recommend against
observed event rate is higher than the average predicted using the Hosmer–Lemeshow test to assess
risk. calibration.
Second, ‘weak calibration’ means that, on average,
the model does not over- or underestimate risk and How to prevent or correct poor calibration?
does not give overly extreme (too close to 0 and 1) When developing a predictive algorithm, the first
or modest (too close to disease prevalence or inci- step involves the control of statistical overfitting. It
dence) risk estimates. Weak calibration can be is important to prespecify the modeling strategy and
assessed by the calibration intercept and calibration to ensure that sample size is sufficient for the num-
slope. The calibration slope evaluates the spread of ber of considered predictors [30, 31]. In smaller
the estimated risks and has a target value of 1. A datasets, procedures that aim to prevent overfitting
slope < 1 suggests that estimated risks are too ex- should be considered, e.g., using penalized regres-
treme, i.e., too high for patients who are at high risk sion techniques such as Ridge or Lasso regression
and too low for patients who are at low risk. A [32] or using simpler models. Simpler models can
slope > 1 suggests the opposite, i.e., that risk esti- refer to fewer predictors, omitting nonlinear or
mates are too moderate. The calibration intercept, interaction terms, or using a less flexible algorithm
which is an assessment of calibration-in-the-large, (e.g., logistic regression instead of random forests or
has a target value of 0; negative values suggest over- a priori limiting the number of hidden neurons in a
estimation, whereas positive values suggest neural network). However, using models that are
underestimation. too simple can backfire (Additional file 1), and pen-
Third, moderate calibration implies that estimated alization does not offer a miracle solution for uncer-
risks correspond to observed proportions, e.g., among tainty in small datasets [33]. Therefore, in small
patients with an estimated risk of 10%, 10 in 100 datasets, it is reasonable for a model not to be de-
have or develop the event. This is assessed with a veloped at all. Additionally, internal validation pro-
flexible calibration curve to show the relation between cedures can quantify the calibration slope. At
the estimated risk (on the x-axis) and the observed internal validation, calibration-in-the-large is irrele-
proportion of events (y-axis), for example, using loess vant since the average of predicted risks will match
or spline functions. A curve close to the diagonal in- the event rate. In contrast, calibration-in-the-large is
dicates that predicted risks correspond well to ob- highly relevant at external validation, where we
served proportions. We show a few theoretical curves often note a mismatch between the predicted and
in Fig. 1a,b, each of which corresponds to different observed risks.
calibration intercepts and slopes. Note that a calibra- When we find poorly calibrated predictions at val-
tion intercept close to 0 and a calibration slope close idation, algorithm updating should be considered to
to 1 do not guarantee that the flexible calibration provide more accurate predictions for new patients
curve is close to the diagonal (see Additional file 1 from the validation setting [1, 20]. Updating of
for an example). To obtain a precise calibration curve, regression-based algorithms may start with changing
a sufficiently large sample size is required; a mini- the intercept to correct calibration-in-the-large [34].
mum of 200 patients with and 200 patients without Full refitting of the algorithm, as in the case study
the event has been suggested [4], although further re- below, will improve calibration if the validation sam-
search is needed to investigate how factors such as ple is relatively large [35]. We present a detailed il-
disease prevalence or incidence affect the required lustration of updating of the ROMA model in
sample size [12]. In small datasets, it is defendable to Additional file 1. Continuous updating strategies are
evaluate only weak calibration by calculating the cali- also gaining in popularity; such strategies dynamically
bration intercept and slope. address shifts in the target population over time [36].
Fourth, strong calibration means that the predicted
risk corresponds to the observed proportion for every Published case study on the diagnosis of obstructive
possible combination of predictor values; this implies coronary artery disease
that calibration is perfect and is a utopic goal [4]. Consider a logistic regression model to predict ob-
The commonly used Hosmer–Lemeshow test is structive coronary artery disease (oCAD) in patients
often presented as a calibration test, though it has with stable chest pain and without a medical history
Van Calster et al. BMC Medicine (2019) 17:230 Page 5 of 7
of oCAD [37]. The model was developed on data measures and visualizations – this helps us to under-
from 5677 patients recruited at 18 European and stand how the algorithm performs in a particular setting,
American centers, of whom 31% had oCAD. The al- where predictions may go wrong, and whether the algo-
gorithm was externally validated on data from 4888 rithm can benefit from updating. Due to local healthcare
patients in Innsbruck, Austria, of whom 44% had systems and referral patterns, population differences be-
oCAD [38]. The algorithm had an AUC of 0.69. tween centers and regions are expected; it is likely that
Calibration suggested a combination of overesti- prediction models do not include all the predictors
mated (intercept − 1.04) and overly extreme risk pre- needed to accommodate these differences. Together with
dictions (slope 0.63) (Fig. 2a). Calibration was the phenomenon of population drifts, models ideally re-
improved by refitting the model, i.e., by re- quire continued monitoring in local settings in order to
estimating the predictor coefficients (Fig. 2b). maximize their benefit over time. This argument will be-
come even more vital with the growing popularity of
Conclusions highly flexible algorithms. The ultimate aim is to
The key arguments of this paper are summarized in optimize the utility of predictive analytics for shared
Table 1. Poorly calibrated predictive algorithms can be decision-making and patient counseling.
misleading, which may result in incorrect and potentially
harmful clinical decisions. Therefore, we need prespeci- Supplementary information
fied modeling strategies that are reasonable with respect Supplementary information accompanies this paper at https://doi.org/10.
to the available sample size. When validating algorithms 1186/s12916-019-1466-7.
it is imperative to evaluate calibration using appropriate
Additional file 1. Detailed illustration of the assessment of calibration
and model updating: the ROMA logistic regression model.
Table 1 Summary points on calibration
Why calibration - Decisions are often based on risk, so predicted
Acknowledgements
matters risks should be reliable
This work was developed as part of the international STRengthening
- Poor calibration may make a prediction model Analytical Thinking for Observational Studies (STRATOS) initiative. The
clinically useless or even harmful objective of STRATOS is to provide accessible and accurate guidance in the
design and analysis of observational studies (http://stratos-initiative.org/).
Causes of poor - Statistical overfitting and measurement error Members of the STRATOS Topic Group ‘Evaluating diagnostic tests and
calibration prediction models’ are (alphabetically) Patrick Bossuyt, Gary S. Collins, Petra
- Heterogeneity in populations in terms of
patient characteristics, disease incidence or Macaskill, David J. McLernon, Karel G.M. Moons, Ewout W. Steyerberg, Ben
prevalence, patient management, and treatment Van Calster, Maarten van Smeden, and Andrew Vickers.
policies
Assessment of - Perfect calibration, where predicted risks are Authors’ contributions
calibration in correct for every covariate pattern, is utopic; All authors conceived of the study. BVC drafted the manuscript. All authors
practice we should not aim for that reviewed and edited the manuscript and approved the final version.
Competing interests 20. Davis SE, Lasko TA, Chen G, Siew ED, Matheny ME. Calibration drift in
The authors declare that they have no competing interests. regression and machine learning models for acute kidney injury. J Am Med
Inform Assoc. 2017;24:1052–61.
Author details 21. Thai TN, Ebell MH. Prospective validation of the good outcome following
1
Department of Development and Regeneration, KU Leuven, Herestraat 49 attempted resuscitation (GO-FAR) score for in-hospital cardiac arrest
box 805, 3000 Leuven, Belgium. 2Department of Biomedical Data Sciences, prognosis. Resuscitation. 2019;140:2–8.
Leiden University Medical Center, Leiden, Netherlands. 3Medical Statistics 22. Leijdekkers JA, Eijkemans MJC, van Tilborg TC, et al. Predicting the
Team, Institute of Applied Health Sciences, School of Medicine, Medical cumulative chance of live birth over multiple complete cycles of in vitro
Sciences and Nutrition, University of Aberdeen, Aberdeen, UK. 4Department fertilization: an external validation study. Hum Reprod. 2018;33:1684–95.
of Clinical Epidemiology, Leiden University Medical Center, Leiden, 23. te Velde ER, Nieboer D, Lintsen AM, et al. Comparison of two models
Netherlands. 5Department of Epidemiology, CAPHRI Care and Public Health predicting IVF success; the effect of time trends on model performance.
Research Institute, Maastricht University, Maastricht, Netherlands. 6http:// Hum Reprod. 2014;29:57–64.
www.stratos-initiative.org. 24. Steyerberg EW, Uno H, Ioannidis JPA, Van Calster B. Poor performance of
clinical prediction models: the harm of commonly applied methods. J Clin
Received: 24 July 2019 Accepted: 10 November 2019 Epidemiol. 2018;98:133–43.
25. Murthy V, Rishi A, Gupta S, et al. Clinical impact of prostate specific antigen
(PSA) inter-assay variability on management of prostate cancer. Clin
Biochem. 2016;49:79–84.
References 26. Wynants L, Timmerman D, Bourne T, Van Huffel S, Van Calster B. Screening
1. Steyerberg EW. Clinical prediction models. New York: Springer; 2009. for data clustering in multicenter studies: the residual intraclass correlation.
2. Wessler BS, Paulus J, Lundquist CM, et al. Tufts PACE clinical predictive BMC Med Res Methodol. 2013;13:128.
model registry: update 1990 through 2015. Diagn Progn Res. 2017;1:10. 27. Luijken K, Groenwold RHH, Van Calster B, Steyerberg EW, van Smeden M.
3. Kleinrouweler CE, Cheong-See FM, Collins GS, et al. Prognostic models in obstetrics: Impact of predictor measurement heterogeneity across settings on
available, but far from applicable. Am J Obstet Gynecol. 2016;214:79–90. performance of prediction models: a measurement error perspective. Stat
4. Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg Med. 2019;38:3444–59.
EW. A calibration hierarchy for risk models was defined: from utopia to 28. Moore RG, McMeekin DS, Brown AK, et al. A novel multiple marker bioassay
empirical data. J Clin Epidemiol. 2016;74:167–76. utilizing HE4 and CA125 for the prediction of ovarian cancer in patients
5. Collins GS, de Groot JA, Dutton S, et al. External validation of multivariable with a pelvic mass. Gynecol Oncol. 2009;112:40–6.
prediction models: a systematic review of methodological conduct and 29. Austin PC, Steyerberg EW. Graphical assessment of internal and external
reporting. BMC Med Res Methodol. 2014;14:40. calibration of logistic regression models by using loess smoothers. Stat Med.
2014;33:517–35.
6. Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B.
30. van Smeden M, Moons KGM, de Groot JA, et al. Sample size for binary
A systematic review shows no performance benefit of machine learning
logistic prediction models: beyond events per variable criteria. Stat Meth
over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;
Med Res. 2019;28:2455–74.
110:12–22.
31. Riley RD, Snell KIE, Ensor J, et al. Minimum sample size for developing a
7. Bouwmeester W, Zuithoff NPA, Mallett S, et al. Reporting and methods in
multivariable prediction model: PART II - binary and time-to-event
clinical prediction research: a systematic review. PLoS Med. 2012;9:1–12.
outcomes. Stat Med. 2019;38:1276–96.
8. Van Calster B, Vickers AJ. Calibration of risk prediction models: impact on
32. Moons KGM, Donders AR, Steyerberg EW, Harrell FE. Penalized maximum
decision-analytic performance. Med Decis Mak. 2015;35:162–9.
likelihood estimation to directly adjust diagnostic and prognostic prediction
9. Van Hoorde K, Van Huffel S, Timmerman D, Bourne T, Van Calster B. A
models for overoptimism: a clinical example. J Clin Epidemiol. 2004;57:
spline-based tool to assess and visualize the calibration of multiclass risk
1262–70.
predictions. J Biomed Inform. 2015;54:283–93.
33. Van Calster B, van Smeden M, Steyerberg EW. On the variability of
10. Van der Ploeg T, Nieboer D, Steyerberg EW. Modern modeling techniques
regression shrinkage methods for clinical prediction models: simulation
had limited external validity in predicting mortality from traumatic brain
study on predictive performance. arXiv. 2019; https://arxiv.org/abs/1907.114
injury. J Clin Epidemiol. 2016;78:83–9.
93. Accessed 10 Oct 2019.
11. Shah ND, Steyerberg EW, Kent DM. Big data and predictive analytics: 34. Steyerberg EW, Borsboom GJJM, van Houwelingen HC, Eijkemans MJC,
recalibrating expectations. JAMA. 2018;320:27–8. Habbema JDF. Validation and updating of predictive logistic regression
12. Moons KG, Altman DG, Reitsma JB, et al. Transparent Reporting of a models: a study on sample size and shrinkage. Stat Med. 2004;23:2567–86.
multivariable prediction model for Individual Prognosis or Diagnosis 35. Su TL, Jaki T, Hickey GL, Buchan I, Sperrin M. A review of statistical updating
(TRIPOD): explanation and elaboration. Ann Intern Med. 2015;162:W1–W73. methods for clinical prediction models. Stat Meth Med Res. 2018;27:185–97.
13. Yala A, Lehman C, Schuster T, Portnoi T, Barzilay R. A deep learning 36. Hickey GL, Grant SW, Caiado C, et al. Dynamic prediction modeling
mammography-based model for improved breast cancer risk prediction. approaches for cardiac surgery. Circ Cardiovasc Qual Outcomes. 2013;6:649–58.
Radiology. 2019;292:60–6. 37. Genders TSS, Steyerberg EW, Hunink MG, et al. Prediction model to estimate
14. Dhillon RK, McLernon DJ, Smith PP, et al. Predicting the chance of live birth presence of coronary artery disease: retrospective pooled analysis of
for women undergoing IVF: a novel pretreatment counselling tool. Hum existing cohorts. BMJ. 2012;344:e3485.
Reprod. 2016;31:84–92. 38. Edlinger M, Wanitschek M, Dörler J, Ulmer H, Alber HF, Steyerberg EW.
15. Collins GS, Altman DG. Predicting the 10 year risk of cardiovascular disease External validation and extension of a diagnostic model for obstructive
in the United Kingdom: independent and external validation of an updated coronary artery disease: a cross-sectional predictive evaluation in 4888
version of QRISK2. BMJ. 2012;344:e4181. patients of the Austrian Coronary Artery disease Risk Determination In
16. Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer Innsbruck by diaGnostic ANgiography (CARDIIGAN) cohort. BMJ Open. 2017;
statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide 7:e014467.
for 36 cancers in 185 countries. CA Cancer J Clin. 2018;68:394–424.
17. Testa A, Kaijser J, Wynants L, et al. Strategies to diagnose ovarian cancer:
new evidence from phase 3 of the multicentre international IOTA study. Br Publisher’s Note
J Cancer. 2014;111:680–8. Springer Nature remains neutral with regard to jurisdictional claims in
18. Riley RD, Ensor J, Snell KI, et al. External validation of clinical prediction published maps and institutional affiliations.
models using big datasets from e-health records or IPD meta-analysis:
opportunities and challenges. BMJ. 2016;353:i3140.
19. Steyerberg EW, Roobol MJ, Kattan MW, van der Kwast TH, de Koning HJ,
Schröder FH. Prediction of indolent prostate cancer: validation and
updating of a prognostic nomogram. J Urol. 2007;177:107–12.