Machine Learning For Real-Time Heart Disease Prediction
Machine Learning For Real-Time Heart Disease Prediction
Abstract—Heart-related anomalies are among the most accurate way, allowing to get non-obvious insights directly from
common causes of death worldwide. Patients are often the observations.
asymptomatic until a fatal event happens, and even when One of the problems of AF detection is that it is often asymp-
they are under observation, trained personnel is needed in
order to identify a heart anomaly. In the last decades, there tomatic (it is incidentally identified in 30–45% of patients who
has been increasing evidence of how Machine Learning had an electrocardiogram for unrelated reasons [19]) and trained
can be leveraged to detect such anomalies, thanks to the personnel is required to spot the disease from electrocardiograms
availability of Electrocardiograms (ECG) in digital format. (ECG). Unfortunately, if AF is not promptly recognized and
New developments in technology have allowed to exploit treated, it can lead to a fatal event, such as a stroke. Similarly,
such data to build models able to analyze the patterns in the
occurrence of heart beats, and spot anomalies from them. Tachycardia (excessively fast heart rate) and Bradycardia (exces-
In this work, we propose a novel methodology to extract sively slow heart rate) are common heart diseases. Despite being
ECG-related features and predict the type of ECG recorded less dangerous than AF, they can lead to serious complications,
in real time (less than 30 milliseconds). Our models lever- such as heart failure, if left untreated.
age a collection of almost 40 thousand ECGs labeled by Given the severity and occurrence of heart diseases, proce-
expert cardiologists across different hospitals and coun-
tries, and are able to detect 7 types of signals: Normal, AF, dures to analyze ECGs have been in place for a long time. A
Tachycardia, Bradycardia, Arrhythmia, Other or Noisy. We breakthrough algorithm for QRS complex detection was pub-
exploit the XGBoost algorithm, a leading machine learning lished in 1985 [27], starting the era of machine learning analysis
method, to train models achieving out of sample F1 Scores to detect arrhytmia from ECGs, also thanks to the MIT-BIH [24]
in the range 0.93 – 0.99. To our knowledge, this is the first made available since 1980 by Physionet. The availability of such
work reporting high performance across hospitals, coun-
tries and recording standards. data has inspired many works in the literature [26], [30], but
some limitations of these works come from the unrealistic nature
Index Terms—Arrhythmia, boosting, ECG, machine of their datasets: there are only Normal or AF ECGs (there can
learning.
be other types of signal); only clean signals are considered; the
sample size is small. More recent projects leverage Deep Neural
I. INTRODUCTION Networks to extract non-linear features from the ECGs [6], [16],
[29]. In the first project, sponsored by Apple for the AF detector
ESPITE the continuous development of medical practices,
D heart-related diseases are still the leading cause of death
in the United States [13]. Atrial Fibrillation (AF) is among the
in its Watch products, the authors work on a dataset of 400
thousand patients and achieve Positive Predictive Value of 0.84,
but they don’t disclose the details of the model used. In the
most common ones, as it affects 1-2% of the general population,
second project, the authors create an AF detector by analyzing 1
causing hundreds of thousands of deaths every year, as it can
million signals with 12 leads and 10 seconds long, achieving
lead to a stroke, heart failure or coronary artery disease [14].
an F 1 Score of 0.45. In the third one, the authors build a
Machine Learning (ML) techniques are becoming more and
Convolutional Neural Network to detect 12 heart abnormalities
more accepted in the world of healthcare as a support to tra-
from 91 thousands 12-leads signals of various length, achieving
ditional ways of disease detection. In fact, algorithms can be
AU C of 0.97 and F 1 Score of 0.84. The data for these projects
leveraged to process a sizeable amount of data in a fast and
is not available to the public, thus complicating an objective
comparison. In [33], the authors propose their analysis on a
Manuscript received November 17, 2020; revised February 11, 2021;
accepted March 8, 2021. Date of publication March 17, 2021; date proprietary dataset, containing more than 40 thousand patients,
of current version September 3, 2021. (Corresponding author: Dimitris and leverage Gradient Boosting Trees to achieve an F 1 Score
Bertsimas.) of 0.97. Finally, in [8] the authors develop a neural network
Dimitris Bertsimas is with the Sloan School of Management, Mas-
sachusetts Institute of Technology, Cambridge, MA 02139-4301 USA framework to analyze a proprietary dataset containing 55 types
(e-mail: [email protected]). of arrhythmias and more than 32 thousand patients, achieving
Luca Mingardi is with the Operation Research Center, Mas- 0.86 F 1 Score. The training datasets from the last two projects
sachusetts Institute of Technology, Cambridge, MA 02139-4301 USA
(e-mail: [email protected]). are available online.
Bartolomeo Stellato is with the Department of Operation Research In this work, we present a novel procedure to accurately
and Financial Engineering, Princeton University, Princeton, NJ 08540 detect heart diseases in real-time from the analysis of short
USA (e-mail: [email protected]).
Digital Object Identifier 10.1109/JBHI.2021.3066347 single-lead ECGs (9-61 seconds). We observe the characteristics
2168-2194 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Vardhaman College of Engineering. Downloaded on October 29,2024 at 07:52:36 UTC from IEEE Xplore. Restrictions apply.
3628 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 25, NO. 9, SEPTEMBER 2021
TABLE I
SUMMARY OF THE EXTRACTED FEATURES
the other two are 10 seconds long and are recorded in hospital
with professional machines at 500 Hz. For this reason, the first
one comes with a single lead recording, while the other two have
the usual 12 ECG leads. However, we believe that our model can
be most useful when used to detect heart problems in real time,
Fig. 1. Visualization of a QRS complex in an ECG.
through the use of portable devices. Thus, in order to simulate
a real time recording, we kept only lead II (the one containing
the best QRS recordings) in the signals from the two Chinese
of ventricular response and analyze the predictability of the datasets, and proceed with a unified pipeline for our analysis.
inter-beat timing of the QRS complexes [5] (see Fig. 1) in the These two datasets also provide demographics information (such
ECG to detect irregular patterns in the data. Specifically, we as Age, Gender) of the patients, which is very valuable because
extract four groups of features that we use for the predictions: heart diseases are often correlated with such characteristics.
time and non-linear domain, distance-based and time series
characteristics. The model is meant to be used as a fast detector III. PIPELINE
for heart diseases, able to recognize various types of outcomes:
Normal, AF, Tachycardia, Bradycardia, Arrhythmia, Other A. Signal Processing
(label to indicate other types of heart anomaly) and Noisy (can’t While an ECG is recorded, there are a number of different
be classified, and needs to be recorded again). If an anomaly is factors that can impact the quality of the signal, such as the
detected, the patient should be visited by a specialist to assess movement of the patient or the powerline noise coming from
the stage of the disease and possible treatments. the electric component of the machinery. Thus, preprocessing
We leverage the XGBoost algorithm [10], a leading machine the original recording is a necessary step to eliminate the noise in
learning method, to train and evaluate our models on three the data. We apply two filters: butterworth highpass [7] (lowcut
different datasets, achieving strong out of sample performance = 0.5 Hz) and band-pass [9] (cutoff = 0.05 Hz), and then we
(F 1 Score ≥ 0.94). Then, we test the performance of our models scale the signals to have zero mean and unit variance. These
when used as predictors accross datasets and we achieve similar operations are performed through the Scikit-Learn [28] and
results (F 1 Score ≥ 0.93). SciPy [32] implementations in Python.
Authorized licensed use limited to: Vardhaman College of Engineering. Downloaded on October 29,2024 at 07:52:36 UTC from IEEE Xplore. Restrictions apply.
BERTSIMAS et al.: MACHINE LEARNING FOR REAL-TIME HEART DISEASE PREDICTION 3629
TABLE II
MODEL AND DATA SUMMARY
The area of inspection around the R peaks has been determined Bradycardia, and between Physionet and Chapman data regard-
through extensive experiments to find the most accurate results. ing Normal and AF (Tianchi data has too few AF observations).
Specifically, let A, B and C be consecutive R peaks, α be the In the case of Tianchi and Chapman, we leverage the features
distance between A and B, and β be that between B and C: point available in the data, in addition to the 110 we extract with
P corresponds to the maximum value in segment B − 0.35α, our procedure (e.g. Age, Gender, T-Offset, P-Offset etc.). Due
point Q corresponds to the minimum value in segment B − 0.1α, to the different set of labels present across datasets, the two
point S corresponds to the minimum value in segment B + 0.1β cross-dataset models are trained on the subset of overlapping
and point T corresponds to the maximum value in segment labels in the training data and evaluated only on the observations
B + 0.35β. Other methods such as moving averages and deriva- having these labels in the testing data. Table II describes in detail
tive analysis have been explored throughout the work, proving the 5 models, listing the dataset specific features that are used
to be less accurate, much slower and thus less viable. Once in addition to the 110 we extract, and summarizing the patients,
the 5 points of the PQRST complex are found, we calculate labels, training, testing, signal and sampling characteristics of
the pair-wise vertical and horizontal distances between each each dataset.
combination. Then, we calculate the average, median, minimum,
maximum, standard deviation for each of them, for a total of D. Feature Selection
109 features. For the fourth group, we leverage the TSFRESH
python package [11], and we extract 742 features related to the Inevitably, the features that we extract in Section III-B are
characteristics of an ECG as a time series. Despite increasing the highly collinear, thus it is extremely important to select a subset
computational complexity, the last group accounts for significant of them in order to improve the performance of each classifier
improvement in performance (up to 7%), thanks to its focus on and speed up the extraction process. We leverage the built-in
time series oriented features. As a final result, we obtain a dataset feature importance method of the XGBoost algorithm, which
composed of 880 features for each signal. ranks the features that have the highest explanatory power with
respect to the outcome. Specifically, for each of the 5 tasks
presented in Section III-C, we train a XGBoost model with
C. Modeling Approach
default parameters. Then, for each of these models, we select the
The problem under consideration is multi-class classification, top 50 features identified by the algorithm. Finally, among these
so that it is not possible to directly leverage the usual methods 250 features, 140 are duplicates, thus leaving with 110 features
for binary classification. However, the XGBoost algorithm [10] for the final model. Finally, we notice that the performance of
allows to set as its objective the softmax function: the models doesn’t improve by adding other features, finding
that 110 is the minimum number of features not to have any
exp(zi )
qi (z) = , (1) drop in performance, accounting for a significant reduction in
j exp(zj ) dimensionality from the initial set of 880 features extracted.
After the final set of features is decided, we find the optimal
which outputs the probability that a given observation belongs
set of parameters for each models according to the methods
to each of the labels zi in the dataset in a vector format. Then,
explained in Section III-E.
the classifier minimizes the cross-entropy loss between the real
distribution of labels p and the estimated probabilities q:
E. Methods
L(p, q) = − p(x) log q(x), (2)
In our work, we leverage the XGBoost algorithm [10] to train
x
our models and the Optuna optimization framework [4] to tune
where x is the set of all the observations in the data, and p is the its parameters. Finally, we use SHAP [20]–[22] to explain which
one-hot encoding of the true label associated to each observation. features have the most explanatory power for each label present
We train 5 different models: one for each available dataset in the training data.
(3), and then we exploit overlapping labels among datasets to 1) XGBoost: XGBoost [10] is one of the most popular and
train 2 models and assess the predictive performance across performing tree-ensemble methods for binary classification. It
data sources. Specifically, we have overlapping labels between is based on an iterative procedure that leverages a large number
Tianchi and Chapman data regarding Normal, Tachycardia and of trees. Its strength lies in the iterative correction procedure on
Authorized licensed use limited to: Vardhaman College of Engineering. Downloaded on October 29,2024 at 07:52:36 UTC from IEEE Xplore. Restrictions apply.
3630 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 25, NO. 9, SEPTEMBER 2021
which it is based, so that new trees are added to correct the errors Precision, Recall and F1 Score calculated on each test dataset.
made by earlier trees, allowing the model to handle better the For example, in the case of Atrial Fibrillation, we calculate the
harder cases. There are many hyperparameters determining its metrics as:
performance, the most important of which are: depth of trees, Correct Predictions
number of trees and learning rate. The tuning of the parameters Accuracy = (3)
Number of Predictions
of the algorithm have a significant impact on its performance, True Positive
and a proper procedure can be followed to avoid the overfitting Precision = (4)
True Positive + False Positive
on the training data: there is usually a trade-off between the
performance on the training set and that on an external testing True Positive
Recall = (5)
set. Proper tuning allows to achieve a strong performance on both True Positive + False Negative
datasets, if there is enough explanatory power in the variables 2Aa 2(precision · recall)
of the dataset. F1a = = , (6)
A+a precision + recall
In this work, we tune seven parameters. The maximum depth
of a tree determines the maximum number of nodes that can where Aa is the number of correct AF predictions, A is the
exist between the root node and the farthest leaf in the tree: it number of predicted AF and a is the number of true AF. The
takes positive integer values, with large ones usually leading same calculation is performed for the all the other classes, and
to overfitting. The number of estimators controls the number we report both the arithmetic and weighted mean of the F 1
of trees to fit in the training: it takes positive integer values, Scores, for each model that we train. The weighted F 1 Score is
with large ones usually lead to overfitting. The learning rate calculated as:
η controls the weighting factor for corrections by new trees: it i F1i · Ni
Weighted F1 Score = (7)
takes values between 0 and 1, with values closer to 0 determining i Ni
fewer corrections for each tree. The parameter γ determines the with i = 1, ..., n, with n being the number of classes and Ni
minimum loss reduction required to make a further partition being the number of observations for class i. The accuracy is
on a leaf node of a tree: it takes positive values, with larger calculated as the number of correct predictions over the total
ones defining a more conservative model. The parameter λ is number of predictions. In all the three datasets there is only one
the L2 regularization on the weights of the features: it takes recording per patient, thus it is not possible to have an unfair
positive values, with the larger ones shrinking the weights, evaluation during the training-testing step.
thus making the model more conservative. The parameter α is
the L1 regularization on the weights of the features: it takes G. Output Calibration
positive values, with the larger ones driving to 0 the weights,
defining a more conservative model. Minimum child weight is Confidence calibration is a particularly relevant problem in
the minimum Hessian weight required to create a new node: it a healthcare setting like the one addressed in this manuscript:
takes positive values, with higher ones making the model more when a Machine Learning model makes a prediction, it is
conservative. All remaining parameters are set to their default important that this output can be trusted. For example, if the
values. model makes 100 label predictions with confidence of 0.9,
2) Optuna: Optuna [4] is a leading optimization framework 90 of them should be correct. While perfect calibration is not
leveraging Tree-structured Parzen Estimator (TPE) to optimize usually achievable in a real setting, there are metrics to quantify
an objective function over a defined parameter space. Each of how reliable a model is. In our work, we follow the procedure
the 5 models is trained independently following three steps. highlighted in [15] to calibrate each model’s output using Tem-
We choose the range (the same for all the models) of possible perature Scaling (the details of which can be found in the paper),
values for the seven parameters explained above, we define the inspect the relationship between accuracy and confidence, and
objective function to maximize as the average 70-folds cross calculate the corresponding Expected Calibration Error (ECE).
validation F 1 Score , and finally we leverage multiple cores to To estimate the expected accuracy, the predictions are grouped
maximize the objective function over 500 iterations. in M interval bins (10 in this case) and we calculate the accuracy
3) SHAP: SHAP [20]–[22] is a method to interpret Machine of each bin B m as:
Learning models through a game theory approach. This method 1
acc(Bm ) = 1(ŷi = yi ) (8)
is helpful to dig further into how the final predicted, so that it |Bm |
i∈Bm
highlights the most important features and explains how they
drive the results in an understandable way. where ŷi and yi are the predicted and true class labels for sample
The analysis has been performed using Python 3.7.5. i. The confidence of bin Bm is calculated as:
1
conf(Bm ) = p̂i (9)
F. Performance Evaluation |Bm |
i∈Bm
In order to assess the performance of our procedure, for the where p̂i is the confidence for sample i. A perfectly cali-
Physionet data we exploit the external 300 ECGs provided as brated model would have acc(Bm ) = conf(Bm ) for all m ∈
validation set, while for the other models we divide the data (1, . . ., M ). The Expected Calibration Error can be approxi-
70% training and 30% testing set. Then, we report the Accuracy, mated by taking a weighted average of the difference between
Authorized licensed use limited to: Vardhaman College of Engineering. Downloaded on October 29,2024 at 07:52:36 UTC from IEEE Xplore. Restrictions apply.
BERTSIMAS et al.: MACHINE LEARNING FOR REAL-TIME HEART DISEASE PREDICTION 3631
TABLE III
DELTA ECE CALIBRATION
TABLE IV
METRICS TABLE PHYSIONET
TABLE VI
METRICS TABLE CHAPMAN
TABLE V
CONFUSION MATRIX FOR THE MODEL TRAINED ON PHYSIONET DATA
Authorized licensed use limited to: Vardhaman College of Engineering. Downloaded on October 29,2024 at 07:52:36 UTC from IEEE Xplore. Restrictions apply.
3632 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 25, NO. 9, SEPTEMBER 2021
TABLE XI
CONFUSION MATRIX FOR THE CHAPMAN-TIANCHI MODEL
Normal signal, and indeed this is an important factor to consider
when analyzing an ECG. In the same fashion, The Ventricular
Rate, is the key feature to identify Bradycardia and Tachycardia.
These two features are available directly in the Chapman data,
but are highly correlated with some of the features extracted
with our pipeline (e.g Beat per Minute). Finally, MCVNN is
the best feature to identify Atrial Fibrillation. It is calculated as Normal and Tachycardia classes, which is related to the fact that
the median absolute deviation of the horizontal R-R distances Tachycardia is associated to an increased heart rate. Similarly,
divided by the median of the absolute differences of successive the median horizontal distance between successive R-R peaks
horizontal R-R distances. The predictive power of these features is the best feature to identify Bradycardia, which is associate to
come from their ability to capture anomalies in successive heart a slow heart rate. Arrhythmia mostly relies on TINN, which is
beats. an approximation of the distribution of successive R-R intervals,
Table VIII shows that the model trained on Tianchi has almost for its identification. Finally, AF is identified through CVSD: the
very accurate performance (Weighted Average F 1 Score 0.99). root mean square of the sum of successive differences in R-R
Table IX shows that the model indeed makes almost no mistakes intervals, divided by the mean of their lengths.
in its predictions. The errors associated to the Arrhythmia and Table X shows that our model is able to achieve a very accurate
AF classes come from the scarcity of such labels in the data. performance also when evaluated on a dataset coming from
Moreover, given that AF is a form of Arrhythmia, it is not a different source (Weighted Average F 1 Score 0.99). In this
clear the difference between the two in the data, so that it case, we train on the Chapman data and evaluate on the Tianchi
is reasonable that the model is missing some of them. Fig. 4 data. Table XI clearly shows that the model is reliably predict-
displays the most important features for this model. Beat per ing across labels, with very few errors in the whole dataset.
Minute is the feature that has the highest predicting power for the Fig. 5 displays the most important features for this model.
Authorized licensed use limited to: Vardhaman College of Engineering. Downloaded on October 29,2024 at 07:52:36 UTC from IEEE Xplore. Restrictions apply.
BERTSIMAS et al.: MACHINE LEARNING FOR REAL-TIME HEART DISEASE PREDICTION 3633
V. REAL-TIME ANALYSIS
The ultimate goal of our work is not to substitute the pre-
cious role of specialized professionals, but to provide an aid to
them, accurately screening people with possible heart conditions
As one would expect, Beat per Minute and the median horizontal which can be directed to such expert for deeper analysis. Thus,
distance between successive R-R peaks are the most important a key role in the viability and usefulness of our tool is its time
features to identify the three labels under observation. In fact, complexity, meaning how long it takes to make a prediction when
the symptoms of Tachycardia and Bradycardia are increased and a new ECG is recorded. Table XIV summarizes the time required
decreased heart rate, and these two features are a great proxy (in milliseconds) by our models to complete the four main steps
for it. of the real-time evaluation: pre-processing (Step 1), extraction of
Finally, Table XII shows that the features that we extract are features from groups 1,2 and 3 (Step 2), extraction of TSFRESH
general enough to train a model that achieves high accuracy features (Step 3), model prediction (Step 4). We also provide the
even when the training and testing datasets have very different 95% Confidence Intervals for each measurement. The Physionet
characteristics (Weighted Average F 1 Score 0.93). In fact, the dataset is composed of signals of different length, thus we divide
Physionet data is recorded from a wearable device at 300 Hz these signals in three groups (less than 20 s, between 20 and
and describes the characteristics of American people, while the 40 s and more than 40 s) in order to have a deeper understanding
Chapman data is recorded in a professional setting at 500 Hz of how fast the models are with longer recordings. The time
(twelve leads are recorded but we only analyze lead II, as complexity of the cross-dataset models is not present in the table
explained in Section II) and comes from the Chinese population. because their features are directly coming from the three datasets
In this case, we train on the Physionet data and evaluate on the at the core of our work. Table XIV shows that once a new ECG
Chapman data. Table XIII shows balance in the errors between is available, our model is able to clean it, extract the required
the two classes. Fig. 6 displays the most important features for features and make a prediction in less than 30 milliseconds for
this model. Similarly to the model trained only on the Physionet any of the signal present in the data that we analyze (the longest
data, the proportion of R-R intervals which are longer than 50 is 61 seconds). This experiment produces sound evidence that
seconds is the feature that has the highest explanatory power for our method can be used in a real-time setting. For example,
the AF label. we can comfortably deploy it on a wearable device displaying
The variance in performance across datasets (0.93-0.99) could ECG-based predictions every 50 milliseconds, i.e., 20 times per
be interpreted as confusing and potentially harming in a real second.
Authorized licensed use limited to: Vardhaman College of Engineering. Downloaded on October 29,2024 at 07:52:36 UTC from IEEE Xplore. Restrictions apply.
3634 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 25, NO. 9, SEPTEMBER 2021
TABLE XIV
TIME COMPLEXITY ANALYSIS
Authorized licensed use limited to: Vardhaman College of Engineering. Downloaded on October 29,2024 at 07:52:36 UTC from IEEE Xplore. Restrictions apply.
BERTSIMAS et al.: MACHINE LEARNING FOR REAL-TIME HEART DISEASE PREDICTION 3635
implementation [11]. Below we report the complete list of 110 r Average, Median and Standard Deviation of the horizontal
features that are used by the 5 models for their predictions. distance between P and S
r Average, Median and Minimum of the horizontal distance
between P and T
A. Time Domain r Standard Deviation and Minimum of the horizontal dis-
We select 12 of the initial 13 features for our final model: tance between Q and R
r CVNN, The standard deviation of the RR intervals divided r Average of the horizontal distance between Q and T
by the mean of the RR intervals r Standard Deviation of the horizontal distance between R
r CVSD, The root mean square of the sum of successive and S
differences divided by the mean of the RR intervals r Average of the horizontal distance between R and T
r MCVNN, The median absolute deviation of the RR inter- r Average, Median and Standard Deviation of the vertical
vals divided by the median of the absolute differences of distance between P and R
their successive differences r Median of the vertical distance between P and T
r MadNN, The median absolute deviation of the RR inter- r Average and Median of the vertical distance between Q
vals and R
r MeanNN, The mean of the RR intervals r Average of the vertical distance between Q and S
r MedianNN, The median of the absolute values of the r Average, Median and Minimum of the vertical distance
successive differences between RR intervals between R and S
r RMSSD, The square root of the mean of the sum of r Average, Median, Standard Deviation and Minimum of
successive differences between adjacent RR intervals the vertical distance between R and T
r SDNN, The standard deviation of the RR intervals
r SDSD, The standard deviation of the successive differ-
ences between RR intervals D. Time Series Characteristics
r TINN, An approximation of the RR interval distribution We select 51 of the initial 742 features for our model:
r pNN20, The proportion of RR intervals greater than 20 ms, r Abs energy, The absolute energy of the time series
out of the total number of RR intervals r Agg autocorrelation, Calculates the aggregated variance
r pNN50, The proportion of RR intervals greater than 50 ms, of the signal
out of the total number of RR intervals r Agg linear trend (4 sets of parameters), Calculates a linear
least-squares regression different aggregated values of the
time series
B. Nonlinear Domain r Augmented dickey fuller, A hypothesis test which checks
We select all the initial 7 features for our final model: whether a unit root is present
r CSI, The Cardiac Sympathetic Index [31] r Autocorrelation (2 sets of parameters), Calculates the au-
r CSI Modified, The modified CSI [17] tocorrelation of the signal
r CVI, The Cardiac Vagal Index [31] r Change quantiles (13 sets of parameters), Fixes a corridor
r SD1, An index of short-term RR interval fluctuations given by the quantiles ql and qh of the distribution of xv
r SD2, An index of long-term RR interval fluctuations r Cid ce (2 sets of parameters), An estimate for a time series
r SD2SD1, Ratio between short and long term fluctuations complexity
of the RR intervals r Energy ratio by chunks (2 sets of paramters), Sum of
r SampEn, The sample entropy measure of Heart Rate squares of chunks over the whole series.
Variability r Fft aggregated, The spectral centroid (mean) of the abso-
lute fourier transform spectrum
r Fft coefficient (3 sets of parameters), Calculates the fourier
C. Distance Based
coefficients of the one-dimensional discrete Fourier Trans-
We select 40 of the initial 109 features for our final model: form for real inputs
r BPM, Beats Per Minute r Friedrich coefficients, Coefficients of polynomial, which
r IBI, Inter Beat Interval has been fitted to the deterministic dynamics of Langevin
r Average difference between subsequent R peaks model
r Average squared difference between subsequent R peaks r Index mass quantile, Calculates the relative index i where
r Average height of R peak q% of the mass of the time series x lie left of i.
r Median difference between subsequent R peaks r Kurtosis, The adjusted Fisher-Pearson standardized mo-
r Median squared difference between subsequent R peaks ment coefficient G2
r Median height of R peak r Large standard deviation, Boolean variable denoting if the
r Average, Median, Standard Deviation and Minimum of standard dev of x is higher than a treshold
the horizontal distance between P and Q r Linear trend attr, Calculates a linear least-squares regres-
r Average, Median and Standard Deviation of the horizontal sion for the values of the time series versus the sequence
distance between P and R from 0 to length of the time series minus one
Authorized licensed use limited to: Vardhaman College of Engineering. Downloaded on October 29,2024 at 07:52:36 UTC from IEEE Xplore. Restrictions apply.
3636 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 25, NO. 9, SEPTEMBER 2021
Reliability Diagrams
Fig. 9. Reliability Diagram Tianchi.
In this section we propose the reliability plots for the cali-
brated output of each of the five models we train. The reliability
plots are calculated through the python implementation [18] of
the methodology presented in [15].
A perfectly calibrated plot would have a reliability plot that is
exactly corresponding to the 45 degrees line, where accuracy and
confidence are equal. In the plots, the dark red gap indicates that
the confidence of the model is lower than its accuracy, meaning
that the model is under-confident in that bin, while the light
red gap indicates the opposite, meaning that it is over-confident.
For each model we display a histogram to summarize the rep-
resentation of each confidence level across 10 bins in the 0-1
probability interval, and the corresponding reliability diagram.
Fig. 7 presents the reliability plot of the Physionet model, having
ECE = 0.035. The model is slightly under-confident in the
0.6-0.9 interval and is over-confident in the 0.5-0.6 one. Again, Fig. 10. Reliability Diagram Chapman-Tianchi.
this is due to the small size of the test set of this model (300
samples).
Fig. 8 presents the reliability plot of the Chapman model,
having ECE = 0.006. This model is almost perfectly calibrated,
and is only slightly over-confident in the 0.4-0.6 and 0.7-0.9
intervals.
Fig. 9 presents the reliability plot of the Tianchi model, having
ECE = 0.001. Also this model has almost perfect calibration and
is slightly under-confident in the 0.3-0.4 interval.
Fig. 10 presents the reliability plot of the model trained on
Chapman and evaluated on Tianchi, having ECE = 0.0008. This
model has the lowest calibration error and is basically perfect in
its predictions.
Fig. 11 presents the reliability plot of the model trained on
Physionet and evaluated on Chapman, having ECE = 0.02. Fig. 11. Reliability Diagram Physionet-Chapman.
Authorized licensed use limited to: Vardhaman College of Engineering. Downloaded on October 29,2024 at 07:52:36 UTC from IEEE Xplore. Restrictions apply.
BERTSIMAS et al.: MACHINE LEARNING FOR REAL-TIME HEART DISEASE PREDICTION 3637
Despite being the most challenging model to train, as the training [15] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern
and testing datasets come from different countries, hospitals and neural networks,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 1321–1330.
[16] A. Y. Hannun et al., “Cardiologist-level arrhythmia detection and classi-
have different recording standards, this model is still remarkably fication in ambulatory electrocardiograms using a deep neural network,”
well calibrated. Nature Med., vol. 25, no. 1, pp. 65–69, 2019.
[17] J. Jeppesen, S. Beniczky, P. Johansen, P. Sidenius, and A. Fuglsang-
Frederiksen, “Using lorenz plot and cardiac sympathetic index of heart
REFERENCES rate variability for detecting seizures for patients with epilepsy,” in Proc.
36th Annu. Int. Conf. IEEE Eng. Med. Biol. Soc., 2014, pp. 4563–4566.
[1] Zheng, “ChapmanECG,” Figshare, Chapman University and Shaoxing
[18] F. Küppers, J. Kronenberger, A. Shantia, and A. Haselhoff, “Multivariate
People’s Hospital, 2019, Accessed: Mar. 23, 2021. [Online]. Available:
https://figshare.com/collections/ChapmanECG/4560497/1 confidence calibration for object detection,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. Workshops, Jun. 2020, pp. 326–327.
[2] “TianChi-ECG-abnormal-event-prediction,” GitHub, 2019. [Online].
[19] G. Y. Lip, P. Kakar, and T. Watson, “Atrial fibrillation-the growing epi-
Available: https://github.com/NingAnMe/TianChi-ECG-abnormal-event-
demic,” Heart, vol. 93, no. 5, pp. 606–612, 2007.
prediction. Tianchi Hefei High-Tech Cup Ecg Human-Machine Intel-
ligence Competition, 2019. Accessed: Mar. 23, 2021. [Online]. Avail- [20] S. M. Lundberg et al., “From local explanations to global understand-
ing with explainable AI for trees,” Nature Mach. Intell., vol. 2, no. 1,
able: http://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/231754/
pp. 2522–5839, 2020.
round2/hf_round2_train.zip
[21] S. M. Lundberg et al., Eds., Advances in Neural Information Processing
[3] Alivecor, Inc., Alivecor.com, 2020, Accessed: Mar. 23, 2021. [Online].
Systems vol. 30. Curran Assoc., Inc., 2017, pp. 4765–4774.
Available: https://www.alivecor.com/#
[22] S. M. Lundberg et al., “Explainable machine-learning predictions for the
[4] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-
prevention of hypoxaemia during surgery,” Nature Biomed. Eng., vol. 2,
generation hyperparameter optimization framework,” in Proc. 25th ACM
SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2019, pp. 2623–2631. no. 10, pp. 749, 2018.
[23] D. Makowski, “Neurokit: A python toolbox for statistics and neurophys-
[5] Z. D. G. Ary, L. Goldberger, and A. Shvilkin, “QRS Complex - an overview
iological signal processing (EEG EDA ECG EMG...),” Memory Cogn.
| ScienceDirect Topics„” Sciencedirect.com, Goldberger’s clinical electro-
Lab’Day, vol. 1, 2016.
cardiography, 2017, Accessed: Mar. 23, 2021. [Online]. Available: https:
//www.sciencedirect.com/topics/medicine-and-dentistry/qrs-complex [24] G. B. Moody and R. G. Mark, “The impact of the MIT-BIH arrhythmia
database,” IEEE Eng. Med. Biol. Mag., vol. 20, no. 3, pp. 45–50, May/Jun.
[6] Z. I. Attia et al., “An artificial intelligence-enabled ECG algorithm
2001.
for the identification of patients with atrial fibrillation during sinus
[25] M. P. Naeini, G. Cooper, and M. Hauskrecht, “Obtaining well calibrated
rhythm: A retrospective analysis of outcome prediction,” Lancet, vol. 394,
no. 10201, 861–867, 2019. probabilities using Bayesian binning,” in Proc. AAAI Conf. Artif. Intell.,
vol. 29, 2015, pp. 2901–2907,
[7] S. Butterworth et al. “On the theory of filter amplifiers,” Wireless Eng.,
[26] S. L. Oh, E. Y. Ng, R. S. Tan, and U. R. Acharya, “Automated diagnosis
vol. 7, no. 6, pp. 536–541, 1930.
of arrhythmia using combination of CNN and LSTM techniques with
[8] J. Cai, W. Sun, J. Guan, and I. You, “Multi-ECGNET for ECG arrythmia
multi-label classification,” IEEE Access, vol. 8, pp. 110848–110858, 2020. variable length heart beats,” Comput. Biol. Med., vol. 102, pp. 278–287,
2018.
[9] G. A. Campbell, “Physical theory of the electric wave-filter, Electric wave-
[27] J. Pan and W. J. Tompkins, “A real-time QRS detection algorithm,” IEEE
filter,” The Bell System Technical Journal, nov 1922, U.S. Patent, vol. 1,
Trans. Biomed. Eng., vol. BME-32, no. 3, pp. 230–236, Mar. 1985.
no. 2, pp. 1–32,, Nov. 1922. doi: 10.1002/j.1538-7305.1922.tb00386.x.
[28] F. Pedregosa et al., “ Scikit-learn: Machine learning in python,” J. Mach.
[10] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in
Learn. Res., vol. 12, pp. 2825–2830, 2011.
Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2016,
[29] M. V. Perez et al., “Large-scale assessment of a smartwatch to identify
pp. 785–794.
[11] M. Christ, N. Braun, J. Neuffer, and A. W. Kempa-Liehr, “Time series atrial fibrillation,” New Engl. J. Med., vol. 381, no. 20, pp. 1909–1917,
2019.
feature extraction on basis of scalable hypothesis tests (TSFRESH-a
[30] L. Sathyapriya, L. Murali, and T. Manigandan, “ Analysis and detection
python package),” Neurocomputing, vol. 307, pp. 72–77, 2018.
r-peak detection using modified pan-tompkins algorithm,” in Proc. IEEE
[12] G. D. Clifford et al., “AF classification from a short single lead ECG
recording: The physionet/computing in cardiology challenge2017,” in Int. Conf. Adv. Commun., Control Comput. Technol., 2014, pp. 483–487.
[31] M. Toichi, T. Sugiura, T. Murai, and A. Sengoku, “A new method of
Proc. Comput. Cardiol., 2017, pp. 1–4.
assessing cardiac autonomic function and its comparison with spectral
[13] C. for Disease Control and Prevention, “FastStats,” Deaths and mortality,
analysis and coefficient of variation of R-R interval,” J. Autonomic Nervous
Cdc.gov, May 2017, Accessed: Mar. 23, 2021. [Online]. Available: https:
//www.cdc.gov/nchs/fastats/deaths.htm Syst., vol. 62, no. 1–2, pp. 79–84, 1997.
[32] P. Virtanen et al., “SciPy 1.0: Fundamental algorithms for scientific com-
[14] C. for Disease Control and Prevention, “Atrial fibrillation | cdc.gov,”
puting in python,” Nature Methods, vol. 17, pp. 261–272, 2020.
Centers for Disease Control and Prevention, May 2020, Accessed: Mar,
[33] J. Zheng et al., “Optimal multi-stage arrhythmia classification approach,”
23, 2021. [Online]. Available: https://www.cdc.gov/heartdisease/atrial_
fibrillation.htm Sci. Rep., vol. 10, no. 1, pp. 1–17, 2020.
Authorized licensed use limited to: Vardhaman College of Engineering. Downloaded on October 29,2024 at 07:52:36 UTC from IEEE Xplore. Restrictions apply.