ref 27

Biomedical Signal Processing and Control 93 (2024) 106211
Contents lists available at ScienceDirect
Biomedical Signal Processing and Control

journal homepage: www.elsevier.com/locate/bspc
CAT-Net: Convolution, attention, and transformer based network for

single-lead ECG arrhythmia classification
Md Rabiul Islam a ,∗, Marwa Qaraqe b , Khalid Qaraqe c , Erchin Serpedin a
a
Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA
b
Information and Computing Technology, College of Science and Engineering, Hamad Bin Khalifa University, Qatar
c
Electrical and Computer Engineering, Texas A&M University at Qatar, Doha, Qatar
ARTICLE INFO ABSTRACT
Keywords: Machine learning technologies have been applied extensively in the last decade to automatically detect
ECG and analyze various forms of arrhythmia from electrocardiogram (ECG) signals. Existing deep learning-
CNN based models focus on enhancing classification performance by exploring spatial–temporal ECG features or
Transformer
by implementing multi-modal and ensemble classifiers. Such approaches perform well but do not provide
Arrhythmia
comprehensive accessibility for real-life applications due to the multi-lead ECG requirement. To address the
Attention
Deep learning
issue, a single-lead ECG based network is considered. The proposed convolution, attention, and transformer-
based network (CAT-Net) exhibits promising performance on arrhythmia classification by adeptly capturing
local and global heartbeats’ morphological characteristics. Along with the local information captured by the
convolution layer, the contextual ECG information is extracted by the multi-head attention layer present in
the transformer encoder. In addition, most of the existing models suffer from lower predictive performance in
minority-class arrhythmias due to the highly imbalanced ECG data. To enhance the predictive performance in
minority classes, three balancing techniques – SMOTE-ENN, SMOTE-Tomek, and ADASYN – are systematically
evaluated, and SMOTE-Tomek is ultimately integrated. To mitigate potential dataset bias, CAT-Net was assessed
across two distinct datasets: MIT-BIH and INCART, respectively, and was shown to achieve state-of-the-art
performance. CAT-Net establishes new benchmarks, achieving 99.14% overall accuracy and 94.69% macro F1
score on 5-class arrhythmia classification in the MIT-BIH dataset and 99.58% accuracy with 96.15% macro F1
score for 3-class classification in the INCART dataset.
1. Introduction arrhythmia detection is burdensome and challenging. A case study

involving 457 general practitioners (GPs) to assess the presence of
Arrhythmia is a common cardiovascular disorder characterized by atrial fibrillation (AF) reached up to 92.5% sensitivity and 89.8%
irregular heartbeats and manifesting in particular as slow or fast heart specificity [2]. A well-designed machine learning (ML)-based approach
rates relative to a normal heart rhythm. Cardiovascular diseases (CVDs)
can alleviate the manual task in arrhythmia detection and increase
are responsible for approximately 32% of global mortality rate leading
detection accuracy. This fact prompted renewed interest in exploring
to an annual death toll of 17.9 million individuals worldwide [1]. As a
ML solutions.
result, the early detection and classification of arrhythmias are crucial
in providing early treatment and preventing adverse complications. Early ML-based approaches such as decision tree (DT) [3], support
Although arrhythmia can be detected through many means such as vector machine (SVM) [4], random forest (RF) [5], AdaBoost [6], and
ECG, echocardiogram, cardiac catheterization, electrophysiology study extreme gradient boosting [7] have been utilized for ECG arrhythmia
(EPS), and stress test, ECG is the most conventional method used to classification. In addition, a large variety of approaches including but
detect and diagnose arrhythmia due to its simplicity and effectiveness. not limited to artificial neural networks (ANNs) [8], probabilistic neu-
Typically, a cardiologist needs to review the ECG report and the ral networks (PNNs) [9], frequency analysis [10], path forests [11],
related information to diagnose the type of arrhythmia. However, the classification and regression trees (CARTs) [12], hidden Markov models
huge volume of work calls for machines to assist and even substitute
(HMMs) [13], and mixture of the expert methods [14] were proposed.
cardiologists. Besides higher prediction accuracy, machines may help to
In [15], ECG entropy-based features were ranked using analysis of
reduce differences in cardiologists’ interpretation. By visual assessment,
∗ Corresponding author.
E-mail address: [email protected] (M.R. Islam).
https://doi.org/10.1016/j.bspc.2024.106211
Received 30 September 2023; Received in revised form 28 November 2023; Accepted 9 March 2024
Available online 13 March 2024
1746-8094/© 2024 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
M.R. Islam et al. Biomedical Signal Processing and Control 93 (2024) 106211
variance (ANOVA) and the selected features were fed to the k-nearest and Seq2Seq-based classifier equipped with an attention mechanism
neighbors (k-NN) and DT classifiers to detect four types of ECG beats: and exhibits good performance with an overall accuracy of 99.28% and
atrial fibrillation (A-Fib), normal sinus rhythm (NSR), ventricular fib- macro F1 score of 95.70% for 5-class arrhythmia classification [32].
rillation (V-Fib), and atrial flutter (AFL), respectively. The extremely However, the DSCSSA model exploits a 2-lead MIT-BIH ECG dataset.
unbalanced data set lowered the model’s performance and a maximum Recently, transformer-based models have become prominent in
accuracy of 96.3% for DT was achieved. physiological signal processing. While CNN-LSTM models effectively
With the development of deep learning (DL) methods, researchers capture both local and global features within ECG data, it is impor-
have automated the modeling process and eliminated the manual fea- tant to acknowledge that RNNs are confined to sequential computa-
ture extraction step [16–19]. Deep learning models can be categorized tions. Motivated by the promising performance of transformers in NLP,
into two distinct groups: (i) pure models, and (ii) hybrid models. researchers are currently aiming to develop pure transformer-based
Pure models include pure convolutional neural network (CNN)-based models as well as hybrid variants to exploit the parallel computing
or pure transformer-based models while the hybrid models comprise features of transformers. Hu et al. [33] proposed an ECG detection
CNN-LSTM-based models and CNN-transformer-based models. Often, transformer (DETR) for continuous ECG segments to classify 8, 4, and
the aforementioned models are updated with various types of attention 2 distinct types of arrhythmia. Yan et al. [34] integrated manually
mechanisms. constructed features with transformer encoder-generated features to
CNN is the simplest but most prominent DL-model [20] used for ar- enhance the predictive performance for four distinct categories of
rhythmia classification from ECG signals. Kiranyaz et al. [16] presented ECG heartbeats. To reduce the number of parameters in the trans-
a one-dimensional (1D) CNN model using patient-specific long-term former model, Meng et al. [35] replaced self-attention of the Fussing
ECG data to classify ventricular ectopic beat (VEB) and supraventricular Transformer with the LightConv Attention (LCA) and obtained 99.32%
beat (SVEB) with 99% accuracy. In [21], a CNN model with 34 layers accuracy. Le utilized 1D ECG, 2D ECG spectrogram, and age and gender
was employed to classify 14 types of arrhythmia with an 80% accuracy. metadata to capture time series, visual-temporal, and metadata-based
However, this model’s performance was lower than that of visual information, respectively. They presented a multi-mode RCNN with
inspection by a cardiologist and it did not address ventricular flutter a transformer encoder to categorize arrhythmias by fusing features.
and fibrillation that may induce cardiac arrest. The model is overly complex, with an accuracy of only 98.29%. Che
Among the hybrid models, networks that combine CNN and recur- et al. [36] evaluated CNN-Transformer-based hybrid models against
rent neural network (RNN) architectures, such as LSTM, BiLSTM, and CNN and CNN–BiLSTM models with and without attention, and found
gated recurrent units (GRUs), have shown improved performance. For that the CNN–Transformer model performed best. They improved effi-
instance, Tan et al. [22] proposed a CNN and LSTM-based prototype ciency by adding a link constraint to the loss function; however, the
to predict coronary artery disease (CAD) using ECG signals with an model required 12 leads ECG data, making it difficult for practical
accuracy of 99.85%. Two convolutional and three LSTM layers were implementation.
utilized to differentiate between normal and CAD classes. The model With the advancement of technology, the development of wearable
performed well but could not distinguish between different arrhythmia single-lead ECG devices is transforming continuous ECG recordings,
types, making it less applicable in clinics. To investigate the effect with the ability to record and transmit ECG signals to smartphones,
of segmentation length on performance, Oh et al. [23] presented a computers, and IoT systems for analysis. AliveCor KardiaMobile EKG,
KardiaMobile® Card, EMAY Portable ECG Monitor, and Eko DUO ECG
CNN-LSTM model that can identify NSR, paced beat (PB), ventricular
are a few examples. The analyzing models related to these devices can
premature contraction (VPC), right bundle branch block (RBBB), and
detect an abnormal cardiac rhythm but face difficulties in classifying
left bundle branch block (LBBB). This model showed 98.1% accuracy,
various arrhythmias. Consequently, a model with higher accuracy and
97.5% sensitivity, and 98.7% specificity.
compatibility with single-lead ECG devices is a fundamental prereq-
The success of attention modules in computer vision inspired re-
uisite for real-life applications. Limited work has been reported on
searchers to implement attention-based mechanisms for ECG signal
single-lead ECG based arrhythmia detection. The model of [37] showed
analysis [24–26]. Yao et al. proposed first a CNN-LSTM-based
only 87.8% accuracy, while the model proposed in [38] achieved
model [24] and then upgraded it with an attention mechanism [27]
accuracy levels of 93.63% and 95.87% for SVEB and VEB classes,
to achieve better performance. Zhang et al. [28] proposed a spatial–
respectively.
temporal attention-based model to classify 9 types of arrhythmia using
To enable adoption in practice of heart arrhythmia detection and
CNN and a bidirectional GRU-based model to reach an average F1
classification applications via single-lead ECG devices and smart de-
score of 83.5%. Since 12 channels of ECG data were employed, this
vices, we propose a CNN-Transformer-based hybrid network (CAT-Net)
approach is not fit for mobile single-sensor based applications. Ullah
for the detection and classification of heart arrhythmia. CNN computes
et al. [29] proposed CNN, CNN-LSTM, and CNN-LSTM with atten-
ECG local features, channel attention extracts the most relevant portion,
tion modules and achieved 99.12%, 99.3% and 99.29% accuracy,
and a transformer contextualizes the ECG feature map in latent space.
respectively. However, this study ignored class balancing, and the
The incorporation of channel attention after CNN and transformer on
pre-processing approach of generating model input data from the
the ECG feature map enabled to focus on local, significant, and global
Massachusetts Institute of Technology-Beth Israel Hospital (MIT-BIH)
contextualization, making our model superior to the existing state-of-
dataset was not specified. In [30], an interpretable dual-level atten-
the-art models. The proposed model is shown to classify five categories
tional convolutional long short-term memory neural network (DLA-
of arrhythmia with 99.14% accuracy and 94.69% macro F1 score. In
CLSTM) was proposed for multilabel ECG signal classification. The addition, the proposed model outperforms existing single-lead ECG-
model showed 88.76% and 60.14% accuracy for the MIT-BIH ar- based models in the literature. The model presents significant impact on
rhythmia database (MITDB) and the 1st China Physiological Signal analyzing the presence of five types of arrhythmia in ECG recordings.
Challenge (CSCP) dataset, respectively. However, it ignored class bal- Fig. 1 illustrates the high-level system architecture of the proposed
ancing, which resulted in a reduced F1 score of 25.81% for atrial single-lead arrhythmia detection model. The main novel aspects of this
premature (AP) class. Prabhakararao and Dandapat [31] proposed work are next summarized.
inter-lead and intra-lead attention mechanisms and achieved 84.5%
and 88.3% accuracy for the Physikalisch-Technische Bundesanstalt’s (1) Novel Model: We present a novel CAT-Net model that employs
(PTB) 12 lead PTB-XL dataset and CinC-training2017 single-lead ECG convolution and attention for capturing local insights and a
dataset, respectively. The model was implemented using a multi-scale transformer encoder for modeling global heartbeat dependen-
CNN equipped with attention. However, the model neglected the global cies. The transformer encoder contributes to contextualizing ECG
ECG characteristics. Recently proposed DSCSSA framework is a CNN feature maps, enabling parallel processing unlike LSTM.
2
Fig. 1. Overview of the proposed processing pipeline.
(2) Dataset Invariance: The proposed CAT-Net model is evaluated three distinct components: convolution, attention, and transformer.
with two different datasets and it generated state-of-the-art re- This section also presents Python implementation details.
sults for both datasets. Therefore, the model is generalized and
dataset invariant.
2.1. Problem formulation
(3) Focusing Minority Class: To emphasize the minority class’s pre-
dictive performance, 3 class balancing approaches: Synthetic Mi-
nority Oversampling Technique Edited Nearest Neighbor The objective of this study is to maximize the predictive perfor-
(SMOTE-ENN), SMOTE-Tomek, and Adaptive Synthetic mance of 5 classes of arrhythmia classification with minimum lead ECG
(ADASYN) were assessed. By removing Tomek links, SMOTE- data. The input dataset representing the set of heartbeats is given by
{ }
Tomek outperforms and is finally integrated into the proposed 𝑋 = 𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑁 , where N denotes the number of heartbeats.
model. The variable 𝑥𝑖 = (𝑠1 , 𝑠2 , 𝑠3 , … , 𝑠𝑛 ) collects the ECG signal values and n
(4) Usability: The proposed model requires only 1-lead ECG data, stands for the length of the segmentation window. The Association for
making it usable in automatic, wearable, and IoT-based real- the Advancement of Medical Instrumentation (AAMI) [39] recommends
world applications. classifying MIT-BIH dataset arrhythmia as normal beat (N), supraven-
(5) Performance: A comprehensive empirical assessment is per- tricular ectopic beat (S), ventricular ectopic beat (V), fusion beat (F),
formed by systematically manipulating the model configurations and unknown beat (Q). Accordingly, the output is expressed as 𝑌 =
and hyperparameters of CNN, channel attention, and trans- { }
𝑦1 , 𝑦2 , 𝑦3 , … , 𝑦𝑁 , where
former encoder layers. Finally, an effective model with higher
accuracy (of 99.14% and 99.58% for two distinct datasets) is ⎧ 0, 𝑁 𝑐𝑙𝑎𝑠𝑠
developed. ⎪ 1, 𝑆 𝑐𝑙𝑎𝑠𝑠
⎪
𝑦𝑖 = ⎨ 2, 𝑉 𝑐𝑙𝑎𝑠𝑠
The developed model is anticipated to support the following ap- ⎪ 3, 𝐹 𝑐𝑙𝑎𝑠𝑠
plications: (i) early arrhythmia detection, (ii) real-time arrhythmia ⎪
⎩ 4, 𝑄 𝑐𝑙𝑎𝑠𝑠 .
classification, and (iii) patient and physician notification via single-
lead ECG wearable devices including smart shirts and other recording This study aims to find an effective model f such that 𝑓 ∶ 𝑋 →
devices. 𝑌 predicts the arrhythmia type for a given ECG heartbeat with high
The rest of paper is organized as follows. Section 2 formulates the accuracy.
problem, describes the data set, and outlines the relevant preprocessing
steps and proposed modeling framework. The model’s performance is
2.2. Datasets
analyzed in Section 3. Section 4 concludes the paper.
2. Methodology This study employed two datasets, (i) MIT-BIH Arrhythmia

Database [40] and (ii) St. Petersburg INCART 12-lead Arrhythmia
This section presents an overview of the used datasets, data pre- Database [41]. Table 1 summarizes information about the datasets
processing techniques, and class balancing methodologies. Afterward, while the number of heartbeat counts for each of the relevant categories
the proposed CAT-Net model architecture is introduced, comprising is presented in Table 2.
3
Table 1
Utilized ECG dataset information.
Topic MIT-BIH (Dataset 1) INCART (Dataset 2)
Number of Recordings 48 75
Laboratory Beth Israel Hospital Arrhythmia Laboratory St. Petersburg Institute of Cardiological Technics (INCART)
Acquisition tool Holter recordings Holter recordings
Signals 2 Lead ECG, (i) MLII and (ii) V1; (occasionally V2, V4 or V5) 12 leads (Only lead II signals are used)
Signal length, sampling frequency 30 min (or slightly over), 360 Hz 30 min, 257 Hz
Subjects 47 (25 Men, 22 Women). Only 1 patient has 2 recordings. 32 (17 Men, 15 Women)
Age Range Men: (32∼89) years, Women: (23∼89) years 18∼80 years
Number of Heartbeat Types 20 (Table 2 lists 15 types of heartbeats are used) 11 (6 types of heartbeats are used)
Recording Period 1975∼1979 (Not Available)
Fig. 2. Visualization of 15 different types of ECG heartbeats of MIT-BIH dataset. Each type consists of 10 randomly selected heartbeat samples except supraventricular premature
beat (S) which assumes only 2 beats in the data set. Different colors are used to differentiate heartbeats.
2.2.1. MIT-BIH arrhythmia database Using the MIT-BIH dataset, we first develop our model for 5-class
The MIT-BIH dataset [40] comprises 48 samples of 2-lead ECG classification and then evaluate it for hyperparameter adjustments,
recordings, each lasting around 30 min. These recordings were col- class balancing, and architecture changing. The INCART dataset is then
lected between 1975 and 1979 through a collaboration between MIT used to validate our model’s performance and check its robustness.
and Boston’s Beth Israel Hospital. It contains 47 ECG recordings from
patients of both genders (25 men and 22 women), covering 32–89 years 2.3. Data preprocessing
of age (men) and 23–89 years of age (women). From a set of 4000
24 hour Holter recordings of a mixed population (inpatients 60%, Data preprocessing assumes two consecutive steps: (i) signal denois-
outpatients 40%) of BIH, 23 samples were chosen randomly, while ing and (ii) heartbeat segmentation. The raw ECG signals downloaded
the remaining 25 samples were selected that had less common but
from the MIT-BIH and INCART website are perturbed by noise induced
clinically significant arrhythmias. In this study, only modified limb lead
by the power line interface, myoelectric interface, etc. The popular dis-
II (MLII) signals are used to build a single-lead ECG model. Therefore,
crete wavelet transform (DWT)-based method is used to denoise the 1D
46 patients’ signals were used because the remaining two patients’ ECG
ECG signal. The wavelet denoising procedure undertakes three distinct
records only have lead V2, V5, or V5 signals, not lead MLII. Although
operations to remove noise from the ECG signal: (i) wavelet decom-
the original MIT-BIH data consists of 23 distinct categories of heartbeat,
position, (ii) coefficient processing, and (iii) wavelet reconstruction,
only 15 types stated in Table 2 are classified into five major classes: N,
respectively.
S, V, F, and Q. Fig. 2 depicts the morphological aspects of 15 types of
The raw ECG signal is first decomposed into nine levels by employ-
heartbeats.
ing Daubechies wavelet 5 (db5), and 9 Details coefficients (cD1, cD2,
2.2.2. St. Petersburg INCART 12-lead arrhythmia database cD3, cD4, cD5, cD6, cD7, cD8 and cD9) and 1 Approximate coefficient
The INCART dataset [41] consists of 75 annotated recordings ex- (cA9). Then, two detail coefficients, cD1 and cD2, are turned into zero
tracted from 32 Holter records. The signals were collected from the and the remaining 7 detail coefficients (cD3 to cD9) are processed using
patients (17 men and 15 women, aged 18–80; mean age: 58) under- the ‘soft thresholding ’ method. Finally, the processed approximate and
going tests for coronary artery disease. There were no patients with detail coefficients are used to reconstruct the denoised ECG signal using
pacemakers; the majority experienced ventricular ectopic beats. In the wavelet reconstruction technique. Fig. 4 shows the ECG signals
total, the reference annotation files comprise more than 175,000 beat before and after denoising.
annotations. Among the 12 leads, only lead II data was extracted for After denoising, the R-peak position is used for signal segmentation.
our model. The dataset is utilized to categorize 3 types of arrhyth- There are many R-peak detection techniques and segmentation algo-
mias because fusion beat (F) and unknown beat (Q) have only 219 rithms for ECG signals proposed in the literature, including but not
and 6 heartbeats, respectively, as mentioned in Table 2. The sample limited to ANN, genetic algorithm, wavelet transform, and modified
heartbeats of 3 categories and 6 sub-categories are visualized in Fig. 3. Pan–Tompkins algorithm [42]. The location of the R peak is already
4
Table 2
Heartbeats count of MIT-BIH and INCART datasets.
Classes SN Type Heartbeat name # Heartbeats # Sub-total
MIT-BIH heartbeats MIT-BIH
(INCART) (INCART)
1 N Normal beat 74658
(150410)
90210
Normal (N) 2 L Left bundle branch block beat 8063
(153676)
3 R Right bundle branch block beat 7244
(3174)
4 e Atrial escape beat 16
5 j Nodal (junctional) escape beat 229
(92)
6 A Atrial premature beat 2540
(1944) 2775
Supraventricular ectopic beat (S)
7 a Aberrated atrial premature beat 150 (1960)
8 J Nodal (junctional) premature beat 83
9 S Supraventricular premature beat 2 (16)
10 V Premature ventricular contraction 7117 7223
Ventricular ectopic beat (V)
(20013) (20013)
11 E Ventricular escape beat 106
Fusion Beat (F) 12 F Fusion of ventricular and normal beat 802 802
(219) (219)
13 / Paced beat 3612
3887
Unknown beat (Q) 14 f Fusion of paced and normal beat 260
(6)
15 Q Unclassifiable beat 15
(6)
# Total 104897
heartbeats (175874)
Fig. 3. Visualization of 6 different types of ECG heartbeats of INCART dataset. Each type consists of 3 randomly selected heartbeat samples. Different colors are used to differentiate
heartbeats.
annotated in the MIT-BIH dataset. Therefore, the annotated R-peak the majority class is said to be Tomek Links if there is no sample 𝑥𝑘
locations and the fixed window method for segmentation are used. A that satisfies the condition: 𝑑(𝑥𝑖 , 𝑥𝑘 ) < 𝑑(𝑥𝑖 , 𝑥𝑗 ) or 𝑑(𝑥𝑗 , 𝑥𝑘 ) < 𝑑(𝑥𝑖 , 𝑥𝑗 ),
window of size 300 samples is used where 99 samples are taken from where 𝑑(𝑥𝑖 , 𝑥𝑗 ) represents the Euclidian distances between 𝑥𝑖 and 𝑥𝑗 .
the left and 201 samples are taken from the right of the R-peak position. ADASYN [45] mainly used density distribution to decide the number of
synthetic samples 𝑠𝑖 to be generated, following the equation as follows
𝑠𝑖 = 𝑥𝑖 + (𝑥𝑧𝑖 − 𝑥𝑖 ) × 𝜆
2.4. Class balancing
where (𝑥𝑧𝑖 − 𝑥𝑖 ) is the difference vector in n-dimensional spaces, and 𝜆
The MIT-BIH and INCART datasets are subject to an imbalanced is a random number: 𝜆 ∈ [0, 1].
number of samples (as shown in Table 2) which biases machine learn- SMOTE-Tomek offers a dual advantage. First, while SMOTE may in-
ing (ML) algorithms to learn attributes better for the larger class dataset troduce synthetic noisy instances, Tomek link removal effectively elim-
and to perform poorly in predicting the minority class data. To balance inates such instances, ensuring a cleaner and more reliable dataset for
the datasets we applied three class balancing approaches: SMOTE-ENN, training. Second, SMOTE-Tomek simultaneously addresses both over-
SMOTE-Tomek, and ADASYN. SMOTE-Tomek is employed in our model sampling and undersampling challenges, resulting in a more balanced
for its superior performance. SMOTE-ENN and SMOTE-Tomek were dataset. This balanced distribution enhances the classifier’s ability to
generalize effectively across all classes. In contrast, SMOTE-ENN and
introduced by Batista et al. [43] which offer improvements to the
ADASYN did not fully exploit the dual benefits of oversampling and un-
SMOTE algorithm [44].
dersampling simultaneously, leading to slightly reduced classification
In SMOTE, new synthetic data are generated rather than over-
performance in arrhythmia classification from ECG.
sampling with replacement. First, the difference between a minority
The number of training and testing samples and corresponding class
class sample 𝑥𝑖 and one of its 𝑘 nearest neighbors 𝑥̂𝑖 is multiplied by
balancing approaches for MIT-BIH and INCART datasets are provided
a factor 𝛿. This product is added to the selected samples 𝑥𝑖 to create a
in Tables 3 and 4, respectively. For both datasets, as the ‘N’ class avails
new sample 𝑥𝑛𝑒𝑤 as follows
a larger number of samples, random under-sampling (RUS) is applied
𝑥𝑛𝑒𝑤 = 𝑥𝑖 + (𝑥̂𝑖 − 𝑥𝑖 ) × 𝛿 before employing the SMOTE-Tomek approach. Finally, two balanced
datasets with 55k and 44k samples per class are generated.
where 𝛿 ranges from 0 to 1. SMOTE-Tomek is a combination of over-
sampling by SMOTE and undersampling by removing Tomek links. 2.5. Model architecture
Tomek Links is a method for detecting pairs of nearest neighbors with
various classes in a dataset. A pair of minimally Euclidian distance The classification model architecture consists of three main compo-
neighbors (𝑥𝑖 , 𝑥𝑗 ) with 𝑥𝑖 belonging to the minority class and 𝑥𝑗 to nents: (i) CNN, (ii) attention mechanism, and (iii) transformer encoder.
5
Fig. 4. Raw, denoised and R peak detected ECG signal from MIT-BIH dataset (recording number 100, channel MLII). The third graph shows segmentation of window size 300,
with the third R-peak (location 370) as a reference point.
Table 3
Number of samples in training, testing, and class balancing approach: MIT-BIH dataset.
Class # Heartbeats # Testing (20%) # Training Balancing technique(s) # Training (Balanced)
N 90210 17982 72228 RUS + SMOTE-Tomek 55000
S 2775 568 2207 SMOTE-Tomek 55000
V 7223 1463 5760 SMOTE-Tomek 55000
F 802 167 635 SMOTE-Tomek 55000
Q 3887 800 3087 SMOTE-Tomek 55000
Total 104897 20980 83917 – 275000
Table 4
Number of samples in training, testing, and class balancing approach: INCART dataset.
Class # Heartbeats Used # Testing (20%) # Training Balancing technique(s) # Training (Balanced)
N 153343 30624 122719 RUS + SMOTE-Tomek 40000
S 1957 413 1544 SMOTE-Tomek 40000
V 19980 4019 15961 SMOTE-Tomek 40000
Total 175280 35056 140224 – 120000
The transformer encoder is described in a separate sub-section with its 2 is applied. Later, for the 2𝑛𝑑 , 3𝑟𝑑 and 4𝑡ℎ CLs, 32 filters of size
components. At the end of the complete framework, the feature map 23 × 1, 64 filters of size 25 × 1, and 128 filters of size 27 × 1 are
is sent to a flattened layer and to the dense layers for prediction. The employed, respectively. Each layer maintains the ‘same’ padding with
proposed CAT-Net model is illustrated in Fig. 5. a stride value of 1. The rectified linear unit (ReLU) activation function
is followed by each CL.
2.5.1. CNN architecture The number of filters and the kernel size are gradually increased
The successful use of CNNs in computer vision and image processing with the depth of network since the deeper layers consider more
applications was followed up by outstanding results in the analysis of information for prediction. To reduce the network computational com-
time series data such as ECG and EEG. Many challenging problems have plexity, the pooling layer is employed after the first 3 CLs. Following
the first two CLs, a 1D maximum pooling of size 3 × 1, stride 2, and
already been addressed by employing 1D CNN models. The convolution
‘same’ padding is applied. After the 3rd CL, a 1D average pooling of size
operation allows the consideration of local information from a signal to
3 × 1, stride 2, and ‘same’ padding is employed. As the deeper portion
create a feature map for the next layer.
contains sensitive information, average pooling is applied instead of
The considered model employs four CNN layers. A summary of the
maximum pooling like the previous two layers.
model framework is presented in Table 5. The first CNN layer processes
For the convolutional layers CL1, CL2, CL3 and CL4, the conducted
the input ECG and the next layer considers the previous layer’s feature operations are
map as input. For the first convolutional layer (CL), 16 filters of size ( )
21 × 1 with stride 1 are applied with the same padding. Next, after ∑
𝑥𝑙𝑘 = 𝜎 𝑥𝑙−1
𝑖 ∗ 𝜔𝑖𝑘 + 𝑏 𝑘 , (1)
the attention layer, a 3 × 1 sized 1D max pooling layer with stride 𝑖∈𝑀𝑘
6
Fig. 5. Proposed CAT-Net model for arrhythmia classification from ECG signal. The model predicts five types of arrhythmia from ECG heartbeats: normal (N), supraventricular
ectopic (S), ventricular ectopic (V), fusion (F), and unknown (Q).
Table 5 2.5.2. Attention mechanism

Summary of the proposed CAT-Net model. In the proposed hybrid model, each CL is followed by a channel
Layer # Filters Kernel size Output shape # Parameters attention mechanism. The core idea of applying attention mechanism
Conv1D 16 21 × 1 300 × 16 352 is to enhance the training performance by emphasizing the more in-
BatchNorm – – 300 × 16 64 formative temporal segment of ECG signal. As only single-lead ECG
Attention1D 16 – 300 × 16 0 data is used, the feature map becomes a 2D matrix and therefore, only
MaxPooling1D – 3 × 1 150 × 16 0
Conv1D 32 23 × 1 150 × 32 11808
channel attention is implemented following the CBAM strategy [46].
BatchNorm – – 150 × 32 128 The proposed attention mechanism schematic is presented in Fig. 6.
Attention1D 32 – 150 × 32 0 The 2D feature map after each convolution is represented by matrix
MaxPooling1D – 3 × 1 75 × 32 0 𝐹 ∈ R𝐶×𝐿 , where 𝐶 indicates the number of channels and 𝐿 denotes the
Conv1D 64 25 × 1 75 × 64 51264 length. In the channel attention mechanism, the 2D feature map is first
BatchNorm – – 75 × 64 256
converted into the 1D attention map 𝐴𝑐 ∈ R𝐶×1 . Then, the 1D attention
Attention1D 64 – 75 × 64 0
AveragePooling1D – 3 × 1 38 × 64 0
map 𝐴𝑐 is multiplied with the feature map 𝐹 𝑙 to generate the attention-
Conv1D 128 27 × 1 38 × 128 221312 based refined feature map 𝐹 𝑙+1 for the next layer as illustrated in Fig. 6.
BatchNorm – – 38 × 128 512 The overall process is formalized at layer l through the element-wise
Attention1D 128 – 38 × 128 0 matrix multiplication operator ⊗ as follows
TransformerEncoder 128 dim 4 heads 38 × 128 280576
Flatten – – 4864 0 𝐹 𝑙+1 = 𝐴𝑐 (𝐹 𝑙 ) ⊗ 𝐹 𝑙 . (2)
Dense – – 128 622720
Dropout – – 128 0 Feature map of lth layer is represented by 𝐹 𝑙.
Dense – – 5 645 To compute the channel attention, the spatial dimension of the input
feature is squeezed. The spatial information is aggregated by using
max-pooling and average pooling in the spatial domain. Thus, two
𝑐
different spatial presenters are generated 𝐹𝑎𝑣𝑔 𝑐 , respectively.
and 𝐹𝑚𝑎𝑥
where 𝑥𝑙𝑘 and 𝑥𝑙−1
𝑘
represent the kth neuron output at layers l and l- Both presenters are then inputted in a shred multi-layer perceptron
1, respectively. The weight kernel between the ith and kth neuron is (MLP) with only one hidden layer. Next, element-wise summation is
denoted by 𝜔𝑖𝑘 and the bias of kth neuron is indicated by 𝑏𝑘 . Sigma applied to the generated output feature vectors. The attention map is
(𝜎) denotes the ReLU activation function. Notation 𝑀𝑘 stands for the formally generated as follows
active range of the convolutional kernel.
Ac (F) = 𝜎(MLP(AvgPool(F)) + MLP(MaxPool(F)))
After each convolution layer, a batch normalization (BN) layer is ( ( ( )) ( ( ))) (3)
applied to preserve the mean output close to 0 and the output standard = 𝜎 W1 W0 Fcavg + W1 W0 Fcmax .
deviation close to 1. The BN layer contributes to stabilizing the learning 𝐶
process and reducing the number of training epochs required to train The weights of MLP are incorporated in matrices 𝑊0 ∈ R 𝑟 ×𝐶 and
𝐶× 𝐶𝑟
deep networks. 𝑊1 ∈ R . Variable 𝑟 is introduced to reduce the parameter overhead.
7
Fig. 6. Channel attention mechanism along the process of refining the feature map.
For our network, 𝑟 = 8. Thus, the hidden activation size is reshaped to getting balanced gradients after the softmax function. The dot-product
R𝐶×1×1 . attention is calculated as follows
The 4th attention layer’s output feeds into the subsequent trans- 𝑄𝐾 𝑇
former encoder layer described next. Later, the transformer encoder’s 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾, 𝑉 ) = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥( √ )𝑉 .
𝑑𝑘
output of size 38 × 128 is flattened into a 4864-sized vector. Finally, this
vector is processed through a dense layer of 128 neurons, followed by For MHA, the query, key, and value matrices are linearly projected
another dense layer of 5 neurons using the ‘softmax’ activation function as 𝑊𝑖𝑄 ∈ R𝑑𝑚𝑜𝑑𝑒𝑙 ×𝑑𝑞 , 𝑊𝑖𝐾 ∈ R𝑑𝑚𝑜𝑑𝑒𝑙 ×𝑑𝑘 , and 𝑊𝑖𝑉 ∈ R𝑑𝑚𝑜𝑑𝑒𝑙 ×𝑑𝑣 , respec-
to classify 5 types of arrhythmia. tively. Next, the outputs of all heads are concatenated and fed to a
linear layer to produce the output:
2.6. Transformer encoder
𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑(𝑄, 𝐾, 𝑉 ) = 𝐶𝑜𝑛𝑐𝑎𝑡(ℎ𝑒𝑎𝑑1 , … , ℎ𝑒𝑎𝑑ℎ ),
The CNN and attention layers model the local morphological in- where ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄𝑊𝑖𝑄 , 𝐾𝑊𝑖𝐾 , 𝑉 𝑊𝑖𝑉 ).
formation of ECG. To capture long-term ECG signal dependencies, a The MHA output (𝑍𝑚ℎ𝑎 ) is added and normalized to generate the
transformer encoder is utilized. The transformer is a groundbreaking final output (𝑍) and fed to the next feed-forward layer.
Seq2Seq architecture applied in NLP [47]. To adapt the transformer in
arrhythmia classification from ECG, the decoder part is eliminated. A 𝑍 = 𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚(𝑍𝑚ℎ𝑎 + 𝐹𝑒𝑐𝑔 ).
transformer encoder includes positional encoding, a multi-head atten-
tion layer, and a feed-forward layer along with skip connections and 2.6.3. Feed forward network
normalization layers. The output 𝑍 is fed to the next sublayer named feed-forward
network (FFN). It consists of two linear transformations with a ReLU
2.6.1. Positional encoding (PE) activation function in between. The output of FFN layer (𝑍𝑓 𝑓 𝑛 ) is given
As the transformer does not include recurrence or convolution, by
understanding the order of sequence positional encoding is important. 𝑍𝑓 𝑓 𝑛 = 𝐹 𝐹 𝑁(𝑍) = max(0, 𝑍𝑊1 + 𝑏1 )𝑊2 + 𝑏2 ,
Therefore, before entering the transformer encoder the input embed-
ding (i.e. ECG feature map) is added with PE. The PE must have the where 𝑊1 , 𝑊2 denote the weights and 𝑏1 , 𝑏2 represent the biases.
same dimensions as the input embedding. Among different approaches, Finally, the 𝑍𝑓 𝑓 𝑛 is transferred to the ‘Add and Norm’ layer to generate
sine and cosine functions of different frequencies are used for PE as they the transformer encoder’s final output (𝑍𝑇 𝐸 )
work efficiently compared to other approaches. The PE is computed as 𝑍𝑇 𝐸 = 𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚(𝑍𝑓 𝑓 𝑛 + 𝑍).
follows
The skip connections always keep the previous layer’s impact, which
𝑃 𝐸(𝑝𝑜𝑠,2𝑖) = sin(𝑝𝑜𝑠∕100002𝑖∕𝑑𝑚𝑜𝑑𝑒𝑙 ) helps predicting with higher performance.
𝑃 𝐸(𝑝𝑜𝑠,2𝑖+1) = cos(𝑝𝑜𝑠∕100002𝑖∕𝑑𝑚𝑜𝑑𝑒𝑙 ),
2.7. Computational environment
where pos identifies the position, i denotes the dimension ranging from
0 to 𝑑𝑚𝑜𝑑𝑒𝑙 ∕2, and 𝑑𝑚𝑜𝑑𝑒𝑙 is the model dimension and refers to the The model is implemented in ‘Google Colab’ using Python (version
dimension of the input token (i.e., segmented ECG feature map). The 3.8.16). The ‘wfdb’ (waveform database) and ‘pywt ’ (PyWavelets) li-
main function of PE is to help the model in learning relative positions brary packages were used for ECG signal processing and denoising,
easily. The PE is computed once during training and can be used respectively. TensorFlow and Scikit-learn libraries were utilized to
subsequently. develop and evaluate the model. The CAT-Net model was developed
using Keras with TensorFlow backend. Regarding the computational
2.6.2. Multi-head attention infrastructure, the study utilized Google Colab’s GPU, featuring an
To empower the ECG feature map by contextualizing the long-range Intel® Xeon® CPU @2.00 GHz, 64-bit, GPU Memory: 12 GB/16 GB and
dependencies, a multi-head attention (MHA) layer is implemented as GPU Memory Clock: 0.82 GHz/1.59 GHz.
the first layer of transformer encoder. The internal computation of MHA
internal is similar to scaled dot-product attention. The difference is 2.8. Training phase
that in MHA, scaled dot-product attention is performed in parallel for
different heads. In our model, we implemented 4 heads. The compu- The overall preprocessed data is split into 80% training and 20%
tation process of scaled dot-product and multi-head attention layers is testing. The CAT-Net model is trained with mini-batch gradient descent
illustrated in Fig. 7. with a batch size of 128. The model is trained with ‘adam’ optimizer for
First, the positional encoded ECG feature map 𝐹𝑒𝑐𝑔 ∈ R𝐿𝑒𝑐𝑔 ×𝑑𝑚𝑜𝑑𝑒𝑙 is 60 epochs. Only 15% of training data is considered as validation data.
mapped into three matrices: query 𝑄 ∈ R𝐿𝑒𝑐𝑔 ×𝑑𝑞 , key 𝐾 ∈ R𝐿𝑒𝑐𝑔 ×𝑑𝑘 , and For every training step, the model weights are recorded, and the
values 𝑉 ∈ R𝐿𝑒𝑐𝑔 ×𝑑𝑣 . Here, 𝐿𝑒𝑐𝑔 denotes the length of ECG feature map. best weights are saved using ‘modelcheckpoint’ callback function. The
√
Next, the dot-product of query and key is scaled with factor 1∕ 𝑑𝑘 for validation accuracy is set as the indicator for searching the best model.
8
Fig. 7. The computation flow in scaled dot-product attention-(left) and multi-head attention-(right).
Table 6
The values of model hyperparameters.
Hyperparameter Value Notation
Batch size 128 –
Epochs 60 –
Model dimensions 128 𝑑𝑚𝑜𝑑𝑒𝑙
Attention heads 4 ℎ
FNN dimensions 128 𝑑𝐹 𝐹 𝑁
Dropout rate 0.02 –
model, the LSTM, attention, and transformer encoder layers are in-
corporated gradually. The performance of the model comprising four
consecutive CNN layers having 4, 16, 32, 64, and 128 filters is evalu-
ated. Random search methods are employed to obtain a better model.
Next, the hyperparameters of transformer encoder are set carefully.
Fig. 8. Accuracy curves during training the model. Curves are plotted in ‘Tensorboard’. Finally, based on accuracy and macro F1 score the CAT-Net model
The original shaded curves are smoothed by a factor of 0.5.
architecture is confirmed. The values of hyperparameters used for the
whole model and transformer encoder are presented in Table 6.
3. Results and discussion
Next, the model performance based on different evaluation matrices

is assessed and analyzed. The effect of changing hyperparameters, class
balancing, and small data in ‘F’ class and the model’s robustness are
discussed. The proposed work is benchmarked with state-of-the-art
approaches and future research recommendations are put forward.
3.1. Evaluation matrices
The classification model performance is assessed via accuracy, sensi-

tivity, specificity, and F1 score, which is an important factor to evaluate
the performance of a machine learning model. The per-class accuracy,
sensitivity, and specificity are computed according to (4) to (7).
𝑇𝑃
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦𝑃 𝐶 = 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦𝑂𝐴 = (4)
Fig. 9. Loss curves during training the model. Curves are plotted in ‘Tensorboard’. The
𝑇𝑃 + 𝐹𝑁
𝑇𝑁
original shaded curves are smoothed by a factor of 0.5. 𝑆𝑝𝑒𝑐𝑖𝑓 𝑖𝑐𝑖𝑡𝑦𝑃 𝐶 = 𝑆𝑝𝑒𝑐𝑖𝑓 𝑖𝑐𝑖𝑡𝑦𝑂𝐴 = (5)
𝑇𝑁 + 𝐹𝑃
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝑃 𝐶 = (6)
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
The ‘tensorbroad’ callback is also used to check the model progress 𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝑂𝐴 = 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 , (7)
during training. The training and testing accuracy and loss curves 𝑁
during training are presented in Figs. 8 and 9, respectively. where TP, TN, FP, and FN refer to true positive, true negative, false
The model architecture and hyperparameters are varied following positive, and false negative, respectively; OA indicates overall perfor-
all the possible paths in Fig. 10. After designing a simple CNN-based mance, PC denotes per-class performance, 𝑁 stands for the number of
9
Fig. 10. Variation of model and related hyperparameters.
Table 7
Confusion matrix and per-class performance score of CAT-Net model on MIT-BIH testing data (Imbalanced)
Class Predicted class Per-class score Total data
type per-class
N S V F Q Accuracy (%) Sensitivity (%) Specificity (%) F1 Score (%)
N 17910 44 15 11 2 99.27 99.60 97.26 99.57 17982
S 37 531 0 0 0 99.60 93.49 99.77 92.75 568
True class v 23 2 1422 16 0 99.69 97.19 99.88 97.80 1463
F 21 0 7 139 0 99.74 83.23 99.87 83.48 167
Q 1 0 1 0 798 99.98 99.75 99.99 99.75 800
Overall Performance 99.14 99.14 99.14 99.14 Micro 20980
94.69 Macro
99.14 Weighted
total samples and 𝑁𝑐𝑜𝑟𝑟𝑒𝑐𝑡 identifies the number of samples predicted It is observed that the per-class accuracy is always greater than
correctly. 99.27% as stated in Table 7. This demonstrates the model’s outstanding
The per-class F1 score is calculated as follows performance across all the classes. When the model was run without
class balancing for the training data, it showed good performance for
𝐹1,𝑃 𝐶 = 𝑀𝑖𝑐𝑟𝑜 𝐹 1,𝑂𝐴
the N class but very poor performance in the F class due to the shortage
2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦
= (8) of F class data sets. The model presented 99.60%, 93.49%, 97.19%,
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦
83.23% and 99.75% sensitivity for N, S, V, F and Q, respectively. The
2𝑇 𝑃
= , sensitivity of ‘F’ class is yet lower compared to the other classes. Thus,
2𝑇 𝑃 + 𝐹 𝑃 + 𝐹 𝑁
improving the F class sensitivity represents an open research problem.
where 𝐹1,𝑂𝐴 indicates overall F1 score.
The proposed CAT-Net model achieves 94.69% macro and 99.14%
The overall macro and weighted F1 score are computed as follows
weighted F1 scores. It is noticed that for the S and F classes, sensitivity
1 ∑
𝐶 and F1 scores are lower compared to the other three classes. This occurs
𝑀𝑎𝑐𝑟𝑜 𝐹1,𝑂𝐴 = 𝐹 1𝑖,𝑃 𝐶 (9) due to less number of samples in the original data for these two classes.
𝐶 𝑖=1
and
3.3. Adjustment of model hyperparameters
1 ∑
𝐶
𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝐹1,𝑂𝐴 = 𝐹 1𝑖,𝑃 𝐶 × 𝑁𝑖 , (10)
𝑁 𝑖=1 Using a random search approach, the model architecture and its
hyperparameters, such as the number of filters in CNN, the number of
where C denotes the number of classes, 𝐹 1𝑖,𝑃 𝐶 indicates the F1 score of
transformer encoders and other parameters are adjusted. In every step,
ith class, and N𝑖 represents the number of samples in ith class. For multi-
the accuracy, number of total parameters, training time, validation
class classification, the overall model’s sensitivity, specificity, accuracy,
accuracy and loss, etc., are recorded. Table 8 provides a listing of
and Micro F1 are equal as
notable models and their corresponding performance. From Table 8, it
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦𝑂𝐴 = 𝑆𝑝𝑒𝑐𝑖𝑓 𝑖𝑐𝑖𝑡𝑦𝑂𝐴 is clear that the CNN-Net model offers the highest validation accuracy
= 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝑂𝐴 (11) (99.89%) and the lowest validation loss (0.52%). Even though the
number of parameters and training time per epoch is slightly higher
= 𝑀𝑖𝑐𝑟𝑜 𝐹1,𝑂𝐴 .
than in other models, it could be ignored for better results.
3.2. Classification performance The developed model is trained and tested with different trans-
former encoder hyperparameters to explore their effects. Fig. 11 depicts
The confusion matrix and the per-class classification report MIT- the corresponding accuracy and macro F1 score. After investigation, we
BIH data are presented in Table 7. The confusion matrix shows that found that only 1 encoder with 4 attention heads for MHA layer works
44 ‘N’ beats are mispredicted as ‘S’ and 37 of S-type beats are wrongly effectively for ECG arrhythmia classification. We set the model and FNN
predicted as ‘N’. The model falls in difficulty in discriminating between dimension to 128 for the highest performance.
normal and supraventricular beats. From the heartbeat plotting in
Fig. 2, the reason is obvious as these two types of heartbeats contain 3.4. Computational complexity
almost the same morphological pattern. Due to distinguishable varia-
tions in unclassifiable heartbeats (Q), the accuracy and F1 scores are The computational complexity (CC) of convolution, recurrent, and
excellent in the class. Among 800 beats, only 2 beats are misclassified. self-attention layers is provided in Table 9. While a recurrent layer
The overall accuracy of the proposed system is 99.14%. needs 𝑂(𝑛) operations for path length 𝑂(𝑛), the attention layer needs
10
Table 8
Number of total parameters, training time, validation accuracy and loss among some of trained models.
Model details # Total param. Training time (s) / Epoch Max. Val. Acc. (%) Min. Val. Loss (%)
CNN 381,837 11.5∼27.8 99.87 0.76
CNN+LSTM 777,797 21.14∼39.16 99.78 0.84
CNN+Att.+LSTM 777,797 30.19∼42.23 99.84 0.56
CAT-Net 1,189,637 34.19∼47.18 99.89 0.52
Fig. 11. Model performance (i.e., accuracy, macro F1 score) variations with transformer encoder hyperparameters.
Table 9 3.5. Local and global features

Computational complexity of different layers.
Layer Complexity per Layer Sequential operations Max path length The proposed CAT-Net model efficiently captures local and global
Convolution 𝑂(𝑘 ⋅ 𝑛 ⋅ 𝑑 2 ) 𝑂(1) 𝑂(log𝑘 (𝑛)) ECG characteristics. The first 4 CNN layers focus on ECG signal local
Recurrent 𝑂(𝑛 ⋅ 𝑑 2 ) 𝑂(𝑛) 𝑂(𝑛)
insights. Channel attention is employed after each CNN layer to capture
Self-attention 𝑂(𝑛2 ⋅ 𝑑) 𝑂(1) 𝑂(1)
the highly relevant parts for arrhythmia classification. The input ECG
Here, 𝑛 = input sequences length of transformer encoder, 𝑑 = model dimension, and 𝑘 = signal is converted to ECG feature maps in the latent space after passing
kernel width of convolution layer. In our model, 𝑛 = 128, 𝑑 = 128, and 𝑘 = 21, 23, 25, 27.
through four CNN and channel attention layers.
To focus on global dependencies, the ECG feature maps are trans-
ferred to the transformer encoder. Following the approach of contex-
tualizing words in NLP, the ECG feature maps are contextualized using
only 𝑂(1) operations for path length 𝑂(1) due to parallelization. Consid- the transformer encoder. The multi-head attention layer finds query-
ering time and space complexity, the self-attention of the transformer key similarities and multiplies the generated attention weights by the
encoder presents lower complexity than convolution and recurrent lay- value matrix to empower the feature maps for prediction. Overall,
the integration of CNN, channel attention, and transformer encoder
ers, i.e., 𝐶𝐶𝑠𝑒𝑙𝑓 −𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 < 𝐶𝐶𝑐𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛 < 𝐶𝐶𝑟𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑡 . Convolution layers
enables the model to successfully incorporate local as well as global
make the CAT-Net model computationally more complex than pure
ECG information.
transformer-based models, which can only focus on global features.
However, the transformer encoder makes the proposed CAT-Net model 3.6. Effect of small data in ‘F’ class
computationally efficient, with less computational complexity (in both
time and space) than pure CNN or CNN-LSTM hybrid models while The majority of research studies still struggle with the issue of low
collecting local and global insight. predictive rate in the ‘F’ class. In the original MIT-BIH dataset, the F
11
Table 10
Low predictive rate in ‘F’ class.
Author Per-class accuracy (%) Overall accuracy (%) Overall macro F1 (%)
N S V F Q
Rafi et al. [48] 99.32 78.46 95.49 71.76 97.79 99.67 98.93
Proposed 99.27 99.60 99.69 99.74 99.98 99.14 94.69
Author Per-class F1-score (%) Overall accuracy (%) Overall macro F1 (%)
N S V F Q
Peng et al. [32] 99.62 91.53 99.26 88.57 99.50 99.28 95.70
Proposed 99.57 92.75 97.80 83.48 99.75 99.14 94.69
Table 11 3.9. Benchmarking with state-of-the-art

Comparison of class balancing approaches.
Approach Required time (Min) Accuracy (%) Macro F1 (%) To conduct a comprehensive and effective performance evaluation,
SMOTE-ENN 42.3 98.60% 94.27% the proposed model is compared with the existing state-of-the-art meth-
SMOTE-Tomek 40.8 99.14% 94.69% ods. Table 13 presents performance metrics for the most representative
ADASYN 37.1 98.02% 93.58%
recent models that employ the same data set, number of classes and
almost similar methods of analysis. Recent deep learning based studies
focusing specifically on CNN [50], CNN-RNN [51–53], CNN-RNN with
class contains only 0.7646% data (802 samples out of a total of 104,897 attention [30,37], transformer [34,35] and CNN-transformer [36,49,
samples). Different class balancing procedures are used to balance the 54] were considered. All models used the MIT-BIH dataset except 2
small fraction of data, resulting in a rise in data volume while the models [35,36], where another dataset [55] was employed. Along with
information is still the same. Consequently, the prediction in this class the MIT-BIH dataset, the model [52] used 1-lead Cin-C training 2017
became more and more challenging. For example, the model proposed dataset and model [30] used 12 lead CPSC-2018 dataset. The accuracy
in [48] showed only 71.76% accuracy in the ‘F’ class and 78.45% levels and F1 scores are visualized in ascending order in Figs. 12 and
accuracy in the ‘S’ class as illustrated in Table 10. The model [32] 13, respectively.
showed only 88.57% F1 score in the ‘F’ class whereas for other classes The traditional ML models require less data and are computationally
the score was above 91%. Furthermore, Xia et al. [49] developed a cheap. However, the models involve hand-crafted feature extraction,
which demands in-depth domain-specific knowledge, resulting in lim-
transformer-based model blended with CNN that achieved only 13.04%
ited performance. Manual feature extraction includes but is not lim-
sensitivity on ‘F’ class. Similarly, our proposed model CAT-Net predicts
ited to, signal-based (i.e., auto-correlation coefficient power, frequency
this class poorly with an accuracy of 99.74% and F1 score of 83.48%.
band power, Shannon’s entropy, etc.), statistical (i.e., mean, maximum,
To enhance the predictive performance in ‘F’ class, more ECG data of
variance, skewness, kurtosis, etc.), and morphological features (i.e., R-R
the class is required.
interval, QRS duration, etc.). For instance, Venkatesan et al. [56] uti-
lized heart rate variability (HRV) time–frequency-based features with
3.7. Effect of class balancing a kNN classifier and achieved 97.5% accuracy for normal vs. abnormal
ECG arrhythmia classification. Martis et al. [57] conducted principle
In order to demonstrate the effect of class imbalance on the out- component analysis on segmented ECG and obtained 98.11% accuracy
come of ML algorithms, the proposed model under both balanced and for 5 types of arrhythmia classification on the MIT-BIH dataset. Park
unbalanced training data is analyzed. With imbalanced training data, et al. [58] extracted morphological features such as P wave and QRS
the model showed only 73.65% F1 score, dropping from 83.48% for the complex and used the k-NN classifier and attained an accuracy of 97%
minority class ‘F’. Furthermore, the F1 scores of other smaller classes in classification.
declined by 1% to 6% relative to the balanced model. Nowadays, researchers are focusing on deep learning-based ad-
The strategy of class balancing also changes the model’s perfor- vanced algorithms to ignore the laborious task of manual feature
mance. The maximum accuracy, macro F1 score and required time for extraction. Gopika et al. proposed the residual convolutional neu-
balancing imbalanced ECG training data are listed in Table 11. The ral network (RCNN) and achieved 4.89% lower accuracy and 0.69%
mode trained by the data balanced by SMOTE-Tomek achieved the lower F1 score. Xu et al. [52] employed pre-training using the 2017
highest accuracy and macro F1 score compared to other approaches. PhysioNet/CinC Challenge dataset and transfer of parameters during
Considering the highest performance, SMOTE-Tomek is finally imple- training with the MIT-BIH set. This model considered 4 CLs, 2 residual
mented in our CAT-Net model. blocks and 2 BiLSTM layers. The CNN–BiLSTM model yielded the F1
score of 95.92%, but suffered from lower accuracy, 95.9%, which was
3.24% lower than the proposed model.
3.8. Validation with INCART dataset
Wang et al. [50] proposed a model that consists of 33 CNN layers
followed by a non-local convolutional block attention module (NCBAM)
In the validation phase of this study, an additional dataset INCART and that achieved 98.64% accuracy, 96.64% F1 score on the MIT-BIH
was employed to rigorously assess the robustness and generalization ca- dataset and 85.07% F1 for the PTB-XL dataset. This model F1 score was
pabilities of the proposed CAT-Net model. Our model achieved 99.58% 1.95% higher than ours, but its accuracy was lower. Jin et al. [30] pro-
accuracy and 96.15% macro F1 score for 3-class arrhythmia classi- posed the dual-level attentional convolutional long short-term memory
fication. The confusion matrix and per-class performance score are (DLA-CLSTM) neural network but its performance was very low 10.38%
shown in Table 12. The obtained result is higher than the MIT-BIH and 14.15% lower than our method. Essa et al. [53] conducted bagging
data-based model in terms of accuracy and F1 score. This proves by ensembling CNN-LSTN and RR intervals and higher-order statistics
the consistent performance of our model. As the local and global with LSTM (RRHOS-LSTM) and classified 4 types of arrhythmia. The
contextualized ECG information was captured, the proposed CAT-Net model accuracy was 95.81% but the highly imbalanced data led to the
model learns effectively ignoring dataset bias. The robust and dataset- low F1 score of 71.06%.
invariant model exhibits promising performance and is applicable to Zhao et al. [37] presented an attention-based temporal convo-
real-world applications. lutional network (TCN) and fused it with a mechanism to encode
12
Table 12
Confusion matrix and per-class performance score of CAT-Net model on INCART testing data (Imbalanced)
Class Predicted class Per-class score Total data
type per-class
N S V Accuracy (%) Sensitivity (%) Specificity (%) F1 Score (%)
N 30538 52 34 99.59 99.71 98.74 99.77 30624
True class S 32 378 3 99.75 91.53 99.84 89.47 4019
V 24 2 3993 99.82 99.35 99.88 99.21 413
Overall Performance 99.58 99.58 99.58 99.58 Micro 35056
96.15 Macro
99.58 Weighted
Table 13
Classification performance comparisons on MIT-BIH dataset of state-of-the-art methods.
Authors Year # Classes # Lead(s) Method(s) Accuracy (%) Increment in Macro F1 (%) Increment in
Accuracy (%) Macro F1 (%)
Yan et al. [34] 2019 4 (N, S, V, F) 1 Transformer + 98.97 0.17 88.73 5.96
Attention
Gopika et al. [51] 2020 5 2 RCNN 94.25 4.89 94.00 0.69
Xu et al. [52] 2020 5 1, 2 CNN + BiLSTM 95.9 3.24 95.92 −1.23
Wang et al. [50] 2021 5 12 CNN + Attention 98.64 0.50 96.64 −1.95
Che et al. [36] 2021 9a 12 CNN + Transformer 83.66 15.48 78.60 16.09
Jin et al. [30] 2021 5 (N, A, V, 2, 12 DLA-CLSTM 88.76 10.38 80.54 14.15
AF, O)
Essa et al. [53] 2022 4 (N,SVEB, 1 Ensemble of 95.81 3.33 71.06b 23.63
VEB, F) CNN-LSTN &
RRHOS-LSTM
Meng et al. [35] 2022 3a (N, S, V) 1 Lightweight 99.32 −0.18 – –
Transformer
b
Yixuan et al. [54] 2022 5 2 STCT: CNN + 98.96 0.18 99.31 −4.62
Transformer
Zhao et al. [37] 2023 5 1 Attention-based TCN 87.81 11.33 89.46b 5.23
Xia et al. [49] 2023 5 1 (CNN, DAE) + 97.66 1.48 – –
Transformer
Proposed Model 2023 5 1 CAT-Net: CNN + 99.14 4.62% avg. 94.69 6.44% avg.
(MIT-BIH Dataset) Attention +
Transformer
Proposed Model 2023 3 (N, S, V) 1 CAT-Net: CNN + 99.58 – 96.15 –
(INCART Dataset) Attention +
Transformer
a The dataset [55] is used instead of the MIT-BIH dataset.
b Not specified whether the F1 score is macro or micro.
Fig. 12. Comparison in terms of accuracy.
ECG. This model showed good results in an intra-patient scheme with Che et al. [36] combined CNN and transformer and obtained
99.84% accuracy, whereas accuracy degraded to only 87.81% for the 83.66% accuracy and 78.60% F1 score. They used 7 convolution layers
inter-patient scenario. This model can predict well specific patient and finally added a transformer encoder along with a link constraint
signals but loses performance for other patients. Overall, this model’s integrated with loss function. The idea of combining CNN and trans-
performance was 11.33% lower in terms of accuracy and 5.23% lower former encoder was good, however, model performance was poor. Yix-
in terms of F1 score. uan et al. [54] proposed Spatial–Temporal Conv-Transformer (STCT)
Yan et al. [34] used a transformer encoder and combined hand- to focus station and temporal information by CNN and transformer
crafted features with transformer-based features for better performance. encoder respectively. The CNN part is inspired by the architecture
The model categorized 4 classes of arrhythmias, however, per-class of VGGNet. Instead of using FFN, multiple convolution layers were
sensitivity and precision were low, causing a low F1 score of 88.73%. integrated to update the transformer encoder. Xia et al. [49] combined
To reduce the number of model parameters, Meng et al. [35] proposed CNN and Denoising Autoencoder (DAE) as well as R-R features and fed
a lightweight transformer. The model accuracy of 99.32% was good, these into transformer encoder for 5-class arrhythmia classification. The
however, only 3 types of arrhythmia were classified. model achieved 97.66% accuracy.
13
Fig. 13. Comparison in terms of F1 score.
In comparison to the aforementioned models, the proposed CAT- facilitate its further implementation and utilization, the comprehensive
Net model has the highest accuracy of 99.14% for 5-class classification source code of the CAT-Net model is made accessible through the link
and 99.58% for 3-class classification. Only a few models [50,52,54] https://github.com/rabiul-ai/Arrhythmia_Classification.git.
exhibit a higher F1 score in comparison to our model. However, these
models demonstrate lower accuracy and require the utilization of either CRediT authorship contribution statement
2 or 12 leads of ECG signals. Our model shows a 4.62% average
increase in overall accuracy and a 6.44% average increase in the macro Md Rabiul Islam: Conceptualization, Formal analysis, Investiga-
F1 score as shown in columns 7 and 9 of Table 13. After all, the tion, Methodology, Software, Validation, Writing – original draft, Writ-
impressive accuracy and balanced F1 score for all classes prove the ing – review & editing. Marwa Qaraqe: Investigation, Validation,
state-of-the-art stature of our model compared to the existing models. Writing – review & editing. Khalid Qaraqe: Formal analysis, Investi-
An additional important finding of this study is that single-lead ECG gation. Erchin Serpedin: Formal analysis, Investigation, Supervision,
data is sufficient for the arrhythmia classification task and this opens Validation, Writing – review & editing.
the path to adopting machine learning algorithms for mobile and IoT
applications that rely only on single-lead ECG devices’ recording. Declaration of competing interest
3.10. Limitations and recommendations We do not have any conflict of interest.
A limitation is that the proposed model requires R-peak annotated Data availability
ECG signals as input. Thus, for real-life applications, an end-to-end
machine learning model that includes the R-peak detection step may Both the MIT-BIH and INCART datasets are publicly available on
be necessary. Although this study focused on the challenge of devel- the website ‘PhysioNet ’.
oping a robust and effective model with only single-lead ECG data,
additional research challenges of high interest remain in the arrhythmia Acknowledgment
classification field. First, optimal features from various neural networks
equipped with proper attention mechanisms may be fused. Second, IoT This work was supported by the Texas A&M University at Qatar
and wearable technology-based patient-specific end-to-end models are (TAMUQ) Research Initiative. Open Access funding provided by the
needed. Third, the major obstacle of lack of annotated and balanced Qatar National Library.
ECG arrhythmia datasets needs to be addressed. Especially, a high
volume of ‘F’ class ECG signals should be recorded to improve the References
predictive performance of this class.
[1] World Health Organization. URL https://www.who.int/health-topics/
cardiovascular-diseases. (Accessed 8 August 2023).
4. Conclusions [2] E.P.M. Karregat, J.C.L. Himmelreich, W.A.M. Lucassen, W.B. Busschers, H.C.P.M.
van Weert, R.E. Harskamp, Evaluation of general practitioners’ single-lead
The study proposed a novel CAT-Net model to predict five distinct electrocardiogram interpretation skills: A case-vignette study, Fam. Pract. 38 (2)
arrhythmia classes using single-lead ECG signals. By leveraging the (2021) 70–75.
[3] R.G. Afkhami, G. Azarnia, M.A. Tinati, Cardiac arrhythmia classification using
capabilities of CNN, channel attention, and transformer encoder archi- statistical and mixture modeling features of ECG signals, Pattern Recognit. Lett.
tectures, both local and global ECG features were incorporated into the 70 (2016) 45–51.
model. It was found that SMOTE-Tomek represents the optimal class [4] A.F. Khalaf, M.I. Owis, I.A. Yassine, A novel technique for cardiac arrhythmia
balancing approach for ECG data. The model’s robustness is validated classification using spectral correlation and support vector machines, Expert Syst.
Appl. 42 (21) (2015) 8361–8368.
across two distinct datasets, and it exhibits consistent performance.
[5] M. Kropf, D. Hayn, G. Schreier, ECG classification based on time and frequency
Using only 1-lead ECG data, the proposed model achieved 99.14% domain features using random forests, in: 2017 Computing in Cardiology, CinC,
accuracy and 94.69% F1 score, which represent a remarkable perfor- IEEE, 2017, pp. 1–4.
mance. This study corroborates that only 1-lead ECG data is adequate [6] K.N.V.P.S. Rajesh, R. Dhuli, Classification of imbalanced ECG beats using re-
to predict arrhythmias provided that the model is properly designed. sampling techniques and AdaBoost ensemble classifier, Biomed. Signal Process.
Control 41 (2018) 242–254.
The proposed model can be used in IoT-based and mobile arrhythmia [7] H. Shi, H. Wang, Y. Huang, L. Zhao, C. Qin, C. Liu, A hierarchical method based
diagnosis systems due to the low requirement of working in real- on weighted extreme gradient boosting in ECG heartbeat classification, Comput.
time with patients’ ECG data captured by only one sensing device. To Methods Programs Biomed. 171 (2019) 1–10.
14
[8] N.K. Dewangan, S.P. Shukla, ECG arrhythmia classification using discrete wavelet [32] X. Peng, W. Shu, C. Pan, Z. Ke, H. Zhu, X. Zhou, W.W. Song, DSCSSA: A
transform and artificial neural network, in: 2016 IEEE International Conference classification framework for spatiotemporal features extraction of arrhythmia
on Recent Trends in Electronics, Information & Communication Technology, based on the Seq2Seq model with attention mechanism, IEEE Trans. Instrum.
RTEICT, IEEE, 2016, pp. 1892–1896. Meas. 71 (2022) 1–12, Art. no. 2515112.
[9] S.-N. Yu, Y.-H. Chen, Electrocardiogram beat classification based on wavelet [33] R. Hu, J. Chen, L. Zhou, A transformer-based deep neural network for arrhythmia
transformation and probabilistic neural network, Pattern Recognit. Lett. 28 (10) detection using continuous ECG signals, Comput. Biol. Med. 144 (May 2022)
(2007) 1142–1150. 105325.
[10] Z. Zhang, J. Dong, X. Luo, K.-S. Choi, X. Wu, Heartbeat classification using [34] G. Yan, S. Liang, Y. Zhang, F. Liu, Fusing transformer model with temporal
disease-specific feature selection, Comput. Biol. Med. 46 (2014) 79–89. features for ECG heartbeat classification, in: 2019 IEEE International Conference
[11] E.B. Mazomenos, D. Biswas, A. Acharyya, T. Chen, K. Maharatna, J. Rosengarten, on Bioinformatics and Biomedicine, BIBM, IEEE, 2019, pp. 898–905.
J. Morgan, N. Curzen, A low-complexity ECG feature extraction algorithm for [35] L. Meng, W. Tan, J. Ma, R. Wang, X. Yin, Y. Zhang, Enhancing dynamic ECG
mobile healthcare applications, IEEE J. Biomed. Health Inf. 17 (2) (2013) heartbeat classification with lightweight transformer model, Artif. Intell. Med.
459–469. 124 (2022) 102236.
[12] M. Lagerholm, C. Peterson, G. Braccini, L. Edenbrandt, L. Sornmo, Clustering [36] C. Che, P. Zhang, M. Zhu, Y. Qu, B. Jin, Constrained transformer network for
ECG complexes using Hermite functions and self-organizing maps, IEEE Trans. ECG signal processing and arrhythmia classification, BMC Med. Inform. Decis.
Biomed. Eng. 47 (7) (2000) 838–848. Mak. 21 (184) (2021) 1–13.
[37] Y. Zhao, J. Ren, B. Zhang, J. Wu, Y. Lyu, An explainable attention-based TCN
[13] P. Melin, J. Amezcua, F. Valdez, O. Castillo, A new neural network model based
heartbeats classification model for arrhythmia detection, Biomed. Signal Process.
on the LVQ algorithm for multi-class classification of arrhythmias, Inform. Sci.
Control 80, Part 1 (2023) 104337.
279 (2014) 483–497.
[38] S.M. Mathews, C. Kambhamettu, K.E. Barner, A novel application of deep
[14] I. Christov, G. Gómez-Herrero, V. Krasteva, I. Jekova, A. Gotchev, K. Egiazarian,
learning for single-lead ECG classification, Comput. Biol. Med. 99 (2018) 53–62.
Comparative study of morphological and time-frequency ECG descriptors for
[39] ANSI/AAMI, Testing and reporting performance results of cardiac rhythm and ST
heartbeat classification, Med. Eng. Phys. 28 (9) (2006) 876–887.
segment measurement algorithms, Assoc. Adv. Med. Instrum. (1998) document
[15] U.R. Acharya, H. Fujita, M. Adam, O.S. Lih, T.J. Hong, V.K. Sudarshan, J.E.W.
ANSI/AAMI/ISO EC57, 1998-(R), American National Standard Institute, Inc.
Koh, Automated characterization of arrhythmias using nonlinear features from
(ANSI), 2008.
tachycardia ECG beats, in: 2016 IEEE International Conference on Systems, Man,
[40] G.B. Moody, R.G. Mark, The impact of the MIT-BIH arrhythmia database, IEEE
and Cybernetics, SMC, IEEE, 2016, pp. 000533–000538.
Eng. Med. Biol. Mag. 20 (3) (2001) 45–50.
[16] S. Kiranyaz, T. Ince, M. Gabbouj, Real-time patient-specific ECG classification [41] St Petersburg INCART 12-lead Arrhythmia Database. URL https://physionet.org/
by 1-D convolutional neural networks, IEEE Trans. Biomed. Eng. 63 (3) (2015) content/incartdb/1.0.0/. (Accessed 3 July 2023).
664–675. [42] L. Sathyapriya, L. Murali, T. Manigandan, Analysis and detection R-peak de-
[17] U.R. Acharya, et al., Application of higher-order spectra for the characterization tection using modified Pan-Tompkins algorithm, in: 2014 IEEE International
of coronary artery disease using electrocardiogram signals, Biomed. Signal Conference on Advanced Communications, Control and Computing Technologies,
Process. Control 31 (2017) 31–43. IEEE, 2014, pp. 483–487.
[18] P.M. Tripathi, A. Kumar, M. Kumar, R. Komaragiri, Multilevel classification and [43] G.E. Batista, R.C. Prati, M.C. Monard, A study of the behavior of several methods
detection of cardiac arrhythmias with high-resolution superlet transform and for balancing machine learning training data, ACM SIGKDD Explor. Newsl. 6 (1)
deep convolution neural network, IEEE Trans. Instrum. Meas. 71 (2022) 1–13. (2004) 20–29.
[19] M. Hammad, A.M. Iliyasu, A. Subasi, E.S.L. Ho, A.A. Abd El-Latif, A multitier [44] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: Synthetic
deep learning model for arrhythmia detection, IEEE Trans. Instrum. Meas. 70 minority over-sampling technique, J. Artificial Intelligence Res. 16 (2002)
(2020) 1–9. 321–357.
[20] O. Faust, Y. Hagiwara, T.J. Hong, O.S. Lih, U.R. Acharya, Deep learning [45] H. He, Y. Bai, E.A. Garcia, S. Li, ADASYN: adaptive synthetic sampling approach
for healthcare applications based on physiological signals: A review, Comput. for imbalanced learning, in: 2008 IEEE International Joint Conference on Neural
Methods Programs Biomed. 161 (2018) 1–13. Networks, IEEE World Congress on Computational Intelligence, IEEE, 2008, pp.
[21] P. Rajpurkar, A.Y. Hannun, M. Haghpanahi, C. Bourn, A.Y. Ng, Cardiologist-level 1322–1328.
arrhythmia detection with convolutional neural networks, 2017, arXiv preprint [46] S. Woo, J. Park, J.-Y. Lee, I.S. Kweon, CBAM: Convolutional block attention
arXiv:1707.01836. module, in: Proceedings of the European Conference on Computer Vision, ECCV,
[22] J.H. Tan, Y. Hagiwara, W. Pang, I. Lim, S.L. Oh, M. Adam, R. San Tan, M. Chen, 2018, pp. 3–19.
U.R. Acharya, Application of stacked convolutional and long short-term memory [47] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser,
network for accurate identification of CAD ECG signals, Comput. Biol. Med. 94 I. Polosukhin, Attention is all you need, Adv. Neural Inf. Process. Syst. 30 (2017).
(2018) 19–26. [48] T.H. Rafi, Y.W. Ko, HeartNet: Self multihead attention mechanism via con-
[23] S.L. Oh, E.Y.K. Ng, R. San Tan, U.R. Acharya, Automated diagnosis of arrhythmia volutional network with adversarial data synthesis for ECG-based arrhythmia
using combination of CNN and LSTM techniques with variable length heart beats, classification, IEEE Access 10 (2022) 100501–100512.
Comput. Biol. Med. 102 (2018) 278–287. [49] Y. Xia, Y. Xiong, K. Wang, A transformer model blended with CNN and denoising
[24] Q. Yao, X. Fan, Y. Cai, R. Wang, L. Yin, Y. Li, Time-incremental convolutional autoencoder for inter-patient ECG arrhythmia classification, Biomed. Signal
neural network for arrhythmia detection in varied-length electrocardiogram, in: Process. Control 86 (2023) 105271.
[50] J. Wang, X. Qiao, C. Liu, X. Wang, Y. Liu, L. Yao, H. Zhang, Automated ECG
IEEE 16th Intl. Conf. on Dependable, Autonomic and Secure Computing, 16th
classification using a non-local convolutional block attention module, Comput.
Intl. Conf. on Pervasive Intelligence and Computing, 4th Intl. Conf. on Big
Methods Programs Biomed. 203 (May 2021) 106006.
Data Intelligence and Computing and Cyber Science and Technology Congress,
[51] P. Gopika, V. Sowmya, E. Gopalakrishnan, K. Soman, Transferable approach for
DASC/PiCom/DataCom/CyberSciTech, IEEE, 2018, pp. 754–761.
cardiac disease classification using deep learning, in: Deep Learning Techniques
[25] E. Prabhakararao, S. Dandapat, Myocardial infarction severity stages classifica-
for Biomedical and Health Informatics, Elsevier, 2020, pp. 285–303.
tion from ECG signals using attentional recurrent neural network, IEEE Sens. J.
[52] X. Xu, S. Jeong, J. Li, Interpretation of electrocardiogram (ECG) rhythm by
20 (15) (2020) 8711–8720.
combined CNN and BiLSTM, IEEE Access 8 (2020) 125380–125388.
[26] P. Singh, A. Sharma, Attention-based convolutional denoising autoencoder for
[53] E. Essa, X. Xie, An ensemble of deep learning-based multi-model for ECG
two-lead ECG denoising and arrhythmia classification, IEEE Trans. Instrum. Meas.
heartbeats arrhythmia classification, IEEE Access 9 (2021) 103452–103464.
71 (2022) 1–10, Art. no. 4007710.
[54] Y. Qiu, W. Chen, L. Yue, M. Xu, B. Zhu, STCT: Spatial–temporal conv-transformer
[27] Q. Yao, R. Wang, X. Fan, J. Liu, Y. Li, Multi-class arrhythmia detection from network for cardiac arrhythmias recognition, in: International Conference on
12-lead varied-length ECG using attention-based time-incremental convolutional Advanced Data Mining and Applications, Springer, 2022, pp. 86–100.
neural network, Inf. Fusion 53 (2020) 174–182. [55] Z. Cai, C. Liu, H. Gao, X. Wang, L. Zhao, Q. Shen, E.Y.K. Ng, J. Li, An open-
[28] J. Zhang, A. Liu, M. Gao, X. Chen, X. Zhang, X. Chen, ECG-based multi- access long-term wearable ECG database for premature ventricular contractions
class arrhythmia detection using spatio-temporal attention-based convolutional and supraventricular premature beat detection, J. Med. Imag. Health Inform. 10
recurrent neural network, Artif. Intell. Med. 106 (2020) 101856. (11) (2020) 2663–2667.
[29] W. Ullah, I. Siddique, R.M. Zulqarnain, M.M. Alam, I. Ahmad, U.A. Raza, [56] C. Venkatesan, P. Karthigaikumar, R. Varatharajan, A novel LMS algorithm
Classification of arrhythmia in heartbeat detection using deep learning, Comput. for ECG signal preprocessing and KNN classifier based abnormality detection,
Intell. Neurosci. 2021 (2021) 1–13, Art. ID 2195922. Multimedia Tools Appl. 77 (2018) 10365–10374.
[30] Y. Jin, J. Liu, Y. Liu, C. Qin, Z. Li, D. Xiao, L. Zhao, C. Liu, A novel interpretable [57] R.J. Martis, U.R. Acharya, K.M. Mandana, A.K. Ray, C. Chakraborty, Application
method based on dual-level attentional deep neural network for actual multilabel of principal component analysis to ECG signals for automated diagnosis of
arrhythmia detection, IEEE Trans. Instrum. Meas. 71 (2021) 1–11, Art. no. cardiac health, Expert Syst. Appl. 39 (14) (2012) 11792–11800.
2500311. [58] J. Park, K. Lee, K. Kang, Arrhythmia detection from heartbeat using k-nearest
[31] E. Prabhakararao, S. Dandapat, Multi-scale convolutional neural network ensem- neighbor classifier, in: 2013 IEEE International Conference on Bioinformatics and
ble for multi-class arrhythmia classification, IEEE J. Biomed. Health Inf. 26 (8) Biomedicine, IEEE, 2013, pp. 15–22.
(2021) 3802–3812.
15

ref 27

Uploaded by

Copyright:

Available Formats

ref 27

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ref 27

Uploaded by

Copyright:

Available Formats

Biomedical Signal Processing and Control 93 (2024) 106211

Contents lists available at ScienceDirect

Biomedical Signal Processing and Control

CAT-Net: Convolution, attention, and transformer based network for

ARTICLE INFO ABSTRACT

1. Introduction arrhythmia detection is burdensome and challenging. A case study

Fig. 1. Overview of the proposed processing pipeline.

2. Methodology This study employed two datasets, (i) MIT-BIH Arrhythmia

Table 5 2.5.2. Attention mechanism

3. Results and discussion

Next, the model performance based on different evaluation matrices

3.1. Evaluation matrices

The classification model performance is assessed via accuracy, sensi-

Fig. 10. Variation of model and related hyperparameters.

Table 9 3.5. Local and global features

Table 11 3.9. Benchmarking with state-of-the-art

Fig. 12. Comparison in terms of accuracy.

Fig. 13. Comparison in terms of F1 score.

3.10. Limitations and recommendations We do not have any conflict of interest.

You might also like