0% found this document useful (0 votes)
45 views13 pages

MGFF Springer

The document presents the Multimodal Gated Fusion Framework (MGFF) for predicting cardiovascular diseases (CVD) by integrating textual electronic health records (EHR) and visual electrocardiogram (ECG) data using a gated fusion mechanism. The framework achieved an accuracy of 91.2%, an F1-score of 0.91, and a ROC-AUC of 0.94, outperforming existing models like MedFuse and VisualBERT. Future work aims to enhance the robustness, generalizability, and interpretability of the model for broader clinical application.

Uploaded by

chandranath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views13 pages

MGFF Springer

The document presents the Multimodal Gated Fusion Framework (MGFF) for predicting cardiovascular diseases (CVD) by integrating textual electronic health records (EHR) and visual electrocardiogram (ECG) data using a gated fusion mechanism. The framework achieved an accuracy of 91.2%, an F1-score of 0.91, and a ROC-AUC of 0.94, outperforming existing models like MedFuse and VisualBERT. Future work aims to enhance the robustness, generalizability, and interpretability of the model for broader clinical application.

Uploaded by

chandranath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Multimodal Gated Fusion Framework for

Cardiovascular Disease Prediction Using ECG


and EHR Data

Jothi Prakash V.1 , Thiruselvan S1 Chandranath S., Vikeshkumar M., and Ajay
Meenatchi Sundaram M.

Department of Information Technology, Karpagam College of Engineering,


Myleripalayam Village, Coimbatore 641032, Tamil Nadu, India.
[email protected]

Abstract. Cardiovascular diseases (CVD) are among the leading causes


of mortality worldwide, necessitating accurate and reliable predictive
models to improve clinical decision-making. Existing approaches often
focus on single-modality data, such as textual electronic health records
(EHR) or visual electrocardiogram (ECG) signals, limiting their abil-
ity to capture complementary information across modalities. To address
this, we propose the Multimodal Gated Fusion Framework (MGFF), a
novel method that integrates textual EHR data and visual ECG signals
through a gated fusion mechanism, leveraging the ViLT transformer for
robust multimodal feature alignment and classification. The framework
was evaluated on multimodal datasets, achieving an accuracy of 91.2%,
an F1-score of 0.91, and a ROC-AUC of 0.94, significantly outperforming
advanced baselines such as MedFuse and VisualBERT. Extensive experi-
ments, including ablation studies and calibration analysis, demonstrated
the importance of the gated fusion mechanism and the reliability of the
predicted probabilities. While the results are promising, limitations such
as robustness to noisy data and computational efficiency highlight areas
for future improvement. The proposed MGFF provides a reliable, accu-
rate, and scalable solution for CVD prediction, emphasizing the potential
of multimodal approaches in advancing healthcare analytics. Future work
will focus on enhancing robustness, generalizability, and interpretability
for broader clinical adoption.

Keywords: Multimodal Fusion, Cardiovascular Disease Prediction, Deep


Learning, Gated Fusion Mechanism, Transformer Models

1 Introduction
Cardiovascular diseases (CVD) [7, 8] remain one of the leading causes of mor-
tality globally, accounting for millions of deaths annually. Early detection and
accurate prediction of CVD are critical for effective clinical intervention and im-
proving patient outcomes. Traditional prediction methods often rely on single-
modality data, such as textual electronic health records (EHR) or visual elec-
trocardiogram (ECG) signals [8, 11]. While these approaches provide valuable
2 Jothi Prakash V. et al.

insights, they fail to exploit the complementary nature of multimodal data, which
can offer a more comprehensive understanding of a patient’s condition. This lim-
itation highlights the need for advanced frameworks that integrate diverse data
modalities to enhance predictive accuracy and reliability [11]. Recent advance-
ments in multimodal learning have demonstrated the potential of combining het-
erogeneous data sources for improved performance in various domains, including
healthcare. However, many existing multimodal frameworks face challenges such
as ineffective feature alignment, limited generalizability, and computational in-
efficiency [4, 6]. These limitations motivate the development of a robust and
efficient framework tailored to the specific needs of CVD prediction. In this re-
search, we propose the Multimodal Gated Fusion Framework (MGFF), a novel
approach that integrates textual EHR data and visual ECG signals using a
gated fusion mechanism and a ViLT-based transformer for feature refinement
and classification. The key contributions of this work are as follows:
– We introduce MGFF, a unified framework that leverages gated fusion for
effective multimodal feature integration and ViLT for robust feature align-
ment.
– We demonstrate the superior performance of MGFF on multimodal datasets,
achieving state-of-the-art results with an accuracy of 91.2%, F1-score of 0.91,
and ROC-AUC of 0.94.
– We conduct extensive experiments, including ablation studies, calibration
analysis, and multimodal fusion efficiency evaluation, to validate the effec-
tiveness of the proposed framework.

2 Related Works
The task of cardiovascular disease (CVD) prediction has garnered significant
attention in recent years, with advancements in machine learning enabling the
development of predictive models across various data modalities. Existing works
can be broadly categorized into single-modality approaches and multimodal
frameworks.

2.1 Single-Modality Approaches


Single-modality models focus on leveraging either visual ECG signals or textual
EHR data. Convolutional Neural Networks (CNNs) [11, 10] have been widely
used for ECG analysis due to their ability to capture temporal and spectral
features from raw signals. For instance, ECG-only models have demonstrated
promising results in arrhythmia detection and CVD classification, but their re-
liance on visual features alone limits their ability to incorporate broader clinical
context. Similarly, Deep Neural Networks (DNNs)[8, 1] have been applied to
EHR data, utilizing structured clinical features such as demographics, lab re-
sults, and diagnoses. However, EHR-only models often suffer from insufficient
representation of physiological patterns, which are crucial for comprehensive
CVD prediction [8, 11, 10].
Title Suppressed Due to Excessive Length 3

2.2 Multimodal Frameworks

Multimodal frameworks aim to integrate multiple data modalities to address


the limitations of single-modality approaches [7, 8]. Simple fusion techniques,
such as feature concatenation [7, 5], have been explored to combine ECG and
EHR data, but these methods often fail to exploit the complementary nature
of the modalities effectively. Advanced attention-based models like MedFuse [9]
have introduced modality-specific encoders and attention mechanisms for fea-
ture alignment, demonstrating improved performance in healthcare applications.
Similarly, VisualBERT [12], originally developed for vision-and-language tasks,
has been adapted for integrating textual and visual healthcare data. While these
models achieve competitive results, challenges such as ineffective feature align-
ment and computational overhead persist. Despite the advancements in multi-
modal learning, several limitations remain. Many frameworks rely on simplis-
tic fusion mechanisms that do not effectively capture the complex interdepen-
dencies between modalities. Additionally, attention-based models like MedFuse
and transformer-based architectures like VisualBERT, while powerful, often re-
quire extensive computational resources, limiting their applicability in resource-
constrained environments [8, 6, 1]. Furthermore, existing methods lack robust
calibration, which is critical for clinical applications requiring reliable probability
estimates.

2.3 Research Gaps

Despite significant advancements in machine learning for cardiovascular disease


(CVD) prediction, several critical gaps remain in existing methodologies. Single-
modality models, such as those relying solely on ECG or EHR data, fail to
leverage the complementary nature of multimodal information, limiting their
predictive accuracy. While multimodal frameworks like MedFuse and Visual-
BERT have shown promise, they often employ simplistic fusion mechanisms or
attention-based architectures that lack efficient feature alignment and integra-
tion. Additionally, these models are computationally intensive, posing challenges
for deployment in resource-constrained environments. Furthermore, existing ap-
proaches often overlook the importance of model calibration, which is essential
for producing reliable probability estimates in clinical decision-making. Lastly,
the interpretability of current models remains inadequate, making it difficult for
clinicians to trust or act on predictions. Addressing these gaps necessitates the
development of robust, efficient, and interpretable multimodal frameworks that
can effectively integrate diverse data modalities while maintaining reliability and
scalability.

3 Methodology

The proposed Multimodal Gated Fusion Framework (MGFF) aims to predict


cardiovascular diseases (CVD) by integrating textual (EHR) and visual (ECG)
4 Jothi Prakash V. et al.

data. The framework leverages advanced feature extraction techniques, a gated


fusion mechanism for effective multimodal alignment, and a transformer-based
classification module to ensure robust and accurate predictions. The key com-
ponents of the framework are illustrated in Figure 1 and detailed below.

Fig. 1: Overview of the MGGF Framework.

3.1 Datasets
The ECG [3] and MIT-BIH [2] datasets, designed for multimodal analysis, com-
prise a total of 22,225 samples with distinct features and classes. The ECG
dataset includes 333 samples with 21 unique features, while the MIT-BIH dataset
contains 21,892 samples with 188 unique features. Both datasets are categorized
into two classes, representing normal and disease states, with 215 normal sam-
ples and 118 disease samples in each dataset. The multimodal nature of the
data incorporates both textual clinical records and visual ECG signals, ensur-
ing comprehensive representation. Preprocessing techniques, including normal-
ization, tokenization, and augmentation, were applied to maintain data quality
and enhance model compatibility. Normalization was applied to ensure consis-
tent data ranges for all features, while tokenization converted textual EHR data
Title Suppressed Due to Excessive Length 5

into embeddings compatible with the model. Augmentation techniques, such as


flipping and scaling, were applied to ECG data to improve model generalization.
Table 1 summarizes the dataset statistics.

Table 1: CVD Dataset Statistics


Statistic ECG Dataset MIT-BIH Dataset
Total Samples 333 21,892
Unique Features 21 188
Normal Samples 215 215
Disease Samples 118 118
Data Types Visual Textual
Multi-modal Representation Yes Yes

3.2 Feature Extraction

Feature extraction is a crucial step in the proposed Multimodal Fusion Frame-


work. Both ECG and EHR data are independently processed to represent their
respective modalities. The extracted features are subsequently aligned and fused
using a gated mechanism for the classification task.

ECG Feature Extraction The visual features of the ECG dataset are ex-
tracted using a Convolutional Neural Network (CNN)-based encoder. Given an
input ECG signal E, the CNN processes the signal through a series of convolu-
tional layers, activation functions, and pooling operations to generate a feature
map Fe ∈ Rde , where de is the dimensionality of the ECG feature vector. The
feature extraction process is mathematically expressed as:

Fe = CNN(E), (1)

where Fe captures the temporal and spectral information inherent in the ECG
signal.

EHR Feature Extraction Structured features from the EHR dataset are
derived using a Deep Neural Network (DNN)-based encoder. Given an input
EHR data H, the DNN processes the data through multiple fully connected
layers to extract meaningful features, represented as Fh ∈ Rdh , where dh is
the dimensionality of the EHR feature vector. The process is mathematically
represented as:
Fh = DNN(H), (2)
where Fh encapsulates the patient’s health history, including diagnostic and
demographic information.
6 Jothi Prakash V. et al.

Multimodal Feature Fusion The visual features Fe and structured features


Fh are aligned and integrated through a gated fusion mechanism. The gated fu-
sion mechanism applies learnable gates to control the contribution of each modal-
ity to the final fused feature vector. This ensures that the most relevant informa-
tion from both modalities is emphasized during classification. This mechanism
ensures effective multimodal fusion by selectively combining the most relevant
features from each modality. The fused multimodal feature vector Fm ∈ Rdm is
computed as:
Fm = GatedFusion(Fe , Fh ), (3)

where Fm represents the integrated features from both ECG and EHR data,
which are subsequently utilized for classification.

3.3 Classification

The fused multimodal feature vector Fm , obtained through the gated fusion
mechanism, is fed into the classification layer to predict the likelihood of car-
diovascular disease (CVD). This layer is designed to refine and classify the mul-
timodal representation effectively, leveraging the ViLT (Vision-and-Language
Transformer) model. ViLT was chosen for its efficiency in processing multimodal
data by directly operating on aligned textual and visual embeddings without re-
quiring heavy pretraining on paired datasets.

Linear Projection The fused feature vector Fm ∈ Rdm is first transformed


into a lower-dimensional space compatible with the ViLT architecture. This is
achieved through a linear projection layer:

Fp = Wp Fm + bp , (4)

where Wp ∈ Rdt ×dm is the projection matrix, bp ∈ Rdt is the bias vector, and
dt represents the dimensionality of the projected feature vector Fp .

Transformer-based Feature Refinement The projected features Fp are fur-


ther refined using the Vision-and-Language Transformer (ViLT). ViLT utilizes a
self-attention mechanism to capture interdependencies and higher-order relation-
ships between the ECG (visual) and EHR (textual) modalities. This refinement
process is represented as:
Fr = ViLT(Fp ), (5)

where Fr ∈ Rdt denotes the refined feature vector. ViLT ensures that the fused
representation incorporates both spatial and semantic alignment across modali-
ties.
Title Suppressed Due to Excessive Length 7

MLP Head and Prediction The refined feature vector Fr is passed through a
Multi-Layer Perceptron (MLP) head for classification. The MLP consists of one
or more fully connected layers and a softmax function, which outputs the proba-
bility distribution over the classes. The classification probabilities are computed
as:
ŷ = Softmax(Wo Fr + bo ), (6)

where Wo ∈ RC×dt is the weight matrix, bo ∈ RC is the bias vector, and C is


the number of classes (e.g., normal and disease states).

Loss Function The model is trained using a cross-entropy loss function, which
quantifies the difference between the predicted probability distribution ŷ and the
true labels y. The cross-entropy loss function was chosen as it is well-suited for
binary classification tasks, effectively penalizing incorrect predictions based on
the predicted probability distribution. The loss function is defined as:

N C
1 XX
L=− yi,c log(ŷi,c ), (7)
N i=1 c=1

where N is the number of samples, yi,c is the ground-truth label for class c of
sample i, and ŷi,c is the predicted probability for class c. The final output of the
classification layer is the predicted class label:

Class = arg max(ŷc ), (8)


c

where ŷc is the predicted probability for class c. The class with the highest
probability is selected as the final prediction, indicating whether the sample
belongs to the normal or disease category.

4 Experimental Evaluation

The experiments were conducted using the ECG and MIT-BIH datasets, with
a train-test split of 80%-10%-10% for training, validation, and testing, respec-
tively. All data preprocessing steps, including normalization for ECG signals and
tokenization for EHR records, were applied to ensure consistency and compati-
bility with the proposed framework. The model was implemented using Python
3.8 with PyTorch 1.12.0 and trained on an NVIDIA V100 GPU with 32GB of
memory. The Adam optimizer was used with an initial learning rate of 0.001,
a batch size of 64, and training was performed for 50 epochs. Hyperparameter
tuning was carried out using grid search to optimize model performance. Ad-
ditionally, 5-fold cross-validation was employed to evaluate the robustness and
generalizability of the model.
8 Jothi Prakash V. et al.

4.1 Evaluation Metrics

The performance of the proposed multimodal fusion framework is evaluated


using standard metrics, including accuracy, precision, recall (sensitivity), F1-
score, and ROC-AUC. These metrics provide a comprehensive understanding of
the model’s classification performance, balancing overall correctness, the ability
to detect positive cases, and the trade-off between precision and recall.

4.2 Baseline Models

To evaluate the performance of the proposed multimodal fusion framework, we


compare it against five recent and relevant baseline models. The first baseline is a
CNN-based ECG-only model [10] that processes the visual ECG signals indepen-
dently, assessing the contribution of the visual modality in isolation. Similarly,
the second baseline is a DNN-based EHR-only model [1] that utilizes structured
EHR data alone to evaluate the effectiveness of the textual modality. For multi-
modal fusion, we include a simple fusion baseline that combines ECG and EHR
features through feature concatenation without the gated fusion mechanism [5],
providing insights into the importance of the proposed alignment strategy. Addi-
tionally, we compare against MedFuse [9], a recent multimodal fusion framework
specifically designed for integrating clinical text and medical imaging data using
attention-based mechanisms for feature alignment. Finally, VisualBERT [12], a
transformer-based model originally developed for vision-and-language tasks, is
adapted to process ECG signals and EHR data by treating ECG signals as visual
embeddings and EHR data as textual inputs. These baselines allow for a com-
prehensive comparison, showcasing the advantages of the proposed framework
in leveraging both modalities for robust cardiovascular disease prediction.

4.3 Evaluation with Baselines

The effectiveness of the proposed Multimodal Gated Fusion Framework is evalu-


ated against several baseline models using key metrics: accuracy, precision, recall,
F1-score, and ROC-AUC. As shown in Table 2, the proposed framework outper-
forms all baselines, achieving the highest accuracy (91.2%), precision (0.92),
recall (0.90), F1-score (0.91), and ROC-AUC (0.94). Single-modality baselines,
including ECG-only (CNN) and EHR-only (DNN) models, demonstrate lim-
ited performance, highlighting the necessity of multimodal integration. While
the simple fusion model, using feature concatenation, shows moderate improve-
ments, it fails to capture the complementary nature of ECG and EHR data effec-
tively. Advanced baselines like MedFuse and VisualBERT, leveraging attention
and transformer-based architectures, achieve competitive results; however, the
proposed framework surpasses them due to its gated fusion mechanism and the
use of ViLT for precise alignment and feature refinement. These results under-
score the framework’s ability to robustly integrate multimodal data for accurate
cardiovascular disease prediction.
Title Suppressed Due to Excessive Length 9

Table 2: Performance Comparison with Baseline Models


Model Accuracy (%) Precision Recall F1-Score ROC-AUC
ECG-only (CNN) 83.5 0.82 0.79 0.81 0.85
EHR-only (DNN) 79.2 0.78 0.76 0.78 0.82
Simple Fusion (Concatenation) 85.3 0.84 0.82 0.83 0.87
MedFuse 87.4 0.87 0.85 0.86 0.89
VisualBERT 88.1 0.88 0.86 0.87 0.90
MGFF (Proposed) 91.2 0.92 0.90 0.91 0.94

4.4 Ablation Study


To evaluate the contributions of individual components in the Multimodal Gated
Fusion Framework (MGFF), an ablation study was conducted. Table 3 presents
the results of removing or modifying key components of the framework. The base-
line MGFF (Proposed) achieves the highest accuracy (91.2%), precision (0.92),
recall (0.90), F1-score (0.91), and ROC-AUC (0.94). Removing the gated fusion
mechanism and replacing it with simple concatenation results in a noticeable
performance drop, with accuracy decreasing to 87.6%, underscoring the gated
mechanism’s importance for effective feature integration. Similarly, replacing the
ViLT transformer with a simpler MLP leads to reduced accuracy (88.4%), pre-
cision (0.88), and F1-score (0.87), demonstrating the transformer’s critical role
in refining multimodal features. Single-modality experiments with ECG or EHR
data alone show significantly lower performance across all metrics, emphasizing
the necessity of multimodal integration. These results validate the design choices
in MGFF and highlight the synergistic effect of gated fusion and transformer-
based refinement in achieving robust cardiovascular disease prediction.

Table 3: Results of Ablation Study


Variant Accuracy (%) Precision Recall F1-Score ROC-AUC
MGFF (Proposed) 91.2 0.92 0.90 0.91 0.94
Without Gated Fusion (Concatenation) 87.6 0.88 0.85 0.86 0.88
Without ViLT (Using MLP) 88.4 0.88 0.86 0.87 0.89
ECG Only (CNN) 83.5 0.82 0.79 0.81 0.85
EHR Only (DNN) 79.2 0.78 0.76 0.78 0.82

4.5 Statistical Analysis


To ensure the reliability and significance of the observed performance improve-
ments, we conducted a statistical analysis comparing the proposed Multimodal
Gated Fusion Framework (MGFF) with baseline models. The statistical signif-
icance of the differences in performance metrics was evaluated using a paired
t-test at a 95% confidence level. Table 4 summarizes the mean and standard
deviation of accuracy, precision, recall, F1-score, and ROC-AUC across 5 runs
for each model, along with the p-values comparing MGFF with each baseline.
The results in Table 4 indicate that the proposed MGFF significantly outper-
forms all baseline models across all metrics, with p-values less than 0.05 for most
10 Jothi Prakash V. et al.

comparisons. The low standard deviation of MGFF’s performance metrics across


5 runs demonstrates its stability and robustness. Baseline models like MedFuse
and VisualBERT, while achieving competitive results, show statistically signif-
icant differences when compared to MGFF. The paired t-test results confirm
that the improvements brought by the gated fusion mechanism and ViLT-based
refinement are statistically significant, further validating the effectiveness of the
proposed framework for multimodal cardiovascular disease prediction.

Table 4: Statistical Analysis of Performance Metrics


Model Accuracy (%) Precision Recall F1-Score ROC-AUC p-value
ECG-only (CNN) 83.5 ± 1.2 0.82 ± 0.01 0.79 ± 0.02 0.81 ± 0.01 0.85 ± 0.02 < 0.001
EHR-only (DNN) 79.2 ± 1.5 0.78 ± 0.02 0.76 ± 0.02 0.78 ± 0.02 0.82 ± 0.01 < 0.001
Simple Fusion 85.3 ± 0.9 0.84 ± 0.01 0.82 ± 0.01 0.83 ± 0.01 0.87 ± 0.01 < 0.01
MedFuse 87.4 ± 0.7 0.87 ± 0.01 0.85 ± 0.01 0.86 ± 0.01 0.89 ± 0.01 < 0.05
VisualBERT 88.1 ± 0.8 0.88 ± 0.01 0.86 ± 0.01 0.87 ± 0.01 0.90 ± 0.01 < 0.05
MGFF (Proposed) 91.2 ± 0.6 0.92 ± 0.01 0.90 ± 0.01 0.91 ± 0.01 0.94 ± 0.01 –

4.6 Multimodal Fusion Efficiency Analysis

To evaluate the efficiency of the proposed gated fusion mechanism in integrating


multimodal data, we compared it with alternative fusion strategies, including
early fusion, late fusion, and simple concatenation. Table 5 presents the results
for accuracy, precision, recall, F1-score, and ROC-AUC across these fusion meth-
ods. The proposed gated fusion mechanism achieves the highest performance,
with an accuracy of 91.2%, precision of 0.92, recall of 0.90, F1-score of 0.91,
and ROC-AUC of 0.94. Early fusion, which combines raw data before feature
extraction, performs poorly due to insufficient alignment of modalities, resulting
in an accuracy of 83.7%. Late fusion, which merges predictions from individual
modality-specific models, shows improved results with an accuracy of 86.1%, but
it lacks the synergistic benefits of joint feature representation. Simple concatena-
tion achieves moderate performance with an accuracy of 85.3%, highlighting the
limitations of naive feature integration. These results validate the effectiveness
of the gated fusion mechanism in aligning and integrating multimodal features,
enabling robust cardiovascular disease prediction.

Table 5: Comparison of Fusion Strategies


Fusion Method Accuracy (%) Precision Recall F1-Score ROC-AUC
Early Fusion 83.7 0.81 0.78 0.80 0.84
Late Fusion 86.1 0.85 0.83 0.84 0.87
Simple Concatenation 85.3 0.84 0.82 0.83 0.87
Gated Fusion (Proposed) 91.2 0.92 0.90 0.91 0.94
Title Suppressed Due to Excessive Length 11

4.7 Calibration Analysis


To evaluate the reliability of the predicted probabilities, we performed a cali-
bration analysis of the proposed Multimodal Gated Fusion Framework (MGFF).
Calibration measures how closely the predicted probabilities match the actual
outcomes. A well-calibrated model produces probabilities that accurately reflect
the likelihood of a positive prediction. Calibration curves were used for visual
interpretation, as shown in Figure 2, with the diagonal line representing perfect
calibration.

Fig. 2: Calibration curves for MGFF (Proposed) and baseline models. The
diagonal line represents perfect calibration.

The calibration curves in Figure 2 demonstrate that the proposed MGFF


achieves the closest alignment with the diagonal line, indicating superior calibra-
tion compared to the baseline models. Models such as ECG-only and EHR-only
exhibit significant deviations, highlighting lower reliability in predicted probabil-
ities. Advanced baselines like MedFuse and VisualBERT show better alignment
but still fall short of MGFF’s performance. These results validate the ability
of MGFF to produce well-calibrated predictions, enhancing its applicability in
real-world healthcare scenarios where reliable probability estimates are crucial.

5 Limitations
While the proposed Multimodal Gated Fusion Framework (MGFF) demon-
strates significant improvements in cardiovascular disease prediction, it has cer-
tain limitations. First, the model relies on high-quality and well-structured multi-
modal datasets, which may not always be available in real-world clinical settings.
12 Jothi Prakash V. et al.

The performance of MGFF may degrade when dealing with noisy or incomplete
data, such as missing ECG signals or incomplete EHR records. Second, the com-
putational complexity of the ViLT-based transformer and gated fusion mecha-
nism requires significant hardware resources, making deployment on low-resource
devices challenging. Third, the framework is tailored to the specific modalities of
ECG and EHR data, limiting its generalizability to other healthcare domains or
multimodal datasets without additional customization. Lastly, while the model
achieves high accuracy and calibration, its interpretability could be further en-
hanced to provide actionable insights for clinicians, such as highlighting specific
contributing features from each modality. Addressing these limitations could
further improve the practical applicability and scalability of MGFF in diverse
clinical scenarios.

6 Conclusion
In this research, we proposed the Multimodal Gated Fusion Framework (MGFF)
for cardiovascular disease prediction, integrating textual EHR data and visual
ECG signals using a gated fusion mechanism and ViLT-based transformer for
feature refinement. The framework demonstrated superior performance across
multiple metrics, achieving an accuracy of 91.2%, an F1-score of 0.91, and
a ROC-AUC of 0.94, outperforming advanced baselines such as MedFuse and
VisualBERT. Extensive analysis, including ablation studies, calibration assess-
ment, and multimodal fusion efficiency evaluation, validated the significance of
the gated fusion mechanism and the transformer’s ability to align multimodal
features effectively. Furthermore, the calibration analysis showed that MGFF
produces well-calibrated predictions, enhancing its reliability for clinical appli-
cations. Despite these promising results, future work will focus on addressing the
limitations of the framework, including improving robustness to noisy and incom-
plete data, reducing computational complexity for deployment on low-resource
devices, and generalizing the approach to other multimodal healthcare datasets.
Additionally, enhancing the interpretability of the framework to provide action-
able insights for clinicians will be prioritized to increase its practical utility in
real-world scenarios.
REFERENCES

[1] Muhammad Shakeel Akram, Bogaraju Sharatchandra Varma, and Dewar Finlay.
Embedded dnn classifier for five different cardiac diseases. In 2024 35th Irish
Signals and Systems Conference (ISSC), pages 01–06. IEEE, 6 2024.
[2] Akshita Gour, Muktesh Gupta, Rajesh Wadhvani, and Sanyam Shukla. Ecg based
heart disease classification: Advancement and review of techniques. Procedia Com-
puter Science, 235:1634–1648, 2024.
[3] Muhammad Salman Haleem, Rossana Castaldo, Silvio Marcello Pagliara, Mario
Petretta, Marco Salvatore, Monica Franzese, and Leandro Pecchia. Time adap-
tive ecg driven cardiovascular disease detector. Biomedical Signal Processing and
Control, 70:102968, 9 2021.
[4] Biyanka Jaltotage, Juan Lu, and Girish Dwivedi. Use of artificial intelligence in-
cluding multimodal systems to improve the management of cardiovascular disease.
Canadian Journal of Cardiology, 40:1804–1812, 10 2024.
[5] Muhammad Umar Khan, Sumair Aziz, Khushbakht Iqtidar, Galila Faisal Zaher,
Shareefa Alghamdi, and Munazza Gull. A two-stage classification model integrat-
ing feature fusion for coronary artery disease detection and classification. Multi-
media Tools and Applications, 81:13661–13690, 4 2022.
[6] Mohammad Moshawrab, Mehdi Adda, Abdenour Bouzouane, Hussein Ibrahim,
and Ali Raad. Reviewing multimodal machine learning and its use in cardiovas-
cular diseases detection. Electronics, 12:1558, 3 2023.
[7] V. Jothi Prakash and N. K. Karthikeyan. Enhanced evolutionary feature selec-
tion and ensemble method for cardiovascular disease prediction. Interdisciplinary
Sciences: Computational Life Sciences, 13:389–412, 9 2021.
[8] V. Jothi Prakash and N. K. Karthikeyan. Dual-layer deep ensemble techniques
for classifying heart disease. Information Technology and Control, 51:158–179, 3
2022.
[9] Ali Rasekh, Reza Heidari, Amir Hosein Haji Mohammad Rezaie, Parsa Sharifi
Sedeh, Zahra Ahmadi, Prasenjit Mitra, and Wolfgang Nejdl. Robust fusion of
time series and image data for improved multimodal clinical prediction. IEEE
Access, 12:174107–174121, 2024.
[10] Arul Antran Vijay Subramanian and Jothi Prakash Venugopal. A deep ensem-
ble network model for classifying and predicting breast cancer. Computational
Intelligence, 39:258–282, 4 2023.
[11] Jothi Prakash V., Arul Antran Vijay S., Ganesh Kumar P., and Karthikeyan N.K.
A novel attention-based cross-modal transfer learning framework for predicting
cardiovascular disease. Computers in Biology and Medicine, 170:107977, 3 2024.
[12] Junxin Wang, Juanen Li, Rui Wang, and Xinqi Zhou. Vae-driven multimodal
fusion for early cardiac disease detection. IEEE Access, 12:90535–90551, 2024.

You might also like