MGFF Springer
MGFF Springer
Jothi Prakash V.1 , Thiruselvan S1 Chandranath S., Vikeshkumar M., and Ajay
Meenatchi Sundaram M.
1 Introduction
Cardiovascular diseases (CVD) [7, 8] remain one of the leading causes of mor-
tality globally, accounting for millions of deaths annually. Early detection and
accurate prediction of CVD are critical for effective clinical intervention and im-
proving patient outcomes. Traditional prediction methods often rely on single-
modality data, such as textual electronic health records (EHR) or visual elec-
trocardiogram (ECG) signals [8, 11]. While these approaches provide valuable
2 Jothi Prakash V. et al.
insights, they fail to exploit the complementary nature of multimodal data, which
can offer a more comprehensive understanding of a patient’s condition. This lim-
itation highlights the need for advanced frameworks that integrate diverse data
modalities to enhance predictive accuracy and reliability [11]. Recent advance-
ments in multimodal learning have demonstrated the potential of combining het-
erogeneous data sources for improved performance in various domains, including
healthcare. However, many existing multimodal frameworks face challenges such
as ineffective feature alignment, limited generalizability, and computational in-
efficiency [4, 6]. These limitations motivate the development of a robust and
efficient framework tailored to the specific needs of CVD prediction. In this re-
search, we propose the Multimodal Gated Fusion Framework (MGFF), a novel
approach that integrates textual EHR data and visual ECG signals using a
gated fusion mechanism and a ViLT-based transformer for feature refinement
and classification. The key contributions of this work are as follows:
– We introduce MGFF, a unified framework that leverages gated fusion for
effective multimodal feature integration and ViLT for robust feature align-
ment.
– We demonstrate the superior performance of MGFF on multimodal datasets,
achieving state-of-the-art results with an accuracy of 91.2%, F1-score of 0.91,
and ROC-AUC of 0.94.
– We conduct extensive experiments, including ablation studies, calibration
analysis, and multimodal fusion efficiency evaluation, to validate the effec-
tiveness of the proposed framework.
2 Related Works
The task of cardiovascular disease (CVD) prediction has garnered significant
attention in recent years, with advancements in machine learning enabling the
development of predictive models across various data modalities. Existing works
can be broadly categorized into single-modality approaches and multimodal
frameworks.
3 Methodology
3.1 Datasets
The ECG [3] and MIT-BIH [2] datasets, designed for multimodal analysis, com-
prise a total of 22,225 samples with distinct features and classes. The ECG
dataset includes 333 samples with 21 unique features, while the MIT-BIH dataset
contains 21,892 samples with 188 unique features. Both datasets are categorized
into two classes, representing normal and disease states, with 215 normal sam-
ples and 118 disease samples in each dataset. The multimodal nature of the
data incorporates both textual clinical records and visual ECG signals, ensur-
ing comprehensive representation. Preprocessing techniques, including normal-
ization, tokenization, and augmentation, were applied to maintain data quality
and enhance model compatibility. Normalization was applied to ensure consis-
tent data ranges for all features, while tokenization converted textual EHR data
Title Suppressed Due to Excessive Length 5
ECG Feature Extraction The visual features of the ECG dataset are ex-
tracted using a Convolutional Neural Network (CNN)-based encoder. Given an
input ECG signal E, the CNN processes the signal through a series of convolu-
tional layers, activation functions, and pooling operations to generate a feature
map Fe ∈ Rde , where de is the dimensionality of the ECG feature vector. The
feature extraction process is mathematically expressed as:
Fe = CNN(E), (1)
where Fe captures the temporal and spectral information inherent in the ECG
signal.
EHR Feature Extraction Structured features from the EHR dataset are
derived using a Deep Neural Network (DNN)-based encoder. Given an input
EHR data H, the DNN processes the data through multiple fully connected
layers to extract meaningful features, represented as Fh ∈ Rdh , where dh is
the dimensionality of the EHR feature vector. The process is mathematically
represented as:
Fh = DNN(H), (2)
where Fh encapsulates the patient’s health history, including diagnostic and
demographic information.
6 Jothi Prakash V. et al.
where Fm represents the integrated features from both ECG and EHR data,
which are subsequently utilized for classification.
3.3 Classification
The fused multimodal feature vector Fm , obtained through the gated fusion
mechanism, is fed into the classification layer to predict the likelihood of car-
diovascular disease (CVD). This layer is designed to refine and classify the mul-
timodal representation effectively, leveraging the ViLT (Vision-and-Language
Transformer) model. ViLT was chosen for its efficiency in processing multimodal
data by directly operating on aligned textual and visual embeddings without re-
quiring heavy pretraining on paired datasets.
Fp = Wp Fm + bp , (4)
where Wp ∈ Rdt ×dm is the projection matrix, bp ∈ Rdt is the bias vector, and
dt represents the dimensionality of the projected feature vector Fp .
where Fr ∈ Rdt denotes the refined feature vector. ViLT ensures that the fused
representation incorporates both spatial and semantic alignment across modali-
ties.
Title Suppressed Due to Excessive Length 7
MLP Head and Prediction The refined feature vector Fr is passed through a
Multi-Layer Perceptron (MLP) head for classification. The MLP consists of one
or more fully connected layers and a softmax function, which outputs the proba-
bility distribution over the classes. The classification probabilities are computed
as:
ŷ = Softmax(Wo Fr + bo ), (6)
Loss Function The model is trained using a cross-entropy loss function, which
quantifies the difference between the predicted probability distribution ŷ and the
true labels y. The cross-entropy loss function was chosen as it is well-suited for
binary classification tasks, effectively penalizing incorrect predictions based on
the predicted probability distribution. The loss function is defined as:
N C
1 XX
L=− yi,c log(ŷi,c ), (7)
N i=1 c=1
where N is the number of samples, yi,c is the ground-truth label for class c of
sample i, and ŷi,c is the predicted probability for class c. The final output of the
classification layer is the predicted class label:
where ŷc is the predicted probability for class c. The class with the highest
probability is selected as the final prediction, indicating whether the sample
belongs to the normal or disease category.
4 Experimental Evaluation
The experiments were conducted using the ECG and MIT-BIH datasets, with
a train-test split of 80%-10%-10% for training, validation, and testing, respec-
tively. All data preprocessing steps, including normalization for ECG signals and
tokenization for EHR records, were applied to ensure consistency and compati-
bility with the proposed framework. The model was implemented using Python
3.8 with PyTorch 1.12.0 and trained on an NVIDIA V100 GPU with 32GB of
memory. The Adam optimizer was used with an initial learning rate of 0.001,
a batch size of 64, and training was performed for 50 epochs. Hyperparameter
tuning was carried out using grid search to optimize model performance. Ad-
ditionally, 5-fold cross-validation was employed to evaluate the robustness and
generalizability of the model.
8 Jothi Prakash V. et al.
Fig. 2: Calibration curves for MGFF (Proposed) and baseline models. The
diagonal line represents perfect calibration.
5 Limitations
While the proposed Multimodal Gated Fusion Framework (MGFF) demon-
strates significant improvements in cardiovascular disease prediction, it has cer-
tain limitations. First, the model relies on high-quality and well-structured multi-
modal datasets, which may not always be available in real-world clinical settings.
12 Jothi Prakash V. et al.
The performance of MGFF may degrade when dealing with noisy or incomplete
data, such as missing ECG signals or incomplete EHR records. Second, the com-
putational complexity of the ViLT-based transformer and gated fusion mecha-
nism requires significant hardware resources, making deployment on low-resource
devices challenging. Third, the framework is tailored to the specific modalities of
ECG and EHR data, limiting its generalizability to other healthcare domains or
multimodal datasets without additional customization. Lastly, while the model
achieves high accuracy and calibration, its interpretability could be further en-
hanced to provide actionable insights for clinicians, such as highlighting specific
contributing features from each modality. Addressing these limitations could
further improve the practical applicability and scalability of MGFF in diverse
clinical scenarios.
6 Conclusion
In this research, we proposed the Multimodal Gated Fusion Framework (MGFF)
for cardiovascular disease prediction, integrating textual EHR data and visual
ECG signals using a gated fusion mechanism and ViLT-based transformer for
feature refinement. The framework demonstrated superior performance across
multiple metrics, achieving an accuracy of 91.2%, an F1-score of 0.91, and
a ROC-AUC of 0.94, outperforming advanced baselines such as MedFuse and
VisualBERT. Extensive analysis, including ablation studies, calibration assess-
ment, and multimodal fusion efficiency evaluation, validated the significance of
the gated fusion mechanism and the transformer’s ability to align multimodal
features effectively. Furthermore, the calibration analysis showed that MGFF
produces well-calibrated predictions, enhancing its reliability for clinical appli-
cations. Despite these promising results, future work will focus on addressing the
limitations of the framework, including improving robustness to noisy and incom-
plete data, reducing computational complexity for deployment on low-resource
devices, and generalizing the approach to other multimodal healthcare datasets.
Additionally, enhancing the interpretability of the framework to provide action-
able insights for clinicians will be prioritized to increase its practical utility in
real-world scenarios.
REFERENCES
[1] Muhammad Shakeel Akram, Bogaraju Sharatchandra Varma, and Dewar Finlay.
Embedded dnn classifier for five different cardiac diseases. In 2024 35th Irish
Signals and Systems Conference (ISSC), pages 01–06. IEEE, 6 2024.
[2] Akshita Gour, Muktesh Gupta, Rajesh Wadhvani, and Sanyam Shukla. Ecg based
heart disease classification: Advancement and review of techniques. Procedia Com-
puter Science, 235:1634–1648, 2024.
[3] Muhammad Salman Haleem, Rossana Castaldo, Silvio Marcello Pagliara, Mario
Petretta, Marco Salvatore, Monica Franzese, and Leandro Pecchia. Time adap-
tive ecg driven cardiovascular disease detector. Biomedical Signal Processing and
Control, 70:102968, 9 2021.
[4] Biyanka Jaltotage, Juan Lu, and Girish Dwivedi. Use of artificial intelligence in-
cluding multimodal systems to improve the management of cardiovascular disease.
Canadian Journal of Cardiology, 40:1804–1812, 10 2024.
[5] Muhammad Umar Khan, Sumair Aziz, Khushbakht Iqtidar, Galila Faisal Zaher,
Shareefa Alghamdi, and Munazza Gull. A two-stage classification model integrat-
ing feature fusion for coronary artery disease detection and classification. Multi-
media Tools and Applications, 81:13661–13690, 4 2022.
[6] Mohammad Moshawrab, Mehdi Adda, Abdenour Bouzouane, Hussein Ibrahim,
and Ali Raad. Reviewing multimodal machine learning and its use in cardiovas-
cular diseases detection. Electronics, 12:1558, 3 2023.
[7] V. Jothi Prakash and N. K. Karthikeyan. Enhanced evolutionary feature selec-
tion and ensemble method for cardiovascular disease prediction. Interdisciplinary
Sciences: Computational Life Sciences, 13:389–412, 9 2021.
[8] V. Jothi Prakash and N. K. Karthikeyan. Dual-layer deep ensemble techniques
for classifying heart disease. Information Technology and Control, 51:158–179, 3
2022.
[9] Ali Rasekh, Reza Heidari, Amir Hosein Haji Mohammad Rezaie, Parsa Sharifi
Sedeh, Zahra Ahmadi, Prasenjit Mitra, and Wolfgang Nejdl. Robust fusion of
time series and image data for improved multimodal clinical prediction. IEEE
Access, 12:174107–174121, 2024.
[10] Arul Antran Vijay Subramanian and Jothi Prakash Venugopal. A deep ensem-
ble network model for classifying and predicting breast cancer. Computational
Intelligence, 39:258–282, 4 2023.
[11] Jothi Prakash V., Arul Antran Vijay S., Ganesh Kumar P., and Karthikeyan N.K.
A novel attention-based cross-modal transfer learning framework for predicting
cardiovascular disease. Computers in Biology and Medicine, 170:107977, 3 2024.
[12] Junxin Wang, Juanen Li, Rui Wang, and Xinqi Zhou. Vae-driven multimodal
fusion for early cardiac disease detection. IEEE Access, 12:90535–90551, 2024.