Ann 3
Ann 3
Volume 85 (2024)
1. Introduction
Nowadays, one of the most severe causes of death and health issues throughout the whole world
is heart disease. According to statistics, about 17.9 million victims die from cardiovascular disease
every year, accounting for 31% of the total deaths all over the world [1]. This serious cardiovascular
disease will affect people of all ages and become an important challenge in the field of public health.
If the disease can be found as soon as possible and active preventive measures can be taken, the
mortality can be greatly reduced [2]. Therefore, the early and exact foresight for heart disease is of
great significance for prevention, intervention and treatment.
Artificial neural networks have been implemented extensively in the medical industry recently,
which shows that great promise in areas like disease prediction, medical photograph interpretation
and so on. Spontaneously, the application of ANN to evaluate the likelihood and risk of developing
heart disease has grown into a specialized research area. Nevertheless, the existing research mainly
rely on professional medical equipment to measure the data which is needed, running out of multiple
resources as well as labor and causing unnecessary inconvenience for patients.
At present, in the field of intelligent diagnosis of heart disease, the mainstream methods are
generally random forest, Ann and so on. For example, Garg et al. uses two supervised machine
learning algorithms, k-NN and random forest. By considering some attributes, such as chest pain,
cholesterol level, age, etc., people with heart disease and people without heart disease are classified.
The prediction accuracy of nearest neighbor method is 86.885% [3], and the prediction accuracy of
random forest method is 81.967%. As another example, sun et al. Proposed an improved sparse
automatic encoder based artificial neural network to help forecast heart disease. In the initial phase,
a sparse self-encoder (SAE) is trained to determine the optimal way to represent data; In the following
phase, according to the learning records, artificial neural network (ANN) is utilized to forecast the
health state. Then use Adam algorithm to optimize SAE and apply batch normalization. The two-
stage method effectively improves the classification effect of neural network and has stronger
robustness than other methods. The accuracy of the model to the test data is 90% [4]. Most of the data
411
Highlights in Science, Engineering and Technology CSIC 2023
Volume 85 (2024)
they use are from the Kaggle website, such as the Kaggle Framingham Heart dataset [5], due to the
fact that there exist more samples in these databases.
Research has shown that fundamental health actions such as smoking, physical activity, diet and
weight have a certain impact on heart disease [6]. Other factors including hypertension, high blood
cholesterol, physical illness, stress levels, alcohol consumption, and irregular diet are also potential
causes of heart disease [7].
2. Method
2.1. Pipeline
The study used the "SMOTE" algorithm as well as "ENN" algorithm to balance the dataset. Then,
research is conducted on the prediction of heart disease datasets depended on all kinds of machine
learning models, such as LR (logistic regression), DT (decision tree), RF (random forest), GBDT
(gradient lifting decision tree), XGBoost (extreme gradient lifting), SVM (support vector machine),
and ANN (deep neural network). Among them, the activation function of the hidden layer in the ANN
model uses Relu, and the loss function uses binary_ Crosstropy, Batch_ Size=50, Epochs=100, with
50 neural units in the input layer, 30 neural units in the hidden layer, and 1 neural unit in the output
layer. At the same time, a method based on AUC monitoring, optimal model callback, and dynamic
optimization of learning rate is adopted. Overall flow chart as shown in Fig 1.
412
Highlights in Science, Engineering and Technology CSIC 2023
Volume 85 (2024)
synthesize new samples based on minority class samples to add to the dataset, as shown in the figure.
The following process is how the algorithm functions.
Step1: Determine k-nearest neighbor of each sample x by calculating its distance from every other
sample in the minority class sample set Smin using the Euclidean distance as the reference.
Step2: Calculate the sampling rate N by setting a sampling ratio based on the sample imbalance
ratio.Supposing the chosen k-nearest neighbor is x ,randomly pick up multiple samples from each
minority sample's k-nearest neighbors.
Step3. Harnessing the initial sample and each k-nearest neighbor x that was selected
casually,create a new sample utilizing the formula below.
x x rand 0,1 ∗ |x x | 1
Another very popular undersampling method is the "Edit Nearest Neighbor" or "ENN" rule [9].
The EDIT NEAREST NEIGHBOR (ENN) algorithm is an algorithm used for data dimensionality
reduction, which improves the performance of classifiers by removing redundancy and noise from
sample data. This algorithm determines which samples should be deleted by comparing the distance
between each sample and its nearest neighbor. Specifically, the "ENN" algorithm first uses the "ENN"
algorithm to find the K nearest neighbors of each sample, and then compares whether the class labels
between each sample and its nearest neighbors are the same. If the class labels of most of the nearest
neighbors are different from the sample, remove the sample from the dataset.
The "ENN" algorithm works on the following principles. For each sample in the category to be
undersampled, calculate its nearest neighbor sample. If the nearest neighbor sample does not match
the category of the current sample, the current sample will be deleted from the dataset.
Through this process, the "ENN" algorithm can edit the dataset and delete samples that are not
"enough" consistent with their neighboring samples. This algorithm determines whether samples
should be retained by considering the neighboring samples around them. In terms of selection criteria,
"ENN" provides two optional strategies. Majority class selection (kind_sel='mode '): The current
sample will only be retained if all nearest neighbor samples belong to the same category as the current
sample. Select All (kind_sel='all '): As long as one of the nearest neighbor samples does not match
the category of the current sample, the current sample will be deleted.
2.3. MLP Classifier framework
MLPClassifier [10]is a classifier in the scikit learn library that applies backpropagation algorithms
in multi-layer neural networks to train models that can be used for classification tasks.
Fig 2. Framework
413
Highlights in Science, Engineering and Technology CSIC 2023
Volume 85 (2024)
Multilayer perceptron (MLP), which is called a feedforward artificial neural network (FFANN) as
well, is frequently applied in different areas including audio processing, natural language processing,
autonomous vehicles, computer vision, etc.
The basic working principle of MLP Classifier is to classify input data, gather input data through
the connections of multiple neurons to the output layer, and ultimately determine the prediction results.
In neurons, activation functions are used to determine whether the input signal should activate the
neuron. Fig 2 shows the whole framework.
3. Experiment
3.1. Dataset Introduction
Source of the dataset: The Kaggle data platform provides the information for this study. In detail,
this survey is composed of 319795 people who are at least 18 years old. In addition, the proportion
of samples without heart disease to ones with that is approximately 10.7 to 1. The distribution shows
as figure 3. Initially, this dataset came from a telephone survey conducted by the Centers for Disease
Control and Prevention in the United States in 2020, which collected data on the health state of
American residents. In the end, this study uses it for heart disease prediction to assist professional
doctors to diagnose heart disease preliminarily.
414
Highlights in Science, Engineering and Technology CSIC 2023
Volume 85 (2024)
4. Experimental result
Accuracy is a commonly used indicator to assess the performance of classification models,
reflecting the ratio of the number of samples which the model accurately predicted to the total number
of samples.
Number of Correct Predictions
Accuracy 2
Total Number of Predictions
F1 score is a universally employed metric to measure the performance of classification models,
which combines the accuracy and recall of the model to balance the relationship between the two.
Precision Recall
F1 Score 2 3
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑅𝑒𝑐𝑎𝑙𝑙
Precision measures the proportion of samples predicted by the model as positive categories, which
is actually positive categories. The calculation formula for accuracy is as (4).
True Positivesl
Precision 4
True Positives False Positives
In the analysis of the dataset, it was found that this dataset is relatively complex, and the absolute
values of the Pearson correlation coefficients between various attributes and heart disease
(HeartDisease) are relatively low, making it unsuitable to use models based on linear relationship
algorithms for prediction, for example, LR (logistic regression), etc. For the raw data, the degree of
415
Highlights in Science, Engineering and Technology CSIC 2023
Volume 85 (2024)
imbalance is high. When directly training the machine learning model, it was found that the accuracy
index is abnormally high, while other indicators are generally low, resulting in poor performance.
After balancing the dataset with "SMOTE+ENN", all evaluation indicators of the models improved,
with auc and recall based on ANN models reaching 0.808 and 0.81, respectively, shown as Table2.
Moreover, ANN has high potential and good robustness. Therefore, it is recommended to use the
ANN model as a model for predicting heart disease.
Table 2. Result
precision recall f1-score support
health 0.97 0.67 0.79 87791
Heart disease 0.18 0.81 0.30 8148
accuracy 0.68 95939
macro avg 0.58 0.74 0.55 95939
weighted avg 0.91 0.68 0.75 95939
AUC:0.808
5. Conclusion
In the data set analysis, this study found that stroke, physical health, diffwalking, AgeCategory,
and diabetes had a greater impact on heart disease prediction. In the analysis of the dataset, it was
found that this dataset is relatively complex, with low Pearson correlation coefficients between
various attributes and HeartDisease, making it unsuitable to use models based on linear relationship
algorithms for prediction, such as LR (logistic regression). This study uses deep neural network (ANN)
models to predict heart disease problems and compares them with models trained by machine learning
algorithms such as logistic regression, decision tree, Xgboost, RF, GBDT, SVM, KNN. By comparing
the performance of these models using five indicators: accuracy, f1 receiver, recall, precision, and
auc, it is found that ANN has certain advantages over traditional learning models. This study used a
Pearson correlation coefficient threshold of 0.02 to screen and found that all models, except for the
f1 score, showed varying degrees of decline in accuracy, precision, recall, and auc, indicating that
self-health rating and average sleep time also have important roles in predicting heart disease and
cannot be discarded. The dataset on Kaggle used in this study was from the Centers for Disease
Control and Prevention in the United States in 2020, which has certain regional and timeliness.
Therefore, we hope to collect the latest data from people of all ethnic groups around the world in the
future and achieve better performance by updating and expanding the dataset.
Authors Contribution
All the authors contributed equally, and their names were listed in alphabetical order.
References
[1] E. Bathrellou, M. D. Kontogianni, E. Chrysanthopoulou et al., “Adherence to a dash-style diet and
cardiovascular disease risk: the 10-year follow-up of the Attica study,” Nutrition and Health, vol. 25, no.
3, pp. 225–230, 2019.
[2] S. Mohan, C. Thirumalai and G. Srivastava, "Effective Heart Disease Prediction Using Hybrid Machine
Learning Techniques," in IEEE Access, vol. 7, pp. 81542-81554, 2019, doi:
10.1109/ACCESS.2019.2923707.
[3] Garg A, Sharma B, Khan R. Heart disease prediction using machine learning techniques[J]. IOP
Conference Series: Materials Science and Engineering, 2021.DOI:10.1088/1757-899X/1022/1/012046.
[4] MIENYE D., SUN Y.X., WANG Z. H. Improved sparse autoencoder based artificial neural network
approach for prediction of heart disease [J]. Informatics in medicine unlocked, 2020,18: 100307.
416
Highlights in Science, Engineering and Technology CSIC 2023
Volume 85 (2024)
417