Enhancing Stroke Prediction Using The Waikato Environment For Knowledge Analysis
Enhancing Stroke Prediction Using The Waikato Environment For Knowledge Analysis
Corresponding Author:
Muneera Altayeb
Department of Communications and Computer Engineering, Faculty of Engineering
Al-Ahliyya Amman University
Al-Saro, Al-Salt, Amman, Jordan
Email: [email protected]
1. INTRODUCTION
Stroke, a potentially fatal consequence of atrial fibrillation, poses challenges in its prediction for
doctors due to its time-consuming and tedious nature. It primarily affects individuals over the age of 65 and is
comparable to a "heart attack" in its damaging effect on the brain. In the United States and agricultural nations,
stroke is the third leading cause of death. It occurs when the brain's blood supply is obstructed or reduced.
There are two main types of stroke: ischemic stroke, caused by insufficient blood flow, and hemorrhagic stroke,
caused by bleeding. Hemorrhagic stroke can be further classified into subarachnoid hemorrhage and
intracerebral hemorrhage [1]. Stroke ranks among the world's main causes of mortality and disability. Stroke
ranks second in Korea in terms of causes of death. The population of Korea is expected to age quickly; by
2050, the proportion of people over 60 is expected to rise from 13.7% in 2015 to 28.6% [2]–[4].
Islam et al. [5] introduced adaptive gradient boosting machine learning (ML) models to classify and
predict acute stroke in active states. The study was conducted on electroencephalogram (EEG) of 75 healthy
adults without a history of any neurological diseases, and 48 patients who had been diagnosed with an acute
stroke. Results showed that the proposed model was approximately 80% accurate in classifying the stroke
group. In a study on stroke prediction, researchers explored the use of three ML models: deep neural network
(DNN), random forest (RF), and logistic regression (LR). They evaluated the models' performance with
specific parameters and found that DNN, commonly used for predicting ischemic or acute stroke, showed
promise for long-term prediction as well. The DNN model achieved an impressive 88% accuracy when
considering input variables, outperforming the other models. The researchers highlighted the need to enhance
the model with automated and precise calculations, reducing the dependence on simpler models [6].
Hadianfard et al. [7] presented a study that aimed to predict stroke patients' survival rates by extracting
decision rules through the use of data mining techniques. The researchers used the multiple imputation method
to handle missing data when analyzing data from 4149 stroke patients that they had obtained from paper
medical records. To balance the target variable, they used methods like under- and oversampling in addition to
synthetic minority oversampling (SMOTE). Stroke patients' survival rate was predicted using the LR, decision
tree, and SVM algorithms. The repeated incremental pruning to produce error reduction (RIPPER) algorithm
was also used to extract decision rules. In terms of kappa (33.34), sensitivity (79.06%), and accuracy (76.96%),
LR outperformed the other algorithms. Nonetheless, the specificity (65.35%) and area under the ROC curve
(AUC) (0.77) were lower than other algorithms. Using an independent dataset of 234 records, the LR algorithm
that performed the best on the primary dataset was tested. When this method was used with the external
validation dataset, its accuracy (79.91%), sensitivity (83.94%), kappa (39.26), and AUC (0.8) all improved; its
specificity (60.98%) did not change.
Choi et al. [8] created a new methodology for applying deep learning models to raw EEG data that
does not take frequency features into account. Using real-time EEG sensor data, the proposed stroke prediction
model was developed and trained. Several deep learning models specializing in time series data classification,
and prediction long short term memory (LSTM), bidirectional LSTM, convolution neural network (CNN)-
LSTM, and CNN-bidirectional LSTM were created and compared. When using raw EEG data, the LSTM
bidirectional CNN model predicted stroke with 94.0% accuracy and low false positive rate (6.0%) and false
negative rate (5.7%), demonstrating high confidence in our method.
Modern ML algorithms and data preprocessing tools are arranged in an orderly manner on the waikato
environment for knowledge analysis (WEKA) workbench. Using these methods from the command line is the
primary method of interacting with them. However, easy-to-use interactive graphical user interfaces are
available for data exploration, large-scale experiment setups on distributed computing platforms, and stream
data processing configuration design. These interfaces make up a sophisticated setting for data mining
experiments. The GNU general public license governs the distribution of the Java-written system [9].
The novelty of this work lies in the use of a huge dataset to train several ML classifiers supported by
the WEKA data mining tool for stroke prediction. The prediction process in the proposed model is divided into
four stages; i) choosing the data set, ii) dataset cleaning and preprocessing, iii) classification using four
algorithms naive Bayes (NB), RF, support vector machine (SVM), and multi-layer perceptron (MLP), and
iv) results and performance evaluation. The performance of the classifier is evaluated using the following
metrics: accuracy, sensitivity, precision, and F-measure.
2. PROPOSED METHOD
The proposed model aims to detect stroke using ML and deep learning classifiers embedded in the
data mining tool WEKA, which allows users to categorize accuracy using various algorithmic methods, based
on a set of features [10]–[12]. Before starting the classification process, the dataset is first filtered and
pre-processed to become ready as features that can be fed to classifiers, as this is the first and most important
step in the process of developing a ML classifier. In the next step, the dataset is divided into test datasets and
training datasets and to gauge and analyze the performance, the cross-validation method is used in our model
proposed in this paper, 10 folds are used; the test is conducted on each fold independently, while the other nine
folds are used to learn. The 1/10 dataset that is retained separately is used to compute the error rate [13], [14].
The classification process is then carried out based on four algorithms NB, RF, SVM, and MLP.
Figure 1 describes the model proposed in this work.
2.1. Dataset
The dataset used in this study was obtained from Kaggle [15], where it consists of 3254 cases, each
containing eight attributes: age, gender, heart disease, hypertension, marital status, average blood sugar level,
body mass index, and smoking. These attributes are initialized by filtering them in a preprocessing or cleaning
step that involves deleting rows that include redundant, corrupted, incomplete, inaccurate, or incorrectly
structured data from a dataset. Then datasets are converted to the comma separated value (CSV) file format,
which is a compatible format with WEKA. Table 1 shows the attributes and description of the dataset used in
the classification process.
‒ Gender: a person's gender is indicated by this characteristic. 2,117 men (41.4%) and 2,994 women (58.6%)
comprise the male and female population. Disproportionately afflict women, with sociocultural gender
Enhancing stroke prediction using the waikato environment for knowledge analysis (Muneera Altayeb)
3012 ISSN: 2252-8938
playing a role in variations in risk factors, evaluation, treatment, and results. The study focuses on the gaps
in existing knowledge and research [16].
‒ Age: this feature describes an individual's age, as the occurrence of strokes in young individuals rises as
they age beyond 35 years, and there has been a 23% increase in such cases over ten years, primarily due
to a rise in ischemic stroke [17].
‒ Hypertension: this feature determines if the individual has hypertension, a condition that impacts 9.8% of
the participants and raises the risk [18].
‒ Heart disease: this feature signifies the presence or absence of heart disease in the individual. The
percentage of patients diagnosed with heart disease stands at 5.4% [19].
‒ Ever married: this feature displays the participants' marital status, with married individuals making up
65.6% of the sample [20].
‒ Average glucose level: this feature captures the participant’s average glucose level [21].
‒ BMI: this feature records the participants' body mass index [22].
‒ Smoking: three categories are included in this feature, which tracks the participant's smoking status:
formerly smoking (21.2%), never smoking (40.9%), and smoking (37-8%) [23].
automatically, this is the main advantage [24], [25]. In this research, several classifiers were tested and
compared for stroke detection and will be discussed in the following subsection.
𝑦 = 𝐹(𝑥) = 𝑊 𝑇 𝑥 + 𝑏 = ∑𝑁
𝑖=1 𝑊𝑖 𝑥𝑖 + 𝑏 (2)
The vector W and scalar b determine the best-separating hyperplane, which maximizes the distance between
the plane and the closest data. Using the kernel function, SVM may be applied to non-linear classification tasks
when the features in high-dimensional feature spaces are non-linearly separable [32].
Enhancing stroke prediction using the waikato environment for knowledge analysis (Muneera Altayeb)
3014 ISSN: 2252-8938
𝑇𝑃
Precision = 𝑇𝑃+𝐹𝑃 (4)
On the other hand, to measure the amount of total positive samples (TP+FN) that were assigned to
positive categories (TP), the sensitivity measurement index was used. In other words, the ratio of true positives
to the total ratio of actual yeses appears in (5) [37]. The F-measure was also used in this work by calculating
the harmonic mean of precision and sensitivity by assigning equal weight to each of them. In (6) shows the
F-measure [38].
𝑇𝑃
Sensitivity = 𝑇𝑃+𝐹𝑁 (5)
2∗Precision∗Sensitivity
F − Measure = (6)
Precision+Sensitivity
On the other hand, when we compare these classification results, as in Table 7 and Figure 3, we can
notice that the NB has the highest precision score of 95.7%. In terms of sensitivity and F-measure, SVM also
has the highest results of 100% and 97.1%, respectively. According to these results, the superiority of SVM
over other classifiers appears, with an accuracy of 94.4% and 100% for sensitivity, and it performed well
regarding precision with 94.4% and 97.1% for F1 score. On the other hand, the accuracy demonstrated in this
paper is shown to be superior to previous research. As in Islam et al. [5] the accuracy rate was 80%; in
Heo et al. [6], the accuracy rate was 88%; and in Hadianfard et al. [7], the accuracy rate was 76.96%. However,
in this study, the accuracy rate was around 94.4%.
102.000
100.000
98.000
96.000
Percentage
94.000
92.000
90.000
88.000
86.000
84.000
Naive Bayes Random Forest SVM MLP
4. CONCLUSION
To complete our study, we had to evaluate several classification algorithms to detect stroke based on
a set of features such as age, hypertension, heart disease, blood sugar, BMI, marital status, and smoking status.
WEKA data mining software was used to evaluate and analyze the NB, RF, SVM, and MLP algorithms.
Regarding classification performance metrics, the performance was measured by performing a variety of
evaluation metrics, such as accuracy, precision, sensitivity, and F-measure on stroke datasets using 10-fold
cross-validation, SVM demonstrated strong generalization ability, achieving reliable results on both training
and testing datasets, with values of 94.4%, 100%, and 97.1% for accuracy, sensitivity, and F-measure,
respectively. In future work, a combination of other classification methods may be used to enhance the results.
REFERENCES
[1] H. K. V, H. P, G. Gupta, V. P, and P. K B, “Stroke prediction using machine learning algorithms,” International Journal of
Innovative Research in Engineering & Management, vol. 8, no. 4, Jul. 2021, doi: 10.21276/ijirem.2021.8.4.2.
[2] D. Pastore et al., “Sex-genetic interaction in the risk for cerebrovascular disease,” Current Medicinal Chemistry, vol. 24, no. 24,
Enhancing stroke prediction using the waikato environment for knowledge analysis (Muneera Altayeb)
3016 ISSN: 2252-8938
[38] H. Yun, “Prediction model of algal blooms using logistic regression and confusion matrix,” International Journal of Electrical and
Computer Engineering, vol. 11, no. 3, pp. 2407–2413, 2021, doi: 10.11591/ijece.v11i3.pp2407-2413.
[39] D. Li, F. Huang, L. Yan, Z. Cao, J. Chen, and Z. Ye, “Landslide susceptibility prediction using particle-swarm-optimized multilayer
perceptron: Comparisons with multilayer-perceptron-only, BP neural network, and information value models,” Applied Sciences,
vol. 9, no. 18, Sep. 2019, doi: 10.3390/app9183664.
[40] G. Zeng, “On the confusion matrix in credit scoring and its analytical properties,” Communications in Statistics - Theory and
Methods, vol. 49, no. 9, pp. 2080–2093, 2020, doi: 10.1080/03610926.2019.1568485.
[41] R. AlShboul, F. Thabtah, A. J. W. Scott, and Y. Wang, “The application of intelligent data models for dementia classification,”
Applied Sciences, vol. 13, no. 6, 2023, doi: 10.3390/app13063612.
[42] D. Fuqua and T. Razzaghi, “A cost-sensitive convolution neural network learning for control chart pattern recognition,” Expert
Systems with Applications, vol. 150, 2020, doi: 10.1016/j.eswa.2020.113275.
BIOGRAPHIES OF AUTHORS
Areen Arabiat earned her B.Sc. in Computer Engineering in 2005 from al Balqaa
Applied University, and her M.Sc. in Intelligent Transportation Systems (ITS) from Al Ahliyya
Amman University in 2022. She is currently a computer lab supervisor at the Faculty of
Engineering, Al-Ahliyya Amman University since 2013. Her research interests are focused on
the areas: machine learning, data mining, artificial intelligence, and image processing. She can
be contacted at email: [email protected].
Enhancing stroke prediction using the waikato environment for knowledge analysis (Muneera Altayeb)