Performance Analysis of Machine Learning Approaches in Stroke Prediction
Performance Analysis of Machine Learning Approaches in Stroke Prediction
Email:‡ tamara.meghla@tuni.fi
¶ Institute of Information Technology, Jahangirnagar University, Dhaka, Bangladesh
Email:¶ [email protected], [email protected]
Abstract—Most of strokes will occur due to an unexpected a healthy/balanced lifestyle that is wiping off the bad lifestyle
obstruction of courses by prompting both the brain and heart. like smoking and drinking, controlling body mass index (BMI)
Early awareness for different warning signs of stroke can mini- and average glucose level, maintaining good health of heart
mize the stroke. This research work proposes an early prediction
of stroke diseases by using different machine learning approaches and kidney. The prediction of stroke is necessary and shall be
with the occurrence of hypertension, body mass index level, heart treated to prevent permanent damage or death. This paper has
disease, average glucose level, smoking status, previous stroke and considered hypertension, BMI level, heart disease, and average
age. Using these high features attributes, ten different classifiers glucose level as parameters for predicting stroke. In addition,
have been trained, they are Logistics Regression, Stochastic machine learning can play a vital role in the decision making
Gradient Descent, Decision Tree Classifier, AdaBoost Classifier,
Gaussian Classifier, Quadratic Discriminant Analysis, Multi layer processes of the proposed prediction system [1]–[3].
Perceptron Classifier, KNeighbors Classifier, Gradient Boosting In the literature, very few recorded research works have used
Classifier, and XGBoost Classifier for predicting the stroke. machine learning models to predict stroke [4]–[9]. The ma-
Afterwards, results of the base classifiers are aggregated by chine learning algorithms are artificial neural network (ANN),
using the weighted voting approach to reach highest accuracy.
Moreover, the propsoed study has achieved an accuracy of 97%,
stochastic gradient descent, c4.5 decision tree algorithm, k-
where the weighted voting classifier performs better than the nearest neighbor (kNN), principle component analysis (PCA),
base classifiers. This model gives the best accuracy for the convolutional neural network (CNN), naive bayes etc. A
stroke prediction. The area under curve value of weighted voting relation is correlated among the diseases/attributes such as
classifier is also high. False positive rate and false negative rate hypertension, BMI level, average glucose level, and heart
of weighted classifier is lowest compared with others. As a result,
weighted voting is almost the perfect classifier for predicting the
disease with stroke [10].
stroke that can be used by physicians and patients to prescribe Our contribution in this paper is as follows-
and early detect a potential stroke.
Keywords—Stroke, Machine Learning, Confusion Matrices,
• A weighted voting classifier is proposed in predicting
Area Under Curve (AUC), Weighted Voting, Correlation Matrix stroke using the diseases/attributes such as hypertension,
body mass index level, heart disease, average glucose
level, smoking status, previous stroke and age.
I. INTRODUCTION • A performance of the proposed weighted voting classi-
A stroke will occur when the blood flow to various areas of fier is compared with the state-of-the-art classifier such
the brain is disrupted or diminished, the cells in those regions as Logistics Regression (LR), Stochastic Gradient De-
do not get the nutrients and oxygen and start to die. A stroke scent (SGD), Decision Tree Classifier (DTC), AdaBoost,
is a medical emergency which requires immediate care. Early Gaussian, Quadratic Discriminant Analysis (QDC), Multi
detection and proper management is required to minimize the Layer Perceptron (MLP), KNeighbors, Gradient Boosting
further damage in the affected area of the brain and other Classifier (GBC), XGBoost (XGB).
complication in the body parts. According to World Health The rest of the paper is organized as following. Section
Organization (WHO) in every year fifteen million people are 2 discusses some literature review on the existing research.
suffering from stroke in worldwide and affected individuals Research methodologies are stated in section 3 and it is
are passing away every 4-5 minutes. separated as three parts: data description, machine learning
The two forms of strokes are ishemic and hemorrhagic. In classifiers and evaluation matrices, implementation procedures
the event of an ischemical stroke, drainage is blocked by clots, are discussed. In section 4, result and discussion are shown
and in the event of a hemorrhagic stroke, a weak blood vessel and the details will describe about the correlation result and
explodes and bleeds into the brain. Stroke can be prevented by performance analysis. Finally, the conclusion is discussed in
Authorized licensed use limited to: STAATS U UNIBIBL BREMEN. Downloaded on May 22,2025 at 09:18:45 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1
section 5. In their study, they used accuracy and AUC as their pointer’s
assessment. All of this algorithm, they classified decision tree
II. L ITERATURE R EVIEW and naive Bayes gave the most accurate.
Many researchers have already used machine learning based Adam et al. [21] performed a study to classify ischemic
approached to predict strokes. Govindarajan et al. [11] con- stroke. They used two models: a k-nearest neighbor and a
ducted a study to categorize stroke disorder using a text mining decision tree algorithm to classified ischemic stroke. In their
combination and a machine learning classifier and collected research, the decision tree algorithm was more usable for
data for 507 patients. For their analysis, they used various medical specialists who used it to classify stroke.
machine learning approaches for training purposes using ANN,
and the SGD algorithm gave them the best value, which was III. R ESEARCH M ETHODOLOGY
95%. This section is divided into three parts, these are: Data
Amini et al. [4], [12] conducted research to predict stroke description, machine learning classifiers & evaluation matri-
incidence, collected 807 healthy and unhealthy subjects in ces, implementation procedures. These three processes are
their study categorized 50 risk factors for stroke, diabetes, described below:
cardiovascular disease, smoking, hyperlipidemia, and alcohol
use. They used two techniques that had the best accuracy from A. Data Description
c4.5 decision tree algorithm, and it was 95%, and for K-nearest In this paper, the informational collection utilized has been
neighbor, the accuracy was 94%. acquired from the medical clinic of Bangladesh. It’s the doc-
Cheng et al. [13] published a report on the estimation of ument of 5110 people’s information and now all the attributes
the ischemic stroke prognosis. In their analysis, 82 ischemic are described:
stroke patient data were used, two ANN models were used to age: This attribute means a person’s age. It’s numerical data.
find precision, and 79% and 95% were used. gender: This attribute means a person’s gender. It’s categorical
Cheon et al. [14]–[16] performed a study to predict stroke data.
patient mortality. In their study, they used 15099 patients to hypertension: This attribute means that this person is hyper-
identify stroke occurrence. They used a deep neural network tensive or not. It’s numerical data.
approach to detect strokes. The authors used PCA to extract work type: This attribute represents the person work scenario.
medical record history and predict stroke. They have got an It’s categorical data.
area under the curve (AUC) value of 83%. residence type: This attribute represents the person living
Singh et al. [17] performed a study on stroke prediction scenario. It’s categorical data.
applied to artificial intelligence. In their research, they used heart disease: This attribute means whether this person has
a different method for predicting stroke on the cardiovascular a heart disease person or not. It’s numerical data.
health study (CHS) dataset. And they took the decision tree avg glucose level: This attribute means what was the level of
algorithm to feature extract to principal component analysis. a person’s glucose condition. It’s numerical data.
They used a neural network classification algorithm to con- bmi: This attribute means body mass index of a person. It’s
struct the model they got 97% accuracy. numerical data.
Chin et al. [18] performed a study to detect an automated ever married: This attribute represents a person’s married
early ischemic stroke. In their study, the main purpose was to status. It’s categorical data.
develop a system using CNN to automated primary ischemic smoking Status: This attribute means a person’s smoking
stroke. They collected 256 images to train and test the CNN condition. It’s categorical data.
model. In their system image prepossessing remove the im- stroke: This attribute means a person previously had a stroke
possible area that can’t occur of stroke, they used the data or not. It’s numerical data.
prolongation method to raise the collected image. Their CNN In this all attribute stroke is the decision class and rest of the
method has given 90% accuracy. Sung et al. [5] performed a attribute is response class.
study to develop a stroke severity index. They collected 3577
patient’s data with acute ischemic stroke. For their predicting B. Machine Learning Classifiers & Evaluation Matrices
models, they used various data mining techniques and linear This section discusses ten machine learning classifiers,
regression. Their prediction feature got the best result from which are used here to build stroke predictors. And this classi-
the k-nearest neighbor model (95% CI). fiers list are: (1)LR, (2)SGD, (3)DTC, (4)AdaBoost, (5)Gaus-
Monteiro et al. [19] performed a study to get a functional sian, (6)QDA, (7)MLP, (8)KNeighbors, (9)GBC, (10)XGB .
outcome prediction of ischemic stroke using machine learning. The reason behind choosing these classifiers is that these are
In their research, they apply this technique to a patient who well known classifiers in building vulnerability predictors and
was passing three months after admission. They got the AUC used in several similar research work. These ten classifiers are
value above 90%. selected for building vulnerability predictors in our model, this
Kansadub et al. [20] performed a study to predict stroke well known classifiers are used several research work [22],
risk. In the study, the authors employed Naive Bayes, Decision [23], as similar of ours. Moreover, these models are evaluated
Tree, and Neural Network to analyze data to predict stroke. by measuring the confusion matrices.
Authorized licensed use limited to: STAATS U UNIBIBL BREMEN. Downloaded on May 22,2025 at 09:18:45 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1
Authorized licensed use limited to: STAATS U UNIBIBL BREMEN. Downloaded on May 22,2025 at 09:18:45 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1
1.0
Gender 1.000 -0.028 0.021 0.085 0.031 0.033 0.007 0.055 0.027 0.068 0.009 -0.027
Age -0.028 1.000 0.280 0.260 -0.680 -0.180 -0.014 0.240 0.320 0.079 0.250 0.990
Hypertension 0.021 0.280 1.000 0.110 -0.160 -0.031 0.007 0.170 0.160 0.013 0.130 0.280
Heart Disease 0.085 0.260 0.110 1.000 -0.110 -0.030 0.003 0.160 0.036 0.063 0.130 0.270
0.5
Merried 0.031 -0.680 -0.160 -0.110 1.000 0.170 0.006 -0.160 -0.330 -0.085 -0.110 -0.680
Work Type 0.033 -0.180 -0.031 -0.030 0.170 1.000 0.019 -0.033 -0.180 -0.034 -0.032 -0.170
Target
Residence type 0.007 -0.014 0.007 0.003 0.006 0.019 1.000 0.005 0.000 -0.032 -0.015 -0.011
0.055 0.240 0.170 0.160 -0.160 -0.033 0.005 1.000 0.170 0.025 0.130 0.240
0
Ave glucose type
BMI 0.027 0.320 0.160 0.036 -0.330 -0.180 0.000 0.170 1.000 0.044 0.035 0.320
Smoking status 0.068 0.079 0.013 0.063 -0.085 -0.034 -0.032 0.025 0.044 1.000 0.031 0.080
Stroke 0.009 0.250 0.130 0.130 -0.110 -0.032 -0.015 0.130 0.035 0.031 1.000 0.240
-0.5
Age cat -0.027 0.990 0.280 0.270 -0.680 -0.170 -0.011 0.240 0.320 0.080 0.240 1.000
Age
Hypertension
Heart Disease
Merried
Work Type
Residence type
BMI
Smoking status
Stroke
Age cat
Gender
Features
Fig. 2. Correlation matrices among the Socio-demographics, lifestyle status and disease.
Authorized licensed use limited to: STAATS U UNIBIBL BREMEN. Downloaded on May 22,2025 at 09:18:45 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1
TABLE II
M EASUREMENT RESULT FOR ML CLASSIFIER TO P REDICTING S TROKE
TABLE III
P ERFORMANCE C OMPARISON OF S TROKE P REDICTION M ODEL
Classifier
Fig. 3. FP & FN rate of different Classifiers. The proposed weighted voting classifier has considered gen-
der, age, hypertension, heart disease, average glucose level,
BMI, smoking status feature attributes to predict stroke. The
is because both True Positive and True Negative statistics are performance evaluation reveals that weighted voting provided
found than False negative or False positives. The classifier the highest accuracy of about 97% compared to the commonly
is unable to differentiate between positive and negative class used other machine learning algorithms. As a result, the
points when AUC = 0.5 is used. The classifier either estimates weighted voting can be considered for the prediction of stroke.
a random class or a constant class over all data points. The The relationship between these diseases and possibility of
AUC for LR, SGD, DTC, AdaBoost, Gaussian, QDA, MLP, occurring stroke in a human individual has been evaluated.
KNN, GBC, XGB are 0.76, 0.73, 0.80, 0.79, 0.77, 0.75, 0.81, So, if this disease is diagnosed and maintained correctly from
0.81, 0.85, 0.90, 0.93 respectively and Weighted Voting AUC early stage, then it will help to reduce the occurrence of stroke
value is 0.93. in our life. In the future, deep learning based imaging, such
Table III shows that, there are lot of existing approaches as brain CT scan and MRI, can be proposed together with an
is used to predict stroke by ML classifiers and also deep existing model to boost the performance indices.
learning. So, here some state-of-art methods and their accuracy
are compared with our proposed model and it is noticed that R EFERENCES
the proposed study has achieved an accuracy of 97%. [1] M. Mahmud et al., “A brain-inspired trust management model to assure
security in a cloud based iot framework for neuroscience applications,”
V. C ONCLUSION Cognitive Computation, vol. 10, no. 5, pp. 864–873, 2018.
[2] M. B. T. Noor, N. Z. Zenia, M. S. Kaiser, S. Al Mamun, and
The proposed research work has employed ten classifiers M. Mahmud, “Application of deep learning in detecting neurological
to find out the performance of stroke occurrence in a person. disorders from magnetic resonance images: a survey on the detection
Authorized licensed use limited to: STAATS U UNIBIBL BREMEN. Downloaded on May 22,2025 at 09:18:45 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1
of alzheimer’s disease, parkinson’s disease and schizophrenia,” Brain [13] C.-A. Cheng, Y.-C. Lin, and H.-W. Chiu, “Prediction of the prognosis of
Informatics, vol. 7, no. 1, pp. 1–21, 2020. ischemic stroke patients after intravenous thrombolysis using artificial
[3] M. Mahmud, M. S. Kaiser, and A. Hussain, “Deep learning in mining neural networks,” Studies in Health Technology and Informatics, vol.
biological data,” arXiv preprint arXiv:2003.00108, 2020. 202, pp. 115–118, 2014.
[4] L. Amini, R. Azarpazhouh, M. T. Farzadfar, S. A. Mousavi, F. Jazaieri, [14] S. Cheon, J. Kim, and J. Lim, “The Use of Deep Learning to Predict
F. Khorvash, R. Norouzi, and N. Toghianfar, “Prediction and control of Stroke Patient Mortality,” International Journal of Environmental Re-
stroke by data mining,” International Journal of Preventive Medicine, search and Public Health, vol. 16, no. 11, 2019.
vol. 4, no. Suppl 2, pp. S245–249, May 2013. [15] M. S. Zulfiker, N. Kabir, A. A. Biswas, P. Chakraborty, and M. M.
[5] S.-F. Sung, C.-Y. Hsieh, Y.-H. Kao Yang, H.-J. Lin, C.-H. Chen, Y.- Rahman, “Predicting students’ performance of the private universities of
W. Chen, and Y.-H. Hu, “Developing a stroke severity index based on bangladesh using machine learning approaches,” International Journal
administrative data was feasible using data mining techniques,” Journal of Advanced Computer Science and Applications, vol. 11, no. 3, 2020.
of Clinical Epidemiology, vol. 68, no. 11, pp. 1292–1300, Nov. 2015. [16] S. Rahman, T. Sharma, S. Reza, M. Rahman, M. Kaiser et al., “Pso-nf
[6] M. C. Paul, S. Sarkar, M. M. Rahman, S. M. Reza, and M. S. based vertical handoff decision for ubiquitous heterogeneous wireless
Kaiser, “Low cost and portable patient monitoring system for e-health network (uhwn),” in 2016 International Workshop on Computational
services in bangladesh,” in 2016 International Conference on Computer Intelligence (IWCI). IEEE, 2016, pp. 153–158.
Communication and Informatics (ICCCI), 2016, pp. 1–4. [17] M. S. Singh and P. Choudhary, “Stroke prediction using artificial
[7] S. M. Reza, M. M. Rahman, M. H. Parvez, M. S. Kaiser, and S. Al Ma- intelligence,” in 2017 8th Annual Industrial Automation and Electrome-
mun, “Innovative approach in web application effort & cost estimation chanical Engineering Conference (IEMECON), Aug. 2017, pp. 158–161.
using functional measurement type,” in 2015 International Conference [18] C. Chin, B. Lin, G. Wu, T. Weng, C. Yang, R. Su, and Y. Pan, “An
on Electrical Engineering and Information Communication Technology automated early ischemic stroke detection system using CNN deep
(ICEEICT). IEEE, 2015, pp. 1–7. learning algorithm,” in 2017 IEEE 8th International Conference on
[8] M. Asif-Ur-Rahman, F. Afsana, M. Mahmud, M. S. Kaiser, M. R. Awareness Science and Technology (iCAST), Nov. 2017, iSSN: 2325-
Ahmed, O. Kaiwartya, and A. James-Taylor, “Toward a heterogeneous 5994.
mist, fog, and cloud-based framework for the internet of healthcare [19] M. Monteiro, A. C. Fonseca, A. T. Freitas, T. Pinho e Melo, A. P.
things,” IEEE Internet of Things Journal, vol. 6, no. 3, pp. 4049–4062, Francisco, J. M. Ferro, and A. L. Oliveira, “Using Machine Learning
2018. to Improve the Prediction of Functional Outcome in Ischemic Stroke
[9] H. M. Ali, M. S. Kaiser, and M. Mahmud, “Application of convolu- Patients,” IEEE/ACM Transactions on Computational Biology and Bioin-
tional neural network in segmenting brain regions from mri data,” in formatics, vol. 15, pp. 1953–1959, Nov. 2018.
International Conference on Brain Informatics. Springer, 2019, pp. [20] T. Kansadub, S. Thammaboosadee, S. Kiattisin, and C. Jalayondeja,
136–146. “Stroke risk prediction model based on demographic data,” in 2015
[10] M. Mahmud, M. S. Kaiser, A. Hussain, and S. Vassanelli, “Applications 8th Biomedical Engineering International Conference (BMEiCON), Nov.
of deep learning and reinforcement learning to biological data,” IEEE 2015, pp. 1–3.
trans. neural netw. learn. syst., vol. 29, no. 6, pp. 2063–2079, 2018. [21] S. Y. Adam, A. Yousif, and M. B. Bashir, “Classification of Ischemic
[11] P. Govindarajan, R. K. Soundarapandian, A. H. Gandomi, R. Patan, Stroke using Machine Learning Algorithms,” International Journal of
P. Jayaraman, and R. Manikandan, “Classification of stroke disease using Computer Applications, vol. 149, no. 10, pp. 26–31, Sep. 2016.
machine learning algorithms,” Neural Computing and Applications, [22] H. Lee, E.-J. Lee, S. Ham, H.-B. Lee, J. S. Lee, S. U. Kwon, J. S. Kim,
vol. 32, no. 3, pp. 817–828, Feb. 2020. N. Kim, and D.-W. Kang, “Machine learning approach to identify stroke
[12] S. M. Reza, M. M. Rahman, and S. Al Mamun, “A new approach for within 4.5 hours,” Stroke, vol. 51, no. 3, pp. 860–866, 2020.
road networks-a vehicle xml device collaboration with big data,” in 2014 [23] T. Kansadub, S. Thammaboosadee, S. Kiattisin, and C. Jalayondeja,
International Conference on Electrical Engineering and Information & “Stroke risk prediction model based on demographic data,” in 2015 8th
Communication Technology. IEEE, 2014, pp. 1–5. Biomedical Engineering International Conference (BMEiCON). IEEE,
2015, pp. 1–3.
Authorized licensed use limited to: STAATS U UNIBIBL BREMEN. Downloaded on May 22,2025 at 09:18:45 UTC from IEEE Xplore. Restrictions apply.