Research Article: Stroke Disease Detection and Prediction Using Robust Learning Approaches
Research Article: Stroke Disease Detection and Prediction Using Robust Learning Approaches
Research Article
Stroke Disease Detection and Prediction Using Robust
Learning Approaches
Tahia Tazin ,1 Md Nur Alam,1 Nahian Nakiba Dola,1 Mohammad Sajibul Bari,1
Sami Bourouis ,2 and Mohammad Monirujjaman Khan 1
1
Department of Electrical and Computer Engineering, North South University, Bashundhara, Dhaka 1229, Bangladesh
2
Department of Information Technology, College of Computers and Information Technology, Taif University, P.O. Box 11099,
Taif 21944, Saudi Arabia
Received 7 October 2021; Revised 4 November 2021; Accepted 9 November 2021; Published 26 November 2021
Copyright © 2021 Tahia Tazin et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Stroke is a medical disorder in which the blood arteries in the brain are ruptured, causing damage to the brain. When the supply of
blood and other nutrients to the brain is interrupted, symptoms might develop. According to the World Health Organization
(WHO), stroke is the greatest cause of death and disability globally. Early recognition of the various warning signs of a stroke can
help reduce the severity of the stroke. Different machine learning (ML) models have been developed to predict the likelihood of a
stroke occurring in the brain. This research uses a range of physiological parameters and machine learning algorithms, such as
Logistic Regression (LR), Decision Tree (DT) Classification, Random Forest (RF) Classification, and Voting Classifier, to train
four different models for reliable prediction. Random Forest was the best performing algorithm for this task with an accuracy of
approximately 96 percent. The dataset used in the development of the method was the open-access Stroke Prediction dataset. The
accuracy percentage of the models used in this investigation is significantly higher than that of previous studies, indicating that the
models used in this investigation are more reliable. Numerous model comparisons have established their robustness, and the
scheme can be deduced from the study analysis.
previous stroke-related research has focused on, among functional prognosis of an ischemic stroke. They tested this
other things, the prediction of heart attacks. Brain stroke has method on a patient who died three months after admission.
been the subject of very few studies. The main motivation of They obtained an AUC value of greater than 90. Kansadub
this paper is to demonstrate how ML may be used to forecast et al. [14] conducted research to determine the risk of stroke.
the onset of a brain stroke. The most important aspect of the The authors of the research analyzed the data to predict
methods employed and the findings achieved is that among strokes using Naive Bayes, decision trees, and neural net-
the four distinct classification algorithms tested, Random works. They assessed their pointer’s accuracy and AUC in
Forest fared the best, achieving a higher accuracy metric in their research. They categorized all of these algorithms as
comparison to the others. One downside of the model is that decision trees, with naive Bayes providing the most accurate
it is trained on textual data rather than real time brain results. Adam et al. [15] conducted research to determine the
images. The implementation of four ML classification classification of an ischemic stroke. They categorized is-
methods is shown in this paper. chemic strokes using two models: the k-nearest neighbor
Numerous academics have previously utilized machine method and the decision tree technique. In their study, the
learning to forecast strokes. Govindarajan et al. [3] used text decision tree method was found to be more useful by
mining and a machine learning classifier to classify stroke medical experts when used to categorize strokes.
disorders in 507 individuals. They tested a variety of machine The majority of studies had an accuracy rate of around
learning methods for training purposes, including Artificial 90%, which was considered to be quite good. However, the
Neural Network (ANN), and they found that the SGD algo- novelty of our research is that we used several well-known
rithm provided the greatest value, 95 percent. Amini et al. [4, 5] machine learning methods to get the best result. Random
performed research to predict a stroke occurrence. They forest (RF), decision tree (DT), voting classifier (VC), and
classified 50 risk variables for stroke, diabetes, cardiovascular logistic regression (LR) were the most successful algorithms,
disease, smoking, hyperlipidemia, and alcohol consumption in with 96, 94, 91, and 87 percent F1-scores, respectively. The
807 healthy and unhealthy individuals. They used two of the accuracy percent of the models used in this research is much
most accurate methods: the c4.5 decision tree algorithm (95 greater than the accuracy percent of the models used in
percent accuracy) and the K-nearest neighbor algorithm (94 previous investigations, suggesting that the models used in
percent accuracy). Cheng et al. [6] presented a study on es- this investigation are more trustworthy. They have been
timating the prognosis of an ischemic stroke. In their study, shown to be resilient in many model comparisons, and the
they used 82 ischemic stroke patient data sets, two ANN scheme may be generated from the results of the study’s
models, and the accuracy values of 79 and 95 percent. Cheon analysis.
et al. [7–9] conducted research to determine the predictability As mentioned earlier, the major contribution of this
of a stroke patient death. They identified the stroke incidence research is that we have used different machine learning
using 15,099 individuals in their research. They detected strokes models on a publicly available dataset. In the previous work,
using a deep neural network method. The authors utilized PCA most of the researchers used a significant model to predict
to extract information from the medical records and predict the stroke disease. However, we used four different models,
strokes. They have 83 percent area under the curve (AUC). and also, we compared the results with the previous work.
Singh et al. [10] conducted research using artificial intelligence All the results and comparisons are briefly discussed in the
to predict strokes. They employed a new technique for pre- following section. The rest of this article is set out as follows:
dicting stroke in their research using the cardiovascular health the experimental methodology and procedures are described
study (CHS) dataset. Additionally, they used the decision tree in Section 2; the result analysis is provided in Section 3; and
method to do a feature extraction followed by a principal conclusions have been discussed in Section 4.
component analysis. In this case, the model was built using a
neural network classification method, and it achieved 97
percent accuracy. 2. Procedure and Experimental Methodology
Chin et al. [11] conducted research to determine the This section includes a description of the dataset, a block
accuracy of an automated early ischemic stroke detection. diagram, a flow diagram, and evaluation matrices, as well as
The major objective of their research was to create a method the process and methodology used in the study.
for automating primary ischemic stroke using Convolu-
tional Neural Network (CNN). They amassed 256 pictures
for the purpose of training and testing the CNN model. They 2.1. Proposed System. The data has become available for
utilized the data lengthening technique to increase the model construction once it has been processed. A pre-
gathered picture in their system’s image preparation. Their processed dataset and machine learning techniques are
CNN technique achieved a 90 percent accuracy rate. Sung needed for the model construction. LR, DT classification, RF
et al. [12] conducted research to establish a stroke severity classification, and voting classifier are some of the methods
index. They gathered data on 3577 patients who had an acute used. After creating four alternative models, the accuracy
ischemic stroke. They utilized a variety of data mining measures, namely accuracy score, precision score, recall
methods, including linear regression, to create their pre- score, and F1 score are used to compare them. The designed
dictive models. Their ability to predict outperformed the system’s block diagram is shown in Figure 1.
k-nearest neighbor method (95% confidence interval). All the components of the block diagram have been
Monteiro et al. [13] used machine learning to predict the discussed in the following subsections.
Journal of Healthcare Engineering 3
Dataset
Machine Leaning
Data
Algorithms
Preprocessing
Missing Data
analysis Random Forest
Handling
Imbalanced data
Logistic
Regression
Label Encoding
Decision Tree
Voting Classifier
Model Building
and Comparing
2.2. Dataset. The stroke prediction dataset [16] was used to bearing on model construction. The dataset is then inspected
perform the study. There were 5110 rows and 12 columns in for null values and filled if any are detected. The null values
this dataset. The value of the output column stroke is either 1 in the column BMI are filled using the data column’s mean
or 0. The number 0 indicates that no stroke risk was in this case.
identified, while the value 1 indicates that a stroke risk was Label encoding converts the dataset’s string literals to
detected. The probability of 0 in the output column (stroke) integer values that the computer can comprehend. As the
exceeds the possibility of 1 in the same column in this computer is frequently trained on numbers, the strings must
dataset. 249 rows alone in the stroke column have the value be converted to integers. The gathered dataset has five
1, whereas 4861 rows have the value 0. To improve accuracy, columns of the data type string. All strings are encoded
data preprocessing is used to balance the data. Figure 2 during label encoding, and the whole dataset is transformed
shows the total number of stroke and nonstroke records in into a collection of numbers. The dataset used for stroke
the output column before preprocessing. prediction is very imbalanced. The dataset has a total of 5110
From Figure 2, it is clear that this dataset is an imbal- rows, with 249 rows indicating the possibility of a stroke and
anced dataset. The SMOTE technique has been used to 4861 rows confirming the lack of a stroke. While using such
balance this dataset. data to train a machine-level model may result in accuracy,
other accuracy measures such as precision and recall are
inadequate. If such an unbalanced data is not dealt with
2.3. Preprocessing. Before building a model, data pre- properly, the findings will be inaccurate, and the forecast will
processing is required to remove unwanted noise and be ineffective. As a result, to obtain an efficient model, this
outliers from the dataset that could lead the model to depart unbalanced data must be dealt with first. The SMOTE
from its intended training. This stage addresses everything technique was employed for this purpose. Figure 3 depicts
that prevents the model from functioning more efficiently. the dataset’s balance output column.
Following the collection of the relevant dataset, the data The next stage is to construct the model after finishing
must be cleaned and prepared for model development. As data preparation and managing the imbalanced dataset. To
stated before, the dataset used has twelve characteristics. To improve the accuracy and efficiency of this job, the data is
begin with, the column id is omitted since its presence has no divided into training and testing data with a ratio of 80
4 Journal of Healthcare Engineering
Figure 4.
The flexibility of the random forest is one of its most
2000
alluring features. It may be utilized for relapse detection and
grouping tasks, and the overall weighting given to infor-
1000 mation characteristics is readily apparent. Additionally, it is
a beneficial approach since the default hyperparameters it
0 employs often give unambiguous expectations. Under-
0 1 standing the hyperparameters is critical since there are
stroke relatively few of them, to begin with. Overfitting is a well-
Figure 2: Total number of stroke and normal data. known problem in machine learning, although it occurs
seldom with the arbitrary random forest classifier. If there
are sufficient trees in the forest, the classifier will not overfit
the model.
4000
2.4.2. Decision Tree. Both regression and classification
concerns are addressed using classification with DT [18].
3000
Furthermore, as the input variables already have a related
count
Voting
Prediction
Level-1
Root
Node
Internal Level-2
Leaf Node
Node
final output value as the mode value of the resultant 2.5. Evaluation Matrix. Figure 8 depicts the confusion
output. Because of the fact that the particular matrix or evaluation matrix. The confusion matrix is a tool
probability values associated with each model are for evaluating the performance of machine learning clas-
disregarded, this approach is analogous to com- sification algorithms. The confusion matrix has been used to
puting the arithmetic mean of a collection of test the efficiency of all models created. The confusion matrix
numbers. The output alone of each model is illustrates how often our models forecast correctly and how
considered. often they estimate incorrectly. False positives and false
negatives have been allocated to badly predicted values,
whereas true positives and true negatives were assigned to
2.4.4. Logistic Regression. The flowchart for the logistic re- properly anticipated values. The model’s accuracy, preci-
gression model is shown in Figure 7. In the supervised sion-recall trade-off, and AUC were utilized to assess its
learning approach, LR is one of the most commonly used performance after grouping all predicted values in the
ML algorithms [20]. It is a forecasting method that uses a matrix.
collection of independent factors to predict a categorical
dependent variable.
Utilizing logistic regression, the output of a categorical 3. Result Analysis
dependent variable is predicted. As a result, the output must The models’ capacities, model forecasts, investigation, and
be discrete or categorical in nature. It may be yes or no, 0 or eventual outcomes are examined in this part.
1, true or false, etc., but probability values between 0 and 1
are given. Logistic regression and linear regression are used
in very similar ways. The classification problems are 3.1. Data Visualization. A histogram depicts a recurrence
addressed with LR, and the regression problems are dispersion with infinite classes. It is a region outline made of
addressed using linear regression. Instead of a regression square shapes with bases at class boundary spans and regions
line, we use an S-shaped logistic function that predicts the proportionate to the comparing classes’ frequencies. As the
two maximum values (0 or 1). base fills in the spaces between the class borders, the square
6 Journal of Healthcare Engineering
Learner-1
Learner-2
Meta Predictio
Data
Data n Result
Learner-3
Learner-n
Feature-1
Feature-2
Logistic Regression Prediction
Feature-n
shapes are all linked. The squares form the statures are 3.2. Visualization of Feature Selection. The process of feature
proportional to the comparative class frequencies and re- selection is shown in Figure 11. Feature selection aids in
currence densities for distinct classes. Figure 9 illustrates comprehending how features are linked to one another.
some important features of the histograms. A histogram Figure 11 shows that age, hypertension, avg_-
depicts the dataset’s proportions. glucose_level, heart_disease, ever_married, and BMI are
Figure 9 depicts the dataset’s gender, age, hypertension, positively corelated with the target feature. However, gender
heart disease, ever married, average glucose level, and body is negatively corelated with stroke.
mass index distributions. For the gender attribute, 0 means
male and 1 means female. There are more female samples
than male samples in this collection. However, based on the 3.3. Evaluation of the Model
age distribution, it is obvious that the sample’s average age is
in the 40s, and the upper limit is approximately 60. When it 3.3.1. Random Forest (RF). Figure 12 depicts the classifi-
comes to hypertension, 0 means the individual does not have cation report for the RF model.
it, while 1 means the person has it. The total number of In this case, the total F1-score obtained is 96 percent. The
individuals who are healthy and have no history of heart individual F1-scores for healthy people are 96 percent, while
disease is achieved in this dataset. With regard to BMI and those who have had a brain stroke have 96 percent. This
average glucose levels, Figure 10 shows the relationship model achieved the highest accuracy after fine-tuning. Prior
between one feature and the target feature. to fine-tuning, the model had an accuracy of 92 percent.
Figure 10 shows the relationship between gender and Figure 13 depicts the random forest model’s prediction.
stroke, age and stroke, hypertension and stroke, heart dis- The predicted outcome and the model’s calculated perfor-
ease and stroke, ever_married and stroke, avg_glucose_level mance are shown in the confusion matrix. There are 2707
and stroke, and BMI and stroke. accurate guesses and 113 erroneous predictions.
Journal of Healthcare Engineering 7
Predicted: Predicted:
NO YES
Actual: NO TN FP
Actual: YES FN TP
3.3.2. Decision Tree. The classification report for the deci- 3.3.3. Voting Classifier. The classification report for the
sion tree classification is shown in Figure 14. voting classifier is shown in Figure 16.
The final F1-score in this case is 94 percent. An indi- The total F1-score obtained in this case is 91 percent. The
vidual’s F1-score is 94 percent for healthy individuals and 95 individual F1-scores are 91 percent for healthy people and 91
percent for those who have had a brain stroke. Also, the percent for those who have had a stroke. Also, the precision
precision and recall are shown in Figure 14. A fine-tuned and recall are shown in Figure 16. Without any fine-tuning,
decision tree model has also been implemented. However, this model achieved 91 percent accuracy.
after fine-tuning, the accuracy did not improve. The prediction made by the voting classifier is shown in
Figure 15 depicts the DT model’s prediction. There were Figure 17. The overall number of accurate guesses is 2565,
2664 accurate guesses and 156 erroneous predictions. while the total number of erroneous predictions is 255.
8 Journal of Healthcare Engineering
1.0 80 1.0
0.8 0.8
60
hypertension
0.6 0.6
gender
age
40
0.4 0.4
0.2 20 0.2
0.0 0 0.0
0 1 0 1 0 1
avg_glucose_level
0.8 0.8
ever_married
heart_disease
200
0.6 0.6
0.4 0.4 150
100
80
60
bmi
40
20
0 1
stroke
Figure 10: Relationship between some important features with the target feature.
0.75
0.50
0.25
0.00
-0.25
stroke 1
age 0.23
hypertension 0.14 -0.50
avg_glucose_level 0.14
heart_disease 0.14
ever_married 0.11 -0.75
bmi 0.042
gender -0.0069
-1.00
stroke
Figure 11: Features correlation with stroke.
Journal of Healthcare Engineering 9
0 1366 41
True Label
1 72 1341
0 1
Predicted Label
Figure 13: Confusion matrix of random forest.
0 1322 85
True Label
1 71 1342
0 1
Predicted Label
Figure 15: Confusion matrix of a decision tree.
0 1280 127
True Label
1 128 1285
0 1
Predicted Label
Figure 17: Confusion matrix of a voting classifier.
Journal of Healthcare Engineering 11
[12] S.-F. Sung, C.-Y. Hsieh, Y.-H. Kao Yang et al., “Developing a
stroke severity index based on administrative data was feasible
using data mining techniques,” Journal of Clinical Epidemi-
ology, vol. 68, no. 11, pp. 1292–1300, 2015.
[13] M. Monteiro, A. C. Fonseca, A. T. Freitas et al., “Using
machine learning to improve the prediction of functional
outcome in ischemic stroke patients,” IEEE/ACM Transac-
tions on Computational Biology and Bioinformatics, vol. 15,
no. 6, pp. 1953–1959, 2018.
[14] T. Kansadub, S. Thammaboosadee, S. Kiattisin, and
C. Jalayondeja, “Stroke risk prediction model based on de-
mographic data,” in Proceedings of the 2015 8th Biomedical
Engineering International Conference (BMEiCON), pp. 1–3,
Pattaya, Thailand, November 2015.
[15] S. Y. Adam, A. Yousif, and M. B. Bashir, “Classification of
ischemic stroke using machine learning algorithms,” Inter-
national Journal of Computer Application, vol. 149, no. 10,
pp. 26–31, 2016.
[16] “Stroke prediction dataset,” [Online]. Available: https://www.
kaggle.com/fedesoriano/stroke-prediction-dataset.
[17] “Documentation for random forest classification from scikit-
learn,” org. [Online]. Available:https://scikit-learn.org/stable/
modules/generated/sklearn.ensemble.
RandomForestClassifier.html.
[18] “Documentation for decision tree classification from scikit-
learn,” org. [Online]. Available: https://scikit-learn.org/stable/
modules/tree.html.
[19] Voting Classifier. [Online]. Available: https://
towardsdatascience.com/custom-implementation-of-feature-
importance-for-your-voting-classifier-model-859b573ce0e0.
[20] “Logistic regression in machine learning,” [Online]. Available:
.
[21] G. Sailasya and G. L. A. Kumari, “Analyzing the performance
of stroke prediction using ML classification algorithms,”
International Journal Of Advanced Computer Science And
Applications, vol. 12, no. 6, pp. 539–545, 2021.