BANASD603
BANASD603
Introduction........................................................................................................... 2
Business Problem................................................................................................... 2
Data Used.............................................................................................................. 3
Dataset Description............................................................................................ 3
Data Preprocessing............................................................................................. 3
Processing and Splitting of the Dataset..............................................................5
Model Buildings..................................................................................................... 6
Selection of Machine Learning Models................................................................6
Training the Models............................................................................................. 6
Model Training and Validation............................................................................. 6
Results and Model Comparisons............................................................................ 7
Conclusion............................................................................................................. 7
References............................................................................................................. 8
Appendix................................................................................................................ 9
Appendix A: Dataset Link................................................................................... 9
Appendix B: ROC Curve...................................................................................... 9
Appendix C: Precision-Recall Curve..................................................................10
1
Introduction
Stroke is reported to be the major cause of more than 11% of deaths and
also one of the leading causes of long-term disability, worldwide. The risk
for getting stroke should therefore be determined as early as possible in a
patient's life to basically prevent patient’s death due to any severe
outcome of stroke (Feigin et al., 2017). Prediction of stroke is not an easy
task, and it basically depends on many prognostic factors such as age,
medical history, and lifestyle habits that can best be predicted with
advanced analytical methods (Hassan et al., 2024).
This whole report would focus on building predictive models that would
basically determine the likelihood of stroke among patients based on
several health attributes such as age, hypertension, heart disease, and
smoking status. The collected patient health data will be fed into several
machine learning models, and the accuracy of the models obtained will be
compared to determine which one is more accurate for the practical use
(Zhi et al., 2024). Hence, the models would therefore identify high-risk
patients much earlier and receive adequate interventions in good time
and consequently pool resources.
This whole report and project now go hand-in-hand with the promise of
data-driven intervention for better outcomes in patients as emphasis on
predictive analytics grows in healthcare. Health providers might make that
shift from reactive to proactive using models like these, and going
forward, reduce the incidence of stroke while increasing care efficiency .
Business Problem
Stroke is one of the major critical healthcare issue in the modern-day age
due to its high mortality and disability rates. Early detection of those
patients that are at high risk for stroke is important and essential, as it
basically enables healthcare personals to basically take preventive
measures and provide early interventions. So, that the patient life can be
saved (Chadaga et al., 2023).
Despite the availability of different types of clinical tools and machines,
accurately predicting stroke risk remains a very big challenge due to the
complexity of factors that are involved, such as age, hypertension, heart
disease, and smoking habits. Healthcare providers basically need a
reliable, data-driven method to identify people who are most likely to
suffer a stroke, basically allowing for efficient resource allocation and
treatment prioritization.
The business benefits of building such a model that can accurately predict
stroke are substantial. Hospitals and clinics can basically improve their
2
decision-making processes by focusing most of their resources on
individuals that have higher likelihood of stroke, thereby reducing
healthcare costs and improving patient outcomes.
Data Used
Dataset Description
The data set that has been used for this whole exercise and project comes
from Kaggle's Stroke Prediction Dataset. This dataset basically holds over
5,100 records about patient health data, basically one row for every
patient. The data set holds the very important health attributes such as
age, gender, hypertension, heart disease, average glucose level, body
mass index (BMI), and smoking status. The main target variable is stroking
whose value either is 1 indicating the stroke or 0 indicating not stroke.
These are important characteristics to basically build a predictive machine
learning model since these have already been recognized as risk factors
for stroke by global health standards.
Data Preprocessing
Before building the predicting model, different types of steps were taken
in order to clean and preprocess the data. For example, there were some
3
missing values in the BMI column that needed immediate attention.
Hence, the missing values in the BMI column were handled by filling them
with the median value of the BMI column, basically ensuring that no data
points were lost due to missing information present in the BMI column.
Additionally, categorical variables such as gender, ever_married,
work_type, Residence_type, and smoking_status were basically converted
into numerical representations using One-Hot Encoding.
4
Figure 3: Handling Missing Values
5
the minority class, helping to balance the dataset and improve the
model's ability to predict stroke cases.
A training and testing set was divided in the ratio 80:20 for basically
checking performance of the model. Scaling features was by using the
StandardScaler library, so that features of all attributes of the dataset
scaled at the same level, which is very particular and important to model
like Logistic Regression and SVM as this normalizes the features that in
effect helps the model to work better.
Model Buildings
Selection of Machine Learning Models
To basically tackle the stroke prediction problem, several machine learning
models were selected to compare their performance. The models used
include:
Logistic Regression
Random Forest Classifier
K-Nearest Neighbours (KNN)
Support Vector Machine (SVM)
Gradient Boosting Classifier
6
These models were chosen due to their effectiveness in classification
tasks and their ability to handle different types of data.
7
Results and Model Comparisons
Among the several machine learning models that was tested, Gradient
Boosting model emerged as the best-performing model for predicting
stroke cases. Although its accuracy (0.94) was similar to many other
models, Gradient Boosting demonstrated a higher precision (0.44) and F1-
score (0.11) for stroke predictions that did not show by any other models.
The AUC-ROC score (0.81) basically indicated a that there is good balance
between true positive and false positive predictions, making it the most
effective model for distinguishing between stroke and non-stroke cases.
Conclusion
Through this whole report and project, we have tested the ability of
several machine learning models to basically predict strokes in patients by
using health attributes such as age, hypertension, and heart disease. All
of these models promised much when trying to balance precision and
recall for predicting strokes but suffered due to the problem of class
imbalance in the minority class. Even though the models were very
accurate, recall needs to improve such that more high-risk patients are
detected. Based on the results, predictive analytics has promise in
healthcare and providers may proactively do this. Future efforts should
aim at improving the recall of the model and class imbalance for the
model to detect a case of stroke as early as possible. But based on the
current dataset and analysis, if we have to do choose any model for
making the predictions, we should definitely go with the Gradient Boosting
model.
References
Feigin, V. L., Norrving, B., & Mensah, G. A. (2017). Global burden of stroke.
Circulation Research, 120(3), 439-448.
https://doi.org/10.1161/CIRCRESAHA.116.308413
Hassan, A., Gulzar Ahmad, S., Ullah Munir, E., Ali Khan, I., & Ramzan, N.
(2024). Predictive modelling and identification of key risk factors for stroke
8
using machine learning. Scientific Reports, 14(1), 11498.
https://doi.org/10.1038/s41598-024-61665-4
Zhi, S., Hu, X., Ding, Y., Chen, H., Li, X., Tao, Y., & Li, W. (2024). An
exploration on the machine-learning-based stroke prediction model.
Frontiers in Neurology, 15, 1372431.
https://doi.org/10.3389/fneur.2024.1372431
Chadaga, K., Sampathila, N., Prabhu, S., & Chadaga, R. (2023). Multiple
explainable approaches to predict the risk of stroke using artificial
intelligence. Information, 14(8), 435. https://doi.org/10.3390/info14080435
9
Appendix
Appendix A: Dataset Link
https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
Describe above is the dataset link from where you can see the dataset
that has been used in the project and report.
10
Appendix C: Precision-Recall Curve
11