Stroke Prediction
Stroke Prediction
PROJECT REPORT
Submitted by
DEEPTHI J (7376222CT108)
GOWTHAM K (7376222CT116)
KANISHYA G (7376222CT123)
degree of
BACHELOR OF TECHNOLOGY
in
COMPUTER TECHNOLOGY
DECEMBER 2024
i
BONAFIDE CERTIFICATE
Certified that this project report “STROKE PREDICTION” is the bonafide work
of “BIJIN GOPAL S(7376222CT106), DEEPTHI J(7376222CT108),
GOWTHAM K(7376222CT116), KANISHYA G(7376222CT123)” who
carried out the project work under my supervision.
Data Science,
7376222CT106 7376222CT108
GOWTHAM K KANISHYA G
7376222CT116 7376222CT123
Mrs. RATHNA S
iii
ACKNOWLEDGMENT
We would like to thank our friends, faculty and non-teaching staff who
have directly and indirectly contributed to the success of this project.
BIJIN GOPAL S
(7376222CT106)
DEEPTHI J (7376222CT108)
GOWTHAM K (7376222CT116)
KANISHYA G (7376222CT123)
iv
ABSTRACT
Stroke occurs when blood clots or bleeds in the brain, causing lasting
damage to mobility, cognition, vision, and communication. It is one of the most
common causes of death and long-term disability worldwide. Early detection and
intervention are critical for lowering the morbidity and death associated with
stroke. Machine Learning (ML) provides accurate and timely prediction results and
has emerged as a significant tool in healthcare settings, providing personalised
therapeutic care for stroke patients. The health care business use a range of data
mining tools to aid with disease diagnosis and early detection. The current analysis
takes into account a variety of factors that contribute to stroke. First, we look at
the characteristics of those who are more prone to have a stroke than others.
The dataset was gathered from a publicly available source, and different
classification methods were employed to predict the incidence of a stroke within a
short time. Using a dataset containing patient characteristics such as age, gender,
hypertension, heart disease, smoking status, body mass index (BMI), and other
pertinent health variables, we will test multiple machine learning algorithms to
find the most accurate predictive model. Using the random forest approach, an
accuracy of 90% was achieved. Finally, several preventative measures such as
stopping smoking, abstaining from alcohol, and other factors are recommended to
lower the risk of having a stroke. This project intends to create a strong machine
learning model that can predict the likelihood of stroke in individuals based on a
range of clinical and demographic characteristics.
v
TABLE OF CONTENTS
Acknowledgement iv
Abstract v
Table of Contents vi
1 Introduction 1
2 Literature Survey 4
3 Objective and Methodology 8
3.1 Key Objective in Stroke Prediction 8
3.2 Methodology 10
3.3 Flowchart 14
4 Result and Discussion 15
4.1 Result 15
4.2 Discussion 17
4.3 Cost Benefit Analysis 19
5 Conclusion 22
6 References 23
7 Appendices 25
vi
LIST OF FIGURES
vii
CHAPTER - 1
INTRODUCTIO
Strokes can be classified into two primary types: ischemic and hemorrhagic
strokes. Ischemic strokes, which are responsible for approximately 85% of all
stroke cases, occur when blood clots or narrowed arteries restrict blood flow to the
brain. This type of stroke highlights the importance of maintaining clear and
healthy blood vessels, as blockages can lead to rapid deterioration of brain
function. On the other hand, hemorrhagic strokes happen when a blood vessel
ruptures, resulting in bleeding within the brain. Both types of strokes necessitate
urgent medical attention, as timely interventions can significantly reduce the long-
term effects and improve recovery prospects. The need for swift diagnosis and
treatment underscores the importance of developing accurate stroke prediction
models that can identify at-risk individuals before a stroke occurs.
3
CHAPTER – 2
LITERATURE SURVEY
"Predictive Modeling of Stroke Risk Using Machine Learning: A Systematic
Review" Ryu, and Kim, (2024):
Provides an in-depth analysis of machine learning models used for stroke risk
prediction. The review systematically examines various machine learning
algorithms, including logistic regression, decision trees, random forests, and deep
learning approaches, evaluating their effectiveness in predicting stroke risk. It
highlights the importance of data preprocessing, feature selection, and model
validation in achieving accurate predictions. They also discuss challenges such as
data imbalance and the integration of diverse clinical, demographic, and lifestyle
data. The review concludes with future trends, emphasizing real-time monitoring
and personalized predictions.
4
addressed is the issue of imbalanced datasets, common in medical contexts, and
strategies to mitigate its impact, ensuring more reliable predictions. It also
emphasizes the importance of combining patient demographic data, lifestyle
factors, and medical imaging features in building accurate stroke prediction
models. Case studies highlight how ensemble methods can further enhance model
performance.
5
It explores the role of machine learning in predicting stroke risk and assessing
outcomes. It reviews various machine learning techniques, including decision trees,
support vector machines, and deep learning, and their applications in stroke
prediction based on clinical, demographic, and imaging data. They discuss the
challenges of feature selection, model accuracy, and handling imbalanced datasets
in medical contexts. Additionally, it highlights how machine learning models can
aid in predicting post-stroke outcomes, helping to inform treatment decisions and
improve patient care. Future trends and research directions are also covered.
7
CHAPTER – 3
Stroke remains one of the leading causes of mortality and long-term disability
worldwide. Its sudden onset and severe consequences make stroke prediction a critical
area in medical research. By accurately predicting stroke risk, healthcare providers
can implement preventive measures, thereby reducing the incidence of stroke,
enhancing patient outcomes, and alleviating the burden on healthcare systems. The
primary objective of stroke prediction research is to develop models and algorithms
capable of reliably identifying individuals at high risk of stroke, enabling early
intervention and personalized care.
A significant goal is to create models that can predict stroke with high
accuracy. This involves using machine learning (ML) and artificial intelligence
(AI) algorithms that can process vast amounts of data to uncover complex patterns
and associations. By leveraging data from various sources, such as electronic
health records (EHRs), wearable devices, and even social determinants of health,
these models aim to increase the accuracy of stroke prediction, which can help in
real-time risk assessment and continuous monitoring.
8
3.1.3 Enabling Early Intervention and Prevention
Stroke risk factors and potential causes vary widely across populations and
individual patients. Predictive models aim to support personalized treatment plans
by identifying unique risk contributors for each patient. By tailoring treatment and
monitoring approaches based on each individual’s risk profile, healthcare
providers can improve the effectiveness of preventive strategies and mitigate the
risk of stroke more precisely.
3.2 Methodology
3.2.1 Dataset
The dataset used was obtained from the publicly available Stroke Prediction
Dataset provided by Kaggle and includes 5,110 patient records. The dataset
contains the following features:
Data preprocessing is a critical step in preparing the dataset for machine learning
algorithms. The steps involved are as follows:
Handling Missing Data: Missing values in fields such as BMI and smoking
status were imputed using the mean and mode for numerical and categorical
10
variables, respectively.
The dataset was split into training and testing sets using an 80-20 ratio. The
training set was used to train the machine learning models, and the testing set was
used to evaluate the model's performance. Cross-validation was performed to
ensure the robustness of the results, particularly for models like Neural Networks
and SVM, which are sensitive to overfitting. Hyperparameter tuning was carried
out for each model to optimize performance. For instance, in the Random Forest
model, the number of trees and maximum depth were fine-tuned, while in Neural
Networks, parameters such as learning rate, batch size, and number of epochs were
adjusted. Evaluation Metrics.
12
To assess the performance of each model, the following metrics were
calculated:
13
3.3 Flowchart:
14
CHAPTER - 4
4.1 Results
The Random Forest model outperformed the Neural Network in terms of accuracy
of 85%, sensitivity of 83%, specificity of 80%, and AUC-ROC score of 0.90. Its
robustness in handling complex medical datasets and ability to distinguish high-
and low-risk patients is evident. Random Forest models are often more
interpretable, providing valuable insights into variables contributing to stroke risk.
15
Fig: 4.1 Random Forest Model
3. Logistic Regression
The study found that Logistic Regression, despite its interpretability and
simplicity, had the lowest performance with an accuracy of 76%, sensitivity of
71%, specificity of 80%, and AUC-ROC score of 0.78. Despite this, Logistic
Regression remains useful for preliminary screenings or when interpretability is
crucial.
16
4.2 Discussion
The results from this study illustrate the potential of machine learning
models to revolutionize stroke prediction by significantly enhancing accuracy and
overall predictive performance. Each algorithm exhibited strengths and
weaknesses, which offer insights into their practical applicability in healthcare
settings.
The Random Forest algorithm is a powerful tool for stroke prediction due to
its ability to handle large datasets, model complex interactions, and provide
valuable insights into stroke risk factors. It is also robust to overfitting, reducing
variance and improving generalization. Despite its computational complexity,
Random Forest's balance between accuracy and interpretability makes it a viable
candidate for real-world stroke prediction models.
17
Logistic Regression: Simplicity at the Cost of Accuracy
The study found that Logistic Regression, despite its simplicity and
interpretability, struggled to capture complex relationships between stroke risk
factors. With an accuracy of 76%, it may be overly simplistic for stroke prediction.
Despite its limitations, Logistic Regression is useful for preliminary analyses and
computational resources. Advanced algorithms are needed for complex medical
conditions like stroke.
Other features, such as gender, work type, and residence type, had less
predictive power. These findings suggest that while demographic and lifestyle
factors contribute to stroke risk, their impact is overshadowed by more direct
clinical indicators. Therefore, future studies should focus on incorporating more
granular clinical data, such as cholesterol levels, genetic markers, and imaging
data, to further enhance model accuracy.
18
professionals in identifying patients who may benefit from preventive measures,
such as lifestyle modifications or medications.
4.3.1 Costs
4.3.1.1 Development and Implementation Costs
Technology Development: Costs of designing, developing, and
deploying machine learning models or algorithms for stroke prediction.
19
Hardware and Software: Costs of infrastructure such as servers, cloud
platforms, and analytical tools.
4.3.2 Benefits
20
improving quality of life and survival rates.
21
CHAPTER - 5
CONCLUSIO
23
REFERENCES
25
APPENDICES
7.1 Publication Certificates
26
27
28
29
30
7.2 Work contribution
Set up the backend environment using frameworks like Django, Flask, or FastAPI.
Design a database schema to efficiently store patient data and prediction results.
Design RESTful APIs to handle user requests and interact with the database.
Load the trained predictive model into the backend services using libraries like
TensorFlow or PyTorch.
Member 2: DEEPTHI J
Conduct an initial project kickoff meeting to align team goals and expectations.
Week 2: Progress Monitoring and Roadblock Resolution
Track the progress of all project tasks against the timeline.
Design test cases for the app, focusing on key features like user input forms and
prediction output.
Develop mockups with tools like Figma or Adobe XD, focusing on intuitive
navigation and layout.
Week 2: Responsive Design and Prototyping
Design layouts optimized for multiple screen sizes (desktop, tablet, mobile).
Add validation rules for required fields, formats, and character limits. Display
error messages and guidance for invalid or incomplete entries.
Test form functionality and edge cases to ensure robust user interaction.
Week 5: Cross-Browser Compatibility Testing
Test the web application on major browsers (Chrome, Firefox, Safari, Edge).
Identify and resolve compatibility issues with layout, styles, and interactivity.
Ensure proper rendering and functionality across devices and screen sizes.
Week 6: Debugging and Performance Optimization
Debug JavaScript and CSS to resolve runtime errors.
Prepare the frontend for deployment by creating build files and documentation.
Member 4: KANISHYA G
34
Week 2: Exploratory Data Analysis (EDA)
Analyze the dataset for patterns, correlations, and distributions of key features.
35
36
37
38
39
40
41