0% found this document useful (0 votes)
8 views44 pages

Final

The document outlines a major project focused on predicting diabetes using machine learning algorithms. It discusses the chronic nature of diabetes, the importance of early detection, and the implementation of various models such as Random Forest, Naive Bayes, and K-Nearest Neighbor. The project aims to improve prediction accuracy and contribute to diabetes research by analyzing influential risk factors and enhancing model performance through advanced techniques.

Uploaded by

Aryan Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views44 pages

Final

The document outlines a major project focused on predicting diabetes using machine learning algorithms. It discusses the chronic nature of diabetes, the importance of early detection, and the implementation of various models such as Random Forest, Naive Bayes, and K-Nearest Neighbor. The project aims to improve prediction accuracy and contribute to diabetes research by analyzing influential risk factors and enhancing model performance through advanced techniques.

Uploaded by

Aryan Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

SRM INSTITUTE OF SCIENCE AND TECHNOLOGY

SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTING TECHNOLOGIES
18CSP109L / 18CSP110L - MAJOR PROJECT /
INTERNSHIP

DIABETES PREDICTION
Using Machine Learning Algorithms

Batch ID: CT10B352


Student 1 Reg. No:RA2011003010522
Guide name: Dr .S. Gnanavel
Student 1 Name : ADARSH

Designation: Associate Professor


Student 2 Reg. No:RA2011003010535
Department: CTech
Student 2 Name: ARYAN KUMAR
Title – Diabetes Prediction

Introduction
Diabetes mellitus is a chronic disease, It refers to a group of metabolic
conditions characterised by elevated blood sugar levels due to either inefficient
insulin production or it can be due to body cells response to insulin poorly.
Insulin is the hormone which regulates blood glucose level. Too much
sugar circulates in blood due to this chronic condition.

The symptoms may include increased hunger,thirst,vision problems,tiredness.


Prolonged Diabetes mellitus should be treated on time,if not it may lead to
serious health issues on various organs like kidney,heart,brain,eye.Physical
inactivity, hereditary issues, being overweight, and having insulin-resistant
blood cells are all factors that contribute to diabetes mellitus.

26-8-2023 2
Title – Diabetes Prediction

Abstract
● Diabetes mellitus (DM) is characterized by an elevated blood glucose level.
● Diabetes is one of the non-communicable illness that offer a health risk to humans.
Diabetes is a chronic condition in which either the pancreas does not create enough
insulin, or the body is unable to utilize the insulin it does produce.
● Diabetes should not be neglected since, if left untreated, it can lead to a range of
serious health issues, including heart diseases, kidney disease, high blood pressure, eye
damage, and organ failure.
● Diabetes can be managed if diagnosed sooner. To achieve this objective, we will a range
of techniques to more precisely predict the onset of diabetes in human bodies and
patients. Here, we will explore using an ensemble of models like Random Forest
classification, Naive Bayes classifier, and Adaboost classifier with Logistic regression as
the meta-model.
● Performance metrics of the individual models will be compared with the proposed
stacking model.
07-09-2023 3
Diabetes Prediction
Literature Review

07-08-2023 4
07-8-2023 5
07-8-2023 6
07-8-2023 7
26-8-2023 8
26-8-2023 9
26-8-2023 10
26-8-2023 11
26-8-2023 12
26-8-2023 13
26-8-2023 14
26-8-2023 15
Challenges to address

Data Quality and Availability:


Insufficient Data: Limited availability of comprehensive health data, especially
from diverse populations, hampers accurate prediction models.

Diagnostic Markers:
Identifying Suitable Biomarkers: Discovering reliable and easily measurable
biomarkers that can indicate early stages of diabetes accurately.
Genetic Factors: Integrating genetic information into prediction models to
understand the hereditary aspects of diabetes risk.

Patient Engagement:
Awareness: Raising awareness among individuals about the importance of regular
check-ups and adopting a healthy lifestyle.
Behavioral Factors: Incorporating lifestyle and behavioral data (diet, exercise,
stress) into prediction models to enhance accuracy.

16
Ethical and Privacy Concerns:

Data Privacy: Addressing concerns about patient data privacy and


implementing robust measures to protect sensitive health information.
Informed Consent: Developing clear protocols for obtaining informed consent
from patients for using their data in predictive analytics.

We will use the training and testing datasets to train and evaluate different
models. We will also perform cross-validation for multiple models before
predicting the testing data. The above code splits the dataset into the train
(70%) and test (30%) datasets. We will perform cross-validation of the models.

To protect and keeping people healthy , timely treatment of diabetes and


early detection is most important. It will provide relief to reduce the risk of
serious heart disease and stroke, blindness, kidney failure and limb
amputations.
07-8-2023 17
Problem statement
• To create an application that consists of a prediction model which aims to predict
diabetes in a patient, as early as possible.
• The model created is trained on a dataset containing details of both diabetes-
unaffected and diabetes-affected individuals. With the tools of machine learning,
doctors can predict the first stages of diabetes.
• Medical records of diabetic patients and various sorts of algorithms are added to
a dataset for experimental research. On the basis of diagnostic measurements, we
implement random forest, a naive Bayes classifier Adaboost, and gradient
boosting to predict whether a patient has diabetes.
• The effectiveness and precision of the employed algorithms are analyzed and
compared.

07-8-2023 18
Objectives
The primary purpose of the project is to develop and evaluate machine learning
models that can accurately predict the early onset of Diabetes Mellitus. Early
detection is critical for timely interventions and improved management of the
disease.
The project seeks to identify the most influential predictors and risk factors
associated with early-stage diabetes. This involves analyzing the feature
importance scores of different models to uncover the variables that play a
significant role in predicting diabetes.

The project contributes to the broader field of diabetes research by exploring the
potential of machine learning in predicting and understanding the disease. The
findings can provide insights into the interplay of various risk factors and their
impact on diabetes development.

07-8-2023 19
Architecture Diagram

07-09-2023 20
Use Case diagram

21
Diabetes Prediction
Proposed model
Support vector machine(SVM )
This is a controlled learning technique which means that the data set is trained to
achieve the predetermined output. It displays the data collection as cloud points in
space.

Advantages of SVM
1. Works well with unstructured and semistructured datasets such as images and text.
2. Can attain accurate and robust results.
3. Is successfully used in medical applications.

Disadvantages of SVM
1. It requires long training time when it is used with large datasets.

07-8-2023 22
K- Nearest Neighbor Algorithm(KNN)
KNN is a method which is used for classifying objects based on closest
training examples in the feature space. KNN is the most basic type of
instance-based learning or lazy learning.

Advantages of KNN
1.It is very simple algorithm to understand and interpret.
2. It is very useful for nonlinear data because there is no assumption about
data in this algorithm.

Disadvantages of KNN
1.It is computationally a bit expensive algorithm because it stores all the
training data.
2. High memory storage required as compared to other supervised learning
algorithms.
07-8-2023 23
Decision tree
Decision Tree is a supervised method used to solve classification problems.
The key purpose of using the Decision Tree is used to estimate the goal class
using previously applied decisions. It uses prediction and classification nodes
and internodes. Root nodes identify instances with different characteristics.
Root nodes may have two or three divisions, and the leaf nodes are graded.

Advantages of Decision tree


1.Ability to handle attribute with different costs.
2. Ability to handle missing values in attributes

Disadvantages of Decision tree


1. Decision trees are also prone to errors in Classification, owing to
differences in perceptions and the limitations of applying statistical tools.

07-8-2023 24
XGBoost
XGBoost (Extreme Gradient Boosting) is a powerful and efficient machine learning algorithm that
belongs to the gradient boosting family of models. It is widely used for both regression and
classification tasks and is known for its accuracy and computational efficiency.

• Gradient Boosting: XGBoost is an ensemble learning method that builds an ensemble of


decision trees to make predictions. It sequentially adds trees to correct the errors of the
previous trees.
• Scalability: XGBoost is designed for efficiency and can efficiently handle large datasets due
to its parallel processing capabilities. It is implemented in C++ and offers Python and other
language interfaces.
• Regularization: XGBoost provides L1 (Lasso) and L2 (Ridge) regularization terms to
prevent overfitting and improve model generalization.
• Sparsity Handling: It can handle sparse data efficiently, making it suitable for a wide range
of applications.

25
IMPLEMENTATION PROCESS

- Importing libraries
- Preprocessing the data
- Preview Data
- Features data-type [eg: Pregnancies, Glucose,BP, BMI, Insulin,
Age etc.]
- Count of null values
- Data Modelling
- Modelling Evaluation

26
Implementation

07-8-2023 27
Accuracy
Accuracy is one metric for evaluating classification models. Accuracy is said to be the
fraction of predictions that the created model got right. Formally, accuracy has the
following definition:

Accuracy = TN+TP .
TN+FP+TP+FN
Where TN=True Negatives; TP= True Positives; FP= False Positives; FN= False
Negatives.

ALGORITHM/METHOD ACCURACY

Super Vector Machine (SVM) 72.07792207792207

K-Nearest Neighbor Algorithm (KNN) 78.57142857142857

Decision Tree 68.181818181818187

XGBoost 71.42857142857143

32
Precision
A good classifier should preferably have a precision value of 1 (high). Only when
the numerator is equal to the denominator,i.e. TP=TP+FP, does precision equal 1,
and this also means that FP is zero. As FP increases, the denominator value is
higher than the numerator value, and the precision values decreases.

Precision = TP .
TP+FP
Where TP -> True Positives; FP -> False Positives.

ALGORITHM/METHOD Precision

Super Vector Machine (SVM) 0.72

K-Nearest Neighbor Algorithm (KNN) 0.81

Decision Tre 0.77

XGBoost 0.78

33
Recall
Recall is the ratio between the number of correctly classified positive samples on
the total number of positive samples it helps in measuring the ability of a model
to detect positive samples. Higher recall indicates that more positive samples are
being detected.
Recall = TP .
TP+FN
Where TP-> True Positives; FN-> False Negatives.

ALGORITHM/METHOD Recall

Super Vector Machine (SVM) 0.63

K-Nearest Neighbor Algorithm (KNN) 0.87

Decision Tree 0.75

XGBoost 0.79

34
The Harmonic mean of Precision and Recall is said to be an F1 score. It is mainly
needed when a balance between Precision and Recall is required and when the
data is unevenly distributed. The best score is when the value is one and the worst
is when the value is zero. The formula for the FI score is

F1 Score =

ALGORITHM/METHOD F1 Score
Super Vector Machine (SVM) 0.67

K-Nearest Neighbor Algorithm (KNN) 0.84

Decision Tree 0.76

XGBoost 0.78

35
Results of the proposed model are given in the above slide. Super Vector
Machine (SVM), K-Nearest Neighbor Algorithm, Decision Tree, XGBoost’s
performance are compared with the proposed stacking model’s performance.
All of the models have been implemented in Google Collab.

For each of the feature pairs, correlation coefficient values are given in the
table[1]. Precision metrics of different models made are given in table[2].

Performance metrics of different models made are given in table[2,3,4]. From


observing these, it can be inferred that maximum accuracy is observed with K-
Nearest Neighbor Algorithm. .

36
Correlation Matrix

37
K- Nearest Neighbor Algorithm(KNN)

38
Results and Discussion

1. Data Preprocessing:
Describe the dataset used for the analysis, including the number of samples, features, and any
preprocessing steps applied (e.g., handling missing values, feature scaling, etc.).

2. Model Evaluation:
Present the evaluation metrics used to assess the performance of the predictive model(s).
Common metrics include accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-
ROC).
Provide a confusion matrix or ROC curve to visually represent the model's performance.

3. Feature Importance:
Discuss the features that were found to be most important in predicting early diabetes. This
information is valuable for understanding the underlying factors contributing to diabetes risk.

4. Model Performance:
Present the accuracy or performance metric achieved by the model on the test dataset.
Compare the performance of the machine learning model with baseline models or traditional
methods, if applicable.
a. Interpretation of Results:
Interpret the findings in the context of diabetes research. Explain the significance of the
identified features and how they relate to established risk factors for diabetes.
Discuss any surprising or unexpected results and propose possible explanations.

b. Clinical Implications:
Discuss how the predictive model can be utilized in clinical settings for early diabetes risk
assessment. Highlight the potential benefits of early detection, such as preventive
interventions and lifestyle modifications.

c. Limitations:
Address the limitations of the study, such as dataset limitations, potential biases, or
constraints of the machine learning techniques used.
Discuss any challenges encountered during the analysis and how they might have
influenced the results.
Future Enhancements
Future enhancements in the field of early diabetes prediction using machine learning
techniques. As technology advances and more data becomes available, there are numerous
opportunities to improve the accuracy, efficiency, and applicability of diabetes prediction
models. Here are some future enhancements that researchers and practitioners could consider:

1. Incorporating Advanced Data Sources:


Genomic Data: Integrating genomic data to explore genetic predispositions to diabetes.
Lifestyle Data: Incorporating data from wearable devices and smartphones to capture real-time
lifestyle information, such as physical activity, sleep patterns, and dietary habits.
Environmental Data: Considering environmental factors such as pollution levels and access to
green spaces, which might influence diabetes risk.
2. Utilizing Advanced Machine Learning Techniques:
Deep Learning: Exploring deep learning algorithms, such as convolutional neural networks
(CNNs) or recurrent neural networks (RNNs), for more complex pattern recognition in high-
dimensional data.
Ensemble Models: Building ensemble models that combine predictions from multiple
algorithms or models to enhance overall accuracy and robustness.
Explainable AI: Developing models that provide interpretable results, allowing clinicians and
patients to understand the reasoning behind predictions, which is crucial for gaining trust and
acceptance in healthcare settings.
3. Personalized Predictive Models:
Personalized Risk Assessment: Creating individualized risk profiles by considering diverse
data sources, enabling personalized interventions and healthcare plans.
Dynamic Models: Developing models that can adapt and evolve with changing patient data,
allowing for dynamic and personalized risk predictions over time.

4. Integration with Electronic Health Records (EHR):


EHR Integration: Integrating predictive models directly into electronic health record systems
to provide real-time risk assessments during patient visits.
Longitudinal Analysis: Conducting longitudinal studies using EHR data to track patients over
time, enabling the identification of early indicators and trends associated with diabetes risk.
pecially concerning underrepresented demographic groups.

5. Collaborative Research and Data Sharing:


Collaborative Initiatives: Encouraging collaboration between researchers, healthcare
providers, and tech companies to share data, expertise, and resources.
Open Data: Promoting the sharing of anonymized healthcare datasets for research purposes,
fostering innovation and accelerating progress in the field.
6. Clinical Validation and Real-World Testing:

Clinical Trials: Conducting rigorous clinical trials to validate the effectiveness of predictive
models in real-world healthcare settings, ensuring their reliability and accuracy in diverse
patient populations.
Feedback Loops: Establishing feedback loops between clinicians, data scientists, and patients to
continuously improve and refine predictive models based on real-world outcomes and patient
experiences.

By focusing on these areas, researchers and developers can significantly enhance the accuracy,
reliability, and usability of early diabetes prediction models, ultimately improving the quality of
care and outcomes for individuals at risk of developing diabetes.
References

● https://www.frontiersin.org/articles/10.3389/fgene.2018.0
0515/full

● https://www.hindawi.com/journals/jhe/2021/9930985/
● https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8702133/

● https://www.healthit.gov/sites/default/files/jsr-17-task-00
2_aiforhealthandhealthcare12122017.pdf

● https://www.kaggle.com/datasets/mathchi/diabetes-data-s
et
07-8-2023 44

You might also like