Final
Final
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTING TECHNOLOGIES
18CSP109L / 18CSP110L - MAJOR PROJECT /
INTERNSHIP
DIABETES PREDICTION
Using Machine Learning Algorithms
Introduction
Diabetes mellitus is a chronic disease, It refers to a group of metabolic
conditions characterised by elevated blood sugar levels due to either inefficient
insulin production or it can be due to body cells response to insulin poorly.
Insulin is the hormone which regulates blood glucose level. Too much
sugar circulates in blood due to this chronic condition.
26-8-2023 2
Title – Diabetes Prediction
Abstract
● Diabetes mellitus (DM) is characterized by an elevated blood glucose level.
● Diabetes is one of the non-communicable illness that offer a health risk to humans.
Diabetes is a chronic condition in which either the pancreas does not create enough
insulin, or the body is unable to utilize the insulin it does produce.
● Diabetes should not be neglected since, if left untreated, it can lead to a range of
serious health issues, including heart diseases, kidney disease, high blood pressure, eye
damage, and organ failure.
● Diabetes can be managed if diagnosed sooner. To achieve this objective, we will a range
of techniques to more precisely predict the onset of diabetes in human bodies and
patients. Here, we will explore using an ensemble of models like Random Forest
classification, Naive Bayes classifier, and Adaboost classifier with Logistic regression as
the meta-model.
● Performance metrics of the individual models will be compared with the proposed
stacking model.
07-09-2023 3
Diabetes Prediction
Literature Review
07-08-2023 4
07-8-2023 5
07-8-2023 6
07-8-2023 7
26-8-2023 8
26-8-2023 9
26-8-2023 10
26-8-2023 11
26-8-2023 12
26-8-2023 13
26-8-2023 14
26-8-2023 15
Challenges to address
Diagnostic Markers:
Identifying Suitable Biomarkers: Discovering reliable and easily measurable
biomarkers that can indicate early stages of diabetes accurately.
Genetic Factors: Integrating genetic information into prediction models to
understand the hereditary aspects of diabetes risk.
Patient Engagement:
Awareness: Raising awareness among individuals about the importance of regular
check-ups and adopting a healthy lifestyle.
Behavioral Factors: Incorporating lifestyle and behavioral data (diet, exercise,
stress) into prediction models to enhance accuracy.
16
Ethical and Privacy Concerns:
We will use the training and testing datasets to train and evaluate different
models. We will also perform cross-validation for multiple models before
predicting the testing data. The above code splits the dataset into the train
(70%) and test (30%) datasets. We will perform cross-validation of the models.
07-8-2023 18
Objectives
The primary purpose of the project is to develop and evaluate machine learning
models that can accurately predict the early onset of Diabetes Mellitus. Early
detection is critical for timely interventions and improved management of the
disease.
The project seeks to identify the most influential predictors and risk factors
associated with early-stage diabetes. This involves analyzing the feature
importance scores of different models to uncover the variables that play a
significant role in predicting diabetes.
The project contributes to the broader field of diabetes research by exploring the
potential of machine learning in predicting and understanding the disease. The
findings can provide insights into the interplay of various risk factors and their
impact on diabetes development.
07-8-2023 19
Architecture Diagram
07-09-2023 20
Use Case diagram
21
Diabetes Prediction
Proposed model
Support vector machine(SVM )
This is a controlled learning technique which means that the data set is trained to
achieve the predetermined output. It displays the data collection as cloud points in
space.
Advantages of SVM
1. Works well with unstructured and semistructured datasets such as images and text.
2. Can attain accurate and robust results.
3. Is successfully used in medical applications.
Disadvantages of SVM
1. It requires long training time when it is used with large datasets.
07-8-2023 22
K- Nearest Neighbor Algorithm(KNN)
KNN is a method which is used for classifying objects based on closest
training examples in the feature space. KNN is the most basic type of
instance-based learning or lazy learning.
Advantages of KNN
1.It is very simple algorithm to understand and interpret.
2. It is very useful for nonlinear data because there is no assumption about
data in this algorithm.
Disadvantages of KNN
1.It is computationally a bit expensive algorithm because it stores all the
training data.
2. High memory storage required as compared to other supervised learning
algorithms.
07-8-2023 23
Decision tree
Decision Tree is a supervised method used to solve classification problems.
The key purpose of using the Decision Tree is used to estimate the goal class
using previously applied decisions. It uses prediction and classification nodes
and internodes. Root nodes identify instances with different characteristics.
Root nodes may have two or three divisions, and the leaf nodes are graded.
07-8-2023 24
XGBoost
XGBoost (Extreme Gradient Boosting) is a powerful and efficient machine learning algorithm that
belongs to the gradient boosting family of models. It is widely used for both regression and
classification tasks and is known for its accuracy and computational efficiency.
25
IMPLEMENTATION PROCESS
- Importing libraries
- Preprocessing the data
- Preview Data
- Features data-type [eg: Pregnancies, Glucose,BP, BMI, Insulin,
Age etc.]
- Count of null values
- Data Modelling
- Modelling Evaluation
26
Implementation
07-8-2023 27
Accuracy
Accuracy is one metric for evaluating classification models. Accuracy is said to be the
fraction of predictions that the created model got right. Formally, accuracy has the
following definition:
Accuracy = TN+TP .
TN+FP+TP+FN
Where TN=True Negatives; TP= True Positives; FP= False Positives; FN= False
Negatives.
ALGORITHM/METHOD ACCURACY
XGBoost 71.42857142857143
32
Precision
A good classifier should preferably have a precision value of 1 (high). Only when
the numerator is equal to the denominator,i.e. TP=TP+FP, does precision equal 1,
and this also means that FP is zero. As FP increases, the denominator value is
higher than the numerator value, and the precision values decreases.
Precision = TP .
TP+FP
Where TP -> True Positives; FP -> False Positives.
ALGORITHM/METHOD Precision
XGBoost 0.78
33
Recall
Recall is the ratio between the number of correctly classified positive samples on
the total number of positive samples it helps in measuring the ability of a model
to detect positive samples. Higher recall indicates that more positive samples are
being detected.
Recall = TP .
TP+FN
Where TP-> True Positives; FN-> False Negatives.
ALGORITHM/METHOD Recall
XGBoost 0.79
34
The Harmonic mean of Precision and Recall is said to be an F1 score. It is mainly
needed when a balance between Precision and Recall is required and when the
data is unevenly distributed. The best score is when the value is one and the worst
is when the value is zero. The formula for the FI score is
F1 Score =
ALGORITHM/METHOD F1 Score
Super Vector Machine (SVM) 0.67
XGBoost 0.78
35
Results of the proposed model are given in the above slide. Super Vector
Machine (SVM), K-Nearest Neighbor Algorithm, Decision Tree, XGBoost’s
performance are compared with the proposed stacking model’s performance.
All of the models have been implemented in Google Collab.
For each of the feature pairs, correlation coefficient values are given in the
table[1]. Precision metrics of different models made are given in table[2].
36
Correlation Matrix
37
K- Nearest Neighbor Algorithm(KNN)
38
Results and Discussion
1. Data Preprocessing:
Describe the dataset used for the analysis, including the number of samples, features, and any
preprocessing steps applied (e.g., handling missing values, feature scaling, etc.).
2. Model Evaluation:
Present the evaluation metrics used to assess the performance of the predictive model(s).
Common metrics include accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-
ROC).
Provide a confusion matrix or ROC curve to visually represent the model's performance.
3. Feature Importance:
Discuss the features that were found to be most important in predicting early diabetes. This
information is valuable for understanding the underlying factors contributing to diabetes risk.
4. Model Performance:
Present the accuracy or performance metric achieved by the model on the test dataset.
Compare the performance of the machine learning model with baseline models or traditional
methods, if applicable.
a. Interpretation of Results:
Interpret the findings in the context of diabetes research. Explain the significance of the
identified features and how they relate to established risk factors for diabetes.
Discuss any surprising or unexpected results and propose possible explanations.
b. Clinical Implications:
Discuss how the predictive model can be utilized in clinical settings for early diabetes risk
assessment. Highlight the potential benefits of early detection, such as preventive
interventions and lifestyle modifications.
c. Limitations:
Address the limitations of the study, such as dataset limitations, potential biases, or
constraints of the machine learning techniques used.
Discuss any challenges encountered during the analysis and how they might have
influenced the results.
Future Enhancements
Future enhancements in the field of early diabetes prediction using machine learning
techniques. As technology advances and more data becomes available, there are numerous
opportunities to improve the accuracy, efficiency, and applicability of diabetes prediction
models. Here are some future enhancements that researchers and practitioners could consider:
Clinical Trials: Conducting rigorous clinical trials to validate the effectiveness of predictive
models in real-world healthcare settings, ensuring their reliability and accuracy in diverse
patient populations.
Feedback Loops: Establishing feedback loops between clinicians, data scientists, and patients to
continuously improve and refine predictive models based on real-world outcomes and patient
experiences.
By focusing on these areas, researchers and developers can significantly enhance the accuracy,
reliability, and usability of early diabetes prediction models, ultimately improving the quality of
care and outcomes for individuals at risk of developing diabetes.
References
● https://www.frontiersin.org/articles/10.3389/fgene.2018.0
0515/full
● https://www.hindawi.com/journals/jhe/2021/9930985/
● https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8702133/
● https://www.healthit.gov/sites/default/files/jsr-17-task-00
2_aiforhealthandhealthcare12122017.pdf
● https://www.kaggle.com/datasets/mathchi/diabetes-data-s
et
07-8-2023 44