Major Project Final TABLE DIAGRAM
Major Project Final TABLE DIAGRAM
Learning Models
A PROJECT REPORT
Submitted by
Satya Narayan Gope 20010451
Avinash Kumar 20010426
Pravir Kumar Pradhan 20010451
Vikash Kumar 20010424
in
NOVEMBER 2023
C.V. RAMAN GLOBAL UNIVERSITY
BHUBANESWAR-ODISHA-752054
CERTIFICATE OF APPROVAL
This is to certify that we have examined the project entitled "Diabetes Prediction
Using Machine Learning Models" submitted by, Satya Narayan Gope, Registration
No.-20010433, Avinash Kumar, Registration No.-20010426, Pravir Kumar
Pradhan, Registration No.-20010451, Vikash Kumar, Registration No.-20010424,
CGU-Odisha, Bhubaneswar. We here by accord our approval of it as a major project
work carried out and presented in a manner required for its acceptance towards
completion of major project stage-I (7th Semester) of Bachelor Degree of Computer
Science & Engineering for which it has been submitted. This approval does not
necessarily endorse or accept every statement made, opinion expressed or
conclusions drawn as recorded in this major project, it only signifies the acceptance
of the major project for the purpose it has been submitted.
We would like to articulate our deep gratitude to our project guide Dr. Debendra Muduli,
Professor, Department of Computer Science &Engineering , who has always been
source of motivation and firm support for carrying out the project.
We would also like to convey our sincerest gratitude and indebtedness to all other faculty
members and staff of Department of Computer Science & Engineering, who bestowed
their great effort and guidance at appropriate times without it would have been very
difficult on our project work.
An assemblage of this nature could never have been attempted with our reference to and
inspiration from the works of others whose details are mentioned in the references section.
We acknowledge our indebtedness to all of them. Further, we would like to express our
feeling towards our parents and God who directly or indirectly encouraged and motivated
us during Assertion.
6. Proposed Solution--------------------------------------------------- 5
7. Workflow------------------------------------------------------------ 6-21
8. Result & Analysis-------------------------------------------------- 22-23
9. Conclusion----------------------------------------------------------- 23
10. Future Work--------------------------------------------------------- 24
11. References----------------------------------------------------------- 2
1. ABSTRACT
Diabetes mellitus is a metabolic disorder characterized by hyperglycemia, which results from the
inadequacy of the body to secrete and respond to insulin. The many years of research in computational
diagnosis of diabetes have pointed to machine learning to as a viable solution for the prediction of
diabetes. However, the accuracy rate to date suggests that there is still much room for improvement. The
goal of this project is to develop a proposed model that can accurately and efficiently detect the Diabetes
symptoms so that it can be cured as soon as possible. In this paper, we propose a robust framework for
building a diabetes prediction model to aid in the clinical diagnosis of diabetes. Datasets that are being
used are PIMA Indian dataset and the laboratory of the Medical City Hospital (LMCH) diabetes dataset.
Our proposed framework comprises two stages: Data Preprocessing and Classification. This is designed
to address what we presume to affect accuracy in the early diagnosis of diabetes mellitus. First, it
preprocesses the data using Spearman correlation, Feature Selection and Missing value imputation.
Second, For Classification it uses some machine learning models with ensemble techniques like
Bagging, Boosting for Naive Bayes (NB), Random Forest (RF), Support Vector Machines (SVM), k-
nearest neighbor (k-NN) .Then for the enhancement of the accuracy it uses K-Fold Cross Validation
Technique. After optimizing the data different combination of machine learning models has been used
and the combined LR+ DT has achieved the highest accuracy.
1
2. INTRODUCTION
Diabetes mellitus is a chronic metabolic disease characterized by high blood sugar levels
(hyperglycaemia). It is caused by either the pancreas not producing enough insulin, or the
body’s cells not responding properly to the insulin that is produced. Insulin is a hormone that
helps the body’s cells use glucose for energy.
Diabetes is a major global health problem, with over 463 million people worldwide living with
the disease. Diabetes is also a leading cause of death, with over 4.2 million deaths attributed
to the disease in 2019.
Early detection and treatment of diabetes are essential to prevent complications such as heart
disease, stroke, kidney failure, and blindness. Machine learning (ML) has the potential to
improve the early detection of diabetes by developing models that can identify people who are
at high risk of developing the disease.
This project proposes a robust framework for building a diabetes prediction model using ML.
The proposed framework comprises two Data Preprocessing and Classification. The Data
Preprocessing stage is designed to address factors that can affect the accuracy of diabetes
prediction models, such as missing values and correlation between features. The Classification
stage involves training and evaluating a variety of machine learning models with ensemble
techniques like Bagging, Boosting for Naive Bayes (NB), Random Forest (RF), Support
Vector Machines (SVM), k-nearest neighbour (k-NN) to predict whether or not a patient has
diabetes. Then for the enhancement of the accuracy it uses K-Fold Cross Validation Technique
and after that using Random forest and adaboost technique for accuracy. Then the logistic
Regression and decision tree combined has given the highest accuracy.
The proposed framework is evaluated on public datasets: the PIMA Indian Diabetes Dataset.
The results show that the proposed framework achieves high accuracy in predicting diabetes.
2
3. LITERATURE REVIEW
Roy et al [3] 2021 Median value, K- ANN High 98% Lack of detailed
NN, and iterative accuracy with methodology on
imputer were used ANN, effective feature selection,
for missing value imputation using model architecture, or
imputation K- NN and dataset.
iterative imputer
Khanam et al 2021 FS: Pearson DNN run with 86.26% accuracy Limited detail on
[4] correlation MVI: different hidden with DNN, feature selection
Median value for layers adaptable method, lacks insight
missing values architecture, into dataset
imputation effective missing characteristics.
value imputation.
Naz and Ahuj 2020 Method not stated MLP and DL with 2 High 98.07% Lack of method
[5] hidden layers accuracy using DL description,
with 2 hidden insufficient context on
layers, significant dataset and
result. experimentation.
Alam et al [6] 2019 FS: PCA; MVI: MLP Effective PCA Moderate accuracy,
Median value feature selection, lacks details on
reasonable dataset, and limited
accuracy with methodology
MLP neural description.
network
3
Zou et al [7] 2018 FS: PCA; MVI: MLP PCA feature Moderate accuracy,
redundancy and selection, lacks detailed dataset
minimum considering and methodology
relevance redundancy and explanation.
relevance,
reasonable MLP
accuracy
4
In this paper, we address the challenge of predicting diabetes in individuals by
employing various machine learning techniques on two distinct datasets: the
PIMA Indian Diabetes dataset and the LMCH Hospital Diabetes dataset. Our goal
is to develop accurate predictive models that can assist in early diabetes detection,
thereby facilitating timely intervention and healthcare support.
6. Proposed Solution with Block Diagram
5
7. Workflow
Data Preprocessing and Feature selection
In the realm of diabetes prediction, the quality of data and the choice of relevant
features play a pivotal role in the overall success of machine learning models. Data
preprocessing and feature selection are crucial steps in this process, as they directly
impact the accuracy and effectiveness of the predictive models.
Data Preprocessing:
Data preprocessing is the initial and essential step in preparing the raw data for
analysis and modelling. In the context of diabetes prediction, this process involves
several key tasks:
Handling Missing Values: Diabetes datasets often contain missing values, which
can disrupt model training. Imputation techniques such as mean, median, or mode
substitution are employed to fill in missing values without compromising data
integrity.
Outlier Detection and Handling: Outliers can skew the model's predictions.
Robust statistical methods are employed to detect and manage outliers
appropriately.
Normalization: Normalizing data to a specific range, such as [0, 1], can be crucial
for algorithms sensitive to feature scaling, like K-Nearest Neighbors (KNN) or
Support Vector Machines (SVM).
Feature Selection:
Feature selection is a critical process that involves choosing the most relevant
features from the dataset while discarding irrelevant or redundant ones. It is
6
essential for diabetes prediction for several reasons:
In summary, data preprocessing and feature selection are critical steps in diabetes
prediction. Properly cleaned and curated data, combined with a well-chosen set of
relevant features, not only improve the accuracy of predictive models but also
enhance their interpretability and efficiency. These processes are integral to the
success of machine learning-based diabetes prediction systems, ultimately
contributing to early detection and improved healthcare outcomes.
7
8
Now, as the data is optimized we are using some machine learning models to find
the predictive outputs. The machine learning models that are used in this paper are
:- K-Nearest Neighbors, Support Vector Machine, Random Forest, Naive Bayes,
Logistic Regression.
After applying these models and it gives the following accuracy for PIMA india Diabetes
dataset.
Table 1: Normal ML model accuracy chart
9
Accuracy
chart
0.78
0.76
0.74
0.72
0.7
0.68
0.66
0.64
0.62
0.6
KNN SVM RF LR NB
Fig 2 : Accuracy graph upon different ML models like KNN, SVM, RF, LR, NB
10
Ensemble Techniques for Enhanced Accuracy:
11
Bagging is a technique that focuses on reducing the variance of a model by
training multiple instances of the same model on different subsets of the dataset,
typically selected randomly with replacement. The predictions from each model
are then combined through voting or averaging. In the case of diabetes prediction,
when we combine K-Nearest Neighbors (KNN) and Support Vector Machine
(SVM) using Bagging, it leverages the diversity of these models to yield a more
accurate and stable prediction. Bagging mitigates overfitting, smooths out
irregularities in individual models, and ultimately results in a more robust and
reliable diabetes prediction.
Boosting:
Boosting, on the other hand, is a technique that assigns varying weights to data
points during model training, allowing the model to focus on those instances that
were previously misclassified. It iteratively trains models, giving more emphasis
to samples that were challenging to classify. By combining Random Forest (RF)
and Naive Bayes (NB) using Boosting, the ensemble benefits from RF's strong
predictive capability and NB's probabilistic approach, ultimately achieving a
significant accuracy boost. Boosting enhances model precision by continually
refining its focus on problematic instances, resulting in improved diabetes
prediction accuracy.
12
After applying Bagging and Boosting techniques on different models it gives theaccuracy for PIMA
dataset:-
13
Now for better accuracy , K-Fold technique is used
K-Fold Cross-Validation (K-Fold Technique):
14
Effective Hyperparameter Tuning: When tuning hyperparameters (e.g., the
number of neighbors in KNN or the depth of a decision tree in RF), K-Fold Cross-
Validation aids in selecting the best hyperparameter values. It ensures that
hyperparameters generalize well across different data partitions.
15
Without K-fold technique the models accuracy are:-
Table 3: Accuracy table using K-fold technique
Accuracy
0.
8
0.7
8
0.7
6
KN SVC LR GN RF
Without K-Fold With K-Fold
16
Random Forest Algorithm:
Once the decision trees are trained, they are used to make predictions on
new data. To do this, each tree makes a prediction, and the final prediction
of the Random Forest is the majority vote of the individual tree predictions.
Random Forest has been shown to be a very effective algorithm for
predicting diabetes. It is able to handle complex data relationships and
identify important features that are associated with diabetes risk.
One of the main advantages of Random Forest for diabetes prediction is its
ability to handle missing data. This is important because diabetes data often
contains missing values, such as blood glucose levels or BMI. Random
Forest is able to impute missing values and still produce accurate
predictions.
17
Adaboost Technique:
AdaBoost is able to achieve high accuracy, even when the training data is
small or noisy.
AdaBoost is able to learn from complex relationships between different
features.
AdaBoost is relatively easy to implement and tune.
AdaBoost has been used successfully to predict diabetes in a number of
studies. For example, one study found that AdaBoost was able to achieve
an accuracy of 90% in predicting diabetes in a cohort of patients.
18
Random Forest Algorithm+ Adaboost Technique:
One way to combine AdaBoost and Random Forest for diabetes prediction
is to use AdaBoost to train a Random Forest model. This can be done by
training each individual decision tree in the Random Forest model using
AdaBoost. Another way to combine AdaBoost and Random Forest is to
stack the two models. In this approach, the predictions of the AdaBoost
model are used as inputs to the Random Forest model.
Accuaracy
19
Logistic Regression+ Decision Tree:
Logistic Regression:
Logistic Regression is a statistical model used for binary classification
problems, where the dependent variable is categorical with two levels. It's
widely used for predicting the probability of an instance belonging to a
particular class. The logistic function (sigmoid) is used to transform a
linear combination of features into a value between 0 and 1, representing
the probability of the positive class. Logistic Regression is interpretable,
computationally efficient, and often serves as a baseline model for binary
classification tasks.
Decision Tree:
A Decision Tree is a non-linear model that recursively splits the dataset
into subsets based on the most significant feature at each node. It makes
decisions by traversing the tree from the root to a leaf node, where the leaf
node corresponds to the predicted class. Decision Trees are capable of
capturing complex relationships in the data and can model non-linear
decision boundaries. They are interpretable and can handle both numerical
and categorical data.
20
and reliable model.
4. Improved Accuracy: The combination of models in an ensemble can lead
to better overall accuracy than using either model individually. This is
especially true when the models make different errors on the data.
Accuaracy
0.82
0.8
0.78
0.76
0.74
0.72
0.7
Random Forest Adaboost Random Logistic
Forest+Adaboost Regression+Decision
Tree
Accuaracy
21
8. RESULT AND ANALYSIS
From the above test it is clearly visible that the accuracy is increases. By
applying different ensemble techniqueslike bagging and boosting the accuracy
is increased by 2-3% and By K- Fold techniques the accuracy is increased by
1-2% . So using of these methods increases the predictability of the diabetes
symptoms and enhance the process of detecting earlier as soon as possible.
In this study, we set out to improve the accuracy of diabetes prediction using a
combination of feature selection, ensemble techniques, and K-Fold cross-
validation. Our findings reveal a substantial enhancement in accuracy, which
has significant implications for early diagnosis and improved healthcare.
The Power of Ensemble Techniques
22
consequences. Detecting diabetes at an earlier stage enables timely
interventions, improved patient outcomes, and efficient allocation of healthcare
resources.
In summary, our study demonstrates that a combination of feature selection,
ensemble techniques, and K-Fold cross-validation significantly improves the
accuracy of diabetes prediction. This advancement holds great promise for the
healthcare industry, where precision and timeliness are critical in managing
chronic diseases such as diabetes.
9. CONCLUSION
23
10. FUTURE WORK
11. REFERENCES
[5]. H. Naz, S. Ahuja, Deep learning approach for diabetes prediction using
PIMA Indian dataset, J. Diabete. Metabol. Disord. 19 (1) (2020) 391–403.
[6]. T.M. Alam, M.A. Iqbal, Y. Ali, A. Wahab, S. Ijaz, T.I. Baig, A. Hussain,
M.A. Malik, M.M. Raza, S. Ibrar, Z. Abbas, A model for early prediction of
diabetes, Inf. Med. Unlocked 16 (2019) 100204.
[7]. Q. Zou, K. Qu, Y. Luo, D. Yin, Y. Ju, H. Tang, Predicting diabetes mellitus
with machine learning techniques, Front. Genetic. 9 (2018) 515.
24