0% found this document useful (0 votes)
81 views28 pages

Major Project Final TABLE DIAGRAM

The document is a project report submitted by four students towards completion of their major project for a Bachelor's degree in Computer Science and Engineering. It aims to develop a machine learning model for diabetes prediction using two datasets. The proposed framework includes two stages: data preprocessing involving feature selection, missing value imputation, and correlation analysis; and classification using machine learning algorithms like Naive Bayes, Random Forest, SVM, and k-NN, with ensemble techniques and k-fold cross validation for accuracy improvement. The logistic regression and decision tree combined model achieved the highest accuracy.

Uploaded by

narayangope57
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views28 pages

Major Project Final TABLE DIAGRAM

The document is a project report submitted by four students towards completion of their major project for a Bachelor's degree in Computer Science and Engineering. It aims to develop a machine learning model for diabetes prediction using two datasets. The proposed framework includes two stages: data preprocessing involving feature selection, missing value imputation, and correlation analysis; and classification using machine learning algorithms like Naive Bayes, Random Forest, SVM, and k-NN, with ensemble techniques and k-fold cross validation for accuracy improvement. The logistic regression and decision tree combined model achieved the highest accuracy.

Uploaded by

narayangope57
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Diabetes Prediction Using Machine

Learning Models
A PROJECT REPORT

Submitted by
Satya Narayan Gope 20010451
Avinash Kumar 20010426
Pravir Kumar Pradhan 20010451
Vikash Kumar 20010424

Towards completion of major project Stage-I


(7th semester)
of
BACHELOR OF T E C H N O L O G Y

in

COMPUTER SCIENCE & ENGINEERING

Department of Computer Science & Engineering


C.V. RAMAN GLOBAL UNIVERSITY
BHUBANESWAR-ODISHA-752054

NOVEMBER 2023
C.V. RAMAN GLOBAL UNIVERSITY
BHUBANESWAR-ODISHA-752054

CERTIFICATE OF APPROVAL

This is to certify that we have examined the project entitled "Diabetes Prediction
Using Machine Learning Models" submitted by, Satya Narayan Gope, Registration
No.-20010433, Avinash Kumar, Registration No.-20010426, Pravir Kumar
Pradhan, Registration No.-20010451, Vikash Kumar, Registration No.-20010424,
CGU-Odisha, Bhubaneswar. We here by accord our approval of it as a major project
work carried out and presented in a manner required for its acceptance towards
completion of major project stage-I (7th Semester) of Bachelor Degree of Computer
Science & Engineering for which it has been submitted. This approval does not
necessarily endorse or accept every statement made, opinion expressed or
conclusions drawn as recorded in this major project, it only signifies the acceptance
of the major project for the purpose it has been submitted.

SUPERVISOR HEAD OF THE DEPARTMENT


ACKNOWLEDGEMENT:

We would like to articulate our deep gratitude to our project guide Dr. Debendra Muduli,
Professor, Department of Computer Science &Engineering , who has always been
source of motivation and firm support for carrying out the project.
We would also like to convey our sincerest gratitude and indebtedness to all other faculty
members and staff of Department of Computer Science & Engineering, who bestowed
their great effort and guidance at appropriate times without it would have been very
difficult on our project work.
An assemblage of this nature could never have been attempted with our reference to and
inspiration from the works of others whose details are mentioned in the references section.
We acknowledge our indebtedness to all of them. Further, we would like to express our
feeling towards our parents and God who directly or indirectly encouraged and motivated
us during Assertion.

Satya Narayan Gope 20010433


Avinash Kumar 20010426
Pravir Kumar Pradhan 20010451
Vikash Kumar 20010424
CONTENTS:
Page no.
1. Abstract--------------------------------------------------------------- 1
2. Introduction---------------------------------------------------------- 2
3. Literature Review--------------------------------------------------- 3-4
4. List of Tables
i. Table 1--------------------------------------------------------- 8
ii. Table 2-------------------------------------------------------- 12
iii. Table 3-------------------------------------------------------- 15
5. List of Figures
i. Fig 1------------------------------------------------------------ 5
ii. Fig 2------------------------------------------------------------ 9
iii. Fig 3------------------------------------------------------------ 10
iv. Fig 4------------------------------------------------------------ 12
v. Fig 5------------------------------------------------------------ 14
vi. Fig 6------------------------------------------------------------ 15
vii. Fig 7------------------------------------------------------------ 18
viii. Fig 8-------------------------------------------------------------20

6. Proposed Solution--------------------------------------------------- 5
7. Workflow------------------------------------------------------------ 6-21
8. Result & Analysis-------------------------------------------------- 22-23
9. Conclusion----------------------------------------------------------- 23
10. Future Work--------------------------------------------------------- 24
11. References----------------------------------------------------------- 2
1. ABSTRACT

Diabetes mellitus is a metabolic disorder characterized by hyperglycemia, which results from the
inadequacy of the body to secrete and respond to insulin. The many years of research in computational
diagnosis of diabetes have pointed to machine learning to as a viable solution for the prediction of
diabetes. However, the accuracy rate to date suggests that there is still much room for improvement. The
goal of this project is to develop a proposed model that can accurately and efficiently detect the Diabetes
symptoms so that it can be cured as soon as possible. In this paper, we propose a robust framework for
building a diabetes prediction model to aid in the clinical diagnosis of diabetes. Datasets that are being
used are PIMA Indian dataset and the laboratory of the Medical City Hospital (LMCH) diabetes dataset.
Our proposed framework comprises two stages: Data Preprocessing and Classification. This is designed
to address what we presume to affect accuracy in the early diagnosis of diabetes mellitus. First, it
preprocesses the data using Spearman correlation, Feature Selection and Missing value imputation.
Second, For Classification it uses some machine learning models with ensemble techniques like
Bagging, Boosting for Naive Bayes (NB), Random Forest (RF), Support Vector Machines (SVM), k-
nearest neighbor (k-NN) .Then for the enhancement of the accuracy it uses K-Fold Cross Validation
Technique. After optimizing the data different combination of machine learning models has been used
and the combined LR+ DT has achieved the highest accuracy.

1
2. INTRODUCTION

Diabetes mellitus is a chronic metabolic disease characterized by high blood sugar levels
(hyperglycaemia). It is caused by either the pancreas not producing enough insulin, or the
body’s cells not responding properly to the insulin that is produced. Insulin is a hormone that
helps the body’s cells use glucose for energy.
Diabetes is a major global health problem, with over 463 million people worldwide living with
the disease. Diabetes is also a leading cause of death, with over 4.2 million deaths attributed
to the disease in 2019.
Early detection and treatment of diabetes are essential to prevent complications such as heart
disease, stroke, kidney failure, and blindness. Machine learning (ML) has the potential to
improve the early detection of diabetes by developing models that can identify people who are
at high risk of developing the disease.
This project proposes a robust framework for building a diabetes prediction model using ML.
The proposed framework comprises two Data Preprocessing and Classification. The Data
Preprocessing stage is designed to address factors that can affect the accuracy of diabetes
prediction models, such as missing values and correlation between features. The Classification
stage involves training and evaluating a variety of machine learning models with ensemble
techniques like Bagging, Boosting for Naive Bayes (NB), Random Forest (RF), Support
Vector Machines (SVM), k-nearest neighbour (k-NN) to predict whether or not a patient has
diabetes. Then for the enhancement of the accuracy it uses K-Fold Cross Validation Technique
and after that using Random forest and adaboost technique for accuracy. Then the logistic
Regression and decision tree combined has given the highest accuracy.
The proposed framework is evaluated on public datasets: the PIMA Indian Diabetes Dataset.
The results show that the proposed framework achieves high accuracy in predicting diabetes.

2
3. LITERATURE REVIEW

Authors Year Feature selection Classification Advantages Disadvantages


and missing
value imputation

Chollette C. 2022 FS: none MLP High 92.31% Lack of feature


Olisah∗ , specified; MVI: accuracy with selection details,
Lyndon Smith, removed missing MLP limited context on
Melvyn Smith values dataset characteristics.
[1]
Bukhari et al 2021 FS: none ANN trained with High 93% Lack of feature
[2] specified; MVI: ABS conjugate accuracy using selection/methodology
removed missing gradient neural ABS conjugate details, no missing
values network (ABP- gradient neural value handling
CGNN) network (ABP- mentioned
CGNN)

Roy et al [3] 2021 Median value, K- ANN High 98% Lack of detailed
NN, and iterative accuracy with methodology on
imputer were used ANN, effective feature selection,
for missing value imputation using model architecture, or
imputation K- NN and dataset.
iterative imputer

Khanam et al 2021 FS: Pearson DNN run with 86.26% accuracy Limited detail on
[4] correlation MVI: different hidden with DNN, feature selection
Median value for layers adaptable method, lacks insight
missing values architecture, into dataset
imputation effective missing characteristics.
value imputation.

Naz and Ahuj 2020 Method not stated MLP and DL with 2 High 98.07% Lack of method
[5] hidden layers accuracy using DL description,
with 2 hidden insufficient context on
layers, significant dataset and
result. experimentation.

Alam et al [6] 2019 FS: PCA; MVI: MLP Effective PCA Moderate accuracy,
Median value feature selection, lacks details on
reasonable dataset, and limited
accuracy with methodology
MLP neural description.
network

3
Zou et al [7] 2018 FS: PCA; MVI: MLP PCA feature Moderate accuracy,
redundancy and selection, lacks detailed dataset
minimum considering and methodology
relevance redundancy and explanation.
relevance,
reasonable MLP
accuracy

Roy et al [3] 2021 Median value, K- LR,SVM,RF,LGBM Effective Lack of detailed


NN, and iterative imputation feature selection and
imputer were used methods, LGBM model parameter
for missing value achieved a information.
imputation. competitive 86%
accuracy.

4
In this paper, we address the challenge of predicting diabetes in individuals by
employing various machine learning techniques on two distinct datasets: the
PIMA Indian Diabetes dataset and the LMCH Hospital Diabetes dataset. Our goal
is to develop accurate predictive models that can assist in early diabetes detection,
thereby facilitating timely intervention and healthcare support.
6. Proposed Solution with Block Diagram

 Fig 1: Block Diagram of Proposed Model

5
7. Workflow
Data Preprocessing and Feature selection

In the realm of diabetes prediction, the quality of data and the choice of relevant
features play a pivotal role in the overall success of machine learning models. Data
preprocessing and feature selection are crucial steps in this process, as they directly
impact the accuracy and effectiveness of the predictive models.
Data Preprocessing:

Data preprocessing is the initial and essential step in preparing the raw data for
analysis and modelling. In the context of diabetes prediction, this process involves
several key tasks:

Handling Missing Values: Diabetes datasets often contain missing values, which
can disrupt model training. Imputation techniques such as mean, median, or mode
substitution are employed to fill in missing values without compromising data
integrity.

Data Standardization: Standardizing numerical features ensures that they have a


common scale, preventing certain features from dominating others. This process
often involves mean cantering and scaling to unit variance.

Categorical Data Encoding: If the dataset includes categorical variables (e.g.,


gender, ethnicity), they must be converted into numerical form for machine
learning algorithms to process. Techniques such as one-hot encoding or label
encoding are utilized for this purpose.

Outlier Detection and Handling: Outliers can skew the model's predictions.
Robust statistical methods are employed to detect and manage outliers
appropriately.

Normalization: Normalizing data to a specific range, such as [0, 1], can be crucial
for algorithms sensitive to feature scaling, like K-Nearest Neighbors (KNN) or
Support Vector Machines (SVM).

Feature Selection:

Feature selection is a critical process that involves choosing the most relevant
features from the dataset while discarding irrelevant or redundant ones. It is
6
essential for diabetes prediction for several reasons:

Reducing Dimensionality: Diabetes datasets can be high-dimensional, containing


numerous features. High dimensionality can lead to overfitting, increased
computational complexity, and reduced model generalization. Feature selection
helps reduce dimensionality by focusing on the most informative attributes.
Enhancing Model Interpretability: A concise set of relevant features makes it
easier to interpret and understand the relationships between variables, aiding
medical professionals in making informed decisions.

Improving Model Accuracy: Including irrelevant or redundant features can


introduce noise into the model, reducing its accuracy. Feature selection ensures
that only the most informative attributes contribute to the prediction.

Reducing Computational Resources: Simplifying the dataset by selecting


essential features can lead to faster model training and prediction, making it more
efficient for real-world applications.

In summary, data preprocessing and feature selection are critical steps in diabetes
prediction. Properly cleaned and curated data, combined with a well-chosen set of
relevant features, not only improve the accuracy of predictive models but also
enhance their interpretability and efficiency. These processes are integral to the
success of machine learning-based diabetes prediction systems, ultimately
contributing to early detection and improved healthcare outcomes.

7
8
Now, as the data is optimized we are using some machine learning models to find
the predictive outputs. The machine learning models that are used in this paper are
:- K-Nearest Neighbors, Support Vector Machine, Random Forest, Naive Bayes,
Logistic Regression.

Machine Learning models:-


1. KNN (K-Nearest Neighbors): KNN predicts diabetes by identifying the
diabetes status of a new individual based on the diabetes status of its nearest
neighbors in the dataset.

2. SVM (Support Vector Machine): SVM predicts diabetes by finding an optimal


hyperplane that best separates diabetic from non-diabetic individuals, maximizing
the margin between the two classes.

3. NB (Naive Bayes): NB predicts diabetes using Bayesian probability, estimating


the likelihood of an individual having diabetes based on the conditional
probabilities of their features given their diabetes status.

4. LR (Logistic Regression): LR predicts diabetes by modeling the relationship


between the individual's features and the likelihood of having diabetes, providing
a probability score and making a binary prediction based on a predefined
threshold.

After applying these models and it gives the following accuracy for PIMA india Diabetes
dataset.
 Table 1: Normal ML model accuracy chart

9
Accuracy
chart
0.78
0.76
0.74
0.72
0.7
0.68
0.66
0.64
0.62
0.6

KNN SVM RF LR NB

 Fig 2 : Accuracy graph upon different ML models like KNN, SVM, RF, LR, NB

10
Ensemble Techniques for Enhanced Accuracy:

Ensemble techniques are powerful strategies in machine learning that aim to


amalgamate the predictive capabilities of multiple individual models, thus
achieving superior accuracy and robustness compared to individual models. Two
prominent ensemble techniques are Bagging and Boosting, each with its unique
approach to enhancing predictive performance.

Bagging (Bootstrap Aggregating):

 Fig 3: visual presentation of Bagging and Boosting

11
Bagging is a technique that focuses on reducing the variance of a model by
training multiple instances of the same model on different subsets of the dataset,
typically selected randomly with replacement. The predictions from each model
are then combined through voting or averaging. In the case of diabetes prediction,
when we combine K-Nearest Neighbors (KNN) and Support Vector Machine
(SVM) using Bagging, it leverages the diversity of these models to yield a more
accurate and stable prediction. Bagging mitigates overfitting, smooths out
irregularities in individual models, and ultimately results in a more robust and
reliable diabetes prediction.

Boosting:

Boosting, on the other hand, is a technique that assigns varying weights to data
points during model training, allowing the model to focus on those instances that
were previously misclassified. It iteratively trains models, giving more emphasis
to samples that were challenging to classify. By combining Random Forest (RF)
and Naive Bayes (NB) using Boosting, the ensemble benefits from RF's strong
predictive capability and NB's probabilistic approach, ultimately achieving a
significant accuracy boost. Boosting enhances model precision by continually
refining its focus on problematic instances, resulting in improved diabetes
prediction accuracy.

12
After applying Bagging and Boosting techniques on different models it gives theaccuracy for PIMA
dataset:-

 Table 2: Bagging and Boosting Accuracy

 Fig 4: Accuracy chart of Bagging and Boosting

13
Now for better accuracy , K-Fold technique is used
K-Fold Cross-Validation (K-Fold Technique):

K-Fold Cross-Validation is a robust and widely used technique in machine


learning for assessing the performance and reliability of predictive models. It
provides a comprehensive evaluation of a model's effectiveness by partitioning the
dataset into 'K' equally sized subsets or "folds." The model is then trained and
tested 'K' times, each time using a different fold as the test set while the remaining
folds serve as the training set. This process rotates until each fold has been used
as the test data exactly once.

How K-Fold Cross-Validation Benefits Diabetes Prediction:

Reduces Overfitting: K-Fold Cross-Validation helps to detect and mitigate


overfitting, a common issue in machine learning where a model performs
exceptionally well on the training data but poorly on new, unseen data. By
repeatedly evaluating the model on different subsets of the dataset, K-Fold Cross-
Validation provides a more comprehensive assessment of its generalization
capabilities.

Robustness Assessment: It allows for a more robust assessment of a model's


performance. The variability in results across the 'K' runs provides insights into
the model's stability and helps identify potential issues, such as sensitivity to data
splits or random variations.

14
Effective Hyperparameter Tuning: When tuning hyperparameters (e.g., the
number of neighbors in KNN or the depth of a decision tree in RF), K-Fold Cross-
Validation aids in selecting the best hyperparameter values. It ensures that
hyperparameters generalize well across different data partitions.

Improved Confidence in Accuracy Estimates: By aggregating the results from


'K' different test sets, K- Fold Cross-Validation provides a more accurate estimate
of a model's predictive accuracy. This estimate
is typically more reliable than a single train-test split, making it valuable for
assessing diabetes prediction accuracy.

In the context of diabetes prediction, K-Fold Cross-Validation offers a robust


and systematic approach to determine the true predictive power of a model. It
ensures that the model's that the model's performance is not influenced by the
specific data split, resulting in more trustworthy accuracy assessments. This
technique is an essential tool for building and fine-tuning models that can
effectively identify.

 Fig 5: Accuracy chart from K-Fold techniques

15
Without K-fold technique the models accuracy are:-
 Table 3: Accuracy table using K-fold technique

Accuracy
0.
8

0.7
8

0.7
6

KN SVC LR GN RF
Without K-Fold With K-Fold

 Fig 6 : Accuracy with kfold and without kfold

16
Random Forest Algorithm:

Random Forest is a supervised machine learning algorithm that can be used


for both classification and regression tasks. It is a type of ensemble learning
method, which means that it combines the predictions of multiple
individual decision trees to produce a more accurate and robust prediction.

Random Forest works by constructing a multitude of decision trees during


training. Each decision tree is trained on a different random subset of the
training data, and a random subset of features is considered at each split in
the tree. This process helps to reduce overfitting and improve the
generalization performance of the model.

Once the decision trees are trained, they are used to make predictions on
new data. To do this, each tree makes a prediction, and the final prediction
of the Random Forest is the majority vote of the individual tree predictions.
Random Forest has been shown to be a very effective algorithm for
predicting diabetes. It is able to handle complex data relationships and
identify important features that are associated with diabetes risk.

One of the main advantages of Random Forest for diabetes prediction is its
ability to handle missing data. This is important because diabetes data often
contains missing values, such as blood glucose levels or BMI. Random
Forest is able to impute missing values and still produce accurate
predictions.

Another advantage of Random Forest is its interpretability. Unlike some


other machine learning algorithms, Random Forest allows you to identify
which features are most important for predicting diabetes. This information
can be used to develop targeted interventions to prevent or manage
diabetes.

17
Adaboost Technique:

AdaBoost (Adaptive Boosting) is an ensemble machine learning


algorithm that combines multiple weak learners to create a strong learner.
It is a sequential learning algorithm, which means that it learns from its
mistakes and improves its performance over time. AdaBoost works by
giving more weight to the training examples that are misclassified by the
previous weak learners. This forces the subsequent weak learners to focus
on the difficult examples, which leads to an improvement in the overall
accuracy of the ensemble.

The final output of AdaBoost is a weighted average of the predictions of


the weak learners.

AdaBoost can be very helpful for predicting diabetes because it is able to


learn from complex relationships between different features. For example,
AdaBoost can learn that patients with a high body mass index, a high
blood sugar level, and a family history of diabetes are more likely to have
diabetes.

Here are some of the benefits of using AdaBoost to predict diabetes:

AdaBoost is able to achieve high accuracy, even when the training data is
small or noisy.
AdaBoost is able to learn from complex relationships between different
features.
AdaBoost is relatively easy to implement and tune.
AdaBoost has been used successfully to predict diabetes in a number of
studies. For example, one study found that AdaBoost was able to achieve
an accuracy of 90% in predicting diabetes in a cohort of patients.

Adaboost technique that obtain accuracy is :- 73.95833333333

18
Random Forest Algorithm+ Adaboost Technique:

AdaBoost and Random Forest can be used to improve the accuracy of


diabetes prediction by combining their strengths. AdaBoost can help to
improve the performance of a weak learner, such as a decision tree, by
giving more weight to examples that are difficult to classify. Random
Forest can help to reduce overfitting and improve the generalization
performance of the model by averaging the predictions of many different
decision trees.

One way to combine AdaBoost and Random Forest for diabetes prediction
is to use AdaBoost to train a Random Forest model. This can be done by
training each individual decision tree in the Random Forest model using
AdaBoost. Another way to combine AdaBoost and Random Forest is to
stack the two models. In this approach, the predictions of the AdaBoost
model are used as inputs to the Random Forest model.

The accuracy is enhanced by 78.034682888892486.


Accuaracy
0.79
0.78
0.77
0.76
0.75
0.74
0.73
0.72
0.71
Random Forest Adaboost Random Forest+Adaboost

Accuaracy

 Fig 7: Accuracy chart for random forest and adaboost

19
Logistic Regression+ Decision Tree:

Logistic Regression:
Logistic Regression is a statistical model used for binary classification
problems, where the dependent variable is categorical with two levels. It's
widely used for predicting the probability of an instance belonging to a
particular class. The logistic function (sigmoid) is used to transform a
linear combination of features into a value between 0 and 1, representing
the probability of the positive class. Logistic Regression is interpretable,
computationally efficient, and often serves as a baseline model for binary
classification tasks.
Decision Tree:
A Decision Tree is a non-linear model that recursively splits the dataset
into subsets based on the most significant feature at each node. It makes
decisions by traversing the tree from the root to a leaf node, where the leaf
node corresponds to the predicted class. Decision Trees are capable of
capturing complex relationships in the data and can model non-linear
decision boundaries. They are interpretable and can handle both numerical
and categorical data.

Using Logistic Regression and Decision Trees Together:


Ensembling methods, such as combining Logistic Regression and
Decision Trees, can be beneficial. Each model has its strengths and
weaknesses, and combining them can lead to better predictive
performance. Here's why:
1. Complementary Strengths: Logistic Regression is good at capturing
linear relationships, while Decision Trees can model complex non-linear
patterns. Combining them allows the ensemble to handle a wider range of
data patterns.
2. Reducing Overfitting: Decision Trees can be prone to overfitting,
especially when the tree is deep. Combining it with Logistic Regression,
which tends to be more robust to overfitting, can help improve
generalization performance.
3. Robustness: If one model fails to capture certain patterns in the data, the
other might compensate for it. Ensembling helps in creating a more robust

20
and reliable model.
4. Improved Accuracy: The combination of models in an ensemble can lead
to better overall accuracy than using either model individually. This is
especially true when the models make different errors on the data.

The accuracy is enhanced by 80.51948051948052

Accuaracy
0.82
0.8
0.78
0.76
0.74
0.72
0.7
Random Forest Adaboost Random Logistic
Forest+Adaboost Regression+Decision
Tree

Accuaracy

 Fig 8: Accuracy chart for Logistic Regression and Decision tree

21
8. RESULT AND ANALYSIS

From the above test it is clearly visible that the accuracy is increases. By
applying different ensemble techniqueslike bagging and boosting the accuracy
is increased by 2-3% and By K- Fold techniques the accuracy is increased by
1-2% . So using of these methods increases the predictability of the diabetes
symptoms and enhance the process of detecting earlier as soon as possible.
In this study, we set out to improve the accuracy of diabetes prediction using a
combination of feature selection, ensemble techniques, and K-Fold cross-
validation. Our findings reveal a substantial enhancement in accuracy, which
has significant implications for early diagnosis and improved healthcare.
The Power of Ensemble Techniques

To further elevate our predictive accuracy, we integrated various ensemble


techniques into our modeling approach. Bagging and boosting methods
emerged as standout performers. Bagging, through its resampling and
aggregation of predictions, smoothed out inconsistencies in the data, resulting in
an impressive 2-3% increase in accuracy. Boosting, on the other hand, iteratively
improved our model's ability to handle complex patterns in the dataset,
contributing to the accuracy gain.
K-Fold Cross-Validation's Contribution
In addition to feature selection and ensemble techniques, we incorporated K-
Fold cross- validation into our methodology. This approach provided a more
robust evaluation of our models by repeatedly splitting the data into training
and testing subsets. The results were striking, with a 1-2% increase in accuracy.
K-Fold cross-validation not only validated the reliability of our models but also
helped identify and mitigate overfitting, making our predictions more
generalizable and dependable.
Combining Model's Contribution
In addition to the combining techniques of decision tree and logistic regression
it have given the accuracy increased by up to 5% . This approach provided more
robust evaluation of our models.

Implications for Diabetes Diagnosis


The implications of our findings are substantial. The use of feature selection,
ensemble techniques, and K-Fold cross-validation collectively enhances the
predictability of diabetes symptoms. This translates into the potential for earlier
and more accurate diagnosis of diabetes, a condition with far-reaching health

22
consequences. Detecting diabetes at an earlier stage enables timely
interventions, improved patient outcomes, and efficient allocation of healthcare
resources.
In summary, our study demonstrates that a combination of feature selection,
ensemble techniques, and K-Fold cross-validation significantly improves the
accuracy of diabetes prediction. This advancement holds great promise for the
healthcare industry, where precision and timeliness are critical in managing
chronic diseases such as diabetes.

9. CONCLUSION

In conclusion, the diabetes prediction project showcased a systematic exploration


of various machine learning models, highlighting the iterative process of model
refinement. Initial experiments with standard machine learning models, yielding
an average accuracy of 75%, served as a baseline for further enhancement. The
introduction of ensemble learning techniques, specifically bagging and boosting,
demonstrated a marginal improvement, reflecting the power of aggregating
predictions from multiple models.

The pivotal moment in our project occurred with the implementation of a


sophisticated ensemble strategy — combining Logistic Regression and Decision
Trees. This innovative approach significantly elevated the accuracy to 80%,
surpassing individual model performances. Logistic Regression, known for its
interpretability, and Decision Trees, adept at capturing complex relationships,
complemented each other in a synergistic manner, resulting in a more robust and
accurate prediction model for diabetes.

The journey from conventional machine learning models to ensemble techniques


and, ultimately, a synergistic fusion of Logistic Regression and Decision Trees
underscored the importance of model selection and combination. The 80%
accuracy achieved in the final phase not only surpasses the initial benchmark but
also emphasizes the potential for continuous improvement in predictive modeling
through thoughtful algorithmic selection and ensemble strategies. This
comprehensive exploration enhances our understanding of diabetes prediction
and sets the stage for future advancements in model development and healthcare
applications.

23
10. FUTURE WORK

As we chart the course for future work, we envisage exploring additional


algorithmic combinations to enhance our research outcomes. This may involve
delving into Deep Neural Network(DNN) for further test and enhancing the
accuracy measures. Furthermore, expanding our study to multiple datatset and
the combination of multiple models insights and improved the accuracy chart. By
continuously pushing the boundaries of exploration , we aim to uncover even
more effective approaches that can contribute to the evolving landscape of
achieving the highest accuracy of prediction.

11. REFERENCES

[1]. M.T. García-Ordás, C. Benavides, J.A. Benítez-Andrades, H. Alaiz-


Moretón, I. García-Rodríguez, Diabetes detection using deep learning
techniques with oversampling and feature augmentation, Comput. Method.
Program. Biomed. 202 (2021) 105968.

[2]. M.M. Bukhari, B.F. Alkhamees, S. Hussain, A. Gumaei, A. Assiri, S.S.


Ullah, An improved artificial neural network model for effective diabetes
prediction, Complexity (2021) 2021.

[3]. K. Roy, M. Ahmad, K. Waqar, K. Priyaah, J. Nebhen, S.S. Alshamrani,


M.A. Raza, I. Ali, An enhanced machine learning framework for Type 2
diabetes classification using imbalanced data with missing values,
Complexity (2021) 2021.

[4]. J.J. Khanam, S.Y. Foo, A comparison of machine learning algorithms


for diabetes prediction, ICT Express (2021).

[5]. H. Naz, S. Ahuja, Deep learning approach for diabetes prediction using
PIMA Indian dataset, J. Diabete. Metabol. Disord. 19 (1) (2020) 391–403.

[6]. T.M. Alam, M.A. Iqbal, Y. Ali, A. Wahab, S. Ijaz, T.I. Baig, A. Hussain,
M.A. Malik, M.M. Raza, S. Ibrar, Z. Abbas, A model for early prediction of
diabetes, Inf. Med. Unlocked 16 (2019) 100204.

[7]. Q. Zou, K. Qu, Y. Luo, D. Yin, Y. Ju, H. Tang, Predicting diabetes mellitus
with machine learning techniques, Front. Genetic. 9 (2018) 515.

24

You might also like