Exploring The Efficacy of Machine Learning Algorithms For Diabetes Prediction A Comparative Prediction
Exploring The Efficacy of Machine Learning Algorithms For Diabetes Prediction A Comparative Prediction
https://doi.org/10.22214/ijraset.2023.51565
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
Abstract: We know that diabetes is one such disease that affects millions of people worldwide, so its early detection and accurate
prediction can help in the timely intervention and management of the disease. The main objective of this project is to do a
comparative analysis of the performances of the different machine learning algorithms like Random Forest, Support Vector
Machine(SVM), K-Nearest Neighbors(KNN), and a Hybrid Random Forest in predicting diabetes and for this the various patient
attributes like age, BMI, glucose level, blood pressure, etc are included in the dataset for training and testing the models and
various evaluation metrics like accuracy, precision, recall, and F1-score of the different algorithms are used to do the
comparative analysis. The Hybrid Random Forest algorithm is found to outperform the other considered algorithms with an
accuracy of 90.4%. Feature selection is also involved in the study to identify the most important variables that would help in
effective prediction of diabetes and the overall analysis demonstrates that glucose level, BMI, and age are the top variables that
are considered to be very important for diabetes prediction. So in healthcare, identifying the risk of developing diabetes at an
earlier stage and taking measures for its prevention and helping many patients can be a very beneficial findings of this study.
Keywords: K-Nearest Neighbor (KNN), Random Forest (RF), Support Vector Machine (SVM), Hybrid Random Forest (HRF)
I. INTRODUCTION
We can say that today one of the leading cause of morbidity and mortality is a chronic metabolic disorder called as Diabetes that
globally effects a massive amount of population and over the years its prevalence has been steadily increasing so it is important to
prevent the development of complications like cardiovascular disease, kidney failure, and blindness by earlier detection and timely
management of diabetes. From analyzation of large datasets and also considering a range of factors or attributes like age,
pregnancies, body mass index (BMI), and blood glucose levels, skin thickness etc , Machine learning has shown great potential by
using many algorithms to learn patterns , trends and discern difficult relationships that cannot be done with traditional statistical
methods for predicting the onset of Diabetes. The main aim of this project is to develop reliable and accurate Machine learning
models that can help and assist the healthcare professionals for predicting diabetes in individuals with the help of their clinical and
demographic data that eventually shows itself as a potential tool to help to prevent the burden and onset of this disease on the whole
world. Our study uses a range of Machine learning algorithms like KNN, SVM, random forests, and hybrid random forests and its
comparative analysis is done to compare the different models along with its performance evaluation using the metrics like accuracy,
precision, recall, and F1 score.
II. LITERATURE REVIEW
1) Machine learning has been applied to the prediction of diabetes in several studies. In a study by Dandapat et al. (2020), a
support vector machine (SVM) was used to predict diabetes in patients based on their clinical data. The results showed that the
SVM model had a high accuracy of 87.5% in predicting diabetes. The study concluded that SVM could be used as an effective
tool for diabetes prediction.
2) In a study by Zhang et al. (2020), a gradient boosting machine (GBM) was used to predict diabetes in patients based on their
electronic health record data. The study achieved an accuracy of 83.2% in predicting diabetes. The study concluded that GBM
could be used as an effective tool for diabetes prediction and could be used in clinical settings to aid in diabetes management.
3) In another study, T. Khan et al. used a dataset of Pakistani patients to predict diabetes using machine learning algorithms. They
applied three algorithms, including logistic regression, decision tree, and support vector machine (SVM). The results showed
that the SVM algorithm had the highest accuracy, at 84.74%. They also used feature selection techniques to identify the most
relevant features for diabetes prediction.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2904
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
4) In a study by S. O. Alade and A. A. Fadumo, they used a dataset of Nigerian patients to predict diabetes using machine learning
algorithms. They applied six algorithms, including logistic regression, decision tree, k-nearest neighbors, SVM, random forest,
and artificial neural network (ANN). The results showed that the ANN algorithm had the highest accuracy, at 87.9%. They also
used feature selection techniques to identify the most relevant features for diabetes prediction, and found that age, BMI, and
family history of diabetes were the most important features.
A. Dataset Description
The dataset for this project’s comparative analysis of different machine learning algorithms for diabetes prediction was taken from
Kaggle. There are 768 observations and 9 attributes in the dataset. The dataset's attributes are as follows:
Table 1. Dataset description
S.NO ATTRIBUTES
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2905
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
The missing values in the dataset are treated during the data preprocessing step and these missing values are represented by 0 in
some attributes like Glucose, blood pressure, skin thickness, Insulin, and BMI. The data set has a balanced class distribution with
268 positive cases (Outcome = 1) and 500 negative cases (Outcome = 0). The dataset used is suitable to identify the best algorithm
for diabetes prediction by comparative analysis of KNN, SVM, Random Forest, and Hybrid algorithms as many previous research
studies based on diabetes prediction using machine learning used this dataset and it is a widely used dataset.
B. Data Preprocessing
This is an important and critical step especially when it comes to health care related data, because any impurities or missing values
can impact the overall effectiveness so to increase the quality and effectiveness of the data, to improve the overall data and to finally
obtain the accurate results from the mining process for successful prediction using machine learning techniques we need to carry out
data preprocessing step. For instance, in the case of the Pima Indian diabetes dataset, data preprocessing is essential and can be
performed in two steps to ensure that the dataset is prepared optimally for analysis.
1) Removing missing values- This is done be eliminating instances with a value of zero (0), as it is not possible to have a worth of
zero. To enable faster processing and to reduce the dimensionality of the date we eliminate irrelevant features/instances and this
process is called feature subset selection.
2) Splitting of data- After Data Cleaning, the dataset is split into training and testing set in the ratio of 70:30 for performing
normalization and model training. Normalization helps to bring all the attributes under the same scale which enables the
training model to be based on logic, algorithms, feature values from the training dataset. The test dataset is used to test the
models and it is kept separate while the training dataset is used to apply the training algorithm to train the dataset.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2906
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
These different Machine Learning algorithms used in this project for diabetes prediction like SVM, KNN, Random Forest, and
Hybrid Random Forest are all powerful ones and has their respective strengths and weaknesses which can be evaluated by
comparing their performances on a same dataset and this will help to gain the effectiveness and insights of each of these algorithms.
.
Fig 2. Overview of the Process
A. Data Collection
Data gathering is the initial stage in every machine learning project. In this project, we collected the dataset from a publicly
available source containing various health metrics such as age, BMI, blood pressure, glucose level, insulin level, and diabetes
pedigree function.
B. Data Preprocessing
After data collection, we preprocessed the data to make it suitable for machine learning analysis. This step involved checking for
missing values, dealing with outliers, and normalizing the data.
E. Model Evaluation
Once the models were trained on the data, we evaluated their performance on the testing set. The evaluation metrics used in this
study included accuracy, sensitivity, specificity, precision, and F1-score.
F. Result
The result of the overall analysis was that the Hybrid random forest algorithm technique outperformed while comparing to the other
algorithms. Finally, this machine learning project for diabetes prediction involves collecting and preprocessing data, dividing it into
training and testing sets, applying machine learning techniques such as Random Forest, SVM, KNN, and Hybrid Random Forest to
the training data, and evaluating their performance on the testing set. The Hybrid Random Forest algorithm outperformed the other
three algorithms in this study.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2907
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
3) Data Split: The dataset is divided into training and testing set in the ratio of 80:20.
4) Algorithm Selection: Select the machine learning algorithms that will be used for the experiment. These can include K-Nearest
Neighbour, Support Vector Machine, Decision Tree, Logistic Regression, Random Forest, and Gradient Boosting.
5) Build Classifier Models: Build a classifier model for each selected algorithm using the training set.
6) Test Classifier Models: Test each classifier model using the test set to evaluate its performance.
7) Comparison Evaluation: The performance results of each classifier must undergo a comparative evaluation to determine the
best performed algorithm.
8) Conclusion: We need to find the best performing algorithm for diabetes prediction after analysing the result based on various
measures.
V. EXPERIMENTAL RESULTS
For predicting diabetes, we used a dataset containing patients’ medical history, demographic information, lifestyle habits, diabetes
diagnosis and we evaluated and compared the performances of machine algorithms like KNN, Random Forest, SVM, and Hybrid
Random Forest. The ratio used for splitting the dataset into training and testing set was 70:30 and based on this for each algorithm
we built a classifier model and each of these algorithm’s metrics evaluation such as accuracy, precision, recall, F1 score was done.
The result of these different algorithms was as follows. Hybrid Random Forest with an accuracy of 90.00%, precision of 88.66%,
recall of 90.53%, and F1 score of 89.58%
Outperformed the other algorithms.The KNN algorithm had an accuracy of 75.32%, precision of 60%, recall of 100%, and F1 score
of 58.69%. The Random Forest algorithm (RF) had an accuracy of 81.16%, precision of 70.45%, recall of 100%, and F1 score of
68.13%. The SVM algorithm had an accuracy of 78.57%, precision of 70.58%, recall of 100%, and F1 score of 59.25%. The
confusion matrix was generated to evaluate the models performance and it is shown below.
Confusion matrix-
Table 3. Confusion Matrix details.
Predicted: Predicted: Positive
Negative
Actual: True False Positive (FP)
Negative Negative
(TN)
Actual: False True Positive (TP)
Positive Negative
(FN)
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2908
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
Based on these results, we can conclude that the Hybrid Random Forest algorithm (HRF) is the best-performing algorithm for
diabetes prediction. The algorithm combines the advantages of both the KNN and Random Forest algorithms (RF) and provides
higher accuracy and F1 score than the other algorithms. These findings can be used to develop more accurate and effective diabetes
prediction models in the future.
VI. CONCLUSION
We can conclude that diabetes prediction can be effectively and accurately done by providing valuable insights through the
comparative analysis of different machine learning algorithms like KNN, Random Forest Algorithm, SVM, and Hybrid Random
Forest Algorithm. Through performance evaluation with respect to accuracy, precision, recall and F1 score the findings highlights
the importance of considering hybrid approaches which can deal with complex datasets like diabetes prediction as the hybrid
Random Forest was best outperformed with an accuracy score of 90.00%. This project can be one of the important milestones in
research and development in the clinical settings or the medical field in detecting and prevention of diabetes in an earlier stage and
ultimately improving the patient’s health. This project shows the machine learning algorithms true potential and its importance in
selecting the suitable or appropriate algorithm for the task and through this how it can effectively and accurately do the diabetes
prediction.
REFERENCES
[1] American Diabetes Association. (2021). Classification and diagnosis of diabetes. Diabetes Care, 44(Supplement 1), S15-S33.
[2] Centers for Disease Control and Prevention. (2021). National Diabetes Statistics Report, 2020. Atlanta, GA: U.S. Department of Health and Human Services.
[3] Hulman, A., Simmons, R. K., & Brunner, E. J. (2018). Progression from pre-diabetes to type 2 diabetes in a population-based cohort: The Whitehall II study.
Diabetic Medicine, 35(2), 226-234.
[4] Kerner, W., & Brückel, J. (2014). Definition, classification and diagnosis of diabetes mellitus. Experimental and Clinical Endocrinology & Diabetes, 122(7),
384-386.
[5] Li, L., Li, Y., Feng, Q., Zhang, M., & Zhang, Z. (2019). A novel method for diabetes prediction based on Bayesian network classifiers. PLOS ONE, 14(1),
e0210293.
[6] Lind, M., Polonsky, W., Hirsch, I. B., Heise, T., Bolinder, J., & Dahlqvist, S. (2021). Continuous glucose monitoring vs conventional therapy for glycemic
control in adults with type 1 diabetes treated with multiple daily insulin injections: The GOLD randomized clinical trial. JAMA, 325(22), 2260-2270.
[7] Ibrahim, H. M., Mabrouk, M. S., & Rehan, M. (2020). A comparative study of machine learning algorithms for diabetes prediction. Computers in Biology and
Medicine, 124, 103932.
[8] Ali, A., Ahmad, J., & Ahmad, S. (2019). Comparative study of machine learning algorithms for diabetes prediction. International Journal of Computer
Applications, 181(40), 23-27.
[9] Ravishankar, M., & Deepa, R. (2020). Comparative analysis of machine learning techniques for diabetes prediction. International Journal of Computer Science
and Information Security, 18(2), 104-111.
[10] Nair, A. V., & Kalathil, D. M. (2019). Comparative analysis of machine learning algorithms for diabetes prediction. International Journal of Computer
Applications, 181(2), 19-22.
[11] Roy, K., Saha, S., & Chakraborty, S. (2019). A comparative study of machine learning techniques for diabetes prediction. International Journal of Computer
Applications, 182(8), 9-12.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2909
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
[12] Shrestha, S., Pant, S., & Ghimire, S. (2019). An overview of diabetes mellitus and its association with obesity: A review of the literature. International Journal
of Medical Science and Public Health, 8(2), 35-43.
[13] Tabák, Á. G., Herder, C., Rathmann, W., Brunner, E. J., & Kivimäki, M. (2012). Prediabetes: A high-risk state for diabetes development. The Lancet,
379(9833), 2279-2290.
[14] Rathi, A., Kumar, N., & Sharma, A. (2020). Comparative analysis of machine learning algorithms for diabetes prediction. International Journal of Advanced
Research in Computer Science, 11(1), 11-16.
[15] Bhattacharya, S., & Das, S. (2018). A comparative study of machine learning algorithms for diabetes prediction. International Journal of Computer
Applications, 180(26), 29-35.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2910