189 Submission
189 Submission
Abstract— Diabetes mellitus, generally called diabetes, a chronic percentage, or upper-body obesity. The main causes of this
metabolic illness marked by elevated blood glucose levels caused condition are rapid urbanization and economic development
by either impaired insulin action or inadequate insulin [6][7]. Patients affected with diabetes face “Sweet urine” [8],
production. Over 537 million people worldwide are estimated to which is much different from regular urine, which is sugar-
have diabetes by the International Diabetes Federation (IDF). free. Due to an excessive amount of glucose in the body,
The IDF forecasts a 46% increase, predicting that cases will rise which the body is unable to metabolize adequately and begins
to 783 million by 2045. There are two forms of diabetes related to accumulate in the bloodstream, there is sugar (in the form
to pregnancy: types 1 and 2. An autoimmune condition called of glucose) in the urine.
type 1 diabetes harms or interferes with the pancreas' insulin-
One of the biggest challenges facing medical professionals
producing cells, necessitating lifelong insulin therapy. The more
common type, type 2 diabetes, arises when cells become resistant
is the early detection along with the accurate diagnosis of
to insulin, often influenced by diet, genetic factors, and obesity. diabetes. This study offers numerous ML algorithms for
Furthermore, gestational diabetes develops throughout diabetes early diagnosis. Much research has been done to gain
pregnancy, and it can raise the risk of type 2 diabetes even preliminary knowledge about this disease and predict whether
though it usually goes away after delivery. Later on. Diabetes an individual is at risk of contracting it throughout their
can cause serious side effects, such as blindness, kidney failure, lifespan. Most research works use the open-source Pima
heart attacks, strokes, along with additional medical conditions, Indians dataset (PID).
if left untreated. A diabetes prediction model is developed and The remaining paper is categorized into the following
evaluated. K-Nearest Neighbors, Logistic Regression, Random manner: section (1) is diabetes also its type, section (2)
Forest, XGBoost, LightGBM, Support Vector Machine, while literature review, section (3) is machine learning, section (4)
Decision Tree are just a few of the machine learning techniques is methodology, along with section (5) conclusion and future
used in this work. work.
Keywords— XGBoost, LightGBM, Random Forest, Decision
Trees, K-Nearest Neighbors, Machine Learning, Diabetes, & II. DIABETES AND ITS TYPES
Logistic Regression Accuracy, Variables, Dataset, Feature
Engineering, Outliers, Data preprocessing, Exploratory data Diabetes mellitus, a metabolic disorder, affects the body's
analysis, Precision, AUC, F1 score, Cross-validation. ability to process blood sugar (glucose) levels. It is classified
into different types based on its cause and impacting insulin
production or usage in the body. Diabetes may harm blood
I. INTRODUCTION vessels or neurons in the heart, kidneys, eyes, and lower limbs.
A chronic metabolic disease, diabetes mellitus, impacts Mouth issues like gum disease or tooth decay may also occur.
millions of individuals globally and continues to impact an The are three types of disease: type 1 diabetes, type 2 diabetes,
increasing number of people in the present day. High blood or gestational diabetes. [10]. Each of them has distinct
sugar levels result when the body either creates insufficient characteristics, causes, and symptoms.
insulin or utilizes it inefficiently. [2]. Serious health problems
like nerve damage and kidney failure can result from diabetes A. Type 1 Diabetes (T1D)
mellitus, eye impairment, or cardiovascular disorders., as well T1D is an autoimmune illness that occurs when the
as an increase in urine [3] if it is not detected or treated immune system mistakenly attacks the insulin-producing cells
appropriately. Metabolic condition worsens with time and in the pancreas. As a result, little to no insulin is produced, and
impacts a patient's physical and mental health. No treatment insulin therapy is required for the remainder of one's life.
method can prevent the disease from progressing or result in People of any age can be affected by this disorder, but it is
remarkable improvements [4]. Diabetes can result from most found in young children and teenagers.
several reasons, such as obesity, sedentary lifestyles, high Symptoms off this kind is:
blood pressure, or abnormal cholesterol levels in a person [5]. • Excessive thirst
India has a high occurrence of diabetes, which is caused by • Frequent urination
low BMI together with elevated insulin resistance, body fat • Slow healing of wounds
• Fatigue and weakness Xue [15] experimented on 520 patients between the ages
• Blurred vision of 16-90 using data from the UCI Machine Learning
• Extreme hunger Repository. SVM, Naïve Bayes, and LightGBM were
• Unexplained weight loss employed to make predictions. With an accuracy of 96.54%,
• Increased susceptibility to infections SVM outperformed the other models.
Le [16] explored the Classification & Regression Tree
B. Type 2 Diabetes (T2D) (CART) algorithm for prediction. The class imbalance studied
T2D is mostly caused by insulin resistance or inadequate in datasets with binary outcomes suggested removing it during
insulin production, which prevents blood glucose levels from data preprocessing.
staying within normal ranges. This type is the most common. Birjais [17] worked on the UCI repository, which includes
Numerous aspects of lifestyle, including being overweight, 768 samples and 8 features extracted from the Diabetes (PID)
eating badly, and not exercising, contribute to its occurrence. dataset for Pima Indians. The study employed the dataset to
Over 95% of individuals worldwide suffer from T2Ds. Most test naive Bayes, logistic regression, as well as gradient
women are not aware of any symptoms or indicators of this boosting classifiers; naive Bayes obtained 77% accuracy,
kind. It typically occurs in adults but happens to increase in logistic regression 79%, while gradient boosting 86%.
younger individuals. Sadhu and Jadli [18] used 520 instances and 16 features
Symptoms of this type are: through the UCI repository. The study used the dataset to test
• Frequent urination and increased thirst gradient boosting classifiers, logistic regression, or naive
• Blurred vision Bayes; naive Bayes obtained 77% accuracy, logistic
• Fatigue and low energy levels regression 79%, while gradient boosting 86% outperformed
the others. Naive Bayes (91%), logistic regression (93%),
• Slow healing of wounds and cuts
support vector machines (94%), along with decision trees
• Tingling or numbness in hands and feet
(94%) came next.
• Dark patches of skin, particularly around necks and Shafi [19] employed the PID dataset, which has been
armpits (a sign of insulin resistance) exposed to a decision tree, SVM, along with naïve Bayes
C. Gestational Diabetes classifiers. The maximum accuracy was 74% for Naive Bayes,
72% for Decision Trees, and 63% for SVM.
Hormonal changes bring on gestational diabetes, Insulin
Sisodia [20] applied using the PID dataset to test naive
resistance during pregnancy may result from this. It increases Bayes, SVM, or decision tree classifiers; naive Bayes
women's chance of getting T2D in later life. Even though it produced the best accuracy, 76.30%.
usually disappears after giving birth. Agrawal [21] analysed the efficiency of the PID dataset
Symptoms of this type are: containing 738 patient records. The study tested the naive
• Blurred vision Bayes, SVM, k-NN, ID3, C4.5, & CART models. SVM &
• Fatigue linear discriminant analysis (LDA) achieved a maximum
• Frequent urination accuracy of 88%.
• Increased thirst Rathore's [22] study focused on women's health. SVM &
decision tree models were utilized to predict PID datasets. The
Early detection of diabetes symptoms is crucial so they can SVM model had an 82% accuracy rate.
be treated and diagnosed promptly. Regular checkups and Kumari and Chitra examined MLP, logistic regression,
medications can improve an individual's life. decision trees, RF, or SVM classifiers using k-fold cross-
validation [23]. Their results showed that MLP with four-fold
cross-validation performed best and achieved the highest
III. LITERATURE REVIEW accuracy at 78.7%.
Most machine learning studies have been done on the Rawat [25] tested AdaBoost, bagging, naive Bayes, Logic
earliest datasets available; Smith et al. [9] created the Pima Boost, and Robust Boost. Bagging achieved a maximum
Indians Diabetes Dataset (PIDD) in 1988. accuracy of 81.77%, followed by AdaBoost, which achieved
Since then, scientists have employed several supervised an accuracy of 79.69%.
learning strategies, like SVM, RF, & ANNs, and Decision Perveen [26] implemented AdaBoost and used J48
Trees. that achieved higher accuracy prediction. (Dua & classifiers, bagging, and the “Canadian Primary Care Sentinel
Graff, 2019) [13]. Surveillance Network dataset.” AdaBoost performed the best.
Kavakiotis et al. [11] reviewed ML applications in Saravananathan and Velmurugan [1] associated models
diabetes research and found that feature selection techniques like J48, CART, SVM, or k-NN classifiers according to its
improved model accuracy. error rate, sensitivity, accuracy, specificity, and precision.
Hasan et al. [12] also confirmed that combining feature According to their findings, J48 made the most accurate
selection with machine learning algorithms produced better predictions (67.15%), subsequent to k-NN (53.39%), SVM
results. (65.04%), and CART (62.28%).
Chawla et al. [14] experimented with data imbalance, Mujumdar and Vaidehi [27] created a model incorporating
which leads to biased results. This study used the Synthetic more diabetes risk factors. They compared different machine
Minority Over-sampling Technique.
learning models. Logistic Regression achieved 96% accuracy, Examples of this type are:
while AdaBoost had the highest performance at 98.8%. • Graph-based learning algorithms
Mercaldo [28] built a classification model based on WHO- • Self-training models
defined criteria for diabetes predictions, testing six
classification techniques. Utilizing the Pima Indians dataset
from Phoenix, Arizona, the Hoeffding Tree approach D. Reinforced Learning
produced a recall of 0.770 and a precision of 0.770 compared This involves decision-making, engagement with the
to 0.775. surroundings, and learning through feedback. The agent is
Moungmai and Nai-Arun [29] created a web application rewarded or penalized based on the decision made, and it
employing disease classification models based on real-world improves its decision-making skills over time.
data from 30,122 patients. This study assessed 13 Some components of this learning are:
classification methods, including NN, NB, LR, DT, RF, and • Agent - The system that makes decisions.
ensemble. The classifier for random forests had the highest • Environment - The working area.
ROC score and accuracy. • Rewards - The feedback from actions.
• Actions - The choices of the agent.
IV. MACHINE LEARNING
A subfield of AI enables computer systems to predict or
judge based on patterns discovered in data that don't involve Examples of reinforced learning algorithms are:
explicit scripting. ML's primary goal are created as a model • Deep-Q-Networks (DQN)
which can generalize from past experiences to forecast or • Double DQN
decide accurately based on fresh information. This • Q-Learning
technological advancement is essential to various sectors, as
it optimizes and improves decision-making.
Machine learning is of the following types. V. METHODOLOGY
A. Supervised Learning Figure 1 shows the research process [24]. To start, the
Labeled data must act as the model's training resource. dataset was gathered and preprocessed to eliminate any
This dataset contains values for both input and output. After inconsistencies. This included fixing class imbalance
learning from the dataset, the algorithm looks for a connection problems and resolving missing values by substituting the
between the input and the result values to provide predictions. mean. In an 80%:20% ratio, the holdout validation has been
Examples of this type are: employed to separate the dataset through training and testing
sets. Procedure. This dataset's optimal model was then
• Support Vector Machines
determined by utilizing several classification techniques. The
• Logistic Regression proposed mobile and web application framework was updated
• Random Forest to incorporate the top-performing prediction model.
• Linear Regression
• Decision Trees
B. Unsupervised Learning
An unlabeled dataset is used to train this kind of algorithm.
This method looks for concealed underlying patterns and
structures. It is most used in anomaly detection and other
tasks.
Examples of this type are:
• Autoencoders
• K-Means Clustering
• K-Nearest Neighbors
• Principal Component Analysis
• Hierarchical Clustering
C. Semi-Supervised Learning
The consequence of combining unsupervised as well as
supervised learning is that this method finds patterns in fresh
unlabeled data by using a tiny amount of labelled data to learn
from it. This learning type is used when labeled data is costly
or time-consuming.
Figure 1. Working procedure for the model development.
A. Dataset • Data Visualization: Plotting of each feature present to
The PIDD served in the dataset's initial source [9]. Of the understand the distribution as well as connections
768 cases in the dataset, 500 do not have diabetes, and 268 do. between the variables in the dataset.
E. Hyperparameter Tuning
GridSearchCV, a scikit-learn library, provides useful
tools for hyperparameter tuning in machine learning. It
optimizes hyperparameters for models like Random Forest,
LightGBM, and XGBoost, enhancing their performance.
Some examples of parameters tuned for the algorithms
are the boosting stage count, minimum sample count required
to divide an internal node, and maximum tree depth. Step size
shrinkage, used to prevent overfitting, was also tuned.
TABLE II. HYPERPARAMETER TUNING PROCESS.
F. Model Evaluation
After fine-tuning, the models were again evaluated to
determine their performances. The process significantly
improved model efficiency, with XGBoost achieving the
highest scores.
TABLE III. CROSS-VALIDATION SCORES OF THE ALGORITHMS AFTER
TUNING.
The dataset contains 80:20 training & testing sets. The sets
are trained and evaluated using ML methods like LightGBM,
Logistic Regression, Classification, CART, Random Forest,
VI. CONCLUSION AND FUTURE WORK [9] Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., Johannes,
R.S.: Using the ADAP learning algorithm to forecast the onset of
Diabetes is the most serious problem worldwide at present. diabetes mellitus. In: Annual Symposium on Computer Applications in
This illness impacts people of any age. Predicting early Medical Care pp. 261–265.
diagnosis is critical since it can lower long-term risk and [10] AACE/ACE Position Statement on the Prevention. Diagnosis and
complications of other diseases. treatment of obesity (1998 Revision) Endoc Practice. 1998;4:297–330.
This research shows that XGBoost efficiently made [11] Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I.,
predictions with an accuracy of 90%. & Chouvarda, I. (2017). Machine learning and data mining methods in
diabetes research. Computational and Structural Biotechnology
Furthermore, the study can be expanded to develop a web Journal, 15, 104-116.
application. The model can be trained using the XGBoost [12] Hasan, M. K., Alam, M. A., Das, D., Hossain, E., & Hasan, M. (2020).
algorithm and embedded with the web application, which can Diabetes prediction using feature selection and ensemble learning.
effectively display results regarding whether a person is prone International Journal of Intelligent Systems, 35(2), 239-265.
to diabetes or not. [13] Dua, D., & Graff, C. (2019). UCI Machine Learning Repository: Pima
Another scope of expansion is to utilize the same approach Indians Diabetes Dataset. University of California, Irvine.
for making accurate predictions for other diseases. Moreover, [14] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P.
the same technique can be applied to many different medical (2002). SMOTE: Synthetic Minority Over-sampling Technique.
Journal of Artificial Intelligence Research, 16, 321-357.
issues.
[15] Xue J, Min F, Ma F. Research on diabetes prediction method based on
machine learning. J Phys Conf Ser. 2020;1684:1–6.
ACKNOWLEDGMENT [16] Le TM, Vo TM, Pham TN, Dao SV. A novel wrapper–based feature
selection for early diabetes prediction is enhanced with a metaheuristic.
The authors greatly appreciate Dr. Ashok K. Chauhan, the IEEE Access. 2020;9:7869–84.
founder and president of Amity Universe. He is renowned for [17] Birjais R, Mourya AK, Chauhan R, Kaur H. Prediction and diagnosis
his intense passion for advancing Amity Universe research of future diabetes risk: A machine learning approach. SN Appl Sci.
and has always inspired us to reach new heights. I want to 2019;1:1–8.
express my sincere dedication to Dr. Laxmi Ahuja for her kind [18] Sadhu A, Jadli A. Early-stage diabetes risk prediction: A comparative
analysis of classification algorithms. Int Adv Res J Sci Eng Technol
support, valuable information, and guidance. (IARJSET) 2021;8:193–201.
[19] Shafi S, Ansari GA. Early prediction of diabetes disease
&classification of algorithms using machine learning approach. In
REFERENCES Proceedings of the International Conference on Smart Data
Intelligence (ICSMDI 2021) Available from: SSRN 3852590 (2021)
[1] Saravananathan K, Velmurugan T. Analyzing diabetic data using [20] Sisodia D, Sisodia DS. Prediction of diabetes using classification
classification algorithms in data mining. Indians J Sci Technol. algorithms. Procedia Comput Sci. 2018;132:1578–85.
2016;9:1–6. [21] Agrawal P, Dewangan AK. A brief survey on the techniques used for
[2] Kharroubi, A.T., Darwish, H.M.: Diabetes mellitus: The century's the diagnosis of diabetes-mellitus. Int Res J Eng Tech IRJET.
epidemic. World J. Diabetes 6, 850–867 (2015) 2015;2:1039–43.
[3] Papatheodorou, K. , Banach, M. , Edmonds, M. , Papanas, N. , [22] Rathore A, Chauhan S, Gujral S. Detecting and predicting diabetes
Papazoglou, D. : Complications of diabetes. J. Diabetes Res. 2015, 1– using supervised learning: An approach towards better healthcare for
6 (2015) women. Int J Adv Res Comput Sci. 2017;8:1192–4.
[4] Report of the expert committee on the diagnosis and classification of [23] Kumari VA, Chitra R. Classification of diabetes disease using support
diabetes mellitus. Diabetes Care. 1997;20:1183–97. doi: vector machine. Int J Eng Res Appl. 2013;3:1797–801.
10.2337/diacare.20.7.1183. [24] Tasin, I., Nabil, T.U., Islam, S., Khan, R.: Diabetes prediction using
[5] Wu, Y., Ding, Y., Tanaka, Y., Zhang, W.: Risk factors contributing to machine learning and explainable AI techniques. Healthc. Technol.
type 2 diabetes and recent advances in the treatment and prevention. Lett. 10, 1–10 (2023). 10.1049/htl2.12039
Int. J. Med. Sci. 11, 1185–1200 (2014) [25] Rawat V, Suryakant S. A classification system for diabetic patients
[6] Shaw JE, Sicree RA, Zimmet PZ. Global estimates of the prevalence with machine learning techniques. Int J Math Eng Manag Sci.
of diabetes for 2010 and 2030. Diabetes Res Clin Pract. 2010;87:4–14. 2019;4:729–44.
doi: 10.1016/j.diabres.2009.10.007. [26] Perveen S, Shahbaz M, Guergachi A, Keshavjee K. Performance
[7] Anjana RM, Pradeepa R, Deepa M, Datta M, Sudha V, Unnikrishnan analysis of data mining classification techniques to predict diabetes.
R, et al. Prevalence of diabetes and prediabetes (impaired fasting Procedia Comput Sci. 2016;82:115–21.
glucose and/or impaired glucose tolerance) in urban and rural India: [27] Mujumdar A, Vaidehi V. Diabetes prediction using machine learning
Phase I results of the Indians Council of Medical Research India algorithms. Procedia Comput Sci. 2019;165:292–9.
Diabetes (ICMRINDIAB) study. Diabetologia. 2011;54:3022–7. doi:
[28] Diabetes mellitus affected patients' classification and diagnosis
10.1007/s00125-011-2291-5.
through machine learning techniques. Procedia Comput Sci.
[8] Wagai GA, Romshoo GJ. Adiposity contributes to poor glycemic 2017;112:2519–28.
control in people with diabetes mellitus, a randomized case study, in
[29] Nai-Arun N, Moungmai R. Comparison of classifiers for the risk of
South Kashmir, India. J Family Med Prim Care. 2020:4623–6. doi:
diabetes prediction. Procedia Comput Sci. 2015;69:132–42.
10.4103/jfmpc.jfmpc_1148_19.