Thesis
Thesis
net/publication/372314912
CITATION READS
1 4,035
1 author:
Clement Okolo
University of Louisiana at Lafayette
1 PUBLICATION 1 CITATION
SEE PROFILE
All content following this page was uploaded by Clement Okolo on 08 September 2023.
APPROVED:
Ashok Kumar
School of Computing and Informatics
Li Chen
School of Computing and Informatics
Mary Farmer-Kaiser
Dean of the Graduate School
© Clement Tochukwu Okolo
2022
Diabetes mellitus, also known as type-2 diabetes, accounts for most of the diabetes cases
in the world. This type of diabetes occurs when the body does not produce enough or respond
normally to insulin causing the blood glucose level to get high, leading inevitably to other health
conditions such as heart disease, kidney disease, etc. The aim of this research paper is to assist
medical professionals in the detection and efficient diagnosis of Type 2 diabetes. We applied
several supervised machine learning techniques to develop a machine model to predict diabetes
with low error rate based on eight predictors from the Pima Indian diabetes dataset. We outline
the methodology, implementation steps, and related work in the field. The four popular machine
learning algorithms used in this study are logistic regression, support vector machine, decision
tree, and random forest. SVM performed best with 77.27% accuracy, 75.61% precision, 51.38%
recall, and 82.47% roc_auc. Our model showed an increase in accuracy when compared with the
ANN model developed from the same dataset in a previous study. With this work, we intend to
improve the process of diagnosing Type 2 diabetes with machine learning and encourage further
iii
To almighty God and my lovely parents who sacrificed and supported me immensely so I could
get an education.
iv
Acknowledgement
First, I would like to thank God almighty for guiding me through all challenges and
giving me the privilege of completing my degree. You will continue to take the wheel in life and
lead me to greater heights. In addition, I would love to express my gratitude to my parents and
siblings for all their prayers, sacrifices, and motivation, which have continued to sustain me
immensely.
I would also like to give thanks to my thesis supervisor, Dr. Michael Totaro, for his
guidance and support throughout the entire process of completing this thesis. Additionally, I
would like to thank my thesis committee members, Dr. Henry Chu, Dr. Ashok Kumar, and Dr. Li
Finally, I would like to thank my classmates and colleagues in the School of Computing
and Informatics at the University of Louisiana at Lafayette for their academic interactions and
v
Table of Content
Abstract……..…………………………………………………………………………………... iii
Acknowledgement……………………………………………………………...…………...…... v
List of Figures………………………………………………………………………..…………. ix
List of Abbreviations………………………………………………………………………….... x
Chapter 1. Introduction……………………………………………………..……...…………... 1
1.1 Classification of Diabetes Mellitus ................................................................................ 2
1.1.1 Type 1 Diabetes Mellitus .......................................................................................... 2
1.1.2 Type 2 Diabetes Mellitus .......................................................................................... 4
1.1.3 Gestational Diabetes Mellitus .................................................................................. 6
1.2 Research Questions ........................................................................................................ 7
1.3 Motivation ....................................................................................................................... 7
1.4 Scope ................................................................................................................................ 7
1.5 Contributions .................................................................................................................. 8
vi
Chapter 4. Results…………………………………………………………………………….... 24
4.1 Feature Selection .......................................................................................................... 24
4.2 Performance of Machine Learning Algorithms ........................................................ 25
Chapter 5. Conclusion……………………………………………...………………………….. 27
Chapter 6. Limitations……………………………………………..………………………….. 28
Biographical Sketch…………………………………………………………………………… 36
vii
List of Tables
viii
List of Figures
Figure 1: Number of subjects with type 1 diabetes in children (0-14 years), with diabetes in
adults (20-79 years) and with hyperglycemia (type 2 or gestational diabetes) in
pregnancy (20-49 years)......................................................................................... 3
ix
List of Abbreviations
BPA Bisphenol-A
DM Diabetes Mellitus
LR Logistic Regression
x
PCA Principal Component Analysis
RF Random Forest
xi
Chapter 1. Introduction
known as high blood glucose, resulting from defects in insulin secretion, insulin action, or both.
The long-term effects of chronic hyperglycemia from diabetes include damage, dysfunction, and
failure of different organs, particularly the eyes, kidneys, nerves, heart, and blood vessels (ADA
2010). According to the Centers for Disease Control and Prevention, there are three types of
diabetes: type 1, type 2, and gestational diabetes (CDC 2022). Diabetes mellitus (DM),
commonly known as diabetes, is a group of diseases that are defined by chronic high blood
glucose levels due to abnormalities in insulin secretion, insulin action, or both (ADA 2010).
Insulin is a peptide hormone that helps in glucose homeostasis and is produced in large
concentration by the β cells of the pancreatic islets of Langerhans and in low concentration by
some neurons of the central nervous system (Rahman 2021). The amount of glucose in the
bloodstream controls the biosynthesis and secretion of hormone insulin. Insulin is synthesized
when glucose levels are between 2 mM to 4 mM and it is secreted when glucose levels rise
above 5 nM (Alarcon 1993). When insulin is not secreted, blood glucose will remain high. High
glucose concentration in the body leads to a condition called hyperglycemia (Vasiljevic 2020).
After the secretion of insulin, it circulates in the body and is distributed to hepatocytes, also
known as liver cells, skeletal muscle cells and adipocytes for glucose uptake, thereby reducing
glucose concentration. If insulin is secreted but the target cells do not take up excess glucose
from the bloodstream, glucose level will remain high, thereby leading to hyperglycemia as well.
Diabetes Mellitus occurs when there is hyperglycemia for a long period of time (Accili 2018).
1
According to Accili, DM can cause major health complications like damage to the
nervous system and dysfunction of the eyes and kidneys. The type of diabetes and how long a
patient has been diabetic determines how bad the symptoms will be and the long-term
Diabetes mellitus can be classified into three categories namely Type 1 diabetes mellitus,
because the pancreatic β cells produce little or no insulin. The body attacks itself unintentionally,
thereby destroying the insulin-producing cells of the pancreas. The autoantibodies responsible
for the destruction include islet cell autoantibodies, autoantibodies to insulin (IAA), glutamic
acid decarboxylase (GAD, GAD65), protein tyrosine phosphatase (IA2 and IA2β) and zinc
transporter protein (ZnT8A) [A) (Vermeulen 2011). It has been estimated that 5-10% of the
world's population who are diabetic have T1DM. In the United States, approximately 1.24
million diabetic patients are type 1 patients, and this number is growing and is projected to reach
5 million by 2050. T1DM is one of the most common chronic diseases in children; however, it
can develop in people of all ages, but it is commonly seen in children, teens, and young adults
(ADA2010, CDC2022). The destruction of the beta cell of the pancreas can happen over months
or years before symptoms are noticed. According to the International Diabetes Federation, some
of the symptoms of T1DM are polydipsia, polyuria, enuresis, lack of energy, extreme tiredness,
2
polyphagia, sudden weight loss, slow-healing wounds, recurrent infections and blurred vision
with severe dehydration and diabetic ketoacidosis (IDF2013, Kharroubi2015). Patients who are
diagnosed with T1DM require insulin replacement for the rest of their lives (Lucier 2022).
Figure 1. Number of subjects with type 1 diabetes in children (0-14 years), with diabetes in
adults (20-79 years) and with hyperglycemia (type 2 or gestational diabetes) in
pregnancy (20-49 years). Data extracted from International Diabetes Federation
Diabetes Atlas, 6th ed, 2013.
3
1.1.2 Type 2 Diabetes Mellitus
(Maitra2005). In T2DM, the pancreatic β cells produce enough insulin but the body cells cannot
use it adequately for glucose homeostasis. Therefore, the pancreatic cells try to get the body to
respond normally to insulin by secreting more and more insulin. As a result, the concentration of
blood glucose increases causing hyperglycemia and type 2 diabetes (CDC2021). T2DM is the
most predominant form of DM. Over 37 million people in the United States of America have
DM, and approximately 90-95% of them have T2DM. While more children, teens, and young
adults are developing type 2 diabetes, it most commonly develops in people over the age of 45
(CDC2021, Olokoba2012). There are lifestyle and genetic risk factors associated with T2DM
(Olokoba2012). Some of the lifestyle risk factors include physical inactivity, smoking, and
alcohol consumption (HU2021). According to the Centers for Disease Control and Prevention,
Obesity is also a risk factor for an estimated 55% of T2DM cases. Environmental toxins such as
bisphenol A may contribute to the recent increase in the cases of T2DM as research suggests that
there is a weak positive correlation between the concentration of urine bisphenol-A (BPA) and
T2DM. The main use of BPA is in the production of plastic and epoxy resins, which are found in
polycarbonate baby feeding bottles and poxy food-can linings (Lang IA 2008, Dekant 2008).
Genes that have been discovered to be associated with T2DM include TCF7L2, PPARG, FTO,
KCNJ11, NOTCH2, WFS1, CDKAL1, IGF2BP2, SLC30A8, JAZF1, and HHEX (McCarthy
2010). Some of the medical conditions have been discovered to be risk factors associated with
4
Aging, diets that are high in fat and in activity are also among the risk factors of diabetes (Alberti
About one-third of adults that have high HbA1c values are not clinically diagnosed
with T2DM within 1 year (Gopalan 2018) and most people with T2DM are not diagnosed for 4–
7 years after hyperglycemia first appears (porta 2014, Harris 1992). According to researchers,
one-fourth of patients who are diagnosed with T2DM already have diabetes-related
2003, Harris 1992). A 2010 chart of patients from the Veterans Affairs Medical Center showed
an average delay of 3.7 years between initial Electronic Health Record (EHR) evidence of
hyperglycemia and clinical diagnosis (Fraser 2010). In a study in 2002, a cross-sectional analysis
of 1426 adults with evidence of hyperglycemia in their EHR, only 79% of these people had been
clinically diagnosed with diabetes mellitus (Edelman2002). While no symptoms are seen during
the period of undiagnosis, the patient misses the chance to get early intervention (Gopalan2018).
According to a study by the United Kingdom Prospective Diabetes (UKPDS), better early
intervention to balance blood glucose levels demonstrated that the risk of developing
microvascular complications and myocardial infarctions was significantly lower, a risk reduction
that lasted for decades after diagnosis compared with cases without initial glycemic control
(Holman2008). Some of the factors that contribute to the delayed diagnosis of diabetes include
Strategies that leverage EHR to support earlier diagnosis of diabetes mellitus could help reduce
5
1.1.3 Gestational Diabetes Mellitus
during pregnancy (ADA 2018). GDM is seen in pregnant women who have never had diabetes.
While the baby is at a higher risk of having hypoglycemia and other health problems at birth,
gestational diabetes in the mother usually goes away after the child is born (ADA 2010, CDC
2022). Based on research from the American Diabetes Association (ADA), 7% of all pregnancies
are complicated by GDM and approximately 50% of gestational diabetic women proceed to
develop type 2 diabetes mellitus, according to the Centers for Disease Control and Prevention
(CDC). During pregnancy, the body does not make enough insulin, but it makes more hormones
and undergoes some changes such as weight gain. These body changes cause insulin resistance,
which is a condition where the body uses insulin less effectively. While all pregnant women
develop some insulin resistance during the later times of their pregnancies, some could have
insulin resistance before pregnancy and these sets of women become more susceptible to
developing gestational diabetes (CDC 2021, Goyal 2022). Some of the risk factors of gestational
diabetes mellitus include obesity, poor diet and micronutrient deficiencies, advanced maternal
age, and a genetic history of insulin resistance and/or diabetes mellitus. While GDM usually goes
away after childbirth, some of its health consequences include increased risk for type 2 diabetes
and cardiovascular disease in the mother, and future obesity, type 2 diabetes, cardiovascular
complications include damage to the filtering system in the kidney resulting in kidney failure,
damage to the blood vessels of the eye leading to blindness, damage to the blood supply of
nerves resulting in foot damage, erectile dysfunction, nausea, vomiting, diarrhea, or constipation.
6
Early prediction of diabetes can help to reduce the risk of life-threatening complications
and improve the treatment of the disease. In recent studies, machine learning techniques have
been applied in health care for the early prediction of diseases (G. Tripathi et al., 2020).
1. How might we apply supervised machine learning techniques for the detection of type 2
diabetes mellitus?
2. By what means might we compare the proposed machine learning model for detection
3. How might we identify the most appropriate machine learning model for the detection of
1.3 Motivation
A significant number of people with the potential of developing type 2 diabetes mellitus
are not diagnosed on-time (Gopalan 2018). These delays in clinical diagnoses fail to optimize the
addressing cardiovascular risk factors. The ability to utilize patient data generated from EHRs for
the earlier prediction of diabetes mellitus could help reduce diagnostic delays and allow for early
1.4 Scope
The scope of this work is to build a machine learning model for the prediction of diabetes
mellitus by using a supervised learning approach with risk factors associated with DM.
7
1.5 Contributions
In light of the preceding, our research contributions, as per addressing the aforementioned
(1) analyzing the various risk factors of diabetes mellitus in the Pima Indian diabetes dataset
and highlighting the risk factors that are statistically significant for its early prediction.
(2) training and evaluating common machine learning algorithms to identify the best
algorithm for the early prediction of diabetes mellitus. Thus, improving on the prediction
8
Chapter 2. Literature Review
Scientific literature on the application of machine learning techniques for the prediction
of diabetes mellitus was reviewed. The machine learning algorithms employed in these papers
include supervised learning, such as classification and regression, and association rules.
Association rules were used to study the associations between features/biomarkers. We cited a
few of these articles and research papers from databases such as PubMed, IEEE, and ACM, and
The most common classification algorithms used for the prediction of diabetes mellitus
are support vector machine (SVM), artificial neural network (ANN), and decision tree (DT)
(Ioannis et al., 2017). The machine algorithm with the best performance in biological and clinical
datasets for diabetes research is SVM (Ioannis et al., 2017). The accuracy of an algorithm is
dependent on the characteristics of the data (this includes type of dataset– clinical or genetic,
dimensionality, and low number of instances compared to the number of features) used for
machine learning tasks (Ioannis et al., 2017); hence the importance of data preprocessing
techniques such as feature selection. Then, the processed data is used to train various machine
learning algorithms and the best model for that dataset is identified (Ioannis et al., 2017).
was used on genomic data obtained from metagenome sequencing. Logistic regression (LR),
linear discriminant analysis (LDA), support vector machine (SVM), and artificial neural network
(ANN) were used to predict T2DM with 10-fold cross-validation method. SVM performed better
with a 0.97/0.99 accuracy in AUC (area under the curve) (Cai et al., 2015).
9
Logistic regression (LR), support vector machine (SVM) and artificial neural network
(ANN) were used to detect fasting blood glucose levels (FBGL) in an Indian population made up
of healthy and unhealthy people with 3-fold cross-validation. While 70% of the data was used as
the training set, the remaining 30% was used as the test set. Among the models used for this
study, SVM using RBF kernel performed best for classifying high FBGLs with approximately 85
(MDR), and support vector machines were used to build a classification model for T2DM on a
Kuwait population. A 5-fold cross-validation method was used to validate each of the
algorithms. The model with the best accuracy was the SVM with 81.3 accuracy score (Bassam et
al., 2013).
Gaussian Naïve Bayes (NB), logistic Regression, K-nearest neighbor (k-NN), CART,
random forests (RF), and support vector machine algorithms were used to forecast the risk of
T2DM from electronic medical records (EMR) with 5-fold cross-validation. The random forest
algorithm performed best with an AUC score above 0.80. (Mani et al., 2012).
Logistic regression, linear discriminant analysis, artificial neural networks, support vector
machines, fuzzy c-mean, and Random Forests (RF) were used to classify diabetic and non-
diabetic persons in an Iran population. 10-fold cross-validation was used to validate the
algorithms and SVM showed the best results with 0.986 accuracy and 0.979 AUC (Tapak et al.,
2013).
Artificial neural network, random forest, k-means clustering for the early prediction of
diabetes mellitus on the Pima Indians Diabetes dataset. Feature selection was done using the
10
principal component analysis (PCA) method. The Association rule algorithm, apriori, was used -
to discover a strong association between diabetes with BMI and glucose level. Artificial neural
11
Chapter 3. Research Methodology
We searched through various online repositories to find a dataset that has been used for a
similar study. We downloaded the Pima Indians Diabetes Dataset from Kaggle
owned by Google LLC, which allows researchers to find published datasets, build machine
learning models in a web-based environment, collaborate with other professionals, and compete
in data science challenges. According to Chang et al, the Pima Indian Diabetes dataset is the
benchmark for diabetes classification research and is available through a CC0: Public Domain
The dataset is made up of Pima Indian female patients of at least 21 years of age. It
consists of 768 instances and 9 features: one target variable, outcome and 8 predictors which
include pregnancies, glucose, blood pressure, skin thickness, insulin, BMI, diabetes pedigree
function, and age. There are 34.90% (268 instances) diabetic patients and 65.10% (500
instances) non-diabetic patients in the dataset. The variance between these classes is large and
could possibly lead to lower accuracy for the diabetic and high-risk classes. Figure 2 below is a
12
Figure 2. Division of the Pima Indian Dataset
Feature Description
While the data type of the target variable in the dataset is a factor, all predictors are of
numeric data type. Figure 3 describes the statistical summary of the dataset. In the figure, the
13
minimum value of features such as glucose, blood pressure, skin thickness, insulin, and BMI is
zero (0), which is inaccurate based on domain knowledge (Zia 2017, Chang 2022).
We analyzed the relationship between the risk factors of diabetes in the dataset using a
correlation metric and a heatmap. Age and pregnancies have the highest correlation among the
features in that dataset, while skin thickness and age have the least correlation. Figure 4
14
Figure 4. Correlation Matrix Heatmap of the Dataset
The inaccurate minimum value in the glucose, blood pressure, skin thickness, insulin, and
BMI attributes were replaced with the median value in each of the features. Figure 5 shows the
statistical description of the Pima Indians diabetes dataset after the minimum values in the
15
glucose, blood pressure, skin thickness, insulin, and BMI attributes have been replaced with their
median values.
The dataset used in this research contained no missing values and null values, but outliers
were detected in the predictors. These predictors include insulin, pregnancies, glucose, blood
pressure, skin thickness, body mass index (BMI), diabetes pedigree function, and age. Figure 6
16
Figure 6. Boxplots showing Outlier in the Dataset
The dataset is not very large, so we avoided removing the outliers, rather we defined a
function that replaced the outliers with the median values of that feature. Additionally, the data
was standardized to rescale the values of the distribution so that mean is 0 and standard deviation
is 1. Standardized data are also less affected by outliers (Géron 2019). Figure 7 visualizes the
17
Figure 7. Boxplots showing the absence of outliers in the dataset
The number of features in the dataset is not large. So, in the first round of this study, we
performed our machine learning experiment using all eight predictors: pregnancies, glucose,
blood pressure, skin thickness, insulin, BMI, diabetes pedigree function, and age.
In the second round of this study, we used SelectKBest feature selection method,
provided by Scikit-learn, to extract the best features in the dataset. The SelectKBest removes all
Documentation, 2011). The classification scoring function used is the “f_classif” function, which
returns the ANOVA F-value between the label/feature for classification tasks
18
p-value of each of the predictors were used to identify the best risk factors for the prediction of
diabetes mellitus.
The Pima Indians Diabetes dataset contains 768 instances: 268 diabetic patients and 500
non-diabetic patients. This unequal distribution of the classes would result in classification bias
toward the non-diabetic class in the dataset. We addressed the class imbalance in this dataset by
setting the class_weights parameter in the Scikit-Learn classifiers (logistic regression, support
— Scikit-Learn 1.1.3 Documentation, 2011). Class weights help an estimator adjust how it
supervised machine learning algorithms. We trained the basic logistic regression, then we trained
more complex models such as support vector machine (SVM), random forest, and decision trees.
Logistic regression adopts sigmoid curves. It is suitable for binary classification but can
be used for multiclass classification by using the one vs. rest scheme (Kusumaningrum 2020).
where,
19
f(x): sigmoid function of x
e: epsilon (2.7182)
x: input value
SVM is one of the most popular machine learning techniques proposed by J. Platt et. al.
hyperplane. SVM isolates entities in specified classes. It can also identify and classify instances
which are not supported by data. SVM does not care about the distribution of acquiring data of
each class. One extension of this algorithm is to execute regression analysis to produce a linear
function and another extension is learning to rank elements to produce classification for
individual elements.
Most of the information highlights limited discrete areas and features called the “classification”.
Every discrete area and feature of the domain is called a class. An input feature of the class
attribute is labeled with the internal node in a decision tree. The leaf node of the tree is labeled
by attribute and each attribute is associated with a target value. The highest information gain for
There are some popular decision tree algorithms that are available to classify diabetic
data in machine learning techniques, including ID3, J48, C4.5, C5, CHAID and CART. In our
20
research, the C4.5 decision tree algorithm has been chosen to measure performance analysis of
diabetic data. C4.5 provides extended features of the ID3 decision tree algorithm proposed by
Ross Quinlan et. al. C4.5 decision tree uses the same training data as ID3, in which a learned
function is introduced. The learning method can be used to diagnose medical data to predict the
value of the decision attribute. In each branch node of the tree, C4.5 selects the attribute value of
the data that most effectively separates the tested data into subset data which enriches the class.
The tree is generated by the normalized information gain. The normalized information gain is
picked to make the decision from the highest value attribute and is evaluated from the C4.5
decision tree.
Random forests are made-up of tree predictors such that each tree depends on the values
of a random vector sampled independently and with the same distribution for all trees in the
forest. The generalization error for forests converges as to a limit as the number of trees in the
forest becomes large. The generalization error of a forest of tree classifiers depends on the
strength of the individual trees in the forest and the correlation between them (Breiman2001).
RF follows specific rules for tree growing, tree combination, self-testing and post-processing, it
is robust to overfitting and it is considered more stable in the presence of outliers and in very
high dimensional parameter spaces than other machine learning algorithms (Caruana and
Niculescu-Mizil, 2006; Menze et al., 2009). The concept of variable importance is an implicit
the Gini impurity criterion index (Ceriani and Verme, 2012). The Gini index is a measure of
reduction (Strobl et al., 2007); it is non-parametric and therefore does not rely on data belonging
21
to a particular type of distribution. For a binary split (white circles in Figure 1), the Gini index of
For splitting a binary node in the best way, the improvement in the Gini index should be
maximized. In other words, a low Gini (i.e., a greater decrease in Gini) means that a particular
predictor feature plays a greater role in partitioning the data into the two classes. Thus, the Gini
index can be used to rank the importance of features for a classification problem (Sarica 2017).
We used an inbuilt function in scikit-learn library, ShuffleSplit, to shuffle the dataset and
split it into k-folds using the cross-validation method. k represents the number of parts the data
will be divided into. K = 10 is the most popular value used to evaluate machine learning models.
configure-k-fold-cross-validation/). Each machine learning model is trained on the k-1 part of the
dataset and evaluated k times on the kth fold. The best performing model is selected through
22
3.6.1 Evaluation Metrics
The evaluation methods used to measure the model performance include accuracy,
precision, recall, and roc_auc, as well as comparing performance on all predictors and the best 5
Accuracy refers to the percentage of all samples that have been predicted correctly. It is
the ratio of the sum of true positives and true negatives to the total number of predictions made.
Precision refers to the percentage of all samples that have been correctly predicted as true
among all those which were predicted as true, even if they were false.
23
Chapter 4. Results
In order to identify the important features in the Pima Indians diabetes dataset for model
training in the second round of this study we calculated the ANOVA F-values and the p-values
of the predictors. Table 2 below shows glucose as the best predictor of diabetes mellitus followed
by age, BMI, number of pregnancies, blood pressure, skin thickness, diabetes pedigree function,
and lastly insulin. Using a significant threshold of 0.05, two predictors: glucose and BMI are
statistically significant for the prediction of diabetes mellitus. This discovery is in alignment with
research by Alam et al., 2019, where association rule algorithm, apriori, was used to discover a
24
4.2 Performance of Machine Learning Algorithms
In this study, five machine learning algorithms were used to analyze the Pima Indian
Diabetes dataset. They include logistic regression, support vector machine, decision tree, and
random forest. The dataset is partitioned using the k-fold cross-validation method, where k is 10
and the random state was constant for all five algorithms.
Four metrics were used to measure the performance of the logistic regression, support
vector machine, decision tree and random forest algorithms. The support vector algorithm proved
to be the best algorithm in the dataset used for this study. Table 3 shows the performance of each
We performed another comparative analysis between the best model developed in this
experiment, support vector machine and the best model from a previous study published by
Alam et al., 2019. Table 4 shows that ANN was the best performing model developed by the
researchers on the same dataset. However, our proposed model outperformed the ANN model
25
Publication Dataset Compared Best Accuracy
Algorithms
Proposed model Pima Indian LR, SVM, DT, SVM ACC = 77.3%
Diabetes dataset RF
26
Chapter 5. Conclusion
In order to apply supervised machine learning techniques for the detection of type 2
diabetes mellitus, the Pima Indians Diabetes dataset was identified as the benchmark dataset used
for diabetes research. Based on domain knowledge, inaccurate data in some of the features were
replaced with the median values in each of the features. Also, the outliers in the dataset were
identified and replaced with the median values in each of the features, as well. The ANOVA F-
value and p-value of the individual predictors were calculated. The glucose and BMI features in
the dataset both had large F-values and p-values less than 0.05, thereby implicating that a
patient's glucose level and body mass index are great predictors for diabetes mellitus. This is in
Due to the limited size of the dataset, all its eight predictors were used to develop a
machine learning model in the first part of the project. The most common machine learning
algorithms used in prediction of DM, such as SVM and DT were used to develop a model to
classify diabetic and non-diabetic patients. In addition, logistic regression and random forest
To compare the performance of the machine learning models, we used evaluation metrics
such as accuracy, precision, recall, and roc_auc. The support vector machine was observed to be
the best model in the dataset. Furthermore, we compared the performance of the best model
developed in this study to the model developed in a previous study on the same dataset (Alam et
The results in our study show that the support vector machine is the most appropriate model
27
Chapter 6. Limitations
The limitations of this study include the size of the dataset and nature of the instances.
The Pima Indians diabetes dataset consists of 768 instances, 8 predictors and 1 target feature.
This is not large and may have resulted in poor approximation of the model performance. Also,
the instances in the dataset consist of female patients of at least 21 years of age and are not
representative of the real-world population of diabetic patients. The comparative analysis of the
model developed in this study and models developed in previous research was limited to one
study. Furthermore, outliers in this dataset were not used to train the machine learning algorithms
in this study.
28
Chapter 7. Future Work
Further work can be extended to predict diabetes using advanced machine learning
models such as ensemble learning and deep learning on a larger and more diverse dataset.
Prediabetic dataset can be used for the early prediction of diabetes mellitus. The performance of
machine learning models can be compared with more models developed in previous research.
Also, a comparative analysis of the model performance can be done with and without the outliers
in the dataset. The aim of this would be to increase the accuracy of the diabetes mellitus
prediction and to discover if there is significant difference in the performance of the model
29
Bibliography
Accili, D. (2018). Insulin Action Research and the Future of Diabetes Treatment: The 2017
Banting Medal for Scientific Achievement Lecture. Diabetes, 67, 1701–1709. doi:
10.2337/dbi18-0025
Alarcon C., Lincoln B., Rhodes C.J. (1993). The biosynthesis of the subtilisin-related proprotein
convertase PC3, but no that of the PC2 convertase, is regulated by glucose in parallel to
proinsulin biosynthesis in rat pancreatic islets. J. Biol. Chem. 1993;268:4276–4280. doi:
10.1016/S0021-9258(18)53606-1.
Alberti KG, Zimmet P, Shaw J, IDF Epidemiology Task Force Consensus Group (2005). The
metabolic syndrome–a new worldwide definition. Lancet. Sep;366(9491):1059-1062
10.1016/S0140-6736(05)67402-8
American Diabetes Association. (2010). Diagnosis and classification of diabetes mellitus.
Diabetes Care. Jan;33 Suppl 1(Suppl 1): S62-9. doi: 10.2337/dc10-S062. Erratum in:
Diabetes Care. 2010 Apr;33(4):e57. PMID: 20042775; PMCID: PMC2797383.
American Diabetes Association. (2018). Classification and Diagnosis of Diabetes: Standards of
Medical Care in Diabetes. Diabetes Care. 2018;41:S13–S27. doi: 10.2337/dc18-S002.
Bassam, F., Channanath, A. M., Kazem, B., & Thangavel, A. (2013). Predictive models to assess
risk of type 2 diabetes, hypertension and comorbidity: machine-learning algorithms and
validation using national health data from Kuwait—a cohort study. BMJ Open, 3.
10.1136/bmjopen-2012-002457
Breiman, L. (2001). Random Forests. Machine Learning 45, 5–32.
https://doi.org/10.1023/A:1010933404324
Cai L, Wu H, Li D, Zhou K, Zou F. (2015). Type 2 Diabetes Biomarkers of Human Gut
Microbiota Selected via Iterative Sure Independent Screening Method. PLOS ONE
10(10): e0140827. https://doi.org/10.1371/journal.pone.0140827
Casagrande SS, Cowie CC, Genuth SM. (2014). Self-reported prevalence of diabetes screening
in the U.S., 2005–2010. Am J Prev Med ; 47: 780–787.
Centers for Disease Control and Prevention. (2007). National Diabetes Fact Sheet: General
Information and National Estimates on Diabetes in the United States. Alanta, GA. US
Department of Health and Human Services, Centers for Disease Control and Prevention.
Centers for Disease Control and Prevention. What is Type 1 Diabetes? March 2022.
https://www.cdc.gov/diabetes/basics/what-is-type-1-diabetes.html
Chang V, Bailey J, Xu QA, Sun Z. (2022). Pima Indians diabetes mellitus classification based on
machine learning (ML) algorithms. Neural Comput Appl. Mar 24:1-17. doi:
10.1007/s00521-022-07049-z. Epub ahead of print. PMID: 35345556; PMCID:
PMC8943493.
30
Dekant, Wolfgang, and Wolfgang Völkel. (2008). "Human exposure to bisphenol A by
biomonitoring: methods, results and assessment of environmental exposures." Toxicology
and applied pharmacology 228.1 114-134.
Edelman D. (2002) Outpatient diagnostic errors: unrecognized hyperglycemia. Eff Clin Pract 5:
11–16
Faizan Zafar, Saad Raza, Muhammad Umair Khalid, and Muhammad Ali Tahir. (2019).
Predictive Analytics in Healthcare for Diabetes Prediction. In Proceedings of the 2019
9th International Conference on Biomedical Engineering and Technology (ICBET' 19).
Association for Computing Machinery, New York, NY, USA, 253–259.
DOI:https://doi.org/10.1145/3326172.3326213
Fraser LA, Twombly J, Zhu M, Long Q, Hanfelt JJ, Narayan KM et al. (2010). Delay in
diagnosis of diabetes is not the patient’s fault. Diabetes Care 33: e10.
Géron, A. (2019). Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems. O'Reilly.
G. Tripathi and R. Kumar. (2020) "Early Prediction of Diabetes Mellitus Using Machine
Learning," 2020 8th International Conference on Reliability, Infocom Technologies and
Optimization (Trends and Future Directions) (ICRITO), pp. 1009-1014,
doi:10.1109/ICRITO48877.2020.9197832.
Gopalan A, Mishra P, Alexeeff SE, Blatchins MA, Kim E, Man AH, Grant RW.1 (2018)
Prevalence and predictors of delayed clinical diagnosis of Type 2 diabetes: a longitudinal
cohort study. Diabet Med. Dec;35(12):1655-1662. doi: 10.1111/dme.13808. Epub 2018
Sep 21. PMID: 30175870; PMCID: PMC6481650.
Goyal R, Jialal I. Diabetes Mellitus Type 2. (Updated 2022 Jun 19). In: StatPearls [Internet].
Treasure Island (FL): StatPearls Publishing; 2022 Jan-. Available from:
https://www.ncbi.nlm.nih.gov/books/NBK513253/
Han Wu, Shengqi Yang, Zhangqin Huang, Jian He, Xiaoyi Wang. (2018). Type 2 diabetes
mellitus prediction model based on data mining, Informatics in Medicine Unlocked,
Volume 10, Pages 100-107, ISSN 2352-9148, https://doi.org/10.1016/j.imu.2017.12.006
Harris MI, Klein R, Welborn TA, Knuiman MW. (1992). Onset of NIDDM occurs at least 4–7 yr
before clinical diagnosis. Diabetes Care 15: 815–819. [PubMed] [Google Scholar]
Holman RR, Paul SK, Bethel MA, Matthews DR, Neil HA. (2008). 10-year follow-up of
intensive glucose control in type 2 diabetes. N Engl J Med 359: 1577–1589.
Hu FB, Manson JE, Stampfer MJ, Colditz G, Liu S, Solomon CG, et al. (2001). Diet, lifestyle,
and the risk of type 2 diabetes mellitus in women. N Engl J Med. Sep;345(11):790-
79710.1056/NEJMoa010492
International Diabetes Federation. (2013). IDF Diabetes Atlas. 6th ed. Brussels, Belgium:
International Diabetes Federation.
31
Ioannis, K., Olga, T., Athanasios, S., Nicos, M., Ioannis, V., & Ioanna, C. (2017). Machine
Learning and Data Mining Methods in Diabetes Research. Computational and Structural
Biotechnology Journal, 15, 104-116. https://doi.org/10.1016/j.csbj.2016.12.005
Jack L, Jr, Boseman L, Vinicor F. (2004).Aging Americans and diabetes. A public health and
clinical response. Geriatrics Apr;59(4):14-17
John C. Platt. (1999). "12 fast training of support vector machines using sequential minimal
optimization." in Advances in kernel methods, pp. 185-208.
Jung Y, Hu J. (2015). A K-fold Averaging Cross-validation Procedure. J Nonparametr Stat.
27(2):167-179. doi: 10.1080/10485252.2015.1010532. Epub 2015 Feb 26. PMID:
27630515; PMCID: PMC5019184.
K. A. Hasan and M. Al Mehedi Hasan. (2020). "Classification of parkinson’s disease by
analyzing multiple vocal features sets", IEEE Region 10 Symposium (TENSYMP), pp.
758-761, 2020
Kharroubi AT, Darwish HM. (2015). Diabetes mellitus: The epidemic of the century. World J
Diabetes. Jun 25;6(6):850-67. doi: 10.4239/wjd.v6.i6.850. PMID: 26131326; PMCID:
PMC4478580.
Kiefer MM, Silverman JB, Young BA, Nelson KM. (2020). National patterns in diabetes
screening: data from the National Health and Nutrition Examination Survey (NHANES)
2005–2012. J Gen Intern Med 30: 612–618
Kusumaningrum R, Indihatmoko TA, Juwita SR, Hanifah AF, Khadijah K, Surarso B. (2020).
Benchmarking of Multi-Class Algorithms for Classifying Documents Related to Stunting.
Applied Sciences. 10(23):8621. https://doi.org/10.3390/app10238621
L. Tapak, H. Mahjub, O. Hamidi, J. Poorolajal. (Sep 2013), Real-data comparison of data mining
methods in prediction of diabetes in Iran Healthc Inform Res, 19 (3) pp. 177-185,
10.4258/hir.2013.19.3.177. [Epub 2013 Sep 30]
Lang IA, Galloway TS, Scarlett A, Henley WE, Depledge M, Wallace RB, et al. Association of
urinary bisphenol A concentration with medical disorders and laboratory abnormalities in
adults. JAMA 2008. Sep;300(11):1303-1310 10.1001/jama.300.11.1303
Larabi-Marie-Sainte S, Aburahmah L, Almohaini R, Saba T. Current techniques for diabetes
prediction: review and case study. Appl Sci. 2019;9(21):4604. doi: 10.3390/app9214604.
Lovejoy JC. The influence of dietary fat on insulin resistance. Curr Diab Rep 2002.
Oct;2(5):435-440 10.1007/s11892-002-0098-y
Lucier J, Weinstock RS. Diabetes Mellitus Type 1. (Updated 2022 May 11). In: StatPearls
[Internet]. Treasure Island (FL): StatPearls Publishing; 2022 Jan-. Available from:
https://www.ncbi.nlm.nih.gov/books/NBK507713/
Maitra A, Abbas AK. (2005). Endocrine system. In: Kumar V, Fausto N, Abbas AK (eds).
Robbins and Cotran Pathologic basis of disease (7th ed). Philadelphia, Saunders; 1156-
1226.
32
Malik, S., Khadgawat, R., Anand, S. et al. (2016). Non-invasive detection of fasting blood
glucose level via electrochemical measurement of saliva. SpringerPlus 5, 701.
https://doi.org/10.1186/s40064-016-2339-6
McCarthy MI (2010). Genomics, type 2 diabetes, and obesity. N Engl J Med. Dec;363(24):2339-
2350 10.1056/NEJMra0906948
Mozhvilo, E. (2021, January 28). Why Weight? The Importance of Training on Balanced
Datasets. Towards Data Science. Retrieved November 11, 2022, from
https://towardsdatascience.com/why-weight-the-importance-of-training-on-balanced-
datasets-f1e54688e7df
Olokoba AB, Obateru OA, Olokoba LB. (2012). Type 2 diabetes mellitus: a review of current
trends. Oman Med J. Jul;27(4):269-73. doi: 10.5001/omj.2012.68. PMID: 23071876;
PMCID: PMC3464757.
Pima Indians Diabetes Database. (updated 2016). Kaggle. Retrieved November 11, 2022, from
https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-
database?resource=download
Plows JF, Stanley JL, Baker PN, Reynolds CM, Vickers MH. (2018). The Pathophysiology of
Gestational Diabetes Mellitus. Int J Mol Sci. Oct 26;19(11):3342.
doi:10.3390/ijms19113342. PMID: 30373146; PMCID: PMC6274679.
Porta M, Curletto G, Cipullo D, Rigault de la Longrais R, Trento M, Passera P et al. (2014).
Estimating the delay between onset and diagnosis of type 2 diabetes from the time course
of retinopathy prevalence. Diabetes Care 37: 1668–1674.
Powers AC. Diabetes mellitus. In: Fauci AS, Braunwald E, Kasper DL, Hauser SL, Longo DL,
Jameson JL, Loscalzo J (eds). (2008). Harrison’s Principles of Internal Medicine.17th ed,
New York, McGraw-Hill 2275-2304
Prevalence of overweight and obesity among adults with diagnosed Diabetes United States,
1988-1994 and 1999-2000"Centers for Disease Control and Prevention (CDC) (2004)
MMWR. Morbidity and Mortality Weekly Report; 53(45): 1066-1068
Rahman MS, Hossain KS, Das S, Kundu S, Adegoke EO, Rahman MA, Hannan MA, Uddin MJ,
Pang MG. (2021). Role of Insulin in Health and Disease: An Update. Int J Mol Sci. Jun
15;22(12):6403. doi: 10.3390/ijms22126403. PMID: 34203830; PMCID: PMC8232639.
Ross Quinlan, (1993). C4. 5: Programs for Machine Learning, San Mateo, CA:Morgan
Kaufmann Publishers.
S. Mani, Y. Chen, T. Elasy, W. Clayton, J. Denny. (2012). Type 2 diabetes risk forecasting from
EMR data using machine learning. AMIA Annu Symp Proc, 2012, pp. 606-615
Sarica Alessia, Cerasa Antonio, Quattrone Aldo. (2017). Random Forest Algorithm for the
Classification of Neuroimaging Data in Alzheimer's Disease: A Systematic Review.
Frontiers in Aging Neuroscience (9). https://www.frontiersin.org/articles/10.3389/ ISSN
1663-4365.
33
sklearn.feature_selection.f_classif — scikit-learn 1.1.3 documentation. (2011). Scikit-learn.
Retrieved November 10, 2022, from https://scikit-
learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.featur
e_selection.f_classif
sklearn.feature_selection.SelectKBest — scikit-learn 1.1.3 documentation. (2011). Scikit-learn.
Retrieved November 11, 2022, from https://scikit-
learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
sklearn.utils.class_weight.compute_class_weight — scikit-learn 1.1.3 documentation. (2011).
Scikit-learn. Retrieved November 11, 2022, from https://scikit-
learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.htm
l
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the
ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the
Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer
Society Press.
Spijkerman AM, Dekker JM, Nijpels G, Adriaanse MC, Kostense PJ, Ruwaard D et al. (2003).
Microvascular complications at time of diagnosis of type 2 diabetes are similar among
diabetic patients detected by targeted screening and patients newly diagnosed in general
practice: the hoorn screening study. Diabetes Care 26: 2604–2608.
Talha M Alam, M Atif I, Yasir A, Abdul W, Safdar I, Talha B, Ayaz H, Muhammad M,
Muhammad Mehdi R, Salman I, Zunish A. (2019). A model for early prediction of
diabetes,Informatics in Medicine Unlocked, Volume 16, 100204, ISSN 2352-9148,
https://doi.org/10.1016/j.imu.2019.100204.
Vermeulen I, Weets I, Asanghanwa M, Ruige J, Van Gaal L, Mathieu C, Keymeulen B,
Lampasona V, Wenzlau JM, Hutton JC, et al. (2011). Contribution of antibodies against
IA-2β and zinc transporter 8 to classification of diabetes diagnosed under 40 years of age.
Diabetes Care.34:1760–1765.
What Causes Gestational Diabetes? (2021). CDC. Retrieved November 10, 2022, from
http://cdc.gov/diabetes/basics/gestational.html
What Causes Type 2 Diabetes? (2021). CDC. Retrieved November 10, 2022, from
https://www.cdc.gov/diabetes/basics/type2.html
What is diabetes? (2022). CDC. Retrieved November 10, 2022, from
https://www.cdc.gov/diabetes/basics/diabetes.html
Zhang X, Geiss LS, Cheng YJ, Beckles GL, Gregg EW, Kahn HS. (2008). The missed patient
with diabetes: how access to health care affects the detection of diabetes. Diabetes Care
31: 1748–1753.
Zhang X, Geiss LS, Cheng YJ, Beckles GL, Gregg EW, Kahn HS. (2008). The missed patient
with diabetes: how access to health care affects the detection of diabetes. Diabetes Care
31: 1748–1753.
34
Zia UA, Khan N (2017). Predicting diabetes in medical datasets using machine learning
techniques. Int J Sci Eng Res 5(2):257–267.
35
Biographical Sketch
Clement Tochukwu Okolo was born in Lagos, Nigeria. He began his academic career at
Olabisi Onabanjo University, Nigeria majoring in anatomy. After earning his bachelor's degree
in anatomy in the Summer of 2017, he joined the University of Louisiana at Lafayette in the
using supervised machine learning algorithm under the tutelage of Dr. Michael W. Totaro. He
also served as the President of the Graduate Student Organization (GSO) and the GSO
representative for the School of Computing and Informatics. His research culminated in earning
a master’s degree in informatics at the University of Louisiana at Lafayette in the Fall of 2022.
36
ProQuest Number: 30240963
This work may be used in accordance with the terms of the Creative Commons license
or other rights statement, as indicated in the copyright statement or in the metadata
associated with this work. Unless otherwise specified in the copyright statement
or the metadata, all rights are reserved by the copyright holder.
ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346 USA