0% found this document useful (0 votes)
9 views49 pages

Thesis

The thesis by Clement Okolo focuses on predicting Type 2 diabetes using various machine learning algorithms, specifically targeting the detection and diagnosis of the condition. The study utilized the Pima Indian diabetes dataset and found that the Support Vector Machine algorithm yielded the highest accuracy of 77.27%. The research aims to enhance diabetes diagnosis processes and encourages further exploration into machine learning applications in healthcare.

Uploaded by

Manav Purswani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views49 pages

Thesis

The thesis by Clement Okolo focuses on predicting Type 2 diabetes using various machine learning algorithms, specifically targeting the detection and diagnosis of the condition. The study utilized the Pima Indian diabetes dataset and found that the Support Vector Machine algorithm yielded the highest accuracy of 77.27%. The research aims to enhance diabetes diagnosis processes and encourages further exploration into machine learning applications in healthcare.

Uploaded by

Manav Purswani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/372314912

Diabetes Prediction Using Machine Learning Algorithm

Thesis · December 2022


DOI: 10.13140/RG.2.2.25215.18084/2

CITATION READS
1 4,035

1 author:

Clement Okolo
University of Louisiana at Lafayette
1 PUBLICATION 1 CITATION

SEE PROFILE

All content following this page was uploaded by Clement Okolo on 08 September 2023.

The user has requested enhancement of the downloaded file.


Diabetes Prediction Using Machine Learning Algorithm

Clement Tochukwu Okolo

A Thesis Presented to the Graduate Faculty


in Partial Fulfillment of the Requirements for the Degree
Master of Science

University of Louisiana at Lafayette


Fall 2022

APPROVED:

Michael W. Totaro, Chair


School of Computing and Informatics

Chee-Hung Henry Chu


School of Computing and Informatics

Ashok Kumar
School of Computing and Informatics

Li Chen
School of Computing and Informatics

Mary Farmer-Kaiser
Dean of the Graduate School
© Clement Tochukwu Okolo

2022

All Rights Reserved


Abstract

Diabetes mellitus, also known as type-2 diabetes, accounts for most of the diabetes cases

in the world. This type of diabetes occurs when the body does not produce enough or respond

normally to insulin causing the blood glucose level to get high, leading inevitably to other health

conditions such as heart disease, kidney disease, etc. The aim of this research paper is to assist

medical professionals in the detection and efficient diagnosis of Type 2 diabetes. We applied

several supervised machine learning techniques to develop a machine model to predict diabetes

with low error rate based on eight predictors from the Pima Indian diabetes dataset. We outline

the methodology, implementation steps, and related work in the field. The four popular machine

learning algorithms used in this study are logistic regression, support vector machine, decision

tree, and random forest. SVM performed best with 77.27% accuracy, 75.61% precision, 51.38%

recall, and 82.47% roc_auc. Our model showed an increase in accuracy when compared with the

ANN model developed from the same dataset in a previous study. With this work, we intend to

improve the process of diagnosing Type 2 diabetes with machine learning and encourage further

research into diabetes prediction using machine learning.

iii
To almighty God and my lovely parents who sacrificed and supported me immensely so I could
get an education.

iv
Acknowledgement

First, I would like to thank God almighty for guiding me through all challenges and

giving me the privilege of completing my degree. You will continue to take the wheel in life and

lead me to greater heights. In addition, I would love to express my gratitude to my parents and

siblings for all their prayers, sacrifices, and motivation, which have continued to sustain me

immensely.

I would also like to give thanks to my thesis supervisor, Dr. Michael Totaro, for his

guidance and support throughout the entire process of completing this thesis. Additionally, I

would like to thank my thesis committee members, Dr. Henry Chu, Dr. Ashok Kumar, and Dr. Li

Chen, for their outstanding comments and suggestions.

Finally, I would like to thank my classmates and colleagues in the School of Computing

and Informatics at the University of Louisiana at Lafayette for their academic interactions and

support. I am indeed very grateful to you all.

v
Table of Content

Abstract……..…………………………………………………………………………………... iii

Acknowledgement……………………………………………………………...…………...…... v

List of Figures………………………………………………………………………..…………. ix

List of Abbreviations………………………………………………………………………….... x

Chapter 1. Introduction……………………………………………………..……...…………... 1
1.1 Classification of Diabetes Mellitus ................................................................................ 2
1.1.1 Type 1 Diabetes Mellitus .......................................................................................... 2
1.1.2 Type 2 Diabetes Mellitus .......................................................................................... 4
1.1.3 Gestational Diabetes Mellitus .................................................................................. 6
1.2 Research Questions ........................................................................................................ 7
1.3 Motivation ....................................................................................................................... 7
1.4 Scope ................................................................................................................................ 7
1.5 Contributions .................................................................................................................. 8

Chapter 2. Literature Review………………………………………………………………..…. 9

Chapter 3. Research Methodology……………………………………………………………. 12


3.1 Data Collection ............................................................................................................. 12
3.2 Data Preprocessing ....................................................................................................... 15
3.3 Feature Selection .......................................................................................................... 18
3.4 Class Imbalance ............................................................................................................ 19
3.5 Machine Learning Algorithms .................................................................................... 19
3.5.1 Logistic Regression ................................................................................................ 19
3.5.2 Support Vector Machine ........................................................................................ 20
3.5.3 Decision Tree.......................................................................................................... 20
3.5.3 Random Forest ....................................................................................................... 21
3.6 Performance Evaluation .............................................................................................. 22
3.6.1 Evaluation Metrics ................................................................................................. 23

vi
Chapter 4. Results…………………………………………………………………………….... 24
4.1 Feature Selection .......................................................................................................... 24
4.2 Performance of Machine Learning Algorithms ........................................................ 25

Chapter 5. Conclusion……………………………………………...………………………….. 27

Chapter 6. Limitations……………………………………………..………………………….. 28

Chapter 7. Future Work………………………………………………………………………. 29

Biographical Sketch…………………………………………………………………………… 36

vii
List of Tables

Table 1: Description of Pima Indian Diabetes Dataset………………………...………… 13

Table 2: ANOVA F-values and the p-values of Individual Predictors…………………... 24

Table 3: Performance of Machine Learning Algorithms……………………………..…. 25

Table 4: Comparison with Different Machine Learning Research……………………… 26

viii
List of Figures

Figure 1: Number of subjects with type 1 diabetes in children (0-14 years), with diabetes in
adults (20-79 years) and with hyperglycemia (type 2 or gestational diabetes) in
pregnancy (20-49 years)......................................................................................... 3

Figure 2: Division of the Pima Indian Dataset……………………………………………. 13

Figure 3: Statistical Description of the Pima Indians Diabetes Dataset…………………... 14

Figure 4: Correlation Matrix Heatmap of the Dataset…………………………………..… 15

Figure 5: Statistical Description of the Dataset without Inaccurate Values…………….… 16

Figure 6: Boxplots showing Outlier in the Dataset……………………………………….. 17

Figure 7: Boxplots showing the absence of outliers in the dataset………………………... 18

ix
List of Abbreviations

ADA American Diabetes Association

ANN Artificial Neural Network

ANOVA Analysis of Variance

AUC Area Under the Curve

BMI Body Mass Index

BPA Bisphenol-A

CDC Centers for Disease Control and Prevention

CART Classification and Regression Trees

DM Diabetes Mellitus

GDM Gestational Diabetes Mellitus

EHR Electronic Health Record

FBGL Fasting Blood Glucose Levels

k-NN K-nearest neighbor

LDA Linear Discriminant Analysis

LR Logistic Regression

MDR Multifactor Dimensionality Reduction

NB Gaussian Naïve Baye

x
PCA Principal Component Analysis

RF Random Forest

SVM Support Vector Machine

T1DM Type 1 Diabetes Mellitus

T2DM Type 2 Diabetes Mellitus

TDS Towards Data Science

UKPDS United Kingdom Prospective Diabetes

ZnT8A Zinc Transporter Protein

xi
Chapter 1. Introduction

Diabetes refers to a group of metabolic diseases characterized by hyperglycemia, also

known as high blood glucose, resulting from defects in insulin secretion, insulin action, or both.

The long-term effects of chronic hyperglycemia from diabetes include damage, dysfunction, and

failure of different organs, particularly the eyes, kidneys, nerves, heart, and blood vessels (ADA

2010). According to the Centers for Disease Control and Prevention, there are three types of

diabetes: type 1, type 2, and gestational diabetes (CDC 2022). Diabetes mellitus (DM),

commonly known as diabetes, is a group of diseases that are defined by chronic high blood

glucose levels due to abnormalities in insulin secretion, insulin action, or both (ADA 2010).

Insulin is a peptide hormone that helps in glucose homeostasis and is produced in large

concentration by the β cells of the pancreatic islets of Langerhans and in low concentration by

some neurons of the central nervous system (Rahman 2021). The amount of glucose in the

bloodstream controls the biosynthesis and secretion of hormone insulin. Insulin is synthesized

when glucose levels are between 2 mM to 4 mM and it is secreted when glucose levels rise

above 5 nM (Alarcon 1993). When insulin is not secreted, blood glucose will remain high. High

glucose concentration in the body leads to a condition called hyperglycemia (Vasiljevic 2020).

After the secretion of insulin, it circulates in the body and is distributed to hepatocytes, also

known as liver cells, skeletal muscle cells and adipocytes for glucose uptake, thereby reducing

glucose concentration. If insulin is secreted but the target cells do not take up excess glucose

from the bloodstream, glucose level will remain high, thereby leading to hyperglycemia as well.

Diabetes Mellitus occurs when there is hyperglycemia for a long period of time (Accili 2018).

1
According to Accili, DM can cause major health complications like damage to the

nervous system and dysfunction of the eyes and kidneys. The type of diabetes and how long a

patient has been diabetic determines how bad the symptoms will be and the long-term

complications of diabetes may be life-threatening (Kharroubi 2015).

1.1 Classification of Diabetes Mellitus

Diabetes mellitus can be classified into three categories namely Type 1 diabetes mellitus,

Type 2 diabetes mellitus, and gestational diabetes (CDC2022).

1.1.1 Type 1 Diabetes Mellitus

Type 1 Diabetes Mellitus (T1DM), once called insulin-dependent or juvenile diabetes, is

an autoimmune disease that is characterized by hyperglycemia and the absence of insulin

because the pancreatic β cells produce little or no insulin. The body attacks itself unintentionally,

thereby destroying the insulin-producing cells of the pancreas. The autoantibodies responsible

for the destruction include islet cell autoantibodies, autoantibodies to insulin (IAA), glutamic

acid decarboxylase (GAD, GAD65), protein tyrosine phosphatase (IA2 and IA2β) and zinc

transporter protein (ZnT8A) [A) (Vermeulen 2011). It has been estimated that 5-10% of the

world's population who are diabetic have T1DM. In the United States, approximately 1.24

million diabetic patients are type 1 patients, and this number is growing and is projected to reach

5 million by 2050. T1DM is one of the most common chronic diseases in children; however, it

can develop in people of all ages, but it is commonly seen in children, teens, and young adults

(ADA2010, CDC2022). The destruction of the beta cell of the pancreas can happen over months

or years before symptoms are noticed. According to the International Diabetes Federation, some

of the symptoms of T1DM are polydipsia, polyuria, enuresis, lack of energy, extreme tiredness,

2
polyphagia, sudden weight loss, slow-healing wounds, recurrent infections and blurred vision

with severe dehydration and diabetic ketoacidosis (IDF2013, Kharroubi2015). Patients who are

diagnosed with T1DM require insulin replacement for the rest of their lives (Lucier 2022).

Figure 1. Number of subjects with type 1 diabetes in children (0-14 years), with diabetes in
adults (20-79 years) and with hyperglycemia (type 2 or gestational diabetes) in
pregnancy (20-49 years). Data extracted from International Diabetes Federation
Diabetes Atlas, 6th ed, 2013.

3
1.1.2 Type 2 Diabetes Mellitus

Type 2 Diabetes Mellitus (T2DM), formally called non-insulin-dependent diabetes

mellitus, is characterized by hyperglycemia, insulin resistance, and relative insulin deficiency

(Maitra2005). In T2DM, the pancreatic β cells produce enough insulin but the body cells cannot

use it adequately for glucose homeostasis. Therefore, the pancreatic cells try to get the body to

respond normally to insulin by secreting more and more insulin. As a result, the concentration of

blood glucose increases causing hyperglycemia and type 2 diabetes (CDC2021). T2DM is the

most predominant form of DM. Over 37 million people in the United States of America have

DM, and approximately 90-95% of them have T2DM. While more children, teens, and young

adults are developing type 2 diabetes, it most commonly develops in people over the age of 45

(CDC2021, Olokoba2012). There are lifestyle and genetic risk factors associated with T2DM

(Olokoba2012). Some of the lifestyle risk factors include physical inactivity, smoking, and

alcohol consumption (HU2021). According to the Centers for Disease Control and Prevention,

Obesity is also a risk factor for an estimated 55% of T2DM cases. Environmental toxins such as

bisphenol A may contribute to the recent increase in the cases of T2DM as research suggests that

there is a weak positive correlation between the concentration of urine bisphenol-A (BPA) and

T2DM. The main use of BPA is in the production of plastic and epoxy resins, which are found in

polycarbonate baby feeding bottles and poxy food-can linings (Lang IA 2008, Dekant 2008).

Genes that have been discovered to be associated with T2DM include TCF7L2, PPARG, FTO,

KCNJ11, NOTCH2, WFS1, CDKAL1, IGF2BP2, SLC30A8, JAZF1, and HHEX (McCarthy

2010). Some of the medical conditions have been discovered to be risk factors associated with

T2DM include obesity, hypertension, high cholesterol, Reaven's syndrome, acromegaly,

Cushing's syndrome, thyrotoxicosis, pheochromocytoma, chronic pancreatitis, and cancer.

4
Aging, diets that are high in fat and in activity are also among the risk factors of diabetes (Alberti

2005, Powers 2008, Jack 2004, Lovejoy 2002).

1.1.2.1 Delayed Type 2 Diabetes Diagnosis

About one-third of adults that have high HbA1c values are not clinically diagnosed

with T2DM within 1 year (Gopalan 2018) and most people with T2DM are not diagnosed for 4–

7 years after hyperglycemia first appears (porta 2014, Harris 1992). According to researchers,

one-fourth of patients who are diagnosed with T2DM already have diabetes-related

microvascular complications at the time of diabetes diagnosis (Koopman 2006, Spijkerman

2003, Harris 1992). A 2010 chart of patients from the Veterans Affairs Medical Center showed

an average delay of 3.7 years between initial Electronic Health Record (EHR) evidence of

hyperglycemia and clinical diagnosis (Fraser 2010). In a study in 2002, a cross-sectional analysis

of 1426 adults with evidence of hyperglycemia in their EHR, only 79% of these people had been

clinically diagnosed with diabetes mellitus (Edelman2002). While no symptoms are seen during

the period of undiagnosis, the patient misses the chance to get early intervention (Gopalan2018).

According to a study by the United Kingdom Prospective Diabetes (UKPDS), better early

intervention to balance blood glucose levels demonstrated that the risk of developing

microvascular complications and myocardial infarctions was significantly lower, a risk reduction

that lasted for decades after diagnosis compared with cases without initial glycemic control

(Holman2008). Some of the factors that contribute to the delayed diagnosis of diabetes include

inadequate access to healthcare and under-screening (Kiefer2015, Casagrande2014, Zhang2008).

Strategies that leverage EHR to support earlier diagnosis of diabetes mellitus could help reduce

diagnostic delays and allow for early treatment (Gopalan2018).

5
1.1.3 Gestational Diabetes Mellitus

Gestational diabetes mellitus (GDM) is characterized by spontaneous hyperglycemia

during pregnancy (ADA 2018). GDM is seen in pregnant women who have never had diabetes.

While the baby is at a higher risk of having hypoglycemia and other health problems at birth,

gestational diabetes in the mother usually goes away after the child is born (ADA 2010, CDC

2022). Based on research from the American Diabetes Association (ADA), 7% of all pregnancies

are complicated by GDM and approximately 50% of gestational diabetic women proceed to

develop type 2 diabetes mellitus, according to the Centers for Disease Control and Prevention

(CDC). During pregnancy, the body does not make enough insulin, but it makes more hormones

and undergoes some changes such as weight gain. These body changes cause insulin resistance,

which is a condition where the body uses insulin less effectively. While all pregnant women

develop some insulin resistance during the later times of their pregnancies, some could have

insulin resistance before pregnancy and these sets of women become more susceptible to

developing gestational diabetes (CDC 2021, Goyal 2022). Some of the risk factors of gestational

diabetes mellitus include obesity, poor diet and micronutrient deficiencies, advanced maternal

age, and a genetic history of insulin resistance and/or diabetes mellitus. While GDM usually goes

away after childbirth, some of its health consequences include increased risk for type 2 diabetes

and cardiovascular disease in the mother, and future obesity, type 2 diabetes, cardiovascular

disease, and/or gestational diabetes mellitus in the child (Plows 2018).

The long-term complications of diabetes may be life-threatening. Some of these

complications include damage to the filtering system in the kidney resulting in kidney failure,

damage to the blood vessels of the eye leading to blindness, damage to the blood supply of

nerves resulting in foot damage, erectile dysfunction, nausea, vomiting, diarrhea, or constipation.

6
Early prediction of diabetes can help to reduce the risk of life-threatening complications

and improve the treatment of the disease. In recent studies, machine learning techniques have

been applied in health care for the early prediction of diseases (G. Tripathi et al., 2020).

1.2 Research Questions

1. How might we apply supervised machine learning techniques for the detection of type 2

diabetes mellitus?

2. By what means might we compare the proposed machine learning model for detection

diagnosis of type 2 diabetes mellitus?

3. How might we identify the most appropriate machine learning model for the detection of

type 2 diabetes mellitus?

1.3 Motivation

A significant number of people with the potential of developing type 2 diabetes mellitus

are not diagnosed on-time (Gopalan 2018). These delays in clinical diagnoses fail to optimize the

benefits of early interventions such as hyperglycemia treatment, lifestyle changes, and

addressing cardiovascular risk factors. The ability to utilize patient data generated from EHRs for

the earlier prediction of diabetes mellitus could help reduce diagnostic delays and allow for early

treatment, thereby improving patient health outcomes.

1.4 Scope

The scope of this work is to build a machine learning model for the prediction of diabetes

mellitus by using a supervised learning approach with risk factors associated with DM.

7
1.5 Contributions

In light of the preceding, our research contributions, as per addressing the aforementioned

research questions are as follows:

(1) analyzing the various risk factors of diabetes mellitus in the Pima Indian diabetes dataset

and highlighting the risk factors that are statistically significant for its early prediction.

(2) training and evaluating common machine learning algorithms to identify the best

algorithm for the early prediction of diabetes mellitus. Thus, improving on the prediction

accuracy from a previous research [Alam et al., 2019].

8
Chapter 2. Literature Review

Scientific literature on the application of machine learning techniques for the prediction

of diabetes mellitus was reviewed. The machine learning algorithms employed in these papers

include supervised learning, such as classification and regression, and association rules.

Association rules were used to study the associations between features/biomarkers. We cited a

few of these articles and research papers from databases such as PubMed, IEEE, and ACM, and

these studies were conducted prior to the COVID-19 pandemic.

The most common classification algorithms used for the prediction of diabetes mellitus

are support vector machine (SVM), artificial neural network (ANN), and decision tree (DT)

(Ioannis et al., 2017). The machine algorithm with the best performance in biological and clinical

datasets for diabetes research is SVM (Ioannis et al., 2017). The accuracy of an algorithm is

dependent on the characteristics of the data (this includes type of dataset– clinical or genetic,

dimensionality, and low number of instances compared to the number of features) used for

machine learning tasks (Ioannis et al., 2017); hence the importance of data preprocessing

techniques such as feature selection. Then, the processed data is used to train various machine

learning algorithms and the best model for that dataset is identified (Ioannis et al., 2017).

Applied iterative sure independence screening, a statistical variable selection method,

was used on genomic data obtained from metagenome sequencing. Logistic regression (LR),

linear discriminant analysis (LDA), support vector machine (SVM), and artificial neural network

(ANN) were used to predict T2DM with 10-fold cross-validation method. SVM performed better

with a 0.97/0.99 accuracy in AUC (area under the curve) (Cai et al., 2015).

9
Logistic regression (LR), support vector machine (SVM) and artificial neural network

(ANN) were used to detect fasting blood glucose levels (FBGL) in an Indian population made up

of healthy and unhealthy people with 3-fold cross-validation. While 70% of the data was used as

the training set, the remaining 30% was used as the test set. Among the models used for this

study, SVM using RBF kernel performed best for classifying high FBGLs with approximately 85

% accuracy, 84 % precision, 85 % sensitivity and 85 % F1 score (Malik et al., 2016).

Logistic regression, k-nearest neighbors (k-NN), multifactor dimensionality reduction

(MDR), and support vector machines were used to build a classification model for T2DM on a

Kuwait population. A 5-fold cross-validation method was used to validate each of the

algorithms. The model with the best accuracy was the SVM with 81.3 accuracy score (Bassam et

al., 2013).

Gaussian Naïve Bayes (NB), logistic Regression, K-nearest neighbor (k-NN), CART,

random forests (RF), and support vector machine algorithms were used to forecast the risk of

T2DM from electronic medical records (EMR) with 5-fold cross-validation. The random forest

algorithm performed best with an AUC score above 0.80. (Mani et al., 2012).

Logistic regression, linear discriminant analysis, artificial neural networks, support vector

machines, fuzzy c-mean, and Random Forests (RF) were used to classify diabetic and non-

diabetic persons in an Iran population. 10-fold cross-validation was used to validate the

algorithms and SVM showed the best results with 0.986 accuracy and 0.979 AUC (Tapak et al.,

2013).

Artificial neural network, random forest, k-means clustering for the early prediction of

diabetes mellitus on the Pima Indians Diabetes dataset. Feature selection was done using the

10
principal component analysis (PCA) method. The Association rule algorithm, apriori, was used -

to discover a strong association between diabetes with BMI and glucose level. Artificial neural

networks performed best with 75.7% accuracy (Alam et al., 2019).

11
Chapter 3. Research Methodology

3.1 Data Collection

We searched through various online repositories to find a dataset that has been used for a

similar study. We downloaded the Pima Indians Diabetes Dataset from Kaggle

(https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database), a public data platform

owned by Google LLC, which allows researchers to find published datasets, build machine

learning models in a web-based environment, collaborate with other professionals, and compete

in data science challenges. According to Chang et al, the Pima Indian Diabetes dataset is the

benchmark for diabetes classification research and is available through a CC0: Public Domain

License and the patient subjects are anonymous.

The dataset is made up of Pima Indian female patients of at least 21 years of age. It

consists of 768 instances and 9 features: one target variable, outcome and 8 predictors which

include pregnancies, glucose, blood pressure, skin thickness, insulin, BMI, diabetes pedigree

function, and age. There are 34.90% (268 instances) diabetic patients and 65.10% (500

instances) non-diabetic patients in the dataset. The variance between these classes is large and

could possibly lead to lower accuracy for the diabetic and high-risk classes. Figure 2 below is a

pie chart showing the class imbalance in the dataset.

12
Figure 2. Division of the Pima Indian Dataset

Feature Description

Pregnancies Number of times pregnant

Glucose Plasma glucose concentration at 2 Hours in an oral


glucose tolerance test (GTIT)

Blood Pressure Diastolic Blood Pressure (mm Hg)

Skin Thickness Triceps skin fold thickness (mm)

Insulin 2-Hour Serum insulin (µh/ml)

BMI Body mass index (weight in kg/ (Height in m) ^2)

Diabetes Pedigree The likelihood of diabetes based on family history


Function

Age Age of the Patient (years)

Outcome Binary value that indicates whether a patient is diabetic


(1) or non-diabetic (0)

Table 1. Description of Pima Indian Diabetes Dataset

While the data type of the target variable in the dataset is a factor, all predictors are of

numeric data type. Figure 3 describes the statistical summary of the dataset. In the figure, the

13
minimum value of features such as glucose, blood pressure, skin thickness, insulin, and BMI is

zero (0), which is inaccurate based on domain knowledge (Zia 2017, Chang 2022).

Figure 3. Statistical Description of the Pima Indians Diabetes Dataset

We analyzed the relationship between the risk factors of diabetes in the dataset using a

correlation metric and a heatmap. Age and pregnancies have the highest correlation among the

features in that dataset, while skin thickness and age have the least correlation. Figure 4

visualizes the correlation of the features using a seaborn heatmap.

14
Figure 4. Correlation Matrix Heatmap of the Dataset

3.2 Data Preprocessing

The inaccurate minimum value in the glucose, blood pressure, skin thickness, insulin, and

BMI attributes were replaced with the median value in each of the features. Figure 5 shows the

statistical description of the Pima Indians diabetes dataset after the minimum values in the

15
glucose, blood pressure, skin thickness, insulin, and BMI attributes have been replaced with their

median values.

Figure 5. Statistical Description of the Dataset without Inaccurate Values

The dataset used in this research contained no missing values and null values, but outliers

were detected in the predictors. These predictors include insulin, pregnancies, glucose, blood

pressure, skin thickness, body mass index (BMI), diabetes pedigree function, and age. Figure 6

shows the presence of outliers in the predictors using boxplots.

16
Figure 6. Boxplots showing Outlier in the Dataset

The dataset is not very large, so we avoided removing the outliers, rather we defined a

function that replaced the outliers with the median values of that feature. Additionally, the data

was standardized to rescale the values of the distribution so that mean is 0 and standard deviation

is 1. Standardized data are also less affected by outliers (Géron 2019). Figure 7 visualizes the

absence of outliers using boxplots.

17
Figure 7. Boxplots showing the absence of outliers in the dataset

3.3 Feature Selection

The number of features in the dataset is not large. So, in the first round of this study, we

performed our machine learning experiment using all eight predictors: pregnancies, glucose,

blood pressure, skin thickness, insulin, BMI, diabetes pedigree function, and age.

In the second round of this study, we used SelectKBest feature selection method,

provided by Scikit-learn, to extract the best features in the dataset. The SelectKBest removes all

but k highest scores features (Sklearn.feature_selection.SelectKBest — Scikit-Learn 1.1.3

Documentation, 2011). The classification scoring function used is the “f_classif” function, which

returns the ANOVA F-value between the label/feature for classification tasks

(Sklearn.feature_selection.f_classif — Scikit-Learn 1.1.3 Documentation, 2011). The F-value and

18
p-value of each of the predictors were used to identify the best risk factors for the prediction of

diabetes mellitus.

3.4 Class Imbalance

The Pima Indians Diabetes dataset contains 768 instances: 268 diabetic patients and 500

non-diabetic patients. This unequal distribution of the classes would result in classification bias

toward the non-diabetic class in the dataset. We addressed the class imbalance in this dataset by

setting the class_weights parameter in the Scikit-Learn classifiers (logistic regression, support

vector machine, random forest) to ‘balance’ (Sklearn.utils.class_weight.compute_class_weight

— Scikit-Learn 1.1.3 Documentation, 2011). Class weights help an estimator adjust how it

calculates losses based on importance/weight given to a class (TDS 2021).

3.5 Machine Learning Algorithms

We identified that our work is a classification problem, as a result, we identified five

supervised machine learning algorithms. We trained the basic logistic regression, then we trained

more complex models such as support vector machine (SVM), random forest, and decision trees.

3.5.1 Logistic Regression

Logistic regression adopts sigmoid curves. It is suitable for binary classification but can

be used for multiclass classification by using the one vs. rest scheme (Kusumaningrum 2020).

The sigmoid equation used in logistic regression includes:

where,

19
f(x): sigmoid function of x

e: epsilon (2.7182)

x: input value

3.5.2 Support Vector Machine

SVM is one of the most popular machine learning techniques proposed by J. Platt et. al.

A Support Vector Machine is an excludent classifier, formally characterized by separating a

hyperplane. SVM isolates entities in specified classes. It can also identify and classify instances

which are not supported by data. SVM does not care about the distribution of acquiring data of

each class. One extension of this algorithm is to execute regression analysis to produce a linear

function and another extension is learning to rank elements to produce classification for

individual elements.

3.5.3 Decision Tree

A decision tree provides powerful classification techniques to predict diabetes mellitus.

Most of the information highlights limited discrete areas and features called the “classification”.

Every discrete area and feature of the domain is called a class. An input feature of the class

attribute is labeled with the internal node in a decision tree. The leaf node of the tree is labeled

by attribute and each attribute is associated with a target value. The highest information gain for

all the attributes is calculated in each node of the tree.

There are some popular decision tree algorithms that are available to classify diabetic

data in machine learning techniques, including ID3, J48, C4.5, C5, CHAID and CART. In our

20
research, the C4.5 decision tree algorithm has been chosen to measure performance analysis of

diabetic data. C4.5 provides extended features of the ID3 decision tree algorithm proposed by

Ross Quinlan et. al. C4.5 decision tree uses the same training data as ID3, in which a learned

function is introduced. The learning method can be used to diagnose medical data to predict the

value of the decision attribute. In each branch node of the tree, C4.5 selects the attribute value of

the data that most effectively separates the tested data into subset data which enriches the class.

The tree is generated by the normalized information gain. The normalized information gain is

picked to make the decision from the highest value attribute and is evaluated from the C4.5

decision tree.

3.5.3 Random Forest

Random forests are made-up of tree predictors such that each tree depends on the values

of a random vector sampled independently and with the same distribution for all trees in the

forest. The generalization error for forests converges as to a limit as the number of trees in the

forest becomes large. The generalization error of a forest of tree classifiers depends on the

strength of the individual trees in the forest and the correlation between them (Breiman2001).

RF follows specific rules for tree growing, tree combination, self-testing and post-processing, it

is robust to overfitting and it is considered more stable in the presence of outliers and in very

high dimensional parameter spaces than other machine learning algorithms (Caruana and

Niculescu-Mizil, 2006; Menze et al., 2009). The concept of variable importance is an implicit

feature selection performed by RF with a random subspace methodology, and it is assessed by

the Gini impurity criterion index (Ceriani and Verme, 2012). The Gini index is a measure of

prediction power of variables in regression or classification, based on the principle of impurity

reduction (Strobl et al., 2007); it is non-parametric and therefore does not rely on data belonging

21
to a particular type of distribution. For a binary split (white circles in Figure 1), the Gini index of

a node n is calculated as follows:

where pj is the relative frequency of class j in the node n.

For splitting a binary node in the best way, the improvement in the Gini index should be

maximized. In other words, a low Gini (i.e., a greater decrease in Gini) means that a particular

predictor feature plays a greater role in partitioning the data into the two classes. Thus, the Gini

index can be used to rank the importance of features for a classification problem (Sarica 2017).

3.6 Performance Evaluation

We used an inbuilt function in scikit-learn library, ShuffleSplit, to shuffle the dataset and

split it into k-folds using the cross-validation method. k represents the number of parts the data

will be divided into. K = 10 is the most popular value used to evaluate machine learning models.

Other common values include k=2 and k=5 (https://machinelearningmastery.com/how-to-

configure-k-fold-cross-validation/). Each machine learning model is trained on the k-1 part of the

dataset and evaluated k times on the kth fold. The best performing model is selected through

K(2p − 1) times of model evaluation, where p is the number of independent

variables, and 2p − 1 is the total number of possible models (Jung 2015).

22
3.6.1 Evaluation Metrics

The evaluation methods used to measure the model performance include accuracy,

precision, recall, and roc_auc, as well as comparing performance on all predictors and the best 5

predictors (Larabi-Marie-Sainte S 2019).

Accuracy refers to the percentage of all samples that have been predicted correctly. It is

the ratio of the sum of true positives and true negatives to the total number of predictions made.

Precision refers to the percentage of all samples that have been correctly predicted as true

among all those which were predicted as true, even if they were false.

23
Chapter 4. Results

4.1 Feature Selection

In order to identify the important features in the Pima Indians diabetes dataset for model

training in the second round of this study we calculated the ANOVA F-values and the p-values

of the predictors. Table 2 below shows glucose as the best predictor of diabetes mellitus followed

by age, BMI, number of pregnancies, blood pressure, skin thickness, diabetes pedigree function,

and lastly insulin. Using a significant threshold of 0.05, two predictors: glucose and BMI are

statistically significant for the prediction of diabetes mellitus. This discovery is in alignment with

research by Alam et al., 2019, where association rule algorithm, apriori, was used to discover a

strong association between diabetes with BMI and glucose level.

Features F-Score P-Value

Pregnancies 32.751 3.071e-01

Glucose 245.668 3.398e-16

Blood Pressure 21.820 9.817e-01

Skin Thickness 18.872 5.880e-01

Insulin 2.000 1.762e-01

BMI 68.570 4.977e-04

Diabetes Pedigree Function 14.502 1.109e-01

Age 79.385 5.086e-02

Table 2. ANOVA F-values and the p-values of Individual Predictors

24
4.2 Performance of Machine Learning Algorithms

In this study, five machine learning algorithms were used to analyze the Pima Indian

Diabetes dataset. They include logistic regression, support vector machine, decision tree, and

random forest. The dataset is partitioned using the k-fold cross-validation method, where k is 10

and the random state was constant for all five algorithms.

Four metrics were used to measure the performance of the logistic regression, support

vector machine, decision tree and random forest algorithms. The support vector algorithm proved

to be the best algorithm in the dataset used for this study. Table 3 shows the performance of each

of the machine learning algorithms and their scores.

Algorithms Accuracy Precision Recall Roc_auc

LR 0.7708 0.7258 0.5467 0.8400

SVM 0.7727 0.7561 0.5138 0.8247

DT 0.7143 0.5790 0.5774 0.6711

RF 0.7584 0.6577 0.5745 0.8275

Table 3. Performance of Machine Learning Algorithms

We performed another comparative analysis between the best model developed in this

experiment, support vector machine and the best model from a previous study published by

Alam et al., 2019. Table 4 shows that ANN was the best performing model developed by the

researchers on the same dataset. However, our proposed model outperformed the ANN model

with a higher accuracy score.

25
Publication Dataset Compared Best Accuracy
Algorithms

Proposed model Pima Indian LR, SVM, DT, SVM ACC = 77.3%
Diabetes dataset RF

Alam et al. Pima Indian ANN, RF, k- ANN ACC = 75.7%


Diabetes dataset means clustering

Table 4. Comparison with Different Machine Learning Research

26
Chapter 5. Conclusion

In order to apply supervised machine learning techniques for the detection of type 2

diabetes mellitus, the Pima Indians Diabetes dataset was identified as the benchmark dataset used

for diabetes research. Based on domain knowledge, inaccurate data in some of the features were

replaced with the median values in each of the features. Also, the outliers in the dataset were

identified and replaced with the median values in each of the features, as well. The ANOVA F-

value and p-value of the individual predictors were calculated. The glucose and BMI features in

the dataset both had large F-values and p-values less than 0.05, thereby implicating that a

patient's glucose level and body mass index are great predictors for diabetes mellitus. This is in

alignment with research by Alam et al.

Due to the limited size of the dataset, all its eight predictors were used to develop a

machine learning model in the first part of the project. The most common machine learning

algorithms used in prediction of DM, such as SVM and DT were used to develop a model to

classify diabetic and non-diabetic patients. In addition, logistic regression and random forest

algorithms were also used.

To compare the performance of the machine learning models, we used evaluation metrics

such as accuracy, precision, recall, and roc_auc. The support vector machine was observed to be

the best model in the dataset. Furthermore, we compared the performance of the best model

developed in this study to the model developed in a previous study on the same dataset (Alam et

al., 2019) and we observed our model performed better.

The results in our study show that the support vector machine is the most appropriate model

for the detection of type 2 diabetes mellitus.

27
Chapter 6. Limitations

The limitations of this study include the size of the dataset and nature of the instances.

The Pima Indians diabetes dataset consists of 768 instances, 8 predictors and 1 target feature.

This is not large and may have resulted in poor approximation of the model performance. Also,

the instances in the dataset consist of female patients of at least 21 years of age and are not

representative of the real-world population of diabetic patients. The comparative analysis of the

model developed in this study and models developed in previous research was limited to one

study. Furthermore, outliers in this dataset were not used to train the machine learning algorithms

in this study.

28
Chapter 7. Future Work

Further work can be extended to predict diabetes using advanced machine learning

models such as ensemble learning and deep learning on a larger and more diverse dataset.

Prediabetic dataset can be used for the early prediction of diabetes mellitus. The performance of

machine learning models can be compared with more models developed in previous research.

Also, a comparative analysis of the model performance can be done with and without the outliers

in the dataset. The aim of this would be to increase the accuracy of the diabetes mellitus

prediction and to discover if there is significant difference in the performance of the model

trained with and without outliers in the dataset.

29
Bibliography

Accili, D. (2018). Insulin Action Research and the Future of Diabetes Treatment: The 2017
Banting Medal for Scientific Achievement Lecture. Diabetes, 67, 1701–1709. doi:
10.2337/dbi18-0025
Alarcon C., Lincoln B., Rhodes C.J. (1993). The biosynthesis of the subtilisin-related proprotein
convertase PC3, but no that of the PC2 convertase, is regulated by glucose in parallel to
proinsulin biosynthesis in rat pancreatic islets. J. Biol. Chem. 1993;268:4276–4280. doi:
10.1016/S0021-9258(18)53606-1.
Alberti KG, Zimmet P, Shaw J, IDF Epidemiology Task Force Consensus Group (2005). The
metabolic syndrome–a new worldwide definition. Lancet. Sep;366(9491):1059-1062
10.1016/S0140-6736(05)67402-8
American Diabetes Association. (2010). Diagnosis and classification of diabetes mellitus.
Diabetes Care. Jan;33 Suppl 1(Suppl 1): S62-9. doi: 10.2337/dc10-S062. Erratum in:
Diabetes Care. 2010 Apr;33(4):e57. PMID: 20042775; PMCID: PMC2797383.
American Diabetes Association. (2018). Classification and Diagnosis of Diabetes: Standards of
Medical Care in Diabetes. Diabetes Care. 2018;41:S13–S27. doi: 10.2337/dc18-S002.
Bassam, F., Channanath, A. M., Kazem, B., & Thangavel, A. (2013). Predictive models to assess
risk of type 2 diabetes, hypertension and comorbidity: machine-learning algorithms and
validation using national health data from Kuwait—a cohort study. BMJ Open, 3.
10.1136/bmjopen-2012-002457
Breiman, L. (2001). Random Forests. Machine Learning 45, 5–32.
https://doi.org/10.1023/A:1010933404324
Cai L, Wu H, Li D, Zhou K, Zou F. (2015). Type 2 Diabetes Biomarkers of Human Gut
Microbiota Selected via Iterative Sure Independent Screening Method. PLOS ONE
10(10): e0140827. https://doi.org/10.1371/journal.pone.0140827
Casagrande SS, Cowie CC, Genuth SM. (2014). Self-reported prevalence of diabetes screening
in the U.S., 2005–2010. Am J Prev Med ; 47: 780–787.
Centers for Disease Control and Prevention. (2007). National Diabetes Fact Sheet: General
Information and National Estimates on Diabetes in the United States. Alanta, GA. US
Department of Health and Human Services, Centers for Disease Control and Prevention.
Centers for Disease Control and Prevention. What is Type 1 Diabetes? March 2022.
https://www.cdc.gov/diabetes/basics/what-is-type-1-diabetes.html
Chang V, Bailey J, Xu QA, Sun Z. (2022). Pima Indians diabetes mellitus classification based on
machine learning (ML) algorithms. Neural Comput Appl. Mar 24:1-17. doi:
10.1007/s00521-022-07049-z. Epub ahead of print. PMID: 35345556; PMCID:
PMC8943493.

30
Dekant, Wolfgang, and Wolfgang Völkel. (2008). "Human exposure to bisphenol A by
biomonitoring: methods, results and assessment of environmental exposures." Toxicology
and applied pharmacology 228.1 114-134.
Edelman D. (2002) Outpatient diagnostic errors: unrecognized hyperglycemia. Eff Clin Pract 5:
11–16
Faizan Zafar, Saad Raza, Muhammad Umair Khalid, and Muhammad Ali Tahir. (2019).
Predictive Analytics in Healthcare for Diabetes Prediction. In Proceedings of the 2019
9th International Conference on Biomedical Engineering and Technology (ICBET' 19).
Association for Computing Machinery, New York, NY, USA, 253–259.
DOI:https://doi.org/10.1145/3326172.3326213
Fraser LA, Twombly J, Zhu M, Long Q, Hanfelt JJ, Narayan KM et al. (2010). Delay in
diagnosis of diabetes is not the patient’s fault. Diabetes Care 33: e10.
Géron, A. (2019). Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems. O'Reilly.
G. Tripathi and R. Kumar. (2020) "Early Prediction of Diabetes Mellitus Using Machine
Learning," 2020 8th International Conference on Reliability, Infocom Technologies and
Optimization (Trends and Future Directions) (ICRITO), pp. 1009-1014,
doi:10.1109/ICRITO48877.2020.9197832.
Gopalan A, Mishra P, Alexeeff SE, Blatchins MA, Kim E, Man AH, Grant RW.1 (2018)
Prevalence and predictors of delayed clinical diagnosis of Type 2 diabetes: a longitudinal
cohort study. Diabet Med. Dec;35(12):1655-1662. doi: 10.1111/dme.13808. Epub 2018
Sep 21. PMID: 30175870; PMCID: PMC6481650.
Goyal R, Jialal I. Diabetes Mellitus Type 2. (Updated 2022 Jun 19). In: StatPearls [Internet].
Treasure Island (FL): StatPearls Publishing; 2022 Jan-. Available from:
https://www.ncbi.nlm.nih.gov/books/NBK513253/
Han Wu, Shengqi Yang, Zhangqin Huang, Jian He, Xiaoyi Wang. (2018). Type 2 diabetes
mellitus prediction model based on data mining, Informatics in Medicine Unlocked,
Volume 10, Pages 100-107, ISSN 2352-9148, https://doi.org/10.1016/j.imu.2017.12.006
Harris MI, Klein R, Welborn TA, Knuiman MW. (1992). Onset of NIDDM occurs at least 4–7 yr
before clinical diagnosis. Diabetes Care 15: 815–819. [PubMed] [Google Scholar]
Holman RR, Paul SK, Bethel MA, Matthews DR, Neil HA. (2008). 10-year follow-up of
intensive glucose control in type 2 diabetes. N Engl J Med 359: 1577–1589.
Hu FB, Manson JE, Stampfer MJ, Colditz G, Liu S, Solomon CG, et al. (2001). Diet, lifestyle,
and the risk of type 2 diabetes mellitus in women. N Engl J Med. Sep;345(11):790-
79710.1056/NEJMoa010492
International Diabetes Federation. (2013). IDF Diabetes Atlas. 6th ed. Brussels, Belgium:
International Diabetes Federation.

31
Ioannis, K., Olga, T., Athanasios, S., Nicos, M., Ioannis, V., & Ioanna, C. (2017). Machine
Learning and Data Mining Methods in Diabetes Research. Computational and Structural
Biotechnology Journal, 15, 104-116. https://doi.org/10.1016/j.csbj.2016.12.005
Jack L, Jr, Boseman L, Vinicor F. (2004).Aging Americans and diabetes. A public health and
clinical response. Geriatrics Apr;59(4):14-17
John C. Platt. (1999). "12 fast training of support vector machines using sequential minimal
optimization." in Advances in kernel methods, pp. 185-208.
Jung Y, Hu J. (2015). A K-fold Averaging Cross-validation Procedure. J Nonparametr Stat.
27(2):167-179. doi: 10.1080/10485252.2015.1010532. Epub 2015 Feb 26. PMID:
27630515; PMCID: PMC5019184.
K. A. Hasan and M. Al Mehedi Hasan. (2020). "Classification of parkinson’s disease by
analyzing multiple vocal features sets", IEEE Region 10 Symposium (TENSYMP), pp.
758-761, 2020
Kharroubi AT, Darwish HM. (2015). Diabetes mellitus: The epidemic of the century. World J
Diabetes. Jun 25;6(6):850-67. doi: 10.4239/wjd.v6.i6.850. PMID: 26131326; PMCID:
PMC4478580.
Kiefer MM, Silverman JB, Young BA, Nelson KM. (2020). National patterns in diabetes
screening: data from the National Health and Nutrition Examination Survey (NHANES)
2005–2012. J Gen Intern Med 30: 612–618
Kusumaningrum R, Indihatmoko TA, Juwita SR, Hanifah AF, Khadijah K, Surarso B. (2020).
Benchmarking of Multi-Class Algorithms for Classifying Documents Related to Stunting.
Applied Sciences. 10(23):8621. https://doi.org/10.3390/app10238621
L. Tapak, H. Mahjub, O. Hamidi, J. Poorolajal. (Sep 2013), Real-data comparison of data mining
methods in prediction of diabetes in Iran Healthc Inform Res, 19 (3) pp. 177-185,
10.4258/hir.2013.19.3.177. [Epub 2013 Sep 30]
Lang IA, Galloway TS, Scarlett A, Henley WE, Depledge M, Wallace RB, et al. Association of
urinary bisphenol A concentration with medical disorders and laboratory abnormalities in
adults. JAMA 2008. Sep;300(11):1303-1310 10.1001/jama.300.11.1303
Larabi-Marie-Sainte S, Aburahmah L, Almohaini R, Saba T. Current techniques for diabetes
prediction: review and case study. Appl Sci. 2019;9(21):4604. doi: 10.3390/app9214604.
Lovejoy JC. The influence of dietary fat on insulin resistance. Curr Diab Rep 2002.
Oct;2(5):435-440 10.1007/s11892-002-0098-y
Lucier J, Weinstock RS. Diabetes Mellitus Type 1. (Updated 2022 May 11). In: StatPearls
[Internet]. Treasure Island (FL): StatPearls Publishing; 2022 Jan-. Available from:
https://www.ncbi.nlm.nih.gov/books/NBK507713/
Maitra A, Abbas AK. (2005). Endocrine system. In: Kumar V, Fausto N, Abbas AK (eds).
Robbins and Cotran Pathologic basis of disease (7th ed). Philadelphia, Saunders; 1156-
1226.

32
Malik, S., Khadgawat, R., Anand, S. et al. (2016). Non-invasive detection of fasting blood
glucose level via electrochemical measurement of saliva. SpringerPlus 5, 701.
https://doi.org/10.1186/s40064-016-2339-6
McCarthy MI (2010). Genomics, type 2 diabetes, and obesity. N Engl J Med. Dec;363(24):2339-
2350 10.1056/NEJMra0906948
Mozhvilo, E. (2021, January 28). Why Weight? The Importance of Training on Balanced
Datasets. Towards Data Science. Retrieved November 11, 2022, from
https://towardsdatascience.com/why-weight-the-importance-of-training-on-balanced-
datasets-f1e54688e7df
Olokoba AB, Obateru OA, Olokoba LB. (2012). Type 2 diabetes mellitus: a review of current
trends. Oman Med J. Jul;27(4):269-73. doi: 10.5001/omj.2012.68. PMID: 23071876;
PMCID: PMC3464757.
Pima Indians Diabetes Database. (updated 2016). Kaggle. Retrieved November 11, 2022, from
https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-
database?resource=download
Plows JF, Stanley JL, Baker PN, Reynolds CM, Vickers MH. (2018). The Pathophysiology of
Gestational Diabetes Mellitus. Int J Mol Sci. Oct 26;19(11):3342.
doi:10.3390/ijms19113342. PMID: 30373146; PMCID: PMC6274679.
Porta M, Curletto G, Cipullo D, Rigault de la Longrais R, Trento M, Passera P et al. (2014).
Estimating the delay between onset and diagnosis of type 2 diabetes from the time course
of retinopathy prevalence. Diabetes Care 37: 1668–1674.
Powers AC. Diabetes mellitus. In: Fauci AS, Braunwald E, Kasper DL, Hauser SL, Longo DL,
Jameson JL, Loscalzo J (eds). (2008). Harrison’s Principles of Internal Medicine.17th ed,
New York, McGraw-Hill 2275-2304
Prevalence of overweight and obesity among adults with diagnosed Diabetes United States,
1988-1994 and 1999-2000"Centers for Disease Control and Prevention (CDC) (2004)
MMWR. Morbidity and Mortality Weekly Report; 53(45): 1066-1068
Rahman MS, Hossain KS, Das S, Kundu S, Adegoke EO, Rahman MA, Hannan MA, Uddin MJ,
Pang MG. (2021). Role of Insulin in Health and Disease: An Update. Int J Mol Sci. Jun
15;22(12):6403. doi: 10.3390/ijms22126403. PMID: 34203830; PMCID: PMC8232639.
Ross Quinlan, (1993). C4. 5: Programs for Machine Learning, San Mateo, CA:Morgan
Kaufmann Publishers.
S. Mani, Y. Chen, T. Elasy, W. Clayton, J. Denny. (2012). Type 2 diabetes risk forecasting from
EMR data using machine learning. AMIA Annu Symp Proc, 2012, pp. 606-615
Sarica Alessia, Cerasa Antonio, Quattrone Aldo. (2017). Random Forest Algorithm for the
Classification of Neuroimaging Data in Alzheimer's Disease: A Systematic Review.
Frontiers in Aging Neuroscience (9). https://www.frontiersin.org/articles/10.3389/ ISSN
1663-4365.

33
sklearn.feature_selection.f_classif — scikit-learn 1.1.3 documentation. (2011). Scikit-learn.
Retrieved November 10, 2022, from https://scikit-
learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.featur
e_selection.f_classif
sklearn.feature_selection.SelectKBest — scikit-learn 1.1.3 documentation. (2011). Scikit-learn.
Retrieved November 11, 2022, from https://scikit-
learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
sklearn.utils.class_weight.compute_class_weight — scikit-learn 1.1.3 documentation. (2011).
Scikit-learn. Retrieved November 11, 2022, from https://scikit-
learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.htm
l
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the
ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the
Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer
Society Press.
Spijkerman AM, Dekker JM, Nijpels G, Adriaanse MC, Kostense PJ, Ruwaard D et al. (2003).
Microvascular complications at time of diagnosis of type 2 diabetes are similar among
diabetic patients detected by targeted screening and patients newly diagnosed in general
practice: the hoorn screening study. Diabetes Care 26: 2604–2608.
Talha M Alam, M Atif I, Yasir A, Abdul W, Safdar I, Talha B, Ayaz H, Muhammad M,
Muhammad Mehdi R, Salman I, Zunish A. (2019). A model for early prediction of
diabetes,Informatics in Medicine Unlocked, Volume 16, 100204, ISSN 2352-9148,
https://doi.org/10.1016/j.imu.2019.100204.
Vermeulen I, Weets I, Asanghanwa M, Ruige J, Van Gaal L, Mathieu C, Keymeulen B,
Lampasona V, Wenzlau JM, Hutton JC, et al. (2011). Contribution of antibodies against
IA-2β and zinc transporter 8 to classification of diabetes diagnosed under 40 years of age.
Diabetes Care.34:1760–1765.
What Causes Gestational Diabetes? (2021). CDC. Retrieved November 10, 2022, from
http://cdc.gov/diabetes/basics/gestational.html
What Causes Type 2 Diabetes? (2021). CDC. Retrieved November 10, 2022, from
https://www.cdc.gov/diabetes/basics/type2.html
What is diabetes? (2022). CDC. Retrieved November 10, 2022, from
https://www.cdc.gov/diabetes/basics/diabetes.html
Zhang X, Geiss LS, Cheng YJ, Beckles GL, Gregg EW, Kahn HS. (2008). The missed patient
with diabetes: how access to health care affects the detection of diabetes. Diabetes Care
31: 1748–1753.
Zhang X, Geiss LS, Cheng YJ, Beckles GL, Gregg EW, Kahn HS. (2008). The missed patient
with diabetes: how access to health care affects the detection of diabetes. Diabetes Care
31: 1748–1753.

34
Zia UA, Khan N (2017). Predicting diabetes in medical datasets using machine learning
techniques. Int J Sci Eng Res 5(2):257–267.

35
Biographical Sketch

Clement Tochukwu Okolo was born in Lagos, Nigeria. He began his academic career at

Olabisi Onabanjo University, Nigeria majoring in anatomy. After earning his bachelor's degree

in anatomy in the Summer of 2017, he joined the University of Louisiana at Lafayette in the

Spring of 2021 as a master's student in informatics conducting research in diabetes prediction

using supervised machine learning algorithm under the tutelage of Dr. Michael W. Totaro. He

also served as the President of the Graduate Student Organization (GSO) and the GSO

representative for the School of Computing and Informatics. His research culminated in earning

a master’s degree in informatics at the University of Louisiana at Lafayette in the Fall of 2022.

36
ProQuest Number: 30240963

INFORMATION TO ALL USERS


The quality and completeness of this reproduction is dependent on the quality
and completeness of the copy made available to ProQuest.

Distributed by ProQuest LLC ( 2023 ).


Copyright of the Dissertation is held by the Author unless otherwise noted.

This work may be used in accordance with the terms of the Creative Commons license
or other rights statement, as indicated in the copyright statement or in the metadata
associated with this work. Unless otherwise specified in the copyright statement
or the metadata, all rights are reserved by the copyright holder.

This work is protected against unauthorized copying under Title 17,


United States Code and other applicable copyright laws.

Microform Edition where available © ProQuest LLC. No reproduction or digitization


of the Microform Edition is authorized without permission of ProQuest LLC.

ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346 USA

View publication stats

You might also like