Diabetes Prediction Using Machine Learning Algorithms and Ontology
Diabetes Prediction Using Machine Learning Algorithms and Ontology
Abstract
Diabetes is one of the chronic diseases, which is increasing from year to
year. The problems begin when diabetes is not detected at an early phase
and diagnosed properly at the appropriate time. Different machine learning
techniques, as well as ontology-based ML techniques, have recently played
an important role in medical science by developing an automated system that
can detect diabetes patients. This paper provides a comparative study and
review of the most popular machine learning techniques and ontology-based
Machine Learning classification. Various types of classification algorithms
were considered namely: SVM, KNN, ANN, Naive Bayes, Logistic regres-
sion, and Decision Tree. The results are evaluated based on performance
metrics like Recall, Accuracy, Precision, and F-Measure that are derived from
the confusion matrix. The experimental results showed that the best accuracy
goes for ontology classifiers and SVM.
1 Introduction
Diabetes is a group of the deadliest and metabolic diseases in which the
level of blood sugar in the human body is abnormally high. It impacts the
body’s capacity to produce the hormone insulin. High blood sugar commonly
causes many complications such as intensified thirst, increased hunger, and
frequent urination if diabetes goes untreated and undiagnosed at an early
stage. Therefore, early detection is the only way to avoid complications
by using the trending technology for ontology-based and machine learning.
Machine learning (ML) is one of the most rapidly developing fields
of computer science, with several applications. It refers to the process of
extracting useful information from a large set of data. ML techniques are
used in different areas such as medical diagnosis, marketing, industry, and
other scientific fields. ML algorithms have been widely used in medical
datasets and are best suited for medical data analysis. There are various
forms of ML, including classification, regression, and clustering. Depending
on the problem that we are trying to solve, each form has a distinct result
and impact. In our work, we focus on classification methods, which are
applied to classify a given dataset into predefined groups and to predict
future activities or information to that data due to its good accuracy and
performance. Classification algorithms are usually employed in the medical
domain especially in diagnosing the diseases such as diabetes. Therefore,
the commonly used machine learning classification [1] namely SVM, KNN,
ANN, Naive Bayes, Logistic regression, and Decision Tree are applied to
identify diabetes patients at an early period.
On the other side, Ontology is one of the most adopted approaches to
manage, organize and extract data during the previous decades. It is a data
representation method that has been successfully implemented in a variety
of fields, especially the medical domain. It is important in computer science
because of its capacity to represent diverse concepts and their relationships in
different disciplines. In actuality, no single ontology is sufficient to follow
the growing demands of today’s healthcare, and the ontologies must be
integrated with algorithms of machine learning to support data integration and
analysis [2]. In previous work, we already created and explored an ontology-
based model capable of predicting diabetes patients by using an ontology
classifier based on a decision tree algorithm.
In this study, we aim to make a comparative analysis among the
six popular classification techniques and ontology-based machine learn-
ing classification based on carefully chosen parameters such as Precision,
Diabetes Prediction Using Machine Learning Algorithms and Ontology 321
Accuracy, F-Measure, and Recall, which are derived from the confusion
matrix.
The organization of the remainder of the paper is as follows: Section 2
represents the literature review of related classification algorithms in the field
of diabetes prediction. Section 3 we present technologies and methods used
in this comparative analysis. Section 4 we describe the performance metrics
used to evaluate the models. Section 5, we present results and discussion.
Finally, Section 6 presents future work and conclusions.
2 Related Works
Recently researchers have published a considerable amount of research to
identify diabetic patients based on symptoms by applying machine-learning
techniques. In [3], the authors propose a model that can predict is the patient
has diabetes or not. This model is based on the prediction precision of
powerful machine learning algorithms, which use certain measures such as
precision, recall, and F1-measure. The authors use Pima Indian Diabetes
(PIDD) dataset to predict diabetic onset based on diagnostics manner. The
results obtained using Logistic Regression (LR), Naı̈ve Bayes (NB), and K-
nearest Neighbor (KNN) algorithms were 94%, 79%, and 69% respectively.
In the paper [4], the authors use seven ML algorithms on the dataset to predict
diabetes, they found that the model with Logistic Regression and SVM were
better on diabetes prediction, they built a NN model with a different hidden
layer and observed the NN with two hidden layers provided 88.6% accuracy.
The study applied in the paper [5] uses several machine learning classi-
fication algorithms (Gaussian Naive Bayes, K-Nearest Neighbors, Artificial
Neural Network, Logistic Regression, Decision Tree, Random Forest, and
Support Vector Machine) on the PIID dataset. Logistic Regression got the
best accuracy result.
Sarwar et al. [6], discuss predictive analytics in healthcare, a number of
machine learning algorithms are used in this study. For experiment purposes,
a dataset of patient’s medical is obtained. The performance and accuracy of
the applied algorithms are discussed and compared
In the paper [7], the authors propose a diabetes prediction model for the
classification of diabetes including external factors responsible for diabetes
along with regular factors like Glucose, BMI, Age, Insulin, etc. Classification
accuracy is improved with the novel dataset compared with existing dataset.
On a dataset of 521 instances (80% and 20% for training testing respec-
tively), [8] authors applied 8 ML algorithms such as logistic regression,
322 H. El Massari et al.
results, Naı̈ve Base and Random Forest classifiers achieved 80% accuracy
compared to the other algorithms.
3.1 Dataset
The dataset called Pima Indians Diabetes Database (PIDD) is originally
from the National Institute of Diabetes and Digestive and Kidney Diseases.
The purpose is to expect based on diagnostic measurements whether a patient
has diabetes. It has 768 instances and 8 numerical attributes plus a class (preg,
plas, pres, skin, insu, mass, pedi, age, class).
After the dataset pre-processing step using UCI Machine Learning, the
output file in CSV format will be transformed into ARFF format.
tested negative or positive. The results of the ontology classifier are presented
in Section 5.
4 Evaluation
In Machine Learning, performance measurement is an essential task. It is
critical to choose the right metrics to evaluate the machine learning model.
Therefore, metrics are used to determine how machine learning algorithms’
performance is measured and compared.
Different performance metrics are used to evaluate machine learning
algorithms such as Accuracy, Precision, Recall, F-Measure, ROC Area,
Kappa statistic, Root mean squared error, Root relative squared error, etc.
Almost all of the performance metrics are derived from the Confusion
Matrix and the numbers inside it. The Confusion Matrix is one of the most
intuitive and easiest metrics for determining the model’s correctness and
accuracy. It is used for classification problems with two or more types of
classes as output.
The confusion matrix is a table with two dimensions (“Actual” and
“Predicted”), and sets of “classes” in both dimensions. Our Actual classifi-
cations are columns and Predicted ones are Rows. For more understanding of
what the confusion matrix is all about and what it represents, let’s take a real
example from our study where we are predicting whether a patient is having
diabetes or not (1: tested positive 0: tested negative). Figure 3 illustrates the
confusion Matrix details, and Table 1 describes the terms associated with the
confusion matrix.
An ideal classification performance would only have no entries for FN
and FP (i.e., the number of FN equal number of FP equal zero).
Diverse measures can be derived from a confusion matrix such as Accu-
racy, Precision, Recall and F-Measure. The best value of accuracy, precision,
and recall is 1.0, whereas the worst is 0.0. Figure 3 illustrates how to compute
them from the confusion matrix.
Accuracy (ACC):
TP + TN
ACC =
TP + TN + FP + FN
Accuracy is computed as the number of all correct predictions divided
by the total number of the dataset, which is the number of patients that are
identified correctly in total in our case.
Precision (PREC):
TP
PREC =
TP + FP
PREC is computed as the number of correct positive predictions divided
by the total number of positive predictions.
Recall (REC):
TP
REC =
TP + FN
REC is computed as the number of correct positive predictions divided by
the total number of positives. It represents the relevant patients that have been
correctly detected, it is also called Sensitivity or true positive rate (TPR).
F-Measure:
P REC ∗ REC
F -Measure : = 2 ∗
P REC + REC
Diabetes Prediction Using Machine Learning Algorithms and Ontology 327
Figure 5 Statistics of inferred concepts. (a) based on 10-fold cross-validation. (b) based on
66% split mode validation.
Table 3 Confusion matrix of ontology classier based on 66% split mode validation
Actual Class
Tested Positive and Negative Classification Positive Negative
Predicted class Positive TP: 160 FP: 37
Negative FN: 16 TN: 48
remainder test) in order to enrich the study and give more visibility to these
two modes.
According to the performance metrics explained in the previous section,
the results of the ontology classifier are shown in Tables 2 and 3, and Figure 5.
Furthermore, we present the result of Accuracy, Precision, Recall, F-Measure
in Figures 6–10 illustrating the graphic of each metric.
Table 4 summarizes the experimental results for ML and ontology
classifiers used in this study.
– Accuracy
In Figure 6 and Table 4, we obtained the highest value in terms of 10-
fold cross-validation mode for Ontology, SVM and Logistic Regression with
77.5%, 77.3%, 77.2% respectively. In split test mode, we obtained 80.1%,
79.7%, 79.3 for logistic regression, ontology and SVM consecutively.
Diabetes Prediction Using Machine Learning Algorithms and Ontology 329
– Precision
The ontology classifier has the highest Precision of 81.2% for both test
modes. Followed by Naı̈ve Bayes and ANN. More details are shown in
Table 4 and Figure 7.
– Recall
From Figure 8 and Table 4, we notice that SVM had the highest value in
both test modes, followed by Ontology and Logistic Regression in the last
position.
– F-Measure
SVM and Ontology have the same metric of F-Measure with 83.3% and
∼85.8% for 10-fold cross-validation and split test mode. (See Figure 9 and
Table 4)
– ROC area
Table 4 and Figure 10 show that Logistic Regression, Naı̈ve Bayes, and
Ontology have the better value of the ROC Area.
Discussion
In our measurements, we used two test mode options, and we noticed that
the percentage split was exceeded in the cross-validation test mode due to the
small data mass, for this we will base by following on a cross-validation 10
times.
332 H. El Massari et al.
References
[1] Z. Sabouri, Y. Maleh, and N. Gherabi, “Benchmarking Classification
Algorithms for Measuring the Performance on Maintainable Applica-
tions,” in Advances in Information, Communication and Cybersecurity,
Cham, 2022, pp. 173–179. doi: 10.1007/978-3-030-91738-8 17.
[2] H. EL Massari, S. Mhammedi, Z. Sabouri, and N. Gherabi, “Ontology-
Based Machine Learning to Predict Diabetes Patients,” in Advances
in Information, Communication and Cybersecurity, Cham, 2022,
pp. 437–445. doi: 10.1007/978-3-030-91738-8 40.
[3] F. Alaa Khaleel and A. M. Al-Bakry, “Diagnosis of diabetes using
machine learning algorithms,” Mater. Today Proc., Jul. 2021, doi: 10
.1016/j.matpr.2021.07.196.
[4] J. J. Khanam and S. Y. Foo, “A comparison of machine learning algo-
rithms for diabetes prediction,” ICT Express, vol. 7, no. 4, pp. 432–439,
Dec. 2021, doi: 10.1016/j.icte.2021.02.004.
[5] P. Cıhan and H. Coşkun, “Performance Comparison of Machine Learn-
ing Models for Diabetes Prediction,” in 2021 29th Signal Processing and
Communications Applications Conference (SIU), Jun. 2021, pp. 1–4.
doi: 10.1109/SIU53274.2021.9477824.
[6] M. A. Sarwar, N. Kamal, W. Hamid, and M. A. Shah, “Prediction of
Diabetes Using Machine Learning Algorithms in Healthcare,” in 2018
24th International Conference on Automation and Computing (ICAC),
Sep. 2018, pp. 1–6. doi: 10.23919/IConAC.2018.8748992.
[7] A. Mujumdar and V. Vaidehi, “Diabetes Prediction using Machine
Learning Algorithms,” Procedia Comput. Sci., vol. 165, pp. 292–299,
Jan. 2019, doi: 10.1016/j.procs.2020.01.047.
[8] M. Rady, K. Moussa, M. Mostafa, A. Elbasry, Z. Ezzat, and W. Medhat,
“Diabetes Prediction Using Machine Learning: A Comparative Study,”
334 H. El Massari et al.
Biographies