0% found this document useful (0 votes)

78 views17 pages

Pima Indians Diabetes Mellitus Classification Based On Machine Learning (ML) Algorithms

This document proposes using machine learning algorithms to develop an e-diagnosis system for detecting and classifying diabetes as part of an Internet of Medical Things application. The system would predict diabetes risk based on risk factors and provide preliminary diagnoses and doctor's guidance to patients. The paper evaluates Naive Bayes, random forest, and decision tree classifiers on a diabetes dataset, analyzing accuracy, precision, sensitivity and more to identify the best model. Several significant predictive features are also extracted.

Uploaded by

21bit20

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views17 pages

Pima Indians Diabetes Mellitus Classification Based On Machine Learning (ML) Algorithms

Uploaded by

21bit20

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Neural Computing and Applications (2023) 35:16157–16173

https://doi.org/10.1007/s00521-022-07049-z (0123456789().,-volV)(0123456789().,-volV)

S.I.: AI-BASED E-DIAGNOSIS

Pima Indians diabetes mellitus classification based on machine

learning (ML) algorithms
Victor Chang1 • Jozeene Bailey2 • Qianwen Ariel Xu2 • Zhili Sun3

Received: 16 August 2021 / Accepted: 31 January 2022 / Published online: 24 March 2022
The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022

Abstract
This paper proposes an e-diagnosis system based on machine learning (ML) algorithms to be implemented on the Internet
of Medical Things (IoMT) environment, particularly for diagnosing diabetes mellitus (type 2 diabetes). However, the ML
applications tend to be mistrusted because of their inability to show the internal decision-making process, resulting in slow
uptake by end-users within certain healthcare sectors. This research delineates the use of three interpretable supervised ML
models: Naı̈ve Bayes classifier, random forest classifier, and J48 decision tree models to be trained and tested using the
Pima Indians diabetes dataset in R programming language. The performance of each algorithm is analyzed to determine the
one with the best accuracy, precision, sensitivity, and specificity. An assessment of the decision process is also made to
improve the model. It can be concluded that a Naı̈ve Bayes model works well with a more fine-tuned selection of features
for binary classification, while random forest works better with more features.

Keywords Diabetes mellitus The Internet of Medical Things (IoMT) Machine learning Interpretable artificial
intelligence

1 Introduction insulin or when the body does not utilize the insulin pro-
duced effectively [1]. There is no cure for this disease.
Diabetes mellitus, or simply diabetes, is a leading non- Diabetes is thought to result from a combination of genetic
communicable disease (NCD) globally, almost doubling in and environmental factors. Several risk factors that are
cases since 1980 [1]. It is a chronic illness that develops attributed to diabetes include ethnicity, family history of
either when the pancreas are not able to generate sufficient diabetes, age, excess weight, unhealthy diet, physical
inactivity, and smoking. In addition to this, the absence of
early detection of diabetes has been known to contribute to
& Victor Chang the development of other chronic diseases such as kidney
[email protected] disease. Furthermore, additional pre-existing non-commu-
Jozeene Bailey nicable diseases present a high risk for the patient, as they
[email protected] easily contract and are susceptible to infectious diseases
Qianwen Ariel Xu such as COVID-19 [2].
[email protected] Predicting the probability of an individual’s risk and
Zhili Sun susceptibility to a chronic illness like diabetes is an
[email protected] important task. Diagnosing chronic illness at an early stage
1
Department of Operations and Information Management, saves on medical costs and reduces the risk of more
Aston Business School, Aston University, Birmingham, UK complicated health problems. Even in emergencies where a
2
Cybersecurity, Information Systems and AI Research Group, patient may be unconscious or unintelligible, it is pertinent
School of Computing and Digital Technologies, Teesside that deductions can be made accurately from immediately
University, Middlesbrough, UK measurable medical indicators to help clinicians make
3
Institute for Communication Systems (ICS), 5G and 6G better decisions for patient treatment in high-risk situations.
Innovation Centre (5G&6GIC), University of Surrey,
Guildford, Surrey, UK

123
16158 Neural Computing and Applications (2023) 35:16157–16173

The majority of existing NCD cases remain undiag- patient self-care, which have become more popular, this
nosed, with patients suffering few symptoms during the function is essential.
initial phases of the disease, which causes a huge challenge The methodology is as follows: prepare the dataset,
in ensuring early detection and diagnosis. One advantage of followed by data pre-processing such as dealing with
providing treatments to patients in the early stage of their missing values and categorical values, imputation, and
experience with non-communicable diseases is that they standardization. Feature selection will be performed by
can avoid expensive treatments later in life as the disease using a variety of tools. Lastly, the classifiers’ performance
gets worse. This is made more problematic with a lack of before and after feature selection will be evaluated further.
medical practitioners in underserved regions such as rural The organization of this paper is outlined as follows:
and remote villages. In such cases, the combination of the Sect. 2 presents literature review, Sect. 3 provides details
Internet of Medical Things (IoMT) and machine learning on data cleaning, exploration and feature selections, and
models can be made available to assist healthcare profes- Sect. 4 presents the methodology for analysis and evalua-
sionals in the early detection and diagnosis of NCDs by tions of the dataset. Finally, Sect. 5 concludes the paper
providing predictive tools for more efficient and timely with discussions of future research.
decision-making.
However, it should be noted that machine learning
solutions tend to be mistrusted by some people because of 2 Literature review
what may be referred to as a ’black-box’ effect: an inability
to show its internal decision-making process. This lack of 2.1 Internet of Medical Things (IoMT)
explainability in machine learning models causes skepti- and artificial intelligence algorithms
cism by consumers and results in slow uptake by end-users
within the healthcare sector. The ability to explain both the Internet of Medical Things (IoMT) is the application of the
reasonings behind and the process it takes to get a machine Internet of Things (IoT) in the medical field. Utilizing
learning prediction is crucial to building trust, particularly networking technologies, the IoMT aims to connect med-
in the healthcare field, where mistakes could be fatal. ical equipment and its applications with healthcare IT
This paper seeks to develop an e-diagnosis system for systems [6]. This innovative development has changed the
detecting and classifying diabetes as an IoMT application. medical field with its novel-designed remote healthcare
Through the use of machine learning algorithms [Naı̈ve system in terms of social benefits, perception, and reliable
Bayes, random forest, and decision tree (J48)], the system detection of illness. Benefiting from the constant comput-
will be able to predict whether a person is at risk for dia- ing of the IoT, it becomes easier to accomplish clinical
betes based on several risk factors, provide doctors with a goals such as patient data, medical orders, medical instru-
preliminary diagnosis, and feedback the doctor’s guidance ments, and remedies [7]. The development of the IoMT has
on diet, exercise, and blood glucose testing to patients. brought about tremendous changes in promoting disease
These classification models were evaluated by the use of management, enhancing disease diagnostic and treatment
various methods, including accuracy, precision, sensitivity, techniques, as well as lowering healthcare costs and mis-
F-measure, area under receiver operating characteristics takes. This transformation has had a significant influence
(AUROC) curve to identify the best performing classifier. on the healthcare quality for both frontline healthcare
Several significant features that can be used to predict the professionals and patients. The IoMT is a thriving force for
severity of diabetes were extracted from the top classifi- the researcher, the medical professional, the patient, and
cation model. the insurer, enabling numerous use cases, for example,
The Pima Indian Diabetes dataset is employed for this telemedical support, data insights, drug management,
experiment. Pima Indians are a Native American group that operation enhancement, patient tracking, etc. [8]. In par-
lives in Mexico and Arizona, USA [3]. This group was ticular, the IoMT offers various services to medical pro-
deemed to have a high incidence rate of diabetes mellitus. fessionals, including delivering feedback to medical staff,
Thus, research around them was thought to be significant to equipment data and settings based on the needs of the
and representative of global health [4]. The Pima Indian patient and the specialist. IoMT gives rapid and easy access
Diabetes dataset consisting of Pima Indian females to various reports that help surgeons in operating rooms
21 years and older is a popular benchmark dataset [5]. This during surgeries [9].
group is also significant to members of underrepresented The value of the IoMT is growing as a result of the
minority or indigenous groups. symbiotic rise of artificial intelligence (AI). However, data
The features of the dataset comprise measures that do production is one of the most significant challenges
not require extensive testing. In emergency situations and resulting from the development that a number of academics
have confronted [10]. Because the amount of data acquired

123
Neural Computing and Applications (2023) 35:16157–16173 16159

is quite massive, it is necessary to use machine learning In this paper, an e-diagnosis system for detecting and
technology, which is good at processing and analyzing the classifying diabetes as an IoMT application is proposed, as
data and extracting valuable information from the massive shown in Fig. 1. Employing ML algorithms, this system
data and then visualizing them [11]. aims to predict the diagnosis of diabetes based on patient
For chronic diseases like diabetes, AI, including data, provide doctors with a preliminary diagnosis, and
machine learning and deep learning, plays an extremely return feedback on the doctor’s guidance on diet, exercise,
important and effective role in supporting doctors’ deci- and blood glucose testing to patients. In addition, as shown
sion-making and monitoring and managing patients on the left side of the figure, the IoMT enables the medical
[12, 13]. Specifically, the combination of IoMT and AI can systems, applications and devices to connect with each
bring two benefits to the diagnosis and treatment of chronic other. Therefore, a patient’s profile can be assessed by a
diseases. On the one hand, an e-diagnosis system based on doctor remotely through the Internet and shared by doctors
AI can efficiently analyze and classify the data obtained in from different medical institutions, no matter the commu-
IoMT to make a preliminary diagnosis of patients and nity hospitals or large hospitals. In this way, the amount of
provide support for the doctor to make the final diagnosis paper medical records can be reduced to a large extent, and
and specify the treatment plan. On the other hand, this the patient does not need to go to the same hospital or even
e-diagnosis system makes it possible to realize remote go to the hospital for follow-up visits in person.
supervision and management of patients with chronic ill-
ness. For example, the root of the diabetes management 2.2 Intelligent methods of diabetes prediction
problem lies in the self-management of patients. The key to
solving this problem is to tell patients how to monitor By clarifying common problems, the emerging techniques
blood sugar, arrange diet, exercise, and rationally use in data science can bring benefits to other fields of science,
drugs. The diabetes management system based on IoT including medicine. Numerous research has employed
technology provides the possibility to solve this problem. various machine learning or AI methods for diabetes pre-
In remote areas lacking medical experts and professional diction, such as artificial neural network (ANN), support
medical equipment, mobile devices can provide data to the vector machine, gradient boosting decision tree, and Naive
e-diagnosis system to use the services provided by IoMT to Bayes.
detect and classify diseases [9].

Fig. 1 An E-diagnosis system enabled in IoMT

123
16160 Neural Computing and Applications (2023) 35:16157–16173

In the study of Komi et al. [14], they use five various 2.3 The selected machine learning algorithms
data mining techniques [ANN, elaboration likelihood
model (ELM), Gaussian mixture model (GMM), support 2.3.1 J48 decision tree
vector machine (SVM), and logistic regression] to explore
the early prediction of diabetes. Their research results show A decision tree (DT) is a supervised ML algorithm widely
that ANN performs best among the five techniques. Similar utilized in dealing with classification and regression issues.
to Komi et al., Ramanujam et al. [15] and Kumar et al. [16] A leaf node in a decision tree represents the classification
also contribute to the early prediction of diabetes, but with outcomes, and an internal node represents the judgment of
different approaches. The early diagnosis of diabetes and attributes. Quinlan [20] calls the algorithm employed to
proper treatment will affect costs and mortality in the later establish the decision tree ID3, which uses a top-down
stage. Early diagnosis and testing expenditures are signif- learning method. The following steps describe the process
icantly crucial. Therefore, people in rural areas are unlikely of the DT: the first step is selecting the most appropriate
to afford early diagnosis and miss timely treatments, attribute for the root node; secondly, the instances are
resulting in higher mortality [17]. In order to help the rural divided into a number of subsets. For each subset, its
Indian people, Ramanujam et al. [15] develop a multilin- instances are supposed to have identical attribute values;
gual decision support system that integrates the predictive finally, every subset is repeated recursively until all
models and clinical decision support system. The design instances have identical classes [21]. Figure 2 shows a part
feature of the system is that users can not only evaluate of a diagnosis decision tree, which can be interpreted
diabetes with the help of nursing assistants but also eval- easily. For instance, according to the tree, if a patient does
uate diabetes by themselves. Kumar et al. [16] compare the not have inter-systolic noise, but has pre-cordial pain, then
performance of technique CatBoost with other ML tech- he or she has a prolapse.
niques, including K-nearest neighbor, logistic regression, The decision tree algorithm has been employed in many
stochastic gradient descent, Gaussian Naive Bayes, and scientific regions, including the medical area. For example,
multilayer perceptron, in the early prediction of diabetes. In Rochmawati et al. [22] use the DT algorithm to classify the
their research, CatBoost has the highest accuracy. COVID-19 symptom. They conclude that compared with
In addition, AI algorithms are also employed to analyze Hoeffding tree, DT has a better performance but is more
and classify iris images to diagnose diabetes. Samant and complicated. Other diseases can also be intelligently
Agarwal [18, 19] study the diagnosis of diabetes through diagnosed by DT, for instance, Lupus disease [23] and
the changes in pigmentation in certain areas of the iris by coronary artery disease [24].
using several ML algorithms. They use pre-image pro-
cessing methods to obtain iris and crop out certain areas. 2.3.2 Random forest
Then, they use texture textural, statistical and wavelet
features to observe the variances in the tissue pigmentation. Random forest (RF) is an extension of a decision tree and is
Finally, five classifiers are employed to classify whether composed of numerous single decision trees, each of which
the patient has diabetes. Their results show that random produces a category of prediction results. The category
forest outperforms other classifiers. with the most votes in the forest contributes to the random
Although AI and machine learning pervade the fields of forest classifier’s final prediction result. For example, as
healthcare and non-communicable chronic diseases, due to shown in Fig. 3, among nine single decision trees in the
the lack of explanation of these complex algorithms or forest, the prediction results of six trees are 1, and those of
models, their actual medical application rate is very low. the remaining three trees are 0. Therefore, the prediction
Based on the existing literature, this paper chooses three result of the RF is 1. The key to the good performance of
classifier models, Naı̈ve Bayes, random forest classifier, this classifier is that the trees in the forest are relatively
and J48 decision tree, to classify the Pima Indians Diabetes unrelated to each other, ensuring that the decision they
dataset in the R programming language. However, unlike make as a whole is better than the decisions made by each
the predecessors, the purpose of this study is to employ of them individually. [25].
interpretable ML models to make our model clear and Random forest uses a simple and powerful basic con-
understandable to end-users regarding how we judge which cept, called the wisdom of the crowd. The low correlation
features are important and how the choice of features between trees is crucial to the success of the model. Under
affects the model’s prediction results. this premise, even if the prediction results of several trees
are not correct, as long as the prediction results of most
other trees are correct, then as a group, these trees can
finally get the correct prediction results. In other words, the

123
Neural Computing and Applications (2023) 35:16157–16173 16161

Fig. 2 Decision tree example

Fig. 3 Visualization of a
random forest model making a
prediction

random forest model performs well because abundant rel- Nevertheless, it is difficult to obtain the required infor-
atively unrelated models that operate as a whole perform mation from the collected data in the absence of labeled
better than any single constituent model. molecules. Therefore, Seifert [26] combines the random
Surface-enhanced Raman scattering (SERS) technology forest method with SERS data to solve this problem. The
is very useful for analyzing biological samples. outcomes indicate that this approach is able to enhance the

123
16162 Neural Computing and Applications (2023) 35:16157–16173

performance of SERS technology. Apart from biology, RF In this research, our focus is to analyze the Pima Indian
can also be used in the areas of agriculture [27] and Dataset with advanced algorithms to work with IoMT
medical science [18, 19]. effectively. The dataset was downloaded from Kaggle
(https://www.kaggle.com/uciml/pima-indians-diabetes-
2.3.3 Naı̈ve Bayes database) and is available via a CC0: Public Domain
License and is properly anonymized and does not contain
The Bayesian classifier is a statistical classifier, and it is any identifiable features of the patient subjects. As seen in
operated according to the Bayes theorem, classifying data Table 1, it records eight causal characteristics and the
into predetermined categories using conditional probabil- corresponding classification. The dataset has 9 columns
ity. Conditional probability can be understood as the and 768 rows (500 non-diabetics and 268 diabetics). The
probability that an event will take place if other events binary classification outcome variable takes (0 or 1) values,
have already taken place. A Bayesian rule is an approach where 0 indicates a negative test for diabetes, and 1 implies
used to estimate the possibility of an attribute given a data a positive test. Table 1 shows the dataset features (col-
set as input. The term ‘‘naive’’ of the algorithm’s name umns) and descriptions.
refers to that it assumes that each attribute value is The dataset has no null values and no missing values.
independent. However, according to domain knowledge [32], there are
Naive Bayes (NB) is regarded as a descriptive as well as inconsistent values for the attributes: glucose concentration
a predictive algorithm. The probabilities are descriptive (Gluc), blood pressure (BP), skin fold thickness (Skin),
and then employed to predict the categories of the insulin and BMI, whereby zero values are not within the
untrained data. This method has several merits, as follows. normal range and are therefore inaccurate (Table 2).
First of all, it is easy to use. Secondly, the amount of The scatterplot matrix is helpful to identify the pair-wise
training data NB needs for classification is not necessarily relationships of the features preliminarily. If the points are
large. In addition, although the NB classifier is naively scattered, it means that there is no obvious relationship,
designed and its assumption seems to be too simple, it while if the points are roughly arranged in a straight line, it
performs well in a number of complicated real-world sit- means that they are linearly related. While referring to the
uations [28]. scatterplot matrix in Fig. 4, the most closely correlated/
Pandiangan et al. [29] consider that in his applied AI proportional features include [pregnancy and age], [skin
research, a student’s study time and duration is an essential thickness and BMI], and [glucose and insulin] because
index to evaluate the quality of the university. They then their scatterplot figures all show a positive correlation.
employ the NB classification algorithm and DT algorithm As shown in Fig. 5, there are outliers in DPF, age,
to predict the student’s study period, evaluate academic insulin, glucose, BMI, and blood pressure features, which
performance and identify correlations for improving the might be due to other underlying factors. It would be best
quality of the university. In the field of education, Daniati to standardize the data to avoid the ill effects of the out-
[30] develops a decision support system for students to liers. The dataset is not a very large one, so it would be
select suitable programs using DBSCAN and Naive Bayes. better to avoid removing rows unnecessarily.
Different from them, Akbar et al. [31] integrate the Internet There seems to be a demonstrable difference in the
of Things with the NB algorithm to develop an intelligent performance and efficiency of prediction classification
laundry mobile application. models depending on the pre-processing methodology.
Therefore, in the first round of experiments, minimal pre-
processing was done. The second time around, however,
3 Data and methodology feature selection algorithms were applied.
Since there were no missing or null values, only one
3.1 Dataset exploration and pre-processing data pre-processing technique was applied in the first
round. This was to impute the median value on the features
Although there are now larger, more complex diabetes that had invalid zero values.
datasets, the Pima Indian Diabetes dataset has remained a Tree algorithms such as decision trees, Naı̈ve Bayes
benchmark for diabetes classification research. Given the models, and random forests are not highly sensitive to non-
presence of a binary outcome variable, the dataset naturally normalized data, so no scaling was done as a way to keep
lends itself to supervised learning and, in particular, the testing similar for the three machine learning models.
logistic regression. However, various ML algorithms have With only eight features, it may seem counter-intuitive
been employed to produce classification models based on to reduce the features further, but it can reduce some of the
this dataset for not being limited to a singular type of noise in classification and pinpoint subtle groupings syn-
model. thesized by combining existing classes. In rounds two and

123
Neural Computing and Applications (2023) 35:16157–16173 16163

Table 1 Overview of Pima Indian diabetes dataset

Feature Description Data type Range

Preg Number of times pregnant Numeric [0, 17]

Gluc Plasma glucose concentration at 2 Hours in an oral glucose tolerance test (GTIT) Numeric [0, 199]
BP Diastolic Blood Pressure (mm Hg) Numeric [0, 122]
Skin Triceps skin fold thickness (mm) Numeric [0, 99]
Insulin 2-Hour Serum insulin (lh/ml) Numeric [0, 846]
BMI Body mass index [weight in kg/(Height in m)] Numeric [0, 67.1]
DPF Diabetes pedigree function Numeric [0.078, 2.42]
Age Age (years) Numeric [21, 81]
Outcome Binary value indicating non-diabetic /diabetic Factor [0,1]

Table 2 Statistical summary of

Features Preg Gluc BP Skin Insulin BMI DPF Age
Pima Indians diabetes dataset
Min. 0.000 0.0 0.00 0.00 0.0 0.00 0.0780 21.00
1st Qu. 1.000 99.0 62.00 0.00 0.0 27.30 0.2437 24.00
Median 3.000 117.0 72.00 23.00 30.5 32.00 0.3725 29.00
Mean 3.845 120.9 69.11 20.54 79.8 31.99 0.4719 33.24
3rd Qu. 6.000 140.2 80.00 32.00 127.2 36.60 0.6262 41.00
Max 17.000 199.0 122.00 99.00 846.0 67.10 2.4200 81.00

three of experiments, the feature selection methodologies split of 70:30 for the training and testing sets. Other than
that were employed included: PCA, k-means clustering, the 70:30 percentage split, none of these techniques were
and importance ranking. used for simplicity for this project. Instead, the feature
selection methodologies mentioned in the previous section
3.2 Methods were used.
Common evaluation methods for model performance
As it relates to the classification and prediction of diabetes included accuracy, precision, sensitivity, specificity, F-
and other non-communicable diseases, ML and DL models measure (F-score), and mean square error (MSE), as well
have become an important research area for many years. as comparing performance on pre-processed versus non-
Numerous tools and models have been put forward to help pre-processed data [5].
solve the diagnosis prediction problem, including convo- Explainable AI (XAI) is the concept within artificial
lutional neural networks (CNN), artificial neural networks intelligence whereby decisions made by a machine learning
(ANN), and combined or hybrid machine learning models. model can be understood by its users [35]. The comple-
Based on recent research, the top algorithms (exclusive mentary concept of interpretability refers to the ability to
of combined models and neural networks) used to train and observe cause and effect within such a model. Model
forecast the Pima Indian Diabetes dataset included: the J48 interpretability can be intrinsic, and such as with decision
decision tree with 94.44% accuracy [32], as well as random trees. However, interpretability can also be introduced to a
forest (94% accuracy) and Naı̈ve Bayes (91% accuracy) model (post hoc) by applying functions on pre-trained
models [5]. All the proposed ML algorithms chosen to be models to generate explanations. Concerning non-com-
used in this research paper are classification models. municable diseases, not much work has been done in
All models that have previously been applied to the examining explainable machine learning models [35].
dataset have used additional machine learning techniques It is important to note that interpretable models may not
to pre-process or engineer the dataset, including boot- always be explainable to an extent where the human mind
strapping, resampling and k-folds, as well as the informa- fully comprehends the steps taking place to arrive at a
tion gain method [5]. Mercaldo et al. [33] made use of decision made by a machine learning model.
feature selection algorithms, such as GreedyStepwise and An example of a post hoc interpreting method is Shapley
BestFirst, to determine the discriminatory clauses. Iyer Additive exPlanations (SHAP), which is a framework
et al. [34] and Zia and Khan [32] also used a percentage (based on the idea in game theory of ‘Shapley values’, a

123
16164 Neural Computing and Applications (2023) 35:16157–16173

Fig. 4 Scatterplot matrix of features

description of a player’s contribution to the result of a dataset. In addition, it can be used to highlight biases and
coalitional game) that builds on an additive feature attri- errors in the machine learning models [35]. In this paper,
bution approach, to generate exclusive interpretation the goal is to use the interpretable AI method to make our
models that can offer interpretations on a classification model clear and understandable to end-users from two
model’s decisions in the form of specific feature contri- aspects. The first one is to judge which features are
butions [36]. Benefits of the SHAP framework include a important, and details are presented in Sect. 4.1. In
capability to meet local accuracy as well as consistency. Sect. 4.2, we will discuss the other aspect of how the
The SHAP summary plot shows feature importance choice of features affects the models’ prediction results.
ranked in descending order denoted by the y-axis as well as
the effect on how the feature value is associated with the
prediction as denoted by the x-axis, which in turn can be 4 Experiment and results
used to interpret the correlation between a feature and the
outcome [35]. 4.1 Feature selection
We can use post hoc interpretation to identify whether
the model we have trained has accurately captured the In order to make our model clear and explainable, this part
details of the real-world decision-making process from the shows the end-users how we judge which features are

123
Neural Computing and Applications (2023) 35:16157–16173 16165

Fig. 5 Box and whisker plots showing feature distribution for each outcome class

important. For feature selection purposes within the boundaries or groupings, but k = 3 also has decent clus-
experiments with the machine learning models, this tering as well and could be useful.
research performs k-means clustering, principal component Lastly, importance ranking was used to identify the
analysis (PCA), and importance ranking on the dataset. features that had the highest mathematical importance to
Starting with PCA, the feature groupings were observed the outcome.
within the plot in Fig. 6, where arrows that are close Through the use of the above-mentioned methodologies,
together represent closely related features. It can be seen as shown in Fig. 8, we can see that the features Glucose,
that the following are closely related: BMI, age, insulin and skin rank very highly in helping to
classify the data. In contrast, the DPF (Diabetes Pedigree
• Pregnancy and age
Function), Blood Pressure, and (number of) Pregnancies
• Glucose and blood pressure (BP)
rank very low. By using these methodologies, the dataset
• BMI, DPF, insulin level, and skin thickness
was scaled back to two versions. In the second round of
Following this, k-means clustering was used where experiments, three (3) factors (glucose, BMI, and age) were
various values of k were used and observed. According to used as the features for classification by the ML algorithms.
Fig. 7, we can see that k = 2 has the best separation of In contrast, five (5) factors were chosen (glucose, BMI,
age, insulin, and skin thickness) in the final round.

123
16166 Neural Computing and Applications (2023) 35:16157–16173

Fig. 6 Principal component

analysis

4.2 Results of machine learning algorithms The sensitivity refers to the percentage of all samples
that have been correctly predicted as true among all those
In this paper, the three ML algorithms that were used to which were predicted true as well as those predicted false
analyze the Pima Indian Diabetes dataset are J48 decision but were true.
tree, random forest, and Naı̈ve-Bayes. The same training The specificity refers to the percentage of all samples
and testing sets were used for all three as a sort of control that have been correctly predicted as false among all those
environment. The data subsets were manually split into 538 which were false even if predicted incorrectly.
and 230 samples, respectively (70/30 split). The standard F-score (F1-score) is an indicator of a
Six metrics were used to evaluate the results, including binary classification model’s accuracy, calculated by the
the accuracy, precision, sensitivity, specificity, F-score, weighted average of the precision and sensitivity. To be
and area under the curve (AUC). These variables are specific, it is calculated by dividing the product of the
computed through the confusion matrix, a matrix showing precision and sensitivity by the sum of the precision and
the values of the actual outcome classes and the predicted sensitivity and multiplying the result by two.
outcome classes on the testing set (see Table 3 and for- ðTP þ TNÞ
mulas below). Accuracy ¼
ðTP þ TN þ FP þ FNÞ
The accuracy refers to the percentage of all samples that
have been predicted correctly. It is the ratio of the sum of TP
Precision ¼
true positives and true negatives to the total number of ðTP þ FPÞ
predictions made. TP
Precision refers to the percentage of all samples that Sensitivity ¼
ðTP þ FNÞ
have been correctly predicted as true among all those
which were predicted as true, even if they were false.

123
Neural Computing and Applications (2023) 35:16157–16173 16167

Fig. 7 k-Means clustering with

k = 2, 3, 4

Fig. 8 Results of
rpart_importance function in R
Gluc 100.00
BMI 59.98
Age 56.04
Insulin 42.80
Skin 29.91
DPF 0.00
BP 0.00
Preg 0.00

TN ðPrecision RecallÞ
Specificity ¼ F Score ¼ 2
ðTN þ FPÞ ðPrecision þ RecallÞ

123
16168 Neural Computing and Applications (2023) 35:16157–16173

Table 3 Confusion matrix template Table 7 Random forest confusion matrix

Actual positive Actual negative Actual positive Actual negative

Predicted positive True positive False positive Predicted positive 136 15

Predicted negative False negative True negative Predicted negative 28 51

Table 4 J48 decision tree confusion matrix Table 8 Random forest confusion matrix with feature selection (3-
factor)
Actual positive Actual negative
Actual positive Actual negative
Predicted positive 107 44
Predicted negative 14 65 Predicted positive 123 28
Predicted negative 31 48

Table 5 J48 decision tree confusion matrix with feature selection (3-
factor) Table 9 Random forest confusion matrix with feature selection (5-
factor)
Actual positive Actual negative
Actual positive Actual negative
Predicted positive 106 45
Predicted negative 12 67 Predicted positive 121 30
Predicted negative 30 49

Table 6 J48 decision tree confusion matrix with feature selection (5-
factor) 4.2.2 Random forest
Actual positive Actual negative
A random forest model refers to a tree ensemble that works
Predicted positive 107 44 similarly to a decision tree but, instead of splitting at a
Predicted negative 12 67 single attribute, forms random groups of attributes to make
classifications. As a result, more processing is done,
improving the accuracy of this model over a single tree
model. Tables 7, 8, and 9 show the confusion matrixes of
random forest on the 3-factor subset, 5-factor subset, and
where TP = true positive, TN = true negative, FP =
full dataset.
false positive, and FN = false negative.
4.2.3 Naı̈ve Bayes
4.2.1 J48 decision tree
Naı̈ve Bayes is regarded as a simplistic model based on the
The J48 decision tree is an implementation of algorithm
‘‘naive’’ assumption that all predictor variables are inde-
ID3 (Iterative Dichotomiser 3) decision tree, developed by
pendent of each other, therefore, will not impact another. It
the WEKA (Java-based ML software) team and included in
uses Bayes probability theory on all attributes to determine
R in the package RWeka. Each attribute of the dataset is
the likelihood of the class outcome and make a prediction.
used to split the data into smaller modules used to clas-
Tables 10, 11, and 12 display the confusion matrixes of
sify/make a prediction. Tables 4, 5, and 6 show the con-
fusion matrixes of the J48 decision tree on the 3-factor
subset, 5-factor subset, and full dataset; further analysis on Table 10 Naive Bayes confusion matrix
the evaluation metrics based on these matrixes will be Actual positive Actual negative
provided in Sect. 5.
Predicted positive 131 29
Predicted negative 20 50

123
Neural Computing and Applications (2023) 35:16157–16173 16169

Table 11 Naive Bayes confusion matrix with feature selection (3-

factor)
Actual positive Actual negative

Predicted positive 133 30

Predicted negative 18 49

Table 12 Naive Bayes confusion matrix with feature selection (5-

factor)
Actual positive Actual negative

Predicted positive 130 30

Predicted negative 21 49

Naive Bayes on the 3-factor subset, 5-factor subset, and

full dataset, respectively, and further analysis on the eval-
uation metrics based on these matrixes will be provided in
Sect. 5.
Fig. 10 ROC curves for all models using PIMA dataset with 3
features
4.2.4 AUC-ROC curves

The receiver operating characteristics (ROC) curve and the

resulting area under the curve (AUC) provide a vital per-
formance measurement for classification models and rep-
resent the degree of separability of classes.
AUC can be seen as the likelihood that the represented
model ranks a random positive example higher than a

Fig. 11 ROC curves for all models using PIMA dataset with 5
features

Fig. 9 ROC curves for all models on imputed data

123
16170 Neural Computing and Applications (2023) 35:16157–16173

Table 13 Results of all models using the only imputation

Model Accuracy (%) Precision (%) Sensitivity (%) Specificity (%) F-score (%) AUC (%)

J48 decision tree 74.78 70.86 88.43 59.63 78.68 78.55

Random forest 79.57 89.40 81.33 75.00 85.17 86.24
Naı̈ve Bayes 78.67 81.88 86.75 63.29 84.24 84.63
Bold values represent the highest values in each set of measurement among three methods

Table 14 Results of all models using feature selection (3-factor)

Model Accuracy (%) Precision (%) Sensitivity (%) Specificity (%) F-score (%) AUC (%)

J48 decision tree 75.22 70.20 89.83 59.82 78.81 81.28

Random forest 75.22 82.12 80.52 64.47 81.31 82.27
Naı̈ve Bayes 79.13 81.60 88.08 62.03 84.71 86.15
Bold values represent the highest values in each set of measurement among three methods

Table 15 Results of all models using feature selection (5-factor)

Model Accuracy (%) Precision (%) Sensitivity (%) Specificity (%) F-score (%) AUC (%)

J48 decision tree 75.65 70.86 89.92 60.36 79.26 80.84

Random forest 73.91 80.79 79.74 62.34 80.26 81.77
Naı̈ve Bayes 77.83 81.25 86.09 62.03 83.60 84.10
Bold values represent the highest values in each set of measurement among three methods

Fig. 12 Graph comparing accuracy across models and datasets Fig. 13 Graph comparing precision across models and datasets

random negative example. AUC measures the entire two- compare the performance of the algorithms across datasets
dimensional area underneath the entire ROC curve from from the perspective of each evaluation metric.
(0.0) to (1.1) with a maximum value of 1. The ROC curves
for all models using a 3-factor subset, 5-factor subset, and
full dataset are shown in Figs. 9, 10, and 11. 5 Discussion and conclusion

4.3 Final results This paper presented classification models suitable for
electronic diagnostic systems to be implemented in the
Tables 13, 14, and 15 combine the data of the above IoMT environment. The models were trained using three
tables and compare the performance of the classifiers on machine learning algorithms and evaluated to predict
each dataset, while Figs. 12, 13, 14, 15, 16, and 17

123
Neural Computing and Applications (2023) 35:16157–16173 16171

Fig. 17 Graph comparing AUC across models and datasets

Fig. 14 Graph comparing sensitivity across models and datasets

The vast difference between sensitivity and specificity is

likely due to the imbalance of samples of class 0 and class
1.
However, the results for the 3-factor and 5-factor data
subsets that used feature selection show that the Naı̈ve
Bayes classification model outperformed both the random
forest and the J48 decision tree models for accuracy. The
Naı̈ve Bayes model on the 3-factor data subset performed
just as well as the random forest model on the full dataset
with an accuracy of 79.13% compared to 79.57%, which
was the highest accuracy in this experiment. We can con-
clude that a Naı̈ve Bayes model works well with a more
fine-tuned selection of features for binary classification but
falls short with numerous correlated features, while random
Fig. 15 Graph comparing specificity across models and datasets forest works better with more features.
The J48 decision tree model consistently performed with
a sensitivity rate range of 88.43% (full dataset) to 89.92%
(5-factor data subset), showing that it is good at predicting
the presence of diabetes no matter how many features it has
to work with.
Although the models in this experiment are close to 80%
accuracy, our research outputs are in line with similar work
by Iyer et al. [34] and can be improved. A positive sign was
that there was no overfitting in our approach. In other
words, results are more genuine and close to reality. The
top five most important features as indicated by the feature
selection methodologies (glucose, BMI, age, insulin, and
skin fold thickness) are in line with existing guidelines,
Fig. 16 Graph comparing F-score across models and datasets which indicate that age and weight (indicated through BMI
and skin fold thickness) play a huge role in the diagnosis
whether a subject’s diabetes mellitus diagnosis is positive and occurrence of diabetes mellitus.
according to eight given attributes. Based on this experiment, an e-diagnosis system for
The experimental results in Sect. 4.3 show that, on the detecting and classifying diabetes as an IoMT application
full Pima Indian Diabetes dataset, the random forest clas- is proposed. The data from IoMT and the advanced ML
sifier outperformed both the Naı̈ve Bayes and J48 decision algorithms enable the e-diagnosis system to predict the
tree with accuracy metric (79.57%), precision (89.40%), diagnosis of diabetes based on patient data, provide doctors
specificity (75.00%), f-score (85.17%), and AUC (86.24%), with a preliminary diagnosis, and return feedback on the
while the J48 had the best sensitivity (88.43%) of the three. doctor’s guidance on diet, exercise, and blood glucose

123
16172 Neural Computing and Applications (2023) 35:16157–16173

testing to patients. This system also contributes to the 8. Alsubaei F, Abuhussein A, Shandilya V, Shiva S (2019) IoMT-
remote monitoring and management of patients with SAF: Internet of Medical Things security assessment framework.
Internet Things 8:100123. https://doi.org/10.1016/j.iot.2019.
chronic diseases. Employing the IoMT can make data 100123
collection and analysis easy. In the IoMT, different medical 9. Khan SU, Islam N, Jan Z, Din IU, Khan A, Faheem Y (2019) An
systems, applications, devices, patients and doctors are able e-Health care services framework for the detection and classifi-
to connect with each other, leading to the accessibility of cation of breast cancer in breast cytology images as an IoMT
application. Futur Gener Comput Syst 98:286–296. https://doi.
massive data to the e-diagnosis system. With these medical org/10.1016/j.future.2019.01.033
data, the system can be improved constantly. Moreover, we 10. Divya K, Sirohi A, Pande S, Malik R (2021) An IoMT assisted
can develop more algorithms to provide better accuracy if heart disease diagnostic system using machine learning tech-
we cannot control the quality of datasets provided by niques. In: Hassanien AE, Khamparia A, Gupta D, Shankar K,
Slowik A (eds) Cognitive Internet of Medical Things for smart
medical researchers. healthcare, vol 311. Springer, New York, pp 145–161. https://doi.
Our future work will include developing innovative org/10.1007/978-3-030-55833-8_9
methods and applying them to other types of medical 11. Kumar PM, Devi Gandhi U (2018) A novel three-tier Internet of
analysis. For example, the accuracy may be enhanced by Things architecture with machine learning algorithm for early
detection of heart diseases. Comput Electr Eng 65:222–235.
using suitable pre-processing techniques for data manage- https://doi.org/10.1016/j.compeleceng.2017.09.001
ment and analysis. New automation and automated pro- 12. Kaur H, Kumari V (2020) Predictive modelling and analytics for
cesses with IoMT can be developed to improve diabetes diabetes using a machine learning approach. In: Applied com-
mellitus prediction and other non-communicable diseases. puting and informatics, ahead-of-print (ahead-of-print). https://
doi.org/10.1016/j.aci.2018.12.004
13. Adeniyi EA, Ogundokun RO, Awotunde JB (2021) IoMT-based
Acknowledgements This research is partly supported by VC
wearable body sensors network healthcare monitoring system. In:
Research (VCR 0000159) for Prof Chang.
Marques G, Bhoi AK, de Albuquerque VHC, Hareesha KS (eds)
IoT in healthcare and ambient assisted living, vol 933. Springer,
Declarations Singapore, pp 103–121. https://doi.org/10.1007/978-981-15-
9897-5_6
Conflict of interest There are no any conflicts of interest from 14. Komi M, Li J, Zhai Y, Zhang X (2017). Application of data
authors. mining methods in diabetes prediction. In: 2017 2nd international
conference on image, vision and computing (ICIVC), Chengdu,
China, pp 1006–1010. https://doi.org/10.1109/ICIVC.2017.
7984706
References 15. Ramanujam E, Chandrakumar T, Thivyadharsine KT, Varsha, D
(2020) A multilingual decision support system for early detection
1. World Health Organisation (2016) Global Report on Diabetes. of diabetes using machine learning approach: case study for rural
https://www.who.int/publications-detail/global-report-on-dia Indian people. In: 2020 fifth international conference on research
betes.Accessed: 24 Apr 2020 in computational intelligence and communication networks
2. World Health Organization (2013) Global action plan for the (ICRCICN). Bangalore, India, pp 17–21. https://doi.org/10.1109/
prevention and control of NCDs 2013–2020. https://apps.who.int/ ICRCICN50933.2020.9296187
iris/bitstream/handle/10665/94384/9789241506236_eng.pdf;jses 16. Kumar PS, Kumari AK, Mohapatra S, Naik B, Nayak J, Mishra
sionid=5A344A4B152EABE6C021C9A7EEE444C8?sequence= M (2021) CatBoost ensemble approach for diabetes risk predic-
1.Accessed 24 Feb 2021 tion at early stages. In: 2021 1st Odisha international conference
3. Schulz LO, Bennett PH, Ravussin E, Kidd JR, Kidd KK, Esparza on electrical power engineering, communication and computing
J, Valencia ME (2006) Effects of traditional and western envi- technology (ODICON). Bhubaneswar, India, pp 1–6. https://doi.
ronments on prevalence of type 2 diabetes in Pima Indians in org/10.1109/ODICON50556.2021.9428943
Mexico and the US. Diabetes Care 29(8):1866–1871 17. Maan V, Vijaywargiya J, Srivastava M (2020) Diabetes prog-
4. Smith JW, Everhart JE, Dickson WC, Knowler WC, Johannes RS nostication—an aptness of machine learning. In: 2020 interna-
(1988) Using the ADAP learning algorithm to forecast the onset tional conference on emerging trends in communication, control
of diabetes mellitus. In: Proceedings of the annual symposium on and computing (ICONC3). Lakshmangarh, Sikar, India, pp 1–5.
computer application in medical care, pp 261–265 https://doi.org/10.1109/ICONC345789.2020.9117465
5. Larabi-Marie-Sainte S, Aburahmah L, Almohaini R, Saba T 18. Samant P, Agarwal R (2018) Machine learning techniques for
(2019) Current techniques for diabetes prediction: review and medical diagnosis of diabetes using iris images. Comput Methods
case study. Appl Sci 9(21):4604 Programs Biomed 157:121–128. https://doi.org/10.1016/j.cmpb.
6. Pratap Singh R, Javaid M, Haleem A, Vaishya R, Ali S (2020) 2018.01.004
Internet of Medical Things (IoMT) for orthopaedic in COVID-19 19. Samant P, Agarwal R (2018) Comparative analysis of classifi-
pandemic: roles, challenges, and applications. J Clin Orthop cation based algorithms for diabetes diagnosis using iris images.
Trauma 11(4):713–717. https://doi.org/10.1016/j.jcot.2020.05. J Med Eng Technol 42:35–42. https://doi.org/10.1080/03091902.
011 2017.1412521
7. Pustokhina IV, Pustokhin DA, Gupta D, Khanna A, Shankar K, 20. Quinlan JR (1996) Learning decision tree classifiers. ACM
Nguyen GN (2020) An effective training scheme for deep neural Comput Surv (CSUR) 28(1):71–72
network in edge computing enabled Internet of Medical Things 21. Saxena R (2017) How decision tree algorithm works. https://
(IoMT) systems. IEEE Access 8:107112–107123. https://doi.org/ dataaspirant.com/2017/01/30/how-decision-tree-algorithm-
10.1109/ACCESS.2020.3000322 works/. Accessed Apr 40.

123
Neural Computing and Applications (2023) 35:16157–16173 16173

22. Rochmawati N, Hidayati HB, Yamasari Y, Yustanti W, Rakh- predicting study period. J Phys Conf Ser 1569:022022. https://
mawati L, Tjahyaningtijas HPA, Anistyasari Y (2020) Covid doi.org/10.1088/1742-6596/1569/2/022022
symptom severity using decision tree. In: 2020 third international 30. Daniati E (2019) Decision support systems to determining pro-
conference on vocational education and electrical engineering gramme for students using DBSCAN and Naive Bayes: case
(ICVEE). Surabaya, Indonesia, pp 1–5. https://doi.org/10.1109/ study: engineering faculty of Universitas Nusantara PGRI Kediri.
ICVEE50212.2020.9243246 In: 2019 international conference of artificial intelligence and
23. Gomathi S, Narayani V (2015) Monitoring of Lupus disease information technology (ICAIIT). Yogyakarta, Indonesia,
using decision tree induction classification algorithm. In: 2015 pp 238–243. https://doi.org/10.1109/ICAIIT.2019.8834474
international conference on advanced computing and communi- 31. Akbar R, Nasution SM, Prasasti AL (2020) Implementation of
cation systems. Coimbatore, India, pp 1–6. https://doi.org/10. Naive Bayes algorithm on IoT-based smart laundry mobile
1109/ICACCS.2015.7324054 application system. In: 2020 international conference on infor-
24. Abdar M, Nasarian E, Zhou X, Bargshady G, Wijayaningrum mation technology systems and innovation (ICITSI). Bandung -
VN, Hussain S (2019) Performance improvement of decision Padang, Indonesia, pp 8–13. https://doi.org/10.1109/
trees for diagnosis of coronary artery disease using multi filtering ICITSI50517.2020.9264938
approach. In: 2019 IEEE 4th international conference on com- 32. Zia UA, Khan N (2017) Predicting diabetes in medical datasets
puter and communication systems (ICCCS). Singapore, using machine learning techniques. Int J Sci Eng Res
pp 26–30. https://doi.org/10.1109/CCOMS.2019.8821633 5(2):257–267
25. Yiu T (2019) Understanding random forest—how the algorithm 33. Mercaldo F, Nardone V, Santone A (2017) Diabetes mellitus
works and why it is so effective. https://towardsdatascience.com/ affected patients classification and diagnosis through machine
understanding-random-forest-58381e0602d2. Accessed 22 May learning techniques. Procedia Comput Sci 112:2519–2528
26. Seifert S (2020) Application of random forest based approaches 34. Iyer A, Jeyalatha S, Sumbaly R (2015) Diagnosis of diabetes
to surface-enhanced Raman scattering data. Sci Rep 10:5436. using classification mining techniques. Int J Data Min Knowl
https://doi.org/10.1038/s41598-020-62338-8 Managt Process (IJDKP) 5(1):1–14
27. You J, van der Klein SAS, Lou E, Zuidhof MJ (2020) Application 35. Cheng D, Ting C, Ho C, Ho C (2020) Performance evaluation of
of random forest classification to predict daily oviposition events explainable machine learning on non-communicable diseases.
in broiler breeders fed by precision feeding system. Comput Solid State Technol 63:2780–2793
Electron Agric 175:105526. https://doi.org/10.1016/j.compag. 36. Athanasiou M, Sfrintzeri K, Zarkogianni K, Thanopoulou A,
2020.105526 Nikita K (2020) An explainable XGBoost-based approach
28. Burdi F, Setianingrum AH, Hakiem N (2016) Application of the towards assessing the risk of cardiovascular disease in patients
Naive Bayes method to a decision support system to provide with type 2 diabetes mellitus. In: 2020 IEEE 20th international
discounts (case study: PT. Bina Usaha Teknik). In: 2016 6th conference on bioinformatics and bioengineering (BIBE),
international conference on information and communication Cincinnati, OH, USA, 2020, pp 859–864. https://doi.org/10.1109/
technology for The Muslim World (ICT4M). Jakarta, BIBE50027.2020.00146
pp 281–285. https://doi.org/10.1109/ICT4M.2016.064
29. Pandiangan N, Buono MLC, Loppies SHD (2020) Implementa- Publisher’s Note Springer Nature remains neutral with regard to
tion of decision tree and Naı̈ve Bayes classification method for jurisdictional claims in published maps and institutional affiliations.

123

Major Final Report Kartik
No ratings yet
Major Final Report Kartik
53 pages
Psycho-Spirituality: Definition, Global Applications, Current Issues
No ratings yet
Psycho-Spirituality: Definition, Global Applications, Current Issues
39 pages
" Craniosacral Therapy: and Somato-Emotional Release The Self-Healing Body
100% (1)
" Craniosacral Therapy: and Somato-Emotional Release The Self-Healing Body
260 pages
Towards Real-Time Monitoring and Risk Assessment of Diabetes Complications Using Optimized Machine Learning Models
No ratings yet
Towards Real-Time Monitoring and Risk Assessment of Diabetes Complications Using Optimized Machine Learning Models
5 pages
An Effective Pre-Processing Techniques For Diabetes Mellitus Prediction in Healthcare Systems
No ratings yet
An Effective Pre-Processing Techniques For Diabetes Mellitus Prediction in Healthcare Systems
15 pages
Dinesh Paper On Diabetes Mellitus (9%)
No ratings yet
Dinesh Paper On Diabetes Mellitus (9%)
8 pages
Critical Ill PDF
No ratings yet
Critical Ill PDF
5 pages
Sat - 17.Pdf - Machine Learning Models For Diagnosis of The Diabetic Patient and Predicting Insulin Dosage
No ratings yet
Sat - 17.Pdf - Machine Learning Models For Diagnosis of The Diabetic Patient and Predicting Insulin Dosage
11 pages
Diabetes Prediction Using Machine Learning
No ratings yet
Diabetes Prediction Using Machine Learning
6 pages
Bio-Inspired PSO For Improving Neural Based Diabetes Prediction System
No ratings yet
Bio-Inspired PSO For Improving Neural Based Diabetes Prediction System
21 pages
Seminar Paper
No ratings yet
Seminar Paper
9 pages
PPR 5
No ratings yet
PPR 5
24 pages
Prediction of Diabetes Using Machine Learning: A Modern User-Friendly Model
No ratings yet
Prediction of Diabetes Using Machine Learning: A Modern User-Friendly Model
7 pages
Analyze The Use of Machine Learning Models in The Pima Diabetes Data Set For Early Stage Detection
No ratings yet
Analyze The Use of Machine Learning Models in The Pima Diabetes Data Set For Early Stage Detection
5 pages
(IJCST-V12I2P11) :K. Pranith, V. Aravind, R. Pavan, Mr. K. Anil Kumar
No ratings yet
(IJCST-V12I2P11) :K. Pranith, V. Aravind, R. Pavan, Mr. K. Anil Kumar
4 pages
Two Machine Learning Hybrid Models For Predicting.2
No ratings yet
Two Machine Learning Hybrid Models For Predicting.2
22 pages
Predicting Diabetes Mellitus in Healthcare: A Comparative Analysis of Machine Learning Algorithms On Big Dataset
No ratings yet
Predicting Diabetes Mellitus in Healthcare: A Comparative Analysis of Machine Learning Algorithms On Big Dataset
12 pages
AICTE Internship 2024 Project Report Template 2
No ratings yet
AICTE Internship 2024 Project Report Template 2
27 pages
Diabetes Prediction Using Machine Learning Algorithms and Ontology
No ratings yet
Diabetes Prediction Using Machine Learning Algorithms and Ontology
19 pages
Performance Analysis of Diabetes Detection Using Machine Learning Classifiers
No ratings yet
Performance Analysis of Diabetes Detection Using Machine Learning Classifiers
12 pages
Bright Futures Tool & Resource Kit
No ratings yet
Bright Futures Tool & Resource Kit
24 pages
1 s2.0 S1877050923005781 Main
No ratings yet
1 s2.0 S1877050923005781 Main
8 pages
1 RV
No ratings yet
1 RV
6 pages
Infusion Therapy Study Guide Questions
No ratings yet
Infusion Therapy Study Guide Questions
94 pages
ICD-10 Common Codes Related To Autoimmune: Diagnostic Services
No ratings yet
ICD-10 Common Codes Related To Autoimmune: Diagnostic Services
2 pages
1 s2.0 S2772671124002419 Main (Asp)
No ratings yet
1 s2.0 S2772671124002419 Main (Asp)
18 pages
Prediction of Diabetes Disease Using An Ensemble of Machine Learning Multi-Classifier Models
No ratings yet
Prediction of Diabetes Disease Using An Ensemble of Machine Learning Multi-Classifier Models
24 pages
PM For Diabetes
No ratings yet
PM For Diabetes
11 pages
Predicting Diabetes Using Deep Learning Techniques: A Study On The Pima Dataset
No ratings yet
Predicting Diabetes Using Deep Learning Techniques: A Study On The Pima Dataset
15 pages
Final Seminar Report Soumya
No ratings yet
Final Seminar Report Soumya
20 pages
TDP Sem 3
No ratings yet
TDP Sem 3
9 pages
An Analytical Paradigm For Exploration of Diabetes Using Machine Learning
No ratings yet
An Analytical Paradigm For Exploration of Diabetes Using Machine Learning
8 pages
Sustainability 15 13484 v2
No ratings yet
Sustainability 15 13484 v2
24 pages
Integrating Machine Learning For Accurate Prediction of Early Diabetes - A Novel Approach
No ratings yet
Integrating Machine Learning For Accurate Prediction of Early Diabetes - A Novel Approach
24 pages
Efficient Binary Classifier For Prediction of Diabetes Using Data Preprocessing and Support Vector Machine
No ratings yet
Efficient Binary Classifier For Prediction of Diabetes Using Data Preprocessing and Support Vector Machine
2 pages
An Effective Approach For Detecting Diabetes Using Deep Learning Techniques Based On Convolutional LSTM Networks
No ratings yet
An Effective Approach For Detecting Diabetes Using Deep Learning Techniques Based On Convolutional LSTM Networks
7 pages
Performance Analysis of Deep Neural Network and Machine Learning Algorithms For Diabetes Prediction
No ratings yet
Performance Analysis of Deep Neural Network and Machine Learning Algorithms For Diabetes Prediction
6 pages
Diabetes Detection
No ratings yet
Diabetes Detection
19 pages
Chapt MP Report Format 23-24
No ratings yet
Chapt MP Report Format 23-24
16 pages
Diagnosis of Diabetes Using Machine Learning
No ratings yet
Diagnosis of Diabetes Using Machine Learning
12 pages
3 Journal
No ratings yet
3 Journal
9 pages
Ijarcce 2020 9712
No ratings yet
Ijarcce 2020 9712
7 pages
Predictive Machine Learning Applying Cross Industry Standard Process For Data Mining For The Diagnosis of Diabetes Mellitus Type 2
No ratings yet
Predictive Machine Learning Applying Cross Industry Standard Process For Data Mining For The Diagnosis of Diabetes Mellitus Type 2
14 pages
Machine Learning and Applications CS522I1C
No ratings yet
Machine Learning and Applications CS522I1C
15 pages
Machine Learning Meets Healthcare: Predicting Diabetes Onset With EHR
No ratings yet
Machine Learning Meets Healthcare: Predicting Diabetes Onset With EHR
8 pages
Diabetes Prediction Using Machine Learning Techniques
No ratings yet
Diabetes Prediction Using Machine Learning Techniques
18 pages
Peerj Cs 1914
No ratings yet
Peerj Cs 1914
30 pages
Kush Don FINAL Jatu
No ratings yet
Kush Don FINAL Jatu
11 pages
DiabDeep Pervasive Diabetes Diagnosis Based On Wearable Medical Sensors and Efficient Neural Networks
No ratings yet
DiabDeep Pervasive Diabetes Diagnosis Based On Wearable Medical Sensors and Efficient Neural Networks
12 pages
10.3934 Publichealth.2023030
No ratings yet
10.3934 Publichealth.2023030
21 pages
Projectreport Diabetes Prediction
No ratings yet
Projectreport Diabetes Prediction
22 pages
2023 Article 5467
No ratings yet
2023 Article 5467
20 pages
Diabetes Decoded: Transitioning From Traditional Models To Hybrid Deep Learning Approaches
No ratings yet
Diabetes Decoded: Transitioning From Traditional Models To Hybrid Deep Learning Approaches
5 pages
Machine Learning and Deep Learning Techniques
No ratings yet
Machine Learning and Deep Learning Techniques
13 pages
Diabe PDF
No ratings yet
Diabe PDF
11 pages
2.1.psychomotor Skill Development Training For Clinical Preceptorship
No ratings yet
2.1.psychomotor Skill Development Training For Clinical Preceptorship
30 pages
11-A Risk Assessment and Prediction Framework For Diabetes Mellitus Using Machine Learning Algorithms
No ratings yet
11-A Risk Assessment and Prediction Framework For Diabetes Mellitus Using Machine Learning Algorithms
12 pages
Article 6
No ratings yet
Article 6
11 pages
10 22399-Ijcesen 1185474-2693654
No ratings yet
10 22399-Ijcesen 1185474-2693654
6 pages
Prognostic Biomarkers Identification For Diabetes Prediction by Utilizing Machine Learning Classifiers
No ratings yet
Prognostic Biomarkers Identification For Diabetes Prediction by Utilizing Machine Learning Classifiers
6 pages
Famous Persuasive Essay
100% (2)
Famous Persuasive Essay
7 pages
245-Article Text-2088-1-10-20240129
No ratings yet
245-Article Text-2088-1-10-20240129
8 pages
Food Del Report 1
No ratings yet
Food Del Report 1
13 pages
Hybrid Deep Learning CNN-LSTM Model For Diabetes Prediction
No ratings yet
Hybrid Deep Learning CNN-LSTM Model For Diabetes Prediction
4 pages
Tanita Mc-780-Portable-Instruction-Manual
No ratings yet
Tanita Mc-780-Portable-Instruction-Manual
36 pages
Report Recommendations Into Death of Aishwarya Aswath
No ratings yet
Report Recommendations Into Death of Aishwarya Aswath
10 pages
US TL Reading SB Answer Key-3
No ratings yet
US TL Reading SB Answer Key-3
26 pages
Demonstrate Occupational Safety and Health Practises Level 6 Qs6bin
No ratings yet
Demonstrate Occupational Safety and Health Practises Level 6 Qs6bin
4 pages
General F0RM 86
No ratings yet
General F0RM 86
1 page
Personal Notes - Days1to3 - Tolentino, Danny Line C
No ratings yet
Personal Notes - Days1to3 - Tolentino, Danny Line C
12 pages
ChlorhexidineDigluconate Test Method
No ratings yet
ChlorhexidineDigluconate Test Method
19 pages
Extended Matura Practice Test 2
No ratings yet
Extended Matura Practice Test 2
16 pages
Positive Psychological Interventions
No ratings yet
Positive Psychological Interventions
45 pages
Hepatic Encephalopaty: Dr. Suyata, SPPD, K-Geh, Finasim
No ratings yet
Hepatic Encephalopaty: Dr. Suyata, SPPD, K-Geh, Finasim
25 pages
Employee Survey Form
No ratings yet
Employee Survey Form
2 pages
21 CFR 1240.61 (Up To Date As of 1-11-2023)
No ratings yet
21 CFR 1240.61 (Up To Date As of 1-11-2023)
2 pages
Basic Clinical Skills
No ratings yet
Basic Clinical Skills
15 pages
1pe0 01 Que 20220525 - 1
No ratings yet
1pe0 01 Que 20220525 - 1
32 pages
Early Hemodynamic Management of Critically Ill Burn Patients
No ratings yet
Early Hemodynamic Management of Critically Ill Burn Patients
7 pages
Topic 6 Nursing Care of A Family When Child Needs Medication CH 38
No ratings yet
Topic 6 Nursing Care of A Family When Child Needs Medication CH 38
11 pages
Journal of Clinical Nursing - 2016 - Ceylan - Evaluation of Oxygen Saturation Values in Different Body Positions in Healthy
No ratings yet
Journal of Clinical Nursing - 2016 - Ceylan - Evaluation of Oxygen Saturation Values in Different Body Positions in Healthy
6 pages
Dhanurasana The Bow Pose 2
No ratings yet
Dhanurasana The Bow Pose 2
8 pages
0001 Electrocleaner-De2000 en
No ratings yet
0001 Electrocleaner-De2000 en
7 pages
The Link Between Cannabis and Psychosis in Teens Is Real Scientific American
No ratings yet
The Link Between Cannabis and Psychosis in Teens Is Real Scientific American
10 pages
OT1 - Therapeutic Use of Self in Occupational Therapy
No ratings yet
OT1 - Therapeutic Use of Self in Occupational Therapy
2 pages
Diabetology & Metabolic Syndrome
No ratings yet
Diabetology & Metabolic Syndrome
7 pages
Transforming Treatment: New Pathways to Lifesaving Care with Data and AI
From Everand
Transforming Treatment: New Pathways to Lifesaving Care with Data and AI
Ryan Bauer
5/5 (1)
Hyper-Personalized Healthcare: The Future of Medicine
From Everand
Hyper-Personalized Healthcare: The Future of Medicine
Carlos Alves
No ratings yet
Precision Medicine
From Everand
Precision Medicine
Mbuso Mabuza
No ratings yet
Data-Driven Healthcare: Revolutionizing Patient Care with Data Science
From Everand
Data-Driven Healthcare: Revolutionizing Patient Care with Data Science
William Webb
No ratings yet

Pima Indians Diabetes Mellitus Classification Based On Machine Learning (ML) Algorithms

Uploaded by

Pima Indians Diabetes Mellitus Classification Based On Machine Learning (ML) Algorithms

Uploaded by

Neural Computing and Applications (2023) 35:16157–16173

S.I.: AI-BASED E-DIAGNOSIS

Pima Indians diabetes mellitus classification based on machine

Fig. 1 An E-diagnosis system enabled in IoMT

Fig. 2 Decision tree example

Table 1 Overview of Pima Indian diabetes dataset

Preg Number of times pregnant Numeric [0, 17]

Table 2 Statistical summary of

Fig. 4 Scatterplot matrix of features

Fig. 6 Principal component

Fig. 7 k-Means clustering with

Table 3 Confusion matrix template Table 7 Random forest confusion matrix

Predicted positive True positive False positive Predicted positive 136 15

Table 11 Naive Bayes confusion matrix with feature selection (3-

Predicted positive 133 30

Table 12 Naive Bayes confusion matrix with feature selection (5-

Predicted positive 130 30

Naive Bayes on the 3-factor subset, 5-factor subset, and

The receiver operating characteristics (ROC) curve and the

Fig. 9 ROC curves for all models on imputed data

Table 13 Results of all models using the only imputation

J48 decision tree 74.78 70.86 88.43 59.63 78.68 78.55

Table 14 Results of all models using feature selection (3-factor)

J48 decision tree 75.22 70.20 89.83 59.82 78.81 81.28

Table 15 Results of all models using feature selection (5-factor)

J48 decision tree 75.65 70.86 89.92 60.36 79.26 80.84

Fig. 17 Graph comparing AUC across models and datasets

The vast difference between sensitivity and specificity is

You might also like