Diabetes Project Using Machine Learning
Diabetes Project Using Machine Learning
PROJECT REPORT
On
BACHELOR OF TECHNOLOGY
In
SOFTWARE ENGINEERING
Submitted By
B.AKSHAYA (20TQ1A5601)
K.MANASWINI (20TQ1A5620)
P.VIGNESHWAR (20TQ1A5640)
U.VIVEK (20TQ1A5636)
Mrs. Vasantha
Assistant Professor
i
SIDDHARTHA INSTITUTE OF TECHNOLOGY AND SCIENCES
CERTIFICATE
This is to certify that the project report titled “AN ENSEMBLE APPROACH FOR PREDICTION
OF DIABETES MELLITUS” is being submitted by B.AKSHAYA (20TQ1A5601),
K.MANASWINI(20TQ1A5620), P.VIGNESHWAR(20TQ1A5640), K.PAVAN
KUMAR(20TQ1A5606), U.VIVEK(20TQ1A5636) in B.Tech IV-I semester Software Engineering
is a record bonafide work carried out by them. The results embodied in this report have not been
submitted to any other University for the award of any degree.
External Examiner
ii
DECLARATION
We, here by declare that the results embodied in this dissertation entitled “AN
ENSEMBLE APPROACH FOR PREDICTION OF DIABETES MELLITUS ” is carried
out by me during the year 2023-2024 in partial fulfillment of the requirements for the award
of the degree of Bachelor of Technology in Software Engineering from Siddartha Institute of
Technology and Sciences, Narapally is an authentic record of my own work carried under the
guidance of Mrs.Vasantha. We have not submitted the same to any other university or
organization for the award of any other degree.
B.Akshaya (20TQ1A5601)
U.Vivek (20TQ1A5610)
iii
ACKNOWLEDGEMENT
We are thankful to our principal Dr. M.JANARDHAN, and to our Head of the
Department Mrs.Vasantha , Software Engineering, for there constant support
towards our project.
We would like to express our immense gratitude and sincere thanks to all the
faculty members of Software Engineering Department and our friends for their
valuable suggestions and support which directly or indirectly helped us in
successful completion of this work.
B.Akshaya (20TQ1A5601)
K.Manaswini (20TQ1A5620)
P.Vigneshwar (20TQ1A5640)
U.Vivek (20TQ1A5636)
iv
SIDDHARTHA INSTITUTE OF TECHNOLOGY AND SCIENCES
(Approved by AICTE, Affiliated to JNTU Hyderabad, Accredited by NAAC(A+))
MISSION STATEMENT
v
INDEX
1 INTRODUCTION 1-3
1.1 Motivation
1.2 Problem definition
1.3 Limitations of existing system
1.4 Proposed system
3.4 Modules
3.4.1 Pandas
3.4.2 sklearn.model_selection
3.4.3 sklearn.naive_bayes
3.4.4 sklearn.ensemble
3.4.5 sklearn.metrics
4 DESIGN 10-18
4.1 DFDs and UML diagrams
5 CODING 19-26
6 IMPLEMENTATION AND RESULTS 27-34
6.1 Explanation of Model Architecture
REFERENCES 39
ABSTRACT
Diabetes is a dreadful disease identified by escalated levels of glucose in the blood. Machine
learning algorithms help in identification and prediction of diabetes at an early stage. The main
objective of this study is to predict diabetes mellitus with better accuracy using an ensemble of
machine learning algorithms. The Pima Indians Diabetes dataset has been considered for
experimentation, which gathers details of patients with and without having diabetes. The
proposed ensemble soft voting classifier gives binary classification and uses the ensemble of
three machine learning algorithms viz. random forest, logistic regression, and Naive Bayes for
the classification. Empirical evaluation of the proposed methodology has been conducted with
state-of-the-art methodologies and base classifiers such as Logistic Regression, Support Vector
machine, Random Forest, Naïve Bayes by taking accuracy, precision, recall, F1-score as the
evaluation criteria. The proposed ensemble approach gives the highest accuracy, precision,
recall, and F1_score respectively on the PIMA diabetes dataset. Using these values a
comparative graph and ROC curve graph are constructed.
vi
LIST OF FIGURES
1. Architecture 11
3. Sequence Diagram 13
4. Activity Diagram 14
8. Comparison Graph 33
9. ROC Curve 33
vii
LIST OF TABLES
1. Dataset 18
2. Evaluation metrics 29
viii
CHAPTER 1
INTRODUCTION
1.1 Motivation
As a SE student, choosing the title "An Ensemble Approach for Prediction of Diabetes
Mellitus " can be a great choice for several reasons:
Relevant and timely topic: Diabetes Mellitus is a prevalent disease affecting millions of
people worldwide, and its accurate classification and prediction can have a significant
impact on the patient's quality of life. Therefore, exploring and improving the existing
methods for diabetes classification and prediction can be a relevant and timely research
topic.
Scope for innovation: There is still a lot of room for innovation and improvement in the
existing methods for diabetes classification and prediction. As a SE student, you can
leverage your skills and knowledge to develop novel approaches and techniques that can
enhance the accuracy and robustness of the prediction.
Potential impact: Accurate classification and prediction of diabetes can have a significant
impact on the patient's quality of life, as it can facilitate timely diagnosis, treatment, and
management of the disease. By working on this research topic, you can contribute to the
development of tools and techniques that can improve the healthcare outcomes of millions of
people worldwide.
Overall, choosing the title "An Ensemble Approach for Prediction of Diabetes Mellitus " felt
like a great research topic for a SE student.
1
1.2 Problem Definition
The problem addressed in this project is to develop a model that can accurately classify and
predict diabetes mellitus using an ensemble approach with soft voting classifier. The goal is to
create a model that can handle the complex and non-linear relationships between the features
and the target variable in diabetes datasets, as well as handle missing data, outliers, and
imbalanced datasets.
To address this problem, the project proposes an ensemble approach with soft voting classifier,
which combines the strengths of multiple machine learning algorithms such as random forest,
logistic regression, naive Bayes and Support vector machine. By combining the predictions of
multiple models, the soft voting classifier is more robust to outliers and missing data, and can
handle imbalanced datasets better. Additionally, the soft voting classifier is better suited for
handling non-linear relationships between the features and target variable, as it can combine
the strengths of multiple algorithms to make accurate predictions.
The ultimate goal of this project is to develop a model that can accurately predict diabetes
mellitus, which can help improve the diagnosis and treatment of this condition. By accurately
predicting diabetes, doctors can provide more personalized and effective treatment plans for
patients, leading to better health outcomes.
Before the soft voting classifier was used to predict diabetes mellitus, various machine learning
algorithms such as logistic regression, decision trees, and support vector machines were used.
However, these algorithms have their limitations when it comes to predicting complex medical
conditions such as diabetes mellitus.
One limitation is that they may not be able to handle missing data or outliers in the dataset,
which can result in inaccurate predictions.
Furthermore, these algorithms may have difficulty handling non-linear relationships between
the features and the target variable, which is common in medical datasets. This can result in
underfitting or overfitting of the model, leading to poor predictions.
2
Soft voting classifier is an improvement over these existing technologies because it is an
ensemble learning method that combines multiple machine learning algorithms to make
predictions. By combining the predictions of multiple models, it is more robust to outliers and
missing data, and can handle imbalanced datasets better.
Soft voting classifier is also better suited for handling non-linear relationships between the
features and target variable, as it can combine the strengths of multiple algorithms to make
accurate predictions. Additionally, soft voting classifier is a more flexible algorithm, as it can
be used with a variety of machine learning algorithms, such as logistic regression, random
forest, and naive Bayes, as shown in the code snippet provided earlier.
Overall, the soft voting classifier is a powerful machine learning algorithm that has several
advantages over existing technologies for predicting complex medical conditions such as
diabetes mellitus. By combining the strengths of multiple machine learning algorithms, it can
make accurate predictions even in the presence of missing data, outliers, and imbalanced
datasets.
The proposed system in this project is an ensemble approach for the classification and
prediction of diabetes mellitus using a soft voting classifier. The system aims to enhance the
accuracy of predictions for this complex problem by utilizing the strengths of multiple
machine learning algorithms.
The system utilizes three machine learning algorithms, namely random forest, logistic
regression, and naive Bayes. These algorithms are trained on the diabetes dataset to make
independent predictions. The soft voting classifier then combines the predictions from each
model using a weighted average approach, with the weights determined by the confidence of
each model's prediction.
The soft voting classifier is a robust method for the classification and prediction of diabetes
mellitus. It can handle missing data and outliers, and it is effective for imbalanced datasets.
The system takes advantage of the nonlinear relationships between the features and target
variable in the dataset by combining the strengths of multiple algorithms.
3
CHAPTER 2
LITERATURE SURVEY
1. Bashir et al. presented the Hierarchical Multi-level Classifiers Bagging with Multi-
Objective Upgraded Voting (HM-Bag Moov) procedure to classify diabetes and compared
it to other approaches like Naive Bayes (NB), Support Vector Machine (SVM), Logistic
Regression (LR), Quadratic Discriminant Analysis (QDA), K-Nearest Neighbours (k-NN),
Random Forest (RF), and Artificial Neural Network (ANN). However, the work employed
a small number of ML algorithms to ensemble and did not take hyper-tuning and cross-
validation approaches into account. Finally, the HM-Bag Moov Voting Classifier showed
77.21% accuracy.
2. In 2012, Choubey et al. used a genetic algorithm (GA) with naive Bayes approaches to
predict diabetes in females aged 21–78 years for the purpose of dataset categorization. 768
occurrences were recorded in total. PIDD was first classified using naive Bayes, and
characteristics were added to and removed from the dataset using a genetic algorithm. It
improved the ROC and classification accuracy while also significantly reducing the
computational cost and time. The most accurate result and best ROC when compared to
other approaches are highlighted in the accuracy comparison of PIDD with ROC, GA, and
NB.
3. According to Hussein Asmaa S, Wail M [13] et al., there are currently 246 million
Americans living with diabetes or an associated condition. By 2025, that number will have
doubled to 500 million, and will shortly reach 1 billion. To find out how characteristics
and alternatives are extracted for a strict dataset, use the decision tree formula. Finding the
classes for freshly produced instances using training datasets [20] produces predictions for
test data inputs.
4
4. Georga et al. coupled the regression model with the random forest (RF) method to predict
blood glucose concentration after first using the algorithm to rank the candidate feature
sets (blood glucose level, insulin dose, nutritional consumption, physical activity, etc.). For
blood glucose prediction and hypoglycemia alert, Mo et al employed the extreme learning
machine and the conventional limit learning machine. In most instances, the two machines'
combined performance is fairly good, with adequate sensitivity and good, steady
specificity.
5. By S.Sivaranjani et al. diabetes has become one of the most life threatening and at the
same time most common diseases not only in India but around the world.Currently various
methods are being used to predict diabetes and diabetic inflicted diseases. We have used
the Machine Learning algorithms Support Vector Machine (SVM) & Random Forest (RF)
that would help to identify the potential chances of getting affected by Diabetes Related
Diseases.
5
CHAPTER 3
REQUIREMENTS ANALYSIS
Functional requirements are a set of specifications that define what a software system or
product should do, its features, functions, and capabilities. These requirements outline the
intended behaviour of the system or product and describe how it should interact with users
and other systems.
Input data: The system should be able to accept input data in the form of patient records,
which include features such as age, gender, BMI, blood pressure, and glucose levels.
Pre-processing: The system should be able to pre-process the input data by performing tasks
such as data cleaning, feature selection, and feature scaling.
Ensemble model creation: The system should be able to create an ensemble of multiple
classifiers such as decision trees, support vector machines, and neural networks, using
different feature subsets of the pre-processed data.
Soft voting classifier: The system should be able to combine the predictions of the individual
classifiers using a soft voting classifier approach.
Model persistence: The system should be able to save the trained ensemble model for future
use and retraining.
Performance evaluation: The system should be able to evaluate the performance of the
ensemble model using metrics such as accuracy, precision, recall, and F1-score.
Prediction: The system should be able to use the trained ensemble model to predict the
likelihood of diabetes mellitus for new patients, based on their input data.
6
Visualization: The system should be able to generate visualizations such as confusion
matrices, ROC curves, and feature importance plots to help users understand the performance
of the model and the importance of various features in the classification task.
Non-functional requirements are a set of specifications that define how a software system or
product should behave, perform, or operate. Unlike functional requirements, non-functional
requirements do not describe the specific functions or features of the system, but rather its
qualities and characteristics.
Performance: The system should be able to handle large amounts of data efficiently, and
provide accurate predictions within a reasonable timeframe.
Usability: The system should have a user-friendly interface that is easy to navigate, with clear
instructions for inputting data and interpreting results.
Reliability: The system should be able to handle errors and unexpected inputs gracefully, and
provide helpful error messages or alerts to the user.
Scalability: The system should be able to scale up or down based on changing demand or data
volume, without sacrificing performance or reliability.
Maintainability: The system should be modular and well-documented, with clear code
structure and comments to facilitate future maintenance and updates.
Interoperability: The system should be able to work with different data formats and integrate
with other systems or applications.
Compatibility: The system should be compatible with different operating systems, browsers,
and devices to ensure a broad user base.
Accessibility: The system should be designed to be accessible to users with disabilities, such
as those who are visually impaired or have limited mobility.
7
3.3 SYSTEM CONFIGURATION:
These are the software specifications needed to make this project work.
3.4 MODULES
3.4.1 Pandas
High performance, user-friendly data structures and tools for data analysis are provided for
Python by the module. The diabetes.csv file is read into a pandas Data Frame using this
programme, and the data is then modified.
3.4.2 sklearn.navie_bayes
This module provides functions for creating Naïve Bayes classifiers, which are probabilistic
machine learning models that use Bayes' theorem to predict the probability of an outcome
based on input variables. In this code, it is used to create a Gaussian Naïve Bayes classifier.
8
3.4.3 sklearn.model_selection
Several functions are available in this module to divide data into training and testing sets. The
diabetes dataset is divided into training and testing sets in this code.
3.4.4 sklearn.ensemble
This module provides functions for creating ensemble models, which combine the predictions
of multiple classifiers. In this code, it is used to create a Random Forest classifier and a soft
voting ensemble model.
3.4.5 sklearn.linear_model
This module provides functions for creating linear models, including the Logistic Regression
classifier. In this code, it is used to create a Logistic Regression classifier.
3.4.6 sklearn.metrics
This module provides a variety of functions for evaluating the performance of machine
learning models. In this code, it is used to calculate the accuracy of the ensemble models’
prediction using the ‘accuracy_score()’ function.
9
CHAPTER 4
DESIGN
general-purpose modelling language that is standardised like UML. The item management
group, which built it, is in responsibility of quality. The end goal is for UML to displace
UML basically consists of two components in its current form: a meta-model and a notation.
In the future, UML may be connected to or integrated with a fresh strategy or method. The
Unified Modelling Language might be used to provide a standard for describing, creating, and
documenting software artefacts as well as business modelling and other nonsoftware systems.
• Provide expressive visual modelling language that is ready-to-use to users so they can create
• To broaden the scope of the core concepts and provide tools for extending and specialising
10
4.11 Architecture
11
Fig. 4.1.2 Use-Case Diagram
12
Fig. 4.1.3 Sequence Diagram
13
Fig. 4.1.4 Activity Diagram
14
Fig. 4.1.5 Dataflow Diagram
4.2 Algorithm
Data pre-processing: The input data is cleaned, normalized, and transformed to ensure that it
is suitable for analysis. This could involve tasks such as handling missing values, feature
scaling, and feature selection.
Model selection: A range of machine learning models are trained and evaluated on the
preprocessed data, including decision trees, support vector machines, logistic regression, and
neural networks. The models are selected based on their performance metrics such as accuracy,
precision, recall, and F1 score.
Ensemble model construction: Multiple models are combined using a soft voting strategy,
where the output of each model is weighted and combined to produce the final prediction. This
could involve techniques such as bagging, boosting, or stacking.
Model evaluation: The ensemble model is evaluated using a range of performance metrics
such as confusion matrix, ROC curve, and AUC score. The model may be further optimized
by adjusting the weights or hyperparameters of the individual models.
15.
4.3 Classifiers
When one or more independent factors predict an outcome in a set of data, a statistical
technique called logistic regression is employed to analyse the data. On the basis of a number
of independent factors, it is frequently used to estimate the likelihood that a specific event will
occur.
Both classification and regression tasks may be accomplished using the machine learning
technique known as random forest. It is an approach to ensemble learning that mixes different
decision trees to generate predictions.
The fundamental principle of random forest is to construct a lot of decision trees, each of
which makes a forecast, and then aggregate those predictions to arrive at a final prediction. A
random subset of the input features and training data are combined to create the trees. The
model's capacity for generalisation is improved and overfitting is decreased because to this
unpredictability.
A probabilistic machine learning approach called Naive Bayes is utilised for categorization
problems. It is founded on the Bayes theorem, which estimates the likelihood of an event given
prior knowledge of pertinent circumstances. In naive Bayes, the method makes the assumption
that, given the class label, the characteristics utilised to produce a prediction are conditionally
independent of one another. This presumption is referred to be "naive" since it frequently is
incorrect in the real world. Naive Bayes is frequently unexpectedly efficient in practise, despite
this simplification assumption.
16
4.3.4 Support Vector Machine
The supervised machine learning method known as Support Vector Machines (SVM) is
utilised for both classification and regression problems. Finding the ideal border or hyperplane
to divide the data into distinct classes is the foundation of SVM. By utilising various kernel
functions to transfer the input into a higher dimensional space where a hyperplane may
separate the data, SVM can handle both linearly separable and non-linearly separable data.
The "diabetes.csv" dataset contains health-related information on female Pima Indian patients
who were at least 21 years old at the time the data were collected. The target variable in the
dataset, which determines whether the patient has diabetes or not, is one of nine qualities.
The characteristics in the dataset are listed below, along with a brief description of each one:
• Pregnancies: It details how many times the patient has given birth. (Range: 0-17)
• Glucose: It shows the patient's 2-hour oral glucose tolerance test plasma glucose levels.
(Range: 0-199)
• Blood Pressure: This value (in mm Hg) represents the patient's diastolic blood
pressure. (Range: 0-122)
• Skin Thickness: This number indicates how thick the patient's skinfolds are in
millimetres. (Range: 0-99)
• Insulin: The patient's 2-hour serum insulin level is shown (mu U/ml). (Range: 0-846)
• Body mass index (BMI) is calculated as follows: weight in kg/(height in m)2. (Range:
0-67). It displays the patient's diabetes pedigree function, which estimates the heredity
of diabetes mellitus based on the patient's family history. (Range: 0.078-2.42)
• Age: This number is the patient's age expressed in years. (Range: 21-81) the patient
has diabetes (1) or not (0), according to the result.
For each characteristic found in the dataset, the range values show the minimum and maximum
values. The information is used to create a prediction model that can determine from a patient's
health-related characteristics whether or not they have diabetes.
17
Fig. 4.4 Sample Data
18
CHAPTER 5
CODING
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import os
import sys from tqdm
import tqdm from sklearn.model_selection
import train_test_split from sklearn.naive_bayes
import RandomForestClassifier from sklearn.linear_model
import LogisticRegression from sklearn.svm
import SVC from sklearn.metrics
import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score,
roc_curve, auc from sklearn.ensemble
import VotingClassifier from sklearn.metrics
import confusion_matrix
def loading_screen():
19
# Your code goes here...
nb = GaussianNB() rf =
RandomForestClassifier(n_estimators=100, random_state=42) lr
= LogisticRegression(random_state=42, max_iter=1000) svm =
SVC(kernel='linear', probability=True)
20
# Make predictions on the test data
y_pred_nb = nb.predict(X_test) y_pred_rf =
rf.predict(X_test) y_pred_lr =
lr.predict(X_test) y_pred_svm =
svm.predict(X_test) y_pred_ensemble =
ensemble.predict(X_test)
rf_accuracy = accuracy_score(y_test,
y_pred_rf) rf_f1 = f1_score(y_test, y_pred_rf)
rf_precision = precision_score(y_test,
y_pred_rf) rf_recall = recall_score(y_test,
y_pred_rf) rf_roc_auc = roc_auc_score(y_test,
y_prob_rf)
lr_accuracy = accuracy_score(y_test,
y_pred_lr) lr_f1 = f1_score(y_test, y_pred_lr)
lr_precision = precision_score(y_test,
21
y_pred_lr) lr_recall = recall_score(y_test,
y_pred_lr) lr_roc_auc = roc_auc_score(y_test,
y_prob_lr)
22
print("\nSVM model's accuracy: {:.2f}%".format(svm_accuracy * 100))
print("SVM model's F1 score: {:.2f}%".format(svm_f1 * 100)) print("SVM
model's precision: {:.2f}%".format(svm_precision * 100)) print("SVM
model's recall: {:.2f}%".format(svm_recall * 100)) print("SVM model's
ROC AUC score: {:.2f}%".format(svm_roc_auc * 100))
plt.ylabel("True Labels")
plt.show()
23
# Generate the confusion matrix for SVM model svm_cm
= confusion_matrix(y_test, y_pred_svm)
24
roc_curve(y_test, y_prob_svm) fpr_ensemble, tpr_ensemble, _ =
roc_curve(y_test, y_prob_ensemble)
# Create a bar chart comparing the models' evaluation metrics labels = ['Naive Bayes',
'Random Forest', 'Logistic Regression', 'Support Vector Machine',
'Ensemble'] accuracy_scores = [nb_accuracy, rf_accuracy, lr_accuracy,
svm_accuracy, ensemble_accuracy] precision_scores = [nb_precision,
rf_precision, lr_precision, svm_precision, ensemble_precision]
recall_scores = [nb_recall, rf_recall, lr_recall, svm_recall,
ensemble_recall] f1_scores = [nb_f1, rf_f1, lr_f1, svm_f1, ensemble_f1]
25
x = np.arange(len(labels))
width = 0.2
fig.tight_layout()
plt.show()
print("THANK YOU”)
26
CHAPTER 6
Logistic regression: Using statistical methods, logistic regression may predict results that are
either true or false (y = 0 or 1). An algorithm for linear learning is logistic regression. The
probability of an event occurring are used to make the logistic regression predictions. Each
data point is mapped by the sigmoid function in the LR algorithm. The typical logistic curve
produces an S-shaped curve. The sigmoid function is shown in equation (1).
1
Sigmoid equation = ……(1)
1+𝑒^(−𝑥)
Naive Bayes: This classifier uses a probabilistic approach as its foundation. By assuming that
the existence of one characteristic in a class is unrelated to the existence of another feature in
the same class, Bayes' theorem is used. The probability of the provided categories are
conjectured using the joint probabilities of categories and words. This independence premise
speeds up calculation since it allows for the separate study of each term's parameters. A
structural model and a collection of conditional probabilities make up the Bayesian network.
27
6.2 Method of Implementation
Proposed ensemble soft voting classifier: This classifier, a metaclassifier, combines machine
learning models with similar or different conceptual underpinnings to make predictions using
majority voting. A voting classifier employs both hard and soft voting methods. In hard voting,
the class prediction that consistently appears among the base models is chosen as the final
prediction by a majority vote. Base models should feature the Predict_proba technique for soft
voting. As it integrates the predictions of many models, the voting classifier offers better
overall results than other base models. Logistic Regression, Naive Bayes, and Random Forest
classifiers have been combined in the suggested model. The predict_proba attribute column,
which provides the probability of each target variable, has been used in a soft voting classifier.
Following the shuffling of training data and data points, logistic regression, Naive Bayes, and
Random Forest models get the data points. Each model computes a separate prediction using
a voting aggregator and a soft voting approach. The final prediction is then produced by
computing the results of the majority voting. The suggested methodology's algorithm has been
shown.
28
The output screen here represent several attributes such as f1_score, accuracy, AUC, recall,
precision values which are calculated from the individual classifiers
• Logistic Regression
• Naïve Bayes
• Random Forest
29
6.2.2 Result Analysis
30
31
A table called a confusion matrix is frequently used to assess how well a machine learning
model is working. It is a tool that enables us to compare predicted labels with actual labels in
a dataset to visualise the performance of a model.
True positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) are the
four separate metrics that make up the confusion matrix. These metrics are determined by
contrasting the model's projected labels with the dataset's actual labels. Typically, the
confusion matrix is set up as follows:
Predicted Positive Predicted Negative
Actual Positive True Positive(TP) False Negative(FN)
Actual Negative False Positive(FP) True Negative(TN)
32
Fig. 6.2.2
33
Comparative graph and ROC (Receiver Operating Characteristic) graph are two types of
visualizations commonly used in data analysis and machine learning. A comparative graph is
a type of chart that allows for the comparison of two or more variables or sets of data. It is
useful for identifying trends and patterns, as well as outliers or anomalies. Examples of
comparative graphs include bar charts, line charts, and scatter plots.
On the other hand, a ROC graph is a plot that illustrates the performance of a binary classifier
as its discrimination threshold is varied. It is often used in binary classification problems to
evaluate the performance of a model by plotting the true positive rate (TPR) against the false
positive rate (FPR) at different threshold settings. The area under the ROC curve (AUC-ROC)
is a commonly used metric to assess the overall performance of a classifier.
34
CHAPTER 7
Scenario 1
We give a dataset consisting of (8 + 1) attributes and 455 rows as input to the classifier in
which 80 percent of the data is considered as training data and rest is considered as test data.
Scenario 2
We give a dataset consisting of (8 + 1) attributes and 580 rows as input to the classifier in
which 80 percent of the data is considered as training data and rest is considered as test data.
Scenario 3
We give a dataset consisting of (8 + 1) attributes and 769 rows as input to the classifier in
which 80 percent of the data is considered as training data and rest is considered as test data.
35
7.2 Validation
During the validation phase of the project, the focus is on determining how well the ensemble
approach can predict diabetes mellitus using a soft voting classifier. The following steps are
typically taken during the validation phase:
• First, the data is prepared for model training and testing by undergoing preprocessing,
cleaning, and transformation into the appropriate format.
• Next, the data is split into training and testing sets. The model is trained using the
training set, and the testing set is used to evaluate its performance.
• The soft voting classifier is trained using the training set. This classifier combines the
predictions from several base models to arrive at a final prediction.
• The performance of the soft voting classifier is then evaluated using the testing set.
Metrics such as accuracy, precision, recall, and F1-score are used to evaluate
performance. A confusion matrix and ROC curve can also be used for this purpose.
• The ensemble approach may need to be fine-tuned by adjusting the hyperparameters
of the base models or the weights assigned to the base models in the soft voting
classifier.
• Finally, the ensemble approach is evaluated on a holdout dataset or using
crossvalidation to ensure that the model performs well on new, unseen data and is not
overfitting to the training data.
The validation phase is crucial to ensure that the diabetes mellitus prediction system is accurate
and reliable when predicting outcomes for new data.
36
CHAPTER 8
CONCLUSION
The proposed soft voting classifier model that combines three machine learning algorithms,
namely random forest, logistic regression, and Naive Bayes, has been proven to be a superior
method for predicting diabetes patients compared to other machine learning models. The
accuracy obtained using the proposed model is higher than that obtained by models such as
Support Vector Machine (SVM) and Artificial Neural Networks (ANN). This underscores the
superiority of the proposed model in accurately predicting diabetes patients.
Furthermore, the robustness and reliability of the proposed model were demonstrated using
two datasets, namely the Pima Indians diabetes dataset and the breast cancer dataset. The soft
voting classifier model exhibited an accuracy of 83.41% on the Pima Indians diabetes dataset,
indicating its effectiveness in addressing the challenge of early detection of diabetes. The
model's high accuracy suggests its potential for widespread implementation in clinical settings
and improving patient outcomes.
As the field of machine learning continues to evolve, future studies may leverage different
deep learning models to further enhance the accuracy of the proposed approach. Nonetheless,
the proposed soft voting classifier model provides a promising solution for the early
recognition of diabetes, which is a critical health concern in modern times.
In conclusion, this study highlights the potential of machine learning in improving medical
diagnosis and management, underscoring the importance of continued research in this area to
advance the development of effective and efficient diagnostic tools.
37
CHAPTER-9
FUTURE SCOPE
38
REFERENCES
[1] T. M. Alama, M. A. Iqbala, Y. Ali et al., “A Model for Early Prediction of Diabetes,”
Informatics in Medicine Unlocked, vol. 16, Article ID 100204, 2019.
[2] A. Mahabub, “A Robust Voting Approach for Diabetes Prediction Using Traditional
Machine Learning Techniques,” SN Applied Sciences, Springer, 2019.
[4] K. Dwivedi, “Analysis of decision tree for diabetes prediction,” International Journal of
Engineering and Technical Research, vol. 9, 2019.
[5] Q. Zou, K. Qu, Y. Luo, and D. Yin, “Predicting diabetes mellitus with machine learning
techniques,” Frontiers in Genetics, vol. 9, 2018.
[6] W. Wang, T. Meng, and M. YU, “Blood glucose prediction with VMD and LSTM
optimized by improved particle swarm optimization,” IEEE Access, vol. 8, pp. 217908–
217916, 2020.
39