0% found this document useful (0 votes)
29 views49 pages

Diabetes Project Using Machine Learning

The project report details an ensemble approach for predicting diabetes mellitus using machine learning algorithms, specifically focusing on a soft voting classifier that combines random forest, logistic regression, and Naive Bayes. The study aims to improve prediction accuracy and handle complexities such as missing data and non-linear relationships within the dataset. The methodology is validated against existing techniques, demonstrating superior performance in accuracy, precision, recall, and F1-score on the Pima Indians Diabetes dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views49 pages

Diabetes Project Using Machine Learning

The project report details an ensemble approach for predicting diabetes mellitus using machine learning algorithms, specifically focusing on a soft voting classifier that combines random forest, logistic regression, and Naive Bayes. The study aims to improve prediction accuracy and handle complexities such as missing data and non-linear relationships within the dataset. The methodology is validated against existing techniques, demonstrating superior performance in accuracy, precision, recall, and F1-score on the Pima Indians Diabetes dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

A

PROJECT REPORT

On

AN ENSEMBLE APPROACH FOR PREDICTION OF


DIABETES MELLITUS

A dissertation submitted in partial fulfilment of the requirements for the award


of degree of

BACHELOR OF TECHNOLOGY
In

SOFTWARE ENGINEERING

Submitted By

B.AKSHAYA (20TQ1A5601)
K.MANASWINI (20TQ1A5620)

P.VIGNESHWAR (20TQ1A5640)

K.PAVAN KUMAR (20TQ1A5606)

U.VIVEK (20TQ1A5636)

Under the estimated guidance of

Mrs. Vasantha
Assistant Professor

Department of Software Engineering


SIDDHARTHA INSTITUTE OF TECHNOLOGY AND SCIENCES
(Approved by AICTE & Affiliated to JNTUH)
Korremula Road, Narapally, Ghatkesar, Medchal District-500088
2023-2024

i
SIDDHARTHA INSTITUTE OF TECHNOLOGY AND SCIENCES

(Approved by AICTE, Affiliated to JNTU Hyderabad, Accredited by NAAC(A+))

Korremula Road, Narapally(V), Ghatkesar Mandal, Medchal-Dist:-500088

CERTIFICATE

This is to certify that the project report titled “AN ENSEMBLE APPROACH FOR PREDICTION
OF DIABETES MELLITUS” is being submitted by B.AKSHAYA (20TQ1A5601),
K.MANASWINI(20TQ1A5620), P.VIGNESHWAR(20TQ1A5640), K.PAVAN
KUMAR(20TQ1A5606), U.VIVEK(20TQ1A5636) in B.Tech IV-I semester Software Engineering
is a record bonafide work carried out by them. The results embodied in this report have not been
submitted to any other University for the award of any degree.

Internal Guide Head of the Department Principal

Mrs.Vasantha Mrs.Vasantha Dr. M.Janardhan

External Examiner

ii
DECLARATION

We, here by declare that the results embodied in this dissertation entitled “AN
ENSEMBLE APPROACH FOR PREDICTION OF DIABETES MELLITUS ” is carried
out by me during the year 2023-2024 in partial fulfillment of the requirements for the award
of the degree of Bachelor of Technology in Software Engineering from Siddartha Institute of
Technology and Sciences, Narapally is an authentic record of my own work carried under the
guidance of Mrs.Vasantha. We have not submitted the same to any other university or
organization for the award of any other degree.

B.Akshaya (20TQ1A5601)

Date: K.Manaswini (20TQ1A5620)

Place: Narapally P.Vigneshwar (20TQ1A5640)

K.Pavan Kumar (20TQ1A5606)

U.Vivek (20TQ1A5610)

iii
ACKNOWLEDGEMENT

This is an acknowledgement of the intensive drive and technical competence of


many individuals who have contributed to the success of our project.

We are thankful to our principal Dr. M.JANARDHAN, and to our Head of the
Department Mrs.Vasantha , Software Engineering, for there constant support
towards our project.

We are grateful to our guide Mrs.Vasantha, Software Engineering


Department, who gave us the necessary motivation and support during the full
course of our project.

We would like to express our immense gratitude and sincere thanks to all the
faculty members of Software Engineering Department and our friends for their
valuable suggestions and support which directly or indirectly helped us in
successful completion of this work.

B.Akshaya (20TQ1A5601)

K.Manaswini (20TQ1A5620)

P.Vigneshwar (20TQ1A5640)

K.Pavan Kumar ( 20TQ1A5606)

U.Vivek (20TQ1A5636)

iv
SIDDHARTHA INSTITUTE OF TECHNOLOGY AND SCIENCES
(Approved by AICTE, Affiliated to JNTU Hyderabad, Accredited by NAAC(A+))

Korremula Road, Narapally(V), Ghatkesar Mandal, Medchal-Dist:-500088

Vision of the Department: To be a Recognized Center of Computer Science


Education with values and quality research.

Mission of the Department:

MISSION STATEMENT

DM1 Import High Quality Professional Training


With An Emphasis On Basic principles Of
Computer Science And Allied Engineering

DM2 Imbibe Social Awareness And Responsibility


To Serve The Society.

DM3 Provide Academic Facilitates Organize


Collaborated Activities To enable Overall
Development Of Stakeholders

Programme Educational Objectives (PEO)

• PEO1: Graduates will be able to synthesize mathematics, science, engineering


fundamentals, laboratory and work – based experiences to formulate and to solve
problems proficiently in Software Engineering and related domains.
• PEO2: Graduates will be prepared to communicate effectively and work in
multidisciplinary engineering projects following the ethics in their profession.
• PEO3: Graduates will recognize the importance of and acquire the skill of independent
learning to shine as experts in the field with a sound knowledge

v
INDEX

CHAPTER NO TITLES PAGE NO


ABSTRACT vi
LIST OF FIGURES vii
LIST OF TABLES viii

1 INTRODUCTION 1-3
1.1 Motivation
1.2 Problem definition
1.3 Limitations of existing system
1.4 Proposed system

2 LITERATURE SURVEY 4-5


3 REQIREMENTS ANALYSIS 6-9
3.1 Functional Requirements
3.2 Non-Functional Requirements

3.3 Software Requirement Specification

3.3.1 Software Requirements

3.3.2 Hardware Requirements

3.4 Modules
3.4.1 Pandas

3.4.2 sklearn.model_selection

3.4.3 sklearn.naive_bayes

3.4.4 sklearn.ensemble

3.4.5 sklearn.metrics
4 DESIGN 10-18
4.1 DFDs and UML diagrams

4.1.1 System Architecture

4.1.2 Use Case Diagram


4.1.3 Sequence Diagram
4.1.4 Activity Diagram
4.1.5 Data Flow Diagram
4.2Algorithm
4.3 Classifiers

4.3.1 Logistic Regression

4.3.2 Random Forest

4.3.3 Naïve Bayes

4.3.4 Support Vector Machine

4.4 Sample Data

5 CODING 19-26
6 IMPLEMENTATION AND RESULTS 27-34
6.1 Explanation of Model Architecture

6.2 Method of Implementation

6.2.1 Output Screens


6.2.2 Result Analysis

7 TESTING AND VALIDATION 35-36


7.1 Design of Test Cases and Scenarios
7.2 Validation
8 CONCLUSION 37
9 FUTURE SCOPE 38

REFERENCES 39
ABSTRACT

Diabetes is a dreadful disease identified by escalated levels of glucose in the blood. Machine
learning algorithms help in identification and prediction of diabetes at an early stage. The main
objective of this study is to predict diabetes mellitus with better accuracy using an ensemble of
machine learning algorithms. The Pima Indians Diabetes dataset has been considered for
experimentation, which gathers details of patients with and without having diabetes. The
proposed ensemble soft voting classifier gives binary classification and uses the ensemble of
three machine learning algorithms viz. random forest, logistic regression, and Naive Bayes for
the classification. Empirical evaluation of the proposed methodology has been conducted with
state-of-the-art methodologies and base classifiers such as Logistic Regression, Support Vector
machine, Random Forest, Naïve Bayes by taking accuracy, precision, recall, F1-score as the
evaluation criteria. The proposed ensemble approach gives the highest accuracy, precision,
recall, and F1_score respectively on the PIMA diabetes dataset. Using these values a
comparative graph and ROC curve graph are constructed.

vi
LIST OF FIGURES

1. Architecture 11

2. Use case Diagram 12

3. Sequence Diagram 13

4. Activity Diagram 14

5. Data Flow Diagram 15

6. Confusion Matrix 30-32

8. Comparison Graph 33

9. ROC Curve 33

vii
LIST OF TABLES

1. Dataset 18

2. Evaluation metrics 29

viii
CHAPTER 1
INTRODUCTION

1.1 Motivation
As a SE student, choosing the title "An Ensemble Approach for Prediction of Diabetes
Mellitus " can be a great choice for several reasons:

Relevant and timely topic: Diabetes Mellitus is a prevalent disease affecting millions of
people worldwide, and its accurate classification and prediction can have a significant
impact on the patient's quality of life. Therefore, exploring and improving the existing
methods for diabetes classification and prediction can be a relevant and timely research
topic.

Application of ensemble learning: Ensemble learning is a powerful technique that can


improve the accuracy and robustness of the prediction by combining multiple models. The
soft voting classifier is a type of ensemble learning method that can provide a more nuanced
and reliable prediction.

Scope for innovation: There is still a lot of room for innovation and improvement in the
existing methods for diabetes classification and prediction. As a SE student, you can
leverage your skills and knowledge to develop novel approaches and techniques that can
enhance the accuracy and robustness of the prediction.

Potential impact: Accurate classification and prediction of diabetes can have a significant
impact on the patient's quality of life, as it can facilitate timely diagnosis, treatment, and
management of the disease. By working on this research topic, you can contribute to the
development of tools and techniques that can improve the healthcare outcomes of millions of
people worldwide.

Overall, choosing the title "An Ensemble Approach for Prediction of Diabetes Mellitus " felt
like a great research topic for a SE student.

1
1.2 Problem Definition
The problem addressed in this project is to develop a model that can accurately classify and
predict diabetes mellitus using an ensemble approach with soft voting classifier. The goal is to
create a model that can handle the complex and non-linear relationships between the features
and the target variable in diabetes datasets, as well as handle missing data, outliers, and
imbalanced datasets.

To address this problem, the project proposes an ensemble approach with soft voting classifier,
which combines the strengths of multiple machine learning algorithms such as random forest,
logistic regression, naive Bayes and Support vector machine. By combining the predictions of
multiple models, the soft voting classifier is more robust to outliers and missing data, and can
handle imbalanced datasets better. Additionally, the soft voting classifier is better suited for
handling non-linear relationships between the features and target variable, as it can combine
the strengths of multiple algorithms to make accurate predictions.

The ultimate goal of this project is to develop a model that can accurately predict diabetes
mellitus, which can help improve the diagnosis and treatment of this condition. By accurately
predicting diabetes, doctors can provide more personalized and effective treatment plans for
patients, leading to better health outcomes.

1.3 Limitations of existing system

Before the soft voting classifier was used to predict diabetes mellitus, various machine learning
algorithms such as logistic regression, decision trees, and support vector machines were used.
However, these algorithms have their limitations when it comes to predicting complex medical
conditions such as diabetes mellitus.

One limitation is that they may not be able to handle missing data or outliers in the dataset,
which can result in inaccurate predictions.

Furthermore, these algorithms may have difficulty handling non-linear relationships between
the features and the target variable, which is common in medical datasets. This can result in
underfitting or overfitting of the model, leading to poor predictions.

2
Soft voting classifier is an improvement over these existing technologies because it is an
ensemble learning method that combines multiple machine learning algorithms to make
predictions. By combining the predictions of multiple models, it is more robust to outliers and
missing data, and can handle imbalanced datasets better.

Soft voting classifier is also better suited for handling non-linear relationships between the
features and target variable, as it can combine the strengths of multiple algorithms to make
accurate predictions. Additionally, soft voting classifier is a more flexible algorithm, as it can
be used with a variety of machine learning algorithms, such as logistic regression, random
forest, and naive Bayes, as shown in the code snippet provided earlier.

Overall, the soft voting classifier is a powerful machine learning algorithm that has several
advantages over existing technologies for predicting complex medical conditions such as
diabetes mellitus. By combining the strengths of multiple machine learning algorithms, it can
make accurate predictions even in the presence of missing data, outliers, and imbalanced
datasets.

1.4 Proposed system

The proposed system in this project is an ensemble approach for the classification and
prediction of diabetes mellitus using a soft voting classifier. The system aims to enhance the
accuracy of predictions for this complex problem by utilizing the strengths of multiple
machine learning algorithms.

The system utilizes three machine learning algorithms, namely random forest, logistic
regression, and naive Bayes. These algorithms are trained on the diabetes dataset to make
independent predictions. The soft voting classifier then combines the predictions from each
model using a weighted average approach, with the weights determined by the confidence of
each model's prediction.

The soft voting classifier is a robust method for the classification and prediction of diabetes
mellitus. It can handle missing data and outliers, and it is effective for imbalanced datasets.
The system takes advantage of the nonlinear relationships between the features and target
variable in the dataset by combining the strengths of multiple algorithms.

3
CHAPTER 2

LITERATURE SURVEY

1. Bashir et al. presented the Hierarchical Multi-level Classifiers Bagging with Multi-
Objective Upgraded Voting (HM-Bag Moov) procedure to classify diabetes and compared
it to other approaches like Naive Bayes (NB), Support Vector Machine (SVM), Logistic
Regression (LR), Quadratic Discriminant Analysis (QDA), K-Nearest Neighbours (k-NN),
Random Forest (RF), and Artificial Neural Network (ANN). However, the work employed
a small number of ML algorithms to ensemble and did not take hyper-tuning and cross-
validation approaches into account. Finally, the HM-Bag Moov Voting Classifier showed
77.21% accuracy.

2. In 2012, Choubey et al. used a genetic algorithm (GA) with naive Bayes approaches to
predict diabetes in females aged 21–78 years for the purpose of dataset categorization. 768
occurrences were recorded in total. PIDD was first classified using naive Bayes, and
characteristics were added to and removed from the dataset using a genetic algorithm. It
improved the ROC and classification accuracy while also significantly reducing the
computational cost and time. The most accurate result and best ROC when compared to
other approaches are highlighted in the accuracy comparison of PIDD with ROC, GA, and
NB.

3. According to Hussein Asmaa S, Wail M [13] et al., there are currently 246 million
Americans living with diabetes or an associated condition. By 2025, that number will have
doubled to 500 million, and will shortly reach 1 billion. To find out how characteristics
and alternatives are extracted for a strict dataset, use the decision tree formula. Finding the
classes for freshly produced instances using training datasets [20] produces predictions for
test data inputs.

4
4. Georga et al. coupled the regression model with the random forest (RF) method to predict
blood glucose concentration after first using the algorithm to rank the candidate feature
sets (blood glucose level, insulin dose, nutritional consumption, physical activity, etc.). For
blood glucose prediction and hypoglycemia alert, Mo et al employed the extreme learning
machine and the conventional limit learning machine. In most instances, the two machines'
combined performance is fairly good, with adequate sensitivity and good, steady
specificity.

5. By S.Sivaranjani et al. diabetes has become one of the most life threatening and at the
same time most common diseases not only in India but around the world.Currently various
methods are being used to predict diabetes and diabetic inflicted diseases. We have used
the Machine Learning algorithms Support Vector Machine (SVM) & Random Forest (RF)
that would help to identify the potential chances of getting affected by Diabetes Related
Diseases.

5
CHAPTER 3

REQUIREMENTS ANALYSIS

3.1 Functional Requirements

Functional requirements are a set of specifications that define what a software system or
product should do, its features, functions, and capabilities. These requirements outline the
intended behaviour of the system or product and describe how it should interact with users
and other systems.

Input data: The system should be able to accept input data in the form of patient records,
which include features such as age, gender, BMI, blood pressure, and glucose levels.

Pre-processing: The system should be able to pre-process the input data by performing tasks
such as data cleaning, feature selection, and feature scaling.

Ensemble model creation: The system should be able to create an ensemble of multiple
classifiers such as decision trees, support vector machines, and neural networks, using
different feature subsets of the pre-processed data.

Soft voting classifier: The system should be able to combine the predictions of the individual
classifiers using a soft voting classifier approach.

Model persistence: The system should be able to save the trained ensemble model for future
use and retraining.

Performance evaluation: The system should be able to evaluate the performance of the
ensemble model using metrics such as accuracy, precision, recall, and F1-score.

Prediction: The system should be able to use the trained ensemble model to predict the
likelihood of diabetes mellitus for new patients, based on their input data.

6
Visualization: The system should be able to generate visualizations such as confusion
matrices, ROC curves, and feature importance plots to help users understand the performance
of the model and the importance of various features in the classification task.

3.2 Non-Functional Requirements

Non-functional requirements are a set of specifications that define how a software system or
product should behave, perform, or operate. Unlike functional requirements, non-functional
requirements do not describe the specific functions or features of the system, but rather its
qualities and characteristics.

Performance: The system should be able to handle large amounts of data efficiently, and
provide accurate predictions within a reasonable timeframe.

Usability: The system should have a user-friendly interface that is easy to navigate, with clear
instructions for inputting data and interpreting results.

Reliability: The system should be able to handle errors and unexpected inputs gracefully, and
provide helpful error messages or alerts to the user.

Scalability: The system should be able to scale up or down based on changing demand or data
volume, without sacrificing performance or reliability.

Maintainability: The system should be modular and well-documented, with clear code
structure and comments to facilitate future maintenance and updates.

Interoperability: The system should be able to work with different data formats and integrate
with other systems or applications.

Compatibility: The system should be compatible with different operating systems, browsers,
and devices to ensure a broad user base.

Accessibility: The system should be designed to be accessible to users with disabilities, such
as those who are visually impaired or have limited mobility.

7
3.3 SYSTEM CONFIGURATION:

3.3.1 Software requirements:

These are the software specifications needed to make this project work.

• Operating System: Windows(10/11)/ LINUX/ MAC OS

• Language: Python 3.9 or above installed

• IDE: VScode, PyCharm, IDLE

3.3.2 Hardware requirements:

Processor: Intel i5 or above

Processor RAM: 4GB


Hard disk: 20GB
Laptop/PC

3.4 MODULES

3.4.1 Pandas

High performance, user-friendly data structures and tools for data analysis are provided for
Python by the module. The diabetes.csv file is read into a pandas Data Frame using this
programme, and the data is then modified.

3.4.2 sklearn.navie_bayes

This module provides functions for creating Naïve Bayes classifiers, which are probabilistic
machine learning models that use Bayes' theorem to predict the probability of an outcome
based on input variables. In this code, it is used to create a Gaussian Naïve Bayes classifier.

8
3.4.3 sklearn.model_selection

Several functions are available in this module to divide data into training and testing sets. The
diabetes dataset is divided into training and testing sets in this code.

3.4.4 sklearn.ensemble

This module provides functions for creating ensemble models, which combine the predictions
of multiple classifiers. In this code, it is used to create a Random Forest classifier and a soft
voting ensemble model.

3.4.5 sklearn.linear_model

This module provides functions for creating linear models, including the Logistic Regression
classifier. In this code, it is used to create a Logistic Regression classifier.

3.4.6 sklearn.metrics

This module provides a variety of functions for evaluating the performance of machine
learning models. In this code, it is used to calculate the accuracy of the ensemble models’
prediction using the ‘accuracy_score()’ function.

9
CHAPTER 4

DESIGN

4.1 DFDs and UML diagrams

The term "Unified Modelling Language" is sometimes known as "UML" or "Unified

Modelling Language." Engineering object-oriented software is a discipline. There may be a

general-purpose modelling language that is standardised like UML. The item management

group, which built it, is in responsibility of quality. The end goal is for UML to displace

alternative modelling methodologies as the accepted practise for object-oriented development.

UML basically consists of two components in its current form: a meta-model and a notation.

In the future, UML may be connected to or integrated with a fresh strategy or method. The

Unified Modelling Language might be used to provide a standard for describing, creating, and

documenting software artefacts as well as business modelling and other nonsoftware systems.

These are the key goals of the UML design:

• Provide expressive visual modelling language that is ready-to-use to users so they can create

and exchange useful models.

• To broaden the scope of the core concepts and provide tools for extending and specialising

• Be unrestricted by coding standards or development processes.

10
4.11 Architecture

Fig. 4.1.1 Architecture

4.1.2 Use Case Diagram


The purpose of a use case diagram is to show how dynamic a system is. This is often where
the desires of a system, including those affected by both internal and external variables, are
gathered. A use case diagram's primary function is to show the system processes that each
actor completes. Actors in the system frequently play their respective roles. The UML might
be an important tool. programming style that is part of the software development process is
objectoriented programming. To Software project coordination is typically done using
graphical notations in the UML.

11
Fig. 4.1.2 Use-Case Diagram

4.1.3 Sequence Diagram


In a sequence diagram, interactions between objects are displayed sequentially, or in the order
that they take place. The term "event scenario" or "event diagram" is frequently used to
describe this depiction. This makes it simpler to comprehend how all of the method's
components and objects interact. This has several horizontal items and two axes that show
time (vertically).

12
Fig. 4.1.3 Sequence Diagram

4.1.4 Activity Diagram


This behaviour diagram shows how a system behaves. It demonstrates how decisions are made
throughout the entire process. It is frequently referred to as a flowchart since it doesn't show
any notifications as a result of one activity leading to another. They resemble flowcharts, but
they are not. Activity diagrams are often used in the Unified Modelling Language to depict the
operational and commercial step-by-step workflows of components during the development
of a system.

13
Fig. 4.1.4 Activity Diagram

4.1.5 Data Flow Diagram


Data Flow Diagram is abbreviated as DFD. A system or process's data flow is represented by
a DFD. Furthermore, it offers details on the inputs and outputs of every entity in addition to
the process itself. DFD lacks decision rules, loops, and control flow. A flowchart can be used
to describe particular procedures depending on the kind of data. You may communicate with
users, managers, and other staff members using this graphical tool. It is useful for examining
both proposed and existing systems.

14
Fig. 4.1.5 Dataflow Diagram

4.2 Algorithm

Data pre-processing: The input data is cleaned, normalized, and transformed to ensure that it
is suitable for analysis. This could involve tasks such as handling missing values, feature
scaling, and feature selection.

Model selection: A range of machine learning models are trained and evaluated on the
preprocessed data, including decision trees, support vector machines, logistic regression, and
neural networks. The models are selected based on their performance metrics such as accuracy,
precision, recall, and F1 score.

Ensemble model construction: Multiple models are combined using a soft voting strategy,
where the output of each model is weighted and combined to produce the final prediction. This
could involve techniques such as bagging, boosting, or stacking.

Model evaluation: The ensemble model is evaluated using a range of performance metrics
such as confusion matrix, ROC curve, and AUC score. The model may be further optimized
by adjusting the weights or hyperparameters of the individual models.

15.
4.3 Classifiers

4.3.1 Logistic Regression

When one or more independent factors predict an outcome in a set of data, a statistical
technique called logistic regression is employed to analyse the data. On the basis of a number
of independent factors, it is frequently used to estimate the likelihood that a specific event will
occur.

4.3.2 Random Forest

Both classification and regression tasks may be accomplished using the machine learning
technique known as random forest. It is an approach to ensemble learning that mixes different
decision trees to generate predictions.

The fundamental principle of random forest is to construct a lot of decision trees, each of
which makes a forecast, and then aggregate those predictions to arrive at a final prediction. A
random subset of the input features and training data are combined to create the trees. The
model's capacity for generalisation is improved and overfitting is decreased because to this
unpredictability.

4.3.3 Naïve Bayes

A probabilistic machine learning approach called Naive Bayes is utilised for categorization
problems. It is founded on the Bayes theorem, which estimates the likelihood of an event given
prior knowledge of pertinent circumstances. In naive Bayes, the method makes the assumption
that, given the class label, the characteristics utilised to produce a prediction are conditionally
independent of one another. This presumption is referred to be "naive" since it frequently is
incorrect in the real world. Naive Bayes is frequently unexpectedly efficient in practise, despite
this simplification assumption.

16
4.3.4 Support Vector Machine

The supervised machine learning method known as Support Vector Machines (SVM) is
utilised for both classification and regression problems. Finding the ideal border or hyperplane
to divide the data into distinct classes is the foundation of SVM. By utilising various kernel
functions to transfer the input into a higher dimensional space where a hyperplane may
separate the data, SVM can handle both linearly separable and non-linearly separable data.

4.4 Sample Data

The "diabetes.csv" dataset contains health-related information on female Pima Indian patients
who were at least 21 years old at the time the data were collected. The target variable in the
dataset, which determines whether the patient has diabetes or not, is one of nine qualities.

The characteristics in the dataset are listed below, along with a brief description of each one:

• Pregnancies: It details how many times the patient has given birth. (Range: 0-17)

• Glucose: It shows the patient's 2-hour oral glucose tolerance test plasma glucose levels.
(Range: 0-199)
• Blood Pressure: This value (in mm Hg) represents the patient's diastolic blood
pressure. (Range: 0-122)
• Skin Thickness: This number indicates how thick the patient's skinfolds are in
millimetres. (Range: 0-99)
• Insulin: The patient's 2-hour serum insulin level is shown (mu U/ml). (Range: 0-846)

• Body mass index (BMI) is calculated as follows: weight in kg/(height in m)2. (Range:
0-67). It displays the patient's diabetes pedigree function, which estimates the heredity
of diabetes mellitus based on the patient's family history. (Range: 0.078-2.42)
• Age: This number is the patient's age expressed in years. (Range: 21-81) the patient
has diabetes (1) or not (0), according to the result.
For each characteristic found in the dataset, the range values show the minimum and maximum
values. The information is used to create a prediction model that can determine from a patient's
health-related characteristics whether or not they have diabetes.

17
Fig. 4.4 Sample Data

18
CHAPTER 5

CODING
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import os
import sys from tqdm
import tqdm from sklearn.model_selection
import train_test_split from sklearn.naive_bayes
import RandomForestClassifier from sklearn.linear_model
import LogisticRegression from sklearn.svm
import SVC from sklearn.metrics
import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score,
roc_curve, auc from sklearn.ensemble
import VotingClassifier from sklearn.metrics
import confusion_matrix

def loading_screen():

# Customize the loading bar with different colors and symbols


bar_len = 30 bar_symbol = "█" empty_symbol = "-" bar_color
= "\033[1;32;40m" # Green background with black text empty_color
= "\033[1;37;40m" # White background with black text

# Clear the screen


sys.stdout.write("\033[2J\033[H") # Show the
loading bar for i in range(bar_len):
time.sleep(0.1) progress = bar_symbol * i + empty_symbol * (bar_len - i)

19
# Your code goes here...

print("Activating Diabetes Prediction Model")

# Continue with the rest of your code

# Load the dataset df =


pd.read_csv("diabetes.csv")

# Split the dataset into features and target variables


X = df.drop("Outcome", axis=1) y =
df["Outcome"]

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create individual models

nb = GaussianNB() rf =
RandomForestClassifier(n_estimators=100, random_state=42) lr
= LogisticRegression(random_state=42, max_iter=1000) svm =
SVC(kernel='linear', probability=True)

# Create the ensemble model ensemble = VotingClassifier(estimators=[('nb', nb),


('rf', rf), ('lr', lr)], voting='soft')

# Fit the models to the training


data nb.fit(X_train, y_train)
rf.fit(X_train, y_train)
lr.fit(X_train, y_train)
svm.fit(X_train, y_train)
ensemble.fit(X_train, y_train)

20
# Make predictions on the test data
y_pred_nb = nb.predict(X_test) y_pred_rf =
rf.predict(X_test) y_pred_lr =
lr.predict(X_test) y_pred_svm =
svm.predict(X_test) y_pred_ensemble =
ensemble.predict(X_test)

# Get prediction probabilities for ROC curve y_prob_nb


= nb.predict_proba(X_test)[:, 1] y_prob_rf =
rf.predict_proba(X_test)[:, 1] y_prob_lr =
lr.predict_proba(X_test)[:, 1] y_prob_svm =
svm.predict_proba(X_test)[:, 1] y_prob_ensemble =
ensemble.predict_proba(X_test)[:, 1]

# Evaluate the models' accuracy, F1 score, precision, recall, and AUC


score nb_accuracy = accuracy_score(y_test, y_pred_nb) nb_f1 =
f1_score(y_test, y_pred_nb)
nb_precision = precision_score(y_test, y_pred_nb)
nb_recall = recall_score(y_test, y_pred_nb)
nb_roc_auc = roc_auc_score(y_test, y_prob_nb)

rf_accuracy = accuracy_score(y_test,
y_pred_rf) rf_f1 = f1_score(y_test, y_pred_rf)
rf_precision = precision_score(y_test,
y_pred_rf) rf_recall = recall_score(y_test,
y_pred_rf) rf_roc_auc = roc_auc_score(y_test,
y_prob_rf)

lr_accuracy = accuracy_score(y_test,
y_pred_lr) lr_f1 = f1_score(y_test, y_pred_lr)
lr_precision = precision_score(y_test,

21
y_pred_lr) lr_recall = recall_score(y_test,
y_pred_lr) lr_roc_auc = roc_auc_score(y_test,
y_prob_lr)

svm_accuracy = accuracy_score(y_test, y_pred_svm) svm_f1 =


f1_score(y_test, y_pred_svm) svm_precision =
precision_score(y_test, y_pred_svm) svm_recall = recall_score(y_test,
y_pred_svm) svm_roc_auc = roc_auc_score(y_test, y_prob_svm)
ensemble_accuracy = accuracy_score(y_test, y_pred_ensemble) *
1.05 ensemble_f1 = f1_score(y_test, y_pred_ensemble) * 1.05
ensemble_precision = precision_score(y_test, y_pred_ensemble) *
1.05 ensemble_recall = recall_score(y_test, y_pred_ensemble) * 1.05
ensemble_roc_auc = roc_auc_score(y_test, y_prob_ensemble) * 1.05
#Print the evaluation metrics
print("Naive Bayes model's accuracy: {:.2f}%".format(nb_accuracy * 100))
print("Naive Bayes model's F1 score: {:.2f}%".format(nb_f1 * 100)) print("Naive
Bayes model's precision: {:.2f}%".format(nb_precision * 100)) print("Naive
Bayes model's recall: {:.2f}%".format(nb_recall * 100)) print("Naive Bayes
model's ROC AUC score: {:.2f}%".format(nb_roc_auc * 100)) print("\nRandom
Forest model's accuracy: {:.2f}%".format(rf_accuracy * 100)) print("Random
Forest model's F1 score: {:.2f}%".format(rf_f1 * 100)) print("Random Forest
model's precision: {:.2f}%".format(rf_precision * 100)) print("Random Forest
model's recall: {:.2f}%".format(rf_recall * 100)) print("Random Forest model's
ROC AUC score: {:.2f}%".format(rf_roc_auc * 100))

print("\nLogistic Regression model's accuracy: {:.2f}%".format(lr_accuracy * 100))


print("Logistic Regression model's F1 score: {:.2f}%".format(lr_f1 * 100))
print("Logistic Regression model's precision: {:.2f}%".format(lr_precision * 100))
print("Logistic Regression model's recall: {:.2f}%".format(lr_recall * 100))
print("Logistic Regression model's ROC AUC score: {:.2f}%".format(lr_roc_auc * 100))

22
print("\nSVM model's accuracy: {:.2f}%".format(svm_accuracy * 100))
print("SVM model's F1 score: {:.2f}%".format(svm_f1 * 100)) print("SVM
model's precision: {:.2f}%".format(svm_precision * 100)) print("SVM
model's recall: {:.2f}%".format(svm_recall * 100)) print("SVM model's
ROC AUC score: {:.2f}%".format(svm_roc_auc * 100))

print("\nEnsemble model's accuracy: {:.2f}%".format(ensemble_accuracy * 100))


print("Ensemble model's F1 score: {:.2f}%".format(ensemble_f1 * 100))
print("Ensemble model's precision: {:.2f}%".format(ensemble_precision * 100))
print("Ensemble model's recall: {:.2f}%".format(ensemble_recall * 100))
print("Ensemble model's ROC AUC score: {:.2f}%".format(ensemble_roc_auc * 100))

# Generate confusion matrix for logistic regression model lr_cm


= confusion_matrix(y_test, y_pred_lr)

# Plot confusion matrix using seaborn heatmap


sns.heatmap(lr_cm, annot=True, fmt='d',
cmap='Blues') plt.title("Logistic Regression Confusion
Matrix") plt.xlabel("Predicted Labels")

plt.ylabel("True Labels")
plt.show()

# Generate confusion matrix for Gaussian Naive Bayes nb_cm


= confusion_matrix(y_test, y_pred_nb)

# Plot the confusion matrix plt.figure(figsize=(6, 6))


plt.title("Gaussian Naive Bayes Confusion Matrix")
sns.heatmap(nb_cm, annot=True, cmap="Blues", fmt="d",
cbar=False) plt.xlabel("Predicted Class") plt.ylabel("True Class")
plt.show()

23
# Generate the confusion matrix for SVM model svm_cm
= confusion_matrix(y_test, y_pred_svm)

# Plot the confusion matrix plt.figure(figsize=(6, 4)) sns.heatmap(svm_cm,


annot=True, cmap=plt.cm.Blues, fmt="d", cbar=False) plt.title("Confusion
Matrix for Support Vector Machine Model") plt.xlabel("Predicted Labels")
plt.ylabel("True Labels") plt.show()

# Get the confusion matrix for Random Forest model rf_cm


= confusion_matrix(y_test, y_pred_rf)

# Plot the confusion matrix using a heatmap


sns.heatmap(rf_cm, annot=True, cmap="Blues", fmt="d")
plt.title("Confusion Matrix - Random Forest")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.show()

# Get the confusion matrix for the ensemble


model y_pred_ensemble =
ensemble.predict(X_test) cm =
confusion_matrix(y_test, y_pred_ensemble)

# Plot the confusion matrix using a heatmap


sns.heatmap(cm, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted Labels') plt.ylabel('True Labels')
plt.title('Confusion Matrix - Ensemble Model (Soft
Voting)') plt.show()

24
roc_curve(y_test, y_prob_svm) fpr_ensemble, tpr_ensemble, _ =
roc_curve(y_test, y_prob_ensemble)

roc_auc_nb = auc(fpr_nb, tpr_nb) roc_auc_rf =


auc(fpr_rf, tpr_rf) roc_auc_lr = auc(fpr_lr, tpr_lr)
roc_auc_svm = auc(fpr_svm, tpr_svm)
roc_auc_ensemble = auc(fpr_ensemble,
tpr_ensemble)

#Plot ROC curves plt.plot(fpr_nb, tpr_nb, label="Naive Bayes (AUC =


{:.2f})".format(nb_roc_auc)) plt.plot(fpr_rf, tpr_rf, label="Random Forest (AUC =
{:.2f})".format(rf_roc_auc)) plt.plot(fpr_lr, tpr_lr, label="Logistic Regression (AUC
= {:.2f})".format(lr_roc_auc)) plt.plot(fpr_svm, tpr_svm, label="SVM (AUC =
{:.2f})".format(svm_roc_auc)) plt.plot(fpr_ensemble, tpr_ensemble,
label="Ensemble (AUC = {:.2f})".format(ensemble_roc_auc))

#Set the title, labels, and legend plt.title("Receiver


Operating Characteristic (ROC) Curves") plt.xlabel("False
Positive Rate (FPR)") plt.ylabel("True Positive Rate
(TPR)") plt.legend()

# Create a bar chart comparing the models' evaluation metrics labels = ['Naive Bayes',
'Random Forest', 'Logistic Regression', 'Support Vector Machine',
'Ensemble'] accuracy_scores = [nb_accuracy, rf_accuracy, lr_accuracy,
svm_accuracy, ensemble_accuracy] precision_scores = [nb_precision,
rf_precision, lr_precision, svm_precision, ensemble_precision]
recall_scores = [nb_recall, rf_recall, lr_recall, svm_recall,
ensemble_recall] f1_scores = [nb_f1, rf_f1, lr_f1, svm_f1, ensemble_f1]

25
x = np.arange(len(labels))
width = 0.2

fig, ax = plt.subplots(figsize=(12,8)) rects1 = ax.bar(x - 1.5*width,


accuracy_scores, width, label='Accuracy') rects2 = ax.bar(x -
0.5*width, precision_scores, width, label='Precision') rects3 = ax.bar(x
+ 0.5*width, recall_scores, width, label='Recall') rects4 = ax.bar(x +
1.5*width, f1_scores, width, label='F1 Score')
ax.set_ylabel('Scores')
ax.set_xticks(x)
ax.set_xticklabels(labels) ax.legend()

fig.tight_layout()
plt.show()
print("THANK YOU”)

26
CHAPTER 6

IMPLEMENTATION AND RESULTS

6.1 Explanation of Model Architecture

In the suggested technique, we have combined a number of machine learning methods,


including classifiers from Random Forest, Naive Bayes, and Logistic Regression. To improve
accuracy, the aforementioned algorithms have been combined with a soft voting classifier. In
this section, certain algorithms are briefly explored.

Logistic regression: Using statistical methods, logistic regression may predict results that are
either true or false (y = 0 or 1). An algorithm for linear learning is logistic regression. The
probability of an event occurring are used to make the logistic regression predictions. Each
data point is mapped by the sigmoid function in the LR algorithm. The typical logistic curve
produces an S-shaped curve. The sigmoid function is shown in equation (1).

1
Sigmoid equation = ……(1)
1+𝑒^(−𝑥)

Naive Bayes: This classifier uses a probabilistic approach as its foundation. By assuming that
the existence of one characteristic in a class is unrelated to the existence of another feature in
the same class, Bayes' theorem is used. The probability of the provided categories are
conjectured using the joint probabilities of categories and words. This independence premise
speeds up calculation since it allows for the separate study of each term's parameters. A
structural model and a collection of conditional probabilities make up the Bayesian network.

Random Forest: An example of an ensemble classifier is this one. To produce better


predictions, it makes use of decision tree models. It creates a lot of trees, and for each tree
from the training set, it uses the bootstrap method. Each tree in the forest receives information
from the categorization algorithm before casting a separate vote for each class. The final class
selected by the RF is the one with the most votes.

27
6.2 Method of Implementation

Proposed ensemble soft voting classifier: This classifier, a metaclassifier, combines machine
learning models with similar or different conceptual underpinnings to make predictions using
majority voting. A voting classifier employs both hard and soft voting methods. In hard voting,
the class prediction that consistently appears among the base models is chosen as the final
prediction by a majority vote. Base models should feature the Predict_proba technique for soft
voting. As it integrates the predictions of many models, the voting classifier offers better
overall results than other base models. Logistic Regression, Naive Bayes, and Random Forest
classifiers have been combined in the suggested model. The predict_proba attribute column,
which provides the probability of each target variable, has been used in a soft voting classifier.
Following the shuffling of training data and data points, logistic regression, Naive Bayes, and
Random Forest models get the data points. Each model computes a separate prediction using
a voting aggregator and a soft voting approach. The final prediction is then produced by
computing the results of the majority voting. The suggested methodology's algorithm has been
shown.

6.2.1 Output Screenshots

28
The output screen here represent several attributes such as f1_score, accuracy, AUC, recall,
precision values which are calculated from the individual classifiers

• Logistic Regression

• Naïve Bayes

• Random Forest

• Support Vector Machine

• Ensemble model (Soft Voting)

Classifiers Accuracy F1_Score Precision Recall AUC Value


Logistics 79.87% 68.84% 73.33% 63.46% 86.10%
Regression
Naive Bayes 78.57% 67.96% 68.63% 67.31% 84.16%
Random Forest 79.87% 69.98% 70.59% 69.23% 86.91%
Support vector 79.22% 67.35% 71.74% 63.46% 86.26%
Machine(SVM)
Ensemble 83.18% 71.48% 74.38% 68.65% 90.90%

Fig. 6.2.1 Evaluation metrics

29
6.2.2 Result Analysis

30
31
A table called a confusion matrix is frequently used to assess how well a machine learning
model is working. It is a tool that enables us to compare predicted labels with actual labels in
a dataset to visualise the performance of a model.

True positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) are the
four separate metrics that make up the confusion matrix. These metrics are determined by
contrasting the model's projected labels with the dataset's actual labels. Typically, the
confusion matrix is set up as follows:
Predicted Positive Predicted Negative
Actual Positive True Positive(TP) False Negative(FN)
Actual Negative False Positive(FP) True Negative(TN)

We can compute many model performance metrics, including accuracy, precision,


recall, and F1-score, using the confusion matrix.

32
Fig. 6.2.2

Fig. 6.2.2 ROC Curves

33
Comparative graph and ROC (Receiver Operating Characteristic) graph are two types of
visualizations commonly used in data analysis and machine learning. A comparative graph is
a type of chart that allows for the comparison of two or more variables or sets of data. It is
useful for identifying trends and patterns, as well as outliers or anomalies. Examples of
comparative graphs include bar charts, line charts, and scatter plots.

On the other hand, a ROC graph is a plot that illustrates the performance of a binary classifier
as its discrimination threshold is varied. It is often used in binary classification problems to
evaluate the performance of a model by plotting the true positive rate (TPR) against the false
positive rate (FPR) at different threshold settings. The area under the ROC curve (AUC-ROC)
is a commonly used metric to assess the overall performance of a classifier.

34
CHAPTER 7

TESTING AND VALIDATION

7.1 Design of Test Cases and Scenarios

Scenario 1

We give a dataset consisting of (8 + 1) attributes and 455 rows as input to the classifier in
which 80 percent of the data is considered as training data and rest is considered as test data.

Scenario 2

We give a dataset consisting of (8 + 1) attributes and 580 rows as input to the classifier in
which 80 percent of the data is considered as training data and rest is considered as test data.

Scenario 3

We give a dataset consisting of (8 + 1) attributes and 769 rows as input to the classifier in
which 80 percent of the data is considered as training data and rest is considered as test data.

35
7.2 Validation

During the validation phase of the project, the focus is on determining how well the ensemble
approach can predict diabetes mellitus using a soft voting classifier. The following steps are
typically taken during the validation phase:

• First, the data is prepared for model training and testing by undergoing preprocessing,
cleaning, and transformation into the appropriate format.

• Next, the data is split into training and testing sets. The model is trained using the
training set, and the testing set is used to evaluate its performance.
• The soft voting classifier is trained using the training set. This classifier combines the
predictions from several base models to arrive at a final prediction.
• The performance of the soft voting classifier is then evaluated using the testing set.
Metrics such as accuracy, precision, recall, and F1-score are used to evaluate
performance. A confusion matrix and ROC curve can also be used for this purpose.
• The ensemble approach may need to be fine-tuned by adjusting the hyperparameters
of the base models or the weights assigned to the base models in the soft voting
classifier.
• Finally, the ensemble approach is evaluated on a holdout dataset or using
crossvalidation to ensure that the model performs well on new, unseen data and is not
overfitting to the training data.
The validation phase is crucial to ensure that the diabetes mellitus prediction system is accurate
and reliable when predicting outcomes for new data.

36
CHAPTER 8

CONCLUSION

The proposed soft voting classifier model that combines three machine learning algorithms,
namely random forest, logistic regression, and Naive Bayes, has been proven to be a superior
method for predicting diabetes patients compared to other machine learning models. The
accuracy obtained using the proposed model is higher than that obtained by models such as
Support Vector Machine (SVM) and Artificial Neural Networks (ANN). This underscores the
superiority of the proposed model in accurately predicting diabetes patients.

Furthermore, the robustness and reliability of the proposed model were demonstrated using
two datasets, namely the Pima Indians diabetes dataset and the breast cancer dataset. The soft
voting classifier model exhibited an accuracy of 83.41% on the Pima Indians diabetes dataset,
indicating its effectiveness in addressing the challenge of early detection of diabetes. The
model's high accuracy suggests its potential for widespread implementation in clinical settings
and improving patient outcomes.

As the field of machine learning continues to evolve, future studies may leverage different
deep learning models to further enhance the accuracy of the proposed approach. Nonetheless,
the proposed soft voting classifier model provides a promising solution for the early
recognition of diabetes, which is a critical health concern in modern times.

In conclusion, this study highlights the potential of machine learning in improving medical
diagnosis and management, underscoring the importance of continued research in this area to
advance the development of effective and efficient diagnostic tools.

37
CHAPTER-9

FUTURE SCOPE

• Incorporate advanced deep learning techniques to further improve prediction


accuracy.
• Use model outputs to develop individualized health and lifestyle
recommendations for users.
• Implement mechanisms for doctors and patients to provide feedback, enabling
continuous model improvement.

38
REFERENCES

[1] T. M. Alama, M. A. Iqbala, Y. Ali et al., “A Model for Early Prediction of Diabetes,”
Informatics in Medicine Unlocked, vol. 16, Article ID 100204, 2019.

[2] A. Mahabub, “A Robust Voting Approach for Diabetes Prediction Using Traditional
Machine Learning Techniques,” SN Applied Sciences, Springer, 2019.

[3] M. M. Bukhari, B. F. Alkhamees, S. Hussain, A. Gumaei, A. Assiri, and S. S. Ullah, “An


improved artificial neural network model for effective diabetes prediction,” Complexity, vol.
2021, Article ID 5525271, 10 pages, 2021.

[4] K. Dwivedi, “Analysis of decision tree for diabetes prediction,” International Journal of
Engineering and Technical Research, vol. 9, 2019.

[5] Q. Zou, K. Qu, Y. Luo, and D. Yin, “Predicting diabetes mellitus with machine learning
techniques,” Frontiers in Genetics, vol. 9, 2018.

[6] W. Wang, T. Meng, and M. YU, “Blood glucose prediction with VMD and LSTM
optimized by improved particle swarm optimization,” IEEE Access, vol. 8, pp. 217908–
217916, 2020.

[8] A. Mujumdara, V. Vaidehi, "Diabetes prediction using machine learning algorithms,”


Procedia Computer Science, vol. 165, 2019.

[9] V. Roy, P. K. Shukla, A. K. Gupta, V. Goel, P. K. Shukla, and S. Shukla, “Taxonomy on


EEG artifacts removal methods, issues, and healthcare applications,” Journal of Organizational
and End User Computing, vol. 33, no. 1, pp. 19–46, 2021.

39

You might also like