Independent Project

This project report focuses on using data mining and machine learning methods to predict diabetes using the Pima Indians Diabetes Database. The report outlines the causes and types of diabetes, the data preprocessing steps, and various machine learning algorithms employed for classification, including Logistic Regression, KNN, SVM, and Random Forest. Additionally, it describes the creation of a user interface for accessibility and presents the organization of the report into chapters covering problem identification, literature review, design flow, result analysis, and future scope.

Uploaded by

Anshul Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views10 pages

Independent Project

Uploaded by

Anshul Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

DATA MINING METHODS FOR DIABETES PREDICTION

A PROJECT REPORT

Submitted by

Anshul (22BCS16477)

Khushi Gupta (22BCS16186)

Era Trivedi (22BCS14924)

Vidushi Gupta (22BCS16291)

Shushant Singh (22BCS16192)

1.1 Identification Of Client :

All around there are numerous ceaseless infections that are boundless in evolved and
developing nations. One of such sickness is diabetes. Diabetes is a metabolic issue that
causes blood sugar by creating a significant measure of insulin in the human body or by
producing a little measure of insulin. Diabetes is perhaps the deadliest sickness on the planet.
It is not just a malady yet, also a maker of different sorts of sicknesses like a coronary
failure, visual deficiency, kidney ailments and nerve harm, and so on.
Subsequently, the identification of such chronic metabolic ailment at a beginning
period could help specialists around the globe in forestalling loss of human life. Presently,
with the ascent of machine learning, AI, and neural systems, and their application in various
domains [1, 2] we may have the option to find an answer for this issue. ML strategies and
neural systems help scientists to find new realities from existing well-being-related
informational indexes, which may help in ailment supervision and detection. The current
work is completed utilizing the Pima Indians Diabetes Database. The point of this framework
is to make an ML model, which can anticipate with precision the likelihood or the odds of a
patient being diabetic. The ordinary distinguishing process for the location of diabetes is that
the patient needs to visit a symptomatic focus. One of the key issues of bio-informatics
examination is to achieve precise outcomes from the information. Human mistakes or
various laboratory tests can entangle the procedure of identification of the disease. This
model can foresee whether the patient has diabetes or not, aiding specialists to ensure that
the patient in need of clinical consideration can get it on schedule and also help anticipate the
loss of human lives.

DNA makes neural networks the apparent choice. Neural networks use neurons to transmit
data across various layers, with each node working on a different weighted parameter to help
predict diabetes.
Presently, with the ascent of machine learning, AI, and neural systems, and their
application in various domains [1, 2] we may have the option to find an answer for this issue.
ML strategies and neural systems help scientists to find new realities from existing well-
being-related informational indexes, which may help in ailment supervision and detection.
The current work is completed utilizing the Pima Indians Diabetes Database

Causes of Diabetes
Genetic factors are the main cause of diabetes. It is caused by at least two mutant
genes in the chromosome 6, the chromosome that affects the response of the body to various
antigens. Viral infection may also influence the occurrence of type 1 and type 2 diabetes.
Studies have shown that infection with viruses such as rubella, Coxsackievirus, mumps,
hepatitis B virus, and cytomegalovirus increase the risk of developing diabetes.
Types of Diabetes
Type 1
Type 1 diabetes means that the immune system is compromised and the cells fail to
produce insulin in sufficient amounts. There are no eloquent studies that prove the
causes of type 1 diabetes and there are currently no known methods of prevention.
Type 2
Type 2 diabetes means that the cells produce a low quantity of insulin or the body can’t use the
insulin correctly. This is the most common type of diabetes, thus affecting 90%of persons
diagnosed with diabetes. It is caused by both genetic factors and the manner of Living.

1.2 IDENTIFICATION OF PROBLEM:-

Data mining and machine learning have been developing, reliable, and supporting tools in the
medical domain in recent years. The data mining method is used to pre-process and select the
relevant features from the healthcare data, and the machine learning method helps automate
diabetes prediction . Data mining and machine learning algorithms can help identify the hidden
pattern of data using the cutting-edge method; hence, a reliable accuracy decision is possible.
Data Mining is a process where several techniques
are involved, including machine learning, statistics, and database system to discover a pattern
from the massive amount of dataset . According to Nvidia: Machine learning uses various
algorithms to learn from the parsed data and make predictions
Diabetes prediction is a classification technique with two mutually exclusive possible outcomes,
either the person is diabetic or not diabetic he point of this framework is to make an ML model,
which can anticipate with precision the likelihood or the odds of a patient being diabetic. The
ordinary distinguishing process for the location of diabetes is that the patient needs to visit
asymptomatic focus. One of the key issues of bio-informatics examination is to achieve precise
outcomes from the information.
Human mistakes or various laboratory test scan entangle the procedure of identification of the
disease. This model can foresee whether the patient has diabetes or not, aiding specialists to
ensure that the patient in need of clinical consideration can get it on schedule and also help
anticipate the loss of human lives
1.3 IDENTIFICATION OF TASKS :

The dataset collected is originally from the Pima Indians Diabetes Database is available on
Kaggle. It consists of several medical analyst variables and one target variable.
The objective of the dataset is to predict whether the patient has diabetes or not. The dataset
consists of several independent variables and one dependent variable, i.e., the outcome.
Independent variables include the number of pregnancies the patient has had their BMI,
insulin level, age, and so on as Shown in Following Table 1:

Serial no Attribute Names Description

1 Pregnancies Number of times pregnant
2 Glucose Plasma glucose concentration
3 Blood Pressure Diastolic blood pressure
4 Skin Thickness Triceps skin fold thickness (mm)
5 Insulin 2-h serum insulin
6 BMI Body mass index
7 Diabetes Pedigree Function Diabetes pedigree function
8 Outcome Class variable (0 or 1)
9 Age Age of patient

➔ The diabetes data set consists of 2000 data points, with 9 features each.
➔ “Outcome” is the feature we are going to predict, 0 means No diabetes, 1 means diabetes
I] Dataset collection – It includes data collection and understanding the data to study the
hidden patterns and trends which helps to predict and evaluating the results. Dataset carries
total number of data and i.e., total number of features. Featuresinclude Pregnancies, Glucose,
Blood Pressure, Skin Thickness, Insulin, BMI, DiabetesPedigreeFunction, Age
II] Data Pre-processing:
This phase of model handles inconsistent data in order to get more accurate and precise results
like in this dataset Id is inconsistent so we dropped the feature.
III]Missing value identification:
Using the Panda library and SK-learn , we will get the missing values in the datasets .We will
compare the missing value with the corresponding mean value
IV] Feature selection:
Pearson’s correlation method is a popular method to find the mostrelevant attributes/features.
The correlation coefficient is calculated in this method, whichcorrelates with the output and
input attributes. The coefficient value remains in the range bybetween −1 and 1. The value
above 0.5 and below −0.5 indicates a notable correlation, andthe zero value means no
correlation
V] Scaling and Normalization:
Scaling means that you're transforming your data so that it fits within a specific scale, like 0-
100 or 0-1. You want to scale data when you're using methods based on measures of how far
apart data points are, like support vector machines (SVM) or k-nearest neighbors (KNN). With
these algorithms, a change of "1" in any numeric feature is given the same importance.
VI] Splitting of data:
After data cleaning and pre-processing, the dataset becomes ready to train
and test. In the train/split method, we split the dataset randomly into the
training and testing set.
VII] Design and implementation of classification model:
In this research work, comprehensive studies are done by applying different
ML classification techniques like DT, KNN, RF, NB, LR, SVM.
VIII] Machine learning classifier:
Machine learning classifier to analyse the performance by finding accuracy of
each classifier
All the classifiers are implemented using scikit learn libraries in python .
MODELING AND ANALYSIS:
A] Logistic Regression:
Logistic regression is a machine learning technique used when dependent
variables are able to categorize. The outputs obtained by using the logistic
regression is based on the available features. Here sigmoidal function is used
to categorize the output.
B] K-Nearest Neighbors:
K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the
values of new
datapoints which further means that the new data point will be assigned a
value based onhow closely it matches the points in the training set.
Predictions are made for a new instance (x) by searching through the entire
training set for the K most similar instances (the neighbors) and summarizing
the output variable for those K instances.
C]SVM:
SVM is supervised learning algorithm used for classification. In SVM we have
to identify the right hyper plane to classify the data correctly. In this we have
to set correct parameters values. To find the right hyper plane we have to
find right margin for this we have choose the gamma value as 0.0001 and rbf
kernel. If we select the hyper plane with low margin leads to miss
classification.
D] Naive Bayes:
Naive Bayes classifiers are a collection of classification algorithms based on
Bayes’
Theorem. It is not a single algorithm but a family of algorithms where all of
them share a common principle, i.e. every pair of features being classified is
independent of each other.
E] Decision Tree:
Decision tree is non parametric classifier in supervised learning. In this
method all the details are represented in the form of tree, where leaves are
corresponds to the class labels and attributes are corresponds to internal
node of the tree.
F] Random Forest:
Random forest is an ensemble learning method for classification. This
algorithm consists of trees and the number of tree structures present in the
data is used to predict the accuracy. Where leaves are corresponds to the
class labels and attributes are corresponds to internal node of the tree.
G] AdaBoost Classifier:
Boosting is an ensemble modeling technique that attempts to build a strong
classifier from the number of weak classifiers. It is done by building a model
by using weak models in series. Firstly, a model is built from the training
data. Then the second model is built which tries to correct the errors present
in the first model. This procedure is continued and models are added until
either the complete training data set is predicted correctly or the maximum
number of models are added.
AdaBoost was the first successful boosting algorithm developed for the
purpose of binary classification. AdaBoost is short for Adaptive Boosting and
is a very popular boosting technique that combines multiple “weak
classifiers” into a single “strong classifier”. It was formulated by Yoav Freund
and Robert Schapire. They also won the 2003 Gödel Prize for their work.
Creating a User Interface for Accessibility:
The last part of the project is the creation of a user interface for the model.
This user
interface is used to enter unseen data for the model to read and then make a
prediction. The user interface is created using “Flask” Web app, Hyper Text
Markup Language, and Cascading .
1.4. Organization of the Report :-
Chapter 1 Problem Identification: This chapter introduces the project and
describes the problem statement discussed earlier in the report.
Chapter 2 Literature Review: This chapter prevents review for various
research papers which help us to understand the problem in a better way. It
also defines what has been done to already solve the problem and what can
be further done.
Chapter 3 Design Flow/ Process: This chapter presents the need and
significance of the proposed work based on literature review. Proposed
objectives and methodology are explained. This presents the relevance of
the problem. It also represents logical and schematic plan to resolve the
research problem.
Chapter 4 Result Analysis and Validation: This chapter explains various
performance parameters used in implementation. Experimental results are
shown in this chapter. It explains the meaning of the results and why they
matter.
Chapter 5 Conclusion and future scope: This chapter concludes the
results and explain the best method to perform this research to get the best
results and define the future scope of study that explains the extent to which
the research area will be explored in the work.
Team Roles
ANSHUL (22BCS16477)
• COLLECTION AND MAKING OF THE DATASET
• CLUSTERING AND DISTRIBUTION OF THE
DATASET.
• VISUALISATION OF THE
DATASET
KHUSHI GUPTA (22BCS16186)
• COLLECTION OF DATASET
• VISUALISATION OF THE DATASET
• TESTING AND TRAINING OF THE DATASET
ERA TRIVEDI (22BCS14924)
• ANALYSING THE DATASET
• APPLYING ALGORITHMS IN
THE DATASET
• SCRAPPING THE DATASET
VIDUSHI GUPTA (22BCS16291)
• COLLECTION AND MAKING
OF THE DATASET
• APPLYING ALGORITHMS IN
THE DATASET
• VISUALISATION OF THE
DATASET
SHUSHANT SINGH (22BCS16192)
• ANALYSING THE DATASET
• CLUSTERING AND DISTRIBUTION OF THE
DATASET.
• COLLECTION OF DATASET
TIMELINE :-

Get General Organic and Biochemistry 4th Edition Katherine Denniston Free All Chapters
100% (7)
Get General Organic and Biochemistry 4th Edition Katherine Denniston Free All Chapters
82 pages
1 s2.0 S2666307421000048 Main
No ratings yet
1 s2.0 S2666307421000048 Main
7 pages
M. Ali Asdar Departement of Pulmonology and Respiratory Medicine Faculty of Medicine University of Indonesia - Persahabatan General Hospital Jakarta
No ratings yet
M. Ali Asdar Departement of Pulmonology and Respiratory Medicine Faculty of Medicine University of Indonesia - Persahabatan General Hospital Jakarta
30 pages
Abrahams & Millar (2008)
No ratings yet
Abrahams & Millar (2008)
27 pages
RevRes PDF
No ratings yet
RevRes PDF
1,134 pages
Slide Presetatio
No ratings yet
Slide Presetatio
30 pages
Minipro 2
No ratings yet
Minipro 2
24 pages
Major Project Report 2023-2024
No ratings yet
Major Project Report 2023-2024
33 pages
Final
No ratings yet
Final
44 pages
IPL Winning Prediction Intern Report
No ratings yet
IPL Winning Prediction Intern Report
52 pages
DDPIS Diabetes Disease Prediction by Improvising
No ratings yet
DDPIS Diabetes Disease Prediction by Improvising
11 pages
Prediction of Diabetes Using R
No ratings yet
Prediction of Diabetes Using R
6 pages
Chapter 7 Software Reuse
No ratings yet
Chapter 7 Software Reuse
30 pages
Diabetes Prediction
No ratings yet
Diabetes Prediction
13 pages
Estimating Diabetic Risk Accurately
No ratings yet
Estimating Diabetic Risk Accurately
26 pages
MLPPT 11 45
No ratings yet
MLPPT 11 45
31 pages
American Ethnologist - February 1987 - BROWN - Religion Class and Context Continuities and Discontinuities in Brazilian
No ratings yet
American Ethnologist - February 1987 - BROWN - Religion Class and Context Continuities and Discontinuities in Brazilian
21 pages
Parkinson Disease & ALS Cheat Sheet
No ratings yet
Parkinson Disease & ALS Cheat Sheet
4 pages
CIEA Term Project
No ratings yet
CIEA Term Project
19 pages
مختار النعيري - The Course Work Submission
No ratings yet
مختار النعيري - The Course Work Submission
31 pages
Dataset
No ratings yet
Dataset
13 pages
Motion 1 QP
No ratings yet
Motion 1 QP
15 pages
Ext 74513
No ratings yet
Ext 74513
10 pages
Traumatic Care DR - GOLDEN
No ratings yet
Traumatic Care DR - GOLDEN
34 pages
Proposal
No ratings yet
Proposal
12 pages
Classifier Model For Diabetes Prediction
No ratings yet
Classifier Model For Diabetes Prediction
30 pages
PIL - 3rd Sem LLB
No ratings yet
PIL - 3rd Sem LLB
68 pages
Task3.Ipynb - Colaboratory Dip
No ratings yet
Task3.Ipynb - Colaboratory Dip
3 pages
Diagnosis of Diabetes Using Machine Learning
No ratings yet
Diagnosis of Diabetes Using Machine Learning
12 pages
Diabetes Prediction Using Machine Learning Techniques
No ratings yet
Diabetes Prediction Using Machine Learning Techniques
18 pages
Ijarcce 2020 9712
No ratings yet
Ijarcce 2020 9712
7 pages
Predictive Model For Diabetes Using Machine Learning
No ratings yet
Predictive Model For Diabetes Using Machine Learning
38 pages
Regent College London New
No ratings yet
Regent College London New
2 pages
Diabetes Prediction - ML
No ratings yet
Diabetes Prediction - ML
29 pages
Article 6
No ratings yet
Article 6
11 pages
Diabetes Prediction Using Machine Learning R3
No ratings yet
Diabetes Prediction Using Machine Learning R3
6 pages
Day 4 Plastic Pollution Ielts Nguyenhuyen
No ratings yet
Day 4 Plastic Pollution Ielts Nguyenhuyen
1 page
Hyper-Personalized Healthcare: The Future of Medicine
From Everand
Hyper-Personalized Healthcare: The Future of Medicine
Carlos Alves
No ratings yet
TechnologyName Phase1
No ratings yet
TechnologyName Phase1
9 pages
Predicting Diabetes Mellitus in Healthcare: A Comparative Analysis of Machine Learning Algorithms On Big Dataset
No ratings yet
Predicting Diabetes Mellitus in Healthcare: A Comparative Analysis of Machine Learning Algorithms On Big Dataset
12 pages
Bio Paper 5 PDF
No ratings yet
Bio Paper 5 PDF
8 pages
20BCE7620 AP2021228000397 Experiment-6 Removed
No ratings yet
20BCE7620 AP2021228000397 Experiment-6 Removed
19 pages
Projectreport Diabetes Prediction
No ratings yet
Projectreport Diabetes Prediction
22 pages
1 s2.0 S2665917422002392 Main
No ratings yet
1 s2.0 S2665917422002392 Main
9 pages
PM For Diabetes
No ratings yet
PM For Diabetes
11 pages
Analyze The Use of Machine Learning Models in The Pima Diabetes Data Set For Early Stage Detection
No ratings yet
Analyze The Use of Machine Learning Models in The Pima Diabetes Data Set For Early Stage Detection
5 pages
Comparison of ML Techniques
No ratings yet
Comparison of ML Techniques
16 pages
A Mini Skill Based Project Report On: Machine Learning & Optimization (270404)
No ratings yet
A Mini Skill Based Project Report On: Machine Learning & Optimization (270404)
20 pages
Prediction of Diabetes
No ratings yet
Prediction of Diabetes
12 pages
Sat - 17.Pdf - Machine Learning Models For Diagnosis of The Diabetic Patient and Predicting Insulin Dosage
No ratings yet
Sat - 17.Pdf - Machine Learning Models For Diagnosis of The Diabetic Patient and Predicting Insulin Dosage
11 pages
Synopsis - Diabetes Prediction
No ratings yet
Synopsis - Diabetes Prediction
28 pages
Poster Template
No ratings yet
Poster Template
1 page
Thesis Port Service
100% (3)
Thesis Port Service
7 pages
Upgrading Cimplicity 6.1 To 8.1 License Issue
No ratings yet
Upgrading Cimplicity 6.1 To 8.1 License Issue
2 pages
BI Miniproject Report (Diabetes)
No ratings yet
BI Miniproject Report (Diabetes)
18 pages
Hybrid Deep Learning CNN-LSTM Model For Diabetes Prediction
No ratings yet
Hybrid Deep Learning CNN-LSTM Model For Diabetes Prediction
4 pages
Project Report
No ratings yet
Project Report
10 pages
Diabetes Prediction Using Machine Learning KNN - Algorithm Technique
No ratings yet
Diabetes Prediction Using Machine Learning KNN - Algorithm Technique
4 pages
Supervised Learning Method of Diabetes Prediction
No ratings yet
Supervised Learning Method of Diabetes Prediction
10 pages
Prediction of Diabetes Using Machine Learning: A Modern User-Friendly Model
No ratings yet
Prediction of Diabetes Using Machine Learning: A Modern User-Friendly Model
7 pages
Mini Project
No ratings yet
Mini Project
15 pages
QIG Quick Installation Guide DCU 305 R3
No ratings yet
QIG Quick Installation Guide DCU 305 R3
2 pages
Predicting Diabetes in Medical Datasets Using Machine Learning Techniques
No ratings yet
Predicting Diabetes in Medical Datasets Using Machine Learning Techniques
14 pages
ML Diabetes Ieee
100% (1)
ML Diabetes Ieee
12 pages
Diabetes Pridiction Using Machine Learning
No ratings yet
Diabetes Pridiction Using Machine Learning
31 pages
Report Diabetics
No ratings yet
Report Diabetics
8 pages
Analysis and Prediction of Diabetes Mell PDF
No ratings yet
Analysis and Prediction of Diabetes Mell PDF
10 pages
Diabetes Prediction Report
No ratings yet
Diabetes Prediction Report
16 pages
Acebrofilina+budesonida
No ratings yet
Acebrofilina+budesonida
3 pages
A Survey On Medical Diagnosis of Diabetes Using Machine Learning Techniques
No ratings yet
A Survey On Medical Diagnosis of Diabetes Using Machine Learning Techniques
12 pages
Ijs DR 2205103
No ratings yet
Ijs DR 2205103
4 pages
Type A Type B 72 78 78 76 73 81 69 74 75 82 74 75 69 75 Heaters? Find The Approximate P-Value For The Test and Interpret Its Value
No ratings yet
Type A Type B 72 78 78 76 73 81 69 74 75 82 74 75 69 75 Heaters? Find The Approximate P-Value For The Test and Interpret Its Value
9 pages
Sachin Pawar Resume
No ratings yet
Sachin Pawar Resume
6 pages
IEEE Paper 1
No ratings yet
IEEE Paper 1
5 pages
Listening Reading Task 1 Task 4 Use of English Task 8: Answers Test 1
No ratings yet
Listening Reading Task 1 Task 4 Use of English Task 8: Answers Test 1
7 pages
5 Muscle
No ratings yet
5 Muscle
3 pages
Diabetes Prediction: Using Data Mining
No ratings yet
Diabetes Prediction: Using Data Mining
11 pages
Specification Sheet For DG Set Alternator: 1 Identity
No ratings yet
Specification Sheet For DG Set Alternator: 1 Identity
6 pages
Anfis Based Kinematic Analysis of A 4-Dofs Scara Robot: Jyotindra Narayan Ashish Singla
No ratings yet
Anfis Based Kinematic Analysis of A 4-Dofs Scara Robot: Jyotindra Narayan Ashish Singla
7 pages
54 Batch Project Documentation-1
No ratings yet
54 Batch Project Documentation-1
82 pages
Using Bayes Network For Prediction of Type-2 Diabetes: Yan Hu
No ratings yet
Using Bayes Network For Prediction of Type-2 Diabetes: Yan Hu
5 pages
Amine Unit
100% (1)
Amine Unit
69 pages
Human Resources, Job Design, and Work Measurement: Human Resource Strategy For Competitive Advantage
No ratings yet
Human Resources, Job Design, and Work Measurement: Human Resource Strategy For Competitive Advantage
3 pages
Diabetes PPT
100% (1)
Diabetes PPT
9 pages
V5i9 0240
No ratings yet
V5i9 0240
4 pages
Classification of Diabetes Mellitus Using Machine Learning Techniques
No ratings yet
Classification of Diabetes Mellitus Using Machine Learning Techniques
4 pages
Fischer FBN Anchors
No ratings yet
Fischer FBN Anchors
23 pages
World Religions Week 3
100% (1)
World Religions Week 3
24 pages
ECE CAD Introduction To AutoCAD
No ratings yet
ECE CAD Introduction To AutoCAD
5 pages
Biography of Adolf Hitler
No ratings yet
Biography of Adolf Hitler
1 page

Independent Project

Uploaded by

Independent Project

Uploaded by

DATA MINING METHODS FOR DIABETES PREDICTION

Khushi Gupta (22BCS16186)

Era Trivedi (22BCS14924)

Vidushi Gupta (22BCS16291)

Shushant Singh (22BCS16192)

1.2 IDENTIFICATION OF PROBLEM:-

Serial no Attribute Names Description

You might also like