0% found this document useful (0 votes)
18 views10 pages

Independent Project

This project report focuses on using data mining and machine learning methods to predict diabetes using the Pima Indians Diabetes Database. The report outlines the causes and types of diabetes, the data preprocessing steps, and various machine learning algorithms employed for classification, including Logistic Regression, KNN, SVM, and Random Forest. Additionally, it describes the creation of a user interface for accessibility and presents the organization of the report into chapters covering problem identification, literature review, design flow, result analysis, and future scope.

Uploaded by

Anshul Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views10 pages

Independent Project

This project report focuses on using data mining and machine learning methods to predict diabetes using the Pima Indians Diabetes Database. The report outlines the causes and types of diabetes, the data preprocessing steps, and various machine learning algorithms employed for classification, including Logistic Regression, KNN, SVM, and Random Forest. Additionally, it describes the creation of a user interface for accessibility and presents the organization of the report into chapters covering problem identification, literature review, design flow, result analysis, and future scope.

Uploaded by

Anshul Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

DATA MINING METHODS FOR DIABETES PREDICTION

A PROJECT REPORT

Submitted by

Anshul (22BCS16477)

Khushi Gupta (22BCS16186)

Era Trivedi (22BCS14924)

Vidushi Gupta (22BCS16291)

Shushant Singh (22BCS16192)


1.1 Identification Of Client :

All around there are numerous ceaseless infections that are boundless in evolved and
developing nations. One of such sickness is diabetes. Diabetes is a metabolic issue that
causes blood sugar by creating a significant measure of insulin in the human body or by
producing a little measure of insulin. Diabetes is perhaps the deadliest sickness on the planet.
It is not just a malady yet, also a maker of different sorts of sicknesses like a coronary
failure, visual deficiency, kidney ailments and nerve harm, and so on.
Subsequently, the identification of such chronic metabolic ailment at a beginning
period could help specialists around the globe in forestalling loss of human life. Presently,
with the ascent of machine learning, AI, and neural systems, and their application in various
domains [1, 2] we may have the option to find an answer for this issue. ML strategies and
neural systems help scientists to find new realities from existing well-being-related
informational indexes, which may help in ailment supervision and detection. The current
work is completed utilizing the Pima Indians Diabetes Database. The point of this framework
is to make an ML model, which can anticipate with precision the likelihood or the odds of a
patient being diabetic. The ordinary distinguishing process for the location of diabetes is that
the patient needs to visit a symptomatic focus. One of the key issues of bio-informatics
examination is to achieve precise outcomes from the information. Human mistakes or
various laboratory tests can entangle the procedure of identification of the disease. This
model can foresee whether the patient has diabetes or not, aiding specialists to ensure that
the patient in need of clinical consideration can get it on schedule and also help anticipate the
loss of human lives.

DNA makes neural networks the apparent choice. Neural networks use neurons to transmit
data across various layers, with each node working on a different weighted parameter to help
predict diabetes.
Presently, with the ascent of machine learning, AI, and neural systems, and their
application in various domains [1, 2] we may have the option to find an answer for this issue.
ML strategies and neural systems help scientists to find new realities from existing well-
being-related informational indexes, which may help in ailment supervision and detection.
The current work is completed utilizing the Pima Indians Diabetes Database

Causes of Diabetes
Genetic factors are the main cause of diabetes. It is caused by at least two mutant
genes in the chromosome 6, the chromosome that affects the response of the body to various
antigens. Viral infection may also influence the occurrence of type 1 and type 2 diabetes.
Studies have shown that infection with viruses such as rubella, Coxsackievirus, mumps,
hepatitis B virus, and cytomegalovirus increase the risk of developing diabetes.
Types of Diabetes
Type 1
Type 1 diabetes means that the immune system is compromised and the cells fail to
produce insulin in sufficient amounts. There are no eloquent studies that prove the
causes of type 1 diabetes and there are currently no known methods of prevention.
Type 2
Type 2 diabetes means that the cells produce a low quantity of insulin or the body can’t use the
insulin correctly. This is the most common type of diabetes, thus affecting 90%of persons
diagnosed with diabetes. It is caused by both genetic factors and the manner of Living.

1.2 IDENTIFICATION OF PROBLEM:-


Data mining and machine learning have been developing, reliable, and supporting tools in the
medical domain in recent years. The data mining method is used to pre-process and select the
relevant features from the healthcare data, and the machine learning method helps automate
diabetes prediction . Data mining and machine learning algorithms can help identify the hidden
pattern of data using the cutting-edge method; hence, a reliable accuracy decision is possible.
Data Mining is a process where several techniques
are involved, including machine learning, statistics, and database system to discover a pattern
from the massive amount of dataset . According to Nvidia: Machine learning uses various
algorithms to learn from the parsed data and make predictions
Diabetes prediction is a classification technique with two mutually exclusive possible outcomes,
either the person is diabetic or not diabetic he point of this framework is to make an ML model,
which can anticipate with precision the likelihood or the odds of a patient being diabetic. The
ordinary distinguishing process for the location of diabetes is that the patient needs to visit
asymptomatic focus. One of the key issues of bio-informatics examination is to achieve precise
outcomes from the information.
Human mistakes or various laboratory test scan entangle the procedure of identification of the
disease. This model can foresee whether the patient has diabetes or not, aiding specialists to
ensure that the patient in need of clinical consideration can get it on schedule and also help
anticipate the loss of human lives
1.3 IDENTIFICATION OF TASKS :

The dataset collected is originally from the Pima Indians Diabetes Database is available on
Kaggle. It consists of several medical analyst variables and one target variable.
The objective of the dataset is to predict whether the patient has diabetes or not. The dataset
consists of several independent variables and one dependent variable, i.e., the outcome.
Independent variables include the number of pregnancies the patient has had their BMI,
insulin level, age, and so on as Shown in Following Table 1:

Serial no Attribute Names Description


1 Pregnancies Number of times pregnant
2 Glucose Plasma glucose concentration
3 Blood Pressure Diastolic blood pressure
4 Skin Thickness Triceps skin fold thickness (mm)
5 Insulin 2-h serum insulin
6 BMI Body mass index
7 Diabetes Pedigree Function Diabetes pedigree function
8 Outcome Class variable (0 or 1)
9 Age Age of patient

➔ The diabetes data set consists of 2000 data points, with 9 features each.
➔ “Outcome” is the feature we are going to predict, 0 means No diabetes, 1 means diabetes
I] Dataset collection – It includes data collection and understanding the data to study the
hidden patterns and trends which helps to predict and evaluating the results. Dataset carries
total number of data and i.e., total number of features. Featuresinclude Pregnancies, Glucose,
Blood Pressure, Skin Thickness, Insulin, BMI, DiabetesPedigreeFunction, Age
II] Data Pre-processing:
This phase of model handles inconsistent data in order to get more accurate and precise results
like in this dataset Id is inconsistent so we dropped the feature.
III]Missing value identification:
Using the Panda library and SK-learn , we will get the missing values in the datasets .We will
compare the missing value with the corresponding mean value
IV] Feature selection:
Pearson’s correlation method is a popular method to find the mostrelevant attributes/features.
The correlation coefficient is calculated in this method, whichcorrelates with the output and
input attributes. The coefficient value remains in the range bybetween −1 and 1. The value
above 0.5 and below −0.5 indicates a notable correlation, andthe zero value means no
correlation
V] Scaling and Normalization:
Scaling means that you're transforming your data so that it fits within a specific scale, like 0-
100 or 0-1. You want to scale data when you're using methods based on measures of how far
apart data points are, like support vector machines (SVM) or k-nearest neighbors (KNN). With
these algorithms, a change of "1" in any numeric feature is given the same importance.
VI] Splitting of data:
After data cleaning and pre-processing, the dataset becomes ready to train
and test. In the train/split method, we split the dataset randomly into the
training and testing set.
VII] Design and implementation of classification model:
In this research work, comprehensive studies are done by applying different
ML classification techniques like DT, KNN, RF, NB, LR, SVM.
VIII] Machine learning classifier:
Machine learning classifier to analyse the performance by finding accuracy of
each classifier
All the classifiers are implemented using scikit learn libraries in python .
MODELING AND ANALYSIS:
A] Logistic Regression:
Logistic regression is a machine learning technique used when dependent
variables are able to categorize. The outputs obtained by using the logistic
regression is based on the available features. Here sigmoidal function is used
to categorize the output.
B] K-Nearest Neighbors:
K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the
values of new
datapoints which further means that the new data point will be assigned a
value based onhow closely it matches the points in the training set.
Predictions are made for a new instance (x) by searching through the entire
training set for the K most similar instances (the neighbors) and summarizing
the output variable for those K instances.
C]SVM:
SVM is supervised learning algorithm used for classification. In SVM we have
to identify the right hyper plane to classify the data correctly. In this we have
to set correct parameters values. To find the right hyper plane we have to
find right margin for this we have choose the gamma value as 0.0001 and rbf
kernel. If we select the hyper plane with low margin leads to miss
classification.
D] Naive Bayes:
Naive Bayes classifiers are a collection of classification algorithms based on
Bayes’
Theorem. It is not a single algorithm but a family of algorithms where all of
them share a common principle, i.e. every pair of features being classified is
independent of each other.
E] Decision Tree:
Decision tree is non parametric classifier in supervised learning. In this
method all the details are represented in the form of tree, where leaves are
corresponds to the class labels and attributes are corresponds to internal
node of the tree.
F] Random Forest:
Random forest is an ensemble learning method for classification. This
algorithm consists of trees and the number of tree structures present in the
data is used to predict the accuracy. Where leaves are corresponds to the
class labels and attributes are corresponds to internal node of the tree.
G] AdaBoost Classifier:
Boosting is an ensemble modeling technique that attempts to build a strong
classifier from the number of weak classifiers. It is done by building a model
by using weak models in series. Firstly, a model is built from the training
data. Then the second model is built which tries to correct the errors present
in the first model. This procedure is continued and models are added until
either the complete training data set is predicted correctly or the maximum
number of models are added.
AdaBoost was the first successful boosting algorithm developed for the
purpose of binary classification. AdaBoost is short for Adaptive Boosting and
is a very popular boosting technique that combines multiple “weak
classifiers” into a single “strong classifier”. It was formulated by Yoav Freund
and Robert Schapire. They also won the 2003 Gödel Prize for their work.
Creating a User Interface for Accessibility:
The last part of the project is the creation of a user interface for the model.
This user
interface is used to enter unseen data for the model to read and then make a
prediction. The user interface is created using “Flask” Web app, Hyper Text
Markup Language, and Cascading .
1.4. Organization of the Report :-
Chapter 1 Problem Identification: This chapter introduces the project and
describes the problem statement discussed earlier in the report.
Chapter 2 Literature Review: This chapter prevents review for various
research papers which help us to understand the problem in a better way. It
also defines what has been done to already solve the problem and what can
be further done.
Chapter 3 Design Flow/ Process: This chapter presents the need and
significance of the proposed work based on literature review. Proposed
objectives and methodology are explained. This presents the relevance of
the problem. It also represents logical and schematic plan to resolve the
research problem.
Chapter 4 Result Analysis and Validation: This chapter explains various
performance parameters used in implementation. Experimental results are
shown in this chapter. It explains the meaning of the results and why they
matter.
Chapter 5 Conclusion and future scope: This chapter concludes the
results and explain the best method to perform this research to get the best
results and define the future scope of study that explains the extent to which
the research area will be explored in the work.
Team Roles
ANSHUL (22BCS16477)
• COLLECTION AND MAKING OF THE DATASET
• CLUSTERING AND DISTRIBUTION OF THE
DATASET.
• VISUALISATION OF THE
DATASET
KHUSHI GUPTA (22BCS16186)
• COLLECTION OF DATASET
• VISUALISATION OF THE DATASET
• TESTING AND TRAINING OF THE DATASET
ERA TRIVEDI (22BCS14924)
• ANALYSING THE DATASET
• APPLYING ALGORITHMS IN
THE DATASET
• SCRAPPING THE DATASET
VIDUSHI GUPTA (22BCS16291)
• COLLECTION AND MAKING
OF THE DATASET
• APPLYING ALGORITHMS IN
THE DATASET
• VISUALISATION OF THE
DATASET
SHUSHANT SINGH (22BCS16192)
• ANALYSING THE DATASET
• CLUSTERING AND DISTRIBUTION OF THE
DATASET.
• COLLECTION OF DATASET
TIMELINE :-

You might also like