Mini Docs Batch 7
Mini Docs Batch 7
A Minor-Project Report
Submitted in partial fulfillment of the requirements for the award of the
degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
Submitted by
K. VISHNU VARDHAN REDDY- (21BT5A0508)
D.SUDHEER PATNAIK-(20BT1A0507)
CH. NAVEEN- (21BT5A0502)
S. ANVESH NAIDU -(20BT1A0525)
M.AKHIL-(21BT5A0509)
C E R TI FI CATE
We here declare that the Mini-project report entitled “DIABETES DISEASE PREDICTION
USING MACHINE LEARNING ALGORITHMS” submitted by K.VISHNU VARDHAN REDDY
(21BT5A0508), D.SUDHEER PATNAIK (20BT1A0507), CH.NAVEEN(21BT5A0502),
S.ANVESH NAIDU (20BT1A0525), M.AKHIL(21BT5A0509)to VCET, JNTUH in partial
fulfilment of the award of the degree of bachelor of technology in electronics and
communication engineering is a record of bonafide project work carried out by us under
the guidance of Mr.JAIPAL, we further declare that the work reported has not been
submitted and will not be submitted either in part or in fail, for the award of any degree
or diploma in this institute or any other or university
Place:
Date:
It gives us a great sense of pleasure to present the report of the project undertaken
during B. Tech. We would like to express our special thanks to our Principal Dr.
D RAMESH for moral support and College Management of Visvesvaraya
College of Engineering & Technology, Hyderabad for providing us infrastructure
to complete the project.
Submitted by
- D. Sudheer Patnaik
-CH. Naveen
-M.Akhil
ABSTRACT
CHAPTER 1 INTRODUCTION 1
3.3 MODULES 8
CHAPTER 4 ALGORITHMS 9
CHAPTER 5 DESIGN 13
CHAPTER 6 IMPLEMENTATION 20
CHAPTER 7 TESTING 26
7.3 MAINTAINENCE 31
CHAPTER 9 CONCLUSION 38
10 FUTURE SCOPE 39
CHAPTER 1
INTRODUCTION
In this day and age, one of the most notorious diseases to have taken the world by
storm is Diabetes, which is a disease which causes an increase in blood glucose levels as
a result of the absence or low levels of insulin. Due to the many criteria to be taken into
consideration for an individual to Harbor this disease, it’s detection and prediction might
be tedious or sometimes inconclusive. Nevertheless, it isn’t impossible to detect it, even
at an early stage. Federation- IDF). 79% of the adult population were living in the
countries with the low and middle-income groups. It is estimated that by the year 2045
approx. 700 million people will have diabetes (IDF).
In the present we can see that the number of cases is rising every year and there
is not slowing down in the active cases. It is a very crucial thing to worry as diabetes has
become one of the most dangerous and fastest diseases to take the lives of many
individuals around the globe.
1
• Supervised Learning: In supervised learning, we train the model based on the labels
attached to the information and based on that we classify or test the new data with labels.
With the rise of Machine Learning and its relative algorithms, it has come to light
that the significant problems and hindrances in its detection faced earlier, can now be
eased with much simplicity, yet, giving a detailed and accurate outcome. As of the
modern-day, it is comprehended that Machine Learning has become even more effective
and helpful in collaboration with the domain of Medicine. Early determination of a
disease can be made possible through machine learning by studying the characteristics of
an individual. Such early tries can lead to the inhibition of disease as well as obstruction
of permitting the disease to reach a critical degree. The work which will be described in
this paper is to perform the diabetes disease prediction using machine learning algorithms
for early care of an individual.
2
CHAPTER 2
2.LITERATURE SURVEY
2.1EXISTING SYSTEM
In previous, they have used the WEKA tool for data analytics for diabetes disease
prediction on Big Data of healthcare. They used the publicly available dataset from UCI
and applied different machine learning classifiers on it. The classifiers which they
incorporated are Naive Bayes, Support Vector Machine, Random Forest and Simple
CART.
Their approach starts with accessing the dataset, preprocess it in Weka tool and
then did the 70:30 train and test split for applying different machine algorithms. They did
not go with the cross-validation step as it is imperative to get the optimal and accurate
results as well.
The authors also used the publicly available dataset named as Pima Indians
Diabetes Database for performing their experiment. Their framework of performing the
prediction starts with the dataset selection and then with data pre-processing. Once the
data was preprocessed, they applied three classification algorithms, i.e., Naïve Bayes,
SVM and Decision tree. As they incorporated different evaluation metrics, they did
compare the different performance measure and comparatively analyzed the accuracy.
The highest accuracy achieved with their experiment was 76.30%. Like they have also
not practiced Cross-validation.
1). There are no techniques and models for analyzing large scale datasets in the
existing system.
2). There is no facility for diabetes dataset in collaboration with a hospital or a medical
institute and will try to achieve better results.
3
2.3 PROPOSED SYSTEM
To perform our experiment, we have used a publicly available dataset named as Pima
Indians Diabetes Database [4]. This dataset includes a various diagnostic measure of
diabetes disease. The dataset was originally from the National Institute of Diabetes and
Digestive and Kidney Diseases. All the recorded instances are of the patients whose age
are above 21 years old. Our proposed model exists of 5 phases which are shown in the
proposed system by following Figure.
4
2.4 ADVANTAGES OF PROPOSED SYSTEM
➢ The system more effective due to fitting datasets for different ML Models by
Applying Machine Learning Algorithms.
➢ The Early determination of a disease can be made possible through machine
learning by studying the characteristics of an individual in the proposed system.
2.5.1Operational Feasibility-
Operational Feasibility deals with the study of prospects of the system to be
developed. This system operationally eliminates all the tensions of the admin and helps
him in effectively tracking the project progress. This kind of automation will surely
reduce the time and energy, which previously consumed in manual work. Based on the
study, the system is proved to be operationally feasible.
2.5.2Economic Feasibility-
5
can use this tool from at any time. The Virtual Private Network is to be developed using
the existing resources of the organization. So, the project is economically feasible.
6
CHAPTER 3
SYSTEM ANALYSIS
❖ Front-End : Python.
❖ Back-End : Django-ORM
7
3.3 MODULES
(i)Service Provider-
In this module, the Service Provider has to login by using valid user name
and password. After login successful he can do some operations such as
Train and Test Data Sets, View Trained and Tested Accuracy in Bar Chart, View
Trained and Tested Accuracy Results, Find Diabetic Status from Data Set Details,
Find Diabetic Ratio on Data Sets, View All Emergency for Diabetic Treatment,
Download Trained Data Sets, View Diabetic Ratio Results, View All Remote
Users.
View and Authorize Users
In this module, the admin can view the list of users who all registered. In
this, the admin can view the user’s details such as, user name, email, address and
admin authorizes the users.
(ii)Remote User-
In this module, there are n numbers of users are present. User should
register before doing any operations. Once user registers, their details will be
stored to the database. After registration successful, he has to login by using
authorized user name and password. Once Login is successful user will do some
operations like POST DIABETIC DATA SETS, SEARCH AND PREDICT
DIABETIC STATUS, VIEW YOUR PROFILE.
8
CHAPTER 4
ALGORITHMS
The fives supervised machine learning algorithms used in our project are
1) K-NEAREST NEIGHBOURS (KNN)
2) NAIVES BAYES
3) RANDOM FOREST
4) DECISION CLASSIFIERS
5) SUPPORT VECTOR MACHINE
9
For regression, it calculates the average of the target values of the k-nearest
neighbors and assigns this average as the predicted value for the new data point.
Choice of 'k':
The value of 'k' is a crucial parameter that needs to be specified. It represents the
number of nearest neighbors to consider. A small 'k' may lead to noisy predictions,
while a large 'k' might lead to oversmoothed predictions.
Key Characteristics:
• KNN is a non-parametric algorithm, meaning it doesn't make any assumptions
about the underlying data distribution.
• It's sensitive to outliers in the data.
• The computational cost of making predictions can be high, especially for large
datasets.
While the Naive Bayes classifier is widely used in the research world, it is not
widespread among practitioners which want to obtain usable results. On the one hand, the
researchers found especially it is very easy to program and implement it, its parameters
are easy to estimate, learning is very fast even on very large databases, its accuracy is
reasonably good in comparison to the other approaches. On the other hand, the final users
do not obtain a model easy to interpret and deploy, they does not understand the interest
of such a technique.
10
Thus, we introduce in a new presentation of the results of the learning process.
The classifier is easier to understand, and its deployment is also made easier. In the first
part of this tutorial, we present some theoretical aspects of the naive bayes classifier.
Then, we implement the approach on a dataset with Tanagra. We compare the obtained
results (the parameters of the model) to those obtained with other linear approaches such
as the logistic regression, the linear discriminant analysis and the linear SVM. We note
that the results are highly consistent. This largely explains the good performance of the
method in comparison to others. In the second part, we use various tools on the same
dataset (Weka 3.6.0, R 2.9.2, Knime 2.1.1, Orange 2.0b and RapidMiner 4.6.0). We
try above all to understand the obtained results.
Random forests or random decision forests are an ensemble learning method for
classification, regression and other tasks that operates by constructing a multitude of
decision trees at training time. For classification tasks, the output of the random forest is
the class selected by most trees. For regression tasks, the mean or average prediction of
the individual trees is returned. Random decision forests correct for decision trees’ habit
of overfitting to their training set. Random forests generally outperform decision trees,
but their accuracy is lower than gradient boosted trees. However, data characteristics can
affect their performance.
Decision tree classifiers are used successfully in many diverse areas. Their most
important feature is the capability of capturing descriptive decision making knowledge
from the supplied data. Decision tree can be generated from training sets. The procedure
for such generation based on the set of objects (S), each belonging to one of the classes
C1, C2, …, Ck is as follows:
11
Step 1. If all the objects in S belong to the same class, for example Ci, the decision tree
for S consists of a leaf labeled with this class
Step 2. Otherwise, let T be some test with possible outcomes O1, O2…, On. Each object
in S has one outcome for T so the test partitions S into subsets S1, S2… Sn where each
object in Si has outcome Oi for T. T becomes the root of the decision tree and for each
outcome Oi, we build a subsidiary decision tree by invoking the same procedure
recursively on the set Si.
12
CHAPTER 5
DESIGN
13
5.2 DATA FLOW DIAGRAM-
14
5.3 CLASS DIAGRAMS-
15
5.4 FLOW CHART DIAGRAMS-
(i)Remote users-
16
(ii)Servicer provider-
17
5.5 USE CASE DIAGRAM-
18
5.6 SEQUENCE DIAGRAM-
19
CHAPTER 6
IMPLEMENTATION
return redirect('Add_DataSet_Details')
except
return render(request,'RUser/login.html')
def Add_DataSet_Details(request):
if "GET" == request.method:
return render(request, 'RUser/Add_DataSet_Details.html', {})
else:
excel_file = request.FILES["excel_file"]
# you may put validations here to check extension or file size
wb = openpyxl.load_workbook(excel_file)
20
diabetes_disease_model.objects.create(
Pregnancies= active_sheet.cell(r, 1).value,
Glucose= active_sheet.cell(r, 2).value,
BloodPressure= active_sheet.cell(r, 3).value,
SkinThickness= active_sheet.cell(r, 4).value,
Insulin= active_sheet.cell(r, 5).value,
BMI= active_sheet.cell(r, 6).value,
DiabetesPedigreeFunction= active_sheet.cell(r, 7).value,
Age= active_sheet.cell(r, 8).value
)
return render(request, 'RUser/Add_DataSet_Details.html', {"excel_data": excel_data}
def Register1(request):
if request.method == "POST":
username = request.POST.get('username')
email = request.POST.get('email')
password = request.POST.get('password')
phoneno = request.POST.get('phoneno')
country = request.POST.get('country')
state = request.POST.get('state')
city = request.POST.get('city')
ClientRegister_Model.objects.create(username=username, email=email,
password=password, phoneno=phoneno,
country=country, state=state, city=city)
return render(request, 'RUser/Register1.html')
else:
return render(request,'RUser/Register1.html')
def ViewYourProfile(request):
userid = request.session['userid']
obj = ClientRegister_Model.objects.get(id= userid)
return render(request,'RUser/ViewYourProfile.html',{'object':obj})
def Search_Predict_Diabetic_DataSets(request):
if request.method == "POST":
21
kword = request.POST.get('keyword')
if request.method == "POST":
kword = request.POST.get('keyword')
print(kword)
obj =
diabetes_disease_prediction.objects.all().filter(Prediction__contains=kword)
return render(request, 'RUser/Search_Predict_Diabetic_DataSets.html',{'objs':
obj})
return render(request, 'RUser/Search_Predict_Diabetic_DataSets.html')
import datetime
import xlwt
import pandas as pd
import numpy as np
22
from sklearn.metrics import accuracy_score
import warnings
def train_model(request):
obj=’’
diabetes_disease_prediction.objects.create(Pregnancies=Pregnancies,
Glucose=Glucose,
BloodPressure=BloodPressure,
SkinThickness=SkinThickness,
Insulin=Insulin,
BMI=BMI,
DiabetesPedigreeFunction=DiabetesPedigreeFunction,
Age=Age,
Prediction=type,
Status=status
)
obj =diabetes_disease_prediction.objects.all()
return render(request, 'SProvider/Find_Diabetic_Status_Details.html', {'list_objects': obj})
def likeschart(request,like_chart):
charts =detection_results_model.objects.values('names').annotate(dcount=Avg('ratio'))
return render(request,"SProvider/likeschart.html", {'form':charts, 'like_chart':like_chart})
def Download_Trained_DataSets(request):
response = HttpResponse(content_type='application/ms-excel')
23
# decide file name
# creating workbook
wb = xlwt.Workbook(encoding='utf-8')
# adding sheet
ws = wb.add_sheet("sheet1")
row_num = 0
font_style = xlwt.XFStyle()
font_style.font.bold = True
# writer = csv.writer(response)
obj = diabetes_disease_prediction.objects.all()
row_num = row_num + 1
24
ws.write(row_num, 8, my_row.Prediction, font_style)
wb.save(response)
return response
models = {}
models[“’XGB’”] = GradientBoostingClassifier(random_state=12345)
train_model1(key, values)
obj = detection_results_model.objects.all()
25
CHAPTER 7
TESTING
o Unit Testing.
o Integration Testing.
o User Acceptance Testing.
o Output Testing.
o Validation Testing.
Unit testing focuses verification effort on the smallest unit of Software design
that is the module. Unit testing exercises specific paths in a module’s control structure
to ensure complete coverage and maximum error detection. This test focuses on each
module individually, ensuring that it functions properly as a unit. Hence, the naming is
Unit Testing.
During this testing, each module is tested individually and the module interfaces
are verified for the consistency with design specification. All important processing path
are tested for the expected results. All error handling paths are also tested.
Integration testing addresses the issues associated with the dual problems of
verification and program construction. After the software has been integrated a set of high
order tests are conducted. The main objective in this testing process is to take unit tested
26
modules and builds a program structure that has been dictated by design.
1. Top-Down Integration
8. Bottom-up Integration
This method begins the construction and testing with the modules at the lowest
level in the program structure. Since the modules are integrated from the bottom up,
processing required for modules subordinate to a given level is always available and the
need for stubs is eliminated. The bottom up integration strategy may be implemented with
the following steps:
▪ The low-level modules are combined into clusters into clusters that
perform a specific Software sub-function.
▪ A driver (i.e.) the control program for testing is written to coordinate test
case input and output.
▪ The cluster is tested.
▪ Drivers are removed and clusters are combined moving upward in the
program structure
The bottom-up approaches tests each module individually and then each module is
module is integrated with a main module and tested for functionality
27
.
User Acceptance of a system is the key factor for the success of any system. The
system under consideration is tested for user acceptance by constantly keeping in touch
with the prospective system users at the time of developing and making changes wherever
required. The system developed provides a friendly user interface that can easily be
understood even by a person who is new to the system.
After performing the validation testing, the next step is output testing of the
proposed system, since no system could be useful if it does not produce the required
output in the specified format. Asking the users about the format required by them tests
the outputs generated or displayed by the system under consideration. Hence the output
format is considered in 2 ways – one is on screen and another in printed format.
Text Field:
The text field can contain only the number of characters lesser than or equal to its
size. The text fields are alphanumeric in some tables and alphabetic in other tables.
Incorrect entry always flashes and error message.
28
Numeric Field:
The numeric field can contain only numbers from 0 to 9. An entry of any
character flashes an error messages. The individual modules are checked for accuracy and
what it has to perform. Each module is subjected to test run along with sample data.
The individually tested modules are integrated into a single system. Testing involves
executing the real data information is used in the program the existence of any program
defect is inferred from the output. The testing should be planned so that all the
requirements are individually tested.
A successful test is one that gives out the defects for the inappropriate data and
produces and output revealing the errors in the system.
Taking various kinds of test data does the above testing. Preparation of test data
plays a vital role in the system testing. After preparing the test data the system under study
is tested using that test data. While testing the system by using test data errors are again
uncovered and corrected by using above testing steps and corrections are also noted for
future use.
Live test data are those that are actually extracted from organization files. After a
system is partially constructed, programmers or analysts often ask users to key in a set of
data from their normal activities. Then, the systems person uses this data as a way to
partially test the system. In other instances, programmers or analysts extract a set of live
data from the files and have them entered themselves.
29
processing requirement, assuming that the live data entered are in fact typical, such data
generally will not test all combinations or formats that can enter the system. This bias
toward typical values then does not provide a true systems test and in fact ignores the
cases most likely to cause system failure.
Artificial test data are created solely for test purposes, since they can be generated
to test all combinations of formats and values. In other words, the artificial data, which
can quickly be prepared by a data generating utility program in the information systems
department, make possible the testing of all login and control paths through the program.
The most effective test programs use artificial test data generated by persons other
than those who wrote the programs. Often, an independent team of testers formulates a
testing plan, using the systems specifications.
The package “Virtual Private Network” has satisfied all the requirements
specified as per software requirement specification and was accepted.
30
7.3 MAINTAINENCE
This covers a wide range of activities including correcting code and design errors.
To reduce the need for maintenance in the long run, we have more accurately defined the
user’s requirements during the process of system development. Depending on the
requirements, this system has been developed to satisfy the needs to the largest possible
extent. With development in technology, it may be possible to add many more features
based on the requirements in future. The coding and designing is simple and easy to
understand which will make maintenance easier.
Black Box Testing is testing the software without any knowledge of the inner workings,
structure or language of the module being tested. Black box tests, as most other kinds of
tests, must be written from a definitive source document, such as specification or
requirements document, such as specification or requirements document. It is a testing in
which the software under test is treated, as a black box. You cannot “see” into it. The test
provides inputs and responds to outputs without considering how the software works.
White Box Testing is a testing in which in which the software tester has knowledge
of the inner workings, structure and language of the software, or at least its purpose. It is
purpose. It is used to test areas that cannot be reached from a black box level.
31
7.6 TESTING STRATEGY –
A strategy for system testing integrates system test cases and design techniques into a
well planned series of steps that results in the successful construction of software. The
testing strategy must co-operate test planning, test case design, test execution, and the
resultant data collection and evaluation ..A strategy for software testing must
accommodate low-level tests that are necessary to verify that a small source code
segment has been correctly implemented as well as high level tests that validate
major system functions against user requirements.
Software testing is a critical element of software quality assurance and represents the
ultimate review of specification design and coding. Testing represents an interesting
anomaly for the software. Thus, a series of testing are performed for the proposed
system before the system is ready for user acceptance testing.
32
CHAPTER 8
EXECUTION SLIDES
33
8.3 USER REGISTRATION PAGE-
34
8.5 SERVICE PROVIDER LOGIN PAGE-
35
8.7 VIEW TRAINED AND TESTED DATA IN PIE CHART-
36
8.9 ALGORITHMS ACCURACY-
37
CHAPTER 9
CONCLUSION
One of the significant impediments with the progression of technology and medicine is
the early detection of a disease, which is in this case, diabetes. However, in this study,
systematic efforts were made into designing a model which is accurate enough in
determining the onset of the disease. With the experiments conducted on the Pima Indians
Diabetes Database, we have readily predicted this disease. Moreover, the results achieved
proved the adequacy of the system, with an accuracy of 76% using the K-Nearest
Neighbours classifiers. With this being said, it is hopeful that we can implement this
model into a system to predict other deadly diseases as well. There can be room for further
improvement for the automation of the analysis of diabetes or any other disease in the
future.
38
CHAPTER 10
FUTURE SCOPE
39
CHAPTER 11
BIBLIOGRAPHY
40
P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J.
Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk,
M. Brett, A. Haldane, J. F. del R’ıo, M. Wiebe, P. Peterson,
P. G’erard-Marchant, K. Sheppard, T. Reddy, W. Weckesser,
H. Abbasi, C. Gohlke, and T. E. Oliphant, “Array programming
with NumPy,” Nature, vol. 585, no. 7825, pp. 357–362, Sep. 2020.
[Online]. Available: https://doi.org/10.1038/s41586-020-2649-2
[8] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,
J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and
´ Edouard Duchesnay, “Scikit-learn: Machine Learning in Python,”
Journal of Machine Learning Research, vol. 12, no. 85, p. 28252830,
2011.
41