DL PPR3
DL PPR3
INTRODUCTION
Now-a-days, people face various diseases due to the environmental condition and their living
habits. So, the prediction of disease at earlier stage becomes important task. But the accurate
prediction on the basis of symptoms becomes too difficult for doctor. The correct prediction of
disease is the most challenging task. To overcome this problem data mining plays an important
role to predict the disease. Medical science has large amount of data growth per year. Due to
increase amount of data growth in medical and healthcare field the accurate analysis on medical
data which has been benefits from early patient care. With the help of disease data, data mining
finds hidden pattern information in the huge amount of medical data. We proposed general
disease prediction based on symptoms of the patient. For the disease prediction, we use K-
Nearest Neighbor (KNN) and Support Vector Machine (SVM) machine learning algorithm for
accurate prediction of disease. For disease prediction required disease symptoms dataset.
1.1 OBJECTIVE
In this general disease prediction, the living habits of person and checkup information consider
for the accurate prediction. The accuracy of general disease prediction by using CNN is 83.5%
which is more than KNN algorithm. And the time and the memory requirement are also more
in KNN than CNN. After general disease prediction, this system able to gives the risk associated
with general disease which is lower risk of general disease or higher.
Machine learning made computer more intelligent and can enable the computer to think.
Different analysts feel that without learning, insight can't be created. There are numerous kinds
of Machine Learning Techniques like Unsupervised, Semi Supervised, Supervised,
Reinforcement, Evolutionary Learning and Deep Learning. These learnings are used to classify
huge data very fatly.
1.2 MOTIVATION
So we use K-Nearest Neighbor (KNN) and SVM (support vector machine) . Because medical
data is increasing day by day so usage of that for predicting correct disease is crucial task but
processing big data is very crucial in general so data mining plays very important role and
classification of large dataset using machine learning becomes so easy. It is critical to
1
comprehend the accurate diagnosis of patients by clinical examination and evaluation. Quality
of the data association has been influenced due to improper management of the information. .
Upgrade in the measure of data needs some legitimate way to concentrate and process
information viably and efficiently. One of the many machine learning applications is utilized
to construct such classifier that can separate the data based on their characteristics. Data set is
partitioned into two or more than two classes. Such classifiers are utilized for medical data
investigation and disease prediction. Today machine learning is present everywhere so that
without knowing it, one can possibly use it many times a day.
1.4 METHODOLOGY
This refers to the machine learning algorithms that are being utilized during implementation.
1.4.1 KNN algorithm
o K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a
well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
2
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog,
but we want to know either it is a cat or dog. So, for this identification, we can use the
KNN algorithm, as it works on a similarity measure. Our KNN model will find the
similar features of the new data set to the cats and dogs’ images and based on the most
similar features it will put it in either cat or dog category.
3
1.4.2 SUPPORT VECTOR MACHINE
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning. The goal of the SVM algorithm is to create
the best line or decision boundary that can segregate n-dimensional space into classes so that
we can easily put the new data point in the correct category in the future. This best decision
boundary is called a hyperplane. SVM chooses the extreme points/vectors that help in creating
the hyperplane. These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this strange creature.
4
1.4.3 RANDOM FOREST
Random forest is a machine learning algorithm used for both classification and regression tasks.
It is a type of ensemble learning method that combines multiple decision trees to create a more
accurate and robust model. The algorithm works by building multiple decision trees using a
subset of the data and a subset of the features at each node of the tree. Each decision tree is
trained on a random subset of the data, and each node of the tree considers only a subset of the
available features. This process creates a diverse set of decision trees that can collectively
provide a more accurate and stable prediction. Once the trees are built, the algorithm combines
their predictions through a process called ensemble learning. In classification tasks, the most
common prediction among the decision trees is selected as the final output, while in regression
tasks, the average prediction of the decision trees is taken as the final output.
Random forest has several advantages over other machine learning algorithms. It can handle
large datasets with high dimensionality and can also handle missing values and noisy data. It is
also less prone to overfitting than other algorithms because of its built-in randomness.
Example: An example of using the random forest algorithm is in predicting whether a customer
will buy a product or not based on their demographic and purchase history. The algorithm would
use a dataset that includes information such as age, gender, income, past purchases, and other
relevant factors. The random forest algorithm would build multiple decision trees using subsets
of this data and combine the predictions of each tree to make a final prediction about whether
5
the customer is likely to buy the product or not. This prediction can then be used to inform
marketing strategies or other business decisions.
All over the world, chronic diseases are a critical issue in the healthcare domain. According to
the medical statement, due to chronic diseases, the death rate of humans increases. The
treatments given for this disease consume over 70% of the patient’s income. Hence, it is highly
essential to minimize the patient’s risk factor that leads to death. The advancement in medical
research makes health-related data collection easier The healthcare data includes the
demographics, medical analysis reports, and the history of disease of the patient. The diseases
caused could be varied based on the regions and the living habitats in that region. Hence, along
with the disease data, the environmental condition and the living habitat of the patient should
also be recorded in the data set.
In recent years, the healthcare domain is evolving more due to the integration of information
technology (IT) in it. The intention to integrate IT in healthcare is to make the life of an
individual more affordable with comfort as smartphones made one’s life easier. This could be
possible by making healthcare to be intelligent, for instance, the invention of the smart
ambulance, smart hospital facilities, and so on, which helps the patients and doctors in several
ways. The research on a specified region for patients affected by chronic diseases every year
had been held and found that the difference between the patients in gender wise is very small,
and it is found that the large number of patients were admitted in the year 2014 for treating
chronic diseases. The use of structured and unstructured data provides highly accurate results
instead of using only structured data. Since the unstructured data includes the doctor’s records
on the patients related to diseases and the patient’s symptoms and grievances faced by them,
explained by themselves, which is an added advantage when used along with the structured data
that consists of the patient demographics, disease details, living habitats, and laboratory test
results. It is difficult to diagnose rare diseases. Hence, the use of self-reported behavioural data
helps differentiate the individuals with rare diseases from the ones with common chronic
diseases. By using machine learning approaches along with questionnaires, it is believed that
the identification of rare diseases is highly possible.
6
1.5.1 DIFFERENT APPROACHES
Table 1.1 Table of Research Work
7
IM.Chen, Y. Ma, Y. Li, D. Wu, Y. Zhang, and C. Youn proposed wearable 2.0 system in which
design smart washable clothing that improves the Qu and QoS of the next generation healthcare
system. Chen designed new IoT based data collection system. In that new sensor based smart
washable cloth invented. By the used of this cloth, doctor captured the patient physiological
condition. And with the help of the physiological data further analysis happen. In this inversion
of washable smart cloth mainly consists of multiple sensors, wires and electrode with the help
of this component user can able to collect the physiological condition of patient as well as
emotional health status information by the used of cloud-based system. With the help of this
cloth, it captured the physiological condition of the patient. And for the analysis purpose, this
data is used. Discussed the issues which are facing while designed the wearable 2.0 architecture.
The issues in existing system consist of physiological data collection, negative psychological
effects, anti-wireless for body area networking and Sustainable big physiological data
collection etc. The multiple operations performed on files like analysis on data, monitoring and
prediction. Again, author classify the functional components of the smart clothing representing
Wearable 2.0 into the following categories like sensors Integration, electrical-cable-based
networking, digital modules. In this, there are many applications discussed like chronic disease
monitoring, elderly people care, emotion care etc.
B. Qian, X. Wang, N. Cao, H. Li, and Y.-G. Jiang [2] designed the Alzheimer disease risk
prediction system with the help of EHR data of the patient. Here they utilized active learning
context to solve a real problem suffered by the patient. In this the active patient risk model was
build. For that active risk prediction algorithm is utilized the risk of Alzheimer disease.
Y. Zhang, M. Qiu, C.-W. Tsai, M. M. Hassan, and A. Alamri [4] designed cloud-based health
–Cps system in which manages the huge amount of biomedical data. Y. Zhang discussed large
amount of data growth in the medical field. The data is created within the less amount of time
and the characteristic of data is stored in different format so this is what the problem related to
the big data. In this designed the health-Cps system in that two technologies prefer one is cloud
and second one is big data technology. This system performed numerous operations on cloud-
like data analysis, monitoring and prediction of data. With the help of this system, a person gets
more information about how to handle and manage the huge amount of biomedical data in the
cloud. The three layers consider in the system data collection layer, data management layer and
data- oriented layer. The data collection layer stored data in the particular standard format. The
data management layer used for distributed storage and parallel computing. By this system
8
multiple operations are performed with the help of Health-cps system. Also, the many services
related to healthcare know by this system.
L. Qiu, K. Gai, and M. Qiu in proposed telehealth system and discussed how to handle a large
amount of hospital data in the cloud. This paper author proposed advance in the telehealth
system, which is mainly based on the sharing data among all the telehealth services over the
cloud. But the data sharing on the cloud facing lots of issues like network capacity and virtual
machine switches. In this proposed the data sharing on cloud approach for the better sharing of
data through the data sharing concepts. Here designed the optimal method of telehealth sharing
model. By this model, author focus on transmission probability, network capabilities and timing
constraints. For this author invented new optimal big data sharing algorithm. By this algorithm,
users get the optimal solution of handling biomedical data.
Ajinkya Kunjir, Harshal Sawant, Nuzhat F.Shaikh [6] proposed a best clinical decision-making
system which predicts the disease on the basis of historical data of patients. In this predicted
multiple diseases and unseen pattern of patient condition. Designed a best clinical decision-
making system used for the accurate disease prediction on the historical data. In that also
determined multiple diseases concept and unseen pattern. For the visualization purpose in this
used 2D/3D graph and pie Charts. And 2D/3D graph and pie charts designed for visualization
purpose.
9
• Fitbit This sensor is used to keep the track of health which has features of measuring
pulse rate, BP, calories burned. After this study, we have concluded with using Fitbit to
collect the data which is easily available and less expensive and Health Gear for all the
other parameters.
• S Mohan, C Thirumalai, G Srivastava Effective heart disease prediction using hybrid
machine learning techniques. IEEE Access, volume 7. Posted: 2019
• Heart disease prediction and classification using machine learning algorithms optimized
by particle swarm optimization and ant colony optimization. Int. J. Intell. Eng. Syst,
volume 12, issue 1. Posted: 2019
• A Mir, S N Dhage. 2018 Fourth International Conference on Computing
Communication Control and Automation (ICCUBEA), p. 1 – 6. Posted: 2018
• M Maniruzzaman, M J Rahman, B Ahammed, M M Abedin. Classification and
prediction of diabetes disease using machine learning paradigm. Health Information
Science and Systems, volume 8, issue 1. Posted: 2020
10
CHAPTER- 2
HARDWARE & SOFTWARE REQUIREMENT
2.1 Software
Software here refers to the application software’s that will be utilized during the development
of the project and implementation of it.
2.1.1 Android Studio
Android Studio is the official integrated development environment for Google's Android
operating system, built on JetBrains' IntelliJ IDEA software and designed specifically for
Android development. It is available for download on Windows, macOS and Linux based
operating systems.
Android Studio provides a unified environment where you can build apps for Android phones,
tablets, Android Wear, Android TV, and Android Auto. Structured code modules allow you to
divide your project into units of functionality that you can independently build, test, and debug.
2.1.2 VS Code
Visual Studio Code, also commonly referred to as VS Code, is a source-code editor made by
Microsoft with the Electron Framework, for Windows, Linux and macOS. Features include
support for debugging, syntax highlighting, intelligent code completion, snippets, code
refactoring, and embedded Git.
11
Visual Studio Code is a streamlined code editor with support for development operations like
debugging, task running, and version control. It aims to provide just the tools a developer needs
for a quick code-build-debug cycle and leaves more complex workflows to fuller featured IDEs,
such as Visual Studio IDE.
12
2.1.4 Octave
Octave is mainly extensible as it is using dynamically loadable modules. It uses an interpreter
to execute the octave scripting language. Its interpreter has the graphics based on Open GL for
creating the plots, graphs, charts and to save and print the same. It also includes the Graphical
user interface in addition to the traditional command-line interface. It is a high-level
programming language mainly used for computing numerical. It was developed by John W.
Eaton. It was initially released in the year 1980. It was written in C, C++, and Fortran.
It mainly consists of function calls or script. Its syntax is mainly matrix-based and provides
various functions for matrix operations. Octave does support various data structures and object-
oriented programming. It has great features and compatible with other languages like syntax
and functional compatibility for MATLAB. It shares other features like built-in support for
complex numbers, powerful built-in math functions, and extensive function libraries and in
terms of user-defined functions as well.
13
2.2 Hardware
• CPU: Intel Core 2 Quad CPU Q6600 @ 2.40GHz (4 CPUs) / AMD Phenom 9850 Quad-
Core Processor (4 CPUs) @ 2.5GHz
• RAM: 4 GB
• OS: Windows 10 64 Bit, Windows 8.1 64 Bit, Windows 8 64 Bit, Windows 7 64 Bit Service
Pack 1, Windows Vista 64 Bit Service Pack 2*
• VIDEO CARD: 32 MB Direct3D Video Card
• FREE DISK SPACE: 10GB
• DEDICATED VIDEO RAM: 1 GB
• CPU SPEED: 700 MHz
14
CHAPTER- 3
SOFTWARE DEVELOPMENT LIFE CYCLE
SDLC or the Software Development Life Cycle is a process that produces software with the
highest quality and lowest cost in the shortest time possible. SDLC provides a well-structured
flow of phases that help an organization to quickly produce high-quality software which is
The waterfall is a universally accepted SDLC model. In this method, the whole process of
software development is divided into various phase. The waterfall model is a continuous
15
software development model in which development is seen as flowing steadily downwards (like
a waterfall) through the steps of requirements analysis, design, implementation, testing
(validation), integration, and maintenance.
RAD or Rapid Application Development process is an adoption of the waterfall model; it targets
developing software in a short period. The RAD model is based on the concept that a better
system can be developed in lesser time by using focus groups to gather system requirements.
o Business Modeling
o Data Modeling
o Process Modeling
o Application Generation
o Testing and Turnover
16
Figure 3.3 RAD Model
3.1.3Spiral Model
The spiral model is a risk-driven process model. This SDLC model helps the group to adopt
elements of one or more process models like a waterfall, incremental, waterfall, etc. The spiral
technique is a combination of rapid prototyping and concurrency in design and development
activities. Each cycle in the spiral begins with the identification of objectives for that cycle, the
different alternatives that are possible for achieving the goals, and the constraints that exist.
This is the first quadrant of the cycle (upper-left quadrant).
The next step in the cycle is to evaluate these different alternatives based on the objectives and
constraints. The focus of evaluation in this step is based on the risk perception for the project.
The next step is to develop strategies that solve uncertainties and risks. This step may involve
activities such as benchmarking, simulation, and prototyping.
17
Figure 3.4 Spiral Model
3.1.4 V-Model
In this type of SDLC model testing and the development, the step is planned in parallel. So,
there are verification phases on the side and the validation phase on the other side. V-Model
joins by Coding phase.
18
3.1.5 Incremental Model
The incremental model is not a separate model. It is necessarily a series of waterfall cycles. The
requirements are divided into groups at the start of the project. For each group, the SDLC model
is followed to develop software. The SDLC process is repeated, with each release adding more
functionality until all requirements are met. In this method, each cycle act as the maintenance
phase for the previous software release. Modification to the incremental model allows
development cycles to overlap. After that subsequent cycle may begin before the previous cycle
is complete.
19
Figure 3.7 Agile Model
20
3.1.8 Big Bang model
Big bang model is focusing on all types of resources in software development and coding, with
no or very little planning. The requirements are understood and implemented when they come.
This model works best for small projects with smaller size development team which are
working together. It is also useful for academic software development projects. It is an ideal
model where requirements are either unknown or final release date is not given.
The prototyping model starts with the requirements gathering. The developer and the user meet
and define the purpose of the software, identify the needs, etc.
A 'quick design' is then created. This design focuses on those aspects of the software that will
be visible to the user. It then leads to the development of a prototype. The customer then checks
the prototype, and any modifications or changes that are needed are made to the prototype.
Looping takes place in this step, and better versions of the prototype are created. These are
continuously shown to the user so that any new changes can be updated in the prototype. This
process continue until the customer is satisfied with the system. Once a user is satisfied, the
prototype is converted to the actual system with all considerations for quality and security.
21
3.2 How it is effective in our model: -
Initially we take disease dataset from UCI machine learning website and that is in the form of
disease list with its symptoms. After that preprocessing is performed on that dataset for cleaning
that is removing comma, punctuations and white places. And that is used as training dataset.
After that feature extracted and selected. Then we classify that data using classification
techniques such as KNN and SVM. Based on machine learning we can predict accurate disease.
An SVM takes these data points and outputs the hyperplane, which is simply a line in two-
dimension, that best separates the tags. The line is the decision boundary. Anything falling to
one side of it will be classified as yellow, and anything on the other side will be classified as
blue.
The above was easy since the data was linearly separable—a straight line can be drawn to
separate yellow and blue. However, in real scenarios, cases are usually not this simple. Consider
the following case:
There is no linear decision boundary. The vectors are, however, very clearly segregated, and it
seems as if it should be easy to separate them.
22
In this case, we will add a third dimension. Up until now, we have worked with two dimensions,
x, and y. A new z dimension is introduced in this case. It is set to be calculated a certain way
that is convenient, z = x² + y² (equation of a circle.) Taking a slice of this three-dimensional
space looks like this: Note that since we are in three dimensions now, the hyperplane is a plane
parallel to the x-axis at a particular point in z, let us say z = 1. Now, it should be mapped back
to two dimensions: There we go! The decision boundary is a circumference with radius 1, and
it separates both tags by using SVM. Calculating the transformation can get pretty expensive
computationally. One may deal with a lot of new dimensions, each possibly involving a
complicated calculation. Hence, doing this for every vector in the dataset will be a lot of work.
23
CHAPTER-4
SOFTWARE REQUIREMENT SPECIFICATION & ANALYSIS
This project gives the procedural approach how a patient gets treatment, details about date of
treatment and finally depending on different criteria like room allocated, lab reports, treatment
and medicine taken, how billing is calculated, etc. During billing health care facility is also
considered. At the same time this project also deals with the different mediums that can be built
between different stakeholders for example, A website between customers and the hospital is
crucial to give the customers a better outlook. For the development of the website, we also need
web developers. Hence, how all the stakeholders are related to each other is enumerated in this
document
4.2.1. Output:
Accurate prediction for disease from symptoms and fixing appointment.
User must be aware of the symptoms the person is having and reporting it accurately
4.2.2. Steps:
• Step1:
The user inputs a series of answers to some basic question along with symptoms
• . Step2:
24
The data is processed and matched with a disease profile
• Step3:
The disease according to its severity is given a ranking and the user is prompted to make a
doctor’s appointment
• Step4
The doctors meet with their patient and recommends medicine and pharmacy.
• Step5
The User gets a medical certificate and medicines and is treated for the disease.
• Post-condition:
The User must attend for a re-appointment.
• Exceptional Scenario 1:
If the severity of the disease is not high, then appointment not required.
25
Figure 4.1 system development phase
26
4.4.4 Security Requirements
Due to security reasons, it is recommended to give the patients name and identification as a
proper certificate would be issued if a dire diagnosis occurs. The certificate would be
issued digitally and credentials would be taken from what the user has input. Incorrect
authentication can lead to problems with the legal system. The software is not responsible
for such an event.
27
CHAPTER – 5
RISK ASSESSMENT
Risk prediction models are used to estimate the likelihood of developing a certain outcome. In
the clinical domain, the outcomes are often adverse clinical events, including development of
disease1–3 , complications4–6 , or severe outcome7–9 , hospital readmission10, and so on. A
model can predict risk for one or multiple outcomes28. A previous report suggests that a useful
28
outcome should satisfy the criteria of: (1) undesired to the person experienced it; (2) significant
to the health service (cost-effective); (3) preventable; and (4) captured in routine administrative
data23. Prediction timepoint. Prediction timepoint defines when the model is intended to be
used. While some models can be widely applied throughout the adulthood, some apply only to
people at a certain time point. For example, Rao et al. built a model to predict bleeding
complications after patients undergo percutaneous coronary intervention6 . Observation
window. Observation window is the period when the predictors are defined and collected.
Prediction window. Prediction window is the period when the outcomes are observed. It should
be defined based on the outcome. While some outcomes happen shortly after the prediction
timepoint, some take years to develop. For example, in predicting hospital readmission, the
prediction window ranges from 30 days if reason for the current admission is a severe disease,
to 12 months if the reason is a less severe condition or general conditions10. Algorithm. While
the concepts above define the prediction problem, the algorithm solves the defined problem.
Two necessary components are algorithms for feature engineering and algorithms to build the
risk equation, as elaborated
5.1 Evaluation metrics
.Many evaluation metrics have been proposed to reflect different aspects of the model
performance As no model can achieve the best performance in every aspect, how to evaluate a
model should be determined based on the aspect of care29.
5.1.2 Pre-processing.
Common pre-processing steps include outlier removal, missing data imputation,
standardization, normalization, transformation, encoding, discretization, and so on. n Feature
construction and feature selection. Both serving the purpose of creating meaningful feature sets,
they act in different directions.
Feature construction aims at generating new features from existing to inject domain knowledge
or to increase feature complexity, while feature selection aims at reducing number of features
to reduce model complexity and increase model performance.
5.1.3 Risk equation.
29
A risk equation takes engineered features and computes a risk score, which is always achieved
through model fitting or a learning process.
Single task. Most risk equations estimate the risk of a single outcome (a single task).
5.2 Classification
A most common class of risk prediction problems is the binary classification problem
Model prediction performance can be evaluated using various metrics including those from
traditional aspects like discrimination and calibration, and from novel aspects like
reclassification and clinical usefulness. Here we introduce a few metrics, and more information
can be found in previous reports
Discrimination measures how well a model can discriminate between cases experienced and
not experienced the outcome. AUROC (also known as ROC, C-statistics, or C-index) and
AUPRC measure the overall performances of a model. Accuracy, sensitivity (recall),
specificity, and pr
ecision measure model performance at a certain threshold (based on which each sample gets
assigned as predicted to ‘will have the outcome’ or ‘will not have the outcome’).
Calibration measures how close the predicted risk score is to the actual probability of
developing the outcome. Risk equations from classification methods do not always have good
calibration especially in external validation cohort, while those from survival analysis have
better calibration. Common metrics include calibration curves and Hosmer-Leeshawn test.
Model interpretability is a measure of how well a model can be understood by human beings.
n Model interpretability reflects how well the mechanism of the model can be interpreted. In a
logistic regression model, odds ratio shows both the directionality and the scale of a feature’s
impact on the outcome. A decision tree model can be easily interpreted as a set of decision
rules. Artificial neural networks are notorious for their bad model interpretability due to the
black-box nature. n Result interpretability reflects how well the risk score from a risk model
can be explained to a human understandable degree. Model interpretability leads to, but is not
a prerequisite for result interpretability.
30
algorithm, and finally to the output, a risk report. A report is typically constituted of four
components: risk score, risk factors, risk level, and intervention.
31
CHAPTER 6
DFD AND APPLICATION ARCHITECTURE
DFD levels are numbered 0,1 or 2 and occasionally to even level 3 or beyond. The necessary
level of detail depends on the scope of what you are trying to accomplish.
6.1 Zero Level DFD:
32
* Managing all the Symptoms
* Managing all the Diseases
*Managing all the Doctors
6.2 First Level DFD:
33
* Processing Report records and generate report of all report.
DFD level 2 then goes one step deeper into parts of Level 1 of Diseases Prediction System. It
may require functionality of the system. The 2nd level DFD contains more details of Report,
Doctors, Test, Medicines and Patient.
Low level functionalities of Diseases Prediction System: -
* Admin reports to the System and manage all the functionalities of the system.
* Admin can add, edit, delete and view the records of patient, diseases, report and doctor.
* Admin can search and reach the details of all the things.
* Admin can track all the information of doctors and patient.
34
CHAPTER-7
PROJECT MODULES DESIGN
The whole project is divided into two parts i.e. the Machine learning model and User Interface
and they can be elaborated as:
7.1 Machine learning Model: -
Data mining techniques are used in the project to see weather the dataset is good for prediction
or not. Various data mining libraries used in the project are:
• 1.Scipy: This is used for implementing scientific computing in Python programming
language. It is a collection of mathematical algorithms and convenience functions built
on Numpy. Following are some of the functionalities it provides Special Functions
(special), Downloaded by Integration (integrate), Optimization (optimize), Fourier
Transforms (interpolate), Signal Processing (signal), Linear Algebra (linalg), Statistics
(stats), File IO (io) etc. In this project stats (Statistics) library of this package is primarily
used.
• 2. Sklearn : This stands for Scikit learn and is built on the Scipy package. It is the
primary package being used in this project. It is used for providing interface for
35
supervised and unsupervised learning algorithms. Following groups of models are
provided by sklearn Clustering, Cross Validation, Datasets, Dimensionality Reduction,
Ensemble methods, Feature extraction, Feature selection, Parameter Tuning, Manifold
Learning, Supervised Models.
• 3. Numpy : It is a library for the Python programming language, adding support for
multidimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays. It provides functions for Array
Objects, Routines, Constants, Universal Functions, Packaging etc. In this project it is
used for performing multi-dimensional array operations.
• 4. Pandas : This library is used to provide high-performance, easy-to-use data structures
and data analysis tools for the Python programming language. It provides functionalities
like table manipulations, creating plots, calculate summary statistics, reshape tables,
combine data from tables, handle times series data, manipulate textual data etc. In this
project it is used for reading csv files, comparing null and alternate hypothesis etc.
• 5. Matplotlib : It is a library for creating static, animated, and interactive visualizations
in Python programming language. In this project it is used for creating simple plots,
subplots and its object is used alongside with the seaborn object to employ certain
functions such as show, grid etc. A %matplotlib inline function is also used for
providing more concise plots right below the cells that create that plot.
• 6. Seaborn : It provides a interface for making graphs that are more attractive and
interactive in nature. It is based on the matplotlib module. These graphs can be dynamic
and are much more informative and easier to interpret. It provides different presentation
formats for data such as Relational, Categorical, Distribution, Regression, Multiples and
style and color of all these types. In this project they are used for creating complex plots
that use various attributes.
• 7. Warning : It is used for handling any warnings that may arise when the program is
running. It is a subclass of Exception.
• 8. Stats : This library is used to incorporate statistics functionality in Python
programming language. This library is included in the scipy package. This library is not
directly used rather the required functions are directly imported as and when required
i.e. for Measures of Central Tendency, Measures of Variability . The functions used can
be for simple concepts like mean, median, mode, variance, percentiles, skew, kurtosis,
range, Cumulative
36
• . 9. Model selection : This library is used for helping in choosing the best model. This
library is present in the sklearn package. It is also used for functions like test-train-split
which is used for splitting the data into train and test data set which helps in improving
accuracy of the model, and like cross-val-score which is used for computing the
accuracy of a model. It also includes functions for techniques for improving the
accuracy of a model like KFolds algorithm which is used in this project. It involves the
functions of linear-model, SVM etc.
• 10. Naive Bayes : It is a library built for implementing the Naive Bayes algorithm. It is
also defined in the sklearn package. In this project multinomial variant is used and it is
one of the most crucial algorithms in the project.
• 11. Tree : It is the library that comprises of all the functionality and concepts associated
with trees. It is included in the sklearn package. The most important algorithm included
in this library is the Decision Tree Classifier which gives very high accuracy and one of
the most used algorithm for projects like this.
• 12. Linear model : It is the library that implements the Regression algorithms. It is also
included in
• 13. Ensemble : It is the library that includes the ensemble methods. It is defined in the
sklearn package. As ensemble techniques are used to improve the accuracy of the
models hence Gradient Boosting Classifier and Random Forest Classifier(very
important algorithm and provides very high accuracy) are used in this project.
• 14. Metrics : It is the library used for presenting the accuracy of the model. The function
accuracy score is the most basic of them all. It is included in the sklearn package.
• 15. Joblib : It is a package that provides lightweight pipe lining in Python programming
language. It is included in the externals package in the sklearn main package. It is used
for providing transparent disk-caching of the output values, easy simple parallel
computing, logging and tracing of the execution. In this project it used to provide the
interaction with the user and perform operations accordingly.
• All these libraries are used to create a model with the help of the dataset. The model
created by applying all these data mining techniques is a binary file so that the model is
secure from any kind of modifications and other security threats related to the system.
The binary file of the model cannot be opened as it does not have any extension to it.
This binary file is basically created with the help of job-lib library and is used for
creating the
37
7.2 User Interface
The UI is developed with the help of React and React Native. The ML model developed with
the use of data mining techniques and the ML algorithms is used in the UI.
The various modules which are included in the project are:
1: Creating a Linux machine Elastic Compute Cloud (EC2) instance on AWS.
2: Creating a postgres RDS instance for the database on AWS.
3: Downloading the project on EC2 instance and creating a suitable python3 environment with
django installed in it.
4: Downloading and installing pgadmin and accessing the remote RDS database on the local
machine 5: Creating the migrations and launching the project.
6: Accessing the project on web browser by the public ip of the EC2 instance.
7: Creating a user account and predicting the disease.
All these modules are used in making the project work properly and fluently if we miss any of
these modules then the project will not work properly, all these module steps include the tasks
which are to be done for accessing the UI of the project and the ML model is implemented in
this UI for the prediction of disease.
38
CHAPTER – 8
TESTING, TRAINING AND EVALUATION
8.1 Testing Methodologies
The program comprises of several algorithms which are tested individually for the accuracy.
we check for the correctness of the program as a whole and how it performs.
39
• Migration Testing - Migration testing is done to ensure that the software can be moved
from older system infrastructures to current system infrastructures without any issues.
41
Figure 8.4 Joint pain training results
42
Figure 8.6 Shivering training results
43
Figure 8.8 Overall system training results
44
CHAPTER-9
PROJECT SNAPSHORTS
45
Figure 9.4 Initial detection Page
46
Figure 9.6 Disease based on set of Symptoms 1
47
Figure 9.8 Disease based on set of Symptoms 2
49
9.2 DATASETS
50
CHAPTER – 10
LIMITATIONS OF THE PROJECT
Disease prediction systems using machine learning are becoming increasingly popular due to
their ability to analyse vast amounts of data and provide accurate predictions. However, like
any technology, these systems have limitations that should be taken into account. In this article,
we will discuss some of the most significant limitations of disease prediction systems using
machine learning.
10.3 Overfitting:
Overfitting occurs when a machine learning model becomes too complex and starts to fit the
training data too closely. This can lead to the model being unable to generalize to new data,
which can result in inaccurate predictions. Overfitting can be particularly problematic in disease
prediction systems because the model may be trained on a particular population or set of data,
making it less accurate when applied to other populations.
51
particularly problematic in healthcare, where it is essential to understand how a prediction was
made to ensure that the correct diagnosis and treatment are given. Lack of interpretability can
also make it difficult to identify any errors or biases in the model.
52
CHAPTER – 11
FUTURE SCOPE OF THE PROJECT
The future of disease prediction systems using machine learning is promising, with many
opportunities for further development and expansion. With the increasing availability of
healthcare data, improved algorithms, and better computing power, disease prediction systems
using machine learning are poised to become an increasingly valuable tool for healthcare
professionals.
One of the most significant potential benefits of disease prediction systems is their ability to
identify diseases at an early stage. Early detection can be critical in the successful treatment of
many diseases, and machine learning algorithms can help identify patterns and risk factors that
may not be immediately apparent to human clinicians. With the integration of wearable
technology and other IoT devices, these systems can also continuously monitor patients,
enabling early detection of potential health issues and prompt interventions.
Machine learning algorithms can also help identify risk factors and predict disease outcomes
for individual patients. By analysing vast amounts of data from electronic health records,
genomics, and other sources, these systems can provide personalized recommendations for
prevention and treatment, leading to better patient outcomes and reduced healthcare costs.
Another potential area of growth for disease prediction systems using machine learning is in
the field of precision medicine. By analysing genomic data, machine learning algorithms can
help identify specific genetic markers that may indicate an increased risk of certain diseases.
This can lead to more targeted and personalized treatments, reducing the risk of adverse effects
and improving outcomes.
The use of machine learning algorithms in disease prediction can also help improve public
health efforts. By analysing data from a large population, these systems can help identify
patterns and trends in disease prevalence, aiding in the development of public health
interventions and policies.
Furthermore, machine learning algorithms can help optimize clinical trials by identifying
patients who are most likely to benefit from a particular treatment, reducing the cost and time
required for drug development.
In the future, disease prediction systems using machine learning may also be integrated with
other emerging technologies such as blockchain and decentralized databases. These
technologies can help improve the security and privacy of patient data, while also enabling
seamless data sharing between healthcare providers and researchers.
53
Overall, the future of disease prediction systems using machine learning is bright, with many
opportunities for further development and expansion. As the technology continues to evolve
and improve, we can expect to see more accurate predictions, better personalized treatments,
and improved public health outcomes. However, it is essential to address the ethical and privacy
concerns surrounding the use of these systems to ensure that they are implemented in a way
that maximizes their benefits while minimizing potential risks.
54
CONCLUSION
In conclusion, disease prediction systems using machine learning have the potential to
revolutionize the field of healthcare. These systems can analyse vast amounts of data, identify
risk factors, and provide personalized recommendations for prevention and treatment. They can
help identify diseases at an early stage, leading to more successful treatment and better patient
outcomes. Additionally, they can aid in the development of public health interventions and
policies, leading to improved health outcomes for populations.
However, the implementation of these systems must be approached with caution. It is important
to address ethical and privacy concerns and ensure that the use of these systems is transparent
and secure. Moreover, the reliability and accuracy of these systems need to be validated through
rigorous testing and evaluation to prevent false positives or false negatives.
The future of disease prediction systems using machine learning is promising, and we can
expect to see more accurate predictions and improved personalized treatments as the technology
continues to evolve. The integration of other emerging technologies, such as blockchain and
decentralized databases, can help improve data security and privacy, leading to more seamless
data sharing between healthcare providers and researchers.
Ultimately, disease prediction systems using machine learning have the potential to transform
healthcare and improve patient outcomes. With careful consideration and continued
development, we can harness the power of this technology to create a healthier and more
equitable future.
55
REFERENCES
[2] B. Qian, X. Wang, N. Cao, H. Li, and Y.-G. Jiang, “A relative similarity-based
method for interactive patient risk prediction,” Springer Data Mining Knowl.
Discovery, vol. 29, no. 4, pp. 1070–1093, 2015.
[3] IM. Chen, Y. Ma, Y. Li, D. Wu, Y. Zhang, and C. Youn, “Wearable 2.0: Enable
human-cloud integration in next generation healthcare system,” IEEE Commun., vol.
55, no. 1, pp. 54–61, Jan. 2017.
[5] S Mohan, C Thirumalai, G Srivastava Effective heart disease prediction using hybrid
machine learning techniques. IEEE Access, volume 7. Posted: 2019
[6] heart disease prediction and classification using machine learning algorithms
optimized by particle swarm optimization and ant colony optimization. Int. J. Intell.
Eng. Syst, volume 12, issue 1. Posted: 2019
[9] Min Chen, Yixue Hao, Kai Hwang, Fellow, IEEE, Lu Wang and Lin Wang “Disease
prediction by machine learning over big data from healthcare communities” (2017)
56
[10] Mr. Chala Beyen, Prof. Pooja Kamat, “Survey on prediction and analyses the
occurrence of heart diseases using data mining techniques” , International Journal of
pure and applied mathematics, 2018
[11] P. Groves, B. Kayyali, D. Knott and S. V. Kuiken,” the big data revolution in
healthcare : acerating values and innovation , “ 2016
[14] A.Davis, D., V.Chawla, N., Blumm, N., Christakis, N., & Barbasi, A. L. (2008).
Predicting Individual Disease Risk Based On Medical History
.
[15] Adam, S., & Parveen, A. (2012). Prediction System For Heart Disease Using Naive
Bayes.
[16] Al-Aidaroos, K., Bakar, A., & Othman, Z. (2012). Medical Data Classification
With Naive Bayes Approach. Information Technology Journal.
[17] Darcy A. Davis, N. V.-L. (2008). Predicting Individual Disease Risk Based On
Medical History.
[18] JyotiSoni, Ansari, U., Sharma, D., & Soni, S. (2011). Predictive Data Mining for
Medical Diagnosis: An Overview Of Heart Disease Prediction.
[20] Nisha Banu, MA; Gomathy, B;. (2013). Disease Predicting System Using Data
MiningTechnique
57