0% found this document useful (0 votes)
16 views57 pages

DL PPR3

Uploaded by

manansingh11103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views57 pages

DL PPR3

Uploaded by

manansingh11103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

CHAPTER - 1

INTRODUCTION

Now-a-days, people face various diseases due to the environmental condition and their living
habits. So, the prediction of disease at earlier stage becomes important task. But the accurate
prediction on the basis of symptoms becomes too difficult for doctor. The correct prediction of
disease is the most challenging task. To overcome this problem data mining plays an important
role to predict the disease. Medical science has large amount of data growth per year. Due to
increase amount of data growth in medical and healthcare field the accurate analysis on medical
data which has been benefits from early patient care. With the help of disease data, data mining
finds hidden pattern information in the huge amount of medical data. We proposed general
disease prediction based on symptoms of the patient. For the disease prediction, we use K-
Nearest Neighbor (KNN) and Support Vector Machine (SVM) machine learning algorithm for
accurate prediction of disease. For disease prediction required disease symptoms dataset.

1.1 OBJECTIVE
In this general disease prediction, the living habits of person and checkup information consider
for the accurate prediction. The accuracy of general disease prediction by using CNN is 83.5%
which is more than KNN algorithm. And the time and the memory requirement are also more
in KNN than CNN. After general disease prediction, this system able to gives the risk associated
with general disease which is lower risk of general disease or higher.
Machine learning made computer more intelligent and can enable the computer to think.
Different analysts feel that without learning, insight can't be created. There are numerous kinds
of Machine Learning Techniques like Unsupervised, Semi Supervised, Supervised,
Reinforcement, Evolutionary Learning and Deep Learning. These learnings are used to classify
huge data very fatly.

1.2 MOTIVATION
So we use K-Nearest Neighbor (KNN) and SVM (support vector machine) . Because medical
data is increasing day by day so usage of that for predicting correct disease is crucial task but
processing big data is very crucial in general so data mining plays very important role and
classification of large dataset using machine learning becomes so easy. It is critical to

1
comprehend the accurate diagnosis of patients by clinical examination and evaluation. Quality
of the data association has been influenced due to improper management of the information. .
Upgrade in the measure of data needs some legitimate way to concentrate and process
information viably and efficiently. One of the many machine learning applications is utilized
to construct such classifier that can separate the data based on their characteristics. Data set is
partitioned into two or more than two classes. Such classifiers are utilized for medical data
investigation and disease prediction. Today machine learning is present everywhere so that
without knowing it, one can possibly use it many times a day.

1.3 PROBLEM DEFINITION


The problem definition of this project is to design a disease prediction based predictive system.
To attain accuracy we typically use Python, ML algorithms, open resources and available
datasets.

1.4 METHODOLOGY
This refers to the machine learning algorithms that are being utilized during implementation.
1.4.1 KNN algorithm
o K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a
well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.

2
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog,
but we want to know either it is a cat or dog. So, for this identification, we can use the
KNN algorithm, as it works on a similarity measure. Our KNN model will find the
similar features of the new data set to the cats and dogs’ images and based on the most
similar features it will put it in either cat or dog category.

Figure 1.1 KNN Classification

Figure 1.2 KNN graph representation

3
1.4.2 SUPPORT VECTOR MACHINE

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning. The goal of the SVM algorithm is to create
the best line or decision boundary that can segregate n-dimensional space into classes so that
we can easily put the new data point in the correct category in the future. This best decision
boundary is called a hyperplane. SVM chooses the extreme points/vectors that help in creating
the hyperplane. These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:

Figure 1.3 SVM graph representation

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this strange creature.

4
1.4.3 RANDOM FOREST

Random forest is a machine learning algorithm used for both classification and regression tasks.
It is a type of ensemble learning method that combines multiple decision trees to create a more
accurate and robust model. The algorithm works by building multiple decision trees using a
subset of the data and a subset of the features at each node of the tree. Each decision tree is
trained on a random subset of the data, and each node of the tree considers only a subset of the
available features. This process creates a diverse set of decision trees that can collectively
provide a more accurate and stable prediction. Once the trees are built, the algorithm combines
their predictions through a process called ensemble learning. In classification tasks, the most
common prediction among the decision trees is selected as the final output, while in regression
tasks, the average prediction of the decision trees is taken as the final output.

Random forest has several advantages over other machine learning algorithms. It can handle
large datasets with high dimensionality and can also handle missing values and noisy data. It is
also less prone to overfitting than other algorithms because of its built-in randomness.

Figure 1.4 Random Forest representation

Example: An example of using the random forest algorithm is in predicting whether a customer
will buy a product or not based on their demographic and purchase history. The algorithm would
use a dataset that includes information such as age, gender, income, past purchases, and other
relevant factors. The random forest algorithm would build multiple decision trees using subsets
of this data and combine the predictions of each tree to make a final prediction about whether
5
the customer is likely to buy the product or not. This prediction can then be used to inform
marketing strategies or other business decisions.

1.5 BACKGROUND AND RELATED WORK

All over the world, chronic diseases are a critical issue in the healthcare domain. According to
the medical statement, due to chronic diseases, the death rate of humans increases. The
treatments given for this disease consume over 70% of the patient’s income. Hence, it is highly
essential to minimize the patient’s risk factor that leads to death. The advancement in medical
research makes health-related data collection easier The healthcare data includes the
demographics, medical analysis reports, and the history of disease of the patient. The diseases
caused could be varied based on the regions and the living habitats in that region. Hence, along
with the disease data, the environmental condition and the living habitat of the patient should
also be recorded in the data set.

In recent years, the healthcare domain is evolving more due to the integration of information
technology (IT) in it. The intention to integrate IT in healthcare is to make the life of an
individual more affordable with comfort as smartphones made one’s life easier. This could be
possible by making healthcare to be intelligent, for instance, the invention of the smart
ambulance, smart hospital facilities, and so on, which helps the patients and doctors in several
ways. The research on a specified region for patients affected by chronic diseases every year
had been held and found that the difference between the patients in gender wise is very small,
and it is found that the large number of patients were admitted in the year 2014 for treating
chronic diseases. The use of structured and unstructured data provides highly accurate results
instead of using only structured data. Since the unstructured data includes the doctor’s records
on the patients related to diseases and the patient’s symptoms and grievances faced by them,
explained by themselves, which is an added advantage when used along with the structured data
that consists of the patient demographics, disease details, living habitats, and laboratory test
results. It is difficult to diagnose rare diseases. Hence, the use of self-reported behavioural data
helps differentiate the individuals with rare diseases from the ones with common chronic
diseases. By using machine learning approaches along with questionnaires, it is believed that
the identification of rare diseases is highly possible.

6
1.5.1 DIFFERENT APPROACHES
Table 1.1 Table of Research Work

S. No. Paper Name Author Publication year Methodology

1 Disease Prediction by M. Chen, Y. Hao, Machine


Machine Learning over big K.Hwang, L. Learning, Big
2017
data from health care Wang and L. data.
communities. Wang.

2 A relative similarity-based B. Qian, X. Wang, Machine


method for indirective patient N. Cao, H. Li, Y.- Learning, Deep
2015
risk prediction. G. Jiang learning

3 Wearable 2.0 Enable human IM. Chen, Y. Ma, Machine


cloud integration next Y. Li, D. Wu, Y. Learning, Cloud
2017
generation healthcare system Zhang, C.Youn Computing

4 HealthCPS: Healthcare cyber Y. Zhang, M. Qiu, Machine


physical system assisted by C.-W. Tsai, M. M. Leaning, Big
2017
cloud and big data Hassan, A. Alamri data, Cloud

5 Effective heart disease S Mohan, C hybrid machine


prediction using hybrid Thirumalai, G learning
2019
machine learning techniques Srivastava

6 Heart disease prediction and . Int. J. Intell. Eng. prediction and


classification using machine Syst classification
2019
learning algorithms optimized using machine
learning
algorithms

7 Fourth International A Mir, S N Dhage. Computing


Conference on Computing Communication
2018
Communication Control and Control and
Automation Automation

8 Health Information Science M Maniruzzaman, using machine


and Systems, volume 8 M J Rahman, B learning
2020
Ahammed, M M paradigm
Abedin.

7
IM.Chen, Y. Ma, Y. Li, D. Wu, Y. Zhang, and C. Youn proposed wearable 2.0 system in which
design smart washable clothing that improves the Qu and QoS of the next generation healthcare
system. Chen designed new IoT based data collection system. In that new sensor based smart
washable cloth invented. By the used of this cloth, doctor captured the patient physiological
condition. And with the help of the physiological data further analysis happen. In this inversion
of washable smart cloth mainly consists of multiple sensors, wires and electrode with the help
of this component user can able to collect the physiological condition of patient as well as
emotional health status information by the used of cloud-based system. With the help of this
cloth, it captured the physiological condition of the patient. And for the analysis purpose, this
data is used. Discussed the issues which are facing while designed the wearable 2.0 architecture.
The issues in existing system consist of physiological data collection, negative psychological
effects, anti-wireless for body area networking and Sustainable big physiological data
collection etc. The multiple operations performed on files like analysis on data, monitoring and
prediction. Again, author classify the functional components of the smart clothing representing
Wearable 2.0 into the following categories like sensors Integration, electrical-cable-based
networking, digital modules. In this, there are many applications discussed like chronic disease
monitoring, elderly people care, emotion care etc.

B. Qian, X. Wang, N. Cao, H. Li, and Y.-G. Jiang [2] designed the Alzheimer disease risk
prediction system with the help of EHR data of the patient. Here they utilized active learning
context to solve a real problem suffered by the patient. In this the active patient risk model was
build. For that active risk prediction algorithm is utilized the risk of Alzheimer disease.

Y. Zhang, M. Qiu, C.-W. Tsai, M. M. Hassan, and A. Alamri [4] designed cloud-based health
–Cps system in which manages the huge amount of biomedical data. Y. Zhang discussed large
amount of data growth in the medical field. The data is created within the less amount of time
and the characteristic of data is stored in different format so this is what the problem related to
the big data. In this designed the health-Cps system in that two technologies prefer one is cloud
and second one is big data technology. This system performed numerous operations on cloud-
like data analysis, monitoring and prediction of data. With the help of this system, a person gets
more information about how to handle and manage the huge amount of biomedical data in the
cloud. The three layers consider in the system data collection layer, data management layer and
data- oriented layer. The data collection layer stored data in the particular standard format. The
data management layer used for distributed storage and parallel computing. By this system

8
multiple operations are performed with the help of Health-cps system. Also, the many services
related to healthcare know by this system.

L. Qiu, K. Gai, and M. Qiu in proposed telehealth system and discussed how to handle a large
amount of hospital data in the cloud. This paper author proposed advance in the telehealth
system, which is mainly based on the sharing data among all the telehealth services over the
cloud. But the data sharing on the cloud facing lots of issues like network capacity and virtual
machine switches. In this proposed the data sharing on cloud approach for the better sharing of
data through the data sharing concepts. Here designed the optimal method of telehealth sharing
model. By this model, author focus on transmission probability, network capabilities and timing
constraints. For this author invented new optimal big data sharing algorithm. By this algorithm,
users get the optimal solution of handling biomedical data.
Ajinkya Kunjir, Harshal Sawant, Nuzhat F.Shaikh [6] proposed a best clinical decision-making
system which predicts the disease on the basis of historical data of patients. In this predicted
multiple diseases and unseen pattern of patient condition. Designed a best clinical decision-
making system used for the accurate disease prediction on the historical data. In that also
determined multiple diseases concept and unseen pattern. For the visualization purpose in this
used 2D/3D graph and pie Charts. And 2D/3D graph and pie charts designed for visualization
purpose.

1.6 Related work


• AliveKor It comes as a touchpad connected to your cell phone (through wireless
network) or as a wristband. The touchpad simulates an ECG of the patient on his cell
phone via Bluetooth. Thus, all the necessary parameters like pulse rate, blood pressure
are easily available. The wristband on the other hand, uses finger touch to display the
pulse function on the dial. It can also notify an atrial fibrillation.
• My Heart In this system, a number of on-body sensors are used to collect physiological
data that are sent wirelessly to a PDA. The information is analyzed and health
recommendations are given to the user based on the analysis.
• Health Gear Health Gear is an application to track most common indexes like lab tests
and physical parameters. Fields include: - [Physical] Indexes like Height, Weight, BMI
- Blood Pressure, Hemoglobin, WBC, RBC, Platelets - [Lipids]: Cholesterol, HDL,
LDL, VLDL, triglycerides, - [Sugar]: Fasting Glucose, after meals, HbA1C.

9
• Fitbit This sensor is used to keep the track of health which has features of measuring
pulse rate, BP, calories burned. After this study, we have concluded with using Fitbit to
collect the data which is easily available and less expensive and Health Gear for all the
other parameters.
• S Mohan, C Thirumalai, G Srivastava Effective heart disease prediction using hybrid
machine learning techniques. IEEE Access, volume 7. Posted: 2019
• Heart disease prediction and classification using machine learning algorithms optimized
by particle swarm optimization and ant colony optimization. Int. J. Intell. Eng. Syst,
volume 12, issue 1. Posted: 2019
• A Mir, S N Dhage. 2018 Fourth International Conference on Computing
Communication Control and Automation (ICCUBEA), p. 1 – 6. Posted: 2018
• M Maniruzzaman, M J Rahman, B Ahammed, M M Abedin. Classification and
prediction of diabetes disease using machine learning paradigm. Health Information
Science and Systems, volume 8, issue 1. Posted: 2020

10
CHAPTER- 2
HARDWARE & SOFTWARE REQUIREMENT

2.1 Software
Software here refers to the application software’s that will be utilized during the development
of the project and implementation of it.
2.1.1 Android Studio
Android Studio is the official integrated development environment for Google's Android
operating system, built on JetBrains' IntelliJ IDEA software and designed specifically for
Android development. It is available for download on Windows, macOS and Linux based
operating systems.
Android Studio provides a unified environment where you can build apps for Android phones,
tablets, Android Wear, Android TV, and Android Auto. Structured code modules allow you to
divide your project into units of functionality that you can independently build, test, and debug.

Figure 2.1 Android Studio

2.1.2 VS Code
Visual Studio Code, also commonly referred to as VS Code, is a source-code editor made by
Microsoft with the Electron Framework, for Windows, Linux and macOS. Features include
support for debugging, syntax highlighting, intelligent code completion, snippets, code
refactoring, and embedded Git.

11
Visual Studio Code is a streamlined code editor with support for development operations like
debugging, task running, and version control. It aims to provide just the tools a developer needs
for a quick code-build-debug cycle and leaves more complex workflows to fuller featured IDEs,
such as Visual Studio IDE.

Figure 2.2 VS CODE

2.1.3 Sublime Text


Sublime Text is a shareware cross-platform source code editor originally created by Jon
Skinner. It natively supports many programming languages and markup languages. Users can
expand its functionality with plugins, typically community-built and maintained under free-
software licenses. To facilitate plugins, Sublime Text features a Python API.

Figure 2.3 Sublime Text

12
2.1.4 Octave
Octave is mainly extensible as it is using dynamically loadable modules. It uses an interpreter
to execute the octave scripting language. Its interpreter has the graphics based on Open GL for
creating the plots, graphs, charts and to save and print the same. It also includes the Graphical
user interface in addition to the traditional command-line interface. It is a high-level
programming language mainly used for computing numerical. It was developed by John W.
Eaton. It was initially released in the year 1980. It was written in C, C++, and Fortran.
It mainly consists of function calls or script. Its syntax is mainly matrix-based and provides
various functions for matrix operations. Octave does support various data structures and object-
oriented programming. It has great features and compatible with other languages like syntax
and functional compatibility for MATLAB. It shares other features like built-in support for
complex numbers, powerful built-in math functions, and extensive function libraries and in
terms of user-defined functions as well.

Figure 2.4 Octave GUI

13
2.2 Hardware

• CPU: Intel Core 2 Quad CPU Q6600 @ 2.40GHz (4 CPUs) / AMD Phenom 9850 Quad-
Core Processor (4 CPUs) @ 2.5GHz
• RAM: 4 GB
• OS: Windows 10 64 Bit, Windows 8.1 64 Bit, Windows 8 64 Bit, Windows 7 64 Bit Service
Pack 1, Windows Vista 64 Bit Service Pack 2*
• VIDEO CARD: 32 MB Direct3D Video Card
• FREE DISK SPACE: 10GB
• DEDICATED VIDEO RAM: 1 GB
• CPU SPEED: 700 MHz

14
CHAPTER- 3
SOFTWARE DEVELOPMENT LIFE CYCLE
SDLC or the Software Development Life Cycle is a process that produces software with the
highest quality and lowest cost in the shortest time possible. SDLC provides a well-structured
flow of phases that help an organization to quickly produce high-quality software which is

well-tested and ready for production use.

3.1 SDLC Models


The SDLC involves six phases as explained in the introduction. Popular SDLC models
include the waterfall model, spiral model, and Agile model.

Figure 3.1 SDLC Cycle

3.1.1 Waterfall Model

The waterfall is a universally accepted SDLC model. In this method, the whole process of
software development is divided into various phase. The waterfall model is a continuous
15
software development model in which development is seen as flowing steadily downwards (like
a waterfall) through the steps of requirements analysis, design, implementation, testing
(validation), integration, and maintenance.

Figure 3.2 Waterfall model

3.1.2 RAD Model

RAD or Rapid Application Development process is an adoption of the waterfall model; it targets
developing software in a short period. The RAD model is based on the concept that a better
system can be developed in lesser time by using focus groups to gather system requirements.

o Business Modeling
o Data Modeling
o Process Modeling
o Application Generation
o Testing and Turnover

16
Figure 3.3 RAD Model

3.1.3Spiral Model

The spiral model is a risk-driven process model. This SDLC model helps the group to adopt
elements of one or more process models like a waterfall, incremental, waterfall, etc. The spiral
technique is a combination of rapid prototyping and concurrency in design and development
activities. Each cycle in the spiral begins with the identification of objectives for that cycle, the
different alternatives that are possible for achieving the goals, and the constraints that exist.
This is the first quadrant of the cycle (upper-left quadrant).

The next step in the cycle is to evaluate these different alternatives based on the objectives and
constraints. The focus of evaluation in this step is based on the risk perception for the project.

The next step is to develop strategies that solve uncertainties and risks. This step may involve
activities such as benchmarking, simulation, and prototyping.

17
Figure 3.4 Spiral Model

3.1.4 V-Model

In this type of SDLC model testing and the development, the step is planned in parallel. So,
there are verification phases on the side and the validation phase on the other side. V-Model
joins by Coding phase.

Figure 3.5 V-Model

18
3.1.5 Incremental Model

The incremental model is not a separate model. It is necessarily a series of waterfall cycles. The
requirements are divided into groups at the start of the project. For each group, the SDLC model
is followed to develop software. The SDLC process is repeated, with each release adding more
functionality until all requirements are met. In this method, each cycle act as the maintenance
phase for the previous software release. Modification to the incremental model allows
development cycles to overlap. After that subsequent cycle may begin before the previous cycle
is complete.

Figure 3.6 Incremental Model

3.1.6 Agile Model

Agile methodology is a practice which promotes continues interaction of development and


testing during the SDLC process of any project. In the Agile method, the entire project is
divided into small incremental builds. All of these builds are provided in iterations, and each
iteration lasts from one to three weeks. Any agile software phase is characterized in a manner
that addresses several key assumptions about the bulk of software projects:

19
Figure 3.7 Agile Model

3.1.7 Iterative Model

It is a particular implementation of a software development life cycle that focuses on an initial,


simplified implementation, which then progressively gains more complexity and a broader
feature set until the final system is complete. In short, iterative development is a way of breaking
down the software development of a large application into smaller pieces.

Figure 3.8 Iterative Model

20
3.1.8 Big Bang model

Big bang model is focusing on all types of resources in software development and coding, with
no or very little planning. The requirements are understood and implemented when they come.
This model works best for small projects with smaller size development team which are
working together. It is also useful for academic software development projects. It is an ideal
model where requirements are either unknown or final release date is not given.

Figure 3.9 Big Bang Model

The SDLC model which we are using is prototype model.

3.1.9 Prototype Model

The prototyping model starts with the requirements gathering. The developer and the user meet
and define the purpose of the software, identify the needs, etc.

A 'quick design' is then created. This design focuses on those aspects of the software that will
be visible to the user. It then leads to the development of a prototype. The customer then checks
the prototype, and any modifications or changes that are needed are made to the prototype.

Looping takes place in this step, and better versions of the prototype are created. These are
continuously shown to the user so that any new changes can be updated in the prototype. This
process continue until the customer is satisfied with the system. Once a user is satisfied, the
prototype is converted to the actual system with all considerations for quality and security.

21
3.2 How it is effective in our model: -
Initially we take disease dataset from UCI machine learning website and that is in the form of
disease list with its symptoms. After that preprocessing is performed on that dataset for cleaning
that is removing comma, punctuations and white places. And that is used as training dataset.
After that feature extracted and selected. Then we classify that data using classification
techniques such as KNN and SVM. Based on machine learning we can predict accurate disease.

An SVM takes these data points and outputs the hyperplane, which is simply a line in two-
dimension, that best separates the tags. The line is the decision boundary. Anything falling to
one side of it will be classified as yellow, and anything on the other side will be classified as
blue.

Figure 3.10 Prototype model

The above was easy since the data was linearly separable—a straight line can be drawn to
separate yellow and blue. However, in real scenarios, cases are usually not this simple. Consider
the following case:

There is no linear decision boundary. The vectors are, however, very clearly segregated, and it
seems as if it should be easy to separate them.

22
In this case, we will add a third dimension. Up until now, we have worked with two dimensions,
x, and y. A new z dimension is introduced in this case. It is set to be calculated a certain way
that is convenient, z = x² + y² (equation of a circle.) Taking a slice of this three-dimensional
space looks like this: Note that since we are in three dimensions now, the hyperplane is a plane
parallel to the x-axis at a particular point in z, let us say z = 1. Now, it should be mapped back
to two dimensions: There we go! The decision boundary is a circumference with radius 1, and
it separates both tags by using SVM. Calculating the transformation can get pretty expensive
computationally. One may deal with a lot of new dimensions, each possibly involving a
complicated calculation. Hence, doing this for every vector in the dataset will be a lot of work.

23
CHAPTER-4
SOFTWARE REQUIREMENT SPECIFICATION & ANALYSIS

This project gives the procedural approach how a patient gets treatment, details about date of
treatment and finally depending on different criteria like room allocated, lab reports, treatment
and medicine taken, how billing is calculated, etc. During billing health care facility is also
considered. At the same time this project also deals with the different mediums that can be built
between different stakeholders for example, A website between customers and the hospital is
crucial to give the customers a better outlook. For the development of the website, we also need
web developers. Hence, how all the stakeholders are related to each other is enumerated in this
document

4.1 System Features


A use case is a methodology used in system analysis to identify, clarify, and organize system requirements. The
use case is made up of a set of possible sequences of interactions between systems and users in
a particular environment and related to a particular goal. It consists of a group of elements (for
example, classes and interfaces) that can be used together in a way that will have an effect
larger than the sum of the separate elements combined. The use case should contain all system
activities that have significance to the users. A use case can be thought of as a collection of
possible scenarios related to a particular goal, indeed, the use case and goal are sometimes considered
to be synonymous

4.2 Requirements Specification for Healthcare Maintenance System


There are the procedural steps and executional procedure for the development for the
Healthcare Maintenance System.

4.2.1. Output:
Accurate prediction for disease from symptoms and fixing appointment.
User must be aware of the symptoms the person is having and reporting it accurately
4.2.2. Steps:
• Step1:
The user inputs a series of answers to some basic question along with symptoms
• . Step2:

24
The data is processed and matched with a disease profile
• Step3:
The disease according to its severity is given a ranking and the user is prompted to make a
doctor’s appointment
• Step4
The doctors meet with their patient and recommends medicine and pharmacy.
• Step5
The User gets a medical certificate and medicines and is treated for the disease.
• Post-condition:
The User must attend for a re-appointment.
• Exceptional Scenario 1:
If the severity of the disease is not high, then appointment not required.

4.3. Other Non-functional Requirements


There are the procedural steps and executional procedure for the development for the Non-
Functional Requirement.

4.3.1 Performance Requirements


The system might require heavy databases as it has to handle millions of data and hence would
require hardware with better processors. The website to be made has to try to not lag so as to
satisfy the users soon as possible considering the urgency of any medical treatment

4.3.2 Safety Requirements


There are possibilities that the system might describe a wrong disease considering the number
of similar symptoms shared by in-numerous diseases. There is no reason to worry as there is an
option of crosschecking with doctors appointed considering their time of diagnosis

4.3.3 Security Requirements


Due to security reasons, it is recommended to give the patients name and identification as a
proper certificate would be issued if a dire diagnosis occurs. The certificate would be
issued digitally and credentials would be taken from what the user has input. Incorrect
authentication can lead to problems with the legal system. The software is not responsible
for such an event.

25
Figure 4.1 system development phase

4.4 Software Analysis


The software is a website which would provide accurate disease predictions for concerned users
and potential patients. Doing a herculean task requires a robust network and great user interfa
ce which we provide.
4.4.1 Requirements
There are other requirements we would cover after the SRS. Some of the

4.4.2 Performance Requirements


The system might require heavy databases as it has to handle millions of data and hence would
require hardware with better processors. The website to be made has to try to not lag so as to
satisfy the users soon as possible considering the urgency of any medical treatment.

4.4.3 Safety Requirements


There are possibilities that the system might describe a wrong disease considering the number
of similar symptoms shared by in-numerous diseases. There is no reason to worry as there is an
option of crosschecking with doctors appointed considering their time of diagnosis.

26
4.4.4 Security Requirements
Due to security reasons, it is recommended to give the patients name and identification as a
proper certificate would be issued if a dire diagnosis occurs. The certificate would be
issued digitally and credentials would be taken from what the user has input. Incorrect
authentication can lead to problems with the legal system. The software is not responsible
for such an event.

27
CHAPTER – 5
RISK ASSESSMENT

A risk prediction model is an equation or a set of equations estimating the likelihood of


developing a clinical outcome based on personal profiles, whose key elements are illustrated in
Figure 5.1 and elaborated below. Cohort. In risk prediction modelling, the cohort can be as
general as all people1–3 and as specific as patients hospitalized due to a certain condition25,
and is usually a group of people likely to develop the clinical outcome. The application of a
model should be on a cohort like the one the model was built on to achieve desired performance.
Predictor. Predictors are personal characteristics used for risk prediction, including
demographics information, vital signs, laboratory test results, family disease history, and so on.
For example, commonly used predictors for type 2 diabetes include age, family history of
diabetes, body mass index, and waist circumference1. Commonly used predictors for
cardiovascular disease include smoking, diabetes, hyperlipidemia, hypertension, C-reactive
protein, lipoprotein(a), fibrinogen, and homocysteine26,27. While including more predictors
may increase the prediction
performance, it would also increase the burden for data collection. Thus, using less expensive,
less invasive, and easily collected predictors, and using fewer predictors may encourage wider
application of a model.

Figure 5.1 Risk factors

Risk prediction models are used to estimate the likelihood of developing a certain outcome. In
the clinical domain, the outcomes are often adverse clinical events, including development of
disease1–3 , complications4–6 , or severe outcome7–9 , hospital readmission10, and so on. A
model can predict risk for one or multiple outcomes28. A previous report suggests that a useful

28
outcome should satisfy the criteria of: (1) undesired to the person experienced it; (2) significant
to the health service (cost-effective); (3) preventable; and (4) captured in routine administrative
data23. Prediction timepoint. Prediction timepoint defines when the model is intended to be
used. While some models can be widely applied throughout the adulthood, some apply only to
people at a certain time point. For example, Rao et al. built a model to predict bleeding
complications after patients undergo percutaneous coronary intervention6 . Observation
window. Observation window is the period when the predictors are defined and collected.
Prediction window. Prediction window is the period when the outcomes are observed. It should
be defined based on the outcome. While some outcomes happen shortly after the prediction
timepoint, some take years to develop. For example, in predicting hospital readmission, the
prediction window ranges from 30 days if reason for the current admission is a severe disease,
to 12 months if the reason is a less severe condition or general conditions10. Algorithm. While
the concepts above define the prediction problem, the algorithm solves the defined problem.
Two necessary components are algorithms for feature engineering and algorithms to build the
risk equation, as elaborated
5.1 Evaluation metrics
.Many evaluation metrics have been proposed to reflect different aspects of the model
performance As no model can achieve the best performance in every aspect, how to evaluate a
model should be determined based on the aspect of care29.

5.1.1 Feature engineering.


In risk prediction modelling, feature engineering refers to the process of creating numerical
values from predictors to be used in the risk equation.

5.1.2 Pre-processing.
Common pre-processing steps include outlier removal, missing data imputation,
standardization, normalization, transformation, encoding, discretization, and so on. n Feature
construction and feature selection. Both serving the purpose of creating meaningful feature sets,
they act in different directions.
Feature construction aims at generating new features from existing to inject domain knowledge
or to increase feature complexity, while feature selection aims at reducing number of features
to reduce model complexity and increase model performance.
5.1.3 Risk equation.

29
A risk equation takes engineered features and computes a risk score, which is always achieved
through model fitting or a learning process.
Single task. Most risk equations estimate the risk of a single outcome (a single task).

5.2 Classification
A most common class of risk prediction problems is the binary classification problem
Model prediction performance can be evaluated using various metrics including those from
traditional aspects like discrimination and calibration, and from novel aspects like
reclassification and clinical usefulness. Here we introduce a few metrics, and more information
can be found in previous reports
Discrimination measures how well a model can discriminate between cases experienced and
not experienced the outcome. AUROC (also known as ROC, C-statistics, or C-index) and
AUPRC measure the overall performances of a model. Accuracy, sensitivity (recall),
specificity, and pr
ecision measure model performance at a certain threshold (based on which each sample gets
assigned as predicted to ‘will have the outcome’ or ‘will not have the outcome’).
Calibration measures how close the predicted risk score is to the actual probability of
developing the outcome. Risk equations from classification methods do not always have good
calibration especially in external validation cohort, while those from survival analysis have
better calibration. Common metrics include calibration curves and Hosmer-Leeshawn test.
Model interpretability is a measure of how well a model can be understood by human beings.
n Model interpretability reflects how well the mechanism of the model can be interpreted. In a
logistic regression model, odds ratio shows both the directionality and the scale of a feature’s
impact on the outcome. A decision tree model can be easily interpreted as a set of decision
rules. Artificial neural networks are notorious for their bad model interpretability due to the
black-box nature. n Result interpretability reflects how well the risk score from a risk model
can be explained to a human understandable degree. Model interpretability leads to, but is not
a prerequisite for result interpretability.

5.3 Risk assessment service


While a risk assessment service is based on a model, it has a broader scope. Key elements for
a risk assessment service are illustrated in Figure 2. Typically, it starts from collecting personal
information as input, to the backend process of feature processing and risk prediction using the

30
algorithm, and finally to the output, a risk report. A report is typically constituted of four
components: risk score, risk factors, risk level, and intervention.

5.3.1 Risk score.


A risk score is a quantitative evaluation of a person’s likelihood of developing the outcome in
the prediction window and is often a direct or normalized output from the risk prediction model.
It can be absolute risk which can be directly interpreted as the probability of developing the
outcome, or relative risk which is positively correlated with the absolute risk.

5.3.2 Risk factor.


A risk factor is an attribute of the assessed individual which has increased his likelihood of
developing the outcome45. Risk factors overlap with predictors but are different. First, a
predictor can be an indicator of increased risk as well as decreased risk, and thus can contribute
to as well as protect against the development of the outcome. Second, even when a predictor
contributes to development of the outcome, it is a risk factor for a person only if the person has
non-optimal value for the predictor.
For example, diabetes is a predictor of cardiovascular disease, and for a diabetic person,
diabetes is his risk factor for cardiovascular disease. Displaying the risk factors would enable a
clear interpretation of the risk score and inform intervention to reduce the risk.
An example risk assessment service with a report is shown in Figure 5.1. The JBS3 risk
calculator47 is a tool to help communicate the risk of cardiovascular disease and the benefits
of interventions and is designed for doctors and practitioners with their patients.
A data flow diagram (DFD) maps out the flow of information for any process or system. It uses
defined symbols like rectangles, circles and arrows, plus short text labels, to show data inputs,
outputs, storage points and the routes between each destination. Data flowcharts can range from
simple, even hand-drawn process overviews, to in-depth, multi-level DFDs that dig
progressively deeper into how the data is handled. They can be used to analyze an existing
system or model a new one. Like all the best diagrams and charts, a DFD can often visually
“say” things that would be hard to explain in words, and they work for both technical and
nontechnical audiences, from developer to CEO. That’s why DFDs remain so popular after all
these years. While they work well for data flow software and systems, they are less applicable
nowadays to visualizing interactive, real-time or database-oriented software or systems.

31
CHAPTER 6
DFD AND APPLICATION ARCHITECTURE

DFD levels are numbered 0,1 or 2 and occasionally to even level 3 or beyond. The necessary
level of detail depends on the scope of what you are trying to accomplish.
6.1 Zero Level DFD:

Figure 6.1 0 level DFD


This is the zero level DFD of Diseases prediction system, where we elaborated the high level
process of diseases prediction. It is a basic overview of the whole system. It’s designed to be
an-at-glance view of diseases, Doctors and Reports showing the system as a single high level
process, with its relationship to external entities of Patients and doctors.
High Level Entities and Process flow of Patient Management System:
* Managing all the Patient

32
* Managing all the Symptoms
* Managing all the Diseases
*Managing all the Doctors
6.2 First Level DFD:

Figure 6.2 1st Level DFD


First level DFD(1st level) of Diseases Prediction System shows how the system is divided into
sub-systems(process), each of which deals with one or more data d\flows to or from an external
agent, and which together provide all the functionality. It also identifies internal data stores of
patient, doctor and diseases.
Main Entities and Output of First Level
* Processing Patient records and generate report of all Patient.
* Processing Medicines record and generate report.
* Procession Doctor schedule record and generate report of all Schedule.

33
* Processing Report records and generate report of all report.

6.3 Second Level DFD:

Figure 6.3 2nd level DFD

DFD level 2 then goes one step deeper into parts of Level 1 of Diseases Prediction System. It
may require functionality of the system. The 2nd level DFD contains more details of Report,
Doctors, Test, Medicines and Patient.
Low level functionalities of Diseases Prediction System: -
* Admin reports to the System and manage all the functionalities of the system.
* Admin can add, edit, delete and view the records of patient, diseases, report and doctor.
* Admin can search and reach the details of all the things.
* Admin can track all the information of doctors and patient.

34
CHAPTER-7
PROJECT MODULES DESIGN

The whole project is divided into two parts i.e. the Machine learning model and User Interface
and they can be elaborated as:
7.1 Machine learning Model: -

Fig 7.1 Detailed Design Model

Data mining techniques are used in the project to see weather the dataset is good for prediction
or not. Various data mining libraries used in the project are:
• 1.Scipy: This is used for implementing scientific computing in Python programming
language. It is a collection of mathematical algorithms and convenience functions built
on Numpy. Following are some of the functionalities it provides Special Functions
(special), Downloaded by Integration (integrate), Optimization (optimize), Fourier
Transforms (interpolate), Signal Processing (signal), Linear Algebra (linalg), Statistics
(stats), File IO (io) etc. In this project stats (Statistics) library of this package is primarily
used.
• 2. Sklearn : This stands for Scikit learn and is built on the Scipy package. It is the
primary package being used in this project. It is used for providing interface for

35
supervised and unsupervised learning algorithms. Following groups of models are
provided by sklearn Clustering, Cross Validation, Datasets, Dimensionality Reduction,
Ensemble methods, Feature extraction, Feature selection, Parameter Tuning, Manifold
Learning, Supervised Models.
• 3. Numpy : It is a library for the Python programming language, adding support for
multidimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays. It provides functions for Array
Objects, Routines, Constants, Universal Functions, Packaging etc. In this project it is
used for performing multi-dimensional array operations.
• 4. Pandas : This library is used to provide high-performance, easy-to-use data structures
and data analysis tools for the Python programming language. It provides functionalities
like table manipulations, creating plots, calculate summary statistics, reshape tables,
combine data from tables, handle times series data, manipulate textual data etc. In this
project it is used for reading csv files, comparing null and alternate hypothesis etc.
• 5. Matplotlib : It is a library for creating static, animated, and interactive visualizations
in Python programming language. In this project it is used for creating simple plots,
subplots and its object is used alongside with the seaborn object to employ certain
functions such as show, grid etc. A %matplotlib inline function is also used for
providing more concise plots right below the cells that create that plot.
• 6. Seaborn : It provides a interface for making graphs that are more attractive and
interactive in nature. It is based on the matplotlib module. These graphs can be dynamic
and are much more informative and easier to interpret. It provides different presentation
formats for data such as Relational, Categorical, Distribution, Regression, Multiples and
style and color of all these types. In this project they are used for creating complex plots
that use various attributes.
• 7. Warning : It is used for handling any warnings that may arise when the program is
running. It is a subclass of Exception.
• 8. Stats : This library is used to incorporate statistics functionality in Python
programming language. This library is included in the scipy package. This library is not
directly used rather the required functions are directly imported as and when required
i.e. for Measures of Central Tendency, Measures of Variability . The functions used can
be for simple concepts like mean, median, mode, variance, percentiles, skew, kurtosis,
range, Cumulative

36
• . 9. Model selection : This library is used for helping in choosing the best model. This
library is present in the sklearn package. It is also used for functions like test-train-split
which is used for splitting the data into train and test data set which helps in improving
accuracy of the model, and like cross-val-score which is used for computing the
accuracy of a model. It also includes functions for techniques for improving the
accuracy of a model like KFolds algorithm which is used in this project. It involves the
functions of linear-model, SVM etc.
• 10. Naive Bayes : It is a library built for implementing the Naive Bayes algorithm. It is
also defined in the sklearn package. In this project multinomial variant is used and it is
one of the most crucial algorithms in the project.
• 11. Tree : It is the library that comprises of all the functionality and concepts associated
with trees. It is included in the sklearn package. The most important algorithm included
in this library is the Decision Tree Classifier which gives very high accuracy and one of
the most used algorithm for projects like this.
• 12. Linear model : It is the library that implements the Regression algorithms. It is also
included in
• 13. Ensemble : It is the library that includes the ensemble methods. It is defined in the
sklearn package. As ensemble techniques are used to improve the accuracy of the
models hence Gradient Boosting Classifier and Random Forest Classifier(very
important algorithm and provides very high accuracy) are used in this project.
• 14. Metrics : It is the library used for presenting the accuracy of the model. The function
accuracy score is the most basic of them all. It is included in the sklearn package.
• 15. Joblib : It is a package that provides lightweight pipe lining in Python programming
language. It is included in the externals package in the sklearn main package. It is used
for providing transparent disk-caching of the output values, easy simple parallel
computing, logging and tracing of the execution. In this project it used to provide the
interaction with the user and perform operations accordingly.
• All these libraries are used to create a model with the help of the dataset. The model
created by applying all these data mining techniques is a binary file so that the model is
secure from any kind of modifications and other security threats related to the system.
The binary file of the model cannot be opened as it does not have any extension to it.
This binary file is basically created with the help of job-lib library and is used for
creating the

37
7.2 User Interface
The UI is developed with the help of React and React Native. The ML model developed with
the use of data mining techniques and the ML algorithms is used in the UI.
The various modules which are included in the project are:
1: Creating a Linux machine Elastic Compute Cloud (EC2) instance on AWS.
2: Creating a postgres RDS instance for the database on AWS.
3: Downloading the project on EC2 instance and creating a suitable python3 environment with
django installed in it.
4: Downloading and installing pgadmin and accessing the remote RDS database on the local
machine 5: Creating the migrations and launching the project.
6: Accessing the project on web browser by the public ip of the EC2 instance.
7: Creating a user account and predicting the disease.
All these modules are used in making the project work properly and fluently if we miss any of
these modules then the project will not work properly, all these module steps include the tasks
which are to be done for accessing the UI of the project and the ML model is implemented in
this UI for the prediction of disease.

38
CHAPTER – 8
TESTING, TRAINING AND EVALUATION
8.1 Testing Methodologies
The program comprises of several algorithms which are tested individually for the accuracy.
we check for the correctness of the program as a whole and how it performs.

8.2 Unit Testing


Unit tests focus on ensuring that the correct changes to the world state take place when a
transaction is processed. The business logic in transaction processor functions should have unit
tests, ideally with 100 percent code coverage. This will ensure that you do not have typos or
logic errors in the business logic. The various modules can be individually run from a command
line and tested for correctness. The tester can pass various values, to check the answer returned
and verify it with the values given to him/her. The other work around is to write a script, and
run all the tests using it and write the output to a log file and using that to verify the results. We
tested each of the algorithms individually and made changes in pre-processing accordingly to
increase the accuracy.

8.3 System Testing


System Testing is a level of software testing where a complete and integrated software is tested.
The purpose of this test is to evaluate the systems compliance with the specified requirements.
System Testing is the testing of a complete and fully integrated software product. and White
Box Testing. System test falls under the black box testing category of software testing. Different
Types of System Testing:
• Usability Testing - Usability Testing mainly focuses on the users ease to use the
application, flexibility in handling controls and ability of the system to meet its
objectives.
• Load Testing - Load Testing is necessary to know that a software solution will perform
under real-life loads.
• Regression Testing- - Regression Testing involves testing done to make sure none of the
changes made over the course of the development process have caused new bugs.
• Recovery Testing - Recovery testing is done to demonstrate a software solution is
reliable, trustworthy and can successfully recoup from possible crashes.

39
• Migration Testing - Migration testing is done to ensure that the software can be moved
from older system infrastructures to current system infrastructures without any issues.

8.4 Quality Assurance


Quality Assurance is popularly known as QA Testing, is defined as an activity to ensure that an
organization is providing the best possible product or service to customers. QA focuses on
improving the processes to deliver Quality Products to the customer. An organization has to
ensure, that processes are efficient and effective as per the quality standards defined for
software products.

8.5 Functional Test


Functional Testing is also known as functional completeness testing, Functional Testing
involves trying to think of any possible missing functions. As chat-bot evolves into new
application areas, functional testing of essential chatbot components. Functional testing
evaluates use-case scenarios and related business processes, such as the behaviour of smart
contracts.

7.6 Outputs of Machine training and testing

Figure 8.1 Skin rash training results


40
Figure 8.2 Acidity training results

Figure 8.3 Itching training results

41
Figure 8.4 Joint pain training results

Figure 8.5 Chills training results

42
Figure 8.6 Shivering training results

Figure 8.7 Sneezing training results

43
Figure 8.8 Overall system training results

44
CHAPTER-9
PROJECT SNAPSHORTS

9.1 Output screens of web application

Figure 9.1 Home Page

Figure 9.2 About Us Page

Figure 9.3 Contact Us Form

45
Figure 9.4 Initial detection Page

Figure 9.5 Disease based on set of Symptoms 3

46
Figure 9.6 Disease based on set of Symptoms 1

Figure 9.7 Disease based on set of Symptoms 4

47
Figure 9.8 Disease based on set of Symptoms 2

Figure 9.9 Disease based on set of Symptoms 5


48
Figure 9.10 Disease based on set of Symptoms 6

Figure 9.11 Disease based on set of Symptoms 7

49
9.2 DATASETS

Figure 9.6 [12] Training dataset

Figure 9.7 [13] Testing Dataset

50
CHAPTER – 10
LIMITATIONS OF THE PROJECT
Disease prediction systems using machine learning are becoming increasingly popular due to
their ability to analyse vast amounts of data and provide accurate predictions. However, like
any technology, these systems have limitations that should be taken into account. In this article,
we will discuss some of the most significant limitations of disease prediction systems using
machine learning.

10.1 Limited data availability:


Machine learning algorithms require large amounts of high-quality data to make accurate
predictions. For some rare diseases or new outbreaks, there may be a lack of data available.
This can make it challenging to build an accurate model, which may result in unreliable
predictions. Furthermore, some data may be biased or incomplete, which can lead to inaccurate
results.

10.2 Lack of transparency:


Machine learning algorithms are often considered a "black box" because it can be challenging
to understand how they make predictions. This lack of transparency can make it difficult to
interpret results and identify any errors or biases in the model. This can be particularly
problematic for medical professionals who need to make critical decisions based on the
predictions made by these systems.

10.3 Overfitting:
Overfitting occurs when a machine learning model becomes too complex and starts to fit the
training data too closely. This can lead to the model being unable to generalize to new data,
which can result in inaccurate predictions. Overfitting can be particularly problematic in disease
prediction systems because the model may be trained on a particular population or set of data,
making it less accurate when applied to other populations.

10.4 Lack of interpretability:


Some machine learning algorithms are inherently difficult to interpret, and it may be
challenging to understand how the model arrived at a particular prediction. This can be

51
particularly problematic in healthcare, where it is essential to understand how a prediction was
made to ensure that the correct diagnosis and treatment are given. Lack of interpretability can
also make it difficult to identify any errors or biases in the model.

10.5 Data privacy and security:


Disease prediction systems require access to sensitive patient data, including medical history,
genetics, and lifestyle information. Ensuring the privacy and security of this data is crucial, but
it can be challenging, particularly when data is being shared between different organizations or
researchers. The risk of data breaches or misuse can be a significant concern, and it is essential
to implement robust security measures to protect patient data.
10.6 Ethical considerations:
Machine learning algorithms can sometimes produce results that are biased or discriminatory,
which can be problematic, particularly in healthcare. This can lead to unfair treatment or
diagnosis, which can have serious consequences for patients. Ensuring that algorithms are
trained on diverse and representative data sets is crucial to prevent bias and discrimination.

10.7 Lack of expert input:


While machine learning algorithms are excellent at analysing vast amounts of data, they lack
the expertise and context that medical professionals bring. This can make it challenging to
identify patterns or factors that may be important in predicting disease. As such, it is essential
to involve medical professionals in the development and implementation of these systems to
ensure that the predictions made are accurate and clinically relevant.
In conclusion, disease prediction systems using machine learning are a valuable tool for
healthcare professionals, but they are not without their limitations. Understanding and
addressing these limitations is crucial to ensure that these systems are effective and accurate in
predicting disease. As such, it is important to involve experts from a variety of fields, including
medical professionals, data scientists, and ethicists, in the development and implementation of
these systems.

52
CHAPTER – 11
FUTURE SCOPE OF THE PROJECT
The future of disease prediction systems using machine learning is promising, with many
opportunities for further development and expansion. With the increasing availability of
healthcare data, improved algorithms, and better computing power, disease prediction systems
using machine learning are poised to become an increasingly valuable tool for healthcare
professionals.
One of the most significant potential benefits of disease prediction systems is their ability to
identify diseases at an early stage. Early detection can be critical in the successful treatment of
many diseases, and machine learning algorithms can help identify patterns and risk factors that
may not be immediately apparent to human clinicians. With the integration of wearable
technology and other IoT devices, these systems can also continuously monitor patients,
enabling early detection of potential health issues and prompt interventions.
Machine learning algorithms can also help identify risk factors and predict disease outcomes
for individual patients. By analysing vast amounts of data from electronic health records,
genomics, and other sources, these systems can provide personalized recommendations for
prevention and treatment, leading to better patient outcomes and reduced healthcare costs.
Another potential area of growth for disease prediction systems using machine learning is in
the field of precision medicine. By analysing genomic data, machine learning algorithms can
help identify specific genetic markers that may indicate an increased risk of certain diseases.
This can lead to more targeted and personalized treatments, reducing the risk of adverse effects
and improving outcomes.
The use of machine learning algorithms in disease prediction can also help improve public
health efforts. By analysing data from a large population, these systems can help identify
patterns and trends in disease prevalence, aiding in the development of public health
interventions and policies.
Furthermore, machine learning algorithms can help optimize clinical trials by identifying
patients who are most likely to benefit from a particular treatment, reducing the cost and time
required for drug development.
In the future, disease prediction systems using machine learning may also be integrated with
other emerging technologies such as blockchain and decentralized databases. These
technologies can help improve the security and privacy of patient data, while also enabling
seamless data sharing between healthcare providers and researchers.

53
Overall, the future of disease prediction systems using machine learning is bright, with many
opportunities for further development and expansion. As the technology continues to evolve
and improve, we can expect to see more accurate predictions, better personalized treatments,
and improved public health outcomes. However, it is essential to address the ethical and privacy
concerns surrounding the use of these systems to ensure that they are implemented in a way
that maximizes their benefits while minimizing potential risks.

54
CONCLUSION
In conclusion, disease prediction systems using machine learning have the potential to
revolutionize the field of healthcare. These systems can analyse vast amounts of data, identify
risk factors, and provide personalized recommendations for prevention and treatment. They can
help identify diseases at an early stage, leading to more successful treatment and better patient
outcomes. Additionally, they can aid in the development of public health interventions and
policies, leading to improved health outcomes for populations.
However, the implementation of these systems must be approached with caution. It is important
to address ethical and privacy concerns and ensure that the use of these systems is transparent
and secure. Moreover, the reliability and accuracy of these systems need to be validated through
rigorous testing and evaluation to prevent false positives or false negatives.
The future of disease prediction systems using machine learning is promising, and we can
expect to see more accurate predictions and improved personalized treatments as the technology
continues to evolve. The integration of other emerging technologies, such as blockchain and
decentralized databases, can help improve data security and privacy, leading to more seamless
data sharing between healthcare providers and researchers.
Ultimately, disease prediction systems using machine learning have the potential to transform
healthcare and improve patient outcomes. With careful consideration and continued
development, we can harness the power of this technology to create a healthier and more
equitable future.

55
REFERENCES

[1] M. Chen, Y. Hao, K. Hwang, L. Wang, and L. Wang, “Disease prediction by


machine learning over big data from healthcare communities”,” IEEE Access, vol. 5,
no. 1, pp. 8869–8879, 2017.

[2] B. Qian, X. Wang, N. Cao, H. Li, and Y.-G. Jiang, “A relative similarity-based
method for interactive patient risk prediction,” Springer Data Mining Knowl.
Discovery, vol. 29, no. 4, pp. 1070–1093, 2015.

[3] IM. Chen, Y. Ma, Y. Li, D. Wu, Y. Zhang, and C. Youn, “Wearable 2.0: Enable
human-cloud integration in next generation healthcare system,” IEEE Commun., vol.
55, no. 1, pp. 54–61, Jan. 2017.

[4] Y. Zhang, M. Qiu, C.-W. Tsai, M. M. Hassan, and A. Alamri, “HealthCPS:


Healthcare cyber physical system assisted by cloud and big data,” IEEE Syst. J., vol.
11, no. 1, pp. 88–95, Mar. 2017.

[5] S Mohan, C Thirumalai, G Srivastava Effective heart disease prediction using hybrid
machine learning techniques. IEEE Access, volume 7. Posted: 2019

[6] heart disease prediction and classification using machine learning algorithms
optimized by particle swarm optimization and ant colony optimization. Int. J. Intell.
Eng. Syst, volume 12, issue 1. Posted: 2019

[7] A Mir, S N Dhage. 2018 Fourth International Conference on Computing


Communication Control and Automation (ICCUBEA), p. 1 – 6. Posted: 2018

[8] M Mannerizing, M J Rahman, B Ahammed, M M Abedin. Classification and


prediction of diabetes disease using machine learning paradigm. Health Information
Science and Systems, volume 8, issue 1. Posted: 2020

[9] Min Chen, Yixue Hao, Kai Hwang, Fellow, IEEE, Lu Wang and Lin Wang “Disease
prediction by machine learning over big data from healthcare communities” (2017)

56
[10] Mr. Chala Beyen, Prof. Pooja Kamat, “Survey on prediction and analyses the
occurrence of heart diseases using data mining techniques” , International Journal of
pure and applied mathematics, 2018

[11] P. Groves, B. Kayyali, D. Knott and S. V. Kuiken,” the big data revolution in
healthcare : acerating values and innovation , “ 2016

[12] Training Dataset: https://www.kaggle.com/datasets/pjmathematician/diseases-


and-symptoms?selec1t=main.csv

[13] Testing Dataset: https://www.kaggle.com/datasets/itachi9604/disease-symptom-


description-dataset?select1=dataset.csv

[14] A.Davis, D., V.Chawla, N., Blumm, N., Christakis, N., & Barbasi, A. L. (2008).
Predicting Individual Disease Risk Based On Medical History
.
[15] Adam, S., & Parveen, A. (2012). Prediction System For Heart Disease Using Naive
Bayes.

[16] Al-Aidaroos, K., Bakar, A., & Othman, Z. (2012). Medical Data Classification
With Naive Bayes Approach. Information Technology Journal.

[17] Darcy A. Davis, N. V.-L. (2008). Predicting Individual Disease Risk Based On
Medical History.

[18] JyotiSoni, Ansari, U., Sharma, D., & Soni, S. (2011). Predictive Data Mining for
Medical Diagnosis: An Overview Of Heart Disease Prediction.

[19] K.M. Al-Aidaroos, A. B. (n.d.).K.M. Al-Aidaroos, A. B. (n.d.). 2012. Medical Data


sssClassification With Naive Bayes Approach

[20] Nisha Banu, MA; Gomathy, B;. (2013). Disease Predicting System Using Data
MiningTechnique

57

You might also like