Multiple Disease Detection
Multiple Disease Detection
BACHELOR OF
TECHNOLOGY IN
COMPUTER SCIENCE AND ENGINEERING
Submitted by
A SAINADH 20KA5A0503
K VENKATESWARLU 19KA1A0556
S CHARAN REDDY 19KA1A0528
SK ARSHAD BASHA 19KA1A0554
2019-2023
CERTIFICATE
This is to certify that Project report entitled “Multiple Disease Prediction by using
Machine Learning” that is submitted by
A SAINADH 20KA5A0503
K VENKATESWARLU 19KA1A0556
S CHARAN REDDY 19KA1A0528
SK ARSHAD BASHA 19KA1A0554
in the partial fulfilment of the requirements for the award of the degree of Bachelor of
Technology (B. Tech) in Computer Science & Engineering (CSE) from Jawaharlal
Nehru Technological University Anantapur College of Engineering, Kalikiri during
the academic year 2022-2023.
ii
DECLARATION
An endeavour over a long period can be successful only with the advice and support of
many well-wishers. The task would be incomplete without mentioning the people who have
made it possible, because is the epitome of hard work. So, with gratitude, we acknowledge all
those whose guidance and encouragement owned our efforts with success.
We are very much obliged to our beloved Dr. B. LALITHA, Associate Professor and
Head , Department of Computer Science & Engineering, JNTUACE, Kalikiri for the moral
support and valuable advice provided here for the success of the project.
Finally, we would like to extend our deep sense of gratitude to all the staff members, and
friends, and last but not least we are greatly indebted to our parents who inspired us under at
all circumstances.
PROJECT ASSOCIATES
A SAINADH 20KA5A0503
K VENKATESWARLU 19KA1A0556
S CHARAN REDDY 19KA1A0528
SK ARSHAD BASHA 19KA1A0554
iv
There are multiple techniques in machine learning that can in a variety of industries,
do predictive analytics on large amounts of data. Predictive analytics in healthcare is a
difficult endeavour, but it can eventually assist practitioners in making timely decisions
regarding patients' health and treatment based on massive data. Diseases like Breast cancer,
diabetes, and heart-related diseases are causing many deaths globally but most of these deaths
are due to the lack of timely check-ups of the diseases. The above problem occurs due to a
lack of medical infrastructure and a low ratio of doctors to the population. The statistics
clearly show the same, WHO recommended, the ratio of doctors to patients is 1:1000 whereas
India's doctor-to- population ratio is 1:1456, this indicates the shortage of doctors.
The diseases related to heart, cancer, and diabetes can cause a potential threat to
mankind, if not found early. Therefore, early recognition and diagnosis of these diseases can
save a lot of lives. This work is all about predicting diseases that are harmful using machine
learning classification algorithms. In this work, breast cancer, heart, and diabetes are included.
To make this work seamless and usable by the mass public, our team made a medical test web
application that makes predictions about various diseases using the concept of machine
learning. In this work, our aim to develop a disease-predicting web app that uses the concept
of machine learning-based predictions about various diseases like Breast cancer, Diabetes, and
Heart diseases.
LIST OF ABBREVATIONS ix
Chapter 1 INTRODUCTION 1
Chapter 2 LITERATURE SURVEY 2
Chapter 3 PROBLEM IDENTIFICATION 5
3.1 EXISTING SYSTEM 5
3.1.1 DISADVANTAGES OF EXISTING SYSTEM 5
3.2 PROPOSED SYSTEM 6
vi
5.2.3 JUPYTER NOTEBOOK 26
5.3 ALGORITHM 28
Downloaded by Nikhil Kumar
CHAPTER 6 TESTING 35
CHAPTER 8 CONCLUSION 48
CHAPTER 10 REFERENCES 50
viii
CHAPTER 1
INTRODUCTIO
N
CHAPTER 2
LITERATURE SURVEY
Afzal Hussain Shahid and Maheshwari Prasad Singh proposed the paper titled “A
deep learning approach for prediction of Parkinson’s disease progression” [19]. This paper
proposed a deep neural network (DNN) model using the reduced input feature space of
Parkinson’s telemonitoring dataset to predict Parkinson’s disease (PD) progression and also
proposed a PCA based DNN model for the prediction of Motor-UPDRS and Total-UPDRS in
Parkinson's Disease progression. The DNN model was evaluated on a real-world PD dataset
taken from UCI. Being a DNN model, the performance of the proposed model may improve
with the addition of more data points in the datasets.
medical is obtained. The performance and accuracy of the applied algorithms are discussed
and compared.
In the paper [7], the authors propose a diabetes prediction model for the classification
of diabetes including external factors responsible for diabetes along with regular factors like
Glucose, BMI, Age, Insulin, etc. Classification accuracy is improved with the novel dataset
compared with existing dataset.
On a dataset of 521 instances (80 % and 20 % for training testing respectively), [8]
authors applied 8 ML algorithms such as logistic regression, support vector machines-linear,
and nonlinear kernel, random forest, decision tree, adaptive boosting classifier, K-nearest
neighbor, and naïve bayes. According to the results, the Random Forest classifier achieved 98
% accuracy compared to the other.
Aditi Gavhane, Gouthami Kokkula, Isha Panday, Prof. Kailash Devadkar, “Prediction
of Heart Disease using Machine Learning” Gavhane et al.[2] have worked on the multi-layer
perceptron model for the prediction of heart diseases in human being and the accuracy of the
algorithm using CAD technology. If the number of person using the prediction system for
their diseases prediction, then the awareness about the diseases is also going to increases and
it make reduction in the death rate of heart patient.
Pahulpreet Singh Kohli and Shriya Arora, “Application of Machine Learning
in Diseases Prediction Machine learning algorithms are used for various type of diseases
predication and many of the researchers have work on this like Kohali et al.[7] work on heart
diseases prediction using logistic regression, diabetes prediction using support vector
machine, breast cancer prediction using Adaboot classifier and concluded that the logistic
regression give the accuracy of 87.1%, support vector machine give the accuracy of 85.71%,
Adaboot classifier give the accuracy up to 98.57%which good for predication point of view.
In another way, the authors of the paper [13] have built models to predict and classify
diabetes complications. In this work, several supervised classification algorithms were applied
to predict and classify 8 diabetes complications. The complications include some parameters
such as metabolic syndrome, dyslipidemia, nephropathy, diabetic foot, obesity, and
retinopathy.
[7] The data mining techniques is a more popular in many field of medical, business,
railway, education etc. They are most commonly used for medical diagnosis and disease
prediction at the early stage. The data mining is utilized for healthcare sector in industrial
societies. This paper to provide a survey of data mining techniques of using Parkinson’s
disease.
[8] Parkinson disease is a global public health issue. Machine learning technique
would be a best solution to classify individuals and individuals with Parkinson's sickness
(PD). This paper gives a complete review for the forecast of Parkinson disease by utilizing the
machine learning based methodologies. A concise presentation of different computational
system based methodologies utilized for the forecast of Parkinson disease are introduced. This
paper likewise displays the outline of results acquired by different scientists from accessible
information to predict the Parkinson disease.
In this experimental analysis [12] four machine learning algorithms, Random Forest,
Knearest neighbor, Support Vector Machine, and Linear Discriminant Analysis are used in the
predictive analysis of early-stage diabetes. High accuracy of 87.66 % goes to the Random
Forest classifier.
In another way, the authors of the paper [13] have built models to predict and classify
diabetes complications. In this work, several supervised classification algorithms were applied
to predict and classify 8 diabetes complications. The complications include some parameters
such as metabolic syndrome, dyslipidemia, nephropathy, diabetic foot, obesity, and
retinopathy. In [14], the authors present two approaches of machine learning to predict
diabetes patients. Random forest algorithm for the classification approach, and XGBoost
algorithm for a hybrid approach. The results show that XGBoost outperforms in terms of an
accuracy rate of
74.10%.
Authors in this article [15] tested machine learning algorithms such as support vector
machine, logistic regression, Decision Tree, Random Forest, gradient boost, K-nearest
neighbor, Naïve Bayes algorithm. According to the results, Naïve Base and Random Forest
classifiers achieved 80% accuracy compared to the other algorithms.
CHAPTER 3
PROBLEM IDENTIFICATION
Many of the existing machine learning models for health care analysis are concentrating
on one disease per analysis. For example first is for liver analysis, one for cancer analysis, one
for lung diseases like that. If a user wants to predict more than one disease, he/she has to go
through different sites. There is no common system where one analysis can perform more than
one disease prediction. Some of the models have lower accuracy which can seriously affect
patients’ health. When an organization wants to analyse their patient’s health reports, they
have to deploy many models which in turn increases the cost as well as time Some of the
existing systems consider very few parameters which can yield false results.
Based on these risk factors, a risk score can be calculated to predict an individual's
likelihood of developing cardiovascular disease.
Traditional statistical methods are used to identify risk factors and calculate a risk
score, which can be used for disease prevention and management.
Overfitting: Overfitting occurs when a machine learning model is trained too closely
to a particular dataset and becomes overly specialized in predicting it. This can result
in poor generalization to new data and lower accuracy.
Limited data availability: Some diseases are rare, which means that there may not be
enough data available to train a machine learning model accurately. This can limit the
effectiveness of the system for predicting such diseases.
Cost and implementation: Implementing machine learning systems for healthcare can
be expensive and time-consuming. Hospitals and clinics may need to invest in new
hardware, software, and staff training to implement these systems effectively
This project involved analyzing a multiple disease patient dataset with proper data
processing.
Different algorithms were used to train and predict, including Decision Trees, Random
Forest, SVM, and Logistic Regression,adaboost.
The feasibility of the project is analyzed in this phase and business proposal is
put forth with a very general plan for the project and some cost estimates. During
system analysis the feasibility study of the proposed system is to be carried out. This
is
to ensure that the proposed system is not a burden to the company. For feasibility
analysis, some understanding of the major requirements for the system is essential.
This study is carried out to check the economic impact that the system will have
on the organization. The amount of fund that the company can pour into the research
and development of the system is limited. The expenditures must be justified. Thus the
developed system as well within the budget and this was achieved because most of the
technologies used are freely available. Only the customized products had to be
purchased.
3.3.2 TECHNICAL FEASIBILITY
During this study, the analyst identifies the existing computer systems of the
concerned department and determines whether these technical resources are sufficient for the
proposed system or not. If they are not sufficient, the analyst suggests the configuration of the
computer systems that are required. The analyst generally pursues two or three different
configurations which satisfy the key technical requirements but which represent different
costs. During technical feasibility study, financial resources and budget is also considered. The
main objective of technical feasibility is to determine whether the project is technically
feasible or not, provided it is economically feasible.
3.3.3 SOCIAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the user. This
includes the process of training the user to use the system efficiently. The user must not feel
threatened by the system, instead must accept it as a necessity. The level of acceptance by the
users solely depends on the methods that are employed to educate the user about the system
and to make him familiar with it. His level of confidence must be raised so that he is also able
to make some constructive criticism, which is welcomed, as he is the final user of the system.
3.4 REQUIREMENTS
A software requirements specification (SRS) is a description of a software
system to be developed, its defined after business requirements specification
(CONOPS) also called stakeholder requirements specification (STRS) other
document related is the system requirements specification (SYRS).
HARDWARE REQUIREMENTS
System processor : Intel Core i7.
Hard Disk : 512 SSD.
Monitor : “15” LED.
Mouse : Optical Mouse.
RAM : 8.0 GB.
Key Board : Standard Windows Keyboard.
SOFTWARE REQUIREMENTS
Operating system : Windows 10.
Coding Language : Python 3.9.
Front-End : Streamlit 3.7, Python
Back-End : Python3.9
Python Modules : Pickle 1.2.3
CHAPTER 4
SYSTEM
DESIGN
This chapter provides information of software development life cycle, design model
i.e.various UML diagrams and process specification.
4.1 DESCRIPTION
Systems design is the process or art of defining the architecture, components,
modules, interfaces, and data for a system to satisfy specified requirements. One could see it
as the application of systems theory to product development. There is some overlap and
synergy with the disciplines of systems analysis, systems architecture and systems
engineering.
This design activity describes the system in narrative form using non-technical
terms. It should provide a high-level system architecture diagram showing a subsystem
breakout of the system, if applicable. The high-level system architecture or subsystem
diagrams should, if applicable, show interfaces to external systems. Supply a high-level
context diagram for the system and subsystems, if applicable. Refer to the requirements
trace ability matrix (RTM) in the Functional Requirements Document (FRD), to identify the
allocation of the functional requirements into this design document.
This section describes any constraints in the system design (reference any trade-off
analyses conducted such, as resource use versus productivity, or conflicts with other
systems) and includesany assumptions made by the project team in developing the system
design.
This section describes any contingencies that might arise in the design of the system
that may change the development direction. Possibilities include lack of interface
agreements with outside agencies or unstable architectures at the time this document is
produced. Address any possible workarounds or alternative plans.
To design a system for Multiple Disease prediction based on lab reports using
machine learning, we can follow the following steps:
1. Data Collection: The first component of the system involves collecting a large
dataset of medical records containing patient information and various medical
features related to multiple diseases. This dataset will be used to train the
machine learning models.
Machine learning has given computer systems the ability to automatically learn without
being explicitly programmed. In this, the author has used three machine learning algorithms
(Logistic Regression, KNN, and Naïve Bayes). The architecture diagram describes the high-
level overview of major system components and important working relationships.
Use case diagrams model behavior within a system and helps the developers
understand of what the user require.
Use case diagram can be useful for getting an overall view of the system and
clarifying who can do and more importantly what they can’t do.
Use case diagram consists of use cases and actors and shows the interaction
between the use case and actors.
One of the primary uses of sequence diagrams is in the transition from requirements
expressed as use cases to the next and more formal level of refinement. Use cases are often
refined into one or more sequence diagrams.
From the Fig:4.2.3 sequence diagram the prediction system can collect the data from
actor and store the data in dataset.Prediction system processes the train data and access the
data from dataset then prediction system use the train and test data and apply ML algorithms
and check user status value and grand status values then get the output.
A component diagram is used to break down a large object-oriented system into the
smaller components, so as to make them more manageable. It models the physical view of a
system such as executables, files, libraries, etc. that resides within the node.
It visualizes the relationships as well as the organization between the components
present in the system. It helps in forming an executable system. A component is a single unit
of the system, which is replaceable and executable. The implementation details of a
component are hidden, and it necessitates an interface to execute a function. It is like a black
box whose behavior is explained by the provided and required interfaces.
This diagram is also used as a communication tool between the developer and
stakeholders of the system. Programmers and developers use the diagrams to formalize a
roadmap for the implementation, allowing for better decision-making about task assignment
or needed skill improvements. System administrators can use component diagrams to plan
ahead, using the view of the logical software components and their relationship on system.
From the Fig:4.2.4 component diagram has components like user,system,data set,pre
processing,results,security, persistence and data base these are tha components of Multiple
Disease prediction system.
The deployment diagram visualizes the physical hardware on which the software will
be deployed. It portrays the static deployment view of a system. It involves the nodes and
their relationships. It ascertains how software is deployed on the hardware. It maps the
software architecture created in design to the physical system architecture, where the software
will be executed as a node. Since it involves many nodes, the relationship is shown by
utilizing communication paths.
CHAPTER 5
IMPLEMENTATION
Balancing of Data
Imbalanced datasets can be balanced in two ways. They are Under Sampling and Over
Sampling.
Under Sampling
Dataset balance is done by the reduction of the size of the data set. This process is
considered when the amount of data is adequate.
Over Sampling
In Over Sampling, dataset balance is done by increasing the size of the dataset. This
process is considered when the amount of data is inadequate.
5.1 MODULES
Comparison of Models
We can say that kNN Model is good for our dataset but SVM giving more AUC.
The higher the AUC, the better the performance of the model at distinguishing
between the positive and negative classes.
Classification Report
Comparison of Models
Classification Report
Accuracy Results
5.2.1 PYTHON
Python is a high-level, general-purpose and a very popular programming language.
Python programming language (latest Python 3) is being used in web development, Machine
Learning applications, along with all cutting edge technology in Software Industry. Python
Programming Language is very well suited for Beginners, also for experienced programmers
with other programming languages like C++ and Java.
Python is an interpreted, high-level, general-purpose programming language. Created
by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code
readability with its notable use of significant whitespace. Its language constructs and object-
oriented approach aim to help programmers write clear, logical code for small and large-scale
projects.
Python is dynamically typed and garbage-collected. It supports multiple programming
paradigms, including structured (particularly, procedural,) object-oriented, and functional
ADVANTAGES OF PYTHON
1. Easy to read, learn and code
Python is a high-level language and its syntax is very simple. It does not need any
semicolons or braces and looks like English. Thus, it is beginner-friendly. Due to its
simplicity, its maintenance cost is less.
2. Dynamic Typing
In Python, there is no need for the declaration of variables. The data type of the
variable gets assigned automatically during runtime, facilitating dynamic coding.
3. Free, Open Source
It is free and also has an open-source licence. This means the source code is available
to the public for free and one can do modifications to the original code. This modified code
can be distributed with no restrictions.
This is a very useful feature that helps companies or people to modify according to their needs
and use their version.
4. Portable :
Python is also platform-independent. That is, if you write the code on one of the
Windows, Mac, or Linux operating systems, then you can run the same code on the other OS
with no need for any changes.
This is called Write Once Run Anywhere (WORA). However, you should be careful while
you add system dependent features.
These libraries have different modules/ packages. These modules contain different inbuilt
functions and algorithms. Using these make the coding process easier and makes it look simple.
5.2.2 STREAMLIT
Streamlit is an open-source python framework for building web apps for Machine
Learning and Data Science. We can instantly develop web apps and deploy them easily using
Streamlit. Streamlit allows you to write an app the same way you write a python code.
Streamlit makes it seamless to work on the interactive loop of coding and viewing results in
the web app.
The best thing about Streamlit is that you don't even need to know the basics of web
development to get started or to create your first web application. So if you're somebody
who's into data science and you want to deploy your models easily, quickly, and with only a
few lines of code, Streamlit is a good fit.
You don't need to spend days or months to create a web app, you can create a really
beautiful machine learning or data science app in only a few hours or even minutes.
It is compatible with the majority of Python libraries (e.g. pandas, matplotlib, seaborn,
plotly, Keras, PyTorch, SymPy(latex)).
Streamlit is a popular open-source Python library that allows developers to build interactive
web applications for data science and machine learning projects with ease. Here are some of
the key features of Streamlit:
1. Ease of Use: Streamlit is easy to use for both beginners and advanced developers. Its
simple syntax allows developers to build interactive web applications quickly without
having to worry about the details of web development.
2. Data Visualization: Streamlit allows developers to create data visualizations such as
charts, plots, and graphs with just a few lines of code. It supports popular data
visualization libraries like Matplotlib, Plotly, and Altair.
3. Customizable UI Components: Streamlit provides various UI components that can be
customized to fit the needs of the application. These components include sliders,
dropdowns, buttons, and text inputs.
4. Real-time Updating: Streamlit automatically updates the web application in real-time
as the user interacts with it. This makes it easy to create dynamic applications that
respond to user input in real-time.
5. Integration with Machine Learning Libraries: Streamlit integrates seamlessly with
popular machine learning libraries like TensorFlow, PyTorch, and Scikit-learn. This
allows developers to build and deploy machine learning models with ease.
6. Sharing and Deployment: Streamlit makes it easy to share and deploy applications.
Developers can share their applications with others by simply sharing a URL.
Streamlit also provides tools for deploying applications to cloud services like Heroku
and AWS
ADVANTAGES OF STREAMLIT
Fast and Easy Development: Streamlit provides a simple and intuitive syntax that
makes it easy to build interactive web applications for data science and machine learning
projects. With Streamlit, developers can build applications faster and with less code.
Sharing and Deployment: Streamlit makes it easy to share and deploy applications.
Developers can share their applications with others by simply sharing a URL. Streamlit also
provides tools for deploying applications to cloud services like Heroku and AWS, making it
easy to scale applications as needed.
The Jupyter Notebook is an open source web application that you can use to create and
share documents that contain live code, equations, visualizations, and text. Jupyter Notebook
is maintained by the people at Project Jupyter.
Jupyter Notebooks are a spin-off project from the IPython project, which used to have
an IPython Notebook project itself. The name, Jupyter, comes from the core supported
programming languages that it supports: Julia, Python, and R. Jupyter ships with the IPython
kernel, which allows you to write your programs in Python, but there are currently over 100
other kernels that you can also use.
• The notebook web application: An interactive web application for writing and running
code interactively and authoring notebook documents.
• Kernels: Separate processes started by the notebook web application that runs users’
code in a given language and returns output back to the notebook web application. The kernel
also handles things like computations for interactive widgets, tab completion and
introspection.
• Notebook documents: Self-contained documents that contain a representation of all
content visible in the note-book web application, including inputs and outputs of the
computations, narrative text, equations, images, and rich media representations of objects.
Each notebook document has its own kernel.
In-browser editing for code, with automatic syntax highlighting, indentation, and tab
completion/introspection.
The ability to execute code from the browser, with the results of computations
attached to the code which generated them.
Displaying the result of computation using rich media representations, such as HTML,
LaTeX, PNG, SVG, etc.
For example, publication-quality figures rendered by the matplotlib library, can be included
inline.
In-browser editing for rich text using the Markdown markup language, which can
provide commentary for the code, is not limited to plain text.
The ability to easily include mathematical notation within markdown cells using
LaTeX, and rendered natively by MathJax.
Easy to convert: Jupyter Notebook allows users to convert the notebooks into other formats
such as HTML and PDF. It also uses online tools and nbviewer which allows you to render a
publicly available notebook in the browser direct.
5.3 ALGORITHMS
Decision tree classifiers are used successfully in many diverse areas. Their most
important feature is the capability of capturing descriptive decision making knowledge from
the supplied data. Decision tree can be generated from training sets. The procedure for such
generation based on the set of objects (S), each belonging to one of the classes C1, C2, …, Ck
is as follows:
Step 1. If all the objects in S belong to the same class, for example Ci, the decision tree
for S consists of a leaf labeled with this class
Step 2. Otherwise, let T be some test with possible outcomes O1, O2,…, On. Each object
in S has one outcome for T so the test partitions S into subsets S1, S2,… Sn where each
object in Si has outcome Oi for T. T becomes the root of the decision tree and for each
outcome Oi we build a subsidiary decision tree by invoking the same procedure
recursively on the set Si.
Gradient boosting
Gradient boosting is a machine learning technique
used in regression and classification tasks, among
others. It gives a prediction model in the form of an ensemble of weak prediction models,
which are typically decision trees.[1][2] When a decision tree is the weak learner, the resulting
algorithm is called gradient-boosted trees; it usually outperforms random forest.A gradient-
boosted trees model is built in a stage-wise fashion as in other boosting methods, but it
generalizes the other methods by allowing optimization of an arbitrary differentiable loss
function.
KNN is slow supervised learning algorithm, it take more time to get trained
classification like other algorithm is divided into two step training from data and testing it on
new instance . The K Nearest Neighbour working principle is based on assignment of weight
to the each data point which is called as neighbour. In K Nearest Neighbour distance is
calculate for training dataset for each of the K Nearest data points now classification is done
on basis of majority of votes there are three types of distances need to be measured in KNN
Euclidian, Manhattan, Minkowski distance in which Euclidian will be consider most one the
following formula is used to calculate their distance.
In N dimensions, the Euclidean distance between two points p and q is √(∑i=1N (pi-qi)²)
where pi (or qi) is the coordinate of p (or q) in dimension i
algorithm for KNN is defined in the steps given below:
1. D represents the samples used in the training and k denotes the number of nearest neighbour.
2. Create super class for each sample class.
3. Compute Euclidian distance for every training sample
4. Based on majority of class in neighbour, classify the sample
Algorithm Implementation:
Step 1 − for implementing any algorithm, we need dataset. So during the first step of KNN, we
must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any odd
integer.
This program computes binary logistic regression and multinomial logistic regression
on both numeric and categorical independent variables. It reports on the regression equation
as well as the goodness of fit, odds ratios, confidence limits, likelihood, and deviance. It
performs a comprehensive residual analysis including diagnostic residual reports and plots. It
can perform an independent variable subset selection search, looking for the best regression
model with the fewest independent variables. It provides confidence intervals on predicted
values and provides ROC curves to help determine the best cutoff point for classification. It
allows you to validate your results by automatically classifying rows that are not used during
the analysis.
Naïve Bayes
The naive bayes approach is a supervised learning method which is based on a
simplistic hypothesis: it assumes that the presence (or absence) of a particular feature of a
class is unrelated to the presence (or absence) of any other feature .
Yet, despite this, it appears robust and efficient. Its performance is comparable to other
supervised learning techniques. Various reasons have been advanced in the literature. In this
tutorial, we highlight an explanation based on the representation bias. The naive bayes
classifier is a linear classifier, as well as linear discriminant analysis, logistic regression or
linear SVM (support vector machine). The difference lies on the method of estimating the
parameters of the classifier (the learning bias).
While the Naive Bayes classifier is widely used in the research world, it is not
widespread among practitioners which want to obtain usable results. On the one hand, the
researchers found especially it is very easy to program and implement it, its parameters are
easy to estimate, learning is very fast even on very large databases, its accuracy is reasonably
good in comparison to the other approaches. On the other hand, the final users do not obtain a
model easy to interpret and deploy, they does not understand the interest of such a technique.
Thus, we introduce in a new presentation of the results of the learning process. The
classifier is easier to understand, and its deployment is also made easier. In the first part of
this tutorial, we present some theoretical aspects of the naive bayes classifier. Then, we
implement the approach on a dataset with Tanagra. We compare the obtained results (the
parameters of the model) to those obtained with other linear approaches such as the logistic
regression, the linear discriminant analysis and the linear SVM. We note that the results are
highly consistent. This largely explains the good performance of the method in comparison to
others. In the second part, we use various tools on the same dataset (Weka 3.6.0, R 2.9.2,
Knime 2.1.1, Orange 2.0b and RapidMiner 4.6.0). We try above all to understand the
obtained results.
Random Forest
Random forests or random decision forests are an ensemble learning method for
classification, regression and other tasks that operates by constructing a multitude of decision
trees at training time. For classification tasks, the output of the random forest is the class
selected by most trees. For regression tasks, the mean or average prediction of the individual
trees is returned. Random decision forests correct for decision trees' habit of overfitting to
their training set. Random forests generally outperform decision trees, but their accuracy is
lower than gradient boosted trees. However, data characteristics can affect their performance.
The first algorithm for random decision forests was created in 1995 by Tin Kam Ho[1]
using the random subspace method, which, in Ho's formulation, is a way to implement the
"stochastic discrimination" approach to classification proposed by Eugene Kleinberg.
An extension of the algorithm was developed by Leo Breiman and Adele Cutler, who
registered "Random Forests" as a trademark in 2006 (as of 2019, owned by Minitab,
Inc.).The extension combines Breiman's "bagging" idea and random selection of features,
introduced
first by Ho[1] and later independently by Amit and Geman[13] in order to construct a
collection of decision trees with controlled variance.0
Random forests are frequently used as "blackbox" models in businesses, as they generate
reasonable predictions across a wide range of data while requiring little configuration.
SVM
ADA BOOST
AdaBoost, also called Adaptive Boosting, is a technique in Machine Learning used as
an Ensemble Method. The most common estimator used with AdaBoost is decision trees with
one level which means Decision trees with only 1 split. These trees are also called Decision
Stumps.
Decision stumps are the simplest model we could construct in terms of complexity.
The algo would just guess the same label for every new example, no matter what it looked
like. The accuracy of such a model would be best if we guess whichever answer, 1 or 0, is
most common in the data. If, say, 60 percent of the examples are 1s, then we’ll get 60 percent
accuracy just by guessing 1 every time.
Decision stumps improve upon this by splitting the examples into two subsets based on the
value of one feature. Each stump chooses a feature, say X2, and a threshold, T, and then splits
the examples into the two groups on either side of the threshold.
To find the decision stump that best fits the examples, we can try every feature of the input
along with every possible threshold and see which one gives the best accuracy. While it
naively seems like there are an infinite number of choices for the threshold, two different
thresholds are only meaningfully different if they put some examples on different sides of the
split. To try every possibility, then, we can sort the examples by the feature in question and try
one threshold falling between each adjacent pair of examples.
The algorithm just described can be improved further, but even this simple version is
extremely fast in comparison to other ML algorithms (e.g. training neural networks).
CHAPTER 6
TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub-assemblies, assemblies and/or a finished product It is the
process of exercising software with the intent of ensuring that the software system meets its
requirements and user expectations and does not fail in an unacceptable manner. There are
various types of test. Each test type addresses a specific testing requirement.
Unit Testing
Unit testing is usually conducted as part of a combined code and unit test phase of the
software lifecycle, although it is not uncommon for coding and unit testing to be conducted as
two distinct phases.
Test strategy and approach
Field testing will be performed manually and functional tests will be written in detail.
Test objectives
All field entries must work properly.
Pages must be activated from the identified link.
The entry screen, messages and responses must not be delayed.
Features to be tested
Verify that the entries are of the correct format
No duplicate entries should be allowed
All links should take the user to the correct page.
6.2Integration Testing
Software integration testing is the incremental integration testing of two or more
integrated software components on a single platform to produce failures caused by interface
defects. The task of the integration test is to check that components or software applications,
e.g. components in a software system or – one step up – software applications at the company
level – interact without error.
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
6.3 Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires
significant participation by the end user. It also ensures that the system meets the
functional requirements.
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
SYSTEM TESTING
TESTING METHODOLOGIES
Unit Testing
Unit testing focuses verification effort on the smallest unit of Software design that is
the module. Unit testing exercises specific paths in a module’s control structure to ensure
complete coverage and maximum error detection. This test focuses on each module
individually, ensuring that it functions properly as a unit. Hence, the naming is Unit Testing.
During this testing, each module is tested individually and the module interfaces are
verified for the consistency with design specification. All important processing path are tested
for the expected results. All error handling paths are also tested.
Integration Testing
Integration testing addresses the issues associated with the dual problems of
verification and program construction. After the software has been integrated a set of high
order tests are conducted. The main objective in this testing process is to take unit tested
modules and builds a program structure that has been dictated by design.
The following are the types of Integration Testing:
1)Top Down Integration
This method is an incremental approach to the construction of program structure.
Modules are integrated by moving downward through the control hierarchy, beginning with
the main program module. The module subordinates to the main program module are
incorporated into the structure in either a depth first or breadth first manner.
In this method, the software is tested from main module and individual stubs are
replaced when the test proceeds downwards.
2. Bottom-up Integration
This method begins the construction and testing with the modules at the lowest level in
the program structure. Since the modules are integrated from the bottom up, processing
required for modules subordinate to a given level is always available and the need for stubs is
eliminated. The bottom up integration strategy may be implemented with the following steps:
The low-level modules are combined into clusters into clusters that perform a specific
Software sub-function.
A driver (i.e.) the control program for testing is written to coordinate test
case input and output.
The cluster is tested.
Drivers are removed and clusters are combined moving upward in the program
structure The bottom up approaches tests each module individually and then each
module is module is integrated with a main module and tested for functionality.
OTHER TESTING METHODOLOGIES
User Acceptance Testing
User Acceptance of a system is the key factor for the success of any system. The system under
consideration is tested for user acceptance by constantly keeping in touch with the prospective
system users at the time of developing and making changes wherever required. The system
developed provides a friendly user interface that can easily be understood even by a person
who is new to the system.
Output Testing
After performing the validation testing, the next step is output testing of the proposed
system, since no system could be useful if it does not produce the required output in the
specified format. Asking the users about the format required by them tests the outputs
generated or displayed by the system under consideration. Hence the output format is
considered in 2 ways – one is on screen and another in printed format.
Validation Checking
Validation checks are performed on the following fields.
Text Field
The text field can contain only the number of characters lesser than or equal to its size.
The text fields are alphanumeric in some tables and alphabetic in other tables. Incorrect entry
always flashes and error message.
Numeric Field:
The numeric field can contain only numbers from 0 to 9. An entry of any character
flashes an error messages. The individual modules are checked for accuracy and what it has to
perform. Each module is subjected to test run along with sample data. The individually tested
modules are integrated into a single system. Testing involves executing the real data
information is used in the program the existence of any program defect is inferred from the
output. The testing should be planned so that all the requirements are individually tested.
A successful test is one that gives out the defects for the inappropriate data and
produces and output revealing the errors in the system.
Preparation of Test Data
Taking various kinds of test data does the above testing. Preparation of test data plays a
vital role in the system testing. After preparing the test data the system under study is tested
using that test data. While testing the system by using test data errors are again uncovered and
corrected by using above testing steps and corrections are also noted for future use.
The most effective test programs use artificial test data generated by persons other
than those who wrote the programs. Often, an independent team of testers formulates a testing
plan, using the systems specifications.
The package “Virtual Private Network” has satisfied all the requirements specified as per
software requirement specification and was accepted.
USER TRAINING
Whenever a new system is developed, user training is required to educate them about the
working of the system so that it can be put to efficient use by those for whom the system has
been primarily designed. For this purpose the normal working of the project was demonstrated
to the prospective users. Its working is easily understandable and since the expected users are
people who have good knowledge of computers, the use of this system is very easy.
MAINTAINENCE
This covers a wide range of activities including correcting code and design errors. To
reduce the need for maintenance in the long run, we have more accurately defined the user’s
requirements during the process of system development. Depending on the requirements, this
system has been developed to satisfy the needs to the largest possible extent. With
development in technology, it may be possible to add many more features based on the
requirements in future. The coding and designing is simple and easy to understand which will
make maintenance easier.
TESTING STRATEGY
A strategy for system testing integrates system test cases and design techniques into a
well-planned series of steps that results in the successful construction of software. The testing
strategy must co-operate test planning, test case design, test execution, and the resultant data
collection and evaluation. A strategy for software testing must accommodate low-level tests
that are necessary to verify that a small source code segment has been correctly implemented
as well as high level tests that validate major system functions against user requirements.
Software testing is a critical element of software quality assurance and represents the
ultimate review of specification design and coding. Testing represents an interesting anomaly
for the software. Thus, a series of testing are performed for the proposed system before the
system is ready for user acceptance testing.
SYSTEM TESTING
Software once validated must be combined with other system elements (e.g.
Hardware, people, database). System testing verifies that all the elements are proper and that
overall system function performance is achieved. It also tests to find discrepancies between
the system.
6.4 Manual Testing
Test Case for Brain Disease Prediction
Test Description The user enters the symptoms.User answers the sub questions
The user will checks the symptoms and attributes and submit the
Actions
button for diagnosis.
Table 6.1
Table 6.1 Describes that The above table shows the test case for Values Entered. In
this the main importance is given to check the entered values are Compared with Dataset
values. If the values are in Matching then the test is passed.
Test Description The user enters the symptoms.User answers the sub questions
The user will checks the symptoms and attributes and submit the
Actions
button for diagnosis.
Table 6.2
Table 6.2Describes that The above table shows the test case for Values Entered. In
this the main importance is given to check the entered values are Compared with Dataset
values. If the values are in Matching then the test is passed.
Test Description The user enters the symptoms.User answers the sub questions
The user will checks the symptoms and attributes and submit the
Actions
button for diagnosis.
Table 6.3
Table 6.3 Describes that The above table shows the test case for Values Entered. In
this the main importance is given to check the entered values are Compared with Dataset
values. If the values are in Matching then the test is passed.
CHAPTER 7
RESULTS
8.1 DIABETES PREDICTION
CHAPTER 8
CONCLUSIO
N
While there are challenges and limitations to the use of machine learning in healthcare,
such as the risk of bias and the need for diverse and representative data, ongoing research and
development in this field is helping to address these challenges and unlock the full potential of
multiple disease prediction using machine learning.
As technology continues to evolve and more data becomes available, it is likely that
machine learning algorithms will become increasingly sophisticated and accurate, leading to
even better patient outcomes and more personalized medicine. Multiple disease prediction
using machine learning has the potential to transform healthcare, and it is an exciting area of
research that holds great promise for the future.
CHAPTER 9
FUTURE
WORK
Addressing data bias: As with all machine learning algorithms, bias in the training data
can lead to inaccurate predictions and perpetuate health disparities. Future work should focus
on developing methods to address and mitigate data bias, such as using more diverse and
representative datasets, and incorporating fairness and equity considerations into the algorithm
development process.
CHAPTER 10
REFERENCE
S