Multiple Disease Prediction Using Machine Learning and Deep Learning With The Im
Multiple Disease Prediction Using Machine Learning and Deep Learning With The Im
Abstract—Disease prediction is crucial in healthcare, web application that user could use. In our proposed work, we
enabling professionals to diagnose and treat diseases more have introduced a web-based software that is easy to use. This
effectively. In recent years, machine learning and web technology can potentially transform how we approach
technology have emerged as powerful tools for predicting healthcare by enabling early detection and personalized
various diseases. Machine learning algorithms can analyze large treatment plans. Disease prediction using machine learning
and complex datasets to learn patterns and relationships in the and web technology is a growing field that has the potential to
data, enabling them to make accurate disease prediction revolutionize healthcare. Machine learning algorithms can
technology. On the other hand, web technology can be used to analyze medical data to learn patterns and relationships in the
deploy machine learning models on platforms such as websites
data, enabling them to make accurate predictions about
or mobile apps, making them accessible to users. This research
addresses the need for early detection and diagnosis of diseases
diseases. On the other hand, web technology can be used to
like diabetes, heart disease, Parkinson’s disease, lung cancer and deploy machine learning models on platforms such as
brain stroke. To achieve this goal, various classification websites or mobile apps, making them accessible to users. In
algorithms of machine learning, including Support Vector this article, we discuss machine learning, deep learning
Machine, k-nearest neighbors, Decision Tree, Random Forest, algorithms and web technology for predicting multiple
AdaBoost, Gaussian Naive Bayes etc. and one deep learning diseases and outline the steps involved in building such a
model, Long Short-Term Memory, are implemented. The main system. We will also highlight some challenges and
goal of this paper is to create a web technology for predicting opportunities of using these technologies in healthcare.
multiple diseases and to outline the steps involved in building
such a system.
The rest of this paper is arranged as follows. The related work
in this field is discussed in Section II. Section III describes the
Keywords—machine learning, deep learning, web methodology followed to carry out this research. After that,
technology, multiple disease prediction. the results obtained from this paper are highlighted in Section
IV and the comparison of this result is featured in Section V.
I. INTRODUCTION Finally, the future work and conclusion are discussed in
Section VI and Section VII at the end of the paper.
As society progresses, people’s lifestyles and environmental
conditions are gradually changing. This leads to an increase in II. LITERATURE REVIEW
hidden risks connected with various diseases. Major diseases
like diabetes, heart diseases and brain strokes have a serious Numerous studies have already been done related to
impact on a global scale. In recent years, healthcare has predicting the diseases using different machine learning
witnessed noticeable advancements in applying machine techniques and algorithms which medical institutions can
learning and deep learning techniques for disease prediction. use. This part of the paper put insights on some of those
In previous work, researchers have explored various machine studies done in research papers, their techniques and results.
learning techniques for disease prediction. Principal Domes Kanchan B. et al. used Naive Bayes classification
component analysis was used by Dhomse et al. to study the (NB), Decision Trees (DT), and Support Vector Machines
prediction of particular diseases [1]. Heart diseases were (SVM) algorithm for disease prediction and they got 34.89%
predicted using multiple linear regression by Polaraju and accuracy for diabetes and 53% for heart disease. The study
Prasad [2]. Convolutional neural networks were the main explains how Principal Component Analysis (PCA) was used
focus of Ambekar and Phalnikar's study on illness risk to determine the bare minimum of attributes needed to
prediction [3]. Naive Bayes and random forest were used by improve the accuracy of different supervised machine
Jackins et al. to predict clinical illness [4]. A human-machine learning algorithms for heart disease prediction [1]. K. Polara
interface for illness pre-diagnosis was created by Gupta et al. Ju et al. used multiple linear regression models to predict the
[6]. Using machine learning techniques, Mohit et al. explored likelihood of developing heart disease, with an accuracy rate
the identification of several diseases [7]. These previous of about 75%. They used only statistical model for predicting
works did not propose any usable system that could easily be
heart disease. No machine learning model was applied here
used in daily life because they did not develop any software or
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2024 at 03:35:32 UTC from IEEE Xplore. Restrictions apply.
[2]. In another research work done by Sayali Ambekar et al., B. Real-life Application
convolutional neural network-based disease prediction and The proposed solution of this paper can improve the
other machine learning algorithms were used. By using Naive possibility to detect various diseases on early stages. Several
Bayes, it was possible to predict breast cancer with an websites and pieces of software are optional with this
accuracy of 82%. They predicted heart disease in three solution. Patients can submit their data on a website, which is
different categories of risk; high, low and medium. They used subsequently stored on a server online. The computer model,
only two algorithms, KNN and Naive Bayes but did not which analyzes the data and forecasts the possibility of
develop any software [3]. Naive Bayes and Random Forest sickness, is hosted on a cloud server.
algorithms were used by V. Jackins et al. to classify diseases.
Their obtained accuracy rates for diabetes, coronary heart III. METHODOLOGY
disease, and cancer data were 74.46%, 82.35%, and 63.74%
A. Data Collection
respectively. They used only two algorithms and also did not
develop any software [4]. To predict diseases, Pahulpreet The dataset for this paper has been collected from Kaggle and
Singh Kohli et al. used Logistic Regression (LR), Decision Pima Indian Dataset and UCI Machine Learning Repository
Tree (DT), Support Vector Machine (SVM), Random Forest [12][13][14][15][16].
(RF), and Adaptive Boosting (AdaBoost). Accuracy levels of TABLE I. DATASET OVERVIEW
95.71 % for breast cancer, 84.42% for diabetes, and 87.12 %
Dataset name Number Number of Data Number
for heart disease were achieved through this work. The of Instances Format of Classes
limitation of their work was they did not implement their Features
machine learning model in any software. [5]. Prajval Gupta Diabetic 9 768 CSV 2
et al’s research developed two frameworks for disease pre- Heart 14 303 CSV 2
Lung Cancer 16 309 CSV 2
diagnosis using machine learning techniques including ANN,
Parkinson’s 24 195 CSV 2
SVM, and Decision Tree Induction. They use ANN but did Brain Stroke 12 5110 CSV 2
not develop any software. The overall accuracy of the system
came out to be nearby 89% [6]. An online application for Five datasets utilized in diverse medical investigations are
illness prediction was created by Indukuri Mohit et al. summarized in Table I. Each dataset includes details on
utilizing K-nearest neighbors, SVM, and Logistic various medical problems and is in comma-separated values
Regression. They reported accuracy rates of 76.60% for (CSV) format. For all the diseases there are between 9 to 24
diabetes, 94.55% for breast cancer, and 83.84% for heart features and 195 to 5110 occurrences. The primary emphasis
disease [7]. Saumya Gupta and Supriya Raheja proposed a of each dataset is a binary classification job with two classes.
method to predict stroke by using various machine-learning These databases are useful tools for performing medical
algorithms. 95%, 96%, and 97% accuracy ratings were studies and creating forecasting models for identifying and
attained using AdaBoost, XGBoost, and Random Forest comprehending various health issues.
Classifier [8]. The accuracy of k-nearest neighbors, Decision
Trees, Linear Regression, and SVM for predicting heart B. Mathematical Explanation and Evaluation of Algorithms
disease was compared by Archana Singh et al and, 83% Different machine learning and deep learning algorithms
accuracy was attained with SVM [9]. A diabetes prediction have been used such as Quadratic Discriminant Analysis
system based on machine learning was developed by (QDA), k-nearest neighbors (KNN), SVM, Linear
Priyanka Sonar and her colleagues. They got the accuracy of Discriminant Analysis (LDA), Naive Bayes algorithm
85% for Decision Tree, 77% for Naive Bayes, and 77.3% for (NBA), Decision tree (DT), Random Forest algorithm (RF),
SVM. They could use more algorithms and performance AdaBoost (AB), k-means clustering (KMC), XGBoost
metrices to achieve higher accuracy and best model [10]. (XGB), Gradient Boosting (GB), Neural-Network, and Long
Chinmayi Thallam et al proposed a method that involved Short-Term Memory (LSTM). RMSE (Root Mean Square
comparing various classification such as Support Vector Error), MAE (Mean Absolute Error), Recall, Precision, F1,
Machine (SVM), k-nearest neighbors (KNN), Random Forest R2 (R-squared), and k-fold Accuracy are commonly used
(RF), Artificial Neural Networks (ANN) and a hybrid model metrics in various fields, particularly in machine learning and
named Voting classifier. Support Vector Machine gave an statistics. These matrices will be used in this paper too and
output of 95% accuracy with 0.8 training data and 0.2 test their mathematical explanation of are given below.
data while Random Forest gave 97.5%, k-nearest neighbors
reached 97%, Neural Networks gave 95.99% and Voting RMSE(X,h) = ∑ (ℎ(𝑥 ( ) ) − 𝑦 ( ) ) (1)
Classifier gave 99.5% [11]. Overall, all these mentioned
previous works did not develop any software and did not use Equation (1) represents Root Mean Square Error (RMSE)
any database system to collect user input data to retrain which quantifies the overall accuracy of the model's
machine learning model. Also, most of them used a few predictions and provides a single value to compare different
machine learning algorithms. models. Lower RMSE values indicate better predictive
A. Aim of this work performance. Here, m is the total number of observations in
the data sets, x(i) is the predicted value for the ith observation,
The main goal of this work is to create a user-friendly web and y(i) is the actual value for the ith observation.
application where people can predict multiple diseases
accurately and simultaneously. By combining several
sickness detection techniques, the approach eliminates the MAE(X,h) = ∑ |ℎ(𝑥 ( ) − 𝑦 ( ) | (2)
need for additional websites or software.
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2024 at 03:35:32 UTC from IEEE Xplore. Restrictions apply.
Equation (2) shows the average absolute difference between were correctly predicted, the FPs are the instances that were
the expected and actual values, which is measured by Mean incorrectly predicted as positive, the TNs are the instances
Absolute Error (MAE). Here, m is the total number of that were correctly predicted as negative, and the FNs are the
observations in the data sets, x(i) is the predicted value for the instances that were incorrectly predicted as negative.
ith observation, and y(i) is the actual value for the ith
C. Software Architecture
observation. It gives a general gauge of prediction accuracy
and is utilized in regression tasks like RMSE. The software architecture in the Fig. 1 shows a three-tier
architecture, with a web app, a local server, and a database.
Recall = (3)
( ) The web app is responsible for serving the web apps to users
through a local server. The local server is responsible for
In binary classification tasks, recall is a statistic that is handling requests from the user's computer and
frequently utilized, particularly when the goal is to identify communicating with the web server. The database stores all
the positive class (as in the case of diagnosing diseases). It of the data that is used by the web app, including the user's
determines the proportion of real positives to all actual input. The machine learning model is used to make
positives. Recall measures the model's accuracy in predictions based on the user's data. When the user interacts
identifying all instances of positive data. True Positives (TP) with the web app, their data is sent to the local server. The
are the number of examples the model correctly identified as local server then sends the data to the web app, which sends
positive, and False Negatives (FN) are the number of it to the machine learning model.
examples incorrectly identified as negative in (3).
Precision = (4)
( )
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2024 at 03:35:32 UTC from IEEE Xplore. Restrictions apply.
3) Heart Disease: Features include age (integer), sex (NBA), and K-Means Clustering (KMC). The accuracy
(binary), cp (integer), trestbps (integer), chol (integer), fbs numbers in the table show how well each algorithm
(binary), restecg (integer), thalach (integer), exang (binary), performed for each job of diagnosing a certain ailment.
oldpeak (float), slope (integer), ca (integer), thal (integer). Greater accuracy ratings often represent the algorithm's
The target variable target (integer) denotes the presence (1) ability to accurately categorize occurrences into the relevant
or absence (0) of heart disease. illness category, which generally indicates higher
4) Brain Stroke: Categorical features such as gender, performance. It’s crucial to remember that the accuracy
work type, Residence type, smoking status, and binary figures do not, by themselves, give a comprehensive
features like hypertension, heart disease, ever married. evaluation of an algorithm's performance. To gain a deeper
knowledge of an algorithm's diagnostic capabilities, other
Numerical features include age, average glucose level, and
assessment matrices, including accuracy, recall, and F1-
bmi. The target variable stroke (integer) represents the
score, should be considered.
presence (1) or absence (0) of a stroke.
5) Lung Cancer: Categorical feature Gender, numerical B. Performance Evaluations of Deep Learning
feature age, and binary features smoking, yellow finger, We also implemented deep learning to observe the accuracy
anxiety, peer pressure, chronic disease fatigue, allergy, and other performances for disease prediction. The quantity
wheezing, alcohol consuming, coughing, shortness of breath, and caliber of the data, the model's level of complexity, and
swallowing difficulty, chest pain. The target variable lung the particular condition being forecasted can all affect how
cancer (categorical) indicates the presence or absence of lung well these models work. Fig. 2, 3 and 4 show accuracy
cancer. changes per epoch for Long Short-Term Memory (LSTM)
User Account: In our system, there will be three modules: model used for different disease prediction.
Admin, User (Patient), and Doctor. The website's homepage
will have two options for every user; login and sign up. Every
new user has to get registered through the admin. After
successful registration, user will be able to log in and predict
the diseases.
User information database: All user information will be
recorded in Django's database. This data will be used to retrain
the model. After retrain the model it will more accurate on
predicting diseases.
IV. RESULT
A. Performance Evaluations of Machine Learning
This section highlights the overview of the accuracy attained Fig. 2. The graph showing loss over epochs for diabetes and heart
by several machine learning algorithms applied on the data disease prediction.
sets for the prediction of diabetes, Parkinson’s disease, heart
disease, brain stroke and lung cancer.
TABLE II. MACHINE LEARNING ALGORITHM’S ACCURACY TABLE
Algorithms Accuracy
Diabetic’s Parkinson’s Heart. Brain Lung
SVM 0.83 0.95 0.83 0.77 0.96
DT 0.97 0.88 0.79 0.97 0.95
KNN 0.74 0.70 0.74 0.97 0.95
QDA 0.77 0.95 0.84 0.55 0.91
LDA 0.82 0.82 0.84 0.77 0.96
RF 0.93 0.93 0.82 0.99 0.96
AB 0.83 0.83 0.79 0.82 0.98
XGB 0.90 0.90 0.80 0.97 0.98
GB 0.88 0.88 0.80 0.84 0.96 Fig. 3. The graph showing loss over epochs for lung cancer and
NBA 0.80 0.80 0.81 0.64 0.95 Parkinson’s prediction.
KMC 0.58 0.58 0.80 0.62 0.46
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2024 at 03:35:32 UTC from IEEE Xplore. Restrictions apply.
Fig. 5. Web app home page
Fig. 4. The graph showing loss over epochs for brain stroke
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2024 at 03:35:32 UTC from IEEE Xplore. Restrictions apply.
Fig. 10. Confusion matrices for diabetics and heart disease
prediction.
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2024 at 03:35:32 UTC from IEEE Xplore. Restrictions apply.
E. Security Aspects on User Personal Data A model for multiple illness prediction can estimate the
Multiple disease prediction program must provide strong likelihood of many diseases and reduce death rates. This
protection for user personal data. Data must be encrypted paper uses different machine learning algorithms to measure
while they are being transmitted and stored, and HTTPS performance, and future work may involve adding more
should be used for secure communication. To secure sensitive diseases trained with machine learning and deep learning
information, robust authentication, role-based access control, models.
and protection against SQL injection must be provided. The
REFERENCES
admin should manage user sessions securely, abide with
[1] B. Dhomse Kanchan and M. Mahale Kishor, "Study of machine
privacy guidelines, and get informed consent before
learning algorithms for special disease prediction using principal
collecting any data. To reduce data usage, frequent backups of component analysis," Proceedings - International Conference
should be made and the developer should keep up with on Global Trends in Signal Processing, Information Computing
security patch updates. The developers should develop safe and Communication, ICGTSPICC 2016, pp. 5–10, Jun. 2017,
doi: 0.1109/ICGTSPICC.2016.7955260.
coding practices, keep an eye out for breaches, and think
[2] K. Polaraju and D. Prasad, "Prediction of Heart Disease using
about data anonymization to preserve users' privacy. To Multiple Linear Regression Model," 2017.
maintain user confidence and data integrity, compliance with [3] S. Ambekar and R. Phalnikar, "Disease Risk Prediction by
data protection rules is crucial. Using Convolutional Neural Network," Proceedings - 2018 4th
International Conference on Computing, Communication
V. RESULT COMPARISON AND CONTRIBUTION Control and Automation, ICCUBEA 2018, Jul. 2018, doi:
10.1109/ICCUBEA.2018.8697423.
The analysis of this work is conducted in a real-time database [4] V. Jackins, S. Vimal, M. Kaliappan, and M. Y. Lee, "AI-based
using a trained machine learning (ML) model on the same smart prediction of clinical disease using random forest classifier
and Naive Bayes," Journal of Supercomputing, vol. 77, no. 5, pp.
dataset and deployed in the ML model. The highest 5198–5219, May 2021, doi: 10.1007/S11227-020-03481-
accuracies obtained in this research are 93% for diabetes X/FIGURES/10.
using the Random Forest algorithm, 95% for Parkinson's [5] P. S. Kohli and S. Arora, "Application of machine learning in
using the Random Forest algorithm and QDA, 84% for heart disease prediction," 2018 4th International Conference on
Computing Communication and Automation, ICCCA 2018, Dec.
disease using LDA and QDA, 98% for Lung cancer using 2018, doi: 10.1109/CCAA.2018.8777449.
AdaBoost and Boost, and 99% for Brain Stroke using [6] P. Gupta, A. Suryavanshi, S. Maheshwari, A. Shukla, and R.
Random Forest algorithm. Previously, Priyanka Sonar et al Tiwari, "Human-machine interface system for pre-diagnosis of
suggested a machine learning based diabetes prediction diseasesusing machine learning," ACM International Conference
Proceeding Series, vol. Part F137705, pp. 71–75, Apr. 2018, doi:
system in 2019. They got the accuracy of 85%, 77% and 10.1145/3220511.3220525.
77.3% using Decision Tree, Naive Bayes, and SVM [7] I. Mohit, K. S. Kumar, A. U. K. Reddy, and B. S. Kumar, "An
algorithm respectively [10]. Moreover, Dhomse Kanchan B. Approach to detect multiple diseases using machine learning
et al. researched special disease prediction using principal algorithm," J Phys Conf Ser, vol. 2089, no. 1, p. 012009, Nov.
2021, doi: 10.1088/1742-6596/2089/1/012009.
component analysis using machine learning algorithms such [8] S. Gupta and S. Raheja, "Stroke Prediction using Machine
as Naive Bayes classification, Decision Tree, and Support Learning Methods," Proceedings of the Confluence 2022 - 12th
Vector Machine in 2017. This approach obtained a diabetes International Conference on Cloud Computing, Data Science and
accuracy of 34.89% and a heart disease accuracy of 53% [1]. Engineering, pp. 553–558, 2022, doi:
10.1109/CONFLUENCE52989.2022.9734197.
However, our work's accuracy is relatively higher than most [9] A. Singh and R. Kumar, "Heart Disease Prediction Using
the papers we have reviewed [2][3][4][6][7]. Machine Learning Algorithms," International Conference on
Electrical and Electronics Engineering, ICE3 2020, pp. 452–457,
VI. FUTURE WORK Feb. 2020, doi: 10.1109/ICE348803.2020.9122958.
[10] P. Sonar and K. Jaya Malini, "Diabetes prediction using different
Further work will mainly focus on medical assistance and machine learning approaches," Proceedings of the 3rd
proper medication to the patients as soon as possible so as to International Conference on Computing Methodologies and
build the best infrastructure and quickest way in the medical Communication, ICCMC 2019, pp. 367–371, Mar. 2019, doi:
sectors. Many possible improvements could be explored to 10.1109/ICCMC.2019.8819841.
[11] C. Thalami, A. Peribonca, S. S. T. Raju, and N. Sampath, "Early
diversify the research by discovering and considering extra Stage Lung Cancer Prediction Using Various Machine Learning
features. Due to the limitation of time, the following work is Techniques," Proceedings of the 4th International Conference on
required to be performed in future. There is plan to add more Electronics, Communication and Aerospace Technology, ICECA
diseases, use more classification techniques/methods, and 2020, pp. 1285–1292, Nov. 2020, Doi:
10.1109/ICECA49313.2020.9297576.
different discretization techniques. We would like to use [12] " Heart disease (no date) UCI Machine Learning Repository.
different rules such as association rule and various algorithms Available at: https://archive.ics.uci.edu/dataset/45/heart+disease
like clustering algorithms. In future, we are willing to make [13] Akbasli, I.T. (2022) Brain stroke prediction dataset, Kaggle.
use of filter-based feature selection methods in order to Available at:
https://www.kaggle.com/datasets/zzettrkalpakbal/full-filled-
achieve more appropriate as well as functional result. brain-stroke-dataset
[14] Bhat, M.A. (2021) Lung cancer, Kaggle. Available at:
VII. CONCLUSION https://www.kaggle.com/datasets/mysarahmadbhat/lung-cancer
This study aims to build a multi-disease prediction model [15] Learning, U.M. (2016) Pima Indians Diabetes Database, He
Kaggle. Available at:
utilizing machine learning algorithms to identify illnesses https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-
based on patient symptoms accurately. Users could anticipate database
many diseases simultaneously using the method without extra [16] Ukani, V. (2020) Parkinson’s disease data set, Kaggle.
software or website browsing. Increased life expectancy and Available at:
https://www.kaggle.com/datasets/vikasukani/parkinsons-
less financial load can result from early illness identification. disease-data-set
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2024 at 03:35:32 UTC from IEEE Xplore. Restrictions apply.