0% found this document useful (0 votes)
24 views6 pages

ESCI2024 Paper 0912

Uploaded by

shaikhsadid431
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views6 pages

ESCI2024 Paper 0912

Uploaded by

shaikhsadid431
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2024 International Conference on Emerging Smart Computing and Informatics (ESCI)

AISSMS Institute of Information Technology, Pune, India. Mar 5-7, 2024

Advancements in Healthcare: Predictive Modeling for


Early Disease Diagnosis and Patient Record
Management
Renu Kachhoria Prithviraj Raut Amar Buchade
Dept. of AI & DS Dept. of Information Technology Dept. of AI & DS
BRACT’s Vishwakarma Institute of BRACT’s Vishwakarma Institute of BRACT’s Vishwakarma Institute of
Information Technology Information Technology Information Technology
Pune, India Pune, India Pune, India
[email protected] [email protected] [email protected]

Abstract: This study introduces the Medical Unique patient’s name, date and time, etc. is pushed in the database and
Identification, a secure patient verification process designed for can be seen by the user on his interface.
accessing medical records. Utilizing data mining techniques and
machine learning algorithms, our focus is on early disease Paper demonstrates how this sickness prediction model is
prediction based on user-provided symptoms. The developed built using machine learning classification techniques like
classifier system aids physicians in early disease diagnosis, thus Random Forest, Naïve Bayes and Decision Tree which
enhancing patient care. The machine learning model is trained on contribute to proactive decision-making, expediting informed
a dataset comprising 4920 patient records, each featuring 132 healthcare practices which makes the system capable of
distinct symptoms. The study adopts the ID3 root-to-leaf predicting multiple diseases which the user might have on the
approach for constructing the decision tree. Additionally, Naive basis of current symptoms. The user needs to select the
Bayes theorem is applied to determine the probability of specific symptoms which he suffers from, on the basis of which the
events. Patients can select symptoms from a list, and the disease machine learning model performs prediction of the disease
with the highest confidence score is predicted and displayed. The which the user is likely to have. Once the prediction is done by
system also integrates systematic storage of patient consultations, the model, the user can consult the medical specialist of that
contributing to a more personalized and comprehensive patient particular disease and fix an appointment with the same
history for improved healthcare and community services. according to the availability of the doctor. This assists the
professionals in taking informed decisions which removes the
Keywords—Medical Unique Identification, Decision T, Iterative
Dichotomiser3 (ID3), Random Forest, Gaussian, Confusion Matrix
treatment delay. The machine learning model is able to detect
nearly 41 types of diseases.
I. INTRODUCTION
II. LITERATURE SURVEY
In the field of Medical Healthcare, it is really important for
the doctors to know the medical data of an individual. Medical K Pingale et al. [1] work with the objective of investigation
Records convey the clinicians about the patients’ current as is to forecast illness using symptoms provided by the user. The
well as past medical history, enabling the clinicians to make system takes input from users in the form of symptoms and
decisions in future. In this technology-driven world it is produces predictions for potential disease outcomes. In
difficult to maintain a hardcopy of the medical reports and keep summary, the precision of disease risk prediction in this model
it carefully for several years. There are chances of spoiling it, is contingent on the diverse features encompassed in the
maintaining it in a good condition and there is often a fear of hospital data. The dataset, consisting of 768 instances, may not
misplacing it. adequately represent the broader population.

Keeping these problems in consideration, we have built a Malpani et al. [2] To my understanding, the model created
web application system which includes Medical Unique with the proposed method demonstrates superior accuracy
Identification of the user and provides facilities like storing the compared to existing ones. Specifically, the prediction
medical records (demographics, past medical history, accuracy for disease detection using Support Vector Machine
medication, medical progress notes, etc.) of the user which can achieves 96%, Naïve Bayes Classifier attains 95%, and
be retrieved by the doctor at the time of check-ups. This keeps Random Forest Classifier reaches 95.7%. The fundamental
the patients’ data secure at one place which can be accessed by objective in developing this system is to provide health
just entering the credentials. There is no headache of assistance to individuals.
maintaining or handling any hardcopy. Some of the other Kohli et al. [3] This paper makes a significant contribution
facilities/features of the system are the user is able to fix an to enhancing classification and recognition systems employed
online appointment with doctor through this application in disease diagnosis. These improvements offer confidential
system. The clinician may accept or reject the appointment information that supports medical experts in the prior detection
according to his/her availability. The status of the appointment, of life-threatening diseases, thereby substantially increasing
whether accepted or rejected along with the details including patient survival rates. The authors employed various

979-8-3503-0661-3/24/$31.00 ©2024 IEEE 1


classification algorithms, each with its unique advantages, and uncovering latent patterns from databases, particularly to
performed feature selection for each dataset through backward address intricate inquiries related to heart disease prediction.
modelling using the p-value test. Insufficient automation in The dataset was acquired from the UCI repository. The
processes like feature selection, model fitting and data munging approach involved the utilization of an enhanced k-means and
hinders the attainment of optimal prediction accuracy. NB algorithm. The study does not discuss the validation of the
proposed models using different dataset.
P. Suryachandra et al. [4] explores the utilization of diverse
machine learning algorithms for disease classification, focusing Monika Gandhi et.al. [10] examines how data mining
on evaluation criteria such as Root Mean Squared Error methods in conjunction with other approaches help to reveal
(RMSE), Root Absolute Value (RAE), etc. The application of hidden patterns from large databases, supporting healthcare
these machine learning classifiers, particularly RF Classifier institutions in making well-informed decisions. Utilizing data
along with Naïve Bayes algorithm, proves effective in mining classification techniques, including DT, neural
accurately predicting breast cancer, even when confronted with networks, and NB, the research aims to facilitate data
limited information. Study lacks a thorough discussion on the discovery, extraction, and classification within the healthcare
potential impact of an imbalance between the number of domain. Exploring the combination of deep learning methods
features and cases on the performance of machine learning and filter-based feature engineering could provide new insights
algorithms. to boost the accuracy and efficiency of prediction models.
Jianliang Gao et al. [5] aims to enhance the accuracy of III. SYSTEM IMPLEMENTATION AND FUNTIONALITIES
predicting similar diseases by leveraging support from multiple
data sources. This is accomplished by building a similarity A. System Design and Architecture
scoring system that incorporates broad data from the structural
and semantic dimensions of various illness and biological
entity interaction networks. The study presents a novel method
of calculating illness similarity through the utilization of
multiple distinct illness information networks. Before
introducing their method, they thoroughly examine existing
approaches such as Resnik, SemFunSim. The lack of
ontologies for every disease restricts the ability to predict
similar diseases using specific methods.
Repaka et al. [6] focuses on using Naive Bayesian to
implement heart disease prediction. Eighty percent of the data
in the dataset is set aside for training, and the remaining twenty
percent is placed aside for testing. Study doesn't thoroughly
look into the ethical and privacy issues of using sensitive
medical data for predictions.
Kaan Uyar et.al [7] suggests computational techniques for
Fig. 1. System Architecture
heart disease analysis using Genetic algorithms and recurrent
fuzzy neural networks (RFNNs). 297 patient data examples are
B. Feature Specifications
included in the study, of which 45 are intended for testing and
252 for training. A 96.78% accuracy is attained throughout the  User Authentication and Authorization: The motive of
testing phase. The heart disease testing dataset is used to user authentication is to ensure secure access to the
successfully conduct experiments that include computations for system. We have implemented a robust user
accuracy, RMSE (Root Mean Square Error), and chance of authentication mechanism to verify the identity of
misclassification. The study uses UCI Cleveland heart disease patients and doctors.
dataset, which may not fully represent the diversity of patient  Patient Profile Management: This enables patients to
populations globally. manage their personal information which would be
AnimeshHazra et.al [8] highlights the substantial volume of taken at the time of system registration. This includes
information generated daily by the healthcare sector, much of storing relevant health information and preferences
which goes untapped and underutilized. Unfortunately, where patients are able to create, view, and update their
Insufficient instruments are available to derive significant profiles.
findings from this vast data repository for tasks such as clinical
 Disease Check-up Module: Facilitate disease
disease detection. It explores hybrid approaches involving
prediction based on user-input symptoms. Implement
various mining algorithms and seeks to draw conclusions
an intuitive interface for patients to select symptoms.
regarding the most effective technique(s) for this purpose.
Integration with external resources, such as Google for
Choosing appropriate methods for data cleaning, and
additional disease information, enhances user
classification algorithms, will enhance the accuracy.
knowledge.
Ashish Chhabbi et al. [9] conducted a comprehensive study
on various data mining techniques aimed at extracting and  Consultancy History and Feedback: Maintain a record
of past consultancy sessions, allowing patients to

2
review their medical history. Include a feedback B. Data Collection
mechanism for patients to share their experiences and For the process of disease prediction, symptoms are used as
contribute to the system's continuous improvement. a parameter. We gathered the data for the illness prediction
 Specialist Consultation and Chat Feature: Facilitate model from Kaggle. A binary vector is generated, wherein a
real-time communication between patients and doctors. value of 1 is assigned to symptoms present in the user's
Introduced a chat feature for instant messaging selected list, and a value of 0 is assigned to symptoms absent
between patients and doctors. The system should from the list. In this context, a '1' signifies the existence of a
seamlessly redirect patients to a chat box for specialist symptom, while '0' indicates its absence in the individual's
consultations, enhancing the interactive and selection. A machine learning model is trained on the dataset
personalized nature of healthcare guidance. containing 4920 patient records. The dataset contains 132
symptom labels which are added in the column head and the
C. Operational Constraints 133rd column is the prognosis column which tells the disease
 Performance: Ensure the system operates efficiently name. There are 41 types of diseases in the dataset. The model
under varying workloads. The system should respond accepts the symptom vector as input and outputs the disease
promptly to user interactions, maintain low latency with the most confidence score.
during disease prediction, and handle concurrent user C. Data Pre-processing
sessions.
1) Data Cleaning:
 Security: Safeguard user data and system functionality. In the process of dataset refinement, data cleaning involves
Employ encryption for data transmission, implement the removal of empty or blank values. Analyzing extensive
secure storage practices, and conduct regular security databases poses challenges, prompting the exclusion of
audits to identify and address vulnerabilities. independent variables (symptoms) that exhibit minimal or
negligible impact on the target variable (disease). During the
 Usability: Enhance user experience through an
data cleaning process, duplicate entries in the dataset are
intuitive and user-friendly interface. Implement clear
identified and eliminated to ensure data integrity and accuracy.
navigation, provide informative feedback, and optimize
the user interface for accessibility. 2) Data Reduction:
 Reliability: Ensure system stability and minimize Data reduction serves as a preventive measure against the
downtime. Employ redundancy mechanisms, conduct issue of overfitting. Symptoms deemed to have negligible
regular system maintenance, and implement failover impact on disease prediction were systematically excluded
strategies to mitigate the impact of potential failures. from the dataset. A refined selection process led to the
inclusion of 92 symptoms out of 132 symptoms, focusing
 Privacy Compliance: Adhere to privacy regulations specifically on those intricately associated with relevant
and protect user confidentiality. Ensure compliance medical conditions, while the remaining symptoms were
with relevant data protection laws, implement privacy intentionally omitted from the analysis.
settings, and obtain explicit user consent for data
D. Data Processing
processing activities.
Our study integrates various classification techniques to
IV. PROPOSED METHODOLOGY leverage their distinctive strengths in addressing the
complexities of disease prediction. The adoption of the
A. Model Architecture: following classifiers is driven by the need for an adaptable and
multifaceted strategy. Integrating these methods into our model
offers a comprehensive approach, delivering precise
predictions and insights into the decision-making process. This
aligns with our study's goals, aiming to provide healthcare
professionals with an effective, transparent, and adaptable tool
for evolving medical datasets and emerging diseases.
1) Decision Tree Classifier:
An effective and comprehensible machine learning method,
the DT algorithm was used in this work to forecast diseases
based on medical information. Functioning as a supervised
learning algorithm, it repeatedly separates the dataset into
subgroups according to the most significant characteristics.
This recursive partitioning mechanism of the DT Classifier
algorithm enables the systematic breakdown of the dataset. It
successively evaluates and selects features, creating decision
nodes that guide the algorithm in navigating the complexities
of medical records. ultimately constructing a tree-like structure
Fig. 2. Flow of Machine Learning Model. where each node represents a decision based on a feature. This

3
algorithm is particularly well-suited for medical diagnosis 2) Random Forest Classifier:
tasks, offering transparency in decision-making processes. The Random Forest algorithm is a versatile and user-
Decision node: It has 2 or more branches. They serve as friendly machine learning approach that often delivers
point of divergence in the tree, guiding the algorithm to impressive results even without extensive hyperparameter
subsequent nodes based on the conditions or rules associated tuning. Addressing a key limitation of individual decision
with the chosen feature. In this research paper, the symptoms trees—namely, the propensity for overfitting—Random Forest
serve as the decision nodes within the Decision T. algorithm. employs an ensemble learning strategy. In essence, it functions
as a collective of decision trees, with the overarching principle
Leaf Node: Represents the last node where the final that a multitude of trees contributes to enhanced generalization.
classification is made, i.e., Leaf nodes do not have any further To elaborate on the process:
branches and they represent the end of a decision path in the
tree. In this paper, the disease predicted represents the leaf  A subset of k symptoms is randomly selected from the
node. dataset (representing medical records), where k is
significantly smaller than the total number of
ID3 Algorithm: In the context of the ID3 algorithm, entropy symptoms, denoted as m. Subsequently, a decision tree
is a measure of uncertainty or randomness associated with a is constructed based on this subset of symptoms.
particular set of data. It reflects the unpredictability of the
outcomes within that set. Specifically, entropy is utilized to  This procedure is iterated n times, resulting in the
assess the disorder or impurity in the distribution of target creation of n decision trees. Every tree is constructed
labels (in this case, diseases) within a subset of data. using a unique random combination of k symptoms, or
alternatively, a bootstrap sample, which is a random
Uncertainty and Predictability: Entropy is a way to sampling of the data.
quantify the degree of uncertainty or unpredictability in a given
dataset. A dataset with high entropy implies more disorder,  Each decision tree(n) is then employed to predict the
randomness, and less predictability in terms of the outcomes occurrence of a disease by inputting a random variable.
we are interested in (e.g., diseases). The predictions for each disease are stored, yielding a
set of n disease predictions from the n decision trees.
Entropy Calculation: In the context of decision tree
algorithms like ID3, the entropy of a dataset is calculated using  The algorithm subsequently computes the number of
the distribution of target labels (classes) within a subset. For a votes for each predicted disease and determines the
binary classification task (diseased or not diseased), mode, representing the most frequently predicted
disease. This mode is considered the final prediction
Interpretation: Entropy is at its maximum when the dataset from RF Classifier algorithm.
is perfectly balanced (50% for each class), indicating the
highest level of disorder and uncertainty. On the other hand,
entropy is minimum (0) when the dataset contains only one
class, signifying perfect predictability.
Decision Making in ID3: In the ID3 algorithm, entropy is
used to determine the best attribute (symptom) at each node of
the decision tree. The attribute that results in the greatest
reduction in entropy or, equivalently, the highest information
gain, is selected as the decision point. Information gain
measures the effectiveness of an attribute in reducing
uncertainty.
The work flow of decision tree algorithm can be seen in
Fig.3 below

Fig. 4. Flow of RF Algorithm.

3) Bayes Classifier:
The Gaussian algorithm operates under the fundamental
assumption that each feature contributes independently and
equally to the outcome. One notable advantage is its efficiency,
particularly on large datasets, as it demands less computational
power. The algorithm is grounded in Bayes' theorem, expressed
as:
Fig. 3. DT Algorithm Flow Chart.

4
(1)

Illustrating this concept through a different hypothetical


scenario: Let's consider the disease prediction for a patient with
symptoms categorized as follows: “Fatigue=Present,”
“Headache=Present,” “Nausea=Absent,” and “Chills=Present,”
where the target class is "Malaria."
Consider the following hypothetical scenario to
comprehend the concept: Initially, we estimate the probability Fig. 6. Disease Predicted with the maximum Confidence score.
for Malaria by defining the class as Malaria and inputting
symptoms such as "Fatigue=Present," "Headache=Absent," Fig.7 shows the storage of patients’ consulatation records in
"Nausea=Present," and "Chills=Present." the database which is important and may be needed in future by
the doctors to check the past history of the patient.
The modified formula becomes:

(2)

= 3/8 * 5/8 * 7/8 * 6/8 * 8/12


= 0.079
Fig. 7. Patient Consultation History.
In this context, the algorithm computes the probability for
each disease based on the observed symptoms and outputs the B. Model Results
one with the highest probability, thereby facilitating disease Figure 8 illustrates the confusion matrix corresponding to
prediction. the Decision Tree Classifier, offering a visual representation of
the model's performance in categorizing instances. Moving
forward, it provides a comprehensive evaluation of the
Decision Tree Classifier within the model, showcasing an
exceptional accuracy score of 97.47%, highlighting the
classifier's effectiveness in making accurate predictions.

Fig. 5. Flow of Naïve Bayes Classifier.

V. RESULTS AND DISCUSSION


Fig. 8. Confusion Matrix for Decision Tree.
A. System Results
Fig.6 shows the predicted disease and its confidence score C. Comparative Analysis
for the selected symptoms. Patients will be redirected to the Table 1 compares the accuracies of different classifiers,
google page if they desire to know more about the predicted with the Decision Tree Classifier identified as the most optimal
disease. choice.

5
TABLE I. COMPARISON OF DIFFERENT CLASSIFIERS [9] Ashish Chhabbi, Lakhan Ahuja,Sahil Ahir, and Y. K. Sharma,19 March
2016,“Heart Disease Prediction Using Data Mining Techniques”, ©
Sr. no. Model Accuracy (%) IJRAT, Special Issue National Conference “NCPC-2016”, pp. 104-106.
1. Decision Tree 97.47
[10] Gandhi, Monika, and Shailendra Narayan Singh. "Predictions in heart
2. Random Forest 97.3 disease using techniques of data mining." In 2015 International
3. Naïve Bayes 96.63 conference on futuristic trends on computational analysis and
knowledge management (ABLAZE), pp. 520-525. IEEE, 2015.

VI. CONCLUSION AND FUTURE SCOPE


The study employed diverse classification techniques,
keeping 92 impactful symptoms from an initial pool of 132 and
excluding those with minimal impact on disease prediction.
The ID3 algorithm, a Decision Tree method, identified key
attributes at each level. Bayes' theorem played a pivotal role in
calculating disease probabilities from known symptoms.
Additionally, the system systematically stored patient
consultation records for future reference. This model,
developed per the proposed method, demonstrated superior
accuracy, particularly the Decision Tree Classifier leading at
97.47%. To make the system more effective, future work could
include educating users (both patients and doctors) about the
system's abilities and limitations. It's essential to continuously
train the model using updated medical data to stay relevant and
adapt to new diseases and symptoms. Improvements may target
a better user interface, making it easier for users to input
symptoms using natural language. Using multimedia elements
can also enhance user-friendliness. Additionally, integrating
real-time patient data from wearables or electronic health
records can give a complete picture of a patient's health,
leading to more personalized and proactive healthcare
approaches.
REFERENCES
[1] Pingale, Kedar, Sushant Surwase, Vaibhav Kulkarni, Saurabh Sarage,
and Abhijeet Karve. "Disease prediction using machine
learning." International Research Journal of Engineering and
Technology (IRJET) 6, no. 12 (2019): 831-833.
[2] Nawab, Mohammad Ali, Tavisi Malpani, Tushar Kaundal, and Ms
Divya Soni. "Disease prediction web app using machine learning." Int.
Res. J. Mod. Eng. Technol. Sci (2022).
[3] Kohli, P.S. and Arora, S., 2018, December. Application of machine
learning in disease prediction. In 2018 4th International conference on
computing communication and automation (ICCCA) (pp. 1-4). IEEE.
[4] Suryachandra, Palli, and P. Venkata Subba Reddy. "Comparison of
machine learning algorithms for breast cancer." In 2016 International
Conference on Inventive Computation Technologies (ICICT), vol. 3, pp.
1-6. Ieee, 2016.
[5] Gao, Jianliang, Ling Tian, Jianxin Wang, Yibo Chen, Bo Song, and
Xiaohua Hu. "Similar disease prediction with heterogeneous disease
information networks." IEEE Transactions on NanoBioscience 19, no. 3
(2020): 571-578.
[6] Repaka, Anjan Nikhil, Sai Deepak Ravikanti, and Ramya G. Franklin.
"Design and implementing heart disease prediction using naives
Bayesian”. In 2019 3rd International conference on trends in electronics
and informatics (ICOEI), pp. 292-297. IEEE, 2019.
[7] Uyar, Kaan, and Ahmet İlhan. "Diagnosis of heart disease using genetic
algorithm based trained recurrent fuzzy neural networks." Procedia
computer science 120 (2017): 588-593.
[8] Animesh Hazra, Subrata Kumar Mandal, Amit Gupta, Arkomita
Mukherjee and Asmita Mukherjee “Heart Disease Diagnosis and
Prediction Using Machine Learning and Data Mining Techniques: A
Review”, Advances in Computational Sciences and Technology,
Volume 10, Number 7 (2017) pp. 2137-2159.

You might also like