ESCI2024 Paper 0912
ESCI2024 Paper 0912
Abstract: This study introduces the Medical Unique patient’s name, date and time, etc. is pushed in the database and
Identification, a secure patient verification process designed for can be seen by the user on his interface.
accessing medical records. Utilizing data mining techniques and
machine learning algorithms, our focus is on early disease Paper demonstrates how this sickness prediction model is
prediction based on user-provided symptoms. The developed built using machine learning classification techniques like
classifier system aids physicians in early disease diagnosis, thus Random Forest, Naïve Bayes and Decision Tree which
enhancing patient care. The machine learning model is trained on contribute to proactive decision-making, expediting informed
a dataset comprising 4920 patient records, each featuring 132 healthcare practices which makes the system capable of
distinct symptoms. The study adopts the ID3 root-to-leaf predicting multiple diseases which the user might have on the
approach for constructing the decision tree. Additionally, Naive basis of current symptoms. The user needs to select the
Bayes theorem is applied to determine the probability of specific symptoms which he suffers from, on the basis of which the
events. Patients can select symptoms from a list, and the disease machine learning model performs prediction of the disease
with the highest confidence score is predicted and displayed. The which the user is likely to have. Once the prediction is done by
system also integrates systematic storage of patient consultations, the model, the user can consult the medical specialist of that
contributing to a more personalized and comprehensive patient particular disease and fix an appointment with the same
history for improved healthcare and community services. according to the availability of the doctor. This assists the
professionals in taking informed decisions which removes the
Keywords—Medical Unique Identification, Decision T, Iterative
Dichotomiser3 (ID3), Random Forest, Gaussian, Confusion Matrix
treatment delay. The machine learning model is able to detect
nearly 41 types of diseases.
I. INTRODUCTION
II. LITERATURE SURVEY
In the field of Medical Healthcare, it is really important for
the doctors to know the medical data of an individual. Medical K Pingale et al. [1] work with the objective of investigation
Records convey the clinicians about the patients’ current as is to forecast illness using symptoms provided by the user. The
well as past medical history, enabling the clinicians to make system takes input from users in the form of symptoms and
decisions in future. In this technology-driven world it is produces predictions for potential disease outcomes. In
difficult to maintain a hardcopy of the medical reports and keep summary, the precision of disease risk prediction in this model
it carefully for several years. There are chances of spoiling it, is contingent on the diverse features encompassed in the
maintaining it in a good condition and there is often a fear of hospital data. The dataset, consisting of 768 instances, may not
misplacing it. adequately represent the broader population.
Keeping these problems in consideration, we have built a Malpani et al. [2] To my understanding, the model created
web application system which includes Medical Unique with the proposed method demonstrates superior accuracy
Identification of the user and provides facilities like storing the compared to existing ones. Specifically, the prediction
medical records (demographics, past medical history, accuracy for disease detection using Support Vector Machine
medication, medical progress notes, etc.) of the user which can achieves 96%, Naïve Bayes Classifier attains 95%, and
be retrieved by the doctor at the time of check-ups. This keeps Random Forest Classifier reaches 95.7%. The fundamental
the patients’ data secure at one place which can be accessed by objective in developing this system is to provide health
just entering the credentials. There is no headache of assistance to individuals.
maintaining or handling any hardcopy. Some of the other Kohli et al. [3] This paper makes a significant contribution
facilities/features of the system are the user is able to fix an to enhancing classification and recognition systems employed
online appointment with doctor through this application in disease diagnosis. These improvements offer confidential
system. The clinician may accept or reject the appointment information that supports medical experts in the prior detection
according to his/her availability. The status of the appointment, of life-threatening diseases, thereby substantially increasing
whether accepted or rejected along with the details including patient survival rates. The authors employed various
2
review their medical history. Include a feedback B. Data Collection
mechanism for patients to share their experiences and For the process of disease prediction, symptoms are used as
contribute to the system's continuous improvement. a parameter. We gathered the data for the illness prediction
Specialist Consultation and Chat Feature: Facilitate model from Kaggle. A binary vector is generated, wherein a
real-time communication between patients and doctors. value of 1 is assigned to symptoms present in the user's
Introduced a chat feature for instant messaging selected list, and a value of 0 is assigned to symptoms absent
between patients and doctors. The system should from the list. In this context, a '1' signifies the existence of a
seamlessly redirect patients to a chat box for specialist symptom, while '0' indicates its absence in the individual's
consultations, enhancing the interactive and selection. A machine learning model is trained on the dataset
personalized nature of healthcare guidance. containing 4920 patient records. The dataset contains 132
symptom labels which are added in the column head and the
C. Operational Constraints 133rd column is the prognosis column which tells the disease
Performance: Ensure the system operates efficiently name. There are 41 types of diseases in the dataset. The model
under varying workloads. The system should respond accepts the symptom vector as input and outputs the disease
promptly to user interactions, maintain low latency with the most confidence score.
during disease prediction, and handle concurrent user C. Data Pre-processing
sessions.
1) Data Cleaning:
Security: Safeguard user data and system functionality. In the process of dataset refinement, data cleaning involves
Employ encryption for data transmission, implement the removal of empty or blank values. Analyzing extensive
secure storage practices, and conduct regular security databases poses challenges, prompting the exclusion of
audits to identify and address vulnerabilities. independent variables (symptoms) that exhibit minimal or
negligible impact on the target variable (disease). During the
Usability: Enhance user experience through an
data cleaning process, duplicate entries in the dataset are
intuitive and user-friendly interface. Implement clear
identified and eliminated to ensure data integrity and accuracy.
navigation, provide informative feedback, and optimize
the user interface for accessibility. 2) Data Reduction:
Reliability: Ensure system stability and minimize Data reduction serves as a preventive measure against the
downtime. Employ redundancy mechanisms, conduct issue of overfitting. Symptoms deemed to have negligible
regular system maintenance, and implement failover impact on disease prediction were systematically excluded
strategies to mitigate the impact of potential failures. from the dataset. A refined selection process led to the
inclusion of 92 symptoms out of 132 symptoms, focusing
Privacy Compliance: Adhere to privacy regulations specifically on those intricately associated with relevant
and protect user confidentiality. Ensure compliance medical conditions, while the remaining symptoms were
with relevant data protection laws, implement privacy intentionally omitted from the analysis.
settings, and obtain explicit user consent for data
D. Data Processing
processing activities.
Our study integrates various classification techniques to
IV. PROPOSED METHODOLOGY leverage their distinctive strengths in addressing the
complexities of disease prediction. The adoption of the
A. Model Architecture: following classifiers is driven by the need for an adaptable and
multifaceted strategy. Integrating these methods into our model
offers a comprehensive approach, delivering precise
predictions and insights into the decision-making process. This
aligns with our study's goals, aiming to provide healthcare
professionals with an effective, transparent, and adaptable tool
for evolving medical datasets and emerging diseases.
1) Decision Tree Classifier:
An effective and comprehensible machine learning method,
the DT algorithm was used in this work to forecast diseases
based on medical information. Functioning as a supervised
learning algorithm, it repeatedly separates the dataset into
subgroups according to the most significant characteristics.
This recursive partitioning mechanism of the DT Classifier
algorithm enables the systematic breakdown of the dataset. It
successively evaluates and selects features, creating decision
nodes that guide the algorithm in navigating the complexities
of medical records. ultimately constructing a tree-like structure
Fig. 2. Flow of Machine Learning Model. where each node represents a decision based on a feature. This
3
algorithm is particularly well-suited for medical diagnosis 2) Random Forest Classifier:
tasks, offering transparency in decision-making processes. The Random Forest algorithm is a versatile and user-
Decision node: It has 2 or more branches. They serve as friendly machine learning approach that often delivers
point of divergence in the tree, guiding the algorithm to impressive results even without extensive hyperparameter
subsequent nodes based on the conditions or rules associated tuning. Addressing a key limitation of individual decision
with the chosen feature. In this research paper, the symptoms trees—namely, the propensity for overfitting—Random Forest
serve as the decision nodes within the Decision T. algorithm. employs an ensemble learning strategy. In essence, it functions
as a collective of decision trees, with the overarching principle
Leaf Node: Represents the last node where the final that a multitude of trees contributes to enhanced generalization.
classification is made, i.e., Leaf nodes do not have any further To elaborate on the process:
branches and they represent the end of a decision path in the
tree. In this paper, the disease predicted represents the leaf A subset of k symptoms is randomly selected from the
node. dataset (representing medical records), where k is
significantly smaller than the total number of
ID3 Algorithm: In the context of the ID3 algorithm, entropy symptoms, denoted as m. Subsequently, a decision tree
is a measure of uncertainty or randomness associated with a is constructed based on this subset of symptoms.
particular set of data. It reflects the unpredictability of the
outcomes within that set. Specifically, entropy is utilized to This procedure is iterated n times, resulting in the
assess the disorder or impurity in the distribution of target creation of n decision trees. Every tree is constructed
labels (in this case, diseases) within a subset of data. using a unique random combination of k symptoms, or
alternatively, a bootstrap sample, which is a random
Uncertainty and Predictability: Entropy is a way to sampling of the data.
quantify the degree of uncertainty or unpredictability in a given
dataset. A dataset with high entropy implies more disorder, Each decision tree(n) is then employed to predict the
randomness, and less predictability in terms of the outcomes occurrence of a disease by inputting a random variable.
we are interested in (e.g., diseases). The predictions for each disease are stored, yielding a
set of n disease predictions from the n decision trees.
Entropy Calculation: In the context of decision tree
algorithms like ID3, the entropy of a dataset is calculated using The algorithm subsequently computes the number of
the distribution of target labels (classes) within a subset. For a votes for each predicted disease and determines the
binary classification task (diseased or not diseased), mode, representing the most frequently predicted
disease. This mode is considered the final prediction
Interpretation: Entropy is at its maximum when the dataset from RF Classifier algorithm.
is perfectly balanced (50% for each class), indicating the
highest level of disorder and uncertainty. On the other hand,
entropy is minimum (0) when the dataset contains only one
class, signifying perfect predictability.
Decision Making in ID3: In the ID3 algorithm, entropy is
used to determine the best attribute (symptom) at each node of
the decision tree. The attribute that results in the greatest
reduction in entropy or, equivalently, the highest information
gain, is selected as the decision point. Information gain
measures the effectiveness of an attribute in reducing
uncertainty.
The work flow of decision tree algorithm can be seen in
Fig.3 below
3) Bayes Classifier:
The Gaussian algorithm operates under the fundamental
assumption that each feature contributes independently and
equally to the outcome. One notable advantage is its efficiency,
particularly on large datasets, as it demands less computational
power. The algorithm is grounded in Bayes' theorem, expressed
as:
Fig. 3. DT Algorithm Flow Chart.
4
(1)
(2)
5
TABLE I. COMPARISON OF DIFFERENT CLASSIFIERS [9] Ashish Chhabbi, Lakhan Ahuja,Sahil Ahir, and Y. K. Sharma,19 March
2016,“Heart Disease Prediction Using Data Mining Techniques”, ©
Sr. no. Model Accuracy (%) IJRAT, Special Issue National Conference “NCPC-2016”, pp. 104-106.
1. Decision Tree 97.47
[10] Gandhi, Monika, and Shailendra Narayan Singh. "Predictions in heart
2. Random Forest 97.3 disease using techniques of data mining." In 2015 International
3. Naïve Bayes 96.63 conference on futuristic trends on computational analysis and
knowledge management (ABLAZE), pp. 520-525. IEEE, 2015.