Data Mining Final
Data Mining Final
Abstract—Dengue fever remains a significant global health B. Challenges in Conventional Dengue Diagnosis
challenge, where early and accurate diagnosis is critical for
effective patient management and preventing severe outcomes.
The effective management of dengue relies heavily on
Traditional diagnostic approaches often face limitations due to prompt and accurate diagnosis. Early identification allows
cost, time, and the non-specific nature of early symptoms, which for timely supportive care, monitoring for warning signs,
can overlap with other febrile illnesses. This project presents a and appropriate clinical interventions, which can significantly
data-driven approach for dengue classification using an Artificial reduce mortality rates. However, the diagnostic process is often
Neural Network (ANN) trained on readily available clinical
and demographic data. The model utilizes a dataset comprising
complicated by several factors. In the early stages, dengue
numerical features such as patient age and serological markers symptoms—such as fever, headache, and myalgia—are non-
(NS1, IgG, IgM), alongside categorical features like gender specific and can be easily confused with other febrile illnesses
and geographical information. Data preprocessing involves stan- like chikungunya, Zika, or influenza, leading to potential
dardizing numerical features and one-hot encoding categorical misdiagnosis. While laboratory tests such as Real-Time Re-
variables to prepare the data for the model. The proposed
ANN is a sequential model composed of dense layers with
verse Transcriptase Polymerase Chain Reaction (RT-PCR) and
ReLU activation functions, integrated with dropout layers for serological assays for NS1 antigen or IgM/IgG antibodies are
regularization to mitigate overfitting. The final output layer uses a considered the gold standard, they have limitations. These tests
sigmoid activation function for binary classification (dengue pos- can be costly, require specialized laboratory infrastructure and
itive/negative). The model is compiled using the Adam optimizer trained personnel, and may have a significant turnaround time,
and binary cross-entropy loss function. The performance of the
trained network is evaluated using standard metrics, including
making them less accessible in resource-constrained settings
accuracy, a confusion matrix, and a classification report, to where dengue is most prevalent. This gap between clinical
demonstrate its potential as an efficient and accessible decision need and diagnostic capacity underscores the demand for
support tool for clinicians in resource-limited settings. alternative tools that are rapid, affordable, and accurate.
Keywords—Dengue Fever, Artificial Neural Network (ANN),
Machine Learning, Clinical Decision Support, Disease Prediction, C. The Emergence of Artificial Intelligence in Medical Diag-
Tabular Data, Supervised Learning. nostics
In recent years, Artificial Intelligence (AI), particularly
I. I NTRODUCTION its subfield of machine learning (ML), has emerged as a
A. The Global Burden of Dengue transformative technology in healthcare. ML algorithms have
demonstrated a remarkable ability to analyze large and com-
Dengue fever, a mosquito-borne viral infection, is a rapidly plex datasets, uncovering intricate patterns that may not be
spreading disease that poses a significant threat to public apparent to human observers. In medical diagnostics, ML
health worldwide. Transmitted primarily by Aedes mosquitoes, models are increasingly being used to predict disease risk,
dengue is endemic in tropical and subtropical regions, placing classify conditions, and support clinical decision-making based
nearly half of the world’s population at risk. The World Health on patient data. Artificial Neural Networks (ANNs), inspired
Organization (WHO) has identified dengue as one of the by the structure of the human brain, are particularly well-suited
top ten global health threats, with an estimated 100 to 400 for handling the complex, non-linear relationships often found
million infections occurring annually. The clinical spectrum of in clinical data. By training on datasets containing patient
dengue ranges from asymptomatic or mild flu-like symptoms demographics, clinical signs, and laboratory results, ANNs can
to severe, life-threatening conditions such as Dengue Hem- learn to identify the subtle combination of factors indicative
orrhagic Fever (DHF) and Dengue Shock Syndrome (DSS). of a specific disease, offering a powerful approach to augment
The escalating incidence and geographic expansion of dengue, traditional diagnostic methods.
driven by factors like urbanization, travel, and climate change,
place immense strain on healthcare systems and economies, D. Contribution and Paper Organization
highlighting the urgent need for effective control and manage- This paper addresses the need for a rapid and accessible
ment strategies. dengue screening tool by developing and evaluating a predic-
tive model based on an Artificial Neural Network. The primary computationally intensive neural networks can provide reliable
contribution of this work is the implementation of an end-to- diagnostic support [3]. These studies collectively affirm that
end machine learning pipeline for dengue classification using ANNs are a powerful and viable tool for dengue classification.
structured, tabular data—a combination of demographic, geo- The approach taken in this project, utilizing a sequential ANN
graphic, and key serological features (Age, Gender, Area, NS1, with dropout for regularization, aligns directly with these
IgG, IgM). The methodology detailed in this project involves successful precedents in the scientific literature.
standard data preprocessing techniques, including scaling of
III. PROPOSED SYSTEM
numerical features and one-hot encoding of categorical fea-
tures. A sequential ANN model with two hidden layers and The proposed system is an end-to-end machine learning
dropout regularization is constructed and trained to perform pipeline designed to classify dengue fever based on clinical
binary classification. The model’s effectiveness is rigorously and demographic data. The framework leverages a sequential
evaluated to assess its potential as a clinical decision support Artificial Neural Network (ANN) to perform binary classifi-
tool. The remainder of this paper is organized as follows: cation, distinguishing between dengue-positive and dengue-
Section II details the dataset and the complete methodology, negative cases. The entire process, from data handling to
including data preprocessing steps and the specific architecture model training and evaluation, is built to be efficient and
of the ANN model. Section III presents the experimental robust. The workflow of the proposed system is illustrated
results and a discussion of the model’s performance. Finally, in Fig. 2.
Section IV concludes the paper with a summary of the findings
and potential directions for future work.
This paper is organized as follows- section II contains re-
lated work and highlights their methods & limitations. Section
III gives an overview of the Model, how to preprocess, split
data, model train and other features. Section IV overview of
the dataset.Section V presents the performance & accuracy
analysis. Finally, section VII concludes the paper with limita-
tions and future possibilities.
II. RELATED WORKS
The use of Artificial Neural Networks (ANNs) for classify-
ing dengue cases based on clinical data is a well-established
area of research, demonstrating the potential of deep learning
models to serve as effective diagnostic aids. Several studies
have successfully implemented ANN architectures, similar
to the one in this project, to distinguish between dengue- Fig. 2. Proposed Model
positive and dengue-negative cases. One notable study pre-
sented an advanced ANN-based approach for the prognosis A. Data Preprocessing
and classification of dengue. Their model, a supervised feed- Effective data preprocessing is a critical first step to ensure
forward neural network (FFNN) with two hidden layers, was the quality and suitability of data for the machine learning
trained on clinical parameters from dengue outbreak sites in model. The raw dataset consists of tabular data containing
Pakistan. The network achieved remarkable performance, with both numerical and categorical features. Feature and Target
nearly 100% accuracy, 100% sensitivity, and 98.7% specificity, Separation: The dataset is first divided into features (X) and
highlighting the high potential of ANNs in this domain the target variable (y). The features include numerical data
[1]. Another relevant study focused on developing a ma- such as Age, NS1, IgG, and IgM, and categorical data like
chine learning model for dengue case screening using only Gender, Area, AreaType, HouseType, and District. The target
clinical data, without laboratory results. They tested several variable, Outcome, is a binary label indicating the presence or
models, including a Multilayer Perceptron (MLP), which is absence of dengue. Data Transformation Pipeline: A Colum-
a type of ANN. Using 10 key clinical variables such as nTransformer is employed to apply specific transformations
fever, myalgia, headache, and rash, the MLP model achieved to different types of columns. This ensures that each feature
an accuracy of 98%, proving its effectiveness as a rapid type is handled appropriately. Numerical Feature Scaling:
screening tool that can be used by healthcare professionals to The numerical features are processed using StandardScaler.
guide decision-making [2]. Further research has explored the This technique standardizes features by removing the mean
use of lightweight ANNs for deployment in resource-limited and scaling to unit variance. Scaling is essential for neural
settings. One such study proposed a lightweight MLP with networks as it helps prevent features with larger ranges from
three hidden layers for a dual-level dengue diagnosis system dominating the learning process and aids in faster convergence
based on patient symptoms. This model achieved an accuracy of the model’s weights. Categorical Feature Encoding: The
of 92% on a small dataset, demonstrating that even less categorical features are transformed using OneHotEncoder.
Fig. 1. Architecture of the Proposed Artificial Neural Network Model
This method converts each category value into a new binary Dropout Layer: Another Dropout layer with a rate of 0.2,
column (0 or 1), which is a necessary step to make non- applying 20% dropout for further regularization. Output Layer:
numeric data understandable to the ANN model. The han- A final Dense layer with a single neuron. This layer uses the
dle unknown=”ignore” parameter ensures that the model can Sigmoid activation function. The sigmoid function outputs a
handle new, unseen categories in the test data without raising value between 0 and 1, which is interpreted as the probability
an error. Dataset Splitting: The preprocessed dataset is split of the positive class (dengue-positive). This makes it ideal for
into a training set and a testing set using an 80:20 ratio. 80% binary classification tasks. The complete architecture of the
of the data is used to train the model, while the remaining proposed ANN model is visualized in Figure 1.
20% is held out as an unseen test set to provide an unbiased
evaluation of the final model’s performance. A random state
is set to ensure the reproducibility of the split. C. Model Compilation and Training
B. Proposed Artificial Neural Network (ANN) Architecture Before training, the model is compiled with the necessary
The core of the classification system is a sequential ANN components to guide the learning process:
model built using the Keras library from TensorFlow. ANNs
• Optimizer: The Adam optimizer is used. Adam is an
are powerful tools for classification tasks as they can learn
efficient and widely used optimization algorithm that
complex, non-linear relationships between input features. The
adapts the learning rate for each parameter, combining
architecture of the proposed model is designed to be effective
the advantages of other popular optimizers like AdaGrad
yet relatively simple, incorporating regularization to prevent
and RMSProp.
overfitting. The model consists of the following layers: Input
• Loss Function: The Binary Cross-Entropy loss function
Layer: A Dense layer with 64 neurons. This layer receives the
is selected. This is the standard loss function for binary
preprocessed input features. It uses the Rectified Linear Unit
classification problems, as it measures the difference
(ReLU) activation function, which is a standard and effective
between the predicted probabilities and the actual class
choice for introducing non-linearity into the model, allowing
labels.
it to learn more complex patterns. First Dropout Layer: A
• Metrics: The model’s performance is monitored using
Dropout layer with a rate of 0.3. Dropout is a regularization
accuracy during training.
technique where 30% of the neurons from the previous layer
are randomly ignored during each training step. This helps The model is trained for 50 epochs with a batch size of
prevent the model from becoming overly reliant on any single 16. A validation split of 20% is used on the training data,
neuron, thus improving its ability to generalize to new, unseen allowing for the monitoring of the model’s performance on a
data. Hidden Layer: A second Dense layer with 32 neurons, separate validation set at the end of each epoch. This helps
also using the ReLU activation function. This layer further in identifying potential overfitting and assessing how well the
processes the features learned by the input layer. Second model generalizes.
TABLE II
C ONFUSION M ATRIX AND C LASSIFICATION R EPORT