Dengue Fever Prediction A Data Mining Problem 2153 0602 1000181 PDF
Dengue Fever Prediction A Data Mining Problem 2153 0602 1000181 PDF
M
ini i
Dar et al., J Data Mining Genomics Proteomics 2015, 6:3
cs
Data Mining in Genomics & Proteomics
urna f Data
http://dx.doi.org/10.4172/2153-0602.1000181
&Proteomics
lo
Jo
ISSN: 2153-0602
Abstract
Dengue is a threatening disease caused by female mosquitos. It is typically found in widespread hot regions. From
long periods of time, Experts are trying to find out some of features on Dengue disease so that they can rightly categorize
patients because different patients require different types of treatment. Pakistan has been target of Dengue disease
from last few years. Dengue fever is used in classification techniques to evaluate and compare their performance. The
dataset was collected from District Headquarter Hospital (DHQ) Jhelum. For properly categorizing our dataset, different
classification techniques are used. These techniques are Naïve Bayesian, REP Tree, Random tree, J48 and SMO.
WEKA was used as Data mining tool for classification of data. Firstly we will evaluate the performance of all the
techniques separately with the help of tables and graphs depending upon dataset and secondly we will compare the
performance of all the techniques.
Keywords: Dengue fever classification; Naïve bayes; J48; SMO; REP Wajeeha Farooqi et al. categorized Dengue fever by using one of
classification technique Decision Tree [3]. They used Data Mining
Introduction techniques for the efficient classification of the dengue fever type.
They performed two experimentations using Decision tree. The first
Dengue infection is vital disease caused by dengue germ, which
general experiment demonstrates the accuracy of 99.44%.The Second
extent in body of human by female mosquito [1]. With indications
experiment classifies dengue fever on the base of expert weighted
of headache, retro orbital pain, joint-pain, muscular pain and rush
attributes, which are used in classification on the base of Minimum
evidence [2]. It is also known as bone breaking illness [3].
Cost and source availability. Correctness of this model’s still high
Dengue infection has endangered 2.5 billion populations all around 98.62%. We matched performance in term of Type II error. It was
the world. Every year there are 50 million people who suffer from it found that Type II error is very little in second experimentation.
globally [1]. Pakistan has been victim of this rapidly growing sickness
M Naresh Kumar used alternating Decision Tree Approach for
from last few years. Since 2007 in Pakistan, large number of cases was
early diagnosis of Dengue fever and accorded its performance with
marked especially in Lahore. In 1994 at Karachi Pakistan’s first case of
C4.5 algorithm [10]. An alternating Decision Tree technique was able
dengue was appeared and Dengue’s outbreak in 2011, that was more
to distinguish the dengue fever using the clinical and laboratory data
life-threatening than preceding years and 1400 people were affected [3].
with number of correctly classified occurrences as F-measure, and
Dengue is divided into two types, i.e., type 1 and type 2, according (ROC) as compared to C4.5, h F-measure. Alternating Decision tree
to world health organization [3]. First one is classical dengue called based approach with boosting has been able to foresee dengue fever
dengue fever and the other is dengue hemorrhagic fever. DHF1, DHF2, with a greater degree of correctness than C4.5 based Decision tree
DHF3 and DHF4 are further four types of dengue hemorrhagic fever. using simple clinical and laboratory features.
DHF is revealed by start of fever which continues for 2 to 7 days with
Noor Diana et al. presented Malaysian dengue outbreak detection
number of signs like leakage of plasma, shock and weak pulse. In earliest
model using three classification methods [11]. They presented
cases it’s hard to differentiate dengue fever from dengue hemorrhagic
a collection of dissimilar dengue, data attributes are used for
fever.
classification modeling and performances are matched with previous
Different techniques for dengue fever classification can be used related work. Experimental results show that suggested classifiers
such as NB classifier; decision tree, KNN Technique, multilayered improve performance of other methodologies. Significant selection
Technique and SVM [1,4,5]. These techniques are evaluated based on of attributes in dengue dataset supports to good results. The Decision
five measures accuracy, precision, sensitivity, specificity and negative tree and Nearest Neighbor models were generally used methods in this
rate. problem, while RS was a rule based method which provides significant
knowledge to be further well-thought-out by professionals.
Some researchers worked on dengue (fever) classification such as
Tanner et al. and Tarig et al. Tanner’s team used Decision tree approach
and they classified 1200 patients and found 6 remarkable features. They
got 84% accurateness [6]. Tarig’s team used Self Organizing MAP *Corresponding author: Kamran Shaukat, IT Department, University of
(SOM) and ML feed-forward neural networks (MFNN). They clustered the Punjab, Jhelum Campus, Pakistan, Tel: 0544-448770; E-mail: dfbxff@
gmail.com
patients into two sets and got only 70% correctness [7]. Fatimah
Ibrahim et.al used ML perceptron’s (MLP) and got 90% accuracy [8]. Received June 04, 2015; Accepted October 19, 2015; Published October 25,
Daranee et al. suggested using decision tree method to classify dengue 2015
patients from two data sets [9]. They got 97.6% and 96.6% accuracy Citation: Shaukat K, Masood N, Mehreen S, Azmeen U (2015) Dengue Fever
from first and second experiment respectively. The accuracy of both Prediction: A Data Mining Problem. J Data Mining Genomics Proteomics 6: 181.
experimentations in unseen test set were more than 90% But in doi:10.4172/2153-0602.1000181
experiment of day0 correctness was very low and tree was found to be Copyright: © 2015 Shaukat K, et al. This is an open-access article distributed
over fitted. So, experimental results shown that decision tree approach under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the
did not counterpart this task very much. original author and source are credited.
Page 2 of 5
Weka data mining tool was used by Kashish Ara et al. for Dengue
Disease prediction. Dengue data was firstly classified and then equated
the different Data Mining techniques in WEKA through different
interfaces as mentioned in (Figure 1) [12]. A dengue dataset with
107 illustrations was used to validate approach additionally but weka
Data mining tool used 99 rows and 18 attributes to discover best
performance and to conclude forecast of disease and their correctness
using classifications of different techniques. For categorizing data
and to support manipulators in mining useful info from data and
effortlessly recognize an appropriate technique for precision of
analytical exemplary from that, as it was core objective of their
research. The conclusion is that NB and J48 are efficient techniques
for accuracy as less time was consumed for constructing this model
through WEKA applications outcomes and they attained maximum
accuracy=100% with 99 correctly categorized instances ,maximum
ROC=1, had minimum mean absolute error [12].
Objective
The general objective of this research is to use few of the classification
techniques to determine the population of Dengue fever infected cases
in Jhelum district and in surrounding areas geographically. So, that Figure 1: Chunk of dataset.
we can compare performance of different classification techniques.
Objective of this study also includes the comparison of different
Chunk was selected from this dataset which was treated as Training
classification algorithms with the help of graphs, based on our dataset.
set and tested this dataset on WEKA Data Mining tool. Some data was
We have implemented all the techniques by using weka tool and all the
classified and rest was tested to check accuracy of data.
procedure of implementation is within it.
Attributes
Methodology
CSV is the file format of datasets which is taken by weka tool. The
We used WEKA as the DM tool for testing and execution. WEKA is Attributes that we have chosen for the testing of dengue are fever,
a popular set for machine learning software carved in JAVA developed bleeding, myalgia, flu, fatigue and other indications with class label
at the University of Waikato, New Zealand [13]. We are using some of results with positive and negative consequences (Figure 2). The
basic techniques of classification from ML method. Our main focus is attributes description is given in (Table 2).
on dengue testing that whether a patient is affected by dengue or not by
using some attributes. On the basis of results, we will show accuracy of Data mining techniques
classification techniques and then compare them. It is very good Data
Different DM techniques have been used for predicting Dengue
Mining tool for the classification of accurateness, by using the different
virus. These predictions have been done for the purpose of classification
techniques.
and accuracy by using different techniques. The edge used for this
Classification objective in paper is Explorer Interface. Accuracy can be observed by
selecting the following procedures: NB, REP tree, RT, J48 and SMO.
Classification is the type of Data mining, which deals with the
problematic things by recognizing and detecting features of infection, The techniques we are using are following:
among patients and forecast that which technique shows top
•NB
performance, on the base of WEKA’s outcome.
•REP Tree
Five techniques have been used in this paper. These techniques uses
Explorer interface and it depends on dissimilar techniques NB, REP •RT
Tree, RT, J48 and SMO.
•J48
All techniques, which we used, were applied on a Dataset of
•SOM
Dengue fever, as enlightened above. Classification and accuracy used
was mentioned in (Table 1). Naïve bayes technique: It performs arithmetical prediction, i.e.,
forecasts class membership possibilities. It is based on Bayes formula.
Dataset A simple NB classifier; ensures comparable performance with ID3 and
The Dataset is a collection of data. Most commonly a data set selected neural system classifiers. We verified our training set on Weka
corresponds to the contents of a single database table, or a single Data Mining tool with NB Technique, we got the outcomes mentioned
statistical data matrix, where every column of the table represents a in the (Table 3).
particular variable, and each row corresponds to a given member of the REP tree: Rep Tree uses a regression tree reason and creates
data set in question. The data has almost 95 entries but we are using 25 several trees in different reiterations. After that it picks best one from
random entries. all produced trees. That will be measured as the illustrative (Figure
This dataset was taken from District Headquarter Hospital Jhelum. 3). In pruning the tree an amount used is a mean square error on the
Page 3 of 5
estimations made by the tree. We tested our training set on weka Data
Correctly Mining tool with REP tree technique; we got the outcomes mentioned
100% Classified in the (Table 4).
Incorrectly RT: Random Tree is the supervised Classifier; it was a collective
80% Classified
learning technique which generates many single learners. It employs
60% Relative Absolute a catching idea to create a set of random data for building an ID3
Error (Figure 4). In standard tree near each node is divided using the best
40% Tp rate split amongst all variables. In the random forest, every node is split
using a best amongst the subset of predicators arbitrarily chosen at that
20% Fp rate node [14]. We tested our training set on weka Data Mining tool with
Random tree technique; we got the outcomes mentioned in the (Table
0% 5).
Measure Precision
J48: C4.5 is the technique used to create a decision ID3 developed by
Recall Ross Quinlan. C4.5 is an addition of Quinlan‘s earlier ID3 Technique.
Figure 2: Baysian graph. The decision trees created by C4.5 can also be used for classification,
and for this purpose, C4.5 is often stated to as an arithmetical classifier.
Attribute Name Definition C4.5 constructs decision trees from the set of training data in the
Correctly Classified Displays the percentage of correctness test that how many identical way as ID3, with the concept of the information entropy
instances are categorized accurately. (Figure 5). We tested our training set on weka Data mining tool with
Incorrectly classified Displays the percentage of incorrectness test that how many J48 Technique; we got the outcomes mentioned in the (Table 6).
instances are categorized accurately.
TP Rate Those which were true and classified as True. SMO: SMO is abbreviation of Sequential minimal optimization,
FP Rate Those which were false but classified as True. which is a technique for answering the QP problem that rises during
ROC Rate ROC graph is a technique for visualizing, organizing and the training of SVM. SMO is widely used for the training of SVM [15].
selecting classifiers based on their performance. ROC
graphs have long been used in signal detection theory. We are using this technique on the base of dataset, for splitting our
Precision Calculating precision and recall is actually quite easy. When data (Figure 6). After running this technique we assessed the output
you get the actual results you sum up how many times you of classifier by altered measurements to create prediction for each and
were right or wrong every occurrence of Dengue dataset. We tested our training set on
Types of Precision There are four ways of being right or wrong: Weka Data Mining tool with SMO technique; we got the outcomes
TN case was negative and predicted negative
TP case was positive and predicted positive mentioned in the (Table 7).
FN case was positive but predicted negative
FP case was negative but predicted positive Comparison
Accuracy A measure of a predictive model that reflects the
proportionate number of times that the model is correct With 5 techniques of Data Mining, We have completed
when applied to data. classification on our dataset. After analysis of our dataset with each
A number that reflects the rate of errors made by a predictive technique we are paralleling them in the conclusion. When we have
Error Rate model. It is one minus the accuracy.
done the comparison among all of them we concluded that naïve Bayes
Table 1: Attributes definition. Technique is greatest among all others. As the accuracy of Naive bayes
is 92% which was biggest of all. Naïve Bayes is the best also for the aim
Attributes Description
that it gives the probability and efficiency while Random Tree and REP
Epid id of Patient
Tree don’t give us probability. The below gives the comparison of all
Fever Yes or no
the techniques (Table 8). The graph comparison is given in (Figure 7).
Bleeding Yes or no
Flu Yes or no
Conclusion
Myalgia Yes or no
Others Other symptoms The main Objective of this paper is toward prediction of dengue
Results Positive or negative infection using WEKA Data Mining tool. Basically it has four edges.
Table 2: Atrribute description. Out of these four edges we are consuming only one edge which is
Explorer. We are using five techniques of classification, i.e., NB, SMO,
Attributes name Measure J48, RT and REP tree. These techniques were applied using Weka Data
Correctly Classified 92% Mining tool to evaluate the accuracy which was gained after analysis
Incorrectly Classified 8% of these techniques. After testing these techniques the outcome were
Relative Absolute Error 26% compared on the base of accuracy. These techniques match classifier
Tp rate 0.92 accuracy with each other on base of correctly classified instances, a
Fp rate 0.253 precision, error rate, TP rate, FP rate and ROC Area.
Precision 0.848
Recall 0.92 Over Explorer technique it has concluded that NB and J48 are the
F-measure 0.882 top performance classifier techniques by way that, they has achieved
Roc Area 0.815 an accuracy of 92% and 88%, takes fewer time to run and shows ROC
Table 3: Bayesian technique. area=0.815, and had smallest error rate.
Page 4 of 5
Correctly Classified
100%
80% Incorrectly
Classified
60% Relative Absolute
Error
40%
TP rate
20%
FP rate
0%
Measure
Precision
Figure 7: Comparison graph.
Page 5 of 5
Techniques TP rate ROC Rate Error Rate Accuracy 7. Faisal T, Ibrahim F, Taib MN (2010) A noninvasive intelligent approach for
predicting the risk in dengue patients. Expert Systems with Application 37:
Baysian 0.92 0.815 0.08 0.92 2175-2181.
Rep tree 0.76 0.099 0.24 0.76
8. Ibrahim F, Taib MN, Abas WA, Guan CC, Sulaiman S (2005) A novel dengue
Random Tree 0.76 0.099 0.24 0.76
fever (DF) and dengue haemorrhagic fever (DHF) analysis using artificial neural
J48 0.88 0.596 0.24 0.76 network (ANN). Computer Methods and Programs in Biomedicine 79: 273-281.
SMO 0.76 0.494 0.24 0.76
9. Daranee T, Prapat S, Nuanwan S (2012) Data mining of dengue infection using
Table 8: Comparsion table. decision tree. Entropy 2: 2.
References 10. Kumar MN (2013) Alternating Decision trees for early diagnosis of dengue
fever. arXiv preprint arXiv: 1305.7331.
1. Farooqi W, Ali S (2013) A Critical Study of Selected Classification Algorithms
for Dengue Fever and Dengue Hemorrhagic Fever. Frontiers of Information 11. Tarmizi NDA, et.al. (2013) Classification of Dengue Outbreak Using Data
Technology (FIT), 11th International Conference on IEEE. Mining Models. Research Notes in Information Science 12: 71-75.
2. Farooqi W, Ali S, Abdul W (2014) Classification of Dengue Fever Using 12. Shakil KA, Anis S, Alam M (2015) Dengue disease prediction using weka data
Decision Tree. VAWKUM Transaction on Computer Sciences 3: 15-22. mining tool. arXiv preprint arXiv:1502.05167.
3. Rigau-Pérez JG, et.al. (1998) Dengue and dengue haemorrhagic fever. The 13. Pérez MS, et.al. (2005) Adapting the weka data mining toolkit to a grid based
Lancet 19: 971-977 environment. Advances in Web Intelligence Springer, Berlin, Heidelberg, 492-
497.
4. Tanner L, Schreiber M, Low JG, Ong A, Tolfvenstam T, et.al. (2008) Decision
Tree Algorithms Predict the Diagnosis and Outcome of Dengue Fever in the 14. Gislason PO, Benediktsson JA, Sveinsson JR (2004) Random forest
Early Phase of Illness. PLoS Neglected Tropical Disease 12: e196. classification of multisource remote sensing and geographic data. Geoscience
and Remote Sensing Symposium 2004 IGARSS‘04 Proceedings 2004 IEEE
5. Phyu TN (2009) Survey of classification techniques in data mining. Proceedings International Vol 2.
of the International MultiConference of Engineers and Computer Scientists Vol 1.
15. Keerthi SS, et.al. (2001) Improvements to Platt‘s SMO algorithm for SVM
6. Vong S, et.al. (2010) Dengue incidence in urban and rural Cambodia: results classifier design. Neural Computation 13: 637-649.
from population-based active fever surveillance, 2006–2008. PLoS neglected
tropical diseases 4: e903.