Prediction of HIV Status in Addis Ababa Using Data Mining Technology
Prediction of HIV Status in Addis Ababa Using Data Mining Technology
Prediction of HIV Status in Addis Ababa Using Data Mining Technology
Abstract
HIV/AIDS continues to be a major global health priority. Knowledge about HIV status helps both the
individual and the community. In spite of the widely and freely available VCT centers in Addis Ababa, most
people often do not know their HIV status. One of the solutions for this problem is to predict the HIV status of
the population using data mining techniques to identify the most affected part of the population to support
prevention programs. The purpose of this paper is to construct and implement HIV status predictive model to
scale up the knowledge of HIV status in Addis Ababa.
The general approach of the methodology is the CRISP-DM methodology which includes the following six
steps: business understanding, data understanding, data preparation, modeling, evaluation and deployment.
Voluntary Counseling and Testing (VCT) centers clients’ data in Addis Ababa is used. Microsoft Excel and
WEKA 3.6 tool, respectively, were used for further preparation of the data and as data mining tool to implement
experimentations using algorithms such as J48, PART and Naïve Bayes. Moreover, WEKA 3.6 was also used to
balance the imbalance data as HIV positive clients’ data are out weighted by HIV negative clients’ data.
The conventional algorithms such as PART, J48 and Naïve Bayes have performed poorly to predict correctly
the minority class (HIV positive clients). Therefore, this problem is solved by balancing the data. As a result,
pruned J48 classifier that predicts HIV status with 93.95% accuracy is constructed. The paper concluded by
identifying HIV status determinant factors, developing HIV status predictive system (HSPS) showing socio-
demography, behavioral and clinical result attributes are sufficient enough to predict HIV status of an
individual.
Keywords: Prediction; HIV Status; HIV/AIDS; Classification; Classifiers; Prototype
a crucial intervention component of the HIV/AIDS solving [3]. Data mining is a dynamic research
prevention, care and support program. capable of extracting hidden relationships from these
The Knowledge of HIV status of the population input variables to identify factors that determine the
helps domain experts, policy makers and service HIV status of clients. This research has attempted to
providers by guiding them how to design and where identify determinants of HIV status and predicts the
to implement their programs. HIV status of the population by analyzing VCT data
The World Health Organization [2] asserted that pattern using recent data and more variables so as to
greater knowledge of HIV status within a community support the scaling up of knowledge of HIV status.
is critical to expand access to HIV treatment, care The prediction is based on the individual HIV status
and support in timely manner as it offers people with prediction. Predictive data mining methods are used
HIV an opportunity to receive information and tools as pattern recognition tools in data mining to classify
to prevent HIV transmission to others. HIV status of individuals based on demographic and
socio-economic characteristics.
One of the underlying concerns of HCT policy
makers and/ or service providers is the scaling up of Finally, it has been attempted to obtain answers
the knowledge of HIV status. Knowledge about HIV for the following main research questions:
status is only through HIV testing. In spite of the What are the main determinant risk factors
widely and freely available VCT centers in Addis that cause HIV infection in Addis Ababa?
Ababa, most people often ignore the benefit of HIV Which data mining technique is more
testing and hence a lot of people do not know their appropriate to identify these determining
HIV status. factors?
One of the solutions for this problem is to predict How the prototype of HIV virus prediction
the HIV status of the population from the available system can be developed?
data in these VCT centers.
2. Literature Review
Domain experts use the statistical report in
The paper by Rosma et al. [4] described the
reporting the number of HIV positive and negative
feasibility of applying data mining technique to
clients in terms of socio-demographic and behavioral
predict the survival of AIDS. An adaptive fuzzy
variables. These traditional methods of data analysis
regression classification technique, FuReA, was used
have limited capacity to discover new and
to predict the length of survival of AIDS patients
unanticipated relationships that are hidden in
based on their CD4, CD8 and viral load counts. The
conventional databases [3].
authors revealed that neural network model was able
The information stored in HCT centers can be
to predict the survival of AIDS with an accuracy of
used beyond for the purpose of monthly, quarterly or
60% to 100% based on selected dependent variables
annual report. These input data can also be used as
such as CD4, CD8 and viral load counts.
determining factors in predicting the HIV status of
Particularly, a similar work to this study was
the population. The identification of determining
conducted by Taryn [3]. According to the Author,
factors provides a foundation up on which special
HIV status can be predicated using Neural Networks
intervention programs can be designed and/or
and demographic factors. The primary objective of
existing programs can be improved to increase the
the paper was to use artificial intelligence methods,
response of clients.
namely, neural networks to perform knowledge
Data mining provides the methodology and
discovery and data mining on HIV clinical and
technology to transform these massive data into
demographic data, resulting in a classifier of HIV
useful information for decision making and problem
status of a patient based on demographic inputs. In
HiLCoE Journal of Computer Science and Technology, Vol. 2, No. 2 67
this study, supervised learning was used to train missing values, outliers and to obtain high level
multilayer perceptions and radial bases networks to information regarding the data and these defects have
classify the HIV status of an individual, given certain been handled well.
demographic factors. The target population is women After preprocessing and cleaning of data had been
who are pregnant. The variables obtained in the study done, the data was found imbalance.
are: race, region, age of the mother, age of the father, 65,422 (84%) HIV negative clients
education level of the mother, gravidity, parity, 12,902(16%) HIV positive clients
province of origin, and HIV status.
Total of 78,324 records with 11 attributes
According to this study, all neural network However, such size is considered so imbalanced
architectures produced similar results but the average that the dataset misleads classification performance
accuracy was between 61 and 62% and the author, and biased to the majority class.
finally, concluded that demographic data is not
Lokanayaki and Malathi [5] remarked that a well
sufficient to accurately predict HIV status and this
balanced dataset is very important for creating a good
value is inadequate for medical classification.
prediction model. Medical datasets are often not
However, this study has disproved the result and balanced in their class labels. Most existing
shown that HIV status can be predicted using classification methods tend to perform poorly on
demographic, behavioral and clinical data. minority class examples when the dataset is
3. Data Preprocessing extremely imbalanced.
Zhai [6] also explained that with imbalanced
The source of data for this research has been
datasets conventional way of maximizing overall
collected from HCT centers database in Addis
performance will often fail to learn anything useful
Ababa. Database in VCT centers is manipulated
about the minority class.
using Epi-info software, in REC format. It has been
exported from Epi-info to Microsoft Excel. Initially, According to Laza [7], sampling strategies have
there were 125,378 records with 80 attributes. The been used to overcome the class imbalance problem
summary of clients by region is shown in Table 1. by either eliminating some data from the majority
class (under-sampling) or adding some artificially
Table 1: Summary of Clients by Region
generated or duplicated data to the minority class
Region Frequency (over-sampling).
Amhara 900
However, according to Chawla [8], random
Afar 528
over-sampling leads to over-fitting as there will be
Dire Dawa 108
multiple copies of minority examples and random
SNNP 2600
under-sampling may cause the classifier to miss
Tigray 483
important concepts. They proposed an over-sampling
Addis Ababa 120,756
approach to overcome the over-fitting and broaden
Missing value 4
the decision region of minority class examples. This
Total 125,378 approach is Synthetic Minority Over-sampling
Only relevant attributes for this paper have been Technique (SMOTE).
selected: Age, Sex, Education Level, Employment, In this technique, the minority class is over-
Marital status, Condom use, Had ever sex, Previously sampled by creating “synthetic” examples rather than
tested, Casual partner, Steady partner, and HIV status by over-sampling with duplicated real data entries.
(target variable). Simple statistical summary is Therefore, in this paper SMOTE technique in WEKA
performed to verify the quality of the data set such as
68 Prediction of HIV Status in Addis Ababa using Data Mining Technology
tool has been used to correct the imbalanced nature The overall accuracies of the models are found
of the data. comparable. However, only the two models could
generate less complex rules and a one mode, Pruned
4. Experimentation and Modeling
J48 with 100 numbers of instances per leaf and
Classification algorithms such as J48, PART and confidence factor of 0.25 could yield more sensible
Naïve Bayes were trained using stratified 10-fold rules. The overall prediction performance of this
validation with the given dataset. While training model is 93.95 %.
these classifiers, the values of the variables including
Moreover, it has registered 98.1% of average
the target variable were given. Where as, during
ROC curve. This means that the ROC coverage area
prediction, given the values of the other 10 variables
of this model is 98.1% which is near to perfection
(attributes), the value of the target attribute (HIV
(100%). This shows the model predicts HIV positive
status) has been predicted.
correctly as positive but not at the cost of predicting
Experiments on the two classification algorithms HIV negative as positive. Similarly, it predicts HIV
namely J48 and PART were performed by modifying negative correctly as negative but not at the cost of
parameters such as confidence factor, number of predicting HIV positive as negative.
instances per a leaf, pruning and unpruning using
Table 2: Experiment Results
WEKA tool. These parameters can increase or
Exp Scheme NL NR Acc.% WROC
decrease the complexity and accuracy of the
generated rules. Pruning may decrease complexity 1 J48-C0.25-M2 456 - 95.11 98.8
but at the cost of accuracy. Reducing confidence 2 J48-C 0.25-M 100 126 - 93.95 98.1
factor helps to identify relevant attributes and 3 J48-C0.1-M 2 379 - 95.02 98.7
reduces complexity. Moreover, increasing number of 4 J48-U-M 2 875 - 95.20 98.9
instances per leaf reduces the complexity of the 5 J48-U-M 100 180 - 93.98 98.4
generated rules but at the cost of the accuracy of the 6 PART-M 2-C0.25 - 285 95.16 99.1
prediction. Naïve Bayes classifier was trained with 7 PART-M100-
- 72 94.01 98.3
C0.25
default parameters as it doesn’t have the above
8 PART-M2-C0.1 - 262 95.11 99.0
options.
9 PART-U-M 2 - 1002 95.20 99.0
The optimized result is a model with an excellent
10 PART-U-M 100 - 289 94.07 98.8
prediction performance and less complex and
11 Naïve Bayes - - 80.77 89.6
sensible rules. Therefore, according to these options,
11 experimentations were performed and evaluated. Key: Exp = Experiment Number, M = minimum
Overall accuracy is not a good measure in case of number of instances per leaf, C = Confidence
imbalanced data as it only considers the majority factor, U = Unpruned, NL = Number of Leaves,
class of the data. Therefore, Confusion Matrix and NR = Number of rules, Acc. = Accuracy, WROC
ROC curve were also taken as evaluation measures = Weighted Average ROC Area. “-“ = there is
of predictive performance. The confusion matrix no information.
shows the predictive performance of the model both The confusion matrix shows that out of the total
on minority and majority classes. The ROC curve of 65,417 actually HIV negative clients, 61873
shows the trade off between true positive and false (94.58%) clients are classified as HIV negatives and
negative prediction. Moreover, complexity and the rest are misclassified as HIV positive. And out of
acceptability of rules should be considered during the total of 64,535 actual HIV positive clients,
comparison of models. The result of each experiment 60,223 (93.31%) clients are classified as HIV
is shown in Table 2. positive and the rest are misclassified as HIV
HiLCoE Journal of Computer Science and Technology, Vol. 2, No. 2 69
negative. These figures show that the model has negative and out of the total of 65,417 actual HIV
shown almost equal prediction performance in terms negative clients, 63,825 (97.56%) are classified as
of correctly classifying HIV negatives and HIV HIV negative and the rest are misclassified as HIV
positives. positive.
Generally, the evaluation measures show that the
classifier performs poorly on the minority class of
imbalanced data.
imbalanced dataset, either the algorithms or the size of the data was large enough. This has resulted
dataset itself should be modified using some in higher prediction accuracy of the constructed
techniques. Therefore, the sampling technique used model. Finally, unlike the above paper, in this study,
in the above paper was random over-sampling which parameter settings that result in optimum result are
works by duplicating the existing data. However, this identified and a prototype of HIV status predictive
sampling technique results in over-fitting. When system has been developed.
over-fitting occurs, the accuracy on the test data is
6. Prototype Development and
reduced. This also reduces the prediction accuracy of
Implementation
the model. Finally, the paper stated that the method
used to handle missing values was not good when In this section, HIV status prediction system
there are multiple missing values and as a result this (HSPS) is developed that can assist HIV/AIDS
reduces the prediction accuracy of the model. intervention program domain experts, policy makers
Because of these weaknesses of the paper, the or service providers in predicting HIV status of the
prediction accuracy was reduced. population based on the socio-demography, behavior
related and clinical result attributes. The system is
Our work disproved the result of the above paper
developed in a Java environment. The HSPS converts
by showing that it is possible to predict HIV status of
codes written in Java and integrates the results on to
an individual using data mining technique with
Java interface.
93.5% prediction accuracy and demography data is
sufficient enough to accurately predict HIV status of
an individual. This result has been achieved by
focusing on the weaknesses of the above study and
taking appropriate approaches as discussed below.
Decision tree and rule based methods are used in
this paper. These methods are easy and human
interpretable. Unlike neural network algorithm, the
rules (output) of these algorithms are displayed to
interpret. As a result the rules can be modified Figure 3: The HSPS Interface
according to comments from domain experts and The main functions of the HSPS interface shown
personal judgments. The dataset used in this paper as Figure 3 include: input clients’ data section where
was also large enough in size (78324) to increase the users input 10 pieces of attributes.
prediction accuracy of the models. Moreover, as
a. Predict button: users click the button to get
these conventional algorithms are not good enough to the result.
construct a good model using imbalanced dataset,
b. Clear button: users click the button to clear
appropriate sampling technique (SMOTE) was used the previous input
to balance this imbalanced dataset. Unlike the c. Exit button: users click the button to leave
sampling technique used in the study above, this the interface
sampling technique works by creating synthetic d. Prediction result: this text box shows the
samples rather than duplicating the existing ones. prediction result of the provided data
This avoids over-fitting and increases the prediction e. And text area that displays determinant
accuracy of the model. In addition to these very factors.
important amendments of the above study, The prediction result displays “Vulnerable” if the
inconsistent, missing and outlier values in this paper client is vulnerable to the virus and “Not Vulnerable”
were handled by simply deleting these values as the if the client is not vulnerable to the virus. Moreover,
HiLCoE Journal of Computer Science and Technology, Vol. 2, No. 2 71
it displays “No such rule” if no rule was generated it can invite other interested researchers to explore
for such a pattern. more in related and similar areas.
Finally, the paper has recommended other
7. Conclusion and Recommendation
researchers to conduct a research on the same topic
In this paper, supervised learning was used to using relevant and different attributes. Besides, the
train the three classification techniques to classify the dataset in this domain is usually too imbalanced for
HIV status of an individual, given certain socio- conventional algorithms to perform equally and
demographic, behavioral clinical result factors. correctly on the two classes. Therefore, research
The paper is concluded by stating the following areas are recommended to be conducted by
important points. modifying these conventional algorithms to perform
First, although the two classes (HIV negative and well and equally on both classes when the data is
HIV positive) are imbalanced in size, equal imbalanced.
importance should be given to their predictions. Even References
though the cost of predicting HIV positive as HIV
[1] FHAPCO, “Country Progress Report on
negative seems more serious than the cost of
HIV/AIDS Response”, GAP Report, Addis
predicting HIV negative as HIV positive, equal
Ababa, Ethiopia, March 31, 2012.
importance has been given to each class prediction.
[2] WHO, “Scaling up Priority HIV/AIDS
As a result, the poor prediction performance of
Interventions in the Health Sector,” Progress
models on the minority data should not be ignored.
Report, Sept. 2010.
Therefore, by using proper sampling technique, [3] Taryn Nicole Ho Tim, “Predicting HIV Status
optimum result has been achieved. Using Neural Networks and Demographic
Second, the paper disproved the work of Taryn Factors “, Johannesburg, April 2006.
[3] who said that demographic, behavioral and [4] Rosma, Sameem, Kareem, Basir, and Adeeba
clinical data are not sufficient to predict HIV status. Annapurni, "The Prediction of AIDS Survival:
This research has shown that HIV status can be A Data Mining Approach," In Proceedings of
predicted using demographic, behavioral and clinical the 2nd WSEAS International Conference on
data with prediction accuracy of 94%. Multivariate Analysis and its Applications in
Third, in addition to HIV status predictive model, Science and Engineering, Vol. 2, 2007.
the main findings of this paper are the determinant [5] Lokanayaki and Malathi, “Data Preprocessing
factors and parameters that result in optimum output, for Liver Dataset Using SMOTE,” International
Journal of Advanced Research in Computer
and the prototype of HIV status predictive system.
Science and Software Engineering, Vol. 3, No.
Fourth, the model works for indefinite time but
11, Nov. 2013.
can be modified by running the same model on the
[6] Zhai, “An Effective Over-sampling Method for
new modified data and the prototype can be Imbalanced Data Sets Classification,” Chinese
enhanced in the future to make it complete. Journal of Electronics, Vol. 20, No.3, Jul. 2011,
Fifth, this study has several contributions such as pp. 489-494.
it provides useful insights to HIV/AIDS prevention [7] Laza, “Evaluating the Effect of Unbalanced
programs for policy makers, domain experts and Data in Biomedical Document Classification”,
service providers. The clients (who need the service) Journal of Integrative Bioinformatics, Vol. 8,
of VCT centers or other prevention programs can No. 3, Sep.2011, pp. 177.
benefit if VCT centers are installed or modified and [8] Chawla, “SMOTE: Synthetic Minority Over-
implemented according to the output of this paper. sampling Technique,” Journal of Artificial
The paper can also be used as additional creative Intelligence Research, Vol. 16, 2002, pp. 321–
reporting mechanism to statistical report. Moreover, 357.