Farid - Hybrid Decision Tree and Naïve Bayes Classifiers For Multi-Class - 2014
Farid - Hybrid Decision Tree and Naïve Bayes Classifiers For Multi-Class - 2014
Farid - Hybrid Decision Tree and Naïve Bayes Classifiers For Multi-Class - 2014
a r t i c l e i n f o a b s t r a c t
Keywords: In this paper, we introduce two independent hybrid mining algorithms to improve the classification
Data mining accuracy rates of decision tree (DT) and naïve Bayes (NB) classifiers for the classification of multi-class
Classification problems. Both DT and NB classifiers are useful, efficient and commonly used for solving classification
Hybrid problems in data mining. Since the presence of noisy contradictory instances in the training set may
Decision tree
cause the generated decision tree suffers from overfitting and its accuracy may decrease, in our first
Naïve Bayes classifier
proposed hybrid DT algorithm, we employ a naïve Bayes (NB) classifier to remove the noisy troublesome
instances from the training set before the DT induction. Moreover, it is extremely computationally expen-
sive for a NB classifier to compute class conditional independence for a dataset with high dimensional
attributes. Thus, in the second proposed hybrid NB classifier, we employ a DT induction to select a
comparatively more important subset of attributes for the production of naïve assumption of class con-
ditional independence. We tested the performances of the two proposed hybrid algorithms against those
of the existing DT and NB classifiers respectively using the classification accuracy, precision, sensitivity-
specificity analysis, and 10-fold cross validation on 10 real benchmark datasets from UCI (University of
California, Irvine) machine learning repository. The experimental results indicate that the proposed
methods have produced impressive results in the classification of real life challenging multi-class prob-
lems. They are also able to automatically extract the most valuable training datasets and identify the
most effective attributes for the description of instances from noisy complex training databases with
large dimensions of attributes.
Ó 2013 Elsevier Ltd. All rights reserved.
0957-4174/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.eswa.2013.08.089
1938 D.Md. Farid et al. / Expert Systems with Applications 41 (2014) 1937–1946
knowledge, (d) able to handle both numerical and categorical data, fast splitting attribute selection (DTFS), an algorithm for building
(e) robust, and (f) dealing with large and noisy datasets. A naïve DTs for large datasets. DTFS used this attribute selection technique
Bayes (NB) classifier is a simple probabilistic classifier based on: to expand nodes and process all the training instances in an incre-
(a) Bayes theorem, (b) strong (naïve) independence assumptions, mental way. In order to avoid storing all the instances into the DT,
and (c) independent feature models (Farid, Rahman, & Rahman, DTFS stored at most N number of instances in a leaf node. When
2011, 2010; Lee & Isa, 2010). It is also an important mining classi- the number of instances stored in a leaf node reached its limit,
fier for data mining and applied in many real world classification DTFS expended or updated the leaf node according to the class la-
problems because of its high classification performance. Similar bels of the instances stored in it. If the leaf node of DT contained
to DT, the NB classifier also has several advantages such as (a) easy training instances from only one class, then DTFS updated the va-
to use, (b) only one scan of the training data required, (c) handling lue of the input tree branch of this leaf node. Otherwise, DTFS ex-
missing attribute values, and (d) continuous data. panded the leaf node by choosing a splitting attribute and created
In this paper, we propose two hybrid algorithms respectively one branch for each class of the stored instances. In this approach,
for a DT classifier and a NB classifier for multi-class classification DTFS always considered a small number of instances in the main
tasks. The first proposed hybrid DT algorithm finds the trouble- memory for the building of a DT. Aviad and Roy (2011) also intro-
some instances in the training data using a NB classifier and re- duced a decision tree construction method based on adjusted clus-
moves these instances from the training set before constructing ter analysis classification called classification by clustering (CbC). It
the learning tree for decision making. Otherwise, DT may suffer found similarities between instances using clustering algorithms
from overfitting due to the presence of such noisy instances and and also selected target attributes. Then it calculated the target
its accuracy may decrease. Moreover, it is also noted that to com- attributes distribution for each cluster. When a threshold for the
pute class conditional independence using a NB classifier is extre- number of instances stored in a cluster was reached, all the in-
mely computationally expensive for a dataset with many stances in each cluster were classified with respect to the appropri-
attributes. Our second proposed hybrid NB algorithm finds the ate value of the target attribute.
most crucial subset of attributes using a DT induction. The weights Polat and Gunes (2009) proposed a hybrid classification system
of the selected attributes by DT are also calculated. Then only these based on a C4.5 decision tree classifier and a one-against-all meth-
most important attributes selected by DT with their corresponding od to improve the classification accuracy for multi-class classifica-
weights are employed for the calculation of the naïve assumption tion problems. Their one-against-all method constructed M
of class conditional independence. We evaluate the performances number of binary C4.5 decision tree classifiers, each of which sep-
of the proposed hybrid algorithms against those of existing DT arated one class from all of the rest. The ith C4.5 decision tree clas-
and NB classifiers using the classification accuracy, precision, sen- sifier was trained with all the training instances of the ith class
sitivity–specificity analysis, and 10-fold cross validation on 10 real with positive labels and all the others with negative labels. The
benchmark datasets from UCI (University of California, Irvine) ma- performance of this hybrid classifier was tested using the classifi-
chine learning repository (Frank & Asuncion, 2010). The experi- cation accuracy, sensitivity-specificity analysis, and 10-fold cross
mental results prove that the proposed methods have produced validation on three datasets taken from the UCI machine learning
very promising results in the classification of real world challeng- repository (Frank & Asuncion, 2010). Balamurugan and Rajaram
ing multi-class problems. These methods also allow us to automat- (2009) proposed a method to resolve one of the exceptions in basic
ically extract the most representative high quality training datasets decision tree induction algorithms when the class prediction at a
and identify the most important attributes for the characterization leaf node cannot be determined by majority voting. The influential
of instances from a large amount of noisy training data with high factor of attributes was found in their work, which gave the
dimensional attributes. dependability of the attribute value on the class label. The DT
The rest of the paper is organized as follows. Section 2 gives an was pruned based on this factor. When classifying new instances
overview of the work related to DT and NB classifiers. Section 3 using this pruned tree, the class labels can be assigned more accu-
introduces the basic DT and NB classification techniques. Section 4 rately than the basic assignment by traditional DT algorithms.
presents our proposed two hybrid algorithms for the multi-class Chen and Hung (2009) presented an associative classification
classification problems respectively based on DT and NB classifiers. tree (ACT) that combined the advantages of both associative clas-
Section 5 provides experimental results and a comparison against sification and decision trees. The ACT tree was built using a set of
existing DT and NB algorithms using 10 real benchmark datasets associative classification rules with high classification predictive
from UCI machine learning repository. Finally, Section 6 concludes accuracy. ACT followed a simple heuristic which selected the attri-
the findings and proposes directions for future work. bute with the highest gain measure as the splitting attribute.
Chandra and Varghese (2009) proposed a fuzzy decision tree Gini
Index based (G-FDT) algorithm to fuzzify the decision boundary
2. Related work
without converting the numeric attributes into fuzzy linguistic
terms. The G-FDT tree used the Gini Index as the split measure
In this section, we review recent research on decision trees and
to choose the most appropriate splitting attribute for each node
naïve Bayes classifiers for various real world multi-class classifica-
in the decision tree. For the construction of the decision tree, the
tion problems.
Gini Index was computed using the fuzzy-membership values of
the attribute corresponding to a split value and fuzzy-membership
2.1. Decision trees values of the instances. The split-points were chosen as the mid-
points of attribute values where the class information changed.
Decision tree classification provides a rapid and useful solution Aitkenhead (2008) presented a co-evolving decision tree method,
for classifying instances in large datasets with a large number of where a large number of variables in datasets were being consid-
variables. There are two common issues for the construction of ered. They proposed a novel combination of DTs and evolutionary
decision trees: (a) the growth of the tree to enable it to accurately methods, such as the bagging approach of a DT classifier and a
categorize the training dataset, and (b) the pruning stage, whereby back-propagation neural network method, to improve the classifi-
superfluous nodes and branches are removed in order to improve cation accuracy. Such methods evolved the structure of a decision
classification accuracy. Franco-Arcega, Carrasco-Ochoa, Sanchez- tree and also handled comparatively a wider range of values and
Diaz, and Martinez-Trinidad (2011) presented decision trees using data types.
D.Md. Farid et al. / Expert Systems with Applications 41 (2014) 1937–1946 1939
The naïve Bayes classifier is also widely used for classification Symbol Term
problems in data mining and machine learning fields because of xi A data point or instance
its simplicity and impressive classification accuracy. Koc, Mazzuchi, X A set of instances
and Sarkani (2012) applied a hidden naïve Bayes (HNB) classifier to Ai An attribute
a network intrusion detection system (NIDS) to classify network at- Aij An attribute’s value
Wi The weight of attribute Ai
tacks. It especially significantly improved the accuracy for the C Total number of classes in a training dataset
detection of denial-of-services (DoS) attacks. The HNB classifier Ci A class label
was an extended version of a basic NB classier. It relaxed the condi- D A training dataset
tional independence assumption imposed on the basic NB classifier. Di A subset of a training dataset
T A decision tree
The HNB method was based on the idea of creating another layer
that represented a hidden parent of each attribute. The influences
from all of the other attributes can thus be easily combined through
conditional probabilities by estimating the attributes from the data instances (Chandra & Paul Varghese, 2009; Jamain & Hand,
training dataset. The HNB multiclass classification model exhibited 2008). Generally, each DT is a rule set. DT recursively partitions
a superior overall performance in terms of accuracy, error rate and the training datasets into smaller subsets until all the subsets be-
misclassification cost compared with the traditional NB classifier on long to a single class. The common DT algorithm is ID3 (Iterative
the KDD 99 dataset (McHugh, 2000; Tavallaee, Bagheri, Lu, & Ghor- Dichotomiser), which uses information theory as its attribute
bani, 2009). Valle, Varas, and Ruz (2012) also presented an approach selection measure (Quinlan, 1986). The root node of DT is chosen
to predict the performance of sales agents of a call centre dedicated based on the highest information gain of the attribute. Given a
exclusively to sales and telemarketing based on a NB classifier. This training dataset, D, the expected information needed to correctly
model was tested using socio-demographic (age, gender, marital classify an instance, xi 2 D, is given in Eq. (1), where pi is the
status, socioeconomic status and experience) and performance probability that xi 2 D, belongs to a class, Ci, and is estimated by
(logged hours, talked hours, effective contacts and finished records) jCi,Dj/jDj.
information as attributes of individuals. The results showed that the
socio-demographic attributes were not suitable for predicting sale X
m
InfoðDÞ ¼ pi log 2 ðpi Þ ð1Þ
performances, but operational records proved to be useful for the i¼1
prediction of the performances of sales agents.
Chandra and Gupta (2011) proposed a robust naïve Bayes clas- In Eq. (1), Info(D) is the average amount of information needed
sifier (R-NBC) to overcome two major limitations i.e., underflow to identify Ci of an instance, xi 2 D. The goal of DT is to iteratively
and over-fitting for the classification of gene expression datasets. partition, D, into subsets, {D1, D2, . . . , Dn}, where all instances in
R-NBC used logarithms of probabilities rather than multiplying each Di belong to the same class, Ci. InfoA(D) is the expected infor-
probabilities to handle the underflow problem and employed an mation required to correctly classify an instance, xi, from D based
estimate approach for providing solutions to over-fitting problems. on the partitioning by attributes, A. Eq. (2) shows InfoA(D)
jD j
It did not require any prior feature selection approaches in the field calculation, where jDjj acts as the weight of the jth partition.
of microarray data analysis where a large number of attributes X
n
jDj j
were considered. Fan, Poh, and Zhou (2010) proposed a partition- InfoA ðDÞ ¼ InfoðDj Þ ð2Þ
jDj
conditional independent component analysis (PC-ICA) method for j¼1
A decision tree classifier is typically a top-down greedy ap- The goodness of a split of D into subsets D1 and D2 is defined by
proach, which provides a rapid and effective method for classifying Eq. (7).
1940 D.Md. Farid et al. / Expert Systems with Applications 41 (2014) 1937–1946
n1 n2
ginisplit ðDÞ ¼ þ ð7Þ
nðginiðD1 ÞÞ nðginiðD2 ÞÞ
In this way, the split with the best gini value is chosen. To
illustrate the operation of DT, we consider a small dataset in Table 2
described by four attributes namely Outlook, Temperature, Humid-
ity, and Wind, which represent the weather condition of a particu-
lar day. Each attribute has several unique attribute values. The Play
column in Table 2 represents the class category of each instance. It
indicates whether a particular weather condition is suitable or not
for playing tennis. Fig. 1 shows the decision tree model constructed
using the playing tennis dataset shown in Table 2.
Fig. 1. A decision tree generated using the playing tennis dataset.
handle missing attribute values by simply omitting the corre- PðXjC i Þ ¼ Pðxk jC i Þ ð9Þ
k¼1
sponding probabilities for those attributes when calculating the
likelihood of membership for each class. The NB classifier also re- PðXjC i Þ ¼ Pðx1 jC i Þ Pðx2 jC i Þ Pðxn jC i Þ ð10Þ
quires the class conditional independence, i.e., the effect of an attri- In Eq. (9), xk refers to the value of attribute Ak for instance X.
bute on a given class is independent of those of other attributes. Therefore, these probabilities P(x1jCi),P(x2jCi), . . ., P(xnjCi) can be
Given a training dataset, D = {X1, X2, . . . , Xn}, each data record is easily estimated from the training instances. Moreover, the attri-
represented as, Xi = {x1, x2, . . . , xn}. D contains the following attri- butes in training datasets can be categorical or continuous-valued.
butes {A1, A2, . . . , An} and each attribute Ai contains the following If the attribute value, Ak, is categorical, then P(xkjCi) is the number
attribute values {Ai1, Ai2, . . . , Aih}. The attribute values can be dis- of instances in the class Ci 2 D with the value xk for Ak, divided by
crete or continuous. D also contains a set of classes C = {C1, C2, - jCi,Dj, i.e., the number of instances belonging to the class Ci 2 D.
. . . , Cm}. Each training instance, X 2 D, has a particular class label If Ak is a continuous-valued attribute, then Ak is typically as-
Ci. For a test instance, X, the classifier will predict that X belongs sumed to have a Gaussian distribution with a mean l and standard
to the class with the highest posterior probability, conditioned deviation r, defined respectively by the following two equations:
on X. That is, the NB classifier predicts that the instance X belongs
to the class Ci, if and only if P(CijX) > P(CjjX) for 1 6 j 6 m,j – i. The Pðxk jC i Þ ¼ gðxk ; lC i ; rC i Þ ð11Þ
class Ci for which P(CijX) is maximized is called the Maximum 1 ðxpÞ2
2
Posteriori Hypothesis. gðx; l; rÞ ¼ pffiffiffiffiffiffiffi e 2r ð12Þ
2pr
PðXjC i ÞPðC i Þ In Eq. (11), lC i is the mean and rC i is the standard deviation of
PðC i jXÞ ¼ ð8Þ
PðXÞ the values of the attribute Ak for all training instances in the class
Ci. Now we can bring these two quantities to Eq. (12), together with
In Bayes theorem shown in Eq. (8), as P(X) is a constant for all
xk, in order to estimate P(xkjCi). To predict the class label of instance
classes, only P(XjCi)P(Ci) needs to be maximized. If the class prior
X,P(XjCi)P(Ci) is evaluated for each class Ci 2 D. The NB classifier
probabilities are not known, then it is commonly assumed that
predicts that the class label of instance X is the class Ci, if and only
the classes are equally likely, that is, P(C1) = P(C2) = = P(Cm),
if
and therefore maximize P(XjCi). Otherwise, maximize P(XjCi)P(Ci).
The class prior probabilities are calculated by P(Ci) = jCi,Dj/jDj, PðXjC i ÞPðC i Þ > PðXjC j ÞPðC j Þ ð13Þ
where jCi,Dj is the number of training instances belonging to the
class Ci in D. To compute P(XjCi) in a dataset with many attributes In Eq. (13), 1 6 j 6 m and j – i. That is the predicted class label is
is extremely computationally expensive. Thus, the naïve assump- the class Ci for which P(XjCi)P(Ci) is the maximum probability.
Tables 3 and 4 respectively tabulate the prior probabilities for each
class and conditional probabilities for each attribute value gener-
Table 2 ated using the playing tennis dataset shown in Table 2.
The playing tennis dataset.
Outlook Temperature Humidity Wind Play 4. The proposed hybrid learning algorithms
Sunny Hot High Weak No
Sunny Hot High Strong No In this paper, we have proposed two independent hybrid
Overcast Hot High Weak Yes algorithms respectively for decision tree and naïve Bayes classifiers
Rain Mild High Weak Yes
Rain Cool Normal Weak Yes
to improve the classification accuracy in multi-class classification
Rain Cool Normal Strong No tasks. These proposed algorithms are described in the following
Overcast Cool Normal Strong Yes
Sunny Mild High Weak No
Sunny Cool Normal Weak Yes Table 3
Rain Mild Normal Weak Yes Prior probabilities for each class generated using the playing tennis dataset.
Sunny Mild Normal Strong Yes
Overcast Mild High Strong Yes Probability Value
Overcast Hot Normal Weak Yes P(Play = Yes) 9/14 = 0.642
Rain Mild High Strong No P(Play = No) 5/14 = 0.375
D.Md. Farid et al. / Expert Systems with Applications 41 (2014) 1937–1946 1941
class Ci = {Ci1, Ci2, . . . , Cik} also has some values. First of all, in this 10: T0 = DTBuild(D);
proposed hybrid NB algorithm, we generate a basic decision tree, 11: end if
T, from the training dataset, D, and collect the attributes appearing 12: T = Add T0 to arc;
in the tree. In this approach, DT is applied as an attribute selection 13: end for
and attribute weighting method. That is we use the DT classifier to 14: for each attribute, Ai 2 D, do
find the subset of the attributes in the training dataset which play 15: if Ai is not tested in T, then
crucial roles in the final classification. After the tree construction, 16: Wi = 0;
we initialize the weight, Wi, for each attribute, Ai 2 D. If the attri- 17: else
bute, Ai 2 D, is not tested in the DT then the weight, Wi, of the attri- 18: d as the minimum depth of Ai 2 T, and W i ¼ p1ffiffi;
d
bute, Ai, is initialized to zero. Otherwise, we calculate the minimum
19: end if
depth, d, where the attribute, Ai 2 T, is tested in the DT and initial-
20: end for
ize the weight, Wi, of the attribute, Ai, with the value of p1ffiffid. In this
21: for each class, Ci 2 D, do
way, the importance of the attributes is measured by their corre-
22: Find the prior probabilities, P(Ci).
sponding weights. For example, the root node of the tree has a
23: end for
higher weight value in comparison with those of its child nodes.
24: for each attribute, Ai 2 D and Wi – 0, do
Subsequently, we calculate the class conditional probabilities using
25: for each attribute value, Aij 2 Ai, do
only those attributes selected by DT (i.e., Wi – 0) and classify each
instance xi 2 D using these probabilities. The weights of the se- 26: Find the class conditional probabilities, PðAij j C i ÞW i .
lected attributes by the DT are also used as exponential parameters 27: end for
(see Eq. (14)) for class conditional probability calculation. Other 28: end for
attributes which are not selected by the DT (i.e., Wi = 0) will not 29: for each instance, xi 2 D, do
be considered in the final result probability calculation. I.e., the 30: Find the posterior probability, P(Cijxi);
class conditional probabilities of those unselected attributes by 31: end for
DT will not be generated and employed in the classification result
production.
Moreover, we calculate the prior probability, P(Ci), for each
5. Experiments
class, Ci, by counting how often Ci occurs in D. For each attribute,
Ai, the number of occurrences of each attribute value, Aij, can be
In this section, we describe the test datasets and experimental
counted to determine P(Ai). Similarly, the probability P(AijjCi) also
environments, and present the evaluation results for both of the
can be estimated by counting how often each Aij occurs in Ci 2 D.
proposed hybrid decision tree and naïve Bayes classifiers.
This is done only for those attributes that appear in the DT, T. To
classify an instance, xi 2 D,P(Ci) and P(AijjCi) from D are used to
5.1. Datasets
make the prediction. This is conducted by combining the effects
of different attribute values, Aij 2 xi. As mentioned earlier, the
The performances of both of the proposed hybrid decision tree
weights of the selected attributes also influence their class condi-
and naïve Bayes algorithms are tested on 10 real benchmark data-
tional probability calculation as exponential parameters. We esti-
sets from UCI machine learning repository (Frank & Asuncion,
mate P(xijCi) by Eq. (14).
2010). Table 5 describes the datasets used in experimental analy-
Y
n sis. Each dataset is roughly equivalent to a two-dimensional
Pðxi jC i Þ ¼ PðAij jC i ÞW i ð14Þ spreadsheet or a database table. The 10 datasets are:
j¼1
Table 8 Table 11
The classification accuracies and precision, sensitivity and specificity values for a C4.5 The classification accuracies and precision, sensitivity and specificity values for the
classifier with 10-fold cross validation. proposed hybrid NB classifier with 10-fold cross validation.
Datasets Classification Precision Sensitivity Specificity Datasets Classification Precision Sensitivity Specificity
accuracy (%) (weighted (weighted (weighted accuracy (%) (weighted (weighted (weighted
avg.) avg.) avg.) avg.) avg.) avg.)
Breast cancer 75.52 0.752 0.755 0.245 Breast cancer 75.87 0.75 0.758 0.241
Contact lenses 83.33 0.851 0.833 0.166 Contact lenses 87.50 0.909 0.875 0.125
Diabetes 73.82 0.735 0.738 0.261 Diabetes 85.41 0.853 0.854 0.145
Glass 66.82 0.67 0.668 0.331 Glass 52.33 0.533 0.523 0.476
Iris plants 96 0.96 0.96 0.04 Iris plants 98 0.98 0.98 0.02
Soybean 91.50 0.842 0.849 0.065 Soybean 94.15 0.891 0.893 0.047
Vote 96.32 0.963 0.963 0.036 Vote 94.48 0.945 0.944 0.055
Image seg. 95.73 0.819 0.957 0.042 Image seg. 85.19 0.742 0.852 0.148
Tic-Tac-Toe 85.07 0.849 0.851 0.149 Tic-Tac-Toe 78.91 0.786 0.789 0.21
NSL-KDD 71.11 0.711 0.711 0.288 NSL-KDD 82.39 0.836 0.823 0.176
References
Utgoff, P. E. (1989). Incremental induction of decision trees. Machine Learning, 4, Valle, M. A., Varas, S., & Ruz, G. A. (2012). Job performance prediction in a call center
161–186. using a naïve Bayes classifier. Expert Systems with Applications, 39, 9939–9945.
Utgoff, P.E. (1988). ID5: An incremental ID3. In Proc. of the fifth National Conference
on Machine Learning, Ann Arbor, Michigan, USA (pp. 107–120).