Farid - Hybrid Decision Tree and Naïve Bayes Classifiers For Multi-Class - 2014

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Expert Systems with Applications 41 (2014) 1937–1946

Contents lists available at ScienceDirect

Expert Systems with Applications


journal homepage: www.elsevier.com/locate/eswa

Hybrid decision tree and naïve Bayes classifiers for multi-class


classification tasks
Dewan Md. Farid a, Li Zhang a,⇑, Chowdhury Mofizur Rahman b, M.A. Hossain a, Rebecca Strachan a
a
Computational Intelligence Group, Department of Computer Science and Digital Technology, Northumbria University, Newcastle upon Tyne, UK
b
Department of Computer Science & Engineering, United International University, Bangladesh

a r t i c l e i n f o a b s t r a c t

Keywords: In this paper, we introduce two independent hybrid mining algorithms to improve the classification
Data mining accuracy rates of decision tree (DT) and naïve Bayes (NB) classifiers for the classification of multi-class
Classification problems. Both DT and NB classifiers are useful, efficient and commonly used for solving classification
Hybrid problems in data mining. Since the presence of noisy contradictory instances in the training set may
Decision tree
cause the generated decision tree suffers from overfitting and its accuracy may decrease, in our first
Naïve Bayes classifier
proposed hybrid DT algorithm, we employ a naïve Bayes (NB) classifier to remove the noisy troublesome
instances from the training set before the DT induction. Moreover, it is extremely computationally expen-
sive for a NB classifier to compute class conditional independence for a dataset with high dimensional
attributes. Thus, in the second proposed hybrid NB classifier, we employ a DT induction to select a
comparatively more important subset of attributes for the production of naïve assumption of class con-
ditional independence. We tested the performances of the two proposed hybrid algorithms against those
of the existing DT and NB classifiers respectively using the classification accuracy, precision, sensitivity-
specificity analysis, and 10-fold cross validation on 10 real benchmark datasets from UCI (University of
California, Irvine) machine learning repository. The experimental results indicate that the proposed
methods have produced impressive results in the classification of real life challenging multi-class prob-
lems. They are also able to automatically extract the most valuable training datasets and identify the
most effective attributes for the description of instances from noisy complex training databases with
large dimensions of attributes.
Ó 2013 Elsevier Ltd. All rights reserved.

1. Introduction construction of overfitting or fragile classifiers. Thus, data prepro-


cessing techniques are needed, where the data are prepared for
During the past decade, a sufficient number of data mining mining. It can improve the quality of the data, thereby helping to
algorithms have been proposed by the computational intelligence improve the accuracy and efficiency of the mining process. There
researchers for solving real world classification and clustering are a number of data preprocessing techniques available such as
problems (Farid et al., 2013; Liao, Chu, & Hsiao, 2012; Ngai, Xiu, (a) data cleaning: removal of noisy data, (b) data integration:
& Chau, 2009). Generally, classification is a data mining function merging data from multiple sources, (c) data transformations: nor-
that describes and distinguishes data classes or concepts. The goal malization of data, and (d) data reduction: reducing the data size
of classification is to accurately predict class labels of instances by aggregating and eliminating redundant features.
whose attribute values are known, but class values are unknown. This paper presents two independent hybrid algorithms for
Clustering is the task of grouping a set of instances in such a way scaling up the classification accuracy of decision tree (DT) and
that instances within a cluster have high similarities in comparison naïve Bayes (NB) classifiers in multi-class classification problems.
to one another, but are very dissimilar to instances in other clus- DT is a classification tool commonly used in data mining tasks such
ters. It analyzes instances without consulting a known class label. as ID3 (Quinlan, 1986), ID4 (Utgoff, 1989), ID5 (Utgoff, 1988), C4.5
The instances are clustered based on the principle of maximizing (Quinlan, 1993), C5.0 (Bujlow, Riaz, & Pedersen, 2012), and CART
the intraclass similarity and minimizing the interclass similarity. (Breiman, Friedman, Stone, & Olshen, 1984). The goal of DT is to
The performance of data mining algorithms in most cases depends create a model that predicts the value of a target class for an un-
on dataset quality, since low-quality training data may lead to the seen test instance based on several input features (Loh & Shih,
1997; Safavian & Landgrebe, 1991; Turney, 1995). Amongst other
⇑ Corresponding author. Tel.: +44 191 243 7089.
data mining methods, DTs have various advantages: (a) simple to
E-mail address: [email protected] (L. Zhang).
understand, (b) easy to implement, (c) requiring little prior

0957-4174/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.eswa.2013.08.089
1938 D.Md. Farid et al. / Expert Systems with Applications 41 (2014) 1937–1946

knowledge, (d) able to handle both numerical and categorical data, fast splitting attribute selection (DTFS), an algorithm for building
(e) robust, and (f) dealing with large and noisy datasets. A naïve DTs for large datasets. DTFS used this attribute selection technique
Bayes (NB) classifier is a simple probabilistic classifier based on: to expand nodes and process all the training instances in an incre-
(a) Bayes theorem, (b) strong (naïve) independence assumptions, mental way. In order to avoid storing all the instances into the DT,
and (c) independent feature models (Farid, Rahman, & Rahman, DTFS stored at most N number of instances in a leaf node. When
2011, 2010; Lee & Isa, 2010). It is also an important mining classi- the number of instances stored in a leaf node reached its limit,
fier for data mining and applied in many real world classification DTFS expended or updated the leaf node according to the class la-
problems because of its high classification performance. Similar bels of the instances stored in it. If the leaf node of DT contained
to DT, the NB classifier also has several advantages such as (a) easy training instances from only one class, then DTFS updated the va-
to use, (b) only one scan of the training data required, (c) handling lue of the input tree branch of this leaf node. Otherwise, DTFS ex-
missing attribute values, and (d) continuous data. panded the leaf node by choosing a splitting attribute and created
In this paper, we propose two hybrid algorithms respectively one branch for each class of the stored instances. In this approach,
for a DT classifier and a NB classifier for multi-class classification DTFS always considered a small number of instances in the main
tasks. The first proposed hybrid DT algorithm finds the trouble- memory for the building of a DT. Aviad and Roy (2011) also intro-
some instances in the training data using a NB classifier and re- duced a decision tree construction method based on adjusted clus-
moves these instances from the training set before constructing ter analysis classification called classification by clustering (CbC). It
the learning tree for decision making. Otherwise, DT may suffer found similarities between instances using clustering algorithms
from overfitting due to the presence of such noisy instances and and also selected target attributes. Then it calculated the target
its accuracy may decrease. Moreover, it is also noted that to com- attributes distribution for each cluster. When a threshold for the
pute class conditional independence using a NB classifier is extre- number of instances stored in a cluster was reached, all the in-
mely computationally expensive for a dataset with many stances in each cluster were classified with respect to the appropri-
attributes. Our second proposed hybrid NB algorithm finds the ate value of the target attribute.
most crucial subset of attributes using a DT induction. The weights Polat and Gunes (2009) proposed a hybrid classification system
of the selected attributes by DT are also calculated. Then only these based on a C4.5 decision tree classifier and a one-against-all meth-
most important attributes selected by DT with their corresponding od to improve the classification accuracy for multi-class classifica-
weights are employed for the calculation of the naïve assumption tion problems. Their one-against-all method constructed M
of class conditional independence. We evaluate the performances number of binary C4.5 decision tree classifiers, each of which sep-
of the proposed hybrid algorithms against those of existing DT arated one class from all of the rest. The ith C4.5 decision tree clas-
and NB classifiers using the classification accuracy, precision, sen- sifier was trained with all the training instances of the ith class
sitivity–specificity analysis, and 10-fold cross validation on 10 real with positive labels and all the others with negative labels. The
benchmark datasets from UCI (University of California, Irvine) ma- performance of this hybrid classifier was tested using the classifi-
chine learning repository (Frank & Asuncion, 2010). The experi- cation accuracy, sensitivity-specificity analysis, and 10-fold cross
mental results prove that the proposed methods have produced validation on three datasets taken from the UCI machine learning
very promising results in the classification of real world challeng- repository (Frank & Asuncion, 2010). Balamurugan and Rajaram
ing multi-class problems. These methods also allow us to automat- (2009) proposed a method to resolve one of the exceptions in basic
ically extract the most representative high quality training datasets decision tree induction algorithms when the class prediction at a
and identify the most important attributes for the characterization leaf node cannot be determined by majority voting. The influential
of instances from a large amount of noisy training data with high factor of attributes was found in their work, which gave the
dimensional attributes. dependability of the attribute value on the class label. The DT
The rest of the paper is organized as follows. Section 2 gives an was pruned based on this factor. When classifying new instances
overview of the work related to DT and NB classifiers. Section 3 using this pruned tree, the class labels can be assigned more accu-
introduces the basic DT and NB classification techniques. Section 4 rately than the basic assignment by traditional DT algorithms.
presents our proposed two hybrid algorithms for the multi-class Chen and Hung (2009) presented an associative classification
classification problems respectively based on DT and NB classifiers. tree (ACT) that combined the advantages of both associative clas-
Section 5 provides experimental results and a comparison against sification and decision trees. The ACT tree was built using a set of
existing DT and NB algorithms using 10 real benchmark datasets associative classification rules with high classification predictive
from UCI machine learning repository. Finally, Section 6 concludes accuracy. ACT followed a simple heuristic which selected the attri-
the findings and proposes directions for future work. bute with the highest gain measure as the splitting attribute.
Chandra and Varghese (2009) proposed a fuzzy decision tree Gini
Index based (G-FDT) algorithm to fuzzify the decision boundary
2. Related work
without converting the numeric attributes into fuzzy linguistic
terms. The G-FDT tree used the Gini Index as the split measure
In this section, we review recent research on decision trees and
to choose the most appropriate splitting attribute for each node
naïve Bayes classifiers for various real world multi-class classifica-
in the decision tree. For the construction of the decision tree, the
tion problems.
Gini Index was computed using the fuzzy-membership values of
the attribute corresponding to a split value and fuzzy-membership
2.1. Decision trees values of the instances. The split-points were chosen as the mid-
points of attribute values where the class information changed.
Decision tree classification provides a rapid and useful solution Aitkenhead (2008) presented a co-evolving decision tree method,
for classifying instances in large datasets with a large number of where a large number of variables in datasets were being consid-
variables. There are two common issues for the construction of ered. They proposed a novel combination of DTs and evolutionary
decision trees: (a) the growth of the tree to enable it to accurately methods, such as the bagging approach of a DT classifier and a
categorize the training dataset, and (b) the pruning stage, whereby back-propagation neural network method, to improve the classifi-
superfluous nodes and branches are removed in order to improve cation accuracy. Such methods evolved the structure of a decision
classification accuracy. Franco-Arcega, Carrasco-Ochoa, Sanchez- tree and also handled comparatively a wider range of values and
Diaz, and Martinez-Trinidad (2011) presented decision trees using data types.
D.Md. Farid et al. / Expert Systems with Applications 41 (2014) 1937–1946 1939

2.2. Naïve Bayes classifiers Table 1


Commonly used symbols and terms.

The naïve Bayes classifier is also widely used for classification Symbol Term
problems in data mining and machine learning fields because of xi A data point or instance
its simplicity and impressive classification accuracy. Koc, Mazzuchi, X A set of instances
and Sarkani (2012) applied a hidden naïve Bayes (HNB) classifier to Ai An attribute
a network intrusion detection system (NIDS) to classify network at- Aij An attribute’s value
Wi The weight of attribute Ai
tacks. It especially significantly improved the accuracy for the C Total number of classes in a training dataset
detection of denial-of-services (DoS) attacks. The HNB classifier Ci A class label
was an extended version of a basic NB classier. It relaxed the condi- D A training dataset
tional independence assumption imposed on the basic NB classifier. Di A subset of a training dataset
T A decision tree
The HNB method was based on the idea of creating another layer
that represented a hidden parent of each attribute. The influences
from all of the other attributes can thus be easily combined through
conditional probabilities by estimating the attributes from the data instances (Chandra & Paul Varghese, 2009; Jamain & Hand,
training dataset. The HNB multiclass classification model exhibited 2008). Generally, each DT is a rule set. DT recursively partitions
a superior overall performance in terms of accuracy, error rate and the training datasets into smaller subsets until all the subsets be-
misclassification cost compared with the traditional NB classifier on long to a single class. The common DT algorithm is ID3 (Iterative
the KDD 99 dataset (McHugh, 2000; Tavallaee, Bagheri, Lu, & Ghor- Dichotomiser), which uses information theory as its attribute
bani, 2009). Valle, Varas, and Ruz (2012) also presented an approach selection measure (Quinlan, 1986). The root node of DT is chosen
to predict the performance of sales agents of a call centre dedicated based on the highest information gain of the attribute. Given a
exclusively to sales and telemarketing based on a NB classifier. This training dataset, D, the expected information needed to correctly
model was tested using socio-demographic (age, gender, marital classify an instance, xi 2 D, is given in Eq. (1), where pi is the
status, socioeconomic status and experience) and performance probability that xi 2 D, belongs to a class, Ci, and is estimated by
(logged hours, talked hours, effective contacts and finished records) jCi,Dj/jDj.
information as attributes of individuals. The results showed that the
socio-demographic attributes were not suitable for predicting sale X
m
InfoðDÞ ¼  pi log 2 ðpi Þ ð1Þ
performances, but operational records proved to be useful for the i¼1
prediction of the performances of sales agents.
Chandra and Gupta (2011) proposed a robust naïve Bayes clas- In Eq. (1), Info(D) is the average amount of information needed
sifier (R-NBC) to overcome two major limitations i.e., underflow to identify Ci of an instance, xi 2 D. The goal of DT is to iteratively
and over-fitting for the classification of gene expression datasets. partition, D, into subsets, {D1, D2, . . . , Dn}, where all instances in
R-NBC used logarithms of probabilities rather than multiplying each Di belong to the same class, Ci. InfoA(D) is the expected infor-
probabilities to handle the underflow problem and employed an mation required to correctly classify an instance, xi, from D based
estimate approach for providing solutions to over-fitting problems. on the partitioning by attributes, A. Eq. (2) shows InfoA(D)
jD j
It did not require any prior feature selection approaches in the field calculation, where jDjj acts as the weight of the jth partition.
of microarray data analysis where a large number of attributes X
n
jDj j
were considered. Fan, Poh, and Zhou (2010) proposed a partition- InfoA ðDÞ ¼  InfoðDj Þ ð2Þ
jDj
conditional independent component analysis (PC-ICA) method for j¼1

naïve Bayes classification in microarray data analysis. It further ex-


Information gain is defined as the difference between the
tended the class-conditional independent component analysis (CC-
original information requirement and the new requirement that
ICA) method. PC-ICA spited the small-size data samples into differ-
is shown in Eq. (3).
ent partitions so that independent component analysis (ICA) can be
done within each partition. PC-ICA also attempted to do ICA-based GainðAÞ ¼ InfoðDÞ  InfoA ðDÞ ð3Þ
feature extraction within each partition that may consist of several
The Gain Ratio is an extension to the information gain approach,
classes. Hsu, Huang, and Chang (2008) presented a classification
also used in DT such as C4.5. A C4.5 classifier is a successor of ID3
method called extended naïve Bayes (ENB) for the classification
classifier (Quinlan, 1993). It applies a kind of normalization to
of mixed types of data. The mixed types of data included categor-
information gain using a ‘‘split information’’ value defined analo-
ical and numeric data. ENB used a normal NB algorithm to calcu-
gously with Info(D) as shown in Eq. (4).
late the probabilities of categorical attributes. When handling
numeric attributes, it adopted the statistical theory to discrete X
n  
jDj j jDj j
the numeric attributes into symbols by considering the average S plitInfoA ðDÞ ¼   log 2 ð4Þ
j¼1
jDj jDj
and variance of numeric values.
The attribute with the maximum Gain Ratio is selected as the
3. Supervised classification splitting attribute, which is defined in Eq. (5).

Classification is one of the most popular data mining techniques GainðAÞ


GainRatioðAÞ ¼ ð5Þ
that can be used for intelligent decision making. In this section, we SplitInfoðAÞ
discuss some basic techniques for data classification using decision
Eq. (6) defines the gini for a dataset, D, where, pj, is the fre-
tree and naïve Bayes classifiers. Table 1 summarizes the most com-
quency of class Cj 2 D.
monly used symbols and terms throughout the paper.
X
m
GiniðDÞ ¼ 1  p2j ð6Þ
3.1. Decision tree induction
j¼1

A decision tree classifier is typically a top-down greedy ap- The goodness of a split of D into subsets D1 and D2 is defined by
proach, which provides a rapid and effective method for classifying Eq. (7).
1940 D.Md. Farid et al. / Expert Systems with Applications 41 (2014) 1937–1946

n1 n2
ginisplit ðDÞ ¼ þ ð7Þ
nðginiðD1 ÞÞ nðginiðD2 ÞÞ
In this way, the split with the best gini value is chosen. To
illustrate the operation of DT, we consider a small dataset in Table 2
described by four attributes namely Outlook, Temperature, Humid-
ity, and Wind, which represent the weather condition of a particu-
lar day. Each attribute has several unique attribute values. The Play
column in Table 2 represents the class category of each instance. It
indicates whether a particular weather condition is suitable or not
for playing tennis. Fig. 1 shows the decision tree model constructed
using the playing tennis dataset shown in Table 2.
Fig. 1. A decision tree generated using the playing tennis dataset.

3.2. Naïve Bayes classification


tion of class conditional independence is made in order to reduce
A naïve Bayes classifier is a simple probabilistic based method,
computation in evaluating P(XjCi). The attributes are conditionally
which can predict the class membership probabilities (Chen,
independent of one another, given the class label of the instance.
Huang, Tian, & Qu, 2009; Farid & Rahman, 2010). It has several
Thus, Eqs. (9) and (10) are used to produce P(XjCi).
advantages: (a) easy to use, and (b) only one scan of the training
data required for probability generation. A NB classifier can easily Y
n

handle missing attribute values by simply omitting the corre- PðXjC i Þ ¼ Pðxk jC i Þ ð9Þ
k¼1
sponding probabilities for those attributes when calculating the
likelihood of membership for each class. The NB classifier also re- PðXjC i Þ ¼ Pðx1 jC i Þ  Pðx2 jC i Þ      Pðxn jC i Þ ð10Þ
quires the class conditional independence, i.e., the effect of an attri- In Eq. (9), xk refers to the value of attribute Ak for instance X.
bute on a given class is independent of those of other attributes. Therefore, these probabilities P(x1jCi),P(x2jCi), . . ., P(xnjCi) can be
Given a training dataset, D = {X1, X2, . . . , Xn}, each data record is easily estimated from the training instances. Moreover, the attri-
represented as, Xi = {x1, x2, . . . , xn}. D contains the following attri- butes in training datasets can be categorical or continuous-valued.
butes {A1, A2, . . . , An} and each attribute Ai contains the following If the attribute value, Ak, is categorical, then P(xkjCi) is the number
attribute values {Ai1, Ai2, . . . , Aih}. The attribute values can be dis- of instances in the class Ci 2 D with the value xk for Ak, divided by
crete or continuous. D also contains a set of classes C = {C1, C2, - jCi,Dj, i.e., the number of instances belonging to the class Ci 2 D.
. . . , Cm}. Each training instance, X 2 D, has a particular class label If Ak is a continuous-valued attribute, then Ak is typically as-
Ci. For a test instance, X, the classifier will predict that X belongs sumed to have a Gaussian distribution with a mean l and standard
to the class with the highest posterior probability, conditioned deviation r, defined respectively by the following two equations:
on X. That is, the NB classifier predicts that the instance X belongs
to the class Ci, if and only if P(CijX) > P(CjjX) for 1 6 j 6 m,j – i. The Pðxk jC i Þ ¼ gðxk ; lC i ; rC i Þ ð11Þ
class Ci for which P(CijX) is maximized is called the Maximum 1 ðxpÞ2
 2
Posteriori Hypothesis. gðx; l; rÞ ¼ pffiffiffiffiffiffiffi e 2r ð12Þ
2pr
PðXjC i ÞPðC i Þ In Eq. (11), lC i is the mean and rC i is the standard deviation of
PðC i jXÞ ¼ ð8Þ
PðXÞ the values of the attribute Ak for all training instances in the class
Ci. Now we can bring these two quantities to Eq. (12), together with
In Bayes theorem shown in Eq. (8), as P(X) is a constant for all
xk, in order to estimate P(xkjCi). To predict the class label of instance
classes, only P(XjCi)P(Ci) needs to be maximized. If the class prior
X,P(XjCi)P(Ci) is evaluated for each class Ci 2 D. The NB classifier
probabilities are not known, then it is commonly assumed that
predicts that the class label of instance X is the class Ci, if and only
the classes are equally likely, that is, P(C1) = P(C2) =    = P(Cm),
if
and therefore maximize P(XjCi). Otherwise, maximize P(XjCi)P(Ci).
The class prior probabilities are calculated by P(Ci) = jCi,Dj/jDj, PðXjC i ÞPðC i Þ > PðXjC j ÞPðC j Þ ð13Þ
where jCi,Dj is the number of training instances belonging to the
class Ci in D. To compute P(XjCi) in a dataset with many attributes In Eq. (13), 1 6 j 6 m and j – i. That is the predicted class label is
is extremely computationally expensive. Thus, the naïve assump- the class Ci for which P(XjCi)P(Ci) is the maximum probability.
Tables 3 and 4 respectively tabulate the prior probabilities for each
class and conditional probabilities for each attribute value gener-
Table 2 ated using the playing tennis dataset shown in Table 2.
The playing tennis dataset.

Outlook Temperature Humidity Wind Play 4. The proposed hybrid learning algorithms
Sunny Hot High Weak No
Sunny Hot High Strong No In this paper, we have proposed two independent hybrid
Overcast Hot High Weak Yes algorithms respectively for decision tree and naïve Bayes classifiers
Rain Mild High Weak Yes
Rain Cool Normal Weak Yes
to improve the classification accuracy in multi-class classification
Rain Cool Normal Strong No tasks. These proposed algorithms are described in the following
Overcast Cool Normal Strong Yes
Sunny Mild High Weak No
Sunny Cool Normal Weak Yes Table 3
Rain Mild Normal Weak Yes Prior probabilities for each class generated using the playing tennis dataset.
Sunny Mild Normal Strong Yes
Overcast Mild High Strong Yes Probability Value
Overcast Hot Normal Weak Yes P(Play = Yes) 9/14 = 0.642
Rain Mild High Strong No P(Play = No) 5/14 = 0.375
D.Md. Farid et al. / Expert Systems with Applications 41 (2014) 1937–1946 1941

Table 4 noise within these data, which leads to contradictory results in


Conditional probabilities for each attribute value calculated using the playing tennis comparison with the original labels. Thus these misclassified in-
dataset.
stances are regarded as troublesome examples. The presence of
Probability Value such noisy training instances is more likely to cause a DT classifier
P(Outlook = SunnyjPlay = Yes) 2/9 = 0.222 to become overfitting, and thus decrease its accuracy.
P(Outlook = SunnyjPlay = No) 3/5 = 0.6 After removing those misclassified/troublesome instances from
P(Outlook = OvercastjPlay = Yes) 4/9 = 0.444 the training dataset, D, we subsequently build a DT for decision
P(Outlook = OvercastjPlay = No) 0/5 = 0.0
P(Outlook = RainjPlay = Yes) 3/9 = 0.3
making using the updated training dataset D with those purely
P(Outlook = RainjPlay = No) 2/5 = 0.4 noise free data. For the decision tree generation, we select the best
P(Temperature = HotjPlay = Yes) 2/9 = 0.222 splitting attribute with the maximum information gain value as
P(Temperature = HotjPlay = No) 2/5 = 0.4 the root node of the tree. Once the root node of DT has been deter-
P(Temperature = MildjPlay = Yes) 4/9 = 0.444
mined, the child nodes and its arcs are created and added to the DT.
P(Temperature = MildjPlay = No) 2/5 = 0.4
P(Temperature = CooljPlay = Yes) 3/9 = 0.333 The algorithm continues recursively by adding new subtrees to
P(Temperature = CooljPlay = No) 1/5 = 0.2 each branching arc. The algorithm terminates when the instances
P(Humidity = HighjPlay = Yes) 3/9 = 0.333 in the reduced training set all belong to the same class. This class
P(Humidity = HighjPlay = No) 4/5 = 0.8 is then used to label the corresponding leaf node. Algorithm 2 out-
P(Humidity = NormaljPlay = Yes) 6/9 = 0.666
lines the proposed DT algorithm. The time and space complexity of
P(Humidity = NormaljPlay = No) 1/5 = 0.2
P(Wind = WeakjPlay = Yes) 6/9 = 0.666 a DT algorithm depends on the size of training dataset, the number
P(Wind = WeakjPlay = No) 2/5 = 0.4 of attributes, and the size of the generated tree.
P(Wind = StrongjPlay = Yes) 3/9 = 0.333
P(Wind = StrongjPlay = No) 3/5 = 0.6
Algorithm 1. Decision tree induction
Input: D = {x1, x2, . . . , xn} // Training dataset, D, which contains
a set of training instances and their associated class labels.
Sections 4.1 and 4.2. Algorithm 1 is used to describe the proposed
Output: T, Decision tree.
hybrid DT induction, which employs a NB classifier to remove any
Method: 1: for each class, Ci 2 D, do
noisy training data at an initial stage to avoid overfitting.
2: Find the prior probabilities, P(Ci).
Algorithm 2 is used for the construction of a hybrid NB classifier.
3: end for
It embeds a DT classifier to identify a subset of most important
4: for each attribute value, Aij 2 D, do
attributes to improve efficiency.
5: Find the class conditional probabilities, P(AijjCi).
6: end for
7: for each training instance, xi 2 D, do
4.1. The proposed hybrid decision tree algorithm
8: Find the posterior probability, P(Cijxi)
9: if xi is misclassified, do
In this section, we discuss the proposed Algorithm 1 of the hy-
10: Remove xi from D;
brid DT induction. It is developed based on a basic C4.5 algorithm.
11: end if
Given a training dataset, D = {x1, x2, . . . , xn}, each training instance is
12: end for
represented as xi = {xi1, xi2, . . . , xih} and D contains the following
13: T = ;;
attributes {A1, A2, . . . , An}. Each attribute, Ai, contains the following
14: Determine best splitting attribute;
attribute values {Ai1, Ai2, . . . , Aih}. The training data also belong to a
15: T = Create the root node and label it with the splitting
set of classes C = {C1, C2, . . . , Cm}. A decision tree is a classification
attribute;
tree associated with D that has the following properties: (a) each
16: T = Add arc to the root node for each split predicate and
internal node labeled with an attribute, Ai, (b) each arc labeled with
label;
a predicate that can be applied to the attribute associated with the
17: for each arc do
parent, and (c) each leaf node labeled with a class, Ci. Once the tree
18: D = Dataset created by applying splitting predicate to D;
is built, it is used to classify each test instance, xi 2 D. The result is a
19: if stopping point reached for this path,
classification for that instance, xi. There are two basic steps for the
20: T0 = Create a leaf node and label it with an appropriate
development of a DT based application: (a) building the DT from a
class;
training dataset, and (b) applying the DT to a test dataset, D.
21: else
For the training dataset, D, we first apply a basic NB classifier to
22: T0 = DTBuild(D);
classify each training instance, xi 2 D. We calculate the prior prob-
23: end if
ability, P(Ci), for each class, Ci 2 D and the class conditional proba-
24: T = Add T0 to arc;
bility, P(AijjCi), for each attribute value (even if it is numeric) in D.
25: end for
Then we classify each training instance, xi 2 D, using these proba-
bilities. The class, Ci, with the highest posterior probability, P(Cijxi),
is selected as the final classification for the instance, xi. Then we re-
move all the misclassified training instances from the dataset D. In
our experiments, these misclassified instances tend to be the trou- 4.2. The proposed hybrid algorithm for a naïve Bayes classifier
blesome training examples. For example, some of these examples
either contain contradictory characteristics, or carry exceptional In this section, we present a hybrid naïve Bayes classifier with
features. Suppose there is a training dataset with two classes. We the integration of a decision tree in order to find a subset of attri-
calculate the prior and class conditional probabilities using this butes with attribute weighting, which play more important roles in
example training dataset. Then we calculate the P(ClassjD) for each class determination. In a given training dataset, each instance, xi,
instance based on these probabilities. We have found some in- contains values {xi1, xi, . . . , xih}. There is a set of attributes used to
stances where the probabilities calculated using the NB classifier describe the training data, D = {A1, A2, . . . , An}. Each attribute con-
indicate that they belong to ‘‘Class = yes’’. However in the training tains attribute values Ai = {Ai1, Ai2, . . . , Aik}. A set of classes C = {C1,-
dataset they are labeled as ‘‘Class = no’’. It seems that there is some C2, . . . , Cn} is also used to label the training instances, where each
1942 D.Md. Farid et al. / Expert Systems with Applications 41 (2014) 1937–1946

class Ci = {Ci1, Ci2, . . . , Cik} also has some values. First of all, in this 10: T0 = DTBuild(D);
proposed hybrid NB algorithm, we generate a basic decision tree, 11: end if
T, from the training dataset, D, and collect the attributes appearing 12: T = Add T0 to arc;
in the tree. In this approach, DT is applied as an attribute selection 13: end for
and attribute weighting method. That is we use the DT classifier to 14: for each attribute, Ai 2 D, do
find the subset of the attributes in the training dataset which play 15: if Ai is not tested in T, then
crucial roles in the final classification. After the tree construction, 16: Wi = 0;
we initialize the weight, Wi, for each attribute, Ai 2 D. If the attri- 17: else
bute, Ai 2 D, is not tested in the DT then the weight, Wi, of the attri- 18: d as the minimum depth of Ai 2 T, and W i ¼ p1ffiffi;
d
bute, Ai, is initialized to zero. Otherwise, we calculate the minimum
19: end if
depth, d, where the attribute, Ai 2 T, is tested in the DT and initial-
20: end for
ize the weight, Wi, of the attribute, Ai, with the value of p1ffiffid. In this
21: for each class, Ci 2 D, do
way, the importance of the attributes is measured by their corre-
22: Find the prior probabilities, P(Ci).
sponding weights. For example, the root node of the tree has a
23: end for
higher weight value in comparison with those of its child nodes.
24: for each attribute, Ai 2 D and Wi – 0, do
Subsequently, we calculate the class conditional probabilities using
25: for each attribute value, Aij 2 Ai, do
only those attributes selected by DT (i.e., Wi – 0) and classify each
instance xi 2 D using these probabilities. The weights of the se- 26: Find the class conditional probabilities, PðAij j C i ÞW i .
lected attributes by the DT are also used as exponential parameters 27: end for
(see Eq. (14)) for class conditional probability calculation. Other 28: end for
attributes which are not selected by the DT (i.e., Wi = 0) will not 29: for each instance, xi 2 D, do
be considered in the final result probability calculation. I.e., the 30: Find the posterior probability, P(Cijxi);
class conditional probabilities of those unselected attributes by 31: end for
DT will not be generated and employed in the classification result
production.
Moreover, we calculate the prior probability, P(Ci), for each
5. Experiments
class, Ci, by counting how often Ci occurs in D. For each attribute,
Ai, the number of occurrences of each attribute value, Aij, can be
In this section, we describe the test datasets and experimental
counted to determine P(Ai). Similarly, the probability P(AijjCi) also
environments, and present the evaluation results for both of the
can be estimated by counting how often each Aij occurs in Ci 2 D.
proposed hybrid decision tree and naïve Bayes classifiers.
This is done only for those attributes that appear in the DT, T. To
classify an instance, xi 2 D,P(Ci) and P(AijjCi) from D are used to
5.1. Datasets
make the prediction. This is conducted by combining the effects
of different attribute values, Aij 2 xi. As mentioned earlier, the
The performances of both of the proposed hybrid decision tree
weights of the selected attributes also influence their class condi-
and naïve Bayes algorithms are tested on 10 real benchmark data-
tional probability calculation as exponential parameters. We esti-
sets from UCI machine learning repository (Frank & Asuncion,
mate P(xijCi) by Eq. (14).
2010). Table 5 describes the datasets used in experimental analy-
Y
n sis. Each dataset is roughly equivalent to a two-dimensional
Pðxi jC i Þ ¼ PðAij jC i ÞW i ð14Þ spreadsheet or a database table. The 10 datasets are:
j¼1

1. Breast Cancer Data (Breast cancer).


In Eq. (14), Wi refers to the weight of the attribute, Ai, which ef- 2. Fitting Contact Lenses Database (Contact lenses).
fects on class conditional probability calculation as an exponential 3. Pima Indians Diabetes Database (Diabetes).
parameter. To calculate P(Cijxi), we need P(Ci) for each Ci, and P(xi- 4. Glass Identification Database (Glass).
jCi), and estimate the likelihood that xi is in each Ci. The posterior 5. Iris Plants Database (Iris plants).
probability, P(Cijxi), is then found for Ci. The class, Ci, with the high- 6. Large Soybean Database (Soybean).
est probability is used to label the instance, xi. Algorithm 2 outlines 7. 1984 United States Congressional Voting Records Database
the proposed hybrid algorithm for the naïve Bayes classifier. (Vote).
8. Image Segmentation Data (Image seg.)
Algorithm 2. Naïve Bayes classifier 9. Tic-Tac-Toe Endgame Data (Tic-Tac-Toe).
Input: D = {x1, x2, . . . , xn} // Training data. 10. NSL-KDD Dataset (NSL-KDD).
Output: A classification Model. Table 5
Method: 1: T = ;; Dataset descriptions.
2: Determine the best splitting attribute;
Datasets No of Att. Att. Types Instances Classes
3: T = Create the root node and label it with the splitting
attribute; Breast cancer 9 Nominal 286 2
Contact lenses 4 Nominal 24 3
4: T = Add arc to the root node for each split predicate and
Diabetes 8 Real 768 2
label; Glass 9 Real 214 7
5: for each arc do Iris plants 4 Real 150 3
6: D = Dataset created by applying splitting predicate to D; Soybean 35 Nominal 683 19
Vote 16 Nominal 435 2
7: ifstopping point reached for this path, then
Image seg. 19 Real 1500 7
8: T0 = Create a leaf node and label it with an appropriate Tic-Tac-Toe 9 Nominal 958 2
class; NSL-KDD 41 Real & Nominal 25192 23
9: else
D.Md. Farid et al. / Expert Systems with Applications 41 (2014) 1937–1946 1943

5.3. Results and discussion


5.2. Experimental setup
Firstly, we evaluated the performances of proposed algorithms
The experiments were conducted using a machine with an Intel
(Algorithms 1 and 2) against existing DT and NB classifiers using
Core 2 Duo Processor 2.0 GHz processor (2 MB Cache, 800 MHz
the classification accuracy on the training sets of the 10 benchmark
FSB) and 1 GB of RAM. We implement both of the proposed algo-
datasets shown in Table 5. Table 7 summarizes the classification
rithms (Algorithms 1 and 2) in Java. We use NetBeans IDE 7.1 in
accuracy rates of the basic C4.5 and NB classifiers, and the pro-
Redhat enterprise Linux 5 for Java coding. NetBeans IDE is the first
posed hybrid algorithms for each of the 10 training datasets.
IDE providing support for JDK 7 and Java EE 6 (http://netbeans.org/
The results in Table 7 indicate that the proposed DT algorithm
index.html). The code for the basic versions of the DT and NB
(Algorithm 1) outperforms the C4.5 DT classifier for the classifica-
classifiers is adopted from Weka3, which is open source data min-
tion of the diabetes dataset by 7.55%, the tic-tac-toe dataset by
ing software (Hall et al., 2009). It is a collection of machine learning
4.7%, the contact lenses dataset by 4.17% and the NSL-KDD dataset
algorithms for data mining tasks. Weka3 contains tools for data
by 3.89%. Algorithm 1 is capable of identifying the noisy instances
pre-processing, classification, regression, clustering, association
from each dataset before the DT induction. This DT classifier gen-
rules, and visualization. The mining algorithms in Weka3 can be
erated from the updated noise free high quality representative
either applied directly to a dataset or called from our own coding.
training dataset is less likely to become overfitting and thus able
To test the proposed hybrid methods, we have used the classi-
to carry more generalization capabilities comparing to the DT
fication accuracy, precision, sensitivity-specificity analysis, and 10-
generated directly from the original training instances using C4.5
fold cross validation. The classification accuracy is measured either
algorithm.
by Eq. (15) or by Eq. (16).
Moreover, in Table 7, the proposed NB algorithm (Algorithm 2)
PjXj also outperforms the traditional NB classifier for the classification
i¼1 assessðxi Þ
accuracy ¼ ; xi 2 X ð15Þ of all 10 training datasets, since it is able to identify the most
jXj
important and discriminative subset of attributes for the
If classify(x) = x  c then assess(x) = 1 else assess(x) = 0, where X is production of naïve assumption of class conditional independence
the set of instances to be classified, x 2 X and x  c is the class of in- comparing to the basic NB classifier. Among all the 10 datasets, the
stance, x. Also, classify(x) returns the classification of x. Eqs. (17)– proposed NB algorithm respectively improves the classification
(19) are used for the calculations of precision, sensitivity (also rates of the glass dataset by 18.23%, the tic-tac-toe dataset by
called the true positive rate, or the recall rate), and specificity (also 8.98%, the image seg. dataset by 4.87% and the contact lenses
called the true negative rate). A perfect classifier would be de- dataset by 4.17%.
scribed as 100% sensitivity and 100% specificity. Table 6 summa- Secondly, we have used classification accuracy, precision,
rizes the symbols and terms used throughout in Eqs. (16)–(19). sensitivity, specificity, and 10-fold cross validation to measure
the performances of the proposed algorithms (Algorithms 1 and
TP þ TN 2) using all 10 datasets. We consider the weighted average for pre-
accuracy ¼ ð16Þ
TP þ TN þ FP þ FN cision, sensitivity and specificity analysis for each dataset. The
TP weighted average is similar to an arithmetic mean, where instead
precision ¼ ð17Þ
TP þ FP of each of the data points contributing equally to the final average,
TP some data points contribute more than others. For example, the
sensitiv ity ¼ ð18Þ
TP þ FN weighting for the evaluation of this research is calculated using
TN the number of instances belonging to one class divided by the total
specificity ¼ ð19Þ
TN þ FP number of instances in one dataset. The detailed results for the
weighted average of precision, sensitivity and specificity analysis
where, TP, TN, FP and FN denote true positives, true negatives, false for the experiments conducted are presented in Tables 8–11.
positives, and false negatives, respectively. Evaluated using all the instances in the 10 datasets, the tradi-
In k-fold cross-validation, the initial data are randomly parti- tional C4.5 DT achieved an average accuracy rate of 83.5% using
tioned into k mutually exclusive subsets or ‘‘folds’’, D1,D2, . . ., Dk, 10-fold cross validation. The proposed DT algorithm (Algorithm
each of which has an approximately equal size. Training and test- 1) obtained an average accuracy rate of 88.3% for the classification
ing are performed k times. In iteration i, the partition Di is reserved of the 10 datasets. Tables 8 and 9 respectively tabulate the perfor-
as the test set, and the remaining partitions are collectively used to mances of the classic C4.5 classifier and the proposed DT algorithm
train the classifier. 10-fold cross validation breaks data into 10 sets on 10 datasets using 10-fold cross validation.
of size N/10. It trains the classifier on 9 datasets and tests it using
the remaining one dataset. This repeats 10 times and we take a
mean accuracy rate. For classification, the accuracy estimate is
the overall number of correct classifications from the k iterations, Table 7
divided by the total number of instances in the initial dataset. The classification accuracies of classifiers using training datasets.

Training C4.5 DT NB Proposed DT Proposed NB


Table 6 datasets classifier classifier classifier (%) classifier(%)
Symbols used in Eqs. (16)–(19) and their meanings. (%) (%) (Algorithm 1) (Algorithm 2)
Breast cancer 96.33 93.7 98.82 95.9
Symbol Intuitive Meaning
Contact lenses 91.66 95.83 95.83 100
TP xi predicted to be in Ci and is actually in it Diabetes 84.11 76.30 91.66 79.55
TN xi not predicted to be in Ci and is not actually in it Glass 96.26 55.60 97.66 73.83
FP xi predicted to be in Ci but is not actually in it Iris plants 98 96 100 98.66
FN xi not predicted to be in Ci but is actually in it Soybean 96.33 93.7 99.85 96.63
accuracy ‘‘%’’ of predictions that are correct Vote 97.24 90.34 98.85 93.33
precision ‘‘%’’ of positive predictions that are correct Image seg. 99.0 81.66 99.6 86.53
sensitivity ‘‘%’’ of positive instances that are predicted as positive Tic-Tac-Toe 93.73 69.83 98.43 78.81
specificity ‘‘%’’ of negative instances that are predicted as negative NSL-KDD 73.37 79.94 77.26 83.44
1944 D.Md. Farid et al. / Expert Systems with Applications 41 (2014) 1937–1946

Table 8 Table 11
The classification accuracies and precision, sensitivity and specificity values for a C4.5 The classification accuracies and precision, sensitivity and specificity values for the
classifier with 10-fold cross validation. proposed hybrid NB classifier with 10-fold cross validation.

Datasets Classification Precision Sensitivity Specificity Datasets Classification Precision Sensitivity Specificity
accuracy (%) (weighted (weighted (weighted accuracy (%) (weighted (weighted (weighted
avg.) avg.) avg.) avg.) avg.) avg.)
Breast cancer 75.52 0.752 0.755 0.245 Breast cancer 75.87 0.75 0.758 0.241
Contact lenses 83.33 0.851 0.833 0.166 Contact lenses 87.50 0.909 0.875 0.125
Diabetes 73.82 0.735 0.738 0.261 Diabetes 85.41 0.853 0.854 0.145
Glass 66.82 0.67 0.668 0.331 Glass 52.33 0.533 0.523 0.476
Iris plants 96 0.96 0.96 0.04 Iris plants 98 0.98 0.98 0.02
Soybean 91.50 0.842 0.849 0.065 Soybean 94.15 0.891 0.893 0.047
Vote 96.32 0.963 0.963 0.036 Vote 94.48 0.945 0.944 0.055
Image seg. 95.73 0.819 0.957 0.042 Image seg. 85.19 0.742 0.852 0.148
Tic-Tac-Toe 85.07 0.849 0.851 0.149 Tic-Tac-Toe 78.91 0.786 0.789 0.21
NSL-KDD 71.11 0.711 0.711 0.288 NSL-KDD 82.39 0.836 0.823 0.176

Tables 10 and 11 respectively tabulate the performances of the


Table 9 traditional NB classifier and the proposed NB algorithm on 10 data-
The classification accuracies and precision, sensitivity and specificity values for the sets using 10-fold cross validation.
proposed hybrid DT classifier with 10-fold cross validation. The results shown in Tables 10 and 11 indicate that the
Datasets Classification Precision Sensitivity Specificity proposed NB algorithm (Algorithm 2) also outperforms the basic
accuracy (%) (weighted (weighted (weighted NB algorithm for all the test cases. Comparing to the traditional
avg.) avg.) avg.) NB classifier, the proposed NB algorithm has respectively improved
Breast cancer 81.46 0.834 0.814 0.185 the classification accuracy rates of the contact lenses dataset by
Contact lenses 91.66 0.931 0.916 0.083 16.67%, the diabetes dataset by 9.11%, the tic-tac-toe dataset by
Diabetes 79.03 0.788 0.79 0.209 9.29%, and the NSL-KDD dataset by 6.12%. Algorithm 2 has
Glass 76.27 0.767 0.762 0.237
improved the classification accuracy rates for all the 10 datasets
Iris plants 98.66 0.986 0.986 0.013
Soybean 92.97 0.866 0.871 0.057 by 9.4% on average in comparison to the classic NB classifier.
Vote 97.70 0.977 0.977 0.022 Figs. 2 and 3 respectively show the comparison of classification
Image seg. 96.53 0.826 0.965 0.034 accuracy rates between the C4.5 DT and the proposed DT classifi-
Tic-Tac-Toe 88.1 0.88 0.881 0.118
ers, and between a basic NB and the proposed NB classifiers for
NSL-KDD 81.92 0.826 0.819 0.18
each dataset with 10-fold cross validation. Fig. 4 also shows the
comparison of classification accuracy rates of all classifiers on 10
datasets with 10-fold cross validation.
Overall, Algorithm 1 is able to automatically remove noisy
Table 10 instances from training datasets for DT generation to avoid overfit-
The classification accuracies and precision, sensitivity and specificity values for a NB ting. It thus possesses more robustness and generalization capabil-
classifier with 10-fold cross validation.
ities. Algorithm 2 is capable of identifying the most discriminative
Datasets Classification Precision Sensitivity Specificity subset of attributes for classification. The evaluation results prove
accuracy (%) (weighted (weighted (weighted the efficiency of the proposed DT and NB algorithms (Algorithm 1
avg.) avg.) avg.)
and 2) for the classification of challenging real benchmark datasets.
Breast cancer 71.67 0.703 0.716 0.283 They respectively outperform the traditional C4.5 DT and NB
Contact lenses 70.83 0.691 0.708 0.291
classifiers in all the test cases (see Figs. 2 and 3).
Diabetes 76.30 0.758 0.763 0.236
Glass 48.59 0.496 0.485 0.514
Iris plants 96 0.96 0.96 0.04
Soybean 92.83 0.87 0.872 0.055
Vote 90.11 0.905 0.901 0.098
Image seg. 81.06 0.708 0.81 0.189
Tic-Tac-Toe 69.62 0.682 0.696 0.3
NSL-KDD 76.27 0.764 0.762 0.237

Moreover, the results shown in Tables 8 and 9 indicate that the


proposed DT algorithm outperforms the traditional C4.5 DT classi-
fier in all the test cases. In comparison with the traditional DT, the
proposed DT classifier has respectively improved the classification
accuracy rates of the NSL-KDD dataset by 10.81%, the glass dataset
by 9.45%, the contact lenses dataset by 8.33%, and the breast cancer
dataset by 5.94%. Overall, it has improved the classification accu-
racy rates for the classification of the above 10 datasets by 4.8%
on average comparing to the traditional DT classifier.
Furthermore, we have also evaluated the traditional NB classi-
fier using all the 10 datasets and achieved an average accuracy rate
of 77.3% using 10-fold cross validation, while the proposed NB clas- Fig. 2. The comparison of classification accuracy rates between the C4.5 DT and the
sifier (Algorithm 2) obtained an average accuracy rate of 86.7%. proposed DT classifiers on 10 datasets with 10-fold cross validation.
D.Md. Farid et al. / Expert Systems with Applications 41 (2014) 1937–1946 1945

of Excellence for Learning, Innovation, Networking and Knowl-


edge) project (Grant No. 2645).

References

Aitkenhead, M. J. (2008). A co-evolving decision tree classification method. Expert


Systems with Applications, 34, 18–25.
Aviad, B., & Roy, G. (2011). Classsification by clustering decision tree-like classifier
based on adjusted clusters. Expert Systems with Applications, 38, 8220–8228.
Balamurugan, S. A. A., & Rajaram, R. (2009). Effective solution for unhandled
exception in decision tree induction algorithms. Expert Systems with
Applications, 36, 12113–12119.
Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and
regression trees. Chapman and Hall/CRC.
Bujlow, T., Riaz, M.T., & Pedersen, J.M. (2012). A method for classification of network
traffic based on C5.0 machine learning algorithm. In Proc. of the International
Conference on Computing, Networking and Communications (ICNC) (pp. 237–241).
Chandra, B., & Gupta, M. (2011). Robust approach for estimating probabilities in
naïve Bayesian classifier for gene expression data. Expert Systems with
Applications, 38, 1293–1298.
Chandra, B., & Paul Varghese, P. (2009). Moving towards efficient decision tree
construction. Information Sciences, 179, 1059–1069.
Fig. 3. The comparison of classification accuracy rates between the basic NB and Chandra, B., & Varghese, P. P. (2009). Fuzzifying gini index based decision trees.
the proposed hybrid NB classifiers on 10 datasets with 10-fold cross validation. Expert Systems with Applications, 36, 8549–8559.
Chen, J., Huang, H., Tian, S., & Qu, Y. (2009). Feature selection for text classification
with naïve Bayes. Expert Systems with Applications, 36, 5432–5435.
Chen, Y.-L., & Hung, L. T.-H. (2009). Using decision trees to summarize associative
classification rules. Expert Systems with Applications, 36, 2338–2351.
Fan, L., Poh, K.-L., & Zhou, P. (2010). Partition-conditional ICA for Bayesian
classification of microarray data. Expert Systems with Applications, 37,
8188–8192.
Farid, D. M., Harbi, N., & Rahman, M. Z. (2010). Combining naïve Bayes and decision
tree for adaptive intrusion detection. International Journal of Network Security &
Its Applications, 2, 12–15.
Farid, D. M., & Rahman, M. Z. (2010). Anomaly network intrusion detection based on
improved self adaptive Bayesian algorithm. Journal of Computers, 5, 23–31.
Farid, D. M., Rahman, M. Z., & Rahman, C. M. (2011). Adaptive intrusion detection
based on boosting and naïve Bayesian classifier. International Journal of
Computer Applications, 24, 12–19.
Farid, D. M., Zhang, L., Hossain, A., Rahman, C. M., Strachan, R., Sexton, G., et al.
(2013). An adaptive ensemble classifier for mining concept drifting data
streams. Expert Systems with Applications, 40, 5895–5906.
Franco-Arcega, A., Carrasco-Ochoa, J. A., Sanchez-Diaz, G., & Martinez-Trinidad, J. F.
(2011). Decision tree induction using a fast splitting attribute selection for large
datasets. Expert Systems with Applications, 38, 14290–14300.
Frank, A., & Asuncion, A. (2010). UCI machine learning repository. http://
Fig. 4. The comparison of classification accuracy rates of all classifiers on each archive.ics.uci.edu/ml. Accessed 29.07.2013.
dataset with 10-fold cross validation. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I.H. (2009).
The weka data mining software: An update. SIGKDD Explorations, 11. http://
www.cs.waikato.ac.nz/ml/weka/. Accessed 29.07.2013.
Hsu, C.-C., Huang, Y.-P., & Chang, K.-W. (2008). Extended naïve Bayes classifier for
6. Conclusions mixed data. Expert Systems with Applications, 35, 1080–1083.
Jamain, A., & Hand, D. J. (2008). Mining supervised classification performance
studies: A meta-analytic investigation. Journal of Classification, 25, 87–112.
In this paper, we have proposed two independent hybrid Koc, L., Mazzuchi, T. A., & Sarkani, S. (2012). A network intrusion detection system
algorithms for DT and NB classifiers. The proposed methods im- based on a hidden naïve Bayes multiclass classifier. Expert Systems with
proved the classification accuracy rates of both DT and NB classifiers Applications, 39, 13492–13500.
Lee, L. H., & Isa, D. (2010). Automatically computed document dependent weighting
in multi-class classification tasks. The first proposed hybrid DT algo- factor facility for na’’ıve Bayes classification. Expert Systems with Applications, 37,
rithm used a NB classifier to remove the noisy troublesome 8471–8478.
instances from the training set before the DT induction, while the Liao, S.-H., Chu, P.-H., & Hsiao, P.-Y. (2012). Data mining techniques and
applications – a decade review from 2000 to 2011. Expert Systems with
second proposed hybrid NB classifier used a DT induction to select
Applications, 39, 11303–11311.
a subset of attributes for the production of naïve assumption of class Loh, W. Y., & Shih, X. (1997). Split selection methods for classification tree. Statistica
conditional independence. The performances of the proposed Sininca, 7, 815–840.
McHugh, J. (2000). Testing intrusion detection systems: a critique of the 1998 and
algorithms were tested against those of the traditional DT and NB
1999 darpa intrusion detection system evaluations as performed by lincoln
classifiers using the classification accuracy, precision, sensitivity- laboratory. ACM Transaction on Information and System Security, 3, 262–294.
specificity analysis, and 10-fold cross validation on 10 real bench- Ngai, E. W. T., Xiu, L., & Chau, D. C. K. (2009). Application of data mining techniques
mark datasets from UCI machine learning repository. The in customer relationship management: A literature review and classification
review article. Expert Systems with Applications, 36, 2592–2602.
experimental results showed that the proposed methods have pro- Polat, K., & Gunes, S. (2009). A novel hybrid intelligent method based on C4.5
duced impressive results for the classification of real life challenging decision tree classifier and one-against-all approach for multi-class
multi-class problems. In future work, other classification algo- classification problems. Expert Systems with Applications, 36, 1587–1592.
Quinlan, J. R. (1986). Induction of decision tree. Machine Learning, 1, 81–106.
rithms, such as naïve Bayes tree (NBTree), genetic algorithms, rough Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufman
set approaches and fuzzy logic, will be used to deal with real-time Publishers Inc..
multi-class classification tasks under dynamic feature sets. Safavian, S. R., & Landgrebe, D. (1991). A survey of decision tree classifier
methodology. IEEE Transactions on Systems, Man and Cybernetics, 21, 660–674.
Tavallaee, M., Bagheri, E., Lu, W., & Ghorbani, A.A. (2009). A detailed analysis of the
Acknowledgment KDD CUP 99 data set. In Proc. of the 2nd IEEE Int. Conf. on Computational
Intelligence in Security and Defense Applications (pp. 53–58).
Turney, D. (1995). Cost-sensitive classification: Empirical evaluation of a hybrid
We appreciate the support for this research received from the
genetic decision tree induction algorithm. Journal of Artificial Intelligence
European Union (EU) sponsored (Erasmus Mundus) cLINK (Centre Research, 369–409.
1946 D.Md. Farid et al. / Expert Systems with Applications 41 (2014) 1937–1946

Utgoff, P. E. (1989). Incremental induction of decision trees. Machine Learning, 4, Valle, M. A., Varas, S., & Ruz, G. A. (2012). Job performance prediction in a call center
161–186. using a naïve Bayes classifier. Expert Systems with Applications, 39, 9939–9945.
Utgoff, P.E. (1988). ID5: An incremental ID3. In Proc. of the fifth National Conference
on Machine Learning, Ann Arbor, Michigan, USA (pp. 107–120).

You might also like