2072 4119 1 SM
2072 4119 1 SM
2072 4119 1 SM
0976-5697
Volume 5, No. 3, March-April 2014
International Journal of Advanced Research in Computer Science
RESEARCH PAPER
Available Online at www.ijarcs.info
Abstract: The decision tree is one of the recent developments of sophisticated techniques for exploring high dimensional databases. In data
mining, a decision tree is a predictive model which can be used to represent both classification and regression. The aim of this study is to classify
kidney transplant patient’s response based on the set of predictor variables using ensemble methods. This paper also compares the performance
of decision tree algorithms (ID3, C4.5 and CART), and ensemble methods such as Random forest, Boosting and Bagging with C4.5 and CART
as a base classifier. The result shows that CART with Boosting shows the better result than other methods.
If the response variable takes on n different values, then procedure develops a group of trees using different values of
the entropy of S is defined as, complexity parameter, giving different sizes of tree.
n According to Breiman et al. (1984) among a group of trees
Entropy (S ) = ∑− p j log 2 p j (1) of different sizes, for a value of α, only one tree of smaller
j =1 size has high accuracy.
Where pj is the frequency of the value j in S. The The optimal tree is one that has the smallest prediction
information gain, Gain (S, A) of an attribute A, relative to a error for new samples. Prediction error is measured using
collection of examples S is defined as, either independent test set or cross validation (CV). When
n the data set is not large enough to split the data into training
Entropy (S v ) (2)
| Sv |
Gain (S, A) = Entropy (S) – ∑ |S|
v =1
and testing data, V-fold cross validation is used. Cross
validation is repeated V times, considering each time
where Sv is the subset of S for which attribute A has value v. different sub sets of training and test data, and thus
developing V number of varied trees. Among the V different
trees, the simplest tree that has the lowest cross validation
B. Classification and Regression Tree (CART): error rate (CV error) is selected as the optimal tree.
CART algorithm based on statistical methodology C. C4.5:
developed for classification with categorical outcomes and The construction of this algorithm is similar to ID3
regression with continuous outcomes [6]. It is a data-mining algorithm. Over –fitting problem is the main issue in ID3
tool based on binary recursive partitioning. The construction decision tree algorithm. The C4.5 decision tree algorithm
of CART algorithm is similar to that with ID3, with the addresses this using tree pruning techniques to prune the tree
exception of the information gain measure. In classification generated by ID3. At each point of the decision tree, the
tree, the impurity measure i(t) is computed using Gini attribute showing the largest gain ratio is selected to divide
criterion which is used to find the best split. The goodness the decision tree. Gain ratio for attribute A is defined as
of a split can be defined as the reduction in impurity
∆i (t) = i(t) – p(tl) i(tl)– p(tr)i(tr) (3) Gain Ratio (S, A) = Gain (S, A) / Split Info (S, A) (7)
i (t ) =1 − ∑ p j
2 n
Entropy (S v )
| Sv |
j
(4)
Split Info(S, A) = - ∑ |S|
v =1
(8)
Where i(t) denote the impurity of the node t and p(tl) C4.5 algorithm removes the biasness of information
and p(tr) are the probability that the object falls into the left gain when there are many outcome values of an attribute.
and right daughter node of node t. pj is the proportion of Moreover it uses pessimistic pruning to remove unnecessary
cases in category j. i(tL) and i(tR) are the impurities of the branches in the decision tree to improve the accuracy of
left and right nodes respectively. Select the predictor classification.
variable and split point with the highest reduction in
impurity and perform the split of the parent node into two D. Boosting:
nodes based on the selected split point. Repeat the process Boosting method has proved to be an effective method
using each node as a new parent node until the tree has the to improve the performance of base classifiers, both
maximum size. After generating the maximal tree CART theoretically and empirically. It is used to adaptively change
uses the pruning technique to select the optimal tree. the distribution of training examples. Boosting assigns a
The pruning procedure develops a sequence of smaller weight to each training example and may adaptively change
trees and computes cost-complexity for each tree. Based on the weight at the end of each boosting round. A sample is
the cost-complexity parameter, the pruning procedure drawn according to the sampling distribution of the training
determines the optimal tree with high accuracy. Complexity examples to obtain a new training set. Next a classifier is
is given by the following equation: induced from the training set and used to classify all the
Rα = R(T ) + α | T |
~ examples in the original data. The weights of the training
examples are updated at the end of each boosting round,
examples that are classified incorrectly will have their
(5)
weights increased, while those that are classified correctly
Where R (T) is the resubstitution estimated error,
~ will have their weights decreased. This forces the classifier
| T | is the number of terminal nodes of the tree, which to focus on examples that are not easy to classify in
determines the complexity of the tree, and α is the cost- subsequent iterations. The final ensemble is obtained by
complexity associated with the tree. R (T) is given by the aggregating the base classifiers obtained from each boosting
misclassification error is computed by the following round.
equation: E. Bootstrap Aggregation (Bagging):
1 N
R(T ) = ∑ X (d (xn ) ≠ jn )
Boostrap aggregation technique repeatedly selects the
(6) samples from a dataset according to a uniform probability
N i=1 distribution. Each bootstrap sample has the same size as the
Where X is the indicator function, which is equal to 1 if original data. Because the sampling is done with
the statement X(d(xn ) ≠ j n is true and 0 if it is false and d(x) replacement, some instances may appear several times in the
is the classifier. The value of the complexity parameter in same training set, while others may be omitted from the
the pruning usually lies between 0 and 1. The pruning
training set. The basic procedure for bagging is summarized Table 1: Accuracy comparison of different data mining algorithms
as follows: Algorithms All Selected Sensitivit Specificity
Algorithm: Bagging variable variable y (%) (%)
(%) (%)
ID3 62.8 63.1 55.9 69.8
a. Let v be the number of bootstrap samples
Pruned 72.4 73.5 71.7 74.4
b. For i=1 to v do C4.5 Unpruned 69.5 72.9 67.7 76.2
c. Create a bootstrap sample of size N, Di
d. Train a base classifier Ci on the bootstrap CART Pruned 71.22 72.2 65.5 77.3
sample Di.
e. End for The selected variables alone were used to find
f. C ( x ) = arg max ∑ δ (C i (x ) = y )
* sensitivity and specificity of the data mining algorithms.
When nine factors are used, classification accuracy turns to
i
be 62.8%, 72.4% and 71.2% for ID3, C4.5 and CART
{δ(.) =1 if its argument is true and otherwise} respectively. In C4.5 pruned decision tree, the accuracy rate
After training the v classifiers, a test instance is is higher than the unpruned decision tree.
assigned to the class that receives the highest number of When six factors are used, classification accuracy turns
votes. Bagging improves generalization error by reducing
to be 63.1%, 73.5% and 72.2% for ID3, C4.5 and CART
the variance of the base classifiers. The performance of
respectively. Since classification accuracy for all variables is
bagging depends on the stability of the base classifier.
lower than that of six variables, we did not carry on further
F. Random Forest: analysis. C4.5 had the highest sensitivity and CART had the
A random forest is a collection of unpruned decision highest specificity. ID3 had the worst accuracy, sensitivity
trees [12]. It combines many tree predictors, where each and specificity compared to other methods. C4.5’s correct
tree depends on the values of a random vector sampled classification rate is 73.5%.
independently. Moreover, all trees in the forest have the Table 2 shows the accuracy, sensitivity and specificity
same distribution. In order to construct a tree, assume that m comparison of decision tree and ensemble methods. For the
is the number of training observations and “a” is the number dialysis dataset, we found that the decision tree CART with
of attributes in a training set. In order to determine the boosting achieved a classification accuracy of 76.74% with
decision node at a tree, choose m<<a as the number of a sensitivity of 82.6% and a specificity of 70.0%. Next to
variables to be selected. Select a bootstrap sample from the CART with boosting, C4.5 with bagging model achieved a
m observations in the training set and use the rest of the classification accuracy of 75.2% with a sensitivity of 75.6%
observations to estimate the error of the tree in the testing and a specificity of 75.0%. The CART with bagging
phase. Randomly choose m variables as a decision at a achieved a classification accuracy of 74.42% with a
certain node in the tree and calculate the best split based on sensitivity of 82.6% and a specificity of 65.0%.
the m variables in the training set. Trees are always grown Table 2: Accuracy comparison of different ensemble methods
and never pruned compared to other tree algorithms. Ensemble methods Percentage success Sensitivit Specifi
All Selected y (%) city
III. RESULTS AND DISCUSSION variable variable (%)
(%) (%)
The dataset used in this paper was obtained from a Random Forest 73.01 74.04 72.2 75.5
kidney transplant database [17]. Data set consists of 469 CART with 74.42 76.74 82.6 70.0
Boosting
cases and ten attributes including the response variable such
CART with 81.3 74.42 82.6 65.0
as age, sex, duration of hemodialysis prior to transplant Bagging
(Dialy), diabetes (DBT), number of prior transplants (PTX), C4.5 with Boosting 71.04 72.0 68.0 74.3
amount of blood transfusion (blood), mismatch score (MIS), C4.5 with Bagging 74.02 75.2 75.6 75.0
use of ALG-an immune suppression drug (ALG), duration
time starting from transplant (MONTH) and status of the The Random forest achieved a classification accuracy
new kidney (FAIL). Status of the new kidney was used as of 74% with a sensitivity of 72.2% and a specificity of
the response variable for fitting CART, C4.5 and ID3 75.5%. C4.5 with boosting performance was worst
classification to multiple explanatory variables. The compared to the other technique in selected variables.
response variable was classified into two categories – new Bagging using CART as a base learner may decrease the
kidney failed (40.9%) and new kidney functioning (59%). misclassification rate in prediction with respect to using a
The top six ranked attributes are age, dialy, blood, MIS, single CART.
ALG and MONTH are considered for building the The produced decision tree by C4.5 algorithm is given
classification model. in Figure 1. Prepruning involves deciding when to stop
In this study we used gini impurity measure for developing subtrees during the tree building process. The
categorical target attributes. 10-fold cross validation was minimum number of observations in a leaf can determine
carried out for each algorithm. Table 1 shows the accuracy the size of the tree. After a tree is constructed, the C4.5 rule
comparison of different data mining algorithms. When all induction program can be used to produce a set of
the nine factors were considered to find accuracy of data equivalent rules. Pruning produces fewer, more easily
mining algorithm, it was found that CART model showed interpreted results. More importantly, pruning can be used
the highest specificity of 77.3%. as a tool to correct for potential overfitting. In C4.5 decision
tree, the number of leaves is 6 and size of tree is 11.
[6]. L. Breiman, J. Friedman, R. Olshen and C. Stone. [15]. Kotsiantis, S and Pintelas, P. (2004). “Local Boosting of
“Classification and Regression Trees”, Wadsworth Weak Classifiers”, Proceedings of Intelligent Systems
International Group, Belmont, CA, 1984. Design and Applications (ISDA 2004), August 26-28,
[7]. Quinlan, J.R. (2003). “C5.0 Online Tutorial”, Budapest, Hungary.
http://www.rulequest.com. [16]. Opitz, D and Maclin, R (1999) "Popular Ensemble
[8]. Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Methods: An Empirical Study", 11: 169-198.
Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., [17]. Chap T. Le (1997). “Applied survival analysis”, Wiley,
Zhou, Z., Steinbach, M., Hand, D. J and Steinberg, D New York.
(2008). “Top 10 Algorithms in Data Mining”, Knowledge [18]. Quinlan, J. R. (1996) “Bagging, Boosting and C4.5”,
and Information Systems, 14 (1): 1-37. AAAI/IAAI, 1: 725-730.
[9]. Wolpert, D. (1992). “Stacked generalization”, Neural [19]. Endo, A, Shibata, T and Tanaka, H (2008) “Comparison of
Networks, 5: 241-259. Seven Algorithms to Predict Breast Cancer Survival”,
[10]. Schapire, R. (1990). “The strength of weak learnability”, Biomedical Soft Computing and Human Sciences, 13(2),
Machine Learning, 5(2): 197-227. pp.11-16.
[11]. Breiman, L. (1996a). “Bagging Predictors”, Machine [20]. Banfield, R.E, Hall, L.O, Bowyer, K.W and Kegeimeyer,
Learning, 24(2): 123-140. W. P. “A comparison of decision tree ensemble creation
[12]. Breiman, L (2001). "Random Forests". Machine Learning techniques” (2007), IEEE Transactions on pattern analysis
45 (1): 5–32. and machine intelligence, 29: 173-180.