Thong Kam 2008
Thong Kam 2008
Cancer Survivability
Jaree Thongkam, Guandong Xu and Yanchun Zhang
Abstract—In this paper we propose a combination of the In relation to current medical analysis, the decision tree is
AdaBoost and random forests algorithms for constructing a widely used in the medical domain. Several research studies
breast cancer survivability prediction model. We use random have successfully employed decision trees to extract the
forests as a weak learner of AdaBoost for selecting the high knowledge from medical data sets. For example, Delen,
weight instances during the boosting process to improve Walker and Kadam [3] employed classification and
accuracy, stability and to reduce overfitting problems. The regression trees to predict the breast cancer survivability in
capability of this hybrid method is evaluated using basic
SEER medical databases. Their results showed that the
performance measurements (e.g., accuracy, sensitivity, and
specificity), Receiver Operating Characteristic (ROC) curve decision tree algorithm was superior for extracting the
and Area Under the receiver operating characteristic Curve knowledge from their data set. Many researchers have
(AUC). Experimental results indicate that the proposed method utilized a single classifier to extract the knowledge from data
outperforms a single classifier and other combined classifiers sets. For example, Yi and Fuyong [9] applied Support
for the breast cancer survivability prediction. Vector Machines (SVM) alone to discover breast cancer
diagnosis patterns from the University of Wisconsin
I. INTRODUCTION Hospital. Their results showed that SVM was suitable for
diagnosing the breast cancer patterns. Moreover, Ryu,
B REAST CANCER is the second most common cause of
cancer death among women in Thailand [1]. It has been
increasing in the past several years, with more than
Chandrasekaran and Jacob [4] employed an isotonic
separation technique to predict breast cancer in the
5,000 new cases reported every year. Several research Wisconsin breast cancer diagnosis data set and the Ljubljana
studies have contributed to investigate factors in diseases breast cancer recurrence data set. Their result showed that
such as lifestyle changes, dietary patterns, and genetic issues the isotonic separation technique outperformed C4.5, Robust
[2]. Also, much research has analyzed the course and LP-P, and SVM Gaussian kernel.
outcome of disease which can help patients to have an idea In order to enhance the ability of standard algorithms,
of how to make decisions about their quality of life in several of the attribution section methods have been utilized
accordance with their finances [3], [4]. For example, in the medical domain for selecting the significant attributes.
Kaplan-Meier and Cox-Propositional hazard models, For example, Xiong et al. [10] combined Principle
traditional statistical methods, are commonly used to Components Analysis (PCA) and Partial Least Squares
estimate the survival rate of a particular patient suffering (PLS) linear regression for analyzing attributes and then
from a disease over a particular time period [5]. Currently, used the decision tree and association rule to extract the
advanced techniques in the field of data mining, a new knowledge from a breast cancer diagnosis data set at the
stream of methodologies, have come into existence. These University of Wisconsin Madison. This data set included
techniques are proven to be more powerful than traditional 699 breast cancer patients, with 458 instances of benign
statistical methods [6]. They provide processes for class and 241 instances of malignant class. Their results
discovering useful patterns or models from large data sets showed a percentage of correctness of 96.57%. Moreover,
[7]. One of the most common widely used techniques in Wang et al. [11] utilized Independent Component Analysis
data mining is classification. It is used to extract models (ICA) to select the best attributes and applied Least Square
describing important data classes and to predict the outcome Support Vector Machines (LS-SVM) to detect the breast
in unseen data at the single point of time [8]. Therefore, in cancer tumor. Experiment results showed that the accuracy
order to help medical practitioners predict the accurate of LS-SVM with ICA was significantly improved over using
outcomes, data mining and decision support based tools are LS-SVM alone.
needed to help medical practitioners to process a huge Recently, AdaBoost technique has become an attractive
amount of data available from previously solved cases, and ensemble method in machine learning since it is low in error
to suggest the probable treatments based on analyzing the rate, performing well in the low noise data set [12], [13]. As
abnormal values of several significant attributes. a successor of the boosting algorithm, it is used to combine a
set of weak classifiers to form a model with higher
prediction outcomes [12]. As a result, several research
Manuscript received December 15, 2007.
J. Thongkam is with School of Computer Science and Mathematics,
studies have successfully applied the AdaBoost algorithm to
Victoria University, Melbourne, Australia (e-mail: [email protected]). solve classification problems in object detection, including
G. Xu is with School of Computer Science and Mathematics, Victoria face recognition, video sequences and signal processing
University, Melbourne, Australia (e-mail: [email protected]). systems. For example, Zhou and Wei [14] utilized the
Y. Zhang is with School of Computer Science and Mathematics, Victoria
University, Melbourne,Australia (e-mail: [email protected]).
AdaBoost algorithm to extract the top 20 significant features
3062
978-1-4244-1821-3/08/$25.002008
c IEEE
from the XM2VT face database. Their results showed that Input: S: training set, S=xi(i=1,2,…,n), labels yi ∈ Y
the AdaBoost algorithm reduces 54.23 % of the computation K: Iterations number
time. Additionally, Sun, Wang and Wong [15] applied the 1) Assign S sample (x1,y1),..,(xn,yn); xi ∈ X , yi ∈ {-1,+1}
AdaBoost algorithm to extract high-order pattern and weight 2) Initialize the weights of D1(i)=1/n, i=1,…,n)
of evidence rule based classifiers from the UCI Machine 3) for k=1,...,K
4) Call WeakLearn, providing it with the distribution Dk
Learning Repository. Their results showed the composed 5) Get weak hypothesis hk:XÆ{-1,+1} with its error: ε k = ¦ D (i) k
classifiers can achieve better classification accuracy over the i = hk ( xi ) ≠ yi
HPWR classifiers alone. However, few research studies 6) Update distribution Dk : D (i) = Dk (i) exp(−α k yk k k ( xk ))
k +1
have utilized AdaBoost and random forests to make 7) next k zk
§ K ·
predictions on medical databases. 8) Output : H(x) = sign¨ ¦α k hk ( x) ¸
© k =1 ¹
We propose a combination of AdaBoost and random
forests for predicting breast cancer survivability from the Fig. 1. AdaBoost algorithm
data set collected in Srinagarind hospital in Thailand. We where Zk is the normalization constant (chosen so that Dk+1
investigate the performance of the AdaBoost algorithm using will be a distribution). αk is presented in Equation 1 to
random forests as the weak learner algorithm to generate improve the generalizing result and also solves the
better prediction models in breast cancer survivability overfitting and noise sensitive problems [19]. W refers to
investigation. The 10-fold cross-validation method, the class probability estimate to construct real value of
confusion matrix, ROC curve AUC score, accuracy, α k hk (x).
sensitivity and specificity are used to evaluate the breast 1 W+1 − W−1
cancer survivability prediction models. α k = ln( ) (1)
2 W−1 + W−1
The remainder of this paper is organized as follows. Therefore, the final hypothesis H(x) is a weighted majority
Section II introduces the basic concepts of AdaBoost, vote of the K weak hypotheses where it is the weight
random forests, and the hybrid of AdaBoost and random assigned to hk. In addition, AdaBoost does not only handle
forests. Section III presents the methodologies and the binary class, but also handle the numerical class for
experiment design used in this paper. Experiment results and predictions purposed [17].
discussions are presented in section IV. The conclusion and
outline of future work are given in section V. B. Random Forests
Random forests (RF) [20] is one of the most successful
II. BASIC CONCEPTS OF ALGORITHMS ensemble learning techniques which have been proven to be
This section briefly describes the theoretical background of very popular and powerful techniques in the pattern
AdaBoost, random forests and the proposed combination recognition and machine learning for high-dimensional
algorithm used in this paper. classification [21] and skewed problems [20]. These studies
used RF to construct a collection of individual decision tree
A. AdaBoost classifiers which utilized the classification and regression
AdaBoost is a new and the most popular ensemble method. trees (CART) algorithms [22]. CART is a rule-based
It is used for prediction in classification tasks and it is method that generates a binary tree through a binary
reported to present self-rated confidence scores by recursive partitioning process that splits a node based on the
estimating the reliability of their predictions [16]. It is a yes and no answer of the predictors. The rule generated at
learning algorithm used to generate multiple classifiers from each step is to maximize the class purity within the two
which the best classifier is selected [16], [17]. It not only has resulting subsets. Each subset is split further based on the
high flexibility for combining with other methods, such as independent rules. CARTs use the Gini index to measures
decision stump and classification and regression trees the impurity of a data partition or set of training instances
(CART), but it also requires less input parameter and less [7]. Although the aim of CART is to maximize the
knowledge of computing background to improve the difference of heterogeneity, however, in the real world data
accuracy of the prediction models from the data set. sets the overfitting problem that causes the classifier to have
In this paper we utilized AdaBoost.M1 [18], Gentle a high error of prediction in the unseen data set often
AdaBoost, which is originated from the setting of weights encounters. Therefore, the bagging mechanism in RF can
over the training set. The training set (x1,y1),…(xn,yn) where enable the algorithm to create classifiers for high
each xi belongs to instance space X, and each label yi is in the dimensional data very quickly [20], [21]. The accuracy of
label set Y, which is equal to the set of {-1,+1}. It assigns the classification decision is obtained by voting from the
the weight on the training example i on round k as Dk(i). individual classifiers in the ensemble. The common element
The same weight will be set at the starting point (Dk(i)=1/N, in all of theses steps is that the number of b tree and a
i=1,…,N). Then the weight of the misclassified example random vector (Sb) using bootstrap sample are generated
from base learning algorithm (called weak hypothesis) is independent of the past random vectors but with the same
increased to concentrate the hard examples in the training set distribution, and a tree is grown using the training set and Sb.
in each round. The eight steps of AdaBoost algorithm are The random forests algorithm is shown in Fig. 2.
given in Fig. 1.
Specificity(%)
85
algorithm is compared with 8 weak learners including
ADTree, C4.5, conjunctive rule, decision stump, Naïve
Bayes, NN-classifier, RIPPER and SVM of AdaBoost. The 80
default settings of each weak learner are used to generate the
models. These models were evaluated using 10-fold cross-
validation and measuring the accuracy, sensitivity, 75
specificity. The experiment involved increasing the iteration
of AdaBoost by 5 iterations each time until 100 iterations to
illustrate the performance of models. The experiment results 0 10 20 30 40 50 60 70 80 90 100
Iterations
were given in Figs. 5, 6 and 7, respectively.
Fig. 7. The specificity comparison
TABLE II
PERFORMANCE COMPPARSON AMONG SINGLE CLASSIFIER ON THE TRAINING AND TEST SETS
Training Set Test Set
Accuracy Sensitivity Specificity Accuracy Sensitivity Specificity
Classifiers (%) (%) (%) (%) (%) (%)
ABRF 100.00 100.00 100.00 88.60 89.30 87.65
AdaBoost 80.88 78.55 85.28 80.35 77.93 85.05
ADTree 85.09 85.59 84.39 82.28 83.59 80.50
Bagging 91.23 92.24 89.92 83.86 84.64 82.77
C4.5 92.46 93.19 91.50 84.04 87.38 80.08
Conjunctive Rule 77.54 74.74 83.71 77.54 74.74 83.71
Naïve Bayes 84.04 85.54 82.04 83.51 84.97 81.56
NN-classifier 100.00 100.00 100.00 83.86 85.49 81.71
Random forests 99.65 99.69 99.60 85.79 86.63 84.65
RIPPER 87.54 91.15 83.40 85.79 88.25 82.75
SVM 99.82 99.69 100.00 85.96 86.45 85.29
TABLE III
PERFORMANCE COMPPARSON AMONG MULTIPLE CLASSIFIERS ON TEST SETS
Accuracy (%) Sensitivity (%) Specificity (%)
Based Classifiers Min Max Avg Var Min Max Avg Var Min Max Avg Var
ABRF 88.42 89.30 88.79 0.05 88.36 90.37 89.79 0.18 86.99 88.94 87.48 0.22
AD Tree 81.05 87.72 86.07 4.34 83.23 89.62 88.02 3.67 77.56 85.89 83.56 0.51
C4.5 82.81 88.07 86.95 1.25 86.36 89.66 88.35 0.64 78.63 86.59 85.15 2.80
Conjunctive Rule 77.37 81.58 80.94 1.61 74.55 80.91 79.85 4.08 82.65 85.33 82.94 0.43
Decision Stump 77.54 81.75 80.40 0.92 74.74 81.87 79.74 3.38 80.18 85.05 81.69 2.15
Naïve Bayes 81.75 83.51 81.98 0.25 84.71 85.94 84.87 0.13 78.13 81.56 78.44 0.64
NN-classifier 81.58 83.68 82.15 0.23 84.01 85.23 84.47 0.08 78.49 81.63 79.19 0.60
RIPPER 84.21 86.49 85.74 0.15 86.19 87.46 86.38 0.20 80.38 85.29 84.89 1.33
SVM 85.96 88.42 87.79 0.29 87.35 89.02 88.40 0.18 84.15 87.76 86.98 0.65
Note: Min refers to minimum; Max refers to maximum; Avg refers to average; Var refers to variance.