Two-Stage Classification Methods For Microarray Data: Tzu-Tsung Wong, Ching-Han Hsu
Two-Stage Classification Methods For Microarray Data: Tzu-Tsung Wong, Ching-Han Hsu
Two-Stage Classification Methods For Microarray Data: Tzu-Tsung Wong, Ching-Han Hsu
with Applications
Expert Systems with Applications 34 (2008) 375–383
www.elsevier.com/locate/eswa
Abstract
Gene expression data are a key factor for the success of medical diagnosis, and two-stage classification methods are therefore devel-
oped for processing microarray data. The first stage for this kind of classification methods is to select a pre-specified number of genes,
which are likely to be the most relevant to the occurrence of a disease, and passes these genes to the second stage for classification. In this
paper, we use four gene selection mechanisms and two classification tools to compose eight two-stage classification methods, and test
these eight methods on eight microarray data sets for analyzing their performance. The first interesting finding is that the genes chosen
by different categories of gene selection mechanisms are less than half in common but result in insignificantly different classification accu-
racies. A subset-gene-ranking mechanism can be beneficial in classification accuracy, but its computational effort is much heavier.
Whether the classification tool employed at the second stage should be accompanied with a dimension reduction technique depends
on the characteristics of a data set.
2006 Elsevier Ltd. All rights reserved.
Keywords: Dimension reduction; Gene selection; Microarray data; Two-stage classification method
0957-4174/$ - see front matter 2006 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2006.09.005
376 T.-T. Wong, C.-H. Hsu / Expert Systems with Applications 34 (2008) 375–383
can be categorized in many ways. In this study, we need the only improve the classification efficiency, but also remove
classification tools that are naturally designed for process- the interference caused by the irrelevant genes. Experimen-
ing continuous features like gene expression data. Since the tal results also showed that a feature selection mechanism
expression data of thousands of genes are likely to contain could increase the accuracy of most classification methods
noise, a dimension reduction technique is an appropriate (Dudoit, Fridlyand, & Speed, 2002; Golub et al., 1999).
tool to filter such noise. We therefore would like to inves- Lu and Han (2003) divided the gene selection mecha-
tigate whether a classification tool should be preceded by nisms into two categories: individual gene ranking and
a dimension reduction technique in analyzing microarray gene subset ranking. Individual-gene-ranking mechanisms
data. calculate the correlation between each gene and the class
This paper is organized as follows. The literatures rele- value and select the genes that have the correlations larger
vant to classifying microarray data are reviewed in Section than a pre-defined threshold. An example of this approach
2. Section 3 presents the possible designs of two-stage clas- is the feature selection mechanism adopted by Antoniadis,
sification methods, and eight two-stage classification meth- Lambert-Lacroiix, and Leblance (2003). This approach is
ods that will be investigated in this study are further usually simpler and more efficient but may exclude the
introduced. In Section 4, we will propose a procedure for genes that are important for disease detection only when
evaluating the performance of the eight two-stage classifi- they work together. It may also find the discerning genes
cation methods. This procedure will be used in Section 5 with redundant information for classification. Gene-sub-
to analyze the experimental results obtained from eight set-ranking mechanisms remove the gene with the smallest
microarray data sets for cancer detection. The conclusions impact on disease detection one by one to find a group of
and the direction for future study of this paper are genes that serve together to achieve the best classification
addressed in Section 6. result. The gene selection method proposed by Li, Wein-
berg, Darden, and Pedersen (2001) is an example of this
approach. This approach can find a discerning gene subset
2. Related works but its computational complexity is high.
There are many ways to categorize classification meth-
Traditional classification methods are generally inappli- ods, such as whether its learning scheme is lazy or whether
cable to microarray data that possess some special charac- it is probability-based. Without considering the noise con-
teristics as pointed out in Section 1. New classification tained in the available data, any classification tool can be
methods are therefore developed for processing microarray applied on microarray data directly, and Li et al. (2001)
data in these years, as summarized in Table 1. Some or part is an example for this case. Some studies attempted to
of the methods listed in Table 1 are brand new and specif- remove the noise before classifying, such as the method
ically developed for processing microarray data. However, presented by Jörnsten and Yu (2003).
most of them are just to modify well-known techniques and
assemble them together to deal with microarray data. The
latter approach can construct a method that is easier to 3. Structure of two-stage classification methods
learn and apply for analyzing microarray data.
A microarray instance usually contains thousands of A two-stage classification method includes a gene selec-
genes or features, and most of them are irrelevant to the tion mechanism at the first stage and a classification tool
disease of interest. A feature selection mechanism cannot that predicts the class of a new instance based on the genes
Table 1
Recent classification methods for microarray data
Source Tools
Friedman et al. (2000) Bayesian network
Li et al. (2001) Genetic algorithm and k-nearest neighbors
Khan et al. (2001) Artificial neural network
Zhang et al. (2001) Binary decision tree
Nguyen and Rocke (2002) t statistics, partial least squares, and logistic discrimination analysis
Albrecht et al. (2003) Perceptron and simulation annealing
Antoniadis et al. (2003) Wilcoxon score, minimum average variance estimation, and logistic discrimination analysis
Jörnsten and Yu (2003) Between-to-within-class sum of squares, Rissanen’s minimum description length, and linear discrimination analysis
Lee et al. (2003) Bayesian model
Desper et al. (2004) Phylogenetic method
Simek et al. (2004) Singular value decomposition and support vector machine
Asyali and Alci (2005) Fuzzy c-means and normal mixture model
Georgii et al. (2005) Quantitative association rule
Qiu et al. (2005) Ensemble dependence model
Martella (2006) Factor mixture models
Wu (2006) Penalized linear regression model
T.-T. Wong, C.-H. Hsu / Expert Systems with Applications 34 (2008) 375–383 377
chosen at the first stage. Since the number of genes in a 3.1. Gene selection mechanisms
microarray instance is generally more than 1000, and most
of the genes cannot provide useful information in classifica- A microarray instance i with n genes and one class value
tion, a gene selection mechanism is necessary for processing can be represented by (xi1, xi2, . . . , xin, yi). Suppose that a
microarray data. microarray instance with class value 1 comes from an
As pointed out in Section 2, the mechanism for gene abnormal tissue, and this class value is 0 if it comes from
selection can be either individual gene ranking or subset a normal tissue. Let Ni be the number of training instances
gene ranking. Since gene expression data are generally con- with class i for i = 0, 1, and let xjk and s2jk be the mean and
taminated by noise, a technique that can filter the noise the variance of gene j calculated from the training instances
contained in the data may enhance the prediction accuracy with class k, respectively. According to Nguyen and Rocke
of a classification tool. Thus, we divide classification tools (2002), the t value of gene j is
into two categories based on whether they include some sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
way for dimension reduction in this study. The goal of s2j0 s2j1
dimension reduction is to transform the information con- tj ¼ ðxj0 xj1 Þ= þ :
N0 N1
tained in the original data into another form that can be
represented by a smaller number of new features. An obvi- The genes are then sorted in a descending order according
ous benefit of dimension reduction is the improvement on to their t values. If the number of genes for classification is
the speed of learning. It sometimes can also remove the set to be r, the t-statistics mechanism will select the first r/2
noise from the data to increase prediction accuracy. At and the last r/2 genes in the order.
the same time, we might worry that useful information Define the BW ratio (the ratio of between-group sums of
for prediction may lose after the data transformation. squares to within-group sums of squares) of gene j as
Thus, it should be of interest to know whether a classifica- follows:
tion tool preceded by a dimension reduction technique will PP 2
improve the prediction accuracy in analyzing microarray Iðy ¼ kÞðxjk xj Þ
BWj ¼ P iP k i ;
data. i xjk Þ2
k Iðy i ¼ kÞðxij
In summary, we can have two options in designing the
mechanism for gene selection at the first stage, and pick a where xj is global mean of gene j and I(Æ) is an indicator
classification method with or without a dimension reduc- function with value 1 if the condition in I holds, or 0 other-
tion procedure for the second stage. So, there are four pos- wise. Dudoit et al. (2002) proposed that a gene with a lar-
sible ways to compose a two-stage classification method for ger BW ratio can interpret a larger proportion of the
microarray data, as shown in Fig. 1. The following two variance of the class. Thus, this mechanism orders the
subsections will introduce four gene selection mechanisms genes in an ascending order and chooses the first r genes
and two classification tools for designing two-stage classifi- for the next stage.
cation methods. Based on these tools, we can compose Park, Pagano, and Bonetti (2001) designed a non-para-
eight two-stage classification methods for analyzing micro- metric mechanism for gene selection. They defined the
array data. score of gene j as follows:
Microarray data
Learning results
Fig. 1. Possible designs of two-stage classification methods for processing microarray data.
378 T.-T. Wong, C.-H. Hsu / Expert Systems with Applications 34 (2008) 375–383
X X
Sj ¼ Iðxij xmj Þ; Since the gene expression data are numeric, classification
i2D0 m2D1 tools that favor continuous attributes will be more appro-
priate for applying on microarray data. We therefore pick
where Dk is the set of training instances with class k and the k-nearest neighbors and the logistic discrimination
I(xij xmj) = 1 if xij > xmj, or 0 otherwise. If the value of analysis combined with the partial least squares for dimen-
Sj is larger than N0 · N1/2, it will be replaced by sion reduction as the classification tools in this study, and
N0 · N1 Sj. Consider that the expression values of gene they will be denoted by K and PL, respectively.
j are sorted in an ascending order. In fact, Sj is the minimal The k-nearest neighbors will use the genes selected at the
number of interchanges on the values in the sorted order first stage to find k training instances that are the closest to
such that the instances with the same class are grouped to- a new instance. The major class of the k training instances
gether. A gene with score = 0 implies that the training in- will be the predicting class of the new instance. It can be
stances with different class values can be divided into two applied for classification regardless of the number of genes
groups without performing any interchange on the values for calculating the Euclidean distance between a pair of
in the sorted order, hence its value is determinant in know- instances. However, when the number of genes is large,
ing the class value of a microarray instance. Thus, this say more than 20, it will be inappropriate to apply the
mechanism will choose the r genes with the smallest score logistic discrimination analysis.
values. Suppose that the class value can be either 0 or 1. Define
The above three gene selection mechanisms by comput- p = p{y = 1jd} for an instance, where d is a row vector
ing the t statistics, the BW ratio, and the score value will be derived from the expression values of the genes chosen at
denoted by mechanisms T, BW, and S, respectively. They the first stage from the instance. The logistic
all apply some measure to determine which gene is the most p discrimination
analysis assumes that the values of ln 1p for all instan-
relevant to the class value, hence they are individual-gene- ces can be fitted well by a regression line db, where
ranking mechanisms. Li et al. (2001) applied the genetic bT = (b0, b1, . . . , bq). Parameter bi can be estimated by the
algorithm combined with the k-nearest neighbors for gene maximum likelihood estimate b ^i for i = 0, 1, . . . , q. Then
selection, that is a subset-gene-ranking mechanism. It will for any new instance, we first derive its d and calculate
be denoted by GK. ^
expðdbÞ
^ ¼ 1þexpðd
p ^ . If p
bÞ
^ > 1=2, then the prediction class of the
The genetic algorithm has a set of initial solutions called new instance is 1, and 0 otherwise. Note that in applying
chromosomes constituted by a fixed number of randomly the logistic discrimination analysis, we need to calculate
selected genes. A fitness function is then applied to evaluate the values of b ^i for i = 0, 1, . . . , q. Like the linear regression,
the feasibility of the chromosomes. For each chromosome, the number of independent variables should be as small as
every pair of instances has a Euclidean distance derived by possible. Thus, a dimension reduction technique, such as
the genes in the chromosome. If the k nearest neighbors of the partial least squares, can transform the original r-
an instance has the same class, this class will be the pre- dimensional space composed by the r genes chosen at the
dicted class of the instance. Otherwise, the class of this first stage into a smaller q-dimensional space to make the
instance is undetermined. logistic discrimination analysis more applicable.
Let nj be the number of correctly classified instances The partial least squares is to find q column vectors,
when we use the genes in chromosome j to find the nearest called PLS components, such that the covariance between
neighbors. Then nj/(N0 + N1) is the fitness of chromosome the class values and the linear combination of the gene
j. A pre-specified number of chromosomes that have the expression values and a PLS component is maximized.
highest fitness values will be passed to generate the chro- These PLS components will be used to transform the
mosomes for the next iteration. The mutation can occur expression values of the r genes in an instance into q values
in this process. A chromosome with a fitness larger than for plugging into the logistic regression model for classifi-
the pre-specified threshold is stored in a list. The genetic cation. Nguyen and Rocke (2002) noted that the PLS
algorithm stops when the number of chromosomes in the components usually result in a better performance than
list reaches a pre-specified number. The frequency of a gene the components generated from the principal component
is the number of chromosomes in the list containing the analysis.
gene. All genes are sorted in a descending order according
to their frequencies, and the GK mechanism will choose the 4. Performance evaluation
first r genes in the order.
A two-stage classification method composed by gene
3.2. Classification tools selection mechanism A and classification tool B will be
denoted by A/B. For example, the methods proposed by
Our interest on classification tools is to see whether a Li et al. (2001) and Nguyen and Rocke (2002) can be rep-
dimension reduction procedure involved in a classification resented by GK/K and T/PL, respectively. In this study, we
tool can affect the prediction accuracy of microarray data. are going to test the two-stage classification methods com-
Most classification tools favor in processing discrete attri- posed by the four gene selection mechanisms and the two
butes, such as naı¨ve Bayesian classifiers and decision trees. classification tools introduced in the previous section.
T.-T. Wong, C.-H. Hsu / Expert Systems with Applications 34 (2008) 375–383 379
The testing results will be able to provide some guidelines population, the leave-one-out cross-validation is inappro-
for designing a two-stage classification method for micro- priate in comparing the performance of two classification
array data. In this section, we will present the evaluation methods. When the number of instances in a data set is less
method for the eight methods that will be tested by eight than 100, the number of instances in a fold for the 5-fold
microarray data sets. cross-validation will be less than 20 in average. This means
As mentioned at the beginning of this paper, the special that every instance can affect more than 5% of the accuracy
characteristics of microarray data are: available instances of this fold. This is the reason why most researchers divided
are few, the number of features is large, and most features available data into 3 folds instead of 5 folds in analyzing
are irrelevant or noisy. The gene selection mechanism in a microarray data. Note that only one repetition of the 3-fold
two-stage classification method can be a proper tool to deal cross-validation will not generate enough classification
with the last two characteristics of microarray data. The accuracies for comparison. In this study, the 3-fold cross-
first characteristic may cause some problems in perfor- validation for every data set will be repeated five times.
mance evaluation. When the number of available instances Let aj and bj be the accuracies of methods A and B,
is small, the classification result of every instance will have respectively for data fold j for jP = 1, 2, . . . , 15. Set
a significant impact in evaluating the prediction accuracy dj = aj bj for j = 1, 2, . . . , 15 and d ¼ 15 j¼1 d j =15. Then d
of a classification method. Since not all genes are relevant is an estimate of the accuracy difference d between methods
to the occurrence of some disease, the number of genes A and B, andffi the standard deviation of d equals
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P15
selected at the first stage can also affect the classification j¼1
2
ðd j dÞ
accuracy. It is of interest to evaluate the accuracies when sd ¼ 210
. By the paired t test, the accuracies of A
either the tools or the numbers of genes for classification and B will be significantly different if the p-value corre-
are different. sponding to the test statistics t ¼ d=s d is less than the signif-
Most classification tools use either the 3-fold or the icance level a.
leave-one-out cross-validation to evaluate the accuracy of According the sensitivity analysis performed by Li et al.
a microarray data set, as summarized in Table 2. The 3- (2001), the appropriate number of genes for classifying the
fold cross-validation randomly divides the instances in a lymphoma data should be between 50 and 200. In order to
data set into three folds. Each fold is then in turn for test- investigate the impact of the number of genes for classifica-
ing, and the other two folds are for training. Thus, the 3- tion, we will pick either 50, 100, 150, or 200 genes at the
fold cross-validation will produce three prediction accura- first stage and pass them to the second stage. Let zij be
cies for comparison. The leave-one-out cross-validation the testing accuracy of data fold j when the number of
holds an instance out to test the learning results derived genes for classification is 50 · i, and let li = (zi1 +
from the remaining instances. Every instance is in turn zi2 + + zi15)/15 for i = 1, 2, 3, 4. Based on the analysis
for testing, and the accuracy is estimated by the number of variance (ANOVA) for single factor, if the p-value cor-
of correct classifications over the total number of instances. responding to the test statistic F is smaller than a, the li are
Traditionally, we will apply either the 10-fold or the 5- not all the same. This implies that the number of genes for
fold cross-validation on a data set to obtain 10 or 5 esti- classification does affect the accuracy. The value of a will be
mated classification accuracies in a repetition. Under 0.10 for the significance tests performed in the remainder of
proper control, the accuracies resulting from different clas- this article.
sification methods are matched samples, and the paired t
test can be used to identify whether their prediction accura- 5. Experimental study
cies are significantly different. The leave-one-out cross-val-
idation can produce only one estimate about the accuracy In this study, the gene selection mechanism applied at
of a data set. Since a data set is just a sample of the entire the first stage will be either T, BW, S, or GK, and the tool
for classification employed at the second stage will be either
K or PL. So, the number of two-stage classification meth-
Table 2 ods investigated in this paper is eight. In this section, we
Methods for evaluating classification performance on microarray data will introduce the characteristics of eight microarray data
Source Evaluation method sets, the procedure of data pre-processing, and the param-
Li et al. (2001) Randomly select N instances for training
eter settings for the methods. Then the eight methods will
Nguyen and Rocke Randomly select N instances for training be tested by the eight data sets for performance evaluation.
(2002)
Antoniadis et al. (2003) Leave-one-out cross-validation 5.1. Data pre-processing and parameter settings
Jörnsten and Yu (2003) 3-Fold cross-validation
Lee et al. (2003) Leave-one-out cross-validation
Albrecht et al. (2003) Repeated 3-fold cross-validation with equal The logistic discrimination analysis is a classification
fold size tool for data sets with only two possible class values. We
Desper et al. (2004) Repeated 3-fold cross-validation with equal therefore downloaded eight microarray data sets from the
fold size web site http://sdmc.lit.org.sg/GEDatasets/Datasets.html
Simek et al. (2004) Leave-one-out cross-validation
to test the eight two-stage classification methods. The char-
380 T.-T. Wong, C.-H. Hsu / Expert Systems with Applications 34 (2008) 375–383
Table 3
The characteristics of microarray data sets
Data set No. of instances No. of genes Description
Breast 97 24481 46 Instances with class ‘‘relapse’’ and 51 instances with class ‘‘non-relapse’’
CNS 60 7129 21 Instances with class ‘‘0’’ and 39 instances with class ‘‘1’’
Colon 62 2000 40 Instances from tumor tissue and 22 from normal tissue
Leukemia 72 7129 47 Instances with class ‘‘ALL’’ and 25 instances with class ‘‘AML’’
Lung 181 12,533 31 Instances with class ‘‘MPM’’ and 150 instances with class ‘‘ADCA’’
Lymphoma 47 4026 24 Instances with class ‘‘germinal’’ and 23 instances with class ‘‘activated’’
Ovarian 253 15,154 91 Instances from normal tissue and 162 instances from tumor tissue
Prostate 136 12,600 77 Instances from tumor tissue and 59 instances from normal tissue
Table 5
The p-values of the ANOVA tests on the number of genes for classification
Data set T/K BW/K S/K GK/K T/PL BW/PL S/PL GK/PL
Breast 0.5041 0.8641 0.9880 0.6244 0.3986 0.4637 0.8918 0.8275
CNS 0.1755 0.1102 0.5363 0.8604 0.9878 0.9397 0.9566 0.7893
Colon 0.9617 0.8338 0.8286 0.9970 0.9330 0.1940 0.1414 0.6657
Leukemia 0.7611 0.9980 0.9846 0.9885 0.8693 0.6144 0.9529 0.7475
Lung 0.4670 0.9981 0.8174 0.9872 0.0001 0.9243 0.6322 0.5682
Lymphoma 0.1189 0.0698 0.3270 0.9897 0.8060 0.7101 0.5359 0.9632
Ovarian 0.2955 0.1277 0.0575 0.5384 0.7586 0.2733 0.0000 0.6552
Prostate 0.0057 0.5416 0.1229 0.8938 0.6564 0.8310 0.5864 0.7255
Table 6
The common gene percentages of the four gene selection mechanisms
Data set T vs. BW T vs. S T vs. GK BW vs. S BW vs. GK S vs. GK
Breast 0.94 0.55 0.16 0.56 0.16 0.49
CNS 0.70 0.58 0.14 0.57 0.13 0.14
Colon 0.88 0.84 0.34 0.83 0.34 0.33
Leukemia 0.54 0.61 0.39 0.65 0.36 0.41
Lung 0.07 0.52 0.32 0.46 0.37 0.49
Lymphoma 0.96 0.74 0.31 0.73 0.31 0.35
Ovarian 0.83 0.62 0.58 0.61 0.56 0.46
Prostate 0.76 0.30 0.11 0.25 0.08 0.23
Average 0.71 0.60 0.29 0.58 0.29 0.36
know whether the various gene selection mechanisms will The two classification tools applied at the second stage are
filter similar genes for classification. Table 6 summarizes the k-nearest neighbors and the logistic discrimination
the percentages of common genes of the four gene selection analysis combined with the partial least squares for dimen-
mechanisms. We can see that the three individual-gene- sion reduction, and Tables 7 and 8 show the results of the
ranking mechanisms will generally find over 50% but lower hypothesis testing on the resulting mean prediction accura-
than 80% common genes for classification. However, the cies, respectively, where a bold value indicates that the
common gene percentage between any one of the three mean prediction accuracies are significantly different and
individual-gene-ranking mechanisms and the GK, the rep- its super index shows the gene selection mechanism with
resentative of subset-gene-ranking mechanisms, is usually a larger mean prediction accuracy. When the classification
lower than 40%. We also calculated the common gene per- tool is the k-nearest neighbors, we can see from Table 7
centages for the four gene selection mechanisms when the that the GK mechanism outperforms the other three
genes selected at the first stage is either 50, 150, or 200, mechanisms in almost all data sets. However, when the
and obtained similar results. Conservatively speaking, classification tool is the PL, the advantage of the GK mech-
more than half of the genes chosen by different categories anism appears only in classifying data set ‘‘prostate’’, as
of gene selection mechanisms for classification will be shown in Table 8. This suggests that the classification tool
different. applied at the first stage for filtering genes is likely to
In comparing the prediction accuracies of the eight two- enhance its prediction accuracy at the second stage. Both
stage classification methods, we first fixed the classification Tables 7 and 8 show that the T, BW, and S mechanisms
tool applied at the second stage to perform the paired t test. result in similar mean prediction accuracies in most cases,
Table 7
The p-values of the paired t tests when the classification tool applied at the second stage is the k-nearest neighbors
Data set T vs. BW T vs. S T vs. GK BW vs. S BW vs. GK S vs. GK
GK GK
Breast 0.1781 0.2631 0.0083 0.8396 0.0086 0.0050GK
CNS 0.0960B 0.1232 0.0000GK 0.7680 0.0085GK 0.0036GK
Colon 0.2531 0.8017 0.1475 0.6026 0.0751GK 0.1220
Leukemia 0.5572 0.3576 0.0047GK 0.4580 0.0006GK 0.0067GK
Lung 0.0237B 0.0019S 0.0008GK 0.6866 0.0996GK 0.1627
Lymphoma 0.5501 0.4215 0.3593 0.4886 0.1890 0.1379
Ovarian 0.8363 0.0467S 0.0031GK 0.1039 0.0129GK 0.1951
Prostate 0.3415 0.7654 0.0000GK 0.4441 0.0017GK 0.0000GK
382 T.-T. Wong, C.-H. Hsu / Expert Systems with Applications 34 (2008) 375–383
Table 8
The p-values of the paired t tests when the classification tool applied at the second stage is the logistic discrimination analysis combined with the partial
least squares
Data set T vs. BW T vs. S T vs. GK BW vs. S BW vs. GK S vs. GK
Breast 0.2582 0.0317S 0.1193 0.0511S 0.1855 0.4390
CNS 0.7959 0.9870 0.4042 0.7904 0.5487 0.4420
Colon 0.7926 0.0639T 0.8659 0.2524 0.7726 0.3428
Leukemia 0.5782 0.2025 0.3341 0.6532 0.5495 0.7370
Lung 0.7006 0.8675 0.5393 0.4072 0.9087 0.2765
Lymphoma 0.8887 0.3257 0.1289 0.2935 0.1016 0.0243S
Ovarian 0.5518 0.0884T 0.3275 0.0727B 0.5850 0.0746GK
Prostate 0.2150 0.5283 0.0195GK 0.7686 0.0258GK 0.0462GK
at the first stage for selecting genes, no matter what its Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M.,
inductive bias is, the inductive bias will confine the search Mesirov, J. P., et al. (1999). Molecular classification of cancer: class
discovery and class prediction by gene expression monitoring. Science,
space of the genes for classification. This means some dis- 286, 531–537.
cernible gene subsets better than the best gene subsets Jörnsten, R., & Yu, B. (2003). Simultaneous gene clustering and subset
found by the classification tool may be excluded from the selection for sample classification via MDL. Bioinformatics, 19,
search space by its inductive bias. Thus, it should be desir- 1100–1109.
able to develop a subset-gene-ranking mechanism without Khan, J., Wei, J. S., Ringnér, M., Saal, L. H., Ladanyi, M., Westermann,
F., et al. (2001). Classification and diagnostic prediction of cancers
any classification tool involved. Such kind of gene selection using gene expression profiling and artificial neural networks. Nature
mechanisms can also let us to fairly judge the performance Medicine, 7, 673–679.
of a classification tool applied at the second stage. Lee, K. E., Sha, N., Dougherty, E. R., Vannucci, M., & Mallick, B. K.
Our testing results show that a dimension reduction (2003). Gene selection: a Bayesian variable selection approach.
technique can filter the noise for some microarray data sets Bioinformatics, 19, 90–97.
Li, L., Weinberg, R. C., Darden, T. A., & Pedersen, L. G. (2001). Gene
to increase the classification accuracy, but not for all data selection for sample classification based on gene expression data: study
sets. So, the conditions under which a dimension reduction of sensitivity to choice of parameters of the GA/KNN method.
technique should be applied on a microarray data set will Bioinformatics, 17, 1131–1142.
be interesting to biologists. Note that those conditions Lu, Y., & Han, J. (2003). Cancer classification using gene expression data.
are based on the expression data of the genes selected at Information Systems, 28, 243–268.
Martella, F. (2006). Classification of microarray data with factor mixture
the first stage, not on the original data. models. Bioinformatics, 22, 202–208.
Mitchell, T. M. (1997). Machine Learning. McGraw Hill.
References Nguyen, D. V., & Rocke, D. M. (2002). Tumor classification by partial
least squares using microarray gene expression data. Bioinformatics,
Albrecht, A., Vintwebo, S. A., & Ohno-Machado, L. (2003). An 18, 39–50.
Epicurean learning approach to gene-expression data classification. Park, P., Pagano, M., & Bonetti, M. (2001). A nonparametric scoring
Artificial Intelligence in Medicine, 28, 75–87. algorithm for identifying informative genes from microarray data.
Antoniadis, A., Lambert-Lacroiix, S., & Leblance, F. (2003). Effective Proceedings of the Pacific Symposium on Biocomputing, 6, 52–63.
dimension reduction methods for tumor classification using gene Qiu, P., Wang, Z. J., & Liu, K. J. R. (2005). Ensemble dependence model
expression data. Bioinformatics, 19, 563–570. for classification and prediction of cancer and normal gene expression
Asyali, M. H., & Alci, M. (2005). Reliability analysis of microarray data data. Bioinformatics, 21, 3114–3121.
using fuzzy c-means and normal mixture modeling based classification Quackenbush, J. (2001). Computational analysis of microarray data.
methods. Bioinformatics, 21, 644–649. Nature Review Genetic, 2, 418–427.
Desper, R., Khan, J., & Schäffer, A. A. (2004). Tumor classification using Simek, K., Fujarewicz, K., Swierniak, A., Kimmel, M., Jarzab, B.,
phylogenetic methods on expression data. Journal of Theoretical Wiench, M., et al. (2004). Using SVD and SVM methods for selection,
Biology, 228, 477–496. classification, clustering and modeling of DNA microarray data.
Dudoit, S., Fridlyand, J., & Speed, T. (2002). Comparison of discrimi- Engineering Application of Artificial Intelligence, 17, 417–427.
nation methods for the classification of tumor using gene expression Wu, B. L. (2006). Differential gene expression detection and sample
data. Journal of the American Statistical Association, 97, 77–87. classification using penalized linear regression models. Bioinformatics,
Friedman, N., Linial, M., Nachman, I., & Pe’er, D. (2000). Using 22, 472–476.
Bayesian networks to analyze expression data. Journal of Computa- Zhang, H., Yu, C., Singer, B., & Xiong, M. (2001). Recursive partitioning
tional Biology, 7, 601–620. for tumor classification with gene expression microarray data.
Georgii, E., Richter, L., Ruckert, U., & Kramer, S. (2005). Analyzing Proceedings of the National Academy Sciences of the United States of
microarray data using quantitative association rules. Bioinformatics, America, 98, 6730–6735.
21, 123–129.