Multivariate Exploratory
Multivariate Exploratory
555–567
Printed in Great Britain
S UMMARY
The ultimate success of microarray technology in basic and applied biological sciences depends
critically on the development of statistical methods for gene expression data analysis. The most widely
used tests for differential expression of genes are essentially univariate. Such tests disregard the
multidimensional structure of microarray data. Multivariate methods are needed to utilize the information
hidden in gene interactions and hence to provide more powerful and biologically meaningful methods
for finding subsets of differentially expressed genes. The objective of this paper is to develop methods
of multidimensional search for biologically significant genes, considering expression signals as mutually
dependent random variables. To attain these ends, we consider the utility of a pertinent distance between
random vectors and its empirical counterpart constructed from gene expression data. The distance
furnishes exploratory procedures aimed at finding a target subset of differentially expressed genes. To
determine the size of the target subset, we resort to successive elimination of smaller subsets resulting
from each step of a random search algorithm based on maximization of the proposed distance. Different
stopping rules associated with this procedure are evaluated. The usefulness of the proposed approach is
illustrated with an application to the analysis of two sets of gene expression data.
Keywords: Cross-validation; Differential expression; Permutation test; Probability distance; Random search; Sets of
genes.
1. I NTRODUCTION
With its potential to quantitatively measure expression levels of a large number of genes in parallel,
microarray technology holds the promise of becoming an extremely valuable tool in basic biological
† To whom correspondence should be addressed
Biostatistics 4(4)
c Oxford University Press; all rights reserved.
556 A. S ZABO ET AL.
sciences and clinical diagnostics. However, the ultimate usefulness of the technology will depend critically
on whether or not the search for efficient statistical methods meets with success. One of the most popular
uses of microarray data is the identification of those genes that may be responsible for differences
between cell types (genotyping cell lines) or functional states of a cell. The genetic profile of a tissue
determines its properties, which is why different tissues are expected to have different gene expression
patterns. As demonstrated by various clustering methods of gene expression vectors (Khan et al., 1998),
tissues of the same histological origin tend to cluster together. Ross et al. (2000) came to a similar
conclusion when clustering 60 NCI cell lines. Alizadeh et al. (2000) used hierarchical clustering to
demonstrate the existence of two previously unknown genetically different subtypes of B-cell lymphoma
that carry significantly different prognosis in terms of patients’ survival. A closely related application is
the identification of subtypes of known diseases. This includes both improving the existing classification
into known classes and the discovery of new/unknown sub-classes that are clinically significant.
While clustering techniques are useful in providing insights into interactions between different genes
or finding genetically similar patterns, these techniques do not answer the question many researchers are
interested in: which genes are expressed differently in the tissues under comparison? One typical example
of such a setting is represented by the well-known leukemia data set (Golub et al., 1999) that includes
27 ALL (acute lymphoblastic leukemia) and 11 AML (acute myeloid leukemia) samples processed using
Affymetrix oligonucleotide microarrays. A test set of 34 samples is also available. The problem in question
is to find those genes that are responsible for the distinction between the two types of leukemia.
Multiple methods for selection of differentially expressed genes have been proposed, from using
a fixed cutoff for the ratio to using various parametric and non-parametric measures of differential
expression (Kerr and Churchill, 2001; Kerr et al., 2000; Newton et al., 2000; van der Laan and Bryan,
2001; Ben-Dor et al., 2000, 2001). A characteristic feature of these methods is the univariate nature of the
decision to include a particular gene in the target set. The well-known complexity of interactions between
gene functions in a cell strongly suggests that a search for methods that utilize multivariate information on
gene expression signals is warranted. In a recent paper (Szabo et al., 2002), we introduced a multivariate
method for finding a subset of differentially expressed genes of a given size. In this paper, we further
develop this methodology to approach the problem of determining the size of a target set of differentially
expressed genes.
The first step for developing multivariate methods is to define a measure of distance between two
tissue types that is based on a set of genes; in Section 3 we review a novel measure proposed by Szabo
et al. (2002) and contrast it with classical alternatives. Once a measure of differential expression of
gene sets is selected, sets of genes that exhibit an ‘unusually large’ distance will be considered to be
differentially expressed. Thus one needs a method to find the gene-sets for which the distance is large
and a method to determine whether this distance is larger than expected by chance. The random search
methodology designed for finding the set with the maximal distance is described in Section 4 and a
resampling based approach for finding a cutoff for ‘significant’ differential expression is developed in
Section 5. The performance of the method is demonstrated via computer simulations and in Section 6 we
re-analyze our motivating data that is described in detail in the next section. Finally, Section 7 presents an
example where large marginal differences dominate the data and thus univariate approaches are expected
to perform well.
In many applications the groups to be separated are heterogeneous, whether it is known to the
investigator or not. In such cases the goal of the investigator is two fold: find genes that explain the
separation of the data into the known categories and also discover the unknown sub-categories. While the
Multivariate exploratory tools for microarray data analysis 557
Gene B
Gene A
Fig. 1. An artificial example of heterogeneous groups. Gray and white circles represent samples from the two groups,
respectively.
second objective might appear unrelated to the first, a simple hypothetical example presented in Figure 1
shows how knowledge of subgroups can enhance classification. In this example the ‘white’ group has two
subgroups each of which is similar to the ‘gray’ group in some respect: one has the same expression of
Gene A, the other has the same expression of Gene B, thus neither gene separates the ‘white’ group from
the ‘gray’ one by itself. However, jointly they can not only separate the two groups, but also indicate the
two ‘white’ subgroups.
The case in point is the leukemia data set of Golub et al. (1999) described in the Introduction. The ALL
group is a mixture of two clinically recognizably different T-cell and B-cell subtypes. When analyzing
this data set, the authors selected those genes that are individually highly correlated with the known
classification and then used a voting procedure for classification of new samples. These techniques are
essentially univariate. The authors produced a list of candidate genes to which the difference between
the two types of leukemia can be attributed. Our recent analysis (Szabo et al., 2002) allowing for a more
general multivariate data structure has resulted in a dissimilar list of ten differentially expressed genes
for the same data set. Some of these genes were T-cell specific, thus they segregate only a part of the
inhomogeneous ALL group; they were found due to the multivariate approach. Another example discussed
in the present paper shows that the multivariate and univariate approaches may produce quite similar
results in some settings. The method used by Szabo et al. (2002) is described in Sections 3 and 4; briefly,
it attempts to find the gene-set of a pre-specified size that best separates two tissues. Two shortcomings of
that method are that the number of the selected genes has to be specified in advance and no assessment of
statistical significance was proposed. Thus while the selected genes ‘made sense’ and provided reasonable
test-set classification rates, it is unclear how many other genes are differentially expressed. The present
paper offers a solution to this problem.
of statistical models with an explicitly specified noise structure for data adjustment (or normalization).
Another method of noise reduction in microarray data consists in data categorization (see Tsodikov et al.,
2002; Szabo et al., 2002; Chilingaryan et al., 2002 for discussion). One way of doing this is to replace the
raw expression measurements with their fractional rank within each slide; this idea will be employed in
Sections 6 and 7. Briefly, let X i j represent measurements of fluorescent intensity for gene i = 1, . . . , p on
(r )
slide j = 1, . . . , n. Than the fractional rank of gene i is defined as X i j = (rankj X i j )/ p, where rankj u i j
is the place (counted from the left) of u k j in the sequence u i j , i = 1, . . . , p arranged in decreasing
order for each j. Notwithstanding inevitable loss of information (see Szabo et al., 2002, for more
discussion), categorical adjustments have proven to be very useful in the analysis of differential expression
of individual genes (Tsodikov et al., 2002) and sets of genes (Szabo et al., 2002; Chilingaryan et al., 2002).
For comparison, we also used the cube-root adjustment recommended for use with Affymetrix data with
SAM (Tusher et al., 2001).
1 n1 n2
1 n1 n1
1 n2 n2
N̂ = N (µ̂n 1 , ν̂n 2 ) = 2L(xi , y j ) − 2 L(xi , x j ) − 2 L(yi , y j ). (3.2)
n 1 n 2 i=1 j=1 n 1 i=1 j=1 n 2 i=1 j=1
√
When using the distance N (µ, ν) one needs to choose a pertinent (strictly negative definite) kernel
L. Szabo et al. (2002) discussed techniques
for constructing such kernels in detail, in this paper we use
only the Euclidean kernel L(x, y) = g∈S (x g − yg ) , where S denotes the set of the genes considered.
2
3.2
Comparison with other distance measures
√
Mahalanobis distance. The estimate of the distance N is nonparametric and does not involve
numerically unstable high-dimensional parts, thus it is expected to be numerically stable even for small
Multivariate exploratory tools for microarray data analysis 559
1.5
1.0
distance N
Density
0.5
Mahalanobis distance
0.0
N M
0 2 4 6 8 10
Fig. 2. The sampling distribution of the Mahalanobis distance and N between two multivariate normal distributions
( p = 4, n = 5, ρ = 0.3, see text). ‘N’ and ‘M’ denote the true value of the distances for distance N and the
Mahalanobis distance, respectively.
sample sizes. A commonly used parametric measure of separation of two samples is the Mahalanobis
distance
x + y −1
R 2Mah = (x̄ − ȳ) (x̄ − ȳ), (3.3)
2
where x̄ and ȳ are the sample means and x , y are the two sample variance–covariance matrices.
We used simulation to compare the stability of the Mahalanobis distance and N when estimating the
distance between two p-variate normal distributions with means (0 . . . 0) and (1 . . . 1) and exchangeable
correlation ρ based on two samples of size n. The coefficient of variation of N is smaller than that of the
Mahalanobis distance for all tested values of p = 2, . . . , 10, n = p + 1, . . . , 100 and ρ = 0, 0.3, 0.8.
Typical sampling distributions are shown in Figure 2. We also found that both estimators are biased
upwards, however the bias of the Mahalanobis distance is much larger.
Distance √ to the nearest neighbor. The following example demonstrates the superior stability of the
distance N as compared to another non-parametric distance measure based onnearest 1 neighbors.
We consider classifying samples into one of two bivariate normal distributions N 00 , 0.5 0.5 and
0 1 −0.5 1
N 0 , −0.5 1 . Note that any distance measure based on the sample mean, as the Mahalanobis
distance, will fail in this situation. Thus we compared N with the nearest-neighbor distance and with the
‘true state’ based on the density functions. We generated 300 data points from both distributions, denoting
these samples D1 and D2 respectively, and 2500 test points forming a uniform 50 × 50 grid. For each
test point we calculated the distance to D1 and D2 and classified the point into the ‘closest’ group; as a
measure of the certainty of this decision we considered the difference between the two distances. Plots of
these differences are shown in Figure 3. Comparing panels (a) and (b) to the difference of the true densities
shown in panel (c), we see that the distance N produces results quite close to the fully informed method,
without actually assuming normality. The nearest-neighbor approach is very unstable, as demonstrated by
the roughness of the surface, and provides highly certain classification far from the actual support of the
distributions (shown as two thick contours).
Fig. 3. (a) Difference between distances to D2 and D1 for the proposed distance N ; (b) difference between the
distances to nearest neighbor from D2 and D1 ; (c) difference of the ‘true’ density functions. Positive values (that is
points classified into D1 ) are shaded gray. The thick contour lines are the projections of the 95% ellipsoid quantiles
of D1 and D2 .
expression signals has been selected, it can be employed in a search for differentially expressed genes
with the target subset of genes being defined as a subset for which the distance between the two classes
attains its maximum. Ideally, all subsets of a given size should be evaluated in terms of the adopted
distance and the one that provides a maximum should be chosen. However, the number of possible subsets
exponentially increases with the total number of genes, and subsequently the exhaustive search procedures
as well as the branch-and-bound method (Fukunaga, 1990) become computationally prohibitive. In such
a situation, stepwise procedures seem to be an indispensable aid to variable selection. For all practical
purposes, the issue of computational complexity can be resolved by applying random search methodology.
Random search can be designed in a number of various ways. A simple algorithm for finding a subset
of predetermined size k with the largest distance between two classes (tissues) (Szabo et al., 2002;
Chilingaryan et al., 2002) is presented in box (A1) and used for data analysis in Sections 6 and 7. In
our study, the number M was set at 100 000.
Random search
1. Randomly select k genes to form the initial approximation; calculate the distance between
the two classes for this subset (cluster) of genes.
2. Replace at random one gene from the current cluster by a gene from outside the cluster;
calculate the distance for this new cluster. (A1)
3. If the distance for the new cluster is larger than for the original cluster (improvement), keep
the change, otherwise revert to the previous cluster.
4. Repeat the process until a predetermined number, M, of steps is reached.
When selecting a subset of genes to provide the best discrimination between two classes, it is easy
to come up with over-optimistic conclusions as a result of over-fitting, that is, finding overly specific
patterns that do not extend to new samples. Whenever a small number of variables is selected from a large
set, one should expect a selection bias associated with choosing the optimal of a large number of subsets,
regardless of the criterion used. Cross-validation techniques provide a powerful tool for reducing this
selection bias. Ganeshanandam and Krzanowski (1989) suggested that cross-validation should precede the
variable selection itself. Resorting to this idea and the well-known methodology of v-fold cross-validation
Multivariate exploratory tools for microarray data analysis 561
(see, for example Breiman et al., 1984), we resort to a ‘cross-validated search’ procedure that checks for
reproducibility of its results. The basic structure of the algorithm is presented in box (A2).
The performance of this algorithm was evaluated by Szabo et al. (2002). An alternative idea
(Chilingaryan et al., 2002) is to search for multiple local maxima by stopping the random search (A1)
before it could find the global maximum; in practice, it can be done by selecting a relatively small
M = 500 or 1000. Then the genes are listed in the order of the frequency of their occurrence in these
suboptimal sets.
Resampling
1. For each gene i, i = 1, . . . , p shift the values from the tissues so they are centered at the
overall mean for this gene, that is
appropriate test set is often difficult to provide. We conducted a simulation study and demonstrated that
the between-tissue distance associated with gene sets is a good and stable proxy for the classification rate.
A description of the simulations can be found as a Supplemental Material on the journal’s website.
In box (A4) we describe a distance-based procedure for selecting a subset of genes that are
differentially expressed in two tissues; the successive selection procedures based on cross-validation or
test sample classification rates have been designed along similar lines. The procedure requires the size k
of the building-block clusters and significance level α as inputs.
CV 100%
1.6
1.4 Class 80%
Classification/CV rate
1.2
Distance 1 60%
0.8 Dist
40%
0.6
0.4
20%
0.2
0 0%
1 5 9 13 17 21 25
Sets of size 5
Fig. 4. Properties of the subsequent optimal sets Gi of size 5 in the leukemia data set. On the left axis: Dist, associated
distance between the tissues. On the right axis: CV, cross-validation rate; Class, test set classification rate. The thin
dashed lines represent the estimated rates; the solid lines represent their isotonically smoothed counterparts; the thick
dashed lines show the cutoff based on the estimated 99th percentile of the corresponding null-distribution. The last
set above the cutoff is marked with a gray diamond.
rate using both leave-one-out cross-validation based on the selected gene set (CV) and the independent
test set (Class). The results are presented in Figure 4.
Both estimates of the classification rate (CV and Class, thin dashed lines in the figure) are highly
variable, while the distance (Dist) is decreasing monotonically. As the optimal sets were selected
according to the distance, the observed monotonicity confirms the ability of the basic algorithm to find
an optimal subset. To reduce the observed variability of the classification rate estimates we assume the
true rates to be non-increasing and apply isotonic regression (Robertson et al., 1988) to smooth the
corresponding curves (solid lines in the figure). The dotted lines represent the level of the 99th percentile
of the null-distribution of the corresponding measure; they were estimated by generating m = 300 random
permutation samples that mimic ‘no-difference’ data in accordance with algorithm (A3).
The iterative search procedure (A4) with sets of size k = 5 at α = 0.01 selects 16 groups for a total of
80 genes. The first three sets of five genes are listed in Table 1; the entire list can be viewed at the journal’s
web site. If the stopping were based on cross-validation or test-set classification rate, the procedure would
have stopped earlier, after five or seven sets. However, the high variability of these measures and the
reported problems with the test set (it was obtained from different institutions and some of them are
pediatric cases) makes stopping according to the classification rate less desirable.
The results obtained with the above-described multivariate procedures are worth comparing with a
univariate selection of differentially expressed genes. For the first comparison we have sorted the genes
according to the value of the corresponding (marginal) t-statistics and selected the top 16 × 5 = 80 of
them, so that the number of genes coincides with that selected by the multivariate distance-based cutoff
criterion. We found that the two lists have only 34 (42%) genes in common.
The Significance Analysis of Microarrays (SAM) of Tusher et al. (2001) is another commonly used
univariate approach, so we have run SAM (Chu et al., 2002) on both the rank-adjusted leukemia data and
on the data adjusted according to the recommendation of Tusher et al. (2001). The latter approach uses
linear calibration on a cube-root scatter plot of each sample against the average of all samples, followed
by transformation to the original scale. As SAM focuses on estimating the false discovery rate (FDR)
for each cutoff of the score and not on controlling type I error, no well-defined procedure exists for the
selection of a final gene set. Our rule for selecting the cutoff was to minimize the estimated false discovery
rate: the minimum was at 0.47% with 143 genes selected and 0.65% with 105 genes for the rank and cube-
564 A. S ZABO ET AL.
Table 1. List of the top 15 genes selected by the iterative search and their order
according to SAM. The ordering within the sets of the iterative search is random
root adjusted data, respectively† . The proportion of the genes found by the iterative search that were also
selected by SAM was 47% and 41% for the rank and cube-root adjusted data, respectively. Table 1 also
contains the ranking of the genes selected by our iterative search procedure according to SAM.
The selection of some genes, e.g. M27783 and M20203, seems to be determined by the adjustment
procedure. These genes have almost no expression in ALL (average cube-root adjusted expression levels
of 522 and 110, respectively) and moderately high levels in AML (average cube-root adjusted expression
levels of 5210 and 4511). However, there are also genes that are ranked very low by SAM regardless
of the adjustment: e.g. M89957, M63438, X82240 and U89922. These genes are specific to one of the
ALL subtypes, individually they separate only T-cell ALL from non-T-cell (e.g. X82240—T-cell leukemia
gene) or B-cell-ALL from non-B-cell-ALL (e.g. M63438, Glutamine synthase), however jointly they also
separate ALL from AML. M89957 and X82240 are ranked 2 and 3 by SAM if genes informative for the
three-way classification into AML, T-cell and B-cell are sought. The multidimensional approach was able
to find them without investing the additional information of the ALL subtypes.
comparison sets of samples were labeled in the reverse orientation, that is in the first three sets Cy-3 was
used to label HCT116 cells while Cy-5 was used for HT29 cells. In the next three sets Cy-5 was used with
HCT116 while HT29 was labeled using Cy-3. Each comparison set was hybridized against two microarray
facing slides containing 4608 minimally redundant cDNAs spotted in duplicate (unpublished data). A rank
adjustment (Tsodikov et al., 2002) was applied for both dyes within each sub-slide and the adjusted values
were averaged across the four dependent replicates obtained from the same original sample (the two sides
of a two-face slide). Thus six independent replicates were obtained for both HT29 and HCT116 cell lines.
Data from an earlier (lower quality) experiment in which each sample was hybridized only to two sides
of one slide (resulting in two dependent replicates instead of four) was similarly adjusted and used as an
independent test set. This set contained eight replicates of each of the two cell lines.
The number of permutation samples was m = 300 in this analysis, and the cluster size k = 5 was used
in algorithm (A4). Using the 99th percentile of the null-distribution as the cutoff, the cross-validation
rate and the distance-based criteria would stop close to each other (at the 57th and 56th sets of size 5,
respectively), while the smoothed test set classification rate drops below the cutoff much earlier, at the
12th set. However, when the 95th percentile was used, the stopping points were much closer to each other
(sets number 57, 63 and 67 for cross-validation, classification and distance, respectively). The extremely
high variability of the test set classification rate is probably responsible for this discrepancy. One possible
reason for that is the imperfect nature of our test data set: it was obtained a year earlier and the expertise
of the Microarray Core Facility that produced the data has greatly increased since that time. In addition,
the test set was formed using only two replicates for each sample compared to four in the training set.
The comparison with univariate procedures was performed similarly to the previous section. For the
first comparison we have sorted the genes according to the value of the corresponding (marginal) t-
statistics and selected the top 56 × 5 = 280 of them, so that the number of genes coincides with that
selected by the multivariate distance-based cutoff criterion. We found that the two lists have only 94
(33%) genes in common.
SAM with rank-adjusted data has the minimum FDR at 0.23% with 283 genes selected. This number is
very close to the 280 genes selected by the distance-based cutoff. These two sets of genes greatly overlap:
260 genes are common in the two lists; the two orderings of the genes also show high agreement with
Kendall’s τ = 0.78. Thus in a data set with a high number of marginally differentially expressed genes
our method gives results very close to SAM. The cube-root adjustment is not applicable to two-color
arrays, thus it was not applied to this data set.
8. D ISCUSSION
In this paper, we propose a successive selection procedure designed to identify a set of differentially
expressed genes from microarray data. The three stopping rules explored in Sections 6 and 7 in
conjunction with this procedure appear to produce similar final subsets of genes. The only discrepancy
observed was between the test-set-based selection and the other two rules in the application to actual
data. This discrepancy can be attributed to the poor quality of the test sample available in this study. A
distinct advantage of the distance-based stopping rule is its stability to random fluctuations in the course
of eliminating groups of differentially expressed genes. However, it is always wise to check the distance-
based procedure against its test-set-based counterpart whenever possible.
The proposed procedure is computationally expensive and probably would have been infeasible not so
long ago. The most time-intensive resampling part (A4, step 1) is easily parallelizable, so multi-processor
systems can be utilized to speed up the computation. All the calculations for this paper were performed
on a computer with two 1000 MHz processors and a typical analysis lasted 1–3 hours depending on the
sample size and the exact choice of the parameters. The speed of the random search procedure effectively
566 A. S ZABO ET AL.
limits what can be done by resampling. Considerable room remains for improvement in designing the
random search algorithm. In Algorithm (A4), all genes have equal probability of being selected at each
step of the random search. Other methods can be explored for improving the speed of the random search,
from simple ones, such as penalizing previously rejected genes, to more sophisticated methods, such as
simulated annealing or a genetic algorithm (Li et al., 2001).
In the last step of algorithm (A4), the subsets of size k obtained at each step of the successive selection
procedure are used as ‘building blocks’ for a finally selected set of differentially expressed genes. A
value of k is conventionally chosen to meet the important practical constraint imposed by the number of
available replicates (Section 5). It is natural to expect that the size and composition of the final set of genes
depend on the choice of k: larger values of k typically lead to larger final sets. Since the problem has no
formal solution, the choice of k is left to the investigator.
One of the most obvious deficiencies of the techniques proposed for finding differentially expressed
genes is that they are essentially univariate and are frequently based on the (sometimes implicit)
assumption that expression signals are stochastically independent. Our example in Section 7 shows that
a pertinent univariate method, properly adjusted for multiple testing, and a more general multivariate
method may yield similar results. As the example suggests, this is likely to be true whenever an
abundance of differentially expressed individual genes is an inherent feature of the biological systems
under comparison. In such settings, the virtues of multivariate approaches become less obvious. But
there may be numerous other situations where gene-to-gene interactions are especially important so that
multivariate methodology must be brought to the fore: our analysis of the leukemia data in Section 6
provides a good example.
It should be kept in mind that the ultimate goal of microarray data analysis is not just correct
classification of unknown samples but selection of biologically relevant (however vague the definition
of relevance) genes, although the two problems are closely related. For example, a very good separation
between classes can sometimes be provided by looking at a single gene so that the classification error
rate is difficult to reduce further by including other differentially expressed genes. However, one would
like to keep the chance of missing other interesting genes to a minimum. This is one of the reasons why
more robust methods based on distances between random vectors may perform better than error-based
multivariate methods of gene selection (Li et al., 2001).
R EFERENCES
A LIZADEH , A. A., E ISEN , M. B., DAVIS , R. E., M A , C., L OSSOS , I. S., ROSENWALD , A., B OLDRICK , J. C.,
S ABET , H., H UDSON , J. J. AND L U , L., et al. (2000). Distinct types of diffuse large B-cell lymphoma identified
by gene expression profiling. Nature 403, 503–511.
B EN -D OR , A., B RUHN , L., F RIEDMAN , N., NACHMAN , I., S CHUMMER , M. AND YAKHINI , Z. (2000). Tissue
classification with gene expression profiles. Journal of Computational Biology 7, 559–584.
B EN -D OR , A., F RIEDMAN , N. AND YAKHINI , Z. (2001). Scoring genes for relevance. Technical Report. Agilent
Laboratories.
B REIMAN , L., F RIEDMAN , J. H., O LSHEN , R. A. AND S TONE , C. J. (1984). Classification and Regression Trees.
Monterey, CA: Wadworth and Brooks/Cole Advanced Books and Software.
C HILINGARYAN , A., G EVORGYAN , N., VARDANYAN , A., J ONES , D. AND S ZABO , A. (2002). Multivarite approach
for selecting sets of differentially expressed genes. Mathematical Biosciences 176, 59–69.
C HU , G., NARASHIMHAN , B., T IBSHIRANI , R. AND T USHER , V. (2002). SAM ‘Significance Analysis of
Microarrays’, Users guide and technical document. Stanford University, http://www-stat.stanford.edu/
~tibs/SAM/, 1.21 edition.
D UDOIT , S., F RIDLYAND , J. AND S PEED , T. P. (2000). Comparison of discrimination methods for the classification
Multivariate exploratory tools for microarray data analysis 567
of tumors using gene expression data. Technical Report, 576. Berkeley, CA: University of California.
F UKUNAGA , K. (1990). Introduction to Statistical Pattern Recognition, 2nd edition. London: Academic.
G ANESHANANDAM , S. AND K RZANOWSKI , W. (1989). On selecting variables and assessing their performance in
linear discriminant analysis. Australian Journal of Statistics 32, 443–447.
G OLUB , T. R., S LONIM , D. K., TAMAYO , P., H UARD , C., G AASENBEEK , M., M ESIROV , J., C OLLER , H., L OH ,
M. L., D OWNING , J. R., C ALIGIURI , M. A., B LOOMFIELD , C. D. AND L ANDER , E. S. (1999). Molecular
classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–
537.
H ASTIE , T., T IBSHIRANI , R., E ISEN , M. B., A LIZADEH , A., L EVY , R., S TAUDT , L., C HAN , W. C., B OTSTEIN ,
D. AND B ROWN , P. (2000). ‘Gene shaving’ as a method for identifying distinct sets of genes with similar
expression patterns. Genome Biology 1, 0002.1–0003.21.
K ERR , M. K. AND C HURCHILL , G. A. (2001). Experimental design for gene expression microarrays. Biostatistics
2, 183–201.
K ERR , M. K., M ARTIN , M. AND C HURCHILL , G. A. (2000). Analysis of variance for gene expression microarray
data. Journal of Computational Biology 7, 819–837.
K HAN , J., S IMON , R., B ITTNER , M., C HEN , Y., L EIGHTON , S. B., P OHIDA , T., S MITH , P. D., J IANG ,
Y., G OODEN , G. C., T RENT , J. M. AND M ELTZER , P. S. (1998). Gene expression profiling of alveolar
rhabdomyosarcoma with cDNA microarrays. Cancer Research 58, 5009–5013.
L I , L., DARDEN , T. A., W EINBERG , C. R., L EVINE , A. J. AND P EDERSEN , L. G. (2001). Gene assessment
and sample classification for gene expression data using a genetic algorrithm/k-nearest neighbor method.
Combinatorial Chemistry & High Throughput Screening 4, 727–739.
N EWTON , M. A., K ENDZIORSKI , C. M., R ICHMOND , C. S., B LATTNER , F. R. AND T SUI , K. W. (2000). On
differential variability of expression ratios: improving statistical inference about gene expression changes from
microarray data. Journal of Computational Biology 8, 37–52.
ROBERTSON , T., W RIGHT , F. T. AND DYKSTRA , R. L. (1988). Order Restricted Statistical Inference. London:
Wiley.
ROSS , D. T., S CHERF , U., E ISEN , M. B., P EROU , C. M., R EES , C., S PELLMAN , P., I YER , V., J EFFREY , S. S.,
VAN DE R IJN , M., WALTHAM , M., P ERGAMENSCHIKOV , A., L EE , J. C. F., L ASHKARI , D., S HALON , D.,
M YERS , T. G., W EINSTEIN , J. N., B OTSTEIN , D. AND B ROWN , P. O. (2000). Systematic variation in gene
expression patterns in human cancer cell lines. Nature Genetics 24, 227–235.
S ZABO , A., B OUCHER , K., C ARROLL , W., K LEBANOV , L., T SODIKOV , A. AND YAKOVLEV , A. (2002). Variable
selection and pattern recognition with gene expression data generated by the microarray technology. Mathematical
Biosciences 176, 71–98.
T SODIKOV , A., S ZABO , A. AND J ONES , D. (2002). Adjustments and measures of differential expression for microar-
ray data. Bioinformatics 18, 251–260.
T USHER , V. G., T IBSHIRANI , R. AND C HU , G. (2001). Significance analysis of microarrays applied to the ionizing
radiation response. Proceedings of the National Academic Science, USA 98, 5116–5121.
VAN DER L AAN , M. J. AND B RYAN , J. F. (2001). Gene expression analysis with the parametric bootstrap.
Biostatistics 2, 445–461.
Z INGER , A. A., K LEBANOV , L. B. AND K AKOSYAN , A. V. (1989). Stability Problems for Stochastic Models,
chapter Characterization of distributions by mean values of statistics in connection with some probability metrics.
Moscow, VNIISI: pp. 47–55.
[Received December 28, 2001; first revision August 6, 2002; second revision December 18, 2002;
accepted for publication February 5, 2003]