0% found this document useful (0 votes)

52 views

Multivariate Exploratory

This document proposes and evaluates multivariate statistical methods for analyzing gene expression microarray data to identify subsets of differentially expressed genes between two biological conditions. It introduces a probability distance measure between random vectors constructed from gene expression data to quantify differences between conditions. A random search algorithm is used to find the target gene subset that maximizes this distance. Resampling methods are developed to determine the size of the differentially expressed gene subset and evaluate the significance of the results. The approach is illustrated on cancer gene expression datasets and shows potential to incorporate gene interactions and identify biologically meaningful patterns compared to univariate analysis methods.

Uploaded by

Cenk Birkan Jr.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views

Multivariate Exploratory

Uploaded by

Cenk Birkan Jr.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Biostatistics (2003), 4, 4, pp.

555–567
Printed in Great Britain

Multivariate exploratory tools for microarray data

analysis
ANIKO SZABO† , KENNETH BOUCHER, DAVID JONES, ALEXANDER D. TSODIKOV
Huntsman Cancer Institute and Department of Oncological Sciences, University of Utah, 2000
Circle of Hope, Salt Lake City, UT 84112-5550, USA
[email protected]
LEV B. KLEBANOV
Department of Probability and Statistics, Charls University, Sokolovska 83, Praha-8, CZ-18675,
Czech Republic
ANDREI Y. YAKOVLEV
Department of Biostatistics and Computational Biology, University of Rochester, 601 Elmwood
Avenue, Box 630 Rochester, NY 14642, USA and Huntsman Cancer Institute and Department of
Oncological Sciences, University of Utah, 2000 Circle of Hope, Salt Lake City, UT 84112-5550, USA

S UMMARY
The ultimate success of microarray technology in basic and applied biological sciences depends
critically on the development of statistical methods for gene expression data analysis. The most widely
used tests for differential expression of genes are essentially univariate. Such tests disregard the
multidimensional structure of microarray data. Multivariate methods are needed to utilize the information
hidden in gene interactions and hence to provide more powerful and biologically meaningful methods
for finding subsets of differentially expressed genes. The objective of this paper is to develop methods
of multidimensional search for biologically significant genes, considering expression signals as mutually
dependent random variables. To attain these ends, we consider the utility of a pertinent distance between
random vectors and its empirical counterpart constructed from gene expression data. The distance
furnishes exploratory procedures aimed at finding a target subset of differentially expressed genes. To
determine the size of the target subset, we resort to successive elimination of smaller subsets resulting
from each step of a random search algorithm based on maximization of the proposed distance. Different
stopping rules associated with this procedure are evaluated. The usefulness of the proposed approach is
illustrated with an application to the analysis of two sets of gene expression data.

Keywords: Cross-validation; Differential expression; Permutation test; Probability distance; Random search; Sets of
genes.

1. I NTRODUCTION
With its potential to quantitatively measure expression levels of a large number of genes in parallel,
microarray technology holds the promise of becoming an extremely valuable tool in basic biological
† To whom correspondence should be addressed

Biostatistics 4(4)
c Oxford University Press; all rights reserved.
556 A. S ZABO ET AL.

sciences and clinical diagnostics. However, the ultimate usefulness of the technology will depend critically
on whether or not the search for efficient statistical methods meets with success. One of the most popular
uses of microarray data is the identification of those genes that may be responsible for differences
between cell types (genotyping cell lines) or functional states of a cell. The genetic profile of a tissue
determines its properties, which is why different tissues are expected to have different gene expression
patterns. As demonstrated by various clustering methods of gene expression vectors (Khan et al., 1998),
tissues of the same histological origin tend to cluster together. Ross et al. (2000) came to a similar
conclusion when clustering 60 NCI cell lines. Alizadeh et al. (2000) used hierarchical clustering to
demonstrate the existence of two previously unknown genetically different subtypes of B-cell lymphoma
that carry significantly different prognosis in terms of patients’ survival. A closely related application is
the identification of subtypes of known diseases. This includes both improving the existing classification
into known classes and the discovery of new/unknown sub-classes that are clinically significant.
While clustering techniques are useful in providing insights into interactions between different genes
or finding genetically similar patterns, these techniques do not answer the question many researchers are
interested in: which genes are expressed differently in the tissues under comparison? One typical example
of such a setting is represented by the well-known leukemia data set (Golub et al., 1999) that includes
27 ALL (acute lymphoblastic leukemia) and 11 AML (acute myeloid leukemia) samples processed using
Affymetrix oligonucleotide microarrays. A test set of 34 samples is also available. The problem in question
is to find those genes that are responsible for the distinction between the two types of leukemia.
Multiple methods for selection of differentially expressed genes have been proposed, from using
a fixed cutoff for the ratio to using various parametric and non-parametric measures of differential
expression (Kerr and Churchill, 2001; Kerr et al., 2000; Newton et al., 2000; van der Laan and Bryan,
2001; Ben-Dor et al., 2000, 2001). A characteristic feature of these methods is the univariate nature of the
decision to include a particular gene in the target set. The well-known complexity of interactions between
gene functions in a cell strongly suggests that a search for methods that utilize multivariate information on
gene expression signals is warranted. In a recent paper (Szabo et al., 2002), we introduced a multivariate
method for finding a subset of differentially expressed genes of a given size. In this paper, we further
develop this methodology to approach the problem of determining the size of a target set of differentially
expressed genes.
The first step for developing multivariate methods is to define a measure of distance between two
tissue types that is based on a set of genes; in Section 3 we review a novel measure proposed by Szabo
et al. (2002) and contrast it with classical alternatives. Once a measure of differential expression of
gene sets is selected, sets of genes that exhibit an ‘unusually large’ distance will be considered to be
differentially expressed. Thus one needs a method to find the gene-sets for which the distance is large
and a method to determine whether this distance is larger than expected by chance. The random search
methodology designed for finding the set with the maximal distance is described in Section 4 and a
resampling based approach for finding a cutoff for ‘significant’ differential expression is developed in
Section 5. The performance of the method is demonstrated via computer simulations and in Section 6 we
re-analyze our motivating data that is described in detail in the next section. Finally, Section 7 presents an
example where large marginal differences dominate the data and thus univariate approaches are expected
to perform well.

2. M OTIVATING E XAMPLE : I NHOMOGENEOUS DATA S ET

In many applications the groups to be separated are heterogeneous, whether it is known to the
investigator or not. In such cases the goal of the investigator is two fold: find genes that explain the
separation of the data into the known categories and also discover the unknown sub-categories. While the
Multivariate exploratory tools for microarray data analysis 557

Gene B
Gene A

Fig. 1. An artificial example of heterogeneous groups. Gray and white circles represent samples from the two groups,
respectively.

second objective might appear unrelated to the first, a simple hypothetical example presented in Figure 1
shows how knowledge of subgroups can enhance classification. In this example the ‘white’ group has two
subgroups each of which is similar to the ‘gray’ group in some respect: one has the same expression of
Gene A, the other has the same expression of Gene B, thus neither gene separates the ‘white’ group from
the ‘gray’ one by itself. However, jointly they can not only separate the two groups, but also indicate the
two ‘white’ subgroups.
The case in point is the leukemia data set of Golub et al. (1999) described in the Introduction. The ALL
group is a mixture of two clinically recognizably different T-cell and B-cell subtypes. When analyzing
this data set, the authors selected those genes that are individually highly correlated with the known
classification and then used a voting procedure for classification of new samples. These techniques are
essentially univariate. The authors produced a list of candidate genes to which the difference between
the two types of leukemia can be attributed. Our recent analysis (Szabo et al., 2002) allowing for a more
general multivariate data structure has resulted in a dissimilar list of ten differentially expressed genes
for the same data set. Some of these genes were T-cell specific, thus they segregate only a part of the
inhomogeneous ALL group; they were found due to the multivariate approach. Another example discussed
in the present paper shows that the multivariate and univariate approaches may produce quite similar
results in some settings. The method used by Szabo et al. (2002) is described in Sections 3 and 4; briefly,
it attempts to find the gene-set of a pre-specified size that best separates two tissues. Two shortcomings of
that method are that the number of the selected genes has to be specified in advance and no assessment of
statistical significance was proposed. Thus while the selected genes ‘made sense’ and provided reasonable
test-set classification rates, it is unclear how many other genes are differentially expressed. The present
paper offers a solution to this problem.

3. Q UANTITATIVE M EASURES OF D IFFERENTIAL E XPRESSION

The set of microarray data on p distinct genes represents a random vector X = (X 1 , . . . , X p )
with mutually dependent components. The dimension of X is extremely high relative to the number
of observations (replicates of experiments). This unusual feature of microarray data prevents variable
selection based on conventional discriminant analysis techniques using X as the feature vector. Therefore,
one needs to limit the analysis of gene expression to subvectors of much lower dimension than that of X.
In addition to this ‘curse of dimensionality’, the problem of microarray data analysis is complicated by
the presence of experimental noise in the data. There are many sources of the observed noise so that some
of the errors are likely to be additive (background), some others are multiplicative (dye incorporation,
fluorescence efficiency, spot size), while saturation effects have a non-linear form. This hampers the use
558 A. S ZABO ET AL.

of statistical models with an explicitly specified noise structure for data adjustment (or normalization).
Another method of noise reduction in microarray data consists in data categorization (see Tsodikov et al.,
2002; Szabo et al., 2002; Chilingaryan et al., 2002 for discussion). One way of doing this is to replace the
raw expression measurements with their fractional rank within each slide; this idea will be employed in
Sections 6 and 7. Briefly, let X i j represent measurements of fluorescent intensity for gene i = 1, . . . , p on
(r )
slide j = 1, . . . , n. Than the fractional rank of gene i is defined as X i j = (rankj X i j )/ p, where rankj u i j
is the place (counted from the left) of u k j in the sequence u i j , i = 1, . . . , p arranged in decreasing
order for each j. Notwithstanding inevitable loss of information (see Szabo et al., 2002, for more
discussion), categorical adjustments have proven to be very useful in the analysis of differential expression
of individual genes (Tsodikov et al., 2002) and sets of genes (Szabo et al., 2002; Chilingaryan et al., 2002).
For comparison, we also used the cube-root adjustment recommended for use with Affymetrix data with
SAM (Tusher et al., 2001).

3.1 A distance between vectors of expression signals

To compare expression signals in two different tissues (or states) we need a pertinent distance between
two random vectors. This distance is expected to satisfy the following requirements: (1) its empirical
counterpart should allow for combining information from different slides; (2) it should accommodate
ranks and categorical data (thus should not necessarily assume normality); (3) its estimate should be stable
to random fluctuations and numerical errors; (4) its computation should not be too time consuming. Szabo
et al. (2002) proposed a new distance and its nonparametric estimate to measure differential expression
for sets of genes. This distance meets the above requirements. Let µ and ν be two probability measures
defined on the Euclidean space Rd . Let L(x, y) be a real-valued measurable function, and introduce the
following expression:

N (µ, ν) = 2 L(x, y)dµ(x)dν(y) − L(x, y)dµ(x)dµ(y) − L(x, y)dν(x)dν(y). (3.1)
Rd Rd Rd Rd Rd Rd
√
It can be shown (Zinger et al., 1989) that N (µ, ν) is a metric in the space of all probability measures
on R if and only if the kernel L(x, y) is strictly negative definite, that is i,s j=1 L(xi , x j )h i h j 0 for
d
s
any x1 , . . . , xs and h 1 , . . . , h s , i=1 h i = 0 with equality if and only if all h i = 0. Therefore, to obtain
a universally applicable kernel (not dependent on the unknown measures µ and ν), L has to be strictly
negative definite.
Consider two independent samples, consisting of n 1 and n 2 observations respectively, represented by
the d-dimensional vectors x1 , . . . , xn 1 and y1 , . . . , yn 2 , and introduce an empirical counterpart of N (µ, ν)
as follows:

1 n1 n2
1 n1 n1
1 n2 n2
N̂ = N (µ̂n 1 , ν̂n 2 ) = 2L(xi , y j ) − 2 L(xi , x j ) − 2 L(yi , y j ). (3.2)
n 1 n 2 i=1 j=1 n 1 i=1 j=1 n 2 i=1 j=1
√
When using the distance N (µ, ν) one needs to choose a pertinent (strictly negative definite) kernel
L. Szabo et al. (2002) discussed techniques
for constructing such kernels in detail, in this paper we use
only the Euclidean kernel L(x, y) = g∈S (x g − yg ) , where S denotes the set of the genes considered.
2

3.2
Comparison with other distance measures
√
Mahalanobis distance. The estimate of the distance N is nonparametric and does not involve
numerically unstable high-dimensional parts, thus it is expected to be numerically stable even for small
Multivariate exploratory tools for microarray data analysis 559

1.5
1.0
distance N

Density

0.5
Mahalanobis distance
0.0

N M
0 2 4 6 8 10

Fig. 2. The sampling distribution of the Mahalanobis distance and N between two multivariate normal distributions
( p = 4, n = 5, ρ = 0.3, see text). ‘N’ and ‘M’ denote the true value of the distances for distance N and the
Mahalanobis distance, respectively.

sample sizes. A commonly used parametric measure of separation of two samples is the Mahalanobis
distance

x + y −1
R 2Mah = (x̄ − ȳ) (x̄ − ȳ), (3.3)
2

where x̄ and ȳ are the sample means and x , y are the two sample variance–covariance matrices.
We used simulation to compare the stability of the Mahalanobis distance and N when estimating the
distance between two p-variate normal distributions with means (0 . . . 0) and (1 . . . 1) and exchangeable
correlation ρ based on two samples of size n. The coefficient of variation of N is smaller than that of the
Mahalanobis distance for all tested values of p = 2, . . . , 10, n = p + 1, . . . , 100 and ρ = 0, 0.3, 0.8.
Typical sampling distributions are shown in Figure 2. We also found that both estimators are biased
upwards, however the bias of the Mahalanobis distance is much larger.

Distance √ to the nearest neighbor. The following example demonstrates the superior stability of the
distance N as compared to another non-parametric distance measure based onnearest 1 neighbors.

We consider classifying samples into one of two bivariate normal distributions N 00 , 0.5 0.5 and
0 1 −0.5 1
N 0 , −0.5 1 . Note that any distance measure based on the sample mean, as the Mahalanobis
distance, will fail in this situation. Thus we compared N with the nearest-neighbor distance and with the
‘true state’ based on the density functions. We generated 300 data points from both distributions, denoting
these samples D1 and D2 respectively, and 2500 test points forming a uniform 50 × 50 grid. For each
test point we calculated the distance to D1 and D2 and classified the point into the ‘closest’ group; as a
measure of the certainty of this decision we considered the difference between the two distances. Plots of
these differences are shown in Figure 3. Comparing panels (a) and (b) to the difference of the true densities
shown in panel (c), we see that the distance N produces results quite close to the fully informed method,
without actually assuming normality. The nearest-neighbor approach is very unstable, as demonstrated by
the roughness of the surface, and provides highly certain classification far from the actual support of the
distributions (shown as two thick contours).

4. R ANDOM S EARCH FOR THE TARGET S UBSET OF A G IVEN S IZE

Suppose two samples of microarray data are available, each sample having been drawn from a distinct
class of biological specimens such as tissues of two different types. Once a multivariate distance between
560 A. S ZABO ET AL.

(a) (b) (c)

Fig. 3. (a) Difference between distances to D2 and D1 for the proposed distance N ; (b) difference between the
distances to nearest neighbor from D2 and D1 ; (c) difference of the ‘true’ density functions. Positive values (that is
points classified into D1 ) are shaded gray. The thick contour lines are the projections of the 95% ellipsoid quantiles
of D1 and D2 .

expression signals has been selected, it can be employed in a search for differentially expressed genes
with the target subset of genes being defined as a subset for which the distance between the two classes
attains its maximum. Ideally, all subsets of a given size should be evaluated in terms of the adopted
distance and the one that provides a maximum should be chosen. However, the number of possible subsets
exponentially increases with the total number of genes, and subsequently the exhaustive search procedures
as well as the branch-and-bound method (Fukunaga, 1990) become computationally prohibitive. In such
a situation, stepwise procedures seem to be an indispensable aid to variable selection. For all practical
purposes, the issue of computational complexity can be resolved by applying random search methodology.
Random search can be designed in a number of various ways. A simple algorithm for finding a subset
of predetermined size k with the largest distance between two classes (tissues) (Szabo et al., 2002;
Chilingaryan et al., 2002) is presented in box (A1) and used for data analysis in Sections 6 and 7. In
our study, the number M was set at 100 000.

Random search
1. Randomly select k genes to form the initial approximation; calculate the distance between
the two classes for this subset (cluster) of genes.
2. Replace at random one gene from the current cluster by a gene from outside the cluster;
calculate the distance for this new cluster. (A1)
3. If the distance for the new cluster is larger than for the original cluster (improvement), keep
the change, otherwise revert to the previous cluster.
4. Repeat the process until a predetermined number, M, of steps is reached.

When selecting a subset of genes to provide the best discrimination between two classes, it is easy
to come up with over-optimistic conclusions as a result of over-fitting, that is, finding overly specific
patterns that do not extend to new samples. Whenever a small number of variables is selected from a large
set, one should expect a selection bias associated with choosing the optimal of a large number of subsets,
regardless of the criterion used. Cross-validation techniques provide a powerful tool for reducing this
selection bias. Ganeshanandam and Krzanowski (1989) suggested that cross-validation should precede the
variable selection itself. Resorting to this idea and the well-known methodology of v-fold cross-validation
Multivariate exploratory tools for microarray data analysis 561

(see, for example Breiman et al., 1984), we resort to a ‘cross-validated search’ procedure that checks for
reproducibility of its results. The basic structure of the algorithm is presented in box (A2).

Cross-validated search for differentially expressed genes

1. Randomly divide the data into v groups of nearly equal size.
2. Drop one of the parts and find the optimal (in accordance with the chosen criterion) subset (A2)
of genes using only the data from v − 1 groups. Algorithm (A1) can be used, for example.
3. Repeat step 2 in succession for each of the groups, obtaining v ‘optimal’ sets.
4. Combine these sets by selecting the genes with the highest frequencies of occurrence.

The performance of this algorithm was evaluated by Szabo et al. (2002). An alternative idea
(Chilingaryan et al., 2002) is to search for multiple local maxima by stopping the random search (A1)
before it could find the global maximum; in practice, it can be done by selecting a relatively small
M = 500 or 1000. Then the genes are listed in the order of the frequency of their occurrence in these
suboptimal sets.

5. F ORMING LARGER TARGET SUBSETS OF GENES

Technically, the estimate N̂ can be computed for vectors of any dimension. However, it is natural
to limit the dimension of the sought-for subset of genes by the number of available training samples
(microarray slides). The reason for this limitation is that the variance of N̂ increases with dimension
of the vectors under comparison. In many settings, the above-mentioned restriction is difficult to meet,
because the number of differentially expressed genes is expected to be quite large, as in the example
discussed in Section 7.
Consider a typical study of differences in genetic profiles of two tissues or cell types. Since algorithm
(A2) described in Section 4 is aimed at finding a subset of differentially expressed genes of a given size, it
would be desirable to find a way of extending the set of separating genes, while maintaining all the benefits
of the multivariate approach to the problem. We propose a successive selection procedure that eliminates
groups of genes from the data after each run of the search algorithm and keeps doing so until no more
subsets of differentially expressed genes can be found. Then the removed genes are the ones responsible
for the difference between the tissues, so we declare all of them to be differentially expressed. The idea
of discarding ‘interesting’ genes obtained at each step of a selection procedure and then repeating the
selection procedure to find additional genes is somewhat similar to the approach by Hastie et al. (2000).
To formulate a stopping rule, one needs to determine the properties of an optimal set of genes in
a ‘no-difference’ data set. Since our optimality criterion is based on the multivariate distance (3.1), the
covariance structure of the particular data is expected to influence the selection process. Thus the ‘no-
difference’ baseline data have to be generated in a way that preserves this structure as close as possible.
In box (A3) we describe an algorithm that achieves this goal: the first step of the algorithm ensures that
the marginal means of the two hypothetical tissues have the same true mean, and the second step mimics
the biological variability through permutation. We denoted the adjusted fluorescence level for gene i,
i = 1, . . . , p in the two tissues by X i j , j = 1, . . . ,
n 1 and Yi j , j = 1, . . . , n
2 , respectively. The average
adjusted expression levels for each gene are X̄ i · = nj=1 1
X i j /n 1 and Ȳi · = nj=1 2
Yi j /n 2 .
Based on the permutation resampling scheme (A3), the null-distributions of various quantitative
characteristics of the optimal gene-set can be estimated. In particular, we will focus on the associated
distance, leave-one-out cross-validated classification rate and test set classification rate. In applications,
the test set classification rate is probably the gold standard. However, due to the scarcity of samples an
562 A. S ZABO ET AL.

Resampling
1. For each gene i, i = 1, . . . , p shift the values from the tissues so they are centered at the
overall mean for this gene, that is

n 1 X̄ i · + n 2 Ȳi · n 1 X̄ i · + n 2 Ȳi · (A3)

X i∗j = X i j − X̄ i · + , Yi∗j = Yi j − Ȳi · + .
n1 + n2 n1 + n2
2. Randomly permute the resulting n 1 + n 2 vectors. The first n 1 and the last n 2 vectors provide
a random sample from the null-distribution.

appropriate test set is often difficult to provide. We conducted a simulation study and demonstrated that
the between-tissue distance associated with gene sets is a good and stable proxy for the classification rate.
A description of the simulations can be found as a Supplemental Material on the journal’s website.
In box (A4) we describe a distance-based procedure for selecting a subset of genes that are
differentially expressed in two tissues; the successive selection procedures based on cross-validation or
test sample classification rates have been designed along similar lines. The procedure requires the size k
of the building-block clusters and significance level α as inputs.

Main algorithm: Successive selection of subsets of genes

1. Form (using e.g. (A3)) m independent permutation samples of sizes n 1 and n 2 , respectively,
from n 1 + n 2 observations (slides). For each of the m permutation samples find (using
e.g. (A2)) an optimal k-element set for which the associated distance attains its maximum.
Estimate from the permutation samples the top αth percentile Dα of the baseline distribution
of the optimal distance (referred to as the null-distribution).
2. Returning to the original two-sample setting, find (using e.g. (A1)) the k-element optimal
set of genes and denote it by G1 . If the associated distance D(G1 ) > Dα , then continue, (A4)
otherwise declare no differentially expressed genes.
3. In the th iteration, discard sets G1 , . . . , G−1 and find the k-element optimal set G from
the remaining genes. If the associated distance D(G ) > Dα , then continue with this step
(next iteration), otherwise proceed to step 4.
4. The union ∪−1j=1 G j defines the set of those genes that are differentially expressed in the two
tissues.

6. A PPLICATION TO LEUKEMIA DATA

Finally, we return to our motivating example, the leukemia data set of Golub et al. (1999). While we
consider the ALL vs AML comparison, we know that the ALL group is not homogeneous—it is a mixture
of the clinically recognizably different T-cell and B-cell subtypes. In this data set this sub-classification is
known for each sample, however a similar situation with inhomogeneous disease with unknown subtypes
can easily be imagined. Thus in the ALL vs AML comparison the difference between the two groups has
a more complicated structure than marginal difference.
With the rank-adjusted leukemia data we carried out successive searches for five-member optimal gene
sets (k = 5) using the 10-fold (v = 10) cross-validated search algorithm (A2). The Euclidean distance
was chosen for the kernel L(x, y) in the distance measure (3.1). For each of the successive optimal sets Gi ,
we recorded the corresponding optimal distance (denoted by Dist) and estimated the tissue classification
Multivariate exploratory tools for microarray data analysis 563

CV 100%
1.6
1.4 Class 80%

Classification/CV rate
1.2
Distance 1 60%

0.8 Dist
40%
0.6
0.4
20%
0.2
0 0%
1 5 9 13 17 21 25
Sets of size 5

Fig. 4. Properties of the subsequent optimal sets Gi of size 5 in the leukemia data set. On the left axis: Dist, associated
distance between the tissues. On the right axis: CV, cross-validation rate; Class, test set classification rate. The thin
dashed lines represent the estimated rates; the solid lines represent their isotonically smoothed counterparts; the thick
dashed lines show the cutoff based on the estimated 99th percentile of the corresponding null-distribution. The last
set above the cutoff is marked with a gray diamond.

rate using both leave-one-out cross-validation based on the selected gene set (CV) and the independent
test set (Class). The results are presented in Figure 4.
Both estimates of the classification rate (CV and Class, thin dashed lines in the figure) are highly
variable, while the distance (Dist) is decreasing monotonically. As the optimal sets were selected
according to the distance, the observed monotonicity confirms the ability of the basic algorithm to find
an optimal subset. To reduce the observed variability of the classification rate estimates we assume the
true rates to be non-increasing and apply isotonic regression (Robertson et al., 1988) to smooth the
corresponding curves (solid lines in the figure). The dotted lines represent the level of the 99th percentile
of the null-distribution of the corresponding measure; they were estimated by generating m = 300 random
permutation samples that mimic ‘no-difference’ data in accordance with algorithm (A3).
The iterative search procedure (A4) with sets of size k = 5 at α = 0.01 selects 16 groups for a total of
80 genes. The first three sets of five genes are listed in Table 1; the entire list can be viewed at the journal’s
web site. If the stopping were based on cross-validation or test-set classification rate, the procedure would
have stopped earlier, after five or seven sets. However, the high variability of these measures and the
reported problems with the test set (it was obtained from different institutions and some of them are
pediatric cases) makes stopping according to the classification rate less desirable.
The results obtained with the above-described multivariate procedures are worth comparing with a
univariate selection of differentially expressed genes. For the first comparison we have sorted the genes
according to the value of the corresponding (marginal) t-statistics and selected the top 16 × 5 = 80 of
them, so that the number of genes coincides with that selected by the multivariate distance-based cutoff
criterion. We found that the two lists have only 34 (42%) genes in common.
The Significance Analysis of Microarrays (SAM) of Tusher et al. (2001) is another commonly used
univariate approach, so we have run SAM (Chu et al., 2002) on both the rank-adjusted leukemia data and
on the data adjusted according to the recommendation of Tusher et al. (2001). The latter approach uses
linear calibration on a cube-root scatter plot of each sample against the average of all samples, followed
by transformation to the original scale. As SAM focuses on estimating the false discovery rate (FDR)
for each cutoff of the score and not on controlling type I error, no well-defined procedure exists for the
selection of a final gene set. Our rule for selecting the cutoff was to minimize the estimated false discovery
rate: the minimum was at 0.47% with 143 genes selected and 0.65% with 105 genes for the rank and cube-
564 A. S ZABO ET AL.

Table 1. List of the top 15 genes selected by the iterative search and their order
according to SAM. The ordering within the sets of the iterative search is random

Rank according to SAM

Iterative search (k = 5) Cube-root adjustment Rank adjustment
M28130 Interleukin 8 gene 13* 56*
M89957 Immunoglobulin-associated beta (B29) 465 283
M27891 Cystatin C 7* 24*
M27783 Elastase 2, neutrophil 611 8*
M84526 Adipsin 18* 1*
U46499 Glutathion S-transferase 31* 15*
M63438 Glutamine synthase 633 427
M83667 NF-IL6-beta protein 30* 19*
X82240 T cell leukemia gene 566 276
U89922 Lymphotoxin-beta 937 422
X52056 Oncogene spi1 39* 143
D88422 Cystatin A 93* 6*
M57731 GRO2 oncogene 73* 28*
D87076 KIAA0239 gene 252 50*
M20203 Neutrophil elastase 529 44*
* the gene is selected by SAM when the smallest FDR rate is chosen as cutoff

root adjusted data, respectively† . The proportion of the genes found by the iterative search that were also
selected by SAM was 47% and 41% for the rank and cube-root adjusted data, respectively. Table 1 also
contains the ranking of the genes selected by our iterative search procedure according to SAM.
The selection of some genes, e.g. M27783 and M20203, seems to be determined by the adjustment
procedure. These genes have almost no expression in ALL (average cube-root adjusted expression levels
of 522 and 110, respectively) and moderately high levels in AML (average cube-root adjusted expression
levels of 5210 and 4511). However, there are also genes that are ranked very low by SAM regardless
of the adjustment: e.g. M89957, M63438, X82240 and U89922. These genes are specific to one of the
ALL subtypes, individually they separate only T-cell ALL from non-T-cell (e.g. X82240—T-cell leukemia
gene) or B-cell-ALL from non-B-cell-ALL (e.g. M63438, Glutamine synthase), however jointly they also
separate ALL from AML. M89957 and X82240 are ranked 2 and 3 by SAM if genes informative for the
three-way classification into AML, T-cell and B-cell are sought. The multidimensional approach was able
to find them without investing the additional information of the ALL subtypes.

7. A PPLICATION TO COLON CANCER CELL LINE DATA

To evaluate the performance of our method in the presence of large marginal differences, we also
applied our methodology to two commonly studied colon cancer cell lines. HT29 cells represent advanced,
highly aggressive colon tumors. They contain mutations in both the APC gene and p53 gene, two tumor
suppressor genes that frequently mutate during colon tumorigenesis. As another cell type, we selected
HCT116 cells. This cell line models less aggressive colon tumors and harbors functional p53 and APC.
However, they show a deficiency of those genetic systems that are responsible for the repair of mismatched
regions of DNA. To generate the data, three samples of each mRNA (1 µg each) were labeled by
production of first-strand cDNA in the presence of Cy3-dCTP (green) or Cy5-dCTP (red). Six identical
† The lists of genes selected by the univariate approaches are also available through the journal’s website.
Multivariate exploratory tools for microarray data analysis 565

comparison sets of samples were labeled in the reverse orientation, that is in the first three sets Cy-3 was
used to label HCT116 cells while Cy-5 was used for HT29 cells. In the next three sets Cy-5 was used with
HCT116 while HT29 was labeled using Cy-3. Each comparison set was hybridized against two microarray
facing slides containing 4608 minimally redundant cDNAs spotted in duplicate (unpublished data). A rank
adjustment (Tsodikov et al., 2002) was applied for both dyes within each sub-slide and the adjusted values
were averaged across the four dependent replicates obtained from the same original sample (the two sides
of a two-face slide). Thus six independent replicates were obtained for both HT29 and HCT116 cell lines.
Data from an earlier (lower quality) experiment in which each sample was hybridized only to two sides
of one slide (resulting in two dependent replicates instead of four) was similarly adjusted and used as an
independent test set. This set contained eight replicates of each of the two cell lines.
The number of permutation samples was m = 300 in this analysis, and the cluster size k = 5 was used
in algorithm (A4). Using the 99th percentile of the null-distribution as the cutoff, the cross-validation
rate and the distance-based criteria would stop close to each other (at the 57th and 56th sets of size 5,
respectively), while the smoothed test set classification rate drops below the cutoff much earlier, at the
12th set. However, when the 95th percentile was used, the stopping points were much closer to each other
(sets number 57, 63 and 67 for cross-validation, classification and distance, respectively). The extremely
high variability of the test set classification rate is probably responsible for this discrepancy. One possible
reason for that is the imperfect nature of our test data set: it was obtained a year earlier and the expertise
of the Microarray Core Facility that produced the data has greatly increased since that time. In addition,
the test set was formed using only two replicates for each sample compared to four in the training set.
The comparison with univariate procedures was performed similarly to the previous section. For the
first comparison we have sorted the genes according to the value of the corresponding (marginal) t-
statistics and selected the top 56 × 5 = 280 of them, so that the number of genes coincides with that
selected by the multivariate distance-based cutoff criterion. We found that the two lists have only 94
(33%) genes in common.
SAM with rank-adjusted data has the minimum FDR at 0.23% with 283 genes selected. This number is
very close to the 280 genes selected by the distance-based cutoff. These two sets of genes greatly overlap:
260 genes are common in the two lists; the two orderings of the genes also show high agreement with
Kendall’s τ = 0.78. Thus in a data set with a high number of marginally differentially expressed genes
our method gives results very close to SAM. The cube-root adjustment is not applicable to two-color
arrays, thus it was not applied to this data set.

8. D ISCUSSION

In this paper, we propose a successive selection procedure designed to identify a set of differentially
expressed genes from microarray data. The three stopping rules explored in Sections 6 and 7 in
conjunction with this procedure appear to produce similar final subsets of genes. The only discrepancy
observed was between the test-set-based selection and the other two rules in the application to actual
data. This discrepancy can be attributed to the poor quality of the test sample available in this study. A
distinct advantage of the distance-based stopping rule is its stability to random fluctuations in the course
of eliminating groups of differentially expressed genes. However, it is always wise to check the distance-
based procedure against its test-set-based counterpart whenever possible.
The proposed procedure is computationally expensive and probably would have been infeasible not so
long ago. The most time-intensive resampling part (A4, step 1) is easily parallelizable, so multi-processor
systems can be utilized to speed up the computation. All the calculations for this paper were performed
on a computer with two 1000 MHz processors and a typical analysis lasted 1–3 hours depending on the
sample size and the exact choice of the parameters. The speed of the random search procedure effectively
566 A. S ZABO ET AL.

limits what can be done by resampling. Considerable room remains for improvement in designing the
random search algorithm. In Algorithm (A4), all genes have equal probability of being selected at each
step of the random search. Other methods can be explored for improving the speed of the random search,
from simple ones, such as penalizing previously rejected genes, to more sophisticated methods, such as
simulated annealing or a genetic algorithm (Li et al., 2001).
In the last step of algorithm (A4), the subsets of size k obtained at each step of the successive selection
procedure are used as ‘building blocks’ for a finally selected set of differentially expressed genes. A
value of k is conventionally chosen to meet the important practical constraint imposed by the number of
available replicates (Section 5). It is natural to expect that the size and composition of the final set of genes
depend on the choice of k: larger values of k typically lead to larger final sets. Since the problem has no
formal solution, the choice of k is left to the investigator.
One of the most obvious deficiencies of the techniques proposed for finding differentially expressed
genes is that they are essentially univariate and are frequently based on the (sometimes implicit)
assumption that expression signals are stochastically independent. Our example in Section 7 shows that
a pertinent univariate method, properly adjusted for multiple testing, and a more general multivariate
method may yield similar results. As the example suggests, this is likely to be true whenever an
abundance of differentially expressed individual genes is an inherent feature of the biological systems
under comparison. In such settings, the virtues of multivariate approaches become less obvious. But
there may be numerous other situations where gene-to-gene interactions are especially important so that
multivariate methodology must be brought to the fore: our analysis of the leukemia data in Section 6
provides a good example.
It should be kept in mind that the ultimate goal of microarray data analysis is not just correct
classification of unknown samples but selection of biologically relevant (however vague the definition
of relevance) genes, although the two problems are closely related. For example, a very good separation
between classes can sometimes be provided by looking at a single gene so that the classification error
rate is difficult to reduce further by including other differentially expressed genes. However, one would
like to keep the chance of missing other interesting genes to a minimum. This is one of the reasons why
more robust methods based on distances between random vectors may perform better than error-based
multivariate methods of gene selection (Li et al., 2001).

R EFERENCES
A LIZADEH , A. A., E ISEN , M. B., DAVIS , R. E., M A , C., L OSSOS , I. S., ROSENWALD , A., B OLDRICK , J. C.,
S ABET , H., H UDSON , J. J. AND L U , L., et al. (2000). Distinct types of diffuse large B-cell lymphoma identified
by gene expression profiling. Nature 403, 503–511.
B EN -D OR , A., B RUHN , L., F RIEDMAN , N., NACHMAN , I., S CHUMMER , M. AND YAKHINI , Z. (2000). Tissue
classification with gene expression profiles. Journal of Computational Biology 7, 559–584.
B EN -D OR , A., F RIEDMAN , N. AND YAKHINI , Z. (2001). Scoring genes for relevance. Technical Report. Agilent
Laboratories.
B REIMAN , L., F RIEDMAN , J. H., O LSHEN , R. A. AND S TONE , C. J. (1984). Classification and Regression Trees.
Monterey, CA: Wadworth and Brooks/Cole Advanced Books and Software.
C HILINGARYAN , A., G EVORGYAN , N., VARDANYAN , A., J ONES , D. AND S ZABO , A. (2002). Multivarite approach
for selecting sets of differentially expressed genes. Mathematical Biosciences 176, 59–69.
C HU , G., NARASHIMHAN , B., T IBSHIRANI , R. AND T USHER , V. (2002). SAM ‘Significance Analysis of
Microarrays’, Users guide and technical document. Stanford University, http://www-stat.stanford.edu/
~tibs/SAM/, 1.21 edition.
D UDOIT , S., F RIDLYAND , J. AND S PEED , T. P. (2000). Comparison of discrimination methods for the classification
Multivariate exploratory tools for microarray data analysis 567

of tumors using gene expression data. Technical Report, 576. Berkeley, CA: University of California.
F UKUNAGA , K. (1990). Introduction to Statistical Pattern Recognition, 2nd edition. London: Academic.
G ANESHANANDAM , S. AND K RZANOWSKI , W. (1989). On selecting variables and assessing their performance in
linear discriminant analysis. Australian Journal of Statistics 32, 443–447.
G OLUB , T. R., S LONIM , D. K., TAMAYO , P., H UARD , C., G AASENBEEK , M., M ESIROV , J., C OLLER , H., L OH ,
M. L., D OWNING , J. R., C ALIGIURI , M. A., B LOOMFIELD , C. D. AND L ANDER , E. S. (1999). Molecular
classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–
537.
H ASTIE , T., T IBSHIRANI , R., E ISEN , M. B., A LIZADEH , A., L EVY , R., S TAUDT , L., C HAN , W. C., B OTSTEIN ,
D. AND B ROWN , P. (2000). ‘Gene shaving’ as a method for identifying distinct sets of genes with similar
expression patterns. Genome Biology 1, 0002.1–0003.21.
K ERR , M. K. AND C HURCHILL , G. A. (2001). Experimental design for gene expression microarrays. Biostatistics
2, 183–201.
K ERR , M. K., M ARTIN , M. AND C HURCHILL , G. A. (2000). Analysis of variance for gene expression microarray
data. Journal of Computational Biology 7, 819–837.
K HAN , J., S IMON , R., B ITTNER , M., C HEN , Y., L EIGHTON , S. B., P OHIDA , T., S MITH , P. D., J IANG ,
Y., G OODEN , G. C., T RENT , J. M. AND M ELTZER , P. S. (1998). Gene expression profiling of alveolar
rhabdomyosarcoma with cDNA microarrays. Cancer Research 58, 5009–5013.
L I , L., DARDEN , T. A., W EINBERG , C. R., L EVINE , A. J. AND P EDERSEN , L. G. (2001). Gene assessment
and sample classification for gene expression data using a genetic algorrithm/k-nearest neighbor method.
Combinatorial Chemistry & High Throughput Screening 4, 727–739.
N EWTON , M. A., K ENDZIORSKI , C. M., R ICHMOND , C. S., B LATTNER , F. R. AND T SUI , K. W. (2000). On
differential variability of expression ratios: improving statistical inference about gene expression changes from
microarray data. Journal of Computational Biology 8, 37–52.
ROBERTSON , T., W RIGHT , F. T. AND DYKSTRA , R. L. (1988). Order Restricted Statistical Inference. London:
Wiley.
ROSS , D. T., S CHERF , U., E ISEN , M. B., P EROU , C. M., R EES , C., S PELLMAN , P., I YER , V., J EFFREY , S. S.,
VAN DE R IJN , M., WALTHAM , M., P ERGAMENSCHIKOV , A., L EE , J. C. F., L ASHKARI , D., S HALON , D.,
M YERS , T. G., W EINSTEIN , J. N., B OTSTEIN , D. AND B ROWN , P. O. (2000). Systematic variation in gene
expression patterns in human cancer cell lines. Nature Genetics 24, 227–235.
S ZABO , A., B OUCHER , K., C ARROLL , W., K LEBANOV , L., T SODIKOV , A. AND YAKOVLEV , A. (2002). Variable
selection and pattern recognition with gene expression data generated by the microarray technology. Mathematical
Biosciences 176, 71–98.
T SODIKOV , A., S ZABO , A. AND J ONES , D. (2002). Adjustments and measures of differential expression for microar-
ray data. Bioinformatics 18, 251–260.
T USHER , V. G., T IBSHIRANI , R. AND C HU , G. (2001). Significance analysis of microarrays applied to the ionizing
radiation response. Proceedings of the National Academic Science, USA 98, 5116–5121.
VAN DER L AAN , M. J. AND B RYAN , J. F. (2001). Gene expression analysis with the parametric bootstrap.
Biostatistics 2, 445–461.
Z INGER , A. A., K LEBANOV , L. B. AND K AKOSYAN , A. V. (1989). Stability Problems for Stochastic Models,
chapter Characterization of distributions by mean values of statistics in connection with some probability metrics.
Moscow, VNIISI: pp. 47–55.

[Received December 28, 2001; first revision August 6, 2002; second revision December 18, 2002;
accepted for publication February 5, 2003]

Int J Selection Assessment - 2015 - Klein - Specific Onboarding Practices for the Socialization of New Employees
No ratings yet
Int J Selection Assessment - 2015 - Klein - Specific Onboarding Practices for the Socialization of New Employees
21 pages
Credit Risk Modeling in Python Chapter3
No ratings yet
Credit Risk Modeling in Python Chapter3
35 pages
Risk and Return
No ratings yet
Risk and Return
6 pages
1 Improved Statistical Test
No ratings yet
1 Improved Statistical Test
20 pages
Improved Statistical Test
87% (171)
Improved Statistical Test
20 pages
1 Improved Statistical Test
100% (1)
1 Improved Statistical Test
20 pages
BMC Bioinformatics
No ratings yet
BMC Bioinformatics
10 pages
1471-2105-15-S2-S2
No ratings yet
1471-2105-15-S2-S2
18 pages
Gene Selection and Sample Classification
No ratings yet
Gene Selection and Sample Classification
15 pages
Support Vector Machine Classification of Microarray Gene Expression Data UCSC-CRL-99-09
No ratings yet
Support Vector Machine Classification of Microarray Gene Expression Data UCSC-CRL-99-09
31 pages
Multivariate Theory For Analyzing High Dimensional
No ratings yet
Multivariate Theory For Analyzing High Dimensional
35 pages
Global Test
No ratings yet
Global Test
67 pages
Cancer Classification of Bioinformatics Data Using ANOVA: A. Bharathi, Dr.A.M.Natarajan
No ratings yet
Cancer Classification of Bioinformatics Data Using ANOVA: A. Bharathi, Dr.A.M.Natarajan
5 pages
Project O: Breast Cancer Gene Analysis Using R: Sheena Scroggins, Susan Mcgowan, John Caras
No ratings yet
Project O: Breast Cancer Gene Analysis Using R: Sheena Scroggins, Susan Mcgowan, John Caras
25 pages
Microarray Full
No ratings yet
Microarray Full
56 pages
Art:10.1007/s12561 010 9024 Z
No ratings yet
Art:10.1007/s12561 010 9024 Z
25 pages
Minimum Redundancy Feature Selection From Microarray Gene Expression Data
No ratings yet
Minimum Redundancy Feature Selection From Microarray Gene Expression Data
8 pages
Statistical Principles of Experimental Design: Dov Stekel
No ratings yet
Statistical Principles of Experimental Design: Dov Stekel
58 pages
Almugren, Alshamlan - 2019 - A Survey On Hybrid Feature Selection Methods in Microarray Gene Expression Data For Cancer Classification
No ratings yet
Almugren, Alshamlan - 2019 - A Survey On Hybrid Feature Selection Methods in Microarray Gene Expression Data For Cancer Classification
16 pages
Diferential Expression Analysis PDF
No ratings yet
Diferential Expression Analysis PDF
72 pages
An Approach of Hybrid Clustering Technique For Maximizing Similarity of Gene Expression
No ratings yet
An Approach of Hybrid Clustering Technique For Maximizing Similarity of Gene Expression
14 pages
Statistical Applications in Genetics and Molecular Biology
No ratings yet
Statistical Applications in Genetics and Molecular Biology
28 pages
Clustering for Gene Expression Analysis
No ratings yet
Clustering for Gene Expression Analysis
8 pages
Microarray Review
No ratings yet
Microarray Review
5 pages
Analysis of Microarray Gene Expression Data - M. Lee (Kluwer
No ratings yet
Analysis of Microarray Gene Expression Data - M. Lee (Kluwer
398 pages
Genes 13 01839 v2
No ratings yet
Genes 13 01839 v2
22 pages
Cluster Analysis in DNA Microarray Experiments: Sandrine Dudoit and Robert Gentleman
No ratings yet
Cluster Analysis in DNA Microarray Experiments: Sandrine Dudoit and Robert Gentleman
48 pages
Two-Stage Classification Methods For Microarray Data: Tzu-Tsung Wong, Ching-Han Hsu
No ratings yet
Two-Stage Classification Methods For Microarray Data: Tzu-Tsung Wong, Ching-Han Hsu
9 pages
043 Chenb Hierarchical
No ratings yet
043 Chenb Hierarchical
4 pages
Biotechniques Simon
No ratings yet
Biotechniques Simon
22 pages
BMC Bioinformatics: Gene Selection and Classification of Microarray Data Using Random Forest
No ratings yet
BMC Bioinformatics: Gene Selection and Classification of Microarray Data Using Random Forest
13 pages
BM-1, Applied Statistics, Lesson 2: Comparing Two Groups (And One Group)
No ratings yet
BM-1, Applied Statistics, Lesson 2: Comparing Two Groups (And One Group)
39 pages
3 4 Forests
No ratings yet
3 4 Forests
23 pages
Introduction to Bioinformatics, Sequence and Genome Analysis
From Everand
Introduction to Bioinformatics, Sequence and Genome Analysis
Jerry H. Swift
No ratings yet
Synopsis: Data Mining Feasibility in Gene Expression Data Analysis Using Weka
No ratings yet
Synopsis: Data Mining Feasibility in Gene Expression Data Analysis Using Weka
12 pages
Aspects of Multivariate Analysis
No ratings yet
Aspects of Multivariate Analysis
4 pages
Plagiarism1 - Report
No ratings yet
Plagiarism1 - Report
8 pages
Gene Expression Analysis: Ulf Leser and Karin Zimmermann
No ratings yet
Gene Expression Analysis: Ulf Leser and Karin Zimmermann
46 pages
Gene Based Disease Prediction Using Pattern Similarity Based Classification
No ratings yet
Gene Based Disease Prediction Using Pattern Similarity Based Classification
6 pages
BS1
No ratings yet
BS1
62 pages
Methods Used For Identification of Differentially Expressing Genes (Degs) From Microarray Gene Dataset: A Review
No ratings yet
Methods Used For Identification of Differentially Expressing Genes (Degs) From Microarray Gene Dataset: A Review
8 pages
PNAS 2005 Subramanian 15545 50
No ratings yet
PNAS 2005 Subramanian 15545 50
71 pages
Ps 3
No ratings yet
Ps 3
3 pages
Module-3
No ratings yet
Module-3
44 pages
Subramanian Et Al 2005 Gene Set Enrichment Analysis A Knowledge Based Approach For Interpreting Genome Wide Expression
No ratings yet
Subramanian Et Al 2005 Gene Set Enrichment Analysis A Knowledge Based Approach For Interpreting Genome Wide Expression
6 pages
Data Mining MetaAnalysis
No ratings yet
Data Mining MetaAnalysis
39 pages
2445-gene-expression-clustering-with-functional-mixture-models
No ratings yet
2445-gene-expression-clustering-with-functional-mixture-models
8 pages
TMP 9 AA7
No ratings yet
TMP 9 AA7
12 pages
Aspects of Multivariate Analysis
No ratings yet
Aspects of Multivariate Analysis
50 pages
BMC Bioinformatics: A Meta-Data Based Method For DNA Microarray Imputation
No ratings yet
BMC Bioinformatics: A Meta-Data Based Method For DNA Microarray Imputation
10 pages
Apriori-Hybrid Algorithm As A Tool For Colon Cancer Microarray Data Classification
No ratings yet
Apriori-Hybrid Algorithm As A Tool For Colon Cancer Microarray Data Classification
5 pages
CMMB 461 Dna Microarray 2 2019 For D2L
No ratings yet
CMMB 461 Dna Microarray 2 2019 For D2L
27 pages
Jin-Xing Liu - 2013 - Pmid23815087
No ratings yet
Jin-Xing Liu - 2013 - Pmid23815087
10 pages
Multivariate Methods
No ratings yet
Multivariate Methods
13 pages
PS2
No ratings yet
PS2
4 pages
Microarray Data Analysis: Class Discovery and Class Prediction: Clustering and Discrimination
No ratings yet
Microarray Data Analysis: Class Discovery and Class Prediction: Clustering and Discrimination
70 pages
Microarray Experiment Design
No ratings yet
Microarray Experiment Design
18 pages
Ridge Regression
No ratings yet
Ridge Regression
82 pages
Obust Model For Gene Anlysis and Classification: Fatemeh Aminzadeh, Bita Shadgar, Alireza Osareh
No ratings yet
Obust Model For Gene Anlysis and Classification: Fatemeh Aminzadeh, Bita Shadgar, Alireza Osareh
10 pages
Immediate download High-dimensional Microarray Data Analysis: Cancer Gene Diagnosis and Malignancy Indexes by Microarray Shuichi Shinmura ebooks 2024
100% (4)
Immediate download High-dimensional Microarray Data Analysis: Cancer Gene Diagnosis and Malignancy Indexes by Microarray Shuichi Shinmura ebooks 2024
55 pages
Microarray gene expression classification: dwarf mongoose optimization with deep learning
No ratings yet
Microarray gene expression classification: dwarf mongoose optimization with deep learning
9 pages
Empirical Bayes Inference and Model Diagnosis of Microarray Data
No ratings yet
Empirical Bayes Inference and Model Diagnosis of Microarray Data
16 pages
Model Systems in Biology: History, Philosophy, and Practical Concerns
From Everand
Model Systems in Biology: History, Philosophy, and Practical Concerns
Georg F. Striedter
No ratings yet
Stata Task
No ratings yet
Stata Task
3 pages
Thesis Reign
No ratings yet
Thesis Reign
38 pages
Deterministic and Probabilistic Engineering Cost
No ratings yet
Deterministic and Probabilistic Engineering Cost
13 pages
Session 1 Forecasting: Advanced Management Accounting
100% (1)
Session 1 Forecasting: Advanced Management Accounting
40 pages
04 - RA - Crop Prediction Using Machine Learning
No ratings yet
04 - RA - Crop Prediction Using Machine Learning
12 pages
Business Statistics A First Course 3rd Edition Sharpe Test Bank 1
100% (54)
Business Statistics A First Course 3rd Edition Sharpe Test Bank 1
36 pages
Chapter 6
No ratings yet
Chapter 6
10 pages
FML File Final
No ratings yet
FML File Final
36 pages
Access Solutions Manual to accompany Introductory Mathematical Analysis for Business 13rd edition 0321643720 All Chapters Immediate PDF Download
100% (5)
Access Solutions Manual to accompany Introductory Mathematical Analysis for Business 13rd edition 0321643720 All Chapters Immediate PDF Download
36 pages
SMK TAMAN KLUANG BARAT, KLUANG, JOHOR M2 Trial (Q)
No ratings yet
SMK TAMAN KLUANG BARAT, KLUANG, JOHOR M2 Trial (Q)
6 pages
Chapter 10: Ethnographic Research Defining Ethnography and Culture
No ratings yet
Chapter 10: Ethnographic Research Defining Ethnography and Culture
17 pages
CAS Catalog
No ratings yet
CAS Catalog
77 pages
Statistics For Managers Using Microsoft Excel: 5 Edition
No ratings yet
Statistics For Managers Using Microsoft Excel: 5 Edition
54 pages
Advanced Soft Computing
No ratings yet
Advanced Soft Computing
24 pages
Occupational Standards for a Library and Information Science Technician Level 60001
No ratings yet
Occupational Standards for a Library and Information Science Technician Level 60001
93 pages
Market Share Literature Review
100% (2)
Market Share Literature Review
8 pages
Chapter 11
No ratings yet
Chapter 11
77 pages
Supervised Vs Unsupervised Learning
No ratings yet
Supervised Vs Unsupervised Learning
9 pages
Flowjo-Guide - EMBL
No ratings yet
Flowjo-Guide - EMBL
11 pages
Kalaian (2008) Research Design
No ratings yet
Kalaian (2008) Research Design
13 pages
Employee Training and Development, and Organisational Performance: A Study of Small-Scale Manufacturing Firms in Nigeria
No ratings yet
Employee Training and Development, and Organisational Performance: A Study of Small-Scale Manufacturing Firms in Nigeria
18 pages
PUBH+399 3.9.2023 pdf-1
No ratings yet
PUBH+399 3.9.2023 pdf-1
18 pages
IPM StatisticalMethods Term1
No ratings yet
IPM StatisticalMethods Term1
4 pages
Q.P. Code: 664724
No ratings yet
Q.P. Code: 664724
10 pages
Module On Data MGT
No ratings yet
Module On Data MGT
32 pages
Manova and Mancova
No ratings yet
Manova and Mancova
10 pages
R Rec SM.1880 1 201508 I!!pdf e
No ratings yet
R Rec SM.1880 1 201508 I!!pdf e
12 pages

Multivariate Exploratory

Uploaded by

Multivariate Exploratory

Uploaded by

Biostatistics (2003), 4, 4, pp.

Multivariate exploratory tools for microarray data

2. M OTIVATING E XAMPLE : I NHOMOGENEOUS DATA S ET

3. Q UANTITATIVE M EASURES OF D IFFERENTIAL E XPRESSION

3.1 A distance between vectors of expression signals

4. R ANDOM S EARCH FOR THE TARGET S UBSET OF A G IVEN S IZE

(a) (b) (c)

Cross-validated search for differentially expressed genes

5. F ORMING LARGER TARGET SUBSETS OF GENES

n 1 X̄ i · + n 2 Ȳi · n 1 X̄ i · + n 2 Ȳi · (A3)

Main algorithm: Successive selection of subsets of genes

6. A PPLICATION TO LEUKEMIA DATA

Rank according to SAM

7. A PPLICATION TO COLON CANCER CELL LINE DATA

You might also like