Integrating Gene and Protein Expression Data

Methods 35 (2005) 303–314
www.elsevier.com/locate/ymeth
Integrating gene and protein expression data: pattern analysis

and proWle mining
Brian Coxa,b,¤, Thomas Kislingera,c, Andrew Emilia,c,¤
a
Department of Medical and Molecular Genetics, University of Toronto, Toronto, Ont., Canada
b
Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto Ont., Canada M5G 1X5
c
Program in Proteomics and Bioinformatics, Banting and Best Department of Medical Research, University of Toronto, Toronto, Ont.,
Canada M5G 1L6
Accepted 25 August 2004

Available online 12 January 2005
Abstract
Proteomics and functional genomics are emerging new research Welds devoted to the study of the entire collection of proteins and
mRNA transcripts (collectively known as gene products) that deWne a biological system. DNA microarrays are now a popular plat-
form for measuring changes in messenger RNA transcript levels on a genome-wide scale, while gel-free shotgun proWling methods
based on tandem mass spectrometry are increasingly being used to determine the identity, modiWcation states, and relative abun-
dance of large numbers of proteins. By deWning the behavior of entire biological pathways and networks under various physiological
states, these studies aim to extend traditional reductionist molecular genetic approaches regarding the biological roles of the vast
array of uncharacterized gene products. A key goal is to determine how the information encoded by the myriad of expressed gene
products is integrated at the molecular, cellular, and even whole organism level to create the dynamic biochemical processes and
complex physiological controls that sustain life. While comparison of the complementary information contained in proteomic and
mRNA data sets poses considerable analytical challenges, these eVorts should provide added insight into the fundamental mecha-
nisms underlying physiology, development, and the emergence of disease. Here, we outline several analytical approaches, methods,
and tools that have proven to be helpful in the face of this important challenge.
 2004 Elsevier Inc. All rights reserved.
Keywords: Expression proWling; Informatics; Microarrays; Protein mass spectrometry; Shotgun sequencing; Proteomics; Data analysis; Clustering;
Data mining; Pattern recognition
1. Introduction procedures coupling high-performance liquid chromato-

graphic fractionation of protein tryptic digests with auto-
DNA microarrays are commonly used to examine mated tandem mass spectrometry (LC-MS) represent
global changes in messenger RNA abundance across particularly powerful technology for elucidating the iden-
diVerent biological settings [1]. Likewise, advances in tities, abundance, and post-translational states of hun-
mass spectrometry-based proteomics technology now dreds to thousands of proteins. This technology can be
make it possible to characterize large-numbers of pro- applied to study proteins present at speciWc time-points
teins in complex biological samples [2]. Gel-free proWling within the life-cycle of a cell or organism [3,45,46].
Because cells generally respond to diverse physiological
*
Corresponding author. Fax: +1 416 946 7281.
cues, developmental signals, and environmental perturba-
E-mail addresses: [email protected] (B. Cox), andrew.emili@ tions, changes in mRNA and protein levels can serve as a
utoronto.ca (A. Emili). particularly informative readout of phenotypic state [4].
1046-2023/$ - see front matter  2004 Elsevier Inc. All rights reserved.
doi:10.1016/j.ymeth.2004.08.021
304 B. Cox et al. / Methods 35 (2005) 303–314
There currently exists a growing literature outlining 2. Description of the method

methods for integrating and comparing functional
proteomics data sets, such as the composition of protein 2.1. Generation of proteomic and genomic data sets
complexes, networks of protein–protein interactions
[5,6] or microarray-derived gene expression patterns [7]. Proteomics is deWned as the large-scale examination
However, a more fundamental and pressing question is of protein expression, localization, modiWcation, struc-
the correspondence of transcriptional responses to cellu- ture, function, and activity. 2D-gel electrophoresis has
lar protein abundance. That is, to what extent does the historically served as a preferred method for high-resolu-
pattern of gene expression, which reXects RNA tran- tion separation of protein mixtures prior to MS analysis.
scription and degradation rates, correlate with the corre- However, alternative gel-free LC-MS procedures greatly
sponding protein levels, which are also inXuenced by improve proteome coverage, leading to the detection of
translational and post-translational mechanisms? The many of the low-abundance and membrane proteins
limited number of comparative studies carried out to typically missed by 2D-PAGE. Greatly improved detec-
date indicates that the correlation across large data sets tion limits have been achieved using capillary-scale
is typically modest, presumably due to substantial varia- multi-dimensional chromatography [14]. Combined with
tions in post-translational processing [8–11]. For sub-cellular fractionation, modern LC-MS-based proWl-
instance, a recent genome-scale epitope-tagging study of ing methods can be used for the unbiased detection and
protein abundance in yeast by Weissman and colleagues identiWcation of literally thousands of proteins in a sin-
[12] indicates that many essential proteins and transcrip- gle overnight analysis [15], well within the range extract-
tion factors are present at levels that are not readily pre- able from small amounts of mouse [15] or human tissue
dicted by mRNA levels. These studies are likely to [16]. Moreover, relative protein abundance between sam-
become far more common, and informative, as the num- ples can often be accurately determined using in vitro
ber of related proteomic and genomic data sets steadily and in vivo protein labeling methods conceptually anal-
grows. ogous to those used in microarray studies [17].
Of course, the value of comparing proteomics and Microarray-based proWling experiments are typically
mRNA data sets can go far beyond mere simple correla- designed to detect changes in transcript levels under
tion analysis of gene product quantities (i.e., relative lev- diVerent experimental conditions, such as various time-
els of protein and mRNA detected for the same gene). points during development, following treatment with a
For instance, pioneering studies by Gerstein and col- drug or as a result of gene mutation [1]. There are multi-
leagues [13] have revealed considerable similarity ple microarray platforms, each of which is optimized for
between the transcriptome and the proteome in terms of measuring changes in transcript ratios rather than abso-
enrichment for speciWc structural and functional proper- lute abundance. The original microarray platform
ties. We believe that this form of comparative analysis involved spotting large-numbers of cDNAs onto mem-
will increasingly be used to bridge the burgeoning gap branes or glass slides [18,19]. These probes range from
between the proteomics and functional genomics »250 bp to »2 kb, and are usually generated by PCR
research communities by creating a common, interactive from an arrayed plasmid library [18]. Like all high-
knowledge-base. Such comparisons should also allow throughput methods, this approach is subjected to spuri-
for a better determination of the suitability of using gene ous experimental artifacts and systemic bias [19–21]. One
transcript levels as a surrogate for protein activity, as obvious failure stems from the variable GC content and
well as provide insight into molecular pathways that sequence length of the probes, which can lead to diVerent
determine and link gene and protein expression patterns. hybridization eYciencies. A second alternative arraying
Lastly, we expect such comparative studies to improve method that partly overcomes this limitation is to use
our understanding of the biochemical mechanisms that long synthetic oligos (»60 bases) unique to each tran-
control a range of cellular responses. script but with similar G values of annealing [47]. A
Here, we provide an overview of common proteomic range of oligo-based microarrays are available commer-
and microarray expression proWling procedures, and cially. The third is the short oligo array sold by AVyme-
outline basic methods and freely available tools that can trix (www.AVymetrix.com), which is discussed below
be used to map and compare mRNA transcript and pro- [22].
tein levels, with an emphasis on deriving broad biologi- To allow for more robust sample measurements, mul-
cal inferences. We emphasize critical steps and analytical tiple probes are repeatedly spotted for each gene. More-
issues that need to be considered to meaningfully com- over, experiments are typically run in triplicate to
pare the results obtained from high-throughput micro- validate the statistical signiWcance of outlier values [19–
array studies with those from shotgun mass 21]. Hybridizations are usually carried out using cDNA
spectrometry-based proteomic analyses, illustrating pools generated by reverse transcription (RT) of total
each of these steps with examples of real experimental RNA or puriWed polyA mRNA using an oligo primer
data. directed to the polyA sequence. cDNA is either directly
B. Cox et al. / Methods 35 (2005) 303–314 305
labeled with Xuorescent nucleotide analogs or reactive whole brain may obscure changes in gene expression in
side groups analogs for subsequent labeling, during the the hypothalamus during treatment with a drug. Most
RT-reaction [19]. Reference RNA is often then co- tissues and organs are heterogeneous and made up of
hybridized along with the experimental sample to many cell types. While cell sorting, tissue culture, and
normalize array intensities across diVerent chips [19]. sub-cellular fractionation can be used to simplify the
However, diYculties in generating consistent reference mixtures [15,27,28], sample preparation can still be prob-
RNA and improved imaging/scanning technologies have lematic for genes/proteins involved in speciWc settings,
reduced this practice. Indeed, experimental samples are such as the critical transitions of the cell cycle. The last
hybridized alone using the popular AVymetrix gene chip challenge is in extracting quantitative information for
platform [22,23], which uses an array of 11–20 perfect low intensity peptides as a reliable signature, since high-
match probes consisting of 25-mer nucleotide sequences abundance proteins, such as housekeeping enzymes, are
targeting unique regions on each transcript. A parallel preferentially detected by LC-MS.
set of mismatch probes with a single base substitution in Another critical consideration is the adoption of suit-
the middle of the probe serves as a background control. able informatics criteria to evaluate the signiWcance of
Biotinylated cRNA is fragmented and annealed to the putative protein matches. To this end, conWdence Wlters
slide, and hybridization is detected with a Xuorescently based on probability distributions and statistical algo-
labeled antibody speciWc to biotin. The scanned probe rithms should be used to determine the likelihood of
set intensities are then subjected to detection statistical putative protein identiWcations essential for eliminating
analysis, using proprietary algorithms which assign a false-positive matches as well as provide for standardiza-
binary absent/present call to each measured gene along tion in the reporting, and comparison of diVerent data
with an estimate of background noise, allowing estima- sets. To obtain an accurate proWle, quantitative data
tion of the signiWcance of diVerences in gene expression describing relative protein abundance under the various
ratios across samples. settings must also be obtained. One option is to use
diVerential labeling of protein samples in a manner
2.2. Considerations for protein and RNA samples analogous to the use of two-label systems in many
microarray studies. Several innovative chemical- or iso-
Detection of meaningful diVerences in recorded pro- tope-based labeling strategies have been shown to
tein and gene expression patterns requires the use of improve the reliability of quantitative inferences made
computational tools to allow for statistically sound anal- by LC-MS [17,29]. However, the impact of these special-
ysis and mining of the data. Since integration of proteo- ized methods has been restricted to data due to the sig-
mic and genomic data sets relies on the careful niWcant expertise and cost associated with these
comparison of large heterogeneous data sets, various analyzes. We believe that peptide or spectral count oVers
diVerent technical limitations associated with each pro- a far simpler semi-quantitative Wrst pass measure for
Wling platform must be considered. For instance, micro- tracking changes in protein abundance for the purpose
arrays can only detect those transcripts having a of global data set comparisons and biomarker discovery.
representative probe on the chip, a limitation rapidly Protein levels can be readily estimated to a good Wrst
being overcome with advancing technology, improved approximation based on the peptide count or cumulative
gene prediction algorithms, and the completion of sum of recorded peptide spectra that can be reliably
genome sequencing projects. Cross-hybridization and matched to a given protein [48]. Experimental repetition
spurious signal is also a frequent if under-appreciated is often needed for proper determination of the spectral
concern. count, however, to deal with statistical issues that arise
MS identiWcation of proteins is limited by the incom- due to MS sampling ineYciencies leading to spurious
pleteness and redundancy of protein sequence databases variations in large-multivariate proteomic data sets.
used for searching MS spectra. The choice of database,
and even the search algorithm, can be critical determi- 2.3. Linking heterogeneous databases
nants of protein identiWcation success rates [24–26]. Even
high-throughput protein identiWcation by methods such Regardless of which platform one chooses to use, the
as capillary-scale multi-dimensional chromatography Wrst task is to match the genes represented on the micro-
[14] face limitations imperfect due to chromatographic array with the corresponding proteins identiWed in a
separation and the under-sampling by the mass spec- proteomics experiment. By luck or by design, the lists of
trometer system being used. The complexity of mamma- sequence identiWers may be from the same database, but
lian tissue represents a considerable experimental more likely one has to perform some cross-referencing
challenge, and pre-fractionation methods are generally or indexing across platforms. Most commercial sources
required to increase proteome coverage. One cannot of microarray provide downloadable support tables
underestimate the importance of proper sample selection of gene accession numbers and common gene IDs relat-
for generating meaningful data. For instance, proWling ing to one or more public annotation databases. At a
minimum, suppliers must provide a list of sequences tion from NCBI can be obtained by batch Entrez
spotted on the array. Detailed cross-referenced annota- (www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?). Swiss-
tions are available on the web for registered AVymetrix Prot and Trembl data can be accessed from the Expasy
array users (www.aVymetrix.com). The NIA also main- web site using list retrieval (http://ca.expasy.org/sprot/
tains extensive annotation tables for a suite of oligo and sprot-retrieve-list.html) or SRS (http://ca.expasy.org/
cDNA microarrays (http://lgsun.grc.nia.nih.gov/cDNA/ srs5/). For those working with mouse data, a particularly
cDNA.html). Typical identiWers include accession refer- useful site is www.informatics.jax.org, where a myriad of
ences to databases such as SwissProt/Trembl (SPTR), data is maintained, including a phenotype browser for
NCBI, ENSEMBL, and Unigene (described below). mutations and diseases. To facilitate routine gene-to-
However, there are several pitfalls to consider when protein mappings, our group has developed a cross-ref-
using these identiWers. For instance, while SwissProt (SP) erencing database lookup tool Protein2Gene Lookup
is a popular choice for spectral database searches since it (http://emili11.med.utoronto.ca/~dbgroup/lookup.php)
is a highly curated, stable protein sequence database, it that provides a user-friendly web-accessible interface.
generally does not house the complete set of proteins for Researchers simply enter accessions of interest to obtain
many organisms. Its companion database TrEMBL the matching IDs/accessions from several major geno-
(TR), a computer-annotated supplement containing all mic and proteomic reference databases.
the translations of EMBL nucleotide sequence entries In some cases, a BLAST sequence alignment may be
not yet integrated in SP, oVers more extensive coverage. the best way to link two data sets. Using bulk sequence
However, TR IDs are unannotated, frequently redun- retrieval tools, a user can pull sequences using ID tags or
dant, and are continuously retired and replaced with SP- entire species-speciWc sets. Searches can then be done
based accessions (IDs) as proteins migrate to SP. using the stand-alone BLAST tool (NCBI downloads) or
NCBI maintained gene and protein databases suVer a utility like BioEdit (www.mbio.ncsu.edu/BioEdit/bio-
from these same problems, although the creation of the edit.html), using one sequence set as reference. Output
RefSeq NP (protein) and NM (mRNA) accession system should be generated in text Wle format so that it can be
[30] is an attempt to standardize and reduce redundancy. readily parsed into columns for query ID, subject ID,
Moreover, the Uniprot database [31] is trying to create a and scores (e value, percent identical, gaps, etc.). Caution
single identiWer for all proteins. This may only be further should be used in interpreting the output, though, a Blast
problematic as each research group picks their favorite e value of 0 does not mean that two sequences are exactly
unique identiWer and we are still left with the task of identical. Rather, percent identity and sequence coverage
linking disparate data sets. Other groupings such as should also be checked to ensure acceptable alignments
GeneLynx and gene card databases [32–34] are trying to and incorporated into the selection of an acceptable
consolidate all of these diVerent identiWers into a single threshold cut-oV. Validated alignments can then be
resource, similar to what has been done with Drosophila extracted as a two-column ID table, a relationship that
(www.fruitXy.org/annot/) and Caenorhabditis elegans can be used to bridge data sets. A further test of homol-
(www.wormbase.org/). A key identiWer is the Unigene ogy is to use reciprocal Blast, in which the Blast search is
database (www.ncbi.nlm.nih.gov), which is generated done twice swapping the query and subject databases,
from species-speciWc clustered nucleotide sequences that this is done to avoid linking paralogous proteins/genes.
overlap with high-percent sequence identity. However,
as new members are added to the collection, Unigene 2.4. Data analysis—methods of comparison
clusters are recalculated, new clusters are generated, with
some members of a cluster being moved into a new Once the microarray and proteomic data sets have
cluster and occasionally a cluster ID being completely been cross-matched, more informative comparisons can
removed, which creates problems with legacy data sets. be made. ConWrmation of changes in expression at both
A caveat, then, to linking data sets using Unigene IDs is the gene and protein levels can help focus target selec-
to ensure that the build dates are the same. Ensembl tion. Of course, there will generally be obvious gaps in
(www.ensembl.org) has one advantage in that IDs are the data sets in instances where there are no gene expres-
only assigned to gene/proteins that can be associated sion data available for a protein or vice versa. Indeed,
with the assembled genome, thus providing a stable non- one needs to consider the reasons for missing values and
redundant set of identiWers. However, not all genome have some rationale to assess the overall quality of the
assemblies are not Wnished (e.g. mouse) and gene annota- non-overlapping data. One can choose to focus only on
tion generally not always complete. the overlapping gene product data, excluding all data
If you need to retrieve tables of sequences or annota- points that do not cross-map. Conversely, one can fully
tions, each database maintains its own service. Sequence merge the data sets, placing blank values for missed
and annotation information can readily be retrieved matches. Since microarray studies are typically of far
from the Ensembl database using the Ensmart system greater scope at present than most protein proWling
(www.ensembl.org/Multi/martview). Data and annota- eVorts, this last choice may not be helpful due to the
large-amount of blanks, which impede data processing. generally required for meaningful comparisons with
By using a focused approach, gene products of interest microarray data. Protein levels can be readily estimated
can be assessed by comparison of the overlapping data to a good Wrst approximation based on the peptide
within the cluster. count or cumulative sum of recorded peptide spectra
Lastly, since microarrays often contain multiple that can be reliably matched to a given protein [48]. The
probes for a given gene, a method of handling this redundant peptide count allows the estimation of the rel-
redundancy is required. One can simply replicate the ative abundance of a protein across a series of experi-
protein information for each redundantly matching ments, provided the isolation techniques were similar,
probe or one can take an average of redundant probes for example with respect to buVers, homogenization
prior to data set comparison. method, and fractionation. Spectral counts typically
show a slight but statistically signiWcant (p < 0.05) bias
2.4.1. Correlating protein and microarray data for molecular weight; in our hands, mouse proteins with
Most of the correlation studies comparing protein low spectral counts have a mean molecular weight of
and gene expression in the literature have dealt with the 56 kDa (median 45 kDa) and those with 100 or more
relative abundance of mRNA and protein. In a pioneer- spectral counts a mean of 79 kDa (median 53 kDa).
ing study, Gygi et al. [9] tested gene–protein correspon- Spectral count values do not allow for intercomparison
dence in yeast by using [35S]methionine labeling for of expression levels of diVerent proteins. Once a table of
protein quantitation and SAGE analysis for mRNA proteomic data (with spectral counts) is assembled, it
transcript quantitation, expressing both as total copies can be treated much like a microarray spreadsheet with
per cell. They collected complementary data for 156 respect to data normalization and calculation of expres-
genes and could show a modest positive correlation of sion ratios (discussed below).
mRNA and protein abundance. More recently, a group
of researchers [8] correlated the expression patterns of 2.4.3. Example of concordance of mRNA and protein
mitochondrial proteins in mammalian tissue with public relative abundance
microarray data, using a simpler present/absent test for The rapid progress in high-throughput LC-MS-based
concordance [8]. A positive score was assigned when a proWling [3] suggests we are now reaching a point where
similar preferential tissue pattern was detected for both we may begin to systematically address post-transcrip-
the corresponding mRNA and protein. By this scheme, tional regulatory mechanisms. As an illustration of this,
426 of 569 detected gene products were found to be con- we compared a recently published extensive data set of
cordant. One criticism raised is an obvious bias in the the proteomic patterns of adult mouse lung and liver [15]
data, wherein the average mRNA abundance of the to a published AVymetrix MGU74 gene chip data for
detectable proteins was found to be nearly 5-fold higher lung and liver [35]. We linked the two data sets using
than for all annotated mitochondrial genes, suggesting SPTR IDs: the proteins had been directly identiWed from
only high-abundance gene products strongly correlate. a database of SPTR sequences and the SPTR annotation
Further, the scoring schema does not oVer a reliable for the MGU74 probe set was downloaded from
assessment of relative abundance as it only considered AVymetrix. Approximately 1200 non-redundant protein-
extremes in the data. microarray pairs were found, of which 623 were used
GriYn et al. [10] asked the more complex question of because they had a signiWcant detection p value as calcu-
whether changes in expression correlate at the protein lated by AVymetrix MAS5.0. The detection p value pro-
and transcript levels between two yeast populations vides a statistical evaluation of the observed intensity as
grown in diVerent carbon sources. They determined the a measure of the true binding of the expected transcript
ratio of protein expression using ICAT labeling and the versus background noise or spurious cross-hybridiza-
ratio of mRNA using spotted cDNA microarrays. Com- tion. A similar concordance-scoring scheme to that of
plementary protein and mRNA abundance data were Mootha et al. [8] was adopted, with the exception that
obtained for 245 genes. Many gene products linked to the total spectral count was substituted for the binary
carbon metabolism showed expected changes in abun- absent/present call to provide a better measure of rela-
dance, but did not change with similar scalars or magni- tive abundance. This results in a positive correlation if
tudes at the individual protein and mRNA levels. These both the microarray and the corresponding protein
observations suggest that genes with similar expression show a similar ratio of expression in the lung and liver
might not translate into similar protein levels and rather, (e.g., both >1).
post-translation control mechanisms are a normal part Considering all 623 pairs of microarray and protein
of a physiological response. data, a concordance of »60% is observed (372/623).
However, taking into account the fold-diVerence between
2.4.2. Protein relative abundance by spectral counts the two tissues (for either protein or mRNA), the concor-
Although comparing protein lists can be informative, dance improves with an increasing fold-diVerence (the
some estimate of relative quantity across samples is ratio of spectral count, or microarray intensity). A
maximal concordance of 68% (257/376) occurs at a pro- for download. A popular tool is Cluster 3.0 developed by
tein cut-oV 7-fold, while for mRNA a maximum concor- de Hoon and colleagues (http://bonsai.ims.u-tokyo.ac.jp/
dance of 69% (145/210) is observed at a cut-oV of 3-fold. ~mdehoon/software/cluster/index.html), a multi-plat-
Note that the total number of protein–mRNA pairs con- form compliant Java program based on the original
sidered for the concordance is reduced as we are only clustering program developed by the Eisen group [36]
considering the pairs for which either the protein or the (http://rana.lbl.gov/EisenSoftware.htm) but improved to
microarray ratios are above the fold cut-oV. This discrep- handle more genes and experiments. This package
ancy suggests experimental noise at the lower ratios. enables users to perform most tasks, such as data Wlter-
From our example, a positive concordance for a protein ing, adjusting (normalizing), and clustering using com-
ratio of twofold would not be considered as strongly as a mon algorithms and proWle similarity (a.ka distance)
7-fold ratio with a positive concordance. It may also be metrics. A second package, called TreeView (http://j
taken from this example that the signal from the micro- treeview.sourceforge.net) developed and adapted by
array is less noisy than the protein spectral count as a Alok Saldanha from the original package developed by
maximum concordance is observed at a lower ratio. Of the Eisen group, allows users to graphically display the
course, many of these outliers are of special interest as output of the clustering program, with easy selection and
they may reXect divergent, yet physiologically relevant, extraction of interesting clusters.
regulation at the transcriptional, and post-transcrip-
tional levels. Samples for which there is no concordance 2.4.5. Calculating a relative ratio of protein expression
(the ratio of microarray intensity to protein intensity Some estimate of relative protein quantity across
does not agree) may be taken as evidence of normal samples is generally required for meaningful compari-
stochastic Xuctuation in protein and mRNA levels, with sons with microarray data and for clustering to be ade-
higher ratios better reXecting meaningful biological quately performed. As discussed above, the spectral
variation. count can be used as a measure of relative abundance of
a protein across experiments. Again, once a table of pro-
2.4.4. Clustering data teomic data expressed as spectral counts is assembled, it
The increasing size and complexity of proteomic and can be treated much like a microarray spreadsheet with
microarray data sets provide opportunities and chal- respect to data normalization and calculation of expres-
lenges for researchers to extract biologically relevant sion ratios.
information. Two key goals, Wnding meaningful patterns For normalization of experimental data sets, global
in the data sets and classifying samples, can both be average (or median) scaling based on average number of
accomplished by applying diVerent data-mining, pattern peptides detected per protein per experiment can be cal-
recognition, clustering, classiWcation, and other associa- culated. The spectral count in each experiment can then
tion techniques to the data sets. Clustering is a common be scaled by a constant such that the global average (or
approach for sorting related sets of proteomic and median) peptide or spectral count is the same across all
microarray data and samples (tissues or experiments), experiments. Typically the lowest and highest Wfth per-
and is generally the preferred and simplest routine for centile data are trimmed Wrst to prevent skewing of the
visually assessing intrinsic patterns within the data sets average. (Of course, these outliers may have biological
[21]. Most data sets, even from samples that cannot be relevance and must be reconsidered in later analyses.)
classiWed in any obvious way, may contain hidden infor- There are two ways to generate a suitable expression
mation about regulatory patterns (such as co-expres- ratio. In the Wrst approach, a common baseline reference
sion) that can be revealed by cluster analysis. The sample data set is used as the denominator, while the
discovery of hidden patterns in expression proWles, experimental sample values are used as the numerator.
referred to as proWle mining, is possible if a large number For example, untreated cell cultures could be the base-
of signature proWles derived for a set of unclassiWed or line, and cell cultures were transformed with increasing
unclassiWable samples is available, or if there is a human amounts of an expression vector as the experimental test
expert who can provide correct information to guide the samples. By this method, the changes in gene expression
discovery of the hidden patterns. It should be noted that detected in the experimental samples are all reported as
clustering only two experimental data sets is not needed the ratio relative to that observed in the untreated con-
for comparison as simple sorting by the ratio in a trol cells. Another method is to generate a pseudo experi-
spreadsheet application is generally suYcient. Clustering ment. In our case, the pseudo experiment data set would
is best used for tackling larger data sets where the crite- contain a list of all proteins detected in each of the sam-
ria for sorting expression proWles across multiple experi- ples as well as the corresponding average calculated spec-
ments are not obvious. tral count value detected for each protein across all
Many commercial informatics packages are available experiments. This method can be particularly beneWcial
for data clustering, but powerful public software pack- in time-course experiments as such data reXect changes
ages that will meet most users needs are freely available in expression overtime, generally resulting in a more even
data point distribution. Expression ratios are often trans- To circumvent this problem, we typically substitute
formed with either log base-2 or -10 to distribute the all blank (missing or null) values with a nominal, low
data evenly around 0, which signiWes no change in abun- log2 ratio value, based on the lowest observed log2 ratio
dance. Down- and up-regulated expression can be easily value. Fig. 1 presents an example of cluster analysis of
visualized using a heat map graphical display format. the proteomic patterns detected in cytosolic and nuclear
fractions from three diVerent time points of the lung pro-
2.4.6. Considerations for clustering protein data tein data set. It can be seen that Wlling in a low value
As an illustrative example, we will evaluate parallel prior to clustering the data (Fig. 1A) and then removing
mRNA and protein expression proWling data sets of these values for visualization (Fig. 1B, gray represents
developing mouse lungs recorded over three develop- null data points) generates a superior (more consistent)
mental time-points (B.C., T.K., and A.E., manuscript in cluster than obtained by leaving the values blank (Fig.
preparation) consisting of early (a composite of embry- 1C, detail Fig. 1D). Note how proteins detected exclu-
onic day 14 (E14) and day 16 (E16)), mid (composite of sively in a single tissue fraction properly cluster together
day 18 (E18) and postnatal day 2 (P2)), and late (a com- (Fig. 1B), whereas these same proteins become distrib-
posite of postnatal day 14 (P14) and Adult) stages. The uted and the clusters broken up when the data are clus-
lung development gene expression proWles were based on tered using blanks (Figs. 1C and D).
a published, publicly accessible microarray data set A Wnal consideration is calculation of the similarity of
reported by Mariani et al. [41], which were generated expression using a suitable distance metric, which is a
using AVymetrix Mu11K A and B chipsets, while the method for calculating the resemblance between the
proteomic data sets were generated in-house using the expression patterns or proWles of two diVerent gene
LC-MS-based PRISM proWling methodology [15]. products across all experiments. Some methods utilize
BrieXy, lung tissue was separated into nuclear, cytosolic, average data values to minimize the eVects of spurious
and mitochondrial protein fractions using diVerential outlier data points, while others use absolute values. This
centrifugation. These fractions were digested with tryp- is not to say that one metric is necessarily better than
sin and the peptide mixtures were separated by two- another, as this is a subjective criterion. The three aver-
dimensional chromatography. The eluting peptides were age linkage clusters shown in Figs. 1E–G were generated
electrosprayed directly (in real time) into an ion trap tan- by using the more common Pearson, Spearman, and
dem mass spectrometer. The spectra were searched Euclidean distance metrics, respectively. Initially, all
against a database of mouse protein sequences obtained metrics appear to generate similar clusters (Figs. 1E–G)
from the European Bioinformatics Institute (EBI; SP/ but upon closer inspection, it can be seen that changes in
TR) using the SEQUEST database search software [37]. the signal log ratios are treated diVerently. This is espe-
High-conWdence (error p value <0.05) matches were sta- cially evident in a group of proteins at the bottom of
tistically validated using the STATQUEST probability each cluster that have more similar expression patterns
Wltering algorithm [15]. The data were then assembled in the cytosolic fractions as compared to the nuclear
into a table of log2 ratios of spectral counts against a fractions.
pseudo experiment as described above. The Pearson metric deWnes a correlation coeYcient
Clustering proteomic data can cause certain problems between two lists of values. The version of Pearson used
as compared to microarray analyses due to diVerences in here is centered, which means the correlation is not
types of biological information represented and the dis- aVected by linear transformation of the data, such as
parate methods of data collection. Microarrays typically adding or multiplying all values of one set by a constant.
generate a detectable signal for nearly all gene spots, even Again, this eVect is most evident in the cluster of proteins
when a transcript is absent (i.e., background). In a protein at the bottom which are all preferentially detected (ele-
proWling experiment, any protein absent from a sample vated abundance) in the cytosol rather than in the nuclei.
(or suYciently low in relative abundance) will go unde- The Spearman metric is a non-parametric distance mea-
tected, and hence its corresponding level will necessarily sure and is less aVected by outliers, such as the presence
be reported with a blank or missing value in the data of weaker signal detected in the nuclear fractions. The
matrix. Null data points are generally ignored when cal- Euclidean metric calculates expression diVerences
culating distances for generating clusters. This is typically directly and is therefore more sensitive to the magnitude
not problematic if only a few points are missing in a data of expression, resulting in better separation of proteins
set. Indeed, there are suitable methods available to calcu- or mRNA species that exhibit similar fold changes in
late ‘missing’ data ranging from simply Wlling in an aver- expression but diVerent overall signal intensities.
age value to imputing the data, although generally these
methods are only valid if less than 10% of the data is 2.4.7. Example of combined microarray and protein data
missing [38–40]. However, it is common for proteomic sets
samples to be quite distinct, for example when comparing To examine if the changes in protein abundance we
diVerent sub-cellular fractions or diVerent tissues [15]. observed correlated with changes in mRNA abundance
Fig. 1. Clustering of protein and gene expression data. Color schema: green, low expression; red, high expression; black, no diVerence; and gray, no
data. Ratios were calculated based the observed protein spectral count in an experiment (sample) over the average spectral count for all experiments.
(A) Three diVerent lung cytosolic (left side) and nuclear (right side) protein fraction proWling datasets clustered using a low value substitute (bright
green) for missing (blank) values. (B) The low Wlled values have been removed and replaced with the original blank values (gray). (C) Same dataset
clustered using original blank values. (D) Expanded view of clusters. (E) Same dataset as in (B) clustered using the Pearson distance metric. (F) Same
dataset clustered with the Spearman distance metric. (G) Same dataset clustered with the Euclidean metric. (H) Cluster of »1,800 protein gene-prod-
uct pairs showing early, mid and late protein ratios (left) and microarray gene expression ratios (right) arranged in a similar orientation from early-
to-late time points. (I,J) Cluster detail of a concordant (I) and a discordant (J) cluster of protein-microarray proWles.
(as reported in the published probe sets) throughout microarray data were also expressed as a signal log2
development, we Wrst summed the spectral count across ratio of its own pseudo experiment. Next, the proteins
all organellar protein fractions obtained for each time- and corresponding mRNA probe sets were cross-
point. To simplify the analysis, the data sets were binned mapped based on a common annotation cross-reference.
into early (E13 and E16), mid (E18 and P2), and late As is quite commonly seen, the microarray data set was
(P14 and adult) developmental stages. We then gener- found to contain many redundant probe sets (i.e., map-
ated the pseudo experiment for the summed data and ping to the same protein), which were averaged prior to
calculated the log2 ratio for each time bin relative to this combining the two data sets. Over 1800 protein/probe
reference. To further facilitate the comparison, the sets pairs (referred to as data pairs) were matched in this
fashion. The Wnal combined table contained a single-col- be similar at the protein or gene levels (Fig. 1J), suggest-
umn of unique (non-redundant) gene–protein identiWers, ing either a high-degree of post-transcriptional regula-
the corresponding columns of protein expression ratios tion or signiWcant error in the measurement of protein
across all three developmental time-points, in chrono- abundance and/or mRNA transcript levels.
logical order from earliest to latest, followed by the To thoroughly assess the relationship between the
mRNA transcript expression ratios, also in the same observed transcriptome and proteome, a more rigorous
chronological order. analysis of the correlation of co-expression is required.
Using the publicly available Cluster 3.0 program, the To this end, we decided to develop a simpliWed linear Wt
merged data sets were clustered using the Pearson metric model to better examine the relationship between gene
and average linkage by gene. Broadly viewed (Fig. 1H), and protein levels. We focused our analysis on cases
the microarray and proteomic data appeared to corre- where both the protein and corresponding mRNA mes-
late quite well. For instance, higher transcript levels at sage were observed at all time-points (807 data pairs) or
speciWc time-points were generally likewise reXected with where the protein was detected exclusively at only a sin-
elevated protein detection levels. Moreover, clustering gle time-point (382 data pairs), excluding all proteomic
generated many sub-groups where the mRNA and pro- data points having incomplete microarray data (640 pro-
tein proWles are largely in agreement (Fig. 1I). However, teins). To generate the linear model, we independently
several clusters of gene products did not not appear to plotted the log2 ratios of protein and gene expression as
Fig. 2. Viewing data trends. (A) Plot of the slope (change in relative ratio over time) of measured protein levels versus the slope of the corresponding
microarray gene transcript values. The data falls into three groups: regulated, where both protein and mRNA have non-zero slopes; neutral-regu-
lated, where only one has a non-zero slope; and neutral, where both gene products have slopes not signiWcantly diVerent from zero. (B) Detail of the
protein spectral count log ratios from lower left quadrant of panel (A), where both protein and mRNA have negative slopes. (C) Detail of the micro-
array signal log ratios from the same region of panel (A).
a function of time, with the latter also log2 transformed 25% co-regulated and 25% dis-regulated, with the
to make it more similar in magnitude as compared to the remaining 50% neutral. Hence, this relatively simple
expression data. Regression analysis was then performed comparative modeling of the microarray and proteomic
across all protein–gene probe set pairs, and the slope was data suggests that many of the singleton proteomic data
determined across all time-points. Genes and/or proteins points reXect genuine developmental regulation of pro-
exhibiting non-zero slopes using a suitable two-tail t test tein abundance at the early and late time-points.
cut-oV were selected for further analysis (critical values Alternative models of the expression patterns may be
of t deWned as a 90% conWdence for the microarray data useful. As evidence of this, the mid unique protein data
and 80% conWdence for the proteomic data). Based on (that is, proteins detected exclusively at the mid develop-
the correspondence between these patterns, the gene– mental time-point) did not show a good Wt to a linear
protein data pairs were then further classiWed as either: regression model when the protein expression patterns
(i) regulated, with both the mRNA and proteins exhibit- were assumed to have either a positive or negative slope.
ing non-zero slopes (228 data pairs); (ii) neutral-regu- These data may Wt better to a second order polynomial,
lated, with only one of each pair showing a non-zero with expression being low at early, high at mid, and low
slope (408 data pairs); and (iii) neutral, with two zero at late time-points.
slopes (no signiWcant change in expression) observed
(176 data pairs). 2.4.8. Biological inference and validation
A plot of these three data groupings (Fig. 2A) indi- Clustering is only useful if it reveals relationships in
cates the tight clustering of the neutral set around a data that are biologically meaningful. A rapid means of
slope of zero, which implies that the scoring of a zero assessing functional clustering of data is by statistically
slope is due to constitutive expression of both the testing clusters for enrichment or depletion of select
mRNA message and the protein end product. In functional categories. Several software packages are
contrast, 83% (189/228) of the regulated gene products freely available to analyze clustered data, usually on the
showed clear evidence of co-regulation (either two posi- basis of annotations obtained from the gene ontology
tive or two negative slopes). However, the correlation (GO) database. GenMapp [42], for example, allows the
score (r2 value) for the regulated group was determined user to enter comma-delimited tables of protein or
to be 0.39, which indicates a true (albeit modest) correla- microarray data to calculate signiWcant changes in GO
tion overall, whereas the correlation score is only 0.18 if terms by applying user-speciWed Wlters to the data
all the data pairs are considered. A closer examination of
the regulated data found in the lower left quadrant of Table 1
Flow chart overview of method for preparing protein and microarray
the plot (co-regulated negative slopes) is provided in
data for merger and analysis
Figs. 2B and C. The observed trends in spectral count
Concordance scheme
ratios (Fig. 2B) and microarray signal ratios (Fig. 2C)
1. Normalize the protein spectral counts by global scaling to the
clearly show the parallel patterns of downregulation average spectral count detected per protein per sample.
(reduced expression) seen with this subset of gene 2. Normalize the microarray data set by similar or other methods.
products. 3. Cross reference the two data sets through a common ID (SwissProt/
Although no exact slope could be precisely calculated Trembl accession number, Ensembl, etc.).
4. Merge the two data sets:
for the many proteins detected at only a single time-
4.1. Average the intensities of the redundant matches.
point, we assumed a strong negative slope in the case of 4.2. Remove all non-paired data.
early speciWc protein expression and a strong positive 5. Use a concordance score to compare the intensities of protein and
slope for late-stage expression. Of the 430 proteins mRNA in diVerent experiments to evaluate the correlation of
detected at single time-points, 382 had complete corre- co-expression.
sponding microarray data. Of these, 96, 115, and 171 Analysis of change
were uniquely detected at early, mid and late stages of 1. Normalize as in the concordance scheme.
development, respectively. Importantly, greater than half 2. Generate ratios by a pseudo-experiment.
2.1. Generate a pseudo-experimental data set, where the intensity
of the early and late unique proteins (those detected only values for a protein is the average spectral count of that protein
in the early or late time-points) showed evidence of co- across all experiments considered.
regulation (that is the slope of the microarray tested as 2.2. The ratio is the log2 of the spectral count over the average count.
being non-zero at a 90% conWdence level), with less than 2.3. This Wle should be in a tab-delimited format, which can be
15% being disregulated (the mRNA tested as being non- imported and used by various clustering software tools.
3. The protein data may be optionally merged with the microarray data
zero, but with an opposite slope to that assumed for the before clustering
protein counterpart) and »30% judged to be neutral 3.1. Select a clustering method and similarity metric (e.g., Pearson,
(that is, the slope could not be determined above a 90% Spearman or Euclidean distances).
conWdence, and was therefore assumed to be zero). In 4. Analyze the cluster for statistical enrichment of select markers or
contrast, the mid unique proteins (those detected exclu- annotation features (e.g., GO terms) of interest.
sively at the mid developmental time point) had only Methods of data comparison are outlined.
Table 2
Functional categories (GO terms) enriched in the three main gene–protein co-expression subgroups
Category (GO term) Positive slope Negative slope Neutral
p value Gene product p value Gene product p value Number product
number number number
Mytochondrion [GO:0005739] 0.00001 9 NS 0.00001 20
Cytoskeleton [GO:0005856] 0.00001 8 NS NS
Cell adhesion [GO:0007155] 0.0003 6 NS NS
Regulation of transcription, DNA-dependent NS 0.0005 23 NS
[GO:0006355]
RNA binding [GO:0003723] NS 0.00001 15 0.0000 14
Electron transport [GO:0006118] NS NS 0.0012 12
Table of GO term enrichment as p-value for diVerent grouping of gene product pairs. Positive, both protein and mRNA time courses have a positive
slope; Negative, both datasets exhibit negative slope; Neutral, both proWles have a slope of 0. NS, not statistically signiWcant.
(www.genmapp.org/download.asp). Due to the multi- determining cell fates. The neutral group was enriched
step nature of the data processing for the method for certain mitochondrial gene products, in particular
described, a schematic summary is displayed in Table 1. those involved in electron transport chain activity, as
A simpler package, called GoMiner (http://dis- well as for gene products involved in RNA binding-
cover.nci.nih.gov/gominer/index.jsp), uses two lists of related functions, such as RNA processing and splic-
gene identiWers 1) the whole data set and 2) a sub-list ing—processes found in virtually all cell types.
that the user has selected as being diVerent (up- or down-
regulated) [43].
We and others [44] have also developed publicly acces- 3. Conclusion
sible web tools to perform this sort of analysis. FatiGO
(http://fatigo.bioinfo.cnio.es/), for instance, carries out One key aim of proWling studies is to accurately cata-
simple data-mining by assigning the most characteristic logue quantitative diVerences in the abundance of one or
GO terms to each cluster using Fishers exact test for sta- more gene products present in various biological sam-
tistical signiWcance testing of groups of SPTR identiWers. ples. Pattern recognition algorithms can then be applied
The results are displayed in HTML and text format, to sort and classify the samples and gene products based
along with a tree view of associated GO terms along with on their characteristic expression proWles. At present, the
the number of linked gene products. On the other hand, massively parallel nature of microarray technology
MouseSpec (http://tap.med.utoronto.ca/~posman/mouse- allows for a far more comprehensive molecular analysis
spec_two/) inputs a list of protein or gene IDs and out- of the transcriptome as compared to direct measure-
puts a summary of GO-based functional classes, ments of the proteome using proteomic methods such as
biological roles, and cellular localization that are LC-MS, which generally exhibit more limited sensitivity
enriched in the list. p values are calculated using the and dynamic range. However, since proteomic
hypergeometric distribution, which represent the proba- approaches can provide additional insight into key
bility that the intersection of given list with any given determinants of biological activity—such as protein sub-
functional category occurs by chance. To correct for spu- cellular localization, protein–protein interactions, and
rious false discovery due to multiple repeat testing, a post-translational modiWcations—protein proWling stud-
Bonferroni-correction factor can be applied to normalize ies will undoubtedly continue to grow in scale and in
for the number of tests conducted. Only those categories popularity. It is therefore hoped that researchers will
for which the chance probability of enrichment is lower increasingly beneWt from the unique insights into biol-
than a pre-deWned p value threshold are displayed. ogy aVorded by comparisons of the proteome and tran-
Here, we used MouseSpec to examine the properties scriptome using one or more of the approaches,
of the co-regulated gene clusters (co-positive slopes, co- methods, and/or tools outlined in this review.
negative slopes, and neutral) from the lung mRNA and
protein data comparison. Table 2 summarizes some of
the statistically enriched GO categories associated with References
each of these groupings, many of which are biologically
interesting. For instance, the positive co-regulated group [1] L. Smith, A. GreenWeld, Hum. Mol. Genet. 12 (Spec No 1) (2003)
showed a clear enrichment for structural molecules, sug- R1–8.
gestive of the acquisition of a terminally diVerentiated [2] R. Aebersold, M. Mann, Nature 422 (2003) 198–207.
[3] T. Kislinger, A. Emili, Curr. Opin. Mol. Ther. 5 (2003) 285–293.
cell state, whereas the negatively co-regulated group was [4] M. Tyers, M. Mann, Nature 422 (2003) 193–197.
enriched for gene products involved in gene expression [5] A.C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, A.
(e.g., transcription factors), perhaps specifying or Bauer, J. Schultz, J.M. Rick, A.M. Michon, C.M. Cruciat, M.
Remor, C. Hofert, M. Schelder, M. Brajenovic, H. RuVner, A. [26] M.J. MacCoss, C.C. Wu, J.R. Yates 3rd, Anal. Chem. 74 (2002)
Merino, K. Klein, M. Hudak, D. Dickson, T. Rudi, V. Gnau, A. 5593–5599.
Bauch, S. Bastuck, B. Huhse, C. Leutwein, M.A. Heurtier, R.R. [27] S.W. Taylor, E. Fahy, S.S. Ghosh, Trends Biotechnol. 21 (2003)
Copley, A. Edelmann, E. Querfurth, V. Rybin, G. Drewes, M. 82–88.
Raida, T. Bouwmeester, P. Bork, B. Seraphin, B. Kuster, G. Neu- [28] S. Brunet, P. Thibault, E. Gagnon, P. Kearney, J.J. Bergeron, M.
bauer, G. Superti-Furga, Nature 415 (2002) 141–147. Desjardins, Trends Cell. Biol. 13 (2003) 629–638.
[6] Y. Ho, A. Gruhler, A. Heilbut, G.D. Bader, L. Moore, S.L. Adams, [29] G. Cagney, A. Emili, Nat. Biotechnol. 20 (2002) 163–170.
A. Millar, P. Taylor, K. Bennett, K. Boutilier, L. Yang, C. Wolting, [30] K.D. Pruitt, K.S. Katz, H. Sicotte, D.R. Maglott, Trends Genet. 16
I. Donaldson, S. SchandorV, J. Shewnarane, M. Vo, J. Taggart, M. (2000) 44–47.
Goudreault, B. Muskat, C. Alfarano, D. Dewar, Z. Lin, K. Micha- [31] R. Apweiler, A. Bairoch, C.H. Wu, W.C. Barker, B. Boeckmann, S.
lickova, A.R. Willems, H. Sassi, P.A. Nielsen, K.J. Rasmussen, J.R. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M.J. Mar-
Andersen, L.E. Johansen, L.H. Hansen, H. Jespersen, A. Pod- tin, D.A. Natale, C. O’Donovan, N. Redaschi, L.S. Yeh, Nucleic
telejnikov, E. Nielsen, J. Crawford, V. Poulsen, B.D. Sorensen, J. Acids Res. 32 (Database issue) (2004) D115–D119.
Matthiesen, R.C. Hendrickson, F. Gleeson, T. Pawson, M.F. [32] M. Rebhan, V. Chalifa-Caspi, J. Prilusky, D. Lancet, Trends
Moran, D. Durocher, M. Mann, C.W. Hogue, D. Figeys, M. Tyers, Genet. 13 (1997) 163.
Nature 415 (2002) 180–183. [33] M. Rebhan, V. Chalifa-Caspi, J. Prilusky, D. Lancet, Bioinformat-
[7] H. Ge, Z. Liu, G.M. Church, M. Vidal, Nat. Genet. 29 (2001) 482– ics 14 (1998) 656–664.
486. [34] M. Safran, I. Solomon, O. Shmueli, M. Lapidot, S. Shen-Orr, A.
[8] V.K. Mootha, J. Bunkenborg, J.V. Olsen, M. Hjerrild, J.R. Wis- Adato, U. Ben-Dor, N. Esterman, N. Rosen, I. Peter, T. Olender,
niewski, E. Stahl, M.S. Bolouri, H.N. Ray, S. Sihag, M. Kamal, N. V. Chalifa-Caspi, D. Lancet, Bioinformatics 18 (2002) 1542–1543.
Patterson, E.S. Lander, M. Mann, Cell 115 (2003) 629–640. [35] A.I. Su, M.P. Cooke, K.A. Ching, Y. Hakak, J.R. Walker, T. Wilt-
[9] S.P. Gygi, Y. Rochon, B.R. Franza, R. Aebersold, Mol. Cell. Biol. shire, A.P. Orth, R.G. Vega, L.M. Sapinoso, A. Moqrich, A. Pata-
19 (1999) 1720–1730. poutian, G.M. Hampton, P.G. Schultz, J.B. Hogenesch, Proc. Natl.
[10] T.J. GriYn, S.P. Gygi, T. Ideker, B. Rist, J. Eng, L. Hood, R. Acad. Sci. USA 99 (2002) 4465–4470.
Aebersold, Mol. Cell. Proteomics 1 (2002) 323–333. [36] M.B. Eisen, P.T. Spellman, P.O. Brown, D. Botstein, Proc. Natl.
[11] G. Chen, T.G. Gharib, C.C. Huang, J.M. Taylor, D.E. Misek, S.L. Acad. Sci. USA 95 (1998) 14863–14868.
Kardia, T.J. Giordano, M.D. Iannettoni, M.B. Orringer, S.M. [37] J.K. Eng, A.L. McCormack, J.R.I. Yates, J. Am. Soc. Mass Spec-
Hanash, D.G. Beer, Mol. Cell. Proteomics 1 (2002) 304–313. trom. 11 (1994) 976–989.
[12] S. Ghaemmaghami, W.K. Huh, K. Bower, R.W. Howson, A. Belle, [38] T.H. Bo, B. Dysvik, I. Jonassen, Nucleic Acids Res. 32 (2004) e34.
N. Dephoure, E.K. O’Shea, J.S. Weissman, Nature 425 (2003) 737– [39] S. Oba, M.A. Sato, I. Takemasa, M. Monden, K. Matsubara, S.
741. Ishii, Bioinformatics 19 (2003) 2088–2096.
[13] D. Greenbaum, R. Jansen, M. Gerstein, Bioinformatics 18 (2002) [40] X. Zhou, X. Wang, E.R. Dougherty, Bioinformatics 19 (2003)
585–596. 2302–2307.
[14] M.P. Washburn, D. Wolters, J.R. Yates 3rd, Nat. Biotechnol. 19 [41] T.J. Mariani, J.J. Reed, S.D. Shapiro, Am. J. Respir. Cell Mol. Biol.
(2001) 242–247. 26 (2002) 541–548.
[15] T. Kislinger, K. Rahman, D. Radulovic, B. Cox, J. Rossant, A. [42] K.D. Dahlquist, N. Salomonis, K. Vranizan, S.C. Lawlor, B.R.
Emili, Mol. Cell. Proteomics 2 (2003) 96–106. Conklin, Nat. Genet. 31 (2002) 19–20.
[16] Y. Pan, T. Kislinger, A.O. Gramolini, E. Zvaritch, E.G. Kranias, [43] B.R. Zeeberg, W. Feng, G. Wang, M.D. Wang, A.T. Fojo, M. Sun-
D.H. MacLennan, A. Emili, Proc. Natl. Acad. Sci. USA 101 (2004) shine, S. Narasimhan, D.W. Kane, W.C. Reinhold, S. Lababidi,
2241–2246. K.J. Bussey, J. Riss, J.C. Barrett, J.N. Weinstein, Genome Biol. 4
[17] W.A. Tao, R. Aebersold, Curr. Opin. Biotechnol. 14 (2003) 110– (2003) R28.
118. [44] F. Al-Shahrour, R. Diaz-Uriarte, J. Dopazo, Bioinformatics 20
[18] M. Schena, D. Shalon, R.W. Davis, P.O. Brown, Science 270 (1995) (2004) 578–580.
467–470. [45] K.G. LeRoch, J.R. Johnson, L. Florens, Y. Zhou, A. Santrosyan,
[19] P. Hegde, R. Qi, K. Abernathy, C. Gay, S. Dharap, R. Gaspard, M. Grainger, S.F. Yan, K.C. Williamson, A.A. Holder, D.J. Caru-
J.E. Hughes, E. Snesrud, N. Lee, J. Quackenbush, Biotechniques cci, J.R. Yates III, E.A. Winzeler, Genome Res. 14 (11) (2004)
29 (2000) 548–550 52–4, 56 passim. 2308–2318.
[20] J. Quackenbush, Nat. Rev. Genet. 2 (2001) 418–427. [46] J.R. Johnson, L. Florens, D.J. Carucci, J.R. Yates III, J. Proteome
[21] J. Quackenbush, Nat. Genet. 32 (Suppl) (2002) 496–501. Res. 3 (2004) 296–306.
[22] R.J. Lipshutz, S.P. Fodor, T.R. Gingeras, D.J. Lockhart, Nat. [47] T.R. Hughes, M. Mao, A.R. Jones, J. Burchard, M.J. Marton,
Genet. 21 (1999) 20–24. K.W. Shannon, S.M. Lefkowitz, M. Ziman, J.M. Schelter, M.R.
[23] D.J. Lockhart, H. Dong, M.C. Byrne, M.T. Follettie, M.V. Gallo, Meyer, S. Kobayashi, C. Davis, H. Dai, Y.D. He, S.B. Stephani-
M.S. Chee, M. Mittmann, C. Wang, M. Kobayashi, H. Horton, ants, G. Cavet, W.L. Walker, A. West, E. CoVey, D.D. Shoemaker,
E.L. Brown, Nat. Biotechnol. 14 (1996) 1675–1680. R. Stoughton, A.P. Blanchard, S.H. Friend, P.S. Linsley, Nat. Bio-
[24] D.L. Tabb, A. Saraf, J.R. Yates 3rd, Anal. Chem. 75 (2003) 6415– technol. 19 (2001) 324–327.
6421. [48] H. Liu, R.G. Sadygov, J.R. Yates III, Anal. Chem. 76 (2004) 4193–
[25] R.G. Sadygov, J.R. Yates 3rd, Anal. Chem. 75 (2003) 3792–3798. 4201.

Integrating Gene and Protein Expression Data

Uploaded by

Copyright:

Available Formats

Integrating Gene and Protein Expression Data

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Integrating Gene and Protein Expression Data

Uploaded by

Copyright:

Available Formats

Methods 35 (2005) 303–314

Integrating gene and protein expression data: pattern analysis

Accepted 25 August 2004

1. Introduction procedures coupling high-performance liquid chromato-

There currently exists a growing literature outlining 2. Description of the method

You might also like