Bioinformatics strategies for proteomic profiling
C. Nicole White, * Daniel W. Chan, and Zhen Zhang
Department of Pathology, Johns Hopkins Medical Institutions, Baltimore, MD, 21231, USA
Accepted 4 May 2004
Clinical proteomics is an emerging field that involves the analysis of protein expression profiles of clinical samples for de novo discovery
of disease-associated biomarkers and for gaining insight into the biology of disease processes. Mass spectrometry represents an important set
of technologies for protein expression measurement. Among them, surface-enhanced laser desorption/ionization time-of-flight mass
spectrometry (SELDI TOF-MS), because of its high throughput and on-chip sample processing capability, has become a popular tool for
clinical proteomics. Bioinformatics plays a critical role in the analysis of SELDI data, and therefore, it is important to understand the issues
associated with the analysis of clinical proteomic data. In this review, we discuss such issues and the bioinformatics strategies used for
proteomic profiling.
ibility is another limiting factor. While the 2D gel technology analytical variables can be among the most damaging.
has been available for over two decades and has been widely While careful statistical examination of results and their
used, it is being replaced by new technologies optimized for correlation with possible non-disease-related variables may
proteomic profiling. Protein/antibody arrays hold consider- reveal the existence of biases, no amount of statistical or
able promise for functional proteomics and expression pro- computational processing can correct such problems within
filing, but problems limit their utility as well. Limitations a single set of samples collected under the same conditions.
include the lack of high-throughput technologies to express Therefore, errors from systematic differences among sam-
and purify proteins and to generate large sets of well- ples should be minimized or eliminated whenever possible
characterized antibodies [14]. Further research and develop- through good experimental design, careful analysis proce-
ment are required before the promise of this technique can be dures, and quality control protocols.
fully realized. Examples of non-disease-associated factors include (1)
Mass spectrometry (MS) successfully addresses the within-class biological variability which may include un-
throughput limitation of 2D gels and eliminates the need to known subphenotypes among study populations; (2) pre-
purify, identify, and develop antibodies to proteins before analytical variables such as systematic differences in study
proteomic profiling experimentation. SELDI is an affinity- populations and/or in sample collection, handling, and
based MS method in which proteins are selectively adsorbed preprocessing procedures; (3) analytical variables such as
to a chemically modified surface and impurities are removed inconsistence in instrument conditions that result in poor
by washing with buffer [15]. By combining different reproducibility; and (4) measurement imprecision.
ProteinChipR array surfaces and wash conditions, SELDI
allows on-chip protein capture and micropurification, thereby Biological variability
permitting high-throughput protein expression analysis of a
large number of clinical samples [15]. After preprocessing When a protein profiling experiment is used for de novo
steps involving mass calibration, baseline subtraction, and discovery, an adequate sample size is of utmost importance.
peak detection, the mass spectra from n individual samples Defining ‘‘adequate’’ can be tenuous though, when no prior
are converted into peak intensity data typically organized as knowledge exists about the proteins the researcher is study-
an m (peaks) n (samples) matrix, where the intensity of a ing. While several different methods are available for
peak corresponds to the relative abundance of proteins at a defining sample size [20,21], none works particularly well
molecular mass expressed as mass to charge ratio, or m/z. A in the context of proteomic profiling because so little is
proteomic profile constructed using the SELDI platform is known about the complexity of the final model before an
typically a subset of these m peaks and is constructed using experiment is completed, or about the protein peaks includ-
the intensity data of the peaks. SELDI has been used to ed in that model. Because SELDI is a high-throughput
evaluate specimens from nipple aspirates [16], serum [1– 5], platform, sample size is often limited by the availability of
urine [7,9], and lysed tissue extracts [17] with little require- samples rather than the resources available to examine those
ment for sample-specific preparation. samples.
While SELDI has several strengths, including the types of Regardless of the number of tested samples, the consis-
biological fluids that may be evaluated, ease of sample tency of protein peak amplitudes should be evaluated within
preparation, and high-throughput capabilities, the technique the disease groups. One method to evaluate the consistency
also has weaknesses. Issues include the reproducibility of of a peak or peaks within a single data set is through
mass spectrometry, in general, and SELDI, in particular, bootstrap analysis. Bootstrap analysis involves resampling,
[3,18,19] and the fact that peak amplitudes are measurements with replacement, from the experimental data [22]. This can
of relative protein abundance rather than absolute quantita- reveal, for instance, if a small percent of the cancer samples
tive measures. Quality control (QC) measures that monitor make the full cancer group appear statistically different from
peak amplitude can identify unacceptable deviations and are the noncancer group. This technique can be used in con-
discussed later. Another limitation of the SELDI PBS-II mass junction with almost any analysis/modeling technique.
reader is that protein identities are not returned as the protein
profile is collected [18]. The proteins included in a biomarker Site or center variability
pattern should be identified to further the understanding of
the biology of disease. However, the SELDI profile per se Collection practices, sample handling, or storage condi-
may be useful in the clinical setting when the pattern itself has tions may be different from institution to institution, and such
diagnostic or prognostic significance. differences may influence the proteins present in biological
fluids [9,18]. Since these biases are often specific to institu-
tions (sites), the use of specimens from multiple institutions
Data variability combined with sound study design is the preferred approach
to discover biomarkers that are truly associated with the
Among the issues associated with expression profiling disease process. Generally, samples from multiple sites are
using clinical samples, systematic biases arising from pre- randomly divided into a discovery, or training, data set and a
performance, however, must be compared not only during biomarkers through bootstrap analysis and/or validation data
one experiment, but also over the course of time. One large sets. While other analysis techniques exist, the methods
specimen pool with volume to last at least several months selected here are some of the most popular and well-
should be aliquot and frozen. An aliquot should be tested in published methods available.
each experimental run. By sampling from one uniform The most rudimentary statistical analysis involves uni-
specimen and assuring that all samples go through the same variate tests. Besides traditional statistics like the t test and
number of freeze/thaw cycles, the (theoretically) same its nonparametric equivalent, the Wilcoxon test, univariate
spectra may be compared for changes over time. QC methods have been developed specifically for the analysis
procedures and rules well-established in clinical laboratory of expression profiles. These determine the significance of
testing can also be applied in the research setting [24]. observed changes while accounting for the large number of
variables [6,25]. For example, the univariate methods de-
Adverse effects caused by excessive variation and poor veloped for microarray analysis but applicable to proteomics
experimental design assess the significance of discriminatory profiles by evalu-
ating permutations of repeated measurements to estimate the
(1) Bias can be introduced at any experimental step. It could percent of changes that would be identified by chance,
include changes to medication or lifestyle in response to known as the false discovery rate [25]. While this approach
cancer diagnosis [23] or differing sample collection improves a researcher’s ability to identify statistically sig-
procedures for samples and controls. Systematic errors nificant changes in expression, it cannot account for the
can artificially worsen or improve a profile’s discrimi- interdependence of the variables.
natory accuracy. It is impossible to evaluate if the It is plausible to assume on biological grounds that the
profile’s accuracy is based on disease pathology or bias proteins present in the proteomic profile are not fully
if systematic differences between samples are introduced independent of each other in vivo. For this reason, a
before sample processing. However, QC procedures may multivariate approach to analysis is preferred because it
differentiate between the two if they are introduced at or can address the correlations among variables. Unfortunately,
following SELDI chip preparation. one of the strengths of proteomic analysis, namely, the large
(2) Excessive variability in spectral peak amplitude will number of variables that can be measured simultaneously,
substantially degrade the utility of experimental results. becomes a limitation for this type of analysis. The large
A case in point is the 50 – 60% imprecision reported by number of variables compared with the (usually) small
Yasui et al [8]. Such poor CVs forced the authors to number of observations results in an unstable estimate of
forgo quantitative analysis and revert to the much less the covariance matrix. Simultaneous multivariate analysis
discriminating presence or absence of the peak as the requires a stable estimate of the covariance matrix.
outcome measure. To use a multivariate approach and circumvent the issue
(3) Mass shifts caused by machine variability and process- of covariance matrix estimation, a dimension reduction step
ing error lead to poor mass accuracy. Several groups is employed. Dimension reduction methods project a large
report mass accuracy around 0.1% [2,8,17], but mass number of genes or proteins onto a smaller and more
shifts can be greater, particularly in large experiments manageable number of clusters [26,27], or some type of
where chips are read over the course of a week rather supervariable [28]. The conditional density function can be
than a day. With increased mass shifting, the researcher used to construct a decision rule. This decision rule com-
cannot rely on peak detection via available software bines peak intensities to cluster the samples into diseased or
packages. Ciphergen’s automated peak detection soft- nondiseased clusters. In real world experiments, it is rare
ware assumes each peak m/z varies less than 0.3%. that all samples can be classified correctly, so the probability
When poor mass accuracy is observed, the researcher of incorrectly classified samples is calculated as the prob-
must select peaks manually. Manual peak selection can ability of error [29].
be more accurate than the software but is tedious and Some of the most commonly used dimension-reduction
time consuming. techniques employ clustering methods such as principle
component analysis (PCA) [2,6,16,30]. PCA is an unsuper-
vised analysis tool: samples are classified without including
Strategies for analysis disease status in the training algorithm. In PCA, the training
samples, regardless of their relative location to the underly-
One of the common characteristics of expression profile ing class boundaries in the variable space, contribute equally
data is high dimensionality in comparison to a relatively to the estimation of the data distributions and the classifi-
small sample size. This characteristic was uncommon before cation function. On the other hand, in some supervised
the development of microarrays and necessitated the recent approaches to dimension reduction, the samples that are
development of novel methods to analyze profiling data. close to the boundaries are weighted much more heavily
Several of these methods are described below including than the interior samples. As an extreme example, the
those used to evaluate the stability of identified candidate support vector machine (SVM) [31,32] model solutions
are solely determined by the support vectors that consist stratified random sampling procedure [2,4,6,11]. Two data
only the boundary data points. The removal of interior sets are constructed from the original data set, and the
samples does not affect the solution at all. Because each decision rule derived using one data set is tested in the
clinical sample represents a considerable amount of effort second data set. Testing the decision rule is necessary
and cost, limiting analysis to the support vectors is not the because of sample or biological variability.
most efficient use of information. In addition, analysis that If the number of samples collected for the study is too
relies solely on the support vectors could be very sensitive small for a stratified random sampling procedure, bootstrap
to labeling errors in the training samples of small sample analysis [5,23] may be performed. These procedures pro-
studies. However, for the purpose of data classification vide the advantage of using the entire data set during
rather than representation, inaccuracies may be introduced discovery and validation. However, because the two data
by treating all samples equally. sets are not independent, the results may be overly optimis-
With the above shortcomings in mind, the unified max- tic and difficult to verify in a second, independent, study.
imum separability analysis (UMSA) algorithm was devel- Overfitting can be a serious problem for complex multivar-
oped for genomic and proteomic expression data [1,5,33]. iate models and may result in an amplification of non-
The conceptual framework of UMSA is very straightfor- disease-associated data variability on the analysis results.
ward. In the original SVM learning algorithm [31], a Additionally, bootstrap analysis may be used during the
constant, C, limits the maximum influence of any sample discovery phase alone to evaluate peak consistency before
point on the final SVM model solution. In UMSA, this validation is completed [11] or may be used to identify
constant becomes an individualized parameter for each data several different, but equally performing, sets of candidate
point to incorporate additional statistical information about biomarkers [6]. Regardless of what technique is employed,
the data point’s position relative to the distribution of all the verification of analysis results should be completed before
classes of samples. The rationale behind UMSA is that proteomic profiling analysis is published.
information about the overall data distribution (although the
estimation itself might not be perfect) can be used to
prequalify the trustworthiness of any training sample to be Summary
a support vector. The final solution, therefore, will rely on
the weighted contributions of the support vectors and be less Advances in high-throughput technologies, such as
sensitive to labeling errors of a small percentage of samples. SELDI, have made it possible to obtain expression profiles
The construction of a linear UMSA classifier provides a of a large number of proteins using clinical samples. Recent
supervised multivariate method to rank a large number of reports have raised the expectation for the application of
variables. Similar to unsupervised component analysis proteomic profiling to clinical diagnostics. In this paper, we
methods such as PCA, the UMSA-based procedure is also have reviewed and discussed several critical and often
a linear projection of data. However, in PCA, the axes in the overlooked issues in translating results from proteomic
new space represent directions along which the data dem- profiling to biomarker discovery and to eventual clinical
onstrate maximum variations. In UMSA component analy- applications.
sis, the new axes represent directions along which the two The clinical evaluation of a diagnostic test relies mostly
classes of data are best separated by linear classification. on its efficacy in terms of positive and negative predictive
When it is used for dimension reduction, the smaller number values in a targeted population. It has been proposed that if a
of axes may be viewed as composite features that retain particular protein expression pattern can be associated with
most of the information relevant to the separation of data a disease condition, the pattern itself could be used as a
classes. The UMSA-based software system has been used diagnostic test. However, the identification of the compos-
for genomic expression data analysis [33] and more recently ing molecular entities, as part of the process of biomarker
for biomarker discovery using clinical proteomic profiling discovery, will not only facilitate assay development for
[1,5,10,11] generated by SELDI. large-scale validation and clinical use, but also help the
After selecting a limited number of protein peaks as understanding of the biology of the disease itself and lead to
candidate biomarkers, more traditional linear modeling additional discoveries.
techniques may be used. Methods such as logistic regression Improvement in profiling technologies, better quality
[34,35] allow the user to define an equation describing the control procedures, and the proper use of sophisticated
relationship between protein peaks as well as explicitly bioinformatics and statistical tools are all important factors
evaluating the significance of each peak’s contribution to ensure true discoveries in clinical proteomics. However,
toward the multivariate relationship. the single most significant factor that affects the discovery
Regardless of the modeling and/or analysis technique and verification of candidate protein expression patterns or
used, it is advisable to complete an additional step and biomarkers is the selection of clinical samples. False results
evaluate the robustness of the final model. Procedures can be obtained from studies using a sample set from a single
available to complete this step are determined by the sample institution that exhibits significant and systematic biases due
size of the study. Many groups split their full data set using a to sample inclusion/exclusion criteria and/or specimen han-
