Wiemer 2004

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

ARTICLE IN PRESS

Pathology – Research and Practice 200 (2004) 173–178


www.elsevier.de/prp

REVIEW
Bioinformatics in proteomics: application, terminology, and pitfalls
Jan C. Wiemer*, Alexander Prokudin
Europroteome AG, Neuendorfstrasse 24b, Hennigsdorf 16761, Germany
Received 19 December 2003; accepted 19 January 2004

Abstract

Bioinformatics applies data mining, i.e., modern computer-based statistics, to biomedical data. It leverages on
machine learning approaches, such as artificial neural networks, decision trees and clustering algorithms, and is ideally
suited for handling huge data amounts. In this article, we review the analysis of mass spectrometry data in proteomics,
starting with common pre-processing steps and using single decision trees and decision tree ensembles for classification.
Special emphasis is put on the pitfall of overfitting, i.e., of generating too complex single decision trees. Finally, we
discuss the pros and cons of the two different decision tree usages.
r 2004 Elsevier GmbH. All rights reserved.

Keywords: Decision trees; Bagging; Mass spectrometry

Introduction parameters. On the one hand, this somewhat statistically


unsatisfactory situation requires special care to prevent
We are currently witnessing an impressive progress in pre-mature conclusions pointing to seemingly promis-
biotechnology. New techniques allow us to simulta- ing, but wrong scientific directions. On the other hand,
neously measure many parameters of biological systems, these may raise unrealistic expectations about the goals
e.g., relative protein concentrations or gene expression that can be achieved over the next years.
levels. These techniques are used in laboratories all over Mass spectrometry is a promising upcoming techni-
the world, yielding an overwhelming amount of new que that determines relative concentrations of proteins
data. In the long run, this will considerably improve our in serum, e.g., in the form of surface-enhanced laser
scientific understanding of living organisms, as it offers desorption ionization time-of-flight mass spectrometry
many perspectives unseen so far. In particular, we hope (SELDI TOF-MS) [15]. It seems reasonable to assume
to get an insight into the development of human diseases that protein concentrations in blood may reflect a
such as cancer. person’s disease or state of health. Then, the task of
The analysis of newly available biotechnological data disease diagnosis is to find measurable disease-specific
sets constitutes a great challenge, because they are quite patterns of relative protein concentrations.
different from other types of data. Typically, the What is the adequate approach to find such patterns?
biotechnological data sets are large and small at the In principle, we can follow two opposing strategies:
same time: large concerning the great number of system either to determine characteristic patterns consisting of
parameters (features) measured simultaneously, and only few most-informative biomarkers or to determine
small concerning the number of analyzed cases, which characteristic patterns using all intensity signals mea-
is usually smaller than that of measured system sured. Both strategies are somewhat problematic: the
first one causes some problems, because it tries to reduce
*Corresponding author. Tel.: +49-0-3302-202-3273 ; fax: +49-0-
a complex system to a possibly too simple level of few
3302-202-3259. system parameters. It is not clear whether such a simple
E-mail address: [email protected] (J.C. Wiemer). level really exists. Even if it does exist, there is no general

0344-0338/$ - see front matter r 2004 Elsevier GmbH. All rights reserved.
doi:10.1016/j.prp.2004.01.012
ARTICLE IN PRESS
174 J.C. Wiemer, A. Prokudin / Pathology – Research and Practice 200 (2004) 173–178

way to find it. The second strategy is problematic,


because there are usually many unspecific background
signals involved, and these easily dominate deduced
patterns. This is particularly dangerous if the number of
measured cases is not considerably larger than that of
measured features.
The following chapters give an overview about the
analysis of mass spectrometry data. Firstly, the sequence
of necessary data-processing steps is reviewed. Secondly,
two different ways of classifier generation using decision
trees are described. Thirdly, the pros and cons of the
different ways are discussed for the special situation of
mass spectrometry.

Processing of mass spectrometry data

The first data-processing steps are usually denoted as


‘‘data pre-processing’’. This summarizes baseline esti-
mation and subtraction, alignment of masses (calibra-
tion), and normalization of intensities to achieve
comparable mass spectra, as well as peak detection
and clustering, yielding a specific coding level for
classifier generation. The term ‘‘pre-processing’’ gives
the impression of dealing with some minor pre-stages of
Fig. 1. Pre-processing of mass spectrum data. (A) Peak
a consecutive main analysis. However, pre-processing is detection on single spectrum level. Only four local intensity
of special importance, because it comprises the selection maxima are automatically detected as pronounced peaks with
of a specific coding format for later stages of data an above-threshold signal-to-noise ratio (marked by arrows).
processing. Choosing the right coding format can At least two other local intensity maxima are not detected as
considerably facilitate the solution of a given task, and peaks (marked by ). (B) Peak clustering for five mass spectra:
having found an optimal coding for a given classifica- peaks of five visually striking normalized and calibrated mass
tion task is close to having solved this task. spectra are grouped to form peak clusters. Four peak clusters
The question of how spectra are best read out is still correspond to the peaks already detected in (A). One
unanswered and subject of intensive research [6]. After additional peak cluster corresponds to an intensity maximum
not detected in (A) (first from right).
baseline subtraction, calibration, and normalization,
peak detection is often conducted: peaks, i.e., pro-
nounced intensity maxima with a sufficient signal-to- separately for each single mass spectrum. After peak
noise ratio, are detected and characterized by average clustering, single mass spectra are reanalyzed, focusing
mass and maximum intensity values (Fig. 1A). (Mass with less strict criteria, e.g., lower signal-to-noise ratio,
spectrometry does not yield masses but ratios of masses on peak cluster regions. Thereby, initially missed peaks
divided by charge. For brevity, we simplify and speak of are found on the assumption that a peak is likely to exist
masses.) Alternatively, Fourier or wavelet transforma- in a spectrum if it has already been found in many other
tions can be applied to mass spectrometry data, yielding spectra. In Fig. 1A, the right local maximum denoted by
more abstract variables [9]. ‘‘’’ is confirmed by other spectra, while the left one is
Peaks of different spectra are then assigned to each not.
other and collected as groups (clusters) if they are likely Entirely satisfying programs for fully automatic peak
to correspond to the same underlying molecules. This is detection and clustering are not available. Accordingly,
called peak clustering (Fig. 1B). Peaks are grouped it is usually conducted in two steps: first, peaks are
entirely on spectrum structure. Each cluster corresponds automatically detected and clustered. Second, the results
to an interval of masses and is described by a are checked visually and adapted manually by experi-
characteristic mass, e.g., corresponding to the mean enced staff. The set of all peak clusters serves as the basis
position of its peaks, and the set of maximum intensities for classifier generation.
read out for all spectra. In general, a classifier is a system that distinguishes
Peak detection and clustering often go hand in hand. between cases by assigning them to different classes. For
In a first peak detection step, peaks are determined example, a classifier distinguishes between serums
ARTICLE IN PRESS
J.C. Wiemer, A. Prokudin / Pathology – Research and Practice 200 (2004) 173–178 175

obtained from healthy and diseased patients. The


decision is based on a set of measured features such as
peak cluster intensities.
There exist different approaches for classifier genera-
tion, e.g., decision trees, artificial neural networks, and
support vector machines [2,5,14]. Each approach has its
own strengths and weaknesses, and there is no superior
approach that can be used for all purposes. An extensive
comparison of many approaches based on several data
sets was made by Meyer et al. [11]. For a given task, one
cannot predict which approach fits best. More impor-
tant than the choice of the classification approach may
be the familiarity with the selected approach to
guarantee its proper and reasonable usage. Accordingly,
scientists usually base their choice on subjective criteria
such as their experience and scientific background. In
the following chapter, we will go into decision trees.
Still, many of the following arguments and insights are
also applicable to other approaches, e.g., the problem of
overfitting and its circumvention by classifier ensembles.

Decision trees

The decision tree approach is a procedure of stepwise


data set partitioning (Fig. 2). Starting with a given data
set of approximately the same number of cases belong-
ing to two different classes (Fig. 2A, top), e.g., peak
cluster intensities from mass spectra of healthy and
Fig. 2. Decision tree generation. (A) A finite number of
diseased patients, we examine the usefulness of all
different classifiers can be generated for a given data set (filled
possible cut-off values of all features, i.e., peak clusters,
black circles: class 1, filled gray circles: class 2). Each cut-off
by separating cases from different classes. Each feature- corresponds to two classifiers: one classifier assigning class
specific cut-off between two data points corresponds to ‘‘positive’’ (negative) to values below (above) cut-off and a
two classifiers: one classifier that assigns class ‘‘positive’’ second classifier with opposite class assignment (middle). For
(negative) to cases with values smaller (larger) than the decision tree generation, all different cut-offs are evaluated for
cut-off, and the opposite classifier assigning class all single features, and the most useful cut-off/feature pair is
‘‘negative’’ (positive) to cases with values below selected (bottom, most useful cut-off marked by ). (B)
(above)-cut-off (Fig. 2A, middle). The number of Exemplary decision tree consisting of two applied cut-off/mass
thereby correctly classified cases gives us some informa- pairs, leading to the three terminal nodes hosting sub-data sets
tion regarding the usefulness of this cut-off. (Actually, II, III, and IV.
not only the number of correctly classified cases is used
to measure the usefulness of a feature-specific cut-off, yielding three final sub-data sets denoted as ‘‘terminal
but additionally, strategic tree building aspects are nodes’’ I–III.
considered that also evaluate the effect of a cut-off on Decision trees should be distinguished from other
possible consecutive further partitioning steps [5]). more basic multi-feature approaches that first evaluate
Having examined all different cut-offs along all mea- single features on their own and then construct
sured features, we select the most useful cut-off/feature classifiers on the basis of a set of high-performing
pair. In Fig. 2A (bottom), the most useful cut-off is masses [16]. The advantage of decision trees is that they
marked by ‘‘’’, yielding a classifier with only three can detect sets of features that complement each other in
misclassifications. Thereby, sub-data sets of improved classification. The disadvantage of decision trees lies in
class homogeneity are obtained, e.g., the sub-data sets I the risk of generating overfitted trees. This will be
and II in Fig. 2B. The process of data partitioning is discussed below.
repeated until sufficient class-homogenous sub-data sets The central question in decision tree generation is as
of still acceptable size are obtained. Fig. 2B shows an follows: to what extent should data sets be partitioned,
example of applying consecutively two partitionings, i.e., how complex the resulting decision trees should be.
ARTICLE IN PRESS
176 J.C. Wiemer, A. Prokudin / Pathology – Research and Practice 200 (2004) 173–178

In principle, one can grow decision trees until all sub-


sets obtained are class homogenous, i.e., until all
terminal nodes consist of cases of only one class. This
corresponds to 100% sensitivity and to 100% specificity
on the actual data set used for decision tree generation.
However, regarding statistics, many splitting criteria
determined in that way are usually undependable and
represent peculiarities of the data set analyzed; they do
not generalize to other data sets consisting of different
measurements, e.g., mass spectra of serum from new
patients analyzed under similar conditions. The phe-
nomenon is called ‘‘overfitting’’: the tree is too strongly
fitted to the actual data set and is unlikely to be
adequate for data unseen to date [8].
The presentation of overfitted classification models is
quite common in the literature. However, this is often
difficult to recognize, and only closer data analysis can
reveal the difference. In mass spectrometry, typically,
the number of patients is approximately the same as that
of peak clusters or even less. In such constellations, it is
of special importance to prevent overfitting. For
example, Adam et al. [1] and Rai et al. [13] probably
applied overfitted decision trees to mass spectrometry
data. (The authors seemed to be aware of this short-
Fig. 3. Overfitting. (A) Using the single decision tree
coming and also published their results using boot-
approach, the number of misclassifications on the training
strapping and boosting for sound statistical analysis
data set, i.e., on the data set used for decision tree generation,
[12,10].) decreases monotonically with the number of splits per single
We are interested only in decision trees that are not decision tree. However, due to the finite size of the training
overfitted, i.e., decision trees whose splitting criteria do data set, there are usually data peculiarities that are not typical
not represent peculiarities of our actual data set, but of the underlying data population and, accordingly, do not
typical aspects of the underlying patient population. generalize to new test data. In the example, best performance
Among other aspects such as the distribution of values on test data is obtained using only two splits per decision tree.
in the data set, the number of extractable splitting Decision trees with more than two splits are overfitted. (B) One
criteria depends on the number of patients in the data of five data set partitionings for cross validation (n=5).
set: the more patients with mass spectra are available,
the more splitting criteria can be deduced. The situation
is sketched in Fig. 3A. Using many splitting criteria in a runs, and the optimal number of splits per tree is
single decision tree, the number of misclassifications in obtained by choosing the tree structure with the smallest
the data set used for tree generation can be reduced to overall classification error on the respective test sets.
zero. However, only the first few criteria can be The decision tree approach is a forward variable
generalized to unseen data; in the example provided in selection procedure yielding strictly hierarchical sets of
Fig. 3, two criteria are optimal. peak clusters and splitting criteria. The first partitioning
Overfitting should be prevented by stopping criteria. is conducted on the entire data set. All single peak
Conditions should be specified when tree growing clusters compete against each other, and the best peak
should be stopped, e.g., when all terminal nodes contain cluster is selected. All successive partitionings are less-
less than five patients. Cross validation is a better way to well-founded statistically, because they are based on
estimate an optimal number of splitting criteria: first, the smaller sub-sets. Moreover, these splits depend on at
data set is partitioned into n sub-sets of approximately least one earlier partitioning. Therefore, forward vari-
equal size. Then, decision trees of different complexity, able selection procedures lack robustness: small changes
i.e., different numbers of splits, are generated using only in the data set used for tree generation can lead to great
n  1 sub-sets and evaluated using the remaining sub-set changes in the resulting tree.
(Fig. 3B). The latter process of tree generation is The lack of decision tree robustness can be compen-
conducted n-times so that each mass spectrum is used sated by building ensemble classifiers also referred to as
n  1 times for tree generation and once for tree ‘‘committees of experts’’. They consist of many decision
evaluation, typically n ¼ 10: The performance of trees trees generated by using slightly varying data sets.
with the same number of splits is averaged over all n Bagging, boosting, and Random Forest are ensemble
ARTICLE IN PRESS
J.C. Wiemer, A. Prokudin / Pathology – Research and Practice 200 (2004) 173–178 177

classifier approaches that introduce different data set overfitting without the need to determine an optimal
variations [3,4,8]. Bagging applies so-called bootstrap- number of splits per decision tree. High decision tree
ping: from the overall data set consisting of N mass complexity does not lead to overfitting in ensemble
spectra, we generate new data sets by randomly selecting classifiers, because random splits tend to average out. As
with replacement N mass spectra for each new data set a general rule of tree generation for ensemble classifiers,
(Fig. 4A). Accordingly, we obtain various data sets in splits of questionable value should be included in tree
which some mass spectra are multiply included, while generation, because additional splits do not harm
others are missing: on average, 63% of the mass spectra overall performance, whereas missing splits may cause
are included at least once in a resulting data set, while its reduction.
37% are left out [8]. Each data set is used to generate
one decision tree. For example, three different decision
tree topologies are presented in Fig. 4B. The large
number of different decision trees constitute the overall Discussion
ensemble classifier, whose classification result is deter-
mined by majority vote: mass spectrum data is Modern science is heading for multidisciplinarity: the
propagated along all single decision trees and assigned most challenging scientific questions require the colla-
to the most frequent single tree classification result (Fig. boration of experts from many different disciplines, e.g.,
4C). The introduction of data set variations prevents the development of diagnostic tools on the basis of
biotechnology. There is no single data mining approach
that can be recommended and that should be applied to
all projects. Instead, complex scientific questions require
specifically adapted statistical analyses. For this, discus-
sions between researchers who are experts in different
fields, e.g., biologists, physicians, physicists, chemists,
mathematicians, and computer scientists, as well as an
extensive exchange of complementing perspectives, are
adjuvant [17]. In the following, we discuss the contexts
in which single decision trees should be applied and
when decision tree ensembles should be preferred.
At present, the additional numerical effort required
for the generation of decision tree ensembles is
unproblematic: the time for bagging is acceptable even
on personal computers with average performance. For
example, the generation of an ensemble of 50 decision
trees on a data set consisting of 100 mass spectra with
100 peak clusters takes less than a minute on a 1 GHz
Pentium processor using the CART software provided
by Salford Systems.
Whether a single decision tree or an ensemble of
decision trees should be trained for a classification task
at hand should depend on project alignment. If the aim
of a project is the generation of a classifier that achieves
the highest possible performance, an ensemble of
decision trees should be generated [7]. For example,
diagnostic tests could be developed on the basis of mass
spectrometry: for each new sample, entire mass spectra
could be determined. For this, the number of peak
clusters used by the following classifier is not important.
Fig. 4. Generation of decision tree ensembles. (A) Data set If additional requirements have to be met, e.g., when
variations for bagging. (B) Decision trees with different
the resulting classifier must be structured and made
splitting criteria and tree topologies are obtained for varied
transparent as simply as possible, single decision trees
data sets. (C) Collecting many different single decision trees to
form one ensemble classifier. In the simplest and most pruned with cross validation are the first choice.
common case, all decision trees are weighted equally. Then, Transparency facilitates system acceptance, particularly
the overall decision of the entire ensemble classifier corre- in high-responsibility tasks such as classifying patients
sponds to the most frequent class assignment by single decision into therapeutic groups. Furthermore, diagnostic tests
trees. could be developed by first applying mass spectrometry
ARTICLE IN PRESS
178 J.C. Wiemer, A. Prokudin / Pathology – Research and Practice 200 (2004) 173–178

as a screening method to search for useful biomarkers, Identification of gastric cancer patients by serum protein
and then by transferring the results to standard analysis profiling, 2004, submitted for publication.
platforms such as enzyme-linked immunosorbent assay. [8] T. Hastie, R. Tibshirani, J. Friedman, The Elements of
The latter step requires protein identification for peak Statistical Learning, Springer, New York, 2001.
clusters used in classifiers, a time-consuming and [9] K.R. Lee, X. Lin, D.C. Park, S. Eslava, Megavariate data
analysis of mass spectrometric proteomics data using
expensive procedure that can only be conducted for
latent variable projection method, Proteomics 3 (2003)
few peak clusters.
1680–1686.
[10] J. Li, Z. Zhang, J. Rosenzweig, Y.Y. Wang, D.W. Chan,
Proteomics and bioinformatics approaches for identifica-
Acknowledgements tion of serum biomarkers to detect breast cancer, Clin.
Chem. 48 (2002) 1296–1304.
. Meuer for fruitful discussions on mass
We thank Jorn [11] D. Meyer, F. Leisch, K. Hornik, Benchmarking support
spectrometry data, his comments on the manuscript, vector machines, Report Series SFB, Adaptive informa-
and for providing Fig. 1. We are grateful to Volker tion systems and modelling in economics and manage-
ment science, No. 78, Wirtschaftsuniversit.at, Wien,
Seibert for proof-reading the manuscript.
2002.
[12] Y. Qu, et al., Boosted decision tree analysis of surface-
enhanced laser desorption/ionization mass spectral serum
References profiles discriminates prostate cancer from noncancer
patients, Clin. Chem. 48 (2002) 1835–1843.
[1] B.L. Adam, et al., Serum protein fingerprinting coupled [13] A.J. Rai, et al., Proteomic approaches to tumor marker
with a pattern-matching algorithm distinguishes prostate discovery, Arch. Pathol. Lab. Med. 126 (2002) 1518–1526.
cancer from benign prostate hyperplasia and healthy men, [14] .
B. Scholkopf, A.J. Smola, Learning with Kernels, MIT
Cancer Res. 62 (2002) 3609–3614. Press, Cambridge, Massachusetts, 2002.
[2] C.M. Bishop, Neural Networks for Pattern Recognition, [15] V. Seibert, A. Wiesner, T. Buschmann, J. Meuer, Surface
Oxford University Press, New York, 1995. enhanced laser desorption ionization time-of-flight mass
[3] L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) spectrometry (SELDI TOF-MS) and ProteinChips
123–140. technology in Proteomic Research, submitted for pub-
[4] L. Breiman, Random forests, Mach. Learn. 45 (2001) lication.
5–32. [16] A. Vlahou, et al., Development of a novel proteomic
[5] L. Breiman, J. Friedman, R. Olhsen, C. Stone, Classifica- approach for the detection of transitional cell carcinoma
tion and Regression Trees, Wadsworth International, of the bladder in urine, Am. J. Pathol. 158 (2001)
Belmont, California, 1984. 1491–1502.
[6] M.J. Campa, M.J. Fitzgerald, E.F. Patz (Eds.), Mining [17] J. Wiemer, F. Schubert, M. Granzow, T. Ragg, J. Fieres,
MALDI-TOF Data, Proteomics 3 (2003) 1659–1832. J. Mattes, R. Eils, Informatics united: exemplary studies
[7] M.P.A. Ebert, J. Meuer, J.C. Wiemer, U.G. Traugott, P. combining medical informatics, neuroinformatics and
.
Malfertheiner, H.U. Schulz, M.A. Reymond, C. Rocken, bioinformatics, Meth. Inf. Med. 42 (2003) 126–133.

You might also like