Wiemer 2004
Wiemer 2004
Wiemer 2004
REVIEW
Bioinformatics in proteomics: application, terminology, and pitfalls
Jan C. Wiemer*, Alexander Prokudin
Europroteome AG, Neuendorfstrasse 24b, Hennigsdorf 16761, Germany
Received 19 December 2003; accepted 19 January 2004
Abstract
Bioinformatics applies data mining, i.e., modern computer-based statistics, to biomedical data. It leverages on
machine learning approaches, such as artificial neural networks, decision trees and clustering algorithms, and is ideally
suited for handling huge data amounts. In this article, we review the analysis of mass spectrometry data in proteomics,
starting with common pre-processing steps and using single decision trees and decision tree ensembles for classification.
Special emphasis is put on the pitfall of overfitting, i.e., of generating too complex single decision trees. Finally, we
discuss the pros and cons of the two different decision tree usages.
r 2004 Elsevier GmbH. All rights reserved.
0344-0338/$ - see front matter r 2004 Elsevier GmbH. All rights reserved.
doi:10.1016/j.prp.2004.01.012
ARTICLE IN PRESS
174 J.C. Wiemer, A. Prokudin / Pathology – Research and Practice 200 (2004) 173–178
Decision trees
classifier approaches that introduce different data set overfitting without the need to determine an optimal
variations [3,4,8]. Bagging applies so-called bootstrap- number of splits per decision tree. High decision tree
ping: from the overall data set consisting of N mass complexity does not lead to overfitting in ensemble
spectra, we generate new data sets by randomly selecting classifiers, because random splits tend to average out. As
with replacement N mass spectra for each new data set a general rule of tree generation for ensemble classifiers,
(Fig. 4A). Accordingly, we obtain various data sets in splits of questionable value should be included in tree
which some mass spectra are multiply included, while generation, because additional splits do not harm
others are missing: on average, 63% of the mass spectra overall performance, whereas missing splits may cause
are included at least once in a resulting data set, while its reduction.
37% are left out [8]. Each data set is used to generate
one decision tree. For example, three different decision
tree topologies are presented in Fig. 4B. The large
number of different decision trees constitute the overall Discussion
ensemble classifier, whose classification result is deter-
mined by majority vote: mass spectrum data is Modern science is heading for multidisciplinarity: the
propagated along all single decision trees and assigned most challenging scientific questions require the colla-
to the most frequent single tree classification result (Fig. boration of experts from many different disciplines, e.g.,
4C). The introduction of data set variations prevents the development of diagnostic tools on the basis of
biotechnology. There is no single data mining approach
that can be recommended and that should be applied to
all projects. Instead, complex scientific questions require
specifically adapted statistical analyses. For this, discus-
sions between researchers who are experts in different
fields, e.g., biologists, physicians, physicists, chemists,
mathematicians, and computer scientists, as well as an
extensive exchange of complementing perspectives, are
adjuvant [17]. In the following, we discuss the contexts
in which single decision trees should be applied and
when decision tree ensembles should be preferred.
At present, the additional numerical effort required
for the generation of decision tree ensembles is
unproblematic: the time for bagging is acceptable even
on personal computers with average performance. For
example, the generation of an ensemble of 50 decision
trees on a data set consisting of 100 mass spectra with
100 peak clusters takes less than a minute on a 1 GHz
Pentium processor using the CART software provided
by Salford Systems.
Whether a single decision tree or an ensemble of
decision trees should be trained for a classification task
at hand should depend on project alignment. If the aim
of a project is the generation of a classifier that achieves
the highest possible performance, an ensemble of
decision trees should be generated [7]. For example,
diagnostic tests could be developed on the basis of mass
spectrometry: for each new sample, entire mass spectra
could be determined. For this, the number of peak
clusters used by the following classifier is not important.
Fig. 4. Generation of decision tree ensembles. (A) Data set If additional requirements have to be met, e.g., when
variations for bagging. (B) Decision trees with different
the resulting classifier must be structured and made
splitting criteria and tree topologies are obtained for varied
transparent as simply as possible, single decision trees
data sets. (C) Collecting many different single decision trees to
form one ensemble classifier. In the simplest and most pruned with cross validation are the first choice.
common case, all decision trees are weighted equally. Then, Transparency facilitates system acceptance, particularly
the overall decision of the entire ensemble classifier corre- in high-responsibility tasks such as classifying patients
sponds to the most frequent class assignment by single decision into therapeutic groups. Furthermore, diagnostic tests
trees. could be developed by first applying mass spectrometry
ARTICLE IN PRESS
178 J.C. Wiemer, A. Prokudin / Pathology – Research and Practice 200 (2004) 173–178
as a screening method to search for useful biomarkers, Identification of gastric cancer patients by serum protein
and then by transferring the results to standard analysis profiling, 2004, submitted for publication.
platforms such as enzyme-linked immunosorbent assay. [8] T. Hastie, R. Tibshirani, J. Friedman, The Elements of
The latter step requires protein identification for peak Statistical Learning, Springer, New York, 2001.
clusters used in classifiers, a time-consuming and [9] K.R. Lee, X. Lin, D.C. Park, S. Eslava, Megavariate data
analysis of mass spectrometric proteomics data using
expensive procedure that can only be conducted for
latent variable projection method, Proteomics 3 (2003)
few peak clusters.
1680–1686.
[10] J. Li, Z. Zhang, J. Rosenzweig, Y.Y. Wang, D.W. Chan,
Proteomics and bioinformatics approaches for identifica-
Acknowledgements tion of serum biomarkers to detect breast cancer, Clin.
Chem. 48 (2002) 1296–1304.
. Meuer for fruitful discussions on mass
We thank Jorn [11] D. Meyer, F. Leisch, K. Hornik, Benchmarking support
spectrometry data, his comments on the manuscript, vector machines, Report Series SFB, Adaptive informa-
and for providing Fig. 1. We are grateful to Volker tion systems and modelling in economics and manage-
ment science, No. 78, Wirtschaftsuniversit.at, Wien,
Seibert for proof-reading the manuscript.
2002.
[12] Y. Qu, et al., Boosted decision tree analysis of surface-
enhanced laser desorption/ionization mass spectral serum
References profiles discriminates prostate cancer from noncancer
patients, Clin. Chem. 48 (2002) 1835–1843.
[1] B.L. Adam, et al., Serum protein fingerprinting coupled [13] A.J. Rai, et al., Proteomic approaches to tumor marker
with a pattern-matching algorithm distinguishes prostate discovery, Arch. Pathol. Lab. Med. 126 (2002) 1518–1526.
cancer from benign prostate hyperplasia and healthy men, [14] .
B. Scholkopf, A.J. Smola, Learning with Kernels, MIT
Cancer Res. 62 (2002) 3609–3614. Press, Cambridge, Massachusetts, 2002.
[2] C.M. Bishop, Neural Networks for Pattern Recognition, [15] V. Seibert, A. Wiesner, T. Buschmann, J. Meuer, Surface
Oxford University Press, New York, 1995. enhanced laser desorption ionization time-of-flight mass
[3] L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) spectrometry (SELDI TOF-MS) and ProteinChips
123–140. technology in Proteomic Research, submitted for pub-
[4] L. Breiman, Random forests, Mach. Learn. 45 (2001) lication.
5–32. [16] A. Vlahou, et al., Development of a novel proteomic
[5] L. Breiman, J. Friedman, R. Olhsen, C. Stone, Classifica- approach for the detection of transitional cell carcinoma
tion and Regression Trees, Wadsworth International, of the bladder in urine, Am. J. Pathol. 158 (2001)
Belmont, California, 1984. 1491–1502.
[6] M.J. Campa, M.J. Fitzgerald, E.F. Patz (Eds.), Mining [17] J. Wiemer, F. Schubert, M. Granzow, T. Ragg, J. Fieres,
MALDI-TOF Data, Proteomics 3 (2003) 1659–1832. J. Mattes, R. Eils, Informatics united: exemplary studies
[7] M.P.A. Ebert, J. Meuer, J.C. Wiemer, U.G. Traugott, P. combining medical informatics, neuroinformatics and
.
Malfertheiner, H.U. Schulz, M.A. Reymond, C. Rocken, bioinformatics, Meth. Inf. Med. 42 (2003) 126–133.