Cristian
Cristian
Cristian
Julkaisu 561
Tampere University of Technology. Publication 561
Cristian Mircean
In memory of my father
ISBN 952-15-1475-2
ISSN 1459-2045
Abstract
In order to measure the state of alteration on a cell's molecular biology, we need
qualitative markers, better performing analysis algorithms, and more reliable diagnosis
of the molecular underpinnings of disease, especially cancer.
New technologies have revealed much about the cells' inner processes, but this
information
explosion
represents
challenge
for
the
scientific
community.
Bioinformatics fills the gaps by creating algorithms, tools, and methods to process
thousands of signals.
During present exciting progress, this doctoral research was dedicated to the
development of good classifiers, insightful models, and methods to select the most
significant features and patterns in molecular biology measurements. These studies
utilized several approaches, based on genomics and proteomics techniques, to develop
methods and apply them to real data from several cancer cell types.
The classification capabilities of the developed algorithms were scrutinized based on
multivariate statistical tools. This typically minimize the estimated error rates, and
additionally we investigated the biological interpretation of the results. The tools should
have direct application in medicine.
Genomics
Disease diagnosis is sometimes biased by features visually identified by pathologists on
tissue samples. However, molecular modifications in cells reflect the biological nature of
specific tumors more accurately than visual features. Current diagnosis schemes will be
enhanced if the classification uses molecular signatures. Also, careful design of methods
to discriminate between diseased and normal cells may replace the human effort in
diagnosis and may reduce error. We analyzed numerous discrimination methods using
cDNA microarray data. The successful techniques combined the k-Nearest Neighbor
algorithm with optimized (Lloyd) data quantization. The error rates obtained with
quantization methods are shown to be smaller than those reported in previous published
studies on the same data sets.
A classifier was trained using a glioma microarray data set, on typical cases. The
effectiveness of learning depended on the quality of data and the information contained
in the data. The selected genes provided the lowest cross-validation classification error.
This classifier was applied to the remaining cases of several mixed gliomas and atypical
meningioma. The algorithm correctly classified most of the gliomas and the detailed
voting results provided subtle information regarding the molecular similarities with
iii
neighboring classes. We propose that the developed method can be used as a diagnosis
tool for observing the continuous character of glioma malignancy.
Information theory provides grounds for feature selection. We measured the
influence of genes on the information stored in dataset. The relative improvement may be
estimated in terms of description length gain when the particular gene is used, compared
to the description without the gene. The Rissanen's normalized maximum likelihood is a
principled way to find the structure of the model for classification and the informative
candidates for the feature sets. Further, on another dataset, using the "gene shaving"
clustering method we grouped similarly behaving genes in clusters, and we checked the
enriched functions based on gene ontology. The functions of the selected genes showed a
significant enrichment of genes involved in metabolism and signal transduction.
Proteomics
One of proteomic technologies able to measure the levels of protein expression in a large
number of biological samples simultaneously is the reverse-phase lysate microarray. The
technology provided the means for effectively observing the insights on a proteomic level,
which carry out cellular functions, and are important to understanding biological
systems. A challenge for accurate quantification of protein expression is the relatively
narrow dynamic range associated with the commonly used chromogenic signal detection
system. We developed a 1440-spots (and then 1728 spots) lysate microarray that contains
80 (or 96) lysate (or serum) samples, printed in triplicate with six two-fold dilutions. We
then designed several algorithms that estimate the levels of protein expression.
The analysis showed that the method based on a robust least squares estimator
provided the most accurate quantitation of the protein lysate microarray data for purified
bovine serum albumin. We then applied the technology to real biological samples. As first
application of the method, we analyzed HCT116 colon cancer cell lines after treatment
with each of two drugs or a combination of the two drugs. The array contained p53-/HCT116 cells with no treatment as well as p53+/+ HCT116 cells (parental cells) with no
treatment as a control. The protein levels estimated from the array data were compared
to those observed by western blotting.
Then, on a large-scale and high-throughput application, we surveyed 82 glioma
samples for the levels of protein expression and phosphorylation of 46 different proteins
involved in signaling of cell survival, apoptosis, angiogenesis, invasion, and cell cycle
pathways. We observed two groups based on survival curves, glioblastomas vs. other
gliomas of lower grade, where glioblastomas are correlated with a dramatically and
distinct negative-outcome. Twelve proteins were identified as the most powerful
discriminators and cluster analysis of phosphorylated sites suggested functional
relationships that warrant further investigation.
iv
Content
Abstract
iii
Content
List of publications
vii
Preface
ix
xii
Acknowledgements
xvii
xix
17
35
49
57
Antibody selection.................................................................................................60
71
Normalization........................................................................................................ 72
Differential expression of genes ........................................................................... 77
Clustering .............................................................................................................. 81
Distance Measures ................................................................................................82
kMeans algorithm...............................................................................................82
Principal Component Analysis (PCA) ..................................................................83
Multidimensional Scaling (MDS) .........................................................................84
Classification .........................................................................................................85
kNearest Neighbor algorithm............................................................................. 87
Quantization and the Lloyd algorithm ................................................................. 87
Model selection and Minimum description Length (MDL) ................................88
Publications
91
vi
List of publications
P1.
P2.
P3.
P4.
Fuller GN*, Mircean C*, Tabus I, Taylor E, Sawaya R, Bruner MJ, Shmulevich I,
Zhang W. Molecular Voting for Glioma Classification Reflecting Heterogeneity in
the Continuum of Cancer Progression. Oncol Rep. 2005 14: 651-656. *Co-first
author.
P5.
P6.
P7.
Fuller GN, Hess KR, Mircean C, Tabus I, Shmulevich I, Rhee CH, Aldape KD,
Bruner JM, Sawaya RA, Zhang W. Chapter 14: Human Glioma Diagnosis From Gene
Expression Data; in Computational and Statistical Approaches to Genomics
Kluwer Academic Publisher 2002 ISBN: 1-4020-7023-3
P8.
vii
viii
Preface
Biology, genetics, and medicine have experienced revolutionary changes in the past
decade. A milestone was the success of the Human Genome Project1 that has identified
approximately 20,000-25,000 genes present in human DNA, determined the sequences
of the 3 billion chemical base pairs that make up human DNA, and stored this
information in databases. Unprecedented high-throughput experiments are now capable
of acquiring data on thousands of molecular events at once, generating an information
explosion.
There is a dramatic gap between the generation of data and the capabilities and
methods of processing the information. A direct challenge to bioinformatics researchers
is to provide and improve tools for data analysis. To help achieve these goals, my
research concentrated on developing algorithms that process biological genomic and
proteomic data.
Signal processing and computer science engaged in medical areas much later than on
other observational sciences. The traditional approach of engineers to design devices and
instruments transformed to a direct engagement in the experiment that generates the
data-flow. The high-throughput character of current experiments requires that design,
data acquisition, and processing to be effective.
Genomics
The archetype one geneone disease was long discarded for many diseases. Now, it is
established that mal-functioning of a group of genes (not only one) often generates
diseases. Subjectivity and recurrent uncertainty related to classical histological diagnosis
based on morphological particularities is one of the reasons for improving the diagnosis
tools using molecular models. Further, the diagnosis should be grounded on
measurements with certain statistical premises. Correct estimations usually utilize only
the information obtained from representative features, filtering-out the information from
the rest of high dimensional space of features on a typical experiment.
One goal of measuring the gene expression in cancer research is to develop models for
the molecular classification of tumors and capabilities for an objective diagnosis based on
gene expression. Decisions for diagnosis or treatment based only on a single marker may
be in error if the selected marker was non-representative. However, it is not feasible to
make diagnoses based on large number of genes or proteins because of cost and errors
1
The Human Genome Project was a 13-year project coordinated by the U.S. Department of Energy of Science
and the National Institutes of Health and completed in 2003. Major partners included the Wellcome Trust
(U.K.) and important contributions were made by researchers from Japan, France, Germany, China, and
others. (http://www.ornl.gov/sci/techresources/Human_Genome/project/hgp.shtml).
ix
related to noisy measurements. Therefore research has focused on reducing the number
of analyzed genes to an informative set, typically containing 100-200 genes or even less.
The classification errors are directly dependent on the classifier, on the information
comprised in the data-set, and also the match of the method to the type of the data set.
Quantization was observed to have a special importance in filtering the noise.
Mistakenly, quantization is often presented as a loss of data precision. This is true when
the data constitute a perfectly true representation of the generating mechanism and
when no fundamental uncertainty exists in the generating mechanism itself. In most
cases, neither of these assumptions is true. For example, in the case of cDNA microarray
data, it is widely recognized that reproducibility of measurements and between-slide
variation is a major issue. Furthermore, genetic regulation exhibits considerable
uncertainty on the biological level. We bring evidence in favor of quantization, suggesting
that this type of noise is in fact advantageous in some regulatory mechanisms. The goal
of quantization can be thought of as trying to find the right balance between properly
capturing the information content in the data and filtering out the non-informative and
detrimental noise.
By discovering the subset of genes that allow correct diagnosis we will help in
understanding different cancers that can eventually lead to an understanding of the
regulatory network between genes (or proteins as products) involved in diseases. The
proteins produced by the selected genes are also potential targets for the
pharmaceutical industry. The key proteins that regulate one or several genes or proteins
could also be regulated or inhibited by drugs.
Much attention has been concentrated on a small number of genes that appear to
have a high probability of involvement in the observed cancer. These so-called classifiers
can be obtained by "supervised" or "unsupervised" methods.
In filter methods the genes are ranked according to a common property relevant for
the prediction or classification (as discriminative power, correlation, or mutual
information) able to explain the sample (disease) class, without making it explicit in the
subsequent prediction model. After ranking single or various groups of features, a
suitable sub-set is accepted and proposed as the set to be used for subsequent analysis
tasks. On the other hand, the wrapper methods intend to restrain the set of factors so
that the prediction ability of a certain given method is improved. The prediction
capability of a particular method is investigated for all possible groups of genes; the set
with the best performance is declared optimal and it maximizes the prediction abilities of
the studied class of models, but may not be relevant for other classes of models. On geneexpression analysis, if a particular method is investigated for all possible groups, the
dimensionality of combinations for thousands of genes makes wrappers computationally
unfeasible and time consuming.
In contrast, we look into simple and effective methods that may be promising for
molecular diagnostic efforts, as a complement to the currently used morphology-based
diagnostics. This thesis regards "simple" algorithms from computational load,
programming, and information complexity view-points.
Future technological evolution may lead to powerful diagnostic tools based on
classification algorithms we develop today as stand-alone tools (a system-software
solution), integrated into existing devices, or as major health-computing systems
established as large national databases. The current trend in diagnosis of cancer is to
merge efforts toward a complete decision-making system that integrates tools from gene
expression with protein expression, and analysis of protein interactions, loss-of-function
phenotyping using RNAi, tissue microarrays, metabolomics, cellomics, etc. This makes
the possible network of interactions complicated, but brings another level of
dimensionality to the solution space of the problem of classification.
Proteomics
Microarray cDNA techniques that analyze levels of mRNA conceptually assume gene
transcription levels are correlated with protein levels, which accomplish the cellular
functions. However, cDNA transcriptional data should be complemented with
information on protein translation and post-translational modification. The products of
genes act in the cellular environment depending on the protein level and protein posttranslational modifications such as phosphorylation and acetylation. Although the
methods of data-mining, classification, and multivariate analysis are similar in the case
of gene expression and protein expression, the specificity of technologies make each of
the two operations distinct. In gene expression analysis, we deal with thousands of
features (genes), typically duplicated. In protein expression, the number of features
(proteins and antibodies) is typically in the range of hundreds, but samples are analyzed
in multiple replicates and dilutions. Anterior methods were characterized by semiquantitative or qualitative measurements. To evaluate the expression of proteins in a
high-throughput manner, together with the new developments of the reverse-phase
protein array technology, we observed a specific need for robust algorithms that
quantitatively estimate the expression of proteins.
With the aid of protein array technology we can estimate the expression of proteins in
a high-throughput manner. In a simplistic description, a robot produces the required
diluted spots by deposing the lysate from cells on a nitrocellulose membrane. After batch
printing, one or two layers of antibodies (called in the last case a sandwich) bind with
high specificity to the spotted proteins. Each antibody is processed on a different slide.
The new algorithms should be able to deal with the relatively narrow dynamic range of
chromogenic signal detection system and with possible errors caused by incorrect load of
xi
lysate on the membrane, crack of the membrane surface, and spot replication and
segmentation imperfections.
Because of the high-throughput character of reverse phase protein arrays as
compared with technologies like western blotting, all samples will be treated with
virtually identical conditions throughout the process. Also, there is the opportunity to
design multivariate analyses, similar to those performed in microarray technology.
In some cases, as in glioma research, these samples are rare. The amount of lysate
needed in protein array technology is about two orders of magnitude lower than for the
western-blotting. Each antibody used must have proven specificity (e.g., by recognizing a
single band in western blot). In the last year, there has been an increase in purity and
availability of antibodies. Lastly, but of great importance, the total cost of reverse-phase
protein array analysis is lower than for classical methods of analysis.
Introduction
The field of bioinformatics has evolved dramatically since my doctoral studies began:
there are new technologies, new applications, improved devices, and more systematic
approaches to molecular biology experiments. The replicability of measurements is
higher and the set of initial feature genes has been partly standardized by industry
researchers. Additionally, there is a tendency toward quantitative proteomics, while in
2000 most of studies measured mRNA expression.
Chapter 1 will describe the "Central dogma of biology" and genetics from a historical
perspective.
Chapter 2 portrays the importance for diagnosis and prognosis in clinical research.
Due to the development of new molecular biology techniques and enhanced by economic
reasons, there has been an attempt to integrate diagnosis of disease. Molecular targeting
is one of the steps in the drug discovery path. In this chapter, the characteristics of cancer
cells will be compared to those of normal cells.
Chapter 3 describes the specificities of gliomas as they fall into the wide spectrum of
cancer malignancy. Brain tumors are described according to the World Health
Organization (WHO) system based on histological features. The section captures the
known molecular changes in pathology and we refer to articles that might help the reader
interested in further medical/clinical/pathological details.
xii
In genomic research, two major platforms are frequently used in expressing the
genes from samples, differing by the method of nucleic acids deposition. Chapter 4 will
describe the fundamentals of preparing, processing, and analyzing data form cDNA
microarrays. Particularities of pre-processing and analysis of microarray data with
weight on the technological process are explained here.
Technological developments made easier the high-throughput observation of
transcriptional levels years before we were able to use high-throughput techniques to
evaluate translational levels. The scientific community assumed a direct relation between
the two phenomena without having technological capabilities for high-throughput
analysis of protein expression. Quantitative measures and high-throughput technologies
in proteomic research have now made possible observation of translational and posttranslational alterations in cancer cells. Chapter 5 describes technological developments
in protein arrays with emphasis on reverse-phase lysate arrays. Our successful
contribution to lysate array technology is placed in the framework of other algorithms
that estimate the expression of proteins. Several theoretical specificities and
technological description that were not included in (P7) are present in the Chapter 5.
Chapter 6 presents the application of our algorithms to the study of glioblastomas.
Chapter 7 is dedicated to describing mathematical algorithms utilized in the papers.
This chapter concisely describes methods of analysis in normalization, differential
expression for genes, clustering methods, definition of distance measures used in
clustering, k-Nearest Neighbor algorithm, k-means, principal component analysis (PCA),
the multidimensional scaling (MDS) as visualization method of high-dimensional spaces,
classification methods, quantization and the Lloyd algorithm, model selection, and
minimum description length (MDL).
Collection of publications
Genomics. The first of the attached publications (P1) compares several discrimination
methods for the classification of gliomas using gene expression data. The considered
methods and the combinations, and the selection of the distance function were evaluated.
Our error rates based on the new methods were shown to be smaller than those reported
in anterior published studies on the same set.
The next study (P2) applied the novel classification techniques to cDNA microarray
data for discriminating subtypes of malignant lymphoma. The genes, on which the
classification is based, were selected by ranking them according to their separability
criteria computed by taking into account between-class and within-class scatter. The
observed errors estimated using cross-validation, were significantly lower for the case of
the k-Nearest Neighbor (k-NN) algorithm with optimized Lloyd data quantization than
xiii
xv
xvi
Acknowledgements
The research presented in this thesis reflects the joint work in Signal Processing with two
areas: molecular biology and bioinformatics. The characteristics of data processing
required in genomics and in proteomics highlighted the need of a tight collaboration
between a biological group (placed where patients need care and where the samples are
acquired and evaluated) and a computer science oriented group. This work was begun in
August 2000 at Institute of Signal Processing, Tampere University of Technology,
Tampere, Finland, and continued beginning in 2003 at the Cancer Genomics Core
Laboratory, The University of Texas M.D. Anderson Cancer Center, Houston, Texas.
My gratitude goes foremost to my supervisor Prof. Ioan Tabus from Tampere
University of Technology, who helped me in the each and every of the aspects related to
my work and life in Finland. I appreciate his support, and credit him with correctly
pointing to the research I needed to do. I thank him for his never-ending willingness to
help, and especially for his expertise in the field of signal processing. My work with Prof.
Ioan Tabus as mentor transformed me from a student in a researcher.
I am particularly thankful to Prof. Dr. Wei Zhang, Director of Cancer Genomics Core
Laboratory and my supervisor on the biological side of this research, for his support and
devotion to bringing the new ideas to life. M.D. Anderson is a first-rate institution in
cancer treatment, and the core-laboratory lead by Dr. Zhang is certainly one of the
research leaders in genomics. My work with the group of Dr. Zhang resulted in major
articles published during last two years. I thank him for his hard work in guiding me and
polishing my scientific attitude, for sharing his scientific expertise, and for his managerial
qualities.
I owe special gratitude to Prof. Ilya Shmulevich, my supervisor in the mathematical,
statistical, and algorithm research performed at M.D. Anderson. I met him a couple of
years ago, when he was a lecturer of Nonlinear Filters in Institute of Signal Processing.
His clear explanations guided me and change my way of understanding algorithm
processes, and even life. I have become simpler and more organized, in means of
Rissanen's modeling. The understanding of processes in molecular biology necessitates
visualization and an organized "logical trend." Ilya brought and catalyzed hundreds ideas
to our research, and it is a pleasure to have him as a supervisor, as a teacher, and as a
friend. I learned also from him that "a sustained good work in research is always
rewarded sooner or later".
I respectfully thank Acad. Prof. Jaakko Astola for his support and advice regarding
my research and for sharing his ideas as co-author in publications included in this thesis.
xvii
xviii
Abbreviations
Abbreviations
Description
AKTpThr308
BadpSer136
BSS/WSS
c-Abl, Cabl
CCND1
CDK4 (CDK6)
CDKN2A
EGFR
EGFRpTyr845
IGFBP2
IGFBP5
IB
LOH
MDM2 (MDM4)
MET
MMP9
MYC-C
NF-B
NHGRI
p-BAD
PCR
PDGFR
p-EGFR
PI3K
Pi3k
p-RB
RB1
TP53
VEGF
WHO
xix
xx
Chapter 1
Historical perspective of the "Central Dogma" of biology
"Dans les champs de l'observation le hasard ne favorise que les esprits prpars. "
Louis Pasteur 1
In metaphoric sense, the present is a reflection of the values from the past, and the
present creates deterministic reasons for future. Creative arrangements and any ideas
are more effective when based on a currently-accepted framework supported by past
experience.
As we approach the maturation of the field of genetics, we see an unprecedented
trend of knowledge accumulation. The next decades are expected to bring
systematization and easy use of the large amount of data generated, and therefore
extensive application of new knowledge. This chapter is dedicated as a brief survey of the
historical laws in molecular biology.
As long as models reflect real events, they help our understanding by allowing us to
predict behaviors. We need the models to guide experimental design and anticipate
results. Rissanen says that only humans need models and modeling, since intrinsically
the real events happen without our defined model.
The set of statements called "Central Dogma" were made more than 40 years ago, as a
way of explaining the world of information transfer in cells. This theory asserts that
information flows from chromosomal DNA through RNA to protein. More than a model,
the Central Dogma has become a paradigm; Adam Smith2 affirmed that paradigms are
like "water to the fish," a generally accepted judgment. Of course, paradigms may be
detrimental to the generation and acceptance of new ideas: "When we are in the middle
of the paradigm, it is hard to imagine any other paradigm." Accepted by the scientific
community, the Central Dogma has inspired researchers in structuring entirely new
fields of genetics, molecular biology, systems biology, and pharmacology.
Historically, knowledge is a succession of discoveries followed by assimilation. There
were a number of competing theories before the true crystalline structure of DNA was
determined. After extensive experiments in crystallography, on April 25, 1953 Nature
magazine published the article describing the double-helical structure of DNA by Watson
Lecture, University of Lille (December 7, 1854); In H. Eves Return to Mathematical Circles, Boston:
Prindle, Weber and Schmidt, 1988 translated as "Chance favors only the prepared mind."
2
Adam Smith (17231790) was a Scottish economist and moral philosopher. His "Inquiry into the Nature and
Causes of the Wealth of Nations" was one of the initial studies of industry and commerce development.
Chapter 1
and Crick [10],[11]. The manuscript explained for first time the X-ray diffraction images
obtained from DNA and was supported by an article by Wilkins et al., in same issue [12].
Five years later, in 1958, Crick [5] described the hypothesis for molecular processes and
the informational flow between the three families of polymers, DNA, RNA, and protein.
From the beginning, it was clear that not all informational flows possessed same
probability. Crick originally represented the informational flow as: DNA RNA
PROTEIN. This suggests that life was traceable to DNA. Reviewing the hypothesis, Crick
[4] summarized on 1970 the work of Temin and Mizutani [9] and Baltimore [1] showing
that an RNA tumor virus can use viral RNA as a template for DNA synthesis, and
observed that no major arguments against the Central Dogma had appeared after of
twelve years of genetic research.
Figure 1.1. Schematic of the Central Dogma of molecular biology. From the total possible combinations
presented in panel a, probabilistically the main flow is represented as: DNA RNA Protein in panel b.
If one considers the un-wound linear structure of the three families of polymers, all
possible informational inter-connections (see Figure 1.1 a.), including bi-directional
transfers, must be initially taken into consideration [4]. Given the information available
to researchers at the time, the likelihood of certain transfers was clear dissimilar due to
known conformational three-dimensional crystalline structures. By inspecting the
experiments and considering his own laboratory experience, Crick [5] grouped the
informational transfers into three typological classes. The first group is formed by
transfers for which experimental evidence, direct or indirect, existed. These transfers
commonly occurred in all known cell types:
I (a)
I (b)
DNA RNA
I (c)
RNA Protein
I (d)
RNA DNA (Temin [9] and Baltimore [1], and RNA DNA RNA
DNA Protein
Years after the publication, Temin's indirect evidence [9], corroborated by the
experiments of Baltimore [1], showed the presence of a specific enzyme in RNA tumor
virus particles that makes a DNA copy from RNA. The publication of this work was met
with a generally hostile reception mainly because of the accepted Central Dogma
paradigm. In 1975, Temin, Baltimore, and Dulbecco shared the Nobel Price for
discovering the enzyme, reverse transcriptase, responsible for copying the information in
a strand of RNA into DNA.
The third group of transfers was not known to occur [5].
III (a)
Protein Protein
III (b)
Protein RNA
III (c)
Protein DNA
There are several factors that make Crick's paper remarkable (see [4]): (1) There are no
assumptions regarding the machinery or how the transfer is made. The accuracy is
considered high and possible errors are not discussed. (2) Control mechanisms are not
considered, nor the rate at which the processes work. (3) The organisms under discussion
are present-day organisms and the paper [5] was not intended to apply to events in the
remote past. (4) The Central Dogma is essentially a negative statement, saying that
transfer of genetic information from protein to other polymers does not exist. These
statements from 1970 [4] try to clarify the disagreements published in several papers
during that year.
For many decades, it was thought that there were no mechanisms that allow
information to flow from protein back into RNA. Once protein is synthesized from RNA,
the information was considered trapped at the protein level. This hypothesis seemed
reasonable in light of the degeneracy of the genetic code; in most cases more than one
nucleotide triplet specifies an amino acid. Recent studies showed [2] that there are
emerging evidences for the role of conformational plasticity in protein-protein
interactions and also, exists mechanisms for a type of protein replication by means of
prions [8]. The mechanism implies a specific isomer of a normal protein, homolog to a
prion. When the organism is infected, the prions interact with this homolog isomer of the
protein forcing it into another prion. Although more on the side of post-translational
processes, by this mechanism, certain proteins can replicate and an informational
transfer is made. These conflict and now complete Central Dogma postulated in 1958.
Chapter 1
The transfers from DNA to RNA use the same type of encoding (nucleic acids) and
therefore are called transcription. Also, when the information flows from RNA to DNA,
the flow catalyzed by an enzyme called reverse transcriptase, the process is called reverse
transcription. Protein synthesis, directed by RNA, is called translation [3]. It is called
translation because an essentially different encoding of the information is used by
proteins (amino acids instead of nucleic acids). For both translation and transcription,
the molecular apparatus are very complex and involves many different RNA and protein
molecules.
Ultimately, the Central Dogma is not sufficient to explain the complexity of a cell or
organism. The sequencing of the entire human genome opened the era of further
understanding of information flow in cells and organisms. The next step is the
elucidation of the relationship between DNA genes with other regions of DNA and with
products of genes; how and when transcription and translation occur and posttranslational modifications to proteins ultimately define a cell. Strong opposition to the
gene centric paradigm also arises from symbiosis phenomena (the merger of two
organisms into one without inheritance of genes). Exploiting the power of cooperation,
rather than competition, it can allow an evolutionary jump that might take a million
years of individual trial and error [6].
Although static as concept, the effect of the "Central Dogma"asserting that
information flows from DNA nucleotide sequence to messenger RNA (mRNA) and then
is translated to the specific amino acid sequence of a proteinhad a large positive impact
over the last decades in modeling of the machinery of intracellular mechanism.
References:
[1] Baltimore D. Viral RNA-dependent DNA polymerase. Nature 226: 1209-11, 1970.
[2] Buck E, Iyengar R. Organization and functions of interacting domains for signaling by proteinprotein interactions. Sci STKE. 2003 Nov 18;2003(209):re14.
[3] Crick FHC, Barnett L, Brenner S, Watts-Tobin RJ. General nature of the genetic code for proteins.
Nature 192: 1227-1232, 1961.
[4] Crick FHC. "Central Dogma of Molecular Biology". Nature 227: 561-563, 1970.
[5] Crick FHC. In Symp. Soc. Exp. Biol. The Biological Replication of Macromolecules, XII, 138
(1958).
[6] Kevin K. Out of Control The New Biology of Machines, Social Systems, and the Economic
World. Perseus Books, 1995.
[7] Kornberg A. Biologic synthesis of deoxyribonucleic acid. Science 131: 1503- 1508, 1960.
[8] Prusiner SB. The Prion Diseases One Protein, Two Shapes. Scientific American 272(1):48-57,
1995.
[9] Temin HM, Mizutani S. RNA-dependent DNA polymerase in virions of Rous sarcoma virus.
Nature 226: 1211- 1213, 1970.
[10] Watson JD, Crick FHC. Genetical implications of the structure of deoxyribonucleic acid. Nature
171: 964- 967, 1953.
[11] Watson JD, Crick FHC. Molecular structure of nucleic acids: A structure for deoxyribose nucleic
acid. Nature 171: 737-738, 1953.
[12] Wilkins MHF, Stokes AR, Wilson HR. Molecular structure of deoxypentose nucleic acids. Nature
171: 738-740, 1953.
[13] Yanofsky C, Carlton BC, Guest JR, Helinski DR, Henning U. On the colinearity of gene structure
and protein structure. Proc. Natl. Acad. Sci. USA 51: 266-272, 1964.
Chapter 2
Genomics and proteomics in diagnosis of disease and
prognosis
Molecular profiling completes disease diagnosis and prognosis. Clinical medicine,
classically described as disease diagnosis, medication, side effects, and management of
outcome, has evolved dramatically with the advent of genomics and proteomics.
Sequence analysis of genomes, discovery of structures and pathways, gene and protein
expression profiling has aided in cancer evaluation and therapy and has enhanced patient
prognosis. In addition, these applications support pharmacology by incorporating
pharmacokinetics and pharmacodynamics into drug development.
In the mid-90's the initial phase of bioinformatics concentrated only on a few subjects
of relevance to sequence analysis; now, bioinformatics is essential to the medical,
pharmaceutical, and biological fields. The development of technologies and the broader
understanding of the genetic pathways in the development of tumors opened the
possibility of correlating molecular data with clinical outcome, survival, and response to
treatment modalities. This chapter will review and discuss the relations between
technologies and diagnosis, It is clear that the combined use of high-throughput
techniques in a comprehensive system will better facilitate our understanding of the
genetic complexities inherent in cancer and will revolutionize cancer therapy.
"Why search for differences between diseases at the molecular level?"
Molecular profiling comprises individual applications of mRNA expression, proteomic,
and metabolomic measurements, and combinations of techniques used to characterize
the state of a cell or a tissue [21]. We are looking for patterns that predict or identify subphenotypes of disease that should allow clinicians to make more informed decisions
about therapy and ultimately allow design of drugs suited to a particular disease
genotype.
Let's assume the general case of a data set and its compression process. Good
compression-rates are obtained when the appropriate model is used for the patterns
contained in the file. Therefore, in addition to minimizing the size of the files, the good
compressor will make use of the model and extract information. This duality (model
information) is essential to information theory.
Molecular profiling analysis is based on the discovery of good predictors. Represented
by mRNA transcriptional or protein translational expressions, the goal is to make use of
the information content that describes the disease classification. Each feature brings a
Chapter 2
certain amount of information about the disease and the state of the cell. In other words,
while considering a given label-set of disease, invasiveness, or survival terms, if the
classification can be described in a short manner or by a small number of features, the
information content is of low complexity. Descriptions that require more space are of
higher complexity or information content.
We certainly know that only a small number of the features (genes or proteins) are
involved in cancer, because most other features regulate non-related mechanisms (e.g.,
they are involved in normal metabolism or normal function of the cell, or are proteins
that consolidate and define the internal structure of the cell). The retained small set in
our discussion represents the patterns linked to the cancer phenotype.
The information based on class-labels and the information based on expression
values assigned to the labels, both characterize the information retrieval process. When
we talk about features, we mean the genes, proteins, phosphorylation states, etc., that
carry information and are identified during a process called feature selection. When we
discuss patterns we refer to situations when the algorithm first maps the space of
features such that the selection is in a mapped space (e.g., principal component, also
called PCA, or multidimensional scale algorithm) and the interpretation of selection is
not straightforward. It is worth looking ahead to some of the consequences of successful
selection of predictors.
The molecular profiling may be conducted at several molecular levels by means of
numerous technologies that measure gene-expression level, protein levels, or
metabolites. This thesis analyzes improvements in cDNA microarray and reverse-phase
lysate array technologies. The duality "informationmodel" recommends that features or
patterns from cells and tissues that discriminate the disease progression, invasiveness,
survival terms, or disease-classification must also satisfy the following requirements:
Identify markers that visualize the affected cells or tissues. Despite numerous
methods of tumor visualization, including computed tomography (CT), magnetic
resonance imaging (MRI), and positron emission tomography (PET), etc., these
techniques may give uncertain results. Tumors are cells that behave in a different
manner from normal tissue, yet it is often difficult to delineate between benign and
malignant. In the initial stages, the changes manifest at transcriptional,
translational, or post-translational levels. Only in late stages of disease, do cells
display visible histological and/or structural changes. Consequently, markers are
still needed to discriminate the cancer and non-cancerous cells.
The current markers used in PET and CT are based on an (increased) metabolic
activity of cancer tissues [5][12]; MRI detects density differences (based on water
content) [11]. In an ideal case clinicians generate prognosis and diagnosis based on
Generate models. The selected features or patterns compose the active nodes in a
graph-representation of a system. The model is defined by the arrangement of
elements and the interconnections that relate them. Arches are proposed to hold the
modulations between the molecular components. If studied under impulses as the
knock-out of a gene1 or malignancy2, the response of the system might help elucidate
the system structure and therefore generate knowledge concerning the mechanisms
of cancer.
1
2
Chapter 2
To accomplish their unchecked growth and the transport of metabolites, tumors are
complemented by complex vascularization. The growth of new blood vessels is called
angiogenesis. In the healthy body, angiogenesis occurs in the process of healing wounds,
for restoring blood flow to tissues after injuries. In malignant growth, the process of
angiogenesis is not regulated and new blood vessels spread to supply the growing tumor's
nutrient demands. A relatively new class of drugs, called angiogenesis inhibitors,
prevents abnormal vascular proliferation, and therefore act in the malignant tissue by
stopping the nutrient supply and slowing the tumor growth.
Target identification
Disease
tissue expression
Clinical samples
Molecular
approach
Target validation
Genomics
proteomics
associations
Modulation on
molecular
mechanisms
Patients
Clinical
Modulation on
systemic
mechanisms
In vivo
models
In vitro
models
Systemic
approach
Targeted Drug
Discovery
Disease model
Figure 2.1. Targeted drug discovery approaches are categorized based on molecular or systemic origin.
Three stages define the drug discovery process: the identification of disease, target identification, and target
validation. If the studied cells are components of a biological system, discovery is through a systemic
approach and in vivo studies are used. In the molecular approach, studies are initially performed in vitro and
the modulation is studied at single-cell level. Target validation is carried out using cell cultures or animal
models and is followed by careful clinical studies.
Another phenomenon is related to the capacity of tumor cells to detach from the
primary site and spread to other organs. Cancer cells tend to be more motile and possess
intrinsic differences in adhesion characteristics from normal cells. Cancerous cells
typically metastasize to distant locations by penetrating through basement membranes
into lymphatic and blood vessels, then circulate through the bloodstream and grow at
distant loci elsewhere in the body.
Personalized medicine
Genomics, proteomics, and other '-omics' technologies have revealed a complexity among
cancers that makes almost every tumor genetically unique. Effective targeted therapies
might be better suited for small subgroups of patients, who require a personalized
approach for discovery. Research groups with interest in multiple areas rather than a
unique production-oriented research may prove more effective in finding solutions on
3
4
5
The study of how a patients genetic make-up affects the response to medicines.
The study of the effect of an individuals genotype on the response to medications.
The targets are typically proteins, but are not restricted to these.
Chapter 2
10
targeted drugs. We can observe two tendencies manifested in cancer growth at the
molecular level. These trends situate the researchers at opposite positions. One tendency
is that cancers that arise in different locations share alterations of the same genes or
pathways. In this case, common therapeutic targets may regulate various cancer types. In
the second case, cancers arising from the same tissue of origin consist of a complex
combination of several different genetic alterations that uniquely define a genetic
subclass of that cancer [14]. The therapeutic targets that are active in each situation may
not overlap.
Initially, cancer therapies were dictated by the organ system of origin. For example,
malignancies arising in a specific organ were grouped together as one single disease, and
thus received the same therapy. The testing of new therapeutic agents and the
chemotherapy prescribed to the patients typically followed the same judgment.
Since the evolution of molecular biology, cancer is no longer classified based uniquely
upon location [2][15][13][17]. Other histological data and information about genetic
alterations contribute to the diagnosis.
demonstrated that the knowledge about targets and their relevance in human cancers is
incomplete. There is a clear need to turn attention to patient characteristics, such as
individual genetic composition, in addition to histological and tissue characteristics.
Therefore, a (small) representative subpopulation is needed for the medical classification
so that we can use the advantages of highly specific, targeted drugs [14][18].
These tendencies lean toward personalized therapies and treatment in cancer
[16][22][9]. It is not a trivial task to provide targeted drugs (or a targeted combination) to
those patients whose tumors carry the relevant genetic alterations. For this purpose, the
patients need to be catalogued by the number and the type of alterations, and the data
stored in standardized databases. Individualized diagnoses and treatments with the
specific medications can then be made.
The motor of all business-models is economic success and thus several economic
concerns of personalized medicine need to be discussed here. The vision of personalized
medicine raises the difficulties of a small and fragmented market. If the market is divided
into smaller, genetically stratified segments, block-buster drugs will become rarer and
the pharmaceutical sector might suffer and become less attractive to investors [22].
Production of personalized drugs, even specific combinations of typical compounds,
would need to take place after the diagnosis stage using a model of "drug on demand."
This is a complex system able to analyze large amounts data, and requires
communication between diagnostic systems and clinicians and an interconnected
informational flow. It could be an "all-in-one" service provided by one single company or
agency, or separate systems with a standardized communication protocol. Currently,
diagnosis, target discovery, validation, and drug production are provided by separate
institutions (hospitals, research laboratories, drug industry, etc.).
Therapy: general
a)
Genetic screen
Therapy: individualized
Personalized drug
b)
11
Chapter 2
12
The sudden increases in drug entities coupled with a smaller market for each show
that the envisioned system would need to be universal, producing a large numbers of
personalized drugs. Also, the model of "drug on demand" will change the infrastructure
of present companies. Additional issues protecting the patient's genetic data and other
ontological6 grounds make this subject delicate. A positive example imatinib, developed
by Novartis; this successful chronic myelogenous leukemia drug provides confidence in
targeted drugs as a business model. Although prescribed for less than 1% of US cancer
patients, the market for imatinib was ~US$ 1.2 billion in 2003 [16].
Currently, pharmaceutical companies are seeking targeted drugs in order to replace
traditional, less effective therapeutics. In previous years the incomes from hormonal,
cytotoxic, and targeted drugs shared each about one third of profits (Figure 2.3). The
research in drug development has changed such that targeted drugs will cover more than
two-thirds of the income on cancer products in the coming decade. Although the number
of potential targeted therapies in cancer is much higher than for other diseases, patents
with the market lists only a limited number of targeted drugs. Table 2.1 lists the targeted
cancer therapy agents analyzed in a recent publication [14] and approved since the year
2000.
Table 2.1 New agents for cancer therapy marketed since 2000. The discovery of novel drugs is considered
one of the most difficult scientific challenges of our times, and both pharma and academia have realized that
many diseases are unlikely to be cured exclusively by either one on their own [14].
Novel agents for cancer therapy
Drug name
Trade name
Pemetrexed
Alimta (Eli Lilly)
http://www.alimta.com/
Bevacizumab
Avastin (Roche)
http://www.drugdevelopmenttechnology.com/projects/avastin/
Clofarabine
Clolar (Genzyme)
http://www.clolar.com/
Cetuximab
Erbitux (Bristol-Myers Squibb)
http://www.erbitux.com/
Erlotinib
Tarceva (Genentech)
http://www.tarceva.com/
Gefitinib
Iressa (AstraZeneca)
http://www.iressa.com/
Bortezomib
Velcade (Millennium)
http://www.mlnm.com/products/velcade/
Tositumomab
Bexxar (GlaxoSmithKline)
http://www.bexxar.com/
Oxaliplatin
Ibritumomab
tiuxetan
Imatinib
mesylate
Alemtuzumab
Eloxatin (Sanofi-Aventis)
http://en.sanofi-aventis.com/
Zevalin (Biogen Idec)
http://www.zevalin.com/
Gleevec (Novartis)
http://www.gleevec.com/
Campath (Genzyme)
Indication
Mesothelioma
Originator
Eli Lilly
Year
2004
Colorectal cancer
Genentech
2004
Acute lymphocytic
leukemia
Colorectal cancer
2004
2004
2004
2003
Multiple myeloma
ProScript (Millennium)
2003
2003
2002
Non-Hodgkins lymphoma
2002
Chronic myelogenous
leukemia
B-cell chronic lymphocytic
Novartis
2001
Cambridge University
2001
Non-Hodgkins lymphoma
Colorectal cancer
6
The information from the study of genetic alterations may be utilized against the patient in many ways if
made public.
Gemtuzumab
ozogamicin
Arsenic
trioxide
http://www.campath.com/
Mylotarg (Wyeth)
http://www.wyeth.com/
Trisenox (Cell Therapeutics)
http://www.trisenox.com/
leukemia
Acute myeloid leukemia
Acute promyelocytic
leukemia
13
Celltech Group
2000
PolaRx Biopharmaceuticals
(Cell
Therapeutics)
2000
67
20
40
Hormonal
60
Cytotoxic
18
49
32
Oncology products
on the m arket
27
Oncology products
in developm ent
80
100
120
Targeted
Figure 2.3. Targeted therapy has increased in importance in recent years. The total worldwide cancer
revenue is about US$ 12 billions (2001). Targeted drugs account for about 18% from total, data from [14].
Chapter 2
14
Data mining success in other fields, such retailer marketing and text mining (e.g., as
demonstrated by Google), show the potential of integrative diagnosis in medicine
decision management and specifically in cancer diagnosis and treatment. These methods
require standardized and compatible interrogation methods followed by specific
information- and knowledge-extraction tools. The efforts and rapid evolution in
integration of such databases [9][4][8][6][19] is slowed by the huge amount of data that
must be processed. This data processing requires computational and human-validation
resources.
Gordon et al. recently presented a standardized neuro-imaging data-base that may
provide a normative and evidence-based framework for individually-based assessments
in "Personalized Medicine." The three primary goals of this database are to quantify
individual differences in brain function, to compare an individual's performance to their
database peers, and to provide a robust normative framework of multidimensional
measures for clinical assessment and treatment prediction [10].
The software in bioinformatics is inclined to follow the rules of "open software"7. The
positive side of open sources is the capability of multiple groups to collaborate and
improve the tools and algorithms. A larger number of specialized analysis tools could
positively impact the biology by providing systematized and standardized information
[25]. Largely written by the highly computer-literate investigators in this field, the use of
bioinformatics software by non-specialists requires the additional development of facile
user interfaces.
Figure 2.4 describes the technologies and the integrated use of molecular profiling
and clinical data. Multivariate analysis, pattern recognition, and system modeling are all
integral parts of selecting features and patterns. Signal processing and computer science
manage the flow of information between technologies with the goal of extracting the
biological knowledge. Technologies have progressed from simple semi-quantitative,
single sample, singly-observed features to more complex, statistically quantitative, and
high-throughput measurements. These evolutions will provide a vast space for research,
and changes in technology will definitely require signal processing methods.
The meaning here is the program source codes are made freely available for modification and redistribution
Disease classification
Molecular
profile
Drug response
Multivariate analysis
pattern recognition
system modeling
transcriptomics
proteomics
metabolomics
phenomics
Knowledge
Expression patterns
2D gels
Yeast twotwo-hybrid
Protein interaction
Protein expression
Mass spectrometry
Western Blot
Protein arrays
LossLoss-ofof-function
phenotyping
Phage display
Knockouts
Mass spectrometry
RNA expression
Microarray
SAGE
15
Antisense
RNA knockdown
TaqMan
Ribozyme
RNAi
Figure 2.4 Technologies and the integrated use of molecular profiling and clinical data. The correlation between the phenotypes and patterns in each technology, utilize the
Signal Processing algorithms.
Chapter 2
16
References:
[1] Ahlborn H, Henderson S, Davies N. No immediate pain relief for the pharmaceutical industry.
Curr Opin Drug Discov Devel. 2005 May;8(3):384-91.
[2] Bardelli A, Parsons DW, Silliman N, Ptak J, Szabo S, Saha S, Markowitz S, Willson JK,
Parmigiani G, Kinzler KW, Vogelstein B, Velculescu VE.. Mutational analysis of the tyrosine kinome in
colorectal cancers. Science. 2003 May 9;300(5621):949.
[3] Basik M, Mousses S, Trent J. Integration of genomic technologies for accelerated cancer drug
development. Biotechniques. 2003 Sep;35(3):580-2, 584, 586 passim.
[4] Bertone P, Gerstein M. Integrative data mining: the new direction in bioinformatics. IEEE Eng
Med Biol Mag. 2001 Jul-Aug;20(4):33-40.
[5] Brock CS, Young H, Osman S, Luthra SK, Jones T, Price PM. Glucose metabolism in brain
tumours can be estimated using [18F]2-fluorodeoxyglucose positron emission tomography and a
population-derived input function scaled using a single arterialised venous blood sample. Int J Oncol.
2005 May;26(5):1377-83.
[6] Cornell M, Paton NW, Hedeler C, Kirby P, Delneri D, Hayes A, Oliver SG. GIMS: an integrated
data storage and analysis environment for genomic and functional data. Yeast. 2003 Nov;20(15):1291
[7] Ferrari M, Cremonesi L, Bonini P, Stenirri S, Foglieni B. Molecular diagnostics by
microelectronic microchips. Expert Rev Mol Diagn. 2005 Mar;5(2):183-92.
[8] Frank E, Hall M, Trigg L, Holmes G, Witten IH. Data mining in bioinformatics using Weka.
Bioinformatics. 2004 Oct 12;20(15):2479-81. Epub 2004 Apr 8.
[9] Gerstein M. Integrative database analysis in structural genomics. Nat Struct Biol. 2000 Nov;7
Suppl:960-3.
[10] Gordon E, Cooper N, Rennie C, Hermens D, Williams LM. Integrative neuroscience: the role of a
standardized database. Clin EEG Neurosci. 2005 Apr;36(2):64-75.
[11] Jacobs AH, Kracht LW, Gossmann A, Ruger MA, Thomas AV, Thiel A, Herholz K. Imaging in
neurooncology. NeuroRx. 2005 Apr;2(2):333-47.
[12] Jacobs AH, Li H, Winkeler A, Hilker R, Knoess C, Ruger A, Galldiks N, Schaller B, Sobesky J,
Kracht L, Monfared P, Klein M, Vollmar S, Bauer B, Wagner R, Graf R, Wienhard K, Herholz K, Heiss
WD. PET-based molecular imaging in neuroscience. Eur J Nucl Med Mol Imaging. 2003
Jul;30(7):1051-65. Epub 2003 May 23.
[13] Kwak EL, Sordella R, Bell DW, Godin-Heymann N, Okimoto RA, Brannigan BW, Harris PL,
Driscoll DR, Fidias P, Lynch TJ, Rabindran SK, McGinnis JP, Wissner A, Sharma SV, Isselbacher KJ,
Settleman J, Haber DA. Irreversible inhibitors of the EGF receptor may circumvent acquired resistance
to gefitinib. Proc Natl Acad Sci U S A. 2005 May 24;102(21):7665-70. Epub 2005 May 16.
[14] Lengauer C, Diaz LA Jr, Saha S. Cancer drug discovery through collaboration. Nat Rev Drug
Discov. 2005 May;4(5):375-80.
[15] Lynch TJ, Bell DW, Sordella R, Gurubhagavatula S, Okimoto RA, Brannigan BW, Harris PL,
Haserlat SM, Supko JG, Haluska FG, Louis DN, Christiani DC, Settleman J, Haber DA. Activating
mutations in the epidermal growth factor receptor underlying responsiveness of nonsmall-cell lung
cancer to gefitinib. N Engl J Med. 2004 May 20;350(21):2129-39. Epub 2004 Apr 29.
[16] Mertens G. [Market Research Report] Beyond the blockbuster drug. Strategies for "nichebusterdrugs", targeted therapies and personalized medicine. Business Insights (February 2005).
[17] Paez JG, Janne PA, Lee JC, Tracy S, Greulich H, Gabriel S, Herman P, Kaye FJ, Lindeman N,
Boggon TJ, Naoki K, Sasaki H, Fujii Y, Eck MJ, Sellers WR, Johnson BE, Meyerson M. EGFR
mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science. 2004 Jun
4;304(5676):1497-500. Epub 2004 Apr 29.
[18] Patel JD, Pasche B, Argiris A. Targeting non-small cell lung cancer with epidermal growth factor
tyrosine kinase inhibitors: where do we stand, where do we go. Crit Rev Oncol Hematol. 2004
Jun;50(3):175-86.
[19] Russell RB. Genomics, proteomics and bioinformatics: all in the same boat. Genome Biol. 2002
Sep 24;3(10):REPORTS4034. Epub 2002 Sep 24.
[20] Saghatelian A, Cravatt BF. Global strategies to integrate the proteome and metabolome. Curr Opin
Chem Biol. 2005 Feb;9(1):62-8.
[21] Stoughton RB, Friend SH. How molecular profiling could revolutionize drug discovery. Nat Rev
Drug Discov. 2005 Apr;4(4):345-50.
[22] Sullivan CG. How personalized medicine is changing the rules of drug life exclusivity.
Pharmacogenomics. 2004 Jun;5(4):429-32.
[23] Toyoda T, Wada A. Omic space: coordinate-based integration and analysis of genomic phenomic
interactions. Bioinformatics. 2004 Jul 22;20(11):1759-65. Epub 2004 Mar 22.
[24] Weinshilboum R, Wang L. Pharmacogenomics: bench to bedside. Nat Rev Drug Discov. 2004
Sep;3(9):739-48.
[25] Wiley HS, Michaels GS. Should software hold data hostage? Nat Biotech. 2004 22, 1037-38.
[26] Workman P. Drug discovery strategies: technologies to accelerate translation from target to drug. J
Chemother. 2004 Nov;16 Suppl 4:13-5.
Chapter 3
Glioma overview1
Although brain tumors account for less than 2% of the total incidence of primary tumors,
the childhood incidence of brain cancers is very high, counting for approximately 20% off
all cancers in those ages 19 and under. Primary tumors2 of the brain are the foremost
cause of cancer mortality among children (Figure 3.1), responsible for 7% of the years of
life lost from cancer before the age of 70 years. The incidence of brain tumors world wide
is about 7 in 100,000 people per year [54], with no major differences across countries.
Gliomas are the most common primary tumor of central nervous system [8].
The median survival of patients with glioblastoma, the most aggressive subclass of
gliomas, is less than 1 year even when surgical resection is combined with pre- or postoperative chemotherapy, immunotherapy, or radiotherapy. In patients with low-grade
gliomas3, such as oligodendroglioma, long-term survival can be achieved after surgical
resection and postoperative chemotherapy.
This variation in survival terms implies that there are important molecular
distinctions among the different glioma grades. Chapter 7 describes our analysis of
glioma data. Based on survival and Kaplan-Meier curves, the samples cluster into only
two classes (glioblastomas vs. the others). A significant relationship was observed in a
recent study [44] between the poor survival of patients with glioblastomas and the
morphology of the tumor cell nuclei. As defined by the physiopathology (see Table 3.1 at
the end of chapter), glioblastomas are characterized by nuclear atypia, and
multinucleated cells. The reasons for this correlation are not yet clear, but if data-mining
algorithmic rules are followed, future research should concentrate on the nuclei and
phenomena localized to the nucleus. Analysis of the same bio-morphometric data [44]
showed that the survival term is statistically independent from the amount of surgical
resection, from the patients age, and from the classification of the glioblastoma (as
primary or secondary).
1
This chapter is describes the pathology of the most common tumors of the nervous system and recapitulates
in brief the molecular characteristics of gliomas. This short survey is warranted in the context of applications
described in the thesis and attached publications that study the classification of gliomas. My expertise is signal
processing; for a more comprehensive examination of the subject, for further details about brain tumor
pathology, I recommend the publications listed in references.
2
There are two categories of brain tumors: primary and secondary (or metastatic). In brain cancer, primary
(benign or malignant) tumors originate in the brain tissue. The classification depends on the tissue from which
the tumor originates. Metastatic, or secondary, tumors are cancers that start in other parts of the body and
metastasize to the brain.
3
The grading system addresses the speed of progression (and aggressiveness) of a cancer.
18
Chapter 3
Figure 3.1. Mortality rates for common cancers. The values were calculated per 100,000 person-years
during 1950 to 1995 (left) and differentiated (right) as neoplastic types for the age group 0-19. Brain and
nervous system tumors are ranked second after leukemias.
Glioma overview
19
Tumor type1
WHO grade
Diffuse astrocytoma
Fibrillary astrocytoma
Protoplasmic astrocytoma
Gemistocytic astrocytoma
Anaplastic astrocytoma
Glioblastoma
Giant cell glioblastoma
Gliosarcoma
Pilocytic astrocytoma
Pleomorphic xanthoastrocytoma
Pleomorphic xanthoastrocytoma with
anaplastic features
II
II
II
II
III
IV
IV
IV
I
II
Not determined
The classes are according to the World Health Organization classification of tumors of the nervous system
[32] [55].
An invasive type of cancer has spread from the point of origin to adjacent tissue.
Anaplasia is the phenomena of replacement of specialized cells by unspecialized, undifferentiated, or stem
cellsin other words, dedifferentiation.
Table 3.2. The World Health Organization (WHO) classification of tumors affecting the Central Nervous System 6,7,8
Neuroepithelial tumors
1.
2.
3.
4.
5.
6.
Other neoplasms
10. Tumors of the Sellar Region
Pituitary adenoma
Pituitary carcinoma
Craniopharyngioma
11. Hematopoietic tumors
Primary malignant lymphomas
Plasmacytoma
Granulocytic sarcoma
Others
12. Germ Cell Tumors
Germinoma
Embryonal carcinoma
Yolk sac tumor (endodermal sinus tumor)
Choriocarcinoma
Teratoma
Mixed germ cell tumors
13. Tumors of the Meninges
Meningioma
meningothelial, fibrous (fibroblastic), transitional (mixed),
psammomatous, angiomatous, microcystic, secretory, clear cell,
chordoid, lymphoplasmacyte-rich, metaplastic subtypes
Atypical meningioma
Anaplastic (malignant) meningioma
14. Non-menigothelial tumors of the meninges
Benign Mesenchymal
osteocartilaginous tumors, lipoma, fibrous histiocytoma, others
Malignant Mesenchymal
chondrosarcoma, hemangiopericytoma, rhabdomyosarcoma, meningeal
sarcomatosis, others
Primary Melanocytic Lesions
6
Since 1993 a new classification of neoplasms for central nervous system has been used. The classification is based on the premise that each type of tumor results from the abnormal
growth of a specific cell type. To the extent that the behavior of a tumor correlates with basic cell type, tumor classification dictates the choice of therapy and predicts prognosis. In the
grading system for aggressiveness, the classes are of a single defined grade.
7
See http://neurosurgery.mgh.harvard.edu/newwhobt.htm by Stephen B. Tatter, M.D., Ph.D. [32].
8
Categories in italics are not recognized by the WHO classification system, but are in common use.
7.
8.
9.
15.
16.
17.
18.
19.
Chapter 3
22
Pathology of oligodendrogliomas
A subclass of glioma, oligodendroglioma is the brain cancer subtype that primarily arises
from cells morphologically resembling oligodendroglia [32]. Like classification of
astrocytomas, the classification of oligodendrogliomas is based on histopathological
features; the majority of studies describe oligodendrogliomas in the context of other
malignancies, making analysis of correlations between tumor characteristics and patient
prognosis
difficult.
Bailey
proposed
link
between
what
are
now
called
oligodendroglioma [4] [3] and oligodendrocytes cells based on three observations: that
both cell types demonstrate round and uniform nuclei, both show a swollen and clear
9
The gemistocytic astrocytoma is a histological variant of diffuse astrocytomas. It is characterized by the
presence of large, glial fibrillary acidic protein (GFAP)-expressing neoplastic astrocytes called gemistocytes
and a high tendency towards rapid progression to glioblastoma [63].
10
Having variation in the size and shape of their nuclei.
11
Location in the upper part of the brain, anatomically above the "tentorium cerebri".
12
Cells that look like fibers when viewed under a microscope.
Glioma overview
23
cytoplasm following standard histological tissue preparation, and the two cell types
display similar cell processes upon silver staining [27].
Oligodendrogliomas (WHO grade II) consist of moderately cellular, monomorphic
tumors with round nuclei, often artifactually inflamed cytoplasm, few or no mitoses, no
florid microvascular proliferation, or necrosis, and are classified as malignancy grade II
according to the WHO. The characteristic cytoplasm artifact relevant for diagnosis is the
clear cytoplasm, resembling a honeycomb, seen upon standard tissue preparation. The
tumor tissue contains numerous delicate, branching vessels with reticular appearance
Anaplastic oligodendrogliomas are histologically represented by aggressiveness
(WHO grade III) through increase in nuclear pleomorphic
13
features, abnormal
Gene
Location
Function
TP53
17p13
RB1
13q14
13
Alteration common in
diffuse astrocytomas,
anaplastic astrocytomas,
glioblastomas (secondary)
glioblastomas
anaplastic astrocytomas
OLIG genes are expressed strongly in oligodendroglioma, and are absent or expressed at a low level in
astrocytoma.
Chapter 3
24
Inhibitor of cyclin-dependent kinase 4 and 6
glioblastomas
anaplastic astrocytomas
glioblastomas
anaplastic astrocytomas
CDKN2A
9p21
p14ARF
9p21
Inhibitor of Mdm2
PTEN
10q23
glioblastomas
EGFR
7p11
glioblastomas
PDGFR
4q12
glioblastomas
MET
7q31
glioblastomas
CDK4
12q13
CDK6
7q21q22
glioblastomas
glioblastomas
CCND1
11q13
glioblastomas
CCND3
6p21
glioblastomas
MDM 2
12q15
glioblastomas
MDM 4
1q32
glioblastomas
MYCC
8q24
Transcription factor
glioblastomas
Studies have indicated that tumor suppressor genes [17][29][53][64][46] and protooncogenes [21][39][46] are genetically altered during glioma progression. Among the
changes are the amplification of EGFR, loss of hererozygosity (LOH) of chromosome 10,
mutation or deletion of PTEN, LOH of chromosome 9, deletion of p16 genes, and
mutation in p53 [11]. Recent genomic studies using cDNA microarrays have revealed a
large number of gene expression changes, including the discovery of overexpression of
IGFBP2 in 80% of glioblastomas [66][12][23][20].
We are far from understanding the gene and pathway alterations in gliomas. In fact,
the question of the cell of origin, illustrated by the animal models of gliomas, and by
lineage markers such as Olig1/2 marker13 [38] remains unsolved. A number of papers
[6][17][20] observe alterations in astrocytic cells in expression of growth factors and
other proteins that control apoptosis, cell cycle, and (see Figure 3.2). In
oligodendroglioma expansion, LOH 1p/19q is suggested [50][43] to characterize
oligodendroglial lineage evolution (Figure 3.3). If alterations in the p53-dependent
pathways are present, the diagnosis is of mixed glioma (i.e., oligoastrocytomas). In
summary, most genetic alterations in gliomas result from the disruption of three main
cellular systems, RB1, p53, and tyrosine kinase receptor pathways. Other gene alterations
have also been identified in glioma, including those in genes that promote mitotic signal
transduction, cell cycle regulation, apoptosis, angiogenesis, or invasion.
Glioma overview
25
anaplastic
astrocytomas
diffuse
astrocytomas
glioblastomas
TP53 mutation,
TP53 mutation
p14ARF homozygous deletion,
p14ARF
MDM2 amplification
hypermethylation /homozygous deletion
MDM4 amplification
PDGFRA
PDGFRA overexpression
PDGFRA
PDGFRA amplification
EGFR amplific.
amplific. / rearrangement
MET amplification
Temporal
Extra-temporal
TP53 mut.
mut.
unknown
Oligoastrocytoma
(Astrocytoma)
Oligodendroglioma
Oligoastrocytoma
Oligodendroglioma
Oligoastrocytoma
Chapter 3
26
14
Images 15
15
Histological
Class
Pathophysiology
Diffuse
astrocytoma
(WHO grade II)
Anaplastic
astrocytoma
(WHO grade III)
Part of images are from public sources, part are experiments performed in vivo at Cancer Genomics Core Laboratory by Sarah Dunlap, M.D. Anderson Cancer Center.
16
Glioblastoma
Multiforme
(WHO grade IV)
Pilocytic
astrocytoma
(grade II)
Characterized by fusiform
"piloid" bipolar astrocytes, with
areas alternating dense and
loose. In loose areas, microcysts
may coalesce to form the
macroscopic cysts. The presence
of nuclear atypia (without
mitotic activity) does not carry a
worse prognosis. Vascular
changes are limited to capillary
proliferations that may include
glomeruloid capillaries and
endothelial proliferation.
Eosinophilic "Rosenthal fibers"
are characteristic; calcification
poss. Common locations are
cerebellum and diencephalon
(especially the optic nerves and
hypothalamus).
Pleomorphic
xanthoastrocytoma
(grade II)
Oligodendroglioma
(grade II)
Pleomorphic xanthoastrocytoma
are a mixture of unusually
pleomorphic cells, ranging from
fibrillary to bizarre giant
multinucleated cells with
intracellular lipid vacuoles
("xanthoma" cells). Usually a
large hemispheric mass, closely
related to the cerebral surface.
Oligodendrogliomas are a
continuous spectrum of lesions
ranging from well-differentiated
neoplasms to malignant tumors;
solid, relatively well-defined,
soft, gray-pink tumors. Tumor is
typically located in the cortex
and white matter, and
infiltration of the overlying
leptomeninges17 may be seen.
Calcification is frequent.
Necrosis, cyst formation, and
hemorrhage define grade III.
Anaplastic
oligodendrogliomas
(grade III)
Oligoastrocytomas
Oligoastrocytomas exhibits
histologic characteristics
indicative of malignancy,
including high cellularity,
cellular pleomorphism, nuclear
atypia, and increased mitotic
activity that includes both
astrocytic and oligodendrocytic
components.
Glioma overview
31
References
[1] Adesina AM, Dunn ST, Moore WE, Nalbantoglu J. Expression of p27kip1 and p53 in
medulloblastoma: Relationship with cell proliferation and survival. Pathol Res Pract 2000;196:243-50
[2] Aldosari N, Bigner SH, Burger PC, et al. MYCC and MYCN oncogene amplification in
medulloblastoma. A fluorescence in situ hybridization study on paraffin sections from the Children's
Oncology Group. Arch Pathol Lab Med 2002;126:540-44
[3]
Bailey P, Bucy P (1929) Oligodendrogliomas of the brain. J Pathol Bacteriol 32:735754
[4] Bailey P, Cushing H. A classification of tumors of the glioma group on a histogenetic basis with a
correlated study of prognosis. Philadelphia: JB Lippincott, 1926.
[5] Batra SK, McLendon RE, Koo JS, et al. Prognostic implications of chromosome 17p deletions in
human medulloblastomas. J Neurooncol 1995;24:39-45
[6] Ben Arush MW, Linn S. Ben-Izhak O, et al. Prognostic significance of DNA ploidy in childhood
astrocytomas. Pediatr Hematol Oncol 1999;16:387-96
[7] Biegel JA, Janss AJ, Raffel C, et al. Prognostic significance of chromosome 17p deletions in
childhood primitive neuroectodermal tumors (medulloblastomas) of the central nervous system. Clin
Cancer Res 1997;3:473-78
[8] Boudreau CR, Liau LM, Molecular characterization of brain tumors. Clin. Neurosurg. 2004, 51,
81-90.
[9] Bredel M, Pollack IF, Hamilton RL, Birner P, Hainfellner JA, Zentner J. DNA topoisomerase
IIalpha predicts progression-free and overall survival in pediatric malignant non-brainstem gliomas. Int J
Cancer 2002;99:817-20
[10] Carter M, Nicholson J, Ross F, et al. Genetic abnormalities detected in ependymomas by
comparative genomic hybridisation. Br J Cancer 2002;86:929-39
[11] Caskey LS, Fuller GN, Bruner JM, Yung WK, Sawaya RE, Holland EC, Zhang W. Toward a
molecular classification of the gliomas: histopathology, molecular genetics, and gene expression
profiling. Histol Histopathol. 2000 Jul;15(3): 971-81.
[12] Chen G, Gharib TG, Huang CC, Taylor JM, Misek DE, Kardia SL, Giordano TJ, Iannettoni MD,
Orringer MB, Hanash SM, Beer DG. Discordant protein and mRNA expression in lung
adenocarcinomas. Mol. Cell Proteomics 2002, 1, 304-313.
[13] Cogen PH. Prognostic significance of molecular genetic markers in childhood brain tumors.
Pediatr Neurosurg 1991;17:245-50
[14] Coons SW, Johnson PC, Haskett D, Rider R. Flow cytometric analysis of deoxyribonucleic acid
ploidy and proliferation in choroid plexus minors. Neurosurgery 1992;31:850-56
[15] Cushing N. Intracranial tumors: Notes upon a series of two thousand verified cases with surgicalmortality percentages pertaining thereto. Springfield, IL: Charles C. Thomas. 1932
[16] De Prada I, Cordobes F, Azorin D, Contra T, Colmenero I, Glez-Mediero I. Pediatric giant cell
glioblastoma: a case report and review of the literature. Childs Nerv Syst. 2005 Jul 6; [Epub ahead of
print]
[17] Drach LM, Kammermeier M, Neirich U, et al. Accumulation of nuclear p53 protein and prognosis
of astrocytomas in childhood and adolescence. Clin Neuropathol 1996;15:67-73
[18] Dyer S, Prebble E, Davison V, et al. Genomic imbalances in pediatric intracranial ependymomas
define clinically relevant groups. Am J Pathol 2002;161:2133-41
[19] Espina V, Mehta AI, Winters ME, Calvert V, Wulfkuhle J, Petricoin EF 3rd, Liotta LA. Protein
microarrays: molecular profiling technologies for clinical specimens. Proteomics. 2003 Nov;3(11):2091100.
[20] Fuller GN, Hess KR, Rhee CH, et al. Molecular classification of human diffuse gliomas by
multidimensional scaling analysis of gene expression profiles parallels morphology-based classification,
correlates with survival, and reveals clinically-relevant novel glioma subsets. Brain Pathol 2002;12:10816
[21] Gilbertson RJ, Clifford SC, MacMeekin W, et al. Expression of the ErbB-neuregulin signaling
network during human cerebellar development: Implications for the biology of medulloblastoma. Cancer
Res 1998;58:3932-41
[22] Grubb RL, Calvert VS, Wulkuhle JD, Paweletz CP, Linehan WM, Phillips JL, Chuaqui R, Valasco
A, Gillespie J, Emmert-Buck M, Liotta LA, Petricoin EF. Signal pathway profiling of prostate cancer
using reverse phase protein arrays. Proteomics. 2003 Nov;3(11):2142-6.
[23] Gygi SP, Rochon Y, Franza BR, Aebersold R. Correlation between protein and mRNA abundance
in yeast. Mol Cell Biol. 1999 Mar;19(3):1720-30.
[24] Hall PA, Going JJ. Predicting the future: A critical appraisal of cancer prognosis studies.
Histopathology 1999;35:489- 94
[25] Hamilton RL, Pollack IF. The molecular biology of ependymomas. Brain Pathol 1997;7:807-22
[26] Hart MN, Petito CK, Earle KM (1974) Mixed gliomas. Cancer 33:134140
[27] Hartmann C, Mueller W, von Deimling A. Pathology and molecular genetics of oligodendroglial
tumors. J Mol Med. 2004 Oct;82(10):638-55.
32
Chapter 3
[28] Ino Y, Betensky RA, Zlatescu MC, et al. Molecular subtypes of anaplastic oligodendroglioma:
Implications for patient management at diagnosis. Clin Cancer Res 2001;7:839-45
[29] Jams E, Lunec J, Perry RH, Kelly PJ, Pearson AD. p53 protein overexpression identifies a group
of central primitive neuroectodermal tumours with poor prognosis. Br J Cancer 1993;68:801- 7
[30] Kaatsch R Rickert CH, Khl J, Schz J. Michaelis J. Population- based epidemiological data of brain
tumors in German children. Cancer 2001;92:3155-64
[31] Kim JY, Sutton ME, Lu DJ, et al. Activation of neurotrophin- 3 receptor TrkC induces apoptosis
in mcdulloblastomas. Cancer Res 1999;59:711-19
[32] Kleihues P, Cavenee WK, eds. World Health Organization classification of tumours: Vol. 1.
Pathology and genetics of tumours of the nervous system. Lyon: IARC Press, 2000.
[33] Klein R, Molenkamp G, Sorensen N, Roggendorf W. Favorable outcome of giant cell
glioblastoma in a child. Report of an 11-year survival period. Childs Nerv Syst. 1998 Jun;14(6):288-91.
[34] Korshunov A, Golanov A. Timirgaz V. Immunohistochemical markers for intracranial
ependymoma recurrence. An analysis of 88 cases. J Neurol Sci 2000;177:72-82
[35] Korshunov A, Savostikova M, Ozerov S. Immunohistochemical markers for prognosis of averagerisk pediatric medulloblastomas. The effect of apoptotic index, TrkC, and c-myc expression. J
Neurooncol 2002;58:271-79
[36] Korshunov A, Sycheva R, Timirgaz V, Golanov A. Prognostic value of immunoexpression of the
chemoresistance-related proteins in ependymomas: An analysis of 76 cases. J Neurooncol 1999;45:21927
[37] Kotylo PK, Robertson PB, Fineberg NS, Azzarelli B, Jakacki R. Flow cytometric DNA analysis of
pediatric intracranial ependymomas. Arch Pathol Lab Med 1997;121:1255-58
[38] Lu QR, Park JK, Noll E, Chan JA, Alberta J, Yuk D, Alzamora MG, Louis DN, Stiles CD,
Rowitch DH, Black PM. Oligodendrocyte lineage genes (OLIG) as molecular markers for human glial
brain tumors. Proc Natl Acad Sci U S A. 2001 Sep 11;98(19):10851-6. Epub 2001 Aug 28.
[39] MacDonald TJ, Brown KM, LaFleur B, et al. Expression profiling of medulloblastoma: PDGFRA
and the RAS/MAPK pathway as therapeutic targets for metastatic disease. Nat Genet 2001;29:143- 52
[40] Marshall T, Rutledge JC. Flow cylometry DNA applications in pediatric tumor pathology. Pediatr
Dev Pathol 2000;3:314-34
[41] Melton L. Protein arrays: Proteomics in multiplex. Nature 2004, 429, 101-107.
[42] Mircean C, Shmulevich I, Cogdell D, Choi W, Jia Y, Tabus I, Hamilton SR, Zhang W. Robust
estimation of protein expression ratios with lysate microarray technology. Bioinformatics. 2005 May
1;21(9):1935-42. Epub 2005 Jan 12.
[43] Mueller W, Hartmann C, Hoffmann A, Lanksch W, Kiwit J, Tonn J, Veelken J, Schramm J,
Weller M, Wiestler OD, Louis DN, von Deimling A (2002) Genetic signature of oligoastrocytomas
correlates with tumor location and denotes distinct molecular subsets. Am J Pathol 161:313319
[44] Nafe R, Franz K, Schlote W, Schneider B. Morphology of tumor cell nuclei is significantly related
with survival time of patients with glioblastomas. Clin Cancer Res. 2005; 11: 21418.
[45] Nicholson JC, Ross FM, Kohler JA, Ellison DW. Comparative genoinic hybridization and
histological variation in primitive neuroectodermal tumours. Br J Cancer 1999;80:1322-31
[46] Nutt CL, Mani DR, Betensky RA, et al. Gene expression-based classification of malignant gliomas
correlates better with survival than histological classification. Cancer Res 2003;63:1602-7
[47] Okami N, Kawamata T, Kubo O, Yamane F, Kawamura H, Hori T. Infantile gliosarcoma: a case
and a review of the literature. Childs Nerv Syst. 2002 Jul;18(6-7):351-5. Epub 2002 May 17.
[48] Paulus W. Lisle DK, Tonn JC, et al. Molecular genetic alterations in pleomorphic
xanthoastrocytoma. Acta Neuropathol 1996;91: 293-97
[49] Pollack IF, Campbell JW, Hamilton RL, Martinez AJ, Bozik ME. Proliferation index as a
predictor of prognosis in malignant gliomas of childhood. Cancer 1997;79:849-56
[50] Pollack IF, Finkelstein SD, Burnham J, et al. Association between chromosome 1p and 19q loss
and outcome in pediatric malignant gliomas: Results from the CCG-945 cohort. Pediatr Neurosurg
2003;39:114-21
[51] Pollack IF, Finkelstein SD, Woods J, et al. Expression of p53 and prognosis in children with
malignant gliomas. N Eng J Med 2002; 346:420-27
[52] Pomeroy SL, Tamayo P, Gaasenbeek, M, et al. Prediction of central nervous system embryonal
tumour outcome based on gene expression. Nature 2002;415:436-42
[53] Raffel C, Frederick L, O'Fallon JR, et al. Analysis of oncogene and tumor suppressor gene
alterations in pediatric malignant astrocytomas reveals reduced survival for patients with PTEN
mutations. Clin Cancer Res 1999;5:4085-90
[54] Reifenberger C, Collins VP. Pathology and molecular genetics of astrocytic gliomas. J Mol Med
(2004) 82:656670
[55] Reifenberger G, Collins VP. Pathology and molecular genetics of astrocytic gliomas. J Mol Med.
2004 Oct;82(10):656-70.
[56] Reyes-Mugica M, Chou PM, Myint MM, Ridaura-Sanz C, Gonzales- Crussi F, Tomita T.
Ependymomas in children: Histologic and DNA- flow cytometric study. Pediatr Pathol 1994;14:453-66
Glioma overview
33
[57] Rickert CH, Strter R, Kaatsch P, et al. Pediatric high-grade astrocytomas show chromosomal
imbalances distinct from adult cases. Am J Pathol 2001;158:1525-32
[58] Rickert CH, Wiestler OD, Paulus W. Chromosomal imbalances in choroid plexus tumors. Am J
Pathol 2002;160:1105-13
[59] Salvati M, Caroli E, Raco A, Giangaspero F, Delfini R, Ferrante L. Gliosarcomas: analysis of 11
cases do two subtypes exist? J Neurooncol. 2005 Aug;74(1):59-63.
[60] Shibata N, Oda H, Hirano A, Kato Y, Kawaguchi M, Dal Canto MC, Uchida K, Sawada T,
Kobayashi M. Molecular biological approaches to neurological disorders including knockout and
transgenic mouse models. Neuropathology. 2002 Dec;22(4):337-49.
[61] Sreekumar A, Nyati MK, Varambally S, Barrette TR, Ghosh D, Lawrence TS, Chinnaiyan AM.
Profiling of cancer cells using protein microarrays: discovery of novel radiation-regulated proteins.
Cancer Res. 2001 Oct 15;61(20):7585-93.
[62] Ward S, Harding B, Wilkins P, et al. Gain of 1q and loss of 22q are the most common changes
detected by comparative genomic hybridisation in paediatric ependymoma. Genes Chromosomes Cancer
2001;32:59-66
[63] Watanabe K, Peraud A, Gratas C, Wakai S, Kleihues P, Ohgaki H. p53 and PTEN gene mutations
in gemistocytic astrocytomas. Acta Neuropathol (Berl). 1998 Jun;95(6):559-64.
[64] Woodburn RT, Azzarelli B, Montebello JF Goss IE. Intense p53 staining is a valuable prognostic
indicator for poor prognosis in medulloblastoma/central nervous system primitive neuroectodermal
tumors. J Neurooncol 2001;52:57-62
[65] Wulfkuhle JD, Aquino JA, Calvert VS, Fishman DA, Coukos G, Liotta LA, Petricoin EF 3rd.
Signal pathway profiling of ovarian cancer from human tissue specimens using reverse-phase protein
microarrays. Proteomics. 2003 Nov;3(11):2085-90.
[66] Zhou YH, Tan F, Hess KR, Yung WK. The expression of PAX6, PTEN, vascular endothelial
growth factor, and epidermal growth factor receptor in gliomas: relationship to tumor grade and survival.
Clin Cancer Res. 2003 Aug 15;9(9):3369-75.
34
Chapter 3
Chapter 4
Fundaments of analysis of cDNA microarray data
Microarray technology can be described as a multi-step and data intensive technology for
observing the phenomena of RNA in a high-throughput manner. One reason for its
popularity is the outstanding and revolutionary perspective it has added to biological
studies; several decades ago researchers could observe the variations only on a gene-bygene basis, but now are able to evaluate hundreds or thousands of genes in an
experiment. Improvements in the technology allow observation of gene expression in a
highly noisy environment and good results can be obtained by a wisely chosen
experimental design. In addition, a great deal of information can be easily extracted by
means of simple methods.
Two major platforms are frequently used, differing by the method of nucleic acid
deposition: in the first, nucleic acid probes are robotically spotted, and in the second,
they are deposited by photolithography. Both platforms are continuously optimized,
changed, and further developed. The robotically spotted microarrayreferred to as cDNA
because polymerase chain reaction products (PCR) were amplified from cDNAwas
described by Schena et al.1 [31] in 1995 and DeRisi et al.2 in 1996 [7]; other important
improvements to the algorithms are presented in [2], [5] [8], [16] [18], [24] [32], [34],
[39], [37], [38] [9]. The photo-lithographically synthesized array (also referred to as
high-density oligonucleotide chips or oligoarrays because each gene is represented by
multiple ~25-mer oligonucleotides, or as Affymetrixchips after the first company to
produce the technology) was initially described by Lockhart et al. [22], Lipshutz et al.
[21], and others [20] [28] [30][29] [33] [42]. The original differences based on the size of
nucleic acids (number of oligonucleotides in each probe) have smoothly diminished over
the last few years. Each platform has good sides and deficiencies.
Elements encompassed in each procedural step are described with accent on the
technical and statistic-algorithmic aspects; Figure 4.1 shows the flow of a typical
microarray experiment.
The system was developed to monitor the expression of 45 Arabidopsis genes. It was made by means of
simultaneous, two-color fluorescence hybridization of volumes of 2 L robotically printed in a high density
array. This system enabled for first time the detection of transcripts in probe mixtures derived from 2
micrograms of total cellular messenger RNA.
2
The paper was the first investigation of the genetic basis of the phenotype of tumors using cDNA
technology. The probes for hybridization were derived from two sources of cellular mRNA [UACC-903 and
UACC-903(+6)]. As in the previous citation, samples were labeled with two-color dyes to provide a direct and
internally controlled comparison of the mRNA levels.
Chapter 4
36
Define biological,
pathological
question
Statistic relevance
of experiment
Microarray
scheme design
Validation
Experimental design
Labeled sample A
Microarray
hybridization
Microarray
Labeled sample B
Fabrication of
microarray
Microarray Scanning
Spot identification & Segmentation
Intensity value extraction
PoorPoor-quality spots
assessment
Normalization
Data prepre-processing
Differentially
expressed genes
Classification
Pathway analysis
Gene ontology
Knowledge
extraction
Etc.
data analysis
Figure 4.1 Flow of a typical microarray experiment. A good experiment design ensures qualitative results.
37
Figure 4.2 Representation of cDNA microarray schema presented in [9] by NHGRI-group Duggan,
Bittner, Chen, Meltzer, and Trent in 1999.
The success of the experiment is based on the initial design of the experiment. Four
elements have been distinguished as crucial in the design phase (see Leung et al. [18]
[19]).
1) The hypothesis and the initial examination conditions of the biological aspects
should be stated in clear form. This is not intended to restrict the chances of new, unexpected findings, but to concentrate on a few biological questions. To illustrate, the next
examples will focus on feature selection and discrimination of classes A, B, C, and D. If
the number of profiles (samples) is limited, the feature selection in the discrimination of
(A vs. B vs. C vs. D) is prone to errors compared to the case (A B vs. C D). More than
that, the set of genes selected by the discrimination (A vs. B vs. C vs. D) differs from the
set of the genes selected by the discriminations (A vs. B vs. C D) or (A B vs. C vs. D)
or any combination of these classes. Sometimes, it is difficult to derive the classification
rules from the biologist's questions.
2) The systematic and experimental errors in microarray experiments can be
reduced during the design phase. Careful planning will allow the cells from the treatment
and normal groups to pass similar conditions, so that the parameters and conditions of
the experiment will equally affect the cells in the classes discriminated. Another example
Chapter 4
38
39
Experimental design
It has to combine the knowledge of several fields. When the technology was first
developed, the design in many laboratories was decided without taking in account
mathematical or statistical aspects (for instance, the value of an experiment that contains
small number of samples is limited). Recently the situation has changed, and more and
more genomic laboratories incorporate the efforts of (molecular) biologists, (clinical)
medical staff with the pathology expertise, technical microarray capabilities, and
mathematical-statistic knowledge of computer-scientist in the experimental design
phase. One of the benefits of Involving computer scientists in the design stage is to
minimize the time and costs of acquiring the results from this high-throughput
technology.
Also, many programmers can integrate the analyzed data from microarray and can
enhance the adaptability of experimental design. The circumstances of the experiments,
especially in basic investigation laboratories or core facilities, often lead to questions that
the traditional software from the market cannot answer.
Statistics must necessarily contribute to the analysis of the quantitative biological
results for the correct interpretation of a biological experiment. The performance of
further analysis, the choice of mathematical methods, the accuracy, and ease of
implementation of algorithms dramatically depend on a statistically-correct design of
experiment.
The selected features and the (supervised) classification error are dependent not only
on an accurate classification algorithm. If samples are labeled erroneously, the features
selected by an optimal classifier will reflect an erroneous biological learning procedure.
Because the labeling is discerned by pathologist, this initial labeling holds a high
importance. Typically, the number of samples in an array experiment is small. Therefore,
if a significant percentage is incorrectly labeled the quality of experiment is dramatically
decreased. Special attention is requires for cases with mixed cell types, such as the mixed
glioma types in Fuller et. al.,4 (2002) [11], where the pathologists designated the samples
as having features from two or many pathological classes. Mixed samples will not be
4
Glioma sub-types are clinically classified by visual observation of features of the tissues. In labeling the
cancer sub-types, pathologists make assignments based on the type of cell from which the tumor originates.
Neuronal cells very seldom degenerate and produce cancer; the cells that nourish, protect, and mechanically
sustain the neurons are at the origin of neuronal cancers. Two glioma cell-subtypes are dendrogliomas that
originates from dendrocytes and astrocytomas that originate astrocytes. In several cases, the pathologists
observe tissues with features from both sub-types. The approach currently used is to label this cancer as
"mixed gliomas". When analyzing the molecular bases of the glioma sub-type discrimination, the samples
labeled as "mixed-gliomas" should not be considered.
Chapter 4
40
utilized to gain knowledge about the class discrimination, because in these cases the
disease-class attribution is unclear.
An important prerequisite of classification procedures is a complete, updated, and
correct clinical database. If the database is correctly designed and populated properly,
the design algorithms can select the features responsible for aggressiveness of disease or
survival terms. Tracking all these details of the experimental design requires complete
attention and a clear and organized approach. The proper design of both experiments
and data handling will avoid later undesired and costly reiteration of experiments.
Image acquisition is also critical and can affect the quality of the entire process.
Acquisition can easily be repeated two or three times in the time-frame limited by the
stability of the fluorescent dyes used. Image acquisition and analysis is succinctly
described by (1) laser scanning, (2) spot recognition and segmentation, and (3) spot
intensity evaluation.
Details of the cDNA image scanning process will be described here; the procedures
used for oligonucleotide arrays are similar. Immediately after sample spotting, in order
to avoid label degradation due to time and light instability,5 the slides are scanned and
image files are stored in lossless compression format (usually TIFF 6 ). The files are
typically 20-80 MB for each of the two dye-wavelengths depending on the resolution
(typically adjusted in steps of 1, 5, 10, and 20 m). For large numbers of samples, the
procedure of scanning is time consuming process. Enough informative scans can be
obtained when the resolution is adjusted at 10 m for cDNA [12]. A suggested rule is that
the scanner resolution be at least 10% of the spot diameter7. The scanners on the market
are generally designed for the dual redgreen (Cy3Cy5) wavelength and most have a
port for laser-bulb installation that allows blue and yellow detection.
A correct scanning implies appropriate adjustment of the laser beam intensity and
photomultiplier gain. The number of spots with saturated pixels should be minimized
(e.g., 3-10 for a slide with 12 k features), such that the experiment benefits from the
maximum dynamic range of the device. Each scan will degrade by some extent the quality
of the slide by irreversible modification of the phosphor-based labels (this effect is called
photo-bleaching). If the laser beam intensity is too low and gain is too high, the
photomultiplier will introduce too much error or the spots will not be visible. It is
advisable to set to the low values (minimum needed) for the intensity of laser for a "preview" in order not to damage the slide. Although the slides are kept in a dry and dark
5
At M.D. Anderson Cancer Center in the Cancer Genomics Core Laboratory the scanning step is performed
on the same day as the washing procedure when the membrane of slides is still wet. Spurious expression levels
can be observed when the slide is very dry.
6
Tag Image File Format.
7
This rule actually generates a square of 100 pixels that covers the spot and more than 60 pixels will
characterize the spot. This is considered enough to get a good estimate of the spot value. See further "sVOL"
definition.
41
place, photo-bleaching affects the microarrays and the slides will typically degrade after
six months.
To locate the spots and quantify the spot-level values, the papers presented in this
thesis used ArrayVision. This software is one from a multitude of choices on the market.
The advantages of ArrayVision include a clear and detailed documentation for the
algorithms for measuring expression. Once a laboratory creates a structure that links the
position of the spot and the accession number (from the GenBank database) of the
respective gene, the configuration for robotically spotting of cDNA and spot-estimation
method do not need major modifications for further experiments.
The grid 8 for spot-finding is placed manually (using click-and-drag). If distancecharacteristics are correctly defined over the entire slide, the pre-defined circles will
overlap the spots. The precise initial position is rather non-important once the first circle
overlaps the first spot, since the grid position will auto-tune in a fine-range. The
specifications for the grid, including number of spots, configuration in patches (the
number of vertical and horizontal spots and distances between spots), the respective
distances, and dimensions are easy accessible and defined in a table.
There are several algorithms that automatically re-place the grid and warp the
location of grid to the center of the spots. One of
methods, based on two-dimensional fast Fourier
transform (2D-FFT) of the image, obtains the
relative distance between spots by means of nextmaximum frequency, since the spots represent a
local maximum in two-dimensional space. A request
in all grid finding solutions is to visually verify the
grid and/or flag the spots that are low-qualitatively.
Once the grid for spot-finding is placed over the
scanned microarray image, the values of spots are
acquired. Segmentation is the process used to
separate the foreground pixels containing the true
Figure 4.3 The image of a single spot.
The circles estimate the intensity of the
spot, while the diamonds estimate the
intensity of the background for the spot
area.
Two
approaches
are
typically
considered depending on the segmentation algorithm that defines the convex area of the
spot. The simplest is to define fixed-circle size segmentation9.
The grid is the imaginary net-structure that defines the location of centers of the spots.
The procedure assumes (ideally) for cDNA that the spots in the array are circular. Beside simplicity of
implementation, it can be advantageous in some oligoarrays (e.g., in Illumina's arrays www.ilumina.com)
Chapter 4
42
A more accurate procedure with direct application to cDNA arrays, since the spots are
not perfectly circular, is the adaptive segmentation method (Figure 4.3). Two circles
centered in the mass-center of the spots, which usually coincide with the nodes of the
spot-finding grid, define the minimum and maximum initial area for segmentation. The
algorithm called SPOT [37], developed by Yang et al. (2001), enlarges the small innercircle and shrinks the large exterior-circle to a convex area such that an optimal border
can be determined. A region-growing algorithm can then be used to define the area of
the spot enclosed in this segmented convex area.
In estimating the value of spot, the algorithms can use, or not use, the background.
The value of the background may be estimated as the mean-value (or median value) over
the entire slide, in this case the method is called flattened background. Another method
is to locally estimate the value of the background for each spot in a procedure that
measure the intensity of background-pixels in 4-diamonds around each spot, and then
the estimated intensity (by using median of spots in the diamonds) is applied for the area
of spot.
In most publications on cDNA microarrays, the spot value is estimated as "subtracted
volume" (i.e., volume expression of spots when the background is removed) denoted
by sVOL :
S spot
convex area
(S
bkg
Abkg )Aspot
Abkg
is the spot signal level calculated as the median of pixels values in the
Aspot
of the spot.
where the spots are defined by holes in the silicon bed. The hybridization takes place in holes and the fixedcircle method confers robustness.
43
medoid (PAM) of the clusters is chosen as the cluster-representatives. The results [25]
are superior compared to previous methods by Yang et al. [39] [37].
10
Chapter 8 describes a set of algorithms for selection of the differentially expressed genes after the preprocessing steps are performed. A variety of signal processing methods can be applied for
classification/discrimination of cancer types and for discovering differentially expressed genes in a specific
cancer type or sub-types using various methods from multivariate data analysis.
11
"Gene expression" refers to the amount of one or more gene products in a particular sample. In the cDNA
microarray case, the expression is often calculated as the mean of the two replicates. Several other modalities
are possible if the number of replicates is higher. In oligoarrays the value of "gene expression" is related to the
expression of the gene by means of the hybridization to a number of oligonucleotides (~30 per gene).
12
An infinite value comes from dividing by zero (e.g., tumor/normal; also logarithm of a zero value is not
defined).
Chapter 4
44
on the two channels of scanning is known to introduce important errors that can partially
be corrected. A microarray experiment can be normalized based on principles that come
from different fields. Using a molecular biology method, normalization can be performed
by introducing additional gene probes as controls into microarray design, those for so
called equally-expressed genes. From statistics comes a method that normalizes the
expression of the entire microarray by its statistics. For sets with 20 k features/genes, it
is expected that most genes express in the same way in the different samples.
The approach of equally-expressed genes introduces probes that hybridize to
housekeeping genes or exogenous genes as controls. Initially, housekeeping genes like
glyceraldehydes-3-phosphate dehydrogenase or ribosomal RNA were thought to be
expressed at a constant level in all tissues [10]. Many studies used a single housekeeping
gene for normalization. However, in later studies [6], no gene with equal expression in all
cells could be identified. Several papers showed that the expression level of housekeeping
genes varied among tissues or cells, and changed under certain circumstances. The
reference gene(s) must be selected for each study (namely the tissue of origin) and
requires special attention [41], [5]. If care is not exercised, normalizing the expression
values to a housekeeping gene may introduce additional errors.
Another
method
of
of
dyes.
Cy3
method
and
cannot
Cy5.
be
This
applied
Nonlinear normalization methods use statistics that assume genes to express similar
in median and quantiles values, on a slide, to aid in normalization [13] [16] [23]. There
are a large variety of algorithms for normalizing the expression levels (described Chapter
8) that fall into several categories: (1) subtraction of the global-median value of each
sample, (2) subtraction of the median value and equalization using the variance by
standard deviation (STD) or median absolute deviation estimate (MAD), and (3)
application of locally weighted scatterplot smoothing normalization (locally weighted
scatter plot smooth - lowess).
45
Shortly, "MA-plots" were defined by Dudoit et al. [8] and help identify the artifacts of
the spotting procedure. On the scatter plot defined by
and
constant
factor,
therefore
on
the
logarithm
scale
A dependent
normalization [39],
Berger et al., [3] presented an optimization approach for the parameters in lowess,
which usually is selected arbitrarily. The optimization based on a local regression
procedure determines the bandwidth parameter for the local regression by minimizing a
cost function as the mean-squared difference between the lowess estimates and the
normalization reference level.
Although initially designed to equalize the red and green channels of the microarray
experiment, the up-described normalization algorithms may apply also to spot replicates
values and gene-expression values originating from two samples. In the case of spotreplicates, even for the same color the expressions of replicates are expected to have same
local characteristics of median and variance if the number of evaluated genes is high. In
the case of gene-expressions, the two samples (e.g., tumor vs. normal, or a transfected
stable cell line vs. parental cells [15]) should have similar levels of expression in the
majority of genes.
Interestingly, in addition to systematic effects of dyes, which can mostly be corrected
by a normalization method, a gene-specific dye bias was defined recently in MartinMagniette et al. [24] as Label Bias Index (LBI). This bias may take values larger than two
for the ratio of
(Cy3 / Cy5) and may alter the conclusions about differentially expressed
Chapter 4
46
genes. The issue is new and more studies must be performed for the effect to be
understood.
adopted
by
the
Object
Management
Group
(OMG)
standards
group
Affymetrix
(www.affymetrix.com),
and
Iobion
capture
and
management
are
described
on
MIAME
webpage
(http://www.mged.org/miame).
The minimal information is structured in six parts: (1) the standards for experimental
design, the set of hybridization experiments as a whole, (2) standards for array design,
each array used and each element (spot) on the array, (3) samples used, extract
preparation and labeling, (4) procedures and parameters of hybridizations, (5) images,
quantitation, and measurements specifications, (6) normalization controls: types, values,
specifications. As MIAME is commonly accepted by major users, it is advisable that
experiments comply with these standards. Also, most journals require data-submission
using this format.
47
References
[1] [No Authors], Microarray standards at last. Nature 419, 323 (26 September 2002); doi:10.1038 /
419323a
[2] Ben-Dor A, Bruhn L, Friedman N, Nachman I, Schummer M, Yakhini Z. Tissue classification
with gene expression profiles. J Comput Biol. 2000;7(3-4):559-83.
[3] Berger JA, Hautaniemi S, Jarvinen AK, Edgren H, Mitra SK, Astola J. Optimized LOWESS
normalization parameter selection for DNA microarray data. BMC Bioinformatics. 2004 Dec
09;5(1):194.
[4] Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W,
Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC,
Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M.
Minimum information about a microarray experiment (MIAME)-toward standards for microarray data.
Nat Genet. 2001 Dec; 29(4):373.
[5] Churchill GA. Fundamentals of experimental design for cDNA microarrays. Nat Genet. 2002 Dec;
32 Suppl:490-5.
[6] de Kok JB, Roelofs RW, Giesendorf BA, Pennings JL, Waas ET, Feuth T, Swinkels DW, Span
PN. Normalization of gene expression measurements in tumor tissues: comparison of 13 endogenous
control genes. Lab Invest. 2005 Jan;85(1):154-9.
[7] DeRisi J, Penland L, Brown PO, Bittner ML, Meltzer PS, Ray M, Chen Y, Su YA, Trent JM. Use
of a cDNA microarray to analyse gene expression patterns in human cancer. Nat Genet. 1996
Dec;14(4):457-60.
[8] Dudoit S, Yang YH, Callow MJ, Speed TP. Statistical methods for identifying differentially
expressed genes in replicated cDNA microarray experiments. Stat. Sinica 12 (2002), pp. 111139.
[9] Duggan DJ, Bittner M, Chen Y, Meltzer P, Trent JM. Expression profiling using cDNA
microarrays. Nat Genet. 1999;21(1, suppl):10-14.
[10] Eickhoff B, Korn B, Schick M, Poustka A, van der Bosch J. Normalization of array hybridization
experiments in differential gene expression analysis. Nucleic Acids Res. 1999 Nov 15;27(22):e33.
[11] Fuller GN, Hess KR, Mircean C, Tabus I, Shmulevich I, Rhee CH, Aldape KD, Bruner JM,
Sawaya RA, Zhang W. Chapter 16: Human Glioma Diagnosis From Gene Expression Data; in
Computational and Statistical Approaches to Genomics Kluwer Academic Publisher 2002 ISBN: 14020-7023-3
[12] Jain AN, Tokuyasu TA, Snijders AM, Segraves R, Albertson DG, Pinkel D. Fully automatic
quantification of microarray image data. Genome Res. 2002 Feb;12(2):325-32.
[13] Kerr MK, Martin M, Churchill GA. Analysis of variance for gene expression microarray data. J
Comput Biol. 2000;7(6):819-37.
[14] Kooperberg C, Fazzio TG, Delrow JJ, Tsukiyama T. Improved background correction for spotted
DNA microarrays. J Comput Biol. 2002;9(1):55-66.
[15] Lee EJ, Mircean C, Shmulevich I, Wang H, Liu J, Niemisto A, Kavanagh JJ, Lee JH, Zhang W.
Insulin-like growth factor binding protein 2 promotes ovarian cancer cell invasion. Mol Cancer. 2005
Feb 02;4(1):7.
[16] Lee ML, Kuo FC, Whitmore GA, Sklar J. Importance of replication in microarray gene expression
studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc Natl Acad Sci U S
A. 2000 Aug 29;97(18):9834-9.
[17] Lee PD. Control genes and variability: absence of ubiquitous reference transcripts in diverse
mammalian expression studies. Genome Res. 12 (2002), pp. 292297.
[18] Leung YF, Cavalieri D. Fundamentals of cDNA microarray data analysis. Trends Genet. 2003
Nov;19(11):649-59.
[19] Leung YF. et al., Microarray software review. In: Berrar DP. et al., A practical approach to
microarray data analysis, Kluwer academic (2002).
[20] Li C. and Wong W.H. Model-based analysis of oligonucleotide arrays: expression index
computation and outlier detection. Proc. Natl. Acad. Sci. U. S. A. 98 (2001), pp. 3136.
[21] Lipshutz RJ, Fodor SP, Gingeras TR, Lockhart DJ. High density synthetic oligonucleotide arrays.
Nat Genet. 1999 Jan;21(1 Suppl):20-4.
[22] Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C,
Kobayashi M, Horton H, Brown EL. Expression monitoring by hybridization to high-density
oligonucleotide arrays. Nat Biotechnol. 1996 Dec;14(13):1675-1680.
[23] Lnnstedt I, Speed TP. Replicated Microarray Data. Stat. Sinica 12 (2002), pp. 3146.
[24] Martin-Magniette ML, Aubert J, Cabannes E, Daudin JJ. Evaluation of the gene-specific dye bias
in cDNA microarray experiments. Bioinformatics. 2005 Feb 2; [Epub ahead of print]
[25] Nagarajan R. Intensity-based segmentation of microarray images. IEEE Trans Med Imaging. 2003
Jul; 22(7):882-9.
[26] Pritchard CC, Hsu L, Delrow J, Nelson PS. Project normal: defining normal variance in mouse
gene expression. Proc Natl Acad Sci U S A. 2001 Nov 6;98(23):13266-71.
48
Chapter 4
[27] Quackenbush J. Microarray data normalization and transformation. Nat. Genet. 32 Suppl. (2002),
pp. 496501.
[28] Sasik R. et al., Statistical analysis of high-density oligonucleotide arrays: a multiplicative noise
model. Bioinformatics 18 (2002), pp. 16331640.
[29] Schadt EE, Li C, Ellis B, Wong WH. Feature extraction and normalization algorithms for highdensity oligonucleotide gene expression array data. J Cell Biochem Suppl. 2001;Suppl 37:120-5.
[30] Schadt EE, Li C, Su C, Wong WH. Analyzing high-density oligonucleotide gene expression array
data. J Cell Biochem. 2000 Oct 20;80(2):192-202.
[31] Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns
with a complementary DNA microarray. Science. 1995 Oct 20;270(5235):467-70
[32] Sherlock G. Analysis of large-scale gene expression data. Brief. Bioinform. 2 (2001), pp. 350
362.
[33] Shmulevich I, Astola J, Cogdell D, Hamilton SR, Zhang W. Data extraction from composite
oligonucleotide microarrays. Nucleic Acids Res. 2003 Apr 1; 31(7): e36-e36.
[34] Simon RM, Dobbin K. Experimental design of DNA microarray experiments. Biotechniques
(2003), pp. S16S21.
[35] Spellman PT, Miller M, Stewart J, Troup C, Sarkans U, Chervitz S, Bernhart D, Sherlock G, Ball
C, Lepage M, Swiatek M, Marks WL, Goncalves J, Markel S, Iordan D, Shojatalab M, Pizarro A, White
J, Hubley R, Deutsch E, Senger M, Aronow BJ, Robinson A, Bassett D, Stoeckert CJ Jr, Brazma A.
Design and implementation of microarray gene expression markup language (MAGE-ML). Genome
Biol. 2002 Aug 23;3(9):RESEARCH0046. Epub 2002 Aug 23.
[36] Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB.
Missing value estimation methods for DNA microarrays. Bioinformatics. 2001 Jun;17(6):520-5.
[37] Yang YH, Buckley MJ, Speed TP. Analysis of cDNA microarray images. Brief Bioinform. 2001
Dec;2(4):341-9.
[38] Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. Normalization for cDNA
microarray data: a robust composite method addressing single and multiple slide systematic variation.
Nucleic Acids Res. 2002 Feb 15;30(4):e15.
[39] Yang YH, Speed T. Design issues for cDNA microarray experiments. Nat. Rev. Genet. 3 (2002),
pp. 579588.
[40] Zakharkin SO, Kim K, Mehta T, Chen L, Barnes S, Scheirer KE, Parrish RS, Allison DB, Page
GP. Sources of variation in Affymetrix microarray experiments. BMC Bioinformatics. 2005 Aug
29;6(1):214 [Epub ahead of print]
[41] Zhang X, Ding L, Sandford AJ. Selection of reference genes for gene expression studies in human
neutrophils by real-time PCR. BMC Mol Biol. 2005 Feb 18;6(1):4.
[42] Zhou Y, R. Abagyan. Algorithms for high-density oligonucleotide array. Curr. Opin. Drug Discov.
Devel. 6 (2003), pp. 339345.
Chapter 5
Reverse-phase protein array technology
The emerging field of quantitative proteomics aims to identify differences between
groups of experimental samples by measuring the expression of proteins. The
development of high-throughput techniques to measure protein expression is required
for applications in targeted drug discovery and analysis of biomarkers, and may
revolutionize
early
cancer
diagnosis
[26][23][12][13].
The
recently
developed
Chapter 5
50
sample. Each spot corresponds to a protein of interest. The expression is detected with a
second tagged molecule or by labeling directly the protein of interest.
Reverse-phase arrays are prepared as shown in Figure 5.1 by immobilizing the
sample proteins, usually from a cell lysate or serum, to a slide typically by robotically
spotting, in a manner similar to cDNA microarray printing. Spots represent samples and
usually each sample is spotted in a multiple dilution series. Thousands of spots are
printed on each slide and each slide comprises hundreds of samples. Only a small
quantity of lysate is utilized for preparation of a slide. As result, rare or valuable samples,
as the one we used from glioma tissues, are conserved. The entire slide is then probed
with a single antibody, and thus a single protein is measured across multiple samples.
Because all samples are probed on the same slide, the multiples samples are assayed
under the same conditions.
Tumor biopsy
Laser capture
Patients
microdissection
Tumor cell
analysis
Labels
Labels
Antibodies
Labels
Labels
ReverseReverse-phase
lysate array slide
Lysate
ReverseReverse-phase lysate array
Figure 5.1. Main steps in the production of a reverse-phase lysate array. With the aid of laser capture
microdissection, protein lysates from histopathologically relevant cell populations of the sample tumor are
isolated, extracted, and immobilized on the reverse-phase protein array. On average, 10,000 cells were
microdissected and lysed, a quantity smaller than used in the traditional immunoassay platforms.
In the case of forward-phase array, many antibodies are applied to a single glass
surface and each sample is analyzed on a different slide. In order to allow comparison of
samples, the following requirements should be met: (a) assay cross-reactivity must be
eliminated, (b) conditions must allow for different sensitivities of the multiple analytes,
(c) the dynamic range must be adjusted according to biological relevance, and (d) the
51
1
Hematoxylin & Eosin staining protocol, color the nucleus in blue and the cytoplasm in pale pink. For
protocols example, see http://www.bcm.edu/rosenlab/protocols.html .
2
http://www.pall.com; 2200 Northern Boulevard East Hills, NY 11548; USA or
http://www.pall.com/laboratory_17821.asp for VIVID products.
3
http://www.schleicher-schuell.com; Hahnestrasse 3, D-37586 Dassel, Germany.
4
An atomically flat glass substrate coated with a 150 m thick layer of hydrophobic polymer.
5
http://www.arrayit.com; 524 East Weddell Drive; Sunnyvale, CA 94089; USA.
6
http://www.nalgenunc.com; 75 Panorama Creek Drive Rochester, NY 14625. USA.
Chapter 5
52
PerkinElmer7 provide a hydrophilic environment and the capacity to hold more than a
single layer of probes [20].
Prior to our publication [22], several other articles described important advances in
lysate array technology. Nishizuka et al. studied the correlation between the mRNA
expression and protein expression [24]. Because they are technically easier to implement,
techniques to measure transcript levels in high-throughput format have developed more
rapidly than protein profiling and transcript expression levels were assumed to predict
protein expression. However, protein levels cannot be inferred from transcript data [24].
NCI cell lines (previously analyzed intensively at transcription level) were printed on
FAST nitrocellulose-coated glass slides with a pin-in-ring format GMS 417 arrayer
(Affymetrix, Santa Clara, CA) with four 500 m-diameter pins [24]. The cDNA/protein
and oligo/protein arrays showed significant correlations for the 19 species across 60 cell
lines. The mean correlation coefficients were +0.52 for cDNA/protein and +0.40
oligo/protein, the profiling structure showing a higher correlation of proteins with
mRNA levels across the 60 cell lines.
Nishizuka et al. [24] were the first to estimate protein expression based on a
monotonic linear spline fitted to
the
88
intensity
serial
66
curve
44
sample,
range,
0
1
10
range,
spline
p = 25%
the
intensity
25%
22
at
dilution
110
I= f
I max
, and minimum
interpolation,
estimated as the
log 2
dilution
[I max K I min ] ,
are heuristically selected. The third highest ranked value observed anywhere on the array
is selected as
I max
over
cell
all
If
is
the
true
dilution
= f 1[I + p (I I )] .
min
max
min
7
factor,
then
53
Chapter 5
54
As a further interesting
step in robust quantitation of
expression on lysate arrays,
the
paper
Tabus
(submitted)
et
al.,
estimates
the
expression of proteins by
fitting
model
to
the
using
linear
and
nonlinear models.
The
lysate
array
( )
single slide undergo the same processes, the dependence function of the measurement as
a function of concentration is expected to be the same (linear or nonlinear) function.
For the first order polynomial model the paper derived the close form and for the
sigmoidal model and higher-order polynomial models an iterative procedure was used.
The paper computed the Cramer-Rao bound for each model and validated the estimation
accuracy by Monte-Carlo simulation. Testing different criteria for model structure
selection (Rissanen's stochastic complexity, Akaike information, and Minimum
Description Length) the expression of particular proteins on the lysate arrays typically
were sigmoidal (see Figure 5.3 and Table 5.1).
Table 5.1. Nonlinear model order selection in lysate array. Model structure selection gives preference to
saturated models.
Lysate array
pThr308AKT
Criterion
Rissanen's stochastic
complexity
(SC)
Sigmoidal
Polynomial k = 4
Minimum
Description Length
(MDL)
Polynomial k = 4
pSer473AKT
t-AKT
-actin
Sigmoidal
Sigmoidal
Sigmoidal
Sigmoidal
Sigmoidal
Polynomial k = 5
Sigmoidal
Sigmoidal
Polynomial k = 5
Akaike information
(AIC)
55
Nonlinear models are able to capture the saturations that affect intensity (i.e.,
concentration dependency). The accuracy of the estimated values can be estimated by CR
bounds or Monte Carlo simulations. Tabus et al. show that the models based on
saturation are preferred and create a basis for better estimation of protein levels on slides
with saturated samples, and offer criteria for better experimental design and more
precise inference.
The next chapter of this thesis presents a more extensive application of reverse-phase
lysate arrays and our algorithms. We slightly modified the design of 1440-spot lysate
microarray slide from [22] to hold 96 samples and we observed the expression levels of
46 proteins in a set of 82 glioma samples.
References
[1] Ahram M, Flaig MJ, Gillespie JW, Duray PH, Linehan WM, Ornstein DK, Niu S, Zhao Y,
Petricoin EF 3rd, Emmert-Buck MR. Evaluation of ethanol-fixed, paraffin-embedded tissues for
proteomic applications. Proteomics. 2003 Apr;3(4):413-21.
[2] Blume-Jensen P, Hunter T. Oncogenic kinase signalling. Nature. 2001 May 17;411(6835):355-65.
[3] Bowden ET, Barth M, Thomas D, Glazer RI, Mueller SC. An invasion-related complex of
cortactin, paxillin and PKCmu associates with invadopodia at sites of extracellular matrix degradation
Oncogene. 1999 Aug 5;18(31):4440-9.
[4] Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W,
Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC,
Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M.
Minimum information about a microarray experiment (MIAME)-toward standards for microarray data.
Nat Genet. 2001 Dec;29(4):365-71.
[5] Carlisle AJ, Prabhu VV, Elkahloun A, Hudson J, Trent JM, Linehan WM, Williams ED, EmmertBuck MR, Liotta LA, Munson PJ, Krizman DB. Development of a prostate cDNA microarray and
statistical gene expression analysis package. Mol Carcinog. 2000 May;28(1):12-22.
[6] Celis JE, Gromov P. Proteomics in translational cancer research: toward an integrated approach.
Cancer Cell. 2003 Jan;3(1):9-15.
[7] Chu WS, Liang Q, Liu J, Wei MQ, Winters M, Liotta L, Sandberg G, Gong M. A nondestructive
molecule extraction method allowing morphological and molecular analyses using a single tissue
section. Lab Invest. 2005 Aug 22; [Epub ahead of print]
[8] Cutler P. Protein arrays: the current state-of-the-art. Proteomics. 2003 Jan;3(1):3-18.
[9] Emmert-Buck MR, Strausberg RL, Krizman DB, Bonaldo MF, Bonner RF, Bostwick DG, Brown
MR, Buetow KH, Chuaqui RF, Cole KA, Duray PH, Englert CR, Gillespie JW, Greenhut S, Grouse L,
Hillier LW, Katz KS, Klausner RD, Kuznetzov V, Lash AE, Lennon G, Linehan WM, Liotta LA, Marra
MA, Munson PJ, Ornstein DK, Prabhu VV, Prang C, Schuler GD, Soares MB, Tolstoshev CM, Vocke
CD, Waterston RH. Molecular profiling of clinical tissues specimens: feasibility and applications. J Mol
Diagn. 2000 May;2(2):60-6.
[10] Espina V, Mehta AI, Winters ME, Calvert V, Wulfkuhle J, Petricoin EF 3rd, Liotta LA. Protein
microarrays: molecular profiling technologies for clinical specimens. Proteomics. 2003 Nov;3(11):2091100.
[11] Gillespie JW, Best CJ, Bichsel VE, Cole KA, Greenhut SF, Hewitt SM, Ahram M, Gathright YB,
Merino MJ, Strausberg RL, Epstein JI, Hamilton SR, Gannot G, Baibakova GV, Calvert VS, Flaig MJ,
Chuaqui RF, Herring JC, Pfeifer J, Petricoin EF, Linehan WM, Duray PH, Bova GS, Emmert-Buck MR.
Evaluation of non-formalin tissue fixation for molecular profiling studies. Am J Pathol. 2002
Feb;160(2):449-57.
[12] Haab BB, Dunham MJ, Brown PO. Protein microarrays for highly parallel detection and
quantitation of specific proteins and antibodies in complex solutions. Genome Biol.
2001;2(2):RESEARCH0004. Epub 2001 Jan 22.
56
Chapter 5
[13] Haab BB. Methods and applications of antibody microarrays in cancer research. Proteomics. 2003
Nov;3(11):2116-22.
[14] Hanahan D, Weinberg RA. The hallmarks of cancer. Cell. 2000 Jan 7;100(1):57-70.
[15] Hunter T. Signaling-2000 and beyond. Cell. 2000 Jan 7;100(1):113-27.
[16] Hunter T. The Croonian Lecture 1997. The phosphorylation of proteins on tyrosine: its role in cell
growth and disease. Philos Trans R Soc Lond B Biol Sci. 1998 Apr 29;353(1368):583-605.
[17] Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL. The large-scale organization of metabolic
networks. Nature. 2000 Oct 5;407(6804):651-4.
[18] Knezevic V, Leethanakul C, Bichsel VE, Worth JM, Prabhu VV, Gutkind JS, Liotta LA, Munson
PJ, Petricoin EF 3rd, Krizman DB. Proteomic profiling of the cancer microenvironment by antibody
arrays. Proteomics. 2001 Oct;1(10):1271-8.
[19] Liotta LA, Kohn EC. The microenvironment of the tumour-host interface. Nature. 2001 May
17;411(6835):375-9.
[20] Melton L. Protein arrays: proteomics in multiplex. Nature. 2004 May 6;429(6987):101-7.
[21] Miller JC, Zhou H, Kwekel J, Cavallo R, Burke J, Butler EB, Teh BS, Haab BB. Antibody
microarray profiling of human prostate cancer sera: antibody screening and identification of potential
biomarkers. Proteomics 2003, 1, 5663.
[22] Mircean C, Shmulevich I, Cogdell D, Choi W, Jia Y, Tabus I, Hamilton SR, Zhang W. Robust
estimation of protein expression ratios with lysate microarray technology. Bioinformatics. 2005 May
1;21(9):1935-42. Epub 2005 Jan 12.
[23] Nielsen UB, Geierstanger BH. Multiplexed sandwich assays in microarray format. J Immunol
Methods. 2004 Jul;290(1-2):107-20.
[24] Nishizuka S, Charboneau L, Young L, Major S, Reinhold CW, Waltham M, Mehr KH, Bussey KJ,
Lee JK, Espina V, Munson PJ, Petricoin EF 3rd, Liotta LA, Weinstein JN. Proteomic profiling of the
NCI-60 cancer cell lines using new high-density reverse-phase lysate microarrays. Proc Natl Acad Sci U
S A. 2003 Nov 25; 100(1):1422934.
[25] Perlee L, Christiansen J, Dondero R, Grimwade B, Lejnine S, Mullenix M, Shao W, Sorette M,
Tchernev V, Patel D, Kingsmore S. Development and standardization of multiplexed antibody
microarrays for use in quantitative proteomics. Proteome Sci. 2004 Dec 15;2(1):9.
[26] Schweitzer B, Kingsmore SF. Measuring proteins on microarrays. Curr Opin Biotechnol. 2002
Feb;13(1):14-9.
[27] Sreekumar A, Nyati MK, Varambally S, Barrette TR, Ghosh D, Lawrence TS, Chinnaiyan AM.
Profiling of cancer cells using protein microarrays: discovery of novel radiation-regulated proteins.
Cancer Res. 2001 Oct 15;61(20):7585-93.
[28] Tabus I, Hategan A, Mircean C, Rissanen J, Shmulevich I, Zhang W, Astola J. Nonlinear
modeling of protein expressions in protein arrays.(submitted)
[29] Tonkinson JL, Stillman BA. Nitrocellulose: a tried and true polymer finds utility as a postgenomic substrate. Front Biosci. 2002 Jan 1;7:c1-12.
Chapter 6
Translational and post-translational examination of
pathway alterations during glioma progression1,2
The progression of gliomas has been extensively studied at the genomic level using cDNA
microarrays
[7][9].
However,
systematic
examination
of
translational
and
posttranslational levels has only recently been carried out in a high-throughput manner.
This
chapter
describes
our
applications
of
the
reverse-phase
lysate
array
The chapter follows partially the paper by Jiang R2, Mircean C2, Shmulevich I, Cogdell D, Jia Y, Tabus I,
Aldape K, Sawaya R, Bruner J, Fuller GN, and Zhang W, titled Pathway alterations during glioma
progression revealed by reverse phase protein lysate arrays [46]. The paper has been submitted to
Proteomics.
2
The authors R. Jiang and C. Mircean contributed equally to the paper [46].
3
A total of 108 antibodies were initially tested. Some did not show a unique band in western blotting and
were, therefore, not used in any assessment.
Chapter 6
58
lower grade tumors. To characterize the 82 tumor samples used in this study, we first
performed a Kaplan-Meier survival analysis according to the information from our
clinical database. As shown in Figure 6.1 for the 82 samples, there was a significant
difference in patient survival time between glioblastoma and all other gliomas combined.
However, there is an overlap in the survival curves of the other glioma subgroups.
Therefore, based on clinical phenotype, the 82 samples can be divided into two major
groups. Although the samples could be divided into 3 or 4 histopathological groups, a
grouping scheme based on clinical difference is likely to generate more biologically
meaningful information at the protein and pathway levels than groupings based purely
on morphological features.
Figure 6.1. Kaplan-Meier survival analysis. The survival curves for patients show that there is a distinction
between glioblastoma and lower grade glioma. But the survival difference between the subgroups of lower
grade gliomas in this sample set is not significant.
Figure 6.2. Quality evaluation of the antibodies used on protein lysate arrays. All the antibodies were tested in a western blotting assay to confirm they preferentially detect a single
band. Some blots were sequentially probed with 2 or 3 different antibodies and the composite results are shown.
Chapter 6
60
Company
Cell Signaling Technology, Inc.
Beverly, MA 01915
BD Biosciences
Immunocytometry Systems
San Jose, CA 95131
Santa Cruz Biotechnology, Inc.
Santa Cruz, CA 95060
GeneTex Inc.
San Antonio, TX 78245
Invitrogen Corporation
(Zymed Laboratories Inc.)
Carlsbad, CA 92008
Abcam Inc
Cambridge, MA 02139
Sigma Chem Co.
St. Louis, MO 63178
Name of antibody
AKT, AKTpSer473, AKTpThr308, PTEN, PTENpSer380,
RSK1/RSK2/RSK3, p90RSKpThr573, GSK3beta,
BADpSer136, Cleaved Caspase 8 Asp374, Cleaved Caspase
9 Asp315, Puma, PDGFR, MAPK, mTOR, mTORpSer2481,
CD11b, EGFRpTyr845
PI3K, Integrin 5, p16, pRB, Src Pan, Src pTyr529, NF B,
Cathepsin D
p53 (Do-l), BCL-2, Bax, p - PDGFR, Cdk7, Cdk4, Cyclin
D3, VEGF, Tie2, IGFBP2, IGFBP3, IGFBP5, ,
EGFR VIII , c-Abl
p14ARF, c-Myc
EGFR
MMP2, MMP9
-actin
Antibody selection
Accumulating evidence from conventional molecular biology studies has produced an
understanding that signal transduction pathways involved in cell growth, cell death, and
metabolism are disturbed in glioma progression. Therefore, in order to systematically
survey the subset of proteomic changes in gliomas using a parallel protein-lysate array
platform, we first generated a list of proteins (Table 6.1) that have been previously
implicated in oncogenetic pathways; some of these proteins have already been shown to
be activated in glioblastoma through other types of assays.
Protein lysate arrays provide a high-throughput platform that allows simultaneous
detection of a protein in a large number of samples with replicates and serial dilutions.
However, the dot blot nature of the assay demands that high quality antibodies be used
in order to avoid cross hybridization, which would produce an unacceptable level of noise
and render the data un-interpretable. Therefore, we first tested many of the antibodies
61
on a western blot to make certain that the antibodies applied to the array detect a single
dominant band. Representative blots are shown in Figure 6.2. Only 46 antibodies passed
this quality control step and were subsequently used in the hybridization experiments.
grade
astrocytomas
(World
Health
Organization
[WHO]
grade
II),
reagents
(biotinyltyramide/hydrogen
peroxide,
and
streptavidin-
peroxidase). Development of slides was completed using hydrogen peroxide. The slides
were then allowed to air dry. Primary and secondary antibodies used in these studies
were diluted 1:100~200 and 1:4000~10,000, respectively. In addition to -actin, which
Chapter 6
62
served as the positive control in each set of protein arrays, one negative control without
any primary antibody was included in each set of experiments. The hybridized slides
were scanned at optical resolution of 1200 dpi and images were saved as uncompressed
TIFF files. After inverting the 16bpp gray image (to allow the same analysis approach as
used in cDNA microarray technology), the spots were segmented and quantified with
ArrayVision (Imaging Research Inc., Catharines, Ontario, Canada).
63
proteins showed that IB clusters with EGFRpTyr845, PI3K with AKTpThr308, and
IGFBP5 with IGFBP2. MMP9, Bcl-2, VEGF and pRB formed another cluster.
Figure 6.3. Examples of protein lysate array images. The images show a layout of the protein lysate array.
Each sample is present in triplicate in a series of six two-fold dilutions.
Chapter 6
64
NFB/IB and EGFR pathways and their relationship
IB was the protein with expression that differed most between glioblastoma and other
glioma subtypes in our analysis. IB is the key regulator of nuclear factor kappa B
(NFB), a transcriptional factor involved in cell growth, apoptosis, and immune response
[39]. NFB is retained in the cytoplasm through its interaction with IB. When IB is
phosphorylated by activation of IKK, NFB is released from the cytoplasm and enters the
nucleus to act on its target genes, including IB itself. Therefore, a regulatory feedback
loop exists between these two proteins and an increase in IB is often considered a result
of activation of NFB [11]. Interestingly, IB is known to be a labile protein whose
expression level is sensitive to physiological conditions. In contrast, the steady state level
of NFB is less sensitive to modulation and the key regulatory step is at the cellular
localization level. Thus, it is not surprising that the NFB subunits p65 and p50 were not
found to be key feature proteins that distinguished glioblastomas. In a recent study, we
reported that NFB was indeed activated in glioblastoma [45]. Thus, these protein-lysate
array data from a large set of patient samples confirm our previous observation.
Our clustering analysis showed that IB clustered with phosphorylated EGFR. EGFR
is critical for cell growth, differentiation, survival, and migration. The activation of EGFR
is believed to be one of the most important molecular events in triggering glioblastoma.
Amplification of EGFR genes occurs in 40% of glioblastomas, and EGFR overexpression
occurs in 60% [47][40]. A mutation in the EGFR gene that results in a shortened form of
EGFR called EGFR vIII has been detected in approximately 20% of glioblastomas [3]. In
our lysate array study, we also detected EGFR vIII in 12% of glioblastomas (data not
shown), but because of its relatively low frequency of detection, EGFR vIII was not
selected by our statistical analysis as distinguishing between the two glioma groups.
Tyrosine kinase receptors are often activated by phosphorylation and the functional
activation of the protein is a major switch in the growth pathway in the cells. Thus, it is
not surprising that we found EGFRpTyr845 rather than EGFR as one of the top
significant discriminators between glioblastoma and other subtypes of gliomas. Tyr845 is
highly conserved within the active loop of the kinase domain on the EGFR and
phosphorylation of this residue is mediated by c-Src and dependent on EGF stimulation
[4][38]. Phosphorylation of EGFR on Tyr845 residues is necessary for the binding of
EGFR to cytochrome c-oxidase subunit II (Cox II), which is a mitochondrion-encoded
protein and a subunit of complex IV of the respiratory chain [6] . After EGF stimulation,
EGFR translocates to the mitochondrion, where it interacts with Cox II to regulate cell
survival. Integrin proteins induce phosphorylation of EGFR on tyrosines 845, 1068,
1086, and 1173 [5][33], thus EGFR activation is also implicated in increased cell motility
and invasion, additional critical features of glioblastoma.
65
Chapter 6
66
together with EGFR and IGFBP2 expression, appears to be relevant in vivo, but not in
vitro, perhaps related to hypoxic conditions.
IGFBP2/IGFBP5 invasion pathway
As mentioned above, genomics studies coupled with tissue microarray experiments have
shown that IGFBP2 overexpression is a signature event in glioblastoma. IGFBP2 is a
promoter of glioma invasion, one of the most important phenotypes of glioblastoma [44].
There are six members in the IGFBP family and they have very different functions,
especially those that are IGF independent [49]. IGFBP5 has been implicated in breast
cancer metastasis [20][41]. In this study, we showed that IGFBP5 is also overexpressed
in glioblastoma and IGFBP2 and IGFBP5 are closely clustered. Thus, both proteins may
contribute to glioma invasion and/or other common functions. This hypothesis can be
readily tested in the future.
The other feature proteins
Several other proteins were identified as feature discriminators for glioblastoma.
Angiogenesis is a key phenotype in glioblastoma, and thus, the selection of VEGF as one
of the feature proteins was expected. Resistance to apoptosis is another important
phenotype. Therefore, the selection of Bcl-2 and BADpSer136 as two discriminators was
also consistent with the phenotype, although their over-expression was a new finding.
Bcl-2 is a survival protein and has been shown to be expressed in glioblastomas [29].
BAD is an apoptosis promoting protein, but when phosphorylated, BAD becomes inactive
and perhaps may even gain survival function [36]. Although BAD phosphorylation is
believed to be a downstream event of AKT phosphorylation, intriguingly, we found that
BADpSer136 did not cluster with AKTpThr308 and PI3K. This may mean that there are
other major upstream regulators of BAD phosphorylation. In support of this hypothesis,
Scheid et al. showed that activation of AKT alone was not sufficient to phosphorylate
BAD and complete inhibition of PI3K/AKT did not abrogate the phosphorylation of BAD
[35]. Furthermore, phosphorylation of BAD at Ser136 was correlated with AKTpSer473
phosphorylation but not with AKTpThr308 phosphorylation; this data suggests that the
activation of the AKT pathway could be independent of BADpSer136 activation [27][43].
An interesting finding from our study is that c-Abl is also highly expressed in
glioblastoma. In an earlier microarray experiment, we found that c-Abl mRNA
expression is associated with poor survival, although those results were not published
due to a small sample size of 25 patients. The present study, however, appears to support
the earlier finding. Additional experiments should be carried out to pursue this
observation because of the clinical implications.
Figure 6.4. Quantitative analysis of protein expressions. The protein level was normalized against -actin and then quantile normalized. The protein expression levels are represented
as a heat-map (a) and (b) where the proteins are ranked in decreasing order for discrimination power of glioblastomas vs. other lower grades. The most discriminating feature proteins are
shown in (b). We used the ratio between the partitioning of sums of squares into between-class and within-class (BSS/WSS) and the set that return the minimum false discovery rate.
Visually, FDR is the ratio of areas between random assignment (c), yellow bars and the correct assignment (red bar) when discrimination is high. The location of the minimal false
discovery rate is shown in (d).
Chapter 6
68
Imatinib (Gleevec), which inhibits bcr-abl in chronic myeloid leukemia and c-Kit in
gastrointestinal stromal cancers, has been one of the most successful therapeutic agents
ever used for targeted therapy [16]. Because Gleevec is also an inhibitor of Abl, it may
also be beneficial for glioblastoma treatment. A clinical trial with Gleevec in glioblastoma
is ongoing [8], and it may be insightful to view the results of that trial through the prism
of our findings.
In summary, our survey of 46 proteins and post-translationally modified isoforms in
82 glioma tissue samples has yielded several biologically relevant discoveries that further
our understanding of glioma systems. Some of these findings provide confirmation, using
a different experimental approach, of concepts previously proposed in the literature.
Others are novel and provide focus for further, in-depth, functional studies. The present
glioma protein-lysate array study demonstrates the utility of this proteomics discovery
tool in advancing our understanding of glioma physiology.
References
[1] Benesch, M., Wagner, S., Berthold, F., Wolff, J.E. J. Neurooncol. 2005, 72, 179-183.
[2] Benjamini Y, Drai D, Elmer G, Kafkafi N, Golani I. Controlling the false discovery rate in
behavior genetics research. Behav Brain Res. 2001 Nov 1;125(1-2):279-84.
[3] Biernat W, Huang H, Yokoo H, Kleihues P, Ohgaki H. Predominant expression of mutant EGFR
(EGFRvIII) is rare in primary glioblastomas. Brain Pathol. 2004 Apr;14(2):131-6.
[4] Biscardi JS, Maa MC, Tice DA, Cox ME, Leu TH, Parsons SJ. c-Src-mediated phosphorylation of
the epidermal growth factor receptor on Tyr845 and Tyr1101 is associated with modulation of receptor
function. J Biol Chem. 1999 Mar 19;274(12):8335-43.
[5] Boerner JL, Demory ML, Silva C, Parsons SJ. Phosphorylation of Y845 on the epidermal growth
factor receptor mediates binding to the mitochondrial protein cytochrome c oxidase subunit II. Mol Cell
Biol. 2004 Aug;24(16):7059-71.
[6] Boerner JL, Demory ML, Silva C, Parsons SJ. Phosphorylation of Y845 on the epidermal growth
factor receptor mediates binding to the mitochondrial protein cytochrome c oxidase subunit II. Mol Cell
Biol. 2004 Aug;24(16):7059-71.
[7] Boudreau CR, Liau LM, Molecular characterization of brain tumors. Clin. Neurosurg. 2004, 51,
81-90.
[8] Breedveld P, Pluim D, Cipriani G, Wielinga P, van Tellingen O, Schinkel AH, Schellens JH. The
effect of Bcrp1 (Abcg2) on the in vivo pharmacokinetics and brain penetration of imatinib mesylate
(Gleevec): implications for the use of breast cancer resistance protein and P-glycoprotein inhibitors to
enable the brain penetration of imatinib in patients. Cancer Res. 2005 Apr 1;65(7):2577-82.
[9] Caskey LS, Fuller GN, Bruner JM, Yung WK, Sawaya RE, Holland EC, Zhang W. Toward a
molecular classification of the gliomas: histopathology, molecular genetics, and gene expression
profiling. Histol Histopathol. 2000 Jul;15(3): 971-81.
[10] Chen G, Gharib TG, Huang CC, Taylor JM, Misek DE, Kardia SL, Giordano TJ, Iannettoni MD,
Orringer MB, Hanash SM, Beer DG. Discordant protein and mRNA expression in lung
adenocarcinomas. Mol. Cell Proteomics 2002, 1, 304-313.
[11] Chiao PJ, Miyamoto S, Verma IM. Autoregulation of I kappa B alpha activity. Proc Natl Acad Sci
U S A. 1994 Jan 4;91(1):28-32.
[12] Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of
tumors using gene expression data. J. Amer. Statist. Assoc. 2000, 97, 77-87.
[13] Dudoit S, Yang YH, Callow MJ, Speed TP. Statistical methods for identifying differentially
expressed genes in replicated cDNA microarray experiments. Statistica. Sinica. 2000, 12, 111-139.
[14] Espina V, Mehta AI, Winters ME, Calvert V, Wulfkuhle J, Petricoin EF 3rd, Liotta LA. Protein
microarrays: molecular profiling technologies for clinical specimens. Proteomics. 2003 Nov;3(11):2091100.
[15] Feng, J., Park, J., Cron, P., Hess, D., Hemmings, B.A. J. Biol. Chem. 2004, 279, 41189-41196.
[16] George D. Targeting PDGF receptors in cancer--rationales and proof of concept clinical trials. Adv
Exp Med Biol. 2003;532:141-51.
[17] Giannini C, Sarkaria JN, Saito A, Uhm JH, Galanis E, Carlson BL, Schroeder MA, James CD.
Patient tumor EGFR and PDGFRA gene amplifications retained in an invasive intracranial xenograft
model of glioblastoma multiforme. Neuro-oncol. 2005 Apr;7(2):164-76.
[18] Grubb RL, Calvert VS, Wulkuhle JD, Paweletz CP, Linehan WM, Phillips JL, Chuaqui R, Valasco
A, Gillespie J, Emmert-Buck M, Liotta LA, Petricoin EF. Signal pathway profiling of prostate cancer
using reverse phase protein arrays. Proteomics. 2003 Nov;3(11):2142-6.
[19] Gygi SP, Rochon Y, Franza BR, Aebersold R. Correlation between protein and mRNA abundance
in yeast. Mol Cell Biol. 1999 Mar;19(3):1720-30.
[20] Hao X, Sun B, Hu L, Lahdesmaki H, Dunmire V, Feng Y, Zhang SW, Wang H, Wu C, Wang H,
Fuller GN, Symmans WF, Shmulevich I, Zhang W. Differential gene and protein expression in primary
breast malignancies and their lymph node metastases as revealed by combined cDNA microarray and
tissue microarray analysis. Cancer. 2004 Mar 15;100(6):1110-22
[21] Holland EC, Celestino J, Dai C, Schaefer L, Sawaya RE, Fuller GN. Combined activation of Ras
and Akt in neural progenitors induces glioblastoma formation in mice. Nat Genet. 2000 May;25(1):55-7.
[22] Jiang R, Mircean C, Shmulevich I, Cogdell D, Jia Y, Tabus I, Aldape K, Sawaya R, Bruner J,
Fuller GN, Zhang W. Pathway alterations during glioma progression revealed by reverse phase protein
lysate arrays. Proteomics (submitted)
[23] Jung JM, Li H, Kobayashi T, Kyritsis AP, Langford LA, Bruner JM, Levin VA, Zhang W.
Inhibition of human glioblastoma cell growth by WAF1/Cip1 can be attenuated by mutant p53. Cell
Growth Differ. 1995 Aug;6(8):909-13.
[24] Karlsson HK, Zierath JR, Kane S, Krook A, Lienhard GE, Wallberg-Henriksson H. Insulinstimulated phosphorylation of the Akt substrate AS160 is impaired in skeletal muscle of type 2 diabetic
subjects. Diabetes. 2005 Jun;54(6):1692-7.
[25] Kawakami Y, Nishimoto H, Kitaura J, Maeda-Yamamoto M, Kato RM, Littman DR, Leitges M,
Rawlings DJ, Kawakami T. Protein kinase C betaII regulates Akt phosphorylation on Ser-473 in a cell
type- and stimulus-specific fashion. J Biol Chem. 2004 Nov 12;279(46):47720-5. Epub 2004 Sep 9.
[26] Keselman, H.J., Cribbie, R., Holland, B. Br. J. Math. Stat. Psychol. 2002, 55, 27-39.
[27] Khor TO, Gul YA, Ithnin H, Seow HF. Positive correlation between overexpression of phosphoBAD with phosphorylated Akt at serine 473 but not threonine 308 in colorectal carcinoma. Cancer Lett.
2004 Jul 16;210(2):139-50.
[28] Lu T, Costello CM, Croucher PJ, Hasler R, Deuschl G, Schreiber S. Can Zipf's law be adapted to
normalize microarrays? BMC Bioinformatics. 2005, 6, 37.
[29] Lytle RA, Jiang Z, Zheng X, Rich KM. BCNU down-regulates anti-apoptotic proteins bcl-xL and
Bcl-2 in association with cell death in oligodendroglioma-derived cells. J Neurooncol. 2004
Jul;68(3):233-41.
[30] Melton L. Protein arrays: Proteomics in multiplex. Nature 2004, 429, 101-107.
[31] Mircean C, Shmulevich I, Cogdell D, Choi W, Jia Y, Tabus I, Hamilton SR, Zhang W. Robust
estimation of protein expression ratios with lysate microarray technology. Bioinformatics. 2005 May
1;21(9):1935-42. Epub 2005 Jan 12.
[32] Monks NR, Biswas DK, Pardee AB. Blocking anti-apoptosis as a strategy for cancer
chemotherapy: NF-kappaB as a target. J Cell Biochem. 2004 Jul 1;92(4):646-50.
[33] Moro L, Dolce L, Cabodi S, Bergatto E, Erba EB, Smeriglio M, Turco E, Retta SF, Giuffrida MG,
Venturino M, Godovac-Zimmermann J, Conti A, Schaefer E, Beguinot L, Tacchetti C, Gaggini P,
Silengo L, Tarone G, Defilippi P. Integrin-induced epidermal growth factor (EGF) receptor activation
requires c-Src and p130Cas and leads to phosphorylation of specific EGF receptor tyrosines. J Biol
Chem. 2002 Mar 15;277(11):9405-14. Epub 2001 Dec 27.
[34] Park S, James CD. ECop (EGFR-Coamplified and overexpressed protein), a novel protein,
regulates NF-B transcriptional activity and associated apoptotic response in an IB-dependent manner
Oncogene. 2005, 24, 2495-2502.
[35] Scheid MP, Duronio V. Dissociation of cytokine-induced phosphorylation of Bad and activation
of PKB/akt: involvement of MEK upstream of Bad phosphorylation. Proc Natl Acad Sci U S A. 1998
Jun 23;95(13):7439-44.
[36] Schurmann A, Mooney AF, Sanders LC, Sells MA, Wang HG, Reed JC, Bokoch GM. p21activated kinase 1 phosphorylates the death agonist bad and protects cells from apoptosis. Mol Cell Biol.
2000 Jan;20(2):453-61.
[37] Sreekumar A, Nyati MK, Varambally S, Barrette TR, Ghosh D, Lawrence TS, Chinnaiyan AM.
Profiling of cancer cells using protein microarrays: discovery of novel radiation-regulated proteins.
Cancer Res. 2001 Oct 15;61(20):7585-93.
[38] Tice DA, Biscardi JS, Nickles AL, Parsons SJ. Mechanism of biological synergy between cellular
Src and epidermal growth factor receptor. Proc Natl Acad Sci U S A. 1999 Feb 16;96(4):1415-20.
[39] Tran NL, McDonough WS, Savitch BA, Sawyer TF, Winkles JA, Berens ME. The tumor necrosis
factor-like weak inducer of apoptosis (TWEAK)-fibroblast growth factor-inducible 14 (Fn14) signaling
70
Chapter 6
system regulates glioma cell survival via NFkappaB pathway activation and BCL-XL/BCL-W
expression. J Biol Chem. 2005 Feb 4;280(5):3483-92. Epub 2004 Dec 16.
[40] Tripp SR, Willmore-Payne C, Layfield LJ. Relationship between EGFR overexpression and gene
amplification status in central nervous system gliomas. Anal Quant Cytol Histol. 2005 Apr;27(2):71-8.
[41] van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K,
Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend
SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002 Jan
31;415(6871):530-6.
[42] von Heydebreck A, Huber W, Poustka A, Vingron M. Identifying splits with clear separation: a
new class discovery method for gene expression data. Bioinformatics. 2001;17 Suppl 1:S107-14.
[43] Wang SW, Denny TA, Steinbrecher UP, Duronio V. Phosphorylation of Bad is not essential for
PKB-mediated survival signaling in hemopoietic cells. Apoptosis. 2005 Mar;10(2):341-8.
[44] Wang, H., Wang, H., Shen, W., Huang, H., et al., Insulin-like Growth Factor Binding Protein 2
Enhances Glioblastoma Invasion by Activating Invasion-enhancing Genes. Cancer Res. 2003, 63, 43154321.
[45] Wang. H., Wang, H., Zhang, W., Huang, H.J., et al., Analysis of the activation status of Akt,
NFkappaB, and Stat3 in human diffuse gliomas. Lab Invest. 2004, 84, 941-951.
[46] Wulfkuhle JD, Aquino JA, Calvert VS, Fishman DA, Coukos G, Liotta LA, Petricoin EF 3rd.
Signal pathway profiling of ovarian cancer from human tissue specimens using reverse-phase protein
microarrays. Proteomics. 2003 Nov;3(11):2085-90.
[47] Xie D, Zeng YX, Wang HJ, Tai LS, Wen JM, Tao Y, Ma NF, Hu L, Sham JS, Guan XY.
Amplification and overexpression of epidermal growth factor receptor gene in glioblastomas of Chinese
patients correlates with patient's age but not with tumor's clinicopathological pathway. Acta Neuropathol
(Berl). 2005 Sep 7; [Epub ahead of print]
[48] Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. Normalization for cDNA
microarray data: a robust composite method addressing single and multiple slide systematic variation.
Nucleic Acids Res. 2002 Feb 15;30(4):e15.
[49] Zhang W, Wang H, Song SW, Fuller GN. Insulin-like growth factor binding protein 2: gene
expression microarrays and the hypothesis-generation paradigm. Brain Pathol. 2002 Jan;12(1):87-94.
[50] Zhou YH, Tan F, Hess KR, Yung WK. The expression of PAX6, PTEN, vascular endothelial
growth factor, and epidermal growth factor receptor in gliomas: relationship to tumor grade and survival.
Clin Cancer Res. 2003 Aug 15;9(9):3369-75.
Chapter 7
An overview of analysis methods
Scientists make predictions based on hypotheses, laws, and theories. The value of
predictions is determined by whether they are supported by experiments. If results do
not confirm the initial hypothesis or experimental results are not confirmed by other
scientists, then the initial assumption that generated the prediction should be changed or
discarded.
The models meet a very "human" need to understand events; we create models and
have expectations. Models are not required by biological organisms.
The scientific
hypothesis must be testable and falsifiable1. The process of proving a hypothesis is not
based on accumulating evidence in its favor, but rather in showing that situations that
could establish falsity do not happen [29]. In order to support a hypothesis, observations
need to be specific and not simply observational; they must be quantitatively expressed
in numerical measurements. When a hypothesis is proven, it gains the status of a
scientific law or theory. Once widely confirmed by scientists over time, the laws become
accepted as facts.
With the aid of technology, molecular biology has evolved from traditionalobservational to quantitative-estimated measurements. We expect that biology and
medicine will continue to move toward a more quantitative science based on scientific
laws. In addition to algorithms, models, and pure numbers used as in mathematical field,
bioinformatic research deals with numbers that often represent measurements of fuzzy
biological events. Measurement, precision, and other aspects that are clearly applicable
in highly technical sciences, need amendment and comprehensive adjustments for
application to biological events.
Systems biology attempts to define a biological environment of a cell or organism
including the structure of the complex integrated system composed of genes, proteins,
metabolites, etc. Two features, the structure and the dynamics of the biological system,
provide the foundation from which variations are measured to analyze the robustness,
which is the essential property of biological systems [15].
In an article in Science published in 2000, Kitano defined robustness this way:
"Robustness of biological systems manifests in various ways. Firstly, biological systems
constantly adapt to internal or external changes. Secondly, they show certain
1
This refers to the theory of assertion described by the philosophers Sir Karl Popper and Ernest Gellner.
Rather than the non-philosophical use of "falsification," meaning "counterfeiting," in science the idea of
"falsifiable" means "disprovable," the opposite of "verifiable."
Chapter 7
72
insensitivity, which enables them to deal with the noise generated by the stochastic
signals to which they are exposed. Finally, they also exhibit what could be called a
graceful degradation, which is a slow and gradual end as opposed to the catastrophic
failure that occurs when functions are damaged [18].
In the array-based experiments described in this dissertation, we have attempted to
improve measurement methods used to characterize variations in cellular events. No
matter what the intrinsic character of the technology (cDNA microarrays, oligonucleotide
arrays, or protein arrays) and independent of what is measured (transcriptional levels,
translational levels, or post- translational levels), the data analysis aims to extract
biologically relevant conclusions.
distinguishes them from other classical biological analysis by the large number of
measurements and the number of undesirable sources of variation.
Normalization
Analysis is the last step of array-based experiments and provides a substantial number of
difficulties and challenges. Data pre-processing typically includes a step that removes
systematic variability, a process called normalization. Some sources of generalized
variations may be highly informative as they may be of biological origin. On the other
hand, variations due to the specific character of array-based technology should be
removed prior to analysis.
Normalization methods are usually based on certain hypotheses. For instance, the
expressions detected should contain a random sample of genes, most of which not
differentially expressed. This means that the expression of only a small subset of genes
will be altered by a given condition. Another hypothesis is that for each sample under
evaluation, equal amounts of RNA or protein were initially present. This implies that the
total number of molecules in the samples will be roughly the same.
In the case of a cDNA microarray, if two samples are co-hybridized expressing values
in two different fluorescent dyes as Cy3, and Cy5 denoted as
Ri
and
Gi , the ratios of
k mean = i Ri
may be used for global normalization for scaling one of the channels (e.g.,
Gi ' = k meanGi and Ri ' = Ri ) such that the new ratios are equal to unity:
R'
G'
i
R
G
i
k mean
=1
73
R'
log i
G 'i
The mean estimate assumes a Gaussian distribution of outliers that is not valid in cDNA
expressions. Hoyle et al., 2002 [14] note that cDNA microarrays have skewed
distribution behavior, with many genes expressed at a very low level. Therefore, a robust
estimate of tendency (such as 75 percentile)2 will give us more realistic estimate of the
common intensity in two different fluorescent dyes, Cy3 and Cy5.
A simple way to compare the two channels is with a loglog plot. Points that are
above or below the diagonal in this plot correspond to spots that have higher expression
levels in one of the channels. Another way to visualize the data is to create an "Intensity
vs. Ratio" plot for the normalized data, called an MA-plot. This plot of the log-ratios,
log(Ri Gi )
R
, is shown in Figure 7.1. (c. d.)
M i = log i vs. mean log-intensities Ai =
2
Gi
Figure 7.1 Intra-array normalization methods. Global normalization consists in scaling one of the channels
(a) and (b) log-log plot, while lowess locally estimates the local intensity of the channels (c) and (d), MA plot.
(a)
(b)
(c)
(d)
In the case of an array with many genes expressed at a low level, the median, mean, or other centraltendency estimates will be close to zero values. A higher than 50th percentile estimate of tendency will better
estimate the variance in distributions.
Chapter 7
74
Most MA-plots of the data from cDNA arrays show skewed tendencies due to the
different binding potential of Cy3 and Cy5 in genes expressed at a low level compared to
highly expressed genes. Typically, microarray MA-scatter plots resemble a "fish"; they
are, skewed toward high mean log-intensities in one of the channels. This bias cannot be
removed by global normalization methods, even though the averages of the intensities of
the two channels are equal.
One way to remove such bias is to apply a local estimate of the intensity. The methods
called lowess and loess are derived from locally weighted scatter plot smoothing and are
differentiated by the model used in the regression: lowess uses a linear polynomial,
whereas loess uses a quadratic polynomial. The smoothing process is based on the
moving average method. Each value is determined by neighboring data points within a
span3. The regression weight function is defined for the data points contained within the
span. In addition, the smoothing version of the robust weight function may provide
outlier-resistant behavior.
In local regression smoothing [5][24], a three-step process is applied to each point
(Figure 7.2). (1) The regression weights are calculated in the span given by the function:
x xi
wi = 1
where
xi
to the most distant predictor value within the span. (2) A weighted
linear least squares regression is performed (for lowess, the regression uses a first degree
polynomial; for loess, the regression uses a second degree polynomial). In order to insure
that the smoothed values do not become distorted by neighboring outliers, a robust
smoothing, which is not influenced by a small fraction of outliers, is used to calculate the
weights in the span:
1 (r 6MAD(r ))2
i
wi =
0
where
ri
, ri < 6 MAD(r )
, ri > 6MAD(r )
is the residual of the ith data point produced by the regression smoothing
procedure, and MAD is the median absolute deviation of the residuals (3). The smoothed
value is given by the weighted regression at the predictor value of interest or both the
local regression weight and the robust weight in the case of a robust smoothing.
3
75
Robotic pins are used to print samples on the nitrocellulose membrane or glass for cDNA
and reverse-phase lysate arrays, therefore another possible systematic source of variation
is the pin-loading capacity.
Any defect or variation in mechanical parameters of the pins might affect in a
common way the expression of the genes or proteins in the group deposed by same pin.
For instance, for a microarray that uses 16 pins, a pin-dependent normalization
procedure is used for quantile-to-quantile normalization of the expression of the genes
deposed by each pin separately. A good visualization is to show a box-plot of an ANOVA
procedure (Figure 7.3). The normalization procedure should be used with caution when
the number of spots processed by each pin is small, because in this case more bias may be
introduced to existent expressions.
Chapter 7
76
(a)
(b)
Figure 7.3. Pin-dependent normalization. A quick survey on the ANOVA shows a variation in the groups
generated by pins 1-4, and that the group 5 is out of scale. This may be due to a defect on the needle that
printed these groups (pin no. 5, not loading). The quantile-to-quantile normalization may solve this problem.
(a)
(b)
Figure 7.4. Inter-slide pre- and post-normalization (quantile-to-quantile normalization) box-plots for
seven arrays.
77
After inter-slide normalization the scales are comparable. Without this normalization,
one or more slides would have excessive weight in deciding the features when performing
a classification. In cases with low differences between scales, however, the error
introduced by an inter-slide normalization procedure may be detrimental to future
analysis procedures5.
The majority of expressed genes exhibit a power-law distribution with an exponent
close to -1 (i.e., gene expression obeys Zipf's law, [25] and [6][22]). Based on the
observation that single channel and two channel microarray data sets also follow a
power-law distribution, a recent paper developed a normalization method based on Zipf's
law. This normalization procedure is useful when the quantile-to-quantile method cannot
be applied (e.g., in the case with microarrays containing functionally specific gene sets,
where the initial hypothesis of genes mostly not differentially expressed doesn't hold
true).
l L as M i ,l =
1
nl
j l ( j )= l
5
For example, BSS/WSS is highly sensitive when WSS 0. This is when the elements in the class are close
one to the other one. The inter-slide normalization, based on the principle that most genes are not differentially
expressed, can spoil the WSS 0, and therefore any further feature selection based on ranking the elements
based on this BSS/WSS procedure (see next paragraph).
6
Similar logic may be applied to protein arrays.
Chapter 7
78
number of samples in the class
Wi =
(M (i, j ) M ) (m L)
( )
2
i ,l
lL j |l j = l
1
Bi = nl M i ,l M (i, j )
m j
lL
(L 1)
WSS
WSS
BSS
BSS
BSS
WSS
Figure 7.5. Representation of the ratio of between-class and within-class variances in a 3-dimensional
space.
This is represented in Figure 7.5. For each gene, the ratio of between-class and
within-class variances (BSS/WSS 7 ) may be used to rank all genes according to their
ratios, and to select those genes having the largest ratios, up to a predefined number, as
feature genes.
A valid threshold for BSS/WSS ratios must be selected where the features are
informative and then, after determining the significance level for retaining a specific set
of best ranked genes, we can define a procedure for assessing the validity of the set. The
general idea is to compare the BSS/WSS used with the true classes against a random
assignment of labels. We can label this as "Confidence level"
threshold for selection of the lowest value of the ratio BSS/WSS (Figure 7.6).
Clevel
is
defined as the ratio of cases when the real value of BSS/WSS is larger than the BSS/WSS
for a random assignment of labels to the total cases:
Between Sum of Squares (BSS), Within Sum of Squares (WSS), Total Sum of Squares (TSS).
Clevel =
79
real > random
N total cases
100 [%] .
To discriminate the classes, two other methods have been used, one based on
Receiver Operating Characteristic (ROC) [34] and one on t-test. The t-test and Between
Sum of Squares vs. Within Sum of Squares (BSS/WSS) on two class-separation problems
should lead to same ranking.
The ROC method occupies a
central unifying position [4][8][10]
in the process of assessing the
Real BSS/WSS = 0.2155
Confidence level = 94.66%
number
Robustness
is
of
due
elements.
to
its
property
matches
our
Chapter 7
80
For a set of
i = 1K m
features selected from the entire set, the significance value for the
(1 ) = 1
i
set
i = 1K m
Although
number
of
other
adjustment
procedures
are
available
FDR =
where
c>0
p0 {1 F0 (c )}
1 F (c )
adjustment value
p0
near c = 0 :
p0
f (c )
f (c )
[28].
Once
the
over the entire set are used to draw the histogram of discrimination values. In an ideal
case, when discrimination carries (biological) meaning, the probabilistic density function
of a random assignment of classes will decrease earlier than in the case of correct
assignment of classes (Figure 7.7). Visually, FDR is the ratio between areas of the random
assignment and the correct assignment when discrimination is high.
NB. When individual comparison brings an equal significance, this equation can easily be solved for
= 1 (1 set )1 k which for small significance value reduces to = set k . The relation is called
Bonferroni correction.
81
Clustering
Clustering is the classification of similar objects into different groups, or more precisely,
the partitioning of a data set into subsets (called clusters), so that the data in each subset
ideally will share common attributes, often proximity according to some defined distance
measure. There are two types of data clustering algorithms: hierarchical and partitional.
In hierarchical algorithms, successive clusters are found using previously established
clusters, while partitional algorithms determine all clusters at once. Hierarchical
algorithms are either agglomerative (also called bottom-up) or divisive (top-down)
methods. Agglomerative algorithms begin with each element as a separate cluster and
merge them in successively larger clusters. The divisive algorithms begin with the whole
set and divide it into successively smaller clusters.
Agglomerative hierarchical clustering builds the hierarchy from the individual
elements by progressively merging clusters; the first step is to determine which elements
merge together. To know the order of elements that merge first, we need to define the
closeness measure between elements, the function
d ( x, y )
(that sometimes is a
distance9). By using this closeness function between elements and/or clusters, depending
on the way of merging we have the following linkages:
max d ( x, y )
average linkage clustering when the mean distance between elements of each
cluster
xC1 , yC 2
min d ( x, y )
xC1 , yC 2
1
d (x, y ) is used
card (C1 ) card (C2 ) xC1 yC 2
the increase in variance for the cluster being merged (called Ward's criterion)
such that
d : n
Chapter 7
82
Each agglomeration occurs at larger values of the function than the previous
agglomeration and one can decide to stop clustering either when the clusters are too far
apart to be merged (based on a distance criterion) or when there is a sufficiently small
number of clusters (a number criterion).
Distance measures
If
x, y n
n -dimensional
defined:
n
1-norm distance
x y
i =1
2-norm distance
2 2
n
xi yi
i =1
p-norm distance
n
xi yi
i =1
infinity-norm distance
where
n
lim xi yi
p
i =1
p
= max( x1 y1 ,K, xn yn )
k-Means algorithm
The k-means algorithm assigns each point to the cluster whose centroid is closest. The
centroids are the points that have the coordinates on a central tendency (for example, the
mean) on each dimension separately for all the points in the cluster. The algorithm
described in [27] (a) randomly generates k clusters and determines the cluster centroids
or, directly generate k seed points as cluster centers; (b) assigns each point to the nearest
cluster center; (c) re-computes the new cluster centers; and (d) repeats until some
convergence criterion is met (usually that the assignment hasn't changed).
The main advantages of this algorithm are its simple implementation and
computational speed (this represents an imperative advantage for large datasets). The
algorithm does NOT return the same results on each run; the resulting clusters depend
on the initial assignments. The k-means algorithm, therefore, maximizes inter-cluster
(or minimizes intra-cluster) variance and ensures a local-optimal solution on a local
minimum of variance.
83
mn ,
where
m>n
(the number of genes analyzed is substantial larger than the number of samples). The
elements of the matrix
( xij
xij
form the
response of the
n dimensional
i th
uk
of
j th
VT
and
gj
patient
i th
row
gene.
is
X = USV T where U
is a
mn
matrix, the
are called left singular vectors (gene coefficient vectors) and form an
uiu j = 0
otherwise).
VT
is an
nn
U TU = I ( uiu j = 1 , when
matrix,
V TV = I ,
the rows
vk
of
contain the elements of the right singular vectors (expression level vectors), and
j th
m dimensional vector p j
form the
i= j,
gene and
vector
i th
si = 0
for
(r + 1) k n . X ( l )
is the rank of
X , sk > 0
l
is defined by
ij
xij xijl
for 1
kr
S is a diagonal matrix
Chapter 7
84
X T X . The eigenvectors of X T X
form the columns of
eigenvalues from
the matrix
XX T
X T X 10.
XX T
, the eigenvectors of
XX T and
and are arranged in descending order. The singular values are always real
and
The relationship between PCA and SVD is described in Wall et. al., 2003 [35].
Performing PCA is the equivalent of performing SVD on the covariance matrix11 of the
data. When matrix
is centered on columns,
X T X = i g i g i
is proportional to the
Xk
X 'k in
xk
and
xl
in
Xk
and with
d ' (k , l )
x'l
in the low-
d (k , l ) by d ' (k , l ) if
the square-
x 'k
and
Proof: X=USVT and XT=VSUT; XTX = VSUTUSVT ; XTX = VS2VT; then XTXV = VS2
USVTVSUT; XXT = US2UT; then XTXU = US2.
11
XXT =
1 N
(x x )( y y ) .
N 1 i =1
12
See the work on MDS by Young and Householder, 1938; Torgerson, 1952 for Euclidian MDS and further
Kruskal and Wish, 1978; de Leeuw and Heiser, 1982; Wish and Carroll, 1982; Young, 1985.
85
[d (k , l ) d ' (k , l )]
k ,l
If the components of the data vectors are of ordinal type, the rank order of the distances
between the vectors is meaningful. In this case, the projection will map the distances to
such values that best preserve the rank order.
In non-metric MDS, only the rank order of entries in the data matrix (not the actual
dissimilarities) is assumed to contain the significant information. Kruskal, 1964 [19],
Shepard, 1962 [32] Kruskal and Wish, 1977 [20] use a function
function:
[(d (k , l )) d ' (k , l )]
(d ' (k , l ))
k ,l
k ,l
The distances in the final configuration should be in the same rank order as the original
data. Consequently, the purpose of the non-metric MDS algorithm is to find a
configuration of points whose distances reflect as closely as possible the rank order of the
data.
Using a monotone regression that involves the calculation of a new set of distances
in an iterative
d ' (k , l ) ;
d ' (k , l ) matrix
correspondent to
d (k , l ) by
evaluating the cost function; (4) to adjust coordinates of each point in the direction that
maximally reduces the stress while the fit is adequate.
Since non-metric MDS deals with ranking, a crucial problem is how to treat ties in the
data. There are two main approaches, the first, called the primary approach, considers
the ties as undetermined and continues with the algorithm preserving the equality or
replacing it by an inequality. In the secondary approach, ties are retained in the fitting
values. Consequently, if the actual distances do not preserve every inequality in the data,
the infringement would be counted as a deviation from monotonicity. The primary
approach is preferred if there are a large number of distinct dissimilarity values.
Classification
Classification refers to a group of methods that aim to characterize the objects, or
phenomena, in a data-set. The classification methods make use of a sub-set of features
Chapter 7
86
relevant for the task or the entire space of features. The methods can be supervised or
unsupervised according whether knowledge of categorized objects within known classes
(the labels of elements) is used.
There are two phases in construction of a classifier. In the training phase, the
training set is used to decide how the parameters should be combined in order to
separate the various classes of objects. In the application phase, the weights determined
in the training set are applied to a set of objects that do not have known classes in order
to determine what the classes are likely to be.
A special case of classification is when we divide a known set, with known labels, in
both training and test. The application phase aims to report the differences between the
classification results and the real labels. This is a test model for the classifier used to
observe its performance on a representative set of data.
If a classification has few parameters to learn, the classification is usually an easy
problem. When there are many parameters to consider, as in the case of array-based
datasets, the classification problem becomes difficult because of the space of parameter
combinations to search and the techniques based on exhaustive searches of the
parameter space are sometimes computationally infeasible.
Formally, the classification problem can be stated as follows: from a training set
y Y .
Three related mathematical problems can be defined. The first is the mapping of the
multi-dimensional vector space of features to a set of labels. In this category, we can
include the nearest neighbor algorithm that partitions the feature space into regions and
assigns labels for respective regions. Also in this first category, we can include the
unsupervised clustering of feature space and labeling of each cluster or region.
In the second category of classification is the estimation problem, defined as a
function of form P
(class | xr ) =
r r
f x;
( ) (
r
r
r
r r
P(class | x ) = f x; P | D d and the
r
result is integrated over all possible and weighted by how likely are in the training
the Bayesian classification, where
D . The third category relates to the second; in this approach, the class-conditional
r
probability P (class | x ) is estimated and the Bayesian approach is used.
data
87
13
Lloyd stated his algorithm for the scalar quantization, case in which k = 1, but it easily extends to the k > 1.
Lloyd-Max necessary conditions for an optimal quantizer. These were first discovered by the Polish
mathematicians Lukaszewicz and Steinhaus [26].
14
Chapter 7
88
15
Occam's Razor principle states that one should make no more assumptions than needed. In simple words,
"the simplest explanation is the best."
89
This has led researchers to view MDL as equivalent to Bayesian inference16. Code
length of the model and code length of model and data together in the MDL framework
correspond to prior probability and marginal likelihood respectively in the Bayesian
framework. While the Bayesian machinery is often useful in constructing efficient MDL
codes, the MDL framework sometimes uses other codes that do not fit into a Bayesian
framework17. Furthermore, the MDL Principle prefers some priors over others. While the
same priors tend to be favored in so-called objective Bayesian analysis, they are favored
for different reasons.
References
[1] Benjamini Y, Drai D, Elmer G, Kafkafi N, Golani I. Controlling the false discovery rate in
behavior genetics research. Behav Brain Res. 2001 Nov 1;125(1-2):279-84.
[2] Dudoit S, Yang YH, Callow MJ, Speed T. Statistical Methods for Identifying Differentially
Expressed Genes in Replicated cDNA Microarray Experiments, Statistica Sinica. 2002 12:111-139
[3] Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes Analysis of a Microarray
Experiment. faculty.washington.edu/ ~jstorey/papers/ETST_JASA_2001.pdf
[4] Egan JP. Signal Detection Theory and ROC Analysis, Academic Press,New York, 1975
[5] Fan J, Gijbels I. Local Polynomial Modelling and its Applications. Chapman and Hall, London.
1996.
[6] George K. Zipf, Human Behaviour and the Principle of Least-Effort, Addison-Wesley, Cambridge
MA, 1949
[7] Gersho A, Gray RM, Vector Quantization and Signal Compression, Kluwer Academic Publishers,
1991.
[8] Hanley JA. Receiver operating characteristic (ROC) methodology: the state of the art. Crit Rev
Diagn Imaging. 1989; 29(3):307-35.
[9] Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P.
'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns.
Genome Biol. 2000;1(2):RESEARCH0003. Epub 2000 Aug 4.
[10] Henderson AR. Assessing test accuracy and its clinical consequences: a primer for receiver
operating characteristic curve analysis. Ann Clin Biochem. 1993 Nov; 30 ( Pt 6):521-39.
[11] Hochberg Y and Benjamini Y. More powerful procedures for multiple significance testing. 1990
Stat. Med. 9: 811/818.
[12] Hochberg Y. A sharper Bonferroni procedure for multiple tests of significance. 1988 Biometrika
75: 800/802.
[13] Holm S. A simple sequentially rejective multiple test procedure. Scand J. Stat. 1979, 6:65-70.
[14] Hoyle DC, Rattray M, Jupp R, Brass A. Making sense of microarray data distributions.
Bioinformatics. 2002 Apr;18(4):576-84.
[15] Jardon M. Systems Biology: An Overview, at http://www.bioteach.ubc.ca/Bioinformatics/
systemsbiology/
[16] Keselman HJ, Cribbie R, Holland B. Controlling the rate of Type I error over a large set of
statistical tests. Br J Math Stat Psychol. 2002 May;55(Pt 1):27-39.
[17] Kieffer JC, A survey of the theory of source coding with a fidelity criterion. Information Theory,
IEEE Transactions on, 1993
[18] Kitano H. Systems Biology: A Brief Overview. Science, 2002, 295: 1662-1664.
[19] Kruskal JB, Nonmetric Multidimensional Scaling: A Numerical Method, Psychometrika, 29:2
1964, pp. 115-129.
[20] Kruskal JB, Wish M. Multidimensional Scaling, Sage Publications, Beverly Hills, Calif., 1978.
[21] Li M, Vitanyi P. An introduction to Kolmogorov Complexity and its Applications: Preface to the
First Edition (1997)
[22] Li W, Random texts exhibit Zipf's-law-like word frequency distribution, IEEE Transactions on
Information Theory, 38(6), pp.1842-1845, 1992
16
For example in David MacKay's Information Theory, Inference, and Learning Algorithms.
An example is the Shtarkov `normalized maximum likelihood code', which plays a central role in current
MDL theory, but has no equivalent in Bayesian inference.
17
90
Chapter 7
[23] Lloyd SP, Least Squares Quantization in PCM, IEEE Transactions on Information Theory, 1982.
[24] Loader C. Local Regression and Likelihood. Springer, New York 1999.
[25] Lu T, Costello CM, Croucher JP,Hsler R,Deuschl G, Schreiber S. Can Zipf's law be adapted to
normalize microarrays? BMC Bioinformatics. 2005; 6: 37.
[26] Lukaszewicz J, H. Steinhaus. On measuring by comparison, Zastos. Mat., vol. 2, pp. 225-232,
1955.
[27] MacQueen J. Some methods for classication and analysis of multivariate observations. Proc. Fifth
Berkeley Symp. University of California Press 1, 1966, pp. 281-297.
[28] Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A. False Discovery Rate, Sensitivity and
Sample Size for Microarray Studies. Bioinformatics. 2005 Apr 21
[29] Popper K, Conjectures and Refutations, London: Routledge and Keagan Paul, 1963, pp. 33-39
[30] Rissanen J. Generalized Kraft Inequality and Arithmetic Coding. IBM Journal of Research and
Development 20(3): 198-203 (1976)
[31] Rissanen J. Theory of Relations for Databases - A Tutorial Survey. MFCS 1978: 536-551
[32] Shepard, R. N. (1962). Psychometrika. 27, 125140: 219-246;
[33] Solomonoff R "Perfect Training Sequences and the Costs of Corruption - A Progress Report on
Inductive Inference Research," Oxbridge Research, August 1982.
[34] Swets JA. Measuring the accuracy of diagnostic systems. Science. 1988 Jun 3; 240(4857):128593.
[35] Wall ME, Rechtsteiner A , Rocha LM - A Practical Approach to Microarray Data Analysis, 2003
[36] Westfall PH, Young SS. Resampling-Based Multiple Testing, 1993 Wiley, New York.
Publications
Publication no. 1
Mircean C, Tabus I, Astola J.
Quantization and distance function selection for discrimination of
tumors using gene expression data.
SPIE 2002, BiOS 2002 Symposium, 19-25 January 2002, San
Jose, CA.
1. INTRODUCTION
This is a case study realized using the MITLeukemia data set, containing expressions of about seven thousand genes
in 72 measurements (also referred to as cells in the following). There is available a labeling of the 72 measurements
in two classes, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML), and also a further refinement
of ALL classes in B-cells and T-cells.
We concentrate on k Nearest Neighbor (k-NN) classification method, and propose variations of it, the most important
one being the use of normalized mutual information to define the closeness function function. In order to get
meaningful estimates of the mutual information, the data has to be first quantized, which leaves us with the task of
selecting an appropriate quantization method.
The criterion used to rank the performance of various methods is the estimated error rate. The error rates are estimated
by repeating of 150 times (latter 1000 times) the following: split the 72 cells in 48 training cells (for which the labels
are known) and 24 test cells, for which the labels are estimated by the given method, and the number of erroneous
classifications is stored as err(i). To graphically represent the error rates we either box-plot the 150 values of err(i), or
we bar-plot them: the first bar represents the number of times err(i)=1, the second bar is (2x number of times err(i)=2)
and so on; the height of all the bars results to be the overall number of erroneous classifications in the 150 runs.
We also study how well certain cells can be classified, by checking how many times in the 150 runs a cell was
misclassified. If a cell is repeatedly misclassified, the original labeling becomes questionable, which suggests the use of
classification as a technique of spotting possible wrong diagnoses.
Functional Monitoring and Drug-Tissue Interaction, Manfred D. Kessler, Gerhard J. Mller, Editors,
Proceedings of SPIE Vol. 4623 (2002) 2002 SPIE 1605-7422/02/$15.00
WSS = xi x
g
where
g 2
BSS = n x x
g designates an interest group, n denotes the number of data in the group g , i the increment across samples
g
belonging to the group g , x represent the average expression level of elements in the group and
across all samples.
For gene j in the 2-class decomposition, BSS / WSS ratio is:
ALL
AML
ALL
)
)
ALL
AML
AML
x is the average
)
)
2
2
For the 3 class decomposition (ALL & B-cell, ALL & T-cell, AML), for gene j the BSS / WSS ratio is:
ALL B
ALL T
AML
BSS ( j ) BSS
+ BSS
+ BSS
=
=
WSS ( j ) WSS ALL B + WSS ALL T + WSS AML
=
ALL B
(x
(x
ALL B
j
ALL B
ij
)
(x
) + (x
2
x. j + n
ALL B 2
xj
ALL T
ALL T
j
ALL T
ij
) (x
) + (x
2
x. j + n
AML
ALL T 2
xj
AML
j
AML
ij
x. j
)
)
AML 2
xj
When using the Entropy Correlation Coefficient (see Section 3.5) we selected first the most discriminative genes
according to their BSS / WSS, ratio and after that we quantized the values. The other alternative would be to use the
decomposition proposed by Odaka6 applied for categorical data sets. In that case, the quantization will be applied first
and then the genes with largest BSS / WSS on the quantized values are selected.
In MITLeukemia dataset we measure the similarity between two mRNA gene expression profiles, one belonging to one
of the two classes: acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML) and the other one considered
unlabeled. The similarity between genes expression profiles x = ( x1 ,..., x p ) and y = ( y1 ,..., y p ) is evaluated by:
p
(x
rx,y =
x )(y i y )
i =1
(x
x)
i =1
(y
y)
i =1
In k-NN algorithm we will define the similarity function as one minus the absolute value of correlation; therefore we
interpret those genes giving either negative correlations or positive correlations to be close to each other.
3. Euclidean Distance
We can also use Euclidean distance between the expression vectors as the distance measure.
If x = ( x1 ,..., x p ) and y = ( y1 ,..., y p ) are the expression profiles the Euclidean distance is:
1
d x, y
p
2
= ( xi yi ) 2
i =1
The two cells are considered similar if the shape of the gene expression profile is similar.
4. Mahalanobis Distance
We use in this report the distances between a unclassified cell
and
If we consider the ALL class and AML class having normal density distribution with the means
ALL = ( ALL1 ,..., ALLp ) and AML = ( AML1 ,..., AMLp ) (px1 vector) and (p x p positive defined matrix) the
common covariance matrix, Mahalanobis distance is defined:
(
, ) = (( x
(x ALL ))2
1
' 1
2
AML ) ( x AML ))
1
The Mahalanobis distance takes the cells gene expression profiles variability into account. Instead of treating equally
all values when calculating the distance from the mean point, it weights the differences by the range of variability in the
direction of the sample point. Another advantage of using the Mahalanobis measurement for discrimination is that the
distances are calculated in units of standard deviation from the group mean.
An inconvenience of the method is the increasing computational complexity, especially for large values of p number
of genes in the cells expression profile.
5. Mutual information; Entropy Correlation Coefficient
Mutual Information measures the amount of information that one random variable contains about another random
variable, or equivalently, measures the reduction in the uncertainty of one random variable due to the knowledge of the
other. Each gene expression profile is a random vector. Mutual information can be defined also for continuous domain,
but in order to reduce the computational task we preferred to quantize the continuous values first.
Consider two random variables X and Y with a joint probability mass function p(x,y) and marginal probability mass
functions p(x) and p(y). The mutual information I(X;Y) is the relative entropy between the joint distribution and the
product distribution:
I(X;Y) =
p(x,y)
xX yY
1
(H ( X ) + H (Y ) ) . The entropy correlation coefficient (Astola10) is defined for
2
information to be scaled by
qualitative variables as follows:
(X,Y) =
I(X;Y)
H(X,Y)
= 21
1
H(X)
+
H(Y)
(H(X) + H(Y))
2
4. QUANTIZATION
parts, delimited by
x R and y*.
y* = centr ( R ) E [d ( x, y*) x R ] E [d ( x, y ) y R ]
between
cent ( R ) =
1
R
, where
i =1
5. SIMULATION RESULTS
In this section we describe the variations on the classical algorithms and the selection of their parameters, which we use
in our experiments.
1. Review of a previous report on the Leukemia data set
Repeating the experiments of the Sandrine Duodoit4 et. Al. team we obtained very similar results (Figure 2). In this
experiment we take p = 40 genes having the largest values of BSS / WSS; increasing the number of variables to p = 200
did not affect significantly the performance. The number of runs is N = 150. We used first the box plot diagram to
visualize the data, as used in Sandrine Duodoit4 paper. The best results with Correlation Coefficient similarity measure
can be obtained for the number of neighbors k = 11 20. The box median in the box-plots, for all these k values, is not
in zero.
2. Changing the metric
We experimented with changing the metric used by the k-NN algorithm, while keeping the number of genes constant, p
= 40. Euclidean Distance is the basic metric used for reference. There are no great advantages (Figure 3) of Euclidean
Distance: although the box median position is sometime zero, errors with large values appear as well.
The nearest neighbor classification rule with a reject option (k,l-NN) was introduced by Hellman9 and Ripley8. We use
the correlation coefficient for the outer interval; for the uncertainty domain (heuristically chosen 20%) we use the
Mahalanobis distance. This distance can be defined between one gene expression profile from TS and each of the two
classes from LS. Mahalanobis distance is not used directly in the k-NN algorithm; it is used only as an option. When
using it, we observe some improvements, the box-plots in Figure 4 have a lower position of the median, for the values
of k = 1220.
A particular situation, occurring frequently, is the following: the number of votes for ALL class is zero. In this situation
the presumed class of the gene expression profile from TS is AML. In the same way, if the number of AML votes is zero,
then the presumed class is ALL. In all other cases, we allowed the Mahalanobis distance to decide. The results are
presented in Figure 5.
We conclude that in general the error rates increase significantly in the case of using Mahalanobis Distance.
3. Comparing the effect of various metrics for the quantized data
In the next experiments we use the Mutual Information as a metric in the k-NN algorithm. The continuous values are
quantized in three levels, the limits of partitioning are
correlation coefficient metric. When using Correlation Coefficient metric (with p = 40; k = 17; N = 1000) the
distribution of the estimated error over the genes expressions profiles is approximately uniform, having the minimum
value at 1.5% and maximum value at 7.8%.
The error is reduced substantially when p is in the range 20 to 40 (Figure 9). We studied in experiments the distribution
for the estimated error over the profiles for Entropy Correlation Coefficient when p= 20; 40; 200 and the genes are first
20, 40, 200 genes, or 4180 genes ranked in the order of BSS / WSS.
In the Figures 10, 11, 12 and 13 the right-down representations contains the ratio value Miss / Hit. On the 66th
observation the ratio is some-time infinity due to the zero on Hit values and is not represented.
As we can see from the Figure 12, when the number of genes variables is increased to 200, the error concentrate with
100% rate at gene 66 and 23.4% for gene 67. All other gene expression profiles give 100% good prediction.
Observations 66 and 67 had also low prediction strengths (0.27 and 0.15, respectively) in the Golub1 et al. paper and in
Sandrine Dudoit4 report.
Changing completely the variables (Figure 13), the 66th observation is again present with 100% misclassification ratio.
Comparing the error dispersion in the above experiments we can appreciate that the Entropy Correlation Coefficient
improves the k-NN algorithm results. The overall error (for k which give minimum error) is concentrating only on
several profiles.
6. Comparative study with 3 classes
It is possible that the 66th observation to be misclassified because of the small number of classes considered. The two
classes: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML) could be not enough to characterize
the gene expression profiles. We extended the experiments also to three classes: ALL B cell, ALL T cell and AML.
Selecting only the first p = 20 gene in the classification algorithm, the error is concentrating on the 17th observation
(originally classified as ALL B cell). This gene expression profile was perfectly classified on the two-class case.
Increasing the number of variables p from 20 to 40 and 200, the 66th and 67th observations become again misclassified.
The estimated error rates for 3 classes (ALL B cell, ALL T cell and AML) is shown in Figure 14; we used Entropy
Correlation Coefficient metric; Lloyd quantization; p = 20, 40, 200 and 4180; k = 17; N = 150; and Figure 15, 16,
17, and 18 present the error dispersion in the genes expressions profiles for first 20, 40, 200 genes and 4180 genes,
ranked in the order of BSS / WSS.
7. Auto selected k value
All experiments reported until now were done for many values of k (the number of neighbors in the k-NN algorithm)
to check which value gives the lowest error rate. An interesting issue is the estimation of k, by using only the LS data.
In the auto selected k value experiment, for each run (N = 1150) we randomly divide the learning data set (LS) in two
parts with the same ratio 2/3 as in the main division. With similar k-NN algorithm repeated N1 = 100 times using LS
only, we obtain a k value for minimum estimated error which is used further in the main algorithm.
Varying the number of variables, p, from 20 to 200 (see Figure 19) we observe that the estimated error rate is the best
for p =40 genes. The error distribution over the genes expression profiles (Figure 20) displays the same pattern, very
high at the 66th and 67th profile.
6. CONCLUSIONS
Comparing the similarity measures used in the k-NN algorithm, we concluded that Entropy Correlation Coefficient is
the best performing.
The quantization is an important step in microarray data processing since it makes the decision robust to noise. Lloyd
quantization, which minimizes the quantization distortion, turns out to provide the minimum estimated classification
error.
The k-NN algorithm is simple and yet provides good classification.
The 66th and the 67th gene profiles accumulate the largest part of the estimated error rate in both cases of:
Two-class classification: ALL (acute lymphoblastic leukemia) and AML (acute myeloid leukemia)
Three-class classification: ALL B cell, ALL T cell, AML
7. FIGURES
ERROR RATES USING THE CONDITION DESCRIBED IN
"Comparison of Disc rimination Methods for the Classification of Tumors Using Gene Expression Data"
Sandrine Dudoit, Jane Fridlyard, Terence Speed
0.6
0.5
Error rates
0.4
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 1011 12 131415 16 171819 20 2122 23 24 25 2627 28 29 3031 32 333435 36 3738 39 40 41 42 43 44 45 46 47
k value of the k-NN alghorithm
Error of k-NN algorithm using Euc lidian metric
0.6
0.5
Estimated Error
0.4
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 4 1 42 43 44 45
k value from k-NN algorithm
Error rates
ERROR RATES USING k- NN ALGHORITHM COMBINED W ITH MAHALANOBIS DISTANCE FOR +/- 20% INCERTITUDE INTERVAL
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 3233 34 35 36 37 38 39 40 41 42 43 44 45 46 47
k value of the k- NN alghotithm
0.6
Error rates
0.5
0.4
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 1011 12 13 1415 16 17 18 19 202122 23 2425 26 27 28 29 3031 32 33 3435 3637 38 39 40 41 424344 45 46 47
k value of the k-NN alghorithm
0.04
0.03 5
0.03
0.02 5
0.02
0.01 5
0.01
0.00 5
0
12
13
14
15
16
17
18
k value from k-NN alghorithm
19
20
21
D E TA IL I N TH E I N TE RE S TI N G A R E A OF GL O BA L E R RO R R A TE S
0 .0 45
C or r e latio n D ista n c e
E uc li d ia n D is tan c e
C or r e latio n & M a h ala n ob is D is tan c e
E C C with (M e an & S t a nd a rd D evia tio n ) Q u an ti s atio n
E C C with Llo y d C ua n tisa t ion
0 . 04
0 .0 35
0 . 03
0 .0 25
0 . 02
0 .0 15
0 . 01
0 .0 05
0
12
14
16
18
k va lu e f r o m k -N N alg h or ithm
E r r o r R a te s u s i n g m e a n a n d
s ta n d a r d
20
22
24
d e v i a ti o n
E r r o r R a te s u s i n g
3
2 .5
2
1 .5
1
0 .5
0
L lo y d Q u a n t i s a t i o n
4
1 e rro r
2 e rro rs
3 e rro rs
4
3 .5
1 0
k v a lu e f r o m
E rr o r R a t e s u s in g
1 5
2 0
k - N N a lg o r i t h m
erro r
e r r o rs
e r r o rs
2 .5
2
1 .5
1
0 .5
0
2 5
1
2
3
3 .5
1 0
1 5
2 0
k v a l u e f r o m k - N N a lg o r i t h m
2 5
M a x im u m E n t r o p y Q u a n t i s a t i o n
1
2
3
erro r
e r r o rs
e r r o rs
2 .5
2
1 .5
1
0 .5
0
1 0
k v a lu e f r o m
1 5
2 0
k - N N a lg o r i t h m
% Error
E r r o r r a t e f o r t h e f ir s t 2 0 B S S
8
7 .5
7
6 .5
6
5 .5
5
4 .5
4
3 .5
3
2 .5
2
1 .5
1
0 .5
0
1
2
3
4
5
6
7
8
9
1
e
e
e
e
e
e
e
e
e
0
/ W SS
rro
rro
rro
rro
rro
rro
rro
rro
rro
e rr
r
rs
rs
rs
rs
rs
rs
rs
rs
o rs
10
15
k v a l u e f o m k - N N a lg o r i t h m
% Error
E r r o r ra t e f o r th e o r d e re d 4 1 - 8 0 B S S
8
7 .5
7
6 .5
6
5 .5
5
4 .5
4
3 .5
3
2 .5
2
1 .5
1
0 .5
0
1
2
3
4
5
6
7
8
9
1
erro
erro
erro
erro
erro
erro
erro
erro
erro
0 err
2 5
E r r o r r a t e f o r t h e f ir s t 4 0 B S S
% Error
2 0
25
2 0
/ W SS
1 e rro r
2 e rro rs
3 e rro rs
10
15
k v a lu e f o m k - N N a lg o r i t h m
2 0
25
E r r o r r a t e f o r t h e f ir s t 2 0 0 B S S / W S S
r
rs
rs
rs
rs
rs
rs
rs
rs
ors
10
15
k v a l u e f o m k - N N a lg o r i t h m
8
7 .5
7
6 .5
6
5 .5
5
4 .5
4
3 .5
3
2 .5
2
1 .5
1
0 .5
0
/ W S S
% Error
4
3 .5
25
8
7 .5
7
6 .5
6
5 .5
5
4 .5
4
3 .5
3
2 .5
2
1 .5
1
0 .5
0
1 e rro r
2 e rro rs
10
15
k v a lu e f o m k - N N a lg o r i t h m
2 0
25
Figure 9. Comparing the estimated error for Entropy Correlation Coefficient metric; Lloyd quantization for different p = BSS /
WSS values; N = 150.
T h e n u m b e r o f M i s c la s s i f i c a ti o n s u s i n g th e E n tr o p y & L l o y d Q u a n ti s a ti o n f o r p = 4 1 - 8 0 g e n e s
800
6 00
No. of Total Tests
No. of Misclassification
7 00
5 00
4 00
3 00
2 00
700
600
500
1 00
0
20
40
60
I n di c e s o f e xp er im e n t
400
80
20
40
60
I n d ic e s o f e x p er im e n t
M is / A ll te s t s
80
2 .5
2
Ratio Value
% Error
60
40
1 .5
1
20
0
80
M is / H it
1 00
0 .5
0
20
40
60
I n di c e s o f e xp er im e n t
80
20
40
60
I n d ic e s o f e x p er im e n t
80
700
7 50
600
7 00
No. of total Tests
No. of Misclassification
500
400
300
200
6 50
6 00
5 50
5 00
100
0
4 50
0
20
40
60
I n d ic e s o f E xp e r im e n t
4 00
80
20
M is / Hit
40
60
I n di c e s o f E xp er im e n t
80
M is / Hit
100
0 .35
0 .3
80
Ratio Value
% Error
0 .25
60
40
0 .2
0 .15
0 .1
20
0
0 .05
0
20
40
60
I n d ic e s o f E xp e r im e n t
80
20
40
60
I n di c e s o f E xp er im e n t
80
500
650
400
600
No. of Misclassification
The number of Misclassification using the Entr opy & Lloyd Quantisation for the first 40 BSS / W SS
300
200
100
0
20
40
60
Indices of Experiment
Mis / All Tests
550
500
450
400
350
80
100
20
40
60
Indices of Exper iment
Mis / Hit
80
20
40
60
Indices of Exper iment
80
0.2
Ratio Value
% Error
80
60
40
20
0
0.25
0.15
0.1
0.05
20
40
60
Indices of Experiment
80
7 00
7 00
6 00
6 50
No of Total Tests
No. of Misclasification
The num ber of M isc la ssif ic ati on u sing the Entr opy Co r r ela tion C oe ffic i ent a nd Lloy d Qu anti satio n fo r the fir s t 20 BS S / W SS
5 00
4 00
3 00
2 00
5 50
5 00
4 50
1 00
0
6 00
20
40
Indic es of E xper ime nt
60
80
4 00
20
25
80
20
60
40
20
0
40
Indi c es of E xper imen t
60
80
60
80
M is / Hit
1 00
Ratio Value
% Error
15
10
5
20
40
Indic es of E xper ime nt
60
80
20
40
Indi c es of E xper imen t
er ro
er ro
er ro
er ro
er ro
er ro
er ro
er ro
er ro
E r r o r r a t e f o r th e f ir s t 4 0 B S S / W S S w i th 3 c l a s s e s
r
rs
rs
rs
rs
rs
rs
rs
rs
% Error
1
2
3
4
5
6
7
8
9
5
10
k va lu e f r o m k - N N a lg o r i th m
15
E r r o r r a te f o r th e o r d e r e d 4 1 - 8 0 B S S / W S S w i th 3 c la s s e s
10
9 .5
1 er r o r
9
2 e r r o rs
8 .5
8
3 e r r o rs
7 .5
4 e r r o rs
7
5 e r r o rs
6 .5
6
6 e r r o rs
5 .5
7 e r r o rs
5
4 .5
4
3 .5
3
2 .5
2
1 .5
1
0 .5
0
0
5
10
15
k va lu e f r o m k - N N a lg o r i th m
10
9.5
9
8.5
8
7.5
7
6.5
6
5.5
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
1
2
3
4
5
6
7
8
9
er r
er r
er r
er r
er r
er r
er r
er r
er r
or
o rs
o rs
o rs
o rs
o rs
o rs
o rs
o rs
5
10
k va lu e f r o m k - N N a lg o r i th m
15
E r r o r r a te f o r th e f ir s t 2 0 0 B S S / W S S w i th 3 c la s s e s
% Error
% Error
% Error
E r r o r r a t e f o r th e f ir s t 2 0 B S S / W S S w i th 3 c la s s e s
10
9 .5
9
8 .5
8
7 .5
7
6 .5
6
5 .5
5
4 .5
4
3 .5
3
2 .5
2
1 .5
1
0 .5
0
10
9.5
9
8.5
8
7.5
7
6.5
6
5.5
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
1
2
3
4
5
6
7
5
10
k va lu e f r o m k - N N a lg o r i th m
er r
er r
er r
er r
er r
er r
er r
or
o rs
o rs
o rs
o rs
o rs
o rs
15
Figure 14 Comparing the estimated error for 3 classes (ALL B cell, ALL T cell and AML) Entropy Correlation Coefficient
metric; Lloyd quantization different p = BSS / WSS values; N = 150.
The n um b er o f M isc la s si fic at io n u s in g the En tr o py & L lo y d Qu a ntis a tio n fo r 3 c la ss e s
320
300
No. of Total Tests
No. of Misclassification
200
150
100
50
280
260
240
220
200
20
40
60
In dic e s o f e xp er im en t
180
80
20
80
60
40
20
0
40
60
In dic e s o f e xp e r im en t
80
M is s / Hi t
10
Ratio Value
% Error
M is / To ta l T e s ts
100
6
4
2
20
40
60
In dic e s o f e xp er im en t
80
20
40
60
In dic e s o f e xp e r im en t
80
T h e n u m b e r o f M i s c la s s i f i c a ti o n s u s i n g th e E n tr o p y & L lo y d Q u a n tis a ti o n f o r 3 c la s s e s
4 00
2 50
No. of Total Tests
No. of Misclassifications
3 00
2 00
1 50
1 00
3 00
2 50
2 00
50
0
3 50
20
40
60
I n d i c e s o f e x p e r im e n t
1 50
80
1 00
20
40
60
I n d ic e s o f e xp er im e n t
80
20
40
60
I n d ic e s o f e xp er im e n t
80
70
60
80
Ratio Value
% Error
50
60
40
40
30
20
20
0
10
10
0
20
40
60
I n d i c e s o f e x p e r im e n t
80
4 00
2 00
3 50
No. of Misclassifications
1 50
1 00
50
0
20
40
60
I n d ic e s o f e xp e r im e n t
3 00
2 50
2 00
1 50
80
20
40
60
I n d ic e s o f e xp e r im e n t
M is / To t a l T e s ts
80
M is / Hit
1 00
0 .4
80
Ratio Value
0 .3
% Error
60
40
0 .2
0 .1
20
0
20
40
60
I n d ic e s o f e xp e r im e n t
80
20
40
60
I n d ic e s o f e xp er im e n t
80
T he n um b er o f M iss c la ss if i c a ti on s us i n g t h e L lo y d Q u a n ti s a tio n an d C o r re la ti on C o e f fi c ie nt f or 3 c la s se s
4 00
250
No. of Total Tests
300
200
150
100
3 50
3 00
2 50
50
0
20
40
60
I n di c e s o f e xp er i m en t
2 00
80
20
40
60
I n di c e s o f e x p er i m en t
M is s / T o tal T e s ts
80
40
Ratio Value
50
% Er ror
60
40
20
0
80
M is s / H it
100
30
20
10
20
40
60
I n di c e s o f e xp er i m en t
80
20
40
60
I n di c e s o f e x p er i m en t
80
/W
8.5
1 error
2 errors
3 errors
4 errors
5 errors
6 errors
7 errors
8
7.5
7
6.5
6
5.5
5
% Error
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
20
40
The number "p" of BSS / WSS
200
40
No. of Missclassifications
50
30
20
10
0
55
50
45
40
20
40
I n di c e s o f e x p e r i m en t
60
35
80
20
M is s / A ll Te s ts
40
In d i c e s o f e xp e r i m e n t
60
80
60
80
M is s / H i t
1 00
0 .7
0 .6
80
Ratio Value
% Error
0 .5
60
40
0 .4
0 .3
0 .2
20
0
0 .1
0
20
40
I n di c e s o f e x p e r i m en t
60
80
20
40
In d i c e s o f e xp e r i m e n t
11
8. REFERENCES
1.
12
Publication no. 2
Mircean C, Tabus I, Astola J, Kobayashi T, Shiku H, Yamaguchi M,
Shmulevich I. and Zhang W.
Quantization and similarity measure selection for discrimination
of lymphoma subtypes under k-nearest neighbor classification.
SPIE 2004, BiOS 2004, Microarrays, Combinatorial Techniques
and High Throughput Screening, 2429 January 2004, San Jose,
California, USA
10
11
12
13
14
15
16
17
Publication no. 3
Tabus I, Mircean C, Zhang W, Shmulevich I. and Astola J.
Chapter 14: Transcriptome-Based Glioma Classification using
Informative Gene Set
in Genomic and Molecular Neuro-Oncology, Jones and Bartlett
Publishers, 2003 ISBN: 0-7637-2261-8
Publication no. 4
Fuller GN*, Mircean C*, Tabus I, Taylor E, Sawaya R, Bruner MJ,
Shmulevich I, Zhang W.
Molecular Voting for Glioma Classification Reflecting
Heterogeneity in the Continuum of Cancer Progression.
Oncol Rep. 2005 14: 651-656. *Co-first author.
Gregory N. Fuller1, Cristian Mircean1, Ioan Tabus, Ellen Taylor, Raymond Sawaya, Janet M.
Bruner, Ilya Shmulevich, Wei Zhang
Department of Pathology (GNF, CM, JMB, ET, IS, WZ) and Neurosurgery (RS), The University of Texas
M. D. Anderson Cancer Center, 1515 Holcombe Blvd., Houston, Texas 77030 USA.
Institute of Signal Processing (CM, IT), Tampere University of Technology
P.O. Box 553, Tampere 33101, Finland.
Gregory N. Fuller and Cristian Mircean made equal contribution to this study
Correspondence: Wei Zhang, Ph.D., Cancer Genomics Core Laboratory, Department of Pathology, the
University of Texas M. D. Anderson Cancer Center, 1515 Holcombe Blvd., Houston, Texas 77030. Tel:
713-745-1103; Fax: 713-792-5549
E-mail: [email protected].
Keywords: glioma, classification of mixed glioma, multidimensional scaling.
Abbreviations used: GBM, glioblastoma multiforme; AA, anaplastic astrocytoma; AO, anaplastic
oligodendroglioma; OL, oligodendroglioma; MDS, multidimensional scaling; k-NN, k-nearest neighbor;
WSS, within-group sum of squares; BSS, between-group sum of squares.
Abstract
Gliomas are the most common brain tumors that are generally categorized into two lineages
(astrocytic and oligodendrocytic) and low-grade (astrocytoma and oligodendroglioma), midgrade (anaplastic astrocytoma and anaplastic oligodendroglioma), and high-grade (glioblastoma
multiforme) based on morphological features. A strict classification scheme has limitations
because a specific glioma can be at any stage of the continuum of cancer progression and may
contain mixed features. Thus, a more comprehensive classification based on molecular
signatures may reflect the biological nature of specific tumors more accurately. In this study, we
used microarray technology to profile the gene expressions of 49 human brain tumors and
applied the k-nearest neighbor algorithm for classification. We first trained the classification
gene set with 19 most typical glioma cases and selected a set of genes that provide the lowest
cross-validation classification error with k = 5.
remaining cases including several that do not belong to gliomas such as atypical meningioma.
The results showed that not only does the algorithm correctly classify most of the gliomas, the
detailed voting results provide more subtle information regarding the molecular similarities to
the neighboring classes. For the atypical meningioma, the voting was equally split among the
four classes, indicating a difficulty in placement of meningioma into the four classes of gliomas.
Thus, the actual voting results, which are typically used only to decide the winning class label in
k-nearest neighbor algorithms, in and of themselves provide a useful method for gaining deeper
insight into the stage of a tumor in the continuum of cancer development.
Introduction
Gliomas are primary tumors of the central nervous system and account for 80% of adult primary
brain tumors (Kleihues and Cavenee, 2000). The prognosis for patients with an advanced
glioma, GBM, is very poor, with a median survival of 8 to 10 months. The diffuse gliomas are
traditionally separated into subtypes based on subjective interpretation and arbitrary weighting of
morphologic features (histologic criteria). The gliomas that share morphological features of
normal astrocytes are classified as astrocytomas and those with morphological features of normal
oligodendrocytes are classified as oligodendrogliomas. Depending on the cellularity and features
such as mitotic index and the presence of necrosis, gliomas are further classified as low-grade
such as oligodendroglioma and astrocytoma, mid-grade (anaplastic oligodendroglioma and
anaplastic astrocytoma), or high-grade GBMs (see Caskey et al., 2000 for a review). However,
heterogeneity is an intrinsic feature of all tumors and histologic classification does not capture
the continuum of the tumor spectrum. When the dominant pathological feature cannot be
identified for a specific tumor, pathologists designate it a mixed tumor. For gliomas, tumors of
mixed or uncertain phenotype represent a significant percentage (10-30%, Kleihues and
Cavenee, 2000). To alleviate the subjectivity in cancer classification, gene expression profiling
has been used to survey the molecular events in cancers and a number of computational
algorithms have been used to perform molecular classification based on groups of genes
(Alizadeh et al., 2000; Fuller et al., 2002; Hedenfalk et al., 2001; Kim et al., 2002). However,
most studies ostensibly reached the conclusions that gene expression-based classification
matched pathological classification and thus no further insight is provided in terms of
classification.
uncertain cases. A recent study (Nutt et al., 2003) investigated whether gene expression profiling
could be used to define subgroups of GBM and AO more objectively than standard pathology.
The feature genes were ranked based on the correlation to each of the two classes and tested
against a random permutation. The classifier, based on Euclidian distance, returned the best error
rate with 20 features. This study showed that prediction models can predict survival for
diagnostically challenging malignant gliomas better than standard pathology.
In this study, we established gene expression profiles of 49 brain tumors including diffuse
gliomas of both typical subtypes (OL, AO, AA, and GBM) and atypical types. By using the kNN method and Fisher discriminant, we identified a set of 50 genes that were used to construct a
voting classifier in a manner highly consistent with pathological evaluation. Further, the actual
votes for each class reflect heterogeneity in the continuum of glioma progression. The molecular
voting showed that mixed gliomas are indeed mixed in gene expression profiles with reference to
the typical subtypes. Finally, the voting for an atypical meningioma was split between all
classes, suggesting a difficulty in its placement, which is consistent with the non-glioma nature
of meningioma.
Materials and Methods:
Patient samples and microarray experiments. All the glioma tissues were obtained from
the Brain Tumor Tissue Bank of M. D. Anderson Cancer Center with the approval of
Institutional Reviewing Board. The tissues were first evaluated by a pathologist (GNF) to
confirm initial pathological evaluation and to ensure that only those tissues with more than 90%
tumor were used.
RNA isolation, microarray experiments, and image analysis were carried out following
procedures previously described (Fuller et al., 1999; Shmulevich et al., 2002; Kobayashi et al.,
2003). The cDNA microarray used for this study included 2,303 genes printed in duplicates and
Such a decision is quite similar in spirit to pre-smoothing of data, commonly performed in signal
and image processing prior to further analysis.
Concerning the number of quantization levels, the simplest approach is the binary model,
where the genes can be either on or off. This approach gives many benefits related to
utilization of already existing methods for dealing with binary models and has been shown to be
useful in gene expression data analysis (Shmulevich and Zhang, 2002; Zhou et al., 2003). The
data may be also quantized to three or four levels, which will allow each gene to discriminate
more than two classes. We considered and compared the binary, ternary, and quaternary
quantization and obtained the optimal thresholds by using the Lloyd algorithm (also known as kmeans) (Lloyd, 1982) separately for each patient sample.
The most discriminative genes are selected according to the ratio of within-group sum of
squares (WSS) and between-sum of squares (BSS) (see Mircean et al., 2002).
As a variant of nearest neighbor density estimation of the group-conditional densities (Fix and
Hodges, 1951; Stone, 1977), k-NN assigns a label to an unknown profile to be the most
frequently occurring label of its k neighbors. In case of a tie, k is reduced until a winner exists.
We also tested the use of the correlation coefficient and the entropy correlation coefficient
(Mircean et al., 2002) as similarity measures between the gene-expression profiles of two patient
samples.
Results and discussion
We began this study with the aim of providing a molecular classification to the mixed gliomas.
We profiled the gene expression of 49 brain tumor tissue samples using a 2,303-gene cDNA
microarray produced in our facility. Those 49 cases included 27 glioblastomas (GBM), 7
anaplastic astrocytomas (AA), 6 anaplastic oligodendrogliomas (AO), 3 oligodendrogliomas
(OL), 5 mixed gliomas, and one case of atypical meningioma. To gain a global estimate of the
similarities or differences among the cases, we performed an initial unsupervised analysis using
multidimensional scaling (MDS). Almost no separation among the four typical groups of
gliomas, as well as the mixed cases, was observed in the MDS representation (Figure 1),
implying the presence of many non-informative genes.
The best-performing genes for molecular classification were identified by using 19
samples that exhibited characteristic morphologic features of the four basic subtypes of diffuse
glioma (5 anaplastic astrocytomas, 5 anaplastic oligodendrogliomas, 6 glioblastomas, and 3
oligodendrogliomas). We first selected candidate subsets of the most discriminative genes
according to the BSS/WSS ratio, the size of the gene set being p {20,30,40,50,60,70} , and then
analyzed the data set using the k-NN algorithm and MDS representation for visualization. Using
cross-validation, we chose the parameters (p, k, and the number of quantization levels) by
performing a random 2:1 split of the set of 19 patients (12 in the training set and 7 in the test set).
The classification problem consists of discriminating AA, AO, OL, and GBM. We tested various
numbers of discriminative genes using both quantized as well as unquantized data. The smallest
cross-validation error was obtained for quaternary quantized data (Lloyd quantization) with the
correlation coefficient as the similarity measure and with 50 retained discriminative genes
(summarized in Table 2). Figure 2 shows an MDS representation of the 49 patients using the
identified 50 informative genes. This figure demonstrates the spatial separation among the four
glioma groups and visually presents the closeness of patients diagnosed as mixed gliomas and
atypical meningioma.
To test the robustness of the derived classifier gene set for molecular voting, we
performed the voting analysis using the same classifiers on the remaining independent set of 30
gliomas samples (validation set - see Table 1), among which 4 belong to a group with mixed
(oligoastrocytic) features and 1 non-glioma tissue (atypical meninglioma). In this algorithm,
classifier genes are used to vote for an assignment for each case into one of the four subtypes
based on its neighbors. If our objective is to produce one final nosologic assignment, then this
can be determined by the class receiving the highest number of votes. However, as we shall see,
the distribution of the votes among the four classes may itself be more informative, especially for
the case of mixed tumors.
For the classical cases in the independent validation set, only two were classified at
variance with the traditional histopathologic assignment (Table 1). One of the miss-assigned
GBM cases had three votes for AA and two votes for GBM. Another GBM case had four votes
as AO and one vote as AA. Because there is no absolute distinction line between AA and GBM,
and high grade AOs are often characterized as GBMs, this may not be a result of missassignment. Rather, the detailed voting information provides a more subtle snapshot of the
cancer progression process and of the characteristic features of a tumor. A careful scrutiny of the
two tables showed that some GBMs had 4-5 votes for GBM whereas others showed 3 votes for
GBM and 2 votes for AA. Thus, the detailed votes may provide important clinical information
about the differences between different cases in the continuum of the glioma spectrum.
Similarly, for the mixed gliomas, voting was conspicuously divided among the four glioma
subtypes. For the atypical meningioma, which is not part of glioma cases, the almost even voting
split among the four glioma subtypes clearly indicated that this case does not fit any of those four
subtypes well. Thus, this molecular voting algorithm with the k-NN method appears to provide
additional diagnosis information as compared to classical pathological diagnosis.
We attempted to determine whether the detailed voting has any bearing on clinical
outcome by correlating the number of votes for GBM with patient survival. Indeed, we observed
a correlation between the numbers of GBM votes and shorter survival.
However, this
information cannot be overplayed at this time because our sample population is overrepresented
by GBMs as it is in the glioma population, and GBMs are known to have shorter survival.
Nevertheless, this study provides an insight for future studies with molecular diagnosis or
prognosis as the goal. We envision that a survival (or therapy response) gene set can be
determined from a training set of gliomas with long, median, and short survival times (or
response levels). Then this gene set may be used to predict whether a patient will survive longer
than median or respond at different levels.
Acknowledgements
The work was partially supported by an Advanced Technology grant from Texas Higher
Education Coordinating Board and the Bullock Fund for Brain Tumor Research, and a grant
from Academy of Finland. The Cancer Genomics Core Lab is supported by the Tobacco
Settlement Fund to M. D. Anderson Cancer Center (MDACC) as appropriated by the Texas
Legislature, a grant from Kadoorie Foundation to MDACC, a grant from the Goodwin Fund, and
the Cancer Center Supporting Grant from NIH/NCI.
References:
1. Alizadeh, A. A., Eisen, M. B., Davis, R.E., Ma C, Lossos, I. S., Rosenwald, A., Boldrick,
J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Hudson,
J. Jr., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C.,
Weisenburger, D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M. R.,
Byrd, J. C., Botstein, D., Brown, P. O., Staudt, L. M. Distinct types of diffuse large B-cell
lymphoma identified by gene expression profiling. Nature., 403(6769):503-11, 2000
2. Borg, I., and Groenen, P. Modern Multidimensional Scaling: Theory and Applications.
Springer, New York, 1997.
3. Caskey, L. S., Fuller, G. N., Bruner, J. M., Yung, W. K., Sawaya, R. E., Holland, E. C.,
and Zhang, W. Toward a molecular classification of the gliomas: histopathology,
molecular genetics, and gene expression profiling. Histol. Histopathol., 15: 971-981,
2000.
4. Fix, E., and Hodges, J. Discriminatory analysis, nonparametric discrimination:
consistency properties. Technical Report, Randolph Field, Texas: USAF School of
Aviation Medicine (1951)
5. Fuller, G. N., Rhee, C. H., Hess, K. R., Caskey, L. S., Wang, R., Bruner, J. M., Yung, W.
K., and Zhang, W. Reactivation of insulin-like growth factor binding protein 2
expression in glioblastoma multiforme: a revelation by parallel gene expression profiling.
Cancer Res 59: 4228-32, 1999.
6. Fuller, G. N., Hess, K. R., Rhee, C. H., Yung, W. K., Sawaya, R. A., Bruner, J. M., and
Zhang W. Molecular classification of human diffuse gliomas by multidimensional scaling
analysis of gene expression profiles parallels morphology-based classification, correlates
with survival, and reveals clinically-relevant novel glioma subsets. Brain Pathol.
12(1):108-16, 2002.
7. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P.,
Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander,
E. S. Molecular classification of cancer: class discovery and class prediction by gene
expression monitoring. Science. 286(5439):531-7, 1999.
8. Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon, R., Meltzer, P.,
Gusterson, B., Esteller, M., Kallioniemi, O. P., Wilfond, B., Borg, A., and Trent, J. Geneexpression profiles in hereditary breast cancer. N Engl J Med. 344(8):539-48, 2001.
9. Huber, P. J. Robust Statistics. John Wiley & Sons, p.107, 1981.
10. Kim, S., Dougherty, E. R., Shmulevich, I., Hess, K. R., Hamilton, S. R., Trent, J. M.,
Fuller, G. N., and Zhang, W. Identification of combination gene sets for glioma
classification. Mol Cancer Ther. (13):1229-36, 2002.
11. Kleihues, P., Cavenee, W. K. Pathology and genetics of tumours of the nervous system.
Lyon: IARC press; 2000.
12. Kobayashi, T., Yamaguchi, M., Kim, S., Morikawa, J., Ogawa, S., Ueno, S., Suh, E.,
Dougherty, E., Shmulevich, I., Shiku, H., and Zhang, W. Microarray reveals differences
in both tumors and vascular specific gene expression in de novo CD5+ and CD5- diffuse
large B-cell lymphomas. Cancer Res. 63(1):60-6, 2003.
13. Lloyd, S. P. Least Squares Quantization in PCM, IEEE Transactions on Information
Theory, vol. IT-28, March 1982, 129-137, 1982.
10
14. Mircean, C., Tabus, I., and Astola, J. Quantization and distance function selection for
discrimination of tumors using gene expression data. Proceedings of SPIE Photonics
West 2002, BiOS 2002 Symposium, San Jose, CA. 2002.
15. Mircean, C., Tabus, I., Astola, J., Kobayashi, T., Shiku, H., Yamaguchi, M., Shmulevich,
I., and Zhang, W. Quantization and similarity measure selection for discrimination of
lymphoma subtypes under k - nearest neighbor classification. SPIE Photonics West 2004,
BiOS 2004 Symposium, San Jose, CA. 2004.
16. Nutt, C. L., Mani, D. R., Betensky, R. A., Tamayo, P., Cairncross, J. G., Ladd C., Pohl,
U., Hartmann, C., McLaughlin, M. E., Batchelor, T. T., Black, P. M., Deimling, A.,
Pomeroy, S. L., Golub, T. R., and Louis, D. N. Gene Expression-based Classification of
Malignant Gliomas Correlates Better with Survival than Histological Classification.
Cancer Research 63, 16021607, 2003.
17. Shmulevich, I., Hunt, K., El-Naggar, A., Taylor, E., Ramdas, L., Laborde, P., Hess, K.
R., Pollock, R., and Zhang, W. Tumor specific gene expression profiles in human
leiomyosarcoma: an evaluation of intratumor heterogeneity. Cancer; 94: 2069-2075,
2002.
18. Shmulevich, I., and Zhang, W. Binary Analysis and Optimization-Based Normalization
of Gene Expression Data. Bioinformatics, 18(4): 555-565, 2002.
19. Stone, C. J. Consistent nonparametric regression (with discussion). Ann Statist. 5 pp.
595-645, 1977.
20. Taylor, E., Cogdell, D., Coombes, K., Hu, L., Ramdas, L., Tabor, A., Hamilton, S., and
Zhang, W. Sequence verification as quality control step for production of cDNA
microarray. BioTechniques 31: 62-65, 2001.
21. Zhou, X., Wang, X. and Dougherty, E. R. Binarization of microarray data on the basis of
a mixture model. Mol Cancer Ther. 2(7):679-84, 2003.
11
Table 1. Molecular voting procedure results and gene expression based diagnostic.
Nr
Crt
Index /
Dataset
20
21
22
23
24
25
B 01
B 20
B 21
B 22
B 23
B - 24
26
B 25
27
28
29
B 26
B 27
B 28
30
B 29
31
B 30
32
B 31
33
34
35
36
37
38
39
B 32
B 33
B 34
B 35
B 36
B 37
B 38
40
B 39
41
B 40
42
43
44
45
46
47
48
49
B 41
B 42
B 43
B 44
B 45
B 46
B 47
B 48
Pathology
GBM
GBM
AA
GBM
GBM
AA
Anaplastic
mixed AO
GBM
GBM
GBM
high grade
Astrocytoma
GBM
low grade
Astrocytoma
mixed oligo
GBM
AO w /necrosis
GBM
GBM
GBM
GBM
Atypical
meningioma
Anaplastic
mixed AO
GBM
GBM
GBM
GBM
GBM
GBM
GBM
GBM
0
2
3
2
2
3
0
0
0
0
0
0
0
0
0
0
0
1
4
2
1
2
2
0
Gene
expression
based
diagnostic
k=5
GBM
GBM
AA
GBM
AA
AA
GBM
GBM
GBM
AA/GBM*
GBM
0
2
1
1
0
0
0
0
0
3
2
3
GBM
GBM
GBM
0
2
1
1
0
0
0
0
0
4
3
4
3
Gene
expression
based
diagnostic
k=1
GBM
GBM
AA
AA
AA
AA
Gene
expression
based
diagnostic
k=4
GBM
GBM/AA*
AA
AA/GBM*
AA/GBM*
AA
GBM
GBM
AA
AA
Votes of algorithm
k=4
[AA AO OL GBM]
Votes of algorithm
k=5
[AA AO OL GBM]
0
2
4
2
3
3
0
0
0
0
0
0
0
0
0
0
0
1
5
3
1
3
2
1
AA
GBM
GBM
GBM
GBM/AA*
GBM
AA
GBM/AA*
GBM
AA
AA
AA
GBM
AO
GBM
GBM
AA
AA/GBM*
AO
GBM
AO
GBM
GBM
OL/GBM*/
/AA*/AO*
3
2
1
1
1
0
0
0
0
2
0
3
0
0
0
0
0
0
0
0
0
1
2
1
3
0
4
4
AA
GBM
AO
GBM
AO
GBM
GBM
3
2
1
1
1
0
0
0
0
3
0
4
0
0
0
0
0
0
0
0
0
2
3
1
4
0
5
5
GBM
AO
AO
AO
GBM
GBM
GBM
GBM
GBM
GBM
GBM
GBM
GBM
GBM
GBM/AA*
GBM
GBM
GBM
GBM/AA*
GBM
1
0
2
0
0
1
2
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
3
4
2
4
3
3
2
3
GBM
GBM
GBM
GBM
GBM
GBM
GBM
GBM
2
1
2
0
0
1
2
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
3
4
3
5
4
4
3
4
OL
Note: We split the data into training and test sets. Set A contains 19 profiles and set B contains 30 patient-profiles. Sets A
and B have been collected and processed during different time frames. As a training set, we use the profiles from A [1:19],
which are balanced across subtypes. The votes for the samples in B are based only on the known labels from the A dataset.
In the case of a tie (ex. gene profile 40 and k=4) we decrease the number of neighbors by one until a maximum vote exists, the
cases are marked (sequential) by ' * ' symbol. This corresponds to choosing the neighbor majority vote as the estimated label. The
diagnostic decisions for k = 3 are similar to those for k = 1.
12
Table 2 Feature genes that yielded the smallest cross-validation error. In right side, for genes sorted in
decreasing order of Fisher discriminant values, we represented the un-quantized expression. The patients (columns
from left to right) follows the labels "AA", "AO", "OL", "GM", and "mixed tumors".
Symbol Gene Names
Accession Name
AA, AO, OL, and GBM
MATN2
IGFBP1
DDOST
MEF2C
Matrilin 2
Insulin-like growth factor binding protein 1
dolichyl-diphosphooligosaccharide -protein glycosyltransferase
MADS box transcription enhancer factor 2, polypeptide C
AA071473
AA233079
NM_005216
AA234897
R31938
AA464346
PAFAH1B3 Platelet-activating factor acetylhydrolase, isoform Ib
H08188
CLCN6
Chloride channel 6
AA398458
MGC5178 Hypothetical protein MGC5178
HIVEP1
Human immunodeficiency virus type I enhancer binding protein 1 AA429769
AA461424
EFNB2
Ephrin-B2
N58107
VTN
Vitronectin (serum spreading factor, somatomedin B)
ATRX
Alpha thalassemia/mental retardation syndrome X-linked (RAD54) AA410435
AA486471
FMOD
Fibromodulin
NM_000873
ICAM2
intercellular adhesion molecule 2
NM_001964
EGR1
early growth response 1
AA482198
MPI
Mannose phosphate isomerase
AA489602
TRAP1
Heat shock protein 75
AA258735
Hs. moderate similarity to protein pir:A45973
HSUDGM
UNG2
uracil-DNA glycosylase 2
H11692
AP3B2
Adaptor-related protein complex 3, beta 2 subunit
W38923
ROR2
Receptor tyrosine kinase-like orphan receptor 2
AA497051
STHM
Sialyltransferase
NM_000436
OXCT
3-oxoacid CoA transferase
H63175
NM_003826
NAPG
N-ethylmaleimide-sensitive factor attachment protein, gamma
NM_003139
SRPR
signal recognition particle receptor ('docking protein')
AC005510
R32756
EWSR1
Ewing sarcoma breakpoint region 1
N74131
TFF3
Trefoil factor 3 (intestinal)
AC007199
T64094
HD
Huntingtin (Huntington disease)
H12006
AP4M1
Adaptor-related protein complex 4, mu 1 subunit
AA010781
DCLRE1A DNA cross-link repair 1A (PSO2 homolog, S. cerevisiae)
HSNFYB
NFYB
Nucl. trans. fact.Y, beta (alias HAP3, CBF-A, CBF-B, NF-YB)
NM_004999
MYO6
myosin VI, (alias DFNA22, DFNB37, KIAA0389)
AA453969
LDHC
Lactate dehydrogenase C
AA281932
PBEF
Pre-B-cell colony-enhancing factor
NM_015847
MBD1
methyl-CpG binding domain protein 1
AA625655
REG1A
Reg. islet-deriv. 1 alpha (pancreatic stone protein)
AA504327
PTP4A2
Protein tyrosine phosphatase type IVA, member 2
HSP162
EEA1
Early endosome antigen 1, 162kD
HSJ735G18
AL096703
T64192
TRB@
T cell receptor beta locus
AA487215
MYLK
Myosin, light polypeptide kinase
AA644191
ARL3
ADP-ribosylation factor-like 3
T60191
AJ271079
AA453202
NR1D1
Nucl. rec. subfam 1, gr D, me 1 (V-erbA related protein EAR-1)
W93163
TNFAIP6
Tumor necrosis factor, alpha-induced protein 6
NM_014688
RNTRE
related to the N terminus of tre
13
Figures
1500
dimension 3
1000
1500
500
0
1000
500
500
1000
0
1500
500
2000
1000
1000
0
dimension 1
1000
dimension 2
Figure 1. Multidimensional Scaling representation (MDS) of the 49 brain tumor gene-expression profiles
using the retained 1826 genes (genes retained after all preprocessing steps). Red circles, anaplastic astrocytoma
(AA); green diamond, anaplastic oligodendroglioma (AO); blue triangle, oligodendroglioma (OL); cyan star,
glioblastoma (GBM); and black cross, the others.
14
dimension 3
20
40
20
20
0
40
20
40
40
20
60
dimension 1
20
dimension 2
Figure 2. Multidimensional Scaling representation (MDS) of the same glioma profiles as in Figure 1, using the
most discriminative 50 genes based on five-neighbor analysis. Red circles, anaplastic astrocytoma (AA); green
diamond, anaplastic oligodendroglioma (AO); blue triangle, oligodendroglioma (OL); cyan star, glioblastoma
(GBM); and black cross, the others.
15
Publication no. 5
Giurcaneanu CD, Mircean C, Fuller GN and Tabus I.
Chapter 2: Finding functional structures in glioma geneexpressions using Gene Shaving clustering and MDL principle
in Computational and Statistical Approaches to Genomics, second
edition Kluwer Academic Publisher (in press)
Chapter 7
FINDING FUNCTIONAL STRUCTURES IN
GLIOMA GENE-EXPRESSIONS USING GENE
SHAVING CLUSTERING AND MDL PRINCIPLE
Ciprian D. Giurcaneanu1 , Cristian Mircean1,2 , Gregory N. Fuller2 , and Ioan
Tabus1
1 Institute of Signal Processing, Tampere University of Technology, Tampere, Finland
2 Cancer Genomics Core Laboratory, Department of Pathology, The University of Texas
1.
Introduction
The recent technological advances in genomics made it possible to measure simultaneously, in similar experimental conditions, the expressions of
thousands of genes, allowing an unprecedented wide probing at the transcriptome level. The huge amount of data thus available calls for improved
methods of data analysis, well grounded in the classical statistical methods, and providing fast and reliable processing. Finding clusters can be
a preliminary step for data analysis, and is a valuable task in itself: the
obtained clusters can convey useful information regarding the similar expression patterns for a number of genes, allowing thus a dimensionality
reduction of the data set, but furthermore, a cluster of similar genes induces hypotheses regarding the genes that act in synergy, and that are on
the same pathway inuencing the studied disease.
The traditional clustering methods are very appealing in gene data analysis, since they do not rely on any a priori knowledge or prior models for
the data set. Clustering was traditionally a method of excellence for biological data; many of the existing clustering algorithms have been developed to be used in taxonomy for the classication of plants and animals.
Therefore the rst option was to employ classical clustering methods on
the increasingly large amount of gene expression data, e.g. partitioning
or hierarchical algorithms derived and well studied for some other type
of biological data. The importance of studying gene expression data for
90
COMPUTATIONAL GENOMICS
91
Gliomas are usually classied into two lineages (astrocytic and oligodendrocytic) and low-grade (astrocytoma and oligodendroglioma), midgrade (anaplastic astrocytoma and anaplastic oligodendroglioma), and
high-grade (glioblastoma multiforme). The labeling-class is based on
morphological features given by pathologists based on cellularity and
features such as mitotic index and the presence of necrosis. This classication scheme has limitations because a specic glioma can be at any
stage of the continuum of cancer progression and may contain mixed
features.
In this work, we focus on the gene expression that makes the difference
between morphological classications of glioma subtypes. We search for
capturing the feature-genes that mark the transition from lower grades to
higher grades (see Caskey et al., 2000 for a review) and the increased
aggressiveness of glioma multiforme (GM) subtype of glioma, and the
lineage differentiation of gliomas.
The initial data contain 2303 different genes (see Taylor et al., 2001)
that are observed over p = 49 different samples (experiments). The number of feature-genes is reduced in several steps to N = 121, where an
optimal classier was obtained for the four glioma-subtypes in quantized
values (Fuller et al., submitted). In the next section we clarify the procedures used. Grouping the genes of obtained classier in clusters that express with similar patterns over patients is a method to provide insight into
the molecular basis of gliomas and reveal intrinsic interaction at molecular
level between components and common function. If the cluster is enriched
with a certain function, we expect with higher probability to have genes
that act together in a possible-chained path. Further efforts can concentrate around these genes in nding the molecular-pathways and validate
them with biological procedures.
2.
92
COMPUTATIONAL GENOMICS
93
submitted) is saved (the gures show in the second column these votes).
We further use this set of N = 121genes because we consider these genes
being very likely to play a special role in discrimination of the four glioma
sub-types, since they are minimizing the discrimination errors
We continue the gene shaving experiments by using the N = 121 genes
selected by LOO-CV procedure from the set Munq , observed in p = 49
different patients (samples). We denote with X the retained matrix with
the size N p.
Our goal is to identify groups of genes that operate similarly in the considered experiments.
3.
94
COMPUTATIONAL GENOMICS
are becoming more and more pure, having the genes better and better
aligned to the principal component of the matrices associated to the candidate clusters.
Once proper candidates are generated by the consecutive shaving
process, one needs to choose the optimal cluster from the nested sets S N ,
Sn 1 , Sn 2 , . . . , S 1 . There are two ways to choose a suitable set: by using
the Gap statistics (as was proposed in the original GS) or to use a MDL
selection. Both are reviewed in this survey.
After the selection of the most appropriate set as a cluster, a new cluster
can be obtained by banning-off the direction of the principal component
of the previously extracted cluster from the following stages. That can be
achieved either by orthogonalizing all the rows of X with respect to direction of the principal component previously extracted (Hastie et al., 2000a),
or by simply removing from the matrix X of the genes found in all previous clusters (Giurcaneanu et al., 2004a). Depending on the banning-off
procedure used, the iterative process of extracting clusters continues for
psuccessive clusters when using orthogonalization, or until only pgenes
are left in the matrix X , when the genes of the clusters found are iteratively removed.
The Gap statistics selection of the optimal size of the cluster is a simple
process of discriminating the structure (regularity) of a cluster by checking against the statistics of randomly permuted data. The nested clusters
have been selected such that the variance of the cluster mean is high, the
cluster center being almost aligned to the rst principal component, which
by its denition has the highest possible variance. Now Gap will choose
the cluster Sn containing genes that are very similar one to each other and
simultaneously have a high degree of dissimilarity with the genes not included in Sn . Based on the analogy with the analysis of variance (Mardia
et al., 1979), the criterion used in cluster selection relies on computing for
Sn the within variance
p
2
1
1
VnW =
xi j ave j (Sn ) ,
p
n
j=1
iSn
p
2
1
ave j (Sn ) ave (Sn )
=
p
j=1
where ave j (Sn ) = n1 iSn xi j is calculated as the average of the measurements in cell line j that correspond to genes in Sn , and ave (Sn ) =
p
iSn
the criterion
j=1 x i j .
95
takes large values. To check that Dn is larger than one could expect by
chance, the value of the Gap statistic is evaluated. The procedure implies
to generate a number of matrices, each of them obtained by applying a
different permutation to the columns of X . For example, assume that 20
such matrices are generated, and let us denote them X 1 , . . . , X 20 . Nested
clusters are identied for every matrixX i , 1 i 20, by using the same
algorithm employed for the original matrix X , and let Dn (X i ) be the criterion computed for the cluster size n of the matrix X i . The Gap statistic
is given by
Gap (n) = Dn ave (Dn ) ,
1 20
where ave (Dn ) = 20
i=1 Dn (X i ), and the optimal cluster size is
n = arg max Gap (n) .
n
4.
In this section we show how the MDL principle can lead to a method for
choosing the optimal cluster from the set S N , Sn 1 , Sn 2 , . . . , S 1 , without
resorting to the evaluation of the Gap statistic. The idea is very simple:
we just need a wand that points on S N and shows us how many clusters it
contains. If S N contains one single cluster, then we decide that S N is the
96
COMPUTATIONAL GENOMICS
optimal choice; otherwise, we use again the wand until nding an index
n such that Sn +1 contains more clusters and Sn contains exactly one
cluster. We conclude that Sn is the optimal choice.
We explain in the sequel how MDL can provide the desired wand. The
roots of the MDL are in information theory, and it allows selecting the
best parametric model for a data set using as criterion the minimum description length or, equivalently, the minimum code length. We emphasize
here that all the considered models belong to a nite family that will be
dened latter. The MDL principle does not search for the true model of
the observed data, but selects the best model from a family that is a priori
dened. The selection relies on a scenario of transmitting the whole data
set from an hypothesized encoder to a decoder. The encoder is constrained
to use only models from the given family. In the most celebrated form of
the MDL, the code length is evaluated as the sum of two parts:
Choose a model from the family and tune its parameters such that
to obtain the best t to the given data set. Practically this step corresponds to nding the maximum likelihood estimators for the model
parameters. The estimated values are then sent to the decoder by using
1
2 log N bits for each parameter, where N is the number of samples in
the data set.
The symbol log () states for the natural logarithm, thus the codelength is expressed in nats. Equally well we can consider the logarithm base two in order to express the codelength in bits. Therefore
the rst term of the codelength is 12 q log N , where q denotes the number of parameters of the model.
Once both the encoder and the decoder know the values of the parameters, it remains only to encode the samples in the data set according
to the chosen model. This leads to a code length that equals the minus
logarithm of the maximum likelihood.
Remark that the code length is not constrained to be an integer number. This is not a difculty since we are not interested on realizable code
lengths, but rather to use code lengths for comparing various models. To
compute the two-parts code length criterion described above, we have to
resort to probabilistic models, which appear frequently in clustering.
4.1
The effort of developing mathematical models for clustering gene expression data obtained with various technologies was not very extensive so
far. Most of the recent papers (Dougherty et al., 2002; C. D. Giurcaneanu
97
et al., 2004a; Hastie et al., 2000a; Yeung et al., 2001) that treat the issue
of model-based clustering for gene expression arrays have shown that the
nite mixture model can be successfully applied.
In nite mixture model, a multivariate distribution is associated to each
gene cluster. To ?x the ideas, let us assume that the number of gene clusters
does not exceed p, the total number of cell lines. Following the approach
from Hastie (Hastie et al., 2000a), we make the hypothesis that the number of signal clusters is K , and there exist also a noise cluster. The
last cluster groups all genes with at prole over all conditions, and consequently its mean is hypothesized to be a row vector with size 1 p
and having all entries only zeros. The mean vectors of signal clusters
are denoted by b1T , b2T , . . . , b TK . Supplementary we assume that the matrix
whose rows are b1T , b2T , . . . , b TK is full-rank. Moreover, the covariance matrix is assumed to be the same for all clusters, and to equal 2 I , where 2
is a parameter of the model and I denotes the p p identity matrix.
Under the mixture sampling, the rows of the matrix X , x 1T , x 2T , . . . , x TN
are taken at random from the mixture density such that the number of
observations from each cluster has a multinomial distribution with sample
size N and probability parameters p1 , p2 , . . . , p K +1 .
The results reported (Hastie et al., 2000a; Hastie et al., 2000b; and Yeung et al., 2001) encourage us to use such simple parametric models. The
nite mixture models have been extensively studied in statistics (Redner
et al., 1984), and they are suitable for the application of the ExpectationMaximization (EM) algorithm (Dempster et al., 1977). Various instances
of the use of the EM algorithm for clustering based on nite mixture models, generically named Classication-Expectation-Maximization (CEM),
are investigated in Celeux et al., (1995).
The EM algorithm is designed to be applied for incomplete data, which
recommends it to be used in clustering. The aim of gene clustering when
the data set is recorded in matrix X is to assign to each row x iT of X a cluster
label. Following the approach from Celeux et al., (1995) we assign to x iT a
row vector v iT whose length equals the number of clustersK + 1. If x iT is
assigned to cluster k, then the k-th entry of v iT takes value one, and all other
entries are zeros.
One can easily observe that the complete data are given
by the pairs x iT , v iT .
To illustrate how the EM algorithm can be employed in gene clustering, we consider the famous case when the distribution for each cluster is
Gaussian with covariance matrix 2 I where I is the p pidentity matrix,
and the parameter 2 is unknown. Therefore the set of parameters is
=
V, p1 , p2 , . . . ,
p K +1 , b1T , b2T , . . . , b TK +1
98
COMPUTATIONAL GENOMICS
where bkT are the mean vectors of the clusters, and pk are the mixing pro K +1
pk = 1. The CEM algorithm is an iterative
portions, 0 < pk < 1, k=1
procedure to search for the values of the parameters that maximize the
log-likelihood function:
K +1 N
L x 1T , x 2T , . . . , x TN ; =
v i k log pk gk x iT |bkT , 2 I
k=1
i=1
where the notation gk (|) is used for the Gaussian distribution. We briey
revisit in the sequel each step of the algorithm (Celeux et al., 1995):
M step Assume that every gene is assigned to a cluster, or equivalently the entries of the matrix V are xed. Denote k the total number
of genes that have been assigned to the cluster k. Elementary calculations prove that the log-likelihood function is maximized by selecting
the values of the parameters such that:
p k =
N
T
b k
k
,
N
T
i=1 v ik x i
2 =
tr (W )
,
Np
T
W = k=1
i=1 v i k (x i bk ) (x i bk ) .
pl gl x i |bl , I
1, k = k
0, other wise
where k = arg maxk tk x iT .
The algorithm is initialized with an arbitrary matrix V , and the iterations
stop when a convergence criterion is fullled. In practical applications
the CEM is started from many different random points. Each initialization can lead to a different convergence point, and the one that maximizes
99
4.2
Let us investigate how to apply the MDL principle for estimating the number of gene clusters in the hypothesis of the particular nite mixture discussed above, without the need of running the slowly convergent EM algorithm. For every possible number of clusters k, we have to compute the best
code length which implicitly means that we have to label each gene with
a cluster name. Considering all possible partitions of N genes in kclusters,
and then evaluating the code length for every case is not a practical way
to solve the problem. We emphasize here that we just need a fast method
to estimate the number of clusters, and we do not need to decide which
gene belongs to which cluster, since this is successfully solved by the GS
algorithm.
The key observation that leads to the design of the GS-MDL method
relates to the particular structure of the mixture covariance matrix . It
was shown in Giurcaneanu et al., (2004a) that the eigenvalues of can be
sorted such that
1 () > 2 () > > K () > K +1 () = K +2 ()
= = p () = 2
100
COMPUTATIONAL GENOMICS
1/( pK ) ( pK2 )N
i=K +1 i (S)
1 p
i=K +1 i
pK
(S)
(7.1)
101
all entries of one by one, we transmit the eigenvectors and the eigenvalues of relying on the following identity which is implied by the spectral
representation theorem:
K
= 2I +
k () 2 u k u kT
k=1
The mean vector b T has length p, therefore the number of parameters is p. Since p does not depend on K , we ignore the cost of
transmitting the mean vector to the decoder;
The eigenvalues of are all real-valued. Recall that K of them are
distinct, and the rest of p K are equal with 2 , therefore the number
of parameters is K + 1;
The eigenvectors of have all entries real numbers. Since the number of eigenvectors is K and each has length p, it results K pparameters.
Each eigenvector is constrained to have norm 1, thus only p 1 entries of each eigenvector are independent parameters. Similarly the
orthogonality constrained for the eigenvectors reduces the number of
parameters with K (K21) .
+1
The total number of parameters results to be 1 + K 2 pK
, which
2
together with equation (7.1) leads to the following MDL criterion:
k
p
k
+
1)
A
(2
k
K = arg min ( p k) N log
+
log N
(7.2)
k
Gk
2
1/( pk)
1 p
p
=
where Ak = pk
. Obi=k+1 i (S) and G k
i=k+1 i (S)
serve that Ak is the arithmetic mean of the last p k eigenvalues of S, and
G k is the geometric mean of the same eigenvalues. We know from an elementary inequality that Ak G k with equality if and only if k+1 (S) =
k+2 (S) = . . . = p (S), which leads to the conclusion that the rst term
of the criterion is nonnegative. It is obvious that the second term is positive
and penalizes the number of parameters.
The expression of the criterion has an intuitive form, and is easy to be
computed. It still remains the question if such a criterion can lead to an
accurate estimation of the number of clusters despite various approximations we have done when deriving it. The answer is given by a result from
Giurcaneanu et al., (2004a) where the consistency of the MDL criterion
102
COMPUTATIONAL GENOMICS
5.
In the original version of Gap, the found clusters are non-exclusive in the
sense that the same gene can potentially be assigned to more different clusters. The same non-exclusive property occurs also in the case of SVDMAN
algorithm. Our aim is to partition the gene set in non-overlapping clusters,
and consequently once a cluster is identied by the GS-MDL algorithm,
all the corresponding genes are eliminated, and a new matrix X is generated. The rows of X are not orthogonalized with respect to the average
gene in the last found cluster. Therefore, at each step the number of rows
in matrix X is decreasing, while the number of columns remains constant.
We decide to stop the procedure when the number of genes in X becomes
smaller than the number of columns of X .
After running the GS-MDL algorithm, 72 genes are grouped in 16 different clusters, and the rest of 49 genes form the last big cluster. In our
analysis we focus on the 16 clusters that are labeled with numbers from
1 to 16, according to the order in which they have been found by the
103
algorithm. We also run the Gap-GS algorithm, which groups the 121 genes
in 11 clusters.
GS groups together the genes highly correlated, either positive or negative correlated. Clusters are approximately aligned to principal components of data, e.g. rst cluster is aligned to rst principal component.
We are interested in the next three molecular-biological discrimination
problems:
The transition from low grades to most aggressive grade (i.e., Oligodendroglioma Low Grade, Anaplastic Oligodendroglioma and
Anaplastic Astrocytoma vs. Glioblastoma Multiforme)
To discover what are the key differences, between the two different
lineages of glioma (Anaplastic Astrocytoma vs. Anaplastic Oligodendroglioma and Oligodendroglioma Low Grade)
The transition from lowest glioma grades to high grades (Oligodendroglioma Low Grade vs. all others)
1 Average Gene is dened as mean of genes from one cluster. Negative correlated genes are represented
POWER OF DISCRIMINATION
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Cluster 1
Cluster 2
Cluster 3
Cluster 4
AA vs.AO+OL
Cluster 5
Cluster 6
GM vs.all
Cluster 7
Cluster 8
Cluster 9
OL vs.all
Cluster 10
Cluster 11
discrimination power
POWER OF DISCRIMINATION
discrimination power
OL vs.all
GM vs.all
AA vs.AO+OL
Figure 7.1. The power of discrimination between classes (AA, AO, OL and GM) and the three important discrimination problems. In left side
the clusters obtained from Gene Shaving with Minimum Description Length (GS-MDL) and in the right side the clusters of the Gene Shaving with Gap
(GS-Gap). In each gure, from left to right for all clusters, we considered power of discrimination by measuring the ratio of Between Sum of Squares
and Within Sum of Squares (BSS/WSS) in case of: 1) transition from lowest glioma grades to high grades (Oligodendroglioma Low Grade vs. all others);
2) transition from low grades to most aggressive grade (Oligodendroglioma Low Grade, Anaplastic Oligodendroglioma and Anaplastic Astrocytoma vs.
Glioblastoma Multiforme); 3) discriminating between the two different lineages of glioma (Anaplastic Astrocytoma vs. Anaplastic Oligodendroglioma
and Oligodendroglioma Low Grade).
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
Cluster 8
Cluster 9
Cluster 10
Cluster 11
Cluster 12
Cluster 13
Cluster 14
Cluster 15
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
104
COMPUTATIONAL GENOMICS
NL
SRPR
MYO6
CLCN6
17
12
43
+
+
+
average gene
Symbol
Votes
8.0
Cell growth
Cell growth,
regulation of vol.
3.9
5.8
2.8
2.8
2.6
AA
7.5
4.2
5.9
2.5
2.3
1.8
AO
7.6
5.4
5.7
1.2
0.9
1.8
OL
7.6
5.4
2.3
1.4
2.0
GM
6.2
Figure 7.2. The transition from low-medium grades to the highest grade (all vs. GM) is inuenced by genes from Cell growth function family.
Gene Shaving with Minimum Description Length (GS-MDL) and Gene Shaving with Gap (GS-Gap) collect genes from same group functions in the
clusters (Cluster 3 and Cluster 4) that have highest discriminatory power in this problem. SRPR (signal recognition particle receptor, a docking protein)
is present in both selections. Other relevant genes are FADD (Fas-TNFRSF6 associated via death domain), POU6F1 (POU domain, class 6, transcription
factor 1), and IGFBP1 (insulin-like growth factor binding protein 1). For the four-type classication problem, Cluster 3 (GS-MDL) situates in rst position
and Cluster 4 (GS-Gap) in the second position. We consider genes from this cluster to be related to evolution of glioma from aggressivity facet. (See
description for color map and table content in caption of Fig. 7.3).
chloride channel 6
Cell growth,
protein metabol.
signal recognition
particle receptor
myosin VI
Cell growth, binding
Function
Name
GM
Glioma Subtypes
AA AO OL
106
COMPUTATIONAL GENOMICS
FADD
POU6F1
H2BFQ
EWSR1
SRPR
average gene
16
17
sorted
average
Nicotiana tabacum
clonePR50
H2B histone family,
member Q
POU domain, class
6, transcription factor
Homo sapiens cDNA
FLJ38365
TNFRSF6 associated via death
H.ch.14 DNA
sequence BAC RHuman T-cell
receptor germ-line
B25-NaN A15-GM
B46-GM A16-GM
A13-GM A14-GM
A17-GM A13-GM
B26-GM A12-GM
A08-AO A19-OL
B35-GM A18-OL
B34-GM B34-AO
B28-GM A11-AO
B41-GM A09-AO
B40-NAN A07-AO
B44-GM B04-AA
A11-AO B01-AA
A09-AO A06-AA
A10-AO A03-AA
A07-AO A01-AA
5.8
5.3
4.4
4.3
4.7
4.6
5.0
4.2
5.8
6.3
0.7
0.8
0.6
1.4
1.8
1.4
2.2
0.9
2.6
1.0
0.6
5.1
4.0
4.9
3.9
4.5
4.5
5.5
4.0
5.9
5.6
5.5
1.3
1.7
1.0
1.1
1.1
1.1
1.3
2.0
1.8
2.2
0.9
0.9
AO
7.4
Transduction,
Apoptosis
Metabolism, Cell
growth
Transcription, Cell
growth, Stem cells
Transcription
Metabolism, DNA
repair
Cell growth,
Metabolism
6.0
1.6
AA
7.4
B34-AO B23-GM
B43-GM B01-GM
B29-NaN B22-GM
DCLRE1A
B42-GM B27-GM
18
B39-NaN B26-GM
Cell growth
B48-GM B37-GM
MATN2
B22-GM B38-GM
43
B31-NaN B28-GM
Function
B36-GM B42-GM
B38-GM B33-GM
B37-GM B44-GM
Name
B32-NaN B43-GM
insulin-like growth
factor binding protein
matrilin 2
B47-GM B45-GM
IGFBP1
B01-GM B46-GM
Symbol
B30-GM A17-GM
average
A02-AA
A18-OL
A05-AA
A20-OL
A08-AO
A19-OL
A10-AO
B24-AA
A20-OL
A06-AA
B20-GM
A03-AA
B30-GM
A05-AA
B36-GM
B21-AA
B41-GM
B01-AA
B48-GM
A02-AA
43
B33-GM B47-GM
Votes
B23-GM B25-NaN
+/
+
B29-GM B29-NaN
NL
1.1
4.5
3.6
4.0
4.4
4.5
5.0
3.3
5.7
4.2
4.7
1.6
2.7
0.9
1.2
2.7
1.4
1.2
2.3
1.8
1.9
2.0
1.3
OL
6.8
B20-GM B31-NaN
GM
A15-GM B32-NaN
AA AO OL
B27-GM B35-GM
A39-GM B39-NaN
4.4
5.7
4.5
4.8
4.9
4.6
5.7
4.7
6.2
6.6
5.5
1.4
1.3
1.5
1.4
0.8
1.6
1.3
1.4
2.0
1.0
1.4
1.6
GM
6.6
A16-GM B40-NaN
Glioma Subtypes
NL
TRAP1
STHM
NFYB
REG1A
MBD1
RNTRE
INPP5D
TBP
KIAA0268
40
35
25
16
13
9
4
3
2
2
+
+
+
+
+
TSN
2
1
+
average gene
HOXD1
HMGA1
+
+
ATRX
Votes
+/
Symbol
homeo box D1
Transcription
Transcription
Transcription
Transcription
Transcription
Signal Transduction
Cell proliferation
Metabolism
Cell proliferation
Transcription
Catalysis
6.1
7.2
7.4
8.7
5.4
4.6
7.2
4.5
7.6
6.0
4.7
5.3
0.4
1.8
0.8
1.0
2.7
1.6
1.6
3.1
1.9
1.0
1.5
0.8
2.3
AA
4.6
4.8
5.9
6.1
7.6
4.0
4.7
6.0
5.0
7.2
6.5
4.9
4.9
3.7
1.2
1.2
1.6
0.8
0.5
1.1
0.8
0.6
0.6
1.5
1.3
1.4
1.0
AO
3.8
6.3
6.9
7.4
5.5
4.3
6.2
4.4
6.8
6.2
4.1
2.8
3.6
0.8
1.1
1.0
1.1
0.8
1.0
0.8
2.9
1.2
1.3
0.8
1.7
0.5
OL
high mobility
group AT-hook 1
translin
methyl-CpG binding
domain protein
related to the N
terminus of tre.
inositol polyphosph. 5- phosph.
TATA box binding
protein
C219-reactive peptide
nuclear transcription
factor Y, beta
regenerating isletderived 1 alpha
sialyl-transferase
Function
Transcription Cell
Growth
Tumor necros. factor
receptor, chaperone
Name
alpha thalassemia
mental retard syndr.
heat shock protein 75
5.7
5.9
6.8
7.7
5.3
4.5
6.9
5.7
6.9
6.5
4.4
4.7
1.2
1.7
1.5
0.8
1.3
0.9
1.5
1.1
1.1
1.1
1.5
1.3
1.5
GM
4.4
Figure 7.3. Transcription and DNA binding function are characteristic for the lineage differentiation separation. Left gure presents genes:
columns are gene expression proles; the 49 patients are grouped by glioma labels: AA, AO, OL, and GM. For positive correlated genes the color green is
used for low expression values, the color red for high expression values. Gene shaving algorithm groups together positive and negative correlated genes.
If a gene is negative correlated, we represent the expression as a negative lm. On the column +/ we labeled with + positive correlated genes and
with negative correlated genes. On right-side of the table we show for each class the mean (upper row, left aligned) and the variance of expression
values (down and right aligned).
Glioma Subtypes
AA AO OL
GM
108
COMPUTATIONAL GENOMICS
average gene
Figure 7.4. Transcription and DNA binding function are characteristic for the lineage differentiation separation. Down part of the image
illustrates details of labels of patients and average gene of Cluster 6 for GS-MDL representing the averages (with sign) of the genes along the patients.
Cluster 6 (GS-MDL) overlaps Cluster 8 (GS-Gap). We then reorder the patients such that the values of the average gene are increasing, we horizontally
list the re-ordered patient labels, and below this list, we draw a row representing by colors the values of the sorted average gene. The bottom row shows
the average gene as a plot.
DETAIL of
NL
OPRK1
NAPG
MEF2C
DCLRE1A
43
20
43
18
6.6
6.3
8.0
7.1
7.1
7.0
7.8
3.3
1.0
3.3
3.0
4.1
3.8
3.1
AA
5.1
5.6
5.6
6.4
5.0
8.3
6.1
2.6
2.2
1.0
1.4
1.0
1.5
2.1
AO
4.2
4.2
4.9
6.2
6.6
8.3
5.5
1.1
1.9
1.8
0.4
1.7
1.2
4.3
OL
Metabolism, Transcription
Metabolism, DNA
repair
Transcription
Cell growth,
Metabolism
Transduction
Metabolism,
Transduction
Human estrogen
sulfotransferase
ephrin-B2
Opioid receptor, kappa
1
N-ethylmaleimidesensitive factor,
MADS box
transcription enhancer
DNA cross-link repair
1A
Oenothera elata
hookeri chloroplast
Function
Name
average gene
10
EFNB2
39
+
+
+
+
+
+
SULT1A3
43
Symbol
+/
5.9
6.6
6.1
6.0
5.9
8.5
1.5
1.0
1.8
1.7
1.8
1.1
1.7
GM
6.1
Figure 7.5. Cluster 1 group the genes aligned to the rst principal component. In both, GS-MDL and GS-Gap, the rst clusters contain genes that
have highest vote-rate (43/43) from LOO-CV algorithm, as shown in Fuller et al., submitted. Five genes are represented in red; they belong to the rst
cluster for both GS-MDL and GS-Gap. The rst clusters are composed of genes with a mixture of metabolic functions: DNA-binding, transduction, and
cell-cycle. These genes discriminate well in the problem of transition from lowest grade to medium-high glioma grades (OL vs. all). We can observe a
low expression of AO and OL subtype versus over-expression in AA glioma subtype.
GM
Glioma Subtypes
AA AO OL
110
COMPUTATIONAL GENOMICS
NL
MEF2C
MPI
NAPG
SGCD
TERF1
TSPY
43
36
20
1
1
1
1
+
+
+
UROS
average gene
10
16
OPRK1
43
+
+
+
+
+
+
+
SULT1A3
43
Symbol
6.0
5.4
6.0
6.9
6.6
7.5
7.1
6.3
8.0
2.6
2.4
2.4
2.6
3.3
3.4
3.0
2.3
3.3
4.1
3.1
5.4
5.5
5.8
6.4
5.1
7.2
6.4
4.5
5.6
5.0
6.1
0.9
0.6
1.3
1.2
2.6
1.7
1.4
1.2
1.0
1.0
2.1
AO
6.0
5.3
6.0
7.1
4.2
5.5
6.2
6.4
4.9
6.6
5.5
0.9
1.3
0.9
0.6
1.1
0.9
0.4
1.2
1.8
1.7
4.3
OL
Metabolism, Transcription
Transcription, Cell
prolif.
Transcription, Cell
prolif.
Cell growth
Metabolism
Cell growth,
Metabolism
Metabolism
Transcription
7.1
7.8
AA
Figure 7.6. Cluster 1 groups the genes aligned to the rst principal component.
Metabolism,
Transduction
Human estrogen
sulfotransferase
Opioid receptor, kappa 1
Transduction
Function
Name
+/
GM
Glioma Subtypes
AA AO OL
6.0
6.3
6.5
6.7
5.9
6.6
6.0
5.3
6.1
5.9
1.4
1.5
1.5
1.5
1.5
1.9
1.7
1.4
1.8
1.8
1.7
GM
6.1
SIAT7B
USP6NL
35
25
4
+
+
+
average gene
TRAP1
Votes
+/-
Symbol
7.2
4.7
Metabolism
Protein Metabolism
Protein Metabolism
1.7
1.6
0.8
AA
5.3
6.0
4.9
4.9
0.9
1.3
1.4
AO
6.2
4.1
2.8
0.9
0.8
1.7
OL
sialyltransferase 7
Function
Tumor necros. factor
receptor, chaperone
Name
heat shock protein 75
6.9
4.4
1.6
1.5
1.3
GM
4.7
Figure 7.7. The winners in the discriminatory problem of transition from lowest grade (OL) to the other grades, are different in GS-MDL and
GS-Gap. Cluster 10 of GS-MLD creates a better structure than GS-Gap. The genes agree for metabolic function. TRAP1 (heat shock protein 75) is
intensively studied.
sort.avg.gene
average.gene
NL
Glioma Subtypes
AA AO OL
GM
112
COMPUTATIONAL GENOMICS
EDNRB
10
1
+
+
average gene
MYLK
23
Symbol
Metabolism, Cell
Growth
Cell Growth,
Migration
5.6
7.8
4.5
4.9
7.0
1.2
1.0
0.9
AO
4.0
5.4
7.4
0.5
0.7
0.9
OL
1.0
0.9
2.1
AA
Not Defined
Function
Name
+/-
4.1
5.4
1.1
1.1
1.3
GM
7.3
Figure 7.8. The winners in the discriminatory problem of transition from lowest grade (OL) to the other grades, are different in GS-MDL and
GS-Gap. Cluster 9 of GS-Gap is composed of MYLK (myosin, light polypeptide kinase); EDNRB (endothelin receptor type B) and clone B368A9 map
4q25.
average
sorted
average
NL
Glioma Subtypes
AA AO OL
GM
114
COMPUTATIONAL GENOMICS
References
Akaike, H. (1974) A new look at the statistical model identication. IEEE
Trans. Autom. Control, AC-19:716723, Dec.
Alizadeh, A. A., Eisen, M. B., Davis, R.E., Ma C, Lossos, I. S.,
Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I.,
Yang, L., Marti, G. E., Moore, T., Hudson, J. Jr., Lu, L., Lewis, D. B.,
Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger,
D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M.
R., Byrd, J. C., Botstein, D., Brown, P. O., Staudt, L. M. (2000) Distinct
types of diffuse large B-cell lymphoma identied by gene expression
proling. Nature., 403(6769):50311
Andersen B, Schonemann D, Pearse II V, Jenne K, Sugarman J, Rosenfeld
G. (1993) Brn-5 is a divergent POU domain factor highly expressed in
layer IV of the neocortex. J Biol Chem 268: 2339023398
Anderson, T.W. (1963) Asymptotic theory for principal component
analysis. Ann. Math. Stat., 34:122148, Mar.
Aschenbrenner L, Naccache SN, Hasson T. (2004) Uncoated endocytic
vesicles require the unconventional myosin, Myo6, for rapid transport
through actin barriers. Mol Biol Cell. May;15(5):225363. Epub 2004
Mar 05.
Barron A, Rissanen J, Yu B. (1998) The minimum description length principle in coding and modeling, IEEE Trans. Info. Theory, IT-44:
27432760.
Borg, I., and Groenen, P. (1997) Modern Multidimensional Scaling:
Theory and Applications. Springer, New York.
Caskey, L. S., Fuller, G. N., Bruner, J. M., Yung, W. K., Sawaya, R. E.,
Holland, E. C., and Zhang, W. (2000) Toward a molecular classication
of the gliomas: histopathology, molecular genetics, and gene expression
proling. Histol. Histopathol., 15: 971981.
Cattell, R.B. (1966) The scree test for the number of factors. Multivariate
Behavioral Research, 1:245276.
Celeux, G. and Govaert, G. (1995) Gaussian parsimonious clustering
models. Pattern Recognit., 28:781793.
Chakarov S, Chakalova L, Tencheva Z, Ganev V, Angelova A. (2000)
Morphine treatment affects the regulation of high mobility group I-type
chromosomal phosphoproteins in C6 glioma cells. Life Sci. 24;66(18):
172531
Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum likelihood
from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat.
Methodol., 39:138.
Dougherty, E.R., Barrera, J., Brun, M., Kim, S., Cesar, R.M., Chen, Y.,
Bittner, M. and Trent, J.M. (2002) Inference from clustering with
REFERENCES
115
116
COMPUTATIONAL GENOMICS
REFERENCES
117
118
COMPUTATIONAL GENOMICS
Publication no. 6
Mircean C, Tabus I, Kobayashi T, Yamaguchi M, Shiku H,
Shmulevich I, Zhang W.
Pathway analysis of informative genes from microarray data
reveals that metabolism and signal transduction genes distinguish
different subtypes of lymphomas.
Int J Oncol. 2004 Mar;24(3):497-504.
497
_________________________________________
Correspondence to: Dr Wei Zhang, Cancer Genomics Core
Laboratory, Department of Pathology, The University of Texas
M.D. Anderson Cancer Center, 1515 Holcombe Blvd., Houston,
TX 77030, USA
E-mail: [email protected]
Key words: gene shaving, cDNA microarray, diffuse large B-cell
lymphoma, mantle cell lymphoma, metabolism and transduction
MDS analysis using each of the gene groups. Four of the genefunction groups (metabolism, signal transduction pathway,
transcriptional factors, cell adhesion and migration), separated
the three lymphoma subtypes well, whereas apoptosis genes
and cell cycle genes did not result in good separation.
Introduction
Diffuse large B-cell lymphomas (DLBCLs) are characterized
by proliferation of large transformed lymphoid cells. They
are heterogeneous in immunophenotypes, clinical features, and
treatment responses (1). Alizadeh et al identified two different
subtypes of DLBCLs, germinal center B-like DLBCL and
activated B-like DLBCL, based on the characteristic gene
expression patterns (2). In the activated B-like DLBCLs, a
poor prognostic subtype, the BCL-2 gene, which has antiapoptotic function, was overexpressed (2). Mantle cell
lymphoma (MCL) is characterized by the proliferation of
monomorphous small to medium-sized lymphoid cells with
irregular nuclei, the recurrent cytogenetic abnormality of
t(11;14)(q13;q32), and poor prognosis (1). Lymphoma cells of
MCL express cyclin D1, BCL-2 and CD5 (1). Immunoglobulin
heavy and light chain genes are rearranged in both DLBCLs
and MCLs. However, variable region genes are not mutated
in most MCLs, consistent with a pre-germinal center B-cell
origin, whereas the variable region genes are mutated in
DLBCLs (1). It has been reported that apoptosis-inducing
genes were down-regulated in MCLs when compared with
non-malignant hyperplastic lymph nodes (3), which may
account for their poor response to chemotherapy and poor
prognosis.
Immunophenotypically, naive B-cells express CD5, and
chronic lymphocytic leukemia (CLL) and MCL are considered
to correspond to CD5+ B-cells (1). Recently, however, 10%
of DLBCLs without prior CLL phase express CD5 and this
de novo CD5+ DLBCL has been reported to be clinicopathologically different from CD5- DLBCL and MCL (4). Immunoglobulin variable region genes are mutated in de novo CD5+
DLBCLs (5-7). To further evaluate de novo CD5+ DLBCL at
the molecular level, we performed gene expression profiling
using cDNA microarray technology (8). A series of genes
498
maxiwhere Dk is square
for
is the average
where W
499
Figure 1. Three clusters of genes identified by gene shaving. For each cluster, the average gene is displayed twice: in original order of patients and in sorted
form, the latter showing that similar lymphoma subtypes correspond to close values of the average gene. The first cluster of genes (upper left) is dominantly
involved in signal transduction. Most CD5- DLBCL cases had high expressions of the average gene, whereas most MCL cases had lower expression values.
Second cluster (lower left) has highest discrimination power. It can be considered enriched for metabolism and signal transduction functions. Cluster 6
(right) is a mixture of genes with all functions, from which signal transduction function is enriched. Gene expression pattern of clusters 1 and 6 (both
enriched in signal transduction) indicates that CD5- DLBCLs are still heterogeneous although these nine cases consist of nodal CD5- DLBCL, suggesting the
presence of subgroups. The gene names corresponding to each row are listed on the vertical axes (those with only accession no. are available in supplementary
tables and are presented here with empty spaces). On the horizontal axis are listed the class labels of each patient, below the first compact image. The last row
in the compact image, named average, shows the averages of the genes along the patients. We re-ordered the patients such that the values of the average
gene are increasing, then we horizontally list the re-ordered patient labels and draw a row below with colors representing the values of the sorted average
gene. That row has strong green at the left end, then starts shading off and changes to red at the right end; being sorted, it allows one to observe easier whether
a specific lymphoma class group has higher or lower expression values.
500
for the 5-9 patients (all CD5- DLBCL) are especially highly
expressed. As an interesting fact, the cluster gathered 3
apoptosis genes (TNFAIP3, IL1B, HDAC3), 2 tumor
suppressors (TPD52 and DLG1, considered to not belong to
any of the six function-classes), and 5 genes from cell cycle
and proliferation - classes that are seldom represented in
the whole set.
Cluster 8 has significance C(S8)=1.2829 and groups two
genes, ASAH1, which hydrolyzes the sphingolipid ceramide
into sphingosine and free fatty acid (belonging to metabolism)
and the other coding for the KIAA0379 protein, whose function
is unknown. Small groupings like this cluster should be
statistically tested on larger experiments. The genes in cluster 8
are highly expressed for the patients in MCL class.
We summarized all 22 clusters and on the right side, we
illustrated a zoomed view of the three clusters (1, 2, and 6)
(Fig. 2). We next characterized the genes in the three best
clusters and identified the dominant functional features. In
cluster 1, genes involved in signal transduction are enriched.
In cluster 2, genes involved in metabolism and signal transduction are enriched. In cluster 6, genes in signal transduction,
apoptosis, and cell cycle are enriched.
The three lymphoma classes can be grouped in four
combination scenarios (CD5- DLBCLs vs. others; de novo
CD5+ DLBCLs vs. others; MCLs vs. DLBCLs, and each
class separately). How each cluster discriminates the three
lymphoma subtypes on the two class problems and three
class problems is shown in Fig. 3.
We observe the following: cluster 2, which is enriched in
genes belonging to metabolism and signal transduction
function, is extremely good in discriminating between de novo
CD5+ DLBCLs and others; cluster 1, which is enriched in
genes belonging to signal transduction function, performs a
good discrimination of MCLs vs. DLBCLs; cluster 6, enriched
in genes belonging to signal transduction, apoptosis, and
cell cycle is able to discriminate well the types CD5DLBCLs vs. others. Drawing a 3D representation in Fig. 4
with average gene of clusters 1, 2 and 6 on the axes, we
observe that these genes are able to discriminate the disease
types quite well.
Multidimensional scaling (MDS) representation of cellular
functions of genes. The above analysis and genes in gene
clusters identified by gene shaving showed that genes of
different functional groups have different discriminative power
in separating the three types of lymphomas. To test this in a
more global manner, we used the whole set of 280 genes,
each of which individually exhibits high discrimination of the
three lymphoma classes. First, we picked those genes that are
involved in six cell functions: metabolism, signal transduction, transcription and DNA binding, cell cycle and
cell proliferation, adhesion and cell migration, apoptosis.
We examined for each group of genes how well MDS could
separate the lymphoma classes (Fig. 5). The grouped genes
from metabolism function are the best in separating the
three lymphoma subtypes. Genes from signal transduction
are the second best and genes from the other functional group
separate less well. Thus, the MDS-based analysis completely
corroborates the results we obtained from the gene shaving
method.
501
Figure 2. Clustering structure resulted from gene shaving algorithm. We sorted the clusters in the descending order of their discriminatory power, with
clusters 2, 8, 1, 19 and 6 discriminating the three lymphoma classes best. These clusters are enriched in metabolic (cluster 2) and signal transduction
(clusters 2, 1 and 6) functions. In the right side image, the name of the genes corresponding to the rows of clusters 2, 1, and 6 are listed on the vertical axes.
The columns represent the gene expression profiles of the patients, their labels coinciding with the unsorted profiles in Fig. 1.
Discussion
Lymphoma is a heterogeneous disease (1). Depending on
the similarities to the normal cells during differentiation,
lymphomas are classified into different subtypes (1). Genetic
studies at chromosomal levels revealed some recurrent chromosomal translocations further defining subtypes (1). Expression
of surface markers such as CD5 and CD10 in B-cell neoplasm
has been useful in identification of additional subclasses
502
Figure 5. Multidimensional scaling (MDS) representations for functional genes separations. From 280 retained genes having a high discrimination value for the
three-lymphoma class problem, we separated six groups of genes, one for each of the six functional classes: metabolism, signal transduction, transcription
and DNA binding factor, cell cycle and cell proliferation, adhesion factor and migrations and apoptosis. The best separation of the three lymphoma classes
is obtained with the genes having metabolic functions, the next best is the one with signal transduction genes. For the rest of the four functional groups, the MDS
representations show a poorer separation of the lymphoma cases. Patients are labeled as follows: CD5- DLBCL with red circles, de novo CD5+ DLBCL with green
diamonds, and MCL class with blue triangles.
503
504
2. Alizadeh AA, Eisen MB, Davis RE, et al: Distinct types of diffuse
large B-cell lymphoma identified by gene expression profiling.
Nature 403: 503-511, 2000.
3. Hofmann W-K, De Vos S, Tsukasaki K, Wachsman W,
Pinkus GS, Said JW and Koeffler HP: Altered apoptosis pathways in mantle cell lymphoma detected by oligonucleotide microarray. Blood 98: 787-794, 2001.
4. Yamaguchi M, Seto M, Okamoto M, et al: De novo CD5+ diffuse
large B-cell lymphoma: a clinicopathologic study of 109 patients.
Blood 99: 815-821, 2002.
5. Kume M, Suzuki R, Yatabe Y, et al: Somatic hypermutations in
the VH segment of immunoglobulin genes of CD5-positive diffuse
large B-cell lymphoma. Jpn J Cancer Res 88: 1087-1093, 1997.
6. Taniguchi M, Oka K, Hiasa A, Yamaguchi M, Ohno T, Kita K
and Shiku H: De novo CD5+ diffuse large B-cell lymphomas
express VH genes with somatic mutation. Blood 91: 1145-1151,
1998.
7. Nakamura N, Hashimoto Y, Kuze T, Tasaki K, Sasaki Y, Sato M
and Abe M: Analysis of the immunoglobulin heavy chain gene
variable region of CD5-positive diffuse large B-cell lymphoma.
Lab Invest 79: 925-933, 1999.
8. Kobayashi T, Yamaguchi M, Kim S, et al: Microarray reveals
differences in both tumors and vascular specific gene expression
in de novo CD5+ and CD5- diffuse large B-cell lymphomas.
Cancer Res 63: 60-66, 2003.
9. Taylor E, Cogdell D, Coombes K, et al: Sequence verification
as quality control step for production of cDNA microarray.
Biotechniques 31: 62-65, 2001.
10. Shmulevich I, Hunt K, El-Naggar A, et al: Tumor specific gene
expression profiles in human leiomyosarcoma: an evaluation of
intratumor heterogeneity. Cancer 94: 2069-2075, 2002.
11. Hu L, Wang J, Baggerly K, et al: Obtaining reliable information
from minute amounts of RNA using cDNA microarrays. BMC
Genomics 3: 16, 2002.
12. Hastie T, Tibshirani R, Eisen M, et al: Gene shaving: a new
class of clustering methods for expression arrays. Technical
report. Stanford University, Stanford, 2000.
13. Diehn M, Sherlock G, Binkley G, et al: SOURCE: a unified
genomic resource of functional annotations, ontologies, and
gene expression data. Nucleic Acids Res 31: 219-223, 2003.
14. Rosenwald A, Wright G, Chan WC, et al: The use of molecular
profiling to predict survival after chemotherapy for diffuse largeB-cell lymphoma. N Engl J Med 346: 1937-1947, 2002.
15. Shipp MA, Ross KN, Tamayo P, et al: Diffuse large B-cell
lymphoma outcome prediction by gene-expression profiling and
supervised machine learning. Nat Med 8: 68-74, 2002.
Publication no. 7
Fuller GN, Hess KR, Mircean C, Tabus I, Shmulevich I, Rhee CH,
Aldape KD, Bruner JM, Sawaya RA, Zhang W.
Chapter 14: Human Glioma Diagnosis From Gene Expression Data
in Computational and Statistical Approaches to Genomics
Kluwer Academic Publisher 2002 ISBN: 1-4020-7023-3
Publication no. 8
Mircean C, Shmulevich I, Cogdell D, Choi W, Jia Y, Tabus I,
Hamilton SR, Zhang W.
Robust estimation of protein expression ratios with lysate
microarray technology.
Bioinformatics. 2005 May 1;21(9):1935-42. Epub 2005 Jan 12.
BIOINFORMATICS
ORIGINAL PAPER
Gene expression
of Pathology, University of Texas M.D. Anderson Cancer Center, Houston, TX, USA and
of Signal Processing, Tampere University of Technology, Tampere, Finland
Received on June 24, 2004; revised on December 21, 2004; accepted on December 29, 2004
Advance Access publication January 10, 2005
ABSTRACT
Motivation: The protein lysate microarray is a developing proteomic
technology for measuring protein expression levels in a large number of biological samples simultaneously. A challenge for accurate
quantification is the relatively narrow dynamic range associated with
the commonly used chromogenic signal detection system. To facilitate accurate measurement of the relative expression levels, each
sample is serially diluted and each diluted version is spotted on
a nitrocellulose-coated slide in triplicate. Thus, each sample yields
multiple measurements in different dynamic ranges of the detection
system. This study aims to develop suitable algorithms that yield
accurate representations of the relative expression levels in different
samples from multiple data points.
Results: We evaluated two algorithms for estimating relative protein
expression in different samples on the lysate microarray by means of a
cross-validation procedure. For this purpose as well as for quality control we designed a 1440-spot lysate microarray containing 80 identical
samples of purified bovine serum albumin, printed in triplicate with six
2-fold dilutions. Our analysis showed that the algorithm based on a
robust least squares estimator provided the most accurate quantification of the protein lysate microarray data. We also demonstrated our
methods by estimating relative expression levels of p53 and p21 in
either p53+/+ or p53/ HCT116 colon cancer cells after two drug
treatments and their combinations on another lysate microarray.
Availability: http://www.cs.tut.fi/mirceanc/lysate_array_
bioinformatics.htm
Contact: [email protected]
INTRODUCTION
Despite the enormous genomic complexity of most organisms, and in
particular humans, the complexity is further increased at the protein
level as a result of posttranslational modifications, such as phosphorylation, acetylation and ubiquitination, which can appreciably
impact the functional state of proteins. Thus, it is not only the levels
of proteins but also their modification status that will have to be
studied in order to gain a deeper understanding of biological systems. A number of proteomic technologies have been developed
that allow researchers to study proteins in a high-throughput fashion. Among these is the protein microarray, which may appear in
To
The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
1935
C.Mircean et al.
Fig. 1. The layout of the lysate arrays. The approaches proposed in this paper estimate differential protein expression. The samples are placed in patches of 18
spots, 6 dilutions and 3 replicates for each dilution, as shown in the illustration on the left. The sets T (k) and N (k) represent the measurements corresponding
to two different samples (e.g. tumor and normal), with k denoting the dilution and s denoting the replicate.
1936
(a)
(b)
(c)
Fig. 2. Robust estimation of the protein expression ratios from six dilutions in three replicates. The model that fits the dilutions is expected to be linear in
loglog space. (a and b) Graphical representation of the non-linear and robust least squares approaches, respectively (see text). (c) The distance between the
two fitted lines at each dilution is weighted according to the estimated spot quality at that dilution (see text).
1937
C.Mircean et al.
1938
Dilutions
Dilutions
Dilutions
Fig. 3. A box plot showing the lower quartile, median and upper quartile
values of the logarithms of the BSA spots for each of the six dilutions. The
three figures, from top to bottom, compare the performance with 1, 5 and
10 touches per spot on Palls Vivid slides. The whiskers extend to 1.5 times
the interquartile range and all values beyond the whiskers are deemed to be
outliers. Taking into account the number of outliers and the slope (implicitly
the saturations) of the dilution curve, the best quality was obtained with 5
touches. Each boxplot represents 240 spots (80 samples 3 replicates).
Fig. 4. Comparison of the methods using the BSA array. The main figure shows the histogram of the mean squared error (MSE) between the estimated true
values, computed from 49 training samples, and the estimated values computed from one test sample performed 5776 times with random splits of the data
(limited by computational load). The lower inset shows the histograms of the errors for the proposed methods, defined as the difference between the estimated
true values and the estimated test values. As can be seen, the robust least squares method that uses all 18 spots from a sample produces dramatically smaller
errors than least squares applied to median estimated spot values at each dilution or than standard least squares. Additionally, the robust least squares method
yields Gaussian behavior of cross-validation errors (on the split 30:49:1), as shown in the upper insets, containing the histograms of the errors as well as a
normal probability plot, which is expected to appear linear for a Gaussian distribution. Standard least squares serve as a baseline comparison for either of the
robust methods. For weighting procedures in robust least squares, we tested Andrews, Cauchy, Huber, Logistic, Talwar, Turkey and Welsch with similar results.
Huber produced the smallest standard deviation of the mean error (std. ME). Also, other heuristic methods (not presented here) were tested with inferior results.
1939
C.Mircean et al.
Fig. 5. Expression of p53 and p21 in p53+/+ HCT116 cells under drug 1 (B), drug2 (C) and combination of drug 1 and drug 2 (D), as well as p53/
HCT116 cells with no treatment (E), all relative to p53+/+ HCT116 cells with no treatment (A). The no-treatment p53+/+ HCT116 control (A) is represented
as a reference baseline at zero in (b). The plotted bars illustrate the relative estimated values as well as the bootstrap-estimated standard deviations (1000
bootstrap samples). See the Methods section for more details. The values for the bars and standard deviations are shown in the table in (e). As a validation step,
a western blot is shown in (a), with the same labels (AE) as in (b). The quality-based weights, estimated using the coefficient of variation as described in the
Methods section, are seen to be different for p21 than for p53. (f) shows the quality-based weights for p53 (green) and p21 (red).
1940
expressions as the quality-weighted distance using the bootstraprandomized selection. Another option is to estimate the errors by
means of bootstrapping the residuals.
We preferred to use the first alternative, which involves recalculation of the parameters, as it makes no assumption about whether or
not the regression model holds. Figure 5b illustrates the expression
ratios, relative to p53+/+ HCT116 cells with no treatment, indicated
as a baseline at zero, along with the bootstrap-estimated standard
deviations. Because the bar graphs in Figure 5b represent distances,
the standard deviations on these bars incorporate the combined variability owing to the treatment (or p53/ ) and the p53+/+ no-drug
reference.
As expected, the p53/ cells (E) have a much lower p53 relative
expression than all other conditions. The same behavior is seen for
p21 expression, but not so dramatically. Further, the combination
of drugs (D) does not increase the expression of p53 and p21 in
HCT116 p53+/+ cells, relative to drug 1 (B) or drug 2 (C) alone. As
a validation step, Figure 5a shows a western blot corresponding to
all tested conditions.
Fig. 6. To study the robustness of the algorithms, the three methods were applied on a BSA array with a crack in the membrane. The sample 59 is presented in
the original slide image (a) and then before quantifying in a negative image prepared for ArrayVision. As result, all three replicates of the fifth dilution are
outliers (b). The simple linear regression is most affected by this error, followed by the non-linear approach robust least squares is able to reject the effect of
outliers. Furthermore, using a bootstrap procedure (1000 times), which combines the pairs of spot intensity and dilution (see Discussion section), the estimated
errors from robust least squares are considerably lower than from the other two cases. The error bars on the fitted lines represent the standard deviations of the
estimated fits for each dilution using the bootstrap procedure. The lower inset shows bootstrap histograms (10 000 bootstrap samples) of distances between two
neighboring normal samples, computed with robust least squares (c), least squares (d) and least squares of the medians (e). In each subplot, the vertical bar
indicates the distance between a cracked sample and its neighboring normal sample, using the same algorithm that is used to generate the histogram.
The HCT116 cell data are fairly free of outliers. In an ideal case,
the two robust methods should return the same results as a standard
regression model. We considered it fit to report the case, where the
three methods return significantly different results, a case not used in
our further analyses. The slide in question is spotted with BSA, with
1-touch. We can observe several cracks of the membrane caused by
erroneous handling combined with a faster drying on a Palls Vivid
slide. On the left of Figure 6, we show the entire slide. The upper
detail is the processed image as described in the Methods section,
before applying the segmentation step of ArrayVision. For this
sample, the values of the fifth dilution spots are strongly affected
by the crack and, as we can see, in reality the spots themselves are
not outliers. From the three models, we can observe the robust least
1941
C.Mircean et al.
ACKNOWLEDGEMENTS
The work was partially supported by the Tobacco Settlement Fund
to M.D. Anderson Cancer Center (MDACC) as appropriated by the
Texas Legislature, a grant from Kadoorie Foundation to MDACC,
a grant from the Goodwin Fund and the Cancer Center Supporting
Grant from NIH/NCI, and a grant from the Academy of Finland.
REFERENCES
Efron,B. and Tibshirani,R.J. (1993) An Introduction to the Bootstrap. Chapman and
Hall, New York.
1942