High Dimensional Data Analysis in Cancer Research, 1st Edition No-Wait Download
High Dimensional Data Analysis in Cancer Research, 1st Edition No-Wait Download
Edition
Visit the link below to download the full version of this book:
https://medidownload.com/product/high-dimensional-data-analysis-in-cancer-resear
ch-1st-edition/
High-Dimensional Data
Analysis in Cancer Research
ABC
Editors
Xiaochun Li Ronghui Xu
Harvard Medical School University of California
Dana-Farber Cancer Institute San Diego
Dept. Biostatistics Department of Family
375 Longwood St. and Preventive Medicine
Boston MA 02115 and Department of Mathematics
USA 9500 Gilman Dr.
[email protected] La Jolla CA 92093-0112
USA
[email protected]
springer.com
To our children, Anna, Sofia, and James
Preface
vii
viii Preface
to choose those fields of research that are either relatively mature, but may not have
been well read in applied statistics, such as risk estimation, or those fields that are
fast developing and also have obtained substantial newer results that are reasonably
well understood for practical use, such as variable selection. On the other hand, we
have omitted such an important topic as multiple comparisons, which is currently
undergoing much theoretical development (as reflected in the August 2007 issue of
Annals of Statistics, for example), and we find it possibly difficult to provide an
accurate stationary yet updated picture for the moment. Such topic, however, can
be found in several other recently published books that contain its classical results
ready for practical use. All the chapters included in this book contain practical ex-
amples to illustrate the analysis methods. In addition, they also reveal the types of
research that are involved in developing these methods.
The opening chapter provides an overview of the various high-dimensional data
sources, the challenges in analyzing such data, and in particular, strategies in the de-
sign phase, as well as possible future directions. Chapter 2 discusses methodologies
and issues surrounding variable selection and model building, including postmodel
selection inference. These have always been important topics in statistical research,
and even more so in the analysis of high-dimensional data. Chapter 3 is devoted to
the topic of multivariate nonparametric regression. Multivariate problems are com-
mon in oncological research, and often the relationship between the outcome of
interest and its predictors is either nonlinear, or nonadditive, or both. This chapter
focuses on the methods of regression trees and spline models. Chapter 4 discusses
the more fundamental problem of risk estimation. This is the basis of many proce-
dures and, in particular, model selection. It reviews the two major approaches to risk
estimation, i.e., covariance penalty and resampling, and summarizes empirical eval-
uations of these approaches. Chapter 5 focuses on tree-based methods. After a brief
review of classification and regression trees (CART), the chapter presents in more
detail tree-based ensembles, including boosting and random forests. Chapter 6 is on
support vector machines (SVMs), one of the methodologies stemming from the ma-
chine learning field that has gained popularity for classification of high dimensional
data. The chapter discusses both two-class and multiclass classification problems,
and linear and nonlinear SVM. For high-dimensional data, a particularly important
aspect is sparse learning, that is, only a relatively small subset of the predictors are
truly involved with the classification boundary. Variable selection is then again a
critical step, and various approaches associated with SVM are described. The last,
but by no means the least, chapter, presents Bayesian approaches to the analyses
of microarray gene expression data. The emphasis is on nonparametric Bayesian
methods, which allow flexible modeling of the data that might arise from underly-
ing heterogeneous mechanisms. Computational algorithms are discussed.
It has been an exciting experience editing this volume. We thank all the authors
for their excellent contributions.
Boston, MA Xiaochun Li
La Jolla, CA Ronghui Xu
Contents
ix
x Contents
4 Risk Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Ronghui Xu and Anthony Gamst
4.1 Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Covariance Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.1 Continuous Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.2 Binary Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.3 A Connection with AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Contents xi
5 Tree-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Adele Cutler, D. Richard Cutler, and John R. Stevens
5.1 Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.1 Microarray Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.2 Mass Spectrometry Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.3 Traditional Approaches to Classification and Regression . 85
5.2.4 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Classification and Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.1 Example: Regression Tree for Prostate Cancer Data . . . . . 86
5.3.2 Properties of Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4 Tree-Based Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4.1 Bagged Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.3 Boosted Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5 Example: Prostate Cancer Microarrays . . . . . . . . . . . . . . . . . . . . . . . 96
5.6 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.7 Recent Research and Oncology Applications . . . . . . . . . . . . . . . . . . 98
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Contributors
xiii
xiv Contributors
Ross L. Prentice
1.1 Introduction
1.2.1 Background
The unifying goal of the various types of high-dimensional data being generated in
recent years is the understanding of biological processes, especially processes that
relate to disease occurrence or management. These may involve, for example, char-
acteristics such as single nucleotide polymorphisms (SNPs) across the genome to be
R.L. Prentice
Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview
Avenue North, Seattle, WA, USA
email: [email protected]
related to the risk of a disease; gene expression patterns in tumor tissue to be related
to the risk of tumor recurrence; or protein expression patterns in blood to be related
to the presence of an undetected cancer. Cutting across the biological processes
related to carcinogenesis, or other chronic disease processes, are high-dimensional
data related to treatment or intervention effects. These may include, for example,
study of changes in the plasma proteome as a result of an agent having chronic dis-
ease prevention potential; or changes in gene expression in tumor tissue as a result
of exposure to a therapeutic regimen, especially a molecularly targeted regimen. It
is the confluence of novel biomarkers of disease development and treatment, with
biomarker changes related to possible interventions that have great potential to en-
hance the identification of novel preventative and therapeutic interventions. Further-
more, biological markers that are useful for early disease detection open the door to
reduced disease mortality, using current or novel therapeutic modalities. The tech-
nology available for these various purposes in human studies depends very much
on the type of specimens available for study, with white blood cells and their DNA
content, tumor tissue and its mRNA content, or blood serum or plasma and its pro-
teomic and metabolomic (small molecule) content, as important examples. The next
subsections will provide a brief overview of the technology for assessment of certain
key types of high-dimensional biologic data.
The study of genotype in relation to the risk of specific cancers or other chronic
diseases has traditionally relied heavily on family studies. Such studies often in-
volve families having a strong history of the study disease to increase the proba-
bility of harboring disease-related genes. A study may involve genotyping family
members for a panel of genetic markers and assessing whether one or more mark-
ers co-segregate with disease among family members. This approach uses the fact
that chromosomal segments are inherited intact, so that markers over some distance
from a disease-related gene can be expected to associate with disease risk within
families. Following the identification of a “linkage” signal with a genetic marker,
some form of fine mapping is needed to close in on disease-related loci. There are
many variations in ascertainment schemes and analysis procedures that may differ in
efficiency and robustness (e.g., Ott, 1991; Thomas, 2004) with case–control family
studies having a prominent role in recent years.
Markers that are sufficiently close on the genome tend to be correlated, depend-
ing somewhat on a person’s evolutionary history (e.g., Felsenstein, 2007). The iden-
tification of several million SNPs across the human genome (e.g., Hinds et al., 2005)
and the identification of tag SNP subsets (The International HapMap Consortium,
2003) that convey most genotype information as a result of such correlation (linkage
disequilibrium) have opened the way not only to family-based studies that involve a
very large number of genomic markers, but also to direct disease association studies
1 Role and Potential of High-D Data 3
among unrelated individuals. For example, the latter type of study may simulta-
neously relate 100,000 or more tag SNPs to disease occurrence in a study cohort,
typically using a nested case–control or case-cohort design.
However, for this type of association study to be practical, there needs to be
reliable, high-throughput genotyping platforms having acceptable costs. Satisfying
this need has been a major technology success story over the past few years, with
commercially available platforms (Affymetrix, Illumina) having 500,000–1,000,000
well-selected tagging SNPs, and genotyping costs reduced to a few hundred dollars
per specimen. These platforms, similar to the gene expression platforms that pre-
ceded them, rely on chemical coupling of DNA from target cells to labeled probes
having a specified sequence affixed to microarrays, and use photolithographic meth-
ods to assess the intensity of the label following hybridization and washing. In ad-
dition to practical cost, these platforms can accommodate the testing of thousands
of cases and controls in a research project in a matter of a few weeks or months.
The results of very high-dimensional SNP studies of this type have only recently
begun to emerge, usually from large cohorts or cohort consortia, in view of the large
sample sizes needed to rule out false positive associations. Novel genotype associa-
tions with disease risks have already been established for breast cancer (e.g., Easton
et al., 2007; Hunter et al., 2007) and prostate cancer (Amundadottir et al., 2006;
Freedman et al., 2006; Yeager et al., 2007), as well as for several other chronic dis-
eases (e.g., Samani et al., 2007, for coronary heart disease). Although it is early to
try to characterize findings, novel associations for complex common diseases tend
to be weak, and mostly better suited to providing insight into disease processes and
pathways, than to contributing usefully to risk assessment. The prostate cancer as-
sociations cited include well-established SNP associations that are not in proximity
to any known gene, providing the impetus for further study of genomic structure
and characteristics in relation to gene and protein expression.
Studies of gene expression patterns in tumor tissue from cancer patients provided
some of the earliest use of microarray technologies in biomedical research, and con-
stituted the setting that motivated much of the statistical design and analysis devel-
opments to date for high-dimensional data studies. Gene expression can be assessed
by the concentration of mRNA (transcripts) in cells, and many applications to date
have focused on studies of tumors or other tissue, often in a therapeutic context.
mRNA hybridizes with labeled probes on a microarray, with a photolithographic as-
sessment of transcript abundance through label intensity. A microarray study may,
for example, compare transcript abundance between two groups for 10,000 or more
human genes.
Studies of the transcription pattern of specific tumors provide a major tool for
assessing recurrence risk, and prognosis more generally, and for classifying patients