0% found this document useful (0 votes)
4 views15 pages

High Dimensional Data Analysis in Cancer Research, 1st Edition No-Wait Download

The book 'High Dimensional Data Analysis in Cancer Research' addresses the challenges and methodologies for analyzing high-dimensional data in cancer research, emphasizing the need for advanced statistical tools due to the increasing complexity of biomedical data. It includes seven chapters covering topics such as variable selection, multivariate nonparametric regression, risk estimation, and machine learning methods like support vector machines. The volume serves as a reference for researchers and practitioners in the field, providing practical examples and insights into current analytical approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views15 pages

High Dimensional Data Analysis in Cancer Research, 1st Edition No-Wait Download

The book 'High Dimensional Data Analysis in Cancer Research' addresses the challenges and methodologies for analyzing high-dimensional data in cancer research, emphasizing the need for advanced statistical tools due to the increasing complexity of biomedical data. It includes seven chapters covering topics such as variable selection, multivariate nonparametric regression, risk estimation, and machine learning methods like support vector machines. The volume serves as a reference for researchers and practitioners in the field, providing practical examples and insights into current analytical approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

High Dimensional Data Analysis in Cancer Research 1st

Edition

Visit the link below to download the full version of this book:

https://medidownload.com/product/high-dimensional-data-analysis-in-cancer-resear
ch-1st-edition/

Click Download Now


Xiaochun Li · Ronghui Xu
Editors

High-Dimensional Data
Analysis in Cancer Research

ABC
Editors
Xiaochun Li Ronghui Xu
Harvard Medical School University of California
Dana-Farber Cancer Institute San Diego
Dept. Biostatistics Department of Family
375 Longwood St. and Preventive Medicine
Boston MA 02115 and Department of Mathematics
USA 9500 Gilman Dr.
[email protected] La Jolla CA 92093-0112
USA
[email protected]

ISBN: 978-0-387-69763-5 e-ISBN: 978-0-387-69765-9


DOI: 10.1007/978-0-387-69765-9

Library of Congress Control Number: 2008940562

°c Springer Science+Business Media, LLC 2009


All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York,
NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in
connection with any form of information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject
to proprietary rights.

Printed on acid-free paper

springer.com
To our children, Anna, Sofia, and James
Preface

In an era with a plethora of high-throughput biological technologies, biomedical


researchers are investigating more comprehensive aspects of cancer with ever-finer
resolution. Not only does this result in large amount of data but also data with hun-
dreds if not thousands of dimensions.
Multivariate analysis is a mainstay of statistical tools in the analysis of biomed-
ical data. It concerns with associating data matrices of n rows by p columns, with
rows representing samples or patients and columns attributes, to certain response
or outcome variables. Classically, the sample size n is much larger than the num-
ber of attributes p. The theoretical properties of statistical models have mostly been
discussed under the assumption of fixed p and infinite n. However, the advance of bi-
ological sciences and technologies has revolutionized the process of investigations
in cancer. The biomedical data collection has become much more automatic and
much more extensive. We are in the era of p as a large fraction of n, or even much
larger than n, which poses challenges to the classical statistical paradigm. Take pro-
teomics as an example. Although proteomic techniques have been researched and
developed for many decades to identify proteins or peptides uniquely associated
with a given disease state, until recently this has mostly been a laborious process,
carried out one protein at a time. The advent of highthroughput proteome-wide tech-
nologies such as liquid chromatography-tandem mass spectroscopy make it possible
to generate proteomic signatures that facilitate rapid development of new strategies
for proteomics-based detection of disease. This poses new challenges and calls for
scalable solutions to the analysis of such high-dimensional data.
In this volume, we present current analytical approaches as well as systematic
strategies to the analysis of correlated and high-dimensional data.
The volume is intended as a reference book for researchers, statisticians, bioin-
formaticians, graduate students, and data analysts working in the field of cancer re-
search. Our aim is to present methodological topics of important relevance to such
analyses, and in a single volume such as this we do not attempt to exhaust all the
analytical tools that have been developed so far.
This volume contains seven chapters. They do not necessarily cover all topics rel-
evant to high-dimensional data analysis in cancer research. Instead, we have aimed

vii
viii Preface

to choose those fields of research that are either relatively mature, but may not have
been well read in applied statistics, such as risk estimation, or those fields that are
fast developing and also have obtained substantial newer results that are reasonably
well understood for practical use, such as variable selection. On the other hand, we
have omitted such an important topic as multiple comparisons, which is currently
undergoing much theoretical development (as reflected in the August 2007 issue of
Annals of Statistics, for example), and we find it possibly difficult to provide an
accurate stationary yet updated picture for the moment. Such topic, however, can
be found in several other recently published books that contain its classical results
ready for practical use. All the chapters included in this book contain practical ex-
amples to illustrate the analysis methods. In addition, they also reveal the types of
research that are involved in developing these methods.
The opening chapter provides an overview of the various high-dimensional data
sources, the challenges in analyzing such data, and in particular, strategies in the de-
sign phase, as well as possible future directions. Chapter 2 discusses methodologies
and issues surrounding variable selection and model building, including postmodel
selection inference. These have always been important topics in statistical research,
and even more so in the analysis of high-dimensional data. Chapter 3 is devoted to
the topic of multivariate nonparametric regression. Multivariate problems are com-
mon in oncological research, and often the relationship between the outcome of
interest and its predictors is either nonlinear, or nonadditive, or both. This chapter
focuses on the methods of regression trees and spline models. Chapter 4 discusses
the more fundamental problem of risk estimation. This is the basis of many proce-
dures and, in particular, model selection. It reviews the two major approaches to risk
estimation, i.e., covariance penalty and resampling, and summarizes empirical eval-
uations of these approaches. Chapter 5 focuses on tree-based methods. After a brief
review of classification and regression trees (CART), the chapter presents in more
detail tree-based ensembles, including boosting and random forests. Chapter 6 is on
support vector machines (SVMs), one of the methodologies stemming from the ma-
chine learning field that has gained popularity for classification of high dimensional
data. The chapter discusses both two-class and multiclass classification problems,
and linear and nonlinear SVM. For high-dimensional data, a particularly important
aspect is sparse learning, that is, only a relatively small subset of the predictors are
truly involved with the classification boundary. Variable selection is then again a
critical step, and various approaches associated with SVM are described. The last,
but by no means the least, chapter, presents Bayesian approaches to the analyses
of microarray gene expression data. The emphasis is on nonparametric Bayesian
methods, which allow flexible modeling of the data that might arise from underly-
ing heterogeneous mechanisms. Computational algorithms are discussed.
It has been an exciting experience editing this volume. We thank all the authors
for their excellent contributions.

Boston, MA Xiaochun Li
La Jolla, CA Ronghui Xu
Contents

1 On the Role and Potential of High-Dimensional Biologic Data


in Cancer Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Ross L. Prentice
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Potential of High-Dimensional Data in Biomedical Research . . . . . 1
1.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.2 High-Dimensional Study of the Genome . . . . . . . . . . . . . . 2
1.2.3 High-Dimensional Studies of the Transcriptome . . . . . . . . 3
1.2.4 High-Dimensional Studies of the Proteome . . . . . . . . . . . . 4
1.2.5 Other Sources of High-Dimensional Biologic Data . . . . . . 5
1.3 Statistical Challenges and Opportunities with High-Dimensional
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Scope of this Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Two-Group Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Study Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.4 More Complex High-Dimensional Data Analysis
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Needed Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Variable Selection in Regression – Estimation, Prediction, Sparsity,


Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Jaroslaw Harezlak, Eric Tchetgen, and Xiaochun Li
2.1 Overview of Model Selection Methods . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Univariate Screening of Variables . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Subset Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Multivariable Modeling: Penalties/Shrinkage . . . . . . . . . . . . . . . . . . 16
2.2.1 Penalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Ridge Regression and Nonnegative Garrote . . . . . . . . . . . . 17
2.2.3 LASSO: Definition, Properties and Some Extensions . . . . 18
2.2.4 Smoothly Clipped Absolute Deviation (SCAD) . . . . . . . . 20

ix
x Contents

2.3 Least Angle Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21


2.4 Dantzig Selector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Prediction and Persistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Difficulties with Post-Model Selection Inference . . . . . . . . . . . . . . . 26
2.7 Penalized Likelihood for Generalized Linear Models . . . . . . . . . . . 28
2.8 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.9 Application of the Methods to the Prostate Cancer Data Set . . . . . . 30
2.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Multivariate Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . 35


Charles Kooperberg and Michael LeBlanc
3.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Linear and Additive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 Example Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Basis Function Expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Regression Tree Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.2 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.3 Backwards Selection (Pruning) . . . . . . . . . . . . . . . . . . . . . . 42
3.5.4 Example Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.5 Issues and Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 Spline Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.1 One Dimensional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.2 Higher-Dimensional Models . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6.3 Example Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7 Logic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7.1 Example Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.8 High-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.8.1 Variable Selection and Shrinkage . . . . . . . . . . . . . . . . . . . . 51
3.8.2 LASSO and LARS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.8.3 Dedicated Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8.4 Example Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.9 Survival Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.9.1 Example Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Risk Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Ronghui Xu and Anthony Gamst
4.1 Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Covariance Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.1 Continuous Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.2 Binary Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.3 A Connection with AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Contents xi

4.2.4 Correlated Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63


4.2.5 Nuisance Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Resampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.1 Empirical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Applications of Risk Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.1 SURE and Admissibility . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.2 Finite Sample Risk and Adaptive Regression Estimates . . 72
4.4.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.4 Gene Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5 Tree-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Adele Cutler, D. Richard Cutler, and John R. Stevens
5.1 Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.1 Microarray Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.2 Mass Spectrometry Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.3 Traditional Approaches to Classification and Regression . 85
5.2.4 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Classification and Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.1 Example: Regression Tree for Prostate Cancer Data . . . . . 86
5.3.2 Properties of Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4 Tree-Based Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4.1 Bagged Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.3 Boosted Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5 Example: Prostate Cancer Microarrays . . . . . . . . . . . . . . . . . . . . . . . 96
5.6 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.7 Recent Research and Oncology Applications . . . . . . . . . . . . . . . . . . 98
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6 Support Vector Machine Classification for High-Dimensional


Microarray Data Analysis, With Applications in Cancer Research . . 103
Hao Helen Zhang
6.1 Classification Problems: A Statistical Point of View . . . . . . . . . . . . 104
6.1.1 Binary Classification Problems . . . . . . . . . . . . . . . . . . . . . . 104
6.1.2 Bayes Rule for Binary Classification . . . . . . . . . . . . . . . . . 104
6.1.3 Multiclass Classification Problems . . . . . . . . . . . . . . . . . . . 105
6.1.4 Bayes Rule for Multiclass Classification . . . . . . . . . . . . . . 106
6.2 Support Vector Machine for Two-Class Classification . . . . . . . . . . . 107
6.2.1 Linear Support Vector Machines . . . . . . . . . . . . . . . . . . . . . 107
6.2.2 Nonlinear Support Vector Machines . . . . . . . . . . . . . . . . . . 110
6.2.3 Regularization Framework for SVM . . . . . . . . . . . . . . . . . . 111
6.3 Support Vector Machines for Multiclass Problems . . . . . . . . . . . . . . 112
6.3.1 One-versus-the-Rest and Pairwise Comparison . . . . . . . . . 112
6.3.2 Multiclass Support Vector Machines (MSVMs) . . . . . . . . 112
xii Contents

6.4 Parameter Tuning and Solution Path for SVM . . . . . . . . . . . . . . . . . 114


6.4.1 Tuning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.4.2 Entire Solution Path for SVM . . . . . . . . . . . . . . . . . . . . . . . 115
6.5 Sparse Learning with Support Vector Machines . . . . . . . . . . . . . . . . 115
6.5.1 Variable Selection for Binary SVM . . . . . . . . . . . . . . . . . . . 116
6.5.2 Variable Selection for Multiclass SVM . . . . . . . . . . . . . . . . 119
6.6 Cancer Data Analysis Using SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.6.1 Binary Cancer Classification for UNC Breast
Cancer Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.6.2 Multi-type Cancer Classification for Khan’s Children
Cancer Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

7 Bayesian Approaches: Nonparametric Bayesian Analysis of Gene


Expression Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Sonia Jain
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.2 Bayesian Analysis of Microarray Data . . . . . . . . . . . . . . . . . . . . . . . . 129
7.2.1 EBarrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.2.2 Probability of Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.3 Nonparametric Bayesian Mixture Model . . . . . . . . . . . . . . . . . . . . . . 133
7.4 Posterior Inference of the Bayesian Model . . . . . . . . . . . . . . . . . . . . 135
7.4.1 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.4.2 Split–Merge Markov Chain Monte Carlo . . . . . . . . . . . . . . 137
7.5 Leukemia Gene Expression Example . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.5.1 Leukemia Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Contributors

Adele Cutler Department of Mathematics and Statistics, Utah State University,


3900 Old Main Hill Logan, UT 84322-3900, USA, [email protected]
Richard Cutler Department of Mathematics and Statistics, Utah State University,
3900 Old Main Hill Logan, UT 84322-3900, USA, [email protected]
Anthony Gamst Division of Biostatistics and Bioinformatics, Department of
Family and Preventive Medicine, University of California, San Diego, 9500 Gilman
Drive, MC0717 La Jolla, CA 92093-0717, USA, [email protected]
Jaroslaw Harezlak Division of Biostatistics, Indiana University School of
Medicine, 410 West 10th Street, Suite 3000 Indianapolis, IN 46202, USA,
[email protected]
Sonia Jain Division of Biostatistics and Bioinformatics, University of California,
San Diego, 9500 Gilman Drive, MC-0717, La Jolla, CA 92093-0717, USA,
[email protected]
Charles Kooperberg Division of Public Health Sciences, Fred Hutchinson Cancer
Research Center, 1100 Fairview Ave N, M3-A410 Seattle, WA 98109-1024, USA,
[email protected]
Michael LeBlanc Division of Public Health Sciences, Fred Hutchinson Cancer
Research Center, 1100 Fairview Ave N, M3-C102 Seattle, WA 98109-1024, USA,
[email protected]
Xiaochun Li Division of Biostatistics, Indiana University School of Medicine, 410
West 10th Street, Suite 3000 Indianapolis, IN 46202, USA, [email protected]
Ross L. Prentice Division of Public Health Sciences, Fred Hutchinson Cancer
Research Center, 1100 Fairview Avenue North, Seattle, WA 98109, USA,
[email protected]

xiii
xiv Contributors

John R. Stevens Department of Mathematics and Statistics, Utah State University,


3900 Old Main Hill Logan, UT 84322-3900, USA, [email protected]
Eric Tchetgen Department of Biostatistics, Harvard School of Public Health, 655
Huntington Avenue, Boston, MA 02115, USA, [email protected]
Ronghui Xu Division of Biostatistics and Bioinformatics, Department of Family
and Preventive Medicine and Department of Mathematics, University of California,
San Diego, 9500 Gilman Drive, MC 0112 La Jolla, CA 92093-0112, USA,
[email protected]
Hao Helen Zhang Department of Statistics, North Carolina State University, 2501
Founders Drive Raleigh, NC 27613, USA, [email protected]
Chapter 1
On the Role and Potential of High-Dimensional
Biologic Data in Cancer Research

Ross L. Prentice

1.1 Introduction

I am pleased to provide a brief introduction to this volume of “High-Dimensional


Data Analysis in Cancer Research”. The chapters to follow will focus on data analy-
sis aspects, particularly related to regression model selection and estimation with
high-dimensional data of various types. The methods described will have a major
emphasis on statistical innovations that are afforded by high-dimensional predictor
variables.
While many of the motivating applications and datasets for these analytic de-
velopments arise from gene expression data in therapeutic research contexts, there
are also important applications, and potential applications, in risk assessment,
early diagnosis and primary disease prevention research, as will be elaborated in
Section 1.2. With this range of applications as background, some preliminary com-
ments are made on related statistical challenges and opportunities (Section 1.3) and
on needed future developments (Section 1.4).

1.2 Potential of High-Dimensional Data in Biomedical Research

1.2.1 Background

The unifying goal of the various types of high-dimensional data being generated in
recent years is the understanding of biological processes, especially processes that
relate to disease occurrence or management. These may involve, for example, char-
acteristics such as single nucleotide polymorphisms (SNPs) across the genome to be

R.L. Prentice
Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview
Avenue North, Seattle, WA, USA
email: [email protected]

X. Li, R. Xu (eds.), High-Dimensional Data Analysis in Cancer Research, Applied 1


Bioinformatics and Biostatistics in Cancer Research, DOI 10.1007/978-0-387-69765-9 1,

c Springer Science+Business Media LLC 2009
2 R.L. Prentice

related to the risk of a disease; gene expression patterns in tumor tissue to be related
to the risk of tumor recurrence; or protein expression patterns in blood to be related
to the presence of an undetected cancer. Cutting across the biological processes
related to carcinogenesis, or other chronic disease processes, are high-dimensional
data related to treatment or intervention effects. These may include, for example,
study of changes in the plasma proteome as a result of an agent having chronic dis-
ease prevention potential; or changes in gene expression in tumor tissue as a result
of exposure to a therapeutic regimen, especially a molecularly targeted regimen. It
is the confluence of novel biomarkers of disease development and treatment, with
biomarker changes related to possible interventions that have great potential to en-
hance the identification of novel preventative and therapeutic interventions. Further-
more, biological markers that are useful for early disease detection open the door to
reduced disease mortality, using current or novel therapeutic modalities. The tech-
nology available for these various purposes in human studies depends very much
on the type of specimens available for study, with white blood cells and their DNA
content, tumor tissue and its mRNA content, or blood serum or plasma and its pro-
teomic and metabolomic (small molecule) content, as important examples. The next
subsections will provide a brief overview of the technology for assessment of certain
key types of high-dimensional biologic data.

1.2.2 High-Dimensional Study of the Genome

The study of genotype in relation to the risk of specific cancers or other chronic
diseases has traditionally relied heavily on family studies. Such studies often in-
volve families having a strong history of the study disease to increase the proba-
bility of harboring disease-related genes. A study may involve genotyping family
members for a panel of genetic markers and assessing whether one or more mark-
ers co-segregate with disease among family members. This approach uses the fact
that chromosomal segments are inherited intact, so that markers over some distance
from a disease-related gene can be expected to associate with disease risk within
families. Following the identification of a “linkage” signal with a genetic marker,
some form of fine mapping is needed to close in on disease-related loci. There are
many variations in ascertainment schemes and analysis procedures that may differ in
efficiency and robustness (e.g., Ott, 1991; Thomas, 2004) with case–control family
studies having a prominent role in recent years.
Markers that are sufficiently close on the genome tend to be correlated, depend-
ing somewhat on a person’s evolutionary history (e.g., Felsenstein, 2007). The iden-
tification of several million SNPs across the human genome (e.g., Hinds et al., 2005)
and the identification of tag SNP subsets (The International HapMap Consortium,
2003) that convey most genotype information as a result of such correlation (linkage
disequilibrium) have opened the way not only to family-based studies that involve a
very large number of genomic markers, but also to direct disease association studies
1 Role and Potential of High-D Data 3

among unrelated individuals. For example, the latter type of study may simulta-
neously relate 100,000 or more tag SNPs to disease occurrence in a study cohort,
typically using a nested case–control or case-cohort design.
However, for this type of association study to be practical, there needs to be
reliable, high-throughput genotyping platforms having acceptable costs. Satisfying
this need has been a major technology success story over the past few years, with
commercially available platforms (Affymetrix, Illumina) having 500,000–1,000,000
well-selected tagging SNPs, and genotyping costs reduced to a few hundred dollars
per specimen. These platforms, similar to the gene expression platforms that pre-
ceded them, rely on chemical coupling of DNA from target cells to labeled probes
having a specified sequence affixed to microarrays, and use photolithographic meth-
ods to assess the intensity of the label following hybridization and washing. In ad-
dition to practical cost, these platforms can accommodate the testing of thousands
of cases and controls in a research project in a matter of a few weeks or months.
The results of very high-dimensional SNP studies of this type have only recently
begun to emerge, usually from large cohorts or cohort consortia, in view of the large
sample sizes needed to rule out false positive associations. Novel genotype associa-
tions with disease risks have already been established for breast cancer (e.g., Easton
et al., 2007; Hunter et al., 2007) and prostate cancer (Amundadottir et al., 2006;
Freedman et al., 2006; Yeager et al., 2007), as well as for several other chronic dis-
eases (e.g., Samani et al., 2007, for coronary heart disease). Although it is early to
try to characterize findings, novel associations for complex common diseases tend
to be weak, and mostly better suited to providing insight into disease processes and
pathways, than to contributing usefully to risk assessment. The prostate cancer as-
sociations cited include well-established SNP associations that are not in proximity
to any known gene, providing the impetus for further study of genomic structure
and characteristics in relation to gene and protein expression.

1.2.3 High-Dimensional Studies of the Transcriptome

Studies of gene expression patterns in tumor tissue from cancer patients provided
some of the earliest use of microarray technologies in biomedical research, and con-
stituted the setting that motivated much of the statistical design and analysis devel-
opments to date for high-dimensional data studies. Gene expression can be assessed
by the concentration of mRNA (transcripts) in cells, and many applications to date
have focused on studies of tumors or other tissue, often in a therapeutic context.
mRNA hybridizes with labeled probes on a microarray, with a photolithographic as-
sessment of transcript abundance through label intensity. A microarray study may,
for example, compare transcript abundance between two groups for 10,000 or more
human genes.
Studies of the transcription pattern of specific tumors provide a major tool for
assessing recurrence risk, and prognosis more generally, and for classifying patients

You might also like