0% found this document useful (0 votes)

4 views15 pages

High Dimensional Data Analysis in Cancer Research, 1st Edition No-Wait Download

The book 'High Dimensional Data Analysis in Cancer Research' addresses the challenges and methodologies for analyzing high-dimensional data in cancer research, emphasizing the need for advanced statistical tools due to the increasing complexity of biomedical data. It includes seven chapters covering topics such as variable selection, multivariate nonparametric regression, risk estimation, and machine learning methods like support vector machines. The volume serves as a reference for researchers and practitioners in the field, providing practical examples and insights into current analytical approaches.

Uploaded by

lungquic.htu.angmong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views15 pages

High Dimensional Data Analysis in Cancer Research, 1st Edition No-Wait Download

Uploaded by

lungquic.htu.angmong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

High Dimensional Data Analysis in Cancer Research 1st

Edition

Visit the link below to download the full version of this book:

https://medidownload.com/product/high-dimensional-data-analysis-in-cancer-resear
ch-1st-edition/

Click Download Now

Xiaochun Li · Ronghui Xu
Editors

High-Dimensional Data
Analysis in Cancer Research

ABC
Editors
Xiaochun Li Ronghui Xu
Harvard Medical School University of California
Dana-Farber Cancer Institute San Diego
Dept. Biostatistics Department of Family
375 Longwood St. and Preventive Medicine
Boston MA 02115 and Department of Mathematics
USA 9500 Gilman Dr.
[email protected] La Jolla CA 92093-0112
USA
[email protected]

ISBN: 978-0-387-69763-5 e-ISBN: 978-0-387-69765-9

DOI: 10.1007/978-0-387-69765-9

Library of Congress Control Number: 2008940562

°c Springer Science+Business Media, LLC 2009

All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York,
NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in
connection with any form of information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject
to proprietary rights.

Printed on acid-free paper

springer.com
To our children, Anna, Sofia, and James
Preface

In an era with a plethora of high-throughput biological technologies, biomedical

researchers are investigating more comprehensive aspects of cancer with ever-finer
resolution. Not only does this result in large amount of data but also data with hun-
dreds if not thousands of dimensions.
Multivariate analysis is a mainstay of statistical tools in the analysis of biomed-
ical data. It concerns with associating data matrices of n rows by p columns, with
rows representing samples or patients and columns attributes, to certain response
or outcome variables. Classically, the sample size n is much larger than the num-
ber of attributes p. The theoretical properties of statistical models have mostly been
discussed under the assumption of fixed p and infinite n. However, the advance of bi-
ological sciences and technologies has revolutionized the process of investigations
in cancer. The biomedical data collection has become much more automatic and
much more extensive. We are in the era of p as a large fraction of n, or even much
larger than n, which poses challenges to the classical statistical paradigm. Take pro-
teomics as an example. Although proteomic techniques have been researched and
developed for many decades to identify proteins or peptides uniquely associated
with a given disease state, until recently this has mostly been a laborious process,
carried out one protein at a time. The advent of highthroughput proteome-wide tech-
nologies such as liquid chromatography-tandem mass spectroscopy make it possible
to generate proteomic signatures that facilitate rapid development of new strategies
for proteomics-based detection of disease. This poses new challenges and calls for
scalable solutions to the analysis of such high-dimensional data.
In this volume, we present current analytical approaches as well as systematic
strategies to the analysis of correlated and high-dimensional data.
The volume is intended as a reference book for researchers, statisticians, bioin-
formaticians, graduate students, and data analysts working in the field of cancer re-
search. Our aim is to present methodological topics of important relevance to such
analyses, and in a single volume such as this we do not attempt to exhaust all the
analytical tools that have been developed so far.
This volume contains seven chapters. They do not necessarily cover all topics rel-
evant to high-dimensional data analysis in cancer research. Instead, we have aimed

vii
viii Preface

to choose those fields of research that are either relatively mature, but may not have
been well read in applied statistics, such as risk estimation, or those fields that are
fast developing and also have obtained substantial newer results that are reasonably
well understood for practical use, such as variable selection. On the other hand, we
have omitted such an important topic as multiple comparisons, which is currently
undergoing much theoretical development (as reflected in the August 2007 issue of
Annals of Statistics, for example), and we find it possibly difficult to provide an
accurate stationary yet updated picture for the moment. Such topic, however, can
be found in several other recently published books that contain its classical results
ready for practical use. All the chapters included in this book contain practical ex-
amples to illustrate the analysis methods. In addition, they also reveal the types of
research that are involved in developing these methods.
The opening chapter provides an overview of the various high-dimensional data
sources, the challenges in analyzing such data, and in particular, strategies in the de-
sign phase, as well as possible future directions. Chapter 2 discusses methodologies
and issues surrounding variable selection and model building, including postmodel
selection inference. These have always been important topics in statistical research,
and even more so in the analysis of high-dimensional data. Chapter 3 is devoted to
the topic of multivariate nonparametric regression. Multivariate problems are com-
mon in oncological research, and often the relationship between the outcome of
interest and its predictors is either nonlinear, or nonadditive, or both. This chapter
focuses on the methods of regression trees and spline models. Chapter 4 discusses
the more fundamental problem of risk estimation. This is the basis of many proce-
dures and, in particular, model selection. It reviews the two major approaches to risk
estimation, i.e., covariance penalty and resampling, and summarizes empirical eval-
uations of these approaches. Chapter 5 focuses on tree-based methods. After a brief
review of classification and regression trees (CART), the chapter presents in more
detail tree-based ensembles, including boosting and random forests. Chapter 6 is on
support vector machines (SVMs), one of the methodologies stemming from the ma-
chine learning field that has gained popularity for classification of high dimensional
data. The chapter discusses both two-class and multiclass classification problems,
and linear and nonlinear SVM. For high-dimensional data, a particularly important
aspect is sparse learning, that is, only a relatively small subset of the predictors are
truly involved with the classification boundary. Variable selection is then again a
critical step, and various approaches associated with SVM are described. The last,
but by no means the least, chapter, presents Bayesian approaches to the analyses
of microarray gene expression data. The emphasis is on nonparametric Bayesian
methods, which allow flexible modeling of the data that might arise from underly-
ing heterogeneous mechanisms. Computational algorithms are discussed.
It has been an exciting experience editing this volume. We thank all the authors
for their excellent contributions.

Boston, MA Xiaochun Li
La Jolla, CA Ronghui Xu
Contents

1 On the Role and Potential of High-Dimensional Biologic Data

in Cancer Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Ross L. Prentice
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Potential of High-Dimensional Data in Biomedical Research . . . . . 1
1.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.2 High-Dimensional Study of the Genome . . . . . . . . . . . . . . 2
1.2.3 High-Dimensional Studies of the Transcriptome . . . . . . . . 3
1.2.4 High-Dimensional Studies of the Proteome . . . . . . . . . . . . 4
1.2.5 Other Sources of High-Dimensional Biologic Data . . . . . . 5
1.3 Statistical Challenges and Opportunities with High-Dimensional
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Scope of this Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Two-Group Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Study Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.4 More Complex High-Dimensional Data Analysis
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Needed Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Variable Selection in Regression – Estimation, Prediction, Sparsity,

Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Jaroslaw Harezlak, Eric Tchetgen, and Xiaochun Li
2.1 Overview of Model Selection Methods . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Univariate Screening of Variables . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Subset Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Multivariable Modeling: Penalties/Shrinkage . . . . . . . . . . . . . . . . . . 16
2.2.1 Penalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Ridge Regression and Nonnegative Garrote . . . . . . . . . . . . 17
2.2.3 LASSO: Definition, Properties and Some Extensions . . . . 18
2.2.4 Smoothly Clipped Absolute Deviation (SCAD) . . . . . . . . 20

ix
x Contents

2.3 Least Angle Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Dantzig Selector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Prediction and Persistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Difficulties with Post-Model Selection Inference . . . . . . . . . . . . . . . 26
2.7 Penalized Likelihood for Generalized Linear Models . . . . . . . . . . . 28
2.8 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.9 Application of the Methods to the Prostate Cancer Data Set . . . . . . 30
2.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Multivariate Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . 35

Charles Kooperberg and Michael LeBlanc
3.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Linear and Additive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 Example Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Basis Function Expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Regression Tree Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.2 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.3 Backwards Selection (Pruning) . . . . . . . . . . . . . . . . . . . . . . 42
3.5.4 Example Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.5 Issues and Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 Spline Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.1 One Dimensional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.2 Higher-Dimensional Models . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6.3 Example Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7 Logic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7.1 Example Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.8 High-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.8.1 Variable Selection and Shrinkage . . . . . . . . . . . . . . . . . . . . 51
3.8.2 LASSO and LARS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.8.3 Dedicated Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8.4 Example Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.9 Survival Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.9.1 Example Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Risk Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Ronghui Xu and Anthony Gamst
4.1 Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Covariance Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.1 Continuous Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.2 Binary Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.3 A Connection with AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Contents xi

4.2.4 Correlated Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2.5 Nuisance Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Resampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.1 Empirical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Applications of Risk Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.1 SURE and Admissibility . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.2 Finite Sample Risk and Adaptive Regression Estimates . . 72
4.4.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.4 Gene Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5 Tree-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Adele Cutler, D. Richard Cutler, and John R. Stevens
5.1 Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.1 Microarray Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.2 Mass Spectrometry Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.3 Traditional Approaches to Classification and Regression . 85
5.2.4 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Classification and Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.1 Example: Regression Tree for Prostate Cancer Data . . . . . 86
5.3.2 Properties of Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4 Tree-Based Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4.1 Bagged Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.3 Boosted Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5 Example: Prostate Cancer Microarrays . . . . . . . . . . . . . . . . . . . . . . . 96
5.6 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.7 Recent Research and Oncology Applications . . . . . . . . . . . . . . . . . . 98
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6 Support Vector Machine Classification for High-Dimensional

Microarray Data Analysis, With Applications in Cancer Research . . 103
Hao Helen Zhang
6.1 Classification Problems: A Statistical Point of View . . . . . . . . . . . . 104
6.1.1 Binary Classification Problems . . . . . . . . . . . . . . . . . . . . . . 104
6.1.2 Bayes Rule for Binary Classification . . . . . . . . . . . . . . . . . 104
6.1.3 Multiclass Classification Problems . . . . . . . . . . . . . . . . . . . 105
6.1.4 Bayes Rule for Multiclass Classification . . . . . . . . . . . . . . 106
6.2 Support Vector Machine for Two-Class Classification . . . . . . . . . . . 107
6.2.1 Linear Support Vector Machines . . . . . . . . . . . . . . . . . . . . . 107
6.2.2 Nonlinear Support Vector Machines . . . . . . . . . . . . . . . . . . 110
6.2.3 Regularization Framework for SVM . . . . . . . . . . . . . . . . . . 111
6.3 Support Vector Machines for Multiclass Problems . . . . . . . . . . . . . . 112
6.3.1 One-versus-the-Rest and Pairwise Comparison . . . . . . . . . 112
6.3.2 Multiclass Support Vector Machines (MSVMs) . . . . . . . . 112
xii Contents

6.4 Parameter Tuning and Solution Path for SVM . . . . . . . . . . . . . . . . . 114

6.4.1 Tuning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.4.2 Entire Solution Path for SVM . . . . . . . . . . . . . . . . . . . . . . . 115
6.5 Sparse Learning with Support Vector Machines . . . . . . . . . . . . . . . . 115
6.5.1 Variable Selection for Binary SVM . . . . . . . . . . . . . . . . . . . 116
6.5.2 Variable Selection for Multiclass SVM . . . . . . . . . . . . . . . . 119
6.6 Cancer Data Analysis Using SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.6.1 Binary Cancer Classification for UNC Breast
Cancer Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.6.2 Multi-type Cancer Classification for Khan’s Children
Cancer Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

7 Bayesian Approaches: Nonparametric Bayesian Analysis of Gene

Expression Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Sonia Jain
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.2 Bayesian Analysis of Microarray Data . . . . . . . . . . . . . . . . . . . . . . . . 129
7.2.1 EBarrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.2.2 Probability of Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.3 Nonparametric Bayesian Mixture Model . . . . . . . . . . . . . . . . . . . . . . 133
7.4 Posterior Inference of the Bayesian Model . . . . . . . . . . . . . . . . . . . . 135
7.4.1 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.4.2 Split–Merge Markov Chain Monte Carlo . . . . . . . . . . . . . . 137
7.5 Leukemia Gene Expression Example . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.5.1 Leukemia Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Contributors

Adele Cutler Department of Mathematics and Statistics, Utah State University,

3900 Old Main Hill Logan, UT 84322-3900, USA, [email protected]
Richard Cutler Department of Mathematics and Statistics, Utah State University,
3900 Old Main Hill Logan, UT 84322-3900, USA, [email protected]
Anthony Gamst Division of Biostatistics and Bioinformatics, Department of
Family and Preventive Medicine, University of California, San Diego, 9500 Gilman
Drive, MC0717 La Jolla, CA 92093-0717, USA, [email protected]
Jaroslaw Harezlak Division of Biostatistics, Indiana University School of
Medicine, 410 West 10th Street, Suite 3000 Indianapolis, IN 46202, USA,
[email protected]
Sonia Jain Division of Biostatistics and Bioinformatics, University of California,
San Diego, 9500 Gilman Drive, MC-0717, La Jolla, CA 92093-0717, USA,
[email protected]
Charles Kooperberg Division of Public Health Sciences, Fred Hutchinson Cancer
Research Center, 1100 Fairview Ave N, M3-A410 Seattle, WA 98109-1024, USA,
[email protected]
Michael LeBlanc Division of Public Health Sciences, Fred Hutchinson Cancer
Research Center, 1100 Fairview Ave N, M3-C102 Seattle, WA 98109-1024, USA,
[email protected]
Xiaochun Li Division of Biostatistics, Indiana University School of Medicine, 410
West 10th Street, Suite 3000 Indianapolis, IN 46202, USA, [email protected]
Ross L. Prentice Division of Public Health Sciences, Fred Hutchinson Cancer
Research Center, 1100 Fairview Avenue North, Seattle, WA 98109, USA,
[email protected]

xiii
xiv Contributors

John R. Stevens Department of Mathematics and Statistics, Utah State University,

3900 Old Main Hill Logan, UT 84322-3900, USA, [email protected]
Eric Tchetgen Department of Biostatistics, Harvard School of Public Health, 655
Huntington Avenue, Boston, MA 02115, USA, [email protected]
Ronghui Xu Division of Biostatistics and Bioinformatics, Department of Family
and Preventive Medicine and Department of Mathematics, University of California,
San Diego, 9500 Gilman Drive, MC 0112 La Jolla, CA 92093-0112, USA,
[email protected]
Hao Helen Zhang Department of Statistics, North Carolina State University, 2501
Founders Drive Raleigh, NC 27613, USA, [email protected]
Chapter 1
On the Role and Potential of High-Dimensional
Biologic Data in Cancer Research

Ross L. Prentice

1.1 Introduction

I am pleased to provide a brief introduction to this volume of “High-Dimensional

Data Analysis in Cancer Research”. The chapters to follow will focus on data analy-
sis aspects, particularly related to regression model selection and estimation with
high-dimensional data of various types. The methods described will have a major
emphasis on statistical innovations that are afforded by high-dimensional predictor
variables.
While many of the motivating applications and datasets for these analytic de-
velopments arise from gene expression data in therapeutic research contexts, there
are also important applications, and potential applications, in risk assessment,
early diagnosis and primary disease prevention research, as will be elaborated in
Section 1.2. With this range of applications as background, some preliminary com-
ments are made on related statistical challenges and opportunities (Section 1.3) and
on needed future developments (Section 1.4).

1.2 Potential of High-Dimensional Data in Biomedical Research

1.2.1 Background

The unifying goal of the various types of high-dimensional data being generated in
recent years is the understanding of biological processes, especially processes that
relate to disease occurrence or management. These may involve, for example, char-
acteristics such as single nucleotide polymorphisms (SNPs) across the genome to be

R.L. Prentice
Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview
Avenue North, Seattle, WA, USA
email: [email protected]

X. Li, R. Xu (eds.), High-Dimensional Data Analysis in Cancer Research, Applied 1

Bioinformatics and Biostatistics in Cancer Research, DOI 10.1007/978-0-387-69765-9 1,

c Springer Science+Business Media LLC 2009
2 R.L. Prentice

related to the risk of a disease; gene expression patterns in tumor tissue to be related
to the risk of tumor recurrence; or protein expression patterns in blood to be related
to the presence of an undetected cancer. Cutting across the biological processes
related to carcinogenesis, or other chronic disease processes, are high-dimensional
data related to treatment or intervention effects. These may include, for example,
study of changes in the plasma proteome as a result of an agent having chronic dis-
ease prevention potential; or changes in gene expression in tumor tissue as a result
of exposure to a therapeutic regimen, especially a molecularly targeted regimen. It
is the confluence of novel biomarkers of disease development and treatment, with
biomarker changes related to possible interventions that have great potential to en-
hance the identification of novel preventative and therapeutic interventions. Further-
more, biological markers that are useful for early disease detection open the door to
reduced disease mortality, using current or novel therapeutic modalities. The tech-
nology available for these various purposes in human studies depends very much
on the type of specimens available for study, with white blood cells and their DNA
content, tumor tissue and its mRNA content, or blood serum or plasma and its pro-
teomic and metabolomic (small molecule) content, as important examples. The next
subsections will provide a brief overview of the technology for assessment of certain
key types of high-dimensional biologic data.

1.2.2 High-Dimensional Study of the Genome

The study of genotype in relation to the risk of specific cancers or other chronic
diseases has traditionally relied heavily on family studies. Such studies often in-
volve families having a strong history of the study disease to increase the proba-
bility of harboring disease-related genes. A study may involve genotyping family
members for a panel of genetic markers and assessing whether one or more mark-
ers co-segregate with disease among family members. This approach uses the fact
that chromosomal segments are inherited intact, so that markers over some distance
from a disease-related gene can be expected to associate with disease risk within
families. Following the identification of a “linkage” signal with a genetic marker,
some form of fine mapping is needed to close in on disease-related loci. There are
many variations in ascertainment schemes and analysis procedures that may differ in
efficiency and robustness (e.g., Ott, 1991; Thomas, 2004) with case–control family
studies having a prominent role in recent years.
Markers that are sufficiently close on the genome tend to be correlated, depend-
ing somewhat on a person’s evolutionary history (e.g., Felsenstein, 2007). The iden-
tification of several million SNPs across the human genome (e.g., Hinds et al., 2005)
and the identification of tag SNP subsets (The International HapMap Consortium,
2003) that convey most genotype information as a result of such correlation (linkage
disequilibrium) have opened the way not only to family-based studies that involve a
very large number of genomic markers, but also to direct disease association studies
1 Role and Potential of High-D Data 3

among unrelated individuals. For example, the latter type of study may simulta-
neously relate 100,000 or more tag SNPs to disease occurrence in a study cohort,
typically using a nested case–control or case-cohort design.
However, for this type of association study to be practical, there needs to be
reliable, high-throughput genotyping platforms having acceptable costs. Satisfying
this need has been a major technology success story over the past few years, with
commercially available platforms (Affymetrix, Illumina) having 500,000–1,000,000
well-selected tagging SNPs, and genotyping costs reduced to a few hundred dollars
per specimen. These platforms, similar to the gene expression platforms that pre-
ceded them, rely on chemical coupling of DNA from target cells to labeled probes
having a specified sequence affixed to microarrays, and use photolithographic meth-
ods to assess the intensity of the label following hybridization and washing. In ad-
dition to practical cost, these platforms can accommodate the testing of thousands
of cases and controls in a research project in a matter of a few weeks or months.
The results of very high-dimensional SNP studies of this type have only recently
begun to emerge, usually from large cohorts or cohort consortia, in view of the large
sample sizes needed to rule out false positive associations. Novel genotype associa-
tions with disease risks have already been established for breast cancer (e.g., Easton
et al., 2007; Hunter et al., 2007) and prostate cancer (Amundadottir et al., 2006;
Freedman et al., 2006; Yeager et al., 2007), as well as for several other chronic dis-
eases (e.g., Samani et al., 2007, for coronary heart disease). Although it is early to
try to characterize findings, novel associations for complex common diseases tend
to be weak, and mostly better suited to providing insight into disease processes and
pathways, than to contributing usefully to risk assessment. The prostate cancer as-
sociations cited include well-established SNP associations that are not in proximity
to any known gene, providing the impetus for further study of genomic structure
and characteristics in relation to gene and protein expression.

1.2.3 High-Dimensional Studies of the Transcriptome

Studies of gene expression patterns in tumor tissue from cancer patients provided
some of the earliest use of microarray technologies in biomedical research, and con-
stituted the setting that motivated much of the statistical design and analysis devel-
opments to date for high-dimensional data studies. Gene expression can be assessed
by the concentration of mRNA (transcripts) in cells, and many applications to date
have focused on studies of tumors or other tissue, often in a therapeutic context.
mRNA hybridizes with labeled probes on a microarray, with a photolithographic as-
sessment of transcript abundance through label intensity. A microarray study may,
for example, compare transcript abundance between two groups for 10,000 or more
human genes.
Studies of the transcription pattern of specific tumors provide a major tool for
assessing recurrence risk, and prognosis more generally, and for classifying patients

Firmax Rf3 Product Only
No ratings yet
Firmax Rf3 Product Only
61 pages
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
From Everand
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
César Pérez López
No ratings yet
Biology Life On Earth 10th Edition Audesirk Solutions Manual Instant Download
100% (6)
Biology Life On Earth 10th Edition Audesirk Solutions Manual Instant Download
37 pages
High Dimensional Microarray Data Analysis Cancer Gene Diagnosis and Malignancy Indexes by Microarray Instant EPUB Download
100% (16)
High Dimensional Microarray Data Analysis Cancer Gene Diagnosis and Malignancy Indexes by Microarray Instant EPUB Download
16 pages
ACI E-107 Aggregates For Concrete (Summary)
No ratings yet
ACI E-107 Aggregates For Concrete (Summary)
4 pages
Quantitative Medical Data Analysis Using Mathematical Tools and Statistical Techniques Direct Download
100% (20)
Quantitative Medical Data Analysis Using Mathematical Tools and Statistical Techniques Direct Download
17 pages
Computational Methods in Biomedical Research 1st Edition Ravindra Khattree Download
100% (1)
Computational Methods in Biomedical Research 1st Edition Ravindra Khattree Download
36 pages
Iev01582 Pi-2893
No ratings yet
Iev01582 Pi-2893
3 pages
Statistical Methods at The Forefront of Biomedical Advances Secure Ebook Download
No ratings yet
Statistical Methods at The Forefront of Biomedical Advances Secure Ebook Download
17 pages
Ts X Biology Final Exam Revision 2023-24
No ratings yet
Ts X Biology Final Exam Revision 2023-24
7 pages
High Dimensional Data Analysis in Cancer Research, 1st Edition Instant Download
100% (10)
High Dimensional Data Analysis in Cancer Research, 1st Edition Instant Download
14 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
7 pages
Articulo Biologia Molecular
No ratings yet
Articulo Biologia Molecular
12 pages
Deep Learning in Cancer Diagnostics: A Feature-Based Transfer Learning Evaluation
No ratings yet
Deep Learning in Cancer Diagnostics: A Feature-Based Transfer Learning Evaluation
41 pages
PPR 3
No ratings yet
PPR 3
31 pages
Hybrid Embedded and Filter Feature Selection Methods in Big Dimension Mammary Cancer and Prostatic Cancer Data
No ratings yet
Hybrid Embedded and Filter Feature Selection Methods in Big Dimension Mammary Cancer and Prostatic Cancer Data
10 pages
Arm Position and Blood Pressure Readings The ARMS Crossover Randomized Clinical Trial. JAMA Internal Medicine 2024
No ratings yet
Arm Position and Blood Pressure Readings The ARMS Crossover Randomized Clinical Trial. JAMA Internal Medicine 2024
7 pages
Quantitative Medical Data Analysis Using Mathematical Tools and Statistical Techniques Optimized EPUB Download
No ratings yet
Quantitative Medical Data Analysis Using Mathematical Tools and Statistical Techniques Optimized EPUB Download
15 pages
The Hexagon For Trigonometric Identities
No ratings yet
The Hexagon For Trigonometric Identities
11 pages
Wizard 2 50c92axx
No ratings yet
Wizard 2 50c92axx
120 pages
Eluru Estmation PDF
No ratings yet
Eluru Estmation PDF
31 pages
Big and Complex Data Analysis Methodologies and Applications Full Access Download
No ratings yet
Big and Complex Data Analysis Methodologies and Applications Full Access Download
17 pages
An Introduction To Classification and Regression Tree
100% (1)
An Introduction To Classification and Regression Tree
15 pages
Assignment Bigdata
No ratings yet
Assignment Bigdata
17 pages
Computational Biology Issues and Applications in Oncology - 1st Edition Full Chapter Download
100% (12)
Computational Biology Issues and Applications in Oncology - 1st Edition Full Chapter Download
15 pages
2014 Book MachineLearningInMedicine-Cook
No ratings yet
2014 Book MachineLearningInMedicine-Cook
131 pages
Owner Manual BCPERS450
No ratings yet
Owner Manual BCPERS450
28 pages
Graphs Intro Notes
No ratings yet
Graphs Intro Notes
10 pages
LITERATUR Digabungkan Fiks
No ratings yet
LITERATUR Digabungkan Fiks
41 pages
Statistical Methods at The Forefront of Biomedical Advances Complete DOCX Download
100% (16)
Statistical Methods at The Forefront of Biomedical Advances Complete DOCX Download
17 pages
Biostatistics with Python: Apply Python for biostatistics with hands-on biomedical and biotechnology projects
From Everand
Biostatistics with Python: Apply Python for biostatistics with hands-on biomedical and biotechnology projects
Darko Medin
No ratings yet
Literature Review of Footwear Industry
100% (2)
Literature Review of Footwear Industry
5 pages
Nano Sweep BT
No ratings yet
Nano Sweep BT
38 pages
7th Grade General Science Proficiency Scales
No ratings yet
7th Grade General Science Proficiency Scales
10 pages
6.03 Trigonometric Ratio Worksheet
No ratings yet
6.03 Trigonometric Ratio Worksheet
12 pages
Cancerous Profiles - 2017 - Conference - Paper
No ratings yet
Cancerous Profiles - 2017 - Conference - Paper
6 pages
Big and Complex Data Analysis Methodologies and Applications Unlimited Ebook Download
100% (14)
Big and Complex Data Analysis Methodologies and Applications Unlimited Ebook Download
16 pages
Anti-Skimming Protection For Your ATM: Flexible Protection For Dip and Motorized Card Readers
No ratings yet
Anti-Skimming Protection For Your ATM: Flexible Protection For Dip and Motorized Card Readers
2 pages
Extended Abstracts Fall 2015 Biomedical Big Data Statistics For Low Dose Radiation Research Scribd Download
100% (11)
Extended Abstracts Fall 2015 Biomedical Big Data Statistics For Low Dose Radiation Research Scribd Download
16 pages
Calg t2 4 Filled in
No ratings yet
Calg t2 4 Filled in
7 pages
CNPAS
No ratings yet
CNPAS
3 pages
Comprehensive Guide to Statistics
From Everand
Comprehensive Guide to Statistics
Mohit Chatterjee
No ratings yet
4 Indus Dancing Girls Represent Mohini S
No ratings yet
4 Indus Dancing Girls Represent Mohini S
20 pages
Bioinformatics Unveiled
From Everand
Bioinformatics Unveiled
Joan Melody
No ratings yet
Data Management and Analysis Using JMP: Health Care Case Studies
From Everand
Data Management and Analysis Using JMP: Health Care Case Studies
Jane E Oppenlander
No ratings yet
Postal Ballot Team 315
No ratings yet
Postal Ballot Team 315
23 pages
Thinking Analytically: A Guide for Making Data-Driven Decisions
From Everand
Thinking Analytically: A Guide for Making Data-Driven Decisions
Jim Frost
No ratings yet
Machine Learning in Medicine Full PDF Download
100% (8)
Machine Learning in Medicine Full PDF Download
16 pages
Multivariate Analysis for the Biobehavioral and Social Sciences: A Graphical Approach
From Everand
Multivariate Analysis for the Biobehavioral and Social Sciences: A Graphical Approach
Bruce L. Brown
No ratings yet
Basic Measures of Epidemiology
100% (1)
Basic Measures of Epidemiology
51 pages
Smart Business Problems and Analytical Hints in Cancer Research
From Everand
Smart Business Problems and Analytical Hints in Cancer Research
Zemelak Goraga
No ratings yet
Brosur Ground Rod
No ratings yet
Brosur Ground Rod
2 pages
Proposals For Further Alterations of The 1999 Constitution (As Amended) Presented by The Sug 15th Judicial Council of Esut
No ratings yet
Proposals For Further Alterations of The 1999 Constitution (As Amended) Presented by The Sug 15th Judicial Council of Esut
8 pages
Rotational Mechanics
No ratings yet
Rotational Mechanics
17 pages
Bioinformatics: Merging Biology and Technology
From Everand
Bioinformatics: Merging Biology and Technology
Mani Devar
No ratings yet
Introduction to Biostatistics with JMP (Hardcover edition)
From Everand
Introduction to Biostatistics with JMP (Hardcover edition)
Steve Figard
1/5 (1)
Case Study of Placenta Previa: Patient's Demographic Data
No ratings yet
Case Study of Placenta Previa: Patient's Demographic Data
9 pages
Applications of Multi-Omics: Fundamentals of Integrating Biological Data for Precision Medicine and Research
From Everand
Applications of Multi-Omics: Fundamentals of Integrating Biological Data for Precision Medicine and Research
Richard Skiba
No ratings yet
Biostatistics and Research Methodology
From Everand
Biostatistics and Research Methodology
Dr. G. Nageswara Rao
5/5 (5)
Rigging Safety
No ratings yet
Rigging Safety
190 pages
Biostatistical Methods: The Assessment of Relative Risks
From Everand
Biostatistical Methods: The Assessment of Relative Risks
John M. Lachin
3.5/5 (2)
BHEL Sample Placement Paper
No ratings yet
BHEL Sample Placement Paper
12 pages
Applied Survival Analysis: Regression Modeling of Time-to-Event Data
From Everand
Applied Survival Analysis: Regression Modeling of Time-to-Event Data
David W. Hosmer, Jr.
4/5 (2)
Statistical Methods for Hospital Monitoring with R
From Everand
Statistical Methods for Hospital Monitoring with R
Anthony Morton
No ratings yet
Microarray Review
No ratings yet
Microarray Review
5 pages
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Bioinformatics: Algorithms, Coding, Data Science And Biostatistics
From Everand
Bioinformatics: Algorithms, Coding, Data Science And Biostatistics
Rob Botwright
No ratings yet
Statistics Super Review, 2nd Ed.
From Everand
Statistics Super Review, 2nd Ed.
The Editors of REA
5/5 (3)
Sample Size Tables for Clinical Studies
From Everand
Sample Size Tables for Clinical Studies
David Machin
No ratings yet
Survival Analysis: Models and Applications
From Everand
Survival Analysis: Models and Applications
Xian Liu
No ratings yet
Program of Activities: Boy Scouts of The Philippines Zamboanga City Council Sta. Maria District
No ratings yet
Program of Activities: Boy Scouts of The Philippines Zamboanga City Council Sta. Maria District
4 pages
Name: Aarsh Trivedi Roll No: 16BME176D Sub: MD 2 Case Study Topic: Worm Gear
No ratings yet
Name: Aarsh Trivedi Roll No: 16BME176D Sub: MD 2 Case Study Topic: Worm Gear
40 pages
High-Dimensional Covariance Estimation: With High-Dimensional Data
From Everand
High-Dimensional Covariance Estimation: With High-Dimensional Data
Mohsen Pourahmadi
No ratings yet
Medical Statistics Made Easy 2e - now superseded by 3e
From Everand
Medical Statistics Made Easy 2e - now superseded by 3e
M. Harris
No ratings yet
Biostatistics Explored Through R Software: An Overview
From Everand
Biostatistics Explored Through R Software: An Overview
Vinaitheerthan Renganathan
3.5/5 (2)
Common Errors in Statistics (and How to Avoid Them)
From Everand
Common Errors in Statistics (and How to Avoid Them)
Phillip I. Good
No ratings yet
Statistical Method from the Viewpoint of Quality Control
From Everand
Statistical Method from the Viewpoint of Quality Control
Walter A. Shewhart
4.5/5 (5)
Pharmaceutical Research Methodology and Bio-statistics: Theory and Practice
From Everand
Pharmaceutical Research Methodology and Bio-statistics: Theory and Practice
Bayya Subba Rao
No ratings yet
"Data Analysis" Basic Concepts and Applications
From Everand
"Data Analysis" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
SPSS for you
From Everand
SPSS for you
A Rajathi
4.5/5 (4)
SAS Clinical Programming: In 18 Easy steps
From Everand
SAS Clinical Programming: In 18 Easy steps
Y. LAKSHMI PRASAD
4/5 (11)
Analyzing Quantitative Data: An Introduction for Social Researchers
From Everand
Analyzing Quantitative Data: An Introduction for Social Researchers
Debra Wetcher-Hendricks
No ratings yet
Concise Biostatistical Principles & Concepts: Guidelines for Clinical and Biomedical Researchers
From Everand
Concise Biostatistical Principles & Concepts: Guidelines for Clinical and Biomedical Researchers
Franklin Opara
No ratings yet
Bayesian Methodology: an Overview With The Help Of R Software
From Everand
Bayesian Methodology: an Overview With The Help Of R Software
Editor IJSMI
No ratings yet
Biostatistician - The Comprehensive Guide: Vanguard Professionals
From Everand
Biostatistician - The Comprehensive Guide: Vanguard Professionals
Viruti Shivan
No ratings yet
Introduction To Non Parametric Methods Through R Software
From Everand
Introduction To Non Parametric Methods Through R Software
Editor IJSMI
No ratings yet
Concise Epidemiologic Principles and Concepts: Guidelines for Clinicians and Biomedical Researchers
From Everand
Concise Epidemiologic Principles and Concepts: Guidelines for Clinicians and Biomedical Researchers
Laurens Holmes Jr.
No ratings yet
Data Preparation and Exploration: Applied to Healthcare Data
From Everand
Data Preparation and Exploration: Applied to Healthcare Data
Robert Hoyt
No ratings yet
Choosing a Research Method, Scientific Inquiry:: Complete Process with Qualitative & Quantitative Design Examples
From Everand
Choosing a Research Method, Scientific Inquiry:: Complete Process with Qualitative & Quantitative Design Examples
Christian S. Yorgure PhD
No ratings yet
Clinical Trial Management – an Overview
From Everand
Clinical Trial Management – an Overview
Editor IJSMI
No ratings yet

High Dimensional Data Analysis in Cancer Research, 1st Edition No-Wait Download

Uploaded by

High Dimensional Data Analysis in Cancer Research, 1st Edition No-Wait Download

Uploaded by

High Dimensional Data Analysis in Cancer Research 1st

Click Download Now

ISBN: 978-0-387-69763-5 e-ISBN: 978-0-387-69765-9

Library of Congress Control Number: 2008940562

°c Springer Science+Business Media, LLC 2009

Printed on acid-free paper

In an era with a plethora of high-throughput biological technologies, biomedical

1 On the Role and Potential of High-Dimensional Biologic Data

2 Variable Selection in Regression – Estimation, Prediction, Sparsity,

2.3 Least Angle Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Multivariate Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2.4 Correlated Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6 Support Vector Machine Classification for High-Dimensional

6.4 Parameter Tuning and Solution Path for SVM . . . . . . . . . . . . . . . . . 114

7 Bayesian Approaches: Nonparametric Bayesian Analysis of Gene

Adele Cutler Department of Mathematics and Statistics, Utah State University,

John R. Stevens Department of Mathematics and Statistics, Utah State University,

I am pleased to provide a brief introduction to this volume of “High-Dimensional

1.2 Potential of High-Dimensional Data in Biomedical Research

X. Li, R. Xu (eds.), High-Dimensional Data Analysis in Cancer Research, Applied 1

1.2.2 High-Dimensional Study of the Genome

1.2.3 High-Dimensional Studies of the Transcriptome

You might also like