0% found this document useful (0 votes)
6 views14 pages

Recursive Partitioning in The Health Sciences Full Text PDF

The book discusses the methodology of recursive partitioning as a powerful tool for analyzing complex pathways in health sciences, contrasting it with traditional regression methods. It is aimed at biomedical researchers, statisticians, and public health practitioners, providing practical applications and theoretical insights. The text emphasizes the advantages of recursive partitioning in revealing scientific insights that conventional methods may overlook.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views14 pages

Recursive Partitioning in The Health Sciences Full Text PDF

The book discusses the methodology of recursive partitioning as a powerful tool for analyzing complex pathways in health sciences, contrasting it with traditional regression methods. It is aimed at biomedical researchers, statisticians, and public health practitioners, providing practical applications and theoretical insights. The text emphasizes the advantages of recursive partitioning in revealing scientific insights that conventional methods may overlook.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Recursive Partitioning in the Health Sciences

Visit the link below to download the full version of this book:

https://medidownload.com/product/recursive-partitioning-in-the-health-sciences/

Click Download Now


Preface

Multiple complex pathways, characterized by interrelated events and con-


ditions, represent routes to many illnesses, diseases, and ultimately death.
Although there are substantial data and plausibility arguments supporting
many conditions as contributory components of pathways to illness and
disease end points, we have, historically, lacked an effective methodology
for identifying the structure of the full pathways. Regression methods, with
strong linearity assumptions and data-based constraints on the extent and
order of interaction terms, have traditionally been the strategies of choice
for relating outcomes to potentially complex explanatory pathways. How-
ever, nonlinear relationships among candidate explanatory variables are
a generic feature that must be dealt with in any characterization of how
health outcomes come about. Thus, the purpose of this book is to demon-
strate the effectiveness of a relatively recently developed methodology-
recursive partitioning-as a response to this challenge. We also compare
and contrast what is learned via recursive partitioning with results ob-
tained on the same data sets using more traditional methods. This serves
to highlight exactly where--and for what kinds of questions-recursive
partitioning-based strategies have a decisive advantage over classical re-
gression techniques.
This book is suitable for three broad groups of readers: (1) biomedical re-
searchers, clinicians, public health practitioners including epidemiologists,
health service researchers, environmental policy advisers; (2) consulting
statisticians who can use the recursive partitioning technique as a guide
in providing effective and insightful solutions to clients' problems; and (3)
statisticians interested in methodological and theoretical issues. The book
viii Preface

provides an up-to-date summary of the methodological and theoretical un-


derpinnings of recursive partitioning. More interestingly, it presents a host
of unsolved problems whose solutions would advance the rigorous under-
pinnings of statistics in general.
From the perspective of the first two groups of readers, we demonstrate
with real applications the sequential interplay between automated produc-
tion of multiple well-fitting trees and scientific judgment leading to respec-
ification of variables, more refined trees subject to context-specific con-
straints (on splitting and pruning, for example), and ultimately selection
of the most interpretable and useful tree(s). The sections marked with as-
terisks can be skipped for application-oriented readers.
We show a more conventional regression analysis-having the same ob-
jective as the recursive partitioning analysis-side by side with the newer
methodology. In each example, we highlight the scientific insight derived
from the recursive partitioning strategy that is not readily revealed by more
conventional methods. The interfacing of automated output and scientific
judgment is illustrated with both conventional and recursive partitioning
analysis.
Theoretically oriented statisticians will find a substantial listing of chal-
lenging theoretical problems whose solutions would provide much deeper
insight than heretofore about the scope and limits of recursive partitioning
as such and multivariate adaptive splines in particular.
We emphasize the development of narratives to summarize the formal
Boolean statements that define routes down the trees to terminal nodes.
Particularly with complex-by scientific necessity-trees, narrative output
facilitates understanding and interpretation of what has been provided by
automated techniques.
We illustrate the sensitivity of trees to variation in choosing misclassi-
fication cost, where the variation is a consequence of divergent views by
clinicians of the costs associated with differing mistakes in prognosis.
The book of Breiman et al. (1984) is a classical work on the subject of
recursive partitioning. In Chapter 4, we reiterate the key ideas expressed
in that book and expand our discussions in different directions on the is-
sues that arise from applications. Other chapters on survival trees, adaptive
splines, and classification trees for multiple discrete outcomes are new de-
velopments since the work of Breiman et al.
Heping Zhang wishes to thank his colleagues and students, Joan Buen-
consejo, Theodore Holford, James Leckman, Ju Li, Robert Makuch, Kath-
leen Merikangas, Bradley Peterson, Norman Silliker, Daniel Zelterman, and
Hongyu Zhao among others, for their help with reading and commenting
on earlier drafts of this book. He is also grateful to Drs. Michael Bracken,
Dorit Carmelli, and Brian Leaderer for making their data sets available
to this book. This work was supported in part by NIH grant HD30712 to
Heping Zhang.
Contents

Preface vii

1 Introduction 1
1.1 Examples Using CART 2
1.2 The Statistical Problem 4
1.3 Outline of the Methodology 5

2 A Practical Guide to Tree Construction 7


2.1 The Elements of Tree Construction . 9
2.2 Splitting a Node ........ 10
2.3 Terminal Nodes . . . . . . . . . 15
2.4 Download and Use of Software 16

3 Logistic Regression 21
3.1 Logistic Regression Models 21
3.2 A Logistic Regression Analysis 22

4 Classification Trees for a Binary Response 29


4.1 Node Impurity ........... 29
4.2 Determination of Terminal Nodes. 32
4.2.1 Misclassification Cost .. 32
4.2.2 Cost Complexity . . . . . 35
4.2.3 Nested Optimal Subtrees· 37
4.3 The Standard Error of Rev * . . . 40
x Contents

4.4 Tree-Based Analysis of the Yale Pregnancy Outcome Study 41


4.5 An Alternative Pruning Approach . . . . . . . . . . . . . . 43
4.6 Localized Cross-Validation . . . . . . . 47
4.7 Comparison Between Tree-Based and Logistic Regression
Analyses. . . . . . . . . . . . . . . . 49
4.8 Missing Data . . . . . . . . . . . . . 53
4.8.1 Missings Together Approach 53
4.8.2 Surrogate Splits 54
4.9 Tree Stability . . 55
4.10 Implementation* . . . . 56

5 Risk-Factor Analysis Using Tree-Based Stratification 61


5.1 Background. 61
5.2 The Analysis . . . . . . . . . . . . . . 63

6 Analysis of Censored Data: Examples 71


6.1 Introduction........................... 71
6.2 Tree-Based Analysis for the Western Collaborative Group
Study Data . . . . . . . . . . . . . . . . . . . . . . . . . .. 74

7 Analysis of Censored Data:


Concepts and Classical Methods 79
7.1 The Basics of Survival Analysis 79
7.1.1 Kaplan-Meier Curve. . . 84
7.1.2 Log-Rank Test . . . . . . 85
7.2 Parametric Regression for Censored Data 87
7.2.1 Linear Regression with Censored Data* 87
7.2.2 Cox Proportional Hazard Regression. . 89
7.2.3 Reanalysis ofthe Western Collaborative Group Study
Data. . . . . . . . . . . . . . . . . . . . . . . . . .. 91

8 Analysis of Censored Data: Survival Trees 93


8.1 Splitting Criteria . . . . . . . . . . 93
8.1.1 Gordon and Olshen's Rule* . 93
8.1.2 Maximizing the Difference . . 96
8.1.3 Use of Likelihood Functions* 96
8.1.4 A Straightforward Extension 99
8.2 Pruning a Survival Tree . . . . . . . 99
8.3 Implementation............ 100
8.4 Survival Trees for the Western Collaborative Group Study
Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 101
Contents xi

9 Regression Trees and Adaptive Splines for a Continuous


Response 105
9.1 Tree Representation of Spline Model and Analysis of Birth
Weight. . . . . . . . . . . . . 106
9.2 Regression Trees . . . . . . . . . . . 108
9.3 The Profile of MARS Models . . . . 112
9.4 Modified MARS Forward Procedure 115
9.5 MARS Backward-Deletion Step 118
9.6 The Best Knot* . . . . . . 120
9.7 Restrictions on the Knot* . . 123
9.7.1 Minimum Span. . . . 123
9.7.2 Maximal Correlation. 124
9.7.3 Patches to the MARS Forward Algorithm 127
9.8 Smoothing Adaptive Splines* . . . . . . . . . . . 127
9.8.1 Smoothing the Linearly Truncated Basis Functions. 128
9.8.2 Cubic Basis Functions 128
9.9 Numerical Examples . . . . . 129

10 Analysis of Longitudinal Data 137


10.1 Infant Growth Curves . . . . 137
10.2 The Notation and a General Model. 139
10.3 Mixed-Effects Models . 140
10.4 Semiparametric Models . . . . . . . 143
10.5 Adaptive Spline Models . . . . . . . 144
10.5.1 Known Covariance Structure 145
10.5.2 Unknown Covariance Structure 146
10.5.3 A Simulated Example . . . . . 149
10.5.4 Reanalyses of Two Published Data Sets 152
10.5.5 Analysis of Infant Growth Curves 161
10.5.6 Remarks. . . . . . . . . . . . . . 166
10.6 Regression Trees for Longitudinal Data 167
10.6.1 Example: HIV in San Francisco. 169

11 Analysis of Multiple Discrete Responses 173


11.1 Parametric Methods for Binary Responses. 175
11.1.1 Log-Linear Models .. . 176
11.1.2 Marginal Models . . . . 178
11.1.3 Parameter Estimation* 179
11.1.4 Frailty Models . . . . . 181
11.2 Classification Trees for Multiple Binary Responses 183
11.2.1 Within-Node Homogeneity 183
11.2.2 Terminal Nodes . . . . . . 184
11.2.3 Computational Issues* .. . 185
11.2.4 Parameter Interpretation* . 186
11.3 Application: Analysis of BROCS Data 187
xii Contents

11.3.1 Background. . . . . . . . . . . . 187


11.3.2 Tree Construction . . . . . . . . 189
11.3.3 Description of Numerical Results 192
11.3.4 Alternative Approaches . . . . . 192
11.3.5 Predictive Performance . . . . . 193
11.4 Polytomous and Longitudinal Responses. 195
11.5 Analysis of the BROCS Data via Log-Linear Models 195

12 Appendix 201
12.1 The Script for Running RTREE Automatically 201
12.2 The Script for Running RTREE Manually . 203
12.3 The. inf File. . . . . . . . . . . . . . . . . . . 207

References 211

Index 223
1
Introd uction

Many scientific problems reduce to modeling the relationship between two


sets of variables. Regression methodology is designed to quantify these
relationships. Recursive Partitioning is a statistical technique that forms
the basis for two classes of nonparametric regression methods: Classifica-
tion and Regression Trees (CART) and Multivariate Adaptive Regression
Splines (MARS). Although relatively new, these methods have a growing
number of applications, particularly in the health sciences, as a result of
increasing complexity of study designs and data structures.
Due to their mathematical simplicity, linear regression for continuous
data, logistic regression for binary data, proportional hazard regression for
censored survival data, and mixed-effect regression for longitudinal data
are among the most commonly used statistical methods. These parametric
regression methods, however, may not lead to faithful data descriptions
when the underlying assumptions are not satisfied. Sometimes, model in-
terpretation can be problematic in the presence of higher-order interactions
among potent predictors.
Nonparametric regression has evolved to relax or remove the restrictive
assumptions. In many cases, recursive partitioning is used to explore data
structures and to derive parsimonious models. The theme of this book is
to describe nonparametric regression methods built on recursive partition-
ing. While explaining the methodology in its entirety, we emphasize the
applications of these methods in the health sciences. Moreover, it should
become apparent from these applications that the resulting models have
very natural and useful interpretations, and the computation will be less
and less an issue. Specifically, we will see that the tree representations can
2 1. Introduction

be stated as a string of hierarchal Boolean statements, facilitating conver-


sion of complex output to narrative form.
In Section 1.1 we give a number of examples for which recursive partition-
ing has been used to investigate a broad spectrum of scientific problems. In
Section 1.2 we formulate these scientific problems into a general regression
framework and introduce the necessary notation. To conclude this chapter,
we outline the contents of the subsequent chapters in Section 1.3.

1.1 Examples Using CART


Recursive partitioning has been applied to understand many clinical prob-
lems. The examples selected below are not necessarily fully representative,
but they give us some idea about the breadth of applications.
Example 1.1 Goldman et al. (1982, 1996) provided a classic example of
using CART. Their purpose was to build an expert computer system that
could assist physicians in emergency rooms to classify patients with chest
pain into relatively homogeneous groups within a few hours of admission
using the clinical factors available. This classification can help physicians
to plan for appropriate levels of medical care for patients based on their
classified group membership. The authors included 10,682 patients with
acute chest pain in the derivation data set and 4,676 in the validation data
set. The derivation data were used to set up a basic model frame, while the
validation data were utilized to justify the model and to conduct hypothesis
testing.
Example 1.2 Levy et al. (1985) carried out one of the early applications
of CART. To predict the outcome from coma caused by cerebral hypoxia-
ischemia, they studied 210 patients with cerebral hypoxia-ischemia and
considered 13 factors including age, sex, verbal and motor responses, and
eye opening movement. Several guidelines were derived to predict within
the first few days which patients would do well and which would do poorly.
Example 1.3 Mammalian sperm move in distinctive patterns, called hy-
peractivated motility, during capacitation. Figure 1.1(a) is a circular pat-
tern of hyperactivated rabbit spermatozoa, and Figure 1.1 (b) displays a
nonhyperactivated track. In general, hyperactivated motility is character-
ized by a change from progressive movement to highly vigorous, nonpro-
gressive random motion. This motility is useful for the investigation of
sperm function and the assessment of fertility. For this reason, we must
establish a quantitative criterion that recognizes hyperactivated sperm in
a mixed population of hyperactivated and nonhyperactivated sperm. Af-
ter collecting 322 hyperactivated and 899 nonhyperactivated sperm, Young
and Bod (1994) derived a classification rule based on the wobble parameter
of motility and the curvilinear velocity, using CART. Their rule was shown
1.1 Examples Using CART 3

a b
FIGURE 1.1. Motility patterns for mammalian sperm. (a) Hyperactivated and
(b) nonhyperactivated.

to have a lower misclassification rate than the commonly used ones that
were established by linear discriminant analysis.

Example 1.4 Important medical decisions are commonly made while sub-
stantial uncertainty remains. Acute unexplained fever in infants is one of
such frequently encountered problems. To make a correct diagnosis, it is
critical to utilize information efficiently, including medical history, physi-
cal examination, and laboratory tests. Using a sample of 1,218 childhood
extremity injuries seen in 1987 and 1988 by residents in family medicine
and pediatrics in the Rochester General Hospital Emergency Department,
McConnochie, Roghmann, and Pasternack (1993) demonstrated the value
of the complementary use of logistic regression and CART in developing
clinical guidelines.

Example 1.5 Birth weight and gestational age are strong predictors for
neonatal mortality and morbidity; see, e.g., Bracken (1984). In less devel-
oped countries, however, birth weight may not be measured for the first
time until several days after birth, by which time substantial weight loss
could have occurred. There are also practical problems in those countries in
obtaining gestational age because many illiterate pregnant women cannot
record the dates of their last menstrual period or calculate the duration of
gestational age. For these considerations, Raymond et al. (1994) selected
843 singleton infants born at a referral hospital in Addis Ababa, Ethiopia,
in 1987 and 1988 and applied CART to build a practical screening tool
based on neonatal body measurements that are presumably more stable
than birth weight. Their study suggests that head and chest circumfer-
ences may predict adequately the risk of low birth weight (less than 2,500
grams) and preterm (less than 37 weeks of gestational age) delivery.

Example 1.6 Head injuries cause about a half million patient hospitaliza-
tions in the United States each year. As a result of the injury, victims often
4 1. Introduction

suffer from persistent disabilities. It is of profound clinical importance to


make early prediction of long-term outcome so that the patient, the family,
and the physicians have sufficient time to arrange a suitable rehabilitation
plan. Moreover, this outcome prediction can also provide useful informa-
tion for assessing the treatment effect. Using CART, Choi et al. (1991) and
Temkin et al. (1995) have developed prediction rules for long-term outcome
in patients with head injuries on the basis of 514 patients. Those rules are
simple and accurate enough for clinical practice.

1.2 The Statistical Problem


Examples 1.1-1.6 can be summarized into the same statistical problem as
follows. They all have an outcome variable, Y, and a set of p predictors,
x!, ... , xp. The number of predictors, p, varies from example to example.
The x's will be regarded as fixed variables, and Y is a random variable.
In example 1.3, Y is a dichotomous variable representing either hyperacti-
vated or nonhyperactivated sperm. The x's include the wobble parameter
of motility and the curvilinear velocity. Obviously, not all predictors appear
in the prediction rule. Likewise, the x's and Y can be easily identified for
the other examples. The statistical problem is to establish a relationship
between Y and the x's so that it is possible to predict Y based on the values
of the x's. Mathematically, we want to estimate the conditional probability
of the random variable Y,

(1.1)

or a functional of this probability such as the conditional expectation

(1.2)

Since Examples 1.1-1.6 involve dichotomous Y (0 or 1), the conditional


expectation in (1.2) coincides with the conditional probability in (1.1) with
y = 1. In such circumstances, logistic regression is commonly used, assum-
ing that the conditional probability (1.1) is of a specific form,

exp(,Bo + 2:f=l (3i X i)


(1.3)
1 + exp({3o + 2:f=l (3i X i) ,

where the {3's are parameters to be estimated.


In the ordinary linear regression, the conditional probability in (1.1) is
assumed to be a normal density function,

1 [(y - J-t)2] (1.4)


y'27f exp - 20"2 '
1.3 Outline of the Methodology 5

TABLE 1.1. Correspondence Between the Uses of Classic Approaches and Re-
cursive Partitioning Technique in This Book

Type of Recursive partitioning


Parametric methods
response technique
Ordinary linear Regression trees and
Continuous regression adaptive splines
in Chapter 9
Logistic regression Classification trees
Binary
in Chapter 3 in Chapter 4
Proportion hazard Survival trees
Censored
regression in Chapter 7 in Chapter 8
Mixed-effects models Regression trees and
Longitudinal in Chapter 10 adaptive splines
in Chapter 10
Multiple Exponential, marginal, Classification trees,
discrete and frailty models all in Chapter 11

where the mean, f..L, equals the conditional expectation in (1.2) and is of a
hypothesized expression
p

f..L = f30 + Lf3i X i. (1.5)


i=l

The (72 in (1.4) is an unknown variance parameter. We use N(f..L, (72) to


denote the normal distribution corresponding to the density in (1.4).
In contrast to these models, recursive partitioning is a nonparametric
technique that does not require a specified model structure like (1.3) or
(1.5). In the subsequent chapters, the outcome Y may represent a censored
measurement or a correlated set of responses. We will cite more examples
accordingly.

1.3 Outline of the Methodology


In this book, we will describe both classic (mostly parametric) and modern
statistical techniques as complementary tools for the analysis of data in
the health sciences. The five types of response variables listed in Table 1.1
cover the majority of the data that arise from health-related studies. Table
1.1 conforms with the content of this book. Thus, it is not a complete list
of methods that are available in the literature.
Chapter 2 is a practical guide to tree construction, focusing on the statis-
tical ideas and scientific judgment. Technical details are deferred to Chapter
4, where methodological issues involved in classification trees are discussed
6 1. Introduction

in depth. We refer to Breiman et al. (1984) for further elaboration. Section


4.2.3 on Nested Optimal Subtrees is relatively technical and may be difficult
for some readers, but the rest of Chapter 4 is relatively straightforward.
Technical differences between classification trees and regression trees are
very minimal. After elucidating classification trees in Chapter 4, we intro-
duce regression trees briefly, but sufficiently, in Section 9.2, focusing on
the differences. To further demonstrate the use of classification trees, we
report a stratified tree-based risk factor analysis of spontaneous abortion
in Chapter 5.
Chapters 6 to 8 cover the analysis of censored data. The first part is
a shortcut to the output of survival trees. We present classical methods
of survival analysis prior to the exposition of survival trees in the last
compartment of this coverage.
Chapter 11 on classification trees for multiple binary responses is nearly
parallel to survival trees from a methodological point of view. Thus, they
can be read separately depending on the different needs of readers.
We start a relatively distinct topic in Chapter 9 that is fundamental to
the understanding of adaptive regression splines and should be read before
Chapter 10, where the use of adaptive splines is further expanded.
Before discussing the trees and splines approaches, we will describe their
parametric counterparts and explain how to use these more standard mod-
els. We view it as important to understand and appreciate the parametric
methods even though the main topic of this book is recursive partitioning.
2
A Practical Guide to Tree
Construction

We introduce the basic ideas associated with recursive partitioning in the


context of a specific scientific question: Which pregnant women are most at
risk of preterm deliveries? Particular emphasis is placed on the interaction
between scientific judgment by investigators and the production of infor-
mative intermediate-stage computer output that facilitates the generation
of the most sensible recursive partitioning trees.
The illustrative database is the Yale Pregnancy Outcome Study, a project
funded by the National Institutes of Health, and it has been under the lead-
ership of Dr. Michael B. Bracken at Yale University. The study subjects
were women who made a first prenatal visit to a private obstetrics or mid-
wife practice, health maintenance organization, or hospital clinic in the
greater New Haven, Connecticut, area between May 12, 1980, and March
12, 1982, and who anticipated delivery at the Yale-New Haven Hospital.
For illustration, we take a subset of 3,861 women from this database by
selecting those women whose pregnancies ended in a singleton live birth
and who met the eligibility criteria for inclusion as specified in detail by
Bracken et al. (1986) and Zhang and Bracken (1995).
Preterm delivery will be the outcome variable of interest. Based on the
extant literature, Zhang and Bracken (1995) considered 15 variables a priori
as candidates to be useful in representing routes to preterm delivery. The
variables are listed in Table 2.1.

You might also like