Recursive Partitioning and Applications
Recursive Partitioning and Applications
Advisors:
P. Bickel, P. Diggle, S. Fienberg,
U. Gather, I. Olkin, S. Zeger
Recursive Partitioning
and Applications
Second Edition
Heping Zhang Burton H. Singer
Department of Epidemiology and Public Health Emerging Pathogens Institute
Yale University School of Medicine University of Florida
60 College Street PO Box 100009
New Haven, Connecticut 06520-8034 Gainesville, FL 32610
USA USA
[email protected]
ISSN 0172-7397
ISBN 978-1-4419-6823-4 e-ISBN 978-1-4419-6824-1
DOI 10.1007/978-1-4419-6824-1
Springer New York Dordrecht Heidelberg London
Preface vii
1 Introduction 1
1.1 Examples Using CART . . . . . . . . . . . . . . . . . . . . 3
1.2 The Statistical Problem . . . . . . . . . . . . . . . . . . . . 6
1.3 Outline of the Methodology . . . . . . . . . . . . . . . . . . 7
3 Logistic Regression 23
3.1 Logistic Regression Models . . . . . . . . . . . . . . . . . . 23
3.2 A Logistic Regression Analysis . . . . . . . . . . . . . . . . 24
13 Appendix 227
13.1 The Script for Running RTREE Automatically . . . . . . . 227
13.2 The Script for Running RTREE Manually . . . . . . . . . . 229
13.3 The .inf File . . . . . . . . . . . . . . . . . . . . . . . . . . 233
References 237
Index 256
1
Introduction
,
JJ.....
. , ........
,
S
L QQ s
Q
S
L @
L P
P
..
.....L....
J H@ H
@ ....
J .
............
MB ... ....
B .......... .
B....
a b
FIGURE 1.1. Motility patterns for mammalian sperm. (a) Hyperactivated and
(b) nonhyperactivated
FIGURE 1.2. Classification tree for colon cancer diagnosis based on gene ex-
pression data. Inside each node are the number of tumor (C) and normal (N)
tissues
assessing the treatment effect. Using CART, Choi et al. (1991) and Temkin
et al. (1995) have developed prediction rules for long-term outcome in pa-
tients with head injuries on the basis of 514 patients. Those rules are simple
and accurate enough for clinical practice.
cation, Alfaro Cortés et al. (2007) used classification trees and an AdaBoost
algorithm to predict corporate failure.
Example 1.9 Chemical Compounds
Recursive partitioning has been employed to aid drug development. To
screen large chemical databases in corporate collections and chemical li-
braries, Chen et al. (1998) used recursive partitioning to develop three-
dimensional pharmacophores that can guide database screening, chemical
library design, and lead optimization. They encoded the three-dimensional
features of chemical compounds into bit strings, which were then selected
to predict the biological activities of the compounds.
Example 1.10 Musical Audio
As a part of a large-scale interdisciplinary MAMI project (Musical Audio
MIning project) conducted at Ghent University, Martens (2002) attempted
to extract the tonal context from a polyphonic musical audio signal and
to convert this information into a meaningful character sequence. First,
a musical signal is decomposed in different sub-bands and represented as
neural patterns by an auditory peripheral module. This process converts the
musical signal eventually into a real vector in a 69-dimensional space, which
is the predictor space. The class label represents one of the 24 keys (the
12 major and 12 minor keys in the so-called Shepard chords). We used 120
synthesized sounds, 5 from each of the following: Shepard sequences; Bass
sequences, sampled from a Yamaha QS300 synthesizer; Piano sequences,
sampled from a Yamaha QS300 synthesizer; Strings sequences, sampled
from a Yamaha QS300 synthesizer; and Dance-lead sequences, sampled
from a Waldorf Micro Q synthesizer. Then, a classification tree is used in
the conversion of 120 synthesized sounds into a character string.
IE{Y | x1 , . . . , xp }. (1.2)
TABLE 1.1. Correspondence Between the Uses of Classic Approaches and Re-
cursive Partitioning Technique in This Book
c1 x1
Root node
x13 II III
b b b r r
b b rr
no
x13 > c2Q
? Q yes b b r r
+
s Internal node
Q
b r
b r
b b rr
I c2 b b
no b b b
x1 > c1 ?QQ yes b b b b
+
s
Q b b b b
b
bb b
II III I b b bb
a b
FIGURE 2.1. An illustrative tree structure. x1 is age and x13 is the amount of
alcohol drinking. Circles and dots represent different outcomes.
The root node contains a sample of subjects from which the tree is grown.
Those subjects constitute the so-called learning sample, and the learning
sample can be the entire study sample or a subset of it. For our example, the
root node contains all 3,861 pregnant women who were the study subjects of
the Yale Pregnancy Outcome Study. All nodes in the same layer constitute
a partition of the root node. The partition becomes finer and finer as the
layer gets deeper and deeper. Therefore, every node in a tree is merely a
subset of the learning sample.
Figure 2.1(b) illustrates a hypothetical situation. Let a dot denote a
preterm delivery and a circle stand for a term delivery. The two coordinates
12 2. A Practical Guide to Tree Construction
represent two covariates, x1 (age) and x13 (the amount of alcohol drinking),
as defined in Table 2.1. We can draw two line segments to separate the
dots from the circles and obtain three disjoint regions: (I) x13 ≤ c2 ; (II)
x13 > c2 and x1 ≤ c1 ; and (III) x13 > c2 and x1 > c1 . Thus, partition I
is not divided by x1 , and partitions I and II are identical in response but
derived differently from x1 and x13 .
In the same figure, panel (a) is a tree representation of this separation.
First, we put both the dots and the circles into the root node. The two
arrows below the root node direct a dot or circle to terminal node I or the
internal node in the second layer, depending on whether or not x13 ≤ c2 .
Those with x13 > c2 are further directed to terminal nodes II and III based
on whether or not x1 ≤ c1 . Hence, the nodes in panel (a) correspond to the
regions in panel (b). When we draw a line to separate a region, it amounts
to partitioning a node in the tree. The precise maps between regions I to
III and terminal nodes I to III, respectively, can be found in Figure 2.1.
The aim of recursive partitioning is to end up with the terminal nodes
that are homogeneous in the sense that they contain either dots or circles.
We accomplished this goal in this artificial example. We should note that
the two internal nodes are heterogeneous because they contain both dots
and circles. Given that the dots and circles represent preterm and term
deliveries, respectively, Figure 2.1 would suggest that all pregnant women
older than a certain age and drinking more than a certain amount of alcohol
daily deliver preterm infants. Consequently, this would demonstrate a hy-
pothetically ideal association of preterm delivery with the age and alcohol
consumption of the pregnant women.
Complete homogeneity of terminal nodes is an ideal that is rarely realized
in real data analysis. Thus, the realistic objective of partitioning is to make
the outcome variables in the terminal nodes as homogeneous as possible.
A quantitative measure of the extent of node homogeneity is the notion of
node impurity. The simplest operationalization of the idea is
whether or not age is more than 35 years (i.e., x1 > 35). In general, for
an ordinal (e.g., times of using marijuana) or a continuous (e.g., caffeine
intake) predictor, xj , the number of allowable splits is one fewer than the
number of its distinctly observed values. For instance, there are 153 different
levels of daily caffeine intake ranging from 0 to 1273 mg in the 3,861 study
subjects. Thus, we can split the root node in 152 different ways based on
the amount of caffeine intake.
What happens to nominal predictors is slightly more complicated. In
Table 2.1, x3 denotes 5 ethnic groups that do not have a particular order.
Table 2.2 lays out 25−1 − 1 = 15 allowable splits from this ethnicity vari-
able. Generally, any nominal variable that has k levels contributes 2k−1 − 1
allowable splits.
Adding together the numbers of allowable splits from the 15 predictors
in Table 2.1, we have 347 possible ways to divide the root node into two
subnodes. Depending on the number of the predictors and the nature of
the predictors, the total number of the allowable splits for the root node
varies, though it is usually not small. The basic question to be addressed
now is: How do we select one or several preferred splits from the pool of
allowable splits?
Before selecting the best split, we must define the goodness of a split.
What we want is a split that results in two pure (or homogeneous) daughter
nodes. However, in reality the daughter nodes are usually partially homo-
geneous. Therefore, the goodness of a split must weigh the homogeneities
(or the impurities) in the two daughter nodes. If we take age as a tentative
14 2. A Practical Guide to Tree Construction
Term Preterm
Left Node (τL ) x1 ≤ c n11 n12 n1·
Right Node (τR ) x1 > c n21 n22 n2·
n·1 n·2
where τ is the parent of τL and τR , and IP {τL } and IP {τR } are respectively
the probabilities that a subject falls into nodes τL and τR . At present,
IP {τL } can be replaced with n1· /(n1· +n2· ) and IP {τR } with n2· /(n1· +n2· ).
The criterion (2.3) measures the degree of reduction in the impurity by
going from the parent node to the daughter nodes.
To appreciate these concepts in more detail, let us go through a concrete
example. If we take c = 35 as the age threshold, we have a 2 × 2 table
Term Preterm
Left Node (τL ) 3521 198 3719
Right Node (τR ) 135 7 142
3656 205 3861
Then, i(τL ) in (2.1) equals
Variable x1 x2 x3 x4 x5 x6 x7 x8
1000ΔI 1.5 2.8 4.0 0.6 0.6 3.2 0.7 0.6
Variable x9 x10 x11 x12 x13 x14 x15
1000ΔI 0.7 0.2 1.8 1.1 0.5 0.8 1.2
16 2. A Practical Guide to Tree Construction
1 1
Yes Q No Yes Q No
Q Q
+ Is she Black?
s
Q + Is she Black?
s
Q
2 3 2 3
Yes Q No
Is she Q
+ employed?
s
Q
4 5
a b
FIGURE 2.2. Node splitting, recursive partitioning process. Node 1 is split into
nodes 2 and 3 and then node 2 into nodes 4 and 5.
the study sample into teenagers and adults. This is tantamount to selecting
the second-best split by our numerical criterion while using scientific judg-
ment in a decisive manner to overrule the automated procedure. We view
this kind of interactive process as fundamentally important in producing
the most interpretable trees.
This best or preferred age split is used to compete with the best (or re-
spectively preferred) splits from the other 14 predictors. Table 2.4 presents
the greatest numerical goodness of split for all predictors. We see that the
best of the best comes from the race variable with 1000ΔI = 4.0, i.e.,
ΔI = 0.004. This best split divides the root node according to whether
a pregnant woman is Black or not. This partition is illustrated in Figure
2.2(a), where the root node (number 1) is split into nodes 2 (Black) and 3
(non-Black).
After splitting the root node, we continue to divide its two daughter
nodes. The partitioning principle is the same. For example, to further di-
vide node 2 in Figure 2.2(b) into nodes 4 and 5, we repeat the previous
partitioning process with a minor adjustment. That is, the partition uses
only 710 Black women, and the remaining 3,151 non-Black women are put
aside. The pool of allowable splits is nearly intact except that race does not
contribute any more splits, as everyone is now Black. So, the total num-
ber of allowable splits decreases from 347 to at least 332. The decreasing
trend of the number of allowable splits is noteworthy, although it is not
necessary for us to be concerned with the precise counts here. After the
split of node 2, we have three nodes (numbers 3, 4, and 5) ready to be
split. In the same way, we can divide node 3 in Figure 2.2(b) as we did for
node 2. But remember that this time we consider only the 3,151 non-Black
women. Furthermore, there are potentially 24−1 − 1 = 7 race splits because
the category of non-Black women comprises Whites, Hispanics, Asians, and
other ethnic groups. Hence, there can be as many as 339 allowable splits
2.2 Splitting a Node 17
for node 3. One important message is that an offspring node may use the
same splitting variable as its ancestors. After we finish node 3, we go on
to nodes 4 and 5, and so on. This is the so-called recursive partitioning
process. Because we partition one node into two nodes only, the resulting
tree is called a binary tree.
A further interpretive consideration arises in the splitting process when
the top 2, 3, or even more variables have goodness of split values within
several significant digits of each other. Here are at least two scenarios to
consider:
It can also happen that none of the variables provides genuine improvement
in classification at a given splitting opportunity. However, on substantive
grounds, it may make sense to include one or two of the variables linked to
the parent node in question. It is often useful to force such a variable into
the tree with a judgmentally selected cut point, and continue the splitting
process from the new daughter node. You frequently find a very good split
at the next step. The equivalent of this hand-tailored step is not part of
current automated splitting algorithms, as it would require them to look
two steps ahead at particular splitting opportunities. The central point is
that hand-cultivation of trees, as we are discussing it here, is an important
aspect of recursive partitioning for single trees, as well as for production of
forests.
This book focuses on the binary trees. Readers who are interested in
multiway trees are referred to Kass (1980), Quinlan (1993), Kim and Loh
(2001), and the references therein. C4.5, CHAID, and CRUISE are the
names of programs that implement the methods in those articles, respec-
tively. Briefly, C4.5 creates a binary or multiway split according to the type
of the split variable. The split is binary for an ordinal variable. The split
is M -way if the variable is categorical with M levels. For example, when
race is considered to split the root node in Figure 2.2, it would yield five
daughter nodes, one for each of the five racial groups. Obviously, there are
situations where it may not be necessary to have a daughter node for every
level of the categories, and it makes sense to collapse some of the levels or
in effect some of the daughter nodes. This is exactly what CHAID attempts
to accomplish as a revision to C4.5. CRUISE is another program that pro-
duces the same number of daughter nodes as the number of the levels of a
categorical variable, but it tries to control the favorism over the variables
18 2. A Practical Guide to Tree Construction
#1
20
PP
< 2 PP> 3
PP
extraversion
2 − 3
#2 ) ? q #4
P
#3
17.5 23
Q female 20.75 Q or better
poor to very good
male good
gender Q quality
relationship Q
+
s
Q
+
s
Q
#5 #6 #7 #8
16.25 high school 18 22.55 24
or less Q college
education Q
+
s
Q
#9 #10
17.2 19
with more allowable splits. In addition, SSPS has a decision tree procedure
that can grow both binary and multiway trees (http://www.ssps.com).
Although the existing methods accommodate multiway splits for cate-
gorical variables only, it may be useful to allow multiway splits for ordinal
variables. For example, Gruenewald et al. (2008) introduced applications
of regression trees to examine diverse pathways to positive and negative
effect in adulthood and later life. They considered a candidate set of nine
sociodemographic (gender, marital status, educational level), personality
(extraversion and neuroticism), and contextual (work stress, relationship
quality, financial control, health status) variables. Based on their under-
standing, they presented a hypothetical tree structure as displayed in Fig-
ure 2.3 that includes a three-way split of the root node (i.e., node 1) base
on the score of extraversion. Motivated by this practical example, it will
be a very useful project to develop trees that allow multiway splits for any
variables. Instead of doing so arbitrarily, it would be wise to set a limit
on the maximum number of allowable ways to split a node for any given
variable. In addition, a penalty factor should be considered when selecting
the final choice of the node split so that multiway splits are not unfairly
favored over binary splits.
In theory, we can convert between binary and multiway trees by further
splitting or merging of nodes, but in practice, due to the fact that different
criteria and different priorities are used in selecting the splits, the end
products are generally different. So far, the distinction between binary and
multiway trees is usually drawn at the conceptual level or in terms of
interpretation, and there is not enough literature to assess the performance
2.3 Terminal Nodes 19
of the two classes of trees. We refer to Kim and Loh (2001) for a relatively
recent discussion.
NPT=205
N=3861
NPT=135 NPT=70
N=3151 N=710
DES or
both or Hormones/ Others
NA DES use
NPT=8 NPT=110
N=61 N=2919
No Years of Yes or
education > 12 NA
NPT=51 NPT=59
N=983 N=1936
NPT=41 NPT=18
N=1602 N=334
FIGURE 2.4. The computer-selected tree structure. N: sample size; NPT: number
of preterm cases.
N=3861
RR=2.14
CI=1.62-2.82
Race
Others Blacks
N=3151 N=710
RR=2.53 RR=1.13
CI=1.48-4.31 CI=0.67-1.89
# of Employed
Pregnancies
>4
No Yes No Yes
Hormones/
DES use
Others DES or Both
NPT=112 NPT=6
N=2949 N=31
FIGURE 2.5. The final tree structure. N: sample size; RR: relative risk estimated
by cross-validation; CI: 95% confidence interval; NPT: number of preterm cases.
3
Logistic Regression
We have seen from Examples 1.1–1.6 that the status of many health con-
ditions is represented by a binary response. Because of its practical im-
portance, analyzing a binary response has been the subject of countless
works; see, e.g., the books of Cox and Snell (1989), Agresti (1990), and the
references therein. For comparison purposes, we give a brief introduction
to logistic regression.
where
β = (β0 , β1 , . . . , βp )
is the new (p + 1)-vector of parameters to be estimated and (xi1 , . . . , xip )
are the values of the p covariates included in the model for the ith subject
(i = 1, . . . , n).
To estimate β, we make use of the likelihood function
L(β; y)
p yi 1−yi
n
exp(β0 + j=1 βj xij ) 1
= p p
i=1
1 + exp(β0 + j=1 βj xij ) 1 + exp(β0 + j=1 βj xij )
p
yi =1 exp(β0 + j=1 βj xij )
= n p .
i=1 [1 + exp(β0 + j=1 βj xij )]
θi p
= exp(β0 + βj xij ).
1 − θi j=1
Consider two individuals i and k for whom xi1 = 1, xk1 = 0, and xij = xkj
for j = 2, . . . , p. Then, the odds ratio for subjects i and k to be abnormal
is
θi /(1 − θi )
= exp(β1 ).
θk /(1 − θk )
Taking the logarithm of both sides, we see that β1 is the log odds ratio of the
response resulting from two such subjects when their first covariate differs
by one unit and the other covariates are the same. In the health sciences,
exp(β1 ) is referred to as the adjusted odds ratio attributed to x1 while
controlling for x2 , . . . , xp . The remaining β’s have similar interpretations.
This useful interpretation may become invalid, however, in the presence of
interactive effects among covariates.
includes all predictors in Table 2.1 as main effects and use the backward
stepwise procedure to select variables that have significant (at the level
of 0.05) main effects. Recall that preterm delivery is our response variable.
For the selected variables, we then consider their second-order interactions.
In Table 2.1, three predictors, x2 (marital status), x3 (race), and x12
(hormones/DES use), are nominal and have five levels. To include them in
logistic regression models, we need to create four (dichotomous) dummy
variables for each of them. For instance, Table 2.1 indicates that the five
levels for x2 are currently married, divorced, separated, widowed, and never
married. Let
1 if a subject was currently married,
z1 =
0 otherwise,
1 if a subject was divorced,
z2 =
0 otherwise,
1 if a subject was separated,
z3 =
0 otherwise,
1 if a subject was widowed,
z4 =
0 otherwise.
Likewise, let
1 for a Caucasian,
z5 =
0 otherwise,
1 for an African-American,
z6 =
0 otherwise,
1 for a Hispanic,
z7 =
0 otherwise,
1 for an Asian,
z8 =
0 otherwise,
and
1 if a subject’s mother did not use hormones or DES,
z9 =
0 otherwise,
1 if a subject’s mother used hormones only,
z10 =
0 otherwise,
1 if a subject’s mother used DES only,
z11 =
0 otherwise,
1 if a subject’s mother used both hormones and DES,
z12 =
0 otherwise.
Note here that the subject refers to a pregnant woman. Thus, z9 through
z12 indicate the history of hormones and DES uses for the mother of a
pregnant woman.
26 3. Logistic Regression
Due to missing information, 1,797 of the 3,861 observations are not used
in the backward deletion step by SAS PROC LOGISTIC. Table 3.1 provides
the key information for the model that is selected by the backward stepwise
procedure. In this table as well as the next two, the first column refers to
the selected predictors, and the second column is the degrees of freedom
(DF). The third column contains the estimated coefficients corresponding
to the selected predictors, followed by the standard errors of the estimated
coefficients. The last column gives the p-value for testing whether or not
each coefficient is zero. We should note that our model selection used each
dummy variable as an individual predictor in the model. As a consequence,
the selected model may depend on how the dummy variables are coded. Al-
ternatively, one may want to include or exclude a chunk of dummy variables
that are created for the same nominal variable.
The high proportion of the removed observations due to the missing in-
formation is an obvious concern. Note that the model selection is based on
the observations with complete information in all predictors even though
fewer predictors are considered in later steps. We examined the distribu-
tion of missing data and removed x7 (employment) and x8 (smoking) from
further consideration because they were not selected in the first place and
they contained most of the missing data. After this strategic adjustment,
only 24 observations are removed due to missing data, and the backward
deletion process produces another set of variables as displayed in Table 3.2.
We have considered the main effects, and next we examine possible
(second-order) interactions between the selected variables. For the two se-
3.2 A Logistic Regression Analysis 27
lected dummy variables, we include their original variables, race and hor-
mones/DES uses, into the backward stepwise process to open our eyes a
little wider. It turns out that none of the interaction terms are significant
at the level of 0.05. Thus, the final model includes the same four variables
as those in Table 3.2. However, the estimates in Table 3.2 are based on
3,837 (i.e., 3861 − 24) observations with complete information for 13 pre-
dictors. Table 3.3 presents the information for the final model for which
only 3 observations are removed due to missing information in the four
selected variables. The different numbers of used observations explain the
minor numerical discrepancy between Tables 3.2 and 3.3.
From Table 3.3, we see that the odds ratio for a Black woman (z6 ) to
deliver a premature infant is doubled relative to that for a White woman,
because the corresponding odds ratio equals exp(0.699) ≈ 2.013. The use
of DES by the mother of the pregnant woman (z10 ) has a significant and
enormous effect on the preterm delivery. Years of education (x6 ), however,
seems to have a small, but significant, protective effect. Finally, the number
of previous pregnancies (x11 ) has a significant, but low-magnitude negative
effect on the preterm delivery.
We have witnessed in our analysis that missing data may lead to serious
loss of information. As a potential consequence, we may end up with im-
precise or even false conclusions. For example, by reviewing Tables 3.1 and
3.3, we realize that x1 is replaced with x11 in Table 3.3 and the estimated
coefficients for the remaining three predictors are notably different. The
difference could be more dramatic if we had a smaller sample. Therefore,
precaution should be taken in the presence of missing data. In Section 4.8,
we will see that the tree-based method handles the missing data efficiently
by either creating a distinct category for the missing value or using surro-
gate variables. These strategies prevent the tragic consequence of missing
data.
Although it is not frequently practiced, we find it useful and important to
evaluate the predictive performance of the final logistic model. To this end,
we make use of ROC (receiver operating characteristic) curves (see, e.g.,
Hanley, 1989). We know that we cannot always make perfect classifications
or predictions for the outcome of interest. For this reason, we want to
28 3. Logistic Regression
FIGURE 3.1. ROC curve for the final logistic regression model
make as few mistakes as possible. Two kinds of mistakes can occur when
we predict an ill-conditioned outcome as normal or a normal condition
as abnormal. To distinguish them, statisticians refer to these mistakes as
type I and type II errors, respectively. In medical-decision making, they are
called false-positive and false-negative diagnoses, respectively. In reasonable
settings, these errors oppose each other. That is, reducing the rate of one
type of error elevates the rate of the other type of error. ROC curves reflect
both rates and quantify the accuracy of the prediction through a graphical
presentation.
For subject i, we estimate her risk of having preterm delivery by
exp(−2.344 − 0.076xi6 + 0.699zi6 + 0.115xi,11 + 1.539zi,10 )
θ̂i = ,
1 + exp(−2.344 − 0.076xi6 + 0.699zi6 + 0.115xi,11 + 1.539zi,10)
(3.3)
i = 1, . . . , 3861, using the estimates in Table 3.3. For any risk threshold
r (0 ≤ r ≤ 1), we calculate the empirical true and false-positive probabili-
ties respectively as
the number of preterm deliveries for which θ̂i > r
TPP =
the total number of preterm deliveries
and
the number of term deliveries for which θ̂i > r
FPP = .
the total number of term deliveries
As r varies continuously, the trace of (T P P, F P P ) constitutes the ROC
curve as shown in Figure 3.1. In the medical literature, the true positive
and negative probabilities are commonly referred to as sensitivity and speci-
ficity.
Figure 3.1 indicates that the final logistic regression model improves the
predictive precision over a random prediction model. The latter predicts
3.2 A Logistic Regression Analysis 29
the risk of 1 and 0 by tossing a fair coin. The ROC curve for this random
prediction is featured by the dotted straight line. It is evident from Figure
3.1 that a great deal of variation is not explained and hence that further
improvement should be sought.
Note also that the ROC curve is drawn from the resubstitution estimate
of the risk, which tends to be optimistic in the sense that the ROC curve
may have an upward-biased area. The reason is as follows. The prediction
in (3.3) was derived to “maximize” the area under the ROC curve based
on the Yale Pregnancy Outcome Study data. If we conduct another simi-
lar, independent study, which we call a validation study, it is almost sure
that we will end up with an optimal prediction that differs from equation
(3.3), although the difference may not be substantial. The other side of the
coin is that if we make predictions for the subjects in the validation study
from equation (3.3), the quality of the prediction is usually downgraded as
compared to the prediction made for the original Yale Pregnancy Outcome
Study. In some applications, validation studies are available, e.g., Goldman
et al. (1982, 1996). In most cases, investigators have only one set of data. To
assess the quality of the prediction, certain sample reuse techniques such
as the cross-validation procedure are warranted (e.g., Efron, 1983). The
cross-validation procedure will be heavily used in this book, specifically in
Chapters 4 and 9–12. The basic idea is that we build our models using part
of the available data and reserve the left-out observations to validate the
selected models. This is a way to create an artificial validation study at
the cost of reducing the sample size for estimating a model. The simplest
strategy is to cut the entire sample into two pieces of equal size. While
one piece is used to build a model, the other piece tests the model. It is
a sample reuse mechanism because we can alternate the roles for the two
pieces of sample.
4
Classification Trees for a Binary
Response
Impurity
6
Entropy
1
Minimum Error
2
@
Gini
@
@
@
@
@ -
1
0 2 1 p
where the function φ has the properties (i) φ ≥ 0 and (ii) for any p ∈ (0, 1),
φ(p) = φ(1 − p) and φ(0) = φ(1) < φ(p).
Common choices of φ include
selected from the population. Then, using Bayes’ theorem, the prevalence
rate within a node τ is
IP {Y = 1, τ }
IP {Y = 1 | τ } =
IP {τ }
IP {Y = 1}IP {τ | Y = 1}
= ,
IP {Y = 1}IP {τ | Y = 1} + IP {Y = 0}IP {τ | Y = 0}
where, marginally, IP {Y = 1} = 1 − IP {Y = 0} = 0.004. The conditional
probabilities IP {τ | Y = 1} and IP {τ | Y = 0} can be estimated from the
data. The former is the conditional probability for a random subject to fall
into node τ given that the subject’s response is 1. The latter conditional
probability has a similar interpretation. Suppose that 30 of the 100 cases
and 50 of the 100 controls fall into node τ. Then, IP {τ | Y = 1} = 30/100 =
0.3 and IP {τ | Y = 0} = 50/100 = 0.5. Putting together these figures, we
obtain
0.004 ∗ 0.3
IP {Y = 1 | τ } = = 0.0024.
0.004 ∗ 0.3 + 0.996 ∗ 0.5
The criterion first defined in (2.1) and again in (4.3) has another in-
terpretation. This different view is helpful in generalizing the tree-based
methods for various purposes. Suppose that Y in node τL follows a bino-
mial distribution with a frequency of θ, namely,
IP {Y = 1 | τL } = θ.
Then, the log-likelihood function from the n1· observations in node τL is
n11 log(θ) + n12 log(1 − θ).
The maximum of this log-likelihood function is
n11 n12
n11 log + n12 log ,
n1· n1·
which is proportional to (2.1). In light of this fact, many node-splitting
criteria originate from the maximum of certain likelihood functions. The
importance of this observation will be appreciated in Chapters 9 and 11.
So far, we only made use of an impurity function for node splitting. There
are also alternative approaches. In particular, it is noteworthy to mention
the twoing rule (Breiman et al., 1984, p. 38) that uses a different measure
for the goodness of a split as follows:
⎡ ⎤2
IP {τL }IP {τR } ⎣
|IP {Y = j|τL } − IP {Y = j|τR }|⎦ .
4 j=0,1
For a binary response, this twoing rule coincides with the use of the Gini
index. It has been observed that this rule has an undesirable end-cut prefer-
ence problem (Morgan and Messenger, 1973 and Breiman et al., 1984, Ch.
34 4. Classification Trees for a Binary Response
11): It gives preference to the splits that result in two daughter nodes of
extremely unbalanced sizes. To resolve this problem, a modification, called
the delta splitting rule, has been adopted in both the THAID (Messenger
and Mandell 1972 and Morgan and Messenger 1973) and CART (Breiman
et al. 1984) programs. Other split functions may also suffer from this prob-
lem, but our observations seem to indicate that the Gini index is more
problematic.
predict whether a patient with chest pain suffers from a serious heart disease
based on the information available within a few hours of admission. To this
end, we first classify a node τ to either class 0 (normal) or 1 (abnormal),
and we predict the outcome of an individual based on the membership of
the node to which the individual belongs. Unfortunately, we always make
mistakes in such a classification, because some of the normal subjects will be
predicted as diseased and vice versa. For instance, Figure 3.1 pinpoints the
predictive performance of a logistic regression model to these false-positive
errors and false-negative errors. In any case, to weigh these mistakes, we
need to assign misclassification costs.
Let us take the root node in Figure 2.2(b). In this root node, there
are 205 preterm and 3656 term deliveries. If we assign class 1 for the root
node, 3656 normal subjects are misclassified. In this case, we would wrongly
predict normal subjects to be abnormal, and false-positive errors occur.
On the other hand, we misclassify the 205 abnormal subjects if the root
node is assigned class 0. These are false-negative errors. If what matters
is the count of the false-positive and the false-negative errors, we would
assign class 0 for the root node, because we then make fewer mistakes.
This naive classification, however, fails to take into account the seriousness
of the mistakes. For example, when we classify a term delivery as preterm,
the baby may receive “unnecessary” special care. But if a preterm baby is
thought to be in term, the baby may not get needed care. Sometimes, a
mistake could be fatal, such as a false-negative diagnosis of heart failure.
In most applications, the false-negative errors are more serious than the
false-positive errors. Consequently, we cannot simply count the errors. The
two kinds of mistakes must be weighted.
Let c(i|j) be a unit misclassification cost that a class j subject is classified
as a class i subject. When i = j, we have the correct classification and
the cost should naturally be zero, i.e., c(i|i) = 0. Since i and j take only
the values of 0 or 1, without loss of generality we can set c(1|0) = 1. In
other words, one false-positive error counts as one. The clinicians and the
statisticians need to work together to gauge the relative cost of c(0|1). This
is a subjective and difficult, but important, decision. Later, in Section 4.5
we will introduce an alternative pruning procedure that avoids this decision.
Here, for the purpose of illustration, we take a range of values between
1 and 18 for c(0|1). For the reasons cited above, we usually assume that
c(0|1) ≥ c(1|0). The upper bound 18 is based on the fact that 3656 : 205 =
17.8 : 1. Note that 3656 and 205 are the numbers of term and preterm
deliveries in the root node, respectively. Table 4.1 reports the misclassifica-
tion costs for the five nodes in Figure 2.2(b) when these nodes are assumed
either as class 0 or as class 1.
For example, when c(0|1) = 10, it means that one false-negative error
counts as many as ten false-positive ones. We know that the cost is 3656
if the root node is assigned class 1. It becomes 225 × 10 = 2250 if the root
node is assigned class 0. Therefore, the root node should be assigned class
36 4. Classification Trees for a Binary Response
0 for 2250 < 3656. In other words, the class membership of 0 or 1 for a
node depends on whether or not the cost of the false-positive errors is lower
than that of the false-negative errors. Formally, node τ is assigned class j
if
[c(j|i)IP {Y = i | τ }] ≤ [c(1 − j|i)IP {Y = i | τ }]. (4.6)
i i
Denote the left-hand side of (4.6) by r(τ ), which is the expected cost result-
ing from any subject within the node. This cost is usually referred to as the
within-node misclassification cost. It appears less confusing, however, to call
it the conditional misclassification cost. Multiplying r(τ ) by IP {τ }, we have
the unconditional misclassification cost of the node, R(τ ) = IP {τ }r(τ ). In
the following discussions, the misclassification cost of a node implies the
unconditional definition, and the within-node misclassification cost means
the conditional one.
Earlier in this section, we mentioned the possibility of using r(τ ) to split
nodes. This proves to be inconvenient in the present case, because it is
usually difficult to assign the cost function before any tree is grown. As
a matter of fact, the assignment can still be challenging even when a tree
profile is given. Moreover, there is abundant empirical evidence that the
use of an impurity function such as the entropy generally leads to useful
trees with reasonable sample sizes. We refer to Breiman et al. (1984) for
some examples.
Having defined the misclassification cost for a node and hence a tree,
we face the issue of estimating it. In this section, we take c(0|1) = 10, for
example. The process is the same with regard to other choices of c(0|1).
According to Table 4.1, we can estimate the misclassification costs for nodes
1 to 5 in Figure 2.2(b). As reported in Table 4.2, these estimates are called
resubstitution estimates of the misclassification cost.
Let Rs (τ ) denote the resubstitution estimate of the misclassification cost
for node τ. Unfortunately, the resubstitution estimates generally underes-
timate the cost in the following sense. If we have an independent data set,
we can assign the new subjects to various nodes of the tree and calculate
the cost based on these new subjects. This cost tends to be higher than
the resubstitution estimate, because the split criteria are somehow related
to the cost, and as a result, the resubstitution estimate of misclassification
cost is usually overoptimistic. In some applications, such an independent
4.2 Determination of Terminal Nodes 37
data set, called a test sample or validation set, is available; see, e.g., Gold-
man et al. (1982, 1996). To obtain unbiased cost estimates, sample reuse
procedures such as cross-validation are warranted.
4.2.2 Cost–Complexity
Although the concept of misclassification cost has its own merit, a major
use of it in the tree context is to select a “right-sized” subtree, namely,
to determine the terminal nodes. For example, in Figure 2.2, panel (a)
represents a subtree of the tree in panel (b). Because a tree (or subtree)
gives an integrated picture of nodes, we concentrate here on how to estimate
the misclassification cost for a tree. This motivation leads to a very critical
concept in the tree methodology: tree cost-complexity. It is defined as
Theorem 4.1 (Breiman et al. 1984, Section 3.3) For any value of the com-
plexity parameter α, there is a unique smallest subtree of T0 that minimizes
the cost-complexity.
This theorem ensures that we cannot have two subtrees of the smallest
size and of the same cost-complexity. We call this smallest subtree the
optimal subtree with respect to the complexity parameter. For example,
when α = 0, the optimal subtree is T0 itself. Why? Note that T0 has two
additional subtrees. One, denoted by T1 , is plotted in Figure 2.2(a) and its
cost-complexity is 0.166 + 0.350 + 0 ∗ 2 = 0.516. The other subtree, call it
T2 , contains only the root node, and its cost-complexity is 0.531 + 0 ∗ 1 =
0.531. We see that both 0.516 and 0.531 are greater than 0.495. In general,
however, the optimal subtree corresponding to α = 0 may not be the initial
tree.
We can always choose α large enough that the corresponding optimal
subtree is the single-node tree. In fact, when α ≥ 0.018, T2 (the root node
tree) becomes the optimal subtree, because
and
#1 205
2050 3656
Q
Q
Q
70 #2 +
s #3 135
Q
640 640 1350 3016
Q Q
Q Q
+
s
Q +
s
Q
11 #4 59 #5 #6 118 #7 17
187 110 453 453 1180 2862 154 154
Q
Q
+
s
Q
6 #8 #9 112
25 25 1120 2837
FIGURE 4.2. Construction of nested optimal trees. Inside each node are the node
number (top) and the units of the misclassification cost (bottom). Next to the
node are the number of abnormal (top) and normal (bottom) subjects in the
node.
that the once pruned subtree in Figure 2.4(b) plays the role of an initial
tree. We knew from our previous discussion that α2 = 0.018 and its optimal
subtree is the root-node tree. No more thresholds need to be found from
here, because the root-node tree is the smallest one.
In general, suppose that we end up with m thresholds,
where Tα1 ⊃ Tα2 means that Tα2 is a subtree of Tα1 . In particular, Tαm
is the root-node subtree. These are so-called nested optimal subtrees. The
final subtree will be selected from among them.
The construction of the nested optimal subtrees proves the following
useful result:
To pave the road for the final selection, what we need is a good estimate of
R(Tαk ) (k = 0, 1, . . . , m), namely, the misclassification costs of the subtrees.
We will select the one with the smallest misclassification cost.
When a test sample is available, estimating R(T ) for any subtree T is
straightforward, because we only need to apply the subtrees to the test
sample. Difficulty arises when we do not have a test sample. The cross-
validation process is generally used by creating artificial test samples. The
idea will be described shortly.
Before describing the cross-validation process, we may find it helpful to
recall what we have achieved so far. Beginning with a learning sample,
we can construct a large tree by recursively splitting the nodes. From this
large tree, we then compute a sequence of complexity parameters {αk }m 0
and their corresponding optimal subtrees {Tαk }m 0 .
The first step of cross-validation is to divide the entire study sample into
a number of pieces, usually 5, 10, or 25 corresponding to 5-, 10-, or 25-fold
cross-validation, respectively. Here, let us randomly divide the 3861 women
in the Yale Pregnancy Outcome Study into five groups: 1 to 5. Group 1
has 773 women and each of the rest contains 772 women. Let L(−i) be the
sample set including all but those subjects in group i, i = 1, . . . , 5.
Using the 3088 women in L(−1) , we can surely produce another large
tree, say T(−1) , in the same way as we did using all 3861 women. Take each
αk from the sequence of complexity parameters as has already been derived
above and obtain the optimal subtree, T(−1),k , of T(−1) corresponding to
αk . Then, we would have a sequence of the optimal subtrees of T(−1) , i.e.,
42 4. Classification Trees for a Binary Response
ple. Let Ci,k be the misclassification cost incurred for the ith subject while
it was a testing subject and the classification rule was based on the kth
subtree, i = 1, . . . , n, k = 0, 1, . . . , m. Then,
Rcv (Tαk ) = IP {Y = j}C̄k|j , (4.10)
j=0,1
where C̄k|j is the average of Ci,k over the set Sj of the subjects whose
response is j (i.e., Y = j). Namely,
1
C̄k|j = Ci,k , (4.11)
nj
i∈Sj
and it follows from (4.10) and (4.11) that the heuristic standard error for
Rcv (Tαk ) is given by
⎧ ⎫1/2
⎨ IP {Y = j} 2 ⎬
SEk = ( 2
Ci,k − nj C̄k|j
2
) . (4.12)
⎩ nj ⎭
j=0,1 i∈Sj
reported in Figure 4.4. The variation between the estimates from 5- and
10-fold cross-validations seems to suggest that the standard error given in
(4.12) may be slightly underestimated. A more thorough examination may
be done by repeating the cross-validation procedure a number of times and
computing the empirical estimates of standard error.
Figure 4.4 indicates that the 1-SE rule selects the root-node subtree. The
interpretation is that the risk factors considered here may not have enough
predictive power to stand out and pass the cross-validation. This statement
is obviously relative to the selected unit cost C(0|1) = 10. For instance,
when we used C(0|1) = 18 and performed a 5-fold cross-validation, the
final tree had a similar structure to the one presented in Figure 4.2 except
that node 2 should be a terminal node. When the purpose of the analysis
is exploratory, we may prune a tree using alternative approaches. See the
next section for the details.
1
2 3
4 5 6
7
8
9 10
internal node. For example, we have the following 2 × 2 table for the root
node:
Term Preterm
Left Node 640 70 710
Right Node 3016 135 3151
3656 205 3861
(70/710)/(135/3151) = 2.3.
Then, the logarithm of the relative risk is 0.833. Note also that the standard
error for the log relative risk is approximately
1/70 − 1/710 + 1/135 − 1/3151 = 0.141.
See Agresti (1990, p. 56). Hence, the Studentized log relative risk is
0.833/0.141 = 5.91.
This Studentized log relative risk will be used as the raw statistic for the
root node. Likewise, we can calculate the raw statistics for all internal nodes
as reported in Table 4.4.
Next, for each internal node we replace the raw statistic with the max-
imum of the raw statistics over its offspring internal nodes if the latter is
greater. For instance, the raw value 1.52 is replaced with 1.94 for node 4;
here, 1.94 is the maximum of 1.47, 1.35, 1.94, 1.60, corresponding to nodes
7, 8, 9, and 10. The reassigned maximum node statistic is displayed in the
third row of Table 4.4. We see that the maximum statistic has seven distinct
values: 1.60, 1.69, 1.94, 2.29, 3.64, 3.72, and 5.91, each of which results in a
subtree. Thus, we have a sequence of eight (7+1) nested subtrees, including
the original tree in Figure 4.5. The seven subtrees are presented in Figure
4.6.
Figure 4.7 plots the sizes of the eight subtrees against their node statis-
tics. If we use 1.96 as the threshold (corresponding to a significance level
of 0.05), tree 3 would be chosen as the final tree. Also, Figure 4.7 seems
to suggest a kink at tree 3 or 4. Interestingly, tree 4 was selected by the
Tree 1
1
2 3
4 5 6
7
8
Tree 2 Tree 3
9
1 1
2 3 2 3
4 5 5
7
8
9
1 1 1 1
3 3
5
21
1
2
tree
size
3
4
5
6
1 7
0 5.51
node statistics
pruning procedure introduced in Section 4.2.3 when c(0|1) = 18. See the
discussion in the end of Section 4.4. Therefore, the alternative approach
described here may be used as a guide to the cost selection of c(0|1).
En route to determining the final tree structure, we must remember that
interpretations are of paramount importance. Note that the tree selection is
based on a resubstitution estimate of relative risk. This estimate is poten-
tially biased upward, because the splits are chosen by the impurity function,
and the impurity relates closely to the relative risk. As a consequence of the
selection bias, we cannot rely on the resubstitution estimate to interpret
the tree results. In Section 4.6 we describe a way to adjust for the bias
based on Zhang et al. (1996).
sample, we can find a split that maximizes the impurity and leads to a 2× 2
table Tl . Then, we apply the selected split to the test sample and derive
another 2 × 2 table Tt . We can use the differences in the frequency counts
between Tl and Tt to estimate the bias, a − a∗ and d − d∗ .
Formally, we randomly divide the population of interest into v subpop-
ulations. For instance, if v = 5, let Li (i = 1, 2, 3, 4, 5) denote each of the
five subpopulations and L(−i) (i = 1, 2, 3, 4, 5) the sample after removing
Li . We use L(−1) to select a split s∗1 over the originally selected covariate;
s∗1 results in two 2 × 2 tables, T1 and T(−1) :
T(−1) T1
Preterm Preterm
yes no yes no
Left Node a(−1) b(−1) Left Node a1 b1
Right Node c(−1) d(−1) Right Node c1 d1
We can always redefine the nodes for T(−1) in such a way that
a(−1) (c(−1) + d(−1) )
> 1,
c(−1) (a(−1) + b(−1) )
and adjust T1 accordingly. Next, we repeat this
same process
for all i and
estimate the bias in a by the maximum of 14 51 a(−i) − 51 ai and a − 0.5
to guarantee that the frequency
5 is positive.
5 Similarly, we estimate the bias
in d by the maximum of 14 1 d(−i) − 1 di and d − 0.5. We correct the
frequency counts by subtracting the corresponding bias and computing the
relative risk and its standard error using these values.
For example, the adjusted 2 × 2 table for the root node in Figure 4.5 is
Term Preterm
Left Node 683 70 753
Right Node 2973 135 3108
3656 205 3861
FIGURE 4.8. Comparison of ROC curves obtained by the tree structure and
logistic regression model
TABLE 4.5. Definition of Dummy Variables from Terminal Nodes in Figure 2.5
tions, it is a good idea to combine the logistic regression models and the
tree-based models. The first approach is to take the linear equation derived
from the logistic regression as a new predictor. Not surprisingly, this new
predictor is generally more powerful than any individual predictor. In the
present application, the new predictor is defined as
x16 = −2.344 − 0.076x6 + 0.699z6 + 0.115x11 + 1.539z10. (4.13)
See Table 2.1 and equation (3.3) for the variable specification and the pre-
dicted risk equation. Figure 4.9 displays the final tree, which makes use of
both the original and the created predictors. It is interesting to note a few
points from Figure 4.9: (a) Education shows a protective effect, particu-
larly for those with college or higher education. Not only does education
participate in the derivation of x16 defined in (4.13), but itself also ap-
pears on the left-hand side of Figure 4.9. It did not appear, however, in
Figure 2.5. (b) Age has merged as a risk factor. In the fertility literature,
whether a women is at least 35 years old is a common standard for preg-
nancy screening. The threshold of 32 in Figure 4.9 is close to this common
sense choice. (c) The risk of delivering preterm babies is not monotonic
with respect to the combined score x16 . In particular, the risk is lower
when −2.837 < x16 ≤ −2.299 than when −2.299 < x16 ≤ −2.062. To the
contrary, monotonicity holds when the risk is predicted with the logistic
equation (3.3). The ROC curve for the new classification tree is shown in
Figure 4.10, and the area under this curve is 0.661. We achieved some, but
not much, improvement in predictive power.
The second approach is to run the logistic regression after a tree is grown.
For example, based on the tree displayed in Figure 2.5, we can create five
dummy variables, each of which corresponds to one of the five terminal
nodes. Table 4.5 specifies these five dummy variables. In particular, the
leftmost terminal node contains 512 unemployed Black women. The dummy
variable, z13 , equals 1 for the 512 unemployed Black women and 0 for the
rest. Next, we include these five dummy variables, z13 to z17 , in addition to
the 15 predictors, x1 to x15 , in Table 2.1 and rebuild a logistic regression
model. The new equation for the predicted risk is
exp(−1.341 − 0.071x6 − 0.885z15 + 1.016z16)
θ̂ = . (4.14)
1 + exp(−1.341 − 0.071x6 − 0.885z15 + 1.016z16)
4.7 Comparison Between Tree-Based and Logistic Regression Analyses 53
N=3861
No Yes
Combined score > -2.837
N=2811 N=1050
N=128 N=166
NPT=10 N=2683 N=884 NPT=25
FIGURE 4.9. The final tree structure making use of the equation from the logistic
regression. N: sample size; NPT: number of preterm cases.
54 4. Classification Trees for a Binary Response
FIGURE 4.10. Comparison of ROC curves obtained by the tree structure (dot-
ted), logistic regression model (solid), and their hybrid (the first approach, short-
-dashed; the second approach, long-dashed)
named the third strategy as the “missings together” (MT) approach, which
was also implemented by Clark and Pregibon (1992). Depending on the
purpose of data analysis, data analysts may choose either approach. Since
the MT approach is relatively straightforward, we describe it first. The
disadvantage, especially with a high proportion of missing observations, is
that the MT may be ineffective when the true values corresponding to the
missings are heterogeneous.
information. Lastly, we can easily trace where the subjects with missing
information are located in a tree structure.
702+134=836 of 3861 subjects are sent to the same node, and hence
836/3861 = 0.217 can be used as an estimate for the coincidence probability
of these two splits. In general, prior information should be incorporated in
estimating the coincidence probability when the subjects are not randomly
drawn from a general population, such as in case–control studies. In such
cases, we estimate the coincidence probability with
It is not unlikely, though, that the predictor that yields the best surro-
gate split may also be missing. Then, we have to look for the second best,
and so on. If our purpose is to build an automatic classification rule (e.g.,
Goldman et al., 1982, 1996), it is not difficult for a computer to keep track
of the list of surrogate splits. However, the same task may not be easy for
humans. Surrogate splits could contain useful information for the analyst
who is trying to extract maximal insight from the data in the course of
determining the final tree. On the other hand, due to the limited space,
surrogate splits are rarely published in the literature, and hence their use-
fulness is hampered by this practical limitation.
There is no guarantee that surrogate splits improve the predictive power
of a particular split as compared to a random split. In such cases, the
surrogate splits should be discarded.
If surrogate splits are used, the user should take full advantage of them. In
particular, a thorough examination of the best surrogate splits may reveal
other important predictors that are absent from the final tree structure,
and it may also provide alternative tree structures that in principle can
have a lower misclassification cost than the final tree, because the final tree
is selected in a stepwise manner and is not necessarily a local optimizer in
any sense.
metrize it as follows:
DKL (t) = py,1 log(py,1 /py,2 ) + py,2 log(py,2 /py,1 ).
y y
4.11 Implementation∗
We have seen that a large number of splits must be searched in order to
grow a tree. Here, we address the computational issue of designing a fast
search algorithm.
Recall that the partitioning process is the same for all internal nodes
including the root node. Therefore, it suffices to explain the process with the
root node. Moreover, we encounter really two types of predictors: ordered
and unordered. For simplicity, we use daily alcohol intake in Table 2.1 as
an example for ordered predictors. This variable, x13 , takes a value from
0, 1, 2, and 3. The race variable, x3 , in the same table will serve as an
example of nominal (not ordered) predictors.
60 4. Classification Trees for a Binary Response
For each of x3 and x13 we need to construct a matrix that holds the
numbers of normal and abnormal subjects at every level of the predictor.
The two corresponding matrices are:
where a bit “1” indicates that a subject who has the corresponding race
group goes to the left daughter node. Hence, the array above implies that
3008 Whites, 109 Hispanics, and 13 others are in the left node, while the
remaining 731 Blacks and Asians are in the right node.
Each of the 15 allowable splits corresponds to a distinct assignment of
bits for the array. In fact, we know that any integer from 1 to 15 can be
expressed in a binary format as displayed in Table 4.6. If we take 7 from
Table 4.6, its binary representation is 00111. This array indicates that 3008
Whites and 710 Blacks should be in the right daughter node. The use of
the binary representation following the original order of 1 to 15 is a little
troublesome, however, for the following reason.
Note that the binary representation for 1 is 00001. Thus, the first allow-
able split is to put the 13 subjects in the “others” racial group in the left
node and the remaining 3848 subjects in the right node. Now, the binary
representation of 2 is 00010. Then, the next split would exchange node
assignments for the 13 others and the 21 Asians. Thus, from the first to
the second splits, two groups of subjects are involved. As a matter of fact,
three groups of subjects must be switched as we move from the third to
the fourth split, because the binary representations of 3 and 4 differ by
three bits. The housekeeping for these movements is not convenient. For-
tunately, there is a simple algorithm that can rearrange the order of the
integers such that the binary representation changes only one bit as we go
from one integer to the next. This rearrangement is also given in Table 4.6.
4.11 Implementation∗ 61
The impurities for the left and right daughter nodes are respectively 0
and 0.208. Thus, the goodness of the split is
3848 13
0.2075 − ∗ 0.208 − ∗ 0 = 0.0002,
3861 3861
where 0.2075 is the impurity of the root node.
The second binary array under the rearranged order is 00011. Hence, the
21 Asian subjects join the left node in the second split. The record is then
modified to:
Left Right
Asians Others Whites Blacks Hispanics
0 20 13 2880 640 103
y
1 1 0 128 70 6
62 4. Classification Trees for a Binary Response
This split gives rise to a goodness of split of 0.0005. Next, the subjects
whose x13 equals 1 move to the left node, because 1 is adjacent to 0. Then,
the subjects with x13 = 2 are switched to the left node, and so on. There-
fore, whenever we proceed to the next split, one more slice (i.e., column)
of A13 (root) is moved to the left node. The number of moves depends
on the number of the distinctly observed data points for the predictor;
in the worst case, it is in the order of the sample size. Therefore, after
A13 (root) is determined, the number of needed computing steps is at most
a constant proportion of the sample size. The constant is smaller than 10.
Moreover, when we split the subsequent nodes, the number of subjects be-
comes smaller and smaller. In fact, for a given predictor the total number of
computing steps for splitting all nodes in the same layer is usually smaller
than that for splitting the root node.
To conclude, the total number of computing steps needed to construct
a tree of d layers is about cpn log(n) + 10dpn, where p is the number of
predictors, n is the sample size, and the term, cpn log(n), results from
preparing the A matrices. Obviously, the second term generally dominates
the first term.
5
Examples Using Tree-Based Analysis
5.1.1 Background
Spontaneous abortion, one of the most difficult reproductive outcomes to
study using epidemiologic methods, will be the outcome of interest; see,
e.g., Zhang and Bracken (1996). The difficulties in this area of research in-
clude failure to detect a large proportion (perhaps majority) of spontaneous
abortions and the large number of known and suspected confounding risk
factors that must be considered before evaluating the possible role of new
factors. The situation is shared by many diseases such as cancer, AIDS,
and coronary heart disease.
Our illustrative data come from a continuation project of the Yale Preg-
nancy Outcome Study. The study population consists of women receiving
prenatal care at 11 private obstetrical practices and two health mainte-
nance organizations in southern Connecticut during the period 1988 to
1991. There were 2849 women who had initial home interviews during 5
to 16 weeks of pregnancy between April 5, 1988, and December 1, 1991,
node 1
135/2849
= .047
node 2 node 3
98/2418 37/431
=0.041 =0.086
Hispanic White and Asian Black and others no birth control had birth control
node 4 node 5 node 6 node 7 node 8
0/60 82/2192 16/166 20/293 17/138
=0.037 =0.096 =0.068 =0.123
FIGURE 5.1. The tree structure for the sample stratification. Reproduced from
Figure 1 of Zhang and Bracken (1996).
at the level of 0.05. This factor has three categories: (a) those who did
not carry loads over 20 lb daily at work (unemployed women included); (b)
those who carried loads over 20 lb less than once per day; and (c) those who
carried loads over 20 lb at least once a day. Seventy-five, 11, and 14 percent
of the subjects, respectively, fall in each of these categories. Although there
is hardly any difference in the rates of spontaneous abortion between the
first two categories, the difference is significant between the first two and
the third. Using the group not carrying loads over 20 lb as the reference,
the relative risk due to carrying loads over 20 lb at least once daily is 1.71,
and its 95% confidence interval is (1.25, 2.32).
As the second step, we make use of the tree-based method to stratify the
study sample into a number of meaningful and homogeneous subgroups,
each of which corresponds to a terminal node in the tree structure. In
this exercise, we use the alternative pruning approach described in Section
4.5 and adopt the missings together strategy of Section 4.8 to handle the
missing data. Similar to the tree constructed in Figure 2.5, we end up with
the tree in Figure 5.1 for the present data.
Figure 5.1 can be read as follows. Node 1 is the entire study popula-
tion of 2849 pregnant women. The overall rate of spontaneous abortion is
4.7%. This sample is first divided into two age groups: 13–35 and >35. The
younger group is called node 2 and the older group node 3. Note that the
5.1 Risk-Factor Analysis in Epidemiology 67
TABLE 5.3. Marginal Associations Between Spontaneous Abortion and the Pu-
tative Risk Factors
Factor No. of subjects %‡ RR CI
Currently employed
No 638 5.3 Reference
Yes 2211 4.6 0.86 0.59–1.25
Standing 2+ hours at work daily
No 2559 4.6 Reference
Yes 290 5.9 1.27 0.78–2.08
Walking 2+ hours at work daily
No 1894 4.9 Reference
Yes 955 4.5 0.93 0.65–1.32
Sitting 2+ hours at work daily
No 1230 5.2 Reference
Yes 1619 4.4 0.84 0.61–1.17
Vibration at work
No 2756 4.7 Reference
Yes 93 5.4 1.14 0.48–2.72
Commute to work
No 705 4.8 Reference
Yes 2144 4.7 0.98 0.67–1.43
Reaching over the shoulders on the job
No 1584 4.5 Reference
<1/day 530 4.5 1.01 0.64–1.59
1+/day 735 5.4 1.21 0.83–1.77
Carrying loads over 20 lb on the job
No 2154 4.4 Reference
<1/day 318 3.8 0.85 0.47–1.54
1+/day 386 7.3 1.64 1.09–2.46
‡
Percentage of spontaneous abortions
Based on Table 2 of Zhang and Bracken (1996)
68 5. Examples Using Tree-Based Analysis
4 45 13 123 12 104
2 29 3 27 5 17
With the analogy to the Mantel and Haenszel (1959) statistic, our sum-
mary relative risk estimate is
7
ai n0i /ni
r = i=1
7 .
i=1 ci n1i /ni
5.1 Risk-Factor Analysis in Epidemiology 69
0 + 45∗422
1928 +
18∗71
293 + 9∗48 4∗31 13∗30 12∗22
184 + 80 + 166 + 138
r= = 0.85.
0 + 18∗1506
1928 +
2∗222
293 + 4∗136 2∗49 3∗136 5∗116
184 + 80 + 166 + 138
The limits of 100(1 − α)% confidence interval for the relative risk can be
obtained from r(1±zα/2 /χ) , where zα/2 is the upper α/2 percentile of the
standard normal distribution, and the χ2 -statistic, developed by Cochran
(1954) and Mantel and Haenszel (1959), can be computed as follows:
7
(| − Ai )| − 0.5)2
i=1 (ai
χ2 = 7 , (5.1)
i=1 Vi
where
n1i m1i n1i n0i m1i m0i
Ai = and Vi = ,
ni n2i (ni − 1)
i = 1, . . . , 7. For the present seven tables, χ2 = 1.093, giving a 95% confi-
dence interval (0.62, 1.16) for the relative risk.
If the stratification were determined a priori without the help of the tree-
based method, this approach of adjustment would be the same as the one
used by Mills et al. (1991) and Giovannucci et al. (1995) among others.
Moreover, if the linear discriminant analysis were used to stratify the data,
it would be the approach proposed by Miettinen (1976). In all cases, the
confounding factors are controlled through the strata. In other words, we
use tree-based methods to reduce the data dimension of the confounding
factors and to construct a filter for the evaluation of new risk factors. So, the
first stage of analysis is the sample stratification based on the confounders,
and the second stage is the calculation of the adjusted relative risk for the
new risk factors.
As reported under the column of RR in Table 5.4, one risk factor showed
significant effects. That is “carry load over 20 lb at least once daily”
[RR=1.71, 95% confidence interval= (1.25, 2.32)]. Nevertheless, a more
modest risk factor is “reaching over the shoulders at least once daily”
[RR=1.35, 95% confidence interval=(1.02, 1.78)]. Table 5.4 presents more
detail on the association of the risk factors to spontaneous abortion. In this
table, the adjusted RR is the Mantel–Haenszel estimate, and the adjusted
odds ratio (OR) is from logistic regression.
For the purpose of comparison, Zhang and Bracken (1996) also reported
analysis based on logistic regression. The model selection is not conven-
tional. Instead, they made use of Figure 5.1. The main and the second-
order interaction effects of the five variables (age, race, years of smoking,
miscarriage, and use of birth control) that stratify the study sample are
included in the initial logistic regression. A forward stepwise procedure se-
lected a logistic model with three significant terms: age (p-value = 0.002),
race (Whites and Asians, p-value = 0.04), and race (Hispanic, p-value =
70 5. Examples Using Tree-Based Analysis
TABLE 5.4. Adjusted Relative Risk and Odds Ratio of Spontaneous Abortion
Attributed by Individual Putative Risk Factor
Factor RR CI OR CI
Currently employed
No Reference
Yes 0.85 0.62–1.16 0.82 0.55–1.23
Standing 2+ hours at work daily
No Reference
Yes 1.27 0.83–1.94 1.28 0.76–2.17
Walking 2+ hours at work daily
No Reference
Yes 0.97 0.60–1.56 0.95 0.65–1.38
Sitting 2+ hours at work daily
No Reference
Yes .81 0.63–1.05 0.80 0.57–1.14
Vibration at work
No Reference
Yes 1.11 0.58–2.13 1.11 0.44–2.80
Commuting to work
No Reference
Yes 0.96 0.59–1.54 0.96 0.65–1.44
Reaching over the shoulders on the job
No Reference
<1/day 0.98 0.58–1.67 1.02 0.63–1.64
1+/day 1.35 1.02–1.78 1.30 0.87–1.95
Carrying loads over 20 lb on the job
No Reference
<1/day 0.91 0.43–1.93 0.87 0.47–1.61
1+/day 1.71 1.25–2.32 1.75 1.13–2.71
Based on Table 2 of Zhang and Bracken (1996)
5.2 Customer Credit Assessment 71
0.01). Then, they examined eight additional logit models by adding and
then deleting one of the eight putative risk factors, one at a time, into
the selected 3-term logit model. The results are reported in the last two
columns of Table 5.4. It is apparent from Tables 5.3 and 5.4 that the three
estimates of risk (crude RR, adjusted RR, and adjusted OR) give very close
answers. Therefore, for the present analysis, the 18 potential confounders
are proven not to be confounders.
In summary, Zhang and Bracken (1996) found that risk of spontaneous
abortion increases as women carry loads over 20 lb at least once a day, or
reach over the shoulders at least once a day, neither of which is recognized
as a risk factor in the extant literature. Hence, these occupational exposures
merit additional study.
Observed Outcome
Classified Outcome Good Bad
Good 0 5
Bad 1 0
and specify this loss matrix in the loss option of rpart(). We use the en-
tropy impurity as the spitting criterion, corresponding to the information
option. Thus, we can grow an initial tree by
The numerical result can be reviewed by using print() and a tree plot can
be produced by using plot() and text().
> print(gmcrd)
n= 1000
Within each node, the sample size (n), the misclassification cost (loss), the
classified membership yval, and the probabilities of the two classes (yprob)
are presented. With the exception of the root node, the splitting variable
and its values are displayed to define the nodes. For example, x1=13,14
means that within node 2, the 457 clients either do not have a checking ac-
count or its balance is at least 200 DM. The translation from the numerical
coding in the data set to the description requires the reader to review the
original data description. Figure 5.2 provides the graphical presentation of
74 5. Examples Using Tree-Based Analysis
x1=13,14
|
x14=143
1
303/240
x4=41,44,48 x4=41,410,42,43
x3=34 x7=72,74
0 1
57/0 16/16
x7=74,75
0 0 1
123/6 17/0 21/6
x5< 4180.5
0
75/7
x17=173
1
8/10
0 1
59/6 21/9
FIGURE 5.2. An initial tree structure for the German credit data. The node split
points to the left daughter. The class membership and sample composition of the
terminal nodes are displayed.
the tree. We should note that the postscript file was slightly edited to be
consistent with the numerical output and for better visualization.
To illustrate the pruning step, we can output the cost complex parame-
ters as in Table 4.3 by using printcp() and plotcp().
> printcp(gmcrd)
CP nsplit rel error xerror xstd
1 0.138571 0 1.00000 1.00000 0.020702
2 0.080000 1 0.86143 0.86714 0.053805
3 0.017619 2 0.78143 0.85000 0.046969
4 0.012143 7 0.68714 0.86286 0.046493
5 0.010000 9 0.66286 0.90857 0.051339
> plotcp(gmcrd)
In the printout, xerror is the relative error estimated by a 10-fold (by
default) cross validation and xstd is the standard error of the relative
error. Figure 5.3 is a graphical presentation of the relative error versus
the cost-complexity parameter. Both the numerical output and the plot
indicate that the minimal error, 0.85, was reached with an s.e. 0.047 when
the tree has three terminal nodes or two splits. Note that the CP in the
5.2 Customer Credit Assessment 75
size of tree
1 2 3 8 10
1.1
1.0
X−val Relative Error
0.9
0.8
0.7
cp
FIGURE 5.3. The cost complexity parameter plot generated by plotcp(). The
vertical lines indicate the range of 1 s.e.
numerical output differs from those in the graph, mainly because the CP
values are not unique in an interval. Applying the 1-se rule, we are looking
for the smallest subtree with the error below 0.85 + 0.047 = 0.897, which
leads to the tree with two terminal nodes or one split as displayed in Figure
5.4. This also can be seen from Figure 5.3.
From this example, the cross-validation based 1-se rule is a very stringent
rule in tree pruning. The one-split tree in Figure 5.4 is obviously not too
helpful in identifying potentially important factors or making good predic-
tions. The misclassification cost went from 700 in the root node to a sum
of 603 in the two terminal nodes.
Although the misclassification cost is not used in node splitting, it does
affect the tree structure. To make the point, we let the misclassification
costs for the two types of errors be the same and reconstruct the tree.
Based on the R output below, we can choose 0.02 as the CP to produce the
final tree as displayed in Figure 5.5. The trees in Figures 5.4 and 5.5 are
clearly different. This example underscores the importance and difficulty in
choosing the misclassification costs. For readers who find the choices of the
misclassification costs difficult or even arbitrary, the alternative pruning
approach described in Section 4.5 seems more practical.
76 5. Examples Using Tree-Based Analysis
#1 300
700 700
Q
Q
+
s
Q
60 #2 #3 240
397 300 303 303
FIGURE 5.4. The final tree for the German credit data. Inside each node are the
node number (top) and the units of the misclassification cost (bottom). Next to
the node are the number of bad (top) and good (bottom) applicants in the node.
x1=13,14
|
x2< 22.5
0
397/60
x3=32,33,34 x6=64,65
0 1 0 1
193/85 7/21 29/12 74/122
FIGURE 5.5. The final tree structure for the German credit data using equal mis-
classification cost. The node split points to the left daughter. The class member-
ship and sample composition of the terminal nodes are displayed. x1=13,14 means
that a checking amount has a balance; x2 is duration of credit; x3=32,33,34
means a credit history during which all credits were paid until now or with a
delay or other issues; x6=64,65 means a saving account or bonds with a balance
of 1000 DM or no information on such accounts.
5.2 Customer Credit Assessment 77
300
700
checking PP others
with balance
account PP
PP
)
240 q 60
P
303 397
Q ≤ 22.5 months Q stores
none bank,
> 22.5
duration Q installment Q
other
+
s
Q
+
s
Q
134 106 38 22
103 200 343 54
sooner
paid
Q or worse
paid now
credit historyQ
+
s
Q
85 21
193 7
FIGURE 5.6. A tree constructed by RTREE for the German credit data. Inside
each node are the number of bad (top) and good (bottom) applicants in the node.
3 Let the recursive partitioning run to the end and generate a tree.
for the spirit of this book, boosting inhibits interpretation. Indeed, the re-
peated sampling in bagging facilitates exposure of subpopulations/groups
with distinctive characteristics.
In forest construction, several practical questions often arise. Here, we
discuss some of those issues. Firstly, how many trees do we need in a forest?
Breiman (2001) chose to run 100 trees in several examples and others have
used much larger numbers. We will discuss in Section 6.2 as to how large
a random forest needs to be. As Breiman (2001) noted, the accuracy of a
random forest depends on two key factors: the prediction strength of the
individual trees and the correlation of the trees. Thus, we may keep the
size of a random forest to the minimal level if the trees can achieve the
highest strength and have the weakest correlation.
Secondly, does a random forest overfit the data without pruning the
individual trees? Breiman (2001) showed that there is no overfitting issue
by the Strong Law of Large Numbers. The prediction error of a random
forest converges as the size of the forest increases, and the error has an
upper bound that is directly related to the strength and the correlation of
the trees in the forest.
Thirdly, selecting the subset of q variables in node splitting is an impor-
√
tant feature of random forests. Commonly used choices are log(p) or p.
However, there is a caveat with this idea. For example, in genetic stud-
ies, we tend to have a huge number of genetic markers (on the order of a
million) and some environment variables (ranging from one to hundreds).
The environment variables have few chances to be selected in the random
forest, not because they are not important, but because there are relatively
so few of them. Furthermore, even among genetic markers, not all of them
should be treated equally. Thus, in practice, we should be cautious about
the fact that the random forest treats all predictors indiscriminately. In
Section 6.5, we discuss some approaches to overcoming this issue.
Finally, after a forest is formed, how do we understand the information
in the forest, especially if it is too large to examine the individual trees?
suffer from. In other words, does a forest have to be large, or how small can
a forest be? To answer this fundamental question, the key idea is to shrink
the forest with two objectives: (a) to maintain a similar (or even better)
level of prediction accuracy; and (b) to reduce the number of the trees in
the forest to a manageable level.
To shrink the size of a forest while maintaining the prediction accuracy,
we need a criterion to determine the importance of a tree in a forest in terms
of prediction performance. Zhang and Wang (2009) considered three op-
tions and found that the measure “by prediction” outperformed the others.
Specifically, a tree is removed if its removal from the forest has the minimal
impact on the overall prediction accuracy. First, calculate the prediction
accuracy of forest F , denoted by pF . Second, for every tree, denoted by T ,
in forest F , calculate the prediction accuracy of forest F−T that excludes T ,
denoted by pF−T . Let Δ−T be the difference in prediction accuracy between
F and F−T :
Δ−T = pF − pF−T . (6.1)
The tree T p with the smallest ΔT is the least important one and hence
subject to removal:
T p = arg min(Δ−T ). (6.2)
T ∈F
To select the optimal size subforest, Zhang and Wang (2009) track the
performance of the subforests. Let h(i), i = 1, . . . , Nf − 1, denote the per-
formance trajectory of a subforest of i trees, where Nf is the size of the
original random forest. Note that h(i) is specific to the method measuring
the performance, because there are many subforests with the same number
of trees. If there is only one realization of h(i), they select the optimal size
iopt of the subforest by maximizing h(i) over i = 1, . . . , Nf − 1:
If there are M realizations of h(i), they select the optimal size subforest
by using the 1-se as described by Breiman et al. (1984). That is, they first
compute the average h(i) and its standard error σ(i):ˆ
1
h(i) = hj (i), i = 1, . . . , Nf − 1, (6.4)
M
j=1,...,M
As discussed by Breiman et al. (1984), the 1-se rule tends to yield a more
robust and parsimonious model.
6.2 The Smallest Forest 83
Finally, they choose the smallest subforest such that its corresponding h
is within one standard error (se) of h(im ) as the optimal subforest size iopt :
the correct class in the oob cases with the permuted values of variable k.
The permutation importance is the average of the differences between the
number of votes for the correct class in the variable-k-permuted oob data
from the number of votes for the correct class in the original oob data, over
all trees in the forest.
The permutation importance index is arguably the most commonly used
choice. There are a few important issues to note. Firstly, the permutation
importance index is not necessarily positive, and does not have an upper
limit. Secondly, both the magnitudes and relative rankings of the permu-
tation importance for predictors can be unstable when the number, p, of
predictors is large relative to the sample size. This is certainly the case for
genomic data. Thirdly, the magnitudes and relative rankings of the per-
mutation importance for predictors vary according to the number of trees
in the forest and the number, q, of variables that are randomly selected to
split a node. As presented by Genuer et al. (2008), the effect of the num-
ber of trees in the forest is relatively minor, although more trees lead to
better stability. However, the magnitude of the importance may increase
dramatically as q increases, although the rankings may remain the same.
To illustrate this, we simulated data based on a microarray data set on
Breast Cancer Prognosis (van de Vijver et al. 2002). That study had 295
samples with 24,496 genes. We randomly selected four genes to generate
a binary (e.g., normal or abnormal) outcome y. Let x1 , x2 , x3 , and x4 be
the expression intensities
of the four selected genes. Then, the response
is derived by y = I( 4i=1 xi > 0); here recall that I(·) is the indicator
function.
Figure 6.1 displays the importance scores of the four selected genes with
a range of q’s. Before the computation, genes with correlation greater than
0.1 with any of the four selected genes (in terms of the expression level)
are removed, to avoid the potential effect of correlation. There are 1000
trees in the forest. Clearly, the importance score tends to increase as the
q increases. However, the four genes keep the same order of importance.
Without going into detail, we should note that the effect of the forest size
on the importance scores is relatively minor.
Finally, there are conflicting numerical reports with regard to the pos-
sibility that the permutation importance overestimates the variable im-
portance of highly correlated variables (see, e.g., Strobl et al. 2008 and
Dı́az-Uriarte and Alvarez de Andrés 2006). Genuer et al. (2008) specifi-
cally addressed this issue with simulation studies and concluded that the
magnitude of the importance for a predictor steadily decreases when more
variables highly correlated with the predictor are included in the data set.
We also performed a simulation to examine this issue. We began with the
four selected genes. Then, we identified the genes whose correlations with
any of the four selected genes are at least 0.4. Those correlated genes are
divided randomly in five sets of about same size. Finally, we added one,
two, . . . , and five sets of them sequentially together with the four selected
86 6. Random and Deterministic Forests
4 4
12
4 4
4
4
10
4
4
8
4
6
4
4 1 1
1 1 1 1
1 1 1
2
4 1 2 2 2 2 2 2 2
2 2 2 3 3 3 3
1
2
33 3 3 3 3 3 3
42
1
0
genes as the predictors. Figure 6.2 is consistent with the result of Genuer
et al. (2008). We can see that the rankings of the predictors are preserved.
Furthermore, let us examine the impact of the correlation from a differ-
ent point of view. We again began with the four selected genes and then
included genes that are correlated with any of the correlated gene at least
0.6, 0.4, and 0.2. We see from Figure 6.3 that the magnitude of the impor-
tance for a gene increases as we restrict the correlation to a higher level.
It is reasonable to say that although variable importance is an important
concept in random forests, we need to be cautious in the interpretation. In
practice, the ranking is more relevant than the magnitude.
15
4
12
10
4 4
4
10
4
8
4
6
4 4
1 4
5
4
1
2 1 1 1
3 2 2 1 2
3 2 1
2 1 2 1
2
3 2 3 2 1
2
3 3 3 3 3 3
0
1 2 3 4 5 1 2 3 4 5
5
4 4
4
4
3
3
1 1
2
2 2
1
4 3 3
4 4 2
1 4
1
2
2
3
1 2
1
3 3 4
1
2
3 4
1
3
2 3
0
0.40
1
2
0.35
3
4
0.30
5
7 0.25
9 0.20
10
11 0.15
12
13
0.10
14
15
0.05
16
>16
0.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
FIGURE 6.4. Interaction heat map. The x-axis is the sequence number of the
primary predictor and the y-axis the sequence number of the potential interacting
predictor. The intensity expresses the frequency when the potential interacting
predictor precedes the primary predictor in a forest.
4 4
80
80
4 4 4 4 4 4 4 4
60
60
1 1 1 1 1 1 1 1 1
1
3
40
40
3
2 3
2 3 3
2 3
2 3
2 3 3 3
2
2 2 2 2
20
20
0
1 2 3 4 5 1 2 3 4 5
FIGURE 6.5. The dependence of the MCC on the number of correlated predictors.
The x-axis is the number of correlated sets of genes and the y-axis the importance
score. Each curve is labeled with the gene number. The forest size is set at 1000.
q equals the square root of the forest size for the left panel and 8 for the right
panel.
92 6. Random and Deterministic Forests
BJ
phased - tree 2 B-^
J
importance
data set 2
B B for x2
? .. .. B B ..
frequencies . . B B .
estimated
@ B BN
R
@ phased - tree B - BN importance
data set B for xp
FIGURE 6.6. A schematic diagram to construct a forest for predictors with un-
certainties. Predictors x1 , x2 , . . . , xp are not directly observed, and hence the raw
data are referred to as “unphased data.” The frequencies of the predictors can be
estimated, and these frequencies are used to generate “phased data” in which the
values of the predictors are drawn according to the distribution of the predictors.
One tree is built for each phased data set. Finally, the importance score for each
predictor is computed in the forest.
there are additional predictors with uncertainties, and in fact, this is the
case for haplotype-based genetic analysis. We refer to Chen et al. (2007)
for details. Figure 6.6 illustrates this process, and a computer program
HapForest is available from http://c2s2.yale.edu/software.
A caveat with the tree- and forest-based method is that it is not fea-
sible to perform theoretically based statistical inference such as the com-
putation of statistical significance and confidence interval. For hypothesis
testing, a general, while computationally intensive, approach is to generate
data under the null hypothesis and examine the distribution of the critical
statistics using the replicated permutation samples. For example, to assess
the significance of association between a haplotype and a disease, the null
distribution for an importance index can be empirically estimated by ran-
domly permuting the disease status in the raw data and then going through
the process in Figure 6.6 to produce one set of importance indices for all
haplotypes under the null hypothesis. Repeating this process can estimate
empirically the null distribution for all haplotypes.
Chen et al. (2007) and Wang et al. (2009) applied this method to a
genetic data set on age-related macular degeneration (AMD), which is a
leading cause of vision loss in the elderly. Using a genomewide significance
level of 0.05, they confirmed one well-known haplotype, ACTCCG (on chro-
mosome 1), and revealed several novel haplotypes, TCTGGACGACA (on
chromosome 7), GATAGT (on chromosome 5), and TCTTACGTAGA (on
chromosome 12). Using permutation, these novel haplotypes were associ-
ated with AMD beyond chance by a genomewide 5% significance level.
The haplotype on chromosome 1 is in the gene called complement factor
H (Klein et al. 2005), the one on chromosome 7 is located in the Bardet–
6.5 Random Forests with Weighted Feature Selection 93
S1
PP higher
lower
gene-1 exp PP
PP
)
q
P
S2 S3
Q lower Q
lower higher
higher
gene-2 exp Q gene-3 exp Q
+
s
Q +
s
Q
45 0 2 0
0 1 0 24
7.1 Introduction
Censored survival time is the outcome of numerous studies. We select a
few examples from the medical literature to give a glimpse of the scope of
studies involving censored survival time. Although survival time is usually
the time to death, it can be broadly referred to as the time to the occurrence
of an event of interest. For example, age of onset for breast cancer can be
interpreted as a survival time.
Example 7.2 From 1974 to 1989, 1578 patients were entered in three
Radiation Therapy Oncology Group malignant glioma trials. Curran et
al. (1993) used this sample to examine the associations of survival time
Example 7.3 The determinants of life span are complex and include ge-
netic factors. To explore the effect of three (H − 2b , H − 2k , and H − 2d )
haplotypes on the T-cell functions and ultimately on survival, Salazar et al.
(1995) conducted an experimental study using 1537 mice that were born
between April 14 and July 28, 1987. The experiment ended on February
2, 1991. During the experiment period, the survival durations of 130 mice
(in addition to those that were still alive at the end) were censored (not
observed) because of accidental drowning of 5 and sacrifice of 125 for im-
munologic studies. The authors found that males lived longer than females
except for H − 2d homozygotes, for which there was no sign of significant
difference at the level of 0.05.
end
6 x dead
(I, II) lost
x dead
-
0 time
(a) All subjects enter into the study at the same time
end
6 x dead
(III) lost
(III) alive
x dead
-
0 time
(b) Subjects enter into the study at the different times
FIGURE 7.1. Three types of censoring
3154 3154
417 417
no yes no yes
age > 50 age > 48
398 992 47
1448 1717 18
936 161 49 188
123 127 VI
I I no yes/NA no yes
yes BMI > 24 SBP>116
no hostile
score > 3.9
63 335 166 826
928 8 17 32 49 139
123 4 II III IV V
V
no yes
chol > 300
885 43
111 12
II no/NA yes
BMI > 27.2
34 9
5 7
III IV
FIGURE 7.2. The survival trees using the log-rank statistic and a straightforward
extension of impurity.
102 7. Analysis of Censored Data: Examples
FIGURE 7.3. Kaplan–Meier curves within terminal nodes. The two panels cor-
respond to the two trees in Figure 7.2.
How do we answer the clinical question from the survival trees? A com-
monly used approach is to draw Kaplan–Meier curves within all terminal
nodes and then to compare these curves. Figure 7.3 is prepared following
this common wisdom. Thus, the survival trees are employed as a means of
stratifying the study sample. This is particularly useful when the propor-
tionality assumption is violated in the Cox model introduced in the next
chapter.
Let us first examine the tree on the left of Figure 7.2. As in the propor-
tional hazard model (see Section 8.2.3), age and cholesterol are important
attributes for survival. The hostile score seems to matter, but it requires a
threshold so high (greater than 3.9) that only 8 subjects crossed the line.
Instead of WCR as in the proportional hazard model, BMI, another mea-
sure of obesity, expresses some influence on the survival, but is limited to a
group of 43 subjects. If we remove three survival curves for the three rela-
tively small nodes, the left panel in Figure 7.3 suggests three major, distinct
characteristics of survival, two of which are determined by age (terminal
nodes I and VI). The curve for terminal node II shows that lower cholesterol
levels have a dramatic protective effect on survival due to cancer.
The six survival curves on the right of Figure 7.3 display four major
distinct characteristics of survival. Terminal nodes III and IV deserve our
special attention. Let us point out that there are 173 missing values on BMI
in terminal node III, of which 18 died from cancer. This death proportion
is about the same as that among those who had BMI measured. Although
subjects in terminal node I (younger and lower WCR group) had enjoyed
the longest survival time, those in terminal node III had a very close survival
duration. What is surprising is that this is a group with relatively high
WCR and BMI. Based on the survivorship of terminal node II and the
discussion above, when only one of WCR and BMI is high, the risk of death
7.2 Tree-Based Analysis for the Western Collaborative Group Study Data 103
TABLE 8.1. A Random Sample from the Western Collaborative Group Study
the hazard, λ, is
11
λ̂ = = 2.05/105, (8.5)
527240
which is the number of failures divided by the total observed time; in other
words, there were 2.05/105 failures per day. When the hazard function is
constant, the estimate in (8.5) follows from the definition.
Assumption (8.3) is one possibility and does not necessarily lead to an
adequate fit to the data. Due to censoring, a simple χ2 goodness-of-fit
test is not appropriate. Hollander and Proschan (1979) proposed a formal
test, described below, making use of the Kaplan–Meier curve. In practice,
some graphical approaches are more intuitive and easier to appreciate. Af-
ter presenting the Kaplan–Meier Curve in Section 8.1.1, we can compare a
parametric fit with the nonparametric Kaplan–Meier Curve. Another useful
approach is hazard plotting (Nelson 1972), similar to probability plotting.
It plots the empirical cumulative hazard function against the assumed the-
oretical cumulative hazard function at times when failures occurred. Here,
the cumulative hazard function is defined as
t
H(t) = h(u)du. (8.6)
0
Since the hazard is not a density function, the cumulative function may
be greater than one. For the exponential survival function, the cumulative
hazard function is a linear function: λt.
In Table 8.1 there are 11 time points where deaths occurred. It is easy
to obtain the theoretical cumulative hazard function. To calculate the em-
pirical value at time Ti , we first find the number of subjects who survived
up to time Ti , denoted by Ki , and then the number of failures at time Ti ,
denoted by di . The hazard rate at Ti is estimated by di /Ki , i.e., the ratio
of the number of failures to the number of subjects at risk. The cumulative
hazard at Ti is the sum of all hazard rates before and at Ti . Table 8.2 dis-
plays the process of calculating both empirical and theoretical cumulative
hazard functions, where the survival function is assumed to be exponential.
The hazard plot in Figure 8.1 implies that the exponential survival is not
appropriate for the data, because the empirical and theoretical cumulative
hazard functions do not match each other. Therefore, we should refit our
survival data by assuming different distributions and then check the good-
ness of fit. We leave it to interested readers to find appropriate parametric
models.
•
•
•
•
•
•
•
•
•
•
FIGURE 8.2. The Kaplan–Meier (solid) curve and the exponential survival (dot-
ted) curve
Figure 8.3. The short vertical lines along the survival curves in this figure
mark the censoring times. The two curves appear to be different. In par-
ticular, the nonsmokers seem to have survived longer. Note, however, that
Table 8.1 contains a small fraction of the Western Collaborative Group
Study. Hence, the clinical conclusions drawn here are for the purpose of
illustrating the method. A complete analysis will be conducted later.
Although graphical presentations are useful, it is also important to test
the significance of the difference in the survival distributions. Many test
statistics have been developed and studied in depth. Among them is Man-
tel’s log-rank test, generalized from Savage’s (1956) test. The name of ‘log-
rank’ was given by Peto and Peto (1972).
At the distinct failure times, we have a sequence of 2 × 2 tables
Dead Alive
Smoking ai ni
Nonsmoking
di Ki
For the data in Table 8.1, the counts of ai , di , ni , and Ki are calculated
in Table 8.4. The log-rank test statistic is
k
i=1 (ai − Ei )
LR = k , (8.7)
V
i=1 i
FIGURE 8.3. The Kaplan–Meier curves for smoking (dotted) and nonsmoking
groups (solid)
and
di (Ki − ni )ni di
Vi = 1− .
Ki (Ki − 1) Ki
Since the log-rank test statistic has an asymptotic standard normal distri-
bution, we test the hypothesis that the two survival functions are the same
by comparing LR with the quantiles of the standard normal distribution.
For our data, LR = 0.87, corresponding to a two-sided p-value of 0.38.
Yi = α + βxi + i ,
different idea. For a censored time, we have Ui < Ti . If we knew the differ-
ence Ti − Ui , we could add it to our observed time Yi . After this, we would
not have censored data, and the standard methodology would be applica-
ble. Obviously, we do not know the difference. Their first step is to replace
the difference by the conditional mean difference IE(Ti − Ui |Ti > Ui ). In
other words, the observations become
Yi∗ = Yi δi + IE(Ti |Ti > Yi )(1 − δi ) (i = 1, . . . , n).
It is important to observe that IE(Yi∗ ) = α + βxi when the linear model
holds for the underlying survival time. It follows that
n
IE (xi − x̄)(Yi∗ − βxi ) = 0,
i=1
When no solution exists, Buckley and James found that the iterations usu-
ally settle down to oscillating between two values. Once a slope β̃ is chosen,
the corresponding estimate α̃ of the intercept is
1
n
α̃ = [Yi δi + Ỹi (β)(1 − δi )] − β̃ x̄.
n i=1
and
The estimates of the coefficients, their standard errors, and p-values are
reported in Table 8.5.
Before we finish with Cox’s model, we must assess the proportional haz-
ard assumption. To this end, we use both a graphical approach and a
theoretical approach developed by Grambsch and Therneau (1994).
To use the graphical approach, we dichotomize age, serum cholesterol,
and waist-to-calf ratio at their median levels. Then, the 2882 (= 3154−272)
subjects are divided into 16 cohorts. Within each cohort i, we calculate
the Kaplan–Meier survival estimate Ŝi (t). Next, we plot log(− log(Ŝi (t)))
versus time as shown in Figure 8.4. In each of the four panels, four curves
are displayed.
118 8. Analysis of Censored Data: Concepts and Classical Methods
d log(S(t))
h(t) = − ,
dt
which is equivalent to
t
S(t) = exp − h(z)dz .
0
In other words,
t
log(− log[S(t; x)]) = xβ + log λ0 (z) . (8.13)
0
1 1 1
0 time
FIGURE 9.2. The L1 Wasserstein distance between two Kaplan–Meier curves
for each daughter node. Then, we calculate the node impurities from (9.3).
A desirable split can be characterized as the one that results in the smallest
weighted impurity. This selection procedure is identical to that discussed
in Section 2.2. Indeed, we use (2.3) again to select a split; namely, the
goodness of a split s is
where λ̂τ is the hazard estimate. They select the split that maximizes
l(τL ) + l(τR ); here τL and τR are two daughter nodes.
The splitting criterion of LeBlanc and Crowley (1992) and Ciampi et
al. (1995) are both based on the assumption that the hazard functions in
two daughter nodes are proportional, but unknown. The difference between
their two approaches is whether the full or partial likelihood function in
the Cox proportional hazard model should be used. Here, we describe only
how to make use of the full likelihood and introduce a splitting rule that is
slightly simpler than that of LeBlanc and Crowley (1992) at a conceptual
level.
We shall see shortly that the use of likelihood generated from the Cox
model as the basis of splitting requires much more in-depth understand-
ing of survival concepts than does that of the log-rank test. On the other
hand, LeBlanc and Crowley (1992) acknowledged that their simulation
studies suggested similar performance between the two approaches. There-
fore, those who are interested basically in the practical use of survival trees
may choose to skip the following discussion. From a methodological point
of view, it is useful to know how parametric ideas can be adopted in the
nonparametric framework.
Assuming the proportional hazard model, all individuals in node τ have
the hazard
λτ (t) = θτ λ0 (t), (9.6)
Then, the full likelihood of the entire learning sample for a tree T can be
expressed as
L(θ, λ0 ; T ) = L(θτ , λ0 ), (9.9)
τ ∈T̃
which is the product of the full likelihoods contributed by all terminal nodes
of T .
Every time we partition a node into two, we need to maximize the full tree
likelihood (9.9). It is immediately clear that this would be too ambitious for
computation, because maximizing (9.9) is usually impractical. Even worse
is the fact that the cumulative hazard Λ0 is unknown in practice, and it
must be estimated over and over again, since it is shared by all nodes.
Given the potential number of splits we have to go through, it is obviously
computationally prohibitive to pursue the precise solution. Furthermore,
due to the overall role of Λ0 , it is not apparent that we would arrive at the
same tree structure if we split the nodes in different orders. For example,
after the root node is divided, we may split the left daughter node first and
then the right one, and we may reverse the order. It is desirable that this
order has no consequence on the tree structure. As a remedy, LeBlanc and
Crowley propose to use a one-step Breslow’s (1972) estimate:
i:Yi ≤t δi
Λ̂0 (t) = , (9.10)
|R(t)|
δi
λ̂0 (Yi ) = ,
|R(t)|
Note that λ̂0 (Yi ) can be estimated before the splitting. If we were to
split node τ into nodes τL and τR , we would maximize the sum of the log
likelihoods from the two daughter nodes; that is,
{δi log[λ̂0 (Yi )θ̂τL ] − Λ̂0 (t)θ̂τL } + {δi log[λ̂0 (Yi )θ̂τR ] − Λ̂0 (t)θ̂τR }.
i∈τL i∈τR
where R(T ) is the sum of the costs over all terminal nodes of T . It is clear
that the remaining steps are identical to those in Section 4.2.3 if we can
define an appropriate node cost R(τ ) for survival trees.
While proposing their splitting criteria, most authors have also suggested
pruning rules that are closely related to the splitting principles. For in-
stance, Gordon and Olshen (1985) suggested using the impurity (9.3) also
as the node cost, R(τ ). Davis and Anderson (1989) take −l(τ ) in (9.5) as
R(τ ). The truth of the matter is that the splitting and pruning rules do
not have to be directly related. In practice, as long as it is appropriate, one
should feel free to match a splitting rule with any of the pruning rules. It
would be a useful project to scrutinize whether there exists a robust match
that results in satisfactory fits to censored data in a variety of settings.
Akin to the cost-complexity, LeBlanc and Crowley (1993) introduced the
notion of split-complexity as a substitute for cost-complexity in pruning a
survival tree. Let LR(τ ) be the value of the log-rank test at node τ . Then
the split-complexity measure is
LRα (T ) = LR(τ ) − α(|T˜ | − 1).
τ ∈T̃
Note that the summation above is over the set of internal (nonterminal)
nodes and |T̃ | − 1 is the number of internal nodes. The negative sign in
front of α is a reflection of the fact that LRα is to be maximized, whereas
the cost-complexity Rα is minimized. LeBlanc and Crowley recommend
choosing α between 2 and 4 if the log-rank test is expressed in the χ21 form.
A penalty of 4 corresponds roughly to the 0.05 significance level for a split,
and that of 2 is consistent with the use of AIC (Akaike 1974). As is the
case in the classification of binary outcome (see, e.g., Section 4.6), the log-
rank test statistic is usually overoptimistic for each split, due to the split
selection. LeBlanc and Crowley used bootstrap techniques to deflate the
value of LR.
In addition, Segal (1988) recommended a practical bottom-up procedure.
This procedure was described in the context of classifying a dichotomous
outcome in Section 4.5, except that now the χ2 statistic should be replaced
with the log-rank test statistic (8.7). We will go through this procedure
with real data in Section 9.5.
9.3 Random Survival Forests 127
9.4 Implementation
The implementation of survival trees is more complicated than that of
classification trees. The calculation of the Kaplan–Meier curves, log-rank
statistics, or likelihood functions is not an easy task if it has to be repeated
a large number of times. It is prudent to achieve the greatest computational
efficiency.
As was shown by the data presented in Table 8.4, whether to calculate the
within-node Kaplan–Meier curve or to conduct a two-node log-rank test, we
need to compute four key quantities, Ki , di , ai , and ni , which were defined
in Section 8.1.2. Obviously, we want efficient algorithms for updating these
quantities while searching for the best node split. For instance, let Ki (τ )
be the number of individuals at risk at time ti within node τ. We consider
splitting τ into τL and τR , say, based on BMI. To make the matter simpler,
suppose that BMI takes only three distinct levels in our data: 24, 26, and
28. First, we should obtain Ki at each level of BMI and label them as
Ki24 , Ki26 , and Ki28 . Then, the first allowable split is to let the individuals
with BMI of 24 be contained in τL and the rest in τR . It is clear that
Ki (τL ) = Ki24 and Ki (τR ) = Ki26 + Ki28 . For the next allowable split,
we add Ki26 to Ki (τL ), whereas Ki (τR ) is reduced by Ki26 . This goes on
until we run out allowable splits. The point here is that we should count
Ki ’s once for every level of the predictor and use them subsequently in the
splitting.
The approaches described are implemented in Heping Zhang’s stree,
which is a standalone program freely available at
http://c2s2.yale.edu/software/stree.
For those who are familiar with R, they can use the R function rsf() in
the user distributed randomSurvivalforest package.
128 9. Analysis of Censored Data: Survival Trees and Random Forests
129.4
129.4
25.5 11.9
25.5 11.9
FIGURE 9.3. An initial large tree obtained by the log-rank testing statistic.
The top and bottom numbers under the node are respectively the original and
maximized values of the statistic
• •
• • •
• • • • • •
niques in an important area, and sometimes the solution does not have to
be complicated.
It is interesting to note that NE was used to split nodes 1 and imme-
diately node 2. This is another example to underscore the usefulness of
considering multiway splits based on variables with ordinal scales, as we
suggested in Section 2.2.
10
Regression Trees and Adaptive Splines
for a Continuous Response
y
6
-
x
FIGURE 10.1. One-dimensional MARS (the thinner piecewise line) and CART
(the step function) models. The dotted curve is the underlying smooth function
x6 > 25?
no yes
0 −414.1(x6 − 25)
0 −88.6(x15 − 1) 0 −88.6(x15 − 1)
FIGURE 10.2. A graphical representation of a MARS model. Inside the root node is the initial value of the regression surface, which
is recursively added by the values inside the offspring nodes
10.1 Tree Representation of Spline Model and Analysis of Birth Weight
135
136 10. Regression Trees and Adaptive Splines for a Continuous Response
Figure 10.2 presents a fitted regression model using MARS for our data,
where z5 and z10 are dummy variables defined in Section 3.2, indicating
a White woman and the use of DES by the pregnant woman’s mother,
respectively. An explicit mathematical formula for the MARS model will
be given in (10.41). The representation of the MARS model in Figure 10.2
shows the relationship between adaptive splines and classification trees,
because they both are based on the recursive partitioning of the domain
formed by the covariates.
At the top of Figure 10.2, which we used to call the root node, is the
function
3141.3 + 296.4z5 − 21.7x9 + 111x15 − 276.3z10. (10.2)
It can be used to calculate the initial value of predicting the birth weight for
any newborn. For instance, if the mother is a White nonsmoking woman
(z5 = 1, x9 = 0), her mother did not use DES (z10 = 0), and she was
pregnant once before (x15 = 1), then her newborn is assigned an initial
weight of 3141.3+296.4∗1−21.7∗0+111∗1−276.3∗0 = 3548.7 grams. At the
second layer, there is a zero inside the left daughter node and −414.1(x6 −
25) inside the right daughter node, as separated by the question of “x6 >
25?” This means, for example, that −414.1 ∗ (27 − 25) = −828.2 grams will
be reduced from the newborn’s initially assigned weight if his or her mother
had 27 years of education. However, no change is made if the mother had
no more than 25 years of education. Other nodes in Figure 10.2 can be
interpreted similarly.
In summary, on average, White babies are 296.4 grams heavier than the
others, and the use of DES by the pregnant woman’s mother reduces a
newborn’s weight by 276.3 grams. We see negative effects of high levels of
education (more than 25 years, x6 > 25), parity (x15 > 1), and gravidity
(x11 > 5). Two terms involve x9 , the number of cigarettes smoked. One
(−21.7x9 ) appears in the root node, and the other [21.7(x9 − 9)] in the
terminal nodes. The sum of these two terms suggests that the number of
cigarettes smoked has a negative effect when the number is beyond a half
pack of cigarettes per day.
The tree representation in Figure 10.2 is left and right balanced because
we have an additive spline model. That is, each term involves only one pre-
dictor. However, in the presence of product terms of two or more predictors,
the tree representation of a MARS model is not necessarily balanced, which
is similar to those classification trees that we have seen before.
where Ȳ is the average of Yi ’s within node τ. To split a node τ into its two
daughter nodes, τL and τR , we maximize the split function
where s is an allowable split. Unlike the goodness of split in (2.3), the split
function in (10.4) does not need weights. Furthermore, we can make use of
i(τ ) to define the tree cost as
R(T ) = i(τ ) (10.5)
τ ∈T̄
FIGURE 10.3. The profile of an initial regression tree for birth weight
FIGURE 10.5. A pruned regression tree for birth weight. On the top is the tree
structure with the average birth weights displayed for the terminal nodes and at
the bottom the within-terminal-node histograms of birth weight
140 10. Regression Trees and Adaptive Splines for a Continuous Response
y y
y=x y=x
6 6
y = (x − τ )+ @
@ −
@ y = (x − τ )
- @ -
0 τ x 0 τ x
This MARS model provides a more specific relationship between the history
of pregnancy and birth weight.
The computation here is done in SPLUS. A sample code of our com-
putation is given below. The response variable is labeled as btw and all
predictors as allpreds. To reduce the computation burden, we grow the
initial tree, requiring the minimal node size to be 80.
birth.tree <- tree(btw ~ allpreds, minsize=80, mincut=40)
plot(birth.tree, type="u")
plot(prune.tree(birth.tree))
final.tree <- prune.tree(birth.tree, k=2500000)
tree.screens()
plot(final.tree, type="u")
text(final.tree)
tile.tree(final.tree)
FIGURE 10.7. MARS model: 2.5 + 4(x1 − 0.3)+ − (x1 − 0.3)− + 4(x2 − 0.2)+ −
(x2 − 0.2)− − 4(x2 − 0.8)+
order terms. In other words, the same predictor is not allowed to appear
more than once in a single term. As a consequence, in the one-dimensional
case, model (10.6) becomes
M
β0 + βk (x − τk )∗ , (10.7)
k=1
FIGURE 10.8. MARS model: 2.5 + 5(x1 − 0.3)+ − (x1 − 0.3)− + 4(x2 − 0.2)+ −
(x2 − 0.2)− − 4(x2 − 0.8)+ + 2(x1 − 0.3)+ (x2 − 0.2)+ − 5(x1 − 0.3)+ (x2 − 0.2)−
rectangles. In regression trees, a flat plane is used to fit the data within each
rectangle, and obviously the entire fit is not continuous in the borders of
the rectangles. In contrast, the MARS model is continuous, and within each
rectangle, it may or may not be a simple plane. For instance, the MARS
model in Figure 10.7 consists of six connected planes. However, the MARS
model in Figure 10.8 has both simple planes and “twisted” surfaces with
three pieces of each. The twisted surfaces result from the last two second-
order terms, and they are within the rectangles (i) x1 > 0.3 and x2 < 0.2,
(ii) x1 > 0.3 and 0.2 < x2 < 0.8, and (iii) x1 > 0.3 and x2 > 0.8. Figure
10.9 provides a focused view of the typical shape of a twisted surface.
What are the differences between the MARS model (10.6) and the or-
dinary linear regression model? In the ordinary linear model, we decide a
priori how many and what terms are to be entered into the model. How-
ever, we do not know how many terms to include in a MARS model prior
to the data modeling. In Figure 10.1, the MARS model is represented by
four line segments, but it could be three, five, or any number of segments.
In practice, we do assign a limit, such as 20 or 50, to the maximum number
of terms that can be included in a MARS model, depending on the data
dimension. The choice of this limit is much easier to choose than is spec-
ifying the exact number of terms in a model, and its impact is relatively
small in the model selection. Another key difference resides in that every
term in a linear model is fully determined, while it is partially specified in
10.4 Modified MARS Forward Procedure 143
N
(Yi − f (xi ; θ))2 (10.8)
1
over the entire domain of θ. In other words, our parameter estimates are
based on the least squares criterion.
To understand what is involved in step 1, it is helpful to know precisely
what the MARS model looks like at this step. After step 0, the MARS
model includes a constant term, and the best constant based on the least
squares criterion (10.8) is Ȳ , the sample average of Y. In step 1, we consider
adding a pair of
(xi − τ1 )+ and (xi − τ1 )−
into the existing model that has the constant term only. Thus, the candidate
MARS model in the present step is of the form
there are three ways by which the existing model can be expanded. This is
because the new pair can be associated with one of the three existing basis
functions in (10.10). When the new pair of basis functions are multiplied
by the constant basis function, they remain the same in the larger model.
The resulting MARS model is one of
or
β0 + β1 x1 + β2 (x1 − τ1 )+ + β3 (x1 − τ )+ . (10.12)
However, if the new pair of basis functions are multiplied by x1 , a pair of
multiplicative basis functions (not the original basis functions) are attached
to the existing model as follows:
where i = 1. Similarly, the new pair can be merged with (x1 − τ1 )+ , and
this leads to the model
0. Begin with the MARS model that contains all, say M, basis functions
generated from the forward algorithm.
1. Delete the existing nonconstant basis function that makes the least
contribution to the model according to the least squares criterion.
2. Repeat step 1 until only the constant basis remains in the model.
After a total of four steps, we should reach the constant basis in the
model. During this process, we have a sequence of five nested models,
fk , k = 1, . . . , 5, which includes both the initial five-basis-function model
and the constant-basis model. These five models are candidates for the final
model. The remaining question is, Which one should we select? The answer
would be obvious if we had a criterion by which to judge them. Friedman
and Silverman (1989), Friedman (1991), and Zhang (1994) use a modified
version of the generalized cross-validation criterion originally proposed by
Craven and Wahba (1979):
N
− fˆk (xi ))2
i=1 (Yi
GCV (k) = , (10.17)
N [1 − (C(k)/N )]2
(x(i) − τ1 1)+ ,
Now, we need to find another pair of basis functions that gives the best
fit to the data when the basis functions are merged with one of the existing
basis functions. Under the new notation, this means the following. Suppose
that the basis functions under consideration are xk and (xk − τ )+ (note
their equivalence to (xk − τ )+ and (xk − τ )− ). They generate two basis
vectors, x(k) and (x(k) − τ 1)+ . After merging these two basis vectors with
one of the existing basis vectors, we consider adding the following two basis
vectors,
bl ◦ x(k) and bl ◦ (x(k) − τ 1)+ , (10.18)
l = 0, . . . , K, into the existing model, where ◦ is the operation of multiplying
two vectors componentwise. The pair (10.18) is ruled out automatically if
bl has x(k) as a component. In addition, bl ◦ x(k) will be excluded if this
vector is already in the model. To avoid these details, we assume that both
basis vectors in (10.18) are eligible for inclusion. Let
B = (b0 , . . . , bK , bl ◦ x(k) ),
bK+1 (τ ) = bl ◦ (x(k) − τ 1)+ ,
and
r = (I − P P )Y,
where Y = (Y1 , . . . , YN ) , P P = B(B B)−1 B , and P P is an identity
matrix. Thus, r is the residual vector when the existing K basis vectors
and one new (fixed) basis vector are entered. For any given τ, if we also
enter bK+1 (τ ) into the model, the least squares criterion equals
(r bK+1 (τ ))2
r 2
− . (10.19)
bK+1 (τ )(I − P P )bK+1 (τ )
The second term in (10.19) is a function of τ ; but the first one is not and
hence is irrelevant to the search for the best knot. Here comes the key task:
We must find the best τ such that
(r bK+1 (τ ))2
h(τ ) = (10.20)
bK+1 (τ )(I − P P )bK+1 (τ )
is maximized. Consequently, the residual sum of squares in (10.19) would
be minimized.
The critical part is that we can express h(τ ) in (10.20) as a more explicit
function of τ if we restrict τ to an interval between two adjacent observed
values of xk . Without loss of generality, suppose that x1k , . . . , xN k are in
increasing order and that they are distinct. For τ ∈ [xjk , xj+1,k ), we have
bK+1 (τ ) = (bl ◦ xk )(−j) − τ bl (−j) , (10.21)
where v(−j) = (0, . . . , 0, vj+1 , . . . , vN ) for any vector v. Then, the numer-
ator of h(τ ) equals the square of
r (bl ◦ xk )(−j) − τ r bl (−j) , (10.22)
150 10. Regression Trees and Adaptive Splines for a Continuous Response
||(bl ◦ xk )(−j) ||2 − ||P (bl ◦ xk )(−j) ||2 + τ 2 (||bl(−j) ||2 − ||P bl(−j) ||2 )
−2τ (bl(−j) (bl ◦ xk )(−j) − (bl ◦ xk )(−j) P P bl(−j) ). (10.23)
(c1j − c2j τ )2
h(τ ) = ,
c3j − 2c4j τ + c5j τ 2
where
For the second term of c4j , we need to create two temporary (K + 1)-
vectors:
Then
w12 = w11 − b2l x2k p2· ,
w22 = w21 − b2l p2· ,
where p2· is the second row vector of P. Therefore,
c42 = c41 − b22l x2k − b2l w11 p2· − b2l x2k w11 p2· + b22l x2k ||p2· ||2 . (10.31)
Why do we need the recurrence formulas (10.30) and (10.31)? If we
obtained c11 , it takes two multiplications and one subtraction to derive
c12 . Furthermore, if c41 is already prepared, it takes 5(K + 1) steps to
update the vectors w12 and w22 ; 3(K +1) operations for computing w11 p2· ,
x2k w11 p2· , and ||p2· || ; and hence 8(K + 1) + 11 operations altogether to
2
reach c42 . The importance of this tedious counting lies in the fact that the
number of operations needed to move from c11 to c12 is a constant and that
the number of operations from c41 to c42 is proportional to the number of
existing basis functions. Moreover, the numbers of required operations are
the same as we move from c1j to c1,j+1 and from c4j to c4,j+1 . In fact, it
takes fewer than 18K operations to update all c’s from one interval to the
next. Thus, the best knot associated with xk and bl can be found in a total
of 18KN operations. There are at most Kp combinations of (k, l), implying
that the best pair of basis functions can be found in about 18K 2 pN steps.
Therefore, if we plan to build a MARS model with no more than M terms,
the total number of operations needed is on the order of M 3 pN. This detail
is relevant when we implement and further extend the algorithm.
We have explained in detail how to find the candidate knot within each
interval of the observed data points. The general search strategy is to scan
one interval at a time and keep the best knot.
Yi = f (xi ) + i (1 ≤ i ≤ N ).
y
6
3 τ2 = 2
2 τ2 = 3
1 τ2 = 4
-
0 1 2 3 4 5 x
has two knots τ1 and τ2 (τ1 > τ2 ), with τ1 in the interior and τ2 near
the boundary; we consider the inclusion of a third knot τ3 . The candidate
τ3 can be chosen closer to τ2 than to τ1 without violating the correlation
threshold, because ρ(τ2 , τ2 + δ) tends to be less than ρ(τ1 , τ1 + δ). By fixing
δ and varying τ, it can be seen graphically that ρ(τ, τ + δ) increases as τ
moves from the left to the right. As a consequence, knots on the left edge
of the interval are presumably allowed to be closer together.
To resolve this potential problem, Zhang (1994) introduced a modified
correlation. Suppose τ1 , . . . , τk are the knots that have already been in-
cluded in the model, and the next knot τ, associated with a predictor xi ,
is to be determined. Let ρ+ be the generalized linear correlation between
(xi −τ 1)+ and the previously adopted bases and let ρ− be similarly defined
using (xi − τ 1)− . Precisely,
In step 1, after the first knot τ1 associated with a predictor xi1 is found,
we define the onset maximal correlation R through the minimum span L,
associated with xi1 , as follows:
⎧
⎨ max{ρ(τ1 − L), ρ(τ1 + L)}, if xi1 N − L ≥ τ1 ≥ xi1 1 + L,
R= ρ(τ1 − L), if τ1 + L > xi1 N ,
⎩
ρ(τ1 + L), otherwise,
where ρ(τ ) is the modified correlation coefficient induced by t and the knot
τ1 . If R > R∗ , set R = R∗ ; that is, R∗ prevents R from being numerically
almost 1.0.
When we add a new knot τk associated with a predictor xik to the set
of knots τ1 , . . . , τk−1 in step 2, the modified correlation coefficient ρ(τk ),
induced by τk and the knots τ1 , . . . , τk−1 , must be less than the current
R. As more knots are inserted into the model, the modified correlation
induced by the new candidate knot and knots already in the model generally
increases. Therefore, R should show an increase as needed, although it is
never allowed to exceed R∗ . A tentative scheme for making R increase is
to calculate a temporary R̃ that is analogous to R in step 1:
⎧
⎨ max{ρ(τk − L), ρ(τk + L)}, if xik N − L > τk > xik 1 + L,
R̃ = ρ(τk − L), if τk + L > xik N ,
⎩
ρ(τk + L), otherwise.
β1 x + β2 x2 + β3 x3 + β4 [(x − τ )+ ]3
Recall from (10.19) and (10.20) that the RSS due to the MARS model
including both the existing and the new basis functions is r 2 − h(τ ), with
h(τ ) defined in (10.20). Here, r is the residual vector when b0 , . . . , bK , xk ,
x2k , and x3k are entered into the model. Note, however, that with the use of
cubic basis functions we need to change bK+1 (τ ) in (10.21) to
The critical fact to realize is that we deal with a similar set of constant
c’s as defined in (10.24)–(10.28) when searching for the best knot from one
observed data point to the next. For example, we need to update
r (bl ◦ bl ◦ bl ◦ xk ◦ xk ◦ xk )(−j) .
This is obviously more complicated than c1j , but the principle is the same.
• ••
•
••• • •
• •
••
• •
• ••
•• • •• •
• •
• •• • • •• •
• •
•
• •
••• •
FIGURE 10.12. Motorcycle example: simulated data (the dots), the true function
(the solid curve), the MARS fit (the dashed curve), and the loess fit (the dotted
curve)
First, we take 50 random points from the interval [−0.2, 1.0] and denote
them by xi , i = 1, . . . , 50. Then, we generate a series of 50 random num-
bers from the normal distribution, namely, εi ∼ N [0, max2 (0.05, xi )], i =
1, . . . , 50. Finally, the observation is the convolution of the signal, f (x), and
the noise, ε. That is,
Yi = f (xi ) + εi , i = 1, . . . , 50.
This fitted model is also plotted in Figure 10.12 (the dashed curve) along
with a smoothed curve (the dotted one) resulting from the loess() function
in SPLUS.
As shown by Figure 10.12, the MARS fit catches the shape of the un-
derlying function quite well except for the artificial jump at the right end,
10.9 Numerical Examples 159
+ ••
++
•• • •
• • + +••++•+• +• +•+• +
+ +•
+++++•++++++• +++ ++• ++• • +
•• • +•••+++• ++ +•+•+
••••
• • •• ••
+ •
+
•
FIGURE 10.13. Residuals and absolute errors for the motorcycle data. The dots
and the solid curve are from the loess fit. The plus signs and the dotted curves
come from the MARS fit
2 9
Yi = sin(1.3xi1 ) − x2i2 + εi ,
3 20
where xi1 , xi2 , and εi were generated from the standard normal distribution
N (0, 1). The theoretical correlation between xi1 and xi2 is 0.4.
160 10. Regression Trees and Adaptive Splines for a Continuous Response
With two predictors, the MARS allows for the second-order product
term. The selected MARS model is
Note that the original model is an additive model, but the MARS model
includes an interaction term: (x1 + 1.36)+ (x2 − 1.97)+ . This most likely is
a result of the collinearity between x1 and x2 , which confuses the MARS
algorithm. How do we deal with this problem? First, let us forbid the use
of the second-order product term. Then, a new MARS model is selected as
follows:
where the intercepts were roughly guessed by splitting 0.45. Figure 10.14
plots the observed data points, and the underlying and the fitted compo-
nents of the model. Although not always, in the present case, the structures
of the underlying components are well preserved in the MARS fit.
• •
• •
•• •• • • •
• • • •••
• • ••••• • • • ••
• •• ••
• • •• • •
• • • • • •• • • • • • ••
• •• • • • •• •• •
• •• • • • • • •• •• • • • •
•
• •
•• • • ••• •
• • •
• •• • • ••••• •• • •• • • •••• • ••• • • •
• • • • •• •• •
• • •• • • •• • •• •
• • • •• • •• • •
• • • • •••
• • • • • • • • •• •
• • • • •
•
•• • •
• •
• •
• •
• •
as the final choice with a GCV value of 1.44. In contrast, if the model is
allowed to include the second-order interaction terms, we selected
with a GCV value of 1.38. Models (10.39) and (10.40) are similar, but they
differ in the critical interaction term. The latter is slightly favorable in terms
of the GCV value. Also, it is noteworthy to mention that model (10.40) was
selected again when the third-order interaction terms were permitted in the
candidate models.
So far, MARS models appear to be capable of capturing the underlying
function structure. We should be aware of the fact that the algorithm can
be fooled easily. Note that the five predictors, x1 to x5 , were generated
independently. What happens if we observe z = x2 + x3 , not x3 ? Theo-
retically, this does not affect the information in the data. If we took the
difference z − x2 as a new predictor prior to the use of the MARS algo-
rithm, we would arrive at the same model. If we use z directly along with
162 10. Regression Trees and Adaptive Splines for a Continuous Response
In health-related studies, researchers often collect data from the same unit
(or subject) repeatedly over time. Measurements may be taken at different
times for different subjects. These are called longitudinal studies. Diggle,
Liang, and Zeger (1994) offer an excellent exposition of the issues related
to the design of such studies and the analysis of longitudinal data. They
also provide many interesting examples of data. We refer to their book
for a thorough treatment of the topic. The purpose of this chapter is to
introduce the methods based on recursive partitioning and to compare the
analyses of longitudinal data using different approaches.
mothers were regular cocaine users and those whose mothers were clearly
not cocaine users. The group membership was classified from the infants’
log of toxicology screens and their mothers’ obstetric records. In addition,
efforts have been made to match the unexposed newborns with the exposed
ones for date of birth, medical insurance, mother’s parity, age, and timing
of the first prenatal visit. The question of our concern is whether a mother’s
cocaine use has a significant effect on the growth of her infant.
After birth, the infants were brought back to see their pediatricians.
At each visit, body weight, height, and head circumference were recorded.
Figure 11.1 shows the growth curves of body weights for 20 randomly chosen
children.
Figure 11.1 suggests that the variability of weight increases as children
grow. Thus, we need to deal with this accelerating variability while model-
ing the growth curves. In Section 11.5.5 we will explain the actual process
of fitting these data. At the moment, we go directly to the result of analysis
reported in Zhang (1999) and put on the table what the adaptive spline
model can offer in analyzing longitudinal data.
Using mother’s cocaine use, infant’s gender, gestational age, and race
(White or Black) as covariates, Zhang (1999) identified the following model,
where d stands for infant’s age in days and ga for gestational age in weeks.
The variable s is the indicator for gender: 1 for girls and 0 for boys. The
absence of mother’s cocaine use in model (11.1) is a sign against its promi-
nence. Nonetheless, we will reexamine this factor later.
According to model (11.1), the velocity of growth lessens as a child ma-
tures. Beyond this common sense knowledge, model (11.1) defines several
interesting phases among which the velocity varies. Note that the knots for
age are 60, 120, 200, and 490 days, which are about 2, 4, 8, and 16 months.
In other words, as the velocity decreases, its duration doubles. This insight
cannot be readily revealed by traditional methods. Furthermore, girls grow
slower soon after birth, but start to catch up after four months. Gestational
age affects birth weight, as immense evidence has shown. It also influences
the growth dynamics. In particular, a more mature newborn tends to grow
faster at first, but later experiences a slower growth as opposed to a less
mature newborn. Finally, it is appealing that model (11.1) mathematically
characterizes the infant growth pattern even without imposing any prior
knowledge. This characterization can provide an empirical basis for fur-
ther refinement of the growth pattern with expert knowledge as well as
assessment of other factors of interest.
i ∼ N (0, Ri ), i = 1, . . . , n, (11.4)
11.3 Mixed-Effects Models 167
and
p
eij = νki xk,ij + ij ,
k=0
What are the essential steps in applying model (11.3) for the analysis of
longitudinal data? A convenient approach to carry out the computation is
the use of PROC MIXED in the SAS package. One tricky step is the specifica-
tion of random effects. It requires knowledge of the particular study design
and objective. See Kleinbaum et al. (1988, Ch. 17) for general guidelines.
Following this step, it remains for us to specify the classes of covariance
structures Ri in (11.4) and G in (11.5).
In practice, Ri is commonly chosen as a diagonal matrix; for example,
Ri = σ12 I; here I is an identity matrix. The resulting model is referred to
as a conditional independence model. In other words, the measurements
within the same individuals are independent after removing the random-
effect components. In applications, a usual choice for G is σ22 I. The dimen-
sion of this identity matrix is omitted, and obviously it should conform with
G. The subscripts of σ remind us of the stage in which these covariance
matrices take part.
There are many other choices for covariance matrices. A thorough list
of options is available from SAS Online Help. Diggle et al. (1991) devote
168 11. Analysis of Longitudinal Data
p
Yij = β0 + βk xk,ij + μ(tij ) + ei (tij ), (11.7)
k=1
have the same multivariate normal distribution, and the correlation be-
tween Y (t) and Y (t + Δ) is ρ(Δ). Examples of ρ(Δ) are
(b) Update the estimate β̂ from the residuals rij = Yij − μ̂(tij ) using
generalized least squares,
estimate of the covariance structure from the residuals, rij = Yij − fˆij ,
i = 1, . . . , n and j = 1, . . . , q. (c) Repeat steps (a) and (b) until convergence.
These three steps are similar to those stated in the previous section.
Crowder and Hand (1990, p. 73) vividly described it as a see-saw algo-
rithm. As a matter of fact, if every step of the estimation is based on
maximizing a certain likelihood function, this three-step algorithm is a
“generalized” version of the method called restricted maximum likelihood
estimation (REML), which was introduced by Patterson and Thompson
(1971) to estimate variance components in a general linear model and has
recently been applied in the longitudinal data setting (e.g., McGilchrist and
Cullis 1991). The merits of the REML estimators in the context of mixed
models have been explored by many authors (e.g., Cressie and Lahiri 1993,
and Richardson and Welsh 1994). It is reasonable to hope that some of
the important properties of REML estimators also hold for the MASAL
estimators.
Section 11.5.1 describes the implementation of step (a), and Section
11.5.2 addresses that of step (b).
where
y = (y1 , . . . , yn ) (11.13)
and
f = (f (t11 , x1,11 , . . . , xp,11 ), . . . , f (tij , x1,ij , . . . , xp,ij ),
z = Ψ− 2 y,
1
(11.14)
where Ψ− 2 Ψ− 2 = Ψ−1 .
1 1
172 11. Analysis of Longitudinal Data
If model (11.11) were a linear model, the transformed data would lead
to a weighted least squares estimate of f similar to (11.10). Unfortunately,
model (11.11) is not linear, and we will see why the nonlinearity deserves
special attention.
Recall the construction of an initial MARS model (10.10). The first and
also critical step in the forward algorithm is to find the best knot τ̂ associ-
ated with a covariate xk such that the WSS in (11.12) is minimized when
f is chosen to be of the form
where
z = β0 Ψ− 2 1 + β1 Ψ− 2 xk·· + β2 Ψ− 2 b(τ ) + Ψ− 2 e,
1 1 1 1
(11.16)
where the three terms represent, respectively, random effects, serial corre-
lation, and measurement error.
174 11. Analysis of Longitudinal Data
1
f (t, x) = 10t + 10 sin(x1 x4 π) + 20(x2 − )2 + 5x5 . (11.23)
2
This is one of the functional structures studied by Zhang (1997). A few
points are worth mentioning. First, x3 and x6 are not in f, and hence they
are noise (or nuisance) predictors. Second, the function includes both linear
and nonlinear terms. Lastly, it also has additive and multiplicative terms.
The sample covariance matrices from the signal (or the true function) and
the noise are respectively
⎛ ⎞
4.97 0.42 0.29 1.08 1.12
⎜ 0.42 4.20 1.15 0.59 1.21 ⎟
⎜ ⎟
⎜ 0.29 1.15 3.83 0.75 1.16 ⎟
⎜ ⎟
⎝ 1.08 0.59 0.75 3.89 1.29 ⎠
1.12 1.21 1.16 1.29 4.33
and ⎛ ⎞
12.9 6.12 6.47 5.46 5.86
⎜ 6.12 13.7 7.50 6.42 6.90 ⎟
⎜ ⎟
⎜ 6.47 7.50 16.6 5.62 5.98 ⎟.
⎜ ⎟
⎝ 5.46 6.42 5.62 16.3 6.78 ⎠
5.86 6.90 5.98 6.78 15.3
These two matrices show that the size of the occasionwise signal-to-noise
ratio is in the range of 2.6 to 4.3.
During the fitting, the maximum number of terms is 20, and the highest
order of interactions permitted is 2. To examine the change in the itera-
tive model-building process, we report not only the subsequent models in
Table 11.4, but also a measure of difference, dc , between two consecutive
covariance matrices and the log-likelihood, lr , in (11.22).
All information in Table 11.4 (the fitted model, dc , and lr ) reveals that
further continuation of cycling has little effect on the fit. The two nuisance
predictors, x3 and x6 , do not appear in any of the models. The fitted models
after the second iteration capture all four terms in the original structure.
Let us choose the model in the third iteration as our final model. We see
that t and x5 are included as linear effects with roughly the same coefficients
as the true values. The sum −8.57x2 + 19.4(x2 − 0.54)+ corresponds to the
quadratic term 10(x2 − 12 )2 in the true model. The knot 0.54 is close to the
11.5 Adaptive Spline Models 177
FIGURE 11.2. Comparison of the true (10 sin(x1 x4 π)) and MASAL (11.24)
curves along two diagonal lines: x1 = x4 (left) and x1 = 1 − x4 (right)
underlying center 0.5 of the parabola, and the coefficients match reasonably
with the true values. The only multiplicative effects are for x1 and x4 . The
proxy for their sinusoidal function is 28.6x1 x4 − 15(x1 − 0.6)+ x4 − 70(x1 −
0.4)+ (x4 − 0.53). In Figure 11.2, we compare 10 sin(x1 x4 π) with
28.6x1 x4 − 15(x1 − 0.6)+ x4 − 70(x1 − 0.4)+ (x4 − 0.53) (11.24)
along the two diagonal lines x1 = x4 and x1 = 1 − x4 .
This example demonstrates that the MASAL model is capable of uncov-
ering both the overall and the detailed structure of the underlying model.
On the other hand, it is very easy to make the MASAL model fail by
increasing the noise level. In that case, we may not have any better alter-
native.
For comparison, let us see what happens when the mixed-effects model
(11.3) is employed. We adopted a backward stepwise procedure by initially
including 7 linear terms, t and x1 to x6 , and their second-order interaction
terms. The significance level for including a term in the final model is 0.05.
The following model was derived:
−0.5+10.6t+9.5x1 +4.49x2 +13.8x4 −1.44tx4 −6.4x1 x4 −10.2x2 x4 . (11.25)
Some aspects of model (11.25) are noteworthy. Firstly, it does not have a
quadratic term of x2 because “we did not know” a priori that we should
consider it. Secondly, it excludes the linear term of x5 . Thirdly, it includes
two interaction terms tx4 and x2 x4 that are not in the true model. Lastly,
the interaction term between x1 and x4 proves to be significant, but it is
difficult from a practical standpoint to consider the nonlinear terms of x1
and x4 .
In fairness to the mixed-effects model, we fitted the data again by at-
taching x22 and x5 to model (11.25). Not surprisingly, they turned out to be
178 11. Analysis of Longitudinal Data
significant, but the interaction term x1 x4 was not. Thus, we lost the true in-
teraction terms while retaining the two false ones. The point is that there
is nothing wrong theoretically with the mixed-effects model. We end up
with an imprecise model usually because we do not know where we should
start from. The present example is a simulation study. The real problems
are generally more difficult to deal with. In a word, the importance of the
mixed-effects model is undeniable, but we also should realistically face its
limitations.
dummy variables: x1 as the indicator for iron dosing and x2 as that for
infection. In addition, we use x3 = x1 x2 as the interaction between x1 and
x2 .
We will take an analytic strategy different from that of Diggle et al.
(1991, p. 102) in two aspects. First, they took a log-transformation of body
weights to stabilize the variance over time. Figure 11.4 displays the sample
covariance matrices (the dots) against time for the body weights and their
log-transformations. This figure shows that the variance of the transformed
weights varies over time with a roughly quadratic trend, as was also noted
by Zhang (1997). Therefore, it is not particularly evident that the log-
transform indeed stabilized the variance over time. On the other hand, the
variance obtained from the original body weights seems to be described well
by a Gaussian form in (11.27) below. For these reasons, we will analyze the
original body weights, not their log-transformations. As a side note, the
log-transformation has little effect on the trend of the autocorrelation.
Since the time trend is not their primary interest, Diggle et al. had a
clever idea of avoiding fitting the time trend while addressing the major
hypothesis. This was made possible by taking a pointwise average of the
growth curve for the control group (x1 = x2 = 0) and then modeling the
differences between the weights in other groups and the average of the con-
trols. They assumed quadratic time trends for the differences. Generally
speaking, however, it is wise to be aware of the shortcomings of this ap-
proach. Twenty-three parameters are needed to derive the average profile
for the control group although the actual trend can be fitted well with a
fewer number of parameters. As a consequence, the differences may involve
a greater variability and eventually could influence the final conclusion.
Since MASAL is designed to fit an arbitrary time trend, the differencing
is no longer necessary, nor is it desirable. Thus, we will fit directly the
body weights based on the two factors and time. This clarifies the second
difference between our strategy and that of Diggle et al.
The number of occasions, q = 23, is obviously large as opposed to the
number of subjects, n = 26. It does not make much sense to use an unstruc-
tured covariance matrix. Thus, we have to explore the covariance structure
before doing any modeling. When we choose a covariance structure, it is
clearly important to capture the overall time trend, but it could be counter-
productive if we devoted too many degrees of freedom in the time trend.
The top two panels in Figure 11.4 display respectively the autocorrelation
against the time difference Δij = ti − tj (left) and the variance against
time (right) on the original weight scale. The time difference Δ will be
referred to as lag. The autocorrelation seems to decrease linearly in lag,
and the variance behaves as a Gaussian function. Specifically, using the
least squares method, we fitted the autocorrelation as a linear function of
lag and arrived at
•• • •
••• •
••••••• • • • •
• • ••• •• • •
• •• •• •• • • • • •
••• ••••• •
• • ••• • • • • ••
•
•• •••• • •• •
• •• •• • • • • • •
• • • • • • •••• • • •• • • •
• •• • • • • • • •
• •• ••
• •• • • • •• • • • •
• • • • • ••• • •
• •• • • • • • • •
• •• •• ••
• • • • •• •
• • •••• • • • •
•
• •• • ••
• ••• • •• • • • •
• • •
•• •
• • • • • •• • • •
•• • •• • •
• •• •• •• • • •
•
• • • •• • • • • • • •• • •
• • • •• •
•• • •
• •
•
• • •
• ••
• • • •
• ••• •• •
••• •• • • •
•• • • •
• •• ••• • • • •• •
•
••• •• •• •• •• • • •
•
•• • • • •
• •••• • •• •• • • •
• • ••••• ••• • • •
• • • • •• • • • • •
•• •
•• •• • ••• • • •
• • •• • •• • ••
• • • • • •
•
•• • •• •••• • • • • • •••• • •
• • • • • •
• •• •••• • • • • •• • • • •
• •
• •• • •• • ••• • • • •• •
• • • • •
•• • •
• • • •• •• •
• • • • • • •
• • • •• • • • •
• •• •
•• • • • • •
• •• • • •• • •
•• ••
• •• • •• • • •
• • • •
•
• •
• • •
• • • •
FIGURE 11.4. Covariance structures of body weights (top) and their log-trans-
formations (bottom). The dots are the sample estimates, and the solid lines and
curves are the least squares fits
11.5 Adaptive Spline Models 181
In many applications, data are collected for some specific aims. In this
example, for instance, the main interest now is to examine the effect of iron
dosing, infection, and their potential interaction on the growth of cows. In
such a circumstance, we have nothing to lose by entering the three variables
x1 to x3 of clinical interest into the MASAL model before the forward step
starts to cumulate basis functions. Table 11.5 displays the basis functions in
the order they appear during the forward step in the first (initial) iteration.
182 11. Analysis of Longitudinal Data
••
•
• • •
• • •• • • • ••• •
• ••• •• •••••• • •
• • • •• • •
• • ••• •• •••••••••••••• ••• • •• •••••
•
• • •
••
•••
•• ••••• ••••• • ••••• •• ••• ••• ••
•
•• •••• ••••••••••••••••••••••••••••••••• •• ••••• •• ••• ••• •••
• •••• ••••••••••
• •••••••• •••••• • ••• •• •• ••
• ••••• •• •••••• ••••••••••••••• •• ••••• ••• ••• ••••• • • • ••
•• ••••••• ••••••••• •• •• • •••• • •• •
•• • • ••
•
••• •••••• •••••••••• • • •
•• ••• •• •
•• •• •• • •
•• •• • • • • •
•
•• • • ••
•
FIGURE 11.5. Residual plot (left) and the fitted growth curves (right)
two groups. The magnitude of the difference is about the same throughout
the last two-thirds of the study period.
To have a direct look into how well the model (11.30) fits the data, we
plot the prediction curves together with the observations for infected and
uninfected cows, respectively, in Figure 11.6. For the sake of comparison,
the mean curves are also depicted in the figure. For the uninfected cows,
the mean curve has more wiggle than the fitted curve, but otherwise they
are close. For the infected cows, the fitted curve is practically identical to
the mean curve. Therefore, it is evident from Figures 11.5 and 11.6 that
model (11.30) provides a useful fit to the data.
Are the terms in the selected model (11.30) statistically significant? It
is important to realize that MASAL selects models not based on the tra-
ditional standard of significance. Instead, it makes use of GCV. The two
standards are clearly related, but generally they do not lead to the same
model. In fact, because of the adaptive knot allocation and exhaustive
search of basis functions, the GCV criterion is usually more stringent than
the significance level of 0.05, as shown in Table 11.8. Assigning exact sig-
nificance levels to the terms in (11.30) is an open question. Thus, we use a
straightforward, but potentially biased, approach.
First, we hold the eight basis functions in (11.30) as if they were chosen
prior to the model selection. Then, model (11.30) is a linear regression
model. Table 11.8 shows the information related to the significance level of
each term. All p-values are far below a traditional mark of 0.05.
In retrospect, all terms in model (11.30) are highly “significant” if we
have certain faith in the p-values. Could iron dosing and the interaction
between iron dosing and infection play a role that is nevertheless significant
at a less ambitious level? To answer this question, we add x2 and x3 , for
example, to model (11.30). It turns out that x2 and x3 do not really affect
the coefficients of the existing terms in model (11.30), and they are not
184 11. Analysis of Longitudinal Data
• • •
• • •• ••
• • • • • •
•
• • • • •• ••
• •• • • •• ••
• • •
• • ••
•
•
••
•
• •• •• • ••• ••
• • • • • • • •• •
• • • •• •• • •• •• • •
•
• • • •• • •
•
• •• • ••• •• • ••
• • • • •• •
•• ••• • •••• • •• •• •
• • •• •• • • • • •• •• •• •• •• •• •
•• ••• • • • • • ••• •
•• •• •• • ••
•
• • •
• ••• • • •• ••
•• •• •••
•
••
•• • • • • • •• •• ••• •• •
• •
•
• •• • • • • ••• •• •• •
•
••
••
•
• • •
•
• •• •• • • •• •• •• •• •• ••• • •
•• •• •• ••• ••• •• ••
•• • • •
•
• •• • •• • •• •• • • • • •
•• •• •• ••• •• •••
• •• •••
• •• •• ••• ••• •• •• •
•
•• •• • •• •• • ••• •• • • • •
• •• •• • • •• ••• • ••
••• •
•• •
• • •• • •• •• ••• ••
•• • • •• •• ••
•• •• •• ••• ••
• •• •
•• ••
FIGURE 11.6. Fitted growth curves (solid) and mean curves (dotted) surrendered
by the observed data points
significant at all (p-values > 0.5). This analysis confirms that of Diggle
et al. (1991). Interestingly, however, when fitting the log-weights, Zhang
(1997) found that the interaction plays a significant role.
Example 11.3 Blood Glucose Levels
The data for this example are taken from Table 2.3 of Crowder and
Hand (1990). Six students at Surrey University were offered free meals in
exchange for having their blood glucose levels measured. Six test meals were
given for each student at 10 a.m., 2 p.m., 6 a.m., 6 p.m., 2 a.m., and 10 p.m.
Blood glucose levels were recorded ten times relatively to the meal time as
follows. The first glucose level was measured 15 minutes before the meal,
followed by a measurement at the meal time. The next four measurements
were taken a half-hour apart, and the last four (some five) measurements
one hour apart. Figure 11.8 shows the growth curves of glucose level for
different meal times with the first two records removed.
The primary issue is the time-of-day effect on the glucose variational pat-
tern. Since the hours are periodic, we use x1 to indicate the morning time
and x2 the afternoon time. Furthermore, we use three additional dummy
variables, x3 to x5 , to discriminate two meal times in each of the morning,
afternoon, and night sessions. Because of the primary interest, we enter x1
to x5 into the model up front.
Crowder and Hand (1990, p. 13) conducted some preliminary analysis
for these data, using the concept of the total area under the curve (AUC).
In other words, if we take a subject at a meal time, we have a growth
curve of glucose level. Above a reasonably chosen “basal” level there is an
area under the curve. The information contributed by the curve is then
compressed into a single number—the area. After this data compression,
a simple t-test can be employed. The AUC is obviously an interpretable
feature of the curve, whereas it contains limited information. Crowder and
Hand also pointed out some fundamental limitations in the use of AUC.
Here, we attempt to fit the glucose pattern as a function of meal time and
measurement time. Furthermore, we will treat the first two measurements
(before and at the meal time) as predictors instead of responses because
they may reflect the up-to-date physical status of the subject.
As a prelude to the use of MASAL, we need to explore the covariance
structure. As an initial attempt, we make use of the sample covariance
matrix from the first hand residuals that are obtained as follows. For every
meal time and every measurement time, we have six glucose levels from the
six students. It is easy to find the group average of these six levels. Then,
the first hand residuals are the glucose levels less their corresponding group
averages. Using these residuals we plot in Figure 11.7 the sample variance
against time (left) and the sample correlation against the lag (right). The
figure also exhibits the estimated curves by the least squares method:
# $
σ̂ 2 (t) = exp 0.535 − 0.0008t − 6.2t2 /105 + 1.4t3 /107 (11.31)
186 11. Analysis of Longitudinal Data
• •
•
•
• •
• •
• •
••
• ••
•
•
• ••
••
• •
•
• •• •
•• •
• •
•
•
• •
• •
• • • • •
FIGURE 11.7. Initial covariance structure. The dots are the sample estimates
and the curves the fitted values
and # $
ρ̂(Δ) = sin 0.71 − 0.074Δ + 1.4Δ2 /105 , (11.32)
where t denotes time in minutes and Δ is the time lag. Note in (11.32) that
we use a sinusoid function to ensure that the correlation is between −1 and
1, although the trend shows a roughly quadratic pattern.
We employ (11.31) and (11.32) to form the initial estimate of the covari-
ance matrix. In the subsequent iterations, the covariance matrix will be
estimated by the maximum likelihood method based on the residuals from
the MASAL models.
We undertook three iterations for this example, and the changes from
iteration to iteration were minor. In fact, the basis functions were nearly
identical in the three iterations. There are changes in the estimates for the
covariance parameters from the initial iteration to the second one, but little
afterwards. The following MASAL model is from the third iteration:
and # $
ρ̂(Δ) = sin 0.793 − 0.00857Δ + 1.65Δ2 /105 .
From (11.33), we see that the glucose levels were higher when the test
meals were given at night, but whether the meals were eaten in the morning
or afternoon did not matter because the coefficients for x1 and x2 are very
close. The glucose level drops linearly for 3 12 hours after the meal and then
stays flat.
11.5 Adaptive Spline Models 187
Figure 11.8 compares the model (11.33) to the original paths of the
glucose levels at 6 different meal times. Looking at the plots at the 10 a.m.
meal time, the fit may not catch the underlying trend well enough. Some
detailed features at the 10 p.m. meal time may be missed by the MASAL
model. Overall, the MASAL model appears to reflect the underlying process
of blood glucose levels.
After performing an AUC analysis, Crowder and Hand concluded that
there was a significant difference of the glucose levels between 10 a.m. and
10 p.m. meals, which, in some aspects, is similar to what we stated above.
Finally, we revert to the primary question: How different are the glucose
levels at different meal times? To answer this, we use the old trick by adding
x3 , x4 , and x5 to model (11.33) at once. Table 11.9 reports the information
for the five dummy variables only. The coefficients corresponding to x3 ,
x4 , and x5 are practically inconsequential and statistically insignificant.
Therefore, what seems to matter is whether the meals were given at night.
FIGURE 11.8. Blood glucose levels and the models. The thinner lines are indi-
vidual paths, and the thicker ones the fits
11.5 Adaptive Spline Models 189
day d. This can be viewed as cutting the growth curves displayed in Figure
11.1 vertically on every day and then collecting all intersected points into
zd . Zhang (1999) found that the variance indeed increases with age and
that the pattern can be adequately described by a cubic polynomial.
To fully specify the covariance structure, we also need to gauge the auto-
correlation of weights between any two days. Because of the irregular time
schedule, an efficient way of examining the autocorrelation is the use of the
variogram (Diggle et al. 1991, pp. 50-51). For a stochastic process Y (t), the
variogram is defined as
1
γ(Δ) = E{Y (t) − Y (t − Δ)}2 , Δ ≥ 0.
2
If Y (t) is stationary with variance σ 2 , the autocorrelation is a simple trans-
formation of the variogram as follows:
ρ(Δ) = 1 − γ(Δ)/σ 2 .
Although we have realized that the variance is not constant over time, we
hope that the variogram is still informative in revealing the autocorrelation
structure.
To obtain the sample variogram as a function of lag, we proceeded in two
steps following a procedure described by Diggle et al. (1991, p. 51). First, we
subtract each observation Yij on day dij from the average over a one-week
period to derive an initial residual rij , j = 1, . . . , Ti , i = 1, . . . , n. Then,
the sample variogram is calculated from pairs of half-squared residuals
1
vijk = (rij − rik )2
2
with a lag of Δijk = dij − dik . At each value of lag Δ, the average of v is
taken as the sample variogram γ̂(Δ). Remarkably, Zhang (1999) discovered
that the autocorrelation follows a linear trend.
After these explorations, it appears reasonable to postulate the following
structure for the covariance structure:
and
ρ(Δ) = φ0 + φ1 Δ, Δ = lag 1, . . . , lag 539. (11.35)
Model (11.1) results from using (11.34) and (11.35) as the backbone of
the covariance matrix and going through the same iterative process as in
the previous examples.
How well does the MASAL model fit the data? We address this ques-
tion graphically. In Figure 11.9, we plot the residuals against the predicted
values. In the original scale, as we expected, the variability is greater for
190 11. Analysis of Longitudinal Data
• •
• •
• • • •• • •
• • ••
• • ••• • ••• ••• •• • •• • • •
•• • • • •••
• • ••• • •••• •••••• •• ••• • •• • •• • •
• •• ••• • • • • •
• • • •• ••• •• •••••••••••• ••• ••••••••••• ••••• ••• • • ••• •
• • •
•• • • • • ••••• ••• ••••• • ••••••• •••••• •• ••••••• ••
• • • ••
• ••• • ••• •••• •••••••••• • •••••••••• • • • •• • •
• • • •• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••• • •• •• • •
• • •
•• ••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • •••••••••••••••••••••••••••• ••••••••••• •••• • •••• • ••••• • • •• •• •••• •• ••• •••• • • • •
•
• • • • • •
•• • • •••• •••••••••••••••••• ••••• •••••••••••••• • •• • • •
• • • •• •••• •••••••••• •••••••••••
• •
• • • •••••••••••••••••••••• •• • ••• •••• •• • •••••••••••••••••• •••••• ••••• • • • • •
••• • ••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••• • •••••••••••••••••••••••••••••• •• • • •••• •••••• • • • ••• ••• • • •
• • •• • • • • • • ••
• •••••••••••••••
•••• • • • •••••••••••• ••••••••• ••••••••••••••••••••••••••••••••••••••••• •••••• ••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• •••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••• •• • • • •
••••• •••••••••••••••••••••••••••••••• ••• •••• •••••• •••••• •••••••••••••••• •••••••••••••••••••••••••• ••••••••••••••••••• •••••••••• • • ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••• •• ••••••••• ••••• • ••• •• • • •
••• ••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••• •••••••••••••••• ••••••••••••••••••• •••••••••••••••••••••••••••• •• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••• ••••••••••••••••• •• ••• •
• •• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
••• •••• ••• • • ••••••••••••••••••••••••••••••••••••••••••••• ••••• •••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••• •••••• ••••• •••• •••
•••••••• ••••• •••• ••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
• • • • •••• •••
• • •• • • •••••••••••••••••••••••••••••••••••••••• •• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••• •••• • •
• • •••• • •••••••••••• •••••••••••• ••••••• •••••••••••••••• •••••••• ••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••• •
•• • •• ••••••• •••••••••••• •••••••••••••••••••••••••••••••••••••••• ••••••••••• ••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••• ••• •• ••••• •• •
•• • ••••• •••• •• •• • ••• •• • • •••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••• •• •••• • ••••••• • •• • ••• •
• •• • • ••••• •••••• •••••• ••• • •••• •••• •• • • •
••••••• ••• • •••••••• •••••••••••••••••••••• •••• •• ••
• • ••• •• ••• •• • •• • • • •
•• • • • •• • • •••• • • ••
•
• •••••• • • • • • • • • •
• • ••• •• • • •• • • •• •• •• • •
• • ••
• • •• • • •• • •
• • •
•• •••• • •• •
FIGURE 11.9. Residual plots against predictions. The left panel is in the original
scale and the right one in a standardized scale
a larger predicted value (see the left panel). After transforming the resid-
uals through the covariance matrix, no apparent structure emerges when
the transformed residuals are plotted against the transformed prediction.
Thus, these residual plots are in favor of the selected MASAL model and the
covariance structure (11.34) and (11.35). To further evaluate the MASAL
model, we plot the fitted curves surrendered by the observed points at
gestational ages of 36 and 40 weeks and for boys and girls, respectively,
in Figure 11.10. We chose 36 and 40 weeks because a 40-week delivery is
a typical full-term pregnancy, and 36 weeks is one week short of a term
delivery (37 weeks or later). It is clear that the fitted curves reside well
in the midst of the observations although there remain unexplained varia-
tions. Therefore, it is evident from Figures 11.9 and 11.10 that the selected
MASAL model is adequate and useful.
As we mentioned earlier, we are particularly interested in the effect of
cocaine use by a pregnant woman on her child’s growth. This variable,
denoted by c previously, did not stand out in the MASAL model. This is
clearly an indication of this factor’s limited impact. We should also real-
ize that our model-building and variable-selection procedures are not the
same as the traditional ones. Could cocaine use contribute significantly to
infant growth under a traditional model? To answer this question, we use
model (11.1) as our basis. Precisely, we hold all terms in this model as
fixed and examine the contribution of c by including c as a main effect or
an interaction term with one of the existing terms in addition to all terms
already in model (11.1). Table 11.10 presents the significance of these indi-
vidual terms, where the p-values are based on a two-sided t-test. Given the
number of tests that were undertaken, two terms involving the interactions
between cocaine use and gestational age may be worth pursuing. Overall,
11.5 Adaptive Spline Models 191
• •
••
• • • •
• • • •• • ••• • •• • •••
• •• • ••• • •••
• • ••• • • • • •••
•• • • •• • • ••••••••••••••••••••••••••••••••••••
• • •• • • •• •••• •••••••••••• ••• • •• • • ••• •••••
• ••• ••••• • • ••• ••• •• ••••••••••••••• •••••••••••••••••••••••••••••••••••••• •••••••• ••••
• • • • • • • • • ••• •••••••••••• •••••• • •
• •• ••••• •• • • • •
• • • • • ••• •• •• •• •• • • •• • • • •
• • • ••• • •••••••••••••••• ••••••••••••••••••••••••••••••••• ••••• •• ••••••• • •
••• ••••••••••••••••••••••••••••••••• •••••••• ••• • • ••• • •
• •• ••••••••••••• •••••• • ••• •
• • ••••••••• ••••••• •• •••• •• ••• •• • •• • • • •
•
• •• • • ••••••••••••••••••••• ••••••••••• • • • ••
•• • ••••••••••••••••••••••••••••••••••••••• ••• •
• • ••••••••• ••••••••• • • •
•
•• • • •••• •••••••••••••••••••••• •• • ••• •
• • •••••••••••••••••••••••••••••• • ••
•• • •••••••••••••••••••• ••••• •••
• • • ••••••••••••••••••••• •
•
• •••• •••••••••
• •• ••
•
•
• • •• •
• •
• •• •• • • •• ••
• •• • •• • • • • •••• •
• • • •• •
•• •
• • •• • • •
• •• • ••• •• • •• • • • ••••
•• • • •• • • ••••••••••• •••• •••••••••••••••••• •
• • • • • •• • • ••• •• • ••
• • • •• • •
•• • •• • •••••••• ••• ••••••••••••••••• • •••••••••••••• • •••• • ••
• • • ••• • • ••••• •
• •• •• •••• • •
• • • • ••••••• •••• •• ••••• ••••• •••• •• •• • •• ••••
• •• • • •••••••••• • ••••••• •••• • •• • • •
• •• • • •
•• • • •••••• •••••••••••••••••••••••••••••••••••••••••••• •••• ••• ••• • • • • • •
• •• ••• •••••••• • •• • •• •••••• •• ••• • • •
•••• •• •• •
• •
•
••••••••••••••••••• •••••••••••••••••• •••••••• • • • ••• •• •• ••••
••
• • ••• •• • • • ••• •••••••••••••••• •••••••• • • • •• • •
• ••
• ••• • •• • •• •• ••••••••••• •••••• ••••• •••• • ••
• ••
•• • • • •• ••• ••••••••••••••••••••••••••••••• •• ••
•• •• • • ••••••••••••••••••• • •
• •
••• ••• • ••••••••••••••••
••• • • ••••••• •
• •••••••• • ••••••••••••• •
• •••••••
•• •••••• • • ••
•• •
FIGURE 11.10. Observations and predictions for boys and girls born at 36 and
40 weeks. The thicker curves are from the MASAL model (11.1), and the ver-
tical lines indicate the knot locations. Model (11.6) is drawn in thinner curves
separately for the cocaine-use group (solid) and no-use group (dashed). Along
with the number of observations, the unweighted residual sum of squares (RSS)
is given respectively for models (11.1) and (11.6) inside each panel. This figure
is reproduced from Figure 4 of Zhang (1999)
192 11. Analysis of Longitudinal Data
our data do not support the hypothesis that cocaine use by a pregnant
woman influences her infant’s growth significantly.
11.5.6 Remarks
There are a number of research questions that are not explored here. The
most important ones are whether the iterative procedure for estimating
the covariance matrices converges, say, in probability, and how fast the
convergence is. When no covariates but time are involved, our iterative
procedure is analogous to the so-called iterated Cochrane–Orcutt procedure
studied by Altman (1992). In one-dimensional smoothing with correlated
errors, Truong (1991) and Altman (1992) provided some asymptotic and
numerical properties for the covariance estimates after the first iteration.
It would be interesting to extend their theoretical results to our iterative
scheme. Truong (1991) assumed certain structures for the errors. It would
be helpful to consider these structures when we apply MASAL for the
analysis of longitudinal data.
Our examples have repeatedly shown that the MASAL model almost
converges in the second iteration. This does not appear to be accidental,
provided that the initial covariance matrix is constructed with “careful”
thought. Next, we give a heuristic argument that supports the convergence
of the iterative algorithm. This argument will also reveal where the con-
vergence can be destroyed.
The convergence here refers to the gradual increase in the likelihood, lr ,
defined in (11.22) as we move along with iterations. Suppose that we start
with an initial covariance matrix Ψ0 , and f0 is the resulting initial MASAL
model. Then, the covariance matrix is reestimated by maximizing lr while
f0 is held fixed, giving Ψ1 . Clearly,
lr (f0 , Ψ1 ) = max lr (f0 , Ψ) ≥ lr (f0 , Ψ0 ). (11.36)
Ψ
11.6 Regression Trees for Longitudinal Data 193
then lr (f1 , Ψ1 ) ≥ lr (f0 , Ψ0 ), which shows that lr does not decrease from one
iteration to the next. The relationship in (11.36) is granted if we assume
a parametric covariance structure. Note critically, however, that MASAL
does not really guarantee the expression (11.37) due to its stepwise nature.
Moreover, the MASAL function f is chosen from a set of functions with
infinite dimensions; thus, blindly maximizing lr is not so meaningful, be-
cause larger models always have advantages over smaller ones. Note also
that we have lr (f1 , Ψ1 ) ≥ lr (f0 , Ψ0 ) if lr (f1 , Ψ1 ) ≥ lr (f0 , Ψ1 ). If necessary,
the MASAL algorithm can be modified to ensure the latter inequality. The
key idea is to use f0 as a reference while we build f1 , which is originally
constructed from nothing. It warrants further investigation whether the
step-by-step increase of lr is at the price of missing a better model down
the road.
criteria and then expand our introduction by addressing other more tech-
nical issues.
For any node τ, let Ψ(θτ ) be the within-node covariance matrix of the
longitudinal responses and ȳ(τ ) the vector of within-node sample averages
of the responses, where θ τ is a vector of parameters that may depend on
the node. Then, an obvious within-node impurity as measured by the least
squares is
SS(τ ) = (yi − ȳ(τ )) Ψ−1 (θ τ )(yi − ȳ(τ )). (11.38)
subject i∈τ
To split a node τ into its two daughter nodes, τL and τR , we aim at min-
imizing both SS(τL ) and SS(τR ). In other words, we maximize the split
function
φ(s, τ ) = SS(τ ) − SS(τL ) − SS(τR ), (11.39)
'$
node 1
N=95
≤ 28/year
&% Q > 28/year
number of QQ
male partners
+
Qs
Q
'$
node 2 node 3
N=40 N=55
never
&% Q once or more
syphilis Q
(lifetime) Q
+
Q s
Q
node 4 node 5
N=25 N=30
is faster that that for the subjects in node 2, and likewise for nodes 4 and
5. It is noteworthy, however, that more sexually active individuals might
have been infected earlier. In other words, the number of sex partners may
not be the real cause, but instead it might have acted as a proxy for other
important factors such as different infection times that were not available
in the data.
11.6 Regression Trees for Longitudinal Data 197
8
Beta-2 Microglobulin
Beta-2 Microglobulin
6
6
4
4
2
2
0
1 2 3 4 5 1 2 3 4 5
time in years time in years
Node 4: Number of Partners > 28, Never Syphilis Node 5: Number of Partners > 28, Ever Syphilis
8
8
Beta-2 Microglobulin
Beta-2 Microglobulin
6
6
4
4
2
2
0
1 2 3 4 5 1 2 3 4 5
time in years time in years
Node 1
Node 2
Node 3
Node 4
4.0
Node 5
Beta-2 Microglobulin
3.0 3.5
2.5
2.0
1 2 3 4 5
time in years
FIGURE 11.13. The average profile of β2 microglobulin within each node of the
tree presented in Figure 11.11. This figure was composed using a postscript file
made by Mark Segal
12
Analysis of Multiple Discrete
Responses
Example 12.3 This is an example where the risk of two distinct, but
presumably correlated, outcomes were studied, i.e., respiratory disease and
diarrhea in children with preexisting mild vitamin A deficiency.
Sommer and colleagues (Sommer et al. 1983 and 1984) conducted a
prospective longitudinal study of 4600 children aged up to 6 years at en-
try in rural villages of Indonesia between March 1977 and December 1978.
Their research team examined these children every 3 months for 18 months.
An average of 3135 children were free of respiratory disease and diarrhea at
the examination. At each examination, they recorded interval medical his-
tory, weight, height, general health status, and eye condition. They found
that the risk of respiratory disease and diarrhea were more closely associ-
ated with vitamin A status than with general nutritional status.
Example 12.4 Genes underlie numerous conditions and diseases. A vast
number of genetic epidemiologic studies have been conducted to infer ge-
netic bases of various syndromes. Multiple clustered responses naturally
arise from such studies. For example, Scourfield et al. (1996) examined the
gender difference in disorders of substance abuse, comorbidity anxiety, and
sensation seeking, using the database from the Genetic Epidemiology Re-
search Unit, Yale University School of Medicine, New Haven, Connecticut,
12.1 Parametric Methods for Binary Responses 201
110 000 000 100
Father Mother
110 001 101 000
Wife
Proband 1 Proband 2
000
100 010 000
Daughter Son Daughter
FIGURE 12.1. Two pedigrees of different family sizes. Each square or circle rep-
resents a family member. The left pedigree pinpoints the relationship of relatives
to the proband. A sequence of three bits (0 or 1) is displayed within all squares
and circles, marking the status of substance abuse, anxiety, and sensation seeking,
respectively.
where
θ = (θ1 , . . . , θq , θ12 · · · θq−1,q ).
Based on model (12.2), the canonical parameters have certain interpreta-
tions. Precisely, we have
IP {Yj = 1|Yk = yk , Yl = 0, l = j, k}
log = θj + θjk yk .
IP {Yj = 0|Yk = yk , Yl = 0, l = j, k}
12.1 Parametric Methods for Binary Responses 203
Thus, θj is the log odds for Yj = 1 given that the remaining components
of Y equal zero. In addition, θjk is referred to as an association parame-
ter because it is the conditional log odds ratio describing the association
between Yj and Yk provided that the other components of Y are zero. It
is important to realize that the canonical parameters are the log odds or
odds ratio under certain conditions, but we should be aware of the fact
that these conditions may not always make sense.
Why is model (12.1) called a log-linear model? Let us consider a bivari-
ate case. It follows from model (12.2) that the joint probability for the n
bivariate vectors is
exp[θ1 (n21 + n22 ) + θ2 (n12 + n22 ) + θ12 n22 + nA(θ)], (12.3)
n n
where
n n11 = i=1 (1 − yi1 )(1− yi2 ), n12 = i=1 (1 − yi1 )yi2 , n21 =
n
i=1 y i1 (1 − y i2 ), and n 22 = i=1 y i1 y i2 are the cell counts in the fol-
lowing 2 × 2 table:
Y2
0 1
0 n11 n12
Y1
1 n21 n22
It is easy to see that the expression in (12.3) equals
n!
mn11 mn12 mn21 mn22 ,
n11 !n12 !n21 !n22 ! 11 12 21 22
where
log(mjk ) = μ + λYj 1 + λYk 2 + λYjk1 Y2 , (12.4)
with
μ = (θ1 + θ2 )/2 + θ12 /4 + A(θ), (12.5)
λY1 1= −θ1 /2 − θ12 /4 + A(θ), (12.6)
λY1 2 = −θ2 /2 − θ12 /4 + A(θ), (12.7)
λY111 Y2 = θ12 /4, (12.8)
and λY2 1 = −λY1 1 , λY2 2 = −λY1 2 , and λY121 Y2 = λY211 Y2 = −λY221 Y2 = −λY111 Y2 .
In other words, (n11 , n12 , n21 , n22 ) follows a multinomial distribution with
means specified by the log-linear effects in (12.4). This is usually how the
log-linear models are introduced (e.g., Agresti 1990, Chapter 5). Further,
Equations (12.5)–(12.8) provide another way to interpret the canonical pa-
rameters.
q
y
IP {Y = y} = μj j (1 − μj )(1−yj )
j=1
×(1 + ρj1 j2 rj1 rj2 + ρj1 j2 j3 rj1 rj2 rj3 + · · · + ρ1···q r1 · · · rq ),
j1 <j2 j1 <j2 <j3
where
μj = IE{Yj },
rj = (yj − μj )/ μj (1 − μj ),
ρj1 ···jl = IE{Rj1 · · · Rjl },
j = 1, . . . , q.
The Bahadur representation is one step forward in terms of formulating
the log-linear model as a function of the parameters such as means and
correlations that we used to see in the analysis of continuous responses.
This representation is, however, severely handicapped by the fact that the
“hierarchal” correlations entangle the ones at lower orders and the means
and that it is particularly problematic in the presence of covariates. To
address the dilemma between the parameter interpretability and feasibility,
Liang et al. (1992) proposed the use of marginal models parametrized by
the means, the odds ratios, and the contrasts of odds ratios. Specifically,
let
IP {Yj1 = 1, Yj2 = 1}IP {Yj1 = 0, Yj2 = 0}
γj1 j2 = OR(Yj1 , Yj2 ) = ,
IP {Yj1 = 1, Yj2 = 0}IP {Yj1 = 0, Yj2 = 1}
ζj1 j2 j3 = log[OR(Yj1 , Yj2 |Yj3 = 1)] − log[OR(Yj1 , Yj2 |Yj3 = 0)],
and generally,
ζj1 ···jl = (−1)b(y) log[OR(Yj1 , Yj2 |yj3 , . . . , yjl )],
yj3 ,...,yjl =0,1
12.1 Parametric Methods for Binary Responses 205
l
where b(y) = k=3 yjk + l − 2.
It is quite unfortunate that evaluating the full likelihood based on the
new set of parameters, μj , γj1 j2 , and ζj1 ···jl , is generally complicated. To
gain insight into where the complications arise, let us go through the details
for the bivariate case. We need to specify the probability IP {Y1 = y1 , Y2 =
y2 } def
= p(y1 , y2 ) for four possible combinations of (y1 , y2 ). The following
four equations can lead to the unique identification of the four probabilities:
p(1, 1) + p(1, 0) = μ1 ,
p(0, 1) + p(1, 1) = μ2 ,
p(1, 1) + p(1, 0) + p(0, 1) + p(0, 0) = 1,
p(1, 1)p(0, 0) = γ12 p(0, 1)p(1, 0).
From the first three equations, we have p(1, 0) = μ1 − p(1, 1), p(0, 1) =
μ2 − p(1, 1), and p(0, 0) = 1 − μ1 − μ2 + p(1, 1). If we plug them into the
last equation, we have a quadratic equation in p(1, 1),
2(1−γ12 ) if γ12 = 1,
μ1 μ2 if γ12 = 1.
When we have more than two responses, the problem could be intractable
if we do not reduce the dimension of the parameters appropriately such as
setting γj1 j2 = γ.
For model (12.9), we assume that there exists a vectorial link function η
that transforms x coupled with a condensed vector of parameters β to θ,
e.g., θ = η(x β). Then, the GEE approach attempts to solve the unbiased
estimating equations (Godambe 1960; Zhao and Prentice 1990)
n
−1 yi − μ
U (β) = JVi = 0, (12.10)
wi − ω
i=1
evaluated at β̂ (Liang et al., 1992). It also turns that U (β) resembles the
quasi-score function derived from the quasi-likelihood as introduced in (9.5)
of McCullagh and Nelder (1989).
Likewise, if we are interested in the pairwise odds ratio and use the
marginal models, then we assume a link function between parameters μj
and γjk , and covariates x. The rest of the derivation for GEE is identical
to that above.
for k = 1, 2, 3. A critical assumption is that for the ith family and con-
ditional on all possible Uki ’s, denoted by U i , the health conditions of all
208 12. Analysis of Multiple Discrete Responses
harbors the frailties. The construction of aij is based on assuming the ex-
istence of a major susceptibility locus with alleles A and a, as clarified
below.
i i
The frequency of allele A is θ2 , and (U2,2j−1 , U2,2j ) indicate the presence
of allele A in the two chromosomes of the jth member of the ith family.
Based on the Mendelian transmission, θ3 = 0.5. The parameter interpre-
tation in model (12.13) is most important. The β parameters measure the
strength of association between the trait and the covariates conditional on
the frailties, while the γ parameters indicate the familial and genetic con-
tributions to the trait. Note that γ = (γ1 , γ2 , γ3 ) . If γ2 = 0 and γ3 = 0,
it suggests a recessive trait because a genetic effect is expressed only in
the presence of two A alleles. On the other hand, if a completely dominant
gene underlies the trait, genotypes Aa and AA give rise to the same effect,
implying that γ2 = 2γ2 + γ3 , i.e., γ2 = −γ3 .
The frailty model (12.13) is closely related to many existing models for
segregation analysis, all of which can be traced back to the classic Elston–
Stewart (1971) model for the genetic analysis of pedigree data. The Elston–
Stewart model was originally designed to identify the mode of inheritance of
a particular trait of interest without considering the presence of covariates.
The frailty model (12.13) is quite similar to the class D logistic regressive
models of Bonney (1986, 1987). The major difference is the method for
modeling familial correlations as a result of residual genetic effects and
environment. The regressive models make use of the parental traits and
assume the conditional independence among siblings on the parental traits.
In contrast, the frailty model assumes the conditional independence among
all family members on the frailty variable. Conceptually, frailty variables
defined here are very similar to that of ousiotype introduced by Cannings
et al. (1978) in pedigree analysis, where a unique ousiotype (essence) for
each individual is assumed to represent unobservable genetic effects. Many
other authors including Bonney (1986, 1987) adopted the ousiotype as the
genotype. Frailty model (12.13) can be viewed as a further clarification of
the ousiotype into a major genotype of focus and residual unobservable
effects.
In terms of computation, when both U and Y are observable, the com-
plete log-likelihood function is easy to derive, and the EM algorithm (Demp-
ster, Laird, and Rubin 1977) can be applied to find the parameter estimates.
A detailed development of the frailty model for segregation analysis will be
presented elsewhere (Zhang and Merikangas 1999).
12.2 Classification Trees for Multiple Binary Responses 209
where nτ , nτL , and nτR are respectively the numbers of subjects in node τ
and its left and right daughter nodes τL and τR .
When we have a single binary response, criterion (12.16) is essentially
the Gini index in (4.4). This is because
nτ
|Vτ | = pτ (1 − pτ ),
nτ − 1
where pτ is the proportion of diseased subjects in node τ.
Further, as a direct extension from the criterion (11.38) used in the trees
for continuous longitudinal data, another measure of within-node homo-
geneity that deserves our attention is
1
h2 (τ ) = − (yi − ȳ(τ )) V −1 (yi − ȳ(τ )), (12.17)
nτ
i∈ node τ
Rα (T ) = R(T ) + α|T˜ |,
where V and ȳ(τ ) are estimated from the learning sample only.
After Rα (T ) is defined, the rest of the procedure is identical to that in
Section 4.2.3. We should mention, however, that a theoretical derivation of
the standard error for R(T ) seems formidable. As a start, Zhang (1998a)
suggested repeating the cross-validation procedure ten times. This process
results in an empirical estimate of the needed standard error. Although
it was not explicitly stated, this in effect introduced the idea of bagging,
except that it was for the purpose of determining the tree size.
point and if the sample size is sufficiently large, IE{Y} and V (Y) should
be close to ȳ and V0 , respectively. So, the following simplified updating
formula takes over the one in (12.20):
∂μ ∂γ
= Cov(Y, Z ), and = Cov(w, Z ).
∂Φ ∂Φ
By the chain rule, we have
∂ρ ∂ρ ∂γ ∂ρ ∂μ
= +
∂Φ ∂γ ∂Φ ∂μ ∂Φ
∂ρ ∂ρ
= Cov(w, Z ) + Cov(Y, Z ).
∂γ ∂μ
Therefore,
∂µ
∂Φ
I 0
∂ρ = ∂ρ ∂ρ Cov(Z) def
= JV.
∂Φ ∂µ ∂γ
estimate for the covariance matrix of μ̂ and ρ̂ from Royall (1986) as follows:
yi − μ̂
−1 ˆ −1
V̂ (μ̂, ρ̂) = [nτ I(μ̂, ρ̂)] (V̂ J )
wi − γ̂
yi − μ̂
× (JˆV̂ )−1 [nτ I(μ̂, ρ̂)]−1
wi − γ̂
1 ˆ yi − μ̂ yi − μ̂
= J Jˆ ,
n2τ wi − γ̂ wi − γ̂
where nτ is the number of subjects in node τ and the summation is over all
subjects in node τ. From the formula above it is numerically straightforward
to compute the standard errors for μ̂ and ρ̂.
Predictor Questions
x1 What is the type of your working space?
(enclosed office with door, cubicles, stacks, etc.)
x2 How is your working space shared?
(single occupant, shared, etc.)
x3 Do you have a metal desk? (yes or no)
x4 Do you have new equipment at your work area?
(yes or no)
x5 Are you allergic to pollen? (yes or no)
x6 Are you allergic to dust? (yes or no)
x7 Are you allergic to molds? (yes or no)
x8 How old are you? (16 to 70 years old)
x9 Gender (male or female)
x10 Is there too much air movement at your work area?
(never, rarely, sometimes, often, always)
x11 Is there too little air movement at your work area?
(never, rarely, sometimes, often, always)
x12 Is your work area too dry?
(never, rarely, sometimes, often, always)
x13 Is the air too stuffy at your work area?
(never, rarely, sometimes, often, always)
x14 Is your work area too noisy?
(never, rarely, sometimes, often, always)
x15 Is your work area too dusty?
(never, rarely, sometimes, often, always)
x16 Do you experience glare at your workstation?
(no, sometimes, often, always)
x17 How comfortable is your chair? (reasonably,
somewhat, very uncomfortable, no one specific chair)
x18 Is your chair easily adjustable?
(yes, no, not adjustable)
x19 Do you have influence over arranging the furniture?
(very little, little, moderate, much, very much)
x20 Do you have children at home? (yes or no)
x21 Do you have major childcare duties? (yes or no)
x22 What type of job do you have?
(managerial, professional, technical, etc.)
This table is reproduced from Table 1 of Zhang (1998a).
12.3 Application: Analysis of BROCS Data 215
FIGURE 12.2. Cost-complexity for two sequences of nested subtrees. Panels (a)
and (b) come from trees using h(τ ) and h2 (τ ), respectively. The solid line is the
log cross-validation (CV) estimates of cost, and the dotted line is the log of one
standard error above the estimated cost estimated by cross-validation
node 1
3400
3400
No/NA Yes
Was air often
too stuffy?
node 2 node 3
1912 1488
1921 1479
node 10 node 11
997 110
999 100
FIGURE 12.3. Tree structure for the risk factors of BROCS based on h(τ ). Inside
each node (a circle or a box) are the node number and the numbers of subjects
in the learning (middle) and validation (bottom) samples. The splitting question
is given under the node
12.3 Application: Analysis of BROCS Data 217
node 1
3400
3400
No/NA Yes
Was air often
too stuffy?
node 2 node 3
1912 1488
1921 1479
FIGURE 12.4. Tree structure for the risk factors of BROCS based on h2 (τ ).
Inside each node (a circle or a box) are the node number and the numbers of
subjects in the learning (middle) and validation (bottom) samples. The splitting
question is given under the node
218 12. Analysis of Multiple Discrete Responses
discomfort because they had the best air quality. Overall, Figure 12.3 and
Table 12.3 show the importance of air quality around the working area.
Based on a different criterion, h2 (τ ), Figure 12.4 demonstrates again the
importance of air quality. It uses nearly the same splits as Figure 12.3 except
that “experiencing a glare” also emerged as a splitting factor. By comparing
terminal nodes 10 and 11 in Figure 12.4, it appears that “experiencing a
glare” resulted in more discomfort for all clusters of symptoms.
FIGURE 12.5. Comparison of ROC curves for the classifications tree in Figures
12.3 and 12.4 among individual clusters. The true positive probability (TPP)
is plotted against the false positive probability (FPP). The solid line indicates
the performance of a random prediction. The dotted and dashed ROC curves
respectively come from Figures 12.3 and 12.4, and the areas under them are also
reported
12.5 Analysis of the BROCS Data via Log-Linear Models 221
different K’s. We define K − 1 indicator variables yijk = I(zij > k), for
k = 1, . . . , K − 1. Recall I(·) is the indicator function. Let
Then, the observed responses from the ith unit can be rewritten as
Now, the components of the yi are binary, and hence we can use the same
procedure as described in Section 12.2.1.
6
4
exp μ + λyikk + λzjkk
k=1 k=1
⎛ ⎞
4
+⎝ λzj11iykk + λzj22iykk + λzj33iykk + λzj44iykk ⎠
k=4,6 k=3,5 k=2,5 k=1
⎛ ⎞
6
6
6
+⎝ λyilliykk + λyi33iykk + λyi44iykk + λyi55iy66 ⎠
l=1,2 k=3 k=4 k=5
⎛ ⎞⎤
+ ⎝λzj22iy22iy44 + λzj33iy33iykk + λzj44iykkiy44 ⎠⎦ . (12.24)
k=4,6 k=1,3
The second PROC CATMOD statement of the SAS program in Table 12.5
performed the computation for model (12.24). The results were organized
in Table 12.6 in five categories based on the grouping of the terms in model
(12.24).
In this chapter we provide some script files that show how to run the
RTREE program and how to read the output resulting from the execution
of the program. The analysis presented in Chapter 2 results from these
files.
[5] E.I. Altman. Bankruptcy, Credit Risk, and High Yield Junk Bonds.
Blackwell Publishers, Malden, Massachusetts, 2002.
[8] S.M. Ansell, B.L. Rapoport, G. Falkson, J.I. Raats, and C.M.
Moeken. Survival determinants in patients with advanced ovarian
cancer. Gynecologic Oncology, 50:215–220, 1993.
238 References
[16] G.E. Bonney. Regression logistic models for familial disease and other
binary traits. Biometrics, 42:611–625, 1986.
[17] G.E. Bonney. Logistic regression for dependent binary observations.
Biometrics, 43:951–973, 1987.
[18] G.E. Box, G.M. Jenkins, and G.C. Reinsel. Time Series Analysis.
Wiley, New York, 3rd edition, 1994.
[19] M.B. Bracken. Perinatal Epidemiology. Oxford University Press,
New York, 1984.
[20] M.B. Bracken, K.G. Hellenbrand, T.R. Holford, and C. Bryce-
Buchanan. Low birth weight in pregnancies following induced abor-
tion: No evidence for an association. American Journal of Epidemi-
ology, 123:604–613, 1986.
[21] M.B. Bracken, K. Belanger, K.G. Hellenbrand, et al. Exposure to
electromagnetic fields during pregnancy with emphasis on electrically
heated beds: association with birthweight and intrauterine growth
retardation. Epidemiology, 6:263–270, 1995.
References 239
[24] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classifica-
tion and Regression Trees. Wadsworth, California, 1984.
[30] P. Buhlmann and B. Yu. Boosting with the l-2 loss: Regression
and classification. Journal of the American Statistical Association,
98:324–339, 2003.
[33] D. Carmelli, H.P. Zhang, and G.E. Swan. Obesity and 33 years of
coronary heart disease and cancer mortality in the Western Collabo-
rative Group Study. Epidemiology, 8:378–383, 1997.
[37] L.-S. Chen and C.-T. Su. Using granular computing model to induce
scheduling knowledge in dynamic manufacturing environments. In-
ternational Journal of Computer Integrated Manufacturing, 21:569–
583, 2008.
[42] S.C. Choi, J.P. Muizelaar, T.Y. Barnes, et al. Prediction tree for
severely head-injured patients. Journal of Neurosurgery, 75:251–255,
1991.
[43] P.A. Chou, T. Lookabaugh, and R.M. Gray. Optimal pruning with
applications to tree-structured source coding and modeling. IEEE
Transactions on Information Theory, 35:299–315, 1989.
[44] A. Ciampi, A. Couturier, and S.L. Li. Prediction trees with soft nodes
for binary outcomes. Statistics in Medicine, 21:1145–1165, 2002.
[48] M.A. Connolly and K.Y. Liang. Conditional logistic regression mod-
els for correlated binary data. Biometrika, 75:501–506, 1988.
References 241
[63] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood
from incomplete data via the EM algorithm. Journal of the Royal
Statistical Society-B, 39:1–22, 1977.
[64] D.G.T. Denison, B.K. Mallick, and A.F.M. Smith. A Bayesian CART
algorithm. Biometrika, pages 363–378, 1998.
[65] G.L. Desilva and J.J. Hull. Proper noun detection in document im-
ages. Pattern Recognition, 27:311–320, 1994.
[68] P.J. Diggle, K.Y. Liang, and S.L. Zeger. Analysis of Longitudinal
Data. Oxford Science Publications, Oxford, 1991.
[71] R.C. Elston and J. Stewart. A general model for the genetic analysis
of pedigree data. Human Heredity, 21:523–542, 1971.
[72] B.G. Ferris, F.E. Speizer, J.D. Spengler, D.W. Dockery, Y.M.M.
Bishop, M. Wolfson, and C. Humble. Effects of sulfur oxides and
respirable particles on human health. American Review of Respira-
tory Disease, 120:767–779, 1979.
[75] G.M. Fitzmaurice and N.M. Laird. Regression models for a bivari-
ate discrete and continuous outcome with clustering. Journal of the
American Statistical Association, 90:845–852, 1995.
[76] G.M. Fitzmaurice, N.M. Laird, and A.G. Rotnitzky. Regression mod-
els for discrete longitudinal responses. Statistical Science, 8:284–299,
1993.
References 243
[77] T.R. Fleming and D.P. Harrington. Counting Processes and Survival
Analysis. Wiley, New York, 1991.
[78] S.H. Fox, G.F. Whalen, M.M. Sanders, J.A. Burleson, K. Jennings,
S. Kurtzman, and D. Kreutzer. Angiogenesis in normal tissue adja-
cent to colon cancer. Journal of Surgical Oncology, 69:230–234, 1998.
[79] Y. Freund and R.E. Schapire. Game theory, on-line prediction and
boosting. In In Proceedings of the Ninth Annual Conference on Com-
putational Learning Theory, pages 325–332. ACM Press, 1996.
[84] H. Frydman, E.I. Altman, and D.-I. Kao. Introducing recursive par-
titioning for financial classification: the case of financial distress. In
Bankruptcy, Credit Risk, and High Yield Junk Bonds, E.I. Altman
ed., pages 37–59, 2002.
[85] A.M. Garber, R.A. Olshen, H.P. Zhang, and E.S. Venkatraman. Pre-
dicting high-risk cholesterol levels. International Statistical Review,
62:203–228, 1994.
[88] A. Gersho and R.M. Gray. Vector Quantization and Signal Compres-
sion. Kluwer, Norwell, Massachusetts, 1992.
[103] J.D. Hart and T.E. Wehrly. Kernel regression estimation using re-
peated measurements data. Journal of the American Statistical As-
sociation, 81:1080–1088, 1986.
[105] T.J. Hastie and R.J. Tibshirani. Generalized Additive Models. Chap-
man and Hall, London, 1990.
[106] E.G. Hebertson and M.J. Jenkins. Factors associated with historic
spruce beetle (Coleoptera: Curculionidae) outbreaks in Utah and Col-
orado. Environmental Entomology, 37:281–292, 2008.
[111] X. Huang, S.D. Chen, and S.J. Soong. Piecewise exponential sur-
vival trees with time-dependent covariates. Biometrics, 54:1420–
1433, 1998.
[115] H. Ishwaran, U.B. Kogalur, E.H. Blackstone, and M.S. Lauer. Ran-
dom survival forests. The Annals of Applied Statistics, 2:841–860,
2008.
[116] A. Jamain and D. Hand. Mining supervised classification perfor-
mance studies: A meta-analytic investigation. Journal of Classifica-
tion, 25:87–112, 2008.
[121] H. Kim and W.-Y. Loh. Classification trees with unbiased multiway
splits. Journal of the American Statistical Association, 96:598–604,
2001.
[122] R.J. Klein, C. Zeiss, E.Y. Chew, J.Y. Tsai, R.S. Sackler, C. Haynes,
A.K. Henning, J.P. SanGiovanni, S.M. Mane, S.T. Mayne, M.B.
Bracken, F.L. Ferris, J. Ott, C. Barnstable, and C. Hoh. Comple-
ment factor H polymorphism in age-related macular degeneration.
Science, 308:385–389, 2005.
[123] D.G. Kleinbaum, L.L. Kupper, and K.E. Muller. Applied Regression
Analysis and Other Multivariable Methods. Duxbury Press, Belmont,
California, 1988.
[124] M.R. Kosorok and S. Ma. Marginal asymptotics for the “large p,
small n” paradigm: With applications to microarray data. Annals of
Statistics, 35:1456–1486, 2007.
[125] S. Kullback and R.A. Leibler. On information and sufficiency. The
Annals of Mathematical Statistics, 22:79–86, 1951.
[126] D.A. Kumar and V. Ravi. Predicting credit card customer churn
in banks using data mining. International Journal of Data Analysis
Techniques and Strategies, 1:4–28, 2008.
[127] C.-S. Kuo, T.P. Hong, and C.-L. Chen. Applying genetic program-
ming technique in classification trees. Soft Computing, 11:1165–1172,
2007.
References 247
[128] L.W. Kwak, J. Halpern, R.A. Olshen, and S.J. Horning. Prognostic
significance of actual dose intensity in diffuse large-cell lymphoma:
results of a tree-structured survival analysis. Journal of Clinical On-
cology, 8:963–977, 1990.
[129] N.M. Laird and J.H. Ware. Random-effects models for longitudinal
data. Biometrics, 38:963–974, 1982.
[130] M. LeBlanc and J. Crowley. Relative risk trees for censored survival
data. Biometrics, 48:411–425, 1992.
[133] E.T. Lee. Statistical Methods for Survival Data Analysis. Wiley, New
York, 1992.
[135] D.E. Levy, J.J. Caronna, B.H. Singer, et al. Predicting outcome
from hypoxic-ischemic coma. Journal of the American Medical As-
sociation, 253:1420–1426, 1985.
[136] K.Y. Liang and S.L. Zeger. Longitudinal data analysis using gener-
alized linear models. Biometrika, 73:13–22, 1986.
[137] K.Y. Liang, S.L. Zeger, and B. Qaqish. Multivariate regression anal-
yses for categorical data. Journal of the Royal Statistical Society-B,
54:3–24, 1992.
[139] R.J.A. Little and D.B. Rubin. Statistical Analysis with Missing Data.
Wiley, New York, 1987.
[141] W.L. Long, J.L. Griffith, H.P. Selker, and R.B. D’Agostino. A com-
parison of logistic regression to decision tree induction in a medical
domain. Computers and Biomedical Research, 26:74–97, 1993.
[149] C.A. McGilchrist and B.R. Cullis. REML estimation for repeated
measures analysis. Journal of Statistical Computation and Simula-
tion, 38:151–163, 1991.
[156] P.K. Mills, W.L. Beeson, R.L. Phillips, and G.E. Fraser. Bladder
cancer in a low risk population: results from the Adventist Health
Study. American Journal of Epidemiology, 133:230–239, 1991.
[159] J.N. Morgan and R.C. Messenger. THAID: a sequential search pro-
gram for the analysis of nominal scale dependent variables. Institute
for Social Research, University of Michigan, Ann Arbor, 1973.
[160] J.N. Morgan and J.A. Sonquist. Problems in the analysis of survey
data, and a proposal. Journal of the American Statistical Associa-
tion, 58:415–434, 1963.
[167] A.B. Nobel and R.A. Olshen. Termination and continuity of greedy
growing for tree structured vector quantizers. IEEE Transactions on
Information Theory, 42:191–206, 1996.
[168] E.A. Owens, R.E. Griffiths, and K.U. Ratnatunga. Using oblique
decision trees for the morphological classification of galaxies. Monthly
Notices of the Royal Astronomical Society, 281:153–157, 1996.
[171] C. M. Perou, T. Sorlie, M.B. Eisen, M. van de Rijn, S.S. Jeffrey, et al.
Molecular portraits of human breast tumours. Nature, 406:747–752,
2000.
[176] J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kauf-
mann, San Mateo, California, 1993.
[177] D.R. Ragland, R.J. Brand, et al. Coronary heart disease mortality
in the Western Collaborative Group Study: Follow-up experience of
22 years. American Journal of Epidemiology, 127:462–475, 1988.
[179] E.G. Raymond, N. Tafari, J.F. Troendle, and J.D. Clemens. Develop-
ment of a practical screening tool to identify preterm, low-birthweight
neonates in Ethiopia. Lancet, 344:520–523, 1994.
References 251
[180] J.A. Rice and B.W. Silverman. Estimating the mean and covariance
structure nonparametrically when the data are curves. Journal of the
Royal Statistical Society-B, 53:233–243, 1991.
[181] A.M. Richardson and A.H. Welsh. Asymptotic properties of re-
stricted maximum likelihood (REML) estimates for hierarchical
mixed linear models. The Australian Journal of Statistics, 36:31–43,
1994.
[182] R.M. Royall. Model robust inference using maximum likelihood esti-
mators. International Statistical Review, 54:221–226, 1986.
[183] M. Sandri and P. Zuccolotto. A bias correction algorithm for the
Gini variable importance measure in classification trees. Journal of
Computational and Graphical Statistics, 17:611–628, 2008.
[184] I.R. Savage. Contributions to the theory of rank order statistics—
the two sample case. Annals of Mathematical Statistics, 27:590–615,
1956.
[185] J. Scourfield, D.E. Stevens, and K.R. Merikangas. Substance abuse,
comorbidity, and sensation seeking: gender difference. Comprehensive
Psychiatry, 37:384–392, 1996.
[186] M.R. Segal. Regression trees for censored data. Biometrics, 44:35–
48, 1988.
[187] M.R. Segal. Tree-structured methods for longitudinal data. Journal
of the American Statistical Association, 87:407–418, 1992.
[188] M.R. Segal. Extending the elements of tree-structured regression.
Statistical Methods in Medical Research, 4:219–236, 1995.
[189] M.R. Segal and D.A. Bloch. A comparison of estimated proportional
hazards models and regression trees. Statistics in Medicine, 8:539–
550, 1989.
[190] H.P. Selker, J.L. Griffith, S. Patil, W.L. Long, and R.B. D’Agostino.
A comparison of performance of mathematical predictive methods
for medical diagnosis: Identifying acute cardiac ischemia among
emergency department patients. Journal of Investigative Medicine,
43:468–476, 1995.
[191] I. Shmulevich, O. Yli-Harja, E. Coyle, D.-J. Povel, and K. Lemström.
Perceptual issues in music pattern recognition: complexity of rhythm
and key finding. Computers and the Humanities, pages 23–35, 2001.
[192] B.W. Silverman. Some aspects of the spline smoothing approach to
non-parametric regression curve fitting. Journal of the Royal Statis-
tical Society-B, 47:1–21, 1985.
252 References
[193] V.S. Sitaram, C.M. Huang, and P.D. Israelsen. Efficient codebooks
for vector quantization image compression with an adaptive tree-
search algorithm. IEEE Transactions on Communications, 42:3027–
3033, 1994.
[194] P.L. Smith. Curve fitting and modeling with splines using statisti-
cal variable selection techniques. NASA 166034, Langley Research
Center, Hampton, VA, 1982.
[195] A. Sommer, J. Katz, and I. Tarwotjo. Increased risk of respiratory
disease and diarrhea in children with preexisting mild vitamin A defi-
ciency. American Journal of Clinical Nutrition, 40:1090–1095, 1984.
[196] A. Sommer, I. Tarwotjo, G. Hussaini, and D. Susanto. Increased
mortality in children with mild vitamin A deficiency. Lancet, 2:585–
588, 1983.
[197] A. Sommer, J.M. Tielsch, J. Katz, H.A. Quigley, J.D. Gottsch, J.C.
Javitt, J.F. Martone, R.M. Royall, K.A. Witt, and S. Ezrine. Racial
differences in the cause-specific prevalence of blindness in east Balti-
more. New England Journal of Medicine, 325:1412–1417, 1991.
[198] StatSci. S-PLUS: Guide to Statistical and Mathematical Analyis.
MathSoft, Inc., Seattle, 1993.
[199] StatSci. S-PLUS: Guide to Statistical and Mathematical Analyis.
MathSoft, Inc., Seattle, 1995.
[200] D.M. Stier, J.M. Leventhal, A.T. Berg, L. Johnson, and J. Mezger.
Are children born to young mothers at increased risk of maltreat-
ment? Pediatrics, 91:642–648, 1993.
[201] C. Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis.
Conditional variable importance for random forests. BMC Bioinfor-
matics, 9:307, 2008.
[202] C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn. Bias in ran-
dom forest variable importance measures: illustrations, sources and
a solution. BMC Bioinformatics, 8:25, 2007.
[203] N.R. Temkin, R. Holubkov, J.E. Machamer, H.R. Winn, and S.S.
Dikmen. Classification and regression trees (CART) for prediction of
function at 1 year following head trauma. Journal of Neurosurgery,
82:764–771, 1995.
[204] J. Terhune, D. Quin, A. DellÁpa, M. Mirhaj, J. Plötz, L. Kinder-
mann, and H. Bornemann. Geographic variations in underwater
male Weddell seal trills suggest breeding area fidelity. Polar Biol-
ogy, 31:671–680, 2008.
References 253
[207] Y.K. Truong. Nonparametric curve estimation with time series er-
rors. Journal of Statistical Planning and Inference, 28:167–183, 1991.
[211] J.H. Ware, D.W. Dockery, A. Spiro, F.E. Speizer, and B.G. Ferris.
Passive smoking, gas cooking, and respiratory health of children living
in six cities. American Review of Respiratory Disease, 129:366–374,
1984.
[213] J.H. Wasson, H.C. Sox, R.K. Neff, and L. Goldman. Clinical pre-
diction rules: Applications and methodologic standards. The New
England Journal of Medicine, 313:793–799, 1985.
[216] L.C. Yeates and G. Powis. The expression of the molecular chaperone
calnexin is decreased in cancer cells grown as colonies compared to
monolayer. Biochemical and Biophysical Research Communications,
238:66–70, 1997.
254 References
[218] S.L. Zeger and P.J. Diggle. Semiparametric models for longitudinal
data with application to CD4 cell numbers in HIV seroconverters.
Biometrics, 50:689–699, 1994.
[219] S.L. Zeger, K.Y. Liang, and P.S. Albert. Models for longitudinal data:
A generalized estimating equation approach. Biometrics, 44:1049–
1060, 1988.
[224] H.P. Zhang. Classification trees for multiple binary responses. Jour-
nal of the American Statistical Association, 93:180–193, 1998a.
[226] H.P. Zhang. Analysis of infant growth curves using MASAL. Bio-
metrics, 55:452–459, 1999.
[228] H.P. Zhang and M.B. Bracken. Tree-based risk factor analysis of
preterm delivery and small-for-gestational-age birth. American Jour-
nal of Epidemiology, 141:70–78, 1995.
[229] H.P. Zhang and M.B. Bracken. Tree-based, two-stage risk factor
analysis for spontaneous abortion. American Journal of Epidemiol-
ogy, 144:989–996, 1996.
References 255
[230] H.P. Zhang, J. Crowley, H.C. Sox, and R.A. Olshen. Tree-structured
statistical methods. Encyclopedia of Biostatistics, 6:4561–4573, 1998.
[231] H.P. Zhang, T. Holford, and M.B. Bracken. A tree-based method
of analysis for prospective studies. Statistics in Medicine, 15:37–49,
1996.
[232] H.P. Zhang and K.R. Merikangas. A frailty model of segregation
analysis: understanding the familial transmission of alcoholism. Bio-
metrics, 56:815–823, 2000.
[233] H.P. Zhang and M.H. Wang. Searching for the smallest random for-
est. Statistics and Its Interface, 2, 2009.
[234] H.P. Zhang and Y. Ye. A tree-based method for modeling a mul-
tivariate ordinal response. Statistics and Its Interface, 1:169–178,
2008.
[235] H.P. Zhang and C. Yu. Tree-based analysis of microarray data for
classifying breast cancer. Frontiers in Bioscience, 7:c63–67, 2002.
[236] H.P Zhang, C.Y. Yu, and B. Singer. Cell and tumor classification
using gene expression data: Construction of forests. Proc. Natl. Acad.
Sci. USA, 100:4168–4172, 2003.
[237] H.P Zhang, C.Y. Yu, B. Singer, and M.M. Xiong. Recursive parti-
tioning for tumor classification with gene expression microarray data.
Proc. Natl. Acad. Sci. USA, 98:6730–6735, 2001.
[238] H.P. Zhang, C.Y. Yu, H.T. Zhu, and J. Shi. Identification of lin-
ear directions in multivariate adaptive spline models. Journal of the
American Statistical Association, 98:369–376, 2003.
[239] M. Zhang, D. Zhang, and M. Wells. Variable selection for large p
small n regression models with incomplete data: Mapping qtl with
epistases. BMC Bioinformatics, 9:251, 2008.
[240] L.P. Zhao and R.L. Prentice. Correlated binary regression using a
quadratic exponential model. Biometrika, 77:642–648, 1990.
Index