An Introduction To Classification and Regression Tree (CART) Analysis
An Introduction To Classification and Regression Tree (CART) Analysis
net/publication/240719582
CITATIONS READS
273 15,127
1 author:
Roger J Lewis
Harbor-UCLA Medical Center
262 PUBLICATIONS 11,220 CITATIONS
SEE PROFILE
All content following this page was uploaded by Roger J Lewis on 13 January 2014.
Presented at the 2000 Annual Meeting of the Society for Academic Emergency Medicine in San
Francisco, California.
Contact Information:
Introduction
A common goal of many clinical research studies is the development of a reliable clinical
decision rule, which can be used to classify new patients into clinically-important categories. Examples
of such clinical decision rules include triage rules, whether used in the out-of-hospital setting or in the
emergency department, and rules used to classify patients into various risk categories so that appropriate
decisions can be made regarding treatment or hospitalization.
Traditional statistical methods are cumbersome to use, or of limited utility, in addressing these
types of classification problems. There are a number of reasons for these difficulties. First, there are
generally many possible “predictor” variables which makes the task of variable selection difficult.
Traditional statistical methods are poorly suited for this sort of multiple comparison. Second, the
predictor variables are rarely nicely distributed. Many clinical variables are not normally distributed and
different groups of patients may have markedly different degrees of variation or variance. Third, complex
interactions or patterns may exist in the data. For example, the value of one variable (e.g., age) may
substantially affect the importance of another variable (e.g., weight). These types of interactions are
generally difficult to model, and virtually impossible to model when the number of interactions and
variables becomes substantial. Fourth, the results of traditional methods may be difficult to use. For
example, a multivariate logistic regression model yields a probability of disease, which can be calculated
using the regression coefficients and the characteristics of the patient, yet such models are rarely utilized
in clinical practice. Clinicians generally do not think in terms of probability but, rather in terms of
categories, such as “low risk” versus “high risk.”
Regardless of the statistical methodology being used, the creation of a clinical decision rule
requires a relatively large dataset. For each patient in the dataset, one variable (the dependent variable),
records whether or not that patient had the condition which we hope to predic t accurately in future
patients. Examples might include significant injury after trauma, myocardial infarction, or subarachnoid
hemorrhage in the setting of headache. In addition, other variables record the values of patient
characteristics which we believe might help us to predict the value of the dependent variable. For
example, if one hopes to predict the presence of subarachnoid hemorrhage, a possible predictor variable
might be whether or not the patient's headache was sudden in onset; another possible predictor would be
whether or not the patient has a history of similar headaches in the past. In many clinically-important
settings, the number of possible predictor variables is quite large.
Within the last 10 years, there has been increasing interest in the use of classification and
regression tree (CART) analysis. CART analysis is a tree-building technique which is unlike traditional
data analysis methods. It is ideally suited to the generation of clinical decision rules. Because CART
analysis is unlike other analysis methods it has been accepted relatively slowly. Furthermore, the vast
majority of statisticians have little or no experience with the technique. Other factors which limit CART's
general acceptability are the complexity of the analysis and, until recently, the software required to
perform CART analysis was difficult to use. Luckily, it is now possible to perform a CART analysis
without a deep understanding of each of the multiple steps being completed by the software. In a number
of studies, I have found CART to be quite effective for creating clinical decision rules which perform as
well or better than rules developed using more traditional methods. In addition, CART is often able to
uncover complex interactions between predictors which may be difficult or impossible to uncover using
traditional multivariate techniques.
The purpose of this lecture is to provide an overview of CART methodology, emphasizing
practical use rather than the underlying statistical theory.
Page 2
Introduction to CART
Roger J. Lewis, M.D., Ph.D.
or “independent” variables. These are the characteristics which are potentially related to the outcome
variable of interest. In general, there are many possible predictor variables. The third component of the
classification problem is the learning dataset. This is a dataset which includes values for both the
outcome and predictor variables, from a group of patients similar to those for whom we would like to be
able to predict outcomes in the future. The fourth component of the classification problem is the test or
future dataset, which consists of patients for whom we would like to be able to make accurate predictions.
This test dataset may or may not exist in practice. While it is commonly believed that a test or validation
dataset is required to validate a classification or decision rule, a separate test dataset is not always
required to determine the performance of a decision rule.
A decision problem includes two components in addition to those found in a classification
problem. These components are a “prior” probability for each outcome, which represents the probability
that a randomly-selected future patient will have a particular outcome, and a decision loss or cost matrix.
The decision cost matrix represents the inherent cost associated with misclassifying a future patient. For
example, it is a much more serious error to classify a patient with an emergent medical condition as non-
urgent, than to misclassify a patient with a non-urgent medical condition as urgent. A sample cost matrix
is shown below, for a triage problem in which patients are classified as emergent, urgent, and non-urgent.
The worst possible error, consisting of classifying a truly emergent patient as non-urgent (undertriage), is
fifteen times as serious as misclassifying an urgent patient as emergent (overtriage).
As the first
example, consider the Classified by Tree as
problem of selecting the Emergent Urgent Non-Urgent
best size and type of True Value of Emergent 0 5 15
laryngoscope blade for Outcome Urgent 1 0 5
pediatric patients Variable Non-Urgent 3 2 0
undergoing intubation.
The outcome variable, the Example Decision Cost Matrix.
best blade for each patient
(as determined by a consulting pediatric airway specialist), has three possible values: Miller 0, Wis-
Hipple 1.5, and Mac 2. The two predictor variables are measurements of neck length and oropharyngeal
height. The learning dataset is shown below. As can be
seen from the figure, the smallest patients are best intubated
Laryngoscope Blade versus Neck and OP Measurements
with the Miller 0, medium sized patients with the Wis-
Hipple 1.5, and the largest patients with the Mac 2. 8
One possible approach to analyzing these data 7
Mac 2
would be to use multivariate logistic regression, using neck
length and oropharyngeal height as the two independent 6
Page 3
Introduction to CART
Roger J. Lewis, M.D., Ph.D.
Page 4
Introduction to CART
Roger J. Lewis, M.D., Ph.D.
In the two figures below, a visual illustration of the CART approach is given. The root node,
which contains all patients, is split in two, analogous to a horizontal line being drawn at neck length =
2.45. All patients below the line, which are those found in the first terminal node, are assigned a
predicted class of Miller 0. The group of patients above the original line are then split by a second line
drawn at oropharyngeal height = 1.75. Those to the left of this line are assigned a class of Wis-Hipple
1.5, while those to the right of the line are assigned the class Mac 2. It is important to note that this
second line applies only to one of the regions, which corresponds to the second parent node (Node 2).
This process of partitioning is easy to visualize in two dimensions (i.e., when there are only two possible
predictor variables) but is difficult or impossible to picture when there are five, ten, or dozens of possible
predictors.
Laryngoscope Blade versus Neck and OP Measurements Laryngoscope Blade versus Neck and OP Measurements
8 8
7 Mac 2 7
Mac 2
6 6
5 5
Neck Length
Neck Length
Wis-Hipple 1.5 Wis-Hipple 1.5
4 4
3 3
2 2
Miller 0 Miller 0
1 1
0 0
0 1 2 3 0 1 2 3
OP Height OP Height
The text box to the right shows the set of commands, contained within a “command” file, which
were used to produce the analysis represented by the CART tree. While there are 16 lines of commands,
half are generic formatting commands or the USE 'c:\cart\data\airway.sys'
definition of the misclassification costs. This LOPTIONS MEANS = NO, PREDICTION = NO,
command file is included in the CD which has PLOTS = YES, TIMING = YES, PRINT
been distributed, as is the dataset “airway.sys.” FORMAT = 3
MODEL best_bla
Advantages and Disadvantages of CART KEEP neck_len op_heigh
CART analysis has a number of CATEGORY best_bla = 3 [min = 1]
advantages over other classification methods, PRIORS EQUAL
Misclass UNIT
including multivariate logistic regression. First, it
Misclasify Cost = 1 Classify 1 as 2
is inherently non-parametric. In other words, no Misclasify Cost = 1 Classify 1 as 3
assumptions are made regarding the underlying Misclasify Cost = 1 Classify 2 as 3
distribution of values of the predictor variables. Misclasify Cost = 1 Classify 3 as 1
Thus, CART can handle numerical data that are Misclasify Cost = 1 Classify 3 as 2
highly skewed or multi-modal, as well as Misclasify Cost = 1 Classify 2 as 1
categorical predictors with either ordinal or non- LIMIT DEPTH=16
ordinal structure. This is an important feature, as ERROR CROSS = 20
it eliminates analyst time which would otherwise
be spent determining whether variables are normally distributed, and making transformation if they are
not.
As discussed below, CART identifies “splitting” variables based on an exhaustive search of all
possibilities. Since efficient algorithms are used, CART is able to search all possible variables as
Page 5
Introduction to CART
Roger J. Lewis, M.D., Ph.D.
splitters, even in problems with many hundreds of possible predictors. [While some listeners may
shudder at possible problems with overfitting and data dredging, these issues are dealt with in depth later].
CART also has sophisticated methods for dealing with missing variables. Thus, useful CART
trees can be generated even when important predictor variables are not known for all patients. Patients
with missing predictor variables are not dropped from the analysis but, instead, “surrogate” variables
containing information similar to that contained in the primary splitter are used. When predictions are
made using a CART tree, predictions for patients with missing predictor variables are based on the values
of surrogate variables as well.
Another advantage of CART analysis is that it is a relatively automatic “machine learning”
method. In other words, compared to the complexity of the analysis, relatively little input is required
from the analyst. This is in marked contrast to other multivariate modeling methods, in which extensive
input from the analyst, analysis of interim results, and subsequent modification of the method are
required.
Finally, CART trees are relatively simple for nonstatisticians to interpret. As mentioned above,
clinical decision rules based on trees are more likely to be feasible and practical, since the structure of the
rule and its inherent logic are apparent to the clinician.
Despite its many advantages, there are a number of disadvantages of CART which should be kept
in mind. First, CART analysis is relatively new and somewhat unknown. Thus, there may be some
resistance to accept CART analysis by traditional statisticians (some of whom consult for prestigious
medical journals). In addition, there is some well-founded skepticism regarding tree methodologies in
general, based on unrealistic claims and poor performance of earlier techniques. Thus, some statisticians
have a generalized distrust of this approach. Because of its relative novelty, it is difficult to find
statisticians with significant expertise in CART. Thus, it may be difficult to find someone to help you use
CART analysis at your own institution. Because CART is not a standard analysis technique, it is not
included in many major statistical software packages (e.g., SAS).
Steps in Cart
CART analysis consists of four basic steps. The first step consists of tree building, during which
a tree is built using recursive splitting of nodes. Each resulting node is assigned a predicted class, based
on the distribution of classes in the learning dataset which would occur in that node and the decision cost
matrix. The assignment of a predicted class to each node occurs whether or not that node is subsequently
split into child nodes. The second step consists of stopping the tree building process. At this point a
“maximal” tree has been produced, which probably greatly overfits the information contained within the
learning dataset. The third step consists of tree “pruning,” which results in the creation of a sequence of
simpler and simpler trees, through the cutting off of increasingly important nodes. The fourth step
consists of optimal tree selection, during which the tree which fits the information in the learning dataset,
but does not overfit the information, is selected from among the sequence of pruned trees. Each of these
steps will be discussed in more detail below.
Tree Building
Tree building begins at the root node, which includes all patients in the learning dataset.
Beginning with this node, the CART software finds the best possible variable to split the node into two
child nodes. In order to find the best variable, the software checks all possible splitting variables (called
splitters), as well as all possible values of the variable to be used to split the node. A number of clever
programming tricks are used to reduce the time required to search through all possible splits. In the case
of a categorical variable, the number of possible splits increases quickly with the number of levels of the
categorical variable. Thus, it is useful to tell the software the maximum number of levels for each
categorical variable.
In choosing the best splitter, the program seeks to maximize the average “purity” of the two child
nodes. A number of different measures of purity can be selected, loosely called “splitting criteria” or
Page 6
Introduction to CART
Roger J. Lewis, M.D., Ph.D.
“splitting functions.” The most common splitting function is the “Gini”, followed by “Twoing.”
Although the CART software manual recommends experimenting with different splitting criteria, these
two methods will give identical results if the outcome variable is a binary categorical variable.
As discussed below, each node (even the root node) is assigned a predicted outcome class. The
process of node splitting, followed by the assignment of a predicted class to each node, is repeated for
each child node and continued recursively until it is impossible to continue.
Missing Variables
For each node, the “primary splitter” is the variable that best splits the node, maximizing the
purity of the resulting child nodes. When the primary splitting variable is missing for an individual
observation, that observation is not discarded but, instead, a surrogate splitting variable is sought. A
surrogate splitter is a variable whose pattern within the dataset, relative to the outcome variable, is similar
to the primary splitter. Thus, the program uses the best available information in the face of missing
values. In datasets of reasonable quality this allows all observations to be used. This is a significant
advantage of this methodology over more traditional multivariate regression modeling, in which
observations which are missing any of the predictor variables usually are often discarded.
Page 7
Introduction to CART
Roger J. Lewis, M.D., Ph.D.
Tree Pruning
In order to generate a sequence of simpler and simpler trees, each of which is a candidate for the
appropriately-fit final tree, the method of “cost-complexity” pruning is used. This method relies on a
complexity parameter, denoted α, which is gradually increased during the pruning process. Beginning at
the last level (i.e., the terminal nodes) the child nodes are pruned away if the resulting change in the
predicted misclassif ication cost is less than α times the change in tree complexity. Thus, α is a measure
of how much additional accuracy a split must add to the entire tree to warrant the additional complexity.
As α is increased, more and more nodes (of increasing importance) are pruned away, resulting in simpler
and simpler trees.
Test
The figure to the right shows the Data
relationship between tree complexity, reflected
“Underfit” “Overfit” Learning
by the number of terminal nodes, and the
decision cost for an independent test dataset Data
and the original learning dataset. As the
number of nodes increases, the decision cost 0 2 4 6 8 10 12 14
decreases monotonically for the learning data. Large α Complexity Small α
This corresponds to the fact that the maximal
tree will always give the best fit to the learning
dataset. In contrast, the expected cost for an independent dataset reaches a minimum, and then increases
as the complexity increases. This reflects the fact that an overfitted and overly complex tree will not
perform well on a new set of data.
Cross Validation
Cross validation is a computationally-intensive method for validating a procedure for model
building, which avoids the requirement for a new or independent validation dataset. In cross validation,
the learning dataset is randomly split into N sections, stratified by the outcome variable of interest. This
assures that a similar distribution of outcomes is present in each of the N subsets of data. One of these
subsets of data is reserved for use as an independent test dataset, while the other N-1 subsets are
combined for use as the learning dataset in the model-building procedure (see the figure on the next
page). The entire model-building procedure is repeated N times, with a different subset of the data
reserved for use as the test dataset each time. Thus, N different models are produced, each one of which
can be tested against an independent subset of the data. The amazing fact on which cross validation is
based is that the average performance of these N models is an excellent estimate of the performance of
the original model (produced using the entire learning dataset) on a future independent set of patients.
Page 8
Introduction to CART
Roger J. Lewis, M.D., Ph.D.
Example: HIV-Triage
As our next example, consider a dataset involving the Learning Data
triage of self-identified HIV-infected patients who present to the
emergency department (ED) for care. The outcome variable is the “urgency” of the visit, which has three
levels: emergent, urgent, and non-urgent. These
urgency levels are based on a retrospective evaluation of
the final diagnosis and clinical course. The patient's classification Cost
historical and presenting features, and the results of an Optimal Tree
CV Mis-
Page 9
Introduction to CART
Roger J. Lewis, M.D., Ph.D.
Page 10
Introduction to CART
Roger J. Lewis, M.D., Ph.D.
Page 11
Introduction to CART
Roger J. Lewis, M.D., Ph.D.
Future Datasets
The purpose of a decision tree
is usually to allow the accurate
prediction of outcome for future
patients, based on the values of their
predictor variables. Similarly, the best
way to test a tree using an independent
dataset is to “drop” cases from a new
dataset through the tree in order to
determine the observed
misclassification rates and costs. The
CART software provides a command
(the “tree” command) which allows
the decision tree to be saved, so that it
can be used with a new set of data in
the future to predict outcome. This
allows the testing of the tree on a new
independent dataset.
Page 12
Introduction to CART
Roger J. Lewis, M.D., Ph.D.
Additional Topics
There are a number of
additional topics that, although of
practical importance to those using
CART analysis, are beyond the scope
of this lecture. These include the
choice and use of different splitting
rules and purity measures, the choice
of alternative prior probability
distributions, the interpretation of
information on surrogate and
competitive variables, methods used
to rate the importance of different
variables, and details of CART's
handling of missing variables.
Insights into these topics can best be
obtained by actual use of the software,
reference to the manual and online documentation, and comparison of results while varying user options.
The dataset and command files included in the distributed CD should allow the listener to begin using the
CART software and gain experience with its use.
Node 1
Yes Class = 1 No
PULSE <= 104.500
Node 2 Node 5
No Class = 1 Yes Yes Class = 0 No
CD4 <= 42.000 TEMP <= 101.650
Node 3 Node 6
Yes Class = 2 No Node -1 No Class = 1 Yes Node -7
TEMP <= 100.450 Class = 1 PULSE <= 137.500 Class = 0
Node 4
No Class = 2 Yes Node -4 Node -6 Node -5
PULSE <= Class = 1 Class = 0 Class = 1
99.500
Node -3 Node -2
Class = 1 Class = 2
Conclusions
Classification and Regression Tree (CART) analysis is a powerful technique with significant potential
and clinical utility. Nonetheless, a substantial investment in time and effort is required to use the
software, select the correct options, and interpret the results. Nonetheless, the use of CART has been
increasing and is likely to increase in the future, largely because of the substantial number of important
problems for which it is the best available solution.
Page 13
Introduction to CART
Roger J. Lewis, M.D., Ph.D.
References
Primary Reference
1. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and Regression Trees. Chapman &
Hall (Wadsworth, Inc.): New York, 1984.
Examples
2. Steadman HJ, Silver E, Monahan J, Apelbaum PS, Robbins PC, Mulvey EP, Grisso T, Roth LH,
Banks S. A classification tree approach to the development of actuarial violence risk assessment
tools. Law and Human Behavior 2000;24:83-100.
3. Hess KR, Abbruzzese MC, Lenzi R, Raber MN, Abbruzzese JL. Classification and regression tree
analysis of 1000 consecutive patients with unknown primary carcinoma. Clinical Cancer Research
1999;5:3403-3410.
4. Rainer TH, Lam PK, Wong EM, Cocks RA. Derivation of a prediction rule for post-traumatic acute
lung injury. Resuscitation 1999;42:187-196.
5. Dart RG, Kaplan B, Varaklis K. Predictive value of history and physical examination in patients with
suspected ectopic pregnancy. Annals of Emergency Medicine 1999;33:283-290.
6. Tsien CL, Fraser HS, Long WJ, Kennedy RL. Using classification tree and logistic regression
methods to diagnose myocardial infarction. Medinfo 1998;9:493-497.
7. Nelson LM, Bloch DA, Longstreth WT Jr., Shi H. Recursive partitioning for the identification of
disease risk subgroups: a case-control study of subarachnoid hemorrhage. Journal of Clinical
Epidemiology 1998;51:199-209.
8. Germanson TP, Lanzino G, Kongable GL, Torner JC,Kassell NJ. Risk classification after aneurysmal
subarachnoid hemorrhage. Surgical Neurology 1998;49:155-163.
9. Kastrati A, Schomig A, Elezi S, Schuhlen H, Dirschinger J, Hadamitzky M, Wehinger A, Hausleiter
J, Walter H, Neuman FJ. Predictive factors of restenosis after coronary stent placement. Journal of
the American College of Cardiology 1997;30:1428-1436.
10. Crichton NJ, Hinde JP, Marchini J. Models for diagnosing chest pain: Is CART helpful? Statistics
in Medicine 1997;16:717-727.
11. Hadzikadic M, Hakenewerth A, Bohren B, Norton J, Mehta B, Andrews C. Concept formation vs.
logistic regression: Predicting death in trauma patients. Artificial Intelligence in Medicine
1996;8:493-504.
12. Mair J, Smidt J, Lechleitner P, Dienstl F, Puschendorf B. A decision tree for the early diagnosis of
acute myocardial infarction in nontraumatic chest pain patients at hospital admission. Chest
1995;108:1502-1509.
13. Selker HP, Griffith JL, Patil S, Long WJ, D'Agostino RB. A comparison of performance of
mathematical predictive methods for medical diagnosis: Identifying acute cardiac ischemia among
emergency department patients. Journal of Investigative Medicine 1995;43:468-476.
14. Li D, German D, Lulla S, Thomas RG, Wilson SR. Prospective study of hospitalization for asthma.
A preliminary risk factor model. American Journal of Respiratory and Critical Care Medicine
1995;151:647-655.
15. Falconer JA, Naughton BJ, Dunlop DD, Roth EJ, Strasser DC, Sinacore JM. Predicting stroke
inpatient rehabilitation outcome using a classification tree approach. Archives of Physical Medic ine
and Rehabilitation 1994;75:619-625.
16. Hasford J, Ansari H, Lehmann K. CART and logistic regression analyses of risk factors for first dose
hypotension by an ACE-inhibitor. Therapie 1993;48:479-482.
Page 14