0% found this document useful (0 votes)
94 views45 pages

Cart: Classification and Regression Tree

The document discusses Classification and Regression Trees (CART), a non-parametric decision tree learning technique. CART can be used for classification and regression tasks. It builds decision trees from a set of training data in a top-down manner to predict target variables by splitting nodes based on input variables. The document provides two examples of CART analyses and outlines the key components, features, and process for building, pruning, and selecting the optimal CART model.

Uploaded by

cahyadi aditya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views45 pages

Cart: Classification and Regression Tree

The document discusses Classification and Regression Trees (CART), a non-parametric decision tree learning technique. CART can be used for classification and regression tasks. It builds decision trees from a set of training data in a top-down manner to predict target variables by splitting nodes based on input variables. The document provides two examples of CART analyses and outlines the key components, features, and process for building, pruning, and selecting the optimal CART model.

Uploaded by

cahyadi aditya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 45

CART: CLASSIFICATION

AND REGRESSION
TREE
CART: Classification and Regression Tree
Motivation:
• Development of a reliable clinical decision rule, which can
be used to classify new patients into clinically-important
categories or risk categories so that appropriate decisions
can be made regarding patient management.
CART: Classification and Regression Tree
Example 1:
• Molecular abnormalities in the major psychiatric illnesses:
Classification and Regression Tree (CRT) analysis of
post-mortem prefrontal markers.
M B Knable et al
Molecular Psychiatry, 2002, Volume 7, Number 4, Pages
392-404.
CART: Classification and Regression Tree
• Post-mortem specimens from the Stanley Foundation
Neuropathology Consortium, which contains matched
samples from patients with schizophrenia, bipolar
disorder, non-psychotic depression and normal controls (n
= 15 per group), have been distributed to many research
groups around the world. This paper provides a summary
of abnormal markers found in prefrontal cortical areas
from this collection between 1997 and 2001. With
parametric analyses of variance of 102 separate data
sets, 14 markers were abnormal in at least one disease.
CART: Classification and Regression Tree
• The markers pertained to a variety of neural systems and
processes including neuronal plasticity,
neurotransmission, signal transduction, inhibitory
interneuron function and glial cells. The data sets were
also examined using the non-parametric Classification
and Regression Tree (CRT) technique for the four
diagnostic groups and in pair-wise combinations. In
contrast to the results obtained with analyses of variance,
the CRT method identified a smaller set of nine markers
that contributed maximally to the diagnostic
classifications.
CART: Classification and Regression Tree
• Three of the nine markers observed with CRT overlapped
with the ANOVA results. Six of the nine markers observed
with the CRT technique pertained to aspects of
glutamatergic, GABA-ergic, and dopaminergic
neurotransmission.
CART: Classification and Regression Tree
Example 2:
• Sperm morphology, motility, and concentration in fertile
and infertile men.
Guzick, DS; Overstreet, JW; Factor-Litvak, P, et al.
New England Journal of Medicine, 2001, Volume 345,
Issue 19, Pages 1388-1393.
CART: Classification and Regression Tree
Background
• Although semen analysis is routinely used to evaluate the
male partner in infertile couples, sperm measurements
that discriminate between fertile and infertile men are not
well defined.
CART: Classification and Regression Tree
Methods
• We evaluated two semen specimens from each
of the male partners in 765 infertile couples and
696 fertile couples at nine sites. The female
partners in the infertile couples had normal
results on fertility evaluation. The sperm
concentration and motility were determined at the
sites; semen smears were stained at the sites
and shipped to a central laboratory for an
assessment of morphologic features of sperm
with the use of strict criteria.
CART: Classification and Regression Tree
• We used classification-and-regression-tree analysis to
estimate threshold values for subfertility and fertility with
respect to the sperm concentration, motility, and
morphology. We also used an analysis of receiver-
operating-characteristic curves to assess the relative
value of these sperm measurements in discriminating
between fertile and infertile men.
CART: Classification and Regression Tree
Results
• The subfertile ranges were a sperm concentration
of less than 13.5x106 per milliliter, less than 32
percent of sperm with motility, and less than 9
percent with normal morphologic features. The
fertile ranges were a concentration of more than
48.0x106 per milliliter, greater than 63 percent
motility, and greater than 12 percent normal
morphologic features. Values between these
ranges indicated indeterminate fertility.
CART: Classification and Regression
Tree
• There was extensive overlap between the fertile and the
infertile men within both the subfertile and the fertile
ranges for all three measurements. Although each of the
sperm measurements helped to distinguish between
fertile and infertile men, none was a powerful
discriminator. The percentage of sperm with normal
morphologic features had the greatest discriminatory
power.
CART: Classification and Regression
Tree
Conclusions
• Threshold values for sperm concentration, motility, and
morphology can be used to classify men as subfertile, of
indeterminate fertility, or fertile. None of the measures,
however, are diagnostic of infertility.
CART: Classification and Regression
Tree
Components of classification problem:
1. Outcome or “dependent” variable. They can
be continuous, categorical (ordinal or nominal), or
time-to event variables, e.g.:
a) Example 1:
b) Example 2:
c) Blood pressure, Patient survival, need for
surgery, presence of myocardial infarction,
medication compliance:
CART: Classification and Regression
Tree
2. Predictor or independent variables
a) Example 1:
b) Example 2:
c) Blood, Patient survival, need for surgery, presence of myocardial
infarction, medication compliance:
CART: Classification and Regression
Tree
3. Learning data set: this is a dataset which includes
values for both the outcome and predictor variables,
from a group of patients similar to those for whom we
would like to be able to predict outcomes in the future.
CART: Classification and Regression
Tree
4. Test data set: it consists of patients for whom we would
like to be able to make accurate predictions. This test
dataset may or may not exist in practice. However, a
separate test dataset is not always required to
determine the performance of a decision rule.
CART: Classification and Regression
Tree
• A decision problem can include two other factors
to be considered:
a “prior” probability for each outcome, which
represents the probability that a randomly-
selected future patient will have a particular
outcome.
decision cost or loss function, which represents
the inherent cost associated with the error in
prediction:
CART: Classification and Regression
Tree
• For example, it is a much more serious error to classify a
patient with an emergent medical condition as non-urgent,
than to misclassify a patient with a non-urgent medical
condition as urgent.
CART: Classification and Regression
Tree
Features of CART:
1. Nodes: parent node, child node, root node, terminal
node.
2. Binary Splits: a “node” in a decision tree, can only be
split into two groups
CART: Classification and Regression
Tree
3. Each Split based on only one variable.
4. Recursive partition: binary partitioning process can be
applied over and over again. Each parent node can
give rise to two child nodes and, in turn, each of these
child nodes may themselves be split, forming
additional children.
CART: Classification and Regression
Tree
To construct a CART:
1. Tree building: a tree is built using recursive
splitting of nodes.
2. Build a “maximal” tree based on some
‘stopping rule’: a “maximal” tree has been
produced, which probably greatly overfits the
information contained within the learning
dataset.
3. Optimal tree selection: the tree which fits the
information in the learning dataset, but does
not overfit the information, is selected from
among the sequence of pruned trees.
CART: Classification and Regression
Tree
TREE BUILDING:
1. Tree building begins at the root node, which includes all
patients in the learning dataset. Beginning with this
node, the CART software finds the best possible
variable to split the node into two child nodes. In order
to find the best variable, the software checks all
possible splitting variables (called splitters), as well as
all possible values of the variable to be used to split the
node.
2. In choosing the best splitter, the program seeks to
minimize the total “purity” of the two child nodes.
CART: Classification and Regression
Tree
3. Measure of impurity of a node: p= (p1, p2, p3,
…, pk)
a. Information, or Entropy: E= Sum (p log(p)),
with 0log0 = 0
b. Gini Index: G=1-Sum (p*p)
4. We define the impurity of a tree to be the sum
over all terminal nodes of the impurity of a
node multiplied by the proportion of cases that
reach that node of the tree
CART: Classification and Regression
Tree
5. The predicted class assigned to each terminal node
depends on three factors:
1) Assumed prior probability of each class within future
datasets;
2) Decision loss or cost matrix; and
3) Fraction of subjects with each outcome in the learning
dataset that end up in each node.
CART: Classification and Regression
Tree
Stop Tree Building
The tree building process goes on until it is impossible to
continue. The process is stopped when:
1. All observations within each child node have the
identical distribution of predictor variables, making
splitting impossible.
CART: Classification and Regression
Tree
2. An external limit on the number of levels in the
maximal tree has been set by the user.
3. An external limit on the size of the terminal nodes in
the maximal tree has been reached.
CART: Classification and Regression
Tree
Tree Pruning
• In order to generate a sequence of simpler and
simpler trees, each of which is a candidate for
the appropriately-fit final tree, the method of
Reduced Error Pruning can be used. Each time
we remove the ‘weakest link’ that provides least
improvement in misclassification. The
misclassification error gradually increased during
the pruning process.
• More generally pruning is possible: cost
complexity pruning.
CART: Classification and Regression
Tree
Optimal Tree Selection
• The maximal tree will always fit the learning dataset with
higher accuracy than any other tree, because the maximal
tree is constructed to optimize its performance based on
the learning dataset.
CART: Classification and Regression
Tree
• The goal in selecting the optimal tree, defined
with respect to expected performance on an
independent set of data, is to find the correct
complexity parameter so that the information
in the learning dataset is fit but not overfit. In
general, finding this value for would require an
independent set of data, but this requirement can
be avoided using the technique of cross
validation
CART: Classification and
Regression Tree
• The figure below shows the relationship between tree
complexity, reflected by the number of terminal nodes,
and the decision cost for an independent test dataset
and the original learning dataset.
CART: Classification and
Regression Tree
• As the number of nodes increases, the decision
cost decreases monotonically for the learning
data. This corresponds to the fact that the
maximal tree will always give the best fit to the
learning dataset. In contrast, the expected cost
for an independent dataset reaches a minimum,
and then increases as the complexity increases.
This reflects the fact that an overfitted and overly
complex tree will not perform well on a new set of
data.
CART: Classification and Regression
Tree
Cross-Validation
• Cross validation is a computationally-intensive
method for validating a procedure for model
building, which avoids the requirement for a new
or independent validation dataset. In cross
validation, the learning dataset is randomly split
into N sections. One of these subsets of data is
reserved for use as an independent test dataset,
while the other N-1 subsets are combined for use
as the learning dataset in the model-building
procedure.
CART: Classification and
Regression Tree
• The entire model-building procedure is
repeated N times, with a different subset of the
data reserved for use as the test dataset each
time. Thus, N different models are produced,
each one of which can be tested against an
independent subset of the data. The amazing
fact on which cross validation is based is that
the average performance of these N models is
an excellent estimate of the performance of
the original model (produced using the entire
learning dataset) on a future independent set
of patients.
CART: Classification and Regression
Tree
Advantage:
No parametric assumption (in contrast to linear regression,
logistic regression, or cox’s proportional hazard model)
• Can cope with any data type (continuous, binary, ordinal,
nominal)
• Classification has a simple form that’s easy to understand
CART: Classification and Regression
Tree
• CART identifies “splitting” variables based on an
exhaustive search of all possibilities. Since efficient
algorithms are used, CART is able to search all possible
variables as splitters, even in problems with many
hundreds of possible predictors.
• It handles complex interactions well. For example, the
value of one variable (e.g., age) may substantially affect
the importance of another variable (e.g., weight).
• Is robust with respect to outliers.
• Provides an estimate of the misclassification rate.
CART: Classification and Regression
Tree
Disadvantage:
• CART does not use combinations of variable in each split
• Tree structures may be unstable – a change in the sample
may give different trees
• Tree is optimal at each split – it may not be globally
optimal.
CART: Classification and Regression
Tree
Reference:
•1. Classification and regression trees
Leo Breiman, Jerome H. Friedman, Richard Olshen, and
Charles J. Stone
Brooks/Cole Publishing, Monterey, 1984
•2. An Introduction to Classification and Regression Tree (CART)
Analysis
Roger J. Lewis
Presented at the 2000 Annual Meeting of the Society for
Academic Emergency Medicine in San Francisco, California
•3. Modern Applied Statistics with S
WN Venables, BD Ripley
Springer 2002

You might also like