Classification and Regression Trees-3
Classification and Regression Trees-3
Revised: 12/10/2018
Summary ......................................................................................................................................... 1
Data Input........................................................................................................................................ 4
Analysis Options ............................................................................................................................. 6
Tables and Graphs........................................................................................................................... 7
Analysis Summary .......................................................................................................................... 8
R Tree Diagram............................................................................................................................... 9
Tree Diagram ................................................................................................................................ 10
R Tree Structure ............................................................................................................................ 11
Node Probabilities ......................................................................................................................... 12
Predictions and Residuals ............................................................................................................. 12
2-D Scatterplot .............................................................................................................................. 14
3-D Scatterplot .............................................................................................................................. 16
Classification Table ...................................................................................................................... 19
Observed versus Predicted ............................................................................................................ 20
Deviance Plot ................................................................................................................................ 21
Save Results .................................................................................................................................. 24
Example 2 ..................................................................................................................................... 25
References ..................................................................................................................................... 27
Summary
1. Classification models that divide observations into groups based on their observed
characteristics.
The models are constructed by creating a tree, each node of which corresponds to a binary
decision. Given a particular observation, one travels down the branches of the tree until a
terminating leaf is found. Each leaf of the tree is associated with a predicted class or value.
The dependent variable may be either categorical or quantitative, as may the predictor variables.
The calculations are performed by the “tree” package in R. To run the procedure, R must be
installed on your computer together with the tree package. For information on downloading and
installing R, refer to the document titled “R – Installation and Configuration”.
Sample Data
The first example uses the classic set of data from Fisher (1936), contained in file iris.sgd. The
data consist of a total of n = 150 irises, 50 from each of g = 3 different species: setosa,
versicolor, and virginica. Measurements were made on p = 4 variables, describing the length and
width of the sepal and petal. The table below shows a partial list of the data in that file:
A classification model is desired that uses the 4 quantitative variables to determine the probable
species of each iris.
Example 1
When the Classification and Regression Trees procedure is selecting from the Statgraphics
menu, a data input dialog box is displayed. In the first example, 4 quantitative factors are used to
construct a classification model for species:
Dependent variable: name of the column containing the class or value of the variable to be
predicted. If fitting a classification model, this variable may be either categorical or
quantitative. If fitting a regression model, this variable must be quantitative.
Categorical factors: names of the columns containing the categorical variables (if any) that
will be used to predict the dependent variable.
Quantitative factors: names of the columns containing the continuous quantitative variables
(if any) that will be used to predict the dependent variable.
2017 by Statgraphics Technologies, Inc. Classification and Regression Trees - 4
Weights: optional numeric column containing weights to be applied to each case when
fitting the model. All weights must be positive.
Select: optional Boolean column or expression identifying the cases (rows of the Databook)
to be included in the analysis.
Example 2
In the second example, 2 categorical factors and 3 quantitative factors are used to construct a
regression model for MPG Highway:
The Analysis Options dialog box sets various options for fitting the model:
Type of Tree: Classification trees are used to predict the value of categorical variables.
Regression trees are used to predict the value of continuous quantitative variables.
Partitioning: controls the initial partitioning of the tree into branches and leaves.
o Smallest allowed node size: the minimum number of observations in a node for it to be
split into 2 smaller nodes.
o Minimum number of observations in each child: the minimum number of observations
allowed in each child node after the split.
o Minimum within-node deviance to split: the smallest deviance allowed for a node to be
split. Smallest values cause more nodes to be created.
o None: do no pruning.
2017 by Statgraphics Technologies, Inc. Classification and Regression Trees - 6
o Specify number of leaves: prune the tree so that it has the specified number of
leaves. The tree returned is a pruning of the initial tree that has the smallest error for
the size specified.
o Crossvalidate: uses crossvalidation to select the best pruning of the size specified.
Training Set: observations to be included in the training set used to fit the tree.
The Analysis Summary begins with a list of the R commands that were executed.
treefit=tree(Species~Sepal.length+Sepal.width+Petal.length+Petal.width,contro
l=tree.control(nobs=150,mincut=5,minsize=10,mindev=0.01),data=d)
summary(treefit)
##
## Classification tree:
## tree(formula = Species ~ Sepal.length + Sepal.width + Petal.length +
## Petal.width, data = d, control = tree.control(nobs = 150,
## mincut = 5, minsize = 10, mindev = 0.01))
## Variables actually used in tree construction:
## [1] "Petal.length" "Petal.width" "Sepal.length"
## Number of terminal nodes: 6
## Residual mean deviance: 0.1253 = 18.05 / 144
## Misclassification error rate: 0.02667 = 4 / 150
plot(treefit)
text(treefit,pretty=3,cex=0.75)
p<-prune.tree(treefit)
write.table(treefit$frame,file="C:\\Users\\Neil\\AppData\\Local\\Temp\\frame.
csv",sep=",")
write.table(treefit$where,file="C:\\Users\\Neil\\AppData\\Local\\Temp\\where.
csv",sep=",",row.names=FALSE)
write.table(cbind(p$size,p$dev,p$k),file="C:\\Users\\Neil\\AppData\\Local\\Te
mp\\prune.csv",sep=",",row.names=FALSE)
1. Variables actually used in tree construction: shows which of the predictor variables
were actually used to construct the tree. In this example, 3 of the 4 predictor variables
were used.
2. Number of terminal nodes: the number of nodes after which no more splits are made
(the leaves). This tree has 6 leaves.
3. Residual mean deviance: a measure of the error remaining in the tree after construction.
For a regression tree, this is related to the mean squared error.
4. Misclassification rate: the proportion of observations in the training set that were
predicted to fall in another class than they actually did. In this case, 4 of the 150 irises
were not predicted correctly.
2017 by Statgraphics Technologies, Inc. Classification and Regression Trees - 8
R Tree Diagram
Tree Diagram
To classify an observation, begin at the top of the tree. At each node, move left if the binary
statement is true or move right if it is false. You will eventually reach a terminating node (leaf) at
which the predicted value of the dependent variable is displayed. The size of the text is
controlled by the Analysis Options dialog box.
Species
Petal.length<2.45
Petal.w idth<1.75
setosa
Petal.length<4.95 Petal.length<4.95
Sepal.length<5.15
v irginica v irginica v irginica
v ersicolor v ersicolor
If the tree is hard to read, use your mouse wheel or toolbar controls to zoom in on a section of it.
Pane Options
R Tree Structure
R Tree Structure
For example, at node #1, observations travel along the branch to the left if Petal length < 2.45
and along the branch to the right if Petal length > 2.45.
This table shows the probability distribution of observations in the training set that reach each
node:
Node Probabilities
For example, at terminating node #8 the probability that the observation comes from the species
versicolor equals 0.333 while the probability that it comes from the species virgninica equals
0.667.
.
Leaf: the terminating node for the specified observation (displayed only for those
observations contained in the training set used to build the tree).
Predicted: the predicted value for the observation.
Observed: the observed value of the observation.
2017 by Statgraphics Technologies, Inc. Classification and Regression Trees - 12
The residual mean deviance is displayed at the bottom of the table. For a regression tree, the
residual mean deviance is calculated by
∑𝑛 ̂ 2
𝑖=1(𝑌𝑖 −𝑌𝑖 )
𝑅𝑀𝐷 = (1)
𝑛−𝑘
where 𝑌𝑖 is the observed value of the dependent variable, 𝑌̂𝑖 is the predicted value, n is the
number of observations in the training set, and k equals the number of terminating nodes (leaves)
in the fitted tree. In such a case, RMD is equivalent to the mean squared error. For a
classification tree, the residual mean deviance is calculated by
∑𝑛
𝑖=1 −2𝑙𝑜𝑔(𝑝𝑖,𝑗 )
𝑅𝑀𝐷 = (2)
𝑛−𝑘
where pi,j is the estimated probability that observation i would be assigned to class j, where j is
the index of the class predicted by the model. In both cases, smaller values of RMD correspond
to better predicting trees.
If some cases have been reserved for validation, a separate table of predictions is displayed. In
such cases, the denominator of equations (1) and (2) is set equal to the number of observations in
the validation set. Note that it is possible for the validation RMD to equal infinity if an
observation in the validation set corresponds to a class given 0 probability by the fitted tree.
This graph plots the obervations in the training set with respect to any 2 of the predictor
variables:
Scatterplot
Training set
4.4 Species
setosa
versicolor
4
virginica
3.6
Sepal width
3.2
2.8
2.4
2
4.3 5.3 6.3 7.3 8.3
Sepal length
If the fitted tree partitions the observations into sections using only the 2 variables specified, the
partitions are shown . For example, the plot below shows the points on axes defined by the 2
petal dimensions:
2.5 Species
setosa
10 11 versicolor
2 virginica
Petal width
1.5
2
1
8
0.5
0
0 2 4 6 8
Petal length
The partition on the left, which corresponds to terminating node #2, is based solely on Petal
length and is uniquely setosa. The partition at the bottom right, which corresponds to terminating
node #8, is based on both Petal length and Petal width and corresponds to both versicolor and
virginica. The open section near the middle bottom of the plot is not partitioned since it involves
variables other than just the 2 displayed on the X and Y axes.
Pane Options
3-D Scatterplot
This graph plots the obervations in the trraining set with respect to any 3 of the predictor
variables:
15
2.5
14
2
Petal width
1.5 13
24 2
1 25
0.5
8
0 6
4
4.3 5.3 2
6.3 7.3 0 Petal length
8.3
Sepal length
Pane Options
This table shows how well the fitted model performs in classifying the observations:
Classification Table
It shows:
Actual Species: There is a row for each level of the dependent variable.
Group Size: the number of observations in the training set that fall in the specified class.
Predicted: the number of observations in the training set that were predicted to fall in the
specified class.
The percentage of correctly classified observations is displayed for each level of the dependent
variable and for all of the observations combined.
For example, there were 50 irises of the species virginica. All but 1 were correctly predicted to
be virginica except for 1 that was predicted to be versicolor. Of all 150 irises, 97.33% were
correctly classified.
A separate table is displayed for any observations withheld for validation purposes.
setosa 100.0%
predicted
By default, the mosaic plot contains a row for each level of the dependent variable. Bars are
drawn in each row with length proportional to the number of times observations at that level
were predicted to be of each class. The plot above shows that setosa was predicted correctly all
the time. Veriscolor was occasionally predicted to be virginica, while virginica was occasionally
predicted to be versicolor.
Pane Options
Display percentage: whether the plot should display the percentage corresponding to each
bar.
Deviance Plot
The deviance plot shows the magnitude of the error associated with a tree containing various
numbers of leaves (terminating nodes).
400
300
deviance
200
100
0
0 1 2 3 4 5 6
size
Large reductions due to the inclusion of another node show that the node had a significant
reduction on the misclassification rate. The plot above shows that a tree with 3 leaves does
almost as well as a tree with 6 leaves. Using Analysis Options to reduce the number of leaves to
3 results in the following partitions:
7
2
Petal width
1.5
2
1
6
0.5
0
0 2 4 6 8
Petal length
Such a tree uses only Petal width and Petal length to classify the observations. The pruned tree
missclassifies 6 observations compared to the original tree that misclassified only 4 observations.
Selected output may be saved in a Statgraphics datasheet by pressing the Save Results button on
the analysis toolbar. The following dialog box will be presented:
Autosave: if checked, the results will be saved automatically each time a saved StatFolio
is loaded.
Save comments: if checked, comments for each column will be saved in the second line
of the datasheet header.
The second example fits a regression tree designed to predict MPG Highway from the
93cars.sgd data file using 5 predictor variables. The default tree (shown below) uses 4 of the 5
predictors:
MPG Highw ay
Weight<2512.5
There are a few differences between a regression tree and a classification tree:
1. When displaying the report of observed and predicted values, residuals are also
calculated and displayed.
The predicted values equal the average of all observations in the training set that arrive at
a given leaf.
3. The graph of Observed versus Predicted values produces a scatterplot rather than a
mosaic chart:
50
45
40
observed
35
30
25
20
20 25 30 35 40 45 50
predicted
10
2
residual
-2
-6
-10
20 25 30 35 40 45 50
predicted
References
Brieman, L., Friedman, J., Stone, C.J. and Olshen, R.A. (1998) Classification and Regression
Trees. Wadsworth.
Fisher, R.A. (1936). “The use of multiple measurements in taxonomic problems.” Ann. Eugenics
7, Pt. II, 179-188.
Lock, R. H. (1993) “1993 New Car Data”. Journal of Statistics Education, vol. 1, no. 1.
Ripley, B.D. (1996) Pattern Recognition and Neural Networks. Cambridge University Press.