1 Theme: Comparison of The Implementation of The CART Algorithm Under Tanagra and R (Rpart Package)
1 Theme: Comparison of The Implementation of The CART Algorithm Under Tanagra and R (Rpart Package)
1 Theme
Comparison of the implementation of the CART algorithm under Tanagra and R (rpart
package).
CART (Breiman and al., 1984) is a very popular classification tree (says also decision tree) learning
algorithm. Rightly. CART incorporates all the ingredients of a good learning control: the post-
pruning process enables to make the trade-off between the bias and the variance; the cost
complexity mechanism enables to "smooth" the exploration of the space of solutions; we can
control the preference for simplicity with the standard error rule (SE-rule); etc. Thus, the data miner
can adjust the settings according to the goal of the study and the data characteristics.
The Breiman's algorithm is provided under different designations in the free data mining tools.
Tanagra uses the “C-RT” name. R, through a specific package 1, provides the “rpart” function.
In this tutorial, we describe these implementations of the CART approach according to the original
book (Breiman and al., 1984; chapters 3, 10 and 11). The main difference between them is the
implementation of the post-pruning process. Tanagra uses a specific sample says "pruning set"
(section 11.4); when rpart is based on the cross-validation principle (section 11.5) 2.
2 Dataset
We use the WAVEFORM dataset (section 2.6.2). The target attribute (CLASS) has 3 values. There
are 21 continuous predictors (V1 to V21). We want to reproduce the experiment described into the
Breiman's book (pages 49 and 50). We have 300 instances into the learning sample, and 5000
instances into the test sample. Thus the data file wave5300.xls 3 includes 5300 rows and 23
columns. The last column is a variable which specifies the membership of each instance to the train
or the test samples (Figure 1).
1
http://cran.r-project.org/web/packages/rpart/index.html
2
We can use a pruning sample with “rpart” but it is not really easy to use. See http://www.math.univ-
toulouse.fr/~besse/pub/TP/appr_se/tp7_cancer_tree_R.pdf; section 2.1.
3
http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/wave5300.xls
After we launch Tanagra, we want to create a diagram and import the data file.
For that, we activate the FILE / NEW menu. We choose the XLS file format and we select
wave5300.xls 4.
The diagram is created. Into the first node, TANAGRA shows that the file contains 5300 instances
and 23 variables.
4
There are various ways to import a XLS data file. We can use the add-on for Excel (http://data-mining-
tutorials.blogspot.com/2010/08/sipina-add-in-for-excel.html, http://data-mining-
tutorials.blogspot.com/2010/08/tanagra-add-in-for-office-2007-and.html) or, as we do in this tutorial, directly
import the dataset (http://data-mining-tutorials.blogspot.com/2008/10/excel-file-format-direct-importation.html).
In this last case, the dataset must not be opened in the spreadsheet application. The values must be in the first
sheet. The first row corresponds to the name of the variables.
The direct importation is faster than the use of the “tanagra.xla add-on. But, on the other hand, Tanagra can
handle only the XLS format here (up to Excel 2003).
We click on the VIEW menu: 300 instances (from 5300) are now used for the construction of the
decision tree.
Before the strictly speaking learning process, we must specify the variable types. We add the
DEFINE STATUS component using the shortcut into the toolbar. We set CLASS as TARGET, the
continuous variables (V1 to V21) as INPUT.
We insert the C-RT component which implements the CART approach. Let us describe some of the
method settings (SUPERVISED PARAMETERS menu).
MIN SIZE OF NODE TO SPLIT means the minimum number of instances needed for performing a
splitting of a node. For our dataset, we do not perform a split if there are less than 10 instances.
PRUNING SET SIZE means the part of the learning set used for the post pruning process. The
default value is 33% i.e. 67% of the learning set is used for the growing phase (67% of 300 = 201
instances), 33% for the pruning phase (33% of 300 = 99 instances).
SE RULE allows to select the right tree in the post-pruning process. The default value is θ=1 i.e. we
select the smallest tree for which the error rate is lower than the "error rate + 1 standard error (of
the error rate)" of the optimal tree. If θ=0, we select the optimal tree, the one which minimizes the
pruning error rate (the error rate computed on the pruning sample).
Last, SHOW ALL TREE SEQUENCE, when it is checked off, allows to show all the sequences of
trees analyzed during the post-pruning process. It is not necessary to activate this setting here.
We validate our settings by clicking on the OK button. Then, we click on the contextual VIEW menu.
We obtain first the resubstitution confusion matrix, computed on the learning set (300 instances,
the growing and the pruning samples are merged). The error rate is 19.67%.
Below, we have the sequences of trees table, with the number of leaves, the error rate calculated
on the growing set and pruning set.
The error rate computed on the growing set decreases as the number of leaves increases. We
cannot use this information to select the right model.
We use the pruning sample to select the "best" model. The tree which minimizes the error rate on
the pruning set is highlighted in green. It has 14 leaves. Its error rate is 28.28%.
With the θ = 1 SE-rule, we select the smallest tree (with the smallest number of leaves) for which
the pruning error rate is lower than
0.2828 × (1 − 0.2828)
ε seuil = 0.2828 + 1 × = 0.3281
99
This is the tree n° 5, with 9 leaves. The pruning error rate is 32.32% (highlighted in red). The
growing error rate is 13.43%.
The chart depicting the change of the error rate computed on the growing and the pruning samples
is available in the CHART tab of the visualization window. We can visually detect the interesting
trees.
We note that the trees with 7 or 6 leaves are equivalent in terms of prediction performance. The
way to modify the setting θ to obtain one of these trees is described in another tutorial 5.
In the lower part of the report (HTML tab), we have a description of the selected tree.
To obtain a good estimation of the generalization error rate, we use a separated sample, the test
set, unused during the learning phase. On our dataset, the test set is the rest of our data file i.e.
5000 instances set aside in the previous part of the diagram (DISCRETE SELECT EXAMPLES node).
Tanagra generates automatically a new column which contains the prediction of the model. The
trick is that the prediction is computed both on the selected instances (the learning sample) and
the unselected ones (the test sample).
5
http://data-mining-tutorials.blogspot.com/2010/01/cart-determining-right-size-of-tree.html
We add the DEFINE STATUS into the diagram. We set CLASS as TARGET and PRED_SPVINSTANCE_1,
the new column generated by the supervised learning algorithm, as INPUT.
We add the TEST component (SPV LEARNING ASSESSMENT tab). We click on the VIEW menu. We
observe that, by default, the confusion matrix is computed on the unselected instances i.e. the test
sample (Note: The confusion matrix can be computed on the selected instances i.e. the learning
sample. In this case, we must obtain the same confusion matrix as the one showed into the model
visualization window.).
We click on the VIEW menu. The test error rate is 28.44%. When we use the tree for the prediction
on an unseen instance, the probability of misclassification is 28.44%.
To import the XLS file format, the easiest way is to use the xlsReadWrite package.
Then, we can import the dataset using the read.xls(.) command. We obtain a short description of
the data frame (the structure which manages the dataset into memory) with summary(.) 6.
6
The full results of the command summary (.) are not shown in our screenshots. It would take too much space.
It is nevertheless essential at every stage to check the data integrity.
4.2 Partitioning the data file into train and test samples
We use the SAMPLE column to partition the dataset (“learning” and “test”). The SAMPLE column is
then removed.
First, we must develop the maximal tree on the learning set i.e. the tree with the maximum number
of leaves. It will be pruned in a second phase.
We use the following settings 7: MINSPLIT = 10, it means that the process does not split a node with
less than 10 instances; MINBUCKET = 1, the process does not validate a split where one the leaves
contains less than 1 instance. The maximal tree obtained should be similar to the one of Tanagra.
One of the main differences is that RPART uses all the instances of the learning set (Tanagra uses
only the growing sample, a part of the learning set).
7
About the other settings: METHOD = "CLASS" indicates that we are dealing with a classification problem
(instead of a regression); SPLIT = GINI indicates the measure used for the selection of splitting variables.
We obtain a tree with 18 leaves. This is less than those of Tanagra. One of reason is the CP
parameter of which the default value (cp=0.01) introduces a pre-pruning during the growing phase.
We see below that the same CP parameter is used for the post-pruning procedure.
One of the main drawbacks of the rpart is that we must perform ourselves the post-pruning
procedure 8. For that, we use the results showed into the CPTABLE obtained with the printcp(.)
command 9 (Figure 4).
8
We see below that this can be an advantage in another point of view.
9
Because rpart partitions randomly the dataset during the cross-validation process, it may be that we do not
have exactly the same results. To obtain the same results at each session, we can define the seed of the
random number generator using the set.seed(.) command.
It describes the sequence of trees by associating the complexity parameter (CP) with: the number
of splitting (equal to the number of leaves - 1 since we have a binary tree), the error calculated on
the training sample (normalized so that the error on the root is equal to 1), the error calculated by
cross-validation (also normalized), the standard error of the error rate.
This table is very similar to the one generated by Tanagra (Figure 2). But instead of using a
separated part of the learning set (the pruning sample), rpart is based on the cross-validation
principle.
On the one hand, it is advantageous when we handle a small learning sample, it is not necessary to
partition this last one. But, on the other hand, the computation time is dramatically increased when
we handle a large dataset.
Selection of the final tree. We want to use the values provided by the CPTABLE to select the
"optimal" tree. The tree n°5 with 7 splitting (8 leaves) is the one which minimizes the cross-
validation error rate (ε = 0.50000). To obtain this tree, we must specify a CP value included
between (>) 0.023684 and (≤) 0.031579.
But this tree is not the best one. Breiman claims that it is more interesting to introduce a
preference to simplicity by selecting the tree according to the 1-SE rule. (Breiman and al., 1984;
section 3.4.3, pages 78 to 81). The threshold error rate is computed as follows from the values
provided by the CPTABLE
The smallest tree of which the cross-validation error rate is lower than this threshold is the n°4 with
4 splitting (5 leaves). To obtain this tree, we set CP between the range (0.031579 < cp ≤ 0.055263).
Let's say cp = 0.04 for instance. We use the prunecp(.) command
We create the prediction column using the predict(.) command. Of course, the prediction is
computed on the "test" sample. Then, we compute the confusion matrix and the error rate.
In this case, the error rate test is 34.48%. This is significantly worse than that created by Tanagra.
Perhaps, we have too pruned the tree.
We can try another solution by specifying another value of CP. This is a very attractive
functionality of rpart. For instance, if we set cp = 0.025, we obtain a tree with 8 leaves 10.
10
Not showed in this tutorial, the plotcp(.) command allows to generate a chart showing the cross-validation
error rate according to the number of splitting.
RPART provides some command to generate a graphical representation of the decision tree.
We obtain:
5 Conclusion
CART has undeniable qualities compared to other techniques for inducing classification trees. It is
able to produce models "turnkey", with performances at least as good as the others.
But in addition, it can provide various scenarios of solutions. The tools such as the (Figure 3) or
(Figure 4) allow to the user to select the best model according to their problem and dataset
characteristics. This is an essential advantage in the practice of Data Mining.
6 References
• L. Breiman, J. Friedman, R. Olsen, C. Stone, Classification and Regression Trees, Chapman &
Hall, 1984.
• D. Zighed, R. Rakotomalala, Graphes d’Induction : Apprentissage et Data mining, Hermès,
2000; chapitre 5, “CART”.
• B. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, 1996; chapter
7, “Tree-Structured Classifiers”.
• W. Venables, B. Ripley, Modern Applied Statistics with S, Springer, 2002; chapter 9, “Tree-based
methods”.
• About the RPART package