0% found this document useful (0 votes)

64 views

1 Theme: Comparison of The Implementation of The CART Algorithm Under Tanagra and R (Rpart Package)

This document compares the implementation of the CART algorithm in Tanagra and R. It uses the WAVEFORM dataset to build a classification tree to predict the target variable CLASS, which has 3 values. The document describes how to: 1. Import the dataset into Tanagra and partition it into training and test samples. 2. Perform decision tree induction in Tanagra, including parameter settings, model selection via pruning, and assessing model performance on the training and test sets. 3. Compare the implementations of CART in Tanagra and R, noting their main difference is in how post-pruning is implemented.

Uploaded by

Vbg Da

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views

1 Theme: Comparison of The Implementation of The CART Algorithm Under Tanagra and R (Rpart Package)

Uploaded by

Vbg Da

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Didacticiel - Études de cas R.R.

1 Theme
Comparison of the implementation of the CART algorithm under Tanagra and R (rpart
package).

CART (Breiman and al., 1984) is a very popular classification tree (says also decision tree) learning
algorithm. Rightly. CART incorporates all the ingredients of a good learning control: the post-
pruning process enables to make the trade-off between the bias and the variance; the cost
complexity mechanism enables to "smooth" the exploration of the space of solutions; we can
control the preference for simplicity with the standard error rule (SE-rule); etc. Thus, the data miner
can adjust the settings according to the goal of the study and the data characteristics.

The Breiman's algorithm is provided under different designations in the free data mining tools.
Tanagra uses the “C-RT” name. R, through a specific package 1, provides the “rpart” function.

In this tutorial, we describe these implementations of the CART approach according to the original
book (Breiman and al., 1984; chapters 3, 10 and 11). The main difference between them is the
implementation of the post-pruning process. Tanagra uses a specific sample says "pruning set"
(section 11.4); when rpart is based on the cross-validation principle (section 11.5) 2.

2 Dataset
We use the WAVEFORM dataset (section 2.6.2). The target attribute (CLASS) has 3 values. There
are 21 continuous predictors (V1 to V21). We want to reproduce the experiment described into the
Breiman's book (pages 49 and 50). We have 300 instances into the learning sample, and 5000
instances into the test sample. Thus the data file wave5300.xls 3 includes 5300 rows and 23
columns. The last column is a variable which specifies the membership of each instance to the train
or the test samples (Figure 1).

Figure 1 – The 20 first instances of the data file – wave5300.xls

1
http://cran.r-project.org/web/packages/rpart/index.html
2
We can use a pruning sample with “rpart” but it is not really easy to use. See http://www.math.univ-
toulouse.fr/~besse/pub/TP/appr_se/tp7_cancer_tree_R.pdf; section 2.1.
3
http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/wave5300.xls

6 août 2011 Page 1 sur 15

Didacticiel - Études de cas R.R.

3 The C-RT component – TANAGRA

3.1 Creating a diagram and importing the data file

After we launch Tanagra, we want to create a diagram and import the data file.

For that, we activate the FILE / NEW menu. We choose the XLS file format and we select
wave5300.xls 4.

The diagram is created. Into the first node, TANAGRA shows that the file contains 5300 instances
and 23 variables.

4
There are various ways to import a XLS data file. We can use the add-on for Excel (http://data-mining-
tutorials.blogspot.com/2010/08/sipina-add-in-for-excel.html, http://data-mining-
tutorials.blogspot.com/2010/08/tanagra-add-in-for-office-2007-and.html) or, as we do in this tutorial, directly
import the dataset (http://data-mining-tutorials.blogspot.com/2008/10/excel-file-format-direct-importation.html).
In this last case, the dataset must not be opened in the spreadsheet application. The values must be in the first
sheet. The first row corresponds to the name of the variables.

The direct importation is faster than the use of the “tanagra.xla add-on. But, on the other hand, Tanagra can
handle only the XLS format here (up to Excel 2003).

6 août 2011 Page 2 sur 15

Didacticiel - Études de cas R.R.

3.2 Partitioning the dataset into train and test samples

We must define the training sample i.e. the instances used for the construction of the decision tree.
We add the DISCRETE SELECT EXAMPLES (INSTANCE SELECTION tab) into the diagram. We click on
the PARAMETERS contextual menu. We use the SAMPLE column for the splitting. The learning set
corresponds to the instances labeled LEARNING.

We click on the VIEW menu: 300 instances (from 5300) are now used for the construction of the
decision tree.

6 août 2011 Page 3 sur 15

Didacticiel - Études de cas R.R.

3.3 Decision tree induction

Before the strictly speaking learning process, we must specify the variable types. We add the
DEFINE STATUS component using the shortcut into the toolbar. We set CLASS as TARGET, the
continuous variables (V1 to V21) as INPUT.

6 août 2011 Page 4 sur 15

Didacticiel - Études de cas R.R.

We insert the C-RT component which implements the CART approach. Let us describe some of the
method settings (SUPERVISED PARAMETERS menu).

MIN SIZE OF NODE TO SPLIT means the minimum number of instances needed for performing a
splitting of a node. For our dataset, we do not perform a split if there are less than 10 instances.

PRUNING SET SIZE means the part of the learning set used for the post pruning process. The
default value is 33% i.e. 67% of the learning set is used for the growing phase (67% of 300 = 201
instances), 33% for the pruning phase (33% of 300 = 99 instances).

SE RULE allows to select the right tree in the post-pruning process. The default value is θ=1 i.e. we
select the smallest tree for which the error rate is lower than the "error rate + 1 standard error (of
the error rate)" of the optimal tree. If θ=0, we select the optimal tree, the one which minimizes the
pruning error rate (the error rate computed on the pruning sample).

RANDOM NUMBER GENERATOR allows to initialize the random number generator.

Last, SHOW ALL TREE SEQUENCE, when it is checked off, allows to show all the sequences of
trees analyzed during the post-pruning process. It is not necessary to activate this setting here.

We validate our settings by clicking on the OK button. Then, we click on the contextual VIEW menu.

We obtain first the resubstitution confusion matrix, computed on the learning set (300 instances,
the growing and the pruning samples are merged). The error rate is 19.67%.

6 août 2011 Page 5 sur 15

Didacticiel - Études de cas R.R.

Below, we have the sequences of trees table, with the number of leaves, the error rate calculated
on the growing set and pruning set.

The error rate computed on the growing set decreases as the number of leaves increases. We
cannot use this information to select the right model.

We use the pruning sample to select the "best" model. The tree which minimizes the error rate on
the pruning set is highlighted in green. It has 14 leaves. Its error rate is 28.28%.

Figure 2 - Sequences of trees for the model selection

With the θ = 1 SE-rule, we select the smallest tree (with the smallest number of leaves) for which
the pruning error rate is lower than

0.2828 × (1 − 0.2828)
ε seuil = 0.2828 + 1 × = 0.3281
99

This is the tree n° 5, with 9 leaves. The pruning error rate is 32.32% (highlighted in red). The
growing error rate is 13.43%.

The chart depicting the change of the error rate computed on the growing and the pruning samples
is available in the CHART tab of the visualization window. We can visually detect the interesting
trees.

6 août 2011 Page 6 sur 15

Didacticiel - Études de cas R.R.

Figure 3 - The error rate according the number of leaves

We note that the trees with 7 or 6 leaves are equivalent in terms of prediction performance. The
way to modify the setting θ to obtain one of these trees is described in another tutorial 5.

In the lower part of the report (HTML tab), we have a description of the selected tree.

3.4 Assessment on the test set

To obtain a good estimation of the generalization error rate, we use a separated sample, the test
set, unused during the learning phase. On our dataset, the test set is the rest of our data file i.e.
5000 instances set aside in the previous part of the diagram (DISCRETE SELECT EXAMPLES node).

Tanagra generates automatically a new column which contains the prediction of the model. The
trick is that the prediction is computed both on the selected instances (the learning sample) and
the unselected ones (the test sample).

5
http://data-mining-tutorials.blogspot.com/2010/01/cart-determining-right-size-of-tree.html

6 août 2011 Page 7 sur 15

Didacticiel - Études de cas R.R.

We add the DEFINE STATUS into the diagram. We set CLASS as TARGET and PRED_SPVINSTANCE_1,
the new column generated by the supervised learning algorithm, as INPUT.

We add the TEST component (SPV LEARNING ASSESSMENT tab). We click on the VIEW menu. We
observe that, by default, the confusion matrix is computed on the unselected instances i.e. the test
sample (Note: The confusion matrix can be computed on the selected instances i.e. the learning
sample. In this case, we must obtain the same confusion matrix as the one showed into the model
visualization window.).

6 août 2011 Page 8 sur 15

Didacticiel - Études de cas R.R.

We click on the VIEW menu. The test error rate is 28.44%. When we use the tree for the prediction
on an unseen instance, the probability of misclassification is 28.44%.

4 The CART approach using the RPART package of R

We suppose that the reader is familiar with the R software. If don't, we can found many tutorials on
line, on the official R website for instance: http://cran.r-project.org/manuals.html and http://cran.r-
project.org/other-docs.html; see also http://www.r-tutor.com/, etc… The documents are very
numerous on the internet.

4.1 Importing the data file using the xlsReadWrite package

To import the XLS file format, the easiest way is to use the xlsReadWrite package.

We load it using the library(.) instruction.

Then, we can import the dataset using the read.xls(.) command. We obtain a short description of
the data frame (the structure which manages the dataset into memory) with summary(.) 6.

6
The full results of the command summary (.) are not shown in our screenshots. It would take too much space.
It is nevertheless essential at every stage to check the data integrity.

6 août 2011 Page 9 sur 15

Didacticiel - Études de cas R.R.

4.2 Partitioning the data file into train and test samples

We use the SAMPLE column to partition the dataset (“learning” and “test”). The SAMPLE column is
then removed.

4.3 Creating the maximal tree

First, we must develop the maximal tree on the learning set i.e. the tree with the maximum number
of leaves. It will be pruned in a second phase.

We use the following settings 7: MINSPLIT = 10, it means that the process does not split a node with
less than 10 instances; MINBUCKET = 1, the process does not validate a split where one the leaves
contains less than 1 instance. The maximal tree obtained should be similar to the one of Tanagra.
One of the main differences is that RPART uses all the instances of the learning set (Tanagra uses
only the growing sample, a part of the learning set).

7
About the other settings: METHOD = "CLASS" indicates that we are dealing with a classification problem
(instead of a regression); SPLIT = GINI indicates the measure used for the selection of splitting variables.

6 août 2011 Page 10 sur 15

Didacticiel - Études de cas R.R.

We obtain a tree with 18 leaves. This is less than those of Tanagra. One of reason is the CP
parameter of which the default value (cp=0.01) introduces a pre-pruning during the growing phase.
We see below that the same CP parameter is used for the post-pruning procedure.

4.4 Selecting the right tree – Post-pruning process with RPART

One of the main drawbacks of the rpart is that we must perform ourselves the post-pruning
procedure 8. For that, we use the results showed into the CPTABLE obtained with the printcp(.)
command 9 (Figure 4).

8
We see below that this can be an advantage in another point of view.

9
Because rpart partitions randomly the dataset during the cross-validation process, it may be that we do not
have exactly the same results. To obtain the same results at each session, we can define the seed of the
random number generator using the set.seed(.) command.

6 août 2011 Page 11 sur 15

Didacticiel - Études de cas R.R.

Figure 4 – The CPTABLE generated by RPART

It describes the sequence of trees by associating the complexity parameter (CP) with: the number
of splitting (equal to the number of leaves - 1 since we have a binary tree), the error calculated on
the training sample (normalized so that the error on the root is equal to 1), the error calculated by
cross-validation (also normalized), the standard error of the error rate.

This table is very similar to the one generated by Tanagra (Figure 2). But instead of using a
separated part of the learning set (the pruning sample), rpart is based on the cross-validation
principle.

On the one hand, it is advantageous when we handle a small learning sample, it is not necessary to
partition this last one. But, on the other hand, the computation time is dramatically increased when
we handle a large dataset.

Selection of the final tree. We want to use the values provided by the CPTABLE to select the
"optimal" tree. The tree n°5 with 7 splitting (8 leaves) is the one which minimizes the cross-
validation error rate (ε = 0.50000). To obtain this tree, we must specify a CP value included
between (>) 0.023684 and (≤) 0.031579.

But this tree is not the best one. Breiman claims that it is more interesting to introduce a
preference to simplicity by selecting the tree according to the 1-SE rule. (Breiman and al., 1984;
section 3.4.3, pages 78 to 81). The threshold error rate is computed as follows from the values
provided by the CPTABLE

ε seuil = 0.5 + 1 × 0.042406 = 0.542406

The smallest tree of which the cross-validation error rate is lower than this threshold is the n°4 with
4 splitting (5 leaves). To obtain this tree, we set CP between the range (0.031579 < cp ≤ 0.055263).
Let's say cp = 0.04 for instance. We use the prunecp(.) command

6 août 2011 Page 12 sur 15

Didacticiel - Études de cas R.R.

Now, we must evaluate this tree (5 leaves) on the test sample.

4.5 Assessment on the test sample

We create the prediction column using the predict(.) command. Of course, the prediction is
computed on the "test" sample. Then, we compute the confusion matrix and the error rate.

In this case, the error rate test is 34.48%. This is significantly worse than that created by Tanagra.
Perhaps, we have too pruned the tree.

We can try another solution by specifying another value of CP. This is a very attractive
functionality of rpart. For instance, if we set cp = 0.025, we obtain a tree with 8 leaves 10.

10
Not showed in this tutorial, the plotcp(.) command allows to generate a chart showing the cross-validation
error rate according to the number of splitting.

6 août 2011 Page 13 sur 15

Didacticiel - Études de cas R.R.

And the test error rate becomes 31.82%.

4.6 Graphical representation of the tree

RPART provides some command to generate a graphical representation of the decision tree.

6 août 2011 Page 14 sur 15

Didacticiel - Études de cas R.R.

We obtain:

5 Conclusion
CART has undeniable qualities compared to other techniques for inducing classification trees. It is
able to produce models "turnkey", with performances at least as good as the others.

But in addition, it can provide various scenarios of solutions. The tools such as the (Figure 3) or
(Figure 4) allow to the user to select the best model according to their problem and dataset
characteristics. This is an essential advantage in the practice of Data Mining.

6 References
• L. Breiman, J. Friedman, R. Olsen, C. Stone, Classification and Regression Trees, Chapman &
Hall, 1984.
• D. Zighed, R. Rakotomalala, Graphes d’Induction : Apprentissage et Data mining, Hermès,
2000; chapitre 5, “CART”.
• B. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, 1996; chapter
7, “Tree-Structured Classifiers”.
• W. Venables, B. Ripley, Modern Applied Statistics with S, Springer, 2002; chapter 9, “Tree-based
methods”.
• About the RPART package

o http://cran.cict.fr/web/packages/rpart/rpart.pdf for the technical description of the

package;
o « An introduction to Recursive Partitioning – Using the RPART Routines » (short version)
- http://eric.univ-lyon2.fr/~ricco/cours/didacticiels/r/short_doc_rpart.pdf
o « An introduction to Recursive Partitioning – Using the RPART Routines » (long version)
http://eric.univ-lyon2.fr/~ricco/cours/didacticiels/r/long_doc_rpart.pdf

6 août 2011 Page 15 sur 15

Desktop GARP Users Manual
No ratings yet
Desktop GARP Users Manual
13 pages
Simatic Visualization Architect (Sivarc) - Getting Started: Tia Portal V14
100% (1)
Simatic Visualization Architect (Sivarc) - Getting Started: Tia Portal V14
33 pages
Veeam Backup 12 Plug-Ins User Guide
100% (1)
Veeam Backup 12 Plug-Ins User Guide
505 pages
Questionnaire Survey of 5g Networks
No ratings yet
Questionnaire Survey of 5g Networks
3 pages
GE Carescape B650 Monitor - Technical Manual 2013
100% (1)
GE Carescape B650 Monitor - Technical Manual 2013
204 pages
Ensemble Learning and Random Forests
No ratings yet
Ensemble Learning and Random Forests
37 pages
En Tanagra Excel AddIn PDF
No ratings yet
En Tanagra Excel AddIn PDF
9 pages
En Tanagra Clustering Tree
No ratings yet
En Tanagra Clustering Tree
12 pages
Primer 3e Chap3 Case New
No ratings yet
Primer 3e Chap3 Case New
8 pages
Decision Trees and Random Forests
No ratings yet
Decision Trees and Random Forests
25 pages
Econometrics Computer Exercise Week 1: Introduction Stata + Simple Regression Model
No ratings yet
Econometrics Computer Exercise Week 1: Introduction Stata + Simple Regression Model
4 pages
DL Practical 1 Train - Test - Split
No ratings yet
DL Practical 1 Train - Test - Split
5 pages
Differential Evolution
No ratings yet
Differential Evolution
11 pages
Machine Learning Random Forest Algorithm - Javatpoint
No ratings yet
Machine Learning Random Forest Algorithm - Javatpoint
14 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
9 pages
08 Decision - Tree
No ratings yet
08 Decision - Tree
9 pages
ANN Analysis
No ratings yet
ANN Analysis
5 pages
1 Subject: Implementing The Kohonen's SOM (Self Organizing Map) Algorithm With Tanagra
No ratings yet
1 Subject: Implementing The Kohonen's SOM (Self Organizing Map) Algorithm With Tanagra
14 pages
J48
No ratings yet
J48
3 pages
Studio 9 Questions
No ratings yet
Studio 9 Questions
6 pages
NARX
100% (1)
NARX
10 pages
Decision Tree Induction Algorithm
No ratings yet
Decision Tree Induction Algorithm
6 pages
Analyzing GRT Data in Stata
No ratings yet
Analyzing GRT Data in Stata
17 pages
AMTA Assignment AMTA B (Aswin Avni Navya)
No ratings yet
AMTA Assignment AMTA B (Aswin Avni Navya)
13 pages
Review Questions: Prob. 11-3 in C11Data
No ratings yet
Review Questions: Prob. 11-3 in C11Data
4 pages
DEEP LEARNING UNIT 3
No ratings yet
DEEP LEARNING UNIT 3
19 pages
En Tanagra PLS DA
No ratings yet
En Tanagra PLS DA
10 pages
Week 10 - ANOVA
No ratings yet
Week 10 - ANOVA
9 pages
7 QC Tools
No ratings yet
7 QC Tools
14 pages
En Tanagra Gain Chart
No ratings yet
En Tanagra Gain Chart
26 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
Dicision Trees On Weka
No ratings yet
Dicision Trees On Weka
4 pages
Unit 2
No ratings yet
Unit 2
18 pages
UNIT-3 Material
No ratings yet
UNIT-3 Material
19 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Types of Pruning Techniques
No ratings yet
Types of Pruning Techniques
10 pages
Subject: Importing The Dataset
No ratings yet
Subject: Importing The Dataset
13 pages
Trinh Khanh Ly 20213676
No ratings yet
Trinh Khanh Ly 20213676
13 pages
S-2
No ratings yet
S-2
10 pages
Package Party': January 27, 2015
No ratings yet
Package Party': January 27, 2015
38 pages
Ensemble Learning
100% (1)
Ensemble Learning
7 pages
Unit 2
No ratings yet
Unit 2
11 pages
DMLB 1
No ratings yet
DMLB 1
3 pages
Experiment 1 Aim:: Introduction To ML Lab With Tools (Hands On WEKA On Data Set (Iris - Arff) ) - (A) Start Weka
No ratings yet
Experiment 1 Aim:: Introduction To ML Lab With Tools (Hands On WEKA On Data Set (Iris - Arff) ) - (A) Start Weka
55 pages
Chapter 11 - History Matching
No ratings yet
Chapter 11 - History Matching
14 pages
Business Intelligence: Lab Mannual (CSP130)
No ratings yet
Business Intelligence: Lab Mannual (CSP130)
32 pages
System Model H (T) Inlet E (T) Outlet E H (T) System Model H (T) Inlet E (T) Outlet E H (T)
No ratings yet
System Model H (T) Inlet E (T) Outlet E H (T) System Model H (T) Inlet E (T) Outlet E H (T)
19 pages
2015 SPSS Exercise
No ratings yet
2015 SPSS Exercise
69 pages
Two-Way ANOVA With Post Tests: Entering and Graphing The Data
No ratings yet
Two-Way ANOVA With Post Tests: Entering and Graphing The Data
6 pages
Surabhi Charu Project
No ratings yet
Surabhi Charu Project
16 pages
Ayuda Comandos Stata Meta
No ratings yet
Ayuda Comandos Stata Meta
42 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
3 pages
Notes
No ratings yet
Notes
44 pages
Topic 3-SPSS and STATA
100% (1)
Topic 3-SPSS and STATA
73 pages
Statistics For Oracle Optimzer
No ratings yet
Statistics For Oracle Optimzer
5 pages
Advanced Statistics Vulcan
No ratings yet
Advanced Statistics Vulcan
10 pages
Excel Probability Function
No ratings yet
Excel Probability Function
4 pages
Xlstat® Tip Sheet For Business Statistics - Cengage Learning
No ratings yet
Xlstat® Tip Sheet For Business Statistics - Cengage Learning
30 pages
Bayesian Analysis of Spatially Autocorrelated Data Spatial Data Analysis in Ecology and Agriculture Using R
No ratings yet
Bayesian Analysis of Spatially Autocorrelated Data Spatial Data Analysis in Ecology and Agriculture Using R
18 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
Here
No ratings yet
Here
17 pages
CS 461 - Fall 2021 - Neural Networks - Machine Learning
No ratings yet
CS 461 - Fall 2021 - Neural Networks - Machine Learning
5 pages
DADM - Tools Help
No ratings yet
DADM - Tools Help
25 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
1 en 1 Chapter
No ratings yet
1 en 1 Chapter
8 pages
201903-Chari Et Al-Bank Reg
No ratings yet
201903-Chari Et Al-Bank Reg
40 pages
How To Take A Screenshot On The Bloomberg Terminal
No ratings yet
How To Take A Screenshot On The Bloomberg Terminal
3 pages
Self-Organizing Map: Teuvo Kohonen, Samuel Kaski, Panu Somervuo, Krista Lagus, Merja Oja, Vesa Paatero
No ratings yet
Self-Organizing Map: Teuvo Kohonen, Samuel Kaski, Panu Somervuo, Krista Lagus, Merja Oja, Vesa Paatero
10 pages
Gupta - 2015 - Forecasting Bankruptcy For SMEs Using Hazard Function - To What Extent Does Size Matter
No ratings yet
Gupta - 2015 - Forecasting Bankruptcy For SMEs Using Hazard Function - To What Extent Does Size Matter
25 pages
En Tanagra Kohonen SOM R
No ratings yet
En Tanagra Kohonen SOM R
21 pages
Quarkus 1
No ratings yet
Quarkus 1
10 pages
DSA EXP3 - ShwetaJoshi
No ratings yet
DSA EXP3 - ShwetaJoshi
14 pages
IIFL - Pagespeed Insights Details - SIP Calculator
No ratings yet
IIFL - Pagespeed Insights Details - SIP Calculator
65 pages
PD1 Set3
No ratings yet
PD1 Set3
76 pages
Advertisement For The Post of Trainee Engineer - I and Project Engineer-I - PDIC and CoE
No ratings yet
Advertisement For The Post of Trainee Engineer - I and Project Engineer-I - PDIC and CoE
10 pages
Ai
No ratings yet
Ai
45 pages
Which Aspect of The Email Is Least Indicative of A Phishing Attack? The Email
No ratings yet
Which Aspect of The Email Is Least Indicative of A Phishing Attack? The Email
8 pages
Application For The Position of IT Intern
No ratings yet
Application For The Position of IT Intern
3 pages
Unit3 ARM Cortex Architecture
No ratings yet
Unit3 ARM Cortex Architecture
112 pages
Arduino Based Watch OLED Menu RTC
No ratings yet
Arduino Based Watch OLED Menu RTC
10 pages
oneM2M-TS-0001-oneM2M-Functional-Architecture-V0.0.7 (RM)
No ratings yet
oneM2M-TS-0001-oneM2M-Functional-Architecture-V0.0.7 (RM)
43 pages
Module 5 Ioe
No ratings yet
Module 5 Ioe
18 pages
12 - Ansi Sparc
No ratings yet
12 - Ansi Sparc
4 pages
Redgate SQL Server Statistics
No ratings yet
Redgate SQL Server Statistics
44 pages
Introduction To System Dynamics Modelling
No ratings yet
Introduction To System Dynamics Modelling
31 pages
Instant ebooks textbook Django for Beginners 3 1 3.1 Edition William S. Vincent download all chapters
100% (2)
Instant ebooks textbook Django for Beginners 3 1 3.1 Edition William S. Vincent download all chapters
78 pages
Training Need Annual Appraisal 2021-22 (Updated Till 24th March 2022) (11945)
No ratings yet
Training Need Annual Appraisal 2021-22 (Updated Till 24th March 2022) (11945)
24 pages
WWW Freeprojectz Com Uml Diagram Car Service Center Management System Uml Diagram
No ratings yet
WWW Freeprojectz Com Uml Diagram Car Service Center Management System Uml Diagram
30 pages
Multiple Registration Protocol
No ratings yet
Multiple Registration Protocol
3 pages
Pi Processbook To Pi Vision Migration Utility
No ratings yet
Pi Processbook To Pi Vision Migration Utility
3 pages
AI Mcqs UNIT 1-3
No ratings yet
AI Mcqs UNIT 1-3
37 pages
Download full Advanced Computing and Communication Technologies: Proceedings of the 10th ICACCT, 2016 1st Edition Ramesh K. Choudhary ebook all chapters
100% (1)
Download full Advanced Computing and Communication Technologies: Proceedings of the 10th ICACCT, 2016 1st Edition Ramesh K. Choudhary ebook all chapters
55 pages
Beginning Android 3 1st Edition Mark Murphy (Auth.) - Own the ebook now with all fully detailed chapters
No ratings yet
Beginning Android 3 1st Edition Mark Murphy (Auth.) - Own the ebook now with all fully detailed chapters
47 pages
Alodokter Company Profile - Booking
No ratings yet
Alodokter Company Profile - Booking
28 pages
ConSteel 14 User Manual-301-350
No ratings yet
ConSteel 14 User Manual-301-350
50 pages
DP 900T00A ENU TrainerPrepGuide
No ratings yet
DP 900T00A ENU TrainerPrepGuide
10 pages

1 Theme: Comparison of The Implementation of The CART Algorithm Under Tanagra and R (Rpart Package)

Uploaded by

1 Theme: Comparison of The Implementation of The CART Algorithm Under Tanagra and R (Rpart Package)

Uploaded by

Didacticiel - Études de cas R.R.

Figure 1 – The 20 first instances of the data file – wave5300.xls

6 août 2011 Page 1 sur 15

3 The C-RT component – TANAGRA

6 août 2011 Page 2 sur 15

3.2 Partitioning the dataset into train and test samples

6 août 2011 Page 3 sur 15

3.3 Decision tree induction

6 août 2011 Page 4 sur 15

RANDOM NUMBER GENERATOR allows to initialize the random number generator.

6 août 2011 Page 5 sur 15

Figure 2 - Sequences of trees for the model selection

6 août 2011 Page 6 sur 15

Figure 3 - The error rate according the number of leaves

3.4 Assessment on the test set

6 août 2011 Page 7 sur 15

6 août 2011 Page 8 sur 15

4 The CART approach using the RPART package of R

4.1 Importing the data file using the xlsReadWrite package

We load it using the library(.) instruction.

6 août 2011 Page 9 sur 15

4.3 Creating the maximal tree

6 août 2011 Page 10 sur 15

4.4 Selecting the right tree – Post-pruning process with RPART

6 août 2011 Page 11 sur 15

Figure 4 – The CPTABLE generated by RPART

ε seuil = 0.5 + 1 × 0.042406 = 0.542406

6 août 2011 Page 12 sur 15

Now, we must evaluate this tree (5 leaves) on the test sample.

4.5 Assessment on the test sample

6 août 2011 Page 13 sur 15

And the test error rate becomes 31.82%.

4.6 Graphical representation of the tree

6 août 2011 Page 14 sur 15

o http://cran.cict.fr/web/packages/rpart/rpart.pdf for the technical description of the

6 août 2011 Page 15 sur 15

You might also like