0% found this document useful (0 votes)
56 views13 pages

Subject: Importing The Dataset

This document discusses building decision trees and estimating error rates using cross-validation in TANAGRA, ORANGE, and WEKA. It explains the basic steps: importing data, selecting attributes, choosing a learning algorithm, viewing the decision tree, and using cross-validation. For each software, it demonstrates importing the HEART dataset, building a decision tree, viewing the tree, and performing 10-fold cross-validation to estimate the error rate. The estimated error rates are 24.81% for TANAGRA, 24.44% for ORANGE, and 26.67% for WEKA.

Uploaded by

Vbg Da
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views13 pages

Subject: Importing The Dataset

This document discusses building decision trees and estimating error rates using cross-validation in TANAGRA, ORANGE, and WEKA. It explains the basic steps: importing data, selecting attributes, choosing a learning algorithm, viewing the decision tree, and using cross-validation. For each software, it demonstrates importing the HEART dataset, building a decision tree, viewing the tree, and performing 10-fold cross-validation to estimate the error rate. The estimated error rates are 24.81% for TANAGRA, 24.44% for ORANGE, and 26.67% for WEKA.

Uploaded by

Vbg Da
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Didacticiel - Etudes de cas R.R.

Subject
Building decision tree with TANAGRA, ORANGE and WEKA. Error rate estimation using a
cross-validation.

When we build a decision tree from a dataset, we much follow the following steps (not
necessarily in the same order):
• Import the dataset in the software;
• Select the class attribute (TARGET) and the descriptors (INPUT);
• Choose the induction algorithm, according the implementation we can obtain slightly
different results;
• Learning process and viewing the decision tree;
• Use cross-validation in order to obtain an honest error rate estimate.

Dataset
We use the HEART.TXT (UCI IRVINE), some attributes are deleted; there are 270 examples
in the dataset.

Building a decision tree with TANAGRA

Importing the dataset


We create a new diagram and import the dataset with the FILE/NEW menu.

Defining the role of attributes


We add the DEFINE STATUS component (we use the button in the toolbar) in the diagram.
We set COEUR as TARGET, the other attributes as INPUT.

23/02/2006 Page 1 sur 13


Didacticiel - Etudes de cas R.R.

Selecting the learning algorithm


We want to use the Classification and Regression Tree (Breiman et al.) algorithm. There are two
steps when we want to define a supervised learning process in TANAGRA: (a) we insert a
meta-supervised learning component from the META SPV LEARNING tab…

(b) … and embed the learning algorithm, C-RT, from the SPV learning tab.

23/02/2006 Page 2 sur 13


Didacticiel - Etudes de cas R.R.

Displaying the results


In order to view the decision tree, we click on the VIEW menu of the last component, we see
the tree1: the resubstitution error rate is 19.63%; the tree has 4 leaves (4 rules).

1 TANAGRA uses a textual representation, if you want a graphical representation, you can try the SIPINA
software from the same author (http://eric.univ-lyon2.fr/~ricco/sipina.html).

23/02/2006 Page 3 sur 13


Didacticiel - Etudes de cas R.R.
Cross-validation
We want to compute the error rate with a cross-validation resampling method. We add the
CROSS-VALIDATION component from the SPV LEARNING ASSESMENT tab. We set the
number of folds to 10, and the number of repetition to 1.

The estimated error rate is 24.81%

23/02/2006 Page 4 sur 13


Didacticiel - Etudes de cas R.R.

Building a decision tree with ORANGE


When we execute ORANGE, we have the following interface.
Components :
Tool palettes
Data Mining tools

<< Workspace

Importing the dataset


ORANGE can handle text file format (tabulation separator). When we select the tool, a new
component is inserted in the diagram. We can select the file with the OPEN contextual menu.

23/02/2006 Page 5 sur 13


Didacticiel - Etudes de cas R.R.
Learning process
By default, the target attribute is the last column; the others are the input attributes. We have
the right configuration in our dataset.

We can add the classification tree component (CLASSIFY tab) in our diagram. We connect
this component with the dataset component.

Decision tree visualization


We can display the tree in a text viewer, it is recommended if we have numerous nodes in
the tree; there is also a graphical viewer that is more pleasant (CLASSIFICATION TREE
VIEWER 2D – CLASSIFY tab). We connect the CLASSIFICATION TREE component to this
last one. We click on the OPEN menu in order to display the tree. There are 10 rules (leaves)
in our tree.

23/02/2006 Page 6 sur 13


Didacticiel - Etudes de cas R.R.

Cross-validation
The TEST LEARNERS component (EVALUATE tab) enables to compute the cross-validation
error rate estimate. We connect to this new component the classification tree.

This component becomes operational when we will have specified the data source and the
training method -- it is possible to connect simultaneously several learning methods, which

23/02/2006 Page 7 sur 13


Didacticiel - Etudes de cas R.R.
makes it possible to realize, very easily, the comparison of performances. We thus carry out
the right connections, and then we display the results using the OPEN menu.

The classification accuracy is 75.56%; the error rate is 24.44%. Other statistics are available.
We can also interactively choose another resampling method.

Building a decision tree with WEKA


A dialog box appears when we execute WEKA; we choose the KNOWLEDGE FLOW
paradigm. We have used the 3.5.1 version.

Importing the dataset


The CSV LOADER enables to handle text file format. We select the HEART.TXT dataset with
the CONFIGURE contextual menu.

23/02/2006 Page 8 sur 13


Didacticiel - Etudes de cas R.R.

Learning process
By default, the target attribute is the last column; the others are the input attributes. We have
the right configuration in our dataset. On the other hand, we must explicitly select the
learning set in the WEKA diagram. We add the TRAINING SET MAKER (EVALUATION
tab) in the diagram; all examples are used for the construction of the decision tree. We choose
the DATASET connection when we connect the LOADER component to this last component.

23/02/2006 Page 9 sur 13


Didacticiel - Etudes de cas R.R.
We add the J48 component (decision tree algorithm such as C4.5, CLASSIFIERS tab). We set
the connection between TRAINING SET MAKER and J48 (training set connection).

Decision tree visualization

We have two representations in WEKA: a textual representation, suggested when we have a


lot of nodes in the tree; a graphical representation that is more pleasant. We select this last
one (GRAPH VIEWER – VISUALIZATION tab) and use the GRAPH connection.

In order to start the execution, we select the first node of the diagram and click on the START
LOADING contextual menu.

23/02/2006 Page 10 sur 13


Didacticiel - Etudes de cas R.R.
When the computation is achieved, we can select the last component (GRAPH VIEWER) and
click on the SHOW GRAPH menu.

The decision tree has 18 leaves.

Cross validation
WEKA has at one’s disposal sophisticated error rate estimation but needs to create a new
sequence of components to do that.

23/02/2006 Page 11 sur 13


Didacticiel - Etudes de cas R.R.

We need to the following components:


• CROSS VALIDATION FOLD MAKER (EVALUATION), which builds folds (DATASET
connection).
• Decision tree J48 component (CLASSIFY); be careful, we have to set the same parameters
as the precedent J48 component. We must connect twice the CROSS VALIDATION
FOLD MAKER to this component, for the training and the test sets.
• CLASSIFIER PERFORMANCE EVALUATOR (EVALUATION) computes the error rate
in each fold. We use the BATCH CLASSIFIER output of J48.
• Last, the TEXT VIEWER component displays the results.

One again, we select the START LOADING of the CSV LOADER component in order to start
the execution. The SHOW RESULTS menu of TEXT VIEWER displays the following results.

23/02/2006 Page 12 sur 13


Didacticiel - Etudes de cas R.R.

The computed error rate is 26.67%. Other statistics are available.

Let us note a very useful characteristic of WEKA; it is possible to visualize the 10 decision
trees computed during the cross validation process. It would be necessary for that to connect
a component TEXT VIEWER at the output of the J48 component, we can see the possible
differences between the trees and judge the stability of computations.

Conclusion
Cross-validation is a very popular method of error rate estimation, especially when we have
a few examples in our dataset. We see in this tutorial that ORANGE, TANAGRA and WEKA,
can handle easily this process.

23/02/2006 Page 13 sur 13

You might also like