Subject: Importing The Dataset
Subject: Importing The Dataset
Subject
Building decision tree with TANAGRA, ORANGE and WEKA. Error rate estimation using a
cross-validation.
When we build a decision tree from a dataset, we much follow the following steps (not
necessarily in the same order):
• Import the dataset in the software;
• Select the class attribute (TARGET) and the descriptors (INPUT);
• Choose the induction algorithm, according the implementation we can obtain slightly
different results;
• Learning process and viewing the decision tree;
• Use cross-validation in order to obtain an honest error rate estimate.
Dataset
We use the HEART.TXT (UCI IRVINE), some attributes are deleted; there are 270 examples
in the dataset.
(b) … and embed the learning algorithm, C-RT, from the SPV learning tab.
1 TANAGRA uses a textual representation, if you want a graphical representation, you can try the SIPINA
software from the same author (http://eric.univ-lyon2.fr/~ricco/sipina.html).
<< Workspace
We can add the classification tree component (CLASSIFY tab) in our diagram. We connect
this component with the dataset component.
Cross-validation
The TEST LEARNERS component (EVALUATE tab) enables to compute the cross-validation
error rate estimate. We connect to this new component the classification tree.
This component becomes operational when we will have specified the data source and the
training method -- it is possible to connect simultaneously several learning methods, which
The classification accuracy is 75.56%; the error rate is 24.44%. Other statistics are available.
We can also interactively choose another resampling method.
Learning process
By default, the target attribute is the last column; the others are the input attributes. We have
the right configuration in our dataset. On the other hand, we must explicitly select the
learning set in the WEKA diagram. We add the TRAINING SET MAKER (EVALUATION
tab) in the diagram; all examples are used for the construction of the decision tree. We choose
the DATASET connection when we connect the LOADER component to this last component.
In order to start the execution, we select the first node of the diagram and click on the START
LOADING contextual menu.
Cross validation
WEKA has at one’s disposal sophisticated error rate estimation but needs to create a new
sequence of components to do that.
One again, we select the START LOADING of the CSV LOADER component in order to start
the execution. The SHOW RESULTS menu of TEXT VIEWER displays the following results.
Let us note a very useful characteristic of WEKA; it is possible to visualize the 10 decision
trees computed during the cross validation process. It would be necessary for that to connect
a component TEXT VIEWER at the output of the J48 component, we can see the possible
differences between the trees and judge the stability of computations.
Conclusion
Cross-validation is a very popular method of error rate estimation, especially when we have
a few examples in our dataset. We see in this tutorial that ORANGE, TANAGRA and WEKA,
can handle easily this process.