Lab (I)
Lab (I)
The Goals
The aim of this exercise is to give you an overview of how to use Weka, through the Explorer interface. To this end, you are going to work with the data set breast_cancer.arff, downloadable from course web page, that contains 286 cancer patient records. You are requested to experiment building several models that describe when recurrence-events may occur, access the performance of the models, and compare them.
This is the rst step in the KDD process and it was discussed during lecture 2. You can nd below a suggestion of some points to look at that may help you to better understand the data. 1. For each attribute nd the following information. (a) The attribute type. (b) Percentage of missing values in the data. (c) Are there any records that have a value for the attribute that no other record has (i.e. unique values)? (d) Study the histogram of the attribute and note how it seems to inuence the risk for recurrence-events. 2. Observe whether the dataset has imbalanced class distribution. 3. Switch to the Visualize tab on the upper part of the screen in Weka to visualize 2D-scatter plots for each pair of attributes. (a) Investigate possible multivariate associations of attributes with the class attribute, i.e. study scatter plots of two attributes X and Y and try to identify possible high(low) recurrence-events areas (if any). For instance, choose X = inv-nodes and Y = breast. 1
The second step is to preprocess the data such that the transformed data is in a more suitable form for the mining algorithms. This aspect was discussed in lecture 2, as well. We are going to concentrate our attention on feature reduction by selecting promising subsets of attributes for the classication tasks.
3.1
Attribute Selection
(a) To rank the attributes by InfoGainAttributeEval measure. Which attributes seem to have the best classication power? (b) To rank the attributes by GainRatioAttributeEval measure. Which attributes seem to have the best classication power? (c) Compare the results.
2. Use a attribute subset evaluator for the following tasks. (a) To build a subset of attributes with CfsSubsetEval. Experiment with GreedyStepwise and ExaustiveSearch search strategies. What can you conclude? (b) To build a subset of attributes with WrapperSubsetEval. Experiment with J48 classier, with minimum 15 records per leaf node, and BestFirst search. (c) To build a subset of attributes with WrapperSubsetEval. Experiment with JRip classier and GreedyStepwise search. (d) Explain how WrapperSubsetEval works. (e) Compare the results and draw conclusions.
3.2
You may save the data set to a comma separated (text) le. Experiment to save the data set to a le called breast_cancer.csv . This may be useful if you want to apply extra pre-processing techniques not available in Weka or even load the data into Excel.
We proceed now by building models that will help us to describe the class recurrence-events. Assume that this class is the positive. Use all attributes to build the following models. Build a model using the OneR Classier and interpret the patterns.
1. Use the training set for estimating classier performance. (a) Note the accuracy, TPR, and F-measure for both classes. (b) Interpret the confusion matrix. (a) Use now 10-fold cross-validation for estimating classier performance. i. Note the accuracy, TPR, and F-measure for both classes. ii. Compare the results with the ones previously obtained. (b) Is the classier biased tower any of the classes? Which one and why?
4.1
Decision Trees
Use J48 classier, i.e. the Weka version of the decision tree classier C4.5. 1. Estimate the performance of the classier by using 10-fold cross-validation. 2. Visualize the tree and describe the patterns. How do you interpret the numbers associated with the tree-leaves? 3. Is the classier biased tower any of the classes? 4. Investigate the use of dierent J48s parameters such as pruning and minimum number of records in the leaves.
4.2
Rule-based Classiers
Use JRip classier, i.e. the Weka version of the RIPPER algorithm. 1. Estimate the performance of the classier by using 10-fold cross-validation. 2. Is the classier biased tower any of the classes? 3. Describe the patterns. How do you interpret the numbers associated with each rule?
4.3
Use of association rule mining (ARM), by using the Apriori algorithm, to build high condence rules predicting the positive class, i.e. recurrence-events. 1. Describe the patterns. How do you interpret the numbers associated with each rule? 2. Which useful hints to characterize the positive class gives this model?