0% found this document useful (0 votes)
101 views

Lab (I)

The document provides instructions for using the Weka data mining tool to analyze a breast cancer dataset and build predictive models. It outlines 4 main steps: 1) exploring the data, 2) preprocessing the data through attribute selection, 3) saving the preprocessed data, and 4) mining the data to build models using classifiers like OneR, J48 decision trees, JRip rule-based classifiers, and association rule mining and evaluating their performance. The goal is to predict cancer recurrence and identify patterns in the data.

Uploaded by

anand_sesham
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views

Lab (I)

The document provides instructions for using the Weka data mining tool to analyze a breast cancer dataset and build predictive models. It outlines 4 main steps: 1) exploring the data, 2) preprocessing the data through attribute selection, 3) saving the preprocessed data, and 4) mining the data to build models using classifiers like OneR, J48 decision trees, JRip rule-based classifiers, and association rule mining and evaluating their performance. The goal is to predict cancer recurrence and identify patterns in the data.

Uploaded by

anand_sesham
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

TNM033: Data Mining Introducing Weka Part I

The Goals

The aim of this exercise is to give you an overview of how to use Weka, through the Explorer interface. To this end, you are going to work with the data set breast_cancer.arff, downloadable from course web page, that contains 286 cancer patient records. You are requested to experiment building several models that describe when recurrence-events may occur, access the performance of the models, and compare them.

Exploring the Data

This is the rst step in the KDD process and it was discussed during lecture 2. You can nd below a suggestion of some points to look at that may help you to better understand the data. 1. For each attribute nd the following information. (a) The attribute type. (b) Percentage of missing values in the data. (c) Are there any records that have a value for the attribute that no other record has (i.e. unique values)? (d) Study the histogram of the attribute and note how it seems to inuence the risk for recurrence-events. 2. Observe whether the dataset has imbalanced class distribution. 3. Switch to the Visualize tab on the upper part of the screen in Weka to visualize 2D-scatter plots for each pair of attributes. (a) Investigate possible multivariate associations of attributes with the class attribute, i.e. study scatter plots of two attributes X and Y and try to identify possible high(low) recurrence-events areas (if any). For instance, choose X = inv-nodes and Y = breast. 1

Preprocessing the Data

The second step is to preprocess the data such that the transformed data is in a more suitable form for the mining algorithms. This aspect was discussed in lecture 2, as well. We are going to concentrate our attention on feature reduction by selecting promising subsets of attributes for the classication tasks.

3.1

Attribute Selection
(a) To rank the attributes by InfoGainAttributeEval measure. Which attributes seem to have the best classication power? (b) To rank the attributes by GainRatioAttributeEval measure. Which attributes seem to have the best classication power? (c) Compare the results.

1. Use a single attribute evaluator for the following tasks.

2. Use a attribute subset evaluator for the following tasks. (a) To build a subset of attributes with CfsSubsetEval. Experiment with GreedyStepwise and ExaustiveSearch search strategies. What can you conclude? (b) To build a subset of attributes with WrapperSubsetEval. Experiment with J48 classier, with minimum 15 records per leaf node, and BestFirst search. (c) To build a subset of attributes with WrapperSubsetEval. Experiment with JRip classier and GreedyStepwise search. (d) Explain how WrapperSubsetEval works. (e) Compare the results and draw conclusions.

3.2

Saving Data to a File

You may save the data set to a comma separated (text) le. Experiment to save the data set to a le called breast_cancer.csv . This may be useful if you want to apply extra pre-processing techniques not available in Weka or even load the data into Excel.

Mining the Data

We proceed now by building models that will help us to describe the class recurrence-events. Assume that this class is the positive. Use all attributes to build the following models. Build a model using the OneR Classier and interpret the patterns.

1. Use the training set for estimating classier performance. (a) Note the accuracy, TPR, and F-measure for both classes. (b) Interpret the confusion matrix. (a) Use now 10-fold cross-validation for estimating classier performance. i. Note the accuracy, TPR, and F-measure for both classes. ii. Compare the results with the ones previously obtained. (b) Is the classier biased tower any of the classes? Which one and why?

4.1

Decision Trees

Use J48 classier, i.e. the Weka version of the decision tree classier C4.5. 1. Estimate the performance of the classier by using 10-fold cross-validation. 2. Visualize the tree and describe the patterns. How do you interpret the numbers associated with the tree-leaves? 3. Is the classier biased tower any of the classes? 4. Investigate the use of dierent J48s parameters such as pruning and minimum number of records in the leaves.

4.2

Rule-based Classiers

Use JRip classier, i.e. the Weka version of the RIPPER algorithm. 1. Estimate the performance of the classier by using 10-fold cross-validation. 2. Is the classier biased tower any of the classes? 3. Describe the patterns. How do you interpret the numbers associated with each rule?

4.3

Association Rule Mining (ARM)

Use of association rule mining (ARM), by using the Apriori algorithm, to build high condence rules predicting the positive class, i.e. recurrence-events. 1. Describe the patterns. How do you interpret the numbers associated with each rule? 2. Which useful hints to characterize the positive class gives this model?

You might also like