Lab (I)

The document provides instructions for using the Weka data mining tool to analyze a breast cancer dataset and build predictive models. It outlines 4 main steps: 1) exploring the data, 2) preprocessing the data through attribute selection, 3) saving the preprocessed data, and 4) mining the data to build models using classifiers like OneR, J48 decision trees, JRip rule-based classifiers, and association rule mining and evaluating their performance. The goal is to predict cancer recurrence and identify patterns in the data.

Uploaded by

anand_sesham

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views

Lab (I)

Uploaded by

anand_sesham

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

TNM033: Data Mining Introducing Weka Part I

The Goals

The aim of this exercise is to give you an overview of how to use Weka, through the Explorer interface. To this end, you are going to work with the data set breast_cancer.arff, downloadable from course web page, that contains 286 cancer patient records. You are requested to experiment building several models that describe when recurrence-events may occur, access the performance of the models, and compare them.

Exploring the Data

This is the rst step in the KDD process and it was discussed during lecture 2. You can nd below a suggestion of some points to look at that may help you to better understand the data. 1. For each attribute nd the following information. (a) The attribute type. (b) Percentage of missing values in the data. (c) Are there any records that have a value for the attribute that no other record has (i.e. unique values)? (d) Study the histogram of the attribute and note how it seems to inuence the risk for recurrence-events. 2. Observe whether the dataset has imbalanced class distribution. 3. Switch to the Visualize tab on the upper part of the screen in Weka to visualize 2D-scatter plots for each pair of attributes. (a) Investigate possible multivariate associations of attributes with the class attribute, i.e. study scatter plots of two attributes X and Y and try to identify possible high(low) recurrence-events areas (if any). For instance, choose X = inv-nodes and Y = breast. 1

Preprocessing the Data

The second step is to preprocess the data such that the transformed data is in a more suitable form for the mining algorithms. This aspect was discussed in lecture 2, as well. We are going to concentrate our attention on feature reduction by selecting promising subsets of attributes for the classication tasks.

3.1

Attribute Selection
(a) To rank the attributes by InfoGainAttributeEval measure. Which attributes seem to have the best classication power? (b) To rank the attributes by GainRatioAttributeEval measure. Which attributes seem to have the best classication power? (c) Compare the results.

1. Use a single attribute evaluator for the following tasks.

2. Use a attribute subset evaluator for the following tasks. (a) To build a subset of attributes with CfsSubsetEval. Experiment with GreedyStepwise and ExaustiveSearch search strategies. What can you conclude? (b) To build a subset of attributes with WrapperSubsetEval. Experiment with J48 classier, with minimum 15 records per leaf node, and BestFirst search. (c) To build a subset of attributes with WrapperSubsetEval. Experiment with JRip classier and GreedyStepwise search. (d) Explain how WrapperSubsetEval works. (e) Compare the results and draw conclusions.

3.2

Saving Data to a File

You may save the data set to a comma separated (text) le. Experiment to save the data set to a le called breast_cancer.csv . This may be useful if you want to apply extra pre-processing techniques not available in Weka or even load the data into Excel.

Mining the Data

We proceed now by building models that will help us to describe the class recurrence-events. Assume that this class is the positive. Use all attributes to build the following models. Build a model using the OneR Classier and interpret the patterns.

1. Use the training set for estimating classier performance. (a) Note the accuracy, TPR, and F-measure for both classes. (b) Interpret the confusion matrix. (a) Use now 10-fold cross-validation for estimating classier performance. i. Note the accuracy, TPR, and F-measure for both classes. ii. Compare the results with the ones previously obtained. (b) Is the classier biased tower any of the classes? Which one and why?

4.1

Decision Trees

Use J48 classier, i.e. the Weka version of the decision tree classier C4.5. 1. Estimate the performance of the classier by using 10-fold cross-validation. 2. Visualize the tree and describe the patterns. How do you interpret the numbers associated with the tree-leaves? 3. Is the classier biased tower any of the classes? 4. Investigate the use of dierent J48s parameters such as pruning and minimum number of records in the leaves.

4.2

Rule-based Classiers

Use JRip classier, i.e. the Weka version of the RIPPER algorithm. 1. Estimate the performance of the classier by using 10-fold cross-validation. 2. Is the classier biased tower any of the classes? 3. Describe the patterns. How do you interpret the numbers associated with each rule?

4.3

Association Rule Mining (ARM)

Use of association rule mining (ARM), by using the Apriori algorithm, to build high condence rules predicting the positive class, i.e. recurrence-events. 1. Describe the patterns. How do you interpret the numbers associated with each rule? 2. Which useful hints to characterize the positive class gives this model?

Project Questions
No ratings yet
Project Questions
4 pages
Project On Data Mining: Prepared by Ashish Pavan Kumar K PGP-DSBA at Great Learning
No ratings yet
Project On Data Mining: Prepared by Ashish Pavan Kumar K PGP-DSBA at Great Learning
50 pages
Data Mining With Weka Heart Disease Dataset: 1 Problem Description
No ratings yet
Data Mining With Weka Heart Disease Dataset: 1 Problem Description
4 pages
Theory (10 Marks)
No ratings yet
Theory (10 Marks)
4 pages
Ass3 v1
No ratings yet
Ass3 v1
4 pages
Machine Learning Team Coursework
No ratings yet
Machine Learning Team Coursework
7 pages
1 (A) Explain Supervised Learning and Unsupervised Learning
No ratings yet
1 (A) Explain Supervised Learning and Unsupervised Learning
52 pages
ML SELF UNIT 2
No ratings yet
ML SELF UNIT 2
20 pages
2021 ITS665 - ISP565 - GROUP PROJECT-revMac21
No ratings yet
2021 ITS665 - ISP565 - GROUP PROJECT-revMac21
6 pages
Assignment_1_Machine Learning
No ratings yet
Assignment_1_Machine Learning
3 pages
Weka Book Questions
0% (1)
Weka Book Questions
2 pages
Assignment-7: Opening Iris - Arff and Removing Class Attribute
No ratings yet
Assignment-7: Opening Iris - Arff and Removing Class Attribute
17 pages
Weka Tutorial: 1. Downloading and Installing Weka (Version 3.6)
No ratings yet
Weka Tutorial: 1. Downloading and Installing Weka (Version 3.6)
4 pages
2023 Its665 - Isp565 - Group Project
No ratings yet
2023 Its665 - Isp565 - Group Project
6 pages
ITS665_ISP565_GROUP_PROJECT_MAC2024
No ratings yet
ITS665_ISP565_GROUP_PROJECT_MAC2024
9 pages
Assign1 s2 2024
No ratings yet
Assign1 s2 2024
5 pages
CAP3770 Lab#4 DecsionTree Sp2017
No ratings yet
CAP3770 Lab#4 DecsionTree Sp2017
4 pages
Data Science Notes
No ratings yet
Data Science Notes
36 pages
Assignment 1-Preprocessing Handon
No ratings yet
Assignment 1-Preprocessing Handon
6 pages
Assignment 1.1: First 10 Rows Looks Like Below in Notepad++
100% (1)
Assignment 1.1: First 10 Rows Looks Like Below in Notepad++
6 pages
Assignment1 COMP723 2019
No ratings yet
Assignment1 COMP723 2019
4 pages
Unit 3
No ratings yet
Unit 3
38 pages
Question Bank Data Science & Its Applications
No ratings yet
Question Bank Data Science & Its Applications
3 pages
Description: Bank - Marketing - Part1 - Data - CSV
No ratings yet
Description: Bank - Marketing - Part1 - Data - CSV
4 pages
Lab-11 Random Forest
No ratings yet
Lab-11 Random Forest
2 pages
Exam Advanced Data Mining Date: 5-11-2009 Time: 14.00-17.00: General Remarks
100% (1)
Exam Advanced Data Mining Date: 5-11-2009 Time: 14.00-17.00: General Remarks
5 pages
Tut2 Weka
No ratings yet
Tut2 Weka
8 pages
Lab 02: Decision Tree With Scikit-Learn: About The Mushroom Data Set
No ratings yet
Lab 02: Decision Tree With Scikit-Learn: About The Mushroom Data Set
3 pages
CS 8031 Data Mining and Data Warehousing Tutorial
No ratings yet
CS 8031 Data Mining and Data Warehousing Tutorial
9 pages
Sample Report
No ratings yet
Sample Report
3 pages
6.034 Design Assignment 2: 1 Data Sets
No ratings yet
6.034 Design Assignment 2: 1 Data Sets
6 pages
Graded_Lab_III
No ratings yet
Graded_Lab_III
3 pages
E4 DS203 2023 Sem2
No ratings yet
E4 DS203 2023 Sem2
2 pages
DWDM Unit 4 PDF
No ratings yet
DWDM Unit 4 PDF
18 pages
Pergunta 1: 1 / 1 Ponto
No ratings yet
Pergunta 1: 1 / 1 Ponto
22 pages
Data Structures Question Bank
No ratings yet
Data Structures Question Bank
2 pages
Experiment 1 Aim:: Introduction To ML Lab With Tools (Hands On WEKA On Data Set (Iris - Arff) ) - (A) Start Weka
No ratings yet
Experiment 1 Aim:: Introduction To ML Lab With Tools (Hands On WEKA On Data Set (Iris - Arff) ) - (A) Start Weka
55 pages
Project_1
No ratings yet
Project_1
4 pages
Important Questions
No ratings yet
Important Questions
4 pages
TD2345
No ratings yet
TD2345
3 pages
Data Collection
No ratings yet
Data Collection
8 pages
DW Ans
No ratings yet
DW Ans
19 pages
Final Project Implementation
No ratings yet
Final Project Implementation
3 pages
AIML LAB WEEK 8 SET-2
No ratings yet
AIML LAB WEEK 8 SET-2
5 pages
DMW_FIle
No ratings yet
DMW_FIle
27 pages
ML Module Iii
No ratings yet
ML Module Iii
12 pages
A3 Classification and Feature Engineering
No ratings yet
A3 Classification and Feature Engineering
2 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
SLIQ
No ratings yet
SLIQ
15 pages
DWDM Unit 3-Part 1
No ratings yet
DWDM Unit 3-Part 1
14 pages
Lab 2
No ratings yet
Lab 2
3 pages
Problem Statement
No ratings yet
Problem Statement
1 page
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
Weka Experiments
No ratings yet
Weka Experiments
4 pages
Assignment_2
No ratings yet
Assignment_2
3 pages
CSCE 120: Learning To Code: Organizing Data I Hacktivity 12.1
No ratings yet
CSCE 120: Learning To Code: Organizing Data I Hacktivity 12.1
3 pages
DM Lab Material
No ratings yet
DM Lab Material
88 pages
AICS Topics
No ratings yet
AICS Topics
250 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Chapter3 - Basic Processing Unit
No ratings yet
Chapter3 - Basic Processing Unit
47 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
30 pages
Uninformed Search: Models To Be Studied in CS 540
No ratings yet
Uninformed Search: Models To Be Studied in CS 540
24 pages
FOL Notes
No ratings yet
FOL Notes
8 pages
Elec 263 Computer Architecture and Organization
No ratings yet
Elec 263 Computer Architecture and Organization
5 pages
Lecture Slides For: Ethem Alpaydin © The MIT Press, 2010
No ratings yet
Lecture Slides For: Ethem Alpaydin © The MIT Press, 2010
30 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
45 pages
BE Seminar Evaluation Format
No ratings yet
BE Seminar Evaluation Format
2 pages
Answer: C Answer: A, B, C, D Answer: D: Week 5 Solutions
No ratings yet
Answer: C Answer: A, B, C, D Answer: D: Week 5 Solutions
2 pages
Differential Equation
No ratings yet
Differential Equation
14 pages
Android Security 2014 Year in Review: Google Report
No ratings yet
Android Security 2014 Year in Review: Google Report
44 pages
DAA Index Merged
No ratings yet
DAA Index Merged
94 pages
DSP Lab 3
0% (1)
DSP Lab 3
11 pages
RSTRENG Introduction
No ratings yet
RSTRENG Introduction
2 pages
Oracle Index Internals
No ratings yet
Oracle Index Internals
83 pages
Send Email - ABAP
No ratings yet
Send Email - ABAP
3 pages
A 5
No ratings yet
A 5
10 pages
An Efficient Text Input Method For Pen-Based Computers
No ratings yet
An Efficient Text Input Method For Pen-Based Computers
8 pages
Mo4, Mo5,&m11
No ratings yet
Mo4, Mo5,&m11
22 pages
D4.2 Software Design
No ratings yet
D4.2 Software Design
39 pages
Assembly Line Balancing: Henry C. Co Technology and Operations Management, California Polytechnic and State University
No ratings yet
Assembly Line Balancing: Henry C. Co Technology and Operations Management, California Polytechnic and State University
11 pages
Rdo Engine
No ratings yet
Rdo Engine
5 pages
Unidirectional TSP: ×N Matrix of Integers, You Are To Write A Program That Computes A Path of
No ratings yet
Unidirectional TSP: ×N Matrix of Integers, You Are To Write A Program That Computes A Path of
2 pages
Ritmurile Lunii
No ratings yet
Ritmurile Lunii
161 pages
Performing SRDF Control Operations: "SRDF Pair State Reference" On Page 485 Table 7 "Dynamic SRDF Operations" On Page 83
No ratings yet
Performing SRDF Control Operations: "SRDF Pair State Reference" On Page 485 Table 7 "Dynamic SRDF Operations" On Page 83
1 page
Panduan User Management Schoolmedia
100% (1)
Panduan User Management Schoolmedia
165 pages
Real Estate's Embrace of Digital Transformation - Centric Digital
No ratings yet
Real Estate's Embrace of Digital Transformation - Centric Digital
14 pages
Bahasa Inggris Bisnis 2 (Kerjain Semua)
No ratings yet
Bahasa Inggris Bisnis 2 (Kerjain Semua)
238 pages
Thompson 1999 FBR PDF
No ratings yet
Thompson 1999 FBR PDF
10 pages
Data Mining Question Bank
No ratings yet
Data Mining Question Bank
4 pages
Ia Agenda
No ratings yet
Ia Agenda
2 pages
Supplier's Accreditation
No ratings yet
Supplier's Accreditation
2 pages
Nfs Lan
100% (2)
Nfs Lan
1 page
JAF Tool Flashing
0% (1)
JAF Tool Flashing
29 pages
NotPaidStudentsList - Placementassuranceprogram 2
No ratings yet
NotPaidStudentsList - Placementassuranceprogram 2
31 pages
Placements 2022-2023
No ratings yet
Placements 2022-2023
1 page
Job Opportunities Sydney 7082017
No ratings yet
Job Opportunities Sydney 7082017
10 pages
Inmarsat Prepaid Service Ver3
No ratings yet
Inmarsat Prepaid Service Ver3
3 pages
XnView User Manual
No ratings yet
XnView User Manual
57 pages

Lab (I)

Uploaded by

Lab (I)

Uploaded by

TNM033: Data Mining Introducing Weka Part I

Exploring the Data

Preprocessing the Data

1. Use a single attribute evaluator for the following tasks.

Saving Data to a File

Mining the Data

Association Rule Mining (ARM)

You might also like