0% found this document useful (0 votes)

99 views

Assignment1 COMP723 2019

The document describes tasks to analyze a Parkinson's disease speech therapy dataset using data mining algorithms in R and Weka. The objective is to classify patients based on whether their speech quality improved after therapy. The dataset has class imbalance and many features. The tasks involve feature selection, analyzing algorithm performance with and without feature selection, data balancing, and building a meta-learner from the top algorithms.

Uploaded by

imran5705074

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views

Assignment1 COMP723 2019

Uploaded by

imran5705074

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Part B – Data Mining in R and Weka

The objective is to mine a real world dataset and obtain the best possible classification outcome. The
dataset that will be used is LSVT which contains data on people who have Parkinson’s disease.
Parkinson’s disease causes loss of control over muscles and one of the symptoms is a decrease in
the quality of speech. Speech therapy helps such patients but not all of them react well to such
therapy. Those whose speech quality improves are categorized as class 1 and those who do not are
labelled as class 2.

The overall objective of mining data is to be able to identify both categories with the best possible
accuracy so that the effects of therapy can be maximized. The accuracy measure that you need to
use is the weighted F score taken over both classes of patients.

The dataset is challenging due to two reasons. Firstly, there are 310 features (apart from the class
feature) and only a small subset of them are relevant to the task of classifying these patients. Thus,
the first challenge to be overcome is to identify which subset of features gives the best possible F
score. The second challenge is the imbalanced nature of the dataset – there are 42 patients in class
1 while there are 84 patients in class 2. Hence data balancing methods needs to be applied to
improve performance.

F_weighted = (F_1*nc1+F_2*nc2)/(nc1+nc2) where F_1 and F_2 are the F ratio values across
classes 1 and 2 respectively; nc1 and nc2 are the number of instances of class 1 and class 2
respectively in the test dataset (LSVT_test.arff). Refer to week 3 Lab sheet for the formula to
calculate the F value for any given class.

You are required to experiment with four data mining algorithms namely; OneR, J48, Naïve Bayes
and 1NN (nearest neighbour, called IBk in Weka). You are required to perform the following tasks:

Task 1: Feature Selection

Write code in R to identify the best set of features by using the Gain Ratio feature selection filter in
Weka. Your R code will need to call the Gain Ratio filter with a given number of features (N) to keep.
Your first call will identify the best 305 features to keep, the second call will identify the best 300
features, and so on until the effects of keeping the best 5 features are examined. Essentially this
means that you will experiment with values of N in the range [5,305] in steps of 5.

For each value of N, you will keep the best (top N) features in the train dataset and then use this
subset of features to build a model by applying a mining algorithm on your feature reduced train
dataset. You should make use of the code given in week 3 Lab sheet for this task.

Once the model is built on the training dataset you will need to apply it on the test dataset and
determine the F-weighted score. When you iterate over the entire range of N [5,305] you will be able
to identify the feature set that produced the highest F_weighted score. Note that the value of N that
produces the highest F_weighted score can differ from algorithm to algorithm – do not assume that it
is the same.
Now repeat the entire process for the rest of the algorithms.
(a) Produce the R code to perform Task 1. Note that your entire code snippet MUST be given for
a SINGLE algorithm (say J48). (7 marks)

(b) For the other 3 algorithms, there will be no need to supply entire code snippets – only one
line that calls the classifier algorithm needs to change, so simply supply that single line of
code for the other 3 algorithms. (3 marks)

Task 2: Performance Analysis

In this task, you will analyse the performance of each algorithm.

(a) First, run each of the 4 algorithms with the full set of features (N=310) and note the
F_weighted score for each of them. (4 marks)

(b) Now prepare a 2 by 4 table with algorithms as columns. The first row of the table must
contain the F-weighted score for each algorithm with the full feature set (i.e. all 310 features).
The second row must contain a pair of values for each algorithm. The first value in the pair
should be the highest F_weighted score, while the second value in the pair must be the
value of N that produced that highest F value. (4 marks)

(c) Explain, for EACH classifier algorithm the effect of applying feature selection. Use your
knowledge of how that algorithm works to explain why feature selection had a positive or
negative effect on the F_weighted score. (9 marks)

(d) Using this 2 by 4 table identify the mining algorithm that produces the highest F_weighted
score after feature selection was performed. (2 marks)

Task 3: Data Distribution

In this task, you will use the Resample filter to balance the dataset and attempt to further improve the
F_weighted score by balancing the data.

For each of the four algorithms take the version of the training dataset that produced the best
feature set (the one that produced the highest F-weighted score) from your experimentation in
Task1. Extend the R code developed in Lab 4 to determine the combination of “BiasToUniformClass”
(B) and “sampleSizePercent” (Z) parameters that produce the highest F_weighted score. You need
to experiment with B values in the range [0.3,1.0] in steps of 0.1 and Z values in the range
[100,1000] in steps of 100. In order to find the best combination, you need to keep one parameter
fixed (say B) at a particular value and then step through the entire range of values for Z. In total this
will involve running 80 trials.

(a) Produce the R code for the above data balancing operation. (9 marks)

(b) Run the code for each of the four algorithms and produce an 8 by 10 table for each algorithm
with rows as Z values and columns as B values. Each cell should contain the F_weighted
value for that row and column. There should be 4 such tables, one for each algorithm. From
each of the 4 tables, identify the combination of B and Z that produces the highest
F_weighted score for the given algorithm. (8 marks)
(c) For this part you need to use Weka. From the table produced in part (b) above you should be
able to identify the best performing algorithm (i.e. the one with the highest F_weighted score).

1) Use this algorithm in the Weka GUI and the version of the training dataset that produced
the highest F score. Generate a model using “Use training set” option in Weka. Once the
model is created, deploy the model using the “Supplied test set” option and supply
LSVT_test.arff as your test set. Once the result is generated, produce a Precision Recall
Curve (PRC). This can be done by right clicking in the result pane and selecting the
“Visualize threshold curve” option. Select the “1’ option to plot the curve for class 1.
Choose Precision as the Y axis and Recall as the X axis. Paste this curve into your
report. (3 marks)

2) Produce a PRC for the same algorithm using the original training dataset (i.e. with all 310
features and no data balancing). Paste this curve into your report as well. (3marks)

(d) By comparing the two PRC curves produced in part (c) above, explain the effects of feature
selection and data balancing on improving accuracy for class 1. (7 marks)

Task 4: Building Meta-learner

In this task you need to build a meta-learner using the top 3 algorithms (the algorithms that produced
the 3 highest F_weighted scores) in Task 3 (b) above. Use Weka to build the meta-learner. Take
each of the top 3 algorithms and use the original training dataset (LSTVT_train.arff) to generate
models. For each algorithm, generate a model using the ‘Use training set” option, just as you did in
Task 3.

Now deploy the model using the “Supplied test set” option and supplying LSVT_test.arff as your test
dataset. Before deploying the model, select “More Options” and supply CSV as the ‘Output
Predictions Option”. Once the model is deployed, Weka will output the predicted class value for each
instance, just as shown below:

inst#, actual, predicted, error, prediction

1,1:1,1:1,,1
2,1:1,1:1,,1
3,1:1,1:1,,1
4,1:1,1:1,,1
5,1:1,2:2,+,1

Copy this output into the clipboard and extract the 4th number in each line. The 4th number is the
predicted class value for that instance. For example, for instance 1 the predicted class value is 1 and
for instance 5 it is 2.

Store the predicted class column only in a .CSV file. Now repeat the process for the other two
algorithms. You should now have 3 files, each containing 42 rows and 1 column (predicted class
value for that instance).

Create a merged file containing the predicted class values from each of the 3 files. You should now
have a single file containing 42 rows and 3 columns (predicted class for alg1, predicted class for alg2
and predicted class for alg3). Save this as a .csv file and import into Weka.
Now use the Multilayer Perceptron to build the meta-learner. Use the “Use training set” option to
generate the meta-learner. Repeat this with choosing the Random Forest to build the meta-learner.

(a) Assess the impact of meta-learning by comparing the F-weighted value obtained through meta
learning with the value obtained by running each of the 4 algorithms on the original training
dataset. Has it improved accuracy in terms of the F score? (8 marks)

(b) How important was the choice of meta-learner algorithm in the mining process?
(3 marks)

MATH3161/MATH5165 Optimization: The University of New South Wales School of Mathematics and Statistics
No ratings yet
MATH3161/MATH5165 Optimization: The University of New South Wales School of Mathematics and Statistics
4 pages
cz4041 Project Final Report Nyc Taxi Fare Prediction
0% (1)
cz4041 Project Final Report Nyc Taxi Fare Prediction
18 pages
CS3002 Solution Paper 2015.16 - v2
No ratings yet
CS3002 Solution Paper 2015.16 - v2
6 pages
Management Science Module 7
100% (1)
Management Science Module 7
54 pages
(EIE529) Assignment 2
100% (1)
(EIE529) Assignment 2
3 pages
Dm Lab Record.pdf
No ratings yet
Dm Lab Record.pdf
32 pages
Latest Data Mining Lab Manual
No ratings yet
Latest Data Mining Lab Manual
74 pages
DWM Lab Manual
No ratings yet
DWM Lab Manual
92 pages
DIT865 2018 Mar Solution
No ratings yet
DIT865 2018 Mar Solution
9 pages
Dimensionality Reduction of High Dimensional Data: Summer Internship Project Summary
No ratings yet
Dimensionality Reduction of High Dimensional Data: Summer Internship Project Summary
20 pages
DWDM Lab Manual: Department of Computer Science and Engineering
No ratings yet
DWDM Lab Manual: Department of Computer Science and Engineering
46 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
34 pages
6.034 Design Assignment 2: 1 Data Sets
No ratings yet
6.034 Design Assignment 2: 1 Data Sets
6 pages
Individual Assignment 2
No ratings yet
Individual Assignment 2
4 pages
EC9560 Data Mining: Lab 02: Classification and Prediction Using WEKA
No ratings yet
EC9560 Data Mining: Lab 02: Classification and Prediction Using WEKA
5 pages
Data Mining Assignment No 2
No ratings yet
Data Mining Assignment No 2
4 pages
Data Mining and Warehousing Lab
No ratings yet
Data Mining and Warehousing Lab
4 pages
Databyte ML Task 1
No ratings yet
Databyte ML Task 1
6 pages
Data Science and ML - End Term
No ratings yet
Data Science and ML - End Term
4 pages
Milestone-FMT
No ratings yet
Milestone-FMT
2 pages
DM Lab Manual IV Cse I Sem
No ratings yet
DM Lab Manual IV Cse I Sem
36 pages
Data Mining - UOG (HH) - Final - F23-1
No ratings yet
Data Mining - UOG (HH) - Final - F23-1
10 pages
(Exp 4) Classification Via Decision Trees in WEKA
No ratings yet
(Exp 4) Classification Via Decision Trees in WEKA
10 pages
CA - 605 - MJP Machine Learning Practical Slips
No ratings yet
CA - 605 - MJP Machine Learning Practical Slips
25 pages
HW_02
No ratings yet
HW_02
3 pages
DMlab - FilE prINCE
No ratings yet
DMlab - FilE prINCE
27 pages
Exam Spring 10
No ratings yet
Exam Spring 10
10 pages
DWDM MANUAL-1
No ratings yet
DWDM MANUAL-1
96 pages
dwdm
No ratings yet
dwdm
46 pages
DA_LabFile
No ratings yet
DA_LabFile
63 pages
ML Question Bank
No ratings yet
ML Question Bank
7 pages
2023-24 AIML ML Mid-Semester Make-Up Answer-Keys
No ratings yet
2023-24 AIML ML Mid-Semester Make-Up Answer-Keys
6 pages
Comp 428 Machine Learning Cat One
No ratings yet
Comp 428 Machine Learning Cat One
2 pages
CDT B1 Lab06 MondayWeek2
No ratings yet
CDT B1 Lab06 MondayWeek2
6 pages
Comp 428 Cat Ii
No ratings yet
Comp 428 Cat Ii
2 pages
Final Exam, Data Mining (CEN 871) : Name Surname: Student's ID
No ratings yet
Final Exam, Data Mining (CEN 871) : Name Surname: Student's ID
2 pages
Datawarehouse Lab Manunaul Edited
No ratings yet
Datawarehouse Lab Manunaul Edited
34 pages
Mod 7 Smote ML
No ratings yet
Mod 7 Smote ML
40 pages
Tut2 Weka
No ratings yet
Tut2 Weka
8 pages
Task
No ratings yet
Task
3 pages
Winning The KDD Cup Orange Challenge With Ensemble Selection
No ratings yet
Winning The KDD Cup Orange Challenge With Ensemble Selection
12 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
2023-24 AIML ML Mid-Semester Regular QP Anwer-Keys
No ratings yet
2023-24 AIML ML Mid-Semester Regular QP Anwer-Keys
4 pages
2023 Its665 - Isp565 - Group Project
No ratings yet
2023 Its665 - Isp565 - Group Project
6 pages
Ml Ese 031223 Openbook
No ratings yet
Ml Ese 031223 Openbook
4 pages
ML SP24 Mid Term Exam - Solution
No ratings yet
ML SP24 Mid Term Exam - Solution
8 pages
ECON 460202E006 MLforBI2 S23o
No ratings yet
ECON 460202E006 MLforBI2 S23o
5 pages
Slides on DataI
No ratings yet
Slides on DataI
33 pages
DS for Business Home Assignments
No ratings yet
DS for Business Home Assignments
24 pages
Data Mining - Algorithms: Oner: Chapter 4, Section 4.1
No ratings yet
Data Mining - Algorithms: Oner: Chapter 4, Section 4.1
30 pages
MachineLearning MidTerm UMT Spring 2021
100% (1)
MachineLearning MidTerm UMT Spring 2021
12 pages
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
DM Record-No Roll No
No ratings yet
DM Record-No Roll No
46 pages
Lab 12 Introduction To Rapidminer/Weka.: Objective
No ratings yet
Lab 12 Introduction To Rapidminer/Weka.: Objective
24 pages
DWDM Lab Manual 7th Sem
No ratings yet
DWDM Lab Manual 7th Sem
45 pages
AL&ML Final
No ratings yet
AL&ML Final
59 pages
Se#ng Up ML Problem: Rao Vemuri UC Davis
No ratings yet
Se#ng Up ML Problem: Rao Vemuri UC Davis
19 pages
Discovering_rules_in_the_poker_hand_dataset
No ratings yet
Discovering_rules_in_the_poker_hand_dataset
2 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Final Report (1)
No ratings yet
Final Report (1)
17 pages
Dw & Dm Lab(Exp 5 to 12 ) Kcs 751 A
No ratings yet
Dw & Dm Lab(Exp 5 to 12 ) Kcs 751 A
19 pages
7641 Assignment 1
No ratings yet
7641 Assignment 1
4 pages
Transportation Logistics
No ratings yet
Transportation Logistics
3 pages
Running Head: Enron 1
No ratings yet
Running Head: Enron 1
4 pages
Chapter 3, 4 5 Combined
No ratings yet
Chapter 3, 4 5 Combined
46 pages
JB Project FINAL-1
No ratings yet
JB Project FINAL-1
71 pages
Influence of Grain Size and Grain-Size Distribution On Workability of Granules With 3D Printing
No ratings yet
Influence of Grain Size and Grain-Size Distribution On Workability of Granules With 3D Printing
10 pages
Battling The Probability of Plagiarism
No ratings yet
Battling The Probability of Plagiarism
3 pages
5.1.8 K-Nearest-Neighbor Algorithm
No ratings yet
5.1.8 K-Nearest-Neighbor Algorithm
8 pages
2 Particle Size Distribution (PSD) : 2.3 Selective Laser Sintering
No ratings yet
2 Particle Size Distribution (PSD) : 2.3 Selective Laser Sintering
10 pages
Lab 3 Model Answers
No ratings yet
Lab 3 Model Answers
5 pages
CHAPTER 2: Literature Review
No ratings yet
CHAPTER 2: Literature Review
8 pages
Module 3 Lean Management
No ratings yet
Module 3 Lean Management
4 pages
MGD Permeate Flow Rate (GPM) GFD Concentrate Flow (GPM) Mass Balance Right Mass Balance Left
No ratings yet
MGD Permeate Flow Rate (GPM) GFD Concentrate Flow (GPM) Mass Balance Right Mass Balance Left
5 pages
PDF
No ratings yet
PDF
2 pages
Overview of Architectural Program For Fictitious Privatized Student Housing
No ratings yet
Overview of Architectural Program For Fictitious Privatized Student Housing
3 pages
Shipyards
No ratings yet
Shipyards
27 pages
Ship Building Materials
No ratings yet
Ship Building Materials
21 pages
Figure 1 Isfahan Jami Mosque
No ratings yet
Figure 1 Isfahan Jami Mosque
6 pages
Discrete-Time Fourier Analysis Discrete-Time Fourier Analysis
No ratings yet
Discrete-Time Fourier Analysis Discrete-Time Fourier Analysis
37 pages
DSP_ IPCC
No ratings yet
DSP_ IPCC
32 pages
08 Nom PDF
100% (1)
08 Nom PDF
43 pages
Simulating A CRCW Algorithm With An EREW Algorithm: Efficient Parallel Algorithms COMP308
No ratings yet
Simulating A CRCW Algorithm With An EREW Algorithm: Efficient Parallel Algorithms COMP308
11 pages
ECE 102 Lab 15
No ratings yet
ECE 102 Lab 15
10 pages
9 Run Length Codes
No ratings yet
9 Run Length Codes
9 pages
DSP File Alk-2
No ratings yet
DSP File Alk-2
31 pages
ID-20193007011 Presentation of Secant Method
No ratings yet
ID-20193007011 Presentation of Secant Method
18 pages
Subsets, Graph Coloring, Hamiltonian Cycles, Knapsack Problem. Traveling Salesperson Problem
No ratings yet
Subsets, Graph Coloring, Hamiltonian Cycles, Knapsack Problem. Traveling Salesperson Problem
22 pages
Factoring - Polynomials Work Sheets PDF
No ratings yet
Factoring - Polynomials Work Sheets PDF
5 pages
Source Code For Chatbot
No ratings yet
Source Code For Chatbot
22 pages
Heart Disease Prediction - Jupyter Notebook
100% (1)
Heart Disease Prediction - Jupyter Notebook
9 pages
Chapter 2 Solutions: C Q H T D H D
No ratings yet
Chapter 2 Solutions: C Q H T D H D
22 pages
Image Compression
No ratings yet
Image Compression
9 pages
Cs224n 2025 Lecture06 Fancy Rnn
No ratings yet
Cs224n 2025 Lecture06 Fancy Rnn
57 pages
Lecture Module 8 Presentation
No ratings yet
Lecture Module 8 Presentation
71 pages
6 - Feature Descriptor - HOG
No ratings yet
6 - Feature Descriptor - HOG
81 pages
Transportation Problem
33% (3)
Transportation Problem
35 pages
Digital Image Processing Jan 2024
No ratings yet
Digital Image Processing Jan 2024
8 pages
DLP Remainder and Factors Theorem
100% (1)
DLP Remainder and Factors Theorem
35 pages
Linear Programming Formulate
No ratings yet
Linear Programming Formulate
25 pages
DSP Lession Plan
No ratings yet
DSP Lession Plan
4 pages
Com124 Data Structure Note-1-1
No ratings yet
Com124 Data Structure Note-1-1
30 pages
Project Template AICTE
No ratings yet
Project Template AICTE
14 pages
Channel Coding: TM355: Communication Technologies
No ratings yet
Channel Coding: TM355: Communication Technologies
58 pages
Linear Programming: Chapter 3 Degeneracy
No ratings yet
Linear Programming: Chapter 3 Degeneracy
11 pages
Lecture-4 (Solution of None Linear Equation)
No ratings yet
Lecture-4 (Solution of None Linear Equation)
63 pages

Assignment1 COMP723 2019

Uploaded by

Assignment1 COMP723 2019

Uploaded by

Part B – Data Mining in R and Weka

Task 1: Feature Selection

Task 2: Performance Analysis

In this task, you will analyse the performance of each algorithm.

Task 3: Data Distribution

Task 4: Building Meta-learner

inst#, actual, predicted, error, prediction

You might also like