0% found this document useful (0 votes)
99 views

Assignment1 COMP723 2019

The document describes tasks to analyze a Parkinson's disease speech therapy dataset using data mining algorithms in R and Weka. The objective is to classify patients based on whether their speech quality improved after therapy. The dataset has class imbalance and many features. The tasks involve feature selection, analyzing algorithm performance with and without feature selection, data balancing, and building a meta-learner from the top algorithms.

Uploaded by

imran5705074
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views

Assignment1 COMP723 2019

The document describes tasks to analyze a Parkinson's disease speech therapy dataset using data mining algorithms in R and Weka. The objective is to classify patients based on whether their speech quality improved after therapy. The dataset has class imbalance and many features. The tasks involve feature selection, analyzing algorithm performance with and without feature selection, data balancing, and building a meta-learner from the top algorithms.

Uploaded by

imran5705074
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Part B – Data Mining in R and Weka

The objective is to mine a real world dataset and obtain the best possible classification outcome. The
dataset that will be used is LSVT which contains data on people who have Parkinson’s disease.
Parkinson’s disease causes loss of control over muscles and one of the symptoms is a decrease in
the quality of speech. Speech therapy helps such patients but not all of them react well to such
therapy. Those whose speech quality improves are categorized as class 1 and those who do not are
labelled as class 2.

The overall objective of mining data is to be able to identify both categories with the best possible
accuracy so that the effects of therapy can be maximized. The accuracy measure that you need to
use is the weighted F score taken over both classes of patients.

The dataset is challenging due to two reasons. Firstly, there are 310 features (apart from the class
feature) and only a small subset of them are relevant to the task of classifying these patients. Thus,
the first challenge to be overcome is to identify which subset of features gives the best possible F
score. The second challenge is the imbalanced nature of the dataset – there are 42 patients in class
1 while there are 84 patients in class 2. Hence data balancing methods needs to be applied to
improve performance.

F_weighted = (F_1*nc1+F_2*nc2)/(nc1+nc2) where F_1 and F_2 are the F ratio values across
classes 1 and 2 respectively; nc1 and nc2 are the number of instances of class 1 and class 2
respectively in the test dataset (LSVT_test.arff). Refer to week 3 Lab sheet for the formula to
calculate the F value for any given class.

You are required to experiment with four data mining algorithms namely; OneR, J48, Naïve Bayes
and 1NN (nearest neighbour, called IBk in Weka). You are required to perform the following tasks:

Task 1: Feature Selection

Write code in R to identify the best set of features by using the Gain Ratio feature selection filter in
Weka. Your R code will need to call the Gain Ratio filter with a given number of features (N) to keep.
Your first call will identify the best 305 features to keep, the second call will identify the best 300
features, and so on until the effects of keeping the best 5 features are examined. Essentially this
means that you will experiment with values of N in the range [5,305] in steps of 5.

For each value of N, you will keep the best (top N) features in the train dataset and then use this
subset of features to build a model by applying a mining algorithm on your feature reduced train
dataset. You should make use of the code given in week 3 Lab sheet for this task.

Once the model is built on the training dataset you will need to apply it on the test dataset and
determine the F-weighted score. When you iterate over the entire range of N [5,305] you will be able
to identify the feature set that produced the highest F_weighted score. Note that the value of N that
produces the highest F_weighted score can differ from algorithm to algorithm – do not assume that it
is the same.
Now repeat the entire process for the rest of the algorithms.
(a) Produce the R code to perform Task 1. Note that your entire code snippet MUST be given for
a SINGLE algorithm (say J48). (7 marks)

(b) For the other 3 algorithms, there will be no need to supply entire code snippets – only one
line that calls the classifier algorithm needs to change, so simply supply that single line of
code for the other 3 algorithms. (3 marks)

Task 2: Performance Analysis

In this task, you will analyse the performance of each algorithm.

(a) First, run each of the 4 algorithms with the full set of features (N=310) and note the
F_weighted score for each of them. (4 marks)

(b) Now prepare a 2 by 4 table with algorithms as columns. The first row of the table must
contain the F-weighted score for each algorithm with the full feature set (i.e. all 310 features).
The second row must contain a pair of values for each algorithm. The first value in the pair
should be the highest F_weighted score, while the second value in the pair must be the
value of N that produced that highest F value. (4 marks)

(c) Explain, for EACH classifier algorithm the effect of applying feature selection. Use your
knowledge of how that algorithm works to explain why feature selection had a positive or
negative effect on the F_weighted score. (9 marks)

(d) Using this 2 by 4 table identify the mining algorithm that produces the highest F_weighted
score after feature selection was performed. (2 marks)

Task 3: Data Distribution

In this task, you will use the Resample filter to balance the dataset and attempt to further improve the
F_weighted score by balancing the data.

For each of the four algorithms take the version of the training dataset that produced the best
feature set (the one that produced the highest F-weighted score) from your experimentation in
Task1. Extend the R code developed in Lab 4 to determine the combination of “BiasToUniformClass”
(B) and “sampleSizePercent” (Z) parameters that produce the highest F_weighted score. You need
to experiment with B values in the range [0.3,1.0] in steps of 0.1 and Z values in the range
[100,1000] in steps of 100. In order to find the best combination, you need to keep one parameter
fixed (say B) at a particular value and then step through the entire range of values for Z. In total this
will involve running 80 trials.

(a) Produce the R code for the above data balancing operation. (9 marks)

(b) Run the code for each of the four algorithms and produce an 8 by 10 table for each algorithm
with rows as Z values and columns as B values. Each cell should contain the F_weighted
value for that row and column. There should be 4 such tables, one for each algorithm. From
each of the 4 tables, identify the combination of B and Z that produces the highest
F_weighted score for the given algorithm. (8 marks)
(c) For this part you need to use Weka. From the table produced in part (b) above you should be
able to identify the best performing algorithm (i.e. the one with the highest F_weighted score).

1) Use this algorithm in the Weka GUI and the version of the training dataset that produced
the highest F score. Generate a model using “Use training set” option in Weka. Once the
model is created, deploy the model using the “Supplied test set” option and supply
LSVT_test.arff as your test set. Once the result is generated, produce a Precision Recall
Curve (PRC). This can be done by right clicking in the result pane and selecting the
“Visualize threshold curve” option. Select the “1’ option to plot the curve for class 1.
Choose Precision as the Y axis and Recall as the X axis. Paste this curve into your
report. (3 marks)

2) Produce a PRC for the same algorithm using the original training dataset (i.e. with all 310
features and no data balancing). Paste this curve into your report as well. (3marks)

(d) By comparing the two PRC curves produced in part (c) above, explain the effects of feature
selection and data balancing on improving accuracy for class 1. (7 marks)

Task 4: Building Meta-learner

In this task you need to build a meta-learner using the top 3 algorithms (the algorithms that produced
the 3 highest F_weighted scores) in Task 3 (b) above. Use Weka to build the meta-learner. Take
each of the top 3 algorithms and use the original training dataset (LSTVT_train.arff) to generate
models. For each algorithm, generate a model using the ‘Use training set” option, just as you did in
Task 3.

Now deploy the model using the “Supplied test set” option and supplying LSVT_test.arff as your test
dataset. Before deploying the model, select “More Options” and supply CSV as the ‘Output
Predictions Option”. Once the model is deployed, Weka will output the predicted class value for each
instance, just as shown below:

inst#, actual, predicted, error, prediction


1,1:1,1:1,,1
2,1:1,1:1,,1
3,1:1,1:1,,1
4,1:1,1:1,,1
5,1:1,2:2,+,1

Copy this output into the clipboard and extract the 4th number in each line. The 4th number is the
predicted class value for that instance. For example, for instance 1 the predicted class value is 1 and
for instance 5 it is 2.

Store the predicted class column only in a .CSV file. Now repeat the process for the other two
algorithms. You should now have 3 files, each containing 42 rows and 1 column (predicted class
value for that instance).

Create a merged file containing the predicted class values from each of the 3 files. You should now
have a single file containing 42 rows and 3 columns (predicted class for alg1, predicted class for alg2
and predicted class for alg3). Save this as a .csv file and import into Weka.
Now use the Multilayer Perceptron to build the meta-learner. Use the “Use training set” option to
generate the meta-learner. Repeat this with choosing the Random Forest to build the meta-learner.

(a) Assess the impact of meta-learning by comparing the F-weighted value obtained through meta
learning with the value obtained by running each of the 4 algorithms on the original training
dataset. Has it improved accuracy in terms of the F score? (8 marks)

(b) How important was the choice of meta-learner algorithm in the mining process?
(3 marks)

You might also like