0% found this document useful (0 votes)
128 views2 pages

The Handwritten Solutions To The First Five Questions, and The Report of Last Question

This document provides instructions for Assignment 3 on data pre-processing. It is due on April 24th and is worth 100 marks. The assignment involves answering questions on data smoothing, normalization, binning, and outlier detection. It also involves using the Weka data mining tool to explore pre-processing techniques like discretization, normalization, resampling and attribute selection on sample datasets. A report summarizing the experiments in Weka is required.

Uploaded by

Qä Sïm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views2 pages

The Handwritten Solutions To The First Five Questions, and The Report of Last Question

This document provides instructions for Assignment 3 on data pre-processing. It is due on April 24th and is worth 100 marks. The assignment involves answering questions on data smoothing, normalization, binning, and outlier detection. It also involves using the Weka data mining tool to explore pre-processing techniques like discretization, normalization, resampling and attribute selection on sample datasets. A report summarizing the experiments in Weka is required.

Uploaded by

Qä Sïm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 2

Assignment 3: Data Pre-processing Due Date: 24 April, 2020 Marks: 100

Submission Instructions: Submit a single pdf file (or zipped folder) on LMS containing
the handwritten solutions to the first five questions, and the report of last question.

1. Given the following data (in ascending order) for the attribute age: 12, 15,16, 16,
19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52,
72. [10 Marks]
(a) Use smoothing by bin means to smooth these data, using a bin depth of 3.
Illustrate your steps. Comment on the effect of this technique for the given data.
(b) How might you determine outliers in the data?
(c) What other methods are there for data smoothing?

2. What are the ranges of the following normalization methods?


[10 Marks]
(a) min-max normalization
(b) z-score normalization
(c) normalization by decimal scaling

3. Use these methods to normalize the following group of data: 200, 300, 400,
600,1000
(a) min-max normalization by setting min = 0 and max = 1 [10
Marks]
(b) z-score normalization
(c) normalization by decimal scaling

4. Using the data for age given in question 1, answer the following: [10
Marks]
(a) Use min-max normalization to transform the value 35 for age onto the range [0.0,
1.0].
(b) Use z-score normalization to transform the value 35 for age.
(c) Use normalization by decimal scaling to transform the value 35 for age.
(d) Comment on which method you would prefer to use for the given data, giving
reasons as to why.

5. Suppose a group of 12 sales price records has been sorted as follows: [10
Marks]
5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215.
Partition them into three bins by each of the following methods:
(a) equal-frequency (equal-depth) partitioning
(b) equal-width partitioning
(c) clustering
(d) do numerosity reduction of these data to obtain 50% data reduction.

6. Pre-processing using Weka [50


Marks]
Weka (https://www.cs.waikato.ac.nz/ml/weka/) is a collection of machine
learning algorithms for solving real-world data mining problems. It is written in Java
and runs on almost any platform. The algorithms can either be applied directly to a
dataset or called from your own Java code. You should install Weka (if not already
done) on your machine, experiment the following parts and submit a report.

a) Load Iris dataset (available in the data folder of Weka installation). Explore
different filters in the preprocess tab and demonstrate the use of following:
attribute: Discretize, Normalize, Standardize, MathExpression
instance: Resample, Randomize
Assignment 3: Data Pre-processing Due Date: 24 April, 2020 Marks: 100

b) Apply attribute selection (from select attributes tab) and report the resulting
attributes, using following methods:
CfsSubsetEval
PrincipalComponents
c) Apply dimensionality reduction to the SGPA dataset you created in assignment 1
for prediction of your Spring2020 SGPA. Report the original dimensions and the
reduced dimensions obtained.

You might also like