0% found this document useful (0 votes)
38 views

Assign 1

This document outlines an assignment on data mining consisting of 5 parts: 1. Analyze and compare student exam results from 2020 and 2021 using statistical analysis and plots. 2. Download a dry bean dataset and report on attribute types, compute summaries for continuous attributes, means, standard deviations, and generate plots. 3. Download and explore the Weka data mining tool using the Iris dataset, reporting basic statistics and scatter plot matrix. 4. Compute dissimilarity matrices using Euclidean and Manhattan distances for 4 points in 3D space and plot the relationship between the measures. 5. Compute a dissimilarity matrix for sample data with different attribute types, and suggest the most similar friend to "Ali"

Uploaded by

Suleman Butt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Assign 1

This document outlines an assignment on data mining consisting of 5 parts: 1. Analyze and compare student exam results from 2020 and 2021 using statistical analysis and plots. 2. Download a dry bean dataset and report on attribute types, compute summaries for continuous attributes, means, standard deviations, and generate plots. 3. Download and explore the Weka data mining tool using the Iris dataset, reporting basic statistics and scatter plot matrix. 4. Compute dissimilarity matrices using Euclidean and Manhattan distances for 4 points in 3D space and plot the relationship between the measures. 5. Compute a dissimilarity matrix for sample data with different attribute types, and suggest the most similar friend to "Ali"

Uploaded by

Suleman Butt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Data Mining Assignment 1: Data Understanding

Submission: Submit the assignment hardcopy in the second Data Mining class of the week (23 or 24 Nov. 2023).

1. (20 points)
Apply your basic data mining knowledge to compare students’ performance in the midterm exam results of a
course for two years, i.e., 2020 and 2021 (result_20_21.xls). You should provide your comments and comparison
by using the statistical description of the data (e.g., mean, median, mode, variance, 5-number summary, etc.)
and plots (boxplot, histogram, etc.). (2 to 3 pages report required)

2. (20 points)
Download the DryBean dataset from UCI Machine Learning Repository. Read the datasets’ descriptions and report
the following (use any language or tool of your choice to solve this problem):

a. The types of the attributes (continuous [interval, ratio], categorical [nominal, ordinal]). Also identify which
attribute(s) are input attribute(s) and which are class attribute(s) (if any).
b. Compute the five-number summary for any two continuous attributes. Compute the mode for categorical
attributes.
c. Compute the mean and standard deviation for the two continuous attributes.
d. Generate the quantile (percentile) plots for two attributes in each dataset.
e. Generate the histogram or distribution plot for each of the two attributes selected in (b).
f. Generate the scatter plots for the two attributes selected in (d).
3. (10 points)

Download and install Weka, a data mining tool, on your systems. Explore the tool and the datasets provided
with the installation. Submit a report containing basic statistics and plots (e.g., scatter plot matrix) for the Iris
dataset using Weka tool. (2 to 3 pages report required)

The following links can be useful.

https://sourceforge.net/projects/weka/

https://www.cs.waikato.ac.nz/ml/weka/

https://waikato.github.io/weka-wiki/downloading_weka/

4. (30 points) Handwritten solution is required.


a. Given these four points in a 3-D space, compute and show the dissimilarity matrix. Use
Euclidian distance as the dissimilarity measure. A(4,5,5), B(5,3,3), C(1,1,0), D(4,4,1)
b. Repeat part (a) using Manhattan distance as dissimilarity measure.
c. Draw a scatter plot for the distances obtained in parts (a) and (b) to identify the relationship
between the two dissimilarity measures.
5. (20 points) Handwritten solution is required.
Name Fever Cough Height Weight Profession City
Ali N Y 65 80 Student Lahore
Bilal Y Y 55 65 Student Karachi
Khan N N 70 75 Teacher Lahore
Ahmed Y N 60 55 Doctor Islamabad
Given the data above, compute the dissimilarity matrix. Fever and Cough are asymmetric binary, Height and
weight are numeric, Profession and City are nominal attributes. Who should be suggested as a friend to Ali
based on your computed dissimilarity matrix?

You might also like