0% found this document useful (0 votes)
324 views

Assignment#2 RT WQ2021

This document outlines an assignment for a data mining course. It includes 4 problems related to data preprocessing tasks like data visualization, normalization, binning, and correlation analysis. Students are asked to perform these operations on datasets relating to hospital patient information, loan approvals, and Spotify music listening data. The assignment is worth 50 total points and is due by midnight on February 11th, 2021. Late submissions within 3 days are allowed but will incur a penalty.

Uploaded by

Manoj Vemuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
324 views

Assignment#2 RT WQ2021

This document outlines an assignment for a data mining course. It includes 4 problems related to data preprocessing tasks like data visualization, normalization, binning, and correlation analysis. Students are asked to perform these operations on datasets relating to hospital patient information, loan approvals, and Spotify music listening data. The assignment is worth 50 total points and is due by midnight on February 11th, 2021. Late submissions within 3 days are allowed but will incur a penalty.

Uploaded by

Manoj Vemuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

DSC 441: Winter 2020-2021 Assignment #2, Page 1 of 2

Assignment #2

Due Date: Thursday, February 11th, 2021, by midnight

Total number of points: 50 points

Problem 1 (10 points): This problem is an example of data preprocessing needed in a data mining process.
Suppose that a hospital tested the age and body fat data for 18 randomly selected adults with the following
results:

Age 26 26 29 29 40 45 50 55 60

%fat 10.5 30.5 8.8 20.8 32.4 26.9 30.4 30.2 33.2

Age 55 45 60 55 61 62 63 75 66

%fat 36.6 44.5 30.8 35.4 33.2 36.1 37.9 43.2 37.7

a. (2 points) Draw the box-plots for age and %fat. Interpret the distribution of the data.
b. (2 points) Normalize the two attributes based on z-score normalization.
c. (2 points) Regardless of the original ranges of the variables, normalization techniques transform
the data into new ranges that allow to compare and use variables on the same scales. What are the
values ranges of the following normalization methods (for this data set and in general)? Explain
and backup your answer.
i. Min-max normalization
ii. Z-score normalization
iii. Normalization by decimal scaling.
d. (2 points) Draw a scatterplot based on the two variables and interpret the relationship between the
two variables.
e. (2 points) Calculate the correlation matrix. Are these two attributes positively or negatively
correlated? Calculate the covariance matrix. How is the correlation matrix different from the
covariance matrix?

Problem 2 (10 points): There are two parts to this discussion assignment.

Part 1: Given the following set of data, bin them with equal width bins (choose how many) and then smooth the data
by replacing each item with the median value of the bin. Show the new bins and show the new list of the data after
smoothing.

18, 8, 22, 10, 12, 5, 4, 32, 2, 9, 16, 25, 26, 28

Part 2: Normalize the same data above with three techniques: min-max (to range 10 to 20), standardization, and
decimal scaling. What value gets mapped to 0 in each case? What are the min and max values after normalization
with each?

Problem 3 (20 points): For this problem, you will load and perform some cleaning steps on a dataset in the
provided BankData.csv, which is data about loan approvals from a bank in Japan (it has been modified from the
DSC 441: Winter 2020-2021 Assignment #2, Page 2 of 2

original for our purposes in class, so use the provided version). Specifically, you will use visualization to examine
the variables and normalization, binning and smoothing to change them in particular ways.

a. Visualize the distributions of the variables in this data. You can choose bar graphs or histograms. Make
appropriate choices given each type of variables and be careful when selecting parameters like the number of bins
for the histograms. Note there are some numerical variables and some categorical ones. The ones labeled as a ‘bool’
are Boolean variables, meaning they are only true or false and are thus a special type of categorical. Checking all the
distributions with visualization and summary statistics is a typical step when beginning to work with new data.

b. Now apply normalization to some of these numerical distributions. Specifically, choose to apply z-score to one,
min-max to another, and decimal scaling to a third.

c. Visualize the new distributions for the variables that have been normalized. What has changed from the previous
visualization?

d. Choose one of the numerical variables to work with for this problem. Let’s call it v. Create a new variable called
v_bins that is a binned version of that variable. This v_bins will have a new set of values like low, medium, high.
Choose the actual new values (you don’t need to use low, medium, high) and the ranges of v that they represent
based on your understanding of v from your visualizations. You can use equal depth, equal width or custom ranges.
Explain your choices: why did you choose to create that number of values and those particular ranges? (Explore
SPSS visual binning)

Problem 4 (10 points): Download the Spotify Dataset along with the description from D2L.
a) (5 points) Describe the data in terms of number of attributes, number of cases, class distribution. Is there
any correlation between features? Explain your answer.
b) (5points) Report the ranges for each numerical variable. Would you recommend normalizing the data? If
yes, which approach would you apply? Justify your answer.

Submission Instructions

1. Answer the problems and write your answers in a Word document.


2. Submit your file online at the website at http://d2l.depaul.edu and check your submission
3. Keep a copy of all your submissions!
4. If you have questions about the homework, email me BEFORE the deadline.
5. Late submissions are allowed with a 5%, 10%, and 15% penalty for a one day, two days, and three days,
respectively.
6. No late work will be accepted after three days since the assignment was due.

You might also like