0% found this document useful (0 votes)
4 views

Assignment I

This document is an assignment for a Data Mining and Warehousing course at the University of Computer Studies, Yangon. It includes various questions related to data structures, SQL queries, classification metrics, data warehouse concepts, statistical analysis, and similarity measures. The assignment aims to evaluate students' understanding of data mining techniques and their application in real-world scenarios.

Uploaded by

myothuyasan80
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Assignment I

This document is an assignment for a Data Mining and Warehousing course at the University of Computer Studies, Yangon. It includes various questions related to data structures, SQL queries, classification metrics, data warehouse concepts, statistical analysis, and similarity measures. The assignment aims to evaluate students' understanding of data mining techniques and their application in real-world scenarios.

Uploaded by

myothuyasan80
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Department of Advanced Science and Technology

University of Computer Studies, Yangon


Semester (VII)/ (XI) Assignment I
B.C.Sc.(SE/BIS/KE)
Data Mining and Warehousing (IS-211)
January, 2025
Answer the Questions.
1. Describe sample data structure and draw a diagram that shows how the
classroom relation of University schema would be stored under a column-
oriented storage structure.

2. Consider the takes relation. Write an SQL query that computes a cross-tab
that has a column for each of the years 2017 and 2018, and a column for all, and
one row for each course, as well as a row for all. Each cell in the table should
contain the number of students who took the corresponding course in the
corresponding year, with column all containing the aggregate across all years,
and row all containing the aggregate across all courses.
Takes (ID, course id, sec id, semester, year, grade)

3. Consider a classification problem where the classifier predicts whether a


person has a particular disease. Suppose that 95% of the people tested do not
suffer from the disease. Let pos denote the fraction of true positives, which is
5% of the test cases, and let neg denote the fraction of true negatives, which is
95% of the test cases. Consider the following classifiers:
 Classifier C1, which always predicts negative (a rather useless
classifier, of course).
 Classifier C2, which predicts positive in 80% of the cases where
the person actually has the disease but also predicts positive in 5%
of the cases where the person does not have the disease.
 Classifier C3, which predicts positive in 95% of the cases where
the person actually has the disease but also predicts positive in 20%
of the cases where the person does not have the disease.
For each classifier, let t_pos denote the true positive fraction, that
is the fraction of cases where the classifier prediction was positive, and
the person actually had the disease. Let f_pos denote the false positive
fraction, that is the fraction of cases where the prediction was positive,
but the person did not have the disease. Let t_neg denote true negative
and f_neg denote false negative fractions, which are defined similarly, but
for the cases where the classifier prediction was negative.

a. Compute the following metrics for each classifier:


i. Accuracy, defined as (t_pos + t_neg)∕(pos+neg), that is, the fraction
of the time when the classifier gives the correct classification.
ii. Recall (also known as sensitivity) defined as t_pos∕pos, that is, how
many of the actual positive cases are classified as positive.
iii. Precision, defined as t_pos/(t_pos+f_pos), that is, how often the
positive prediction is correct.
iv. Specificity, defined as t_neg/neg.

b. If you intend to use the results of classification to perform further


screening for the disease, how would you choose between the classifiers?

c. On the other hand, if you intend to use the result of classification to


start medication, where the medication could have harmful effects if
given to someone who does not have the disease, how would you choose
between the classifiers?

4. How is a data warehouse different from a database?

5. Based on your understanding, describe why concept hierarchies are useful in


data mining.

6. Suppose that the data for analysis includes the attribute age. The age values
for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22,
25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal, trimodal, etc.).
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of
the data?
(e) Give the five-number summary of the data.
(f) Show a boxplot of the data.
(g) How is a quantile-quantile plot different from a quantile plot?
7. Suppose that the values for a given set of data are grouped into intervals. The
intervals and corresponding frequencies are as follows.
Age Frequency
0-10 250
11-25 450
26-30 600
31-40 400
41-62 200
63-70 100
Compute an approximate median value for the data.

8. Suppose that a hospital tested the age and body fat data for 18 randomly
selected adults with the following results:

(a) Calculate the mean, median, and standard deviation of age and %fat.
(b) Draw the boxplots for age and %fat.
9. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.
(b) Compute the Manhattan distance between the two objects.
(c) Compute the Minkowski distance between the two objects, using h = 3.
(d) Compute the supremum distance between the two objects.

10. It is important to define or select similarity measures in data analysis.


However, there is no commonly accepted subjective similarity measure. Results
can vary depending on the similarity measures used.
Nonetheless, seemingly different similarity measures may be equivalent after
some transformation.
Suppose we have the following two-dimensional data set:
(a) Consider the data as two-dimensional data points. Given a new data point, x
= (1.4, 1.6) as a query, rank the database points based on similarity with the
query using Euclidean distance, Manhattan distance, supremum distance, and
cosine similarity.
(b) Normalize the data set to make the norm of each data point equal to 1. Use
Euclidean distance on the transformed data to rank the data points.

End of Assignment_1
***********************************************

You might also like