Assignment I

This document is an assignment for a Data Mining and Warehousing course at the University of Computer Studies, Yangon. It includes various questions related to data structures, SQL queries, classification metrics, data warehouse concepts, statistical analysis, and similarity measures. The assignment aims to evaluate students' understanding of data mining techniques and their application in real-world scenarios.

Uploaded by

myothuyasan80

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Assignment I

Uploaded by

myothuyasan80

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Department of Advanced Science and Technology

University of Computer Studies, Yangon

Semester (VII)/ (XI) Assignment I
B.C.Sc.(SE/BIS/KE)
Data Mining and Warehousing (IS-211)
January, 2025
Answer the Questions.
1. Describe sample data structure and draw a diagram that shows how the
classroom relation of University schema would be stored under a column-
oriented storage structure.

2. Consider the takes relation. Write an SQL query that computes a cross-tab
that has a column for each of the years 2017 and 2018, and a column for all, and
one row for each course, as well as a row for all. Each cell in the table should
contain the number of students who took the corresponding course in the
corresponding year, with column all containing the aggregate across all years,
and row all containing the aggregate across all courses.
Takes (ID, course id, sec id, semester, year, grade)

3. Consider a classification problem where the classifier predicts whether a

person has a particular disease. Suppose that 95% of the people tested do not
suffer from the disease. Let pos denote the fraction of true positives, which is
5% of the test cases, and let neg denote the fraction of true negatives, which is
95% of the test cases. Consider the following classifiers:
 Classifier C1, which always predicts negative (a rather useless
classifier, of course).
 Classifier C2, which predicts positive in 80% of the cases where
the person actually has the disease but also predicts positive in 5%
of the cases where the person does not have the disease.
 Classifier C3, which predicts positive in 95% of the cases where
the person actually has the disease but also predicts positive in 20%
of the cases where the person does not have the disease.
For each classifier, let t_pos denote the true positive fraction, that
is the fraction of cases where the classifier prediction was positive, and
the person actually had the disease. Let f_pos denote the false positive
fraction, that is the fraction of cases where the prediction was positive,
but the person did not have the disease. Let t_neg denote true negative
and f_neg denote false negative fractions, which are defined similarly, but
for the cases where the classifier prediction was negative.

a. Compute the following metrics for each classifier:

i. Accuracy, defined as (t_pos + t_neg)∕(pos+neg), that is, the fraction
of the time when the classifier gives the correct classification.
ii. Recall (also known as sensitivity) defined as t_pos∕pos, that is, how
many of the actual positive cases are classified as positive.
iii. Precision, defined as t_pos/(t_pos+f_pos), that is, how often the
positive prediction is correct.
iv. Specificity, defined as t_neg/neg.

b. If you intend to use the results of classification to perform further

screening for the disease, how would you choose between the classifiers?

c. On the other hand, if you intend to use the result of classification to

start medication, where the medication could have harmful effects if
given to someone who does not have the disease, how would you choose
between the classifiers?

4. How is a data warehouse different from a database?

5. Based on your understanding, describe why concept hierarchies are useful in

data mining.

6. Suppose that the data for analysis includes the attribute age. The age values
for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22,
25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal, trimodal, etc.).
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of
the data?
(e) Give the five-number summary of the data.
(f) Show a boxplot of the data.
(g) How is a quantile-quantile plot different from a quantile plot?
7. Suppose that the values for a given set of data are grouped into intervals. The
intervals and corresponding frequencies are as follows.
Age Frequency
0-10 250
11-25 450
26-30 600
31-40 400
41-62 200
63-70 100
Compute an approximate median value for the data.

8. Suppose that a hospital tested the age and body fat data for 18 randomly
selected adults with the following results:

(a) Calculate the mean, median, and standard deviation of age and %fat.
(b) Draw the boxplots for age and %fat.
9. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.
(b) Compute the Manhattan distance between the two objects.
(c) Compute the Minkowski distance between the two objects, using h = 3.
(d) Compute the supremum distance between the two objects.

10. It is important to define or select similarity measures in data analysis.

However, there is no commonly accepted subjective similarity measure. Results
can vary depending on the similarity measures used.
Nonetheless, seemingly different similarity measures may be equivalent after
some transformation.
Suppose we have the following two-dimensional data set:
(a) Consider the data as two-dimensional data points. Given a new data point, x
= (1.4, 1.6) as a query, rank the database points based on similarity with the
query using Euclidean distance, Manhattan distance, supremum distance, and
cosine similarity.
(b) Normalize the data set to make the norm of each data point equal to 1. Use
Euclidean distance on the transformed data to rank the data points.

End of Assignment_1
***********************************************

Biostatistics - Multiple Choice Questions
94% (16)
Biostatistics - Multiple Choice Questions
4 pages
Data Mining Paer 2 Oct 12, 2024_241012_224522 (1)
No ratings yet
Data Mining Paer 2 Oct 12, 2024_241012_224522 (1)
13 pages
Assignment 2
No ratings yet
Assignment 2
4 pages
21hcs4108 Davpracticals
No ratings yet
21hcs4108 Davpracticals
29 pages
Answers PDF
No ratings yet
Answers PDF
9 pages
FDS - 1 SOLVED
No ratings yet
FDS - 1 SOLVED
17 pages
23HCS4142.pdf
No ratings yet
23HCS4142.pdf
24 pages
DM-I Q Paper 2024
No ratings yet
DM-I Q Paper 2024
12 pages
IP Question Paper 2020-2021
No ratings yet
IP Question Paper 2020-2021
9 pages
Assignment-2 3
No ratings yet
Assignment-2 3
4 pages
2021_Data Mining DU CBCS
No ratings yet
2021_Data Mining DU CBCS
4 pages
Assignment 2 Slot8 TTS3208 Summer
No ratings yet
Assignment 2 Slot8 TTS3208 Summer
11 pages
Final Exam, Data Mining (CEN 871) : Name Surname: Student's ID
No ratings yet
Final Exam, Data Mining (CEN 871) : Name Surname: Student's ID
2 pages
Data Mining Assignment 2
No ratings yet
Data Mining Assignment 2
2 pages
21CS63 - Unit1 Practice Questions
No ratings yet
21CS63 - Unit1 Practice Questions
3 pages
DM_Practice_Problem_Set-2
No ratings yet
DM_Practice_Problem_Set-2
7 pages
Data Mining - Sem 3 - Assignment - 2
No ratings yet
Data Mining - Sem 3 - Assignment - 2
5 pages
FDS - 2 SOLVED
No ratings yet
FDS - 2 SOLVED
14 pages
Unit 1 Assignment
0% (1)
Unit 1 Assignment
6 pages
Dcs 7302
No ratings yet
Dcs 7302
17 pages
GE Practical Sem 2 (2)
No ratings yet
GE Practical Sem 2 (2)
28 pages
Answer Midterm Exam Data Mining1 2021 - 2022
100% (1)
Answer Midterm Exam Data Mining1 2021 - 2022
4 pages
Levels of Measurement Q A
No ratings yet
Levels of Measurement Q A
16 pages
DAV_practicle_File
No ratings yet
DAV_practicle_File
28 pages
Data Analytics: Practice Exercises
50% (2)
Data Analytics: Practice Exercises
2 pages
PRACTICAL LIST CLASS-XII (INFO. PRACTICALS - fINAL PDF
100% (1)
PRACTICAL LIST CLASS-XII (INFO. PRACTICALS - fINAL PDF
8 pages
Set-B_CT2_ AnswerKey
No ratings yet
Set-B_CT2_ AnswerKey
10 pages
Soal CISDM
No ratings yet
Soal CISDM
3 pages
Data Warehousing&Data Mining AMTCSE0114
No ratings yet
Data Warehousing&Data Mining AMTCSE0114
3 pages
UNIT 5 Data Literacy Levels of Measurement QuesAnsExtra
No ratings yet
UNIT 5 Data Literacy Levels of Measurement QuesAnsExtra
14 pages
DAV Practical File 234003
No ratings yet
DAV Practical File 234003
14 pages
DATA SCIENCE SAMPLE
No ratings yet
DATA SCIENCE SAMPLE
5 pages
Foundation of Data Science previous year question paper
No ratings yet
Foundation of Data Science previous year question paper
40 pages
Exam-dm1-121017-ans
No ratings yet
Exam-dm1-121017-ans
8 pages
Pandas_Worksheet
No ratings yet
Pandas_Worksheet
19 pages
Assign 1
No ratings yet
Assign 1
1 page
HW1
0% (1)
HW1
2 pages
DS JRE Paper June 2023
No ratings yet
DS JRE Paper June 2023
9 pages
Comp 1942 finalExamQuestion-2016
No ratings yet
Comp 1942 finalExamQuestion-2016
11 pages
Data Mining Unit-1 Notes
No ratings yet
Data Mining Unit-1 Notes
18 pages
self-practical-file-Tina-Gupta
No ratings yet
self-practical-file-Tina-Gupta
45 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
III Yr B.Tech. - Computer Science & Engineering/Information Technology Data Mining
No ratings yet
III Yr B.Tech. - Computer Science & Engineering/Information Technology Data Mining
2 pages
Assignment
No ratings yet
Assignment
2 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
manishadav
No ratings yet
manishadav
27 pages
Q1S-1(2)
No ratings yet
Q1S-1(2)
2 pages
Comp 1942 finalExamQuestion-2019
No ratings yet
Comp 1942 finalExamQuestion-2019
14 pages
Internals1 FDS Scheme
No ratings yet
Internals1 FDS Scheme
7 pages
Script of E__Previous Question Papers_URR18 03.08.2023_VI Semester_U18CS605.pdf
No ratings yet
Script of E__Previous Question Papers_URR18 03.08.2023_VI Semester_U18CS605.pdf
10 pages
Q1R_ext(2)
No ratings yet
Q1R_ext(2)
4 pages
Set-D_CT2_answerKey
No ratings yet
Set-D_CT2_answerKey
11 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
FDS Sem5
No ratings yet
FDS Sem5
20 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Lec2 Activities
No ratings yet
Lec2 Activities
2 pages
Lec 5
No ratings yet
Lec 5
24 pages
DBDM, FDS, Ds Model QP
No ratings yet
DBDM, FDS, Ds Model QP
5 pages
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
From Everand
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
Peter Bradley
No ratings yet
Chi Squared for Beginners
From Everand
Chi Squared for Beginners
Stephanie Glen
No ratings yet
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Sensitivity Specificity PPV NPV DLR Diagnostic Test 2x2 Table v5
No ratings yet
Sensitivity Specificity PPV NPV DLR Diagnostic Test 2x2 Table v5
6 pages
Chapter I Jerlyn
No ratings yet
Chapter I Jerlyn
35 pages
Intro Stats- Week 2 Problems
No ratings yet
Intro Stats- Week 2 Problems
2 pages
Hasil SPSS PDF
No ratings yet
Hasil SPSS PDF
61 pages
Basic Business Statistics 13th Edition Berenson Solutions Manual pdf download
100% (3)
Basic Business Statistics 13th Edition Berenson Solutions Manual pdf download
44 pages
Formulas and Tables For Gerstman
No ratings yet
Formulas and Tables For Gerstman
10 pages
Test Bank for Elementary Statistics, 7th Edition, Ron Larson, Betsy Farber - Read Now Or Download For A Complete Experience
100% (9)
Test Bank for Elementary Statistics, 7th Edition, Ron Larson, Betsy Farber - Read Now Or Download For A Complete Experience
66 pages
HWK2_324_SS
No ratings yet
HWK2_324_SS
7 pages
CANOOG - Statistics - 82819 - Quiz 1
100% (1)
CANOOG - Statistics - 82819 - Quiz 1
18 pages
Chapter 4 - Linear Regression
100% (2)
Chapter 4 - Linear Regression
25 pages
Mini Tab 16 Help Data Sets
50% (2)
Mini Tab 16 Help Data Sets
21 pages
Stat & Prob Formula Sheet
100% (1)
Stat & Prob Formula Sheet
2 pages
Subject: STATS Test Marks: 50
No ratings yet
Subject: STATS Test Marks: 50
4 pages
Average Relative Error in Geochemical Determinations: Clarification, Calculation, and A Plea For Consistency
No ratings yet
Average Relative Error in Geochemical Determinations: Clarification, Calculation, and A Plea For Consistency
11 pages
Visualizing and Presenting Data
No ratings yet
Visualizing and Presenting Data
28 pages
Boston House Prediction - Colab1
No ratings yet
Boston House Prediction - Colab1
10 pages
Statistics 1st PUC Formula Book
100% (4)
Statistics 1st PUC Formula Book
21 pages
Pengaruh Lingkungan Kerja Dan Karakteristik Individu Terhadap Stres Kerja Wartawan PT Serambi Media Press Di Kota Padang
No ratings yet
Pengaruh Lingkungan Kerja Dan Karakteristik Individu Terhadap Stres Kerja Wartawan PT Serambi Media Press Di Kota Padang
13 pages
Percentiles 1
No ratings yet
Percentiles 1
14 pages
M6 Check in Activity 2 Math
100% (2)
M6 Check in Activity 2 Math
2 pages
SSR 2 PDF
No ratings yet
SSR 2 PDF
9 pages
Regresie Simpla Excel
No ratings yet
Regresie Simpla Excel
5 pages
PBS Narcissism
No ratings yet
PBS Narcissism
6 pages
F Test 1
No ratings yet
F Test 1
3 pages
Random Effects Models: Yanez, Spring 2004 1 Lecture Notes XI
No ratings yet
Random Effects Models: Yanez, Spring 2004 1 Lecture Notes XI
14 pages
Tutorial Questions: Quantitative Methods I
No ratings yet
Tutorial Questions: Quantitative Methods I
5 pages
All chapter download Using and Understanding Mathematics 6th Edition Bennett Solutions Manual
100% (5)
All chapter download Using and Understanding Mathematics 6th Edition Bennett Solutions Manual
42 pages
Uses of Variance, Standard Deviation and Coefficient
No ratings yet
Uses of Variance, Standard Deviation and Coefficient
21 pages
Rr311801 Probability and Statistics
No ratings yet
Rr311801 Probability and Statistics
8 pages